All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: cem@kernel.org, djwong@kernel.org
Cc: Christoph Hellwig <hch@lst.de>,
	Bill O'Donnell <bodonnel@redhat.com>,
	linux-xfs@vger.kernel.org
Subject: [PATCH 29/67] xfs: force all buffers to be written during btree bulk load
Date: Wed, 17 Apr 2024 14:29:20 -0700	[thread overview]
Message-ID: <171338842775.1853449.4418216174854196307.stgit@frogsfrogsfrogs> (raw)
In-Reply-To: <171338842269.1853449.4066376212453408283.stgit@frogsfrogsfrogs>

From: Darrick J. Wong <djwong@kernel.org>

Source kernel commit: 13ae04d8d45227c2ba51e188daf9fc13d08a1b12

While stress-testing online repair of btrees, I noticed periodic
assertion failures from the buffer cache about buffers with incorrect
DELWRI_Q state.  Looking further, I observed this race between the AIL
trying to write out a btree block and repair zapping a btree block after
the fact:

AIL:    Repair0:

pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list

stale buf X:
clear DELWRI_Q
does not clear b_list
free space X

delwri_submit   # oops

Worse yet, I discovered that running the same repair over and over in a
tight loop can result in a second race that cause data integrity
problems with the repair:

AIL:    Repair0:        Repair1:

pin buffer X
delwri_queue:
set DELWRI_Q
add to delwri list

stale buf X:
clear DELWRI_Q
does not clear b_list
free space X

find free space X
get buffer
rewrite buffer
delwri_queue:
set DELWRI_Q
already on a list, do not add

BAD: committed tree root before all blocks written

delwri_submit   # too late now

I traced this to my own misunderstanding of how the delwri lists work,
particularly with regards to the AIL's buffer list.  If a buffer is
logged and committed, the buffer can end up on that AIL buffer list.  If
btree repairs are run twice in rapid succession, it's possible that the
first repair will invalidate the buffer and free it before the next time
the AIL wakes up.  Marking the buffer stale clears DELWRI_Q from the
buffer state without removing the buffer from its delwri list.  The
buffer doesn't know which list it's on, so it cannot know which lock to
take to protect the list for a removal.

If the second repair allocates the same block, it will then recycle the
buffer to start writing the new btree block.  Meanwhile, if the AIL
wakes up and walks the buffer list, it will ignore the buffer because it
can't lock it, and go back to sleep.

When the second repair calls delwri_queue to put the buffer on the
list of buffers to write before committing the new btree, it will set
DELWRI_Q again, but since the buffer hasn't been removed from the AIL's
buffer list, it won't add it to the bulkload buffer's list.

This is incorrect, because the bulkload caller relies on delwri_submit
to ensure that all the buffers have been sent to disk /before/
required for data consistency.

Worse, the AIL won't clear DELWRI_Q from the buffer when it does finally
drop it, so the next thread to walk through the btree will trip over a
debug assertion on that flag.

To fix this, create a new function that waits for the buffer to be
removed from any other delwri lists before adding the buffer to the
caller's delwri list.  By waiting for the buffer to clear both the
delwri list and any potential delwri wait list, we can be sure that
repair will initiate writes of all buffers and report all write errors
back to userspace instead of committing the new structure.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
---
 libxfs/libxfs_io.h         |   11 +++++++++++
 libxfs/xfs_btree_staging.c |    4 +---
 2 files changed, 12 insertions(+), 3 deletions(-)


diff --git a/libxfs/libxfs_io.h b/libxfs/libxfs_io.h
index 267ea9796..259c6a7cf 100644
--- a/libxfs/libxfs_io.h
+++ b/libxfs/libxfs_io.h
@@ -244,6 +244,17 @@ xfs_buf_delwri_queue(struct xfs_buf *bp, struct list_head *buffer_list)
 	return true;
 }
 
+static inline void
+xfs_buf_delwri_queue_here(struct xfs_buf *bp, struct list_head *buffer_list)
+{
+	ASSERT(list_empty(&bp->b_list));
+
+	/* This buffer is uptodate; don't let it get reread. */
+	libxfs_buf_mark_dirty(bp);
+
+	xfs_buf_delwri_queue(bp, buffer_list);
+}
+
 int xfs_buf_delwri_submit(struct list_head *buffer_list);
 void xfs_buf_delwri_cancel(struct list_head *list);
 
diff --git a/libxfs/xfs_btree_staging.c b/libxfs/xfs_btree_staging.c
index a6a907916..baf7f4226 100644
--- a/libxfs/xfs_btree_staging.c
+++ b/libxfs/xfs_btree_staging.c
@@ -342,9 +342,7 @@ xfs_btree_bload_drop_buf(
 	if (*bpp == NULL)
 		return;
 
-	if (!xfs_buf_delwri_queue(*bpp, buffers_list))
-		ASSERT(0);
-
+	xfs_buf_delwri_queue_here(*bpp, buffers_list);
 	xfs_buf_relse(*bpp);
 	*bpp = NULL;
 }


  parent reply	other threads:[~2024-04-17 21:29 UTC|newest]

Thread overview: 126+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-17 21:11 [PATCHBOMB] xfsprogs: catch us up to 6.8, at least Darrick J. Wong
2024-04-17 21:15 ` [PATCHSET 01/11] xfsprogs: packaging fixes for 6.7 Darrick J. Wong
2024-04-17 21:18   ` [PATCH 1/2] debian: fix package configuration after removing platform_defs.h.in Darrick J. Wong
2024-05-01 18:00     ` Bill O'Donnell
2024-05-01 18:06       ` Bill O'Donnell
2024-04-17 21:18   ` [PATCH 2/2] libxfs: fix incorrect porting to 6.7 Darrick J. Wong
2024-04-17 21:15 ` [PATCHSET 02/11] xfsprogs: minor fixes for 6.7 Darrick J. Wong
2024-04-17 21:18   ` [PATCH 1/1] mkfs: fix log sunit rounding when external logs are in use Darrick J. Wong
2024-04-17 21:16 ` [PATCHSET 03/11] xfsprogs: convert utilities to use new rt helpers Darrick J. Wong
2024-04-17 21:19   ` [PATCH 01/11] xfs_repair: fix confusing rt space units in the duplicate detection code Darrick J. Wong
2024-04-17 21:19   ` [PATCH 02/11] libxfs: create a helper to compute leftovers of realtime extents Darrick J. Wong
2024-04-17 21:19   ` [PATCH 03/11] libxfs: use helpers to convert rt block numbers to rt extent numbers Darrick J. Wong
2024-04-17 21:19   ` [PATCH 04/11] xfs_repair: convert utility to use new rt extent helpers and types Darrick J. Wong
2024-04-17 21:20   ` [PATCH 05/11] mkfs: " Darrick J. Wong
2024-04-17 21:20   ` [PATCH 06/11] xfs_{db,repair}: convert open-coded xfs_rtword_t pointer accesses to helper Darrick J. Wong
2024-04-17 21:20   ` [PATCH 07/11] xfs_repair: convert helpers for rtbitmap block/wordcount computations Darrick J. Wong
2024-04-17 21:20   ` [PATCH 08/11] xfs_{db,repair}: use accessor functions for bitmap words Darrick J. Wong
2024-04-17 21:21   ` [PATCH 09/11] xfs_{db,repair}: use helpers for rtsummary block/wordcount computations Darrick J. Wong
2024-04-17 21:21   ` [PATCH 10/11] xfs_{db,repair}: use accessor functions for summary info words Darrick J. Wong
2024-04-17 21:21   ` [PATCH 11/11] xfs_{db,repair}: use m_blockwsize instead of sb_blocksize for rt blocks Darrick J. Wong
2024-04-17 21:16 ` [PATCHSET 04/11] libxfs: sync with 6.8 Darrick J. Wong
2024-04-17 21:22   ` [PATCH 01/67] xfs: use xfs_defer_pending objects to recover intent items Darrick J. Wong
2024-04-17 21:22   ` [PATCH 02/67] xfs: recreate work items when recovering " Darrick J. Wong
2024-04-17 21:22   ` [PATCH 03/67] xfs: use xfs_defer_finish_one to finish recovered work items Darrick J. Wong
2024-04-17 21:22   ` [PATCH 04/67] xfs: move ->iop_recover to xfs_defer_op_type Darrick J. Wong
2024-04-17 21:23   ` [PATCH 05/67] xfs: hoist intent done flag setting to ->finish_item callsite Darrick J. Wong
2024-04-17 21:23   ` [PATCH 06/67] xfs: hoist ->create_intent boilerplate to its callsite Darrick J. Wong
2024-04-17 21:23   ` [PATCH 07/67] xfs: use xfs_defer_create_done for the relogging operation Darrick J. Wong
2024-04-17 21:23   ` [PATCH 08/67] xfs: clean out XFS_LI_DIRTY setting boilerplate from ->iop_relog Darrick J. Wong
2024-04-17 21:24   ` [PATCH 09/67] xfs: hoist xfs_trans_add_item calls to defer ops functions Darrick J. Wong
2024-04-17 21:24   ` [PATCH 10/67] xfs: move ->iop_relog to struct xfs_defer_op_type Darrick J. Wong
2024-04-17 21:24   ` [PATCH 11/67] xfs: make rextslog computation consistent with mkfs Darrick J. Wong
2024-04-17 21:24   ` [PATCH 12/67] xfs: fix 32-bit truncation in xfs_compute_rextslog Darrick J. Wong
2024-04-17 21:25   ` [PATCH 13/67] xfs: don't allow overly small or large realtime volumes Darrick J. Wong
2024-04-17 21:25   ` [PATCH 14/67] xfs: elide ->create_done calls for unlogged deferred work Darrick J. Wong
2024-04-17 21:25   ` [PATCH 15/67] xfs: don't append work items to logged xfs_defer_pending objects Darrick J. Wong
2024-04-17 21:25   ` [PATCH 16/67] xfs: allow pausing of pending deferred work items Darrick J. Wong
2024-04-17 21:26   ` [PATCH 17/67] xfs: remove __xfs_free_extent_later Darrick J. Wong
2024-04-17 21:26   ` [PATCH 18/67] xfs: automatic freeing of freshly allocated unwritten space Darrick J. Wong
2024-04-17 21:26   ` [PATCH 19/67] xfs: remove unused fields from struct xbtree_ifakeroot Darrick J. Wong
2024-04-17 21:26   ` [PATCH 20/67] xfs: force small EFIs for reaping btree extents Darrick J. Wong
2024-04-17 21:27   ` [PATCH 21/67] xfs: ensure logflagsp is initialized in xfs_bmap_del_extent_real Darrick J. Wong
2024-04-17 21:27   ` [PATCH 22/67] xfs: update dir3 leaf block metadata after swap Darrick J. Wong
2024-04-17 21:27   ` [PATCH 23/67] xfs: extract xfs_da_buf_copy() helper function Darrick J. Wong
2024-04-17 21:28   ` [PATCH 24/67] xfs: move xfs_ondisk.h to libxfs/ Darrick J. Wong
2024-04-17 21:28   ` [PATCH 25/67] xfs: consolidate the xfs_attr_defer_* helpers Darrick J. Wong
2024-04-17 21:28   ` [PATCH 26/67] xfs: store an ops pointer in struct xfs_defer_pending Darrick J. Wong
2024-04-17 21:28   ` [PATCH 27/67] xfs: pass the defer ops instead of type to xfs_defer_start_recovery Darrick J. Wong
2024-04-17 21:29   ` [PATCH 28/67] xfs: pass the defer ops directly to xfs_defer_add Darrick J. Wong
2024-04-17 21:29   ` Darrick J. Wong [this message]
2024-04-17 21:29   ` [PATCH 30/67] xfs: set XBF_DONE on newly formatted btree block that are ready for writing Darrick J. Wong
2024-04-17 21:29   ` [PATCH 31/67] xfs: read leaf blocks when computing keys for bulkloading into node blocks Darrick J. Wong
2024-04-17 21:30   ` [PATCH 32/67] xfs: move btree bulkload record initialization to ->get_record implementations Darrick J. Wong
2024-04-17 21:30   ` [PATCH 33/67] xfs: constrain dirty buffers while formatting a staged btree Darrick J. Wong
2024-04-17 21:30   ` [PATCH 34/67] xfs: repair free space btrees Darrick J. Wong
2024-04-17 21:30   ` [PATCH 35/67] xfs: repair inode btrees Darrick J. Wong
2024-04-17 21:31   ` [PATCH 36/67] xfs: repair refcount btrees Darrick J. Wong
2024-04-17 21:31   ` [PATCH 37/67] xfs: dont cast to char * for XFS_DFORK_*PTR macros Darrick J. Wong
2024-04-17 21:31   ` [PATCH 38/67] xfs: set inode sick state flags when we zap either ondisk fork Darrick J. Wong
2024-04-17 21:31   ` [PATCH 39/67] xfs: zap broken inode forks Darrick J. Wong
2024-04-17 21:32   ` [PATCH 40/67] xfs: repair inode fork block mapping data structures Darrick J. Wong
2024-04-17 21:32   ` [PATCH 41/67] xfs: create a ranged query function for refcount btrees Darrick J. Wong
2024-04-17 21:32   ` [PATCH 42/67] xfs: create a new inode fork block unmap helper Darrick J. Wong
2024-04-17 21:32   ` [PATCH 43/67] xfs: improve dquot iteration for scrub Darrick J. Wong
2024-04-17 21:33   ` [PATCH 44/67] xfs: add lock protection when remove perag from radix tree Darrick J. Wong
2024-04-17 21:33   ` [PATCH 45/67] xfs: fix perag leak when growfs fails Darrick J. Wong
2024-04-17 21:33   ` [PATCH 46/67] xfs: remove the xfs_alloc_arg argument to xfs_bmap_btalloc_accounting Darrick J. Wong
2024-04-17 21:34   ` [PATCH 47/67] xfs: also use xfs_bmap_btalloc_accounting for RT allocations Darrick J. Wong
2024-04-17 21:34   ` [PATCH 48/67] xfs: return -ENOSPC from xfs_rtallocate_* Darrick J. Wong
2024-04-17 21:34   ` [PATCH 49/67] xfs: indicate if xfs_bmap_adjacent changed ap->blkno Darrick J. Wong
2024-04-17 21:34   ` [PATCH 50/67] xfs: move xfs_rtget_summary to xfs_rtbitmap.c Darrick J. Wong
2024-04-17 21:35   ` [PATCH 51/67] xfs: split xfs_rtmodify_summary_int Darrick J. Wong
2024-04-17 21:35   ` [PATCH 52/67] xfs: remove rt-wrappers from xfs_format.h Darrick J. Wong
2024-04-17 21:35   ` [PATCH 53/67] xfs: remove XFS_RTMIN/XFS_RTMAX Darrick J. Wong
2024-04-17 21:35   ` [PATCH 54/67] xfs: make if_data a void pointer Darrick J. Wong
2024-04-17 21:36   ` [PATCH 55/67] xfs: return if_data from xfs_idata_realloc Darrick J. Wong
2024-04-17 21:36   ` [PATCH 56/67] xfs: move the xfs_attr_sf_lookup tracepoint Darrick J. Wong
2024-04-17 21:36   ` [PATCH 57/67] xfs: simplify xfs_attr_sf_findname Darrick J. Wong
2024-04-17 21:36   ` [PATCH 58/67] xfs: remove xfs_attr_shortform_lookup Darrick J. Wong
2024-04-17 21:37   ` [PATCH 59/67] xfs: use xfs_attr_sf_findname in xfs_attr_shortform_getvalue Darrick J. Wong
2024-04-17 21:37   ` [PATCH 60/67] xfs: remove struct xfs_attr_shortform Darrick J. Wong
2024-04-17 21:37   ` [PATCH 61/67] xfs: remove xfs_attr_sf_hdr_t Darrick J. Wong
2024-04-17 21:37   ` [PATCH 62/67] xfs: turn the XFS_DA_OP_REPLACE checks in xfs_attr_shortform_addname into asserts Darrick J. Wong
2024-04-17 21:38   ` [PATCH 63/67] xfs: fix a use after free in xfs_defer_finish_recovery Darrick J. Wong
2024-04-17 21:38   ` [PATCH 64/67] xfs: use the op name in trace_xlog_intent_recovery_failed Darrick J. Wong
2024-04-17 21:38   ` [PATCH 65/67] xfs: fix backwards logic in xfs_bmap_alloc_account Darrick J. Wong
2024-04-17 21:38   ` [PATCH 66/67] xfs: reset XFS_ATTR_INCOMPLETE filter on node removal Darrick J. Wong
2024-04-17 21:39   ` [PATCH 67/67] xfs: remove conditional building of rt geometry validator functions Darrick J. Wong
2024-04-17 21:16 ` [PATCHSET 05/11] xfs_repair: faster btree bulkloading Darrick J. Wong
2024-04-17 21:39   ` [PATCH 1/2] xfs_repair: adjust btree bulkloading slack computations to match online repair Darrick J. Wong
2024-04-17 21:39   ` [PATCH 2/2] xfs_repair: bulk load records into new btree blocks Darrick J. Wong
2024-04-17 21:16 ` [PATCHSET 06/11] xfsprogs: bug fixes for 6.8 Darrick J. Wong
2024-04-17 21:40   ` [PATCH 1/5] xfs_repair: double-check with shortform attr verifiers Darrick J. Wong
2024-04-17 21:40   ` [PATCH 2/5] xfs_db: improve number extraction in getbitval Darrick J. Wong
2024-04-17 21:40   ` [PATCH 3/5] xfs_scrub: fix threadcount estimates for phase 6 Darrick J. Wong
2024-04-17 21:40   ` [PATCH 4/5] xfs_scrub: don't fail while reporting media scan errors Darrick J. Wong
2024-04-17 21:41   ` [PATCH 5/5] xfs_io: add linux madvise advice codes Darrick J. Wong
2024-04-17 21:17 ` [PATCHSET V3 07/11] xfsprogs: fix log sector size detection Darrick J. Wong
2024-04-17 21:41   ` [PATCH 1/5] libxfs: remove the unused fs_topology_t typedef Darrick J. Wong
2024-04-17 21:41   ` [PATCH 2/5] libxfs: refactor the fs_topology structure Darrick J. Wong
2024-04-17 21:41   ` [PATCH 3/5] libxfs: remove the S_ISREG check from blkid_get_topology Darrick J. Wong
2024-04-17 21:42   ` [PATCH 4/5] libxfs: also query log device topology in get_topology Darrick J. Wong
2024-04-17 21:42   ` [PATCH 5/5] mkfs: use a sensible log sector size default Darrick J. Wong
2024-04-17 21:17 ` [PATCHSET 08/11] mkfs: scale shards on ssds Darrick J. Wong
2024-04-17 21:42   ` [PATCH 1/2] mkfs: allow sizing allocation groups for concurrency Darrick J. Wong
2024-04-17 21:42   ` [PATCH 2/2] mkfs: allow sizing internal logs " Darrick J. Wong
2024-04-17 21:17 ` [PATCHSET v30.3 09/11] xfs_scrub: scan metadata files in parallel Darrick J. Wong
2024-04-17 21:43   ` [PATCH 1/3] libfrog: rename XFROG_SCRUB_TYPE_* to XFROG_SCRUB_GROUP_* Darrick J. Wong
2024-04-17 21:43   ` [PATCH 2/3] libfrog: promote XFROG_SCRUB_DESCR_SUMMARY to a scrub type Darrick J. Wong
2024-04-17 21:43   ` [PATCH 3/3] xfs_scrub: scan whole-fs metadata files in parallel Darrick J. Wong
2024-04-17 21:17 ` [PATCHSET v30.3 10/11] xfs_repair: rebuild inode fork mappings Darrick J. Wong
2024-04-17 21:43   ` [PATCH 1/3] xfs_repair: push inode buf and dinode pointers all the way to inode fork processing Darrick J. Wong
2024-04-17 21:44   ` [PATCH 2/3] xfs_repair: sync bulkload data structures with kernel newbt code Darrick J. Wong
2024-04-17 21:44   ` [PATCH 3/3] xfs_repair: rebuild block mappings from rmapbt data Darrick J. Wong
2024-04-17 21:18 ` [PATCHSET 11/11] xfs_repair: support more than 4 billion records Darrick J. Wong
2024-04-17 21:44   ` [PATCH 1/8] xfs_db: add a bmbt inflation command Darrick J. Wong
2024-04-17 21:45   ` [PATCH 2/8] xfs_repair: slab and bag structs need to track more than 2^32 items Darrick J. Wong
2024-04-17 21:45   ` [PATCH 3/8] xfs_repair: support more than 2^32 rmapbt records per AG Darrick J. Wong
2024-04-17 21:45   ` [PATCH 4/8] xfs_repair: support more than 2^32 owners per physical block Darrick J. Wong
2024-04-17 21:45   ` [PATCH 5/8] xfs_repair: clean up lock resources Darrick J. Wong
2024-04-17 21:46   ` [PATCH 6/8] xfs_repair: constrain attr fork extent count Darrick J. Wong
2024-04-17 21:46   ` [PATCH 7/8] xfs_repair: don't create block maps for data files Darrick J. Wong
2024-04-17 21:46   ` [PATCH 8/8] xfs_repair: support more than INT_MAX block maps Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2024-04-22 16:25 [PATCH 00/67] libxfs: Sync to Linux 6.8 cem
2024-04-22 16:25 ` [PATCH 29/67] xfs: force all buffers to be written during btree bulk load cem
2024-03-26  2:55 [PATCHSET 02/18] libxfs: sync with 6.8 Darrick J. Wong
2024-03-26  3:10 ` [PATCH 29/67] xfs: force all buffers to be written during btree bulk load Darrick J. Wong
2024-03-13  1:47 [PATCHSET 02/10] libxfs: sync with 6.8 Darrick J. Wong
2024-03-13  2:00 ` [PATCH 29/67] xfs: force all buffers to be written during btree bulk load Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=171338842775.1853449.4418216174854196307.stgit@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=bodonnel@redhat.com \
    --cc=cem@kernel.org \
    --cc=hch@lst.de \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.