* [PATCHBOMB v30.3] xfs: online repair, part 1 is done @ 2024-04-15 23:28 Darrick J. Wong 2024-04-15 23:33 ` [PATCHSET v30.3 01/16] xfs: improve log incompat feature handling Darrick J. Wong ` (15 more replies) 0 siblings, 16 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:28 UTC (permalink / raw To: Chandan Babu R; +Cc: xfs, linux-fsdevel Hi everyone, I'm about to send pull requests to Chandan for all the fully reviewed patchsets that I have in my development tree. Due to all the recent design changes, I have decided to resend all patches to that the list can record the final versions of these patches with complete tagging. --D ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 01/16] xfs: improve log incompat feature handling 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong @ 2024-04-15 23:33 ` Darrick J. Wong 2024-04-15 23:37 ` [PATCH 1/5] xfs: pass xfs_buf lookup flags to xfs_*read_agi Darrick J. Wong ` (4 more replies) 2024-04-15 23:34 ` [PATCHSET v30.3 02/16] xfs: refactorings for atomic file content exchanges Darrick J. Wong ` (14 subsequent siblings) 15 siblings, 5 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:33 UTC (permalink / raw To: chandanbabu, djwong Cc: Christoph Hellwig, Dave Chinner, Dan Carpenter, hch, linux-xfs Hi all, This patchset improves the performance of log incompat feature bit handling by making a few changes to how the filesystem handles them. First, we now only clear the bits during a clean unmount to reduce calls to the (expensive) upgrade function to once per bit per mount. Second, we now only allow incompat feature upgrades for sysadmins or if the sysadmin explicitly allows it via mount option. Currently the only log incompat user is logged xattrs, which requires CONFIG_XFS_DEBUG=y, so there should be no user visible impact to this change. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=log-incompat-permissions-6.10 --- Commits in this patchset: * xfs: pass xfs_buf lookup flags to xfs_*read_agi * xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode * xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory * xfs: fix error bailout in xrep_abt_build_new_trees * xfs: only clear log incompat flags at clean unmount --- .../filesystems/xfs/xfs-online-fsck-design.rst | 3 - fs/xfs/libxfs/xfs_ag.c | 8 ++- fs/xfs/libxfs/xfs_ialloc.c | 16 ++++-- fs/xfs/libxfs/xfs_ialloc.h | 5 +- fs/xfs/libxfs/xfs_ialloc_btree.c | 4 +- fs/xfs/scrub/alloc_repair.c | 2 - fs/xfs/scrub/common.c | 4 +- fs/xfs/scrub/fscounters.c | 2 - fs/xfs/scrub/inode_repair.c | 50 ++++++++++++++++++++ fs/xfs/scrub/iscan.c | 36 ++++++++++++++ fs/xfs/scrub/iscan.h | 15 ++++++ fs/xfs/scrub/repair.c | 6 +- fs/xfs/scrub/trace.h | 10 +++- fs/xfs/xfs_inode.c | 8 ++- fs/xfs/xfs_iwalk.c | 4 +- fs/xfs/xfs_log.c | 28 ----------- fs/xfs/xfs_log.h | 2 - fs/xfs/xfs_log_priv.h | 3 - fs/xfs/xfs_log_recover.c | 19 +------- fs/xfs/xfs_mount.c | 8 +++ fs/xfs/xfs_mount.h | 6 ++ fs/xfs/xfs_xattr.c | 42 ++--------------- 22 files changed, 160 insertions(+), 121 deletions(-) ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/5] xfs: pass xfs_buf lookup flags to xfs_*read_agi 2024-04-15 23:33 ` [PATCHSET v30.3 01/16] xfs: improve log incompat feature handling Darrick J. Wong @ 2024-04-15 23:37 ` Darrick J. Wong 2024-04-15 23:38 ` [PATCH 2/5] xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode Darrick J. Wong ` (3 subsequent siblings) 4 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:37 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Allow callers to pass buffer lookup flags to xfs_read_agi and xfs_ialloc_read_agi. This will be used in the next patch to fix a deadlock in the online fsck inode scanner. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_ag.c | 8 ++++---- fs/xfs/libxfs/xfs_ialloc.c | 16 ++++++++++------ fs/xfs/libxfs/xfs_ialloc.h | 5 +++-- fs/xfs/libxfs/xfs_ialloc_btree.c | 4 ++-- fs/xfs/scrub/common.c | 4 ++-- fs/xfs/scrub/fscounters.c | 2 +- fs/xfs/scrub/iscan.c | 2 +- fs/xfs/scrub/repair.c | 6 +++--- fs/xfs/xfs_inode.c | 8 ++++---- fs/xfs/xfs_iwalk.c | 4 ++-- fs/xfs/xfs_log_recover.c | 4 ++-- 11 files changed, 34 insertions(+), 29 deletions(-) diff --git a/fs/xfs/libxfs/xfs_ag.c b/fs/xfs/libxfs/xfs_ag.c index dc1873f76bff..09fe9412eab4 100644 --- a/fs/xfs/libxfs/xfs_ag.c +++ b/fs/xfs/libxfs/xfs_ag.c @@ -194,7 +194,7 @@ xfs_initialize_perag_data( pag = xfs_perag_get(mp, index); error = xfs_alloc_read_agf(pag, NULL, 0, NULL); if (!error) - error = xfs_ialloc_read_agi(pag, NULL, NULL); + error = xfs_ialloc_read_agi(pag, NULL, 0, NULL); if (error) { xfs_perag_put(pag); return error; @@ -931,7 +931,7 @@ xfs_ag_shrink_space( int error, err2; ASSERT(pag->pag_agno == mp->m_sb.sb_agcount - 1); - error = xfs_ialloc_read_agi(pag, *tpp, &agibp); + error = xfs_ialloc_read_agi(pag, *tpp, 0, &agibp); if (error) return error; @@ -1062,7 +1062,7 @@ xfs_ag_extend_space( ASSERT(pag->pag_agno == pag->pag_mount->m_sb.sb_agcount - 1); - error = xfs_ialloc_read_agi(pag, tp, &bp); + error = xfs_ialloc_read_agi(pag, tp, 0, &bp); if (error) return error; @@ -1119,7 +1119,7 @@ xfs_ag_get_geometry( int error; /* Lock the AG headers. */ - error = xfs_ialloc_read_agi(pag, NULL, &agi_bp); + error = xfs_ialloc_read_agi(pag, NULL, 0, &agi_bp); if (error) return error; error = xfs_alloc_read_agf(pag, NULL, 0, &agf_bp); diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index e5ac3e5430c4..cb37f0007731 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -1699,7 +1699,7 @@ xfs_dialloc_good_ag( return false; if (!xfs_perag_initialised_agi(pag)) { - error = xfs_ialloc_read_agi(pag, tp, NULL); + error = xfs_ialloc_read_agi(pag, tp, 0, NULL); if (error) return false; } @@ -1768,7 +1768,7 @@ xfs_dialloc_try_ag( * Then read in the AGI buffer and recheck with the AGI buffer * lock held. */ - error = xfs_ialloc_read_agi(pag, *tpp, &agbp); + error = xfs_ialloc_read_agi(pag, *tpp, 0, &agbp); if (error) return error; @@ -2286,7 +2286,7 @@ xfs_difree( /* * Get the allocation group header. */ - error = xfs_ialloc_read_agi(pag, tp, &agbp); + error = xfs_ialloc_read_agi(pag, tp, 0, &agbp); if (error) { xfs_warn(mp, "%s: xfs_ialloc_read_agi() returned error %d.", __func__, error); @@ -2332,7 +2332,7 @@ xfs_imap_lookup( int error; int i; - error = xfs_ialloc_read_agi(pag, tp, &agbp); + error = xfs_ialloc_read_agi(pag, tp, 0, &agbp); if (error) { xfs_alert(mp, "%s: xfs_ialloc_read_agi() returned error %d, agno %d", @@ -2675,6 +2675,7 @@ int xfs_read_agi( struct xfs_perag *pag, struct xfs_trans *tp, + xfs_buf_flags_t flags, struct xfs_buf **agibpp) { struct xfs_mount *mp = pag->pag_mount; @@ -2684,7 +2685,7 @@ xfs_read_agi( error = xfs_trans_read_buf(mp, tp, mp->m_ddev_targp, XFS_AG_DADDR(mp, pag->pag_agno, XFS_AGI_DADDR(mp)), - XFS_FSS_TO_BB(mp, 1), 0, agibpp, &xfs_agi_buf_ops); + XFS_FSS_TO_BB(mp, 1), flags, agibpp, &xfs_agi_buf_ops); if (xfs_metadata_is_sick(error)) xfs_ag_mark_sick(pag, XFS_SICK_AG_AGI); if (error) @@ -2704,6 +2705,7 @@ int xfs_ialloc_read_agi( struct xfs_perag *pag, struct xfs_trans *tp, + int flags, struct xfs_buf **agibpp) { struct xfs_buf *agibp; @@ -2712,7 +2714,9 @@ xfs_ialloc_read_agi( trace_xfs_ialloc_read_agi(pag->pag_mount, pag->pag_agno); - error = xfs_read_agi(pag, tp, &agibp); + error = xfs_read_agi(pag, tp, + (flags & XFS_IALLOC_FLAG_TRYLOCK) ? XBF_TRYLOCK : 0, + &agibp); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_ialloc.h b/fs/xfs/libxfs/xfs_ialloc.h index f1412183bb44..b549627e3a61 100644 --- a/fs/xfs/libxfs/xfs_ialloc.h +++ b/fs/xfs/libxfs/xfs_ialloc.h @@ -63,10 +63,11 @@ xfs_ialloc_log_agi( struct xfs_buf *bp, /* allocation group header buffer */ uint32_t fields); /* bitmask of fields to log */ -int xfs_read_agi(struct xfs_perag *pag, struct xfs_trans *tp, +int xfs_read_agi(struct xfs_perag *pag, struct xfs_trans *tp, xfs_buf_flags_t flags, struct xfs_buf **agibpp); int xfs_ialloc_read_agi(struct xfs_perag *pag, struct xfs_trans *tp, - struct xfs_buf **agibpp); + int flags, struct xfs_buf **agibpp); +#define XFS_IALLOC_FLAG_TRYLOCK (1U << 0) /* use trylock for buffer locking */ /* * Lookup a record by ino in the btree given by cur. diff --git a/fs/xfs/libxfs/xfs_ialloc_btree.c b/fs/xfs/libxfs/xfs_ialloc_btree.c index cc661fca6ff5..42e9fd47f6c7 100644 --- a/fs/xfs/libxfs/xfs_ialloc_btree.c +++ b/fs/xfs/libxfs/xfs_ialloc_btree.c @@ -745,7 +745,7 @@ xfs_finobt_count_blocks( struct xfs_btree_cur *cur; int error; - error = xfs_ialloc_read_agi(pag, tp, &agbp); + error = xfs_ialloc_read_agi(pag, tp, 0, &agbp); if (error) return error; @@ -768,7 +768,7 @@ xfs_finobt_read_blocks( struct xfs_agi *agi; int error; - error = xfs_ialloc_read_agi(pag, tp, &agbp); + error = xfs_ialloc_read_agi(pag, tp, 0, &agbp); if (error) return error; diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index 47a20cf5205f..a27d33b6f464 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -445,7 +445,7 @@ xchk_perag_read_headers( { int error; - error = xfs_ialloc_read_agi(sa->pag, sc->tp, &sa->agi_bp); + error = xfs_ialloc_read_agi(sa->pag, sc->tp, 0, &sa->agi_bp); if (error && want_ag_read_header_failure(sc, XFS_SCRUB_TYPE_AGI)) return error; @@ -827,7 +827,7 @@ xchk_iget_agi( * in the iget cache miss path. */ pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inum)); - error = xfs_ialloc_read_agi(pag, tp, agi_bpp); + error = xfs_ialloc_read_agi(pag, tp, 0, agi_bpp); xfs_perag_put(pag); if (error) return error; diff --git a/fs/xfs/scrub/fscounters.c b/fs/xfs/scrub/fscounters.c index d310737c8823..da2f6729699d 100644 --- a/fs/xfs/scrub/fscounters.c +++ b/fs/xfs/scrub/fscounters.c @@ -85,7 +85,7 @@ xchk_fscount_warmup( continue; /* Lock both AG headers. */ - error = xfs_ialloc_read_agi(pag, sc->tp, &agi_bp); + error = xfs_ialloc_read_agi(pag, sc->tp, 0, &agi_bp); if (error) break; error = xfs_alloc_read_agf(pag, sc->tp, 0, &agf_bp); diff --git a/fs/xfs/scrub/iscan.c b/fs/xfs/scrub/iscan.c index ec3478bc505e..66ba0fbd059e 100644 --- a/fs/xfs/scrub/iscan.c +++ b/fs/xfs/scrub/iscan.c @@ -281,7 +281,7 @@ xchk_iscan_advance( if (!pag) return -ECANCELED; - ret = xfs_ialloc_read_agi(pag, sc->tp, &agi_bp); + ret = xfs_ialloc_read_agi(pag, sc->tp, 0, &agi_bp); if (ret) goto out_pag; diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index f43dce771cdd..443e62f72481 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -290,7 +290,7 @@ xrep_calc_ag_resblks( icount = pag->pagi_count; } else { /* Try to get the actual counters from disk. */ - error = xfs_ialloc_read_agi(pag, NULL, &bp); + error = xfs_ialloc_read_agi(pag, NULL, 0, &bp); if (!error) { icount = pag->pagi_count; xfs_buf_relse(bp); @@ -908,7 +908,7 @@ xrep_reinit_pagi( ASSERT(xfs_perag_initialised_agi(pag)); clear_bit(XFS_AGSTATE_AGI_INIT, &pag->pag_opstate); - error = xfs_ialloc_read_agi(pag, sc->tp, &bp); + error = xfs_ialloc_read_agi(pag, sc->tp, 0, &bp); if (error) return error; @@ -934,7 +934,7 @@ xrep_ag_init( ASSERT(!sa->pag); - error = xfs_ialloc_read_agi(pag, sc->tp, &sa->agi_bp); + error = xfs_ialloc_read_agi(pag, sc->tp, 0, &sa->agi_bp); if (error) return error; diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index d55b42b2480d..3e667a19b80b 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2167,7 +2167,7 @@ xfs_iunlink( pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, ip->i_ino)); /* Get the agi buffer first. It ensures lock ordering on the list. */ - error = xfs_read_agi(pag, tp, &agibp); + error = xfs_read_agi(pag, tp, 0, &agibp); if (error) goto out; @@ -2264,7 +2264,7 @@ xfs_iunlink_remove( trace_xfs_iunlink_remove(ip); /* Get the agi buffer first. It ensures lock ordering on the list. */ - error = xfs_read_agi(pag, tp, &agibp); + error = xfs_read_agi(pag, tp, 0, &agibp); if (error) return error; @@ -3142,7 +3142,7 @@ xfs_rename( pag = xfs_perag_get(mp, XFS_INO_TO_AGNO(mp, inodes[i]->i_ino)); - error = xfs_read_agi(pag, tp, &bp); + error = xfs_read_agi(pag, tp, 0, &bp); xfs_perag_put(pag); if (error) goto out_trans_cancel; @@ -3814,7 +3814,7 @@ xfs_inode_reload_unlinked_bucket( /* Grab the first inode in the list */ pag = xfs_perag_get(mp, agno); - error = xfs_ialloc_read_agi(pag, tp, &agibp); + error = xfs_ialloc_read_agi(pag, tp, 0, &agibp); xfs_perag_put(pag); if (error) return error; diff --git a/fs/xfs/xfs_iwalk.c b/fs/xfs/xfs_iwalk.c index 01b55f03a102..730c8d48da28 100644 --- a/fs/xfs/xfs_iwalk.c +++ b/fs/xfs/xfs_iwalk.c @@ -268,7 +268,7 @@ xfs_iwalk_ag_start( /* Set up a fresh cursor and empty the inobt cache. */ iwag->nr_recs = 0; - error = xfs_ialloc_read_agi(pag, tp, agi_bpp); + error = xfs_ialloc_read_agi(pag, tp, 0, agi_bpp); if (error) return error; *curpp = xfs_inobt_init_cursor(pag, tp, *agi_bpp); @@ -386,7 +386,7 @@ xfs_iwalk_run_callbacks( } /* ...and recreate the cursor just past where we left off. */ - error = xfs_ialloc_read_agi(iwag->pag, iwag->tp, agi_bpp); + error = xfs_ialloc_read_agi(iwag->pag, iwag->tp, 0, agi_bpp); if (error) return error; *curpp = xfs_inobt_init_cursor(iwag->pag, iwag->tp, *agi_bpp); diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 13f1d2e91540..1b1f0a4cd494 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -2656,7 +2656,7 @@ xlog_recover_clear_agi_bucket( if (error) goto out_error; - error = xfs_read_agi(pag, tp, &agibp); + error = xfs_read_agi(pag, tp, 0, &agibp); if (error) goto out_abort; @@ -2772,7 +2772,7 @@ xlog_recover_iunlink_ag( int bucket; int error; - error = xfs_read_agi(pag, NULL, &agibp); + error = xfs_read_agi(pag, NULL, 0, &agibp); if (error) { /* * AGI is b0rked. Don't process it. ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/5] xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode 2024-04-15 23:33 ` [PATCHSET v30.3 01/16] xfs: improve log incompat feature handling Darrick J. Wong 2024-04-15 23:37 ` [PATCH 1/5] xfs: pass xfs_buf lookup flags to xfs_*read_agi Darrick J. Wong @ 2024-04-15 23:38 ` Darrick J. Wong 2024-04-15 23:38 ` [PATCH 3/5] xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory Darrick J. Wong ` (2 subsequent siblings) 4 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:38 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> While reviewing the next patch which fixes an ABBA deadlock between the AGI and a directory ILOCK, someone asked a question about why we're holding the AGI in the first place. The reason for that is to quiesce the inode structures for that AG while we do a repair. I then realized that the xrep_dinode_findmode invokes xchk_iscan_iter, which walks the inobts (and hence the AGIs) to find all the inodes. This itself is also an ABBA vector, since the damaged inode could be in AG 5, which we hold while we scan AG 0 for directories. 5 -> 0 is not allowed. To address this, modify the iscan to allow trylock of the AGI buffer using the flags argument to xfs_ialloc_read_agi that the previous patch added. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/inode_repair.c | 1 + fs/xfs/scrub/iscan.c | 36 +++++++++++++++++++++++++++++++++++- fs/xfs/scrub/iscan.h | 15 +++++++++++++++ fs/xfs/scrub/trace.h | 10 ++++++++-- 4 files changed, 59 insertions(+), 3 deletions(-) diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c index eab380e95ef4..35da0193c919 100644 --- a/fs/xfs/scrub/inode_repair.c +++ b/fs/xfs/scrub/inode_repair.c @@ -356,6 +356,7 @@ xrep_dinode_find_mode( * so there's a real possibility that _iscan_iter can return EBUSY. */ xchk_iscan_start(sc, 5000, 100, &ri->ftype_iscan); + xchk_iscan_set_agi_trylock(&ri->ftype_iscan); ri->ftype_iscan.skip_ino = sc->sm->sm_ino; ri->alleged_ftype = XFS_DIR3_FT_UNKNOWN; while ((error = xchk_iscan_iter(&ri->ftype_iscan, &dp)) == 1) { diff --git a/fs/xfs/scrub/iscan.c b/fs/xfs/scrub/iscan.c index 66ba0fbd059e..c643b7d79b60 100644 --- a/fs/xfs/scrub/iscan.c +++ b/fs/xfs/scrub/iscan.c @@ -243,6 +243,40 @@ xchk_iscan_finish( mutex_unlock(&iscan->lock); } +/* + * Grab the AGI to advance the inode scan. Returns 0 if *agi_bpp is now set, + * -ECANCELED if the live scan aborted, -EBUSY if the AGI could not be grabbed, + * or the usual negative errno. + */ +STATIC int +xchk_iscan_read_agi( + struct xchk_iscan *iscan, + struct xfs_perag *pag, + struct xfs_buf **agi_bpp) +{ + struct xfs_scrub *sc = iscan->sc; + unsigned long relax; + int ret; + + if (!xchk_iscan_agi_needs_trylock(iscan)) + return xfs_ialloc_read_agi(pag, sc->tp, 0, agi_bpp); + + relax = msecs_to_jiffies(iscan->iget_retry_delay); + do { + ret = xfs_ialloc_read_agi(pag, sc->tp, XFS_IALLOC_FLAG_TRYLOCK, + agi_bpp); + if (ret != -EAGAIN) + return ret; + if (!iscan->iget_timeout || + time_is_before_jiffies(iscan->__iget_deadline)) + return -EBUSY; + + trace_xchk_iscan_agi_retry_wait(iscan); + } while (!schedule_timeout_killable(relax) && + !xchk_iscan_aborted(iscan)); + return -ECANCELED; +} + /* * Advance ino to the next inode that the inobt thinks is allocated, being * careful to jump to the next AG if we've reached the right end of this AG's @@ -281,7 +315,7 @@ xchk_iscan_advance( if (!pag) return -ECANCELED; - ret = xfs_ialloc_read_agi(pag, sc->tp, 0, &agi_bp); + ret = xchk_iscan_read_agi(iscan, pag, &agi_bp); if (ret) goto out_pag; diff --git a/fs/xfs/scrub/iscan.h b/fs/xfs/scrub/iscan.h index 71f657552dfa..5e0e4ed9dea6 100644 --- a/fs/xfs/scrub/iscan.h +++ b/fs/xfs/scrub/iscan.h @@ -59,6 +59,9 @@ struct xchk_iscan { /* Set if the scan has been aborted due to some event in the fs. */ #define XCHK_ISCAN_OPSTATE_ABORTED (1) +/* Use trylock to acquire the AGI */ +#define XCHK_ISCAN_OPSTATE_TRYLOCK_AGI (2) + static inline bool xchk_iscan_aborted(const struct xchk_iscan *iscan) { @@ -71,6 +74,18 @@ xchk_iscan_abort(struct xchk_iscan *iscan) set_bit(XCHK_ISCAN_OPSTATE_ABORTED, &iscan->__opstate); } +static inline bool +xchk_iscan_agi_needs_trylock(const struct xchk_iscan *iscan) +{ + return test_bit(XCHK_ISCAN_OPSTATE_TRYLOCK_AGI, &iscan->__opstate); +} + +static inline void +xchk_iscan_set_agi_trylock(struct xchk_iscan *iscan) +{ + set_bit(XCHK_ISCAN_OPSTATE_TRYLOCK_AGI, &iscan->__opstate); +} + void xchk_iscan_start(struct xfs_scrub *sc, unsigned int iget_timeout, unsigned int iget_retry_delay, struct xchk_iscan *iscan); void xchk_iscan_teardown(struct xchk_iscan *iscan); diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 5b294be52c55..b1c7c79760d4 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1300,7 +1300,7 @@ TRACE_EVENT(xchk_iscan_iget_batch, __entry->unavail) ); -TRACE_EVENT(xchk_iscan_iget_retry_wait, +DECLARE_EVENT_CLASS(xchk_iscan_retry_wait_class, TP_PROTO(struct xchk_iscan *iscan), TP_ARGS(iscan), TP_STRUCT__entry( @@ -1326,7 +1326,13 @@ TRACE_EVENT(xchk_iscan_iget_retry_wait, __entry->remaining, __entry->iget_timeout, __entry->retry_delay) -); +) +#define DEFINE_ISCAN_RETRY_WAIT_EVENT(name) \ +DEFINE_EVENT(xchk_iscan_retry_wait_class, name, \ + TP_PROTO(struct xchk_iscan *iscan), \ + TP_ARGS(iscan)) +DEFINE_ISCAN_RETRY_WAIT_EVENT(xchk_iscan_iget_retry_wait); +DEFINE_ISCAN_RETRY_WAIT_EVENT(xchk_iscan_agi_retry_wait); TRACE_EVENT(xchk_nlinks_collect_dirent, TP_PROTO(struct xfs_mount *mp, struct xfs_inode *dp, ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/5] xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory 2024-04-15 23:33 ` [PATCHSET v30.3 01/16] xfs: improve log incompat feature handling Darrick J. Wong 2024-04-15 23:37 ` [PATCH 1/5] xfs: pass xfs_buf lookup flags to xfs_*read_agi Darrick J. Wong 2024-04-15 23:38 ` [PATCH 2/5] xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode Darrick J. Wong @ 2024-04-15 23:38 ` Darrick J. Wong 2024-04-15 23:38 ` [PATCH 4/5] xfs: fix error bailout in xrep_abt_build_new_trees Darrick J. Wong 2024-04-15 23:38 ` [PATCH 5/5] xfs: only clear log incompat flags at clean unmount Darrick J. Wong 4 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:38 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> xfs/399 found the following deadlock when fuzzing core.mode = ones: /proc/20506/task/20558/stack : [<0>] xfs_ilock+0xa0/0x240 [xfs] [<0>] xfs_ilock_data_map_shared+0x1b/0x20 [xfs] [<0>] xrep_dinode_findmode_walk_directory+0x69/0xe0 [xfs] [<0>] xrep_dinode_find_mode+0x103/0x2a0 [xfs] [<0>] xrep_dinode_mode+0x7c/0x120 [xfs] [<0>] xrep_dinode_core+0xed/0x2b0 [xfs] [<0>] xrep_dinode_problems+0x10/0x80 [xfs] [<0>] xrep_inode+0x6c/0xc0 [xfs] [<0>] xrep_attempt+0x64/0x1d0 [xfs] [<0>] xfs_scrub_metadata+0x365/0x840 [xfs] [<0>] xfs_scrubv_metadata+0x282/0x430 [xfs] [<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs] [<0>] xfs_file_ioctl+0xc68/0x1780 [xfs] /proc/20506/task/20559/stack : [<0>] xfs_buf_lock+0x3b/0x110 [xfs] [<0>] xfs_buf_find_lock+0x66/0x1c0 [xfs] [<0>] xfs_buf_get_map+0x208/0xc00 [xfs] [<0>] xfs_buf_read_map+0x5d/0x2c0 [xfs] [<0>] xfs_trans_read_buf_map+0x1b0/0x4c0 [xfs] [<0>] xfs_read_agi+0xbd/0x190 [xfs] [<0>] xfs_ialloc_read_agi+0x47/0x160 [xfs] [<0>] xfs_imap_lookup+0x69/0x1f0 [xfs] [<0>] xfs_imap+0x1fc/0x3d0 [xfs] [<0>] xfs_iget+0x357/0xd50 [xfs] [<0>] xchk_dir_actor+0x16e/0x330 [xfs] [<0>] xchk_dir_walk_block+0x164/0x1e0 [xfs] [<0>] xchk_dir_walk+0x13a/0x190 [xfs] [<0>] xchk_directory+0x1a2/0x2b0 [xfs] [<0>] xfs_scrub_metadata+0x2f4/0x840 [xfs] [<0>] xfs_scrubv_metadata+0x282/0x430 [xfs] [<0>] xfs_ioc_scrubv_metadata+0x149/0x1a0 [xfs] [<0>] xfs_file_ioctl+0xc68/0x1780 [xfs] Thread 20558 holds an AGI buffer and is trying to grab the ILOCK of the root directory. Thread 20559 holds the root directory ILOCK and is trying to grab the AGI of an inode that is one of the root directory's children. The AGI held by 20558 is the same buffer that 20559 is trying to acquire. In other words, this is an ABBA deadlock. In general, the lock order is ILOCK and then AGI -- rename does this while preparing for an operation involving whiteouts or renaming files out of existence; and unlink does this when moving an inode to the unlinked list. The only place where we do it in the opposite order is on the child during an icreate, but at that point the child is marked INEW and is not visible to other threads. Work around this deadlock by replacing the blocking ilock attempt with a nonblocking loop that aborts after 30 seconds. Relax for a jiffy after a failed lock attempt. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/inode_repair.c | 49 ++++++++++++++++++++++++++++++++++++++++++- 1 file changed, 48 insertions(+), 1 deletion(-) diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c index 35da0193c919..097afba3043f 100644 --- a/fs/xfs/scrub/inode_repair.c +++ b/fs/xfs/scrub/inode_repair.c @@ -282,6 +282,51 @@ xrep_dinode_findmode_dirent( return 0; } +/* Try to lock a directory, or wait a jiffy. */ +static inline int +xrep_dinode_ilock_nowait( + struct xfs_inode *dp, + unsigned int lock_mode) +{ + if (xfs_ilock_nowait(dp, lock_mode)) + return true; + + schedule_timeout_killable(1); + return false; +} + +/* + * Try to lock a directory to look for ftype hints. Since we already hold the + * AGI buffer, we cannot block waiting for the ILOCK because rename can take + * the ILOCK and then try to lock AGIs. + */ +STATIC int +xrep_dinode_trylock_directory( + struct xrep_inode *ri, + struct xfs_inode *dp, + unsigned int *lock_modep) +{ + unsigned long deadline = jiffies + msecs_to_jiffies(30000); + unsigned int lock_mode; + int error = 0; + + do { + if (xchk_should_terminate(ri->sc, &error)) + return error; + + if (xfs_need_iread_extents(&dp->i_df)) + lock_mode = XFS_ILOCK_EXCL; + else + lock_mode = XFS_ILOCK_SHARED; + + if (xrep_dinode_ilock_nowait(dp, lock_mode)) { + *lock_modep = lock_mode; + return 0; + } + } while (!time_is_before_jiffies(deadline)); + return -EBUSY; +} + /* * If this is a directory, walk the dirents looking for any that point to the * scrub target inode. @@ -299,7 +344,9 @@ xrep_dinode_findmode_walk_directory( * Scan the directory to see if there it contains an entry pointing to * the directory that we are repairing. */ - lock_mode = xfs_ilock_data_map_shared(dp); + error = xrep_dinode_trylock_directory(ri, dp, &lock_mode); + if (error) + return error; /* * If this directory is known to be sick, we cannot scan it reliably ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 4/5] xfs: fix error bailout in xrep_abt_build_new_trees 2024-04-15 23:33 ` [PATCHSET v30.3 01/16] xfs: improve log incompat feature handling Darrick J. Wong ` (2 preceding siblings ...) 2024-04-15 23:38 ` [PATCH 3/5] xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory Darrick J. Wong @ 2024-04-15 23:38 ` Darrick J. Wong 2024-04-15 23:38 ` [PATCH 5/5] xfs: only clear log incompat flags at clean unmount Darrick J. Wong 4 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:38 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Dan Carpenter, Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Dan Carpenter reports: "Commit 4bdfd7d15747 ("xfs: repair free space btrees") from Dec 15, 2023 (linux-next), leads to the following Smatch static checker warning: fs/xfs/scrub/alloc_repair.c:781 xrep_abt_build_new_trees() warn: missing unwind goto?" That's a bug, so let's fix it. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: 4bdfd7d15747 ("xfs: repair free space btrees") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/alloc_repair.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/scrub/alloc_repair.c b/fs/xfs/scrub/alloc_repair.c index d421b253923e..30295898cc8a 100644 --- a/fs/xfs/scrub/alloc_repair.c +++ b/fs/xfs/scrub/alloc_repair.c @@ -778,7 +778,7 @@ xrep_abt_build_new_trees( error = xrep_bnobt_sort_records(ra); if (error) - return error; + goto err_levels; /* Load the free space by block number tree. */ ra->array_cur = XFARRAY_CURSOR_INIT; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 5/5] xfs: only clear log incompat flags at clean unmount 2024-04-15 23:33 ` [PATCHSET v30.3 01/16] xfs: improve log incompat feature handling Darrick J. Wong ` (3 preceding siblings ...) 2024-04-15 23:38 ` [PATCH 4/5] xfs: fix error bailout in xrep_abt_build_new_trees Darrick J. Wong @ 2024-04-15 23:38 ` Darrick J. Wong 4 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:38 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, Dave Chinner, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> While reviewing the online fsck patchset, someone spied the xfs_swapext_can_use_without_log_assistance function and wondered why we go through this inverted-bitmask dance to avoid setting the XFS_SB_FEAT_INCOMPAT_LOG_SWAPEXT feature. (The same principles apply to the logged extended attribute update feature bit in the since-merged LARP series.) The reason for this dance is that xfs_add_incompat_log_feature is an expensive operation -- it forces the log, pushes the AIL, and then if nobody's beaten us to it, sets the feature bit and issues a synchronous write of the primary superblock. That could be a one-time cost amortized over the life of the filesystem, but the log quiesce and cover operations call xfs_clear_incompat_log_features to remove feature bits opportunistically. On a moderately loaded filesystem this leads to us cycling those bits on and off over and over, which hurts performance. Why do we clear the log incompat bits? Back in ~2020 I think Dave and I had a conversation on IRC[2] about what the log incompat bits represent. IIRC in that conversation we decided that the log incompat bits protect unrecovered log items so that old kernels won't try to recover them and barf. Since a clean log has no protected log items, we could clear the bits at cover/quiesce time. As Dave Chinner pointed out in the thread, clearing log incompat bits at unmount time has positive effects for golden root disk image generator setups, since the generator could be running a newer kernel than what gets written to the golden image -- if there are log incompat fields set in the golden image that was generated by a newer kernel/OS image builder then the provisioning host cannot mount the filesystem even though the log is clean and recovery is unnecessary to mount the filesystem. Given that it's expensive to set log incompat bits, we really only want to do that once per bit per mount. Therefore, I propose that we only clear log incompat bits as part of writing a clean unmount record. Do this by adding an operational state flag to the xfs mount that guards whether or not the feature bit clearing can actually take place. This eliminates the l_incompat_users rwsem that we use to protect a log cleaning operation from clearing a feature bit that a frontend thread is trying to set -- this lock adds another way to fail w.r.t. locking. For the swapext series, I shard that into multiple locks just to work around the lockdep complaints, and that's fugly. Link: https://lore.kernel.org/linux-xfs/20240131230043.GA6180@frogsfrogsfrogs/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Chinner <dchinner@redhat.com> --- .../filesystems/xfs/xfs-online-fsck-design.rst | 3 - fs/xfs/xfs_log.c | 28 ------------- fs/xfs/xfs_log.h | 2 - fs/xfs/xfs_log_priv.h | 3 - fs/xfs/xfs_log_recover.c | 15 ------- fs/xfs/xfs_mount.c | 8 +++- fs/xfs/xfs_mount.h | 6 ++- fs/xfs/xfs_xattr.c | 42 +++----------------- 8 files changed, 19 insertions(+), 88 deletions(-) diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst index 6333697ba3e8..1d161752f09e 100644 --- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -4047,9 +4047,6 @@ series. | one ``struct rw_semaphore`` for each feature. | | The log cleaning code tries to take this rwsem in exclusive mode to | | clear the bit; if the lock attempt fails, the feature bit remains set. | -| Filesystem code signals its intention to use a log incompat feature in a | -| transaction by calling ``xlog_use_incompat_feat``, which takes the rwsem | -| in shared mode. | | The code supporting a log incompat feature should create wrapper | | functions to obtain the log feature and call | | ``xfs_add_incompat_log_feature`` to set the feature bits in the primary | diff --git a/fs/xfs/xfs_log.c b/fs/xfs/xfs_log.c index 5004f23d344e..416c15494983 100644 --- a/fs/xfs/xfs_log.c +++ b/fs/xfs/xfs_log.c @@ -1448,7 +1448,7 @@ xfs_log_work_queue( * Clear the log incompat flags if we have the opportunity. * * This only happens if we're about to log the second dummy transaction as part - * of covering the log and we can get the log incompat feature usage lock. + * of covering the log. */ static inline void xlog_clear_incompat( @@ -1463,11 +1463,7 @@ xlog_clear_incompat( if (log->l_covered_state != XLOG_STATE_COVER_DONE2) return; - if (!down_write_trylock(&log->l_incompat_users)) - return; - xfs_clear_incompat_log_features(mp); - up_write(&log->l_incompat_users); } /* @@ -1585,8 +1581,6 @@ xlog_alloc_log( } log->l_sectBBsize = 1 << log2_size; - init_rwsem(&log->l_incompat_users); - xlog_get_iclog_buffer_size(mp, log); spin_lock_init(&log->l_icloglock); @@ -3871,23 +3865,3 @@ xfs_log_check_lsn( return valid; } - -/* - * Notify the log that we're about to start using a feature that is protected - * by a log incompat feature flag. This will prevent log covering from - * clearing those flags. - */ -void -xlog_use_incompat_feat( - struct xlog *log) -{ - down_read(&log->l_incompat_users); -} - -/* Notify the log that we've finished using log incompat features. */ -void -xlog_drop_incompat_feat( - struct xlog *log) -{ - up_read(&log->l_incompat_users); -} diff --git a/fs/xfs/xfs_log.h b/fs/xfs/xfs_log.h index 2728886c2963..d69acf881153 100644 --- a/fs/xfs/xfs_log.h +++ b/fs/xfs/xfs_log.h @@ -159,8 +159,6 @@ bool xfs_log_check_lsn(struct xfs_mount *, xfs_lsn_t); xfs_lsn_t xlog_grant_push_threshold(struct xlog *log, int need_bytes); bool xlog_force_shutdown(struct xlog *log, uint32_t shutdown_flags); -void xlog_use_incompat_feat(struct xlog *log); -void xlog_drop_incompat_feat(struct xlog *log); int xfs_attr_use_log_assist(struct xfs_mount *mp); #endif /* __XFS_LOG_H__ */ diff --git a/fs/xfs/xfs_log_priv.h b/fs/xfs/xfs_log_priv.h index e30c06ec20e3..43881575cd49 100644 --- a/fs/xfs/xfs_log_priv.h +++ b/fs/xfs/xfs_log_priv.h @@ -450,9 +450,6 @@ struct xlog { xfs_lsn_t l_recovery_lsn; uint32_t l_iclog_roundoff;/* padding roundoff */ - - /* Users of log incompat features should take a read lock. */ - struct rw_semaphore l_incompat_users; }; /* diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 1b1f0a4cd494..41aec991433c 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -3496,21 +3496,6 @@ xlog_recover_finish( */ xfs_log_force(log->l_mp, XFS_LOG_SYNC); - /* - * Now that we've recovered the log and all the intents, we can clear - * the log incompat feature bits in the superblock because there's no - * longer anything to protect. We rely on the AIL push to write out the - * updated superblock after everything else. - */ - if (xfs_clear_incompat_log_features(log->l_mp)) { - error = xfs_sync_sb(log->l_mp, false); - if (error < 0) { - xfs_alert(log->l_mp, - "Failed to clear log incompat features on recovery"); - goto out_error; - } - } - xlog_recover_process_iunlinks(log); /* diff --git a/fs/xfs/xfs_mount.c b/fs/xfs/xfs_mount.c index df370eb5dc15..d37ba10f5fa3 100644 --- a/fs/xfs/xfs_mount.c +++ b/fs/xfs/xfs_mount.c @@ -1095,6 +1095,11 @@ xfs_unmountfs( "Freespace may not be correct on next mount."); xfs_unmount_check(mp); + /* + * Indicate that it's ok to clear log incompat bits before cleaning + * the log and writing the unmount record. + */ + xfs_set_done_with_log_incompat(mp); xfs_log_unmount(mp); xfs_da_unmount(mp); xfs_uuid_unmount(mp); @@ -1364,7 +1369,8 @@ xfs_clear_incompat_log_features( if (!xfs_has_crc(mp) || !xfs_sb_has_incompat_log_feature(&mp->m_sb, XFS_SB_FEAT_INCOMPAT_LOG_ALL) || - xfs_is_shutdown(mp)) + xfs_is_shutdown(mp) || + !xfs_is_done_with_log_incompat(mp)) return false; /* diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index e880aa48de68..6ec038b88454 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -412,6 +412,8 @@ __XFS_HAS_FEAT(nouuid, NOUUID) #define XFS_OPSTATE_WARNED_LARP 9 /* Mount time quotacheck is running */ #define XFS_OPSTATE_QUOTACHECK_RUNNING 10 +/* Do we want to clear log incompat flags? */ +#define XFS_OPSTATE_UNSET_LOG_INCOMPAT 11 #define __XFS_IS_OPSTATE(name, NAME) \ static inline bool xfs_is_ ## name (struct xfs_mount *mp) \ @@ -439,6 +441,7 @@ __XFS_IS_OPSTATE(quotacheck_running, QUOTACHECK_RUNNING) #else # define xfs_is_quotacheck_running(mp) (false) #endif +__XFS_IS_OPSTATE(done_with_log_incompat, UNSET_LOG_INCOMPAT) static inline bool xfs_should_warn(struct xfs_mount *mp, long nr) @@ -457,7 +460,8 @@ xfs_should_warn(struct xfs_mount *mp, long nr) { (1UL << XFS_OPSTATE_WARNED_SCRUB), "wscrub" }, \ { (1UL << XFS_OPSTATE_WARNED_SHRINK), "wshrink" }, \ { (1UL << XFS_OPSTATE_WARNED_LARP), "wlarp" }, \ - { (1UL << XFS_OPSTATE_QUOTACHECK_RUNNING), "quotacheck" } + { (1UL << XFS_OPSTATE_QUOTACHECK_RUNNING), "quotacheck" }, \ + { (1UL << XFS_OPSTATE_UNSET_LOG_INCOMPAT), "unset_log_incompat" } /* * Max and min values for mount-option defined I/O diff --git a/fs/xfs/xfs_xattr.c b/fs/xfs/xfs_xattr.c index 364104e1b38a..4ebf7052eb67 100644 --- a/fs/xfs/xfs_xattr.c +++ b/fs/xfs/xfs_xattr.c @@ -22,10 +22,7 @@ /* * Get permission to use log-assisted atomic exchange of file extents. - * - * Callers must not be running any transactions or hold any inode locks, and - * they must release the permission by calling xlog_drop_incompat_feat - * when they're done. + * Callers must not be running any transactions or hold any ILOCKs. */ static inline int xfs_attr_grab_log_assist( @@ -33,16 +30,7 @@ xfs_attr_grab_log_assist( { int error = 0; - /* - * Protect ourselves from an idle log clearing the logged xattrs log - * incompat feature bit. - */ - xlog_use_incompat_feat(mp->m_log); - - /* - * If log-assisted xattrs are already enabled, the caller can use the - * log assisted swap functions with the log-incompat reference we got. - */ + /* xattr update log intent items are already enabled */ if (xfs_sb_version_haslogxattrs(&mp->m_sb)) return 0; @@ -52,31 +40,19 @@ xfs_attr_grab_log_assist( * a V5 filesystem for the superblock field, but we'll require rmap * or reflink to avoid having to deal with really old kernels. */ - if (!xfs_has_reflink(mp) && !xfs_has_rmapbt(mp)) { - error = -EOPNOTSUPP; - goto drop_incompat; - } + if (!xfs_has_reflink(mp) && !xfs_has_rmapbt(mp)) + return -EOPNOTSUPP; /* Enable log-assisted xattrs. */ error = xfs_add_incompat_log_feature(mp, XFS_SB_FEAT_INCOMPAT_LOG_XATTRS); if (error) - goto drop_incompat; + return error; xfs_warn_mount(mp, XFS_OPSTATE_WARNED_LARP, "EXPERIMENTAL logged extended attributes feature in use. Use at your own risk!"); return 0; -drop_incompat: - xlog_drop_incompat_feat(mp->m_log); - return error; -} - -static inline void -xfs_attr_rele_log_assist( - struct xfs_mount *mp) -{ - xlog_drop_incompat_feat(mp->m_log); } static inline bool @@ -100,7 +76,6 @@ xfs_attr_change( struct xfs_da_args *args) { struct xfs_mount *mp = args->dp->i_mount; - bool use_logging = false; int error; ASSERT(!(args->op_flags & XFS_DA_OP_LOGGED)); @@ -111,14 +86,9 @@ xfs_attr_change( return error; args->op_flags |= XFS_DA_OP_LOGGED; - use_logging = true; } - error = xfs_attr_set(args); - - if (use_logging) - xfs_attr_rele_log_assist(mp); - return error; + return xfs_attr_set(args); } ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 02/16] xfs: refactorings for atomic file content exchanges 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong 2024-04-15 23:33 ` [PATCHSET v30.3 01/16] xfs: improve log incompat feature handling Darrick J. Wong @ 2024-04-15 23:34 ` Darrick J. Wong 2024-04-15 23:39 ` [PATCH 1/7] xfs: move inode lease breaking functions to xfs_inode.c Darrick J. Wong ` (6 more replies) 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (13 subsequent siblings) 15 siblings, 7 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:34 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs Hi all, This series applies various cleanups and refactorings to file IO handling code ahead of the main series to implement atomic file content exchanges. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=file-exchange-refactorings-6.10 --- Commits in this patchset: * xfs: move inode lease breaking functions to xfs_inode.c * xfs: move xfs_iops.c declarations out of xfs_inode.h * xfs: declare xfs_file.c symbols in xfs_file.h * xfs: create a new helper to return a file's allocation unit * xfs: hoist multi-fsb allocation unit detection to a helper * xfs: refactor non-power-of-two alignment checks * xfs: constify xfs_bmap_is_written_extent --- fs/xfs/libxfs/xfs_bmap.h | 2 + fs/xfs/xfs_bmap_util.c | 4 +- fs/xfs/xfs_file.c | 88 ++++------------------------------------------ fs/xfs/xfs_file.h | 15 ++++++++ fs/xfs/xfs_inode.c | 75 +++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_inode.h | 16 +++++--- fs/xfs/xfs_ioctl.c | 1 + fs/xfs/xfs_iops.c | 1 + fs/xfs/xfs_iops.h | 7 ++-- fs/xfs/xfs_linux.h | 5 +++ 10 files changed, 121 insertions(+), 93 deletions(-) create mode 100644 fs/xfs/xfs_file.h ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/7] xfs: move inode lease breaking functions to xfs_inode.c 2024-04-15 23:34 ` [PATCHSET v30.3 02/16] xfs: refactorings for atomic file content exchanges Darrick J. Wong @ 2024-04-15 23:39 ` Darrick J. Wong 2024-04-15 23:39 ` [PATCH 2/7] xfs: move xfs_iops.c declarations out of xfs_inode.h Darrick J. Wong ` (5 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:39 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> The lease breaking functions operate at the scope of the entire VFS inode, not subranges of a file. Move them to xfs_inode.c since they're already declared in xfs_inode.h. This cleanup moves us closer to having xfs_FOO.h declare only the symbols in xfs_FOO.c. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_file.c | 61 --------------------------------------------------- fs/xfs/xfs_inode.c | 62 ++++++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_inode.h | 1 - 3 files changed, 62 insertions(+), 62 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 632653e00906..40b778415f5f 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -861,67 +861,6 @@ xfs_file_write_iter( return xfs_file_buffered_write(iocb, from); } -static void -xfs_wait_dax_page( - struct inode *inode) -{ - struct xfs_inode *ip = XFS_I(inode); - - xfs_iunlock(ip, XFS_MMAPLOCK_EXCL); - schedule(); - xfs_ilock(ip, XFS_MMAPLOCK_EXCL); -} - -int -xfs_break_dax_layouts( - struct inode *inode, - bool *retry) -{ - struct page *page; - - xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL); - - page = dax_layout_busy_page(inode->i_mapping); - if (!page) - return 0; - - *retry = true; - return ___wait_var_event(&page->_refcount, - atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, - 0, 0, xfs_wait_dax_page(inode)); -} - -int -xfs_break_layouts( - struct inode *inode, - uint *iolock, - enum layout_break_reason reason) -{ - bool retry; - int error; - - xfs_assert_ilocked(XFS_I(inode), XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL); - - do { - retry = false; - switch (reason) { - case BREAK_UNMAP: - error = xfs_break_dax_layouts(inode, &retry); - if (error || retry) - break; - fallthrough; - case BREAK_WRITE: - error = xfs_break_leased_layouts(inode, iolock, &retry); - break; - default: - WARN_ON_ONCE(1); - error = -EINVAL; - } - } while (error == 0 && retry); - - return error; -} - /* Does this file, inode, or mount want synchronous writes? */ static inline bool xfs_file_sync_writes(struct file *filp) { diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 3e667a19b80b..39e6f88e9691 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -38,6 +38,7 @@ #include "xfs_ag.h" #include "xfs_log_priv.h" #include "xfs_health.h" +#include "xfs_pnfs.h" struct kmem_cache *xfs_inode_cache; @@ -3946,3 +3947,64 @@ xfs_inode_count_blocks( xfs_bmap_count_leaves(ifp, rblocks); *dblocks = ip->i_nblocks - *rblocks; } + +static void +xfs_wait_dax_page( + struct inode *inode) +{ + struct xfs_inode *ip = XFS_I(inode); + + xfs_iunlock(ip, XFS_MMAPLOCK_EXCL); + schedule(); + xfs_ilock(ip, XFS_MMAPLOCK_EXCL); +} + +int +xfs_break_dax_layouts( + struct inode *inode, + bool *retry) +{ + struct page *page; + + xfs_assert_ilocked(XFS_I(inode), XFS_MMAPLOCK_EXCL); + + page = dax_layout_busy_page(inode->i_mapping); + if (!page) + return 0; + + *retry = true; + return ___wait_var_event(&page->_refcount, + atomic_read(&page->_refcount) == 1, TASK_INTERRUPTIBLE, + 0, 0, xfs_wait_dax_page(inode)); +} + +int +xfs_break_layouts( + struct inode *inode, + uint *iolock, + enum layout_break_reason reason) +{ + bool retry; + int error; + + xfs_assert_ilocked(XFS_I(inode), XFS_IOLOCK_SHARED | XFS_IOLOCK_EXCL); + + do { + retry = false; + switch (reason) { + case BREAK_UNMAP: + error = xfs_break_dax_layouts(inode, &retry); + if (error || retry) + break; + fallthrough; + case BREAK_WRITE: + error = xfs_break_leased_layouts(inode, iolock, &retry); + break; + default: + WARN_ON_ONCE(1); + error = -EINVAL; + } + } while (error == 0 && retry); + + return error; +} diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index ab46ffb3ac19..5164c5d3e549 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -565,7 +565,6 @@ xfs_itruncate_extents( return xfs_itruncate_extents_flags(tpp, ip, whichfork, new_size, 0); } -/* from xfs_file.c */ int xfs_break_dax_layouts(struct inode *inode, bool *retry); int xfs_break_layouts(struct inode *inode, uint *iolock, enum layout_break_reason reason); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/7] xfs: move xfs_iops.c declarations out of xfs_inode.h 2024-04-15 23:34 ` [PATCHSET v30.3 02/16] xfs: refactorings for atomic file content exchanges Darrick J. Wong 2024-04-15 23:39 ` [PATCH 1/7] xfs: move inode lease breaking functions to xfs_inode.c Darrick J. Wong @ 2024-04-15 23:39 ` Darrick J. Wong 2024-04-15 23:39 ` [PATCH 3/7] xfs: declare xfs_file.c symbols in xfs_file.h Darrick J. Wong ` (4 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:39 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Similarly, move declarations of public symbols of xfs_iops.c from xfs_inode.h to xfs_iops.h. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_inode.h | 5 ----- fs/xfs/xfs_iops.h | 4 ++++ 2 files changed, 4 insertions(+), 5 deletions(-) diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 5164c5d3e549..b2dde0e0f265 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -569,11 +569,6 @@ int xfs_break_dax_layouts(struct inode *inode, bool *retry); int xfs_break_layouts(struct inode *inode, uint *iolock, enum layout_break_reason reason); -/* from xfs_iops.c */ -extern void xfs_setup_inode(struct xfs_inode *ip); -extern void xfs_setup_iops(struct xfs_inode *ip); -extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init); - static inline void xfs_update_stable_writes(struct xfs_inode *ip) { if (bdev_stable_writes(xfs_inode_buftarg(ip)->bt_bdev)) diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h index 7f84a0843b24..8a38c3e2ed0e 100644 --- a/fs/xfs/xfs_iops.h +++ b/fs/xfs/xfs_iops.h @@ -19,4 +19,8 @@ int xfs_vn_setattr_size(struct mnt_idmap *idmap, int xfs_inode_init_security(struct inode *inode, struct inode *dir, const struct qstr *qstr); +extern void xfs_setup_inode(struct xfs_inode *ip); +extern void xfs_setup_iops(struct xfs_inode *ip); +extern void xfs_diflags_to_iflags(struct xfs_inode *ip, bool init); + #endif /* __XFS_IOPS_H__ */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/7] xfs: declare xfs_file.c symbols in xfs_file.h 2024-04-15 23:34 ` [PATCHSET v30.3 02/16] xfs: refactorings for atomic file content exchanges Darrick J. Wong 2024-04-15 23:39 ` [PATCH 1/7] xfs: move inode lease breaking functions to xfs_inode.c Darrick J. Wong 2024-04-15 23:39 ` [PATCH 2/7] xfs: move xfs_iops.c declarations out of xfs_inode.h Darrick J. Wong @ 2024-04-15 23:39 ` Darrick J. Wong 2024-04-15 23:40 ` [PATCH 4/7] xfs: create a new helper to return a file's allocation unit Darrick J. Wong ` (3 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:39 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Move the two public symbols in xfs_file.c to xfs_file.h. We're about to add more public symbols in that source file, so let's finally create the header file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_file.c | 1 + fs/xfs/xfs_file.h | 12 ++++++++++++ fs/xfs/xfs_ioctl.c | 1 + fs/xfs/xfs_iops.c | 1 + fs/xfs/xfs_iops.h | 3 --- 5 files changed, 15 insertions(+), 3 deletions(-) create mode 100644 fs/xfs/xfs_file.h diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 40b778415f5f..9961d4b5efbe 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -24,6 +24,7 @@ #include "xfs_pnfs.h" #include "xfs_iomap.h" #include "xfs_reflink.h" +#include "xfs_file.h" #include <linux/dax.h> #include <linux/falloc.h> diff --git a/fs/xfs/xfs_file.h b/fs/xfs/xfs_file.h new file mode 100644 index 000000000000..7d39e3eca56d --- /dev/null +++ b/fs/xfs/xfs_file.h @@ -0,0 +1,12 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Copyright (c) 2000-2005 Silicon Graphics, Inc. + * All Rights Reserved. + */ +#ifndef __XFS_FILE_H__ +#define __XFS_FILE_H__ + +extern const struct file_operations xfs_file_operations; +extern const struct file_operations xfs_dir_file_operations; + +#endif /* __XFS_FILE_H__ */ diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index d0e2cec6210d..1397edea20f1 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -39,6 +39,7 @@ #include "xfs_ioctl.h" #include "xfs_xattr.h" #include "xfs_rtbitmap.h" +#include "xfs_file.h" #include <linux/mount.h> #include <linux/namei.h> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 66f8c47642e8..55ed2d1023d6 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -25,6 +25,7 @@ #include "xfs_error.h" #include "xfs_ioctl.h" #include "xfs_xattr.h" +#include "xfs_file.h" #include <linux/posix_acl.h> #include <linux/security.h> diff --git a/fs/xfs/xfs_iops.h b/fs/xfs/xfs_iops.h index 8a38c3e2ed0e..3c1a2605ffd2 100644 --- a/fs/xfs/xfs_iops.h +++ b/fs/xfs/xfs_iops.h @@ -8,9 +8,6 @@ struct xfs_inode; -extern const struct file_operations xfs_file_operations; -extern const struct file_operations xfs_dir_file_operations; - extern ssize_t xfs_vn_listxattr(struct dentry *, char *data, size_t size); int xfs_vn_setattr_size(struct mnt_idmap *idmap, ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 4/7] xfs: create a new helper to return a file's allocation unit 2024-04-15 23:34 ` [PATCHSET v30.3 02/16] xfs: refactorings for atomic file content exchanges Darrick J. Wong ` (2 preceding siblings ...) 2024-04-15 23:39 ` [PATCH 3/7] xfs: declare xfs_file.c symbols in xfs_file.h Darrick J. Wong @ 2024-04-15 23:40 ` Darrick J. Wong 2024-04-15 23:40 ` [PATCH 5/7] xfs: hoist multi-fsb allocation unit detection to a helper Darrick J. Wong ` (2 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:40 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a new helper function to calculate the fundamental allocation unit (i.e. the smallest unit of space we can allocate) of a file. Things are going to get hairy with range-exchange on the realtime device, so prepare for this now. Remove the static attribute from xfs_is_falloc_aligned since the next patch will need it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_file.c | 28 ++++++++++------------------ fs/xfs/xfs_file.h | 3 +++ fs/xfs/xfs_inode.c | 13 +++++++++++++ fs/xfs/xfs_inode.h | 1 + 4 files changed, 27 insertions(+), 18 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 9961d4b5efbe..64278f8acaee 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -39,33 +39,25 @@ static const struct vm_operations_struct xfs_file_vm_ops; * Decide if the given file range is aligned to the size of the fundamental * allocation unit for the file. */ -static bool +bool xfs_is_falloc_aligned( struct xfs_inode *ip, loff_t pos, long long int len) { - struct xfs_mount *mp = ip->i_mount; - uint64_t mask; + unsigned int alloc_unit = xfs_inode_alloc_unitsize(ip); - if (XFS_IS_REALTIME_INODE(ip)) { - if (!is_power_of_2(mp->m_sb.sb_rextsize)) { - u64 rextbytes; - u32 mod; + if (!is_power_of_2(alloc_unit)) { + u32 mod; - rextbytes = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize); - div_u64_rem(pos, rextbytes, &mod); - if (mod) - return false; - div_u64_rem(len, rextbytes, &mod); - return mod == 0; - } - mask = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize) - 1; - } else { - mask = mp->m_sb.sb_blocksize - 1; + div_u64_rem(pos, alloc_unit, &mod); + if (mod) + return false; + div_u64_rem(len, alloc_unit, &mod); + return mod == 0; } - return !((pos | len) & mask); + return !((pos | len) & (alloc_unit - 1)); } /* diff --git a/fs/xfs/xfs_file.h b/fs/xfs/xfs_file.h index 7d39e3eca56d..2ad91f755caf 100644 --- a/fs/xfs/xfs_file.h +++ b/fs/xfs/xfs_file.h @@ -9,4 +9,7 @@ extern const struct file_operations xfs_file_operations; extern const struct file_operations xfs_dir_file_operations; +bool xfs_is_falloc_aligned(struct xfs_inode *ip, loff_t pos, + long long int len); + #endif /* __XFS_FILE_H__ */ diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 39e6f88e9691..492dae0efad2 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -4008,3 +4008,16 @@ xfs_break_layouts( return error; } + +/* Returns the size of fundamental allocation unit for a file, in bytes. */ +unsigned int +xfs_inode_alloc_unitsize( + struct xfs_inode *ip) +{ + unsigned int blocks = 1; + + if (XFS_IS_REALTIME_INODE(ip)) + blocks = ip->i_mount->m_sb.sb_rextsize; + + return XFS_FSB_TO_B(ip->i_mount, blocks); +} diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index b2dde0e0f265..fa3e605901e2 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -625,6 +625,7 @@ int xfs_inode_reload_unlinked(struct xfs_inode *ip); bool xfs_ifork_zapped(const struct xfs_inode *ip, int whichfork); void xfs_inode_count_blocks(struct xfs_trans *tp, struct xfs_inode *ip, xfs_filblks_t *dblocks, xfs_filblks_t *rblocks); +unsigned int xfs_inode_alloc_unitsize(struct xfs_inode *ip); struct xfs_dir_update_params { const struct xfs_inode *dp; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 5/7] xfs: hoist multi-fsb allocation unit detection to a helper 2024-04-15 23:34 ` [PATCHSET v30.3 02/16] xfs: refactorings for atomic file content exchanges Darrick J. Wong ` (3 preceding siblings ...) 2024-04-15 23:40 ` [PATCH 4/7] xfs: create a new helper to return a file's allocation unit Darrick J. Wong @ 2024-04-15 23:40 ` Darrick J. Wong 2024-04-15 23:40 ` [PATCH 6/7] xfs: refactor non-power-of-two alignment checks Darrick J. Wong 2024-04-15 23:40 ` [PATCH 7/7] xfs: constify xfs_bmap_is_written_extent Darrick J. Wong 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:40 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Replace the open-coded logic to decide if a file has a multi-fsb allocation unit to a helper to make the code easier to read. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_bmap_util.c | 4 ++-- fs/xfs/xfs_inode.h | 9 +++++++++ 2 files changed, 11 insertions(+), 2 deletions(-) diff --git a/fs/xfs/xfs_bmap_util.c b/fs/xfs/xfs_bmap_util.c index 19e11d1da660..53aa90a0ee3a 100644 --- a/fs/xfs/xfs_bmap_util.c +++ b/fs/xfs/xfs_bmap_util.c @@ -542,7 +542,7 @@ xfs_can_free_eofblocks( * forever. */ end_fsb = XFS_B_TO_FSB(mp, (xfs_ufsize_t)XFS_ISIZE(ip)); - if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) + if (xfs_inode_has_bigrtalloc(ip)) end_fsb = xfs_rtb_roundup_rtx(mp, end_fsb); last_fsb = XFS_B_TO_FSB(mp, mp->m_super->s_maxbytes); if (last_fsb <= end_fsb) @@ -843,7 +843,7 @@ xfs_free_file_space( endoffset_fsb = XFS_B_TO_FSBT(mp, offset + len); /* We can only free complete realtime extents. */ - if (XFS_IS_REALTIME_INODE(ip) && mp->m_sb.sb_rextsize > 1) { + if (xfs_inode_has_bigrtalloc(ip)) { startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb); endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb); } diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index fa3e605901e2..f559e68ee707 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -311,6 +311,15 @@ static inline bool xfs_inode_has_large_extent_counts(struct xfs_inode *ip) return ip->i_diflags2 & XFS_DIFLAG2_NREXT64; } +/* + * Decide if this file is a realtime file whose data allocation unit is larger + * than a single filesystem block. + */ +static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip) +{ + return XFS_IS_REALTIME_INODE(ip) && ip->i_mount->m_sb.sb_rextsize > 1; +} + /* * Return the buftarg used for data allocations on a given inode. */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 6/7] xfs: refactor non-power-of-two alignment checks 2024-04-15 23:34 ` [PATCHSET v30.3 02/16] xfs: refactorings for atomic file content exchanges Darrick J. Wong ` (4 preceding siblings ...) 2024-04-15 23:40 ` [PATCH 5/7] xfs: hoist multi-fsb allocation unit detection to a helper Darrick J. Wong @ 2024-04-15 23:40 ` Darrick J. Wong 2024-04-15 23:40 ` [PATCH 7/7] xfs: constify xfs_bmap_is_written_extent Darrick J. Wong 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:40 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a helper function that can compute if a 64-bit number is an integer multiple of a 32-bit number, where the 32-bit number is not required to be an even power of two. This is needed for some new code for the realtime device, where we can set 37k allocation units and then have to remap them. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_file.c | 12 +++--------- fs/xfs/xfs_linux.h | 5 +++++ 2 files changed, 8 insertions(+), 9 deletions(-) diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c index 64278f8acaee..d1d4158441bd 100644 --- a/fs/xfs/xfs_file.c +++ b/fs/xfs/xfs_file.c @@ -47,15 +47,9 @@ xfs_is_falloc_aligned( { unsigned int alloc_unit = xfs_inode_alloc_unitsize(ip); - if (!is_power_of_2(alloc_unit)) { - u32 mod; - - div_u64_rem(pos, alloc_unit, &mod); - if (mod) - return false; - div_u64_rem(len, alloc_unit, &mod); - return mod == 0; - } + if (!is_power_of_2(alloc_unit)) + return isaligned_64(pos, alloc_unit) && + isaligned_64(len, alloc_unit); return !((pos | len) & (alloc_unit - 1)); } diff --git a/fs/xfs/xfs_linux.h b/fs/xfs/xfs_linux.h index 8f07c9f6157f..ac355328121a 100644 --- a/fs/xfs/xfs_linux.h +++ b/fs/xfs/xfs_linux.h @@ -198,6 +198,11 @@ static inline uint64_t howmany_64(uint64_t x, uint32_t y) return x; } +static inline bool isaligned_64(uint64_t x, uint32_t y) +{ + return do_div(x, y) == 0; +} + /* If @b is a power of 2, return log2(b). Else return -1. */ static inline int8_t log2_if_power2(unsigned long b) { ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 7/7] xfs: constify xfs_bmap_is_written_extent 2024-04-15 23:34 ` [PATCHSET v30.3 02/16] xfs: refactorings for atomic file content exchanges Darrick J. Wong ` (5 preceding siblings ...) 2024-04-15 23:40 ` [PATCH 6/7] xfs: refactor non-power-of-two alignment checks Darrick J. Wong @ 2024-04-15 23:40 ` Darrick J. Wong 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:40 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> This predicate doesn't modify the structure that's being passed in, so we can mark it const. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_bmap.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h index f7662595309d..b8bdbf1560e6 100644 --- a/fs/xfs/libxfs/xfs_bmap.h +++ b/fs/xfs/libxfs/xfs_bmap.h @@ -158,7 +158,7 @@ static inline bool xfs_bmap_is_real_extent(const struct xfs_bmbt_irec *irec) * Return true if the extent is a real, allocated extent, or false if it is a * delayed allocation, and unwritten extent or a hole. */ -static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec) +static inline bool xfs_bmap_is_written_extent(const struct xfs_bmbt_irec *irec) { return xfs_bmap_is_real_extent(irec) && irec->br_state != XFS_EXT_UNWRITTEN; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 03/16] xfs: atomic file content exchanges 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong 2024-04-15 23:33 ` [PATCHSET v30.3 01/16] xfs: improve log incompat feature handling Darrick J. Wong 2024-04-15 23:34 ` [PATCHSET v30.3 02/16] xfs: refactorings for atomic file content exchanges Darrick J. Wong @ 2024-04-15 23:34 ` Darrick J. Wong 2024-04-15 23:41 ` [PATCH 01/15] vfs: export remap and write check helpers Darrick J. Wong ` (14 more replies) 2024-04-15 23:34 ` [PATCHSET v30.3 04/16] xfs: create temporary files for online repair Darrick J. Wong ` (12 subsequent siblings) 15 siblings, 15 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:34 UTC (permalink / raw To: chandanbabu, djwong Cc: Christoph Hellwig, linux-fsdevel, hch, linux-fsdevel, linux-xfs Hi all, This series creates a new XFS_IOC_EXCHANGE_RANGE ioctl to exchange ranges of bytes between two files atomically. This new functionality enables data storage programs to stage and commit file updates such that reader programs will see either the old contents or the new contents in their entirety, with no chance of torn writes. A successful call completion guarantees that the new contents will be seen even if the system fails. The ability to exchange file fork mappings between files in this manner is critical to supporting online filesystem repair, which is built upon the strategy of constructing a clean copy of a damaged structure and committing the new structure into the metadata file atomically. The ioctls exist to facilitate testing of the new functionality and to enable future application program designs. User programs will be able to update files atomically by opening an O_TMPFILE, reflinking the source file to it, making whatever updates they want to make, and exchange the relevant ranges of the temp file with the original file. If the updates are aligned with the file block size, a new (since v2) flag provides for exchanging only the written areas. Note that application software must quiesce writes to the file while it stages an atomic update. This will be addressed by a subsequent series. This mechanism solves the clunkiness of two existing atomic file update mechanisms: for O_TRUNC + rewrite, this eliminates the brief period where other programs can see an empty file. For create tempfile + rename, the need to copy file attributes and extended attributes for each file update is eliminated. However, this method introduces its own awkwardness -- any program initiating an exchange now needs to have a way to signal to other programs that the file contents have changed. For file access mediated via read and write, fanotify or inotify are probably sufficient. For mmaped files, that may not be fast enough. Here is the proposed manual page: IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2) NAME ioctl_xfs_exchange_range - exchange the contents of parts of two files SYNOPSIS #include <sys/ioctl.h> #include <xfs/xfs_fs.h> int ioctl(int file2_fd, XFS_IOC_EXCHANGE_RANGE, struct xfs_ex‐ change_range *arg); DESCRIPTION Given a range of bytes in a first file file1_fd and a second range of bytes in a second file file2_fd, this ioctl(2) ex‐ changes the contents of the two ranges. Exchanges are atomic with regards to concurrent file opera‐ tions. Implementations must guarantee that readers see either the old contents or the new contents in their entirety, even if the system fails. The system call parameters are conveyed in structures of the following form: struct xfs_exchange_range { __s32 file1_fd; __u32 pad; __u64 file1_offset; __u64 file2_offset; __u64 length; __u64 flags; }; The field pad must be zero. The fields file1_fd, file1_offset, and length define the first range of bytes to be exchanged. The fields file2_fd, file2_offset, and length define the second range of bytes to be exchanged. Both files must be from the same filesystem mount. If the two file descriptors represent the same file, the byte ranges must not overlap. Most disk-based filesystems require that the starts of both ranges must be aligned to the file block size. If this is the case, the ends of the ranges must also be so aligned unless the XFS_EXCHANGE_RANGE_TO_EOF flag is set. The field flags control the behavior of the exchange operation. XFS_EXCHANGE_RANGE_TO_EOF Ignore the length parameter. All bytes in file1_fd from file1_offset to EOF are moved to file2_fd, and file2's size is set to (file2_offset+(file1_length- file1_offset)). Meanwhile, all bytes in file2 from file2_offset to EOF are moved to file1 and file1's size is set to (file1_offset+(file2_length- file2_offset)). XFS_EXCHANGE_RANGE_DSYNC Ensure that all modified in-core data in both file ranges and all metadata updates pertaining to the exchange operation are flushed to persistent storage before the call returns. Opening either file de‐ scriptor with O_SYNC or O_DSYNC will have the same effect. XFS_EXCHANGE_RANGE_FILE1_WRITTEN Only exchange sub-ranges of file1_fd that are known to contain data written by application software. Each sub-range may be expanded (both upwards and downwards) to align with the file allocation unit. For files on the data device, this is one filesystem block. For files on the realtime device, this is the realtime extent size. This facility can be used to implement fast atomic scatter-gather writes of any complexity for software-defined storage targets if all writes are aligned to the file allocation unit. XFS_EXCHANGE_RANGE_DRY_RUN Check the parameters and the feasibility of the op‐ eration, but do not change anything. RETURN VALUE On error, -1 is returned, and errno is set to indicate the er‐ ror. ERRORS Error codes can be one of, but are not limited to, the follow‐ ing: EBADF file1_fd is not open for reading and writing or is open for append-only writes; or file2_fd is not open for reading and writing or is open for append-only writes. EINVAL The parameters are not correct for these files. This error can also appear if either file descriptor repre‐ sents a device, FIFO, or socket. Disk filesystems gen‐ erally require the offset and length arguments to be aligned to the fundamental block sizes of both files. EIO An I/O error occurred. EISDIR One of the files is a directory. ENOMEM The kernel was unable to allocate sufficient memory to perform the operation. ENOSPC There is not enough free space in the filesystem ex‐ change the contents safely. EOPNOTSUPP The filesystem does not support exchanging bytes between the two files. EPERM file1_fd or file2_fd are immutable. ETXTBSY One of the files is a swap file. EUCLEAN The filesystem is corrupt. EXDEV file1_fd and file2_fd are not on the same mounted filesystem. CONFORMING TO This API is XFS-specific. USE CASES Several use cases are imagined for this system call. In all cases, application software must coordinate updates to the file because the exchange is performed unconditionally. The first is a data storage program that wants to commit non- contiguous updates to a file atomically and coordinates write access to that file. This can be done by creating a temporary file, calling FICLONE(2) to share the contents, and staging the updates into the temporary file. The FULL_FILES flag is recom‐ mended for this purpose. The temporary file can be deleted or punched out afterwards. An example program might look like this: int fd = open("/some/file", O_RDWR); int temp_fd = open("/some", O_TMPFILE | O_RDWR); ioctl(temp_fd, FICLONE, fd); /* append 1MB of records */ lseek(temp_fd, 0, SEEK_END); write(temp_fd, data1, 1000000); /* update record index */ pwrite(temp_fd, data1, 600, 98765); pwrite(temp_fd, data2, 320, 54321); pwrite(temp_fd, data2, 15, 0); /* commit the entire update */ struct xfs_exchange_range args = { .file1_fd = temp_fd, .flags = XFS_EXCHANGE_RANGE_TO_EOF, }; ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args); The second is a software-defined storage host (e.g. a disk jukebox) which implements an atomic scatter-gather write com‐ mand. Provided the exported disk's logical block size matches the file's allocation unit size, this can be done by creating a temporary file and writing the data at the appropriate offsets. It is recommended that the temporary file be truncated to the size of the regular file before any writes are staged to the temporary file to avoid issues with zeroing during EOF exten‐ sion. Use this call with the FILE1_WRITTEN flag to exchange only the file allocation units involved in the emulated de‐ vice's write command. The temporary file should be truncated or punched out completely before being reused to stage another write. An example program might look like this: int fd = open("/some/file", O_RDWR); int temp_fd = open("/some", O_TMPFILE | O_RDWR); struct stat sb; int blksz; fstat(fd, &sb); blksz = sb.st_blksize; /* land scatter gather writes between 100fsb and 500fsb */ pwrite(temp_fd, data1, blksz * 2, blksz * 100); pwrite(temp_fd, data2, blksz * 20, blksz * 480); pwrite(temp_fd, data3, blksz * 7, blksz * 257); /* commit the entire update */ struct xfs_exchange_range args = { .file1_fd = temp_fd, .file1_offset = blksz * 100, .file2_offset = blksz * 100, .length = blksz * 400, .flags = XFS_EXCHANGE_RANGE_FILE1_WRITTEN | XFS_EXCHANGE_RANGE_FILE1_DSYNC, }; ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args); NOTES Some filesystems may limit the amount of data or the number of extents that can be exchanged in a single call. SEE ALSO ioctl(2) XFS 2024-02-10 IOCTL-XFS-EXCHANGE-RANGE(2) The reference implementation in XFS creates a new log incompat feature and log intent items to track high level progress of swapping ranges of two files and finish interrupted work if the system goes down. Sample code can be found in the corresponding changes to xfs_io to exercise the use case mentioned above. Note that this function is /not/ the O_DIRECT atomic untorn file writes concept that has also been floating around for years. It is also not the RWF_ATOMIC patchset that has been shared. This RFC is constructed entirely in software, which means that there are no limitations other than the general filesystem limits. As a side note, the original motivation behind the kernel functionality is online repair of file-based metadata. The atomic file content exchange is implemented as an atomic exchange of file fork mappings, which means that we can implement online reconstruction of extended attributes and directories by building a new one in another inode and exchanging the contents. Subsequent patchsets adapt the online filesystem repair code to use atomic file exchanges. This enables repair functions to construct a clean copy of a directory, xattr information, symbolic links, realtime bitmaps, and realtime summary information in a temporary inode. If this completes successfully, the new contents can be committed atomically into the inode being repaired. This is essential to avoid making corruption problems worse if the system goes down in the middle of running repair. For userspace, this series also includes the userspace pieces needed to test the new functionality, and a sample implementation of atomic file updates. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates-6.10 --- Commits in this patchset: * vfs: export remap and write check helpers * xfs: introduce new file range exchange ioctl * xfs: create a incompat flag for atomic file mapping exchanges * xfs: introduce a file mapping exchange log intent item * xfs: create deferred log items for file mapping exchanges * xfs: bind together the front and back ends of the file range exchange code * xfs: add error injection to test file mapping exchange recovery * xfs: condense extended attributes after a mapping exchange operation * xfs: condense directories after a mapping exchange operation * xfs: condense symbolic links after a mapping exchange operation * xfs: make file range exchange support realtime files * xfs: support non-power-of-two rtextsize with exchange-range * xfs: capture inode generation numbers in the ondisk exchmaps log item * docs: update swapext -> exchmaps language * xfs: enable logged file mapping exchange feature --- .../filesystems/xfs/xfs-online-fsck-design.rst | 259 ++-- fs/read_write.c | 1 fs/remap_range.c | 4 fs/xfs/Makefile | 3 fs/xfs/libxfs/xfs_defer.c | 6 fs/xfs/libxfs/xfs_defer.h | 2 fs/xfs/libxfs/xfs_errortag.h | 4 fs/xfs/libxfs/xfs_exchmaps.c | 1237 ++++++++++++++++++++ fs/xfs/libxfs/xfs_exchmaps.h | 123 ++ fs/xfs/libxfs/xfs_format.h | 26 fs/xfs/libxfs/xfs_fs.h | 42 + fs/xfs/libxfs/xfs_log_format.h | 66 + fs/xfs/libxfs/xfs_log_recover.h | 4 fs/xfs/libxfs/xfs_sb.c | 5 fs/xfs/libxfs/xfs_symlink_remote.c | 47 + fs/xfs/libxfs/xfs_symlink_remote.h | 1 fs/xfs/libxfs/xfs_trans_space.h | 4 fs/xfs/xfs_error.c | 3 fs/xfs/xfs_exchmaps_item.c | 614 ++++++++++ fs/xfs/xfs_exchmaps_item.h | 64 + fs/xfs/xfs_exchrange.c | 804 +++++++++++++ fs/xfs/xfs_exchrange.h | 38 + fs/xfs/xfs_ioctl.c | 4 fs/xfs/xfs_log_recover.c | 33 + fs/xfs/xfs_mount.h | 2 fs/xfs/xfs_super.c | 23 fs/xfs/xfs_symlink.c | 49 - fs/xfs/xfs_trace.c | 2 fs/xfs/xfs_trace.h | 327 +++++ include/linux/fs.h | 1 30 files changed, 3613 insertions(+), 185 deletions(-) create mode 100644 fs/xfs/libxfs/xfs_exchmaps.c create mode 100644 fs/xfs/libxfs/xfs_exchmaps.h create mode 100644 fs/xfs/xfs_exchmaps_item.c create mode 100644 fs/xfs/xfs_exchmaps_item.h create mode 100644 fs/xfs/xfs_exchrange.c create mode 100644 fs/xfs/xfs_exchrange.h ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 01/15] vfs: export remap and write check helpers 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong @ 2024-04-15 23:41 ` Darrick J. Wong 2024-04-15 23:41 ` [PATCH 02/15] xfs: introduce new file range exchange ioctl Darrick J. Wong ` (13 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:41 UTC (permalink / raw To: chandanbabu, djwong Cc: linux-fsdevel, Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Export these functions so that the next patch can use them to check the file ranges being passed to the XFS_IOC_EXCHANGE_RANGE operation. Cc: linux-fsdevel@vger.kernel.org Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/read_write.c | 1 + fs/remap_range.c | 4 ++-- include/linux/fs.h | 1 + 3 files changed, 4 insertions(+), 2 deletions(-) diff --git a/fs/read_write.c b/fs/read_write.c index d4c036e82b6c..85c096f2c0d0 100644 --- a/fs/read_write.c +++ b/fs/read_write.c @@ -1667,6 +1667,7 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count) return 0; } +EXPORT_SYMBOL_GPL(generic_write_check_limits); /* Like generic_write_checks(), but takes size of write instead of iter. */ int generic_write_checks_count(struct kiocb *iocb, loff_t *count) diff --git a/fs/remap_range.c b/fs/remap_range.c index de07f978ce3e..28246dfc8485 100644 --- a/fs/remap_range.c +++ b/fs/remap_range.c @@ -99,8 +99,7 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in, return 0; } -static int remap_verify_area(struct file *file, loff_t pos, loff_t len, - bool write) +int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write) { int mask = write ? MAY_WRITE : MAY_READ; loff_t tmp; @@ -118,6 +117,7 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len, return fsnotify_file_area_perm(file, mask, &pos, len); } +EXPORT_SYMBOL_GPL(remap_verify_area); /* * Ensure that we don't remap a partial EOF block in the middle of something diff --git a/include/linux/fs.h b/include/linux/fs.h index 8dfd53b52744..0835faeebe7b 100644 --- a/include/linux/fs.h +++ b/include/linux/fs.h @@ -2119,6 +2119,7 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *); extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *); extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *, loff_t, size_t, unsigned int); +int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write); int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in, struct file *file_out, loff_t pos_out, loff_t *len, unsigned int remap_flags, ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 02/15] xfs: introduce new file range exchange ioctl 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong 2024-04-15 23:41 ` [PATCH 01/15] vfs: export remap and write check helpers Darrick J. Wong @ 2024-04-15 23:41 ` Darrick J. Wong 2024-04-15 23:41 ` [PATCH 03/15] xfs: create a incompat flag for atomic file mapping exchanges Darrick J. Wong ` (12 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:41 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Introduce a new ioctl to handle exchanging ranges of bytes between files. The goal here is to perform the exchange atomically with respect to applications -- either they see the file contents before the exchange or they see that A-B is now B-A, even if the kernel crashes. My original goal with all this code was to make it so that online repair can build a replacement directory or xattr structure in a temporary file and commit the repair by atomically exchanging all the data blocks between the two files. However, I needed a way to test this mechanism thoroughly, so I've been evolving an ioctl interface since then. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_fs.h | 41 ++++++ fs/xfs/xfs_exchrange.c | 339 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_exchrange.h | 30 ++++ fs/xfs/xfs_ioctl.c | 4 + 5 files changed, 415 insertions(+) create mode 100644 fs/xfs/xfs_exchrange.c create mode 100644 fs/xfs/xfs_exchrange.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 76674ad5833e..2474242f5a05 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -67,6 +67,7 @@ xfs-y += xfs_aops.o \ xfs_dir2_readdir.o \ xfs_discard.o \ xfs_error.o \ + xfs_exchrange.o \ xfs_export.o \ xfs_extent_busy.o \ xfs_file.o \ diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h index ca1b17d01437..8a1e30cf4dc8 100644 --- a/fs/xfs/libxfs/xfs_fs.h +++ b/fs/xfs/libxfs/xfs_fs.h @@ -772,6 +772,46 @@ struct xfs_scrub_metadata { # define XFS_XATTR_LIST_MAX 65536 #endif +/* + * Exchange part of file1 with part of the file that this ioctl that is being + * called against (which we'll call file2). Filesystems must be able to + * restart and complete the operation even after the system goes down. + */ +struct xfs_exchange_range { + __s32 file1_fd; + __u32 pad; /* must be zeroes */ + __u64 file1_offset; /* file1 offset, bytes */ + __u64 file2_offset; /* file2 offset, bytes */ + __u64 length; /* bytes to exchange */ + + __u64 flags; /* see XFS_EXCHANGE_RANGE_* below */ +}; + +/* + * Exchange file data all the way to the ends of both files, and then exchange + * the file sizes. This flag can be used to replace a file's contents with a + * different amount of data. length will be ignored. + */ +#define XFS_EXCHANGE_RANGE_TO_EOF (1ULL << 0) + +/* Flush all changes in file data and file metadata to disk before returning. */ +#define XFS_EXCHANGE_RANGE_DSYNC (1ULL << 1) + +/* Dry run; do all the parameter verification but do not change anything. */ +#define XFS_EXCHANGE_RANGE_DRY_RUN (1ULL << 2) + +/* + * Exchange only the parts of the two files where the file allocation units + * mapped to file1's range have been written to. This can accelerate + * scatter-gather atomic writes with a temp file if all writes are aligned to + * the file allocation unit. + */ +#define XFS_EXCHANGE_RANGE_FILE1_WRITTEN (1ULL << 3) + +#define XFS_EXCHANGE_RANGE_ALL_FLAGS (XFS_EXCHANGE_RANGE_TO_EOF | \ + XFS_EXCHANGE_RANGE_DSYNC | \ + XFS_EXCHANGE_RANGE_DRY_RUN | \ + XFS_EXCHANGE_RANGE_FILE1_WRITTEN) /* * ioctl commands that are used by Linux filesystems @@ -843,6 +883,7 @@ struct xfs_scrub_metadata { #define XFS_IOC_FSGEOMETRY _IOR ('X', 126, struct xfs_fsop_geom) #define XFS_IOC_BULKSTAT _IOR ('X', 127, struct xfs_bulkstat_req) #define XFS_IOC_INUMBERS _IOR ('X', 128, struct xfs_inumbers_req) +#define XFS_IOC_EXCHANGE_RANGE _IOWR('X', 129, struct xfs_exchange_range) /* XFS_IOC_GETFSUUID ---------- deprecated 140 */ diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c new file mode 100644 index 000000000000..4cd824e47f75 --- /dev/null +++ b/fs/xfs/xfs_exchrange.c @@ -0,0 +1,339 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_inode.h" +#include "xfs_trans.h" +#include "xfs_exchrange.h" +#include <linux/fsnotify.h> + +/* + * Generic code for exchanging ranges of two files via XFS_IOC_EXCHANGE_RANGE. + * This part deals with struct file objects and byte ranges and does not deal + * with XFS-specific data structures such as xfs_inodes and block ranges. This + * separation may some day facilitate porting to another filesystem. + * + * The goal is to exchange fxr.length bytes starting at fxr.file1_offset in + * file1 with the same number of bytes starting at fxr.file2_offset in file2. + * Implementations must call xfs_exchange_range_prep to prepare the two + * files prior to taking locks; and they must update the inode change and mod + * times of both files as part of the metadata update. The timestamp update + * and freshness checks must be done atomically as part of the data exchange + * operation to ensure correctness of the freshness check. + * xfs_exchange_range_finish must be called after the operation completes + * successfully but before locks are dropped. + */ + +/* Verify that we have security clearance to perform this operation. */ +static int +xfs_exchange_range_verify_area( + struct xfs_exchrange *fxr) +{ + int ret; + + ret = remap_verify_area(fxr->file1, fxr->file1_offset, fxr->length, + true); + if (ret) + return ret; + + return remap_verify_area(fxr->file2, fxr->file2_offset, fxr->length, + true); +} + +/* + * Performs necessary checks before doing a range exchange, having stabilized + * mutable inode attributes via i_rwsem. + */ +static inline int +xfs_exchange_range_checks( + struct xfs_exchrange *fxr, + unsigned int alloc_unit) +{ + struct inode *inode1 = file_inode(fxr->file1); + struct inode *inode2 = file_inode(fxr->file2); + uint64_t allocmask = alloc_unit - 1; + int64_t test_len; + uint64_t blen; + loff_t size1, size2, tmp; + int error; + + /* Don't touch certain kinds of inodes */ + if (IS_IMMUTABLE(inode1) || IS_IMMUTABLE(inode2)) + return -EPERM; + if (IS_SWAPFILE(inode1) || IS_SWAPFILE(inode2)) + return -ETXTBSY; + + size1 = i_size_read(inode1); + size2 = i_size_read(inode2); + + /* Ranges cannot start after EOF. */ + if (fxr->file1_offset > size1 || fxr->file2_offset > size2) + return -EINVAL; + + /* + * If the caller said to exchange to EOF, we set the length of the + * request large enough to cover everything to the end of both files. + */ + if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF) { + fxr->length = max_t(int64_t, size1 - fxr->file1_offset, + size2 - fxr->file2_offset); + + error = xfs_exchange_range_verify_area(fxr); + if (error) + return error; + } + + /* + * The start of both ranges must be aligned to the file allocation + * unit. + */ + if (!IS_ALIGNED(fxr->file1_offset, alloc_unit) || + !IS_ALIGNED(fxr->file2_offset, alloc_unit)) + return -EINVAL; + + /* Ensure offsets don't wrap. */ + if (check_add_overflow(fxr->file1_offset, fxr->length, &tmp) || + check_add_overflow(fxr->file2_offset, fxr->length, &tmp)) + return -EINVAL; + + /* + * We require both ranges to end within EOF, unless we're exchanging + * to EOF. + */ + if (!(fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF) && + (fxr->file1_offset + fxr->length > size1 || + fxr->file2_offset + fxr->length > size2)) + return -EINVAL; + + /* + * Make sure we don't hit any file size limits. If we hit any size + * limits such that test_length was adjusted, we abort the whole + * operation. + */ + test_len = fxr->length; + error = generic_write_check_limits(fxr->file2, fxr->file2_offset, + &test_len); + if (error) + return error; + error = generic_write_check_limits(fxr->file1, fxr->file1_offset, + &test_len); + if (error) + return error; + if (test_len != fxr->length) + return -EINVAL; + + /* + * If the user wanted us to exchange up to the infile's EOF, round up + * to the next allocation unit boundary for this check. Do the same + * for the outfile. + * + * Otherwise, reject the range length if it's not aligned to an + * allocation unit. + */ + if (fxr->file1_offset + fxr->length == size1) + blen = ALIGN(size1, alloc_unit) - fxr->file1_offset; + else if (fxr->file2_offset + fxr->length == size2) + blen = ALIGN(size2, alloc_unit) - fxr->file2_offset; + else if (!IS_ALIGNED(fxr->length, alloc_unit)) + return -EINVAL; + else + blen = fxr->length; + + /* Don't allow overlapped exchanges within the same file. */ + if (inode1 == inode2 && + fxr->file2_offset + blen > fxr->file1_offset && + fxr->file1_offset + blen > fxr->file2_offset) + return -EINVAL; + + /* + * Ensure that we don't exchange a partial EOF block into the middle of + * another file. + */ + if ((fxr->length & allocmask) == 0) + return 0; + + blen = fxr->length; + if (fxr->file2_offset + blen < size2) + blen &= ~allocmask; + + if (fxr->file1_offset + blen < size1) + blen &= ~allocmask; + + return blen == fxr->length ? 0 : -EINVAL; +} + +/* + * Check that the two inodes are eligible for range exchanges, the ranges make + * sense, and then flush all dirty data. Caller must ensure that the inodes + * have been locked against any other modifications. + */ +static inline int +xfs_exchange_range_prep( + struct xfs_exchrange *fxr, + unsigned int alloc_unit) +{ + struct inode *inode1 = file_inode(fxr->file1); + struct inode *inode2 = file_inode(fxr->file2); + bool same_inode = (inode1 == inode2); + int error; + + /* Check that we don't violate system file offset limits. */ + error = xfs_exchange_range_checks(fxr, alloc_unit); + if (error || fxr->length == 0) + return error; + + /* Wait for the completion of any pending IOs on both files */ + inode_dio_wait(inode1); + if (!same_inode) + inode_dio_wait(inode2); + + error = filemap_write_and_wait_range(inode1->i_mapping, + fxr->file1_offset, + fxr->file1_offset + fxr->length - 1); + if (error) + return error; + + error = filemap_write_and_wait_range(inode2->i_mapping, + fxr->file2_offset, + fxr->file2_offset + fxr->length - 1); + if (error) + return error; + + /* + * If the files or inodes involved require synchronous writes, amend + * the request to force the filesystem to flush all data and metadata + * to disk after the operation completes. + */ + if (((fxr->file1->f_flags | fxr->file2->f_flags) & O_SYNC) || + IS_SYNC(inode1) || IS_SYNC(inode2)) + fxr->flags |= XFS_EXCHANGE_RANGE_DSYNC; + + return 0; +} + +/* + * Finish a range exchange operation, if it was successful. Caller must ensure + * that the inodes are still locked against any other modifications. + */ +static inline int +xfs_exchange_range_finish( + struct xfs_exchrange *fxr) +{ + int error; + + error = file_remove_privs(fxr->file1); + if (error) + return error; + if (file_inode(fxr->file1) == file_inode(fxr->file2)) + return 0; + + return file_remove_privs(fxr->file2); +} + +/* Exchange parts of two files. */ +static int +xfs_exchange_range( + struct xfs_exchrange *fxr) +{ + struct inode *inode1 = file_inode(fxr->file1); + struct inode *inode2 = file_inode(fxr->file2); + int ret; + + BUILD_BUG_ON(XFS_EXCHANGE_RANGE_ALL_FLAGS & + XFS_EXCHANGE_RANGE_PRIV_FLAGS); + + /* Both files must be on the same mount/filesystem. */ + if (fxr->file1->f_path.mnt != fxr->file2->f_path.mnt) + return -EXDEV; + + if (fxr->flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS) + return -EINVAL; + + /* Userspace requests only honored for regular files. */ + if (S_ISDIR(inode1->i_mode) || S_ISDIR(inode2->i_mode)) + return -EISDIR; + if (!S_ISREG(inode1->i_mode) || !S_ISREG(inode2->i_mode)) + return -EINVAL; + + /* Both files must be opened for read and write. */ + if (!(fxr->file1->f_mode & FMODE_READ) || + !(fxr->file1->f_mode & FMODE_WRITE) || + !(fxr->file2->f_mode & FMODE_READ) || + !(fxr->file2->f_mode & FMODE_WRITE)) + return -EBADF; + + /* Neither file can be opened append-only. */ + if ((fxr->file1->f_flags & O_APPEND) || + (fxr->file2->f_flags & O_APPEND)) + return -EBADF; + + /* + * If we're not exchanging to EOF, we can check the areas before + * stabilizing both files' i_size. + */ + if (!(fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF)) { + ret = xfs_exchange_range_verify_area(fxr); + if (ret) + return ret; + } + + /* Update cmtime if the fd/inode don't forbid it. */ + if (!(fxr->file1->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode1)) + fxr->flags |= __XFS_EXCHANGE_RANGE_UPD_CMTIME1; + if (!(fxr->file2->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode2)) + fxr->flags |= __XFS_EXCHANGE_RANGE_UPD_CMTIME2; + + file_start_write(fxr->file2); + ret = -EOPNOTSUPP; /* XXX call out to lower level code */ + file_end_write(fxr->file2); + if (ret) + return ret; + + fsnotify_modify(fxr->file1); + if (fxr->file2 != fxr->file1) + fsnotify_modify(fxr->file2); + return 0; +} + +/* Collect exchange-range arguments from userspace. */ +long +xfs_ioc_exchange_range( + struct file *file, + struct xfs_exchange_range __user *argp) +{ + struct xfs_exchrange fxr = { + .file2 = file, + }; + struct xfs_exchange_range args; + struct fd file1; + int error; + + if (copy_from_user(&args, argp, sizeof(args))) + return -EFAULT; + if (memchr_inv(&args.pad, 0, sizeof(args.pad))) + return -EINVAL; + if (args.flags & ~XFS_EXCHANGE_RANGE_ALL_FLAGS) + return -EINVAL; + + fxr.file1_offset = args.file1_offset; + fxr.file2_offset = args.file2_offset; + fxr.length = args.length; + fxr.flags = args.flags; + + file1 = fdget(args.file1_fd); + if (!file1.file) + return -EBADF; + fxr.file1 = file1.file; + + error = xfs_exchange_range(&fxr); + fdput(file1); + return error; +} diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h new file mode 100644 index 000000000000..f80369c7df5d --- /dev/null +++ b/fs/xfs/xfs_exchrange.h @@ -0,0 +1,30 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_EXCHRANGE_H__ +#define __XFS_EXCHRANGE_H__ + +/* Update the mtime/cmtime of file1 and file2 */ +#define __XFS_EXCHANGE_RANGE_UPD_CMTIME1 (1ULL << 63) +#define __XFS_EXCHANGE_RANGE_UPD_CMTIME2 (1ULL << 62) + +#define XFS_EXCHANGE_RANGE_PRIV_FLAGS (__XFS_EXCHANGE_RANGE_UPD_CMTIME1 | \ + __XFS_EXCHANGE_RANGE_UPD_CMTIME2) + +struct xfs_exchrange { + struct file *file1; + struct file *file2; + + loff_t file1_offset; + loff_t file2_offset; + u64 length; + + u64 flags; /* XFS_EXCHANGE_RANGE flags */ +}; + +long xfs_ioc_exchange_range(struct file *file, + struct xfs_exchange_range __user *argp); + +#endif /* __XFS_EXCHRANGE_H__ */ diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c index 1397edea20f1..efa95892655d 100644 --- a/fs/xfs/xfs_ioctl.c +++ b/fs/xfs/xfs_ioctl.c @@ -40,6 +40,7 @@ #include "xfs_xattr.h" #include "xfs_rtbitmap.h" #include "xfs_file.h" +#include "xfs_exchrange.h" #include <linux/mount.h> #include <linux/namei.h> @@ -2170,6 +2171,9 @@ xfs_file_ioctl( return error; } + case XFS_IOC_EXCHANGE_RANGE: + return xfs_ioc_exchange_range(filp, arg); + default: return -ENOTTY; } ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 03/15] xfs: create a incompat flag for atomic file mapping exchanges 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong 2024-04-15 23:41 ` [PATCH 01/15] vfs: export remap and write check helpers Darrick J. Wong 2024-04-15 23:41 ` [PATCH 02/15] xfs: introduce new file range exchange ioctl Darrick J. Wong @ 2024-04-15 23:41 ` Darrick J. Wong 2024-04-15 23:41 ` [PATCH 04/15] xfs: introduce a file mapping exchange log intent item Darrick J. Wong ` (11 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:41 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a incompat flag so that we only attempt to process file mapping exchange log items if the filesystem supports it, and a geometry flag to advertise support if it's present. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_format.h | 23 ++++++++++++----------- fs/xfs/libxfs/xfs_fs.h | 1 + fs/xfs/libxfs/xfs_sb.c | 5 +++++ fs/xfs/xfs_mount.h | 2 ++ fs/xfs/xfs_super.c | 4 ++++ 5 files changed, 24 insertions(+), 11 deletions(-) diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h index 2b2f9050fbfb..ff1e28316e1b 100644 --- a/fs/xfs/libxfs/xfs_format.h +++ b/fs/xfs/libxfs/xfs_format.h @@ -367,18 +367,19 @@ xfs_sb_has_ro_compat_feature( return (sbp->sb_features_ro_compat & feature) != 0; } -#define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */ -#define XFS_SB_FEAT_INCOMPAT_SPINODES (1 << 1) /* sparse inode chunks */ -#define XFS_SB_FEAT_INCOMPAT_META_UUID (1 << 2) /* metadata UUID */ -#define XFS_SB_FEAT_INCOMPAT_BIGTIME (1 << 3) /* large timestamps */ -#define XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR (1 << 4) /* needs xfs_repair */ -#define XFS_SB_FEAT_INCOMPAT_NREXT64 (1 << 5) /* large extent counters */ +#define XFS_SB_FEAT_INCOMPAT_FTYPE (1 << 0) /* filetype in dirent */ +#define XFS_SB_FEAT_INCOMPAT_SPINODES (1 << 1) /* sparse inode chunks */ +#define XFS_SB_FEAT_INCOMPAT_META_UUID (1 << 2) /* metadata UUID */ +#define XFS_SB_FEAT_INCOMPAT_BIGTIME (1 << 3) /* large timestamps */ +#define XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR (1 << 4) /* needs xfs_repair */ +#define XFS_SB_FEAT_INCOMPAT_NREXT64 (1 << 5) /* large extent counters */ +#define XFS_SB_FEAT_INCOMPAT_EXCHRANGE (1 << 6) /* exchangerange supported */ #define XFS_SB_FEAT_INCOMPAT_ALL \ - (XFS_SB_FEAT_INCOMPAT_FTYPE| \ - XFS_SB_FEAT_INCOMPAT_SPINODES| \ - XFS_SB_FEAT_INCOMPAT_META_UUID| \ - XFS_SB_FEAT_INCOMPAT_BIGTIME| \ - XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR| \ + (XFS_SB_FEAT_INCOMPAT_FTYPE | \ + XFS_SB_FEAT_INCOMPAT_SPINODES | \ + XFS_SB_FEAT_INCOMPAT_META_UUID | \ + XFS_SB_FEAT_INCOMPAT_BIGTIME | \ + XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR | \ XFS_SB_FEAT_INCOMPAT_NREXT64) #define XFS_SB_FEAT_INCOMPAT_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_ALL diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h index 8a1e30cf4dc8..53526fca7386 100644 --- a/fs/xfs/libxfs/xfs_fs.h +++ b/fs/xfs/libxfs/xfs_fs.h @@ -239,6 +239,7 @@ typedef struct xfs_fsop_resblks { #define XFS_FSOP_GEOM_FLAGS_BIGTIME (1 << 21) /* 64-bit nsec timestamps */ #define XFS_FSOP_GEOM_FLAGS_INOBTCNT (1 << 22) /* inobt btree counter */ #define XFS_FSOP_GEOM_FLAGS_NREXT64 (1 << 23) /* large extent counters */ +#define XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE (1 << 24) /* exchange range */ /* * Minimum and maximum sizes need for growth checks. diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c index 73a4b895de67..c350e259b685 100644 --- a/fs/xfs/libxfs/xfs_sb.c +++ b/fs/xfs/libxfs/xfs_sb.c @@ -26,6 +26,7 @@ #include "xfs_health.h" #include "xfs_ag.h" #include "xfs_rtbitmap.h" +#include "xfs_exchrange.h" /* * Physical superblock buffer manipulations. Shared with libxfs in userspace. @@ -175,6 +176,8 @@ xfs_sb_version_to_features( features |= XFS_FEAT_NEEDSREPAIR; if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_NREXT64) features |= XFS_FEAT_NREXT64; + if (sbp->sb_features_incompat & XFS_SB_FEAT_INCOMPAT_EXCHRANGE) + features |= XFS_FEAT_EXCHANGE_RANGE; return features; } @@ -1259,6 +1262,8 @@ xfs_fs_geometry( } if (xfs_has_large_extent_counts(mp)) geo->flags |= XFS_FSOP_GEOM_FLAGS_NREXT64; + if (xfs_has_exchange_range(mp)) + geo->flags |= XFS_FSOP_GEOM_FLAGS_EXCHANGE_RANGE; geo->rtsectsize = sbp->sb_blocksize; geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp); diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h index 6ec038b88454..b022e5120dc4 100644 --- a/fs/xfs/xfs_mount.h +++ b/fs/xfs/xfs_mount.h @@ -292,6 +292,7 @@ typedef struct xfs_mount { #define XFS_FEAT_BIGTIME (1ULL << 24) /* large timestamps */ #define XFS_FEAT_NEEDSREPAIR (1ULL << 25) /* needs xfs_repair */ #define XFS_FEAT_NREXT64 (1ULL << 26) /* large extent counters */ +#define XFS_FEAT_EXCHANGE_RANGE (1ULL << 27) /* exchange range */ /* Mount features */ #define XFS_FEAT_NOATTR2 (1ULL << 48) /* disable attr2 creation */ @@ -355,6 +356,7 @@ __XFS_HAS_FEAT(inobtcounts, INOBTCNT) __XFS_HAS_FEAT(bigtime, BIGTIME) __XFS_HAS_FEAT(needsrepair, NEEDSREPAIR) __XFS_HAS_FEAT(large_extent_counts, NREXT64) +__XFS_HAS_FEAT(exchange_range, EXCHANGE_RANGE) /* * Mount features diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index bce020374c5e..dbda72df3419 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -1727,6 +1727,10 @@ xfs_fs_fill_super( goto out_filestream_unmount; } + if (xfs_has_exchange_range(mp)) + xfs_warn(mp, + "EXPERIMENTAL exchange-range feature enabled. Use at your own risk!"); + error = xfs_mountfs(mp); if (error) goto out_filestream_unmount; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 04/15] xfs: introduce a file mapping exchange log intent item 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (2 preceding siblings ...) 2024-04-15 23:41 ` [PATCH 03/15] xfs: create a incompat flag for atomic file mapping exchanges Darrick J. Wong @ 2024-04-15 23:41 ` Darrick J. Wong 2024-04-15 23:42 ` [PATCH 05/15] xfs: create deferred log items for file mapping exchanges Darrick J. Wong ` (10 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:41 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Introduce a new intent log item to handle exchanging mappings between the forks of two files. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_log_format.h | 42 ++++++- fs/xfs/libxfs/xfs_log_recover.h | 2 fs/xfs/xfs_exchmaps_item.c | 235 +++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_exchmaps_item.h | 59 ++++++++++ fs/xfs/xfs_log_recover.c | 2 fs/xfs/xfs_super.c | 19 +++ 7 files changed, 357 insertions(+), 3 deletions(-) create mode 100644 fs/xfs/xfs_exchmaps_item.c create mode 100644 fs/xfs/xfs_exchmaps_item.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 2474242f5a05..68ca9726e7b7 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -102,6 +102,7 @@ xfs-y += xfs_log.o \ xfs_buf_item.o \ xfs_buf_item_recover.o \ xfs_dquot_item_recover.o \ + xfs_exchmaps_item.o \ xfs_extfree_item.o \ xfs_attr_item.o \ xfs_icreate_item.o \ diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h index 16872972e1e9..09024431cae9 100644 --- a/fs/xfs/libxfs/xfs_log_format.h +++ b/fs/xfs/libxfs/xfs_log_format.h @@ -117,8 +117,9 @@ struct xfs_unmount_log_format { #define XLOG_REG_TYPE_ATTRD_FORMAT 28 #define XLOG_REG_TYPE_ATTR_NAME 29 #define XLOG_REG_TYPE_ATTR_VALUE 30 -#define XLOG_REG_TYPE_MAX 30 - +#define XLOG_REG_TYPE_XMI_FORMAT 31 +#define XLOG_REG_TYPE_XMD_FORMAT 32 +#define XLOG_REG_TYPE_MAX 32 /* * Flags to log operation header @@ -243,6 +244,8 @@ typedef struct xfs_trans_header { #define XFS_LI_BUD 0x1245 #define XFS_LI_ATTRI 0x1246 /* attr set/remove intent*/ #define XFS_LI_ATTRD 0x1247 /* attr set/remove done */ +#define XFS_LI_XMI 0x1248 /* mapping exchange intent */ +#define XFS_LI_XMD 0x1249 /* mapping exchange done */ #define XFS_LI_TYPE_DESC \ { XFS_LI_EFI, "XFS_LI_EFI" }, \ @@ -260,7 +263,9 @@ typedef struct xfs_trans_header { { XFS_LI_BUI, "XFS_LI_BUI" }, \ { XFS_LI_BUD, "XFS_LI_BUD" }, \ { XFS_LI_ATTRI, "XFS_LI_ATTRI" }, \ - { XFS_LI_ATTRD, "XFS_LI_ATTRD" } + { XFS_LI_ATTRD, "XFS_LI_ATTRD" }, \ + { XFS_LI_XMI, "XFS_LI_XMI" }, \ + { XFS_LI_XMD, "XFS_LI_XMD" } /* * Inode Log Item Format definitions. @@ -878,6 +883,37 @@ struct xfs_bud_log_format { uint64_t bud_bui_id; /* id of corresponding bui */ }; +/* + * XMI/XMD (file mapping exchange) log format definitions + */ + +/* This is the structure used to lay out an mapping exchange log item. */ +struct xfs_xmi_log_format { + uint16_t xmi_type; /* xmi log item type */ + uint16_t xmi_size; /* size of this item */ + uint32_t __pad; /* must be zero */ + uint64_t xmi_id; /* xmi identifier */ + + uint64_t xmi_inode1; /* inumber of first file */ + uint64_t xmi_inode2; /* inumber of second file */ + uint64_t xmi_startoff1; /* block offset into file1 */ + uint64_t xmi_startoff2; /* block offset into file2 */ + uint64_t xmi_blockcount; /* number of blocks */ + uint64_t xmi_flags; /* XFS_EXCHMAPS_* */ + uint64_t xmi_isize1; /* intended file1 size */ + uint64_t xmi_isize2; /* intended file2 size */ +}; + +#define XFS_EXCHMAPS_LOGGED_FLAGS (0) + +/* This is the structure used to lay out an mapping exchange done log item. */ +struct xfs_xmd_log_format { + uint16_t xmd_type; /* xmd log item type */ + uint16_t xmd_size; /* size of this item */ + uint32_t __pad; + uint64_t xmd_xmi_id; /* id of corresponding xmi */ +}; + /* * Dquot Log format definitions. * diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h index 9fe7a9564bca..47b758b49cb3 100644 --- a/fs/xfs/libxfs/xfs_log_recover.h +++ b/fs/xfs/libxfs/xfs_log_recover.h @@ -75,6 +75,8 @@ extern const struct xlog_recover_item_ops xlog_cui_item_ops; extern const struct xlog_recover_item_ops xlog_cud_item_ops; extern const struct xlog_recover_item_ops xlog_attri_item_ops; extern const struct xlog_recover_item_ops xlog_attrd_item_ops; +extern const struct xlog_recover_item_ops xlog_xmi_item_ops; +extern const struct xlog_recover_item_ops xlog_xmd_item_ops; /* * Macros, structures, prototypes for internal log manager use. diff --git a/fs/xfs/xfs_exchmaps_item.c b/fs/xfs/xfs_exchmaps_item.c new file mode 100644 index 000000000000..65b0ade41b3d --- /dev/null +++ b/fs/xfs/xfs_exchmaps_item.c @@ -0,0 +1,235 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_format.h" +#include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_bit.h" +#include "xfs_shared.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_inode.h" +#include "xfs_trans.h" +#include "xfs_trans_priv.h" +#include "xfs_exchmaps_item.h" +#include "xfs_log.h" +#include "xfs_bmap.h" +#include "xfs_icache.h" +#include "xfs_trans_space.h" +#include "xfs_error.h" +#include "xfs_log_priv.h" +#include "xfs_log_recover.h" + +struct kmem_cache *xfs_xmi_cache; +struct kmem_cache *xfs_xmd_cache; + +static const struct xfs_item_ops xfs_xmi_item_ops; + +static inline struct xfs_xmi_log_item *XMI_ITEM(struct xfs_log_item *lip) +{ + return container_of(lip, struct xfs_xmi_log_item, xmi_item); +} + +STATIC void +xfs_xmi_item_free( + struct xfs_xmi_log_item *xmi_lip) +{ + kvfree(xmi_lip->xmi_item.li_lv_shadow); + kmem_cache_free(xfs_xmi_cache, xmi_lip); +} + +/* + * Freeing the XMI requires that we remove it from the AIL if it has already + * been placed there. However, the XMI may not yet have been placed in the AIL + * when called by xfs_xmi_release() from XMD processing due to the ordering of + * committed vs unpin operations in bulk insert operations. Hence the reference + * count to ensure only the last caller frees the XMI. + */ +STATIC void +xfs_xmi_release( + struct xfs_xmi_log_item *xmi_lip) +{ + ASSERT(atomic_read(&xmi_lip->xmi_refcount) > 0); + if (atomic_dec_and_test(&xmi_lip->xmi_refcount)) { + xfs_trans_ail_delete(&xmi_lip->xmi_item, 0); + xfs_xmi_item_free(xmi_lip); + } +} + + +STATIC void +xfs_xmi_item_size( + struct xfs_log_item *lip, + int *nvecs, + int *nbytes) +{ + *nvecs += 1; + *nbytes += sizeof(struct xfs_xmi_log_format); +} + +/* + * This is called to fill in the vector of log iovecs for the given xmi log + * item. We use only 1 iovec, and we point that at the xmi_log_format structure + * embedded in the xmi item. + */ +STATIC void +xfs_xmi_item_format( + struct xfs_log_item *lip, + struct xfs_log_vec *lv) +{ + struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(lip); + struct xfs_log_iovec *vecp = NULL; + + xmi_lip->xmi_format.xmi_type = XFS_LI_XMI; + xmi_lip->xmi_format.xmi_size = 1; + + xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_XMI_FORMAT, + &xmi_lip->xmi_format, + sizeof(struct xfs_xmi_log_format)); +} + +/* + * The unpin operation is the last place an XMI is manipulated in the log. It + * is either inserted in the AIL or aborted in the event of a log I/O error. In + * either case, the XMI transaction has been successfully committed to make it + * this far. Therefore, we expect whoever committed the XMI to either construct + * and commit the XMD or drop the XMD's reference in the event of error. Simply + * drop the log's XMI reference now that the log is done with it. + */ +STATIC void +xfs_xmi_item_unpin( + struct xfs_log_item *lip, + int remove) +{ + struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(lip); + + xfs_xmi_release(xmi_lip); +} + +/* + * The XMI has been either committed or aborted if the transaction has been + * cancelled. If the transaction was cancelled, an XMD isn't going to be + * constructed and thus we free the XMI here directly. + */ +STATIC void +xfs_xmi_item_release( + struct xfs_log_item *lip) +{ + xfs_xmi_release(XMI_ITEM(lip)); +} + +/* Allocate and initialize an xmi item. */ +STATIC struct xfs_xmi_log_item * +xfs_xmi_init( + struct xfs_mount *mp) + +{ + struct xfs_xmi_log_item *xmi_lip; + + xmi_lip = kmem_cache_zalloc(xfs_xmi_cache, GFP_KERNEL | __GFP_NOFAIL); + + xfs_log_item_init(mp, &xmi_lip->xmi_item, XFS_LI_XMI, &xfs_xmi_item_ops); + xmi_lip->xmi_format.xmi_id = (uintptr_t)(void *)xmi_lip; + atomic_set(&xmi_lip->xmi_refcount, 2); + + return xmi_lip; +} + +static inline struct xfs_xmd_log_item *XMD_ITEM(struct xfs_log_item *lip) +{ + return container_of(lip, struct xfs_xmd_log_item, xmd_item); +} + +STATIC bool +xfs_xmi_item_match( + struct xfs_log_item *lip, + uint64_t intent_id) +{ + return XMI_ITEM(lip)->xmi_format.xmi_id == intent_id; +} + +static const struct xfs_item_ops xfs_xmi_item_ops = { + .flags = XFS_ITEM_INTENT, + .iop_size = xfs_xmi_item_size, + .iop_format = xfs_xmi_item_format, + .iop_unpin = xfs_xmi_item_unpin, + .iop_release = xfs_xmi_item_release, + .iop_match = xfs_xmi_item_match, +}; + +/* + * This routine is called to create an in-core file mapping exchange item from + * the xmi format structure which was logged on disk. It allocates an in-core + * xmi, copies the exchange information from the format structure into it, and + * adds the xmi to the AIL with the given LSN. + */ +STATIC int +xlog_recover_xmi_commit_pass2( + struct xlog *log, + struct list_head *buffer_list, + struct xlog_recover_item *item, + xfs_lsn_t lsn) +{ + struct xfs_mount *mp = log->l_mp; + struct xfs_xmi_log_item *xmi_lip; + struct xfs_xmi_log_format *xmi_formatp; + size_t len; + + len = sizeof(struct xfs_xmi_log_format); + if (item->ri_buf[0].i_len != len) { + XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp); + return -EFSCORRUPTED; + } + + xmi_formatp = item->ri_buf[0].i_addr; + if (xmi_formatp->__pad != 0) { + XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp); + return -EFSCORRUPTED; + } + + xmi_lip = xfs_xmi_init(mp); + memcpy(&xmi_lip->xmi_format, xmi_formatp, len); + + /* not implemented yet */ + return -EIO; +} + +const struct xlog_recover_item_ops xlog_xmi_item_ops = { + .item_type = XFS_LI_XMI, + .commit_pass2 = xlog_recover_xmi_commit_pass2, +}; + +/* + * This routine is called when an XMD format structure is found in a committed + * transaction in the log. Its purpose is to cancel the corresponding XMI if it + * was still in the log. To do this it searches the AIL for the XMI with an id + * equal to that in the XMD format structure. If we find it we drop the XMD + * reference, which removes the XMI from the AIL and frees it. + */ +STATIC int +xlog_recover_xmd_commit_pass2( + struct xlog *log, + struct list_head *buffer_list, + struct xlog_recover_item *item, + xfs_lsn_t lsn) +{ + struct xfs_xmd_log_format *xmd_formatp; + + xmd_formatp = item->ri_buf[0].i_addr; + if (item->ri_buf[0].i_len != sizeof(struct xfs_xmd_log_format)) { + XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp); + return -EFSCORRUPTED; + } + + xlog_recover_release_intent(log, XFS_LI_XMI, xmd_formatp->xmd_xmi_id); + return 0; +} + +const struct xlog_recover_item_ops xlog_xmd_item_ops = { + .item_type = XFS_LI_XMD, + .commit_pass2 = xlog_recover_xmd_commit_pass2, +}; diff --git a/fs/xfs/xfs_exchmaps_item.h b/fs/xfs/xfs_exchmaps_item.h new file mode 100644 index 000000000000..ada1eb314e65 --- /dev/null +++ b/fs/xfs/xfs_exchmaps_item.h @@ -0,0 +1,59 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_EXCHMAPS_ITEM_H__ +#define __XFS_EXCHMAPS_ITEM_H__ + +/* + * The file mapping exchange intent item helps us exchange multiple file + * mappings between two inode forks. It does this by tracking the range of + * file block offsets that still need to be exchanged, and relogs as progress + * happens. + * + * *I items should be recorded in the *first* of a series of rolled + * transactions, and the *D items should be recorded in the same transaction + * that records the associated bmbt updates. + * + * Should the system crash after the commit of the first transaction but + * before the commit of the final transaction in a series, log recovery will + * use the redo information recorded by the intent items to replay the + * rest of the mapping exchanges. + */ + +/* kernel only XMI/XMD definitions */ + +struct xfs_mount; +struct kmem_cache; + +/* + * This is the incore file mapping exchange intent log item. It is used to log + * the fact that we are exchanging mappings between two files. It is used in + * conjunction with the incore file mapping exchange done log item described + * below. + * + * These log items follow the same rules as struct xfs_efi_log_item; see the + * comments about that structure (in xfs_extfree_item.h) for more details. + */ +struct xfs_xmi_log_item { + struct xfs_log_item xmi_item; + atomic_t xmi_refcount; + struct xfs_xmi_log_format xmi_format; +}; + +/* + * This is the incore file mapping exchange done log item. It is used to log + * the fact that an exchange mentioned in an earlier xmi item have been + * performed. + */ +struct xfs_xmd_log_item { + struct xfs_log_item xmd_item; + struct xfs_xmi_log_item *xmd_intent_log_item; + struct xfs_xmd_log_format xmd_format; +}; + +extern struct kmem_cache *xfs_xmi_cache; +extern struct kmem_cache *xfs_xmd_cache; + +#endif /* __XFS_EXCHMAPS_ITEM_H__ */ diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 41aec991433c..1e5ba95adf2c 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1789,6 +1789,8 @@ static const struct xlog_recover_item_ops *xlog_recover_item_ops[] = { &xlog_bud_item_ops, &xlog_attri_item_ops, &xlog_attrd_item_ops, + &xlog_xmi_item_ops, + &xlog_xmd_item_ops, }; static const struct xlog_recover_item_ops * diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c index dbda72df3419..5c9ba974252d 100644 --- a/fs/xfs/xfs_super.c +++ b/fs/xfs/xfs_super.c @@ -43,6 +43,7 @@ #include "xfs_iunlink_item.h" #include "xfs_dahash_test.h" #include "xfs_rtbitmap.h" +#include "xfs_exchmaps_item.h" #include "scrub/stats.h" #include "scrub/rcbag_btree.h" @@ -2189,8 +2190,24 @@ xfs_init_caches(void) if (!xfs_iunlink_cache) goto out_destroy_attri_cache; + xfs_xmd_cache = kmem_cache_create("xfs_xmd_item", + sizeof(struct xfs_xmd_log_item), + 0, 0, NULL); + if (!xfs_xmd_cache) + goto out_destroy_iul_cache; + + xfs_xmi_cache = kmem_cache_create("xfs_xmi_item", + sizeof(struct xfs_xmi_log_item), + 0, 0, NULL); + if (!xfs_xmi_cache) + goto out_destroy_xmd_cache; + return 0; + out_destroy_xmd_cache: + kmem_cache_destroy(xfs_xmd_cache); + out_destroy_iul_cache: + kmem_cache_destroy(xfs_iunlink_cache); out_destroy_attri_cache: kmem_cache_destroy(xfs_attri_cache); out_destroy_attrd_cache: @@ -2247,6 +2264,8 @@ xfs_destroy_caches(void) * destroy caches. */ rcu_barrier(); + kmem_cache_destroy(xfs_xmd_cache); + kmem_cache_destroy(xfs_xmi_cache); kmem_cache_destroy(xfs_iunlink_cache); kmem_cache_destroy(xfs_attri_cache); kmem_cache_destroy(xfs_attrd_cache); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 05/15] xfs: create deferred log items for file mapping exchanges 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (3 preceding siblings ...) 2024-04-15 23:41 ` [PATCH 04/15] xfs: introduce a file mapping exchange log intent item Darrick J. Wong @ 2024-04-15 23:42 ` Darrick J. Wong 2024-04-15 23:42 ` [PATCH 06/15] xfs: bind together the front and back ends of the file range exchange code Darrick J. Wong ` (9 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:42 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Now that we've created the skeleton of a log intent item to track and restart file mapping exchange operations, add the upper level logic to commit intent items and turn them into concrete work recorded in the log. This builds on the existing bmap update intent items that have been around for a while now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_defer.c | 6 fs/xfs/libxfs/xfs_defer.h | 2 fs/xfs/libxfs/xfs_exchmaps.c | 1045 +++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_exchmaps.h | 118 ++++ fs/xfs/libxfs/xfs_log_format.h | 24 + fs/xfs/libxfs/xfs_trans_space.h | 4 fs/xfs/xfs_exchmaps_item.c | 368 ++++++++++++++ fs/xfs/xfs_exchmaps_item.h | 5 fs/xfs/xfs_exchrange.c | 49 ++ fs/xfs/xfs_exchrange.h | 8 fs/xfs/xfs_trace.c | 1 fs/xfs/xfs_trace.h | 217 ++++++++ 13 files changed, 1844 insertions(+), 4 deletions(-) create mode 100644 fs/xfs/libxfs/xfs_exchmaps.c create mode 100644 fs/xfs/libxfs/xfs_exchmaps.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 68ca9726e7b7..b547a3dc03f8 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -34,6 +34,7 @@ xfs-y += $(addprefix libxfs/, \ xfs_dir2_node.o \ xfs_dir2_sf.o \ xfs_dquot_buf.o \ + xfs_exchmaps.o \ xfs_ialloc.o \ xfs_ialloc_btree.o \ xfs_iext_tree.o \ diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index c13276095cc0..061cc01245a9 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -27,6 +27,7 @@ #include "xfs_da_btree.h" #include "xfs_attr.h" #include "xfs_trans_priv.h" +#include "xfs_exchmaps.h" static struct kmem_cache *xfs_defer_pending_cache; @@ -1176,6 +1177,10 @@ xfs_defer_init_item_caches(void) error = xfs_attr_intent_init_cache(); if (error) goto err; + error = xfs_exchmaps_intent_init_cache(); + if (error) + goto err; + return 0; err: xfs_defer_destroy_item_caches(); @@ -1186,6 +1191,7 @@ xfs_defer_init_item_caches(void) void xfs_defer_destroy_item_caches(void) { + xfs_exchmaps_intent_destroy_cache(); xfs_attr_intent_destroy_cache(); xfs_extfree_intent_destroy_cache(); xfs_bmap_intent_destroy_cache(); diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h index 18a9fb92dde8..81cca60d70a3 100644 --- a/fs/xfs/libxfs/xfs_defer.h +++ b/fs/xfs/libxfs/xfs_defer.h @@ -72,7 +72,7 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type; extern const struct xfs_defer_op_type xfs_extent_free_defer_type; extern const struct xfs_defer_op_type xfs_agfl_free_defer_type; extern const struct xfs_defer_op_type xfs_attr_defer_type; - +extern const struct xfs_defer_op_type xfs_exchmaps_defer_type; /* * Deferred operation item relogging limits. diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c new file mode 100644 index 000000000000..b8e9450cc175 --- /dev/null +++ b/fs/xfs/libxfs/xfs_exchmaps.c @@ -0,0 +1,1045 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_inode.h" +#include "xfs_trans.h" +#include "xfs_bmap.h" +#include "xfs_icache.h" +#include "xfs_quota.h" +#include "xfs_exchmaps.h" +#include "xfs_trace.h" +#include "xfs_bmap_btree.h" +#include "xfs_trans_space.h" +#include "xfs_error.h" +#include "xfs_errortag.h" +#include "xfs_health.h" +#include "xfs_exchmaps_item.h" + +struct kmem_cache *xfs_exchmaps_intent_cache; + +/* bmbt mappings adjacent to a pair of records. */ +struct xfs_exchmaps_adjacent { + struct xfs_bmbt_irec left1; + struct xfs_bmbt_irec right1; + struct xfs_bmbt_irec left2; + struct xfs_bmbt_irec right2; +}; + +#define ADJACENT_INIT { \ + .left1 = { .br_startblock = HOLESTARTBLOCK }, \ + .right1 = { .br_startblock = HOLESTARTBLOCK }, \ + .left2 = { .br_startblock = HOLESTARTBLOCK }, \ + .right2 = { .br_startblock = HOLESTARTBLOCK }, \ +} + +/* Information to reset reflink flag / CoW fork state after an exchange. */ + +/* + * If the reflink flag is set on either inode, make sure it has an incore CoW + * fork, since all reflink inodes must have them. If there's a CoW fork and it + * has mappings in it, make sure the inodes are tagged appropriately so that + * speculative preallocations can be GC'd if we run low of space. + */ +static inline void +xfs_exchmaps_ensure_cowfork( + struct xfs_inode *ip) +{ + struct xfs_ifork *cfork; + + if (xfs_is_reflink_inode(ip)) + xfs_ifork_init_cow(ip); + + cfork = xfs_ifork_ptr(ip, XFS_COW_FORK); + if (!cfork) + return; + if (cfork->if_bytes > 0) + xfs_inode_set_cowblocks_tag(ip); + else + xfs_inode_clear_cowblocks_tag(ip); +} + +/* + * Adjust the on-disk inode size upwards if needed so that we never add + * mappings into the file past EOF. This is crucial so that log recovery won't + * get confused by the sudden appearance of post-eof mappings. + */ +STATIC void +xfs_exchmaps_update_size( + struct xfs_trans *tp, + struct xfs_inode *ip, + struct xfs_bmbt_irec *imap, + xfs_fsize_t new_isize) +{ + struct xfs_mount *mp = tp->t_mountp; + xfs_fsize_t len; + + if (new_isize < 0) + return; + + len = min(XFS_FSB_TO_B(mp, imap->br_startoff + imap->br_blockcount), + new_isize); + + if (len <= ip->i_disk_size) + return; + + trace_xfs_exchmaps_update_inode_size(ip, len); + + ip->i_disk_size = len; + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); +} + +/* Advance the incore state tracking after exchanging a mapping. */ +static inline void +xmi_advance( + struct xfs_exchmaps_intent *xmi, + const struct xfs_bmbt_irec *irec) +{ + xmi->xmi_startoff1 += irec->br_blockcount; + xmi->xmi_startoff2 += irec->br_blockcount; + xmi->xmi_blockcount -= irec->br_blockcount; +} + +/* Do we still have more mappings to exchange? */ +static inline bool +xmi_has_more_exchange_work(const struct xfs_exchmaps_intent *xmi) +{ + return xmi->xmi_blockcount > 0; +} + +/* Do we have post-operation cleanups to perform? */ +static inline bool +xmi_has_postop_work(const struct xfs_exchmaps_intent *xmi) +{ + return xmi->xmi_flags & (XFS_EXCHMAPS_CLEAR_INO1_REFLINK | + XFS_EXCHMAPS_CLEAR_INO2_REFLINK); +} + +/* Check all mappings to make sure we can actually exchange them. */ +int +xfs_exchmaps_check_forks( + struct xfs_mount *mp, + const struct xfs_exchmaps_req *req) +{ + struct xfs_ifork *ifp1, *ifp2; + int whichfork = xfs_exchmaps_reqfork(req); + + /* No fork? */ + ifp1 = xfs_ifork_ptr(req->ip1, whichfork); + ifp2 = xfs_ifork_ptr(req->ip2, whichfork); + if (!ifp1 || !ifp2) + return -EINVAL; + + /* We don't know how to exchange local format forks. */ + if (ifp1->if_format == XFS_DINODE_FMT_LOCAL || + ifp2->if_format == XFS_DINODE_FMT_LOCAL) + return -EINVAL; + + /* We don't support realtime data forks yet. */ + if (!XFS_IS_REALTIME_INODE(req->ip1)) + return 0; + if (whichfork == XFS_ATTR_FORK) + return 0; + return -EINVAL; +} + +#ifdef CONFIG_XFS_QUOTA +/* Log the actual updates to the quota accounting. */ +static inline void +xfs_exchmaps_update_quota( + struct xfs_trans *tp, + struct xfs_exchmaps_intent *xmi, + struct xfs_bmbt_irec *irec1, + struct xfs_bmbt_irec *irec2) +{ + int64_t ip1_delta = 0, ip2_delta = 0; + unsigned int qflag; + + qflag = XFS_IS_REALTIME_INODE(xmi->xmi_ip1) ? XFS_TRANS_DQ_RTBCOUNT : + XFS_TRANS_DQ_BCOUNT; + + if (xfs_bmap_is_real_extent(irec1)) { + ip1_delta -= irec1->br_blockcount; + ip2_delta += irec1->br_blockcount; + } + + if (xfs_bmap_is_real_extent(irec2)) { + ip1_delta += irec2->br_blockcount; + ip2_delta -= irec2->br_blockcount; + } + + xfs_trans_mod_dquot_byino(tp, xmi->xmi_ip1, qflag, ip1_delta); + xfs_trans_mod_dquot_byino(tp, xmi->xmi_ip2, qflag, ip2_delta); +} +#else +# define xfs_exchmaps_update_quota(tp, xmi, irec1, irec2) ((void)0) +#endif + +/* Decide if we want to skip this mapping from file1. */ +static inline bool +xfs_exchmaps_can_skip_mapping( + struct xfs_exchmaps_intent *xmi, + struct xfs_bmbt_irec *irec) +{ + /* Do not skip this mapping if the caller did not tell us to. */ + if (!(xmi->xmi_flags & XFS_EXCHMAPS_INO1_WRITTEN)) + return false; + + /* Do not skip mapped, written mappings. */ + if (xfs_bmap_is_written_extent(irec)) + return false; + + /* + * The mapping is unwritten or a hole. It cannot be a delalloc + * reservation because we already excluded those. It cannot be an + * unwritten mapping with dirty page cache because we flushed the page + * cache. We don't support realtime files yet, so we needn't (yet) + * deal with them. + */ + return true; +} + +/* + * Walk forward through the file ranges in @xmi until we find two different + * mappings to exchange. If there is work to do, return the mappings; + * otherwise we've reached the end of the range and xmi_blockcount will be + * zero. + * + * If the walk skips over a pair of mappings to the same storage, save them as + * the left records in @adj (if provided) so that the simulation phase can + * avoid an extra lookup. + */ +static int +xfs_exchmaps_find_mappings( + struct xfs_exchmaps_intent *xmi, + struct xfs_bmbt_irec *irec1, + struct xfs_bmbt_irec *irec2, + struct xfs_exchmaps_adjacent *adj) +{ + int nimaps; + int bmap_flags; + int error; + + bmap_flags = xfs_bmapi_aflag(xfs_exchmaps_whichfork(xmi)); + + for (; xmi_has_more_exchange_work(xmi); xmi_advance(xmi, irec1)) { + /* Read mapping from the first file */ + nimaps = 1; + error = xfs_bmapi_read(xmi->xmi_ip1, xmi->xmi_startoff1, + xmi->xmi_blockcount, irec1, &nimaps, + bmap_flags); + if (error) + return error; + if (nimaps != 1 || + irec1->br_startblock == DELAYSTARTBLOCK || + irec1->br_startoff != xmi->xmi_startoff1) { + /* + * We should never get no mapping or a delalloc mapping + * or something that doesn't match what we asked for, + * since the caller flushed both inodes and we hold the + * ILOCKs for both inodes. + */ + ASSERT(0); + return -EINVAL; + } + + if (xfs_exchmaps_can_skip_mapping(xmi, irec1)) { + trace_xfs_exchmaps_mapping1_skip(xmi->xmi_ip1, irec1); + continue; + } + + /* Read mapping from the second file */ + nimaps = 1; + error = xfs_bmapi_read(xmi->xmi_ip2, xmi->xmi_startoff2, + irec1->br_blockcount, irec2, &nimaps, + bmap_flags); + if (error) + return error; + if (nimaps != 1 || + irec2->br_startblock == DELAYSTARTBLOCK || + irec2->br_startoff != xmi->xmi_startoff2) { + /* + * We should never get no mapping or a delalloc mapping + * or something that doesn't match what we asked for, + * since the caller flushed both inodes and we hold the + * ILOCKs for both inodes. + */ + ASSERT(0); + return -EINVAL; + } + + /* + * We can only exchange as many blocks as the smaller of the + * two mapping maps. + */ + irec1->br_blockcount = min(irec1->br_blockcount, + irec2->br_blockcount); + + trace_xfs_exchmaps_mapping1(xmi->xmi_ip1, irec1); + trace_xfs_exchmaps_mapping2(xmi->xmi_ip2, irec2); + + /* We found something to exchange, so return it. */ + if (irec1->br_startblock != irec2->br_startblock) + return 0; + + /* + * Two mappings pointing to the same physical block must not + * have different states; that's filesystem corruption. Move + * on to the next mapping if they're both holes or both point + * to the same physical space extent. + */ + if (irec1->br_state != irec2->br_state) { + xfs_bmap_mark_sick(xmi->xmi_ip1, + xfs_exchmaps_whichfork(xmi)); + xfs_bmap_mark_sick(xmi->xmi_ip2, + xfs_exchmaps_whichfork(xmi)); + return -EFSCORRUPTED; + } + + /* + * Save the mappings if we're estimating work and skipping + * these identical mappings. + */ + if (adj) { + memcpy(&adj->left1, irec1, sizeof(*irec1)); + memcpy(&adj->left2, irec2, sizeof(*irec2)); + } + } + + return 0; +} + +/* Exchange these two mappings. */ +static void +xfs_exchmaps_one_step( + struct xfs_trans *tp, + struct xfs_exchmaps_intent *xmi, + struct xfs_bmbt_irec *irec1, + struct xfs_bmbt_irec *irec2) +{ + int whichfork = xfs_exchmaps_whichfork(xmi); + + xfs_exchmaps_update_quota(tp, xmi, irec1, irec2); + + /* Remove both mappings. */ + xfs_bmap_unmap_extent(tp, xmi->xmi_ip1, whichfork, irec1); + xfs_bmap_unmap_extent(tp, xmi->xmi_ip2, whichfork, irec2); + + /* + * Re-add both mappings. We exchange the file offsets between the two + * maps and add the opposite map, which has the effect of filling the + * logical offsets we just unmapped, but with with the physical mapping + * information exchanged. + */ + swap(irec1->br_startoff, irec2->br_startoff); + xfs_bmap_map_extent(tp, xmi->xmi_ip1, whichfork, irec2); + xfs_bmap_map_extent(tp, xmi->xmi_ip2, whichfork, irec1); + + /* Make sure we're not adding mappings past EOF. */ + if (whichfork == XFS_DATA_FORK) { + xfs_exchmaps_update_size(tp, xmi->xmi_ip1, irec2, + xmi->xmi_isize1); + xfs_exchmaps_update_size(tp, xmi->xmi_ip2, irec1, + xmi->xmi_isize2); + } + + /* + * Advance our cursor and exit. The caller (either defer ops or log + * recovery) will log the XMD item, and if *blockcount is nonzero, it + * will log a new XMI item for the remainder and call us back. + */ + xmi_advance(xmi, irec1); +} + +/* Clear the reflink flag after an exchange. */ +static inline void +xfs_exchmaps_clear_reflink( + struct xfs_trans *tp, + struct xfs_inode *ip) +{ + trace_xfs_reflink_unset_inode_flag(ip); + + ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK; + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); +} + +/* Finish whatever work might come after an exchange operation. */ +static int +xfs_exchmaps_do_postop_work( + struct xfs_trans *tp, + struct xfs_exchmaps_intent *xmi) +{ + if (xmi->xmi_flags & XFS_EXCHMAPS_CLEAR_INO1_REFLINK) { + xfs_exchmaps_clear_reflink(tp, xmi->xmi_ip1); + xmi->xmi_flags &= ~XFS_EXCHMAPS_CLEAR_INO1_REFLINK; + } + + if (xmi->xmi_flags & XFS_EXCHMAPS_CLEAR_INO2_REFLINK) { + xfs_exchmaps_clear_reflink(tp, xmi->xmi_ip2); + xmi->xmi_flags &= ~XFS_EXCHMAPS_CLEAR_INO2_REFLINK; + } + + return 0; +} + +/* Finish one step in a mapping exchange operation, possibly relogging. */ +int +xfs_exchmaps_finish_one( + struct xfs_trans *tp, + struct xfs_exchmaps_intent *xmi) +{ + struct xfs_bmbt_irec irec1, irec2; + int error; + + if (xmi_has_more_exchange_work(xmi)) { + /* + * If the operation state says that some range of the files + * have not yet been exchanged, look for mappings in that range + * to exchange. If we find some mappings, exchange them. + */ + error = xfs_exchmaps_find_mappings(xmi, &irec1, &irec2, NULL); + if (error) + return error; + + if (xmi_has_more_exchange_work(xmi)) + xfs_exchmaps_one_step(tp, xmi, &irec1, &irec2); + + /* + * If the caller asked us to exchange the file sizes after the + * exchange and either we just exchanged the last mappings in + * the range or we didn't find anything to exchange, update the + * ondisk file sizes. + */ + if ((xmi->xmi_flags & XFS_EXCHMAPS_SET_SIZES) && + !xmi_has_more_exchange_work(xmi)) { + xmi->xmi_ip1->i_disk_size = xmi->xmi_isize1; + xmi->xmi_ip2->i_disk_size = xmi->xmi_isize2; + + xfs_trans_log_inode(tp, xmi->xmi_ip1, XFS_ILOG_CORE); + xfs_trans_log_inode(tp, xmi->xmi_ip2, XFS_ILOG_CORE); + } + } else if (xmi_has_postop_work(xmi)) { + /* + * Now that we're finished with the exchange operation, + * complete the post-op cleanup work. + */ + error = xfs_exchmaps_do_postop_work(tp, xmi); + if (error) + return error; + } + + /* If we still have work to do, ask for a new transaction. */ + if (xmi_has_more_exchange_work(xmi) || xmi_has_postop_work(xmi)) { + trace_xfs_exchmaps_defer(tp->t_mountp, xmi); + return -EAGAIN; + } + + /* + * If we reach here, we've finished all the exchange work and the post + * operation work. The last thing we need to do before returning to + * the caller is to make sure that COW forks are set up correctly. + */ + if (!(xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK)) { + xfs_exchmaps_ensure_cowfork(xmi->xmi_ip1); + xfs_exchmaps_ensure_cowfork(xmi->xmi_ip2); + } + + return 0; +} + +/* + * Compute the amount of bmbt blocks we should reserve for each file. In the + * worst case, each exchange will fill a hole with a new mapping, which could + * result in a btree split every time we add a new leaf block. + */ +static inline uint64_t +xfs_exchmaps_bmbt_blocks( + struct xfs_mount *mp, + const struct xfs_exchmaps_req *req) +{ + return howmany_64(req->nr_exchanges, + XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp)) * + XFS_EXTENTADD_SPACE_RES(mp, xfs_exchmaps_reqfork(req)); +} + +/* Compute the space we should reserve for the rmap btree expansions. */ +static inline uint64_t +xfs_exchmaps_rmapbt_blocks( + struct xfs_mount *mp, + const struct xfs_exchmaps_req *req) +{ + if (!xfs_has_rmapbt(mp)) + return 0; + if (XFS_IS_REALTIME_INODE(req->ip1)) + return 0; + + return howmany_64(req->nr_exchanges, + XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)) * + XFS_RMAPADD_SPACE_RES(mp); +} + +/* Estimate the bmbt and rmapbt overhead required to exchange mappings. */ +static int +xfs_exchmaps_estimate_overhead( + struct xfs_exchmaps_req *req) +{ + struct xfs_mount *mp = req->ip1->i_mount; + xfs_filblks_t bmbt_blocks; + xfs_filblks_t rmapbt_blocks; + xfs_filblks_t resblks = req->resblks; + + /* + * Compute the number of bmbt and rmapbt blocks we might need to handle + * the estimated number of exchanges. + */ + bmbt_blocks = xfs_exchmaps_bmbt_blocks(mp, req); + rmapbt_blocks = xfs_exchmaps_rmapbt_blocks(mp, req); + + trace_xfs_exchmaps_overhead(mp, bmbt_blocks, rmapbt_blocks); + + /* Make sure the change in file block count doesn't overflow. */ + if (check_add_overflow(req->ip1_bcount, bmbt_blocks, &req->ip1_bcount)) + return -EFBIG; + if (check_add_overflow(req->ip2_bcount, bmbt_blocks, &req->ip2_bcount)) + return -EFBIG; + + /* + * Add together the number of blocks we need to handle btree growth, + * then add it to the number of blocks we need to reserve to this + * transaction. + */ + if (check_add_overflow(resblks, bmbt_blocks, &resblks)) + return -ENOSPC; + if (check_add_overflow(resblks, bmbt_blocks, &resblks)) + return -ENOSPC; + if (check_add_overflow(resblks, rmapbt_blocks, &resblks)) + return -ENOSPC; + if (check_add_overflow(resblks, rmapbt_blocks, &resblks)) + return -ENOSPC; + + /* Can't actually reserve more than UINT_MAX blocks. */ + if (req->resblks > UINT_MAX) + return -ENOSPC; + + req->resblks = resblks; + trace_xfs_exchmaps_final_estimate(req); + return 0; +} + +/* Decide if we can merge two real mappings. */ +static inline bool +xmi_can_merge( + const struct xfs_bmbt_irec *b1, + const struct xfs_bmbt_irec *b2) +{ + /* Don't merge holes. */ + if (b1->br_startblock == HOLESTARTBLOCK || + b2->br_startblock == HOLESTARTBLOCK) + return false; + + /* We don't merge holes. */ + if (!xfs_bmap_is_real_extent(b1) || !xfs_bmap_is_real_extent(b2)) + return false; + + if (b1->br_startoff + b1->br_blockcount == b2->br_startoff && + b1->br_startblock + b1->br_blockcount == b2->br_startblock && + b1->br_state == b2->br_state && + b1->br_blockcount + b2->br_blockcount <= XFS_MAX_BMBT_EXTLEN) + return true; + + return false; +} + +/* + * Decide if we can merge three mappings. Caller must ensure all three + * mappings must not be holes or delalloc reservations. + */ +static inline bool +xmi_can_merge_all( + const struct xfs_bmbt_irec *l, + const struct xfs_bmbt_irec *m, + const struct xfs_bmbt_irec *r) +{ + xfs_filblks_t new_len; + + new_len = l->br_blockcount + m->br_blockcount + r->br_blockcount; + return new_len <= XFS_MAX_BMBT_EXTLEN; +} + +#define CLEFT_CONTIG 0x01 +#define CRIGHT_CONTIG 0x02 +#define CHOLE 0x04 +#define CBOTH_CONTIG (CLEFT_CONTIG | CRIGHT_CONTIG) + +#define NLEFT_CONTIG 0x10 +#define NRIGHT_CONTIG 0x20 +#define NHOLE 0x40 +#define NBOTH_CONTIG (NLEFT_CONTIG | NRIGHT_CONTIG) + +/* Estimate the effect of a single exchange on mapping count. */ +static inline int +xmi_delta_nextents_step( + struct xfs_mount *mp, + const struct xfs_bmbt_irec *left, + const struct xfs_bmbt_irec *curr, + const struct xfs_bmbt_irec *new, + const struct xfs_bmbt_irec *right) +{ + bool lhole, rhole, chole, nhole; + unsigned int state = 0; + int ret = 0; + + lhole = left->br_startblock == HOLESTARTBLOCK; + rhole = right->br_startblock == HOLESTARTBLOCK; + chole = curr->br_startblock == HOLESTARTBLOCK; + nhole = new->br_startblock == HOLESTARTBLOCK; + + if (chole) + state |= CHOLE; + if (!lhole && !chole && xmi_can_merge(left, curr)) + state |= CLEFT_CONTIG; + if (!rhole && !chole && xmi_can_merge(curr, right)) + state |= CRIGHT_CONTIG; + if ((state & CBOTH_CONTIG) == CBOTH_CONTIG && + !xmi_can_merge_all(left, curr, right)) + state &= ~CRIGHT_CONTIG; + + if (nhole) + state |= NHOLE; + if (!lhole && !nhole && xmi_can_merge(left, new)) + state |= NLEFT_CONTIG; + if (!rhole && !nhole && xmi_can_merge(new, right)) + state |= NRIGHT_CONTIG; + if ((state & NBOTH_CONTIG) == NBOTH_CONTIG && + !xmi_can_merge_all(left, new, right)) + state &= ~NRIGHT_CONTIG; + + switch (state & (CLEFT_CONTIG | CRIGHT_CONTIG | CHOLE)) { + case CLEFT_CONTIG | CRIGHT_CONTIG: + /* + * left/curr/right are the same mapping, so deleting curr + * causes 2 new mappings to be created. + */ + ret += 2; + break; + case 0: + /* + * curr is not contiguous with any mapping, so we remove curr + * completely + */ + ret--; + break; + case CHOLE: + /* hole, do nothing */ + break; + case CLEFT_CONTIG: + case CRIGHT_CONTIG: + /* trim either left or right, no change */ + break; + } + + switch (state & (NLEFT_CONTIG | NRIGHT_CONTIG | NHOLE)) { + case NLEFT_CONTIG | NRIGHT_CONTIG: + /* + * left/curr/right will become the same mapping, so adding + * curr causes the deletion of right. + */ + ret--; + break; + case 0: + /* new is not contiguous with any mapping */ + ret++; + break; + case NHOLE: + /* hole, do nothing. */ + break; + case NLEFT_CONTIG: + case NRIGHT_CONTIG: + /* new is absorbed into left or right, no change */ + break; + } + + trace_xfs_exchmaps_delta_nextents_step(mp, left, curr, new, right, ret, + state); + return ret; +} + +/* Make sure we don't overflow the extent (mapping) counters. */ +static inline int +xmi_ensure_delta_nextents( + struct xfs_exchmaps_req *req, + struct xfs_inode *ip, + int64_t delta) +{ + struct xfs_mount *mp = ip->i_mount; + int whichfork = xfs_exchmaps_reqfork(req); + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, whichfork); + uint64_t new_nextents; + xfs_extnum_t max_nextents; + + if (delta < 0) + return 0; + + /* + * It's always an error if the delta causes integer overflow. delta + * needs an explicit cast here to avoid warnings about implicit casts + * coded into the overflow check. + */ + if (check_add_overflow(ifp->if_nextents, (uint64_t)delta, + &new_nextents)) + return -EFBIG; + + if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REDUCE_MAX_IEXTENTS) && + new_nextents > 10) + return -EFBIG; + + /* + * We always promote both inodes to have large extent counts if the + * superblock feature is enabled, so we only need to check against the + * theoretical maximum. + */ + max_nextents = xfs_iext_max_nextents(xfs_has_large_extent_counts(mp), + whichfork); + if (new_nextents > max_nextents) + return -EFBIG; + + return 0; +} + +/* Find the next mapping after irec. */ +static inline int +xmi_next( + struct xfs_inode *ip, + int bmap_flags, + const struct xfs_bmbt_irec *irec, + struct xfs_bmbt_irec *nrec) +{ + xfs_fileoff_t off; + xfs_filblks_t blockcount; + int nimaps = 1; + int error; + + off = irec->br_startoff + irec->br_blockcount; + blockcount = XFS_MAX_FILEOFF - off; + error = xfs_bmapi_read(ip, off, blockcount, nrec, &nimaps, bmap_flags); + if (error) + return error; + if (nrec->br_startblock == DELAYSTARTBLOCK || + nrec->br_startoff != off) { + /* + * If we don't get the mapping we want, return a zero-length + * mapping, which our estimator function will pretend is a hole. + * We shouldn't get delalloc reservations. + */ + nrec->br_startblock = HOLESTARTBLOCK; + } + + return 0; +} + +int __init +xfs_exchmaps_intent_init_cache(void) +{ + xfs_exchmaps_intent_cache = kmem_cache_create("xfs_exchmaps_intent", + sizeof(struct xfs_exchmaps_intent), + 0, 0, NULL); + + return xfs_exchmaps_intent_cache != NULL ? 0 : -ENOMEM; +} + +void +xfs_exchmaps_intent_destroy_cache(void) +{ + kmem_cache_destroy(xfs_exchmaps_intent_cache); + xfs_exchmaps_intent_cache = NULL; +} + +/* + * Decide if we will exchange the reflink flags between the two files after the + * exchange. The only time we want to do this is if we're exchanging all + * mappings under EOF and the inode reflink flags have different states. + */ +static inline bool +xmi_can_exchange_reflink_flags( + const struct xfs_exchmaps_req *req, + unsigned int reflink_state) +{ + struct xfs_mount *mp = req->ip1->i_mount; + + if (hweight32(reflink_state) != 1) + return false; + if (req->startoff1 != 0 || req->startoff2 != 0) + return false; + if (req->blockcount != XFS_B_TO_FSB(mp, req->ip1->i_disk_size)) + return false; + if (req->blockcount != XFS_B_TO_FSB(mp, req->ip2->i_disk_size)) + return false; + return true; +} + + +/* Allocate and initialize a new incore intent item from a request. */ +struct xfs_exchmaps_intent * +xfs_exchmaps_init_intent( + const struct xfs_exchmaps_req *req) +{ + struct xfs_exchmaps_intent *xmi; + unsigned int rs = 0; + + xmi = kmem_cache_zalloc(xfs_exchmaps_intent_cache, + GFP_NOFS | __GFP_NOFAIL); + INIT_LIST_HEAD(&xmi->xmi_list); + xmi->xmi_ip1 = req->ip1; + xmi->xmi_ip2 = req->ip2; + xmi->xmi_startoff1 = req->startoff1; + xmi->xmi_startoff2 = req->startoff2; + xmi->xmi_blockcount = req->blockcount; + xmi->xmi_isize1 = xmi->xmi_isize2 = -1; + xmi->xmi_flags = req->flags & XFS_EXCHMAPS_PARAMS; + + if (xfs_exchmaps_whichfork(xmi) == XFS_ATTR_FORK) + return xmi; + + if (req->flags & XFS_EXCHMAPS_SET_SIZES) { + xmi->xmi_flags |= XFS_EXCHMAPS_SET_SIZES; + xmi->xmi_isize1 = req->ip2->i_disk_size; + xmi->xmi_isize2 = req->ip1->i_disk_size; + } + + /* Record the state of each inode's reflink flag before the op. */ + if (xfs_is_reflink_inode(req->ip1)) + rs |= 1; + if (xfs_is_reflink_inode(req->ip2)) + rs |= 2; + + /* + * Figure out if we're clearing the reflink flags (which effectively + * exchanges them) after the operation. + */ + if (xmi_can_exchange_reflink_flags(req, rs)) { + if (rs & 1) + xmi->xmi_flags |= XFS_EXCHMAPS_CLEAR_INO1_REFLINK; + if (rs & 2) + xmi->xmi_flags |= XFS_EXCHMAPS_CLEAR_INO2_REFLINK; + } + + return xmi; +} + +/* + * Estimate the number of exchange operations and the number of file blocks + * in each file that will be affected by the exchange operation. + */ +int +xfs_exchmaps_estimate( + struct xfs_exchmaps_req *req) +{ + struct xfs_exchmaps_intent *xmi; + struct xfs_bmbt_irec irec1, irec2; + struct xfs_exchmaps_adjacent adj = ADJACENT_INIT; + xfs_filblks_t ip1_blocks = 0, ip2_blocks = 0; + int64_t d_nexts1, d_nexts2; + int bmap_flags; + int error; + + ASSERT(!(req->flags & ~XFS_EXCHMAPS_PARAMS)); + + bmap_flags = xfs_bmapi_aflag(xfs_exchmaps_reqfork(req)); + xmi = xfs_exchmaps_init_intent(req); + + /* + * To guard against the possibility of overflowing the extent counters, + * we have to estimate an upper bound on the potential increase in that + * counter. We can split the mapping at each end of the range, and for + * each step of the exchange we can split the mapping that we're + * working on if the mappings do not align. + */ + d_nexts1 = d_nexts2 = 3; + + while (xmi_has_more_exchange_work(xmi)) { + /* + * Walk through the file ranges until we find something to + * exchange. Because we're simulating the exchange, pass in + * adj to capture skipped mappings for correct estimation of + * bmbt record merges. + */ + error = xfs_exchmaps_find_mappings(xmi, &irec1, &irec2, &adj); + if (error) + goto out_free; + if (!xmi_has_more_exchange_work(xmi)) + break; + + /* Update accounting. */ + if (xfs_bmap_is_real_extent(&irec1)) + ip1_blocks += irec1.br_blockcount; + if (xfs_bmap_is_real_extent(&irec2)) + ip2_blocks += irec2.br_blockcount; + req->nr_exchanges++; + + /* Read the next mappings from both files. */ + error = xmi_next(req->ip1, bmap_flags, &irec1, &adj.right1); + if (error) + goto out_free; + + error = xmi_next(req->ip2, bmap_flags, &irec2, &adj.right2); + if (error) + goto out_free; + + /* Update extent count deltas. */ + d_nexts1 += xmi_delta_nextents_step(req->ip1->i_mount, + &adj.left1, &irec1, &irec2, &adj.right1); + + d_nexts2 += xmi_delta_nextents_step(req->ip1->i_mount, + &adj.left2, &irec2, &irec1, &adj.right2); + + /* Now pretend we exchanged the mappings. */ + if (xmi_can_merge(&adj.left2, &irec1)) + adj.left2.br_blockcount += irec1.br_blockcount; + else + memcpy(&adj.left2, &irec1, sizeof(irec1)); + + if (xmi_can_merge(&adj.left1, &irec2)) + adj.left1.br_blockcount += irec2.br_blockcount; + else + memcpy(&adj.left1, &irec2, sizeof(irec2)); + + xmi_advance(xmi, &irec1); + } + + /* Account for the blocks that are being exchanged. */ + if (XFS_IS_REALTIME_INODE(req->ip1) && + xfs_exchmaps_reqfork(req) == XFS_DATA_FORK) { + req->ip1_rtbcount = ip1_blocks; + req->ip2_rtbcount = ip2_blocks; + } else { + req->ip1_bcount = ip1_blocks; + req->ip2_bcount = ip2_blocks; + } + + /* + * Make sure that both forks have enough slack left in their extent + * counters that the exchange operation will not overflow. + */ + trace_xfs_exchmaps_delta_nextents(req, d_nexts1, d_nexts2); + if (req->ip1 == req->ip2) { + error = xmi_ensure_delta_nextents(req, req->ip1, + d_nexts1 + d_nexts2); + } else { + error = xmi_ensure_delta_nextents(req, req->ip1, d_nexts1); + if (error) + goto out_free; + error = xmi_ensure_delta_nextents(req, req->ip2, d_nexts2); + } + if (error) + goto out_free; + + trace_xfs_exchmaps_initial_estimate(req); + error = xfs_exchmaps_estimate_overhead(req); +out_free: + kmem_cache_free(xfs_exchmaps_intent_cache, xmi); + return error; +} + +/* Set the reflink flag before an operation. */ +static inline void +xfs_exchmaps_set_reflink( + struct xfs_trans *tp, + struct xfs_inode *ip) +{ + trace_xfs_reflink_set_inode_flag(ip); + + ip->i_diflags2 |= XFS_DIFLAG2_REFLINK; + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); +} + +/* + * If either file has shared blocks and we're exchanging data forks, we must + * flag the other file as having shared blocks so that we get the shared-block + * rmap functions if we need to fix up the rmaps. + */ +void +xfs_exchmaps_ensure_reflink( + struct xfs_trans *tp, + const struct xfs_exchmaps_intent *xmi) +{ + unsigned int rs = 0; + + if (xfs_is_reflink_inode(xmi->xmi_ip1)) + rs |= 1; + if (xfs_is_reflink_inode(xmi->xmi_ip2)) + rs |= 2; + + if ((rs & 1) && !xfs_is_reflink_inode(xmi->xmi_ip2)) + xfs_exchmaps_set_reflink(tp, xmi->xmi_ip2); + + if ((rs & 2) && !xfs_is_reflink_inode(xmi->xmi_ip1)) + xfs_exchmaps_set_reflink(tp, xmi->xmi_ip1); +} + +/* Set the large extent count flag before an operation if needed. */ +static inline void +xfs_exchmaps_ensure_large_extent_counts( + struct xfs_trans *tp, + struct xfs_inode *ip) +{ + if (xfs_inode_has_large_extent_counts(ip)) + return; + + ip->i_diflags2 |= XFS_DIFLAG2_NREXT64; + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); +} + +/* Widen the extent counter fields of both inodes if necessary. */ +void +xfs_exchmaps_upgrade_extent_counts( + struct xfs_trans *tp, + const struct xfs_exchmaps_intent *xmi) +{ + if (!xfs_has_large_extent_counts(tp->t_mountp)) + return; + + xfs_exchmaps_ensure_large_extent_counts(tp, xmi->xmi_ip1); + xfs_exchmaps_ensure_large_extent_counts(tp, xmi->xmi_ip2); +} + +/* + * Schedule an exchange a range of mappings from one inode to another. + * + * The use of file mapping exchange log intent items ensures the operation can + * be resumed even if the system goes down. The caller must commit the + * transaction to start the work. + * + * The caller must ensure the inodes must be joined to the transaction and + * ILOCKd; they will still be joined to the transaction at exit. + */ +void +xfs_exchange_mappings( + struct xfs_trans *tp, + const struct xfs_exchmaps_req *req) +{ + struct xfs_exchmaps_intent *xmi; + + xfs_assert_ilocked(req->ip1, XFS_ILOCK_EXCL); + xfs_assert_ilocked(req->ip2, XFS_ILOCK_EXCL); + ASSERT(!(req->flags & ~XFS_EXCHMAPS_LOGGED_FLAGS)); + if (req->flags & XFS_EXCHMAPS_SET_SIZES) + ASSERT(!(req->flags & XFS_EXCHMAPS_ATTR_FORK)); + ASSERT(xfs_has_exchange_range(tp->t_mountp)); + + if (req->blockcount == 0) + return; + + xmi = xfs_exchmaps_init_intent(req); + xfs_exchmaps_defer_add(tp, xmi); + xfs_exchmaps_ensure_reflink(tp, xmi); + xfs_exchmaps_upgrade_extent_counts(tp, xmi); +} diff --git a/fs/xfs/libxfs/xfs_exchmaps.h b/fs/xfs/libxfs/xfs_exchmaps.h new file mode 100644 index 000000000000..e8fc3f80c68c --- /dev/null +++ b/fs/xfs/libxfs/xfs_exchmaps.h @@ -0,0 +1,118 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_EXCHMAPS_H__ +#define __XFS_EXCHMAPS_H__ + +/* In-core deferred operation info about a file mapping exchange request. */ +struct xfs_exchmaps_intent { + /* List of other incore deferred work. */ + struct list_head xmi_list; + + /* Inodes participating in the operation. */ + struct xfs_inode *xmi_ip1; + struct xfs_inode *xmi_ip2; + + /* File offset range information. */ + xfs_fileoff_t xmi_startoff1; + xfs_fileoff_t xmi_startoff2; + xfs_filblks_t xmi_blockcount; + + /* Set these file sizes after the operation, unless negative. */ + xfs_fsize_t xmi_isize1; + xfs_fsize_t xmi_isize2; + + uint64_t xmi_flags; /* XFS_EXCHMAPS_* flags */ +}; + +/* flags that can be passed to xfs_exchmaps_{estimate,mappings} */ +#define XFS_EXCHMAPS_PARAMS (XFS_EXCHMAPS_ATTR_FORK | \ + XFS_EXCHMAPS_SET_SIZES | \ + XFS_EXCHMAPS_INO1_WRITTEN) + +static inline int +xfs_exchmaps_whichfork(const struct xfs_exchmaps_intent *xmi) +{ + if (xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK) + return XFS_ATTR_FORK; + return XFS_DATA_FORK; +} + +/* Parameters for a mapping exchange request. */ +struct xfs_exchmaps_req { + /* Inodes participating in the operation. */ + struct xfs_inode *ip1; + struct xfs_inode *ip2; + + /* File offset range information. */ + xfs_fileoff_t startoff1; + xfs_fileoff_t startoff2; + xfs_filblks_t blockcount; + + /* XFS_EXCHMAPS_* operation flags */ + uint64_t flags; + + /* + * Fields below this line are filled out by xfs_exchmaps_estimate; + * callers should initialize this part of the struct to zero. + */ + + /* + * Data device blocks to be moved out of ip1, and free space needed to + * handle the bmbt changes. + */ + xfs_filblks_t ip1_bcount; + + /* + * Data device blocks to be moved out of ip2, and free space needed to + * handle the bmbt changes. + */ + xfs_filblks_t ip2_bcount; + + /* rt blocks to be moved out of ip1. */ + xfs_filblks_t ip1_rtbcount; + + /* rt blocks to be moved out of ip2. */ + xfs_filblks_t ip2_rtbcount; + + /* Free space needed to handle the bmbt changes */ + unsigned long long resblks; + + /* Number of exchanges needed to complete the operation */ + unsigned long long nr_exchanges; +}; + +static inline int +xfs_exchmaps_reqfork(const struct xfs_exchmaps_req *req) +{ + if (req->flags & XFS_EXCHMAPS_ATTR_FORK) + return XFS_ATTR_FORK; + return XFS_DATA_FORK; +} + +int xfs_exchmaps_estimate(struct xfs_exchmaps_req *req); + +extern struct kmem_cache *xfs_exchmaps_intent_cache; + +int __init xfs_exchmaps_intent_init_cache(void); +void xfs_exchmaps_intent_destroy_cache(void); + +struct xfs_exchmaps_intent *xfs_exchmaps_init_intent( + const struct xfs_exchmaps_req *req); +void xfs_exchmaps_ensure_reflink(struct xfs_trans *tp, + const struct xfs_exchmaps_intent *xmi); +void xfs_exchmaps_upgrade_extent_counts(struct xfs_trans *tp, + const struct xfs_exchmaps_intent *xmi); + +int xfs_exchmaps_finish_one(struct xfs_trans *tp, + struct xfs_exchmaps_intent *xmi); + +int xfs_exchmaps_check_forks(struct xfs_mount *mp, + const struct xfs_exchmaps_req *req); + +void xfs_exchange_mappings(struct xfs_trans *tp, + const struct xfs_exchmaps_req *req); + +#endif /* __XFS_EXCHMAPS_H__ */ diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h index 09024431cae9..8dbe1f997dfd 100644 --- a/fs/xfs/libxfs/xfs_log_format.h +++ b/fs/xfs/libxfs/xfs_log_format.h @@ -904,7 +904,29 @@ struct xfs_xmi_log_format { uint64_t xmi_isize2; /* intended file2 size */ }; -#define XFS_EXCHMAPS_LOGGED_FLAGS (0) +/* Exchange mappings between extended attribute forks instead of data forks. */ +#define XFS_EXCHMAPS_ATTR_FORK (1ULL << 0) + +/* Set the file sizes when finished. */ +#define XFS_EXCHMAPS_SET_SIZES (1ULL << 1) + +/* + * Exchange the mappings of the two files only if the file allocation units + * mapped to file1's range have been written. + */ +#define XFS_EXCHMAPS_INO1_WRITTEN (1ULL << 2) + +/* Clear the reflink flag from inode1 after the operation. */ +#define XFS_EXCHMAPS_CLEAR_INO1_REFLINK (1ULL << 3) + +/* Clear the reflink flag from inode2 after the operation. */ +#define XFS_EXCHMAPS_CLEAR_INO2_REFLINK (1ULL << 4) + +#define XFS_EXCHMAPS_LOGGED_FLAGS (XFS_EXCHMAPS_ATTR_FORK | \ + XFS_EXCHMAPS_SET_SIZES | \ + XFS_EXCHMAPS_INO1_WRITTEN | \ + XFS_EXCHMAPS_CLEAR_INO1_REFLINK | \ + XFS_EXCHMAPS_CLEAR_INO2_REFLINK) /* This is the structure used to lay out an mapping exchange done log item. */ struct xfs_xmd_log_format { diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h index 87b31c69a773..9640fc232c14 100644 --- a/fs/xfs/libxfs/xfs_trans_space.h +++ b/fs/xfs/libxfs/xfs_trans_space.h @@ -10,6 +10,10 @@ * Components of space reservations. */ +/* Worst case number of bmaps that can be held in a block. */ +#define XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp) \ + (((mp)->m_bmap_dmxr[0]) - ((mp)->m_bmap_dmnr[0])) + /* Worst case number of rmaps that can be held in a block. */ #define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp) \ (((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0])) diff --git a/fs/xfs/xfs_exchmaps_item.c b/fs/xfs/xfs_exchmaps_item.c index 65b0ade41b3d..a40216f33214 100644 --- a/fs/xfs/xfs_exchmaps_item.c +++ b/fs/xfs/xfs_exchmaps_item.c @@ -16,13 +16,17 @@ #include "xfs_trans.h" #include "xfs_trans_priv.h" #include "xfs_exchmaps_item.h" +#include "xfs_exchmaps.h" #include "xfs_log.h" #include "xfs_bmap.h" #include "xfs_icache.h" +#include "xfs_bmap_btree.h" #include "xfs_trans_space.h" #include "xfs_error.h" #include "xfs_log_priv.h" #include "xfs_log_recover.h" +#include "xfs_exchrange.h" +#include "xfs_trace.h" struct kmem_cache *xfs_xmi_cache; struct kmem_cache *xfs_xmd_cache; @@ -144,6 +148,365 @@ static inline struct xfs_xmd_log_item *XMD_ITEM(struct xfs_log_item *lip) return container_of(lip, struct xfs_xmd_log_item, xmd_item); } +STATIC void +xfs_xmd_item_size( + struct xfs_log_item *lip, + int *nvecs, + int *nbytes) +{ + *nvecs += 1; + *nbytes += sizeof(struct xfs_xmd_log_format); +} + +/* + * This is called to fill in the vector of log iovecs for the given xmd log + * item. We use only 1 iovec, and we point that at the xmd_log_format structure + * embedded in the xmd item. + */ +STATIC void +xfs_xmd_item_format( + struct xfs_log_item *lip, + struct xfs_log_vec *lv) +{ + struct xfs_xmd_log_item *xmd_lip = XMD_ITEM(lip); + struct xfs_log_iovec *vecp = NULL; + + xmd_lip->xmd_format.xmd_type = XFS_LI_XMD; + xmd_lip->xmd_format.xmd_size = 1; + + xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_XMD_FORMAT, &xmd_lip->xmd_format, + sizeof(struct xfs_xmd_log_format)); +} + +/* + * The XMD is either committed or aborted if the transaction is cancelled. If + * the transaction is cancelled, drop our reference to the XMI and free the + * XMD. + */ +STATIC void +xfs_xmd_item_release( + struct xfs_log_item *lip) +{ + struct xfs_xmd_log_item *xmd_lip = XMD_ITEM(lip); + + xfs_xmi_release(xmd_lip->xmd_intent_log_item); + kvfree(xmd_lip->xmd_item.li_lv_shadow); + kmem_cache_free(xfs_xmd_cache, xmd_lip); +} + +static struct xfs_log_item * +xfs_xmd_item_intent( + struct xfs_log_item *lip) +{ + return &XMD_ITEM(lip)->xmd_intent_log_item->xmi_item; +} + +static const struct xfs_item_ops xfs_xmd_item_ops = { + .flags = XFS_ITEM_RELEASE_WHEN_COMMITTED | + XFS_ITEM_INTENT_DONE, + .iop_size = xfs_xmd_item_size, + .iop_format = xfs_xmd_item_format, + .iop_release = xfs_xmd_item_release, + .iop_intent = xfs_xmd_item_intent, +}; + +/* Log file mapping exchange information in the intent item. */ +STATIC struct xfs_log_item * +xfs_exchmaps_create_intent( + struct xfs_trans *tp, + struct list_head *items, + unsigned int count, + bool sort) +{ + struct xfs_xmi_log_item *xmi_lip; + struct xfs_exchmaps_intent *xmi; + struct xfs_xmi_log_format *xlf; + + ASSERT(count == 1); + + xmi = list_first_entry_or_null(items, struct xfs_exchmaps_intent, + xmi_list); + + xmi_lip = xfs_xmi_init(tp->t_mountp); + xlf = &xmi_lip->xmi_format; + + xlf->xmi_inode1 = xmi->xmi_ip1->i_ino; + xlf->xmi_inode2 = xmi->xmi_ip2->i_ino; + xlf->xmi_startoff1 = xmi->xmi_startoff1; + xlf->xmi_startoff2 = xmi->xmi_startoff2; + xlf->xmi_blockcount = xmi->xmi_blockcount; + xlf->xmi_isize1 = xmi->xmi_isize1; + xlf->xmi_isize2 = xmi->xmi_isize2; + xlf->xmi_flags = xmi->xmi_flags & XFS_EXCHMAPS_LOGGED_FLAGS; + + return &xmi_lip->xmi_item; +} + +STATIC struct xfs_log_item * +xfs_exchmaps_create_done( + struct xfs_trans *tp, + struct xfs_log_item *intent, + unsigned int count) +{ + struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(intent); + struct xfs_xmd_log_item *xmd_lip; + + xmd_lip = kmem_cache_zalloc(xfs_xmd_cache, GFP_KERNEL | __GFP_NOFAIL); + xfs_log_item_init(tp->t_mountp, &xmd_lip->xmd_item, XFS_LI_XMD, + &xfs_xmd_item_ops); + xmd_lip->xmd_intent_log_item = xmi_lip; + xmd_lip->xmd_format.xmd_xmi_id = xmi_lip->xmi_format.xmi_id; + + return &xmd_lip->xmd_item; +} + +/* Add this deferred XMI to the transaction. */ +void +xfs_exchmaps_defer_add( + struct xfs_trans *tp, + struct xfs_exchmaps_intent *xmi) +{ + trace_xfs_exchmaps_defer(tp->t_mountp, xmi); + + xfs_defer_add(tp, &xmi->xmi_list, &xfs_exchmaps_defer_type); +} + +static inline struct xfs_exchmaps_intent *xmi_entry(const struct list_head *e) +{ + return list_entry(e, struct xfs_exchmaps_intent, xmi_list); +} + +/* Cancel a deferred file mapping exchange. */ +STATIC void +xfs_exchmaps_cancel_item( + struct list_head *item) +{ + struct xfs_exchmaps_intent *xmi = xmi_entry(item); + + kmem_cache_free(xfs_exchmaps_intent_cache, xmi); +} + +/* Process a deferred file mapping exchange. */ +STATIC int +xfs_exchmaps_finish_item( + struct xfs_trans *tp, + struct xfs_log_item *done, + struct list_head *item, + struct xfs_btree_cur **state) +{ + struct xfs_exchmaps_intent *xmi = xmi_entry(item); + int error; + + /* + * Exchange one more mappings between two files. If there's still more + * work to do, we want to requeue ourselves after all other pending + * deferred operations have finished. This includes all of the dfops + * that we queued directly as well as any new ones created in the + * process of finishing the others. Doing so prevents us from queuing + * a large number of XMI log items in kernel memory, which in turn + * prevents us from pinning the tail of the log (while logging those + * new XMI items) until the first XMI items can be processed. + */ + error = xfs_exchmaps_finish_one(tp, xmi); + if (error != -EAGAIN) + xfs_exchmaps_cancel_item(item); + return error; +} + +/* Abort all pending XMIs. */ +STATIC void +xfs_exchmaps_abort_intent( + struct xfs_log_item *intent) +{ + xfs_xmi_release(XMI_ITEM(intent)); +} + +/* Is this recovered XMI ok? */ +static inline bool +xfs_xmi_validate( + struct xfs_mount *mp, + struct xfs_xmi_log_item *xmi_lip) +{ + struct xfs_xmi_log_format *xlf = &xmi_lip->xmi_format; + + if (!xfs_has_exchange_range(mp)) + return false; + + if (xmi_lip->xmi_format.__pad != 0) + return false; + + if (xlf->xmi_flags & ~XFS_EXCHMAPS_LOGGED_FLAGS) + return false; + + if (!xfs_verify_ino(mp, xlf->xmi_inode1) || + !xfs_verify_ino(mp, xlf->xmi_inode2)) + return false; + + if (!xfs_verify_fileext(mp, xlf->xmi_startoff1, xlf->xmi_blockcount)) + return false; + + return xfs_verify_fileext(mp, xlf->xmi_startoff2, xlf->xmi_blockcount); +} + +/* + * Use the recovered log state to create a new request, estimate resource + * requirements, and create a new incore intent state. + */ +STATIC struct xfs_exchmaps_intent * +xfs_xmi_item_recover_intent( + struct xfs_mount *mp, + struct xfs_defer_pending *dfp, + const struct xfs_xmi_log_format *xlf, + struct xfs_exchmaps_req *req, + struct xfs_inode **ipp1, + struct xfs_inode **ipp2) +{ + struct xfs_inode *ip1, *ip2; + struct xfs_exchmaps_intent *xmi; + int error; + + /* + * Grab both inodes and set IRECOVERY to prevent trimming of post-eof + * mappings and freeing of unlinked inodes until we're totally done + * processing files. + */ + error = xlog_recover_iget(mp, xlf->xmi_inode1, &ip1); + if (error) + return ERR_PTR(error); + error = xlog_recover_iget(mp, xlf->xmi_inode2, &ip2); + if (error) + goto err_rele1; + + req->ip1 = ip1; + req->ip2 = ip2; + req->startoff1 = xlf->xmi_startoff1; + req->startoff2 = xlf->xmi_startoff2; + req->blockcount = xlf->xmi_blockcount; + req->flags = xlf->xmi_flags & XFS_EXCHMAPS_PARAMS; + + xfs_exchrange_ilock(NULL, ip1, ip2); + error = xfs_exchmaps_estimate(req); + xfs_exchrange_iunlock(ip1, ip2); + if (error) + goto err_rele2; + + *ipp1 = ip1; + *ipp2 = ip2; + xmi = xfs_exchmaps_init_intent(req); + xfs_defer_add_item(dfp, &xmi->xmi_list); + return xmi; + +err_rele2: + xfs_irele(ip2); +err_rele1: + xfs_irele(ip1); + req->ip2 = req->ip1 = NULL; + return ERR_PTR(error); +} + +/* Process a file mapping exchange item that was recovered from the log. */ +STATIC int +xfs_exchmaps_recover_work( + struct xfs_defer_pending *dfp, + struct list_head *capture_list) +{ + struct xfs_exchmaps_req req = { .flags = 0 }; + struct xfs_trans_res resv; + struct xfs_exchmaps_intent *xmi; + struct xfs_log_item *lip = dfp->dfp_intent; + struct xfs_xmi_log_item *xmi_lip = XMI_ITEM(lip); + struct xfs_mount *mp = lip->li_log->l_mp; + struct xfs_trans *tp; + struct xfs_inode *ip1, *ip2; + int error = 0; + + if (!xfs_xmi_validate(mp, xmi_lip)) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, + &xmi_lip->xmi_format, + sizeof(xmi_lip->xmi_format)); + return -EFSCORRUPTED; + } + + xmi = xfs_xmi_item_recover_intent(mp, dfp, &xmi_lip->xmi_format, &req, + &ip1, &ip2); + if (IS_ERR(xmi)) + return PTR_ERR(xmi); + + trace_xfs_exchmaps_recover(mp, xmi); + + resv = xlog_recover_resv(&M_RES(mp)->tr_write); + error = xfs_trans_alloc(mp, &resv, req.resblks, 0, 0, &tp); + if (error) + goto err_rele; + + xfs_exchrange_ilock(tp, ip1, ip2); + + xfs_exchmaps_ensure_reflink(tp, xmi); + xfs_exchmaps_upgrade_extent_counts(tp, xmi); + error = xlog_recover_finish_intent(tp, dfp); + if (error == -EFSCORRUPTED) + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, + &xmi_lip->xmi_format, + sizeof(xmi_lip->xmi_format)); + if (error) + goto err_cancel; + + /* + * Commit transaction, which frees the transaction and saves the inodes + * for later replay activities. + */ + error = xfs_defer_ops_capture_and_commit(tp, capture_list); + goto err_unlock; + +err_cancel: + xfs_trans_cancel(tp); +err_unlock: + xfs_exchrange_iunlock(ip1, ip2); +err_rele: + xfs_irele(ip2); + xfs_irele(ip1); + return error; +} + +/* Relog an intent item to push the log tail forward. */ +static struct xfs_log_item * +xfs_exchmaps_relog_intent( + struct xfs_trans *tp, + struct xfs_log_item *intent, + struct xfs_log_item *done_item) +{ + struct xfs_xmi_log_item *xmi_lip; + struct xfs_xmi_log_format *old_xlf, *new_xlf; + + old_xlf = &XMI_ITEM(intent)->xmi_format; + + xmi_lip = xfs_xmi_init(tp->t_mountp); + new_xlf = &xmi_lip->xmi_format; + + new_xlf->xmi_inode1 = old_xlf->xmi_inode1; + new_xlf->xmi_inode2 = old_xlf->xmi_inode2; + new_xlf->xmi_startoff1 = old_xlf->xmi_startoff1; + new_xlf->xmi_startoff2 = old_xlf->xmi_startoff2; + new_xlf->xmi_blockcount = old_xlf->xmi_blockcount; + new_xlf->xmi_flags = old_xlf->xmi_flags; + new_xlf->xmi_isize1 = old_xlf->xmi_isize1; + new_xlf->xmi_isize2 = old_xlf->xmi_isize2; + + return &xmi_lip->xmi_item; +} + +const struct xfs_defer_op_type xfs_exchmaps_defer_type = { + .name = "exchmaps", + .max_items = 1, + .create_intent = xfs_exchmaps_create_intent, + .abort_intent = xfs_exchmaps_abort_intent, + .create_done = xfs_exchmaps_create_done, + .finish_item = xfs_exchmaps_finish_item, + .cancel_item = xfs_exchmaps_cancel_item, + .recover_work = xfs_exchmaps_recover_work, + .relog_intent = xfs_exchmaps_relog_intent, +}; + STATIC bool xfs_xmi_item_match( struct xfs_log_item *lip, @@ -194,8 +557,9 @@ xlog_recover_xmi_commit_pass2( xmi_lip = xfs_xmi_init(mp); memcpy(&xmi_lip->xmi_format, xmi_formatp, len); - /* not implemented yet */ - return -EIO; + xlog_recover_intent_item(log, &xmi_lip->xmi_item, lsn, + &xfs_exchmaps_defer_type); + return 0; } const struct xlog_recover_item_ops xlog_xmi_item_ops = { diff --git a/fs/xfs/xfs_exchmaps_item.h b/fs/xfs/xfs_exchmaps_item.h index ada1eb314e65..efa368d25d09 100644 --- a/fs/xfs/xfs_exchmaps_item.h +++ b/fs/xfs/xfs_exchmaps_item.h @@ -56,4 +56,9 @@ struct xfs_xmd_log_item { extern struct kmem_cache *xfs_xmi_cache; extern struct kmem_cache *xfs_xmd_cache; +struct xfs_exchmaps_intent; + +void xfs_exchmaps_defer_add(struct xfs_trans *tp, + struct xfs_exchmaps_intent *xmi); + #endif /* __XFS_EXCHMAPS_ITEM_H__ */ diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c index 4cd824e47f75..35351b973521 100644 --- a/fs/xfs/xfs_exchrange.c +++ b/fs/xfs/xfs_exchrange.c @@ -13,8 +13,57 @@ #include "xfs_inode.h" #include "xfs_trans.h" #include "xfs_exchrange.h" +#include "xfs_exchmaps.h" #include <linux/fsnotify.h> +/* Lock (and optionally join) two inodes for a file range exchange. */ +void +xfs_exchrange_ilock( + struct xfs_trans *tp, + struct xfs_inode *ip1, + struct xfs_inode *ip2) +{ + if (ip1 != ip2) + xfs_lock_two_inodes(ip1, XFS_ILOCK_EXCL, + ip2, XFS_ILOCK_EXCL); + else + xfs_ilock(ip1, XFS_ILOCK_EXCL); + if (tp) { + xfs_trans_ijoin(tp, ip1, 0); + if (ip2 != ip1) + xfs_trans_ijoin(tp, ip2, 0); + } + +} + +/* Unlock two inodes after a file range exchange operation. */ +void +xfs_exchrange_iunlock( + struct xfs_inode *ip1, + struct xfs_inode *ip2) +{ + if (ip2 != ip1) + xfs_iunlock(ip2, XFS_ILOCK_EXCL); + xfs_iunlock(ip1, XFS_ILOCK_EXCL); +} + +/* + * Estimate the resource requirements to exchange file contents between the two + * files. The caller is required to hold the IOLOCK and the MMAPLOCK and to + * have flushed both inodes' pagecache and active direct-ios. + */ +int +xfs_exchrange_estimate( + struct xfs_exchmaps_req *req) +{ + int error; + + xfs_exchrange_ilock(NULL, req->ip1, req->ip2); + error = xfs_exchmaps_estimate(req); + xfs_exchrange_iunlock(req->ip1, req->ip2); + return error; +} + /* * Generic code for exchanging ranges of two files via XFS_IOC_EXCHANGE_RANGE. * This part deals with struct file objects and byte ranges and does not deal diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h index f80369c7df5d..039abcca546e 100644 --- a/fs/xfs/xfs_exchrange.h +++ b/fs/xfs/xfs_exchrange.h @@ -27,4 +27,12 @@ struct xfs_exchrange { long xfs_ioc_exchange_range(struct file *file, struct xfs_exchange_range __user *argp); +struct xfs_exchmaps_req; + +void xfs_exchrange_ilock(struct xfs_trans *tp, struct xfs_inode *ip1, + struct xfs_inode *ip2); +void xfs_exchrange_iunlock(struct xfs_inode *ip1, struct xfs_inode *ip2); + +int xfs_exchrange_estimate(struct xfs_exchmaps_req *req); + #endif /* __XFS_EXCHRANGE_H__ */ diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c index 1a963382e5e9..9f38e69f1ce4 100644 --- a/fs/xfs/xfs_trace.c +++ b/fs/xfs/xfs_trace.c @@ -39,6 +39,7 @@ #include "xfs_buf_mem.h" #include "xfs_btree_mem.h" #include "xfs_bmap.h" +#include "xfs_exchmaps.h" /* * We include this last to have the helpers above available for the trace diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index aea97fc074f8..7c17d1f80fec 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -82,6 +82,8 @@ struct xfs_perag; struct xfbtree; struct xfs_btree_ops; struct xfs_bmap_intent; +struct xfs_exchmaps_intent; +struct xfs_exchmaps_req; #define XFS_ATTR_FILTER_FLAGS \ { XFS_ATTR_ROOT, "ROOT" }, \ @@ -4770,6 +4772,221 @@ DEFINE_XFBTREE_FREESP_EVENT(xfbtree_alloc_block); DEFINE_XFBTREE_FREESP_EVENT(xfbtree_free_block); #endif /* CONFIG_XFS_BTREE_IN_MEM */ +/* exchmaps tracepoints */ +#define XFS_EXCHMAPS_STRINGS \ + { XFS_EXCHMAPS_ATTR_FORK, "ATTRFORK" }, \ + { XFS_EXCHMAPS_SET_SIZES, "SETSIZES" }, \ + { XFS_EXCHMAPS_INO1_WRITTEN, "INO1_WRITTEN" }, \ + { XFS_EXCHMAPS_CLEAR_INO1_REFLINK, "CLEAR_INO1_REFLINK" }, \ + { XFS_EXCHMAPS_CLEAR_INO2_REFLINK, "CLEAR_INO2_REFLINK" } + +DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1_skip); +DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1); +DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping2); +DEFINE_ITRUNC_EVENT(xfs_exchmaps_update_inode_size); + +TRACE_EVENT(xfs_exchmaps_overhead, + TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks, + unsigned long long rmapbt_blocks), + TP_ARGS(mp, bmbt_blocks, rmapbt_blocks), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(unsigned long long, bmbt_blocks) + __field(unsigned long long, rmapbt_blocks) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->bmbt_blocks = bmbt_blocks; + __entry->rmapbt_blocks = rmapbt_blocks; + ), + TP_printk("dev %d:%d bmbt_blocks 0x%llx rmapbt_blocks 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->bmbt_blocks, + __entry->rmapbt_blocks) +); + +DECLARE_EVENT_CLASS(xfs_exchmaps_estimate_class, + TP_PROTO(const struct xfs_exchmaps_req *req), + TP_ARGS(req), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino1) + __field(xfs_ino_t, ino2) + __field(xfs_fileoff_t, startoff1) + __field(xfs_fileoff_t, startoff2) + __field(xfs_filblks_t, blockcount) + __field(uint64_t, flags) + __field(xfs_filblks_t, ip1_bcount) + __field(xfs_filblks_t, ip2_bcount) + __field(xfs_filblks_t, ip1_rtbcount) + __field(xfs_filblks_t, ip2_rtbcount) + __field(unsigned long long, resblks) + __field(unsigned long long, nr_exchanges) + ), + TP_fast_assign( + __entry->dev = req->ip1->i_mount->m_super->s_dev; + __entry->ino1 = req->ip1->i_ino; + __entry->ino2 = req->ip2->i_ino; + __entry->startoff1 = req->startoff1; + __entry->startoff2 = req->startoff2; + __entry->blockcount = req->blockcount; + __entry->flags = req->flags; + __entry->ip1_bcount = req->ip1_bcount; + __entry->ip2_bcount = req->ip2_bcount; + __entry->ip1_rtbcount = req->ip1_rtbcount; + __entry->ip2_rtbcount = req->ip2_rtbcount; + __entry->resblks = req->resblks; + __entry->nr_exchanges = req->nr_exchanges; + ), + TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) bcount1 0x%llx rtbcount1 0x%llx bcount2 0x%llx rtbcount2 0x%llx resblks 0x%llx nr_exchanges %llu", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino1, __entry->startoff1, + __entry->ino2, __entry->startoff2, + __entry->blockcount, + __print_flags_u64(__entry->flags, "|", XFS_EXCHMAPS_STRINGS), + __entry->ip1_bcount, + __entry->ip1_rtbcount, + __entry->ip2_bcount, + __entry->ip2_rtbcount, + __entry->resblks, + __entry->nr_exchanges) +); + +#define DEFINE_EXCHMAPS_ESTIMATE_EVENT(name) \ +DEFINE_EVENT(xfs_exchmaps_estimate_class, name, \ + TP_PROTO(const struct xfs_exchmaps_req *req), \ + TP_ARGS(req)) +DEFINE_EXCHMAPS_ESTIMATE_EVENT(xfs_exchmaps_initial_estimate); +DEFINE_EXCHMAPS_ESTIMATE_EVENT(xfs_exchmaps_final_estimate); + +DECLARE_EVENT_CLASS(xfs_exchmaps_intent_class, + TP_PROTO(struct xfs_mount *mp, const struct xfs_exchmaps_intent *xmi), + TP_ARGS(mp, xmi), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino1) + __field(xfs_ino_t, ino2) + __field(uint64_t, flags) + __field(xfs_fileoff_t, startoff1) + __field(xfs_fileoff_t, startoff2) + __field(xfs_filblks_t, blockcount) + __field(xfs_fsize_t, isize1) + __field(xfs_fsize_t, isize2) + __field(xfs_fsize_t, new_isize1) + __field(xfs_fsize_t, new_isize2) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->ino1 = xmi->xmi_ip1->i_ino; + __entry->ino2 = xmi->xmi_ip2->i_ino; + __entry->flags = xmi->xmi_flags; + __entry->startoff1 = xmi->xmi_startoff1; + __entry->startoff2 = xmi->xmi_startoff2; + __entry->blockcount = xmi->xmi_blockcount; + __entry->isize1 = xmi->xmi_ip1->i_disk_size; + __entry->isize2 = xmi->xmi_ip2->i_disk_size; + __entry->new_isize1 = xmi->xmi_isize1; + __entry->new_isize2 = xmi->xmi_isize2; + ), + TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) isize1 0x%llx newisize1 0x%llx isize2 0x%llx newisize2 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino1, __entry->startoff1, + __entry->ino2, __entry->startoff2, + __entry->blockcount, + __print_flags_u64(__entry->flags, "|", XFS_EXCHMAPS_STRINGS), + __entry->isize1, __entry->new_isize1, + __entry->isize2, __entry->new_isize2) +); + +#define DEFINE_EXCHMAPS_INTENT_EVENT(name) \ +DEFINE_EVENT(xfs_exchmaps_intent_class, name, \ + TP_PROTO(struct xfs_mount *mp, const struct xfs_exchmaps_intent *xmi), \ + TP_ARGS(mp, xmi)) +DEFINE_EXCHMAPS_INTENT_EVENT(xfs_exchmaps_defer); +DEFINE_EXCHMAPS_INTENT_EVENT(xfs_exchmaps_recover); + +TRACE_EVENT(xfs_exchmaps_delta_nextents_step, + TP_PROTO(struct xfs_mount *mp, + const struct xfs_bmbt_irec *left, + const struct xfs_bmbt_irec *curr, + const struct xfs_bmbt_irec *new, + const struct xfs_bmbt_irec *right, + int delta, unsigned int state), + TP_ARGS(mp, left, curr, new, right, delta, state), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_fileoff_t, loff) + __field(xfs_fsblock_t, lstart) + __field(xfs_filblks_t, lcount) + __field(xfs_fileoff_t, coff) + __field(xfs_fsblock_t, cstart) + __field(xfs_filblks_t, ccount) + __field(xfs_fileoff_t, noff) + __field(xfs_fsblock_t, nstart) + __field(xfs_filblks_t, ncount) + __field(xfs_fileoff_t, roff) + __field(xfs_fsblock_t, rstart) + __field(xfs_filblks_t, rcount) + __field(int, delta) + __field(unsigned int, state) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->loff = left->br_startoff; + __entry->lstart = left->br_startblock; + __entry->lcount = left->br_blockcount; + __entry->coff = curr->br_startoff; + __entry->cstart = curr->br_startblock; + __entry->ccount = curr->br_blockcount; + __entry->noff = new->br_startoff; + __entry->nstart = new->br_startblock; + __entry->ncount = new->br_blockcount; + __entry->roff = right->br_startoff; + __entry->rstart = right->br_startblock; + __entry->rcount = right->br_blockcount; + __entry->delta = delta; + __entry->state = state; + ), + TP_printk("dev %d:%d left 0x%llx:0x%llx:0x%llx; curr 0x%llx:0x%llx:0x%llx <- new 0x%llx:0x%llx:0x%llx; right 0x%llx:0x%llx:0x%llx delta %d state 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->loff, __entry->lstart, __entry->lcount, + __entry->coff, __entry->cstart, __entry->ccount, + __entry->noff, __entry->nstart, __entry->ncount, + __entry->roff, __entry->rstart, __entry->rcount, + __entry->delta, __entry->state) +); + +TRACE_EVENT(xfs_exchmaps_delta_nextents, + TP_PROTO(const struct xfs_exchmaps_req *req, int64_t d_nexts1, + int64_t d_nexts2), + TP_ARGS(req, d_nexts1, d_nexts2), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino1) + __field(xfs_ino_t, ino2) + __field(xfs_extnum_t, nexts1) + __field(xfs_extnum_t, nexts2) + __field(int64_t, d_nexts1) + __field(int64_t, d_nexts2) + ), + TP_fast_assign( + int whichfork = xfs_exchmaps_reqfork(req); + + __entry->dev = req->ip1->i_mount->m_super->s_dev; + __entry->ino1 = req->ip1->i_ino; + __entry->ino2 = req->ip2->i_ino; + __entry->nexts1 = xfs_ifork_ptr(req->ip1, whichfork)->if_nextents; + __entry->nexts2 = xfs_ifork_ptr(req->ip2, whichfork)->if_nextents; + __entry->d_nexts1 = d_nexts1; + __entry->d_nexts2 = d_nexts2; + ), + TP_printk("dev %d:%d ino1 0x%llx nexts %llu ino2 0x%llx nexts %llu delta1 %lld delta2 %lld", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino1, __entry->nexts1, + __entry->ino2, __entry->nexts2, + __entry->d_nexts1, __entry->d_nexts2) +); + #endif /* _TRACE_XFS_H */ #undef TRACE_INCLUDE_PATH ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 06/15] xfs: bind together the front and back ends of the file range exchange code 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (4 preceding siblings ...) 2024-04-15 23:42 ` [PATCH 05/15] xfs: create deferred log items for file mapping exchanges Darrick J. Wong @ 2024-04-15 23:42 ` Darrick J. Wong 2024-04-15 23:42 ` [PATCH 07/15] xfs: add error injection to test file mapping exchange recovery Darrick J. Wong ` (8 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:42 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> So far, we've constructed the front end of the file range exchange code that does all the checking; and the back end of the file mapping exchange code that actually does the work. Glue these two pieces together so that we can turn on the functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_exchrange.c | 334 ++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_trace.c | 1 fs/xfs/xfs_trace.h | 109 ++++++++++++++++ 3 files changed, 443 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c index 35351b973521..0fc95e6471cb 100644 --- a/fs/xfs/xfs_exchrange.c +++ b/fs/xfs/xfs_exchrange.c @@ -12,8 +12,15 @@ #include "xfs_defer.h" #include "xfs_inode.h" #include "xfs_trans.h" +#include "xfs_quota.h" +#include "xfs_bmap_util.h" +#include "xfs_reflink.h" +#include "xfs_trace.h" #include "xfs_exchrange.h" #include "xfs_exchmaps.h" +#include "xfs_sb.h" +#include "xfs_icache.h" +#include "xfs_log.h" #include <linux/fsnotify.h> /* Lock (and optionally join) two inodes for a file range exchange. */ @@ -64,6 +71,207 @@ xfs_exchrange_estimate( return error; } +#define QRETRY_IP1 (0x1) +#define QRETRY_IP2 (0x2) + +/* + * Obtain a quota reservation to make sure we don't hit EDQUOT. We can skip + * this if quota enforcement is disabled or if both inodes' dquots are the + * same. The qretry structure must be initialized to zeroes before the first + * call to this function. + */ +STATIC int +xfs_exchrange_reserve_quota( + struct xfs_trans *tp, + const struct xfs_exchmaps_req *req, + unsigned int *qretry) +{ + int64_t ddelta, rdelta; + int ip1_error = 0; + int error; + + /* + * Don't bother with a quota reservation if we're not enforcing them + * or the two inodes have the same dquots. + */ + if (!XFS_IS_QUOTA_ON(tp->t_mountp) || req->ip1 == req->ip2 || + (req->ip1->i_udquot == req->ip2->i_udquot && + req->ip1->i_gdquot == req->ip2->i_gdquot && + req->ip1->i_pdquot == req->ip2->i_pdquot)) + return 0; + + *qretry = 0; + + /* + * For each file, compute the net gain in the number of regular blocks + * that will be mapped into that file and reserve that much quota. The + * quota counts must be able to absorb at least that much space. + */ + ddelta = req->ip2_bcount - req->ip1_bcount; + rdelta = req->ip2_rtbcount - req->ip1_rtbcount; + if (ddelta > 0 || rdelta > 0) { + error = xfs_trans_reserve_quota_nblks(tp, req->ip1, + ddelta > 0 ? ddelta : 0, + rdelta > 0 ? rdelta : 0, + false); + if (error == -EDQUOT || error == -ENOSPC) { + /* + * Save this error and see what happens if we try to + * reserve quota for ip2. Then report both. + */ + *qretry |= QRETRY_IP1; + ip1_error = error; + error = 0; + } + if (error) + return error; + } + if (ddelta < 0 || rdelta < 0) { + error = xfs_trans_reserve_quota_nblks(tp, req->ip2, + ddelta < 0 ? -ddelta : 0, + rdelta < 0 ? -rdelta : 0, + false); + if (error == -EDQUOT || error == -ENOSPC) + *qretry |= QRETRY_IP2; + if (error) + return error; + } + if (ip1_error) + return ip1_error; + + /* + * For each file, forcibly reserve the gross gain in mapped blocks so + * that we don't trip over any quota block reservation assertions. + * We must reserve the gross gain because the quota code subtracts from + * bcount the number of blocks that we unmap; it does not add that + * quantity back to the quota block reservation. + */ + error = xfs_trans_reserve_quota_nblks(tp, req->ip1, req->ip1_bcount, + req->ip1_rtbcount, true); + if (error) + return error; + + return xfs_trans_reserve_quota_nblks(tp, req->ip2, req->ip2_bcount, + req->ip2_rtbcount, true); +} + +/* Exchange the mappings (and hence the contents) of two files' forks. */ +STATIC int +xfs_exchrange_mappings( + const struct xfs_exchrange *fxr, + struct xfs_inode *ip1, + struct xfs_inode *ip2) +{ + struct xfs_mount *mp = ip1->i_mount; + struct xfs_exchmaps_req req = { + .ip1 = ip1, + .ip2 = ip2, + .startoff1 = XFS_B_TO_FSBT(mp, fxr->file1_offset), + .startoff2 = XFS_B_TO_FSBT(mp, fxr->file2_offset), + .blockcount = XFS_B_TO_FSB(mp, fxr->length), + }; + struct xfs_trans *tp; + unsigned int qretry; + bool retried = false; + int error; + + trace_xfs_exchrange_mappings(fxr, ip1, ip2); + + if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF) + req.flags |= XFS_EXCHMAPS_SET_SIZES; + if (fxr->flags & XFS_EXCHANGE_RANGE_FILE1_WRITTEN) + req.flags |= XFS_EXCHMAPS_INO1_WRITTEN; + + error = xfs_exchrange_estimate(&req); + if (error) + return error; + +retry: + /* Allocate the transaction, lock the inodes, and join them. */ + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, req.resblks, 0, + XFS_TRANS_RES_FDBLKS, &tp); + if (error) + return error; + + xfs_exchrange_ilock(tp, ip1, ip2); + + trace_xfs_exchrange_before(ip2, 2); + trace_xfs_exchrange_before(ip1, 1); + + error = xfs_exchmaps_check_forks(mp, &req); + if (error) + goto out_trans_cancel; + + /* + * Reserve ourselves some quota if any of them are in enforcing mode. + * In theory we only need enough to satisfy the change in the number + * of blocks between the two ranges being remapped. + */ + error = xfs_exchrange_reserve_quota(tp, &req, &qretry); + if ((error == -EDQUOT || error == -ENOSPC) && !retried) { + xfs_trans_cancel(tp); + xfs_exchrange_iunlock(ip1, ip2); + if (qretry & QRETRY_IP1) + xfs_blockgc_free_quota(ip1, 0); + if (qretry & QRETRY_IP2) + xfs_blockgc_free_quota(ip2, 0); + retried = true; + goto retry; + } + if (error) + goto out_trans_cancel; + + /* If we got this far on a dry run, all parameters are ok. */ + if (fxr->flags & XFS_EXCHANGE_RANGE_DRY_RUN) + goto out_trans_cancel; + + /* Update the mtime and ctime of both files. */ + if (fxr->flags & __XFS_EXCHANGE_RANGE_UPD_CMTIME1) + xfs_trans_ichgtime(tp, ip1, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG); + if (fxr->flags & __XFS_EXCHANGE_RANGE_UPD_CMTIME2) + xfs_trans_ichgtime(tp, ip2, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG); + + xfs_exchange_mappings(tp, &req); + + /* + * Force the log to persist metadata updates if the caller or the + * administrator requires this. The generic prep function already + * flushed the relevant parts of the page cache. + */ + if (xfs_has_wsync(mp) || (fxr->flags & XFS_EXCHANGE_RANGE_DSYNC)) + xfs_trans_set_sync(tp); + + error = xfs_trans_commit(tp); + + trace_xfs_exchrange_after(ip2, 2); + trace_xfs_exchrange_after(ip1, 1); + + if (error) + goto out_unlock; + + /* + * If the caller wanted us to exchange the contents of two complete + * files of unequal length, exchange the incore sizes now. This should + * be safe because we flushed both files' page caches, exchanged all + * the mappings, and updated the ondisk sizes. + */ + if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF) { + loff_t temp; + + temp = i_size_read(VFS_I(ip2)); + i_size_write(VFS_I(ip2), i_size_read(VFS_I(ip1))); + i_size_write(VFS_I(ip1), temp); + } + +out_unlock: + xfs_exchrange_iunlock(ip1, ip2); + return error; + +out_trans_cancel: + xfs_trans_cancel(tp); + goto out_unlock; +} + /* * Generic code for exchanging ranges of two files via XFS_IOC_EXCHANGE_RANGE. * This part deals with struct file objects and byte ranges and does not deal @@ -287,6 +495,130 @@ xfs_exchange_range_finish( return file_remove_privs(fxr->file2); } +/* Prepare two files to have their data exchanged. */ +STATIC int +xfs_exchrange_prep( + struct xfs_exchrange *fxr, + struct xfs_inode *ip1, + struct xfs_inode *ip2) +{ + unsigned int alloc_unit = xfs_inode_alloc_unitsize(ip2); + int error; + + trace_xfs_exchrange_prep(fxr, ip1, ip2); + + /* Verify both files are either real-time or non-realtime */ + if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2)) + return -EINVAL; + + /* + * The alignment checks in the generic helpers cannot deal with + * allocation units that are not powers of 2. This can happen with the + * realtime volume if the extent size is set. + */ + if (!is_power_of_2(alloc_unit)) + return -EOPNOTSUPP; + + error = xfs_exchange_range_prep(fxr, alloc_unit); + if (error || fxr->length == 0) + return error; + + /* Attach dquots to both inodes before changing block maps. */ + error = xfs_qm_dqattach(ip2); + if (error) + return error; + error = xfs_qm_dqattach(ip1); + if (error) + return error; + + trace_xfs_exchrange_flush(fxr, ip1, ip2); + + /* Flush the relevant ranges of both files. */ + error = xfs_flush_unmap_range(ip2, fxr->file2_offset, fxr->length); + if (error) + return error; + error = xfs_flush_unmap_range(ip1, fxr->file1_offset, fxr->length); + if (error) + return error; + + /* + * Cancel CoW fork preallocations for the ranges of both files. The + * prep function should have flushed all the dirty data, so the only + * CoW mappings remaining should be speculative. + */ + if (xfs_inode_has_cow_data(ip1)) { + error = xfs_reflink_cancel_cow_range(ip1, fxr->file1_offset, + fxr->length, true); + if (error) + return error; + } + + if (xfs_inode_has_cow_data(ip2)) { + error = xfs_reflink_cancel_cow_range(ip2, fxr->file2_offset, + fxr->length, true); + if (error) + return error; + } + + return 0; +} + +/* + * Exchange contents of files. This is the binding between the generic + * file-level concepts and the XFS inode-specific implementation. + */ +STATIC int +xfs_exchrange_contents( + struct xfs_exchrange *fxr) +{ + struct inode *inode1 = file_inode(fxr->file1); + struct inode *inode2 = file_inode(fxr->file2); + struct xfs_inode *ip1 = XFS_I(inode1); + struct xfs_inode *ip2 = XFS_I(inode2); + struct xfs_mount *mp = ip1->i_mount; + int error; + + if (!xfs_has_exchange_range(mp)) + return -EOPNOTSUPP; + + if (fxr->flags & ~(XFS_EXCHANGE_RANGE_ALL_FLAGS | + XFS_EXCHANGE_RANGE_PRIV_FLAGS)) + return -EINVAL; + + if (xfs_is_shutdown(mp)) + return -EIO; + + /* Lock both files against IO */ + error = xfs_ilock2_io_mmap(ip1, ip2); + if (error) + goto out_err; + + /* Prepare and then exchange file contents. */ + error = xfs_exchrange_prep(fxr, ip1, ip2); + if (error) + goto out_unlock; + + error = xfs_exchrange_mappings(fxr, ip1, ip2); + if (error) + goto out_unlock; + + /* + * Finish the exchange by removing special file privileges like any + * other file write would do. This may involve turning on support for + * logged xattrs if either file has security capabilities. + */ + error = xfs_exchange_range_finish(fxr); + if (error) + goto out_unlock; + +out_unlock: + xfs_iunlock2_io_mmap(ip1, ip2); +out_err: + if (error) + trace_xfs_exchrange_error(ip2, error, _RET_IP_); + return error; +} + /* Exchange parts of two files. */ static int xfs_exchange_range( @@ -341,7 +673,7 @@ xfs_exchange_range( fxr->flags |= __XFS_EXCHANGE_RANGE_UPD_CMTIME2; file_start_write(fxr->file2); - ret = -EOPNOTSUPP; /* XXX call out to lower level code */ + ret = xfs_exchrange_contents(fxr); file_end_write(fxr->file2); if (ret) return ret; diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c index 9f38e69f1ce4..cf92a3bd56c7 100644 --- a/fs/xfs/xfs_trace.c +++ b/fs/xfs/xfs_trace.c @@ -40,6 +40,7 @@ #include "xfs_btree_mem.h" #include "xfs_bmap.h" #include "xfs_exchmaps.h" +#include "xfs_exchrange.h" /* * We include this last to have the helpers above available for the trace diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 7c17d1f80fec..729e728c2076 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -84,6 +84,7 @@ struct xfs_btree_ops; struct xfs_bmap_intent; struct xfs_exchmaps_intent; struct xfs_exchmaps_req; +struct xfs_exchrange; #define XFS_ATTR_FILTER_FLAGS \ { XFS_ATTR_ROOT, "ROOT" }, \ @@ -4785,6 +4786,114 @@ DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1); DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping2); DEFINE_ITRUNC_EVENT(xfs_exchmaps_update_inode_size); +#define XFS_EXCHRANGE_INODES \ + { 1, "file1" }, \ + { 2, "file2" } + +DECLARE_EVENT_CLASS(xfs_exchrange_inode_class, + TP_PROTO(struct xfs_inode *ip, int whichfile), + TP_ARGS(ip, whichfile), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(int, whichfile) + __field(xfs_ino_t, ino) + __field(int, format) + __field(xfs_extnum_t, nex) + __field(int, broot_size) + __field(int, fork_off) + ), + TP_fast_assign( + __entry->dev = VFS_I(ip)->i_sb->s_dev; + __entry->whichfile = whichfile; + __entry->ino = ip->i_ino; + __entry->format = ip->i_df.if_format; + __entry->nex = ip->i_df.if_nextents; + __entry->fork_off = xfs_inode_fork_boff(ip); + ), + TP_printk("dev %d:%d ino 0x%llx whichfile %s format %s num_extents %llu forkoff 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __print_symbolic(__entry->whichfile, XFS_EXCHRANGE_INODES), + __print_symbolic(__entry->format, XFS_INODE_FORMAT_STR), + __entry->nex, + __entry->fork_off) +) + +#define DEFINE_EXCHRANGE_INODE_EVENT(name) \ +DEFINE_EVENT(xfs_exchrange_inode_class, name, \ + TP_PROTO(struct xfs_inode *ip, int whichfile), \ + TP_ARGS(ip, whichfile)) + +DEFINE_EXCHRANGE_INODE_EVENT(xfs_exchrange_before); +DEFINE_EXCHRANGE_INODE_EVENT(xfs_exchrange_after); +DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error); + +#define XFS_EXCHANGE_RANGE_FLAGS_STRS \ + { XFS_EXCHANGE_RANGE_TO_EOF, "TO_EOF" }, \ + { XFS_EXCHANGE_RANGE_DSYNC , "DSYNC" }, \ + { XFS_EXCHANGE_RANGE_DRY_RUN, "DRY_RUN" }, \ + { XFS_EXCHANGE_RANGE_FILE1_WRITTEN, "F1_WRITTEN" }, \ + { __XFS_EXCHANGE_RANGE_UPD_CMTIME1, "CMTIME1" }, \ + { __XFS_EXCHANGE_RANGE_UPD_CMTIME2, "CMTIME2" } + +/* file exchange-range tracepoint class */ +DECLARE_EVENT_CLASS(xfs_exchrange_class, + TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip1, + struct xfs_inode *ip2), + TP_ARGS(fxr, ip1, ip2), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ip1_ino) + __field(loff_t, ip1_isize) + __field(loff_t, ip1_disize) + __field(xfs_ino_t, ip2_ino) + __field(loff_t, ip2_isize) + __field(loff_t, ip2_disize) + + __field(loff_t, file1_offset) + __field(loff_t, file2_offset) + __field(unsigned long long, length) + __field(unsigned long long, flags) + ), + TP_fast_assign( + __entry->dev = VFS_I(ip1)->i_sb->s_dev; + __entry->ip1_ino = ip1->i_ino; + __entry->ip1_isize = VFS_I(ip1)->i_size; + __entry->ip1_disize = ip1->i_disk_size; + __entry->ip2_ino = ip2->i_ino; + __entry->ip2_isize = VFS_I(ip2)->i_size; + __entry->ip2_disize = ip2->i_disk_size; + + __entry->file1_offset = fxr->file1_offset; + __entry->file2_offset = fxr->file2_offset; + __entry->length = fxr->length; + __entry->flags = fxr->flags; + ), + TP_printk("dev %d:%d flags %s bytecount 0x%llx " + "ino1 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx -> " + "ino2 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __print_flags_u64(__entry->flags, "|", XFS_EXCHANGE_RANGE_FLAGS_STRS), + __entry->length, + __entry->ip1_ino, + __entry->ip1_isize, + __entry->ip1_disize, + __entry->file1_offset, + __entry->ip2_ino, + __entry->ip2_isize, + __entry->ip2_disize, + __entry->file2_offset) +) + +#define DEFINE_EXCHRANGE_EVENT(name) \ +DEFINE_EVENT(xfs_exchrange_class, name, \ + TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip1, \ + struct xfs_inode *ip2), \ + TP_ARGS(fxr, ip1, ip2)) +DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep); +DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush); +DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings); + TRACE_EVENT(xfs_exchmaps_overhead, TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks, unsigned long long rmapbt_blocks), ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 07/15] xfs: add error injection to test file mapping exchange recovery 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (5 preceding siblings ...) 2024-04-15 23:42 ` [PATCH 06/15] xfs: bind together the front and back ends of the file range exchange code Darrick J. Wong @ 2024-04-15 23:42 ` Darrick J. Wong 2024-04-15 23:42 ` [PATCH 08/15] xfs: condense extended attributes after a mapping exchange operation Darrick J. Wong ` (7 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:42 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add an errortag so that we can test recovery of exchmaps log items. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_errortag.h | 4 +++- fs/xfs/libxfs/xfs_exchmaps.c | 3 +++ fs/xfs/xfs_error.c | 3 +++ 3 files changed, 9 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h index 01a9e86b3037..7002d7676a78 100644 --- a/fs/xfs/libxfs/xfs_errortag.h +++ b/fs/xfs/libxfs/xfs_errortag.h @@ -63,7 +63,8 @@ #define XFS_ERRTAG_ATTR_LEAF_TO_NODE 41 #define XFS_ERRTAG_WB_DELAY_MS 42 #define XFS_ERRTAG_WRITE_DELAY_MS 43 -#define XFS_ERRTAG_MAX 44 +#define XFS_ERRTAG_EXCHMAPS_FINISH_ONE 44 +#define XFS_ERRTAG_MAX 45 /* * Random factors for above tags, 1 means always, 2 means 1/2 time, etc. @@ -111,5 +112,6 @@ #define XFS_RANDOM_ATTR_LEAF_TO_NODE 1 #define XFS_RANDOM_WB_DELAY_MS 3000 #define XFS_RANDOM_WRITE_DELAY_MS 3000 +#define XFS_RANDOM_EXCHMAPS_FINISH_ONE 1 #endif /* __XFS_ERRORTAG_H_ */ diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c index b8e9450cc175..3b1f29e95fea 100644 --- a/fs/xfs/libxfs/xfs_exchmaps.c +++ b/fs/xfs/libxfs/xfs_exchmaps.c @@ -437,6 +437,9 @@ xfs_exchmaps_finish_one( return error; } + if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_EXCHMAPS_FINISH_ONE)) + return -EIO; + /* If we still have work to do, ask for a new transaction. */ if (xmi_has_more_exchange_work(xmi) || xmi_has_postop_work(xmi)) { trace_xfs_exchmaps_defer(tp->t_mountp, xmi); diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c index 7ad0e92c6b5b..78cdc5064a8c 100644 --- a/fs/xfs/xfs_error.c +++ b/fs/xfs/xfs_error.c @@ -62,6 +62,7 @@ static unsigned int xfs_errortag_random_default[] = { XFS_RANDOM_ATTR_LEAF_TO_NODE, XFS_RANDOM_WB_DELAY_MS, XFS_RANDOM_WRITE_DELAY_MS, + XFS_RANDOM_EXCHMAPS_FINISH_ONE, }; struct xfs_errortag_attr { @@ -179,6 +180,7 @@ XFS_ERRORTAG_ATTR_RW(da_leaf_split, XFS_ERRTAG_DA_LEAF_SPLIT); XFS_ERRORTAG_ATTR_RW(attr_leaf_to_node, XFS_ERRTAG_ATTR_LEAF_TO_NODE); XFS_ERRORTAG_ATTR_RW(wb_delay_ms, XFS_ERRTAG_WB_DELAY_MS); XFS_ERRORTAG_ATTR_RW(write_delay_ms, XFS_ERRTAG_WRITE_DELAY_MS); +XFS_ERRORTAG_ATTR_RW(exchmaps_finish_one, XFS_ERRTAG_EXCHMAPS_FINISH_ONE); static struct attribute *xfs_errortag_attrs[] = { XFS_ERRORTAG_ATTR_LIST(noerror), @@ -224,6 +226,7 @@ static struct attribute *xfs_errortag_attrs[] = { XFS_ERRORTAG_ATTR_LIST(attr_leaf_to_node), XFS_ERRORTAG_ATTR_LIST(wb_delay_ms), XFS_ERRORTAG_ATTR_LIST(write_delay_ms), + XFS_ERRORTAG_ATTR_LIST(exchmaps_finish_one), NULL, }; ATTRIBUTE_GROUPS(xfs_errortag); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 08/15] xfs: condense extended attributes after a mapping exchange operation 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (6 preceding siblings ...) 2024-04-15 23:42 ` [PATCH 07/15] xfs: add error injection to test file mapping exchange recovery Darrick J. Wong @ 2024-04-15 23:42 ` Darrick J. Wong 2024-04-15 23:43 ` [PATCH 09/15] xfs: condense directories " Darrick J. Wong ` (6 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:42 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add a new file mapping exchange flag that enables us to perform post-exchange processing on file2 once we're done exchanging the extent mappings. If we were swapping mappings between extended attribute forks, we want to be able to convert file2's attr fork from block to inline format. (This implies that all fork contents are exchanged.) This isn't used anywhere right now, but we need to have the basic ondisk flags in place so that a future online xattr repair feature can create salvaged attrs in a temporary file and exchange the attr fork mappings when ready. If one file is in extents format and the other is inline, we will have to promote both to extents format to perform the exchange. After the exchange, we can try to condense the fixed file's attr fork back down to inline format if possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_exchmaps.c | 53 ++++++++++++++++++++++++++++++++++++++++-- fs/xfs/libxfs/xfs_exchmaps.h | 5 ++++ fs/xfs/xfs_trace.h | 3 ++ 3 files changed, 58 insertions(+), 3 deletions(-) diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c index 3b1f29e95fea..e46b314fa0cf 100644 --- a/fs/xfs/libxfs/xfs_exchmaps.c +++ b/fs/xfs/libxfs/xfs_exchmaps.c @@ -24,6 +24,10 @@ #include "xfs_errortag.h" #include "xfs_health.h" #include "xfs_exchmaps_item.h" +#include "xfs_da_format.h" +#include "xfs_da_btree.h" +#include "xfs_attr_leaf.h" +#include "xfs_attr.h" struct kmem_cache *xfs_exchmaps_intent_cache; @@ -121,7 +125,8 @@ static inline bool xmi_has_postop_work(const struct xfs_exchmaps_intent *xmi) { return xmi->xmi_flags & (XFS_EXCHMAPS_CLEAR_INO1_REFLINK | - XFS_EXCHMAPS_CLEAR_INO2_REFLINK); + XFS_EXCHMAPS_CLEAR_INO2_REFLINK | + __XFS_EXCHMAPS_INO2_SHORTFORM); } /* Check all mappings to make sure we can actually exchange them. */ @@ -360,6 +365,36 @@ xfs_exchmaps_one_step( xmi_advance(xmi, irec1); } +/* Convert inode2's leaf attr fork back to shortform, if possible.. */ +STATIC int +xfs_exchmaps_attr_to_sf( + struct xfs_trans *tp, + struct xfs_exchmaps_intent *xmi) +{ + struct xfs_da_args args = { + .dp = xmi->xmi_ip2, + .geo = tp->t_mountp->m_attr_geo, + .whichfork = XFS_ATTR_FORK, + .trans = tp, + }; + struct xfs_buf *bp; + int forkoff; + int error; + + if (!xfs_attr_is_leaf(xmi->xmi_ip2)) + return 0; + + error = xfs_attr3_leaf_read(tp, xmi->xmi_ip2, 0, &bp); + if (error) + return error; + + forkoff = xfs_attr_shortform_allfit(bp, xmi->xmi_ip2); + if (forkoff == 0) + return 0; + + return xfs_attr3_leaf_to_shortform(bp, &args, forkoff); +} + /* Clear the reflink flag after an exchange. */ static inline void xfs_exchmaps_clear_reflink( @@ -378,6 +413,16 @@ xfs_exchmaps_do_postop_work( struct xfs_trans *tp, struct xfs_exchmaps_intent *xmi) { + if (xmi->xmi_flags & __XFS_EXCHMAPS_INO2_SHORTFORM) { + int error = 0; + + if (xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK) + error = xfs_exchmaps_attr_to_sf(tp, xmi); + xmi->xmi_flags &= ~__XFS_EXCHMAPS_INO2_SHORTFORM; + if (error) + return error; + } + if (xmi->xmi_flags & XFS_EXCHMAPS_CLEAR_INO1_REFLINK) { xfs_exchmaps_clear_reflink(tp, xmi->xmi_ip1); xmi->xmi_flags &= ~XFS_EXCHMAPS_CLEAR_INO1_REFLINK; @@ -809,8 +854,10 @@ xfs_exchmaps_init_intent( xmi->xmi_isize1 = xmi->xmi_isize2 = -1; xmi->xmi_flags = req->flags & XFS_EXCHMAPS_PARAMS; - if (xfs_exchmaps_whichfork(xmi) == XFS_ATTR_FORK) + if (xfs_exchmaps_whichfork(xmi) == XFS_ATTR_FORK) { + xmi->xmi_flags |= __XFS_EXCHMAPS_INO2_SHORTFORM; return xmi; + } if (req->flags & XFS_EXCHMAPS_SET_SIZES) { xmi->xmi_flags |= XFS_EXCHMAPS_SET_SIZES; @@ -1031,6 +1078,8 @@ xfs_exchange_mappings( { struct xfs_exchmaps_intent *xmi; + BUILD_BUG_ON(XFS_EXCHMAPS_INTERNAL_FLAGS & XFS_EXCHMAPS_LOGGED_FLAGS); + xfs_assert_ilocked(req->ip1, XFS_ILOCK_EXCL); xfs_assert_ilocked(req->ip2, XFS_ILOCK_EXCL); ASSERT(!(req->flags & ~XFS_EXCHMAPS_LOGGED_FLAGS)); diff --git a/fs/xfs/libxfs/xfs_exchmaps.h b/fs/xfs/libxfs/xfs_exchmaps.h index e8fc3f80c68c..d8718fca606e 100644 --- a/fs/xfs/libxfs/xfs_exchmaps.h +++ b/fs/xfs/libxfs/xfs_exchmaps.h @@ -27,6 +27,11 @@ struct xfs_exchmaps_intent { uint64_t xmi_flags; /* XFS_EXCHMAPS_* flags */ }; +/* Try to convert inode2 from block to short format at the end, if possible. */ +#define __XFS_EXCHMAPS_INO2_SHORTFORM (1ULL << 63) + +#define XFS_EXCHMAPS_INTERNAL_FLAGS (__XFS_EXCHMAPS_INO2_SHORTFORM) + /* flags that can be passed to xfs_exchmaps_{estimate,mappings} */ #define XFS_EXCHMAPS_PARAMS (XFS_EXCHMAPS_ATTR_FORK | \ XFS_EXCHMAPS_SET_SIZES | \ diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 729e728c2076..caef95f2c87c 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -4779,7 +4779,8 @@ DEFINE_XFBTREE_FREESP_EVENT(xfbtree_free_block); { XFS_EXCHMAPS_SET_SIZES, "SETSIZES" }, \ { XFS_EXCHMAPS_INO1_WRITTEN, "INO1_WRITTEN" }, \ { XFS_EXCHMAPS_CLEAR_INO1_REFLINK, "CLEAR_INO1_REFLINK" }, \ - { XFS_EXCHMAPS_CLEAR_INO2_REFLINK, "CLEAR_INO2_REFLINK" } + { XFS_EXCHMAPS_CLEAR_INO2_REFLINK, "CLEAR_INO2_REFLINK" }, \ + { __XFS_EXCHMAPS_INO2_SHORTFORM, "INO2_SF" } DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1_skip); DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 09/15] xfs: condense directories after a mapping exchange operation 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (7 preceding siblings ...) 2024-04-15 23:42 ` [PATCH 08/15] xfs: condense extended attributes after a mapping exchange operation Darrick J. Wong @ 2024-04-15 23:43 ` Darrick J. Wong 2024-04-15 23:43 ` [PATCH 10/15] xfs: condense symbolic links " Darrick J. Wong ` (5 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:43 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> The previous commit added a new file mapping exchange flag that enables us to perform post-swap processing on file2 once we're done exchanging extent mappings. Now add this ability for directories. This isn't used anywhere right now, but we need to have the basic ondisk flags in place so that a future online directory repair feature can create salvaged dirents in a temporary directory and exchange the data fork mappings when ready. If one file is in extents format and the other is inline, we will have to promote both to extents format to perform the exchange. After the exchange, we can try to condense the fixed directory down to inline format if possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_exchmaps.c | 43 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 43 insertions(+) diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c index e46b314fa0cf..f199629adbf0 100644 --- a/fs/xfs/libxfs/xfs_exchmaps.c +++ b/fs/xfs/libxfs/xfs_exchmaps.c @@ -28,6 +28,8 @@ #include "xfs_da_btree.h" #include "xfs_attr_leaf.h" #include "xfs_attr.h" +#include "xfs_dir2_priv.h" +#include "xfs_dir2.h" struct kmem_cache *xfs_exchmaps_intent_cache; @@ -395,6 +397,42 @@ xfs_exchmaps_attr_to_sf( return xfs_attr3_leaf_to_shortform(bp, &args, forkoff); } +/* Convert inode2's block dir fork back to shortform, if possible.. */ +STATIC int +xfs_exchmaps_dir_to_sf( + struct xfs_trans *tp, + struct xfs_exchmaps_intent *xmi) +{ + struct xfs_da_args args = { + .dp = xmi->xmi_ip2, + .geo = tp->t_mountp->m_dir_geo, + .whichfork = XFS_DATA_FORK, + .trans = tp, + }; + struct xfs_dir2_sf_hdr sfh; + struct xfs_buf *bp; + bool isblock; + int size; + int error; + + error = xfs_dir2_isblock(&args, &isblock); + if (error) + return error; + + if (!isblock) + return 0; + + error = xfs_dir3_block_read(tp, xmi->xmi_ip2, &bp); + if (error) + return error; + + size = xfs_dir2_block_sfsize(xmi->xmi_ip2, bp->b_addr, &sfh); + if (size > xfs_inode_data_fork_size(xmi->xmi_ip2)) + return 0; + + return xfs_dir2_block_to_sf(&args, bp, size, &sfh); +} + /* Clear the reflink flag after an exchange. */ static inline void xfs_exchmaps_clear_reflink( @@ -418,6 +456,8 @@ xfs_exchmaps_do_postop_work( if (xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK) error = xfs_exchmaps_attr_to_sf(tp, xmi); + else if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode)) + error = xfs_exchmaps_dir_to_sf(tp, xmi); xmi->xmi_flags &= ~__XFS_EXCHMAPS_INO2_SHORTFORM; if (error) return error; @@ -882,6 +922,9 @@ xfs_exchmaps_init_intent( xmi->xmi_flags |= XFS_EXCHMAPS_CLEAR_INO2_REFLINK; } + if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode)) + xmi->xmi_flags |= __XFS_EXCHMAPS_INO2_SHORTFORM; + return xmi; } ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 10/15] xfs: condense symbolic links after a mapping exchange operation 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (8 preceding siblings ...) 2024-04-15 23:43 ` [PATCH 09/15] xfs: condense directories " Darrick J. Wong @ 2024-04-15 23:43 ` Darrick J. Wong 2024-04-15 23:43 ` [PATCH 11/15] xfs: make file range exchange support realtime files Darrick J. Wong ` (4 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:43 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> The previous commit added a new file mapping exchange flag that enables us to perform post-exchange processing on file2 once we're done exchanging the extent mappings. Now add this ability for symlinks. This isn't used anywhere right now, but we need to have the basic ondisk flags in place so that a future online symlink repair feature can salvage the remote target in a temporary link and exchange the data fork mappings when ready. If one file is in extents format and the other is inline, we will have to promote both to extents format to perform the exchange. After the exchange, we can try to condense the fixed symlink down to inline format if possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_exchmaps.c | 49 +++++++++++++++++++++++++++++++++++- fs/xfs/libxfs/xfs_symlink_remote.c | 47 +++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_symlink_remote.h | 1 + fs/xfs/xfs_symlink.c | 49 ++++-------------------------------- 4 files changed, 102 insertions(+), 44 deletions(-) diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c index f199629adbf0..f58240466b1c 100644 --- a/fs/xfs/libxfs/xfs_exchmaps.c +++ b/fs/xfs/libxfs/xfs_exchmaps.c @@ -30,6 +30,7 @@ #include "xfs_attr.h" #include "xfs_dir2_priv.h" #include "xfs_dir2.h" +#include "xfs_symlink_remote.h" struct kmem_cache *xfs_exchmaps_intent_cache; @@ -433,6 +434,49 @@ xfs_exchmaps_dir_to_sf( return xfs_dir2_block_to_sf(&args, bp, size, &sfh); } +/* Convert inode2's remote symlink target back to shortform, if possible. */ +STATIC int +xfs_exchmaps_link_to_sf( + struct xfs_trans *tp, + struct xfs_exchmaps_intent *xmi) +{ + struct xfs_inode *ip = xmi->xmi_ip2; + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); + char *buf; + int error; + + if (ifp->if_format == XFS_DINODE_FMT_LOCAL || + ip->i_disk_size > xfs_inode_data_fork_size(ip)) + return 0; + + /* Read the current symlink target into a buffer. */ + buf = kmalloc(ip->i_disk_size + 1, + GFP_KERNEL | __GFP_NOLOCKDEP | __GFP_NOFAIL); + if (!buf) { + ASSERT(0); + return -ENOMEM; + } + + error = xfs_symlink_remote_read(ip, buf); + if (error) + goto free; + + /* Remove the blocks. */ + error = xfs_symlink_remote_truncate(tp, ip); + if (error) + goto free; + + /* Convert fork to local format and log our changes. */ + xfs_idestroy_fork(ifp); + ifp->if_bytes = 0; + ifp->if_format = XFS_DINODE_FMT_LOCAL; + xfs_init_local_fork(ip, XFS_DATA_FORK, buf, ip->i_disk_size); + xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE); +free: + kfree(buf); + return error; +} + /* Clear the reflink flag after an exchange. */ static inline void xfs_exchmaps_clear_reflink( @@ -458,6 +502,8 @@ xfs_exchmaps_do_postop_work( error = xfs_exchmaps_attr_to_sf(tp, xmi); else if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode)) error = xfs_exchmaps_dir_to_sf(tp, xmi); + else if (S_ISLNK(VFS_I(xmi->xmi_ip2)->i_mode)) + error = xfs_exchmaps_link_to_sf(tp, xmi); xmi->xmi_flags &= ~__XFS_EXCHMAPS_INO2_SHORTFORM; if (error) return error; @@ -922,7 +968,8 @@ xfs_exchmaps_init_intent( xmi->xmi_flags |= XFS_EXCHMAPS_CLEAR_INO2_REFLINK; } - if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode)) + if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode) || + S_ISLNK(VFS_I(xmi->xmi_ip2)->i_mode)) xmi->xmi_flags |= __XFS_EXCHMAPS_INO2_SHORTFORM; return xmi; diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c index ffb1317a9212..8f0d5c584f46 100644 --- a/fs/xfs/libxfs/xfs_symlink_remote.c +++ b/fs/xfs/libxfs/xfs_symlink_remote.c @@ -380,3 +380,50 @@ xfs_symlink_write_target( ASSERT(pathlen == 0); return 0; } + +/* Remove all the blocks from a symlink and invalidate buffers. */ +int +xfs_symlink_remote_truncate( + struct xfs_trans *tp, + struct xfs_inode *ip) +{ + struct xfs_bmbt_irec mval[XFS_SYMLINK_MAPS]; + struct xfs_mount *mp = tp->t_mountp; + struct xfs_buf *bp; + int nmaps = XFS_SYMLINK_MAPS; + int done = 0; + int i; + int error; + + /* Read mappings and invalidate buffers. */ + error = xfs_bmapi_read(ip, 0, XFS_MAX_FILEOFF, mval, &nmaps, 0); + if (error) + return error; + + for (i = 0; i < nmaps; i++) { + if (!xfs_bmap_is_real_extent(&mval[i])) + break; + + error = xfs_trans_get_buf(tp, mp->m_ddev_targp, + XFS_FSB_TO_DADDR(mp, mval[i].br_startblock), + XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0, + &bp); + if (error) + return error; + + xfs_trans_binval(tp, bp); + } + + /* Unmap the remote blocks. */ + error = xfs_bunmapi(tp, ip, 0, XFS_MAX_FILEOFF, 0, nmaps, &done); + if (error) + return error; + if (!done) { + ASSERT(done); + xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK); + return -EFSCORRUPTED; + } + + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); + return 0; +} diff --git a/fs/xfs/libxfs/xfs_symlink_remote.h b/fs/xfs/libxfs/xfs_symlink_remote.h index a63bd38ae4fa..ac3dac8f617e 100644 --- a/fs/xfs/libxfs/xfs_symlink_remote.h +++ b/fs/xfs/libxfs/xfs_symlink_remote.h @@ -22,5 +22,6 @@ int xfs_symlink_remote_read(struct xfs_inode *ip, char *link); int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip, const char *target_path, int pathlen, xfs_fsblock_t fs_blocks, uint resblks); +int xfs_symlink_remote_truncate(struct xfs_trans *tp, struct xfs_inode *ip); #endif /* __XFS_SYMLINK_REMOTE_H */ diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c index 3e376d24c7c1..3daeebff4bb4 100644 --- a/fs/xfs/xfs_symlink.c +++ b/fs/xfs/xfs_symlink.c @@ -250,19 +250,12 @@ xfs_symlink( */ STATIC int xfs_inactive_symlink_rmt( - struct xfs_inode *ip) + struct xfs_inode *ip) { - struct xfs_buf *bp; - int done; - int error; - int i; - xfs_mount_t *mp; - xfs_bmbt_irec_t mval[XFS_SYMLINK_MAPS]; - int nmaps; - int size; - xfs_trans_t *tp; + struct xfs_mount *mp = ip->i_mount; + struct xfs_trans *tp; + int error; - mp = ip->i_mount; ASSERT(!xfs_need_iread_extents(&ip->i_df)); /* * We're freeing a symlink that has some @@ -286,44 +279,14 @@ xfs_inactive_symlink_rmt( * locked for the second transaction. In the error paths we need it * held so the cancel won't rele it, see below. */ - size = (int)ip->i_disk_size; ip->i_disk_size = 0; VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG; xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); - /* - * Find the block(s) so we can inval and unmap them. - */ - done = 0; - nmaps = ARRAY_SIZE(mval); - error = xfs_bmapi_read(ip, 0, xfs_symlink_blocks(mp, size), - mval, &nmaps, 0); - if (error) - goto error_trans_cancel; - /* - * Invalidate the block(s). No validation is done. - */ - for (i = 0; i < nmaps; i++) { - error = xfs_trans_get_buf(tp, mp->m_ddev_targp, - XFS_FSB_TO_DADDR(mp, mval[i].br_startblock), - XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0, - &bp); - if (error) - goto error_trans_cancel; - xfs_trans_binval(tp, bp); - } - /* - * Unmap the dead block(s) to the dfops. - */ - error = xfs_bunmapi(tp, ip, 0, size, 0, nmaps, &done); + + error = xfs_symlink_remote_truncate(tp, ip); if (error) goto error_trans_cancel; - ASSERT(done); - /* - * Commit the transaction. This first logs the EFI and the inode, then - * rolls and commits the transaction that frees the extents. - */ - xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); error = xfs_trans_commit(tp); if (error) { ASSERT(xfs_is_shutdown(mp)); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 11/15] xfs: make file range exchange support realtime files 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (9 preceding siblings ...) 2024-04-15 23:43 ` [PATCH 10/15] xfs: condense symbolic links " Darrick J. Wong @ 2024-04-15 23:43 ` Darrick J. Wong 2024-04-15 23:43 ` [PATCH 12/15] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong ` (3 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:43 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Now that bmap items support the realtime device, we can add the necessary pieces to the file range exchange code to support exchanging mappings. All we really need to do here is adjust the blockcount upwards to the end of the rt extent and remove the inode checks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_exchmaps.c | 70 ++++++++++++++++++++++++++++++++++++------ fs/xfs/xfs_exchrange.c | 9 +++++ 2 files changed, 69 insertions(+), 10 deletions(-) diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c index f58240466b1c..7fa244228750 100644 --- a/fs/xfs/libxfs/xfs_exchmaps.c +++ b/fs/xfs/libxfs/xfs_exchmaps.c @@ -152,12 +152,7 @@ xfs_exchmaps_check_forks( ifp2->if_format == XFS_DINODE_FMT_LOCAL) return -EINVAL; - /* We don't support realtime data forks yet. */ - if (!XFS_IS_REALTIME_INODE(req->ip1)) - return 0; - if (whichfork == XFS_ATTR_FORK) - return 0; - return -EINVAL; + return 0; } #ifdef CONFIG_XFS_QUOTA @@ -198,6 +193,8 @@ xfs_exchmaps_can_skip_mapping( struct xfs_exchmaps_intent *xmi, struct xfs_bmbt_irec *irec) { + struct xfs_mount *mp = xmi->xmi_ip1->i_mount; + /* Do not skip this mapping if the caller did not tell us to. */ if (!(xmi->xmi_flags & XFS_EXCHMAPS_INO1_WRITTEN)) return false; @@ -209,11 +206,64 @@ xfs_exchmaps_can_skip_mapping( /* * The mapping is unwritten or a hole. It cannot be a delalloc * reservation because we already excluded those. It cannot be an - * unwritten mapping with dirty page cache because we flushed the page - * cache. We don't support realtime files yet, so we needn't (yet) - * deal with them. + * unwritten extent with dirty page cache because we flushed the page + * cache. For files where the allocation unit is 1FSB (files on the + * data dev, rt files if the extent size is 1FSB), we can safely + * skip this mapping. */ - return true; + if (!xfs_inode_has_bigrtalloc(xmi->xmi_ip1)) + return true; + + /* + * For a realtime file with a multi-fsb allocation unit, the decision + * is trickier because we can only swap full allocation units. + * Unwritten mappings can appear in the middle of an rtx if the rtx is + * partially written, but they can also appear for preallocations. + * + * If the mapping is a hole, skip it entirely. Holes should align with + * rtx boundaries. + */ + if (!xfs_bmap_is_real_extent(irec)) + return true; + + /* + * All mappings below this point are unwritten. + * + * - If the beginning is not aligned to an rtx, trim the end of the + * mapping so that it does not cross an rtx boundary, and swap it. + * + * - If both ends are aligned to an rtx, skip the entire mapping. + */ + if (!isaligned_64(irec->br_startoff, mp->m_sb.sb_rextsize)) { + xfs_fileoff_t new_end; + + new_end = roundup_64(irec->br_startoff, mp->m_sb.sb_rextsize); + irec->br_blockcount = min(irec->br_blockcount, + new_end - irec->br_startoff); + return false; + } + if (isaligned_64(irec->br_blockcount, mp->m_sb.sb_rextsize)) + return true; + + /* + * All mappings below this point are unwritten, start on an rtx + * boundary, and do not end on an rtx boundary. + * + * - If the mapping is longer than one rtx, trim the end of the mapping + * down to an rtx boundary and skip it. + * + * - The mapping is shorter than one rtx. Swap it. + */ + if (irec->br_blockcount > mp->m_sb.sb_rextsize) { + xfs_fileoff_t new_end; + + new_end = rounddown_64(irec->br_startoff + irec->br_blockcount, + mp->m_sb.sb_rextsize); + irec->br_blockcount = new_end - irec->br_startoff; + return true; + } + + return false; } /* diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c index 0fc95e6471cb..90baf12bd97f 100644 --- a/fs/xfs/xfs_exchrange.c +++ b/fs/xfs/xfs_exchrange.c @@ -21,6 +21,7 @@ #include "xfs_sb.h" #include "xfs_icache.h" #include "xfs_log.h" +#include "xfs_rtbitmap.h" #include <linux/fsnotify.h> /* Lock (and optionally join) two inodes for a file range exchange. */ @@ -182,6 +183,14 @@ xfs_exchrange_mappings( if (fxr->flags & XFS_EXCHANGE_RANGE_FILE1_WRITTEN) req.flags |= XFS_EXCHMAPS_INO1_WRITTEN; + /* + * Round the request length up to the nearest file allocation unit. + * The prep function already checked that the request offsets and + * length in @fxr are safe to round up. + */ + if (xfs_inode_has_bigrtalloc(ip2)) + req.blockcount = xfs_rtb_roundup_rtx(mp, req.blockcount); + error = xfs_exchrange_estimate(&req); if (error) return error; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 12/15] xfs: support non-power-of-two rtextsize with exchange-range 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (10 preceding siblings ...) 2024-04-15 23:43 ` [PATCH 11/15] xfs: make file range exchange support realtime files Darrick J. Wong @ 2024-04-15 23:43 ` Darrick J. Wong 2024-04-15 23:44 ` [PATCH 13/15] xfs: capture inode generation numbers in the ondisk exchmaps log item Darrick J. Wong ` (2 subsequent siblings) 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:43 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> The generic exchange-range alignment checks use (fast) bitmasking operations to perform block alignment checks on the exchange parameters. Unfortunately, bitmasks require that the alignment size be a power of two. This isn't true for realtime devices with a non-power-of-two extent size, so we have to copy-pasta the generic checks using long division for this to work properly. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_exchrange.c | 89 ++++++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 82 insertions(+), 7 deletions(-) diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c index 90baf12bd97f..c8a655c92c92 100644 --- a/fs/xfs/xfs_exchrange.c +++ b/fs/xfs/xfs_exchrange.c @@ -504,6 +504,75 @@ xfs_exchange_range_finish( return file_remove_privs(fxr->file2); } +/* + * Check the alignment of an exchange request when the allocation unit size + * isn't a power of two. The generic file-level helpers use (fast) + * bitmask-based alignment checks, but here we have to use slow long division. + */ +static int +xfs_exchrange_check_rtalign( + const struct xfs_exchrange *fxr, + struct xfs_inode *ip1, + struct xfs_inode *ip2, + unsigned int alloc_unit) +{ + uint64_t length = fxr->length; + uint64_t blen; + loff_t size1, size2; + + size1 = i_size_read(VFS_I(ip1)); + size2 = i_size_read(VFS_I(ip2)); + + /* The start of both ranges must be aligned to a rt extent. */ + if (!isaligned_64(fxr->file1_offset, alloc_unit) || + !isaligned_64(fxr->file2_offset, alloc_unit)) + return -EINVAL; + + if (fxr->flags & XFS_EXCHANGE_RANGE_TO_EOF) + length = max_t(int64_t, size1 - fxr->file1_offset, + size2 - fxr->file2_offset); + + /* + * If the user wanted us to exchange up to the infile's EOF, round up + * to the next rt extent boundary for this check. Do the same for the + * outfile. + * + * Otherwise, reject the range length if it's not rt extent aligned. + * We already confirmed the starting offsets' rt extent block + * alignment. + */ + if (fxr->file1_offset + length == size1) + blen = roundup_64(size1, alloc_unit) - fxr->file1_offset; + else if (fxr->file2_offset + length == size2) + blen = roundup_64(size2, alloc_unit) - fxr->file2_offset; + else if (!isaligned_64(length, alloc_unit)) + return -EINVAL; + else + blen = length; + + /* Don't allow overlapped exchanges within the same file. */ + if (ip1 == ip2 && + fxr->file2_offset + blen > fxr->file1_offset && + fxr->file1_offset + blen > fxr->file2_offset) + return -EINVAL; + + /* + * Ensure that we don't exchange a partial EOF rt extent into the + * middle of another file. + */ + if (isaligned_64(length, alloc_unit)) + return 0; + + blen = length; + if (fxr->file2_offset + length < size2) + blen = rounddown_64(blen, alloc_unit); + + if (fxr->file1_offset + blen < size1) + blen = rounddown_64(blen, alloc_unit); + + return blen == length ? 0 : -EINVAL; +} + /* Prepare two files to have their data exchanged. */ STATIC int xfs_exchrange_prep( @@ -511,6 +580,7 @@ xfs_exchrange_prep( struct xfs_inode *ip1, struct xfs_inode *ip2) { + struct xfs_mount *mp = ip2->i_mount; unsigned int alloc_unit = xfs_inode_alloc_unitsize(ip2); int error; @@ -520,13 +590,18 @@ xfs_exchrange_prep( if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2)) return -EINVAL; - /* - * The alignment checks in the generic helpers cannot deal with - * allocation units that are not powers of 2. This can happen with the - * realtime volume if the extent size is set. - */ - if (!is_power_of_2(alloc_unit)) - return -EOPNOTSUPP; + /* Check non-power of two alignment issues, if necessary. */ + if (!is_power_of_2(alloc_unit)) { + error = xfs_exchrange_check_rtalign(fxr, ip1, ip2, alloc_unit); + if (error) + return error; + + /* + * Do the generic file-level checks with the regular block + * alignment. + */ + alloc_unit = mp->m_sb.sb_blocksize; + } error = xfs_exchange_range_prep(fxr, alloc_unit); if (error || fxr->length == 0) ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 13/15] xfs: capture inode generation numbers in the ondisk exchmaps log item 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (11 preceding siblings ...) 2024-04-15 23:43 ` [PATCH 12/15] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong @ 2024-04-15 23:44 ` Darrick J. Wong 2024-04-15 23:44 ` [PATCH 14/15] docs: update swapext -> exchmaps language Darrick J. Wong 2024-04-15 23:44 ` [PATCH 15/15] xfs: enable logged file mapping exchange feature Darrick J. Wong 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:44 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Per some very late review comments, capture the generation numbers of both inodes involved in a file content exchange operation so that we don't accidentally target files with have been reallocated. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_log_format.h | 2 ++ fs/xfs/libxfs/xfs_log_recover.h | 2 ++ fs/xfs/xfs_exchmaps_item.c | 25 ++++++++++++++++++++----- fs/xfs/xfs_log_recover.c | 31 +++++++++++++++++++++++++++++++ 4 files changed, 55 insertions(+), 5 deletions(-) diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h index 8dbe1f997dfd..accba2acd623 100644 --- a/fs/xfs/libxfs/xfs_log_format.h +++ b/fs/xfs/libxfs/xfs_log_format.h @@ -896,6 +896,8 @@ struct xfs_xmi_log_format { uint64_t xmi_inode1; /* inumber of first file */ uint64_t xmi_inode2; /* inumber of second file */ + uint32_t xmi_igen1; /* generation of first file */ + uint32_t xmi_igen2; /* generation of second file */ uint64_t xmi_startoff1; /* block offset into file1 */ uint64_t xmi_startoff2; /* block offset into file2 */ uint64_t xmi_blockcount; /* number of blocks */ diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h index 47b758b49cb3..521d327e4c89 100644 --- a/fs/xfs/libxfs/xfs_log_recover.h +++ b/fs/xfs/libxfs/xfs_log_recover.h @@ -123,6 +123,8 @@ bool xlog_is_buffer_cancelled(struct xlog *log, xfs_daddr_t blkno, uint len); int xlog_recover_iget(struct xfs_mount *mp, xfs_ino_t ino, struct xfs_inode **ipp); +int xlog_recover_iget_handle(struct xfs_mount *mp, xfs_ino_t ino, uint32_t gen, + struct xfs_inode **ipp); void xlog_recover_release_intent(struct xlog *log, unsigned short intent_type, uint64_t intent_id); int xlog_alloc_buf_cancel_table(struct xlog *log); diff --git a/fs/xfs/xfs_exchmaps_item.c b/fs/xfs/xfs_exchmaps_item.c index a40216f33214..264a121c5e16 100644 --- a/fs/xfs/xfs_exchmaps_item.c +++ b/fs/xfs/xfs_exchmaps_item.c @@ -231,7 +231,9 @@ xfs_exchmaps_create_intent( xlf = &xmi_lip->xmi_format; xlf->xmi_inode1 = xmi->xmi_ip1->i_ino; + xlf->xmi_igen1 = VFS_I(xmi->xmi_ip1)->i_generation; xlf->xmi_inode2 = xmi->xmi_ip2->i_ino; + xlf->xmi_igen2 = VFS_I(xmi->xmi_ip2)->i_generation; xlf->xmi_startoff1 = xmi->xmi_startoff1; xlf->xmi_startoff2 = xmi->xmi_startoff2; xlf->xmi_blockcount = xmi->xmi_blockcount; @@ -368,14 +370,25 @@ xfs_xmi_item_recover_intent( /* * Grab both inodes and set IRECOVERY to prevent trimming of post-eof * mappings and freeing of unlinked inodes until we're totally done - * processing files. + * processing files. The ondisk format of this new log item contains + * file handle information, which is why recovery for other items do + * not check the inode generation number. */ - error = xlog_recover_iget(mp, xlf->xmi_inode1, &ip1); - if (error) + error = xlog_recover_iget_handle(mp, xlf->xmi_inode1, xlf->xmi_igen1, + &ip1); + if (error) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, xlf, + sizeof(*xlf)); return ERR_PTR(error); - error = xlog_recover_iget(mp, xlf->xmi_inode2, &ip2); - if (error) + } + + error = xlog_recover_iget_handle(mp, xlf->xmi_inode2, xlf->xmi_igen2, + &ip2); + if (error) { + XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp, xlf, + sizeof(*xlf)); goto err_rele1; + } req->ip1 = ip1; req->ip2 = ip2; @@ -485,6 +498,8 @@ xfs_exchmaps_relog_intent( new_xlf->xmi_inode1 = old_xlf->xmi_inode1; new_xlf->xmi_inode2 = old_xlf->xmi_inode2; + new_xlf->xmi_igen1 = old_xlf->xmi_igen1; + new_xlf->xmi_igen2 = old_xlf->xmi_igen2; new_xlf->xmi_startoff1 = old_xlf->xmi_startoff1; new_xlf->xmi_startoff2 = old_xlf->xmi_startoff2; new_xlf->xmi_blockcount = old_xlf->xmi_blockcount; diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c index 1e5ba95adf2c..b445e8ce4a7d 100644 --- a/fs/xfs/xfs_log_recover.c +++ b/fs/xfs/xfs_log_recover.c @@ -1767,6 +1767,37 @@ xlog_recover_iget( return 0; } +/* + * Get an inode so that we can recover a log operation. + * + * Log intent items that target inodes effectively contain a file handle. + * Check that the generation number matches the intent item like we do for + * other file handles. Log intent items defined after this validation weakness + * was identified must use this function. + */ +int +xlog_recover_iget_handle( + struct xfs_mount *mp, + xfs_ino_t ino, + uint32_t gen, + struct xfs_inode **ipp) +{ + struct xfs_inode *ip; + int error; + + error = xlog_recover_iget(mp, ino, &ip); + if (error) + return error; + + if (VFS_I(ip)->i_generation != gen) { + xfs_irele(ip); + return -EFSCORRUPTED; + } + + *ipp = ip; + return 0; +} + /****************************************************************************** * * Log recover routines ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 14/15] docs: update swapext -> exchmaps language 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (12 preceding siblings ...) 2024-04-15 23:44 ` [PATCH 13/15] xfs: capture inode generation numbers in the ondisk exchmaps log item Darrick J. Wong @ 2024-04-15 23:44 ` Darrick J. Wong 2024-04-15 23:44 ` [PATCH 15/15] xfs: enable logged file mapping exchange feature Darrick J. Wong 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:44 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Start reworking the atomic swapext design documentation to refer to its new file contents/mapping exchange name. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- .../filesystems/xfs/xfs-online-fsck-design.rst | 259 +++++++++++--------- 1 file changed, 136 insertions(+), 123 deletions(-) diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst index 1d161752f09e..3afa1bc5f47c 100644 --- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -2167,7 +2167,7 @@ The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate`` function frees them all because compaction is not needed. The details of repairing directories and extended attributes will be discussed -in a subsequent section about atomic extent swapping. +in a subsequent section about atomic file content exchanges. However, it should be noted that these repair functions only use blob storage to cache a small number of entries before adding them to a temporary ondisk file, which is why compaction is not required. @@ -2802,7 +2802,8 @@ follows this format: Repairs for file-based metadata such as extended attributes, directories, symbolic links, quota files and realtime bitmaps are performed by building a -new structure attached to a temporary file and swapping the forks. +new structure attached to a temporary file and exchanging all mappings in the +file forks. Afterward, the mappings in the old file fork are the candidate blocks for disposal. @@ -3851,8 +3852,8 @@ Because file forks can consume as much space as the entire filesystem, repairs cannot be staged in memory, even when a paging scheme is available. Therefore, online repair of file-based metadata createas a temporary file in the XFS filesystem, writes a new structure at the correct offsets into the -temporary file, and atomically swaps the fork mappings (and hence the fork -contents) to commit the repair. +temporary file, and atomically exchanges all file fork mappings (and hence the +fork contents) to commit the repair. Once the repair is complete, the old fork can be reaped as necessary; if the system goes down during the reap, the iunlink code will delete the blocks during log recovery. @@ -3862,10 +3863,11 @@ consistent to use a temporary file safely! This dependency is the reason why online repair can only use pageable kernel memory to stage ondisk space usage information. -Swapping metadata extents with a temporary file requires the owner field of the -block headers to match the file being repaired and not the temporary file. The -directory, extended attribute, and symbolic link functions were all modified to -allow callers to specify owner numbers explicitly. +Exchanging metadata file mappings with a temporary file requires the owner +field of the block headers to match the file being repaired and not the +temporary file. +The directory, extended attribute, and symbolic link functions were all +modified to allow callers to specify owner numbers explicitly. There is a downside to the reaping process -- if the system crashes during the reap phase and the fork extents are crosslinked, the iunlink processing will @@ -3974,8 +3976,8 @@ The proposed patches are in the <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_ series. -Atomic Extent Swapping ----------------------- +Logged File Content Exchanges +----------------------------- Once repair builds a temporary file with a new data structure written into it, it must commit the new changes into the existing file. @@ -4010,17 +4012,21 @@ e. Old blocks in the file may be cross-linked with another structure and must These problems are overcome by creating a new deferred operation and a new type of log intent item to track the progress of an operation to exchange two file ranges. -The new deferred operation type chains together the same transactions used by -the reverse-mapping extent swap code. +The new exchange operation type chains together the same transactions used by +the reverse-mapping extent swap code, but records intermedia progress in the +log so that operations can be restarted after a crash. +This new functionality is called the file contents exchange (xfs_exchrange) +code. +The underlying implementation exchanges file fork mappings (xfs_exchmaps). The new log item records the progress of the exchange to ensure that once an exchange begins, it will always run to completion, even there are interruptions. -The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag +The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag in the superblock protects these new log item records from being replayed on old kernels. The proposed patchset is the -`atomic extent swap +`file contents exchange <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_ series. @@ -4061,72 +4067,73 @@ series. | The feature bit will not be cleared from the superblock until the log | | becomes clean. | | | -| Log-assisted extended attribute updates and atomic extent swaps both use | -| log incompat features and provide convenience wrappers around the | +| Log-assisted extended attribute updates and file content exchanges bothe | +| use log incompat features and provide convenience wrappers around the | | functionality. | +--------------------------------------------------------------------------+ -Mechanics of an Atomic Extent Swap -`````````````````````````````````` +Mechanics of a Logged File Content Exchange +``````````````````````````````````````````` -Swapping entire file forks is a complex task. +Exchanging contents between file forks is a complex task. The goal is to exchange all file fork mappings between two file fork offset ranges. There are likely to be many extent mappings in each fork, and the edges of the mappings aren't necessarily aligned. -Furthermore, there may be other updates that need to happen after the swap, +Furthermore, there may be other updates that need to happen after the exchange, such as exchanging file sizes, inode flags, or conversion of fork data to local format. -This is roughly the format of the new deferred extent swap work item: +This is roughly the format of the new deferred exchange-mapping work item: .. code-block:: c - struct xfs_swapext_intent { + struct xfs_exchmaps_intent { /* Inodes participating in the operation. */ - struct xfs_inode *sxi_ip1; - struct xfs_inode *sxi_ip2; + struct xfs_inode *xmi_ip1; + struct xfs_inode *xmi_ip2; /* File offset range information. */ - xfs_fileoff_t sxi_startoff1; - xfs_fileoff_t sxi_startoff2; - xfs_filblks_t sxi_blockcount; + xfs_fileoff_t xmi_startoff1; + xfs_fileoff_t xmi_startoff2; + xfs_filblks_t xmi_blockcount; /* Set these file sizes after the operation, unless negative. */ - xfs_fsize_t sxi_isize1; - xfs_fsize_t sxi_isize2; + xfs_fsize_t xmi_isize1; + xfs_fsize_t xmi_isize2; - /* XFS_SWAP_EXT_* log operation flags */ - uint64_t sxi_flags; + /* XFS_EXCHMAPS_* log operation flags */ + uint64_t xmi_flags; }; The new log intent item contains enough information to track two logical fork offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2, blockcount)``. -Each step of a swap operation exchanges the largest file range mapping possible -from one file to the other. -After each step in the swap operation, the two startoff fields are incremented -and the blockcount field is decremented to reflect the progress made. -The flags field captures behavioral parameters such as swapping the attr fork -instead of the data fork and other work to be done after the extent swap. -The two isize fields are used to swap the file size at the end of the operation -if the file data fork is the target of the swap operation. +Each step of an exchange operation exchanges the largest file range mapping +possible from one file to the other. +After each step in the exchange operation, the two startoff fields are +incremented and the blockcount field is decremented to reflect the progress +made. +The flags field captures behavioral parameters such as exchanging attr fork +mappings instead of the data fork and other work to be done after the exchange. +The two isize fields are used to exchange the file sizes at the end of the +operation if the file data fork is the target of the operation. -When the extent swap is initiated, the sequence of operations is as follows: +When the exchange is initiated, the sequence of operations is as follows: -1. Create a deferred work item for the extent swap. - At the start, it should contain the entirety of the file ranges to be - swapped. +1. Create a deferred work item for the file mapping exchange. + At the start, it should contain the entirety of the file block ranges to be + exchanged. 2. Call ``xfs_defer_finish`` to process the exchange. - This is encapsulated in ``xrep_tempswap_contents`` for scrub operations. + This is encapsulated in ``xrep_tempexch_contents`` for scrub operations. This will log an extent swap intent item to the transaction for the deferred - extent swap work item. + mapping exchange work item. -3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero, +3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero, - a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and - ``sxi_startoff2``, respectively, and compute the longest extent that can - be swapped in a single step. + a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and + ``xmi_startoff2``, respectively, and compute the longest extent that can + be exchanged in a single step. This is the minimum of the two ``br_blockcount`` s in the mappings. Keep advancing through the file forks until at least one of the mappings contains written blocks. @@ -4148,20 +4155,20 @@ When the extent swap is initiated, the sequence of operations is as follows: g. Extend the ondisk size of either file if necessary. - h. Log an extent swap done log item for the extent swap intent log item - that was read at the start of step 3. + h. Log a mapping exchange done log item for th mapping exchange intent log + item that was read at the start of step 3. i. Compute the amount of file range that has just been covered. This quantity is ``(map1.br_startoff + map1.br_blockcount - - sxi_startoff1)``, because step 3a could have skipped holes. + xmi_startoff1)``, because step 3a could have skipped holes. - j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2`` + j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2`` by the number of blocks computed in the previous step, and decrease - ``sxi_blockcount`` by the same quantity. + ``xmi_blockcount`` by the same quantity. This advances the cursor. - k. Log a new extent swap intent log item reflecting the advanced state of - the work item. + k. Log a new mapping exchange intent log item reflecting the advanced state + of the work item. l. Return the proper error code (EAGAIN) to the deferred operation manager to inform it that there is more work to be done. @@ -4172,22 +4179,23 @@ When the extent swap is initiated, the sequence of operations is as follows: This will be discussed in more detail in subsequent sections. If the filesystem goes down in the middle of an operation, log recovery will -find the most recent unfinished extent swap log intent item and restart from -there. -This is how extent swapping guarantees that an outside observer will either see -the old broken structure or the new one, and never a mismash of both. +find the most recent unfinished maping exchange log intent item and restart +from there. +This is how atomic file mapping exchanges guarantees that an outside observer +will either see the old broken structure or the new one, and never a mismash of +both. -Preparation for Extent Swapping -``````````````````````````````` +Preparation for File Content Exchanges +`````````````````````````````````````` There are a few things that need to be taken care of before initiating an -atomic extent swap operation. +atomic file mapping exchange operation. First, regular files require the page cache to be flushed to disk before the operation begins, and directio writes to be quiesced. -Like any filesystem operation, extent swapping must determine the maximum -amount of disk space and quota that can be consumed on behalf of both files in -the operation, and reserve that quantity of resources to avoid an unrecoverable -out of space failure once it starts dirtying metadata. +Like any filesystem operation, file mapping exchanges must determine the +maximum amount of disk space and quota that can be consumed on behalf of both +files in the operation, and reserve that quantity of resources to avoid an +unrecoverable out of space failure once it starts dirtying metadata. The preparation step scans the ranges of both files to estimate: - Data device blocks needed to handle the repeated updates to the fork @@ -4201,56 +4209,59 @@ The preparation step scans the ranges of both files to estimate: to different extents on the realtime volume, which could happen if the operation fails to run to completion. -The need for precise estimation increases the run time of the swap operation, -but it is very important to maintain correct accounting. -The filesystem must not run completely out of free space, nor can the extent -swap ever add more extent mappings to a fork than it can support. +The need for precise estimation increases the run time of the exchange +operation, but it is very important to maintain correct accounting. +The filesystem must not run completely out of free space, nor can the mapping +exchange ever add more extent mappings to a fork than it can support. Regular users are required to abide the quota limits, though metadata repairs may exceed quota to resolve inconsistent metadata elsewhere. -Special Features for Swapping Metadata File Extents -``````````````````````````````````````````````````` +Special Features for Exchanging Metadata File Contents +`````````````````````````````````````````````````````` Extended attributes, symbolic links, and directories can set the fork format to "local" and treat the fork as a literal area for data storage. Metadata repairs must take extra steps to support these cases: - If both forks are in local format and the fork areas are large enough, the - swap is performed by copying the incore fork contents, logging both forks, - and committing. - The atomic extent swap mechanism is not necessary, since this can be done - with a single transaction. + exchange is performed by copying the incore fork contents, logging both + forks, and committing. + The atomic file mapping exchange mechanism is not necessary, since this can + be done with a single transaction. -- If both forks map blocks, then the regular atomic extent swap is used. +- If both forks map blocks, then the regular atomic file mapping exchange is + used. - Otherwise, only one fork is in local format. The contents of the local format fork are converted to a block to perform the - swap. + exchange. The conversion to block format must be done in the same transaction that - logs the initial extent swap intent log item. - The regular atomic extent swap is used to exchange the mappings. - Special flags are set on the swap operation so that the transaction can be - rolled one more time to convert the second file's fork back to local format - so that the second file will be ready to go as soon as the ILOCK is dropped. + logs the initial mapping exchange intent log item. + The regular atomic mapping exchange is used to exchange the metadata file + mappings. + Special flags are set on the exchange operation so that the transaction can + be rolled one more time to convert the second file's fork back to local + format so that the second file will be ready to go as soon as the ILOCK is + dropped. Extended attributes and directories stamp the owning inode into every block, but the buffer verifiers do not actually check the inode number! Although there is no verification, it is still important to maintain -referential integrity, so prior to performing the extent swap, online repair -builds every block in the new data structure with the owner field of the file -being repaired. +referential integrity, so prior to performing the mapping exchange, online +repair builds every block in the new data structure with the owner field of the +file being repaired. -After a successful swap operation, the repair operation must reap the old fork -blocks by processing each fork mapping through the standard :ref:`file extent -reaping <reaping>` mechanism that is done post-repair. +After a successful exchange operation, the repair operation must reap the old +fork blocks by processing each fork mapping through the standard :ref:`file +extent reaping <reaping>` mechanism that is done post-repair. If the filesystem should go down during the reap part of the repair, the iunlink processing at the end of recovery will free both the temporary file and whatever blocks were not reaped. However, this iunlink processing omits the cross-link detection of online repair, and is not completely foolproof. -Swapping Temporary File Extents -``````````````````````````````` +Exchanging Temporary File Contents +`````````````````````````````````` To repair a metadata file, online repair proceeds as follows: @@ -4260,14 +4271,14 @@ To repair a metadata file, online repair proceeds as follows: file. The same fork must be written to as is being repaired. -3. Commit the scrub transaction, since the swap estimation step must be - completed before transaction reservations are made. +3. Commit the scrub transaction, since the exchange resource estimation step + must be completed before transaction reservations are made. -4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with +4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with the appropriate resource reservations, locks, and fill out a ``struct - xfs_swapext_req`` with the details of the swap operation. + xfs_exchmaps_req`` with the details of the exchange operation. -5. Call ``xrep_tempswap_contents`` to swap the contents. +5. Call ``xrep_tempexch_contents`` to exchange the contents. 6. Commit the transaction to complete the repair. @@ -4309,7 +4320,7 @@ To check the summary file against the bitmap: 3. Compare the contents of the xfile against the ondisk file. To repair the summary file, write the xfile contents into the temporary file -and use atomic extent swap to commit the new contents. +and use atomic mapping exchange to commit the new contents. The temporary file is then reaped. The proposed patchset is the @@ -4352,8 +4363,8 @@ Salvaging extended attributes is done as follows: memory or there are no more attr fork blocks to examine, unlock the file and add the staged extended attributes to the temporary file. -3. Use atomic extent swapping to exchange the new and old extended attribute - structures. +3. Use atomic file mapping exchange to exchange the new and old extended + attribute structures. The old attribute blocks are now attached to the temporary file. 4. Reap the temporary file. @@ -4410,7 +4421,8 @@ salvaging directories is straightforward: directory and add the staged dirents into the temporary directory. Truncate the staging files. -4. Use atomic extent swapping to exchange the new and old directory structures. +4. Use atomic file mapping exchange to exchange the new and old directory + structures. The old directory blocks are now attached to the temporary file. 5. Reap the temporary file. @@ -4542,7 +4554,7 @@ a :ref:`directory entry live update hook <liveupdate>` as follows: Instead, we stash updates in the xfarray and rely on the scanner thread to apply the stashed updates to the temporary directory. -5. When the scan is complete, atomically swap the contents of the temporary +5. When the scan is complete, atomically exchange the contents of the temporary directory and the directory being repaired. The temporary directory now contains the damaged directory structure. @@ -4629,8 +4641,8 @@ directory reconstruction: 5. Copy all non-parent pointer extended attributes to the temporary file. -6. When the scan is complete, atomically swap the attribute fork of the - temporary file and the file being repaired. +6. When the scan is complete, atomically exchange the mappings of the attribute + forks of the temporary file and the file being repaired. The temporary file now contains the damaged extended attribute structure. 7. Reap the temporary file. @@ -5105,18 +5117,18 @@ make it easier for code readers to understand what has been built, for whom it has been built, and why. Please feel free to contact the XFS mailing list with questions. -FIEXCHANGE_RANGE ----------------- +XFS_IOC_EXCHANGE_RANGE +---------------------- -As discussed earlier, a second frontend to the atomic extent swap mechanism is -a new ioctl call that userspace programs can use to commit updates to files -atomically. +As discussed earlier, a second frontend to the atomic file mapping exchange +mechanism is a new ioctl call that userspace programs can use to commit updates +to files atomically. This frontend has been out for review for several years now, though the necessary refinements to online repair and lack of customer demand mean that the proposal has not been pushed very hard. -Extent Swapping with Regular User Files -``````````````````````````````````````` +File Content Exchanges with Regular User Files +`````````````````````````````````````````````` As mentioned earlier, XFS has long had the ability to swap extents between files, which is used almost exclusively by ``xfs_fsr`` to defragment files. @@ -5131,12 +5143,12 @@ the consistency of the fork mappings with the reverse mapping index was to develop an iterative mechanism that used deferred bmap and rmap operations to swap mappings one at a time. This mechanism is identical to steps 2-3 from the procedure above except for -the new tracking items, because the atomic extent swap mechanism is an -iteration of an existing mechanism and not something totally novel. +the new tracking items, because the atomic file mapping exchange mechanism is +an iteration of an existing mechanism and not something totally novel. For the narrow case of file defragmentation, the file contents must be identical, so the recovery guarantees are not much of a gain. -Atomic extent swapping is much more flexible than the existing swapext +Atomic file content exchanges are much more flexible than the existing swapext implementations because it can guarantee that the caller never sees a mix of old and new contents even after a crash, and it can operate on two arbitrary file fork ranges. @@ -5147,11 +5159,11 @@ The extra flexibility enables several new use cases: Next, it opens a temporary file and calls the file clone operation to reflink the first file's contents into the temporary file. Writes to the original file should instead be written to the temporary file. - Finally, the process calls the atomic extent swap system call - (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all - of the updates to the original file, or none of them. + Finally, the process calls the atomic file mapping exchange system call + (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby + committing all of the updates to the original file, or none of them. -.. _swapext_if_unchanged: +.. _exchrange_if_unchanged: - **Transactional file updates**: The same mechanism as above, but the caller only wants the commit to occur if the original file's contents have not @@ -5160,16 +5172,17 @@ The extra flexibility enables several new use cases: change timestamps of the original file before reflinking its data to the temporary file. When the program is ready to commit the changes, it passes the timestamps - into the kernel as arguments to the atomic extent swap system call. + into the kernel as arguments to the atomic file mapping exchange system call. The kernel only commits the changes if the provided timestamps match the original file. + A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this. - **Emulation of atomic block device writes**: Export a block device with a logical sector size matching the filesystem block size to force all writes to be aligned to the filesystem block size. Stage all writes to a temporary file, and when that is complete, call the - atomic extent swap system call with a flag to indicate that holes in the - temporary file should be ignored. + atomic file mapping exchange system call with a flag to indicate that holes + in the temporary file should be ignored. This emulates an atomic device write in software, and can support arbitrary scattered writes. @@ -5251,8 +5264,8 @@ of the file to try to share the physical space with a dummy file. Cloning the extent means that the original owners cannot overwrite the contents; any changes will be written somewhere else via copy-on-write. Clearspace makes its own copy of the frozen extent in an area that is not being -cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap -<swapext_if_unchanged>` feature) to change the target file's data extent +cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges +<exchrange_if_unchanged>` feature) to change the target file's data extent mapping away from the area being cleared. When all other mappings have been moved, clearspace reflinks the space into the space collector file so that it becomes unavailable. ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 15/15] xfs: enable logged file mapping exchange feature 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong ` (13 preceding siblings ...) 2024-04-15 23:44 ` [PATCH 14/15] docs: update swapext -> exchmaps language Darrick J. Wong @ 2024-04-15 23:44 ` Darrick J. Wong 14 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:44 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add the XFS_SB_FEAT_INCOMPAT_EXCHRANGE feature to the set of features that we will permit when mounting a filesystem. This turns on support for the file range exchange feature. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_format.h | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h index ff1e28316e1b..10153ce116d4 100644 --- a/fs/xfs/libxfs/xfs_format.h +++ b/fs/xfs/libxfs/xfs_format.h @@ -380,7 +380,8 @@ xfs_sb_has_ro_compat_feature( XFS_SB_FEAT_INCOMPAT_META_UUID | \ XFS_SB_FEAT_INCOMPAT_BIGTIME | \ XFS_SB_FEAT_INCOMPAT_NEEDSREPAIR | \ - XFS_SB_FEAT_INCOMPAT_NREXT64) + XFS_SB_FEAT_INCOMPAT_NREXT64 | \ + XFS_SB_FEAT_INCOMPAT_EXCHRANGE) #define XFS_SB_FEAT_INCOMPAT_UNKNOWN ~XFS_SB_FEAT_INCOMPAT_ALL static inline bool ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 04/16] xfs: create temporary files for online repair 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (2 preceding siblings ...) 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong @ 2024-04-15 23:34 ` Darrick J. Wong 2024-04-15 23:44 ` [PATCH 1/4] xfs: hide private inodes from bulkstat and handle functions Darrick J. Wong ` (3 more replies) 2024-04-15 23:34 ` [PATCHSET v30.3 05/16] xfs: online repair of realtime summaries Darrick J. Wong ` (11 subsequent siblings) 15 siblings, 4 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:34 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, linux-xfs Hi all, As mentioned earlier, the repair strategy for file-based metadata is to build a new copy in a temporary file and swap the file fork mappings with the metadata inode. We've built the atomic extent swap facility, so now we need to build a facility for handling private temporary files. The first step is to teach the filesystem to ignore the temporary files. We'll mark them as PRIVATE in the VFS so that the kernel security modules will leave it alone. The second step is to add the online repair code the ability to create a temporary file and reap extents from the temporary file after the extent swap. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles-6.10 --- Commits in this patchset: * xfs: hide private inodes from bulkstat and handle functions * xfs: create temporary files and directories for online repair * xfs: refactor live buffer invalidation for repairs * xfs: add the ability to reap entire inode forks --- fs/xfs/Makefile | 1 fs/xfs/scrub/parent.c | 2 fs/xfs/scrub/reap.c | 445 +++++++++++++++++++++++++++++++++++++++++++++-- fs/xfs/scrub/reap.h | 21 ++ fs/xfs/scrub/scrub.c | 3 fs/xfs/scrub/scrub.h | 4 fs/xfs/scrub/tempfile.c | 251 +++++++++++++++++++++++++++ fs/xfs/scrub/tempfile.h | 28 +++ fs/xfs/scrub/trace.h | 96 ++++++++++ fs/xfs/xfs_export.c | 2 fs/xfs/xfs_inode.c | 3 fs/xfs/xfs_inode.h | 2 fs/xfs/xfs_iops.c | 3 fs/xfs/xfs_itable.c | 8 + 14 files changed, 843 insertions(+), 26 deletions(-) create mode 100644 fs/xfs/scrub/tempfile.c create mode 100644 fs/xfs/scrub/tempfile.h ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/4] xfs: hide private inodes from bulkstat and handle functions 2024-04-15 23:34 ` [PATCHSET v30.3 04/16] xfs: create temporary files for online repair Darrick J. Wong @ 2024-04-15 23:44 ` Darrick J. Wong 2024-04-15 23:45 ` [PATCH 2/4] xfs: create temporary files and directories for online repair Darrick J. Wong ` (2 subsequent siblings) 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:44 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, linux-xfs From: Darrick J. Wong <djwong@kernel.org> We're about to start adding functionality that uses internal inodes that are private to XFS. What this means is that userspace should never be able to access any information about these files, and should not be able to open these files by handle. To prevent users from ever finding the file or mis-interactions with the security apparatus, set S_PRIVATE on the inode. Don't allow bulkstat, open-by-handle, or linking of S_PRIVATE files into the directory tree. This should keep private inodes actually private. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_export.c | 2 +- fs/xfs/xfs_iops.c | 3 +++ fs/xfs/xfs_itable.c | 8 ++++++++ 3 files changed, 12 insertions(+), 1 deletion(-) diff --git a/fs/xfs/xfs_export.c b/fs/xfs/xfs_export.c index 7cd09c3a82cb..4b03221351c0 100644 --- a/fs/xfs/xfs_export.c +++ b/fs/xfs/xfs_export.c @@ -160,7 +160,7 @@ xfs_nfs_get_inode( } } - if (VFS_I(ip)->i_generation != generation) { + if (VFS_I(ip)->i_generation != generation || IS_PRIVATE(VFS_I(ip))) { xfs_irele(ip); return ERR_PTR(-ESTALE); } diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c index 55ed2d1023d6..7f0c840f0fd2 100644 --- a/fs/xfs/xfs_iops.c +++ b/fs/xfs/xfs_iops.c @@ -365,6 +365,9 @@ xfs_vn_link( if (unlikely(error)) return error; + if (IS_PRIVATE(inode)) + return -EPERM; + error = xfs_link(XFS_I(dir), XFS_I(inode), &name); if (unlikely(error)) return error; diff --git a/fs/xfs/xfs_itable.c b/fs/xfs/xfs_itable.c index 95fc31b9f87d..c0757ab99495 100644 --- a/fs/xfs/xfs_itable.c +++ b/fs/xfs/xfs_itable.c @@ -97,6 +97,14 @@ xfs_bulkstat_one_int( vfsuid = i_uid_into_vfsuid(idmap, inode); vfsgid = i_gid_into_vfsgid(idmap, inode); + /* If this is a private inode, don't leak its details to userspace. */ + if (IS_PRIVATE(inode)) { + xfs_iunlock(ip, XFS_ILOCK_SHARED); + xfs_irele(ip); + error = -EINVAL; + goto out_advance; + } + /* xfs_iget returns the following without needing * further change. */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/4] xfs: create temporary files and directories for online repair 2024-04-15 23:34 ` [PATCHSET v30.3 04/16] xfs: create temporary files for online repair Darrick J. Wong 2024-04-15 23:44 ` [PATCH 1/4] xfs: hide private inodes from bulkstat and handle functions Darrick J. Wong @ 2024-04-15 23:45 ` Darrick J. Wong 2024-04-15 23:45 ` [PATCH 3/4] xfs: refactor live buffer invalidation for repairs Darrick J. Wong 2024-04-15 23:45 ` [PATCH 4/4] xfs: add the ability to reap entire inode forks Darrick J. Wong 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:45 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Teach the online repair code how to create temporary files or directories. These temporary files can be used to stage reconstructed information until we're ready to perform an atomic extent swap to commit the new metadata. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/scrub/parent.c | 2 fs/xfs/scrub/scrub.c | 3 + fs/xfs/scrub/scrub.h | 4 + fs/xfs/scrub/tempfile.c | 251 +++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/tempfile.h | 28 +++++ fs/xfs/scrub/trace.h | 33 ++++++ fs/xfs/xfs_inode.c | 3 - fs/xfs/xfs_inode.h | 2 9 files changed, 324 insertions(+), 3 deletions(-) create mode 100644 fs/xfs/scrub/tempfile.c create mode 100644 fs/xfs/scrub/tempfile.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index b547a3dc03f8..ae8488ab4d6b 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -207,6 +207,7 @@ xfs-y += $(addprefix scrub/, \ refcount_repair.o \ repair.o \ rmap_repair.o \ + tempfile.o \ ) xfs-$(CONFIG_XFS_RT) += $(addprefix scrub/, \ diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c index 7db873672146..5da10ed1fe8c 100644 --- a/fs/xfs/scrub/parent.c +++ b/fs/xfs/scrub/parent.c @@ -143,7 +143,7 @@ xchk_parent_validate( } if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error)) return error; - if (dp == sc->ip || !S_ISDIR(VFS_I(dp)->i_mode)) { + if (dp == sc->ip || dp == sc->tempip || !S_ISDIR(VFS_I(dp)->i_mode)) { xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0); goto out_rele; } diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 20fac9723c08..d9012e9a6afd 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -17,6 +17,7 @@ #include "xfs_scrub.h" #include "xfs_buf_mem.h" #include "xfs_rmap.h" +#include "xfs_exchrange.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" @@ -24,6 +25,7 @@ #include "scrub/health.h" #include "scrub/stats.h" #include "scrub/xfile.h" +#include "scrub/tempfile.h" /* * Online Scrub and Repair @@ -211,6 +213,7 @@ xchk_teardown( sc->buf = NULL; } + xrep_tempfile_rele(sc); xchk_fsgates_disable(sc); return error; } diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h index 9ad65b604fe1..e37d8599718e 100644 --- a/fs/xfs/scrub/scrub.h +++ b/fs/xfs/scrub/scrub.h @@ -105,6 +105,10 @@ struct xfs_scrub { /* Lock flags for @ip. */ uint ilock_flags; + /* A temporary file on this filesystem, for staging new metadata. */ + struct xfs_inode *tempip; + uint temp_ilock_flags; + /* See the XCHK/XREP state flags below. */ unsigned int flags; diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c new file mode 100644 index 000000000000..68d245749bc1 --- /dev/null +++ b/fs/xfs/scrub/tempfile.c @@ -0,0 +1,251 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2021-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_inode.h" +#include "xfs_ialloc.h" +#include "xfs_quota.h" +#include "xfs_bmap_btree.h" +#include "xfs_trans_space.h" +#include "xfs_dir2.h" +#include "xfs_exchrange.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/trace.h" +#include "scrub/tempfile.h" + +/* + * Create a temporary file for reconstructing metadata, with the intention of + * atomically exchanging the temporary file's contents with the file that's + * being repaired. + */ +int +xrep_tempfile_create( + struct xfs_scrub *sc, + uint16_t mode) +{ + struct xfs_mount *mp = sc->mp; + struct xfs_trans *tp = NULL; + struct xfs_dquot *udqp = NULL; + struct xfs_dquot *gdqp = NULL; + struct xfs_dquot *pdqp = NULL; + struct xfs_trans_res *tres; + struct xfs_inode *dp = mp->m_rootip; + xfs_ino_t ino; + unsigned int resblks; + bool is_dir = S_ISDIR(mode); + int error; + + if (xfs_is_shutdown(mp)) + return -EIO; + if (xfs_is_readonly(mp)) + return -EROFS; + + ASSERT(sc->tp == NULL); + ASSERT(sc->tempip == NULL); + + /* + * Make sure that we have allocated dquot(s) on disk. The temporary + * inode should be completely root owned so that we don't fail due to + * quota limits. + */ + error = xfs_qm_vop_dqalloc(dp, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, + XFS_QMOPT_QUOTALL, &udqp, &gdqp, &pdqp); + if (error) + return error; + + if (is_dir) { + resblks = XFS_MKDIR_SPACE_RES(mp, 0); + tres = &M_RES(mp)->tr_mkdir; + } else { + resblks = XFS_IALLOC_SPACE_RES(mp); + tres = &M_RES(mp)->tr_create_tmpfile; + } + + error = xfs_trans_alloc_icreate(mp, tres, udqp, gdqp, pdqp, resblks, + &tp); + if (error) + goto out_release_dquots; + + /* Allocate inode, set up directory. */ + error = xfs_dialloc(&tp, dp->i_ino, mode, &ino); + if (error) + goto out_trans_cancel; + error = xfs_init_new_inode(&nop_mnt_idmap, tp, dp, ino, mode, 0, 0, + 0, false, &sc->tempip); + if (error) + goto out_trans_cancel; + + /* Change the ownership of the inode to root. */ + VFS_I(sc->tempip)->i_uid = GLOBAL_ROOT_UID; + VFS_I(sc->tempip)->i_gid = GLOBAL_ROOT_GID; + sc->tempip->i_diflags &= ~(XFS_DIFLAG_REALTIME | XFS_DIFLAG_RTINHERIT); + xfs_trans_log_inode(tp, sc->tempip, XFS_ILOG_CORE); + + /* + * Mark our temporary file as private so that LSMs and the ACL code + * don't try to add their own metadata or reason about these files. + * The file should never be exposed to userspace. + */ + VFS_I(sc->tempip)->i_flags |= S_PRIVATE; + VFS_I(sc->tempip)->i_opflags &= ~IOP_XATTR; + + if (is_dir) { + error = xfs_dir_init(tp, sc->tempip, dp); + if (error) + goto out_trans_cancel; + } + + /* + * Attach the dquot(s) to the inodes and modify them incore. + * These ids of the inode couldn't have changed since the new + * inode has been locked ever since it was created. + */ + xfs_qm_vop_create_dqattach(tp, sc->tempip, udqp, gdqp, pdqp); + + /* + * Put our temp file on the unlinked list so it's purged automatically. + * All file-based metadata being reconstructed using this file must be + * atomically exchanged with the original file because the contents + * here will be purged when the inode is dropped or log recovery cleans + * out the unlinked list. + */ + error = xfs_iunlink(tp, sc->tempip); + if (error) + goto out_trans_cancel; + + error = xfs_trans_commit(tp); + if (error) + goto out_release_inode; + + trace_xrep_tempfile_create(sc); + + xfs_qm_dqrele(udqp); + xfs_qm_dqrele(gdqp); + xfs_qm_dqrele(pdqp); + + /* Finish setting up the incore / vfs context. */ + xfs_setup_iops(sc->tempip); + xfs_finish_inode_setup(sc->tempip); + + sc->temp_ilock_flags = 0; + return error; + +out_trans_cancel: + xfs_trans_cancel(tp); +out_release_inode: + /* + * Wait until after the current transaction is aborted to finish the + * setup of the inode and release the inode. This prevents recursive + * transactions and deadlocks from xfs_inactive. + */ + if (sc->tempip) { + xfs_finish_inode_setup(sc->tempip); + xchk_irele(sc, sc->tempip); + } +out_release_dquots: + xfs_qm_dqrele(udqp); + xfs_qm_dqrele(gdqp); + xfs_qm_dqrele(pdqp); + + return error; +} + +/* Take IOLOCK_EXCL on the temporary file, maybe. */ +bool +xrep_tempfile_iolock_nowait( + struct xfs_scrub *sc) +{ + if (xfs_ilock_nowait(sc->tempip, XFS_IOLOCK_EXCL)) { + sc->temp_ilock_flags |= XFS_IOLOCK_EXCL; + return true; + } + + return false; +} + +/* + * Take the temporary file's IOLOCK while holding a different inode's IOLOCK. + * In theory nobody else should hold the tempfile's IOLOCK, but we use trylock + * to avoid deadlocks and lockdep complaints. + */ +int +xrep_tempfile_iolock_polled( + struct xfs_scrub *sc) +{ + int error = 0; + + while (!xrep_tempfile_iolock_nowait(sc)) { + if (xchk_should_terminate(sc, &error)) + return error; + delay(1); + } + + return 0; +} + +/* Release IOLOCK_EXCL on the temporary file. */ +void +xrep_tempfile_iounlock( + struct xfs_scrub *sc) +{ + xfs_iunlock(sc->tempip, XFS_IOLOCK_EXCL); + sc->temp_ilock_flags &= ~XFS_IOLOCK_EXCL; +} + +/* Prepare the temporary file for metadata updates by grabbing ILOCK_EXCL. */ +void +xrep_tempfile_ilock( + struct xfs_scrub *sc) +{ + sc->temp_ilock_flags |= XFS_ILOCK_EXCL; + xfs_ilock(sc->tempip, XFS_ILOCK_EXCL); +} + +/* Try to grab ILOCK_EXCL on the temporary file. */ +bool +xrep_tempfile_ilock_nowait( + struct xfs_scrub *sc) +{ + if (xfs_ilock_nowait(sc->tempip, XFS_ILOCK_EXCL)) { + sc->temp_ilock_flags |= XFS_ILOCK_EXCL; + return true; + } + + return false; +} + +/* Unlock ILOCK_EXCL on the temporary file after an update. */ +void +xrep_tempfile_iunlock( + struct xfs_scrub *sc) +{ + xfs_iunlock(sc->tempip, XFS_ILOCK_EXCL); + sc->temp_ilock_flags &= ~XFS_ILOCK_EXCL; +} + +/* Release the temporary file. */ +void +xrep_tempfile_rele( + struct xfs_scrub *sc) +{ + if (!sc->tempip) + return; + + if (sc->temp_ilock_flags) { + xfs_iunlock(sc->tempip, sc->temp_ilock_flags); + sc->temp_ilock_flags = 0; + } + + xchk_irele(sc, sc->tempip); + sc->tempip = NULL; +} diff --git a/fs/xfs/scrub/tempfile.h b/fs/xfs/scrub/tempfile.h new file mode 100644 index 000000000000..e165e0a3faf6 --- /dev/null +++ b/fs/xfs/scrub/tempfile.h @@ -0,0 +1,28 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2021-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_TEMPFILE_H__ +#define __XFS_SCRUB_TEMPFILE_H__ + +#ifdef CONFIG_XFS_ONLINE_REPAIR +int xrep_tempfile_create(struct xfs_scrub *sc, uint16_t mode); +void xrep_tempfile_rele(struct xfs_scrub *sc); + +bool xrep_tempfile_iolock_nowait(struct xfs_scrub *sc); +int xrep_tempfile_iolock_polled(struct xfs_scrub *sc); +void xrep_tempfile_iounlock(struct xfs_scrub *sc); + +void xrep_tempfile_ilock(struct xfs_scrub *sc); +bool xrep_tempfile_ilock_nowait(struct xfs_scrub *sc); +void xrep_tempfile_iunlock(struct xfs_scrub *sc); +#else +static inline void xrep_tempfile_iolock_both(struct xfs_scrub *sc) +{ + xchk_ilock(sc, XFS_IOLOCK_EXCL); +} +# define xrep_tempfile_rele(sc) +#endif /* CONFIG_XFS_ONLINE_REPAIR */ + +#endif /* __XFS_SCRUB_TEMPFILE_H__ */ diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index b1c7c79760d4..020b029b7988 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2279,6 +2279,39 @@ TRACE_EVENT(xrep_rmap_live_update, __entry->flags) ); +TRACE_EVENT(xrep_tempfile_create, + TP_PROTO(struct xfs_scrub *sc), + TP_ARGS(sc), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(unsigned int, type) + __field(xfs_agnumber_t, agno) + __field(xfs_ino_t, inum) + __field(unsigned int, gen) + __field(unsigned int, flags) + __field(xfs_ino_t, temp_inum) + ), + TP_fast_assign( + __entry->dev = sc->mp->m_super->s_dev; + __entry->ino = sc->file ? XFS_I(file_inode(sc->file))->i_ino : 0; + __entry->type = sc->sm->sm_type; + __entry->agno = sc->sm->sm_agno; + __entry->inum = sc->sm->sm_ino; + __entry->gen = sc->sm->sm_gen; + __entry->flags = sc->sm->sm_flags; + __entry->temp_inum = sc->tempip->i_ino; + ), + TP_printk("dev %d:%d ino 0x%llx type %s inum 0x%llx gen 0x%x flags 0x%x temp_inum 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __print_symbolic(__entry->type, XFS_SCRUB_TYPE_STRINGS), + __entry->inum, + __entry->gen, + __entry->flags, + __entry->temp_inum) +); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 492dae0efad2..ac92c0525d9b 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -42,7 +42,6 @@ struct kmem_cache *xfs_inode_cache; -STATIC int xfs_iunlink(struct xfs_trans *, struct xfs_inode *); STATIC int xfs_iunlink_remove(struct xfs_trans *tp, struct xfs_perag *pag, struct xfs_inode *); @@ -2151,7 +2150,7 @@ xfs_iunlink_insert_inode( * We place the on-disk inode on a list in the AGI. It will be pulled from this * list when the inode is freed. */ -STATIC int +int xfs_iunlink( struct xfs_trans *tp, struct xfs_inode *ip) diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index f559e68ee707..596eec715675 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -616,6 +616,8 @@ extern struct kmem_cache *xfs_inode_cache; bool xfs_inode_needs_inactive(struct xfs_inode *ip); +int xfs_iunlink(struct xfs_trans *tp, struct xfs_inode *ip); + void xfs_end_io(struct work_struct *work); int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/4] xfs: refactor live buffer invalidation for repairs 2024-04-15 23:34 ` [PATCHSET v30.3 04/16] xfs: create temporary files for online repair Darrick J. Wong 2024-04-15 23:44 ` [PATCH 1/4] xfs: hide private inodes from bulkstat and handle functions Darrick J. Wong 2024-04-15 23:45 ` [PATCH 2/4] xfs: create temporary files and directories for online repair Darrick J. Wong @ 2024-04-15 23:45 ` Darrick J. Wong 2024-04-15 23:45 ` [PATCH 4/4] xfs: add the ability to reap entire inode forks Darrick J. Wong 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:45 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, linux-xfs From: Darrick J. Wong <djwong@kernel.org> In an upcoming patch, we will need to be able to look for xfs_buf objects caching file-based metadata blocks without needing to walk the (possibly corrupt) structures to find all the buffers. Repair already has most of the code needed to scan the buffer cache, so hoist these utility functions. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/reap.c | 73 ++++++++++++++++++++++++++++++++++++--------------- fs/xfs/scrub/reap.h | 20 ++++++++++++++ 2 files changed, 71 insertions(+), 22 deletions(-) diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c index 0252a3b5b65a..7ae6253395e7 100644 --- a/fs/xfs/scrub/reap.c +++ b/fs/xfs/scrub/reap.c @@ -211,6 +211,48 @@ static inline void xreap_defer_finish_reset(struct xreap_state *rs) rs->force_roll = false; } +/* + * Compute the maximum length of a buffer cache scan (in units of sectors), + * given a quantity of fs blocks. + */ +xfs_daddr_t +xrep_bufscan_max_sectors( + struct xfs_mount *mp, + xfs_extlen_t fsblocks) +{ + int max_fsbs; + + /* Remote xattr values are the largest buffers that we support. */ + max_fsbs = xfs_attr3_rmt_blocks(mp, XFS_XATTR_SIZE_MAX); + + return XFS_FSB_TO_BB(mp, min_t(xfs_extlen_t, fsblocks, max_fsbs)); +} + +/* + * Return an incore buffer from a sector scan, or NULL if there are no buffers + * left to return. + */ +struct xfs_buf * +xrep_bufscan_advance( + struct xfs_mount *mp, + struct xrep_bufscan *scan) +{ + scan->__sector_count += scan->daddr_step; + while (scan->__sector_count <= scan->max_sectors) { + struct xfs_buf *bp = NULL; + int error; + + error = xfs_buf_incore(mp->m_ddev_targp, scan->daddr, + scan->__sector_count, XBF_LIVESCAN, &bp); + if (!error) + return bp; + + scan->__sector_count += scan->daddr_step; + } + + return NULL; +} + /* Try to invalidate the incore buffers for an extent that we're freeing. */ STATIC void xreap_agextent_binval( @@ -241,28 +283,15 @@ xreap_agextent_binval( * of any plausible size. */ while (bno < agbno_next) { - xfs_agblock_t fsbcount; - xfs_agblock_t max_fsbs; - - /* - * Max buffer size is the max remote xattr buffer size, which - * is one fs block larger than 64k. - */ - max_fsbs = min_t(xfs_agblock_t, agbno_next - bno, - xfs_attr3_rmt_blocks(mp, XFS_XATTR_SIZE_MAX)); - - for (fsbcount = 1; fsbcount <= max_fsbs; fsbcount++) { - struct xfs_buf *bp = NULL; - xfs_daddr_t daddr; - int error; - - daddr = XFS_AGB_TO_DADDR(mp, agno, bno); - error = xfs_buf_incore(mp->m_ddev_targp, daddr, - XFS_FSB_TO_BB(mp, fsbcount), - XBF_LIVESCAN, &bp); - if (error) - continue; - + struct xrep_bufscan scan = { + .daddr = XFS_AGB_TO_DADDR(mp, agno, bno), + .max_sectors = xrep_bufscan_max_sectors(mp, + agbno_next - bno), + .daddr_step = XFS_FSB_TO_BB(mp, 1), + }; + struct xfs_buf *bp; + + while ((bp = xrep_bufscan_advance(mp, &scan)) != NULL) { xfs_trans_bjoin(sc->tp, bp); xfs_trans_binval(sc->tp, bp); rs->invalidated++; diff --git a/fs/xfs/scrub/reap.h b/fs/xfs/scrub/reap.h index 0b69f16dd98f..bb09e21fcb17 100644 --- a/fs/xfs/scrub/reap.h +++ b/fs/xfs/scrub/reap.h @@ -14,4 +14,24 @@ int xrep_reap_agblocks(struct xfs_scrub *sc, struct xagb_bitmap *bitmap, int xrep_reap_fsblocks(struct xfs_scrub *sc, struct xfsb_bitmap *bitmap, const struct xfs_owner_info *oinfo); +/* Buffer cache scan context. */ +struct xrep_bufscan { + /* Disk address for the buffers we want to scan. */ + xfs_daddr_t daddr; + + /* Maximum number of sectors to scan. */ + xfs_daddr_t max_sectors; + + /* Each round, increment the search length by this number of sectors. */ + xfs_daddr_t daddr_step; + + /* Internal scan state; initialize to zero. */ + xfs_daddr_t __sector_count; +}; + +xfs_daddr_t xrep_bufscan_max_sectors(struct xfs_mount *mp, + xfs_extlen_t fsblocks); +struct xfs_buf *xrep_bufscan_advance(struct xfs_mount *mp, + struct xrep_bufscan *scan); + #endif /* __XFS_SCRUB_REAP_H__ */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 4/4] xfs: add the ability to reap entire inode forks 2024-04-15 23:34 ` [PATCHSET v30.3 04/16] xfs: create temporary files for online repair Darrick J. Wong ` (2 preceding siblings ...) 2024-04-15 23:45 ` [PATCH 3/4] xfs: refactor live buffer invalidation for repairs Darrick J. Wong @ 2024-04-15 23:45 ` Darrick J. Wong 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:45 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, linux-xfs From: Darrick J. Wong <djwong@kernel.org> In preparation for supporting repair of indexed file-based metadata (such as realtime bitmaps, directories, and extended attribute data), add a function to reap the old blocks after a metadata repair finishes. IOWs, this is an elaborate bunmapi call that deals with crosslinked blocks by unmapping them without freeing them, and also scans for incore buffers to invalidate. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/reap.c | 372 ++++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/reap.h | 1 fs/xfs/scrub/trace.h | 63 ++++++++ 3 files changed, 436 insertions(+) diff --git a/fs/xfs/scrub/reap.c b/fs/xfs/scrub/reap.c index 7ae6253395e7..01ceaa4efa16 100644 --- a/fs/xfs/scrub/reap.c +++ b/fs/xfs/scrub/reap.c @@ -675,3 +675,375 @@ xrep_reap_fsblocks( return 0; } + +/* + * Metadata files are not supposed to share blocks with anything else. + * If blocks are shared, we remove the reverse mapping (thus reducing the + * crosslink factor); if blocks are not shared, we also need to free them. + * + * This first step determines the longest subset of the passed-in imap + * (starting at its beginning) that is either crosslinked or not crosslinked. + * The blockcount will be adjust down as needed. + */ +STATIC int +xreap_bmapi_select( + struct xfs_scrub *sc, + struct xfs_inode *ip, + int whichfork, + struct xfs_bmbt_irec *imap, + bool *crosslinked) +{ + struct xfs_owner_info oinfo; + struct xfs_btree_cur *cur; + xfs_filblks_t len = 1; + xfs_agblock_t bno; + xfs_agblock_t agbno; + xfs_agblock_t agbno_next; + int error; + + agbno = XFS_FSB_TO_AGBNO(sc->mp, imap->br_startblock); + agbno_next = agbno + imap->br_blockcount; + + cur = xfs_rmapbt_init_cursor(sc->mp, sc->tp, sc->sa.agf_bp, + sc->sa.pag); + + xfs_rmap_ino_owner(&oinfo, ip->i_ino, whichfork, imap->br_startoff); + error = xfs_rmap_has_other_keys(cur, agbno, 1, &oinfo, crosslinked); + if (error) + goto out_cur; + + bno = agbno + 1; + while (bno < agbno_next) { + bool also_crosslinked; + + oinfo.oi_offset++; + error = xfs_rmap_has_other_keys(cur, bno, 1, &oinfo, + &also_crosslinked); + if (error) + goto out_cur; + + if (also_crosslinked != *crosslinked) + break; + + len++; + bno++; + } + + imap->br_blockcount = len; + trace_xreap_bmapi_select(sc->sa.pag, agbno, len, *crosslinked); +out_cur: + xfs_btree_del_cursor(cur, error); + return error; +} + +/* + * Decide if this buffer can be joined to a transaction. This is true for most + * buffers, but there are two cases that we want to catch: large remote xattr + * value buffers are not logged and can overflow the buffer log item dirty + * bitmap size; and oversized cached buffers if things have really gone + * haywire. + */ +static inline bool +xreap_buf_loggable( + const struct xfs_buf *bp) +{ + int i; + + for (i = 0; i < bp->b_map_count; i++) { + int chunks; + int map_size; + + chunks = DIV_ROUND_UP(BBTOB(bp->b_maps[i].bm_len), + XFS_BLF_CHUNK); + map_size = DIV_ROUND_UP(chunks, NBWORD); + if (map_size > XFS_BLF_DATAMAP_SIZE) + return false; + } + + return true; +} + +/* + * Invalidate any buffers for this file mapping. The @imap blockcount may be + * adjusted downward if we need to roll the transaction. + */ +STATIC int +xreap_bmapi_binval( + struct xfs_scrub *sc, + struct xfs_inode *ip, + int whichfork, + struct xfs_bmbt_irec *imap) +{ + struct xfs_mount *mp = sc->mp; + struct xfs_perag *pag = sc->sa.pag; + int bmap_flags = xfs_bmapi_aflag(whichfork); + xfs_fileoff_t off; + xfs_fileoff_t max_off; + xfs_extlen_t scan_blocks; + xfs_agnumber_t agno = sc->sa.pag->pag_agno; + xfs_agblock_t bno; + xfs_agblock_t agbno; + xfs_agblock_t agbno_next; + unsigned int invalidated = 0; + int error; + + /* + * Avoid invalidating AG headers and post-EOFS blocks because we never + * own those. + */ + agbno = bno = XFS_FSB_TO_AGBNO(sc->mp, imap->br_startblock); + agbno_next = agbno + imap->br_blockcount; + if (!xfs_verify_agbno(pag, agbno) || + !xfs_verify_agbno(pag, agbno_next - 1)) + return 0; + + /* + * Buffers for file blocks can span multiple contiguous mappings. This + * means that for each block in the mapping, there could exist an + * xfs_buf indexed by that block with any length up to the maximum + * buffer size (remote xattr values) or to the next hole in the fork. + * To set up our binval scan, first we need to figure out the location + * of the next hole. + */ + off = imap->br_startoff + imap->br_blockcount; + max_off = off + xfs_attr3_rmt_blocks(mp, XFS_XATTR_SIZE_MAX); + while (off < max_off) { + struct xfs_bmbt_irec hmap; + int nhmaps = 1; + + error = xfs_bmapi_read(ip, off, max_off - off, &hmap, + &nhmaps, bmap_flags); + if (error) + return error; + if (nhmaps != 1 || hmap.br_startblock == DELAYSTARTBLOCK) { + ASSERT(0); + return -EFSCORRUPTED; + } + + if (!xfs_bmap_is_real_extent(&hmap)) + break; + + off = hmap.br_startoff + hmap.br_blockcount; + } + scan_blocks = off - imap->br_startoff; + + trace_xreap_bmapi_binval_scan(sc, imap, scan_blocks); + + /* + * If there are incore buffers for these blocks, invalidate them. If + * we can't (try)lock the buffer we assume it's owned by someone else + * and leave it alone. The buffer cache cannot detect aliasing, so + * employ nested loops to detect incore buffers of any plausible size. + */ + while (bno < agbno_next) { + struct xrep_bufscan scan = { + .daddr = XFS_AGB_TO_DADDR(mp, agno, bno), + .max_sectors = xrep_bufscan_max_sectors(mp, + scan_blocks), + .daddr_step = XFS_FSB_TO_BB(mp, 1), + }; + struct xfs_buf *bp; + + while ((bp = xrep_bufscan_advance(mp, &scan)) != NULL) { + if (xreap_buf_loggable(bp)) { + xfs_trans_bjoin(sc->tp, bp); + xfs_trans_binval(sc->tp, bp); + } else { + xfs_buf_stale(bp); + xfs_buf_relse(bp); + } + invalidated++; + + /* + * Stop invalidating if we've hit the limit; we should + * still have enough reservation left to free however + * much of the mapping we've seen so far. + */ + if (invalidated > XREAP_MAX_BINVAL) { + imap->br_blockcount = agbno_next - bno; + goto out; + } + } + + bno++; + scan_blocks--; + } + +out: + trace_xreap_bmapi_binval(sc->sa.pag, agbno, imap->br_blockcount); + return 0; +} + +/* + * Dispose of as much of the beginning of this file fork mapping as possible. + * The number of blocks disposed of is returned in @imap->br_blockcount. + */ +STATIC int +xrep_reap_bmapi_iter( + struct xfs_scrub *sc, + struct xfs_inode *ip, + int whichfork, + struct xfs_bmbt_irec *imap, + bool crosslinked) +{ + int error; + + if (crosslinked) { + /* + * If there are other rmappings, this block is cross linked and + * must not be freed. Remove the reverse mapping, leave the + * buffer cache in its possibly confused state, and move on. + * We don't want to risk discarding valid data buffers from + * anybody else who thinks they own the block, even though that + * runs the risk of stale buffer warnings in the future. + */ + trace_xreap_dispose_unmap_extent(sc->sa.pag, + XFS_FSB_TO_AGBNO(sc->mp, imap->br_startblock), + imap->br_blockcount); + + /* + * Schedule removal of the mapping from the fork. We use + * deferred log intents in this function to control the exact + * sequence of metadata updates. + */ + xfs_bmap_unmap_extent(sc->tp, ip, whichfork, imap); + xfs_trans_mod_dquot_byino(sc->tp, ip, XFS_TRANS_DQ_BCOUNT, + -(int64_t)imap->br_blockcount); + xfs_rmap_unmap_extent(sc->tp, ip, whichfork, imap); + return 0; + } + + /* + * If the block is not crosslinked, we can invalidate all the incore + * buffers for the extent, and then free the extent. This is a bit of + * a mess since we don't detect discontiguous buffers that are indexed + * by a block starting before the first block of the extent but overlap + * anyway. + */ + trace_xreap_dispose_free_extent(sc->sa.pag, + XFS_FSB_TO_AGBNO(sc->mp, imap->br_startblock), + imap->br_blockcount); + + /* + * Invalidate as many buffers as we can, starting at the beginning of + * this mapping. If this function sets blockcount to zero, the + * transaction is full of logged buffer invalidations, so we need to + * return early so that we can roll and retry. + */ + error = xreap_bmapi_binval(sc, ip, whichfork, imap); + if (error || imap->br_blockcount == 0) + return error; + + /* + * Schedule removal of the mapping from the fork. We use deferred log + * intents in this function to control the exact sequence of metadata + * updates. + */ + xfs_bmap_unmap_extent(sc->tp, ip, whichfork, imap); + xfs_trans_mod_dquot_byino(sc->tp, ip, XFS_TRANS_DQ_BCOUNT, + -(int64_t)imap->br_blockcount); + return xfs_free_extent_later(sc->tp, imap->br_startblock, + imap->br_blockcount, NULL, XFS_AG_RESV_NONE, true); +} + +/* + * Dispose of as much of this file extent as we can. Upon successful return, + * the imap will reflect the mapping that was removed from the fork. + */ +STATIC int +xreap_ifork_extent( + struct xfs_scrub *sc, + struct xfs_inode *ip, + int whichfork, + struct xfs_bmbt_irec *imap) +{ + xfs_agnumber_t agno; + bool crosslinked; + int error; + + ASSERT(sc->sa.pag == NULL); + + trace_xreap_ifork_extent(sc, ip, whichfork, imap); + + agno = XFS_FSB_TO_AGNO(sc->mp, imap->br_startblock); + sc->sa.pag = xfs_perag_get(sc->mp, agno); + if (!sc->sa.pag) + return -EFSCORRUPTED; + + error = xfs_alloc_read_agf(sc->sa.pag, sc->tp, 0, &sc->sa.agf_bp); + if (error) + goto out_pag; + + /* + * Decide the fate of the blocks at the beginning of the mapping, then + * update the mapping to use it with the unmap calls. + */ + error = xreap_bmapi_select(sc, ip, whichfork, imap, &crosslinked); + if (error) + goto out_agf; + + error = xrep_reap_bmapi_iter(sc, ip, whichfork, imap, crosslinked); + if (error) + goto out_agf; + +out_agf: + xfs_trans_brelse(sc->tp, sc->sa.agf_bp); + sc->sa.agf_bp = NULL; +out_pag: + xfs_perag_put(sc->sa.pag); + sc->sa.pag = NULL; + return error; +} + +/* + * Dispose of each block mapped to the given fork of the given file. Callers + * must hold ILOCK_EXCL, and ip can only be sc->ip or sc->tempip. The fork + * must not have any delalloc reservations. + */ +int +xrep_reap_ifork( + struct xfs_scrub *sc, + struct xfs_inode *ip, + int whichfork) +{ + xfs_fileoff_t off = 0; + int bmap_flags = xfs_bmapi_aflag(whichfork); + int error; + + ASSERT(xfs_has_rmapbt(sc->mp)); + ASSERT(ip == sc->ip || ip == sc->tempip); + ASSERT(whichfork == XFS_ATTR_FORK || !XFS_IS_REALTIME_INODE(ip)); + + while (off < XFS_MAX_FILEOFF) { + struct xfs_bmbt_irec imap; + int nimaps = 1; + + /* Read the next extent, skip past holes and delalloc. */ + error = xfs_bmapi_read(ip, off, XFS_MAX_FILEOFF - off, &imap, + &nimaps, bmap_flags); + if (error) + return error; + if (nimaps != 1 || imap.br_startblock == DELAYSTARTBLOCK) { + ASSERT(0); + return -EFSCORRUPTED; + } + + /* + * If this is a real space mapping, reap as much of it as we + * can in a single transaction. + */ + if (xfs_bmap_is_real_extent(&imap)) { + error = xreap_ifork_extent(sc, ip, whichfork, &imap); + if (error) + return error; + + error = xfs_defer_finish(&sc->tp); + if (error) + return error; + } + + off = imap.br_startoff + imap.br_blockcount; + } + + return 0; +} diff --git a/fs/xfs/scrub/reap.h b/fs/xfs/scrub/reap.h index bb09e21fcb17..3f2f1775e29d 100644 --- a/fs/xfs/scrub/reap.h +++ b/fs/xfs/scrub/reap.h @@ -13,6 +13,7 @@ int xrep_reap_agblocks(struct xfs_scrub *sc, struct xagb_bitmap *bitmap, const struct xfs_owner_info *oinfo, enum xfs_ag_resv_type type); int xrep_reap_fsblocks(struct xfs_scrub *sc, struct xfsb_bitmap *bitmap, const struct xfs_owner_info *oinfo); +int xrep_reap_ifork(struct xfs_scrub *sc, struct xfs_inode *ip, int whichfork); /* Buffer cache scan context. */ struct xrep_bufscan { diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 020b029b7988..cbd70ecd3011 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -1539,6 +1539,7 @@ DEFINE_EVENT(xrep_extent_class, name, \ DEFINE_REPAIR_EXTENT_EVENT(xreap_dispose_unmap_extent); DEFINE_REPAIR_EXTENT_EVENT(xreap_dispose_free_extent); DEFINE_REPAIR_EXTENT_EVENT(xreap_agextent_binval); +DEFINE_REPAIR_EXTENT_EVENT(xreap_bmapi_binval); DEFINE_REPAIR_EXTENT_EVENT(xrep_agfl_insert); DECLARE_EVENT_CLASS(xrep_reap_find_class, @@ -1572,6 +1573,7 @@ DEFINE_EVENT(xrep_reap_find_class, name, \ bool crosslinked), \ TP_ARGS(pag, agbno, len, crosslinked)) DEFINE_REPAIR_REAP_FIND_EVENT(xreap_agextent_select); +DEFINE_REPAIR_REAP_FIND_EVENT(xreap_bmapi_select); DECLARE_EVENT_CLASS(xrep_rmap_class, TP_PROTO(struct xfs_mount *mp, xfs_agnumber_t agno, @@ -2312,6 +2314,67 @@ TRACE_EVENT(xrep_tempfile_create, __entry->temp_inum) ); +TRACE_EVENT(xreap_ifork_extent, + TP_PROTO(struct xfs_scrub *sc, struct xfs_inode *ip, int whichfork, + const struct xfs_bmbt_irec *irec), + TP_ARGS(sc, ip, whichfork, irec), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(int, whichfork) + __field(xfs_fileoff_t, fileoff) + __field(xfs_filblks_t, len) + __field(xfs_agnumber_t, agno) + __field(xfs_agblock_t, agbno) + __field(int, state) + ), + TP_fast_assign( + __entry->dev = sc->mp->m_super->s_dev; + __entry->ino = ip->i_ino; + __entry->whichfork = whichfork; + __entry->fileoff = irec->br_startoff; + __entry->len = irec->br_blockcount; + __entry->agno = XFS_FSB_TO_AGNO(sc->mp, irec->br_startblock); + __entry->agbno = XFS_FSB_TO_AGBNO(sc->mp, irec->br_startblock); + __entry->state = irec->br_state; + ), + TP_printk("dev %d:%d ip 0x%llx whichfork %s agno 0x%x agbno 0x%x fileoff 0x%llx fsbcount 0x%llx state 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS), + __entry->agno, + __entry->agbno, + __entry->fileoff, + __entry->len, + __entry->state) +); + +TRACE_EVENT(xreap_bmapi_binval_scan, + TP_PROTO(struct xfs_scrub *sc, const struct xfs_bmbt_irec *irec, + xfs_extlen_t scan_blocks), + TP_ARGS(sc, irec, scan_blocks), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_filblks_t, len) + __field(xfs_agnumber_t, agno) + __field(xfs_agblock_t, agbno) + __field(xfs_extlen_t, scan_blocks) + ), + TP_fast_assign( + __entry->dev = sc->mp->m_super->s_dev; + __entry->len = irec->br_blockcount; + __entry->agno = XFS_FSB_TO_AGNO(sc->mp, irec->br_startblock); + __entry->agbno = XFS_FSB_TO_AGBNO(sc->mp, irec->br_startblock); + __entry->scan_blocks = scan_blocks; + ), + TP_printk("dev %d:%d agno 0x%x agbno 0x%x fsbcount 0x%llx scan_blocks 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->agbno, + __entry->len, + __entry->scan_blocks) +); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 05/16] xfs: online repair of realtime summaries 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (3 preceding siblings ...) 2024-04-15 23:34 ` [PATCHSET v30.3 04/16] xfs: create temporary files for online repair Darrick J. Wong @ 2024-04-15 23:34 ` Darrick J. Wong 2024-04-15 23:46 ` [PATCH 1/3] xfs: support preallocating and copying content into temporary files Darrick J. Wong ` (2 more replies) 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong ` (10 subsequent siblings) 15 siblings, 3 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:34 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, linux-xfs Hi all, We now have all the infrastructure we need to repair file metadata. We'll begin with the realtime summary file, because it is the least complex data structure. To support this we need to add three more pieces to the temporary file code from the previous patchset -- preallocating space in the temp file, formatting metadata into that space and writing the blocks to disk, and swapping the fork mappings atomically. After that, the actual reconstruction of the realtime summary information is pretty simple, since we can simply write the incore copy computed by the rtsummary scrubber to the temporary file, swap the contents, and reap the old blocks. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-rtsummary-6.10 --- Commits in this patchset: * xfs: support preallocating and copying content into temporary files * xfs: teach the tempfile to set up atomic file content exchanges * xfs: online repair of realtime summaries --- fs/xfs/Makefile | 1 fs/xfs/scrub/common.c | 1 fs/xfs/scrub/repair.h | 3 fs/xfs/scrub/rtsummary.c | 33 ++- fs/xfs/scrub/rtsummary.h | 37 ++++ fs/xfs/scrub/rtsummary_repair.c | 177 ++++++++++++++++++ fs/xfs/scrub/scrub.c | 11 + fs/xfs/scrub/scrub.h | 7 + fs/xfs/scrub/tempexch.h | 21 ++ fs/xfs/scrub/tempfile.c | 388 +++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/tempfile.h | 15 ++ fs/xfs/scrub/trace.h | 40 ++++ 12 files changed, 715 insertions(+), 19 deletions(-) create mode 100644 fs/xfs/scrub/rtsummary.h create mode 100644 fs/xfs/scrub/rtsummary_repair.c create mode 100644 fs/xfs/scrub/tempexch.h ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/3] xfs: support preallocating and copying content into temporary files 2024-04-15 23:34 ` [PATCHSET v30.3 05/16] xfs: online repair of realtime summaries Darrick J. Wong @ 2024-04-15 23:46 ` Darrick J. Wong 2024-04-15 23:46 ` [PATCH 2/3] xfs: teach the tempfile to set up atomic file content exchanges Darrick J. Wong 2024-04-15 23:46 ` [PATCH 3/3] xfs: online repair of realtime summaries Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:46 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create the routines we need to preallocate space in a temporary ondisk file and then copy the contents of an xfile into the tempfile. The upcoming rtsummary repair feature will construct the contents of a realtime summary file in memory, after which it will want to copy all that into the ondisk temporary file before atomically committing the new rtsummary contents. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/tempfile.c | 197 +++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/tempfile.h | 15 ++++ fs/xfs/scrub/trace.h | 39 +++++++++ 3 files changed, 251 insertions(+) diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c index 68d245749bc1..83e683e16561 100644 --- a/fs/xfs/scrub/tempfile.c +++ b/fs/xfs/scrub/tempfile.c @@ -14,14 +14,18 @@ #include "xfs_inode.h" #include "xfs_ialloc.h" #include "xfs_quota.h" +#include "xfs_bmap.h" #include "xfs_bmap_btree.h" #include "xfs_trans_space.h" #include "xfs_dir2.h" #include "xfs_exchrange.h" +#include "xfs_defer.h" #include "scrub/scrub.h" #include "scrub/common.h" +#include "scrub/repair.h" #include "scrub/trace.h" #include "scrub/tempfile.h" +#include "scrub/xfile.h" /* * Create a temporary file for reconstructing metadata, with the intention of @@ -249,3 +253,196 @@ xrep_tempfile_rele( xchk_irele(sc, sc->tempip); sc->tempip = NULL; } + +/* + * Make sure that the given range of the data fork of the temporary file is + * mapped to written blocks. The caller must ensure that both inodes are + * joined to the transaction. + */ +int +xrep_tempfile_prealloc( + struct xfs_scrub *sc, + xfs_fileoff_t off, + xfs_filblks_t len) +{ + struct xfs_bmbt_irec map; + xfs_fileoff_t end = off + len; + int error; + + ASSERT(sc->tempip != NULL); + ASSERT(!XFS_NOT_DQATTACHED(sc->mp, sc->tempip)); + + for (; off < end; off = map.br_startoff + map.br_blockcount) { + int nmaps = 1; + + /* + * If we have a real extent mapping this block then we're + * in ok shape. + */ + error = xfs_bmapi_read(sc->tempip, off, end - off, &map, &nmaps, + XFS_DATA_FORK); + if (error) + return error; + if (nmaps == 0) { + ASSERT(nmaps != 0); + return -EFSCORRUPTED; + } + + if (xfs_bmap_is_written_extent(&map)) + continue; + + /* + * If we find a delalloc reservation then something is very + * very wrong. Bail out. + */ + if (map.br_startblock == DELAYSTARTBLOCK) + return -EFSCORRUPTED; + + /* + * Make sure this block has a real zeroed extent allocated to + * it. + */ + nmaps = 1; + error = xfs_bmapi_write(sc->tp, sc->tempip, off, end - off, + XFS_BMAPI_CONVERT | XFS_BMAPI_ZERO, 0, &map, + &nmaps); + if (error) + return error; + if (nmaps != 1) + return -EFSCORRUPTED; + + trace_xrep_tempfile_prealloc(sc, XFS_DATA_FORK, &map); + + /* Commit new extent and all deferred work. */ + error = xfs_defer_finish(&sc->tp); + if (error) + return error; + } + + return 0; +} + +/* + * Write data to each block of a file. The given range of the tempfile's data + * fork must already be populated with written extents. + */ +int +xrep_tempfile_copyin( + struct xfs_scrub *sc, + xfs_fileoff_t off, + xfs_filblks_t len, + xrep_tempfile_copyin_fn prep_fn, + void *data) +{ + LIST_HEAD(buffers_list); + struct xfs_mount *mp = sc->mp; + struct xfs_buf *bp; + xfs_fileoff_t flush_mask; + xfs_fileoff_t end = off + len; + loff_t pos = XFS_FSB_TO_B(mp, off); + int error = 0; + + ASSERT(S_ISREG(VFS_I(sc->tempip)->i_mode)); + + /* Flush buffers to disk every 512K */ + flush_mask = XFS_B_TO_FSBT(mp, (1U << 19)) - 1; + + for (; off < end; off++, pos += mp->m_sb.sb_blocksize) { + struct xfs_bmbt_irec map; + int nmaps = 1; + + /* Read block mapping for this file block. */ + error = xfs_bmapi_read(sc->tempip, off, 1, &map, &nmaps, 0); + if (error) + goto out_err; + if (nmaps == 0 || !xfs_bmap_is_written_extent(&map)) { + error = -EFSCORRUPTED; + goto out_err; + } + + /* Get the metadata buffer for this offset in the file. */ + error = xfs_trans_get_buf(sc->tp, mp->m_ddev_targp, + XFS_FSB_TO_DADDR(mp, map.br_startblock), + mp->m_bsize, 0, &bp); + if (error) + goto out_err; + + trace_xrep_tempfile_copyin(sc, XFS_DATA_FORK, &map); + + /* Read in a block's worth of data from the xfile. */ + error = prep_fn(sc, bp, data); + if (error) { + xfs_trans_brelse(sc->tp, bp); + goto out_err; + } + + /* Queue buffer, and flush if we have too much dirty data. */ + xfs_buf_delwri_queue_here(bp, &buffers_list); + xfs_trans_brelse(sc->tp, bp); + + if (!(off & flush_mask)) { + error = xfs_buf_delwri_submit(&buffers_list); + if (error) + goto out_err; + } + } + + /* + * Write the new blocks to disk. If the ordered list isn't empty after + * that, then something went wrong and we have to fail. This should + * never happen, but we'll check anyway. + */ + error = xfs_buf_delwri_submit(&buffers_list); + if (error) + goto out_err; + + if (!list_empty(&buffers_list)) { + ASSERT(list_empty(&buffers_list)); + error = -EIO; + goto out_err; + } + + return 0; + +out_err: + xfs_buf_delwri_cancel(&buffers_list); + return error; +} + +/* + * Set the temporary file's size. Caller must join the tempfile to the scrub + * transaction and is responsible for adjusting block mappings as needed. + */ +int +xrep_tempfile_set_isize( + struct xfs_scrub *sc, + unsigned long long isize) +{ + if (sc->tempip->i_disk_size == isize) + return 0; + + sc->tempip->i_disk_size = isize; + i_size_write(VFS_I(sc->tempip), isize); + return xrep_tempfile_roll_trans(sc); +} + +/* + * Roll a repair transaction involving the temporary file. Caller must join + * both the temporary file and the file being scrubbed to the transaction. + * This function return with both inodes joined to a new scrub transaction, + * or the usual negative errno. + */ +int +xrep_tempfile_roll_trans( + struct xfs_scrub *sc) +{ + int error; + + xfs_trans_log_inode(sc->tp, sc->tempip, XFS_ILOG_CORE); + error = xrep_roll_trans(sc); + if (error) + return error; + + xfs_trans_ijoin(sc->tp, sc->tempip, 0); + return 0; +} diff --git a/fs/xfs/scrub/tempfile.h b/fs/xfs/scrub/tempfile.h index e165e0a3faf6..7980f9c4de55 100644 --- a/fs/xfs/scrub/tempfile.h +++ b/fs/xfs/scrub/tempfile.h @@ -17,6 +17,21 @@ void xrep_tempfile_iounlock(struct xfs_scrub *sc); void xrep_tempfile_ilock(struct xfs_scrub *sc); bool xrep_tempfile_ilock_nowait(struct xfs_scrub *sc); void xrep_tempfile_iunlock(struct xfs_scrub *sc); + +int xrep_tempfile_prealloc(struct xfs_scrub *sc, xfs_fileoff_t off, + xfs_filblks_t len); + +enum xfs_blft; + +typedef int (*xrep_tempfile_copyin_fn)(struct xfs_scrub *sc, + struct xfs_buf *bp, void *data); + +int xrep_tempfile_copyin(struct xfs_scrub *sc, xfs_fileoff_t off, + xfs_filblks_t len, xrep_tempfile_copyin_fn fn, void *data); + +int xrep_tempfile_set_isize(struct xfs_scrub *sc, unsigned long long isize); + +int xrep_tempfile_roll_trans(struct xfs_scrub *sc); #else static inline void xrep_tempfile_iolock_both(struct xfs_scrub *sc) { diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index cbd70ecd3011..ae90731bf6ad 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2314,6 +2314,45 @@ TRACE_EVENT(xrep_tempfile_create, __entry->temp_inum) ); +DECLARE_EVENT_CLASS(xrep_tempfile_class, + TP_PROTO(struct xfs_scrub *sc, int whichfork, + struct xfs_bmbt_irec *irec), + TP_ARGS(sc, whichfork, irec), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(int, whichfork) + __field(xfs_fileoff_t, lblk) + __field(xfs_filblks_t, len) + __field(xfs_fsblock_t, pblk) + __field(int, state) + ), + TP_fast_assign( + __entry->dev = sc->mp->m_super->s_dev; + __entry->ino = sc->tempip->i_ino; + __entry->whichfork = whichfork; + __entry->lblk = irec->br_startoff; + __entry->len = irec->br_blockcount; + __entry->pblk = irec->br_startblock; + __entry->state = irec->br_state; + ), + TP_printk("dev %d:%d ino 0x%llx whichfork %s fileoff 0x%llx fsbcount 0x%llx startblock 0x%llx state %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __print_symbolic(__entry->whichfork, XFS_WHICHFORK_STRINGS), + __entry->lblk, + __entry->len, + __entry->pblk, + __entry->state) +); +#define DEFINE_XREP_TEMPFILE_EVENT(name) \ +DEFINE_EVENT(xrep_tempfile_class, name, \ + TP_PROTO(struct xfs_scrub *sc, int whichfork, \ + struct xfs_bmbt_irec *irec), \ + TP_ARGS(sc, whichfork, irec)) +DEFINE_XREP_TEMPFILE_EVENT(xrep_tempfile_prealloc); +DEFINE_XREP_TEMPFILE_EVENT(xrep_tempfile_copyin); + TRACE_EVENT(xreap_ifork_extent, TP_PROTO(struct xfs_scrub *sc, struct xfs_inode *ip, int whichfork, const struct xfs_bmbt_irec *irec), ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/3] xfs: teach the tempfile to set up atomic file content exchanges 2024-04-15 23:34 ` [PATCHSET v30.3 05/16] xfs: online repair of realtime summaries Darrick J. Wong 2024-04-15 23:46 ` [PATCH 1/3] xfs: support preallocating and copying content into temporary files Darrick J. Wong @ 2024-04-15 23:46 ` Darrick J. Wong 2024-04-15 23:46 ` [PATCH 3/3] xfs: online repair of realtime summaries Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:46 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create some new routines to exchange the contents of a temporary file created to stage a repair with another ondisk file. This will be used by the realtime summary repair function to commit atomically the new rtsummary data, which will be staged in the tempfile. The rest of XFS coordinates access to the realtime metadata inodes solely through the ILOCK. For repair to hold its exclusive access to the realtime summary file, it has to allocate a single large transaction and roll it repeatedly throughout the repair while holding the ILOCK. In turn, this means that for now there's only a partial file mapping exchange implementation for the temporary file because we can only work within an existing transaction. For now, the only tempswap functions needed here are to estimate the resource requirements of the exchange, reserve more space/quota to an existing transaction, and kick off the actual exchange. The rest will be added in a later patch in preparation for repairing xattrs and directories. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/scrub.c | 8 +- fs/xfs/scrub/scrub.h | 7 ++ fs/xfs/scrub/tempexch.h | 21 +++++ fs/xfs/scrub/tempfile.c | 191 +++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/trace.h | 1 5 files changed, 225 insertions(+), 3 deletions(-) create mode 100644 fs/xfs/scrub/tempexch.h diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index d9012e9a6afd..ff156edf49a0 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -149,14 +149,15 @@ xchk_probe( /* Scrub setup and teardown */ +#define FSGATES_MASK (XCHK_FSGATES_ALL | XREP_FSGATES_ALL) static inline void xchk_fsgates_disable( struct xfs_scrub *sc) { - if (!(sc->flags & XCHK_FSGATES_ALL)) + if (!(sc->flags & FSGATES_MASK)) return; - trace_xchk_fsgates_disable(sc, sc->flags & XCHK_FSGATES_ALL); + trace_xchk_fsgates_disable(sc, sc->flags & FSGATES_MASK); if (sc->flags & XCHK_FSGATES_DRAIN) xfs_drain_wait_disable(); @@ -170,8 +171,9 @@ xchk_fsgates_disable( if (sc->flags & XCHK_FSGATES_RMAP) xfs_rmap_hook_disable(); - sc->flags &= ~XCHK_FSGATES_ALL; + sc->flags &= ~FSGATES_MASK; } +#undef FSGATES_MASK /* Free all the resources and finish the transactions. */ STATIC int diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h index e37d8599718e..d38f0b30416c 100644 --- a/fs/xfs/scrub/scrub.h +++ b/fs/xfs/scrub/scrub.h @@ -131,6 +131,7 @@ struct xfs_scrub { #define XCHK_FSGATES_QUOTA (1U << 4) /* quota live update enabled */ #define XCHK_FSGATES_DIRENTS (1U << 5) /* directory live update enabled */ #define XCHK_FSGATES_RMAP (1U << 6) /* rmapbt live update enabled */ +#define XREP_FSGATES_EXCHANGE_RANGE (1U << 29) /* uses file content exchange */ #define XREP_RESET_PERAG_RESV (1U << 30) /* must reset AG space reservation */ #define XREP_ALREADY_FIXED (1U << 31) /* checking our repair work */ @@ -145,6 +146,12 @@ struct xfs_scrub { XCHK_FSGATES_DIRENTS | \ XCHK_FSGATES_RMAP) +/* + * The sole XREP_FSGATES* flag reflects a log intent item that is protected + * by a log-incompat feature flag. No code patching in use here. + */ +#define XREP_FSGATES_ALL (XREP_FSGATES_EXCHANGE_RANGE) + /* Metadata scrubbers */ int xchk_tester(struct xfs_scrub *sc); int xchk_superblock(struct xfs_scrub *sc); diff --git a/fs/xfs/scrub/tempexch.h b/fs/xfs/scrub/tempexch.h new file mode 100644 index 000000000000..98222b684b6a --- /dev/null +++ b/fs/xfs/scrub/tempexch.h @@ -0,0 +1,21 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2022-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_TEMPEXCH_H__ +#define __XFS_SCRUB_TEMPEXCH_H__ + +#ifdef CONFIG_XFS_ONLINE_REPAIR +struct xrep_tempexch { + struct xfs_exchmaps_req req; +}; + +int xrep_tempexch_enable(struct xfs_scrub *sc); +int xrep_tempexch_trans_reserve(struct xfs_scrub *sc, int whichfork, + struct xrep_tempexch *ti); + +int xrep_tempexch_contents(struct xfs_scrub *sc, struct xrep_tempexch *ti); +#endif /* CONFIG_XFS_ONLINE_REPAIR */ + +#endif /* __XFS_SCRUB_TEMPEXCH_H__ */ diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c index 83e683e16561..7791336ca820 100644 --- a/fs/xfs/scrub/tempfile.c +++ b/fs/xfs/scrub/tempfile.c @@ -19,12 +19,14 @@ #include "xfs_trans_space.h" #include "xfs_dir2.h" #include "xfs_exchrange.h" +#include "xfs_exchmaps.h" #include "xfs_defer.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/repair.h" #include "scrub/trace.h" #include "scrub/tempfile.h" +#include "scrub/tempexch.h" #include "scrub/xfile.h" /* @@ -446,3 +448,192 @@ xrep_tempfile_roll_trans( xfs_trans_ijoin(sc->tp, sc->tempip, 0); return 0; } + +/* Enable file content exchanges. */ +int +xrep_tempexch_enable( + struct xfs_scrub *sc) +{ + if (sc->flags & XREP_FSGATES_EXCHANGE_RANGE) + return 0; + + if (!xfs_has_exchange_range(sc->mp)) + return -EOPNOTSUPP; + + trace_xchk_fsgates_enable(sc, XREP_FSGATES_EXCHANGE_RANGE); + + sc->flags |= XREP_FSGATES_EXCHANGE_RANGE; + return 0; +} + +/* + * Fill out the mapping exchange request in preparation for atomically + * committing the contents of a metadata file that we've rebuilt in the temp + * file. + */ +STATIC int +xrep_tempexch_prep_request( + struct xfs_scrub *sc, + int whichfork, + struct xrep_tempexch *tx) +{ + struct xfs_exchmaps_req *req = &tx->req; + + memset(tx, 0, sizeof(struct xrep_tempexch)); + + /* COW forks don't exist on disk. */ + if (whichfork == XFS_COW_FORK) { + ASSERT(0); + return -EINVAL; + } + + /* Both files should have the relevant forks. */ + if (!xfs_ifork_ptr(sc->ip, whichfork) || + !xfs_ifork_ptr(sc->tempip, whichfork)) { + ASSERT(xfs_ifork_ptr(sc->ip, whichfork) != NULL); + ASSERT(xfs_ifork_ptr(sc->tempip, whichfork) != NULL); + return -EINVAL; + } + + /* Exchange all mappings in both forks. */ + req->ip1 = sc->tempip; + req->ip2 = sc->ip; + req->startoff1 = 0; + req->startoff2 = 0; + switch (whichfork) { + case XFS_ATTR_FORK: + req->flags |= XFS_EXCHMAPS_ATTR_FORK; + break; + case XFS_DATA_FORK: + /* Always exchange sizes when exchanging data fork mappings. */ + req->flags |= XFS_EXCHMAPS_SET_SIZES; + break; + } + req->blockcount = XFS_MAX_FILEOFF; + + return 0; +} + +/* + * Obtain a quota reservation to make sure we don't hit EDQUOT. We can skip + * this if quota enforcement is disabled or if both inodes' dquots are the + * same. The qretry structure must be initialized to zeroes before the first + * call to this function. + */ +STATIC int +xrep_tempexch_reserve_quota( + struct xfs_scrub *sc, + const struct xrep_tempexch *tx) +{ + struct xfs_trans *tp = sc->tp; + const struct xfs_exchmaps_req *req = &tx->req; + int64_t ddelta, rdelta; + int error; + + /* + * Don't bother with a quota reservation if we're not enforcing them + * or the two inodes have the same dquots. + */ + if (!XFS_IS_QUOTA_ON(tp->t_mountp) || req->ip1 == req->ip2 || + (req->ip1->i_udquot == req->ip2->i_udquot && + req->ip1->i_gdquot == req->ip2->i_gdquot && + req->ip1->i_pdquot == req->ip2->i_pdquot)) + return 0; + + /* + * Quota reservation for each file comes from two sources. First, we + * need to account for any net gain in mapped blocks during the + * exchange. Second, we need reservation for the gross gain in mapped + * blocks so that we don't trip over any quota block reservation + * assertions. We must reserve the gross gain because the quota code + * subtracts from bcount the number of blocks that we unmap; it does + * not add that quantity back to the quota block reservation. + */ + ddelta = max_t(int64_t, 0, req->ip2_bcount - req->ip1_bcount); + rdelta = max_t(int64_t, 0, req->ip2_rtbcount - req->ip1_rtbcount); + error = xfs_trans_reserve_quota_nblks(tp, req->ip1, + ddelta + req->ip1_bcount, rdelta + req->ip1_rtbcount, + true); + if (error) + return error; + + ddelta = max_t(int64_t, 0, req->ip1_bcount - req->ip2_bcount); + rdelta = max_t(int64_t, 0, req->ip1_rtbcount - req->ip2_rtbcount); + return xfs_trans_reserve_quota_nblks(tp, req->ip2, + ddelta + req->ip2_bcount, rdelta + req->ip2_rtbcount, + true); +} + +/* + * Prepare an existing transaction for an atomic file contents exchange. + * + * This function fills out the mapping exchange request and resource estimation + * structures in preparation for exchanging the contents of a metadata file + * that has been rebuilt in the temp file. Next, it reserves space and quota + * for the transaction. + * + * The caller must hold ILOCK_EXCL of the scrub target file and the temporary + * file. The caller must join both inodes to the transaction with no unlock + * flags, and is responsible for dropping both ILOCKs when appropriate. Only + * use this when those ILOCKs cannot be dropped. + */ +int +xrep_tempexch_trans_reserve( + struct xfs_scrub *sc, + int whichfork, + struct xrep_tempexch *tx) +{ + int error; + + ASSERT(sc->tp != NULL); + xfs_assert_ilocked(sc->ip, XFS_ILOCK_EXCL); + xfs_assert_ilocked(sc->tempip, XFS_ILOCK_EXCL); + + error = xrep_tempexch_prep_request(sc, whichfork, tx); + if (error) + return error; + + error = xfs_exchmaps_estimate(&tx->req); + if (error) + return error; + + error = xfs_trans_reserve_more(sc->tp, tx->req.resblks, 0); + if (error) + return error; + + return xrep_tempexch_reserve_quota(sc, tx); +} + +/* + * Exchange file mappings (and hence file contents) between the file being + * repaired and the temporary file. Returns with both inodes locked and joined + * to a clean scrub transaction. + */ +int +xrep_tempexch_contents( + struct xfs_scrub *sc, + struct xrep_tempexch *tx) +{ + int error; + + ASSERT(sc->flags & XREP_FSGATES_EXCHANGE_RANGE); + + xfs_exchange_mappings(sc->tp, &tx->req); + error = xfs_defer_finish(&sc->tp); + if (error) + return error; + + /* + * If we exchanged the ondisk sizes of two metadata files, we must + * exchanged the incore sizes as well. + */ + if (tx->req.flags & XFS_EXCHMAPS_SET_SIZES) { + loff_t temp; + + temp = i_size_read(VFS_I(sc->ip)); + i_size_write(VFS_I(sc->ip), i_size_read(VFS_I(sc->tempip))); + i_size_write(VFS_I(sc->tempip), temp); + } + + return 0; +} diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index ae90731bf6ad..8d05f2adae3d 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -114,6 +114,7 @@ TRACE_DEFINE_ENUM(XFS_SCRUB_TYPE_HEALTHY); { XCHK_FSGATES_QUOTA, "fsgates_quota" }, \ { XCHK_FSGATES_DIRENTS, "fsgates_dirents" }, \ { XCHK_FSGATES_RMAP, "fsgates_rmap" }, \ + { XREP_FSGATES_EXCHANGE_RANGE, "fsgates_exchrange" }, \ { XREP_RESET_PERAG_RESV, "reset_perag_resv" }, \ { XREP_ALREADY_FIXED, "already_fixed" } ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/3] xfs: online repair of realtime summaries 2024-04-15 23:34 ` [PATCHSET v30.3 05/16] xfs: online repair of realtime summaries Darrick J. Wong 2024-04-15 23:46 ` [PATCH 1/3] xfs: support preallocating and copying content into temporary files Darrick J. Wong 2024-04-15 23:46 ` [PATCH 2/3] xfs: teach the tempfile to set up atomic file content exchanges Darrick J. Wong @ 2024-04-15 23:46 ` Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:46 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Repair the realtime summary data by constructing a new rtsummary file in the scrub temporary file, then atomically swapping the contents. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/scrub/common.c | 1 fs/xfs/scrub/repair.h | 3 + fs/xfs/scrub/rtsummary.c | 33 ++++--- fs/xfs/scrub/rtsummary.h | 37 ++++++++ fs/xfs/scrub/rtsummary_repair.c | 177 +++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/scrub.c | 3 - 7 files changed, 239 insertions(+), 16 deletions(-) create mode 100644 fs/xfs/scrub/rtsummary.h create mode 100644 fs/xfs/scrub/rtsummary_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index ae8488ab4d6b..5e3ac7ec8fa5 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -212,6 +212,7 @@ xfs-y += $(addprefix scrub/, \ xfs-$(CONFIG_XFS_RT) += $(addprefix scrub/, \ rtbitmap_repair.o \ + rtsummary_repair.o \ ) xfs-$(CONFIG_XFS_QUOTA) += $(addprefix scrub/, \ diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index a27d33b6f464..a2da2bef509a 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -31,6 +31,7 @@ #include "xfs_ag.h" #include "xfs_error.h" #include "xfs_quota.h" +#include "xfs_exchmaps.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index ce082d941459..0e2b695ab8f6 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -126,8 +126,10 @@ int xrep_fscounters(struct xfs_scrub *sc); #ifdef CONFIG_XFS_RT int xrep_rtbitmap(struct xfs_scrub *sc); +int xrep_rtsummary(struct xfs_scrub *sc); #else # define xrep_rtbitmap xrep_notsupported +# define xrep_rtsummary xrep_notsupported #endif /* CONFIG_XFS_RT */ #ifdef CONFIG_XFS_QUOTA @@ -212,6 +214,7 @@ xrep_setup_nothing( #define xrep_quotacheck xrep_notsupported #define xrep_nlinks xrep_notsupported #define xrep_fscounters xrep_notsupported +#define xrep_rtsummary xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/rtsummary.c b/fs/xfs/scrub/rtsummary.c index 5055092bd9e8..3fee603f5244 100644 --- a/fs/xfs/scrub/rtsummary.c +++ b/fs/xfs/scrub/rtsummary.c @@ -17,10 +17,14 @@ #include "xfs_bit.h" #include "xfs_bmap.h" #include "xfs_sb.h" +#include "xfs_exchmaps.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" #include "scrub/xfile.h" +#include "scrub/repair.h" +#include "scrub/tempexch.h" +#include "scrub/rtsummary.h" /* * Realtime Summary @@ -32,18 +36,6 @@ * (potentially large) amount of data in pageable memory. */ -struct xchk_rtsummary { - struct xfs_rtalloc_args args; - - uint64_t rextents; - uint64_t rbmblocks; - uint64_t rsumsize; - unsigned int rsumlevels; - - /* Memory buffer for the summary comparison. */ - union xfs_suminfo_raw words[]; -}; - /* Set us up to check the rtsummary file. */ int xchk_setup_rtsummary( @@ -60,6 +52,12 @@ xchk_setup_rtsummary( return -ENOMEM; sc->buf = rts; + if (xchk_could_repair(sc)) { + error = xrep_setup_rtsummary(sc, rts); + if (error) + return error; + } + /* * Create an xfile to construct a new rtsummary file. The xfile allows * us to avoid pinning kernel memory for this purpose. @@ -70,7 +68,7 @@ xchk_setup_rtsummary( if (error) return error; - error = xchk_trans_alloc(sc, 0); + error = xchk_trans_alloc(sc, rts->resblks); if (error) return error; @@ -135,7 +133,7 @@ xfsum_store( sumoff << XFS_WORDLOG); } -static inline int +inline int xfsum_copyout( struct xfs_scrub *sc, xfs_rtsumoff_t sumoff, @@ -362,7 +360,12 @@ xchk_rtsummary( error = xchk_rtsum_compare(sc); out_rbm: - /* Unlock the rtbitmap since we're done with it. */ + /* + * Unlock the rtbitmap since we're done with it. All other writers of + * the rt free space metadata grab the bitmap and summary ILOCKs in + * that order, so we're still protected against allocation activities + * even if we continue on to the repair function. + */ xfs_iunlock(mp->m_rbmip, XFS_ILOCK_SHARED | XFS_ILOCK_RTBITMAP); return error; } diff --git a/fs/xfs/scrub/rtsummary.h b/fs/xfs/scrub/rtsummary.h new file mode 100644 index 000000000000..e1d50304d8d4 --- /dev/null +++ b/fs/xfs/scrub/rtsummary.h @@ -0,0 +1,37 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_RTSUMMARY_H__ +#define __XFS_SCRUB_RTSUMMARY_H__ + +struct xchk_rtsummary { +#ifdef CONFIG_XFS_ONLINE_REPAIR + struct xrep_tempexch tempexch; +#endif + struct xfs_rtalloc_args args; + + uint64_t rextents; + uint64_t rbmblocks; + uint64_t rsumsize; + unsigned int rsumlevels; + unsigned int resblks; + + /* suminfo position of xfile as we write buffers to disk. */ + xfs_rtsumoff_t prep_wordoff; + + /* Memory buffer for the summary comparison. */ + union xfs_suminfo_raw words[]; +}; + +int xfsum_copyout(struct xfs_scrub *sc, xfs_rtsumoff_t sumoff, + union xfs_suminfo_raw *rawinfo, unsigned int nr_words); + +#ifdef CONFIG_XFS_ONLINE_REPAIR +int xrep_setup_rtsummary(struct xfs_scrub *sc, struct xchk_rtsummary *rts); +#else +# define xrep_setup_rtsummary(sc, rts) (0) +#endif /* CONFIG_XFS_ONLINE_REPAIR */ + +#endif /* __XFS_SCRUB_RTSUMMARY_H__ */ diff --git a/fs/xfs/scrub/rtsummary_repair.c b/fs/xfs/scrub/rtsummary_repair.c new file mode 100644 index 000000000000..c8bb6c4f15d0 --- /dev/null +++ b/fs/xfs/scrub/rtsummary_repair.c @@ -0,0 +1,177 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_btree.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_rtalloc.h" +#include "xfs_inode.h" +#include "xfs_bit.h" +#include "xfs_bmap.h" +#include "xfs_bmap_btree.h" +#include "xfs_exchmaps.h" +#include "xfs_rtbitmap.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/tempfile.h" +#include "scrub/tempexch.h" +#include "scrub/reap.h" +#include "scrub/xfile.h" +#include "scrub/rtsummary.h" + +/* Set us up to repair the rtsummary file. */ +int +xrep_setup_rtsummary( + struct xfs_scrub *sc, + struct xchk_rtsummary *rts) +{ + struct xfs_mount *mp = sc->mp; + unsigned long long blocks; + int error; + + error = xrep_tempfile_create(sc, S_IFREG); + if (error) + return error; + + /* + * If we're doing a repair, we reserve enough blocks to write out a + * completely new summary file, plus twice as many blocks as we would + * need if we can only allocate one block per data fork mapping. This + * should cover the preallocation of the temporary file and exchanging + * the extent mappings. + * + * We cannot use xfs_exchmaps_estimate because we have not yet + * constructed the replacement rtsummary and therefore do not know how + * many extents it will use. By the time we do, we will have a dirty + * transaction (which we cannot drop because we cannot drop the + * rtsummary ILOCK) and cannot ask for more reservation. + */ + blocks = XFS_B_TO_FSB(mp, mp->m_rsumsize); + blocks += xfs_bmbt_calc_size(mp, blocks) * 2; + if (blocks > UINT_MAX) + return -EOPNOTSUPP; + + rts->resblks += blocks; + + /* + * Grab support for atomic file content exchanges before we allocate + * any transactions or grab ILOCKs. + */ + return xrep_tempexch_enable(sc); +} + +static int +xrep_rtsummary_prep_buf( + struct xfs_scrub *sc, + struct xfs_buf *bp, + void *data) +{ + struct xchk_rtsummary *rts = data; + struct xfs_mount *mp = sc->mp; + union xfs_suminfo_raw *ondisk; + int error; + + rts->args.mp = sc->mp; + rts->args.tp = sc->tp; + rts->args.sumbp = bp; + ondisk = xfs_rsumblock_infoptr(&rts->args, 0); + rts->args.sumbp = NULL; + + bp->b_ops = &xfs_rtbuf_ops; + + error = xfsum_copyout(sc, rts->prep_wordoff, ondisk, mp->m_blockwsize); + if (error) + return error; + + rts->prep_wordoff += mp->m_blockwsize; + xfs_trans_buf_set_type(sc->tp, bp, XFS_BLFT_RTSUMMARY_BUF); + return 0; +} + +/* Repair the realtime summary. */ +int +xrep_rtsummary( + struct xfs_scrub *sc) +{ + struct xchk_rtsummary *rts = sc->buf; + struct xfs_mount *mp = sc->mp; + xfs_filblks_t rsumblocks; + int error; + + /* We require the rmapbt to rebuild anything. */ + if (!xfs_has_rmapbt(mp)) + return -EOPNOTSUPP; + + /* Walk away if we disagree on the size of the rt bitmap. */ + if (rts->rbmblocks != mp->m_sb.sb_rbmblocks) + return 0; + + /* Make sure any problems with the fork are fixed. */ + error = xrep_metadata_inode_forks(sc); + if (error) + return error; + + /* + * Try to take ILOCK_EXCL of the temporary file. We had better be the + * only ones holding onto this inode, but we can't block while holding + * the rtsummary file's ILOCK_EXCL. + */ + while (!xrep_tempfile_ilock_nowait(sc)) { + if (xchk_should_terminate(sc, &error)) + return error; + delay(1); + } + + /* Make sure we have space allocated for the entire summary file. */ + rsumblocks = XFS_B_TO_FSB(mp, rts->rsumsize); + xfs_trans_ijoin(sc->tp, sc->ip, 0); + xfs_trans_ijoin(sc->tp, sc->tempip, 0); + error = xrep_tempfile_prealloc(sc, 0, rsumblocks); + if (error) + return error; + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + return error; + + /* Copy the rtsummary file that we generated. */ + error = xrep_tempfile_copyin(sc, 0, rsumblocks, + xrep_rtsummary_prep_buf, rts); + if (error) + return error; + error = xrep_tempfile_set_isize(sc, rts->rsumsize); + if (error) + return error; + + /* + * Now exchange the contents. Nothing in repair uses the temporary + * buffer, so we can reuse it for the tempfile exchrange information. + */ + error = xrep_tempexch_trans_reserve(sc, XFS_DATA_FORK, &rts->tempexch); + if (error) + return error; + + error = xrep_tempexch_contents(sc, &rts->tempexch); + if (error) + return error; + + /* Reset incore state and blow out the summary cache. */ + if (mp->m_rsum_cache) + memset(mp->m_rsum_cache, 0xFF, mp->m_sb.sb_rbmblocks); + + mp->m_rsumlevels = rts->rsumlevels; + mp->m_rsumsize = rts->rsumsize; + + /* Free the old rtsummary blocks if they're not in use. */ + return xrep_reap_ifork(sc, sc->tempip, XFS_DATA_FORK); +} diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index ff156edf49a0..62a064c1a5d3 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -18,6 +18,7 @@ #include "xfs_buf_mem.h" #include "xfs_rmap.h" #include "xfs_exchrange.h" +#include "xfs_exchmaps.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" @@ -354,7 +355,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_FS, .setup = xchk_setup_rtsummary, .scrub = xchk_rtsummary, - .repair = xrep_notsupported, + .repair = xrep_rtsummary, }, [XFS_SCRUB_TYPE_UQUOTA] = { /* user quota */ .type = ST_FS, ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (4 preceding siblings ...) 2024-04-15 23:34 ` [PATCHSET v30.3 05/16] xfs: online repair of realtime summaries Darrick J. Wong @ 2024-04-15 23:35 ` Darrick J. Wong 2024-04-15 23:46 ` [PATCH 01/10] xfs: add an explicit owner field to xfs_da_args Darrick J. Wong ` (9 more replies) 2024-04-15 23:35 ` [PATCHSET v30.3 07/16] xfs: online repair of extended attributes Darrick J. Wong ` (9 subsequent siblings) 15 siblings, 10 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:35 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs Hi all, There are a couple of significant changes that need to be made to the directory and xattr code before we can support online repairs of those data structures. The first change is because online repair is designed to use libxfs to create a replacement dir/xattr structure in a temporary file, and use atomic extent swapping to commit the corrected structure. To avoid the performance hit of walking every block of the new structure to rewrite the owner number before the swap, we instead change libxfs to allow callers of the dir and xattr code the ability to set an explicit owner number to be written into the header fields of any new blocks that are created. For regular operation this will be the directory inode number. The second change is to update the dir/xattr code to actually *check* the owner number in each block that is read off the disk, since we don't currently do that. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=dirattr-validate-owners-6.10 --- Commits in this patchset: * xfs: add an explicit owner field to xfs_da_args * xfs: use the xfs_da_args owner field to set new dir/attr block owner * xfs: reduce indenting in xfs_attr_node_list * xfs: validate attr leaf buffer owners * xfs: validate attr remote value buffer owners * xfs: validate dabtree node buffer owners * xfs: validate directory leaf buffer owners * xfs: validate explicit directory data buffer owners * xfs: validate explicit directory block buffer owners * xfs: validate explicit directory free block owners --- fs/xfs/libxfs/xfs_attr.c | 14 ++- fs/xfs/libxfs/xfs_attr_leaf.c | 60 ++++++++++++-- fs/xfs/libxfs/xfs_attr_leaf.h | 4 + fs/xfs/libxfs/xfs_attr_remote.c | 13 +-- fs/xfs/libxfs/xfs_bmap.c | 1 fs/xfs/libxfs/xfs_da_btree.c | 169 +++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_da_btree.h | 3 + fs/xfs/libxfs/xfs_dir2.c | 5 + fs/xfs/libxfs/xfs_dir2.h | 4 + fs/xfs/libxfs/xfs_dir2_block.c | 42 ++++++---- fs/xfs/libxfs/xfs_dir2_data.c | 18 +++- fs/xfs/libxfs/xfs_dir2_leaf.c | 100 ++++++++++++++++++----- fs/xfs/libxfs/xfs_dir2_node.c | 44 ++++++---- fs/xfs/libxfs/xfs_dir2_priv.h | 15 ++- fs/xfs/libxfs/xfs_exchmaps.c | 7 +- fs/xfs/scrub/attr.c | 1 fs/xfs/scrub/dabtree.c | 8 ++ fs/xfs/scrub/dir.c | 23 +++-- fs/xfs/scrub/readdir.c | 6 + fs/xfs/xfs_attr_item.c | 1 fs/xfs/xfs_attr_list.c | 89 ++++++++++++++------- fs/xfs/xfs_dir2_readdir.c | 6 + fs/xfs/xfs_trace.h | 7 +- 23 files changed, 492 insertions(+), 148 deletions(-) ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 01/10] xfs: add an explicit owner field to xfs_da_args 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong @ 2024-04-15 23:46 ` Darrick J. Wong 2024-04-15 23:47 ` [PATCH 02/10] xfs: use the xfs_da_args owner field to set new dir/attr block owner Darrick J. Wong ` (8 subsequent siblings) 9 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:46 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Add an explicit owner field to xfs_da_args, which will make it easier for online fsck to set the owner field of the temporary directory and xattr structures that it builds to repair damaged metadata. Note: I hopefully found all the xfs_da_args definitions by looking for automatic stack variable declarations and xfs_da_args.dp assignments: git grep -E '(args.*dp =|struct xfs_da_args[[:space:]]*[a-z0-9][a-z0-9]*)' Note that callers of xfs_attr_{get,set,change} can set the owner to zero (or leave it unset) to have the default set to args->dp. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_attr.c | 4 ++++ fs/xfs/libxfs/xfs_attr_leaf.c | 2 ++ fs/xfs/libxfs/xfs_bmap.c | 1 + fs/xfs/libxfs/xfs_da_btree.h | 1 + fs/xfs/libxfs/xfs_dir2.c | 5 +++++ fs/xfs/libxfs/xfs_exchmaps.c | 2 ++ fs/xfs/scrub/attr.c | 1 + fs/xfs/scrub/dabtree.c | 1 + fs/xfs/scrub/dir.c | 3 ++- fs/xfs/scrub/readdir.c | 2 ++ fs/xfs/xfs_attr_item.c | 1 + fs/xfs/xfs_dir2_readdir.c | 1 + fs/xfs/xfs_trace.h | 7 +++++-- 13 files changed, 28 insertions(+), 3 deletions(-) diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c index 673a4b6d2e8d..74d769461443 100644 --- a/fs/xfs/libxfs/xfs_attr.c +++ b/fs/xfs/libxfs/xfs_attr.c @@ -264,6 +264,8 @@ xfs_attr_get( if (xfs_is_shutdown(args->dp->i_mount)) return -EIO; + if (!args->owner) + args->owner = args->dp->i_ino; args->geo = args->dp->i_mount->m_attr_geo; args->whichfork = XFS_ATTR_FORK; args->hashval = xfs_da_hashname(args->name, args->namelen); @@ -937,6 +939,8 @@ xfs_attr_set( if (error) return error; + if (!args->owner) + args->owner = args->dp->i_ino; args->geo = mp->m_attr_geo; args->whichfork = XFS_ATTR_FORK; args->hashval = xfs_da_hashname(args->name, args->namelen); diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c index ac904cc1a97b..e606eae8d377 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.c +++ b/fs/xfs/libxfs/xfs_attr_leaf.c @@ -904,6 +904,7 @@ xfs_attr_shortform_to_leaf( nargs.whichfork = XFS_ATTR_FORK; nargs.trans = args->trans; nargs.op_flags = XFS_DA_OP_OKNOENT; + nargs.owner = args->owner; sfe = xfs_attr_sf_firstentry(sf); for (i = 0; i < sf->count; i++) { @@ -1106,6 +1107,7 @@ xfs_attr3_leaf_to_shortform( nargs.whichfork = XFS_ATTR_FORK; nargs.trans = args->trans; nargs.op_flags = XFS_DA_OP_OKNOENT; + nargs.owner = args->owner; for (i = 0; i < ichdr.count; entry++, i++) { if (entry->flags & XFS_ATTR_INCOMPLETE) diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 656c95a22f2e..46bbc9f0a117 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -976,6 +976,7 @@ xfs_bmap_add_attrfork_local( dargs.total = dargs.geo->fsbcount; dargs.whichfork = XFS_DATA_FORK; dargs.trans = tp; + dargs.owner = ip->i_ino; return xfs_dir2_sf_to_block(&dargs); } diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h index 706baf36e175..7fb13f26edaa 100644 --- a/fs/xfs/libxfs/xfs_da_btree.h +++ b/fs/xfs/libxfs/xfs_da_btree.h @@ -79,6 +79,7 @@ typedef struct xfs_da_args { int rmtvaluelen2; /* remote attr value length in bytes */ uint32_t op_flags; /* operation flags */ enum xfs_dacmp cmpresult; /* name compare result for lookups */ + xfs_ino_t owner; /* inode that owns the dir/attr data */ } xfs_da_args_t; /* diff --git a/fs/xfs/libxfs/xfs_dir2.c b/fs/xfs/libxfs/xfs_dir2.c index 4821519efad4..9da99fa20c75 100644 --- a/fs/xfs/libxfs/xfs_dir2.c +++ b/fs/xfs/libxfs/xfs_dir2.c @@ -250,6 +250,7 @@ xfs_dir_init( args->geo = dp->i_mount->m_dir_geo; args->dp = dp; args->trans = tp; + args->owner = dp->i_ino; error = xfs_dir2_sf_create(args, pdp->i_ino); kfree(args); return error; @@ -295,6 +296,7 @@ xfs_dir_createname( args->whichfork = XFS_DATA_FORK; args->trans = tp; args->op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; + args->owner = dp->i_ino; if (!inum) args->op_flags |= XFS_DA_OP_JUSTCHECK; @@ -383,6 +385,7 @@ xfs_dir_lookup( args->whichfork = XFS_DATA_FORK; args->trans = tp; args->op_flags = XFS_DA_OP_OKNOENT; + args->owner = dp->i_ino; if (ci_name) args->op_flags |= XFS_DA_OP_CILOOKUP; @@ -456,6 +459,7 @@ xfs_dir_removename( args->total = total; args->whichfork = XFS_DATA_FORK; args->trans = tp; + args->owner = dp->i_ino; if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) { rval = xfs_dir2_sf_removename(args); @@ -517,6 +521,7 @@ xfs_dir_replace( args->total = total; args->whichfork = XFS_DATA_FORK; args->trans = tp; + args->owner = dp->i_ino; if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) { rval = xfs_dir2_sf_replace(args); diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c index 7fa244228750..8d28e8cce5e9 100644 --- a/fs/xfs/libxfs/xfs_exchmaps.c +++ b/fs/xfs/libxfs/xfs_exchmaps.c @@ -429,6 +429,7 @@ xfs_exchmaps_attr_to_sf( .geo = tp->t_mountp->m_attr_geo, .whichfork = XFS_ATTR_FORK, .trans = tp, + .owner = xmi->xmi_ip2->i_ino, }; struct xfs_buf *bp; int forkoff; @@ -459,6 +460,7 @@ xfs_exchmaps_dir_to_sf( .geo = tp->t_mountp->m_dir_geo, .whichfork = XFS_DATA_FORK, .trans = tp, + .owner = xmi->xmi_ip2->i_ino, }; struct xfs_dir2_sf_hdr sfh; struct xfs_buf *bp; diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c index 83c7feb38714..0c467f4f8e77 100644 --- a/fs/xfs/scrub/attr.c +++ b/fs/xfs/scrub/attr.c @@ -169,6 +169,7 @@ xchk_xattr_listent( .hashval = xfs_da_hashname(name, namelen), .trans = context->tp, .valuelen = valuelen, + .owner = context->dp->i_ino, }; struct xchk_xattr_buf *ab; struct xchk_xattr *sx; diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c index 82b150d3b8b7..fa6385a99ac4 100644 --- a/fs/xfs/scrub/dabtree.c +++ b/fs/xfs/scrub/dabtree.c @@ -494,6 +494,7 @@ xchk_da_btree( ds->dargs.whichfork = whichfork; ds->dargs.trans = sc->tp; ds->dargs.op_flags = XFS_DA_OP_OKNOENT; + ds->dargs.owner = sc->ip->i_ino; ds->state = xfs_da_state_alloc(&ds->dargs); ds->sc = sc; ds->private = private; diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c index 076a310b8eb0..042e28547e04 100644 --- a/fs/xfs/scrub/dir.c +++ b/fs/xfs/scrub/dir.c @@ -621,10 +621,11 @@ xchk_directory_blocks( { struct xfs_bmbt_irec got; struct xfs_da_args args = { - .dp = sc ->ip, + .dp = sc->ip, .whichfork = XFS_DATA_FORK, .geo = sc->mp->m_dir_geo, .trans = sc->tp, + .owner = sc->ip->i_ino, }; struct xfs_ifork *ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK); struct xfs_mount *mp = sc->mp; diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c index dfdcb96b6c16..fb98b7624994 100644 --- a/fs/xfs/scrub/readdir.c +++ b/fs/xfs/scrub/readdir.c @@ -273,6 +273,7 @@ xchk_dir_walk( .dp = dp, .geo = dp->i_mount->m_dir_geo, .trans = sc->tp, + .owner = dp->i_ino, }; bool isblock; int error; @@ -324,6 +325,7 @@ xchk_dir_lookup( .hashval = xfs_dir2_hashname(dp->i_mount, name), .whichfork = XFS_DATA_FORK, .op_flags = XFS_DA_OP_OKNOENT, + .owner = dp->i_ino, }; bool isblock, isleaf; int error; diff --git a/fs/xfs/xfs_attr_item.c b/fs/xfs/xfs_attr_item.c index 9b4c61e1c22e..d46034705694 100644 --- a/fs/xfs/xfs_attr_item.c +++ b/fs/xfs/xfs_attr_item.c @@ -540,6 +540,7 @@ xfs_attri_recover_work( args->attr_filter = attrp->alfi_attr_filter & XFS_ATTRI_FILTER_MASK; args->op_flags = XFS_DA_OP_RECOVERY | XFS_DA_OP_OKNOENT | XFS_DA_OP_LOGGED; + args->owner = args->dp->i_ino; ASSERT(xfs_sb_version_haslogxattrs(&mp->m_sb)); diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c index cf9296b7e06f..4e811fa393ad 100644 --- a/fs/xfs/xfs_dir2_readdir.c +++ b/fs/xfs/xfs_dir2_readdir.c @@ -532,6 +532,7 @@ xfs_readdir( args.dp = dp; args.geo = dp->i_mount->m_dir_geo; args.trans = tp; + args.owner = dp->i_ino; if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) return xfs_dir2_sf_getdents(&args, ctx); diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index caef95f2c87c..939baf08331b 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -1931,6 +1931,7 @@ DECLARE_EVENT_CLASS(xfs_da_class, __field(xfs_dahash_t, hashval) __field(xfs_ino_t, inumber) __field(uint32_t, op_flags) + __field(xfs_ino_t, owner) ), TP_fast_assign( __entry->dev = VFS_I(args->dp)->i_sb->s_dev; @@ -1941,9 +1942,10 @@ DECLARE_EVENT_CLASS(xfs_da_class, __entry->hashval = args->hashval; __entry->inumber = args->inumber; __entry->op_flags = args->op_flags; + __entry->owner = args->owner; ), TP_printk("dev %d:%d ino 0x%llx name %.*s namelen %d hashval 0x%x " - "inumber 0x%llx op_flags %s", + "inumber 0x%llx op_flags %s owner 0x%llx", MAJOR(__entry->dev), MINOR(__entry->dev), __entry->ino, __entry->namelen, @@ -1951,7 +1953,8 @@ DECLARE_EVENT_CLASS(xfs_da_class, __entry->namelen, __entry->hashval, __entry->inumber, - __print_flags(__entry->op_flags, "|", XFS_DA_OP_FLAGS)) + __print_flags(__entry->op_flags, "|", XFS_DA_OP_FLAGS), + __entry->owner) ) #define DEFINE_DIR2_EVENT(name) \ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 02/10] xfs: use the xfs_da_args owner field to set new dir/attr block owner 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong 2024-04-15 23:46 ` [PATCH 01/10] xfs: add an explicit owner field to xfs_da_args Darrick J. Wong @ 2024-04-15 23:47 ` Darrick J. Wong 2024-04-15 23:47 ` [PATCH 03/10] xfs: reduce indenting in xfs_attr_node_list Darrick J. Wong ` (7 subsequent siblings) 9 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:47 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> When we're creating leaf, data, freespace, or dabtree blocks for directories and xattrs, use the explicit owner field (instead of the xfs_inode) to set the owner field. This will enable online repair to construct replacement data structures in a temporary file without having to change the owner fields prior to swapping the new and old structures. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_attr_leaf.c | 2 +- fs/xfs/libxfs/xfs_attr_remote.c | 4 ++-- fs/xfs/libxfs/xfs_da_btree.c | 2 +- fs/xfs/libxfs/xfs_dir2_block.c | 19 ++++++++++--------- fs/xfs/libxfs/xfs_dir2_data.c | 2 +- fs/xfs/libxfs/xfs_dir2_leaf.c | 11 +++++------ fs/xfs/libxfs/xfs_dir2_node.c | 2 +- 7 files changed, 21 insertions(+), 21 deletions(-) diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c index e606eae8d377..8937c034b330 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.c +++ b/fs/xfs/libxfs/xfs_attr_leaf.c @@ -1239,7 +1239,7 @@ xfs_attr3_leaf_create( ichdr.magic = XFS_ATTR3_LEAF_MAGIC; hdr3->blkno = cpu_to_be64(xfs_buf_daddr(bp)); - hdr3->owner = cpu_to_be64(dp->i_ino); + hdr3->owner = cpu_to_be64(args->owner); uuid_copy(&hdr3->uuid, &mp->m_sb.sb_meta_uuid); ichdr.freemap[0].base = sizeof(struct xfs_attr3_leaf_hdr); diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c index ff0412828772..024895cc7029 100644 --- a/fs/xfs/libxfs/xfs_attr_remote.c +++ b/fs/xfs/libxfs/xfs_attr_remote.c @@ -522,8 +522,8 @@ xfs_attr_rmtval_set_value( return error; bp->b_ops = &xfs_attr3_rmt_buf_ops; - xfs_attr_rmtval_copyin(mp, bp, args->dp->i_ino, &offset, - &valuelen, &src); + xfs_attr_rmtval_copyin(mp, bp, args->owner, &offset, &valuelen, + &src); error = xfs_bwrite(bp); /* GROT: NOTE: synchronous write */ xfs_buf_relse(bp); diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c index 718d071bb21a..743f6421cc04 100644 --- a/fs/xfs/libxfs/xfs_da_btree.c +++ b/fs/xfs/libxfs/xfs_da_btree.c @@ -486,7 +486,7 @@ xfs_da3_node_create( memset(hdr3, 0, sizeof(struct xfs_da3_node_hdr)); ichdr.magic = XFS_DA3_NODE_MAGIC; hdr3->info.blkno = cpu_to_be64(xfs_buf_daddr(bp)); - hdr3->info.owner = cpu_to_be64(args->dp->i_ino); + hdr3->info.owner = cpu_to_be64(args->owner); uuid_copy(&hdr3->info.uuid, &mp->m_sb.sb_meta_uuid); } else { ichdr.magic = XFS_DA_NODE_MAGIC; diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c index a2da007adb46..61cbc668f228 100644 --- a/fs/xfs/libxfs/xfs_dir2_block.c +++ b/fs/xfs/libxfs/xfs_dir2_block.c @@ -163,12 +163,13 @@ xfs_dir3_block_read( static void xfs_dir3_block_init( - struct xfs_mount *mp, - struct xfs_trans *tp, - struct xfs_buf *bp, - struct xfs_inode *dp) + struct xfs_da_args *args, + struct xfs_buf *bp) { - struct xfs_dir3_blk_hdr *hdr3 = bp->b_addr; + struct xfs_trans *tp = args->trans; + struct xfs_inode *dp = args->dp; + struct xfs_mount *mp = dp->i_mount; + struct xfs_dir3_blk_hdr *hdr3 = bp->b_addr; bp->b_ops = &xfs_dir3_block_buf_ops; xfs_trans_buf_set_type(tp, bp, XFS_BLFT_DIR_BLOCK_BUF); @@ -177,7 +178,7 @@ xfs_dir3_block_init( memset(hdr3, 0, sizeof(*hdr3)); hdr3->magic = cpu_to_be32(XFS_DIR3_BLOCK_MAGIC); hdr3->blkno = cpu_to_be64(xfs_buf_daddr(bp)); - hdr3->owner = cpu_to_be64(dp->i_ino); + hdr3->owner = cpu_to_be64(args->owner); uuid_copy(&hdr3->uuid, &mp->m_sb.sb_meta_uuid); return; @@ -1009,7 +1010,7 @@ xfs_dir2_leaf_to_block( /* * Start converting it to block form. */ - xfs_dir3_block_init(mp, tp, dbp, dp); + xfs_dir3_block_init(args, dbp); needlog = 1; needscan = 0; @@ -1129,7 +1130,7 @@ xfs_dir2_sf_to_block( error = xfs_dir3_data_init(args, blkno, &bp); if (error) goto out_free; - xfs_dir3_block_init(mp, tp, bp, dp); + xfs_dir3_block_init(args, bp); hdr = bp->b_addr; /* @@ -1169,7 +1170,7 @@ xfs_dir2_sf_to_block( * Create entry for . */ dep = bp->b_addr + offset; - dep->inumber = cpu_to_be64(dp->i_ino); + dep->inumber = cpu_to_be64(args->owner); dep->namelen = 1; dep->name[0] = '.'; xfs_dir2_data_put_ftype(mp, dep, XFS_DIR3_FT_DIR); diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c index 7a6d965bea71..c3ef720b5ff6 100644 --- a/fs/xfs/libxfs/xfs_dir2_data.c +++ b/fs/xfs/libxfs/xfs_dir2_data.c @@ -725,7 +725,7 @@ xfs_dir3_data_init( memset(hdr3, 0, sizeof(*hdr3)); hdr3->magic = cpu_to_be32(XFS_DIR3_DATA_MAGIC); hdr3->blkno = cpu_to_be64(xfs_buf_daddr(bp)); - hdr3->owner = cpu_to_be64(dp->i_ino); + hdr3->owner = cpu_to_be64(args->owner); uuid_copy(&hdr3->uuid, &mp->m_sb.sb_meta_uuid); } else diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c index 08dda5ce9d91..20ce057d12e8 100644 --- a/fs/xfs/libxfs/xfs_dir2_leaf.c +++ b/fs/xfs/libxfs/xfs_dir2_leaf.c @@ -304,12 +304,12 @@ xfs_dir3_leafn_read( */ static void xfs_dir3_leaf_init( - struct xfs_mount *mp, - struct xfs_trans *tp, + struct xfs_da_args *args, struct xfs_buf *bp, - xfs_ino_t owner, uint16_t type) { + struct xfs_mount *mp = args->dp->i_mount; + struct xfs_trans *tp = args->trans; struct xfs_dir2_leaf *leaf = bp->b_addr; ASSERT(type == XFS_DIR2_LEAF1_MAGIC || type == XFS_DIR2_LEAFN_MAGIC); @@ -323,7 +323,7 @@ xfs_dir3_leaf_init( ? cpu_to_be16(XFS_DIR3_LEAF1_MAGIC) : cpu_to_be16(XFS_DIR3_LEAFN_MAGIC); leaf3->info.blkno = cpu_to_be64(xfs_buf_daddr(bp)); - leaf3->info.owner = cpu_to_be64(owner); + leaf3->info.owner = cpu_to_be64(args->owner); uuid_copy(&leaf3->info.uuid, &mp->m_sb.sb_meta_uuid); } else { memset(leaf, 0, sizeof(*leaf)); @@ -356,7 +356,6 @@ xfs_dir3_leaf_get_buf( { struct xfs_inode *dp = args->dp; struct xfs_trans *tp = args->trans; - struct xfs_mount *mp = dp->i_mount; struct xfs_buf *bp; int error; @@ -369,7 +368,7 @@ xfs_dir3_leaf_get_buf( if (error) return error; - xfs_dir3_leaf_init(mp, tp, bp, dp->i_ino, magic); + xfs_dir3_leaf_init(args, bp, magic); xfs_dir3_leaf_log_header(args, bp); if (magic == XFS_DIR2_LEAF1_MAGIC) xfs_dir3_leaf_log_tail(args, bp); diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c index be0b8834028c..1ad7405f9c38 100644 --- a/fs/xfs/libxfs/xfs_dir2_node.c +++ b/fs/xfs/libxfs/xfs_dir2_node.c @@ -349,7 +349,7 @@ xfs_dir3_free_get_buf( hdr.magic = XFS_DIR3_FREE_MAGIC; hdr3->hdr.blkno = cpu_to_be64(xfs_buf_daddr(bp)); - hdr3->hdr.owner = cpu_to_be64(dp->i_ino); + hdr3->hdr.owner = cpu_to_be64(args->owner); uuid_copy(&hdr3->hdr.uuid, &mp->m_sb.sb_meta_uuid); } else hdr.magic = XFS_DIR2_FREE_MAGIC; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 03/10] xfs: reduce indenting in xfs_attr_node_list 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong 2024-04-15 23:46 ` [PATCH 01/10] xfs: add an explicit owner field to xfs_da_args Darrick J. Wong 2024-04-15 23:47 ` [PATCH 02/10] xfs: use the xfs_da_args owner field to set new dir/attr block owner Darrick J. Wong @ 2024-04-15 23:47 ` Darrick J. Wong 2024-04-15 23:47 ` [PATCH 04/10] xfs: validate attr leaf buffer owners Darrick J. Wong ` (6 subsequent siblings) 9 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:47 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Reduce the indentation here so that we can add some things in the next patch without going over the column limits. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_attr_list.c | 56 +++++++++++++++++++++++++----------------------- 1 file changed, 29 insertions(+), 27 deletions(-) diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c index a6819a642cc0..42a575db7267 100644 --- a/fs/xfs/xfs_attr_list.c +++ b/fs/xfs/xfs_attr_list.c @@ -310,46 +310,47 @@ xfs_attr_node_list( */ bp = NULL; if (cursor->blkno > 0) { + struct xfs_attr_leaf_entry *entries; + error = xfs_da3_node_read(context->tp, dp, cursor->blkno, &bp, XFS_ATTR_FORK); if (xfs_metadata_is_sick(error)) xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK); - if ((error != 0) && (error != -EFSCORRUPTED)) + if (error != 0 && error != -EFSCORRUPTED) return error; - if (bp) { - struct xfs_attr_leaf_entry *entries; + if (!bp) + goto need_lookup; - node = bp->b_addr; - switch (be16_to_cpu(node->hdr.info.magic)) { - case XFS_DA_NODE_MAGIC: - case XFS_DA3_NODE_MAGIC: + node = bp->b_addr; + switch (be16_to_cpu(node->hdr.info.magic)) { + case XFS_DA_NODE_MAGIC: + case XFS_DA3_NODE_MAGIC: + trace_xfs_attr_list_wrong_blk(context); + xfs_trans_brelse(context->tp, bp); + bp = NULL; + break; + case XFS_ATTR_LEAF_MAGIC: + case XFS_ATTR3_LEAF_MAGIC: + leaf = bp->b_addr; + xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, + &leafhdr, leaf); + entries = xfs_attr3_leaf_entryp(leaf); + if (cursor->hashval > be32_to_cpu( + entries[leafhdr.count - 1].hashval)) { trace_xfs_attr_list_wrong_blk(context); xfs_trans_brelse(context->tp, bp); bp = NULL; - break; - case XFS_ATTR_LEAF_MAGIC: - case XFS_ATTR3_LEAF_MAGIC: - leaf = bp->b_addr; - xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, - &leafhdr, leaf); - entries = xfs_attr3_leaf_entryp(leaf); - if (cursor->hashval > be32_to_cpu( - entries[leafhdr.count - 1].hashval)) { - trace_xfs_attr_list_wrong_blk(context); - xfs_trans_brelse(context->tp, bp); - bp = NULL; - } else if (cursor->hashval <= be32_to_cpu( - entries[0].hashval)) { - trace_xfs_attr_list_wrong_blk(context); - xfs_trans_brelse(context->tp, bp); - bp = NULL; - } - break; - default: + } else if (cursor->hashval <= be32_to_cpu( + entries[0].hashval)) { trace_xfs_attr_list_wrong_blk(context); xfs_trans_brelse(context->tp, bp); bp = NULL; } + break; + default: + trace_xfs_attr_list_wrong_blk(context); + xfs_trans_brelse(context->tp, bp); + bp = NULL; } } @@ -359,6 +360,7 @@ xfs_attr_node_list( * Note that start of node block is same as start of leaf block. */ if (bp == NULL) { +need_lookup: error = xfs_attr_node_list_lookup(context, cursor, &bp); if (error || !bp) return error; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 04/10] xfs: validate attr leaf buffer owners 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong ` (2 preceding siblings ...) 2024-04-15 23:47 ` [PATCH 03/10] xfs: reduce indenting in xfs_attr_node_list Darrick J. Wong @ 2024-04-15 23:47 ` Darrick J. Wong 2024-04-15 23:47 ` [PATCH 05/10] xfs: validate attr remote value " Darrick J. Wong ` (5 subsequent siblings) 9 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:47 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a leaf block header checking function to validate the owner field of xattr leaf blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_attr.c | 10 ++++--- fs/xfs/libxfs/xfs_attr_leaf.c | 56 ++++++++++++++++++++++++++++++++++------- fs/xfs/libxfs/xfs_attr_leaf.h | 4 ++- fs/xfs/libxfs/xfs_da_btree.c | 42 +++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_da_btree.h | 1 + fs/xfs/libxfs/xfs_exchmaps.c | 3 +- fs/xfs/scrub/dabtree.c | 7 +++++ fs/xfs/xfs_attr_list.c | 24 +++++++++++++++--- 8 files changed, 128 insertions(+), 19 deletions(-) diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c index 74d769461443..b3c9666cd011 100644 --- a/fs/xfs/libxfs/xfs_attr.c +++ b/fs/xfs/libxfs/xfs_attr.c @@ -649,8 +649,8 @@ xfs_attr_leaf_remove_attr( int forkoff; int error; - error = xfs_attr3_leaf_read(args->trans, args->dp, args->blkno, - &bp); + error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner, + args->blkno, &bp); if (error) return error; @@ -681,7 +681,7 @@ xfs_attr_leaf_shrink( if (!xfs_attr_is_leaf(dp)) return 0; - error = xfs_attr3_leaf_read(args->trans, args->dp, 0, &bp); + error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner, 0, &bp); if (error) return error; @@ -1158,7 +1158,7 @@ xfs_attr_leaf_try_add( struct xfs_buf *bp; int error; - error = xfs_attr3_leaf_read(args->trans, args->dp, 0, &bp); + error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner, 0, &bp); if (error) return error; @@ -1206,7 +1206,7 @@ xfs_attr_leaf_hasname( { int error = 0; - error = xfs_attr3_leaf_read(args->trans, args->dp, 0, bp); + error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner, 0, bp); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_attr_leaf.c b/fs/xfs/libxfs/xfs_attr_leaf.c index 8937c034b330..17ec5ff5a4e3 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.c +++ b/fs/xfs/libxfs/xfs_attr_leaf.c @@ -388,6 +388,27 @@ xfs_attr3_leaf_verify( return NULL; } +xfs_failaddr_t +xfs_attr3_leaf_header_check( + struct xfs_buf *bp, + xfs_ino_t owner) +{ + struct xfs_mount *mp = bp->b_mount; + + if (xfs_has_crc(mp)) { + struct xfs_attr3_leafblock *hdr3 = bp->b_addr; + + if (hdr3->hdr.info.hdr.magic != + cpu_to_be16(XFS_ATTR3_LEAF_MAGIC)) + return __this_address; + + if (be64_to_cpu(hdr3->hdr.info.owner) != owner) + return __this_address; + } + + return NULL; +} + static void xfs_attr3_leaf_write_verify( struct xfs_buf *bp) @@ -448,16 +469,30 @@ int xfs_attr3_leaf_read( struct xfs_trans *tp, struct xfs_inode *dp, + xfs_ino_t owner, xfs_dablk_t bno, struct xfs_buf **bpp) { + xfs_failaddr_t fa; int err; err = xfs_da_read_buf(tp, dp, bno, 0, bpp, XFS_ATTR_FORK, &xfs_attr3_leaf_buf_ops); - if (!err && tp && *bpp) + if (err || !(*bpp)) + return err; + + fa = xfs_attr3_leaf_header_check(*bpp, owner); + if (fa) { + __xfs_buf_mark_corrupt(*bpp, fa); + xfs_trans_brelse(tp, *bpp); + *bpp = NULL; + xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK); + return -EFSCORRUPTED; + } + + if (tp) xfs_trans_buf_set_type(tp, *bpp, XFS_BLFT_ATTR_LEAF_BUF); - return err; + return 0; } /*======================================================================== @@ -1160,7 +1195,7 @@ xfs_attr3_leaf_to_node( error = xfs_da_grow_inode(args, &blkno); if (error) goto out; - error = xfs_attr3_leaf_read(args->trans, dp, 0, &bp1); + error = xfs_attr3_leaf_read(args->trans, dp, args->owner, 0, &bp1); if (error) goto out; @@ -1995,7 +2030,7 @@ xfs_attr3_leaf_toosmall( if (blkno == 0) continue; error = xfs_attr3_leaf_read(state->args->trans, state->args->dp, - blkno, &bp); + state->args->owner, blkno, &bp); if (error) return error; @@ -2717,7 +2752,8 @@ xfs_attr3_leaf_clearflag( /* * Set up the operation. */ - error = xfs_attr3_leaf_read(args->trans, args->dp, args->blkno, &bp); + error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner, + args->blkno, &bp); if (error) return error; @@ -2781,7 +2817,8 @@ xfs_attr3_leaf_setflag( /* * Set up the operation. */ - error = xfs_attr3_leaf_read(args->trans, args->dp, args->blkno, &bp); + error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner, + args->blkno, &bp); if (error) return error; @@ -2840,7 +2877,8 @@ xfs_attr3_leaf_flipflags( /* * Read the block containing the "old" attr */ - error = xfs_attr3_leaf_read(args->trans, args->dp, args->blkno, &bp1); + error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner, + args->blkno, &bp1); if (error) return error; @@ -2848,8 +2886,8 @@ xfs_attr3_leaf_flipflags( * Read the block containing the "new" attr, if it is different */ if (args->blkno2 != args->blkno) { - error = xfs_attr3_leaf_read(args->trans, args->dp, args->blkno2, - &bp2); + error = xfs_attr3_leaf_read(args->trans, args->dp, args->owner, + args->blkno2, &bp2); if (error) return error; } else { diff --git a/fs/xfs/libxfs/xfs_attr_leaf.h b/fs/xfs/libxfs/xfs_attr_leaf.h index 9b9948639c0f..bac219589896 100644 --- a/fs/xfs/libxfs/xfs_attr_leaf.h +++ b/fs/xfs/libxfs/xfs_attr_leaf.h @@ -98,12 +98,14 @@ int xfs_attr_leaf_order(struct xfs_buf *leaf1_bp, struct xfs_buf *leaf2_bp); int xfs_attr_leaf_newentsize(struct xfs_da_args *args, int *local); int xfs_attr3_leaf_read(struct xfs_trans *tp, struct xfs_inode *dp, - xfs_dablk_t bno, struct xfs_buf **bpp); + xfs_ino_t owner, xfs_dablk_t bno, struct xfs_buf **bpp); void xfs_attr3_leaf_hdr_from_disk(struct xfs_da_geometry *geo, struct xfs_attr3_icleaf_hdr *to, struct xfs_attr_leafblock *from); void xfs_attr3_leaf_hdr_to_disk(struct xfs_da_geometry *geo, struct xfs_attr_leafblock *to, struct xfs_attr3_icleaf_hdr *from); +xfs_failaddr_t xfs_attr3_leaf_header_check(struct xfs_buf *bp, + xfs_ino_t owner); #endif /* __XFS_ATTR_LEAF_H__ */ diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c index 743f6421cc04..65eef8775187 100644 --- a/fs/xfs/libxfs/xfs_da_btree.c +++ b/fs/xfs/libxfs/xfs_da_btree.c @@ -252,6 +252,25 @@ xfs_da3_node_verify( return NULL; } +xfs_failaddr_t +xfs_da3_header_check( + struct xfs_buf *bp, + xfs_ino_t owner) +{ + struct xfs_mount *mp = bp->b_mount; + struct xfs_da_blkinfo *hdr = bp->b_addr; + + if (!xfs_has_crc(mp)) + return NULL; + + switch (hdr->magic) { + case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC): + return xfs_attr3_leaf_header_check(bp, owner); + } + + return NULL; +} + static void xfs_da3_node_write_verify( struct xfs_buf *bp) @@ -1591,6 +1610,7 @@ xfs_da3_node_lookup_int( struct xfs_da_node_entry *btree; struct xfs_da3_icnode_hdr nodehdr; struct xfs_da_args *args; + xfs_failaddr_t fa; xfs_dablk_t blkno; xfs_dahash_t hashval; xfs_dahash_t btreehashval; @@ -1629,6 +1649,12 @@ xfs_da3_node_lookup_int( if (magic == XFS_ATTR_LEAF_MAGIC || magic == XFS_ATTR3_LEAF_MAGIC) { + fa = xfs_attr3_leaf_header_check(blk->bp, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(blk->bp, fa); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } blk->magic = XFS_ATTR_LEAF_MAGIC; blk->hashval = xfs_attr_leaf_lasthash(blk->bp, NULL); break; @@ -1996,6 +2022,7 @@ xfs_da3_path_shift( struct xfs_da_node_entry *btree; struct xfs_da3_icnode_hdr nodehdr; struct xfs_buf *bp; + xfs_failaddr_t fa; xfs_dablk_t blkno = 0; int level; int error; @@ -2087,6 +2114,12 @@ xfs_da3_path_shift( break; case XFS_ATTR_LEAF_MAGIC: case XFS_ATTR3_LEAF_MAGIC: + fa = xfs_attr3_leaf_header_check(blk->bp, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(blk->bp, fa); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } blk->magic = XFS_ATTR_LEAF_MAGIC; ASSERT(level == path->active-1); blk->index = 0; @@ -2290,6 +2323,7 @@ xfs_da3_swap_lastblock( struct xfs_buf *last_buf; struct xfs_buf *sib_buf; struct xfs_buf *par_buf; + xfs_failaddr_t fa; xfs_dahash_t dead_hash; xfs_fileoff_t lastoff; xfs_dablk_t dead_blkno; @@ -2326,6 +2360,14 @@ xfs_da3_swap_lastblock( error = xfs_da3_node_read(tp, dp, last_blkno, &last_buf, w); if (error) return error; + fa = xfs_da3_header_check(last_buf, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(last_buf, fa); + xfs_trans_brelse(tp, last_buf); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } + /* * Copy the last block into the dead buffer and log it. */ diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h index 7fb13f26edaa..99618e0c8a72 100644 --- a/fs/xfs/libxfs/xfs_da_btree.h +++ b/fs/xfs/libxfs/xfs_da_btree.h @@ -236,6 +236,7 @@ void xfs_da3_node_hdr_from_disk(struct xfs_mount *mp, struct xfs_da3_icnode_hdr *to, struct xfs_da_intnode *from); void xfs_da3_node_hdr_to_disk(struct xfs_mount *mp, struct xfs_da_intnode *to, struct xfs_da3_icnode_hdr *from); +xfs_failaddr_t xfs_da3_header_check(struct xfs_buf *bp, xfs_ino_t owner); extern struct kmem_cache *xfs_da_state_cache; diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c index 8d28e8cce5e9..9c9cf2e998b2 100644 --- a/fs/xfs/libxfs/xfs_exchmaps.c +++ b/fs/xfs/libxfs/xfs_exchmaps.c @@ -438,7 +438,8 @@ xfs_exchmaps_attr_to_sf( if (!xfs_attr_is_leaf(xmi->xmi_ip2)) return 0; - error = xfs_attr3_leaf_read(tp, xmi->xmi_ip2, 0, &bp); + error = xfs_attr3_leaf_read(tp, xmi->xmi_ip2, xmi->xmi_ip2->i_ino, 0, + &bp); if (error) return error; diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c index fa6385a99ac4..c71254088dff 100644 --- a/fs/xfs/scrub/dabtree.c +++ b/fs/xfs/scrub/dabtree.c @@ -320,6 +320,7 @@ xchk_da_btree_block( struct xfs_da3_blkinfo *hdr3; struct xfs_da_args *dargs = &ds->dargs; struct xfs_inode *ip = ds->dargs.dp; + xfs_failaddr_t fa; xfs_ino_t owner; int *pmaxrecs; struct xfs_da3_icnode_hdr nodehdr; @@ -442,6 +443,12 @@ xchk_da_btree_block( goto out_freebp; } + fa = xfs_da3_header_check(blk->bp, dargs->owner); + if (fa) { + xchk_da_set_corrupt(ds, level); + goto out_freebp; + } + /* * If we've been handed a block that is below the dabtree root, does * its hashval match what the parent block expected to see? diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c index 42a575db7267..f6496e33ff91 100644 --- a/fs/xfs/xfs_attr_list.c +++ b/fs/xfs/xfs_attr_list.c @@ -214,6 +214,7 @@ xfs_attr_node_list_lookup( struct xfs_mount *mp = dp->i_mount; struct xfs_trans *tp = context->tp; struct xfs_buf *bp; + xfs_failaddr_t fa; int i; int error = 0; unsigned int expected_level = 0; @@ -273,6 +274,12 @@ xfs_attr_node_list_lookup( } } + fa = xfs_attr3_leaf_header_check(bp, dp->i_ino); + if (fa) { + __xfs_buf_mark_corrupt(bp, fa); + goto out_releasebuf; + } + if (expected_level != 0) goto out_corruptbuf; @@ -281,6 +288,7 @@ xfs_attr_node_list_lookup( out_corruptbuf: xfs_buf_mark_corrupt(bp); +out_releasebuf: xfs_trans_brelse(tp, bp); xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK); return -EFSCORRUPTED; @@ -297,6 +305,7 @@ xfs_attr_node_list( struct xfs_buf *bp; struct xfs_inode *dp = context->dp; struct xfs_mount *mp = dp->i_mount; + xfs_failaddr_t fa; int error = 0; trace_xfs_attr_node_list(context); @@ -332,6 +341,14 @@ xfs_attr_node_list( case XFS_ATTR_LEAF_MAGIC: case XFS_ATTR3_LEAF_MAGIC: leaf = bp->b_addr; + fa = xfs_attr3_leaf_header_check(bp, dp->i_ino); + if (fa) { + __xfs_buf_mark_corrupt(bp, fa); + xfs_trans_brelse(context->tp, bp); + xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK); + bp = NULL; + break; + } xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf); entries = xfs_attr3_leaf_entryp(leaf); @@ -382,8 +399,8 @@ xfs_attr_node_list( break; cursor->blkno = leafhdr.forw; xfs_trans_brelse(context->tp, bp); - error = xfs_attr3_leaf_read(context->tp, dp, cursor->blkno, - &bp); + error = xfs_attr3_leaf_read(context->tp, dp, dp->i_ino, + cursor->blkno, &bp); if (error) return error; } @@ -503,7 +520,8 @@ xfs_attr_leaf_list( trace_xfs_attr_leaf_list(context); context->cursor.blkno = 0; - error = xfs_attr3_leaf_read(context->tp, context->dp, 0, &bp); + error = xfs_attr3_leaf_read(context->tp, context->dp, + context->dp->i_ino, 0, &bp); if (error) return error; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 05/10] xfs: validate attr remote value buffer owners 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong ` (3 preceding siblings ...) 2024-04-15 23:47 ` [PATCH 04/10] xfs: validate attr leaf buffer owners Darrick J. Wong @ 2024-04-15 23:47 ` Darrick J. Wong 2024-04-15 23:48 ` [PATCH 06/10] xfs: validate dabtree node " Darrick J. Wong ` (4 subsequent siblings) 9 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:47 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Check the owner field of xattr remote value blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_attr_remote.c | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/fs/xfs/libxfs/xfs_attr_remote.c b/fs/xfs/libxfs/xfs_attr_remote.c index 024895cc7029..a8de9dc1e998 100644 --- a/fs/xfs/libxfs/xfs_attr_remote.c +++ b/fs/xfs/libxfs/xfs_attr_remote.c @@ -280,12 +280,12 @@ xfs_attr_rmtval_copyout( struct xfs_mount *mp, struct xfs_buf *bp, struct xfs_inode *dp, + xfs_ino_t owner, int *offset, int *valuelen, uint8_t **dst) { char *src = bp->b_addr; - xfs_ino_t ino = dp->i_ino; xfs_daddr_t bno = xfs_buf_daddr(bp); int len = BBTOB(bp->b_length); int blksize = mp->m_attr_geo->blksize; @@ -299,11 +299,11 @@ xfs_attr_rmtval_copyout( byte_cnt = min(*valuelen, byte_cnt); if (xfs_has_crc(mp)) { - if (xfs_attr3_rmt_hdr_ok(src, ino, *offset, + if (xfs_attr3_rmt_hdr_ok(src, owner, *offset, byte_cnt, bno)) { xfs_alert(mp, "remote attribute header mismatch bno/off/len/owner (0x%llx/0x%x/Ox%x/0x%llx)", - bno, *offset, byte_cnt, ino); + bno, *offset, byte_cnt, owner); xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK); return -EFSCORRUPTED; } @@ -427,8 +427,7 @@ xfs_attr_rmtval_get( return error; error = xfs_attr_rmtval_copyout(mp, bp, args->dp, - &offset, &valuelen, - &dst); + args->owner, &offset, &valuelen, &dst); xfs_buf_relse(bp); if (error) return error; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 06/10] xfs: validate dabtree node buffer owners 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong ` (4 preceding siblings ...) 2024-04-15 23:47 ` [PATCH 05/10] xfs: validate attr remote value " Darrick J. Wong @ 2024-04-15 23:48 ` Darrick J. Wong 2024-04-15 23:48 ` [PATCH 07/10] xfs: validate directory leaf " Darrick J. Wong ` (3 subsequent siblings) 9 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:48 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Check the owner field of dabtree node blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_da_btree.c | 109 ++++++++++++++++++++++++++++++++++++++++++ fs/xfs/libxfs/xfs_da_btree.h | 1 fs/xfs/xfs_attr_list.c | 9 +++ 3 files changed, 119 insertions(+) diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c index 65eef8775187..e6c28bccdbc0 100644 --- a/fs/xfs/libxfs/xfs_da_btree.c +++ b/fs/xfs/libxfs/xfs_da_btree.c @@ -252,6 +252,26 @@ xfs_da3_node_verify( return NULL; } +xfs_failaddr_t +xfs_da3_node_header_check( + struct xfs_buf *bp, + xfs_ino_t owner) +{ + struct xfs_mount *mp = bp->b_mount; + + if (xfs_has_crc(mp)) { + struct xfs_da3_blkinfo *hdr3 = bp->b_addr; + + if (hdr3->hdr.magic != cpu_to_be16(XFS_DA3_NODE_MAGIC)) + return __this_address; + + if (be64_to_cpu(hdr3->owner) != owner) + return __this_address; + } + + return NULL; +} + xfs_failaddr_t xfs_da3_header_check( struct xfs_buf *bp, @@ -266,6 +286,8 @@ xfs_da3_header_check( switch (hdr->magic) { case cpu_to_be16(XFS_ATTR3_LEAF_MAGIC): return xfs_attr3_leaf_header_check(bp, owner); + case cpu_to_be16(XFS_DA3_NODE_MAGIC): + return xfs_da3_node_header_check(bp, owner); } return NULL; @@ -1218,6 +1240,7 @@ xfs_da3_root_join( struct xfs_da3_icnode_hdr oldroothdr; int error; struct xfs_inode *dp = state->args->dp; + xfs_failaddr_t fa; trace_xfs_da_root_join(state->args); @@ -1244,6 +1267,13 @@ xfs_da3_root_join( error = xfs_da3_node_read(args->trans, dp, child, &bp, args->whichfork); if (error) return error; + fa = xfs_da3_header_check(bp, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(bp, fa); + xfs_trans_brelse(args->trans, bp); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } xfs_da_blkinfo_onlychild_validate(bp->b_addr, oldroothdr.level); /* @@ -1278,6 +1308,7 @@ xfs_da3_node_toosmall( struct xfs_da_blkinfo *info; xfs_dablk_t blkno; struct xfs_buf *bp; + xfs_failaddr_t fa; struct xfs_da3_icnode_hdr nodehdr; int count; int forward; @@ -1352,6 +1383,13 @@ xfs_da3_node_toosmall( state->args->whichfork); if (error) return error; + fa = xfs_da3_node_header_check(bp, state->args->owner); + if (fa) { + __xfs_buf_mark_corrupt(bp, fa); + xfs_trans_brelse(state->args->trans, bp); + xfs_da_mark_sick(state->args); + return -EFSCORRUPTED; + } node = bp->b_addr; xfs_da3_node_hdr_from_disk(dp->i_mount, &thdr, node); @@ -1674,6 +1712,13 @@ xfs_da3_node_lookup_int( return -EFSCORRUPTED; } + fa = xfs_da3_node_header_check(blk->bp, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(blk->bp, fa); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } + blk->magic = XFS_DA_NODE_MAGIC; /* @@ -1846,6 +1891,7 @@ xfs_da3_blk_link( struct xfs_da_blkinfo *tmp_info; struct xfs_da_args *args; struct xfs_buf *bp; + xfs_failaddr_t fa; int before = 0; int error; struct xfs_inode *dp = state->args->dp; @@ -1889,6 +1935,13 @@ xfs_da3_blk_link( &bp, args->whichfork); if (error) return error; + fa = xfs_da3_header_check(bp, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(bp, fa); + xfs_trans_brelse(args->trans, bp); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } ASSERT(bp != NULL); tmp_info = bp->b_addr; ASSERT(tmp_info->magic == old_info->magic); @@ -1910,6 +1963,13 @@ xfs_da3_blk_link( &bp, args->whichfork); if (error) return error; + fa = xfs_da3_header_check(bp, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(bp, fa); + xfs_trans_brelse(args->trans, bp); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } ASSERT(bp != NULL); tmp_info = bp->b_addr; ASSERT(tmp_info->magic == old_info->magic); @@ -1939,6 +1999,7 @@ xfs_da3_blk_unlink( struct xfs_da_blkinfo *tmp_info; struct xfs_da_args *args; struct xfs_buf *bp; + xfs_failaddr_t fa; int error; /* @@ -1969,6 +2030,13 @@ xfs_da3_blk_unlink( &bp, args->whichfork); if (error) return error; + fa = xfs_da3_header_check(bp, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(bp, fa); + xfs_trans_brelse(args->trans, bp); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } ASSERT(bp != NULL); tmp_info = bp->b_addr; ASSERT(tmp_info->magic == save_info->magic); @@ -1986,6 +2054,13 @@ xfs_da3_blk_unlink( &bp, args->whichfork); if (error) return error; + fa = xfs_da3_header_check(bp, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(bp, fa); + xfs_trans_brelse(args->trans, bp); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } ASSERT(bp != NULL); tmp_info = bp->b_addr; ASSERT(tmp_info->magic == save_info->magic); @@ -2101,6 +2176,12 @@ xfs_da3_path_shift( switch (be16_to_cpu(info->magic)) { case XFS_DA_NODE_MAGIC: case XFS_DA3_NODE_MAGIC: + fa = xfs_da3_node_header_check(blk->bp, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(blk->bp, fa); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } blk->magic = XFS_DA_NODE_MAGIC; xfs_da3_node_hdr_from_disk(dp->i_mount, &nodehdr, bp->b_addr); @@ -2406,6 +2487,13 @@ xfs_da3_swap_lastblock( error = xfs_da3_node_read(tp, dp, sib_blkno, &sib_buf, w); if (error) goto done; + fa = xfs_da3_header_check(sib_buf, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(sib_buf, fa); + xfs_da_mark_sick(args); + error = -EFSCORRUPTED; + goto done; + } sib_info = sib_buf->b_addr; if (XFS_IS_CORRUPT(mp, be32_to_cpu(sib_info->forw) != last_blkno || @@ -2427,6 +2515,13 @@ xfs_da3_swap_lastblock( error = xfs_da3_node_read(tp, dp, sib_blkno, &sib_buf, w); if (error) goto done; + fa = xfs_da3_header_check(sib_buf, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(sib_buf, fa); + xfs_da_mark_sick(args); + error = -EFSCORRUPTED; + goto done; + } sib_info = sib_buf->b_addr; if (XFS_IS_CORRUPT(mp, be32_to_cpu(sib_info->back) != last_blkno || @@ -2450,6 +2545,13 @@ xfs_da3_swap_lastblock( error = xfs_da3_node_read(tp, dp, par_blkno, &par_buf, w); if (error) goto done; + fa = xfs_da3_node_header_check(par_buf, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(par_buf, fa); + xfs_da_mark_sick(args); + error = -EFSCORRUPTED; + goto done; + } par_node = par_buf->b_addr; xfs_da3_node_hdr_from_disk(dp->i_mount, &par_hdr, par_node); if (XFS_IS_CORRUPT(mp, @@ -2499,6 +2601,13 @@ xfs_da3_swap_lastblock( error = xfs_da3_node_read(tp, dp, par_blkno, &par_buf, w); if (error) goto done; + fa = xfs_da3_node_header_check(par_buf, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(par_buf, fa); + xfs_da_mark_sick(args); + error = -EFSCORRUPTED; + goto done; + } par_node = par_buf->b_addr; xfs_da3_node_hdr_from_disk(dp->i_mount, &par_hdr, par_node); if (XFS_IS_CORRUPT(mp, par_hdr.level != level)) { diff --git a/fs/xfs/libxfs/xfs_da_btree.h b/fs/xfs/libxfs/xfs_da_btree.h index 99618e0c8a72..7a004786ee0a 100644 --- a/fs/xfs/libxfs/xfs_da_btree.h +++ b/fs/xfs/libxfs/xfs_da_btree.h @@ -237,6 +237,7 @@ void xfs_da3_node_hdr_from_disk(struct xfs_mount *mp, void xfs_da3_node_hdr_to_disk(struct xfs_mount *mp, struct xfs_da_intnode *to, struct xfs_da3_icnode_hdr *from); xfs_failaddr_t xfs_da3_header_check(struct xfs_buf *bp, xfs_ino_t owner); +xfs_failaddr_t xfs_da3_node_header_check(struct xfs_buf *bp, xfs_ino_t owner); extern struct kmem_cache *xfs_da_state_cache; diff --git a/fs/xfs/xfs_attr_list.c b/fs/xfs/xfs_attr_list.c index f6496e33ff91..6a621f016f04 100644 --- a/fs/xfs/xfs_attr_list.c +++ b/fs/xfs/xfs_attr_list.c @@ -239,6 +239,10 @@ xfs_attr_node_list_lookup( goto out_corruptbuf; } + fa = xfs_da3_node_header_check(bp, dp->i_ino); + if (fa) + goto out_corruptbuf; + xfs_da3_node_hdr_from_disk(mp, &nodehdr, node); /* Tree taller than we can handle; bail out! */ @@ -335,6 +339,11 @@ xfs_attr_node_list( case XFS_DA_NODE_MAGIC: case XFS_DA3_NODE_MAGIC: trace_xfs_attr_list_wrong_blk(context); + fa = xfs_da3_node_header_check(bp, dp->i_ino); + if (fa) { + __xfs_buf_mark_corrupt(bp, fa); + xfs_dirattr_mark_sick(dp, XFS_ATTR_FORK); + } xfs_trans_brelse(context->tp, bp); bp = NULL; break; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 07/10] xfs: validate directory leaf buffer owners 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong ` (5 preceding siblings ...) 2024-04-15 23:48 ` [PATCH 06/10] xfs: validate dabtree node " Darrick J. Wong @ 2024-04-15 23:48 ` Darrick J. Wong 2024-04-15 23:48 ` [PATCH 08/10] xfs: validate explicit directory data " Darrick J. Wong ` (2 subsequent siblings) 9 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:48 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Check the owner field of directory leaf blocks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_da_btree.c | 16 ++++++++++ fs/xfs/libxfs/xfs_dir2.h | 2 + fs/xfs/libxfs/xfs_dir2_leaf.c | 65 +++++++++++++++++++++++++++++++++++++---- fs/xfs/libxfs/xfs_dir2_node.c | 3 +- fs/xfs/libxfs/xfs_dir2_priv.h | 4 +-- fs/xfs/scrub/dir.c | 2 + 6 files changed, 82 insertions(+), 10 deletions(-) diff --git a/fs/xfs/libxfs/xfs_da_btree.c b/fs/xfs/libxfs/xfs_da_btree.c index e6c28bccdbc0..b13796629e22 100644 --- a/fs/xfs/libxfs/xfs_da_btree.c +++ b/fs/xfs/libxfs/xfs_da_btree.c @@ -288,8 +288,12 @@ xfs_da3_header_check( return xfs_attr3_leaf_header_check(bp, owner); case cpu_to_be16(XFS_DA3_NODE_MAGIC): return xfs_da3_node_header_check(bp, owner); + case cpu_to_be16(XFS_DIR3_LEAF1_MAGIC): + case cpu_to_be16(XFS_DIR3_LEAFN_MAGIC): + return xfs_dir3_leaf_header_check(bp, owner); } + ASSERT(0); return NULL; } @@ -1700,6 +1704,12 @@ xfs_da3_node_lookup_int( if (magic == XFS_DIR2_LEAFN_MAGIC || magic == XFS_DIR3_LEAFN_MAGIC) { + fa = xfs_dir3_leaf_header_check(blk->bp, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(blk->bp, fa); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } blk->magic = XFS_DIR2_LEAFN_MAGIC; blk->hashval = xfs_dir2_leaf_lasthash(args->dp, blk->bp, NULL); @@ -2208,6 +2218,12 @@ xfs_da3_path_shift( break; case XFS_DIR2_LEAFN_MAGIC: case XFS_DIR3_LEAFN_MAGIC: + fa = xfs_dir3_leaf_header_check(blk->bp, args->owner); + if (fa) { + __xfs_buf_mark_corrupt(blk->bp, fa); + xfs_da_mark_sick(args); + return -EFSCORRUPTED; + } blk->magic = XFS_DIR2_LEAFN_MAGIC; ASSERT(level == path->active-1); blk->index = 0; diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h index 8497d041f316..2f728c26a416 100644 --- a/fs/xfs/libxfs/xfs_dir2.h +++ b/fs/xfs/libxfs/xfs_dir2.h @@ -101,6 +101,8 @@ extern struct xfs_dir2_data_free *xfs_dir2_data_freefind( extern int xfs_dir_ino_validate(struct xfs_mount *mp, xfs_ino_t ino); +xfs_failaddr_t xfs_dir3_leaf_header_check(struct xfs_buf *bp, xfs_ino_t owner); + extern const struct xfs_buf_ops xfs_dir3_block_buf_ops; extern const struct xfs_buf_ops xfs_dir3_leafn_buf_ops; extern const struct xfs_buf_ops xfs_dir3_leaf1_buf_ops; diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c index 20ce057d12e8..53b808e2a5f0 100644 --- a/fs/xfs/libxfs/xfs_dir2_leaf.c +++ b/fs/xfs/libxfs/xfs_dir2_leaf.c @@ -208,6 +208,29 @@ xfs_dir3_leaf_verify( return xfs_dir3_leaf_check_int(mp, &leafhdr, bp->b_addr, true); } +xfs_failaddr_t +xfs_dir3_leaf_header_check( + struct xfs_buf *bp, + xfs_ino_t owner) +{ + struct xfs_mount *mp = bp->b_mount; + + if (xfs_has_crc(mp)) { + struct xfs_dir3_leaf *hdr3 = bp->b_addr; + + if (hdr3->hdr.info.hdr.magic != + cpu_to_be16(XFS_DIR3_LEAF1_MAGIC) && + hdr3->hdr.info.hdr.magic != + cpu_to_be16(XFS_DIR3_LEAFN_MAGIC)) + return __this_address; + + if (be64_to_cpu(hdr3->hdr.info.owner) != owner) + return __this_address; + } + + return NULL; +} + static void xfs_dir3_leaf_read_verify( struct xfs_buf *bp) @@ -271,32 +294,60 @@ int xfs_dir3_leaf_read( struct xfs_trans *tp, struct xfs_inode *dp, + xfs_ino_t owner, xfs_dablk_t fbno, struct xfs_buf **bpp) { + xfs_failaddr_t fa; int err; err = xfs_da_read_buf(tp, dp, fbno, 0, bpp, XFS_DATA_FORK, &xfs_dir3_leaf1_buf_ops); - if (!err && tp && *bpp) + if (err || !(*bpp)) + return err; + + fa = xfs_dir3_leaf_header_check(*bpp, owner); + if (fa) { + __xfs_buf_mark_corrupt(*bpp, fa); + xfs_trans_brelse(tp, *bpp); + *bpp = NULL; + xfs_dirattr_mark_sick(dp, XFS_DATA_FORK); + return -EFSCORRUPTED; + } + + if (tp) xfs_trans_buf_set_type(tp, *bpp, XFS_BLFT_DIR_LEAF1_BUF); - return err; + return 0; } int xfs_dir3_leafn_read( struct xfs_trans *tp, struct xfs_inode *dp, + xfs_ino_t owner, xfs_dablk_t fbno, struct xfs_buf **bpp) { + xfs_failaddr_t fa; int err; err = xfs_da_read_buf(tp, dp, fbno, 0, bpp, XFS_DATA_FORK, &xfs_dir3_leafn_buf_ops); - if (!err && tp && *bpp) + if (err || !(*bpp)) + return err; + + fa = xfs_dir3_leaf_header_check(*bpp, owner); + if (fa) { + __xfs_buf_mark_corrupt(*bpp, fa); + xfs_trans_brelse(tp, *bpp); + *bpp = NULL; + xfs_dirattr_mark_sick(dp, XFS_DATA_FORK); + return -EFSCORRUPTED; + } + + if (tp) xfs_trans_buf_set_type(tp, *bpp, XFS_BLFT_DIR_LEAFN_BUF); - return err; + return 0; } /* @@ -646,7 +697,8 @@ xfs_dir2_leaf_addname( trace_xfs_dir2_leaf_addname(args); - error = xfs_dir3_leaf_read(tp, dp, args->geo->leafblk, &lbp); + error = xfs_dir3_leaf_read(tp, dp, args->owner, args->geo->leafblk, + &lbp); if (error) return error; @@ -1237,7 +1289,8 @@ xfs_dir2_leaf_lookup_int( tp = args->trans; mp = dp->i_mount; - error = xfs_dir3_leaf_read(tp, dp, args->geo->leafblk, &lbp); + error = xfs_dir3_leaf_read(tp, dp, args->owner, args->geo->leafblk, + &lbp); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c index 1ad7405f9c38..e21965788188 100644 --- a/fs/xfs/libxfs/xfs_dir2_node.c +++ b/fs/xfs/libxfs/xfs_dir2_node.c @@ -1562,7 +1562,8 @@ xfs_dir2_leafn_toosmall( /* * Read the sibling leaf block. */ - error = xfs_dir3_leafn_read(state->args->trans, dp, blkno, &bp); + error = xfs_dir3_leafn_read(state->args->trans, dp, + state->args->owner, blkno, &bp); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h index 1db2e60ba827..2f0e3ad47b37 100644 --- a/fs/xfs/libxfs/xfs_dir2_priv.h +++ b/fs/xfs/libxfs/xfs_dir2_priv.h @@ -95,9 +95,9 @@ void xfs_dir2_leaf_hdr_from_disk(struct xfs_mount *mp, void xfs_dir2_leaf_hdr_to_disk(struct xfs_mount *mp, struct xfs_dir2_leaf *to, struct xfs_dir3_icleaf_hdr *from); int xfs_dir3_leaf_read(struct xfs_trans *tp, struct xfs_inode *dp, - xfs_dablk_t fbno, struct xfs_buf **bpp); + xfs_ino_t owner, xfs_dablk_t fbno, struct xfs_buf **bpp); int xfs_dir3_leafn_read(struct xfs_trans *tp, struct xfs_inode *dp, - xfs_dablk_t fbno, struct xfs_buf **bpp); + xfs_ino_t owner, xfs_dablk_t fbno, struct xfs_buf **bpp); extern int xfs_dir2_block_to_leaf(struct xfs_da_args *args, struct xfs_buf *dbp); extern int xfs_dir2_leaf_addname(struct xfs_da_args *args); diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c index 042e28547e04..d94e265a8e1f 100644 --- a/fs/xfs/scrub/dir.c +++ b/fs/xfs/scrub/dir.c @@ -470,7 +470,7 @@ xchk_directory_leaf1_bestfree( int error; /* Read the free space block. */ - error = xfs_dir3_leaf_read(sc->tp, sc->ip, lblk, &bp); + error = xfs_dir3_leaf_read(sc->tp, sc->ip, sc->ip->i_ino, lblk, &bp); if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, lblk, &error)) return error; xchk_buffer_recheck(sc, bp); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 08/10] xfs: validate explicit directory data buffer owners 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong ` (6 preceding siblings ...) 2024-04-15 23:48 ` [PATCH 07/10] xfs: validate directory leaf " Darrick J. Wong @ 2024-04-15 23:48 ` Darrick J. Wong 2024-04-15 23:48 ` [PATCH 09/10] xfs: validate explicit directory block " Darrick J. Wong 2024-04-15 23:49 ` [PATCH 10/10] xfs: validate explicit directory free block owners Darrick J. Wong 9 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:48 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Port the existing directory data header checking function to accept an owner number instead of an xfs_inode, then update the callsites to use xfs_da_args.owner when possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_dir2.h | 1 + fs/xfs/libxfs/xfs_dir2_block.c | 3 ++- fs/xfs/libxfs/xfs_dir2_data.c | 16 ++++++++++------ fs/xfs/libxfs/xfs_dir2_leaf.c | 21 +++++++++++---------- fs/xfs/libxfs/xfs_dir2_node.c | 7 +++---- fs/xfs/libxfs/xfs_dir2_priv.h | 3 ++- fs/xfs/scrub/dir.c | 14 +++++++------- fs/xfs/scrub/readdir.c | 2 +- fs/xfs/xfs_dir2_readdir.c | 3 ++- 9 files changed, 39 insertions(+), 31 deletions(-) diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h index 2f728c26a416..d623bfdcd421 100644 --- a/fs/xfs/libxfs/xfs_dir2.h +++ b/fs/xfs/libxfs/xfs_dir2.h @@ -102,6 +102,7 @@ extern struct xfs_dir2_data_free *xfs_dir2_data_freefind( extern int xfs_dir_ino_validate(struct xfs_mount *mp, xfs_ino_t ino); xfs_failaddr_t xfs_dir3_leaf_header_check(struct xfs_buf *bp, xfs_ino_t owner); +xfs_failaddr_t xfs_dir3_data_header_check(struct xfs_buf *bp, xfs_ino_t owner); extern const struct xfs_buf_ops xfs_dir3_block_buf_ops; extern const struct xfs_buf_ops xfs_dir3_leafn_buf_ops; diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c index 61cbc668f228..b20b08394aa0 100644 --- a/fs/xfs/libxfs/xfs_dir2_block.c +++ b/fs/xfs/libxfs/xfs_dir2_block.c @@ -982,7 +982,8 @@ xfs_dir2_leaf_to_block( * Read the data block if we don't already have it, give up if it fails. */ if (!dbp) { - error = xfs_dir3_data_read(tp, dp, args->geo->datablk, 0, &dbp); + error = xfs_dir3_data_read(tp, dp, args->owner, + args->geo->datablk, 0, &dbp); if (error) return error; } diff --git a/fs/xfs/libxfs/xfs_dir2_data.c b/fs/xfs/libxfs/xfs_dir2_data.c index c3ef720b5ff6..ea0b9628df18 100644 --- a/fs/xfs/libxfs/xfs_dir2_data.c +++ b/fs/xfs/libxfs/xfs_dir2_data.c @@ -395,17 +395,20 @@ static const struct xfs_buf_ops xfs_dir3_data_reada_buf_ops = { .verify_write = xfs_dir3_data_write_verify, }; -static xfs_failaddr_t +xfs_failaddr_t xfs_dir3_data_header_check( - struct xfs_inode *dp, - struct xfs_buf *bp) + struct xfs_buf *bp, + xfs_ino_t owner) { - struct xfs_mount *mp = dp->i_mount; + struct xfs_mount *mp = bp->b_mount; if (xfs_has_crc(mp)) { struct xfs_dir3_data_hdr *hdr3 = bp->b_addr; - if (be64_to_cpu(hdr3->hdr.owner) != dp->i_ino) + if (hdr3->hdr.magic != cpu_to_be32(XFS_DIR3_DATA_MAGIC)) + return __this_address; + + if (be64_to_cpu(hdr3->hdr.owner) != owner) return __this_address; } @@ -416,6 +419,7 @@ int xfs_dir3_data_read( struct xfs_trans *tp, struct xfs_inode *dp, + xfs_ino_t owner, xfs_dablk_t bno, unsigned int flags, struct xfs_buf **bpp) @@ -429,7 +433,7 @@ xfs_dir3_data_read( return err; /* Check things that we can't do in the verifier. */ - fa = xfs_dir3_data_header_check(dp, *bpp); + fa = xfs_dir3_data_header_check(*bpp, owner); if (fa) { __xfs_buf_mark_corrupt(*bpp, fa); xfs_trans_brelse(tp, *bpp); diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c index 53b808e2a5f0..0b1b852f6178 100644 --- a/fs/xfs/libxfs/xfs_dir2_leaf.c +++ b/fs/xfs/libxfs/xfs_dir2_leaf.c @@ -885,9 +885,9 @@ xfs_dir2_leaf_addname( * Already had space in some data block. * Just read that one in. */ - error = xfs_dir3_data_read(tp, dp, - xfs_dir2_db_to_da(args->geo, use_block), - 0, &dbp); + error = xfs_dir3_data_read(tp, dp, args->owner, + xfs_dir2_db_to_da(args->geo, use_block), 0, + &dbp); if (error) { xfs_trans_brelse(tp, lbp); return error; @@ -1328,9 +1328,9 @@ xfs_dir2_leaf_lookup_int( if (newdb != curdb) { if (dbp) xfs_trans_brelse(tp, dbp); - error = xfs_dir3_data_read(tp, dp, - xfs_dir2_db_to_da(args->geo, newdb), - 0, &dbp); + error = xfs_dir3_data_read(tp, dp, args->owner, + xfs_dir2_db_to_da(args->geo, newdb), 0, + &dbp); if (error) { xfs_trans_brelse(tp, lbp); return error; @@ -1370,9 +1370,9 @@ xfs_dir2_leaf_lookup_int( ASSERT(cidb != -1); if (cidb != curdb) { xfs_trans_brelse(tp, dbp); - error = xfs_dir3_data_read(tp, dp, - xfs_dir2_db_to_da(args->geo, cidb), - 0, &dbp); + error = xfs_dir3_data_read(tp, dp, args->owner, + xfs_dir2_db_to_da(args->geo, cidb), 0, + &dbp); if (error) { xfs_trans_brelse(tp, lbp); return error; @@ -1666,7 +1666,8 @@ xfs_dir2_leaf_trim_data( /* * Read the offending data block. We need its buffer. */ - error = xfs_dir3_data_read(tp, dp, xfs_dir2_db_to_da(geo, db), 0, &dbp); + error = xfs_dir3_data_read(tp, dp, args->owner, + xfs_dir2_db_to_da(geo, db), 0, &dbp); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c index e21965788188..dc85197b8448 100644 --- a/fs/xfs/libxfs/xfs_dir2_node.c +++ b/fs/xfs/libxfs/xfs_dir2_node.c @@ -863,7 +863,7 @@ xfs_dir2_leafn_lookup_for_entry( ASSERT(state->extravalid); curbp = state->extrablk.bp; } else { - error = xfs_dir3_data_read(tp, dp, + error = xfs_dir3_data_read(tp, dp, args->owner, xfs_dir2_db_to_da(args->geo, newdb), 0, &curbp); @@ -1949,9 +1949,8 @@ xfs_dir2_node_addname_int( &freehdr, &findex); } else { /* Read the data block in. */ - error = xfs_dir3_data_read(tp, dp, - xfs_dir2_db_to_da(args->geo, dbno), - 0, &dbp); + error = xfs_dir3_data_read(tp, dp, args->owner, + xfs_dir2_db_to_da(args->geo, dbno), 0, &dbp); } if (error) return error; diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h index 2f0e3ad47b37..879aa2e9fd73 100644 --- a/fs/xfs/libxfs/xfs_dir2_priv.h +++ b/fs/xfs/libxfs/xfs_dir2_priv.h @@ -78,7 +78,8 @@ extern void xfs_dir3_data_check(struct xfs_inode *dp, struct xfs_buf *bp); extern xfs_failaddr_t __xfs_dir3_data_check(struct xfs_inode *dp, struct xfs_buf *bp); int xfs_dir3_data_read(struct xfs_trans *tp, struct xfs_inode *dp, - xfs_dablk_t bno, unsigned int flags, struct xfs_buf **bpp); + xfs_ino_t owner, xfs_dablk_t bno, unsigned int flags, + struct xfs_buf **bpp); int xfs_dir3_data_readahead(struct xfs_inode *dp, xfs_dablk_t bno, unsigned int flags); diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c index d94e265a8e1f..6b572196bb43 100644 --- a/fs/xfs/scrub/dir.c +++ b/fs/xfs/scrub/dir.c @@ -196,8 +196,8 @@ xchk_dir_rec( xchk_da_set_corrupt(ds, level); goto out; } - error = xfs_dir3_data_read(ds->dargs.trans, dp, rec_bno, - XFS_DABUF_MAP_HOLE_OK, &bp); + error = xfs_dir3_data_read(ds->dargs.trans, dp, ds->dargs.owner, + rec_bno, XFS_DABUF_MAP_HOLE_OK, &bp); if (!xchk_fblock_process_error(ds->sc, XFS_DATA_FORK, rec_bno, &error)) goto out; @@ -318,7 +318,8 @@ xchk_directory_data_bestfree( error = xfs_dir3_block_read(sc->tp, sc->ip, &bp); } else { /* dir data format */ - error = xfs_dir3_data_read(sc->tp, sc->ip, lblk, 0, &bp); + error = xfs_dir3_data_read(sc->tp, sc->ip, sc->ip->i_ino, lblk, + 0, &bp); } if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, lblk, &error)) goto out; @@ -531,10 +532,9 @@ xchk_directory_leaf1_bestfree( /* Check all the bestfree entries. */ for (i = 0; i < bestcount; i++, bestp++) { best = be16_to_cpu(*bestp); - error = xfs_dir3_data_read(sc->tp, sc->ip, + error = xfs_dir3_data_read(sc->tp, sc->ip, args->owner, xfs_dir2_db_to_da(args->geo, i), - XFS_DABUF_MAP_HOLE_OK, - &dbp); + XFS_DABUF_MAP_HOLE_OK, &dbp); if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, lblk, &error)) break; @@ -597,7 +597,7 @@ xchk_directory_free_bestfree( stale++; continue; } - error = xfs_dir3_data_read(sc->tp, sc->ip, + error = xfs_dir3_data_read(sc->tp, sc->ip, args->owner, (freehdr.firstdb + i) * args->geo->fsbcount, 0, &dbp); if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, lblk, diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c index fb98b7624994..bed15a9524a2 100644 --- a/fs/xfs/scrub/readdir.c +++ b/fs/xfs/scrub/readdir.c @@ -175,7 +175,7 @@ xchk_read_leaf_dir_buf( if (new_off > *curoff) *curoff = new_off; - return xfs_dir3_data_read(tp, dp, map.br_startoff, 0, bpp); + return xfs_dir3_data_read(tp, dp, dp->i_ino, map.br_startoff, 0, bpp); } /* Call a function for every entry in a leaf directory. */ diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c index 4e811fa393ad..2c03371b542a 100644 --- a/fs/xfs/xfs_dir2_readdir.c +++ b/fs/xfs/xfs_dir2_readdir.c @@ -282,7 +282,8 @@ xfs_dir2_leaf_readbuf( new_off = xfs_dir2_da_to_byte(geo, map.br_startoff); if (new_off > *cur_off) *cur_off = new_off; - error = xfs_dir3_data_read(args->trans, dp, map.br_startoff, 0, &bp); + error = xfs_dir3_data_read(args->trans, dp, args->owner, + map.br_startoff, 0, &bp); if (error) goto out; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 09/10] xfs: validate explicit directory block buffer owners 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong ` (7 preceding siblings ...) 2024-04-15 23:48 ` [PATCH 08/10] xfs: validate explicit directory data " Darrick J. Wong @ 2024-04-15 23:48 ` Darrick J. Wong 2024-04-15 23:49 ` [PATCH 10/10] xfs: validate explicit directory free block owners Darrick J. Wong 9 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:48 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Port the existing directory block header checking function to accept an owner number instead of an xfs_inode, then update the callsites to use xfs_da_args.owner when possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_dir2.h | 1 + fs/xfs/libxfs/xfs_dir2_block.c | 20 ++++++++++++-------- fs/xfs/libxfs/xfs_dir2_priv.h | 4 ++-- fs/xfs/libxfs/xfs_exchmaps.c | 2 +- fs/xfs/scrub/dir.c | 2 +- fs/xfs/scrub/readdir.c | 2 +- fs/xfs/xfs_dir2_readdir.c | 2 +- 7 files changed, 19 insertions(+), 14 deletions(-) diff --git a/fs/xfs/libxfs/xfs_dir2.h b/fs/xfs/libxfs/xfs_dir2.h index d623bfdcd421..eb3a5c35025b 100644 --- a/fs/xfs/libxfs/xfs_dir2.h +++ b/fs/xfs/libxfs/xfs_dir2.h @@ -103,6 +103,7 @@ extern int xfs_dir_ino_validate(struct xfs_mount *mp, xfs_ino_t ino); xfs_failaddr_t xfs_dir3_leaf_header_check(struct xfs_buf *bp, xfs_ino_t owner); xfs_failaddr_t xfs_dir3_data_header_check(struct xfs_buf *bp, xfs_ino_t owner); +xfs_failaddr_t xfs_dir3_block_header_check(struct xfs_buf *bp, xfs_ino_t owner); extern const struct xfs_buf_ops xfs_dir3_block_buf_ops; extern const struct xfs_buf_ops xfs_dir3_leafn_buf_ops; diff --git a/fs/xfs/libxfs/xfs_dir2_block.c b/fs/xfs/libxfs/xfs_dir2_block.c index b20b08394aa0..0f93ed1a4a74 100644 --- a/fs/xfs/libxfs/xfs_dir2_block.c +++ b/fs/xfs/libxfs/xfs_dir2_block.c @@ -115,17 +115,20 @@ const struct xfs_buf_ops xfs_dir3_block_buf_ops = { .verify_struct = xfs_dir3_block_verify, }; -static xfs_failaddr_t +xfs_failaddr_t xfs_dir3_block_header_check( - struct xfs_inode *dp, - struct xfs_buf *bp) + struct xfs_buf *bp, + xfs_ino_t owner) { - struct xfs_mount *mp = dp->i_mount; + struct xfs_mount *mp = bp->b_mount; if (xfs_has_crc(mp)) { struct xfs_dir3_blk_hdr *hdr3 = bp->b_addr; - if (be64_to_cpu(hdr3->owner) != dp->i_ino) + if (hdr3->magic != cpu_to_be32(XFS_DIR3_BLOCK_MAGIC)) + return __this_address; + + if (be64_to_cpu(hdr3->owner) != owner) return __this_address; } @@ -136,6 +139,7 @@ int xfs_dir3_block_read( struct xfs_trans *tp, struct xfs_inode *dp, + xfs_ino_t owner, struct xfs_buf **bpp) { struct xfs_mount *mp = dp->i_mount; @@ -148,7 +152,7 @@ xfs_dir3_block_read( return err; /* Check things that we can't do in the verifier. */ - fa = xfs_dir3_block_header_check(dp, *bpp); + fa = xfs_dir3_block_header_check(*bpp, owner); if (fa) { __xfs_buf_mark_corrupt(*bpp, fa); xfs_trans_brelse(tp, *bpp); @@ -383,7 +387,7 @@ xfs_dir2_block_addname( tp = args->trans; /* Read the (one and only) directory block into bp. */ - error = xfs_dir3_block_read(tp, dp, &bp); + error = xfs_dir3_block_read(tp, dp, args->owner, &bp); if (error) return error; @@ -698,7 +702,7 @@ xfs_dir2_block_lookup_int( dp = args->dp; tp = args->trans; - error = xfs_dir3_block_read(tp, dp, &bp); + error = xfs_dir3_block_read(tp, dp, args->owner, &bp); if (error) return error; diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h index 879aa2e9fd73..adbc544c9bef 100644 --- a/fs/xfs/libxfs/xfs_dir2_priv.h +++ b/fs/xfs/libxfs/xfs_dir2_priv.h @@ -50,8 +50,8 @@ extern int xfs_dir_cilookup_result(struct xfs_da_args *args, /* xfs_dir2_block.c */ -extern int xfs_dir3_block_read(struct xfs_trans *tp, struct xfs_inode *dp, - struct xfs_buf **bpp); +int xfs_dir3_block_read(struct xfs_trans *tp, struct xfs_inode *dp, + xfs_ino_t owner, struct xfs_buf **bpp); extern int xfs_dir2_block_addname(struct xfs_da_args *args); extern int xfs_dir2_block_lookup(struct xfs_da_args *args); extern int xfs_dir2_block_removename(struct xfs_da_args *args); diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c index 9c9cf2e998b2..3880ae32eecf 100644 --- a/fs/xfs/libxfs/xfs_exchmaps.c +++ b/fs/xfs/libxfs/xfs_exchmaps.c @@ -476,7 +476,7 @@ xfs_exchmaps_dir_to_sf( if (!isblock) return 0; - error = xfs_dir3_block_read(tp, xmi->xmi_ip2, &bp); + error = xfs_dir3_block_read(tp, xmi->xmi_ip2, xmi->xmi_ip2->i_ino, &bp); if (error) return error; diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c index 6b572196bb43..43f5bc8ce0d4 100644 --- a/fs/xfs/scrub/dir.c +++ b/fs/xfs/scrub/dir.c @@ -315,7 +315,7 @@ xchk_directory_data_bestfree( /* dir block format */ if (lblk != XFS_B_TO_FSBT(mp, XFS_DIR2_DATA_OFFSET)) xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, lblk); - error = xfs_dir3_block_read(sc->tp, sc->ip, &bp); + error = xfs_dir3_block_read(sc->tp, sc->ip, sc->ip->i_ino, &bp); } else { /* dir data format */ error = xfs_dir3_data_read(sc->tp, sc->ip, sc->ip->i_ino, lblk, diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c index bed15a9524a2..e94080469315 100644 --- a/fs/xfs/scrub/readdir.c +++ b/fs/xfs/scrub/readdir.c @@ -99,7 +99,7 @@ xchk_dir_walk_block( unsigned int off, next_off, end; int error; - error = xfs_dir3_block_read(sc->tp, dp, &bp); + error = xfs_dir3_block_read(sc->tp, dp, dp->i_ino, &bp); if (error) return error; diff --git a/fs/xfs/xfs_dir2_readdir.c b/fs/xfs/xfs_dir2_readdir.c index 2c03371b542a..b3abad5a6cd8 100644 --- a/fs/xfs/xfs_dir2_readdir.c +++ b/fs/xfs/xfs_dir2_readdir.c @@ -157,7 +157,7 @@ xfs_dir2_block_getdents( if (xfs_dir2_dataptr_to_db(geo, ctx->pos) > geo->datablk) return 0; - error = xfs_dir3_block_read(args->trans, dp, &bp); + error = xfs_dir3_block_read(args->trans, dp, args->owner, &bp); if (error) return error; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 10/10] xfs: validate explicit directory free block owners 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong ` (8 preceding siblings ...) 2024-04-15 23:48 ` [PATCH 09/10] xfs: validate explicit directory block " Darrick J. Wong @ 2024-04-15 23:49 ` Darrick J. Wong 9 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:49 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Port the existing directory freespace block header checking function to accept an owner number instead of an xfs_inode, then update the callsites to use xfs_da_args.owner when possible. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_dir2_leaf.c | 3 ++- fs/xfs/libxfs/xfs_dir2_node.c | 32 ++++++++++++++++++-------------- fs/xfs/libxfs/xfs_dir2_priv.h | 4 ++-- fs/xfs/scrub/dir.c | 2 +- 4 files changed, 23 insertions(+), 18 deletions(-) diff --git a/fs/xfs/libxfs/xfs_dir2_leaf.c b/fs/xfs/libxfs/xfs_dir2_leaf.c index 0b1b852f6178..71c2f22a3f6e 100644 --- a/fs/xfs/libxfs/xfs_dir2_leaf.c +++ b/fs/xfs/libxfs/xfs_dir2_leaf.c @@ -1806,7 +1806,8 @@ xfs_dir2_node_to_leaf( /* * Read the freespace block. */ - error = xfs_dir2_free_read(tp, dp, args->geo->freeblk, &fbp); + error = xfs_dir2_free_read(tp, dp, args->owner, args->geo->freeblk, + &fbp); if (error) return error; xfs_dir2_free_hdr_from_disk(mp, &freehdr, fbp->b_addr); diff --git a/fs/xfs/libxfs/xfs_dir2_node.c b/fs/xfs/libxfs/xfs_dir2_node.c index dc85197b8448..fe8d4fa13128 100644 --- a/fs/xfs/libxfs/xfs_dir2_node.c +++ b/fs/xfs/libxfs/xfs_dir2_node.c @@ -175,11 +175,11 @@ const struct xfs_buf_ops xfs_dir3_free_buf_ops = { /* Everything ok in the free block header? */ static xfs_failaddr_t xfs_dir3_free_header_check( - struct xfs_inode *dp, - xfs_dablk_t fbno, - struct xfs_buf *bp) + struct xfs_buf *bp, + xfs_ino_t owner, + xfs_dablk_t fbno) { - struct xfs_mount *mp = dp->i_mount; + struct xfs_mount *mp = bp->b_mount; int maxbests = mp->m_dir_geo->free_max_bests; unsigned int firstdb; @@ -195,7 +195,7 @@ xfs_dir3_free_header_check( return __this_address; if (be32_to_cpu(hdr3->nvalid) < be32_to_cpu(hdr3->nused)) return __this_address; - if (be64_to_cpu(hdr3->hdr.owner) != dp->i_ino) + if (be64_to_cpu(hdr3->hdr.owner) != owner) return __this_address; } else { struct xfs_dir2_free_hdr *hdr = bp->b_addr; @@ -214,6 +214,7 @@ static int __xfs_dir3_free_read( struct xfs_trans *tp, struct xfs_inode *dp, + xfs_ino_t owner, xfs_dablk_t fbno, unsigned int flags, struct xfs_buf **bpp) @@ -227,7 +228,7 @@ __xfs_dir3_free_read( return err; /* Check things that we can't do in the verifier. */ - fa = xfs_dir3_free_header_check(dp, fbno, *bpp); + fa = xfs_dir3_free_header_check(*bpp, owner, fbno); if (fa) { __xfs_buf_mark_corrupt(*bpp, fa); xfs_trans_brelse(tp, *bpp); @@ -299,20 +300,23 @@ int xfs_dir2_free_read( struct xfs_trans *tp, struct xfs_inode *dp, + xfs_ino_t owner, xfs_dablk_t fbno, struct xfs_buf **bpp) { - return __xfs_dir3_free_read(tp, dp, fbno, 0, bpp); + return __xfs_dir3_free_read(tp, dp, owner, fbno, 0, bpp); } static int xfs_dir2_free_try_read( struct xfs_trans *tp, struct xfs_inode *dp, + xfs_ino_t owner, xfs_dablk_t fbno, struct xfs_buf **bpp) { - return __xfs_dir3_free_read(tp, dp, fbno, XFS_DABUF_MAP_HOLE_OK, bpp); + return __xfs_dir3_free_read(tp, dp, owner, fbno, XFS_DABUF_MAP_HOLE_OK, + bpp); } static int @@ -717,7 +721,7 @@ xfs_dir2_leafn_lookup_for_addname( if (curbp) xfs_trans_brelse(tp, curbp); - error = xfs_dir2_free_read(tp, dp, + error = xfs_dir2_free_read(tp, dp, args->owner, xfs_dir2_db_to_da(args->geo, newfdb), &curbp); @@ -1356,8 +1360,8 @@ xfs_dir2_leafn_remove( * read in the free block. */ fdb = xfs_dir2_db_to_fdb(geo, db); - error = xfs_dir2_free_read(tp, dp, xfs_dir2_db_to_da(geo, fdb), - &fbp); + error = xfs_dir2_free_read(tp, dp, args->owner, + xfs_dir2_db_to_da(geo, fdb), &fbp); if (error) return error; free = fbp->b_addr; @@ -1716,7 +1720,7 @@ xfs_dir2_node_add_datablk( * that was just allocated. */ fbno = xfs_dir2_db_to_fdb(args->geo, *dbno); - error = xfs_dir2_free_try_read(tp, dp, + error = xfs_dir2_free_try_read(tp, dp, args->owner, xfs_dir2_db_to_da(args->geo, fbno), &fbp); if (error) return error; @@ -1863,7 +1867,7 @@ xfs_dir2_node_find_freeblk( * so this might not succeed. This should be really rare, so * there's no reason to avoid it. */ - error = xfs_dir2_free_try_read(tp, dp, + error = xfs_dir2_free_try_read(tp, dp, args->owner, xfs_dir2_db_to_da(args->geo, fbno), &fbp); if (error) @@ -2302,7 +2306,7 @@ xfs_dir2_node_trim_free( /* * Read the freespace block. */ - error = xfs_dir2_free_try_read(tp, dp, fo, &bp); + error = xfs_dir2_free_try_read(tp, dp, args->owner, fo, &bp); if (error) return error; /* diff --git a/fs/xfs/libxfs/xfs_dir2_priv.h b/fs/xfs/libxfs/xfs_dir2_priv.h index adbc544c9bef..3befb32509fa 100644 --- a/fs/xfs/libxfs/xfs_dir2_priv.h +++ b/fs/xfs/libxfs/xfs_dir2_priv.h @@ -155,8 +155,8 @@ extern int xfs_dir2_node_removename(struct xfs_da_args *args); extern int xfs_dir2_node_replace(struct xfs_da_args *args); extern int xfs_dir2_node_trim_free(struct xfs_da_args *args, xfs_fileoff_t fo, int *rvalp); -extern int xfs_dir2_free_read(struct xfs_trans *tp, struct xfs_inode *dp, - xfs_dablk_t fbno, struct xfs_buf **bpp); +int xfs_dir2_free_read(struct xfs_trans *tp, struct xfs_inode *dp, + xfs_ino_t owner, xfs_dablk_t fbno, struct xfs_buf **bpp); /* xfs_dir2_sf.c */ xfs_ino_t xfs_dir2_sf_get_ino(struct xfs_mount *mp, struct xfs_dir2_sf_hdr *hdr, diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c index 43f5bc8ce0d4..7bac74621af7 100644 --- a/fs/xfs/scrub/dir.c +++ b/fs/xfs/scrub/dir.c @@ -577,7 +577,7 @@ xchk_directory_free_bestfree( int error; /* Read the free space block */ - error = xfs_dir2_free_read(sc->tp, sc->ip, lblk, &bp); + error = xfs_dir2_free_read(sc->tp, sc->ip, sc->ip->i_ino, lblk, &bp); if (!xchk_fblock_process_error(sc, XFS_DATA_FORK, lblk, &error)) return error; xchk_buffer_recheck(sc, bp); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 07/16] xfs: online repair of extended attributes 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (5 preceding siblings ...) 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong @ 2024-04-15 23:35 ` Darrick J. Wong 2024-04-15 23:49 ` [PATCH 1/7] xfs: enable discarding of folios backing an xfile Darrick J. Wong ` (6 more replies) 2024-04-15 23:35 ` [PATCHSET v30.3 08/16] xfs: online repair of inode unlinked state Darrick J. Wong ` (8 subsequent siblings) 15 siblings, 7 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:35 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, Dave Chinner, hch, linux-xfs Hi all, This series employs atomic extent swapping to enable safe reconstruction of extended attribute data attached to a file. Because xattrs do not have any redundant information to draw off of, we can at best salvage as much data as we can and build a new structure. Rebuilding an extended attribute structure consists of these three steps: First, we walk the existing attributes to salvage as many of them as we can, by adding them as new attributes attached to the repair tempfile. We need to add a new xfile-based data structure to hold blobs of arbitrary length to stage the xattr names and values. Second, we write the salvaged attributes to a temporary file, and use atomic extent swaps to exchange the entire attribute fork between the two files. Finally, we reap the old xattr blocks (which are now in the temporary file) as carefully as we can. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-xattrs-6.10 --- Commits in this patchset: * xfs: enable discarding of folios backing an xfile * xfs: create a blob array data structure * xfs: use atomic extent swapping to fix user file fork data * xfs: repair extended attributes * xfs: scrub should set preen if attr leaf has holes * xfs: flag empty xattr leaf blocks for optimization * xfs: create an xattr iteration function for scrub --- fs/xfs/Makefile | 3 fs/xfs/libxfs/xfs_attr.c | 2 fs/xfs/libxfs/xfs_attr.h | 2 fs/xfs/libxfs/xfs_da_format.h | 5 fs/xfs/libxfs/xfs_exchmaps.c | 2 fs/xfs/libxfs/xfs_exchmaps.h | 1 fs/xfs/scrub/attr.c | 158 +++-- fs/xfs/scrub/attr.h | 7 fs/xfs/scrub/attr_repair.c | 1207 +++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/attr_repair.h | 11 fs/xfs/scrub/dab_bitmap.h | 37 + fs/xfs/scrub/dabtree.c | 16 + fs/xfs/scrub/dabtree.h | 3 fs/xfs/scrub/listxattr.c | 312 +++++++++++ fs/xfs/scrub/listxattr.h | 17 + fs/xfs/scrub/repair.c | 46 ++ fs/xfs/scrub/repair.h | 6 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/tempexch.h | 2 fs/xfs/scrub/tempfile.c | 204 +++++++ fs/xfs/scrub/tempfile.h | 3 fs/xfs/scrub/trace.h | 85 +++ fs/xfs/scrub/xfarray.c | 17 + fs/xfs/scrub/xfarray.h | 2 fs/xfs/scrub/xfblob.c | 168 ++++++ fs/xfs/scrub/xfblob.h | 26 + fs/xfs/scrub/xfile.c | 12 fs/xfs/scrub/xfile.h | 6 fs/xfs/xfs_buf.c | 3 fs/xfs/xfs_trace.h | 2 30 files changed, 2284 insertions(+), 83 deletions(-) create mode 100644 fs/xfs/scrub/attr_repair.c create mode 100644 fs/xfs/scrub/attr_repair.h create mode 100644 fs/xfs/scrub/dab_bitmap.h create mode 100644 fs/xfs/scrub/listxattr.c create mode 100644 fs/xfs/scrub/listxattr.h create mode 100644 fs/xfs/scrub/xfblob.c create mode 100644 fs/xfs/scrub/xfblob.h ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/7] xfs: enable discarding of folios backing an xfile 2024-04-15 23:35 ` [PATCHSET v30.3 07/16] xfs: online repair of extended attributes Darrick J. Wong @ 2024-04-15 23:49 ` Darrick J. Wong 2024-04-15 23:49 ` [PATCH 2/7] xfs: create a blob array data structure Darrick J. Wong ` (5 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:49 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a new xfile function to discard the page cache that's backing part of an xfile. The next patch wil use this to drop parts of an xfile that aren't needed anymore. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/trace.h | 1 + fs/xfs/scrub/xfile.c | 12 ++++++++++++ fs/xfs/scrub/xfile.h | 1 + 3 files changed, 14 insertions(+) diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 8d05f2adae3d..7d07912d8f75 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -948,6 +948,7 @@ DEFINE_XFILE_EVENT(xfile_store); DEFINE_XFILE_EVENT(xfile_seek_data); DEFINE_XFILE_EVENT(xfile_get_folio); DEFINE_XFILE_EVENT(xfile_put_folio); +DEFINE_XFILE_EVENT(xfile_discard); TRACE_EVENT(xfarray_create, TP_PROTO(struct xfarray *xfa, unsigned long long required_capacity), diff --git a/fs/xfs/scrub/xfile.c b/fs/xfs/scrub/xfile.c index 8cdd863db585..4e254a0ba003 100644 --- a/fs/xfs/scrub/xfile.c +++ b/fs/xfs/scrub/xfile.c @@ -310,3 +310,15 @@ xfile_put_folio( folio_unlock(folio); folio_put(folio); } + +/* Discard the page cache that's backing a range of the xfile. */ +void +xfile_discard( + struct xfile *xf, + loff_t pos, + u64 count) +{ + trace_xfile_discard(xf, pos, count); + + shmem_truncate_range(file_inode(xf->file), pos, pos + count - 1); +} diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h index 76d78dba7e34..8dfbae1fe33a 100644 --- a/fs/xfs/scrub/xfile.h +++ b/fs/xfs/scrub/xfile.h @@ -17,6 +17,7 @@ int xfile_load(struct xfile *xf, void *buf, size_t count, loff_t pos); int xfile_store(struct xfile *xf, const void *buf, size_t count, loff_t pos); +void xfile_discard(struct xfile *xf, loff_t pos, u64 count); loff_t xfile_seek_data(struct xfile *xf, loff_t pos); #define XFILE_MAX_FOLIO_SIZE (PAGE_SIZE << MAX_PAGECACHE_ORDER) ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/7] xfs: create a blob array data structure 2024-04-15 23:35 ` [PATCHSET v30.3 07/16] xfs: online repair of extended attributes Darrick J. Wong 2024-04-15 23:49 ` [PATCH 1/7] xfs: enable discarding of folios backing an xfile Darrick J. Wong @ 2024-04-15 23:49 ` Darrick J. Wong 2024-04-15 23:49 ` [PATCH 3/7] xfs: use atomic extent swapping to fix user file fork data Darrick J. Wong ` (4 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:49 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a simple 'blob array' data structure for storage of arbitrarily sized metadata objects that will be used to reconstruct metadata. For the intended usage (temporarily storing extended attribute names and values) we only have to support storing objects and retrieving them. Use the xfile abstraction to store the attribute information in memory that can be swapped out. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/scrub/xfblob.c | 151 +++++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/xfblob.h | 24 ++++++++ 3 files changed, 176 insertions(+) create mode 100644 fs/xfs/scrub/xfblob.c create mode 100644 fs/xfs/scrub/xfblob.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 5e3ac7ec8fa5..bc27757702fe 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -208,6 +208,7 @@ xfs-y += $(addprefix scrub/, \ repair.o \ rmap_repair.o \ tempfile.o \ + xfblob.o \ ) xfs-$(CONFIG_XFS_RT) += $(addprefix scrub/, \ diff --git a/fs/xfs/scrub/xfblob.c b/fs/xfs/scrub/xfblob.c new file mode 100644 index 000000000000..cec668debce5 --- /dev/null +++ b/fs/xfs/scrub/xfblob.c @@ -0,0 +1,151 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2021-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "scrub/scrub.h" +#include "scrub/xfile.h" +#include "scrub/xfarray.h" +#include "scrub/xfblob.h" + +/* + * XFS Blob Storage + * ================ + * Stores and retrieves blobs using an xfile. Objects are appended to the file + * and the offset is returned as a magic cookie for retrieval. + */ + +#define XB_KEY_MAGIC 0xABAADDAD +struct xb_key { + uint32_t xb_magic; /* XB_KEY_MAGIC */ + uint32_t xb_size; /* size of the blob, in bytes */ + loff_t xb_offset; /* byte offset of this key */ + /* blob comes after here */ +} __packed; + +/* Initialize a blob storage object. */ +int +xfblob_create( + const char *description, + struct xfblob **blobp) +{ + struct xfblob *blob; + struct xfile *xfile; + int error; + + error = xfile_create(description, 0, &xfile); + if (error) + return error; + + blob = kmalloc(sizeof(struct xfblob), XCHK_GFP_FLAGS); + if (!blob) { + error = -ENOMEM; + goto out_xfile; + } + + blob->xfile = xfile; + blob->last_offset = PAGE_SIZE; + + *blobp = blob; + return 0; + +out_xfile: + xfile_destroy(xfile); + return error; +} + +/* Destroy a blob storage object. */ +void +xfblob_destroy( + struct xfblob *blob) +{ + xfile_destroy(blob->xfile); + kfree(blob); +} + +/* Retrieve a blob. */ +int +xfblob_load( + struct xfblob *blob, + xfblob_cookie cookie, + void *ptr, + uint32_t size) +{ + struct xb_key key; + int error; + + error = xfile_load(blob->xfile, &key, sizeof(key), cookie); + if (error) + return error; + + if (key.xb_magic != XB_KEY_MAGIC || key.xb_offset != cookie) { + ASSERT(0); + return -ENODATA; + } + if (size < key.xb_size) { + ASSERT(0); + return -EFBIG; + } + + return xfile_load(blob->xfile, ptr, key.xb_size, + cookie + sizeof(key)); +} + +/* Store a blob. */ +int +xfblob_store( + struct xfblob *blob, + xfblob_cookie *cookie, + const void *ptr, + uint32_t size) +{ + struct xb_key key = { + .xb_offset = blob->last_offset, + .xb_magic = XB_KEY_MAGIC, + .xb_size = size, + }; + loff_t pos = blob->last_offset; + int error; + + error = xfile_store(blob->xfile, &key, sizeof(key), pos); + if (error) + return error; + + pos += sizeof(key); + error = xfile_store(blob->xfile, ptr, size, pos); + if (error) + goto out_err; + + *cookie = blob->last_offset; + blob->last_offset += sizeof(key) + size; + return 0; +out_err: + xfile_discard(blob->xfile, blob->last_offset, sizeof(key)); + return error; +} + +/* Free a blob. */ +int +xfblob_free( + struct xfblob *blob, + xfblob_cookie cookie) +{ + struct xb_key key; + int error; + + error = xfile_load(blob->xfile, &key, sizeof(key), cookie); + if (error) + return error; + + if (key.xb_magic != XB_KEY_MAGIC || key.xb_offset != cookie) { + ASSERT(0); + return -ENODATA; + } + + xfile_discard(blob->xfile, cookie, sizeof(key) + key.xb_size); + return 0; +} diff --git a/fs/xfs/scrub/xfblob.h b/fs/xfs/scrub/xfblob.h new file mode 100644 index 000000000000..bd98647407f1 --- /dev/null +++ b/fs/xfs/scrub/xfblob.h @@ -0,0 +1,24 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Copyright (c) 2021-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_XFBLOB_H__ +#define __XFS_SCRUB_XFBLOB_H__ + +struct xfblob { + struct xfile *xfile; + loff_t last_offset; +}; + +typedef loff_t xfblob_cookie; + +int xfblob_create(const char *descr, struct xfblob **blobp); +void xfblob_destroy(struct xfblob *blob); +int xfblob_load(struct xfblob *blob, xfblob_cookie cookie, void *ptr, + uint32_t size); +int xfblob_store(struct xfblob *blob, xfblob_cookie *cookie, const void *ptr, + uint32_t size); +int xfblob_free(struct xfblob *blob, xfblob_cookie cookie); + +#endif /* __XFS_SCRUB_XFBLOB_H__ */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/7] xfs: use atomic extent swapping to fix user file fork data 2024-04-15 23:35 ` [PATCHSET v30.3 07/16] xfs: online repair of extended attributes Darrick J. Wong 2024-04-15 23:49 ` [PATCH 1/7] xfs: enable discarding of folios backing an xfile Darrick J. Wong 2024-04-15 23:49 ` [PATCH 2/7] xfs: create a blob array data structure Darrick J. Wong @ 2024-04-15 23:49 ` Darrick J. Wong 2024-04-15 23:50 ` [PATCH 4/7] xfs: repair extended attributes Darrick J. Wong ` (3 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:49 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Build on the code that was recently added to the temporary repair file code so that we can atomically switch the contents of any file fork, even if the fork is in local format. The upcoming functions to repair xattrs, directories, and symlinks will need that capability. Repair can lock out access to these user files by holding IOLOCK_EXCL on these user files. Therefore, it is safe to drop the ILOCK of both the file being repaired and the tempfile being used for staging, and cancel the scrub transaction. We do this so that we can reuse the resource estimation and transaction allocation functions used by a regular file exchange operation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_exchmaps.c | 2 fs/xfs/libxfs/xfs_exchmaps.h | 1 fs/xfs/scrub/tempexch.h | 2 fs/xfs/scrub/tempfile.c | 204 ++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/tempfile.h | 3 + 5 files changed, 211 insertions(+), 1 deletion(-) diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c index 3880ae32eecf..44ab6a9235c0 100644 --- a/fs/xfs/libxfs/xfs_exchmaps.c +++ b/fs/xfs/libxfs/xfs_exchmaps.c @@ -675,7 +675,7 @@ xfs_exchmaps_rmapbt_blocks( } /* Estimate the bmbt and rmapbt overhead required to exchange mappings. */ -static int +int xfs_exchmaps_estimate_overhead( struct xfs_exchmaps_req *req) { diff --git a/fs/xfs/libxfs/xfs_exchmaps.h b/fs/xfs/libxfs/xfs_exchmaps.h index d8718fca606e..fa822dff202a 100644 --- a/fs/xfs/libxfs/xfs_exchmaps.h +++ b/fs/xfs/libxfs/xfs_exchmaps.h @@ -97,6 +97,7 @@ xfs_exchmaps_reqfork(const struct xfs_exchmaps_req *req) return XFS_DATA_FORK; } +int xfs_exchmaps_estimate_overhead(struct xfs_exchmaps_req *req); int xfs_exchmaps_estimate(struct xfs_exchmaps_req *req); extern struct kmem_cache *xfs_exchmaps_intent_cache; diff --git a/fs/xfs/scrub/tempexch.h b/fs/xfs/scrub/tempexch.h index 98222b684b6a..c1dd4adec4f1 100644 --- a/fs/xfs/scrub/tempexch.h +++ b/fs/xfs/scrub/tempexch.h @@ -14,6 +14,8 @@ struct xrep_tempexch { int xrep_tempexch_enable(struct xfs_scrub *sc); int xrep_tempexch_trans_reserve(struct xfs_scrub *sc, int whichfork, struct xrep_tempexch *ti); +int xrep_tempexch_trans_alloc(struct xfs_scrub *sc, int whichfork, + struct xrep_tempexch *ti); int xrep_tempexch_contents(struct xfs_scrub *sc, struct xrep_tempexch *ti); #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c index 7791336ca820..0b3060be938f 100644 --- a/fs/xfs/scrub/tempfile.c +++ b/fs/xfs/scrub/tempfile.c @@ -239,6 +239,28 @@ xrep_tempfile_iunlock( sc->temp_ilock_flags &= ~XFS_ILOCK_EXCL; } +/* + * Begin the process of making changes to both the file being scrubbed and + * the temporary file by taking ILOCK_EXCL on both. + */ +void +xrep_tempfile_ilock_both( + struct xfs_scrub *sc) +{ + xfs_lock_two_inodes(sc->ip, XFS_ILOCK_EXCL, sc->tempip, XFS_ILOCK_EXCL); + sc->ilock_flags |= XFS_ILOCK_EXCL; + sc->temp_ilock_flags |= XFS_ILOCK_EXCL; +} + +/* Unlock ILOCK_EXCL on both files. */ +void +xrep_tempfile_iunlock_both( + struct xfs_scrub *sc) +{ + xrep_tempfile_iunlock(sc); + xchk_iunlock(sc, XFS_ILOCK_EXCL); +} + /* Release the temporary file. */ void xrep_tempfile_rele( @@ -514,6 +536,89 @@ xrep_tempexch_prep_request( return 0; } +/* + * Fill out the mapping exchange resource estimation structures in preparation + * for exchanging the contents of a metadata file that we've rebuilt in the + * temp file. Caller must hold IOLOCK_EXCL but not ILOCK_EXCL on both files. + */ +STATIC int +xrep_tempexch_estimate( + struct xfs_scrub *sc, + struct xrep_tempexch *tx) +{ + struct xfs_exchmaps_req *req = &tx->req; + struct xfs_ifork *ifp; + struct xfs_ifork *tifp; + int whichfork = xfs_exchmaps_reqfork(req); + int state = 0; + + /* + * The exchmaps code only knows how to exchange file fork space + * mappings. Any fork data in local format must be promoted to a + * single block before the exchange can take place. + */ + ifp = xfs_ifork_ptr(sc->ip, whichfork); + if (ifp->if_format == XFS_DINODE_FMT_LOCAL) + state |= 1; + + tifp = xfs_ifork_ptr(sc->tempip, whichfork); + if (tifp->if_format == XFS_DINODE_FMT_LOCAL) + state |= 2; + + switch (state) { + case 0: + /* Both files have mapped extents; use the regular estimate. */ + return xfs_exchrange_estimate(req); + case 1: + /* + * The file being repaired is in local format, but the temp + * file has mapped extents. To perform the exchange, the file + * being repaired must have its shorform data converted to an + * ondisk block so that the forks will be in extents format. + * We need one resblk for the conversion; the number of + * exchanges is (worst case) the temporary file's extent count + * plus the block we converted. + */ + req->ip1_bcount = sc->tempip->i_nblocks; + req->ip2_bcount = 1; + req->nr_exchanges = 1 + tifp->if_nextents; + req->resblks = 1; + break; + case 2: + /* + * The temporary file is in local format, but the file being + * repaired has mapped extents. To perform the exchange, the + * temp file must have its shortform data converted to an + * ondisk block, and the fork changed to extents format. We + * need one resblk for the conversion; the number of exchanges + * is (worst case) the extent count of the file being repaired + * plus the block we converted. + */ + req->ip1_bcount = 1; + req->ip2_bcount = sc->ip->i_nblocks; + req->nr_exchanges = 1 + ifp->if_nextents; + req->resblks = 1; + break; + case 3: + /* + * Both forks are in local format. To perform the exchange, + * both files must have their shortform data converted to + * fsblocks, and both forks must be converted to extents + * format. We need two resblks for the two conversions, and + * the number of exchanges is 1 since there's only one block at + * fileoff 0. Presumably, the caller could not exchange the + * two inode fork areas directly. + */ + req->ip1_bcount = 1; + req->ip2_bcount = 1; + req->nr_exchanges = 1; + req->resblks = 2; + break; + } + + return xfs_exchmaps_estimate_overhead(req); +} + /* * Obtain a quota reservation to make sure we don't hit EDQUOT. We can skip * this if quota enforcement is disabled or if both inodes' dquots are the @@ -604,6 +709,55 @@ xrep_tempexch_trans_reserve( return xrep_tempexch_reserve_quota(sc, tx); } +/* + * Create a new transaction for a file contents exchange. + * + * This function fills out the mapping excahange request and resource + * estimation structures in preparation for exchanging the contents of a + * metadata file that has been rebuilt in the temp file. Next, it reserves + * space, takes ILOCK_EXCL of both inodes, joins them to the transaction and + * reserves quota for the transaction. + * + * The caller is responsible for dropping both ILOCKs when appropriate. + */ +int +xrep_tempexch_trans_alloc( + struct xfs_scrub *sc, + int whichfork, + struct xrep_tempexch *tx) +{ + unsigned int flags = 0; + int error; + + ASSERT(sc->tp == NULL); + + error = xrep_tempexch_prep_request(sc, whichfork, tx); + if (error) + return error; + + error = xrep_tempexch_estimate(sc, tx); + if (error) + return error; + + if (xfs_has_lazysbcount(sc->mp)) + flags |= XFS_TRANS_RES_FDBLKS; + + error = xrep_tempexch_enable(sc); + if (error) + return error; + + error = xfs_trans_alloc(sc->mp, &M_RES(sc->mp)->tr_itruncate, + tx->req.resblks, 0, flags, &sc->tp); + if (error) + return error; + + sc->temp_ilock_flags |= XFS_ILOCK_EXCL; + sc->ilock_flags |= XFS_ILOCK_EXCL; + xfs_exchrange_ilock(sc->tp, sc->ip, sc->tempip); + + return xrep_tempexch_reserve_quota(sc, tx); +} + /* * Exchange file mappings (and hence file contents) between the file being * repaired and the temporary file. Returns with both inodes locked and joined @@ -637,3 +791,53 @@ xrep_tempexch_contents( return 0; } + +/* + * Write local format data from one of the temporary file's forks into the same + * fork of file being repaired, and exchange the file sizes, if appropriate. + * Caller must ensure that the file being repaired has enough fork space to + * hold all the bytes. + */ +void +xrep_tempfile_copyout_local( + struct xfs_scrub *sc, + int whichfork) +{ + struct xfs_ifork *temp_ifp; + struct xfs_ifork *ifp; + unsigned int ilog_flags = XFS_ILOG_CORE; + + temp_ifp = xfs_ifork_ptr(sc->tempip, whichfork); + ifp = xfs_ifork_ptr(sc->ip, whichfork); + + ASSERT(temp_ifp != NULL); + ASSERT(ifp != NULL); + ASSERT(temp_ifp->if_format == XFS_DINODE_FMT_LOCAL); + ASSERT(ifp->if_format == XFS_DINODE_FMT_LOCAL); + + switch (whichfork) { + case XFS_DATA_FORK: + ASSERT(sc->tempip->i_disk_size <= + xfs_inode_data_fork_size(sc->ip)); + break; + case XFS_ATTR_FORK: + ASSERT(sc->tempip->i_forkoff >= sc->ip->i_forkoff); + break; + default: + ASSERT(0); + return; + } + + /* Recreate @sc->ip's incore fork (ifp) with data from temp_ifp. */ + xfs_idestroy_fork(ifp); + xfs_init_local_fork(sc->ip, whichfork, temp_ifp->if_data, + temp_ifp->if_bytes); + + if (whichfork == XFS_DATA_FORK) { + i_size_write(VFS_I(sc->ip), i_size_read(VFS_I(sc->tempip))); + sc->ip->i_disk_size = sc->tempip->i_disk_size; + } + + ilog_flags |= xfs_ilog_fdata(whichfork); + xfs_trans_log_inode(sc->tp, sc->ip, ilog_flags); +} diff --git a/fs/xfs/scrub/tempfile.h b/fs/xfs/scrub/tempfile.h index 7980f9c4de55..d57e4f145a7c 100644 --- a/fs/xfs/scrub/tempfile.h +++ b/fs/xfs/scrub/tempfile.h @@ -17,6 +17,8 @@ void xrep_tempfile_iounlock(struct xfs_scrub *sc); void xrep_tempfile_ilock(struct xfs_scrub *sc); bool xrep_tempfile_ilock_nowait(struct xfs_scrub *sc); void xrep_tempfile_iunlock(struct xfs_scrub *sc); +void xrep_tempfile_iunlock_both(struct xfs_scrub *sc); +void xrep_tempfile_ilock_both(struct xfs_scrub *sc); int xrep_tempfile_prealloc(struct xfs_scrub *sc, xfs_fileoff_t off, xfs_filblks_t len); @@ -32,6 +34,7 @@ int xrep_tempfile_copyin(struct xfs_scrub *sc, xfs_fileoff_t off, int xrep_tempfile_set_isize(struct xfs_scrub *sc, unsigned long long isize); int xrep_tempfile_roll_trans(struct xfs_scrub *sc); +void xrep_tempfile_copyout_local(struct xfs_scrub *sc, int whichfork); #else static inline void xrep_tempfile_iolock_both(struct xfs_scrub *sc) { ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 4/7] xfs: repair extended attributes 2024-04-15 23:35 ` [PATCHSET v30.3 07/16] xfs: online repair of extended attributes Darrick J. Wong ` (2 preceding siblings ...) 2024-04-15 23:49 ` [PATCH 3/7] xfs: use atomic extent swapping to fix user file fork data Darrick J. Wong @ 2024-04-15 23:50 ` Darrick J. Wong 2024-04-15 23:50 ` [PATCH 5/7] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong ` (2 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:50 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> If the extended attributes look bad, try to sift through the rubble to find whatever keys/values we can, stage a new attribute structure in a temporary file and use the atomic extent swapping mechanism to commit the results in bulk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_attr.c | 2 fs/xfs/libxfs/xfs_attr.h | 2 fs/xfs/libxfs/xfs_da_format.h | 5 fs/xfs/scrub/attr.c | 20 + fs/xfs/scrub/attr.h | 7 fs/xfs/scrub/attr_repair.c | 1207 +++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/attr_repair.h | 11 fs/xfs/scrub/repair.c | 46 ++ fs/xfs/scrub/repair.h | 6 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 83 +++ fs/xfs/scrub/xfarray.c | 17 + fs/xfs/scrub/xfarray.h | 2 fs/xfs/scrub/xfblob.c | 17 + fs/xfs/scrub/xfblob.h | 2 fs/xfs/scrub/xfile.h | 5 fs/xfs/xfs_buf.c | 3 fs/xfs/xfs_trace.h | 2 19 files changed, 1436 insertions(+), 4 deletions(-) create mode 100644 fs/xfs/scrub/attr_repair.c create mode 100644 fs/xfs/scrub/attr_repair.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index bc27757702fe..8647629ac7bf 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -194,6 +194,7 @@ ifeq ($(CONFIG_XFS_ONLINE_REPAIR),y) xfs-y += $(addprefix scrub/, \ agheader_repair.o \ alloc_repair.o \ + attr_repair.o \ bmap_repair.o \ cow_repair.o \ fscounters_repair.o \ diff --git a/fs/xfs/libxfs/xfs_attr.c b/fs/xfs/libxfs/xfs_attr.c index b3c9666cd011..05d22c5e3885 100644 --- a/fs/xfs/libxfs/xfs_attr.c +++ b/fs/xfs/libxfs/xfs_attr.c @@ -1055,7 +1055,7 @@ xfs_attr_set( * External routines when attribute list is inside the inode *========================================================================*/ -static inline int xfs_attr_sf_totsize(struct xfs_inode *dp) +int xfs_attr_sf_totsize(struct xfs_inode *dp) { struct xfs_attr_sf_hdr *sf = dp->i_af.if_data; diff --git a/fs/xfs/libxfs/xfs_attr.h b/fs/xfs/libxfs/xfs_attr.h index 81be9b3e4004..e4f55008552b 100644 --- a/fs/xfs/libxfs/xfs_attr.h +++ b/fs/xfs/libxfs/xfs_attr.h @@ -618,4 +618,6 @@ extern struct kmem_cache *xfs_attr_intent_cache; int __init xfs_attr_intent_init_cache(void); void xfs_attr_intent_destroy_cache(void); +int xfs_attr_sf_totsize(struct xfs_inode *dp); + #endif /* __XFS_ATTR_H__ */ diff --git a/fs/xfs/libxfs/xfs_da_format.h b/fs/xfs/libxfs/xfs_da_format.h index 060e5c96b70f..aac3fe039614 100644 --- a/fs/xfs/libxfs/xfs_da_format.h +++ b/fs/xfs/libxfs/xfs_da_format.h @@ -721,6 +721,11 @@ struct xfs_attr3_leafblock { #define XFS_ATTR_INCOMPLETE (1u << XFS_ATTR_INCOMPLETE_BIT) #define XFS_ATTR_NSP_ONDISK_MASK (XFS_ATTR_ROOT | XFS_ATTR_SECURE) +#define XFS_ATTR_NAMESPACE_STR \ + { XFS_ATTR_LOCAL, "local" }, \ + { XFS_ATTR_ROOT, "root" }, \ + { XFS_ATTR_SECURE, "secure" } + /* * Alignment for namelist and valuelist entries (since they are mixed * there can be only one alignment value) diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c index 0c467f4f8e77..7621e548d730 100644 --- a/fs/xfs/scrub/attr.c +++ b/fs/xfs/scrub/attr.c @@ -10,6 +10,7 @@ #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_log_format.h" +#include "xfs_trans.h" #include "xfs_inode.h" #include "xfs_da_format.h" #include "xfs_da_btree.h" @@ -20,6 +21,7 @@ #include "scrub/common.h" #include "scrub/dabtree.h" #include "scrub/attr.h" +#include "scrub/repair.h" /* Free the buffers linked from the xattr buffer. */ static void @@ -35,6 +37,8 @@ xchk_xattr_buf_cleanup( kvfree(ab->value); ab->value = NULL; ab->value_sz = 0; + kvfree(ab->name); + ab->name = NULL; } /* @@ -65,7 +69,7 @@ xchk_xattr_want_freemap( * reallocating the buffer if necessary. Buffer contents are not preserved * across a reallocation. */ -static int +int xchk_setup_xattr_buf( struct xfs_scrub *sc, size_t value_size) @@ -95,6 +99,12 @@ xchk_setup_xattr_buf( return -ENOMEM; } + if (xchk_could_repair(sc)) { + ab->name = kvmalloc(XATTR_NAME_MAX + 1, XCHK_GFP_FLAGS); + if (!ab->name) + return -ENOMEM; + } + resize_value: if (ab->value_sz >= value_size) return 0; @@ -121,6 +131,12 @@ xchk_setup_xattr( { int error; + if (xchk_could_repair(sc)) { + error = xrep_setup_xattr(sc); + if (error) + return error; + } + /* * We failed to get memory while checking attrs, so this time try to * get all the memory we're ever going to need. Allocate the buffer @@ -247,7 +263,7 @@ xchk_xattr_listent( * Within a char, the lowest bit of the char represents the byte with * the smallest address */ -STATIC bool +bool xchk_xattr_set_map( struct xfs_scrub *sc, unsigned long *map, diff --git a/fs/xfs/scrub/attr.h b/fs/xfs/scrub/attr.h index 48fd9402c432..7db58af56646 100644 --- a/fs/xfs/scrub/attr.h +++ b/fs/xfs/scrub/attr.h @@ -16,9 +16,16 @@ struct xchk_xattr_buf { /* Bitmap of free space in xattr leaf blocks. */ unsigned long *freemap; + /* Memory buffer used to hold salvaged xattr names. */ + unsigned char *name; + /* Memory buffer used to extract xattr values. */ void *value; size_t value_sz; }; +bool xchk_xattr_set_map(struct xfs_scrub *sc, unsigned long *map, + unsigned int start, unsigned int len); +int xchk_setup_xattr_buf(struct xfs_scrub *sc, size_t value_size); + #endif /* __XFS_SCRUB_ATTR_H__ */ diff --git a/fs/xfs/scrub/attr_repair.c b/fs/xfs/scrub/attr_repair.c new file mode 100644 index 000000000000..7b4318764d03 --- /dev/null +++ b/fs/xfs/scrub/attr_repair.c @@ -0,0 +1,1207 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2018-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_inode.h" +#include "xfs_da_format.h" +#include "xfs_da_btree.h" +#include "xfs_dir2.h" +#include "xfs_attr.h" +#include "xfs_attr_leaf.h" +#include "xfs_attr_sf.h" +#include "xfs_attr_remote.h" +#include "xfs_bmap.h" +#include "xfs_bmap_util.h" +#include "xfs_exchmaps.h" +#include "xfs_exchrange.h" +#include "xfs_acl.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/tempfile.h" +#include "scrub/tempexch.h" +#include "scrub/xfile.h" +#include "scrub/xfarray.h" +#include "scrub/xfblob.h" +#include "scrub/attr.h" +#include "scrub/reap.h" +#include "scrub/attr_repair.h" + +/* + * Extended Attribute Repair + * ========================= + * + * We repair extended attributes by reading the attr leaf blocks looking for + * attributes entries that look salvageable (name passes verifiers, value can + * be retrieved, etc). Each extended attribute worth salvaging is stashed in + * memory, and the stashed entries are periodically replayed into a temporary + * file to constrain memory use. Batching the construction of the temporary + * extended attribute structure in this fashion reduces lock cycling of the + * file being repaired and the temporary file. + * + * When salvaging completes, the remaining stashed attributes are replayed to + * the temporary file. An atomic file contents exchange is used to commit the + * new xattr blocks to the file being repaired. This will disrupt attrmulti + * cursors. + */ + +struct xrep_xattr_key { + /* Cookie for retrieval of the xattr name. */ + xfblob_cookie name_cookie; + + /* Cookie for retrieval of the xattr value. */ + xfblob_cookie value_cookie; + + /* XFS_ATTR_* flags */ + int flags; + + /* Length of the value and name. */ + uint32_t valuelen; + uint16_t namelen; +}; + +/* + * Stash up to 8 pages of attrs in xattr_records/xattr_blobs before we write + * them to the temp file. + */ +#define XREP_XATTR_MAX_STASH_BYTES (PAGE_SIZE * 8) + +struct xrep_xattr { + struct xfs_scrub *sc; + + /* Information for exchanging attr fork mappings at the end. */ + struct xrep_tempexch tx; + + /* xattr keys */ + struct xfarray *xattr_records; + + /* xattr values */ + struct xfblob *xattr_blobs; + + /* Number of attributes that we are salvaging. */ + unsigned long long attrs_found; +}; + +/* Set up to recreate the extended attributes. */ +int +xrep_setup_xattr( + struct xfs_scrub *sc) +{ + return xrep_tempfile_create(sc, S_IFREG); +} + +/* + * Decide if we want to salvage this attribute. We don't bother with + * incomplete or oversized keys or values. The @value parameter can be null + * for remote attrs. + */ +STATIC int +xrep_xattr_want_salvage( + struct xrep_xattr *rx, + unsigned int attr_flags, + const void *name, + int namelen, + const void *value, + int valuelen) +{ + if (attr_flags & XFS_ATTR_INCOMPLETE) + return false; + if (namelen > XATTR_NAME_MAX || namelen <= 0) + return false; + if (!xfs_attr_namecheck(name, namelen)) + return false; + if (valuelen > XATTR_SIZE_MAX || valuelen < 0) + return false; + if (hweight32(attr_flags & XFS_ATTR_NSP_ONDISK_MASK) > 1) + return false; + return true; +} + +/* Allocate an in-core record to hold xattrs while we rebuild the xattr data. */ +STATIC int +xrep_xattr_salvage_key( + struct xrep_xattr *rx, + int flags, + unsigned char *name, + int namelen, + unsigned char *value, + int valuelen) +{ + struct xrep_xattr_key key = { + .valuelen = valuelen, + .flags = flags & XFS_ATTR_NSP_ONDISK_MASK, + }; + unsigned int i = 0; + int error = 0; + + if (xchk_should_terminate(rx->sc, &error)) + return error; + + /* + * Truncate the name to the first character that would trip namecheck. + * If we no longer have a name after that, ignore this attribute. + */ + while (i < namelen && name[i] != 0) + i++; + if (i == 0) + return 0; + key.namelen = i; + + trace_xrep_xattr_salvage_rec(rx->sc->ip, flags, name, key.namelen, + valuelen); + + error = xfblob_store(rx->xattr_blobs, &key.name_cookie, name, + key.namelen); + if (error) + return error; + + error = xfblob_store(rx->xattr_blobs, &key.value_cookie, value, + key.valuelen); + if (error) + return error; + + error = xfarray_append(rx->xattr_records, &key); + if (error) + return error; + + rx->attrs_found++; + return 0; +} + +/* + * Record a shortform extended attribute key & value for later reinsertion + * into the inode. + */ +STATIC int +xrep_xattr_salvage_sf_attr( + struct xrep_xattr *rx, + struct xfs_attr_sf_hdr *hdr, + struct xfs_attr_sf_entry *sfe) +{ + struct xfs_scrub *sc = rx->sc; + struct xchk_xattr_buf *ab = sc->buf; + unsigned char *name = sfe->nameval; + unsigned char *value = &sfe->nameval[sfe->namelen]; + + if (!xchk_xattr_set_map(sc, ab->usedmap, (char *)name - (char *)hdr, + sfe->namelen)) + return 0; + + if (!xchk_xattr_set_map(sc, ab->usedmap, (char *)value - (char *)hdr, + sfe->valuelen)) + return 0; + + if (!xrep_xattr_want_salvage(rx, sfe->flags, sfe->nameval, + sfe->namelen, value, sfe->valuelen)) + return 0; + + return xrep_xattr_salvage_key(rx, sfe->flags, sfe->nameval, + sfe->namelen, value, sfe->valuelen); +} + +/* + * Record a local format extended attribute key & value for later reinsertion + * into the inode. + */ +STATIC int +xrep_xattr_salvage_local_attr( + struct xrep_xattr *rx, + struct xfs_attr_leaf_entry *ent, + unsigned int nameidx, + const char *buf_end, + struct xfs_attr_leaf_name_local *lentry) +{ + struct xchk_xattr_buf *ab = rx->sc->buf; + unsigned char *value; + unsigned int valuelen; + unsigned int namesize; + + /* + * Decode the leaf local entry format. If something seems wrong, we + * junk the attribute. + */ + value = &lentry->nameval[lentry->namelen]; + valuelen = be16_to_cpu(lentry->valuelen); + namesize = xfs_attr_leaf_entsize_local(lentry->namelen, valuelen); + if ((char *)lentry + namesize > buf_end) + return 0; + if (!xrep_xattr_want_salvage(rx, ent->flags, lentry->nameval, + lentry->namelen, value, valuelen)) + return 0; + if (!xchk_xattr_set_map(rx->sc, ab->usedmap, nameidx, namesize)) + return 0; + + /* Try to save this attribute. */ + return xrep_xattr_salvage_key(rx, ent->flags, lentry->nameval, + lentry->namelen, value, valuelen); +} + +/* + * Record a remote format extended attribute key & value for later reinsertion + * into the inode. + */ +STATIC int +xrep_xattr_salvage_remote_attr( + struct xrep_xattr *rx, + struct xfs_attr_leaf_entry *ent, + unsigned int nameidx, + const char *buf_end, + struct xfs_attr_leaf_name_remote *rentry, + unsigned int ent_idx, + struct xfs_buf *leaf_bp) +{ + struct xchk_xattr_buf *ab = rx->sc->buf; + struct xfs_da_args args = { + .trans = rx->sc->tp, + .dp = rx->sc->ip, + .index = ent_idx, + .geo = rx->sc->mp->m_attr_geo, + .owner = rx->sc->ip->i_ino, + .attr_filter = ent->flags & XFS_ATTR_NSP_ONDISK_MASK, + .namelen = rentry->namelen, + .name = rentry->name, + .value = ab->value, + .valuelen = be32_to_cpu(rentry->valuelen), + }; + unsigned int namesize; + int error; + + /* + * Decode the leaf remote entry format. If something seems wrong, we + * junk the attribute. Note that we should never find a zero-length + * remote attribute value. + */ + namesize = xfs_attr_leaf_entsize_remote(rentry->namelen); + if ((char *)rentry + namesize > buf_end) + return 0; + if (args.valuelen == 0 || + !xrep_xattr_want_salvage(rx, ent->flags, rentry->name, + rentry->namelen, NULL, args.valuelen)) + return 0; + if (!xchk_xattr_set_map(rx->sc, ab->usedmap, nameidx, namesize)) + return 0; + + /* + * Enlarge the buffer (if needed) to hold the value that we're trying + * to salvage from the old extended attribute data. + */ + error = xchk_setup_xattr_buf(rx->sc, args.valuelen); + if (error == -ENOMEM) + error = -EDEADLOCK; + if (error) + return error; + + /* Look up the remote value and stash it for reconstruction. */ + error = xfs_attr3_leaf_getvalue(leaf_bp, &args); + if (error || args.rmtblkno == 0) + goto err_free; + + error = xfs_attr_rmtval_get(&args); + if (error) + goto err_free; + + /* Try to save this attribute. */ + error = xrep_xattr_salvage_key(rx, ent->flags, rentry->name, + rentry->namelen, ab->value, args.valuelen); +err_free: + /* remote value was garbage, junk it */ + if (error == -EFSBADCRC || error == -EFSCORRUPTED) + error = 0; + return error; +} + +/* Extract every xattr key that we can from this attr fork block. */ +STATIC int +xrep_xattr_recover_leaf( + struct xrep_xattr *rx, + struct xfs_buf *bp) +{ + struct xfs_attr3_icleaf_hdr leafhdr; + struct xfs_scrub *sc = rx->sc; + struct xfs_mount *mp = sc->mp; + struct xfs_attr_leafblock *leaf; + struct xfs_attr_leaf_name_local *lentry; + struct xfs_attr_leaf_name_remote *rentry; + struct xfs_attr_leaf_entry *ent; + struct xfs_attr_leaf_entry *entries; + struct xchk_xattr_buf *ab = rx->sc->buf; + char *buf_end; + size_t off; + unsigned int nameidx; + unsigned int hdrsize; + int i; + int error = 0; + + bitmap_zero(ab->usedmap, mp->m_attr_geo->blksize); + + /* Check the leaf header */ + leaf = bp->b_addr; + xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf); + hdrsize = xfs_attr3_leaf_hdr_size(leaf); + xchk_xattr_set_map(sc, ab->usedmap, 0, hdrsize); + entries = xfs_attr3_leaf_entryp(leaf); + + buf_end = (char *)bp->b_addr + mp->m_attr_geo->blksize; + for (i = 0, ent = entries; i < leafhdr.count; ent++, i++) { + if (xchk_should_terminate(sc, &error)) + return error; + + /* Skip key if it conflicts with something else? */ + off = (char *)ent - (char *)leaf; + if (!xchk_xattr_set_map(sc, ab->usedmap, off, + sizeof(xfs_attr_leaf_entry_t))) + continue; + + /* Check the name information. */ + nameidx = be16_to_cpu(ent->nameidx); + if (nameidx < leafhdr.firstused || + nameidx >= mp->m_attr_geo->blksize) + continue; + + if (ent->flags & XFS_ATTR_LOCAL) { + lentry = xfs_attr3_leaf_name_local(leaf, i); + error = xrep_xattr_salvage_local_attr(rx, ent, nameidx, + buf_end, lentry); + } else { + rentry = xfs_attr3_leaf_name_remote(leaf, i); + error = xrep_xattr_salvage_remote_attr(rx, ent, nameidx, + buf_end, rentry, i, bp); + } + if (error) + return error; + } + + return 0; +} + +/* Try to recover shortform attrs. */ +STATIC int +xrep_xattr_recover_sf( + struct xrep_xattr *rx) +{ + struct xfs_scrub *sc = rx->sc; + struct xchk_xattr_buf *ab = sc->buf; + struct xfs_attr_sf_hdr *hdr; + struct xfs_attr_sf_entry *sfe; + struct xfs_attr_sf_entry *next; + struct xfs_ifork *ifp; + unsigned char *end; + int i; + int error = 0; + + ifp = xfs_ifork_ptr(rx->sc->ip, XFS_ATTR_FORK); + hdr = ifp->if_data; + + bitmap_zero(ab->usedmap, ifp->if_bytes); + end = (unsigned char *)ifp->if_data + ifp->if_bytes; + xchk_xattr_set_map(sc, ab->usedmap, 0, sizeof(*hdr)); + + sfe = xfs_attr_sf_firstentry(hdr); + if ((unsigned char *)sfe > end) + return 0; + + for (i = 0; i < hdr->count; i++) { + if (xchk_should_terminate(sc, &error)) + return error; + + next = xfs_attr_sf_nextentry(sfe); + if ((unsigned char *)next > end) + break; + + if (xchk_xattr_set_map(sc, ab->usedmap, + (char *)sfe - (char *)hdr, + sizeof(struct xfs_attr_sf_entry))) { + /* + * No conflicts with the sf entry; let's save this + * attribute. + */ + error = xrep_xattr_salvage_sf_attr(rx, hdr, sfe); + if (error) + return error; + } + + sfe = next; + } + + return 0; +} + +/* + * Try to return a buffer of xattr data for a given physical extent. + * + * Because the buffer cache get function complains if it finds a buffer + * matching the block number but not matching the length, we must be careful to + * look for incore buffers (up to the maximum length of a remote value) that + * could be hiding anywhere in the physical range. If we find an incore + * buffer, we can pass that to the caller. Optionally, read a single block and + * pass that back. + * + * Note the subtlety that remote attr value blocks for which there is no incore + * buffer will be passed to the callback one block at a time. These buffers + * will not have any ops attached and must be staled to prevent aliasing with + * multiblock buffers once we drop the ILOCK. + */ +STATIC int +xrep_xattr_find_buf( + struct xfs_mount *mp, + xfs_fsblock_t fsbno, + xfs_extlen_t max_len, + bool can_read, + struct xfs_buf **bpp) +{ + struct xrep_bufscan scan = { + .daddr = XFS_FSB_TO_DADDR(mp, fsbno), + .max_sectors = xrep_bufscan_max_sectors(mp, max_len), + .daddr_step = XFS_FSB_TO_BB(mp, 1), + }; + struct xfs_buf *bp; + + while ((bp = xrep_bufscan_advance(mp, &scan)) != NULL) { + *bpp = bp; + return 0; + } + + if (!can_read) { + *bpp = NULL; + return 0; + } + + return xfs_buf_read(mp->m_ddev_targp, scan.daddr, XFS_FSB_TO_BB(mp, 1), + XBF_TRYLOCK, bpp, NULL); +} + +/* + * Deal with a buffer that we found during our walk of the attr fork. + * + * Attribute leaf and node blocks are simple -- they're a single block, so we + * can walk them one at a time and we never have to worry about discontiguous + * multiblock buffers like we do for directories. + * + * Unfortunately, remote attr blocks add a lot of complexity here. Each disk + * block is totally self contained, in the sense that the v5 header provides no + * indication that there could be more data in the next block. The incore + * buffers can span multiple blocks, though they never cross extent records. + * However, they don't necessarily start or end on an extent record boundary. + * Therefore, we need a special buffer find function to walk the buffer cache + * for us. + * + * The caller must hold the ILOCK on the file being repaired. We use + * XBF_TRYLOCK here to skip any locked buffer on the assumption that we don't + * own the block and don't want to hang the system on a potentially garbage + * buffer. + */ +STATIC int +xrep_xattr_recover_block( + struct xrep_xattr *rx, + xfs_dablk_t dabno, + xfs_fsblock_t fsbno, + xfs_extlen_t max_len, + xfs_extlen_t *actual_len) +{ + struct xfs_da_blkinfo *info; + struct xfs_buf *bp; + int error; + + error = xrep_xattr_find_buf(rx->sc->mp, fsbno, max_len, true, &bp); + if (error) + return error; + info = bp->b_addr; + *actual_len = XFS_BB_TO_FSB(rx->sc->mp, bp->b_length); + + trace_xrep_xattr_recover_leafblock(rx->sc->ip, dabno, + be16_to_cpu(info->magic)); + + /* + * If the buffer has the right magic number for an attr leaf block and + * passes a structure check (we don't care about checksums), salvage + * as much as we can from the block. */ + if (info->magic == cpu_to_be16(XFS_ATTR3_LEAF_MAGIC) && + xrep_buf_verify_struct(bp, &xfs_attr3_leaf_buf_ops) && + xfs_attr3_leaf_header_check(bp, rx->sc->ip->i_ino) == NULL) + error = xrep_xattr_recover_leaf(rx, bp); + + /* + * If the buffer didn't already have buffer ops set, it was read in by + * the _find_buf function and could very well be /part/ of a multiblock + * remote block. Mark it stale so that it doesn't hang around in + * memory to cause problems. + */ + if (bp->b_ops == NULL) + xfs_buf_stale(bp); + + xfs_buf_relse(bp); + return error; +} + +/* Insert one xattr key/value. */ +STATIC int +xrep_xattr_insert_rec( + struct xrep_xattr *rx, + const struct xrep_xattr_key *key) +{ + struct xfs_da_args args = { + .dp = rx->sc->tempip, + .attr_filter = key->flags, + .attr_flags = XATTR_CREATE, + .namelen = key->namelen, + .valuelen = key->valuelen, + .owner = rx->sc->ip->i_ino, + }; + struct xchk_xattr_buf *ab = rx->sc->buf; + int error; + + /* + * Grab pointers to the scrub buffer so that we can use them to insert + * attrs into the temp file. + */ + args.name = ab->name; + args.value = ab->value; + + /* + * The attribute name is stored near the end of the in-core buffer, + * though we reserve one more byte to ensure null termination. + */ + ab->name[XATTR_NAME_MAX] = 0; + + error = xfblob_load(rx->xattr_blobs, key->name_cookie, ab->name, + key->namelen); + if (error) + return error; + + error = xfblob_free(rx->xattr_blobs, key->name_cookie); + if (error) + return error; + + error = xfblob_load(rx->xattr_blobs, key->value_cookie, args.value, + key->valuelen); + if (error) + return error; + + error = xfblob_free(rx->xattr_blobs, key->value_cookie); + if (error) + return error; + + ab->name[key->namelen] = 0; + + trace_xrep_xattr_insert_rec(rx->sc->tempip, key->flags, ab->name, + key->namelen, key->valuelen); + + /* + * xfs_attr_set creates and commits its own transaction. If the attr + * already exists, we'll just drop it during the rebuild. + */ + error = xfs_attr_set(&args); + if (error == -EEXIST) + error = 0; + + return error; +} + +/* + * Periodically flush salvaged attributes to the temporary file. This is done + * to reduce the memory requirements of the xattr rebuild because files can + * contain millions of attributes. + */ +STATIC int +xrep_xattr_flush_stashed( + struct xrep_xattr *rx) +{ + xfarray_idx_t array_cur; + int error; + + /* + * Entering this function, the scrub context has a reference to the + * inode being repaired, the temporary file, and a scrub transaction + * that we use during xattr salvaging to avoid livelocking if there + * are cycles in the xattr structures. We hold ILOCK_EXCL on both + * the inode being repaired, though it is not ijoined to the scrub + * transaction. + * + * To constrain kernel memory use, we occasionally flush salvaged + * xattrs from the xfarray and xfblob structures into the temporary + * file in preparation for exchanging the xattr structures at the end. + * Updating the temporary file requires a transaction, so we commit the + * scrub transaction and drop the two ILOCKs so that xfs_attr_set can + * allocate whatever transaction it wants. + * + * We still hold IOLOCK_EXCL on the inode being repaired, which + * prevents anyone from modifying the damaged xattr data while we + * repair it. + */ + error = xrep_trans_commit(rx->sc); + if (error) + return error; + xchk_iunlock(rx->sc, XFS_ILOCK_EXCL); + + /* + * Take the IOLOCK of the temporary file while we modify xattrs. This + * isn't strictly required because the temporary file is never revealed + * to userspace, but we follow the same locking rules. We still hold + * sc->ip's IOLOCK. + */ + error = xrep_tempfile_iolock_polled(rx->sc); + if (error) + return error; + + /* Add all the salvaged attrs to the temporary file. */ + foreach_xfarray_idx(rx->xattr_records, array_cur) { + struct xrep_xattr_key key; + + error = xfarray_load(rx->xattr_records, array_cur, &key); + if (error) + return error; + + error = xrep_xattr_insert_rec(rx, &key); + if (error) + return error; + } + + /* Empty out both arrays now that we've added the entries. */ + xfarray_truncate(rx->xattr_records); + xfblob_truncate(rx->xattr_blobs); + + xrep_tempfile_iounlock(rx->sc); + + /* Recreate the salvage transaction and relock the inode. */ + error = xchk_trans_alloc(rx->sc, 0); + if (error) + return error; + xchk_ilock(rx->sc, XFS_ILOCK_EXCL); + return 0; +} + +/* Decide if we've stashed too much xattr data in memory. */ +static inline bool +xrep_xattr_want_flush_stashed( + struct xrep_xattr *rx) +{ + unsigned long long bytes; + + bytes = xfarray_bytes(rx->xattr_records) + + xfblob_bytes(rx->xattr_blobs); + return bytes > XREP_XATTR_MAX_STASH_BYTES; +} + +/* Extract as many attribute keys and values as we can. */ +STATIC int +xrep_xattr_recover( + struct xrep_xattr *rx) +{ + struct xfs_bmbt_irec got; + struct xfs_scrub *sc = rx->sc; + struct xfs_da_geometry *geo = sc->mp->m_attr_geo; + xfs_fileoff_t offset; + xfs_extlen_t len; + xfs_dablk_t dabno; + int nmap; + int error; + + /* + * Iterate each xattr leaf block in the attr fork to scan them for any + * attributes that we might salvage. + */ + for (offset = 0; + offset < XFS_MAX_FILEOFF; + offset = got.br_startoff + got.br_blockcount) { + nmap = 1; + error = xfs_bmapi_read(sc->ip, offset, XFS_MAX_FILEOFF - offset, + &got, &nmap, XFS_BMAPI_ATTRFORK); + if (error) + return error; + if (nmap != 1) + return -EFSCORRUPTED; + if (!xfs_bmap_is_written_extent(&got)) + continue; + + for (dabno = round_up(got.br_startoff, geo->fsbcount); + dabno < got.br_startoff + got.br_blockcount; + dabno += len) { + xfs_fileoff_t curr_offset = dabno - got.br_startoff; + xfs_extlen_t maxlen; + + if (xchk_should_terminate(rx->sc, &error)) + return error; + + maxlen = min_t(xfs_filblks_t, INT_MAX, + got.br_blockcount - curr_offset); + error = xrep_xattr_recover_block(rx, dabno, + curr_offset + got.br_startblock, + maxlen, &len); + if (error) + return error; + + if (xrep_xattr_want_flush_stashed(rx)) { + error = xrep_xattr_flush_stashed(rx); + if (error) + return error; + } + } + } + + return 0; +} + +/* + * Reset the extended attribute fork to a state where we can start re-adding + * the salvaged attributes. + */ +STATIC int +xrep_xattr_fork_remove( + struct xfs_scrub *sc, + struct xfs_inode *ip) +{ + struct xfs_attr_sf_hdr *hdr; + struct xfs_ifork *ifp = xfs_ifork_ptr(ip, XFS_ATTR_FORK); + + /* + * If the data fork is in btree format, we can't change di_forkoff + * because we could run afoul of the rule that the data fork isn't + * supposed to be in btree format if there's enough space in the fork + * that it could have used extents format. Instead, reinitialize the + * attr fork to have a shortform structure with zero attributes. + */ + if (ip->i_df.if_format == XFS_DINODE_FMT_BTREE) { + ifp->if_format = XFS_DINODE_FMT_LOCAL; + hdr = xfs_idata_realloc(ip, (int)sizeof(*hdr) - ifp->if_bytes, + XFS_ATTR_FORK); + hdr->count = 0; + hdr->totsize = cpu_to_be16(sizeof(*hdr)); + xfs_trans_log_inode(sc->tp, ip, + XFS_ILOG_CORE | XFS_ILOG_ADATA); + return 0; + } + + /* If we still have attr fork extents, something's wrong. */ + if (ifp->if_nextents != 0) { + struct xfs_iext_cursor icur; + struct xfs_bmbt_irec irec; + unsigned int i = 0; + + xfs_emerg(sc->mp, + "inode 0x%llx attr fork still has %llu attr extents, format %d?!", + ip->i_ino, ifp->if_nextents, ifp->if_format); + for_each_xfs_iext(ifp, &icur, &irec) { + xfs_err(sc->mp, + "[%u]: startoff %llu startblock %llu blockcount %llu state %u", + i++, irec.br_startoff, + irec.br_startblock, irec.br_blockcount, + irec.br_state); + } + ASSERT(0); + return -EFSCORRUPTED; + } + + xfs_attr_fork_remove(ip, sc->tp); + return 0; +} + +/* + * Free all the attribute fork blocks of the file being repaired and delete the + * fork. The caller must ILOCK the scrub file and join it to the transaction. + * This function returns with the inode joined to a clean transaction. + */ +int +xrep_xattr_reset_fork( + struct xfs_scrub *sc) +{ + int error; + + trace_xrep_xattr_reset_fork(sc->ip, sc->ip); + + /* Unmap all the attr blocks. */ + if (xfs_ifork_has_extents(&sc->ip->i_af)) { + error = xrep_reap_ifork(sc, sc->ip, XFS_ATTR_FORK); + if (error) + return error; + } + + error = xrep_xattr_fork_remove(sc, sc->ip); + if (error) + return error; + + return xfs_trans_roll_inode(&sc->tp, sc->ip); +} + +/* + * Free all the attribute fork blocks of the temporary file and delete the attr + * fork. The caller must ILOCK the tempfile and join it to the transaction. + * This function returns with the inode joined to a clean scrub transaction. + */ +STATIC int +xrep_xattr_reset_tempfile_fork( + struct xfs_scrub *sc) +{ + int error; + + trace_xrep_xattr_reset_fork(sc->ip, sc->tempip); + + /* + * Wipe out the attr fork of the temp file so that regular inode + * inactivation won't trip over the corrupt attr fork. + */ + if (xfs_ifork_has_extents(&sc->tempip->i_af)) { + error = xrep_reap_ifork(sc, sc->tempip, XFS_ATTR_FORK); + if (error) + return error; + } + + return xrep_xattr_fork_remove(sc, sc->tempip); +} + +/* + * Find all the extended attributes for this inode by scraping them out of the + * attribute key blocks by hand, and flushing them into the temp file. + * When we're done, free the staging memory before exchanging the xattr + * structures to reduce memory usage. + */ +STATIC int +xrep_xattr_salvage_attributes( + struct xrep_xattr *rx) +{ + struct xfs_inode *ip = rx->sc->ip; + int error; + + /* Short format xattrs are easy! */ + if (rx->sc->ip->i_af.if_format == XFS_DINODE_FMT_LOCAL) { + error = xrep_xattr_recover_sf(rx); + if (error) + return error; + + return xrep_xattr_flush_stashed(rx); + } + + /* + * For non-inline xattr structures, the salvage function scans the + * buffer cache looking for potential attr leaf blocks. The scan + * requires the ability to lock any buffer found and runs independently + * of any transaction <-> buffer item <-> buffer linkage. Therefore, + * roll the transaction to ensure there are no buffers joined. We hold + * the ILOCK independently of the transaction. + */ + error = xfs_trans_roll(&rx->sc->tp); + if (error) + return error; + + error = xfs_iread_extents(rx->sc->tp, ip, XFS_ATTR_FORK); + if (error) + return error; + + error = xrep_xattr_recover(rx); + if (error) + return error; + + return xrep_xattr_flush_stashed(rx); +} + +/* + * Prepare both inodes' attribute forks for an exchange. Promote the tempfile + * from short format to leaf format, and if the file being repaired has a short + * format attr fork, turn it into an empty extent list. + */ +STATIC int +xrep_xattr_swap_prep( + struct xfs_scrub *sc, + bool temp_local, + bool ip_local) +{ + int error; + + /* + * If the tempfile's attributes are in shortform format, convert that + * to a single leaf extent so that we can use the atomic mapping + * exchange. + */ + if (temp_local) { + struct xfs_da_args args = { + .dp = sc->tempip, + .geo = sc->mp->m_attr_geo, + .whichfork = XFS_ATTR_FORK, + .trans = sc->tp, + .total = 1, + .owner = sc->ip->i_ino, + }; + + error = xfs_attr_shortform_to_leaf(&args); + if (error) + return error; + + /* + * Roll the deferred log items to get us back to a clean + * transaction. + */ + error = xfs_defer_finish(&sc->tp); + if (error) + return error; + } + + /* + * If the file being repaired had a shortform attribute fork, convert + * that to an empty extent list in preparation for the atomic mapping + * exchange. + */ + if (ip_local) { + struct xfs_ifork *ifp; + + ifp = xfs_ifork_ptr(sc->ip, XFS_ATTR_FORK); + + xfs_idestroy_fork(ifp); + ifp->if_format = XFS_DINODE_FMT_EXTENTS; + ifp->if_nextents = 0; + ifp->if_bytes = 0; + ifp->if_data = NULL; + ifp->if_height = 0; + + xfs_trans_log_inode(sc->tp, sc->ip, + XFS_ILOG_CORE | XFS_ILOG_ADATA); + } + + return 0; +} + +/* Exchange the temporary file's attribute fork with the one being repaired. */ +STATIC int +xrep_xattr_swap( + struct xfs_scrub *sc, + struct xrep_tempexch *tx) +{ + bool ip_local, temp_local; + int error = 0; + + ip_local = sc->ip->i_af.if_format == XFS_DINODE_FMT_LOCAL; + temp_local = sc->tempip->i_af.if_format == XFS_DINODE_FMT_LOCAL; + + /* + * If the both files have a local format attr fork and the rebuilt + * xattr data would fit in the repaired file's attr fork, just copy + * the contents from the tempfile and declare ourselves done. + */ + if (ip_local && temp_local) { + int forkoff; + int newsize; + + newsize = xfs_attr_sf_totsize(sc->tempip); + forkoff = xfs_attr_shortform_bytesfit(sc->ip, newsize); + if (forkoff > 0) { + sc->ip->i_forkoff = forkoff; + xrep_tempfile_copyout_local(sc, XFS_ATTR_FORK); + return 0; + } + } + + /* Otherwise, make sure both attr forks are in block-mapping mode. */ + error = xrep_xattr_swap_prep(sc, temp_local, ip_local); + if (error) + return error; + + return xrep_tempexch_contents(sc, tx); +} + +/* + * Exchange the new extended attribute data (which we created in the tempfile) + * with the file being repaired. + */ +STATIC int +xrep_xattr_rebuild_tree( + struct xrep_xattr *rx) +{ + struct xfs_scrub *sc = rx->sc; + int error; + + /* + * If we didn't find any attributes to salvage, repair the file by + * zapping its attr fork. + */ + if (rx->attrs_found == 0) { + xfs_trans_ijoin(sc->tp, sc->ip, 0); + error = xrep_xattr_reset_fork(sc); + if (error) + return error; + + goto forget_acls; + } + + trace_xrep_xattr_rebuild_tree(sc->ip, sc->tempip); + + /* + * Commit the repair transaction and drop the ILOCKs so that we can use + * the atomic file content exchange helper functions to compute the + * correct resource reservations. + * + * We still hold IOLOCK_EXCL (aka i_rwsem) which will prevent xattr + * modifications, but there's nothing to prevent userspace from reading + * the attributes until we're ready for the exchange operation. Reads + * will return -EIO without shutting down the fs, so we're ok with + * that. + */ + error = xrep_trans_commit(sc); + if (error) + return error; + + xchk_iunlock(sc, XFS_ILOCK_EXCL); + + /* + * Take the IOLOCK on the temporary file so that we can run xattr + * operations with the same locks held as we would for a normal file. + * We still hold sc->ip's IOLOCK. + */ + error = xrep_tempfile_iolock_polled(rx->sc); + if (error) + return error; + + /* Allocate exchange transaction and lock both inodes. */ + error = xrep_tempexch_trans_alloc(rx->sc, XFS_ATTR_FORK, &rx->tx); + if (error) + return error; + + /* + * Exchange the blocks mapped by the tempfile's attr fork with the file + * being repaired. The old attr blocks will then be attached to the + * tempfile, so reap its attr fork. + */ + error = xrep_xattr_swap(sc, &rx->tx); + if (error) + return error; + + error = xrep_xattr_reset_tempfile_fork(sc); + if (error) + return error; + + /* + * Roll to get a transaction without any inodes joined to it. Then we + * can drop the tempfile's ILOCK and IOLOCK before doing more work on + * the scrub target file. + */ + error = xfs_trans_roll(&sc->tp); + if (error) + return error; + + xrep_tempfile_iunlock(sc); + xrep_tempfile_iounlock(sc); + +forget_acls: + /* Invalidate cached ACLs now that we've reloaded all the xattrs. */ + xfs_forget_acl(VFS_I(sc->ip), SGI_ACL_FILE); + xfs_forget_acl(VFS_I(sc->ip), SGI_ACL_DEFAULT); + return 0; +} + +/* Tear down all the incore scan stuff we created. */ +STATIC void +xrep_xattr_teardown( + struct xrep_xattr *rx) +{ + xfblob_destroy(rx->xattr_blobs); + xfarray_destroy(rx->xattr_records); + kfree(rx); +} + +/* Set up the filesystem scan so we can regenerate extended attributes. */ +STATIC int +xrep_xattr_setup_scan( + struct xfs_scrub *sc, + struct xrep_xattr **rxp) +{ + struct xrep_xattr *rx; + char *descr; + int max_len; + int error; + + rx = kzalloc(sizeof(struct xrep_xattr), XCHK_GFP_FLAGS); + if (!rx) + return -ENOMEM; + rx->sc = sc; + + /* + * Allocate enough memory to handle loading local attr values from the + * xfblob data while flushing stashed attrs to the temporary file. + * We only realloc the buffer when salvaging remote attr values. + */ + max_len = xfs_attr_leaf_entsize_local_max(sc->mp->m_attr_geo->blksize); + error = xchk_setup_xattr_buf(rx->sc, max_len); + if (error == -ENOMEM) + error = -EDEADLOCK; + if (error) + goto out_rx; + + /* Set up some staging for salvaged attribute keys and values */ + descr = xchk_xfile_ino_descr(sc, "xattr keys"); + error = xfarray_create(descr, 0, sizeof(struct xrep_xattr_key), + &rx->xattr_records); + kfree(descr); + if (error) + goto out_rx; + + descr = xchk_xfile_ino_descr(sc, "xattr names"); + error = xfblob_create(descr, &rx->xattr_blobs); + kfree(descr); + if (error) + goto out_keys; + + *rxp = rx; + return 0; +out_keys: + xfarray_destroy(rx->xattr_records); +out_rx: + kfree(rx); + return error; +} + +/* + * Repair the extended attribute metadata. + * + * XXX: Remote attribute value buffers encompass the entire (up to 64k) buffer. + * The buffer cache in XFS can't handle aliased multiblock buffers, so this + * might misbehave if the attr fork is crosslinked with other filesystem + * metadata. + */ +int +xrep_xattr( + struct xfs_scrub *sc) +{ + struct xrep_xattr *rx = NULL; + int error; + + if (!xfs_inode_hasattr(sc->ip)) + return -ENOENT; + + /* The rmapbt is required to reap the old attr fork. */ + if (!xfs_has_rmapbt(sc->mp)) + return -EOPNOTSUPP; + + error = xrep_xattr_setup_scan(sc, &rx); + if (error) + return error; + + ASSERT(sc->ilock_flags & XFS_ILOCK_EXCL); + + error = xrep_xattr_salvage_attributes(rx); + if (error) + goto out_scan; + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + goto out_scan; + + error = xrep_xattr_rebuild_tree(rx); + if (error) + goto out_scan; + +out_scan: + xrep_xattr_teardown(rx); + return error; +} diff --git a/fs/xfs/scrub/attr_repair.h b/fs/xfs/scrub/attr_repair.h new file mode 100644 index 000000000000..0a9ffa7cfa90 --- /dev/null +++ b/fs/xfs/scrub/attr_repair.h @@ -0,0 +1,11 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2018-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_ATTR_REPAIR_H__ +#define __XFS_SCRUB_ATTR_REPAIR_H__ + +int xrep_xattr_reset_fork(struct xfs_scrub *sc); + +#endif /* __XFS_SCRUB_ATTR_REPAIR_H__ */ diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 443e62f72481..04aec0e9e4c3 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -32,6 +32,9 @@ #include "xfs_reflink.h" #include "xfs_health.h" #include "xfs_buf_mem.h" +#include "xfs_da_format.h" +#include "xfs_da_btree.h" +#include "xfs_attr.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" @@ -39,6 +42,7 @@ #include "scrub/bitmap.h" #include "scrub/stats.h" #include "scrub/xfile.h" +#include "scrub/attr_repair.h" /* * Attempt to repair some metadata, if the metadata is corrupt and userspace @@ -1136,6 +1140,17 @@ xrep_metadata_inode_forks( return error; } + /* Clear the attr forks since metadata shouldn't have that. */ + if (xfs_inode_hasattr(sc->ip)) { + if (!dirty) { + dirty = true; + xfs_trans_ijoin(sc->tp, sc->ip, 0); + } + error = xrep_xattr_reset_fork(sc); + if (error) + return error; + } + /* * If we modified the inode, roll the transaction but don't rejoin the * inode to the new transaction because xrep_bmap_data can do that. @@ -1201,3 +1216,34 @@ xrep_trans_cancel_hook_dummy( current->journal_info = *cookiep; *cookiep = NULL; } + +/* + * See if this buffer can pass the given ->verify_struct() function. + * + * If the buffer already has ops attached and they're not the ones that were + * passed in, we reject the buffer. Otherwise, we perform the structure test + * (note that we do not check CRCs) and return the outcome of the test. The + * buffer ops and error state are left unchanged. + */ +bool +xrep_buf_verify_struct( + struct xfs_buf *bp, + const struct xfs_buf_ops *ops) +{ + const struct xfs_buf_ops *old_ops = bp->b_ops; + xfs_failaddr_t fa; + int old_error; + + if (old_ops) { + if (old_ops != ops) + return false; + } + + old_error = bp->b_error; + bp->b_ops = ops; + fa = bp->b_ops->verify_struct(bp); + bp->b_ops = old_ops; + bp->b_error = old_error; + + return fa == NULL; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 0e2b695ab8f6..9cbfd8da5620 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -90,6 +90,7 @@ int xrep_bmap(struct xfs_scrub *sc, int whichfork, bool allow_unwritten); int xrep_metadata_inode_forks(struct xfs_scrub *sc); int xrep_setup_ag_rmapbt(struct xfs_scrub *sc); int xrep_setup_ag_refcountbt(struct xfs_scrub *sc); +int xrep_setup_xattr(struct xfs_scrub *sc); /* Repair setup functions */ int xrep_setup_ag_allocbt(struct xfs_scrub *sc); @@ -123,6 +124,7 @@ int xrep_bmap_attr(struct xfs_scrub *sc); int xrep_bmap_cow(struct xfs_scrub *sc); int xrep_nlinks(struct xfs_scrub *sc); int xrep_fscounters(struct xfs_scrub *sc); +int xrep_xattr(struct xfs_scrub *sc); #ifdef CONFIG_XFS_RT int xrep_rtbitmap(struct xfs_scrub *sc); @@ -147,6 +149,8 @@ int xrep_trans_alloc_hook_dummy(struct xfs_mount *mp, void **cookiep, struct xfs_trans **tpp); void xrep_trans_cancel_hook_dummy(void **cookiep, struct xfs_trans *tp); +bool xrep_buf_verify_struct(struct xfs_buf *bp, const struct xfs_buf_ops *ops); + #else #define xrep_ino_dqattach(sc) (0) @@ -190,6 +194,7 @@ xrep_setup_nothing( #define xrep_setup_ag_allocbt xrep_setup_nothing #define xrep_setup_ag_rmapbt xrep_setup_nothing #define xrep_setup_ag_refcountbt xrep_setup_nothing +#define xrep_setup_xattr xrep_setup_nothing #define xrep_setup_inode(sc, imap) ((void)0) @@ -215,6 +220,7 @@ xrep_setup_nothing( #define xrep_nlinks xrep_notsupported #define xrep_fscounters xrep_notsupported #define xrep_rtsummary xrep_notsupported +#define xrep_xattr xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 62a064c1a5d3..547189a14b6b 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -331,7 +331,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_INODE, .setup = xchk_setup_xattr, .scrub = xchk_xattr, - .repair = xrep_notsupported, + .repair = xrep_xattr, }, [XFS_SCRUB_TYPE_SYMLINK] = { /* symbolic link */ .type = ST_INODE, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 7d07912d8f75..026813205b47 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2416,6 +2416,89 @@ TRACE_EVENT(xreap_bmapi_binval_scan, __entry->scan_blocks) ); +TRACE_EVENT(xrep_xattr_recover_leafblock, + TP_PROTO(struct xfs_inode *ip, xfs_dablk_t dabno, uint16_t magic), + TP_ARGS(ip, dabno, magic), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(xfs_dablk_t, dabno) + __field(uint16_t, magic) + ), + TP_fast_assign( + __entry->dev = ip->i_mount->m_super->s_dev; + __entry->ino = ip->i_ino; + __entry->dabno = dabno; + __entry->magic = magic; + ), + TP_printk("dev %d:%d ino 0x%llx dablk 0x%x magic 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __entry->dabno, + __entry->magic) +); + +DECLARE_EVENT_CLASS(xrep_xattr_salvage_class, + TP_PROTO(struct xfs_inode *ip, unsigned int flags, char *name, + unsigned int namelen, unsigned int valuelen), + TP_ARGS(ip, flags, name, namelen, valuelen), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(unsigned int, flags) + __field(unsigned int, namelen) + __dynamic_array(char, name, namelen) + __field(unsigned int, valuelen) + ), + TP_fast_assign( + __entry->dev = ip->i_mount->m_super->s_dev; + __entry->ino = ip->i_ino; + __entry->flags = flags; + __entry->namelen = namelen; + memcpy(__get_str(name), name, namelen); + __entry->valuelen = valuelen; + ), + TP_printk("dev %d:%d ino 0x%llx flags %s name '%.*s' valuelen 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __print_flags(__entry->flags, "|", XFS_ATTR_NAMESPACE_STR), + __entry->namelen, + __get_str(name), + __entry->valuelen) +); +#define DEFINE_XREP_XATTR_SALVAGE_EVENT(name) \ +DEFINE_EVENT(xrep_xattr_salvage_class, name, \ + TP_PROTO(struct xfs_inode *ip, unsigned int flags, char *name, \ + unsigned int namelen, unsigned int valuelen), \ + TP_ARGS(ip, flags, name, namelen, valuelen)) +DEFINE_XREP_XATTR_SALVAGE_EVENT(xrep_xattr_salvage_rec); +DEFINE_XREP_XATTR_SALVAGE_EVENT(xrep_xattr_insert_rec); + +TRACE_EVENT(xrep_xattr_class, + TP_PROTO(struct xfs_inode *ip, struct xfs_inode *arg_ip), + TP_ARGS(ip, arg_ip), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(xfs_ino_t, src_ino) + ), + TP_fast_assign( + __entry->dev = ip->i_mount->m_super->s_dev; + __entry->ino = ip->i_ino; + __entry->src_ino = arg_ip->i_ino; + ), + TP_printk("dev %d:%d ino 0x%llx src 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __entry->src_ino) +) +#define DEFINE_XREP_XATTR_EVENT(name) \ +DEFINE_EVENT(xrep_xattr_class, name, \ + TP_PROTO(struct xfs_inode *ip, struct xfs_inode *arg_ip), \ + TP_ARGS(ip, arg_ip)) +DEFINE_XREP_XATTR_EVENT(xrep_xattr_rebuild_tree); +DEFINE_XREP_XATTR_EVENT(xrep_xattr_reset_fork); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ diff --git a/fs/xfs/scrub/xfarray.c b/fs/xfs/scrub/xfarray.c index 17c982a4821d..b65cd3fc5ac9 100644 --- a/fs/xfs/scrub/xfarray.c +++ b/fs/xfs/scrub/xfarray.c @@ -1051,3 +1051,20 @@ xfarray_sort( kvfree(si); return error; } + +/* How many bytes is this array consuming? */ +unsigned long long +xfarray_bytes( + struct xfarray *array) +{ + return xfile_bytes(array->xfile); +} + +/* Empty the entire array. */ +void +xfarray_truncate( + struct xfarray *array) +{ + xfile_discard(array->xfile, 0, MAX_LFS_FILESIZE); + array->nr = 0; +} diff --git a/fs/xfs/scrub/xfarray.h b/fs/xfs/scrub/xfarray.h index acb2f94c56c1..3b10a58e9f14 100644 --- a/fs/xfs/scrub/xfarray.h +++ b/fs/xfs/scrub/xfarray.h @@ -44,6 +44,8 @@ int xfarray_unset(struct xfarray *array, xfarray_idx_t idx); int xfarray_store(struct xfarray *array, xfarray_idx_t idx, const void *ptr); int xfarray_store_anywhere(struct xfarray *array, const void *ptr); bool xfarray_element_is_null(struct xfarray *array, const void *ptr); +void xfarray_truncate(struct xfarray *array); +unsigned long long xfarray_bytes(struct xfarray *array); /* * Load an array element, but zero the buffer if there's no data because we diff --git a/fs/xfs/scrub/xfblob.c b/fs/xfs/scrub/xfblob.c index cec668debce5..6ef2a9637f16 100644 --- a/fs/xfs/scrub/xfblob.c +++ b/fs/xfs/scrub/xfblob.c @@ -149,3 +149,20 @@ xfblob_free( xfile_discard(blob->xfile, cookie, sizeof(key) + key.xb_size); return 0; } + +/* How many bytes is this blob storage object consuming? */ +unsigned long long +xfblob_bytes( + struct xfblob *blob) +{ + return xfile_bytes(blob->xfile); +} + +/* Drop all the blobs. */ +void +xfblob_truncate( + struct xfblob *blob) +{ + xfile_discard(blob->xfile, PAGE_SIZE, MAX_LFS_FILESIZE - PAGE_SIZE); + blob->last_offset = PAGE_SIZE; +} diff --git a/fs/xfs/scrub/xfblob.h b/fs/xfs/scrub/xfblob.h index bd98647407f1..78a67a06408f 100644 --- a/fs/xfs/scrub/xfblob.h +++ b/fs/xfs/scrub/xfblob.h @@ -20,5 +20,7 @@ int xfblob_load(struct xfblob *blob, xfblob_cookie cookie, void *ptr, int xfblob_store(struct xfblob *blob, xfblob_cookie *cookie, const void *ptr, uint32_t size); int xfblob_free(struct xfblob *blob, xfblob_cookie cookie); +unsigned long long xfblob_bytes(struct xfblob *blob); +void xfblob_truncate(struct xfblob *blob); #endif /* __XFS_SCRUB_XFBLOB_H__ */ diff --git a/fs/xfs/scrub/xfile.h b/fs/xfs/scrub/xfile.h index 8dfbae1fe33a..cc2cc1714cd4 100644 --- a/fs/xfs/scrub/xfile.h +++ b/fs/xfs/scrub/xfile.h @@ -27,4 +27,9 @@ struct folio *xfile_get_folio(struct xfile *xf, loff_t offset, size_t len, unsigned int flags); void xfile_put_folio(struct xfile *xf, struct folio *folio); +static inline unsigned long long xfile_bytes(struct xfile *xf) +{ + return file_inode(xf->file)->i_blocks << SECTOR_SHIFT; +} + #endif /* __XFS_SCRUB_XFILE_H__ */ diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c index f0fa02264eda..8a0151e23f3d 100644 --- a/fs/xfs/xfs_buf.c +++ b/fs/xfs/xfs_buf.c @@ -494,6 +494,9 @@ _xfs_buf_obj_cmp( * it stale has not yet committed. i.e. we are * reallocating a busy extent. Skip this buffer and * continue searching for an exact match. + * + * Note: If we're scanning for incore buffers to stale, don't + * complain if we find non-stale buffers. */ if (!(map->bm_flags & XBM_LIVESCAN)) ASSERT(bp->b_flags & XBF_STALE); diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h index 939baf08331b..a80c3063a13f 100644 --- a/fs/xfs/xfs_trace.h +++ b/fs/xfs/xfs_trace.h @@ -31,6 +31,8 @@ * pos: file offset, in bytes * bytecount: number of bytes * + * dablk: directory or xattr block offset, in filesystem blocks + * * disize: ondisk file size, in bytes * isize: incore file size, in bytes * ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 5/7] xfs: scrub should set preen if attr leaf has holes 2024-04-15 23:35 ` [PATCHSET v30.3 07/16] xfs: online repair of extended attributes Darrick J. Wong ` (3 preceding siblings ...) 2024-04-15 23:50 ` [PATCH 4/7] xfs: repair extended attributes Darrick J. Wong @ 2024-04-15 23:50 ` Darrick J. Wong 2024-04-15 23:50 ` [PATCH 6/7] xfs: flag empty xattr leaf blocks for optimization Darrick J. Wong 2024-04-15 23:50 ` [PATCH 7/7] xfs: create an xattr iteration function for scrub Darrick J. Wong 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:50 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Dave Chinner, Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> If an attr block indicates that it could use compaction, set the preen flag to have the attr fork rebuilt, since the attr fork rebuilder can take care of that for us. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/attr.c | 2 ++ fs/xfs/scrub/dabtree.c | 16 ++++++++++++++++ fs/xfs/scrub/dabtree.h | 1 + fs/xfs/scrub/trace.h | 1 + 4 files changed, 20 insertions(+) diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c index 7621e548d730..ba06be86ac7d 100644 --- a/fs/xfs/scrub/attr.c +++ b/fs/xfs/scrub/attr.c @@ -428,6 +428,8 @@ xchk_xattr_block( xchk_da_set_corrupt(ds, level); if (!xchk_xattr_set_map(ds->sc, ab->usedmap, 0, hdrsize)) xchk_da_set_corrupt(ds, level); + if (leafhdr.holes) + xchk_da_set_preen(ds, level); if (ds->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) goto out; diff --git a/fs/xfs/scrub/dabtree.c b/fs/xfs/scrub/dabtree.c index c71254088dff..056de4819f86 100644 --- a/fs/xfs/scrub/dabtree.c +++ b/fs/xfs/scrub/dabtree.c @@ -78,6 +78,22 @@ xchk_da_set_corrupt( __return_address); } +/* Flag a da btree node in need of optimization. */ +void +xchk_da_set_preen( + struct xchk_da_btree *ds, + int level) +{ + struct xfs_scrub *sc = ds->sc; + + sc->sm->sm_flags |= XFS_SCRUB_OFLAG_PREEN; + trace_xchk_fblock_preen(sc, ds->dargs.whichfork, + xfs_dir2_da_to_db(ds->dargs.geo, + ds->state->path.blk[level].blkno), + __return_address); +} + +/* Find an entry at a certain level in a da btree. */ static struct xfs_da_node_entry * xchk_da_btree_node_entry( struct xchk_da_btree *ds, diff --git a/fs/xfs/scrub/dabtree.h b/fs/xfs/scrub/dabtree.h index 4f8c2138a1ec..d654c125feb4 100644 --- a/fs/xfs/scrub/dabtree.h +++ b/fs/xfs/scrub/dabtree.h @@ -35,6 +35,7 @@ bool xchk_da_process_error(struct xchk_da_btree *ds, int level, int *error); /* Check for da btree corruption. */ void xchk_da_set_corrupt(struct xchk_da_btree *ds, int level); +void xchk_da_set_preen(struct xchk_da_btree *ds, int level); int xchk_da_btree_hash(struct xchk_da_btree *ds, int level, __be32 *hashp); int xchk_da_btree(struct xfs_scrub *sc, int whichfork, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 026813205b47..ffaff7722bf2 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -365,6 +365,7 @@ DEFINE_EVENT(xchk_fblock_error_class, name, \ DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_error); DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_warning); +DEFINE_SCRUB_FBLOCK_ERROR_EVENT(xchk_fblock_preen); #ifdef CONFIG_XFS_QUOTA DECLARE_EVENT_CLASS(xchk_dqiter_class, ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 6/7] xfs: flag empty xattr leaf blocks for optimization 2024-04-15 23:35 ` [PATCHSET v30.3 07/16] xfs: online repair of extended attributes Darrick J. Wong ` (4 preceding siblings ...) 2024-04-15 23:50 ` [PATCH 5/7] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong @ 2024-04-15 23:50 ` Darrick J. Wong 2024-04-15 23:50 ` [PATCH 7/7] xfs: create an xattr iteration function for scrub Darrick J. Wong 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:50 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Empty xattr leaf blocks at offset zero are a waste of space but otherwise harmless. If we encounter one, flag it as an opportunity for optimization. If we encounter empty attr leaf blocks anywhere else in the attr fork, that's corruption. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/attr.c | 11 +++++++++++ fs/xfs/scrub/dabtree.h | 2 ++ 2 files changed, 13 insertions(+) diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c index ba06be86ac7d..696971204b87 100644 --- a/fs/xfs/scrub/attr.c +++ b/fs/xfs/scrub/attr.c @@ -420,6 +420,17 @@ xchk_xattr_block( xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf); hdrsize = xfs_attr3_leaf_hdr_size(leaf); + /* + * Empty xattr leaf blocks mapped at block 0 are probably a byproduct + * of a race between setxattr and a log shutdown. Anywhere else in the + * attr fork is a corruption. + */ + if (leafhdr.count == 0) { + if (blk->blkno == 0) + xchk_da_set_preen(ds, level); + else + xchk_da_set_corrupt(ds, level); + } if (leafhdr.usedbytes > mp->m_attr_geo->blksize) xchk_da_set_corrupt(ds, level); if (leafhdr.firstused > mp->m_attr_geo->blksize) diff --git a/fs/xfs/scrub/dabtree.h b/fs/xfs/scrub/dabtree.h index d654c125feb4..de291e3b77dd 100644 --- a/fs/xfs/scrub/dabtree.h +++ b/fs/xfs/scrub/dabtree.h @@ -37,6 +37,8 @@ bool xchk_da_process_error(struct xchk_da_btree *ds, int level, int *error); void xchk_da_set_corrupt(struct xchk_da_btree *ds, int level); void xchk_da_set_preen(struct xchk_da_btree *ds, int level); +void xchk_da_set_preen(struct xchk_da_btree *ds, int level); + int xchk_da_btree_hash(struct xchk_da_btree *ds, int level, __be32 *hashp); int xchk_da_btree(struct xfs_scrub *sc, int whichfork, xchk_da_btree_rec_fn scrub_fn, void *private); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 7/7] xfs: create an xattr iteration function for scrub 2024-04-15 23:35 ` [PATCHSET v30.3 07/16] xfs: online repair of extended attributes Darrick J. Wong ` (5 preceding siblings ...) 2024-04-15 23:50 ` [PATCH 6/7] xfs: flag empty xattr leaf blocks for optimization Darrick J. Wong @ 2024-04-15 23:50 ` Darrick J. Wong 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:50 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Create a streamlined function to walk a file's xattrs, without all the cursor management stuff in the regular listxattr. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/scrub/attr.c | 125 +++++++----------- fs/xfs/scrub/dab_bitmap.h | 37 +++++ fs/xfs/scrub/listxattr.c | 312 +++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/listxattr.h | 17 ++ 5 files changed, 414 insertions(+), 78 deletions(-) create mode 100644 fs/xfs/scrub/dab_bitmap.h create mode 100644 fs/xfs/scrub/listxattr.c create mode 100644 fs/xfs/scrub/listxattr.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 8647629ac7bf..7dbe6b3befb3 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -165,6 +165,7 @@ xfs-y += $(addprefix scrub/, \ ialloc.o \ inode.o \ iscan.o \ + listxattr.o \ nlinks.o \ parent.o \ readdir.o \ diff --git a/fs/xfs/scrub/attr.c b/fs/xfs/scrub/attr.c index 696971204b87..8853e4d0eee3 100644 --- a/fs/xfs/scrub/attr.c +++ b/fs/xfs/scrub/attr.c @@ -21,6 +21,7 @@ #include "scrub/common.h" #include "scrub/dabtree.h" #include "scrub/attr.h" +#include "scrub/listxattr.h" #include "scrub/repair.h" /* Free the buffers linked from the xattr buffer. */ @@ -153,90 +154,81 @@ xchk_setup_xattr( /* Extended Attributes */ -struct xchk_xattr { - struct xfs_attr_list_context context; - struct xfs_scrub *sc; -}; - /* * Check that an extended attribute key can be looked up by hash. * - * We use the XFS attribute list iterator (i.e. xfs_attr_list_ilocked) - * to call this function for every attribute key in an inode. Once - * we're here, we load the attribute value to see if any errors happen, - * or if we get more or less data than we expected. + * We use the extended attribute walk helper to call this function for every + * attribute key in an inode. Once we're here, we load the attribute value to + * see if any errors happen, or if we get more or less data than we expected. */ -static void -xchk_xattr_listent( - struct xfs_attr_list_context *context, - int flags, - unsigned char *name, - int namelen, - int valuelen) +static int +xchk_xattr_actor( + struct xfs_scrub *sc, + struct xfs_inode *ip, + unsigned int attr_flags, + const unsigned char *name, + unsigned int namelen, + const void *value, + unsigned int valuelen, + void *priv) { struct xfs_da_args args = { .op_flags = XFS_DA_OP_NOTIME, - .attr_filter = flags & XFS_ATTR_NSP_ONDISK_MASK, - .geo = context->dp->i_mount->m_attr_geo, + .attr_filter = attr_flags & XFS_ATTR_NSP_ONDISK_MASK, + .geo = sc->mp->m_attr_geo, .whichfork = XFS_ATTR_FORK, - .dp = context->dp, + .dp = ip, .name = name, .namelen = namelen, .hashval = xfs_da_hashname(name, namelen), - .trans = context->tp, + .trans = sc->tp, .valuelen = valuelen, - .owner = context->dp->i_ino, + .owner = ip->i_ino, }; struct xchk_xattr_buf *ab; - struct xchk_xattr *sx; int error = 0; - sx = container_of(context, struct xchk_xattr, context); - ab = sx->sc->buf; + ab = sc->buf; - if (xchk_should_terminate(sx->sc, &error)) { - context->seen_enough = error; - return; - } + if (xchk_should_terminate(sc, &error)) + return error; - if (flags & XFS_ATTR_INCOMPLETE) { + if (attr_flags & XFS_ATTR_INCOMPLETE) { /* Incomplete attr key, just mark the inode for preening. */ - xchk_ino_set_preen(sx->sc, context->dp->i_ino); - return; + xchk_ino_set_preen(sc, ip->i_ino); + return 0; } /* Only one namespace bit allowed. */ - if (hweight32(flags & XFS_ATTR_NSP_ONDISK_MASK) > 1) { - xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, args.blkno); - goto fail_xref; + if (hweight32(attr_flags & XFS_ATTR_NSP_ONDISK_MASK) > 1) { + xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, args.blkno); + return -ECANCELED; } /* Does this name make sense? */ if (!xfs_attr_namecheck(name, namelen)) { - xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, args.blkno); - goto fail_xref; + xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, args.blkno); + return -ECANCELED; } /* - * Local xattr values are stored in the attr leaf block, so we don't - * need to retrieve the value from a remote block to detect corruption - * problems. + * Local and shortform xattr values are stored in the attr leaf block, + * so we don't need to retrieve the value from a remote block to detect + * corruption problems. */ - if (flags & XFS_ATTR_LOCAL) - goto fail_xref; + if (value) + return 0; /* - * Try to allocate enough memory to extrat the attr value. If that - * doesn't work, we overload the seen_enough variable to convey - * the error message back to the main scrub function. + * Try to allocate enough memory to extract the attr value. If that + * doesn't work, return -EDEADLOCK as a signal to try again with a + * maximally sized buffer. */ - error = xchk_setup_xattr_buf(sx->sc, valuelen); + error = xchk_setup_xattr_buf(sc, valuelen); if (error == -ENOMEM) error = -EDEADLOCK; - if (error) { - context->seen_enough = error; - return; - } + if (error) + return error; args.value = ab->value; @@ -244,16 +236,13 @@ xchk_xattr_listent( /* ENODATA means the hash lookup failed and the attr is bad */ if (error == -ENODATA) error = -EFSCORRUPTED; - if (!xchk_fblock_process_error(sx->sc, XFS_ATTR_FORK, args.blkno, + if (!xchk_fblock_process_error(sc, XFS_ATTR_FORK, args.blkno, &error)) - goto fail_xref; + return error; if (args.valuelen != valuelen) - xchk_fblock_set_corrupt(sx->sc, XFS_ATTR_FORK, - args.blkno); -fail_xref: - if (sx->sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT) - context->seen_enough = 1; - return; + xchk_fblock_set_corrupt(sc, XFS_ATTR_FORK, args.blkno); + + return 0; } /* @@ -618,16 +607,6 @@ int xchk_xattr( struct xfs_scrub *sc) { - struct xchk_xattr sx = { - .sc = sc, - .context = { - .dp = sc->ip, - .tp = sc->tp, - .resynch = 1, - .put_listent = xchk_xattr_listent, - .allow_incomplete = true, - }, - }; xfs_dablk_t last_checked = -1U; int error = 0; @@ -656,12 +635,6 @@ xchk_xattr( /* * Look up every xattr in this file by name and hash. * - * Use the backend implementation of xfs_attr_list to call - * xchk_xattr_listent on every attribute key in this inode. - * In other words, we use the same iterator/callback mechanism - * that listattr uses to scrub extended attributes, though in our - * _listent function, we check the value of the attribute. - * * The VFS only locks i_rwsem when modifying attrs, so keep all * three locks held because that's the only way to ensure we're * the only thread poking into the da btree. We traverse the da @@ -669,13 +642,9 @@ xchk_xattr( * iteration, which doesn't really follow the usual buffer * locking order. */ - error = xfs_attr_list_ilocked(&sx.context); + error = xchk_xattr_walk(sc, sc->ip, xchk_xattr_actor, NULL); if (!xchk_fblock_process_error(sc, XFS_ATTR_FORK, 0, &error)) return error; - /* Did our listent function try to return any errors? */ - if (sx.context.seen_enough < 0) - return sx.context.seen_enough; - return 0; } diff --git a/fs/xfs/scrub/dab_bitmap.h b/fs/xfs/scrub/dab_bitmap.h new file mode 100644 index 000000000000..0c6e3aad4395 --- /dev/null +++ b/fs/xfs/scrub/dab_bitmap.h @@ -0,0 +1,37 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2022-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_DAB_BITMAP_H__ +#define __XFS_SCRUB_DAB_BITMAP_H__ + +/* Bitmaps, but for type-checked for xfs_dablk_t */ + +struct xdab_bitmap { + struct xbitmap32 dabitmap; +}; + +static inline void xdab_bitmap_init(struct xdab_bitmap *bitmap) +{ + xbitmap32_init(&bitmap->dabitmap); +} + +static inline void xdab_bitmap_destroy(struct xdab_bitmap *bitmap) +{ + xbitmap32_destroy(&bitmap->dabitmap); +} + +static inline int xdab_bitmap_set(struct xdab_bitmap *bitmap, + xfs_dablk_t dabno, xfs_extlen_t len) +{ + return xbitmap32_set(&bitmap->dabitmap, dabno, len); +} + +static inline bool xdab_bitmap_test(struct xdab_bitmap *bitmap, + xfs_dablk_t dabno, xfs_extlen_t *len) +{ + return xbitmap32_test(&bitmap->dabitmap, dabno, len); +} + +#endif /* __XFS_SCRUB_DAB_BITMAP_H__ */ diff --git a/fs/xfs/scrub/listxattr.c b/fs/xfs/scrub/listxattr.c new file mode 100644 index 000000000000..cbe5911ecbbc --- /dev/null +++ b/fs/xfs/scrub/listxattr.c @@ -0,0 +1,312 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2022-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_log_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_inode.h" +#include "xfs_da_format.h" +#include "xfs_da_btree.h" +#include "xfs_attr.h" +#include "xfs_attr_leaf.h" +#include "xfs_attr_sf.h" +#include "xfs_trans.h" +#include "scrub/scrub.h" +#include "scrub/bitmap.h" +#include "scrub/dab_bitmap.h" +#include "scrub/listxattr.h" + +/* Call a function for every entry in a shortform xattr structure. */ +STATIC int +xchk_xattr_walk_sf( + struct xfs_scrub *sc, + struct xfs_inode *ip, + xchk_xattr_fn attr_fn, + void *priv) +{ + struct xfs_attr_sf_hdr *hdr = ip->i_af.if_data; + struct xfs_attr_sf_entry *sfe; + unsigned int i; + int error; + + sfe = xfs_attr_sf_firstentry(hdr); + for (i = 0; i < hdr->count; i++) { + error = attr_fn(sc, ip, sfe->flags, sfe->nameval, sfe->namelen, + &sfe->nameval[sfe->namelen], sfe->valuelen, + priv); + if (error) + return error; + + sfe = xfs_attr_sf_nextentry(sfe); + } + + return 0; +} + +/* Call a function for every entry in this xattr leaf block. */ +STATIC int +xchk_xattr_walk_leaf_entries( + struct xfs_scrub *sc, + struct xfs_inode *ip, + xchk_xattr_fn attr_fn, + struct xfs_buf *bp, + void *priv) +{ + struct xfs_attr3_icleaf_hdr ichdr; + struct xfs_mount *mp = sc->mp; + struct xfs_attr_leafblock *leaf = bp->b_addr; + struct xfs_attr_leaf_entry *entry; + unsigned int i; + int error; + + xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &ichdr, leaf); + entry = xfs_attr3_leaf_entryp(leaf); + + for (i = 0; i < ichdr.count; entry++, i++) { + void *value; + unsigned char *name; + unsigned int namelen, valuelen; + + if (entry->flags & XFS_ATTR_LOCAL) { + struct xfs_attr_leaf_name_local *name_loc; + + name_loc = xfs_attr3_leaf_name_local(leaf, i); + name = name_loc->nameval; + namelen = name_loc->namelen; + value = &name_loc->nameval[name_loc->namelen]; + valuelen = be16_to_cpu(name_loc->valuelen); + } else { + struct xfs_attr_leaf_name_remote *name_rmt; + + name_rmt = xfs_attr3_leaf_name_remote(leaf, i); + name = name_rmt->name; + namelen = name_rmt->namelen; + value = NULL; + valuelen = be32_to_cpu(name_rmt->valuelen); + } + + error = attr_fn(sc, ip, entry->flags, name, namelen, value, + valuelen, priv); + if (error) + return error; + + } + + return 0; +} + +/* + * Call a function for every entry in a leaf-format xattr structure. Avoid + * memory allocations for the loop detector since there's only one block. + */ +STATIC int +xchk_xattr_walk_leaf( + struct xfs_scrub *sc, + struct xfs_inode *ip, + xchk_xattr_fn attr_fn, + void *priv) +{ + struct xfs_buf *leaf_bp; + int error; + + error = xfs_attr3_leaf_read(sc->tp, ip, ip->i_ino, 0, &leaf_bp); + if (error) + return error; + + error = xchk_xattr_walk_leaf_entries(sc, ip, attr_fn, leaf_bp, priv); + xfs_trans_brelse(sc->tp, leaf_bp); + return error; +} + +/* Find the leftmost leaf in the xattr dabtree. */ +STATIC int +xchk_xattr_find_leftmost_leaf( + struct xfs_scrub *sc, + struct xfs_inode *ip, + struct xdab_bitmap *seen_dablks, + struct xfs_buf **leaf_bpp) +{ + struct xfs_da3_icnode_hdr nodehdr; + struct xfs_mount *mp = sc->mp; + struct xfs_trans *tp = sc->tp; + struct xfs_da_intnode *node; + struct xfs_da_node_entry *btree; + struct xfs_buf *bp; + xfs_failaddr_t fa; + xfs_dablk_t blkno = 0; + unsigned int expected_level = 0; + int error; + + for (;;) { + xfs_extlen_t len = 1; + uint16_t magic; + + /* Make sure we haven't seen this new block already. */ + if (xdab_bitmap_test(seen_dablks, blkno, &len)) + return -EFSCORRUPTED; + + error = xfs_da3_node_read(tp, ip, blkno, &bp, XFS_ATTR_FORK); + if (error) + return error; + + node = bp->b_addr; + magic = be16_to_cpu(node->hdr.info.magic); + if (magic == XFS_ATTR_LEAF_MAGIC || + magic == XFS_ATTR3_LEAF_MAGIC) + break; + + error = -EFSCORRUPTED; + if (magic != XFS_DA_NODE_MAGIC && + magic != XFS_DA3_NODE_MAGIC) + goto out_buf; + + fa = xfs_da3_node_header_check(bp, ip->i_ino); + if (fa) + goto out_buf; + + xfs_da3_node_hdr_from_disk(mp, &nodehdr, node); + + if (nodehdr.count == 0 || nodehdr.level >= XFS_DA_NODE_MAXDEPTH) + goto out_buf; + + /* Check the level from the root node. */ + if (blkno == 0) + expected_level = nodehdr.level - 1; + else if (expected_level != nodehdr.level) + goto out_buf; + else + expected_level--; + + /* Remember that we've seen this node. */ + error = xdab_bitmap_set(seen_dablks, blkno, 1); + if (error) + goto out_buf; + + /* Find the next level towards the leaves of the dabtree. */ + btree = nodehdr.btree; + blkno = be32_to_cpu(btree->before); + xfs_trans_brelse(tp, bp); + } + + error = -EFSCORRUPTED; + fa = xfs_attr3_leaf_header_check(bp, ip->i_ino); + if (fa) + goto out_buf; + + if (expected_level != 0) + goto out_buf; + + /* Remember that we've seen this leaf. */ + error = xdab_bitmap_set(seen_dablks, blkno, 1); + if (error) + goto out_buf; + + *leaf_bpp = bp; + return 0; + +out_buf: + xfs_trans_brelse(tp, bp); + return error; +} + +/* Call a function for every entry in a node-format xattr structure. */ +STATIC int +xchk_xattr_walk_node( + struct xfs_scrub *sc, + struct xfs_inode *ip, + xchk_xattr_fn attr_fn, + void *priv) +{ + struct xfs_attr3_icleaf_hdr leafhdr; + struct xdab_bitmap seen_dablks; + struct xfs_mount *mp = sc->mp; + struct xfs_attr_leafblock *leaf; + struct xfs_buf *leaf_bp; + int error; + + xdab_bitmap_init(&seen_dablks); + + error = xchk_xattr_find_leftmost_leaf(sc, ip, &seen_dablks, &leaf_bp); + if (error) + goto out_bitmap; + + for (;;) { + xfs_extlen_t len; + + error = xchk_xattr_walk_leaf_entries(sc, ip, attr_fn, leaf_bp, + priv); + if (error) + goto out_leaf; + + /* Find the right sibling of this leaf block. */ + leaf = leaf_bp->b_addr; + xfs_attr3_leaf_hdr_from_disk(mp->m_attr_geo, &leafhdr, leaf); + if (leafhdr.forw == 0) + goto out_leaf; + + xfs_trans_brelse(sc->tp, leaf_bp); + + /* Make sure we haven't seen this new leaf already. */ + len = 1; + if (xdab_bitmap_test(&seen_dablks, leafhdr.forw, &len)) { + error = -EFSCORRUPTED; + goto out_bitmap; + } + + error = xfs_attr3_leaf_read(sc->tp, ip, ip->i_ino, + leafhdr.forw, &leaf_bp); + if (error) + goto out_bitmap; + + /* Remember that we've seen this new leaf. */ + error = xdab_bitmap_set(&seen_dablks, leafhdr.forw, 1); + if (error) + goto out_leaf; + } + +out_leaf: + xfs_trans_brelse(sc->tp, leaf_bp); +out_bitmap: + xdab_bitmap_destroy(&seen_dablks); + return error; +} + +/* + * Call a function for every extended attribute in a file. + * + * Callers must hold the ILOCK. No validation or cursor restarts allowed. + * Returns -EFSCORRUPTED on any problem, including loops in the dabtree. + */ +int +xchk_xattr_walk( + struct xfs_scrub *sc, + struct xfs_inode *ip, + xchk_xattr_fn attr_fn, + void *priv) +{ + int error; + + xfs_assert_ilocked(ip, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL); + + if (!xfs_inode_hasattr(ip)) + return 0; + + if (ip->i_af.if_format == XFS_DINODE_FMT_LOCAL) + return xchk_xattr_walk_sf(sc, ip, attr_fn, priv); + + /* attr functions require that the attr fork is loaded */ + error = xfs_iread_extents(sc->tp, ip, XFS_ATTR_FORK); + if (error) + return error; + + if (xfs_attr_is_leaf(ip)) + return xchk_xattr_walk_leaf(sc, ip, attr_fn, priv); + + return xchk_xattr_walk_node(sc, ip, attr_fn, priv); +} diff --git a/fs/xfs/scrub/listxattr.h b/fs/xfs/scrub/listxattr.h new file mode 100644 index 000000000000..48fe89d05946 --- /dev/null +++ b/fs/xfs/scrub/listxattr.h @@ -0,0 +1,17 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Copyright (c) 2022-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_LISTXATTR_H__ +#define __XFS_SCRUB_LISTXATTR_H__ + +typedef int (*xchk_xattr_fn)(struct xfs_scrub *sc, struct xfs_inode *ip, + unsigned int attr_flags, const unsigned char *name, + unsigned int namelen, const void *value, unsigned int valuelen, + void *priv); + +int xchk_xattr_walk(struct xfs_scrub *sc, struct xfs_inode *ip, + xchk_xattr_fn attr_fn, void *priv); + +#endif /* __XFS_SCRUB_LISTXATTR_H__ */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 08/16] xfs: online repair of inode unlinked state 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (6 preceding siblings ...) 2024-04-15 23:35 ` [PATCHSET v30.3 07/16] xfs: online repair of extended attributes Darrick J. Wong @ 2024-04-15 23:35 ` Darrick J. Wong 2024-04-15 23:51 ` [PATCH 1/2] xfs: ensure unlinked list state is consistent with nlink during scrub Darrick J. Wong 2024-04-15 23:51 ` [PATCH 2/2] xfs: update the unlinked list when repairing link counts Darrick J. Wong 2024-04-15 23:35 ` [PATCHSET v30.3 09/16] xfs: online repair of directories Darrick J. Wong ` (7 subsequent siblings) 15 siblings, 2 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:35 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs Hi all, This series adds some logic to the inode scrubbers so that they can detect and deal with consistency errors between the link count and the per-inode unlinked list state. The helpers needed to do this are presented here because they are a prequisite for rebuildng directories, since we need to get a rebuilt non-empty directory off the unlinked list. Note that this patchset does not provide comprehensive reconstruction of the AGI unlinked list; that is coming in a subsequent patchset. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-unlinked-inode-state-6.10 --- Commits in this patchset: * xfs: ensure unlinked list state is consistent with nlink during scrub * xfs: update the unlinked list when repairing link counts --- fs/xfs/scrub/inode.c | 19 ++++++++++++++++++ fs/xfs/scrub/inode_repair.c | 45 ++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/nlinks_repair.c | 42 +++++++++++++++++++++++++++++++-------- fs/xfs/xfs_inode.c | 5 +---- fs/xfs/xfs_inode.h | 2 ++ 5 files changed, 100 insertions(+), 13 deletions(-) ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/2] xfs: ensure unlinked list state is consistent with nlink during scrub 2024-04-15 23:35 ` [PATCHSET v30.3 08/16] xfs: online repair of inode unlinked state Darrick J. Wong @ 2024-04-15 23:51 ` Darrick J. Wong 2024-04-15 23:51 ` [PATCH 2/2] xfs: update the unlinked list when repairing link counts Darrick J. Wong 1 sibling, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:51 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Now that we have the means to tell if an inode is on an unlinked inode list or not, we can check that an inode with zero link count is on the unlinked list; and an inode that has nonzero link count is not on that list. Make repair clean things up too. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/inode.c | 19 ++++++++++++++++++ fs/xfs/scrub/inode_repair.c | 45 +++++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_inode.c | 5 +---- fs/xfs/xfs_inode.h | 2 ++ 4 files changed, 67 insertions(+), 4 deletions(-) diff --git a/fs/xfs/scrub/inode.c b/fs/xfs/scrub/inode.c index 6e2fe2d6250b..d32716fb2fec 100644 --- a/fs/xfs/scrub/inode.c +++ b/fs/xfs/scrub/inode.c @@ -739,6 +739,23 @@ xchk_inode_check_reflink_iflag( xchk_ino_set_corrupt(sc, ino); } +/* + * If this inode has zero link count, it must be on the unlinked list. If + * it has nonzero link count, it must not be on the unlinked list. + */ +STATIC void +xchk_inode_check_unlinked( + struct xfs_scrub *sc) +{ + if (VFS_I(sc->ip)->i_nlink == 0) { + if (!xfs_inode_on_unlinked_list(sc->ip)) + xchk_ino_set_corrupt(sc, sc->ip->i_ino); + } else { + if (xfs_inode_on_unlinked_list(sc->ip)) + xchk_ino_set_corrupt(sc, sc->ip->i_ino); + } +} + /* Scrub an inode. */ int xchk_inode( @@ -771,6 +788,8 @@ xchk_inode( if (S_ISREG(VFS_I(sc->ip)->i_mode)) xchk_inode_check_reflink_iflag(sc, sc->ip->i_ino); + xchk_inode_check_unlinked(sc); + xchk_inode_xref(sc, sc->ip->i_ino, &di); out: return error; diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c index 097afba3043f..c743772a523e 100644 --- a/fs/xfs/scrub/inode_repair.c +++ b/fs/xfs/scrub/inode_repair.c @@ -1745,6 +1745,46 @@ xrep_inode_problems( return xrep_roll_trans(sc); } +/* + * Make sure this inode's unlinked list pointers are consistent with its + * link count. + */ +STATIC int +xrep_inode_unlinked( + struct xfs_scrub *sc) +{ + unsigned int nlink = VFS_I(sc->ip)->i_nlink; + int error; + + /* + * If this inode is linked from the directory tree and on the unlinked + * list, remove it from the unlinked list. + */ + if (nlink > 0 && xfs_inode_on_unlinked_list(sc->ip)) { + struct xfs_perag *pag; + int error; + + pag = xfs_perag_get(sc->mp, + XFS_INO_TO_AGNO(sc->mp, sc->ip->i_ino)); + error = xfs_iunlink_remove(sc->tp, pag, sc->ip); + xfs_perag_put(pag); + if (error) + return error; + } + + /* + * If this inode is not linked from the directory tree yet not on the + * unlinked list, put it on the unlinked list. + */ + if (nlink == 0 && !xfs_inode_on_unlinked_list(sc->ip)) { + error = xfs_iunlink(sc->tp, sc->ip); + if (error) + return error; + } + + return 0; +} + /* Repair an inode's fields. */ int xrep_inode( @@ -1794,5 +1834,10 @@ xrep_inode( return error; } + /* Reconnect incore unlinked list */ + error = xrep_inode_unlinked(sc); + if (error) + return error; + return xrep_defer_finish(sc); } diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index ac92c0525d9b..b24c0e23d37d 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -42,9 +42,6 @@ struct kmem_cache *xfs_inode_cache; -STATIC int xfs_iunlink_remove(struct xfs_trans *tp, struct xfs_perag *pag, - struct xfs_inode *); - /* * helper function to extract extent size hint from inode */ @@ -2252,7 +2249,7 @@ xfs_iunlink_remove_inode( /* * Pull the on-disk inode from the AGI unlinked list. */ -STATIC int +int xfs_iunlink_remove( struct xfs_trans *tp, struct xfs_perag *pag, diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 596eec715675..8157ae7f8e59 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -617,6 +617,8 @@ extern struct kmem_cache *xfs_inode_cache; bool xfs_inode_needs_inactive(struct xfs_inode *ip); int xfs_iunlink(struct xfs_trans *tp, struct xfs_inode *ip); +int xfs_iunlink_remove(struct xfs_trans *tp, struct xfs_perag *pag, + struct xfs_inode *ip); void xfs_end_io(struct work_struct *work); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/2] xfs: update the unlinked list when repairing link counts 2024-04-15 23:35 ` [PATCHSET v30.3 08/16] xfs: online repair of inode unlinked state Darrick J. Wong 2024-04-15 23:51 ` [PATCH 1/2] xfs: ensure unlinked list state is consistent with nlink during scrub Darrick J. Wong @ 2024-04-15 23:51 ` Darrick J. Wong 1 sibling, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:51 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> When we're repairing the link counts of a file, we must ensure either that the file has zero link count and is on the unlinked list; or that it has nonzero link count and is not on the unlinked list. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/nlinks_repair.c | 42 +++++++++++++++++++++++++++++++++--------- 1 file changed, 33 insertions(+), 9 deletions(-) diff --git a/fs/xfs/scrub/nlinks_repair.c b/fs/xfs/scrub/nlinks_repair.c index b87618322f55..58cacb8e94c1 100644 --- a/fs/xfs/scrub/nlinks_repair.c +++ b/fs/xfs/scrub/nlinks_repair.c @@ -17,6 +17,7 @@ #include "xfs_iwalk.h" #include "xfs_ialloc.h" #include "xfs_sb.h" +#include "xfs_ag.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/repair.h" @@ -36,6 +37,20 @@ * inode is locked. */ +/* Remove an inode from the unlinked list. */ +STATIC int +xrep_nlinks_iunlink_remove( + struct xfs_scrub *sc) +{ + struct xfs_perag *pag; + int error; + + pag = xfs_perag_get(sc->mp, XFS_INO_TO_AGNO(sc->mp, sc->ip->i_ino)); + error = xfs_iunlink_remove(sc->tp, pag, sc->ip); + xfs_perag_put(pag); + return error; +} + /* * Correct the link count of the given inode. Because we have to grab locks * and resources in a certain order, it's possible that this will be a no-op. @@ -99,16 +114,25 @@ xrep_nlinks_repair_inode( } /* - * We did not find any links to this inode. If the inode agrees, we - * have nothing further to do. If not, the inode has a nonzero link - * count and we don't have anywhere to graft the child onto. Dropping - * a live inode's link count to zero can cause unexpected shutdowns in - * inactivation, so leave it alone. + * If this inode is linked from the directory tree and on the unlinked + * list, remove it from the unlinked list. */ - if (total_links == 0) { - if (actual_nlink != 0) - trace_xrep_nlinks_unfixable_inode(mp, ip, &obs); - goto out_trans; + if (total_links > 0 && xfs_inode_on_unlinked_list(ip)) { + error = xrep_nlinks_iunlink_remove(sc); + if (error) + goto out_trans; + dirty = true; + } + + /* + * If this inode is not linked from the directory tree yet not on the + * unlinked list, put it on the unlinked list. + */ + if (total_links == 0 && !xfs_inode_on_unlinked_list(ip)) { + error = xfs_iunlink(sc->tp, ip); + if (error) + goto out_trans; + dirty = true; } /* Commit the new link count if it changed. */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 09/16] xfs: online repair of directories 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (7 preceding siblings ...) 2024-04-15 23:35 ` [PATCHSET v30.3 08/16] xfs: online repair of inode unlinked state Darrick J. Wong @ 2024-04-15 23:35 ` Darrick J. Wong 2024-04-15 23:51 ` [PATCH 1/5] xfs: inactivate directory data blocks Darrick J. Wong ` (4 more replies) 2024-04-15 23:36 ` [PATCHSET v30.3 10/16] xfs: move orphan files to lost and found Darrick J. Wong ` (6 subsequent siblings) 15 siblings, 5 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:35 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs Hi all, This series employs atomic extent swapping to enable safe reconstruction of directory data. For now, XFS does not support reverse directory links (aka parent pointers), so we can only salvage the dirents of a directory and construct a new structure. Directory repair therefore consists of five main parts: First, we walk the existing directory to salvage as many entries as we can, by adding them as new directory entries to the repair temp dir. Second, we validate the parent pointer found in the directory. If one was not found, we scan the entire filesystem looking for a potential parent. Third, we use atomic extent swaps to exchange the entire data fork between the two directories. Fourth, we reap the old directory blocks as carefully as we can. To wrap up the directory repair code, we need to add to the regular filesystem the ability to free all the data fork blocks in a directory. This does not change anything with normal directories, since they must still unlink and shrink one entry at a time. However, this will facilitate freeing of partially-inactivated temporary directories during log recovery. The second half of this patchset implements repairs for the dotdot entries of directories. For now there is only rudimentary support for this, because there are no directory parent pointers, so the best we can do is scanning the filesystem and the VFS dcache for answers. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-dirs-6.10 --- Commits in this patchset: * xfs: inactivate directory data blocks * xfs: online repair of directories * xfs: scan the filesystem to repair a directory dotdot entry * xfs: online repair of parent pointers * xfs: ask the dentry cache if it knows the parent of a directory --- fs/xfs/Makefile | 3 fs/xfs/scrub/dir.c | 9 fs/xfs/scrub/dir_repair.c | 1402 ++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/findparent.c | 448 +++++++++++++ fs/xfs/scrub/findparent.h | 50 + fs/xfs/scrub/inode_repair.c | 5 fs/xfs/scrub/iscan.c | 18 + fs/xfs/scrub/iscan.h | 1 fs/xfs/scrub/nlinks.c | 23 + fs/xfs/scrub/nlinks_repair.c | 9 fs/xfs/scrub/parent.c | 14 fs/xfs/scrub/parent_repair.c | 234 +++++++ fs/xfs/scrub/readdir.c | 7 fs/xfs/scrub/repair.c | 1 fs/xfs/scrub/repair.h | 8 fs/xfs/scrub/scrub.c | 4 fs/xfs/scrub/tempfile.c | 13 fs/xfs/scrub/tempfile.h | 2 fs/xfs/scrub/trace.h | 115 +++ fs/xfs/scrub/xfblob.h | 24 + fs/xfs/xfs_inode.c | 51 ++ 21 files changed, 2437 insertions(+), 4 deletions(-) create mode 100644 fs/xfs/scrub/dir_repair.c create mode 100644 fs/xfs/scrub/findparent.c create mode 100644 fs/xfs/scrub/findparent.h create mode 100644 fs/xfs/scrub/parent_repair.c ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/5] xfs: inactivate directory data blocks 2024-04-15 23:35 ` [PATCHSET v30.3 09/16] xfs: online repair of directories Darrick J. Wong @ 2024-04-15 23:51 ` Darrick J. Wong 2024-04-15 23:52 ` [PATCH 2/5] xfs: online repair of directories Darrick J. Wong ` (3 subsequent siblings) 4 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:51 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Teach inode inactivation to delete all the incore buffers backing a directory. In normal runtime this should never happen because the VFS forbids rmdir on a non-empty directory. In the next patch, online directory repair stands up a new directory, exchanges it with the broken directory, and then drops the private temporary directory. If we cancel the repair just prior to exchanging the directory contents, the new directory will need to be torn down. Note: If we commit the repair, reaping will take care of all the ondisk space allocations and incore buffers for the old corrupt directory. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_inode.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 51 insertions(+) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index b24c0e23d37d..09d643a9e997 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -16,6 +16,7 @@ #include "xfs_inode.h" #include "xfs_dir2.h" #include "xfs_attr.h" +#include "xfs_bit.h" #include "xfs_trans_space.h" #include "xfs_trans.h" #include "xfs_buf_item.h" @@ -1551,6 +1552,51 @@ xfs_release( return error; } +/* + * Mark all the buffers attached to this directory stale. In theory we should + * never be freeing a directory with any blocks at all, but this covers the + * case where we've recovered a directory swap with a "temporary" directory + * created by online repair and now need to dump it. + */ +STATIC void +xfs_inactive_dir( + struct xfs_inode *dp) +{ + struct xfs_iext_cursor icur; + struct xfs_bmbt_irec got; + struct xfs_mount *mp = dp->i_mount; + struct xfs_da_geometry *geo = mp->m_dir_geo; + struct xfs_ifork *ifp = xfs_ifork_ptr(dp, XFS_DATA_FORK); + xfs_fileoff_t off; + + /* + * Invalidate each directory block. All directory blocks are of + * fsbcount length and alignment, so we only need to walk those same + * offsets. We hold the only reference to this inode, so we must wait + * for the buffer locks. + */ + for_each_xfs_iext(ifp, &icur, &got) { + for (off = round_up(got.br_startoff, geo->fsbcount); + off < got.br_startoff + got.br_blockcount; + off += geo->fsbcount) { + struct xfs_buf *bp = NULL; + xfs_fsblock_t fsbno; + int error; + + fsbno = (off - got.br_startoff) + got.br_startblock; + error = xfs_buf_incore(mp->m_ddev_targp, + XFS_FSB_TO_DADDR(mp, fsbno), + XFS_FSB_TO_BB(mp, geo->fsbcount), + XBF_LIVESCAN, &bp); + if (error) + continue; + + xfs_buf_stale(bp); + xfs_buf_relse(bp); + } + } +} + /* * xfs_inactive_truncate * @@ -1861,6 +1907,11 @@ xfs_inactive( goto out; } + if (S_ISDIR(VFS_I(ip)->i_mode) && ip->i_df.if_nextents > 0) { + xfs_inactive_dir(ip); + truncate = 1; + } + if (S_ISLNK(VFS_I(ip)->i_mode)) error = xfs_inactive_symlink(ip); else if (truncate) ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/5] xfs: online repair of directories 2024-04-15 23:35 ` [PATCHSET v30.3 09/16] xfs: online repair of directories Darrick J. Wong 2024-04-15 23:51 ` [PATCH 1/5] xfs: inactivate directory data blocks Darrick J. Wong @ 2024-04-15 23:52 ` Darrick J. Wong 2024-04-15 23:52 ` [PATCH 3/5] xfs: scan the filesystem to repair a directory dotdot entry Darrick J. Wong ` (2 subsequent siblings) 4 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:52 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> If a directory looks like it's in bad shape, try to sift through the rubble to find whatever directory entries we can, scan the directory tree for the parent (if needed), stage the new directory contents in a temporary file and use the atomic extent swapping mechanism to commit the results in bulk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/scrub/dir.c | 9 fs/xfs/scrub/dir_repair.c | 1349 ++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/inode_repair.c | 5 fs/xfs/scrub/nlinks.c | 23 + fs/xfs/scrub/nlinks_repair.c | 9 fs/xfs/scrub/parent.c | 4 fs/xfs/scrub/readdir.c | 7 fs/xfs/scrub/repair.c | 1 fs/xfs/scrub/repair.h | 4 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/tempfile.c | 13 fs/xfs/scrub/tempfile.h | 2 fs/xfs/scrub/trace.h | 112 +++ fs/xfs/scrub/xfblob.h | 24 + 15 files changed, 1563 insertions(+), 2 deletions(-) create mode 100644 fs/xfs/scrub/dir_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 7dbe6b3befb3..5c9449e14f74 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -198,6 +198,7 @@ xfs-y += $(addprefix scrub/, \ attr_repair.o \ bmap_repair.o \ cow_repair.o \ + dir_repair.o \ fscounters_repair.o \ ialloc_repair.o \ inode_repair.o \ diff --git a/fs/xfs/scrub/dir.c b/fs/xfs/scrub/dir.c index 7bac74621af7..3fe6ffcf9c06 100644 --- a/fs/xfs/scrub/dir.c +++ b/fs/xfs/scrub/dir.c @@ -21,12 +21,21 @@ #include "scrub/dabtree.h" #include "scrub/readdir.h" #include "scrub/health.h" +#include "scrub/repair.h" /* Set us up to scrub directories. */ int xchk_setup_directory( struct xfs_scrub *sc) { + int error; + + if (xchk_could_repair(sc)) { + error = xrep_setup_directory(sc); + if (error) + return error; + } + return xchk_setup_inode_contents(sc, 0); } diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c new file mode 100644 index 000000000000..48aa80d8c7dc --- /dev/null +++ b/fs/xfs/scrub/dir_repair.c @@ -0,0 +1,1349 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_inode.h" +#include "xfs_icache.h" +#include "xfs_da_format.h" +#include "xfs_da_btree.h" +#include "xfs_dir2.h" +#include "xfs_dir2_priv.h" +#include "xfs_bmap.h" +#include "xfs_quota.h" +#include "xfs_bmap_btree.h" +#include "xfs_trans_space.h" +#include "xfs_bmap_util.h" +#include "xfs_exchmaps.h" +#include "xfs_exchrange.h" +#include "xfs_ag.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/tempfile.h" +#include "scrub/tempexch.h" +#include "scrub/xfile.h" +#include "scrub/xfarray.h" +#include "scrub/xfblob.h" +#include "scrub/readdir.h" +#include "scrub/reap.h" + +/* + * Directory Repair + * ================ + * + * We repair directories by reading the directory data blocks looking for + * directory entries that look salvageable (name passes verifiers, entry points + * to a valid allocated inode, etc). Each entry worth salvaging is stashed in + * memory, and the stashed entries are periodically replayed into a temporary + * directory to constrain memory use. Batching the construction of the + * temporary directory in this fashion reduces lock cycling of the directory + * being repaired and the temporary directory, and will later become important + * for parent pointer scanning. + * + * Directory entries added to the temporary directory do not elevate the link + * counts of the inodes found. When salvaging completes, the remaining stashed + * entries are replayed to the temporary directory. An atomic mapping exchange + * is used to commit the new directory blocks to the directory being repaired. + * This will disrupt readdir cursors. + * + * Locking Issues + * -------------- + * + * If /a, /a/b, and /c are all directories, the VFS does not take i_rwsem on + * /a/b for a "mv /a/b /c/" operation. This means that only b's ILOCK protects + * b's dotdot update. This is in contrast to every other dotdot update (link, + * remove, mkdir). If the repair code drops the ILOCK, it must either + * revalidate the dotdot entry or use dirent hooks to capture updates from + * other threads. + */ + +/* Directory entry to be restored in the new directory. */ +struct xrep_dirent { + /* Cookie for retrieval of the dirent name. */ + xfblob_cookie name_cookie; + + /* Target inode number. */ + xfs_ino_t ino; + + /* Length of the dirent name. */ + uint8_t namelen; + + /* File type of the dirent. */ + uint8_t ftype; +}; + +/* + * Stash up to 8 pages of recovered dirent data in dir_entries and dir_names + * before we write them to the temp dir. + */ +#define XREP_DIR_MAX_STASH_BYTES (PAGE_SIZE * 8) + +struct xrep_dir { + struct xfs_scrub *sc; + + /* Fixed-size array of xrep_dirent structures. */ + struct xfarray *dir_entries; + + /* Blobs containing directory entry names. */ + struct xfblob *dir_names; + + /* Information for exchanging data forks at the end. */ + struct xrep_tempexch tx; + + /* Preallocated args struct for performing dir operations */ + struct xfs_da_args args; + + /* + * This is the parent that we're going to set on the reconstructed + * directory. + */ + xfs_ino_t parent_ino; + + /* How many subdirectories did we find? */ + uint64_t subdirs; + + /* How many dirents did we find? */ + unsigned int dirents; + + /* Directory entry name, plus the trailing null. */ + struct xfs_name xname; + unsigned char namebuf[MAXNAMELEN]; +}; + +/* Tear down all the incore stuff we created. */ +static void +xrep_dir_teardown( + struct xfs_scrub *sc) +{ + struct xrep_dir *rd = sc->buf; + + xfblob_destroy(rd->dir_names); + xfarray_destroy(rd->dir_entries); +} + +/* Set up for a directory repair. */ +int +xrep_setup_directory( + struct xfs_scrub *sc) +{ + struct xrep_dir *rd; + int error; + + error = xrep_tempfile_create(sc, S_IFDIR); + if (error) + return error; + + rd = kvzalloc(sizeof(struct xrep_dir), XCHK_GFP_FLAGS); + if (!rd) + return -ENOMEM; + rd->sc = sc; + rd->xname.name = rd->namebuf; + sc->buf = rd; + + return 0; +} + +/* + * If we're the root of a directory tree, we are our own parent. If we're an + * unlinked directory, the parent /won't/ have a link to us. Set the parent + * directory to the root for both cases. Returns NULLFSINO if we don't know + * what to do. + */ +static inline xfs_ino_t +xrep_dir_self_parent( + struct xrep_dir *rd) +{ + struct xfs_scrub *sc = rd->sc; + + if (sc->ip->i_ino == sc->mp->m_sb.sb_rootino) + return sc->mp->m_sb.sb_rootino; + + if (VFS_I(sc->ip)->i_nlink == 0) + return sc->mp->m_sb.sb_rootino; + + return NULLFSINO; +} + +/* + * Look up the dotdot entry. Returns NULLFSINO if we don't know what to do. + * The next patch will check this more carefully. + */ +static inline xfs_ino_t +xrep_dir_lookup_parent( + struct xrep_dir *rd) +{ + struct xfs_scrub *sc = rd->sc; + xfs_ino_t ino; + int error; + + error = xfs_dir_lookup(sc->tp, sc->ip, &xfs_name_dotdot, &ino, NULL); + if (error) + return NULLFSINO; + if (!xfs_verify_dir_ino(sc->mp, ino)) + return NULLFSINO; + + return ino; +} + +/* + * Try to find the parent of the directory being repaired. + * + * NOTE: This function will someday be augmented by the directory parent repair + * code, which will know how to check the parent and scan the filesystem if + * we cannot find anything. Inode scans will have to be done before we start + * salvaging directory entries, so we do this now. + */ +STATIC int +xrep_dir_find_parent( + struct xrep_dir *rd) +{ + xfs_ino_t ino; + + ino = xrep_dir_self_parent(rd); + if (ino != NULLFSINO) { + rd->parent_ino = ino; + return 0; + } + + ino = xrep_dir_lookup_parent(rd); + if (ino != NULLFSINO) { + rd->parent_ino = ino; + return 0; + } + + /* NOTE: A future patch will deal with moving orphans. */ + return -EFSCORRUPTED; +} + +/* + * Decide if we want to salvage this entry. We don't bother with oversized + * names or the dot entry. + */ +STATIC int +xrep_dir_want_salvage( + struct xrep_dir *rd, + const char *name, + int namelen, + xfs_ino_t ino) +{ + struct xfs_mount *mp = rd->sc->mp; + + /* No pointers to ourselves or to garbage. */ + if (ino == rd->sc->ip->i_ino) + return false; + if (!xfs_verify_dir_ino(mp, ino)) + return false; + + /* No weird looking names or dot entries. */ + if (namelen >= MAXNAMELEN || namelen <= 0) + return false; + if (namelen == 1 && name[0] == '.') + return false; + if (!xfs_dir2_namecheck(name, namelen)) + return false; + + return true; +} + +/* + * Remember that we want to create a dirent in the tempdir. These stashed + * actions will be replayed later. + */ +STATIC int +xrep_dir_stash_createname( + struct xrep_dir *rd, + const struct xfs_name *name, + xfs_ino_t ino) +{ + struct xrep_dirent dirent = { + .ino = ino, + .namelen = name->len, + .ftype = name->type, + }; + int error; + + trace_xrep_dir_stash_createname(rd->sc->tempip, name, ino); + + error = xfblob_storename(rd->dir_names, &dirent.name_cookie, name); + if (error) + return error; + + return xfarray_append(rd->dir_entries, &dirent); +} + +/* Allocate an in-core record to hold entries while we rebuild the dir data. */ +STATIC int +xrep_dir_salvage_entry( + struct xrep_dir *rd, + unsigned char *name, + unsigned int namelen, + xfs_ino_t ino) +{ + struct xfs_name xname = { + .name = name, + }; + struct xfs_scrub *sc = rd->sc; + struct xfs_inode *ip; + unsigned int i = 0; + int error = 0; + + if (xchk_should_terminate(sc, &error)) + return error; + + /* + * Truncate the name to the first character that would trip namecheck. + * If we no longer have a name after that, ignore this entry. + */ + while (i < namelen && name[i] != 0 && name[i] != '/') + i++; + if (i == 0) + return 0; + xname.len = i; + + /* Ignore '..' entries; we already picked the new parent. */ + if (xname.len == 2 && name[0] == '.' && name[1] == '.') { + trace_xrep_dir_salvaged_parent(sc->ip, ino); + return 0; + } + + trace_xrep_dir_salvage_entry(sc->ip, &xname, ino); + + /* + * Compute the ftype or dump the entry if we can't. We don't lock the + * inode because inodes can't change type while we have a reference. + */ + error = xchk_iget(sc, ino, &ip); + if (error) + return 0; + + xname.type = xfs_mode_to_ftype(VFS_I(ip)->i_mode); + xchk_irele(sc, ip); + + return xrep_dir_stash_createname(rd, &xname, ino); +} + +/* Record a shortform directory entry for later reinsertion. */ +STATIC int +xrep_dir_salvage_sf_entry( + struct xrep_dir *rd, + struct xfs_dir2_sf_hdr *sfp, + struct xfs_dir2_sf_entry *sfep) +{ + xfs_ino_t ino; + + ino = xfs_dir2_sf_get_ino(rd->sc->mp, sfp, sfep); + if (!xrep_dir_want_salvage(rd, sfep->name, sfep->namelen, ino)) + return 0; + + return xrep_dir_salvage_entry(rd, sfep->name, sfep->namelen, ino); +} + +/* Record a regular directory entry for later reinsertion. */ +STATIC int +xrep_dir_salvage_data_entry( + struct xrep_dir *rd, + struct xfs_dir2_data_entry *dep) +{ + xfs_ino_t ino; + + ino = be64_to_cpu(dep->inumber); + if (!xrep_dir_want_salvage(rd, dep->name, dep->namelen, ino)) + return 0; + + return xrep_dir_salvage_entry(rd, dep->name, dep->namelen, ino); +} + +/* Try to recover block/data format directory entries. */ +STATIC int +xrep_dir_recover_data( + struct xrep_dir *rd, + struct xfs_buf *bp) +{ + struct xfs_da_geometry *geo = rd->sc->mp->m_dir_geo; + unsigned int offset; + unsigned int end; + int error = 0; + + /* + * Loop over the data portion of the block. + * Each object is a real entry (dep) or an unused one (dup). + */ + offset = geo->data_entry_offset; + end = min_t(unsigned int, BBTOB(bp->b_length), + xfs_dir3_data_end_offset(geo, bp->b_addr)); + + while (offset < end) { + struct xfs_dir2_data_unused *dup = bp->b_addr + offset; + struct xfs_dir2_data_entry *dep = bp->b_addr + offset; + + if (xchk_should_terminate(rd->sc, &error)) + return error; + + /* Skip unused entries. */ + if (be16_to_cpu(dup->freetag) == XFS_DIR2_DATA_FREE_TAG) { + offset += be16_to_cpu(dup->length); + continue; + } + + /* Don't walk off the end of the block. */ + offset += xfs_dir2_data_entsize(rd->sc->mp, dep->namelen); + if (offset > end) + break; + + /* Ok, let's save this entry. */ + error = xrep_dir_salvage_data_entry(rd, dep); + if (error) + return error; + + } + + return 0; +} + +/* Try to recover shortform directory entries. */ +STATIC int +xrep_dir_recover_sf( + struct xrep_dir *rd) +{ + struct xfs_dir2_sf_hdr *hdr; + struct xfs_dir2_sf_entry *sfep; + struct xfs_dir2_sf_entry *next; + struct xfs_ifork *ifp; + xfs_ino_t ino; + unsigned char *end; + int error = 0; + + ifp = xfs_ifork_ptr(rd->sc->ip, XFS_DATA_FORK); + hdr = ifp->if_data; + end = (unsigned char *)ifp->if_data + ifp->if_bytes; + + ino = xfs_dir2_sf_get_parent_ino(hdr); + trace_xrep_dir_salvaged_parent(rd->sc->ip, ino); + + sfep = xfs_dir2_sf_firstentry(hdr); + while ((unsigned char *)sfep < end) { + if (xchk_should_terminate(rd->sc, &error)) + return error; + + next = xfs_dir2_sf_nextentry(rd->sc->mp, hdr, sfep); + if ((unsigned char *)next > end) + break; + + /* Ok, let's save this entry. */ + error = xrep_dir_salvage_sf_entry(rd, hdr, sfep); + if (error) + return error; + + sfep = next; + } + + return 0; +} + +/* + * Try to figure out the format of this directory from the data fork mappings + * and the directory size. If we can be reasonably sure of format, we can be + * more aggressive in salvaging directory entries. On return, @magic_guess + * will be set to DIR3_BLOCK_MAGIC if we think this is a "block format" + * directory; DIR3_DATA_MAGIC if we think this is a "data format" directory, + * and 0 if we can't tell. + */ +STATIC void +xrep_dir_guess_format( + struct xrep_dir *rd, + __be32 *magic_guess) +{ + struct xfs_inode *dp = rd->sc->ip; + struct xfs_mount *mp = rd->sc->mp; + struct xfs_da_geometry *geo = mp->m_dir_geo; + xfs_fileoff_t last; + int error; + + ASSERT(xfs_has_crc(mp)); + + *magic_guess = 0; + + /* + * If there's a single directory block and the directory size is + * exactly one block, this has to be a single block format directory. + */ + error = xfs_bmap_last_offset(dp, &last, XFS_DATA_FORK); + if (!error && XFS_FSB_TO_B(mp, last) == geo->blksize && + dp->i_disk_size == geo->blksize) { + *magic_guess = cpu_to_be32(XFS_DIR3_BLOCK_MAGIC); + return; + } + + /* + * If the last extent before the leaf offset matches the directory + * size and the directory size is larger than 1 block, this is a + * data format directory. + */ + last = geo->leafblk; + error = xfs_bmap_last_before(rd->sc->tp, dp, &last, XFS_DATA_FORK); + if (!error && + XFS_FSB_TO_B(mp, last) > geo->blksize && + XFS_FSB_TO_B(mp, last) == dp->i_disk_size) { + *magic_guess = cpu_to_be32(XFS_DIR3_DATA_MAGIC); + return; + } +} + +/* Recover directory entries from a specific directory block. */ +STATIC int +xrep_dir_recover_dirblock( + struct xrep_dir *rd, + __be32 magic_guess, + xfs_dablk_t dabno) +{ + struct xfs_dir2_data_hdr *hdr; + struct xfs_buf *bp; + __be32 oldmagic; + int error; + + /* + * Try to read buffer. We invalidate them in the next step so we don't + * bother to set a buffer type or ops. + */ + error = xfs_da_read_buf(rd->sc->tp, rd->sc->ip, dabno, + XFS_DABUF_MAP_HOLE_OK, &bp, XFS_DATA_FORK, NULL); + if (error || !bp) + return error; + + hdr = bp->b_addr; + oldmagic = hdr->magic; + + trace_xrep_dir_recover_dirblock(rd->sc->ip, dabno, + be32_to_cpu(hdr->magic), be32_to_cpu(magic_guess)); + + /* + * If we're sure of the block's format, proceed with the salvage + * operation using the specified magic number. + */ + if (magic_guess) { + hdr->magic = magic_guess; + goto recover; + } + + /* + * If we couldn't guess what type of directory this is, then we will + * only salvage entries from directory blocks that match the magic + * number and pass verifiers. + */ + switch (hdr->magic) { + case cpu_to_be32(XFS_DIR2_BLOCK_MAGIC): + case cpu_to_be32(XFS_DIR3_BLOCK_MAGIC): + if (!xrep_buf_verify_struct(bp, &xfs_dir3_block_buf_ops)) + goto out; + if (xfs_dir3_block_header_check(bp, rd->sc->ip->i_ino) != NULL) + goto out; + break; + case cpu_to_be32(XFS_DIR2_DATA_MAGIC): + case cpu_to_be32(XFS_DIR3_DATA_MAGIC): + if (!xrep_buf_verify_struct(bp, &xfs_dir3_data_buf_ops)) + goto out; + if (xfs_dir3_data_header_check(bp, rd->sc->ip->i_ino) != NULL) + goto out; + break; + default: + goto out; + } + +recover: + error = xrep_dir_recover_data(rd, bp); + +out: + hdr->magic = oldmagic; + xfs_trans_brelse(rd->sc->tp, bp); + return error; +} + +static inline void +xrep_dir_init_args( + struct xrep_dir *rd, + struct xfs_inode *dp, + const struct xfs_name *name) +{ + memset(&rd->args, 0, sizeof(struct xfs_da_args)); + rd->args.geo = rd->sc->mp->m_dir_geo; + rd->args.whichfork = XFS_DATA_FORK; + rd->args.owner = rd->sc->ip->i_ino; + rd->args.trans = rd->sc->tp; + rd->args.dp = dp; + if (!name) + return; + rd->args.name = name->name; + rd->args.namelen = name->len; + rd->args.filetype = name->type; + rd->args.hashval = xfs_dir2_hashname(rd->sc->mp, name); +} + +/* Replay a stashed createname into the temporary directory. */ +STATIC int +xrep_dir_replay_createname( + struct xrep_dir *rd, + const struct xfs_name *name, + xfs_ino_t inum, + xfs_extlen_t total) +{ + struct xfs_scrub *sc = rd->sc; + struct xfs_inode *dp = rd->sc->tempip; + bool is_block, is_leaf; + int error; + + ASSERT(S_ISDIR(VFS_I(dp)->i_mode)); + + error = xfs_dir_ino_validate(sc->mp, inum); + if (error) + return error; + + trace_xrep_dir_replay_createname(dp, name, inum); + + xrep_dir_init_args(rd, dp, name); + rd->args.inumber = inum; + rd->args.total = total; + rd->args.op_flags = XFS_DA_OP_ADDNAME | XFS_DA_OP_OKNOENT; + + if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) + return xfs_dir2_sf_addname(&rd->args); + + error = xfs_dir2_isblock(&rd->args, &is_block); + if (error) + return error; + if (is_block) + return xfs_dir2_block_addname(&rd->args); + + error = xfs_dir2_isleaf(&rd->args, &is_leaf); + if (error) + return error; + if (is_leaf) + return xfs_dir2_leaf_addname(&rd->args); + + return xfs_dir2_node_addname(&rd->args); +} + +/* + * Add this stashed incore directory entry to the temporary directory. + * The caller must hold the tempdir's IOLOCK, must not hold any ILOCKs, and + * must not be in transaction context. + */ +STATIC int +xrep_dir_replay_update( + struct xrep_dir *rd, + const struct xfs_name *xname, + const struct xrep_dirent *dirent) +{ + struct xfs_mount *mp = rd->sc->mp; +#ifdef DEBUG + xfs_ino_t ino; +#endif + uint resblks; + int error; + + resblks = XFS_LINK_SPACE_RES(mp, xname->len); + error = xchk_trans_alloc(rd->sc, resblks); + if (error) + return error; + + /* Lock the temporary directory and join it to the transaction */ + xrep_tempfile_ilock(rd->sc); + xfs_trans_ijoin(rd->sc->tp, rd->sc->tempip, 0); + + /* + * Create a replacement dirent in the temporary directory. Note that + * _createname doesn't check for existing entries. There shouldn't be + * any in the temporary dir, but we'll verify this in debug mode. + */ +#ifdef DEBUG + error = xchk_dir_lookup(rd->sc, rd->sc->tempip, xname, &ino); + if (error != -ENOENT) { + ASSERT(error != -ENOENT); + goto out_cancel; + } +#endif + + error = xrep_dir_replay_createname(rd, xname, dirent->ino, resblks); + if (error) + goto out_cancel; + + if (xname->type == XFS_DIR3_FT_DIR) + rd->subdirs++; + rd->dirents++; + + /* Commit and unlock. */ + error = xrep_trans_commit(rd->sc); + if (error) + return error; + + xrep_tempfile_iunlock(rd->sc); + return 0; +out_cancel: + xchk_trans_cancel(rd->sc); + xrep_tempfile_iunlock(rd->sc); + return error; +} + +/* + * Flush stashed incore dirent updates that have been recorded by the scanner. + * This is done to reduce the memory requirements of the directory rebuild, + * since directories can contain up to 32GB of directory data. + * + * Caller must not hold transactions or ILOCKs. Caller must hold the tempdir + * IOLOCK. + */ +STATIC int +xrep_dir_replay_updates( + struct xrep_dir *rd) +{ + xfarray_idx_t array_cur; + int error; + + /* Add all the salvaged dirents to the temporary directory. */ + foreach_xfarray_idx(rd->dir_entries, array_cur) { + struct xrep_dirent dirent; + + error = xfarray_load(rd->dir_entries, array_cur, &dirent); + if (error) + return error; + + error = xfblob_loadname(rd->dir_names, dirent.name_cookie, + &rd->xname, dirent.namelen); + if (error) + return error; + rd->xname.type = dirent.ftype; + + error = xrep_dir_replay_update(rd, &rd->xname, &dirent); + if (error) + return error; + } + + /* Empty out both arrays now that we've added the entries. */ + xfarray_truncate(rd->dir_entries); + xfblob_truncate(rd->dir_names); + return 0; +} + +/* + * Periodically flush stashed directory entries to the temporary dir. This + * is done to reduce the memory requirements of the directory rebuild, since + * directories can contain up to 32GB of directory data. + */ +STATIC int +xrep_dir_flush_stashed( + struct xrep_dir *rd) +{ + int error; + + /* + * Entering this function, the scrub context has a reference to the + * inode being repaired, the temporary file, and a scrub transaction + * that we use during dirent salvaging to avoid livelocking if there + * are cycles in the directory structures. We hold ILOCK_EXCL on both + * the inode being repaired and the temporary file, though they are + * not ijoined to the scrub transaction. + * + * To constrain kernel memory use, we occasionally write salvaged + * dirents from the xfarray and xfblob structures into the temporary + * directory in preparation for exchanging the directory structures at + * the end. Updating the temporary file requires a transaction, so we + * commit the scrub transaction and drop the two ILOCKs so that + * we can allocate whatever transaction we want. + * + * We still hold IOLOCK_EXCL on the inode being repaired, which + * prevents anyone from accessing the damaged directory data while we + * repair it. + */ + error = xrep_trans_commit(rd->sc); + if (error) + return error; + xchk_iunlock(rd->sc, XFS_ILOCK_EXCL); + + /* + * Take the IOLOCK of the temporary file while we modify dirents. This + * isn't strictly required because the temporary file is never revealed + * to userspace, but we follow the same locking rules. We still hold + * sc->ip's IOLOCK. + */ + error = xrep_tempfile_iolock_polled(rd->sc); + if (error) + return error; + + /* Write to the tempdir all the updates that we've stashed. */ + error = xrep_dir_replay_updates(rd); + xrep_tempfile_iounlock(rd->sc); + if (error) + return error; + + /* + * Recreate the salvage transaction and relock the dir we're salvaging. + */ + error = xchk_trans_alloc(rd->sc, 0); + if (error) + return error; + xchk_ilock(rd->sc, XFS_ILOCK_EXCL); + return 0; +} + +/* Decide if we've stashed too much dirent data in memory. */ +static inline bool +xrep_dir_want_flush_stashed( + struct xrep_dir *rd) +{ + unsigned long long bytes; + + bytes = xfarray_bytes(rd->dir_entries) + xfblob_bytes(rd->dir_names); + return bytes > XREP_DIR_MAX_STASH_BYTES; +} + +/* Extract as many directory entries as we can. */ +STATIC int +xrep_dir_recover( + struct xrep_dir *rd) +{ + struct xfs_bmbt_irec got; + struct xfs_scrub *sc = rd->sc; + struct xfs_da_geometry *geo = sc->mp->m_dir_geo; + xfs_fileoff_t offset; + xfs_dablk_t dabno; + __be32 magic_guess; + int nmap; + int error; + + xrep_dir_guess_format(rd, &magic_guess); + + /* Iterate each directory data block in the data fork. */ + for (offset = 0; + offset < geo->leafblk; + offset = got.br_startoff + got.br_blockcount) { + nmap = 1; + error = xfs_bmapi_read(sc->ip, offset, geo->leafblk - offset, + &got, &nmap, 0); + if (error) + return error; + if (nmap != 1) + return -EFSCORRUPTED; + if (!xfs_bmap_is_written_extent(&got)) + continue; + + for (dabno = round_up(got.br_startoff, geo->fsbcount); + dabno < got.br_startoff + got.br_blockcount; + dabno += geo->fsbcount) { + if (xchk_should_terminate(rd->sc, &error)) + return error; + + error = xrep_dir_recover_dirblock(rd, + magic_guess, dabno); + if (error) + return error; + + /* Flush dirents to constrain memory usage. */ + if (xrep_dir_want_flush_stashed(rd)) { + error = xrep_dir_flush_stashed(rd); + if (error) + return error; + } + } + } + + return 0; +} + +/* + * Find all the directory entries for this inode by scraping them out of the + * directory leaf blocks by hand, and flushing them into the temp dir. + */ +STATIC int +xrep_dir_find_entries( + struct xrep_dir *rd) +{ + struct xfs_inode *dp = rd->sc->ip; + int error; + + /* + * Salvage directory entries from the old directory, and write them to + * the temporary directory. + */ + if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) { + error = xrep_dir_recover_sf(rd); + } else { + error = xfs_iread_extents(rd->sc->tp, dp, XFS_DATA_FORK); + if (error) + return error; + + error = xrep_dir_recover(rd); + } + if (error) + return error; + + return xrep_dir_flush_stashed(rd); +} + +/* Scan all files in the filesystem for dirents. */ +STATIC int +xrep_dir_salvage_entries( + struct xrep_dir *rd) +{ + struct xfs_scrub *sc = rd->sc; + int error; + + /* + * Drop the ILOCK on this directory so that we can scan for this + * directory's parent. Figure out who is going to be the parent of + * this directory, then retake the ILOCK so that we can salvage + * directory entries. + */ + xchk_iunlock(sc, XFS_ILOCK_EXCL); + error = xrep_dir_find_parent(rd); + xchk_ilock(sc, XFS_ILOCK_EXCL); + if (error) + return error; + + /* + * Collect directory entries by parsing raw leaf blocks to salvage + * whatever we can. When we're done, free the staging memory before + * exchanging the directories to reduce memory usage. + */ + error = xrep_dir_find_entries(rd); + if (error) + return error; + + /* + * Cancel the repair transaction and drop the ILOCK so that we can + * (later) use the atomic mapping exchange functions to compute the + * correct block reservations and re-lock the inodes. + * + * We still hold IOLOCK_EXCL (aka i_rwsem) which will prevent directory + * modifications, but there's nothing to prevent userspace from reading + * the directory until we're ready for the exchange operation. Reads + * will return -EIO without shutting down the fs, so we're ok with + * that. + */ + error = xrep_trans_commit(sc); + if (error) + return error; + + xchk_iunlock(sc, XFS_ILOCK_EXCL); + return 0; +} + + +/* + * Free all the directory blocks and reset the data fork. The caller must + * join the inode to the transaction. This function returns with the inode + * joined to a clean scrub transaction. + */ +STATIC int +xrep_dir_reset_fork( + struct xrep_dir *rd, + xfs_ino_t parent_ino) +{ + struct xfs_scrub *sc = rd->sc; + struct xfs_ifork *ifp = xfs_ifork_ptr(sc->tempip, XFS_DATA_FORK); + int error; + + /* Unmap all the directory buffers. */ + if (xfs_ifork_has_extents(ifp)) { + error = xrep_reap_ifork(sc, sc->tempip, XFS_DATA_FORK); + if (error) + return error; + } + + trace_xrep_dir_reset_fork(sc->tempip, parent_ino); + + /* Reset the data fork to an empty data fork. */ + xfs_idestroy_fork(ifp); + ifp->if_bytes = 0; + sc->tempip->i_disk_size = 0; + + /* Reinitialize the short form directory. */ + xrep_dir_init_args(rd, sc->tempip, NULL); + return xfs_dir2_sf_create(&rd->args, parent_ino); +} + +/* + * Prepare both inodes' directory forks for exchanging mappings. Promote the + * tempfile from short format to leaf format, and if the file being repaired + * has a short format data fork, turn it into an empty extent list. + */ +STATIC int +xrep_dir_swap_prep( + struct xfs_scrub *sc, + bool temp_local, + bool ip_local) +{ + int error; + + /* + * If the tempfile's directory is in shortform format, convert that to + * a single leaf extent so that we can use the atomic mapping exchange. + */ + if (temp_local) { + struct xfs_da_args args = { + .dp = sc->tempip, + .geo = sc->mp->m_dir_geo, + .whichfork = XFS_DATA_FORK, + .trans = sc->tp, + .total = 1, + .owner = sc->ip->i_ino, + }; + + error = xfs_dir2_sf_to_block(&args); + if (error) + return error; + + /* + * Roll the deferred log items to get us back to a clean + * transaction. + */ + error = xfs_defer_finish(&sc->tp); + if (error) + return error; + } + + /* + * If the file being repaired had a shortform data fork, convert that + * to an empty extent list in preparation for the atomic mapping + * exchange. + */ + if (ip_local) { + struct xfs_ifork *ifp; + + ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK); + xfs_idestroy_fork(ifp); + ifp->if_format = XFS_DINODE_FMT_EXTENTS; + ifp->if_nextents = 0; + ifp->if_bytes = 0; + ifp->if_data = NULL; + ifp->if_height = 0; + + xfs_trans_log_inode(sc->tp, sc->ip, + XFS_ILOG_CORE | XFS_ILOG_DDATA); + } + + return 0; +} + +/* + * Replace the inode number of a directory entry. + */ +static int +xrep_dir_replace( + struct xrep_dir *rd, + struct xfs_inode *dp, + const struct xfs_name *name, + xfs_ino_t inum, + xfs_extlen_t total) +{ + struct xfs_scrub *sc = rd->sc; + bool is_block, is_leaf; + int error; + + ASSERT(S_ISDIR(VFS_I(dp)->i_mode)); + + error = xfs_dir_ino_validate(sc->mp, inum); + if (error) + return error; + + xrep_dir_init_args(rd, dp, name); + rd->args.inumber = inum; + rd->args.total = total; + + if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL) + return xfs_dir2_sf_replace(&rd->args); + + error = xfs_dir2_isblock(&rd->args, &is_block); + if (error) + return error; + if (is_block) + return xfs_dir2_block_replace(&rd->args); + + error = xfs_dir2_isleaf(&rd->args, &is_leaf); + if (error) + return error; + if (is_leaf) + return xfs_dir2_leaf_replace(&rd->args); + + return xfs_dir2_node_replace(&rd->args); +} + +/* + * Reset the link count of this directory and adjust the unlinked list pointers + * as needed. + */ +STATIC int +xrep_dir_set_nlink( + struct xrep_dir *rd) +{ + struct xfs_scrub *sc = rd->sc; + struct xfs_inode *dp = sc->ip; + struct xfs_perag *pag; + unsigned int new_nlink = rd->subdirs + 2; + int error; + + /* + * The directory is not on the incore unlinked list, which means that + * it needs to be reachable via the directory tree. Update the nlink + * with our observed link count. + * + * XXX: A subsequent patch will handle parentless directories by moving + * them to the lost and found instead of aborting the repair. + */ + if (!xfs_inode_on_unlinked_list(dp)) + goto reset_nlink; + + /* + * The directory is on the unlinked list and we did not find any + * dirents. Set the link count to zero and let the directory + * inactivate when the last reference drops. + */ + if (rd->dirents == 0) { + new_nlink = 0; + goto reset_nlink; + } + + /* + * The directory is on the unlinked list and we found dirents. This + * directory needs to be reachable via the directory tree. Remove the + * dir from the unlinked list and update nlink with the observed link + * count. + */ + pag = xfs_perag_get(sc->mp, XFS_INO_TO_AGNO(sc->mp, dp->i_ino)); + if (!pag) { + ASSERT(0); + return -EFSCORRUPTED; + } + + error = xfs_iunlink_remove(sc->tp, pag, dp); + xfs_perag_put(pag); + if (error) + return error; + +reset_nlink: + if (VFS_I(dp)->i_nlink != new_nlink) + set_nlink(VFS_I(dp), new_nlink); + return 0; +} + +/* Exchange the temporary directory's data fork with the one being repaired. */ +STATIC int +xrep_dir_swap( + struct xrep_dir *rd) +{ + struct xfs_scrub *sc = rd->sc; + bool ip_local, temp_local; + int error = 0; + + /* + * If we found enough subdirs to overflow this directory's link count, + * bail out to userspace before we modify anything. + */ + if (rd->subdirs + 2 > XFS_MAXLINK) + return -EFSCORRUPTED; + + /* + * Reset the temporary directory's '..' entry to point to the parent + * that we found. The temporary directory was created with the root + * directory as the parent, so we can skip this if repairing a + * subdirectory of the root. + * + * It's also possible that this replacement could also expand a sf + * tempdir into block format. + */ + if (rd->parent_ino != sc->mp->m_rootip->i_ino) { + error = xrep_dir_replace(rd, rd->sc->tempip, &xfs_name_dotdot, + rd->parent_ino, rd->tx.req.resblks); + if (error) + return error; + } + + /* + * Changing the dot and dotdot entries could have changed the shape of + * the directory, so we recompute these. + */ + ip_local = sc->ip->i_df.if_format == XFS_DINODE_FMT_LOCAL; + temp_local = sc->tempip->i_df.if_format == XFS_DINODE_FMT_LOCAL; + + /* + * If the both files have a local format data fork and the rebuilt + * directory data would fit in the repaired file's data fork, copy + * the contents from the tempfile and update the directory link count. + * We're done now. + */ + if (ip_local && temp_local && + sc->tempip->i_disk_size <= xfs_inode_data_fork_size(sc->ip)) { + xrep_tempfile_copyout_local(sc, XFS_DATA_FORK); + return xrep_dir_set_nlink(rd); + } + + /* + * Clean the transaction before we start working on exchanging + * directory contents. + */ + error = xrep_tempfile_roll_trans(rd->sc); + if (error) + return error; + + /* Otherwise, make sure both data forks are in block-mapping mode. */ + error = xrep_dir_swap_prep(sc, temp_local, ip_local); + if (error) + return error; + + /* + * Set nlink of the directory in the same transaction sequence that + * (atomically) commits the new directory data. + */ + error = xrep_dir_set_nlink(rd); + if (error) + return error; + + return xrep_tempexch_contents(sc, &rd->tx); +} + +/* + * Exchange the new directory contents (which we created in the tempfile) with + * the directory being repaired. + */ +STATIC int +xrep_dir_rebuild_tree( + struct xrep_dir *rd) +{ + struct xfs_scrub *sc = rd->sc; + int error; + + trace_xrep_dir_rebuild_tree(sc->ip, rd->parent_ino); + + /* + * Take the IOLOCK on the temporary file so that we can run dir + * operations with the same locks held as we would for a normal file. + * We still hold sc->ip's IOLOCK. + */ + error = xrep_tempfile_iolock_polled(rd->sc); + if (error) + return error; + + /* Allocate transaction and ILOCK the scrub file and the temp file. */ + error = xrep_tempexch_trans_alloc(sc, XFS_DATA_FORK, &rd->tx); + if (error) + return error; + + /* + * Exchange the tempdir's data fork with the file being repaired. This + * recreates the transaction and re-takes the ILOCK in the scrub + * context. + */ + error = xrep_dir_swap(rd); + if (error) + return error; + + /* + * Release the old directory blocks and reset the data fork of the temp + * directory to an empty shortform directory because inactivation does + * nothing for directories. + */ + error = xrep_dir_reset_fork(rd, sc->mp->m_rootip->i_ino); + if (error) + return error; + + /* + * Roll to get a transaction without any inodes joined to it. Then we + * can drop the tempfile's ILOCK and IOLOCK before doing more work on + * the scrub target directory. + */ + error = xfs_trans_roll(&sc->tp); + if (error) + return error; + + xrep_tempfile_iunlock(sc); + xrep_tempfile_iounlock(sc); + return 0; +} + +/* Set up the filesystem scan so we can regenerate directory entries. */ +STATIC int +xrep_dir_setup_scan( + struct xrep_dir *rd) +{ + struct xfs_scrub *sc = rd->sc; + char *descr; + int error; + + rd->parent_ino = NULLFSINO; + + /* Set up some staging memory for salvaging dirents. */ + descr = xchk_xfile_ino_descr(sc, "directory entries"); + error = xfarray_create(descr, 0, sizeof(struct xrep_dirent), + &rd->dir_entries); + kfree(descr); + if (error) + return error; + + descr = xchk_xfile_ino_descr(sc, "directory entry names"); + error = xfblob_create(descr, &rd->dir_names); + kfree(descr); + if (error) + goto out_xfarray; + + return 0; + +out_xfarray: + xfarray_destroy(rd->dir_entries); + rd->dir_entries = NULL; + return error; +} + +/* + * Repair the directory metadata. + * + * XXX: Directory entry buffers can be multiple fsblocks in size. The buffer + * cache in XFS can't handle aliased multiblock buffers, so this might + * misbehave if the directory blocks are crosslinked with other filesystem + * metadata. + * + * XXX: Is it necessary to check the dcache for this directory to make sure + * that we always recreate every cached entry? + */ +int +xrep_directory( + struct xfs_scrub *sc) +{ + struct xrep_dir *rd = sc->buf; + int error; + + /* The rmapbt is required to reap the old data fork. */ + if (!xfs_has_rmapbt(sc->mp)) + return -EOPNOTSUPP; + + error = xrep_dir_setup_scan(rd); + if (error) + return error; + + error = xrep_dir_salvage_entries(rd); + if (error) + goto out_teardown; + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + goto out_teardown; + + error = xrep_dir_rebuild_tree(rd); + if (error) + goto out_teardown; + +out_teardown: + xrep_dir_teardown(sc); + return error; +} diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c index c743772a523e..0dde5df2f8d3 100644 --- a/fs/xfs/scrub/inode_repair.c +++ b/fs/xfs/scrub/inode_repair.c @@ -46,6 +46,7 @@ #include "scrub/repair.h" #include "scrub/iscan.h" #include "scrub/readdir.h" +#include "scrub/tempfile.h" /* * Inode Record Repair @@ -340,6 +341,10 @@ xrep_dinode_findmode_walk_directory( unsigned int lock_mode; int error = 0; + /* Ignore temporary repair directories. */ + if (xrep_is_tempfile(dp)) + return 0; + /* * Scan the directory to see if there it contains an entry pointing to * the directory that we are repairing. diff --git a/fs/xfs/scrub/nlinks.c b/fs/xfs/scrub/nlinks.c index 8a7d9557897c..8b9aa73093d6 100644 --- a/fs/xfs/scrub/nlinks.c +++ b/fs/xfs/scrub/nlinks.c @@ -27,6 +27,7 @@ #include "scrub/nlinks.h" #include "scrub/trace.h" #include "scrub/readdir.h" +#include "scrub/tempfile.h" /* * Live Inode Link Count Checking @@ -152,6 +153,13 @@ xchk_nlinks_live_update( xnc = container_of(nb, struct xchk_nlink_ctrs, dhook.dirent_hook.nb); + /* + * Ignore temporary directories being used to stage dir repairs, since + * we don't bump the link counts of the children. + */ + if (xrep_is_tempfile(p->dp)) + return NOTIFY_DONE; + trace_xchk_nlinks_live_update(xnc->sc->mp, p->dp, action, p->ip->i_ino, p->delta, p->name->name, p->name->len); @@ -303,6 +311,13 @@ xchk_nlinks_collect_dir( unsigned int lock_mode; int error = 0; + /* + * Ignore temporary directories being used to stage dir repairs, since + * we don't bump the link counts of the children. + */ + if (xrep_is_tempfile(dp)) + return 0; + /* Prevent anyone from changing this directory while we walk it. */ xfs_ilock(dp, XFS_IOLOCK_SHARED); lock_mode = xfs_ilock_data_map_shared(dp); @@ -537,6 +552,14 @@ xchk_nlinks_compare_inode( unsigned int actual_nlink; int error; + /* + * Ignore temporary files being used to stage repairs, since we assume + * they're correct for non-directories, and the directory repair code + * doesn't bump the link counts for the children. + */ + if (xrep_is_tempfile(ip)) + return 0; + xfs_ilock(ip, XFS_ILOCK_SHARED); mutex_lock(&xnc->lock); diff --git a/fs/xfs/scrub/nlinks_repair.c b/fs/xfs/scrub/nlinks_repair.c index 58cacb8e94c1..23eb08c4b5ad 100644 --- a/fs/xfs/scrub/nlinks_repair.c +++ b/fs/xfs/scrub/nlinks_repair.c @@ -26,6 +26,7 @@ #include "scrub/iscan.h" #include "scrub/nlinks.h" #include "scrub/trace.h" +#include "scrub/tempfile.h" /* * Live Inode Link Count Repair @@ -68,6 +69,14 @@ xrep_nlinks_repair_inode( bool dirty = false; int error; + /* + * Ignore temporary files being used to stage repairs, since we assume + * they're correct for non-directories, and the directory repair code + * doesn't bump the link counts for the children. + */ + if (xrep_is_tempfile(ip)) + return 0; + xchk_ilock(sc, XFS_IOLOCK_EXCL); error = xfs_trans_alloc(mp, &M_RES(mp)->tr_link, 0, 0, 0, &sc->tp); diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c index 5da10ed1fe8c..050a8e8914f6 100644 --- a/fs/xfs/scrub/parent.c +++ b/fs/xfs/scrub/parent.c @@ -17,6 +17,7 @@ #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/readdir.h" +#include "scrub/tempfile.h" /* Set us up to scrub parents. */ int @@ -143,7 +144,8 @@ xchk_parent_validate( } if (!xchk_fblock_xref_process_error(sc, XFS_DATA_FORK, 0, &error)) return error; - if (dp == sc->ip || dp == sc->tempip || !S_ISDIR(VFS_I(dp)->i_mode)) { + if (dp == sc->ip || xrep_is_tempfile(dp) || + !S_ISDIR(VFS_I(dp)->i_mode)) { xchk_fblock_set_corrupt(sc, XFS_DATA_FORK, 0); goto out_rele; } diff --git a/fs/xfs/scrub/readdir.c b/fs/xfs/scrub/readdir.c index e94080469315..028690761c62 100644 --- a/fs/xfs/scrub/readdir.c +++ b/fs/xfs/scrub/readdir.c @@ -333,6 +333,13 @@ xchk_dir_lookup( if (xfs_is_shutdown(dp->i_mount)) return -EIO; + /* + * A temporary directory's block headers are written with the owner + * set to sc->ip, so we must switch the owner here for the lookup. + */ + if (dp == sc->tempip) + args.owner = sc->ip->i_ino; + ASSERT(S_ISDIR(VFS_I(dp)->i_mode)); xfs_assert_ilocked(dp, XFS_ILOCK_SHARED | XFS_ILOCK_EXCL); diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 04aec0e9e4c3..369f0430e4ba 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -35,6 +35,7 @@ #include "xfs_da_format.h" #include "xfs_da_btree.h" #include "xfs_attr.h" +#include "xfs_dir2.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 9cbfd8da5620..4e25aa95753a 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -91,6 +91,7 @@ int xrep_metadata_inode_forks(struct xfs_scrub *sc); int xrep_setup_ag_rmapbt(struct xfs_scrub *sc); int xrep_setup_ag_refcountbt(struct xfs_scrub *sc); int xrep_setup_xattr(struct xfs_scrub *sc); +int xrep_setup_directory(struct xfs_scrub *sc); /* Repair setup functions */ int xrep_setup_ag_allocbt(struct xfs_scrub *sc); @@ -125,6 +126,7 @@ int xrep_bmap_cow(struct xfs_scrub *sc); int xrep_nlinks(struct xfs_scrub *sc); int xrep_fscounters(struct xfs_scrub *sc); int xrep_xattr(struct xfs_scrub *sc); +int xrep_directory(struct xfs_scrub *sc); #ifdef CONFIG_XFS_RT int xrep_rtbitmap(struct xfs_scrub *sc); @@ -195,6 +197,7 @@ xrep_setup_nothing( #define xrep_setup_ag_rmapbt xrep_setup_nothing #define xrep_setup_ag_refcountbt xrep_setup_nothing #define xrep_setup_xattr xrep_setup_nothing +#define xrep_setup_directory xrep_setup_nothing #define xrep_setup_inode(sc, imap) ((void)0) @@ -221,6 +224,7 @@ xrep_setup_nothing( #define xrep_fscounters xrep_notsupported #define xrep_rtsummary xrep_notsupported #define xrep_xattr xrep_notsupported +#define xrep_directory xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 547189a14b6b..8e9e2bf121c2 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -325,7 +325,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_INODE, .setup = xchk_setup_directory, .scrub = xchk_directory, - .repair = xrep_notsupported, + .repair = xrep_directory, }, [XFS_SCRUB_TYPE_XATTR] = { /* extended attributes */ .type = ST_INODE, diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c index 0b3060be938f..4ca86a6a5be1 100644 --- a/fs/xfs/scrub/tempfile.c +++ b/fs/xfs/scrub/tempfile.c @@ -841,3 +841,16 @@ xrep_tempfile_copyout_local( ilog_flags |= xfs_ilog_fdata(whichfork); xfs_trans_log_inode(sc->tp, sc->ip, ilog_flags); } + +/* Decide if a given XFS inode is a temporary file for a repair. */ +bool +xrep_is_tempfile( + const struct xfs_inode *ip) +{ + const struct inode *inode = &ip->i_vnode; + + if (IS_PRIVATE(inode) && !(inode->i_opflags & IOP_XATTR)) + return true; + + return false; +} diff --git a/fs/xfs/scrub/tempfile.h b/fs/xfs/scrub/tempfile.h index d57e4f145a7c..e51399f595fe 100644 --- a/fs/xfs/scrub/tempfile.h +++ b/fs/xfs/scrub/tempfile.h @@ -35,11 +35,13 @@ int xrep_tempfile_set_isize(struct xfs_scrub *sc, unsigned long long isize); int xrep_tempfile_roll_trans(struct xfs_scrub *sc); void xrep_tempfile_copyout_local(struct xfs_scrub *sc, int whichfork); +bool xrep_is_tempfile(const struct xfs_inode *ip); #else static inline void xrep_tempfile_iolock_both(struct xfs_scrub *sc) { xchk_ilock(sc, XFS_IOLOCK_EXCL); } +# define xrep_is_tempfile(ip) (false) # define xrep_tempfile_rele(sc) #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index ffaff7722bf2..d6d9e8d6109c 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2500,6 +2500,118 @@ DEFINE_EVENT(xrep_xattr_class, name, \ DEFINE_XREP_XATTR_EVENT(xrep_xattr_rebuild_tree); DEFINE_XREP_XATTR_EVENT(xrep_xattr_reset_fork); +TRACE_EVENT(xrep_dir_recover_dirblock, + TP_PROTO(struct xfs_inode *dp, xfs_dablk_t dabno, uint32_t magic, + uint32_t magic_guess), + TP_ARGS(dp, dabno, magic, magic_guess), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, dir_ino) + __field(xfs_dablk_t, dabno) + __field(uint32_t, magic) + __field(uint32_t, magic_guess) + ), + TP_fast_assign( + __entry->dev = dp->i_mount->m_super->s_dev; + __entry->dir_ino = dp->i_ino; + __entry->dabno = dabno; + __entry->magic = magic; + __entry->magic_guess = magic_guess; + ), + TP_printk("dev %d:%d dir 0x%llx dablk 0x%x magic 0x%x magic_guess 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->dir_ino, + __entry->dabno, + __entry->magic, + __entry->magic_guess) +); + +DECLARE_EVENT_CLASS(xrep_dir_class, + TP_PROTO(struct xfs_inode *dp, xfs_ino_t parent_ino), + TP_ARGS(dp, parent_ino), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, dir_ino) + __field(xfs_ino_t, parent_ino) + ), + TP_fast_assign( + __entry->dev = dp->i_mount->m_super->s_dev; + __entry->dir_ino = dp->i_ino; + __entry->parent_ino = parent_ino; + ), + TP_printk("dev %d:%d dir 0x%llx parent 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->dir_ino, + __entry->parent_ino) +) +#define DEFINE_XREP_DIR_EVENT(name) \ +DEFINE_EVENT(xrep_dir_class, name, \ + TP_PROTO(struct xfs_inode *dp, xfs_ino_t parent_ino), \ + TP_ARGS(dp, parent_ino)) +DEFINE_XREP_DIR_EVENT(xrep_dir_rebuild_tree); +DEFINE_XREP_DIR_EVENT(xrep_dir_reset_fork); + +DECLARE_EVENT_CLASS(xrep_dirent_class, + TP_PROTO(struct xfs_inode *dp, const struct xfs_name *name, + xfs_ino_t ino), + TP_ARGS(dp, name, ino), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, dir_ino) + __field(unsigned int, namelen) + __dynamic_array(char, name, name->len) + __field(xfs_ino_t, ino) + __field(uint8_t, ftype) + ), + TP_fast_assign( + __entry->dev = dp->i_mount->m_super->s_dev; + __entry->dir_ino = dp->i_ino; + __entry->namelen = name->len; + memcpy(__get_str(name), name->name, name->len); + __entry->ino = ino; + __entry->ftype = name->type; + ), + TP_printk("dev %d:%d dir 0x%llx ftype %s name '%.*s' ino 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->dir_ino, + __print_symbolic(__entry->ftype, XFS_DIR3_FTYPE_STR), + __entry->namelen, + __get_str(name), + __entry->ino) +) +#define DEFINE_XREP_DIRENT_EVENT(name) \ +DEFINE_EVENT(xrep_dirent_class, name, \ + TP_PROTO(struct xfs_inode *dp, const struct xfs_name *name, \ + xfs_ino_t ino), \ + TP_ARGS(dp, name, ino)) +DEFINE_XREP_DIRENT_EVENT(xrep_dir_salvage_entry); +DEFINE_XREP_DIRENT_EVENT(xrep_dir_stash_createname); +DEFINE_XREP_DIRENT_EVENT(xrep_dir_replay_createname); + +DECLARE_EVENT_CLASS(xrep_parent_salvage_class, + TP_PROTO(struct xfs_inode *dp, xfs_ino_t ino), + TP_ARGS(dp, ino), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, dir_ino) + __field(xfs_ino_t, ino) + ), + TP_fast_assign( + __entry->dev = dp->i_mount->m_super->s_dev; + __entry->dir_ino = dp->i_ino; + __entry->ino = ino; + ), + TP_printk("dev %d:%d dir 0x%llx parent 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->dir_ino, + __entry->ino) +) +#define DEFINE_XREP_PARENT_SALVAGE_EVENT(name) \ +DEFINE_EVENT(xrep_parent_salvage_class, name, \ + TP_PROTO(struct xfs_inode *dp, xfs_ino_t ino), \ + TP_ARGS(dp, ino)) +DEFINE_XREP_PARENT_SALVAGE_EVENT(xrep_dir_salvaged_parent); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ diff --git a/fs/xfs/scrub/xfblob.h b/fs/xfs/scrub/xfblob.h index 78a67a06408f..ae78322613ca 100644 --- a/fs/xfs/scrub/xfblob.h +++ b/fs/xfs/scrub/xfblob.h @@ -23,4 +23,28 @@ int xfblob_free(struct xfblob *blob, xfblob_cookie cookie); unsigned long long xfblob_bytes(struct xfblob *blob); void xfblob_truncate(struct xfblob *blob); +static inline int +xfblob_storename( + struct xfblob *blob, + xfblob_cookie *cookie, + const struct xfs_name *xname) +{ + return xfblob_store(blob, cookie, xname->name, xname->len); +} + +static inline int +xfblob_loadname( + struct xfblob *blob, + xfblob_cookie cookie, + struct xfs_name *xname, + uint32_t size) +{ + int ret = xfblob_load(blob, cookie, (void *)xname->name, size); + if (ret) + return ret; + + xname->len = size; + return 0; +} + #endif /* __XFS_SCRUB_XFBLOB_H__ */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/5] xfs: scan the filesystem to repair a directory dotdot entry 2024-04-15 23:35 ` [PATCHSET v30.3 09/16] xfs: online repair of directories Darrick J. Wong 2024-04-15 23:51 ` [PATCH 1/5] xfs: inactivate directory data blocks Darrick J. Wong 2024-04-15 23:52 ` [PATCH 2/5] xfs: online repair of directories Darrick J. Wong @ 2024-04-15 23:52 ` Darrick J. Wong 2024-04-15 23:52 ` [PATCH 4/5] xfs: online repair of parent pointers Darrick J. Wong 2024-04-15 23:52 ` [PATCH 5/5] xfs: ask the dentry cache if it knows the parent of a directory Darrick J. Wong 4 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:52 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Teach the online directory repair code to scan the filesystem so that we can set the dotdot entry when we're rebuilding a directory. This involves dropping ILOCK on the directory that we're repairing, which means that the VFS can sneak in and tell us to update dotdot at any time. Deal with these races by using a dirent hook to absorb dotdot updates, and be careful not to check the scan results until after we've retaken the ILOCK. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/scrub/dir_repair.c | 70 +++++--- fs/xfs/scrub/findparent.c | 412 +++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/findparent.h | 49 +++++ fs/xfs/scrub/iscan.c | 18 ++ fs/xfs/scrub/iscan.h | 1 fs/xfs/scrub/trace.h | 1 7 files changed, 528 insertions(+), 24 deletions(-) create mode 100644 fs/xfs/scrub/findparent.c create mode 100644 fs/xfs/scrub/findparent.h diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 5c9449e14f74..3c754777ec28 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -199,6 +199,7 @@ xfs-y += $(addprefix scrub/, \ bmap_repair.o \ cow_repair.o \ dir_repair.o \ + findparent.o \ fscounters_repair.o \ ialloc_repair.o \ inode_repair.o \ diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c index 48aa80d8c7dc..b17de79207db 100644 --- a/fs/xfs/scrub/dir_repair.c +++ b/fs/xfs/scrub/dir_repair.c @@ -38,8 +38,10 @@ #include "scrub/xfile.h" #include "scrub/xfarray.h" #include "scrub/xfblob.h" +#include "scrub/iscan.h" #include "scrub/readdir.h" #include "scrub/reap.h" +#include "scrub/findparent.h" /* * Directory Repair @@ -108,10 +110,10 @@ struct xrep_dir { struct xfs_da_args args; /* - * This is the parent that we're going to set on the reconstructed - * directory. + * Information used to scan the filesystem to find the inumber of the + * dotdot entry for this directory. */ - xfs_ino_t parent_ino; + struct xrep_parent_scan_info pscan; /* How many subdirectories did we find? */ uint64_t subdirs; @@ -131,6 +133,7 @@ xrep_dir_teardown( { struct xrep_dir *rd = sc->buf; + xrep_findparent_scan_teardown(&rd->pscan); xfblob_destroy(rd->dir_names); xfarray_destroy(rd->dir_entries); } @@ -143,6 +146,8 @@ xrep_setup_directory( struct xrep_dir *rd; int error; + xchk_fsgates_enable(sc, XCHK_FSGATES_DIRENTS); + error = xrep_tempfile_create(sc, S_IFDIR); if (error) return error; @@ -179,8 +184,8 @@ xrep_dir_self_parent( } /* - * Look up the dotdot entry. Returns NULLFSINO if we don't know what to do. - * The next patch will check this more carefully. + * Look up the dotdot entry and confirm that it's really the parent. + * Returns NULLFSINO if we don't know what to do. */ static inline xfs_ino_t xrep_dir_lookup_parent( @@ -196,37 +201,39 @@ xrep_dir_lookup_parent( if (!xfs_verify_dir_ino(sc->mp, ino)) return NULLFSINO; + error = xrep_findparent_confirm(sc, &ino); + if (error) + return NULLFSINO; + return ino; } -/* - * Try to find the parent of the directory being repaired. - * - * NOTE: This function will someday be augmented by the directory parent repair - * code, which will know how to check the parent and scan the filesystem if - * we cannot find anything. Inode scans will have to be done before we start - * salvaging directory entries, so we do this now. - */ +/* Try to find the parent of the directory being repaired. */ STATIC int xrep_dir_find_parent( struct xrep_dir *rd) { xfs_ino_t ino; - ino = xrep_dir_self_parent(rd); + ino = xrep_findparent_self_reference(rd->sc); if (ino != NULLFSINO) { - rd->parent_ino = ino; + xrep_findparent_scan_finish_early(&rd->pscan, ino); return 0; } ino = xrep_dir_lookup_parent(rd); if (ino != NULLFSINO) { - rd->parent_ino = ino; + xrep_findparent_scan_finish_early(&rd->pscan, ino); return 0; } - /* NOTE: A future patch will deal with moving orphans. */ - return -EFSCORRUPTED; + /* + * A full filesystem scan is the last resort. On a busy filesystem, + * the scan can fail with -EBUSY if we cannot grab IOLOCKs. That means + * that we don't know what who the parent is, so we should return to + * userspace. + */ + return xrep_findparent_scan(&rd->pscan); } /* @@ -931,6 +938,10 @@ xrep_dir_salvage_entries( * the directory until we're ready for the exchange operation. Reads * will return -EIO without shutting down the fs, so we're ok with * that. + * + * The VFS can change dotdot on us, but the findparent scan will keep + * our incore parent inode up to date. See the note on locking issues + * for more details. */ error = xrep_trans_commit(sc); if (error) @@ -1154,6 +1165,14 @@ xrep_dir_swap( if (rd->subdirs + 2 > XFS_MAXLINK) return -EFSCORRUPTED; + /* + * If we never found the parent for this directory, we can't fix this + * directory. + */ + ASSERT(sc->ilock_flags & XFS_ILOCK_EXCL); + if (rd->pscan.parent_ino == NULLFSINO) + return -EFSCORRUPTED; + /* * Reset the temporary directory's '..' entry to point to the parent * that we found. The temporary directory was created with the root @@ -1163,9 +1182,9 @@ xrep_dir_swap( * It's also possible that this replacement could also expand a sf * tempdir into block format. */ - if (rd->parent_ino != sc->mp->m_rootip->i_ino) { + if (rd->pscan.parent_ino != sc->mp->m_rootip->i_ino) { error = xrep_dir_replace(rd, rd->sc->tempip, &xfs_name_dotdot, - rd->parent_ino, rd->tx.req.resblks); + rd->pscan.parent_ino, rd->tx.req.resblks); if (error) return error; } @@ -1224,7 +1243,7 @@ xrep_dir_rebuild_tree( struct xfs_scrub *sc = rd->sc; int error; - trace_xrep_dir_rebuild_tree(sc->ip, rd->parent_ino); + trace_xrep_dir_rebuild_tree(sc->ip, rd->pscan.parent_ino); /* * Take the IOLOCK on the temporary file so that we can run dir @@ -1281,8 +1300,6 @@ xrep_dir_setup_scan( char *descr; int error; - rd->parent_ino = NULLFSINO; - /* Set up some staging memory for salvaging dirents. */ descr = xchk_xfile_ino_descr(sc, "directory entries"); error = xfarray_create(descr, 0, sizeof(struct xrep_dirent), @@ -1297,8 +1314,15 @@ xrep_dir_setup_scan( if (error) goto out_xfarray; + error = xrep_findparent_scan_start(sc, &rd->pscan); + if (error) + goto out_xfblob; + return 0; +out_xfblob: + xfblob_destroy(rd->dir_names); + rd->dir_names = NULL; out_xfarray: xfarray_destroy(rd->dir_entries); rd->dir_entries = NULL; diff --git a/fs/xfs/scrub/findparent.c b/fs/xfs/scrub/findparent.c new file mode 100644 index 000000000000..7b3ec8d7d6cc --- /dev/null +++ b/fs/xfs/scrub/findparent.c @@ -0,0 +1,412 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_inode.h" +#include "xfs_icache.h" +#include "xfs_da_format.h" +#include "xfs_da_btree.h" +#include "xfs_dir2.h" +#include "xfs_bmap_btree.h" +#include "xfs_dir2_priv.h" +#include "xfs_trans_space.h" +#include "xfs_health.h" +#include "xfs_exchmaps.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/iscan.h" +#include "scrub/findparent.h" +#include "scrub/readdir.h" +#include "scrub/tempfile.h" + +/* + * Finding the Parent of a Directory + * ================================= + * + * Directories have parent pointers, in the sense that each directory contains + * a dotdot entry that points to the single allowed parent. The brute force + * way to find the parent of a given directory is to scan every directory in + * the filesystem looking for a child dirent that references this directory. + * + * This module wraps the process of scanning the directory tree. It requires + * that @sc->ip is the directory whose parent we want to find, and that the + * caller hold only the IOLOCK on that directory. The scan itself needs to + * take the ILOCK of each directory visited. + * + * Because we cannot hold @sc->ip's ILOCK during a scan of the whole fs, it is + * necessary to use dirent hook to update the parent scan results. Callers + * must not read the scan results without re-taking @sc->ip's ILOCK. + * + * There are a few shortcuts that we can take to avoid scanning the entire + * filesystem, such as noticing directory tree roots. + */ + +struct xrep_findparent_info { + /* The directory currently being scanned. */ + struct xfs_inode *dp; + + /* + * Scrub context. We're looking for a @dp containing a directory + * entry pointing to sc->ip->i_ino. + */ + struct xfs_scrub *sc; + + /* Optional scan information for a xrep_findparent_scan call. */ + struct xrep_parent_scan_info *parent_scan; + + /* + * Parent that we've found for sc->ip. If we're scanning the entire + * directory tree, we need this to ensure that we only find /one/ + * parent directory. + */ + xfs_ino_t found_parent; + + /* + * This is set to true if @found_parent was not observed directly from + * the directory scan but by noticing a change in dotdot entries after + * cycling the sc->ip IOLOCK. + */ + bool parent_tentative; +}; + +/* + * If this directory entry points to the scrub target inode, then the directory + * we're scanning is the parent of the scrub target inode. + */ +STATIC int +xrep_findparent_dirent( + struct xfs_scrub *sc, + struct xfs_inode *dp, + xfs_dir2_dataptr_t dapos, + const struct xfs_name *name, + xfs_ino_t ino, + void *priv) +{ + struct xrep_findparent_info *fpi = priv; + int error = 0; + + if (xchk_should_terminate(fpi->sc, &error)) + return error; + + if (ino != fpi->sc->ip->i_ino) + return 0; + + /* Ignore garbage directory entry names. */ + if (name->len == 0 || !xfs_dir2_namecheck(name->name, name->len)) + return -EFSCORRUPTED; + + /* + * Ignore dotdot and dot entries -- we're looking for parent -> child + * links only. + */ + if (name->name[0] == '.' && (name->len == 1 || + (name->len == 2 && name->name[1] == '.'))) + return 0; + + /* Uhoh, more than one parent for a dir? */ + if (fpi->found_parent != NULLFSINO && + !(fpi->parent_tentative && fpi->found_parent == fpi->dp->i_ino)) { + trace_xrep_findparent_dirent(fpi->sc->ip, 0); + return -EFSCORRUPTED; + } + + /* We found a potential parent; remember this. */ + trace_xrep_findparent_dirent(fpi->sc->ip, fpi->dp->i_ino); + fpi->found_parent = fpi->dp->i_ino; + fpi->parent_tentative = false; + + if (fpi->parent_scan) + xrep_findparent_scan_found(fpi->parent_scan, fpi->dp->i_ino); + + return 0; +} + +/* + * If this is a directory, walk the dirents looking for any that point to the + * scrub target inode. + */ +STATIC int +xrep_findparent_walk_directory( + struct xrep_findparent_info *fpi) +{ + struct xfs_scrub *sc = fpi->sc; + struct xfs_inode *dp = fpi->dp; + unsigned int lock_mode; + int error = 0; + + /* + * The inode being scanned cannot be its own parent, nor can any + * temporary directory we created to stage this repair. + */ + if (dp == sc->ip || dp == sc->tempip) + return 0; + + /* + * Similarly, temporary files created to stage a repair cannot be the + * parent of this inode. + */ + if (xrep_is_tempfile(dp)) + return 0; + + /* + * Scan the directory to see if there it contains an entry pointing to + * the directory that we are repairing. + */ + lock_mode = xfs_ilock_data_map_shared(dp); + + /* + * If this directory is known to be sick, we cannot scan it reliably + * and must abort. + */ + if (xfs_inode_has_sickness(dp, XFS_SICK_INO_CORE | + XFS_SICK_INO_BMBTD | + XFS_SICK_INO_DIR)) { + error = -EFSCORRUPTED; + goto out_unlock; + } + + /* + * We cannot complete our parent pointer scan if a directory looks as + * though it has been zapped by the inode record repair code. + */ + if (xchk_dir_looks_zapped(dp)) { + error = -EBUSY; + goto out_unlock; + } + + error = xchk_dir_walk(sc, dp, xrep_findparent_dirent, fpi); + if (error) + goto out_unlock; + +out_unlock: + xfs_iunlock(dp, lock_mode); + return error; +} + +/* + * Update this directory's dotdot pointer based on ongoing dirent updates. + */ +STATIC int +xrep_findparent_live_update( + struct notifier_block *nb, + unsigned long action, + void *data) +{ + struct xfs_dir_update_params *p = data; + struct xrep_parent_scan_info *pscan; + struct xfs_scrub *sc; + + pscan = container_of(nb, struct xrep_parent_scan_info, + dhook.dirent_hook.nb); + sc = pscan->sc; + + /* + * If @p->ip is the subdirectory that we're interested in and we've + * already scanned @p->dp, update the dotdot target inumber to the + * parent inode. + */ + if (p->ip->i_ino == sc->ip->i_ino && + xchk_iscan_want_live_update(&pscan->iscan, p->dp->i_ino)) { + if (p->delta > 0) { + xrep_findparent_scan_found(pscan, p->dp->i_ino); + } else { + xrep_findparent_scan_found(pscan, NULLFSINO); + } + } + + return NOTIFY_DONE; +} + +/* + * Set up a scan to find the parent of a directory. The provided dirent hook + * will be called when there is a dotdot update for the inode being repaired. + */ +int +xrep_findparent_scan_start( + struct xfs_scrub *sc, + struct xrep_parent_scan_info *pscan) +{ + int error; + + if (!(sc->flags & XCHK_FSGATES_DIRENTS)) { + ASSERT(sc->flags & XCHK_FSGATES_DIRENTS); + return -EINVAL; + } + + pscan->sc = sc; + pscan->parent_ino = NULLFSINO; + + mutex_init(&pscan->lock); + + xchk_iscan_start(sc, 30000, 100, &pscan->iscan); + + /* + * Hook into the dirent update code. The hook only operates on inodes + * that were already scanned, and the scanner thread takes each inode's + * ILOCK, which means that any in-progress inode updates will finish + * before we can scan the inode. + */ + xfs_dir_hook_setup(&pscan->dhook, xrep_findparent_live_update); + error = xfs_dir_hook_add(sc->mp, &pscan->dhook); + if (error) + goto out_iscan; + + return 0; +out_iscan: + xchk_iscan_teardown(&pscan->iscan); + mutex_destroy(&pscan->lock); + return error; +} + +/* + * Scan the entire filesystem looking for a parent inode for the inode being + * scrubbed. @sc->ip must not be the root of a directory tree. Callers must + * not hold a dirty transaction or any lock that would interfere with taking + * an ILOCK. + * + * Returns 0 with @pscan->parent_ino set to the parent that we found. + * Returns 0 with @pscan->parent_ino set to NULLFSINO if we found no parents. + * Returns the usual negative errno if something else happened. + */ +int +xrep_findparent_scan( + struct xrep_parent_scan_info *pscan) +{ + struct xrep_findparent_info fpi = { + .sc = pscan->sc, + .found_parent = NULLFSINO, + .parent_scan = pscan, + }; + struct xfs_scrub *sc = pscan->sc; + int ret; + + ASSERT(S_ISDIR(VFS_IC(sc->ip)->i_mode)); + + while ((ret = xchk_iscan_iter(&pscan->iscan, &fpi.dp)) == 1) { + if (S_ISDIR(VFS_I(fpi.dp)->i_mode)) + ret = xrep_findparent_walk_directory(&fpi); + else + ret = 0; + xchk_iscan_mark_visited(&pscan->iscan, fpi.dp); + xchk_irele(sc, fpi.dp); + if (ret) + break; + + if (xchk_should_terminate(sc, &ret)) + break; + } + xchk_iscan_iter_finish(&pscan->iscan); + + return ret; +} + +/* Tear down a parent scan. */ +void +xrep_findparent_scan_teardown( + struct xrep_parent_scan_info *pscan) +{ + xfs_dir_hook_del(pscan->sc->mp, &pscan->dhook); + xchk_iscan_teardown(&pscan->iscan); + mutex_destroy(&pscan->lock); +} + +/* Finish a parent scan early. */ +void +xrep_findparent_scan_finish_early( + struct xrep_parent_scan_info *pscan, + xfs_ino_t ino) +{ + xrep_findparent_scan_found(pscan, ino); + xchk_iscan_finish_early(&pscan->iscan); +} + +/* + * Confirm that the directory @parent_ino actually contains a directory entry + * pointing to the child @sc->ip->ino. This function returns one of several + * ways: + * + * Returns 0 with @parent_ino unchanged if the parent was confirmed. + * Returns 0 with @parent_ino set to NULLFSINO if the parent was not valid. + * Returns the usual negative errno if something else happened. + */ +int +xrep_findparent_confirm( + struct xfs_scrub *sc, + xfs_ino_t *parent_ino) +{ + struct xrep_findparent_info fpi = { + .sc = sc, + .found_parent = NULLFSINO, + }; + int error; + + /* + * The root directory always points to itself. Unlinked dirs can point + * anywhere, so we point them at the root dir too. + */ + if (sc->ip == sc->mp->m_rootip || VFS_I(sc->ip)->i_nlink == 0) { + *parent_ino = sc->mp->m_sb.sb_rootino; + return 0; + } + + /* Reject garbage parent inode numbers and self-referential parents. */ + if (*parent_ino == NULLFSINO) + return 0; + if (!xfs_verify_dir_ino(sc->mp, *parent_ino) || + *parent_ino == sc->ip->i_ino) { + *parent_ino = NULLFSINO; + return 0; + } + + error = xchk_iget(sc, *parent_ino, &fpi.dp); + if (error) + return error; + + if (!S_ISDIR(VFS_I(fpi.dp)->i_mode)) { + *parent_ino = NULLFSINO; + goto out_rele; + } + + error = xrep_findparent_walk_directory(&fpi); + if (error) + goto out_rele; + + *parent_ino = fpi.found_parent; +out_rele: + xchk_irele(sc, fpi.dp); + return error; +} + +/* + * If we're the root of a directory tree, we are our own parent. If we're an + * unlinked directory, the parent /won't/ have a link to us. Set the parent + * directory to the root for both cases. Returns NULLFSINO if we don't know + * what to do. + */ +xfs_ino_t +xrep_findparent_self_reference( + struct xfs_scrub *sc) +{ + if (sc->ip->i_ino == sc->mp->m_sb.sb_rootino) + return sc->mp->m_sb.sb_rootino; + + if (VFS_I(sc->ip)->i_nlink == 0) + return sc->mp->m_sb.sb_rootino; + + return NULLFSINO; +} diff --git a/fs/xfs/scrub/findparent.h b/fs/xfs/scrub/findparent.h new file mode 100644 index 000000000000..d946bc81f34e --- /dev/null +++ b/fs/xfs/scrub/findparent.h @@ -0,0 +1,49 @@ +/* SPDX-License-Identifier: GPL-2.0-or-later */ +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_FINDPARENT_H__ +#define __XFS_SCRUB_FINDPARENT_H__ + +struct xrep_parent_scan_info { + struct xfs_scrub *sc; + + /* Inode scan cursor. */ + struct xchk_iscan iscan; + + /* Hook to capture directory entry updates. */ + struct xfs_dir_hook dhook; + + /* Lock protecting parent_ino. */ + struct mutex lock; + + /* Parent inode that we've found. */ + xfs_ino_t parent_ino; + + bool lookup_parent; +}; + +int xrep_findparent_scan_start(struct xfs_scrub *sc, + struct xrep_parent_scan_info *pscan); +int xrep_findparent_scan(struct xrep_parent_scan_info *pscan); +void xrep_findparent_scan_teardown(struct xrep_parent_scan_info *pscan); + +static inline void +xrep_findparent_scan_found( + struct xrep_parent_scan_info *pscan, + xfs_ino_t ino) +{ + mutex_lock(&pscan->lock); + pscan->parent_ino = ino; + mutex_unlock(&pscan->lock); +} + +void xrep_findparent_scan_finish_early(struct xrep_parent_scan_info *pscan, + xfs_ino_t ino); + +int xrep_findparent_confirm(struct xfs_scrub *sc, xfs_ino_t *parent_ino); + +xfs_ino_t xrep_findparent_self_reference(struct xfs_scrub *sc); + +#endif /* __XFS_SCRUB_FINDPARENT_H__ */ diff --git a/fs/xfs/scrub/iscan.c b/fs/xfs/scrub/iscan.c index c643b7d79b60..c380207702e2 100644 --- a/fs/xfs/scrub/iscan.c +++ b/fs/xfs/scrub/iscan.c @@ -243,6 +243,17 @@ xchk_iscan_finish( mutex_unlock(&iscan->lock); } +/* Mark an inode scan finished before we actually scan anything. */ +void +xchk_iscan_finish_early( + struct xchk_iscan *iscan) +{ + ASSERT(iscan->cursor_ino == iscan->scan_start_ino); + ASSERT(iscan->__visited_ino == iscan->scan_start_ino); + + xchk_iscan_finish(iscan); +} + /* * Grab the AGI to advance the inode scan. Returns 0 if *agi_bpp is now set, * -ECANCELED if the live scan aborted, -EBUSY if the AGI could not be grabbed, @@ -436,8 +447,13 @@ xchk_iscan_iget( * It's possible that this inode has lost all of its links but * hasn't yet been inactivated. If we don't have a transaction * or it's not writable, flush the inodegc workers and wait. + * If we have a non-empty transaction, we must not block on + * inodegc, which allocates its own transactions. */ - xfs_inodegc_flush(mp); + if (sc->tp && !(sc->tp->t_flags & XFS_TRANS_NO_WRITECOUNT)) + xfs_inodegc_push(mp); + else + xfs_inodegc_flush(mp); return xchk_iscan_iget_retry(iscan, true); } diff --git a/fs/xfs/scrub/iscan.h b/fs/xfs/scrub/iscan.h index 5e0e4ed9dea6..f9f47fa01a9e 100644 --- a/fs/xfs/scrub/iscan.h +++ b/fs/xfs/scrub/iscan.h @@ -88,6 +88,7 @@ xchk_iscan_set_agi_trylock(struct xchk_iscan *iscan) void xchk_iscan_start(struct xfs_scrub *sc, unsigned int iget_timeout, unsigned int iget_retry_delay, struct xchk_iscan *iscan); +void xchk_iscan_finish_early(struct xchk_iscan *iscan); void xchk_iscan_teardown(struct xchk_iscan *iscan); int xchk_iscan_iter(struct xchk_iscan *iscan, struct xfs_inode **ipp); diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index d6d9e8d6109c..85537a87516e 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2611,6 +2611,7 @@ DEFINE_EVENT(xrep_parent_salvage_class, name, \ TP_PROTO(struct xfs_inode *dp, xfs_ino_t ino), \ TP_ARGS(dp, ino)) DEFINE_XREP_PARENT_SALVAGE_EVENT(xrep_dir_salvaged_parent); +DEFINE_XREP_PARENT_SALVAGE_EVENT(xrep_findparent_dirent); #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 4/5] xfs: online repair of parent pointers 2024-04-15 23:35 ` [PATCHSET v30.3 09/16] xfs: online repair of directories Darrick J. Wong ` (2 preceding siblings ...) 2024-04-15 23:52 ` [PATCH 3/5] xfs: scan the filesystem to repair a directory dotdot entry Darrick J. Wong @ 2024-04-15 23:52 ` Darrick J. Wong 2024-04-15 23:52 ` [PATCH 5/5] xfs: ask the dentry cache if it knows the parent of a directory Darrick J. Wong 4 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:52 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Teach the online repair code to fix parent pointers for directories. For now, this means correcting the dotdot entry of an existing directory that is otherwise consistent. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/scrub/parent.c | 10 ++ fs/xfs/scrub/parent_repair.c | 221 ++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/repair.h | 4 + fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/trace.h | 1 6 files changed, 238 insertions(+), 1 deletion(-) create mode 100644 fs/xfs/scrub/parent_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 3c754777ec28..d48646f86563 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -205,6 +205,7 @@ xfs-y += $(addprefix scrub/, \ inode_repair.o \ newbt.o \ nlinks_repair.o \ + parent_repair.o \ rcbag_btree.o \ rcbag.o \ reap.o \ diff --git a/fs/xfs/scrub/parent.c b/fs/xfs/scrub/parent.c index 050a8e8914f6..acb6282c3d14 100644 --- a/fs/xfs/scrub/parent.c +++ b/fs/xfs/scrub/parent.c @@ -10,6 +10,7 @@ #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_log_format.h" +#include "xfs_trans.h" #include "xfs_inode.h" #include "xfs_icache.h" #include "xfs_dir2.h" @@ -18,12 +19,21 @@ #include "scrub/common.h" #include "scrub/readdir.h" #include "scrub/tempfile.h" +#include "scrub/repair.h" /* Set us up to scrub parents. */ int xchk_setup_parent( struct xfs_scrub *sc) { + int error; + + if (xchk_could_repair(sc)) { + error = xrep_setup_parent(sc); + if (error) + return error; + } + return xchk_setup_inode_contents(sc, 0); } diff --git a/fs/xfs/scrub/parent_repair.c b/fs/xfs/scrub/parent_repair.c new file mode 100644 index 000000000000..0a9651bb0b05 --- /dev/null +++ b/fs/xfs/scrub/parent_repair.c @@ -0,0 +1,221 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2020-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_inode.h" +#include "xfs_icache.h" +#include "xfs_da_format.h" +#include "xfs_da_btree.h" +#include "xfs_dir2.h" +#include "xfs_bmap_btree.h" +#include "xfs_dir2_priv.h" +#include "xfs_trans_space.h" +#include "xfs_health.h" +#include "xfs_exchmaps.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/iscan.h" +#include "scrub/findparent.h" +#include "scrub/readdir.h" + +/* + * Repairing The Directory Parent Pointer + * ====================================== + * + * Currently, only directories support parent pointers (in the form of '..' + * entries), so we simply scan the filesystem and update the '..' entry. + * + * Note that because the only parent pointer is the dotdot entry, we won't + * touch an unhealthy directory, since the directory repair code is perfectly + * capable of rebuilding a directory with the proper parent inode. + * + * See the section on locking issues in dir_repair.c for more information about + * conflicts with the VFS. The findparent code wll keep our incore parent + * inode up to date. + */ + +struct xrep_parent { + struct xfs_scrub *sc; + + /* + * Information used to scan the filesystem to find the inumber of the + * dotdot entry for this directory. + */ + struct xrep_parent_scan_info pscan; +}; + +/* Tear down all the incore stuff we created. */ +static void +xrep_parent_teardown( + struct xrep_parent *rp) +{ + xrep_findparent_scan_teardown(&rp->pscan); +} + +/* Set up for a parent repair. */ +int +xrep_setup_parent( + struct xfs_scrub *sc) +{ + struct xrep_parent *rp; + + xchk_fsgates_enable(sc, XCHK_FSGATES_DIRENTS); + + rp = kvzalloc(sizeof(struct xrep_parent), XCHK_GFP_FLAGS); + if (!rp) + return -ENOMEM; + rp->sc = sc; + sc->buf = rp; + + return 0; +} + +/* + * Scan all files in the filesystem for a child dirent that we can turn into + * the dotdot entry for this directory. + */ +STATIC int +xrep_parent_find_dotdot( + struct xrep_parent *rp) +{ + struct xfs_scrub *sc = rp->sc; + xfs_ino_t ino; + unsigned int sick, checked; + int error; + + /* + * Avoid sick directories. There shouldn't be anyone else clearing the + * directory's sick status. + */ + xfs_inode_measure_sickness(sc->ip, &sick, &checked); + if (sick & XFS_SICK_INO_DIR) + return -EFSCORRUPTED; + + ino = xrep_findparent_self_reference(sc); + if (ino != NULLFSINO) { + xrep_findparent_scan_finish_early(&rp->pscan, ino); + return 0; + } + + /* + * Drop the ILOCK on this directory so that we can scan for the dotdot + * entry. Figure out who is going to be the parent of this directory, + * then retake the ILOCK so that we can salvage directory entries. + */ + xchk_iunlock(sc, XFS_ILOCK_EXCL); + error = xrep_findparent_scan(&rp->pscan); + xchk_ilock(sc, XFS_ILOCK_EXCL); + + return error; +} + +/* Reset a directory's dotdot entry, if needed. */ +STATIC int +xrep_parent_reset_dotdot( + struct xrep_parent *rp) +{ + struct xfs_scrub *sc = rp->sc; + xfs_ino_t ino; + unsigned int spaceres; + int error = 0; + + ASSERT(sc->ilock_flags & XFS_ILOCK_EXCL); + + error = xchk_dir_lookup(sc, sc->ip, &xfs_name_dotdot, &ino); + if (error || ino == rp->pscan.parent_ino) + return error; + + xfs_trans_ijoin(sc->tp, sc->ip, 0); + + trace_xrep_parent_reset_dotdot(sc->ip, rp->pscan.parent_ino); + + /* + * Reserve more space just in case we have to expand the dir. We're + * allowed to exceed quota to repair inconsistent metadata. + */ + spaceres = XFS_RENAME_SPACE_RES(sc->mp, xfs_name_dotdot.len); + error = xfs_trans_reserve_more_inode(sc->tp, sc->ip, spaceres, 0, + true); + if (error) + return error; + + error = xfs_dir_replace(sc->tp, sc->ip, &xfs_name_dotdot, + rp->pscan.parent_ino, spaceres); + if (error) + return error; + + /* + * Roll transaction to detach the inode from the transaction but retain + * ILOCK_EXCL. + */ + return xfs_trans_roll(&sc->tp); +} + +/* + * Commit the new parent pointer structure (currently only the dotdot entry) to + * the file that we're repairing. + */ +STATIC int +xrep_parent_rebuild_tree( + struct xrep_parent *rp) +{ + if (rp->pscan.parent_ino == NULLFSINO) { + /* Cannot fix orphaned directories yet. */ + return -EFSCORRUPTED; + } + + return xrep_parent_reset_dotdot(rp); +} + +/* Set up the filesystem scan so we can look for parents. */ +STATIC int +xrep_parent_setup_scan( + struct xrep_parent *rp) +{ + struct xfs_scrub *sc = rp->sc; + + return xrep_findparent_scan_start(sc, &rp->pscan); +} + +int +xrep_parent( + struct xfs_scrub *sc) +{ + struct xrep_parent *rp = sc->buf; + int error; + + error = xrep_parent_setup_scan(rp); + if (error) + return error; + + error = xrep_parent_find_dotdot(rp); + if (error) + goto out_teardown; + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + goto out_teardown; + + error = xrep_parent_rebuild_tree(rp); + if (error) + goto out_teardown; + +out_teardown: + xrep_parent_teardown(rp); + return error; +} diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 4e25aa95753a..e53374fa5430 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -92,6 +92,7 @@ int xrep_setup_ag_rmapbt(struct xfs_scrub *sc); int xrep_setup_ag_refcountbt(struct xfs_scrub *sc); int xrep_setup_xattr(struct xfs_scrub *sc); int xrep_setup_directory(struct xfs_scrub *sc); +int xrep_setup_parent(struct xfs_scrub *sc); /* Repair setup functions */ int xrep_setup_ag_allocbt(struct xfs_scrub *sc); @@ -127,6 +128,7 @@ int xrep_nlinks(struct xfs_scrub *sc); int xrep_fscounters(struct xfs_scrub *sc); int xrep_xattr(struct xfs_scrub *sc); int xrep_directory(struct xfs_scrub *sc); +int xrep_parent(struct xfs_scrub *sc); #ifdef CONFIG_XFS_RT int xrep_rtbitmap(struct xfs_scrub *sc); @@ -198,6 +200,7 @@ xrep_setup_nothing( #define xrep_setup_ag_refcountbt xrep_setup_nothing #define xrep_setup_xattr xrep_setup_nothing #define xrep_setup_directory xrep_setup_nothing +#define xrep_setup_parent xrep_setup_nothing #define xrep_setup_inode(sc, imap) ((void)0) @@ -225,6 +228,7 @@ xrep_setup_nothing( #define xrep_rtsummary xrep_notsupported #define xrep_xattr xrep_notsupported #define xrep_directory xrep_notsupported +#define xrep_parent xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 8e9e2bf121c2..520d83db193c 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -343,7 +343,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_INODE, .setup = xchk_setup_parent, .scrub = xchk_parent, - .repair = xrep_notsupported, + .repair = xrep_parent, }, [XFS_SCRUB_TYPE_RTBITMAP] = { /* realtime bitmap */ .type = ST_FS, diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 85537a87516e..e1755fe63e67 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2550,6 +2550,7 @@ DEFINE_EVENT(xrep_dir_class, name, \ TP_ARGS(dp, parent_ino)) DEFINE_XREP_DIR_EVENT(xrep_dir_rebuild_tree); DEFINE_XREP_DIR_EVENT(xrep_dir_reset_fork); +DEFINE_XREP_DIR_EVENT(xrep_parent_reset_dotdot); DECLARE_EVENT_CLASS(xrep_dirent_class, TP_PROTO(struct xfs_inode *dp, const struct xfs_name *name, ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 5/5] xfs: ask the dentry cache if it knows the parent of a directory 2024-04-15 23:35 ` [PATCHSET v30.3 09/16] xfs: online repair of directories Darrick J. Wong ` (3 preceding siblings ...) 2024-04-15 23:52 ` [PATCH 4/5] xfs: online repair of parent pointers Darrick J. Wong @ 2024-04-15 23:52 ` Darrick J. Wong 4 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:52 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> It's possible that the dentry cache can tell us the parent of a directory. Therefore, when repairing directory dot dot entries, query the dcache as a last resort before scanning the entire filesystem. A reviewer asks: "How high is the chance that we actually have a valid dcache entry for a file in a corrupted directory?" There's a decent chance of this actually working. Say you have a 1000-block directory foo, and block 980 gets corrupted. Let's further suppose that block 0 has a correct entry for ".." and "bar". If someone accesses /mnt/foo/bar, that will cause the dcache to create a dentry from /mnt to /mnt/foo whose d_parent points back to /mnt. If you then want to rebuild the directory, XFS can obtain the parent from the dcache without needing to wander into parent pointers or scan the filesystem to find /mnt's connection to foo. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/dir_repair.c | 29 +++++++++++++++++++++++++++++ fs/xfs/scrub/findparent.c | 38 +++++++++++++++++++++++++++++++++++++- fs/xfs/scrub/findparent.h | 1 + fs/xfs/scrub/parent_repair.c | 13 +++++++++++++ fs/xfs/scrub/trace.h | 1 + 5 files changed, 81 insertions(+), 1 deletion(-) diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c index b17de79207db..34fe720fde0e 100644 --- a/fs/xfs/scrub/dir_repair.c +++ b/fs/xfs/scrub/dir_repair.c @@ -208,6 +208,29 @@ xrep_dir_lookup_parent( return ino; } +/* + * Look up '..' in the dentry cache and confirm that it's really the parent. + * Returns NULLFSINO if the dcache misses or if the hit is implausible. + */ +static inline xfs_ino_t +xrep_dir_dcache_parent( + struct xrep_dir *rd) +{ + struct xfs_scrub *sc = rd->sc; + xfs_ino_t parent_ino; + int error; + + parent_ino = xrep_findparent_from_dcache(sc); + if (parent_ino == NULLFSINO) + return parent_ino; + + error = xrep_findparent_confirm(sc, &parent_ino); + if (error) + return NULLFSINO; + + return parent_ino; +} + /* Try to find the parent of the directory being repaired. */ STATIC int xrep_dir_find_parent( @@ -221,6 +244,12 @@ xrep_dir_find_parent( return 0; } + ino = xrep_dir_dcache_parent(rd); + if (ino != NULLFSINO) { + xrep_findparent_scan_finish_early(&rd->pscan, ino); + return 0; + } + ino = xrep_dir_lookup_parent(rd); if (ino != NULLFSINO) { xrep_findparent_scan_finish_early(&rd->pscan, ino); diff --git a/fs/xfs/scrub/findparent.c b/fs/xfs/scrub/findparent.c index 7b3ec8d7d6cc..712dd73e4789 100644 --- a/fs/xfs/scrub/findparent.c +++ b/fs/xfs/scrub/findparent.c @@ -53,7 +53,8 @@ * must not read the scan results without re-taking @sc->ip's ILOCK. * * There are a few shortcuts that we can take to avoid scanning the entire - * filesystem, such as noticing directory tree roots. + * filesystem, such as noticing directory tree roots and querying the dentry + * cache for parent information. */ struct xrep_findparent_info { @@ -410,3 +411,38 @@ xrep_findparent_self_reference( return NULLFSINO; } + +/* Check the dentry cache to see if knows of a parent for the scrub target. */ +xfs_ino_t +xrep_findparent_from_dcache( + struct xfs_scrub *sc) +{ + struct inode *pip = NULL; + struct dentry *dentry, *parent; + xfs_ino_t ret = NULLFSINO; + + dentry = d_find_alias(VFS_I(sc->ip)); + if (!dentry) + goto out; + + parent = dget_parent(dentry); + if (!parent) + goto out_dput; + + ASSERT(parent->d_sb == sc->ip->i_mount->m_super); + + pip = igrab(d_inode(parent)); + dput(parent); + + if (S_ISDIR(pip->i_mode)) { + trace_xrep_findparent_from_dcache(sc->ip, XFS_I(pip)->i_ino); + ret = XFS_I(pip)->i_ino; + } + + xchk_irele(sc, XFS_I(pip)); + +out_dput: + dput(dentry); +out: + return ret; +} diff --git a/fs/xfs/scrub/findparent.h b/fs/xfs/scrub/findparent.h index d946bc81f34e..501f99d3164e 100644 --- a/fs/xfs/scrub/findparent.h +++ b/fs/xfs/scrub/findparent.h @@ -45,5 +45,6 @@ void xrep_findparent_scan_finish_early(struct xrep_parent_scan_info *pscan, int xrep_findparent_confirm(struct xfs_scrub *sc, xfs_ino_t *parent_ino); xfs_ino_t xrep_findparent_self_reference(struct xfs_scrub *sc); +xfs_ino_t xrep_findparent_from_dcache(struct xfs_scrub *sc); #endif /* __XFS_SCRUB_FINDPARENT_H__ */ diff --git a/fs/xfs/scrub/parent_repair.c b/fs/xfs/scrub/parent_repair.c index 0a9651bb0b05..826926c2bb0d 100644 --- a/fs/xfs/scrub/parent_repair.c +++ b/fs/xfs/scrub/parent_repair.c @@ -118,7 +118,20 @@ xrep_parent_find_dotdot( * then retake the ILOCK so that we can salvage directory entries. */ xchk_iunlock(sc, XFS_ILOCK_EXCL); + + /* Does the VFS dcache have an answer for us? */ + ino = xrep_findparent_from_dcache(sc); + if (ino != NULLFSINO) { + error = xrep_findparent_confirm(sc, &ino); + if (!error && ino != NULLFSINO) { + xrep_findparent_scan_finish_early(&rp->pscan, ino); + goto out_relock; + } + } + + /* Scan the entire filesystem for a parent. */ error = xrep_findparent_scan(&rp->pscan); +out_relock: xchk_ilock(sc, XFS_ILOCK_EXCL); return error; diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index e1755fe63e67..d68ec8e2781e 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2613,6 +2613,7 @@ DEFINE_EVENT(xrep_parent_salvage_class, name, \ TP_ARGS(dp, ino)) DEFINE_XREP_PARENT_SALVAGE_EVENT(xrep_dir_salvaged_parent); DEFINE_XREP_PARENT_SALVAGE_EVENT(xrep_findparent_dirent); +DEFINE_XREP_PARENT_SALVAGE_EVENT(xrep_findparent_from_dcache); #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 10/16] xfs: move orphan files to lost and found 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (8 preceding siblings ...) 2024-04-15 23:35 ` [PATCHSET v30.3 09/16] xfs: online repair of directories Darrick J. Wong @ 2024-04-15 23:36 ` Darrick J. Wong 2024-04-15 23:53 ` [PATCH 1/3] xfs: move orphan files to the orphanage Darrick J. Wong ` (2 more replies) 2024-04-15 23:36 ` [PATCHSET v30.3 11/16] xfs: online repair of symbolic links Darrick J. Wong ` (5 subsequent siblings) 15 siblings, 3 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:36 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs Hi all, Orphaned files are defined to be files with nonzero ondisk link count but no observable parent directory. This series enables online repair to reparent orphaned files into the filesystem directory tree, and wires up this reparenting ability into the directory, file link count, and parent pointer repair functions. This is how we fix files with positive link count that are not reachable through the directory tree. This patch will also create the orphanage directory (lost+found) if it is not present. In contrast to xfs_repair, we follow e2fsck in creating the lost+found without group or other-owner access to avoid accidental disclosure of files that were previously hidden by an 0700 directory. That's silly security, but people have been known to do it. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-orphanage-6.10 --- Commits in this patchset: * xfs: move orphan files to the orphanage * xfs: move files to orphanage instead of letting nlinks drop to zero * xfs: ensure dentry consistency when the orphanage adopts a file --- .../filesystems/xfs/xfs-online-fsck-design.rst | 20 - fs/xfs/Makefile | 1 fs/xfs/scrub/dir_repair.c | 130 ++++ fs/xfs/scrub/nlinks.c | 20 + fs/xfs/scrub/nlinks.h | 7 fs/xfs/scrub/nlinks_repair.c | 123 ++++ fs/xfs/scrub/orphanage.c | 589 ++++++++++++++++++++ fs/xfs/scrub/orphanage.h | 75 +++ fs/xfs/scrub/parent_repair.c | 100 +++ fs/xfs/scrub/repair.h | 2 fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/scrub.h | 4 fs/xfs/scrub/trace.c | 1 fs/xfs/scrub/trace.h | 96 +++ fs/xfs/xfs_inode.c | 6 fs/xfs/xfs_inode.h | 1 16 files changed, 1139 insertions(+), 38 deletions(-) create mode 100644 fs/xfs/scrub/orphanage.c create mode 100644 fs/xfs/scrub/orphanage.h ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/3] xfs: move orphan files to the orphanage 2024-04-15 23:36 ` [PATCHSET v30.3 10/16] xfs: move orphan files to lost and found Darrick J. Wong @ 2024-04-15 23:53 ` Darrick J. Wong 2024-04-15 23:53 ` [PATCH 2/3] xfs: move files to orphanage instead of letting nlinks drop to zero Darrick J. Wong 2024-04-15 23:53 ` [PATCH 3/3] xfs: ensure dentry consistency when the orphanage adopts a file Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:53 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> When we're repairing a directory structure or fixing the dotdot entry of a subdirectory, it's possible that we won't ever find a parent for the subdirectory. When this is the case, move it to the orphanage, aka /lost+found. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- .../filesystems/xfs/xfs-online-fsck-design.rst | 19 + fs/xfs/Makefile | 1 fs/xfs/scrub/dir_repair.c | 130 +++++ fs/xfs/scrub/orphanage.c | 498 ++++++++++++++++++++ fs/xfs/scrub/orphanage.h | 75 +++ fs/xfs/scrub/parent_repair.c | 100 ++++ fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/scrub.h | 4 fs/xfs/scrub/trace.h | 28 + fs/xfs/xfs_inode.c | 6 fs/xfs/xfs_inode.h | 1 11 files changed, 844 insertions(+), 20 deletions(-) create mode 100644 fs/xfs/scrub/orphanage.c create mode 100644 fs/xfs/scrub/orphanage.h diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst index 3afa1bc5f47c..37dddaaeda50 100644 --- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -4778,14 +4778,21 @@ Orphaned files are adopted by the orphanage as follows: The ``xrep_orphanage_iolock_two`` function follows the inode locking strategy discussed earlier. -3. Call ``xrep_orphanage_compute_blkres`` and ``xrep_orphanage_compute_name`` - to compute the new name in the orphanage and the block reservation required. - -4. Use ``xrep_orphanage_adoption_prep`` to reserve resources to the repair +3. Use ``xrep_adoption_trans_alloc`` to reserve resources to the repair transaction. -5. Call ``xrep_orphanage_adopt`` to reparent the orphaned file into the lost - and found, and update the kernel dentry cache. +4. Call ``xrep_orphanage_compute_name`` to compute the new name in the + orphanage. + +5. If the adoption is going to happen, call ``xrep_adoption_reparent`` to + reparent the orphaned file into the lost and found and invalidate the dentry + cache. + +6. Call ``xrep_adoption_finish`` to commit any filesystem updates, release the + orphanage ILOCK, and clean the scrub transaction. + +7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all + resources. The proposed patches are in the `orphanage adoption diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index d48646f86563..1e23d1b3cd7b 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -205,6 +205,7 @@ xfs-y += $(addprefix scrub/, \ inode_repair.o \ newbt.o \ nlinks_repair.o \ + orphanage.o \ parent_repair.o \ rcbag_btree.o \ rcbag.o \ diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c index 34fe720fde0e..c150b2efa2c2 100644 --- a/fs/xfs/scrub/dir_repair.c +++ b/fs/xfs/scrub/dir_repair.c @@ -42,6 +42,7 @@ #include "scrub/readdir.h" #include "scrub/reap.h" #include "scrub/findparent.h" +#include "scrub/orphanage.h" /* * Directory Repair @@ -115,12 +116,21 @@ struct xrep_dir { */ struct xrep_parent_scan_info pscan; + /* + * Context information for attaching this directory to the lost+found + * if this directory does not have a parent. + */ + struct xrep_adoption adoption; + /* How many subdirectories did we find? */ uint64_t subdirs; /* How many dirents did we find? */ unsigned int dirents; + /* Should we move this directory to the orphanage? */ + bool needs_adoption; + /* Directory entry name, plus the trailing null. */ struct xfs_name xname; unsigned char namebuf[MAXNAMELEN]; @@ -148,6 +158,10 @@ xrep_setup_directory( xchk_fsgates_enable(sc, XCHK_FSGATES_DIRENTS); + error = xrep_orphanage_try_create(sc); + if (error) + return error; + error = xrep_tempfile_create(sc, S_IFDIR); if (error) return error; @@ -1137,10 +1151,8 @@ xrep_dir_set_nlink( /* * The directory is not on the incore unlinked list, which means that * it needs to be reachable via the directory tree. Update the nlink - * with our observed link count. - * - * XXX: A subsequent patch will handle parentless directories by moving - * them to the lost and found instead of aborting the repair. + * with our observed link count. If the directory has no parent, it + * will be moved to the orphanage. */ if (!xfs_inode_on_unlinked_list(dp)) goto reset_nlink; @@ -1151,6 +1163,7 @@ xrep_dir_set_nlink( * inactivate when the last reference drops. */ if (rd->dirents == 0) { + rd->needs_adoption = false; new_nlink = 0; goto reset_nlink; } @@ -1159,7 +1172,8 @@ xrep_dir_set_nlink( * The directory is on the unlinked list and we found dirents. This * directory needs to be reachable via the directory tree. Remove the * dir from the unlinked list and update nlink with the observed link - * count. + * count. If the directory has no parent, it will be moved to the + * orphanage. */ pag = xfs_perag_get(sc->mp, XFS_INO_TO_AGNO(sc->mp, dp->i_ino)); if (!pag) { @@ -1195,12 +1209,16 @@ xrep_dir_swap( return -EFSCORRUPTED; /* - * If we never found the parent for this directory, we can't fix this - * directory. + * If we never found the parent for this directory, temporarily assign + * the root dir as the parent; we'll move this to the orphanage after + * exchanging the dir contents. We hold the ILOCK of the dir being + * repaired, so we're not worried about racy updates of dotdot. */ ASSERT(sc->ilock_flags & XFS_ILOCK_EXCL); - if (rd->pscan.parent_ino == NULLFSINO) - return -EFSCORRUPTED; + if (rd->pscan.parent_ino == NULLFSINO) { + rd->needs_adoption = true; + rd->pscan.parent_ino = rd->sc->mp->m_sb.sb_rootino; + } /* * Reset the temporary directory's '..' entry to point to the parent @@ -1358,6 +1376,91 @@ xrep_dir_setup_scan( return error; } +/* + * Move the current file to the orphanage. + * + * Caller must hold IOLOCK_EXCL on @sc->ip, and no other inode locks. Upon + * successful return, the scrub transaction will have enough extra reservation + * to make the move; it will hold IOLOCK_EXCL and ILOCK_EXCL of @sc->ip and the + * orphanage; and both inodes will be ijoined. + */ +STATIC int +xrep_dir_move_to_orphanage( + struct xrep_dir *rd) +{ + struct xfs_scrub *sc = rd->sc; + xfs_ino_t orig_parent, new_parent; + int error; + + /* + * We are about to drop the ILOCK on sc->ip to lock the orphanage and + * prepare for the adoption. Therefore, look up the old dotdot entry + * for sc->ip so that we can compare it after we re-lock sc->ip. + */ + error = xchk_dir_lookup(sc, sc->ip, &xfs_name_dotdot, &orig_parent); + if (error) + return error; + + /* + * Drop the ILOCK on the scrub target and commit the transaction. + * Adoption computes its own resource requirements and gathers the + * necessary components. + */ + error = xrep_trans_commit(sc); + if (error) + return error; + xchk_iunlock(sc, XFS_ILOCK_EXCL); + + /* If we can take the orphanage's iolock then we're ready to move. */ + if (!xrep_orphanage_ilock_nowait(sc, XFS_IOLOCK_EXCL)) { + xchk_iunlock(sc, sc->ilock_flags); + error = xrep_orphanage_iolock_two(sc); + if (error) + return error; + } + + /* Grab transaction and ILOCK the two files. */ + error = xrep_adoption_trans_alloc(sc, &rd->adoption); + if (error) + return error; + + error = xrep_adoption_compute_name(&rd->adoption, &rd->xname); + if (error) + return error; + + /* + * Now that we've reacquired the ILOCK on sc->ip, look up the dotdot + * entry again. If the parent changed or the child was unlinked while + * the child directory was unlocked, we don't need to move the child to + * the orphanage after all. + */ + error = xchk_dir_lookup(sc, sc->ip, &xfs_name_dotdot, &new_parent); + if (error) + return error; + + /* + * Attach to the orphanage if we still have a linked directory and it + * hasn't been moved. + */ + if (orig_parent == new_parent && VFS_I(sc->ip)->i_nlink > 0) { + error = xrep_adoption_move(&rd->adoption); + if (error) + return error; + } + + /* + * Launder the scrub transaction so we can drop the orphanage ILOCK + * and IOLOCK. Return holding the scrub target's ILOCK and IOLOCK. + */ + error = xrep_adoption_trans_roll(&rd->adoption); + if (error) + return error; + + xrep_orphanage_iunlock(sc, XFS_ILOCK_EXCL); + xrep_orphanage_iunlock(sc, XFS_IOLOCK_EXCL); + return 0; +} + /* * Repair the directory metadata. * @@ -1396,6 +1499,15 @@ xrep_directory( if (error) goto out_teardown; + if (rd->needs_adoption) { + if (!xrep_orphanage_can_adopt(rd->sc)) + error = -EFSCORRUPTED; + else + error = xrep_dir_move_to_orphanage(rd); + if (error) + goto out_teardown; + } + out_teardown: xrep_dir_teardown(sc); return error; diff --git a/fs/xfs/scrub/orphanage.c b/fs/xfs/scrub/orphanage.c new file mode 100644 index 000000000000..41733be3ef45 --- /dev/null +++ b/fs/xfs/scrub/orphanage.c @@ -0,0 +1,498 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2021-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_inode.h" +#include "xfs_ialloc.h" +#include "xfs_quota.h" +#include "xfs_trans_space.h" +#include "xfs_dir2.h" +#include "xfs_icache.h" +#include "xfs_bmap.h" +#include "xfs_bmap_btree.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/repair.h" +#include "scrub/trace.h" +#include "scrub/orphanage.h" +#include "scrub/readdir.h" + +#include <linux/namei.h> + +/* + * The Orphanage + * ============= + * + * If the directory tree is damaged, children of that directory become + * inaccessible via that file path. If a child has no other parents, the file + * is said to be orphaned. xfs_repair fixes this situation by creating a + * orphanage directory (specifically, /lost+found) and creating a directory + * entry pointing to the orphaned file. + * + * Online repair follows this tactic by creating a root-owned /lost+found + * directory if one does not exist. If an orphan is found, it will move that + * files into orphanage. + */ + +/* Make the orphanage owned by root. */ +STATIC int +xrep_chown_orphanage( + struct xfs_scrub *sc, + struct xfs_inode *dp) +{ + struct xfs_trans *tp; + struct xfs_mount *mp = sc->mp; + struct xfs_dquot *udqp = NULL, *gdqp = NULL, *pdqp = NULL; + struct xfs_dquot *oldu = NULL, *oldg = NULL, *oldp = NULL; + struct inode *inode = VFS_I(dp); + int error; + + error = xfs_qm_vop_dqalloc(dp, GLOBAL_ROOT_UID, GLOBAL_ROOT_GID, 0, + XFS_QMOPT_QUOTALL, &udqp, &gdqp, &pdqp); + if (error) + return error; + + error = xfs_trans_alloc_ichange(dp, udqp, gdqp, pdqp, true, &tp); + if (error) + goto out_dqrele; + + /* + * Always clear setuid/setgid/sticky on the orphanage since we don't + * normally want that functionality on this directory and xfs_repair + * doesn't create it this way either. Leave the other access bits + * unchanged. + */ + inode->i_mode &= ~(S_ISUID | S_ISGID | S_ISVTX); + + /* + * Change the ownerships and register quota modifications + * in the transaction. + */ + if (!uid_eq(inode->i_uid, GLOBAL_ROOT_UID)) { + if (XFS_IS_UQUOTA_ON(mp)) + oldu = xfs_qm_vop_chown(tp, dp, &dp->i_udquot, udqp); + inode->i_uid = GLOBAL_ROOT_UID; + } + if (!gid_eq(inode->i_gid, GLOBAL_ROOT_GID)) { + if (XFS_IS_GQUOTA_ON(mp)) + oldg = xfs_qm_vop_chown(tp, dp, &dp->i_gdquot, gdqp); + inode->i_gid = GLOBAL_ROOT_GID; + } + if (dp->i_projid != 0) { + if (XFS_IS_PQUOTA_ON(mp)) + oldp = xfs_qm_vop_chown(tp, dp, &dp->i_pdquot, pdqp); + dp->i_projid = 0; + } + + dp->i_diflags &= ~(XFS_DIFLAG_REALTIME | XFS_DIFLAG_RTINHERIT); + xfs_trans_log_inode(tp, dp, XFS_ILOG_CORE); + + XFS_STATS_INC(mp, xs_ig_attrchg); + + if (xfs_has_wsync(mp)) + xfs_trans_set_sync(tp); + error = xfs_trans_commit(tp); + + xfs_qm_dqrele(oldu); + xfs_qm_dqrele(oldg); + xfs_qm_dqrele(oldp); + +out_dqrele: + xfs_qm_dqrele(udqp); + xfs_qm_dqrele(gdqp); + xfs_qm_dqrele(pdqp); + return error; +} + +#define ORPHANAGE "lost+found" + +/* Create the orphanage directory, and set sc->orphanage to it. */ +int +xrep_orphanage_create( + struct xfs_scrub *sc) +{ + struct xfs_mount *mp = sc->mp; + struct dentry *root_dentry, *orphanage_dentry; + struct inode *root_inode = VFS_I(sc->mp->m_rootip); + struct inode *orphanage_inode; + int error; + + if (xfs_is_shutdown(mp)) + return -EIO; + if (xfs_is_readonly(mp)) { + sc->orphanage = NULL; + return 0; + } + + ASSERT(sc->tp == NULL); + ASSERT(sc->orphanage == NULL); + + /* Find the dentry for the root directory... */ + root_dentry = d_find_alias(root_inode); + if (!root_dentry) { + error = -EFSCORRUPTED; + goto out; + } + + /* ...which is a directory, right? */ + if (!d_is_dir(root_dentry)) { + error = -EFSCORRUPTED; + goto out_dput_root; + } + + /* Try to find the orphanage directory. */ + inode_lock_nested(root_inode, I_MUTEX_PARENT); + orphanage_dentry = lookup_one_len(ORPHANAGE, root_dentry, + strlen(ORPHANAGE)); + if (IS_ERR(orphanage_dentry)) { + error = PTR_ERR(orphanage_dentry); + goto out_unlock_root; + } + + /* + * Nothing found? Call mkdir to create the orphanage. Create the + * directory without other-user access because we're live and someone + * could have been relying partly on minimal access to a parent + * directory to control access to a file we put in here. + */ + if (d_really_is_negative(orphanage_dentry)) { + error = vfs_mkdir(&nop_mnt_idmap, root_inode, orphanage_dentry, + 0750); + if (error) + goto out_dput_orphanage; + } + + /* Not a directory? Bail out. */ + if (!d_is_dir(orphanage_dentry)) { + error = -ENOTDIR; + goto out_dput_orphanage; + } + + /* + * Grab a reference to the orphanage. This /should/ succeed since + * we hold the root directory locked and therefore nobody can delete + * the orphanage. + */ + orphanage_inode = igrab(d_inode(orphanage_dentry)); + if (!orphanage_inode) { + error = -ENOENT; + goto out_dput_orphanage; + } + + /* Make sure the orphanage is owned by root. */ + error = xrep_chown_orphanage(sc, XFS_I(orphanage_inode)); + if (error) + goto out_dput_orphanage; + + /* Stash the reference for later and bail out. */ + sc->orphanage = XFS_I(orphanage_inode); + sc->orphanage_ilock_flags = 0; + +out_dput_orphanage: + dput(orphanage_dentry); +out_unlock_root: + inode_unlock(VFS_I(sc->mp->m_rootip)); +out_dput_root: + dput(root_dentry); +out: + return error; +} + +void +xrep_orphanage_ilock( + struct xfs_scrub *sc, + unsigned int ilock_flags) +{ + sc->orphanage_ilock_flags |= ilock_flags; + xfs_ilock(sc->orphanage, ilock_flags); +} + +bool +xrep_orphanage_ilock_nowait( + struct xfs_scrub *sc, + unsigned int ilock_flags) +{ + if (xfs_ilock_nowait(sc->orphanage, ilock_flags)) { + sc->orphanage_ilock_flags |= ilock_flags; + return true; + } + + return false; +} + +void +xrep_orphanage_iunlock( + struct xfs_scrub *sc, + unsigned int ilock_flags) +{ + xfs_iunlock(sc->orphanage, ilock_flags); + sc->orphanage_ilock_flags &= ~ilock_flags; +} + +/* Grab the IOLOCK of the orphanage and sc->ip. */ +int +xrep_orphanage_iolock_two( + struct xfs_scrub *sc) +{ + int error = 0; + + while (true) { + if (xchk_should_terminate(sc, &error)) + return error; + + /* + * Normal XFS takes the IOLOCK before grabbing a transaction. + * Scrub holds a transaction, which means that we can't block + * on either IOLOCK. + */ + if (xrep_orphanage_ilock_nowait(sc, XFS_IOLOCK_EXCL)) { + if (xchk_ilock_nowait(sc, XFS_IOLOCK_EXCL)) + break; + xrep_orphanage_iunlock(sc, XFS_IOLOCK_EXCL); + } + delay(1); + } + + return 0; +} + +/* Release the orphanage. */ +void +xrep_orphanage_rele( + struct xfs_scrub *sc) +{ + if (!sc->orphanage) + return; + + if (sc->orphanage_ilock_flags) + xfs_iunlock(sc->orphanage, sc->orphanage_ilock_flags); + + xchk_irele(sc, sc->orphanage); + sc->orphanage = NULL; +} + +/* Adoption moves a file into /lost+found */ + +/* Can the orphanage adopt @sc->ip? */ +bool +xrep_orphanage_can_adopt( + struct xfs_scrub *sc) +{ + ASSERT(sc->ip != NULL); + + if (!sc->orphanage) + return false; + if (sc->ip == sc->orphanage) + return false; + if (xfs_internal_inum(sc->mp, sc->ip->i_ino)) + return false; + return true; +} + +/* + * Create a new transaction to send a child to the orphanage. + * + * Allocate a new transaction with sufficient disk space to handle the + * adoption, take ILOCK_EXCL of the orphanage and sc->ip, joins them to the + * transaction, and reserve quota to reparent the latter. Caller must hold the + * IOLOCK of the orphanage and sc->ip. + */ +int +xrep_adoption_trans_alloc( + struct xfs_scrub *sc, + struct xrep_adoption *adopt) +{ + struct xfs_mount *mp = sc->mp; + unsigned int child_blkres = 0; + int error; + + ASSERT(sc->tp == NULL); + ASSERT(sc->ip != NULL); + ASSERT(sc->orphanage != NULL); + ASSERT(sc->ilock_flags & XFS_IOLOCK_EXCL); + ASSERT(sc->orphanage_ilock_flags & XFS_IOLOCK_EXCL); + ASSERT(!(sc->ilock_flags & (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL))); + ASSERT(!(sc->orphanage_ilock_flags & + (XFS_ILOCK_SHARED | XFS_ILOCK_EXCL))); + + /* Compute the worst case space reservation that we need. */ + adopt->sc = sc; + adopt->orphanage_blkres = XFS_LINK_SPACE_RES(mp, MAXNAMELEN); + if (S_ISDIR(VFS_I(sc->ip)->i_mode)) + child_blkres = XFS_RENAME_SPACE_RES(mp, xfs_name_dotdot.len); + adopt->child_blkres = child_blkres; + + /* + * Allocate a transaction to link the child into the parent, along with + * enough disk space to handle expansion of both the orphanage and the + * dotdot entry of a child directory. + */ + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_link, + adopt->orphanage_blkres + adopt->child_blkres, 0, 0, + &sc->tp); + if (error) + return error; + + xfs_lock_two_inodes(sc->orphanage, XFS_ILOCK_EXCL, + sc->ip, XFS_ILOCK_EXCL); + sc->ilock_flags |= XFS_ILOCK_EXCL; + sc->orphanage_ilock_flags |= XFS_ILOCK_EXCL; + + xfs_trans_ijoin(sc->tp, sc->orphanage, 0); + xfs_trans_ijoin(sc->tp, sc->ip, 0); + + /* + * Reserve enough quota in the orphan directory to add the new name. + * Normally the orphanage should have user/group/project ids of zero + * and hence is not subject to quota enforcement, but we're allowed to + * exceed quota to reattach disconnected parts of the directory tree. + */ + error = xfs_trans_reserve_quota_nblks(sc->tp, sc->orphanage, + adopt->orphanage_blkres, 0, true); + if (error) + goto out_cancel; + + /* + * Reserve enough quota in the child directory to change dotdot. + * Here we're also allowed to exceed file quota to repair inconsistent + * metadata. + */ + if (adopt->child_blkres) { + error = xfs_trans_reserve_quota_nblks(sc->tp, sc->ip, + adopt->child_blkres, 0, true); + if (error) + goto out_cancel; + } + + return 0; +out_cancel: + xchk_trans_cancel(sc); + xrep_orphanage_iunlock(sc, XFS_ILOCK_EXCL); + xrep_orphanage_iunlock(sc, XFS_IOLOCK_EXCL); + return error; +} + +/* + * Compute the xfs_name for the directory entry that we're adding to the + * orphanage. Caller must hold ILOCKs of sc->ip and the orphanage and must not + * reuse namebuf until the adoption completes or is dissolved. + */ +int +xrep_adoption_compute_name( + struct xrep_adoption *adopt, + struct xfs_name *xname) +{ + struct xfs_scrub *sc = adopt->sc; + char *namebuf = (void *)xname->name; + xfs_ino_t ino; + unsigned int incr = 0; + int error = 0; + + adopt->xname = xname; + xname->len = snprintf(namebuf, MAXNAMELEN, "%llu", sc->ip->i_ino); + xname->type = xfs_mode_to_ftype(VFS_I(sc->ip)->i_mode); + + /* Make sure the filename is unique in the lost+found. */ + error = xchk_dir_lookup(sc, sc->orphanage, xname, &ino); + while (error == 0 && incr < 10000) { + xname->len = snprintf(namebuf, MAXNAMELEN, "%llu.%u", + sc->ip->i_ino, ++incr); + error = xchk_dir_lookup(sc, sc->orphanage, xname, &ino); + } + if (error == 0) { + /* We already have 10,000 entries in the orphanage? */ + return -EFSCORRUPTED; + } + + if (error != -ENOENT) + return error; + return 0; +} + +/* + * Move the current file to the orphanage under the computed name. + * + * Returns with a dirty transaction so that the caller can handle any other + * work, such as fixing up unlinked lists or resetting link counts. + */ +int +xrep_adoption_move( + struct xrep_adoption *adopt) +{ + struct xfs_scrub *sc = adopt->sc; + bool isdir = S_ISDIR(VFS_I(sc->ip)->i_mode); + int error; + + trace_xrep_adoption_reparent(sc->orphanage, adopt->xname, + sc->ip->i_ino); + + /* Create the new name in the orphanage. */ + error = xfs_dir_createname(sc->tp, sc->orphanage, adopt->xname, + sc->ip->i_ino, adopt->orphanage_blkres); + if (error) + return error; + + /* + * Bump the link count of the orphanage if we just added a + * subdirectory, and update its timestamps. + */ + xfs_trans_ichgtime(sc->tp, sc->orphanage, + XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG); + if (isdir) + xfs_bumplink(sc->tp, sc->orphanage); + xfs_trans_log_inode(sc->tp, sc->orphanage, XFS_ILOG_CORE); + + /* Replace the dotdot entry if the child is a subdirectory. */ + if (isdir) { + error = xfs_dir_replace(sc->tp, sc->ip, &xfs_name_dotdot, + sc->orphanage->i_ino, adopt->child_blkres); + if (error) + return error; + } + + /* + * Notify dirent hooks that we moved the file to /lost+found, and + * finish all the deferred work so that we know the adoption is fully + * recorded in the log. + */ + xfs_dir_update_hook(sc->orphanage, sc->ip, 1, adopt->xname); + return 0; +} + +/* + * Roll to a clean scrub transaction so that we can release the orphanage, + * even if xrep_adoption_move was not called. + * + * Commits all the work and deferred ops attached to an adoption request and + * rolls to a clean scrub transaction. On success, returns 0 with the scrub + * context holding a clean transaction with no inodes joined. On failure, + * returns negative errno with no scrub transaction. All inode locks are + * still held after this function returns. + */ +int +xrep_adoption_trans_roll( + struct xrep_adoption *adopt) +{ + struct xfs_scrub *sc = adopt->sc; + int error; + + trace_xrep_adoption_trans_roll(sc->orphanage, sc->ip, + !!(sc->tp->t_flags & XFS_TRANS_DIRTY)); + + /* Finish all the deferred ops to commit all repairs. */ + error = xrep_defer_finish(sc); + if (error) + return error; + + /* Roll the transaction once more to detach the inodes. */ + return xfs_trans_roll(&sc->tp); +} diff --git a/fs/xfs/scrub/orphanage.h b/fs/xfs/scrub/orphanage.h new file mode 100644 index 000000000000..319179ab788d --- /dev/null +++ b/fs/xfs/scrub/orphanage.h @@ -0,0 +1,75 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2021-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_ORPHANAGE_H__ +#define __XFS_SCRUB_ORPHANAGE_H__ + +#ifdef CONFIG_XFS_ONLINE_REPAIR +int xrep_orphanage_create(struct xfs_scrub *sc); + +/* + * If we're doing a repair, ensure that the orphanage exists and attach it to + * the scrub context. + */ +static inline int +xrep_orphanage_try_create( + struct xfs_scrub *sc) +{ + int error; + + ASSERT(sc->sm->sm_flags & XFS_SCRUB_IFLAG_REPAIR); + + error = xrep_orphanage_create(sc); + switch (error) { + case 0: + case -ENOENT: + case -ENOTDIR: + case -ENOSPC: + /* + * If the orphanage can't be found or isn't a directory, we'll + * keep going, but we won't be able to attach the file to the + * orphanage if we can't find the parent. + */ + return 0; + } + + return error; +} + +int xrep_orphanage_iolock_two(struct xfs_scrub *sc); + +void xrep_orphanage_ilock(struct xfs_scrub *sc, unsigned int ilock_flags); +bool xrep_orphanage_ilock_nowait(struct xfs_scrub *sc, + unsigned int ilock_flags); +void xrep_orphanage_iunlock(struct xfs_scrub *sc, unsigned int ilock_flags); + +void xrep_orphanage_rele(struct xfs_scrub *sc); + +/* Information about a request to add a file to the orphanage. */ +struct xrep_adoption { + struct xfs_scrub *sc; + + /* Name used for the adoption. */ + struct xfs_name *xname; + + /* Block reservations for orphanage and child (if directory). */ + unsigned int orphanage_blkres; + unsigned int child_blkres; +}; + +bool xrep_orphanage_can_adopt(struct xfs_scrub *sc); + +int xrep_adoption_trans_alloc(struct xfs_scrub *sc, + struct xrep_adoption *adopt); +int xrep_adoption_compute_name(struct xrep_adoption *adopt, + struct xfs_name *xname); +int xrep_adoption_move(struct xrep_adoption *adopt); +int xrep_adoption_trans_roll(struct xrep_adoption *adopt); +#else +struct xrep_adoption { /* empty */ }; +# define xrep_orphanage_rele(sc) ((void)0) +#endif /* CONFIG_XFS_ONLINE_REPAIR */ + +#endif /* __XFS_SCRUB_ORPHANAGE_H__ */ diff --git a/fs/xfs/scrub/parent_repair.c b/fs/xfs/scrub/parent_repair.c index 826926c2bb0d..ebb5791bf839 100644 --- a/fs/xfs/scrub/parent_repair.c +++ b/fs/xfs/scrub/parent_repair.c @@ -32,6 +32,8 @@ #include "scrub/iscan.h" #include "scrub/findparent.h" #include "scrub/readdir.h" +#include "scrub/tempfile.h" +#include "scrub/orphanage.h" /* * Repairing The Directory Parent Pointer @@ -57,6 +59,13 @@ struct xrep_parent { * dotdot entry for this directory. */ struct xrep_parent_scan_info pscan; + + /* Orphanage reparenting request. */ + struct xrep_adoption adoption; + + /* Directory entry name, plus the trailing null. */ + struct xfs_name xname; + unsigned char namebuf[MAXNAMELEN]; }; /* Tear down all the incore stuff we created. */ @@ -80,9 +89,10 @@ xrep_setup_parent( if (!rp) return -ENOMEM; rp->sc = sc; + rp->xname.name = rp->namebuf; sc->buf = rp; - return 0; + return xrep_orphanage_try_create(sc); } /* @@ -179,6 +189,91 @@ xrep_parent_reset_dotdot( return xfs_trans_roll(&sc->tp); } +/* + * Move the current file to the orphanage. + * + * Caller must hold IOLOCK_EXCL on @sc->ip, and no other inode locks. Upon + * successful return, the scrub transaction will have enough extra reservation + * to make the move; it will hold IOLOCK_EXCL and ILOCK_EXCL of @sc->ip and the + * orphanage; and both inodes will be ijoined. + */ +STATIC int +xrep_parent_move_to_orphanage( + struct xrep_parent *rp) +{ + struct xfs_scrub *sc = rp->sc; + xfs_ino_t orig_parent, new_parent; + int error; + + /* + * We are about to drop the ILOCK on sc->ip to lock the orphanage and + * prepare for the adoption. Therefore, look up the old dotdot entry + * for sc->ip so that we can compare it after we re-lock sc->ip. + */ + error = xchk_dir_lookup(sc, sc->ip, &xfs_name_dotdot, &orig_parent); + if (error) + return error; + + /* + * Drop the ILOCK on the scrub target and commit the transaction. + * Adoption computes its own resource requirements and gathers the + * necessary components. + */ + error = xrep_trans_commit(sc); + if (error) + return error; + xchk_iunlock(sc, XFS_ILOCK_EXCL); + + /* If we can take the orphanage's iolock then we're ready to move. */ + if (!xrep_orphanage_ilock_nowait(sc, XFS_IOLOCK_EXCL)) { + xchk_iunlock(sc, sc->ilock_flags); + error = xrep_orphanage_iolock_two(sc); + if (error) + return error; + } + + /* Grab transaction and ILOCK the two files. */ + error = xrep_adoption_trans_alloc(sc, &rp->adoption); + if (error) + return error; + + error = xrep_adoption_compute_name(&rp->adoption, &rp->xname); + if (error) + return error; + + /* + * Now that we've reacquired the ILOCK on sc->ip, look up the dotdot + * entry again. If the parent changed or the child was unlinked while + * the child directory was unlocked, we don't need to move the child to + * the orphanage after all. + */ + error = xchk_dir_lookup(sc, sc->ip, &xfs_name_dotdot, &new_parent); + if (error) + return error; + + /* + * Attach to the orphanage if we still have a linked directory and it + * hasn't been moved. + */ + if (orig_parent == new_parent && VFS_I(sc->ip)->i_nlink > 0) { + error = xrep_adoption_move(&rp->adoption); + if (error) + return error; + } + + /* + * Launder the scrub transaction so we can drop the orphanage ILOCK + * and IOLOCK. Return holding the scrub target's ILOCK and IOLOCK. + */ + error = xrep_adoption_trans_roll(&rp->adoption); + if (error) + return error; + + xrep_orphanage_iunlock(sc, XFS_ILOCK_EXCL); + xrep_orphanage_iunlock(sc, XFS_IOLOCK_EXCL); + return 0; +} + /* * Commit the new parent pointer structure (currently only the dotdot entry) to * the file that we're repairing. @@ -188,7 +283,8 @@ xrep_parent_rebuild_tree( struct xrep_parent *rp) { if (rp->pscan.parent_ino == NULLFSINO) { - /* Cannot fix orphaned directories yet. */ + if (xrep_orphanage_can_adopt(rp->sc)) + return xrep_parent_move_to_orphanage(rp); return -EFSCORRUPTED; } diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 520d83db193c..6417628ce26b 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -27,6 +27,7 @@ #include "scrub/stats.h" #include "scrub/xfile.h" #include "scrub/tempfile.h" +#include "scrub/orphanage.h" /* * Online Scrub and Repair @@ -217,6 +218,7 @@ xchk_teardown( } xrep_tempfile_rele(sc); + xrep_orphanage_rele(sc); xchk_fsgates_disable(sc); return error; } diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h index d38f0b30416c..7abe498f7a46 100644 --- a/fs/xfs/scrub/scrub.h +++ b/fs/xfs/scrub/scrub.h @@ -105,6 +105,10 @@ struct xfs_scrub { /* Lock flags for @ip. */ uint ilock_flags; + /* The orphanage, for stashing files that have lost their parent. */ + uint orphanage_ilock_flags; + struct xfs_inode *orphanage; + /* A temporary file on this filesystem, for staging new metadata. */ struct xfs_inode *tempip; uint temp_ilock_flags; diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index d68ec8e2781e..7c49aa6f6b8d 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2588,6 +2588,34 @@ DEFINE_EVENT(xrep_dirent_class, name, \ DEFINE_XREP_DIRENT_EVENT(xrep_dir_salvage_entry); DEFINE_XREP_DIRENT_EVENT(xrep_dir_stash_createname); DEFINE_XREP_DIRENT_EVENT(xrep_dir_replay_createname); +DEFINE_XREP_DIRENT_EVENT(xrep_adoption_reparent); + +DECLARE_EVENT_CLASS(xrep_adoption_class, + TP_PROTO(struct xfs_inode *dp, struct xfs_inode *ip, bool moved), + TP_ARGS(dp, ip, moved), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, dir_ino) + __field(xfs_ino_t, child_ino) + __field(bool, moved) + ), + TP_fast_assign( + __entry->dev = dp->i_mount->m_super->s_dev; + __entry->dir_ino = dp->i_ino; + __entry->child_ino = ip->i_ino; + __entry->moved = moved; + ), + TP_printk("dev %d:%d dir 0x%llx child 0x%llx moved? %d", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->dir_ino, + __entry->child_ino, + __entry->moved) +); +#define DEFINE_XREP_ADOPTION_EVENT(name) \ +DEFINE_EVENT(xrep_adoption_class, name, \ + TP_PROTO(struct xfs_inode *dp, struct xfs_inode *ip, bool moved), \ + TP_ARGS(dp, ip, moved)) +DEFINE_XREP_ADOPTION_EVENT(xrep_adoption_trans_roll); DECLARE_EVENT_CLASS(xrep_parent_salvage_class, TP_PROTO(struct xfs_inode *dp, xfs_ino_t ino), diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 09d643a9e997..803a64687014 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -914,10 +914,10 @@ xfs_droplink( /* * Increment the link count on an inode & log the change. */ -static void +void xfs_bumplink( - xfs_trans_t *tp, - xfs_inode_t *ip) + struct xfs_trans *tp, + struct xfs_inode *ip) { xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG); diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 8157ae7f8e59..18bc3d7750a0 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -625,6 +625,7 @@ void xfs_end_io(struct work_struct *work); int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2); void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2); void xfs_iunlock2_remapping(struct xfs_inode *ip1, struct xfs_inode *ip2); +void xfs_bumplink(struct xfs_trans *tp, struct xfs_inode *ip); static inline bool xfs_inode_unlinked_incomplete( ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/3] xfs: move files to orphanage instead of letting nlinks drop to zero 2024-04-15 23:36 ` [PATCHSET v30.3 10/16] xfs: move orphan files to lost and found Darrick J. Wong 2024-04-15 23:53 ` [PATCH 1/3] xfs: move orphan files to the orphanage Darrick J. Wong @ 2024-04-15 23:53 ` Darrick J. Wong 2024-04-15 23:53 ` [PATCH 3/3] xfs: ensure dentry consistency when the orphanage adopts a file Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:53 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> If we encounter an inode with a nonzero link count but zero observed links, move it to the orphanage. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- .../filesystems/xfs/xfs-online-fsck-design.rst | 3 fs/xfs/scrub/nlinks.c | 20 ++- fs/xfs/scrub/nlinks.h | 7 + fs/xfs/scrub/nlinks_repair.c | 123 ++++++++++++++++++-- fs/xfs/scrub/repair.h | 2 fs/xfs/scrub/trace.c | 1 fs/xfs/scrub/trace.h | 26 ++++ 7 files changed, 163 insertions(+), 19 deletions(-) diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst index 37dddaaeda50..74a8e42c74bd 100644 --- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -4789,7 +4789,8 @@ Orphaned files are adopted by the orphanage as follows: cache. 6. Call ``xrep_adoption_finish`` to commit any filesystem updates, release the - orphanage ILOCK, and clean the scrub transaction. + orphanage ILOCK, and clean the scrub transaction. Call + ``xrep_adoption_commit`` to commit the updates and the scrub transaction. 7. If a runtime error happens, call ``xrep_adoption_cancel`` to release all resources. diff --git a/fs/xfs/scrub/nlinks.c b/fs/xfs/scrub/nlinks.c index 8b9aa73093d6..c456523fac9c 100644 --- a/fs/xfs/scrub/nlinks.c +++ b/fs/xfs/scrub/nlinks.c @@ -24,6 +24,7 @@ #include "scrub/xfile.h" #include "scrub/xfarray.h" #include "scrub/iscan.h" +#include "scrub/orphanage.h" #include "scrub/nlinks.h" #include "scrub/trace.h" #include "scrub/readdir.h" @@ -44,11 +45,23 @@ int xchk_setup_nlinks( struct xfs_scrub *sc) { + struct xchk_nlink_ctrs *xnc; + int error; + xchk_fsgates_enable(sc, XCHK_FSGATES_DIRENTS); - sc->buf = kzalloc(sizeof(struct xchk_nlink_ctrs), XCHK_GFP_FLAGS); - if (!sc->buf) + if (xchk_could_repair(sc)) { + error = xrep_setup_nlinks(sc); + if (error) + return error; + } + + xnc = kvzalloc(sizeof(struct xchk_nlink_ctrs), XCHK_GFP_FLAGS); + if (!xnc) return -ENOMEM; + xnc->xname.name = xnc->namebuf; + xnc->sc = sc; + sc->buf = xnc; return xchk_setup_fs(sc); } @@ -873,9 +886,6 @@ xchk_nlinks_setup_scan( xfs_agino_t first_agino, last_agino; int error; - ASSERT(xnc->sc == NULL); - xnc->sc = sc; - mutex_init(&xnc->lock); /* Retry iget every tenth of a second for up to 30 seconds. */ diff --git a/fs/xfs/scrub/nlinks.h b/fs/xfs/scrub/nlinks.h index a950f3daf204..b820712bfd87 100644 --- a/fs/xfs/scrub/nlinks.h +++ b/fs/xfs/scrub/nlinks.h @@ -28,6 +28,13 @@ struct xchk_nlink_ctrs { * from other writer threads. */ struct xfs_dir_hook dhook; + + /* Orphanage reparenting request. */ + struct xrep_adoption adoption; + + /* Directory entry name, plus the trailing null. */ + struct xfs_name xname; + char namebuf[MAXNAMELEN]; }; /* diff --git a/fs/xfs/scrub/nlinks_repair.c b/fs/xfs/scrub/nlinks_repair.c index 23eb08c4b5ad..0cb67339eac8 100644 --- a/fs/xfs/scrub/nlinks_repair.c +++ b/fs/xfs/scrub/nlinks_repair.c @@ -24,6 +24,7 @@ #include "scrub/xfile.h" #include "scrub/xfarray.h" #include "scrub/iscan.h" +#include "scrub/orphanage.h" #include "scrub/nlinks.h" #include "scrub/trace.h" #include "scrub/tempfile.h" @@ -38,6 +39,34 @@ * inode is locked. */ +/* Set up to repair inode link counts. */ +int +xrep_setup_nlinks( + struct xfs_scrub *sc) +{ + return xrep_orphanage_try_create(sc); +} + +/* + * Inodes that aren't the root directory or the orphanage, have a nonzero link + * count, and no observed parents should be moved to the orphanage. + */ +static inline bool +xrep_nlinks_is_orphaned( + struct xfs_scrub *sc, + struct xfs_inode *ip, + unsigned int actual_nlink, + const struct xchk_nlink *obs) +{ + struct xfs_mount *mp = ip->i_mount; + + if (obs->parents != 0) + return false; + if (ip == mp->m_rootip || ip == sc->orphanage) + return false; + return actual_nlink != 0; +} + /* Remove an inode from the unlinked list. */ STATIC int xrep_nlinks_iunlink_remove( @@ -66,6 +95,7 @@ xrep_nlinks_repair_inode( struct xfs_inode *ip = sc->ip; uint64_t total_links; uint64_t actual_nlink; + bool orphanage_available = false; bool dirty = false; int error; @@ -77,14 +107,41 @@ xrep_nlinks_repair_inode( if (xrep_is_tempfile(ip)) return 0; - xchk_ilock(sc, XFS_IOLOCK_EXCL); + /* + * If the filesystem has an orphanage attached to the scrub context, + * prepare for a link count repair that could involve @ip being adopted + * by the lost+found. + */ + if (xrep_orphanage_can_adopt(sc)) { + error = xrep_orphanage_iolock_two(sc); + if (error) + return error; - error = xfs_trans_alloc(mp, &M_RES(mp)->tr_link, 0, 0, 0, &sc->tp); - if (error) - return error; + error = xrep_adoption_trans_alloc(sc, &xnc->adoption); + if (error) { + xchk_iunlock(sc, XFS_IOLOCK_EXCL); + xrep_orphanage_iunlock(sc, XFS_IOLOCK_EXCL); + } else { + orphanage_available = true; + } + } - xchk_ilock(sc, XFS_ILOCK_EXCL); - xfs_trans_ijoin(sc->tp, ip, 0); + /* + * Either there is no orphanage or we couldn't allocate resources for + * that kind of update. Let's try again with only the resources we + * need for a simple link count update, since that's much more common. + */ + if (!orphanage_available) { + xchk_ilock(sc, XFS_IOLOCK_EXCL); + + error = xfs_trans_alloc(mp, &M_RES(mp)->tr_link, 0, 0, 0, + &sc->tp); + if (error) + return error; + + xchk_ilock(sc, XFS_ILOCK_EXCL); + xfs_trans_ijoin(sc->tp, ip, 0); + } mutex_lock(&xnc->lock); @@ -122,6 +179,41 @@ xrep_nlinks_repair_inode( goto out_trans; } + /* + * Decide if we're going to move this file to the orphanage, and fix + * up the incore link counts if we are. + */ + if (orphanage_available && + xrep_nlinks_is_orphaned(sc, ip, actual_nlink, &obs)) { + /* Figure out what name we're going to use here. */ + error = xrep_adoption_compute_name(&xnc->adoption, &xnc->xname); + if (error) + goto out_trans; + + /* + * Reattach this file to the directory tree by moving it to + * the orphanage per the adoption parameters that we already + * computed. + */ + error = xrep_adoption_move(&xnc->adoption); + if (error) + goto out_trans; + + /* + * Re-read the link counts since the reparenting will have + * updated our scan info. + */ + mutex_lock(&xnc->lock); + error = xfarray_load_sparse(xnc->nlinks, ip->i_ino, &obs); + mutex_unlock(&xnc->lock); + if (error) + goto out_trans; + + total_links = xchk_nlink_total(ip, &obs); + actual_nlink = VFS_I(ip)->i_nlink; + dirty = true; + } + /* * If this inode is linked from the directory tree and on the unlinked * list, remove it from the unlinked list. @@ -165,14 +257,19 @@ xrep_nlinks_repair_inode( xfs_trans_log_inode(sc->tp, ip, XFS_ILOG_CORE); error = xrep_trans_commit(sc); - xchk_iunlock(sc, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL); - return error; + goto out_unlock; out_scanlock: mutex_unlock(&xnc->lock); out_trans: xchk_trans_cancel(sc); - xchk_iunlock(sc, XFS_ILOCK_EXCL | XFS_IOLOCK_EXCL); +out_unlock: + xchk_iunlock(sc, XFS_ILOCK_EXCL); + if (orphanage_available) { + xrep_orphanage_iunlock(sc, XFS_ILOCK_EXCL); + xrep_orphanage_iunlock(sc, XFS_IOLOCK_EXCL); + } + xchk_iunlock(sc, XFS_IOLOCK_EXCL); return error; } @@ -205,10 +302,10 @@ xrep_nlinks( /* * We need ftype for an accurate count of the number of child * subdirectory links. Child subdirectories with a back link (dotdot - * entry) but no forward link are unfixable, so we cannot repair the - * link count of the parent directory based on the back link count - * alone. Filesystems without ftype support are rare (old V4) so we - * just skip out here. + * entry) but no forward link are moved to the orphanage, so we cannot + * repair the link count of the parent directory based on the back link + * count alone. Filesystems without ftype support are rare (old V4) so + * we just skip out here. */ if (!xfs_has_ftype(sc->mp)) return -EOPNOTSUPP; diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index e53374fa5430..7e6aba7fe558 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -93,6 +93,7 @@ int xrep_setup_ag_refcountbt(struct xfs_scrub *sc); int xrep_setup_xattr(struct xfs_scrub *sc); int xrep_setup_directory(struct xfs_scrub *sc); int xrep_setup_parent(struct xfs_scrub *sc); +int xrep_setup_nlinks(struct xfs_scrub *sc); /* Repair setup functions */ int xrep_setup_ag_allocbt(struct xfs_scrub *sc); @@ -201,6 +202,7 @@ xrep_setup_nothing( #define xrep_setup_xattr xrep_setup_nothing #define xrep_setup_directory xrep_setup_nothing #define xrep_setup_parent xrep_setup_nothing +#define xrep_setup_nlinks xrep_setup_nothing #define xrep_setup_inode(sc, imap) ((void)0) diff --git a/fs/xfs/scrub/trace.c b/fs/xfs/scrub/trace.c index 3dd281d6d185..b2ce7b22cad3 100644 --- a/fs/xfs/scrub/trace.c +++ b/fs/xfs/scrub/trace.c @@ -24,6 +24,7 @@ #include "scrub/xfarray.h" #include "scrub/quota.h" #include "scrub/iscan.h" +#include "scrub/orphanage.h" #include "scrub/nlinks.h" #include "scrub/fscounters.h" diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 7c49aa6f6b8d..2a4c54f7992a 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2643,6 +2643,32 @@ DEFINE_XREP_PARENT_SALVAGE_EVENT(xrep_dir_salvaged_parent); DEFINE_XREP_PARENT_SALVAGE_EVENT(xrep_findparent_dirent); DEFINE_XREP_PARENT_SALVAGE_EVENT(xrep_findparent_from_dcache); +TRACE_EVENT(xrep_nlinks_set_record, + TP_PROTO(struct xfs_mount *mp, xfs_ino_t ino, + const struct xchk_nlink *obs), + TP_ARGS(mp, ino, obs), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(xfs_nlink_t, parents) + __field(xfs_nlink_t, backrefs) + __field(xfs_nlink_t, children) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->ino = ino; + __entry->parents = obs->parents; + __entry->backrefs = obs->backrefs; + __entry->children = obs->children; + ), + TP_printk("dev %d:%d ino 0x%llx parents %u backrefs %u children %u", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __entry->parents, + __entry->backrefs, + __entry->children) +); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/3] xfs: ensure dentry consistency when the orphanage adopts a file 2024-04-15 23:36 ` [PATCHSET v30.3 10/16] xfs: move orphan files to lost and found Darrick J. Wong 2024-04-15 23:53 ` [PATCH 1/3] xfs: move orphan files to the orphanage Darrick J. Wong 2024-04-15 23:53 ` [PATCH 2/3] xfs: move files to orphanage instead of letting nlinks drop to zero Darrick J. Wong @ 2024-04-15 23:53 ` Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:53 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> When the orphanage adopts a file, that file becomes a child of the orphanage. The dentry cache may have entries for the orphanage directory and the name we've chosen, so (1) make sure we abort if the dcache has a positive entry because something's not right; and (2) invalidate and purge negative dentries if the adoption goes through. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/orphanage.c | 91 ++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/trace.h | 42 +++++++++++++++++++++ 2 files changed, 133 insertions(+) diff --git a/fs/xfs/scrub/orphanage.c b/fs/xfs/scrub/orphanage.c index 41733be3ef45..885b7d478a0a 100644 --- a/fs/xfs/scrub/orphanage.c +++ b/fs/xfs/scrub/orphanage.c @@ -418,6 +418,90 @@ xrep_adoption_compute_name( return 0; } +/* + * Make sure the dcache does not have a positive dentry for the name we've + * chosen. The caller should have checked with the ondisk directory, so any + * discrepancy is a sign that something is seriously wrong. + */ +static int +xrep_adoption_check_dcache( + struct xrep_adoption *adopt) +{ + struct qstr qname = QSTR_INIT(adopt->xname->name, + adopt->xname->len); + struct dentry *d_orphanage, *d_child; + int error = 0; + + d_orphanage = d_find_alias(VFS_I(adopt->sc->orphanage)); + if (!d_orphanage) + return 0; + + d_child = d_hash_and_lookup(d_orphanage, &qname); + if (d_child) { + trace_xrep_adoption_check_child(adopt->sc->mp, d_child); + + if (d_is_positive(d_child)) { + ASSERT(d_is_negative(d_child)); + error = -EFSCORRUPTED; + } + + dput(d_child); + } + + dput(d_orphanage); + if (error) + return error; + + /* + * Do we need to update d_parent of the dentry for the file being + * repaired? There shouldn't be a hashed dentry with a parent since + * the file had nonzero nlink but wasn't connected to any parent dir. + */ + d_child = d_find_alias(VFS_I(adopt->sc->ip)); + if (!d_child) + return 0; + + trace_xrep_adoption_check_alias(adopt->sc->mp, d_child); + + if (d_child->d_parent && !d_unhashed(d_child)) { + ASSERT(d_child->d_parent == NULL || d_unhashed(d_child)); + error = -EFSCORRUPTED; + } + + dput(d_child); + return error; +} + +/* + * Remove all negative dentries from the dcache. There should not be any + * positive entries, since we've maintained our lock on the orphanage + * directory. + */ +static void +xrep_adoption_zap_dcache( + struct xrep_adoption *adopt) +{ + struct qstr qname = QSTR_INIT(adopt->xname->name, + adopt->xname->len); + struct dentry *d_orphanage, *d_child; + + d_orphanage = d_find_alias(VFS_I(adopt->sc->orphanage)); + if (!d_orphanage) + return; + + d_child = d_hash_and_lookup(d_orphanage, &qname); + while (d_child != NULL) { + trace_xrep_adoption_invalidate_child(adopt->sc->mp, d_child); + + ASSERT(d_is_negative(d_child)); + d_invalidate(d_child); + dput(d_child); + d_child = d_lookup(d_orphanage, &qname); + } + + dput(d_orphanage); +} + /* * Move the current file to the orphanage under the computed name. * @@ -435,6 +519,10 @@ xrep_adoption_move( trace_xrep_adoption_reparent(sc->orphanage, adopt->xname, sc->ip->i_ino); + error = xrep_adoption_check_dcache(adopt); + if (error) + return error; + /* Create the new name in the orphanage. */ error = xfs_dir_createname(sc->tp, sc->orphanage, adopt->xname, sc->ip->i_ino, adopt->orphanage_blkres); @@ -465,6 +553,9 @@ xrep_adoption_move( * recorded in the log. */ xfs_dir_update_hook(sc->orphanage, sc->ip, 1, adopt->xname); + + /* Remove negative dentries from the lost+found's dcache */ + xrep_adoption_zap_dcache(adopt); return 0; } diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 2a4c54f7992a..668da6ff2ca2 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2669,6 +2669,48 @@ TRACE_EVENT(xrep_nlinks_set_record, __entry->children) ); +DECLARE_EVENT_CLASS(xrep_dentry_class, + TP_PROTO(struct xfs_mount *mp, const struct dentry *dentry), + TP_ARGS(mp, dentry), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(unsigned int, flags) + __field(unsigned long, ino) + __field(bool, positive) + __field(unsigned long, parent_ino) + __field(unsigned int, namelen) + __dynamic_array(char, name, dentry->d_name.len) + ), + TP_fast_assign( + __entry->dev = mp->m_super->s_dev; + __entry->flags = dentry->d_flags; + __entry->positive = d_is_positive(dentry); + if (dentry->d_parent && d_inode(dentry->d_parent)) + __entry->parent_ino = d_inode(dentry->d_parent)->i_ino; + else + __entry->parent_ino = -1UL; + __entry->ino = d_inode(dentry) ? d_inode(dentry)->i_ino : 0; + __entry->namelen = dentry->d_name.len; + memcpy(__get_str(name), dentry->d_name.name, dentry->d_name.len); + ), + TP_printk("dev %d:%d flags 0x%x positive? %d parent_ino 0x%lx ino 0x%lx name '%.*s'", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->flags, + __entry->positive, + __entry->parent_ino, + __entry->ino, + __entry->namelen, + __get_str(name)) +); +#define DEFINE_REPAIR_DENTRY_EVENT(name) \ +DEFINE_EVENT(xrep_dentry_class, name, \ + TP_PROTO(struct xfs_mount *mp, const struct dentry *dentry), \ + TP_ARGS(mp, dentry)) +DEFINE_REPAIR_DENTRY_EVENT(xrep_adoption_check_child); +DEFINE_REPAIR_DENTRY_EVENT(xrep_adoption_check_alias); +DEFINE_REPAIR_DENTRY_EVENT(xrep_adoption_check_dentry); +DEFINE_REPAIR_DENTRY_EVENT(xrep_adoption_invalidate_child); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 11/16] xfs: online repair of symbolic links 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (9 preceding siblings ...) 2024-04-15 23:36 ` [PATCHSET v30.3 10/16] xfs: move orphan files to lost and found Darrick J. Wong @ 2024-04-15 23:36 ` Darrick J. Wong 2024-04-15 23:53 ` [PATCH 1/3] xfs: expose xfs_bmap_local_to_extents for online repair Darrick J. Wong ` (2 more replies) 2024-04-15 23:36 ` [PATCHSET v30.3 12/16] xfs: online fsck of iunlink buckets Darrick J. Wong ` (4 subsequent siblings) 15 siblings, 3 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:36 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs Hi all, The patches in this set adds the ability to repair the target buffer of a symbolic link, using the same salvage, rebuild, and swap strategy used everywhere else. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-symlink-6.10 --- Commits in this patchset: * xfs: expose xfs_bmap_local_to_extents for online repair * xfs: pass the owner to xfs_symlink_write_target * xfs: online repair of symbolic links --- fs/xfs/Makefile | 1 fs/xfs/libxfs/xfs_bmap.c | 11 - fs/xfs/libxfs/xfs_bmap.h | 6 fs/xfs/libxfs/xfs_symlink_remote.c | 7 fs/xfs/libxfs/xfs_symlink_remote.h | 7 fs/xfs/scrub/repair.h | 8 + fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/symlink.c | 13 + fs/xfs/scrub/symlink_repair.c | 506 ++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/tempfile.c | 13 + fs/xfs/scrub/trace.h | 46 +++ fs/xfs/xfs_symlink.c | 4 12 files changed, 609 insertions(+), 15 deletions(-) create mode 100644 fs/xfs/scrub/symlink_repair.c ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/3] xfs: expose xfs_bmap_local_to_extents for online repair 2024-04-15 23:36 ` [PATCHSET v30.3 11/16] xfs: online repair of symbolic links Darrick J. Wong @ 2024-04-15 23:53 ` Darrick J. Wong 2024-04-15 23:54 ` [PATCH 2/3] xfs: pass the owner to xfs_symlink_write_target Darrick J. Wong 2024-04-15 23:54 ` [PATCH 3/3] xfs: online repair of symbolic links Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:53 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Allow online repair to call xfs_bmap_local_to_extents and add a void * argument at the end so that online repair can pass its own context. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_bmap.c | 11 ++++++----- fs/xfs/libxfs/xfs_bmap.h | 6 ++++++ fs/xfs/libxfs/xfs_symlink_remote.c | 3 ++- fs/xfs/libxfs/xfs_symlink_remote.h | 3 ++- 4 files changed, 16 insertions(+), 7 deletions(-) diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c index 46bbc9f0a117..59b8b9dc29cc 100644 --- a/fs/xfs/libxfs/xfs_bmap.c +++ b/fs/xfs/libxfs/xfs_bmap.c @@ -779,7 +779,7 @@ xfs_bmap_local_to_extents_empty( } -STATIC int /* error */ +int /* error */ xfs_bmap_local_to_extents( xfs_trans_t *tp, /* transaction pointer */ xfs_inode_t *ip, /* incore inode pointer */ @@ -789,7 +789,8 @@ xfs_bmap_local_to_extents( void (*init_fn)(struct xfs_trans *tp, struct xfs_buf *bp, struct xfs_inode *ip, - struct xfs_ifork *ifp)) + struct xfs_ifork *ifp, void *priv), + void *priv) { int error = 0; int flags; /* logging flags returned */ @@ -850,7 +851,7 @@ xfs_bmap_local_to_extents( * log here. Note that init_fn must also set the buffer log item type * correctly. */ - init_fn(tp, bp, ip, ifp); + init_fn(tp, bp, ip, ifp, priv); /* account for the change in fork size */ xfs_idata_realloc(ip, -ifp->if_bytes, whichfork); @@ -982,8 +983,8 @@ xfs_bmap_add_attrfork_local( if (S_ISLNK(VFS_I(ip)->i_mode)) return xfs_bmap_local_to_extents(tp, ip, 1, flags, - XFS_DATA_FORK, - xfs_symlink_local_to_remote); + XFS_DATA_FORK, xfs_symlink_local_to_remote, + NULL); /* should only be called for types that support local format data */ ASSERT(0); diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h index b8bdbf1560e6..32fb2a455c29 100644 --- a/fs/xfs/libxfs/xfs_bmap.h +++ b/fs/xfs/libxfs/xfs_bmap.h @@ -179,6 +179,12 @@ unsigned int xfs_bmap_compute_attr_offset(struct xfs_mount *mp); int xfs_bmap_add_attrfork(struct xfs_inode *ip, int size, int rsvd); void xfs_bmap_local_to_extents_empty(struct xfs_trans *tp, struct xfs_inode *ip, int whichfork); +int xfs_bmap_local_to_extents(struct xfs_trans *tp, struct xfs_inode *ip, + xfs_extlen_t total, int *logflagsp, int whichfork, + void (*init_fn)(struct xfs_trans *tp, struct xfs_buf *bp, + struct xfs_inode *ip, struct xfs_ifork *ifp, + void *priv), + void *priv); void xfs_bmap_compute_maxlevels(struct xfs_mount *mp, int whichfork); int xfs_bmap_first_unused(struct xfs_trans *tp, struct xfs_inode *ip, xfs_extlen_t len, xfs_fileoff_t *unused, int whichfork); diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c index 8f0d5c584f46..d150576ddd0a 100644 --- a/fs/xfs/libxfs/xfs_symlink_remote.c +++ b/fs/xfs/libxfs/xfs_symlink_remote.c @@ -169,7 +169,8 @@ xfs_symlink_local_to_remote( struct xfs_trans *tp, struct xfs_buf *bp, struct xfs_inode *ip, - struct xfs_ifork *ifp) + struct xfs_ifork *ifp, + void *priv) { struct xfs_mount *mp = ip->i_mount; char *buf; diff --git a/fs/xfs/libxfs/xfs_symlink_remote.h b/fs/xfs/libxfs/xfs_symlink_remote.h index ac3dac8f617e..83b89a1deb9f 100644 --- a/fs/xfs/libxfs/xfs_symlink_remote.h +++ b/fs/xfs/libxfs/xfs_symlink_remote.h @@ -16,7 +16,8 @@ int xfs_symlink_hdr_set(struct xfs_mount *mp, xfs_ino_t ino, uint32_t offset, bool xfs_symlink_hdr_ok(xfs_ino_t ino, uint32_t offset, uint32_t size, struct xfs_buf *bp); void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp, - struct xfs_inode *ip, struct xfs_ifork *ifp); + struct xfs_inode *ip, struct xfs_ifork *ifp, + void *priv); xfs_failaddr_t xfs_symlink_shortform_verify(void *sfp, int64_t size); int xfs_symlink_remote_read(struct xfs_inode *ip, char *link); int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip, ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/3] xfs: pass the owner to xfs_symlink_write_target 2024-04-15 23:36 ` [PATCHSET v30.3 11/16] xfs: online repair of symbolic links Darrick J. Wong 2024-04-15 23:53 ` [PATCH 1/3] xfs: expose xfs_bmap_local_to_extents for online repair Darrick J. Wong @ 2024-04-15 23:54 ` Darrick J. Wong 2024-04-15 23:54 ` [PATCH 3/3] xfs: online repair of symbolic links Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:54 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Require callers of xfs_symlink_write_target to pass the owner number explicitly. This sets us up for online repair to be able to write a remote symlink target to sc->tempip with sc->ip's inumber in the block heaader. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_symlink_remote.c | 4 ++-- fs/xfs/libxfs/xfs_symlink_remote.h | 4 ++-- fs/xfs/xfs_symlink.c | 4 ++-- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c index d150576ddd0a..f228127a88ff 100644 --- a/fs/xfs/libxfs/xfs_symlink_remote.c +++ b/fs/xfs/libxfs/xfs_symlink_remote.c @@ -311,6 +311,7 @@ int xfs_symlink_write_target( struct xfs_trans *tp, struct xfs_inode *ip, + xfs_ino_t owner, const char *target_path, int pathlen, xfs_fsblock_t fs_blocks, @@ -365,8 +366,7 @@ xfs_symlink_write_target( byte_cnt = min(byte_cnt, pathlen); buf = bp->b_addr; - buf += xfs_symlink_hdr_set(mp, ip->i_ino, offset, byte_cnt, - bp); + buf += xfs_symlink_hdr_set(mp, owner, offset, byte_cnt, bp); memcpy(buf, cur_chunk, byte_cnt); diff --git a/fs/xfs/libxfs/xfs_symlink_remote.h b/fs/xfs/libxfs/xfs_symlink_remote.h index 83b89a1deb9f..c1672fe1f17b 100644 --- a/fs/xfs/libxfs/xfs_symlink_remote.h +++ b/fs/xfs/libxfs/xfs_symlink_remote.h @@ -21,8 +21,8 @@ void xfs_symlink_local_to_remote(struct xfs_trans *tp, struct xfs_buf *bp, xfs_failaddr_t xfs_symlink_shortform_verify(void *sfp, int64_t size); int xfs_symlink_remote_read(struct xfs_inode *ip, char *link); int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip, - const char *target_path, int pathlen, xfs_fsblock_t fs_blocks, - uint resblks); + xfs_ino_t owner, const char *target_path, int pathlen, + xfs_fsblock_t fs_blocks, uint resblks); int xfs_symlink_remote_truncate(struct xfs_trans *tp, struct xfs_inode *ip); #endif /* __XFS_SYMLINK_REMOTE_H */ diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c index 3daeebff4bb4..fb060aaf6d40 100644 --- a/fs/xfs/xfs_symlink.c +++ b/fs/xfs/xfs_symlink.c @@ -181,8 +181,8 @@ xfs_symlink( xfs_qm_vop_create_dqattach(tp, ip, udqp, gdqp, pdqp); resblks -= XFS_IALLOC_SPACE_RES(mp); - error = xfs_symlink_write_target(tp, ip, target_path, pathlen, - fs_blocks, resblks); + error = xfs_symlink_write_target(tp, ip, ip->i_ino, target_path, + pathlen, fs_blocks, resblks); if (error) goto out_trans_cancel; resblks -= fs_blocks; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/3] xfs: online repair of symbolic links 2024-04-15 23:36 ` [PATCHSET v30.3 11/16] xfs: online repair of symbolic links Darrick J. Wong 2024-04-15 23:53 ` [PATCH 1/3] xfs: expose xfs_bmap_local_to_extents for online repair Darrick J. Wong 2024-04-15 23:54 ` [PATCH 2/3] xfs: pass the owner to xfs_symlink_write_target Darrick J. Wong @ 2024-04-15 23:54 ` Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:54 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> If a symbolic link target looks bad, try to sift through the rubble to find as much of the target buffer that we can, and stage a new target (short or remote format as needed) in a temporary file and use the atomic extent swapping mechanism to commit the results. In the worst case, we replace the target with an overly long filename that cannot possibly resolve. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/Makefile | 1 fs/xfs/scrub/repair.h | 8 + fs/xfs/scrub/scrub.c | 2 fs/xfs/scrub/symlink.c | 13 + fs/xfs/scrub/symlink_repair.c | 506 +++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/tempfile.c | 13 + fs/xfs/scrub/trace.h | 46 ++++ 7 files changed, 587 insertions(+), 2 deletions(-) create mode 100644 fs/xfs/scrub/symlink_repair.c diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile index 1e23d1b3cd7b..4e1eb3b6dbc4 100644 --- a/fs/xfs/Makefile +++ b/fs/xfs/Makefile @@ -213,6 +213,7 @@ xfs-y += $(addprefix scrub/, \ refcount_repair.o \ repair.o \ rmap_repair.o \ + symlink_repair.o \ tempfile.o \ xfblob.o \ ) diff --git a/fs/xfs/scrub/repair.h b/fs/xfs/scrub/repair.h index 7e6aba7fe558..622eb486a16f 100644 --- a/fs/xfs/scrub/repair.h +++ b/fs/xfs/scrub/repair.h @@ -94,6 +94,7 @@ int xrep_setup_xattr(struct xfs_scrub *sc); int xrep_setup_directory(struct xfs_scrub *sc); int xrep_setup_parent(struct xfs_scrub *sc); int xrep_setup_nlinks(struct xfs_scrub *sc); +int xrep_setup_symlink(struct xfs_scrub *sc, unsigned int *resblks); /* Repair setup functions */ int xrep_setup_ag_allocbt(struct xfs_scrub *sc); @@ -130,6 +131,7 @@ int xrep_fscounters(struct xfs_scrub *sc); int xrep_xattr(struct xfs_scrub *sc); int xrep_directory(struct xfs_scrub *sc); int xrep_parent(struct xfs_scrub *sc); +int xrep_symlink(struct xfs_scrub *sc); #ifdef CONFIG_XFS_RT int xrep_rtbitmap(struct xfs_scrub *sc); @@ -206,6 +208,11 @@ xrep_setup_nothing( #define xrep_setup_inode(sc, imap) ((void)0) +static inline int xrep_setup_symlink(struct xfs_scrub *sc, unsigned int *x) +{ + return 0; +} + #define xrep_revalidate_allocbt (NULL) #define xrep_revalidate_iallocbt (NULL) @@ -231,6 +238,7 @@ xrep_setup_nothing( #define xrep_xattr xrep_notsupported #define xrep_directory xrep_notsupported #define xrep_parent xrep_notsupported +#define xrep_symlink xrep_notsupported #endif /* CONFIG_XFS_ONLINE_REPAIR */ diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 6417628ce26b..301d5b753fdd 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -339,7 +339,7 @@ static const struct xchk_meta_ops meta_scrub_ops[] = { .type = ST_INODE, .setup = xchk_setup_symlink, .scrub = xchk_symlink, - .repair = xrep_notsupported, + .repair = xrep_symlink, }, [XFS_SCRUB_TYPE_PARENT] = { /* parent pointers */ .type = ST_INODE, diff --git a/fs/xfs/scrub/symlink.c b/fs/xfs/scrub/symlink.c index d77d8a9598f6..c848bcc07cd5 100644 --- a/fs/xfs/scrub/symlink.c +++ b/fs/xfs/scrub/symlink.c @@ -10,6 +10,7 @@ #include "xfs_trans_resv.h" #include "xfs_mount.h" #include "xfs_log_format.h" +#include "xfs_trans.h" #include "xfs_inode.h" #include "xfs_symlink.h" #include "xfs_health.h" @@ -17,18 +18,28 @@ #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/health.h" +#include "scrub/repair.h" /* Set us up to scrub a symbolic link. */ int xchk_setup_symlink( struct xfs_scrub *sc) { + unsigned int resblks = 0; + int error; + /* Allocate the buffer without the inode lock held. */ sc->buf = kvzalloc(XFS_SYMLINK_MAXLEN + 1, XCHK_GFP_FLAGS); if (!sc->buf) return -ENOMEM; - return xchk_setup_inode_contents(sc, 0); + if (xchk_could_repair(sc)) { + error = xrep_setup_symlink(sc, &resblks); + if (error) + return error; + } + + return xchk_setup_inode_contents(sc, resblks); } /* Symbolic links. */ diff --git a/fs/xfs/scrub/symlink_repair.c b/fs/xfs/scrub/symlink_repair.c new file mode 100644 index 000000000000..178304959535 --- /dev/null +++ b/fs/xfs/scrub/symlink_repair.c @@ -0,0 +1,506 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2018-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#include "xfs.h" +#include "xfs_fs.h" +#include "xfs_shared.h" +#include "xfs_format.h" +#include "xfs_trans_resv.h" +#include "xfs_mount.h" +#include "xfs_defer.h" +#include "xfs_btree.h" +#include "xfs_bit.h" +#include "xfs_log_format.h" +#include "xfs_trans.h" +#include "xfs_sb.h" +#include "xfs_inode.h" +#include "xfs_inode_fork.h" +#include "xfs_symlink.h" +#include "xfs_bmap.h" +#include "xfs_quota.h" +#include "xfs_da_format.h" +#include "xfs_da_btree.h" +#include "xfs_bmap_btree.h" +#include "xfs_trans_space.h" +#include "xfs_symlink_remote.h" +#include "xfs_exchmaps.h" +#include "xfs_exchrange.h" +#include "xfs_health.h" +#include "scrub/xfs_scrub.h" +#include "scrub/scrub.h" +#include "scrub/common.h" +#include "scrub/trace.h" +#include "scrub/repair.h" +#include "scrub/tempfile.h" +#include "scrub/tempexch.h" +#include "scrub/reap.h" + +/* + * Symbolic Link Repair + * ==================== + * + * We repair symbolic links by reading whatever target data we can find, up to + * the first NULL byte. If the recovered target strlen matches i_size, then + * we rewrite the target. In all other cases, we replace the target with an + * overly long string that cannot possibly resolve. The new target is written + * into a private hidden temporary file, and then a file contents exchange + * commits the new symlink target to the file being repaired. + */ + +/* Set us up to repair the symlink file. */ +int +xrep_setup_symlink( + struct xfs_scrub *sc, + unsigned int *resblks) +{ + struct xfs_mount *mp = sc->mp; + unsigned long long blocks; + int error; + + error = xrep_tempfile_create(sc, S_IFLNK); + if (error) + return error; + + /* + * If we're doing a repair, we reserve enough blocks to write out a + * completely new symlink file, plus twice as many blocks as we would + * need if we can only allocate one block per data fork mapping. This + * should cover the preallocation of the temporary file and exchanging + * the extent mappings. + * + * We cannot use xfs_exchmaps_estimate because we have not yet + * constructed the replacement symlink and therefore do not know how + * many extents it will use. By the time we do, we will have a dirty + * transaction (which we cannot drop because we cannot drop the + * symlink ILOCK) and cannot ask for more reservation. + */ + blocks = xfs_symlink_blocks(sc->mp, XFS_SYMLINK_MAXLEN); + blocks += xfs_bmbt_calc_size(mp, blocks) * 2; + if (blocks > UINT_MAX) + return -EOPNOTSUPP; + + *resblks += blocks; + return 0; +} + +/* + * Try to salvage the pathname from remote blocks. Returns the number of bytes + * salvaged or a negative errno. + */ +STATIC ssize_t +xrep_symlink_salvage_remote( + struct xfs_scrub *sc) +{ + struct xfs_bmbt_irec mval[XFS_SYMLINK_MAPS]; + struct xfs_inode *ip = sc->ip; + struct xfs_buf *bp; + char *target_buf = sc->buf; + xfs_failaddr_t fa; + xfs_filblks_t fsblocks; + xfs_daddr_t d; + loff_t len; + loff_t offset = 0; + unsigned int byte_cnt; + bool magic_ok; + bool hdr_ok; + int n; + int nmaps = XFS_SYMLINK_MAPS; + int error; + + /* We'll only read until the buffer is full. */ + len = min_t(loff_t, ip->i_disk_size, XFS_SYMLINK_MAXLEN); + fsblocks = xfs_symlink_blocks(sc->mp, len); + error = xfs_bmapi_read(ip, 0, fsblocks, mval, &nmaps, 0); + if (error) + return error; + + for (n = 0; n < nmaps; n++) { + struct xfs_dsymlink_hdr *dsl; + + d = XFS_FSB_TO_DADDR(sc->mp, mval[n].br_startblock); + + /* Read the rmt block. We'll run the verifiers manually. */ + error = xfs_trans_read_buf(sc->mp, sc->tp, sc->mp->m_ddev_targp, + d, XFS_FSB_TO_BB(sc->mp, mval[n].br_blockcount), + 0, &bp, NULL); + if (error) + return error; + bp->b_ops = &xfs_symlink_buf_ops; + + /* How many bytes do we expect to get out of this buffer? */ + byte_cnt = XFS_FSB_TO_B(sc->mp, mval[n].br_blockcount); + byte_cnt = XFS_SYMLINK_BUF_SPACE(sc->mp, byte_cnt); + byte_cnt = min_t(unsigned int, byte_cnt, len); + + /* + * See if the verifiers accept this block. We're willing to + * salvage if the if the offset/byte/ino are ok and either the + * verifier passed or the magic is ok. Anything else and we + * stop dead in our tracks. + */ + fa = bp->b_ops->verify_struct(bp); + dsl = bp->b_addr; + magic_ok = dsl->sl_magic == cpu_to_be32(XFS_SYMLINK_MAGIC); + hdr_ok = xfs_symlink_hdr_ok(ip->i_ino, offset, byte_cnt, bp); + if (!hdr_ok || (fa != NULL && !magic_ok)) + break; + + memcpy(target_buf + offset, dsl + 1, byte_cnt); + + len -= byte_cnt; + offset += byte_cnt; + } + return offset; +} + +/* + * Try to salvage an inline symlink's contents. Returns the number of bytes + * salvaged or a negative errno. + */ +STATIC ssize_t +xrep_symlink_salvage_inline( + struct xfs_scrub *sc) +{ + struct xfs_inode *ip = sc->ip; + char *target_buf = sc->buf; + char *old_target; + struct xfs_ifork *ifp; + unsigned int nr; + + ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK); + if (!ifp->if_data) + return 0; + + /* + * If inode repair zapped the link target, pretend that we didn't find + * any bytes at all so that we can replace the (now totally lost) link + * target with a warning message. + */ + old_target = ifp->if_data; + if (xfs_inode_has_sickness(sc->ip, XFS_SICK_INO_SYMLINK_ZAPPED) && + sc->ip->i_disk_size == 1 && old_target[0] == '?') + return 0; + + nr = min(XFS_SYMLINK_MAXLEN, xfs_inode_data_fork_size(ip)); + strncpy(target_buf, ifp->if_data, nr); + return nr; +} + +#define DUMMY_TARGET \ + "The target of this symbolic link could not be recovered at all and " \ + "has been replaced with this explanatory message. To avoid " \ + "accidentally pointing to an existing file path, this message is " \ + "longer than the maximum supported file name length. That is an " \ + "acceptable length for a symlink target on XFS but will produce " \ + "File Name Too Long errors if resolved." + +/* Salvage whatever we can of the target. */ +STATIC int +xrep_symlink_salvage( + struct xfs_scrub *sc) +{ + char *target_buf = sc->buf; + ssize_t buflen = 0; + + BUILD_BUG_ON(sizeof(DUMMY_TARGET) - 1 <= NAME_MAX); + + /* + * Salvage the target if there weren't any corruption problems observed + * while scanning it. + */ + if (!(sc->sm->sm_flags & XFS_SCRUB_OFLAG_CORRUPT)) { + if (sc->ip->i_df.if_format == XFS_DINODE_FMT_LOCAL) + buflen = xrep_symlink_salvage_inline(sc); + else + buflen = xrep_symlink_salvage_remote(sc); + if (buflen < 0) + return buflen; + + /* + * NULL-terminate the buffer because the ondisk target does not + * do that for us. If salvage didn't find the exact amount of + * data that we expected to find, don't salvage anything. + */ + target_buf[buflen] = 0; + if (strlen(target_buf) != sc->ip->i_disk_size) + buflen = 0; + } + + /* + * Change an empty target into a dummy target and clear the symlink + * target zapped flag. + */ + if (buflen == 0) { + sc->sick_mask |= XFS_SICK_INO_SYMLINK_ZAPPED; + sprintf(target_buf, DUMMY_TARGET); + } + + trace_xrep_symlink_salvage_target(sc->ip, target_buf, + strlen(target_buf)); + return 0; +} + +STATIC void +xrep_symlink_local_to_remote( + struct xfs_trans *tp, + struct xfs_buf *bp, + struct xfs_inode *ip, + struct xfs_ifork *ifp, + void *priv) +{ + struct xfs_scrub *sc = priv; + struct xfs_dsymlink_hdr *dsl = bp->b_addr; + + xfs_symlink_local_to_remote(tp, bp, ip, ifp, NULL); + + if (!xfs_has_crc(sc->mp)) + return; + + dsl->sl_owner = cpu_to_be64(sc->ip->i_ino); + xfs_trans_log_buf(tp, bp, 0, + sizeof(struct xfs_dsymlink_hdr) + ifp->if_bytes - 1); +} + +/* + * Prepare both links' data forks for an exchange. Promote the tempfile from + * local format to extents format, and if the file being repaired has a short + * format data fork, turn it into an empty extent list. + */ +STATIC int +xrep_symlink_swap_prep( + struct xfs_scrub *sc, + bool temp_local, + bool ip_local) +{ + int error; + + /* + * If the temp link is in shortform format, convert that to a remote + * target so that we can use the atomic mapping exchange. + */ + if (temp_local) { + int logflags = XFS_ILOG_CORE; + + error = xfs_bmap_local_to_extents(sc->tp, sc->tempip, 1, + &logflags, XFS_DATA_FORK, + xrep_symlink_local_to_remote, + sc); + if (error) + return error; + + xfs_trans_log_inode(sc->tp, sc->ip, 0); + + error = xfs_defer_finish(&sc->tp); + if (error) + return error; + } + + /* + * If the file being repaired had a shortform data fork, convert that + * to an empty extent list in preparation for the atomic mapping + * exchange. + */ + if (ip_local) { + struct xfs_ifork *ifp; + + ifp = xfs_ifork_ptr(sc->ip, XFS_DATA_FORK); + xfs_idestroy_fork(ifp); + ifp->if_format = XFS_DINODE_FMT_EXTENTS; + ifp->if_nextents = 0; + ifp->if_bytes = 0; + ifp->if_data = NULL; + ifp->if_height = 0; + + xfs_trans_log_inode(sc->tp, sc->ip, + XFS_ILOG_CORE | XFS_ILOG_DDATA); + } + + return 0; +} + +/* Exchange the temporary symlink's data fork with the one being repaired. */ +STATIC int +xrep_symlink_swap( + struct xfs_scrub *sc) +{ + struct xrep_tempexch *tx = sc->buf; + bool ip_local, temp_local; + int error; + + ip_local = sc->ip->i_df.if_format == XFS_DINODE_FMT_LOCAL; + temp_local = sc->tempip->i_df.if_format == XFS_DINODE_FMT_LOCAL; + + /* + * If the both links have a local format data fork and the rebuilt + * remote data would fit in the repaired file's data fork, copy the + * contents from the tempfile and declare ourselves done. + */ + if (ip_local && temp_local && + sc->tempip->i_disk_size <= xfs_inode_data_fork_size(sc->ip)) { + xrep_tempfile_copyout_local(sc, XFS_DATA_FORK); + return 0; + } + + /* Otherwise, make sure both data forks are in block-mapping mode. */ + error = xrep_symlink_swap_prep(sc, temp_local, ip_local); + if (error) + return error; + + return xrep_tempexch_contents(sc, tx); +} + +/* + * Free all the remote blocks and reset the data fork. The caller must join + * the inode to the transaction. This function returns with the inode joined + * to a clean scrub transaction. + */ +STATIC int +xrep_symlink_reset_fork( + struct xfs_scrub *sc) +{ + struct xfs_ifork *ifp = xfs_ifork_ptr(sc->tempip, XFS_DATA_FORK); + int error; + + /* Unmap all the remote target buffers. */ + if (xfs_ifork_has_extents(ifp)) { + error = xrep_reap_ifork(sc, sc->tempip, XFS_DATA_FORK); + if (error) + return error; + } + + trace_xrep_symlink_reset_fork(sc->tempip); + + /* Reset the temp symlink target to dummy content. */ + xfs_idestroy_fork(ifp); + return xfs_symlink_write_target(sc->tp, sc->tempip, sc->tempip->i_ino, + "?", 1, 0, 0); +} + +/* + * Reinitialize a link target. Caller must ensure the inode is joined to + * the transaction. + */ +STATIC int +xrep_symlink_rebuild( + struct xfs_scrub *sc) +{ + struct xrep_tempexch *tx; + char *target_buf = sc->buf; + xfs_fsblock_t fs_blocks; + unsigned int target_len; + unsigned int resblks; + int error; + + /* How many blocks do we need? */ + target_len = strlen(target_buf); + ASSERT(target_len != 0); + if (target_len == 0 || target_len > XFS_SYMLINK_MAXLEN) + return -EFSCORRUPTED; + + trace_xrep_symlink_rebuild(sc->ip); + + /* + * In preparation to write the new symlink target to the temporary + * file, drop the ILOCK of the file being repaired (it shouldn't be + * joined) and take the ILOCK of the temporary file. + * + * The VFS does not take the IOLOCK while reading a symlink (and new + * symlinks are hidden with INEW until they've been written) so it's + * possible that a readlink() could see the old corrupted contents + * while we're doing this. + */ + xchk_iunlock(sc, XFS_ILOCK_EXCL); + xrep_tempfile_ilock(sc); + xfs_trans_ijoin(sc->tp, sc->tempip, 0); + + /* + * Reserve resources to reinitialize the target. We're allowed to + * exceed file quota to repair inconsistent metadata, though this is + * unlikely. + */ + fs_blocks = xfs_symlink_blocks(sc->mp, target_len); + resblks = XFS_SYMLINK_SPACE_RES(sc->mp, target_len, fs_blocks); + error = xfs_trans_reserve_quota_nblks(sc->tp, sc->tempip, resblks, 0, + true); + if (error) + return error; + + /* Erase the dummy target set up by the tempfile initialization. */ + xfs_idestroy_fork(&sc->tempip->i_df); + sc->tempip->i_df.if_bytes = 0; + sc->tempip->i_df.if_format = XFS_DINODE_FMT_EXTENTS; + + /* Write the salvaged target to the temporary link. */ + error = xfs_symlink_write_target(sc->tp, sc->tempip, sc->ip->i_ino, + target_buf, target_len, fs_blocks, resblks); + if (error) + return error; + + /* + * Commit the repair transaction so that we can use the atomic mapping + * exchange functions to compute the correct block reservations and + * re-lock the inodes. + */ + target_buf = NULL; + error = xrep_trans_commit(sc); + if (error) + return error; + + /* Last chance to abort before we start committing fixes. */ + if (xchk_should_terminate(sc, &error)) + return error; + + xrep_tempfile_iunlock(sc); + + /* + * We're done with the temporary buffer, so we can reuse it for the + * tempfile contents exchange information. + */ + tx = sc->buf; + error = xrep_tempexch_trans_alloc(sc, XFS_DATA_FORK, tx); + if (error) + return error; + + /* + * Exchange the temp link's data fork with the file being repaired. + * This recreates the transaction and takes the ILOCKs of the file + * being repaired and the temporary file. + */ + error = xrep_symlink_swap(sc); + if (error) + return error; + + /* + * Release the old symlink blocks and reset the data fork of the temp + * link to an empty shortform link. This is the last repair action we + * perform on the symlink, so we don't need to clean the transaction. + */ + return xrep_symlink_reset_fork(sc); +} + +/* Repair a symbolic link. */ +int +xrep_symlink( + struct xfs_scrub *sc) +{ + int error; + + /* The rmapbt is required to reap the old data fork. */ + if (!xfs_has_rmapbt(sc->mp)) + return -EOPNOTSUPP; + + ASSERT(sc->ilock_flags & XFS_ILOCK_EXCL); + + error = xrep_symlink_salvage(sc); + if (error) + return error; + + /* Now reset the target. */ + error = xrep_symlink_rebuild(sc); + if (error) + return error; + + return xrep_trans_commit(sc); +} diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c index 4ca86a6a5be1..c72e447eb8ec 100644 --- a/fs/xfs/scrub/tempfile.c +++ b/fs/xfs/scrub/tempfile.c @@ -21,6 +21,7 @@ #include "xfs_exchrange.h" #include "xfs_exchmaps.h" #include "xfs_defer.h" +#include "xfs_symlink_remote.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/repair.h" @@ -109,6 +110,18 @@ xrep_tempfile_create( error = xfs_dir_init(tp, sc->tempip, dp); if (error) goto out_trans_cancel; + } else if (S_ISLNK(VFS_I(sc->tempip)->i_mode)) { + /* + * Initialize the temporary symlink with a meaningless target + * that won't trip the verifiers. Repair must rewrite the + * target with meaningful content before swapping with the file + * being repaired. A single-byte target will not write a + * remote target block, so the owner is irrelevant. + */ + error = xfs_symlink_write_target(tp, sc->tempip, + sc->tempip->i_ino, ".", 1, 0, 0); + if (error) + goto out_trans_cancel; } /* diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 668da6ff2ca2..03cb095fc1a1 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2711,6 +2711,52 @@ DEFINE_REPAIR_DENTRY_EVENT(xrep_adoption_check_alias); DEFINE_REPAIR_DENTRY_EVENT(xrep_adoption_check_dentry); DEFINE_REPAIR_DENTRY_EVENT(xrep_adoption_invalidate_child); +TRACE_EVENT(xrep_symlink_salvage_target, + TP_PROTO(struct xfs_inode *ip, char *target, unsigned int targetlen), + TP_ARGS(ip, target, targetlen), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + __field(unsigned int, targetlen) + __dynamic_array(char, target, targetlen + 1) + ), + TP_fast_assign( + __entry->dev = ip->i_mount->m_super->s_dev; + __entry->ino = ip->i_ino; + __entry->targetlen = targetlen; + memcpy(__get_str(target), target, targetlen); + __get_str(target)[targetlen] = 0; + ), + TP_printk("dev %d:%d ip 0x%llx target '%.*s'", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino, + __entry->targetlen, + __get_str(target)) +); + +DECLARE_EVENT_CLASS(xrep_symlink_class, + TP_PROTO(struct xfs_inode *ip), + TP_ARGS(ip), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_ino_t, ino) + ), + TP_fast_assign( + __entry->dev = ip->i_mount->m_super->s_dev; + __entry->ino = ip->i_ino; + ), + TP_printk("dev %d:%d ip 0x%llx", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->ino) +); + +#define DEFINE_XREP_SYMLINK_EVENT(name) \ +DEFINE_EVENT(xrep_symlink_class, name, \ + TP_PROTO(struct xfs_inode *ip), \ + TP_ARGS(ip)) +DEFINE_XREP_SYMLINK_EVENT(xrep_symlink_rebuild); +DEFINE_XREP_SYMLINK_EVENT(xrep_symlink_reset_fork); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 12/16] xfs: online fsck of iunlink buckets 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (10 preceding siblings ...) 2024-04-15 23:36 ` [PATCHSET v30.3 11/16] xfs: online repair of symbolic links Darrick J. Wong @ 2024-04-15 23:36 ` Darrick J. Wong 2024-04-15 23:54 ` [PATCH 1/3] xfs: check AGI unlinked inode buckets Darrick J. Wong ` (2 more replies) 2024-04-15 23:36 ` [PATCHSET v30.3 13/16] xfs: inode-related repair fixes Darrick J. Wong ` (3 subsequent siblings) 15 siblings, 3 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:36 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs Hi all, This series enhances the AGI scrub code to check the unlinked inode bucket lists for errors, and fixes them if necessary. Now that iunlink pointer updates are virtual log items, we can batch updates pretty efficiently in the logging code. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-iunlink-6.10 --- Commits in this patchset: * xfs: check AGI unlinked inode buckets * xfs: hoist AGI repair context to a heap object * xfs: repair AGI unlinked inode bucket lists --- fs/xfs/scrub/agheader.c | 40 ++ fs/xfs/scrub/agheader_repair.c | 879 ++++++++++++++++++++++++++++++++++++++-- fs/xfs/scrub/agino_bitmap.h | 49 ++ fs/xfs/scrub/trace.h | 255 ++++++++++++ fs/xfs/xfs_inode.c | 2 fs/xfs/xfs_inode.h | 1 6 files changed, 1179 insertions(+), 47 deletions(-) create mode 100644 fs/xfs/scrub/agino_bitmap.h ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/3] xfs: check AGI unlinked inode buckets 2024-04-15 23:36 ` [PATCHSET v30.3 12/16] xfs: online fsck of iunlink buckets Darrick J. Wong @ 2024-04-15 23:54 ` Darrick J. Wong 2024-04-15 23:54 ` [PATCH 2/3] xfs: hoist AGI repair context to a heap object Darrick J. Wong 2024-04-15 23:55 ` [PATCH 3/3] xfs: repair AGI unlinked inode bucket lists Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:54 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Look for corruptions in the AGI unlinked bucket chains. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/agheader.c | 40 ++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_inode.c | 2 +- fs/xfs/xfs_inode.h | 1 + 3 files changed, 42 insertions(+), 1 deletion(-) diff --git a/fs/xfs/scrub/agheader.c b/fs/xfs/scrub/agheader.c index e954f07679dd..1528f14bd925 100644 --- a/fs/xfs/scrub/agheader.c +++ b/fs/xfs/scrub/agheader.c @@ -15,6 +15,7 @@ #include "xfs_ialloc.h" #include "xfs_rmap.h" #include "xfs_ag.h" +#include "xfs_inode.h" #include "scrub/scrub.h" #include "scrub/common.h" @@ -865,6 +866,43 @@ xchk_agi_xref( /* scrub teardown will take care of sc->sa for us */ } +/* + * Check the unlinked buckets for links to bad inodes. We hold the AGI, so + * there cannot be any threads updating unlinked list pointers in this AG. + */ +STATIC void +xchk_iunlink( + struct xfs_scrub *sc, + struct xfs_agi *agi) +{ + unsigned int i; + struct xfs_inode *ip; + + for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) { + xfs_agino_t agino = be32_to_cpu(agi->agi_unlinked[i]); + + while (agino != NULLAGINO) { + if (agino % XFS_AGI_UNLINKED_BUCKETS != i) { + xchk_block_set_corrupt(sc, sc->sa.agi_bp); + return; + } + + ip = xfs_iunlink_lookup(sc->sa.pag, agino); + if (!ip) { + xchk_block_set_corrupt(sc, sc->sa.agi_bp); + return; + } + + if (!xfs_inode_on_unlinked_list(ip)) { + xchk_block_set_corrupt(sc, sc->sa.agi_bp); + return; + } + + agino = ip->i_next_unlinked; + } + } +} + /* Scrub the AGI. */ int xchk_agi( @@ -949,6 +987,8 @@ xchk_agi( if (pag->pagi_freecount != be32_to_cpu(agi->agi_freecount)) xchk_block_set_corrupt(sc, sc->sa.agi_bp); + xchk_iunlink(sc, agi); + xchk_agi_xref(sc); out: return error; diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 803a64687014..fed0cd6bffdf 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1985,7 +1985,7 @@ xfs_inactive( * only unlinked, referenced inodes can be on the unlinked inode list. If we * don't find the inode in cache, then let the caller handle the situation. */ -static struct xfs_inode * +struct xfs_inode * xfs_iunlink_lookup( struct xfs_perag *pag, xfs_agino_t agino) diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index 18bc3d7750a0..c74c48bc0945 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -619,6 +619,7 @@ bool xfs_inode_needs_inactive(struct xfs_inode *ip); int xfs_iunlink(struct xfs_trans *tp, struct xfs_inode *ip); int xfs_iunlink_remove(struct xfs_trans *tp, struct xfs_perag *pag, struct xfs_inode *ip); +struct xfs_inode *xfs_iunlink_lookup(struct xfs_perag *pag, xfs_agino_t agino); void xfs_end_io(struct work_struct *work); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/3] xfs: hoist AGI repair context to a heap object 2024-04-15 23:36 ` [PATCHSET v30.3 12/16] xfs: online fsck of iunlink buckets Darrick J. Wong 2024-04-15 23:54 ` [PATCH 1/3] xfs: check AGI unlinked inode buckets Darrick J. Wong @ 2024-04-15 23:54 ` Darrick J. Wong 2024-04-15 23:55 ` [PATCH 3/3] xfs: repair AGI unlinked inode bucket lists Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:54 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Save ~460 bytes of stack space by moving all the repair context to a heap object. We're going to add even more context data in the next patch, which is why we really need to do this now. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/agheader_repair.c | 105 ++++++++++++++++++++++++---------------- 1 file changed, 63 insertions(+), 42 deletions(-) diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c index 427054b65b23..d210bd7d5eb1 100644 --- a/fs/xfs/scrub/agheader_repair.c +++ b/fs/xfs/scrub/agheader_repair.c @@ -796,15 +796,29 @@ enum { XREP_AGI_MAX }; +struct xrep_agi { + struct xfs_scrub *sc; + + /* AGI buffer, tracked separately */ + struct xfs_buf *agi_bp; + + /* context for finding btree roots */ + struct xrep_find_ag_btree fab[XREP_AGI_MAX]; + + /* old AGI contents in case we have to revert */ + struct xfs_agi old_agi; +}; + /* * Given the inode btree roots described by *fab, find the roots, check them * for sanity, and pass the root data back out via *fab. */ STATIC int xrep_agi_find_btrees( - struct xfs_scrub *sc, - struct xrep_find_ag_btree *fab) + struct xrep_agi *ragi) { + struct xfs_scrub *sc = ragi->sc; + struct xrep_find_ag_btree *fab = ragi->fab; struct xfs_buf *agf_bp; struct xfs_mount *mp = sc->mp; int error; @@ -837,10 +851,11 @@ xrep_agi_find_btrees( */ STATIC void xrep_agi_init_header( - struct xfs_scrub *sc, - struct xfs_buf *agi_bp, - struct xfs_agi *old_agi) + struct xrep_agi *ragi) { + struct xfs_scrub *sc = ragi->sc; + struct xfs_buf *agi_bp = ragi->agi_bp; + struct xfs_agi *old_agi = &ragi->old_agi; struct xfs_agi *agi = agi_bp->b_addr; struct xfs_perag *pag = sc->sa.pag; struct xfs_mount *mp = sc->mp; @@ -868,10 +883,12 @@ xrep_agi_init_header( /* Set btree root information in an AGI. */ STATIC void xrep_agi_set_roots( - struct xfs_scrub *sc, - struct xfs_agi *agi, - struct xrep_find_ag_btree *fab) + struct xrep_agi *ragi) { + struct xfs_scrub *sc = ragi->sc; + struct xfs_agi *agi = ragi->agi_bp->b_addr; + struct xrep_find_ag_btree *fab = ragi->fab; + agi->agi_root = cpu_to_be32(fab[XREP_AGI_INOBT].root); agi->agi_level = cpu_to_be32(fab[XREP_AGI_INOBT].height); @@ -884,9 +901,10 @@ xrep_agi_set_roots( /* Update the AGI counters. */ STATIC int xrep_agi_calc_from_btrees( - struct xfs_scrub *sc, - struct xfs_buf *agi_bp) + struct xrep_agi *ragi) { + struct xfs_scrub *sc = ragi->sc; + struct xfs_buf *agi_bp = ragi->agi_bp; struct xfs_btree_cur *cur; struct xfs_agi *agi = agi_bp->b_addr; struct xfs_mount *mp = sc->mp; @@ -931,9 +949,10 @@ xrep_agi_calc_from_btrees( /* Trigger reinitialization of the in-core data. */ STATIC int xrep_agi_commit_new( - struct xfs_scrub *sc, - struct xfs_buf *agi_bp) + struct xrep_agi *ragi) { + struct xfs_scrub *sc = ragi->sc; + struct xfs_buf *agi_bp = ragi->agi_bp; struct xfs_perag *pag; struct xfs_agi *agi = agi_bp->b_addr; @@ -956,33 +975,36 @@ xrep_agi_commit_new( /* Repair the AGI. */ int xrep_agi( - struct xfs_scrub *sc) + struct xfs_scrub *sc) { - struct xrep_find_ag_btree fab[XREP_AGI_MAX] = { - [XREP_AGI_INOBT] = { - .rmap_owner = XFS_RMAP_OWN_INOBT, - .buf_ops = &xfs_inobt_buf_ops, - .maxlevels = M_IGEO(sc->mp)->inobt_maxlevels, - }, - [XREP_AGI_FINOBT] = { - .rmap_owner = XFS_RMAP_OWN_INOBT, - .buf_ops = &xfs_finobt_buf_ops, - .maxlevels = M_IGEO(sc->mp)->inobt_maxlevels, - }, - [XREP_AGI_END] = { - .buf_ops = NULL - }, - }; - struct xfs_agi old_agi; - struct xfs_mount *mp = sc->mp; - struct xfs_buf *agi_bp; - struct xfs_agi *agi; - int error; + struct xrep_agi *ragi; + struct xfs_mount *mp = sc->mp; + int error; /* We require the rmapbt to rebuild anything. */ if (!xfs_has_rmapbt(mp)) return -EOPNOTSUPP; + sc->buf = kzalloc(sizeof(struct xrep_agi), XCHK_GFP_FLAGS); + if (!sc->buf) + return -ENOMEM; + ragi = sc->buf; + ragi->sc = sc; + + ragi->fab[XREP_AGI_INOBT] = (struct xrep_find_ag_btree){ + .rmap_owner = XFS_RMAP_OWN_INOBT, + .buf_ops = &xfs_inobt_buf_ops, + .maxlevels = M_IGEO(sc->mp)->inobt_maxlevels, + }; + ragi->fab[XREP_AGI_FINOBT] = (struct xrep_find_ag_btree){ + .rmap_owner = XFS_RMAP_OWN_INOBT, + .buf_ops = &xfs_finobt_buf_ops, + .maxlevels = M_IGEO(sc->mp)->inobt_maxlevels, + }; + ragi->fab[XREP_AGI_END] = (struct xrep_find_ag_btree){ + .buf_ops = NULL, + }; + /* * Make sure we have the AGI buffer, as scrub might have decided it * was corrupt after xfs_ialloc_read_agi failed with -EFSCORRUPTED. @@ -990,14 +1012,13 @@ xrep_agi( error = xfs_trans_read_buf(mp, sc->tp, mp->m_ddev_targp, XFS_AG_DADDR(mp, sc->sa.pag->pag_agno, XFS_AGI_DADDR(mp)), - XFS_FSS_TO_BB(mp, 1), 0, &agi_bp, NULL); + XFS_FSS_TO_BB(mp, 1), 0, &ragi->agi_bp, NULL); if (error) return error; - agi_bp->b_ops = &xfs_agi_buf_ops; - agi = agi_bp->b_addr; + ragi->agi_bp->b_ops = &xfs_agi_buf_ops; /* Find the AGI btree roots. */ - error = xrep_agi_find_btrees(sc, fab); + error = xrep_agi_find_btrees(ragi); if (error) return error; @@ -1006,18 +1027,18 @@ xrep_agi( return error; /* Start rewriting the header and implant the btrees we found. */ - xrep_agi_init_header(sc, agi_bp, &old_agi); - xrep_agi_set_roots(sc, agi, fab); - error = xrep_agi_calc_from_btrees(sc, agi_bp); + xrep_agi_init_header(ragi); + xrep_agi_set_roots(ragi); + error = xrep_agi_calc_from_btrees(ragi); if (error) goto out_revert; /* Reinitialize in-core state. */ - return xrep_agi_commit_new(sc, agi_bp); + return xrep_agi_commit_new(ragi); out_revert: /* Mark the incore AGI state stale and revert the AGI. */ clear_bit(XFS_AGSTATE_AGI_INIT, &sc->sa.pag->pag_opstate); - memcpy(agi, &old_agi, sizeof(old_agi)); + memcpy(ragi->agi_bp->b_addr, &ragi->old_agi, sizeof(struct xfs_agi)); return error; } ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/3] xfs: repair AGI unlinked inode bucket lists 2024-04-15 23:36 ` [PATCHSET v30.3 12/16] xfs: online fsck of iunlink buckets Darrick J. Wong 2024-04-15 23:54 ` [PATCH 1/3] xfs: check AGI unlinked inode buckets Darrick J. Wong 2024-04-15 23:54 ` [PATCH 2/3] xfs: hoist AGI repair context to a heap object Darrick J. Wong @ 2024-04-15 23:55 ` Darrick J. Wong 2 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:55 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Teach the AGI repair code to rebuild the unlinked buckets and lists. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/agheader_repair.c | 774 ++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/agino_bitmap.h | 49 +++ fs/xfs/scrub/trace.h | 255 +++++++++++++ 3 files changed, 1074 insertions(+), 4 deletions(-) create mode 100644 fs/xfs/scrub/agino_bitmap.h diff --git a/fs/xfs/scrub/agheader_repair.c b/fs/xfs/scrub/agheader_repair.c index d210bd7d5eb1..0dbc484b182f 100644 --- a/fs/xfs/scrub/agheader_repair.c +++ b/fs/xfs/scrub/agheader_repair.c @@ -21,13 +21,18 @@ #include "xfs_rmap_btree.h" #include "xfs_refcount_btree.h" #include "xfs_ag.h" +#include "xfs_inode.h" +#include "xfs_iunlink_item.h" #include "scrub/scrub.h" #include "scrub/common.h" #include "scrub/trace.h" #include "scrub/repair.h" #include "scrub/bitmap.h" #include "scrub/agb_bitmap.h" +#include "scrub/agino_bitmap.h" #include "scrub/reap.h" +#include "scrub/xfile.h" +#include "scrub/xfarray.h" /* Superblock */ @@ -796,6 +801,8 @@ enum { XREP_AGI_MAX }; +#define XREP_AGI_LOOKUP_BATCH 32 + struct xrep_agi { struct xfs_scrub *sc; @@ -807,8 +814,34 @@ struct xrep_agi { /* old AGI contents in case we have to revert */ struct xfs_agi old_agi; + + /* bitmap of which inodes are unlinked */ + struct xagino_bitmap iunlink_bmp; + + /* heads of the unlinked inode bucket lists */ + xfs_agino_t iunlink_heads[XFS_AGI_UNLINKED_BUCKETS]; + + /* scratchpad for batched lookups of the radix tree */ + struct xfs_inode *lookup_batch[XREP_AGI_LOOKUP_BATCH]; + + /* Map of ino -> next_ino for unlinked inode processing. */ + struct xfarray *iunlink_next; + + /* Map of ino -> prev_ino for unlinked inode processing. */ + struct xfarray *iunlink_prev; }; +static void +xrep_agi_buf_cleanup( + void *buf) +{ + struct xrep_agi *ragi = buf; + + xfarray_destroy(ragi->iunlink_prev); + xfarray_destroy(ragi->iunlink_next); + xagino_bitmap_destroy(&ragi->iunlink_bmp); +} + /* * Given the inode btree roots described by *fab, find the roots, check them * for sanity, and pass the root data back out via *fab. @@ -871,10 +904,6 @@ xrep_agi_init_header( if (xfs_has_crc(mp)) uuid_copy(&agi->agi_uuid, &mp->m_sb.sb_meta_uuid); - /* We don't know how to fix the unlinked list yet. */ - memcpy(&agi->agi_unlinked, &old_agi->agi_unlinked, - sizeof(agi->agi_unlinked)); - /* Mark the incore AGF data stale until we're done fixing things. */ ASSERT(xfs_perag_initialised_agi(pag)); clear_bit(XFS_AGSTATE_AGI_INIT, &pag->pag_opstate); @@ -946,6 +975,714 @@ xrep_agi_calc_from_btrees( return error; } +/* + * Record a forwards unlinked chain pointer from agino -> next_agino in our + * staging information. + */ +static inline int +xrep_iunlink_store_next( + struct xrep_agi *ragi, + xfs_agino_t agino, + xfs_agino_t next_agino) +{ + ASSERT(next_agino != 0); + + return xfarray_store(ragi->iunlink_next, agino, &next_agino); +} + +/* + * Record a backwards unlinked chain pointer from prev_ino <- agino in our + * staging information. + */ +static inline int +xrep_iunlink_store_prev( + struct xrep_agi *ragi, + xfs_agino_t agino, + xfs_agino_t prev_agino) +{ + ASSERT(prev_agino != 0); + + return xfarray_store(ragi->iunlink_prev, agino, &prev_agino); +} + +/* + * Given an @agino, look up the next inode in the iunlink bucket. Returns + * NULLAGINO if we're at the end of the chain, 0 if @agino is not in memory + * like it should be, or a per-AG inode number. + */ +static inline xfs_agino_t +xrep_iunlink_next( + struct xfs_scrub *sc, + xfs_agino_t agino) +{ + struct xfs_inode *ip; + + ip = xfs_iunlink_lookup(sc->sa.pag, agino); + if (!ip) + return 0; + + return ip->i_next_unlinked; +} + +/* + * Load the inode @agino into memory, set its i_prev_unlinked, and drop the + * inode so it can be inactivated. Returns NULLAGINO if we're at the end of + * the chain or if we should stop walking the chain due to corruption; or a + * per-AG inode number. + */ +STATIC xfs_agino_t +xrep_iunlink_reload_next( + struct xrep_agi *ragi, + xfs_agino_t prev_agino, + xfs_agino_t agino) +{ + struct xfs_scrub *sc = ragi->sc; + struct xfs_inode *ip; + xfs_ino_t ino; + xfs_agino_t ret = NULLAGINO; + int error; + + ino = XFS_AGINO_TO_INO(sc->mp, sc->sa.pag->pag_agno, agino); + error = xchk_iget(ragi->sc, ino, &ip); + if (error) + return ret; + + trace_xrep_iunlink_reload_next(ip, prev_agino); + + /* If this is a linked inode, stop processing the chain. */ + if (VFS_I(ip)->i_nlink != 0) { + xrep_iunlink_store_next(ragi, agino, NULLAGINO); + goto rele; + } + + ip->i_prev_unlinked = prev_agino; + ret = ip->i_next_unlinked; + + /* + * Drop the inode reference that we just took. We hold the AGI, so + * this inode cannot move off the unlinked list and hence cannot be + * reclaimed. + */ +rele: + xchk_irele(sc, ip); + return ret; +} + +/* + * Walk an AGI unlinked bucket's list to load incore any unlinked inodes that + * still existed at mount time. This can happen if iunlink processing fails + * during log recovery. + */ +STATIC int +xrep_iunlink_walk_ondisk_bucket( + struct xrep_agi *ragi, + unsigned int bucket) +{ + struct xfs_scrub *sc = ragi->sc; + struct xfs_agi *agi = sc->sa.agi_bp->b_addr; + xfs_agino_t prev_agino = NULLAGINO; + xfs_agino_t next_agino; + int error = 0; + + next_agino = be32_to_cpu(agi->agi_unlinked[bucket]); + while (next_agino != NULLAGINO) { + xfs_agino_t agino = next_agino; + + if (xchk_should_terminate(ragi->sc, &error)) + return error; + + trace_xrep_iunlink_walk_ondisk_bucket(sc->sa.pag, bucket, + prev_agino, agino); + + if (bucket != agino % XFS_AGI_UNLINKED_BUCKETS) + break; + + next_agino = xrep_iunlink_next(sc, agino); + if (!next_agino) + next_agino = xrep_iunlink_reload_next(ragi, prev_agino, + agino); + + prev_agino = agino; + } + + return 0; +} + +/* Decide if this is an unlinked inode in this AG. */ +STATIC bool +xrep_iunlink_igrab( + struct xfs_perag *pag, + struct xfs_inode *ip) +{ + struct xfs_mount *mp = pag->pag_mount; + + if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno) + return false; + + if (!xfs_inode_on_unlinked_list(ip)) + return false; + + return true; +} + +/* + * Mark the given inode in the lookup batch in our unlinked inode bitmap, and + * remember if this inode is the start of the unlinked chain. + */ +STATIC int +xrep_iunlink_visit( + struct xrep_agi *ragi, + unsigned int batch_idx) +{ + struct xfs_mount *mp = ragi->sc->mp; + struct xfs_inode *ip = ragi->lookup_batch[batch_idx]; + xfs_agino_t agino; + unsigned int bucket; + int error; + + ASSERT(XFS_INO_TO_AGNO(mp, ip->i_ino) == ragi->sc->sa.pag->pag_agno); + ASSERT(xfs_inode_on_unlinked_list(ip)); + + agino = XFS_INO_TO_AGINO(mp, ip->i_ino); + bucket = agino % XFS_AGI_UNLINKED_BUCKETS; + + trace_xrep_iunlink_visit(ragi->sc->sa.pag, bucket, + ragi->iunlink_heads[bucket], ip); + + error = xagino_bitmap_set(&ragi->iunlink_bmp, agino, 1); + if (error) + return error; + + if (ip->i_prev_unlinked == NULLAGINO) { + if (ragi->iunlink_heads[bucket] == NULLAGINO) + ragi->iunlink_heads[bucket] = agino; + } + + return 0; +} + +/* + * Find all incore unlinked inodes so that we can rebuild the unlinked buckets. + * We hold the AGI so there should not be any modifications to the unlinked + * list. + */ +STATIC int +xrep_iunlink_mark_incore( + struct xrep_agi *ragi) +{ + struct xfs_perag *pag = ragi->sc->sa.pag; + struct xfs_mount *mp = pag->pag_mount; + uint32_t first_index = 0; + bool done = false; + unsigned int nr_found = 0; + + do { + unsigned int i; + int error = 0; + + if (xchk_should_terminate(ragi->sc, &error)) + return error; + + rcu_read_lock(); + + nr_found = radix_tree_gang_lookup(&pag->pag_ici_root, + (void **)&ragi->lookup_batch, first_index, + XREP_AGI_LOOKUP_BATCH); + if (!nr_found) { + rcu_read_unlock(); + return 0; + } + + for (i = 0; i < nr_found; i++) { + struct xfs_inode *ip = ragi->lookup_batch[i]; + + if (done || !xrep_iunlink_igrab(pag, ip)) + ragi->lookup_batch[i] = NULL; + + /* + * Update the index for the next lookup. Catch + * overflows into the next AG range which can occur if + * we have inodes in the last block of the AG and we + * are currently pointing to the last inode. + * + * Because we may see inodes that are from the wrong AG + * due to RCU freeing and reallocation, only update the + * index if it lies in this AG. It was a race that lead + * us to see this inode, so another lookup from the + * same index will not find it again. + */ + if (XFS_INO_TO_AGNO(mp, ip->i_ino) != pag->pag_agno) + continue; + first_index = XFS_INO_TO_AGINO(mp, ip->i_ino + 1); + if (first_index < XFS_INO_TO_AGINO(mp, ip->i_ino)) + done = true; + } + + /* unlock now we've grabbed the inodes. */ + rcu_read_unlock(); + + for (i = 0; i < nr_found; i++) { + if (!ragi->lookup_batch[i]) + continue; + error = xrep_iunlink_visit(ragi, i); + if (error) + return error; + } + } while (!done); + + return 0; +} + +/* Mark all the unlinked ondisk inodes in this inobt record in iunlink_bmp. */ +STATIC int +xrep_iunlink_mark_ondisk_rec( + struct xfs_btree_cur *cur, + const union xfs_btree_rec *rec, + void *priv) +{ + struct xfs_inobt_rec_incore irec; + struct xrep_agi *ragi = priv; + struct xfs_scrub *sc = ragi->sc; + struct xfs_mount *mp = cur->bc_mp; + xfs_agino_t agino; + unsigned int i; + int error = 0; + + xfs_inobt_btrec_to_irec(mp, rec, &irec); + + for (i = 0, agino = irec.ir_startino; + i < XFS_INODES_PER_CHUNK; + i++, agino++) { + struct xfs_inode *ip; + unsigned int len = 1; + + /* Skip free inodes */ + if (XFS_INOBT_MASK(i) & irec.ir_free) + continue; + /* Skip inodes we've seen before */ + if (xagino_bitmap_test(&ragi->iunlink_bmp, agino, &len)) + continue; + + /* + * Skip incore inodes; these were already picked up by + * the _mark_incore step. + */ + rcu_read_lock(); + ip = radix_tree_lookup(&sc->sa.pag->pag_ici_root, agino); + rcu_read_unlock(); + if (ip) + continue; + + /* + * Try to look up this inode. If we can't get it, just move + * on because we haven't actually scrubbed the inobt or the + * inodes yet. + */ + error = xchk_iget(ragi->sc, + XFS_AGINO_TO_INO(mp, sc->sa.pag->pag_agno, + agino), + &ip); + if (error) + continue; + + trace_xrep_iunlink_reload_ondisk(ip); + + if (VFS_I(ip)->i_nlink == 0) + error = xagino_bitmap_set(&ragi->iunlink_bmp, agino, 1); + xchk_irele(sc, ip); + if (error) + break; + } + + return error; +} + +/* + * Find ondisk inodes that are unlinked and not in cache, and mark them in + * iunlink_bmp. We haven't checked the inobt yet, so we don't error out if + * the btree is corrupt. + */ +STATIC void +xrep_iunlink_mark_ondisk( + struct xrep_agi *ragi) +{ + struct xfs_scrub *sc = ragi->sc; + struct xfs_buf *agi_bp = ragi->agi_bp; + struct xfs_btree_cur *cur; + int error; + + cur = xfs_inobt_init_cursor(sc->sa.pag, sc->tp, agi_bp); + error = xfs_btree_query_all(cur, xrep_iunlink_mark_ondisk_rec, ragi); + xfs_btree_del_cursor(cur, error); +} + +/* + * Walk an iunlink bucket's inode list. For each inode that should be on this + * chain, clear its entry in in iunlink_bmp because it's ok and we don't need + * to touch it further. + */ +STATIC int +xrep_iunlink_resolve_bucket( + struct xrep_agi *ragi, + unsigned int bucket) +{ + struct xfs_scrub *sc = ragi->sc; + struct xfs_inode *ip; + xfs_agino_t prev_agino = NULLAGINO; + xfs_agino_t next_agino = ragi->iunlink_heads[bucket]; + int error = 0; + + while (next_agino != NULLAGINO) { + if (xchk_should_terminate(ragi->sc, &error)) + return error; + + /* Find the next inode in the chain. */ + ip = xfs_iunlink_lookup(sc->sa.pag, next_agino); + if (!ip) { + /* Inode not incore? Terminate the chain. */ + trace_xrep_iunlink_resolve_uncached(sc->sa.pag, + bucket, prev_agino, next_agino); + + next_agino = NULLAGINO; + break; + } + + if (next_agino % XFS_AGI_UNLINKED_BUCKETS != bucket) { + /* + * Inode is in the wrong bucket. Advance the list, + * but pretend we didn't see this inode. + */ + trace_xrep_iunlink_resolve_wronglist(sc->sa.pag, + bucket, prev_agino, next_agino); + + next_agino = ip->i_next_unlinked; + continue; + } + + if (!xfs_inode_on_unlinked_list(ip)) { + /* + * Incore inode doesn't think this inode is on an + * unlinked list. This is probably because we reloaded + * it from disk. Advance the list, but pretend we + * didn't see this inode; we'll fix that later. + */ + trace_xrep_iunlink_resolve_nolist(sc->sa.pag, + bucket, prev_agino, next_agino); + next_agino = ip->i_next_unlinked; + continue; + } + + trace_xrep_iunlink_resolve_ok(sc->sa.pag, bucket, prev_agino, + next_agino); + + /* + * Otherwise, this inode's unlinked pointers are ok. Clear it + * from the unlinked bitmap since we're done with it, and make + * sure the chain is still correct. + */ + error = xagino_bitmap_clear(&ragi->iunlink_bmp, next_agino, 1); + if (error) + return error; + + /* Remember the previous inode's next pointer. */ + if (prev_agino != NULLAGINO) { + error = xrep_iunlink_store_next(ragi, prev_agino, + next_agino); + if (error) + return error; + } + + /* Remember this inode's previous pointer. */ + error = xrep_iunlink_store_prev(ragi, next_agino, prev_agino); + if (error) + return error; + + /* Advance the list and remember this inode. */ + prev_agino = next_agino; + next_agino = ip->i_next_unlinked; + } + + /* Update the previous inode's next pointer. */ + if (prev_agino != NULLAGINO) { + error = xrep_iunlink_store_next(ragi, prev_agino, next_agino); + if (error) + return error; + } + + return 0; +} + +/* Reinsert this unlinked inode into the head of the staged bucket list. */ +STATIC int +xrep_iunlink_add_to_bucket( + struct xrep_agi *ragi, + xfs_agino_t agino) +{ + xfs_agino_t current_head; + unsigned int bucket; + int error; + + bucket = agino % XFS_AGI_UNLINKED_BUCKETS; + + /* Point this inode at the current head of the bucket list. */ + current_head = ragi->iunlink_heads[bucket]; + + trace_xrep_iunlink_add_to_bucket(ragi->sc->sa.pag, bucket, agino, + current_head); + + error = xrep_iunlink_store_next(ragi, agino, current_head); + if (error) + return error; + + /* Remember the head inode's previous pointer. */ + if (current_head != NULLAGINO) { + error = xrep_iunlink_store_prev(ragi, current_head, agino); + if (error) + return error; + } + + ragi->iunlink_heads[bucket] = agino; + return 0; +} + +/* Reinsert unlinked inodes into the staged iunlink buckets. */ +STATIC int +xrep_iunlink_add_lost_inodes( + uint32_t start, + uint32_t len, + void *priv) +{ + struct xrep_agi *ragi = priv; + int error; + + for (; len > 0; start++, len--) { + error = xrep_iunlink_add_to_bucket(ragi, start); + if (error) + return error; + } + + return 0; +} + +/* + * Figure out the iunlink bucket values and find inodes that need to be + * reinserted into the list. + */ +STATIC int +xrep_iunlink_rebuild_buckets( + struct xrep_agi *ragi) +{ + unsigned int i; + int error; + + /* + * Walk the ondisk AGI unlinked list to find inodes that are on the + * list but aren't in memory. This can happen if a past log recovery + * tried to clear the iunlinked list but failed. Our scan rebuilds the + * unlinked list using incore inodes, so we must load and link them + * properly. + */ + for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) { + error = xrep_iunlink_walk_ondisk_bucket(ragi, i); + if (error) + return error; + } + + /* + * Record all the incore unlinked inodes in iunlink_bmp that we didn't + * find by walking the ondisk iunlink buckets. This shouldn't happen, + * but we can't risk forgetting an inode somewhere. + */ + error = xrep_iunlink_mark_incore(ragi); + if (error) + return error; + + /* + * If there are ondisk inodes that are unlinked and are not been loaded + * into cache, record them in iunlink_bmp. + */ + xrep_iunlink_mark_ondisk(ragi); + + /* + * Walk each iunlink bucket to (re)construct as much of the incore list + * as would be correct. For each inode that survives this step, mark + * it clear in iunlink_bmp; we're done with those inodes. + */ + for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) { + error = xrep_iunlink_resolve_bucket(ragi, i); + if (error) + return error; + } + + /* + * Any unlinked inodes that we didn't find through the bucket list + * walk (or was ignored by the walk) must be inserted into the bucket + * list. Stage this in memory for now. + */ + return xagino_bitmap_walk(&ragi->iunlink_bmp, + xrep_iunlink_add_lost_inodes, ragi); +} + +/* Update i_next_iunlinked for the inode @agino. */ +STATIC int +xrep_iunlink_relink_next( + struct xrep_agi *ragi, + xfarray_idx_t idx, + xfs_agino_t next_agino) +{ + struct xfs_scrub *sc = ragi->sc; + struct xfs_perag *pag = sc->sa.pag; + struct xfs_inode *ip; + xfarray_idx_t agino = idx - 1; + bool want_rele = false; + int error = 0; + + ip = xfs_iunlink_lookup(pag, agino); + if (!ip) { + xfs_ino_t ino; + xfs_agino_t prev_agino; + + /* + * No inode exists in cache. Load it off the disk so that we + * can reinsert it into the incore unlinked list. + */ + ino = XFS_AGINO_TO_INO(sc->mp, pag->pag_agno, agino); + error = xchk_iget(sc, ino, &ip); + if (error) + return -EFSCORRUPTED; + + want_rele = true; + + /* Set the backward pointer since this just came off disk. */ + error = xfarray_load(ragi->iunlink_prev, agino, &prev_agino); + if (error) + goto out_rele; + + trace_xrep_iunlink_relink_prev(ip, prev_agino); + ip->i_prev_unlinked = prev_agino; + } + + /* Update the forward pointer. */ + if (ip->i_next_unlinked != next_agino) { + error = xfs_iunlink_log_inode(sc->tp, ip, pag, next_agino); + if (error) + goto out_rele; + + trace_xrep_iunlink_relink_next(ip, next_agino); + ip->i_next_unlinked = next_agino; + } + +out_rele: + /* + * The iunlink lookup doesn't igrab because we hold the AGI buffer lock + * and the inode cannot be reclaimed. However, if we used iget to load + * a missing inode, we must irele it here. + */ + if (want_rele) + xchk_irele(sc, ip); + return error; +} + +/* Update i_prev_iunlinked for the inode @agino. */ +STATIC int +xrep_iunlink_relink_prev( + struct xrep_agi *ragi, + xfarray_idx_t idx, + xfs_agino_t prev_agino) +{ + struct xfs_scrub *sc = ragi->sc; + struct xfs_perag *pag = sc->sa.pag; + struct xfs_inode *ip; + xfarray_idx_t agino = idx - 1; + bool want_rele = false; + int error = 0; + + ASSERT(prev_agino != 0); + + ip = xfs_iunlink_lookup(pag, agino); + if (!ip) { + xfs_ino_t ino; + xfs_agino_t next_agino; + + /* + * No inode exists in cache. Load it off the disk so that we + * can reinsert it into the incore unlinked list. + */ + ino = XFS_AGINO_TO_INO(sc->mp, pag->pag_agno, agino); + error = xchk_iget(sc, ino, &ip); + if (error) + return -EFSCORRUPTED; + + want_rele = true; + + /* Set the forward pointer since this just came off disk. */ + error = xfarray_load(ragi->iunlink_prev, agino, &next_agino); + if (error) + goto out_rele; + + error = xfs_iunlink_log_inode(sc->tp, ip, pag, next_agino); + if (error) + goto out_rele; + + trace_xrep_iunlink_relink_next(ip, next_agino); + ip->i_next_unlinked = next_agino; + } + + /* Update the backward pointer. */ + if (ip->i_prev_unlinked != prev_agino) { + trace_xrep_iunlink_relink_prev(ip, prev_agino); + ip->i_prev_unlinked = prev_agino; + } + +out_rele: + /* + * The iunlink lookup doesn't igrab because we hold the AGI buffer lock + * and the inode cannot be reclaimed. However, if we used iget to load + * a missing inode, we must irele it here. + */ + if (want_rele) + xchk_irele(sc, ip); + return error; +} + +/* Log all the iunlink updates we need to finish regenerating the AGI. */ +STATIC int +xrep_iunlink_commit( + struct xrep_agi *ragi) +{ + struct xfs_agi *agi = ragi->agi_bp->b_addr; + xfarray_idx_t idx = XFARRAY_CURSOR_INIT; + xfs_agino_t agino; + unsigned int i; + int error; + + /* Fix all the forward links */ + while ((error = xfarray_iter(ragi->iunlink_next, &idx, &agino)) == 1) { + error = xrep_iunlink_relink_next(ragi, idx, agino); + if (error) + return error; + } + + /* Fix all the back links */ + idx = XFARRAY_CURSOR_INIT; + while ((error = xfarray_iter(ragi->iunlink_prev, &idx, &agino)) == 1) { + error = xrep_iunlink_relink_prev(ragi, idx, agino); + if (error) + return error; + } + + /* Copy the staged iunlink buckets to the new AGI. */ + for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) { + trace_xrep_iunlink_commit_bucket(ragi->sc->sa.pag, i, + be32_to_cpu(ragi->old_agi.agi_unlinked[i]), + ragi->iunlink_heads[i]); + + agi->agi_unlinked[i] = cpu_to_be32(ragi->iunlink_heads[i]); + } + + return 0; +} + /* Trigger reinitialization of the in-core data. */ STATIC int xrep_agi_commit_new( @@ -979,6 +1716,8 @@ xrep_agi( { struct xrep_agi *ragi; struct xfs_mount *mp = sc->mp; + char *descr; + unsigned int i; int error; /* We require the rmapbt to rebuild anything. */ @@ -1005,6 +1744,26 @@ xrep_agi( .buf_ops = NULL, }; + for (i = 0; i < XFS_AGI_UNLINKED_BUCKETS; i++) + ragi->iunlink_heads[i] = NULLAGINO; + + xagino_bitmap_init(&ragi->iunlink_bmp); + sc->buf_cleanup = xrep_agi_buf_cleanup; + + descr = xchk_xfile_ag_descr(sc, "iunlinked next pointers"); + error = xfarray_create(descr, 0, sizeof(xfs_agino_t), + &ragi->iunlink_next); + kfree(descr); + if (error) + return error; + + descr = xchk_xfile_ag_descr(sc, "iunlinked prev pointers"); + error = xfarray_create(descr, 0, sizeof(xfs_agino_t), + &ragi->iunlink_prev); + kfree(descr); + if (error) + return error; + /* * Make sure we have the AGI buffer, as scrub might have decided it * was corrupt after xfs_ialloc_read_agi failed with -EFSCORRUPTED. @@ -1022,6 +1781,10 @@ xrep_agi( if (error) return error; + error = xrep_iunlink_rebuild_buckets(ragi); + if (error) + return error; + /* Last chance to abort before we start committing fixes. */ if (xchk_should_terminate(sc, &error)) return error; @@ -1030,6 +1793,9 @@ xrep_agi( xrep_agi_init_header(ragi); xrep_agi_set_roots(ragi); error = xrep_agi_calc_from_btrees(ragi); + if (error) + goto out_revert; + error = xrep_iunlink_commit(ragi); if (error) goto out_revert; diff --git a/fs/xfs/scrub/agino_bitmap.h b/fs/xfs/scrub/agino_bitmap.h new file mode 100644 index 000000000000..56d7db5f1699 --- /dev/null +++ b/fs/xfs/scrub/agino_bitmap.h @@ -0,0 +1,49 @@ +// SPDX-License-Identifier: GPL-2.0-or-later +/* + * Copyright (c) 2018-2024 Oracle. All Rights Reserved. + * Author: Darrick J. Wong <djwong@kernel.org> + */ +#ifndef __XFS_SCRUB_AGINO_BITMAP_H__ +#define __XFS_SCRUB_AGINO_BITMAP_H__ + +/* Bitmaps, but for type-checked for xfs_agino_t */ + +struct xagino_bitmap { + struct xbitmap32 aginobitmap; +}; + +static inline void xagino_bitmap_init(struct xagino_bitmap *bitmap) +{ + xbitmap32_init(&bitmap->aginobitmap); +} + +static inline void xagino_bitmap_destroy(struct xagino_bitmap *bitmap) +{ + xbitmap32_destroy(&bitmap->aginobitmap); +} + +static inline int xagino_bitmap_clear(struct xagino_bitmap *bitmap, + xfs_agino_t agino, unsigned int len) +{ + return xbitmap32_clear(&bitmap->aginobitmap, agino, len); +} + +static inline int xagino_bitmap_set(struct xagino_bitmap *bitmap, + xfs_agino_t agino, unsigned int len) +{ + return xbitmap32_set(&bitmap->aginobitmap, agino, len); +} + +static inline bool xagino_bitmap_test(struct xagino_bitmap *bitmap, + xfs_agino_t agino, unsigned int *len) +{ + return xbitmap32_test(&bitmap->aginobitmap, agino, len); +} + +static inline int xagino_bitmap_walk(struct xagino_bitmap *bitmap, + xbitmap32_walk_fn fn, void *priv) +{ + return xbitmap32_walk(&bitmap->aginobitmap, fn, priv); +} + +#endif /* __XFS_SCRUB_AGINO_BITMAP_H__ */ diff --git a/fs/xfs/scrub/trace.h b/fs/xfs/scrub/trace.h index 03cb095fc1a1..814db1d1747a 100644 --- a/fs/xfs/scrub/trace.h +++ b/fs/xfs/scrub/trace.h @@ -2757,6 +2757,261 @@ DEFINE_EVENT(xrep_symlink_class, name, \ DEFINE_XREP_SYMLINK_EVENT(xrep_symlink_rebuild); DEFINE_XREP_SYMLINK_EVENT(xrep_symlink_reset_fork); +TRACE_EVENT(xrep_iunlink_visit, + TP_PROTO(struct xfs_perag *pag, unsigned int bucket, + xfs_agino_t bucket_agino, struct xfs_inode *ip), + TP_ARGS(pag, bucket, bucket_agino, ip), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(xfs_agino_t, agino) + __field(unsigned int, bucket) + __field(xfs_agino_t, bucket_agino) + __field(xfs_agino_t, prev_agino) + __field(xfs_agino_t, next_agino) + ), + TP_fast_assign( + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->agino = XFS_INO_TO_AGINO(pag->pag_mount, ip->i_ino); + __entry->bucket = bucket; + __entry->bucket_agino = bucket_agino; + __entry->prev_agino = ip->i_prev_unlinked; + __entry->next_agino = ip->i_next_unlinked; + ), + TP_printk("dev %d:%d agno 0x%x bucket %u agino 0x%x bucket_agino 0x%x prev_agino 0x%x next_agino 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->bucket, + __entry->agino, + __entry->bucket_agino, + __entry->prev_agino, + __entry->next_agino) +); + +TRACE_EVENT(xrep_iunlink_reload_next, + TP_PROTO(struct xfs_inode *ip, xfs_agino_t prev_agino), + TP_ARGS(ip, prev_agino), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(xfs_agino_t, agino) + __field(xfs_agino_t, old_prev_agino) + __field(xfs_agino_t, prev_agino) + __field(xfs_agino_t, next_agino) + __field(unsigned int, nlink) + ), + TP_fast_assign( + __entry->dev = ip->i_mount->m_super->s_dev; + __entry->agno = XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino); + __entry->agino = XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino); + __entry->old_prev_agino = ip->i_prev_unlinked; + __entry->prev_agino = prev_agino; + __entry->next_agino = ip->i_next_unlinked; + __entry->nlink = VFS_I(ip)->i_nlink; + ), + TP_printk("dev %d:%d agno 0x%x bucket %u agino 0x%x nlink %u old_prev_agino %u prev_agino 0x%x next_agino 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->agino % XFS_AGI_UNLINKED_BUCKETS, + __entry->agino, + __entry->nlink, + __entry->old_prev_agino, + __entry->prev_agino, + __entry->next_agino) +); + +TRACE_EVENT(xrep_iunlink_reload_ondisk, + TP_PROTO(struct xfs_inode *ip), + TP_ARGS(ip), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(xfs_agino_t, agino) + __field(unsigned int, nlink) + __field(xfs_agino_t, next_agino) + ), + TP_fast_assign( + __entry->dev = ip->i_mount->m_super->s_dev; + __entry->agno = XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino); + __entry->agino = XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino); + __entry->nlink = VFS_I(ip)->i_nlink; + __entry->next_agino = ip->i_next_unlinked; + ), + TP_printk("dev %d:%d agno 0x%x bucket %u agino 0x%x nlink %u next_agino 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->agino % XFS_AGI_UNLINKED_BUCKETS, + __entry->agino, + __entry->nlink, + __entry->next_agino) +); + +TRACE_EVENT(xrep_iunlink_walk_ondisk_bucket, + TP_PROTO(struct xfs_perag *pag, unsigned int bucket, + xfs_agino_t prev_agino, xfs_agino_t next_agino), + TP_ARGS(pag, bucket, prev_agino, next_agino), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(unsigned int, bucket) + __field(xfs_agino_t, prev_agino) + __field(xfs_agino_t, next_agino) + ), + TP_fast_assign( + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->bucket = bucket; + __entry->prev_agino = prev_agino; + __entry->next_agino = next_agino; + ), + TP_printk("dev %d:%d agno 0x%x bucket %u prev_agino 0x%x next_agino 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->bucket, + __entry->prev_agino, + __entry->next_agino) +); + +DECLARE_EVENT_CLASS(xrep_iunlink_resolve_class, + TP_PROTO(struct xfs_perag *pag, unsigned int bucket, + xfs_agino_t prev_agino, xfs_agino_t next_agino), + TP_ARGS(pag, bucket, prev_agino, next_agino), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(unsigned int, bucket) + __field(xfs_agino_t, prev_agino) + __field(xfs_agino_t, next_agino) + ), + TP_fast_assign( + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->bucket = bucket; + __entry->prev_agino = prev_agino; + __entry->next_agino = next_agino; + ), + TP_printk("dev %d:%d agno 0x%x bucket %u prev_agino 0x%x next_agino 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->bucket, + __entry->prev_agino, + __entry->next_agino) +); +#define DEFINE_REPAIR_IUNLINK_RESOLVE_EVENT(name) \ +DEFINE_EVENT(xrep_iunlink_resolve_class, name, \ + TP_PROTO(struct xfs_perag *pag, unsigned int bucket, \ + xfs_agino_t prev_agino, xfs_agino_t next_agino), \ + TP_ARGS(pag, bucket, prev_agino, next_agino)) +DEFINE_REPAIR_IUNLINK_RESOLVE_EVENT(xrep_iunlink_resolve_uncached); +DEFINE_REPAIR_IUNLINK_RESOLVE_EVENT(xrep_iunlink_resolve_wronglist); +DEFINE_REPAIR_IUNLINK_RESOLVE_EVENT(xrep_iunlink_resolve_nolist); +DEFINE_REPAIR_IUNLINK_RESOLVE_EVENT(xrep_iunlink_resolve_ok); + +TRACE_EVENT(xrep_iunlink_relink_next, + TP_PROTO(struct xfs_inode *ip, xfs_agino_t next_agino), + TP_ARGS(ip, next_agino), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(xfs_agino_t, agino) + __field(xfs_agino_t, next_agino) + __field(xfs_agino_t, new_next_agino) + ), + TP_fast_assign( + __entry->dev = ip->i_mount->m_super->s_dev; + __entry->agno = XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino); + __entry->agino = XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino); + __entry->next_agino = ip->i_next_unlinked; + __entry->new_next_agino = next_agino; + ), + TP_printk("dev %d:%d agno 0x%x bucket %u agino 0x%x next_agino 0x%x -> 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->agino % XFS_AGI_UNLINKED_BUCKETS, + __entry->agino, + __entry->next_agino, + __entry->new_next_agino) +); + +TRACE_EVENT(xrep_iunlink_relink_prev, + TP_PROTO(struct xfs_inode *ip, xfs_agino_t prev_agino), + TP_ARGS(ip, prev_agino), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(xfs_agino_t, agino) + __field(xfs_agino_t, prev_agino) + __field(xfs_agino_t, new_prev_agino) + ), + TP_fast_assign( + __entry->dev = ip->i_mount->m_super->s_dev; + __entry->agno = XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino); + __entry->agino = XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino); + __entry->prev_agino = ip->i_prev_unlinked; + __entry->new_prev_agino = prev_agino; + ), + TP_printk("dev %d:%d agno 0x%x bucket %u agino 0x%x prev_agino 0x%x -> 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->agino % XFS_AGI_UNLINKED_BUCKETS, + __entry->agino, + __entry->prev_agino, + __entry->new_prev_agino) +); + +TRACE_EVENT(xrep_iunlink_add_to_bucket, + TP_PROTO(struct xfs_perag *pag, unsigned int bucket, + xfs_agino_t agino, xfs_agino_t curr_head), + TP_ARGS(pag, bucket, agino, curr_head), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(unsigned int, bucket) + __field(xfs_agino_t, agino) + __field(xfs_agino_t, next_agino) + ), + TP_fast_assign( + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->bucket = bucket; + __entry->agino = agino; + __entry->next_agino = curr_head; + ), + TP_printk("dev %d:%d agno 0x%x bucket %u agino 0x%x next_agino 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->bucket, + __entry->agino, + __entry->next_agino) +); + +TRACE_EVENT(xrep_iunlink_commit_bucket, + TP_PROTO(struct xfs_perag *pag, unsigned int bucket, + xfs_agino_t old_agino, xfs_agino_t agino), + TP_ARGS(pag, bucket, old_agino, agino), + TP_STRUCT__entry( + __field(dev_t, dev) + __field(xfs_agnumber_t, agno) + __field(unsigned int, bucket) + __field(xfs_agino_t, old_agino) + __field(xfs_agino_t, agino) + ), + TP_fast_assign( + __entry->dev = pag->pag_mount->m_super->s_dev; + __entry->agno = pag->pag_agno; + __entry->bucket = bucket; + __entry->old_agino = old_agino; + __entry->agino = agino; + ), + TP_printk("dev %d:%d agno 0x%x bucket %u agino 0x%x -> 0x%x", + MAJOR(__entry->dev), MINOR(__entry->dev), + __entry->agno, + __entry->bucket, + __entry->old_agino, + __entry->agino) +); + #endif /* IS_ENABLED(CONFIG_XFS_ONLINE_REPAIR) */ #endif /* _TRACE_XFS_SCRUB_TRACE_H */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 13/16] xfs: inode-related repair fixes 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (11 preceding siblings ...) 2024-04-15 23:36 ` [PATCHSET v30.3 12/16] xfs: online fsck of iunlink buckets Darrick J. Wong @ 2024-04-15 23:36 ` Darrick J. Wong 2024-04-15 23:55 ` [PATCH 1/4] xfs: check unused nlink fields in the ondisk inode Darrick J. Wong ` (3 more replies) 2024-04-15 23:37 ` [PATCHSET v30.3 14/16] xfs: less heavy locks during fstrim Darrick J. Wong ` (2 subsequent siblings) 15 siblings, 4 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:36 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs Hi all, While doing QA of the online fsck code, I made a few observations: First, nobody was checking that the di_onlink field is actually zero; Second, that allocating a temporary file for repairs can fail (and thus bring down the entire fs) if the inode cluster is corrupt; and Third, that file link counts do not pin at ~0U to prevent integer overflows. Fourth, the x{chk,rep}_metadata_inode_fork functions should be subclassing the main scrub context, not modifying the parent's setup willy-nilly. This scattered patchset fixes those three problems. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=inode-repair-improvements-6.10 --- Commits in this patchset: * xfs: check unused nlink fields in the ondisk inode * xfs: try to avoid allocating from sick inode clusters * xfs: pin inodes that would otherwise overflow link count * xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype --- fs/xfs/libxfs/xfs_format.h | 6 ++++ fs/xfs/libxfs/xfs_ialloc.c | 40 ++++++++++++++++++++++++ fs/xfs/libxfs/xfs_inode_buf.c | 8 +++++ fs/xfs/scrub/common.c | 23 ++------------ fs/xfs/scrub/dir_repair.c | 11 ++----- fs/xfs/scrub/inode_repair.c | 12 +++++++ fs/xfs/scrub/nlinks.c | 4 ++ fs/xfs/scrub/nlinks_repair.c | 8 +---- fs/xfs/scrub/repair.c | 67 ++++++++--------------------------------- fs/xfs/scrub/scrub.c | 63 +++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/scrub.h | 11 +++++++ fs/xfs/xfs_inode.c | 33 +++++++++++++------- 12 files changed, 187 insertions(+), 99 deletions(-) ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/4] xfs: check unused nlink fields in the ondisk inode 2024-04-15 23:36 ` [PATCHSET v30.3 13/16] xfs: inode-related repair fixes Darrick J. Wong @ 2024-04-15 23:55 ` Darrick J. Wong 2024-04-15 23:55 ` [PATCH 2/4] xfs: try to avoid allocating from sick inode clusters Darrick J. Wong ` (2 subsequent siblings) 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:55 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> v2/v3 inodes use di_nlink and not di_onlink; and v1 inodes use di_onlink and not di_nlink. Whichever field is not in use, make sure its contents are zero, and teach xfs_scrub to fix that if it is. This clears a bunch of missing scrub failure errors in xfs/385 for core.onlink. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_inode_buf.c | 8 ++++++++ fs/xfs/scrub/inode_repair.c | 12 ++++++++++++ 2 files changed, 20 insertions(+) diff --git a/fs/xfs/libxfs/xfs_inode_buf.c b/fs/xfs/libxfs/xfs_inode_buf.c index d0dcce462bf4..d79002343d0b 100644 --- a/fs/xfs/libxfs/xfs_inode_buf.c +++ b/fs/xfs/libxfs/xfs_inode_buf.c @@ -491,6 +491,14 @@ xfs_dinode_verify( return __this_address; } + if (dip->di_version > 1) { + if (dip->di_onlink) + return __this_address; + } else { + if (dip->di_nlink) + return __this_address; + } + /* don't allow invalid i_size */ di_size = be64_to_cpu(dip->di_size); if (di_size & (1ULL << 63)) diff --git a/fs/xfs/scrub/inode_repair.c b/fs/xfs/scrub/inode_repair.c index 0dde5df2f8d3..e3b74ea50fde 100644 --- a/fs/xfs/scrub/inode_repair.c +++ b/fs/xfs/scrub/inode_repair.c @@ -516,6 +516,17 @@ xrep_dinode_mode( return 0; } +/* Fix unused link count fields having nonzero values. */ +STATIC void +xrep_dinode_nlinks( + struct xfs_dinode *dip) +{ + if (dip->di_version > 1) + dip->di_onlink = 0; + else + dip->di_nlink = 0; +} + /* Fix any conflicting flags that the verifiers complain about. */ STATIC void xrep_dinode_flags( @@ -1377,6 +1388,7 @@ xrep_dinode_core( iget_error = xrep_dinode_mode(ri, dip); if (iget_error) goto write; + xrep_dinode_nlinks(dip); xrep_dinode_flags(sc, dip, ri->rt_extents > 0); xrep_dinode_size(ri, dip); xrep_dinode_extsize_hints(sc, dip); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/4] xfs: try to avoid allocating from sick inode clusters 2024-04-15 23:36 ` [PATCHSET v30.3 13/16] xfs: inode-related repair fixes Darrick J. Wong 2024-04-15 23:55 ` [PATCH 1/4] xfs: check unused nlink fields in the ondisk inode Darrick J. Wong @ 2024-04-15 23:55 ` Darrick J. Wong 2024-04-15 23:55 ` [PATCH 3/4] xfs: pin inodes that would otherwise overflow link count Darrick J. Wong 2024-04-15 23:56 ` [PATCH 4/4] xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype Darrick J. Wong 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:55 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> I noticed that xfs/413 and xfs/375 occasionally failed while fuzzing core.mode of an inode. The root cause of these problems is that the field we fuzzed (core.mode or core.magic, typically) causes the entire inode cluster buffer verification to fail, which affects several inodes at once. The repair process tries to create either a /lost+found or a temporary repair file, but regrettably it picks the same inode cluster that we just corrupted, with the result that repair triggers the demise of the filesystem. Try avoid this by making the inode allocation path detect when the perag health status indicates that someone has found bad inode cluster buffers, and try to read the inode cluster buffer. If the cluster buffer fails the verifiers, try another AG. This isn't foolproof and can result in premature ENOSPC, but that might be better than shutting down. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_ialloc.c | 40 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 40 insertions(+) diff --git a/fs/xfs/libxfs/xfs_ialloc.c b/fs/xfs/libxfs/xfs_ialloc.c index cb37f0007731..14c81f227c5b 100644 --- a/fs/xfs/libxfs/xfs_ialloc.c +++ b/fs/xfs/libxfs/xfs_ialloc.c @@ -1057,6 +1057,33 @@ xfs_inobt_first_free_inode( return xfs_lowbit64(realfree); } +/* + * If this AG has corrupt inodes, check if allocating this inode would fail + * with corruption errors. Returns 0 if we're clear, or EAGAIN to try again + * somewhere else. + */ +static int +xfs_dialloc_check_ino( + struct xfs_perag *pag, + struct xfs_trans *tp, + xfs_ino_t ino) +{ + struct xfs_imap imap; + struct xfs_buf *bp; + int error; + + error = xfs_imap(pag, tp, ino, &imap, 0); + if (error) + return -EAGAIN; + + error = xfs_imap_to_bp(pag->pag_mount, tp, &imap, &bp); + if (error) + return -EAGAIN; + + xfs_trans_brelse(tp, bp); + return 0; +} + /* * Allocate an inode using the inobt-only algorithm. */ @@ -1309,6 +1336,13 @@ xfs_dialloc_ag_inobt( ASSERT((XFS_AGINO_TO_OFFSET(mp, rec.ir_startino) % XFS_INODES_PER_CHUNK) == 0); ino = XFS_AGINO_TO_INO(mp, pag->pag_agno, rec.ir_startino + offset); + + if (xfs_ag_has_sickness(pag, XFS_SICK_AG_INODES)) { + error = xfs_dialloc_check_ino(pag, tp, ino); + if (error) + goto error0; + } + rec.ir_free &= ~XFS_INOBT_MASK(offset); rec.ir_freecount--; error = xfs_inobt_update(cur, &rec); @@ -1584,6 +1618,12 @@ xfs_dialloc_ag( XFS_INODES_PER_CHUNK) == 0); ino = XFS_AGINO_TO_INO(mp, pag->pag_agno, rec.ir_startino + offset); + if (xfs_ag_has_sickness(pag, XFS_SICK_AG_INODES)) { + error = xfs_dialloc_check_ino(pag, tp, ino); + if (error) + goto error_cur; + } + /* * Modify or remove the finobt record. */ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/4] xfs: pin inodes that would otherwise overflow link count 2024-04-15 23:36 ` [PATCHSET v30.3 13/16] xfs: inode-related repair fixes Darrick J. Wong 2024-04-15 23:55 ` [PATCH 1/4] xfs: check unused nlink fields in the ondisk inode Darrick J. Wong 2024-04-15 23:55 ` [PATCH 2/4] xfs: try to avoid allocating from sick inode clusters Darrick J. Wong @ 2024-04-15 23:55 ` Darrick J. Wong 2024-04-15 23:56 ` [PATCH 4/4] xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype Darrick J. Wong 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:55 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> The VFS inc_nlink function does not explicitly check for integer overflows in the i_nlink field. Instead, it checks the link count against s_max_links in the vfs_{link,create,rename} functions. XFS sets the maximum link count to 2.1 billion, so integer overflows should not be a problem. However. It's possible that online repair could find that a file has more than four billion links, particularly if the link count got corrupted while creating hardlinks to the file. The di_nlinkv2 field is not large enough to store a value larger than 2^32, so we ought to define a magic pin value of ~0U which means that the inode never gets deleted. This will prevent a UAF error if the repair finds this situation and users begin deleting links to the file. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_format.h | 6 ++++++ fs/xfs/scrub/dir_repair.c | 11 +++-------- fs/xfs/scrub/nlinks.c | 4 +++- fs/xfs/scrub/nlinks_repair.c | 8 ++------ fs/xfs/xfs_inode.c | 33 ++++++++++++++++++++++----------- 5 files changed, 36 insertions(+), 26 deletions(-) diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h index 10153ce116d4..f1818c54af6f 100644 --- a/fs/xfs/libxfs/xfs_format.h +++ b/fs/xfs/libxfs/xfs_format.h @@ -899,6 +899,12 @@ static inline uint xfs_dinode_size(int version) */ #define XFS_MAXLINK ((1U << 31) - 1U) +/* + * Any file that hits the maximum ondisk link count should be pinned to avoid + * a use-after-free situation. + */ +#define XFS_NLINK_PINNED (~0U) + /* * Values for di_format * diff --git a/fs/xfs/scrub/dir_repair.c b/fs/xfs/scrub/dir_repair.c index c150b2efa2c2..38957da26b94 100644 --- a/fs/xfs/scrub/dir_repair.c +++ b/fs/xfs/scrub/dir_repair.c @@ -1145,7 +1145,9 @@ xrep_dir_set_nlink( struct xfs_scrub *sc = rd->sc; struct xfs_inode *dp = sc->ip; struct xfs_perag *pag; - unsigned int new_nlink = rd->subdirs + 2; + unsigned int new_nlink = min_t(unsigned long long, + rd->subdirs + 2, + XFS_NLINK_PINNED); int error; /* @@ -1201,13 +1203,6 @@ xrep_dir_swap( bool ip_local, temp_local; int error = 0; - /* - * If we found enough subdirs to overflow this directory's link count, - * bail out to userspace before we modify anything. - */ - if (rd->subdirs + 2 > XFS_MAXLINK) - return -EFSCORRUPTED; - /* * If we never found the parent for this directory, temporarily assign * the root dir as the parent; we'll move this to the orphanage after diff --git a/fs/xfs/scrub/nlinks.c b/fs/xfs/scrub/nlinks.c index c456523fac9c..fcb9c473f372 100644 --- a/fs/xfs/scrub/nlinks.c +++ b/fs/xfs/scrub/nlinks.c @@ -607,9 +607,11 @@ xchk_nlinks_compare_inode( * this as a corruption. The VFS won't let users increase the link * count, but it will let them decrease it. */ - if (total_links > XFS_MAXLINK) { + if (total_links > XFS_NLINK_PINNED) { xchk_ino_set_corrupt(sc, ip->i_ino); goto out_corrupt; + } else if (total_links > XFS_MAXLINK) { + xchk_ino_set_warning(sc, ip->i_ino); } /* Link counts should match. */ diff --git a/fs/xfs/scrub/nlinks_repair.c b/fs/xfs/scrub/nlinks_repair.c index 0cb67339eac8..83f8637bb08f 100644 --- a/fs/xfs/scrub/nlinks_repair.c +++ b/fs/xfs/scrub/nlinks_repair.c @@ -238,14 +238,10 @@ xrep_nlinks_repair_inode( /* Commit the new link count if it changed. */ if (total_links != actual_nlink) { - if (total_links > XFS_MAXLINK) { - trace_xrep_nlinks_unfixable_inode(mp, ip, &obs); - goto out_trans; - } - trace_xrep_nlinks_update_inode(mp, ip, &obs); - set_nlink(VFS_I(ip), total_links); + set_nlink(VFS_I(ip), min_t(unsigned long long, total_links, + XFS_NLINK_PINNED)); dirty = true; } diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index fed0cd6bffdf..03dcb4ac0431 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -890,22 +890,25 @@ xfs_init_new_inode( */ static int /* error */ xfs_droplink( - xfs_trans_t *tp, - xfs_inode_t *ip) + struct xfs_trans *tp, + struct xfs_inode *ip) { - if (VFS_I(ip)->i_nlink == 0) { - xfs_alert(ip->i_mount, - "%s: Attempt to drop inode (%llu) with nlink zero.", - __func__, ip->i_ino); - return -EFSCORRUPTED; - } + struct inode *inode = VFS_I(ip); xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG); - drop_nlink(VFS_I(ip)); + if (inode->i_nlink == 0) { + xfs_info_ratelimited(tp->t_mountp, + "Inode 0x%llx link count dropped below zero. Pinning link count.", + ip->i_ino); + set_nlink(inode, XFS_NLINK_PINNED); + } + if (inode->i_nlink != XFS_NLINK_PINNED) + drop_nlink(inode); + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); - if (VFS_I(ip)->i_nlink) + if (inode->i_nlink) return 0; return xfs_iunlink(tp, ip); @@ -919,9 +922,17 @@ xfs_bumplink( struct xfs_trans *tp, struct xfs_inode *ip) { + struct inode *inode = VFS_I(ip); + xfs_trans_ichgtime(tp, ip, XFS_ICHGTIME_CHG); - inc_nlink(VFS_I(ip)); + if (inode->i_nlink == XFS_NLINK_PINNED - 1) + xfs_info_ratelimited(tp->t_mountp, + "Inode 0x%llx link count exceeded maximum. Pinning link count.", + ip->i_ino); + if (inode->i_nlink != XFS_NLINK_PINNED) + inc_nlink(inode); + xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE); } ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 4/4] xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype 2024-04-15 23:36 ` [PATCHSET v30.3 13/16] xfs: inode-related repair fixes Darrick J. Wong ` (2 preceding siblings ...) 2024-04-15 23:55 ` [PATCH 3/4] xfs: pin inodes that would otherwise overflow link count Darrick J. Wong @ 2024-04-15 23:56 ` Darrick J. Wong 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:56 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> When a file-based metadata structure is being scrubbed in xchk_metadata_inode_subtype, we should create an entirely new scrub context so that each scrubber doesn't trip over another's buffers. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/common.c | 23 +++-------------- fs/xfs/scrub/repair.c | 67 ++++++++++--------------------------------------- fs/xfs/scrub/scrub.c | 63 ++++++++++++++++++++++++++++++++++++++++++++++ fs/xfs/scrub/scrub.h | 11 ++++++++ 4 files changed, 91 insertions(+), 73 deletions(-) diff --git a/fs/xfs/scrub/common.c b/fs/xfs/scrub/common.c index a2da2bef509a..48302532d10d 100644 --- a/fs/xfs/scrub/common.c +++ b/fs/xfs/scrub/common.c @@ -1203,27 +1203,12 @@ xchk_metadata_inode_subtype( struct xfs_scrub *sc, unsigned int scrub_type) { - __u32 smtype = sc->sm->sm_type; - unsigned int sick_mask = sc->sick_mask; + struct xfs_scrub_subord *sub; int error; - sc->sm->sm_type = scrub_type; - - switch (scrub_type) { - case XFS_SCRUB_TYPE_INODE: - error = xchk_inode(sc); - break; - case XFS_SCRUB_TYPE_BMBTD: - error = xchk_bmap_data(sc); - break; - default: - ASSERT(0); - error = -EFSCORRUPTED; - break; - } - - sc->sick_mask = sick_mask; - sc->sm->sm_type = smtype; + sub = xchk_scrub_create_subord(sc, scrub_type); + error = sub->sc.ops->scrub(&sub->sc); + xchk_scrub_free_subord(sub); return error; } diff --git a/fs/xfs/scrub/repair.c b/fs/xfs/scrub/repair.c index 369f0430e4ba..b6aff89679d5 100644 --- a/fs/xfs/scrub/repair.c +++ b/fs/xfs/scrub/repair.c @@ -1009,55 +1009,27 @@ xrep_metadata_inode_subtype( struct xfs_scrub *sc, unsigned int scrub_type) { - __u32 smtype = sc->sm->sm_type; - __u32 smflags = sc->sm->sm_flags; - unsigned int sick_mask = sc->sick_mask; + struct xfs_scrub_subord *sub; int error; /* - * Let's see if the inode needs repair. We're going to open-code calls - * to the scrub and repair functions so that we can hang on to the + * Let's see if the inode needs repair. Use a subordinate scrub context + * to call the scrub and repair functions so that we can hang on to the * resources that we already acquired instead of using the standard * setup/teardown routines. */ - sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT; - sc->sm->sm_type = scrub_type; - - switch (scrub_type) { - case XFS_SCRUB_TYPE_INODE: - error = xchk_inode(sc); - break; - case XFS_SCRUB_TYPE_BMBTD: - error = xchk_bmap_data(sc); - break; - case XFS_SCRUB_TYPE_BMBTA: - error = xchk_bmap_attr(sc); - break; - default: - ASSERT(0); - error = -EFSCORRUPTED; - } + sub = xchk_scrub_create_subord(sc, scrub_type); + error = sub->sc.ops->scrub(&sub->sc); if (error) goto out; - - if (!xrep_will_attempt(sc)) + if (!xrep_will_attempt(&sub->sc)) goto out; /* * Repair some part of the inode. This will potentially join the inode * to the transaction. */ - switch (scrub_type) { - case XFS_SCRUB_TYPE_INODE: - error = xrep_inode(sc); - break; - case XFS_SCRUB_TYPE_BMBTD: - error = xrep_bmap(sc, XFS_DATA_FORK, false); - break; - case XFS_SCRUB_TYPE_BMBTA: - error = xrep_bmap(sc, XFS_ATTR_FORK, false); - break; - } + error = sub->sc.ops->repair(&sub->sc); if (error) goto out; @@ -1066,10 +1038,10 @@ xrep_metadata_inode_subtype( * that the inode will not be joined to the transaction when we exit * the function. */ - error = xfs_defer_finish(&sc->tp); + error = xfs_defer_finish(&sub->sc.tp); if (error) goto out; - error = xfs_trans_roll(&sc->tp); + error = xfs_trans_roll(&sub->sc.tp); if (error) goto out; @@ -1077,31 +1049,18 @@ xrep_metadata_inode_subtype( * Clear the corruption flags and re-check the metadata that we just * repaired. */ - sc->sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT; - - switch (scrub_type) { - case XFS_SCRUB_TYPE_INODE: - error = xchk_inode(sc); - break; - case XFS_SCRUB_TYPE_BMBTD: - error = xchk_bmap_data(sc); - break; - case XFS_SCRUB_TYPE_BMBTA: - error = xchk_bmap_attr(sc); - break; - } + sub->sc.sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT; + error = sub->sc.ops->scrub(&sub->sc); if (error) goto out; /* If corruption persists, the repair has failed. */ - if (xchk_needs_repair(sc->sm)) { + if (xchk_needs_repair(sub->sc.sm)) { error = -EFSCORRUPTED; goto out; } out: - sc->sick_mask = sick_mask; - sc->sm->sm_type = smtype; - sc->sm->sm_flags = smflags; + xchk_scrub_free_subord(sub); return error; } diff --git a/fs/xfs/scrub/scrub.c b/fs/xfs/scrub/scrub.c index 301d5b753fdd..ebb06838c31b 100644 --- a/fs/xfs/scrub/scrub.c +++ b/fs/xfs/scrub/scrub.c @@ -177,6 +177,39 @@ xchk_fsgates_disable( } #undef FSGATES_MASK +/* Free the resources associated with a scrub subtype. */ +void +xchk_scrub_free_subord( + struct xfs_scrub_subord *sub) +{ + struct xfs_scrub *sc = sub->parent_sc; + + ASSERT(sc->ip == sub->sc.ip); + ASSERT(sc->orphanage == sub->sc.orphanage); + ASSERT(sc->tempip == sub->sc.tempip); + + sc->sm->sm_type = sub->old_smtype; + sc->sm->sm_flags = sub->old_smflags | + (sc->sm->sm_flags & XFS_SCRUB_FLAGS_OUT); + sc->tp = sub->sc.tp; + + if (sub->sc.buf) { + if (sub->sc.buf_cleanup) + sub->sc.buf_cleanup(sub->sc.buf); + kvfree(sub->sc.buf); + } + if (sub->sc.xmbtp) + xmbuf_free(sub->sc.xmbtp); + if (sub->sc.xfile) + xfile_destroy(sub->sc.xfile); + + sc->ilock_flags = sub->sc.ilock_flags; + sc->orphanage_ilock_flags = sub->sc.orphanage_ilock_flags; + sc->temp_ilock_flags = sub->sc.temp_ilock_flags; + + kfree(sub); +} + /* Free all the resources and finish the transactions. */ STATIC int xchk_teardown( @@ -505,6 +538,36 @@ static inline void xchk_postmortem(struct xfs_scrub *sc) } #endif /* CONFIG_XFS_ONLINE_REPAIR */ +/* + * Create a new scrub context from an existing one, but with a different scrub + * type. + */ +struct xfs_scrub_subord * +xchk_scrub_create_subord( + struct xfs_scrub *sc, + unsigned int subtype) +{ + struct xfs_scrub_subord *sub; + + sub = kzalloc(sizeof(*sub), XCHK_GFP_FLAGS); + if (!sub) + return ERR_PTR(-ENOMEM); + + sub->old_smtype = sc->sm->sm_type; + sub->old_smflags = sc->sm->sm_flags; + sub->parent_sc = sc; + memcpy(&sub->sc, sc, sizeof(struct xfs_scrub)); + sub->sc.ops = &meta_scrub_ops[subtype]; + sub->sc.sm->sm_type = subtype; + sub->sc.sm->sm_flags &= ~XFS_SCRUB_FLAGS_OUT; + sub->sc.buf = NULL; + sub->sc.buf_cleanup = NULL; + sub->sc.xfile = NULL; + sub->sc.xmbtp = NULL; + + return sub; +} + /* Dispatch metadata scrubbing. */ int xfs_scrub_metadata( diff --git a/fs/xfs/scrub/scrub.h b/fs/xfs/scrub/scrub.h index 7abe498f7a46..54a4242bc79c 100644 --- a/fs/xfs/scrub/scrub.h +++ b/fs/xfs/scrub/scrub.h @@ -156,6 +156,17 @@ struct xfs_scrub { */ #define XREP_FSGATES_ALL (XREP_FSGATES_EXCHANGE_RANGE) +struct xfs_scrub_subord { + struct xfs_scrub sc; + struct xfs_scrub *parent_sc; + unsigned int old_smtype; + unsigned int old_smflags; +}; + +struct xfs_scrub_subord *xchk_scrub_create_subord(struct xfs_scrub *sc, + unsigned int subtype); +void xchk_scrub_free_subord(struct xfs_scrub_subord *sub); + /* Metadata scrubbers */ int xchk_tester(struct xfs_scrub *sc); int xchk_superblock(struct xfs_scrub *sc); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v30.3 14/16] xfs: less heavy locks during fstrim 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (12 preceding siblings ...) 2024-04-15 23:36 ` [PATCHSET v30.3 13/16] xfs: inode-related repair fixes Darrick J. Wong @ 2024-04-15 23:37 ` Darrick J. Wong 2024-04-15 23:56 ` [PATCH 1/1] xfs: fix performance problems when fstrimming a subset of a fragmented AG Darrick J. Wong 2024-04-15 23:37 ` [PATCHSET v13.2 15/16] xfs: design documentation for online fsck, part 2 Darrick J. Wong 2024-04-15 23:37 ` [PATCHSET v13.2 16/16] xfs: retain ILOCK during directory updates Darrick J. Wong 15 siblings, 1 reply; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:37 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Dave Chinner, linux-xfs Hi all, Congratulations! You have made it to the final patchset of the main online fsck feature! This patchset fixes some stalling behavior that I observed when running FITRIM against large flash-based filesystems with very heavily fragmented free space data. In summary -- the current fstrim implementation optimizes for trimming the largest free extents first, and holds the AGF lock for the duration of the operation. This is great if fstrim is being run as a foreground process by a sysadmin. For xfs_scrub, however, this isn't so good -- we don't really want to block on one huge kernel call while reporting no progress information. We don't want to hold the AGF so long that background processes stall. These problems are easily fixable by issuing smaller FITRIM calls, but there's still the problem of walking the entire cntbt. To solve that second problem, we introduce a new sub-AG FITRIM implementation. To solve the first problem, make it relax the AGF periodically. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=discard-relax-locks-6.10 --- Commits in this patchset: * xfs: fix performance problems when fstrimming a subset of a fragmented AG --- fs/xfs/xfs_discard.c | 153 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 93 insertions(+), 60 deletions(-) ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/1] xfs: fix performance problems when fstrimming a subset of a fragmented AG 2024-04-15 23:37 ` [PATCHSET v30.3 14/16] xfs: less heavy locks during fstrim Darrick J. Wong @ 2024-04-15 23:56 ` Darrick J. Wong 0 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:56 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Dave Chinner, linux-xfs From: Darrick J. Wong <djwong@kernel.org> On a 10TB filesystem where the free space in each AG is heavily fragmented, I noticed some very high runtimes on a FITRIM call for the entire filesystem. xfs_scrub likes to report progress information on each phase of the scrub, which means that a strace for the entire filesystem: ioctl(3, FITRIM, {start=0x0, len=10995116277760, minlen=0}) = 0 <686.209839> shows that scrub is uncommunicative for the entire duration. Reducing the size of the FITRIM requests to a single AG at a time produces lower times for each individual call, but even this isn't quite acceptable, because the time between progress reports are still very high: Strace for the first 4x 1TB AGs looks like (2): ioctl(3, FITRIM, {start=0x0, len=1099511627776, minlen=0}) = 0 <68.352033> ioctl(3, FITRIM, {start=0x10000000000, len=1099511627776, minlen=0}) = 0 <68.760323> ioctl(3, FITRIM, {start=0x20000000000, len=1099511627776, minlen=0}) = 0 <67.235226> ioctl(3, FITRIM, {start=0x30000000000, len=1099511627776, minlen=0}) = 0 <69.465744> I then had the idea to limit the length parameter of each call to a smallish amount (~11GB) so that we could report progress relatively quickly, but much to my surprise, each FITRIM call still took ~68 seconds! Unfortunately, the by-length fstrim implementation handles this poorly because it walks the entire free space by length index (cntbt), which is a very inefficient way to walk a subset of the blocks of an AG. Therefore, create a second implementation that will walk the bnobt and perform the trims in block number order. This implementation avoids the worst problems of the original code, though it lacks the desirable attribute of freeing the biggest chunks first. On the other hand, this second implementation will be much easier to constrain the system call latency, and makes it much easier to report fstrim progress to anyone who's running xfs_scrub. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com --- fs/xfs/xfs_discard.c | 153 ++++++++++++++++++++++++++++++-------------------- 1 file changed, 93 insertions(+), 60 deletions(-) diff --git a/fs/xfs/xfs_discard.c b/fs/xfs/xfs_discard.c index 268bb734dc0a..25fe3b932b5a 100644 --- a/fs/xfs/xfs_discard.c +++ b/fs/xfs/xfs_discard.c @@ -145,14 +145,18 @@ xfs_discard_extents( return error; } +struct xfs_trim_cur { + xfs_agblock_t start; + xfs_extlen_t count; + xfs_agblock_t end; + xfs_extlen_t minlen; + bool by_bno; +}; static int xfs_trim_gather_extents( struct xfs_perag *pag, - xfs_daddr_t start, - xfs_daddr_t end, - xfs_daddr_t minlen, - struct xfs_alloc_rec_incore *tcur, + struct xfs_trim_cur *tcur, struct xfs_busy_extents *extents, uint64_t *blocks_trimmed) { @@ -179,21 +183,26 @@ xfs_trim_gather_extents( if (error) goto out_trans_cancel; - cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); - - /* - * Look up the extent length requested in the AGF and start with it. - */ - if (tcur->ar_startblock == NULLAGBLOCK) - error = xfs_alloc_lookup_ge(cur, 0, tcur->ar_blockcount, &i); - else - error = xfs_alloc_lookup_le(cur, tcur->ar_startblock, - tcur->ar_blockcount, &i); + if (tcur->by_bno) { + /* sub-AG discard request always starts at tcur->start */ + cur = xfs_bnobt_init_cursor(mp, tp, agbp, pag); + error = xfs_alloc_lookup_le(cur, tcur->start, 0, &i); + if (!error && !i) + error = xfs_alloc_lookup_ge(cur, tcur->start, 0, &i); + } else if (tcur->start == 0) { + /* first time through a by-len starts with max length */ + cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); + error = xfs_alloc_lookup_ge(cur, 0, tcur->count, &i); + } else { + /* nth time through a by-len starts where we left off */ + cur = xfs_cntbt_init_cursor(mp, tp, agbp, pag); + error = xfs_alloc_lookup_le(cur, tcur->start, tcur->count, &i); + } if (error) goto out_del_cursor; if (i == 0) { /* nothing of that length left in the AG, we are done */ - tcur->ar_blockcount = 0; + tcur->count = 0; goto out_del_cursor; } @@ -204,8 +213,6 @@ xfs_trim_gather_extents( while (i) { xfs_agblock_t fbno; xfs_extlen_t flen; - xfs_daddr_t dbno; - xfs_extlen_t dlen; error = xfs_alloc_get_rec(cur, &fbno, &flen, &i); if (error) @@ -221,37 +228,45 @@ xfs_trim_gather_extents( * Update the cursor to point at this extent so we * restart the next batch from this extent. */ - tcur->ar_startblock = fbno; - tcur->ar_blockcount = flen; - break; - } - - /* - * use daddr format for all range/len calculations as that is - * the format the range/len variables are supplied in by - * userspace. - */ - dbno = XFS_AGB_TO_DADDR(mp, pag->pag_agno, fbno); - dlen = XFS_FSB_TO_BB(mp, flen); - - /* - * Too small? Give up. - */ - if (dlen < minlen) { - trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen); - tcur->ar_blockcount = 0; + tcur->start = fbno; + tcur->count = flen; break; } /* * If the extent is entirely outside of the range we are - * supposed to discard skip it. Do not bother to trim - * down partially overlapping ranges for now. + * supposed to skip it. Do not bother to trim down partially + * overlapping ranges for now. */ - if (dbno + dlen < start || dbno > end) { + if (fbno + flen < tcur->start) { trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); goto next_extent; } + if (fbno > tcur->end) { + trace_xfs_discard_exclude(mp, pag->pag_agno, fbno, flen); + if (tcur->by_bno) { + tcur->count = 0; + break; + } + goto next_extent; + } + + /* Trim the extent returned to the range we want. */ + if (fbno < tcur->start) { + flen -= tcur->start - fbno; + fbno = tcur->start; + } + if (fbno + flen > tcur->end + 1) + flen = tcur->end - fbno + 1; + + /* Too small? Give up. */ + if (flen < tcur->minlen) { + trace_xfs_discard_toosmall(mp, pag->pag_agno, fbno, flen); + if (tcur->by_bno) + goto next_extent; + tcur->count = 0; + break; + } /* * If any blocks in the range are still busy, skip the @@ -266,7 +281,10 @@ xfs_trim_gather_extents( &extents->extent_list); *blocks_trimmed += flen; next_extent: - error = xfs_btree_decrement(cur, 0, &i); + if (tcur->by_bno) + error = xfs_btree_increment(cur, 0, &i); + else + error = xfs_btree_decrement(cur, 0, &i); if (error) break; @@ -276,7 +294,7 @@ xfs_trim_gather_extents( * is no more extents to search. */ if (i == 0) - tcur->ar_blockcount = 0; + tcur->count = 0; } /* @@ -306,17 +324,22 @@ xfs_trim_should_stop(void) static int xfs_trim_extents( struct xfs_perag *pag, - xfs_daddr_t start, - xfs_daddr_t end, - xfs_daddr_t minlen, + xfs_agblock_t start, + xfs_agblock_t end, + xfs_extlen_t minlen, uint64_t *blocks_trimmed) { - struct xfs_alloc_rec_incore tcur = { - .ar_blockcount = pag->pagf_longest, - .ar_startblock = NULLAGBLOCK, + struct xfs_trim_cur tcur = { + .start = start, + .count = pag->pagf_longest, + .end = end, + .minlen = minlen, }; int error = 0; + if (start != 0 || end != pag->block_count) + tcur.by_bno = true; + do { struct xfs_busy_extents *extents; @@ -330,8 +353,8 @@ xfs_trim_extents( extents->owner = extents; INIT_LIST_HEAD(&extents->extent_list); - error = xfs_trim_gather_extents(pag, start, end, minlen, - &tcur, extents, blocks_trimmed); + error = xfs_trim_gather_extents(pag, &tcur, extents, + blocks_trimmed); if (error) { kfree(extents); break; @@ -354,7 +377,7 @@ xfs_trim_extents( if (xfs_trim_should_stop()) break; - } while (tcur.ar_blockcount != 0); + } while (tcur.count != 0); return error; @@ -378,8 +401,10 @@ xfs_ioc_trim( unsigned int granularity = bdev_discard_granularity(mp->m_ddev_targp->bt_bdev); struct fstrim_range range; - xfs_daddr_t start, end, minlen; - xfs_agnumber_t agno; + xfs_daddr_t start, end; + xfs_extlen_t minlen; + xfs_agnumber_t start_agno, end_agno; + xfs_agblock_t start_agbno, end_agbno; uint64_t blocks_trimmed = 0; int error, last_error = 0; @@ -399,7 +424,8 @@ xfs_ioc_trim( return -EFAULT; range.minlen = max_t(u64, granularity, range.minlen); - minlen = BTOBB(range.minlen); + minlen = XFS_B_TO_FSB(mp, range.minlen); + /* * Truncating down the len isn't actually quite correct, but using * BBTOB would mean we trivially get overflows for values @@ -413,15 +439,21 @@ xfs_ioc_trim( return -EINVAL; start = BTOBB(range.start); - end = start + BTOBBT(range.len) - 1; + end = min_t(xfs_daddr_t, start + BTOBBT(range.len), + XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks)) - 1; - if (end > XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1) - end = XFS_FSB_TO_BB(mp, mp->m_sb.sb_dblocks) - 1; + start_agno = xfs_daddr_to_agno(mp, start); + start_agbno = xfs_daddr_to_agbno(mp, start); + end_agno = xfs_daddr_to_agno(mp, end); + end_agbno = xfs_daddr_to_agbno(mp, end); - agno = xfs_daddr_to_agno(mp, start); - for_each_perag_range(mp, agno, xfs_daddr_to_agno(mp, end), pag) { - error = xfs_trim_extents(pag, start, end, minlen, - &blocks_trimmed); + for_each_perag_range(mp, start_agno, end_agno, pag) { + xfs_agblock_t agend = pag->block_count; + + if (start_agno == end_agno) + agend = end_agbno; + error = xfs_trim_extents(pag, start_agbno, agend, minlen, + &blocks_trimmed); if (error) last_error = error; @@ -429,6 +461,7 @@ xfs_ioc_trim( xfs_perag_rele(pag); break; } + start_agbno = 0; } if (last_error) ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v13.2 15/16] xfs: design documentation for online fsck, part 2 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (13 preceding siblings ...) 2024-04-15 23:37 ` [PATCHSET v30.3 14/16] xfs: less heavy locks during fstrim Darrick J. Wong @ 2024-04-15 23:37 ` Darrick J. Wong 2024-04-15 23:56 ` [PATCH 1/4] docs: update the parent pointers documentation to the final version Darrick J. Wong ` (3 more replies) 2024-04-15 23:37 ` [PATCHSET v13.2 16/16] xfs: retain ILOCK during directory updates Darrick J. Wong 15 siblings, 4 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:37 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs Hi all, This series updates the design documentation for online fsck to reflect the final design of the parent pointers feature as well as the implementation of online fsck for the new metadata. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=online-fsck-design-6.10 --- Commits in this patchset: * docs: update the parent pointers documentation to the final version * docs: update online directory and parent pointer repair sections * docs: update offline parent pointer repair strategy * docs: describe xfs directory tree online fsck --- .../filesystems/xfs/xfs-online-fsck-design.rst | 354 +++++++++++++++----- 1 file changed, 266 insertions(+), 88 deletions(-) ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/4] docs: update the parent pointers documentation to the final version 2024-04-15 23:37 ` [PATCHSET v13.2 15/16] xfs: design documentation for online fsck, part 2 Darrick J. Wong @ 2024-04-15 23:56 ` Darrick J. Wong 2024-04-15 23:56 ` [PATCH 2/4] docs: update online directory and parent pointer repair sections Darrick J. Wong ` (2 subsequent siblings) 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:56 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Now that we've decided on the ondisk format of parent pointers, update the documentation to reflect that. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- .../filesystems/xfs/xfs-online-fsck-design.rst | 94 +++++++++++--------- 1 file changed, 53 insertions(+), 41 deletions(-) diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst index 74a8e42c74bd..1e3211d12247 100644 --- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -4465,10 +4465,10 @@ reconstruction of filesystem space metadata. The parent pointer feature, however, makes total directory reconstruction possible. -XFS parent pointers include the dirent name and location of the entry within -the parent directory. +XFS parent pointers contain the information needed to identify the +corresponding directory entry in the parent directory. In other words, child files use extended attributes to store pointers to -parents in the form ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. +parents in the form ``(dirent_name) → (parent_inum, parent_gen)``. The directory checking process can be strengthened to ensure that the target of each dirent also contains a parent pointer pointing back to the dirent. Likewise, each parent pointer can be checked by ensuring that the target of @@ -4476,8 +4476,6 @@ each parent pointer is a directory and that it contains a dirent matching the parent pointer. Both online and offline repair can use this strategy. -**Note**: The ondisk format of parent pointers is not yet finalized. - +--------------------------------------------------------------------------+ | **Historical Sidebar**: | +--------------------------------------------------------------------------+ @@ -4519,8 +4517,58 @@ Both online and offline repair can use this strategy. | Chandan increased the maximum extent counts of both data and attribute | | forks, thereby ensuring that the extended attribute structure can grow | | to handle the maximum hardlink count of any file. | +| | +| For this second effort, the ondisk parent pointer format as originally | +| proposed was ``(parent_inum, parent_gen, dirent_pos) → (dirent_name)``. | +| The format was changed during development to eliminate the requirement | +| of repair tools needing to to ensure that the ``dirent_pos`` field | +| always matched when reconstructing a directory. | +| | +| There were a few other ways to have solved that problem: | +| | +| 1. The field could be designated advisory, since the other three values | +| are sufficient to find the entry in the parent. | +| However, this makes indexed key lookup impossible while repairs are | +| ongoing. | +| | +| 2. We could allow creating directory entries at specified offsets, which | +| solves the referential integrity problem but runs the risk that | +| dirent creation will fail due to conflicts with the free space in the | +| directory. | +| | +| These conflicts could be resolved by appending the directory entry | +| and amending the xattr code to support updating an xattr key and | +| reindexing the dabtree, though this would have to be performed with | +| the parent directory still locked. | +| | +| 3. Same as above, but remove the old parent pointer entry and add a new | +| one atomically. | +| | +| 4. Change the ondisk xattr format to | +| ``(parent_inum, name) → (parent_gen)``, which would provide the attr | +| name uniqueness that we require, without forcing repair code to | +| update the dirent position. | +| Unfortunately, this requires changes to the xattr code to support | +| attr names as long as 263 bytes. | +| | +| 5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → | +| (name, parent_gen)``. | +| If the hash is sufficiently resistant to collisions (e.g. sha256) | +| then this should provide the attr name uniqueness that we require. | +| Names shorter than 247 bytes could be stored directly. | +| | +| 6. Change the ondisk xattr format to ``(dirent_name) → (parent_ino, | +| parent_gen)``. This format doesn't require any of the complicated | +| nested name hashing of the previous suggestions. However, it was | +| discovered that multiple hardlinks to the same inode with the same | +| filename caused performance problems with hashed xattr lookups, so | +| the parent inumber is now xor'd into the hash index. | +| | +| In the end, it was decided that solution #6 was the most compact and the | +| most performant. A new hash function was designed for parent pointers. | +--------------------------------------------------------------------------+ + Case Study: Repairing Directories with Parent Pointers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ @@ -4569,42 +4617,6 @@ The proposed patchset is the <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_ series. -**Unresolved Question**: How will repair ensure that the ``dirent_pos`` fields -match in the reconstructed directory? - -*Answer*: There are a few ways to solve this problem: - -1. The field could be designated advisory, since the other three values are - sufficient to find the entry in the parent. - However, this makes indexed key lookup impossible while repairs are ongoing. - -2. We could allow creating directory entries at specified offsets, which solves - the referential integrity problem but runs the risk that dirent creation - will fail due to conflicts with the free space in the directory. - - These conflicts could be resolved by appending the directory entry and - amending the xattr code to support updating an xattr key and reindexing the - dabtree, though this would have to be performed with the parent directory - still locked. - -3. Same as above, but remove the old parent pointer entry and add a new one - atomically. - -4. Change the ondisk xattr format to ``(parent_inum, name) → (parent_gen)``, - which would provide the attr name uniqueness that we require, without - forcing repair code to update the dirent position. - Unfortunately, this requires changes to the xattr code to support attr - names as long as 263 bytes. - -5. Change the ondisk xattr format to ``(parent_inum, hash(name)) → - (name, parent_gen)``. - If the hash is sufficiently resistant to collisions (e.g. sha256) then - this should provide the attr name uniqueness that we require. - Names shorter than 247 bytes could be stored directly. - -Discussion is ongoing under the `parent pointers patch deluge -<https://www.spinics.net/lists/linux-xfs/msg69397.html>`_. - Case Study: Repairing Parent Pointers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/4] docs: update online directory and parent pointer repair sections 2024-04-15 23:37 ` [PATCHSET v13.2 15/16] xfs: design documentation for online fsck, part 2 Darrick J. Wong 2024-04-15 23:56 ` [PATCH 1/4] docs: update the parent pointers documentation to the final version Darrick J. Wong @ 2024-04-15 23:56 ` Darrick J. Wong 2024-04-15 23:57 ` [PATCH 3/4] docs: update offline parent pointer repair strategy Darrick J. Wong 2024-04-15 23:57 ` [PATCH 4/4] docs: describe xfs directory tree online fsck Darrick J. Wong 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:56 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Update the case studies of online directory and parent pointer reconstruction to reflect what they actually do in the final version. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- .../filesystems/xfs/xfs-online-fsck-design.rst | 55 +++++++++++--------- 1 file changed, 29 insertions(+), 26 deletions(-) diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst index 1e3211d12247..1ea4e59c9cdb 100644 --- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -4576,8 +4576,9 @@ Directory rebuilding uses a :ref:`coordinated inode scan <iscan>` and a :ref:`directory entry live update hook <liveupdate>` as follows: 1. Set up a temporary directory for generating the new directory structure, - an xfblob for storing entry names, and an xfarray for stashing directory - updates. + an xfblob for storing entry names, and an xfarray for stashing the fixed + size fields involved in a directory update: ``(child inumber, add vs. + remove, name cookie, ftype)``. 2. Set up an inode scanner and hook into the directory entry code to receive updates on directory operations. @@ -4586,35 +4587,34 @@ a :ref:`directory entry live update hook <liveupdate>` as follows: pointer references the directory of interest. If so: - a. Stash an addname entry for this dirent in the xfarray for later. + a. Stash the parent pointer name and an addname entry for this dirent in the + xfblob and xfarray, respectively. - b. When finished scanning that file, flush the stashed updates to the - temporary directory. + b. When finished scanning that file or the kernel memory consumption exceeds + a threshold, flush the stashed updates to the temporary directory. 4. For each live directory update received via the hook, decide if the child has already been scanned. If so: - a. Stash an addname or removename entry for this dirent update in the - xfarray for later. + a. Stash the parent pointer name an addname or removename entry for this + dirent update in the xfblob and xfarray for later. We cannot write directly to the temporary directory because hook functions are not allowed to modify filesystem metadata. Instead, we stash updates in the xfarray and rely on the scanner thread to apply the stashed updates to the temporary directory. -5. When the scan is complete, atomically exchange the contents of the temporary +5. When the scan is complete, replay any stashed entries in the xfarray. + +6. When the scan is complete, atomically exchange the contents of the temporary directory and the directory being repaired. The temporary directory now contains the damaged directory structure. -6. Reap the temporary directory. - -7. Update the dirent position field of parent pointers as necessary. - This may require the queuing of a substantial number of xattr log intent - items. +7. Reap the temporary directory. The proposed patchset is the `parent pointers directory repair -<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-dir-repair>`_ +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_ series. Case Study: Repairing Parent Pointers @@ -4624,8 +4624,9 @@ Online reconstruction of a file's parent pointer information works similarly to directory reconstruction: 1. Set up a temporary file for generating a new extended attribute structure, - an `xfblob<xfblob>` for storing parent pointer names, and an xfarray for - stashing parent pointer updates. + an xfblob for storing parent pointer names, and an xfarray for stashing the + fixed size fields involved in a parent pointer update: ``(parent inumber, + parent generation, add vs. remove, name cookie)``. 2. Set up an inode scanner and hook into the directory entry code to receive updates on directory operations. @@ -4634,34 +4635,36 @@ directory reconstruction: dirent references the file of interest. If so: - a. Stash an addpptr entry for this parent pointer in the xfblob and xfarray - for later. + a. Stash the dirent name and an addpptr entry for this parent pointer in the + xfblob and xfarray, respectively. - b. When finished scanning the directory, flush the stashed updates to the - temporary directory. + b. When finished scanning the directory or the kernel memory consumption + exceeds a threshold, flush the stashed updates to the temporary file. 4. For each live directory update received via the hook, decide if the parent has already been scanned. If so: - a. Stash an addpptr or removepptr entry for this dirent update in the - xfarray for later. + a. Stash the dirent name and an addpptr or removepptr entry for this dirent + update in the xfblob and xfarray for later. We cannot write parent pointers directly to the temporary file because hook functions are not allowed to modify filesystem metadata. Instead, we stash updates in the xfarray and rely on the scanner thread to apply the stashed parent pointer updates to the temporary file. -5. Copy all non-parent pointer extended attributes to the temporary file. +5. When the scan is complete, replay any stashed entries in the xfarray. -6. When the scan is complete, atomically exchange the mappings of the attribute +6. Copy all non-parent pointer extended attributes to the temporary file. + +7. When the scan is complete, atomically exchange the mappings of the attribute forks of the temporary file and the file being repaired. The temporary file now contains the damaged extended attribute structure. -7. Reap the temporary file. +8. Reap the temporary file. The proposed patchset is the `parent pointers repair -<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-online-parent-repair>`_ +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=pptrs-fsck>`_ series. Digression: Offline Checking of Parent Pointers ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/4] docs: update offline parent pointer repair strategy 2024-04-15 23:37 ` [PATCHSET v13.2 15/16] xfs: design documentation for online fsck, part 2 Darrick J. Wong 2024-04-15 23:56 ` [PATCH 1/4] docs: update the parent pointers documentation to the final version Darrick J. Wong 2024-04-15 23:56 ` [PATCH 2/4] docs: update online directory and parent pointer repair sections Darrick J. Wong @ 2024-04-15 23:57 ` Darrick J. Wong 2024-04-15 23:57 ` [PATCH 4/4] docs: describe xfs directory tree online fsck Darrick J. Wong 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:57 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Now update how xfs_repair checks and repairs parent pointer info. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- .../filesystems/xfs/xfs-online-fsck-design.rst | 81 +++++++++++++++----- 1 file changed, 60 insertions(+), 21 deletions(-) diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst index 1ea4e59c9cdb..70e3e629d8b3 100644 --- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -4675,26 +4675,56 @@ files are erased long before directory tree connectivity checks are performed. Parent pointer checks are therefore a second pass to be added to the existing connectivity checks: -1. After the set of surviving files has been established (i.e. phase 6), +1. After the set of surviving files has been established (phase 6), walk the surviving directories of each AG in the filesystem. This is already performed as part of the connectivity checks. -2. For each directory entry found, record the name in an xfblob, and store - ``(child_ag_inum, parent_inum, parent_gen, dirent_pos)`` tuples in a - per-AG in-memory slab. +2. For each directory entry found, + + a. If the name has already been stored in the xfblob, then use that cookie + and skip the next step. + + b. Otherwise, record the name in an xfblob, and remember the xfblob cookie. + Unique mappings are critical for + + 1. Deduplicating names to reduce memory usage, and + + 2. Creating a stable sort key for the parent pointer indexes so that the + parent pointer validation described below will work. + + c. Store ``(child_ag_inum, parent_inum, parent_gen, name_hash, name_len, + name_cookie)`` tuples in a per-AG in-memory slab. The ``name_hash`` + referenced in this section is the regular directory entry name hash, not + the specialized one used for parent pointer xattrs. 3. For each AG in the filesystem, - a. Sort the per-AG tuples in order of child_ag_inum, parent_inum, and - dirent_pos. + a. Sort the per-AG tuple set in order of ``child_ag_inum``, ``parent_inum``, + ``name_hash``, and ``name_cookie``. + Having a single ``name_cookie`` for each ``name`` is critical for + handling the uncommon case of a directory containing multiple hardlinks + to the same file where all the names hash to the same value. b. For each inode in the AG, 1. Scan the inode for parent pointers. - Record the names in a per-file xfblob, and store ``(parent_inum, - parent_gen, dirent_pos)`` tuples in a per-file slab. + For each parent pointer found, - 2. Sort the per-file tuples in order of parent_inum, and dirent_pos. + a. Validate the ondisk parent pointer. + If validation fails, move on to the next parent pointer in the + file. + + b. If the name has already been stored in the xfblob, then use that + cookie and skip the next step. + + c. Record the name in a per-file xfblob, and remember the xfblob + cookie. + + d. Store ``(parent_inum, parent_gen, name_hash, name_len, + name_cookie)`` tuples in a per-file slab. + + 2. Sort the per-file tuples in order of ``parent_inum``, ``name_hash``, + and ``name_cookie``. 3. Position one slab cursor at the start of the inode's records in the per-AG tuple slab. @@ -4703,28 +4733,37 @@ connectivity checks: 4. Position a second slab cursor at the start of the per-file tuple slab. - 5. Iterate the two cursors in lockstep, comparing the parent_ino and - dirent_pos fields of the records under each cursor. + 5. Iterate the two cursors in lockstep, comparing the ``parent_ino``, + ``name_hash``, and ``name_cookie`` fields of the records under each + cursor: - a. Tuples in the per-AG list but not the per-file list are missing and - need to be written to the inode. + a. If the per-AG cursor is at a lower point in the keyspace than the + per-file cursor, then the per-AG cursor points to a missing parent + pointer. + Add the parent pointer to the inode and advance the per-AG + cursor. - b. Tuples in the per-file list but not the per-AG list are dangling - and need to be removed from the inode. + b. If the per-file cursor is at a lower point in the keyspace than + the per-AG cursor, then the per-file cursor points to a dangling + parent pointer. + Remove the parent pointer from the inode and advance the per-file + cursor. - c. For tuples in both lists, update the parent_gen and name components - of the parent pointer if necessary. + c. Otherwise, both cursors point at the same parent pointer. + Update the parent_gen component if necessary. + Advance both cursors. 4. Move on to examining link counts, as we do today. The proposed patchset is the `offline parent pointers repair -<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-repair>`_ +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=pptrs-fsck>`_ series. -Rebuilding directories from parent pointers in offline repair is very -challenging because it currently uses a single-pass scan of the filesystem -during phase 3 to decide which files are corrupt enough to be zapped. +Rebuilding directories from parent pointers in offline repair would be very +challenging because xfs_repair currently uses two single-pass scans of the +filesystem during phases 3 and 4 to decide which files are corrupt enough to be +zapped. This scan would have to be converted into a multi-pass scan: 1. The first pass of the scan zaps corrupt inodes, forks, and attributes ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 4/4] docs: describe xfs directory tree online fsck 2024-04-15 23:37 ` [PATCHSET v13.2 15/16] xfs: design documentation for online fsck, part 2 Darrick J. Wong ` (2 preceding siblings ...) 2024-04-15 23:57 ` [PATCH 3/4] docs: update offline parent pointer repair strategy Darrick J. Wong @ 2024-04-15 23:57 ` Darrick J. Wong 3 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:57 UTC (permalink / raw To: chandanbabu, djwong; +Cc: Christoph Hellwig, hch, linux-xfs From: Darrick J. Wong <djwong@kernel.org> I've added a scrubber that checks the directory tree structure and fixes them; describe this in the design documentation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- .../filesystems/xfs/xfs-online-fsck-design.rst | 124 ++++++++++++++++++++ 1 file changed, 124 insertions(+) diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst index 70e3e629d8b3..12aa63840830 100644 --- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst +++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst @@ -4785,6 +4785,130 @@ This scan would have to be converted into a multi-pass scan: This code has not yet been constructed. +.. _dirtree: + +Case Study: Directory Tree Structure +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +As mentioned earlier, the filesystem directory tree is supposed to be a +directed acylic graph structure. +However, each node in this graph is a separate ``xfs_inode`` object with its +own locks, which makes validating the tree qualities difficult. +Fortunately, non-directories are allowed to have multiple parents and cannot +have children, so only directories need to be scanned. +Directories typically constitute 5-10% of the files in a filesystem, which +reduces the amount of work dramatically. + +If the directory tree could be frozen, it would be easy to discover cycles and +disconnected regions by running a depth (or breadth) first search downwards +from the root directory and marking a bitmap for each directory found. +At any point in the walk, trying to set an already set bit means there is a +cycle. +After the scan completes, XORing the marked inode bitmap with the inode +allocation bitmap reveals disconnected inodes. +However, one of online repair's design goals is to avoid locking the entire +filesystem unless it's absolutely necessary. +Directory tree updates can move subtrees across the scanner wavefront on a live +filesystem, so the bitmap algorithm cannot be applied. + +Directory parent pointers enable an incremental approach to validation of the +tree structure. +Instead of using one thread to scan the entire filesystem, multiple threads can +walk from individual subdirectories upwards towards the root. +For this to work, all directory entries and parent pointers must be internally +consistent, each directory entry must have a parent pointer, and the link +counts of all directories must be correct. +Each scanner thread must be able to take the IOLOCK of an alleged parent +directory while holding the IOLOCK of the child directory to prevent either +directory from being moved within the tree. +This is not possible since the VFS does not take the IOLOCK of a child +subdirectory when moving that subdirectory, so instead the scanner stabilizes +the parent -> child relationship by taking the ILOCKs and installing a dirent +update hook to detect changes. + +The scanning process uses a dirent hook to detect changes to the directories +mentioned in the scan data. +The scan works as follows: + +1. For each subdirectory in the filesystem, + + a. For each parent pointer of that subdirectory, + + 1. Create a path object for that parent pointer, and mark the + subdirectory inode number in the path object's bitmap. + + 2. Record the parent pointer name and inode number in a path structure. + + 3. If the alleged parent is the subdirectory being scrubbed, the path is + a cycle. + Mark the path for deletion and repeat step 1a with the next + subdirectory parent pointer. + + 4. Try to mark the alleged parent inode number in a bitmap in the path + object. + If the bit is already set, then there is a cycle in the directory + tree. + Mark the path as a cycle and repeat step 1a with the next subdirectory + parent pointer. + + 5. Load the alleged parent. + If the alleged parent is not a linked directory, abort the scan + because the parent pointer information is inconsistent. + + 6. For each parent pointer of this alleged ancestor directory, + + a. Record the parent pointer name and inode number in the path object + if no parent has been set for that level. + + b. If an ancestor has more than one parent, mark the path as corrupt. + Repeat step 1a with the next subdirectory parent pointer. + + c. Repeat steps 1a3-1a6 for the ancestor identified in step 1a6a. + This repeats until the directory tree root is reached or no parents + are found. + + 7. If the walk terminates at the root directory, mark the path as ok. + + 8. If the walk terminates without reaching the root, mark the path as + disconnected. + +2. If the directory entry update hook triggers, check all paths already found + by the scan. + If the entry matches part of a path, mark that path and the scan stale. + When the scanner thread sees that the scan has been marked stale, it deletes + all scan data and starts over. + +Repairing the directory tree works as follows: + +1. Walk each path of the target subdirectory. + + a. Corrupt paths and cycle paths are counted as suspect. + + b. Paths already marked for deletion are counted as bad. + + c. Paths that reached the root are counted as good. + +2. If the subdirectory is either the root directory or has zero link count, + delete all incoming directory entries in the immediate parents. + Repairs are complete. + +3. If the subdirectory has exactly one path, set the dotdot entry to the + parent and exit. + +4. If the subdirectory has at least one good path, delete all the other + incoming directory entries in the immediate parents. + +5. If the subdirectory has no good paths and more than one suspect path, delete + all the other incoming directory entries in the immediate parents. + +6. If the subdirectory has zero paths, attach it to the lost and found. + +The proposed patches are in the +`directory tree repair +<https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=scrub-directory-tree>`_ +series. + + .. _orphanage: The Orphanage ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCHSET v13.2 16/16] xfs: retain ILOCK during directory updates 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong ` (14 preceding siblings ...) 2024-04-15 23:37 ` [PATCHSET v13.2 15/16] xfs: design documentation for online fsck, part 2 Darrick J. Wong @ 2024-04-15 23:37 ` Darrick J. Wong 2024-04-15 23:57 ` [PATCH 1/7] xfs: Increase XFS_DEFER_OPS_NR_INODES to 5 Darrick J. Wong ` (6 more replies) 15 siblings, 7 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:37 UTC (permalink / raw To: chandanbabu, djwong Cc: Christoph Hellwig, Catherine Hoang, Allison Henderson, hch, allison.henderson, catherine.hoang, linux-xfs Hi all, This series changes the directory update code to retain the ILOCK on all files involved in a rename until the end of the operation. The upcoming parent pointers patchset applies parent pointers in a separate chained update from the actual directory update, which is why it is now necessary to keep the ILOCK instead of dropping it after the first transaction in the chain. As a side effect, we no longer need to hold the IOLOCK during an rmapbt scan of inodes to serialize the scan with ongoing directory updates. If you're going to start using this code, I strongly recommend pulling from my git trees, which are linked below. This has been running on the djcloud for months with no problems. Enjoy! Comments and questions are, as always, welcome. --D kernel git tree: https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=retain-ilock-during-dir-ops-6.10 --- Commits in this patchset: * xfs: Increase XFS_DEFER_OPS_NR_INODES to 5 * xfs: Increase XFS_QM_TRANS_MAXDQS to 5 * xfs: Hold inode locks in xfs_ialloc * xfs: Hold inode locks in xfs_trans_alloc_dir * xfs: Hold inode locks in xfs_rename * xfs: don't pick up IOLOCK during rmapbt repair scan * xfs: unlock new repair tempfiles after creation --- fs/xfs/libxfs/xfs_defer.c | 6 ++- fs/xfs/libxfs/xfs_defer.h | 8 +++- fs/xfs/scrub/rmap_repair.c | 16 ------- fs/xfs/scrub/tempfile.c | 2 + fs/xfs/xfs_dquot.c | 41 ++++++++++++++++++ fs/xfs/xfs_dquot.h | 1 fs/xfs/xfs_inode.c | 98 ++++++++++++++++++++++++++++++++------------ fs/xfs/xfs_inode.h | 2 + fs/xfs/xfs_qm.c | 4 +- fs/xfs/xfs_qm.h | 2 - fs/xfs/xfs_symlink.c | 6 ++- fs/xfs/xfs_trans.c | 9 +++- fs/xfs/xfs_trans_dquot.c | 15 ++++--- 13 files changed, 156 insertions(+), 54 deletions(-) ^ permalink raw reply [flat|nested] 100+ messages in thread
* [PATCH 1/7] xfs: Increase XFS_DEFER_OPS_NR_INODES to 5 2024-04-15 23:37 ` [PATCHSET v13.2 16/16] xfs: retain ILOCK during directory updates Darrick J. Wong @ 2024-04-15 23:57 ` Darrick J. Wong 2024-04-15 23:57 ` [PATCH 2/7] xfs: Increase XFS_QM_TRANS_MAXDQS " Darrick J. Wong ` (5 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:57 UTC (permalink / raw To: chandanbabu, djwong Cc: Allison Henderson, Catherine Hoang, Christoph Hellwig, hch, allison.henderson, catherine.hoang, linux-xfs From: Allison Henderson <allison.henderson@oracle.com> Renames that generate parent pointer updates can join up to 5 inodes locked in sorted order. So we need to increase the number of defer ops inodes and relock them in the same way. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> [djwong: have one sorting function] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/libxfs/xfs_defer.c | 6 +++++- fs/xfs/libxfs/xfs_defer.h | 8 +++++++- fs/xfs/xfs_inode.c | 27 ++++++++++++++++++--------- fs/xfs/xfs_inode.h | 2 ++ 4 files changed, 32 insertions(+), 11 deletions(-) diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c index 061cc01245a9..4a078e07e1a0 100644 --- a/fs/xfs/libxfs/xfs_defer.c +++ b/fs/xfs/libxfs/xfs_defer.c @@ -1092,7 +1092,11 @@ xfs_defer_ops_continue( ASSERT(!(tp->t_flags & XFS_TRANS_DIRTY)); /* Lock the captured resources to the new transaction. */ - if (dfc->dfc_held.dr_inos == 2) + if (dfc->dfc_held.dr_inos > 2) { + xfs_sort_inodes(dfc->dfc_held.dr_ip, dfc->dfc_held.dr_inos); + xfs_lock_inodes(dfc->dfc_held.dr_ip, dfc->dfc_held.dr_inos, + XFS_ILOCK_EXCL); + } else if (dfc->dfc_held.dr_inos == 2) xfs_lock_two_inodes(dfc->dfc_held.dr_ip[0], XFS_ILOCK_EXCL, dfc->dfc_held.dr_ip[1], XFS_ILOCK_EXCL); else if (dfc->dfc_held.dr_inos == 1) diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h index 81cca60d70a3..8b338031e487 100644 --- a/fs/xfs/libxfs/xfs_defer.h +++ b/fs/xfs/libxfs/xfs_defer.h @@ -77,7 +77,13 @@ extern const struct xfs_defer_op_type xfs_exchmaps_defer_type; /* * Deferred operation item relogging limits. */ -#define XFS_DEFER_OPS_NR_INODES 2 /* join up to two inodes */ + +/* + * Rename w/ parent pointers can require up to 5 inodes with deferred ops to + * be joined to the transaction: src_dp, target_dp, src_ip, target_ip, and wip. + * These inodes are locked in sorted order by their inode numbers + */ +#define XFS_DEFER_OPS_NR_INODES 5 #define XFS_DEFER_OPS_NR_BUFS 2 /* join up to two buffers */ /* Resources that must be held across a transaction roll. */ diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 03dcb4ac0431..efd040094753 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -418,7 +418,7 @@ xfs_lock_inumorder( * lock more than one at a time, lockdep will report false positives saying we * have violated locking orders. */ -static void +void xfs_lock_inodes( struct xfs_inode **ips, int inodes, @@ -2802,7 +2802,7 @@ xfs_sort_for_rename( struct xfs_inode **i_tab,/* out: sorted array of inodes */ int *num_inodes) /* in/out: inodes in array */ { - int i, j; + int i; ASSERT(*num_inodes == __XFS_SORT_INODES); memset(i_tab, 0, *num_inodes * sizeof(struct xfs_inode *)); @@ -2824,17 +2824,26 @@ xfs_sort_for_rename( i_tab[i++] = wip; *num_inodes = i; + xfs_sort_inodes(i_tab, *num_inodes); +} + +void +xfs_sort_inodes( + struct xfs_inode **i_tab, + unsigned int num_inodes) +{ + int i, j; + + ASSERT(num_inodes <= __XFS_SORT_INODES); + /* * Sort the elements via bubble sort. (Remember, there are at * most 5 elements to sort, so this is adequate.) */ - for (i = 0; i < *num_inodes; i++) { - for (j = 1; j < *num_inodes; j++) { - if (i_tab[j]->i_ino < i_tab[j-1]->i_ino) { - struct xfs_inode *temp = i_tab[j]; - i_tab[j] = i_tab[j-1]; - i_tab[j-1] = temp; - } + for (i = 0; i < num_inodes; i++) { + for (j = 1; j < num_inodes; j++) { + if (i_tab[j]->i_ino < i_tab[j-1]->i_ino) + swap(i_tab[j], i_tab[j - 1]); } } } diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h index c74c48bc0945..a6da1ab8ab13 100644 --- a/fs/xfs/xfs_inode.h +++ b/fs/xfs/xfs_inode.h @@ -627,6 +627,8 @@ int xfs_ilock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2); void xfs_iunlock2_io_mmap(struct xfs_inode *ip1, struct xfs_inode *ip2); void xfs_iunlock2_remapping(struct xfs_inode *ip1, struct xfs_inode *ip2); void xfs_bumplink(struct xfs_trans *tp, struct xfs_inode *ip); +void xfs_lock_inodes(struct xfs_inode **ips, int inodes, uint lock_mode); +void xfs_sort_inodes(struct xfs_inode **i_tab, unsigned int num_inodes); static inline bool xfs_inode_unlinked_incomplete( ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 2/7] xfs: Increase XFS_QM_TRANS_MAXDQS to 5 2024-04-15 23:37 ` [PATCHSET v13.2 16/16] xfs: retain ILOCK during directory updates Darrick J. Wong 2024-04-15 23:57 ` [PATCH 1/7] xfs: Increase XFS_DEFER_OPS_NR_INODES to 5 Darrick J. Wong @ 2024-04-15 23:57 ` Darrick J. Wong 2024-04-15 23:58 ` [PATCH 3/7] xfs: Hold inode locks in xfs_ialloc Darrick J. Wong ` (4 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:57 UTC (permalink / raw To: chandanbabu, djwong Cc: Allison Henderson, Christoph Hellwig, hch, allison.henderson, catherine.hoang, linux-xfs From: Allison Henderson <allison.henderson@oracle.com> With parent pointers enabled, a rename operation can update up to 5 inodes: src_dp, target_dp, src_ip, target_ip and wip. This causes their dquots to a be attached to the transaction chain, so we need to increase XFS_QM_TRANS_MAXDQS. This patch also add a helper function xfs_dqlockn to lock an arbitrary number of dquots. Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_dquot.c | 41 +++++++++++++++++++++++++++++++++++++++++ fs/xfs/xfs_dquot.h | 1 + fs/xfs/xfs_qm.h | 2 +- fs/xfs/xfs_trans_dquot.c | 15 ++++++++++----- 4 files changed, 53 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_dquot.c b/fs/xfs/xfs_dquot.c index c98cb468c357..13aba84bd64a 100644 --- a/fs/xfs/xfs_dquot.c +++ b/fs/xfs/xfs_dquot.c @@ -1371,6 +1371,47 @@ xfs_dqlock2( } } +static int +xfs_dqtrx_cmp( + const void *a, + const void *b) +{ + const struct xfs_dqtrx *qa = a; + const struct xfs_dqtrx *qb = b; + + if (qa->qt_dquot->q_id > qb->qt_dquot->q_id) + return 1; + if (qa->qt_dquot->q_id < qb->qt_dquot->q_id) + return -1; + return 0; +} + +void +xfs_dqlockn( + struct xfs_dqtrx *q) +{ + unsigned int i; + + BUILD_BUG_ON(XFS_QM_TRANS_MAXDQS > MAX_LOCKDEP_SUBCLASSES); + + /* Sort in order of dquot id, do not allow duplicates */ + for (i = 0; i < XFS_QM_TRANS_MAXDQS && q[i].qt_dquot != NULL; i++) { + unsigned int j; + + for (j = 0; j < i; j++) + ASSERT(q[i].qt_dquot != q[j].qt_dquot); + } + if (i == 0) + return; + + sort(q, i, sizeof(struct xfs_dqtrx), xfs_dqtrx_cmp, NULL); + + mutex_lock(&q[0].qt_dquot->q_qlock); + for (i = 1; i < XFS_QM_TRANS_MAXDQS && q[i].qt_dquot != NULL; i++) + mutex_lock_nested(&q[i].qt_dquot->q_qlock, + XFS_QLOCK_NESTED + i - 1); +} + int __init xfs_qm_init(void) { diff --git a/fs/xfs/xfs_dquot.h b/fs/xfs/xfs_dquot.h index 956272d9b302..677bb2dc9ac9 100644 --- a/fs/xfs/xfs_dquot.h +++ b/fs/xfs/xfs_dquot.h @@ -223,6 +223,7 @@ int xfs_qm_dqget_uncached(struct xfs_mount *mp, void xfs_qm_dqput(struct xfs_dquot *dqp); void xfs_dqlock2(struct xfs_dquot *, struct xfs_dquot *); +void xfs_dqlockn(struct xfs_dqtrx *q); void xfs_dquot_set_prealloc_limits(struct xfs_dquot *); diff --git a/fs/xfs/xfs_qm.h b/fs/xfs/xfs_qm.h index f5993012bf98..6e09dfcd13e2 100644 --- a/fs/xfs/xfs_qm.h +++ b/fs/xfs/xfs_qm.h @@ -136,7 +136,7 @@ enum { XFS_QM_TRANS_PRJ, XFS_QM_TRANS_DQTYPES }; -#define XFS_QM_TRANS_MAXDQS 2 +#define XFS_QM_TRANS_MAXDQS 5 struct xfs_dquot_acct { struct xfs_dqtrx dqs[XFS_QM_TRANS_DQTYPES][XFS_QM_TRANS_MAXDQS]; }; diff --git a/fs/xfs/xfs_trans_dquot.c b/fs/xfs/xfs_trans_dquot.c index 577b535a595c..b368e13424c4 100644 --- a/fs/xfs/xfs_trans_dquot.c +++ b/fs/xfs/xfs_trans_dquot.c @@ -379,24 +379,29 @@ xfs_trans_mod_dquot( /* * Given an array of dqtrx structures, lock all the dquots associated and join - * them to the transaction, provided they have been modified. We know that the - * highest number of dquots of one type - usr, grp and prj - involved in a - * transaction is 3 so we don't need to make this very generic. + * them to the transaction, provided they have been modified. */ STATIC void xfs_trans_dqlockedjoin( struct xfs_trans *tp, struct xfs_dqtrx *q) { + unsigned int i; ASSERT(q[0].qt_dquot != NULL); if (q[1].qt_dquot == NULL) { xfs_dqlock(q[0].qt_dquot); xfs_trans_dqjoin(tp, q[0].qt_dquot); - } else { - ASSERT(XFS_QM_TRANS_MAXDQS == 2); + } else if (q[2].qt_dquot == NULL) { xfs_dqlock2(q[0].qt_dquot, q[1].qt_dquot); xfs_trans_dqjoin(tp, q[0].qt_dquot); xfs_trans_dqjoin(tp, q[1].qt_dquot); + } else { + xfs_dqlockn(q); + for (i = 0; i < XFS_QM_TRANS_MAXDQS; i++) { + if (q[i].qt_dquot == NULL) + break; + xfs_trans_dqjoin(tp, q[i].qt_dquot); + } } } ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 3/7] xfs: Hold inode locks in xfs_ialloc 2024-04-15 23:37 ` [PATCHSET v13.2 16/16] xfs: retain ILOCK during directory updates Darrick J. Wong 2024-04-15 23:57 ` [PATCH 1/7] xfs: Increase XFS_DEFER_OPS_NR_INODES to 5 Darrick J. Wong 2024-04-15 23:57 ` [PATCH 2/7] xfs: Increase XFS_QM_TRANS_MAXDQS " Darrick J. Wong @ 2024-04-15 23:58 ` Darrick J. Wong 2024-04-15 23:58 ` [PATCH 4/7] xfs: Hold inode locks in xfs_trans_alloc_dir Darrick J. Wong ` (3 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:58 UTC (permalink / raw To: chandanbabu, djwong Cc: Allison Henderson, Catherine Hoang, Christoph Hellwig, hch, allison.henderson, catherine.hoang, linux-xfs From: Allison Henderson <allison.henderson@oracle.com> Modify xfs_ialloc to hold locks after return. Caller will be responsible for manual unlock. We will need this later to hold locks across parent pointer operations Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> [djwong: hold the parent ilocked across transaction rolls too] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_inode.c | 12 +++++++++--- fs/xfs/xfs_qm.c | 4 +++- fs/xfs/xfs_symlink.c | 6 ++++-- 3 files changed, 16 insertions(+), 6 deletions(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index efd040094753..2ec005e6c1da 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -747,6 +747,8 @@ xfs_inode_inherit_flags2( /* * Initialise a newly allocated inode and return the in-core inode to the * caller locked exclusively. + * + * Caller is responsible for unlocking the inode manually upon return */ int xfs_init_new_inode( @@ -873,7 +875,7 @@ xfs_init_new_inode( /* * Log the new values stuffed into the inode. */ - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, ip, 0); xfs_trans_log_inode(tp, ip, flags); /* now that we have an i_mode we can setup the inode structure */ @@ -1101,8 +1103,7 @@ xfs_create( * the transaction cancel unlocking dp so don't do it explicitly in the * error path. */ - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL); - unlock_dp_on_error = false; + xfs_trans_ijoin(tp, dp, 0); error = xfs_dir_createname(tp, dp, name, ip->i_ino, resblks - XFS_IALLOC_SPACE_RES(mp)); @@ -1151,6 +1152,8 @@ xfs_create( xfs_qm_dqrele(pdqp); *ipp = ip; + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_iunlock(dp, XFS_ILOCK_EXCL); return 0; out_trans_cancel: @@ -1162,6 +1165,7 @@ xfs_create( * transactions and deadlocks from xfs_inactive. */ if (ip) { + xfs_iunlock(ip, XFS_ILOCK_EXCL); xfs_finish_inode_setup(ip); xfs_irele(ip); } @@ -1247,6 +1251,7 @@ xfs_create_tmpfile( xfs_qm_dqrele(pdqp); *ipp = ip; + xfs_iunlock(ip, XFS_ILOCK_EXCL); return 0; out_trans_cancel: @@ -1258,6 +1263,7 @@ xfs_create_tmpfile( * transactions and deadlocks from xfs_inactive. */ if (ip) { + xfs_iunlock(ip, XFS_ILOCK_EXCL); xfs_finish_inode_setup(ip); xfs_irele(ip); } diff --git a/fs/xfs/xfs_qm.c b/fs/xfs/xfs_qm.c index 0f4cf4170c35..47120b745c47 100644 --- a/fs/xfs/xfs_qm.c +++ b/fs/xfs/xfs_qm.c @@ -836,8 +836,10 @@ xfs_qm_qino_alloc( ASSERT(xfs_is_shutdown(mp)); xfs_alert(mp, "%s failed (error %d)!", __func__, error); } - if (need_alloc) + if (need_alloc) { + xfs_iunlock(*ipp, XFS_ILOCK_EXCL); xfs_finish_inode_setup(*ipp); + } return error; } diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c index fb060aaf6d40..85ef56fdd7df 100644 --- a/fs/xfs/xfs_symlink.c +++ b/fs/xfs/xfs_symlink.c @@ -172,8 +172,7 @@ xfs_symlink( * the transaction cancel unlocking dp so don't do it explicitly in the * error path. */ - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL); - unlock_dp_on_error = false; + xfs_trans_ijoin(tp, dp, 0); /* * Also attach the dquot(s) to it, if applicable. @@ -215,6 +214,8 @@ xfs_symlink( xfs_qm_dqrele(pdqp); *ipp = ip; + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_iunlock(dp, XFS_ILOCK_EXCL); return 0; out_trans_cancel: @@ -226,6 +227,7 @@ xfs_symlink( * transactions and deadlocks from xfs_inactive. */ if (ip) { + xfs_iunlock(ip, XFS_ILOCK_EXCL); xfs_finish_inode_setup(ip); xfs_irele(ip); } ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 4/7] xfs: Hold inode locks in xfs_trans_alloc_dir 2024-04-15 23:37 ` [PATCHSET v13.2 16/16] xfs: retain ILOCK during directory updates Darrick J. Wong ` (2 preceding siblings ...) 2024-04-15 23:58 ` [PATCH 3/7] xfs: Hold inode locks in xfs_ialloc Darrick J. Wong @ 2024-04-15 23:58 ` Darrick J. Wong 2024-04-15 23:58 ` [PATCH 5/7] xfs: Hold inode locks in xfs_rename Darrick J. Wong ` (2 subsequent siblings) 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:58 UTC (permalink / raw To: chandanbabu, djwong Cc: Allison Henderson, Catherine Hoang, Christoph Hellwig, hch, allison.henderson, catherine.hoang, linux-xfs From: Allison Henderson <allison.henderson@oracle.com> Modify xfs_trans_alloc_dir to hold locks after return. Caller will be responsible for manual unlock. We will need this later to hold locks across parent pointer operations Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_inode.c | 14 ++++++++++++-- fs/xfs/xfs_trans.c | 9 +++++++-- 2 files changed, 19 insertions(+), 4 deletions(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 2ec005e6c1da..36e1012e156a 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -1368,10 +1368,15 @@ xfs_link( if (xfs_has_wsync(mp) || xfs_has_dirsync(mp)) xfs_trans_set_sync(tp); - return xfs_trans_commit(tp); + error = xfs_trans_commit(tp); + xfs_iunlock(tdp, XFS_ILOCK_EXCL); + xfs_iunlock(sip, XFS_ILOCK_EXCL); + return error; error_return: xfs_trans_cancel(tp); + xfs_iunlock(tdp, XFS_ILOCK_EXCL); + xfs_iunlock(sip, XFS_ILOCK_EXCL); std_return: if (error == -ENOSPC && nospace_error) error = nospace_error; @@ -2781,15 +2786,20 @@ xfs_remove( error = xfs_trans_commit(tp); if (error) - goto std_return; + goto out_unlock; if (is_dir && xfs_inode_is_filestream(ip)) xfs_filestream_deassociate(ip); + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_iunlock(dp, XFS_ILOCK_EXCL); return 0; out_trans_cancel: xfs_trans_cancel(tp); + out_unlock: + xfs_iunlock(ip, XFS_ILOCK_EXCL); + xfs_iunlock(dp, XFS_ILOCK_EXCL); std_return: return error; } diff --git a/fs/xfs/xfs_trans.c b/fs/xfs/xfs_trans.c index 7350640059cc..50d878d78a5e 100644 --- a/fs/xfs/xfs_trans.c +++ b/fs/xfs/xfs_trans.c @@ -1430,6 +1430,8 @@ xfs_trans_alloc_ichange( * The caller must ensure that the on-disk dquots attached to this inode have * already been allocated and initialized. The ILOCKs will be dropped when the * transaction is committed or cancelled. + * + * Caller is responsible for unlocking the inodes manually upon return */ int xfs_trans_alloc_dir( @@ -1460,8 +1462,8 @@ xfs_trans_alloc_dir( xfs_lock_two_inodes(dp, XFS_ILOCK_EXCL, ip, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, dp, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, ip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, dp, 0); + xfs_trans_ijoin(tp, ip, 0); error = xfs_qm_dqattach_locked(dp, false); if (error) { @@ -1484,6 +1486,9 @@ xfs_trans_alloc_dir( if (error == -EDQUOT || error == -ENOSPC) { if (!retried) { xfs_trans_cancel(tp); + xfs_iunlock(dp, XFS_ILOCK_EXCL); + if (dp != ip) + xfs_iunlock(ip, XFS_ILOCK_EXCL); xfs_blockgc_free_quota(dp, 0); retried = true; goto retry; ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 5/7] xfs: Hold inode locks in xfs_rename 2024-04-15 23:37 ` [PATCHSET v13.2 16/16] xfs: retain ILOCK during directory updates Darrick J. Wong ` (3 preceding siblings ...) 2024-04-15 23:58 ` [PATCH 4/7] xfs: Hold inode locks in xfs_trans_alloc_dir Darrick J. Wong @ 2024-04-15 23:58 ` Darrick J. Wong 2024-04-15 23:59 ` [PATCH 6/7] xfs: don't pick up IOLOCK during rmapbt repair scan Darrick J. Wong 2024-04-15 23:59 ` [PATCH 7/7] xfs: unlock new repair tempfiles after creation Darrick J. Wong 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:58 UTC (permalink / raw To: chandanbabu, djwong Cc: Allison Henderson, Catherine Hoang, Christoph Hellwig, hch, allison.henderson, catherine.hoang, linux-xfs From: Allison Henderson <allison.henderson@oracle.com> Modify xfs_rename to hold all inode locks across a rename operation We will need this later when we add parent pointers Signed-off-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Catherine Hoang <catherine.hoang@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/xfs_inode.c | 45 +++++++++++++++++++++++++++++++++------------ 1 file changed, 33 insertions(+), 12 deletions(-) diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c index 36e1012e156a..2aec7ab59aeb 100644 --- a/fs/xfs/xfs_inode.c +++ b/fs/xfs/xfs_inode.c @@ -2804,6 +2804,21 @@ xfs_remove( return error; } +static inline void +xfs_iunlock_rename( + struct xfs_inode **i_tab, + int num_inodes) +{ + int i; + + for (i = num_inodes - 1; i >= 0; i--) { + /* Skip duplicate inodes if src and target dps are the same */ + if (!i_tab[i] || (i > 0 && i_tab[i] == i_tab[i - 1])) + continue; + xfs_iunlock(i_tab[i], XFS_ILOCK_EXCL); + } +} + /* * Enter all inodes for a rename transaction into a sorted array. */ @@ -3113,8 +3128,10 @@ xfs_rename( * Attach the dquots to the inodes */ error = xfs_qm_vop_rename_dqattach(inodes); - if (error) - goto out_trans_cancel; + if (error) { + xfs_trans_cancel(tp); + goto out_release_wip; + } /* * Lock all the participating inodes. Depending upon whether @@ -3125,18 +3142,16 @@ xfs_rename( xfs_lock_inodes(inodes, num_inodes, XFS_ILOCK_EXCL); /* - * Join all the inodes to the transaction. From this point on, - * we can rely on either trans_commit or trans_cancel to unlock - * them. + * Join all the inodes to the transaction. */ - xfs_trans_ijoin(tp, src_dp, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, src_dp, 0); if (new_parent) - xfs_trans_ijoin(tp, target_dp, XFS_ILOCK_EXCL); - xfs_trans_ijoin(tp, src_ip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, target_dp, 0); + xfs_trans_ijoin(tp, src_ip, 0); if (target_ip) - xfs_trans_ijoin(tp, target_ip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, target_ip, 0); if (wip) - xfs_trans_ijoin(tp, wip, XFS_ILOCK_EXCL); + xfs_trans_ijoin(tp, wip, 0); /* * If we are using project inheritance, we only allow renames @@ -3150,10 +3165,13 @@ xfs_rename( } /* RENAME_EXCHANGE is unique from here on. */ - if (flags & RENAME_EXCHANGE) - return xfs_cross_rename(tp, src_dp, src_name, src_ip, + if (flags & RENAME_EXCHANGE) { + error = xfs_cross_rename(tp, src_dp, src_name, src_ip, target_dp, target_name, target_ip, spaceres); + xfs_iunlock_rename(inodes, num_inodes); + return error; + } /* * Try to reserve quota to handle an expansion of the target directory. @@ -3167,6 +3185,7 @@ xfs_rename( if (error == -EDQUOT || error == -ENOSPC) { if (!retried) { xfs_trans_cancel(tp); + xfs_iunlock_rename(inodes, num_inodes); xfs_blockgc_free_quota(target_dp, 0); retried = true; goto retry; @@ -3393,12 +3412,14 @@ xfs_rename( xfs_dir_update_hook(src_dp, wip, 1, src_name); error = xfs_finish_rename(tp); + xfs_iunlock_rename(inodes, num_inodes); if (wip) xfs_irele(wip); return error; out_trans_cancel: xfs_trans_cancel(tp); + xfs_iunlock_rename(inodes, num_inodes); out_release_wip: if (wip) xfs_irele(wip); ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 6/7] xfs: don't pick up IOLOCK during rmapbt repair scan 2024-04-15 23:37 ` [PATCHSET v13.2 16/16] xfs: retain ILOCK during directory updates Darrick J. Wong ` (4 preceding siblings ...) 2024-04-15 23:58 ` [PATCH 5/7] xfs: Hold inode locks in xfs_rename Darrick J. Wong @ 2024-04-15 23:59 ` Darrick J. Wong 2024-04-15 23:59 ` [PATCH 7/7] xfs: unlock new repair tempfiles after creation Darrick J. Wong 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:59 UTC (permalink / raw To: chandanbabu, djwong Cc: Christoph Hellwig, hch, allison.henderson, catherine.hoang, linux-xfs From: Darrick J. Wong <djwong@kernel.org> Now that we've fixed the directory operations to hold the ILOCK until they're finished with rmapbt updates for directory shape changes, we no longer need to take this lock when scanning directories for rmapbt records. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/rmap_repair.c | 16 +--------------- 1 file changed, 1 insertion(+), 15 deletions(-) diff --git a/fs/xfs/scrub/rmap_repair.c b/fs/xfs/scrub/rmap_repair.c index e8e07b683eab..25acd69614c2 100644 --- a/fs/xfs/scrub/rmap_repair.c +++ b/fs/xfs/scrub/rmap_repair.c @@ -578,23 +578,9 @@ xrep_rmap_scan_inode( struct xrep_rmap *rr, struct xfs_inode *ip) { - unsigned int lock_mode = 0; + unsigned int lock_mode = xrep_rmap_scan_ilock(ip); int error; - /* - * Directory updates (create/link/unlink/rename) drop the directory's - * ILOCK before finishing any rmapbt updates associated with directory - * shape changes. For this scan to coordinate correctly with the live - * update hook, we must take the only lock (i_rwsem) that is held all - * the way to dir op completion. This will get fixed by the parent - * pointer patchset. - */ - if (S_ISDIR(VFS_I(ip)->i_mode)) { - lock_mode = XFS_IOLOCK_SHARED; - xfs_ilock(ip, lock_mode); - } - lock_mode |= xrep_rmap_scan_ilock(ip); - /* Check the data fork. */ error = xrep_rmap_scan_ifork(rr, ip, XFS_DATA_FORK); if (error) ^ permalink raw reply related [flat|nested] 100+ messages in thread
* [PATCH 7/7] xfs: unlock new repair tempfiles after creation 2024-04-15 23:37 ` [PATCHSET v13.2 16/16] xfs: retain ILOCK during directory updates Darrick J. Wong ` (5 preceding siblings ...) 2024-04-15 23:59 ` [PATCH 6/7] xfs: don't pick up IOLOCK during rmapbt repair scan Darrick J. Wong @ 2024-04-15 23:59 ` Darrick J. Wong 6 siblings, 0 replies; 100+ messages in thread From: Darrick J. Wong @ 2024-04-15 23:59 UTC (permalink / raw To: chandanbabu, djwong Cc: Christoph Hellwig, hch, allison.henderson, catherine.hoang, linux-xfs From: Darrick J. Wong <djwong@kernel.org> After creation, drop the ILOCK on temporary files that have been created to stage a repair. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> --- fs/xfs/scrub/tempfile.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/fs/xfs/scrub/tempfile.c b/fs/xfs/scrub/tempfile.c index c72e447eb8ec..6f39504a216e 100644 --- a/fs/xfs/scrub/tempfile.c +++ b/fs/xfs/scrub/tempfile.c @@ -153,6 +153,7 @@ xrep_tempfile_create( xfs_qm_dqrele(pdqp); /* Finish setting up the incore / vfs context. */ + xfs_iunlock(sc->tempip, XFS_ILOCK_EXCL); xfs_setup_iops(sc->tempip); xfs_finish_inode_setup(sc->tempip); @@ -168,6 +169,7 @@ xrep_tempfile_create( * transactions and deadlocks from xfs_inactive. */ if (sc->tempip) { + xfs_iunlock(sc->tempip, XFS_ILOCK_EXCL); xfs_finish_inode_setup(sc->tempip); xchk_irele(sc, sc->tempip); } ^ permalink raw reply related [flat|nested] 100+ messages in thread
end of thread, other threads:[~2024-04-15 23:59 UTC | newest] Thread overview: 100+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2024-04-15 23:28 [PATCHBOMB v30.3] xfs: online repair, part 1 is done Darrick J. Wong 2024-04-15 23:33 ` [PATCHSET v30.3 01/16] xfs: improve log incompat feature handling Darrick J. Wong 2024-04-15 23:37 ` [PATCH 1/5] xfs: pass xfs_buf lookup flags to xfs_*read_agi Darrick J. Wong 2024-04-15 23:38 ` [PATCH 2/5] xfs: fix an AGI lock acquisition ordering problem in xrep_dinode_findmode Darrick J. Wong 2024-04-15 23:38 ` [PATCH 3/5] xfs: fix potential AGI <-> ILOCK ABBA deadlock in xrep_dinode_findmode_walk_directory Darrick J. Wong 2024-04-15 23:38 ` [PATCH 4/5] xfs: fix error bailout in xrep_abt_build_new_trees Darrick J. Wong 2024-04-15 23:38 ` [PATCH 5/5] xfs: only clear log incompat flags at clean unmount Darrick J. Wong 2024-04-15 23:34 ` [PATCHSET v30.3 02/16] xfs: refactorings for atomic file content exchanges Darrick J. Wong 2024-04-15 23:39 ` [PATCH 1/7] xfs: move inode lease breaking functions to xfs_inode.c Darrick J. Wong 2024-04-15 23:39 ` [PATCH 2/7] xfs: move xfs_iops.c declarations out of xfs_inode.h Darrick J. Wong 2024-04-15 23:39 ` [PATCH 3/7] xfs: declare xfs_file.c symbols in xfs_file.h Darrick J. Wong 2024-04-15 23:40 ` [PATCH 4/7] xfs: create a new helper to return a file's allocation unit Darrick J. Wong 2024-04-15 23:40 ` [PATCH 5/7] xfs: hoist multi-fsb allocation unit detection to a helper Darrick J. Wong 2024-04-15 23:40 ` [PATCH 6/7] xfs: refactor non-power-of-two alignment checks Darrick J. Wong 2024-04-15 23:40 ` [PATCH 7/7] xfs: constify xfs_bmap_is_written_extent Darrick J. Wong 2024-04-15 23:34 ` [PATCHSET v30.3 03/16] xfs: atomic file content exchanges Darrick J. Wong 2024-04-15 23:41 ` [PATCH 01/15] vfs: export remap and write check helpers Darrick J. Wong 2024-04-15 23:41 ` [PATCH 02/15] xfs: introduce new file range exchange ioctl Darrick J. Wong 2024-04-15 23:41 ` [PATCH 03/15] xfs: create a incompat flag for atomic file mapping exchanges Darrick J. Wong 2024-04-15 23:41 ` [PATCH 04/15] xfs: introduce a file mapping exchange log intent item Darrick J. Wong 2024-04-15 23:42 ` [PATCH 05/15] xfs: create deferred log items for file mapping exchanges Darrick J. Wong 2024-04-15 23:42 ` [PATCH 06/15] xfs: bind together the front and back ends of the file range exchange code Darrick J. Wong 2024-04-15 23:42 ` [PATCH 07/15] xfs: add error injection to test file mapping exchange recovery Darrick J. Wong 2024-04-15 23:42 ` [PATCH 08/15] xfs: condense extended attributes after a mapping exchange operation Darrick J. Wong 2024-04-15 23:43 ` [PATCH 09/15] xfs: condense directories " Darrick J. Wong 2024-04-15 23:43 ` [PATCH 10/15] xfs: condense symbolic links " Darrick J. Wong 2024-04-15 23:43 ` [PATCH 11/15] xfs: make file range exchange support realtime files Darrick J. Wong 2024-04-15 23:43 ` [PATCH 12/15] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong 2024-04-15 23:44 ` [PATCH 13/15] xfs: capture inode generation numbers in the ondisk exchmaps log item Darrick J. Wong 2024-04-15 23:44 ` [PATCH 14/15] docs: update swapext -> exchmaps language Darrick J. Wong 2024-04-15 23:44 ` [PATCH 15/15] xfs: enable logged file mapping exchange feature Darrick J. Wong 2024-04-15 23:34 ` [PATCHSET v30.3 04/16] xfs: create temporary files for online repair Darrick J. Wong 2024-04-15 23:44 ` [PATCH 1/4] xfs: hide private inodes from bulkstat and handle functions Darrick J. Wong 2024-04-15 23:45 ` [PATCH 2/4] xfs: create temporary files and directories for online repair Darrick J. Wong 2024-04-15 23:45 ` [PATCH 3/4] xfs: refactor live buffer invalidation for repairs Darrick J. Wong 2024-04-15 23:45 ` [PATCH 4/4] xfs: add the ability to reap entire inode forks Darrick J. Wong 2024-04-15 23:34 ` [PATCHSET v30.3 05/16] xfs: online repair of realtime summaries Darrick J. Wong 2024-04-15 23:46 ` [PATCH 1/3] xfs: support preallocating and copying content into temporary files Darrick J. Wong 2024-04-15 23:46 ` [PATCH 2/3] xfs: teach the tempfile to set up atomic file content exchanges Darrick J. Wong 2024-04-15 23:46 ` [PATCH 3/3] xfs: online repair of realtime summaries Darrick J. Wong 2024-04-15 23:35 ` [PATCHSET v30.3 06/16] xfs: set and validate dir/attr block owners Darrick J. Wong 2024-04-15 23:46 ` [PATCH 01/10] xfs: add an explicit owner field to xfs_da_args Darrick J. Wong 2024-04-15 23:47 ` [PATCH 02/10] xfs: use the xfs_da_args owner field to set new dir/attr block owner Darrick J. Wong 2024-04-15 23:47 ` [PATCH 03/10] xfs: reduce indenting in xfs_attr_node_list Darrick J. Wong 2024-04-15 23:47 ` [PATCH 04/10] xfs: validate attr leaf buffer owners Darrick J. Wong 2024-04-15 23:47 ` [PATCH 05/10] xfs: validate attr remote value " Darrick J. Wong 2024-04-15 23:48 ` [PATCH 06/10] xfs: validate dabtree node " Darrick J. Wong 2024-04-15 23:48 ` [PATCH 07/10] xfs: validate directory leaf " Darrick J. Wong 2024-04-15 23:48 ` [PATCH 08/10] xfs: validate explicit directory data " Darrick J. Wong 2024-04-15 23:48 ` [PATCH 09/10] xfs: validate explicit directory block " Darrick J. Wong 2024-04-15 23:49 ` [PATCH 10/10] xfs: validate explicit directory free block owners Darrick J. Wong 2024-04-15 23:35 ` [PATCHSET v30.3 07/16] xfs: online repair of extended attributes Darrick J. Wong 2024-04-15 23:49 ` [PATCH 1/7] xfs: enable discarding of folios backing an xfile Darrick J. Wong 2024-04-15 23:49 ` [PATCH 2/7] xfs: create a blob array data structure Darrick J. Wong 2024-04-15 23:49 ` [PATCH 3/7] xfs: use atomic extent swapping to fix user file fork data Darrick J. Wong 2024-04-15 23:50 ` [PATCH 4/7] xfs: repair extended attributes Darrick J. Wong 2024-04-15 23:50 ` [PATCH 5/7] xfs: scrub should set preen if attr leaf has holes Darrick J. Wong 2024-04-15 23:50 ` [PATCH 6/7] xfs: flag empty xattr leaf blocks for optimization Darrick J. Wong 2024-04-15 23:50 ` [PATCH 7/7] xfs: create an xattr iteration function for scrub Darrick J. Wong 2024-04-15 23:35 ` [PATCHSET v30.3 08/16] xfs: online repair of inode unlinked state Darrick J. Wong 2024-04-15 23:51 ` [PATCH 1/2] xfs: ensure unlinked list state is consistent with nlink during scrub Darrick J. Wong 2024-04-15 23:51 ` [PATCH 2/2] xfs: update the unlinked list when repairing link counts Darrick J. Wong 2024-04-15 23:35 ` [PATCHSET v30.3 09/16] xfs: online repair of directories Darrick J. Wong 2024-04-15 23:51 ` [PATCH 1/5] xfs: inactivate directory data blocks Darrick J. Wong 2024-04-15 23:52 ` [PATCH 2/5] xfs: online repair of directories Darrick J. Wong 2024-04-15 23:52 ` [PATCH 3/5] xfs: scan the filesystem to repair a directory dotdot entry Darrick J. Wong 2024-04-15 23:52 ` [PATCH 4/5] xfs: online repair of parent pointers Darrick J. Wong 2024-04-15 23:52 ` [PATCH 5/5] xfs: ask the dentry cache if it knows the parent of a directory Darrick J. Wong 2024-04-15 23:36 ` [PATCHSET v30.3 10/16] xfs: move orphan files to lost and found Darrick J. Wong 2024-04-15 23:53 ` [PATCH 1/3] xfs: move orphan files to the orphanage Darrick J. Wong 2024-04-15 23:53 ` [PATCH 2/3] xfs: move files to orphanage instead of letting nlinks drop to zero Darrick J. Wong 2024-04-15 23:53 ` [PATCH 3/3] xfs: ensure dentry consistency when the orphanage adopts a file Darrick J. Wong 2024-04-15 23:36 ` [PATCHSET v30.3 11/16] xfs: online repair of symbolic links Darrick J. Wong 2024-04-15 23:53 ` [PATCH 1/3] xfs: expose xfs_bmap_local_to_extents for online repair Darrick J. Wong 2024-04-15 23:54 ` [PATCH 2/3] xfs: pass the owner to xfs_symlink_write_target Darrick J. Wong 2024-04-15 23:54 ` [PATCH 3/3] xfs: online repair of symbolic links Darrick J. Wong 2024-04-15 23:36 ` [PATCHSET v30.3 12/16] xfs: online fsck of iunlink buckets Darrick J. Wong 2024-04-15 23:54 ` [PATCH 1/3] xfs: check AGI unlinked inode buckets Darrick J. Wong 2024-04-15 23:54 ` [PATCH 2/3] xfs: hoist AGI repair context to a heap object Darrick J. Wong 2024-04-15 23:55 ` [PATCH 3/3] xfs: repair AGI unlinked inode bucket lists Darrick J. Wong 2024-04-15 23:36 ` [PATCHSET v30.3 13/16] xfs: inode-related repair fixes Darrick J. Wong 2024-04-15 23:55 ` [PATCH 1/4] xfs: check unused nlink fields in the ondisk inode Darrick J. Wong 2024-04-15 23:55 ` [PATCH 2/4] xfs: try to avoid allocating from sick inode clusters Darrick J. Wong 2024-04-15 23:55 ` [PATCH 3/4] xfs: pin inodes that would otherwise overflow link count Darrick J. Wong 2024-04-15 23:56 ` [PATCH 4/4] xfs: create subordinate scrub contexts for xchk_metadata_inode_subtype Darrick J. Wong 2024-04-15 23:37 ` [PATCHSET v30.3 14/16] xfs: less heavy locks during fstrim Darrick J. Wong 2024-04-15 23:56 ` [PATCH 1/1] xfs: fix performance problems when fstrimming a subset of a fragmented AG Darrick J. Wong 2024-04-15 23:37 ` [PATCHSET v13.2 15/16] xfs: design documentation for online fsck, part 2 Darrick J. Wong 2024-04-15 23:56 ` [PATCH 1/4] docs: update the parent pointers documentation to the final version Darrick J. Wong 2024-04-15 23:56 ` [PATCH 2/4] docs: update online directory and parent pointer repair sections Darrick J. Wong 2024-04-15 23:57 ` [PATCH 3/4] docs: update offline parent pointer repair strategy Darrick J. Wong 2024-04-15 23:57 ` [PATCH 4/4] docs: describe xfs directory tree online fsck Darrick J. Wong 2024-04-15 23:37 ` [PATCHSET v13.2 16/16] xfs: retain ILOCK during directory updates Darrick J. Wong 2024-04-15 23:57 ` [PATCH 1/7] xfs: Increase XFS_DEFER_OPS_NR_INODES to 5 Darrick J. Wong 2024-04-15 23:57 ` [PATCH 2/7] xfs: Increase XFS_QM_TRANS_MAXDQS " Darrick J. Wong 2024-04-15 23:58 ` [PATCH 3/7] xfs: Hold inode locks in xfs_ialloc Darrick J. Wong 2024-04-15 23:58 ` [PATCH 4/7] xfs: Hold inode locks in xfs_trans_alloc_dir Darrick J. Wong 2024-04-15 23:58 ` [PATCH 5/7] xfs: Hold inode locks in xfs_rename Darrick J. Wong 2024-04-15 23:59 ` [PATCH 6/7] xfs: don't pick up IOLOCK during rmapbt repair scan Darrick J. Wong 2024-04-15 23:59 ` [PATCH 7/7] xfs: unlock new repair tempfiles after creation Darrick J. Wong
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.