Re: [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap

Linux-XFS Archive mirror
 help / color / mirror / Atom feed

From: Jan Kara <jack@suse.cz>
To: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>
Cc: linux-ext4@vger.kernel.org, linux-xfs@vger.kernel.org,
	linux-fsdevel@vger.kernel.org,
	Matthew Wilcox <willy@infradead.org>,
	"Darrick J . Wong" <djwong@kernel.org>,
	Ojaswin Mujoo <ojaswin@linux.ibm.com>, Jan Kara <jack@suse.cz>
Subject: Re: [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap
Date: Sun, 12 May 2024 15:20:03 +0200	[thread overview]
Message-ID: <20240512132003.b6o4ghqk5yjsvkpy@quack3> (raw)
In-Reply-To: <009d08646b77e0d774b4ce248675b86564bca9ee.1714046808.git.ritesh.list@gmail.com>

On Thu 25-04-24 18:58:48, Ritesh Harjani (IBM) wrote:
> There is a possibility of following race with iomap during
> writebck -
> 
> write_cache_pages()
>   cache extent covering 0..1MB range
>   write page at offset 0k
> 					truncate(file, 4k)
> 					  drops all relevant pages
> 					  frees fs blocks
> 					pwrite(file, 4k, 4k)
> 					  creates dirty page in the page cache
>   writes page at offset 4k to a stale block
> 
> This race can happen because iomap_writepages() keeps a cached extent mapping
> within struct iomap. While write_cache_pages() is going over each folio,
> (can cache a large extent range), if a truncate happens in parallel on the
> next folio followed by a buffered write to the same offset within the file,
> this can change logical to physical offset of the cached iomap mapping.
> That means, the cached iomap has now become stale.
> 
> This patch implements the seq counter approach for revalidation of stale
> iomap mappings. i_blkseq will get incremented for every block
> allocation/free. Here is what we do -
> 
> For ext2 buffered-writes, the block allocation happens at the
> ->write_iter time itself. So at writeback time,
> 1. We first cache the i_blkseq.
> 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
>    already allocated.
> 3. Call ext2_get_blocks() the second time with length to be same as
>    the no. of blocks we know were already allocated.
> 4. Till now it means, the cached i_blkseq remains valid as no block
>    allocation has happened yet.
> This means the next call to ->map_blocks(), we can verify whether the
> i_blkseq has raced with truncate or not. If not, then i_blkseq will
> remain valid.
> 
> In case of a hole (could happen with mmaped writes), we only allocate
> 1 block at a time anyways. So even if the i_blkseq value changes right
> after, we anyway need to allocate the next block in subsequent
> ->map_blocks() call.
> 
> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com>

A few small comments below.

> @@ -698,6 +699,11 @@ static inline struct ext2_inode_info *EXT2_I(struct inode *inode)
>  	return container_of(inode, struct ext2_inode_info, vfs_inode);
>  }
>  
> +static inline void ext2_inc_i_blkseq(struct ext2_inode_info *ei)
> +{
> +	WRITE_ONCE(ei->i_blkseq, READ_ONCE(ei->i_blkseq) + 1);
> +}
> +

Please add a comment here (and assertion as well) that updates of i_blkseq
are protected by ei->i_truncate_mutex. Reads can race at any moment to
that's the reason why WRITE_ONCE() is used. You can remove READ_ONCE() here
as it is pointless (we are locked here).

>  static int ext2_write_map_blocks(struct iomap_writepage_ctx *wpc,
>  				 struct inode *inode, loff_t offset,
>  				 unsigned len)
>  {
> -	if (offset >= wpc->iomap.offset &&
> -	    offset < wpc->iomap.offset + wpc->iomap.length)
> +	loff_t maxblocks = (loff_t)INT_MAX;
> +	u8 blkbits = inode->i_blkbits;
> +	u32 bno;
> +	bool new, boundary;
> +	int ret;
> +
> +	if (ext2_imap_valid(wpc, inode, offset))
>  		return 0;
>  
> -	return ext2_iomap_begin(inode, offset, inode->i_sb->s_blocksize,
> +	/*
> +	 * For ext2 buffered-writes, the block allocation happens at the
> +	 * ->write_iter time itself. So at writeback time -
> +	 * 1. We first cache the i_blkseq.
> +	 * 2. Call ext2_get_blocks(, create = 0) to get the no. of blocks
> +	 *    already allocated.
> +	 * 3. Call ext2_get_blocks() the second time with length to be same as
> +	 *    the no. of blocks we know were already allocated.
> +	 * 4. Till now it means, the cached i_blkseq remains valid as no block
> +	 *    allocation has happened yet.
> +	 * This means the next call to ->map_blocks(), we can verify whether the
> +	 * i_blkseq has raced with truncate or not. If not, then i_blkseq will
> +	 * remain valid.
> +	 *
> +	 * In case of a hole (could happen with mmaped writes), we only allocate
> +	 * 1 block at a time anyways. So even if the i_blkseq value changes, we
> +	 * anyway need to allocate the next block in subsequent ->map_blocks()
> +	 * call.
> +	 */

I suspect it would be tidier to move this logic into ext2_get_blocks()
itself but I guess it is ok as is for now.

> +	wpc->iomap.validity_cookie = READ_ONCE(EXT2_I(inode)->i_blkseq);
> +
> +	ret = ext2_get_blocks(inode, offset >> blkbits, maxblocks << blkbits,
> +			      &bno, &new, &boundary, 0);
> +	if (ret < 0)
> +		return ret;
> +	/*
> +	 * ret can be 0 in case of a hole which is possible for mmaped writes.
> +	 */
> +	ret = ret ? ret : 1;
> +	return ext2_iomap_begin(inode, offset, (loff_t)ret << blkbits,
>  				IOMAP_WRITE, &wpc->iomap, NULL);
>  }
>  
> diff --git a/fs/ext2/super.c b/fs/ext2/super.c
> index 37f7ce56adce..32f5386284d6 100644
> --- a/fs/ext2/super.c
> +++ b/fs/ext2/super.c
> @@ -188,7 +188,7 @@ static struct inode *ext2_alloc_inode(struct super_block *sb)
>  #ifdef CONFIG_QUOTA
>  	memset(&ei->i_dquot, 0, sizeof(ei->i_dquot));
>  #endif
> -
> +	WRITE_ONCE(ei->i_blkseq, 0);
>  	return &ei->vfs_inode;

No need for write once here. This cannot race with anything...

									Honza
-- 
Jan Kara <jack@suse.com>
SUSE Labs, CR

next prev parent reply	other threads:[~2024-05-13 13:41 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-25 13:28 [RFCv3 0/7] ext2 iomap changes and iomap improvements Ritesh Harjani (IBM)
2024-04-25 13:28 ` [RFCv3 1/7] ext2: Remove comment related to journal handle Ritesh Harjani (IBM)
2024-04-26 15:21   ` Darrick J. Wong
2024-04-25 13:28 ` [RFCv3 2/7] ext2: Convert ext2 regular file buffered I/O to use iomap Ritesh Harjani (IBM)
2024-04-26 15:29   ` Darrick J. Wong
2024-04-25 13:28 ` [RFCv3 3/7] ext2: Enable large folio support Ritesh Harjani (IBM)
2024-04-26 15:30   ` Darrick J. Wong
2024-04-25 13:28 ` [RFCv3 4/7] ext2: Implement seq counter for validating cached iomap Ritesh Harjani (IBM)
2024-04-26 15:41   ` Darrick J. Wong
2024-05-12 13:20   ` Jan Kara [this message]
2024-04-25 13:28 ` [RFCv3 5/7] iomap: Fix iomap_adjust_read_range for plen calculation Ritesh Harjani (IBM)
2024-04-26  6:49   ` Christoph Hellwig
2024-04-26  8:52     ` Ritesh Harjani
2024-04-26 15:43       ` Darrick J. Wong
2024-04-25 13:28 ` [RFCv3 6/7] iomap: Optimize iomap_read_folio Ritesh Harjani (IBM)
2024-04-25 13:53   ` Matthew Wilcox
2024-04-26  6:53   ` Christoph Hellwig
2024-04-26  8:50     ` Ritesh Harjani
2024-04-27  4:44       ` Christoph Hellwig
2024-04-25 13:28 ` [RFCv3 7/7] iomap: Optimize data access patterns for filesystems with indirect mappings Ritesh Harjani (IBM)
2024-04-26 16:24   ` Darrick J. Wong
2024-04-26 17:17     ` Ritesh Harjani
2024-04-26 17:25       ` Ritesh Harjani
2024-04-26 17:37         ` Matthew Wilcox
2024-04-26 17:55           ` Ritesh Harjani
2024-04-26 18:13             ` Matthew Wilcox
2024-04-26 18:57               ` Ritesh Harjani
2024-04-26 19:19                 ` Matthew Wilcox
2024-04-27  4:54                   ` Christoph Hellwig
2024-04-27  6:03                   ` Ritesh Harjani
2024-04-27  4:47   ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240512132003.b6o4ghqk5yjsvkpy@quack3 \
    --to=jack@suse.cz \
    --cc=djwong@kernel.org \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).