Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Zhang Yi <yi.zhang@huaweicloud.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, linux-ext4@vger.kernel.org,
	hch@infradead.org, brauner@kernel.org, david@fromorbit.com,
	chandanbabu@kernel.org, jack@suse.cz, yi.zhang@huawei.com,
	chengzhihao1@huawei.com, yukuai3@huawei.com
Subject: Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
Date: Fri, 17 May 2024 10:59:00 -0700	[thread overview]
Message-ID: <20240517175900.GC360919@frogsfrogsfrogs> (raw)
In-Reply-To: <20240517111355.233085-4-yi.zhang@huaweicloud.com>

On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When truncating a realtime file unaligned to a shorter size,
> xfs_setattr_size() only flush the EOF page before zeroing out, and
> xfs_truncate_page() also only zeros the EOF block. This could expose
> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
> a write operation").
> 
> If the sb_rextsize is bigger than one block, and we have a realtime
> inode that contains a long enough written extent. If we unaligned
> truncate into the middle of this extent, xfs_itruncate_extents() could
> split the extent and align the it's tail to sb_rextsize, there maybe
> have more than one blocks more between the end of the file. Since
> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
> value, so it may leftover some blocks contains stale data that could be
> exposed if we append write it over a long enough distance later.

IOWs, any time we truncate down, we need to zero every byte from the new
EOF all the way to the end of the allocation unit, correct?

Maybe pictures would be easier to reason with.  Say you have
rextsize=30 and a partially written rtextent; each 'W' is a written
fsblock and 'u' is an unwritten fsblock:

WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
                    ^ old EOF

Now you want to truncate down:

WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
     ^ new EOF      ^ old EOF

Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
so the truncate leaves the file in this state:

WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
     ^ new EOF      ^ old EOF

(where 'z' is a written block with zeroes after EOF)

This is bad because the "W"s between the new and old EOF still contain
old credit card info or whatever.  Now if we mmap the file or whatever,
we can access those old contents.

So your new patch amends iomap_truncate_page so that it'll zero all the
way to the end of the @blocksize parameter.  That fixes the exposure by 
writing zeroes to the pagecache before we truncate down:

WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
     ^ new EOF      ^ old EOF

Is that correct?

If so, then why don't we make xfs_truncate_page convert the post-eof
rtextent blocks back to unwritten status:

WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
     ^ new EOF      ^ old EOF

If we can do that, then do we need the changes to iomap_truncate_page?
Converting the mapping should be much faster than dirtying potentially
a lot of data (rt extents can be 1GB in size).

> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
> and make sure the entire zeroed range have been flushed to disk before
> updating the inode size.
> 
> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
> Reported-by: Chandan Babu R <chandanbabu@kernel.org>
> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_iops.c  | 10 ----------
>  2 files changed, 31 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 4958cc3337bc..fc379450fe74 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>  	loff_t			pos,
>  	bool			*did_zero)
>  {
> +	struct xfs_mount	*mp = ip->i_mount;
>  	struct inode		*inode = VFS_I(ip);
>  	unsigned int		blocksize = i_blocksize(inode);
> +	int			error;
> +
> +	if (XFS_IS_REALTIME_INODE(ip))
> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);

Don't opencode xfs_inode_alloc_unitsize, please.

> +
> +	/*
> +	 * iomap won't detect a dirty page over an unwritten block (or a
> +	 * cow block over a hole) and subsequently skips zeroing the
> +	 * newly post-EOF portion of the page. Flush the new EOF to
> +	 * convert the block before the pagecache truncate.
> +	 */
> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
> +					     roundup_64(pos, blocksize));
> +	if (error)
> +		return error;pos_in_block

Ok so this is hoisting the filemap_write_and_wait_range call from
xfs_setattr_size.  It's curious that we need to need to twiddle anything
other than the EOF block itself though?

>  
>  	if (IS_DAX(inode))
> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
> -					&xfs_dax_write_iomap_ops);
> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
> -				   &xfs_buffered_write_iomap_ops);
> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
> +					  &xfs_dax_write_iomap_ops);
> +	else
> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
> +					    &xfs_buffered_write_iomap_ops);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Write back path won't write dirty blocks post EOF folio,
> +	 * flush the entire zeroed range before updating the inode
> +	 * size.
> +	 */
> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
> +					    roundup_64(pos, blocksize));

...but what is the purpose of the second filemap_write_and_wait_range
call?  Is that to flush the bytes between new and old EOF to disk before
truncate_setsize invalidates the (zeroed) pagecache?

--D

>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 66f8c47642e8..baeeddf4a6bb 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -845,16 +845,6 @@ xfs_setattr_size(
>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>  				&did_zeroing);
>  	} else {
> -		/*
> -		 * iomap won't detect a dirty page over an unwritten block (or a
> -		 * cow block over a hole) and subsequently skips zeroing the
> -		 * newly post-EOF portion of the page. Flush the new EOF to
> -		 * convert the block before the pagecache truncate.
> -		 */
> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> -						     newsize);
> -		if (error)
> -			return error;
>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>  	}
>  
> -- 
> 2.39.2
> 
> 

  reply	other threads:[~2024-05-17 17:59 UTC|newest]

Thread overview: 19+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-17 11:13 [PATCH v3 0/3] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
2024-05-17 11:13 ` [PATCH v3 1/3] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
2024-05-17 17:29   ` Darrick J. Wong
2024-05-18  2:01     ` Zhang Yi
2024-05-17 11:13 ` [PATCH v3 2/3] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
2024-05-17 11:13 ` [PATCH v3 3/3] xfs: correct the zeroing truncate range Zhang Yi
2024-05-17 17:59   ` Darrick J. Wong [this message]
2024-05-18  6:35     ` Zhang Yi
2024-05-18 19:26       ` Darrick J. Wong
2024-05-20  6:56         ` Zhang Yi
2024-05-20  7:11           ` Zhang Yi
2024-05-20 18:37           ` Darrick J. Wong
2024-05-21 13:45             ` Zhang Yi
2024-05-21  2:38   ` Dave Chinner
2024-05-22  1:57     ` Zhang Yi
2024-05-23  1:11       ` Dave Chinner
2024-05-23  2:00         ` Zhang Yi
2024-05-22  3:00     ` Darrick J. Wong
2024-05-23  1:14       ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240517175900.GC360919@frogsfrogsfrogs \
    --to=djwong@kernel.org \
    --cc=brauner@kernel.org \
    --cc=chandanbabu@kernel.org \
    --cc=chengzhihao1@huawei.com \
    --cc=david@fromorbit.com \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=yi.zhang@huawei.com \
    --cc=yi.zhang@huaweicloud.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).