[PATCH v3 0/3] iomap/xfs: fix stale data exposure when truncating realtime inodes

Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed

* [PATCH v3 0/3] iomap/xfs: fix stale data exposure when truncating realtime inodes
@ 2024-05-17 11:13 Zhang Yi
  2024-05-17 11:13 ` [PATCH v3 1/3] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
                   ` (2 more replies)
  0 siblings, 3 replies; 19+ messages in thread
From: Zhang Yi @ 2024-05-17 11:13 UTC (permalink / raw
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, linux-ext4, djwong, hch, brauner, david,
	chandanbabu, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Changes since v2:
 - Use div_u64_rem() instead of do_div().

Changes since v1:
 - In iomap_truncate_page() and dax_truncate_page(), for the case of
   truncate blocksize is not power of 2, use do_dive() to calculate the
   offset.

This series fix a stale data exposure issue reported by Chandan when
running fstests generic/561 on xfs with realtime device[1]. The real
problem is xfs_setattr_size() doesn't zero out enough range when
truncating a realtime inode, please see the third patch or [1] for
details.

The first two patches modify iomap_truncate_page() and
dax_truncate_page() to pass filesystem identified blocksize, and drop
the assumption of i_blocksize() as Dave suggested. The third patch fix
the issue by modifying xfs_truncate_page() to pass the correct
blocksize, and make sure zeroed range have been flushed to disk before
updating i_size.

[1] https://lore.kernel.org/linux-xfs/87ttj8ircu.fsf@debian-BULLSEYE-live-builder-AMD64/

Thanks,
Yi.

Zhang Yi (3):
  iomap: pass blocksize to iomap_truncate_page()
  fsdax: pass blocksize to dax_truncate_page()
  xfs: correct the zeroing truncate range

 fs/dax.c               | 13 +++++++++----
 fs/ext2/inode.c        |  4 ++--
 fs/iomap/buffered-io.c | 13 +++++++++----
 fs/xfs/xfs_iomap.c     | 36 ++++++++++++++++++++++++++++++++----
 fs/xfs/xfs_iops.c      | 10 ----------
 include/linux/dax.h    |  4 ++--
 include/linux/iomap.h  |  4 ++--
 7 files changed, 56 insertions(+), 28 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v3 1/3] iomap: pass blocksize to iomap_truncate_page()
  2024-05-17 11:13 [PATCH v3 0/3] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
@ 2024-05-17 11:13 ` Zhang Yi
  2024-05-17 17:29   ` Darrick J. Wong
  2024-05-17 11:13 ` [PATCH v3 2/3] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
  2024-05-17 11:13 ` [PATCH v3 3/3] xfs: correct the zeroing truncate range Zhang Yi
  2 siblings, 1 reply; 19+ messages in thread
From: Zhang Yi @ 2024-05-17 11:13 UTC (permalink / raw
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, linux-ext4, djwong, hch, brauner, david,
	chandanbabu, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

iomap_truncate_page() always assumes the block size of the truncating
inode is i_blocksize(), this is not always true for some filesystems,
e.g. XFS does extent size alignment for realtime inodes. Drop this
assumption and pass the block size for zeroing into
iomap_truncate_page(), allow filesystems to indicate the correct block
size.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 13 +++++++++----
 fs/xfs/xfs_iomap.c     |  3 ++-
 include/linux/iomap.h  |  4 ++--
 3 files changed, 13 insertions(+), 7 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 0926d216a5af..a0a0ac2c659c 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/sched/signal.h>
 #include <linux/migrate.h>
+#include <linux/math64.h>
 #include "trace.h"
 
 #include "../internal.h"
@@ -1445,11 +1446,15 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 EXPORT_SYMBOL_GPL(iomap_zero_range);
 
 int
-iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops)
+iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops)
 {
-	unsigned int blocksize = i_blocksize(inode);
-	unsigned int off = pos & (blocksize - 1);
+	unsigned int off;
+
+	if (is_power_of_2(blocksize))
+		off = pos & (blocksize - 1);
+	else
+		div_u64_rem(pos, blocksize, &off);
 
 	/* Block boundary? Nothing to do */
 	if (!off)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 2857ef1b0272..31ac07bb8425 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1467,10 +1467,11 @@ xfs_truncate_page(
 	bool			*did_zero)
 {
 	struct inode		*inode = VFS_I(ip);
+	unsigned int		blocksize = i_blocksize(inode);
 
 	if (IS_DAX(inode))
 		return dax_truncate_page(inode, pos, did_zero,
 					&xfs_dax_write_iomap_ops);
-	return iomap_truncate_page(inode, pos, did_zero,
+	return iomap_truncate_page(inode, pos, blocksize, did_zero,
 				   &xfs_buffered_write_iomap_ops);
 }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d..d67bf86ec582 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -273,8 +273,8 @@ int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
 		bool *did_zero, const struct iomap_ops *ops);
-int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops);
+int iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops);
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
 			const struct iomap_ops *ops);
 int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 2/3] fsdax: pass blocksize to dax_truncate_page()
  2024-05-17 11:13 [PATCH v3 0/3] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
  2024-05-17 11:13 ` [PATCH v3 1/3] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
@ 2024-05-17 11:13 ` Zhang Yi
  2024-05-17 11:13 ` [PATCH v3 3/3] xfs: correct the zeroing truncate range Zhang Yi
  2 siblings, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-05-17 11:13 UTC (permalink / raw
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, linux-ext4, djwong, hch, brauner, david,
	chandanbabu, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

dax_truncate_page() always assumes the block size of the truncating
inode is i_blocksize(), this is not always true for some filesystems,
e.g. XFS does extent size alignment for realtime inodes. Drop this
assumption and pass the block size for zeroing into dax_truncate_page(),
allow filesystems to indicate the correct block size.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/dax.c            | 13 +++++++++----
 fs/ext2/inode.c     |  4 ++--
 fs/xfs/xfs_iomap.c  |  2 +-
 include/linux/dax.h |  4 ++--
 4 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index 423fc1607dfa..98419280d9ae 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -25,6 +25,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/iomap.h>
 #include <linux/rmap.h>
+#include <linux/math64.h>
 #include <asm/pgalloc.h>
 
 #define CREATE_TRACE_POINTS
@@ -1403,11 +1404,15 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 }
 EXPORT_SYMBOL_GPL(dax_zero_range);
 
-int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops)
+int dax_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops)
 {
-	unsigned int blocksize = i_blocksize(inode);
-	unsigned int off = pos & (blocksize - 1);
+	unsigned int off;
+
+	if (is_power_of_2(blocksize))
+		off = pos & (blocksize - 1);
+	else
+		div_u64_rem(pos, blocksize, &off);
 
 	/* Block boundary? Nothing to do */
 	if (!off)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index f3d570a9302b..fbbd479f3c4e 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1278,8 +1278,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	inode_dio_wait(inode);
 
 	if (IS_DAX(inode))
-		error = dax_truncate_page(inode, newsize, NULL,
-					  &ext2_iomap_ops);
+		error = dax_truncate_page(inode, newsize, i_blocksize(inode),
+					  NULL, &ext2_iomap_ops);
 	else
 		error = block_truncate_page(inode->i_mapping,
 				newsize, ext2_get_block);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 31ac07bb8425..4958cc3337bc 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1470,7 +1470,7 @@ xfs_truncate_page(
 	unsigned int		blocksize = i_blocksize(inode);
 
 	if (IS_DAX(inode))
-		return dax_truncate_page(inode, pos, did_zero,
+		return dax_truncate_page(inode, pos, blocksize, did_zero,
 					&xfs_dax_write_iomap_ops);
 	return iomap_truncate_page(inode, pos, blocksize, did_zero,
 				   &xfs_buffered_write_iomap_ops);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9d3e3327af4c..4aa8ef7c8fd4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -210,8 +210,8 @@ int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		const struct iomap_ops *ops);
-int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops);
+int dax_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops);
 
 #if IS_ENABLED(CONFIG_DAX)
 int dax_read_lock(void);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-17 11:13 [PATCH v3 0/3] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
  2024-05-17 11:13 ` [PATCH v3 1/3] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
  2024-05-17 11:13 ` [PATCH v3 2/3] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
@ 2024-05-17 11:13 ` Zhang Yi
  2024-05-17 17:59   ` Darrick J. Wong
  2024-05-21  2:38   ` Dave Chinner
  2 siblings, 2 replies; 19+ messages in thread
From: Zhang Yi @ 2024-05-17 11:13 UTC (permalink / raw
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, linux-ext4, djwong, hch, brauner, david,
	chandanbabu, jack, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

When truncating a realtime file unaligned to a shorter size,
xfs_setattr_size() only flush the EOF page before zeroing out, and
xfs_truncate_page() also only zeros the EOF block. This could expose
stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
a write operation").

If the sb_rextsize is bigger than one block, and we have a realtime
inode that contains a long enough written extent. If we unaligned
truncate into the middle of this extent, xfs_itruncate_extents() could
split the extent and align the it's tail to sb_rextsize, there maybe
have more than one blocks more between the end of the file. Since
xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
value, so it may leftover some blocks contains stale data that could be
exposed if we append write it over a long enough distance later.

xfs_truncate_page() should flush, zeros out the entire rtextsize range,
and make sure the entire zeroed range have been flushed to disk before
updating the inode size.

Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
Reported-by: Chandan Babu R <chandanbabu@kernel.org>
Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
 fs/xfs/xfs_iops.c  | 10 ----------
 2 files changed, 31 insertions(+), 14 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 4958cc3337bc..fc379450fe74 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1466,12 +1466,39 @@ xfs_truncate_page(
 	loff_t			pos,
 	bool			*did_zero)
 {
+	struct xfs_mount	*mp = ip->i_mount;
 	struct inode		*inode = VFS_I(ip);
 	unsigned int		blocksize = i_blocksize(inode);
+	int			error;
+
+	if (XFS_IS_REALTIME_INODE(ip))
+		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
+
+	/*
+	 * iomap won't detect a dirty page over an unwritten block (or a
+	 * cow block over a hole) and subsequently skips zeroing the
+	 * newly post-EOF portion of the page. Flush the new EOF to
+	 * convert the block before the pagecache truncate.
+	 */
+	error = filemap_write_and_wait_range(inode->i_mapping, pos,
+					     roundup_64(pos, blocksize));
+	if (error)
+		return error;
 
 	if (IS_DAX(inode))
-		return dax_truncate_page(inode, pos, blocksize, did_zero,
-					&xfs_dax_write_iomap_ops);
-	return iomap_truncate_page(inode, pos, blocksize, did_zero,
-				   &xfs_buffered_write_iomap_ops);
+		error = dax_truncate_page(inode, pos, blocksize, did_zero,
+					  &xfs_dax_write_iomap_ops);
+	else
+		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
+					    &xfs_buffered_write_iomap_ops);
+	if (error)
+		return error;
+
+	/*
+	 * Write back path won't write dirty blocks post EOF folio,
+	 * flush the entire zeroed range before updating the inode
+	 * size.
+	 */
+	return filemap_write_and_wait_range(inode->i_mapping, pos,
+					    roundup_64(pos, blocksize));
 }
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 66f8c47642e8..baeeddf4a6bb 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -845,16 +845,6 @@ xfs_setattr_size(
 		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
 				&did_zeroing);
 	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/3] iomap: pass blocksize to iomap_truncate_page()
  2024-05-17 11:13 ` [PATCH v3 1/3] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
@ 2024-05-17 17:29   ` Darrick J. Wong
  2024-05-18  2:01     ` Zhang Yi
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2024-05-17 17:29 UTC (permalink / raw
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, hch, brauner,
	david, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On Fri, May 17, 2024 at 07:13:53PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> iomap_truncate_page() always assumes the block size of the truncating
> inode is i_blocksize(), this is not always true for some filesystems,
> e.g. XFS does extent size alignment for realtime inodes. Drop this
> assumption and pass the block size for zeroing into
> iomap_truncate_page(), allow filesystems to indicate the correct block
> size.
> 
> Suggested-by: Dave Chinner <david@fromorbit.com>
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/iomap/buffered-io.c | 13 +++++++++----
>  fs/xfs/xfs_iomap.c     |  3 ++-
>  include/linux/iomap.h  |  4 ++--
>  3 files changed, 13 insertions(+), 7 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 0926d216a5af..a0a0ac2c659c 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -17,6 +17,7 @@
>  #include <linux/bio.h>
>  #include <linux/sched/signal.h>
>  #include <linux/migrate.h>
> +#include <linux/math64.h>
>  #include "trace.h"
>  
>  #include "../internal.h"
> @@ -1445,11 +1446,15 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  EXPORT_SYMBOL_GPL(iomap_zero_range);
>  
>  int
> -iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> -		const struct iomap_ops *ops)
> +iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
> +		bool *did_zero, const struct iomap_ops *ops)
>  {
> -	unsigned int blocksize = i_blocksize(inode);
> -	unsigned int off = pos & (blocksize - 1);
> +	unsigned int off;
> +
> +	if (is_power_of_2(blocksize))
> +		off = pos & (blocksize - 1);
> +	else
> +		div_u64_rem(pos, blocksize, &off);

I wish this was a helper in math64.h somewhere.

static inline u32 rem_u64(u64 dividend, u32 divisor)
{
	if (likely(is_power_of_2(divisor)))
		return dividend & (divisor - 1);

	return dividend % divisor;
}

That way we skip the second division in div_u64_rem entirely, and the
iomap/dax code becomes:

	unsigned int off = rem_u64(pos, blocksize); /* pos in block */

Otherwise this looks like a straightforward mechanical change to me.

--D

>  
>  	/* Block boundary? Nothing to do */
>  	if (!off)
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 2857ef1b0272..31ac07bb8425 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1467,10 +1467,11 @@ xfs_truncate_page(
>  	bool			*did_zero)
>  {
>  	struct inode		*inode = VFS_I(ip);
> +	unsigned int		blocksize = i_blocksize(inode);
>  
>  	if (IS_DAX(inode))
>  		return dax_truncate_page(inode, pos, did_zero,
>  					&xfs_dax_write_iomap_ops);
> -	return iomap_truncate_page(inode, pos, did_zero,
> +	return iomap_truncate_page(inode, pos, blocksize, did_zero,
>  				   &xfs_buffered_write_iomap_ops);
>  }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6fc1c858013d..d67bf86ec582 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -273,8 +273,8 @@ int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops);
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
>  		bool *did_zero, const struct iomap_ops *ops);
> -int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> -		const struct iomap_ops *ops);
> +int iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
> +		bool *did_zero, const struct iomap_ops *ops);
>  vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
>  			const struct iomap_ops *ops);
>  int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-17 11:13 ` [PATCH v3 3/3] xfs: correct the zeroing truncate range Zhang Yi
@ 2024-05-17 17:59   ` Darrick J. Wong
  2024-05-18  6:35     ` Zhang Yi
  2024-05-21  2:38   ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2024-05-17 17:59 UTC (permalink / raw
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, hch, brauner,
	david, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When truncating a realtime file unaligned to a shorter size,
> xfs_setattr_size() only flush the EOF page before zeroing out, and
> xfs_truncate_page() also only zeros the EOF block. This could expose
> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
> a write operation").
> 
> If the sb_rextsize is bigger than one block, and we have a realtime
> inode that contains a long enough written extent. If we unaligned
> truncate into the middle of this extent, xfs_itruncate_extents() could
> split the extent and align the it's tail to sb_rextsize, there maybe
> have more than one blocks more between the end of the file. Since
> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
> value, so it may leftover some blocks contains stale data that could be
> exposed if we append write it over a long enough distance later.

IOWs, any time we truncate down, we need to zero every byte from the new
EOF all the way to the end of the allocation unit, correct?

Maybe pictures would be easier to reason with.  Say you have
rextsize=30 and a partially written rtextent; each 'W' is a written
fsblock and 'u' is an unwritten fsblock:

WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
                    ^ old EOF

Now you want to truncate down:

WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
     ^ new EOF      ^ old EOF

Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
so the truncate leaves the file in this state:

WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
     ^ new EOF      ^ old EOF

(where 'z' is a written block with zeroes after EOF)

This is bad because the "W"s between the new and old EOF still contain
old credit card info or whatever.  Now if we mmap the file or whatever,
we can access those old contents.

So your new patch amends iomap_truncate_page so that it'll zero all the
way to the end of the @blocksize parameter.  That fixes the exposure by 
writing zeroes to the pagecache before we truncate down:

WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
     ^ new EOF      ^ old EOF

Is that correct?

If so, then why don't we make xfs_truncate_page convert the post-eof
rtextent blocks back to unwritten status:

WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
     ^ new EOF      ^ old EOF

If we can do that, then do we need the changes to iomap_truncate_page?
Converting the mapping should be much faster than dirtying potentially
a lot of data (rt extents can be 1GB in size).

> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
> and make sure the entire zeroed range have been flushed to disk before
> updating the inode size.
> 
> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
> Reported-by: Chandan Babu R <chandanbabu@kernel.org>
> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_iops.c  | 10 ----------
>  2 files changed, 31 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 4958cc3337bc..fc379450fe74 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>  	loff_t			pos,
>  	bool			*did_zero)
>  {
> +	struct xfs_mount	*mp = ip->i_mount;
>  	struct inode		*inode = VFS_I(ip);
>  	unsigned int		blocksize = i_blocksize(inode);
> +	int			error;
> +
> +	if (XFS_IS_REALTIME_INODE(ip))
> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);

Don't opencode xfs_inode_alloc_unitsize, please.

> +
> +	/*
> +	 * iomap won't detect a dirty page over an unwritten block (or a
> +	 * cow block over a hole) and subsequently skips zeroing the
> +	 * newly post-EOF portion of the page. Flush the new EOF to
> +	 * convert the block before the pagecache truncate.
> +	 */
> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
> +					     roundup_64(pos, blocksize));
> +	if (error)
> +		return error;pos_in_block

Ok so this is hoisting the filemap_write_and_wait_range call from
xfs_setattr_size.  It's curious that we need to need to twiddle anything
other than the EOF block itself though?

>  
>  	if (IS_DAX(inode))
> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
> -					&xfs_dax_write_iomap_ops);
> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
> -				   &xfs_buffered_write_iomap_ops);
> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
> +					  &xfs_dax_write_iomap_ops);
> +	else
> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
> +					    &xfs_buffered_write_iomap_ops);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Write back path won't write dirty blocks post EOF folio,
> +	 * flush the entire zeroed range before updating the inode
> +	 * size.
> +	 */
> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
> +					    roundup_64(pos, blocksize));

...but what is the purpose of the second filemap_write_and_wait_range
call?  Is that to flush the bytes between new and old EOF to disk before
truncate_setsize invalidates the (zeroed) pagecache?

--D

>  }
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 66f8c47642e8..baeeddf4a6bb 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -845,16 +845,6 @@ xfs_setattr_size(
>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>  				&did_zeroing);
>  	} else {
> -		/*
> -		 * iomap won't detect a dirty page over an unwritten block (or a
> -		 * cow block over a hole) and subsequently skips zeroing the
> -		 * newly post-EOF portion of the page. Flush the new EOF to
> -		 * convert the block before the pagecache truncate.
> -		 */
> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> -						     newsize);
> -		if (error)
> -			return error;
>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>  	}
>  
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 1/3] iomap: pass blocksize to iomap_truncate_page()
  2024-05-17 17:29   ` Darrick J. Wong
@ 2024-05-18  2:01     ` Zhang Yi
  0 siblings, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-05-18  2:01 UTC (permalink / raw
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, hch, brauner,
	david, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/5/18 1:29, Darrick J. Wong wrote:
> On Fri, May 17, 2024 at 07:13:53PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> iomap_truncate_page() always assumes the block size of the truncating
>> inode is i_blocksize(), this is not always true for some filesystems,
>> e.g. XFS does extent size alignment for realtime inodes. Drop this
>> assumption and pass the block size for zeroing into
>> iomap_truncate_page(), allow filesystems to indicate the correct block
>> size.
>>
>> Suggested-by: Dave Chinner <david@fromorbit.com>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/iomap/buffered-io.c | 13 +++++++++----
>>  fs/xfs/xfs_iomap.c     |  3 ++-
>>  include/linux/iomap.h  |  4 ++--
>>  3 files changed, 13 insertions(+), 7 deletions(-)
>>
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index 0926d216a5af..a0a0ac2c659c 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -17,6 +17,7 @@
>>  #include <linux/bio.h>
>>  #include <linux/sched/signal.h>
>>  #include <linux/migrate.h>
>> +#include <linux/math64.h>
>>  #include "trace.h"
>>  
>>  #include "../internal.h"
>> @@ -1445,11 +1446,15 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>>  EXPORT_SYMBOL_GPL(iomap_zero_range);
>>  
>>  int
>> -iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>> -		const struct iomap_ops *ops)
>> +iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
>> +		bool *did_zero, const struct iomap_ops *ops)
>>  {
>> -	unsigned int blocksize = i_blocksize(inode);
>> -	unsigned int off = pos & (blocksize - 1);
>> +	unsigned int off;
>> +
>> +	if (is_power_of_2(blocksize))
>> +		off = pos & (blocksize - 1);
>> +	else
>> +		div_u64_rem(pos, blocksize, &off);
> 
> I wish this was a helper in math64.h somewhere.
> 
> static inline u32 rem_u64(u64 dividend, u32 divisor)
> {
> 	if (likely(is_power_of_2(divisor)))
> 		return dividend & (divisor - 1);
> 
> 	return dividend % divisor;
> }
> 
> That way we skip the second division in div_u64_rem entirely, and the
> iomap/dax code becomes:
> 
> 	unsigned int off = rem_u64(pos, blocksize); /* pos in block */
> 
> Otherwise this looks like a straightforward mechanical change to me.
> 

Yeah, we do need this helper.

Thanks,
Yi.

> 
>>  
>>  	/* Block boundary? Nothing to do */
>>  	if (!off)
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 2857ef1b0272..31ac07bb8425 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -1467,10 +1467,11 @@ xfs_truncate_page(
>>  	bool			*did_zero)
>>  {
>>  	struct inode		*inode = VFS_I(ip);
>> +	unsigned int		blocksize = i_blocksize(inode);
>>  
>>  	if (IS_DAX(inode))
>>  		return dax_truncate_page(inode, pos, did_zero,
>>  					&xfs_dax_write_iomap_ops);
>> -	return iomap_truncate_page(inode, pos, did_zero,
>> +	return iomap_truncate_page(inode, pos, blocksize, did_zero,
>>  				   &xfs_buffered_write_iomap_ops);
>>  }
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 6fc1c858013d..d67bf86ec582 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -273,8 +273,8 @@ int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>>  		const struct iomap_ops *ops);
>>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
>>  		bool *did_zero, const struct iomap_ops *ops);
>> -int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
>> -		const struct iomap_ops *ops);
>> +int iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
>> +		bool *did_zero, const struct iomap_ops *ops);
>>  vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
>>  			const struct iomap_ops *ops);
>>  int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
>> -- 
>> 2.39.2
>>
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-17 17:59   ` Darrick J. Wong
@ 2024-05-18  6:35     ` Zhang Yi
  2024-05-18 19:26       ` Darrick J. Wong
  0 siblings, 1 reply; 19+ messages in thread
From: Zhang Yi @ 2024-05-18  6:35 UTC (permalink / raw
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, hch, brauner,
	david, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/5/18 1:59, Darrick J. Wong wrote:
> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When truncating a realtime file unaligned to a shorter size,
>> xfs_setattr_size() only flush the EOF page before zeroing out, and
>> xfs_truncate_page() also only zeros the EOF block. This could expose
>> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
>> a write operation").
>>
>> If the sb_rextsize is bigger than one block, and we have a realtime
>> inode that contains a long enough written extent. If we unaligned
>> truncate into the middle of this extent, xfs_itruncate_extents() could
>> split the extent and align the it's tail to sb_rextsize, there maybe
>> have more than one blocks more between the end of the file. Since
>> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
>> value, so it may leftover some blocks contains stale data that could be
>> exposed if we append write it over a long enough distance later.
> 
> IOWs, any time we truncate down, we need to zero every byte from the new
> EOF all the way to the end of the allocation unit, correct?

Yeah.

> 
> Maybe pictures would be easier to reason with.  Say you have
> rextsize=30 and a partially written rtextent; each 'W' is a written
> fsblock and 'u' is an unwritten fsblock:
> 
> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>                     ^ old EOF
> 
> Now you want to truncate down:
> 
> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>      ^ new EOF      ^ old EOF
> 
> Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
> so the truncate leaves the file in this state:
> 
> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
>      ^ new EOF      ^ old EOF
> 
> (where 'z' is a written block with zeroes after EOF)
> 
> This is bad because the "W"s between the new and old EOF still contain
> old credit card info or whatever.  Now if we mmap the file or whatever,
> we can access those old contents.
> 
> So your new patch amends iomap_truncate_page so that it'll zero all the
> way to the end of the @blocksize parameter.  That fixes the exposure by 
> writing zeroes to the pagecache before we truncate down:
> 
> WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
>      ^ new EOF      ^ old EOF
> 
> Is that correct?
> 

Yes, it's correct. However, not only write zeros to the pagecache, but
also flush to disk, please see below for details.

> If so, then why don't we make xfs_truncate_page convert the post-eof
> rtextent blocks back to unwritten status:
> 
> WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
>      ^ new EOF      ^ old EOF
> 
> If we can do that, then do we need the changes to iomap_truncate_page?
> Converting the mapping should be much faster than dirtying potentially
> a lot of data (rt extents can be 1GB in size).

Now that the exposed stale data range (should be zeroed) is only one
rtextsize unit, if we convert the post-eof rtextent blocks to unwritten,
it breaks the alignment of rtextent and the definition of "extsize is used
to specify the size of the blocks in the real-time section of the
filesystem", is it fine? And IIUC, the upcoming xfs force alignment
extent feature seems also need to follow this alignment, right?

> 
>> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
>> and make sure the entire zeroed range have been flushed to disk before
>> updating the inode size.
>>
>> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
>> Reported-by: Chandan Babu R <chandanbabu@kernel.org>
>> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>>  fs/xfs/xfs_iops.c  | 10 ----------
>>  2 files changed, 31 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 4958cc3337bc..fc379450fe74 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>>  	loff_t			pos,
>>  	bool			*did_zero)
>>  {
>> +	struct xfs_mount	*mp = ip->i_mount;
>>  	struct inode		*inode = VFS_I(ip);
>>  	unsigned int		blocksize = i_blocksize(inode);
>> +	int			error;
>> +
>> +	if (XFS_IS_REALTIME_INODE(ip))
>> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
> 
> Don't opencode xfs_inode_alloc_unitsize, please.

Ha, I missed the latest added helper, thanks for pointing this out.

> 
>> +
>> +	/*
>> +	 * iomap won't detect a dirty page over an unwritten block (or a
>> +	 * cow block over a hole) and subsequently skips zeroing the
>> +	 * newly post-EOF portion of the page. Flush the new EOF to
>> +	 * convert the block before the pagecache truncate.
>> +	 */
>> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
>> +					     roundup_64(pos, blocksize));
>> +	if (error)
>> +		return error;pos_in_block
> 
> Ok so this is hoisting the filemap_write_and_wait_range call from
> xfs_setattr_size.  It's curious that we need to need to twiddle anything
> other than the EOF block itself though?

Since we planed to zero out the dirtied range which is ailgned to the
extsize instead of the blocksize, ensure one block is not unwritten is
not enough, we should also make sure that the range which is going to
zero out is not unwritten, or else the iomap_zero_iter() will skip
zeroing out the extra blocks.

For example:

before zeroing:
           |<-    extszie   ->|
        ...dddddddddddddddddddd
        ...UUUUUUUUUUUUUUUUUUUU
           ^                  ^
        new EOF             old EOF    (where 'd' means the pagecache is dirty)

if we only flush the new EOF block, the result becomes:

           |<-    extszie   ->|
           zddddddddddddddddddd
           ZUUUUUUUUUUUUUUUUUUU
           ^                  ^
        new EOF             old EOF


then the dirty extent range that between new EOF block and the old EOF
block can't be zeroed sine it's still unwritten. So we have to flush the
whole range before zeroing out.

> 
>>  
>>  	if (IS_DAX(inode))
>> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
>> -					&xfs_dax_write_iomap_ops);
>> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
>> -				   &xfs_buffered_write_iomap_ops);
>> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
>> +					  &xfs_dax_write_iomap_ops);
>> +	else
>> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
>> +					    &xfs_buffered_write_iomap_ops);
>> +	if (error)
>> +		return error;
>> +
>> +	/*
>> +	 * Write back path won't write dirty blocks post EOF folio,
>> +	 * flush the entire zeroed range before updating the inode
>> +	 * size.
>> +	 */
>> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
>> +					    roundup_64(pos, blocksize));
> 
> ...but what is the purpose of the second filemap_write_and_wait_range
> call?  Is that to flush the bytes between new and old EOF to disk before
> truncate_setsize invalidates the (zeroed) pagecache?
> 

The second filemap_write_and_wait_range() call is used to make sure that
the zeroed data be flushed to disk before we updating i_size. If we don't
add this one, once the i_size is been changed, the zeroed data which
beyond the new EOF folio(block) couldn't be write back, because
iomap_writepage_map()->iomap_writepage_handle_eof() skip that range, so
the stale data problem is still there.

For example:

before zeroing:
           |<-    extszie   ->|
           wwwwwwwwwwwwwwwwwwww (pagecache)
        ...WWWWWWWWWWWWWWWWWWWW (disk)
           ^                  ^
        new EOF               EOF   (where 'w' means the pagecache contains data)

then iomap_truncate_page() zeroing out the pagecache:

           |<-    extszie   ->|
           zzzzzzzzzzzzzzzzzzzz (pagecache)
           WWWWWWWWWWWWWWWWWWWW (disk)
           ^                  ^
        new EOF               EOF

then update i_size, sync and drop cache:

           |<-    extszie   ->|
           ZWWWWWWWWWWWWWWWWWWW (disk)
           ^
           EOF

Thanks,
Yi.

> 
>>  }
>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>> index 66f8c47642e8..baeeddf4a6bb 100644
>> --- a/fs/xfs/xfs_iops.c
>> +++ b/fs/xfs/xfs_iops.c
>> @@ -845,16 +845,6 @@ xfs_setattr_size(
>>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>>  				&did_zeroing);
>>  	} else {
>> -		/*
>> -		 * iomap won't detect a dirty page over an unwritten block (or a
>> -		 * cow block over a hole) and subsequently skips zeroing the
>> -		 * newly post-EOF portion of the page. Flush the new EOF to
>> -		 * convert the block before the pagecache truncate.
>> -		 */
>> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
>> -						     newsize);
>> -		if (error)
>> -			return error;
>>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>>  	}
>>  
>> -- 
>> 2.39.2
>>
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-18  6:35     ` Zhang Yi
@ 2024-05-18 19:26       ` Darrick J. Wong
  2024-05-20  6:56         ` Zhang Yi
  0 siblings, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2024-05-18 19:26 UTC (permalink / raw
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, hch, brauner,
	david, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On Sat, May 18, 2024 at 02:35:02PM +0800, Zhang Yi wrote:
> On 2024/5/18 1:59, Darrick J. Wong wrote:
> > On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
> >> From: Zhang Yi <yi.zhang@huawei.com>
> >>
> >> When truncating a realtime file unaligned to a shorter size,
> >> xfs_setattr_size() only flush the EOF page before zeroing out, and
> >> xfs_truncate_page() also only zeros the EOF block. This could expose
> >> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
> >> a write operation").
> >>
> >> If the sb_rextsize is bigger than one block, and we have a realtime
> >> inode that contains a long enough written extent. If we unaligned
> >> truncate into the middle of this extent, xfs_itruncate_extents() could
> >> split the extent and align the it's tail to sb_rextsize, there maybe
> >> have more than one blocks more between the end of the file. Since
> >> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
> >> value, so it may leftover some blocks contains stale data that could be
> >> exposed if we append write it over a long enough distance later.

Hum.  Is this an appending write into the next rtextent?  For example,
if you start with a file like this:

WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
                    ^ old EOF

Then truncate it improperly like this:

WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
     ^ new EOF               

Then do an extending write like this:

WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
     ^ EOF                    ^ next rtx        ^ append here

And now the problem is that we've exposed stale data that should be
zeroes?

WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
      ^^^^^^^^^^^^^^^                             ^ new EOF
      should be zeroed

> > 
> > IOWs, any time we truncate down, we need to zero every byte from the new
> > EOF all the way to the end of the allocation unit, correct?
> 
> Yeah.
> 
> > 
> > Maybe pictures would be easier to reason with.  Say you have
> > rextsize=30 and a partially written rtextent; each 'W' is a written
> > fsblock and 'u' is an unwritten fsblock:
> > 
> > WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
> >                     ^ old EOF
> > 
> > Now you want to truncate down:
> > 
> > WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
> >      ^ new EOF      ^ old EOF
> > 
> > Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
> > so the truncate leaves the file in this state:
> > 
> > WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
> >      ^ new EOF      ^ old EOF
> > 
> > (where 'z' is a written block with zeroes after EOF)
> > 
> > This is bad because the "W"s between the new and old EOF still contain
> > old credit card info or whatever.  Now if we mmap the file or whatever,
> > we can access those old contents.
> > 
> > So your new patch amends iomap_truncate_page so that it'll zero all the
> > way to the end of the @blocksize parameter.  That fixes the exposure by 
> > writing zeroes to the pagecache before we truncate down:
> > 
> > WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
> >      ^ new EOF      ^ old EOF
> > 
> > Is that correct?
> > 
> 
> Yes, it's correct. However, not only write zeros to the pagecache, but
> also flush to disk, please see below for details.

<nod> iomap_truncate_page writes zeroes to any part of the pagecache
backed by written extents, and then xfs must call
filemap_write_and_wait_range to write the dirty (zeroed) cache out to
disk.

> > If so, then why don't we make xfs_truncate_page convert the post-eof
> > rtextent blocks back to unwritten status:
> > 
> > WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
> >      ^ new EOF      ^ old EOF
> > 
> > If we can do that, then do we need the changes to iomap_truncate_page?
> > Converting the mapping should be much faster than dirtying potentially
> > a lot of data (rt extents can be 1GB in size).
> 
> Now that the exposed stale data range (should be zeroed) is only one
> rtextsize unit, if we convert the post-eof rtextent blocks to unwritten,
> it breaks the alignment of rtextent and the definition of "extsize is used
> to specify the size of the blocks in the real-time section of the
> filesystem", is it fine?

A written -> unwritten extent conversion doesn't change which physical
space extent is mapped to the file data extent; it merely marks the
mapping as unwritten.

For example, if you start with this mapping:

{startoff = 8, startblock 256, blockcount = 8, state = written}

and then convert blocks 13-15 to unwritten, you get:

{startoff = 8, startblock 256, blockcount = 5, state = written}
{startoff = 13, startblock 261, blockcount = 3, state = unwritten}

File blocks 8-15 still map to physical space 256-263.

In xfs, the entire allocation unit is /always/ mapped to the file, even
if parts of it have to be unwritten.  Hole punching on rt, for example,
converts the punched region to unwritten.  This is (iirc) the key
difference between xfs rt and ext4 bigalloc.  xfs doesn't have or need
(or want) the implied cluster allocation code that ext4 has.

I can't tell if there's something that you see that I don't see such
that we really /do/ need to actually write zeroes to the entire tail of
the rtextent; or if you weren't sure that forcing all the post-eof
fsblocks in the rtextent to unwritten (and zapping the pagecache) would
actually preserve the rtextsize alignment.

(Or if there's something else?)

>                          And IIUC, the upcoming xfs force alignment
> extent feature seems also need to follow this alignment, right?

Yes.

> > 
> >> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
> >> and make sure the entire zeroed range have been flushed to disk before
> >> updating the inode size.
> >>
> >> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
> >> Reported-by: Chandan Babu R <chandanbabu@kernel.org>
> >> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
> >> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >> ---
> >>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
> >>  fs/xfs/xfs_iops.c  | 10 ----------
> >>  2 files changed, 31 insertions(+), 14 deletions(-)
> >>
> >> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> >> index 4958cc3337bc..fc379450fe74 100644
> >> --- a/fs/xfs/xfs_iomap.c
> >> +++ b/fs/xfs/xfs_iomap.c
> >> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
> >>  	loff_t			pos,
> >>  	bool			*did_zero)
> >>  {
> >> +	struct xfs_mount	*mp = ip->i_mount;
> >>  	struct inode		*inode = VFS_I(ip);
> >>  	unsigned int		blocksize = i_blocksize(inode);
> >> +	int			error;
> >> +
> >> +	if (XFS_IS_REALTIME_INODE(ip))
> >> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
> > 
> > Don't opencode xfs_inode_alloc_unitsize, please.
> 
> Ha, I missed the latest added helper, thanks for pointing this out.
> 
> > 
> >> +
> >> +	/*
> >> +	 * iomap won't detect a dirty page over an unwritten block (or a
> >> +	 * cow block over a hole) and subsequently skips zeroing the
> >> +	 * newly post-EOF portion of the page. Flush the new EOF to
> >> +	 * convert the block before the pagecache truncate.
> >> +	 */
> >> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
> >> +					     roundup_64(pos, blocksize));
> >> +	if (error)
> >> +		return error;pos_in_block
> > 
> > Ok so this is hoisting the filemap_write_and_wait_range call from
> > xfs_setattr_size.  It's curious that we need to need to twiddle anything
> > other than the EOF block itself though?
> 
> Since we planed to zero out the dirtied range which is ailgned to the
> extsize instead of the blocksize, ensure one block is not unwritten is
> not enough, we should also make sure that the range which is going to
> zero out is not unwritten, or else the iomap_zero_iter() will skip
> zeroing out the extra blocks.
> 
> For example:
> 
> before zeroing:
>            |<-    extszie   ->|
>         ...dddddddddddddddddddd
>         ...UUUUUUUUUUUUUUUUUUUU
>            ^                  ^
>         new EOF             old EOF    (where 'd' means the pagecache is dirty)
> 
> if we only flush the new EOF block, the result becomes:
> 
>            |<-    extszie   ->|
>            zddddddddddddddddddd
>            ZUUUUUUUUUUUUUUUUUUU
>            ^                  ^
>         new EOF             old EOF
> 
> 
> then the dirty extent range that between new EOF block and the old EOF
> block can't be zeroed sine it's still unwritten. So we have to flush the
> whole range before zeroing out.

"Z" on the second line of the second diagram is a written fsblock with
the tail zeroed, correct?

truncate_setsize -> truncate_pagecache unmaps all the pagecache after
the eof folio and unconditionally zeroes the tail of the eof folio
without regard to the mappings.  Doesn't that cover us here?  After the
truncate_setsize finishes, won't we end up in this state:

           |<-   rextsize   ->|
           zzzzzzzz               
           ZUUUUUUUUUUUUUUUUUUU
           ^      ^           ^
        new EOF   |         old EOF
                  folio boundary

> > 
> >>  
> >>  	if (IS_DAX(inode))
> >> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
> >> -					&xfs_dax_write_iomap_ops);
> >> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
> >> -				   &xfs_buffered_write_iomap_ops);
> >> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
> >> +					  &xfs_dax_write_iomap_ops);
> >> +	else
> >> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
> >> +					    &xfs_buffered_write_iomap_ops);
> >> +	if (error)
> >> +		return error;
> >> +
> >> +	/*
> >> +	 * Write back path won't write dirty blocks post EOF folio,
> >> +	 * flush the entire zeroed range before updating the inode
> >> +	 * size.
> >> +	 */
> >> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
> >> +					    roundup_64(pos, blocksize));
> > 
> > ...but what is the purpose of the second filemap_write_and_wait_range
> > call?  Is that to flush the bytes between new and old EOF to disk before
> > truncate_setsize invalidates the (zeroed) pagecache?
> > 
> 
> The second filemap_write_and_wait_range() call is used to make sure that
> the zeroed data be flushed to disk before we updating i_size. If we don't
> add this one, once the i_size is been changed, the zeroed data which
> beyond the new EOF folio(block) couldn't be write back, because
> iomap_writepage_map()->iomap_writepage_handle_eof() skip that range, so
> the stale data problem is still there.
> 
> For example:
> 
> before zeroing:
>            |<-    extszie   ->|
>            wwwwwwwwwwwwwwwwwwww (pagecache)
>         ...WWWWWWWWWWWWWWWWWWWW (disk)
>            ^                  ^
>         new EOF               EOF   (where 'w' means the pagecache contains data)
> 
> then iomap_truncate_page() zeroing out the pagecache:
> 
>            |<-    extszie   ->|
>            zzzzzzzzzzzzzzzzzzzz (pagecache)
>            WWWWWWWWWWWWWWWWWWWW (disk)
>            ^                  ^
>         new EOF               EOF
> 
> then update i_size, sync and drop cache:
> 
>            |<-    extszie   ->|
>            ZWWWWWWWWWWWWWWWWWWW (disk)
>            ^
>            EOF

<nod> Ok, so this second call to filemap_write_and_wait_range flushes
the newly written pagecache to disk.  If it doesn't work to
force-convert the tail fsblocks of the rtextent to unwritten status,
then I suppose this is necessary if @blocksize != mp->m_sb.blocksize.

--D

> Thanks,
> Yi.
> 
> > 
> >>  }
> >> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> >> index 66f8c47642e8..baeeddf4a6bb 100644
> >> --- a/fs/xfs/xfs_iops.c
> >> +++ b/fs/xfs/xfs_iops.c
> >> @@ -845,16 +845,6 @@ xfs_setattr_size(
> >>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
> >>  				&did_zeroing);
> >>  	} else {
> >> -		/*
> >> -		 * iomap won't detect a dirty page over an unwritten block (or a
> >> -		 * cow block over a hole) and subsequently skips zeroing the
> >> -		 * newly post-EOF portion of the page. Flush the new EOF to
> >> -		 * convert the block before the pagecache truncate.
> >> -		 */
> >> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> >> -						     newsize);
> >> -		if (error)
> >> -			return error;
> >>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
> >>  	}
> >>  
> >> -- 
> >> 2.39.2
> >>
> >>
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-18 19:26       ` Darrick J. Wong
@ 2024-05-20  6:56         ` Zhang Yi
  2024-05-20  7:11           ` Zhang Yi
  2024-05-20 18:37           ` Darrick J. Wong
  0 siblings, 2 replies; 19+ messages in thread
From: Zhang Yi @ 2024-05-20  6:56 UTC (permalink / raw
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, hch, brauner,
	david, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/5/19 3:26, Darrick J. Wong wrote:
> On Sat, May 18, 2024 at 02:35:02PM +0800, Zhang Yi wrote:
>> On 2024/5/18 1:59, Darrick J. Wong wrote:
>>> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> When truncating a realtime file unaligned to a shorter size,
>>>> xfs_setattr_size() only flush the EOF page before zeroing out, and
>>>> xfs_truncate_page() also only zeros the EOF block. This could expose
>>>> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
>>>> a write operation").
>>>>
>>>> If the sb_rextsize is bigger than one block, and we have a realtime
>>>> inode that contains a long enough written extent. If we unaligned
>>>> truncate into the middle of this extent, xfs_itruncate_extents() could
>>>> split the extent and align the it's tail to sb_rextsize, there maybe
>>>> have more than one blocks more between the end of the file. Since
>>>> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
>>>> value, so it may leftover some blocks contains stale data that could be
>>>> exposed if we append write it over a long enough distance later.
> 
> Hum.  Is this an appending write into the next rtextent?  For example,
> if you start with a file like this:
> 
> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>                     ^ old EOF
> 
> Then truncate it improperly like this:
> 
> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
>      ^ new EOF               
> 
> Then do an extending write like this:
> 
> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
>      ^ EOF                    ^ next rtx        ^ append here
> 
> And now the problem is that we've exposed stale data that should be
> zeroes?
> 
> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
>       ^^^^^^^^^^^^^^^                             ^ new EOF
>       should be zeroed
> 

Yeah.

>>>
>>> IOWs, any time we truncate down, we need to zero every byte from the new
>>> EOF all the way to the end of the allocation unit, correct?
>>
>> Yeah.
>>
>>>
>>> Maybe pictures would be easier to reason with.  Say you have
>>> rextsize=30 and a partially written rtextent; each 'W' is a written
>>> fsblock and 'u' is an unwritten fsblock:
>>>
>>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>>                     ^ old EOF
>>>
>>> Now you want to truncate down:
>>>
>>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>>      ^ new EOF      ^ old EOF
>>>
>>> Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
>>> so the truncate leaves the file in this state:
>>>
>>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
>>>      ^ new EOF      ^ old EOF
>>>
>>> (where 'z' is a written block with zeroes after EOF)
>>>
>>> This is bad because the "W"s between the new and old EOF still contain
>>> old credit card info or whatever.  Now if we mmap the file or whatever,
>>> we can access those old contents.
>>>
>>> So your new patch amends iomap_truncate_page so that it'll zero all the
>>> way to the end of the @blocksize parameter.  That fixes the exposure by 
>>> writing zeroes to the pagecache before we truncate down:
>>>
>>> WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
>>>      ^ new EOF      ^ old EOF
>>>
>>> Is that correct?
>>>
>>
>> Yes, it's correct. However, not only write zeros to the pagecache, but
>> also flush to disk, please see below for details.
> 
> <nod> iomap_truncate_page writes zeroes to any part of the pagecache
> backed by written extents, and then xfs must call
> filemap_write_and_wait_range to write the dirty (zeroed) cache out to
> disk.
> 
>>> If so, then why don't we make xfs_truncate_page convert the post-eof
>>> rtextent blocks back to unwritten status:
>>>
>>> WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
>>>      ^ new EOF      ^ old EOF
>>>
>>> If we can do that, then do we need the changes to iomap_truncate_page?
>>> Converting the mapping should be much faster than dirtying potentially
>>> a lot of data (rt extents can be 1GB in size).
>>
>> Now that the exposed stale data range (should be zeroed) is only one
>> rtextsize unit, if we convert the post-eof rtextent blocks to unwritten,
>> it breaks the alignment of rtextent and the definition of "extsize is used
>> to specify the size of the blocks in the real-time section of the
>> filesystem", is it fine?
> 
> A written -> unwritten extent conversion doesn't change which physical
> space extent is mapped to the file data extent; it merely marks the
> mapping as unwritten.
> 
> For example, if you start with this mapping:
> 
> {startoff = 8, startblock 256, blockcount = 8, state = written}
> 
> and then convert blocks 13-15 to unwritten, you get:
> 
> {startoff = 8, startblock 256, blockcount = 5, state = written}
> {startoff = 13, startblock 261, blockcount = 3, state = unwritten}
> 
> File blocks 8-15 still map to physical space 256-263.

Yeah, indeed.

> 
> In xfs, the entire allocation unit is /always/ mapped to the file, even
> if parts of it have to be unwritten.  Hole punching on rt, for example,
> converts the punched region to unwritten.  This is (iirc) the key
> difference between xfs rt and ext4 bigalloc.  xfs doesn't have or need
> (or want) the implied cluster allocation code that ext4 has.
> 

I checked the xfs_file_fallocate(), it looks like hole punching on realtime
inode is still follow the rtextsize alignment, i.e. if we punch hole on a
file that only contains one written extent, it doesn't split it and convet
the punched range to unwritten. Please take a look at
xfs_file_fallocate()->xfs_free_file_space(), it aligned the freeing range
and zeroing out the whole unligned range in one reextsize unit, and
FALLOC_FL_ZERO_RANGE is the same.

 836	/* We can only free complete realtime extents. */
 837	if (xfs_inode_has_bigrtalloc(ip)) {
 838		startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
 839 		endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
 840	}
...
 864	error = xfs_zero_range(ip, offset, len, NULL);

And I tested it on my machine, it's true that it doesn't do the convertion.

  # mkfs.xfs -f -rrtdev=/dev/nvme0n1 -f -m reflink=0,rmapbt=0, -d rtinherit=1 -r extsize=28k /dev/pmem2s
  # mount -ortdev=/dev/nvme0n1 /dev/pmem2s /mnt/scratch
  # xfs_io -f -c "pwrite 0 28k" -c "fsync" /mnt/scratch/foo
  # xfs_io -c "fpunch 4k 24k" /mnt/scratch/foo
  # umount /mnt/scratch

  # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
   u3.bmx[0] = [startoff,startblock,blockcount,extentflag]
   0:[0,0,7,0]

Am I missed something?

> I can't tell if there's something that you see that I don't see such
> that we really /do/ need to actually write zeroes to the entire tail of
> the rtextent; or if you weren't sure that forcing all the post-eof
> fsblocks in the rtextent to unwritten (and zapping the pagecache) would
> actually preserve the rtextsize alignment.

I haven't found any restrictions yet, and I also noticed that a simple
write is not guaranteed to align the extent to rtextsize, since the write
back path doesn't zeroing out the extra blocks that align to the
rtextsize.

  # #extsize=28k
  # xfs_io -d -f -c  "pwrite 0 4k" -c "fsync" /mnt/scratch/foo
  # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
     u3.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
     0:[0,0,1,0]
     1:[1,1,6,1]

So I guess convert the tail fsblocks of the rtextent to unwritten status
could work. However, I'm a little confused, besides the write operation,
other operations like above punch hold and zero range, they seem to be
doing their best to follow the alignment rule since commit fe341eb151ec
("xfs: ensure that fpunch, fcollapse, and finsert operations are aligned
to rt extent size") [1], it looks like this commit is to fix some issues,
so I'm not sure that converting to unwritten would always preserve the
rtextsize alignment.

[1]. https://lore.kernel.org/linux-xfs/159950168085.582172.4254559621934598919.stgit@magnolia/

> 
> (Or if there's something else?)
> 
>>                          And IIUC, the upcoming xfs force alignment
>> extent feature seems also need to follow this alignment, right?
> 
> Yes.
> 
>>>
>>>> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
>>>> and make sure the entire zeroed range have been flushed to disk before
>>>> updating the inode size.
>>>>
>>>> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
>>>> Reported-by: Chandan Babu R <chandanbabu@kernel.org>
>>>> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>> ---
>>>>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>>>>  fs/xfs/xfs_iops.c  | 10 ----------
>>>>  2 files changed, 31 insertions(+), 14 deletions(-)
>>>>
>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>>>> index 4958cc3337bc..fc379450fe74 100644
>>>> --- a/fs/xfs/xfs_iomap.c
>>>> +++ b/fs/xfs/xfs_iomap.c
>>>> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>>>>  	loff_t			pos,
>>>>  	bool			*did_zero)
>>>>  {
>>>> +	struct xfs_mount	*mp = ip->i_mount;
>>>>  	struct inode		*inode = VFS_I(ip);
>>>>  	unsigned int		blocksize = i_blocksize(inode);
>>>> +	int			error;
>>>> +
>>>> +	if (XFS_IS_REALTIME_INODE(ip))
>>>> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
>>>
>>> Don't opencode xfs_inode_alloc_unitsize, please.
>>
>> Ha, I missed the latest added helper, thanks for pointing this out.
>>
>>>
>>>> +
>>>> +	/*
>>>> +	 * iomap won't detect a dirty page over an unwritten block (or a
>>>> +	 * cow block over a hole) and subsequently skips zeroing the
>>>> +	 * newly post-EOF portion of the page. Flush the new EOF to
>>>> +	 * convert the block before the pagecache truncate.
>>>> +	 */
>>>> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
>>>> +					     roundup_64(pos, blocksize));
>>>> +	if (error)
>>>> +		return error;pos_in_block
>>>
>>> Ok so this is hoisting the filemap_write_and_wait_range call from
>>> xfs_setattr_size.  It's curious that we need to need to twiddle anything
>>> other than the EOF block itself though?
>>
>> Since we planed to zero out the dirtied range which is ailgned to the
>> extsize instead of the blocksize, ensure one block is not unwritten is
>> not enough, we should also make sure that the range which is going to
>> zero out is not unwritten, or else the iomap_zero_iter() will skip
>> zeroing out the extra blocks.
>>
>> For example:
>>
>> before zeroing:
>>            |<-    extszie   ->|
>>         ...dddddddddddddddddddd
>>         ...UUUUUUUUUUUUUUUUUUUU
>>            ^                  ^
>>         new EOF             old EOF    (where 'd' means the pagecache is dirty)
>>
>> if we only flush the new EOF block, the result becomes:
>>
>>            |<-    extszie   ->|
>>            zddddddddddddddddddd
>>            ZUUUUUUUUUUUUUUUUUUU
>>            ^                  ^
>>         new EOF             old EOF
>>
>>
>> then the dirty extent range that between new EOF block and the old EOF
>> block can't be zeroed sine it's still unwritten. So we have to flush the
>> whole range before zeroing out.
> 
> "Z" on the second line of the second diagram is a written fsblock with
> the tail zeroed, correct?

Yeah,

> 
> truncate_setsize -> truncate_pagecache unmaps all the pagecache after
> the eof folio and unconditionally zeroes the tail of the eof folio
> without regard to the mappings.  Doesn't that cover us here?  After the
> truncate_setsize finishes, won't we end up in this state:
> 
>            |<-   rextsize   ->|
>            zzzzzzzz               
>            ZUUUUUUUUUUUUUUUUUUU
>            ^      ^           ^
>         new EOF   |         old EOF
>                   folio boundary
> 

Yeah, this case is fine, but the below case is not fine.

truncate                          write back
xfs_setattr_size()
 xfs_truncate_page()
  filemap_write_and_wait_range(newsize, newsize) <- A
  iomap_zero_range() <- B
                                  flush dirty pages <- C
 truncate_setsize() <- D

Please assume if a concurrent write back happenes just before
truncate_setsize(), the state of the file changes as below:

A:
              |<-    extszie   ->|
              wddddddddddddddddddd (pagecache)
              WUUUUUUUUUUUUUUUUUUU (disk)
              ^                  ^
           (new EOF)           old EOF  (where 'd' means the pagecache is dirty)
                                        (where 'x' means the pagecache contianes user data)

B:
              |<-    extszie   ->|
              zddddddddddddddddddd
              ZUUUUUUUUUUUUUUUUUUU
              ^                  ^
           (new EOF)           old EOF  (where 'z' means the pagecache is zero)

C:
              |<-    extszie   ->|
              zwwwwwwwwwwwwwwwwwww
              ZWWWWWWWWWWWWWWWWWWW
              ^                  ^
           (new EOF)           old EOF

D:
              |<-    extszie   ->|
              zzzzzzzzz
              ZWWWWWWWWWWWWWWWWWWW
              ^       ^          ^
            new EOF   |       (old EOF)
                      folio boundary


Thanks,
Yi.

>>>
>>>>  
>>>>  	if (IS_DAX(inode))
>>>> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
>>>> -					&xfs_dax_write_iomap_ops);
>>>> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
>>>> -				   &xfs_buffered_write_iomap_ops);
>>>> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
>>>> +					  &xfs_dax_write_iomap_ops);
>>>> +	else
>>>> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
>>>> +					    &xfs_buffered_write_iomap_ops);
>>>> +	if (error)
>>>> +		return error;
>>>> +
>>>> +	/*
>>>> +	 * Write back path won't write dirty blocks post EOF folio,
>>>> +	 * flush the entire zeroed range before updating the inode
>>>> +	 * size.
>>>> +	 */
>>>> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
>>>> +					    roundup_64(pos, blocksize));
>>>
>>> ...but what is the purpose of the second filemap_write_and_wait_range
>>> call?  Is that to flush the bytes between new and old EOF to disk before
>>> truncate_setsize invalidates the (zeroed) pagecache?
>>>
>>
>> The second filemap_write_and_wait_range() call is used to make sure that
>> the zeroed data be flushed to disk before we updating i_size. If we don't
>> add this one, once the i_size is been changed, the zeroed data which
>> beyond the new EOF folio(block) couldn't be write back, because
>> iomap_writepage_map()->iomap_writepage_handle_eof() skip that range, so
>> the stale data problem is still there.
>>
>> For example:
>>
>> before zeroing:
>>            |<-    extszie   ->|
>>            wwwwwwwwwwwwwwwwwwww (pagecache)
>>         ...WWWWWWWWWWWWWWWWWWWW (disk)
>>            ^                  ^
>>         new EOF               EOF   (where 'w' means the pagecache contains data)
>>
>> then iomap_truncate_page() zeroing out the pagecache:
>>
>>            |<-    extszie   ->|
>>            zzzzzzzzzzzzzzzzzzzz (pagecache)
>>            WWWWWWWWWWWWWWWWWWWW (disk)
>>            ^                  ^
>>         new EOF               EOF
>>
>> then update i_size, sync and drop cache:
>>
>>            |<-    extszie   ->|
>>            ZWWWWWWWWWWWWWWWWWWW (disk)
>>            ^
>>            EOF
> 
> <nod> Ok, so this second call to filemap_write_and_wait_range flushes
> the newly written pagecache to disk.  If it doesn't work to
> force-convert the tail fsblocks of the rtextent to unwritten status,
> then I suppose this is necessary if @blocksize != mp->m_sb.blocksize.
> 
> --D
> 
>> Thanks,
>> Yi.
>>
>>>
>>>>  }
>>>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>>>> index 66f8c47642e8..baeeddf4a6bb 100644
>>>> --- a/fs/xfs/xfs_iops.c
>>>> +++ b/fs/xfs/xfs_iops.c
>>>> @@ -845,16 +845,6 @@ xfs_setattr_size(
>>>>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>>>>  				&did_zeroing);
>>>>  	} else {
>>>> -		/*
>>>> -		 * iomap won't detect a dirty page over an unwritten block (or a
>>>> -		 * cow block over a hole) and subsequently skips zeroing the
>>>> -		 * newly post-EOF portion of the page. Flush the new EOF to
>>>> -		 * convert the block before the pagecache truncate.
>>>> -		 */
>>>> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
>>>> -						     newsize);
>>>> -		if (error)
>>>> -			return error;
>>>>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>>>>  	}
>>>>  
>>>> -- 
>>>> 2.39.2
>>>>
>>>>
>>
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-20  6:56         ` Zhang Yi
@ 2024-05-20  7:11           ` Zhang Yi
  2024-05-20 18:37           ` Darrick J. Wong
  1 sibling, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-05-20  7:11 UTC (permalink / raw
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, hch, brauner,
	david, chandanbabu, jack, chengzhihao1, yukuai3, yi.zhang

On 2024/5/20 14:56, Zhang Yi wrote:
> On 2024/5/19 3:26, Darrick J. Wong wrote:
>> On Sat, May 18, 2024 at 02:35:02PM +0800, Zhang Yi wrote:
>>> On 2024/5/18 1:59, Darrick J. Wong wrote:
>>>> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>
>>>>> When truncating a realtime file unaligned to a shorter size,
>>>>> xfs_setattr_size() only flush the EOF page before zeroing out, and
>>>>> xfs_truncate_page() also only zeros the EOF block. This could expose
>>>>> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
>>>>> a write operation").
>>>>>
>>>>> If the sb_rextsize is bigger than one block, and we have a realtime
>>>>> inode that contains a long enough written extent. If we unaligned
>>>>> truncate into the middle of this extent, xfs_itruncate_extents() could
>>>>> split the extent and align the it's tail to sb_rextsize, there maybe
>>>>> have more than one blocks more between the end of the file. Since
>>>>> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
>>>>> value, so it may leftover some blocks contains stale data that could be
>>>>> exposed if we append write it over a long enough distance later.
>>
>> Hum.  Is this an appending write into the next rtextent?  For example,
>> if you start with a file like this:
>>
>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>                     ^ old EOF
>>
>> Then truncate it improperly like this:
>>
>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
>>      ^ new EOF               
>>
>> Then do an extending write like this:
>>
>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
>>      ^ EOF                    ^ next rtx        ^ append here
>>
>> And now the problem is that we've exposed stale data that should be
>> zeroes?
>>
>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
>>       ^^^^^^^^^^^^^^^                             ^ new EOF
>>       should be zeroed
>>
> 
> Yeah.
> 
>>>>
>>>> IOWs, any time we truncate down, we need to zero every byte from the new
>>>> EOF all the way to the end of the allocation unit, correct?
>>>
>>> Yeah.
>>>
>>>>
>>>> Maybe pictures would be easier to reason with.  Say you have
>>>> rextsize=30 and a partially written rtextent; each 'W' is a written
>>>> fsblock and 'u' is an unwritten fsblock:
>>>>
>>>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>>>                     ^ old EOF
>>>>
>>>> Now you want to truncate down:
>>>>
>>>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>>>      ^ new EOF      ^ old EOF
>>>>
>>>> Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
>>>> so the truncate leaves the file in this state:
>>>>
>>>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
>>>>      ^ new EOF      ^ old EOF
>>>>
>>>> (where 'z' is a written block with zeroes after EOF)
>>>>
>>>> This is bad because the "W"s between the new and old EOF still contain
>>>> old credit card info or whatever.  Now if we mmap the file or whatever,
>>>> we can access those old contents.
>>>>
>>>> So your new patch amends iomap_truncate_page so that it'll zero all the
>>>> way to the end of the @blocksize parameter.  That fixes the exposure by 
>>>> writing zeroes to the pagecache before we truncate down:
>>>>
>>>> WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
>>>>      ^ new EOF      ^ old EOF
>>>>
>>>> Is that correct?
>>>>
>>>
>>> Yes, it's correct. However, not only write zeros to the pagecache, but
>>> also flush to disk, please see below for details.
>>
>> <nod> iomap_truncate_page writes zeroes to any part of the pagecache
>> backed by written extents, and then xfs must call
>> filemap_write_and_wait_range to write the dirty (zeroed) cache out to
>> disk.
>>
>>>> If so, then why don't we make xfs_truncate_page convert the post-eof
>>>> rtextent blocks back to unwritten status:
>>>>
>>>> WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
>>>>      ^ new EOF      ^ old EOF
>>>>
>>>> If we can do that, then do we need the changes to iomap_truncate_page?
>>>> Converting the mapping should be much faster than dirtying potentially
>>>> a lot of data (rt extents can be 1GB in size).
>>>
>>> Now that the exposed stale data range (should be zeroed) is only one
>>> rtextsize unit, if we convert the post-eof rtextent blocks to unwritten,
>>> it breaks the alignment of rtextent and the definition of "extsize is used
>>> to specify the size of the blocks in the real-time section of the
>>> filesystem", is it fine?
>>
>> A written -> unwritten extent conversion doesn't change which physical
>> space extent is mapped to the file data extent; it merely marks the
>> mapping as unwritten.
>>
>> For example, if you start with this mapping:
>>
>> {startoff = 8, startblock 256, blockcount = 8, state = written}
>>
>> and then convert blocks 13-15 to unwritten, you get:
>>
>> {startoff = 8, startblock 256, blockcount = 5, state = written}
>> {startoff = 13, startblock 261, blockcount = 3, state = unwritten}
>>
>> File blocks 8-15 still map to physical space 256-263.
> 
> Yeah, indeed.
> 
>>
>> In xfs, the entire allocation unit is /always/ mapped to the file, even
>> if parts of it have to be unwritten.  Hole punching on rt, for example,
>> converts the punched region to unwritten.  This is (iirc) the key
>> difference between xfs rt and ext4 bigalloc.  xfs doesn't have or need
>> (or want) the implied cluster allocation code that ext4 has.
>>
> 
> I checked the xfs_file_fallocate(), it looks like hole punching on realtime
> inode is still follow the rtextsize alignment, i.e. if we punch hole on a
> file that only contains one written extent, it doesn't split it and convet
> the punched range to unwritten. Please take a look at
> xfs_file_fallocate()->xfs_free_file_space(), it aligned the freeing range
> and zeroing out the whole unligned range in one reextsize unit, and
> FALLOC_FL_ZERO_RANGE is the same.
> 
>  836	/* We can only free complete realtime extents. */
>  837	if (xfs_inode_has_bigrtalloc(ip)) {
>  838		startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
>  839 		endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
>  840	}
> ...
>  864	error = xfs_zero_range(ip, offset, len, NULL);
> 
> And I tested it on my machine, it's true that it doesn't do the convertion.
> 
>   # mkfs.xfs -f -rrtdev=/dev/nvme0n1 -f -m reflink=0,rmapbt=0, -d rtinherit=1 -r extsize=28k /dev/pmem2s
>   # mount -ortdev=/dev/nvme0n1 /dev/pmem2s /mnt/scratch
>   # xfs_io -f -c "pwrite 0 28k" -c "fsync" /mnt/scratch/foo
>   # xfs_io -c "fpunch 4k 24k" /mnt/scratch/foo
>   # umount /mnt/scratch
> 
>   # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
>    u3.bmx[0] = [startoff,startblock,blockcount,extentflag]
>    0:[0,0,7,0]
> 
> Am I missed something?
> 
>> I can't tell if there's something that you see that I don't see such
>> that we really /do/ need to actually write zeroes to the entire tail of
>> the rtextent; or if you weren't sure that forcing all the post-eof
>> fsblocks in the rtextent to unwritten (and zapping the pagecache) would
>> actually preserve the rtextsize alignment.
> 
> I haven't found any restrictions yet, and I also noticed that a simple
> write is not guaranteed to align the extent to rtextsize, since the write
> back path doesn't zeroing out the extra blocks that align to the
> rtextsize.
> 
>   # #extsize=28k
>   # xfs_io -d -f -c  "pwrite 0 4k" -c "fsync" /mnt/scratch/foo
>   # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
>      u3.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
>      0:[0,0,1,0]
>      1:[1,1,6,1]
> 
> So I guess convert the tail fsblocks of the rtextent to unwritten status
> could work. However, I'm a little confused, besides the write operation,
> other operations like above punch hold and zero range, they seem to be
> doing their best to follow the alignment rule since commit fe341eb151ec
> ("xfs: ensure that fpunch, fcollapse, and finsert operations are aligned
> to rt extent size") [1], it looks like this commit is to fix some issues,
> so I'm not sure that converting to unwritten would always preserve the
> rtextsize alignment.
> 
> [1]. https://lore.kernel.org/linux-xfs/159950168085.582172.4254559621934598919.stgit@magnolia/
> 
>>
>> (Or if there's something else?)
>>
>>>                          And IIUC, the upcoming xfs force alignment
>>> extent feature seems also need to follow this alignment, right?
>>
>> Yes.
>>
>>>>
>>>>> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
>>>>> and make sure the entire zeroed range have been flushed to disk before
>>>>> updating the inode size.
>>>>>
>>>>> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
>>>>> Reported-by: Chandan Babu R <chandanbabu@kernel.org>
>>>>> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>> ---
>>>>>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>>>>>  fs/xfs/xfs_iops.c  | 10 ----------
>>>>>  2 files changed, 31 insertions(+), 14 deletions(-)
>>>>>
>>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>>>>> index 4958cc3337bc..fc379450fe74 100644
>>>>> --- a/fs/xfs/xfs_iomap.c
>>>>> +++ b/fs/xfs/xfs_iomap.c
>>>>> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>>>>>  	loff_t			pos,
>>>>>  	bool			*did_zero)
>>>>>  {
>>>>> +	struct xfs_mount	*mp = ip->i_mount;
>>>>>  	struct inode		*inode = VFS_I(ip);
>>>>>  	unsigned int		blocksize = i_blocksize(inode);
>>>>> +	int			error;
>>>>> +
>>>>> +	if (XFS_IS_REALTIME_INODE(ip))
>>>>> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
>>>>
>>>> Don't opencode xfs_inode_alloc_unitsize, please.
>>>
>>> Ha, I missed the latest added helper, thanks for pointing this out.
>>>
>>>>
>>>>> +
>>>>> +	/*
>>>>> +	 * iomap won't detect a dirty page over an unwritten block (or a
>>>>> +	 * cow block over a hole) and subsequently skips zeroing the
>>>>> +	 * newly post-EOF portion of the page. Flush the new EOF to
>>>>> +	 * convert the block before the pagecache truncate.
>>>>> +	 */
>>>>> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
>>>>> +					     roundup_64(pos, blocksize));
>>>>> +	if (error)
>>>>> +		return error;pos_in_block
>>>>
>>>> Ok so this is hoisting the filemap_write_and_wait_range call from
>>>> xfs_setattr_size.  It's curious that we need to need to twiddle anything
>>>> other than the EOF block itself though?
>>>
>>> Since we planed to zero out the dirtied range which is ailgned to the
>>> extsize instead of the blocksize, ensure one block is not unwritten is
>>> not enough, we should also make sure that the range which is going to
>>> zero out is not unwritten, or else the iomap_zero_iter() will skip
>>> zeroing out the extra blocks.
>>>
>>> For example:
>>>
>>> before zeroing:
>>>            |<-    extszie   ->|
>>>         ...dddddddddddddddddddd
>>>         ...UUUUUUUUUUUUUUUUUUUU
>>>            ^                  ^
>>>         new EOF             old EOF    (where 'd' means the pagecache is dirty)
>>>
>>> if we only flush the new EOF block, the result becomes:
>>>
>>>            |<-    extszie   ->|
>>>            zddddddddddddddddddd
>>>            ZUUUUUUUUUUUUUUUUUUU
>>>            ^                  ^
>>>         new EOF             old EOF
>>>
>>>
>>> then the dirty extent range that between new EOF block and the old EOF
>>> block can't be zeroed sine it's still unwritten. So we have to flush the
>>> whole range before zeroing out.
>>
>> "Z" on the second line of the second diagram is a written fsblock with
>> the tail zeroed, correct?
> 
> Yeah,
> 
>>
>> truncate_setsize -> truncate_pagecache unmaps all the pagecache after
>> the eof folio and unconditionally zeroes the tail of the eof folio
>> without regard to the mappings.  Doesn't that cover us here?  After the
>> truncate_setsize finishes, won't we end up in this state:
>>
>>            |<-   rextsize   ->|
>>            zzzzzzzz               
>>            ZUUUUUUUUUUUUUUUUUUU
>>            ^      ^           ^
>>         new EOF   |         old EOF
>>                   folio boundary
>>
> 
> Yeah, this case is fine, but the below case is not fine.
> 
> truncate                          write back
> xfs_setattr_size()
>  xfs_truncate_page()
>   filemap_write_and_wait_range(newsize, newsize) <- A
>   iomap_zero_range() <- B
>                                   flush dirty pages <- C
>  truncate_setsize() <- D
> 
> Please assume if a concurrent write back happenes just before
> truncate_setsize(), the state of the file changes as below:
> 
> A:
>               |<-    extszie   ->|
>               wddddddddddddddddddd (pagecache)
>               WUUUUUUUUUUUUUUUUUUU (disk)
>               ^                  ^
>            (new EOF)           old EOF  (where 'd' means the pagecache is dirty)
>                                         (where 'x' means the pagecache contianes user data)
                                                 ^^^
                                                  w
> 
> B:
>               |<-    extszie   ->|
>               zddddddddddddddddddd
>               ZUUUUUUUUUUUUUUUUUUU
>               ^                  ^
>            (new EOF)           old EOF  (where 'z' means the pagecache is zero)
> 
> C:
>               |<-    extszie   ->|
>               zwwwwwwwwwwwwwwwwwww
>               ZWWWWWWWWWWWWWWWWWWW
>               ^                  ^
>            (new EOF)           old EOF
> 
> D:
>               |<-    extszie   ->|
>               zzzzzzzzz
>               ZWWWWWWWWWWWWWWWWWWW
>               ^       ^          ^
>             new EOF   |       (old EOF)
>                       folio boundary
> 
> 
> Thanks,
> Yi.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-20  6:56         ` Zhang Yi
  2024-05-20  7:11           ` Zhang Yi
@ 2024-05-20 18:37           ` Darrick J. Wong
  2024-05-21 13:45             ` Zhang Yi
  1 sibling, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2024-05-20 18:37 UTC (permalink / raw
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, hch, brauner,
	david, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On Mon, May 20, 2024 at 02:56:22PM +0800, Zhang Yi wrote:
> On 2024/5/19 3:26, Darrick J. Wong wrote:
> > On Sat, May 18, 2024 at 02:35:02PM +0800, Zhang Yi wrote:
> >> On 2024/5/18 1:59, Darrick J. Wong wrote:
> >>> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
> >>>> From: Zhang Yi <yi.zhang@huawei.com>
> >>>>
> >>>> When truncating a realtime file unaligned to a shorter size,
> >>>> xfs_setattr_size() only flush the EOF page before zeroing out, and
> >>>> xfs_truncate_page() also only zeros the EOF block. This could expose
> >>>> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
> >>>> a write operation").
> >>>>
> >>>> If the sb_rextsize is bigger than one block, and we have a realtime
> >>>> inode that contains a long enough written extent. If we unaligned
> >>>> truncate into the middle of this extent, xfs_itruncate_extents() could
> >>>> split the extent and align the it's tail to sb_rextsize, there maybe
> >>>> have more than one blocks more between the end of the file. Since
> >>>> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
> >>>> value, so it may leftover some blocks contains stale data that could be
> >>>> exposed if we append write it over a long enough distance later.
> > 
> > Hum.  Is this an appending write into the next rtextent?  For example,
> > if you start with a file like this:
> > 
> > WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
> >                     ^ old EOF
> > 
> > Then truncate it improperly like this:
> > 
> > WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
> >      ^ new EOF               
> > 
> > Then do an extending write like this:
> > 
> > WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
> >      ^ EOF                    ^ next rtx        ^ append here
> > 
> > And now the problem is that we've exposed stale data that should be
> > zeroes?
> > 
> > WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
> >       ^^^^^^^^^^^^^^^                             ^ new EOF
> >       should be zeroed
> > 
> 
> Yeah.
> 
> >>>
> >>> IOWs, any time we truncate down, we need to zero every byte from the new
> >>> EOF all the way to the end of the allocation unit, correct?
> >>
> >> Yeah.
> >>
> >>>
> >>> Maybe pictures would be easier to reason with.  Say you have
> >>> rextsize=30 and a partially written rtextent; each 'W' is a written
> >>> fsblock and 'u' is an unwritten fsblock:
> >>>
> >>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
> >>>                     ^ old EOF
> >>>
> >>> Now you want to truncate down:
> >>>
> >>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
> >>>      ^ new EOF      ^ old EOF
> >>>
> >>> Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
> >>> so the truncate leaves the file in this state:
> >>>
> >>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
> >>>      ^ new EOF      ^ old EOF
> >>>
> >>> (where 'z' is a written block with zeroes after EOF)
> >>>
> >>> This is bad because the "W"s between the new and old EOF still contain
> >>> old credit card info or whatever.  Now if we mmap the file or whatever,
> >>> we can access those old contents.
> >>>
> >>> So your new patch amends iomap_truncate_page so that it'll zero all the
> >>> way to the end of the @blocksize parameter.  That fixes the exposure by 
> >>> writing zeroes to the pagecache before we truncate down:
> >>>
> >>> WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
> >>>      ^ new EOF      ^ old EOF
> >>>
> >>> Is that correct?
> >>>
> >>
> >> Yes, it's correct. However, not only write zeros to the pagecache, but
> >> also flush to disk, please see below for details.
> > 
> > <nod> iomap_truncate_page writes zeroes to any part of the pagecache
> > backed by written extents, and then xfs must call
> > filemap_write_and_wait_range to write the dirty (zeroed) cache out to
> > disk.
> > 
> >>> If so, then why don't we make xfs_truncate_page convert the post-eof
> >>> rtextent blocks back to unwritten status:
> >>>
> >>> WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
> >>>      ^ new EOF      ^ old EOF
> >>>
> >>> If we can do that, then do we need the changes to iomap_truncate_page?
> >>> Converting the mapping should be much faster than dirtying potentially
> >>> a lot of data (rt extents can be 1GB in size).
> >>
> >> Now that the exposed stale data range (should be zeroed) is only one
> >> rtextsize unit, if we convert the post-eof rtextent blocks to unwritten,
> >> it breaks the alignment of rtextent and the definition of "extsize is used
> >> to specify the size of the blocks in the real-time section of the
> >> filesystem", is it fine?
> > 
> > A written -> unwritten extent conversion doesn't change which physical
> > space extent is mapped to the file data extent; it merely marks the
> > mapping as unwritten.
> > 
> > For example, if you start with this mapping:
> > 
> > {startoff = 8, startblock 256, blockcount = 8, state = written}
> > 
> > and then convert blocks 13-15 to unwritten, you get:
> > 
> > {startoff = 8, startblock 256, blockcount = 5, state = written}
> > {startoff = 13, startblock 261, blockcount = 3, state = unwritten}
> > 
> > File blocks 8-15 still map to physical space 256-263.
> 
> Yeah, indeed.
> 
> > 
> > In xfs, the entire allocation unit is /always/ mapped to the file, even
> > if parts of it have to be unwritten.  Hole punching on rt, for example,
> > converts the punched region to unwritten.  This is (iirc) the key
> > difference between xfs rt and ext4 bigalloc.  xfs doesn't have or need
> > (or want) the implied cluster allocation code that ext4 has.
> > 
> 
> I checked the xfs_file_fallocate(), it looks like hole punching on realtime
> inode is still follow the rtextsize alignment, i.e. if we punch hole on a
> file that only contains one written extent, it doesn't split it and convet
> the punched range to unwritten. Please take a look at
> xfs_file_fallocate()->xfs_free_file_space(), it aligned the freeing range
> and zeroing out the whole unligned range in one reextsize unit, and
> FALLOC_FL_ZERO_RANGE is the same.
> 
>  836	/* We can only free complete realtime extents. */
>  837	if (xfs_inode_has_bigrtalloc(ip)) {
>  838		startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
>  839 		endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
>  840	}
> ...
>  864	error = xfs_zero_range(ip, offset, len, NULL);
> 
> And I tested it on my machine, it's true that it doesn't do the convertion.
> 
>   # mkfs.xfs -f -rrtdev=/dev/nvme0n1 -f -m reflink=0,rmapbt=0, -d rtinherit=1 -r extsize=28k /dev/pmem2s
>   # mount -ortdev=/dev/nvme0n1 /dev/pmem2s /mnt/scratch
>   # xfs_io -f -c "pwrite 0 28k" -c "fsync" /mnt/scratch/foo
>   # xfs_io -c "fpunch 4k 24k" /mnt/scratch/foo
>   # umount /mnt/scratch
> 
>   # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
>    u3.bmx[0] = [startoff,startblock,blockcount,extentflag]
>    0:[0,0,7,0]
> 
> Am I missed something?

I think fpunch is broken here -- xfs definitely supports having
unwritten mappings in the middle of an allocation unit.  See below.

> > I can't tell if there's something that you see that I don't see such
> > that we really /do/ need to actually write zeroes to the entire tail of
> > the rtextent; or if you weren't sure that forcing all the post-eof
> > fsblocks in the rtextent to unwritten (and zapping the pagecache) would
> > actually preserve the rtextsize alignment.
> 
> I haven't found any restrictions yet, and I also noticed that a simple
> write is not guaranteed to align the extent to rtextsize, since the write
> back path doesn't zeroing out the extra blocks that align to the
> rtextsize.
> 
>   # #extsize=28k
>   # xfs_io -d -f -c  "pwrite 0 4k" -c "fsync" /mnt/scratch/foo
>   # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
>      u3.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
>      0:[0,0,1,0]
>      1:[1,1,6,1]
> 
> So I guess convert the tail fsblocks of the rtextent to unwritten status
> could work. However, I'm a little confused, besides the write operation,
> other operations like above punch hold and zero range, they seem to be
> doing their best to follow the alignment rule since commit fe341eb151ec
> ("xfs: ensure that fpunch, fcollapse, and finsert operations are aligned
> to rt extent size") [1], it looks like this commit is to fix some issues,
> so I'm not sure that converting to unwritten would always preserve the
> rtextsize alignment.
> 
> [1]. https://lore.kernel.org/linux-xfs/159950168085.582172.4254559621934598919.stgit@magnolia/

Looking at commit fe341eb151ec0 ("xfs: ensure that fpunch, fcollapse,
and finsert operations are aligned to rt extent size"), I think the
logic in xfs_free_file_space is wrong.  If bunmapi is told to unmap a
partial rt extent, it will instead convert it to unwritten; it only
unmaps full rt extents.  For punch that's ok because punch
opportunistically removes blocks and zeroes the unaligned segments; for
zero that's ok because it will falloc the sparse holes and zero the
rest.

It's only collapse and insert that *require* alignment.  That's
something that should be checked during input validation, and I think
that got fixed by 25219dbfa734e ("xfs: fix fallocate functions when
rtextsize is larger than 1").

> > 
> > (Or if there's something else?)
> > 
> >>                          And IIUC, the upcoming xfs force alignment
> >> extent feature seems also need to follow this alignment, right?
> > 
> > Yes.
> > 
> >>>
> >>>> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
> >>>> and make sure the entire zeroed range have been flushed to disk before
> >>>> updating the inode size.
> >>>>
> >>>> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
> >>>> Reported-by: Chandan Babu R <chandanbabu@kernel.org>
> >>>> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
> >>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> >>>> ---
> >>>>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
> >>>>  fs/xfs/xfs_iops.c  | 10 ----------
> >>>>  2 files changed, 31 insertions(+), 14 deletions(-)
> >>>>
> >>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> >>>> index 4958cc3337bc..fc379450fe74 100644
> >>>> --- a/fs/xfs/xfs_iomap.c
> >>>> +++ b/fs/xfs/xfs_iomap.c
> >>>> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
> >>>>  	loff_t			pos,
> >>>>  	bool			*did_zero)
> >>>>  {
> >>>> +	struct xfs_mount	*mp = ip->i_mount;
> >>>>  	struct inode		*inode = VFS_I(ip);
> >>>>  	unsigned int		blocksize = i_blocksize(inode);
> >>>> +	int			error;
> >>>> +
> >>>> +	if (XFS_IS_REALTIME_INODE(ip))
> >>>> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
> >>>
> >>> Don't opencode xfs_inode_alloc_unitsize, please.
> >>
> >> Ha, I missed the latest added helper, thanks for pointing this out.
> >>
> >>>
> >>>> +
> >>>> +	/*
> >>>> +	 * iomap won't detect a dirty page over an unwritten block (or a
> >>>> +	 * cow block over a hole) and subsequently skips zeroing the
> >>>> +	 * newly post-EOF portion of the page. Flush the new EOF to
> >>>> +	 * convert the block before the pagecache truncate.
> >>>> +	 */
> >>>> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
> >>>> +					     roundup_64(pos, blocksize));
> >>>> +	if (error)
> >>>> +		return error;pos_in_block
> >>>
> >>> Ok so this is hoisting the filemap_write_and_wait_range call from
> >>> xfs_setattr_size.  It's curious that we need to need to twiddle anything
> >>> other than the EOF block itself though?
> >>
> >> Since we planed to zero out the dirtied range which is ailgned to the
> >> extsize instead of the blocksize, ensure one block is not unwritten is
> >> not enough, we should also make sure that the range which is going to
> >> zero out is not unwritten, or else the iomap_zero_iter() will skip
> >> zeroing out the extra blocks.
> >>
> >> For example:
> >>
> >> before zeroing:
> >>            |<-    extszie   ->|
> >>         ...dddddddddddddddddddd
> >>         ...UUUUUUUUUUUUUUUUUUUU
> >>            ^                  ^
> >>         new EOF             old EOF    (where 'd' means the pagecache is dirty)
> >>
> >> if we only flush the new EOF block, the result becomes:
> >>
> >>            |<-    extszie   ->|
> >>            zddddddddddddddddddd
> >>            ZUUUUUUUUUUUUUUUUUUU
> >>            ^                  ^
> >>         new EOF             old EOF
> >>
> >>
> >> then the dirty extent range that between new EOF block and the old EOF
> >> block can't be zeroed sine it's still unwritten. So we have to flush the
> >> whole range before zeroing out.
> > 
> > "Z" on the second line of the second diagram is a written fsblock with
> > the tail zeroed, correct?
> 
> Yeah,
> 
> > 
> > truncate_setsize -> truncate_pagecache unmaps all the pagecache after
> > the eof folio and unconditionally zeroes the tail of the eof folio
> > without regard to the mappings.  Doesn't that cover us here?  After the
> > truncate_setsize finishes, won't we end up in this state:
> > 
> >            |<-   rextsize   ->|
> >            zzzzzzzz               
> >            ZUUUUUUUUUUUUUUUUUUU
> >            ^      ^           ^
> >         new EOF   |         old EOF
> >                   folio boundary
> > 
> 
> Yeah, this case is fine, but the below case is not fine.
> 
> truncate                          write back
> xfs_setattr_size()
>  xfs_truncate_page()
>   filemap_write_and_wait_range(newsize, newsize) <- A
>   iomap_zero_range() <- B
>                                   flush dirty pages <- C
>  truncate_setsize() <- D
> 
> Please assume if a concurrent write back happenes just before
> truncate_setsize(), the state of the file changes as below:
> 
> A:
>               |<-    extszie   ->|
>               wddddddddddddddddddd (pagecache)
>               WUUUUUUUUUUUUUUUUUUU (disk)
>               ^                  ^
>            (new EOF)           old EOF  (where 'd' means the pagecache is dirty)
>                                         (where 'x' means the pagecache contianes user data)

"W", not "x", as you noted.

> 
> B:
>               |<-    extszie   ->|
>               zddddddddddddddddddd
>               ZUUUUUUUUUUUUUUUUUUU
>               ^                  ^
>            (new EOF)           old EOF  (where 'z' means the pagecache is zero)
> 
> C:
>               |<-    extszie   ->|
>               zwwwwwwwwwwwwwwwwwww
>               ZWWWWWWWWWWWWWWWWWWW
>               ^                  ^
>            (new EOF)           old EOF
> 
> D:
>               |<-    extszie   ->|
>               zzzzzzzzz
>               ZWWWWWWWWWWWWWWWWWWW
>               ^       ^          ^
>             new EOF   |       (old EOF)
>                       folio boundary

Hmm.  At point D we still hold i_rwsem and the invalidate_lock, so could
we convert the underlying blocks to unwritten here instead of writing
them all out?  Once we reduce i_size, writeback will stop at EOF, right?

D:
              |<-    extszie   ->|
              zzzzzzzzz
              ZUUUUUUUUUUUUUUUUUUU
              ^       ^          ^
            new EOF   |       (old EOF)
                      folio boundary

--D

> 
> Thanks,
> Yi.
> 
> >>>
> >>>>  
> >>>>  	if (IS_DAX(inode))
> >>>> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
> >>>> -					&xfs_dax_write_iomap_ops);
> >>>> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
> >>>> -				   &xfs_buffered_write_iomap_ops);
> >>>> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
> >>>> +					  &xfs_dax_write_iomap_ops);
> >>>> +	else
> >>>> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
> >>>> +					    &xfs_buffered_write_iomap_ops);
> >>>> +	if (error)
> >>>> +		return error;
> >>>> +
> >>>> +	/*
> >>>> +	 * Write back path won't write dirty blocks post EOF folio,
> >>>> +	 * flush the entire zeroed range before updating the inode
> >>>> +	 * size.
> >>>> +	 */
> >>>> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
> >>>> +					    roundup_64(pos, blocksize));
> >>>
> >>> ...but what is the purpose of the second filemap_write_and_wait_range
> >>> call?  Is that to flush the bytes between new and old EOF to disk before
> >>> truncate_setsize invalidates the (zeroed) pagecache?
> >>>
> >>
> >> The second filemap_write_and_wait_range() call is used to make sure that
> >> the zeroed data be flushed to disk before we updating i_size. If we don't
> >> add this one, once the i_size is been changed, the zeroed data which
> >> beyond the new EOF folio(block) couldn't be write back, because
> >> iomap_writepage_map()->iomap_writepage_handle_eof() skip that range, so
> >> the stale data problem is still there.
> >>
> >> For example:
> >>
> >> before zeroing:
> >>            |<-    extszie   ->|
> >>            wwwwwwwwwwwwwwwwwwww (pagecache)
> >>         ...WWWWWWWWWWWWWWWWWWWW (disk)
> >>            ^                  ^
> >>         new EOF               EOF   (where 'w' means the pagecache contains data)
> >>
> >> then iomap_truncate_page() zeroing out the pagecache:
> >>
> >>            |<-    extszie   ->|
> >>            zzzzzzzzzzzzzzzzzzzz (pagecache)
> >>            WWWWWWWWWWWWWWWWWWWW (disk)
> >>            ^                  ^
> >>         new EOF               EOF
> >>
> >> then update i_size, sync and drop cache:
> >>
> >>            |<-    extszie   ->|
> >>            ZWWWWWWWWWWWWWWWWWWW (disk)
> >>            ^
> >>            EOF
> > 
> > <nod> Ok, so this second call to filemap_write_and_wait_range flushes
> > the newly written pagecache to disk.  If it doesn't work to
> > force-convert the tail fsblocks of the rtextent to unwritten status,
> > then I suppose this is necessary if @blocksize != mp->m_sb.blocksize.
> > 
> > --D
> > 
> >> Thanks,
> >> Yi.
> >>
> >>>
> >>>>  }
> >>>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> >>>> index 66f8c47642e8..baeeddf4a6bb 100644
> >>>> --- a/fs/xfs/xfs_iops.c
> >>>> +++ b/fs/xfs/xfs_iops.c
> >>>> @@ -845,16 +845,6 @@ xfs_setattr_size(
> >>>>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
> >>>>  				&did_zeroing);
> >>>>  	} else {
> >>>> -		/*
> >>>> -		 * iomap won't detect a dirty page over an unwritten block (or a
> >>>> -		 * cow block over a hole) and subsequently skips zeroing the
> >>>> -		 * newly post-EOF portion of the page. Flush the new EOF to
> >>>> -		 * convert the block before the pagecache truncate.
> >>>> -		 */
> >>>> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> >>>> -						     newsize);
> >>>> -		if (error)
> >>>> -			return error;
> >>>>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
> >>>>  	}
> >>>>  
> >>>> -- 
> >>>> 2.39.2
> >>>>
> >>>>
> >>
> >>
> 
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-17 11:13 ` [PATCH v3 3/3] xfs: correct the zeroing truncate range Zhang Yi
  2024-05-17 17:59   ` Darrick J. Wong
@ 2024-05-21  2:38   ` Dave Chinner
  2024-05-22  1:57     ` Zhang Yi
  2024-05-22  3:00     ` Darrick J. Wong
  1 sibling, 2 replies; 19+ messages in thread
From: Dave Chinner @ 2024-05-21  2:38 UTC (permalink / raw
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, djwong, hch,
	brauner, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When truncating a realtime file unaligned to a shorter size,
> xfs_setattr_size() only flush the EOF page before zeroing out, and
> xfs_truncate_page() also only zeros the EOF block. This could expose
> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
> a write operation").
> 
> If the sb_rextsize is bigger than one block, and we have a realtime
> inode that contains a long enough written extent. If we unaligned
> truncate into the middle of this extent, xfs_itruncate_extents() could
> split the extent and align the it's tail to sb_rextsize, there maybe
> have more than one blocks more between the end of the file. Since
> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
> value, so it may leftover some blocks contains stale data that could be
> exposed if we append write it over a long enough distance later.
> 
> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
> and make sure the entire zeroed range have been flushed to disk before
> updating the inode size.
> 
> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
> Reported-by: Chandan Babu R <chandanbabu@kernel.org>
> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>  fs/xfs/xfs_iops.c  | 10 ----------
>  2 files changed, 31 insertions(+), 14 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 4958cc3337bc..fc379450fe74 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>  	loff_t			pos,
>  	bool			*did_zero)
>  {
> +	struct xfs_mount	*mp = ip->i_mount;
>  	struct inode		*inode = VFS_I(ip);
>  	unsigned int		blocksize = i_blocksize(inode);
> +	int			error;
> +
> +	if (XFS_IS_REALTIME_INODE(ip))
> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
> +
> +	/*
> +	 * iomap won't detect a dirty page over an unwritten block (or a
> +	 * cow block over a hole) and subsequently skips zeroing the
> +	 * newly post-EOF portion of the page. Flush the new EOF to
> +	 * convert the block before the pagecache truncate.
> +	 */
> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
> +					     roundup_64(pos, blocksize));
> +	if (error)
> +		return error;
>  
>  	if (IS_DAX(inode))
> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
> -					&xfs_dax_write_iomap_ops);
> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
> -				   &xfs_buffered_write_iomap_ops);
> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
> +					  &xfs_dax_write_iomap_ops);
> +	else
> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
> +					    &xfs_buffered_write_iomap_ops);
> +	if (error)
> +		return error;
> +
> +	/*
> +	 * Write back path won't write dirty blocks post EOF folio,
> +	 * flush the entire zeroed range before updating the inode
> +	 * size.
> +	 */
> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
> +					    roundup_64(pos, blocksize));
>  }

Ok, this means we do -three- blocking writebacks through this path
instead of one or maybe two.

We already know that this existing blocking writeback case for dirty
pages over unwritten extents is a significant performance issue for
some workloads. I have a fix in progress for iomap to handle this
case without requiring blocking writeback to be done to convert the
extent to written before we do the truncate.

Regardless, I think this whole "truncate is allocation unit size
aware" algorithm is largely unworkable without a rewrite. What XFS
needs to do on truncate *down* before we start the truncate
transaction is pretty simple:

	- ensure that the new EOF extent tail contains zeroes
	- ensure that the range from the existing ip->i_disk_size to
	  the new EOF is on disk so data vs metadata ordering is
	  correct for crash recovery purposes.

What this patch does to acheive that is:

	1. blocking writeback to clean dirty unwritten/cow blocks at
	the new EOF.
	2. iomap_truncate_page() writes zeroes into the page cache,
	which dirties the pages we just cleaned at the new EOF.
	3. blocking writeback to clean the dirty blocks at the new
	EOF.
	4. truncate_setsize() then writes zeros to partial folios at
	the new EOF, dirtying the EOF page again.
	5. blocking writeback to clean dirty blocks from the current
	on-disk size to the new EOF.

This is pretty crazy when you stop and think about it. We're writing
the same EOF block -three- times. The first data write gets
overwritten by zeroes on the second write, and the third write
writes the same zeroes as the second write. There are two redundant
*blocking* writes in this process.

We can do all this with a single writeback operation if we are a
little bit smarter about the order of operations we perform and we
are a little bit smarter in iomap about zeroing dirty pages in the
page cache:

	1. change iomap_zero_range() to do the right thing with
	dirty unwritten and cow extents (the patch I've been working
	on).

	2. pass the range to be zeroed into iomap_truncate_page()
	(the fundamental change being made here).

	3. zero the required range *through the page cache*
	(iomap_zero_range() already does this).

	4. write back the XFS inode from ip->i_disk_size to the end
	of the range zeroed by iomap_truncate_page()
	(xfs_setattr_size() already does this).

	5. i_size_write(newsize);

	6. invalidate_inode_pages2_range(newsize, -1) to trash all
	the page cache beyond the new EOF without doing any zeroing
	as we've already done all the zeroing needed to the page
	cache through iomap_truncate_page().


The patch I'm working on for step 1 is below. It still needs to be
extended to handle the cow case, but I'm unclear on how to exercise
that case so I haven't written the code to do it. The rest of it is
just rearranging the code that we already use just to get the order
of operations right. The only notable change in behaviour is using
invalidate_inode_pages2_range() instead of truncate_pagecache(),
because we don't want the EOF page to be dirtied again once we've
already written zeroes to disk....

-- 
Dave Chinner
david@fromorbit.com


[RFC] iomap: zeroing needs to be pagecache aware

From: Dave Chinner <dchinner@redhat.com>

Unwritten extents can have page cache data over the range being
zeroed so we can't just skip them entirely. Fix this by checking for
an existing dirty folio over the unwritten range we are zeroing
and only performing zeroing if the folio is already dirty.

XXX: how do we detect a iomap containing a cow mapping over a hole
in iomap_zero_iter()? The XFS code implies this case also needs to
zero the page cache if there is data present, so trigger for page
cache lookup only in iomap_zero_iter() needs to handle this case as
well.

Before:

$ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters

real    0m14.103s
user    0m0.015s
sys     0m0.020s

$ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 85.90    0.847616          16     50000           ftruncate
 14.01    0.138229           2     50000           pwrite64
....

After:

$ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters

real    0m0.144s
user    0m0.021s
sys     0m0.012s

$ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 53.86    0.505964          10     50000           ftruncate
 46.12    0.433251           8     50000           pwrite64
....

Yup, we get back all the performance.

As for the "mmap write beyond EOF" data exposure aspect
documented here:

https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/

With this command:

$ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
  -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
  -c fsync -c "pread -v 3k 32" /mnt/scratch/foo

Before:

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
wrote 4096/4096 bytes at offset 32768
4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
read 32/32 bytes at offset 3072
32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182 ops/sec

After:

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
wrote 4096/4096 bytes at offset 32768
4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
read 32/32 bytes at offset 3072
32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429 ops/sec)

We see that this post-eof unwritten extent dirty page zeroing is
working correctly.

This has passed through most of fstests on a couple of test VMs
without issues at the moment, so I think this approach to fixing the
issue is going to be solid once we've worked out how to detect the
COW-hole mapping case.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_iops.c      | 12 +-----------
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4e8e41c8b3c0..6877474de0c9 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -583,11 +583,23 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  *
  * Returns a locked reference to the folio at @pos, or an error pointer if the
  * folio could not be obtained.
+ *
+ * Note: when zeroing unwritten extents, we might have data in the page cache
+ * over an unwritten extent. In this case, we want to do a pure lookup on the
+ * page cache and not create a new folio as we don't need to perform zeroing on
+ * unwritten extents if there is no cached data over the given range.
  */
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
+	if (iter->flags & IOMAP_ZERO) {
+		const struct iomap *srcmap = iomap_iter_srcmap(iter);
+
+		if (srcmap->type == IOMAP_UNWRITTEN)
+			fgp &= ~FGP_CREAT;
+	}
+
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
 	fgp |= fgf_set_order(len);
@@ -1375,7 +1387,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	loff_t written = 0;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	if (srcmap->type == IOMAP_HOLE)
 		return length;
 
 	do {
@@ -1385,8 +1397,22 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		size_t bytes = min_t(u64, SIZE_MAX, length);
 
 		status = iomap_write_begin(iter, pos, bytes, &folio);
-		if (status)
+		if (status) {
+			if (status == -ENOENT) {
+				/*
+				 * Unwritten extents need to have page cache
+				 * lookups done to determine if they have data
+				 * over them that needs zeroing. If there is no
+				 * data, we'll get -ENOENT returned here, so we
+				 * can just skip over this index.
+				 */
+				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
+				if (bytes > PAGE_SIZE - offset_in_page(pos))
+					bytes = PAGE_SIZE - offset_in_page(pos);
+				goto loop_continue;
+			}
 			return status;
+		}
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
@@ -1394,6 +1420,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (bytes > folio_size(folio) - offset)
 			bytes = folio_size(folio) - offset;
 
+		/*
+		 * If the folio over an unwritten extent is clean (i.e. because
+		 * it has been read from), then it already contains zeros. Hence
+		 * we can just skip it.
+		 */
+		if (srcmap->type == IOMAP_UNWRITTEN &&
+		    !folio_test_dirty(folio)) {
+			folio_unlock(folio);
+			goto loop_continue;
+		}
+
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);
 
@@ -1401,6 +1438,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (WARN_ON_ONCE(bytes == 0))
 			return -EIO;
 
+loop_continue:
 		pos += bytes;
 		length -= bytes;
 		written += bytes;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8a145ca7d380..e8c9f3018c80 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -838,17 +838,7 @@ xfs_setattr_size(
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
 		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
 				&did_zeroing);
-	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
+	} else if (newsize != oldsize) {
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}
 

^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-20 18:37           ` Darrick J. Wong
@ 2024-05-21 13:45             ` Zhang Yi
  0 siblings, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-05-21 13:45 UTC (permalink / raw
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, hch, brauner,
	david, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/5/21 2:37, Darrick J. Wong wrote:
> On Mon, May 20, 2024 at 02:56:22PM +0800, Zhang Yi wrote:
>> On 2024/5/19 3:26, Darrick J. Wong wrote:
>>> On Sat, May 18, 2024 at 02:35:02PM +0800, Zhang Yi wrote:
>>>> On 2024/5/18 1:59, Darrick J. Wong wrote:
>>>>> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
>>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>>
>>>>>> When truncating a realtime file unaligned to a shorter size,
>>>>>> xfs_setattr_size() only flush the EOF page before zeroing out, and
>>>>>> xfs_truncate_page() also only zeros the EOF block. This could expose
>>>>>> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
>>>>>> a write operation").
>>>>>>
>>>>>> If the sb_rextsize is bigger than one block, and we have a realtime
>>>>>> inode that contains a long enough written extent. If we unaligned
>>>>>> truncate into the middle of this extent, xfs_itruncate_extents() could
>>>>>> split the extent and align the it's tail to sb_rextsize, there maybe
>>>>>> have more than one blocks more between the end of the file. Since
>>>>>> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
>>>>>> value, so it may leftover some blocks contains stale data that could be
>>>>>> exposed if we append write it over a long enough distance later.
>>>
>>> Hum.  Is this an appending write into the next rtextent?  For example,
>>> if you start with a file like this:
>>>
>>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>>                     ^ old EOF
>>>
>>> Then truncate it improperly like this:
>>>
>>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
>>>      ^ new EOF               
>>>
>>> Then do an extending write like this:
>>>
>>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
>>>      ^ EOF                    ^ next rtx        ^ append here
>>>
>>> And now the problem is that we've exposed stale data that should be
>>> zeroes?
>>>
>>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuuuuuuuuuuuuuuuuuuuuWWWuuuuuuuuu
>>>       ^^^^^^^^^^^^^^^                             ^ new EOF
>>>       should be zeroed
>>>
>>
>> Yeah.
>>
>>>>>
>>>>> IOWs, any time we truncate down, we need to zero every byte from the new
>>>>> EOF all the way to the end of the allocation unit, correct?
>>>>
>>>> Yeah.
>>>>
>>>>>
>>>>> Maybe pictures would be easier to reason with.  Say you have
>>>>> rextsize=30 and a partially written rtextent; each 'W' is a written
>>>>> fsblock and 'u' is an unwritten fsblock:
>>>>>
>>>>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>>>>                     ^ old EOF
>>>>>
>>>>> Now you want to truncate down:
>>>>>
>>>>> WWWWWWWWWWWWWWWWWWWWWuuuuuuuuu
>>>>>      ^ new EOF      ^ old EOF
>>>>>
>>>>> Currently, iomap_truncate_blocks only zeroes up to the next i_blocksize,
>>>>> so the truncate leaves the file in this state:
>>>>>
>>>>> WWWWWzWWWWWWWWWWWWWWWuuuuuuuuu
>>>>>      ^ new EOF      ^ old EOF
>>>>>
>>>>> (where 'z' is a written block with zeroes after EOF)
>>>>>
>>>>> This is bad because the "W"s between the new and old EOF still contain
>>>>> old credit card info or whatever.  Now if we mmap the file or whatever,
>>>>> we can access those old contents.
>>>>>
>>>>> So your new patch amends iomap_truncate_page so that it'll zero all the
>>>>> way to the end of the @blocksize parameter.  That fixes the exposure by 
>>>>> writing zeroes to the pagecache before we truncate down:
>>>>>
>>>>> WWWWWzzzzzzzzzzzzzzzzuuuuuuuuu
>>>>>      ^ new EOF      ^ old EOF
>>>>>
>>>>> Is that correct?
>>>>>
>>>>
>>>> Yes, it's correct. However, not only write zeros to the pagecache, but
>>>> also flush to disk, please see below for details.
>>>
>>> <nod> iomap_truncate_page writes zeroes to any part of the pagecache
>>> backed by written extents, and then xfs must call
>>> filemap_write_and_wait_range to write the dirty (zeroed) cache out to
>>> disk.
>>>
>>>>> If so, then why don't we make xfs_truncate_page convert the post-eof
>>>>> rtextent blocks back to unwritten status:
>>>>>
>>>>> WWWWWzuuuuuuuuuuuuuuuuuuuuuuuu
>>>>>      ^ new EOF      ^ old EOF
>>>>>
>>>>> If we can do that, then do we need the changes to iomap_truncate_page?
>>>>> Converting the mapping should be much faster than dirtying potentially
>>>>> a lot of data (rt extents can be 1GB in size).
>>>>
>>>> Now that the exposed stale data range (should be zeroed) is only one
>>>> rtextsize unit, if we convert the post-eof rtextent blocks to unwritten,
>>>> it breaks the alignment of rtextent and the definition of "extsize is used
>>>> to specify the size of the blocks in the real-time section of the
>>>> filesystem", is it fine?
>>>
>>> A written -> unwritten extent conversion doesn't change which physical
>>> space extent is mapped to the file data extent; it merely marks the
>>> mapping as unwritten.
>>>
>>> For example, if you start with this mapping:
>>>
>>> {startoff = 8, startblock 256, blockcount = 8, state = written}
>>>
>>> and then convert blocks 13-15 to unwritten, you get:
>>>
>>> {startoff = 8, startblock 256, blockcount = 5, state = written}
>>> {startoff = 13, startblock 261, blockcount = 3, state = unwritten}
>>>
>>> File blocks 8-15 still map to physical space 256-263.
>>
>> Yeah, indeed.
>>
>>>
>>> In xfs, the entire allocation unit is /always/ mapped to the file, even
>>> if parts of it have to be unwritten.  Hole punching on rt, for example,
>>> converts the punched region to unwritten.  This is (iirc) the key
>>> difference between xfs rt and ext4 bigalloc.  xfs doesn't have or need
>>> (or want) the implied cluster allocation code that ext4 has.
>>>
>>
>> I checked the xfs_file_fallocate(), it looks like hole punching on realtime
>> inode is still follow the rtextsize alignment, i.e. if we punch hole on a
>> file that only contains one written extent, it doesn't split it and convet
>> the punched range to unwritten. Please take a look at
>> xfs_file_fallocate()->xfs_free_file_space(), it aligned the freeing range
>> and zeroing out the whole unligned range in one reextsize unit, and
>> FALLOC_FL_ZERO_RANGE is the same.
>>
>>  836	/* We can only free complete realtime extents. */
>>  837	if (xfs_inode_has_bigrtalloc(ip)) {
>>  838		startoffset_fsb = xfs_rtb_roundup_rtx(mp, startoffset_fsb);
>>  839 		endoffset_fsb = xfs_rtb_rounddown_rtx(mp, endoffset_fsb);
>>  840	}
>> ...
>>  864	error = xfs_zero_range(ip, offset, len, NULL);
>>
>> And I tested it on my machine, it's true that it doesn't do the convertion.
>>
>>   # mkfs.xfs -f -rrtdev=/dev/nvme0n1 -f -m reflink=0,rmapbt=0, -d rtinherit=1 -r extsize=28k /dev/pmem2s
>>   # mount -ortdev=/dev/nvme0n1 /dev/pmem2s /mnt/scratch
>>   # xfs_io -f -c "pwrite 0 28k" -c "fsync" /mnt/scratch/foo
>>   # xfs_io -c "fpunch 4k 24k" /mnt/scratch/foo
>>   # umount /mnt/scratch
>>
>>   # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
>>    u3.bmx[0] = [startoff,startblock,blockcount,extentflag]
>>    0:[0,0,7,0]
>>
>> Am I missed something?
> 
> I think fpunch is broken here -- xfs definitely supports having
> unwritten mappings in the middle of an allocation unit.  See below.
> 
>>> I can't tell if there's something that you see that I don't see such
>>> that we really /do/ need to actually write zeroes to the entire tail of
>>> the rtextent; or if you weren't sure that forcing all the post-eof
>>> fsblocks in the rtextent to unwritten (and zapping the pagecache) would
>>> actually preserve the rtextsize alignment.
>>
>> I haven't found any restrictions yet, and I also noticed that a simple
>> write is not guaranteed to align the extent to rtextsize, since the write
>> back path doesn't zeroing out the extra blocks that align to the
>> rtextsize.
>>
>>   # #extsize=28k
>>   # xfs_io -d -f -c  "pwrite 0 4k" -c "fsync" /mnt/scratch/foo
>>   # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
>>      u3.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
>>      0:[0,0,1,0]
>>      1:[1,1,6,1]
>>
>> So I guess convert the tail fsblocks of the rtextent to unwritten status
>> could work. However, I'm a little confused, besides the write operation,
>> other operations like above punch hold and zero range, they seem to be
>> doing their best to follow the alignment rule since commit fe341eb151ec
>> ("xfs: ensure that fpunch, fcollapse, and finsert operations are aligned
>> to rt extent size") [1], it looks like this commit is to fix some issues,
>> so I'm not sure that converting to unwritten would always preserve the
>> rtextsize alignment.
>>
>> [1]. https://lore.kernel.org/linux-xfs/159950168085.582172.4254559621934598919.stgit@magnolia/
> 
> Looking at commit fe341eb151ec0 ("xfs: ensure that fpunch, fcollapse,
> and finsert operations are aligned to rt extent size"), I think the
> logic in xfs_free_file_space is wrong.  If bunmapi is told to unmap a
> partial rt extent, it will instead convert it to unwritten; it only
> unmaps full rt extents.  For punch that's ok because punch
> opportunistically removes blocks and zeroes the unaligned segments; for
> zero that's ok because it will falloc the sparse holes and zero the
> rest.
> 
> It's only collapse and insert that *require* alignment.  That's
> something that should be checked during input validation, and I think
> that got fixed by 25219dbfa734e ("xfs: fix fallocate functions when
> rtextsize is larger than 1").
> 

Ha, I see! I didn't notice that xfs_bunmapi() would split and convert
the extent to unwritten for an rt inode. I try to drop the rtextsize
range alignment logic in xfs_free_file_space(), it works as expect
now.

  # #reextsize is 28k
  # xfs_io -f -c "pwrite 0 28k" -c "fsync" /mnt/scratch/foo
  # xfs_io -c "fpunch 4k 24k" /mnt/scratch/foo
  # umount /mnt/scratch
  # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
    u3.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
    0:[0,0,1,0]
    1:[1,1,6,1]

Now, think about the original problem again. I simplified my test case
below,

  # #reextsize is 28k
  # xfs_io -f -c "pwrite 0 28k" -c "fsync" /mnt/scratch/foo
  # xfs_io -c "truncate 4k" /mnt/scratch/foo
  # umount /mnt/scratch
  # xfs_db -c "inode 131" -c "p u3.bmx" /dev/pmem2s
    u3.bmx[0] = [startoff,startblock,blockcount,extentflag]
    0:[0,0,7,0]

The expected result should be the same as the fpunch, but it doesn't.
So, It seems that xfs_setattr_size()->xfs_itruncate_extents() doesn't
do the right thing. After check it, the problem is __xfs_bunmapi()
goto the wrong branch, it failed to convert this extent to unwritten
since tp->t_blk_res is zero.

5559			} else if (del.br_startoff == start &&
5560				  (del.br_state == XFS_EXT_UNWRITTEN ||
5561				   tp->t_blk_res == 0)) {
5562				/*
5563 				 * Can't make it unwritten.  There isn't
5564				 * a full extent here so just skip it.
5565				 */

May be we should fix this problem by reserve enough blocks when
allocating transaction in xfs_setattr_size? If so, the whole tail
rtextsize aligned blocks beyond EOF could be convert to unwritten
as expected, and we could also avoid writing too much zero,
finally the stale data exposure issue would gone.

>>>
>>> (Or if there's something else?)
>>>
>>>>                          And IIUC, the upcoming xfs force alignment
>>>> extent feature seems also need to follow this alignment, right?
>>>
>>> Yes.
>>>
>>>>>
>>>>>> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
>>>>>> and make sure the entire zeroed range have been flushed to disk before
>>>>>> updating the inode size.
>>>>>>
>>>>>> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
>>>>>> Reported-by: Chandan Babu R <chandanbabu@kernel.org>
>>>>>> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
>>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>>> ---
>>>>>>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>>>>>>  fs/xfs/xfs_iops.c  | 10 ----------
>>>>>>  2 files changed, 31 insertions(+), 14 deletions(-)
>>>>>>
>>>>>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>>>>>> index 4958cc3337bc..fc379450fe74 100644
>>>>>> --- a/fs/xfs/xfs_iomap.c
>>>>>> +++ b/fs/xfs/xfs_iomap.c
>>>>>> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>>>>>>  	loff_t			pos,
>>>>>>  	bool			*did_zero)
>>>>>>  {
>>>>>> +	struct xfs_mount	*mp = ip->i_mount;
>>>>>>  	struct inode		*inode = VFS_I(ip);
>>>>>>  	unsigned int		blocksize = i_blocksize(inode);
>>>>>> +	int			error;
>>>>>> +
>>>>>> +	if (XFS_IS_REALTIME_INODE(ip))
>>>>>> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
>>>>>
>>>>> Don't opencode xfs_inode_alloc_unitsize, please.
>>>>
>>>> Ha, I missed the latest added helper, thanks for pointing this out.
>>>>
>>>>>
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * iomap won't detect a dirty page over an unwritten block (or a
>>>>>> +	 * cow block over a hole) and subsequently skips zeroing the
>>>>>> +	 * newly post-EOF portion of the page. Flush the new EOF to
>>>>>> +	 * convert the block before the pagecache truncate.
>>>>>> +	 */
>>>>>> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
>>>>>> +					     roundup_64(pos, blocksize));
>>>>>> +	if (error)
>>>>>> +		return error;pos_in_block
>>>>>
>>>>> Ok so this is hoisting the filemap_write_and_wait_range call from
>>>>> xfs_setattr_size.  It's curious that we need to need to twiddle anything
>>>>> other than the EOF block itself though?
>>>>
>>>> Since we planed to zero out the dirtied range which is ailgned to the
>>>> extsize instead of the blocksize, ensure one block is not unwritten is
>>>> not enough, we should also make sure that the range which is going to
>>>> zero out is not unwritten, or else the iomap_zero_iter() will skip
>>>> zeroing out the extra blocks.
>>>>
>>>> For example:
>>>>
>>>> before zeroing:
>>>>            |<-    extszie   ->|
>>>>         ...dddddddddddddddddddd
>>>>         ...UUUUUUUUUUUUUUUUUUUU
>>>>            ^                  ^
>>>>         new EOF             old EOF    (where 'd' means the pagecache is dirty)
>>>>
>>>> if we only flush the new EOF block, the result becomes:
>>>>
>>>>            |<-    extszie   ->|
>>>>            zddddddddddddddddddd
>>>>            ZUUUUUUUUUUUUUUUUUUU
>>>>            ^                  ^
>>>>         new EOF             old EOF
>>>>
>>>>
>>>> then the dirty extent range that between new EOF block and the old EOF
>>>> block can't be zeroed sine it's still unwritten. So we have to flush the
>>>> whole range before zeroing out.
>>>
>>> "Z" on the second line of the second diagram is a written fsblock with
>>> the tail zeroed, correct?
>>
>> Yeah,
>>
>>>
>>> truncate_setsize -> truncate_pagecache unmaps all the pagecache after
>>> the eof folio and unconditionally zeroes the tail of the eof folio
>>> without regard to the mappings.  Doesn't that cover us here?  After the
>>> truncate_setsize finishes, won't we end up in this state:
>>>
>>>            |<-   rextsize   ->|
>>>            zzzzzzzz               
>>>            ZUUUUUUUUUUUUUUUUUUU
>>>            ^      ^           ^
>>>         new EOF   |         old EOF
>>>                   folio boundary
>>>
>>
>> Yeah, this case is fine, but the below case is not fine.
>>
>> truncate                          write back
>> xfs_setattr_size()
>>  xfs_truncate_page()
>>   filemap_write_and_wait_range(newsize, newsize) <- A
>>   iomap_zero_range() <- B
>>                                   flush dirty pages <- C
>>  truncate_setsize() <- D
>>
>> Please assume if a concurrent write back happenes just before
>> truncate_setsize(), the state of the file changes as below:
>>
>> A:
>>               |<-    extszie   ->|
>>               wddddddddddddddddddd (pagecache)
>>               WUUUUUUUUUUUUUUUUUUU (disk)
>>               ^                  ^
>>            (new EOF)           old EOF  (where 'd' means the pagecache is dirty)
>>                                         (where 'x' means the pagecache contianes user data)
> 
> "W", not "x", as you noted.
> 
>>
>> B:
>>               |<-    extszie   ->|
>>               zddddddddddddddddddd
>>               ZUUUUUUUUUUUUUUUUUUU
>>               ^                  ^
>>            (new EOF)           old EOF  (where 'z' means the pagecache is zero)
>>
>> C:
>>               |<-    extszie   ->|
>>               zwwwwwwwwwwwwwwwwwww
>>               ZWWWWWWWWWWWWWWWWWWW
>>               ^                  ^
>>            (new EOF)           old EOF
>>
>> D:
>>               |<-    extszie   ->|
>>               zzzzzzzzz
>>               ZWWWWWWWWWWWWWWWWWWW
>>               ^       ^          ^
>>             new EOF   |       (old EOF)
>>                       folio boundary
> 
> Hmm.  At point D we still hold i_rwsem and the invalidate_lock, so could
> we convert the underlying blocks to unwritten here instead of writing
> them all out?  Once we reduce i_size, writeback will stop at EOF, right?
> 
> D:
>               |<-    extszie   ->|
>               zzzzzzzzz
>               ZUUUUUUUUUUUUUUUUUUU
>               ^       ^          ^
>             new EOF   |       (old EOF)
>                       folio boundary
> 

Yeah.

Thanks,
Yi.

>>
>>>>>
>>>>>>  
>>>>>>  	if (IS_DAX(inode))
>>>>>> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
>>>>>> -					&xfs_dax_write_iomap_ops);
>>>>>> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
>>>>>> -				   &xfs_buffered_write_iomap_ops);
>>>>>> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
>>>>>> +					  &xfs_dax_write_iomap_ops);
>>>>>> +	else
>>>>>> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
>>>>>> +					    &xfs_buffered_write_iomap_ops);
>>>>>> +	if (error)
>>>>>> +		return error;
>>>>>> +
>>>>>> +	/*
>>>>>> +	 * Write back path won't write dirty blocks post EOF folio,
>>>>>> +	 * flush the entire zeroed range before updating the inode
>>>>>> +	 * size.
>>>>>> +	 */
>>>>>> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
>>>>>> +					    roundup_64(pos, blocksize));
>>>>>
>>>>> ...but what is the purpose of the second filemap_write_and_wait_range
>>>>> call?  Is that to flush the bytes between new and old EOF to disk before
>>>>> truncate_setsize invalidates the (zeroed) pagecache?
>>>>>
>>>>
>>>> The second filemap_write_and_wait_range() call is used to make sure that
>>>> the zeroed data be flushed to disk before we updating i_size. If we don't
>>>> add this one, once the i_size is been changed, the zeroed data which
>>>> beyond the new EOF folio(block) couldn't be write back, because
>>>> iomap_writepage_map()->iomap_writepage_handle_eof() skip that range, so
>>>> the stale data problem is still there.
>>>>
>>>> For example:
>>>>
>>>> before zeroing:
>>>>            |<-    extszie   ->|
>>>>            wwwwwwwwwwwwwwwwwwww (pagecache)
>>>>         ...WWWWWWWWWWWWWWWWWWWW (disk)
>>>>            ^                  ^
>>>>         new EOF               EOF   (where 'w' means the pagecache contains data)
>>>>
>>>> then iomap_truncate_page() zeroing out the pagecache:
>>>>
>>>>            |<-    extszie   ->|
>>>>            zzzzzzzzzzzzzzzzzzzz (pagecache)
>>>>            WWWWWWWWWWWWWWWWWWWW (disk)
>>>>            ^                  ^
>>>>         new EOF               EOF
>>>>
>>>> then update i_size, sync and drop cache:
>>>>
>>>>            |<-    extszie   ->|
>>>>            ZWWWWWWWWWWWWWWWWWWW (disk)
>>>>            ^
>>>>            EOF
>>>
>>> <nod> Ok, so this second call to filemap_write_and_wait_range flushes
>>> the newly written pagecache to disk.  If it doesn't work to
>>> force-convert the tail fsblocks of the rtextent to unwritten status,
>>> then I suppose this is necessary if @blocksize != mp->m_sb.blocksize.
>>>
>>> --D
>>>
>>>> Thanks,
>>>> Yi.
>>>>
>>>>>
>>>>>>  }
>>>>>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>>>>>> index 66f8c47642e8..baeeddf4a6bb 100644
>>>>>> --- a/fs/xfs/xfs_iops.c
>>>>>> +++ b/fs/xfs/xfs_iops.c
>>>>>> @@ -845,16 +845,6 @@ xfs_setattr_size(
>>>>>>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>>>>>>  				&did_zeroing);
>>>>>>  	} else {
>>>>>> -		/*
>>>>>> -		 * iomap won't detect a dirty page over an unwritten block (or a
>>>>>> -		 * cow block over a hole) and subsequently skips zeroing the
>>>>>> -		 * newly post-EOF portion of the page. Flush the new EOF to
>>>>>> -		 * convert the block before the pagecache truncate.
>>>>>> -		 */
>>>>>> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
>>>>>> -						     newsize);
>>>>>> -		if (error)
>>>>>> -			return error;
>>>>>>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>>>>>>  	}
>>>>>>  
>>>>>> -- 
>>>>>> 2.39.2
>>>>>>
>>>>>>
>>>>
>>>>
>>
>>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-21  2:38   ` Dave Chinner
@ 2024-05-22  1:57     ` Zhang Yi
  2024-05-23  1:11       ` Dave Chinner
  2024-05-22  3:00     ` Darrick J. Wong
  1 sibling, 1 reply; 19+ messages in thread
From: Zhang Yi @ 2024-05-22  1:57 UTC (permalink / raw
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, djwong, hch,
	brauner, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/5/21 10:38, Dave Chinner wrote:
> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When truncating a realtime file unaligned to a shorter size,
>> xfs_setattr_size() only flush the EOF page before zeroing out, and
>> xfs_truncate_page() also only zeros the EOF block. This could expose
>> stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
>> a write operation").
>>
>> If the sb_rextsize is bigger than one block, and we have a realtime
>> inode that contains a long enough written extent. If we unaligned
>> truncate into the middle of this extent, xfs_itruncate_extents() could
>> split the extent and align the it's tail to sb_rextsize, there maybe
>> have more than one blocks more between the end of the file. Since
>> xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
>> value, so it may leftover some blocks contains stale data that could be
>> exposed if we append write it over a long enough distance later.
>>
>> xfs_truncate_page() should flush, zeros out the entire rtextsize range,
>> and make sure the entire zeroed range have been flushed to disk before
>> updating the inode size.
>>
>> Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
>> Reported-by: Chandan Babu R <chandanbabu@kernel.org>
>> Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
>>  fs/xfs/xfs_iops.c  | 10 ----------
>>  2 files changed, 31 insertions(+), 14 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 4958cc3337bc..fc379450fe74 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -1466,12 +1466,39 @@ xfs_truncate_page(
>>  	loff_t			pos,
>>  	bool			*did_zero)
>>  {
>> +	struct xfs_mount	*mp = ip->i_mount;
>>  	struct inode		*inode = VFS_I(ip);
>>  	unsigned int		blocksize = i_blocksize(inode);
>> +	int			error;
>> +
>> +	if (XFS_IS_REALTIME_INODE(ip))
>> +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
>> +
>> +	/*
>> +	 * iomap won't detect a dirty page over an unwritten block (or a
>> +	 * cow block over a hole) and subsequently skips zeroing the
>> +	 * newly post-EOF portion of the page. Flush the new EOF to
>> +	 * convert the block before the pagecache truncate.
>> +	 */
>> +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
>> +					     roundup_64(pos, blocksize));
>> +	if (error)
>> +		return error;
>>  
>>  	if (IS_DAX(inode))
>> -		return dax_truncate_page(inode, pos, blocksize, did_zero,
>> -					&xfs_dax_write_iomap_ops);
>> -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
>> -				   &xfs_buffered_write_iomap_ops);
>> +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
>> +					  &xfs_dax_write_iomap_ops);
>> +	else
>> +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
>> +					    &xfs_buffered_write_iomap_ops);
>> +	if (error)
>> +		return error;
>> +
>> +	/*
>> +	 * Write back path won't write dirty blocks post EOF folio,
>> +	 * flush the entire zeroed range before updating the inode
>> +	 * size.
>> +	 */
>> +	return filemap_write_and_wait_range(inode->i_mapping, pos,
>> +					    roundup_64(pos, blocksize));
>>  }
> 
> Ok, this means we do -three- blocking writebacks through this path
> instead of one or maybe two.
> 
> We already know that this existing blocking writeback case for dirty
> pages over unwritten extents is a significant performance issue for
> some workloads. I have a fix in progress for iomap to handle this
> case without requiring blocking writeback to be done to convert the
> extent to written before we do the truncate.
> 
> Regardless, I think this whole "truncate is allocation unit size
> aware" algorithm is largely unworkable without a rewrite. What XFS
> needs to do on truncate *down* before we start the truncate
> transaction is pretty simple:
> 
> 	- ensure that the new EOF extent tail contains zeroes
> 	- ensure that the range from the existing ip->i_disk_size to
> 	  the new EOF is on disk so data vs metadata ordering is
> 	  correct for crash recovery purposes.
> 
> What this patch does to acheive that is:
> 
> 	1. blocking writeback to clean dirty unwritten/cow blocks at
> 	the new EOF.
> 	2. iomap_truncate_page() writes zeroes into the page cache,
> 	which dirties the pages we just cleaned at the new EOF.
> 	3. blocking writeback to clean the dirty blocks at the new
> 	EOF.
> 	4. truncate_setsize() then writes zeros to partial folios at
> 	the new EOF, dirtying the EOF page again.
> 	5. blocking writeback to clean dirty blocks from the current
> 	on-disk size to the new EOF.
> 
> This is pretty crazy when you stop and think about it. We're writing
> the same EOF block -three- times. The first data write gets
> overwritten by zeroes on the second write, and the third write
> writes the same zeroes as the second write. There are two redundant
> *blocking* writes in this process.

Yes, this is indeed a performance disaster, and iomap_zero_range()
should aware the dirty pages. I had the same problem when developing
buffered iomap conversion on ext4.

> 
> We can do all this with a single writeback operation if we are a
> little bit smarter about the order of operations we perform and we
> are a little bit smarter in iomap about zeroing dirty pages in the
> page cache:
> 
> 	1. change iomap_zero_range() to do the right thing with
> 	dirty unwritten and cow extents (the patch I've been working
> 	on).
> 
> 	2. pass the range to be zeroed into iomap_truncate_page()
> 	(the fundamental change being made here).
> 
> 	3. zero the required range *through the page cache*
> 	(iomap_zero_range() already does this).
> 
> 	4. write back the XFS inode from ip->i_disk_size to the end
> 	of the range zeroed by iomap_truncate_page()
> 	(xfs_setattr_size() already does this).
> 
> 	5. i_size_write(newsize);
> 
> 	6. invalidate_inode_pages2_range(newsize, -1) to trash all
> 	the page cache beyond the new EOF without doing any zeroing
> 	as we've already done all the zeroing needed to the page
> 	cache through iomap_truncate_page().
> 
> 
> The patch I'm working on for step 1 is below. It still needs to be
> extended to handle the cow case, but I'm unclear on how to exercise
> that case so I haven't written the code to do it. The rest of it is
> just rearranging the code that we already use just to get the order
> of operations right. The only notable change in behaviour is using
> invalidate_inode_pages2_range() instead of truncate_pagecache(),
> because we don't want the EOF page to be dirtied again once we've
> already written zeroes to disk....
> 

Indeed, this sounds like the best solution. Since Darrick recommended
that we could fix the stale data exposure on realtime inode issue by
convert the tail extent to unwritten, I suppose we could do this after
fixing the problem.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-21  2:38   ` Dave Chinner
  2024-05-22  1:57     ` Zhang Yi
@ 2024-05-22  3:00     ` Darrick J. Wong
  2024-05-23  1:14       ` Dave Chinner
  1 sibling, 1 reply; 19+ messages in thread
From: Darrick J. Wong @ 2024-05-22  3:00 UTC (permalink / raw
  To: Dave Chinner
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, hch,
	brauner, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On Tue, May 21, 2024 at 12:38:30PM +1000, Dave Chinner wrote:
> On Fri, May 17, 2024 at 07:13:55PM +0800, Zhang Yi wrote:
> > From: Zhang Yi <yi.zhang@huawei.com>
> > 
> > When truncating a realtime file unaligned to a shorter size,
> > xfs_setattr_size() only flush the EOF page before zeroing out, and
> > xfs_truncate_page() also only zeros the EOF block. This could expose
> > stale data since 943bc0882ceb ("iomap: don't increase i_size if it's not
> > a write operation").
> > 
> > If the sb_rextsize is bigger than one block, and we have a realtime
> > inode that contains a long enough written extent. If we unaligned
> > truncate into the middle of this extent, xfs_itruncate_extents() could
> > split the extent and align the it's tail to sb_rextsize, there maybe
> > have more than one blocks more between the end of the file. Since
> > xfs_truncate_page() only zeros the trailing portion of the i_blocksize()
> > value, so it may leftover some blocks contains stale data that could be
> > exposed if we append write it over a long enough distance later.
> > 
> > xfs_truncate_page() should flush, zeros out the entire rtextsize range,
> > and make sure the entire zeroed range have been flushed to disk before
> > updating the inode size.
> > 
> > Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
> > Reported-by: Chandan Babu R <chandanbabu@kernel.org>
> > Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
> > Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> > ---
> >  fs/xfs/xfs_iomap.c | 35 +++++++++++++++++++++++++++++++----
> >  fs/xfs/xfs_iops.c  | 10 ----------
> >  2 files changed, 31 insertions(+), 14 deletions(-)
> > 
> > diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> > index 4958cc3337bc..fc379450fe74 100644
> > --- a/fs/xfs/xfs_iomap.c
> > +++ b/fs/xfs/xfs_iomap.c
> > @@ -1466,12 +1466,39 @@ xfs_truncate_page(
> >  	loff_t			pos,
> >  	bool			*did_zero)
> >  {
> > +	struct xfs_mount	*mp = ip->i_mount;
> >  	struct inode		*inode = VFS_I(ip);
> >  	unsigned int		blocksize = i_blocksize(inode);
> > +	int			error;
> > +
> > +	if (XFS_IS_REALTIME_INODE(ip))
> > +		blocksize = XFS_FSB_TO_B(mp, mp->m_sb.sb_rextsize);
> > +
> > +	/*
> > +	 * iomap won't detect a dirty page over an unwritten block (or a
> > +	 * cow block over a hole) and subsequently skips zeroing the
> > +	 * newly post-EOF portion of the page. Flush the new EOF to
> > +	 * convert the block before the pagecache truncate.
> > +	 */
> > +	error = filemap_write_and_wait_range(inode->i_mapping, pos,
> > +					     roundup_64(pos, blocksize));
> > +	if (error)
> > +		return error;
> >  
> >  	if (IS_DAX(inode))
> > -		return dax_truncate_page(inode, pos, blocksize, did_zero,
> > -					&xfs_dax_write_iomap_ops);
> > -	return iomap_truncate_page(inode, pos, blocksize, did_zero,
> > -				   &xfs_buffered_write_iomap_ops);
> > +		error = dax_truncate_page(inode, pos, blocksize, did_zero,
> > +					  &xfs_dax_write_iomap_ops);
> > +	else
> > +		error = iomap_truncate_page(inode, pos, blocksize, did_zero,
> > +					    &xfs_buffered_write_iomap_ops);
> > +	if (error)
> > +		return error;
> > +
> > +	/*
> > +	 * Write back path won't write dirty blocks post EOF folio,
> > +	 * flush the entire zeroed range before updating the inode
> > +	 * size.
> > +	 */
> > +	return filemap_write_and_wait_range(inode->i_mapping, pos,
> > +					    roundup_64(pos, blocksize));
> >  }
> 
> Ok, this means we do -three- blocking writebacks through this path
> instead of one or maybe two.
> 
> We already know that this existing blocking writeback case for dirty
> pages over unwritten extents is a significant performance issue for
> some workloads. I have a fix in progress for iomap to handle this
> case without requiring blocking writeback to be done to convert the
> extent to written before we do the truncate.
> 
> Regardless, I think this whole "truncate is allocation unit size
> aware" algorithm is largely unworkable without a rewrite. What XFS
> needs to do on truncate *down* before we start the truncate
> transaction is pretty simple:
> 
> 	- ensure that the new EOF extent tail contains zeroes
> 	- ensure that the range from the existing ip->i_disk_size to
> 	  the new EOF is on disk so data vs metadata ordering is
> 	  correct for crash recovery purposes.
> 
> What this patch does to acheive that is:
> 
> 	1. blocking writeback to clean dirty unwritten/cow blocks at
> 	the new EOF.
> 	2. iomap_truncate_page() writes zeroes into the page cache,
> 	which dirties the pages we just cleaned at the new EOF.
> 	3. blocking writeback to clean the dirty blocks at the new
> 	EOF.
> 	4. truncate_setsize() then writes zeros to partial folios at
> 	the new EOF, dirtying the EOF page again.
> 	5. blocking writeback to clean dirty blocks from the current
> 	on-disk size to the new EOF.
> 
> This is pretty crazy when you stop and think about it. We're writing
> the same EOF block -three- times. The first data write gets
> overwritten by zeroes on the second write, and the third write
> writes the same zeroes as the second write. There are two redundant
> *blocking* writes in this process.
> 
> We can do all this with a single writeback operation if we are a
> little bit smarter about the order of operations we perform and we
> are a little bit smarter in iomap about zeroing dirty pages in the
> page cache:
> 
> 	1. change iomap_zero_range() to do the right thing with
> 	dirty unwritten and cow extents (the patch I've been working
> 	on).
> 
> 	2. pass the range to be zeroed into iomap_truncate_page()
> 	(the fundamental change being made here).
> 
> 	3. zero the required range *through the page cache*
> 	(iomap_zero_range() already does this).
> 
> 	4. write back the XFS inode from ip->i_disk_size to the end
> 	of the range zeroed by iomap_truncate_page()
> 	(xfs_setattr_size() already does this).
> 
> 	5. i_size_write(newsize);
> 
> 	6. invalidate_inode_pages2_range(newsize, -1) to trash all
> 	the page cache beyond the new EOF without doing any zeroing
> 	as we've already done all the zeroing needed to the page
> 	cache through iomap_truncate_page().
> 
> 
> The patch I'm working on for step 1 is below. It still needs to be
> extended to handle the cow case, but I'm unclear on how to exercise
> that case so I haven't written the code to do it. The rest of it is
> just rearranging the code that we already use just to get the order
> of operations right. The only notable change in behaviour is using
> invalidate_inode_pages2_range() instead of truncate_pagecache(),
> because we don't want the EOF page to be dirtied again once we've
> already written zeroes to disk....
> 
> -- 
> Dave Chinner
> david@fromorbit.com
> 
> 
> [RFC] iomap: zeroing needs to be pagecache aware
> 
> From: Dave Chinner <dchinner@redhat.com>
> 
> Unwritten extents can have page cache data over the range being
> zeroed so we can't just skip them entirely. Fix this by checking for
> an existing dirty folio over the unwritten range we are zeroing
> and only performing zeroing if the folio is already dirty.
> 
> XXX: how do we detect a iomap containing a cow mapping over a hole
> in iomap_zero_iter()? The XFS code implies this case also needs to
> zero the page cache if there is data present, so trigger for page
> cache lookup only in iomap_zero_iter() needs to handle this case as
> well.

Hmm.  If memory serves, we probably need to adapt the
xfs_buffered/direct_write_iomap_begin functions to return the hole in
srcmap and the cow mapping in the iomap.  RN I think it just returns the
hole.

--D

> Before:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m14.103s
> user    0m0.015s
> sys     0m0.020s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  85.90    0.847616          16     50000           ftruncate
>  14.01    0.138229           2     50000           pwrite64
> ....
> 
> After:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m0.144s
> user    0m0.021s
> sys     0m0.012s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  53.86    0.505964          10     50000           ftruncate
>  46.12    0.433251           8     50000           pwrite64
> ....
> 
> Yup, we get back all the performance.
> 
> As for the "mmap write beyond EOF" data exposure aspect
> documented here:
> 
> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
> 
> With this command:
> 
> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
> 
> Before:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58  XXXXXXXXXXXXXXXX
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182 ops/sec
> 
> After:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00  ................
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429 ops/sec)
> 
> We see that this post-eof unwritten extent dirty page zeroing is
> working correctly.
> 
> This has passed through most of fstests on a couple of test VMs
> without issues at the moment, so I think this approach to fixing the
> issue is going to be solid once we've worked out how to detect the
> COW-hole mapping case.
> 
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
> ---
>  fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++++++++++++++++--
>  fs/xfs/xfs_iops.c      | 12 +-----------
>  2 files changed, 41 insertions(+), 13 deletions(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 4e8e41c8b3c0..6877474de0c9 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -583,11 +583,23 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
>   *
>   * Returns a locked reference to the folio at @pos, or an error pointer if the
>   * folio could not be obtained.
> + *
> + * Note: when zeroing unwritten extents, we might have data in the page cache
> + * over an unwritten extent. In this case, we want to do a pure lookup on the
> + * page cache and not create a new folio as we don't need to perform zeroing on
> + * unwritten extents if there is no cached data over the given range.
>   */
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  {
>  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
>  
> +	if (iter->flags & IOMAP_ZERO) {
> +		const struct iomap *srcmap = iomap_iter_srcmap(iter);
> +
> +		if (srcmap->type == IOMAP_UNWRITTEN)
> +			fgp &= ~FGP_CREAT;
> +	}
> +
>  	if (iter->flags & IOMAP_NOWAIT)
>  		fgp |= FGP_NOWAIT;
>  	fgp |= fgf_set_order(len);
> @@ -1375,7 +1387,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  	loff_t written = 0;
>  
>  	/* already zeroed?  we're done. */
> -	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
> +	if (srcmap->type == IOMAP_HOLE)
>  		return length;
>  
>  	do {
> @@ -1385,8 +1397,22 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		size_t bytes = min_t(u64, SIZE_MAX, length);
>  
>  		status = iomap_write_begin(iter, pos, bytes, &folio);
> -		if (status)
> +		if (status) {
> +			if (status == -ENOENT) {
> +				/*
> +				 * Unwritten extents need to have page cache
> +				 * lookups done to determine if they have data
> +				 * over them that needs zeroing. If there is no
> +				 * data, we'll get -ENOENT returned here, so we
> +				 * can just skip over this index.
> +				 */
> +				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> +				if (bytes > PAGE_SIZE - offset_in_page(pos))
> +					bytes = PAGE_SIZE - offset_in_page(pos);
> +				goto loop_continue;
> +			}
>  			return status;
> +		}
>  		if (iter->iomap.flags & IOMAP_F_STALE)
>  			break;
>  
> @@ -1394,6 +1420,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		if (bytes > folio_size(folio) - offset)
>  			bytes = folio_size(folio) - offset;
>  
> +		/*
> +		 * If the folio over an unwritten extent is clean (i.e. because
> +		 * it has been read from), then it already contains zeros. Hence
> +		 * we can just skip it.
> +		 */
> +		if (srcmap->type == IOMAP_UNWRITTEN &&
> +		    !folio_test_dirty(folio)) {
> +			folio_unlock(folio);
> +			goto loop_continue;
> +		}
> +
>  		folio_zero_range(folio, offset, bytes);
>  		folio_mark_accessed(folio);
>  
> @@ -1401,6 +1438,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		if (WARN_ON_ONCE(bytes == 0))
>  			return -EIO;
>  
> +loop_continue:
>  		pos += bytes;
>  		length -= bytes;
>  		written += bytes;
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index 8a145ca7d380..e8c9f3018c80 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -838,17 +838,7 @@ xfs_setattr_size(
>  		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
>  		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>  				&did_zeroing);
> -	} else {
> -		/*
> -		 * iomap won't detect a dirty page over an unwritten block (or a
> -		 * cow block over a hole) and subsequently skips zeroing the
> -		 * newly post-EOF portion of the page. Flush the new EOF to
> -		 * convert the block before the pagecache truncate.
> -		 */
> -		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
> -						     newsize);
> -		if (error)
> -			return error;
> +	} else if (newsize != oldsize) {
>  		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>  	}
>  
> 

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-22  1:57     ` Zhang Yi
@ 2024-05-23  1:11       ` Dave Chinner
  2024-05-23  2:00         ` Zhang Yi
  0 siblings, 1 reply; 19+ messages in thread
From: Dave Chinner @ 2024-05-23  1:11 UTC (permalink / raw
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, djwong, hch,
	brauner, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On Wed, May 22, 2024 at 09:57:13AM +0800, Zhang Yi wrote:
> On 2024/5/21 10:38, Dave Chinner wrote:
> > We can do all this with a single writeback operation if we are a
> > little bit smarter about the order of operations we perform and we
> > are a little bit smarter in iomap about zeroing dirty pages in the
> > page cache:
> > 
> > 	1. change iomap_zero_range() to do the right thing with
> > 	dirty unwritten and cow extents (the patch I've been working
> > 	on).
> > 
> > 	2. pass the range to be zeroed into iomap_truncate_page()
> > 	(the fundamental change being made here).
> > 
> > 	3. zero the required range *through the page cache*
> > 	(iomap_zero_range() already does this).
> > 
> > 	4. write back the XFS inode from ip->i_disk_size to the end
> > 	of the range zeroed by iomap_truncate_page()
> > 	(xfs_setattr_size() already does this).
> > 
> > 	5. i_size_write(newsize);
> > 
> > 	6. invalidate_inode_pages2_range(newsize, -1) to trash all
> > 	the page cache beyond the new EOF without doing any zeroing
> > 	as we've already done all the zeroing needed to the page
> > 	cache through iomap_truncate_page().
> > 
> > 
> > The patch I'm working on for step 1 is below. It still needs to be
> > extended to handle the cow case, but I'm unclear on how to exercise
> > that case so I haven't written the code to do it. The rest of it is
> > just rearranging the code that we already use just to get the order
> > of operations right. The only notable change in behaviour is using
> > invalidate_inode_pages2_range() instead of truncate_pagecache(),
> > because we don't want the EOF page to be dirtied again once we've
> > already written zeroes to disk....
> > 
> 
> Indeed, this sounds like the best solution. Since Darrick recommended
> that we could fix the stale data exposure on realtime inode issue by
> convert the tail extent to unwritten, I suppose we could do this after
> fixing the problem.

We also need to fix the truncate issue for the upcoming forced
alignment feature (for atomic writes), and in that case we are
required to write zeroes to the entire tail extent. i.e. forced
alignment does not allow partial unwritten extent conversion of
the EOF extent.

Hence I think we want to fix the problem by zeroing the entire EOF
extent first, then optimise the large rtextsize case to use
unwritten extents if that tail zeroing proves to be a performance
issue.

I say "if" because the large rtextsize case will still need to write
zeroes for the fsb that spans EOF. Adding conversion of the rest of
the extent to unwritten may well be more expensive (in terms of both
CPU and IO requirements for the transactional metadata updates) than
just submitting a slightly larger IO containing real zeroes and
leaving it as a written extent....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-22  3:00     ` Darrick J. Wong
@ 2024-05-23  1:14       ` Dave Chinner
  0 siblings, 0 replies; 19+ messages in thread
From: Dave Chinner @ 2024-05-23  1:14 UTC (permalink / raw
  To: Darrick J. Wong
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, hch,
	brauner, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On Tue, May 21, 2024 at 08:00:20PM -0700, Darrick J. Wong wrote:
> On Tue, May 21, 2024 at 12:38:30PM +1000, Dave Chinner wrote:
> > [RFC] iomap: zeroing needs to be pagecache aware
> > 
> > From: Dave Chinner <dchinner@redhat.com>
> > 
> > Unwritten extents can have page cache data over the range being
> > zeroed so we can't just skip them entirely. Fix this by checking for
> > an existing dirty folio over the unwritten range we are zeroing
> > and only performing zeroing if the folio is already dirty.
> > 
> > XXX: how do we detect a iomap containing a cow mapping over a hole
> > in iomap_zero_iter()? The XFS code implies this case also needs to
> > zero the page cache if there is data present, so trigger for page
> > cache lookup only in iomap_zero_iter() needs to handle this case as
> > well.
> 
> Hmm.  If memory serves, we probably need to adapt the
> xfs_buffered/direct_write_iomap_begin functions to return the hole in
> srcmap and the cow mapping in the iomap.  RN I think it just returns the
> hole.

Yes, that is what I was thinking we need to do -
xfs_buffered_write_iomap_begin() doesn't even check for COW mappings
if IOMAP_ZERO is set, so there's a bunch of refactoring work needed
to let iomap know that there is a COW mapping over the hole so it
can do the same page cache lookup stuff that I added for unwritten
extents....

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v3 3/3] xfs: correct the zeroing truncate range
  2024-05-23  1:11       ` Dave Chinner
@ 2024-05-23  2:00         ` Zhang Yi
  0 siblings, 0 replies; 19+ messages in thread
From: Zhang Yi @ 2024-05-23  2:00 UTC (permalink / raw
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, linux-kernel, linux-ext4, djwong, hch,
	brauner, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On 2024/5/23 9:11, Dave Chinner wrote:
> On Wed, May 22, 2024 at 09:57:13AM +0800, Zhang Yi wrote:
>> On 2024/5/21 10:38, Dave Chinner wrote:
>>> We can do all this with a single writeback operation if we are a
>>> little bit smarter about the order of operations we perform and we
>>> are a little bit smarter in iomap about zeroing dirty pages in the
>>> page cache:
>>>
>>> 	1. change iomap_zero_range() to do the right thing with
>>> 	dirty unwritten and cow extents (the patch I've been working
>>> 	on).
>>>
>>> 	2. pass the range to be zeroed into iomap_truncate_page()
>>> 	(the fundamental change being made here).
>>>
>>> 	3. zero the required range *through the page cache*
>>> 	(iomap_zero_range() already does this).
>>>
>>> 	4. write back the XFS inode from ip->i_disk_size to the end
>>> 	of the range zeroed by iomap_truncate_page()
>>> 	(xfs_setattr_size() already does this).
>>>
>>> 	5. i_size_write(newsize);
>>>
>>> 	6. invalidate_inode_pages2_range(newsize, -1) to trash all
>>> 	the page cache beyond the new EOF without doing any zeroing
>>> 	as we've already done all the zeroing needed to the page
>>> 	cache through iomap_truncate_page().
>>>
>>>
>>> The patch I'm working on for step 1 is below. It still needs to be
>>> extended to handle the cow case, but I'm unclear on how to exercise
>>> that case so I haven't written the code to do it. The rest of it is
>>> just rearranging the code that we already use just to get the order
>>> of operations right. The only notable change in behaviour is using
>>> invalidate_inode_pages2_range() instead of truncate_pagecache(),
>>> because we don't want the EOF page to be dirtied again once we've
>>> already written zeroes to disk....
>>>
>>
>> Indeed, this sounds like the best solution. Since Darrick recommended
>> that we could fix the stale data exposure on realtime inode issue by
>> convert the tail extent to unwritten, I suppose we could do this after
>> fixing the problem.
> 
> We also need to fix the truncate issue for the upcoming forced
> alignment feature (for atomic writes), and in that case we are
> required to write zeroes to the entire tail extent. i.e. forced
> alignment does not allow partial unwritten extent conversion of
> the EOF extent.
> 

Yes, right. I noticed that feature also needs to fix.

> Hence I think we want to fix the problem by zeroing the entire EOF
> extent first, then optimise the large rtextsize case to use
> unwritten extents if that tail zeroing proves to be a performance
> issue.
> 
> I say "if" because the large rtextsize case will still need to write
> zeroes for the fsb that spans EOF. Adding conversion of the rest of
> the extent to unwritten may well be more expensive (in terms of both
> CPU and IO requirements for the transactional metadata updates) than
> just submitting a slightly larger IO containing real zeroes and
> leaving it as a written extent....
> 

Yeah, if the rtextsize if not large (in most cases), I'm pretty sure
that writing zeros would better. If the rtextsize is large enough, I
think it deserves a performance test.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2024-05-23  2:00 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-17 11:13 [PATCH v3 0/3] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
2024-05-17 11:13 ` [PATCH v3 1/3] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
2024-05-17 17:29   ` Darrick J. Wong
2024-05-18  2:01     ` Zhang Yi
2024-05-17 11:13 ` [PATCH v3 2/3] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
2024-05-17 11:13 ` [PATCH v3 3/3] xfs: correct the zeroing truncate range Zhang Yi
2024-05-17 17:59   ` Darrick J. Wong
2024-05-18  6:35     ` Zhang Yi
2024-05-18 19:26       ` Darrick J. Wong
2024-05-20  6:56         ` Zhang Yi
2024-05-20  7:11           ` Zhang Yi
2024-05-20 18:37           ` Darrick J. Wong
2024-05-21 13:45             ` Zhang Yi
2024-05-21  2:38   ` Dave Chinner
2024-05-22  1:57     ` Zhang Yi
2024-05-23  1:11       ` Dave Chinner
2024-05-23  2:00         ` Zhang Yi
2024-05-22  3:00     ` Darrick J. Wong
2024-05-23  1:14       ` Dave Chinner

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).