[RFC 0/7] ext4: Allocator changes for atomic write support with DIO

Linux-ext4 Archive mirror
 help / color / mirror / Atom feed

* [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
@ 2023-11-30 13:53 Ojaswin Mujoo
  2023-11-30 13:53 ` [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic Ojaswin Mujoo
                   ` (7 more replies)
  0 siblings, 8 replies; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-11-30 13:53 UTC (permalink / raw
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, John Garry, dchinner

This patch series builds on top of John Gary's atomic direct write 
patch series [1] and enables this support in ext4. This is a 2 step
process:

1. Enable aligned allocation in ext4 mballoc. This allows us to allocate
power-of-2 aligned physical blocks, which is needed for atomic writes.

2. Hook the direct IO path in ext4 to use aligned allocation to obtain 
physical blocks at a given alignment, which is needed for atomic IO. If 
for any reason we are not able to obtain blocks at given alignment we
fail the atomic write.

Currently this RFC does not impose any restrictions for atomic and non-atomic
allocations to any inode,  which also leaves policy decisions to user-space
as much as possible. So, for example, the user space can:

 * Do an atomic direct IO at any alignment and size provided it
   satisfies underlying device constraints. The only restriction for now
   is that it should be power of 2 len and atleast of FS block size.

 * Do any combination of non atomic and atomic writes on the same file
   in any order. As long as the user space is passing the RWF_ATOMIC flag 
   to pwritev2() it is guaranteed to do an atomic IO (or fail if not
   possible).

There are some TODOs on the allocator side which are remaining like...

1.  Fallback to original request size when normalized request size (due to
    preallocation) allocation is not possible.
2.  Testing some edge cases.

But since all the basic test scenarios were covered, hence we wanted to get
this RFC out for discussion on atomic write support for DIO in ext4.

Further points for discussion -

1. We might need an inode flag to identify that the inode has blocks/extents
atomically allocated. So that other userspace tools do not move the blocks of
the inode for e.g. during resize/fsck etc.
  a. Should inode be marked as atomic similar to how we have IS_DAX(inode)
  implementation? Any thoughts?

2. Should there be support for open flags like O_ATOMIC. So that in case if
user wants to do only atomic writes to an open fd, then all writes can be
considered atomic.

3. Do we need to have any feature compat flags for FS? (IMO) It doesn't look
like since say if there are block allocations done which were done atomically,
it should not matter to FS w.r.t compatibility.

4. Mostly aligned allocations are required when we don't have data=journal
mode. So should we return -EIO with data journalling mode for DIO request?

Script to test using pwritev2() can be found here: 
https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9

Regards,
ojaswin

[1] https://lore.kernel.org/linux-fsdevel/20230929102726.2985188-1-john.g.garry@oracle.com

Ojaswin Mujoo (7):
  iomap: Don't fall back to buffered write if the write is atomic
  ext4: Factor out size and start prediction from
    ext4_mb_normalize_request()
  ext4: add aligned allocation support in mballoc
  ext4: allow inode preallocation for aligned alloc
  block: export blkdev_atomic_write_valid() and refactor api
  ext4: Add aligned allocation support for atomic direct io
  ext4: Support atomic write for statx

 block/fops.c                |  18 ++-
 fs/ext4/ext4.h              |  10 +-
 fs/ext4/extents.c           |  14 ++
 fs/ext4/file.c              |  49 ++++++
 fs/ext4/inode.c             | 142 ++++++++++++++++-
 fs/ext4/mballoc.c           | 302 +++++++++++++++++++++++++-----------
 fs/iomap/direct-io.c        |   8 +-
 include/linux/blkdev.h      |   2 +
 include/trace/events/ext4.h |   2 +
 9 files changed, 442 insertions(+), 105 deletions(-)

-- 
2.39.3

^ permalink raw reply	[flat|nested] 32+ messages in thread

* [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic
  2023-11-30 13:53 [RFC 0/7] ext4: Allocator changes for atomic write support with DIO Ojaswin Mujoo
@ 2023-11-30 13:53 ` Ojaswin Mujoo
  2023-11-30 21:10   ` Dave Chinner
  2023-11-30 13:53 ` [RFC 2/7] ext4: Factor out size and start prediction from ext4_mb_normalize_request() Ojaswin Mujoo
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-11-30 13:53 UTC (permalink / raw
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, John Garry, dchinner

Currently, iomap only supports atomic writes for direct IOs and there is
no guarantees that a buffered IO will be atomic. Hence, if the user has
explicitly requested the direct write to be atomic and there's a
failure, return -EIO instead of falling back to buffered IO.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/iomap/direct-io.c | 8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
index 6ef25e26f1a1..3e7cd9bc8f4d 100644
--- a/fs/iomap/direct-io.c
+++ b/fs/iomap/direct-io.c
@@ -662,7 +662,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
 			if (ret != -EAGAIN) {
 				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
 								iomi.len);
-				ret = -ENOTBLK;
+				/*
+				 * if this write was supposed to be atomic,
+				 * return the err rather than trying to fall
+				 * back to buffered IO.
+				 */
+				if (!atomic_write)
+					ret = -ENOTBLK;
 			}
 			goto out_free_dio;
 		}
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 2/7] ext4: Factor out size and start prediction from ext4_mb_normalize_request()
  2023-11-30 13:53 [RFC 0/7] ext4: Allocator changes for atomic write support with DIO Ojaswin Mujoo
  2023-11-30 13:53 ` [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic Ojaswin Mujoo
@ 2023-11-30 13:53 ` Ojaswin Mujoo
  2023-11-30 13:53 ` [RFC 3/7] ext4: add aligned allocation support in mballoc Ojaswin Mujoo
                   ` (5 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-11-30 13:53 UTC (permalink / raw
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, John Garry, dchinner

As a part of trimming down the size of ext4_mb_normalize_request(), factor
out the logic to predict normalized start and size to a separate function
ext4_mb_pa_predict_size().

This is no functional change in this patch.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/mballoc.c | 95 ++++++++++++++++++++++++++++-------------------
 1 file changed, 56 insertions(+), 39 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 0b0aff458efd..3eb7b639d36e 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -4394,6 +4394,58 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
 	*end = new_end;
 }
 
+static void ext4_mb_pa_predict_size(struct ext4_allocation_context *ac,
+				    loff_t *start, loff_t *size)
+{
+	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
+	loff_t new_size = *size;
+	loff_t new_start = *start;
+	int bsbits, max;
+
+	bsbits = ac->ac_sb->s_blocksize_bits;
+	/* max size of free chunks */
+	max = 2 << bsbits;
+
+#define NRL_CHECK_SIZE(req, size, max, chunk_size) \
+	(req <= (size) || max <= (chunk_size))
+
+	if (new_size <= 16 * 1024) {
+		new_size = 16 * 1024;
+	} else if (new_size <= 32 * 1024) {
+		new_size = 32 * 1024;
+	} else if (new_size <= 64 * 1024) {
+		new_size = 64 * 1024;
+	} else if (new_size <= 128 * 1024) {
+		new_size = 128 * 1024;
+	} else if (new_size <= 256 * 1024) {
+		new_size = 256 * 1024;
+	} else if (new_size <= 512 * 1024) {
+		new_size = 512 * 1024;
+	} else if (new_size <= 1024 * 1024) {
+		new_size = 1024 * 1024;
+	} else if (NRL_CHECK_SIZE(new_size, 4 * 1024 * 1024, max, 2 * 1024)) {
+		new_start = ((loff_t)ac->ac_o_ex.fe_logical >>
+						(21 - bsbits)) << 21;
+		new_size = 2 * 1024 * 1024;
+	} else if (NRL_CHECK_SIZE(new_size, 8 * 1024 * 1024, max, 4 * 1024)) {
+		new_start = ((loff_t)ac->ac_o_ex.fe_logical >>
+							(22 - bsbits)) << 22;
+		new_size = 4 * 1024 * 1024;
+	} else if (NRL_CHECK_SIZE(EXT4_C2B(sbi, ac->ac_o_ex.fe_len),
+					(8<<20)>>bsbits, max, 8 * 1024)) {
+		new_start = ((loff_t)ac->ac_o_ex.fe_logical >>
+							(23 - bsbits)) << 23;
+		new_size = 8 * 1024 * 1024;
+	} else {
+		new_start = (loff_t) ac->ac_o_ex.fe_logical << bsbits;
+		new_size = (loff_t) EXT4_C2B(sbi,
+					      ac->ac_o_ex.fe_len) << bsbits;
+	}
+
+	*size = new_size;
+	*start = new_start;
+}
+
 /*
  * Normalization means making request better in terms of
  * size and alignment
@@ -4404,7 +4456,7 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	struct ext4_super_block *es = sbi->s_es;
-	int bsbits, max;
+	int bsbits;
 	loff_t size, start_off, end;
 	loff_t orig_size __maybe_unused;
 	ext4_lblk_t start;
@@ -4438,47 +4490,12 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 		size = i_size_read(ac->ac_inode);
 	orig_size = size;
 
-	/* max size of free chunks */
-	max = 2 << bsbits;
-
-#define NRL_CHECK_SIZE(req, size, max, chunk_size)	\
-		(req <= (size) || max <= (chunk_size))
-
 	/* first, try to predict filesize */
 	/* XXX: should this table be tunable? */
 	start_off = 0;
-	if (size <= 16 * 1024) {
-		size = 16 * 1024;
-	} else if (size <= 32 * 1024) {
-		size = 32 * 1024;
-	} else if (size <= 64 * 1024) {
-		size = 64 * 1024;
-	} else if (size <= 128 * 1024) {
-		size = 128 * 1024;
-	} else if (size <= 256 * 1024) {
-		size = 256 * 1024;
-	} else if (size <= 512 * 1024) {
-		size = 512 * 1024;
-	} else if (size <= 1024 * 1024) {
-		size = 1024 * 1024;
-	} else if (NRL_CHECK_SIZE(size, 4 * 1024 * 1024, max, 2 * 1024)) {
-		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
-						(21 - bsbits)) << 21;
-		size = 2 * 1024 * 1024;
-	} else if (NRL_CHECK_SIZE(size, 8 * 1024 * 1024, max, 4 * 1024)) {
-		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
-							(22 - bsbits)) << 22;
-		size = 4 * 1024 * 1024;
-	} else if (NRL_CHECK_SIZE(EXT4_C2B(sbi, ac->ac_o_ex.fe_len),
-					(8<<20)>>bsbits, max, 8 * 1024)) {
-		start_off = ((loff_t)ac->ac_o_ex.fe_logical >>
-							(23 - bsbits)) << 23;
-		size = 8 * 1024 * 1024;
-	} else {
-		start_off = (loff_t) ac->ac_o_ex.fe_logical << bsbits;
-		size	  = (loff_t) EXT4_C2B(sbi,
-					      ac->ac_o_ex.fe_len) << bsbits;
-	}
+
+	ext4_mb_pa_predict_size(ac, &start_off, &size);
+
 	size = size >> bsbits;
 	start = start_off >> bsbits;
 
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 3/7] ext4: add aligned allocation support in mballoc
  2023-11-30 13:53 [RFC 0/7] ext4: Allocator changes for atomic write support with DIO Ojaswin Mujoo
  2023-11-30 13:53 ` [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic Ojaswin Mujoo
  2023-11-30 13:53 ` [RFC 2/7] ext4: Factor out size and start prediction from ext4_mb_normalize_request() Ojaswin Mujoo
@ 2023-11-30 13:53 ` Ojaswin Mujoo
  2023-11-30 13:53 ` [RFC 4/7] ext4: allow inode preallocation for aligned alloc Ojaswin Mujoo
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-11-30 13:53 UTC (permalink / raw
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, John Garry, dchinner

Add support in mballoc for allocating blocks that are aligned
to a certain power-of-2 offset.

1. We define a new flag EXT4_MB_ALIGNED_ALLOC to indicate that we want
an aligned allocation.

2. The alignment is determined by the length of the allocation, for
example if we ask for 8192 bytes, then the alignment of physical blocks
will also be 8192 bytes aligned (ie 2 blocks aligned on 4k blocksize).

3. We dont yet support arbitrary alignment. For aligned writes, the
length/alignment must be power of 2 blocks, ie for 4k blocksize we
can get 4k byte aligned, 8k byte aligned, 16k byte aligned ...
allocation but not 12k byte aligned.

4. We use CR_POWER2_ALIGNED criteria for aligned allocation which by
design allocates in an aligned manner. Since CR_POWER2_ALIGNED needs the
ac->ac_g_ex.fe_len to be power of 2, thats where the restriction in
point 3 above comes from. Since right now aligned allocation support is
added mainly for atomic writes use case, this restriction should be fine
since atomic write capable devices usually support only power of 2
alignments

5. For ease of review enabling inode preallocation support is done in
upcoming patches and is disabled in this patch.

6. In case we can't find anything in CR_POWER2_ALIGNED, we return ENOSPC.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/ext4.h              |  6 ++--
 fs/ext4/mballoc.c           | 69 ++++++++++++++++++++++++++++++++++---
 include/trace/events/ext4.h |  1 +
 3 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 9418359b1d9d..38a77148b85c 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -216,9 +216,11 @@ enum criteria {
 /* Large fragment size list lookup succeeded at least once for cr = 0 */
 #define EXT4_MB_CR_POWER2_ALIGNED_OPTIMIZED		0x8000
 /* Avg fragment size rb tree lookup succeeded at least once for cr = 1 */
-#define EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED		0x00010000
+#define EXT4_MB_CR_GOAL_LEN_FAST_OPTIMIZED		0x10000
 /* Avg fragment size rb tree lookup succeeded at least once for cr = 1.5 */
-#define EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED		0x00020000
+#define EXT4_MB_CR_BEST_AVAIL_LEN_OPTIMIZED		0x20000
+/* The allocation must respect alignment requirements for physical blocks */
+#define EXT4_MB_ALIGNED_ALLOC		0x40000
 
 struct ext4_allocation_request {
 	/* target inode for block we're allocating */
diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index 3eb7b639d36e..b1df531e6db3 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2150,8 +2150,11 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	 * user requested originally, we store allocated
 	 * space in a special descriptor.
 	 */
-	if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len)
+	if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) {
+		/* Aligned allocation doesn't have preallocation support */
+		WARN_ON(ac->ac_flags & EXT4_MB_ALIGNED_ALLOC);
 		ext4_mb_new_preallocation(ac);
+	}
 
 }
 
@@ -2784,10 +2787,15 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 	BUG_ON(ac->ac_status == AC_STATUS_FOUND);
 
-	/* first, try the goal */
-	err = ext4_mb_find_by_goal(ac, &e4b);
-	if (err || ac->ac_status == AC_STATUS_FOUND)
-		goto out;
+	/*
+	 * first, try the goal. Skip trying goal for aligned allocations since
+	 * goal determination logic is not alignment aware (yet)
+	 */
+	if (!(ac->ac_flags & EXT4_MB_ALIGNED_ALLOC)) {
+		err = ext4_mb_find_by_goal(ac, &e4b);
+		if (err || ac->ac_status == AC_STATUS_FOUND)
+			goto out;
+	}
 
 	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
 		goto out;
@@ -2828,9 +2836,26 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	 */
 	if (ac->ac_2order)
 		cr = CR_POWER2_ALIGNED;
+	else
+		WARN_ON(ac->ac_flags & EXT4_MB_ALIGNED_ALLOC &&
+			ac->ac_g_ex.fe_len > 1);
 repeat:
 	for (; cr < EXT4_MB_NUM_CRS && ac->ac_status == AC_STATUS_CONTINUE; cr++) {
 		ac->ac_criteria = cr;
+
+		if (ac->ac_criteria > CR_POWER2_ALIGNED &&
+		    ac->ac_flags & EXT4_MB_ALIGNED_ALLOC &&
+		    ac->ac_g_ex.fe_len > 1
+		    ) {
+			/*
+			 * Aligned allocation only supports power 2 alignment
+			 * values which can only be satisfied by
+			 * CR_POWER2_ALIGNED. The exception being allocations of
+			 * 1 block which can be done via any criteria
+			 */
+			break;
+		}
+
 		/*
 		 * searching for the right group start
 		 * from the goal value specified
@@ -2955,6 +2980,23 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 	if (!err && ac->ac_status != AC_STATUS_FOUND && first_err)
 		err = first_err;
 
+	if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC && ac->ac_status == AC_STATUS_FOUND) {
+		ext4_fsblk_t start = ext4_grp_offs_to_block(sb, &ac->ac_b_ex);
+		ext4_grpblk_t len = EXT4_C2B(sbi, ac->ac_b_ex.fe_len);
+
+		if (!len) {
+			ext4_warning(sb, "Expected a non zero len extent");
+			ac->ac_status = AC_STATUS_BREAK;
+			goto exit;
+		}
+
+		WARN_ON(!is_power_of_2(len));
+		WARN_ON(start % len);
+		/* We don't support preallocation yet */
+		WARN_ON(ac->ac_b_ex.fe_len != ac->ac_o_ex.fe_len);
+	}
+
+ exit:
 	mb_debug(sb, "Best len %d, origin len %d, ac_status %u, ac_flags 0x%x, cr %d ret %d\n",
 		 ac->ac_b_ex.fe_len, ac->ac_o_ex.fe_len, ac->ac_status,
 		 ac->ac_flags, cr, err);
@@ -4475,6 +4517,13 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC)
 		return;
 
+	/*
+	 * caller may have strict alignment requirements. In this case, avoid
+	 * normalization since it is not alignment aware.
+	 */
+	if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC)
+		return;
+
 	if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) {
 		ext4_mb_normalize_group_request(ac);
 		return ;
@@ -4790,6 +4839,10 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return false;
 
+	/* using preallocated blocks is not alignment aware. */
+	if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC)
+		return false;
+
 	/*
 	 * first, try per-file preallocation by searching the inode pa rbtree.
 	 *
@@ -6069,6 +6122,12 @@ static bool ext4_mb_discard_preallocations_should_retry(struct super_block *sb,
 	u64 seq_retry = 0;
 	bool ret = false;
 
+	/* No need to retry for aligned allocations */
+	if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) {
+		ret = false;
+		goto out_dbg;
+	}
+
 	freed = ext4_mb_discard_preallocations(sb, ac->ac_o_ex.fe_len);
 	if (freed) {
 		ret = true;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 65029dfb92fb..56895cfb5781 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -36,6 +36,7 @@ struct partial_cluster;
 	{ EXT4_MB_STREAM_ALLOC,		"STREAM_ALLOC" },	\
 	{ EXT4_MB_USE_ROOT_BLOCKS,	"USE_ROOT_BLKS" },	\
 	{ EXT4_MB_USE_RESERVED,		"USE_RESV" },		\
+	{ EXT4_MB_ALIGNED_ALLOC,		"ALIGNED_ALLOC" }, \
 	{ EXT4_MB_STRICT_CHECK,		"STRICT_CHECK" })
 
 #define show_map_flags(flags) __print_flags(flags, "|",			\
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 4/7] ext4: allow inode preallocation for aligned alloc
  2023-11-30 13:53 [RFC 0/7] ext4: Allocator changes for atomic write support with DIO Ojaswin Mujoo
                   ` (2 preceding siblings ...)
  2023-11-30 13:53 ` [RFC 3/7] ext4: add aligned allocation support in mballoc Ojaswin Mujoo
@ 2023-11-30 13:53 ` Ojaswin Mujoo
  2023-11-30 13:53 ` [RFC 5/7] block: export blkdev_atomic_write_valid() and refactor api Ojaswin Mujoo
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-11-30 13:53 UTC (permalink / raw
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, John Garry, dchinner

Enable inode preallocation support for aligned allocations. Inode
preallocation will only be used if the preallocated blocks are able to
satisfy the length and alignment requirements of the allocations, else
we disable preallocation for this particular allocation and proceed as
usual. Disabling inode preallocation is required otherwise we might end
up with overlapping preallocated ranges which can trigger a BUG() later.

While normalizing the request, we need to make sure that:

1. start of normalized(goal) request matches original request so it is
easier to align it during actual allocations. This prevents various edge
cases where the start of goal is different than original start making it
trickier to align the original start as requested by user.

2. the length of goal should not be smaller than original and should
   be a power of 2.

For now, group preallocation is disabled for aligned allocations.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/mballoc.c | 168 +++++++++++++++++++++++++++++-----------------
 1 file changed, 107 insertions(+), 61 deletions(-)

diff --git a/fs/ext4/mballoc.c b/fs/ext4/mballoc.c
index b1df531e6db3..c21b2758c3f0 100644
--- a/fs/ext4/mballoc.c
+++ b/fs/ext4/mballoc.c
@@ -2151,8 +2151,6 @@ static void ext4_mb_use_best_found(struct ext4_allocation_context *ac,
 	 * space in a special descriptor.
 	 */
 	if (ac->ac_o_ex.fe_len < ac->ac_b_ex.fe_len) {
-		/* Aligned allocation doesn't have preallocation support */
-		WARN_ON(ac->ac_flags & EXT4_MB_ALIGNED_ALLOC);
 		ext4_mb_new_preallocation(ac);
 	}
 
@@ -2992,8 +2990,7 @@ ext4_mb_regular_allocator(struct ext4_allocation_context *ac)
 
 		WARN_ON(!is_power_of_2(len));
 		WARN_ON(start % len);
-		/* We don't support preallocation yet */
-		WARN_ON(ac->ac_b_ex.fe_len != ac->ac_o_ex.fe_len);
+		WARN_ON(ac->ac_b_ex.fe_len < ac->ac_o_ex.fe_len);
 	}
 
  exit:
@@ -4309,7 +4306,7 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
 	struct ext4_prealloc_space *tmp_pa = NULL, *left_pa = NULL, *right_pa = NULL;
 	struct rb_node *iter;
 	ext4_lblk_t new_start, tmp_pa_start, right_pa_start = -1;
-	loff_t new_end, tmp_pa_end, left_pa_end = -1;
+	loff_t size, new_end, tmp_pa_end, left_pa_end = -1;
 
 	new_start = *start;
 	new_end = *end;
@@ -4429,6 +4426,22 @@ ext4_mb_pa_adjust_overlap(struct ext4_allocation_context *ac,
 	}
 	read_unlock(&ei->i_prealloc_lock);
 
+	if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) {
+		/*
+		 * Aligned allocation happens via CR_POWER2_ALIGNED criteria
+		 * hence we must make sure that the new size is a power of 2.
+		 */
+		size = new_end - new_start;
+		size = (loff_t)1 << (fls64(size) - 1);
+
+		if (unlikely(size < ac->ac_o_ex.fe_len))
+			size = ac->ac_o_ex.fe_len;
+		new_end = new_start + size;
+
+		WARN_ON(*start != new_start);
+		WARN_ON(!is_power_of_2(size));
+	}
+
 	/* XXX: extra loop to check we really don't overlap preallocations */
 	ext4_mb_pa_assert_overlap(ac, new_start, new_end);
 
@@ -4484,6 +4497,21 @@ static void ext4_mb_pa_predict_size(struct ext4_allocation_context *ac,
 					      ac->ac_o_ex.fe_len) << bsbits;
 	}
 
+	/*
+	 * For aligned allocations, we need to ensure 2 things:
+	 *
+	 * 1. The start should remain same as original start so that finding
+	 * aligned physical blocks for it is straight forward.
+	 *
+	 * 2. The new_size should not be less than the original len. This
+	 * can sometimes happen due to the way we predict size above.
+	 */
+	if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) {
+		new_start = ac->ac_o_ex.fe_logical << bsbits;
+		new_size = max_t(loff_t, new_size,
+				 EXT4_C2B(sbi, ac->ac_o_ex.fe_len) << bsbits);
+	}
+
 	*size = new_size;
 	*start = new_start;
 }
@@ -4517,13 +4545,6 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	if (ac->ac_flags & EXT4_MB_HINT_NOPREALLOC)
 		return;
 
-	/*
-	 * caller may have strict alignment requirements. In this case, avoid
-	 * normalization since it is not alignment aware.
-	 */
-	if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC)
-		return;
-
 	if (ac->ac_flags & EXT4_MB_HINT_GROUP_ALLOC) {
 		ext4_mb_normalize_group_request(ac);
 		return ;
@@ -4557,8 +4578,13 @@ ext4_mb_normalize_request(struct ext4_allocation_context *ac,
 	start = max(start, rounddown(ac->ac_o_ex.fe_logical,
 			(ext4_lblk_t)EXT4_BLOCKS_PER_GROUP(ac->ac_sb)));
 
-	/* don't cover already allocated blocks in selected range */
+	/*
+	 * don't cover already allocated blocks in selected range. For aligned
+	 * alloc, since we don't change the original start we should ideally not
+	 * enter this if block.
+	 */
 	if (ar->pleft && start <= ar->lleft) {
+		WARN_ON(ac->ac_flags & EXT4_MB_ALIGNED_ALLOC);
 		size -= ar->lleft + 1 - start;
 		start = ar->lleft + 1;
 	}
@@ -4791,32 +4817,46 @@ ext4_mb_check_group_pa(ext4_fsblk_t goal_block,
 }
 
 /*
- * check if found pa meets EXT4_MB_HINT_GOAL_ONLY
+ * check if found pa meets EXT4_MB_HINT_GOAL_ONLY or EXT4_MB_ALIGNED_ALLOC
  */
 static bool
-ext4_mb_pa_goal_check(struct ext4_allocation_context *ac,
+ext4_mb_pa_check(struct ext4_allocation_context *ac,
 		      struct ext4_prealloc_space *pa)
 {
 	struct ext4_sb_info *sbi = EXT4_SB(ac->ac_sb);
 	ext4_fsblk_t start;
 
-	if (likely(!(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY)))
+	if (likely(!(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY ||
+		     ac->ac_flags & EXT4_MB_ALIGNED_ALLOC)))
 		return true;
 
-	/*
-	 * If EXT4_MB_HINT_GOAL_ONLY is set, ac_g_ex will not be adjusted
-	 * in ext4_mb_normalize_request and will keep same with ac_o_ex
-	 * from ext4_mb_initialize_context. Choose ac_g_ex here to keep
-	 * consistent with ext4_mb_find_by_goal.
-	 */
-	start = pa->pa_pstart +
-		(ac->ac_g_ex.fe_logical - pa->pa_lstart);
-	if (ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex) != start)
-		return false;
+	if (ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY) {
+		/*
+		 * If EXT4_MB_HINT_GOAL_ONLY is set, ac_g_ex will not be adjusted
+		 * in ext4_mb_normalize_request and will keep same with ac_o_ex
+		 * from ext4_mb_initialize_context. Choose ac_g_ex here to keep
+		 * consistent with ext4_mb_find_by_goal.
+		 */
+		start = pa->pa_pstart +
+			(ac->ac_g_ex.fe_logical - pa->pa_lstart);
+		if (ext4_grp_offs_to_block(ac->ac_sb, &ac->ac_g_ex) != start)
+			return false;
 
-	if (ac->ac_g_ex.fe_len > pa->pa_len -
-	    EXT4_B2C(sbi, ac->ac_g_ex.fe_logical - pa->pa_lstart))
-		return false;
+		if (ac->ac_g_ex.fe_len >
+		    pa->pa_len - EXT4_B2C(sbi, ac->ac_g_ex.fe_logical -
+						       pa->pa_lstart))
+			return false;
+	} else if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC) {
+		start = pa->pa_pstart +
+			(ac->ac_g_ex.fe_logical - pa->pa_lstart);
+		if (start % EXT4_C2B(sbi, ac->ac_g_ex.fe_len))
+			return false;
+
+		if (EXT4_C2B(sbi, ac->ac_g_ex.fe_len) >
+		    (EXT4_C2B(sbi, pa->pa_len) -
+		     (ac->ac_g_ex.fe_logical - pa->pa_lstart)))
+			return false;
+	}
 
 	return true;
 }
@@ -4839,10 +4879,6 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return false;
 
-	/* using preallocated blocks is not alignment aware. */
-	if (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC)
-		return false;
-
 	/*
 	 * first, try per-file preallocation by searching the inode pa rbtree.
 	 *
@@ -4948,41 +4984,49 @@ ext4_mb_use_preallocated(struct ext4_allocation_context *ac)
 		goto try_group_pa;
 	}
 
-	if (tmp_pa->pa_free && likely(ext4_mb_pa_goal_check(ac, tmp_pa))) {
+	if (tmp_pa->pa_free && likely(ext4_mb_pa_check(ac, tmp_pa))) {
 		atomic_inc(&tmp_pa->pa_count);
 		ext4_mb_use_inode_pa(ac, tmp_pa);
 		spin_unlock(&tmp_pa->pa_lock);
 		read_unlock(&ei->i_prealloc_lock);
 		return true;
 	} else {
+		if (tmp_pa->pa_free == 0)
+			/*
+			 * We found a valid overlapping pa but couldn't use it because
+			 * it had no free blocks. This should ideally never happen
+			 * because:
+			 *
+			 * 1. When a new inode pa is added to rbtree it must have
+			 *    pa_free > 0 since otherwise we won't actually need
+			 *    preallocation.
+			 *
+			 * 2. An inode pa that is in the rbtree can only have it's
+			 *    pa_free become zero when another thread calls:
+			 *      ext4_mb_new_blocks
+			 *       ext4_mb_use_preallocated
+			 *        ext4_mb_use_inode_pa
+			 *
+			 * 3. Further, after the above calls make pa_free == 0, we will
+			 *    immediately remove it from the rbtree in:
+			 *      ext4_mb_new_blocks
+			 *       ext4_mb_release_context
+			 *        ext4_mb_put_pa
+			 *
+			 * 4. Since the pa_free becoming 0 and pa_free getting removed
+			 * from tree both happen in ext4_mb_new_blocks, which is always
+			 * called with i_data_sem held for data allocations, we can be
+			 * sure that another process will never see a pa in rbtree with
+			 * pa_free == 0.
+			 */
+			WARN_ON_ONCE(tmp_pa->pa_free == 0);
 		/*
-		 * We found a valid overlapping pa but couldn't use it because
-		 * it had no free blocks. This should ideally never happen
-		 * because:
-		 *
-		 * 1. When a new inode pa is added to rbtree it must have
-		 *    pa_free > 0 since otherwise we won't actually need
-		 *    preallocation.
-		 *
-		 * 2. An inode pa that is in the rbtree can only have it's
-		 *    pa_free become zero when another thread calls:
-		 *      ext4_mb_new_blocks
-		 *       ext4_mb_use_preallocated
-		 *        ext4_mb_use_inode_pa
-		 *
-		 * 3. Further, after the above calls make pa_free == 0, we will
-		 *    immediately remove it from the rbtree in:
-		 *      ext4_mb_new_blocks
-		 *       ext4_mb_release_context
-		 *        ext4_mb_put_pa
-		 *
-		 * 4. Since the pa_free becoming 0 and pa_free getting removed
-		 * from tree both happen in ext4_mb_new_blocks, which is always
-		 * called with i_data_sem held for data allocations, we can be
-		 * sure that another process will never see a pa in rbtree with
-		 * pa_free == 0.
+		 * If we come here we need to disable preallocations else we'd
+		 * have multiple preallocations for the same logical offset
+		 * which is not allowed and will cause BUG_ONs to be triggered
+		 * later.
 		 */
-		WARN_ON_ONCE(tmp_pa->pa_free == 0);
+		ac->ac_flags |= EXT4_MB_HINT_NOPREALLOC;
 	}
 	spin_unlock(&tmp_pa->pa_lock);
 try_group_pa:
@@ -5818,6 +5862,7 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 	int bsbits = ac->ac_sb->s_blocksize_bits;
 	loff_t size, isize;
 	bool inode_pa_eligible, group_pa_eligible;
+	bool is_aligned = (ac->ac_flags & EXT4_MB_ALIGNED_ALLOC);
 
 	if (!(ac->ac_flags & EXT4_MB_HINT_DATA))
 		return;
@@ -5825,7 +5870,8 @@ static void ext4_mb_group_or_file(struct ext4_allocation_context *ac)
 	if (unlikely(ac->ac_flags & EXT4_MB_HINT_GOAL_ONLY))
 		return;
 
-	group_pa_eligible = sbi->s_mb_group_prealloc > 0;
+	/* Aligned allocation does not support group pa */
+	group_pa_eligible = (!is_aligned && sbi->s_mb_group_prealloc > 0);
 	inode_pa_eligible = true;
 	size = extent_logical_end(sbi, &ac->ac_o_ex);
 	isize = (i_size_read(ac->ac_inode) + ac->ac_sb->s_blocksize - 1)
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 5/7] block: export blkdev_atomic_write_valid() and refactor api
  2023-11-30 13:53 [RFC 0/7] ext4: Allocator changes for atomic write support with DIO Ojaswin Mujoo
                   ` (3 preceding siblings ...)
  2023-11-30 13:53 ` [RFC 4/7] ext4: allow inode preallocation for aligned alloc Ojaswin Mujoo
@ 2023-11-30 13:53 ` Ojaswin Mujoo
  2023-12-01 10:47   ` John Garry
  2023-11-30 13:53 ` [RFC 6/7] ext4: Add aligned allocation support for atomic direct io Ojaswin Mujoo
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-11-30 13:53 UTC (permalink / raw
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, John Garry, dchinner

Export the blkdev_atomic_write_valid() function so that other filesystems
can call it as a part of validating the atomic write operation.

Further, refactor the api to accept a len argument instead of iov_iter to
make it easier to call from other places.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 block/fops.c           | 18 ++++++++++--------
 include/linux/blkdev.h |  2 ++
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/block/fops.c b/block/fops.c
index 516669ad69e5..5dae95c49720 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -41,8 +41,7 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
 		!bdev_iter_is_aligned(bdev, iter);
 }
 
-static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
-			      struct iov_iter *iter)
+bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, size_t len)
 {
 	unsigned int atomic_write_unit_min_bytes =
 			queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
@@ -53,16 +52,17 @@ static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
 		return false;
 	if (pos % atomic_write_unit_min_bytes)
 		return false;
-	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
+	if (len % atomic_write_unit_min_bytes)
 		return false;
-	if (!is_power_of_2(iov_iter_count(iter)))
+	if (!is_power_of_2(len))
 		return false;
-	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
+	if (len > atomic_write_unit_max_bytes)
 		return false;
-	if (pos % iov_iter_count(iter))
+	if (pos % len)
 		return false;
 	return true;
 }
+EXPORT_SYMBOL_GPL(blkdev_atomic_write_valid);
 
 #define DIO_INLINE_BIO_VECS 4
 
@@ -81,7 +81,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
 	if (blkdev_dio_unaligned(bdev, pos, iter))
 		return -EINVAL;
 
-	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+	if (atomic_write &&
+	    !blkdev_atomic_write_valid(bdev, pos, iov_iter_count(iter)))
 		return -EINVAL;
 
 	if (nr_pages <= DIO_INLINE_BIO_VECS)
@@ -348,7 +349,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
 	if (blkdev_dio_unaligned(bdev, pos, iter))
 		return -EINVAL;
 
-	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
+	if (atomic_write &&
+	    !blkdev_atomic_write_valid(bdev, pos, iov_iter_count(iter)))
 		return -EINVAL;
 
 	if (iocb->ki_flags & IOCB_ALLOC_CACHE)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index f70988083734..5a3124fc191f 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -1566,6 +1566,8 @@ static inline int early_lookup_bdev(const char *pathname, dev_t *dev)
 int freeze_bdev(struct block_device *bdev);
 int thaw_bdev(struct block_device *bdev);
 
+bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, size_t len);
+
 struct io_comp_batch {
 	struct request *req_list;
 	bool need_ts;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 6/7] ext4: Add aligned allocation support for atomic direct io
  2023-11-30 13:53 [RFC 0/7] ext4: Allocator changes for atomic write support with DIO Ojaswin Mujoo
                   ` (4 preceding siblings ...)
  2023-11-30 13:53 ` [RFC 5/7] block: export blkdev_atomic_write_valid() and refactor api Ojaswin Mujoo
@ 2023-11-30 13:53 ` Ojaswin Mujoo
  2023-11-30 13:53 ` [RFC 7/7] ext4: Support atomic write for statx Ojaswin Mujoo
  2023-12-04 10:36 ` [RFC 0/7] ext4: Allocator changes for atomic write support with DIO John Garry
  7 siblings, 0 replies; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-11-30 13:53 UTC (permalink / raw
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, John Garry, dchinner

If the direct IO write is meant to be atomic, ext4 will now try to
allocate aligned physical blocks so that atomic write request can be
satisfied.

This patch also makes the ext4_map_blocks() family of function alignment
aware and defines ext4_map_blocks_aligned() function that can allow
users to ask for aligned blocks and has checks to ensure the returned
extent actually follows the alignment requirements. As usual, alignment
requirement is always determined by the length and the offset should be
naturally aligned to this len.

Although aligned mapping usually makes sense with EXT4_GET_BLOCKS_CREATE
we can call ext4_map_blocks_aligned() without that flag aswell. This can
be useful to check:

 1. If an aligned extent is already present and can be reused.

 2. If a pre-existing extent at the location can't satisfy the alignment
 in which case and aligned write of given len and offset won't be
 possible.

 3. If there is a hole, is it big enough that a subsequent map blocks
 would be able to allocate the required aligned extent of off and len.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/ext4.h              |   4 ++
 fs/ext4/extents.c           |  14 +++++
 fs/ext4/file.c              |  49 +++++++++++++++++
 fs/ext4/inode.c             | 104 +++++++++++++++++++++++++++++++++++-
 include/trace/events/ext4.h |   1 +
 5 files changed, 170 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
index 38a77148b85c..1a57662e6a7a 100644
--- a/fs/ext4/ext4.h
+++ b/fs/ext4/ext4.h
@@ -717,6 +717,8 @@ enum {
 #define EXT4_GET_BLOCKS_IO_SUBMIT		0x0400
 	/* Caller is in the atomic contex, find extent if it has been cached */
 #define EXT4_GET_BLOCKS_CACHED_NOWAIT		0x0800
+/* Caller wants strictly aligned allocation */
+#define EXT4_GET_BLOCKS_ALIGNED			0x1000
 
 /*
  * The bit position of these flags must not overlap with any of the
@@ -3683,6 +3685,8 @@ extern int ext4_convert_unwritten_io_end_vec(handle_t *handle,
 					     ext4_io_end_t *io_end);
 extern int ext4_map_blocks(handle_t *handle, struct inode *inode,
 			   struct ext4_map_blocks *map, int flags);
+extern int ext4_map_blocks_aligned(handle_t *handle, struct inode *inode,
+			   struct ext4_map_blocks *map, int flags);
 extern int ext4_ext_calc_credits_for_single_extent(struct inode *inode,
 						   int num,
 						   struct ext4_ext_path *path);
diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
index 202c76996b62..2334fa767a6b 100644
--- a/fs/ext4/extents.c
+++ b/fs/ext4/extents.c
@@ -4091,6 +4091,7 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 	int err = 0, depth, ret;
 	unsigned int allocated = 0, offset = 0;
 	unsigned int allocated_clusters = 0;
+	unsigned int orig_mlen = map->m_len;
 	struct ext4_allocation_request ar;
 	ext4_lblk_t cluster_offset;
 
@@ -4282,6 +4283,19 @@ int ext4_ext_map_blocks(handle_t *handle, struct inode *inode,
 		ar.flags |= EXT4_MB_DELALLOC_RESERVED;
 	if (flags & EXT4_GET_BLOCKS_METADATA_NOFAIL)
 		ar.flags |= EXT4_MB_USE_RESERVED;
+	if (flags & EXT4_GET_BLOCKS_ALIGNED) {
+		/*
+		 * During aligned allocation we dont want to map a length smaller
+		 * than the originally requested length since we use this len to
+		 * determine alignment and changing it can misalign the blocks.
+		 */
+		if (ar.len != orig_mlen) {
+			ext4_warning(inode->i_sb,
+				     "Tried to modify requested len of aligned allocation.");
+			goto out;
+		}
+		ar.flags |= EXT4_MB_ALIGNED_ALLOC;
+	}
 	newblock = ext4_mb_new_blocks(handle, &ar, &err);
 	if (!newblock)
 		goto out;
diff --git a/fs/ext4/file.c b/fs/ext4/file.c
index 6830ea3a6c59..c928c2e8c067 100644
--- a/fs/ext4/file.c
+++ b/fs/ext4/file.c
@@ -430,6 +430,48 @@ static const struct iomap_dio_ops ext4_dio_write_ops = {
 	.end_io = ext4_dio_write_end_io,
 };
 
+/*
+ * Check loff_t because the iov_iter_count() used in blkdev was size_t
+ */
+static bool ext4_dio_atomic_write_checks(struct kiocb *iocb,
+					 struct iov_iter *from)
+{
+	struct inode *inode = iocb->ki_filp->f_inode;
+	struct block_device *bdev = inode->i_sb->s_bdev;
+	size_t len = iov_iter_count(from);
+	loff_t pos = iocb->ki_pos;
+	u8 blkbits = inode->i_blkbits;
+
+	/*
+	 * Currently aligned alloc, which is needed for atomic IO, is only
+	 * supported with extent based files and non bigalloc file systems
+	 */
+	if (EXT4_SB(inode->i_sb)->s_cluster_ratio > 1) {
+		ext4_warning(inode->i_sb,
+			     "Atomic write not supported with bigalloc");
+		return false;
+	}
+	if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))) {
+		ext4_warning(inode->i_sb,
+			     "Atomic write not supported on non-extent files");
+		return false;
+	}
+	if (len & ((1 << blkbits) - 1))
+		/* len should be blocksize aligned */
+		return false;
+	else if (pos % len)
+		/* pos should be naturally aligned to len */
+		return false;
+	else if (!is_power_of_2(len >> blkbits))
+		/*
+		 * len in blocks should be power of 2 for mballoc to ensure
+		 * alignment
+		 */
+		return false;
+
+	return blkdev_atomic_write_valid(bdev, pos, len);
+}
+
 /*
  * The intention here is to start with shared lock acquired then see if any
  * condition requires an exclusive inode lock. If yes, then we restart the
@@ -458,12 +500,19 @@ static ssize_t ext4_dio_write_checks(struct kiocb *iocb, struct iov_iter *from,
 	size_t count;
 	ssize_t ret;
 	bool overwrite, unaligned_io;
+	bool atomic_write = (iocb->ki_flags & IOCB_ATOMIC);
 
 restart:
 	ret = ext4_generic_write_checks(iocb, from);
 	if (ret <= 0)
 		goto out;
 
+	if (atomic_write && !ext4_dio_atomic_write_checks(iocb, from)) {
+		ext4_warning(inode->i_sb, "Atomic write checks failed.");
+		ret = -EIO;
+		goto out;
+	}
+
 	offset = iocb->ki_pos;
 	count = ret;
 
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 4ce35f1c8b0a..d185ec54ffa3 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -453,6 +453,77 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
 }
 #endif /* ES_AGGRESSIVE_TEST */
 
+/*
+ * This function checks if the map returned by ext4_map_blocks satisfies aligned
+ * allocation requirements. This should be used as the entry point for aligned
+ * allocations
+ */
+static bool ext4_map_check_alignment(struct ext4_map_blocks *map,
+				     unsigned int orig_mlen,
+				     ext4_lblk_t orig_mlblk,
+				     int flags)
+{
+	if (flags & EXT4_GET_BLOCKS_CREATE) {
+		/*
+		 * A create lookup must be mapped to satisfy alignment
+		 * requirements
+		 */
+		if (!(map->m_flags & EXT4_MAP_MAPPED))
+			return false;
+	} else {
+		/*
+		 * For create=0, if we find a hole, this hole should be big
+		 * enough to accommodate our aligned extent later
+		 */
+		if (!(map->m_flags & EXT4_MAP_MAPPED) &&
+		    (!(map->m_flags & EXT4_MAP_UNWRITTEN))) {
+			if (map->m_len < orig_mlen)
+				return false;
+			if (map->m_lblk != orig_mlblk)
+				/* Ideally shouldn't happen */
+				return false;
+			return true;
+		}
+	}
+
+	/*
+	 * For all the remaining cases, to satisfy alignment, the extent should
+	 * be exactly as big as requests and be at the right physical block
+	 * alignment
+	 */
+	if (map->m_len != orig_mlen)
+		return false;
+	if (map->m_lblk != orig_mlblk)
+		return false;
+	if (!map->m_len || map->m_pblk % map->m_len)
+		return false;
+
+	return true;
+}
+
+int ext4_map_blocks_aligned(handle_t *handle, struct inode *inode,
+			    struct ext4_map_blocks *map, int flags)
+{
+	int ret;
+	unsigned int orig_mlen = map->m_len;
+	ext4_lblk_t orig_mlblk = map->m_lblk;
+
+	if (flags & EXT4_GET_BLOCKS_CREATE)
+		flags |= EXT4_GET_BLOCKS_ALIGNED;
+
+	ret = ext4_map_blocks(handle, inode, map, flags);
+
+	if (ret >= 0 &&
+	    !ext4_map_check_alignment(map, orig_mlen, orig_mlblk, flags)) {
+		ext4_warning(
+			inode->i_sb,
+			"Returned extent couldn't satisfy alignment requirements");
+		ret = -EIO;
+	}
+
+	return ret;
+}
+
 /*
  * The ext4_map_blocks() function tries to look up the requested blocks,
  * and returns if the blocks are already mapped.
@@ -474,6 +545,12 @@ static void ext4_map_blocks_es_recheck(handle_t *handle,
  * indicate the length of a hole starting at map->m_lblk.
  *
  * It returns the error in case of allocation failure.
+ *
+ * Note for aligned allocations: While most of the alignment related checks are
+ * done by higher level functions, we do have some optimizations here. When
+ * trying to *create* a new aligned extent if at any point we are sure that the
+ * extent won't be as big as the full length, we exit early instead of going for
+ * the allocation and failing later.
  */
 int ext4_map_blocks(handle_t *handle, struct inode *inode,
 		    struct ext4_map_blocks *map, int flags)
@@ -481,6 +558,7 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	struct extent_status es;
 	int retval;
 	int ret = 0;
+	unsigned int orig_mlen = map->m_len;
 #ifdef ES_AGGRESSIVE_TEST
 	struct ext4_map_blocks orig_map;
 
@@ -583,6 +661,12 @@ int ext4_map_blocks(handle_t *handle, struct inode *inode,
 	if ((flags & EXT4_GET_BLOCKS_CREATE) == 0)
 		return retval;
 
+	/* For aligned allocation, we must not change original alignment */
+	if (retval < 0 && (flags & EXT4_GET_BLOCKS_ALIGNED) &&
+	    map->m_len != orig_mlen) {
+		return retval;
+	}
+
 	/*
 	 * Returns if the blocks have already allocated
 	 *
@@ -3307,7 +3391,10 @@ static int ext4_iomap_alloc(struct inode *inode, struct ext4_map_blocks *map,
 	else if (ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS))
 		m_flags = EXT4_GET_BLOCKS_IO_CREATE_EXT;
 
-	ret = ext4_map_blocks(handle, inode, map, m_flags);
+	if (flags & IOMAP_ATOMIC_WRITE)
+		ret = ext4_map_blocks_aligned(handle, inode, map, m_flags);
+	else
+		ret = ext4_map_blocks(handle, inode, map, m_flags);
 
 	/*
 	 * We cannot fill holes in indirect tree based inodes as that could
@@ -3353,7 +3440,11 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 		 * especially in multi-threaded overwrite requests.
 		 */
 		if (offset + length <= i_size_read(inode)) {
-			ret = ext4_map_blocks(NULL, inode, &map, 0);
+			if (flags & IOMAP_ATOMIC_WRITE)
+				ret = ext4_map_blocks_aligned(NULL, inode, &map, 0);
+			else
+				ret = ext4_map_blocks(NULL, inode, &map, 0);
+
 			if (ret > 0 && (map.m_flags & EXT4_MAP_MAPPED))
 				goto out;
 		}
@@ -3372,6 +3463,15 @@ static int ext4_iomap_begin(struct inode *inode, loff_t offset, loff_t length,
 	 */
 	map.m_len = fscrypt_limit_io_blocks(inode, map.m_lblk, map.m_len);
 
+	/*
+	 * Ensure the found extent meets the alignment requirements for aligned
+	 * allocation
+	 */
+	if ((flags & IOMAP_ATOMIC_WRITE) &&
+	    ((map.m_pblk << blkbits) % length ||
+	     (map.m_len << blkbits) != length))
+		return -EIO;
+
 	ext4_set_iomap(inode, iomap, &map, offset, length, flags);
 
 	return 0;
diff --git a/include/trace/events/ext4.h b/include/trace/events/ext4.h
index 56895cfb5781..7bf116021408 100644
--- a/include/trace/events/ext4.h
+++ b/include/trace/events/ext4.h
@@ -50,6 +50,7 @@ struct partial_cluster;
 	{ EXT4_GET_BLOCKS_CONVERT_UNWRITTEN,	"CONVERT_UNWRITTEN" },  \
 	{ EXT4_GET_BLOCKS_ZERO,			"ZERO" },		\
 	{ EXT4_GET_BLOCKS_IO_SUBMIT,		"IO_SUBMIT" },		\
+	{ EXT4_GET_BLOCKS_ALIGNED,		"ALIGNED" },		\
 	{ EXT4_EX_NOCACHE,			"EX_NOCACHE" })
 
 /*
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* [RFC 7/7] ext4: Support atomic write for statx
  2023-11-30 13:53 [RFC 0/7] ext4: Allocator changes for atomic write support with DIO Ojaswin Mujoo
                   ` (5 preceding siblings ...)
  2023-11-30 13:53 ` [RFC 6/7] ext4: Add aligned allocation support for atomic direct io Ojaswin Mujoo
@ 2023-11-30 13:53 ` Ojaswin Mujoo
  2023-12-04 10:36 ` [RFC 0/7] ext4: Allocator changes for atomic write support with DIO John Garry
  7 siblings, 0 replies; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-11-30 13:53 UTC (permalink / raw
  To: linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, John Garry, dchinner

Support providing info on atomic write unit min and max for an inode.

For simplicity, currently we limit the min at the FS block size, but a
lower limit could be supported in future.

Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
---
 fs/ext4/inode.c | 38 ++++++++++++++++++++++++++++++++++++--
 1 file changed, 36 insertions(+), 2 deletions(-)

diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index d185ec54ffa3..c8f974d0f113 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5621,6 +5621,7 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
 	struct ext4_inode *raw_inode;
 	struct ext4_inode_info *ei = EXT4_I(inode);
 	unsigned int flags;
+	struct block_device *bdev = inode->i_sb->s_bdev;
 
 	if ((request_mask & STATX_BTIME) &&
 	    EXT4_FITS_IN_INODE(raw_inode, ei, i_crtime)) {
@@ -5639,8 +5640,6 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
 
 		stat->result_mask |= STATX_DIOALIGN;
 		if (dio_align == 1) {
-			struct block_device *bdev = inode->i_sb->s_bdev;
-
 			/* iomap defaults */
 			stat->dio_mem_align = bdev_dma_alignment(bdev) + 1;
 			stat->dio_offset_align = bdev_logical_block_size(bdev);
@@ -5650,6 +5649,41 @@ int ext4_getattr(struct mnt_idmap *idmap, const struct path *path,
 		}
 	}
 
+	if ((request_mask & STATX_WRITE_ATOMIC)) {
+		unsigned int awumin, awumax;
+		unsigned int blocksize = 1 << inode->i_blkbits;
+
+		awumin = queue_atomic_write_unit_min_bytes(bdev->bd_queue);
+		awumax = queue_atomic_write_unit_max_bytes(bdev->bd_queue);
+
+		if (!(ext4_test_inode_flag(inode, EXT4_INODE_EXTENTS)) ||
+		    EXT4_SB(inode->i_sb)->s_cluster_ratio > 1) {
+			/*
+			 * Currently not supported for non extent files or
+			 * with bigalloc
+			 */
+			stat->atomic_write_unit_min = 0;
+			stat->atomic_write_unit_max = 0;
+		} else if (awumin && awumax) {
+			/*
+			 * For now we support atomic writes which are
+			 * at least block size bytes. If that exceeds the
+			 * max atomic unit, then don't advertise support
+			 */
+			stat->atomic_write_unit_min = max(awumin, blocksize);
+
+			if (awumax < stat->atomic_write_unit_min) {
+				stat->atomic_write_unit_min = 0;
+				stat->atomic_write_unit_max = 0;
+			} else {
+				stat->atomic_write_unit_max = awumax;
+				stat->attributes |= STATX_ATTR_WRITE_ATOMIC;
+			}
+		}
+		stat->attributes_mask |= STATX_ATTR_WRITE_ATOMIC;
+		stat->result_mask |= STATX_WRITE_ATOMIC;
+	}
+
 	flags = ei->i_flags & EXT4_FL_USER_VISIBLE;
 	if (flags & EXT4_APPEND_FL)
 		stat->attributes |= STATX_ATTR_APPEND;
-- 
2.39.3


^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic
  2023-11-30 13:53 ` [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic Ojaswin Mujoo
@ 2023-11-30 21:10   ` Dave Chinner
  2023-12-01 10:42     ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Dave Chinner @ 2023-11-30 21:10 UTC (permalink / raw
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-block, linux-xfs, linux-fsdevel,
	John Garry, dchinner

On Thu, Nov 30, 2023 at 07:23:09PM +0530, Ojaswin Mujoo wrote:
> Currently, iomap only supports atomic writes for direct IOs and there is
> no guarantees that a buffered IO will be atomic. Hence, if the user has
> explicitly requested the direct write to be atomic and there's a
> failure, return -EIO instead of falling back to buffered IO.
> 
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> ---
>  fs/iomap/direct-io.c | 8 +++++++-
>  1 file changed, 7 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> index 6ef25e26f1a1..3e7cd9bc8f4d 100644
> --- a/fs/iomap/direct-io.c
> +++ b/fs/iomap/direct-io.c
> @@ -662,7 +662,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>  			if (ret != -EAGAIN) {
>  				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
>  								iomi.len);
> -				ret = -ENOTBLK;
> +				/*
> +				 * if this write was supposed to be atomic,
> +				 * return the err rather than trying to fall
> +				 * back to buffered IO.
> +				 */
> +				if (!atomic_write)
> +					ret = -ENOTBLK;

This belongs in the caller when it receives an -ENOTBLK from
iomap_dio_rw(). The iomap code is saying "this IO cannot be done
with direct IO" by returning this value, and then the caller can
make the determination of whether to run a buffered IO or not.

For example, a filesystem might still be able to perform an atomic
IO via a COW-based buffered IO slow path. Sure, ext4 can't do this,
but the above patch would prevent filesystems that could from being
able to implement such a fallback....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic
  2023-11-30 21:10   ` Dave Chinner
@ 2023-12-01 10:42     ` John Garry
  2023-12-01 13:27       ` Matthew Wilcox
  2023-12-01 22:07       ` Dave Chinner
  0 siblings, 2 replies; 32+ messages in thread
From: John Garry @ 2023-12-01 10:42 UTC (permalink / raw
  To: Dave Chinner, Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-block, linux-xfs, linux-fsdevel, dchinner

On 30/11/2023 21:10, Dave Chinner wrote:
> On Thu, Nov 30, 2023 at 07:23:09PM +0530, Ojaswin Mujoo wrote:
>> Currently, iomap only supports atomic writes for direct IOs and there is
>> no guarantees that a buffered IO will be atomic. Hence, if the user has
>> explicitly requested the direct write to be atomic and there's a
>> failure, return -EIO instead of falling back to buffered IO.
>>
>> Signed-off-by: Ojaswin Mujoo<ojaswin@linux.ibm.com>
>> ---
>>   fs/iomap/direct-io.c | 8 +++++++-
>>   1 file changed, 7 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
>> index 6ef25e26f1a1..3e7cd9bc8f4d 100644
>> --- a/fs/iomap/direct-io.c
>> +++ b/fs/iomap/direct-io.c
>> @@ -662,7 +662,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
>>   			if (ret != -EAGAIN) {
>>   				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
>>   								iomi.len);
>> -				ret = -ENOTBLK;
>> +				/*
>> +				 * if this write was supposed to be atomic,
>> +				 * return the err rather than trying to fall
>> +				 * back to buffered IO.
>> +				 */
>> +				if (!atomic_write)
>> +					ret = -ENOTBLK;
> This belongs in the caller when it receives an -ENOTBLK from
> iomap_dio_rw(). The iomap code is saying "this IO cannot be done
> with direct IO" by returning this value, and then the caller can
> make the determination of whether to run a buffered IO or not.
> 
> For example, a filesystem might still be able to perform an atomic
> IO via a COW-based buffered IO slow path. Sure, ext4 can't do this,
> but the above patch would prevent filesystems that could from being
> able to implement such a fallback....

Sure, and I think that we need a better story for supporting buffered IO 
for atomic writes.

Currently we have:
- man pages tell us RWF_ATOMIC is only supported for direct IO
- statx gives atomic write unit min/max, not explicitly telling us it's 
for direct IO
- RWF_ATOMIC is ignored for !O_DIRECT

So I am thinking of expanding statx support to enable querying of atomic 
write capabilities for buffered IO and direct IO separately.

Thanks,
John


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 5/7] block: export blkdev_atomic_write_valid() and refactor api
  2023-11-30 13:53 ` [RFC 5/7] block: export blkdev_atomic_write_valid() and refactor api Ojaswin Mujoo
@ 2023-12-01 10:47   ` John Garry
  2023-12-11 10:57     ` Ojaswin Mujoo
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2023-12-01 10:47 UTC (permalink / raw
  To: Ojaswin Mujoo, linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, dchinner

On 30/11/2023 13:53, Ojaswin Mujoo wrote:
> Export the blkdev_atomic_write_valid() function so that other filesystems
> can call it as a part of validating the atomic write operation.
> 
> Further, refactor the api to accept a len argument instead of iov_iter to
> make it easier to call from other places.
> 
> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>

I was actually thinking of moving this functionality to vfs and maybe 
also calling earlier in write path, as the code is really common to 
blkdev and FSes.

However, Christoph Hellwig was not so happy about current interface with 
power-of-2 requirement et al, so I was going to wait until that 
discussion is concluded before deciding.

Thanks,
John

> ---
>   block/fops.c           | 18 ++++++++++--------
>   include/linux/blkdev.h |  2 ++
>   2 files changed, 12 insertions(+), 8 deletions(-)
> 
> diff --git a/block/fops.c b/block/fops.c
> index 516669ad69e5..5dae95c49720 100644
> --- a/block/fops.c
> +++ b/block/fops.c
> @@ -41,8 +41,7 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
>   		!bdev_iter_is_aligned(bdev, iter);
>   }
>   
> -static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
> -			      struct iov_iter *iter)
> +bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, size_t len)
>   {
>   	unsigned int atomic_write_unit_min_bytes =
>   			queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
> @@ -53,16 +52,17 @@ static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
>   		return false;
>   	if (pos % atomic_write_unit_min_bytes)
>   		return false;
> -	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> +	if (len % atomic_write_unit_min_bytes)
>   		return false;
> -	if (!is_power_of_2(iov_iter_count(iter)))
> +	if (!is_power_of_2(len))
>   		return false;
> -	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
> +	if (len > atomic_write_unit_max_bytes)
>   		return false;
> -	if (pos % iov_iter_count(iter))
> +	if (pos % len)
>   		return false;
>   	return true;
>   }
> +EXPORT_SYMBOL_GPL(blkdev_atomic_write_valid);
>   
>   #define DIO_INLINE_BIO_VECS 4
>   
> @@ -81,7 +81,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
>   	if (blkdev_dio_unaligned(bdev, pos, iter))
>   		return -EINVAL;
>   
> -	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
> +	if (atomic_write &&
> +	    !blkdev_atomic_write_valid(bdev, pos, iov_iter_count(iter)))
>   		return -EINVAL;
>   
>   	if (nr_pages <= DIO_INLINE_BIO_VECS)
> @@ -348,7 +349,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
>   	if (blkdev_dio_unaligned(bdev, pos, iter))
>   		return -EINVAL;
>   
> -	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
> +	if (atomic_write &&
> +	    !blkdev_atomic_write_valid(bdev, pos, iov_iter_count(iter)))
>   		return -EINVAL;
>   
>   	if (iocb->ki_flags & IOCB_ALLOC_CACHE)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index f70988083734..5a3124fc191f 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -1566,6 +1566,8 @@ static inline int early_lookup_bdev(const char *pathname, dev_t *dev)
>   int freeze_bdev(struct block_device *bdev);
>   int thaw_bdev(struct block_device *bdev);
>   
> +bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, size_t len);
> +
>   struct io_comp_batch {
>   	struct request *req_list;
>   	bool need_ts;


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic
  2023-12-01 10:42     ` John Garry
@ 2023-12-01 13:27       ` Matthew Wilcox
  2023-12-01 19:06         ` John Garry
  2023-12-01 22:07       ` Dave Chinner
  1 sibling, 1 reply; 32+ messages in thread
From: Matthew Wilcox @ 2023-12-01 13:27 UTC (permalink / raw
  To: John Garry
  Cc: Dave Chinner, Ojaswin Mujoo, linux-ext4, Theodore Ts'o,
	Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, dchinner

On Fri, Dec 01, 2023 at 10:42:57AM +0000, John Garry wrote:
> Sure, and I think that we need a better story for supporting buffered IO for
> atomic writes.
> 
> Currently we have:
> - man pages tell us RWF_ATOMIC is only supported for direct IO
> - statx gives atomic write unit min/max, not explicitly telling us it's for
> direct IO
> - RWF_ATOMIC is ignored for !O_DIRECT
> 
> So I am thinking of expanding statx support to enable querying of atomic
> write capabilities for buffered IO and direct IO separately.

Or ... we could support RWF_ATOMIC in the page cache?

I haven't particularly been following the atomic writes patchset, but
for filesystems which support large folios, we now create large folios
in the write path.  I see four problems to solve:

1. We might already have a smaller folio in the page cache from an
   earlier access,  We'd have to kick it out before creating a new folio
   that is the appropriate size.

2. We currently believe it's always OK to fall back to allocating smaller
   folios if memory allocation fails.  We'd need to change that policy
   (which we need to modify anyway for the bs>PS support).

3. We need to somewhere keep the information that writeback of this
   folio has to use the atomic commands.  Maybe it becomes a per-inode
   flag so that all writeback from this inode now uses the atomic
   commands?

4. If somebody does a weird thing like truncate/holepunch into the
   middle of the folio, we need to define what we do.  It's conceptually
   a bizarre thing to do, so I can't see any user actually wanting to
   do that ... but we need to define the semantics.

Maybe there are things I haven't thought of.  And of course, some
filesystems don't support large folios yet.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic
  2023-12-01 13:27       ` Matthew Wilcox
@ 2023-12-01 19:06         ` John Garry
  0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2023-12-01 19:06 UTC (permalink / raw
  To: Matthew Wilcox
  Cc: Dave Chinner, Ojaswin Mujoo, linux-ext4, Theodore Ts'o,
	Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, dchinner

On 01/12/2023 13:27, Matthew Wilcox wrote:
>> Sure, and I think that we need a better story for supporting buffered IO for
>> atomic writes.
>>
>> Currently we have:
>> - man pages tell us RWF_ATOMIC is only supported for direct IO
>> - statx gives atomic write unit min/max, not explicitly telling us it's for
>> direct IO
>> - RWF_ATOMIC is ignored for !O_DIRECT
>>
>> So I am thinking of expanding statx support to enable querying of atomic
>> write capabilities for buffered IO and direct IO separately.
> Or ... we could support RWF_ATOMIC in the page cache?
> 
> I haven't particularly been following the atomic writes patchset,

Some background is that we are focused on direct IO as the database 
applications we're interested in use direct IO, but there are other DBs 
which do not support direct IO (and want atomic write support).

> but
> for filesystems which support large folios, we now create large folios
> in the write path.  I see four problems to solve:
> 
> 1. We might already have a smaller folio in the page cache from an
>     earlier access,  We'd have to kick it out before creating a new folio
>     that is the appropriate size.

Understood. Even though we give scope to do atomic writes of variable 
size, we do expect applications to use a fixed size mostly. In addition, 
typically we would expect only atomic or non-atomic writes. But what you 
say would be possible.

> 
> 2. We currently believe it's always OK to fall back to allocating smaller
>     folios if memory allocation fails.  We'd need to change that policy
>     (which we need to modify anyway for the bs>PS support).

ok

> 
> 3. We need to somewhere keep the information that writeback of this
>     folio has to use the atomic commands.  Maybe it becomes a per-inode
>     flag so that all writeback from this inode now uses the atomic
>     commands?

I'm not sure. Currently atomic writes are simply flagged per IO, and 
per-inode atomic flags are something which we have avoided so far.

> 
> 4. If somebody does a weird thing like truncate/holepunch into the
>     middle of the folio, we need to define what we do.  It's conceptually
>     a bizarre thing to do, so I can't see any user actually wanting to
>     do that ... but we need to define the semantics.

ok

> 
> Maybe there are things I haven't thought of.  And of course, some
> filesystems don't support large folios yet.

I may consider a PoC...

Thanks,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic
  2023-12-01 10:42     ` John Garry
  2023-12-01 13:27       ` Matthew Wilcox
@ 2023-12-01 22:07       ` Dave Chinner
  2023-12-04  9:02         ` John Garry
  2023-12-07 12:43         ` John Garry
  1 sibling, 2 replies; 32+ messages in thread
From: Dave Chinner @ 2023-12-01 22:07 UTC (permalink / raw
  To: John Garry
  Cc: Ojaswin Mujoo, linux-ext4, Theodore Ts'o, Ritesh Harjani,
	linux-kernel, Darrick J . Wong, linux-block, linux-xfs,
	linux-fsdevel, dchinner

On Fri, Dec 01, 2023 at 10:42:57AM +0000, John Garry wrote:
> On 30/11/2023 21:10, Dave Chinner wrote:
> > On Thu, Nov 30, 2023 at 07:23:09PM +0530, Ojaswin Mujoo wrote:
> > > Currently, iomap only supports atomic writes for direct IOs and there is
> > > no guarantees that a buffered IO will be atomic. Hence, if the user has
> > > explicitly requested the direct write to be atomic and there's a
> > > failure, return -EIO instead of falling back to buffered IO.
> > > 
> > > Signed-off-by: Ojaswin Mujoo<ojaswin@linux.ibm.com>
> > > ---
> > >   fs/iomap/direct-io.c | 8 +++++++-
> > >   1 file changed, 7 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/fs/iomap/direct-io.c b/fs/iomap/direct-io.c
> > > index 6ef25e26f1a1..3e7cd9bc8f4d 100644
> > > --- a/fs/iomap/direct-io.c
> > > +++ b/fs/iomap/direct-io.c
> > > @@ -662,7 +662,13 @@ __iomap_dio_rw(struct kiocb *iocb, struct iov_iter *iter,
> > >   			if (ret != -EAGAIN) {
> > >   				trace_iomap_dio_invalidate_fail(inode, iomi.pos,
> > >   								iomi.len);
> > > -				ret = -ENOTBLK;
> > > +				/*
> > > +				 * if this write was supposed to be atomic,
> > > +				 * return the err rather than trying to fall
> > > +				 * back to buffered IO.
> > > +				 */
> > > +				if (!atomic_write)
> > > +					ret = -ENOTBLK;
> > This belongs in the caller when it receives an -ENOTBLK from
> > iomap_dio_rw(). The iomap code is saying "this IO cannot be done
> > with direct IO" by returning this value, and then the caller can
> > make the determination of whether to run a buffered IO or not.
> > 
> > For example, a filesystem might still be able to perform an atomic
> > IO via a COW-based buffered IO slow path. Sure, ext4 can't do this,
> > but the above patch would prevent filesystems that could from being
> > able to implement such a fallback....
> 
> Sure, and I think that we need a better story for supporting buffered IO for
> atomic writes.
> 
> Currently we have:
> - man pages tell us RWF_ATOMIC is only supported for direct IO
> - statx gives atomic write unit min/max, not explicitly telling us it's for
> direct IO
> - RWF_ATOMIC is ignored for !O_DIRECT
> 
> So I am thinking of expanding statx support to enable querying of atomic
> write capabilities for buffered IO and direct IO separately.

You're over complicating this way too much by trying to restrict the
functionality down to just what you want to implement right now.

RWF_ATOMIC is no different to RWF_NOWAIT. The API doesn't decide
what can be supported - the filesystems themselves decide what part
of the API they can support and implement those pieces.

TO go back to RWF_NOWAIT, for a long time we (XFS) only supported
RWF_NOWAIT on DIO, and buffered reads and writes were given
-EOPNOTSUPP by the filesystem. Then other filesystems started
supporting DIO with RWF_NOWAIT. Then buffered read support was added
to the page cache and XFS, and as other filesystems were converted
they removed the RWF_NOWAIT exclusion check from their read IO
paths.

We are now in the same place with buffered write support for
RWF_NOWAIT. XFS, the page cache and iomap allow buffered writes w/
RWF_NOWAIT, but ext4, btrfs and f2fs still all return -EOPNOTSUPP
because they don't support non-blocking buffered writes yet.

This is the same model we should be applying with RWF_ATOMIC - we
know that over time we'll be able to expand support for atomic
writes across both direct and buffered IO, so we should not be
restricting the API or infrastructure to only allow RWF_ATOMIC w/
DIO. Just have the filesystems reject RWF_ATOMIC w/ -EOPNOTSUPP if
they don't support it, and for those that do it is conditional on
whther the filesystem supports it for the given type of IO being
done.

Seriously - an application can easily probe for RWF_ATOMIC support
without needing information to be directly exposed in statx() - just
open a O_TMPFILE, issue the type of RWF_ATOMIC IO you require to be
supported, and if it returns -EOPNOTSUPP then it you can't use
RWF_ATOMIC optimisations in the application....

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic
  2023-12-01 22:07       ` Dave Chinner
@ 2023-12-04  9:02         ` John Garry
  2023-12-04 18:17           ` Darrick J. Wong
  2023-12-07 12:43         ` John Garry
  1 sibling, 1 reply; 32+ messages in thread
From: John Garry @ 2023-12-04  9:02 UTC (permalink / raw
  To: Dave Chinner
  Cc: Ojaswin Mujoo, linux-ext4, Theodore Ts'o, Ritesh Harjani,
	linux-kernel, Darrick J . Wong, linux-block, linux-xfs,
	linux-fsdevel, dchinner

On 01/12/2023 22:07, Dave Chinner wrote:
>> Sure, and I think that we need a better story for supporting buffered IO for
>> atomic writes.
>>
>> Currently we have:
>> - man pages tell us RWF_ATOMIC is only supported for direct IO
>> - statx gives atomic write unit min/max, not explicitly telling us it's for
>> direct IO
>> - RWF_ATOMIC is ignored for !O_DIRECT
>>
>> So I am thinking of expanding statx support to enable querying of atomic
>> write capabilities for buffered IO and direct IO separately.
> You're over complicating this way too much by trying to restrict the
> functionality down to just what you want to implement right now.
> 
> RWF_ATOMIC is no different to RWF_NOWAIT. The API doesn't decide
> what can be supported - the filesystems themselves decide what part
> of the API they can support and implement those pieces.

Sure, but for RWF_ATOMIC we still have the associated statx call to tell 
us whether atomic writes are supported for a file and the specific range 
capability.

> 
> TO go back to RWF_NOWAIT, for a long time we (XFS) only supported
> RWF_NOWAIT on DIO, and buffered reads and writes were given
> -EOPNOTSUPP by the filesystem. Then other filesystems started
> supporting DIO with RWF_NOWAIT. Then buffered read support was added
> to the page cache and XFS, and as other filesystems were converted
> they removed the RWF_NOWAIT exclusion check from their read IO
> paths.
> 
> We are now in the same place with buffered write support for
> RWF_NOWAIT. XFS, the page cache and iomap allow buffered writes w/
> RWF_NOWAIT, but ext4, btrfs and f2fs still all return -EOPNOTSUPP
> because they don't support non-blocking buffered writes yet.
> 
> This is the same model we should be applying with RWF_ATOMIC - we
> know that over time we'll be able to expand support for atomic
> writes across both direct and buffered IO, so we should not be
> restricting the API or infrastructure to only allow RWF_ATOMIC w/
> DIO.

Agreed.

> Just have the filesystems reject RWF_ATOMIC w/ -EOPNOTSUPP if
> they don't support it,

Yes, I was going to add this regardless.

> and for those that do it is conditional on
> whther the filesystem supports it for the given type of IO being
> done.
> 
> Seriously - an application can easily probe for RWF_ATOMIC support
> without needing information to be directly exposed in statx() - just
> open a O_TMPFILE, issue the type of RWF_ATOMIC IO you require to be
> supported, and if it returns -EOPNOTSUPP then it you can't use
> RWF_ATOMIC optimisations in the application....

ok, if that is the done thing.

So I can't imagine that atomic write unit range will be different for 
direct IO and buffered IO (ignoring for a moment Christoph's idea for 
CoW always for no HW offload) when supported. But it seems that we may 
have a scenario where statx tells is that atomic writes are supported 
for a file, and a DIO write succeeds and a buffered IO write may return 
-EOPNOTSUPP. If that's acceptable then I'll work towards that.

If we could just run statx on a file descriptor here then that would be 
simpler...

Thanks,
John



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-11-30 13:53 [RFC 0/7] ext4: Allocator changes for atomic write support with DIO Ojaswin Mujoo
                   ` (6 preceding siblings ...)
  2023-11-30 13:53 ` [RFC 7/7] ext4: Support atomic write for statx Ojaswin Mujoo
@ 2023-12-04 10:36 ` John Garry
  2023-12-04 13:38   ` Ojaswin Mujoo
  7 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2023-12-04 10:36 UTC (permalink / raw
  To: Ojaswin Mujoo, linux-ext4, Theodore Ts'o
  Cc: Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, dchinner

On 30/11/2023 13:53, Ojaswin Mujoo wrote:

Thanks for putting this together.

> This patch series builds on top of John Gary's atomic direct write
> patch series [1] and enables this support in ext4. This is a 2 step
> process:
> 
> 1. Enable aligned allocation in ext4 mballoc. This allows us to allocate
> power-of-2 aligned physical blocks, which is needed for atomic writes.
> 
> 2. Hook the direct IO path in ext4 to use aligned allocation to obtain
> physical blocks at a given alignment, which is needed for atomic IO. If
> for any reason we are not able to obtain blocks at given alignment we
> fail the atomic write.

So are we supposed to be doing atomic writes on unwritten ranges only in 
the file to get the aligned allocations?

I actually tried that, and I got a WARN triggered:

# mkfs.ext4 /dev/sda
mke2fs 1.46.5 (30-Dec-2021)
Creating filesystem with 358400 1k blocks and 89760 inodes
Filesystem UUID: 7543a44b-2957-4ddc-9d4a-db3a5fd019c9
Superblock backups stored on blocks:
         8193, 24577, 40961, 57345, 73729, 204801, 221185

Allocating group tables: done
Writing inode tables: done
Creating journal (8192 blocks): done
Writing superblocks and filesystem accounting information: done

[   12.745889] mkfs.ext4 (150) used greatest stack depth: 13304 bytes left
# mount /dev/sda mnt
[   12.798804] EXT4-fs (sda): mounted filesystem
7543a44b-2957-4ddc-9d4a-db3a5fd019c9 r/w with ordered data mode. Quota
mode: none.
# touch mnt/file
#
# /test-statx -a /root/mnt/file
statx(/root/mnt/file) = 0
dump_statx results=5fff
   Size: 0               Blocks: 0          IO Block: 1024    regular file
Device: 08:00           Inode: 12          Links: 1
Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
Access: 2023-12-04 10:27:40.002848720+0000
Modify: 2023-12-04 10:27:40.002848720+0000
Change: 2023-12-04 10:27:40.002848720+0000
  Birth: 2023-12-04 10:27:40.002848720+0000
stx_attributes_mask=0x703874
         STATX_ATTR_WRITE_ATOMIC set
         unit min: 1024
         uunit max: 524288
Attributes: 0000000000400000 (........ ........ ........ ........
........ .?--.... ..---... .---.-..)
#



looks ok so far, then write 4KB at offset 0:

# /test-pwritev2 -a -d -p 0 -l 4096  /root/mnt/file
file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24
[   46.813720] ------------[ cut here ]------------
[   46.814934] WARNING: CPU: 1 PID: 158 at fs/ext4/mballoc.c:2991
ext4_mb_regular_allocator+0xeca/0xf20
[   46.816344] Modules linked in:
[   46.816831] CPU: 1 PID: 158 Comm: test-pwritev2 Not tainted
6.7.0-rc1-00038-gae3807f27e7d-dirty #968
[   46.818220] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014
[   46.819886] RIP: 0010:ext4_mb_regular_allocator+0xeca/0xf20
[   46.820734] Code: fd ff ff f0 41 ff 81 e4 03 00 00 e9 63 fd ff ff
90 0f 0b 90 e9 fe f3 ff ff 90 48 c7 c7 e2 7a b2 86 44 89 ca e8 f7 f1
d2 ff 90 <0f> 0b 90 90 45 8b 44 24 3c e9 d4 f3 ff ff 4d 8b 45 08 4c 89
c2 4d
[   46.823577] RSP: 0018:ffffb77dc056b7c0 EFLAGS: 00010286
[   46.824379] RAX: 0000000000000000 RBX: ffff9b2ad77dea80 RCX: 
0000000000000000
[   46.825458] RDX: 0000000000000001 RSI: ffff9b2b3491b5c0 RDI: 
ffff9b2b3491b5c0
[   46.826557] RBP: ffff9b2adc7cd000 R08: 0000000000000000 R09: 
c0000000ffffdfff
[   46.827634] R10: ffff9b2adcb9d780 R11: ffffb77dc056b648 R12: 
ffff9b2ac6778000
[   46.828714] R13: ffff9b2adc7cd000 R14: ffff9b2adc7d0000 R15: 
000000000000002a
[   46.829796] FS:  00007f726dece740(0000) GS:ffff9b2b34900000(0000)
knlGS:0000000000000000
[   46.830706] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   46.831299] CR2: 0000000001ed72b8 CR3: 000000001c794006 CR4: 
0000000000370ef0
[   46.832041] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[   46.832813] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 
0000000000000400
[   46.833546] Call Trace:
[   46.833901]  <TASK>
[   46.834163]  ? __warn+0x78/0x130
[   46.834504]  ? ext4_mb_regular_allocator+0xeca/0xf20
[   46.835037]  ? report_bug+0xf8/0x1e0
[   46.835527]  ? console_unlock+0x45/0xd0
[   46.835963]  ? handle_bug+0x40/0x70
[   46.836419]  ? exc_invalid_op+0x13/0x70
[   46.836865]  ? asm_exc_invalid_op+0x16/0x20
[   46.837329]  ? ext4_mb_regular_allocator+0xeca/0xf20
[   46.837852]  ext4_mb_new_blocks+0x7e8/0xe60
[   46.838382]  ? __kmalloc+0x4b/0x130
[   46.838824]  ? __kmalloc+0x4b/0x130
[   46.839243]  ? ext4_find_extent+0x347/0x360
[   46.839743]  ext4_ext_map_blocks+0xc44/0xff0
[   46.840395]  ext4_map_blocks+0x162/0x5b0
[   46.841010]  ? jbd2__journal_start+0x84/0x1f0
[   46.841694]  ext4_map_blocks_aligned+0x20/0xa0
[   46.842382]  ext4_iomap_begin+0x1e9/0x320
[   46.843006]  iomap_iter+0x16d/0x350
[   46.843554]  __iomap_dio_rw+0x3be/0x830
[   46.844150]  iomap_dio_rw+0x9/0x30
[   46.844680]  ext4_file_write_iter+0x597/0x800
[   46.845346]  do_iter_readv_writev+0xe1/0x150
[   46.846029]  do_iter_write+0x86/0x1f0
[   46.846638]  vfs_writev+0x96/0x190
[   46.847176]  ? do_pwritev+0x98/0xd0
[   46.847721]  do_pwritev+0x98/0xd0
[   46.848230]  ? syscall_trace_enter.isra.19+0x130/0x1b0
[   46.849028]  do_syscall_64+0x42/0xf0
[   46.849590]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
[   46.850405] RIP: 0033:0x7f726df9666f
[   46.850964] Code: d5 41 54 49 89 f4 55 89 fd 53 44 89 c3 48 83 ec
18 80 3d bb fd 0b 00 00 74 2a 45 89 c1 49 89 ca 45 31 c0 b8 48 01 00
00 0f 05 <48> 3d 00 f0 ff ff 76 5c 48 8b 15 7a 77 0b 00 f7 d8 64 89 02
48 83
[   46.854020] RSP: 002b:00007fff28b9bff0 EFLAGS: 00000246 ORIG_RAX:
0000000000000148
[   46.855178] RAX: ffffffffffffffda RBX: 0000000000000024 RCX: 
00007f726df9666f
[   46.856248] RDX: 0000000000000001 RSI: 00007fff28b9c050 RDI: 
0000000000000003
[   46.857303] RBP: 0000000000000003 R08: 0000000000000000 R09: 
0000000000000024
[   46.858365] R10: 0000000000000000 R11: 0000000000000246 R12: 
00007fff28b9c050
[   46.859407] R13: 0000000000000001 R14: 0000000000000000 R15: 
00007f726e08aa60
[   46.860448]  </TASK>
[   46.860797] ---[ end trace 0000000000000000 ]---
[   46.861497] EXT4-fs warning (device sda):
ext4_map_blocks_aligned:520: Returned extent couldn't satisfy
alignment requirements
main wrote -1 bytes at offset 0
[   46.863855] test-pwritev2 (158) used greatest stack depth: 11920 
bytes left
#

Please note that I tested on my own dev branch, which contains changes 
over [1], but I expect it would not make a difference for this test.


> 
> Currently this RFC does not impose any restrictions for atomic and non-atomic
> allocations to any inode,  which also leaves policy decisions to user-space
> as much as possible. So, for example, the user space can:
> 
>   * Do an atomic direct IO at any alignment and size provided it
>     satisfies underlying device constraints. The only restriction for now
>     is that it should be power of 2 len and atleast of FS block size.
> 
>   * Do any combination of non atomic and atomic writes on the same file
>     in any order. As long as the user space is passing the RWF_ATOMIC flag
>     to pwritev2() it is guaranteed to do an atomic IO (or fail if not
>     possible).
> 
> There are some TODOs on the allocator side which are remaining like...
> 
> 1.  Fallback to original request size when normalized request size (due to
>      preallocation) allocation is not possible.
> 2.  Testing some edge cases.
> 
> But since all the basic test scenarios were covered, hence we wanted to get
> this RFC out for discussion on atomic write support for DIO in ext4.
> 
> Further points for discussion -
> 
> 1. We might need an inode flag to identify that the inode has blocks/extents
> atomically allocated. So that other userspace tools do not move the blocks of
> the inode for e.g. during resize/fsck etc.
>    a. Should inode be marked as atomic similar to how we have IS_DAX(inode)
>    implementation? Any thoughts?
> 
> 2. Should there be support for open flags like O_ATOMIC. So that in case if
> user wants to do only atomic writes to an open fd, then all writes can be
> considered atomic.
> 
> 3. Do we need to have any feature compat flags for FS? (IMO) It doesn't look
> like since say if there are block allocations done which were done atomically,
> it should not matter to FS w.r.t compatibility.
> 
> 4. Mostly aligned allocations are required when we don't have data=journal
> mode. So should we return -EIO with data journalling mode for DIO request?
> 
> Script to test using pwritev2() can be found here:
> https://urldefense.com/v3/__https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9__;!!ACWV5N9M2RV99hQ!LbVSb-43597CLDmYnhgOwH6MAcikusRh75-4fUbUrA_8go3B6JL1lWJPmhij8siPJE031qtQb6-bpdLEa1qrVA$  

Please note that the posix_memalign() call in the program should PAGE align.

Thanks,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-04 10:36 ` [RFC 0/7] ext4: Allocator changes for atomic write support with DIO John Garry
@ 2023-12-04 13:38   ` Ojaswin Mujoo
  2023-12-04 14:44     ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-12-04 13:38 UTC (permalink / raw
  To: John Garry
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-block, linux-xfs, linux-fsdevel, dchinner

Hi John,

Thanks for the review. 

On Mon, Dec 04, 2023 at 10:36:01AM +0000, John Garry wrote:
> On 30/11/2023 13:53, Ojaswin Mujoo wrote:
> 
> Thanks for putting this together.
> 
> > This patch series builds on top of John Gary's atomic direct write
> > patch series [1] and enables this support in ext4. This is a 2 step
> > process:
> > 
> > 1. Enable aligned allocation in ext4 mballoc. This allows us to allocate
> > power-of-2 aligned physical blocks, which is needed for atomic writes.
> > 
> > 2. Hook the direct IO path in ext4 to use aligned allocation to obtain
> > physical blocks at a given alignment, which is needed for atomic IO. If
> > for any reason we are not able to obtain blocks at given alignment we
> > fail the atomic write.
> 
> So are we supposed to be doing atomic writes on unwritten ranges only in the
> file to get the aligned allocations?

If we do an atomic write on a hole, ext4 will give us an aligned extent
provided the hole is big enough to accomodate it. 

However, if we do an atomic write on a existing extent (written or
unwritten) ext4 would check if it satisfies the alignment and length
requirement and returns an error if it doesn't. Since we don't have cow
like functionality afaik the only way we could let this kind of write go
through is by punching the pre-existing extent which is not something we
can directly do in the same write operation, hence we return error.

> 
> I actually tried that, and I got a WARN triggered:
> 
> # mkfs.ext4 /dev/sda
> mke2fs 1.46.5 (30-Dec-2021)
> Creating filesystem with 358400 1k blocks and 89760 inodes
> Filesystem UUID: 7543a44b-2957-4ddc-9d4a-db3a5fd019c9
> Superblock backups stored on blocks:
>         8193, 24577, 40961, 57345, 73729, 204801, 221185
> 
> Allocating group tables: done
> Writing inode tables: done
> Creating journal (8192 blocks): done
> Writing superblocks and filesystem accounting information: done
> 
> [   12.745889] mkfs.ext4 (150) used greatest stack depth: 13304 bytes left
> # mount /dev/sda mnt
> [   12.798804] EXT4-fs (sda): mounted filesystem
> 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 r/w with ordered data mode. Quota
> mode: none.
> # touch mnt/file
> #
> # /test-statx -a /root/mnt/file
> statx(/root/mnt/file) = 0
> dump_statx results=5fff
>   Size: 0               Blocks: 0          IO Block: 1024    regular file
> Device: 08:00           Inode: 12          Links: 1
> Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
> Access: 2023-12-04 10:27:40.002848720+0000
> Modify: 2023-12-04 10:27:40.002848720+0000
> Change: 2023-12-04 10:27:40.002848720+0000
>  Birth: 2023-12-04 10:27:40.002848720+0000
> stx_attributes_mask=0x703874
>         STATX_ATTR_WRITE_ATOMIC set
>         unit min: 1024
>         uunit max: 524288
> Attributes: 0000000000400000 (........ ........ ........ ........
> ........ .?--.... ..---... .---.-..)
> #
> 
> 
> 
> looks ok so far, then write 4KB at offset 0:
> 
> # /test-pwritev2 -a -d -p 0 -l 4096  /root/mnt/file
> file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24
> [   46.813720] ------------[ cut here ]------------
> [   46.814934] WARNING: CPU: 1 PID: 158 at fs/ext4/mballoc.c:2991
> ext4_mb_regular_allocator+0xeca/0xf20
> [   46.816344] Modules linked in:
> [   46.816831] CPU: 1 PID: 158 Comm: test-pwritev2 Not tainted
> 6.7.0-rc1-00038-gae3807f27e7d-dirty #968
> [   46.818220] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009),
> BIOS rel-1.16.0-0-gd239552c-rebuilt.opensuse.org 04/01/2014
> [   46.819886] RIP: 0010:ext4_mb_regular_allocator+0xeca/0xf20
> [   46.820734] Code: fd ff ff f0 41 ff 81 e4 03 00 00 e9 63 fd ff ff
> 90 0f 0b 90 e9 fe f3 ff ff 90 48 c7 c7 e2 7a b2 86 44 89 ca e8 f7 f1
> d2 ff 90 <0f> 0b 90 90 45 8b 44 24 3c e9 d4 f3 ff ff 4d 8b 45 08 4c 89
> c2 4d
> [   46.823577] RSP: 0018:ffffb77dc056b7c0 EFLAGS: 00010286
> [   46.824379] RAX: 0000000000000000 RBX: ffff9b2ad77dea80 RCX:
> 0000000000000000
> [   46.825458] RDX: 0000000000000001 RSI: ffff9b2b3491b5c0 RDI:
> ffff9b2b3491b5c0
> [   46.826557] RBP: ffff9b2adc7cd000 R08: 0000000000000000 R09:
> c0000000ffffdfff
> [   46.827634] R10: ffff9b2adcb9d780 R11: ffffb77dc056b648 R12:
> ffff9b2ac6778000
> [   46.828714] R13: ffff9b2adc7cd000 R14: ffff9b2adc7d0000 R15:
> 000000000000002a
> [   46.829796] FS:  00007f726dece740(0000) GS:ffff9b2b34900000(0000)
> knlGS:0000000000000000
> [   46.830706] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [   46.831299] CR2: 0000000001ed72b8 CR3: 000000001c794006 CR4:
> 0000000000370ef0
> [   46.832041] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [   46.832813] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [   46.833546] Call Trace:
> [   46.833901]  <TASK>
> [   46.834163]  ? __warn+0x78/0x130
> [   46.834504]  ? ext4_mb_regular_allocator+0xeca/0xf20
> [   46.835037]  ? report_bug+0xf8/0x1e0
> [   46.835527]  ? console_unlock+0x45/0xd0
> [   46.835963]  ? handle_bug+0x40/0x70
> [   46.836419]  ? exc_invalid_op+0x13/0x70
> [   46.836865]  ? asm_exc_invalid_op+0x16/0x20
> [   46.837329]  ? ext4_mb_regular_allocator+0xeca/0xf20
> [   46.837852]  ext4_mb_new_blocks+0x7e8/0xe60
> [   46.838382]  ? __kmalloc+0x4b/0x130
> [   46.838824]  ? __kmalloc+0x4b/0x130
> [   46.839243]  ? ext4_find_extent+0x347/0x360
> [   46.839743]  ext4_ext_map_blocks+0xc44/0xff0
> [   46.840395]  ext4_map_blocks+0x162/0x5b0
> [   46.841010]  ? jbd2__journal_start+0x84/0x1f0
> [   46.841694]  ext4_map_blocks_aligned+0x20/0xa0
> [   46.842382]  ext4_iomap_begin+0x1e9/0x320
> [   46.843006]  iomap_iter+0x16d/0x350
> [   46.843554]  __iomap_dio_rw+0x3be/0x830
> [   46.844150]  iomap_dio_rw+0x9/0x30
> [   46.844680]  ext4_file_write_iter+0x597/0x800
> [   46.845346]  do_iter_readv_writev+0xe1/0x150
> [   46.846029]  do_iter_write+0x86/0x1f0
> [   46.846638]  vfs_writev+0x96/0x190
> [   46.847176]  ? do_pwritev+0x98/0xd0
> [   46.847721]  do_pwritev+0x98/0xd0
> [   46.848230]  ? syscall_trace_enter.isra.19+0x130/0x1b0
> [   46.849028]  do_syscall_64+0x42/0xf0
> [   46.849590]  entry_SYSCALL_64_after_hwframe+0x6f/0x77
> [   46.850405] RIP: 0033:0x7f726df9666f
> [   46.850964] Code: d5 41 54 49 89 f4 55 89 fd 53 44 89 c3 48 83 ec
> 18 80 3d bb fd 0b 00 00 74 2a 45 89 c1 49 89 ca 45 31 c0 b8 48 01 00
> 00 0f 05 <48> 3d 00 f0 ff ff 76 5c 48 8b 15 7a 77 0b 00 f7 d8 64 89 02
> 48 83
> [   46.854020] RSP: 002b:00007fff28b9bff0 EFLAGS: 00000246 ORIG_RAX:
> 0000000000000148
> [   46.855178] RAX: ffffffffffffffda RBX: 0000000000000024 RCX:
> 00007f726df9666f
> [   46.856248] RDX: 0000000000000001 RSI: 00007fff28b9c050 RDI:
> 0000000000000003
> [   46.857303] RBP: 0000000000000003 R08: 0000000000000000 R09:
> 0000000000000024
> [   46.858365] R10: 0000000000000000 R11: 0000000000000246 R12:
> 00007fff28b9c050
> [   46.859407] R13: 0000000000000001 R14: 0000000000000000 R15:
> 00007f726e08aa60
> [   46.860448]  </TASK>
> [   46.860797] ---[ end trace 0000000000000000 ]---
> [   46.861497] EXT4-fs warning (device sda):
> ext4_map_blocks_aligned:520: Returned extent couldn't satisfy
> alignment requirements
> main wrote -1 bytes at offset 0
> [   46.863855] test-pwritev2 (158) used greatest stack depth: 11920 bytes
> left
> #
> 
> Please note that I tested on my own dev branch, which contains changes over
> [1], but I expect it would not make a difference for this test.

Hmm this should not ideally happen, can you please share your test
script with me if possible?

> 
> 
> > 
> > Currently this RFC does not impose any restrictions for atomic and non-atomic
> > allocations to any inode,  which also leaves policy decisions to user-space
> > as much as possible. So, for example, the user space can:
> > 
> >   * Do an atomic direct IO at any alignment and size provided it
> >     satisfies underlying device constraints. The only restriction for now
> >     is that it should be power of 2 len and atleast of FS block size.
> > 
> >   * Do any combination of non atomic and atomic writes on the same file
> >     in any order. As long as the user space is passing the RWF_ATOMIC flag
> >     to pwritev2() it is guaranteed to do an atomic IO (or fail if not
> >     possible).
> > 
> > There are some TODOs on the allocator side which are remaining like...
> > 
> > 1.  Fallback to original request size when normalized request size (due to
> >      preallocation) allocation is not possible.
> > 2.  Testing some edge cases.
> > 
> > But since all the basic test scenarios were covered, hence we wanted to get
> > this RFC out for discussion on atomic write support for DIO in ext4.
> > 
> > Further points for discussion -
> > 
> > 1. We might need an inode flag to identify that the inode has blocks/extents
> > atomically allocated. So that other userspace tools do not move the blocks of
> > the inode for e.g. during resize/fsck etc.
> >    a. Should inode be marked as atomic similar to how we have IS_DAX(inode)
> >    implementation? Any thoughts?
> > 
> > 2. Should there be support for open flags like O_ATOMIC. So that in case if
> > user wants to do only atomic writes to an open fd, then all writes can be
> > considered atomic.
> > 
> > 3. Do we need to have any feature compat flags for FS? (IMO) It doesn't look
> > like since say if there are block allocations done which were done atomically,
> > it should not matter to FS w.r.t compatibility.
> > 
> > 4. Mostly aligned allocations are required when we don't have data=journal
> > mode. So should we return -EIO with data journalling mode for DIO request?
> > 
> > Script to test using pwritev2() can be found here:
> > https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9
> 
> Please note that the posix_memalign() call in the program should PAGE align.

Why do you say that? direct IO seems to be working when the userspace
buffer is 512 byte aligned, am I missing something?

Regards,
ojaswin

PS: I'm on vacation this week so might be a bit slow to address all the
review comments.

> 
> Thanks,
> John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-04 13:38   ` Ojaswin Mujoo
@ 2023-12-04 14:44     ` John Garry
  2023-12-11 10:54       ` Ojaswin Mujoo
  0 siblings, 1 reply; 32+ messages in thread
From: John Garry @ 2023-12-04 14:44 UTC (permalink / raw
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-block, linux-xfs, linux-fsdevel, dchinner

On 04/12/2023 13:38, Ojaswin Mujoo wrote:
>> So are we supposed to be doing atomic writes on unwritten ranges only in the
>> file to get the aligned allocations?
> If we do an atomic write on a hole, ext4 will give us an aligned extent
> provided the hole is big enough to accomodate it.
> 
> However, if we do an atomic write on a existing extent (written or
> unwritten) ext4 would check if it satisfies the alignment and length
> requirement and returns an error if it doesn't.

This seems a rather big drawback.

> Since we don't have cow
> like functionality afaik the only way we could let this kind of write go
> through is by punching the pre-existing extent which is not something we
> can directly do in the same write operation, hence we return error.

Well, as you prob saw, for XFS we were relying on forcing extent 
alignment, and not CoW (yet).

> 
>> I actually tried that, and I got a WARN triggered:
>>
>> # mkfs.ext4 /dev/sda
>> mke2fs 1.46.5 (30-Dec-2021)
>> Creating filesystem with 358400 1k blocks and 89760 inodes
>> Filesystem UUID: 7543a44b-2957-4ddc-9d4a-db3a5fd019c9
>> Superblock backups stored on blocks:
>>          8193, 24577, 40961, 57345, 73729, 204801, 221185
>>
>> Allocating group tables: done
>> Writing inode tables: done
>> Creating journal (8192 blocks): done
>> Writing superblocks and filesystem accounting information: done
>>
>> [   12.745889] mkfs.ext4 (150) used greatest stack depth: 13304 bytes left
>> # mount /dev/sda mnt
>> [   12.798804] EXT4-fs (sda): mounted filesystem
>> 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 r/w with ordered data mode. Quota
>> mode: none.
>> # touch mnt/file
>> #
>> # /test-statx -a /root/mnt/file
>> statx(/root/mnt/file) = 0
>> dump_statx results=5fff
>>    Size: 0               Blocks: 0          IO Block: 1024    regular file
>> Device: 08:00           Inode: 12          Links: 1
>> Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
>> Access: 2023-12-04 10:27:40.002848720+0000
>> Modify: 2023-12-04 10:27:40.002848720+0000
>> Change: 2023-12-04 10:27:40.002848720+0000
>>   Birth: 2023-12-04 10:27:40.002848720+0000
>> stx_attributes_mask=0x703874
>>          STATX_ATTR_WRITE_ATOMIC set
>>          unit min: 1024
>>          uunit max: 524288
>> Attributes: 0000000000400000 (........ ........ ........ ........
>> ........ .?--.... ..---... .---.-..)
>> #
>>
>>
>>
>> looks ok so far, then write 4KB at offset 0:
>>
>> # /test-pwritev2 -a -d -p 0 -l 4096  /root/mnt/file
>> file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24

...

>> Please note that I tested on my own dev branch, which contains changes over
>> [1], but I expect it would not make a difference for this test.
> Hmm this should not ideally happen, can you please share your test
> script with me if possible?

It's doing nothing special, just RWF_ATOMIC flag is set for DIO write:

https://github.com/johnpgarry/linux/blob/236870d48ecb19c1cf89dc439e188182a0524cd4/samples/vfs/test-pwritev2.c

>>>
>>> Script to test using pwritev2() can be found here:
>>> https://urldefense.com/v3/__https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9__;!!ACWV5N9M2RV99hQ!Lr0j4iDHrfXisXOGZ82HNPefBtVDDGe9zbLhey7rRDfPE7A_tsrrQ9Dw_4Ng_qS7xTGCZaEWBKtd6pqA_LIBfA$  
>> Please note that the posix_memalign() call in the program should PAGE align.
> Why do you say that? direct IO seems to be working when the userspace
> buffer is 512 byte aligned, am I missing something?

ah, sorry, if you just use 1x IOV vector then no special alignment are 
required, so ignore this. Indeed, I need to improve kernel checks for 
alignment anyway.

Thanks,
John


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic
  2023-12-04  9:02         ` John Garry
@ 2023-12-04 18:17           ` Darrick J. Wong
  2023-12-04 18:34             ` John Garry
  0 siblings, 1 reply; 32+ messages in thread
From: Darrick J. Wong @ 2023-12-04 18:17 UTC (permalink / raw
  To: John Garry
  Cc: Dave Chinner, Ojaswin Mujoo, linux-ext4, Theodore Ts'o,
	Ritesh Harjani, linux-kernel, linux-block, linux-xfs,
	linux-fsdevel, dchinner

On Mon, Dec 04, 2023 at 09:02:56AM +0000, John Garry wrote:
> On 01/12/2023 22:07, Dave Chinner wrote:
> > > Sure, and I think that we need a better story for supporting buffered IO for
> > > atomic writes.
> > > 
> > > Currently we have:
> > > - man pages tell us RWF_ATOMIC is only supported for direct IO
> > > - statx gives atomic write unit min/max, not explicitly telling us it's for
> > > direct IO
> > > - RWF_ATOMIC is ignored for !O_DIRECT
> > > 
> > > So I am thinking of expanding statx support to enable querying of atomic
> > > write capabilities for buffered IO and direct IO separately.
> > You're over complicating this way too much by trying to restrict the
> > functionality down to just what you want to implement right now.
> > 
> > RWF_ATOMIC is no different to RWF_NOWAIT. The API doesn't decide
> > what can be supported - the filesystems themselves decide what part
> > of the API they can support and implement those pieces.
> 
> Sure, but for RWF_ATOMIC we still have the associated statx call to tell us
> whether atomic writes are supported for a file and the specific range
> capability.
> 
> > 
> > TO go back to RWF_NOWAIT, for a long time we (XFS) only supported
> > RWF_NOWAIT on DIO, and buffered reads and writes were given
> > -EOPNOTSUPP by the filesystem. Then other filesystems started
> > supporting DIO with RWF_NOWAIT. Then buffered read support was added
> > to the page cache and XFS, and as other filesystems were converted
> > they removed the RWF_NOWAIT exclusion check from their read IO
> > paths.
> > 
> > We are now in the same place with buffered write support for
> > RWF_NOWAIT. XFS, the page cache and iomap allow buffered writes w/
> > RWF_NOWAIT, but ext4, btrfs and f2fs still all return -EOPNOTSUPP
> > because they don't support non-blocking buffered writes yet.
> > 
> > This is the same model we should be applying with RWF_ATOMIC - we
> > know that over time we'll be able to expand support for atomic
> > writes across both direct and buffered IO, so we should not be
> > restricting the API or infrastructure to only allow RWF_ATOMIC w/
> > DIO.
> 
> Agreed.
> 
> > Just have the filesystems reject RWF_ATOMIC w/ -EOPNOTSUPP if
> > they don't support it,
> 
> Yes, I was going to add this regardless.
> 
> > and for those that do it is conditional on
> > whther the filesystem supports it for the given type of IO being
> > done.
> > 
> > Seriously - an application can easily probe for RWF_ATOMIC support
> > without needing information to be directly exposed in statx() - just
> > open a O_TMPFILE, issue the type of RWF_ATOMIC IO you require to be
> > supported, and if it returns -EOPNOTSUPP then it you can't use
> > RWF_ATOMIC optimisations in the application....
> 
> ok, if that is the done thing.
> 
> So I can't imagine that atomic write unit range will be different for direct
> IO and buffered IO (ignoring for a moment Christoph's idea for CoW always
> for no HW offload) when supported. But it seems that we may have a scenario
> where statx tells is that atomic writes are supported for a file, and a DIO
> write succeeds and a buffered IO write may return -EOPNOTSUPP. If that's
> acceptable then I'll work towards that.
> 
> If we could just run statx on a file descriptor here then that would be
> simpler...

statx(fd, "", AT_EMPTY_PATH, ...); ?

--D

> Thanks,
> John
> 
> 
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic
  2023-12-04 18:17           ` Darrick J. Wong
@ 2023-12-04 18:34             ` John Garry
  0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2023-12-04 18:34 UTC (permalink / raw
  To: Darrick J. Wong
  Cc: Dave Chinner, Ojaswin Mujoo, linux-ext4, Theodore Ts'o,
	Ritesh Harjani, linux-kernel, linux-block, linux-xfs,
	linux-fsdevel, dchinner

On 04/12/2023 18:17, Darrick J. Wong wrote:
>> If we could just run statx on a file descriptor here then that would be
>> simpler...
> statx(fd, "", AT_EMPTY_PATH, ...); ?

I really meant that if we could only run statx on a fd then we could 
know if we want DIO or buffered IO (as that is how it was opened).

Thanks,
John



^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic
  2023-12-01 22:07       ` Dave Chinner
  2023-12-04  9:02         ` John Garry
@ 2023-12-07 12:43         ` John Garry
  1 sibling, 0 replies; 32+ messages in thread
From: John Garry @ 2023-12-07 12:43 UTC (permalink / raw
  To: Dave Chinner
  Cc: Ojaswin Mujoo, linux-ext4, Theodore Ts'o, Ritesh Harjani,
	linux-kernel, Darrick J . Wong, linux-block, linux-xfs,
	linux-fsdevel, dchinner

On 01/12/2023 22:07, Dave Chinner wrote:
> RWF_ATOMIC is no different to RWF_NOWAIT. The API doesn't decide
> what can be supported - the filesystems themselves decide what part
> of the API they can support and implement those pieces.
> 
> TO go back to RWF_NOWAIT, for a long time we (XFS) only supported
> RWF_NOWAIT on DIO, and buffered reads and writes were given
> -EOPNOTSUPP by the filesystem. Then other filesystems started
> supporting DIO with RWF_NOWAIT. Then buffered read support was added
> to the page cache and XFS, and as other filesystems were converted
> they removed the RWF_NOWAIT exclusion check from their read IO
> paths.
> 
> We are now in the same place with buffered write support for
> RWF_NOWAIT. XFS, the page cache and iomap allow buffered writes w/
> RWF_NOWAIT, but ext4, btrfs and f2fs still all return -EOPNOTSUPP
> because they don't support non-blocking buffered writes yet.
> 
> This is the same model we should be applying with RWF_ATOMIC - we
> know that over time we'll be able to expand support for atomic
> writes across both direct and buffered IO, so we should not be
> restricting the API or infrastructure to only allow RWF_ATOMIC w/
> DIO. Just have the filesystems reject RWF_ATOMIC w/ -EOPNOTSUPP if
> they don't support it, and for those that do it is conditional on
> whther the filesystem supports it for the given type of IO being
> done.
> 
> Seriously - an application can easily probe for RWF_ATOMIC support
> without needing information to be directly exposed in statx() - just
> open a O_TMPFILE, issue the type of RWF_ATOMIC IO you require to be
> supported, and if it returns -EOPNOTSUPP then it you can't use
> RWF_ATOMIC optimisations in the application....

Hi Dave,

For rejecting RWF_ATOMIC when not supported for a file, how about 
something like this:

--->8----

diff --git a/block/fops.c b/block/fops.c
index 273bd8f5a370..d9563ef29dde 100644
--- a/block/fops.c
+++ b/block/fops.c
@@ -639,6 +637,9 @@ static int blkdev_open(struct inode *inode, struct 
file *filp)
  	if (IS_ERR(handle))
  		return PTR_ERR(handle);

+	if (queue_atomic_write_unit_max_bytes(bdev_get_queue(handle->bdev)))
+		filp->f_mode |= FMODE_CAN_ATOMIC_WRITE;
+
  	if (bdev_nowait(handle->bdev))
  		filp->f_mode |= FMODE_NOWAIT;

diff --git a/include/linux/fs.h b/include/linux/fs.h
index 4256ec184461..d725c194243c 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -185,6 +185,9 @@ typedef int (dio_iodone_t)(struct kiocb *iocb, 
loff_t offset,
  /* File supports async nowait buffered writes */
  #define FMODE_BUF_WASYNC	((__force fmode_t)0x80000000)

+/* File supports atomic writes */
+#define FMODE_CAN_ATOMIC_WRITE	((__force fmode_t)0x100000000)
+
  /*
   * Attribute flags.  These should be or-ed together to figure out what
   * has been changed!
@@ -3266,6 +3269,10 @@ static inline int kiocb_set_rw_flags(struct kiocb 
*ki, rwf_t flags)
  			return -EOPNOTSUPP;
  		kiocb_flags |= IOCB_NOIO;
  	}
+	if (flags & RWF_ATOMIC) {
+		if (!(ki->ki_filp->f_mode & FMODE_CAN_ATOMIC_WRITE))
+			return -EOPNOTSUPP;
+	}
  	kiocb_flags |= (__force int) (flags & RWF_SUPPORTED);
  	if (flags & RWF_SYNC)
  		kiocb_flags |= IOCB_DSYNC;
diff --git a/include/linux/types.h b/include/linux/types.h
index 253168bb3fe1..49c754fde1d6 100644
--- a/include/linux/types.h
+++ b/include/linux/types.h
@@ -153,7 +153,7 @@ typedef u32 dma_addr_t;

  typedef unsigned int __bitwise gfp_t;
  typedef unsigned int __bitwise slab_flags_t;
-typedef unsigned int __bitwise fmode_t;
+typedef unsigned long __bitwise fmode_t;

  #ifdef CONFIG_PHYS_ADDR_T_64BIT
  typedef u64 phys_addr_t;


----8<------

My concern is that we need to increase fmode_t in size as all available 
32 bits are used up.

Thanks,
John

^ permalink raw reply related	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-04 14:44     ` John Garry
@ 2023-12-11 10:54       ` Ojaswin Mujoo
  2023-12-12  7:46         ` John Garry
  2023-12-13  6:42         ` Ojaswin Mujoo
  0 siblings, 2 replies; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-12-11 10:54 UTC (permalink / raw
  To: John Garry
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-block, linux-xfs, linux-fsdevel, dchinner

Hi John,

On Mon, Dec 04, 2023 at 02:44:36PM +0000, John Garry wrote:
> On 04/12/2023 13:38, Ojaswin Mujoo wrote:
> > > So are we supposed to be doing atomic writes on unwritten ranges only in the
> > > file to get the aligned allocations?
> > If we do an atomic write on a hole, ext4 will give us an aligned extent
> > provided the hole is big enough to accomodate it.
> > 
> > However, if we do an atomic write on a existing extent (written or
> > unwritten) ext4 would check if it satisfies the alignment and length
> > requirement and returns an error if it doesn't.
> 
> This seems a rather big drawback.

So if I'm not wrong, we force the extent alignment as well as the size
of the extent in xfs right. 

We didn't want to overly restrict the users of atomic writes by forcing
the extents to be of a certain alignment/size irrespective of the size
of write. The design in this patchset provides this flexibility at the
cost of some added precautions that the user should take (eg not doing
an atomic write on a pre existing unaligned extent etc).

However, I don't think it should be too hard to provide an optional
forced alignment feature on top of this if there's interest in that.
> 
> > Since we don't have cow
> > like functionality afaik the only way we could let this kind of write go
> > through is by punching the pre-existing extent which is not something we
> > can directly do in the same write operation, hence we return error.
> 
> Well, as you prob saw, for XFS we were relying on forcing extent alignment,
> and not CoW (yet).

That's right.

> 
> > 
> > > I actually tried that, and I got a WARN triggered:
> > > 
> > > # mkfs.ext4 /dev/sda
> > > mke2fs 1.46.5 (30-Dec-2021)
> > > Creating filesystem with 358400 1k blocks and 89760 inodes
> > > Filesystem UUID: 7543a44b-2957-4ddc-9d4a-db3a5fd019c9
> > > Superblock backups stored on blocks:
> > >          8193, 24577, 40961, 57345, 73729, 204801, 221185
> > > 
> > > Allocating group tables: done
> > > Writing inode tables: done
> > > Creating journal (8192 blocks): done
> > > Writing superblocks and filesystem accounting information: done
> > > 
> > > [   12.745889] mkfs.ext4 (150) used greatest stack depth: 13304 bytes left
> > > # mount /dev/sda mnt
> > > [   12.798804] EXT4-fs (sda): mounted filesystem
> > > 7543a44b-2957-4ddc-9d4a-db3a5fd019c9 r/w with ordered data mode. Quota
> > > mode: none.
> > > # touch mnt/file
> > > #
> > > # /test-statx -a /root/mnt/file
> > > statx(/root/mnt/file) = 0
> > > dump_statx results=5fff
> > >    Size: 0               Blocks: 0          IO Block: 1024    regular file
> > > Device: 08:00           Inode: 12          Links: 1
> > > Access: (0644/-rw-r--r--)  Uid:     0   Gid:     0
> > > Access: 2023-12-04 10:27:40.002848720+0000
> > > Modify: 2023-12-04 10:27:40.002848720+0000
> > > Change: 2023-12-04 10:27:40.002848720+0000
> > >   Birth: 2023-12-04 10:27:40.002848720+0000
> > > stx_attributes_mask=0x703874
> > >          STATX_ATTR_WRITE_ATOMIC set
> > >          unit min: 1024
> > >          uunit max: 524288
> > > Attributes: 0000000000400000 (........ ........ ........ ........
> > > ........ .?--.... ..---... .---.-..)
> > > #
> > > 
> > > 
> > > 
> > > looks ok so far, then write 4KB at offset 0:
> > > 
> > > # /test-pwritev2 -a -d -p 0 -l 4096  /root/mnt/file
> > > file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24
> 
> ...
> 
> > > Please note that I tested on my own dev branch, which contains changes over
> > > [1], but I expect it would not make a difference for this test.
> > Hmm this should not ideally happen, can you please share your test
> > script with me if possible?
> 
> It's doing nothing special, just RWF_ATOMIC flag is set for DIO write:
> 
> https://github.com/johnpgarry/linux/blob/236870d48ecb19c1cf89dc439e188182a0524cd4/samples/vfs/test-pwritev2.c

Thanks for the script, will try to replicate this today and get back to
you.

> 
> > > > 
> > > > Script to test using pwritev2() can be found here:
> > > > https://gist.github.com/OjaswinM/e67accee3cbb7832bd3f1a9543c01da9
> > > Please note that the posix_memalign() call in the program should PAGE align.
> > Why do you say that? direct IO seems to be working when the userspace
> > buffer is 512 byte aligned, am I missing something?
> 
> ah, sorry, if you just use 1x IOV vector then no special alignment are
> required, so ignore this. Indeed, I need to improve kernel checks for
> alignment anyway.
> 
> Thanks,
> John
> 

Regards,
ojaswin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 5/7] block: export blkdev_atomic_write_valid() and refactor api
  2023-12-01 10:47   ` John Garry
@ 2023-12-11 10:57     ` Ojaswin Mujoo
  0 siblings, 0 replies; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-12-11 10:57 UTC (permalink / raw
  To: John Garry
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-block, linux-xfs, linux-fsdevel, dchinner

On Fri, Dec 01, 2023 at 10:47:59AM +0000, John Garry wrote:
> On 30/11/2023 13:53, Ojaswin Mujoo wrote:
> > Export the blkdev_atomic_write_valid() function so that other filesystems
> > can call it as a part of validating the atomic write operation.
> > 
> > Further, refactor the api to accept a len argument instead of iov_iter to
> > make it easier to call from other places.
> > 
> > Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com>
> 
> I was actually thinking of moving this functionality to vfs and maybe also
> calling earlier in write path, as the code is really common to blkdev and
> FSes.

This makes sense. The code to make sure the underlying device
will be able to support this atomic write can be moved higher up in vfs.
And then each fs can do extra fs-specific checks in their code.

> 
> However, Christoph Hellwig was not so happy about current interface with
> power-of-2 requirement et al, so I was going to wait until that discussion
> is concluded before deciding.

Got it, I'll leave this bit to you then :) 

> 
> Thanks,
> John
> 
> > ---
> >   block/fops.c           | 18 ++++++++++--------
> >   include/linux/blkdev.h |  2 ++
> >   2 files changed, 12 insertions(+), 8 deletions(-)
> > 
> > diff --git a/block/fops.c b/block/fops.c
> > index 516669ad69e5..5dae95c49720 100644
> > --- a/block/fops.c
> > +++ b/block/fops.c
> > @@ -41,8 +41,7 @@ static bool blkdev_dio_unaligned(struct block_device *bdev, loff_t pos,
> >   		!bdev_iter_is_aligned(bdev, iter);
> >   }
> > -static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
> > -			      struct iov_iter *iter)
> > +bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, size_t len)
> >   {
> >   	unsigned int atomic_write_unit_min_bytes =
> >   			queue_atomic_write_unit_min_bytes(bdev_get_queue(bdev));
> > @@ -53,16 +52,17 @@ static bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos,
> >   		return false;
> >   	if (pos % atomic_write_unit_min_bytes)
> >   		return false;
> > -	if (iov_iter_count(iter) % atomic_write_unit_min_bytes)
> > +	if (len % atomic_write_unit_min_bytes)
> >   		return false;
> > -	if (!is_power_of_2(iov_iter_count(iter)))
> > +	if (!is_power_of_2(len))
> >   		return false;
> > -	if (iov_iter_count(iter) > atomic_write_unit_max_bytes)
> > +	if (len > atomic_write_unit_max_bytes)
> >   		return false;
> > -	if (pos % iov_iter_count(iter))
> > +	if (pos % len)
> >   		return false;
> >   	return true;
> >   }
> > +EXPORT_SYMBOL_GPL(blkdev_atomic_write_valid);
> >   #define DIO_INLINE_BIO_VECS 4
> > @@ -81,7 +81,8 @@ static ssize_t __blkdev_direct_IO_simple(struct kiocb *iocb,
> >   	if (blkdev_dio_unaligned(bdev, pos, iter))
> >   		return -EINVAL;
> > -	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
> > +	if (atomic_write &&
> > +	    !blkdev_atomic_write_valid(bdev, pos, iov_iter_count(iter)))
> >   		return -EINVAL;
> >   	if (nr_pages <= DIO_INLINE_BIO_VECS)
> > @@ -348,7 +349,8 @@ static ssize_t __blkdev_direct_IO_async(struct kiocb *iocb,
> >   	if (blkdev_dio_unaligned(bdev, pos, iter))
> >   		return -EINVAL;
> > -	if (atomic_write && !blkdev_atomic_write_valid(bdev, pos, iter))
> > +	if (atomic_write &&
> > +	    !blkdev_atomic_write_valid(bdev, pos, iov_iter_count(iter)))
> >   		return -EINVAL;
> >   	if (iocb->ki_flags & IOCB_ALLOC_CACHE)
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index f70988083734..5a3124fc191f 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -1566,6 +1566,8 @@ static inline int early_lookup_bdev(const char *pathname, dev_t *dev)
> >   int freeze_bdev(struct block_device *bdev);
> >   int thaw_bdev(struct block_device *bdev);
> > +bool blkdev_atomic_write_valid(struct block_device *bdev, loff_t pos, size_t len);
> > +
> >   struct io_comp_batch {
> >   	struct request *req_list;
> >   	bool need_ts;
> 

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-11 10:54       ` Ojaswin Mujoo
@ 2023-12-12  7:46         ` John Garry
  2023-12-12 13:10           ` Christoph Hellwig
  2023-12-13  5:59           ` Ojaswin Mujoo
  2023-12-13  6:42         ` Ojaswin Mujoo
  1 sibling, 2 replies; 32+ messages in thread
From: John Garry @ 2023-12-12  7:46 UTC (permalink / raw
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-block, linux-xfs, linux-fsdevel, dchinner

On 11/12/2023 10:54, Ojaswin Mujoo wrote:
>> This seems a rather big drawback.
> So if I'm not wrong, we force the extent alignment as well as the size
> of the extent in xfs right.

For XFS in my v1 patchset, we only force alignment (but not size).

It is assumed that the user will fallocate/dd the complete file before 
issuing atomic writes, and we will have extent alignment and length as 
required.

However - as we have seen with a trial user - it can create a problem if 
we don't do that and we write 4K and then overwrite with a 16K atomic 
write to a file, as 2x extents may be allocated for the complete 16K and 
it cannot be issued as a single BIO.

> 
> We didn't want to overly restrict the users of atomic writes by forcing
> the extents to be of a certain alignment/size irrespective of the size
> of write. The design in this patchset provides this flexibility at the
> cost of some added precautions that the user should take (eg not doing
> an atomic write on a pre existing unaligned extent etc).

Doesn't bigalloc already give you what you require here?

Thanks,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-12  7:46         ` John Garry
@ 2023-12-12 13:10           ` Christoph Hellwig
  2023-12-12 15:16             ` Theodore Ts'o
  2023-12-12 16:10             ` John Garry
  2023-12-13  5:59           ` Ojaswin Mujoo
  1 sibling, 2 replies; 32+ messages in thread
From: Christoph Hellwig @ 2023-12-12 13:10 UTC (permalink / raw
  To: John Garry
  Cc: Ojaswin Mujoo, linux-ext4, Theodore Ts'o, Ritesh Harjani,
	linux-kernel, Darrick J . Wong, linux-block, linux-xfs,
	linux-fsdevel, dchinner

On Tue, Dec 12, 2023 at 07:46:51AM +0000, John Garry wrote:
> It is assumed that the user will fallocate/dd the complete file before
> issuing atomic writes, and we will have extent alignment and length as
> required.

I don't think that's a long time maintainable usage model.

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-12 13:10           ` Christoph Hellwig
@ 2023-12-12 15:16             ` Theodore Ts'o
  2023-12-12 15:19               ` Christoph Hellwig
  2023-12-12 16:10             ` John Garry
  1 sibling, 1 reply; 32+ messages in thread
From: Theodore Ts'o @ 2023-12-12 15:16 UTC (permalink / raw
  To: Christoph Hellwig
  Cc: John Garry, Ojaswin Mujoo, linux-ext4, Ritesh Harjani,
	linux-kernel, Darrick J . Wong, linux-block, linux-xfs,
	linux-fsdevel, dchinner

On Tue, Dec 12, 2023 at 05:10:42AM -0800, Christoph Hellwig wrote:
> On Tue, Dec 12, 2023 at 07:46:51AM +0000, John Garry wrote:
> > It is assumed that the user will fallocate/dd the complete file before
> > issuing atomic writes, and we will have extent alignment and length as
> > required.
> 
> I don't think that's a long time maintainable usage model.

For databases that are trying to use this to significantly improve
their performance by eliminating double writes, the allocation and
writes are being done by a single process.  So for *that* use case, it
is quite maintainable.

Cheers,

						- Ted

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-12 15:16             ` Theodore Ts'o
@ 2023-12-12 15:19               ` Christoph Hellwig
  0 siblings, 0 replies; 32+ messages in thread
From: Christoph Hellwig @ 2023-12-12 15:19 UTC (permalink / raw
  To: Theodore Ts'o
  Cc: Christoph Hellwig, John Garry, Ojaswin Mujoo, linux-ext4,
	Ritesh Harjani, linux-kernel, Darrick J . Wong, linux-block,
	linux-xfs, linux-fsdevel, dchinner

On Tue, Dec 12, 2023 at 10:16:13AM -0500, Theodore Ts'o wrote:
> On Tue, Dec 12, 2023 at 05:10:42AM -0800, Christoph Hellwig wrote:
> > On Tue, Dec 12, 2023 at 07:46:51AM +0000, John Garry wrote:
> > > It is assumed that the user will fallocate/dd the complete file before
> > > issuing atomic writes, and we will have extent alignment and length as
> > > required.
> > 
> > I don't think that's a long time maintainable usage model.
> 
> For databases that are trying to use this to significantly improve
> their performance by eliminating double writes, the allocation and
> writes are being done by a single process.  So for *that* use case, it
> is quite maintainable.

That's not the freaking point.  We need to have proper kernel interfaces
that don't rely on intimate knowledge and control of details.  We need
to build proper genral purpose interfaces and not layer hacks on top of
hacks.


^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-12 13:10           ` Christoph Hellwig
  2023-12-12 15:16             ` Theodore Ts'o
@ 2023-12-12 16:10             ` John Garry
  1 sibling, 0 replies; 32+ messages in thread
From: John Garry @ 2023-12-12 16:10 UTC (permalink / raw
  To: Christoph Hellwig
  Cc: Ojaswin Mujoo, linux-ext4, Theodore Ts'o, Ritesh Harjani,
	linux-kernel, Darrick J . Wong, linux-block, linux-xfs,
	linux-fsdevel, dchinner

On 12/12/2023 13:10, Christoph Hellwig wrote:
> On Tue, Dec 12, 2023 at 07:46:51AM +0000, John Garry wrote:
>> It is assumed that the user will fallocate/dd the complete file before
>> issuing atomic writes, and we will have extent alignment and length as
>> required.
> I don't think that's a long time maintainable usage model.

ok, sure - as I may have mentioned, this was even causing issues in porting.

I sent the v2 series without XFS support, but the same API proposal as 
before. Let's discuss any potential API changes there.

Thanks,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-12  7:46         ` John Garry
  2023-12-12 13:10           ` Christoph Hellwig
@ 2023-12-13  5:59           ` Ojaswin Mujoo
  2023-12-13  9:17             ` John Garry
  1 sibling, 1 reply; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-12-13  5:59 UTC (permalink / raw
  To: John Garry
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-block, linux-xfs, linux-fsdevel, dchinner

On Tue, Dec 12, 2023 at 07:46:51AM +0000, John Garry wrote:
> On 11/12/2023 10:54, Ojaswin Mujoo wrote:
> > > This seems a rather big drawback.
> > So if I'm not wrong, we force the extent alignment as well as the size
> > of the extent in xfs right.
> 
> For XFS in my v1 patchset, we only force alignment (but not size).
> 
> It is assumed that the user will fallocate/dd the complete file before
> issuing atomic writes, and we will have extent alignment and length as
> required.
> 
> However - as we have seen with a trial user - it can create a problem if we
> don't do that and we write 4K and then overwrite with a 16K atomic write to
> a file, as 2x extents may be allocated for the complete 16K and it cannot be
> issued as a single BIO.

So currently, if we don't fallocate beforehand in xfs and the user
tries to do the 16k overwrite to an offset having a 4k extent, how are
we handling it?

Here ext4 will return an error indicating atomic write can't happen at
this particular offset. The way I see it is if the user passes atomic
flag to pwritev2 and we are unable to ensure atomicity for any reason we
return error, which seems like a fair approach for a generic interface. 

> 
> > 
> > We didn't want to overly restrict the users of atomic writes by
> > forcing
> > the extents to be of a certain alignment/size irrespective of the
> > size
> > of write. The design in this patchset provides this flexibility at
> > the
> > cost of some added precautions that the user should take (eg not
> > doing
> > an atomic write on a pre existing unaligned extent etc).
> 
> Doesn't bigalloc already give you what you require here?

Yes, but its an mkfs time feature and it also applies to each an every
file which might not be desirable for all use cases. 

Regards,
ojaswin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-11 10:54       ` Ojaswin Mujoo
  2023-12-12  7:46         ` John Garry
@ 2023-12-13  6:42         ` Ojaswin Mujoo
  2023-12-13  9:20           ` John Garry
  1 sibling, 1 reply; 32+ messages in thread
From: Ojaswin Mujoo @ 2023-12-13  6:42 UTC (permalink / raw
  To: John Garry
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-block, linux-xfs, linux-fsdevel, dchinner

On Mon, Dec 11, 2023 at 04:24:14PM +0530, Ojaswin Mujoo wrote:
> > > > looks ok so far, then write 4KB at offset 0:
> > > > 
> > > > # /test-pwritev2 -a -d -p 0 -l 4096  /root/mnt/file
> > > > file=/root/mnt/file write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24
> > 
> > ...
> > 
> > > > Please note that I tested on my own dev branch, which contains changes over
> > > > [1], but I expect it would not make a difference for this test.
> > > Hmm this should not ideally happen, can you please share your test
> > > script with me if possible?
> > 
> > It's doing nothing special, just RWF_ATOMIC flag is set for DIO write:
> > 
> > https://github.com/johnpgarry/linux/blob/236870d48ecb19c1cf89dc439e188182a0524cd4/samples/vfs/test-pwritev2.c
> 
> Thanks for the script, will try to replicate this today and get back to
> you.
> 

Hi John,

So I don't seem to be able to hit the warn on:

$ touch mnt/testfile
$ ./test-pwritev2 -a -d -p 0 -l 4096 mnt/testfile

	file=mnt/testfile write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24
	main wrote 4096 bytes at offset 0

$ filefrag -v mnt/testfile

	Filesystem type is: ef53
	File size of testfile is 4096 (1 block of 4096 bytes)
	ext:     logical_offset:        physical_offset: length:   expected: flags:
		0:        0..       0:      32900..     32900:      1: last,eof

$ ./test-pwritev2 -a -d -p 8192 -l 8192 mnt/testfile

	file=mnt/testfile write_size=8192 offset=8192 o_flags=0x4002 wr_flags=0x24
	main wrote 8192 bytes at offset 8192

$ filefrag -v mnt/testfile

	Filesystem type is: ef53
	File size of mnt/testfile is 16384 (4 blocks of 4096 bytes)
	 ext:     logical_offset:        physical_offset: length:   expected: flags:
		0:        0..       0:      32900..     32900:      1:
		1:        2..       3:      33288..     33289:      2: 32902: last,eof
	mnt/testfile: 2 extents found

Not sure why you are hitting the WARN_ON. The tree I used is:

Latest ted/dev + your atomic patchset v1 + this patchset

Regards,
ojaswin

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-13  5:59           ` Ojaswin Mujoo
@ 2023-12-13  9:17             ` John Garry
  0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2023-12-13  9:17 UTC (permalink / raw
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-block, linux-xfs, linux-fsdevel, dchinner

On 13/12/2023 05:59, Ojaswin Mujoo wrote:
>> However - as we have seen with a trial user - it can create a problem if we
>> don't do that and we write 4K and then overwrite with a 16K atomic write to
>> a file, as 2x extents may be allocated for the complete 16K and it cannot be
>> issued as a single BIO.
> So currently, if we don't fallocate beforehand in xfs and the user
> tries to do the 16k overwrite to an offset having a 4k extent, how are
> we handling it?
> 
> Here ext4 will return an error indicating atomic write can't happen at
> this particular offset. The way I see it is if the user passes atomic
> flag to pwritev2 and we are unable to ensure atomicity for any reason we
> return error, which seems like a fair approach for a generic interface.

ok, but this does not seem like a good user experience.

> 
>>> We didn't want to overly restrict the users of atomic writes by
>>> forcing
>>> the extents to be of a certain alignment/size irrespective of the
>>> size
>>> of write. The design in this patchset provides this flexibility at
>>> the
>>> cost of some added precautions that the user should take (eg not
>>> doing
>>> an atomic write on a pre existing unaligned extent etc).
>> Doesn't bigalloc already give you what you require here?
> Yes, but its an mkfs time feature and it also applies to each an every
> file which might not be desirable for all use cases.

Sure, but to get started could you initially support that option (as 
long as it does not conflict with other per-file option)?

Thanks,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

* Re: [RFC 0/7] ext4: Allocator changes for atomic write support with DIO
  2023-12-13  6:42         ` Ojaswin Mujoo
@ 2023-12-13  9:20           ` John Garry
  0 siblings, 0 replies; 32+ messages in thread
From: John Garry @ 2023-12-13  9:20 UTC (permalink / raw
  To: Ojaswin Mujoo
  Cc: linux-ext4, Theodore Ts'o, Ritesh Harjani, linux-kernel,
	Darrick J . Wong, linux-block, linux-xfs, linux-fsdevel, dchinner

On 13/12/2023 06:42, Ojaswin Mujoo wrote:
> Hi John,
> 
> So I don't seem to be able to hit the warn on:
> 
> $ touch mnt/testfile
> $ ./test-pwritev2 -a -d -p 0 -l 4096 mnt/testfile
> 
> 	file=mnt/testfile write_size=4096 offset=0 o_flags=0x4002 wr_flags=0x24
> 	main wrote 4096 bytes at offset 0
> 
> $ filefrag -v mnt/testfile
> 
> 	Filesystem type is: ef53
> 	File size of testfile is 4096 (1 block of 4096 bytes)
> 	ext:     logical_offset:        physical_offset: length:   expected: flags:
> 		0:        0..       0:      32900..     32900:      1: last,eof
> 
> $ ./test-pwritev2 -a -d -p 8192 -l 8192 mnt/testfile
> 
> 	file=mnt/testfile write_size=8192 offset=8192 o_flags=0x4002 wr_flags=0x24
> 	main wrote 8192 bytes at offset 8192
> 
> $ filefrag -v mnt/testfile
> 
> 	Filesystem type is: ef53
> 	File size of mnt/testfile is 16384 (4 blocks of 4096 bytes)
> 	 ext:     logical_offset:        physical_offset: length:   expected: flags:
> 		0:        0..       0:      32900..     32900:      1:
> 		1:        2..       3:      33288..     33289:      2: 32902: last,eof
> 	mnt/testfile: 2 extents found
> 
> Not sure why you are hitting the WARN_ON. The tree I used is:
> 
> Latest ted/dev + your atomic patchset v1 + this patchset

What commit/tree is ted/dev exactly?

Anyway, my v2 series is at 
https://github.com/johnpgarry/linux/commits/atomic-writes-v6.7-v2-blk

I'll try that later for ext4.

Thanks,
John

^ permalink raw reply	[flat|nested] 32+ messages in thread

end of thread, other threads:[~2023-12-13  9:25 UTC | newest]

Thread overview: 32+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-30 13:53 [RFC 0/7] ext4: Allocator changes for atomic write support with DIO Ojaswin Mujoo
2023-11-30 13:53 ` [RFC 1/7] iomap: Don't fall back to buffered write if the write is atomic Ojaswin Mujoo
2023-11-30 21:10   ` Dave Chinner
2023-12-01 10:42     ` John Garry
2023-12-01 13:27       ` Matthew Wilcox
2023-12-01 19:06         ` John Garry
2023-12-01 22:07       ` Dave Chinner
2023-12-04  9:02         ` John Garry
2023-12-04 18:17           ` Darrick J. Wong
2023-12-04 18:34             ` John Garry
2023-12-07 12:43         ` John Garry
2023-11-30 13:53 ` [RFC 2/7] ext4: Factor out size and start prediction from ext4_mb_normalize_request() Ojaswin Mujoo
2023-11-30 13:53 ` [RFC 3/7] ext4: add aligned allocation support in mballoc Ojaswin Mujoo
2023-11-30 13:53 ` [RFC 4/7] ext4: allow inode preallocation for aligned alloc Ojaswin Mujoo
2023-11-30 13:53 ` [RFC 5/7] block: export blkdev_atomic_write_valid() and refactor api Ojaswin Mujoo
2023-12-01 10:47   ` John Garry
2023-12-11 10:57     ` Ojaswin Mujoo
2023-11-30 13:53 ` [RFC 6/7] ext4: Add aligned allocation support for atomic direct io Ojaswin Mujoo
2023-11-30 13:53 ` [RFC 7/7] ext4: Support atomic write for statx Ojaswin Mujoo
2023-12-04 10:36 ` [RFC 0/7] ext4: Allocator changes for atomic write support with DIO John Garry
2023-12-04 13:38   ` Ojaswin Mujoo
2023-12-04 14:44     ` John Garry
2023-12-11 10:54       ` Ojaswin Mujoo
2023-12-12  7:46         ` John Garry
2023-12-12 13:10           ` Christoph Hellwig
2023-12-12 15:16             ` Theodore Ts'o
2023-12-12 15:19               ` Christoph Hellwig
2023-12-12 16:10             ` John Garry
2023-12-13  5:59           ` Ojaswin Mujoo
2023-12-13  9:17             ` John Garry
2023-12-13  6:42         ` Ojaswin Mujoo
2023-12-13  9:20           ` John Garry

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).