Linux-XFS Archive mirror
 help / color / mirror / Atom feed
From: Leah Rumancik <leah.rumancik@gmail.com>
To: stable@vger.kernel.org
Cc: linux-xfs@vger.kernel.org, amir73il@gmail.com,
	chandan.babu@oracle.com, fred@cloudflare.com,
	Dave Chinner <dchinner@redhat.com>,
	Christoph Hellwig <hch@lst.de>,
	"Darrick J . Wong" <djwong@kernel.org>,
	Leah Rumancik <leah.rumancik@gmail.com>
Subject: [PATCH 6.1 01/24] xfs: write page faults in iomap are not buffered writes
Date: Wed,  1 May 2024 11:40:49 -0700	[thread overview]
Message-ID: <20240501184112.3799035-1-leah.rumancik@gmail.com> (raw)

From: Dave Chinner <dchinner@redhat.com>

[ Upstream commit 118e021b4b66f758f8e8f21dc0e5e0a4c721e69e ]

When we reserve a delalloc region in xfs_buffered_write_iomap_begin,
we mark the iomap as IOMAP_F_NEW so that the the write context
understands that it allocated the delalloc region.

If we then fail that buffered write, xfs_buffered_write_iomap_end()
checks for the IOMAP_F_NEW flag and if it is set, it punches out
the unused delalloc region that was allocated for the write.

The assumption this code makes is that all buffered write operations
that can allocate space are run under an exclusive lock (i_rwsem).
This is an invalid assumption: page faults in mmap()d regions call
through this same function pair to map the file range being faulted
and this runs only holding the inode->i_mapping->invalidate_lock in
shared mode.

IOWs, we can have races between page faults and write() calls that
fail the nested page cache write operation that result in data loss.
That is, the failing iomap_end call will punch out the data that
the other racing iomap iteration brought into the page cache. This
can be reproduced with generic/34[46] if we arbitrarily fail page
cache copy-in operations from write() syscalls.

Code analysis tells us that the iomap_page_mkwrite() function holds
the already instantiated and uptodate folio locked across the iomap
mapping iterations. Hence the folio cannot be removed from memory
whilst we are mapping the range it covers, and as such we do not
care if the mapping changes state underneath the iomap iteration
loop:

1. if the folio is not already dirty, there is no writeback races
   possible.
2. if we allocated the mapping (delalloc or unwritten), the folio
   cannot already be dirty. See #1.
3. If the folio is already dirty, it must be up to date. As we hold
   it locked, it cannot be reclaimed from memory. Hence we always
   have valid data in the page cache while iterating the mapping.
4. Valid data in the page cache can exist when the underlying
   mapping is DELALLOC, UNWRITTEN or WRITTEN. Having the mapping
   change from DELALLOC->UNWRITTEN or UNWRITTEN->WRITTEN does not
   change the data in the page - it only affects actions if we are
   initialising a new page. Hence #3 applies  and we don't care
   about these extent map transitions racing with
   iomap_page_mkwrite().
5. iomap_page_mkwrite() checks for page invalidation races
   (truncate, hole punch, etc) after it locks the folio. We also
   hold the mapping->invalidation_lock here, and hence the mapping
   cannot change due to extent removal operations while we are
   iterating the folio.

As such, filesystems that don't use bufferheads will never fail
the iomap_folio_mkwrite_iter() operation on the current mapping,
regardless of whether the iomap should be considered stale.

Further, the range we are asked to iterate is limited to the range
inside EOF that the folio spans. Hence, for XFS, we will only map
the exact range we are asked for, and we will only do speculative
preallocation with delalloc if we are mapping a hole at the EOF
page. The iterator will consume the entire range of the folio that
is within EOF, and anything beyond the EOF block cannot be accessed.
We never need to truncate this post-EOF speculative prealloc away in
the context of the iomap_page_mkwrite() iterator because if it
remains unused we'll remove it when the last reference to the inode
goes away.

Hence we don't actually need an .iomap_end() cleanup/error handling
path at all for iomap_page_mkwrite() for XFS. This means we can
separate the page fault processing from the complexity of the
.iomap_end() processing in the buffered write path. This also means
that the buffered write path will also be able to take the
mapping->invalidate_lock as necessary.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Leah Rumancik <leah.rumancik@gmail.com>
Acked-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_file.c  | 2 +-
 fs/xfs/xfs_iomap.c | 9 +++++++++
 fs/xfs/xfs_iomap.h | 1 +
 3 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_file.c b/fs/xfs/xfs_file.c
index e462d39c840e..595a5bcf46b9 100644
--- a/fs/xfs/xfs_file.c
+++ b/fs/xfs/xfs_file.c
@@ -1325,7 +1325,7 @@ __xfs_filemap_fault(
 		if (write_fault) {
 			xfs_ilock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 			ret = iomap_page_mkwrite(vmf,
-					&xfs_buffered_write_iomap_ops);
+					&xfs_page_mkwrite_iomap_ops);
 			xfs_iunlock(XFS_I(inode), XFS_MMAPLOCK_SHARED);
 		} else {
 			ret = filemap_fault(vmf);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 07da03976ec1..5cea069a38b4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1187,6 +1187,15 @@ const struct iomap_ops xfs_buffered_write_iomap_ops = {
 	.iomap_end		= xfs_buffered_write_iomap_end,
 };
 
+/*
+ * iomap_page_mkwrite() will never fail in a way that requires delalloc extents
+ * that it allocated to be revoked. Hence we do not need an .iomap_end method
+ * for this operation.
+ */
+const struct iomap_ops xfs_page_mkwrite_iomap_ops = {
+	.iomap_begin		= xfs_buffered_write_iomap_begin,
+};
+
 static int
 xfs_read_iomap_begin(
 	struct inode		*inode,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index c782e8c0479c..0f62ab633040 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -47,6 +47,7 @@ xfs_aligned_fsb_count(
 }
 
 extern const struct iomap_ops xfs_buffered_write_iomap_ops;
+extern const struct iomap_ops xfs_page_mkwrite_iomap_ops;
 extern const struct iomap_ops xfs_direct_write_iomap_ops;
 extern const struct iomap_ops xfs_read_iomap_ops;
 extern const struct iomap_ops xfs_seek_iomap_ops;
-- 
2.45.0.rc1.225.g2a3ae87e7f-goog


             reply	other threads:[~2024-05-01 18:41 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-01 18:40 Leah Rumancik [this message]
2024-05-01 18:40 ` [PATCH 6.1 02/24] xfs: punching delalloc extents on write failure is racy Leah Rumancik
2024-05-01 18:40 ` [PATCH 6.1 03/24] xfs: use byte ranges for write cleanup ranges Leah Rumancik
2024-05-01 18:40 ` [PATCH 6.1 04/24] xfs,iomap: move delalloc punching to iomap Leah Rumancik
2024-05-01 18:40 ` [PATCH 6.1 05/24] iomap: buffered write failure should not truncate the page cache Leah Rumancik
2024-05-01 18:40 ` [PATCH 6.1 06/24] xfs: xfs_bmap_punch_delalloc_range() should take a byte range Leah Rumancik
2024-05-01 18:40 ` [PATCH 6.1 07/24] iomap: write iomap validity checks Leah Rumancik
2024-05-01 18:40 ` [PATCH 6.1 08/24] xfs: use iomap_valid method to detect stale cached iomaps Leah Rumancik
2024-05-01 18:40 ` [PATCH 6.1 09/24] xfs: drop write error injection is unfixable, remove it Leah Rumancik
2024-05-01 18:40 ` [PATCH 6.1 10/24] xfs: fix off-by-one-block in xfs_discard_folio() Leah Rumancik
2024-05-01 18:40 ` [PATCH 6.1 11/24] xfs: fix incorrect error-out in xfs_remove Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 12/24] xfs: fix sb write verify for lazysbcount Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 13/24] xfs: fix incorrect i_nlink caused by inode racing Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 14/24] xfs: invalidate block device page cache during unmount Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 15/24] xfs: attach dquots to inode before reading data/cow fork mappings Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 16/24] xfs: wait iclog complete before tearing down AIL Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 17/24] xfs: fix super block buf log item UAF during force shutdown Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 18/24] xfs: hoist refcount record merge predicates Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 19/24] xfs: estimate post-merge refcounts correctly Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 20/24] xfs: invalidate xfs_bufs when allocating cow extents Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 21/24] xfs: allow inode inactivation during a ro mount log recovery Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 22/24] xfs: fix log recovery when unknown rocompat bits are set Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 23/24] xfs: get root inode correctly at bulkstat Leah Rumancik
2024-05-01 18:41 ` [PATCH 6.1 24/24] xfs: short circuit xfs_growfs_data_private() if delta is zero Leah Rumancik
2024-05-04  9:16 ` [PATCH 6.1 01/24] xfs: write page faults in iomap are not buffered writes Greg KH
2024-05-04 18:17   ` Amir Goldstein
2024-05-06 17:52     ` Leah Rumancik
2024-05-22 14:11       ` Greg KH
2024-05-22 21:55         ` Leah Rumancik
2024-05-23  7:08           ` Greg KH
2024-05-23 11:06           ` Greg KH

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240501184112.3799035-1-leah.rumancik@gmail.com \
    --to=leah.rumancik@gmail.com \
    --cc=amir73il@gmail.com \
    --cc=chandan.babu@oracle.com \
    --cc=dchinner@redhat.com \
    --cc=djwong@kernel.org \
    --cc=fred@cloudflare.com \
    --cc=hch@lst.de \
    --cc=linux-xfs@vger.kernel.org \
    --cc=stable@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).