[PATCHSET v29.4 03/13] xfs: atomic file content exchanges

Linux-XFS Archive mirror
 help / color / mirror / Atom feed

* [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
@ 2024-02-27  2:18 Darrick J. Wong
  2024-02-27  2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
                   ` (16 more replies)
  0 siblings, 17 replies; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:18 UTC (permalink / raw
  To: djwong; +Cc: linux-fsdevel, linux-xfs, hch

Hi all,

This series creates a new FIEXCHANGE_RANGE system call to exchange
ranges of bytes between two files atomically.  This new functionality
enables data storage programs to stage and commit file updates such that
reader programs will see either the old contents or the new contents in
their entirety, with no chance of torn writes.  A successful call
completion guarantees that the new contents will be seen even if the
system fails.

The ability to exchange file fork mappings between files in this manner
is critical to supporting online filesystem repair, which is built upon
the strategy of constructing a clean copy of a damaged structure and
committing the new structure into the metadata file atomically.

User programs will be able to update files atomically by opening an
O_TMPFILE, reflinking the source file to it, making whatever updates
they want to make, and exchange the relevant ranges of the temp file
with the original file.  If the updates are aligned with the file block
size, a new (since v2) flag provides for exchanging only the written
areas.  Callers can arrange for the update to be rejected if the
original file has been changed.

The intent behind this new userspace functionality is to enable atomic
rewrites of arbitrary parts of individual files.  For years, application
programmers wanting to ensure the atomicity of a file update had to
write the changes to a new file in the same directory, fsync the new
file, rename the new file on top of the old filename, and then fsync the
directory.  People get it wrong all the time, and $fs hacks abound.
Here are the proposed manual pages:

IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)

NAME
       ioctl_xfs_exchange_range  -  exchange  the contents of parts of
       two files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <xfs/xfs_fs_staging.h>

       int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
       xfs_exch_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes the contents of the two ranges.

       Exchanges  are  atomic  with  regards to concurrent file opera‐
       tions, so no userspace-level locks need to be taken  to  obtain
       consistent  results.  Implementations must guarantee that read‐
       ers see either the old contents or the new  contents  in  their
       entirety, even if the system fails.

       The  system  call  parameters are conveyed in structures of the
       following form:

           struct xfs_exch_range {
               __s64    file1_fd;
               __s64    file1_offset;
               __s64    file2_offset;
               __s64    length;
               __u64    flags;

               __u64    pad;
           };

       The field pad must be zero.

       The fields file1_fd, file1_offset, and length define the  first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       Both files must be from the same filesystem mount.  If the  two
       file  descriptors represent the same file, the byte ranges must
       not overlap.  Most  disk-based  filesystems  require  that  the
       starts  of  both ranges must be aligned to the file block size.
       If this is the case, the ends of the ranges  must  also  be  so
       aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           XFS_EXCHRANGE_TO_EOF
                  Ignore  the length parameter.  All bytes in file1_fd
                  from file1_offset to EOF are moved to file2_fd,  and
                  file2's  size is set to (file2_offset+(file1_length-
                  file1_offset)).  Meanwhile, all bytes in file2  from
                  file2_offset  to  EOF are moved to file1 and file1's
                  size   is   set   to    (file1_offset+(file2_length-
                  file2_offset)).

           XFS_EXCHRANGE_DSYNC
                  Ensure  that  all modified in-core data in both file
                  ranges and all metadata updates  pertaining  to  the
                  exchange operation are flushed to persistent storage
                  before the call returns.  Opening  either  file  de‐
                  scriptor  with  O_SYNC or O_DSYNC will have the same
                  effect.

           XFS_EXCHRANGE_FILE1_WRITTEN
                  Only exchange sub-ranges of file1_fd that are  known
                  to  contain  data  written  by application software.
                  Each sub-range may be  expanded  (both  upwards  and
                  downwards)  to  align with the file allocation unit.
                  For files on the data device, this is one filesystem
                  block.   For  files  on the realtime device, this is
                  the realtime extent size.  This facility can be used
                  to  implement  fast  atomic scatter-gather writes of
                  any complexity for software-defined storage  targets
                  if  all  writes  are  aligned to the file allocation
                  unit.

           XFS_EXCHRANGE_DRY_RUN
                  Check the parameters and the feasibility of the  op‐
                  eration, but do not change anything.

RETURN VALUE
       On  error, -1 is returned, and errno is set to indicate the er‐
       ror.

ERRORS
       Error codes can be one of, but are not limited to, the  follow‐
       ing:

       EBADF  file1_fd  is not open for reading and writing or is open
              for append-only writes; or  file2_fd  is  not  open  for
              reading and writing or is open for append-only writes.

       EINVAL The  parameters  are  not correct for these files.  This
              error can also appear if either file  descriptor  repre‐
              sents  a device, FIFO, or socket.  Disk filesystems gen‐
              erally require the offset and  length  arguments  to  be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The  kernel  was unable to allocate sufficient memory to
              perform the operation.

       ENOSPC There is not enough free space  in  the  filesystem  ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
              filesystem.

CONFORMING TO
       This API is XFS-specific.

USE CASES
       Several  use  cases  are imagined for this system call.  In all
       cases, application software must coordinate updates to the file
       because the exchange is performed unconditionally.

       The  first  is a data storage program that wants to commit non-
       contiguous updates to a file atomically and  coordinates  write
       access  to that file.  This can be done by creating a temporary
       file, calling FICLONE(2) to share the contents, and staging the
       updates into the temporary file.  The FULL_FILES flag is recom‐
       mended for this purpose.  The temporary file can be deleted  or
       punched out afterwards.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);

           ioctl(temp_fd, FICLONE, fd);

           /* append 1MB of records */
           lseek(temp_fd, 0, SEEK_END);
           write(temp_fd, data1, 1000000);

           /* update record index */
           pwrite(temp_fd, data1, 600, 98765);
           pwrite(temp_fd, data2, 320, 54321);
           pwrite(temp_fd, data2, 15, 0);

           /* commit the entire update */
           struct xfs_exch_range args = {
               .file1_fd = temp_fd,
               .flags = XFS_EXCHRANGE_TO_EOF,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

       The  second  is  a  software-defined  storage host (e.g. a disk
       jukebox) which implements an atomic scatter-gather  write  com‐
       mand.   Provided the exported disk's logical block size matches
       the file's allocation unit size, this can be done by creating a
       temporary file and writing the data at the appropriate offsets.
       It is recommended that the temporary file be truncated  to  the
       size  of  the  regular file before any writes are staged to the
       temporary file to avoid issues with zeroing during  EOF  exten‐
       sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
       only the file allocation units involved  in  the  emulated  de‐
       vice's  write  command.  The temporary file should be truncated
       or punched out completely before being reused to stage  another
       write.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           int blksz;

           fstat(fd, &sb);
           blksz = sb.st_blksize;

           /* land scatter gather writes between 100fsb and 500fsb */
           pwrite(temp_fd, data1, blksz * 2, blksz * 100);
           pwrite(temp_fd, data2, blksz * 20, blksz * 480);
           pwrite(temp_fd, data3, blksz * 7, blksz * 257);

           /* commit the entire update */
           struct xfs_exch_range args = {
               .file1_fd = temp_fd,
               .file1_offset = blksz * 100,
               .file2_offset = blksz * 100,
               .length       = blksz * 400,
               .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
                               XFS_EXCHRANGE_FILE1_DSYNC,
           };

           ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);

NOTES
       Some  filesystems may limit the amount of data or the number of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)

NAME
       ioctl_xfs_commit_range - conditionally exchange the contents of
       parts of two files

SYNOPSIS
       #include <sys/ioctl.h>
       #include <xfs/xfs_fs_staging.h>

       int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
       mit_range *arg);

DESCRIPTION
       Given  a  range  of bytes in a first file file1_fd and a second
       range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
       changes  the contents of the two ranges if file2_fd passes cer‐
       tain freshness criteria.

       After locking both files but before  exchanging  the  contents,
       the  supplied  file2_ino field must match file2_fd's inode num‐
       ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
       file2_ctime, and file2_ctime_nsec fields must match the modifi‐
       cation time and change time of file2.  If they  do  not  match,
       EBUSY will be returned.

       Exchanges  are  atomic  with  regards to concurrent file opera‐
       tions, so no userspace-level locks need to be taken  to  obtain
       consistent  results.  Implementations must guarantee that read‐
       ers see either the old contents or the new  contents  in  their
       entirety, even if the system fails.

       The  system  call  parameters are conveyed in structures of the
       following form:

           struct xfs_commit_range {
               __s64    file1_fd;
               __s64    file1_offset;
               __s64    file2_offset;
               __s64    length;
               __u64    flags;

               __s64    file2_ino;
               __s64    file2_mtime;
               __s64    file2_ctime;
               __s32    file2_mtime_nsec;
               __s32    file2_ctime_nsec;

               __u64    pad;
           };

       The field pad must be zero.

       The fields file1_fd, file1_offset, and length define the  first
       range of bytes to be exchanged.

       The fields file2_fd, file2_offset, and length define the second
       range of bytes to be exchanged.

       The   fields    file2_ino,    file2_mtime,    file2_mtime_nsec,
       file2_ctime   and   file2_ctime_nsec   must  be  gathered  from
       file2_fd's stat information prior to beginning the file update.
       These file attributes are used to confirm that file2_fd has not
       changed by another thread since the current thread began  stag‐
       ing its own update.

       Both  files must be from the same filesystem mount.  If the two
       file descriptors represent the same file, the byte ranges  must
       not  overlap.   Most  disk-based  filesystems  require that the
       starts of both ranges must be aligned to the file  block  size.
       If  this  is  the  case, the ends of the ranges must also be so
       aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.

       The field flags control the behavior of the exchange operation.

           XFS_EXCHRANGE_TO_EOF
                  Ignore the length parameter.  All bytes in  file1_fd
                  from  file1_offset to EOF are moved to file2_fd, and
                  file2's size is set to  (file2_offset+(file1_length-
                  file1_offset)).   Meanwhile, all bytes in file2 from
                  file2_offset to EOF are moved to file1  and  file1's
                  size    is   set   to   (file1_offset+(file2_length-
                  file2_offset)).

           XFS_EXCHRANGE_DSYNC
                  Ensure that all modified in-core data in  both  file
                  ranges  and  all  metadata updates pertaining to the
                  exchange operation are flushed to persistent storage
                  before  the  call  returns.  Opening either file de‐
                  scriptor with O_SYNC or O_DSYNC will have  the  same
                  effect.

           XFS_EXCHRANGE_FILE1_WRITTEN
                  Only  exchange sub-ranges of file1_fd that are known
                  to contain data  written  by  application  software.
                  Each  sub-range  may  be  expanded (both upwards and
                  downwards) to align with the file  allocation  unit.
                  For files on the data device, this is one filesystem
                  block.  For files on the realtime  device,  this  is
                  the realtime extent size.  This facility can be used
                  to implement fast atomic  scatter-gather  writes  of
                  any  complexity for software-defined storage targets
                  if all writes are aligned  to  the  file  allocation
                  unit.

           XFS_EXCHRANGE_DRY_RUN
                  Check  the parameters and the feasibility of the op‐
                  eration, but do not change anything.

RETURN VALUE
       On error, -1 is returned, and errno is set to indicate the  er‐
       ror.

ERRORS
       Error  codes can be one of, but are not limited to, the follow‐
       ing:

       EBADF  file1_fd is not open for reading and writing or is  open
              for  append-only  writes;  or  file2_fd  is not open for
              reading and writing or is open for append-only writes.

       EBUSY  The file2 inode number and timestamps  supplied  do  not
              match file2_fd.

       EINVAL The  parameters  are  not correct for these files.  This
              error can also appear if either file  descriptor  repre‐
              sents  a device, FIFO, or socket.  Disk filesystems gen‐
              erally require the offset and  length  arguments  to  be
              aligned to the fundamental block sizes of both files.

       EIO    An I/O error occurred.

       EISDIR One of the files is a directory.

       ENOMEM The  kernel  was unable to allocate sufficient memory to
              perform the operation.

       ENOSPC There is not enough free space  in  the  filesystem  ex‐
              change the contents safely.

       EOPNOTSUPP
              The filesystem does not support exchanging bytes between
              the two files.

       EPERM  file1_fd or file2_fd are immutable.

       ETXTBSY
              One of the files is a swap file.

       EUCLEAN
              The filesystem is corrupt.

       EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
              filesystem.

CONFORMING TO
       This API is XFS-specific.

USE CASES
       Several use cases are imagined for this system call.  Coordina‐
       tion between multiple threads is performed by the kernel.

       The first is a filesystem defragmenter, which copies  the  con‐
       tents  of  a  file into another file and wishes to exchange the
       space mappings of the two files,  provided  that  the  original
       file has not changed.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           struct xfs_commit_range args = {
               .flags = XFS_EXCHRANGE_TO_EOF,
           };

           /* gather file2's freshness information */
           fstat(fd, &sb);
           args.file2_ino = sb.st_ino;
           args.file2_mtime = sb.st_mtim.tv_sec;
           args.file2_mtime_nsec = sb.st_mtim.tv_nsec;
           args.file2_ctime = sb.st_ctim.tv_sec;
           args.file2_ctime_nsec = sb.st_ctim.tv_nsec;

           /* make a fresh copy of the file with terrible alignment to avoid reflink */
           clone_file_range(fd, NULL, temp_fd, NULL, 1, 0);
           clone_file_range(fd, NULL, temp_fd, NULL, sb.st_size - 1, 0);

           /* commit the entire update */
           args.file1_fd = temp_fd;
           ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
           if (ret && errno == EBUSY)
               printf("file changed while defrag was underway
");

       The  second is a data storage program that wants to commit non-
       contiguous updates to a file atomically.  This  program  cannot
       coordinate updates to the file and therefore relies on the ker‐
       nel to reject the COMMIT_RANGE command if the file has been up‐
       dated  by  someone else.  This can be done by creating a tempo‐
       rary file, calling FICLONE(2) to share the contents, and  stag‐
       ing  the  updates into the temporary file.  The FULL_FILES flag
       is recommended for this purpose.  The  temporary  file  can  be
       deleted or punched out afterwards.

       An example program might look like this:

           int fd = open("/some/file", O_RDWR);
           int temp_fd = open("/some", O_TMPFILE | O_RDWR);
           struct stat sb;
           struct xfs_commit_range args = {
               .flags = XFS_EXCHRANGE_TO_EOF,
           };

           /* gather file2's freshness information */
           fstat(fd, &sb);
           args.file2_ino = sb.st_ino;
           args.file2_mtime = sb.st_mtim.tv_sec;
           args.file2_mtime_nsec = sb.st_mtim.tv_nsec;
           args.file2_ctime = sb.st_ctim.tv_sec;
           args.file2_ctime_nsec = sb.st_ctim.tv_nsec;

           ioctl(temp_fd, FICLONE, fd);

           /* append 1MB of records */
           lseek(temp_fd, 0, SEEK_END);
           write(temp_fd, data1, 1000000);

           /* update record index */
           pwrite(temp_fd, data1, 600, 98765);
           pwrite(temp_fd, data2, 320, 54321);
           pwrite(temp_fd, data2, 15, 0);

           /* commit the entire update */
           args.file1_fd = temp_fd;
           ret = ioctl(fd, XFS_IOC_COMMIT_RANGE, &args);
           if (ret && errno == EBUSY)
               printf("file changed before commit; will roll back
");

NOTES
       Some  filesystems may limit the amount of data or the number of
       extents that can be exchanged in a single call.

SEE ALSO
       ioctl(2)

XFS                           2024-02-18     IOCTL-XFS-COMMIT-RANGE(2)

The reference implementation in XFS creates a new log incompat feature
and log intent items to track high level progress of swapping ranges of
two files and finish interrupted work if the system goes down.  Sample
code can be found in the corresponding changes to xfs_io to exercise the
use case mentioned above.

Note that this function is /not/ the O_DIRECT atomic untorn file writes
concept that has also been floating around for years.  It is also not
the RWF_ATOMIC patchset that has been shared.  This RFC is constructed
entirely in software, which means that there are no limitations other
than the general filesystem limits.

As a side note, the original motivation behind the kernel functionality
is online repair of file-based metadata.  The atomic file content
exchange is implemented as an atomic exchange of file fork mappings,
which means that we can implement online reconstruction of extended
attributes and directories by building a new one in another inode and
exchanging the contents.

Subsequent patchsets adapt the online filesystem repair code to use
atomic file exchanges.  This enables repair functions to construct a
clean copy of a directory, xattr information, symbolic links, realtime
bitmaps, and realtime summary information in a temporary inode.  If this
completes successfully, the new contents can be committed atomically
into the inode being repaired.  This is essential to avoid making
corruption problems worse if the system goes down in the middle of
running repair.

or userspace, this series also includes the userspace pieces needed to
test the new functionality, and a sample implementation of atomic file
updates.

If you're going to start using this code, I strongly recommend pulling
from my git trees, which are linked below.

This has been running on the djcloud for months with no problems.  Enjoy!
Comments and questions are, as always, welcome.

--D

kernel git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates

xfsprogs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfsprogs-dev.git/log/?h=atomic-file-updates

fstests git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfstests-dev.git/log/?h=atomic-file-updates

xfsdocs git tree:
https://git.kernel.org/cgit/linux/kernel/git/djwong/xfs-documentation.git/log/?h=atomic-file-updates
---
Commits in this patchset:
 * vfs: export remap and write check helpers
 * xfs: introduce new file range exchange ioctls
 * xfs: create a log incompat flag for atomic file mapping exchanges
 * xfs: introduce a file mapping exchange log intent item
 * xfs: create deferred log items for file mapping exchanges
 * xfs: bind together the front and back ends of the file range exchange code
 * xfs: add error injection to test file mapping exchange recovery
 * xfs: condense extended attributes after a mapping exchange operation
 * xfs: condense directories after a mapping exchange operation
 * xfs: condense symbolic links after a mapping exchange operation
 * xfs: make file range exchange support realtime files
 * xfs: support non-power-of-two rtextsize with exchange-range
 * docs: update swapext -> exchmaps language
 * xfs: enable logged file mapping exchange feature
---
 .../filesystems/xfs/xfs-online-fsck-design.rst     |  259 ++--
 fs/read_write.c                                    |    1 
 fs/remap_range.c                                   |    4 
 fs/xfs/Makefile                                    |    3 
 fs/xfs/libxfs/xfs_bmap.h                           |    2 
 fs/xfs/libxfs/xfs_defer.c                          |    6 
 fs/xfs/libxfs/xfs_defer.h                          |    2 
 fs/xfs/libxfs/xfs_errortag.h                       |    4 
 fs/xfs/libxfs/xfs_exchmaps.c                       | 1222 ++++++++++++++++++++
 fs/xfs/libxfs/xfs_exchmaps.h                       |  123 ++
 fs/xfs/libxfs/xfs_format.h                         |   16 
 fs/xfs/libxfs/xfs_fs.h                             |   75 +
 fs/xfs/libxfs/xfs_log_format.h                     |   64 +
 fs/xfs/libxfs/xfs_log_recover.h                    |    2 
 fs/xfs/libxfs/xfs_sb.c                             |    3 
 fs/xfs/libxfs/xfs_symlink_remote.c                 |   47 +
 fs/xfs/libxfs/xfs_symlink_remote.h                 |    1 
 fs/xfs/libxfs/xfs_trans_space.h                    |    4 
 fs/xfs/xfs_error.c                                 |    3 
 fs/xfs/xfs_exchmaps_item.c                         |  603 ++++++++++
 fs/xfs/xfs_exchmaps_item.h                         |   64 +
 fs/xfs/xfs_exchrange.c                             |  865 ++++++++++++++
 fs/xfs/xfs_exchrange.h                             |   51 +
 fs/xfs/xfs_ioctl.c                                 |   89 +
 fs/xfs/xfs_log_recover.c                           |    2 
 fs/xfs/xfs_mount.h                                 |    5 
 fs/xfs/xfs_super.c                                 |   19 
 fs/xfs/xfs_symlink.c                               |   49 -
 fs/xfs/xfs_trace.c                                 |    2 
 fs/xfs/xfs_trace.h                                 |  382 ++++++
 include/linux/fs.h                                 |    1 
 31 files changed, 3797 insertions(+), 176 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_exchmaps.c
 create mode 100644 fs/xfs/libxfs/xfs_exchmaps.h
 create mode 100644 fs/xfs/xfs_exchmaps_item.c
 create mode 100644 fs/xfs/xfs_exchmaps_item.h
 create mode 100644 fs/xfs/xfs_exchrange.c
 create mode 100644 fs/xfs/xfs_exchrange.h


^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 01/14] vfs: export remap and write check helpers
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
@ 2024-02-27  2:21 ` Darrick J. Wong
  2024-02-28 15:40   ` Christoph Hellwig
  2024-02-27  2:21 ` [PATCH 02/14] xfs: introduce new file range exchange ioctls Darrick J. Wong
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:21 UTC (permalink / raw
  To: djwong; +Cc: linux-fsdevel, linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Export these functions so that the next patch can use them to check the
file ranges being passed to the XFS_IOC_EXCHANGE_RANGE operation.

Cc: linux-fsdevel@vger.kernel.org
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/read_write.c    |    1 +
 fs/remap_range.c   |    4 ++--
 include/linux/fs.h |    1 +
 3 files changed, 4 insertions(+), 2 deletions(-)


diff --git a/fs/read_write.c b/fs/read_write.c
index d4c036e82b6c3..85c096f2c0d06 100644
--- a/fs/read_write.c
+++ b/fs/read_write.c
@@ -1667,6 +1667,7 @@ int generic_write_check_limits(struct file *file, loff_t pos, loff_t *count)
 
 	return 0;
 }
+EXPORT_SYMBOL_GPL(generic_write_check_limits);
 
 /* Like generic_write_checks(), but takes size of write instead of iter. */
 int generic_write_checks_count(struct kiocb *iocb, loff_t *count)
diff --git a/fs/remap_range.c b/fs/remap_range.c
index de07f978ce3eb..28246dfc84851 100644
--- a/fs/remap_range.c
+++ b/fs/remap_range.c
@@ -99,8 +99,7 @@ static int generic_remap_checks(struct file *file_in, loff_t pos_in,
 	return 0;
 }
 
-static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
-			     bool write)
+int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write)
 {
 	int mask = write ? MAY_WRITE : MAY_READ;
 	loff_t tmp;
@@ -118,6 +117,7 @@ static int remap_verify_area(struct file *file, loff_t pos, loff_t len,
 
 	return fsnotify_file_area_perm(file, mask, &pos, len);
 }
+EXPORT_SYMBOL_GPL(remap_verify_area);
 
 /*
  * Ensure that we don't remap a partial EOF block in the middle of something
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 1fbc72c5f112c..f0ada316dc97b 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2096,6 +2096,7 @@ extern ssize_t vfs_read(struct file *, char __user *, size_t, loff_t *);
 extern ssize_t vfs_write(struct file *, const char __user *, size_t, loff_t *);
 extern ssize_t vfs_copy_file_range(struct file *, loff_t , struct file *,
 				   loff_t, size_t, unsigned int);
+int remap_verify_area(struct file *file, loff_t pos, loff_t len, bool write);
 int __generic_remap_file_range_prep(struct file *file_in, loff_t pos_in,
 				    struct file *file_out, loff_t pos_out,
 				    loff_t *len, unsigned int remap_flags,


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 02/14] xfs: introduce new file range exchange ioctls
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
  2024-02-27  2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
@ 2024-02-27  2:21 ` Darrick J. Wong
  2024-02-28 15:44   ` Christoph Hellwig
  2024-02-27  2:21 ` [PATCH 03/14] xfs: create a log incompat flag for atomic file mapping exchanges Darrick J. Wong
                   ` (14 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:21 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Introduce a pair of new ioctls to handle exchanging ranges of bytes
between files.  The goal here is to perform the exchange atomically with
respect to applications -- either they see the file contents before the
exchange or they see that A-B is now B-A, even if the kernel crashes.

The simpler of the two ioctls is XFS_IOC_EXCHANGE_RANGE, which performs
the exchange unconditionally.  XFS_IOC_COMMIT_RANGE, on the other hand,
requires the caller to sample the file attributes of one of the files
participating in the exchange, and aborts the exchange if that file has
changed in the meantime (presumably by another thread).

For userspace programs the goal is to enable defragmentation of files
(it's a replacement for the old swapext ioctl), the atomic commit of
sparse updates to an indexed data file, or emulation of atomic writes on
software defined storage servers.

My original goal with all this code was to make it so that online repair
can build a replacement directory or xattr structure in a temporary file
and commit the repair by atomically exchanging all the data blocks
between the two files.  However, I needed a way to test this mechanism
thoroughly, so I've been evolving an ioctl interface since then.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile        |    1 
 fs/xfs/libxfs/xfs_fs.h |   72 +++++++++++
 fs/xfs/xfs_exchrange.c |  307 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_exchrange.h |   38 ++++++
 fs/xfs/xfs_ioctl.c     |   89 ++++++++++++++
 5 files changed, 507 insertions(+)
 create mode 100644 fs/xfs/xfs_exchrange.c
 create mode 100644 fs/xfs/xfs_exchrange.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 20f0bbe4b7102..3e85762a28ee7 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -67,6 +67,7 @@ xfs-y				+= xfs_aops.o \
 				   xfs_dir2_readdir.o \
 				   xfs_discard.o \
 				   xfs_error.o \
+				   xfs_exchrange.o \
 				   xfs_export.o \
 				   xfs_extent_busy.o \
 				   xfs_file.o \
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index ca1b17d014377..fe2bd607ac11f 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -772,6 +772,76 @@ struct xfs_scrub_metadata {
 #  define XFS_XATTR_LIST_MAX 65536
 #endif
 
+/*
+ * Exchange part of file1 with part of the file that this ioctl that is being
+ * called against (which we'll call file2).  Filesystems must be able to
+ * restart and complete the operation even after the system goes down.
+ */
+struct xfs_exch_range {
+	__s64		file1_fd;
+	__s64		file1_offset;	/* file1 offset, bytes */
+	__s64		file2_offset;	/* file2 offset, bytes */
+	__u64		length;		/* bytes to exchange */
+
+	__u64		flags;		/* see XFS_EXCHRANGE_* below */
+
+	__u64		pad;		/* must be zeroes */
+};
+
+/*
+ * Using the same definition of file2 as struct xfs_exch_range, commit the
+ * contents of file1 into file2 if file2 has the same inode number, mtime, and
+ * ctime as the arguments provided to the call.  The old contents of file2 will
+ * be moved to file1.
+ *
+ * Returns -EBUSY if there isn't an exact match for the file2 fields.
+ *
+ * Filesystems must be able to restart and complete the operation even after
+ * the system goes down.
+ */
+struct xfs_commit_range {
+	__s64		file1_fd;
+	__s64		file1_offset;	/* file1 offset, bytes */
+	__s64		file2_offset;	/* file2 offset, bytes */
+	__s64		length;		/* bytes to exchange */
+
+	__u64		flags;		/* see XFS_EXCHRANGE_* below */
+
+	/* file2 metadata for freshness checks */
+	__u64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+
+	__u64		pad;		/* must be zeroes */
+};
+
+/*
+ * Exchange file data all the way to the ends of both files, and then exchange
+ * the file sizes.  This flag can be used to replace a file's contents with a
+ * different amount of data.  length will be ignored.
+ */
+#define XFS_EXCHRANGE_TO_EOF		(1ULL << 0)
+
+/* Flush all changes in file data and file metadata to disk before returning. */
+#define XFS_EXCHRANGE_DSYNC		(1ULL << 1)
+
+/* Dry run; do all the parameter verification but do not change anything. */
+#define XFS_EXCHRANGE_DRY_RUN		(1ULL << 2)
+
+/*
+ * Exchange only the parts of the two files where the file allocation units
+ * mapped to file1's range have been written to.  This can accelerate
+ * scatter-gather atomic writes with a temp file if all writes are aligned to
+ * the file allocation unit.
+ */
+#define XFS_EXCHRANGE_FILE1_WRITTEN	(1ULL << 3)
+
+#define XFS_EXCHRANGE_ALL_FLAGS		(XFS_EXCHRANGE_TO_EOF | \
+					 XFS_EXCHRANGE_DSYNC | \
+					 XFS_EXCHRANGE_DRY_RUN | \
+					 XFS_EXCHRANGE_FILE1_WRITTEN)
 
 /*
  * ioctl commands that are used by Linux filesystems
@@ -843,6 +913,8 @@ struct xfs_scrub_metadata {
 #define XFS_IOC_FSGEOMETRY	     _IOR ('X', 126, struct xfs_fsop_geom)
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
+#define XFS_IOC_EXCHANGE_RANGE	     _IOWR('X', 129, struct xfs_exch_range)
+#define XFS_IOC_COMMIT_RANGE	     _IOWR('X', 129, struct xfs_commit_range)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
new file mode 100644
index 0000000000000..d5889db89daeb
--- /dev/null
+++ b/fs/xfs/xfs_exchrange.c
@@ -0,0 +1,307 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2020-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_exchrange.h"
+#include <linux/fsnotify.h>
+
+/*
+ * Generic code for exchanging ranges of two files via XFS_IOC_EXCHANGE_RANGE.
+ * This part deals with struct file objects and byte ranges and does not deal
+ * with XFS-specific data structures such as xfs_inodes and block ranges.  It
+ * may some day be ported to the VFS.
+ *
+ * The goal is to exchange fxr.length bytes starting at fxr.file1_offset in
+ * file1 with the same number of bytes starting at fxr.file2_offset in file2.
+ * Implementations must call xfs_generic_exchrange_prep to prepare the two
+ * files prior to taking locks; and they must update the inode change and mod
+ * times of both files as part of the metadata update.  The timestamp update
+ * and freshness checks must be done atomically as part of the data exchange
+ * operation to ensure correctness of the freshness check.
+ * xfs_generic_exchrange_finish must be called after the operation completes
+ * successfully but before locks are dropped.
+ */
+
+/* Verify that we have security clearance to perform this operation. */
+static int
+xfs_generic_exchrange_verify_area(
+	struct xfs_exchrange	*fxr)
+{
+	int			ret;
+
+	ret = remap_verify_area(fxr->file1, fxr->file1_offset, fxr->length,
+			true);
+	if (ret)
+		return ret;
+
+	return remap_verify_area(fxr->file2, fxr->file2_offset, fxr->length,
+			true);
+}
+
+/*
+ * Performs necessary checks before doing a range exchange, having stabilized
+ * mutable inode attributes via i_rwsem.
+ */
+static inline int
+xfs_generic_exchrange_checks(
+	struct xfs_exchrange	*fxr,
+	unsigned int		alloc_unit)
+{
+	struct inode		*inode1 = file_inode(fxr->file1);
+	struct inode		*inode2 = file_inode(fxr->file2);
+	uint64_t		allocmask = alloc_unit - 1;
+	int64_t			test_len;
+	uint64_t		blen;
+	loff_t			size1, size2, tmp;
+	int			error;
+
+	/* Don't touch certain kinds of inodes */
+	if (IS_IMMUTABLE(inode1) || IS_IMMUTABLE(inode2))
+		return -EPERM;
+	if (IS_SWAPFILE(inode1) || IS_SWAPFILE(inode2))
+		return -ETXTBSY;
+
+	size1 = i_size_read(inode1);
+	size2 = i_size_read(inode2);
+
+	/* Ranges cannot start after EOF. */
+	if (fxr->file1_offset > size1 || fxr->file2_offset > size2)
+		return -EINVAL;
+
+	/*
+	 * If the caller said to exchange to EOF, we set the length of the
+	 * request large enough to cover everything to the end of both files.
+	 */
+	if (fxr->flags & XFS_EXCHRANGE_TO_EOF) {
+		fxr->length = max_t(int64_t, size1 - fxr->file1_offset,
+					     size2 - fxr->file2_offset);
+
+		error = xfs_generic_exchrange_verify_area(fxr);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * The start of both ranges must be aligned to the file allocation
+	 * unit.
+	 */
+	if (!IS_ALIGNED(fxr->file1_offset, alloc_unit) ||
+	    !IS_ALIGNED(fxr->file2_offset, alloc_unit))
+		return -EINVAL;
+
+	/* Ensure offsets don't wrap. */
+	if (check_add_overflow(fxr->file1_offset, fxr->length, &tmp) ||
+	    check_add_overflow(fxr->file2_offset, fxr->length, &tmp))
+		return -EINVAL;
+
+	/*
+	 * We require both ranges to end within EOF, unless we're exchanging
+	 * to EOF.
+	 */
+	if (!(fxr->flags & XFS_EXCHRANGE_TO_EOF) &&
+	    (fxr->file1_offset + fxr->length > size1 ||
+	     fxr->file2_offset + fxr->length > size2))
+		return -EINVAL;
+
+	/*
+	 * Make sure we don't hit any file size limits.  If we hit any size
+	 * limits such that test_length was adjusted, we abort the whole
+	 * operation.
+	 */
+	test_len = fxr->length;
+	error = generic_write_check_limits(fxr->file2, fxr->file2_offset,
+			&test_len);
+	if (error)
+		return error;
+	error = generic_write_check_limits(fxr->file1, fxr->file1_offset,
+			&test_len);
+	if (error)
+		return error;
+	if (test_len != fxr->length)
+		return -EINVAL;
+
+	/*
+	 * If the user wanted us to exchange up to the infile's EOF, round up
+	 * to the next allocation unit boundary for this check.  Do the same
+	 * for the outfile.
+	 *
+	 * Otherwise, reject the range length if it's not aligned to an
+	 * allocation unit.
+	 */
+	if (fxr->file1_offset + fxr->length == size1)
+		blen = ALIGN(size1, alloc_unit) - fxr->file1_offset;
+	else if (fxr->file2_offset + fxr->length == size2)
+		blen = ALIGN(size2, alloc_unit) - fxr->file2_offset;
+	else if (!IS_ALIGNED(fxr->length, alloc_unit))
+		return -EINVAL;
+	else
+		blen = fxr->length;
+
+	/* Don't allow overlapped exchanges within the same file. */
+	if (inode1 == inode2 &&
+	    fxr->file2_offset + blen > fxr->file1_offset &&
+	    fxr->file1_offset + blen > fxr->file2_offset)
+		return -EINVAL;
+
+	/*
+	 * Ensure that we don't exchange a partial EOF block into the middle of
+	 * another file.
+	 */
+	if ((fxr->length & allocmask) == 0)
+		return 0;
+
+	blen = fxr->length;
+	if (fxr->file2_offset + blen < size2)
+		blen &= ~allocmask;
+
+	if (fxr->file1_offset + blen < size1)
+		blen &= ~allocmask;
+
+	return blen == fxr->length ? 0 : -EINVAL;
+}
+
+/*
+ * Check that the two inodes are eligible for range exchanges, the ranges make
+ * sense, and then flush all dirty data.  Caller must ensure that the inodes
+ * have been locked against any other modifications.
+ */
+static inline int
+xfs_generic_exchrange_prep(
+	struct xfs_exchrange	*fxr,
+	unsigned int		alloc_unit)
+{
+	struct inode		*inode1 = file_inode(fxr->file1);
+	struct inode		*inode2 = file_inode(fxr->file2);
+	bool			same_inode = (inode1 == inode2);
+	int			error;
+
+	/* Check that we don't violate system file offset limits. */
+	error = xfs_generic_exchrange_checks(fxr, alloc_unit);
+	if (error || fxr->length == 0)
+		return error;
+
+	/* Wait for the completion of any pending IOs on both files */
+	inode_dio_wait(inode1);
+	if (!same_inode)
+		inode_dio_wait(inode2);
+
+	error = filemap_write_and_wait_range(inode1->i_mapping,
+			fxr->file1_offset,
+			fxr->file1_offset + fxr->length - 1);
+	if (error)
+		return error;
+
+	error = filemap_write_and_wait_range(inode2->i_mapping,
+			fxr->file2_offset,
+			fxr->file2_offset + fxr->length - 1);
+	if (error)
+		return error;
+
+	/*
+	 * If the files or inodes involved require synchronous writes, amend
+	 * the request to force the filesystem to flush all data and metadata
+	 * to disk after the operation completes.
+	 */
+	if (((fxr->file1->f_flags | fxr->file2->f_flags) & (__O_SYNC | O_DSYNC)) ||
+	    IS_SYNC(inode1) || IS_SYNC(inode2))
+		fxr->flags |= XFS_EXCHRANGE_DSYNC;
+
+	return 0;
+}
+
+/*
+ * Finish a range exchange operation, if it was successful.  Caller must ensure
+ * that the inodes are still locked against any other modifications.
+ */
+static inline int
+xfs_generic_exchrange_finish(
+	struct xfs_exchrange	*fxr)
+{
+	int			error;
+
+	error = file_remove_privs(fxr->file1);
+	if (error)
+		return error;
+	if (file_inode(fxr->file1) == file_inode(fxr->file2))
+		return 0;
+
+	return file_remove_privs(fxr->file2);
+}
+
+/* Exchange parts of two files. */
+int
+xfs_exchange_range(
+	struct xfs_exchrange	*fxr)
+{
+	struct inode		*inode1 = file_inode(fxr->file1);
+	struct inode		*inode2 = file_inode(fxr->file2);
+	int			ret;
+
+	BUILD_BUG_ON(XFS_IOC_EXCHANGE_RANGE == XFS_IOC_COMMIT_RANGE);
+	BUILD_BUG_ON(XFS_EXCHRANGE_ALL_FLAGS & XFS_EXCHRANGE_PRIVATE_FLAGS);
+
+	if (fxr->flags & ~(XFS_EXCHRANGE_ALL_FLAGS | __XFS_EXCHRANGE_CHECK_FRESH2))
+		return -EINVAL;
+
+	/*
+	 * The ioctl enforces that src and dest files are on the same mount.
+	 * However, they only need to be on the same file system.
+	 */
+	if (inode1->i_sb != inode2->i_sb)
+		return -EXDEV;
+
+	/* Userspace requests only honored for regular files. */
+	if (S_ISDIR(inode1->i_mode) || S_ISDIR(inode2->i_mode))
+		return -EISDIR;
+	if (!S_ISREG(inode1->i_mode) || !S_ISREG(inode2->i_mode))
+		return -EINVAL;
+
+	/* Both files must be opened for read and write. */
+	if (!(fxr->file1->f_mode & FMODE_READ) ||
+	    !(fxr->file1->f_mode & FMODE_WRITE) ||
+	    !(fxr->file2->f_mode & FMODE_READ) ||
+	    !(fxr->file2->f_mode & FMODE_WRITE))
+		return -EBADF;
+
+	/* Neither file can be opened append-only. */
+	if ((fxr->file1->f_flags & O_APPEND) ||
+	    (fxr->file2->f_flags & O_APPEND))
+		return -EBADF;
+
+	/*
+	 * If we're not exchanging to EOF, we can check the areas before
+	 * stabilizing both files' i_size.
+	 */
+	if (!(fxr->flags & XFS_EXCHRANGE_TO_EOF)) {
+		ret = xfs_generic_exchrange_verify_area(fxr);
+		if (ret)
+			return ret;
+	}
+
+	/* Update cmtime if the fd/inode don't forbid it. */
+	if (!(fxr->file1->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode1))
+		fxr->flags |= __XFS_EXCHRANGE_UPD_CMTIME1;
+	if (!(fxr->file2->f_mode & FMODE_NOCMTIME) && !IS_NOCMTIME(inode2))
+		fxr->flags |= __XFS_EXCHRANGE_UPD_CMTIME2;
+
+	file_start_write(fxr->file2);
+	ret = -EOPNOTSUPP; /* XXX call out to xfs code */
+	file_end_write(fxr->file2);
+	if (ret)
+		return ret;
+
+	fsnotify_modify(fxr->file1);
+	if (fxr->file2 != fxr->file1)
+		fsnotify_modify(fxr->file2);
+	return 0;
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
new file mode 100644
index 0000000000000..593a85a644bce
--- /dev/null
+++ b/fs/xfs/xfs_exchrange.h
@@ -0,0 +1,38 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2020-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_EXCHRANGE_H__
+#define __XFS_EXCHRANGE_H__
+
+/* Update the mtime/cmtime of file1 and file2 */
+#define __XFS_EXCHRANGE_UPD_CMTIME1	(1ULL << 63)
+#define __XFS_EXCHRANGE_UPD_CMTIME2	(1ULL << 62)
+
+/* Freshness check required */
+#define __XFS_EXCHRANGE_CHECK_FRESH2	(1ULL << 61)
+
+#define XFS_EXCHRANGE_PRIVATE_FLAGS	(__XFS_EXCHRANGE_UPD_CMTIME1 | \
+					 __XFS_EXCHRANGE_UPD_CMTIME2 | \
+					 __XFS_EXCHRANGE_CHECK_FRESH2)
+
+struct xfs_exchrange {
+	struct file		*file1;
+	struct file		*file2;
+
+	loff_t			file1_offset;
+	loff_t			file2_offset;
+	u64			length;
+
+	u64			flags;	/* XFS_EXCHRANGE flags */
+
+	/* file2 metadata for freshness checks if file1_ino != 0 */
+	u64			file2_ino;
+	struct timespec64	file2_mtime;
+	struct timespec64	file2_ctime;
+};
+
+int xfs_exchange_range(struct xfs_exchrange *fxr);
+
+#endif /* __XFS_EXCHRANGE_H__ */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index 1360a551118dd..08fc15881ee51 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -40,6 +40,7 @@
 #include "xfs_xattr.h"
 #include "xfs_rtbitmap.h"
 #include "xfs_file.h"
+#include "xfs_exchrange.h"
 
 #include <linux/mount.h>
 #include <linux/namei.h>
@@ -1930,6 +1931,89 @@ xfs_ioctl_fs_counts(
 	return 0;
 }
 
+static long
+xfs_ioc_exchange_range(
+	struct file			*file,
+	struct xfs_exch_range __user	*argp)
+{
+	struct xfs_exchrange		fxr = {
+		.file2			= file,
+	};
+	struct xfs_exch_range		args;
+	struct fd			file1;
+	int				error;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+	if (memchr_inv(&args.pad, 0, sizeof(args.pad)))
+		return -EINVAL;
+	if (args.flags & ~XFS_EXCHRANGE_ALL_FLAGS)
+		return -EINVAL;
+
+	fxr.file1_offset	= args.file1_offset;
+	fxr.file2_offset	= args.file2_offset;
+	fxr.length		= args.length;
+	fxr.flags		= args.flags;
+
+	file1 = fdget(args.file1_fd);
+	if (!file1.file)
+		return -EBADF;
+	fxr.file1 = file1.file;
+
+	error = -EXDEV;
+	if (fxr.file1->f_path.mnt != fxr.file2->f_path.mnt)
+		goto fdput;
+
+	error = xfs_exchange_range(&fxr);
+fdput:
+	fdput(file1);
+	return error;
+}
+
+static long
+xfs_ioc_commit_range(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_exchrange		fxr = {
+		.file2			= file,
+	};
+	struct xfs_commit_range		args;
+	struct fd			file1;
+	int				error;
+
+	if (copy_from_user(&args, argp, sizeof(args)))
+		return -EFAULT;
+	if (memchr_inv(&args.pad, 0, sizeof(args.pad)))
+		return -EINVAL;
+	if (args.flags & ~XFS_EXCHRANGE_ALL_FLAGS)
+		return -EINVAL;
+
+	fxr.file1_offset	= args.file1_offset;
+	fxr.file2_offset	= args.file2_offset;
+	fxr.length		= args.length;
+	fxr.flags		= args.flags | __XFS_EXCHRANGE_CHECK_FRESH2;
+	fxr.file2_ino		= args.file2_ino;
+	fxr.file2_mtime.tv_sec	= args.file2_mtime;
+	fxr.file2_mtime.tv_nsec	= args.file2_mtime_nsec;
+	fxr.file2_ctime.tv_sec	= args.file2_ctime;
+	fxr.file2_ctime.tv_nsec	= args.file2_ctime_nsec;
+
+	file1 = fdget(args.file1_fd);
+	if (!file1.file)
+		return -EBADF;
+	fxr.file1 = file1.file;
+
+	error = -EXDEV;
+	if (fxr.file1->f_path.mnt != fxr.file2->f_path.mnt)
+		goto fdput;
+
+	error = xfs_exchange_range(&fxr);
+fdput:
+	fdput(file1);
+	return error;
+}
+
 /*
  * These long-unused ioctls were removed from the official ioctl API in 5.17,
  * but retain these definitions so that we can log warnings about them.
@@ -2170,6 +2254,11 @@ xfs_file_ioctl(
 		return error;
 	}
 
+	case XFS_IOC_EXCHANGE_RANGE:
+		return xfs_ioc_exchange_range(filp, arg);
+	case XFS_IOC_COMMIT_RANGE:
+		return xfs_ioc_commit_range(filp, arg);
+
 	default:
 		return -ENOTTY;
 	}


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 03/14] xfs: create a log incompat flag for atomic file mapping exchanges
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
  2024-02-27  2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
  2024-02-27  2:21 ` [PATCH 02/14] xfs: introduce new file range exchange ioctls Darrick J. Wong
@ 2024-02-27  2:21 ` Darrick J. Wong
  2024-02-28 15:44   ` Christoph Hellwig
  2024-02-27  2:21 ` [PATCH 04/14] xfs: introduce a file mapping exchange log intent item Darrick J. Wong
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:21 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Create a log incompat flag so that we only attempt to process file
mapping exchange log items if the filesystem supports it, and a geometry
flag to advertise support if it's present or could be present.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h |   13 +++++++++++++
 fs/xfs/libxfs/xfs_fs.h     |    3 +++
 fs/xfs/libxfs/xfs_sb.c     |    3 +++
 fs/xfs/xfs_exchrange.c     |   31 +++++++++++++++++++++++++++++++
 fs/xfs/xfs_exchrange.h     |    2 ++
 5 files changed, 52 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 2b2f9050fbfbb..753adde56a2d0 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -391,6 +391,12 @@ xfs_sb_has_incompat_feature(
 }
 
 #define XFS_SB_FEAT_INCOMPAT_LOG_XATTRS   (1 << 0)	/* Delayed Attributes */
+
+/*
+ * Log contains file mapping exchange log intent items which are not otherwise
+ * protected by an INCOMPAT/RO_COMPAT feature flag.
+ */
+#define XFS_SB_FEAT_INCOMPAT_LOG_EXCHMAPS (1 << 1)
 #define XFS_SB_FEAT_INCOMPAT_LOG_ALL \
 	(XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
 #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
@@ -423,6 +429,13 @@ static inline bool xfs_sb_version_haslogxattrs(struct xfs_sb *sbp)
 		 XFS_SB_FEAT_INCOMPAT_LOG_XATTRS);
 }
 
+static inline bool xfs_sb_version_haslogexchmaps(struct xfs_sb *sbp)
+{
+	return xfs_sb_is_v5(sbp) &&
+		(sbp->sb_features_log_incompat &
+		 XFS_SB_FEAT_INCOMPAT_LOG_EXCHMAPS);
+}
+
 static inline bool
 xfs_is_quota_inode(struct xfs_sb *sbp, xfs_ino_t ino)
 {
diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index fe2bd607ac11f..ede313f8371e5 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -240,6 +240,9 @@ typedef struct xfs_fsop_resblks {
 #define XFS_FSOP_GEOM_FLAGS_INOBTCNT	(1 << 22) /* inobt btree counter */
 #define XFS_FSOP_GEOM_FLAGS_NREXT64	(1 << 23) /* large extent counters */
 
+/* file range exchange available to userspace */
+#define XFS_FSOP_GEOM_FLAGS_EXCHRANGE	(1 << 24)
+
 /*
  * Minimum and maximum sizes need for growth checks.
  *
diff --git a/fs/xfs/libxfs/xfs_sb.c b/fs/xfs/libxfs/xfs_sb.c
index d991eec054368..2d8a0546ab4ba 100644
--- a/fs/xfs/libxfs/xfs_sb.c
+++ b/fs/xfs/libxfs/xfs_sb.c
@@ -26,6 +26,7 @@
 #include "xfs_health.h"
 #include "xfs_ag.h"
 #include "xfs_rtbitmap.h"
+#include "xfs_exchrange.h"
 
 /*
  * Physical superblock buffer manipulations. Shared with libxfs in userspace.
@@ -1258,6 +1259,8 @@ xfs_fs_geometry(
 	}
 	if (xfs_has_large_extent_counts(mp))
 		geo->flags |= XFS_FSOP_GEOM_FLAGS_NREXT64;
+	if (xfs_exchrange_possible(mp))
+		geo->flags |= XFS_FSOP_GEOM_FLAGS_EXCHRANGE;
 	geo->rtsectsize = sbp->sb_blocksize;
 	geo->dirblocksize = xfs_dir2_dirblock_bytes(sbp);
 
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index d5889db89daeb..6ee181e9229a8 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -15,6 +15,37 @@
 #include "xfs_exchrange.h"
 #include <linux/fsnotify.h>
 
+/*
+ * If the filesystem has relatively new features enabled, we're willing to
+ * upgrade the filesystem to have the EXCHMAPS log incompat feature.
+ * Technically we could do this with any V5 filesystem, but let's not deal
+ * with really old kernels.
+ */
+static inline bool
+xfs_exchrange_upgradeable(
+	struct xfs_mount	*mp)
+{
+	return xfs_has_bigtime(mp) || xfs_has_large_extent_counts(mp);
+}
+
+/*
+ * Decide if we should advertise to userspace the potential for using file
+ * range exchanges on this filesystem.  This does not say anything about the
+ * actual readiness to start such an operation.
+ */
+bool
+xfs_exchrange_possible(
+	struct xfs_mount	*mp)
+{
+	/* Always possible when mapping exchange log intent items are enabled */
+	if (xfs_sb_version_haslogexchmaps(&mp->m_sb))
+		return true;
+
+	/* Can we upgrade the fs to have the log intent item? */
+	return xfs_exchrange_upgradeable(mp) &&
+	       xfs_can_add_incompat_log_features(mp, false);
+}
+
 /*
  * Generic code for exchanging ranges of two files via XFS_IOC_EXCHANGE_RANGE.
  * This part deals with struct file objects and byte ranges and does not deal
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 593a85a644bce..a008b42736716 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -6,6 +6,8 @@
 #ifndef __XFS_EXCHRANGE_H__
 #define __XFS_EXCHRANGE_H__
 
+bool xfs_exchrange_possible(struct xfs_mount *mp);
+
 /* Update the mtime/cmtime of file1 and file2 */
 #define __XFS_EXCHRANGE_UPD_CMTIME1	(1ULL << 63)
 #define __XFS_EXCHRANGE_UPD_CMTIME2	(1ULL << 62)


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 04/14] xfs: introduce a file mapping exchange log intent item
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (2 preceding siblings ...)
  2024-02-27  2:21 ` [PATCH 03/14] xfs: create a log incompat flag for atomic file mapping exchanges Darrick J. Wong
@ 2024-02-27  2:21 ` Darrick J. Wong
  2024-02-28 15:45   ` Christoph Hellwig
  2024-02-27  2:22 ` [PATCH 05/14] xfs: create deferred log items for file mapping exchanges Darrick J. Wong
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:21 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Introduce a new intent log item to handle exchanging mappings between
the forks of two files.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                 |    1 
 fs/xfs/libxfs/xfs_log_format.h  |   42 ++++++-
 fs/xfs/libxfs/xfs_log_recover.h |    2 
 fs/xfs/xfs_exchmaps_item.c      |  235 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_exchmaps_item.h      |   59 ++++++++++
 fs/xfs/xfs_log_recover.c        |    2 
 fs/xfs/xfs_super.c              |   19 +++
 7 files changed, 357 insertions(+), 3 deletions(-)
 create mode 100644 fs/xfs/xfs_exchmaps_item.c
 create mode 100644 fs/xfs/xfs_exchmaps_item.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index 3e85762a28ee7..ae34dba36508b 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -103,6 +103,7 @@ xfs-y				+= xfs_log.o \
 				   xfs_buf_item.o \
 				   xfs_buf_item_recover.o \
 				   xfs_dquot_item_recover.o \
+				   xfs_exchmaps_item.o \
 				   xfs_extfree_item.o \
 				   xfs_attr_item.o \
 				   xfs_icreate_item.o \
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 16872972e1e97..09024431cae9a 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -117,8 +117,9 @@ struct xfs_unmount_log_format {
 #define XLOG_REG_TYPE_ATTRD_FORMAT	28
 #define XLOG_REG_TYPE_ATTR_NAME	29
 #define XLOG_REG_TYPE_ATTR_VALUE	30
-#define XLOG_REG_TYPE_MAX		30
-
+#define XLOG_REG_TYPE_XMI_FORMAT	31
+#define XLOG_REG_TYPE_XMD_FORMAT	32
+#define XLOG_REG_TYPE_MAX		32
 
 /*
  * Flags to log operation header
@@ -243,6 +244,8 @@ typedef struct xfs_trans_header {
 #define	XFS_LI_BUD		0x1245
 #define	XFS_LI_ATTRI		0x1246  /* attr set/remove intent*/
 #define	XFS_LI_ATTRD		0x1247  /* attr set/remove done */
+#define	XFS_LI_XMI		0x1248  /* mapping exchange intent */
+#define	XFS_LI_XMD		0x1249  /* mapping exchange done */
 
 #define XFS_LI_TYPE_DESC \
 	{ XFS_LI_EFI,		"XFS_LI_EFI" }, \
@@ -260,7 +263,9 @@ typedef struct xfs_trans_header {
 	{ XFS_LI_BUI,		"XFS_LI_BUI" }, \
 	{ XFS_LI_BUD,		"XFS_LI_BUD" }, \
 	{ XFS_LI_ATTRI,		"XFS_LI_ATTRI" }, \
-	{ XFS_LI_ATTRD,		"XFS_LI_ATTRD" }
+	{ XFS_LI_ATTRD,		"XFS_LI_ATTRD" }, \
+	{ XFS_LI_XMI,		"XFS_LI_XMI" }, \
+	{ XFS_LI_XMD,		"XFS_LI_XMD" }
 
 /*
  * Inode Log Item Format definitions.
@@ -878,6 +883,37 @@ struct xfs_bud_log_format {
 	uint64_t		bud_bui_id;	/* id of corresponding bui */
 };
 
+/*
+ * XMI/XMD (file mapping exchange) log format definitions
+ */
+
+/* This is the structure used to lay out an mapping exchange log item. */
+struct xfs_xmi_log_format {
+	uint16_t		xmi_type;	/* xmi log item type */
+	uint16_t		xmi_size;	/* size of this item */
+	uint32_t		__pad;		/* must be zero */
+	uint64_t		xmi_id;		/* xmi identifier */
+
+	uint64_t		xmi_inode1;	/* inumber of first file */
+	uint64_t		xmi_inode2;	/* inumber of second file */
+	uint64_t		xmi_startoff1;	/* block offset into file1 */
+	uint64_t		xmi_startoff2;	/* block offset into file2 */
+	uint64_t		xmi_blockcount;	/* number of blocks */
+	uint64_t		xmi_flags;	/* XFS_EXCHMAPS_* */
+	uint64_t		xmi_isize1;	/* intended file1 size */
+	uint64_t		xmi_isize2;	/* intended file2 size */
+};
+
+#define XFS_EXCHMAPS_LOGGED_FLAGS		(0)
+
+/* This is the structure used to lay out an mapping exchange done log item. */
+struct xfs_xmd_log_format {
+	uint16_t		xmd_type;	/* xmd log item type */
+	uint16_t		xmd_size;	/* size of this item */
+	uint32_t		__pad;
+	uint64_t		xmd_xmi_id;	/* id of corresponding xmi */
+};
+
 /*
  * Dquot Log format definitions.
  *
diff --git a/fs/xfs/libxfs/xfs_log_recover.h b/fs/xfs/libxfs/xfs_log_recover.h
index 9fe7a9564bca9..47b758b49cb35 100644
--- a/fs/xfs/libxfs/xfs_log_recover.h
+++ b/fs/xfs/libxfs/xfs_log_recover.h
@@ -75,6 +75,8 @@ extern const struct xlog_recover_item_ops xlog_cui_item_ops;
 extern const struct xlog_recover_item_ops xlog_cud_item_ops;
 extern const struct xlog_recover_item_ops xlog_attri_item_ops;
 extern const struct xlog_recover_item_ops xlog_attrd_item_ops;
+extern const struct xlog_recover_item_ops xlog_xmi_item_ops;
+extern const struct xlog_recover_item_ops xlog_xmd_item_ops;
 
 /*
  * Macros, structures, prototypes for internal log manager use.
diff --git a/fs/xfs/xfs_exchmaps_item.c b/fs/xfs/xfs_exchmaps_item.c
new file mode 100644
index 0000000000000..c36f1065914c6
--- /dev/null
+++ b/fs/xfs/xfs_exchmaps_item.c
@@ -0,0 +1,235 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2020-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_bit.h"
+#include "xfs_shared.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_trans_priv.h"
+#include "xfs_exchmaps_item.h"
+#include "xfs_log.h"
+#include "xfs_bmap.h"
+#include "xfs_icache.h"
+#include "xfs_trans_space.h"
+#include "xfs_error.h"
+#include "xfs_log_priv.h"
+#include "xfs_log_recover.h"
+
+struct kmem_cache	*xfs_xmi_cache;
+struct kmem_cache	*xfs_xmd_cache;
+
+static const struct xfs_item_ops xfs_xmi_item_ops;
+
+static inline struct xfs_xmi_log_item *XMI_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_xmi_log_item, xmi_item);
+}
+
+STATIC void
+xfs_xmi_item_free(
+	struct xfs_xmi_log_item	*xmi_lip)
+{
+	kmem_free(xmi_lip->xmi_item.li_lv_shadow);
+	kmem_cache_free(xfs_xmi_cache, xmi_lip);
+}
+
+/*
+ * Freeing the XMI requires that we remove it from the AIL if it has already
+ * been placed there. However, the XMI may not yet have been placed in the AIL
+ * when called by xfs_xmi_release() from XMD processing due to the ordering of
+ * committed vs unpin operations in bulk insert operations. Hence the reference
+ * count to ensure only the last caller frees the XMI.
+ */
+STATIC void
+xfs_xmi_release(
+	struct xfs_xmi_log_item	*xmi_lip)
+{
+	ASSERT(atomic_read(&xmi_lip->xmi_refcount) > 0);
+	if (atomic_dec_and_test(&xmi_lip->xmi_refcount)) {
+		xfs_trans_ail_delete(&xmi_lip->xmi_item, 0);
+		xfs_xmi_item_free(xmi_lip);
+	}
+}
+
+
+STATIC void
+xfs_xmi_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_xmi_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the given xmi log
+ * item. We use only 1 iovec, and we point that at the xmi_log_format structure
+ * embedded in the xmi item.
+ */
+STATIC void
+xfs_xmi_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_xmi_log_item	*xmi_lip = XMI_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	xmi_lip->xmi_format.xmi_type = XFS_LI_XMI;
+	xmi_lip->xmi_format.xmi_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_XMI_FORMAT,
+			&xmi_lip->xmi_format,
+			sizeof(struct xfs_xmi_log_format));
+}
+
+/*
+ * The unpin operation is the last place an XMI is manipulated in the log. It
+ * is either inserted in the AIL or aborted in the event of a log I/O error. In
+ * either case, the XMI transaction has been successfully committed to make it
+ * this far. Therefore, we expect whoever committed the XMI to either construct
+ * and commit the XMD or drop the XMD's reference in the event of error. Simply
+ * drop the log's XMI reference now that the log is done with it.
+ */
+STATIC void
+xfs_xmi_item_unpin(
+	struct xfs_log_item	*lip,
+	int			remove)
+{
+	struct xfs_xmi_log_item	*xmi_lip = XMI_ITEM(lip);
+
+	xfs_xmi_release(xmi_lip);
+}
+
+/*
+ * The XMI has been either committed or aborted if the transaction has been
+ * cancelled. If the transaction was cancelled, an XMD isn't going to be
+ * constructed and thus we free the XMI here directly.
+ */
+STATIC void
+xfs_xmi_item_release(
+	struct xfs_log_item	*lip)
+{
+	xfs_xmi_release(XMI_ITEM(lip));
+}
+
+/* Allocate and initialize an xmi item. */
+STATIC struct xfs_xmi_log_item *
+xfs_xmi_init(
+	struct xfs_mount	*mp)
+
+{
+	struct xfs_xmi_log_item	*xmi_lip;
+
+	xmi_lip = kmem_cache_zalloc(xfs_xmi_cache, GFP_KERNEL | __GFP_NOFAIL);
+
+	xfs_log_item_init(mp, &xmi_lip->xmi_item, XFS_LI_XMI, &xfs_xmi_item_ops);
+	xmi_lip->xmi_format.xmi_id = (uintptr_t)(void *)xmi_lip;
+	atomic_set(&xmi_lip->xmi_refcount, 2);
+
+	return xmi_lip;
+}
+
+static inline struct xfs_xmd_log_item *XMD_ITEM(struct xfs_log_item *lip)
+{
+	return container_of(lip, struct xfs_xmd_log_item, xmd_item);
+}
+
+STATIC bool
+xfs_xmi_item_match(
+	struct xfs_log_item	*lip,
+	uint64_t		intent_id)
+{
+	return XMI_ITEM(lip)->xmi_format.xmi_id == intent_id;
+}
+
+static const struct xfs_item_ops xfs_xmi_item_ops = {
+	.flags		= XFS_ITEM_INTENT,
+	.iop_size	= xfs_xmi_item_size,
+	.iop_format	= xfs_xmi_item_format,
+	.iop_unpin	= xfs_xmi_item_unpin,
+	.iop_release	= xfs_xmi_item_release,
+	.iop_match	= xfs_xmi_item_match,
+};
+
+/*
+ * This routine is called to create an in-core file mapping exchange item from
+ * the xmi format structure which was logged on disk.  It allocates an in-core
+ * xmi, copies the exchange information from the format structure into it, and
+ * adds the xmi to the AIL with the given LSN.
+ */
+STATIC int
+xlog_recover_xmi_commit_pass2(
+	struct xlog			*log,
+	struct list_head		*buffer_list,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	struct xfs_mount		*mp = log->l_mp;
+	struct xfs_xmi_log_item		*xmi_lip;
+	struct xfs_xmi_log_format	*xmi_formatp;
+	size_t				len;
+
+	len = sizeof(struct xfs_xmi_log_format);
+	if (item->ri_buf[0].i_len != len) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+		return -EFSCORRUPTED;
+	}
+
+	xmi_formatp = item->ri_buf[0].i_addr;
+	if (xmi_formatp->__pad != 0) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+		return -EFSCORRUPTED;
+	}
+
+	xmi_lip = xfs_xmi_init(mp);
+	memcpy(&xmi_lip->xmi_format, xmi_formatp, len);
+
+	/* not implemented yet */
+	return -EIO;
+}
+
+const struct xlog_recover_item_ops xlog_xmi_item_ops = {
+	.item_type		= XFS_LI_XMI,
+	.commit_pass2		= xlog_recover_xmi_commit_pass2,
+};
+
+/*
+ * This routine is called when an XMD format structure is found in a committed
+ * transaction in the log. Its purpose is to cancel the corresponding XMI if it
+ * was still in the log. To do this it searches the AIL for the XMI with an id
+ * equal to that in the XMD format structure. If we find it we drop the XMD
+ * reference, which removes the XMI from the AIL and frees it.
+ */
+STATIC int
+xlog_recover_xmd_commit_pass2(
+	struct xlog			*log,
+	struct list_head		*buffer_list,
+	struct xlog_recover_item	*item,
+	xfs_lsn_t			lsn)
+{
+	struct xfs_xmd_log_format	*xmd_formatp;
+
+	xmd_formatp = item->ri_buf[0].i_addr;
+	if (item->ri_buf[0].i_len != sizeof(struct xfs_xmd_log_format)) {
+		XFS_ERROR_REPORT(__func__, XFS_ERRLEVEL_LOW, log->l_mp);
+		return -EFSCORRUPTED;
+	}
+
+	xlog_recover_release_intent(log, XFS_LI_XMI, xmd_formatp->xmd_xmi_id);
+	return 0;
+}
+
+const struct xlog_recover_item_ops xlog_xmd_item_ops = {
+	.item_type		= XFS_LI_XMD,
+	.commit_pass2		= xlog_recover_xmd_commit_pass2,
+};
diff --git a/fs/xfs/xfs_exchmaps_item.h b/fs/xfs/xfs_exchmaps_item.h
new file mode 100644
index 0000000000000..ada1eb314e658
--- /dev/null
+++ b/fs/xfs/xfs_exchmaps_item.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2020-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef	__XFS_EXCHMAPS_ITEM_H__
+#define	__XFS_EXCHMAPS_ITEM_H__
+
+/*
+ * The file mapping exchange intent item helps us exchange multiple file
+ * mappings between two inode forks.  It does this by tracking the range of
+ * file block offsets that still need to be exchanged, and relogs as progress
+ * happens.
+ *
+ * *I items should be recorded in the *first* of a series of rolled
+ * transactions, and the *D items should be recorded in the same transaction
+ * that records the associated bmbt updates.
+ *
+ * Should the system crash after the commit of the first transaction but
+ * before the commit of the final transaction in a series, log recovery will
+ * use the redo information recorded by the intent items to replay the
+ * rest of the mapping exchanges.
+ */
+
+/* kernel only XMI/XMD definitions */
+
+struct xfs_mount;
+struct kmem_cache;
+
+/*
+ * This is the incore file mapping exchange intent log item.  It is used to log
+ * the fact that we are exchanging mappings between two files.  It is used in
+ * conjunction with the incore file mapping exchange done log item described
+ * below.
+ *
+ * These log items follow the same rules as struct xfs_efi_log_item; see the
+ * comments about that structure (in xfs_extfree_item.h) for more details.
+ */
+struct xfs_xmi_log_item {
+	struct xfs_log_item		xmi_item;
+	atomic_t			xmi_refcount;
+	struct xfs_xmi_log_format	xmi_format;
+};
+
+/*
+ * This is the incore file mapping exchange done log item.  It is used to log
+ * the fact that an exchange mentioned in an earlier xmi item have been
+ * performed.
+ */
+struct xfs_xmd_log_item {
+	struct xfs_log_item		xmd_item;
+	struct xfs_xmi_log_item		*xmd_intent_log_item;
+	struct xfs_xmd_log_format	xmd_format;
+};
+
+extern struct kmem_cache	*xfs_xmi_cache;
+extern struct kmem_cache	*xfs_xmd_cache;
+
+#endif	/* __XFS_EXCHMAPS_ITEM_H__ */
diff --git a/fs/xfs/xfs_log_recover.c b/fs/xfs/xfs_log_recover.c
index 36a1b4eeb39fa..5c11322ce7318 100644
--- a/fs/xfs/xfs_log_recover.c
+++ b/fs/xfs/xfs_log_recover.c
@@ -1789,6 +1789,8 @@ static const struct xlog_recover_item_ops *xlog_recover_item_ops[] = {
 	&xlog_bud_item_ops,
 	&xlog_attri_item_ops,
 	&xlog_attrd_item_ops,
+	&xlog_xmi_item_ops,
+	&xlog_xmd_item_ops,
 };
 
 static const struct xlog_recover_item_ops *
diff --git a/fs/xfs/xfs_super.c b/fs/xfs/xfs_super.c
index 679b99bed5499..5f9e406855d7d 100644
--- a/fs/xfs/xfs_super.c
+++ b/fs/xfs/xfs_super.c
@@ -43,6 +43,7 @@
 #include "xfs_iunlink_item.h"
 #include "xfs_dahash_test.h"
 #include "xfs_rtbitmap.h"
+#include "xfs_exchmaps_item.h"
 #include "scrub/stats.h"
 #include "scrub/rcbag_btree.h"
 
@@ -2199,8 +2200,24 @@ xfs_init_caches(void)
 	if (!xfs_iunlink_cache)
 		goto out_destroy_attri_cache;
 
+	xfs_xmd_cache = kmem_cache_create("xfs_xmd_item",
+					 sizeof(struct xfs_xmd_log_item),
+					 0, 0, NULL);
+	if (!xfs_xmd_cache)
+		goto out_destroy_iul_cache;
+
+	xfs_xmi_cache = kmem_cache_create("xfs_xmi_item",
+					 sizeof(struct xfs_xmi_log_item),
+					 0, 0, NULL);
+	if (!xfs_xmi_cache)
+		goto out_destroy_xmd_cache;
+
 	return 0;
 
+ out_destroy_xmd_cache:
+	kmem_cache_destroy(xfs_xmd_cache);
+ out_destroy_iul_cache:
+	kmem_cache_destroy(xfs_iunlink_cache);
  out_destroy_attri_cache:
 	kmem_cache_destroy(xfs_attri_cache);
  out_destroy_attrd_cache:
@@ -2257,6 +2274,8 @@ xfs_destroy_caches(void)
 	 * destroy caches.
 	 */
 	rcu_barrier();
+	kmem_cache_destroy(xfs_xmd_cache);
+	kmem_cache_destroy(xfs_xmi_cache);
 	kmem_cache_destroy(xfs_iunlink_cache);
 	kmem_cache_destroy(xfs_attri_cache);
 	kmem_cache_destroy(xfs_attrd_cache);


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 05/14] xfs: create deferred log items for file mapping exchanges
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (3 preceding siblings ...)
  2024-02-27  2:21 ` [PATCH 04/14] xfs: introduce a file mapping exchange log intent item Darrick J. Wong
@ 2024-02-27  2:22 ` Darrick J. Wong
  2024-02-28 15:49   ` Christoph Hellwig
  2024-02-27  2:22 ` [PATCH 06/14] xfs: bind together the front and back ends of the file range exchange code Darrick J. Wong
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:22 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that we've created the skeleton of a log intent item to track and
restart file mapping exchange operations, add the upper level logic to
commit intent items and turn them into concrete work recorded in the
log.  This builds on the existing bmap update intent items that have
been around for a while now.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/Makefile                 |    1 
 fs/xfs/libxfs/xfs_bmap.h        |    2 
 fs/xfs/libxfs/xfs_defer.c       |    6 
 fs/xfs/libxfs/xfs_defer.h       |    2 
 fs/xfs/libxfs/xfs_exchmaps.c    | 1031 +++++++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_exchmaps.h    |  118 ++++
 fs/xfs/libxfs/xfs_log_format.h  |   24 +
 fs/xfs/libxfs/xfs_trans_space.h |    4 
 fs/xfs/xfs_exchmaps_item.c      |  372 ++++++++++++++
 fs/xfs/xfs_exchmaps_item.h      |    5 
 fs/xfs/xfs_exchrange.c          |   49 ++
 fs/xfs/xfs_exchrange.h          |   10 
 fs/xfs/xfs_trace.c              |    1 
 fs/xfs/xfs_trace.h              |  217 ++++++++
 14 files changed, 1837 insertions(+), 5 deletions(-)
 create mode 100644 fs/xfs/libxfs/xfs_exchmaps.c
 create mode 100644 fs/xfs/libxfs/xfs_exchmaps.h


diff --git a/fs/xfs/Makefile b/fs/xfs/Makefile
index ae34dba36508b..20d7dea6f5cad 100644
--- a/fs/xfs/Makefile
+++ b/fs/xfs/Makefile
@@ -34,6 +34,7 @@ xfs-y				+= $(addprefix libxfs/, \
 				   xfs_dir2_node.o \
 				   xfs_dir2_sf.o \
 				   xfs_dquot_buf.o \
+				   xfs_exchmaps.o \
 				   xfs_ialloc.o \
 				   xfs_ialloc_btree.o \
 				   xfs_iext_tree.o \
diff --git a/fs/xfs/libxfs/xfs_bmap.h b/fs/xfs/libxfs/xfs_bmap.h
index f7662595309d8..b8bdbf1560e65 100644
--- a/fs/xfs/libxfs/xfs_bmap.h
+++ b/fs/xfs/libxfs/xfs_bmap.h
@@ -158,7 +158,7 @@ static inline bool xfs_bmap_is_real_extent(const struct xfs_bmbt_irec *irec)
  * Return true if the extent is a real, allocated extent, or false if it is  a
  * delayed allocation, and unwritten extent or a hole.
  */
-static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
+static inline bool xfs_bmap_is_written_extent(const struct xfs_bmbt_irec *irec)
 {
 	return xfs_bmap_is_real_extent(irec) &&
 	       irec->br_state != XFS_EXT_UNWRITTEN;
diff --git a/fs/xfs/libxfs/xfs_defer.c b/fs/xfs/libxfs/xfs_defer.c
index 66a17910d0219..159665252599b 100644
--- a/fs/xfs/libxfs/xfs_defer.c
+++ b/fs/xfs/libxfs/xfs_defer.c
@@ -27,6 +27,7 @@
 #include "xfs_da_btree.h"
 #include "xfs_attr.h"
 #include "xfs_trans_priv.h"
+#include "xfs_exchmaps.h"
 
 static struct kmem_cache	*xfs_defer_pending_cache;
 
@@ -1181,6 +1182,10 @@ xfs_defer_init_item_caches(void)
 	error = xfs_attr_intent_init_cache();
 	if (error)
 		goto err;
+	error = xfs_exchmaps_intent_init_cache();
+	if (error)
+		goto err;
+
 	return 0;
 err:
 	xfs_defer_destroy_item_caches();
@@ -1191,6 +1196,7 @@ xfs_defer_init_item_caches(void)
 void
 xfs_defer_destroy_item_caches(void)
 {
+	xfs_exchmaps_intent_destroy_cache();
 	xfs_attr_intent_destroy_cache();
 	xfs_extfree_intent_destroy_cache();
 	xfs_bmap_intent_destroy_cache();
diff --git a/fs/xfs/libxfs/xfs_defer.h b/fs/xfs/libxfs/xfs_defer.h
index 18a9fb92dde8e..81cca60d70a3b 100644
--- a/fs/xfs/libxfs/xfs_defer.h
+++ b/fs/xfs/libxfs/xfs_defer.h
@@ -72,7 +72,7 @@ extern const struct xfs_defer_op_type xfs_rmap_update_defer_type;
 extern const struct xfs_defer_op_type xfs_extent_free_defer_type;
 extern const struct xfs_defer_op_type xfs_agfl_free_defer_type;
 extern const struct xfs_defer_op_type xfs_attr_defer_type;
-
+extern const struct xfs_defer_op_type xfs_exchmaps_defer_type;
 
 /*
  * Deferred operation item relogging limits.
diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c
new file mode 100644
index 0000000000000..eddb0972e344e
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_exchmaps.c
@@ -0,0 +1,1031 @@
+// SPDX-License-Identifier: GPL-2.0-or-later
+/*
+ * Copyright (c) 2020-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#include "xfs.h"
+#include "xfs_fs.h"
+#include "xfs_shared.h"
+#include "xfs_format.h"
+#include "xfs_log_format.h"
+#include "xfs_trans_resv.h"
+#include "xfs_mount.h"
+#include "xfs_defer.h"
+#include "xfs_inode.h"
+#include "xfs_trans.h"
+#include "xfs_bmap.h"
+#include "xfs_icache.h"
+#include "xfs_quota.h"
+#include "xfs_exchmaps.h"
+#include "xfs_trace.h"
+#include "xfs_bmap_btree.h"
+#include "xfs_trans_space.h"
+#include "xfs_error.h"
+#include "xfs_errortag.h"
+#include "xfs_health.h"
+#include "xfs_exchmaps_item.h"
+
+struct kmem_cache	*xfs_exchmaps_intent_cache;
+
+/* bmbt mappings adjacent to a pair of records. */
+struct xfs_exchmaps_adjacent {
+	struct xfs_bmbt_irec		left1;
+	struct xfs_bmbt_irec		right1;
+	struct xfs_bmbt_irec		left2;
+	struct xfs_bmbt_irec		right2;
+};
+
+#define ADJACENT_INIT { \
+	.left1  = { .br_startblock = HOLESTARTBLOCK }, \
+	.right1 = { .br_startblock = HOLESTARTBLOCK }, \
+	.left2  = { .br_startblock = HOLESTARTBLOCK }, \
+	.right2 = { .br_startblock = HOLESTARTBLOCK }, \
+}
+
+/* Information to reset reflink flag / CoW fork state after an exchange. */
+
+/*
+ * If the reflink flag is set on either inode, make sure it has an incore CoW
+ * fork, since all reflink inodes must have them.  If there's a CoW fork and it
+ * has mappings in it, make sure the inodes are tagged appropriately so that
+ * speculative preallocations can be GC'd if we run low of space.
+ */
+static inline void
+xfs_exchmaps_ensure_cowfork(
+	struct xfs_inode	*ip)
+{
+	struct xfs_ifork	*cfork;
+
+	if (xfs_is_reflink_inode(ip))
+		xfs_ifork_init_cow(ip);
+
+	cfork = xfs_ifork_ptr(ip, XFS_COW_FORK);
+	if (!cfork)
+		return;
+	if (cfork->if_bytes > 0)
+		xfs_inode_set_cowblocks_tag(ip);
+	else
+		xfs_inode_clear_cowblocks_tag(ip);
+}
+
+/*
+ * Adjust the on-disk inode size upwards if needed so that we never add
+ * mappings into the file past EOF.  This is crucial so that log recovery won't
+ * get confused by the sudden appearance of post-eof mappings.
+ */
+STATIC void
+xfs_exchmaps_update_size(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip,
+	struct xfs_bmbt_irec	*imap,
+	xfs_fsize_t		new_isize)
+{
+	struct xfs_mount	*mp = tp->t_mountp;
+	xfs_fsize_t		len;
+
+	if (new_isize < 0)
+		return;
+
+	len = min(XFS_FSB_TO_B(mp, imap->br_startoff + imap->br_blockcount),
+		  new_isize);
+
+	if (len <= ip->i_disk_size)
+		return;
+
+	trace_xfs_exchmaps_update_inode_size(ip, len);
+
+	ip->i_disk_size = len;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/* Advance the incore state tracking after exchanging a mapping. */
+static inline void
+xmi_advance(
+	struct xfs_exchmaps_intent	*xmi,
+	const struct xfs_bmbt_irec	*irec)
+{
+	xmi->xmi_startoff1 += irec->br_blockcount;
+	xmi->xmi_startoff2 += irec->br_blockcount;
+	xmi->xmi_blockcount -= irec->br_blockcount;
+}
+
+/* Do we still have more mappings to exchange? */
+static inline bool
+xmi_has_more_exchange_work(const struct xfs_exchmaps_intent *xmi)
+{
+	return xmi->xmi_blockcount > 0;
+}
+
+/* Do we have post-operation cleanups to perform? */
+static inline bool
+xmi_has_postop_work(const struct xfs_exchmaps_intent *xmi)
+{
+	return xmi->xmi_flags & (XFS_EXCHMAPS_CLEAR_INO1_REFLINK |
+				 XFS_EXCHMAPS_CLEAR_INO2_REFLINK);
+}
+
+/* Check all mappings to make sure we can actually exchange them. */
+int
+xfs_exchmaps_check_forks(
+	struct xfs_mount		*mp,
+	const struct xfs_exchmaps_req	*req)
+{
+	struct xfs_ifork		*ifp1, *ifp2;
+	int				whichfork = xfs_exchmaps_reqfork(req);
+
+	/* No fork? */
+	ifp1 = xfs_ifork_ptr(req->ip1, whichfork);
+	ifp2 = xfs_ifork_ptr(req->ip2, whichfork);
+	if (!ifp1 || !ifp2)
+		return -EINVAL;
+
+	/* We don't know how to exchange local format forks. */
+	if (ifp1->if_format == XFS_DINODE_FMT_LOCAL ||
+	    ifp2->if_format == XFS_DINODE_FMT_LOCAL)
+		return -EINVAL;
+
+	/* We don't support realtime data forks yet. */
+	if (!XFS_IS_REALTIME_INODE(req->ip1))
+		return 0;
+	if (whichfork == XFS_ATTR_FORK)
+		return 0;
+	return -EINVAL;
+}
+
+#ifdef CONFIG_XFS_QUOTA
+/* Log the actual updates to the quota accounting. */
+static inline void
+xfs_exchmaps_update_quota(
+	struct xfs_trans		*tp,
+	struct xfs_exchmaps_intent	*xmi,
+	struct xfs_bmbt_irec		*irec1,
+	struct xfs_bmbt_irec		*irec2)
+{
+	int64_t				ip1_delta = 0, ip2_delta = 0;
+	unsigned int			qflag;
+
+	qflag = XFS_IS_REALTIME_INODE(xmi->xmi_ip1) ? XFS_TRANS_DQ_RTBCOUNT :
+						      XFS_TRANS_DQ_BCOUNT;
+
+	if (xfs_bmap_is_real_extent(irec1)) {
+		ip1_delta -= irec1->br_blockcount;
+		ip2_delta += irec1->br_blockcount;
+	}
+
+	if (xfs_bmap_is_real_extent(irec2)) {
+		ip1_delta += irec2->br_blockcount;
+		ip2_delta -= irec2->br_blockcount;
+	}
+
+	xfs_trans_mod_dquot_byino(tp, xmi->xmi_ip1, qflag, ip1_delta);
+	xfs_trans_mod_dquot_byino(tp, xmi->xmi_ip2, qflag, ip2_delta);
+}
+#else
+# define xfs_exchmaps_update_quota(tp, xmi, irec1, irec2)	((void)0)
+#endif
+
+/* Decide if we want to skip this mapping from file1. */
+static inline bool
+xfs_exchmaps_can_skip_mapping(
+	struct xfs_exchmaps_intent	*xmi,
+	struct xfs_bmbt_irec		*irec)
+{
+	/* Do not skip this mapping if the caller did not tell us to. */
+	if (!(xmi->xmi_flags & XFS_EXCHMAPS_INO1_WRITTEN))
+		return false;
+
+	/* Do not skip mapped, written mappings. */
+	if (xfs_bmap_is_written_extent(irec))
+		return false;
+
+	/*
+	 * The mapping is unwritten or a hole.  It cannot be a delalloc
+	 * reservation because we already excluded those.  It cannot be an
+	 * unwritten mapping with dirty page cache because we flushed the page
+	 * cache.  We don't support realtime files yet, so we needn't (yet)
+	 * deal with them.
+	 */
+	return true;
+}
+
+/*
+ * Walk forward through the file ranges in @xmi until we find two different
+ * mappings to exchange.  If there is work to do, return the mappings;
+ * otherwise we've reached the end of the range and xmi_blockcount will be
+ * zero.
+ *
+ * If the walk skips over a pair of mappings to the same storage, save them as
+ * the left records in @adj (if provided) so that the simulation phase can
+ * avoid an extra lookup.
+  */
+static int
+xfs_exchmaps_find_mappings(
+	struct xfs_exchmaps_intent	*xmi,
+	struct xfs_bmbt_irec		*irec1,
+	struct xfs_bmbt_irec		*irec2,
+	struct xfs_exchmaps_adjacent	*adj)
+{
+	int				nimaps;
+	int				bmap_flags;
+	int				error;
+
+	bmap_flags = xfs_bmapi_aflag(xfs_exchmaps_whichfork(xmi));
+
+	for (; xmi_has_more_exchange_work(xmi); xmi_advance(xmi, irec1)) {
+		/* Read mapping from the first file */
+		nimaps = 1;
+		error = xfs_bmapi_read(xmi->xmi_ip1, xmi->xmi_startoff1,
+				xmi->xmi_blockcount, irec1, &nimaps,
+				bmap_flags);
+		if (error)
+			return error;
+		if (nimaps != 1 ||
+		    irec1->br_startblock == DELAYSTARTBLOCK ||
+		    irec1->br_startoff != xmi->xmi_startoff1) {
+			/*
+			 * We should never get no mapping or a delalloc mapping
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		if (xfs_exchmaps_can_skip_mapping(xmi, irec1)) {
+			trace_xfs_exchmaps_mapping1_skip(xmi->xmi_ip1, irec1);
+			continue;
+		}
+
+		/* Read mapping from the second file */
+		nimaps = 1;
+		error = xfs_bmapi_read(xmi->xmi_ip2, xmi->xmi_startoff2,
+				irec1->br_blockcount, irec2, &nimaps,
+				bmap_flags);
+		if (error)
+			return error;
+		if (nimaps != 1 ||
+		    irec2->br_startblock == DELAYSTARTBLOCK ||
+		    irec2->br_startoff != xmi->xmi_startoff2) {
+			/*
+			 * We should never get no mapping or a delalloc mapping
+			 * or something that doesn't match what we asked for,
+			 * since the caller flushed both inodes and we hold the
+			 * ILOCKs for both inodes.
+			 */
+			ASSERT(0);
+			return -EINVAL;
+		}
+
+		/*
+		 * We can only exchange as many blocks as the smaller of the
+		 * two mapping maps.
+		 */
+		irec1->br_blockcount = min(irec1->br_blockcount,
+					   irec2->br_blockcount);
+
+		trace_xfs_exchmaps_mapping1(xmi->xmi_ip1, irec1);
+		trace_xfs_exchmaps_mapping2(xmi->xmi_ip2, irec2);
+
+		/* We found something to exchange, so return it. */
+		if (irec1->br_startblock != irec2->br_startblock)
+			return 0;
+
+		/*
+		 * Two mappings pointing to the same physical block must not
+		 * have different states; that's filesystem corruption.  Move
+		 * on to the next mapping if they're both holes or both point
+		 * to the same physical space extent.
+		 */
+		if (irec1->br_state != irec2->br_state) {
+			xfs_bmap_mark_sick(xmi->xmi_ip1,
+					xfs_exchmaps_whichfork(xmi));
+			xfs_bmap_mark_sick(xmi->xmi_ip2,
+					xfs_exchmaps_whichfork(xmi));
+			return -EFSCORRUPTED;
+		}
+
+		/*
+		 * Save the mappings if we're estimating work and skipping
+		 * these identical mappings.
+		 */
+		if (adj) {
+			memcpy(&adj->left1, irec1, sizeof(*irec1));
+			memcpy(&adj->left2, irec2, sizeof(*irec2));
+		}
+	}
+
+	return 0;
+}
+
+/* Exchange these two mappings. */
+static void
+xfs_exchmaps_one_step(
+	struct xfs_trans		*tp,
+	struct xfs_exchmaps_intent	*xmi,
+	struct xfs_bmbt_irec		*irec1,
+	struct xfs_bmbt_irec		*irec2)
+{
+	int				whichfork = xfs_exchmaps_whichfork(xmi);
+
+	xfs_exchmaps_update_quota(tp, xmi, irec1, irec2);
+
+	/* Remove both mappings. */
+	xfs_bmap_unmap_extent(tp, xmi->xmi_ip1, whichfork, irec1);
+	xfs_bmap_unmap_extent(tp, xmi->xmi_ip2, whichfork, irec2);
+
+	/*
+	 * Re-add both mappings.  We exchange the file offsets between the two
+	 * maps and add the opposite map, which has the effect of filling the
+	 * logical offsets we just unmapped, but with with the physical mapping
+	 * information exchanged.
+	 */
+	swap(irec1->br_startoff, irec2->br_startoff);
+	xfs_bmap_map_extent(tp, xmi->xmi_ip1, whichfork, irec2);
+	xfs_bmap_map_extent(tp, xmi->xmi_ip2, whichfork, irec1);
+
+	/* Make sure we're not adding mappings past EOF. */
+	if (whichfork == XFS_DATA_FORK) {
+		xfs_exchmaps_update_size(tp, xmi->xmi_ip1, irec2,
+				xmi->xmi_isize1);
+		xfs_exchmaps_update_size(tp, xmi->xmi_ip2, irec1,
+				xmi->xmi_isize2);
+	}
+
+	/*
+	 * Advance our cursor and exit.   The caller (either defer ops or log
+	 * recovery) will log the XMD item, and if *blockcount is nonzero, it
+	 * will log a new XMI item for the remainder and call us back.
+	 */
+	xmi_advance(xmi, irec1);
+}
+
+/* Clear the reflink flag after an exchange. */
+static inline void
+xfs_exchmaps_clear_reflink(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	trace_xfs_reflink_unset_inode_flag(ip);
+
+	ip->i_diflags2 &= ~XFS_DIFLAG2_REFLINK;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/* Finish whatever work might come after an exchange operation. */
+static int
+xfs_exchmaps_do_postop_work(
+	struct xfs_trans		*tp,
+	struct xfs_exchmaps_intent	*xmi)
+{
+	if (xmi->xmi_flags & XFS_EXCHMAPS_CLEAR_INO1_REFLINK) {
+		xfs_exchmaps_clear_reflink(tp, xmi->xmi_ip1);
+		xmi->xmi_flags &= ~XFS_EXCHMAPS_CLEAR_INO1_REFLINK;
+	}
+
+	if (xmi->xmi_flags & XFS_EXCHMAPS_CLEAR_INO2_REFLINK) {
+		xfs_exchmaps_clear_reflink(tp, xmi->xmi_ip2);
+		xmi->xmi_flags &= ~XFS_EXCHMAPS_CLEAR_INO2_REFLINK;
+	}
+
+	return 0;
+}
+
+/* Finish one step in a mapping exchange operation, possibly relogging. */
+int
+xfs_exchmaps_finish_one(
+	struct xfs_trans		*tp,
+	struct xfs_exchmaps_intent	*xmi)
+{
+	struct xfs_bmbt_irec		irec1, irec2;
+	int				error;
+
+	if (xmi_has_more_exchange_work(xmi)) {
+		/*
+		 * If the operation state says that some range of the files
+		 * have not yet been exchanged, look for mappings in that range
+		 * to exchange.  If we find some mappings, exchange them.
+		 */
+		error = xfs_exchmaps_find_mappings(xmi, &irec1, &irec2, NULL);
+		if (error)
+			return error;
+
+		if (xmi_has_more_exchange_work(xmi))
+			xfs_exchmaps_one_step(tp, xmi, &irec1, &irec2);
+
+		/*
+		 * If the caller asked us to exchange the file sizes after the
+		 * exchange and either we just exchanged the last mappings in
+		 * the range or we didn't find anything to exchange, update the
+		 * ondisk file sizes.
+		 */
+		if ((xmi->xmi_flags & XFS_EXCHMAPS_SET_SIZES) &&
+		    !xmi_has_more_exchange_work(xmi)) {
+			xmi->xmi_ip1->i_disk_size = xmi->xmi_isize1;
+			xmi->xmi_ip2->i_disk_size = xmi->xmi_isize2;
+
+			xfs_trans_log_inode(tp, xmi->xmi_ip1, XFS_ILOG_CORE);
+			xfs_trans_log_inode(tp, xmi->xmi_ip2, XFS_ILOG_CORE);
+		}
+	} else if (xmi_has_postop_work(xmi)) {
+		/*
+		 * Now that we're finished with the exchange operation,
+		 * complete the post-op cleanup work.
+		 */
+		error = xfs_exchmaps_do_postop_work(tp, xmi);
+		if (error)
+			return error;
+	}
+
+	/* If we still have work to do, ask for a new transaction. */
+	if (xmi_has_more_exchange_work(xmi) || xmi_has_postop_work(xmi)) {
+		trace_xfs_exchmaps_defer(tp->t_mountp, xmi);
+		return -EAGAIN;
+	}
+
+	/*
+	 * If we reach here, we've finished all the exchange work and the post
+	 * operation work.  The last thing we need to do before returning to
+	 * the caller is to make sure that COW forks are set up correctly.
+	 */
+	if (!(xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK)) {
+		xfs_exchmaps_ensure_cowfork(xmi->xmi_ip1);
+		xfs_exchmaps_ensure_cowfork(xmi->xmi_ip2);
+	}
+
+	return 0;
+}
+
+/*
+ * Compute the amount of bmbt blocks we should reserve for each file.  In the
+ * worst case, each exchange will fill a hole with a new mapping, which could
+ * result in a btree split every time we add a new leaf block.
+ */
+static inline uint64_t
+xfs_exchmaps_bmbt_blocks(
+	struct xfs_mount		*mp,
+	const struct xfs_exchmaps_req	*req)
+{
+	return howmany_64(req->nr_exchanges,
+					XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp)) *
+			XFS_EXTENTADD_SPACE_RES(mp, xfs_exchmaps_reqfork(req));
+}
+
+/* Compute the space we should reserve for the rmap btree expansions. */
+static inline uint64_t
+xfs_exchmaps_rmapbt_blocks(
+	struct xfs_mount		*mp,
+	const struct xfs_exchmaps_req	*req)
+{
+	if (!xfs_has_rmapbt(mp))
+		return 0;
+	if (XFS_IS_REALTIME_INODE(req->ip1))
+		return 0;
+
+	return howmany_64(req->nr_exchanges,
+					XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)) *
+			XFS_RMAPADD_SPACE_RES(mp);
+}
+
+/* Estimate the bmbt and rmapbt overhead required to exchange mappings. */
+static int
+xfs_exchmaps_estimate_overhead(
+	struct xfs_exchmaps_req		*req)
+{
+	struct xfs_mount		*mp = req->ip1->i_mount;
+	xfs_filblks_t			bmbt_blocks;
+	xfs_filblks_t			rmapbt_blocks;
+	xfs_filblks_t			resblks = req->resblks;
+
+	/*
+	 * Compute the number of bmbt and rmapbt blocks we might need to handle
+	 * the estimated number of exchanges.
+	 */
+	bmbt_blocks = xfs_exchmaps_bmbt_blocks(mp, req);
+	rmapbt_blocks = xfs_exchmaps_rmapbt_blocks(mp, req);
+
+	trace_xfs_exchmaps_overhead(mp, bmbt_blocks, rmapbt_blocks);
+
+	/* Make sure the change in file block count doesn't overflow. */
+	if (check_add_overflow(req->ip1_bcount, bmbt_blocks, &req->ip1_bcount))
+		return -EFBIG;
+	if (check_add_overflow(req->ip2_bcount, bmbt_blocks, &req->ip2_bcount))
+		return -EFBIG;
+
+	/*
+	 * Add together the number of blocks we need to handle btree growth,
+	 * then add it to the number of blocks we need to reserve to this
+	 * transaction.
+	 */
+	if (check_add_overflow(resblks, bmbt_blocks, &resblks))
+		return -ENOSPC;
+	if (check_add_overflow(resblks, bmbt_blocks, &resblks))
+		return -ENOSPC;
+	if (check_add_overflow(resblks, rmapbt_blocks, &resblks))
+		return -ENOSPC;
+	if (check_add_overflow(resblks, rmapbt_blocks, &resblks))
+		return -ENOSPC;
+
+	/* Can't actually reserve more than UINT_MAX blocks. */
+	if (req->resblks > UINT_MAX)
+		return -ENOSPC;
+
+	req->resblks = resblks;
+	trace_xfs_exchmaps_final_estimate(req);
+	return 0;
+}
+
+/* Decide if we can merge two real mappings. */
+static inline bool
+xmi_can_merge(
+	const struct xfs_bmbt_irec	*b1,
+	const struct xfs_bmbt_irec	*b2)
+{
+	/* Don't merge holes. */
+	if (b1->br_startblock == HOLESTARTBLOCK ||
+	    b2->br_startblock == HOLESTARTBLOCK)
+		return false;
+
+	/* We don't merge holes. */
+	if (!xfs_bmap_is_real_extent(b1) || !xfs_bmap_is_real_extent(b2))
+		return false;
+
+	if (b1->br_startoff   + b1->br_blockcount == b2->br_startoff &&
+	    b1->br_startblock + b1->br_blockcount == b2->br_startblock &&
+	    b1->br_state			  == b2->br_state &&
+	    b1->br_blockcount + b2->br_blockcount <= XFS_MAX_BMBT_EXTLEN)
+		return true;
+
+	return false;
+}
+
+#define CLEFT_CONTIG	0x01
+#define CRIGHT_CONTIG	0x02
+#define CHOLE		0x04
+#define CBOTH_CONTIG	(CLEFT_CONTIG | CRIGHT_CONTIG)
+
+#define NLEFT_CONTIG	0x10
+#define NRIGHT_CONTIG	0x20
+#define NHOLE		0x40
+#define NBOTH_CONTIG	(NLEFT_CONTIG | NRIGHT_CONTIG)
+
+/* Estimate the effect of a single exchange on mapping count. */
+static inline int
+xmi_delta_nextents_step(
+	struct xfs_mount		*mp,
+	const struct xfs_bmbt_irec	*left,
+	const struct xfs_bmbt_irec	*curr,
+	const struct xfs_bmbt_irec	*new,
+	const struct xfs_bmbt_irec	*right)
+{
+	bool				lhole, rhole, chole, nhole;
+	unsigned int			state = 0;
+	int				ret = 0;
+
+	lhole = left->br_startblock == HOLESTARTBLOCK;
+	rhole = right->br_startblock == HOLESTARTBLOCK;
+	chole = curr->br_startblock == HOLESTARTBLOCK;
+	nhole = new->br_startblock == HOLESTARTBLOCK;
+
+	if (chole)
+		state |= CHOLE;
+	if (!lhole && !chole && xmi_can_merge(left, curr))
+		state |= CLEFT_CONTIG;
+	if (!rhole && !chole && xmi_can_merge(curr, right))
+		state |= CRIGHT_CONTIG;
+	if ((state & CBOTH_CONTIG) == CBOTH_CONTIG &&
+	    left->br_startblock + curr->br_startblock +
+					right->br_startblock > XFS_MAX_BMBT_EXTLEN)
+		state &= ~CRIGHT_CONTIG;
+
+	if (nhole)
+		state |= NHOLE;
+	if (!lhole && !nhole && xmi_can_merge(left, new))
+		state |= NLEFT_CONTIG;
+	if (!rhole && !nhole && xmi_can_merge(new, right))
+		state |= NRIGHT_CONTIG;
+	if ((state & NBOTH_CONTIG) == NBOTH_CONTIG &&
+	    left->br_startblock + new->br_startblock +
+					right->br_startblock > XFS_MAX_BMBT_EXTLEN)
+		state &= ~NRIGHT_CONTIG;
+
+	switch (state & (CLEFT_CONTIG | CRIGHT_CONTIG | CHOLE)) {
+	case CLEFT_CONTIG | CRIGHT_CONTIG:
+		/*
+		 * left/curr/right are the same mapping, so deleting curr
+		 * causes 2 new mappings to be created.
+		 */
+		ret += 2;
+		break;
+	case 0:
+		/*
+		 * curr is not contiguous with any mapping, so we remove curr
+		 * completely
+		 */
+		ret--;
+		break;
+	case CHOLE:
+		/* hole, do nothing */
+		break;
+	case CLEFT_CONTIG:
+	case CRIGHT_CONTIG:
+		/* trim either left or right, no change */
+		break;
+	}
+
+	switch (state & (NLEFT_CONTIG | NRIGHT_CONTIG | NHOLE)) {
+	case NLEFT_CONTIG | NRIGHT_CONTIG:
+		/*
+		 * left/curr/right will become the same mapping, so adding
+		 * curr causes the deletion of right.
+		 */
+		ret--;
+		break;
+	case 0:
+		/* new is not contiguous with any mapping */
+		ret++;
+		break;
+	case NHOLE:
+		/* hole, do nothing. */
+		break;
+	case NLEFT_CONTIG:
+	case NRIGHT_CONTIG:
+		/* new is absorbed into left or right, no change */
+		break;
+	}
+
+	trace_xfs_exchmaps_delta_nextents_step(mp, left, curr, new, right, ret,
+			state);
+	return ret;
+}
+
+/* Make sure we don't overflow the extent (mapping) counters. */
+static inline int
+xmi_ensure_delta_nextents(
+	struct xfs_exchmaps_req	*req,
+	struct xfs_inode	*ip,
+	int64_t			delta)
+{
+	struct xfs_mount	*mp = ip->i_mount;
+	int			whichfork = xfs_exchmaps_reqfork(req);
+	struct xfs_ifork	*ifp = xfs_ifork_ptr(ip, whichfork);
+	uint64_t		new_nextents;
+	xfs_extnum_t		max_nextents;
+
+	if (delta < 0)
+		return 0;
+
+	/*
+	 * It's always an error if the delta causes integer overflow.  delta
+	 * needs an explicit cast here to avoid warnings about implicit casts
+	 * coded into the overflow check.
+	 */
+	if (check_add_overflow(ifp->if_nextents, (uint64_t)delta,
+				&new_nextents))
+		return -EFBIG;
+
+	if (XFS_TEST_ERROR(false, mp, XFS_ERRTAG_REDUCE_MAX_IEXTENTS) &&
+	    new_nextents > 10)
+		return -EFBIG;
+
+	/*
+	 * We always promote both inodes to have large extent counts if the
+	 * superblock feature is enabled, so we only need to check against the
+	 * theoretical maximum.
+	 */
+	max_nextents = xfs_iext_max_nextents(xfs_has_large_extent_counts(mp),
+					     whichfork);
+	if (new_nextents > max_nextents)
+		return -EFBIG;
+
+	return 0;
+}
+
+/* Find the next mapping after irec. */
+static inline int
+xmi_next(
+	struct xfs_inode		*ip,
+	int				bmap_flags,
+	const struct xfs_bmbt_irec	*irec,
+	struct xfs_bmbt_irec		*nrec)
+{
+	xfs_fileoff_t			off;
+	xfs_filblks_t			blockcount;
+	int				nimaps = 1;
+	int				error;
+
+	off = irec->br_startoff + irec->br_blockcount;
+	blockcount = XFS_MAX_FILEOFF - off;
+	error = xfs_bmapi_read(ip, off, blockcount, nrec, &nimaps, bmap_flags);
+	if (error)
+		return error;
+	if (nrec->br_startblock == DELAYSTARTBLOCK ||
+	    nrec->br_startoff != off) {
+		/*
+		 * If we don't get the mapping we want, return a zero-length
+		 * mapping, which our estimator function will pretend is a hole.
+		 * We shouldn't get delalloc reservations.
+		 */
+		nrec->br_startblock = HOLESTARTBLOCK;
+	}
+
+	return 0;
+}
+
+int __init
+xfs_exchmaps_intent_init_cache(void)
+{
+	xfs_exchmaps_intent_cache = kmem_cache_create("xfs_exchmaps_intent",
+			sizeof(struct xfs_exchmaps_intent),
+			0, 0, NULL);
+
+	return xfs_exchmaps_intent_cache != NULL ? 0 : -ENOMEM;
+}
+
+void
+xfs_exchmaps_intent_destroy_cache(void)
+{
+	kmem_cache_destroy(xfs_exchmaps_intent_cache);
+	xfs_exchmaps_intent_cache = NULL;
+}
+
+/*
+ * Decide if we will exchange the reflink flags between the two files after the
+ * exchange.  The only time we want to do this is if we're exchanging all
+ * mappings under EOF and the inode reflink flags have different states.
+ */
+static inline bool
+xmi_can_exchange_reflink_flags(
+	const struct xfs_exchmaps_req	*req,
+	unsigned int			reflink_state)
+{
+	struct xfs_mount		*mp = req->ip1->i_mount;
+
+	if (hweight32(reflink_state) != 1)
+		return false;
+	if (req->startoff1 != 0 || req->startoff2 != 0)
+		return false;
+	if (req->blockcount != XFS_B_TO_FSB(mp, req->ip1->i_disk_size))
+		return false;
+	if (req->blockcount != XFS_B_TO_FSB(mp, req->ip2->i_disk_size))
+		return false;
+	return true;
+}
+
+
+/* Allocate and initialize a new incore intent item from a request. */
+struct xfs_exchmaps_intent *
+xfs_exchmaps_init_intent(
+	const struct xfs_exchmaps_req	*req)
+{
+	struct xfs_exchmaps_intent	*xmi;
+	unsigned int			rs = 0;
+
+	xmi = kmem_cache_zalloc(xfs_exchmaps_intent_cache,
+			GFP_NOFS | __GFP_NOFAIL);
+	INIT_LIST_HEAD(&xmi->xmi_list);
+	xmi->xmi_ip1 = req->ip1;
+	xmi->xmi_ip2 = req->ip2;
+	xmi->xmi_startoff1 = req->startoff1;
+	xmi->xmi_startoff2 = req->startoff2;
+	xmi->xmi_blockcount = req->blockcount;
+	xmi->xmi_isize1 = xmi->xmi_isize2 = -1;
+	xmi->xmi_flags = req->flags & XFS_EXCHMAPS_PARAMS;
+
+	if (xfs_exchmaps_whichfork(xmi) == XFS_ATTR_FORK)
+		return xmi;
+
+	if (req->flags & XFS_EXCHMAPS_SET_SIZES) {
+		xmi->xmi_flags |= XFS_EXCHMAPS_SET_SIZES;
+		xmi->xmi_isize1 = req->ip2->i_disk_size;
+		xmi->xmi_isize2 = req->ip1->i_disk_size;
+	}
+
+	/* Record the state of each inode's reflink flag before the op. */
+	if (xfs_is_reflink_inode(req->ip1))
+		rs |= 1;
+	if (xfs_is_reflink_inode(req->ip2))
+		rs |= 2;
+
+	/*
+	 * Figure out if we're clearing the reflink flags (which effectively
+	 * exchanges them) after the operation.
+	 */
+	if (xmi_can_exchange_reflink_flags(req, rs)) {
+		if (rs & 1)
+			xmi->xmi_flags |= XFS_EXCHMAPS_CLEAR_INO1_REFLINK;
+		if (rs & 2)
+			xmi->xmi_flags |= XFS_EXCHMAPS_CLEAR_INO2_REFLINK;
+	}
+
+	return xmi;
+}
+
+/*
+ * Estimate the number of exchange operations and the number of file blocks
+ * in each file that will be affected by the exchange operation.
+ */
+int
+xfs_exchmaps_estimate(
+	struct xfs_exchmaps_req		*req)
+{
+	struct xfs_exchmaps_intent	*xmi;
+	struct xfs_bmbt_irec		irec1, irec2;
+	struct xfs_exchmaps_adjacent	adj = ADJACENT_INIT;
+	xfs_filblks_t			ip1_blocks = 0, ip2_blocks = 0;
+	int64_t				d_nexts1, d_nexts2;
+	int				bmap_flags;
+	int				error;
+
+	ASSERT(!(req->flags & ~XFS_EXCHMAPS_PARAMS));
+
+	bmap_flags = xfs_bmapi_aflag(xfs_exchmaps_reqfork(req));
+	xmi = xfs_exchmaps_init_intent(req);
+
+	/*
+	 * To guard against the possibility of overflowing the extent counters,
+	 * we have to estimate an upper bound on the potential increase in that
+	 * counter.  We can split the mapping at each end of the range, and for
+	 * each step of the exchange we can split the mapping that we're
+	 * working on if the mappings do not align.
+	 */
+	d_nexts1 = d_nexts2 = 3;
+
+	while (xmi_has_more_exchange_work(xmi)) {
+		/*
+		 * Walk through the file ranges until we find something to
+		 * exchange.  Because we're simulating the exchange, pass in
+		 * adj to capture skipped mappings for correct estimation of
+		 * bmbt record merges.
+		 */
+		error = xfs_exchmaps_find_mappings(xmi, &irec1, &irec2, &adj);
+		if (error)
+			goto out_free;
+		if (!xmi_has_more_exchange_work(xmi))
+			break;
+
+		/* Update accounting. */
+		if (xfs_bmap_is_real_extent(&irec1))
+			ip1_blocks += irec1.br_blockcount;
+		if (xfs_bmap_is_real_extent(&irec2))
+			ip2_blocks += irec2.br_blockcount;
+		req->nr_exchanges++;
+
+		/* Read the next mappings from both files. */
+		error = xmi_next(req->ip1, bmap_flags, &irec1, &adj.right1);
+		if (error)
+			goto out_free;
+
+		error = xmi_next(req->ip2, bmap_flags, &irec2, &adj.right2);
+		if (error)
+			goto out_free;
+
+		/* Update extent count deltas. */
+		d_nexts1 += xmi_delta_nextents_step(req->ip1->i_mount,
+				&adj.left1, &irec1, &irec2, &adj.right1);
+
+		d_nexts2 += xmi_delta_nextents_step(req->ip1->i_mount,
+				&adj.left2, &irec2, &irec1, &adj.right2);
+
+		/* Now pretend we exchanged the mappings. */
+		if (xmi_can_merge(&adj.left2, &irec1))
+			adj.left2.br_blockcount += irec1.br_blockcount;
+		else
+			memcpy(&adj.left2, &irec1, sizeof(irec1));
+
+		if (xmi_can_merge(&adj.left1, &irec2))
+			adj.left1.br_blockcount += irec2.br_blockcount;
+		else
+			memcpy(&adj.left1, &irec2, sizeof(irec2));
+
+		xmi_advance(xmi, &irec1);
+	}
+
+	/* Account for the blocks that are being exchanged. */
+	if (XFS_IS_REALTIME_INODE(req->ip1) &&
+	    xfs_exchmaps_reqfork(req) == XFS_DATA_FORK) {
+		req->ip1_rtbcount = ip1_blocks;
+		req->ip2_rtbcount = ip2_blocks;
+	} else {
+		req->ip1_bcount = ip1_blocks;
+		req->ip2_bcount = ip2_blocks;
+	}
+
+	/*
+	 * Make sure that both forks have enough slack left in their extent
+	 * counters that the exchange operation will not overflow.
+	 */
+	trace_xfs_exchmaps_delta_nextents(req, d_nexts1, d_nexts2);
+	if (req->ip1 == req->ip2) {
+		error = xmi_ensure_delta_nextents(req, req->ip1,
+				d_nexts1 + d_nexts2);
+	} else {
+		error = xmi_ensure_delta_nextents(req, req->ip1, d_nexts1);
+		if (error)
+			goto out_free;
+		error = xmi_ensure_delta_nextents(req, req->ip2, d_nexts2);
+	}
+	if (error)
+		goto out_free;
+
+	trace_xfs_exchmaps_initial_estimate(req);
+	error = xfs_exchmaps_estimate_overhead(req);
+out_free:
+	kmem_cache_free(xfs_exchmaps_intent_cache, xmi);
+	return error;
+}
+
+/* Set the reflink flag before an operation. */
+static inline void
+xfs_exchmaps_set_reflink(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	trace_xfs_reflink_set_inode_flag(ip);
+
+	ip->i_diflags2 |= XFS_DIFLAG2_REFLINK;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/*
+ * If either file has shared blocks and we're exchanging data forks, we must
+ * flag the other file as having shared blocks so that we get the shared-block
+ * rmap functions if we need to fix up the rmaps.
+ */
+void
+xfs_exchmaps_ensure_reflink(
+	struct xfs_trans			*tp,
+	const struct xfs_exchmaps_intent	*xmi)
+{
+	unsigned int				rs = 0;
+
+	if (xfs_is_reflink_inode(xmi->xmi_ip1))
+		rs |= 1;
+	if (xfs_is_reflink_inode(xmi->xmi_ip2))
+		rs |= 2;
+
+	if ((rs & 1) && !xfs_is_reflink_inode(xmi->xmi_ip2))
+		xfs_exchmaps_set_reflink(tp, xmi->xmi_ip2);
+
+	if ((rs & 2) && !xfs_is_reflink_inode(xmi->xmi_ip1))
+		xfs_exchmaps_set_reflink(tp, xmi->xmi_ip1);
+}
+
+/* Set the large extent count flag before an operation if needed. */
+static inline void
+xfs_exchmaps_ensure_large_extent_counts(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	if (xfs_inode_has_large_extent_counts(ip))
+		return;
+
+	ip->i_diflags2 |= XFS_DIFLAG2_NREXT64;
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+}
+
+/* Widen the extent counter fields of both inodes if necessary. */
+void
+xfs_exchmaps_upgrade_extent_counts(
+	struct xfs_trans			*tp,
+	const struct xfs_exchmaps_intent	*xmi)
+{
+	if (!xfs_has_large_extent_counts(tp->t_mountp))
+		return;
+
+	xfs_exchmaps_ensure_large_extent_counts(tp, xmi->xmi_ip1);
+	xfs_exchmaps_ensure_large_extent_counts(tp, xmi->xmi_ip2);
+}
+
+/*
+ * Schedule an exchange a range of mappings from one inode to another.
+ *
+ * The use of file mapping exchange log intent items ensures the operation can
+ * be resumed even if the system goes down.  The caller must commit the
+ * transaction to start the work.
+ *
+ * The caller must ensure the inodes must be joined to the transaction and
+ * ILOCKd; they will still be joined to the transaction at exit.
+ */
+void
+xfs_exchange_mappings(
+	struct xfs_trans		*tp,
+	const struct xfs_exchmaps_req	*req)
+{
+	struct xfs_exchmaps_intent	*xmi;
+
+	ASSERT(xfs_isilocked(req->ip1, XFS_ILOCK_EXCL));
+	ASSERT(xfs_isilocked(req->ip2, XFS_ILOCK_EXCL));
+	ASSERT(!(req->flags & ~XFS_EXCHMAPS_LOGGED_FLAGS));
+	if (req->flags & XFS_EXCHMAPS_SET_SIZES)
+		ASSERT(!(req->flags & XFS_EXCHMAPS_ATTR_FORK));
+	ASSERT(xfs_sb_version_haslogexchmaps(&tp->t_mountp->m_sb));
+
+	if (req->blockcount == 0)
+		return;
+
+	xmi = xfs_exchmaps_init_intent(req);
+	xfs_exchmaps_defer_add(tp, xmi);
+	xfs_exchmaps_ensure_reflink(tp, xmi);
+	xfs_exchmaps_upgrade_extent_counts(tp, xmi);
+}
diff --git a/fs/xfs/libxfs/xfs_exchmaps.h b/fs/xfs/libxfs/xfs_exchmaps.h
new file mode 100644
index 0000000000000..e8fc3f80c68c2
--- /dev/null
+++ b/fs/xfs/libxfs/xfs_exchmaps.h
@@ -0,0 +1,118 @@
+/* SPDX-License-Identifier: GPL-2.0-or-later */
+/*
+ * Copyright (c) 2020-2024 Oracle.  All Rights Reserved.
+ * Author: Darrick J. Wong <djwong@kernel.org>
+ */
+#ifndef __XFS_EXCHMAPS_H__
+#define __XFS_EXCHMAPS_H__
+
+/* In-core deferred operation info about a file mapping exchange request. */
+struct xfs_exchmaps_intent {
+	/* List of other incore deferred work. */
+	struct list_head	xmi_list;
+
+	/* Inodes participating in the operation. */
+	struct xfs_inode	*xmi_ip1;
+	struct xfs_inode	*xmi_ip2;
+
+	/* File offset range information. */
+	xfs_fileoff_t		xmi_startoff1;
+	xfs_fileoff_t		xmi_startoff2;
+	xfs_filblks_t		xmi_blockcount;
+
+	/* Set these file sizes after the operation, unless negative. */
+	xfs_fsize_t		xmi_isize1;
+	xfs_fsize_t		xmi_isize2;
+
+	uint64_t		xmi_flags;	/* XFS_EXCHMAPS_* flags */
+};
+
+/* flags that can be passed to xfs_exchmaps_{estimate,mappings} */
+#define XFS_EXCHMAPS_PARAMS		(XFS_EXCHMAPS_ATTR_FORK | \
+					 XFS_EXCHMAPS_SET_SIZES | \
+					 XFS_EXCHMAPS_INO1_WRITTEN)
+
+static inline int
+xfs_exchmaps_whichfork(const struct xfs_exchmaps_intent *xmi)
+{
+	if (xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK)
+		return XFS_ATTR_FORK;
+	return XFS_DATA_FORK;
+}
+
+/* Parameters for a mapping exchange request. */
+struct xfs_exchmaps_req {
+	/* Inodes participating in the operation. */
+	struct xfs_inode	*ip1;
+	struct xfs_inode	*ip2;
+
+	/* File offset range information. */
+	xfs_fileoff_t		startoff1;
+	xfs_fileoff_t		startoff2;
+	xfs_filblks_t		blockcount;
+
+	/* XFS_EXCHMAPS_* operation flags */
+	uint64_t		flags;
+
+	/*
+	 * Fields below this line are filled out by xfs_exchmaps_estimate;
+	 * callers should initialize this part of the struct to zero.
+	 */
+
+	/*
+	 * Data device blocks to be moved out of ip1, and free space needed to
+	 * handle the bmbt changes.
+	 */
+	xfs_filblks_t		ip1_bcount;
+
+	/*
+	 * Data device blocks to be moved out of ip2, and free space needed to
+	 * handle the bmbt changes.
+	 */
+	xfs_filblks_t		ip2_bcount;
+
+	/* rt blocks to be moved out of ip1. */
+	xfs_filblks_t		ip1_rtbcount;
+
+	/* rt blocks to be moved out of ip2. */
+	xfs_filblks_t		ip2_rtbcount;
+
+	/* Free space needed to handle the bmbt changes */
+	unsigned long long	resblks;
+
+	/* Number of exchanges needed to complete the operation */
+	unsigned long long	nr_exchanges;
+};
+
+static inline int
+xfs_exchmaps_reqfork(const struct xfs_exchmaps_req *req)
+{
+	if (req->flags & XFS_EXCHMAPS_ATTR_FORK)
+		return XFS_ATTR_FORK;
+	return XFS_DATA_FORK;
+}
+
+int xfs_exchmaps_estimate(struct xfs_exchmaps_req *req);
+
+extern struct kmem_cache	*xfs_exchmaps_intent_cache;
+
+int __init xfs_exchmaps_intent_init_cache(void);
+void xfs_exchmaps_intent_destroy_cache(void);
+
+struct xfs_exchmaps_intent *xfs_exchmaps_init_intent(
+		const struct xfs_exchmaps_req *req);
+void xfs_exchmaps_ensure_reflink(struct xfs_trans *tp,
+		const struct xfs_exchmaps_intent *xmi);
+void xfs_exchmaps_upgrade_extent_counts(struct xfs_trans *tp,
+		const struct xfs_exchmaps_intent *xmi);
+
+int xfs_exchmaps_finish_one(struct xfs_trans *tp,
+		struct xfs_exchmaps_intent *xmi);
+
+int xfs_exchmaps_check_forks(struct xfs_mount *mp,
+		const struct xfs_exchmaps_req *req);
+
+void xfs_exchange_mappings(struct xfs_trans *tp,
+		const struct xfs_exchmaps_req *req);
+
+#endif /* __XFS_EXCHMAPS_H__ */
diff --git a/fs/xfs/libxfs/xfs_log_format.h b/fs/xfs/libxfs/xfs_log_format.h
index 09024431cae9a..8dbe1f997dfd5 100644
--- a/fs/xfs/libxfs/xfs_log_format.h
+++ b/fs/xfs/libxfs/xfs_log_format.h
@@ -904,7 +904,29 @@ struct xfs_xmi_log_format {
 	uint64_t		xmi_isize2;	/* intended file2 size */
 };
 
-#define XFS_EXCHMAPS_LOGGED_FLAGS		(0)
+/* Exchange mappings between extended attribute forks instead of data forks. */
+#define XFS_EXCHMAPS_ATTR_FORK		(1ULL << 0)
+
+/* Set the file sizes when finished. */
+#define XFS_EXCHMAPS_SET_SIZES		(1ULL << 1)
+
+/*
+ * Exchange the mappings of the two files only if the file allocation units
+ * mapped to file1's range have been written.
+ */
+#define XFS_EXCHMAPS_INO1_WRITTEN	(1ULL << 2)
+
+/* Clear the reflink flag from inode1 after the operation. */
+#define XFS_EXCHMAPS_CLEAR_INO1_REFLINK	(1ULL << 3)
+
+/* Clear the reflink flag from inode2 after the operation. */
+#define XFS_EXCHMAPS_CLEAR_INO2_REFLINK	(1ULL << 4)
+
+#define XFS_EXCHMAPS_LOGGED_FLAGS	(XFS_EXCHMAPS_ATTR_FORK | \
+					 XFS_EXCHMAPS_SET_SIZES | \
+					 XFS_EXCHMAPS_INO1_WRITTEN | \
+					 XFS_EXCHMAPS_CLEAR_INO1_REFLINK | \
+					 XFS_EXCHMAPS_CLEAR_INO2_REFLINK)
 
 /* This is the structure used to lay out an mapping exchange done log item. */
 struct xfs_xmd_log_format {
diff --git a/fs/xfs/libxfs/xfs_trans_space.h b/fs/xfs/libxfs/xfs_trans_space.h
index 87b31c69a7732..9640fc232c147 100644
--- a/fs/xfs/libxfs/xfs_trans_space.h
+++ b/fs/xfs/libxfs/xfs_trans_space.h
@@ -10,6 +10,10 @@
  * Components of space reservations.
  */
 
+/* Worst case number of bmaps that can be held in a block. */
+#define XFS_MAX_CONTIG_BMAPS_PER_BLOCK(mp)    \
+		(((mp)->m_bmap_dmxr[0]) - ((mp)->m_bmap_dmnr[0]))
+
 /* Worst case number of rmaps that can be held in a block. */
 #define XFS_MAX_CONTIG_RMAPS_PER_BLOCK(mp)    \
 		(((mp)->m_rmap_mxr[0]) - ((mp)->m_rmap_mnr[0]))
diff --git a/fs/xfs/xfs_exchmaps_item.c b/fs/xfs/xfs_exchmaps_item.c
index c36f1065914c6..2086d053bc1c4 100644
--- a/fs/xfs/xfs_exchmaps_item.c
+++ b/fs/xfs/xfs_exchmaps_item.c
@@ -16,13 +16,17 @@
 #include "xfs_trans.h"
 #include "xfs_trans_priv.h"
 #include "xfs_exchmaps_item.h"
+#include "xfs_exchmaps.h"
 #include "xfs_log.h"
 #include "xfs_bmap.h"
 #include "xfs_icache.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_trans_space.h"
 #include "xfs_error.h"
 #include "xfs_log_priv.h"
 #include "xfs_log_recover.h"
+#include "xfs_exchrange.h"
+#include "xfs_trace.h"
 
 struct kmem_cache	*xfs_xmi_cache;
 struct kmem_cache	*xfs_xmd_cache;
@@ -144,6 +148,369 @@ static inline struct xfs_xmd_log_item *XMD_ITEM(struct xfs_log_item *lip)
 	return container_of(lip, struct xfs_xmd_log_item, xmd_item);
 }
 
+STATIC void
+xfs_xmd_item_size(
+	struct xfs_log_item	*lip,
+	int			*nvecs,
+	int			*nbytes)
+{
+	*nvecs += 1;
+	*nbytes += sizeof(struct xfs_xmd_log_format);
+}
+
+/*
+ * This is called to fill in the vector of log iovecs for the given xmd log
+ * item. We use only 1 iovec, and we point that at the xmd_log_format structure
+ * embedded in the xmd item.
+ */
+STATIC void
+xfs_xmd_item_format(
+	struct xfs_log_item	*lip,
+	struct xfs_log_vec	*lv)
+{
+	struct xfs_xmd_log_item	*xmd_lip = XMD_ITEM(lip);
+	struct xfs_log_iovec	*vecp = NULL;
+
+	xmd_lip->xmd_format.xmd_type = XFS_LI_XMD;
+	xmd_lip->xmd_format.xmd_size = 1;
+
+	xlog_copy_iovec(lv, &vecp, XLOG_REG_TYPE_XMD_FORMAT, &xmd_lip->xmd_format,
+			sizeof(struct xfs_xmd_log_format));
+}
+
+/*
+ * The XMD is either committed or aborted if the transaction is cancelled. If
+ * the transaction is cancelled, drop our reference to the XMI and free the
+ * XMD.
+ */
+STATIC void
+xfs_xmd_item_release(
+	struct xfs_log_item	*lip)
+{
+	struct xfs_xmd_log_item	*xmd_lip = XMD_ITEM(lip);
+
+	xfs_xmi_release(xmd_lip->xmd_intent_log_item);
+	kmem_free(xmd_lip->xmd_item.li_lv_shadow);
+	kmem_cache_free(xfs_xmd_cache, xmd_lip);
+}
+
+static struct xfs_log_item *
+xfs_xmd_item_intent(
+	struct xfs_log_item	*lip)
+{
+	return &XMD_ITEM(lip)->xmd_intent_log_item->xmi_item;
+}
+
+static const struct xfs_item_ops xfs_xmd_item_ops = {
+	.flags		= XFS_ITEM_RELEASE_WHEN_COMMITTED |
+			  XFS_ITEM_INTENT_DONE,
+	.iop_size	= xfs_xmd_item_size,
+	.iop_format	= xfs_xmd_item_format,
+	.iop_release	= xfs_xmd_item_release,
+	.iop_intent	= xfs_xmd_item_intent,
+};
+
+/* Log file mapping exchange information in the intent item. */
+STATIC struct xfs_log_item *
+xfs_exchmaps_create_intent(
+	struct xfs_trans		*tp,
+	struct list_head		*items,
+	unsigned int			count,
+	bool				sort)
+{
+	struct xfs_xmi_log_item		*xmi_lip;
+	struct xfs_exchmaps_intent	*xmi;
+	struct xfs_xmi_log_format	*xlf;
+
+	ASSERT(count == 1);
+
+	xmi = list_first_entry_or_null(items, struct xfs_exchmaps_intent,
+			xmi_list);
+
+	xmi_lip = xfs_xmi_init(tp->t_mountp);
+	xlf = &xmi_lip->xmi_format;
+
+	xlf->xmi_inode1 = xmi->xmi_ip1->i_ino;
+	xlf->xmi_inode2 = xmi->xmi_ip2->i_ino;
+	xlf->xmi_startoff1 = xmi->xmi_startoff1;
+	xlf->xmi_startoff2 = xmi->xmi_startoff2;
+	xlf->xmi_blockcount = xmi->xmi_blockcount;
+	xlf->xmi_isize1 = xmi->xmi_isize1;
+	xlf->xmi_isize2 = xmi->xmi_isize2;
+	xlf->xmi_flags = xmi->xmi_flags & XFS_EXCHMAPS_LOGGED_FLAGS;
+
+	return &xmi_lip->xmi_item;
+}
+
+STATIC struct xfs_log_item *
+xfs_exchmaps_create_done(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*intent,
+	unsigned int			count)
+{
+	struct xfs_xmi_log_item		*xmi_lip = XMI_ITEM(intent);
+	struct xfs_xmd_log_item		*xmd_lip;
+
+	xmd_lip = kmem_cache_zalloc(xfs_xmd_cache, GFP_KERNEL | __GFP_NOFAIL);
+	xfs_log_item_init(tp->t_mountp, &xmd_lip->xmd_item, XFS_LI_XMD,
+			  &xfs_xmd_item_ops);
+	xmd_lip->xmd_intent_log_item = xmi_lip;
+	xmd_lip->xmd_format.xmd_xmi_id = xmi_lip->xmi_format.xmi_id;
+
+	return &xmd_lip->xmd_item;
+}
+
+/* Add this deferred XMI to the transaction. */
+void
+xfs_exchmaps_defer_add(
+	struct xfs_trans		*tp,
+	struct xfs_exchmaps_intent	*xmi)
+{
+	trace_xfs_exchmaps_defer(tp->t_mountp, xmi);
+
+	xfs_defer_add(tp, &xmi->xmi_list, &xfs_exchmaps_defer_type);
+}
+
+static inline struct xfs_exchmaps_intent *xmi_entry(const struct list_head *e)
+{
+	return list_entry(e, struct xfs_exchmaps_intent, xmi_list);
+}
+
+/* Cancel a deferred file mapping exchange. */
+STATIC void
+xfs_exchmaps_cancel_item(
+	struct list_head		*item)
+{
+	struct xfs_exchmaps_intent	*xmi = xmi_entry(item);
+
+	kmem_cache_free(xfs_exchmaps_intent_cache, xmi);
+}
+
+/* Process a deferred file mapping exchange. */
+STATIC int
+xfs_exchmaps_finish_item(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*done,
+	struct list_head		*item,
+	struct xfs_btree_cur		**state)
+{
+	struct xfs_exchmaps_intent	*xmi = xmi_entry(item);
+	int				error;
+
+	/*
+	 * Exchange one more mappings between two files.  If there's still more
+	 * work to do, we want to requeue ourselves after all other pending
+	 * deferred operations have finished.  This includes all of the dfops
+	 * that we queued directly as well as any new ones created in the
+	 * process of finishing the others.  Doing so prevents us from queuing
+	 * a large number of XMI log items in kernel memory, which in turn
+	 * prevents us from pinning the tail of the log (while logging those
+	 * new XMI items) until the first XMI items can be processed.
+	 */
+	error = xfs_exchmaps_finish_one(tp, xmi);
+	if (error != -EAGAIN)
+		xfs_exchmaps_cancel_item(item);
+	return error;
+}
+
+/* Abort all pending XMIs. */
+STATIC void
+xfs_exchmaps_abort_intent(
+	struct xfs_log_item		*intent)
+{
+	xfs_xmi_release(XMI_ITEM(intent));
+}
+
+/* Is this recovered XMI ok? */
+static inline bool
+xfs_xmi_validate(
+	struct xfs_mount		*mp,
+	struct xfs_xmi_log_item		*xmi_lip)
+{
+	struct xfs_xmi_log_format	*xlf = &xmi_lip->xmi_format;
+
+	if (!xfs_sb_version_haslogexchmaps(&mp->m_sb))
+		return false;
+
+	if (xmi_lip->xmi_format.__pad != 0)
+		return false;
+
+	if (xlf->xmi_flags & ~XFS_EXCHMAPS_LOGGED_FLAGS)
+		return false;
+
+	if (!xfs_verify_ino(mp, xlf->xmi_inode1) ||
+	    !xfs_verify_ino(mp, xlf->xmi_inode2))
+		return false;
+
+	if ((xlf->xmi_flags & XFS_EXCHMAPS_SET_SIZES) &&
+	     (xlf->xmi_isize1 < 0 || xlf->xmi_isize2 < 0))
+		return false;
+
+	if (!xfs_verify_fileext(mp, xlf->xmi_startoff1, xlf->xmi_blockcount))
+		return false;
+
+	return xfs_verify_fileext(mp, xlf->xmi_startoff2, xlf->xmi_blockcount);
+}
+
+/*
+ * Use the recovered log state to create a new request, estimate resource
+ * requirements, and create a new incore intent state.
+ */
+STATIC struct xfs_exchmaps_intent *
+xfs_xmi_item_recover_intent(
+	struct xfs_mount		*mp,
+	struct xfs_defer_pending	*dfp,
+	const struct xfs_xmi_log_format	*xlf,
+	struct xfs_exchmaps_req		*req,
+	struct xfs_inode		**ipp1,
+	struct xfs_inode		**ipp2)
+{
+	struct xfs_inode		*ip1, *ip2;
+	struct xfs_exchmaps_intent	*xmi;
+	int				error;
+
+	/*
+	 * Grab both inodes and set IRECOVERY to prevent trimming of post-eof
+	 * mappings and freeing of unlinked inodes until we're totally done
+	 * processing files.
+	 */
+	error = xlog_recover_iget(mp, xlf->xmi_inode1, &ip1);
+	if (error)
+		return ERR_PTR(error);
+	error = xlog_recover_iget(mp, xlf->xmi_inode2, &ip2);
+	if (error)
+		goto err_rele1;
+
+	req->ip1 = ip1;
+	req->ip2 = ip2;
+	req->startoff1 = xlf->xmi_startoff1;
+	req->startoff2 = xlf->xmi_startoff2;
+	req->blockcount = xlf->xmi_blockcount;
+	req->flags = xlf->xmi_flags & XFS_EXCHMAPS_PARAMS;
+
+	xfs_exchrange_ilock(NULL, ip1, ip2);
+	error = xfs_exchmaps_estimate(req);
+	xfs_exchrange_iunlock(ip1, ip2);
+	if (error)
+		goto err_rele2;
+
+	*ipp1 = ip1;
+	*ipp2 = ip2;
+	xmi = xfs_exchmaps_init_intent(req);
+	xfs_defer_add_item(dfp, &xmi->xmi_list);
+	return xmi;
+
+err_rele2:
+	xfs_irele(ip2);
+err_rele1:
+	xfs_irele(ip1);
+	req->ip2 = req->ip1 = NULL;
+	return ERR_PTR(error);
+}
+
+/* Process a file mapping exchange item that was recovered from the log. */
+STATIC int
+xfs_exchmaps_recover_work(
+	struct xfs_defer_pending	*dfp,
+	struct list_head		*capture_list)
+{
+	struct xfs_exchmaps_req		req = { .flags = 0 };
+	struct xfs_trans_res		resv;
+	struct xfs_exchmaps_intent	*xmi;
+	struct xfs_log_item		*lip = dfp->dfp_intent;
+	struct xfs_xmi_log_item		*xmi_lip = XMI_ITEM(lip);
+	struct xfs_mount		*mp = lip->li_log->l_mp;
+	struct xfs_trans		*tp;
+	struct xfs_inode		*ip1, *ip2;
+	int				error = 0;
+
+	if (!xfs_xmi_validate(mp, xmi_lip)) {
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
+				&xmi_lip->xmi_format,
+				sizeof(xmi_lip->xmi_format));
+		return -EFSCORRUPTED;
+	}
+
+	xmi = xfs_xmi_item_recover_intent(mp, dfp, &xmi_lip->xmi_format, &req,
+			&ip1, &ip2);
+	if (IS_ERR(xmi))
+		return PTR_ERR(xmi);
+
+	trace_xfs_exchmaps_recover(mp, xmi);
+
+	resv = xlog_recover_resv(&M_RES(mp)->tr_write);
+	error = xfs_trans_alloc(mp, &resv, req.resblks, 0, 0, &tp);
+	if (error)
+		goto err_rele;
+
+	xfs_exchrange_ilock(tp, ip1, ip2);
+
+	xfs_exchmaps_ensure_reflink(tp, xmi);
+	xfs_exchmaps_upgrade_extent_counts(tp, xmi);
+	error = xlog_recover_finish_intent(tp, dfp);
+	if (error == -EFSCORRUPTED)
+		XFS_CORRUPTION_ERROR(__func__, XFS_ERRLEVEL_LOW, mp,
+				&xmi_lip->xmi_format,
+				sizeof(xmi_lip->xmi_format));
+	if (error)
+		goto err_cancel;
+
+	/*
+	 * Commit transaction, which frees the transaction and saves the inodes
+	 * for later replay activities.
+	 */
+	error = xfs_defer_ops_capture_and_commit(tp, capture_list);
+	goto err_unlock;
+
+err_cancel:
+	xfs_trans_cancel(tp);
+err_unlock:
+	xfs_exchrange_iunlock(ip1, ip2);
+err_rele:
+	xfs_irele(ip2);
+	xfs_irele(ip1);
+	return error;
+}
+
+/* Relog an intent item to push the log tail forward. */
+static struct xfs_log_item *
+xfs_exchmaps_relog_intent(
+	struct xfs_trans		*tp,
+	struct xfs_log_item		*intent,
+	struct xfs_log_item		*done_item)
+{
+	struct xfs_xmi_log_item		*xmi_lip;
+	struct xfs_xmi_log_format	*old_xlf, *new_xlf;
+
+	old_xlf = &XMI_ITEM(intent)->xmi_format;
+
+	xmi_lip = xfs_xmi_init(tp->t_mountp);
+	new_xlf = &xmi_lip->xmi_format;
+
+	new_xlf->xmi_inode1	= old_xlf->xmi_inode1;
+	new_xlf->xmi_inode2	= old_xlf->xmi_inode2;
+	new_xlf->xmi_startoff1	= old_xlf->xmi_startoff1;
+	new_xlf->xmi_startoff2	= old_xlf->xmi_startoff2;
+	new_xlf->xmi_blockcount	= old_xlf->xmi_blockcount;
+	new_xlf->xmi_flags	= old_xlf->xmi_flags;
+	new_xlf->xmi_isize1	= old_xlf->xmi_isize1;
+	new_xlf->xmi_isize2	= old_xlf->xmi_isize2;
+
+	return &xmi_lip->xmi_item;
+}
+
+const struct xfs_defer_op_type xfs_exchmaps_defer_type = {
+	.name		= "exchmaps",
+	.max_items	= 1,
+	.create_intent	= xfs_exchmaps_create_intent,
+	.abort_intent	= xfs_exchmaps_abort_intent,
+	.create_done	= xfs_exchmaps_create_done,
+	.finish_item	= xfs_exchmaps_finish_item,
+	.cancel_item	= xfs_exchmaps_cancel_item,
+	.recover_work	= xfs_exchmaps_recover_work,
+	.relog_intent	= xfs_exchmaps_relog_intent,
+};
+
 STATIC bool
 xfs_xmi_item_match(
 	struct xfs_log_item	*lip,
@@ -194,8 +561,9 @@ xlog_recover_xmi_commit_pass2(
 	xmi_lip = xfs_xmi_init(mp);
 	memcpy(&xmi_lip->xmi_format, xmi_formatp, len);
 
-	/* not implemented yet */
-	return -EIO;
+	xlog_recover_intent_item(log, &xmi_lip->xmi_item, lsn,
+			&xfs_exchmaps_defer_type);
+	return 0;
 }
 
 const struct xlog_recover_item_ops xlog_xmi_item_ops = {
diff --git a/fs/xfs/xfs_exchmaps_item.h b/fs/xfs/xfs_exchmaps_item.h
index ada1eb314e658..efa368d25d09c 100644
--- a/fs/xfs/xfs_exchmaps_item.h
+++ b/fs/xfs/xfs_exchmaps_item.h
@@ -56,4 +56,9 @@ struct xfs_xmd_log_item {
 extern struct kmem_cache	*xfs_xmi_cache;
 extern struct kmem_cache	*xfs_xmd_cache;
 
+struct xfs_exchmaps_intent;
+
+void xfs_exchmaps_defer_add(struct xfs_trans *tp,
+		struct xfs_exchmaps_intent *xmi);
+
 #endif	/* __XFS_EXCHMAPS_ITEM_H__ */
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index 6ee181e9229a8..431adcd3e6722 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -13,6 +13,7 @@
 #include "xfs_inode.h"
 #include "xfs_trans.h"
 #include "xfs_exchrange.h"
+#include "xfs_exchmaps.h"
 #include <linux/fsnotify.h>
 
 /*
@@ -46,6 +47,54 @@ xfs_exchrange_possible(
 	       xfs_can_add_incompat_log_features(mp, false);
 }
 
+/* Lock (and optionally join) two inodes for a file range exchange. */
+void
+xfs_exchrange_ilock(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	if (ip1 != ip2)
+		xfs_lock_two_inodes(ip1, XFS_ILOCK_EXCL,
+				    ip2, XFS_ILOCK_EXCL);
+	else
+		xfs_ilock(ip1, XFS_ILOCK_EXCL);
+	if (tp) {
+		xfs_trans_ijoin(tp, ip1, 0);
+		if (ip2 != ip1)
+			xfs_trans_ijoin(tp, ip2, 0);
+	}
+
+}
+
+/* Unlock two inodes after a file range exchange operation. */
+void
+xfs_exchrange_iunlock(
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	if (ip2 != ip1)
+		xfs_iunlock(ip2, XFS_ILOCK_EXCL);
+	xfs_iunlock(ip1, XFS_ILOCK_EXCL);
+}
+
+/*
+ * Estimate the resource requirements to exchange file contents between the two
+ * files.  The caller is required to hold the IOLOCK and the MMAPLOCK and to
+ * have flushed both inodes' pagecache and active direct-ios.
+ */
+int
+xfs_exchrange_estimate(
+	struct xfs_exchmaps_req	*req)
+{
+	int			error;
+
+	xfs_exchrange_ilock(NULL, req->ip1, req->ip2);
+	error = xfs_exchmaps_estimate(req);
+	xfs_exchrange_iunlock(req->ip1, req->ip2);
+	return error;
+}
+
 /*
  * Generic code for exchanging ranges of two files via XFS_IOC_EXCHANGE_RANGE.
  * This part deals with struct file objects and byte ranges and does not deal
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index a008b42736716..eeec4b40b9fbe 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -37,4 +37,14 @@ struct xfs_exchrange {
 
 int xfs_exchange_range(struct xfs_exchrange *fxr);
 
+/* XFS-specific parts of file exchanges */
+
+struct xfs_exchmaps_req;
+
+void xfs_exchrange_ilock(struct xfs_trans *tp, struct xfs_inode *ip1,
+		struct xfs_inode *ip2);
+void xfs_exchrange_iunlock(struct xfs_inode *ip1, struct xfs_inode *ip2);
+
+int xfs_exchrange_estimate(struct xfs_exchmaps_req *req);
+
 #endif /* __XFS_EXCHRANGE_H__ */
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 1a963382e5e9e..9f38e69f1ce40 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -39,6 +39,7 @@
 #include "xfs_buf_mem.h"
 #include "xfs_btree_mem.h"
 #include "xfs_bmap.h"
+#include "xfs_exchmaps.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 8652881a2151a..0a56397a92373 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -82,6 +82,8 @@ struct xfs_perag;
 struct xfbtree;
 struct xfs_btree_ops;
 struct xfs_bmap_intent;
+struct xfs_exchmaps_intent;
+struct xfs_exchmaps_req;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -4790,6 +4792,221 @@ DEFINE_XFBTREE_FREESP_EVENT(xfbtree_alloc_block);
 DEFINE_XFBTREE_FREESP_EVENT(xfbtree_free_block);
 #endif /* CONFIG_XFS_BTREE_IN_MEM */
 
+/* exchmaps tracepoints */
+#define XFS_EXCHMAPS_STRINGS \
+	{ XFS_EXCHMAPS_ATTR_FORK,		"ATTRFORK" }, \
+	{ XFS_EXCHMAPS_SET_SIZES,		"SETSIZES" }, \
+	{ XFS_EXCHMAPS_INO1_WRITTEN,		"INO1_WRITTEN" }, \
+	{ XFS_EXCHMAPS_CLEAR_INO1_REFLINK,	"CLEAR_INO1_REFLINK" }, \
+	{ XFS_EXCHMAPS_CLEAR_INO2_REFLINK,	"CLEAR_INO2_REFLINK" }
+
+DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1_skip);
+DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1);
+DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping2);
+DEFINE_ITRUNC_EVENT(xfs_exchmaps_update_inode_size);
+
+TRACE_EVENT(xfs_exchmaps_overhead,
+	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
+		 unsigned long long rmapbt_blocks),
+	TP_ARGS(mp, bmbt_blocks, rmapbt_blocks),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(unsigned long long, bmbt_blocks)
+		__field(unsigned long long, rmapbt_blocks)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->bmbt_blocks = bmbt_blocks;
+		__entry->rmapbt_blocks = rmapbt_blocks;
+	),
+	TP_printk("dev %d:%d bmbt_blocks 0x%llx rmapbt_blocks 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->bmbt_blocks,
+		  __entry->rmapbt_blocks)
+);
+
+DECLARE_EVENT_CLASS(xfs_exchmaps_estimate_class,
+	TP_PROTO(const struct xfs_exchmaps_req *req),
+	TP_ARGS(req),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino1)
+		__field(xfs_ino_t, ino2)
+		__field(xfs_fileoff_t, startoff1)
+		__field(xfs_fileoff_t, startoff2)
+		__field(xfs_filblks_t, blockcount)
+		__field(uint64_t, flags)
+		__field(xfs_filblks_t, ip1_bcount)
+		__field(xfs_filblks_t, ip2_bcount)
+		__field(xfs_filblks_t, ip1_rtbcount)
+		__field(xfs_filblks_t, ip2_rtbcount)
+		__field(unsigned long long, resblks)
+		__field(unsigned long long, nr_exchanges)
+	),
+	TP_fast_assign(
+		__entry->dev = req->ip1->i_mount->m_super->s_dev;
+		__entry->ino1 = req->ip1->i_ino;
+		__entry->ino2 = req->ip2->i_ino;
+		__entry->startoff1 = req->startoff1;
+		__entry->startoff2 = req->startoff2;
+		__entry->blockcount = req->blockcount;
+		__entry->flags = req->flags;
+		__entry->ip1_bcount = req->ip1_bcount;
+		__entry->ip2_bcount = req->ip2_bcount;
+		__entry->ip1_rtbcount = req->ip1_rtbcount;
+		__entry->ip2_rtbcount = req->ip2_rtbcount;
+		__entry->resblks = req->resblks;
+		__entry->nr_exchanges = req->nr_exchanges;
+	),
+	TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) bcount1 0x%llx rtbcount1 0x%llx bcount2 0x%llx rtbcount2 0x%llx resblks 0x%llx nr_exchanges %llu",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino1, __entry->startoff1,
+		  __entry->ino2, __entry->startoff2,
+		  __entry->blockcount,
+		  __print_flags_u64(__entry->flags, "|", XFS_EXCHMAPS_STRINGS),
+		  __entry->ip1_bcount,
+		  __entry->ip1_rtbcount,
+		  __entry->ip2_bcount,
+		  __entry->ip2_rtbcount,
+		  __entry->resblks,
+		  __entry->nr_exchanges)
+);
+
+#define DEFINE_EXCHMAPS_ESTIMATE_EVENT(name)	\
+DEFINE_EVENT(xfs_exchmaps_estimate_class, name,	\
+	TP_PROTO(const struct xfs_exchmaps_req *req), \
+	TP_ARGS(req))
+DEFINE_EXCHMAPS_ESTIMATE_EVENT(xfs_exchmaps_initial_estimate);
+DEFINE_EXCHMAPS_ESTIMATE_EVENT(xfs_exchmaps_final_estimate);
+
+DECLARE_EVENT_CLASS(xfs_exchmaps_intent_class,
+	TP_PROTO(struct xfs_mount *mp, const struct xfs_exchmaps_intent *xmi),
+	TP_ARGS(mp, xmi),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino1)
+		__field(xfs_ino_t, ino2)
+		__field(uint64_t, flags)
+		__field(xfs_fileoff_t, startoff1)
+		__field(xfs_fileoff_t, startoff2)
+		__field(xfs_filblks_t, blockcount)
+		__field(xfs_fsize_t, isize1)
+		__field(xfs_fsize_t, isize2)
+		__field(xfs_fsize_t, new_isize1)
+		__field(xfs_fsize_t, new_isize2)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->ino1 = xmi->xmi_ip1->i_ino;
+		__entry->ino2 = xmi->xmi_ip2->i_ino;
+		__entry->flags = xmi->xmi_flags;
+		__entry->startoff1 = xmi->xmi_startoff1;
+		__entry->startoff2 = xmi->xmi_startoff2;
+		__entry->blockcount = xmi->xmi_blockcount;
+		__entry->isize1 = xmi->xmi_ip1->i_disk_size;
+		__entry->isize2 = xmi->xmi_ip2->i_disk_size;
+		__entry->new_isize1 = xmi->xmi_isize1;
+		__entry->new_isize2 = xmi->xmi_isize2;
+	),
+	TP_printk("dev %d:%d ino1 0x%llx fileoff1 0x%llx ino2 0x%llx fileoff2 0x%llx fsbcount 0x%llx flags (%s) isize1 0x%llx newisize1 0x%llx isize2 0x%llx newisize2 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino1, __entry->startoff1,
+		  __entry->ino2, __entry->startoff2,
+		  __entry->blockcount,
+		  __print_flags_u64(__entry->flags, "|", XFS_EXCHMAPS_STRINGS),
+		  __entry->isize1, __entry->new_isize1,
+		  __entry->isize2, __entry->new_isize2)
+);
+
+#define DEFINE_EXCHMAPS_INTENT_EVENT(name)	\
+DEFINE_EVENT(xfs_exchmaps_intent_class, name,	\
+	TP_PROTO(struct xfs_mount *mp, const struct xfs_exchmaps_intent *xmi), \
+	TP_ARGS(mp, xmi))
+DEFINE_EXCHMAPS_INTENT_EVENT(xfs_exchmaps_defer);
+DEFINE_EXCHMAPS_INTENT_EVENT(xfs_exchmaps_recover);
+
+TRACE_EVENT(xfs_exchmaps_delta_nextents_step,
+	TP_PROTO(struct xfs_mount *mp,
+		 const struct xfs_bmbt_irec *left,
+		 const struct xfs_bmbt_irec *curr,
+		 const struct xfs_bmbt_irec *new,
+		 const struct xfs_bmbt_irec *right,
+		 int delta, unsigned int state),
+	TP_ARGS(mp, left, curr, new, right, delta, state),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_fileoff_t, loff)
+		__field(xfs_fsblock_t, lstart)
+		__field(xfs_filblks_t, lcount)
+		__field(xfs_fileoff_t, coff)
+		__field(xfs_fsblock_t, cstart)
+		__field(xfs_filblks_t, ccount)
+		__field(xfs_fileoff_t, noff)
+		__field(xfs_fsblock_t, nstart)
+		__field(xfs_filblks_t, ncount)
+		__field(xfs_fileoff_t, roff)
+		__field(xfs_fsblock_t, rstart)
+		__field(xfs_filblks_t, rcount)
+		__field(int, delta)
+		__field(unsigned int, state)
+	),
+	TP_fast_assign(
+		__entry->dev = mp->m_super->s_dev;
+		__entry->loff = left->br_startoff;
+		__entry->lstart = left->br_startblock;
+		__entry->lcount = left->br_blockcount;
+		__entry->coff = curr->br_startoff;
+		__entry->cstart = curr->br_startblock;
+		__entry->ccount = curr->br_blockcount;
+		__entry->noff = new->br_startoff;
+		__entry->nstart = new->br_startblock;
+		__entry->ncount = new->br_blockcount;
+		__entry->roff = right->br_startoff;
+		__entry->rstart = right->br_startblock;
+		__entry->rcount = right->br_blockcount;
+		__entry->delta = delta;
+		__entry->state = state;
+	),
+	TP_printk("dev %d:%d left 0x%llx:0x%llx:0x%llx; curr 0x%llx:0x%llx:0x%llx <- new 0x%llx:0x%llx:0x%llx; right 0x%llx:0x%llx:0x%llx delta %d state 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		__entry->loff, __entry->lstart, __entry->lcount,
+		__entry->coff, __entry->cstart, __entry->ccount,
+		__entry->noff, __entry->nstart, __entry->ncount,
+		__entry->roff, __entry->rstart, __entry->rcount,
+		__entry->delta, __entry->state)
+);
+
+TRACE_EVENT(xfs_exchmaps_delta_nextents,
+	TP_PROTO(const struct xfs_exchmaps_req *req, int64_t d_nexts1,
+		 int64_t d_nexts2),
+	TP_ARGS(req, d_nexts1, d_nexts2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ino1)
+		__field(xfs_ino_t, ino2)
+		__field(xfs_extnum_t, nexts1)
+		__field(xfs_extnum_t, nexts2)
+		__field(int64_t, d_nexts1)
+		__field(int64_t, d_nexts2)
+	),
+	TP_fast_assign(
+		int whichfork = xfs_exchmaps_reqfork(req);
+
+		__entry->dev = req->ip1->i_mount->m_super->s_dev;
+		__entry->ino1 = req->ip1->i_ino;
+		__entry->ino2 = req->ip2->i_ino;
+		__entry->nexts1 = xfs_ifork_ptr(req->ip1, whichfork)->if_nextents;
+		__entry->nexts2 = xfs_ifork_ptr(req->ip2, whichfork)->if_nextents;
+		__entry->d_nexts1 = d_nexts1;
+		__entry->d_nexts2 = d_nexts2;
+	),
+	TP_printk("dev %d:%d ino1 0x%llx nexts %llu ino2 0x%llx nexts %llu delta1 %lld delta2 %lld",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino1, __entry->nexts1,
+		  __entry->ino2, __entry->nexts2,
+		  __entry->d_nexts1, __entry->d_nexts2)
+);
+
 #endif /* _TRACE_XFS_H */
 
 #undef TRACE_INCLUDE_PATH


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 06/14] xfs: bind together the front and back ends of the file range exchange code
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (4 preceding siblings ...)
  2024-02-27  2:22 ` [PATCH 05/14] xfs: create deferred log items for file mapping exchanges Darrick J. Wong
@ 2024-02-27  2:22 ` Darrick J. Wong
  2024-02-28 15:49   ` Christoph Hellwig
  2024-02-27  2:22 ` [PATCH 07/14] xfs: add error injection to test file mapping exchange recovery Darrick J. Wong
                   ` (10 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:22 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

So far, we've constructed the front end of the file range exchange code
that does all the checking; and the back end of the file mapping
exchange code that actually does the work.  Glue these two pieces
together so that we can turn on the functionality.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_exchrange.c |  396 ++++++++++++++++++++++++++++++++++++++++++++++++
 fs/xfs/xfs_exchrange.h |    1 
 fs/xfs/xfs_mount.h     |    5 -
 fs/xfs/xfs_trace.c     |    1 
 fs/xfs/xfs_trace.h     |  164 ++++++++++++++++++++
 5 files changed, 565 insertions(+), 2 deletions(-)


diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index 431adcd3e6722..741c7ae7f7895 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -12,8 +12,15 @@
 #include "xfs_defer.h"
 #include "xfs_inode.h"
 #include "xfs_trans.h"
+#include "xfs_quota.h"
+#include "xfs_bmap_util.h"
+#include "xfs_reflink.h"
+#include "xfs_trace.h"
 #include "xfs_exchrange.h"
 #include "xfs_exchmaps.h"
+#include "xfs_sb.h"
+#include "xfs_icache.h"
+#include "xfs_log.h"
 #include <linux/fsnotify.h>
 
 /*
@@ -47,6 +54,34 @@ xfs_exchrange_possible(
 	       xfs_can_add_incompat_log_features(mp, false);
 }
 
+/*
+ * Get permission to use log-assisted atomic exchange of file mappings.
+ * Callers must not be running any transactions or hold any ILOCKs.
+ */
+int
+xfs_exchrange_enable(
+	struct xfs_mount	*mp)
+{
+	int			error = 0;
+
+	/* Mapping exchange log intent items are already enabled */
+	if (xfs_sb_version_haslogexchmaps(&mp->m_sb))
+		return 0;
+
+	if (!xfs_exchrange_upgradeable(mp))
+		return -EOPNOTSUPP;
+
+	error = xfs_add_incompat_log_feature(mp,
+			XFS_SB_FEAT_INCOMPAT_LOG_EXCHMAPS);
+	if (error)
+		return error;
+
+	xfs_warn_mount(mp, XFS_OPSTATE_WARNED_EXCHMAPS,
+ "EXPERIMENTAL atomic file range exchange feature in use. Use at your own risk!");
+
+	return 0;
+}
+
 /* Lock (and optionally join) two inodes for a file range exchange. */
 void
 xfs_exchrange_ilock(
@@ -95,6 +130,234 @@ xfs_exchrange_estimate(
 	return error;
 }
 
+/*
+ * Check that file2's metadata agree with the snapshot that we took for the
+ * range commit request.
+ *
+ * This should be called after the filesystem has locked /all/ inode metadata
+ * against modification.
+ */
+STATIC int
+xfs_exchrange_check_freshness(
+	const struct xfs_exchrange	*fxr,
+	struct xfs_inode		*ip2)
+{
+	struct inode			*inode2 = VFS_I(ip2);
+	struct timespec64		ctime = inode_get_ctime(inode2);
+	struct timespec64		mtime = inode_get_mtime(inode2);
+
+	trace_xfs_exchrange_freshness(fxr, ip2);
+
+	/* Check that file2 hasn't otherwise been modified. */
+	if (fxr->file2_ino != ip2->i_ino ||
+	    !timespec64_equal(&fxr->file2_ctime, &ctime) ||
+	    !timespec64_equal(&fxr->file2_mtime, &mtime))
+		return -EBUSY;
+
+	return 0;
+}
+
+#define QRETRY_IP1	(0x1)
+#define QRETRY_IP2	(0x2)
+
+/*
+ * Obtain a quota reservation to make sure we don't hit EDQUOT.  We can skip
+ * this if quota enforcement is disabled or if both inodes' dquots are the
+ * same.  The qretry structure must be initialized to zeroes before the first
+ * call to this function.
+ */
+STATIC int
+xfs_exchrange_reserve_quota(
+	struct xfs_trans		*tp,
+	const struct xfs_exchmaps_req	*req,
+	unsigned int			*qretry)
+{
+	int64_t				ddelta, rdelta;
+	int				ip1_error = 0;
+	int				error;
+
+	/*
+	 * Don't bother with a quota reservation if we're not enforcing them
+	 * or the two inodes have the same dquots.
+	 */
+	if (!XFS_IS_QUOTA_ON(tp->t_mountp) || req->ip1 == req->ip2 ||
+	    (req->ip1->i_udquot == req->ip2->i_udquot &&
+	     req->ip1->i_gdquot == req->ip2->i_gdquot &&
+	     req->ip1->i_pdquot == req->ip2->i_pdquot))
+		return 0;
+
+	*qretry = 0;
+
+	/*
+	 * For each file, compute the net gain in the number of regular blocks
+	 * that will be mapped into that file and reserve that much quota.  The
+	 * quota counts must be able to absorb at least that much space.
+	 */
+	ddelta = req->ip2_bcount - req->ip1_bcount;
+	rdelta = req->ip2_rtbcount - req->ip1_rtbcount;
+	if (ddelta > 0 || rdelta > 0) {
+		error = xfs_trans_reserve_quota_nblks(tp, req->ip1,
+				ddelta > 0 ? ddelta : 0,
+				rdelta > 0 ? rdelta : 0,
+				false);
+		if (error == -EDQUOT || error == -ENOSPC) {
+			/*
+			 * Save this error and see what happens if we try to
+			 * reserve quota for ip2.  Then report both.
+			 */
+			*qretry |= QRETRY_IP1;
+			ip1_error = error;
+			error = 0;
+		}
+		if (error)
+			return error;
+	}
+	if (ddelta < 0 || rdelta < 0) {
+		error = xfs_trans_reserve_quota_nblks(tp, req->ip2,
+				ddelta < 0 ? -ddelta : 0,
+				rdelta < 0 ? -rdelta : 0,
+				false);
+		if (error == -EDQUOT || error == -ENOSPC)
+			*qretry |= QRETRY_IP2;
+		if (error)
+			return error;
+	}
+	if (ip1_error)
+		return ip1_error;
+
+	/*
+	 * For each file, forcibly reserve the gross gain in mapped blocks so
+	 * that we don't trip over any quota block reservation assertions.
+	 * We must reserve the gross gain because the quota code subtracts from
+	 * bcount the number of blocks that we unmap; it does not add that
+	 * quantity back to the quota block reservation.
+	 */
+	error = xfs_trans_reserve_quota_nblks(tp, req->ip1, req->ip1_bcount,
+			req->ip1_rtbcount, true);
+	if (error)
+		return error;
+
+	return xfs_trans_reserve_quota_nblks(tp, req->ip2, req->ip2_bcount,
+			req->ip2_rtbcount, true);
+}
+
+/* Exchange the mappings (and hence the contents) of two files' forks. */
+STATIC int
+xfs_exchrange_mappings(
+	const struct xfs_exchrange	*fxr,
+	struct xfs_inode		*ip1,
+	struct xfs_inode		*ip2)
+{
+	struct xfs_mount		*mp = ip1->i_mount;
+	struct xfs_exchmaps_req		req = {
+		.ip1			= ip1,
+		.ip2			= ip2,
+		.startoff1		= XFS_B_TO_FSBT(mp, fxr->file1_offset),
+		.startoff2		= XFS_B_TO_FSBT(mp, fxr->file2_offset),
+		.blockcount		= XFS_B_TO_FSB(mp, fxr->length),
+	};
+	struct xfs_trans		*tp;
+	unsigned int			qretry;
+	bool				retried = false;
+	int				error;
+
+	trace_xfs_exchrange_mappings(fxr, ip1, ip2);
+
+	if (fxr->flags & XFS_EXCHRANGE_TO_EOF)
+		req.flags |= XFS_EXCHMAPS_SET_SIZES;
+	if (fxr->flags & XFS_EXCHRANGE_FILE1_WRITTEN)
+		req.flags |= XFS_EXCHMAPS_INO1_WRITTEN;
+
+	error = xfs_exchrange_estimate(&req);
+	if (error)
+		return error;
+
+retry:
+	/* Allocate the transaction, lock the inodes, and join them. */
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_write, req.resblks, 0,
+			XFS_TRANS_RES_FDBLKS, &tp);
+	if (error)
+		return error;
+
+	xfs_exchrange_ilock(tp, ip1, ip2);
+
+	trace_xfs_exchrange_before(ip2, 2);
+	trace_xfs_exchrange_before(ip1, 1);
+
+	error = xfs_exchmaps_check_forks(mp, &req);
+	if (error)
+		goto out_trans_cancel;
+
+	/*
+	 * Reserve ourselves some quota if any of them are in enforcing mode.
+	 * In theory we only need enough to satisfy the change in the number
+	 * of blocks between the two ranges being remapped.
+	 */
+	error = xfs_exchrange_reserve_quota(tp, &req, &qretry);
+	if ((error == -EDQUOT || error == -ENOSPC) && !retried) {
+		xfs_trans_cancel(tp);
+		xfs_exchrange_iunlock(ip1, ip2);
+		if (qretry & QRETRY_IP1)
+			xfs_blockgc_free_quota(ip1, 0);
+		if (qretry & QRETRY_IP2)
+			xfs_blockgc_free_quota(ip2, 0);
+		retried = true;
+		goto retry;
+	}
+	if (error)
+		goto out_trans_cancel;
+
+	/* If we got this far on a dry run, all parameters are ok. */
+	if (fxr->flags & XFS_EXCHRANGE_DRY_RUN)
+		goto out_trans_cancel;
+
+	/* Update the mtime and ctime of both files. */
+	if (fxr->flags & __XFS_EXCHRANGE_UPD_CMTIME1)
+		xfs_trans_ichgtime(tp, ip1, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+	if (fxr->flags & __XFS_EXCHRANGE_UPD_CMTIME2)
+		xfs_trans_ichgtime(tp, ip2, XFS_ICHGTIME_MOD | XFS_ICHGTIME_CHG);
+
+	xfs_exchange_mappings(tp, &req);
+
+	/*
+	 * Force the log to persist metadata updates if the caller or the
+	 * administrator requires this.  The generic prep function already
+	 * flushed the relevant parts of the page cache.
+	 */
+	if (xfs_has_wsync(mp) || (fxr->flags & XFS_EXCHRANGE_DSYNC))
+		xfs_trans_set_sync(tp);
+
+	error = xfs_trans_commit(tp);
+
+	trace_xfs_exchrange_after(ip2, 2);
+	trace_xfs_exchrange_after(ip1, 1);
+
+	if (error)
+		goto out_unlock;
+
+	/*
+	 * If the caller wanted us to exchange the contents of two complete
+	 * files of unequal length, exchange the incore sizes now.  This should
+	 * be safe because we flushed both files' page caches, exchanged all
+	 * the mappings, and updated the ondisk sizes.
+	 */
+	if (fxr->flags & XFS_EXCHRANGE_TO_EOF) {
+		loff_t	temp;
+
+		temp = i_size_read(VFS_I(ip2));
+		i_size_write(VFS_I(ip2), i_size_read(VFS_I(ip1)));
+		i_size_write(VFS_I(ip1), temp);
+	}
+
+out_unlock:
+	xfs_exchrange_iunlock(ip1, ip2);
+	return error;
+
+out_trans_cancel:
+	xfs_trans_cancel(tp);
+	goto out_unlock;
+}
+
 /*
  * Generic code for exchanging ranges of two files via XFS_IOC_EXCHANGE_RANGE.
  * This part deals with struct file objects and byte ranges and does not deal
@@ -318,6 +581,137 @@ xfs_generic_exchrange_finish(
 	return file_remove_privs(fxr->file2);
 }
 
+/* Prepare two files to have their data exchanged. */
+STATIC int
+xfs_exchrange_prep(
+	struct xfs_exchrange	*fxr,
+	struct xfs_inode	*ip1,
+	struct xfs_inode	*ip2)
+{
+	unsigned int		alloc_unit = xfs_inode_alloc_unitsize(ip2);
+	int			error;
+
+	trace_xfs_exchrange_prep(fxr, ip1, ip2);
+
+	/* Verify both files are either real-time or non-realtime */
+	if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2))
+		return -EINVAL;
+
+	/*
+	 * The alignment checks in the generic helpers cannot deal with
+	 * allocation units that are not powers of 2.  This can happen with the
+	 * realtime volume if the extent size is set.
+	 */
+	if (!is_power_of_2(alloc_unit))
+		return -EOPNOTSUPP;
+
+	error = xfs_generic_exchrange_prep(fxr, alloc_unit);
+	if (error || fxr->length == 0)
+		return error;
+
+	if (fxr->flags & __XFS_EXCHRANGE_CHECK_FRESH2) {
+		error = xfs_exchrange_check_freshness(fxr, ip2);
+		if (error)
+			return error;
+	}
+
+	/* Attach dquots to both inodes before changing block maps. */
+	error = xfs_qm_dqattach(ip2);
+	if (error)
+		return error;
+	error = xfs_qm_dqattach(ip1);
+	if (error)
+		return error;
+
+	trace_xfs_exchrange_flush(fxr, ip1, ip2);
+
+	/* Flush the relevant ranges of both files. */
+	error = xfs_flush_unmap_range(ip2, fxr->file2_offset, fxr->length);
+	if (error)
+		return error;
+	error = xfs_flush_unmap_range(ip1, fxr->file1_offset, fxr->length);
+	if (error)
+		return error;
+
+	/*
+	 * Cancel CoW fork preallocations for the ranges of both files.  The
+	 * prep function should have flushed all the dirty data, so the only
+	 * CoW mappings remaining should be speculative.
+	 */
+	if (xfs_inode_has_cow_data(ip1)) {
+		error = xfs_reflink_cancel_cow_range(ip1, fxr->file1_offset,
+				fxr->length, true);
+		if (error)
+			return error;
+	}
+
+	if (xfs_inode_has_cow_data(ip2)) {
+		error = xfs_reflink_cancel_cow_range(ip2, fxr->file2_offset,
+				fxr->length, true);
+		if (error)
+			return error;
+	}
+
+	return 0;
+}
+
+/*
+ * Exchange contents of files.  This is the binding between the generic
+ * file-level concepts and the XFS inode-specific implementation.
+ */
+STATIC int
+xfs_exchrange_contents(
+	struct xfs_exchrange	*fxr)
+{
+	struct inode		*inode1 = file_inode(fxr->file1);
+	struct inode		*inode2 = file_inode(fxr->file2);
+	struct xfs_inode	*ip1 = XFS_I(inode1);
+	struct xfs_inode	*ip2 = XFS_I(inode2);
+	struct xfs_mount	*mp = ip1->i_mount;
+	int			error;
+
+	if (fxr->flags & ~(XFS_EXCHRANGE_ALL_FLAGS | XFS_EXCHRANGE_PRIVATE_FLAGS))
+		return -EINVAL;
+
+	if (xfs_is_shutdown(mp))
+		return -EIO;
+
+	/* Lock both files against IO */
+	error = xfs_ilock2_io_mmap(ip1, ip2);
+	if (error)
+		goto out_err;
+
+	/* Get permission to use log-assisted file mapping exchanges. */
+	error = xfs_exchrange_enable(mp);
+	if (error)
+		goto out_unlock;
+
+	/* Prepare and then exchange file contents. */
+	error = xfs_exchrange_prep(fxr, ip1, ip2);
+	if (error)
+		goto out_unlock;
+
+	error = xfs_exchrange_mappings(fxr, ip1, ip2);
+	if (error)
+		goto out_unlock;
+
+	/*
+	 * Finish the exchange by removing special file privileges like any
+	 * other file write would do.  This may involve turning on support for
+	 * logged xattrs if either file has security capabilities.
+	 */
+	error = xfs_generic_exchrange_finish(fxr);
+	if (error)
+		goto out_unlock;
+
+out_unlock:
+	xfs_iunlock2_io_mmap(ip1, ip2);
+out_err:
+	if (error)
+		trace_xfs_exchrange_error(ip2, error, _RET_IP_);
+	return error;
+}
+
 /* Exchange parts of two files. */
 int
 xfs_exchange_range(
@@ -375,7 +769,7 @@ xfs_exchange_range(
 		fxr->flags |= __XFS_EXCHRANGE_UPD_CMTIME2;
 
 	file_start_write(fxr->file2);
-	ret = -EOPNOTSUPP; /* XXX call out to xfs code */
+	ret = xfs_exchrange_contents(fxr);
 	file_end_write(fxr->file2);
 	if (ret)
 		return ret;
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index eeec4b40b9fbe..2dd9ab7d76828 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -7,6 +7,7 @@
 #define __XFS_EXCHRANGE_H__
 
 bool xfs_exchrange_possible(struct xfs_mount *mp);
+int xfs_exchrange_enable(struct xfs_mount *mp);
 
 /* Update the mtime/cmtime of file1 and file2 */
 #define __XFS_EXCHRANGE_UPD_CMTIME1	(1ULL << 63)
diff --git a/fs/xfs/xfs_mount.h b/fs/xfs/xfs_mount.h
index 654d282234b1e..a567a1ac24134 100644
--- a/fs/xfs/xfs_mount.h
+++ b/fs/xfs/xfs_mount.h
@@ -417,6 +417,8 @@ __XFS_HAS_FEAT(nouuid, NOUUID)
 #define XFS_OPSTATE_QUOTACHECK_RUNNING	10
 /* Do we want to clear log incompat flags? */
 #define XFS_OPSTATE_UNSET_LOG_INCOMPAT	11
+/* Kernel has logged a warning about logged file mapping exchanges being used. */
+#define XFS_OPSTATE_WARNED_EXCHMAPS	12
 
 #define __XFS_IS_OPSTATE(name, NAME) \
 static inline bool xfs_is_ ## name (struct xfs_mount *mp) \
@@ -464,7 +466,8 @@ xfs_should_warn(struct xfs_mount *mp, long nr)
 	{ (1UL << XFS_OPSTATE_WARNED_SHRINK),		"wshrink" }, \
 	{ (1UL << XFS_OPSTATE_WARNED_LARP),		"wlarp" }, \
 	{ (1UL << XFS_OPSTATE_QUOTACHECK_RUNNING),	"quotacheck" }, \
-	{ (1UL << XFS_OPSTATE_UNSET_LOG_INCOMPAT),	"unset_log_incompat" }
+	{ (1UL << XFS_OPSTATE_UNSET_LOG_INCOMPAT),	"unset_log_incompat" }, \
+	{ (1UL << XFS_OPSTATE_WARNED_EXCHMAPS),		"wexchmaps" }
 
 /*
  * Max and min values for mount-option defined I/O
diff --git a/fs/xfs/xfs_trace.c b/fs/xfs/xfs_trace.c
index 9f38e69f1ce40..cf92a3bd56c79 100644
--- a/fs/xfs/xfs_trace.c
+++ b/fs/xfs/xfs_trace.c
@@ -40,6 +40,7 @@
 #include "xfs_btree_mem.h"
 #include "xfs_bmap.h"
 #include "xfs_exchmaps.h"
+#include "xfs_exchrange.h"
 
 /*
  * We include this last to have the helpers above available for the trace
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 0a56397a92373..3929022b03cfe 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -84,6 +84,7 @@ struct xfs_btree_ops;
 struct xfs_bmap_intent;
 struct xfs_exchmaps_intent;
 struct xfs_exchmaps_req;
+struct xfs_exchrange;
 
 #define XFS_ATTR_FILTER_FLAGS \
 	{ XFS_ATTR_ROOT,	"ROOT" }, \
@@ -4805,6 +4806,169 @@ DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1);
 DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping2);
 DEFINE_ITRUNC_EVENT(xfs_exchmaps_update_inode_size);
 
+#define XFS_EXCHRANGE_INODES \
+	{ 1,	"file1" }, \
+	{ 2,	"file2" }
+
+DECLARE_EVENT_CLASS(xfs_exchrange_inode_class,
+	TP_PROTO(struct xfs_inode *ip, int whichfile),
+	TP_ARGS(ip, whichfile),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(int, whichfile)
+		__field(xfs_ino_t, ino)
+		__field(int, format)
+		__field(xfs_extnum_t, nex)
+		__field(int, broot_size)
+		__field(int, fork_off)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip)->i_sb->s_dev;
+		__entry->whichfile = whichfile;
+		__entry->ino = ip->i_ino;
+		__entry->format = ip->i_df.if_format;
+		__entry->nex = ip->i_df.if_nextents;
+		__entry->fork_off = xfs_inode_fork_boff(ip);
+	),
+	TP_printk("dev %d:%d ino 0x%llx whichfile %s format %s num_extents %llu forkoff 0x%x",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ino,
+		  __print_symbolic(__entry->whichfile, XFS_EXCHRANGE_INODES),
+		  __print_symbolic(__entry->format, XFS_INODE_FORMAT_STR),
+		  __entry->nex,
+		  __entry->fork_off)
+)
+
+#define DEFINE_EXCHRANGE_INODE_EVENT(name) \
+DEFINE_EVENT(xfs_exchrange_inode_class, name, \
+	TP_PROTO(struct xfs_inode *ip, int whichfile), \
+	TP_ARGS(ip, whichfile))
+
+DEFINE_EXCHRANGE_INODE_EVENT(xfs_exchrange_before);
+DEFINE_EXCHRANGE_INODE_EVENT(xfs_exchrange_after);
+DEFINE_INODE_ERROR_EVENT(xfs_exchrange_error);
+
+#define XFS_EXCHRANGE_FLAGS_STRS \
+	{ XFS_EXCHRANGE_TO_EOF,		"TO_EOF" }, \
+	{ XFS_EXCHRANGE_DSYNC	,	"DSYNC" }, \
+	{ XFS_EXCHRANGE_DRY_RUN,	"DRY_RUN" }, \
+	{ XFS_EXCHRANGE_FILE1_WRITTEN,	"F1_WRITTEN" }, \
+	{ __XFS_EXCHRANGE_UPD_CMTIME1,	"CMTIME1" }, \
+	{ __XFS_EXCHRANGE_UPD_CMTIME2,	"CMTIME2" }, \
+	{ __XFS_EXCHRANGE_CHECK_FRESH2,	"FRESH2" }
+
+/* file exchange-range tracepoint class */
+DECLARE_EVENT_CLASS(xfs_exchrange_class,
+	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip1,
+		 struct xfs_inode *ip2),
+	TP_ARGS(fxr, ip1, ip2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ip1_ino)
+		__field(loff_t, ip1_isize)
+		__field(loff_t, ip1_disize)
+		__field(xfs_ino_t, ip2_ino)
+		__field(loff_t, ip2_isize)
+		__field(loff_t, ip2_disize)
+
+		__field(loff_t, file1_offset)
+		__field(loff_t, file2_offset)
+		__field(unsigned long long, length)
+		__field(unsigned long long, flags)
+	),
+	TP_fast_assign(
+		__entry->dev = VFS_I(ip1)->i_sb->s_dev;
+		__entry->ip1_ino = ip1->i_ino;
+		__entry->ip1_isize = VFS_I(ip1)->i_size;
+		__entry->ip1_disize = ip1->i_disk_size;
+		__entry->ip2_ino = ip2->i_ino;
+		__entry->ip2_isize = VFS_I(ip2)->i_size;
+		__entry->ip2_disize = ip2->i_disk_size;
+
+		__entry->file1_offset = fxr->file1_offset;
+		__entry->file2_offset = fxr->file2_offset;
+		__entry->length = fxr->length;
+		__entry->flags = fxr->flags;
+	),
+	TP_printk("dev %d:%d flags %s bytecount 0x%llx "
+		  "ino1 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx -> "
+		  "ino2 0x%llx isize 0x%llx disize 0x%llx pos 0x%llx",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		   __print_flags_u64(__entry->flags, "|", XFS_EXCHRANGE_FLAGS_STRS),
+		  __entry->length,
+		  __entry->ip1_ino,
+		  __entry->ip1_isize,
+		  __entry->ip1_disize,
+		  __entry->file1_offset,
+		  __entry->ip2_ino,
+		  __entry->ip2_isize,
+		  __entry->ip2_disize,
+		  __entry->file2_offset)
+)
+
+#define DEFINE_EXCHRANGE_EVENT(name)	\
+DEFINE_EVENT(xfs_exchrange_class, name,	\
+	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip1, \
+		 struct xfs_inode *ip2), \
+	TP_ARGS(fxr, ip1, ip2))
+DEFINE_EXCHRANGE_EVENT(xfs_exchrange_prep);
+DEFINE_EXCHRANGE_EVENT(xfs_exchrange_flush);
+DEFINE_EXCHRANGE_EVENT(xfs_exchrange_mappings);
+
+TRACE_EVENT(xfs_exchrange_freshness,
+	TP_PROTO(const struct xfs_exchrange *fxr, struct xfs_inode *ip2),
+	TP_ARGS(fxr, ip2),
+	TP_STRUCT__entry(
+		__field(dev_t, dev)
+		__field(xfs_ino_t, ip2_ino)
+		__field(long long, ip2_mtime)
+		__field(long long, ip2_ctime)
+		__field(int, ip2_mtime_nsec)
+		__field(int, ip2_ctime_nsec)
+
+		__field(xfs_ino_t, file2_ino)
+		__field(long long, file2_mtime)
+		__field(long long, file2_ctime)
+		__field(int, file2_mtime_nsec)
+		__field(int, file2_ctime_nsec)
+	),
+	TP_fast_assign(
+		struct timespec64	ts64;
+		struct inode		*inode2 = VFS_I(ip2);
+
+		__entry->dev = inode2->i_sb->s_dev;
+		__entry->ip2_ino = ip2->i_ino;
+
+		ts64 = inode_get_ctime(inode2);
+		__entry->ip2_ctime = ts64.tv_sec;
+		__entry->ip2_ctime_nsec = ts64.tv_nsec;
+
+		ts64 = inode_get_mtime(inode2);
+		__entry->ip2_mtime = ts64.tv_sec;
+		__entry->ip2_mtime_nsec = ts64.tv_nsec;
+
+		__entry->file2_ino = fxr->file2_ino;
+		__entry->file2_mtime = fxr->file2_mtime.tv_sec;
+		__entry->file2_ctime = fxr->file2_ctime.tv_sec;
+		__entry->file2_mtime_nsec = fxr->file2_mtime.tv_nsec;
+		__entry->file2_ctime_nsec = fxr->file2_ctime.tv_nsec;
+	),
+	TP_printk("dev %d:%d "
+		  "ino 0x%llx mtime %lld:%d ctime %lld:%d -> "
+		  "file 0x%llx mtime %lld:%d ctime %lld:%d",
+		  MAJOR(__entry->dev), MINOR(__entry->dev),
+		  __entry->ip2_ino,
+		  __entry->ip2_mtime,
+		  __entry->ip2_mtime_nsec,
+		  __entry->ip2_ctime,
+		  __entry->ip2_ctime_nsec,
+		  __entry->file2_ino,
+		  __entry->file2_mtime,
+		  __entry->file2_mtime_nsec,
+		  __entry->file2_ctime,
+		  __entry->file2_ctime_nsec)
+);
+
 TRACE_EVENT(xfs_exchmaps_overhead,
 	TP_PROTO(struct xfs_mount *mp, unsigned long long bmbt_blocks,
 		 unsigned long long rmapbt_blocks),


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 07/14] xfs: add error injection to test file mapping exchange recovery
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (5 preceding siblings ...)
  2024-02-27  2:22 ` [PATCH 06/14] xfs: bind together the front and back ends of the file range exchange code Darrick J. Wong
@ 2024-02-27  2:22 ` Darrick J. Wong
  2024-02-28 15:50   ` Christoph Hellwig
  2024-02-27  2:22 ` [PATCH 08/14] xfs: condense extended attributes after a mapping exchange operation Darrick J. Wong
                   ` (9 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:22 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Add an errortag so that we can test recovery of exchmaps log items.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_errortag.h |    4 +++-
 fs/xfs/libxfs/xfs_exchmaps.c |    3 +++
 fs/xfs/xfs_error.c           |    3 +++
 3 files changed, 9 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_errortag.h b/fs/xfs/libxfs/xfs_errortag.h
index 01a9e86b30379..7002d7676a788 100644
--- a/fs/xfs/libxfs/xfs_errortag.h
+++ b/fs/xfs/libxfs/xfs_errortag.h
@@ -63,7 +63,8 @@
 #define XFS_ERRTAG_ATTR_LEAF_TO_NODE			41
 #define XFS_ERRTAG_WB_DELAY_MS				42
 #define XFS_ERRTAG_WRITE_DELAY_MS			43
-#define XFS_ERRTAG_MAX					44
+#define XFS_ERRTAG_EXCHMAPS_FINISH_ONE			44
+#define XFS_ERRTAG_MAX					45
 
 /*
  * Random factors for above tags, 1 means always, 2 means 1/2 time, etc.
@@ -111,5 +112,6 @@
 #define XFS_RANDOM_ATTR_LEAF_TO_NODE			1
 #define XFS_RANDOM_WB_DELAY_MS				3000
 #define XFS_RANDOM_WRITE_DELAY_MS			3000
+#define XFS_RANDOM_EXCHMAPS_FINISH_ONE			1
 
 #endif /* __XFS_ERRORTAG_H_ */
diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c
index eddb0972e344e..228eb15fce212 100644
--- a/fs/xfs/libxfs/xfs_exchmaps.c
+++ b/fs/xfs/libxfs/xfs_exchmaps.c
@@ -437,6 +437,9 @@ xfs_exchmaps_finish_one(
 			return error;
 	}
 
+	if (XFS_TEST_ERROR(false, tp->t_mountp, XFS_ERRTAG_EXCHMAPS_FINISH_ONE))
+		return -EIO;
+
 	/* If we still have work to do, ask for a new transaction. */
 	if (xmi_has_more_exchange_work(xmi) || xmi_has_postop_work(xmi)) {
 		trace_xfs_exchmaps_defer(tp->t_mountp, xmi);
diff --git a/fs/xfs/xfs_error.c b/fs/xfs/xfs_error.c
index b2cbbba3e15a5..b23e22222f804 100644
--- a/fs/xfs/xfs_error.c
+++ b/fs/xfs/xfs_error.c
@@ -62,6 +62,7 @@ static unsigned int xfs_errortag_random_default[] = {
 	XFS_RANDOM_ATTR_LEAF_TO_NODE,
 	XFS_RANDOM_WB_DELAY_MS,
 	XFS_RANDOM_WRITE_DELAY_MS,
+	XFS_RANDOM_EXCHMAPS_FINISH_ONE,
 };
 
 struct xfs_errortag_attr {
@@ -179,6 +180,7 @@ XFS_ERRORTAG_ATTR_RW(da_leaf_split,	XFS_ERRTAG_DA_LEAF_SPLIT);
 XFS_ERRORTAG_ATTR_RW(attr_leaf_to_node,	XFS_ERRTAG_ATTR_LEAF_TO_NODE);
 XFS_ERRORTAG_ATTR_RW(wb_delay_ms,	XFS_ERRTAG_WB_DELAY_MS);
 XFS_ERRORTAG_ATTR_RW(write_delay_ms,	XFS_ERRTAG_WRITE_DELAY_MS);
+XFS_ERRORTAG_ATTR_RW(exchmaps_finish_one, XFS_ERRTAG_EXCHMAPS_FINISH_ONE);
 
 static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(noerror),
@@ -224,6 +226,7 @@ static struct attribute *xfs_errortag_attrs[] = {
 	XFS_ERRORTAG_ATTR_LIST(attr_leaf_to_node),
 	XFS_ERRORTAG_ATTR_LIST(wb_delay_ms),
 	XFS_ERRORTAG_ATTR_LIST(write_delay_ms),
+	XFS_ERRORTAG_ATTR_LIST(exchmaps_finish_one),
 	NULL,
 };
 ATTRIBUTE_GROUPS(xfs_errortag);


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 08/14] xfs: condense extended attributes after a mapping exchange operation
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (6 preceding siblings ...)
  2024-02-27  2:22 ` [PATCH 07/14] xfs: add error injection to test file mapping exchange recovery Darrick J. Wong
@ 2024-02-27  2:22 ` Darrick J. Wong
  2024-02-28 15:50   ` Christoph Hellwig
  2024-02-27  2:23 ` [PATCH 09/14] xfs: condense directories " Darrick J. Wong
                   ` (8 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:22 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Add a new file mapping exchange flag that enables us to perform
post-exchange processing on file2 once we're done exchanging the extent
mappings.  If we were swapping mappings between extended attribute
forks, we want to be able to convert file2's attr fork from block to
inline format.

(This implies that all fork contents are exchanged.)

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online xattr repair feature can create
salvaged attrs in a temporary file and exchange the attr fork mappings
when ready.  If one file is in extents format and the other is inline,
we will have to promote both to extents format to perform the exchange.
After the exchange, we can try to condense the fixed file's attr fork
back down to inline format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_exchmaps.c |   53 ++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/libxfs/xfs_exchmaps.h |    5 ++++
 fs/xfs/xfs_trace.h           |    3 ++
 3 files changed, 58 insertions(+), 3 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c
index 228eb15fce212..a2af9ecfe4b01 100644
--- a/fs/xfs/libxfs/xfs_exchmaps.c
+++ b/fs/xfs/libxfs/xfs_exchmaps.c
@@ -24,6 +24,10 @@
 #include "xfs_errortag.h"
 #include "xfs_health.h"
 #include "xfs_exchmaps_item.h"
+#include "xfs_da_format.h"
+#include "xfs_da_btree.h"
+#include "xfs_attr_leaf.h"
+#include "xfs_attr.h"
 
 struct kmem_cache	*xfs_exchmaps_intent_cache;
 
@@ -121,7 +125,8 @@ static inline bool
 xmi_has_postop_work(const struct xfs_exchmaps_intent *xmi)
 {
 	return xmi->xmi_flags & (XFS_EXCHMAPS_CLEAR_INO1_REFLINK |
-				 XFS_EXCHMAPS_CLEAR_INO2_REFLINK);
+				 XFS_EXCHMAPS_CLEAR_INO2_REFLINK |
+				 __XFS_EXCHMAPS_INO2_SHORTFORM);
 }
 
 /* Check all mappings to make sure we can actually exchange them. */
@@ -360,6 +365,36 @@ xfs_exchmaps_one_step(
 	xmi_advance(xmi, irec1);
 }
 
+/* Convert inode2's leaf attr fork back to shortform, if possible.. */
+STATIC int
+xfs_exchmaps_attr_to_sf(
+	struct xfs_trans		*tp,
+	struct xfs_exchmaps_intent	*xmi)
+{
+	struct xfs_da_args	args = {
+		.dp		= xmi->xmi_ip2,
+		.geo		= tp->t_mountp->m_attr_geo,
+		.whichfork	= XFS_ATTR_FORK,
+		.trans		= tp,
+	};
+	struct xfs_buf		*bp;
+	int			forkoff;
+	int			error;
+
+	if (!xfs_attr_is_leaf(xmi->xmi_ip2))
+		return 0;
+
+	error = xfs_attr3_leaf_read(tp, xmi->xmi_ip2, 0, &bp);
+	if (error)
+		return error;
+
+	forkoff = xfs_attr_shortform_allfit(bp, xmi->xmi_ip2);
+	if (forkoff == 0)
+		return 0;
+
+	return xfs_attr3_leaf_to_shortform(bp, &args, forkoff);
+}
+
 /* Clear the reflink flag after an exchange. */
 static inline void
 xfs_exchmaps_clear_reflink(
@@ -378,6 +413,16 @@ xfs_exchmaps_do_postop_work(
 	struct xfs_trans		*tp,
 	struct xfs_exchmaps_intent	*xmi)
 {
+	if (xmi->xmi_flags & __XFS_EXCHMAPS_INO2_SHORTFORM) {
+		int			error = 0;
+
+		if (xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK)
+			error = xfs_exchmaps_attr_to_sf(tp, xmi);
+		xmi->xmi_flags &= ~__XFS_EXCHMAPS_INO2_SHORTFORM;
+		if (error)
+			return error;
+	}
+
 	if (xmi->xmi_flags & XFS_EXCHMAPS_CLEAR_INO1_REFLINK) {
 		xfs_exchmaps_clear_reflink(tp, xmi->xmi_ip1);
 		xmi->xmi_flags &= ~XFS_EXCHMAPS_CLEAR_INO1_REFLINK;
@@ -795,8 +840,10 @@ xfs_exchmaps_init_intent(
 	xmi->xmi_isize1 = xmi->xmi_isize2 = -1;
 	xmi->xmi_flags = req->flags & XFS_EXCHMAPS_PARAMS;
 
-	if (xfs_exchmaps_whichfork(xmi) == XFS_ATTR_FORK)
+	if (xfs_exchmaps_whichfork(xmi) == XFS_ATTR_FORK) {
+		xmi->xmi_flags |= __XFS_EXCHMAPS_INO2_SHORTFORM;
 		return xmi;
+	}
 
 	if (req->flags & XFS_EXCHMAPS_SET_SIZES) {
 		xmi->xmi_flags |= XFS_EXCHMAPS_SET_SIZES;
@@ -1017,6 +1064,8 @@ xfs_exchange_mappings(
 {
 	struct xfs_exchmaps_intent	*xmi;
 
+	BUILD_BUG_ON(XFS_EXCHMAPS_INTERNAL_FLAGS & XFS_EXCHMAPS_LOGGED_FLAGS);
+
 	ASSERT(xfs_isilocked(req->ip1, XFS_ILOCK_EXCL));
 	ASSERT(xfs_isilocked(req->ip2, XFS_ILOCK_EXCL));
 	ASSERT(!(req->flags & ~XFS_EXCHMAPS_LOGGED_FLAGS));
diff --git a/fs/xfs/libxfs/xfs_exchmaps.h b/fs/xfs/libxfs/xfs_exchmaps.h
index e8fc3f80c68c2..d8718fca606e5 100644
--- a/fs/xfs/libxfs/xfs_exchmaps.h
+++ b/fs/xfs/libxfs/xfs_exchmaps.h
@@ -27,6 +27,11 @@ struct xfs_exchmaps_intent {
 	uint64_t		xmi_flags;	/* XFS_EXCHMAPS_* flags */
 };
 
+/* Try to convert inode2 from block to short format at the end, if possible. */
+#define __XFS_EXCHMAPS_INO2_SHORTFORM	(1ULL << 63)
+
+#define XFS_EXCHMAPS_INTERNAL_FLAGS	(__XFS_EXCHMAPS_INO2_SHORTFORM)
+
 /* flags that can be passed to xfs_exchmaps_{estimate,mappings} */
 #define XFS_EXCHMAPS_PARAMS		(XFS_EXCHMAPS_ATTR_FORK | \
 					 XFS_EXCHMAPS_SET_SIZES | \
diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
index 3929022b03cfe..d6666aa6a9529 100644
--- a/fs/xfs/xfs_trace.h
+++ b/fs/xfs/xfs_trace.h
@@ -4799,7 +4799,8 @@ DEFINE_XFBTREE_FREESP_EVENT(xfbtree_free_block);
 	{ XFS_EXCHMAPS_SET_SIZES,		"SETSIZES" }, \
 	{ XFS_EXCHMAPS_INO1_WRITTEN,		"INO1_WRITTEN" }, \
 	{ XFS_EXCHMAPS_CLEAR_INO1_REFLINK,	"CLEAR_INO1_REFLINK" }, \
-	{ XFS_EXCHMAPS_CLEAR_INO2_REFLINK,	"CLEAR_INO2_REFLINK" }
+	{ XFS_EXCHMAPS_CLEAR_INO2_REFLINK,	"CLEAR_INO2_REFLINK" }, \
+	{ __XFS_EXCHMAPS_INO2_SHORTFORM,	"INO2_SF" }
 
 DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1_skip);
 DEFINE_INODE_IREC_EVENT(xfs_exchmaps_mapping1);


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 09/14] xfs: condense directories after a mapping exchange operation
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (7 preceding siblings ...)
  2024-02-27  2:22 ` [PATCH 08/14] xfs: condense extended attributes after a mapping exchange operation Darrick J. Wong
@ 2024-02-27  2:23 ` Darrick J. Wong
  2024-02-28 15:51   ` Christoph Hellwig
  2024-02-27  2:23 ` [PATCH 10/14] xfs: condense symbolic links " Darrick J. Wong
                   ` (7 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:23 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

The previous commit added a new file mapping exchange flag that enables
us to perform post-swap processing on file2 once we're done exchanging
extent mappings.  Now add this ability for directories.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online directory repair feature can
create salvaged dirents in a temporary directory and exchange the data
fork mappings when ready.  If one file is in extents format and the
other is inline, we will have to promote both to extents format to
perform the exchange.  After the exchange, we can try to condense the
fixed directory down to inline format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_exchmaps.c |   43 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 43 insertions(+)


diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c
index a2af9ecfe4b01..875f34f4537e0 100644
--- a/fs/xfs/libxfs/xfs_exchmaps.c
+++ b/fs/xfs/libxfs/xfs_exchmaps.c
@@ -28,6 +28,8 @@
 #include "xfs_da_btree.h"
 #include "xfs_attr_leaf.h"
 #include "xfs_attr.h"
+#include "xfs_dir2_priv.h"
+#include "xfs_dir2.h"
 
 struct kmem_cache	*xfs_exchmaps_intent_cache;
 
@@ -395,6 +397,42 @@ xfs_exchmaps_attr_to_sf(
 	return xfs_attr3_leaf_to_shortform(bp, &args, forkoff);
 }
 
+/* Convert inode2's block dir fork back to shortform, if possible.. */
+STATIC int
+xfs_exchmaps_dir_to_sf(
+	struct xfs_trans		*tp,
+	struct xfs_exchmaps_intent	*xmi)
+{
+	struct xfs_da_args	args = {
+		.dp		= xmi->xmi_ip2,
+		.geo		= tp->t_mountp->m_dir_geo,
+		.whichfork	= XFS_DATA_FORK,
+		.trans		= tp,
+	};
+	struct xfs_dir2_sf_hdr	sfh;
+	struct xfs_buf		*bp;
+	bool			isblock;
+	int			size;
+	int			error;
+
+	error = xfs_dir2_isblock(&args, &isblock);
+	if (error)
+		return error;
+
+	if (!isblock)
+		return 0;
+
+	error = xfs_dir3_block_read(tp, xmi->xmi_ip2, &bp);
+	if (error)
+		return error;
+
+	size = xfs_dir2_block_sfsize(xmi->xmi_ip2, bp->b_addr, &sfh);
+	if (size > xfs_inode_data_fork_size(xmi->xmi_ip2))
+		return 0;
+
+	return xfs_dir2_block_to_sf(&args, bp, size, &sfh);
+}
+
 /* Clear the reflink flag after an exchange. */
 static inline void
 xfs_exchmaps_clear_reflink(
@@ -418,6 +456,8 @@ xfs_exchmaps_do_postop_work(
 
 		if (xmi->xmi_flags & XFS_EXCHMAPS_ATTR_FORK)
 			error = xfs_exchmaps_attr_to_sf(tp, xmi);
+		else if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode))
+			error = xfs_exchmaps_dir_to_sf(tp, xmi);
 		xmi->xmi_flags &= ~__XFS_EXCHMAPS_INO2_SHORTFORM;
 		if (error)
 			return error;
@@ -868,6 +908,9 @@ xfs_exchmaps_init_intent(
 			xmi->xmi_flags |= XFS_EXCHMAPS_CLEAR_INO2_REFLINK;
 	}
 
+	if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode))
+		xmi->xmi_flags |= __XFS_EXCHMAPS_INO2_SHORTFORM;
+
 	return xmi;
 }
 


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 10/14] xfs: condense symbolic links after a mapping exchange operation
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (8 preceding siblings ...)
  2024-02-27  2:23 ` [PATCH 09/14] xfs: condense directories " Darrick J. Wong
@ 2024-02-27  2:23 ` Darrick J. Wong
  2024-02-28 15:51   ` Christoph Hellwig
  2024-02-27  2:23 ` [PATCH 11/14] xfs: make file range exchange support realtime files Darrick J. Wong
                   ` (6 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:23 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

The previous commit added a new file mapping exchange flag that enables
us to perform post-exchange processing on file2 once we're done
exchanging the extent mappings.  Now add this ability for symlinks.

This isn't used anywhere right now, but we need to have the basic ondisk
flags in place so that a future online symlink repair feature can
salvage the remote target in a temporary link and exchange the data fork
mappings when ready.  If one file is in extents format and the other is
inline, we will have to promote both to extents format to perform the
exchange.  After the exchange, we can try to condense the fixed symlink
down to inline format if possible.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_exchmaps.c       |   48 +++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_symlink_remote.c |   47 +++++++++++++++++++++++++++++++++++
 fs/xfs/libxfs/xfs_symlink_remote.h |    1 +
 fs/xfs/xfs_symlink.c               |   49 ++++--------------------------------
 4 files changed, 101 insertions(+), 44 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c
index 875f34f4537e0..f5e723f674ce8 100644
--- a/fs/xfs/libxfs/xfs_exchmaps.c
+++ b/fs/xfs/libxfs/xfs_exchmaps.c
@@ -30,6 +30,7 @@
 #include "xfs_attr.h"
 #include "xfs_dir2_priv.h"
 #include "xfs_dir2.h"
+#include "xfs_symlink_remote.h"
 
 struct kmem_cache	*xfs_exchmaps_intent_cache;
 
@@ -433,6 +434,48 @@ xfs_exchmaps_dir_to_sf(
 	return xfs_dir2_block_to_sf(&args, bp, size, &sfh);
 }
 
+/* Convert inode2's remote symlink target back to shortform, if possible. */
+STATIC int
+xfs_exchmaps_link_to_sf(
+	struct xfs_trans		*tp,
+	struct xfs_exchmaps_intent	*xmi)
+{
+	struct xfs_inode		*ip = xmi->xmi_ip2;
+	struct xfs_ifork		*ifp = xfs_ifork_ptr(ip, XFS_DATA_FORK);
+	char				*buf;
+	int				error;
+
+	if (ifp->if_format == XFS_DINODE_FMT_LOCAL ||
+	    ip->i_disk_size > xfs_inode_data_fork_size(ip))
+		return 0;
+
+	/* Read the current symlink target into a buffer. */
+	buf = kmem_alloc(ip->i_disk_size + 1, KM_NOFS);
+	if (!buf) {
+		ASSERT(0);
+		return -ENOMEM;
+	}
+
+	error = xfs_symlink_remote_read(ip, buf);
+	if (error)
+		goto free;
+
+	/* Remove the blocks. */
+	error = xfs_symlink_remote_truncate(tp, ip);
+	if (error)
+		goto free;
+
+	/* Convert fork to local format and log our changes. */
+	xfs_idestroy_fork(ifp);
+	ifp->if_bytes = 0;
+	ifp->if_format = XFS_DINODE_FMT_LOCAL;
+	xfs_init_local_fork(ip, XFS_DATA_FORK, buf, ip->i_disk_size);
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_DDATA | XFS_ILOG_CORE);
+free:
+	kmem_free(buf);
+	return error;
+}
+
 /* Clear the reflink flag after an exchange. */
 static inline void
 xfs_exchmaps_clear_reflink(
@@ -458,6 +501,8 @@ xfs_exchmaps_do_postop_work(
 			error = xfs_exchmaps_attr_to_sf(tp, xmi);
 		else if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode))
 			error = xfs_exchmaps_dir_to_sf(tp, xmi);
+		else if (S_ISLNK(VFS_I(xmi->xmi_ip2)->i_mode))
+			error = xfs_exchmaps_link_to_sf(tp, xmi);
 		xmi->xmi_flags &= ~__XFS_EXCHMAPS_INO2_SHORTFORM;
 		if (error)
 			return error;
@@ -908,7 +953,8 @@ xfs_exchmaps_init_intent(
 			xmi->xmi_flags |= XFS_EXCHMAPS_CLEAR_INO2_REFLINK;
 	}
 
-	if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode))
+	if (S_ISDIR(VFS_I(xmi->xmi_ip2)->i_mode) ||
+	    S_ISLNK(VFS_I(xmi->xmi_ip2)->i_mode))
 		xmi->xmi_flags |= __XFS_EXCHMAPS_INO2_SHORTFORM;
 
 	return xmi;
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.c b/fs/xfs/libxfs/xfs_symlink_remote.c
index ebc9ce8565b85..df1db72a3b7f3 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.c
+++ b/fs/xfs/libxfs/xfs_symlink_remote.c
@@ -380,3 +380,50 @@ xfs_symlink_write_target(
 	ASSERT(pathlen == 0);
 	return 0;
 }
+
+/* Remove all the blocks from a symlink and invalidate buffers. */
+int
+xfs_symlink_remote_truncate(
+	struct xfs_trans	*tp,
+	struct xfs_inode	*ip)
+{
+	struct xfs_bmbt_irec	mval[XFS_SYMLINK_MAPS];
+	struct xfs_mount	*mp = tp->t_mountp;
+	struct xfs_buf		*bp;
+	int			nmaps = XFS_SYMLINK_MAPS;
+	int			done = 0;
+	int			i;
+	int			error;
+
+	/* Read mappings and invalidate buffers. */
+	error = xfs_bmapi_read(ip, 0, XFS_MAX_FILEOFF, mval, &nmaps, 0);
+	if (error)
+		return error;
+
+	for (i = 0; i < nmaps; i++) {
+		if (!xfs_bmap_is_real_extent(&mval[i]))
+			break;
+
+		error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
+				XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
+				XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
+				&bp);
+		if (error)
+			return error;
+
+		xfs_trans_binval(tp, bp);
+	}
+
+	/* Unmap the remote blocks. */
+	error = xfs_bunmapi(tp, ip, 0, XFS_MAX_FILEOFF, 0, nmaps, &done);
+	if (error)
+		return error;
+	if (!done) {
+		ASSERT(done);
+		xfs_inode_mark_sick(ip, XFS_SICK_INO_SYMLINK);
+		return -EFSCORRUPTED;
+	}
+
+	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
+	return 0;
+}
diff --git a/fs/xfs/libxfs/xfs_symlink_remote.h b/fs/xfs/libxfs/xfs_symlink_remote.h
index a63bd38ae4faf..ac3dac8f617ed 100644
--- a/fs/xfs/libxfs/xfs_symlink_remote.h
+++ b/fs/xfs/libxfs/xfs_symlink_remote.h
@@ -22,5 +22,6 @@ int xfs_symlink_remote_read(struct xfs_inode *ip, char *link);
 int xfs_symlink_write_target(struct xfs_trans *tp, struct xfs_inode *ip,
 		const char *target_path, int pathlen, xfs_fsblock_t fs_blocks,
 		uint resblks);
+int xfs_symlink_remote_truncate(struct xfs_trans *tp, struct xfs_inode *ip);
 
 #endif /* __XFS_SYMLINK_REMOTE_H */
diff --git a/fs/xfs/xfs_symlink.c b/fs/xfs/xfs_symlink.c
index d7302019ff435..486df9f7963d8 100644
--- a/fs/xfs/xfs_symlink.c
+++ b/fs/xfs/xfs_symlink.c
@@ -251,19 +251,12 @@ xfs_symlink(
  */
 STATIC int
 xfs_inactive_symlink_rmt(
-	struct xfs_inode *ip)
+	struct xfs_inode	*ip)
 {
-	struct xfs_buf	*bp;
-	int		done;
-	int		error;
-	int		i;
-	xfs_mount_t	*mp;
-	xfs_bmbt_irec_t	mval[XFS_SYMLINK_MAPS];
-	int		nmaps;
-	int		size;
-	xfs_trans_t	*tp;
+	struct xfs_mount	*mp = ip->i_mount;
+	struct xfs_trans	*tp;
+	int			error;
 
-	mp = ip->i_mount;
 	ASSERT(!xfs_need_iread_extents(&ip->i_df));
 	/*
 	 * We're freeing a symlink that has some
@@ -287,44 +280,14 @@ xfs_inactive_symlink_rmt(
 	 * locked for the second transaction.  In the error paths we need it
 	 * held so the cancel won't rele it, see below.
 	 */
-	size = (int)ip->i_disk_size;
 	ip->i_disk_size = 0;
 	VFS_I(ip)->i_mode = (VFS_I(ip)->i_mode & ~S_IFMT) | S_IFREG;
 	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
-	/*
-	 * Find the block(s) so we can inval and unmap them.
-	 */
-	done = 0;
-	nmaps = ARRAY_SIZE(mval);
-	error = xfs_bmapi_read(ip, 0, xfs_symlink_blocks(mp, size),
-				mval, &nmaps, 0);
-	if (error)
-		goto error_trans_cancel;
-	/*
-	 * Invalidate the block(s). No validation is done.
-	 */
-	for (i = 0; i < nmaps; i++) {
-		error = xfs_trans_get_buf(tp, mp->m_ddev_targp,
-				XFS_FSB_TO_DADDR(mp, mval[i].br_startblock),
-				XFS_FSB_TO_BB(mp, mval[i].br_blockcount), 0,
-				&bp);
-		if (error)
-			goto error_trans_cancel;
-		xfs_trans_binval(tp, bp);
-	}
-	/*
-	 * Unmap the dead block(s) to the dfops.
-	 */
-	error = xfs_bunmapi(tp, ip, 0, size, 0, nmaps, &done);
+
+	error = xfs_symlink_remote_truncate(tp, ip);
 	if (error)
 		goto error_trans_cancel;
-	ASSERT(done);
 
-	/*
-	 * Commit the transaction. This first logs the EFI and the inode, then
-	 * rolls and commits the transaction that frees the extents.
-	 */
-	xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);
 	error = xfs_trans_commit(tp);
 	if (error) {
 		ASSERT(xfs_is_shutdown(mp));


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 11/14] xfs: make file range exchange support realtime files
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (9 preceding siblings ...)
  2024-02-27  2:23 ` [PATCH 10/14] xfs: condense symbolic links " Darrick J. Wong
@ 2024-02-27  2:23 ` Darrick J. Wong
  2024-02-28 15:51   ` Christoph Hellwig
  2024-02-27  2:23 ` [PATCH 12/14] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong
                   ` (5 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:23 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Now that bmap items support the realtime device, we can add the
necessary pieces to the file range exchange code to support exchanging
mappings.  All we really need to do here is adjust the blockcount
upwards to the end of the rt extent and remove the inode checks.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_exchmaps.c |   70 ++++++++++++++++++++++++++++++++++++------
 fs/xfs/xfs_exchrange.c       |    9 +++++
 2 files changed, 69 insertions(+), 10 deletions(-)


diff --git a/fs/xfs/libxfs/xfs_exchmaps.c b/fs/xfs/libxfs/xfs_exchmaps.c
index f5e723f674ce8..501365cd4cf4e 100644
--- a/fs/xfs/libxfs/xfs_exchmaps.c
+++ b/fs/xfs/libxfs/xfs_exchmaps.c
@@ -152,12 +152,7 @@ xfs_exchmaps_check_forks(
 	    ifp2->if_format == XFS_DINODE_FMT_LOCAL)
 		return -EINVAL;
 
-	/* We don't support realtime data forks yet. */
-	if (!XFS_IS_REALTIME_INODE(req->ip1))
-		return 0;
-	if (whichfork == XFS_ATTR_FORK)
-		return 0;
-	return -EINVAL;
+	return 0;
 }
 
 #ifdef CONFIG_XFS_QUOTA
@@ -198,6 +193,8 @@ xfs_exchmaps_can_skip_mapping(
 	struct xfs_exchmaps_intent	*xmi,
 	struct xfs_bmbt_irec		*irec)
 {
+	struct xfs_mount		*mp = xmi->xmi_ip1->i_mount;
+
 	/* Do not skip this mapping if the caller did not tell us to. */
 	if (!(xmi->xmi_flags & XFS_EXCHMAPS_INO1_WRITTEN))
 		return false;
@@ -209,11 +206,64 @@ xfs_exchmaps_can_skip_mapping(
 	/*
 	 * The mapping is unwritten or a hole.  It cannot be a delalloc
 	 * reservation because we already excluded those.  It cannot be an
-	 * unwritten mapping with dirty page cache because we flushed the page
-	 * cache.  We don't support realtime files yet, so we needn't (yet)
-	 * deal with them.
+	 * unwritten extent with dirty page cache because we flushed the page
+	 * cache.  For files where the allocation unit is 1FSB (files on the
+	 * data dev, rt files if the extent size is 1FSB), we can safely
+	 * skip this mapping.
 	 */
-	return true;
+	if (!xfs_inode_has_bigallocunit(xmi->xmi_ip1))
+		return true;
+
+	/*
+	 * For a realtime file with a multi-fsb allocation unit, the decision
+	 * is trickier because we can only swap full allocation units.
+	 * Unwritten mappings can appear in the middle of an rtx if the rtx is
+	 * partially written, but they can also appear for preallocations.
+	 *
+	 * If the mapping is a hole, skip it entirely.  Holes should align with
+	 * rtx boundaries.
+	 */
+	if (!xfs_bmap_is_real_extent(irec))
+		return true;
+
+	/*
+	 * All mappings below this point are unwritten.
+	 *
+	 * - If the beginning is not aligned to an rtx, trim the end of the
+	 *   mapping so that it does not cross an rtx boundary, and swap it.
+	 *
+	 * - If both ends are aligned to an rtx, skip the entire mapping.
+	 */
+	if (!isaligned_64(irec->br_startoff, mp->m_sb.sb_rextsize)) {
+		xfs_fileoff_t	new_end;
+
+		new_end = roundup_64(irec->br_startoff, mp->m_sb.sb_rextsize);
+		irec->br_blockcount = min(irec->br_blockcount,
+					  new_end - irec->br_startoff);
+		return false;
+	}
+	if (isaligned_64(irec->br_blockcount, mp->m_sb.sb_rextsize))
+		return true;
+
+	/*
+	 * All mappings below this point are unwritten, start on an rtx
+	 * boundary, and do not end on an rtx boundary.
+	 *
+	 * - If the mapping is longer than one rtx, trim the end of the mapping
+	 *   down to an rtx boundary and skip it.
+	 *
+	 * - The mapping is shorter than one rtx.  Swap it.
+	 */
+	if (irec->br_blockcount > mp->m_sb.sb_rextsize) {
+		xfs_fileoff_t	new_end;
+
+		new_end = rounddown_64(irec->br_startoff + irec->br_blockcount,
+				mp->m_sb.sb_rextsize);
+		irec->br_blockcount = new_end - irec->br_startoff;
+		return true;
+	}
+
+	return false;
 }
 
 /*
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index 741c7ae7f7895..a2b1c9d933385 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -21,6 +21,7 @@
 #include "xfs_sb.h"
 #include "xfs_icache.h"
 #include "xfs_log.h"
+#include "xfs_rtbitmap.h"
 #include <linux/fsnotify.h>
 
 /*
@@ -268,6 +269,14 @@ xfs_exchrange_mappings(
 	if (fxr->flags & XFS_EXCHRANGE_FILE1_WRITTEN)
 		req.flags |= XFS_EXCHMAPS_INO1_WRITTEN;
 
+	/*
+	 * Round the request length up to the nearest file allocation unit.
+	 * The prep function already checked that the request offsets and
+	 * length in @fxr are safe to round up.
+	 */
+	if (xfs_inode_has_bigallocunit(ip2))
+		req.blockcount = xfs_rtb_roundup_rtx(mp, req.blockcount);
+
 	error = xfs_exchrange_estimate(&req);
 	if (error)
 		return error;


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 12/14] xfs: support non-power-of-two rtextsize with exchange-range
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (10 preceding siblings ...)
  2024-02-27  2:23 ` [PATCH 11/14] xfs: make file range exchange support realtime files Darrick J. Wong
@ 2024-02-27  2:23 ` Darrick J. Wong
  2024-02-28 15:51   ` Christoph Hellwig
  2024-02-27  2:24 ` [PATCH 13/14] docs: update swapext -> exchmaps language Darrick J. Wong
                   ` (4 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:23 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

The generic exchange-range alignment checks use (fast) bitmasking
operations to perform block alignment checks on the exchange parameters.
Unfortunately, bitmasks require that the alignment size be a power of
two.  This isn't true for realtime devices with a non-power-of-two
extent size, so we have to copy-pasta the generic checks using long
division for this to work properly.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/xfs_exchrange.c |   89 ++++++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 82 insertions(+), 7 deletions(-)


diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index a2b1c9d933385..ae74ff016ee1f 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -590,6 +590,75 @@ xfs_generic_exchrange_finish(
 	return file_remove_privs(fxr->file2);
 }
 
+/*
+ * Check the alignment of an exchange request when the allocation unit size
+ * isn't a power of two.  The generic file-level helpers use (fast)
+ * bitmask-based alignment checks, but here we have to use slow long division.
+ */
+static int
+xfs_exchrange_check_rtalign(
+	const struct xfs_exchrange	*fxr,
+	struct xfs_inode		*ip1,
+	struct xfs_inode		*ip2,
+	unsigned int			alloc_unit)
+{
+	uint64_t			length = fxr->length;
+	uint64_t			blen;
+	loff_t				size1, size2;
+
+	size1 = i_size_read(VFS_I(ip1));
+	size2 = i_size_read(VFS_I(ip2));
+
+	/* The start of both ranges must be aligned to a rt extent. */
+	if (!isaligned_64(fxr->file1_offset, alloc_unit) ||
+	    !isaligned_64(fxr->file2_offset, alloc_unit))
+		return -EINVAL;
+
+	if (fxr->flags & XFS_EXCHRANGE_TO_EOF)
+		length = max_t(int64_t, size1 - fxr->file1_offset,
+					size2 - fxr->file2_offset);
+
+	/*
+	 * If the user wanted us to exchange up to the infile's EOF, round up
+	 * to the next rt extent boundary for this check.  Do the same for the
+	 * outfile.
+	 *
+	 * Otherwise, reject the range length if it's not rt extent aligned.
+	 * We already confirmed the starting offsets' rt extent block
+	 * alignment.
+	 */
+	if (fxr->file1_offset + length == size1)
+		blen = roundup_64(size1, alloc_unit) - fxr->file1_offset;
+	else if (fxr->file2_offset + length == size2)
+		blen = roundup_64(size2, alloc_unit) - fxr->file2_offset;
+	else if (!isaligned_64(length, alloc_unit))
+		return -EINVAL;
+	else
+		blen = length;
+
+	/* Don't allow overlapped exchanges within the same file. */
+	if (ip1 == ip2 &&
+	    fxr->file2_offset + blen > fxr->file1_offset &&
+	    fxr->file1_offset + blen > fxr->file2_offset)
+		return -EINVAL;
+
+	/*
+	 * Ensure that we don't exchange a partial EOF rt extent into the
+	 * middle of another file.
+	 */
+	if (isaligned_64(length, alloc_unit))
+		return 0;
+
+	blen = length;
+	if (fxr->file2_offset + length < size2)
+		blen = rounddown_64(blen, alloc_unit);
+
+	if (fxr->file1_offset + blen < size1)
+		blen = rounddown_64(blen, alloc_unit);
+
+	return blen == length ? 0 : -EINVAL;
+}
+
 /* Prepare two files to have their data exchanged. */
 STATIC int
 xfs_exchrange_prep(
@@ -597,6 +666,7 @@ xfs_exchrange_prep(
 	struct xfs_inode	*ip1,
 	struct xfs_inode	*ip2)
 {
+	struct xfs_mount	*mp = ip2->i_mount;
 	unsigned int		alloc_unit = xfs_inode_alloc_unitsize(ip2);
 	int			error;
 
@@ -606,13 +676,18 @@ xfs_exchrange_prep(
 	if (XFS_IS_REALTIME_INODE(ip1) != XFS_IS_REALTIME_INODE(ip2))
 		return -EINVAL;
 
-	/*
-	 * The alignment checks in the generic helpers cannot deal with
-	 * allocation units that are not powers of 2.  This can happen with the
-	 * realtime volume if the extent size is set.
-	 */
-	if (!is_power_of_2(alloc_unit))
-		return -EOPNOTSUPP;
+	/* Check non-power of two alignment issues, if necessary. */
+	if (!is_power_of_2(alloc_unit)) {
+		error = xfs_exchrange_check_rtalign(fxr, ip1, ip2, alloc_unit);
+		if (error)
+			return error;
+
+		/*
+		 * Do the generic file-level checks with the regular block
+		 * alignment.
+		 */
+		alloc_unit = mp->m_sb.sb_blocksize;
+	}
 
 	error = xfs_generic_exchrange_prep(fxr, alloc_unit);
 	if (error || fxr->length == 0)


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 13/14] docs: update swapext -> exchmaps language
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (11 preceding siblings ...)
  2024-02-27  2:23 ` [PATCH 12/14] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong
@ 2024-02-27  2:24 ` Darrick J. Wong
  2024-02-28 15:52   ` Christoph Hellwig
  2024-02-27  2:24 ` [PATCH 14/14] xfs: enable logged file mapping exchange feature Darrick J. Wong
                   ` (3 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:24 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Start reworking the atomic swapext design documentation to refer to its
new file contents/mapping exchange name.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 .../filesystems/xfs/xfs-online-fsck-design.rst     |  259 +++++++++++---------
 1 file changed, 136 insertions(+), 123 deletions(-)


diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 1d161752f09ed..f72e1ed2d0e5f 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -2167,7 +2167,7 @@ The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
 function frees them all because compaction is not needed.
 
 The details of repairing directories and extended attributes will be discussed
-in a subsequent section about atomic extent swapping.
+in a subsequent section about atomic file content exchanges.
 However, it should be noted that these repair functions only use blob storage
 to cache a small number of entries before adding them to a temporary ondisk
 file, which is why compaction is not required.
@@ -2802,7 +2802,8 @@ follows this format:
 
 Repairs for file-based metadata such as extended attributes, directories,
 symbolic links, quota files and realtime bitmaps are performed by building a
-new structure attached to a temporary file and swapping the forks.
+new structure attached to a temporary file and exchanging all mappings in the
+file forks.
 Afterward, the mappings in the old file fork are the candidate blocks for
 disposal.
 
@@ -3851,8 +3852,8 @@ Because file forks can consume as much space as the entire filesystem, repairs
 cannot be staged in memory, even when a paging scheme is available.
 Therefore, online repair of file-based metadata createas a temporary file in
 the XFS filesystem, writes a new structure at the correct offsets into the
-temporary file, and atomically swaps the fork mappings (and hence the fork
-contents) to commit the repair.
+temporary file, and atomically exchanges all file fork mappings (and hence the
+fork contents) to commit the repair.
 Once the repair is complete, the old fork can be reaped as necessary; if the
 system goes down during the reap, the iunlink code will delete the blocks
 during log recovery.
@@ -3862,10 +3863,11 @@ consistent to use a temporary file safely!
 This dependency is the reason why online repair can only use pageable kernel
 memory to stage ondisk space usage information.
 
-Swapping metadata extents with a temporary file requires the owner field of the
-block headers to match the file being repaired and not the temporary file.  The
-directory, extended attribute, and symbolic link functions were all modified to
-allow callers to specify owner numbers explicitly.
+Exchanging metadata file mappings with a temporary file requires the owner
+field of the block headers to match the file being repaired and not the
+temporary file.
+The directory, extended attribute, and symbolic link functions were all
+modified to allow callers to specify owner numbers explicitly.
 
 There is a downside to the reaping process -- if the system crashes during the
 reap phase and the fork extents are crosslinked, the iunlink processing will
@@ -3974,8 +3976,8 @@ The proposed patches are in the
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
 series.
 
-Atomic Extent Swapping
-----------------------
+Logged File Content Exchanges
+-----------------------------
 
 Once repair builds a temporary file with a new data structure written into
 it, it must commit the new changes into the existing file.
@@ -4010,17 +4012,21 @@ e. Old blocks in the file may be cross-linked with another structure and must
 These problems are overcome by creating a new deferred operation and a new type
 of log intent item to track the progress of an operation to exchange two file
 ranges.
-The new deferred operation type chains together the same transactions used by
-the reverse-mapping extent swap code.
+The new exchange operation type chains together the same transactions used by
+the reverse-mapping extent swap code, but records intermedia progress in the
+log so that operations can be restarted after a crash.
+This new functionality is called the file contents exchange (xfs_exchrange)
+code.
+The underlying implementation exchanges file fork mappings (xfs_exchmaps).
 The new log item records the progress of the exchange to ensure that once an
 exchange begins, it will always run to completion, even there are
 interruptions.
-The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
+The new ``XFS_SB_FEAT_INCOMPAT_LOG_EXCHMAPS`` log-incompatible feature flag
 in the superblock protects these new log item records from being replayed on
 old kernels.
 
 The proposed patchset is the
-`atomic extent swap
+`file contents exchange
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
 series.
 
@@ -4061,72 +4067,73 @@ series.
 | The feature bit will not be cleared from the superblock until the log    |
 | becomes clean.                                                           |
 |                                                                          |
-| Log-assisted extended attribute updates and atomic extent swaps both use |
-| log incompat features and provide convenience wrappers around the        |
+| Log-assisted extended attribute updates and file content exchanges bothe |
+| use log incompat features and provide convenience wrappers around the    |
 | functionality.                                                           |
 +--------------------------------------------------------------------------+
 
-Mechanics of an Atomic Extent Swap
-``````````````````````````````````
+Mechanics of a Logged File Content Exchange
+```````````````````````````````````````````
 
-Swapping entire file forks is a complex task.
+Exchanging contents between file forks is a complex task.
 The goal is to exchange all file fork mappings between two file fork offset
 ranges.
 There are likely to be many extent mappings in each fork, and the edges of
 the mappings aren't necessarily aligned.
-Furthermore, there may be other updates that need to happen after the swap,
+Furthermore, there may be other updates that need to happen after the exchange,
 such as exchanging file sizes, inode flags, or conversion of fork data to local
 format.
-This is roughly the format of the new deferred extent swap work item:
+This is roughly the format of the new deferred exchange-mapping work item:
 
 .. code-block:: c
 
-	struct xfs_swapext_intent {
+	struct xfs_exchmaps_intent {
 	    /* Inodes participating in the operation. */
-	    struct xfs_inode    *sxi_ip1;
-	    struct xfs_inode    *sxi_ip2;
+	    struct xfs_inode    *xmi_ip1;
+	    struct xfs_inode    *xmi_ip2;
 
 	    /* File offset range information. */
-	    xfs_fileoff_t       sxi_startoff1;
-	    xfs_fileoff_t       sxi_startoff2;
-	    xfs_filblks_t       sxi_blockcount;
+	    xfs_fileoff_t       xmi_startoff1;
+	    xfs_fileoff_t       xmi_startoff2;
+	    xfs_filblks_t       xmi_blockcount;
 
 	    /* Set these file sizes after the operation, unless negative. */
-	    xfs_fsize_t         sxi_isize1;
-	    xfs_fsize_t         sxi_isize2;
+	    xfs_fsize_t         xmi_isize1;
+	    xfs_fsize_t         xmi_isize2;
 
-	    /* XFS_SWAP_EXT_* log operation flags */
-	    uint64_t            sxi_flags;
+	    /* XFS_EXCHMAPS_* log operation flags */
+	    uint64_t            xmi_flags;
 	};
 
 The new log intent item contains enough information to track two logical fork
 offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
 blockcount)``.
-Each step of a swap operation exchanges the largest file range mapping possible
-from one file to the other.
-After each step in the swap operation, the two startoff fields are incremented
-and the blockcount field is decremented to reflect the progress made.
-The flags field captures behavioral parameters such as swapping the attr fork
-instead of the data fork and other work to be done after the extent swap.
-The two isize fields are used to swap the file size at the end of the operation
-if the file data fork is the target of the swap operation.
+Each step of an exchange operation exchanges the largest file range mapping
+possible from one file to the other.
+After each step in the exchange operation, the two startoff fields are
+incremented and the blockcount field is decremented to reflect the progress
+made.
+The flags field captures behavioral parameters such as exchanging attr fork
+mappings instead of the data fork and other work to be done after the exchange.
+The two isize fields are used to exchange the file sizes at the end of the
+operation if the file data fork is the target of the operation.
 
-When the extent swap is initiated, the sequence of operations is as follows:
+When the exchange is initiated, the sequence of operations is as follows:
 
-1. Create a deferred work item for the extent swap.
-   At the start, it should contain the entirety of the file ranges to be
-   swapped.
+1. Create a deferred work item for the file mapping exchange.
+   At the start, it should contain the entirety of the file block ranges to be
+   exchanged.
 
 2. Call ``xfs_defer_finish`` to process the exchange.
-   This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
+   This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
    This will log an extent swap intent item to the transaction for the deferred
-   extent swap work item.
+   mapping exchange work item.
 
-3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
+3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,
 
-   a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
-      ``sxi_startoff2``, respectively, and compute the longest extent that can
-      be swapped in a single step.
+   a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
+      ``xmi_startoff2``, respectively, and compute the longest extent that can
+      be exchanged in a single step.
       This is the minimum of the two ``br_blockcount`` s in the mappings.
       Keep advancing through the file forks until at least one of the mappings
       contains written blocks.
@@ -4148,20 +4155,20 @@ When the extent swap is initiated, the sequence of operations is as follows:
 
    g. Extend the ondisk size of either file if necessary.
 
-   h. Log an extent swap done log item for the extent swap intent log item
-      that was read at the start of step 3.
+   h. Log a mapping exchange done log item for th mapping exchange intent log
+      item that was read at the start of step 3.
 
    i. Compute the amount of file range that has just been covered.
       This quantity is ``(map1.br_startoff + map1.br_blockcount -
-      sxi_startoff1)``, because step 3a could have skipped holes.
+      xmi_startoff1)``, because step 3a could have skipped holes.
 
-   j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
+   j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
       by the number of blocks computed in the previous step, and decrease
-      ``sxi_blockcount`` by the same quantity.
+      ``xmi_blockcount`` by the same quantity.
       This advances the cursor.
 
-   k. Log a new extent swap intent log item reflecting the advanced state of
-      the work item.
+   k. Log a new mapping exchange intent log item reflecting the advanced state
+      of the work item.
 
    l. Return the proper error code (EAGAIN) to the deferred operation manager
       to inform it that there is more work to be done.
@@ -4172,22 +4179,23 @@ When the extent swap is initiated, the sequence of operations is as follows:
    This will be discussed in more detail in subsequent sections.
 
 If the filesystem goes down in the middle of an operation, log recovery will
-find the most recent unfinished extent swap log intent item and restart from
-there.
-This is how extent swapping guarantees that an outside observer will either see
-the old broken structure or the new one, and never a mismash of both.
+find the most recent unfinished maping exchange log intent item and restart
+from there.
+This is how atomic file mapping exchanges guarantees that an outside observer
+will either see the old broken structure or the new one, and never a mismash of
+both.
 
-Preparation for Extent Swapping
-```````````````````````````````
+Preparation for File Content Exchanges
+``````````````````````````````````````
 
 There are a few things that need to be taken care of before initiating an
-atomic extent swap operation.
+atomic file mapping exchange operation.
 First, regular files require the page cache to be flushed to disk before the
 operation begins, and directio writes to be quiesced.
-Like any filesystem operation, extent swapping must determine the maximum
-amount of disk space and quota that can be consumed on behalf of both files in
-the operation, and reserve that quantity of resources to avoid an unrecoverable
-out of space failure once it starts dirtying metadata.
+Like any filesystem operation, file mapping exchanges must determine the
+maximum amount of disk space and quota that can be consumed on behalf of both
+files in the operation, and reserve that quantity of resources to avoid an
+unrecoverable out of space failure once it starts dirtying metadata.
 The preparation step scans the ranges of both files to estimate:
 
 - Data device blocks needed to handle the repeated updates to the fork
@@ -4201,56 +4209,59 @@ The preparation step scans the ranges of both files to estimate:
   to different extents on the realtime volume, which could happen if the
   operation fails to run to completion.
 
-The need for precise estimation increases the run time of the swap operation,
-but it is very important to maintain correct accounting.
-The filesystem must not run completely out of free space, nor can the extent
-swap ever add more extent mappings to a fork than it can support.
+The need for precise estimation increases the run time of the exchange
+operation, but it is very important to maintain correct accounting.
+The filesystem must not run completely out of free space, nor can the mapping
+exchange ever add more extent mappings to a fork than it can support.
 Regular users are required to abide the quota limits, though metadata repairs
 may exceed quota to resolve inconsistent metadata elsewhere.
 
-Special Features for Swapping Metadata File Extents
-```````````````````````````````````````````````````
+Special Features for Exchanging Metadata File Contents
+``````````````````````````````````````````````````````
 
 Extended attributes, symbolic links, and directories can set the fork format to
 "local" and treat the fork as a literal area for data storage.
 Metadata repairs must take extra steps to support these cases:
 
 - If both forks are in local format and the fork areas are large enough, the
-  swap is performed by copying the incore fork contents, logging both forks,
-  and committing.
-  The atomic extent swap mechanism is not necessary, since this can be done
-  with a single transaction.
+  exchange is performed by copying the incore fork contents, logging both
+  forks, and committing.
+  The atomic file mapping exchange mechanism is not necessary, since this can
+  be done with a single transaction.
 
-- If both forks map blocks, then the regular atomic extent swap is used.
+- If both forks map blocks, then the regular atomic file mapping exchange is
+  used.
 
 - Otherwise, only one fork is in local format.
   The contents of the local format fork are converted to a block to perform the
-  swap.
+  exchange.
   The conversion to block format must be done in the same transaction that
-  logs the initial extent swap intent log item.
-  The regular atomic extent swap is used to exchange the mappings.
-  Special flags are set on the swap operation so that the transaction can be
-  rolled one more time to convert the second file's fork back to local format
-  so that the second file will be ready to go as soon as the ILOCK is dropped.
+  logs the initial mapping exchange intent log item.
+  The regular atomic mapping exchange is used to exchange the metadata file
+  mappings.
+  Special flags are set on the exchange operation so that the transaction can
+  be rolled one more time to convert the second file's fork back to local
+  format so that the second file will be ready to go as soon as the ILOCK is
+  dropped.
 
 Extended attributes and directories stamp the owning inode into every block,
 but the buffer verifiers do not actually check the inode number!
 Although there is no verification, it is still important to maintain
-referential integrity, so prior to performing the extent swap, online repair
-builds every block in the new data structure with the owner field of the file
-being repaired.
+referential integrity, so prior to performing the mapping exchange, online
+repair builds every block in the new data structure with the owner field of the
+file being repaired.
 
-After a successful swap operation, the repair operation must reap the old fork
-blocks by processing each fork mapping through the standard :ref:`file extent
-reaping <reaping>` mechanism that is done post-repair.
+After a successful exchange operation, the repair operation must reap the old
+fork blocks by processing each fork mapping through the standard :ref:`file
+extent reaping <reaping>` mechanism that is done post-repair.
 If the filesystem should go down during the reap part of the repair, the
 iunlink processing at the end of recovery will free both the temporary file and
 whatever blocks were not reaped.
 However, this iunlink processing omits the cross-link detection of online
 repair, and is not completely foolproof.
 
-Swapping Temporary File Extents
-```````````````````````````````
+Exchanging Temporary File Contents
+``````````````````````````````````
 
 To repair a metadata file, online repair proceeds as follows:
 
@@ -4260,14 +4271,14 @@ To repair a metadata file, online repair proceeds as follows:
    file.
    The same fork must be written to as is being repaired.
 
-3. Commit the scrub transaction, since the swap estimation step must be
-   completed before transaction reservations are made.
+3. Commit the scrub transaction, since the exchange resource estimation step
+   must be completed before transaction reservations are made.
 
-4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
+4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
    the appropriate resource reservations, locks, and fill out a ``struct
-   xfs_swapext_req`` with the details of the swap operation.
+   xfs_exchmaps_req`` with the details of the exchange operation.
 
-5. Call ``xrep_tempswap_contents`` to swap the contents.
+5. Call ``xrep_tempexch_contents`` to exchange the contents.
 
 6. Commit the transaction to complete the repair.
 
@@ -4309,7 +4320,7 @@ To check the summary file against the bitmap:
 3. Compare the contents of the xfile against the ondisk file.
 
 To repair the summary file, write the xfile contents into the temporary file
-and use atomic extent swap to commit the new contents.
+and use atomic mapping exchange to commit the new contents.
 The temporary file is then reaped.
 
 The proposed patchset is the
@@ -4352,8 +4363,8 @@ Salvaging extended attributes is done as follows:
    memory or there are no more attr fork blocks to examine, unlock the file and
    add the staged extended attributes to the temporary file.
 
-3. Use atomic extent swapping to exchange the new and old extended attribute
-   structures.
+3. Use atomic file mapping exchange to exchange the new and old extended
+   attribute structures.
    The old attribute blocks are now attached to the temporary file.
 
 4. Reap the temporary file.
@@ -4410,7 +4421,8 @@ salvaging directories is straightforward:
    directory and add the staged dirents into the temporary directory.
    Truncate the staging files.
 
-4. Use atomic extent swapping to exchange the new and old directory structures.
+4. Use atomic file mapping exchange to exchange the new and old directory
+   structures.
    The old directory blocks are now attached to the temporary file.
 
 5. Reap the temporary file.
@@ -4542,7 +4554,7 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
       Instead, we stash updates in the xfarray and rely on the scanner thread
       to apply the stashed updates to the temporary directory.
 
-5. When the scan is complete, atomically swap the contents of the temporary
+5. When the scan is complete, atomically exchange the contents of the temporary
    directory and the directory being repaired.
    The temporary directory now contains the damaged directory structure.
 
@@ -4629,8 +4641,8 @@ directory reconstruction:
 
 5. Copy all non-parent pointer extended attributes to the temporary file.
 
-6. When the scan is complete, atomically swap the attribute fork of the
-   temporary file and the file being repaired.
+6. When the scan is complete, atomically exchange the mappings of the attribute
+   forks of the temporary file and the file being repaired.
    The temporary file now contains the damaged extended attribute structure.
 
 7. Reap the temporary file.
@@ -5105,18 +5117,18 @@ make it easier for code readers to understand what has been built, for whom it
 has been built, and why.
 Please feel free to contact the XFS mailing list with questions.
 
-FIEXCHANGE_RANGE
-----------------
+XFS_IOC_EXCHANGE_RANGE
+----------------------
 
-As discussed earlier, a second frontend to the atomic extent swap mechanism is
-a new ioctl call that userspace programs can use to commit updates to files
-atomically.
+As discussed earlier, a second frontend to the atomic file mapping exchange
+mechanism is a new ioctl call that userspace programs can use to commit updates
+to files atomically.
 This frontend has been out for review for several years now, though the
 necessary refinements to online repair and lack of customer demand mean that
 the proposal has not been pushed very hard.
 
-Extent Swapping with Regular User Files
-```````````````````````````````````````
+File Content Exchanges with Regular User Files
+``````````````````````````````````````````````
 
 As mentioned earlier, XFS has long had the ability to swap extents between
 files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
@@ -5131,12 +5143,12 @@ the consistency of the fork mappings with the reverse mapping index was to
 develop an iterative mechanism that used deferred bmap and rmap operations to
 swap mappings one at a time.
 This mechanism is identical to steps 2-3 from the procedure above except for
-the new tracking items, because the atomic extent swap mechanism is an
-iteration of an existing mechanism and not something totally novel.
+the new tracking items, because the atomic file mapping exchange mechanism is
+an iteration of an existing mechanism and not something totally novel.
 For the narrow case of file defragmentation, the file contents must be
 identical, so the recovery guarantees are not much of a gain.
 
-Atomic extent swapping is much more flexible than the existing swapext
+Atomic file content exchanges are much more flexible than the existing swapext
 implementations because it can guarantee that the caller never sees a mix of
 old and new contents even after a crash, and it can operate on two arbitrary
 file fork ranges.
@@ -5147,11 +5159,11 @@ The extra flexibility enables several new use cases:
   Next, it opens a temporary file and calls the file clone operation to reflink
   the first file's contents into the temporary file.
   Writes to the original file should instead be written to the temporary file.
-  Finally, the process calls the atomic extent swap system call
-  (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
-  of the updates to the original file, or none of them.
+  Finally, the process calls the atomic file mapping exchange system call
+  (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby
+  committing all of the updates to the original file, or none of them.
 
-.. _swapext_if_unchanged:
+.. _exchrange_if_unchanged:
 
 - **Transactional file updates**: The same mechanism as above, but the caller
   only wants the commit to occur if the original file's contents have not
@@ -5160,16 +5172,17 @@ The extra flexibility enables several new use cases:
   change timestamps of the original file before reflinking its data to the
   temporary file.
   When the program is ready to commit the changes, it passes the timestamps
-  into the kernel as arguments to the atomic extent swap system call.
+  into the kernel as arguments to the atomic file mapping exchange system call.
   The kernel only commits the changes if the provided timestamps match the
   original file.
+  A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
 
 - **Emulation of atomic block device writes**: Export a block device with a
   logical sector size matching the filesystem block size to force all writes
   to be aligned to the filesystem block size.
   Stage all writes to a temporary file, and when that is complete, call the
-  atomic extent swap system call with a flag to indicate that holes in the
-  temporary file should be ignored.
+  atomic file mapping exchange system call with a flag to indicate that holes
+  in the temporary file should be ignored.
   This emulates an atomic device write in software, and can support arbitrary
   scattered writes.
 
@@ -5251,8 +5264,8 @@ of the file to try to share the physical space with a dummy file.
 Cloning the extent means that the original owners cannot overwrite the
 contents; any changes will be written somewhere else via copy-on-write.
 Clearspace makes its own copy of the frozen extent in an area that is not being
-cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
-<swapext_if_unchanged>` feature) to change the target file's data extent
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges
+<exchrange_if_unchanged>` feature) to change the target file's data extent
 mapping away from the area being cleared.
 When all other mappings have been moved, clearspace reflinks the space into the
 space collector file so that it becomes unavailable.


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 14/14] xfs: enable logged file mapping exchange feature
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (12 preceding siblings ...)
  2024-02-27  2:24 ` [PATCH 13/14] docs: update swapext -> exchmaps language Darrick J. Wong
@ 2024-02-27  2:24 ` Darrick J. Wong
  2024-02-28 15:52   ` Christoph Hellwig
  2024-02-27  9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27  2:24 UTC (permalink / raw
  To: djwong; +Cc: linux-xfs, hch

From: Darrick J. Wong <djwong@kernel.org>

Add the XFS_SB_FEAT_INCOMPAT_LOG_EXCHMAPS feature to the set of features
that we will permit when mounting a filesystem.  This turns on support
for the file range exchange feature.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_format.h |    3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)


diff --git a/fs/xfs/libxfs/xfs_format.h b/fs/xfs/libxfs/xfs_format.h
index 753adde56a2d0..aa2ad7e04202b 100644
--- a/fs/xfs/libxfs/xfs_format.h
+++ b/fs/xfs/libxfs/xfs_format.h
@@ -398,7 +398,8 @@ xfs_sb_has_incompat_feature(
  */
 #define XFS_SB_FEAT_INCOMPAT_LOG_EXCHMAPS (1 << 1)
 #define XFS_SB_FEAT_INCOMPAT_LOG_ALL \
-	(XFS_SB_FEAT_INCOMPAT_LOG_XATTRS)
+		(XFS_SB_FEAT_INCOMPAT_LOG_XATTRS | \
+		 XFS_SB_FEAT_INCOMPAT_LOG_EXCHMAPS)
 #define XFS_SB_FEAT_INCOMPAT_LOG_UNKNOWN	~XFS_SB_FEAT_INCOMPAT_LOG_ALL
 static inline bool
 xfs_sb_has_incompat_log_feature(


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (13 preceding siblings ...)
  2024-02-27  2:24 ` [PATCH 14/14] xfs: enable logged file mapping exchange feature Darrick J. Wong
@ 2024-02-27  9:23 ` Amir Goldstein
  2024-02-27 10:53   ` Jeff Layton
  2024-02-27 15:45   ` Darrick J. Wong
  2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
  2024-02-28  1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
  16 siblings, 2 replies; 62+ messages in thread
From: Amir Goldstein @ 2024-02-27  9:23 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, Jeff Layton

On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> Hi all,
>
> This series creates a new FIEXCHANGE_RANGE system call to exchange
> ranges of bytes between two files atomically.  This new functionality
> enables data storage programs to stage and commit file updates such that
> reader programs will see either the old contents or the new contents in
> their entirety, with no chance of torn writes.  A successful call
> completion guarantees that the new contents will be seen even if the
> system fails.
>
> The ability to exchange file fork mappings between files in this manner
> is critical to supporting online filesystem repair, which is built upon
> the strategy of constructing a clean copy of a damaged structure and
> committing the new structure into the metadata file atomically.
>
> User programs will be able to update files atomically by opening an
> O_TMPFILE, reflinking the source file to it, making whatever updates
> they want to make, and exchange the relevant ranges of the temp file
> with the original file.  If the updates are aligned with the file block
> size, a new (since v2) flag provides for exchanging only the written
> areas.  Callers can arrange for the update to be rejected if the
> original file has been changed.
>
> The intent behind this new userspace functionality is to enable atomic
> rewrites of arbitrary parts of individual files.  For years, application
> programmers wanting to ensure the atomicity of a file update had to
> write the changes to a new file in the same directory, fsync the new
> file, rename the new file on top of the old filename, and then fsync the
> directory.  People get it wrong all the time, and $fs hacks abound.
> Here are the proposed manual pages:
>
> IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
>
> NAME
>        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
>        two files
>
> SYNOPSIS
>        #include <sys/ioctl.h>
>        #include <xfs/xfs_fs_staging.h>
>
>        int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
>        xfs_exch_range *arg);
>
> DESCRIPTION
>        Given  a  range  of bytes in a first file file1_fd and a second
>        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
>        changes the contents of the two ranges.
>
>        Exchanges  are  atomic  with  regards to concurrent file opera‐
>        tions, so no userspace-level locks need to be taken  to  obtain
>        consistent  results.  Implementations must guarantee that read‐
>        ers see either the old contents or the new  contents  in  their
>        entirety, even if the system fails.
>
>        The  system  call  parameters are conveyed in structures of the
>        following form:
>
>            struct xfs_exch_range {
>                __s64    file1_fd;
>                __s64    file1_offset;
>                __s64    file2_offset;
>                __s64    length;
>                __u64    flags;
>
>                __u64    pad;
>            };
>
>        The field pad must be zero.
>
>        The fields file1_fd, file1_offset, and length define the  first
>        range of bytes to be exchanged.
>
>        The fields file2_fd, file2_offset, and length define the second
>        range of bytes to be exchanged.
>
>        Both files must be from the same filesystem mount.  If the  two
>        file  descriptors represent the same file, the byte ranges must
>        not overlap.  Most  disk-based  filesystems  require  that  the
>        starts  of  both ranges must be aligned to the file block size.
>        If this is the case, the ends of the ranges  must  also  be  so
>        aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
>
>        The field flags control the behavior of the exchange operation.
>
>            XFS_EXCHRANGE_TO_EOF
>                   Ignore  the length parameter.  All bytes in file1_fd
>                   from file1_offset to EOF are moved to file2_fd,  and
>                   file2's  size is set to (file2_offset+(file1_length-
>                   file1_offset)).  Meanwhile, all bytes in file2  from
>                   file2_offset  to  EOF are moved to file1 and file1's
>                   size   is   set   to    (file1_offset+(file2_length-
>                   file2_offset)).
>
>            XFS_EXCHRANGE_DSYNC
>                   Ensure  that  all modified in-core data in both file
>                   ranges and all metadata updates  pertaining  to  the
>                   exchange operation are flushed to persistent storage
>                   before the call returns.  Opening  either  file  de‐
>                   scriptor  with  O_SYNC or O_DSYNC will have the same
>                   effect.
>
>            XFS_EXCHRANGE_FILE1_WRITTEN
>                   Only exchange sub-ranges of file1_fd that are  known
>                   to  contain  data  written  by application software.
>                   Each sub-range may be  expanded  (both  upwards  and
>                   downwards)  to  align with the file allocation unit.
>                   For files on the data device, this is one filesystem
>                   block.   For  files  on the realtime device, this is
>                   the realtime extent size.  This facility can be used
>                   to  implement  fast  atomic scatter-gather writes of
>                   any complexity for software-defined storage  targets
>                   if  all  writes  are  aligned to the file allocation
>                   unit.
>
>            XFS_EXCHRANGE_DRY_RUN
>                   Check the parameters and the feasibility of the  op‐
>                   eration, but do not change anything.
>
> RETURN VALUE
>        On  error, -1 is returned, and errno is set to indicate the er‐
>        ror.
>
> ERRORS
>        Error codes can be one of, but are not limited to, the  follow‐
>        ing:
>
>        EBADF  file1_fd  is not open for reading and writing or is open
>               for append-only writes; or  file2_fd  is  not  open  for
>               reading and writing or is open for append-only writes.
>
>        EINVAL The  parameters  are  not correct for these files.  This
>               error can also appear if either file  descriptor  repre‐
>               sents  a device, FIFO, or socket.  Disk filesystems gen‐
>               erally require the offset and  length  arguments  to  be
>               aligned to the fundamental block sizes of both files.
>
>        EIO    An I/O error occurred.
>
>        EISDIR One of the files is a directory.
>
>        ENOMEM The  kernel  was unable to allocate sufficient memory to
>               perform the operation.
>
>        ENOSPC There is not enough free space  in  the  filesystem  ex‐
>               change the contents safely.
>
>        EOPNOTSUPP
>               The filesystem does not support exchanging bytes between
>               the two files.
>
>        EPERM  file1_fd or file2_fd are immutable.
>
>        ETXTBSY
>               One of the files is a swap file.
>
>        EUCLEAN
>               The filesystem is corrupt.
>
>        EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
>               filesystem.
>
> CONFORMING TO
>        This API is XFS-specific.
>
> USE CASES
>        Several  use  cases  are imagined for this system call.  In all
>        cases, application software must coordinate updates to the file
>        because the exchange is performed unconditionally.
>
>        The  first  is a data storage program that wants to commit non-
>        contiguous updates to a file atomically and  coordinates  write
>        access  to that file.  This can be done by creating a temporary
>        file, calling FICLONE(2) to share the contents, and staging the
>        updates into the temporary file.  The FULL_FILES flag is recom‐
>        mended for this purpose.  The temporary file can be deleted  or
>        punched out afterwards.
>
>        An example program might look like this:
>
>            int fd = open("/some/file", O_RDWR);
>            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
>
>            ioctl(temp_fd, FICLONE, fd);
>
>            /* append 1MB of records */
>            lseek(temp_fd, 0, SEEK_END);
>            write(temp_fd, data1, 1000000);
>
>            /* update record index */
>            pwrite(temp_fd, data1, 600, 98765);
>            pwrite(temp_fd, data2, 320, 54321);
>            pwrite(temp_fd, data2, 15, 0);
>
>            /* commit the entire update */
>            struct xfs_exch_range args = {
>                .file1_fd = temp_fd,
>                .flags = XFS_EXCHRANGE_TO_EOF,
>            };
>
>            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
>
>        The  second  is  a  software-defined  storage host (e.g. a disk
>        jukebox) which implements an atomic scatter-gather  write  com‐
>        mand.   Provided the exported disk's logical block size matches
>        the file's allocation unit size, this can be done by creating a
>        temporary file and writing the data at the appropriate offsets.
>        It is recommended that the temporary file be truncated  to  the
>        size  of  the  regular file before any writes are staged to the
>        temporary file to avoid issues with zeroing during  EOF  exten‐
>        sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
>        only the file allocation units involved  in  the  emulated  de‐
>        vice's  write  command.  The temporary file should be truncated
>        or punched out completely before being reused to stage  another
>        write.
>
>        An example program might look like this:
>
>            int fd = open("/some/file", O_RDWR);
>            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
>            struct stat sb;
>            int blksz;
>
>            fstat(fd, &sb);
>            blksz = sb.st_blksize;
>
>            /* land scatter gather writes between 100fsb and 500fsb */
>            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
>            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
>            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
>
>            /* commit the entire update */
>            struct xfs_exch_range args = {
>                .file1_fd = temp_fd,
>                .file1_offset = blksz * 100,
>                .file2_offset = blksz * 100,
>                .length       = blksz * 400,
>                .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
>                                XFS_EXCHRANGE_FILE1_DSYNC,
>            };
>
>            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
>
> NOTES
>        Some  filesystems may limit the amount of data or the number of
>        extents that can be exchanged in a single call.
>
> SEE ALSO
>        ioctl(2)
>
> XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
> IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
>
> NAME
>        ioctl_xfs_commit_range - conditionally exchange the contents of
>        parts of two files
>
> SYNOPSIS
>        #include <sys/ioctl.h>
>        #include <xfs/xfs_fs_staging.h>
>
>        int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
>        mit_range *arg);
>
> DESCRIPTION
>        Given  a  range  of bytes in a first file file1_fd and a second
>        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
>        changes  the contents of the two ranges if file2_fd passes cer‐
>        tain freshness criteria.
>
>        After locking both files but before  exchanging  the  contents,
>        the  supplied  file2_ino field must match file2_fd's inode num‐
>        ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
>        file2_ctime, and file2_ctime_nsec fields must match the modifi‐
>        cation time and change time of file2.  If they  do  not  match,
>        EBUSY will be returned.
>

Maybe a stupid question, but under which circumstances would mtime
change and ctime not change? Why are both needed?

And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
Even if this API is designed to be hoisted out of XFS at some future time,
Is there a real need to support it on filesystems that do not support
i_version(?)

Not to mention the fact that POSIX does not explicitly define how ctime should
behave with changes to fiemap (uninitialized extent and all), so who knows
how other filesystems may update ctime in those cases.

I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
really explicitly requests a bump of i_version on the next change.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27  9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
@ 2024-02-27 10:53   ` Jeff Layton
  2024-02-27 16:06     ` Darrick J. Wong
  2024-02-27 23:46     ` Dave Chinner
  2024-02-27 15:45   ` Darrick J. Wong
  1 sibling, 2 replies; 62+ messages in thread
From: Jeff Layton @ 2024-02-27 10:53 UTC (permalink / raw
  To: Amir Goldstein, Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch

On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > 
> > Hi all,
> > 
> > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > ranges of bytes between two files atomically.  This new functionality
> > enables data storage programs to stage and commit file updates such that
> > reader programs will see either the old contents or the new contents in
> > their entirety, with no chance of torn writes.  A successful call
> > completion guarantees that the new contents will be seen even if the
> > system fails.
> > 
> > The ability to exchange file fork mappings between files in this manner
> > is critical to supporting online filesystem repair, which is built upon
> > the strategy of constructing a clean copy of a damaged structure and
> > committing the new structure into the metadata file atomically.
> > 
> > User programs will be able to update files atomically by opening an
> > O_TMPFILE, reflinking the source file to it, making whatever updates
> > they want to make, and exchange the relevant ranges of the temp file
> > with the original file.  If the updates are aligned with the file block
> > size, a new (since v2) flag provides for exchanging only the written
> > areas.  Callers can arrange for the update to be rejected if the
> > original file has been changed.
> > 
> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files.  For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory, fsync the new
> > file, rename the new file on top of the old filename, and then fsync the
> > directory.  People get it wrong all the time, and $fs hacks abound.
> > Here are the proposed manual pages:
> > 

This is a cool idea!  I've had some handwavy ideas about making a gated
write() syscall (i.e. only write if the change cookie hasn't changed),
but something like this may be a simpler lift initially.

> > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> > 
> > NAME
> >        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
> >        two files
> > 
> > SYNOPSIS
> >        #include <sys/ioctl.h>
> >        #include <xfs/xfs_fs_staging.h>
> > 
> >        int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
> >        xfs_exch_range *arg);
> > 
> > DESCRIPTION
> >        Given  a  range  of bytes in a first file file1_fd and a second
> >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> >        changes the contents of the two ranges.
> > 
> >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> >        tions, so no userspace-level locks need to be taken  to  obtain
> >        consistent  results.  Implementations must guarantee that read‐
> >        ers see either the old contents or the new  contents  in  their
> >        entirety, even if the system fails.
> > 
> >        The  system  call  parameters are conveyed in structures of the
> >        following form:
> > 
> >            struct xfs_exch_range {
> >                __s64    file1_fd;
> >                __s64    file1_offset;
> >                __s64    file2_offset;
> >                __s64    length;
> >                __u64    flags;
> > 
> >                __u64    pad;
> >            };
> > 
> >        The field pad must be zero.
> > 
> >        The fields file1_fd, file1_offset, and length define the  first
> >        range of bytes to be exchanged.
> > 
> >        The fields file2_fd, file2_offset, and length define the second
> >        range of bytes to be exchanged.
> > 
> >        Both files must be from the same filesystem mount.  If the  two
> >        file  descriptors represent the same file, the byte ranges must
> >        not overlap.  Most  disk-based  filesystems  require  that  the
> >        starts  of  both ranges must be aligned to the file block size.
> >        If this is the case, the ends of the ranges  must  also  be  so
> >        aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> > 
> >        The field flags control the behavior of the exchange operation.
> > 
> >            XFS_EXCHRANGE_TO_EOF
> >                   Ignore  the length parameter.  All bytes in file1_fd
> >                   from file1_offset to EOF are moved to file2_fd,  and
> >                   file2's  size is set to (file2_offset+(file1_length-
> >                   file1_offset)).  Meanwhile, all bytes in file2  from
> >                   file2_offset  to  EOF are moved to file1 and file1's
> >                   size   is   set   to    (file1_offset+(file2_length-
> >                   file2_offset)).
> > 
> >            XFS_EXCHRANGE_DSYNC
> >                   Ensure  that  all modified in-core data in both file
> >                   ranges and all metadata updates  pertaining  to  the
> >                   exchange operation are flushed to persistent storage
> >                   before the call returns.  Opening  either  file  de‐
> >                   scriptor  with  O_SYNC or O_DSYNC will have the same
> >                   effect.
> > 
> >            XFS_EXCHRANGE_FILE1_WRITTEN
> >                   Only exchange sub-ranges of file1_fd that are  known
> >                   to  contain  data  written  by application software.
> >                   Each sub-range may be  expanded  (both  upwards  and
> >                   downwards)  to  align with the file allocation unit.
> >                   For files on the data device, this is one filesystem
> >                   block.   For  files  on the realtime device, this is
> >                   the realtime extent size.  This facility can be used
> >                   to  implement  fast  atomic scatter-gather writes of
> >                   any complexity for software-defined storage  targets
> >                   if  all  writes  are  aligned to the file allocation
> >                   unit.
> > 
> >            XFS_EXCHRANGE_DRY_RUN
> >                   Check the parameters and the feasibility of the  op‐
> >                   eration, but do not change anything.
> > 
> > RETURN VALUE
> >        On  error, -1 is returned, and errno is set to indicate the er‐
> >        ror.
> > 
> > ERRORS
> >        Error codes can be one of, but are not limited to, the  follow‐
> >        ing:
> > 
> >        EBADF  file1_fd  is not open for reading and writing or is open
> >               for append-only writes; or  file2_fd  is  not  open  for
> >               reading and writing or is open for append-only writes.
> > 
> >        EINVAL The  parameters  are  not correct for these files.  This
> >               error can also appear if either file  descriptor  repre‐
> >               sents  a device, FIFO, or socket.  Disk filesystems gen‐
> >               erally require the offset and  length  arguments  to  be
> >               aligned to the fundamental block sizes of both files.
> > 
> >        EIO    An I/O error occurred.
> > 
> >        EISDIR One of the files is a directory.
> > 
> >        ENOMEM The  kernel  was unable to allocate sufficient memory to
> >               perform the operation.
> > 
> >        ENOSPC There is not enough free space  in  the  filesystem  ex‐
> >               change the contents safely.
> > 
> >        EOPNOTSUPP
> >               The filesystem does not support exchanging bytes between
> >               the two files.
> > 
> >        EPERM  file1_fd or file2_fd are immutable.
> > 
> >        ETXTBSY
> >               One of the files is a swap file.
> > 
> >        EUCLEAN
> >               The filesystem is corrupt.
> > 
> >        EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
> >               filesystem.
> > 
> > CONFORMING TO
> >        This API is XFS-specific.
> > 
> > USE CASES
> >        Several  use  cases  are imagined for this system call.  In all
> >        cases, application software must coordinate updates to the file
> >        because the exchange is performed unconditionally.
> > 
> >        The  first  is a data storage program that wants to commit non-
> >        contiguous updates to a file atomically and  coordinates  write
> >        access  to that file.  This can be done by creating a temporary
> >        file, calling FICLONE(2) to share the contents, and staging the
> >        updates into the temporary file.  The FULL_FILES flag is recom‐
> >        mended for this purpose.  The temporary file can be deleted  or
> >        punched out afterwards.
> > 
> >        An example program might look like this:
> > 
> >            int fd = open("/some/file", O_RDWR);
> >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > 
> >            ioctl(temp_fd, FICLONE, fd);
> > 
> >            /* append 1MB of records */
> >            lseek(temp_fd, 0, SEEK_END);
> >            write(temp_fd, data1, 1000000);
> > 
> >            /* update record index */
> >            pwrite(temp_fd, data1, 600, 98765);
> >            pwrite(temp_fd, data2, 320, 54321);
> >            pwrite(temp_fd, data2, 15, 0);
> > 
> >            /* commit the entire update */
> >            struct xfs_exch_range args = {
> >                .file1_fd = temp_fd,
> >                .flags = XFS_EXCHRANGE_TO_EOF,
> >            };
> > 
> >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > 
> >        The  second  is  a  software-defined  storage host (e.g. a disk
> >        jukebox) which implements an atomic scatter-gather  write  com‐
> >        mand.   Provided the exported disk's logical block size matches
> >        the file's allocation unit size, this can be done by creating a
> >        temporary file and writing the data at the appropriate offsets.
> >        It is recommended that the temporary file be truncated  to  the
> >        size  of  the  regular file before any writes are staged to the
> >        temporary file to avoid issues with zeroing during  EOF  exten‐
> >        sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
> >        only the file allocation units involved  in  the  emulated  de‐
> >        vice's  write  command.  The temporary file should be truncated
> >        or punched out completely before being reused to stage  another
> >        write.
> > 
> >        An example program might look like this:
> > 
> >            int fd = open("/some/file", O_RDWR);
> >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> >            struct stat sb;
> >            int blksz;
> > 
> >            fstat(fd, &sb);
> >            blksz = sb.st_blksize;
> > 
> >            /* land scatter gather writes between 100fsb and 500fsb */
> >            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> >            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> >            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> > 
> >            /* commit the entire update */
> >            struct xfs_exch_range args = {
> >                .file1_fd = temp_fd,
> >                .file1_offset = blksz * 100,
> >                .file2_offset = blksz * 100,
> >                .length       = blksz * 400,
> >                .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
> >                                XFS_EXCHRANGE_FILE1_DSYNC,
> >            };
> > 
> >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > 
> > NOTES
> >        Some  filesystems may limit the amount of data or the number of
> >        extents that can be exchanged in a single call.
> > 
> > SEE ALSO
> >        ioctl(2)
> > 
> > XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
> > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> > 
> > NAME
> >        ioctl_xfs_commit_range - conditionally exchange the contents of
> >        parts of two files
> > 
> > SYNOPSIS
> >        #include <sys/ioctl.h>
> >        #include <xfs/xfs_fs_staging.h>
> > 
> >        int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
> >        mit_range *arg);
> > 
> > DESCRIPTION
> >        Given  a  range  of bytes in a first file file1_fd and a second
> >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> >        changes  the contents of the two ranges if file2_fd passes cer‐
> >        tain freshness criteria.
> > 
> >        After locking both files but before  exchanging  the  contents,
> >        the  supplied  file2_ino field must match file2_fd's inode num‐
> >        ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
> >        file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> >        cation time and change time of file2.  If they  do  not  match,
> >        EBUSY will be returned.
> > 
> 
> Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?
> 

ctime should always change if the mtime does. An mtime update means that
the metadata was updated, so you also need to update the ctime. 

> And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> Even if this API is designed to be hoisted out of XFS at some future time,
> Is there a real need to support it on filesystems that do not support
> i_version(?)
> 
> Not to mention the fact that POSIX does not explicitly define how ctime should
> behave with changes to fiemap (uninitialized extent and all), so who knows
> how other filesystems may update ctime in those cases.
> 
> I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> really explicitly requests a bump of i_version on the next change.
> 


I agree. Using an opaque change cookie would be a lot nicer from an API
standpoint, and shouldn't be subject to timestamp granularity issues.

That said, XFS's change cookie is currently broken. Dave C. said he had
some patches in progress to fix that however.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27  9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
  2024-02-27 10:53   ` Jeff Layton
@ 2024-02-27 15:45   ` Darrick J. Wong
  2024-02-27 16:58     ` Amir Goldstein
  1 sibling, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27 15:45 UTC (permalink / raw
  To: Amir Goldstein; +Cc: linux-fsdevel, linux-xfs, hch, Jeff Layton

On Tue, Feb 27, 2024 at 11:23:39AM +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > Hi all,
> >
> > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > ranges of bytes between two files atomically.  This new functionality
> > enables data storage programs to stage and commit file updates such that
> > reader programs will see either the old contents or the new contents in
> > their entirety, with no chance of torn writes.  A successful call
> > completion guarantees that the new contents will be seen even if the
> > system fails.
> >
> > The ability to exchange file fork mappings between files in this manner
> > is critical to supporting online filesystem repair, which is built upon
> > the strategy of constructing a clean copy of a damaged structure and
> > committing the new structure into the metadata file atomically.
> >
> > User programs will be able to update files atomically by opening an
> > O_TMPFILE, reflinking the source file to it, making whatever updates
> > they want to make, and exchange the relevant ranges of the temp file
> > with the original file.  If the updates are aligned with the file block
> > size, a new (since v2) flag provides for exchanging only the written
> > areas.  Callers can arrange for the update to be rejected if the
> > original file has been changed.
> >
> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files.  For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory, fsync the new
> > file, rename the new file on top of the old filename, and then fsync the
> > directory.  People get it wrong all the time, and $fs hacks abound.
> > Here are the proposed manual pages:
> >
> > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> >
> > NAME
> >        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
> >        two files
> >
> > SYNOPSIS
> >        #include <sys/ioctl.h>
> >        #include <xfs/xfs_fs_staging.h>
> >
> >        int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
> >        xfs_exch_range *arg);
> >
> > DESCRIPTION
> >        Given  a  range  of bytes in a first file file1_fd and a second
> >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> >        changes the contents of the two ranges.
> >
> >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> >        tions, so no userspace-level locks need to be taken  to  obtain
> >        consistent  results.  Implementations must guarantee that read‐
> >        ers see either the old contents or the new  contents  in  their
> >        entirety, even if the system fails.
> >
> >        The  system  call  parameters are conveyed in structures of the
> >        following form:
> >
> >            struct xfs_exch_range {
> >                __s64    file1_fd;
> >                __s64    file1_offset;
> >                __s64    file2_offset;
> >                __s64    length;
> >                __u64    flags;
> >
> >                __u64    pad;
> >            };
> >
> >        The field pad must be zero.
> >
> >        The fields file1_fd, file1_offset, and length define the  first
> >        range of bytes to be exchanged.
> >
> >        The fields file2_fd, file2_offset, and length define the second
> >        range of bytes to be exchanged.
> >
> >        Both files must be from the same filesystem mount.  If the  two
> >        file  descriptors represent the same file, the byte ranges must
> >        not overlap.  Most  disk-based  filesystems  require  that  the
> >        starts  of  both ranges must be aligned to the file block size.
> >        If this is the case, the ends of the ranges  must  also  be  so
> >        aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> >
> >        The field flags control the behavior of the exchange operation.
> >
> >            XFS_EXCHRANGE_TO_EOF
> >                   Ignore  the length parameter.  All bytes in file1_fd
> >                   from file1_offset to EOF are moved to file2_fd,  and
> >                   file2's  size is set to (file2_offset+(file1_length-
> >                   file1_offset)).  Meanwhile, all bytes in file2  from
> >                   file2_offset  to  EOF are moved to file1 and file1's
> >                   size   is   set   to    (file1_offset+(file2_length-
> >                   file2_offset)).
> >
> >            XFS_EXCHRANGE_DSYNC
> >                   Ensure  that  all modified in-core data in both file
> >                   ranges and all metadata updates  pertaining  to  the
> >                   exchange operation are flushed to persistent storage
> >                   before the call returns.  Opening  either  file  de‐
> >                   scriptor  with  O_SYNC or O_DSYNC will have the same
> >                   effect.
> >
> >            XFS_EXCHRANGE_FILE1_WRITTEN
> >                   Only exchange sub-ranges of file1_fd that are  known
> >                   to  contain  data  written  by application software.
> >                   Each sub-range may be  expanded  (both  upwards  and
> >                   downwards)  to  align with the file allocation unit.
> >                   For files on the data device, this is one filesystem
> >                   block.   For  files  on the realtime device, this is
> >                   the realtime extent size.  This facility can be used
> >                   to  implement  fast  atomic scatter-gather writes of
> >                   any complexity for software-defined storage  targets
> >                   if  all  writes  are  aligned to the file allocation
> >                   unit.
> >
> >            XFS_EXCHRANGE_DRY_RUN
> >                   Check the parameters and the feasibility of the  op‐
> >                   eration, but do not change anything.
> >
> > RETURN VALUE
> >        On  error, -1 is returned, and errno is set to indicate the er‐
> >        ror.
> >
> > ERRORS
> >        Error codes can be one of, but are not limited to, the  follow‐
> >        ing:
> >
> >        EBADF  file1_fd  is not open for reading and writing or is open
> >               for append-only writes; or  file2_fd  is  not  open  for
> >               reading and writing or is open for append-only writes.
> >
> >        EINVAL The  parameters  are  not correct for these files.  This
> >               error can also appear if either file  descriptor  repre‐
> >               sents  a device, FIFO, or socket.  Disk filesystems gen‐
> >               erally require the offset and  length  arguments  to  be
> >               aligned to the fundamental block sizes of both files.
> >
> >        EIO    An I/O error occurred.
> >
> >        EISDIR One of the files is a directory.
> >
> >        ENOMEM The  kernel  was unable to allocate sufficient memory to
> >               perform the operation.
> >
> >        ENOSPC There is not enough free space  in  the  filesystem  ex‐
> >               change the contents safely.
> >
> >        EOPNOTSUPP
> >               The filesystem does not support exchanging bytes between
> >               the two files.
> >
> >        EPERM  file1_fd or file2_fd are immutable.
> >
> >        ETXTBSY
> >               One of the files is a swap file.
> >
> >        EUCLEAN
> >               The filesystem is corrupt.
> >
> >        EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
> >               filesystem.
> >
> > CONFORMING TO
> >        This API is XFS-specific.
> >
> > USE CASES
> >        Several  use  cases  are imagined for this system call.  In all
> >        cases, application software must coordinate updates to the file
> >        because the exchange is performed unconditionally.
> >
> >        The  first  is a data storage program that wants to commit non-
> >        contiguous updates to a file atomically and  coordinates  write
> >        access  to that file.  This can be done by creating a temporary
> >        file, calling FICLONE(2) to share the contents, and staging the
> >        updates into the temporary file.  The FULL_FILES flag is recom‐
> >        mended for this purpose.  The temporary file can be deleted  or
> >        punched out afterwards.
> >
> >        An example program might look like this:
> >
> >            int fd = open("/some/file", O_RDWR);
> >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> >
> >            ioctl(temp_fd, FICLONE, fd);
> >
> >            /* append 1MB of records */
> >            lseek(temp_fd, 0, SEEK_END);
> >            write(temp_fd, data1, 1000000);
> >
> >            /* update record index */
> >            pwrite(temp_fd, data1, 600, 98765);
> >            pwrite(temp_fd, data2, 320, 54321);
> >            pwrite(temp_fd, data2, 15, 0);
> >
> >            /* commit the entire update */
> >            struct xfs_exch_range args = {
> >                .file1_fd = temp_fd,
> >                .flags = XFS_EXCHRANGE_TO_EOF,
> >            };
> >
> >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> >
> >        The  second  is  a  software-defined  storage host (e.g. a disk
> >        jukebox) which implements an atomic scatter-gather  write  com‐
> >        mand.   Provided the exported disk's logical block size matches
> >        the file's allocation unit size, this can be done by creating a
> >        temporary file and writing the data at the appropriate offsets.
> >        It is recommended that the temporary file be truncated  to  the
> >        size  of  the  regular file before any writes are staged to the
> >        temporary file to avoid issues with zeroing during  EOF  exten‐
> >        sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
> >        only the file allocation units involved  in  the  emulated  de‐
> >        vice's  write  command.  The temporary file should be truncated
> >        or punched out completely before being reused to stage  another
> >        write.
> >
> >        An example program might look like this:
> >
> >            int fd = open("/some/file", O_RDWR);
> >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> >            struct stat sb;
> >            int blksz;
> >
> >            fstat(fd, &sb);
> >            blksz = sb.st_blksize;
> >
> >            /* land scatter gather writes between 100fsb and 500fsb */
> >            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> >            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> >            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> >
> >            /* commit the entire update */
> >            struct xfs_exch_range args = {
> >                .file1_fd = temp_fd,
> >                .file1_offset = blksz * 100,
> >                .file2_offset = blksz * 100,
> >                .length       = blksz * 400,
> >                .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
> >                                XFS_EXCHRANGE_FILE1_DSYNC,
> >            };
> >
> >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> >
> > NOTES
> >        Some  filesystems may limit the amount of data or the number of
> >        extents that can be exchanged in a single call.
> >
> > SEE ALSO
> >        ioctl(2)
> >
> > XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
> > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> >
> > NAME
> >        ioctl_xfs_commit_range - conditionally exchange the contents of
> >        parts of two files
> >
> > SYNOPSIS
> >        #include <sys/ioctl.h>
> >        #include <xfs/xfs_fs_staging.h>
> >
> >        int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
> >        mit_range *arg);
> >
> > DESCRIPTION
> >        Given  a  range  of bytes in a first file file1_fd and a second
> >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> >        changes  the contents of the two ranges if file2_fd passes cer‐
> >        tain freshness criteria.
> >
> >        After locking both files but before  exchanging  the  contents,
> >        the  supplied  file2_ino field must match file2_fd's inode num‐
> >        ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
> >        file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> >        cation time and change time of file2.  If they  do  not  match,
> >        EBUSY will be returned.
> >
> 
> Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?

It's the other way 'round -- mtime doesn't change but ctime does.  The
race I'm trying to protect against is:

Thread 0			Thread 1
<snapshot fd cmtime>
<start writing tempfd>
				<fstat fd>
				<write to fd>
				<futimens to reset mtime>
<commitrange>

mtime is controllable by "attackers" but ctime isn't.  I think we only
need to capture ctime, but ye olde swapext ioctl (from which this
derives) did both.

> And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?

Seeing as iversion (as the vfs and/or jlayton seems to want it) doesn't
work in the intended manner in XFS, no.

> Even if this API is designed to be hoisted out of XFS at some future time,
> Is there a real need to support it on filesystems that do not support
> i_version(?)

Given the way the iversion discussions have gone (file data write
counter) I don't think there's a way to support commitrange on
non-iversion filesystems.

I withdrew any plans to make this more than an XFS-specific ioctl last
year after giving up on ever getting through fsdevel review.  I think
the last reply I got was from viro back in 2021...

> Not to mention the fact that POSIX does not explicitly define how ctime should
> behave with changes to fiemap (uninitialized extent and all), so who knows
> how other filesystems may update ctime in those cases.

...and given the lack of interest from any other filesystem developers
in porting it to !xfs, I'm not likely to take this up ever again.  To be
fair, I think the only filesystems that could possibly support
EXCHANGE_RANGE are /maybe/ btrfs and /probably/ bcachefs.

> I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> really explicitly requests a bump of i_version on the next change.

Another way I could've structured this (and still could!) would be to
declare the entire freshness region as an untyped u64 fresh[4] blob and
add a START_COMMIT ioctl to fill it out.  Then the kernel fs drivers
gets to determine what goes in there.

That at least would be less work for userspace to do.

I don't want userspace API wrangling to hold up online repair **yet
again**.  I only made EXCHANGE_RANGE so that I could test the functionality
that fsck relies on.  If there was another way to test it then I would
have gladly done that.  Further down the line, COMMIT_RANGE will get us
out of trouble with the xfs defrag tool.

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27 10:53   ` Jeff Layton
@ 2024-02-27 16:06     ` Darrick J. Wong
  2024-03-01 13:16       ` Jeff Layton
  2024-02-27 23:46     ` Dave Chinner
  1 sibling, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27 16:06 UTC (permalink / raw
  To: Jeff Layton; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch

On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > 
> > > Hi all,
> > > 
> > > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > > ranges of bytes between two files atomically.  This new functionality
> > > enables data storage programs to stage and commit file updates such that
> > > reader programs will see either the old contents or the new contents in
> > > their entirety, with no chance of torn writes.  A successful call
> > > completion guarantees that the new contents will be seen even if the
> > > system fails.
> > > 
> > > The ability to exchange file fork mappings between files in this manner
> > > is critical to supporting online filesystem repair, which is built upon
> > > the strategy of constructing a clean copy of a damaged structure and
> > > committing the new structure into the metadata file atomically.
> > > 
> > > User programs will be able to update files atomically by opening an
> > > O_TMPFILE, reflinking the source file to it, making whatever updates
> > > they want to make, and exchange the relevant ranges of the temp file
> > > with the original file.  If the updates are aligned with the file block
> > > size, a new (since v2) flag provides for exchanging only the written
> > > areas.  Callers can arrange for the update to be rejected if the
> > > original file has been changed.
> > > 
> > > The intent behind this new userspace functionality is to enable atomic
> > > rewrites of arbitrary parts of individual files.  For years, application
> > > programmers wanting to ensure the atomicity of a file update had to
> > > write the changes to a new file in the same directory, fsync the new
> > > file, rename the new file on top of the old filename, and then fsync the
> > > directory.  People get it wrong all the time, and $fs hacks abound.
> > > Here are the proposed manual pages:
> > > 
> 
> This is a cool idea!  I've had some handwavy ideas about making a gated
> write() syscall (i.e. only write if the change cookie hasn't changed),
> but something like this may be a simpler lift initially.

How /does/ userspace get at the change cookie nowadays?

> > > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> > > 
> > > NAME
> > >        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
> > >        two files
> > > 
> > > SYNOPSIS
> > >        #include <sys/ioctl.h>
> > >        #include <xfs/xfs_fs_staging.h>
> > > 
> > >        int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
> > >        xfs_exch_range *arg);
> > > 
> > > DESCRIPTION
> > >        Given  a  range  of bytes in a first file file1_fd and a second
> > >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> > >        changes the contents of the two ranges.
> > > 
> > >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> > >        tions, so no userspace-level locks need to be taken  to  obtain
> > >        consistent  results.  Implementations must guarantee that read‐
> > >        ers see either the old contents or the new  contents  in  their
> > >        entirety, even if the system fails.
> > > 
> > >        The  system  call  parameters are conveyed in structures of the
> > >        following form:
> > > 
> > >            struct xfs_exch_range {
> > >                __s64    file1_fd;
> > >                __s64    file1_offset;
> > >                __s64    file2_offset;
> > >                __s64    length;
> > >                __u64    flags;
> > > 
> > >                __u64    pad;
> > >            };
> > > 
> > >        The field pad must be zero.
> > > 
> > >        The fields file1_fd, file1_offset, and length define the  first
> > >        range of bytes to be exchanged.
> > > 
> > >        The fields file2_fd, file2_offset, and length define the second
> > >        range of bytes to be exchanged.
> > > 
> > >        Both files must be from the same filesystem mount.  If the  two
> > >        file  descriptors represent the same file, the byte ranges must
> > >        not overlap.  Most  disk-based  filesystems  require  that  the
> > >        starts  of  both ranges must be aligned to the file block size.
> > >        If this is the case, the ends of the ranges  must  also  be  so
> > >        aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> > > 
> > >        The field flags control the behavior of the exchange operation.
> > > 
> > >            XFS_EXCHRANGE_TO_EOF
> > >                   Ignore  the length parameter.  All bytes in file1_fd
> > >                   from file1_offset to EOF are moved to file2_fd,  and
> > >                   file2's  size is set to (file2_offset+(file1_length-
> > >                   file1_offset)).  Meanwhile, all bytes in file2  from
> > >                   file2_offset  to  EOF are moved to file1 and file1's
> > >                   size   is   set   to    (file1_offset+(file2_length-
> > >                   file2_offset)).
> > > 
> > >            XFS_EXCHRANGE_DSYNC
> > >                   Ensure  that  all modified in-core data in both file
> > >                   ranges and all metadata updates  pertaining  to  the
> > >                   exchange operation are flushed to persistent storage
> > >                   before the call returns.  Opening  either  file  de‐
> > >                   scriptor  with  O_SYNC or O_DSYNC will have the same
> > >                   effect.
> > > 
> > >            XFS_EXCHRANGE_FILE1_WRITTEN
> > >                   Only exchange sub-ranges of file1_fd that are  known
> > >                   to  contain  data  written  by application software.
> > >                   Each sub-range may be  expanded  (both  upwards  and
> > >                   downwards)  to  align with the file allocation unit.
> > >                   For files on the data device, this is one filesystem
> > >                   block.   For  files  on the realtime device, this is
> > >                   the realtime extent size.  This facility can be used
> > >                   to  implement  fast  atomic scatter-gather writes of
> > >                   any complexity for software-defined storage  targets
> > >                   if  all  writes  are  aligned to the file allocation
> > >                   unit.
> > > 
> > >            XFS_EXCHRANGE_DRY_RUN
> > >                   Check the parameters and the feasibility of the  op‐
> > >                   eration, but do not change anything.
> > > 
> > > RETURN VALUE
> > >        On  error, -1 is returned, and errno is set to indicate the er‐
> > >        ror.
> > > 
> > > ERRORS
> > >        Error codes can be one of, but are not limited to, the  follow‐
> > >        ing:
> > > 
> > >        EBADF  file1_fd  is not open for reading and writing or is open
> > >               for append-only writes; or  file2_fd  is  not  open  for
> > >               reading and writing or is open for append-only writes.
> > > 
> > >        EINVAL The  parameters  are  not correct for these files.  This
> > >               error can also appear if either file  descriptor  repre‐
> > >               sents  a device, FIFO, or socket.  Disk filesystems gen‐
> > >               erally require the offset and  length  arguments  to  be
> > >               aligned to the fundamental block sizes of both files.
> > > 
> > >        EIO    An I/O error occurred.
> > > 
> > >        EISDIR One of the files is a directory.
> > > 
> > >        ENOMEM The  kernel  was unable to allocate sufficient memory to
> > >               perform the operation.
> > > 
> > >        ENOSPC There is not enough free space  in  the  filesystem  ex‐
> > >               change the contents safely.
> > > 
> > >        EOPNOTSUPP
> > >               The filesystem does not support exchanging bytes between
> > >               the two files.
> > > 
> > >        EPERM  file1_fd or file2_fd are immutable.
> > > 
> > >        ETXTBSY
> > >               One of the files is a swap file.
> > > 
> > >        EUCLEAN
> > >               The filesystem is corrupt.
> > > 
> > >        EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
> > >               filesystem.
> > > 
> > > CONFORMING TO
> > >        This API is XFS-specific.
> > > 
> > > USE CASES
> > >        Several  use  cases  are imagined for this system call.  In all
> > >        cases, application software must coordinate updates to the file
> > >        because the exchange is performed unconditionally.
> > > 
> > >        The  first  is a data storage program that wants to commit non-
> > >        contiguous updates to a file atomically and  coordinates  write
> > >        access  to that file.  This can be done by creating a temporary
> > >        file, calling FICLONE(2) to share the contents, and staging the
> > >        updates into the temporary file.  The FULL_FILES flag is recom‐
> > >        mended for this purpose.  The temporary file can be deleted  or
> > >        punched out afterwards.
> > > 
> > >        An example program might look like this:
> > > 
> > >            int fd = open("/some/file", O_RDWR);
> > >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > 
> > >            ioctl(temp_fd, FICLONE, fd);
> > > 
> > >            /* append 1MB of records */
> > >            lseek(temp_fd, 0, SEEK_END);
> > >            write(temp_fd, data1, 1000000);
> > > 
> > >            /* update record index */
> > >            pwrite(temp_fd, data1, 600, 98765);
> > >            pwrite(temp_fd, data2, 320, 54321);
> > >            pwrite(temp_fd, data2, 15, 0);
> > > 
> > >            /* commit the entire update */
> > >            struct xfs_exch_range args = {
> > >                .file1_fd = temp_fd,
> > >                .flags = XFS_EXCHRANGE_TO_EOF,
> > >            };
> > > 
> > >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > 
> > >        The  second  is  a  software-defined  storage host (e.g. a disk
> > >        jukebox) which implements an atomic scatter-gather  write  com‐
> > >        mand.   Provided the exported disk's logical block size matches
> > >        the file's allocation unit size, this can be done by creating a
> > >        temporary file and writing the data at the appropriate offsets.
> > >        It is recommended that the temporary file be truncated  to  the
> > >        size  of  the  regular file before any writes are staged to the
> > >        temporary file to avoid issues with zeroing during  EOF  exten‐
> > >        sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
> > >        only the file allocation units involved  in  the  emulated  de‐
> > >        vice's  write  command.  The temporary file should be truncated
> > >        or punched out completely before being reused to stage  another
> > >        write.
> > > 
> > >        An example program might look like this:
> > > 
> > >            int fd = open("/some/file", O_RDWR);
> > >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > >            struct stat sb;
> > >            int blksz;
> > > 
> > >            fstat(fd, &sb);
> > >            blksz = sb.st_blksize;
> > > 
> > >            /* land scatter gather writes between 100fsb and 500fsb */
> > >            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> > >            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> > >            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> > > 
> > >            /* commit the entire update */
> > >            struct xfs_exch_range args = {
> > >                .file1_fd = temp_fd,
> > >                .file1_offset = blksz * 100,
> > >                .file2_offset = blksz * 100,
> > >                .length       = blksz * 400,
> > >                .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
> > >                                XFS_EXCHRANGE_FILE1_DSYNC,
> > >            };
> > > 
> > >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > 
> > > NOTES
> > >        Some  filesystems may limit the amount of data or the number of
> > >        extents that can be exchanged in a single call.
> > > 
> > > SEE ALSO
> > >        ioctl(2)
> > > 
> > > XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
> > > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> > > 
> > > NAME
> > >        ioctl_xfs_commit_range - conditionally exchange the contents of
> > >        parts of two files
> > > 
> > > SYNOPSIS
> > >        #include <sys/ioctl.h>
> > >        #include <xfs/xfs_fs_staging.h>
> > > 
> > >        int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
> > >        mit_range *arg);
> > > 
> > > DESCRIPTION
> > >        Given  a  range  of bytes in a first file file1_fd and a second
> > >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> > >        changes  the contents of the two ranges if file2_fd passes cer‐
> > >        tain freshness criteria.
> > > 
> > >        After locking both files but before  exchanging  the  contents,
> > >        the  supplied  file2_ino field must match file2_fd's inode num‐
> > >        ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
> > >        file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> > >        cation time and change time of file2.  If they  do  not  match,
> > >        EBUSY will be returned.
> > > 
> > 
> > Maybe a stupid question, but under which circumstances would mtime
> > change and ctime not change? Why are both needed?
> > 
> 
> ctime should always change if the mtime does. An mtime update means that
> the metadata was updated, so you also need to update the ctime. 

Exactly. :)

> > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> > Even if this API is designed to be hoisted out of XFS at some future time,
> > Is there a real need to support it on filesystems that do not support
> > i_version(?)
> > 
> > Not to mention the fact that POSIX does not explicitly define how ctime should
> > behave with changes to fiemap (uninitialized extent and all), so who knows
> > how other filesystems may update ctime in those cases.
> > 
> > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > really explicitly requests a bump of i_version on the next change.
> > 
> 
> 
> I agree. Using an opaque change cookie would be a lot nicer from an API
> standpoint, and shouldn't be subject to timestamp granularity issues.

TLDR: No.

> That said, XFS's change cookie is currently broken. Dave C. said he had
> some patches in progress to fix that however.

Dave says that about a lot of things.  I'm not willing to delay the
online fsck project _even further_ for a bunch of vaporware that's not
even out on linux-xfs for review.

The difference in opinion between xfs and the rest of the kernel about
i_version is 50% of why I didn't include it here.  The other 50% is the
part where userspace can't access it, because I do not want to saddle my
mostly internal project with YET ANOTHER ASK FROM RH PEOPLE FOR CORE
CHANGES.

--D

> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27 15:45   ` Darrick J. Wong
@ 2024-02-27 16:58     ` Amir Goldstein
  0 siblings, 0 replies; 62+ messages in thread
From: Amir Goldstein @ 2024-02-27 16:58 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, Jeff Layton

On Tue, Feb 27, 2024 at 5:45 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 11:23:39AM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
[...]
> > Maybe a stupid question, but under which circumstances would mtime
> > change and ctime not change? Why are both needed?
>
> It's the other way 'round -- mtime doesn't change but ctime does.  The
> race I'm trying to protect against is:
>
> Thread 0                        Thread 1
> <snapshot fd cmtime>
> <start writing tempfd>
>                                 <fstat fd>
>                                 <write to fd>
>                                 <futimens to reset mtime>
> <commitrange>
>
> mtime is controllable by "attackers" but ctime isn't.  I think we only
> need to capture ctime, but ye olde swapext ioctl (from which this
> derives) did both.
>

Yes, that's what I meant. was just a braino.
mtime seems redundant, but if you want to keep it for compatibility with
legacy API, so be it.

> > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
>
> Seeing as iversion (as the vfs and/or jlayton seems to want it) doesn't
> work in the intended manner in XFS, no.
>

OK. for the record, AFAICT, the problem of NFS guys with xfs iversion
is that it is too
aggressive to their taste (i.e. bumped on atime updates), but because
ctime is always
updated along with iversion in xfs and because iversion has no granularity
problem, I think it is a better choice for you, regardless of any
other filesystems
and their interpretation of ctime or iversion.

> > Even if this API is designed to be hoisted out of XFS at some future time,
> > Is there a real need to support it on filesystems that do not support
> > i_version(?)
>
> Given the way the iversion discussions have gone (file data write
> counter) I don't think there's a way to support commitrange on
> non-iversion filesystems.
>
> I withdrew any plans to make this more than an XFS-specific ioctl last
> year after giving up on ever getting through fsdevel review.  I think
> the last reply I got was from viro back in 2021...
>

understandable.
I wasn't implying that you should hoist this out of XFS.
I was wondering about why not use xfs's iversion, which
seems like a better change counter than ctime.

> > Not to mention the fact that POSIX does not explicitly define how ctime should
> > behave with changes to fiemap (uninitialized extent and all), so who knows
> > how other filesystems may update ctime in those cases.
>
> ...and given the lack of interest from any other filesystem developers
> in porting it to !xfs, I'm not likely to take this up ever again.  To be
> faiproblemr, I think the only filesystems that could possibly support
> EXCHANGE_RANGE are /maybe/ btrfs and /probably/ bcachefs.
>

Again, sorry if my question were misinterpreted.
I was not trying to imply that this API should be hoisted.

> > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > really explicitly requests a bump of i_version on the next change.
>
> Another way I could've structured this (and still could!) would be to
> deproblemclare the entire freshness region as an untyped u64 fresh[4] blob and
> add a START_COMMIT ioctl to fill it out.  Then the kernel fs drivers
> gets to determine what goes in there.
>
> That at least would be less work for userspace to do.
>

To me that makes sense. Cleaner API.
Less questions asked.
But it's up to you.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (14 preceding siblings ...)
  2024-02-27  9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
@ 2024-02-27 17:46 ` Darrick J. Wong
  2024-02-27 18:52   ` Amir Goldstein
  2024-02-28  1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
  16 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-27 17:46 UTC (permalink / raw
  To: linux-fsdevel, linux-xfs, hch; +Cc: Amir Goldstein, jlayton

From: Darrick J. Wong <djwong@kernel.org>

To head off bikeshedding about the fields in xfs_commit_range, let's
make it an opaque u64 array and require the userspace program to call
a third ioctl to sample the freshness data for us.  If we ever converge
on a definition for i_version then we can use that; for now we'll just
use mtime/ctime like the old swapext ioctl.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
---
 fs/xfs/libxfs/xfs_fs.h |   13 +++--------
 fs/xfs/xfs_exchrange.c |   15 ++++++++++++
 fs/xfs/xfs_exchrange.h |    1 +
 fs/xfs/xfs_ioctl.c     |   58 +++++++++++++++++++++++++++++++++++++++++++-----
 4 files changed, 72 insertions(+), 15 deletions(-)

diff --git a/fs/xfs/libxfs/xfs_fs.h b/fs/xfs/libxfs/xfs_fs.h
index 01b3553adfc55..4019a78ee3ea5 100644
--- a/fs/xfs/libxfs/xfs_fs.h
+++ b/fs/xfs/libxfs/xfs_fs.h
@@ -860,14 +860,8 @@ struct xfs_commit_range {
 
 	__u64		flags;		/* see XFS_EXCHRANGE_* below */
 
-	/* file2 metadata for freshness checks */
-	__u64		file2_ino;	/* inode number */
-	__s64		file2_mtime;	/* modification time */
-	__s64		file2_ctime;	/* change time */
-	__s32		file2_mtime_nsec; /* mod time, nsec */
-	__s32		file2_ctime_nsec; /* change time, nsec */
-
-	__u64		pad;		/* must be zeroes */
+	/* opaque file2 metadata for freshness checks */
+	__u64		file2_freshness[5];
 };
 
 /*
@@ -973,7 +967,8 @@ struct xfs_commit_range {
 #define XFS_IOC_BULKSTAT	     _IOR ('X', 127, struct xfs_bulkstat_req)
 #define XFS_IOC_INUMBERS	     _IOR ('X', 128, struct xfs_inumbers_req)
 #define XFS_IOC_EXCHANGE_RANGE	     _IOWR('X', 129, struct xfs_exch_range)
-#define XFS_IOC_COMMIT_RANGE	     _IOWR('X', 129, struct xfs_commit_range)
+#define XFS_IOC_START_COMMIT	     _IOWR('X', 130, struct xfs_commit_range)
+#define XFS_IOC_COMMIT_RANGE	     _IOWR('X', 131, struct xfs_commit_range)
 /*	XFS_IOC_GETFSUUID ---------- deprecated 140	 */
 
 
diff --git a/fs/xfs/xfs_exchrange.c b/fs/xfs/xfs_exchrange.c
index e55ae06f1a32c..dae855515c3c4 100644
--- a/fs/xfs/xfs_exchrange.c
+++ b/fs/xfs/xfs_exchrange.c
@@ -863,3 +863,18 @@ xfs_exchange_range(
 		fsnotify_modify(fxr->file2);
 	return 0;
 }
+
+/* Sample freshness data from fxr->file2 for a commit range operation. */
+void
+xfs_exchrange_freshness(
+	struct xfs_exchrange	*fxr)
+{
+	struct inode		*inode2 = file_inode(fxr->file2);
+	struct xfs_inode	*ip2 = XFS_I(inode2);
+
+	xfs_ilock(ip2, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED | XFS_ILOCK_SHARED);
+	fxr->file2_ino = ip2->i_ino;
+	fxr->file2_ctime = inode_get_ctime(inode2);
+	fxr->file2_mtime = inode_get_mtime(inode2);
+	xfs_iunlock(ip2, XFS_IOLOCK_SHARED | XFS_MMAPLOCK_SHARED | XFS_ILOCK_SHARED);
+}
diff --git a/fs/xfs/xfs_exchrange.h b/fs/xfs/xfs_exchrange.h
index 2dd9ab7d76828..942283a7f75f5 100644
--- a/fs/xfs/xfs_exchrange.h
+++ b/fs/xfs/xfs_exchrange.h
@@ -36,6 +36,7 @@ struct xfs_exchrange {
 	struct timespec64	file2_ctime;
 };
 
+void xfs_exchrange_freshness(struct xfs_exchrange *fxr);
 int xfs_exchange_range(struct xfs_exchrange *fxr);
 
 /* XFS-specific parts of file exchanges */
diff --git a/fs/xfs/xfs_ioctl.c b/fs/xfs/xfs_ioctl.c
index ee26ac2028da1..1940da72a1da7 100644
--- a/fs/xfs/xfs_ioctl.c
+++ b/fs/xfs/xfs_ioctl.c
@@ -2402,6 +2402,47 @@ xfs_ioc_exchange_range(
 	return error;
 }
 
+/* Opaque freshness blob for XFS_IOC_COMMIT_RANGE */
+struct xfs_commit_range_fresh {
+	__u64		file2_ino;	/* inode number */
+	__s64		file2_mtime;	/* modification time */
+	__s64		file2_ctime;	/* change time */
+	__s32		file2_mtime_nsec; /* mod time, nsec */
+	__s32		file2_ctime_nsec; /* change time, nsec */
+	__u64		pad;		/* zero */
+};
+
+static long
+xfs_ioc_start_commit(
+	struct file			*file,
+	struct xfs_commit_range __user	*argp)
+{
+	struct xfs_exchrange		fxr = {
+		.file2			= file,
+	};
+	struct xfs_commit_range		args;
+	struct xfs_commit_range_fresh	*kern_f;
+	struct xfs_commit_range_fresh	__user *user_f;
+
+	BUILD_BUG_ON(sizeof(struct xfs_commit_range_fresh) !=
+		     sizeof(args.file2_freshness));
+
+	xfs_exchrange_freshness(&fxr);
+
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+	kern_f->file2_ino		= fxr.file2_ino;
+	kern_f->file2_mtime		= fxr.file2_mtime.tv_sec;
+	kern_f->file2_mtime_nsec	= fxr.file2_mtime.tv_nsec;
+	kern_f->file2_ctime		= fxr.file2_ctime.tv_sec;
+	kern_f->file2_ctime_nsec	= fxr.file2_ctime.tv_nsec;
+
+	user_f = (struct xfs_commit_range_fresh *)&argp->file2_freshness;
+	if (copy_to_user(user_f, kern_f, sizeof(*kern_f)))
+		return -EFAULT;
+
+	return 0;
+}
+
 static long
 xfs_ioc_commit_range(
 	struct file			*file,
@@ -2411,12 +2452,15 @@ xfs_ioc_commit_range(
 		.file2			= file,
 	};
 	struct xfs_commit_range		args;
+	struct xfs_commit_range_fresh	*kern_f;
 	struct fd			file1;
 	int				error;
 
+	kern_f = (struct xfs_commit_range_fresh *)&args.file2_freshness;
+
 	if (copy_from_user(&args, argp, sizeof(args)))
 		return -EFAULT;
-	if (memchr_inv(&args.pad, 0, sizeof(args.pad)))
+	if (memchr_inv(&kern_f->pad, 0, sizeof(kern_f->pad)))
 		return -EINVAL;
 	if (args.flags & ~XFS_EXCHRANGE_ALL_FLAGS)
 		return -EINVAL;
@@ -2425,11 +2469,11 @@ xfs_ioc_commit_range(
 	fxr.file2_offset	= args.file2_offset;
 	fxr.length		= args.length;
 	fxr.flags		= args.flags | __XFS_EXCHRANGE_CHECK_FRESH2;
-	fxr.file2_ino		= args.file2_ino;
-	fxr.file2_mtime.tv_sec	= args.file2_mtime;
-	fxr.file2_mtime.tv_nsec	= args.file2_mtime_nsec;
-	fxr.file2_ctime.tv_sec	= args.file2_ctime;
-	fxr.file2_ctime.tv_nsec	= args.file2_ctime_nsec;
+	fxr.file2_ino		= kern_f->file2_ino;
+	fxr.file2_mtime.tv_sec	= kern_f->file2_mtime;
+	fxr.file2_mtime.tv_nsec	= kern_f->file2_mtime_nsec;
+	fxr.file2_ctime.tv_sec	= kern_f->file2_ctime;
+	fxr.file2_ctime.tv_nsec	= kern_f->file2_ctime_nsec;
 
 	file1 = fdget(args.file1_fd);
 	if (!file1.file)
@@ -2782,6 +2826,8 @@ xfs_file_ioctl(
 
 	case XFS_IOC_EXCHANGE_RANGE:
 		return xfs_ioc_exchange_range(filp, arg);
+	case XFS_IOC_START_COMMIT:
+		return xfs_ioc_start_commit(filp, arg);
 	case XFS_IOC_COMMIT_RANGE:
 		return xfs_ioc_commit_range(filp, arg);
 

^ permalink raw reply related	[flat|nested] 62+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
@ 2024-02-27 18:52   ` Amir Goldstein
  2024-02-29 23:27     ` Darrick J. Wong
  0 siblings, 1 reply; 62+ messages in thread
From: Amir Goldstein @ 2024-02-27 18:52 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, jlayton

On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
>
> From: Darrick J. Wong <djwong@kernel.org>
>
> To head off bikeshedding about the fields in xfs_commit_range, let's
> make it an opaque u64 array and require the userspace program to call
> a third ioctl to sample the freshness data for us.  If we ever converge
> on a definition for i_version then we can use that; for now we'll just
> use mtime/ctime like the old swapext ioctl.

This addresses my concerns about using mtime/ctime.

I have to say, Darrick, that I think that referring to this concern as
bikeshedding is not being honest.

I do hate nit picking reviews and I do hate "maybe also fix the world"
review comments, but I think the question about using mtime/ctime in
this new API was not out of place and I think that making the freshness
data opaque is better for everyone in the long run and hopefully, this will
help you move to the things you care about faster.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27 10:53   ` Jeff Layton
  2024-02-27 16:06     ` Darrick J. Wong
@ 2024-02-27 23:46     ` Dave Chinner
  2024-02-28 10:30       ` Jeff Layton
  1 sibling, 1 reply; 62+ messages in thread
From: Dave Chinner @ 2024-02-27 23:46 UTC (permalink / raw
  To: Jeff Layton
  Cc: Amir Goldstein, Darrick J. Wong, linux-fsdevel, linux-xfs, hch

On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?

Like xfs_fsr doing online defrag, we really only care about explicit
user data changes here, not internal layout and metadata changes to
the files...

> > Even if this API is designed to be hoisted out of XFS at some future time,
> > Is there a real need to support it on filesystems that do not support
> > i_version(?)
> > 
> > Not to mention the fact that POSIX does not explicitly define how ctime should
> > behave with changes to fiemap (uninitialized extent and all), so who knows
> > how other filesystems may update ctime in those cases.
> > 
> > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > really explicitly requests a bump of i_version on the next change.
> > 
> 
> 
> I agree. Using an opaque change cookie would be a lot nicer from an API
> standpoint, and shouldn't be subject to timestamp granularity issues.
> 
> That said, XFS's change cookie is currently broken. Dave C. said he had
> some patches in progress to fix that however.

By "fix", I meant "remove".

i.e. the patches I was proposing were to remove SB_I_VERSION support
from XFS so NFS just uses the ctime on XFS because the recent
changes to i_version make it a ctime change counter, not an inode
change counter.

Then patches were posted for finer grained inode timestamps to allow
everything to use ctime instead of i_version, and with that I
thought NFS was just going to change to ctime for everyone with that
the whole change cookie issue was going away.

It now sounds like that isn't happening, so I'll just ressurect the
patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
support from XFS for now and us XFS people can just go back to
ignoring this problem again.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
                   ` (15 preceding siblings ...)
  2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
@ 2024-02-28  1:50 ` Colin Walters
  2024-02-29 20:18   ` Darrick J. Wong
  16 siblings, 1 reply; 62+ messages in thread
From: Colin Walters @ 2024-02-28  1:50 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-fsdevel, xfs, Christoph Hellwig

On Mon, Feb 26, 2024, at 9:18 PM, Darrick J. Wong wrote:
> Hi all,
>
> This series creates a new FIEXCHANGE_RANGE system call to exchange
> ranges of bytes between two files atomically.  This new functionality
> enables data storage programs to stage and commit file updates such that
> reader programs will see either the old contents or the new contents in
> their entirety, with no chance of torn writes.  A successful call
> completion guarantees that the new contents will be seen even if the
> system fails.
>
> The ability to exchange file fork mappings between files in this manner
> is critical to supporting online filesystem repair, which is built upon
> the strategy of constructing a clean copy of a damaged structure and
> committing the new structure into the metadata file atomically.
>
> User programs will be able to update files atomically by opening an
> O_TMPFILE, reflinking the source file to it, making whatever updates
> they want to make, and exchange the relevant ranges of the temp file
> with the original file. 

It's probably worth noting that the "reflinking the source file" here
is optional, right?  IOW one can just:

- open(O_TMPFILE)
- write()
- ioctl(FIEXCHANGE_RANGE)

I suspect the "simpler" non-database cases (think e.g. editors
operating on plain text files) are going to be operating on an
in-memory copy; in theory of course we could identify common ranges
and reflink, but it's not clear to me it's really worth it at the
tiny scale most source files are.

> The intent behind this new userspace functionality is to enable atomic
> rewrites of arbitrary parts of individual files.  For years, application
> programmers wanting to ensure the atomicity of a file update had to
> write the changes to a new file in the same directory

More sophisticated tools already are using O_TMPFILE I would say,
just with a final last step of materializing it with a name,
and then rename() into place.  So if this also
obviates the need for
https://lore.kernel.org/linux-fsdevel/364531.1579265357@warthog.procyon.org.uk/
that seems good.

>        Exchanges  are  atomic  with  regards to concurrent file opera‐
>        tions, so no userspace-level locks need to be taken  to  obtain
>        consistent  results.  Implementations must guarantee that read‐
>        ers see either the old contents or the new  contents  in  their
>        entirety, even if the system fails.

But given that we're reusing the same inode, I don't think that can *really* be true...at least, not without higher level serialization.

A classic case today is dconf in GNOME is a basic memory-mapped database file that is atomically replaced by the "create new file, rename into place" model.  Clients with mmap() view just see the old data until they reload explicitly.  But with this, clients with mmap'd view *will* immediately see the new contents (because it's the same inode, right?) and that's just going to lead to possibly split reads and undefined behavior - without extra userspace serialization or locking (that more proper databases) are going to be doing.

Arguably of course, dconf is too simple and more sophisticated tools like sqlite or LMDB could make use of this.  (There's some special atomic write that got added to f2fs for sqlite last I saw...I'm curious if this could replace it)

But still...it seems to me like there's going to be quite a lot of the "potentially concurrent reader, atomic replace desired" pattern and since this can't replace that, we should call that out explicitly in the man page.  And also if so, then there's still a need for the linkat(AT_REPLACE) etc.

>            XFS_EXCHRANGE_TO_EOF

I kept reading this as some sort of typo...would it really be too onerous to spell it out as XFS_EXCHANGE_RANGE_TO_EOF e.g.?  Echoes of unix "creat" here =)

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27 23:46     ` Dave Chinner
@ 2024-02-28 10:30       ` Jeff Layton
  2024-02-28 10:58         ` Amir Goldstein
  0 siblings, 1 reply; 62+ messages in thread
From: Jeff Layton @ 2024-02-28 10:30 UTC (permalink / raw
  To: Dave Chinner
  Cc: Amir Goldstein, Darrick J. Wong, linux-fsdevel, linux-xfs, hch

On Wed, 2024-02-28 at 10:46 +1100, Dave Chinner wrote:
> On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> 
> Like xfs_fsr doing online defrag, we really only care about explicit
> user data changes here, not internal layout and metadata changes to
> the files...
> 
> > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > Is there a real need to support it on filesystems that do not support
> > > i_version(?)
> > > 
> > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > how other filesystems may update ctime in those cases.
> > > 
> > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > really explicitly requests a bump of i_version on the next change.
> > > 
> > 
> > 
> > I agree. Using an opaque change cookie would be a lot nicer from an API
> > standpoint, and shouldn't be subject to timestamp granularity issues.
> > 
> > That said, XFS's change cookie is currently broken. Dave C. said he had
> > some patches in progress to fix that however.
> 
> By "fix", I meant "remove".
> 
> i.e. the patches I was proposing were to remove SB_I_VERSION support
> from XFS so NFS just uses the ctime on XFS because the recent
> changes to i_version make it a ctime change counter, not an inode
> change counter.
> 
> Then patches were posted for finer grained inode timestamps to allow
> everything to use ctime instead of i_version, and with that I
> thought NFS was just going to change to ctime for everyone with that
> the whole change cookie issue was going away.
> 
> It now sounds like that isn't happening, so I'll just ressurect the
> patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
> support from XFS for now and us XFS people can just go back to
> ignoring this problem again.


I must have misunderstood what you said when we were at LPC this year:

After the multigrain ctime patches were reverted, you mentioned that you
were working on a patchset that used the unused bits in the tv_nsec
field as counter for counting changes that have occurred within the same
tv_nsec value.

Did those not pan out for some reason?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-28 10:30       ` Jeff Layton
@ 2024-02-28 10:58         ` Amir Goldstein
  2024-02-28 11:01           ` Jeff Layton
  0 siblings, 1 reply; 62+ messages in thread
From: Amir Goldstein @ 2024-02-28 10:58 UTC (permalink / raw
  To: Jeff Layton
  Cc: Dave Chinner, Darrick J. Wong, linux-fsdevel, linux-xfs, hch,
	Jan Kara

On Wed, Feb 28, 2024 at 12:30 PM Jeff Layton <jlayton@kernel.org> wrote:
>
> On Wed, 2024-02-28 at 10:46 +1100, Dave Chinner wrote:
> > On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> >
> > Like xfs_fsr doing online defrag, we really only care about explicit
> > user data changes here, not internal layout and metadata changes to
> > the files...
> >
> > > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > > Is there a real need to support it on filesystems that do not support
> > > > i_version(?)
> > > >
> > > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > > how other filesystems may update ctime in those cases.
> > > >
> > > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > > really explicitly requests a bump of i_version on the next change.
> > > >
> > >
> > >
> > > I agree. Using an opaque change cookie would be a lot nicer from an API
> > > standpoint, and shouldn't be subject to timestamp granularity issues.
> > >
> > > That said, XFS's change cookie is currently broken. Dave C. said he had
> > > some patches in progress to fix that however.
> >
> > By "fix", I meant "remove".
> >
> > i.e. the patches I was proposing were to remove SB_I_VERSION support
> > from XFS so NFS just uses the ctime on XFS because the recent
> > changes to i_version make it a ctime change counter, not an inode
> > change counter.
> >
> > Then patches were posted for finer grained inode timestamps to allow
> > everything to use ctime instead of i_version, and with that I
> > thought NFS was just going to change to ctime for everyone with that
> > the whole change cookie issue was going away.
> >
> > It now sounds like that isn't happening, so I'll just ressurect the
> > patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
> > support from XFS for now and us XFS people can just go back to
> > ignoring this problem again.
>
>
> I must have misunderstood what you said when we were at LPC this year:
>
> After the multigrain ctime patches were reverted, you mentioned that you
> were working on a patchset that used the unused bits in the tv_nsec
> field as counter for counting changes that have occurred within the same
> tv_nsec value.
>
> Did those not pan out for some reason?

Jeff,

Could I trouble you to suggest a topic for LSFMM to summarize
everything that has been going on this year wrt change cookie/time
at xfs/vfs level and try to set a clear roadmap?

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-28 10:58         ` Amir Goldstein
@ 2024-02-28 11:01           ` Jeff Layton
  0 siblings, 0 replies; 62+ messages in thread
From: Jeff Layton @ 2024-02-28 11:01 UTC (permalink / raw
  To: Amir Goldstein
  Cc: Dave Chinner, Darrick J. Wong, linux-fsdevel, linux-xfs, hch,
	Jan Kara

On Wed, 2024-02-28 at 12:58 +0200, Amir Goldstein wrote:
> On Wed, Feb 28, 2024 at 12:30 PM Jeff Layton <jlayton@kernel.org> wrote:
> > 
> > On Wed, 2024-02-28 at 10:46 +1100, Dave Chinner wrote:
> > > On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > > > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> > > 
> > > Like xfs_fsr doing online defrag, we really only care about explicit
> > > user data changes here, not internal layout and metadata changes to
> > > the files...
> > > 
> > > > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > > > Is there a real need to support it on filesystems that do not support
> > > > > i_version(?)
> > > > > 
> > > > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > > > how other filesystems may update ctime in those cases.
> > > > > 
> > > > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > > > really explicitly requests a bump of i_version on the next change.
> > > > > 
> > > > 
> > > > 
> > > > I agree. Using an opaque change cookie would be a lot nicer from an API
> > > > standpoint, and shouldn't be subject to timestamp granularity issues.
> > > > 
> > > > That said, XFS's change cookie is currently broken. Dave C. said he had
> > > > some patches in progress to fix that however.
> > > 
> > > By "fix", I meant "remove".
> > > 
> > > i.e. the patches I was proposing were to remove SB_I_VERSION support
> > > from XFS so NFS just uses the ctime on XFS because the recent
> > > changes to i_version make it a ctime change counter, not an inode
> > > change counter.
> > > 
> > > Then patches were posted for finer grained inode timestamps to allow
> > > everything to use ctime instead of i_version, and with that I
> > > thought NFS was just going to change to ctime for everyone with that
> > > the whole change cookie issue was going away.
> > > 
> > > It now sounds like that isn't happening, so I'll just ressurect the
> > > patch to remove published SB_I_VERSION and STATX_CHANGE_COOKIE
> > > support from XFS for now and us XFS people can just go back to
> > > ignoring this problem again.
> > 
> > 
> > I must have misunderstood what you said when we were at LPC this year:
> > 
> > After the multigrain ctime patches were reverted, you mentioned that you
> > were working on a patchset that used the unused bits in the tv_nsec
> > field as counter for counting changes that have occurred within the same
> > tv_nsec value.
> > 
> > Did those not pan out for some reason?
> 
> Jeff,
> 
> Could I trouble you to suggest a topic for LSFMM to summarize
> everything that has been going on this year wrt change cookie/time
> at xfs/vfs level and try to set a clear roadmap?
> 

Sure, I'll send something later today.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 01/14] vfs: export remap and write check helpers
  2024-02-27  2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
@ 2024-02-28 15:40   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:40 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/14] xfs: introduce new file range exchange ioctls
  2024-02-27  2:21 ` [PATCH 02/14] xfs: introduce new file range exchange ioctls Darrick J. Wong
@ 2024-02-28 15:44   ` Christoph Hellwig
  2024-02-28 19:35     ` Darrick J. Wong
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:44 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

On Mon, Feb 26, 2024 at 06:21:23PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> Introduce a pair of new ioctls to handle exchanging ranges of bytes
> between files.  The goal here is to perform the exchange atomically with
> respect to applications -- either they see the file contents before the
> exchange or they see that A-B is now B-A, even if the kernel crashes.
> 
> The simpler of the two ioctls is XFS_IOC_EXCHANGE_RANGE, which performs
> the exchange unconditionally.  XFS_IOC_COMMIT_RANGE, on the other hand,
> requires the caller to sample the file attributes of one of the files
> participating in the exchange, and aborts the exchange if that file has
> changed in the meantime (presumably by another thread).

So per all the discussions, wouldn't it make sense to separate out
XFS_IOC_COMMIT_RANGE (plus the new start commit one later), and if
discussions are still going on just get XFS_IOC_EXCHANGE_RANGE
done ASAP to go on with online repair, and give XFS_IOC_COMMIT_RANGE
enough time to discuss the finer details?

> +struct xfs_exch_range {
> +	__s64		file1_fd;

I should have noticed this last time, by why are we passing a fd
as a 64-bit value when it actually just is a 32-bit value in syscalls?
(same for commit).

> +	if (((fxr->file1->f_flags | fxr->file2->f_flags) & (__O_SYNC | O_DSYNC)) ||

Nit: overly long line here.

> +	if (fxr->flags & ~(XFS_EXCHRANGE_ALL_FLAGS | __XFS_EXCHRANGE_CHECK_FRESH2))

.. and here

> +	/*
> +	 * The ioctl enforces that src and dest files are on the same mount.
> +	 * However, they only need to be on the same file system.
> +	 */
> +	if (inode1->i_sb != inode2->i_sb)
> +		return -EXDEV;

How about only doing this checks once further up?  As the same sb also
applies the same mount.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 03/14] xfs: create a log incompat flag for atomic file mapping exchanges
  2024-02-27  2:21 ` [PATCH 03/14] xfs: create a log incompat flag for atomic file mapping exchanges Darrick J. Wong
@ 2024-02-28 15:44   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:44 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 04/14] xfs: introduce a file mapping exchange log intent item
  2024-02-27  2:21 ` [PATCH 04/14] xfs: introduce a file mapping exchange log intent item Darrick J. Wong
@ 2024-02-28 15:45   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:45 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/14] xfs: create deferred log items for file mapping exchanges
  2024-02-27  2:22 ` [PATCH 05/14] xfs: create deferred log items for file mapping exchanges Darrick J. Wong
@ 2024-02-28 15:49   ` Christoph Hellwig
  2024-02-28 19:55     ` Darrick J. Wong
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:49 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

> -static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
> +static inline bool xfs_bmap_is_written_extent(const struct xfs_bmbt_irec *irec)

This seems entirely unrelated, can you split it into a cleanup patch?

> +		state |= CRIGHT_CONTIG;
> +	if ((state & CBOTH_CONTIG) == CBOTH_CONTIG &&
> +	    left->br_startblock + curr->br_startblock +
> +					right->br_startblock > XFS_MAX_BMBT_EXTLEN)

Overly long line here (and pretty weird formatting causing it..)

> +	if ((state & NBOTH_CONTIG) == NBOTH_CONTIG &&
> +	    left->br_startblock + new->br_startblock +
> +					right->br_startblock > XFS_MAX_BMBT_EXTLEN)

Same here.

> +/* XFS-specific parts of file exchanges */

Well, everything really is XFS-specific :)  I'd drop this comment.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 06/14] xfs: bind together the front and back ends of the file range exchange code
  2024-02-27  2:22 ` [PATCH 06/14] xfs: bind together the front and back ends of the file range exchange code Darrick J. Wong
@ 2024-02-28 15:49   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:49 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 07/14] xfs: add error injection to test file mapping exchange recovery
  2024-02-27  2:22 ` [PATCH 07/14] xfs: add error injection to test file mapping exchange recovery Darrick J. Wong
@ 2024-02-28 15:50   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:50 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 08/14] xfs: condense extended attributes after a mapping exchange operation
  2024-02-27  2:22 ` [PATCH 08/14] xfs: condense extended attributes after a mapping exchange operation Darrick J. Wong
@ 2024-02-28 15:50   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:50 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 09/14] xfs: condense directories after a mapping exchange operation
  2024-02-27  2:23 ` [PATCH 09/14] xfs: condense directories " Darrick J. Wong
@ 2024-02-28 15:51   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:51 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 10/14] xfs: condense symbolic links after a mapping exchange operation
  2024-02-27  2:23 ` [PATCH 10/14] xfs: condense symbolic links " Darrick J. Wong
@ 2024-02-28 15:51   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:51 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 11/14] xfs: make file range exchange support realtime files
  2024-02-27  2:23 ` [PATCH 11/14] xfs: make file range exchange support realtime files Darrick J. Wong
@ 2024-02-28 15:51   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:51 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 12/14] xfs: support non-power-of-two rtextsize with exchange-range
  2024-02-27  2:23 ` [PATCH 12/14] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong
@ 2024-02-28 15:51   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:51 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 13/14] docs: update swapext -> exchmaps language
  2024-02-27  2:24 ` [PATCH 13/14] docs: update swapext -> exchmaps language Darrick J. Wong
@ 2024-02-28 15:52   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:52 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 14/14] xfs: enable logged file mapping exchange feature
  2024-02-27  2:24 ` [PATCH 14/14] xfs: enable logged file mapping exchange feature Darrick J. Wong
@ 2024-02-28 15:52   ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 15:52 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-xfs, hch

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/14] xfs: introduce new file range exchange ioctls
  2024-02-28 15:44   ` Christoph Hellwig
@ 2024-02-28 19:35     ` Darrick J. Wong
  2024-02-28 19:37       ` Christoph Hellwig
  0 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-28 19:35 UTC (permalink / raw
  To: Christoph Hellwig; +Cc: linux-xfs, hch

On Wed, Feb 28, 2024 at 07:44:32AM -0800, Christoph Hellwig wrote:
> On Mon, Feb 26, 2024 at 06:21:23PM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Introduce a pair of new ioctls to handle exchanging ranges of bytes
> > between files.  The goal here is to perform the exchange atomically with
> > respect to applications -- either they see the file contents before the
> > exchange or they see that A-B is now B-A, even if the kernel crashes.
> > 
> > The simpler of the two ioctls is XFS_IOC_EXCHANGE_RANGE, which performs
> > the exchange unconditionally.  XFS_IOC_COMMIT_RANGE, on the other hand,
> > requires the caller to sample the file attributes of one of the files
> > participating in the exchange, and aborts the exchange if that file has
> > changed in the meantime (presumably by another thread).
> 
> So per all the discussions, wouldn't it make sense to separate out
> XFS_IOC_COMMIT_RANGE (plus the new start commit one later), and if
> discussions are still going on just get XFS_IOC_EXCHANGE_RANGE
> done ASAP to go on with online repair, and give XFS_IOC_COMMIT_RANGE
> enough time to discuss the finer details?

Done.  All of the COMMIT_RANGE code (and the _CHECK_FRESH2 code) have
been moved to a separate patch and combined with the patch[1] from
yesterday.

[1] https://lore.kernel.org/linux-xfs/CAOQ4uxiPfno-Hx+fH3LEN_4D6HQgyMAySRNCU=O2R_-ksrxSDQ@mail.gmail.com/

> > +struct xfs_exch_range {
> > +	__s64		file1_fd;
> 
> I should have noticed this last time, by why are we passing a fd
> as a 64-bit value when it actually just is a 32-bit value in syscalls?
> (same for commit).

I'll change that.

> > +	if (((fxr->file1->f_flags | fxr->file2->f_flags) & (__O_SYNC | O_DSYNC)) ||
> 
> Nit: overly long line here.

I'll replace the __O_SYNC | O_DSYNC with O_SYNC, since they're the same
thing.

> > +	if (fxr->flags & ~(XFS_EXCHRANGE_ALL_FLAGS | __XFS_EXCHRANGE_CHECK_FRESH2))
> 
> .. and here

Fixed, thanks.

> > +	/*
> > +	 * The ioctl enforces that src and dest files are on the same mount.
> > +	 * However, they only need to be on the same file system.
> > +	 */
> > +	if (inode1->i_sb != inode2->i_sb)
> > +		return -EXDEV;
> 
> How about only doing this checks once further up?  As the same sb also
> applies the same mount.

I'll remove this check entirely, since we've already checked that the
vfsmnt are the same.  Assuming that's what you meant-- I was slightly
confused by "same sb also applies the same mount" and decided to
interpret that as "same sb implies the same mount".

--D

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/14] xfs: introduce new file range exchange ioctls
  2024-02-28 19:35     ` Darrick J. Wong
@ 2024-02-28 19:37       ` Christoph Hellwig
  2024-02-28 23:00         ` Darrick J. Wong
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 19:37 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, hch

On Wed, Feb 28, 2024 at 11:35:47AM -0800, Darrick J. Wong wrote:
> > How about only doing this checks once further up?  As the same sb also
> > applies the same mount.
> 
> I'll remove this check entirely, since we've already checked that the
> vfsmnt are the same.  Assuming that's what you meant-- I was slightly
> confused by "same sb also applies the same mount" and decided to
> interpret that as "same sb implies the same mount".

You interpreted the correctly.  Sorry for my jetlagged early morning
incoherence.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/14] xfs: create deferred log items for file mapping exchanges
  2024-02-28 15:49   ` Christoph Hellwig
@ 2024-02-28 19:55     ` Darrick J. Wong
  2024-02-28 22:08       ` Christoph Hellwig
  0 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-28 19:55 UTC (permalink / raw
  To: Christoph Hellwig; +Cc: linux-xfs, hch

On Wed, Feb 28, 2024 at 07:49:05AM -0800, Christoph Hellwig wrote:
> > -static inline bool xfs_bmap_is_written_extent(struct xfs_bmbt_irec *irec)
> > +static inline bool xfs_bmap_is_written_extent(const struct xfs_bmbt_irec *irec)
> 
> This seems entirely unrelated, can you split it into a cleanup patch?

Done.

> > +		state |= CRIGHT_CONTIG;
> > +	if ((state & CBOTH_CONTIG) == CBOTH_CONTIG &&
> > +	    left->br_startblock + curr->br_startblock +
> > +					right->br_startblock > XFS_MAX_BMBT_EXTLEN)

Oh yikes, that should be br_blockcount, not br_startblock.

> Overly long line here (and pretty weird formatting causing it..)

I'll change it to this helper:

static inline bool
xmi_can_merge_all(
	const struct xfs_bmbt_irec	*l,
	const struct xfs_bmbt_irec	*m,
	const struct xfs_bmbt_irec	*r)
{
	xfs_filblks_t			new_len;

	new_len = l->br_blockcount + m->br_blockcount + r->br_blockcount;
	return new_len <= XFS_MAX_BMBT_EXTLEN;
}

Then the call sites become:

	if ((state & CBOTH_CONTIG) == CBOTH_CONTIG &&
	    !xmi_can_merge_all(left, curr, right))
		state &= ~CRIGHT_CONTIG;

> > +	if ((state & NBOTH_CONTIG) == NBOTH_CONTIG &&
> > +	    left->br_startblock + new->br_startblock +
> > +					right->br_startblock > XFS_MAX_BMBT_EXTLEN)
> 
> Same here.

Here too.

> > +/* XFS-specific parts of file exchanges */
> 
> Well, everything really is XFS-specific :)  I'd drop this comment.

Ok.

> Otherwise looks good:
> 
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Thanks!

--D

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/14] xfs: create deferred log items for file mapping exchanges
  2024-02-28 19:55     ` Darrick J. Wong
@ 2024-02-28 22:08       ` Christoph Hellwig
  2024-02-28 22:56         ` Darrick J. Wong
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-28 22:08 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, hch

On Wed, Feb 28, 2024 at 11:55:32AM -0800, Darrick J. Wong wrote:
> static inline bool
> xmi_can_merge_all(
> 	const struct xfs_bmbt_irec	*l,
> 	const struct xfs_bmbt_irec	*m,
> 	const struct xfs_bmbt_irec	*r)
> {
> 	xfs_filblks_t			new_len;
> 
> 	new_len = l->br_blockcount + m->br_blockcount + r->br_blockcount;
> 	return new_len <= XFS_MAX_BMBT_EXTLEN;
> }

Dumb question:  can the addition overflow?

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 05/14] xfs: create deferred log items for file mapping exchanges
  2024-02-28 22:08       ` Christoph Hellwig
@ 2024-02-28 22:56         ` Darrick J. Wong
  0 siblings, 0 replies; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-28 22:56 UTC (permalink / raw
  To: Christoph Hellwig; +Cc: linux-xfs, hch

On Wed, Feb 28, 2024 at 02:08:51PM -0800, Christoph Hellwig wrote:
> On Wed, Feb 28, 2024 at 11:55:32AM -0800, Darrick J. Wong wrote:
> > static inline bool
> > xmi_can_merge_all(
> > 	const struct xfs_bmbt_irec	*l,
> > 	const struct xfs_bmbt_irec	*m,
> > 	const struct xfs_bmbt_irec	*r)
> > {
> > 	xfs_filblks_t			new_len;
> > 
> > 	new_len = l->br_blockcount + m->br_blockcount + r->br_blockcount;
> > 	return new_len <= XFS_MAX_BMBT_EXTLEN;
> > }
> 
> Dumb question:  can the addition overflow?

No.

Both callsites of xmi_can_merge_all trigger only if both LEFT_CONTIG and
RIGHT_CONTIG have been set.  Both of thse _CONTIG flags are set only if
xmi_can_merge returned true, which it only does for real mappings.  Real
mappings are derived from ondisk bmbt mappings, which means they won't
be larger than 2^21 blocks in length.

Therefore, [lmr]->br_blockcount each can only be up to 2^21, and adding
them all together only requires 23 bits.  The u64 here is overkill, but
it matches xfs_bmbt_irec.br_blockcount.

--D

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/14] xfs: introduce new file range exchange ioctls
  2024-02-28 19:37       ` Christoph Hellwig
@ 2024-02-28 23:00         ` Darrick J. Wong
  2024-02-29 13:22           ` Christoph Hellwig
  0 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-28 23:00 UTC (permalink / raw
  To: Christoph Hellwig; +Cc: linux-xfs, hch

On Wed, Feb 28, 2024 at 11:37:26AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 28, 2024 at 11:35:47AM -0800, Darrick J. Wong wrote:
> > > How about only doing this checks once further up?  As the same sb also
> > > applies the same mount.
> > 
> > I'll remove this check entirely, since we've already checked that the
> > vfsmnt are the same.  Assuming that's what you meant-- I was slightly
> > confused by "same sb also applies the same mount" and decided to
> > interpret that as "same sb implies the same mount".
> 
> You interpreted the correctly.  Sorry for my jetlagged early morning
> incoherence.

So it occurs to me that I've mismatched the signedness in struct
xfs_exchange_range:

struct xfs_exchange_range {
	...
	__s64		file1_offset;	/* file1 offset, bytes */
	__s64		file2_offset;	/* file2 offset, bytes */
	__u64		length;		/* bytes to exchange */

Compare this to FICLONERANGE:

struct file_clone_range {
	...
	__u64 src_offset;
	__u64 src_length;
	__u64 dest_offset;
};

The offsets and lengths for FICLONERANGE are unsigned, so I think
xfs_exchange_range ought to follow that.

--D

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/14] xfs: introduce new file range exchange ioctls
  2024-02-28 23:00         ` Darrick J. Wong
@ 2024-02-29 13:22           ` Christoph Hellwig
  2024-02-29 17:10             ` Darrick J. Wong
  0 siblings, 1 reply; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-29 13:22 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, hch

On Wed, Feb 28, 2024 at 03:00:57PM -0800, Darrick J. Wong wrote:
> The offsets and lengths for FICLONERANGE are unsigned, so I think
> xfs_exchange_range ought to follow that.

Yes.  I though I had actually brought that up before, but I might have
wanted to and not actually sent the comments out..

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/14] xfs: introduce new file range exchange ioctls
  2024-02-29 13:22           ` Christoph Hellwig
@ 2024-02-29 17:10             ` Darrick J. Wong
  2024-02-29 19:42               ` Christoph Hellwig
  0 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-29 17:10 UTC (permalink / raw
  To: Christoph Hellwig; +Cc: linux-xfs, hch

On Thu, Feb 29, 2024 at 05:22:13AM -0800, Christoph Hellwig wrote:
> On Wed, Feb 28, 2024 at 03:00:57PM -0800, Darrick J. Wong wrote:
> > The offsets and lengths for FICLONERANGE are unsigned, so I think
> > xfs_exchange_range ought to follow that.
> 
> Yes.  I though I had actually brought that up before, but I might have
> wanted to and not actually sent the comments out..

You mentioned it in passing, but I misinterpreted what you'd said and
took the signedness in the wrong direction.  Here's what I went with in
the end:

struct xfs_exchange_range {
	__s32		file1_fd;
	__u32		pad;		/* must be zeroes */
	__u64		file1_offset;	/* file1 offset, bytes */
	__u64		file2_offset;	/* file2 offset, bytes */
	__u64		length;		/* bytes to exchange */

	__u64		flags;		/* see XFS_EXCHANGE_RANGE_* below */
};

--D

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 02/14] xfs: introduce new file range exchange ioctls
  2024-02-29 17:10             ` Darrick J. Wong
@ 2024-02-29 19:42               ` Christoph Hellwig
  0 siblings, 0 replies; 62+ messages in thread
From: Christoph Hellwig @ 2024-02-29 19:42 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: Christoph Hellwig, linux-xfs, hch

On Thu, Feb 29, 2024 at 09:10:43AM -0800, Darrick J. Wong wrote:
> You mentioned it in passing, but I misinterpreted what you'd said and
> took the signedness in the wrong direction.  Here's what I went with in
> the end:

Looks good to me.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-28  1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
@ 2024-02-29 20:18   ` Darrick J. Wong
  2024-02-29 22:43     ` Colin Walters
  0 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-29 20:18 UTC (permalink / raw
  To: Colin Walters; +Cc: linux-fsdevel, xfs, Christoph Hellwig

On Tue, Feb 27, 2024 at 08:50:20PM -0500, Colin Walters wrote:
> 
> 
> On Mon, Feb 26, 2024, at 9:18 PM, Darrick J. Wong wrote:
> > Hi all,
> >
> > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > ranges of bytes between two files atomically.  This new functionality
> > enables data storage programs to stage and commit file updates such that
> > reader programs will see either the old contents or the new contents in
> > their entirety, with no chance of torn writes.  A successful call
> > completion guarantees that the new contents will be seen even if the
> > system fails.
> >
> > The ability to exchange file fork mappings between files in this manner
> > is critical to supporting online filesystem repair, which is built upon
> > the strategy of constructing a clean copy of a damaged structure and
> > committing the new structure into the metadata file atomically.
> >
> > User programs will be able to update files atomically by opening an
> > O_TMPFILE, reflinking the source file to it, making whatever updates
> > they want to make, and exchange the relevant ranges of the temp file
> > with the original file. 
> 
> It's probably worth noting that the "reflinking the source file" here
> is optional, right?  IOW one can just:
> 
> - open(O_TMPFILE)
> - write()
> - ioctl(FIEXCHANGE_RANGE)

If the write() rewrites the entire file, then yes, that'll also work.

> I suspect the "simpler" non-database cases (think e.g. editors
> operating on plain text files) are going to be operating on an
> in-memory copy; in theory of course we could identify common ranges
> and reflink, but it's not clear to me it's really worth it at the
> tiny scale most source files are.

Correct, there's no built-in dedupe.  For small files you'll probably
end up with a single allocation anyway, which is ideal in terms of
ondisk metadata overhead.

One advantage that EXCHANGE_RANGE has over the rename dance is that the
calling application doesn't have to copy all the file attributes and
xattrs to the temporary file before the switch.

> > The intent behind this new userspace functionality is to enable atomic
> > rewrites of arbitrary parts of individual files.  For years, application
> > programmers wanting to ensure the atomicity of a file update had to
> > write the changes to a new file in the same directory
> 
> More sophisticated tools already are using O_TMPFILE I would say,
> just with a final last step of materializing it with a name,
> and then rename() into place.  So if this also
> obviates the need for
> https://lore.kernel.org/linux-fsdevel/364531.1579265357@warthog.procyon.org.uk/
> that seems good.

It would, though I would bet that extending linkat (or rename, or
whatever) is going to be the only workable solution for old / simple
filesystems (e.g. fat32).

> >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> >        tions, so no userspace-level locks need to be taken  to  obtain
> >        consistent  results.  Implementations must guarantee that read‐
> >        ers see either the old contents or the new  contents  in  their
> >        entirety, even if the system fails.
> 
> But given that we're reusing the same inode, I don't think that can
> *really* be true...at least, not without higher level serialization.

Higher level coordination is required, yes.  It doesn't have to be
serialization, though.  The committing thread could signal all the other
readers that they should invalidate and restart whatever they're working
on if that work depends on the file that was COMMIT_RANGE'd.  The
readers could detect unexpected data and resample mtime of the files
they've read and restart if it's changed.

> A classic case today is dconf in GNOME is a basic memory-mapped
> database file that is atomically replaced by the "create new file,
> rename into place" model.  Clients with mmap() view just see the old
> data until they reload explicitly.  But with this, clients with mmap'd
> view *will* immediately see the new contents (because it's the same
> inode, right?)

Correct, they'll start seeing the new contents as soon as they access
the affected pages.

How /does/ dconf handle those changes?  Does it rename the file and
signal all the other dconf threads to reopen the file?  And then those
threads get the new file contents?

>                and that's just going to lead to possibly split reads
> and undefined behavior - without extra userspace serialization or
> locking (that more proper databases) are going to be doing.

Huurrrh hurrrh.  That's right, I don't see how exchange can mesh well
with mmap without actual flock()ing. :(

fsnotify will send a message out to userspace after the exchange
finishes, which means that userspace could watch for the notifications
via fanotify.  However, that's still a bit racy... :/

> Arguably of course, dconf is too simple and more sophisticated tools
> like sqlite or LMDB could make use of this.  (There's some special
> atomic write that got added to f2fs for sqlite last I saw...I'm
> curious if this could replace it)

I think so:

F2FS_IOC_START_ATOMIC_WRITE -> XFS_IOC_START_COMMIT,
F2FS_IOC_COMMIT_ATOMIC_WRITE -> XFS_IOC_COMMIT_RANGE, and
F2FS_IOC_ABORT_VOLATILE_WRITE merely turns into close(temp_fd);

> But still...it seems to me like there's going to be quite a lot of the
> "potentially concurrent reader, atomic replace desired" pattern and
> since this can't replace that, we should call that out explicitly in
> the man page.  And also if so, then there's still a need for the
> linkat(AT_REPLACE) etc.

Hmm, I think I'll shrink that paragraph of the manpage:

"Exchanges are atomic with regards to concurrent file operations.
Implementations must guarantee that readers see either the old contents
or the new contents in their entirety, even if the system fails."

> 
> >            XFS_EXCHRANGE_TO_EOF
> 
> I kept reading this as some sort of typo...would it really be too
> onerous to spell it out as XFS_EXCHANGE_RANGE_TO_EOF e.g.?  Echoes of
> unix "creat" here =)

Yeah, I've expanded that to XFS_EXCHANGE_RANGE_TO_EOF for v29.5.

--D

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-29 20:18   ` Darrick J. Wong
@ 2024-02-29 22:43     ` Colin Walters
  2024-03-01  0:03       ` Darrick J. Wong
  0 siblings, 1 reply; 62+ messages in thread
From: Colin Walters @ 2024-02-29 22:43 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-fsdevel, xfs, Christoph Hellwig

On Thu, Feb 29, 2024, at 3:18 PM, Darrick J. Wong wrote:
>
> Correct, there's no built-in dedupe.  For small files you'll probably
> end up with a single allocation anyway, which is ideal in terms of
> ondisk metadata overhead.

Makes sense.

> though I would bet that extending linkat (or rename, or
> whatever) is going to be the only workable solution for old / simple
> filesystems (e.g. fat32).

Ah, right; that too.

> How /does/ dconf handle those changes?  Does it rename the file and
> signal all the other dconf threads to reopen the file?  And then those
> threads get the new file contents?

I briefly skimmed the code and couldn't find it, but yes I believe it's basically that clients have an inotify watch that gets handled from the mainloop and clients close and reopen and re-mmap - it's probably nonexistent to have non-mainloop threads reading things from the mmap, so there's no races with any other threads.

>
> Huurrrh hurrrh.  That's right, I don't see how exchange can mesh well
> with mmap without actual flock()ing. :(
>
> fsnotify will send a message out to userspace after the exchange
> finishes, which means that userspace could watch for the notifications
> via fanotify.  However, that's still a bit racy... :/

Right.  However...it's not just about mmap.  Sorry this is a minor rant but...near my top ten list of changes to make with a time machine for Unix would be the concept of a contents-immutable file, like all the seals that work on memfd with F_ADD_SEALS (and outside of fsverity, which is good but can be a bit of a heavier hammer).

A few times I've been working on shell script in my editor on my desktop, and these shell scripts are tests because shell script is so tempting.  I'm sure this familiar, given (x)fstests.

And if you just run the tests (directly from source in git), and then notice a bug, and start typing in your editor, save the changes, and then and your editor happens to do a generic "open(O_TRUNC), save" instead of an atomic rename.  This happens to be what `nano` and VSCode do, although at least the `vi` I have here does an atomic rename.  (One could say all editors that don't are broken...but...)

And now because the way bash works (and I assume other historical Unix shells) is that they interpret the file *as they're reading it* in this scenario you can get completely undefined behavior.  It could do *anything*.

At least one of those times, I got an error from an `rm -rf` invocation that happened to live in one of those test scripts...that could have in theory just gone off and removed anything.

Basically the contents-immutable is really what you *always* want for executables and really anything that can be parsed without locking (like, almost all config files in /etc too).  With ELF files there's EXTBUSY if it *happens* to be in use, but that's just a hack.  Also in that other thread about racing writes to suid executables...well, there'd be no possibility for races if we just denied writing because again - it makes no sense to just make random writes in-place to an executable.  (OK I did see the zig folks are trying an incremental linker, but still I would just assume reflinks are available for that)

Now this is relevant here because, I don't think anything like dpkg/rpm and all those things could ever use this ioctl for this reason.

So, it seems to me like it should really be more explicitly targeted at
- Things that are using open()+write() today and it's safe for that use case
- The database cases

And not talk about replacing the general open(O_TMPFILE) + rename() path.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-02-27 18:52   ` Amir Goldstein
@ 2024-02-29 23:27     ` Darrick J. Wong
  2024-03-01 13:00       ` Amir Goldstein
  2024-03-01 13:31       ` Jeff Layton
  0 siblings, 2 replies; 62+ messages in thread
From: Darrick J. Wong @ 2024-02-29 23:27 UTC (permalink / raw
  To: Amir Goldstein; +Cc: linux-fsdevel, linux-xfs, hch, jlayton

On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> >
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > To head off bikeshedding about the fields in xfs_commit_range, let's
> > make it an opaque u64 array and require the userspace program to call
> > a third ioctl to sample the freshness data for us.  If we ever converge
> > on a definition for i_version then we can use that; for now we'll just
> > use mtime/ctime like the old swapext ioctl.
> 
> This addresses my concerns about using mtime/ctime.

Oh good! :)

> I have to say, Darrick, that I think that referring to this concern as
> bikeshedding is not being honest.
> 
> I do hate nit picking reviews and I do hate "maybe also fix the world"
> review comments, but I think the question about using mtime/ctime in
> this new API was not out of place

I agree, your question about mtime/ctime:

"Maybe a stupid question, but under which circumstances would mtime
change and ctime not change? Why are both needed?"

was a very good question.  But perhaps that statement referred to the
other part of that thread.

>                                   and I think that making the freshness
> data opaque is better for everyone in the long run and hopefully, this will
> help you move to the things you care about faster.

I wish you'd suggested an opaque blob that the fs can lay out however it
wants instead of suggesting specifically the change cookie.  I'm very
much ok with an opaque freshness blob that allows future flexibility in
how we define the blob's contents.

I was however very upset about the Jeff's suggestion of using i_version.
I apologize for using all caps in that reply, and snarling about it in
the commit message here.  The final version of this patch will not have
that.

That said, I don't think it is at all helpful to suggest using a file
attribute whose behavior is as yet unresolved.  Multigrain timestamps
were a clever idea, regrettably reverted.  As far as I could tell when I
wrote my reply, neither had NFS implemented a better behavior and
quietly merged it; nor have Jeff and Dave produced any sort of candidate
patchset to fix all the resulting issues in XFS.

Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
internal" made me think "OH $deity, they wants me to do that work
too???"

A better way to have woreded that might've been "How about switching
this to a fs-determined structure so that we can switch the freshness
check to i_version when that's fully working on XFS?"

The problem I have with reading patch review emails is that I can't
easily tell whether an author's suggestion is being made in a casual
offhand manner?  Or if it reflects something they feel strongly needs
change before merging.

In fairness to you, Amir, I don't know how much you've kept on top of
that i_version vs. XFS discussion.  So I have no idea if you were aware
of the status of that work.

--D

> Thanks,
> Amir.
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-29 22:43     ` Colin Walters
@ 2024-03-01  0:03       ` Darrick J. Wong
  0 siblings, 0 replies; 62+ messages in thread
From: Darrick J. Wong @ 2024-03-01  0:03 UTC (permalink / raw
  To: Colin Walters; +Cc: linux-fsdevel, xfs, Christoph Hellwig

On Thu, Feb 29, 2024 at 05:43:21PM -0500, Colin Walters wrote:
> 
> 
> On Thu, Feb 29, 2024, at 3:18 PM, Darrick J. Wong wrote:
> >
> > Correct, there's no built-in dedupe.  For small files you'll probably
> > end up with a single allocation anyway, which is ideal in terms of
> > ondisk metadata overhead.
> 
> Makes sense.
> 
> > though I would bet that extending linkat (or rename, or
> > whatever) is going to be the only workable solution for old / simple
> > filesystems (e.g. fat32).
> 
> Ah, right; that too.
> 
> > How /does/ dconf handle those changes?  Does it rename the file and
> > signal all the other dconf threads to reopen the file?  And then those
> > threads get the new file contents?
> 
> I briefly skimmed the code and couldn't find it, but yes I believe
> it's basically that clients have an inotify watch that gets handled
> from the mainloop and clients close and reopen and re-mmap - it's
> probably nonexistent to have non-mainloop threads reading things from
> the mmap, so there's no races with any other threads.

Hrmm.  IIRC inotify and fanotify both use the same fsnotify backend.
fsnotify events are emitted after i_rwsem drops, which (if I read
read_write.c correctly) means this is technically racy.

That said if they're mostly waiting around in the inotify loop then it
probably doesn't matter.

> > Huurrrh hurrrh.  That's right, I don't see how exchange can mesh well
> > with mmap without actual flock()ing. :(
> >
> > fsnotify will send a message out to userspace after the exchange
> > finishes, which means that userspace could watch for the notifications
> > via fanotify.  However, that's still a bit racy... :/
> 
> Right.  However...it's not just about mmap.  Sorry this is a minor
> rant but...near my top ten list of changes to make with a time machine
> for Unix would be the concept of a contents-immutable file, like all
> the seals that work on memfd with F_ADD_SEALS (and outside of
> fsverity, which is good but can be a bit of a heavier hammer).

You and me both. :)

Also I want a persistent file contents write counter; and a
file-anything write counter.

Oh, and a conditional read where you pass in the file contents write
counter and returns an error if the file has been changed since sampling
time.  The changecookie thing mentioned elsewhere gets us towards that,
if onlty the issues w/ XFS get resolved.

> A few times I've been working on shell script in my editor on my
> desktop, and these shell scripts are tests because shell script is so
> tempting.  I'm sure this familiar, given (x)fstests.
> 
> And if you just run the tests (directly from source in git), and then
> notice a bug, and start typing in your editor, save the changes, and
> then and your editor happens to do a generic "open(O_TRUNC), save"
> instead of an atomic rename.  This happens to be what `nano` and
> VSCode do, although at least the `vi` I have here does an atomic
> rename.  (One could say all editors that don't are broken...but...)

I think they do O_TRUNC because it saves them from having to copy the
file attrs and xattrs.  Too bad it severely screws up a program running
in another terminal that just happens to hit the zero-byte file.

> And now because the way bash works (and I assume other historical Unix
> shells) is that they interpret the file *as they're reading it* in
> this scenario you can get completely undefined behavior.  It could do
> *anything*.
> 
> At least one of those times, I got an error from an `rm -rf`
> invocation that happened to live in one of those test scripts...that
> could have in theory just gone off and removed anything.
> 
> Basically the contents-immutable is really what you *always* want for
> executables and really anything that can be parsed without locking
> (like, almost all config files in /etc too).  With ELF files there's

Yes.

> EXTBUSY if it *happens* to be in use, but that's just a hack.  Also in

A hack that doesn't work for scripts.  Either interpreters have to read
the entire script into memory before execution, or I guess they can do
the insane thing that the DOS batch interpreter did, where before each
statement it would save the file pos, close it, execute the command,
reopen the batch file, and seek back to that line.

> that other thread about racing writes to suid executables...well,
> there'd be no possibility for races if we just denied writing because
> again - it makes no sense to just make random writes in-place to an
> executable.  (OK I did see the zig folks are trying an incremental
> linker, but still I would just assume reflinks are available for that)
> 
> Now this is relevant here because, I don't think anything like
> dpkg/rpm and all those things could ever use this ioctl for this
> reason.

Right.  dpkg executable file replacement really doesn't make much sense
for exchange range.  That's also wasn't the usecase I was targetting
though admittedly I'm only using this ioctl to test functionality that
online fsck requires.

> So, it seems to me like it should really be more explicitly targeted at
> - Things that are using open()+write() today and it's safe for that use case
> - The database cases
> 
> And not talk about replacing the general open(O_TMPFILE) + rename() path.

I think I'll change the cover letter to talk about what it does, what
problems it solves, and what problems it introduces.  Figuring out how
to take advantage of it is an exercise for application writers.

--D

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-02-29 23:27     ` Darrick J. Wong
@ 2024-03-01 13:00       ` Amir Goldstein
  2024-03-01 13:31       ` Jeff Layton
  1 sibling, 0 replies; 62+ messages in thread
From: Amir Goldstein @ 2024-03-01 13:00 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: linux-fsdevel, linux-xfs, hch, jlayton

On Fri, Mar 1, 2024 at 1:27 AM Darrick J. Wong <djwong@kernel.org> wrote:
>
> On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > >
> > > From: Darrick J. Wong <djwong@kernel.org>
> > >
> > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > make it an opaque u64 array and require the userspace program to call
> > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > on a definition for i_version then we can use that; for now we'll just
> > > use mtime/ctime like the old swapext ioctl.
> >
> > This addresses my concerns about using mtime/ctime.
>
> Oh good! :)
>
> > I have to say, Darrick, that I think that referring to this concern as
> > bikeshedding is not being honest.
> >
> > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > review comments, but I think the question about using mtime/ctime in
> > this new API was not out of place
>
> I agree, your question about mtime/ctime:
>
> "Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?"
>
> was a very good question.  But perhaps that statement referred to the
> other part of that thread.
>
> >                                   and I think that making the freshness
> > data opaque is better for everyone in the long run and hopefully, this will
> > help you move to the things you care about faster.
>
> I wish you'd suggested an opaque blob that the fs can lay out however it
> wants instead of suggesting specifically the change cookie.  I'm very
> much ok with an opaque freshness blob that allows future flexibility in
> how we define the blob's contents.
>

I wish I had thought of it myself - it is a good idea - just did not
occur to me.
Using the language of i_changecounter, that is "the current xfs implementation
of i_version", I still think that using it as the content of the
opaque freshness blob
makes more sense than mtime+ctime, but it is none of my concern what
you decide to fill in the freshness blob for the first version.

I was not aware of the way xfs_fsr is currently using mtime+ctime when
I replied and I am not sure if and how it is relevant to the new API.

> I was however very upset about the Jeff's suggestion of using i_version.
> I apologize for using all caps in that reply, and snarling about it in
> the commit message here.  The final version of this patch will not have
> that.
>
> That said, I don't think it is at all helpful to suggest using a file
> attribute whose behavior is as yet unresolved.  Multigrain timestamps
> were a clever idea, regrettably reverted.  As far as I could tell when I
> wrote my reply, neither had NFS implemented a better behavior and
> quietly merged it; nor have Jeff and Dave produced any sort of candidate
> patchset to fix all the resulting issues in XFS.
>
> Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> internal" made me think "OH $deity, they wants me to do that work
> too???"
>
> A better way to have woreded that might've been "How about switching
> this to a fs-determined structure so that we can switch the freshness
> check to i_version when that's fully working on XFS?"
>

Yeh, I should have chosen my words more carefully.
I was perfectly aware of your lack of interest in doing extra work
and wasn't trying to request any.

> The problem I have with reading patch review emails is that I can't
> easily tell whether an author's suggestion is being made in a casual
> offhand manner?  Or if it reflects something they feel strongly needs
> change before merging.
>

Can't speak for everyone else, but coming from the middle east,
I have fewer politeness filters.
When I write "wouldn't it be better to use change_cookie?"
I am just asking that question.

When I am asking something to be changed before merge,
I try to be much more explicit about it and this is what I expect
others to do when reviewing my patches.

Thanks,
Amir.

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCHSET v29.4 03/13] xfs: atomic file content exchanges
  2024-02-27 16:06     ` Darrick J. Wong
@ 2024-03-01 13:16       ` Jeff Layton
  0 siblings, 0 replies; 62+ messages in thread
From: Jeff Layton @ 2024-03-01 13:16 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch

On Tue, 2024-02-27 at 08:06 -0800, Darrick J. Wong wrote:
> On Tue, Feb 27, 2024 at 05:53:46AM -0500, Jeff Layton wrote:
> > On Tue, 2024-02-27 at 11:23 +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 4:18 AM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > 
> > > > Hi all,
> > > > 
> > > > This series creates a new FIEXCHANGE_RANGE system call to exchange
> > > > ranges of bytes between two files atomically.  This new functionality
> > > > enables data storage programs to stage and commit file updates such that
> > > > reader programs will see either the old contents or the new contents in
> > > > their entirety, with no chance of torn writes.  A successful call
> > > > completion guarantees that the new contents will be seen even if the
> > > > system fails.
> > > > 
> > > > The ability to exchange file fork mappings between files in this manner
> > > > is critical to supporting online filesystem repair, which is built upon
> > > > the strategy of constructing a clean copy of a damaged structure and
> > > > committing the new structure into the metadata file atomically.
> > > > 
> > > > User programs will be able to update files atomically by opening an
> > > > O_TMPFILE, reflinking the source file to it, making whatever updates
> > > > they want to make, and exchange the relevant ranges of the temp file
> > > > with the original file.  If the updates are aligned with the file block
> > > > size, a new (since v2) flag provides for exchanging only the written
> > > > areas.  Callers can arrange for the update to be rejected if the
> > > > original file has been changed.
> > > > 
> > > > The intent behind this new userspace functionality is to enable atomic
> > > > rewrites of arbitrary parts of individual files.  For years, application
> > > > programmers wanting to ensure the atomicity of a file update had to
> > > > write the changes to a new file in the same directory, fsync the new
> > > > file, rename the new file on top of the old filename, and then fsync the
> > > > directory.  People get it wrong all the time, and $fs hacks abound.
> > > > Here are the proposed manual pages:
> > > > 
> > 
> > This is a cool idea!  I've had some handwavy ideas about making a gated
> > write() syscall (i.e. only write if the change cookie hasn't changed),
> > but something like this may be a simpler lift initially.
> 
> How /does/ userspace get at the change cookie nowadays?
> 

Today, it doesn't. That would need to be exposed before we could make
that work.

> > > > IOCTL-XFS-EXCHANGE-RANGE(2System Calls ManuIOCTL-XFS-EXCHANGE-RANGE(2)
> > > > 
> > > > NAME
> > > >        ioctl_xfs_exchange_range  -  exchange  the contents of parts of
> > > >        two files
> > > > 
> > > > SYNOPSIS
> > > >        #include <sys/ioctl.h>
> > > >        #include <xfs/xfs_fs_staging.h>
> > > > 
> > > >        int   ioctl(int   file2_fd,   XFS_IOC_EXCHANGE_RANGE,    struct
> > > >        xfs_exch_range *arg);
> > > > 
> > > > DESCRIPTION
> > > >        Given  a  range  of bytes in a first file file1_fd and a second
> > > >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> > > >        changes the contents of the two ranges.
> > > > 
> > > >        Exchanges  are  atomic  with  regards to concurrent file opera‐
> > > >        tions, so no userspace-level locks need to be taken  to  obtain
> > > >        consistent  results.  Implementations must guarantee that read‐
> > > >        ers see either the old contents or the new  contents  in  their
> > > >        entirety, even if the system fails.
> > > > 
> > > >        The  system  call  parameters are conveyed in structures of the
> > > >        following form:
> > > > 
> > > >            struct xfs_exch_range {
> > > >                __s64    file1_fd;
> > > >                __s64    file1_offset;
> > > >                __s64    file2_offset;
> > > >                __s64    length;
> > > >                __u64    flags;
> > > > 
> > > >                __u64    pad;
> > > >            };
> > > > 
> > > >        The field pad must be zero.
> > > > 
> > > >        The fields file1_fd, file1_offset, and length define the  first
> > > >        range of bytes to be exchanged.
> > > > 
> > > >        The fields file2_fd, file2_offset, and length define the second
> > > >        range of bytes to be exchanged.
> > > > 
> > > >        Both files must be from the same filesystem mount.  If the  two
> > > >        file  descriptors represent the same file, the byte ranges must
> > > >        not overlap.  Most  disk-based  filesystems  require  that  the
> > > >        starts  of  both ranges must be aligned to the file block size.
> > > >        If this is the case, the ends of the ranges  must  also  be  so
> > > >        aligned unless the XFS_EXCHRANGE_TO_EOF flag is set.
> > > > 
> > > >        The field flags control the behavior of the exchange operation.
> > > > 
> > > >            XFS_EXCHRANGE_TO_EOF
> > > >                   Ignore  the length parameter.  All bytes in file1_fd
> > > >                   from file1_offset to EOF are moved to file2_fd,  and
> > > >                   file2's  size is set to (file2_offset+(file1_length-
> > > >                   file1_offset)).  Meanwhile, all bytes in file2  from
> > > >                   file2_offset  to  EOF are moved to file1 and file1's
> > > >                   size   is   set   to    (file1_offset+(file2_length-
> > > >                   file2_offset)).
> > > > 
> > > >            XFS_EXCHRANGE_DSYNC
> > > >                   Ensure  that  all modified in-core data in both file
> > > >                   ranges and all metadata updates  pertaining  to  the
> > > >                   exchange operation are flushed to persistent storage
> > > >                   before the call returns.  Opening  either  file  de‐
> > > >                   scriptor  with  O_SYNC or O_DSYNC will have the same
> > > >                   effect.
> > > > 
> > > >            XFS_EXCHRANGE_FILE1_WRITTEN
> > > >                   Only exchange sub-ranges of file1_fd that are  known
> > > >                   to  contain  data  written  by application software.
> > > >                   Each sub-range may be  expanded  (both  upwards  and
> > > >                   downwards)  to  align with the file allocation unit.
> > > >                   For files on the data device, this is one filesystem
> > > >                   block.   For  files  on the realtime device, this is
> > > >                   the realtime extent size.  This facility can be used
> > > >                   to  implement  fast  atomic scatter-gather writes of
> > > >                   any complexity for software-defined storage  targets
> > > >                   if  all  writes  are  aligned to the file allocation
> > > >                   unit.
> > > > 
> > > >            XFS_EXCHRANGE_DRY_RUN
> > > >                   Check the parameters and the feasibility of the  op‐
> > > >                   eration, but do not change anything.
> > > > 
> > > > RETURN VALUE
> > > >        On  error, -1 is returned, and errno is set to indicate the er‐
> > > >        ror.
> > > > 
> > > > ERRORS
> > > >        Error codes can be one of, but are not limited to, the  follow‐
> > > >        ing:
> > > > 
> > > >        EBADF  file1_fd  is not open for reading and writing or is open
> > > >               for append-only writes; or  file2_fd  is  not  open  for
> > > >               reading and writing or is open for append-only writes.
> > > > 
> > > >        EINVAL The  parameters  are  not correct for these files.  This
> > > >               error can also appear if either file  descriptor  repre‐
> > > >               sents  a device, FIFO, or socket.  Disk filesystems gen‐
> > > >               erally require the offset and  length  arguments  to  be
> > > >               aligned to the fundamental block sizes of both files.
> > > > 
> > > >        EIO    An I/O error occurred.
> > > > 
> > > >        EISDIR One of the files is a directory.
> > > > 
> > > >        ENOMEM The  kernel  was unable to allocate sufficient memory to
> > > >               perform the operation.
> > > > 
> > > >        ENOSPC There is not enough free space  in  the  filesystem  ex‐
> > > >               change the contents safely.
> > > > 
> > > >        EOPNOTSUPP
> > > >               The filesystem does not support exchanging bytes between
> > > >               the two files.
> > > > 
> > > >        EPERM  file1_fd or file2_fd are immutable.
> > > > 
> > > >        ETXTBSY
> > > >               One of the files is a swap file.
> > > > 
> > > >        EUCLEAN
> > > >               The filesystem is corrupt.
> > > > 
> > > >        EXDEV  file1_fd and  file2_fd  are  not  on  the  same  mounted
> > > >               filesystem.
> > > > 
> > > > CONFORMING TO
> > > >        This API is XFS-specific.
> > > > 
> > > > USE CASES
> > > >        Several  use  cases  are imagined for this system call.  In all
> > > >        cases, application software must coordinate updates to the file
> > > >        because the exchange is performed unconditionally.
> > > > 
> > > >        The  first  is a data storage program that wants to commit non-
> > > >        contiguous updates to a file atomically and  coordinates  write
> > > >        access  to that file.  This can be done by creating a temporary
> > > >        file, calling FICLONE(2) to share the contents, and staging the
> > > >        updates into the temporary file.  The FULL_FILES flag is recom‐
> > > >        mended for this purpose.  The temporary file can be deleted  or
> > > >        punched out afterwards.
> > > > 
> > > >        An example program might look like this:
> > > > 
> > > >            int fd = open("/some/file", O_RDWR);
> > > >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > > 
> > > >            ioctl(temp_fd, FICLONE, fd);
> > > > 
> > > >            /* append 1MB of records */
> > > >            lseek(temp_fd, 0, SEEK_END);
> > > >            write(temp_fd, data1, 1000000);
> > > > 
> > > >            /* update record index */
> > > >            pwrite(temp_fd, data1, 600, 98765);
> > > >            pwrite(temp_fd, data2, 320, 54321);
> > > >            pwrite(temp_fd, data2, 15, 0);
> > > > 
> > > >            /* commit the entire update */
> > > >            struct xfs_exch_range args = {
> > > >                .file1_fd = temp_fd,
> > > >                .flags = XFS_EXCHRANGE_TO_EOF,
> > > >            };
> > > > 
> > > >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > > 
> > > >        The  second  is  a  software-defined  storage host (e.g. a disk
> > > >        jukebox) which implements an atomic scatter-gather  write  com‐
> > > >        mand.   Provided the exported disk's logical block size matches
> > > >        the file's allocation unit size, this can be done by creating a
> > > >        temporary file and writing the data at the appropriate offsets.
> > > >        It is recommended that the temporary file be truncated  to  the
> > > >        size  of  the  regular file before any writes are staged to the
> > > >        temporary file to avoid issues with zeroing during  EOF  exten‐
> > > >        sion.   Use  this  call with the FILE1_WRITTEN flag to exchange
> > > >        only the file allocation units involved  in  the  emulated  de‐
> > > >        vice's  write  command.  The temporary file should be truncated
> > > >        or punched out completely before being reused to stage  another
> > > >        write.
> > > > 
> > > >        An example program might look like this:
> > > > 
> > > >            int fd = open("/some/file", O_RDWR);
> > > >            int temp_fd = open("/some", O_TMPFILE | O_RDWR);
> > > >            struct stat sb;
> > > >            int blksz;
> > > > 
> > > >            fstat(fd, &sb);
> > > >            blksz = sb.st_blksize;
> > > > 
> > > >            /* land scatter gather writes between 100fsb and 500fsb */
> > > >            pwrite(temp_fd, data1, blksz * 2, blksz * 100);
> > > >            pwrite(temp_fd, data2, blksz * 20, blksz * 480);
> > > >            pwrite(temp_fd, data3, blksz * 7, blksz * 257);
> > > > 
> > > >            /* commit the entire update */
> > > >            struct xfs_exch_range args = {
> > > >                .file1_fd = temp_fd,
> > > >                .file1_offset = blksz * 100,
> > > >                .file2_offset = blksz * 100,
> > > >                .length       = blksz * 400,
> > > >                .flags        = XFS_EXCHRANGE_FILE1_WRITTEN |
> > > >                                XFS_EXCHRANGE_FILE1_DSYNC,
> > > >            };
> > > > 
> > > >            ioctl(fd, XFS_IOC_EXCHANGE_RANGE, &args);
> > > > 
> > > > NOTES
> > > >        Some  filesystems may limit the amount of data or the number of
> > > >        extents that can be exchanged in a single call.
> > > > 
> > > > SEE ALSO
> > > >        ioctl(2)
> > > > 
> > > > XFS                           2024-02-10   IOCTL-XFS-EXCHANGE-RANGE(2)
> > > > IOCTL-XFS-COMMIT-RANGE(2) System Calls ManualIOCTL-XFS-COMMIT-RANGE(2)
> > > > 
> > > > NAME
> > > >        ioctl_xfs_commit_range - conditionally exchange the contents of
> > > >        parts of two files
> > > > 
> > > > SYNOPSIS
> > > >        #include <sys/ioctl.h>
> > > >        #include <xfs/xfs_fs_staging.h>
> > > > 
> > > >        int ioctl(int file2_fd, XFS_IOC_COMMIT_RANGE,  struct  xfs_com‐
> > > >        mit_range *arg);
> > > > 
> > > > DESCRIPTION
> > > >        Given  a  range  of bytes in a first file file1_fd and a second
> > > >        range of bytes in a second file  file2_fd,  this  ioctl(2)  ex‐
> > > >        changes  the contents of the two ranges if file2_fd passes cer‐
> > > >        tain freshness criteria.
> > > > 
> > > >        After locking both files but before  exchanging  the  contents,
> > > >        the  supplied  file2_ino field must match file2_fd's inode num‐
> > > >        ber,   and   the   supplied   file2_mtime,    file2_mtime_nsec,
> > > >        file2_ctime, and file2_ctime_nsec fields must match the modifi‐
> > > >        cation time and change time of file2.  If they  do  not  match,
> > > >        EBUSY will be returned.
> > > > 
> > > 
> > > Maybe a stupid question, but under which circumstances would mtime
> > > change and ctime not change? Why are both needed?
> > > 
> > 
> > ctime should always change if the mtime does. An mtime update means that
> > the metadata was updated, so you also need to update the ctime. 
> 
> Exactly. :)
> 
> > > And for a new API, wouldn't it be better to use change_cookie (a.k.a i_version)?
> > > Even if this API is designed to be hoisted out of XFS at some future time,
> > > Is there a real need to support it on filesystems that do not support
> > > i_version(?)
> > > 
> > > Not to mention the fact that POSIX does not explicitly define how ctime should
> > > behave with changes to fiemap (uninitialized extent and all), so who knows
> > > how other filesystems may update ctime in those cases.
> > > 
> > > I realize that STATX_CHANGE_COOKIE is currently kernel internal, but
> > > it seems that XFS_IOC_EXCHANGE_RANGE is a case where userspace
> > > really explicitly requests a bump of i_version on the next change.
> > > 
> > 
> > 
> > I agree. Using an opaque change cookie would be a lot nicer from an API
> > standpoint, and shouldn't be subject to timestamp granularity issues.
> 
> TLDR: No.
> 
> > That said, XFS's change cookie is currently broken. Dave C. said he had
> > some patches in progress to fix that however.
> 
> Dave says that about a lot of things.  I'm not willing to delay the
> online fsck project _even further_ for a bunch of vaporware that's not
> even out on linux-xfs for review.
> 
> The difference in opinion between xfs and the rest of the kernel about
> i_version is 50% of why I didn't include it here.  The other 50% is the
> part where userspace can't access it, because I do not want to saddle my
> mostly internal project with YET ANOTHER ASK FROM RH PEOPLE FOR CORE
> CHANGES.

Ouch, point taken.

I just have grave concerns about using something as coarse-grained as
the  to gate changes to a file. With modern machines, a single timestamp
can represent a large number of different states of the file's contents.

Is that not a problem here?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-02-29 23:27     ` Darrick J. Wong
  2024-03-01 13:00       ` Amir Goldstein
@ 2024-03-01 13:31       ` Jeff Layton
  2024-03-02  2:48         ` Darrick J. Wong
  1 sibling, 1 reply; 62+ messages in thread
From: Jeff Layton @ 2024-03-01 13:31 UTC (permalink / raw
  To: Darrick J. Wong, Amir Goldstein; +Cc: linux-fsdevel, linux-xfs, hch

On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > 
> > > From: Darrick J. Wong <djwong@kernel.org>
> > > 
> > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > make it an opaque u64 array and require the userspace program to call
> > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > on a definition for i_version then we can use that; for now we'll just
> > > use mtime/ctime like the old swapext ioctl.
> > 
> > This addresses my concerns about using mtime/ctime.
> 
> Oh good! :)
> 
> > I have to say, Darrick, that I think that referring to this concern as
> > bikeshedding is not being honest.
> > 
> > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > review comments, but I think the question about using mtime/ctime in
> > this new API was not out of place
> 
> I agree, your question about mtime/ctime:
> 
> "Maybe a stupid question, but under which circumstances would mtime
> change and ctime not change? Why are both needed?"
> 
> was a very good question.  But perhaps that statement referred to the
> other part of that thread.
> 
> >                                   and I think that making the freshness
> > data opaque is better for everyone in the long run and hopefully, this will
> > help you move to the things you care about faster.
> 
> I wish you'd suggested an opaque blob that the fs can lay out however it
> wants instead of suggesting specifically the change cookie.  I'm very
> much ok with an opaque freshness blob that allows future flexibility in
> how we define the blob's contents.
> 
> I was however very upset about the Jeff's suggestion of using i_version.
> I apologize for using all caps in that reply, and snarling about it in
> the commit message here.  The final version of this patch will not have
> that.
> 
> That said, I don't think it is at all helpful to suggest using a file
> attribute whose behavior is as yet unresolved.  Multigrain timestamps
> were a clever idea, regrettably reverted.  As far as I could tell when I
> wrote my reply, neither had NFS implemented a better behavior and
> quietly merged it; nor have Jeff and Dave produced any sort of candidate
> patchset to fix all the resulting issues in XFS.
>
> Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> internal" made me think "OH $deity, they wants me to do that work
> too???"
> 
> A better way to have woreded that might've been "How about switching
> this to a fs-determined structure so that we can switch the freshness
> check to i_version when that's fully working on XFS?"
> 
> The problem I have with reading patch review emails is that I can't
> easily tell whether an author's suggestion is being made in a casual
> offhand manner?  Or if it reflects something they feel strongly needs
> change before merging.
> 
> In fairness to you, Amir, I don't know how much you've kept on top of
> that i_version vs. XFS discussion.  So I have no idea if you were aware
> of the status of that work.
> 

Sorry, I didn't mean to trigger anyone, but I do have real concerns
about any API that attempts to use timestamps to detect whether
something has changed.

We learned that lesson in NFS in the 90's. VFS timestamp resolution is
just not enough to show whether there was a change to a file -- full
stop.

I get the hand-wringing over i_version definitions and I don't care to
rehash that discussion here, but I'll point out that this is a
(proposed) XFS-private interface:

What you could do is expose the XFS change counter (the one that gets
bumped for everything, even atime updates, possibly via different
ioctl), and use that for your "freshness" check.

You'd unfortunately get false negative freshness checks after read
operations, but you shouldn't get any false positives (which is real
danger with timestamps).
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-03-01 13:31       ` Jeff Layton
@ 2024-03-02  2:48         ` Darrick J. Wong
  2024-03-02 12:43           ` Jeff Layton
  0 siblings, 1 reply; 62+ messages in thread
From: Darrick J. Wong @ 2024-03-02  2:48 UTC (permalink / raw
  To: Jeff Layton; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch

On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > 
> > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > 
> > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > make it an opaque u64 array and require the userspace program to call
> > > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > > on a definition for i_version then we can use that; for now we'll just
> > > > use mtime/ctime like the old swapext ioctl.
> > > 
> > > This addresses my concerns about using mtime/ctime.
> > 
> > Oh good! :)
> > 
> > > I have to say, Darrick, that I think that referring to this concern as
> > > bikeshedding is not being honest.
> > > 
> > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > review comments, but I think the question about using mtime/ctime in
> > > this new API was not out of place
> > 
> > I agree, your question about mtime/ctime:
> > 
> > "Maybe a stupid question, but under which circumstances would mtime
> > change and ctime not change? Why are both needed?"
> > 
> > was a very good question.  But perhaps that statement referred to the
> > other part of that thread.
> > 
> > >                                   and I think that making the freshness
> > > data opaque is better for everyone in the long run and hopefully, this will
> > > help you move to the things you care about faster.
> > 
> > I wish you'd suggested an opaque blob that the fs can lay out however it
> > wants instead of suggesting specifically the change cookie.  I'm very
> > much ok with an opaque freshness blob that allows future flexibility in
> > how we define the blob's contents.
> > 
> > I was however very upset about the Jeff's suggestion of using i_version.
> > I apologize for using all caps in that reply, and snarling about it in
> > the commit message here.  The final version of this patch will not have
> > that.
> > 
> > That said, I don't think it is at all helpful to suggest using a file
> > attribute whose behavior is as yet unresolved.  Multigrain timestamps
> > were a clever idea, regrettably reverted.  As far as I could tell when I
> > wrote my reply, neither had NFS implemented a better behavior and
> > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > patchset to fix all the resulting issues in XFS.
> >
> > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > internal" made me think "OH $deity, they wants me to do that work
> > too???"
> > 
> > A better way to have woreded that might've been "How about switching
> > this to a fs-determined structure so that we can switch the freshness
> > check to i_version when that's fully working on XFS?"
> > 
> > The problem I have with reading patch review emails is that I can't
> > easily tell whether an author's suggestion is being made in a casual
> > offhand manner?  Or if it reflects something they feel strongly needs
> > change before merging.
> > 
> > In fairness to you, Amir, I don't know how much you've kept on top of
> > that i_version vs. XFS discussion.  So I have no idea if you were aware
> > of the status of that work.
> > 
> 
> Sorry, I didn't mean to trigger anyone, but I do have real concerns
> about any API that attempts to use timestamps to detect whether
> something has changed.
> 
> We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> just not enough to show whether there was a change to a file -- full
> stop.
> 
> I get the hand-wringing over i_version definitions and I don't care to
> rehash that discussion here, but I'll point out that this is a
> (proposed) XFS-private interface:
> 
> What you could do is expose the XFS change counter (the one that gets
> bumped for everything, even atime updates, possibly via different
> ioctl), and use that for your "freshness" check.
> 
> You'd unfortunately get false negative freshness checks after read
> operations, but you shouldn't get any false positives (which is real
> danger with timestamps).

I don't see how would that work for this usecase?  You have to sample
file2 before reflinking file2's contents to file1, writing the changes
to file1, and executing COMMIT_RANGE.  Setting the xfs-private REFLINK
inode flag on file2 will trigger an iversion update even though it won't
change mtime or ctime.  The COMMIT then fails due to the inode flags
change.

Worse yet, applications aren't going to know if a particular access is
actually the one that will trigger an atime update.  So this will just
fail unpredictably.

If iversion was purely a write counter then I would switch the freshness
implementation to use it.  But it's not, and I know this to be true
because I tried that and could not get COMMIT_RANGE to work reliably.
I suppose the advantage of the blob thing is that we actually /can/
switch over whenever it's ready.

--D

> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-03-02  2:48         ` Darrick J. Wong
@ 2024-03-02 12:43           ` Jeff Layton
  2024-03-07 23:25             ` Darrick J. Wong
  0 siblings, 1 reply; 62+ messages in thread
From: Jeff Layton @ 2024-03-02 12:43 UTC (permalink / raw
  To: Darrick J. Wong; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch

On Fri, 2024-03-01 at 18:48 -0800, Darrick J. Wong wrote:
> On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> > On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > 
> > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > 
> > > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > > make it an opaque u64 array and require the userspace program to call
> > > > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > > > on a definition for i_version then we can use that; for now we'll just
> > > > > use mtime/ctime like the old swapext ioctl.
> > > > 
> > > > This addresses my concerns about using mtime/ctime.
> > > 
> > > Oh good! :)
> > > 
> > > > I have to say, Darrick, that I think that referring to this concern as
> > > > bikeshedding is not being honest.
> > > > 
> > > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > > review comments, but I think the question about using mtime/ctime in
> > > > this new API was not out of place
> > > 
> > > I agree, your question about mtime/ctime:
> > > 
> > > "Maybe a stupid question, but under which circumstances would mtime
> > > change and ctime not change? Why are both needed?"
> > > 
> > > was a very good question.  But perhaps that statement referred to the
> > > other part of that thread.
> > > 
> > > >                                   and I think that making the freshness
> > > > data opaque is better for everyone in the long run and hopefully, this will
> > > > help you move to the things you care about faster.
> > > 
> > > I wish you'd suggested an opaque blob that the fs can lay out however it
> > > wants instead of suggesting specifically the change cookie.  I'm very
> > > much ok with an opaque freshness blob that allows future flexibility in
> > > how we define the blob's contents.
> > > 
> > > I was however very upset about the Jeff's suggestion of using i_version.
> > > I apologize for using all caps in that reply, and snarling about it in
> > > the commit message here.  The final version of this patch will not have
> > > that.
> > > 
> > > That said, I don't think it is at all helpful to suggest using a file
> > > attribute whose behavior is as yet unresolved.  Multigrain timestamps
> > > were a clever idea, regrettably reverted.  As far as I could tell when I
> > > wrote my reply, neither had NFS implemented a better behavior and
> > > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > > patchset to fix all the resulting issues in XFS.
> > > 
> > > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > > internal" made me think "OH $deity, they wants me to do that work
> > > too???"
> > > 
> > > A better way to have woreded that might've been "How about switching
> > > this to a fs-determined structure so that we can switch the freshness
> > > check to i_version when that's fully working on XFS?"
> > > 
> > > The problem I have with reading patch review emails is that I can't
> > > easily tell whether an author's suggestion is being made in a casual
> > > offhand manner?  Or if it reflects something they feel strongly needs
> > > change before merging.
> > > 
> > > In fairness to you, Amir, I don't know how much you've kept on top of
> > > that i_version vs. XFS discussion.  So I have no idea if you were aware
> > > of the status of that work.
> > > 
> > 
> > Sorry, I didn't mean to trigger anyone, but I do have real concerns
> > about any API that attempts to use timestamps to detect whether
> > something has changed.
> > 
> > We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> > just not enough to show whether there was a change to a file -- full
> > stop.
> > 
> > I get the hand-wringing over i_version definitions and I don't care to
> > rehash that discussion here, but I'll point out that this is a
> > (proposed) XFS-private interface:
> > 
> > What you could do is expose the XFS change counter (the one that gets
> > bumped for everything, even atime updates, possibly via different
> > ioctl), and use that for your "freshness" check.
> > 
> > You'd unfortunately get false negative freshness checks after read
> > operations, but you shouldn't get any false positives (which is real
> > danger with timestamps).
> 
> I don't see how would that work for this usecase?  You have to sample
> file2 before reflinking file2's contents to file1, writing the changes
> to file1, and executing COMMIT_RANGE.  Setting the xfs-private REFLINK
> inode flag on file2 will trigger an iversion update even though it won't
> change mtime or ctime.  The COMMIT then fails due to the inode flags
> change.
> 
> Worse yet, applications aren't going to know if a particular access is
> actually the one that will trigger an atime update.  So this will just
> fail unpredictably.
> 
> If iversion was purely a write counter then I would switch the freshness
> implementation to use it.  But it's not, and I know this to be true
> because I tried that and could not get COMMIT_RANGE to work reliably.
> I suppose the advantage of the blob thing is that we actually /can/
> switch over whenever it's ready.
> 

Yeah, that's the other part -- you have to be willing to redrive the I/O
every time the freshness check fails, which can get expensive depending
on how active the file is. Again this is an XFS interface, so I don't
really have a dog in this fight. If you think timestamps are good
enough, then so be it.

All I can do is mention that it has been our experience in the NFS world
that relying on timestamps like this will eventually lead to data
corruption. The race conditions may be tight, and much of the time the
race may be benign, but if you do this enough you'll eventually get
bitten, and end up exchanging data when you shouldn't have.

All of that said, I think this is great discussion fodder for LSF this
year. I feel like the time is right to consider these sorts of
interfaces that do synchronized I/O without locking. I've already
proposed a discussion around the state of the i_version counter, so
maybe we can chat about it then?
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 62+ messages in thread

* Re: [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque
  2024-03-02 12:43           ` Jeff Layton
@ 2024-03-07 23:25             ` Darrick J. Wong
  0 siblings, 0 replies; 62+ messages in thread
From: Darrick J. Wong @ 2024-03-07 23:25 UTC (permalink / raw
  To: Jeff Layton; +Cc: Amir Goldstein, linux-fsdevel, linux-xfs, hch

On Sat, Mar 02, 2024 at 07:43:53AM -0500, Jeff Layton wrote:
> On Fri, 2024-03-01 at 18:48 -0800, Darrick J. Wong wrote:
> > On Fri, Mar 01, 2024 at 08:31:21AM -0500, Jeff Layton wrote:
> > > On Thu, 2024-02-29 at 15:27 -0800, Darrick J. Wong wrote:
> > > > On Tue, Feb 27, 2024 at 08:52:58PM +0200, Amir Goldstein wrote:
> > > > > On Tue, Feb 27, 2024 at 7:46 PM Darrick J. Wong <djwong@kernel.org> wrote:
> > > > > > 
> > > > > > From: Darrick J. Wong <djwong@kernel.org>
> > > > > > 
> > > > > > To head off bikeshedding about the fields in xfs_commit_range, let's
> > > > > > make it an opaque u64 array and require the userspace program to call
> > > > > > a third ioctl to sample the freshness data for us.  If we ever converge
> > > > > > on a definition for i_version then we can use that; for now we'll just
> > > > > > use mtime/ctime like the old swapext ioctl.
> > > > > 
> > > > > This addresses my concerns about using mtime/ctime.
> > > > 
> > > > Oh good! :)
> > > > 
> > > > > I have to say, Darrick, that I think that referring to this concern as
> > > > > bikeshedding is not being honest.
> > > > > 
> > > > > I do hate nit picking reviews and I do hate "maybe also fix the world"
> > > > > review comments, but I think the question about using mtime/ctime in
> > > > > this new API was not out of place
> > > > 
> > > > I agree, your question about mtime/ctime:
> > > > 
> > > > "Maybe a stupid question, but under which circumstances would mtime
> > > > change and ctime not change? Why are both needed?"
> > > > 
> > > > was a very good question.  But perhaps that statement referred to the
> > > > other part of that thread.
> > > > 
> > > > >                                   and I think that making the freshness
> > > > > data opaque is better for everyone in the long run and hopefully, this will
> > > > > help you move to the things you care about faster.
> > > > 
> > > > I wish you'd suggested an opaque blob that the fs can lay out however it
> > > > wants instead of suggesting specifically the change cookie.  I'm very
> > > > much ok with an opaque freshness blob that allows future flexibility in
> > > > how we define the blob's contents.
> > > > 
> > > > I was however very upset about the Jeff's suggestion of using i_version.
> > > > I apologize for using all caps in that reply, and snarling about it in
> > > > the commit message here.  The final version of this patch will not have
> > > > that.
> > > > 
> > > > That said, I don't think it is at all helpful to suggest using a file
> > > > attribute whose behavior is as yet unresolved.  Multigrain timestamps
> > > > were a clever idea, regrettably reverted.  As far as I could tell when I
> > > > wrote my reply, neither had NFS implemented a better behavior and
> > > > quietly merged it; nor have Jeff and Dave produced any sort of candidate
> > > > patchset to fix all the resulting issues in XFS.
> > > > 
> > > > Reading "I realize that STATX_CHANGE_COOKIE is currently kernel
> > > > internal" made me think "OH $deity, they wants me to do that work
> > > > too???"
> > > > 
> > > > A better way to have woreded that might've been "How about switching
> > > > this to a fs-determined structure so that we can switch the freshness
> > > > check to i_version when that's fully working on XFS?"
> > > > 
> > > > The problem I have with reading patch review emails is that I can't
> > > > easily tell whether an author's suggestion is being made in a casual
> > > > offhand manner?  Or if it reflects something they feel strongly needs
> > > > change before merging.
> > > > 
> > > > In fairness to you, Amir, I don't know how much you've kept on top of
> > > > that i_version vs. XFS discussion.  So I have no idea if you were aware
> > > > of the status of that work.
> > > > 
> > > 
> > > Sorry, I didn't mean to trigger anyone, but I do have real concerns
> > > about any API that attempts to use timestamps to detect whether
> > > something has changed.
> > > 
> > > We learned that lesson in NFS in the 90's. VFS timestamp resolution is
> > > just not enough to show whether there was a change to a file -- full
> > > stop.
> > > 
> > > I get the hand-wringing over i_version definitions and I don't care to
> > > rehash that discussion here, but I'll point out that this is a
> > > (proposed) XFS-private interface:
> > > 
> > > What you could do is expose the XFS change counter (the one that gets
> > > bumped for everything, even atime updates, possibly via different
> > > ioctl), and use that for your "freshness" check.
> > > 
> > > You'd unfortunately get false negative freshness checks after read
> > > operations, but you shouldn't get any false positives (which is real
> > > danger with timestamps).
> > 
> > I don't see how would that work for this usecase?  You have to sample
> > file2 before reflinking file2's contents to file1, writing the changes
> > to file1, and executing COMMIT_RANGE.  Setting the xfs-private REFLINK
> > inode flag on file2 will trigger an iversion update even though it won't
> > change mtime or ctime.  The COMMIT then fails due to the inode flags
> > change.
> > 
> > Worse yet, applications aren't going to know if a particular access is
> > actually the one that will trigger an atime update.  So this will just
> > fail unpredictably.
> > 
> > If iversion was purely a write counter then I would switch the freshness
> > implementation to use it.  But it's not, and I know this to be true
> > because I tried that and could not get COMMIT_RANGE to work reliably.
> > I suppose the advantage of the blob thing is that we actually /can/
> > switch over whenever it's ready.
> > 
> 
> Yeah, that's the other part -- you have to be willing to redrive the I/O
> every time the freshness check fails, which can get expensive depending
> on how active the file is. Again this is an XFS interface, so I don't
> really have a dog in this fight. If you think timestamps are good
> enough, then so be it.
> 
> All I can do is mention that it has been our experience in the NFS world
> that relying on timestamps like this will eventually lead to data
> corruption. The race conditions may be tight, and much of the time the
> race may be benign, but if you do this enough you'll eventually get
> bitten, and end up exchanging data when you shouldn't have.
> 
> All of that said, I think this is great discussion fodder for LSF this
> year. I feel like the time is right to consider these sorts of
> interfaces that do synchronized I/O without locking. I've already
> proposed a discussion around the state of the i_version counter, so
> maybe we can chat about it then?

Yes.  I've gotten an invitation, so corporate approval and dumb injuries
notwithstanding, I'll be there this year. :)

--D

> -- 
> Jeff Layton <jlayton@kernel.org>
> 

^ permalink raw reply	[flat|nested] 62+ messages in thread

* [PATCH 13/14] docs: update swapext -> exchmaps language
  2024-03-30  0:57 [PATCHSET v30.1] " Darrick J. Wong
@ 2024-03-30  1:00 ` Darrick J. Wong
  0 siblings, 0 replies; 62+ messages in thread
From: Darrick J. Wong @ 2024-03-30  1:00 UTC (permalink / raw
  To: djwong; +Cc: Christoph Hellwig, linux-fsdevel, hch, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Start reworking the atomic swapext design documentation to refer to its
new file contents/mapping exchange name.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 .../filesystems/xfs/xfs-online-fsck-design.rst     |  259 +++++++++++---------
 1 file changed, 136 insertions(+), 123 deletions(-)


diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 1d161752f09ed..f72e1ed2d0e5f 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -2167,7 +2167,7 @@ The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
 function frees them all because compaction is not needed.
 
 The details of repairing directories and extended attributes will be discussed
-in a subsequent section about atomic extent swapping.
+in a subsequent section about atomic file content exchanges.
 However, it should be noted that these repair functions only use blob storage
 to cache a small number of entries before adding them to a temporary ondisk
 file, which is why compaction is not required.
@@ -2802,7 +2802,8 @@ follows this format:
 
 Repairs for file-based metadata such as extended attributes, directories,
 symbolic links, quota files and realtime bitmaps are performed by building a
-new structure attached to a temporary file and swapping the forks.
+new structure attached to a temporary file and exchanging all mappings in the
+file forks.
 Afterward, the mappings in the old file fork are the candidate blocks for
 disposal.
 
@@ -3851,8 +3852,8 @@ Because file forks can consume as much space as the entire filesystem, repairs
 cannot be staged in memory, even when a paging scheme is available.
 Therefore, online repair of file-based metadata createas a temporary file in
 the XFS filesystem, writes a new structure at the correct offsets into the
-temporary file, and atomically swaps the fork mappings (and hence the fork
-contents) to commit the repair.
+temporary file, and atomically exchanges all file fork mappings (and hence the
+fork contents) to commit the repair.
 Once the repair is complete, the old fork can be reaped as necessary; if the
 system goes down during the reap, the iunlink code will delete the blocks
 during log recovery.
@@ -3862,10 +3863,11 @@ consistent to use a temporary file safely!
 This dependency is the reason why online repair can only use pageable kernel
 memory to stage ondisk space usage information.
 
-Swapping metadata extents with a temporary file requires the owner field of the
-block headers to match the file being repaired and not the temporary file.  The
-directory, extended attribute, and symbolic link functions were all modified to
-allow callers to specify owner numbers explicitly.
+Exchanging metadata file mappings with a temporary file requires the owner
+field of the block headers to match the file being repaired and not the
+temporary file.
+The directory, extended attribute, and symbolic link functions were all
+modified to allow callers to specify owner numbers explicitly.
 
 There is a downside to the reaping process -- if the system crashes during the
 reap phase and the fork extents are crosslinked, the iunlink processing will
@@ -3974,8 +3976,8 @@ The proposed patches are in the
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
 series.
 
-Atomic Extent Swapping
-----------------------
+Logged File Content Exchanges
+-----------------------------
 
 Once repair builds a temporary file with a new data structure written into
 it, it must commit the new changes into the existing file.
@@ -4010,17 +4012,21 @@ e. Old blocks in the file may be cross-linked with another structure and must
 These problems are overcome by creating a new deferred operation and a new type
 of log intent item to track the progress of an operation to exchange two file
 ranges.
-The new deferred operation type chains together the same transactions used by
-the reverse-mapping extent swap code.
+The new exchange operation type chains together the same transactions used by
+the reverse-mapping extent swap code, but records intermedia progress in the
+log so that operations can be restarted after a crash.
+This new functionality is called the file contents exchange (xfs_exchrange)
+code.
+The underlying implementation exchanges file fork mappings (xfs_exchmaps).
 The new log item records the progress of the exchange to ensure that once an
 exchange begins, it will always run to completion, even there are
 interruptions.
-The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
+The new ``XFS_SB_FEAT_INCOMPAT_LOG_EXCHMAPS`` log-incompatible feature flag
 in the superblock protects these new log item records from being replayed on
 old kernels.
 
 The proposed patchset is the
-`atomic extent swap
+`file contents exchange
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
 series.
 
@@ -4061,72 +4067,73 @@ series.
 | The feature bit will not be cleared from the superblock until the log    |
 | becomes clean.                                                           |
 |                                                                          |
-| Log-assisted extended attribute updates and atomic extent swaps both use |
-| log incompat features and provide convenience wrappers around the        |
+| Log-assisted extended attribute updates and file content exchanges bothe |
+| use log incompat features and provide convenience wrappers around the    |
 | functionality.                                                           |
 +--------------------------------------------------------------------------+
 
-Mechanics of an Atomic Extent Swap
-``````````````````````````````````
+Mechanics of a Logged File Content Exchange
+```````````````````````````````````````````
 
-Swapping entire file forks is a complex task.
+Exchanging contents between file forks is a complex task.
 The goal is to exchange all file fork mappings between two file fork offset
 ranges.
 There are likely to be many extent mappings in each fork, and the edges of
 the mappings aren't necessarily aligned.
-Furthermore, there may be other updates that need to happen after the swap,
+Furthermore, there may be other updates that need to happen after the exchange,
 such as exchanging file sizes, inode flags, or conversion of fork data to local
 format.
-This is roughly the format of the new deferred extent swap work item:
+This is roughly the format of the new deferred exchange-mapping work item:
 
 .. code-block:: c
 
-	struct xfs_swapext_intent {
+	struct xfs_exchmaps_intent {
 	    /* Inodes participating in the operation. */
-	    struct xfs_inode    *sxi_ip1;
-	    struct xfs_inode    *sxi_ip2;
+	    struct xfs_inode    *xmi_ip1;
+	    struct xfs_inode    *xmi_ip2;
 
 	    /* File offset range information. */
-	    xfs_fileoff_t       sxi_startoff1;
-	    xfs_fileoff_t       sxi_startoff2;
-	    xfs_filblks_t       sxi_blockcount;
+	    xfs_fileoff_t       xmi_startoff1;
+	    xfs_fileoff_t       xmi_startoff2;
+	    xfs_filblks_t       xmi_blockcount;
 
 	    /* Set these file sizes after the operation, unless negative. */
-	    xfs_fsize_t         sxi_isize1;
-	    xfs_fsize_t         sxi_isize2;
+	    xfs_fsize_t         xmi_isize1;
+	    xfs_fsize_t         xmi_isize2;
 
-	    /* XFS_SWAP_EXT_* log operation flags */
-	    uint64_t            sxi_flags;
+	    /* XFS_EXCHMAPS_* log operation flags */
+	    uint64_t            xmi_flags;
 	};
 
 The new log intent item contains enough information to track two logical fork
 offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
 blockcount)``.
-Each step of a swap operation exchanges the largest file range mapping possible
-from one file to the other.
-After each step in the swap operation, the two startoff fields are incremented
-and the blockcount field is decremented to reflect the progress made.
-The flags field captures behavioral parameters such as swapping the attr fork
-instead of the data fork and other work to be done after the extent swap.
-The two isize fields are used to swap the file size at the end of the operation
-if the file data fork is the target of the swap operation.
+Each step of an exchange operation exchanges the largest file range mapping
+possible from one file to the other.
+After each step in the exchange operation, the two startoff fields are
+incremented and the blockcount field is decremented to reflect the progress
+made.
+The flags field captures behavioral parameters such as exchanging attr fork
+mappings instead of the data fork and other work to be done after the exchange.
+The two isize fields are used to exchange the file sizes at the end of the
+operation if the file data fork is the target of the operation.
 
-When the extent swap is initiated, the sequence of operations is as follows:
+When the exchange is initiated, the sequence of operations is as follows:
 
-1. Create a deferred work item for the extent swap.
-   At the start, it should contain the entirety of the file ranges to be
-   swapped.
+1. Create a deferred work item for the file mapping exchange.
+   At the start, it should contain the entirety of the file block ranges to be
+   exchanged.
 
 2. Call ``xfs_defer_finish`` to process the exchange.
-   This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
+   This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
    This will log an extent swap intent item to the transaction for the deferred
-   extent swap work item.
+   mapping exchange work item.
 
-3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
+3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,
 
-   a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
-      ``sxi_startoff2``, respectively, and compute the longest extent that can
-      be swapped in a single step.
+   a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
+      ``xmi_startoff2``, respectively, and compute the longest extent that can
+      be exchanged in a single step.
       This is the minimum of the two ``br_blockcount`` s in the mappings.
       Keep advancing through the file forks until at least one of the mappings
       contains written blocks.
@@ -4148,20 +4155,20 @@ When the extent swap is initiated, the sequence of operations is as follows:
 
    g. Extend the ondisk size of either file if necessary.
 
-   h. Log an extent swap done log item for the extent swap intent log item
-      that was read at the start of step 3.
+   h. Log a mapping exchange done log item for th mapping exchange intent log
+      item that was read at the start of step 3.
 
    i. Compute the amount of file range that has just been covered.
       This quantity is ``(map1.br_startoff + map1.br_blockcount -
-      sxi_startoff1)``, because step 3a could have skipped holes.
+      xmi_startoff1)``, because step 3a could have skipped holes.
 
-   j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
+   j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
       by the number of blocks computed in the previous step, and decrease
-      ``sxi_blockcount`` by the same quantity.
+      ``xmi_blockcount`` by the same quantity.
       This advances the cursor.
 
-   k. Log a new extent swap intent log item reflecting the advanced state of
-      the work item.
+   k. Log a new mapping exchange intent log item reflecting the advanced state
+      of the work item.
 
    l. Return the proper error code (EAGAIN) to the deferred operation manager
       to inform it that there is more work to be done.
@@ -4172,22 +4179,23 @@ When the extent swap is initiated, the sequence of operations is as follows:
    This will be discussed in more detail in subsequent sections.
 
 If the filesystem goes down in the middle of an operation, log recovery will
-find the most recent unfinished extent swap log intent item and restart from
-there.
-This is how extent swapping guarantees that an outside observer will either see
-the old broken structure or the new one, and never a mismash of both.
+find the most recent unfinished maping exchange log intent item and restart
+from there.
+This is how atomic file mapping exchanges guarantees that an outside observer
+will either see the old broken structure or the new one, and never a mismash of
+both.
 
-Preparation for Extent Swapping
-```````````````````````````````
+Preparation for File Content Exchanges
+``````````````````````````````````````
 
 There are a few things that need to be taken care of before initiating an
-atomic extent swap operation.
+atomic file mapping exchange operation.
 First, regular files require the page cache to be flushed to disk before the
 operation begins, and directio writes to be quiesced.
-Like any filesystem operation, extent swapping must determine the maximum
-amount of disk space and quota that can be consumed on behalf of both files in
-the operation, and reserve that quantity of resources to avoid an unrecoverable
-out of space failure once it starts dirtying metadata.
+Like any filesystem operation, file mapping exchanges must determine the
+maximum amount of disk space and quota that can be consumed on behalf of both
+files in the operation, and reserve that quantity of resources to avoid an
+unrecoverable out of space failure once it starts dirtying metadata.
 The preparation step scans the ranges of both files to estimate:
 
 - Data device blocks needed to handle the repeated updates to the fork
@@ -4201,56 +4209,59 @@ The preparation step scans the ranges of both files to estimate:
   to different extents on the realtime volume, which could happen if the
   operation fails to run to completion.
 
-The need for precise estimation increases the run time of the swap operation,
-but it is very important to maintain correct accounting.
-The filesystem must not run completely out of free space, nor can the extent
-swap ever add more extent mappings to a fork than it can support.
+The need for precise estimation increases the run time of the exchange
+operation, but it is very important to maintain correct accounting.
+The filesystem must not run completely out of free space, nor can the mapping
+exchange ever add more extent mappings to a fork than it can support.
 Regular users are required to abide the quota limits, though metadata repairs
 may exceed quota to resolve inconsistent metadata elsewhere.
 
-Special Features for Swapping Metadata File Extents
-```````````````````````````````````````````````````
+Special Features for Exchanging Metadata File Contents
+``````````````````````````````````````````````````````
 
 Extended attributes, symbolic links, and directories can set the fork format to
 "local" and treat the fork as a literal area for data storage.
 Metadata repairs must take extra steps to support these cases:
 
 - If both forks are in local format and the fork areas are large enough, the
-  swap is performed by copying the incore fork contents, logging both forks,
-  and committing.
-  The atomic extent swap mechanism is not necessary, since this can be done
-  with a single transaction.
+  exchange is performed by copying the incore fork contents, logging both
+  forks, and committing.
+  The atomic file mapping exchange mechanism is not necessary, since this can
+  be done with a single transaction.
 
-- If both forks map blocks, then the regular atomic extent swap is used.
+- If both forks map blocks, then the regular atomic file mapping exchange is
+  used.
 
 - Otherwise, only one fork is in local format.
   The contents of the local format fork are converted to a block to perform the
-  swap.
+  exchange.
   The conversion to block format must be done in the same transaction that
-  logs the initial extent swap intent log item.
-  The regular atomic extent swap is used to exchange the mappings.
-  Special flags are set on the swap operation so that the transaction can be
-  rolled one more time to convert the second file's fork back to local format
-  so that the second file will be ready to go as soon as the ILOCK is dropped.
+  logs the initial mapping exchange intent log item.
+  The regular atomic mapping exchange is used to exchange the metadata file
+  mappings.
+  Special flags are set on the exchange operation so that the transaction can
+  be rolled one more time to convert the second file's fork back to local
+  format so that the second file will be ready to go as soon as the ILOCK is
+  dropped.
 
 Extended attributes and directories stamp the owning inode into every block,
 but the buffer verifiers do not actually check the inode number!
 Although there is no verification, it is still important to maintain
-referential integrity, so prior to performing the extent swap, online repair
-builds every block in the new data structure with the owner field of the file
-being repaired.
+referential integrity, so prior to performing the mapping exchange, online
+repair builds every block in the new data structure with the owner field of the
+file being repaired.
 
-After a successful swap operation, the repair operation must reap the old fork
-blocks by processing each fork mapping through the standard :ref:`file extent
-reaping <reaping>` mechanism that is done post-repair.
+After a successful exchange operation, the repair operation must reap the old
+fork blocks by processing each fork mapping through the standard :ref:`file
+extent reaping <reaping>` mechanism that is done post-repair.
 If the filesystem should go down during the reap part of the repair, the
 iunlink processing at the end of recovery will free both the temporary file and
 whatever blocks were not reaped.
 However, this iunlink processing omits the cross-link detection of online
 repair, and is not completely foolproof.
 
-Swapping Temporary File Extents
-```````````````````````````````
+Exchanging Temporary File Contents
+``````````````````````````````````
 
 To repair a metadata file, online repair proceeds as follows:
 
@@ -4260,14 +4271,14 @@ To repair a metadata file, online repair proceeds as follows:
    file.
    The same fork must be written to as is being repaired.
 
-3. Commit the scrub transaction, since the swap estimation step must be
-   completed before transaction reservations are made.
+3. Commit the scrub transaction, since the exchange resource estimation step
+   must be completed before transaction reservations are made.
 
-4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
+4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
    the appropriate resource reservations, locks, and fill out a ``struct
-   xfs_swapext_req`` with the details of the swap operation.
+   xfs_exchmaps_req`` with the details of the exchange operation.
 
-5. Call ``xrep_tempswap_contents`` to swap the contents.
+5. Call ``xrep_tempexch_contents`` to exchange the contents.
 
 6. Commit the transaction to complete the repair.
 
@@ -4309,7 +4320,7 @@ To check the summary file against the bitmap:
 3. Compare the contents of the xfile against the ondisk file.
 
 To repair the summary file, write the xfile contents into the temporary file
-and use atomic extent swap to commit the new contents.
+and use atomic mapping exchange to commit the new contents.
 The temporary file is then reaped.
 
 The proposed patchset is the
@@ -4352,8 +4363,8 @@ Salvaging extended attributes is done as follows:
    memory or there are no more attr fork blocks to examine, unlock the file and
    add the staged extended attributes to the temporary file.
 
-3. Use atomic extent swapping to exchange the new and old extended attribute
-   structures.
+3. Use atomic file mapping exchange to exchange the new and old extended
+   attribute structures.
    The old attribute blocks are now attached to the temporary file.
 
 4. Reap the temporary file.
@@ -4410,7 +4421,8 @@ salvaging directories is straightforward:
    directory and add the staged dirents into the temporary directory.
    Truncate the staging files.
 
-4. Use atomic extent swapping to exchange the new and old directory structures.
+4. Use atomic file mapping exchange to exchange the new and old directory
+   structures.
    The old directory blocks are now attached to the temporary file.
 
 5. Reap the temporary file.
@@ -4542,7 +4554,7 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
       Instead, we stash updates in the xfarray and rely on the scanner thread
       to apply the stashed updates to the temporary directory.
 
-5. When the scan is complete, atomically swap the contents of the temporary
+5. When the scan is complete, atomically exchange the contents of the temporary
    directory and the directory being repaired.
    The temporary directory now contains the damaged directory structure.
 
@@ -4629,8 +4641,8 @@ directory reconstruction:
 
 5. Copy all non-parent pointer extended attributes to the temporary file.
 
-6. When the scan is complete, atomically swap the attribute fork of the
-   temporary file and the file being repaired.
+6. When the scan is complete, atomically exchange the mappings of the attribute
+   forks of the temporary file and the file being repaired.
    The temporary file now contains the damaged extended attribute structure.
 
 7. Reap the temporary file.
@@ -5105,18 +5117,18 @@ make it easier for code readers to understand what has been built, for whom it
 has been built, and why.
 Please feel free to contact the XFS mailing list with questions.
 
-FIEXCHANGE_RANGE
-----------------
+XFS_IOC_EXCHANGE_RANGE
+----------------------
 
-As discussed earlier, a second frontend to the atomic extent swap mechanism is
-a new ioctl call that userspace programs can use to commit updates to files
-atomically.
+As discussed earlier, a second frontend to the atomic file mapping exchange
+mechanism is a new ioctl call that userspace programs can use to commit updates
+to files atomically.
 This frontend has been out for review for several years now, though the
 necessary refinements to online repair and lack of customer demand mean that
 the proposal has not been pushed very hard.
 
-Extent Swapping with Regular User Files
-```````````````````````````````````````
+File Content Exchanges with Regular User Files
+``````````````````````````````````````````````
 
 As mentioned earlier, XFS has long had the ability to swap extents between
 files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
@@ -5131,12 +5143,12 @@ the consistency of the fork mappings with the reverse mapping index was to
 develop an iterative mechanism that used deferred bmap and rmap operations to
 swap mappings one at a time.
 This mechanism is identical to steps 2-3 from the procedure above except for
-the new tracking items, because the atomic extent swap mechanism is an
-iteration of an existing mechanism and not something totally novel.
+the new tracking items, because the atomic file mapping exchange mechanism is
+an iteration of an existing mechanism and not something totally novel.
 For the narrow case of file defragmentation, the file contents must be
 identical, so the recovery guarantees are not much of a gain.
 
-Atomic extent swapping is much more flexible than the existing swapext
+Atomic file content exchanges are much more flexible than the existing swapext
 implementations because it can guarantee that the caller never sees a mix of
 old and new contents even after a crash, and it can operate on two arbitrary
 file fork ranges.
@@ -5147,11 +5159,11 @@ The extra flexibility enables several new use cases:
   Next, it opens a temporary file and calls the file clone operation to reflink
   the first file's contents into the temporary file.
   Writes to the original file should instead be written to the temporary file.
-  Finally, the process calls the atomic extent swap system call
-  (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
-  of the updates to the original file, or none of them.
+  Finally, the process calls the atomic file mapping exchange system call
+  (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby
+  committing all of the updates to the original file, or none of them.
 
-.. _swapext_if_unchanged:
+.. _exchrange_if_unchanged:
 
 - **Transactional file updates**: The same mechanism as above, but the caller
   only wants the commit to occur if the original file's contents have not
@@ -5160,16 +5172,17 @@ The extra flexibility enables several new use cases:
   change timestamps of the original file before reflinking its data to the
   temporary file.
   When the program is ready to commit the changes, it passes the timestamps
-  into the kernel as arguments to the atomic extent swap system call.
+  into the kernel as arguments to the atomic file mapping exchange system call.
   The kernel only commits the changes if the provided timestamps match the
   original file.
+  A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
 
 - **Emulation of atomic block device writes**: Export a block device with a
   logical sector size matching the filesystem block size to force all writes
   to be aligned to the filesystem block size.
   Stage all writes to a temporary file, and when that is complete, call the
-  atomic extent swap system call with a flag to indicate that holes in the
-  temporary file should be ignored.
+  atomic file mapping exchange system call with a flag to indicate that holes
+  in the temporary file should be ignored.
   This emulates an atomic device write in software, and can support arbitrary
   scattered writes.
 
@@ -5251,8 +5264,8 @@ of the file to try to share the physical space with a dummy file.
 Cloning the extent means that the original owners cannot overwrite the
 contents; any changes will be written somewhere else via copy-on-write.
 Clearspace makes its own copy of the frozen extent in an area that is not being
-cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
-<swapext_if_unchanged>` feature) to change the target file's data extent
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges
+<exchrange_if_unchanged>` feature) to change the target file's data extent
 mapping away from the area being cleared.
 When all other mappings have been moved, clearspace reflinks the space into the
 space collector file so that it becomes unavailable.


^ permalink raw reply related	[flat|nested] 62+ messages in thread

* [PATCH 13/14] docs: update swapext -> exchmaps language
  2024-04-09  3:34 [PATCHSET v30.2] xfs: atomic file content exchanges Darrick J. Wong
@ 2024-04-09  3:37 ` Darrick J. Wong
  0 siblings, 0 replies; 62+ messages in thread
From: Darrick J. Wong @ 2024-04-09  3:37 UTC (permalink / raw
  To: djwong; +Cc: Christoph Hellwig, hch, linux-fsdevel, linux-xfs

From: Darrick J. Wong <djwong@kernel.org>

Start reworking the atomic swapext design documentation to refer to its
new file contents/mapping exchange name.

Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
 .../filesystems/xfs/xfs-online-fsck-design.rst     |  259 +++++++++++---------
 1 file changed, 136 insertions(+), 123 deletions(-)


diff --git a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
index 1d161752f09ed..3afa1bc5f47ce 100644
--- a/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
+++ b/Documentation/filesystems/xfs/xfs-online-fsck-design.rst
@@ -2167,7 +2167,7 @@ The ``xfblob_free`` function frees a specific blob, and the ``xfblob_truncate``
 function frees them all because compaction is not needed.
 
 The details of repairing directories and extended attributes will be discussed
-in a subsequent section about atomic extent swapping.
+in a subsequent section about atomic file content exchanges.
 However, it should be noted that these repair functions only use blob storage
 to cache a small number of entries before adding them to a temporary ondisk
 file, which is why compaction is not required.
@@ -2802,7 +2802,8 @@ follows this format:
 
 Repairs for file-based metadata such as extended attributes, directories,
 symbolic links, quota files and realtime bitmaps are performed by building a
-new structure attached to a temporary file and swapping the forks.
+new structure attached to a temporary file and exchanging all mappings in the
+file forks.
 Afterward, the mappings in the old file fork are the candidate blocks for
 disposal.
 
@@ -3851,8 +3852,8 @@ Because file forks can consume as much space as the entire filesystem, repairs
 cannot be staged in memory, even when a paging scheme is available.
 Therefore, online repair of file-based metadata createas a temporary file in
 the XFS filesystem, writes a new structure at the correct offsets into the
-temporary file, and atomically swaps the fork mappings (and hence the fork
-contents) to commit the repair.
+temporary file, and atomically exchanges all file fork mappings (and hence the
+fork contents) to commit the repair.
 Once the repair is complete, the old fork can be reaped as necessary; if the
 system goes down during the reap, the iunlink code will delete the blocks
 during log recovery.
@@ -3862,10 +3863,11 @@ consistent to use a temporary file safely!
 This dependency is the reason why online repair can only use pageable kernel
 memory to stage ondisk space usage information.
 
-Swapping metadata extents with a temporary file requires the owner field of the
-block headers to match the file being repaired and not the temporary file.  The
-directory, extended attribute, and symbolic link functions were all modified to
-allow callers to specify owner numbers explicitly.
+Exchanging metadata file mappings with a temporary file requires the owner
+field of the block headers to match the file being repaired and not the
+temporary file.
+The directory, extended attribute, and symbolic link functions were all
+modified to allow callers to specify owner numbers explicitly.
 
 There is a downside to the reaping process -- if the system crashes during the
 reap phase and the fork extents are crosslinked, the iunlink processing will
@@ -3974,8 +3976,8 @@ The proposed patches are in the
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=repair-tempfiles>`_
 series.
 
-Atomic Extent Swapping
-----------------------
+Logged File Content Exchanges
+-----------------------------
 
 Once repair builds a temporary file with a new data structure written into
 it, it must commit the new changes into the existing file.
@@ -4010,17 +4012,21 @@ e. Old blocks in the file may be cross-linked with another structure and must
 These problems are overcome by creating a new deferred operation and a new type
 of log intent item to track the progress of an operation to exchange two file
 ranges.
-The new deferred operation type chains together the same transactions used by
-the reverse-mapping extent swap code.
+The new exchange operation type chains together the same transactions used by
+the reverse-mapping extent swap code, but records intermedia progress in the
+log so that operations can be restarted after a crash.
+This new functionality is called the file contents exchange (xfs_exchrange)
+code.
+The underlying implementation exchanges file fork mappings (xfs_exchmaps).
 The new log item records the progress of the exchange to ensure that once an
 exchange begins, it will always run to completion, even there are
 interruptions.
-The new ``XFS_SB_FEAT_INCOMPAT_LOG_ATOMIC_SWAP`` log-incompatible feature flag
+The new ``XFS_SB_FEAT_INCOMPAT_EXCHRANGE`` incompatible feature flag
 in the superblock protects these new log item records from being replayed on
 old kernels.
 
 The proposed patchset is the
-`atomic extent swap
+`file contents exchange
 <https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux.git/log/?h=atomic-file-updates>`_
 series.
 
@@ -4061,72 +4067,73 @@ series.
 | The feature bit will not be cleared from the superblock until the log    |
 | becomes clean.                                                           |
 |                                                                          |
-| Log-assisted extended attribute updates and atomic extent swaps both use |
-| log incompat features and provide convenience wrappers around the        |
+| Log-assisted extended attribute updates and file content exchanges bothe |
+| use log incompat features and provide convenience wrappers around the    |
 | functionality.                                                           |
 +--------------------------------------------------------------------------+
 
-Mechanics of an Atomic Extent Swap
-``````````````````````````````````
+Mechanics of a Logged File Content Exchange
+```````````````````````````````````````````
 
-Swapping entire file forks is a complex task.
+Exchanging contents between file forks is a complex task.
 The goal is to exchange all file fork mappings between two file fork offset
 ranges.
 There are likely to be many extent mappings in each fork, and the edges of
 the mappings aren't necessarily aligned.
-Furthermore, there may be other updates that need to happen after the swap,
+Furthermore, there may be other updates that need to happen after the exchange,
 such as exchanging file sizes, inode flags, or conversion of fork data to local
 format.
-This is roughly the format of the new deferred extent swap work item:
+This is roughly the format of the new deferred exchange-mapping work item:
 
 .. code-block:: c
 
-	struct xfs_swapext_intent {
+	struct xfs_exchmaps_intent {
 	    /* Inodes participating in the operation. */
-	    struct xfs_inode    *sxi_ip1;
-	    struct xfs_inode    *sxi_ip2;
+	    struct xfs_inode    *xmi_ip1;
+	    struct xfs_inode    *xmi_ip2;
 
 	    /* File offset range information. */
-	    xfs_fileoff_t       sxi_startoff1;
-	    xfs_fileoff_t       sxi_startoff2;
-	    xfs_filblks_t       sxi_blockcount;
+	    xfs_fileoff_t       xmi_startoff1;
+	    xfs_fileoff_t       xmi_startoff2;
+	    xfs_filblks_t       xmi_blockcount;
 
 	    /* Set these file sizes after the operation, unless negative. */
-	    xfs_fsize_t         sxi_isize1;
-	    xfs_fsize_t         sxi_isize2;
+	    xfs_fsize_t         xmi_isize1;
+	    xfs_fsize_t         xmi_isize2;
 
-	    /* XFS_SWAP_EXT_* log operation flags */
-	    uint64_t            sxi_flags;
+	    /* XFS_EXCHMAPS_* log operation flags */
+	    uint64_t            xmi_flags;
 	};
 
 The new log intent item contains enough information to track two logical fork
 offset ranges: ``(inode1, startoff1, blockcount)`` and ``(inode2, startoff2,
 blockcount)``.
-Each step of a swap operation exchanges the largest file range mapping possible
-from one file to the other.
-After each step in the swap operation, the two startoff fields are incremented
-and the blockcount field is decremented to reflect the progress made.
-The flags field captures behavioral parameters such as swapping the attr fork
-instead of the data fork and other work to be done after the extent swap.
-The two isize fields are used to swap the file size at the end of the operation
-if the file data fork is the target of the swap operation.
+Each step of an exchange operation exchanges the largest file range mapping
+possible from one file to the other.
+After each step in the exchange operation, the two startoff fields are
+incremented and the blockcount field is decremented to reflect the progress
+made.
+The flags field captures behavioral parameters such as exchanging attr fork
+mappings instead of the data fork and other work to be done after the exchange.
+The two isize fields are used to exchange the file sizes at the end of the
+operation if the file data fork is the target of the operation.
 
-When the extent swap is initiated, the sequence of operations is as follows:
+When the exchange is initiated, the sequence of operations is as follows:
 
-1. Create a deferred work item for the extent swap.
-   At the start, it should contain the entirety of the file ranges to be
-   swapped.
+1. Create a deferred work item for the file mapping exchange.
+   At the start, it should contain the entirety of the file block ranges to be
+   exchanged.
 
 2. Call ``xfs_defer_finish`` to process the exchange.
-   This is encapsulated in ``xrep_tempswap_contents`` for scrub operations.
+   This is encapsulated in ``xrep_tempexch_contents`` for scrub operations.
    This will log an extent swap intent item to the transaction for the deferred
-   extent swap work item.
+   mapping exchange work item.
 
-3. Until ``sxi_blockcount`` of the deferred extent swap work item is zero,
+3. Until ``xmi_blockcount`` of the deferred mapping exchange work item is zero,
 
-   a. Read the block maps of both file ranges starting at ``sxi_startoff1`` and
-      ``sxi_startoff2``, respectively, and compute the longest extent that can
-      be swapped in a single step.
+   a. Read the block maps of both file ranges starting at ``xmi_startoff1`` and
+      ``xmi_startoff2``, respectively, and compute the longest extent that can
+      be exchanged in a single step.
       This is the minimum of the two ``br_blockcount`` s in the mappings.
       Keep advancing through the file forks until at least one of the mappings
       contains written blocks.
@@ -4148,20 +4155,20 @@ When the extent swap is initiated, the sequence of operations is as follows:
 
    g. Extend the ondisk size of either file if necessary.
 
-   h. Log an extent swap done log item for the extent swap intent log item
-      that was read at the start of step 3.
+   h. Log a mapping exchange done log item for th mapping exchange intent log
+      item that was read at the start of step 3.
 
    i. Compute the amount of file range that has just been covered.
       This quantity is ``(map1.br_startoff + map1.br_blockcount -
-      sxi_startoff1)``, because step 3a could have skipped holes.
+      xmi_startoff1)``, because step 3a could have skipped holes.
 
-   j. Increase the starting offsets of ``sxi_startoff1`` and ``sxi_startoff2``
+   j. Increase the starting offsets of ``xmi_startoff1`` and ``xmi_startoff2``
       by the number of blocks computed in the previous step, and decrease
-      ``sxi_blockcount`` by the same quantity.
+      ``xmi_blockcount`` by the same quantity.
       This advances the cursor.
 
-   k. Log a new extent swap intent log item reflecting the advanced state of
-      the work item.
+   k. Log a new mapping exchange intent log item reflecting the advanced state
+      of the work item.
 
    l. Return the proper error code (EAGAIN) to the deferred operation manager
       to inform it that there is more work to be done.
@@ -4172,22 +4179,23 @@ When the extent swap is initiated, the sequence of operations is as follows:
    This will be discussed in more detail in subsequent sections.
 
 If the filesystem goes down in the middle of an operation, log recovery will
-find the most recent unfinished extent swap log intent item and restart from
-there.
-This is how extent swapping guarantees that an outside observer will either see
-the old broken structure or the new one, and never a mismash of both.
+find the most recent unfinished maping exchange log intent item and restart
+from there.
+This is how atomic file mapping exchanges guarantees that an outside observer
+will either see the old broken structure or the new one, and never a mismash of
+both.
 
-Preparation for Extent Swapping
-```````````````````````````````
+Preparation for File Content Exchanges
+``````````````````````````````````````
 
 There are a few things that need to be taken care of before initiating an
-atomic extent swap operation.
+atomic file mapping exchange operation.
 First, regular files require the page cache to be flushed to disk before the
 operation begins, and directio writes to be quiesced.
-Like any filesystem operation, extent swapping must determine the maximum
-amount of disk space and quota that can be consumed on behalf of both files in
-the operation, and reserve that quantity of resources to avoid an unrecoverable
-out of space failure once it starts dirtying metadata.
+Like any filesystem operation, file mapping exchanges must determine the
+maximum amount of disk space and quota that can be consumed on behalf of both
+files in the operation, and reserve that quantity of resources to avoid an
+unrecoverable out of space failure once it starts dirtying metadata.
 The preparation step scans the ranges of both files to estimate:
 
 - Data device blocks needed to handle the repeated updates to the fork
@@ -4201,56 +4209,59 @@ The preparation step scans the ranges of both files to estimate:
   to different extents on the realtime volume, which could happen if the
   operation fails to run to completion.
 
-The need for precise estimation increases the run time of the swap operation,
-but it is very important to maintain correct accounting.
-The filesystem must not run completely out of free space, nor can the extent
-swap ever add more extent mappings to a fork than it can support.
+The need for precise estimation increases the run time of the exchange
+operation, but it is very important to maintain correct accounting.
+The filesystem must not run completely out of free space, nor can the mapping
+exchange ever add more extent mappings to a fork than it can support.
 Regular users are required to abide the quota limits, though metadata repairs
 may exceed quota to resolve inconsistent metadata elsewhere.
 
-Special Features for Swapping Metadata File Extents
-```````````````````````````````````````````````````
+Special Features for Exchanging Metadata File Contents
+``````````````````````````````````````````````````````
 
 Extended attributes, symbolic links, and directories can set the fork format to
 "local" and treat the fork as a literal area for data storage.
 Metadata repairs must take extra steps to support these cases:
 
 - If both forks are in local format and the fork areas are large enough, the
-  swap is performed by copying the incore fork contents, logging both forks,
-  and committing.
-  The atomic extent swap mechanism is not necessary, since this can be done
-  with a single transaction.
+  exchange is performed by copying the incore fork contents, logging both
+  forks, and committing.
+  The atomic file mapping exchange mechanism is not necessary, since this can
+  be done with a single transaction.
 
-- If both forks map blocks, then the regular atomic extent swap is used.
+- If both forks map blocks, then the regular atomic file mapping exchange is
+  used.
 
 - Otherwise, only one fork is in local format.
   The contents of the local format fork are converted to a block to perform the
-  swap.
+  exchange.
   The conversion to block format must be done in the same transaction that
-  logs the initial extent swap intent log item.
-  The regular atomic extent swap is used to exchange the mappings.
-  Special flags are set on the swap operation so that the transaction can be
-  rolled one more time to convert the second file's fork back to local format
-  so that the second file will be ready to go as soon as the ILOCK is dropped.
+  logs the initial mapping exchange intent log item.
+  The regular atomic mapping exchange is used to exchange the metadata file
+  mappings.
+  Special flags are set on the exchange operation so that the transaction can
+  be rolled one more time to convert the second file's fork back to local
+  format so that the second file will be ready to go as soon as the ILOCK is
+  dropped.
 
 Extended attributes and directories stamp the owning inode into every block,
 but the buffer verifiers do not actually check the inode number!
 Although there is no verification, it is still important to maintain
-referential integrity, so prior to performing the extent swap, online repair
-builds every block in the new data structure with the owner field of the file
-being repaired.
+referential integrity, so prior to performing the mapping exchange, online
+repair builds every block in the new data structure with the owner field of the
+file being repaired.
 
-After a successful swap operation, the repair operation must reap the old fork
-blocks by processing each fork mapping through the standard :ref:`file extent
-reaping <reaping>` mechanism that is done post-repair.
+After a successful exchange operation, the repair operation must reap the old
+fork blocks by processing each fork mapping through the standard :ref:`file
+extent reaping <reaping>` mechanism that is done post-repair.
 If the filesystem should go down during the reap part of the repair, the
 iunlink processing at the end of recovery will free both the temporary file and
 whatever blocks were not reaped.
 However, this iunlink processing omits the cross-link detection of online
 repair, and is not completely foolproof.
 
-Swapping Temporary File Extents
-```````````````````````````````
+Exchanging Temporary File Contents
+``````````````````````````````````
 
 To repair a metadata file, online repair proceeds as follows:
 
@@ -4260,14 +4271,14 @@ To repair a metadata file, online repair proceeds as follows:
    file.
    The same fork must be written to as is being repaired.
 
-3. Commit the scrub transaction, since the swap estimation step must be
-   completed before transaction reservations are made.
+3. Commit the scrub transaction, since the exchange resource estimation step
+   must be completed before transaction reservations are made.
 
-4. Call ``xrep_tempswap_trans_alloc`` to allocate a new scrub transaction with
+4. Call ``xrep_tempexch_trans_alloc`` to allocate a new scrub transaction with
    the appropriate resource reservations, locks, and fill out a ``struct
-   xfs_swapext_req`` with the details of the swap operation.
+   xfs_exchmaps_req`` with the details of the exchange operation.
 
-5. Call ``xrep_tempswap_contents`` to swap the contents.
+5. Call ``xrep_tempexch_contents`` to exchange the contents.
 
 6. Commit the transaction to complete the repair.
 
@@ -4309,7 +4320,7 @@ To check the summary file against the bitmap:
 3. Compare the contents of the xfile against the ondisk file.
 
 To repair the summary file, write the xfile contents into the temporary file
-and use atomic extent swap to commit the new contents.
+and use atomic mapping exchange to commit the new contents.
 The temporary file is then reaped.
 
 The proposed patchset is the
@@ -4352,8 +4363,8 @@ Salvaging extended attributes is done as follows:
    memory or there are no more attr fork blocks to examine, unlock the file and
    add the staged extended attributes to the temporary file.
 
-3. Use atomic extent swapping to exchange the new and old extended attribute
-   structures.
+3. Use atomic file mapping exchange to exchange the new and old extended
+   attribute structures.
    The old attribute blocks are now attached to the temporary file.
 
 4. Reap the temporary file.
@@ -4410,7 +4421,8 @@ salvaging directories is straightforward:
    directory and add the staged dirents into the temporary directory.
    Truncate the staging files.
 
-4. Use atomic extent swapping to exchange the new and old directory structures.
+4. Use atomic file mapping exchange to exchange the new and old directory
+   structures.
    The old directory blocks are now attached to the temporary file.
 
 5. Reap the temporary file.
@@ -4542,7 +4554,7 @@ a :ref:`directory entry live update hook <liveupdate>` as follows:
       Instead, we stash updates in the xfarray and rely on the scanner thread
       to apply the stashed updates to the temporary directory.
 
-5. When the scan is complete, atomically swap the contents of the temporary
+5. When the scan is complete, atomically exchange the contents of the temporary
    directory and the directory being repaired.
    The temporary directory now contains the damaged directory structure.
 
@@ -4629,8 +4641,8 @@ directory reconstruction:
 
 5. Copy all non-parent pointer extended attributes to the temporary file.
 
-6. When the scan is complete, atomically swap the attribute fork of the
-   temporary file and the file being repaired.
+6. When the scan is complete, atomically exchange the mappings of the attribute
+   forks of the temporary file and the file being repaired.
    The temporary file now contains the damaged extended attribute structure.
 
 7. Reap the temporary file.
@@ -5105,18 +5117,18 @@ make it easier for code readers to understand what has been built, for whom it
 has been built, and why.
 Please feel free to contact the XFS mailing list with questions.
 
-FIEXCHANGE_RANGE
-----------------
+XFS_IOC_EXCHANGE_RANGE
+----------------------
 
-As discussed earlier, a second frontend to the atomic extent swap mechanism is
-a new ioctl call that userspace programs can use to commit updates to files
-atomically.
+As discussed earlier, a second frontend to the atomic file mapping exchange
+mechanism is a new ioctl call that userspace programs can use to commit updates
+to files atomically.
 This frontend has been out for review for several years now, though the
 necessary refinements to online repair and lack of customer demand mean that
 the proposal has not been pushed very hard.
 
-Extent Swapping with Regular User Files
-```````````````````````````````````````
+File Content Exchanges with Regular User Files
+``````````````````````````````````````````````
 
 As mentioned earlier, XFS has long had the ability to swap extents between
 files, which is used almost exclusively by ``xfs_fsr`` to defragment files.
@@ -5131,12 +5143,12 @@ the consistency of the fork mappings with the reverse mapping index was to
 develop an iterative mechanism that used deferred bmap and rmap operations to
 swap mappings one at a time.
 This mechanism is identical to steps 2-3 from the procedure above except for
-the new tracking items, because the atomic extent swap mechanism is an
-iteration of an existing mechanism and not something totally novel.
+the new tracking items, because the atomic file mapping exchange mechanism is
+an iteration of an existing mechanism and not something totally novel.
 For the narrow case of file defragmentation, the file contents must be
 identical, so the recovery guarantees are not much of a gain.
 
-Atomic extent swapping is much more flexible than the existing swapext
+Atomic file content exchanges are much more flexible than the existing swapext
 implementations because it can guarantee that the caller never sees a mix of
 old and new contents even after a crash, and it can operate on two arbitrary
 file fork ranges.
@@ -5147,11 +5159,11 @@ The extra flexibility enables several new use cases:
   Next, it opens a temporary file and calls the file clone operation to reflink
   the first file's contents into the temporary file.
   Writes to the original file should instead be written to the temporary file.
-  Finally, the process calls the atomic extent swap system call
-  (``FIEXCHANGE_RANGE``) to exchange the file contents, thereby committing all
-  of the updates to the original file, or none of them.
+  Finally, the process calls the atomic file mapping exchange system call
+  (``XFS_IOC_EXCHANGE_RANGE``) to exchange the file contents, thereby
+  committing all of the updates to the original file, or none of them.
 
-.. _swapext_if_unchanged:
+.. _exchrange_if_unchanged:
 
 - **Transactional file updates**: The same mechanism as above, but the caller
   only wants the commit to occur if the original file's contents have not
@@ -5160,16 +5172,17 @@ The extra flexibility enables several new use cases:
   change timestamps of the original file before reflinking its data to the
   temporary file.
   When the program is ready to commit the changes, it passes the timestamps
-  into the kernel as arguments to the atomic extent swap system call.
+  into the kernel as arguments to the atomic file mapping exchange system call.
   The kernel only commits the changes if the provided timestamps match the
   original file.
+  A new ioctl (``XFS_IOC_COMMIT_RANGE``) is provided to perform this.
 
 - **Emulation of atomic block device writes**: Export a block device with a
   logical sector size matching the filesystem block size to force all writes
   to be aligned to the filesystem block size.
   Stage all writes to a temporary file, and when that is complete, call the
-  atomic extent swap system call with a flag to indicate that holes in the
-  temporary file should be ignored.
+  atomic file mapping exchange system call with a flag to indicate that holes
+  in the temporary file should be ignored.
   This emulates an atomic device write in software, and can support arbitrary
   scattered writes.
 
@@ -5251,8 +5264,8 @@ of the file to try to share the physical space with a dummy file.
 Cloning the extent means that the original owners cannot overwrite the
 contents; any changes will be written somewhere else via copy-on-write.
 Clearspace makes its own copy of the frozen extent in an area that is not being
-cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic extent swap
-<swapext_if_unchanged>` feature) to change the target file's data extent
+cleared, and uses ``FIEDEUPRANGE`` (or the :ref:`atomic file content exchanges
+<exchrange_if_unchanged>` feature) to change the target file's data extent
 mapping away from the area being cleared.
 When all other mappings have been moved, clearspace reflinks the space into the
 space collector file so that it becomes unavailable.


^ permalink raw reply related	[flat|nested] 62+ messages in thread

end of thread, other threads:[~2024-04-09  3:37 UTC | newest]

Thread overview: 62+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-27  2:18 [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Darrick J. Wong
2024-02-27  2:21 ` [PATCH 01/14] vfs: export remap and write check helpers Darrick J. Wong
2024-02-28 15:40   ` Christoph Hellwig
2024-02-27  2:21 ` [PATCH 02/14] xfs: introduce new file range exchange ioctls Darrick J. Wong
2024-02-28 15:44   ` Christoph Hellwig
2024-02-28 19:35     ` Darrick J. Wong
2024-02-28 19:37       ` Christoph Hellwig
2024-02-28 23:00         ` Darrick J. Wong
2024-02-29 13:22           ` Christoph Hellwig
2024-02-29 17:10             ` Darrick J. Wong
2024-02-29 19:42               ` Christoph Hellwig
2024-02-27  2:21 ` [PATCH 03/14] xfs: create a log incompat flag for atomic file mapping exchanges Darrick J. Wong
2024-02-28 15:44   ` Christoph Hellwig
2024-02-27  2:21 ` [PATCH 04/14] xfs: introduce a file mapping exchange log intent item Darrick J. Wong
2024-02-28 15:45   ` Christoph Hellwig
2024-02-27  2:22 ` [PATCH 05/14] xfs: create deferred log items for file mapping exchanges Darrick J. Wong
2024-02-28 15:49   ` Christoph Hellwig
2024-02-28 19:55     ` Darrick J. Wong
2024-02-28 22:08       ` Christoph Hellwig
2024-02-28 22:56         ` Darrick J. Wong
2024-02-27  2:22 ` [PATCH 06/14] xfs: bind together the front and back ends of the file range exchange code Darrick J. Wong
2024-02-28 15:49   ` Christoph Hellwig
2024-02-27  2:22 ` [PATCH 07/14] xfs: add error injection to test file mapping exchange recovery Darrick J. Wong
2024-02-28 15:50   ` Christoph Hellwig
2024-02-27  2:22 ` [PATCH 08/14] xfs: condense extended attributes after a mapping exchange operation Darrick J. Wong
2024-02-28 15:50   ` Christoph Hellwig
2024-02-27  2:23 ` [PATCH 09/14] xfs: condense directories " Darrick J. Wong
2024-02-28 15:51   ` Christoph Hellwig
2024-02-27  2:23 ` [PATCH 10/14] xfs: condense symbolic links " Darrick J. Wong
2024-02-28 15:51   ` Christoph Hellwig
2024-02-27  2:23 ` [PATCH 11/14] xfs: make file range exchange support realtime files Darrick J. Wong
2024-02-28 15:51   ` Christoph Hellwig
2024-02-27  2:23 ` [PATCH 12/14] xfs: support non-power-of-two rtextsize with exchange-range Darrick J. Wong
2024-02-28 15:51   ` Christoph Hellwig
2024-02-27  2:24 ` [PATCH 13/14] docs: update swapext -> exchmaps language Darrick J. Wong
2024-02-28 15:52   ` Christoph Hellwig
2024-02-27  2:24 ` [PATCH 14/14] xfs: enable logged file mapping exchange feature Darrick J. Wong
2024-02-28 15:52   ` Christoph Hellwig
2024-02-27  9:23 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Amir Goldstein
2024-02-27 10:53   ` Jeff Layton
2024-02-27 16:06     ` Darrick J. Wong
2024-03-01 13:16       ` Jeff Layton
2024-02-27 23:46     ` Dave Chinner
2024-02-28 10:30       ` Jeff Layton
2024-02-28 10:58         ` Amir Goldstein
2024-02-28 11:01           ` Jeff Layton
2024-02-27 15:45   ` Darrick J. Wong
2024-02-27 16:58     ` Amir Goldstein
2024-02-27 17:46 ` [PATCH 14/13] xfs: make XFS_IOC_COMMIT_RANGE freshness data opaque Darrick J. Wong
2024-02-27 18:52   ` Amir Goldstein
2024-02-29 23:27     ` Darrick J. Wong
2024-03-01 13:00       ` Amir Goldstein
2024-03-01 13:31       ` Jeff Layton
2024-03-02  2:48         ` Darrick J. Wong
2024-03-02 12:43           ` Jeff Layton
2024-03-07 23:25             ` Darrick J. Wong
2024-02-28  1:50 ` [PATCHSET v29.4 03/13] xfs: atomic file content exchanges Colin Walters
2024-02-29 20:18   ` Darrick J. Wong
2024-02-29 22:43     ` Colin Walters
2024-03-01  0:03       ` Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2024-03-30  0:57 [PATCHSET v30.1] " Darrick J. Wong
2024-03-30  1:00 ` [PATCH 13/14] docs: update swapext -> exchmaps language Darrick J. Wong
2024-04-09  3:34 [PATCHSET v30.2] xfs: atomic file content exchanges Darrick J. Wong
2024-04-09  3:37 ` [PATCH 13/14] docs: update swapext -> exchmaps language Darrick J. Wong

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).