[RFC 0/9] gfs2: handle page faults during read and write

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [RFC 0/9] gfs2: handle page faults during read and write
@ 2021-05-31 17:01 ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, linux-kernel, Alexander Viro,
	Jan Kara, Matthew Wilcox

Hi Linus,

here's a set of fixes for how gfs2 handles page faults during read and
write syscalls.  The patch queue is ready for merging except for two
open issues.  May I ask you to shed some light or ask the right person
to help us out?

Right now, the filesystem can end up trying to take a lock it's already
holding, which causes it to BUG.  We can recognize and mostly deal with
that simple case, but more complex scenarios exist which involve
multiple locks and / or incompatible locking modes.  To handle those, we
switch to trylocks where necessary.

The patches appear to be working as itended, with the following
remaining questions:

(1) Jan Kara has pointed out [*] that returning VM_FAULT_SIGBUS from the
.fault / .page_mkwrite ops will raise SIGBUS, which would be visible to
user space.  This hasn't been observed in our testing; instead, accesses
to the mmapped memory via call chains like:

  iov_iter_fault_in_readable -> fault_in_pages_readable -> __get_user

would simply fail with -EFAULT, as we're expecting.

From looking at do_user_addr_fault and do_sigbus in arch/x86/mm/fault.c,
my impression is that VM_FAULT_SIGBUS will not cause SIGBUS to be raised
in kernel mode, and that we can rely on the -EFAULT behavior for
triggering the retries at the filesystem level.

(2) The patch queue introduces the same kind of trylock behavior for
both .fault (gfs2_fault) and .page_mkwrite (gfs2_page_mkwrite).  I'm not
aware of any situation in which we can actually end up in .page_mkwrite
during a .read_iter or .write_iter operation, so the trylock behavior in
.page_mkwrite might be harmless but unnecessary.

Thank you very much,
Andreas

[*] https://listman.redhat.com/archives/cluster-devel/2021-May/msg00080.html

Previous posting of this patch queue:

https://listman.redhat.com/archives/cluster-devel/2021-May/msg00073.html

New xfstest for mmap + page faults during read / write:

https://lore.kernel.org/fstests/20210531152604.240462-1-agruenba@redhat.com/

Andreas Gruenbacher (9):
  gfs2: Clean up the error handling in gfs2_page_mkwrite
  gfs2: Add wrapper for iomap_file_buffered_write
  gfs2: Add gfs2_holder_is_compatible helper
  gfs2: Fix mmap + page fault deadlocks (part 1)
  iov_iter: Add iov_iter_fault_in_writeable()
  gfs2: Add wrappers for accessing journal_info
  gfs2: Encode glock holding and retry flags in journal_info
  gfs2: Add LM_FLAG_OUTER glock holder flag
  gfs2: Fix mmap + page fault deadlocks (part 2)

 fs/gfs2/aops.c      |   6 +-
 fs/gfs2/bmap.c      |  31 ++++----
 fs/gfs2/file.c      | 175 +++++++++++++++++++++++++++++++++-----------
 fs/gfs2/glock.c     |  12 +++
 fs/gfs2/glock.h     |  27 ++++++-
 fs/gfs2/incore.h    |  41 +++++++++++
 fs/gfs2/inode.c     |   2 +-
 fs/gfs2/log.c       |   4 +-
 fs/gfs2/lops.c      |   2 +-
 fs/gfs2/meta_io.c   |   6 +-
 fs/gfs2/super.c     |   2 +-
 fs/gfs2/trans.c     |  16 ++--
 include/linux/uio.h |   1 +
 lib/iov_iter.c      |  20 ++++-
 14 files changed, 265 insertions(+), 80 deletions(-)

-- 
2.26.3

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 0/9] gfs2: handle page faults during read and write
@ 2021-05-31 17:01 ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi Linus,

here's a set of fixes for how gfs2 handles page faults during read and
write syscalls.  The patch queue is ready for merging except for two
open issues.  May I ask you to shed some light or ask the right person
to help us out?

Right now, the filesystem can end up trying to take a lock it's already
holding, which causes it to BUG.  We can recognize and mostly deal with
that simple case, but more complex scenarios exist which involve
multiple locks and / or incompatible locking modes.  To handle those, we
switch to trylocks where necessary.

The patches appear to be working as itended, with the following
remaining questions:

(1) Jan Kara has pointed out [*] that returning VM_FAULT_SIGBUS from the
.fault / .page_mkwrite ops will raise SIGBUS, which would be visible to
user space.  This hasn't been observed in our testing; instead, accesses
to the mmapped memory via call chains like:

  iov_iter_fault_in_readable -> fault_in_pages_readable -> __get_user

would simply fail with -EFAULT, as we're expecting.

From looking at do_user_addr_fault and do_sigbus in arch/x86/mm/fault.c,
my impression is that VM_FAULT_SIGBUS will not cause SIGBUS to be raised
in kernel mode, and that we can rely on the -EFAULT behavior for
triggering the retries at the filesystem level.

(2) The patch queue introduces the same kind of trylock behavior for
both .fault (gfs2_fault) and .page_mkwrite (gfs2_page_mkwrite).  I'm not
aware of any situation in which we can actually end up in .page_mkwrite
during a .read_iter or .write_iter operation, so the trylock behavior in
.page_mkwrite might be harmless but unnecessary.

Thank you very much,
Andreas

[*] https://listman.redhat.com/archives/cluster-devel/2021-May/msg00080.html

Previous posting of this patch queue:

https://listman.redhat.com/archives/cluster-devel/2021-May/msg00073.html

New xfstest for mmap + page faults during read / write:

https://lore.kernel.org/fstests/20210531152604.240462-1-agruenba at redhat.com/

Andreas Gruenbacher (9):
  gfs2: Clean up the error handling in gfs2_page_mkwrite
  gfs2: Add wrapper for iomap_file_buffered_write
  gfs2: Add gfs2_holder_is_compatible helper
  gfs2: Fix mmap + page fault deadlocks (part 1)
  iov_iter: Add iov_iter_fault_in_writeable()
  gfs2: Add wrappers for accessing journal_info
  gfs2: Encode glock holding and retry flags in journal_info
  gfs2: Add LM_FLAG_OUTER glock holder flag
  gfs2: Fix mmap + page fault deadlocks (part 2)

 fs/gfs2/aops.c      |   6 +-
 fs/gfs2/bmap.c      |  31 ++++----
 fs/gfs2/file.c      | 175 +++++++++++++++++++++++++++++++++-----------
 fs/gfs2/glock.c     |  12 +++
 fs/gfs2/glock.h     |  27 ++++++-
 fs/gfs2/incore.h    |  41 +++++++++++
 fs/gfs2/inode.c     |   2 +-
 fs/gfs2/log.c       |   4 +-
 fs/gfs2/lops.c      |   2 +-
 fs/gfs2/meta_io.c   |   6 +-
 fs/gfs2/super.c     |   2 +-
 fs/gfs2/trans.c     |  16 ++--
 include/linux/uio.h |   1 +
 lib/iov_iter.c      |  20 ++++-
 14 files changed, 265 insertions(+), 80 deletions(-)

-- 
2.26.3

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [RFC 1/9] gfs2: Clean up the error handling in gfs2_page_mkwrite
  2021-05-31 17:01 ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, linux-kernel, Alexander Viro,
	Jan Kara, Matthew Wilcox

We're setting an error number so that block_page_mkwrite_return
translates it into the corresponding VM_FAULT_* code in several places,
but this is getting confusing, so set the VM_FAULT_* codes directly
instead.  (No change in functionality.)

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 63 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 8a35a0196b6d..0eb235728098 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -427,22 +427,25 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	struct gfs2_alloc_parms ap = { .aflags = 0, };
 	u64 offset = page_offset(page);
 	unsigned int data_blocks, ind_blocks, rblocks;
+	vm_fault_t ret = VM_FAULT_LOCKED;
 	struct gfs2_holder gh;
 	unsigned int length;
 	loff_t size;
-	int ret;
+	int err;
 
 	sb_start_pagefault(inode->i_sb);
 
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
-	ret = gfs2_glock_nq(&gh);
-	if (ret)
+	err = gfs2_glock_nq(&gh);
+	if (err) {
+		ret = block_page_mkwrite_return(err);
 		goto out_uninit;
+	}
 
 	/* Check page index against inode size */
 	size = i_size_read(inode);
 	if (offset >= size) {
-		ret = -EINVAL;
+		ret = VM_FAULT_SIGBUS;
 		goto out_unlock;
 	}
 
@@ -469,24 +472,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	    !gfs2_write_alloc_required(ip, offset, length)) {
 		lock_page(page);
 		if (!PageUptodate(page) || page->mapping != inode->i_mapping) {
-			ret = -EAGAIN;
+			ret = VM_FAULT_NOPAGE;
 			unlock_page(page);
 		}
 		goto out_unlock;
 	}
 
-	ret = gfs2_rindex_update(sdp);
-	if (ret)
+	err = gfs2_rindex_update(sdp);
+	if (err) {
+		ret = block_page_mkwrite_return(err);
 		goto out_unlock;
+	}
 
 	gfs2_write_calc_reserv(ip, length, &data_blocks, &ind_blocks);
 	ap.target = data_blocks + ind_blocks;
-	ret = gfs2_quota_lock_check(ip, &ap);
-	if (ret)
+	err = gfs2_quota_lock_check(ip, &ap);
+	if (err) {
+		ret = block_page_mkwrite_return(err);
 		goto out_unlock;
-	ret = gfs2_inplace_reserve(ip, &ap);
-	if (ret)
+	}
+	err = gfs2_inplace_reserve(ip, &ap);
+	if (err) {
+		ret = block_page_mkwrite_return(err);
 		goto out_quota_unlock;
+	}
 
 	rblocks = RES_DINODE + ind_blocks;
 	if (gfs2_is_jdata(ip))
@@ -495,27 +504,35 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 		rblocks += RES_STATFS + RES_QUOTA;
 		rblocks += gfs2_rg_blocks(ip, data_blocks + ind_blocks);
 	}
-	ret = gfs2_trans_begin(sdp, rblocks, 0);
-	if (ret)
+	err = gfs2_trans_begin(sdp, rblocks, 0);
+	if (err) {
+		ret = block_page_mkwrite_return(err);
 		goto out_trans_fail;
+	}
 
 	lock_page(page);
-	ret = -EAGAIN;
 	/* If truncated, we must retry the operation, we may have raced
 	 * with the glock demotion code.
 	 */
-	if (!PageUptodate(page) || page->mapping != inode->i_mapping)
+	if (!PageUptodate(page) || page->mapping != inode->i_mapping) {
+		ret = VM_FAULT_NOPAGE;
 		goto out_trans_end;
+	}
 
 	/* Unstuff, if required, and allocate backing blocks for page */
-	ret = 0;
-	if (gfs2_is_stuffed(ip))
-		ret = gfs2_unstuff_dinode(ip, page);
-	if (ret == 0)
-		ret = gfs2_allocate_page_backing(page, length);
+	if (gfs2_is_stuffed(ip)) {
+		err = gfs2_unstuff_dinode(ip, page);
+		if (err) {
+			ret = block_page_mkwrite_return(err);
+			goto out_trans_end;
+		}
+	}
+	err = gfs2_allocate_page_backing(page, length);
+	if (err)
+		ret = block_page_mkwrite_return(err);
 
 out_trans_end:
-	if (ret)
+	if (ret != VM_FAULT_LOCKED)
 		unlock_page(page);
 	gfs2_trans_end(sdp);
 out_trans_fail:
@@ -526,12 +543,12 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
-	if (ret == 0) {
+	if (ret == VM_FAULT_LOCKED) {
 		set_page_dirty(page);
 		wait_for_stable_page(page);
 	}
 	sb_end_pagefault(inode->i_sb);
-	return block_page_mkwrite_return(ret);
+	return ret;
 }
 
 static vm_fault_t gfs2_fault(struct vm_fault *vmf)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 1/9] gfs2: Clean up the error handling in gfs2_page_mkwrite
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

We're setting an error number so that block_page_mkwrite_return
translates it into the corresponding VM_FAULT_* code in several places,
but this is getting confusing, so set the VM_FAULT_* codes directly
instead.  (No change in functionality.)

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 63 ++++++++++++++++++++++++++++++++------------------
 1 file changed, 40 insertions(+), 23 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 8a35a0196b6d..0eb235728098 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -427,22 +427,25 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	struct gfs2_alloc_parms ap = { .aflags = 0, };
 	u64 offset = page_offset(page);
 	unsigned int data_blocks, ind_blocks, rblocks;
+	vm_fault_t ret = VM_FAULT_LOCKED;
 	struct gfs2_holder gh;
 	unsigned int length;
 	loff_t size;
-	int ret;
+	int err;
 
 	sb_start_pagefault(inode->i_sb);
 
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
-	ret = gfs2_glock_nq(&gh);
-	if (ret)
+	err = gfs2_glock_nq(&gh);
+	if (err) {
+		ret = block_page_mkwrite_return(err);
 		goto out_uninit;
+	}
 
 	/* Check page index against inode size */
 	size = i_size_read(inode);
 	if (offset >= size) {
-		ret = -EINVAL;
+		ret = VM_FAULT_SIGBUS;
 		goto out_unlock;
 	}
 
@@ -469,24 +472,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	    !gfs2_write_alloc_required(ip, offset, length)) {
 		lock_page(page);
 		if (!PageUptodate(page) || page->mapping != inode->i_mapping) {
-			ret = -EAGAIN;
+			ret = VM_FAULT_NOPAGE;
 			unlock_page(page);
 		}
 		goto out_unlock;
 	}
 
-	ret = gfs2_rindex_update(sdp);
-	if (ret)
+	err = gfs2_rindex_update(sdp);
+	if (err) {
+		ret = block_page_mkwrite_return(err);
 		goto out_unlock;
+	}
 
 	gfs2_write_calc_reserv(ip, length, &data_blocks, &ind_blocks);
 	ap.target = data_blocks + ind_blocks;
-	ret = gfs2_quota_lock_check(ip, &ap);
-	if (ret)
+	err = gfs2_quota_lock_check(ip, &ap);
+	if (err) {
+		ret = block_page_mkwrite_return(err);
 		goto out_unlock;
-	ret = gfs2_inplace_reserve(ip, &ap);
-	if (ret)
+	}
+	err = gfs2_inplace_reserve(ip, &ap);
+	if (err) {
+		ret = block_page_mkwrite_return(err);
 		goto out_quota_unlock;
+	}
 
 	rblocks = RES_DINODE + ind_blocks;
 	if (gfs2_is_jdata(ip))
@@ -495,27 +504,35 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 		rblocks += RES_STATFS + RES_QUOTA;
 		rblocks += gfs2_rg_blocks(ip, data_blocks + ind_blocks);
 	}
-	ret = gfs2_trans_begin(sdp, rblocks, 0);
-	if (ret)
+	err = gfs2_trans_begin(sdp, rblocks, 0);
+	if (err) {
+		ret = block_page_mkwrite_return(err);
 		goto out_trans_fail;
+	}
 
 	lock_page(page);
-	ret = -EAGAIN;
 	/* If truncated, we must retry the operation, we may have raced
 	 * with the glock demotion code.
 	 */
-	if (!PageUptodate(page) || page->mapping != inode->i_mapping)
+	if (!PageUptodate(page) || page->mapping != inode->i_mapping) {
+		ret = VM_FAULT_NOPAGE;
 		goto out_trans_end;
+	}
 
 	/* Unstuff, if required, and allocate backing blocks for page */
-	ret = 0;
-	if (gfs2_is_stuffed(ip))
-		ret = gfs2_unstuff_dinode(ip, page);
-	if (ret == 0)
-		ret = gfs2_allocate_page_backing(page, length);
+	if (gfs2_is_stuffed(ip)) {
+		err = gfs2_unstuff_dinode(ip, page);
+		if (err) {
+			ret = block_page_mkwrite_return(err);
+			goto out_trans_end;
+		}
+	}
+	err = gfs2_allocate_page_backing(page, length);
+	if (err)
+		ret = block_page_mkwrite_return(err);
 
 out_trans_end:
-	if (ret)
+	if (ret != VM_FAULT_LOCKED)
 		unlock_page(page);
 	gfs2_trans_end(sdp);
 out_trans_fail:
@@ -526,12 +543,12 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
-	if (ret == 0) {
+	if (ret == VM_FAULT_LOCKED) {
 		set_page_dirty(page);
 		wait_for_stable_page(page);
 	}
 	sb_end_pagefault(inode->i_sb);
-	return block_page_mkwrite_return(ret);
+	return ret;
 }
 
 static vm_fault_t gfs2_fault(struct vm_fault *vmf)
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC 2/9] gfs2: Add wrapper for iomap_file_buffered_write
  2021-05-31 17:01 ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, linux-kernel, Alexander Viro,
	Jan Kara, Matthew Wilcox

Add a wrapper around iomap_file_buffered_write.  We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 0eb235728098..6d77743f11a4 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -876,6 +876,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return written ? written : ret;
 }
 
+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	ssize_t ret;
+
+	current->backing_dev_info = inode_to_bdi(inode);
+	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	current->backing_dev_info = NULL;
+	return ret;
+}
+
 /**
  * gfs2_file_write_iter - Perform a write to a file
  * @iocb: The io context
@@ -927,9 +939,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out_unlock;
 
 		iocb->ki_flags |= IOCB_DSYNC;
-		current->backing_dev_info = inode_to_bdi(inode);
-		buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		buffered = gfs2_file_buffered_write(iocb, from);
 		if (unlikely(buffered <= 0)) {
 			if (!ret)
 				ret = buffered;
@@ -951,9 +961,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (!ret || ret2 > 0)
 			ret += ret2;
 	} else {
-		current->backing_dev_info = inode_to_bdi(inode);
-		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		ret = gfs2_file_buffered_write(iocb, from);
 		if (likely(ret > 0)) {
 			iocb->ki_pos += ret;
 			ret = generic_write_sync(iocb, ret);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 2/9] gfs2: Add wrapper for iomap_file_buffered_write
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Add a wrapper around iomap_file_buffered_write.  We'll add code for when
the operation needs to be retried here later.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 20 ++++++++++++++------
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 0eb235728098..6d77743f11a4 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -876,6 +876,18 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	return written ? written : ret;
 }
 
+static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *from)
+{
+	struct file *file = iocb->ki_filp;
+	struct inode *inode = file_inode(file);
+	ssize_t ret;
+
+	current->backing_dev_info = inode_to_bdi(inode);
+	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
+	current->backing_dev_info = NULL;
+	return ret;
+}
+
 /**
  * gfs2_file_write_iter - Perform a write to a file
  * @iocb: The io context
@@ -927,9 +939,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 			goto out_unlock;
 
 		iocb->ki_flags |= IOCB_DSYNC;
-		current->backing_dev_info = inode_to_bdi(inode);
-		buffered = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		buffered = gfs2_file_buffered_write(iocb, from);
 		if (unlikely(buffered <= 0)) {
 			if (!ret)
 				ret = buffered;
@@ -951,9 +961,7 @@ static ssize_t gfs2_file_write_iter(struct kiocb *iocb, struct iov_iter *from)
 		if (!ret || ret2 > 0)
 			ret += ret2;
 	} else {
-		current->backing_dev_info = inode_to_bdi(inode);
-		ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
-		current->backing_dev_info = NULL;
+		ret = gfs2_file_buffered_write(iocb, from);
 		if (likely(ret > 0)) {
 			iocb->ki_pos += ret;
 			ret = generic_write_sync(iocb, ret);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC 3/9] gfs2: Add gfs2_holder_is_compatible helper
  2021-05-31 17:01 ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, linux-kernel, Alexander Viro,
	Jan Kara, Matthew Wilcox

This function checks if a glock holder's locking state is compatible with
another locking state.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/glock.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 31a8f2f649b5..f0ef6fd24ba4 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -49,6 +49,20 @@ enum {
 #define LM_ST_DEFERRED		2
 #define LM_ST_SHARED		3
 
+static inline bool gfs2_holder_is_compatible(struct gfs2_holder *gh, int state) {
+	BUG_ON(state == LM_ST_UNLOCKED);
+	switch(gh->gh_state) {
+	case LM_ST_EXCLUSIVE:
+		return state != LM_ST_DEFERRED;
+	case LM_ST_DEFERRED:
+		return state == LM_ST_DEFERRED;
+	case LM_ST_SHARED:
+		return state == LM_ST_SHARED;
+	default:
+		return false;
+	}
+}
+
 /*
  * lm_lock() flags
  *
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 3/9] gfs2: Add gfs2_holder_is_compatible helper
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

This function checks if a glock holder's locking state is compatible with
another locking state.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/glock.h | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index 31a8f2f649b5..f0ef6fd24ba4 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -49,6 +49,20 @@ enum {
 #define LM_ST_DEFERRED		2
 #define LM_ST_SHARED		3
 
+static inline bool gfs2_holder_is_compatible(struct gfs2_holder *gh, int state) {
+	BUG_ON(state == LM_ST_UNLOCKED);
+	switch(gh->gh_state) {
+	case LM_ST_EXCLUSIVE:
+		return state != LM_ST_DEFERRED;
+	case LM_ST_DEFERRED:
+		return state == LM_ST_DEFERRED;
+	case LM_ST_SHARED:
+		return state == LM_ST_SHARED;
+	default:
+		return false;
+	}
+}
+
 /*
  * lm_lock() flags
  *
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
  2021-05-31 17:01 ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, linux-kernel, Alexander Viro,
	Jan Kara, Matthew Wilcox

When the buffer passed to a read or write system call is memory mapped
to the same file, a page fault can occur in filemap_fault.  In that
case, the task will already be holding the inode glock, and trying to
take the same lock again will result in a BUG in add_to_queue().

Fix that by recognizing the self-recursion case.  Either skip the lock
taking (when the glock is held in a compatible way), or fail the
operation.

Likewise, a request to un-share a copy-on-write page can *probably*
happen in similar situations, so treat the locking in gfs2_page_mkwrite
in the same way.

A future patch will handle these case more gracefully by retrying
operations instead of failing them, along with addressing more complex
deadlock scenarios.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 6d77743f11a4..7d88abb4629b 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -423,6 +423,7 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	struct page *page = vmf->page;
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct gfs2_alloc_parms ap = { .aflags = 0, };
 	u64 offset = page_offset(page);
@@ -436,10 +437,18 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	sb_start_pagefault(inode->i_sb);
 
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
-	err = gfs2_glock_nq(&gh);
-	if (err) {
-		ret = block_page_mkwrite_return(err);
-		goto out_uninit;
+	if (likely(!outer_gh)) {
+		err = gfs2_glock_nq(&gh);
+		if (err) {
+			ret = block_page_mkwrite_return(err);
+			goto out_uninit;
+		}
+	} else {
+		if (!gfs2_holder_is_compatible(outer_gh, LM_ST_EXCLUSIVE)) {
+			/* We could try to upgrade outer_gh here. */
+			ret = VM_FAULT_SIGBUS;
+			goto out_uninit;
+		}
 	}
 
 	/* Check page index against inode size */
@@ -540,7 +549,8 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 out_quota_unlock:
 	gfs2_quota_unlock(ip);
 out_unlock:
-	gfs2_glock_dq(&gh);
+	if (likely(!outer_gh))
+		gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	if (ret == VM_FAULT_LOCKED) {
@@ -555,6 +565,7 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
 {
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl);
 	struct gfs2_holder gh;
 	vm_fault_t ret;
 	u16 state;
@@ -562,13 +573,22 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
 
 	state = (vmf->flags & FAULT_FLAG_WRITE) ? LM_ST_EXCLUSIVE : LM_ST_SHARED;
 	gfs2_holder_init(ip->i_gl, state, 0, &gh);
-	err = gfs2_glock_nq(&gh);
-	if (err) {
-		ret = block_page_mkwrite_return(err);
-		goto out_uninit;
+	if (likely(!outer_gh)) {
+		err = gfs2_glock_nq(&gh);
+		if (err) {
+			ret = block_page_mkwrite_return(err);
+			goto out_uninit;
+		}
+	} else {
+		if (!gfs2_holder_is_compatible(outer_gh, state)) {
+			/* We could try to upgrade outer_gh here. */
+			ret = VM_FAULT_SIGBUS;
+			goto out_uninit;
+		}
 	}
 	ret = filemap_fault(vmf);
-	gfs2_glock_dq(&gh);
+	if (likely(!outer_gh))
+		gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return ret;
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When the buffer passed to a read or write system call is memory mapped
to the same file, a page fault can occur in filemap_fault.  In that
case, the task will already be holding the inode glock, and trying to
take the same lock again will result in a BUG in add_to_queue().

Fix that by recognizing the self-recursion case.  Either skip the lock
taking (when the glock is held in a compatible way), or fail the
operation.

Likewise, a request to un-share a copy-on-write page can *probably*
happen in similar situations, so treat the locking in gfs2_page_mkwrite
in the same way.

A future patch will handle these case more gracefully by retrying
operations instead of failing them, along with addressing more complex
deadlock scenarios.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/file.c | 40 ++++++++++++++++++++++++++++++----------
 1 file changed, 30 insertions(+), 10 deletions(-)

diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 6d77743f11a4..7d88abb4629b 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -423,6 +423,7 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	struct page *page = vmf->page;
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	struct gfs2_alloc_parms ap = { .aflags = 0, };
 	u64 offset = page_offset(page);
@@ -436,10 +437,18 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	sb_start_pagefault(inode->i_sb);
 
 	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
-	err = gfs2_glock_nq(&gh);
-	if (err) {
-		ret = block_page_mkwrite_return(err);
-		goto out_uninit;
+	if (likely(!outer_gh)) {
+		err = gfs2_glock_nq(&gh);
+		if (err) {
+			ret = block_page_mkwrite_return(err);
+			goto out_uninit;
+		}
+	} else {
+		if (!gfs2_holder_is_compatible(outer_gh, LM_ST_EXCLUSIVE)) {
+			/* We could try to upgrade outer_gh here. */
+			ret = VM_FAULT_SIGBUS;
+			goto out_uninit;
+		}
 	}
 
 	/* Check page index against inode size */
@@ -540,7 +549,8 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 out_quota_unlock:
 	gfs2_quota_unlock(ip);
 out_unlock:
-	gfs2_glock_dq(&gh);
+	if (likely(!outer_gh))
+		gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	if (ret == VM_FAULT_LOCKED) {
@@ -555,6 +565,7 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
 {
 	struct inode *inode = file_inode(vmf->vma->vm_file);
 	struct gfs2_inode *ip = GFS2_I(inode);
+	struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl);
 	struct gfs2_holder gh;
 	vm_fault_t ret;
 	u16 state;
@@ -562,13 +573,22 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
 
 	state = (vmf->flags & FAULT_FLAG_WRITE) ? LM_ST_EXCLUSIVE : LM_ST_SHARED;
 	gfs2_holder_init(ip->i_gl, state, 0, &gh);
-	err = gfs2_glock_nq(&gh);
-	if (err) {
-		ret = block_page_mkwrite_return(err);
-		goto out_uninit;
+	if (likely(!outer_gh)) {
+		err = gfs2_glock_nq(&gh);
+		if (err) {
+			ret = block_page_mkwrite_return(err);
+			goto out_uninit;
+		}
+	} else {
+		if (!gfs2_holder_is_compatible(outer_gh, state)) {
+			/* We could try to upgrade outer_gh here. */
+			ret = VM_FAULT_SIGBUS;
+			goto out_uninit;
+		}
 	}
 	ret = filemap_fault(vmf);
-	gfs2_glock_dq(&gh);
+	if (likely(!outer_gh))
+		gfs2_glock_dq(&gh);
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return ret;
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
  2021-05-31 17:01 ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, linux-kernel, Alexander Viro,
	Jan Kara, Matthew Wilcox

Add the equivalent of iov_iter_fault_in_readable(), but for pages that
will be written to.

While at it, fix an indentation error in iov_iter_fault_in_readable().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 20 +++++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index d3ec87706d75..6811eb6ac6e3 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -124,6 +124,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_writeable(struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c701b7a187f2..317c94eac907 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -480,13 +480,31 @@ int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 		iterate_iovec(i, bytes, v, iov, skip, ({
 			err = fault_in_pages_readable(v.iov_base, v.iov_len);
 			if (unlikely(err))
-			return err;
+				return err;
 		0;}))
 	}
 	return 0;
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
+int iov_iter_fault_in_writeable(struct iov_iter *i, size_t bytes)
+{
+	size_t skip = i->iov_offset;
+	const struct iovec *iov;
+	int err;
+	struct iovec v;
+
+	if (!(i->type & (ITER_BVEC|ITER_KVEC))) {
+		iterate_iovec(i, bytes, v, iov, skip, ({
+			err = fault_in_pages_writeable(v.iov_base, v.iov_len);
+			if (unlikely(err))
+				return err;
+		0;}))
+	}
+	return 0;
+}
+EXPORT_SYMBOL(iov_iter_fault_in_writeable);
+
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Add the equivalent of iov_iter_fault_in_readable(), but for pages that
will be written to.

While at it, fix an indentation error in iov_iter_fault_in_readable().

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 include/linux/uio.h |  1 +
 lib/iov_iter.c      | 20 +++++++++++++++++++-
 2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/include/linux/uio.h b/include/linux/uio.h
index d3ec87706d75..6811eb6ac6e3 100644
--- a/include/linux/uio.h
+++ b/include/linux/uio.h
@@ -124,6 +124,7 @@ size_t iov_iter_copy_from_user_atomic(struct page *page,
 void iov_iter_advance(struct iov_iter *i, size_t bytes);
 void iov_iter_revert(struct iov_iter *i, size_t bytes);
 int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes);
+int iov_iter_fault_in_writeable(struct iov_iter *i, size_t bytes);
 size_t iov_iter_single_seg_count(const struct iov_iter *i);
 size_t copy_page_to_iter(struct page *page, size_t offset, size_t bytes,
 			 struct iov_iter *i);
diff --git a/lib/iov_iter.c b/lib/iov_iter.c
index c701b7a187f2..317c94eac907 100644
--- a/lib/iov_iter.c
+++ b/lib/iov_iter.c
@@ -480,13 +480,31 @@ int iov_iter_fault_in_readable(struct iov_iter *i, size_t bytes)
 		iterate_iovec(i, bytes, v, iov, skip, ({
 			err = fault_in_pages_readable(v.iov_base, v.iov_len);
 			if (unlikely(err))
-			return err;
+				return err;
 		0;}))
 	}
 	return 0;
 }
 EXPORT_SYMBOL(iov_iter_fault_in_readable);
 
+int iov_iter_fault_in_writeable(struct iov_iter *i, size_t bytes)
+{
+	size_t skip = i->iov_offset;
+	const struct iovec *iov;
+	int err;
+	struct iovec v;
+
+	if (!(i->type & (ITER_BVEC|ITER_KVEC))) {
+		iterate_iovec(i, bytes, v, iov, skip, ({
+			err = fault_in_pages_writeable(v.iov_base, v.iov_len);
+			if (unlikely(err))
+				return err;
+		0;}))
+	}
+	return 0;
+}
+EXPORT_SYMBOL(iov_iter_fault_in_writeable);
+
 void iov_iter_init(struct iov_iter *i, unsigned int direction,
 			const struct iovec *iov, unsigned long nr_segs,
 			size_t count)
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC 6/9] gfs2: Add wrappers for accessing journal_info
  2021-05-31 17:01 ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, linux-kernel, Alexander Viro,
	Jan Kara, Matthew Wilcox

No longer access current->journal_info directly.  The next patch will
change the wrappers to encode additional information in the lower bits
of current->journal_info.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/aops.c    |  6 +++---
 fs/gfs2/bmap.c    | 28 ++++++++++++++--------------
 fs/gfs2/incore.h  | 10 ++++++++++
 fs/gfs2/inode.c   |  2 +-
 fs/gfs2/log.c     |  4 ++--
 fs/gfs2/lops.c    |  2 +-
 fs/gfs2/meta_io.c |  6 +++---
 fs/gfs2/super.c   |  2 +-
 fs/gfs2/trans.c   | 16 ++++++++--------
 9 files changed, 43 insertions(+), 33 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 23b5be3db044..50dd1771d00c 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -95,7 +95,7 @@ static int gfs2_writepage(struct page *page, struct writeback_control *wbc)
 
 	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
 		goto out;
-	if (current->journal_info)
+	if (current_trans())
 		goto redirty;
 	return iomap_writepage(page, wbc, &wpc, &gfs2_writeback_ops);
 
@@ -182,7 +182,7 @@ static int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc
 
 	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
 		goto out;
-	if (PageChecked(page) || current->journal_info)
+	if (PageChecked(page) || current_trans())
 		goto out_ignore;
 	return __gfs2_jdata_writepage(page, wbc);
 
@@ -620,7 +620,7 @@ void adjust_fs_space(struct inode *inode)
  
 static int jdata_set_page_dirty(struct page *page)
 {
-	if (current->journal_info)
+	if (current_trans())
 		SetPageChecked(page);
 	return __set_page_dirty_buffers(page);
 }
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 0bcf11a9987b..2ff501c413f4 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1016,7 +1016,7 @@ static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
 				 unsigned copied, struct page *page,
 				 struct iomap *iomap)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 
@@ -1099,7 +1099,7 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 			}
 		}
 
-		tr = current->journal_info;
+		tr = current_trans();
 		if (tr->tr_num_buf_new)
 			__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 
@@ -1347,7 +1347,7 @@ int gfs2_alloc_extent(struct inode *inode, u64 lblock, u64 *dblock,
 static int gfs2_block_zero_range(struct inode *inode, loff_t from,
 				 unsigned int length)
 {
-	BUG_ON(current->journal_info);
+	BUG_ON(current_trans());
 	return iomap_zero_range(inode, from, length, NULL, &gfs2_iomap_ops);
 }
 
@@ -1386,7 +1386,7 @@ static int gfs2_journaled_truncate(struct inode *inode, u64 oldsize, u64 newsize
 		truncate_pagecache(inode, oldsize - chunk);
 		oldsize -= chunk;
 
-		tr = current->journal_info;
+		tr = current_trans();
 		if (!test_bit(TR_TOUCHED, &tr->tr_flags))
 			continue;
 
@@ -1447,7 +1447,7 @@ static int trunc_start(struct inode *inode, u64 newsize)
 
 out:
 	brelse(dibh);
-	if (current->journal_info)
+	if (current_trans())
 		gfs2_trans_end(sdp);
 	return error;
 }
@@ -1555,7 +1555,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, struct gfs2_holder *rd_gh,
 		   the rgrp. So we estimate. We know it can't be more than
 		   the dinode's i_blocks and we don't want to exceed the
 		   journal flush threshold, sd_log_thresh2. */
-		if (current->journal_info == NULL) {
+		if (!current_trans()) {
 			unsigned int jblocks_rqsted, revokes;
 
 			jblocks_rqsted = rgd->rd_length + RES_DINODE +
@@ -1577,7 +1577,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, struct gfs2_holder *rd_gh,
 			down_write(&ip->i_rw_mutex);
 		}
 		/* check if we will exceed the transaction blocks requested */
-		tr = current->journal_info;
+		tr = current_trans();
 		if (tr->tr_num_buf_new + RES_STATFS +
 		    RES_QUOTA >= atomic_read(&sdp->sd_log_thresh2)) {
 			/* We set blks_outside_rgrp to ensure the loop will
@@ -1625,7 +1625,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, struct gfs2_holder *rd_gh,
 	if (!ret && blks_outside_rgrp) { /* If buffer still has non-zero blocks
 					    outside the rgrp we just processed,
 					    do it all over again. */
-		if (current->journal_info) {
+		if (current_trans()) {
 			struct buffer_head *dibh;
 
 			ret = gfs2_meta_inode_buffer(ip, &dibh);
@@ -1991,7 +1991,7 @@ static int punch_hole(struct gfs2_inode *ip, u64 offset, u64 length)
 	}
 
 	if (btotal) {
-		if (current->journal_info == NULL) {
+		if (!current_trans()) {
 			ret = gfs2_trans_begin(sdp, RES_DINODE + RES_STATFS +
 					       RES_QUOTA, 0);
 			if (ret)
@@ -2011,7 +2011,7 @@ static int punch_hole(struct gfs2_inode *ip, u64 offset, u64 length)
 out:
 	if (gfs2_holder_initialized(&rd_gh))
 		gfs2_glock_dq_uninit(&rd_gh);
-	if (current->journal_info) {
+	if (current_trans()) {
 		up_write(&ip->i_rw_mutex);
 		gfs2_trans_end(sdp);
 		cond_resched();
@@ -2436,7 +2436,7 @@ static int gfs2_journaled_truncate_range(struct inode *inode, loff_t offset,
 		offset += chunk;
 		length -= chunk;
 
-		tr = current->journal_info;
+		tr = current_trans();
 		if (!test_bit(TR_TOUCHED, &tr->tr_flags))
 			continue;
 
@@ -2501,7 +2501,7 @@ int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length)
 	}
 
 	if (gfs2_is_jdata(ip)) {
-		BUG_ON(!current->journal_info);
+		BUG_ON(!current_trans());
 		gfs2_journaled_truncate_range(inode, offset, length);
 	} else
 		truncate_pagecache_range(inode, offset, offset + length - 1);
@@ -2509,14 +2509,14 @@ int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length)
 	file_update_time(file);
 	mark_inode_dirty(inode);
 
-	if (current->journal_info)
+	if (current_trans())
 		gfs2_trans_end(sdp);
 
 	if (!gfs2_is_stuffed(ip))
 		error = punch_hole(ip, offset, length);
 
 out:
-	if (current->journal_info)
+	if (current_trans())
 		gfs2_trans_end(sdp);
 	return error;
 }
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index e6f820f146cb..aa8d1a23132d 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -871,5 +871,15 @@ static inline unsigned gfs2_max_stuffed_size(const struct gfs2_inode *ip)
 	return GFS2_SB(&ip->i_inode)->sd_sb.sb_bsize - sizeof(struct gfs2_dinode);
 }
 
+static inline struct gfs2_trans *current_trans(void)
+{
+	return current->journal_info;
+}
+
+static inline void set_current_trans(struct gfs2_trans *tr)
+{
+	current->journal_info = tr;
+}
+
 #endif /* __INCORE_DOT_H__ */
 
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 6e15434b23ac..1b94cbdc00cc 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -1883,7 +1883,7 @@ static int gfs2_setattr_simple(struct inode *inode, struct iattr *attr)
 {
 	int error;
 
-	if (current->journal_info)
+	if (current_trans())
 		return __gfs2_setattr_simple(inode, attr);
 
 	error = gfs2_trans_begin(GFS2_SB(inode), RES_DINODE, 0);
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 42c15cfc0821..3ee29045ab90 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -204,7 +204,7 @@ void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control *wbc)
 	ret = 0;
 	if (time_after(jiffies, flush_start + (HZ * 600))) {
 		fs_err(sdp, "Error: In %s for ten minutes! t=%d\n",
-		       __func__, current->journal_info ? 1 : 0);
+		       __func__, current_trans() ? 1 : 0);
 		dump_ail_list(sdp);
 		goto out;
 	}
@@ -971,7 +971,7 @@ static void empty_ail1_list(struct gfs2_sbd *sdp)
 	for (;;) {
 		if (time_after(jiffies, start + (HZ * 600))) {
 			fs_err(sdp, "Error: In %s for 10 minutes! t=%d\n",
-			       __func__, current->journal_info ? 1 : 0);
+			       __func__, current_trans() ? 1 : 0);
 			dump_ail_list(sdp);
 			return;
 		}
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 8ee05d25dfa6..9bd080e5db43 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -43,7 +43,7 @@ void gfs2_pin(struct gfs2_sbd *sdp, struct buffer_head *bh)
 {
 	struct gfs2_bufdata *bd;
 
-	BUG_ON(!current->journal_info);
+	BUG_ON(!current_trans());
 
 	clear_buffer_dirty(bh);
 	if (test_set_buffer_pinned(bh))
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index d68184ebbfdd..f5622393de63 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -294,7 +294,7 @@ int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags,
 	bh = *bhp;
 	wait_on_buffer(bh);
 	if (unlikely(!buffer_uptodate(bh))) {
-		struct gfs2_trans *tr = current->journal_info;
+		struct gfs2_trans *tr = current_trans();
 		if (tr && test_bit(TR_TOUCHED, &tr->tr_flags))
 			gfs2_io_error_bh_wd(sdp, bh);
 		brelse(bh);
@@ -321,7 +321,7 @@ int gfs2_meta_wait(struct gfs2_sbd *sdp, struct buffer_head *bh)
 	wait_on_buffer(bh);
 
 	if (!buffer_uptodate(bh)) {
-		struct gfs2_trans *tr = current->journal_info;
+		struct gfs2_trans *tr = current_trans();
 		if (tr && test_bit(TR_TOUCHED, &tr->tr_flags))
 			gfs2_io_error_bh_wd(sdp, bh);
 		return -EIO;
@@ -337,7 +337,7 @@ void gfs2_remove_from_journal(struct buffer_head *bh, int meta)
 	struct address_space *mapping = bh->b_page->mapping;
 	struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
 	struct gfs2_bufdata *bd = bh->b_private;
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	int was_pinned = 0;
 
 	if (test_clear_buffer_pinned(bh)) {
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 4d4ceb0b6903..5cb823e58d01 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -557,7 +557,7 @@ static void gfs2_dirty_inode(struct inode *inode, int flags)
 	} else if (WARN_ON_ONCE(ip->i_gl->gl_state != LM_ST_EXCLUSIVE))
 		return;
 
-	if (current->journal_info == NULL) {
+	if (!current_trans()) {
 		ret = gfs2_trans_begin(sdp, RES_DINODE, 0);
 		if (ret) {
 			fs_err(sdp, "dirty_inode: gfs2_trans_begin %d\n", ret);
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 63fec11ef2ce..7681fbb12050 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -43,8 +43,8 @@ int __gfs2_trans_begin(struct gfs2_trans *tr, struct gfs2_sbd *sdp,
 {
 	unsigned int extra_revokes;
 
-	if (current->journal_info) {
-		gfs2_print_trans(sdp, current->journal_info);
+	if (current_trans()) {
+		gfs2_print_trans(sdp, current_trans());
 		BUG();
 	}
 	BUG_ON(blocks == 0 && revokes == 0);
@@ -101,7 +101,7 @@ int __gfs2_trans_begin(struct gfs2_trans *tr, struct gfs2_sbd *sdp,
 		return -EROFS;
 	}
 
-	current->journal_info = tr;
+	set_current_trans(tr);
 
 	return 0;
 }
@@ -123,10 +123,10 @@ int gfs2_trans_begin(struct gfs2_sbd *sdp, unsigned int blocks,
 
 void gfs2_trans_end(struct gfs2_sbd *sdp)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	s64 nbuf;
 
-	current->journal_info = NULL;
+	set_current_trans(NULL);
 
 	if (!test_bit(TR_TOUCHED, &tr->tr_flags)) {
 		gfs2_log_release_revokes(sdp, tr->tr_revokes);
@@ -191,7 +191,7 @@ static struct gfs2_bufdata *gfs2_alloc_bufdata(struct gfs2_glock *gl,
  */
 void gfs2_trans_add_data(struct gfs2_glock *gl, struct buffer_head *bh)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	struct gfs2_bufdata *bd;
 
@@ -232,7 +232,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct buffer_head *bh)
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	struct gfs2_bufdata *bd;
 	struct gfs2_meta_header *mh;
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	enum gfs2_freeze_state state = atomic_read(&sdp->sd_freeze_state);
 
 	lock_buffer(bh);
@@ -288,7 +288,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct buffer_head *bh)
 
 void gfs2_trans_add_revoke(struct gfs2_sbd *sdp, struct gfs2_bufdata *bd)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 
 	BUG_ON(!list_empty(&bd->bd_list));
 	gfs2_add_revoke(sdp, bd);
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 6/9] gfs2: Add wrappers for accessing journal_info
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

No longer access current->journal_info directly.  The next patch will
change the wrappers to encode additional information in the lower bits
of current->journal_info.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/aops.c    |  6 +++---
 fs/gfs2/bmap.c    | 28 ++++++++++++++--------------
 fs/gfs2/incore.h  | 10 ++++++++++
 fs/gfs2/inode.c   |  2 +-
 fs/gfs2/log.c     |  4 ++--
 fs/gfs2/lops.c    |  2 +-
 fs/gfs2/meta_io.c |  6 +++---
 fs/gfs2/super.c   |  2 +-
 fs/gfs2/trans.c   | 16 ++++++++--------
 9 files changed, 43 insertions(+), 33 deletions(-)

diff --git a/fs/gfs2/aops.c b/fs/gfs2/aops.c
index 23b5be3db044..50dd1771d00c 100644
--- a/fs/gfs2/aops.c
+++ b/fs/gfs2/aops.c
@@ -95,7 +95,7 @@ static int gfs2_writepage(struct page *page, struct writeback_control *wbc)
 
 	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
 		goto out;
-	if (current->journal_info)
+	if (current_trans())
 		goto redirty;
 	return iomap_writepage(page, wbc, &wpc, &gfs2_writeback_ops);
 
@@ -182,7 +182,7 @@ static int gfs2_jdata_writepage(struct page *page, struct writeback_control *wbc
 
 	if (gfs2_assert_withdraw(sdp, gfs2_glock_is_held_excl(ip->i_gl)))
 		goto out;
-	if (PageChecked(page) || current->journal_info)
+	if (PageChecked(page) || current_trans())
 		goto out_ignore;
 	return __gfs2_jdata_writepage(page, wbc);
 
@@ -620,7 +620,7 @@ void adjust_fs_space(struct inode *inode)
  
 static int jdata_set_page_dirty(struct page *page)
 {
-	if (current->journal_info)
+	if (current_trans())
 		SetPageChecked(page);
 	return __set_page_dirty_buffers(page);
 }
diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 0bcf11a9987b..2ff501c413f4 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -1016,7 +1016,7 @@ static void gfs2_iomap_page_done(struct inode *inode, loff_t pos,
 				 unsigned copied, struct page *page,
 				 struct iomap *iomap)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	struct gfs2_inode *ip = GFS2_I(inode);
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 
@@ -1099,7 +1099,7 @@ static int gfs2_iomap_begin_write(struct inode *inode, loff_t pos,
 			}
 		}
 
-		tr = current->journal_info;
+		tr = current_trans();
 		if (tr->tr_num_buf_new)
 			__mark_inode_dirty(inode, I_DIRTY_DATASYNC);
 
@@ -1347,7 +1347,7 @@ int gfs2_alloc_extent(struct inode *inode, u64 lblock, u64 *dblock,
 static int gfs2_block_zero_range(struct inode *inode, loff_t from,
 				 unsigned int length)
 {
-	BUG_ON(current->journal_info);
+	BUG_ON(current_trans());
 	return iomap_zero_range(inode, from, length, NULL, &gfs2_iomap_ops);
 }
 
@@ -1386,7 +1386,7 @@ static int gfs2_journaled_truncate(struct inode *inode, u64 oldsize, u64 newsize
 		truncate_pagecache(inode, oldsize - chunk);
 		oldsize -= chunk;
 
-		tr = current->journal_info;
+		tr = current_trans();
 		if (!test_bit(TR_TOUCHED, &tr->tr_flags))
 			continue;
 
@@ -1447,7 +1447,7 @@ static int trunc_start(struct inode *inode, u64 newsize)
 
 out:
 	brelse(dibh);
-	if (current->journal_info)
+	if (current_trans())
 		gfs2_trans_end(sdp);
 	return error;
 }
@@ -1555,7 +1555,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, struct gfs2_holder *rd_gh,
 		   the rgrp. So we estimate. We know it can't be more than
 		   the dinode's i_blocks and we don't want to exceed the
 		   journal flush threshold, sd_log_thresh2. */
-		if (current->journal_info == NULL) {
+		if (!current_trans()) {
 			unsigned int jblocks_rqsted, revokes;
 
 			jblocks_rqsted = rgd->rd_length + RES_DINODE +
@@ -1577,7 +1577,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, struct gfs2_holder *rd_gh,
 			down_write(&ip->i_rw_mutex);
 		}
 		/* check if we will exceed the transaction blocks requested */
-		tr = current->journal_info;
+		tr = current_trans();
 		if (tr->tr_num_buf_new + RES_STATFS +
 		    RES_QUOTA >= atomic_read(&sdp->sd_log_thresh2)) {
 			/* We set blks_outside_rgrp to ensure the loop will
@@ -1625,7 +1625,7 @@ static int sweep_bh_for_rgrps(struct gfs2_inode *ip, struct gfs2_holder *rd_gh,
 	if (!ret && blks_outside_rgrp) { /* If buffer still has non-zero blocks
 					    outside the rgrp we just processed,
 					    do it all over again. */
-		if (current->journal_info) {
+		if (current_trans()) {
 			struct buffer_head *dibh;
 
 			ret = gfs2_meta_inode_buffer(ip, &dibh);
@@ -1991,7 +1991,7 @@ static int punch_hole(struct gfs2_inode *ip, u64 offset, u64 length)
 	}
 
 	if (btotal) {
-		if (current->journal_info == NULL) {
+		if (!current_trans()) {
 			ret = gfs2_trans_begin(sdp, RES_DINODE + RES_STATFS +
 					       RES_QUOTA, 0);
 			if (ret)
@@ -2011,7 +2011,7 @@ static int punch_hole(struct gfs2_inode *ip, u64 offset, u64 length)
 out:
 	if (gfs2_holder_initialized(&rd_gh))
 		gfs2_glock_dq_uninit(&rd_gh);
-	if (current->journal_info) {
+	if (current_trans()) {
 		up_write(&ip->i_rw_mutex);
 		gfs2_trans_end(sdp);
 		cond_resched();
@@ -2436,7 +2436,7 @@ static int gfs2_journaled_truncate_range(struct inode *inode, loff_t offset,
 		offset += chunk;
 		length -= chunk;
 
-		tr = current->journal_info;
+		tr = current_trans();
 		if (!test_bit(TR_TOUCHED, &tr->tr_flags))
 			continue;
 
@@ -2501,7 +2501,7 @@ int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length)
 	}
 
 	if (gfs2_is_jdata(ip)) {
-		BUG_ON(!current->journal_info);
+		BUG_ON(!current_trans());
 		gfs2_journaled_truncate_range(inode, offset, length);
 	} else
 		truncate_pagecache_range(inode, offset, offset + length - 1);
@@ -2509,14 +2509,14 @@ int __gfs2_punch_hole(struct file *file, loff_t offset, loff_t length)
 	file_update_time(file);
 	mark_inode_dirty(inode);
 
-	if (current->journal_info)
+	if (current_trans())
 		gfs2_trans_end(sdp);
 
 	if (!gfs2_is_stuffed(ip))
 		error = punch_hole(ip, offset, length);
 
 out:
-	if (current->journal_info)
+	if (current_trans())
 		gfs2_trans_end(sdp);
 	return error;
 }
diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index e6f820f146cb..aa8d1a23132d 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -871,5 +871,15 @@ static inline unsigned gfs2_max_stuffed_size(const struct gfs2_inode *ip)
 	return GFS2_SB(&ip->i_inode)->sd_sb.sb_bsize - sizeof(struct gfs2_dinode);
 }
 
+static inline struct gfs2_trans *current_trans(void)
+{
+	return current->journal_info;
+}
+
+static inline void set_current_trans(struct gfs2_trans *tr)
+{
+	current->journal_info = tr;
+}
+
 #endif /* __INCORE_DOT_H__ */
 
diff --git a/fs/gfs2/inode.c b/fs/gfs2/inode.c
index 6e15434b23ac..1b94cbdc00cc 100644
--- a/fs/gfs2/inode.c
+++ b/fs/gfs2/inode.c
@@ -1883,7 +1883,7 @@ static int gfs2_setattr_simple(struct inode *inode, struct iattr *attr)
 {
 	int error;
 
-	if (current->journal_info)
+	if (current_trans())
 		return __gfs2_setattr_simple(inode, attr);
 
 	error = gfs2_trans_begin(GFS2_SB(inode), RES_DINODE, 0);
diff --git a/fs/gfs2/log.c b/fs/gfs2/log.c
index 42c15cfc0821..3ee29045ab90 100644
--- a/fs/gfs2/log.c
+++ b/fs/gfs2/log.c
@@ -204,7 +204,7 @@ void gfs2_ail1_flush(struct gfs2_sbd *sdp, struct writeback_control *wbc)
 	ret = 0;
 	if (time_after(jiffies, flush_start + (HZ * 600))) {
 		fs_err(sdp, "Error: In %s for ten minutes! t=%d\n",
-		       __func__, current->journal_info ? 1 : 0);
+		       __func__, current_trans() ? 1 : 0);
 		dump_ail_list(sdp);
 		goto out;
 	}
@@ -971,7 +971,7 @@ static void empty_ail1_list(struct gfs2_sbd *sdp)
 	for (;;) {
 		if (time_after(jiffies, start + (HZ * 600))) {
 			fs_err(sdp, "Error: In %s for 10 minutes! t=%d\n",
-			       __func__, current->journal_info ? 1 : 0);
+			       __func__, current_trans() ? 1 : 0);
 			dump_ail_list(sdp);
 			return;
 		}
diff --git a/fs/gfs2/lops.c b/fs/gfs2/lops.c
index 8ee05d25dfa6..9bd080e5db43 100644
--- a/fs/gfs2/lops.c
+++ b/fs/gfs2/lops.c
@@ -43,7 +43,7 @@ void gfs2_pin(struct gfs2_sbd *sdp, struct buffer_head *bh)
 {
 	struct gfs2_bufdata *bd;
 
-	BUG_ON(!current->journal_info);
+	BUG_ON(!current_trans());
 
 	clear_buffer_dirty(bh);
 	if (test_set_buffer_pinned(bh))
diff --git a/fs/gfs2/meta_io.c b/fs/gfs2/meta_io.c
index d68184ebbfdd..f5622393de63 100644
--- a/fs/gfs2/meta_io.c
+++ b/fs/gfs2/meta_io.c
@@ -294,7 +294,7 @@ int gfs2_meta_read(struct gfs2_glock *gl, u64 blkno, int flags,
 	bh = *bhp;
 	wait_on_buffer(bh);
 	if (unlikely(!buffer_uptodate(bh))) {
-		struct gfs2_trans *tr = current->journal_info;
+		struct gfs2_trans *tr = current_trans();
 		if (tr && test_bit(TR_TOUCHED, &tr->tr_flags))
 			gfs2_io_error_bh_wd(sdp, bh);
 		brelse(bh);
@@ -321,7 +321,7 @@ int gfs2_meta_wait(struct gfs2_sbd *sdp, struct buffer_head *bh)
 	wait_on_buffer(bh);
 
 	if (!buffer_uptodate(bh)) {
-		struct gfs2_trans *tr = current->journal_info;
+		struct gfs2_trans *tr = current_trans();
 		if (tr && test_bit(TR_TOUCHED, &tr->tr_flags))
 			gfs2_io_error_bh_wd(sdp, bh);
 		return -EIO;
@@ -337,7 +337,7 @@ void gfs2_remove_from_journal(struct buffer_head *bh, int meta)
 	struct address_space *mapping = bh->b_page->mapping;
 	struct gfs2_sbd *sdp = gfs2_mapping2sbd(mapping);
 	struct gfs2_bufdata *bd = bh->b_private;
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	int was_pinned = 0;
 
 	if (test_clear_buffer_pinned(bh)) {
diff --git a/fs/gfs2/super.c b/fs/gfs2/super.c
index 4d4ceb0b6903..5cb823e58d01 100644
--- a/fs/gfs2/super.c
+++ b/fs/gfs2/super.c
@@ -557,7 +557,7 @@ static void gfs2_dirty_inode(struct inode *inode, int flags)
 	} else if (WARN_ON_ONCE(ip->i_gl->gl_state != LM_ST_EXCLUSIVE))
 		return;
 
-	if (current->journal_info == NULL) {
+	if (!current_trans()) {
 		ret = gfs2_trans_begin(sdp, RES_DINODE, 0);
 		if (ret) {
 			fs_err(sdp, "dirty_inode: gfs2_trans_begin %d\n", ret);
diff --git a/fs/gfs2/trans.c b/fs/gfs2/trans.c
index 63fec11ef2ce..7681fbb12050 100644
--- a/fs/gfs2/trans.c
+++ b/fs/gfs2/trans.c
@@ -43,8 +43,8 @@ int __gfs2_trans_begin(struct gfs2_trans *tr, struct gfs2_sbd *sdp,
 {
 	unsigned int extra_revokes;
 
-	if (current->journal_info) {
-		gfs2_print_trans(sdp, current->journal_info);
+	if (current_trans()) {
+		gfs2_print_trans(sdp, current_trans());
 		BUG();
 	}
 	BUG_ON(blocks == 0 && revokes == 0);
@@ -101,7 +101,7 @@ int __gfs2_trans_begin(struct gfs2_trans *tr, struct gfs2_sbd *sdp,
 		return -EROFS;
 	}
 
-	current->journal_info = tr;
+	set_current_trans(tr);
 
 	return 0;
 }
@@ -123,10 +123,10 @@ int gfs2_trans_begin(struct gfs2_sbd *sdp, unsigned int blocks,
 
 void gfs2_trans_end(struct gfs2_sbd *sdp)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	s64 nbuf;
 
-	current->journal_info = NULL;
+	set_current_trans(NULL);
 
 	if (!test_bit(TR_TOUCHED, &tr->tr_flags)) {
 		gfs2_log_release_revokes(sdp, tr->tr_revokes);
@@ -191,7 +191,7 @@ static struct gfs2_bufdata *gfs2_alloc_bufdata(struct gfs2_glock *gl,
  */
 void gfs2_trans_add_data(struct gfs2_glock *gl, struct buffer_head *bh)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	struct gfs2_bufdata *bd;
 
@@ -232,7 +232,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct buffer_head *bh)
 	struct gfs2_sbd *sdp = gl->gl_name.ln_sbd;
 	struct gfs2_bufdata *bd;
 	struct gfs2_meta_header *mh;
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 	enum gfs2_freeze_state state = atomic_read(&sdp->sd_freeze_state);
 
 	lock_buffer(bh);
@@ -288,7 +288,7 @@ void gfs2_trans_add_meta(struct gfs2_glock *gl, struct buffer_head *bh)
 
 void gfs2_trans_add_revoke(struct gfs2_sbd *sdp, struct gfs2_bufdata *bd)
 {
-	struct gfs2_trans *tr = current->journal_info;
+	struct gfs2_trans *tr = current_trans();
 
 	BUG_ON(!list_empty(&bd->bd_list));
 	gfs2_add_revoke(sdp, bd);
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC 7/9] gfs2: Encode glock holding and retry flags in journal_info
  2021-05-31 17:01 ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, linux-kernel, Alexander Viro,
	Jan Kara, Matthew Wilcox

Use the lowest two bits in current->journal_info to encode when
we're holding a glock, and when an operation holding a glock
needs to be retried.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/incore.h | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index aa8d1a23132d..e32433df119c 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -871,14 +871,45 @@ static inline unsigned gfs2_max_stuffed_size(const struct gfs2_inode *ip)
 	return GFS2_SB(&ip->i_inode)->sd_sb.sb_bsize - sizeof(struct gfs2_dinode);
 }
 
+/*
+ * Transactions are always memory aligned, so we use bit 0 of
+ * current->journal_info to indicate when we're holding a glock and so taking
+ * random additional glocks might deadlock, and bit 1 to indicate when such an
+ * operation needs to be retried after dropping and re-acquiring that "outer"
+ * glock.
+ */
+
 static inline struct gfs2_trans *current_trans(void)
 {
-	return current->journal_info;
+	return (void *)((long)current->journal_info & ~3);
 }
 
 static inline void set_current_trans(struct gfs2_trans *tr)
 {
-	current->journal_info = tr;
+	long flags = (long)current->journal_info & 3;
+	current->journal_info = (void *)((long)tr | flags);
+}
+
+static inline bool current_holds_glock(void)
+{
+	return (long)current->journal_info & 1;
+}
+
+static inline bool current_needs_retry(void)
+{
+	return (long)current->journal_info & 2;
+}
+
+static inline void set_current_holds_glock(bool b)
+{
+	current->journal_info =
+		(void *)(((long)current->journal_info & ~1) | b);
+}
+
+static inline void set_current_needs_retry(bool b)
+{
+	current->journal_info =
+		(void *)(((long)current->journal_info & ~2) | (b << 1));
 }
 
 #endif /* __INCORE_DOT_H__ */
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 7/9] gfs2: Encode glock holding and retry flags in journal_info
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Use the lowest two bits in current->journal_info to encode when
we're holding a glock, and when an operation holding a glock
needs to be retried.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/incore.h | 35 +++++++++++++++++++++++++++++++++--
 1 file changed, 33 insertions(+), 2 deletions(-)

diff --git a/fs/gfs2/incore.h b/fs/gfs2/incore.h
index aa8d1a23132d..e32433df119c 100644
--- a/fs/gfs2/incore.h
+++ b/fs/gfs2/incore.h
@@ -871,14 +871,45 @@ static inline unsigned gfs2_max_stuffed_size(const struct gfs2_inode *ip)
 	return GFS2_SB(&ip->i_inode)->sd_sb.sb_bsize - sizeof(struct gfs2_dinode);
 }
 
+/*
+ * Transactions are always memory aligned, so we use bit 0 of
+ * current->journal_info to indicate when we're holding a glock and so taking
+ * random additional glocks might deadlock, and bit 1 to indicate when such an
+ * operation needs to be retried after dropping and re-acquiring that "outer"
+ * glock.
+ */
+
 static inline struct gfs2_trans *current_trans(void)
 {
-	return current->journal_info;
+	return (void *)((long)current->journal_info & ~3);
 }
 
 static inline void set_current_trans(struct gfs2_trans *tr)
 {
-	current->journal_info = tr;
+	long flags = (long)current->journal_info & 3;
+	current->journal_info = (void *)((long)tr | flags);
+}
+
+static inline bool current_holds_glock(void)
+{
+	return (long)current->journal_info & 1;
+}
+
+static inline bool current_needs_retry(void)
+{
+	return (long)current->journal_info & 2;
+}
+
+static inline void set_current_holds_glock(bool b)
+{
+	current->journal_info =
+		(void *)(((long)current->journal_info & ~1) | b);
+}
+
+static inline void set_current_needs_retry(bool b)
+{
+	current->journal_info =
+		(void *)(((long)current->journal_info & ~2) | (b << 1));
 }
 
 #endif /* __INCORE_DOT_H__ */
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC 8/9] gfs2: Add LM_FLAG_OUTER glock holder flag
  2021-05-31 17:01 ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, linux-kernel, Alexander Viro,
	Jan Kara, Matthew Wilcox

When a glock holder has the LM_FLAG_OUTER flag set, we set the
current_holds_glock() flag upon taking the lock.  With that, we
can recognize when trying to take an "inner" glock, and react
accordingly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/glock.c | 12 ++++++++++++
 fs/gfs2/glock.h | 13 ++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index d9cb261f55b0..f6cae2ee1c83 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1427,6 +1427,11 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
 	if (glock_blocked_by_withdraw(gl) && !(gh->gh_flags & LM_FLAG_NOEXP))
 		return -EIO;
 
+	if (gh->gh_flags & LM_FLAG_OUTER) {
+		BUG_ON(current_holds_glock());
+		set_current_holds_glock(true);
+	}
+
 	if (test_bit(GLF_LRU, &gl->gl_flags))
 		gfs2_glock_remove_from_lru(gl);
 
@@ -1514,6 +1519,11 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
 		__gfs2_glock_queue_work(gl, delay);
 	}
 	spin_unlock(&gl->gl_lockref.lock);
+
+	if (gh->gh_flags & LM_FLAG_OUTER) {
+		BUG_ON(!current_holds_glock());
+		set_current_holds_glock(false);
+	}
 }
 
 void gfs2_glock_dq_wait(struct gfs2_holder *gh)
@@ -2068,6 +2078,8 @@ static const char *hflags2str(char *buf, u16 flags, unsigned long iflags)
 		*p++ = 'p';
 	if (flags & LM_FLAG_NODE_SCOPE)
 		*p++ = 'n';
+	if (flags & LM_FLAG_OUTER)
+		*p++ = 'o';
 	if (flags & GL_ASYNC)
 		*p++ = 'a';
 	if (flags & GL_EXACT)
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index f0ef6fd24ba4..8b145269fb14 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -94,6 +94,12 @@ static inline bool gfs2_holder_is_compatible(struct gfs2_holder *gh, int state)
  * This holder agrees to share the lock within this node. In other words,
  * the glock is held in EX mode according to DLM, but local holders on the
  * same node can share it.
+ *
+ * LM_FLAG_OUTER
+ * Use set_current_holds_glock() to indicate when the current task is holding
+ * this "upper" glock, and current_holds_glock() to detect when the current
+ * task is trying to take another glock.  Used to prevent deadlocks involving
+ * the inode glock during page faults.
  */
 
 #define LM_FLAG_TRY		0x0001
@@ -102,9 +108,10 @@ static inline bool gfs2_holder_is_compatible(struct gfs2_holder *gh, int state)
 #define LM_FLAG_ANY		0x0008
 #define LM_FLAG_PRIORITY	0x0010
 #define LM_FLAG_NODE_SCOPE	0x0020
-#define GL_ASYNC		0x0040
-#define GL_EXACT		0x0080
-#define GL_SKIP			0x0100
+#define LM_FLAG_OUTER		0x0040
+#define GL_ASYNC		0x0080
+#define GL_EXACT		0x0100
+#define GL_SKIP			0x0200
 #define GL_NOCACHE		0x0400
   
 /*
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 8/9] gfs2: Add LM_FLAG_OUTER glock holder flag
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

When a glock holder has the LM_FLAG_OUTER flag set, we set the
current_holds_glock() flag upon taking the lock.  With that, we
can recognize when trying to take an "inner" glock, and react
accordingly.

Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/glock.c | 12 ++++++++++++
 fs/gfs2/glock.h | 13 ++++++++++---
 2 files changed, 22 insertions(+), 3 deletions(-)

diff --git a/fs/gfs2/glock.c b/fs/gfs2/glock.c
index d9cb261f55b0..f6cae2ee1c83 100644
--- a/fs/gfs2/glock.c
+++ b/fs/gfs2/glock.c
@@ -1427,6 +1427,11 @@ int gfs2_glock_nq(struct gfs2_holder *gh)
 	if (glock_blocked_by_withdraw(gl) && !(gh->gh_flags & LM_FLAG_NOEXP))
 		return -EIO;
 
+	if (gh->gh_flags & LM_FLAG_OUTER) {
+		BUG_ON(current_holds_glock());
+		set_current_holds_glock(true);
+	}
+
 	if (test_bit(GLF_LRU, &gl->gl_flags))
 		gfs2_glock_remove_from_lru(gl);
 
@@ -1514,6 +1519,11 @@ void gfs2_glock_dq(struct gfs2_holder *gh)
 		__gfs2_glock_queue_work(gl, delay);
 	}
 	spin_unlock(&gl->gl_lockref.lock);
+
+	if (gh->gh_flags & LM_FLAG_OUTER) {
+		BUG_ON(!current_holds_glock());
+		set_current_holds_glock(false);
+	}
 }
 
 void gfs2_glock_dq_wait(struct gfs2_holder *gh)
@@ -2068,6 +2078,8 @@ static const char *hflags2str(char *buf, u16 flags, unsigned long iflags)
 		*p++ = 'p';
 	if (flags & LM_FLAG_NODE_SCOPE)
 		*p++ = 'n';
+	if (flags & LM_FLAG_OUTER)
+		*p++ = 'o';
 	if (flags & GL_ASYNC)
 		*p++ = 'a';
 	if (flags & GL_EXACT)
diff --git a/fs/gfs2/glock.h b/fs/gfs2/glock.h
index f0ef6fd24ba4..8b145269fb14 100644
--- a/fs/gfs2/glock.h
+++ b/fs/gfs2/glock.h
@@ -94,6 +94,12 @@ static inline bool gfs2_holder_is_compatible(struct gfs2_holder *gh, int state)
  * This holder agrees to share the lock within this node. In other words,
  * the glock is held in EX mode according to DLM, but local holders on the
  * same node can share it.
+ *
+ * LM_FLAG_OUTER
+ * Use set_current_holds_glock() to indicate when the current task is holding
+ * this "upper" glock, and current_holds_glock() to detect when the current
+ * task is trying to take another glock.  Used to prevent deadlocks involving
+ * the inode glock during page faults.
  */
 
 #define LM_FLAG_TRY		0x0001
@@ -102,9 +108,10 @@ static inline bool gfs2_holder_is_compatible(struct gfs2_holder *gh, int state)
 #define LM_FLAG_ANY		0x0008
 #define LM_FLAG_PRIORITY	0x0010
 #define LM_FLAG_NODE_SCOPE	0x0020
-#define GL_ASYNC		0x0040
-#define GL_EXACT		0x0080
-#define GL_SKIP			0x0100
+#define LM_FLAG_OUTER		0x0040
+#define GL_ASYNC		0x0080
+#define GL_EXACT		0x0100
+#define GL_SKIP			0x0200
 #define GL_NOCACHE		0x0400
   
 /*
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [RFC 9/9] gfs2: Fix mmap + page fault deadlocks (part 2)
  2021-05-31 17:01 ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, linux-kernel, Alexander Viro,
	Jan Kara, Matthew Wilcox

Now that we handle self-recursion on the inode glock in gfs2_fault and
gfs2_page_mkwrite, we need to take care of more complex deadlock
scenarios like the following (example by Jan Kara):

Two independent processes P1, P2. Two files F1, F2, and two mappings M1,
M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO
to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They
can race like:

P1                                      P2
read()                                  read()
  gfs2_file_read_iter()                   gfs2_file_read_iter()
    gfs2_file_direct_read()                 gfs2_file_direct_read()
      locks glock of F1                       locks glock of F2
      iomap_dio_rw()                          iomap_dio_rw()
        bio_iov_iter_get_pages()                bio_iov_iter_get_pages()
          <fault in M2>                           <fault in M1>
            gfs2_fault()                            gfs2_fault()
              tries to grab glock of F2               tries to grab glock of F1

Those of scenarios are much harder to reproduce than self-recursion.

We deal with such situations by using the LM_FLAG_OUTER flag to mark
"outer" glock taking.  Then, when taking an "inner" glock, we use the
LM_FLAG_TRY flag so that locking attempts that don't succeed immediately
will be aborted.  In case of a failed locking attempt, we "unwind" to
where the "outer" glock was taken, drop the "outer" glock, and fault in
the first offending user page.  This will re-trigger the "inner" locking
attempt but without the "outer" glock being held and without the
LM_FLAG_TRY flag.  Once that's done, we re-acquire the "outer" glock and
retry the original operation.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/bmap.c |  3 ++-
 fs/gfs2/file.c | 58 ++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 2ff501c413f4..82e4506984e3 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -967,7 +967,8 @@ static int gfs2_write_lock(struct inode *inode)
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	int error;
 
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, LM_FLAG_OUTER,
+			 &ip->i_gh);
 	error = gfs2_glock_nq(&ip->i_gh);
 	if (error)
 		goto out_uninit;
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 7d88abb4629b..3107d49a379b 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	vm_fault_t ret = VM_FAULT_LOCKED;
 	struct gfs2_holder gh;
 	unsigned int length;
+	u16 flags = 0;
 	loff_t size;
 	int err;
 
 	sb_start_pagefault(inode->i_sb);
 
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
+	if (current_holds_glock())
+		flags |= LM_FLAG_TRY;
+
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh);
 	if (likely(!outer_gh)) {
 		err = gfs2_glock_nq(&gh);
 		if (err) {
 			ret = block_page_mkwrite_return(err);
+			if (err == GLR_TRYFAILED) {
+				set_current_needs_retry(true);
+				ret = VM_FAULT_SIGBUS;
+			}
 			goto out_uninit;
 		}
 	} else {
 		if (!gfs2_holder_is_compatible(outer_gh, LM_ST_EXCLUSIVE)) {
 			/* We could try to upgrade outer_gh here. */
+			set_current_needs_retry(true);
 			ret = VM_FAULT_SIGBUS;
 			goto out_uninit;
 		}
@@ -568,20 +577,28 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
 	struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl);
 	struct gfs2_holder gh;
 	vm_fault_t ret;
-	u16 state;
+	u16 state, flags = 0;
 	int err;
 
+	if (current_holds_glock())
+		flags |= LM_FLAG_TRY;
+
 	state = (vmf->flags & FAULT_FLAG_WRITE) ? LM_ST_EXCLUSIVE : LM_ST_SHARED;
-	gfs2_holder_init(ip->i_gl, state, 0, &gh);
+	gfs2_holder_init(ip->i_gl, state, flags, &gh);
 	if (likely(!outer_gh)) {
 		err = gfs2_glock_nq(&gh);
 		if (err) {
 			ret = block_page_mkwrite_return(err);
+			if (err == GLR_TRYFAILED) {
+				set_current_needs_retry(true);
+				ret = VM_FAULT_SIGBUS;
+			}
 			goto out_uninit;
 		}
 	} else {
 		if (!gfs2_holder_is_compatible(outer_gh, state)) {
 			/* We could try to upgrade outer_gh here. */
+			set_current_needs_retry(true);
 			ret = VM_FAULT_SIGBUS;
 			goto out_uninit;
 		}
@@ -807,13 +824,20 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	if (!count)
 		return 0; /* skip atime */
 
-	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, LM_FLAG_OUTER, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
 
 	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
 	gfs2_glock_dq(gh);
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT &&
+		    !iov_iter_fault_in_writeable(to, PAGE_SIZE))
+			goto retry;
+	}
 out_uninit:
 	gfs2_holder_uninit(gh);
 	return ret;
@@ -837,7 +861,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	 * unfortunately, have the option of only flushing a range like the
 	 * VFS does.
 	 */
-	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, LM_FLAG_OUTER, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
@@ -851,6 +876,12 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 		ret = 0;
 out:
 	gfs2_glock_dq(gh);
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT &&
+		    !iov_iter_fault_in_readable(from, PAGE_SIZE))
+			goto retry;
+	}
 out_uninit:
 	gfs2_holder_uninit(gh);
 	return ret;
@@ -883,7 +914,8 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			return ret;
 	}
 	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
-	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_OUTER, &gh);
+retry:
 	ret = gfs2_glock_nq(&gh);
 	if (ret)
 		goto out_uninit;
@@ -891,6 +923,12 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (ret > 0)
 		written += ret;
 	gfs2_glock_dq(&gh);
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT &&
+		    !iov_iter_fault_in_writeable(to, PAGE_SIZE))
+			goto retry;
+	}
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return written ? written : ret;
@@ -902,9 +940,17 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	struct inode *inode = file_inode(file);
 	ssize_t ret;
 
+retry:
 	current->backing_dev_info = inode_to_bdi(inode);
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
 	current->backing_dev_info = NULL;
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT &&
+		    !iov_iter_fault_in_readable(from, PAGE_SIZE))
+			goto retry;
+	}
+
 	return ret;
 }
 
-- 
2.26.3


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 9/9] gfs2: Fix mmap + page fault deadlocks (part 2)
@ 2021-05-31 17:01   ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 17:01 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Now that we handle self-recursion on the inode glock in gfs2_fault and
gfs2_page_mkwrite, we need to take care of more complex deadlock
scenarios like the following (example by Jan Kara):

Two independent processes P1, P2. Two files F1, F2, and two mappings M1,
M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO
to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They
can race like:

P1                                      P2
read()                                  read()
  gfs2_file_read_iter()                   gfs2_file_read_iter()
    gfs2_file_direct_read()                 gfs2_file_direct_read()
      locks glock of F1                       locks glock of F2
      iomap_dio_rw()                          iomap_dio_rw()
        bio_iov_iter_get_pages()                bio_iov_iter_get_pages()
          <fault in M2>                           <fault in M1>
            gfs2_fault()                            gfs2_fault()
              tries to grab glock of F2               tries to grab glock of F1

Those of scenarios are much harder to reproduce than self-recursion.

We deal with such situations by using the LM_FLAG_OUTER flag to mark
"outer" glock taking.  Then, when taking an "inner" glock, we use the
LM_FLAG_TRY flag so that locking attempts that don't succeed immediately
will be aborted.  In case of a failed locking attempt, we "unwind" to
where the "outer" glock was taken, drop the "outer" glock, and fault in
the first offending user page.  This will re-trigger the "inner" locking
attempt but without the "outer" glock being held and without the
LM_FLAG_TRY flag.  Once that's done, we re-acquire the "outer" glock and
retry the original operation.

Reported-by: Jan Kara <jack@suse.cz>
Signed-off-by: Andreas Gruenbacher <agruenba@redhat.com>
---
 fs/gfs2/bmap.c |  3 ++-
 fs/gfs2/file.c | 58 ++++++++++++++++++++++++++++++++++++++++++++------
 2 files changed, 54 insertions(+), 7 deletions(-)

diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c
index 2ff501c413f4..82e4506984e3 100644
--- a/fs/gfs2/bmap.c
+++ b/fs/gfs2/bmap.c
@@ -967,7 +967,8 @@ static int gfs2_write_lock(struct inode *inode)
 	struct gfs2_sbd *sdp = GFS2_SB(inode);
 	int error;
 
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, LM_FLAG_OUTER,
+			 &ip->i_gh);
 	error = gfs2_glock_nq(&ip->i_gh);
 	if (error)
 		goto out_uninit;
diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c
index 7d88abb4629b..3107d49a379b 100644
--- a/fs/gfs2/file.c
+++ b/fs/gfs2/file.c
@@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf)
 	vm_fault_t ret = VM_FAULT_LOCKED;
 	struct gfs2_holder gh;
 	unsigned int length;
+	u16 flags = 0;
 	loff_t size;
 	int err;
 
 	sb_start_pagefault(inode->i_sb);
 
-	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh);
+	if (current_holds_glock())
+		flags |= LM_FLAG_TRY;
+
+	gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh);
 	if (likely(!outer_gh)) {
 		err = gfs2_glock_nq(&gh);
 		if (err) {
 			ret = block_page_mkwrite_return(err);
+			if (err == GLR_TRYFAILED) {
+				set_current_needs_retry(true);
+				ret = VM_FAULT_SIGBUS;
+			}
 			goto out_uninit;
 		}
 	} else {
 		if (!gfs2_holder_is_compatible(outer_gh, LM_ST_EXCLUSIVE)) {
 			/* We could try to upgrade outer_gh here. */
+			set_current_needs_retry(true);
 			ret = VM_FAULT_SIGBUS;
 			goto out_uninit;
 		}
@@ -568,20 +577,28 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf)
 	struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl);
 	struct gfs2_holder gh;
 	vm_fault_t ret;
-	u16 state;
+	u16 state, flags = 0;
 	int err;
 
+	if (current_holds_glock())
+		flags |= LM_FLAG_TRY;
+
 	state = (vmf->flags & FAULT_FLAG_WRITE) ? LM_ST_EXCLUSIVE : LM_ST_SHARED;
-	gfs2_holder_init(ip->i_gl, state, 0, &gh);
+	gfs2_holder_init(ip->i_gl, state, flags, &gh);
 	if (likely(!outer_gh)) {
 		err = gfs2_glock_nq(&gh);
 		if (err) {
 			ret = block_page_mkwrite_return(err);
+			if (err == GLR_TRYFAILED) {
+				set_current_needs_retry(true);
+				ret = VM_FAULT_SIGBUS;
+			}
 			goto out_uninit;
 		}
 	} else {
 		if (!gfs2_holder_is_compatible(outer_gh, state)) {
 			/* We could try to upgrade outer_gh here. */
+			set_current_needs_retry(true);
 			ret = VM_FAULT_SIGBUS;
 			goto out_uninit;
 		}
@@ -807,13 +824,20 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
 	if (!count)
 		return 0; /* skip atime */
 
-	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, LM_FLAG_OUTER, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
 
 	ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
 	gfs2_glock_dq(gh);
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT &&
+		    !iov_iter_fault_in_writeable(to, PAGE_SIZE))
+			goto retry;
+	}
 out_uninit:
 	gfs2_holder_uninit(gh);
 	return ret;
@@ -837,7 +861,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 	 * unfortunately, have the option of only flushing a range like the
 	 * VFS does.
 	 */
-	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, LM_FLAG_OUTER, gh);
+retry:
 	ret = gfs2_glock_nq(gh);
 	if (ret)
 		goto out_uninit;
@@ -851,6 +876,12 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from,
 		ret = 0;
 out:
 	gfs2_glock_dq(gh);
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT &&
+		    !iov_iter_fault_in_readable(from, PAGE_SIZE))
+			goto retry;
+	}
 out_uninit:
 	gfs2_holder_uninit(gh);
 	return ret;
@@ -883,7 +914,8 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			return ret;
 	}
 	ip = GFS2_I(iocb->ki_filp->f_mapping->host);
-	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh);
+	gfs2_holder_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_OUTER, &gh);
+retry:
 	ret = gfs2_glock_nq(&gh);
 	if (ret)
 		goto out_uninit;
@@ -891,6 +923,12 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 	if (ret > 0)
 		written += ret;
 	gfs2_glock_dq(&gh);
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT &&
+		    !iov_iter_fault_in_writeable(to, PAGE_SIZE))
+			goto retry;
+	}
 out_uninit:
 	gfs2_holder_uninit(&gh);
 	return written ? written : ret;
@@ -902,9 +940,17 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro
 	struct inode *inode = file_inode(file);
 	ssize_t ret;
 
+retry:
 	current->backing_dev_info = inode_to_bdi(inode);
 	ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops);
 	current->backing_dev_info = NULL;
+	if (unlikely(current_needs_retry())) {
+		set_current_needs_retry(false);
+		if (ret == -EFAULT &&
+		    !iov_iter_fault_in_readable(from, PAGE_SIZE))
+			goto retry;
+	}
+
 	return ret;
 }
 
-- 
2.26.3



^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
  2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-05-31 17:12     ` Al Viro
  -1 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-05-31 17:12 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, cluster-devel, linux-kernel, Jan Kara,
	Matthew Wilcox

On Mon, May 31, 2021 at 07:01:19PM +0200, Andreas Gruenbacher wrote:
> Add the equivalent of iov_iter_fault_in_readable(), but for pages that
> will be written to.
> 
> While at it, fix an indentation error in iov_iter_fault_in_readable().

> +int iov_iter_fault_in_writeable(struct iov_iter *i, size_t bytes)
> +{
> +	size_t skip = i->iov_offset;
> +	const struct iovec *iov;
> +	int err;
> +	struct iovec v;
> +
> +	if (!(i->type & (ITER_BVEC|ITER_KVEC))) {
> +		iterate_iovec(i, bytes, v, iov, skip, ({
> +			err = fault_in_pages_writeable(v.iov_base, v.iov_len);
> +			if (unlikely(err))
> +				return err;
> +		0;}))
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(iov_iter_fault_in_writeable);

I really don't like that.  Conflicts with iov_iter patches are not hard to
deal with, but (like fault_in_pages_writeable() itself) it's dangerous as
hell - fault-in for read is non-destructive, but that is *not*.  Existing
users have to be careful with it and there are very few of those.  Adding
that as a new primitive is inviting trouble; at the very least it needs
a big fat "Don't use unless you really know what you are doing" kind of
warning.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
@ 2021-05-31 17:12     ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-05-31 17:12 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, May 31, 2021 at 07:01:19PM +0200, Andreas Gruenbacher wrote:
> Add the equivalent of iov_iter_fault_in_readable(), but for pages that
> will be written to.
> 
> While at it, fix an indentation error in iov_iter_fault_in_readable().

> +int iov_iter_fault_in_writeable(struct iov_iter *i, size_t bytes)
> +{
> +	size_t skip = i->iov_offset;
> +	const struct iovec *iov;
> +	int err;
> +	struct iovec v;
> +
> +	if (!(i->type & (ITER_BVEC|ITER_KVEC))) {
> +		iterate_iovec(i, bytes, v, iov, skip, ({
> +			err = fault_in_pages_writeable(v.iov_base, v.iov_len);
> +			if (unlikely(err))
> +				return err;
> +		0;}))
> +	}
> +	return 0;
> +}
> +EXPORT_SYMBOL(iov_iter_fault_in_writeable);

I really don't like that.  Conflicts with iov_iter patches are not hard to
deal with, but (like fault_in_pages_writeable() itself) it's dangerous as
hell - fault-in for read is non-destructive, but that is *not*.  Existing
users have to be careful with it and there are very few of those.  Adding
that as a new primitive is inviting trouble; at the very least it needs
a big fat "Don't use unless you really know what you are doing" kind of
warning.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 0/9] gfs2: handle page faults during read and write
  2021-05-31 17:01 ` [Cluster-devel] " Andreas Gruenbacher
                   ` (9 preceding siblings ...)
  (?)
@ 2021-05-31 17:57 ` Linus Torvalds
  2021-05-31 20:35     ` [Cluster-devel] " Andreas Gruenbacher
  -1 siblings, 1 reply; 49+ messages in thread
From: Linus Torvalds @ 2021-05-31 17:57 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Sorry, I'm on a boat right now, with only a cellphone. Which is why this
html mess email, and quick reply.

Due to the html, this may get a bounce from the mailing list, and only make
it to the personal email recipients. Feel free to quote more just in case
others didn't get my original email through the lists.

I'll be out most of the day, but I'll try to take a deeper look this
evening.

I'm the meantime, a couple of questions and comments..

On Mon, May 31, 2021, 07:01 Andreas Gruenbacher <agruenba@redhat.com> wrote:

>
> here's a set of fixes for how gfs2 handles page faults during read and
> write syscalls.

So how much of this is due to the confusion you just introduced where you
pointlessly and incorrectly take an exclusive luck for write faults?

See my reply to that pull request for why it's wrong and pointless.

  The patch queue is ready for merging except for two
> open issues.
>

There is no way this series is acceptable for 5.13. This kind of change is
very much a merge window thing. Much much too late to make fundamental
locking changes. Maybe it can then be backported to stable (including at
that point 5.13 of course) if it's been shown to be ok.

This deadlock is not new, we've very much had the same kind of thing when
writing to a file in the generic filemap_write() function, where we take
the page lock and then copy from user space. If that copy faults, and needs
the same page for the source due to an odd mmap issue (usually malicious),
you get a deadlock on the page lock it you aren't careful.

I'm surprised that gfs2 hasn't seen this, I thought we had fstests for it.
And I'd have expected that case to also trigger any internal gfs2 issues,
although it's possible that the generic code just does such a good job at
avoiding the issue that we'd need another test for your case.

      Linus

>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/cluster-devel/attachments/20210531/f24193f4/attachment.htm>

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 0/9] gfs2: handle page faults during read and write
  2021-05-31 17:57 ` [Cluster-devel] [RFC 0/9] gfs2: handle page faults during read and write Linus Torvalds
@ 2021-05-31 20:35     ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 20:35 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, cluster-devel, Linux Kernel Mailing List, Jan Kara,
	Matthew Wilcox

On Mon, May 31, 2021 at 7:57 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Sorry, I'm on a boat right now, with only a cellphone. Which is why this html mess email, and quick reply.
>
> Due to the html, this may get a bounce from the mailing list, and only make it to the personal email recipients. Feel free to quote more just in case others didn't get my original email through the lists.
>
> I'll be out most of the day, but I'll try to take a deeper look this evening.
>
> I'm the meantime, a couple of questions and comments..
>
> On Mon, May 31, 2021, 07:01 Andreas Gruenbacher <agruenba@redhat.com> wrote:
>>
>> here's a set of fixes for how gfs2 handles page faults during read and
>> write syscalls.
>
> So how much of this is due to the confusion you just introduced where you pointlessly and incorrectly take an exclusive luck for write faults?
>
> See my reply to that pull request for why it's wrong and pointless.

Those are separate problems. If we treat a write fault as a filesystem
read and only take a read lock as you've explained in your other
reply, we'll still have some remaining locking mode incompatibilities
(gfs2_holder_is_compatible); we probably won't need
iov_iter_fault_in_writeable() though.

>>   The patch queue is ready for merging except for two
>> open issues.
>
>
> There is no way this series is acceptable for 5.13. This kind of change is very much a merge window thing. Much much too late to make fundamental locking changes. Maybe it can then be backported to stable (including at that point 5.13 of course) if it's been shown to be ok.
>
> This deadlock is not new, we've very much had the same kind of thing when writing to a file in the generic filemap_write() function, where we take the page lock and then copy from user space. If that copy faults, and needs the same page for the source due to an odd mmap issue (usually malicious), you get a deadlock on the page lock it you aren't careful.

Right, the deadlock isn't new, we just didn't know about it until Jan
Kara pointed it out.

It would be important to us to have the self-recursion case addressed
in 5.13 at least; that's the four patches up to and including "gfs2:
Fix mmap + page fault deadlocks (part 1)".

> I'm surprised that gfs2 hasn't seen this, I thought we had fstests for it. And I'd have expected that case to also trigger any internal gfs2 issues, although it's possible that the generic code just does such a good job at avoiding the issue that we'd need another test for your case.

fstests didn't catch it, so I wrote a new test (not merged yet):

https://lore.kernel.org/fstests/20210531152604.240462-1-agruenba@redhat.com/

Thanks a lot,
Andreas


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 0/9] gfs2: handle page faults during read and write
@ 2021-05-31 20:35     ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-05-31 20:35 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, May 31, 2021 at 7:57 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> Sorry, I'm on a boat right now, with only a cellphone. Which is why this html mess email, and quick reply.
>
> Due to the html, this may get a bounce from the mailing list, and only make it to the personal email recipients. Feel free to quote more just in case others didn't get my original email through the lists.
>
> I'll be out most of the day, but I'll try to take a deeper look this evening.
>
> I'm the meantime, a couple of questions and comments..
>
> On Mon, May 31, 2021, 07:01 Andreas Gruenbacher <agruenba@redhat.com> wrote:
>>
>> here's a set of fixes for how gfs2 handles page faults during read and
>> write syscalls.
>
> So how much of this is due to the confusion you just introduced where you pointlessly and incorrectly take an exclusive luck for write faults?
>
> See my reply to that pull request for why it's wrong and pointless.

Those are separate problems. If we treat a write fault as a filesystem
read and only take a read lock as you've explained in your other
reply, we'll still have some remaining locking mode incompatibilities
(gfs2_holder_is_compatible); we probably won't need
iov_iter_fault_in_writeable() though.

>>   The patch queue is ready for merging except for two
>> open issues.
>
>
> There is no way this series is acceptable for 5.13. This kind of change is very much a merge window thing. Much much too late to make fundamental locking changes. Maybe it can then be backported to stable (including at that point 5.13 of course) if it's been shown to be ok.
>
> This deadlock is not new, we've very much had the same kind of thing when writing to a file in the generic filemap_write() function, where we take the page lock and then copy from user space. If that copy faults, and needs the same page for the source due to an odd mmap issue (usually malicious), you get a deadlock on the page lock it you aren't careful.

Right, the deadlock isn't new, we just didn't know about it until Jan
Kara pointed it out.

It would be important to us to have the self-recursion case addressed
in 5.13 at least; that's the four patches up to and including "gfs2:
Fix mmap + page fault deadlocks (part 1)".

> I'm surprised that gfs2 hasn't seen this, I thought we had fstests for it. And I'd have expected that case to also trigger any internal gfs2 issues, although it's possible that the generic code just does such a good job at avoiding the issue that we'd need another test for your case.

fstests didn't catch it, so I wrote a new test (not merged yet):

https://lore.kernel.org/fstests/20210531152604.240462-1-agruenba at redhat.com/

Thanks a lot,
Andreas




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 9/9] gfs2: Fix mmap + page fault deadlocks (part 2)
  2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-06-01  5:47     ` Linus Torvalds
  -1 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2021-06-01  5:47 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: cluster-devel, Linux Kernel Mailing List, Alexander Viro,
	Jan Kara, Matthew Wilcox

On Mon, May 31, 2021 at 7:02 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> @@ -807,13 +824,20 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
> [...]
>         ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
>         gfs2_glock_dq(gh);
> +       if (unlikely(current_needs_retry())) {
> +               set_current_needs_retry(false);
> +               if (ret == -EFAULT &&
> +                   !iov_iter_fault_in_writeable(to, PAGE_SIZE))
> +                       goto retry;
> +       }

Hmm. I haven't walked through this all, but is that "ret == -EFAULT"
test the right thing to do?

Can iomap_dio_rw() not instead just return a partial success if it hit
a missing page half-way?

Shouldn't you retry for that case too?

                Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 9/9] gfs2: Fix mmap + page fault deadlocks (part 2)
@ 2021-06-01  5:47     ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2021-06-01  5:47 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, May 31, 2021 at 7:02 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> @@ -807,13 +824,20 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to,
> [...]
>         ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0);
>         gfs2_glock_dq(gh);
> +       if (unlikely(current_needs_retry())) {
> +               set_current_needs_retry(false);
> +               if (ret == -EFAULT &&
> +                   !iov_iter_fault_in_writeable(to, PAGE_SIZE))
> +                       goto retry;
> +       }

Hmm. I haven't walked through this all, but is that "ret == -EFAULT"
test the right thing to do?

Can iomap_dio_rw() not instead just return a partial success if it hit
a missing page half-way?

Shouldn't you retry for that case too?

                Linus



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
  2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-06-01  6:00     ` Linus Torvalds
  -1 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2021-06-01  6:00 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: cluster-devel, Linux Kernel Mailing List, Alexander Viro,
	Jan Kara, Matthew Wilcox

On Mon, May 31, 2021 at 7:01 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Fix that by recognizing the self-recursion case.

Hmm. I get the feeling that the self-recursion case should never have
been allowed to happen in the first place.

IOW, is there some reason why you can't make the user accesses always
be doen with page faults disabled (ie using the "atomic" user space
access model), and then if you get a partial read (or write) to user
space, at that point you drop the locks in read/write, do the "try to
make readable/writable" and try again.

IOW, none of this "detect recursion" thing. Just "no recursion in the
first place".

That way you'd not have these odd rules at fault time at all, because
a fault while holding a lock would never get to the filesystem at all,
it would be aborted early. And you'd not have any odd "inner/outer"
locks, or lock compatibility rules or anything like that. You'd
literally have just "oh, I didn't get everything at RW time while I
held locks, so let's drop the locks, try to access user space, and
retry".

Wouldn't that be a lot simpler and more robust?

Because what if the mmap is something a bit more complex, like
overlayfs or usefaultfd, and completing the fault isn't about gfs2
handling it as a "fault", but about some *other* entity calling back
to gfs2 and doing a read/write instead? Now all your "inner/outer"
lock logic ends up being entirely pointless, as far as I can tell, and
you end up deadlocking on the lock you are holding over the user space
access _anyway_.

So I literally think that your approach is

 (a) too complicated

 (b) doesn't actually fix the issue in the more general case

But maybe I'm missing something.

              Linus

                    Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
@ 2021-06-01  6:00     ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2021-06-01  6:00 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, May 31, 2021 at 7:01 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> Fix that by recognizing the self-recursion case.

Hmm. I get the feeling that the self-recursion case should never have
been allowed to happen in the first place.

IOW, is there some reason why you can't make the user accesses always
be doen with page faults disabled (ie using the "atomic" user space
access model), and then if you get a partial read (or write) to user
space, at that point you drop the locks in read/write, do the "try to
make readable/writable" and try again.

IOW, none of this "detect recursion" thing. Just "no recursion in the
first place".

That way you'd not have these odd rules at fault time at all, because
a fault while holding a lock would never get to the filesystem at all,
it would be aborted early. And you'd not have any odd "inner/outer"
locks, or lock compatibility rules or anything like that. You'd
literally have just "oh, I didn't get everything at RW time while I
held locks, so let's drop the locks, try to access user space, and
retry".

Wouldn't that be a lot simpler and more robust?

Because what if the mmap is something a bit more complex, like
overlayfs or usefaultfd, and completing the fault isn't about gfs2
handling it as a "fault", but about some *other* entity calling back
to gfs2 and doing a read/write instead? Now all your "inner/outer"
lock logic ends up being entirely pointless, as far as I can tell, and
you end up deadlocking on the lock you are holding over the user space
access _anyway_.

So I literally think that your approach is

 (a) too complicated

 (b) doesn't actually fix the issue in the more general case

But maybe I'm missing something.

              Linus

                    Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
  2021-06-01  6:00     ` [Cluster-devel] " Linus Torvalds
@ 2021-06-02 11:16       ` Andreas Gruenbacher
  -1 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-06-02 11:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: cluster-devel, Linux Kernel Mailing List, Alexander Viro,
	Jan Kara, Matthew Wilcox

On Tue, Jun 1, 2021 at 8:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, May 31, 2021 at 7:01 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Fix that by recognizing the self-recursion case.
>
> Hmm. I get the feeling that the self-recursion case should never have
> been allowed to happen in the first place.
>
> IOW, is there some reason why you can't make the user accesses always
> be done with page faults disabled (ie using the "atomic" user space
> access model), and then if you get a partial read (or write) to user
> space, at that point you drop the locks in read/write, do the "try to
> make readable/writable" and try again.
>
> IOW, none of this "detect recursion" thing. Just "no recursion in the
> first place".
>
> That way you'd not have these odd rules at fault time at all, because
> a fault while holding a lock would never get to the filesystem at all,
> it would be aborted early. And you'd not have any odd "inner/outer"
> locks, or lock compatibility rules or anything like that. You'd
> literally have just "oh, I didn't get everything at RW time while I
> held locks, so let's drop the locks, try to access user space, and
> retry".

Well, iomap_file_buffered_write() does that by using
iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as
in iomap_write_actor(), but the read and direct I/O side doesn't seem
to have equivalents. I suspect we can't just wrap
generic_file_read_iter() and iomap_dio_rw() calls in
pagefault_disable().

> Wouldn't that be a lot simpler and more robust?

Sure, with vfs primitives that support atomic user-space access and
with a iov_iter_fault_in_writeable() like operation, we could do that.

> Because what if the mmap is something a bit more complex, like
> overlayfs or userfaultfd, and completing the fault isn't about gfs2
> handling it as a "fault", but about some *other* entity calling back
> to gfs2 and doing a read/write instead? Now all your "inner/outer"
> lock logic ends up being entirely pointless, as far as I can tell, and
> you end up deadlocking on the lock you are holding over the user space
> access _anyway_.

Yes, those kinds of deadlocks would still be possible.

Until we have a better solution, wouldn't it make sense to at least
prevent those self-recursion deadlocks? I'll send a separate pull
request in case you find that acceptable.

Thanks,
Andreas


^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
@ 2021-06-02 11:16       ` Andreas Gruenbacher
  0 siblings, 0 replies; 49+ messages in thread
From: Andreas Gruenbacher @ 2021-06-02 11:16 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Tue, Jun 1, 2021 at 8:00 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
> On Mon, May 31, 2021 at 7:01 AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> >
> > Fix that by recognizing the self-recursion case.
>
> Hmm. I get the feeling that the self-recursion case should never have
> been allowed to happen in the first place.
>
> IOW, is there some reason why you can't make the user accesses always
> be done with page faults disabled (ie using the "atomic" user space
> access model), and then if you get a partial read (or write) to user
> space, at that point you drop the locks in read/write, do the "try to
> make readable/writable" and try again.
>
> IOW, none of this "detect recursion" thing. Just "no recursion in the
> first place".
>
> That way you'd not have these odd rules at fault time at all, because
> a fault while holding a lock would never get to the filesystem at all,
> it would be aborted early. And you'd not have any odd "inner/outer"
> locks, or lock compatibility rules or anything like that. You'd
> literally have just "oh, I didn't get everything at RW time while I
> held locks, so let's drop the locks, try to access user space, and
> retry".

Well, iomap_file_buffered_write() does that by using
iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as
in iomap_write_actor(), but the read and direct I/O side doesn't seem
to have equivalents. I suspect we can't just wrap
generic_file_read_iter() and iomap_dio_rw() calls in
pagefault_disable().

> Wouldn't that be a lot simpler and more robust?

Sure, with vfs primitives that support atomic user-space access and
with a iov_iter_fault_in_writeable() like operation, we could do that.

> Because what if the mmap is something a bit more complex, like
> overlayfs or userfaultfd, and completing the fault isn't about gfs2
> handling it as a "fault", but about some *other* entity calling back
> to gfs2 and doing a read/write instead? Now all your "inner/outer"
> lock logic ends up being entirely pointless, as far as I can tell, and
> you end up deadlocking on the lock you are holding over the user space
> access _anyway_.

Yes, those kinds of deadlocks would still be possible.

Until we have a better solution, wouldn't it make sense to at least
prevent those self-recursion deadlocks? I'll send a separate pull
request in case you find that acceptable.

Thanks,
Andreas



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
  2021-06-02 11:16       ` [Cluster-devel] " Andreas Gruenbacher
@ 2021-06-11 16:25         ` Al Viro
  -1 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-11 16:25 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Wed, Jun 02, 2021 at 01:16:32PM +0200, Andreas Gruenbacher wrote:

> Well, iomap_file_buffered_write() does that by using
> iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as
> in iomap_write_actor(), but the read and direct I/O side doesn't seem
> to have equivalents. I suspect we can't just wrap
> generic_file_read_iter() and iomap_dio_rw() calls in
> pagefault_disable().

And it will have zero effect on O_DIRECT case, so you get the same
deadlocks right back.  Because there you hit
	iomap_dio_bio_actor()
		bio_iov_iter_get_pages()
			....
				get_user_pages_fast()
					....
						faultin_page()
							handle_mm_fault()
and at no point had CPU hit an exception, so disable_pagefault() will
have no effect whatsoever.  You can bloody well hit gfs2 readpage/mkwrite
if the destination is in mmapped area of some GFS2 file.  Do that
while holding GFS2 locks and you are fucked.

No amount of prefaulting will protect you, BTW - it might make the
deadlock harder to reproduce, but that's it.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
@ 2021-06-11 16:25         ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-11 16:25 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Wed, Jun 02, 2021 at 01:16:32PM +0200, Andreas Gruenbacher wrote:

> Well, iomap_file_buffered_write() does that by using
> iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as
> in iomap_write_actor(), but the read and direct I/O side doesn't seem
> to have equivalents. I suspect we can't just wrap
> generic_file_read_iter() and iomap_dio_rw() calls in
> pagefault_disable().

And it will have zero effect on O_DIRECT case, so you get the same
deadlocks right back.  Because there you hit
	iomap_dio_bio_actor()
		bio_iov_iter_get_pages()
			....
				get_user_pages_fast()
					....
						faultin_page()
							handle_mm_fault()
and at no point had CPU hit an exception, so disable_pagefault() will
have no effect whatsoever.  You can bloody well hit gfs2 readpage/mkwrite
if the destination is in mmapped area of some GFS2 file.  Do that
while holding GFS2 locks and you are fucked.

No amount of prefaulting will protect you, BTW - it might make the
deadlock harder to reproduce, but that's it.



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
  2021-06-11 16:25         ` [Cluster-devel] " Al Viro
@ 2021-06-12 21:05           ` Al Viro
  -1 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-12 21:05 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Fri, Jun 11, 2021 at 04:25:10PM +0000, Al Viro wrote:
> On Wed, Jun 02, 2021 at 01:16:32PM +0200, Andreas Gruenbacher wrote:
> 
> > Well, iomap_file_buffered_write() does that by using
> > iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as
> > in iomap_write_actor(), but the read and direct I/O side doesn't seem
> > to have equivalents. I suspect we can't just wrap
> > generic_file_read_iter() and iomap_dio_rw() calls in
> > pagefault_disable().
> 
> And it will have zero effect on O_DIRECT case, so you get the same
> deadlocks right back.  Because there you hit
> 	iomap_dio_bio_actor()
> 		bio_iov_iter_get_pages()
> 			....
> 				get_user_pages_fast()
> 					....
> 						faultin_page()
> 							handle_mm_fault()
> and at no point had CPU hit an exception, so disable_pagefault() will
> have no effect whatsoever.  You can bloody well hit gfs2 readpage/mkwrite
> if the destination is in mmapped area of some GFS2 file.  Do that
> while holding GFS2 locks and you are fucked.
> 
> No amount of prefaulting will protect you, BTW - it might make the
> deadlock harder to reproduce, but that's it.

AFAICS, what we have is
	* handle_mm_fault() can hit gfs2_fault(), which grabs per-inode lock
shared
	* handle_mm_fault() for write can hit gfs2_page_mkwrite(), which grabs
per-inode lock exclusive
	* pagefault_disable() prevents that for real page faults, but not for
get_user_pages_fast()
	* normal write:
        with inode_lock(inode)
		in a loop
			with per-inode lock exclusive
				__gfs2_iomap_get
				possibly gfs2_iomap_begin_write
				in a loop
					fault-in [read faults]
					iomap_write_begin
					copy_page_from_iter_atomic() [pf disabled]
					iomap_write_end
				gfs2_iomap_end
	* O_DIRECT write:
	with inode_lock(inode) and per-inode lock deferred (?)
		in a loop
			__gfs2_iomap_get
			possibly gfs2_iomap_begin_write
			bio_iov_iter_get_pages(), map and submit [gup]
			gfs2_iomap_end
	* normal read:
		in a loop
			filemap_get_pages (grab pages and readpage them if needed)
			copy_page_to_iter() for each [write faults]
	* O_DIRECT read:
        with per-inode lock deferred
		in a loop
			__gfs2_iomap_get
			either iov_iter_zero() (on hole) [write faults]
			or bio_iov_iter_get_pages(), map and submit [gup]
			gfs2_iomap_end

... with some amount of waiting on buffered IO in case of O_DIRECT writes

Is the above an accurate description of the mainline situation there?
In particular, normal read doesn't seem to bother with locks at all.
What exactly are those cluster locks for in O_DIRECT read?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
@ 2021-06-12 21:05           ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-12 21:05 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Fri, Jun 11, 2021 at 04:25:10PM +0000, Al Viro wrote:
> On Wed, Jun 02, 2021 at 01:16:32PM +0200, Andreas Gruenbacher wrote:
> 
> > Well, iomap_file_buffered_write() does that by using
> > iov_iter_fault_in_readable() and iov_iter_copy_from_user_atomic() as
> > in iomap_write_actor(), but the read and direct I/O side doesn't seem
> > to have equivalents. I suspect we can't just wrap
> > generic_file_read_iter() and iomap_dio_rw() calls in
> > pagefault_disable().
> 
> And it will have zero effect on O_DIRECT case, so you get the same
> deadlocks right back.  Because there you hit
> 	iomap_dio_bio_actor()
> 		bio_iov_iter_get_pages()
> 			....
> 				get_user_pages_fast()
> 					....
> 						faultin_page()
> 							handle_mm_fault()
> and at no point had CPU hit an exception, so disable_pagefault() will
> have no effect whatsoever.  You can bloody well hit gfs2 readpage/mkwrite
> if the destination is in mmapped area of some GFS2 file.  Do that
> while holding GFS2 locks and you are fucked.
> 
> No amount of prefaulting will protect you, BTW - it might make the
> deadlock harder to reproduce, but that's it.

AFAICS, what we have is
	* handle_mm_fault() can hit gfs2_fault(), which grabs per-inode lock
shared
	* handle_mm_fault() for write can hit gfs2_page_mkwrite(), which grabs
per-inode lock exclusive
	* pagefault_disable() prevents that for real page faults, but not for
get_user_pages_fast()
	* normal write:
        with inode_lock(inode)
		in a loop
			with per-inode lock exclusive
				__gfs2_iomap_get
				possibly gfs2_iomap_begin_write
				in a loop
					fault-in [read faults]
					iomap_write_begin
					copy_page_from_iter_atomic() [pf disabled]
					iomap_write_end
				gfs2_iomap_end
	* O_DIRECT write:
	with inode_lock(inode) and per-inode lock deferred (?)
		in a loop
			__gfs2_iomap_get
			possibly gfs2_iomap_begin_write
			bio_iov_iter_get_pages(), map and submit [gup]
			gfs2_iomap_end
	* normal read:
		in a loop
			filemap_get_pages (grab pages and readpage them if needed)
			copy_page_to_iter() for each [write faults]
	* O_DIRECT read:
        with per-inode lock deferred
		in a loop
			__gfs2_iomap_get
			either iov_iter_zero() (on hole) [write faults]
			or bio_iov_iter_get_pages(), map and submit [gup]
			gfs2_iomap_end

... with some amount of waiting on buffered IO in case of O_DIRECT writes

Is the above an accurate description of the mainline situation there?
In particular, normal read doesn't seem to bother with locks at all.
What exactly are those cluster locks for in O_DIRECT read?



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
  2021-05-31 17:12     ` [Cluster-devel] " Al Viro
@ 2021-06-12 21:12       ` Al Viro
  -1 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-12 21:12 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, cluster-devel, linux-kernel, Jan Kara,
	Matthew Wilcox

On Mon, May 31, 2021 at 05:12:31PM +0000, Al Viro wrote:

> > +int iov_iter_fault_in_writeable(struct iov_iter *i, size_t bytes)
> > +{
> > +	size_t skip = i->iov_offset;
> > +	const struct iovec *iov;
> > +	int err;
> > +	struct iovec v;
> > +
> > +	if (!(i->type & (ITER_BVEC|ITER_KVEC))) {
> > +		iterate_iovec(i, bytes, v, iov, skip, ({
> > +			err = fault_in_pages_writeable(v.iov_base, v.iov_len);
> > +			if (unlikely(err))
> > +				return err;
> > +		0;}))
> > +	}
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(iov_iter_fault_in_writeable);
> 
> I really don't like that.  Conflicts with iov_iter patches are not hard to
> deal with, but (like fault_in_pages_writeable() itself) it's dangerous as
> hell - fault-in for read is non-destructive, but that is *not*.  Existing
> users have to be careful with it and there are very few of those.  Adding
> that as a new primitive is inviting trouble; at the very least it needs
> a big fat "Don't use unless you really know what you are doing" kind of
> warning.

Actually, is there any good way to make sure that write fault is triggered
_without_ modification of the data?  On x86 lock xadd (of 0, that is) would
probably do it and some of the other architectures could probably get away
with using cmpxchg and its relatives, but how reliable it is wrt always
triggering a write fault if the page is currently read-only?

I mean, something like
	do {
		r0 = r = *p
		atomically [if (*p == r) *p = r; r = *p;]
	} while (r != r0);
would look like a feasible candidate, but what if the processor
"optimizes" that cmpxchg to simple load, seeing that new value is
equal to expected old one?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
@ 2021-06-12 21:12       ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-12 21:12 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Mon, May 31, 2021 at 05:12:31PM +0000, Al Viro wrote:

> > +int iov_iter_fault_in_writeable(struct iov_iter *i, size_t bytes)
> > +{
> > +	size_t skip = i->iov_offset;
> > +	const struct iovec *iov;
> > +	int err;
> > +	struct iovec v;
> > +
> > +	if (!(i->type & (ITER_BVEC|ITER_KVEC))) {
> > +		iterate_iovec(i, bytes, v, iov, skip, ({
> > +			err = fault_in_pages_writeable(v.iov_base, v.iov_len);
> > +			if (unlikely(err))
> > +				return err;
> > +		0;}))
> > +	}
> > +	return 0;
> > +}
> > +EXPORT_SYMBOL(iov_iter_fault_in_writeable);
> 
> I really don't like that.  Conflicts with iov_iter patches are not hard to
> deal with, but (like fault_in_pages_writeable() itself) it's dangerous as
> hell - fault-in for read is non-destructive, but that is *not*.  Existing
> users have to be careful with it and there are very few of those.  Adding
> that as a new primitive is inviting trouble; at the very least it needs
> a big fat "Don't use unless you really know what you are doing" kind of
> warning.

Actually, is there any good way to make sure that write fault is triggered
_without_ modification of the data?  On x86 lock xadd (of 0, that is) would
probably do it and some of the other architectures could probably get away
with using cmpxchg and its relatives, but how reliable it is wrt always
triggering a write fault if the page is currently read-only?

I mean, something like
	do {
		r0 = r = *p
		atomically [if (*p == r) *p = r; r = *p;]
	} while (r != r0);
would look like a feasible candidate, but what if the processor
"optimizes" that cmpxchg to simple load, seeing that new value is
equal to expected old one?



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
  2021-06-12 21:12       ` [Cluster-devel] " Al Viro
@ 2021-06-12 21:33         ` Linus Torvalds
  -1 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2021-06-12 21:33 UTC (permalink / raw)
  To: Al Viro
  Cc: Andreas Gruenbacher, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Sat, Jun 12, 2021 at 2:12 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Actually, is there any good way to make sure that write fault is triggered
> _without_ modification of the data?  On x86 lock xadd (of 0, that is) would
> probably do it and some of the other architectures could probably get away
> with using cmpxchg and its relatives, but how reliable it is wrt always
> triggering a write fault if the page is currently read-only?

I wouldn't worry about the CPU optimizing a zero 'add' away (extra
work for no gain in any normal situation).

But any architecture using 'ldl/stc' to do atomics would do it in
software for at least cmpxchg (ie just abort after the "doesn't
match").

Even on x86, it's certainly _possible_ that a non-matching cmpxchg
might not have done the writability check, although I would find that
unlikely (ie I would expect it to do just one TLB lookup, and just one
permission check, whether it then writes or not).

And if the architecture does atomic operations using that ldl/stc
model, I could (again) see the software loop breaking out early
(before the stc) on the grounds of "value didn't change".

Although it's a lot less likely outside of cmpxchg. I suspect an "add
zero" would work just fine even on a ldl/stc model.

That said, reads are obviously much easier, and I'd probably prefer
the model for writes to be to not necessarily pre-fault anything at
all, but just write to user space with page faults disabled.

And then only if that fails do you do anything special. And at that
point, even walking the page tables by hand might be perfectly
acceptable - since we know it's going to fault anyway, and it might
actually be cheaper to just do it by hand with GUP or whatever.

          Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
@ 2021-06-12 21:33         ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2021-06-12 21:33 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sat, Jun 12, 2021 at 2:12 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> Actually, is there any good way to make sure that write fault is triggered
> _without_ modification of the data?  On x86 lock xadd (of 0, that is) would
> probably do it and some of the other architectures could probably get away
> with using cmpxchg and its relatives, but how reliable it is wrt always
> triggering a write fault if the page is currently read-only?

I wouldn't worry about the CPU optimizing a zero 'add' away (extra
work for no gain in any normal situation).

But any architecture using 'ldl/stc' to do atomics would do it in
software for at least cmpxchg (ie just abort after the "doesn't
match").

Even on x86, it's certainly _possible_ that a non-matching cmpxchg
might not have done the writability check, although I would find that
unlikely (ie I would expect it to do just one TLB lookup, and just one
permission check, whether it then writes or not).

And if the architecture does atomic operations using that ldl/stc
model, I could (again) see the software loop breaking out early
(before the stc) on the grounds of "value didn't change".

Although it's a lot less likely outside of cmpxchg. I suspect an "add
zero" would work just fine even on a ldl/stc model.

That said, reads are obviously much easier, and I'd probably prefer
the model for writes to be to not necessarily pre-fault anything at
all, but just write to user space with page faults disabled.

And then only if that fails do you do anything special. And at that
point, even walking the page tables by hand might be perfectly
acceptable - since we know it's going to fault anyway, and it might
actually be cheaper to just do it by hand with GUP or whatever.

          Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
  2021-06-12 21:05           ` [Cluster-devel] " Al Viro
@ 2021-06-12 21:35             ` Al Viro
  -1 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-12 21:35 UTC (permalink / raw)
  To: Andreas Gruenbacher
  Cc: Linus Torvalds, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Sat, Jun 12, 2021 at 09:05:40PM +0000, Al Viro wrote:

> Is the above an accurate description of the mainline situation there?
> In particular, normal read doesn't seem to bother with locks at all.
> What exactly are those cluster locks for in O_DIRECT read?

BTW, assuming the lack of contention, how costly is dropping/regaining
such cluster lock?

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
@ 2021-06-12 21:35             ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-12 21:35 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sat, Jun 12, 2021 at 09:05:40PM +0000, Al Viro wrote:

> Is the above an accurate description of the mainline situation there?
> In particular, normal read doesn't seem to bother with locks at all.
> What exactly are those cluster locks for in O_DIRECT read?

BTW, assuming the lack of contention, how costly is dropping/regaining
such cluster lock?



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
  2021-06-12 21:33         ` [Cluster-devel] " Linus Torvalds
@ 2021-06-12 21:47           ` Al Viro
  -1 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-12 21:47 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Sat, Jun 12, 2021 at 02:33:31PM -0700, Linus Torvalds wrote:

> That said, reads are obviously much easier, and I'd probably prefer
> the model for writes to be to not necessarily pre-fault anything at
> all, but just write to user space with page faults disabled.

*nod*
I don't like that write pre-fault model at all - note that unlike read
we'll end up with atomic operations, etc. and there's a plenty of
non-obvious ways for that to end up being costly, even assuming it
works correctly in all cases.

	O_DIRECT case is a PITA - there we use GUP and there's no way
to tell GUP that in the current situation we do *NOT* want to hit
->fault()/->page_mkwrite()/etc.  pagefault_disable() won't be even
noticed there...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
@ 2021-06-12 21:47           ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-12 21:47 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sat, Jun 12, 2021 at 02:33:31PM -0700, Linus Torvalds wrote:

> That said, reads are obviously much easier, and I'd probably prefer
> the model for writes to be to not necessarily pre-fault anything at
> all, but just write to user space with page faults disabled.

*nod*
I don't like that write pre-fault model at all - note that unlike read
we'll end up with atomic operations, etc. and there's a plenty of
non-obvious ways for that to end up being costly, even assuming it
works correctly in all cases.

	O_DIRECT case is a PITA - there we use GUP and there's no way
to tell GUP that in the current situation we do *NOT* want to hit
->fault()/->page_mkwrite()/etc.  pagefault_disable() won't be even
noticed there...



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
  2021-06-12 21:47           ` [Cluster-devel] " Al Viro
@ 2021-06-12 23:17             ` Linus Torvalds
  -1 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2021-06-12 23:17 UTC (permalink / raw)
  To: Al Viro
  Cc: Andreas Gruenbacher, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Sat, Jun 12, 2021 at 2:47 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         O_DIRECT case is a PITA - there we use GUP and there's no way
> to tell GUP that in the current situation we do *NOT* want to hit
> ->fault()/->page_mkwrite()/etc.  pagefault_disable() won't be even
> noticed there...

Well, we could change that.

And we do have get_user_pages_fast_only() these days.

              Linus

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
@ 2021-06-12 23:17             ` Linus Torvalds
  0 siblings, 0 replies; 49+ messages in thread
From: Linus Torvalds @ 2021-06-12 23:17 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sat, Jun 12, 2021 at 2:47 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
>
>         O_DIRECT case is a PITA - there we use GUP and there's no way
> to tell GUP that in the current situation we do *NOT* want to hit
> ->fault()/->page_mkwrite()/etc.  pagefault_disable() won't be even
> noticed there...

Well, we could change that.

And we do have get_user_pages_fast_only() these days.

              Linus



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
  2021-06-12 23:17             ` [Cluster-devel] " Linus Torvalds
@ 2021-06-12 23:38               ` Al Viro
  -1 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-12 23:38 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Andreas Gruenbacher, cluster-devel, Linux Kernel Mailing List,
	Jan Kara, Matthew Wilcox

On Sat, Jun 12, 2021 at 04:17:30PM -0700, Linus Torvalds wrote:
> On Sat, Jun 12, 2021 at 2:47 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         O_DIRECT case is a PITA - there we use GUP and there's no way
> > to tell GUP that in the current situation we do *NOT* want to hit
> > ->fault()/->page_mkwrite()/etc.  pagefault_disable() won't be even
> > noticed there...
> 
> Well, we could change that.
> 
> And we do have get_user_pages_fast_only() these days.

FWIW, I'm looking through the users of iov_iter_get_pages{,_alloc}()
right now; their semantics feels somewhat wrong.  Will follow up
when I finish looking through that pile...

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable()
@ 2021-06-12 23:38               ` Al Viro
  0 siblings, 0 replies; 49+ messages in thread
From: Al Viro @ 2021-06-12 23:38 UTC (permalink / raw)
  To: cluster-devel.redhat.com

On Sat, Jun 12, 2021 at 04:17:30PM -0700, Linus Torvalds wrote:
> On Sat, Jun 12, 2021 at 2:47 PM Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> >         O_DIRECT case is a PITA - there we use GUP and there's no way
> > to tell GUP that in the current situation we do *NOT* want to hit
> > ->fault()/->page_mkwrite()/etc.  pagefault_disable() won't be even
> > noticed there...
> 
> Well, we could change that.
> 
> And we do have get_user_pages_fast_only() these days.

FWIW, I'm looking through the users of iov_iter_get_pages{,_alloc}()
right now; their semantics feels somewhat wrong.  Will follow up
when I finish looking through that pile...



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
  2021-06-12 21:35             ` [Cluster-devel] " Al Viro
@ 2021-06-13  8:44               ` Steven Whitehouse
  -1 siblings, 0 replies; 49+ messages in thread
From: Steven Whitehouse @ 2021-06-13  8:44 UTC (permalink / raw)
  To: Al Viro, Andreas Gruenbacher
  Cc: cluster-devel, Jan Kara, Linus Torvalds,
	Linux Kernel Mailing List, Matthew Wilcox

Hi,

On Sat, 2021-06-12 at 21:35 +0000, Al Viro wrote:
> On Sat, Jun 12, 2021 at 09:05:40PM +0000, Al Viro wrote:
> 
> > Is the above an accurate description of the mainline situation
> > there?
> > In particular, normal read doesn't seem to bother with locks at
> > all.
> > What exactly are those cluster locks for in O_DIRECT read?
> 
> BTW, assuming the lack of contention, how costly is
> dropping/regaining
> such cluster lock?
> 

The answer is that it depends...

The locking modes for glocks for inodes look like this:

==========      ==========   ==============   ==========   ==============
Glock mode      Cache data   Cache Metadata   Dirty Data   Dirty Metadata
==========      ==========   ==============   ==========   ==============
    UN             No              No             No            No
    SH             Yes             Yes            No            No
    DF             No              Yes            No            No
    EX             Yes             Yes            Yes           Yes
==========      ==========   ==============   ==========   ==============

The above is a copy & paste from Documentation/filesystems/gfs2-
glocks.rst. If you think of these locks as cache control, then it makes
a lot more sense.

The DF (deferred) mode is there only for DIO. It is a shared lock mode
that is incompatible with the normal SH mode. That is because it is ok
to cache data pages under SH but not under DF. That the only other
difference between the two shared modes. DF is used for both read and
write under DIO meaning that it is possible for multiple nodes to read
& write the same file at the same time with DIO, leaving any
synchronisation to the application layer. As soon as one performs an
operation which alters the metadata tree (truncate, extend, hole
filling) then we drop back to the normal EX mode, so DF is only used
for preallocated files.

Your original question though was about the cost of locking, and there
is a wide variation according to circumstances. The glock layer caches
the results of the DLM requests and will continue to hold glocks gained
from remote nodes until either memory pressure or requests to drop the
lock from another node is received.

When no other nodes are interested in a lock, all such cluster lock
activity is local. There is a cost to it though, and if (for example)
you tried to take and drop the cluster lock on every page, that would
definitely be noticeable. There are probably optimisations that could
be done on what is quite a complex code path, but in general thats what
we've discovered from testing. The introduction of ->readpages() vs the
old ->readpage() made a measurable difference and likewise on the write
side, iomap has also show performance increases due to the reduction in
locking on multi-page writes.

If there is another node that has an interest in a lock, then it can
get very expensive in terms of latency to regain a lock. To drop the
lock to a lower mode may involve I/O (from EX mode) and journal
flush(es) and to get the lock back again involves I/O to other nodes
and then a wait while they finish what they are doing. To avoid
starvation there is a "minimum hold time" so that when a node gains a
glock, it is allowed to retain it, in the absence of local requests,
for a short period. The idea being that if a large number of glock
requests are being made on a node, each for a short time, we allow
several of those to complete before we do the expensive glock release
to another node.

See Documentation/filesystems/gfs2-glocks.rst for a longer explanation
and locking order/rules between different lock types,

Steve.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1)
@ 2021-06-13  8:44               ` Steven Whitehouse
  0 siblings, 0 replies; 49+ messages in thread
From: Steven Whitehouse @ 2021-06-13  8:44 UTC (permalink / raw)
  To: cluster-devel.redhat.com

Hi,

On Sat, 2021-06-12 at 21:35 +0000, Al Viro wrote:
> On Sat, Jun 12, 2021 at 09:05:40PM +0000, Al Viro wrote:
> 
> > Is the above an accurate description of the mainline situation
> > there?
> > In particular, normal read doesn't seem to bother with locks at
> > all.
> > What exactly are those cluster locks for in O_DIRECT read?
> 
> BTW, assuming the lack of contention, how costly is
> dropping/regaining
> such cluster lock?
> 

The answer is that it depends...

The locking modes for glocks for inodes look like this:

==========      ==========   ==============   ==========   ==============
Glock mode      Cache data   Cache Metadata   Dirty Data   Dirty Metadata
==========      ==========   ==============   ==========   ==============
    UN             No              No             No            No
    SH             Yes             Yes            No            No
    DF             No              Yes            No            No
    EX             Yes             Yes            Yes           Yes
==========      ==========   ==============   ==========   ==============

The above is a copy & paste from Documentation/filesystems/gfs2-
glocks.rst. If you think of these locks as cache control, then it makes
a lot more sense.

The DF (deferred) mode is there only for DIO. It is a shared lock mode
that is incompatible with the normal SH mode. That is because it is ok
to cache data pages under SH but not under DF. That the only other
difference between the two shared modes. DF is used for both read and
write under DIO meaning that it is possible for multiple nodes to read
& write the same file at the same time with DIO, leaving any
synchronisation to the application layer. As soon as one performs an
operation which alters the metadata tree (truncate, extend, hole
filling) then we drop back to the normal EX mode, so DF is only used
for preallocated files.

Your original question though was about the cost of locking, and there
is a wide variation according to circumstances. The glock layer caches
the results of the DLM requests and will continue to hold glocks gained
from remote nodes until either memory pressure or requests to drop the
lock from another node is received.

When no other nodes are interested in a lock, all such cluster lock
activity is local. There is a cost to it though, and if (for example)
you tried to take and drop the cluster lock on every page, that would
definitely be noticeable. There are probably optimisations that could
be done on what is quite a complex code path, but in general thats what
we've discovered from testing. The introduction of ->readpages() vs the
old ->readpage() made a measurable difference and likewise on the write
side, iomap has also show performance increases due to the reduction in
locking on multi-page writes.

If there is another node that has an interest in a lock, then it can
get very expensive in terms of latency to regain a lock. To drop the
lock to a lower mode may involve I/O (from EX mode) and journal
flush(es) and to get the lock back again involves I/O to other nodes
and then a wait while they finish what they are doing. To avoid
starvation there is a "minimum hold time" so that when a node gains a
glock, it is allowed to retain it, in the absence of local requests,
for a short period. The idea being that if a large number of glock
requests are being made on a node, each for a short time, we allow
several of those to complete before we do the expensive glock release
to another node.

See Documentation/filesystems/gfs2-glocks.rst for a longer explanation
and locking order/rules between different lock types,

Steve.

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2021-06-13  9:02 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-31 17:01 [RFC 0/9] gfs2: handle page faults during read and write Andreas Gruenbacher
2021-05-31 17:01 ` [Cluster-devel] " Andreas Gruenbacher
2021-05-31 17:01 ` [RFC 1/9] gfs2: Clean up the error handling in gfs2_page_mkwrite Andreas Gruenbacher
2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
2021-05-31 17:01 ` [RFC 2/9] gfs2: Add wrapper for iomap_file_buffered_write Andreas Gruenbacher
2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
2021-05-31 17:01 ` [RFC 3/9] gfs2: Add gfs2_holder_is_compatible helper Andreas Gruenbacher
2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
2021-05-31 17:01 ` [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1) Andreas Gruenbacher
2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
2021-06-01  6:00   ` Linus Torvalds
2021-06-01  6:00     ` [Cluster-devel] " Linus Torvalds
2021-06-02 11:16     ` Andreas Gruenbacher
2021-06-02 11:16       ` [Cluster-devel] " Andreas Gruenbacher
2021-06-11 16:25       ` Al Viro
2021-06-11 16:25         ` [Cluster-devel] " Al Viro
2021-06-12 21:05         ` Al Viro
2021-06-12 21:05           ` [Cluster-devel] " Al Viro
2021-06-12 21:35           ` Al Viro
2021-06-12 21:35             ` [Cluster-devel] " Al Viro
2021-06-13  8:44             ` Steven Whitehouse
2021-06-13  8:44               ` Steven Whitehouse
2021-05-31 17:01 ` [RFC 5/9] iov_iter: Add iov_iter_fault_in_writeable() Andreas Gruenbacher
2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
2021-05-31 17:12   ` Al Viro
2021-05-31 17:12     ` [Cluster-devel] " Al Viro
2021-06-12 21:12     ` Al Viro
2021-06-12 21:12       ` [Cluster-devel] " Al Viro
2021-06-12 21:33       ` Linus Torvalds
2021-06-12 21:33         ` [Cluster-devel] " Linus Torvalds
2021-06-12 21:47         ` Al Viro
2021-06-12 21:47           ` [Cluster-devel] " Al Viro
2021-06-12 23:17           ` Linus Torvalds
2021-06-12 23:17             ` [Cluster-devel] " Linus Torvalds
2021-06-12 23:38             ` Al Viro
2021-06-12 23:38               ` [Cluster-devel] " Al Viro
2021-05-31 17:01 ` [RFC 6/9] gfs2: Add wrappers for accessing journal_info Andreas Gruenbacher
2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
2021-05-31 17:01 ` [RFC 7/9] gfs2: Encode glock holding and retry flags in journal_info Andreas Gruenbacher
2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
2021-05-31 17:01 ` [RFC 8/9] gfs2: Add LM_FLAG_OUTER glock holder flag Andreas Gruenbacher
2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
2021-05-31 17:01 ` [RFC 9/9] gfs2: Fix mmap + page fault deadlocks (part 2) Andreas Gruenbacher
2021-05-31 17:01   ` [Cluster-devel] " Andreas Gruenbacher
2021-06-01  5:47   ` Linus Torvalds
2021-06-01  5:47     ` [Cluster-devel] " Linus Torvalds
2021-05-31 17:57 ` [Cluster-devel] [RFC 0/9] gfs2: handle page faults during read and write Linus Torvalds
2021-05-31 20:35   ` Andreas Gruenbacher
2021-05-31 20:35     ` [Cluster-devel] " Andreas Gruenbacher

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.