From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.1 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C04F6C47082 for ; Mon, 31 May 2021 17:10:42 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 9F2AA6108D for ; Mon, 31 May 2021 17:10:42 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234086AbhEaRMM (ORCPT ); Mon, 31 May 2021 13:12:12 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:37173 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234628AbhEaRDl (ORCPT ); Mon, 31 May 2021 13:03:41 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1622480519; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=T5NkC0/lfj3Scpug9D6v7vvU8qePacdxIkstxFG9DqM=; b=RI8eO4VmEa7U8QAkvoialQOqnJif/7wbvWEEVMR4cd/xux7ge1lthVGFXNSkNdwemDBWEo YxMFNmJR30OOVpoHqgATpLndI+ZA2EElCaIqUi41FA5HscIz3eucgJVSxqvtn6Pltg1MML KSAtnbQjdcmPtS7CPIwhGQ8nei33pDY= Received: from mimecast-mx01.redhat.com (mimecast-mx01.redhat.com [209.132.183.4]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-447-lBtl58t3OlmRfta5r_vM-w-1; Mon, 31 May 2021 13:01:57 -0400 X-MC-Unique: lBtl58t3OlmRfta5r_vM-w-1 Received: from smtp.corp.redhat.com (int-mx01.intmail.prod.int.phx2.redhat.com [10.5.11.11]) (using TLSv1.2 with cipher AECDH-AES256-SHA (256/256 bits)) (No client certificate requested) by mimecast-mx01.redhat.com (Postfix) with ESMTPS id 3D36F107ACFC; Mon, 31 May 2021 17:01:56 +0000 (UTC) Received: from max.com (unknown [10.40.192.80]) by smtp.corp.redhat.com (Postfix) with ESMTP id 89A526062C; Mon, 31 May 2021 17:01:54 +0000 (UTC) From: Andreas Gruenbacher To: Linus Torvalds Cc: Andreas Gruenbacher , cluster-devel@redhat.com, linux-kernel@vger.kernel.org, Alexander Viro , Jan Kara , Matthew Wilcox Subject: [RFC 9/9] gfs2: Fix mmap + page fault deadlocks (part 2) Date: Mon, 31 May 2021 19:01:23 +0200 Message-Id: <20210531170123.243771-10-agruenba@redhat.com> In-Reply-To: <20210531170123.243771-1-agruenba@redhat.com> References: <20210531170123.243771-1-agruenba@redhat.com> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit X-Scanned-By: MIMEDefang 2.79 on 10.5.11.11 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Now that we handle self-recursion on the inode glock in gfs2_fault and gfs2_page_mkwrite, we need to take care of more complex deadlock scenarios like the following (example by Jan Kara): Two independent processes P1, P2. Two files F1, F2, and two mappings M1, M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They can race like: P1 P2 read() read() gfs2_file_read_iter() gfs2_file_read_iter() gfs2_file_direct_read() gfs2_file_direct_read() locks glock of F1 locks glock of F2 iomap_dio_rw() iomap_dio_rw() bio_iov_iter_get_pages() bio_iov_iter_get_pages() gfs2_fault() gfs2_fault() tries to grab glock of F2 tries to grab glock of F1 Those of scenarios are much harder to reproduce than self-recursion. We deal with such situations by using the LM_FLAG_OUTER flag to mark "outer" glock taking. Then, when taking an "inner" glock, we use the LM_FLAG_TRY flag so that locking attempts that don't succeed immediately will be aborted. In case of a failed locking attempt, we "unwind" to where the "outer" glock was taken, drop the "outer" glock, and fault in the first offending user page. This will re-trigger the "inner" locking attempt but without the "outer" glock being held and without the LM_FLAG_TRY flag. Once that's done, we re-acquire the "outer" glock and retry the original operation. Reported-by: Jan Kara Signed-off-by: Andreas Gruenbacher --- fs/gfs2/bmap.c | 3 ++- fs/gfs2/file.c | 58 ++++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 54 insertions(+), 7 deletions(-) diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c index 2ff501c413f4..82e4506984e3 100644 --- a/fs/gfs2/bmap.c +++ b/fs/gfs2/bmap.c @@ -967,7 +967,8 @@ static int gfs2_write_lock(struct inode *inode) struct gfs2_sbd *sdp = GFS2_SB(inode); int error; - gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh); + gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, LM_FLAG_OUTER, + &ip->i_gh); error = gfs2_glock_nq(&ip->i_gh); if (error) goto out_uninit; diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 7d88abb4629b..3107d49a379b 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf) vm_fault_t ret = VM_FAULT_LOCKED; struct gfs2_holder gh; unsigned int length; + u16 flags = 0; loff_t size; int err; sb_start_pagefault(inode->i_sb); - gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh); + if (current_holds_glock()) + flags |= LM_FLAG_TRY; + + gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh); if (likely(!outer_gh)) { err = gfs2_glock_nq(&gh); if (err) { ret = block_page_mkwrite_return(err); + if (err == GLR_TRYFAILED) { + set_current_needs_retry(true); + ret = VM_FAULT_SIGBUS; + } goto out_uninit; } } else { if (!gfs2_holder_is_compatible(outer_gh, LM_ST_EXCLUSIVE)) { /* We could try to upgrade outer_gh here. */ + set_current_needs_retry(true); ret = VM_FAULT_SIGBUS; goto out_uninit; } @@ -568,20 +577,28 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf) struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl); struct gfs2_holder gh; vm_fault_t ret; - u16 state; + u16 state, flags = 0; int err; + if (current_holds_glock()) + flags |= LM_FLAG_TRY; + state = (vmf->flags & FAULT_FLAG_WRITE) ? LM_ST_EXCLUSIVE : LM_ST_SHARED; - gfs2_holder_init(ip->i_gl, state, 0, &gh); + gfs2_holder_init(ip->i_gl, state, flags, &gh); if (likely(!outer_gh)) { err = gfs2_glock_nq(&gh); if (err) { ret = block_page_mkwrite_return(err); + if (err == GLR_TRYFAILED) { + set_current_needs_retry(true); + ret = VM_FAULT_SIGBUS; + } goto out_uninit; } } else { if (!gfs2_holder_is_compatible(outer_gh, state)) { /* We could try to upgrade outer_gh here. */ + set_current_needs_retry(true); ret = VM_FAULT_SIGBUS; goto out_uninit; } @@ -807,13 +824,20 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to, if (!count) return 0; /* skip atime */ - gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); + gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, LM_FLAG_OUTER, gh); +retry: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0); gfs2_glock_dq(gh); + if (unlikely(current_needs_retry())) { + set_current_needs_retry(false); + if (ret == -EFAULT && + !iov_iter_fault_in_writeable(to, PAGE_SIZE)) + goto retry; + } out_uninit: gfs2_holder_uninit(gh); return ret; @@ -837,7 +861,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, * unfortunately, have the option of only flushing a range like the * VFS does. */ - gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); + gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, LM_FLAG_OUTER, gh); +retry: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; @@ -851,6 +876,12 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, ret = 0; out: gfs2_glock_dq(gh); + if (unlikely(current_needs_retry())) { + set_current_needs_retry(false); + if (ret == -EFAULT && + !iov_iter_fault_in_readable(from, PAGE_SIZE)) + goto retry; + } out_uninit: gfs2_holder_uninit(gh); return ret; @@ -883,7 +914,8 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) return ret; } ip = GFS2_I(iocb->ki_filp->f_mapping->host); - gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); + gfs2_holder_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_OUTER, &gh); +retry: ret = gfs2_glock_nq(&gh); if (ret) goto out_uninit; @@ -891,6 +923,12 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) if (ret > 0) written += ret; gfs2_glock_dq(&gh); + if (unlikely(current_needs_retry())) { + set_current_needs_retry(false); + if (ret == -EFAULT && + !iov_iter_fault_in_writeable(to, PAGE_SIZE)) + goto retry; + } out_uninit: gfs2_holder_uninit(&gh); return written ? written : ret; @@ -902,9 +940,17 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro struct inode *inode = file_inode(file); ssize_t ret; +retry: current->backing_dev_info = inode_to_bdi(inode); ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops); current->backing_dev_info = NULL; + if (unlikely(current_needs_retry())) { + set_current_needs_retry(false); + if (ret == -EFAULT && + !iov_iter_fault_in_readable(from, PAGE_SIZE)) + goto retry; + } + return ret; } -- 2.26.3 From mboxrd@z Thu Jan 1 00:00:00 1970 From: Andreas Gruenbacher Date: Mon, 31 May 2021 19:01:23 +0200 Subject: [Cluster-devel] [RFC 9/9] gfs2: Fix mmap + page fault deadlocks (part 2) In-Reply-To: <20210531170123.243771-1-agruenba@redhat.com> References: <20210531170123.243771-1-agruenba@redhat.com> Message-ID: <20210531170123.243771-10-agruenba@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Now that we handle self-recursion on the inode glock in gfs2_fault and gfs2_page_mkwrite, we need to take care of more complex deadlock scenarios like the following (example by Jan Kara): Two independent processes P1, P2. Two files F1, F2, and two mappings M1, M2 where M1 is a mapping of F1, M2 is a mapping of F2. Now P1 does DIO to F1 with M2 as a buffer, P2 does DIO to F2 with M1 as a buffer. They can race like: P1 P2 read() read() gfs2_file_read_iter() gfs2_file_read_iter() gfs2_file_direct_read() gfs2_file_direct_read() locks glock of F1 locks glock of F2 iomap_dio_rw() iomap_dio_rw() bio_iov_iter_get_pages() bio_iov_iter_get_pages() gfs2_fault() gfs2_fault() tries to grab glock of F2 tries to grab glock of F1 Those of scenarios are much harder to reproduce than self-recursion. We deal with such situations by using the LM_FLAG_OUTER flag to mark "outer" glock taking. Then, when taking an "inner" glock, we use the LM_FLAG_TRY flag so that locking attempts that don't succeed immediately will be aborted. In case of a failed locking attempt, we "unwind" to where the "outer" glock was taken, drop the "outer" glock, and fault in the first offending user page. This will re-trigger the "inner" locking attempt but without the "outer" glock being held and without the LM_FLAG_TRY flag. Once that's done, we re-acquire the "outer" glock and retry the original operation. Reported-by: Jan Kara Signed-off-by: Andreas Gruenbacher --- fs/gfs2/bmap.c | 3 ++- fs/gfs2/file.c | 58 ++++++++++++++++++++++++++++++++++++++++++++------ 2 files changed, 54 insertions(+), 7 deletions(-) diff --git a/fs/gfs2/bmap.c b/fs/gfs2/bmap.c index 2ff501c413f4..82e4506984e3 100644 --- a/fs/gfs2/bmap.c +++ b/fs/gfs2/bmap.c @@ -967,7 +967,8 @@ static int gfs2_write_lock(struct inode *inode) struct gfs2_sbd *sdp = GFS2_SB(inode); int error; - gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &ip->i_gh); + gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, LM_FLAG_OUTER, + &ip->i_gh); error = gfs2_glock_nq(&ip->i_gh); if (error) goto out_uninit; diff --git a/fs/gfs2/file.c b/fs/gfs2/file.c index 7d88abb4629b..3107d49a379b 100644 --- a/fs/gfs2/file.c +++ b/fs/gfs2/file.c @@ -431,21 +431,30 @@ static vm_fault_t gfs2_page_mkwrite(struct vm_fault *vmf) vm_fault_t ret = VM_FAULT_LOCKED; struct gfs2_holder gh; unsigned int length; + u16 flags = 0; loff_t size; int err; sb_start_pagefault(inode->i_sb); - gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, 0, &gh); + if (current_holds_glock()) + flags |= LM_FLAG_TRY; + + gfs2_holder_init(ip->i_gl, LM_ST_EXCLUSIVE, flags, &gh); if (likely(!outer_gh)) { err = gfs2_glock_nq(&gh); if (err) { ret = block_page_mkwrite_return(err); + if (err == GLR_TRYFAILED) { + set_current_needs_retry(true); + ret = VM_FAULT_SIGBUS; + } goto out_uninit; } } else { if (!gfs2_holder_is_compatible(outer_gh, LM_ST_EXCLUSIVE)) { /* We could try to upgrade outer_gh here. */ + set_current_needs_retry(true); ret = VM_FAULT_SIGBUS; goto out_uninit; } @@ -568,20 +577,28 @@ static vm_fault_t gfs2_fault(struct vm_fault *vmf) struct gfs2_holder *outer_gh = gfs2_glock_is_locked_by_me(ip->i_gl); struct gfs2_holder gh; vm_fault_t ret; - u16 state; + u16 state, flags = 0; int err; + if (current_holds_glock()) + flags |= LM_FLAG_TRY; + state = (vmf->flags & FAULT_FLAG_WRITE) ? LM_ST_EXCLUSIVE : LM_ST_SHARED; - gfs2_holder_init(ip->i_gl, state, 0, &gh); + gfs2_holder_init(ip->i_gl, state, flags, &gh); if (likely(!outer_gh)) { err = gfs2_glock_nq(&gh); if (err) { ret = block_page_mkwrite_return(err); + if (err == GLR_TRYFAILED) { + set_current_needs_retry(true); + ret = VM_FAULT_SIGBUS; + } goto out_uninit; } } else { if (!gfs2_holder_is_compatible(outer_gh, state)) { /* We could try to upgrade outer_gh here. */ + set_current_needs_retry(true); ret = VM_FAULT_SIGBUS; goto out_uninit; } @@ -807,13 +824,20 @@ static ssize_t gfs2_file_direct_read(struct kiocb *iocb, struct iov_iter *to, if (!count) return 0; /* skip atime */ - gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); + gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, LM_FLAG_OUTER, gh); +retry: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; ret = iomap_dio_rw(iocb, to, &gfs2_iomap_ops, NULL, 0); gfs2_glock_dq(gh); + if (unlikely(current_needs_retry())) { + set_current_needs_retry(false); + if (ret == -EFAULT && + !iov_iter_fault_in_writeable(to, PAGE_SIZE)) + goto retry; + } out_uninit: gfs2_holder_uninit(gh); return ret; @@ -837,7 +861,8 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, * unfortunately, have the option of only flushing a range like the * VFS does. */ - gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, 0, gh); + gfs2_holder_init(ip->i_gl, LM_ST_DEFERRED, LM_FLAG_OUTER, gh); +retry: ret = gfs2_glock_nq(gh); if (ret) goto out_uninit; @@ -851,6 +876,12 @@ static ssize_t gfs2_file_direct_write(struct kiocb *iocb, struct iov_iter *from, ret = 0; out: gfs2_glock_dq(gh); + if (unlikely(current_needs_retry())) { + set_current_needs_retry(false); + if (ret == -EFAULT && + !iov_iter_fault_in_readable(from, PAGE_SIZE)) + goto retry; + } out_uninit: gfs2_holder_uninit(gh); return ret; @@ -883,7 +914,8 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) return ret; } ip = GFS2_I(iocb->ki_filp->f_mapping->host); - gfs2_holder_init(ip->i_gl, LM_ST_SHARED, 0, &gh); + gfs2_holder_init(ip->i_gl, LM_ST_SHARED, LM_FLAG_OUTER, &gh); +retry: ret = gfs2_glock_nq(&gh); if (ret) goto out_uninit; @@ -891,6 +923,12 @@ static ssize_t gfs2_file_read_iter(struct kiocb *iocb, struct iov_iter *to) if (ret > 0) written += ret; gfs2_glock_dq(&gh); + if (unlikely(current_needs_retry())) { + set_current_needs_retry(false); + if (ret == -EFAULT && + !iov_iter_fault_in_writeable(to, PAGE_SIZE)) + goto retry; + } out_uninit: gfs2_holder_uninit(&gh); return written ? written : ret; @@ -902,9 +940,17 @@ static ssize_t gfs2_file_buffered_write(struct kiocb *iocb, struct iov_iter *fro struct inode *inode = file_inode(file); ssize_t ret; +retry: current->backing_dev_info = inode_to_bdi(inode); ret = iomap_file_buffered_write(iocb, from, &gfs2_iomap_ops); current->backing_dev_info = NULL; + if (unlikely(current_needs_retry())) { + set_current_needs_retry(false); + if (ret == -EFAULT && + !iov_iter_fault_in_readable(from, PAGE_SIZE)) + goto retry; + } + return ret; } -- 2.26.3