From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.0 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C8822C48BE8 for ; Sun, 13 Jun 2021 09:02:10 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 8F63F6120E for ; Sun, 13 Jun 2021 09:02:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231146AbhFMIqj (ORCPT ); Sun, 13 Jun 2021 04:46:39 -0400 Received: from us-smtp-delivery-124.mimecast.com ([216.205.24.124]:50770 "EHLO us-smtp-delivery-124.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229777AbhFMIqh (ORCPT ); Sun, 13 Jun 2021 04:46:37 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1623573876; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=FHEEdqa7cqln4XKJYtrL82d5dzyuGs+JlNcccHmAJK4=; b=H/NTdufQx/mwn+xt2K7UHpLOdMEj9PlMCx7+5y6X3lXCeCHYUtDm7rK5nlnDnywGEmcuZf Zl217XZVip61Lkk1K2rlflliBmbfiT1qi7Z5z6ON5Ooi37wvIkZadO3XyoIsbsuuyppj2i aCJBjq/l8WrVdM+5s4JeTDmI7ADMdUU= Received: from mail-wr1-f71.google.com (mail-wr1-f71.google.com [209.85.221.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-323-nk1NYrweM1WDZWJVQtYJIQ-1; Sun, 13 Jun 2021 04:44:34 -0400 X-MC-Unique: nk1NYrweM1WDZWJVQtYJIQ-1 Received: by mail-wr1-f71.google.com with SMTP id k25-20020a5d52590000b0290114dee5b660so5300392wrc.16 for ; Sun, 13 Jun 2021 01:44:34 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:message-id:subject:from:to:cc:date:in-reply-to :references:organization:user-agent:mime-version :content-transfer-encoding; bh=FHEEdqa7cqln4XKJYtrL82d5dzyuGs+JlNcccHmAJK4=; b=P9UMek+ufcaLv30D0a0Ipi2sWPDfkVyy9+XcDqzK3cNwVrxBVm7iSaBQzBivW4FTBE JLwDKtCenlTbZuoStifWijAjrODwOQVffDy3mI6JnQvdHrIJciK6DDfX38/VfkX6ljze +GOZoJ+eNe/MfKOtaff0+eIXSK3xgqnF+fZ1lxaiM7a2QnpHWtXBHzs+IGX62WQXxRiY 6eJbrLOEw6qFVbO7nAtk7K0mSyVcfmFbon2IHzvZ1QVljFei8rFExaIGCzDEeaHPD9Ef NzmqIXeR3Nu5s0buOEuf5pdd+FhNwGmwmBNF6qOeGj2LOX6khte+C+QX90srrsZEDCgS hHqQ== X-Gm-Message-State: AOAM5331UeRMzGcAQWxfOTg1SjMDgzGY3jCleHm5iXnNS2H2XxJetuph b6Nqm19R/zOp9YjLSh5zS7uTpBmfY4CZ+4VwVRUeImJoUSAENTc2y2RmFWJcOZ7hAx5zQDas8Kt BINesoctdFsDzMllAECtyrebx X-Received: by 2002:a5d:4689:: with SMTP id u9mr12894324wrq.31.1623573873475; Sun, 13 Jun 2021 01:44:33 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzPV5Df5jlQniyi+mEa+LESxC0jsxbRNrsfkQ0E+V2REF6TTCGUK0DwVzr+uD6OmWQlwPAVMQ== X-Received: by 2002:a5d:4689:: with SMTP id u9mr12894306wrq.31.1623573873233; Sun, 13 Jun 2021 01:44:33 -0700 (PDT) Received: from 0.7.3.c.2.b.0.0.0.3.7.8.9.5.0.2.0.0.0.0.a.d.f.f.0.b.8.0.1.0.0.2.ip6.arpa (0.7.3.c.2.b.0.0.0.3.7.8.9.5.0.2.0.0.0.0.a.d.f.f.0.b.8.0.1.0.0.2.ip6.arpa. [2001:8b0:ffda:0:2059:8730:b2:c370]) by smtp.gmail.com with ESMTPSA id k42sm19768540wms.0.2021.06.13.01.44.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 13 Jun 2021 01:44:32 -0700 (PDT) Message-ID: <3ae2eac460c27a9f58a1b89b49da710c0c9d43ed.camel@redhat.com> Subject: Re: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1) From: Steven Whitehouse To: Al Viro , Andreas Gruenbacher Cc: cluster-devel , Jan Kara , Linus Torvalds , Linux Kernel Mailing List , Matthew Wilcox Date: Sun, 13 Jun 2021 09:44:31 +0100 In-Reply-To: References: <20210531170123.243771-1-agruenba@redhat.com> <20210531170123.243771-5-agruenba@redhat.com> Organization: Red Hat UK Ltd Content-Type: text/plain; charset="UTF-8" User-Agent: Evolution 3.34.4 (3.34.4-1.fc31) MIME-Version: 1.0 Content-Transfer-Encoding: 7bit Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Sat, 2021-06-12 at 21:35 +0000, Al Viro wrote: > On Sat, Jun 12, 2021 at 09:05:40PM +0000, Al Viro wrote: > > > Is the above an accurate description of the mainline situation > > there? > > In particular, normal read doesn't seem to bother with locks at > > all. > > What exactly are those cluster locks for in O_DIRECT read? > > BTW, assuming the lack of contention, how costly is > dropping/regaining > such cluster lock? > The answer is that it depends... The locking modes for glocks for inodes look like this: ========== ========== ============== ========== ============== Glock mode Cache data Cache Metadata Dirty Data Dirty Metadata ========== ========== ============== ========== ============== UN No No No No SH Yes Yes No No DF No Yes No No EX Yes Yes Yes Yes ========== ========== ============== ========== ============== The above is a copy & paste from Documentation/filesystems/gfs2- glocks.rst. If you think of these locks as cache control, then it makes a lot more sense. The DF (deferred) mode is there only for DIO. It is a shared lock mode that is incompatible with the normal SH mode. That is because it is ok to cache data pages under SH but not under DF. That the only other difference between the two shared modes. DF is used for both read and write under DIO meaning that it is possible for multiple nodes to read & write the same file at the same time with DIO, leaving any synchronisation to the application layer. As soon as one performs an operation which alters the metadata tree (truncate, extend, hole filling) then we drop back to the normal EX mode, so DF is only used for preallocated files. Your original question though was about the cost of locking, and there is a wide variation according to circumstances. The glock layer caches the results of the DLM requests and will continue to hold glocks gained from remote nodes until either memory pressure or requests to drop the lock from another node is received. When no other nodes are interested in a lock, all such cluster lock activity is local. There is a cost to it though, and if (for example) you tried to take and drop the cluster lock on every page, that would definitely be noticeable. There are probably optimisations that could be done on what is quite a complex code path, but in general thats what we've discovered from testing. The introduction of ->readpages() vs the old ->readpage() made a measurable difference and likewise on the write side, iomap has also show performance increases due to the reduction in locking on multi-page writes. If there is another node that has an interest in a lock, then it can get very expensive in terms of latency to regain a lock. To drop the lock to a lower mode may involve I/O (from EX mode) and journal flush(es) and to get the lock back again involves I/O to other nodes and then a wait while they finish what they are doing. To avoid starvation there is a "minimum hold time" so that when a node gains a glock, it is allowed to retain it, in the absence of local requests, for a short period. The idea being that if a large number of glock requests are being made on a node, each for a short time, we allow several of those to complete before we do the expensive glock release to another node. See Documentation/filesystems/gfs2-glocks.rst for a longer explanation and locking order/rules between different lock types, Steve. From mboxrd@z Thu Jan 1 00:00:00 1970 From: Steven Whitehouse Date: Sun, 13 Jun 2021 09:44:31 +0100 Subject: [Cluster-devel] [RFC 4/9] gfs2: Fix mmap + page fault deadlocks (part 1) In-Reply-To: References: <20210531170123.243771-1-agruenba@redhat.com> <20210531170123.243771-5-agruenba@redhat.com> Message-ID: <3ae2eac460c27a9f58a1b89b49da710c0c9d43ed.camel@redhat.com> List-Id: To: cluster-devel.redhat.com MIME-Version: 1.0 Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: 7bit Hi, On Sat, 2021-06-12 at 21:35 +0000, Al Viro wrote: > On Sat, Jun 12, 2021 at 09:05:40PM +0000, Al Viro wrote: > > > Is the above an accurate description of the mainline situation > > there? > > In particular, normal read doesn't seem to bother with locks at > > all. > > What exactly are those cluster locks for in O_DIRECT read? > > BTW, assuming the lack of contention, how costly is > dropping/regaining > such cluster lock? > The answer is that it depends... The locking modes for glocks for inodes look like this: ========== ========== ============== ========== ============== Glock mode Cache data Cache Metadata Dirty Data Dirty Metadata ========== ========== ============== ========== ============== UN No No No No SH Yes Yes No No DF No Yes No No EX Yes Yes Yes Yes ========== ========== ============== ========== ============== The above is a copy & paste from Documentation/filesystems/gfs2- glocks.rst. If you think of these locks as cache control, then it makes a lot more sense. The DF (deferred) mode is there only for DIO. It is a shared lock mode that is incompatible with the normal SH mode. That is because it is ok to cache data pages under SH but not under DF. That the only other difference between the two shared modes. DF is used for both read and write under DIO meaning that it is possible for multiple nodes to read & write the same file at the same time with DIO, leaving any synchronisation to the application layer. As soon as one performs an operation which alters the metadata tree (truncate, extend, hole filling) then we drop back to the normal EX mode, so DF is only used for preallocated files. Your original question though was about the cost of locking, and there is a wide variation according to circumstances. The glock layer caches the results of the DLM requests and will continue to hold glocks gained from remote nodes until either memory pressure or requests to drop the lock from another node is received. When no other nodes are interested in a lock, all such cluster lock activity is local. There is a cost to it though, and if (for example) you tried to take and drop the cluster lock on every page, that would definitely be noticeable. There are probably optimisations that could be done on what is quite a complex code path, but in general thats what we've discovered from testing. The introduction of ->readpages() vs the old ->readpage() made a measurable difference and likewise on the write side, iomap has also show performance increases due to the reduction in locking on multi-page writes. If there is another node that has an interest in a lock, then it can get very expensive in terms of latency to regain a lock. To drop the lock to a lower mode may involve I/O (from EX mode) and journal flush(es) and to get the lock back again involves I/O to other nodes and then a wait while they finish what they are doing. To avoid starvation there is a "minimum hold time" so that when a node gains a glock, it is allowed to retain it, in the absence of local requests, for a short period. The idea being that if a large number of glock requests are being made on a node, each for a short time, we allow several of those to complete before we do the expensive glock release to another node. See Documentation/filesystems/gfs2-glocks.rst for a longer explanation and locking order/rules between different lock types, Steve.