[Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions

cluster-devel.redhat.com archive mirror
 help / color / mirror / Atom feed

From: Alexander Aring <aahringo@redhat.com>
To: cluster-devel.redhat.com
Subject: [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions
Date: Sat, 3 Jun 2023 08:02:31 -0400	[thread overview]
Message-ID: <CAK-6q+j_RZ+uySaVEmSfSHTDGRfM0LrbkxMvSuDfK2kjuYNDPw@mail.gmail.com> (raw)
In-Reply-To: <CAHc6FU4HU6oiYHBAYndp6wb0-LmkSRiAccjQE6t1Gf1PCnhFQQ@mail.gmail.com>

Hi,

On Sat, Jun 3, 2023 at 6:03?AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
>
> On Thu, Jun 1, 2023 at 9:10?PM Alexander Aring <aahringo@redhat.com> wrote:
> > Hi,
> >
> > On Thu, Jun 1, 2023 at 1:11?PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > >
> > > On Thu, Jun 1, 2023 at 6:28?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > Hi,
> > > >
> > > > On Tue, May 30, 2023 at 1:40?PM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > > >
> > > > > On Tue, May 30, 2023 at 4:08?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Tue, May 30, 2023 at 7:01?AM Andreas Gruenbacher <agruenba@redhat.com> wrote:
> > > > > > >
> > > > > > > On Tue, May 30, 2023 at 12:19?AM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > > Hi,
> > > > > > > >
> > > > > > > > On Thu, May 25, 2023 at 11:02?AM Andreas Gruenbacher
> > > > > > > > <agruenba@redhat.com> wrote:
> > > > > > > > >
> > > > > > > > > On Wed, May 24, 2023 at 6:02?PM Alexander Aring <aahringo@redhat.com> wrote:
> > > > > > > > > > This patch fixes a possible plock op collisions when using F_SETLKW lock
> > > > > > > > > > requests and fsid, number and owner are not enough to identify a result
> > > > > > > > > > for a pending request. The ltp testcases [0] and [1] are examples when
> > > > > > > > > > this is not enough in case of using classic posix locks with threads and
> > > > > > > > > > open filedescriptor posix locks.
> > > > > > > > > >
> > > > > > > > > > The idea to fix the issue here is to place all lock request in order. In
> > > > > > > > > > case of non F_SETLKW lock request (indicated if wait is set or not) the
> > > > > > > > > > lock requests are ordered inside the recv_list. If a result comes back
> > > > > > > > > > the right plock op can be found by the first plock_op in recv_list which
> > > > > > > > > > has not info.wait set. This can be done only by non F_SETLKW plock ops as
> > > > > > > > > > dlm_controld always reads a specific plock op (list_move_tail() from
> > > > > > > > > > send_list to recv_mlist) and write the result immediately back.
> > > > > > > > > >
> > > > > > > > > > This behaviour is for F_SETLKW not possible as multiple waiters can be
> > > > > > > > > > get a result back in an random order. To avoid a collisions in cases
> > > > > > > > > > like [0] or [1] this patch adds more fields to compare the plock
> > > > > > > > > > operations as the lock request is the same. This is also being made in
> > > > > > > > > > NFS to find an result for an asynchronous F_SETLKW lock request [2][3]. We
> > > > > > > > > > still can't find the exact lock request for a specific result if the
> > > > > > > > > > lock request is the same, but if this is the case we don't care the
> > > > > > > > > > order how the identical lock requests get their result back to grant the
> > > > > > > > > > lock.
> > > > > > > > >
> > > > > > > > > When the recv_list contains multiple indistinguishable requests, this
> > > > > > > > > can only be because they originated from multiple threads of the same
> > > > > > > > > process. In that case, I agree that it doesn't matter which of those
> > > > > > > > > requests we "complete" in dev_write() as long as we only complete one
> > > > > > > > > request. We do need to compare the additional request fields in
> > > > > > > > > dev_write() to find a suitable request, so that makes sense as well.
> > > > > > > > > We need to compare all of the fields that identify a request (optype,
> > > > > > > > > ex, wait, pid, nodeid, fsid, number, start, end, owner) to find the
> > > > > > > > > "right" request (or in case there is more than one identical request,
> > > > > > > > > a "suitable" request).
> > > > > > > > >
> > > > > > > >
> > > > > > > > In my "definition" why this works is as you said the "identical
> > > > > > > > request". There is a more deeper definition of "when is a request
> > > > > > > > identical" and in my opinion it is here as: "A request A is identical
> > > > > > > > to request B when they get granted under the same 'time'" which is all
> > > > > > > > the fields you mentioned.
> > > > > > > >
> > > > > > > > Even with cancellation (F_SETLKW only) it does not matter which
> > > > > > > > "identical request" you cancel because the kernel and user
> > > > > > > > (dlm_controld) makes no relation between a lock request instance. You
> > > > > > > > need to have at least the same amount of "results" coming back from
> > > > > > > > user space as the amount you are waiting for a result for the same
> > > > > > > > "identical request".
> > > > > > >
> > > > > > > That's not incorrect per se, but cancellations create an additional
> > > > > > > difficulty: they can either succeed or fail. To indicate that a
> > > > > > > cancellation has succeeded, a new type of message can be introduced
> > > > > > > (say, "CANCELLED"), and it's obvious that a CANCELLED message can only
> > > > > > > belong to a locking request that is being cancelled. When cancelling a
> > > > > > > locking request fails, the kernel will see a "locking request granted"
> > > > > > > message though, and when multiple identical locking requests are
> > > > > > > queued and only some of them have been cancelled, it won't be obvious
> > > > > > > which locking request a "locking request granted" message should be
> > > > > > > assigned to anymore. You really don't want to mix things up in that
> > > > > > > case.
> > > > > > >
> > > > > > > This complication makes it a whole lot more difficult to reason about
> > > > > > > the correctness of the code. All that complexity is avoidable by
> > > > > > > sticking with a fixed mapping of requests and replies (i.e., a unique
> > > > > > > request identifier).
> > > > > > >
> > > > > > > To put it differently, you can shoot yourself in the foot and still
> > > > > > > hop along on the other leg, but it may not be the best of all possible
> > > > > > > ideas.
> > > > > > >
> > > > > >
> > > > > > It makes things more complicated, I agree and the reason why this
> > > > > > works now is because there are a lot of "dependencies". I would love
> > > > > > to have an unique identifier to make it possible that we can follow an
> > > > > > instance handle of the original lock request.
> > > > > >
> > > > > > * an unique identifier which also works with the async lock request of
> > > > > > lockd case.
> > > > >
> > > > > What's the lockd case you're referring to here, and why is it relevant
> > > > > for the problem at hand?
> > > >
> > > > just mentioned that we need a solution which also works for the
> > > > asynchronous lock request (F_SETLK, F_SETLKW) case, there is only one
> > > > user lockd. [0] DLM plock handling implements the behaviour mentioned
> > > > at [0] but lm_grant() callback can also return negative values and
> > > > signals that the lock request was cancelled (on nfs layer) and then
> > > > need to tell it DLM. This however is not supported as we have a lack
> > > > of cancellation.
> > >
> > > Ouch, that's a bit messy. But if the vfs_lock_file() description is
> > > accurate, then only F_SETLK requests arriving via lockd can be
> > > asynchronous, and F_SETLKW requests never are asynchronous. And we
> > > only need to cancel F_SETLKW requests. It follows that we only ever
> > > need to cancel synchronous requests.
> > >
> >
> > I looked into the code and I think the second sentence of [0] is
> > important regarding to F_SLEEP "Callers expecting ->lock() to return
> > asynchronously will only use F_SETLK, not F_SETLKW; they will set
> > FL_SLEEP if (and only if) the request is for a blocking lock.".
> > If lockd does a lock request, it will then set args.wait to 1 if it
> > was F_SETLKW [1].
>
> Hmm, this sets args.block?
>

You are right but args.block gets passed as the wait parameter to nlmsvc_lock().

An example user:

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/lockd/svc4proc.c#n158

> > The receiver will then create a blocking request [2]
> > and set FL_SLEEP [3]; it does unset FL_SLEEP when args.wait (now as
> > wait parameter) wasn't set. There is a case of [5] which unset wait
> > again, but it seems we can end with FL_SLEEP there anyway.
> >
> > I think we have currently an issue here that we handle F_SETLK when
> > F_SLEEP (in case of async lockd request) is handled as trylock which
> > isn't right. Don't ask me why they are going over F_SLEEP and F_SETLK
> > and not just using F_SETLKW to signal that it is a blocking request.
> > So we actually have the F_SETLKW case but it's just signaled with
> > F_SETLK + FL_SLEEP?
>
> It seems to me that the vfs_lock_file() description and the
> distinction it makes between F_SETLK and F_SETLKW makes sense in a
> scenario when locking is implemented locally in the kernel, but not in
> the context of dlm_controld, where all the locking decisions are made
> in user-space: in the former scenario, F_SETLK requests can be decided
> immediately without sleeping, but F_SETLKW requests may sleep. In the
> dlm_controld scenario, locking requests can never be decided
> immediately, and the requesting task will always sleep. So from the
> point of view of lockd, any request to dlm_controld should probably be
> treated as asynchronous.
>
> This makes things a bit more ugly than I was hoping for.
>

I think they describe a "blocking" request as what you describe with
F_SETLKW. Everything is asynchronous, but F_SETLKW is the "blocking
request" by not actually meaning the lock() callback blocks in the
sense of wait_event(). There is however a semantic difference between
F_SETLK and F_SETLKW in sense when the asynchronous result comes back
and what the result is.

I looked more into this and a "somewhat" recent commit confuses me
more. It's 40595cdc93ed ("nfs: block notification on fs with its own
->lock") [0] and from what I understood, they signaled before F_SETLKW
with FL_SLEEP set, but they realized that mostly everybody does not
follow this API, so they switch to only F_SETLK and assume that the
most nfs clients do polling for a look. If you can't use F_SETLKW,
then the alternative could be polling the lock with F_SETLK, which
isn't a good design to use on DLM. In the commit message it's also
mentioned that they may bring back the old behaviour.

I think the comment in [1] is not up to date anymore.

There are also only two users of "fl_lmops" evaluation, which is fcntl
core and DLM, fuse will reject any lock request when "fl_lmops" is
set.

To be honest I have no idea how to proceed here, probably just look in
fcntl core and do what they do. I also tested dlm/next with a nfsv3
server on gfs2 by running "stress-ng --fcntl 16", I see some worried
error messages on the kernel log coming up.

- Alex

[0] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/fs/lockd/svclock.c?h=v6.4-rc4&id=40595cdc93edf4110c0f0c0b06f8d82008f23929
[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/fs/locks.c?h=v6.4-rc4#n2255

     prev parent reply	other threads:[~2023-06-03 12:02 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-05-24 16:02 [Cluster-devel] [PATCHv2 dlm/next] fs: dlm: avoid F_SETLKW plock op lookup collisions Alexander Aring
2023-05-25 15:02 ` Andreas Gruenbacher
2023-05-29 22:18   ` Alexander Aring
2023-05-29 22:20     ` Alexander Aring
2023-05-30 11:01     ` Andreas Gruenbacher
2023-05-30 14:06       ` Alexander Aring
2023-05-30 17:40         ` Andreas Gruenbacher
2023-06-01 16:28           ` Alexander Aring
2023-06-01 17:10             ` Andreas Gruenbacher
2023-06-01 19:09               ` Alexander Aring
2023-06-03 10:02                 ` Andreas Gruenbacher
2023-06-03 12:02                   ` Alexander Aring [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAK-6q+j_RZ+uySaVEmSfSHTDGRfM0LrbkxMvSuDfK2kjuYNDPw@mail.gmail.com \
    --to=aahringo@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).