From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: ** X-Spam-Status: No, score=2.5 required=3.0 tests=BAYES_00,DKIM_ADSP_CUSTOM_MED, DKIM_INVALID,DKIM_SIGNED,FREEMAIL_FORGED_FROMDOMAIN,FREEMAIL_FROM, HEADER_FROM_DIFFERENT_DOMAINS,HTML_MESSAGE,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 47488C47082 for ; Thu, 3 Jun 2021 10:55:17 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 00F0D60FE7 for ; Thu, 3 Jun 2021 10:55:16 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 00F0D60FE7 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=gmail.com Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 6F58E6E153; Thu, 3 Jun 2021 10:55:16 +0000 (UTC) Received: from mail-pl1-x632.google.com (mail-pl1-x632.google.com [IPv6:2607:f8b0:4864:20::632]) by gabe.freedesktop.org (Postfix) with ESMTPS id 071566E153; Thu, 3 Jun 2021 10:55:15 +0000 (UTC) Received: by mail-pl1-x632.google.com with SMTP id u7so2675709plq.4; Thu, 03 Jun 2021 03:55:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=vbUqsTDk3va9ddxqUKkWAOrP3UuTgqnG5IbI19sfIQM=; b=Hw2Z7TI6z86hLHawgv94jUnZGWBd5I83ZEcvARP/BF9bkEc1kwnKYY6YlZ+EHH6Iuy 7Xwo07VNTzyYNJrBRrOdVPE4vLtDTcDS3PKn8AdUgK7b8C42MoToDa+VdDAwcPAjNotG LcJCVZZtYTDQtIsT7URpwarMMf9MrN/JqOxHZfLJkkGHsaNAbXIMGZEVDT1egGrTVC+F cJaeLbU0Y1lKLJdeuRcloCtzgHFksrfRGRavGllE+pKMjnj12hDks1RqxRQKGwfaTa9D Y061bZvAL+FA88TLbGj7rnlOFqYl3JuGxeFZD9jmmYo61F5WamzFUdtV6cRfsrIz1k/L TDxg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=vbUqsTDk3va9ddxqUKkWAOrP3UuTgqnG5IbI19sfIQM=; b=Lb3ZOjhju9LQ5prsCxvTiiaczh9ZL1TtzW6CaufNEOBnVqPQFU2S0NwCkYubxiOKwO OOmNKsJlA0Yz1JM9T94rqokgQ6DJ1nKRWQ5hwWz3mcSD6Uy7yt8+0i/IAKOveYLs8ksd HuFYskLxE3xVL1mVcbP12PiigiYCmbcgUdpyRd0UQwJG6tPJOX/kbo9jn6vZSXhhWL05 c+XCtCxPvrqBKmmxvbwnp+e613cgHjcvYom1WJ3CpvFkxXiwQComK3dRhROeSeBLA4LO CWnNVU3esV3kZ1q2U2mZvM/FcMzIJdMWzS4VTmocogDKTpyFQhD6lBdv+aIqLSUvpeRR msHA== X-Gm-Message-State: AOAM530ZaySg01vJm9NBm0Jo6BFh1yyGqTJJCEiTO53I1YHhFIEs0kjV bHkZSOzmrVie12vUKJCGzlijajeWCm6Y1DDephg= X-Google-Smtp-Source: ABdhPJw1u4QEhsRMQqKG1IpUevn/QD7X02wJEr7wG94VxrCbJ+jdxVMnL9DwvdrlOHnT04nlAcYB4sC8taddVYBkbs4= X-Received: by 2002:a17:902:db0f:b029:f3:e5f4:87f1 with SMTP id m15-20020a170902db0fb02900f3e5f487f1mr34505537plx.26.1622717715368; Thu, 03 Jun 2021 03:55:15 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: =?UTF-8?B?TWFyZWsgT2zFocOhaw==?= Date: Thu, 3 Jun 2021 06:55:02 -0400 Message-ID: Subject: Re: [Mesa-dev] Linux Graphics Next: Userspace submission update To: Daniel Vetter Content-Type: multipart/alternative; boundary="000000000000fed1be05c3da688c" X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: =?UTF-8?Q?Christian_K=C3=B6nig?= , =?UTF-8?Q?Michel_D=C3=A4nzer?= , dri-devel , Jason Ekstrand , ML Mesa-dev Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" --000000000000fed1be05c3da688c Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, wrote: > On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Ol=C5=A1=C3=A1k wrote: > > On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter wrote: > > > > > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Ol=C5=A1=C3=A1k wrote= : > > > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter > wrote: > > > > > > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Ol=C5=A1=C3=A1k w= rote: > > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Ol=C5=A1=C3=A1k > wrote: > > > > > > > > > > > > > Yes, we can't break anything because we don't want to > complicate > > > things > > > > > > > for us. It's pretty much all NAK'd already. We are trying to > gather > > > > > more > > > > > > > knowledge and then make better decisions. > > > > > > > > > > > > > > The idea we are considering is that we'll expose memory-based > sync > > > > > objects > > > > > > > to userspace for read only, and the kernel or hw will strictl= y > > > control > > > > > the > > > > > > > memory writes to those sync objects. The hole in that idea is > that > > > > > > > userspace can decide not to signal a job, so even if userspac= e > > > can't > > > > > > > overwrite memory-based sync object states arbitrarily, it can > still > > > > > decide > > > > > > > not to signal them, and then a future fence is born. > > > > > > > > > > > > > > > > > > > This would actually be treated as a GPU hang caused by that > context, > > > so > > > > > it > > > > > > should be fine. > > > > > > > > > > This is practically what I proposed already, except your not doin= g > it > > > with > > > > > dma_fence. And on the memory fence side this also doesn't actuall= y > give > > > > > what you want for that compute model. > > > > > > > > > > This seems like a bit a worst of both worlds approach to me? Tons > of > > > work > > > > > in the kernel to hide these not-dma_fence-but-almost, and still > pain to > > > > > actually drive the hardware like it should be for compute or dire= ct > > > > > display. > > > > > > > > > > Also maybe I've missed it, but I didn't see any replies to my > > > suggestion > > > > > how to fake the entire dma_fence stuff on top of new hw. Would be > > > > > interesting to know what doesn't work there instead of amd folks > going > > > of > > > > > into internal again and then coming back with another rfc from ou= t > of > > > > > nowhere :-) > > > > > > > > > > > > > Going internal again is probably a good idea to spare you the long > > > > discussions and not waste your time, but we haven't talked about th= e > > > > dma_fence stuff internally other than acknowledging that it can be > > > solved. > > > > > > > > The compute use case already uses the hw as-is with no inter-proces= s > > > > sharing, which mostly keeps the kernel out of the picture. It uses > > > glFinish > > > > to sync with GL. > > > > > > > > The gfx use case needs new hardware logic to support implicit and > > > explicit > > > > sync. When we propose a solution, it's usually torn apart the next > day by > > > > ourselves. > > > > > > > > Since we are talking about next hw or next next hw, preemption > should be > > > > better. > > > > > > > > user queue =3D user-mapped ring buffer > > > > > > > > For implicit sync, we will only let userspace lock access to a buff= er > > > via a > > > > user queue, which waits for the per-buffer sequence counter in > memory to > > > be > > > > >=3D the number assigned by the kernel, and later unlock the access > with > > > > another command, which increments the per-buffer sequence counter i= n > > > memory > > > > with atomic_inc regardless of the number assigned by the kernel. Th= e > > > kernel > > > > counter and the counter in memory can be out-of-sync, and I'll > explain > > > why > > > > it's OK. If a process increments the kernel counter but not the > memory > > > > counter, that's its problem and it's the same as a GPU hang caused = by > > > that > > > > process. If a process increments the memory counter but not the > kernel > > > > counter, the ">=3D" condition alongside atomic_inc guarantee that > > > signaling n > > > > will signal n+1, so it will never deadlock but also it will > effectively > > > > disable synchronization. This method of disabling synchronization i= s > > > > similar to a process corrupting the buffer, which should be fine. > Can you > > > > find any flaw in it? I can't find any. > > > > > > Hm maybe I misunderstood what exactly you wanted to do earlier. That > kind > > > of "we let userspace free-wheel whatever it wants, kernel ensures > > > correctness of the resulting chain of dma_fence with reset the entire > > > context" is what I proposed too. > > > > > > Like you say, userspace is allowed to render garbage already. > > > > > > > The explicit submit can be done by userspace (if there is no > > > > synchronization), but we plan to use the kernel to do it for implic= it > > > sync. > > > > Essentially, the kernel will receive a buffer list and addresses of > wait > > > > commands in the user queue. It will assign new sequence numbers to > all > > > > buffers and write those numbers into the wait commands, and ring th= e > hw > > > > doorbell to start execution of that queue. > > > > > > Yeah for implicit sync I think kernel and using drm/scheduler to sort > out > > > the dma_fence dependencies is probably best. Since you can filter out > > > which dma_fence you hand to the scheduler for dependency tracking you > can > > > filter out your own ones and let the hw handle those directly > (depending > > > how much your hw can do an all that). On i915 we might do that to be > able > > > to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw > scheduler. > > > > > > For buffer tracking with implicit sync I think cleanest is probably t= o > > > still keep them wrapped as dma_fence and stuffed into dma_resv, but > > > conceptually it's the same. If we let every driver reinvent their own > > > buffer tracking just because the hw works a bit different it'll be a > mess. > > > > > > Wrt wait commands: I'm honestly not sure why you'd do that. Userspace > gets > > > to keep the pieces if it gets it wrong. You do still need to handle > > > external dma_fence though, hence drm/scheduler frontend to sort these > out. > > > > > > > The reason is to disallow lower-privileged process to deadlock/hang a > > higher-privileged process where the kernel can't tell who did it. If th= e > > implicit-sync sequence counter is read only to userspace and only > > incrementable by the unlock-signal command after the lock-wait command > > appeared in the same queue (both together forming a critical section), > > userspace can't manipulate it arbitrarily and we get almost the exact > same > > behavior as implicit sync has today. That means any implicitly-sync'd > > buffer from any process can be fully trusted by a compositor to signal > in a > > finite time, and possibly even trusted by the kernel. The only thing > that's > > different is that a malicious process can disable implicit sync for a > > buffer in all processes/kernel, but it can't hang other processes/kerne= l > > (it can only hang itself and the kernel will be notified). So I'm a hap= py > > panda now. :) > > Yeah I think that's not going to work too well, and is too many piled up > hacks. Within a drm_file fd you can do whatever you feel like, since it's > just one client. > > But once implicit sync kicks in I think you need to go with dma_fence and > drm/scheduler to handle the dependencies, and tdr kicking it. With the > dma_fence you do know who's the offender - you might not know why, but > that doesn't matter, you just shred the entire context and let that > userspace figure out the details. > > I think trying to make memory fences work as implicit sync directly, > without wrapping them in a dma_fence and assorted guarantees, will just > not work. > > And once you do wrap them in dma_fence, then all the other problems go > away: cross-driver sync, syncfiles, ... So I really don't see the benefit > of this half-way approach. > > Yes there's going to be a tad bit of overhead, but that's already there i= n > the current model. And it can't hurt to have a bit of motivation for > compositors to switch over to userspace memory fences properly. > Well, Christian thinks that we need a high level synchronization primitive in hw. I don't know myself and you may be right. A software scheduler with user queues might be one option. My part is only to find out how much of the scheduler logic can be moved to the hardware. We plan to have memory timeline semaphores, or simply monotonic counters, and a fence will be represented by the counter address and a constant sequence number for the <=3D comparison. One counter can represent up to 2^= 64 different fences. Giving any process write access to a fence is the same as giving it the power to manipulate the signalled state of a sequence of up to 2^64 fences. That could mess up a lot of things. However, if the hardware had a high level synchronization primitive with access rights and a limited set of clearly defined operations such that we can formally prove whether it's safe for everybody, we could have a solution where we don't have to involve the software scheduler and just let the hardware do everything. Marek -Daniel > -- > Daniel Vetter > Software Engineer, Intel Corporation > http://blog.ffwll.ch > --000000000000fed1be05c3da688c Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable
On Thu., Jun. 3, 2021, 06:03 Daniel Vetter, <daniel@ffwll.ch> wrote:
On Thu, Jun 03, 2021 at 04:20:18AM -0400, Marek Ol=C5=A1= =C3=A1k wrote:
> On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter <daniel@ffwll.ch> = wrote:
>
> > On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Ol=C5=A1=C3=A1k w= rote:
> > > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.c= h> wrote:
> > >
> > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Ol=C5= =A1=C3=A1k wrote:
> > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Ol=C5=A1=C3= =A1k <maraeo@gmail.com> wrote:
> > > > >
> > > > > > Yes, we can't break anything because we d= on't want to complicate
> > things
> > > > > > for us. It's pretty much all NAK'd al= ready. We are trying to gather
> > > > more
> > > > > > knowledge and then make better decisions.
> > > > > >
> > > > > > The idea we are considering is that we'll= expose memory-based sync
> > > > objects
> > > > > > to userspace for read only, and the kernel or= hw will strictly
> > control
> > > > the
> > > > > > memory writes to those sync objects. The hole= in that idea is that
> > > > > > userspace can decide not to signal a job, so = even if userspace
> > can't
> > > > > > overwrite memory-based sync object states arb= itrarily, it can still
> > > > decide
> > > > > > not to signal them, and then a future fence i= s born.
> > > > > >
> > > > >
> > > > > This would actually be treated as a GPU hang cause= d by that context,
> > so
> > > > it
> > > > > should be fine.
> > > >
> > > > This is practically what I proposed already, except you= r not doing it
> > with
> > > > dma_fence. And on the memory fence side this also doesn= 't actually give
> > > > what you want for that compute model.
> > > >
> > > > This seems like a bit a worst of both worlds approach t= o me? Tons of
> > work
> > > > in the kernel to hide these not-dma_fence-but-almost, a= nd still pain to
> > > > actually drive the hardware like it should be for compu= te or direct
> > > > display.
> > > >
> > > > Also maybe I've missed it, but I didn't see any= replies to my
> > suggestion
> > > > how to fake the entire dma_fence stuff on top of new hw= . Would be
> > > > interesting to know what doesn't work there instead= of amd folks going
> > of
> > > > into internal again and then coming back with another r= fc from out of
> > > > nowhere :-)
> > > >
> > >
> > > Going internal again is probably a good idea to spare you th= e long
> > > discussions and not waste your time, but we haven't talk= ed about the
> > > dma_fence stuff internally other than acknowledging that it = can be
> > solved.
> > >
> > > The compute use case already uses the hw as-is with no inter= -process
> > > sharing, which mostly keeps the kernel out of the picture. I= t uses
> > glFinish
> > > to sync with GL.
> > >
> > > The gfx use case needs new hardware logic to support implici= t and
> > explicit
> > > sync. When we propose a solution, it's usually torn apar= t the next day by
> > > ourselves.
> > >
> > > Since we are talking about next hw or next next hw, preempti= on should be
> > > better.
> > >
> > > user queue =3D user-mapped ring buffer
> > >
> > > For implicit sync, we will only let userspace lock access to= a buffer
> > via a
> > > user queue, which waits for the per-buffer sequence counter = in memory to
> > be
> > > >=3D the number assigned by the kernel, and later unlock = the access with
> > > another command, which increments the per-buffer sequence co= unter in
> > memory
> > > with atomic_inc regardless of the number assigned by the ker= nel. The
> > kernel
> > > counter and the counter in memory can be out-of-sync, and I&= #39;ll explain
> > why
> > > it's OK. If a process increments the kernel counter but = not the memory
> > > counter, that's its problem and it's the same as a G= PU hang caused by
> > that
> > > process. If a process increments the memory counter but not = the kernel
> > > counter, the ">=3D" condition alongside atomic_= inc guarantee that
> > signaling n
> > > will signal n+1, so it will never deadlock but also it will = effectively
> > > disable synchronization. This method of disabling synchroniz= ation is
> > > similar to a process corrupting the buffer, which should be = fine. Can you
> > > find any flaw in it? I can't find any.
> >
> > Hm maybe I misunderstood what exactly you wanted to do earlier. T= hat kind
> > of "we let userspace free-wheel whatever it wants, kernel en= sures
> > correctness of the resulting chain of dma_fence with reset the en= tire
> > context" is what I proposed too.
> >
> > Like you say, userspace is allowed to render garbage already.
> >
> > > The explicit submit can be done by userspace (if there is no=
> > > synchronization), but we plan to use the kernel to do it for= implicit
> > sync.
> > > Essentially, the kernel will receive a buffer list and addre= sses of wait
> > > commands in the user queue. It will assign new sequence numb= ers to all
> > > buffers and write those numbers into the wait commands, and = ring the hw
> > > doorbell to start execution of that queue.
> >
> > Yeah for implicit sync I think kernel and using drm/scheduler to = sort out
> > the dma_fence dependencies is probably best. Since you can filter= out
> > which dma_fence you hand to the scheduler for dependency tracking= you can
> > filter out your own ones and let the hw handle those directly (de= pending
> > how much your hw can do an all that). On i915 we might do that to= be able
> > to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw sc= heduler.
> >
> > For buffer tracking with implicit sync I think cleanest is probab= ly to
> > still keep them wrapped as dma_fence and stuffed into dma_resv, b= ut
> > conceptually it's the same. If we let every driver reinvent t= heir own
> > buffer tracking just because the hw works a bit different it'= ll be a mess.
> >
> > Wrt wait commands: I'm honestly not sure why you'd do tha= t. Userspace gets
> > to keep the pieces if it gets it wrong. You do still need to hand= le
> > external dma_fence though, hence drm/scheduler frontend to sort t= hese out.
> >
>
> The reason is to disallow lower-privileged process to deadlock/hang a<= br> > higher-privileged process where the kernel can't tell who did it. = If the
> implicit-sync sequence counter is read only to userspace and only
> incrementable by the unlock-signal command after the lock-wait command=
> appeared in the same queue (both together forming a critical section),=
> userspace can't manipulate it arbitrarily and we get almost the ex= act same
> behavior as implicit sync has today. That means any implicitly-sync= 9;d
> buffer from any process can be fully trusted by a compositor to signal= in a
> finite time, and possibly even trusted by the kernel. The only thing t= hat's
> different is that a malicious process can disable implicit sync for a<= br> > buffer in all processes/kernel, but it can't hang other processes/= kernel
> (it can only hang itself and the kernel will be notified). So I'm = a happy
> panda now. :)

Yeah I think that's not going to work too well, and is too many piled u= p
hacks. Within a drm_file fd you can do whatever you feel like, since it'= ;s
just one client.

But once implicit sync kicks in I think you need to go with dma_fence and drm/scheduler to handle the dependencies, and tdr kicking it. With the
dma_fence you do know who's the offender - you might not know why, but<= br> that doesn't matter, you just shred the entire context and let that
userspace figure out the details.

I think trying to make memory fences work as implicit sync directly,
without wrapping them in a dma_fence and assorted guarantees, will just
not work.

And once you do wrap them in dma_fence, then all the other problems go
away: cross-driver sync, syncfiles, ... So I really don't see the benef= it
of this half-way approach.

Yes there's going to be a tad bit of overhead, but that's already t= here in
the current model. And it can't hurt to have a bit of motivation for compositors to switch over to userspace memory fences properly.

Well, Christ= ian thinks that we need a high level synchronization primitive in hw. I don= 't know myself and you may be right. A software scheduler with user que= ues might be one option. My part is only to find out how much of the schedu= ler logic can be moved to the hardware.

We plan to have memory timeline semaphores, or simply monot= onic counters, and a fence will be represented by the counter address and a= constant sequence number for the <=3D comparison. One counter can repre= sent up to 2^64 different fences. Giving any process write access to a fenc= e is the same as giving it the power to manipulate the signalled state of a= sequence of up to 2^64 fences. That could mess up a lot of things. However= , if the hardware had a high level synchronization primitive with access ri= ghts and a limited set of clearly defined operations such that we can forma= lly prove whether it's safe for everybody, we could have a solution whe= re we don't have to involve the software scheduler and just let the har= dware do everything.

Mar= ek



-Daniel
--
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch
--000000000000fed1be05c3da688c--