On Thu, Jun 3, 2021 at 3:47 AM Daniel Vetter <daniel@ffwll.ch> wrote:

> On Wed, Jun 02, 2021 at 11:16:39PM -0400, Marek Olšák wrote:
> > On Wed, Jun 2, 2021 at 2:48 PM Daniel Vetter <daniel@ffwll.ch> wrote:
> >
> > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
> > > >
> > > > > Yes, we can't break anything because we don't want to complicate
> things
> > > > > for us. It's pretty much all NAK'd already. We are trying to gather
> > > more
> > > > > knowledge and then make better decisions.
> > > > >
> > > > > The idea we are considering is that we'll expose memory-based sync
> > > objects
> > > > > to userspace for read only, and the kernel or hw will strictly
> control
> > > the
> > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > userspace can decide not to signal a job, so even if userspace
> can't
> > > > > overwrite memory-based sync object states arbitrarily, it can still
> > > decide
> > > > > not to signal them, and then a future fence is born.
> > > > >
> > > >
> > > > This would actually be treated as a GPU hang caused by that context,
> so
> > > it
> > > > should be fine.
> > >
> > > This is practically what I proposed already, except your not doing it
> with
> > > dma_fence. And on the memory fence side this also doesn't actually give
> > > what you want for that compute model.
> > >
> > > This seems like a bit a worst of both worlds approach to me? Tons of
> work
> > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > actually drive the hardware like it should be for compute or direct
> > > display.
> > >
> > > Also maybe I've missed it, but I didn't see any replies to my
> suggestion
> > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > interesting to know what doesn't work there instead of amd folks going
> of
> > > into internal again and then coming back with another rfc from out of
> > > nowhere :-)
> > >
> >
> > Going internal again is probably a good idea to spare you the long
> > discussions and not waste your time, but we haven't talked about the
> > dma_fence stuff internally other than acknowledging that it can be
> solved.
> >
> > The compute use case already uses the hw as-is with no inter-process
> > sharing, which mostly keeps the kernel out of the picture. It uses
> glFinish
> > to sync with GL.
> >
> > The gfx use case needs new hardware logic to support implicit and
> explicit
> > sync. When we propose a solution, it's usually torn apart the next day by
> > ourselves.
> >
> > Since we are talking about next hw or next next hw, preemption should be
> > better.
> >
> > user queue = user-mapped ring buffer
> >
> > For implicit sync, we will only let userspace lock access to a buffer
> via a
> > user queue, which waits for the per-buffer sequence counter in memory to
> be
> > >= the number assigned by the kernel, and later unlock the access with
> > another command, which increments the per-buffer sequence counter in
> memory
> > with atomic_inc regardless of the number assigned by the kernel. The
> kernel
> > counter and the counter in memory can be out-of-sync, and I'll explain
> why
> > it's OK. If a process increments the kernel counter but not the memory
> > counter, that's its problem and it's the same as a GPU hang caused by
> that
> > process. If a process increments the memory counter but not the kernel
> > counter, the ">=" condition alongside atomic_inc guarantee that
> signaling n
> > will signal n+1, so it will never deadlock but also it will effectively
> > disable synchronization. This method of disabling synchronization is
> > similar to a process corrupting the buffer, which should be fine. Can you
> > find any flaw in it? I can't find any.
>
> Hm maybe I misunderstood what exactly you wanted to do earlier. That kind
> of "we let userspace free-wheel whatever it wants, kernel ensures
> correctness of the resulting chain of dma_fence with reset the entire
> context" is what I proposed too.
>
> Like you say, userspace is allowed to render garbage already.
>
> > The explicit submit can be done by userspace (if there is no
> > synchronization), but we plan to use the kernel to do it for implicit
> sync.
> > Essentially, the kernel will receive a buffer list and addresses of wait
> > commands in the user queue. It will assign new sequence numbers to all
> > buffers and write those numbers into the wait commands, and ring the hw
> > doorbell to start execution of that queue.
>
> Yeah for implicit sync I think kernel and using drm/scheduler to sort out
> the dma_fence dependencies is probably best. Since you can filter out
> which dma_fence you hand to the scheduler for dependency tracking you can
> filter out your own ones and let the hw handle those directly (depending
> how much your hw can do an all that). On i915 we might do that to be able
> to use MI_SEMAPHORE_WAIT/SIGNAL functionality in the hw and fw scheduler.
>
> For buffer tracking with implicit sync I think cleanest is probably to
> still keep them wrapped as dma_fence and stuffed into dma_resv, but
> conceptually it's the same. If we let every driver reinvent their own
> buffer tracking just because the hw works a bit different it'll be a mess.
>
> Wrt wait commands: I'm honestly not sure why you'd do that. Userspace gets
> to keep the pieces if it gets it wrong. You do still need to handle
> external dma_fence though, hence drm/scheduler frontend to sort these out.
>

The reason is to disallow lower-privileged process to deadlock/hang a
higher-privileged process where the kernel can't tell who did it. If the
implicit-sync sequence counter is read only to userspace and only
incrementable by the unlock-signal command after the lock-wait command
appeared in the same queue (both together forming a critical section),
userspace can't manipulate it arbitrarily and we get almost the exact same
behavior as implicit sync has today. That means any implicitly-sync'd
buffer from any process can be fully trusted by a compositor to signal in a
finite time, and possibly even trusted by the kernel. The only thing that's
different is that a malicious process can disable implicit sync for a
buffer in all processes/kernel, but it can't hang other processes/kernel
(it can only hang itself and the kernel will be notified). So I'm a happy
panda now. :)

Marek