Re: [Mesa-dev] Linux Graphics Next: Userspace submission update

dri-devel Archive mirror
 help / color / mirror / Atom feed

From: Daniel Vetter <daniel@ffwll.ch>
To: "Christian König" <ckoenig.leichtzumerken@gmail.com>
Cc: "Marek Olšák" <maraeo@gmail.com>,
	"Michel Dänzer" <michel@daenzer.net>,
	dri-devel <dri-devel@lists.freedesktop.org>,
	"Jason Ekstrand" <jason@jlekstrand.net>,
	"ML Mesa-dev" <mesa-dev@lists.freedesktop.org>
Subject: Re: [Mesa-dev] Linux Graphics Next: Userspace submission update
Date: Wed, 9 Jun 2021 15:19:28 +0200	[thread overview]
Message-ID: <YMC/4IhCePCu57HU@phenom.ffwll.local> (raw)
In-Reply-To: <586edeb3-73df-3da2-4925-1829712cba8b@gmail.com>

On Fri, Jun 04, 2021 at 01:27:15PM +0200, Christian König wrote:
> Am 04.06.21 um 10:57 schrieb Daniel Vetter:
> > On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote:
> > > Am 02.06.21 um 21:19 schrieb Daniel Vetter:
> > > > On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote:
> > > > > Am 02.06.21 um 20:48 schrieb Daniel Vetter:
> > > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote:
> > > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák <maraeo@gmail.com> wrote:
> > > > > > > 
> > > > > > > > Yes, we can't break anything because we don't want to complicate things
> > > > > > > > for us. It's pretty much all NAK'd already. We are trying to gather more
> > > > > > > > knowledge and then make better decisions.
> > > > > > > > 
> > > > > > > > The idea we are considering is that we'll expose memory-based sync objects
> > > > > > > > to userspace for read only, and the kernel or hw will strictly control the
> > > > > > > > memory writes to those sync objects. The hole in that idea is that
> > > > > > > > userspace can decide not to signal a job, so even if userspace can't
> > > > > > > > overwrite memory-based sync object states arbitrarily, it can still decide
> > > > > > > > not to signal them, and then a future fence is born.
> > > > > > > > 
> > > > > > > This would actually be treated as a GPU hang caused by that context, so it
> > > > > > > should be fine.
> > > > > > This is practically what I proposed already, except your not doing it with
> > > > > > dma_fence. And on the memory fence side this also doesn't actually give
> > > > > > what you want for that compute model.
> > > > > > 
> > > > > > This seems like a bit a worst of both worlds approach to me? Tons of work
> > > > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to
> > > > > > actually drive the hardware like it should be for compute or direct
> > > > > > display.
> > > > > > 
> > > > > > Also maybe I've missed it, but I didn't see any replies to my suggestion
> > > > > > how to fake the entire dma_fence stuff on top of new hw. Would be
> > > > > > interesting to know what doesn't work there instead of amd folks going of
> > > > > > into internal again and then coming back with another rfc from out of
> > > > > > nowhere :-)
> > > > > Well to be honest I would just push back on our hardware/firmware guys that
> > > > > we need to keep kernel queues forever before going down that route.
> > > > I looked again, and you said the model wont work because preemption is way
> > > > too slow, even when the context is idle.
> > > > 
> > > > I guess at that point I got maybe too fed up and just figured "not my
> > > > problem", but if preempt is too slow as the unload fence, you can do it
> > > > with pte removal and tlb shootdown too (that is hopefully not too slow,
> > > > otherwise your hw is just garbage and wont even be fast for direct submit
> > > > compute workloads).
> > > Have you seen that one here:
> > > https://www.spinics.net/lists/amd-gfx/msg63101.html :)
> > > 
> > > I've rejected it because I think polling for 6 seconds on a TLB flush which
> > > can block interrupts as well is just madness.
> > Hm but I thought you had like 2 tlb flush modes, the shitty one (with
> > retrying page faults) and the not so shitty one?
> 
> Yeah, we call this the lightweight and the heavyweight tlb flush.
> 
> The lighweight can be used when you are sure that you don't have any of the
> PTEs currently in flight in the 3D/DMA engine and you just need to
> invalidate the TLB.
> 
> The heavyweight must be used when you need to invalidate the TLB *AND* make
> sure that no concurrently operation moves new stuff into the TLB.
> 
> The problem is for this use case we have to use the heavyweight one.

Just for my own curiosity: So the lightweight flush is only for in-between
CS when you know access is idle? Or does that also not work if userspace
has a CS on a dma engine going at the same time because the tlb aren't
isolated enough between engines?
-Daniel

> > But yeah at that point I think you just have to bite one of the bullets.
> 
> Yeah, completely agree. We can choose which way we want to die, but it's
> certainly not going to be nice whatever we do.
> 
> > 
> > The thing is with hmm/userspace memory fence model this will be even
> > worse, because you will _have_ to do this tlb flush deep down in core mm
> > functions, so this is going to be userptr, but worse.
> > 
> > With the dma_resv/dma_fence bo memory management model you can at least
> > wrap that tlb flush into a dma_fence and push the waiting/pinging onto a
> > separate thread or something like that. If the hw really is that slow.
> > 
> > Somewhat aside: Personally I think that sriov needs to move over to the
> > compute model, i.e. indefinite timeouts, no tdr, because everything takes
> > too long. At least looking around sriov timeouts tend to be 10x bare
> > metal, across the board.
> > 
> > But for stuff like cloud gaming that's serious amounts of heavy lifting
> > since it brings us right back "the entire linux/android 3d stack is built
> > on top of dma_fence right now".
> > 
> > > > The only thing that you need to do when you use pte clearing + tlb
> > > > shootdown instad of preemption as the unload fence for buffers that get
> > > > moved is that if you get any gpu page fault, you don't serve that, but
> > > > instead treat it as a tdr and shot the context permanently.
> > > > 
> > > > So summarizing the model I proposed:
> > > > 
> > > > - you allow userspace to directly write into the ringbuffer, and also
> > > >     write the fences directly
> > > > 
> > > > - actual submit is done by the kernel, using drm/scheduler. The kernel
> > > >     blindly trusts userspace to set up everything else, and even just wraps
> > > >     dma_fences around the userspace memory fences.
> > > > 
> > > > - the only check is tdr. If a fence doesn't complete an tdr fires, a) the
> > > >     kernel shot the entire context and b) userspace recovers by setting up a
> > > >     new ringbuffer
> > > > 
> > > > - memory management is done using ttm only, you still need to supply the
> > > >     buffer list (ofc that list includes the always present ones, so CS will
> > > >     only get the list of special buffers like today). If you hw can't trun
> > > >     gpu page faults and you ever get one we pull up the same old solution:
> > > >     Kernel shots the entire context.
> > > > 
> > > >     The important thing is that from the gpu pov memory management works
> > > >     exactly like compute workload with direct submit, except that you just
> > > >     terminate the context on _any_ page fault, instead of only those that go
> > > >     somewhere where there's really no mapping and repair the others.
> > > > 
> > > >     Also I guess from reading the old thread this means you'd disable page
> > > >     fault retry because that is apparently also way too slow for anything.
> > > > 
> > > > - memory management uses an unload fence. That unload fences waits for all
> > > >     userspace memory fences (represented as dma_fence) to complete, with
> > > >     maybe some fudge to busy-spin until we've reached the actual end of the
> > > >     ringbuffer (maybe you have a IB tail there after the memory fence write,
> > > >     we have that on intel hw), and it waits for the memory to get
> > > >     "unloaded". This is either preemption, or pte clearing + tlb shootdown,
> > > >     or whatever else your hw provides which is a) used for dynamic memory
> > > >     management b) fast enough for actual memory management.
> > > > 
> > > > - any time a context dies we force-complete all it's pending fences,
> > > >     in-order ofc
> > > > 
> > > > So from hw pov this looks 99% like direct userspace submit, with the exact
> > > > same mappings, command sequences and everything else. The only difference
> > > > is that the rinbuffer head/tail updates happen from drm/scheduler, instead
> > > > of directly from userspace.
> > > > 
> > > > None of this stuff needs funny tricks where the kernel controls the
> > > > writes to memory fences, or where you need kernel ringbuffers, or anything
> > > > like thist. Userspace is allowed to do anything stupid, the rules are
> > > > guaranteed with:
> > > > 
> > > > - we rely on the hw isolation features to work, but _exactly_ like compute
> > > >     direct submit would too
> > > > 
> > > > - dying on any page fault captures memory management issues
> > > > 
> > > > - dying (without kernel recover, this is up to userspace if it cares) on
> > > >     any tdr makes sure fences complete still
> > > > 
> > > > > That syncfile and all that Android stuff isn't working out of the box with
> > > > > the new shiny user queue submission model (which in turn is mostly because
> > > > > of Windows) already raised some eyebrows here.
> > > > I think if you really want to make sure the current linux stack doesn't
> > > > break the _only_ option you have is provide a ctx mode that allows
> > > > dma_fence and drm/scheduler to be used like today.
> > > Yeah, but I still can just tell our hw/fw guys that we really really need to
> > > keep kernel queues or the whole Linux/Android infrastructure needs to get a
> > > compatibility layer like you describe above.
> > > 
> > > > For everything else it sounds you're a few years too late, because even
> > > > just huge kernel changes wont happen in time. Much less rewriting
> > > > userspace protocols.
> > > Seconded, question is rather if we are going to start migrating at some
> > > point or if we should keep pushing on our hw/fw guys.
> > So from what I'm hearing other hw might gain the sw compat layer too. Plus
> > I'm hoping that with the sw compat layer it'd be easier to smooth over
> > userspace to the new model (because there will be a long time where we
> > have to support both, maybe even with a runtime switch from userspace
> > memory fences to dma_fence kernel stuff).
> > 
> > But in the end it's up to you what makes more sense between sw work and
> > hw/fw work involved.
> 
> I'm currently entertaining my head a bit with the idea of implementing the
> HW scheduling on the CPU.
> 
> The only obstacle that I can really see is that we might need to unmap a
> page in the CPU page table when a queue becomes idle.
> 
> But apart from that it would give us the required functionality, just
> without the hardware scheduler.
> 
> Christian.
> 
> > -Daniel
> 

-- 
Daniel Vetter
Software Engineer, Intel Corporation
http://blog.ffwll.ch

next prev parent reply	other threads:[~2021-06-09 13:19 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-05-27 21:51 Linux Graphics Next: Userspace submission update Marek Olšák
2021-05-28 14:41 ` Christian König
2021-05-28 22:25   ` Marek Olšák
2021-05-29  3:33     ` Marek Olšák
2021-05-31  8:25       ` Christian König
2021-06-01  9:02 ` Michel Dänzer
2021-06-01 10:21   ` Christian König
2021-06-01 10:49     ` Michel Dänzer
2021-06-01 12:10       ` [Mesa-dev] " Christian König
2021-06-01 12:30         ` Daniel Vetter
2021-06-01 12:51           ` Christian König
2021-06-01 13:01             ` Marek Olšák
2021-06-01 13:24               ` Michel Dänzer
2021-06-02  8:57             ` Daniel Stone
2021-06-02  9:34               ` Marek Olšák
2021-06-02  9:38                 ` Marek Olšák
2021-06-02 18:48                   ` Daniel Vetter
2021-06-02 18:52                     ` Christian König
2021-06-02 19:19                       ` Daniel Vetter
2021-06-04  7:00                         ` Christian König
2021-06-04  8:57                           ` Daniel Vetter
2021-06-04 11:27                             ` Christian König
2021-06-09 13:19                               ` Daniel Vetter [this message]
2021-06-09 13:58                                 ` Christian König
2021-06-09 18:31                                   ` Daniel Vetter
2021-06-10 15:59                                     ` Marek Olšák
2021-06-10 16:33                                       ` Christian König
2021-06-14 17:10                                         ` Marek Olšák
2021-06-14 17:13                                           ` Christian König
2021-06-17 16:48                                             ` Daniel Vetter
2021-06-17 18:28                                               ` Marek Olšák
2021-06-17 19:04                                                 ` Daniel Vetter
2021-06-17 19:23                                                   ` Marek Olšák
2021-06-03  3:16                     ` Marek Olšák
2021-06-03  7:47                       ` Daniel Vetter
2021-06-03  8:20                         ` Marek Olšák
2021-06-03 10:03                           ` Daniel Vetter
2021-06-03 10:55                             ` Marek Olšák
2021-06-03 11:22                               ` Daniel Vetter
2021-06-03 17:52                                 ` Marek Olšák
2021-06-03 19:18                                   ` Daniel Vetter
2021-06-04  5:26                                     ` Marek Olšák
2021-06-02  9:44               ` Christian König
2021-06-02  9:58                 ` Marek Olšák
2021-06-02 10:06                   ` Christian König
2021-06-01 13:18         ` Michel Dänzer
2021-06-01 17:39           ` Michel Dänzer
2021-06-01 17:42           ` Daniel Stone
2021-06-02  8:09       ` Michel Dänzer
2021-06-02 19:20         ` Daniel Vetter

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YMC/4IhCePCu57HU@phenom.ffwll.local \
    --to=daniel@ffwll.ch \
    --cc=ckoenig.leichtzumerken@gmail.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=jason@jlekstrand.net \
    --cc=maraeo@gmail.com \
    --cc=mesa-dev@lists.freedesktop.org \
    --cc=michel@daenzer.net \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).