From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-3.6 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33BDDC48BCD for ; Wed, 9 Jun 2021 13:19:39 +0000 (UTC) Received: from gabe.freedesktop.org (gabe.freedesktop.org [131.252.210.177]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPS id 02A3D613BE for ; Wed, 9 Jun 2021 13:19:38 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 02A3D613BE Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=ffwll.ch Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=dri-devel-bounces@lists.freedesktop.org Received: from gabe.freedesktop.org (localhost [127.0.0.1]) by gabe.freedesktop.org (Postfix) with ESMTP id 48AD26E30D; Wed, 9 Jun 2021 13:19:35 +0000 (UTC) Received: from mail-wm1-x333.google.com (mail-wm1-x333.google.com [IPv6:2a00:1450:4864:20::333]) by gabe.freedesktop.org (Postfix) with ESMTPS id 5BB476E2B6 for ; Wed, 9 Jun 2021 13:19:33 +0000 (UTC) Received: by mail-wm1-x333.google.com with SMTP id 3-20020a05600c0243b029019f2f9b2b8aso4202312wmj.2 for ; Wed, 09 Jun 2021 06:19:33 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to; bh=ZumHFksmNiF0ClVX/ElPtpWcZQaf/GHlYrrU5PkilzY=; b=ZZbwhJ/uJMPKx5L4KZFz+cmkH9QxFcOlLzF3lRqybfpUFH/X1cRxM2BxKa9jpr4pAk 1YRDzlWpNiCXyG1HYECFsKYwPslzSmYKoik7A70lD1KafGYjAhBW6fxm3SQ66+bDDKuA +mjpYf53UazoMgofZCkxawlATEwx9A5BkHLkE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to; bh=ZumHFksmNiF0ClVX/ElPtpWcZQaf/GHlYrrU5PkilzY=; b=CySk8aoXKzlOjfg/cAumOHm1gi5t4jSYYr9QAaXYrSu65m3FgfTlZ23PW39fJ/1uP8 kJd3dumRxph0ZNoVggdH7v+FtHH3g0NuDpo2tn04uNl0TbeO3tOejl59RPLIEEvUwhWB WOEXtrFY3Qv41BHeRuaNhhjqol3vyyK3zldI1ucXs42znK4jqFcfFbW1MMgNSTzA4AYL OZDpxu2WP2EmXoUbsXRGIRkvPjnmhAWV/adZYUeIM+tBi4toPhVA6cdZTte7qVQcy+Er cpq34wZu2WMIonWCoNeR6VOV3i08hox/3ASRCPzg3Vnxg/vkMMX+eMRSiXhiYbKzteB5 /5mA== X-Gm-Message-State: AOAM531pZMDGA6Xghl/QxOkqvXiOQrqKrbsmqqUp3fmUimWmG+1/Lpwl yNIg10oGzEUMnwNCa02g3XAujw== X-Google-Smtp-Source: ABdhPJzru+CZOryy3n4Z9lcGOxt68P0FvPeq1AmgyLh/dX7gKy8yOQvi7djgr1wxyYUVS2zqUO/LIQ== X-Received: by 2002:a1c:7418:: with SMTP id p24mr7034868wmc.80.1623244771661; Wed, 09 Jun 2021 06:19:31 -0700 (PDT) Received: from phenom.ffwll.local ([2a02:168:57f4:0:efd0:b9e5:5ae6:c2fa]) by smtp.gmail.com with ESMTPSA id c7sm23809008wrs.23.2021.06.09.06.19.30 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 09 Jun 2021 06:19:30 -0700 (PDT) Date: Wed, 9 Jun 2021 15:19:28 +0200 From: Daniel Vetter To: Christian =?iso-8859-1?Q?K=F6nig?= Subject: Re: [Mesa-dev] Linux Graphics Next: Userspace submission update Message-ID: References: <0fbb1197-fa88-c474-09db-6daec13d3004@gmail.com> <586edeb3-73df-3da2-4925-1829712cba8b@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <586edeb3-73df-3da2-4925-1829712cba8b@gmail.com> X-Operating-System: Linux phenom 5.10.32scarlett+ X-BeenThere: dri-devel@lists.freedesktop.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: Direct Rendering Infrastructure - Development List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Cc: Marek =?utf-8?B?T2zFocOhaw==?= , Michel =?iso-8859-1?Q?D=E4nzer?= , dri-devel , Jason Ekstrand , ML Mesa-dev Errors-To: dri-devel-bounces@lists.freedesktop.org Sender: "dri-devel" On Fri, Jun 04, 2021 at 01:27:15PM +0200, Christian König wrote: > Am 04.06.21 um 10:57 schrieb Daniel Vetter: > > On Fri, Jun 04, 2021 at 09:00:31AM +0200, Christian König wrote: > > > Am 02.06.21 um 21:19 schrieb Daniel Vetter: > > > > On Wed, Jun 02, 2021 at 08:52:38PM +0200, Christian König wrote: > > > > > Am 02.06.21 um 20:48 schrieb Daniel Vetter: > > > > > > On Wed, Jun 02, 2021 at 05:38:51AM -0400, Marek Olšák wrote: > > > > > > > On Wed, Jun 2, 2021 at 5:34 AM Marek Olšák wrote: > > > > > > > > > > > > > > > Yes, we can't break anything because we don't want to complicate things > > > > > > > > for us. It's pretty much all NAK'd already. We are trying to gather more > > > > > > > > knowledge and then make better decisions. > > > > > > > > > > > > > > > > The idea we are considering is that we'll expose memory-based sync objects > > > > > > > > to userspace for read only, and the kernel or hw will strictly control the > > > > > > > > memory writes to those sync objects. The hole in that idea is that > > > > > > > > userspace can decide not to signal a job, so even if userspace can't > > > > > > > > overwrite memory-based sync object states arbitrarily, it can still decide > > > > > > > > not to signal them, and then a future fence is born. > > > > > > > > > > > > > > > This would actually be treated as a GPU hang caused by that context, so it > > > > > > > should be fine. > > > > > > This is practically what I proposed already, except your not doing it with > > > > > > dma_fence. And on the memory fence side this also doesn't actually give > > > > > > what you want for that compute model. > > > > > > > > > > > > This seems like a bit a worst of both worlds approach to me? Tons of work > > > > > > in the kernel to hide these not-dma_fence-but-almost, and still pain to > > > > > > actually drive the hardware like it should be for compute or direct > > > > > > display. > > > > > > > > > > > > Also maybe I've missed it, but I didn't see any replies to my suggestion > > > > > > how to fake the entire dma_fence stuff on top of new hw. Would be > > > > > > interesting to know what doesn't work there instead of amd folks going of > > > > > > into internal again and then coming back with another rfc from out of > > > > > > nowhere :-) > > > > > Well to be honest I would just push back on our hardware/firmware guys that > > > > > we need to keep kernel queues forever before going down that route. > > > > I looked again, and you said the model wont work because preemption is way > > > > too slow, even when the context is idle. > > > > > > > > I guess at that point I got maybe too fed up and just figured "not my > > > > problem", but if preempt is too slow as the unload fence, you can do it > > > > with pte removal and tlb shootdown too (that is hopefully not too slow, > > > > otherwise your hw is just garbage and wont even be fast for direct submit > > > > compute workloads). > > > Have you seen that one here: > > > https://www.spinics.net/lists/amd-gfx/msg63101.html :) > > > > > > I've rejected it because I think polling for 6 seconds on a TLB flush which > > > can block interrupts as well is just madness. > > Hm but I thought you had like 2 tlb flush modes, the shitty one (with > > retrying page faults) and the not so shitty one? > > Yeah, we call this the lightweight and the heavyweight tlb flush. > > The lighweight can be used when you are sure that you don't have any of the > PTEs currently in flight in the 3D/DMA engine and you just need to > invalidate the TLB. > > The heavyweight must be used when you need to invalidate the TLB *AND* make > sure that no concurrently operation moves new stuff into the TLB. > > The problem is for this use case we have to use the heavyweight one. Just for my own curiosity: So the lightweight flush is only for in-between CS when you know access is idle? Or does that also not work if userspace has a CS on a dma engine going at the same time because the tlb aren't isolated enough between engines? -Daniel > > But yeah at that point I think you just have to bite one of the bullets. > > Yeah, completely agree. We can choose which way we want to die, but it's > certainly not going to be nice whatever we do. > > > > > The thing is with hmm/userspace memory fence model this will be even > > worse, because you will _have_ to do this tlb flush deep down in core mm > > functions, so this is going to be userptr, but worse. > > > > With the dma_resv/dma_fence bo memory management model you can at least > > wrap that tlb flush into a dma_fence and push the waiting/pinging onto a > > separate thread or something like that. If the hw really is that slow. > > > > Somewhat aside: Personally I think that sriov needs to move over to the > > compute model, i.e. indefinite timeouts, no tdr, because everything takes > > too long. At least looking around sriov timeouts tend to be 10x bare > > metal, across the board. > > > > But for stuff like cloud gaming that's serious amounts of heavy lifting > > since it brings us right back "the entire linux/android 3d stack is built > > on top of dma_fence right now". > > > > > > The only thing that you need to do when you use pte clearing + tlb > > > > shootdown instad of preemption as the unload fence for buffers that get > > > > moved is that if you get any gpu page fault, you don't serve that, but > > > > instead treat it as a tdr and shot the context permanently. > > > > > > > > So summarizing the model I proposed: > > > > > > > > - you allow userspace to directly write into the ringbuffer, and also > > > > write the fences directly > > > > > > > > - actual submit is done by the kernel, using drm/scheduler. The kernel > > > > blindly trusts userspace to set up everything else, and even just wraps > > > > dma_fences around the userspace memory fences. > > > > > > > > - the only check is tdr. If a fence doesn't complete an tdr fires, a) the > > > > kernel shot the entire context and b) userspace recovers by setting up a > > > > new ringbuffer > > > > > > > > - memory management is done using ttm only, you still need to supply the > > > > buffer list (ofc that list includes the always present ones, so CS will > > > > only get the list of special buffers like today). If you hw can't trun > > > > gpu page faults and you ever get one we pull up the same old solution: > > > > Kernel shots the entire context. > > > > > > > > The important thing is that from the gpu pov memory management works > > > > exactly like compute workload with direct submit, except that you just > > > > terminate the context on _any_ page fault, instead of only those that go > > > > somewhere where there's really no mapping and repair the others. > > > > > > > > Also I guess from reading the old thread this means you'd disable page > > > > fault retry because that is apparently also way too slow for anything. > > > > > > > > - memory management uses an unload fence. That unload fences waits for all > > > > userspace memory fences (represented as dma_fence) to complete, with > > > > maybe some fudge to busy-spin until we've reached the actual end of the > > > > ringbuffer (maybe you have a IB tail there after the memory fence write, > > > > we have that on intel hw), and it waits for the memory to get > > > > "unloaded". This is either preemption, or pte clearing + tlb shootdown, > > > > or whatever else your hw provides which is a) used for dynamic memory > > > > management b) fast enough for actual memory management. > > > > > > > > - any time a context dies we force-complete all it's pending fences, > > > > in-order ofc > > > > > > > > So from hw pov this looks 99% like direct userspace submit, with the exact > > > > same mappings, command sequences and everything else. The only difference > > > > is that the rinbuffer head/tail updates happen from drm/scheduler, instead > > > > of directly from userspace. > > > > > > > > None of this stuff needs funny tricks where the kernel controls the > > > > writes to memory fences, or where you need kernel ringbuffers, or anything > > > > like thist. Userspace is allowed to do anything stupid, the rules are > > > > guaranteed with: > > > > > > > > - we rely on the hw isolation features to work, but _exactly_ like compute > > > > direct submit would too > > > > > > > > - dying on any page fault captures memory management issues > > > > > > > > - dying (without kernel recover, this is up to userspace if it cares) on > > > > any tdr makes sure fences complete still > > > > > > > > > That syncfile and all that Android stuff isn't working out of the box with > > > > > the new shiny user queue submission model (which in turn is mostly because > > > > > of Windows) already raised some eyebrows here. > > > > I think if you really want to make sure the current linux stack doesn't > > > > break the _only_ option you have is provide a ctx mode that allows > > > > dma_fence and drm/scheduler to be used like today. > > > Yeah, but I still can just tell our hw/fw guys that we really really need to > > > keep kernel queues or the whole Linux/Android infrastructure needs to get a > > > compatibility layer like you describe above. > > > > > > > For everything else it sounds you're a few years too late, because even > > > > just huge kernel changes wont happen in time. Much less rewriting > > > > userspace protocols. > > > Seconded, question is rather if we are going to start migrating at some > > > point or if we should keep pushing on our hw/fw guys. > > So from what I'm hearing other hw might gain the sw compat layer too. Plus > > I'm hoping that with the sw compat layer it'd be easier to smooth over > > userspace to the new model (because there will be a long time where we > > have to support both, maybe even with a runtime switch from userspace > > memory fences to dma_fence kernel stuff). > > > > But in the end it's up to you what makes more sense between sw work and > > hw/fw work involved. > > I'm currently entertaining my head a bit with the idea of implementing the > HW scheduling on the CPU. > > The only obstacle that I can really see is that we might need to unmap a > page in the CPU page table when a queue becomes idle. > > But apart from that it would give us the required functionality, just > without the hardware scheduler. > > Christian. > > > -Daniel > -- Daniel Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch