dri-devel Archive mirror
 help / color / mirror / Atom feed
From: "Thomas Hellström" <thomas.hellstrom@linux.intel.com>
To: Jason Gunthorpe <jgg@nvidia.com>
Cc: "Zeng, Oak" <oak.zeng@intel.com>,
	"dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Brost, Matthew" <matthew.brost@intel.com>,
	 "Welty, Brian" <brian.welty@intel.com>,
	"Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
	"Bommu, Krishnaiah" <krishnaiah.bommu@intel.com>,
	"Vishwanathapura,
	Niranjana" <niranjana.vishwanathapura@intel.com>,
	Leon Romanovsky <leon@kernel.org>
Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
Date: Fri, 26 Apr 2024 16:49:26 +0200	[thread overview]
Message-ID: <ad82f95ee29ada403459416d4c97c2b9083b5a0f.camel@linux.intel.com> (raw)
In-Reply-To: <20240426120047.GX941030@nvidia.com>

On Fri, 2024-04-26 at 09:00 -0300, Jason Gunthorpe wrote:
> On Fri, Apr 26, 2024 at 11:55:05AM +0200, Thomas Hellström wrote:
> > First, the gpu_vma structure is something that partitions the
> > gpu_vm
> > that holds gpu-related range metadata, like what to mirror, desired
> > gpu
> > caching policies etc. These are managed (created, removed and
> > split)
> > mainly from user-space. These are stored and looked up from an rb-
> > tree.
> 
> Except we are talking about SVA here, so all of this should not be
> exposed to userspace.

I think you are misreading. this is on the level "Mirror this region of
the cpu_vm", "prefer this region placed in VRAM", "GPU will do atomic
accesses on this region", very similar to cpu mmap / munmap and
madvise. What I'm trying to say here is that this does not directly
affect the SVA except whether to do SVA or not, and in that case what
region of the CPU mm will be mirrored, and in addition, any gpu
attributes for the mirrored region.

> 
> > Now, when we hit a fault, we want to use hmm_range_fault() to re-
> > populate the faulting PTE, but also to pre-fault a range. Using a
> > range
> > here (let's call this a prefault range for clarity) rather than to
> > insert a single PTE is for multiple reasons:
> 
> I've never said to do a single range, everyone using
> hmm_range_fault()
> has some kind of prefetch populate around algorithm.
> 
> > This is why we've been using dma_map_sg() for these ranges, since
> > it is
> > assumed the benefits gained from 
> 
> This doesn't logically follow. You need to use dma_map_page page by
> page and batch that into your update mechanism.
> 
> If you use dma_map_sg you get into the world of wrongness where you
> have to track ranges and invalidation has to wipe an entire range -
> because you cannot do a dma unmap of a single page from a dma_map_sg
> mapping. This is all the wrong way to use hmm_range_fault.
> 
> hmm_range_fault() is page table mirroring, it fundamentally must be
> page-by-page. The target page table structure must have similar
> properties to the MM page table - especially page by page
> validate/invalidate. Meaning you cannot use dma_map_sg().

To me this is purely an optimization to make the driver page-table and
hence the GPU TLB benefit from iommu coalescing / large pages and large
driver PTEs. It is true that invalidation will sometimes shoot down
large gpu ptes unnecessarily but it will not put any additional burden
on the core AFAICT. For the dma mappings themselves they aren't touched
on invalidation since zapping the gpu PTEs effectively stops any dma
accesses. The dma mappings are rebuilt on the next gpu pagefault,
which, as you mention, are considered slow anyway, but will probably
still reuse the same prefault region, hence needing to rebuild the dma
mappings anyway.

So as long as we are correct and do not adversely affect core mm, If
the gpu performance (for whatever reason) is severely hampered if
large gpu page-table-entries are not used, couldn't this be considered
left to the driver?

And a related question. What about THP pages? OK to set up a single
dma-mapping to those?


> 
> > Second, when pre-faulting a range like this, the mmu interval
> > notifier
> > seqno comes into play, until the gpu ptes for the prefault range
> > are
> > safely in place. Now if an invalidation happens in a completely
> > separate part of the mirror range, it will bump the seqno and force
> > us
> > to rerun the fault processing unnecessarily. 
> 
> This is how hmm_range_fault() works. Drivers should not do hacky
> things to try to "improve" this. SVA granuals should be large, maybe
> not the entire MM, but still quite large. 2M is far to small.
> 
> There is a tradeoff here of slowing down the entire MM vs risking an
> iteration during fault processing. We want to err toward making fault
> processing slowing because fault procesing is already really slow.
> 
> > Hence, for this purpose we
> > ideally just want to get a seqno bump covering the prefault range.
> 
> Ideally, but this is not something we can get for free.
> 
> > That's why finer-granularity mmu_interval notifiers might be
> > beneficial
> > (and then cached for future re-use of the same prefault range).
> > This
> > leads me to the next question:
> 
> It is not the design, please don't invent crazy special Intel things
> on top of hmm_range_fault.

For the record, this is not a "crazy special Intel" invention. It's the
way all GPU implementations do this so far. We're currently catching
up. If we're going to do this in another way, we fully need to
understand why it's a bad thing to do. That's why these questions are
asked.

> 
> > You mention that mmu_notifiers are expensive to register. From
> > looking
> > at the code it seems *mmu_interval* notifiers are cheap unless
> > there
> > are ongoing invalidations in which case using a gpu_vma-wide
> > notifier
> > would block anyway? Could you clarify a bit more the cost involved
> > here?
> 
> The rb tree insertions become expensive the larger the tree is. If
> you
> have only a couple of notifiers it is reasonable.
> 
> > If we don't register these smaller-range interval notifiers, do
> > you think the seqno bumps from unrelated subranges would be a real
> > problem?
> 
> I don't think it is, you'd need to have a workload which was
> agressively manipulating the CPU mm (which is also pretty slow). If
> the workload is doing that then it also really won't like being
> slowed
> down by the giant rb tree.

OK, this makes sense, and will also simplify implementation.

> 
> You can't win with an argument that collisions are likely due to an
> app pattern that makes alot of stress on the MM so the right response
> is to make the MM slower.
> 
> > Finally the size of the pre-faulting range is something we need to
> > tune. 
> 
> Correct.
> 
> > Currently it is cpu vma - wide. I understand you strongly suggest
> > this should be avoided. Could you elaborate a bit on why this is
> > such a
> > bad choice?
> 
> Why would a prefetch have anything to do with a VMA? Ie your app
> calls
> malloc() and gets a little allocation out of a giant mmap() arena -
> you want to prefault the entire arena? Does that really make any
> sense?

Personally, no it doesn't. I'd rather use some sort of fixed-size
chunk. But to rephrase, the question was more into the strong "drivers
should not be aware of the cpu mm vma structures" comment. While I
fully agree they are probably not very useful for determining the size
of gpu prefault regions, is there anything else we should be aware of
here.

Thanks,
Thomas


> 
> Mirroring is a huge PITA, IMHO it should be discouraged in favour of
> SVA. Sadly too many CPUs still canot implement SVA.
> 
> With mirroring there is no good way for the system to predict what
> the
> access pattern is. The only way to make this actually performant is
> for userspace to explicitly manage the mirroring with some kind of
> prefetching scheme to avoid faulting its accesses except in
> extrodinary cases.
> 
> VMA is emphatically not a hint about what to prefetch. You should
> balance your prefetching based on HW performance and related. If it
> is
> tidy for HW to fault around a 2M granual then just do that.
> 
> Jason


  reply	other threads:[~2024-04-26 14:49 UTC|newest]

Thread overview: 123+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
2024-01-17 22:12 ` [PATCH 01/23] drm/xe/svm: Add SVM document Oak Zeng
2024-01-17 22:12 ` [PATCH 02/23] drm/xe/svm: Add svm key data structures Oak Zeng
2024-01-17 22:12 ` [PATCH 03/23] drm/xe/svm: create xe svm during vm creation Oak Zeng
2024-01-17 22:12 ` [PATCH 04/23] drm/xe/svm: Trace svm creation Oak Zeng
2024-01-17 22:12 ` [PATCH 05/23] drm/xe/svm: add helper to retrieve svm range from address Oak Zeng
2024-01-17 22:12 ` [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range Oak Zeng
2024-04-05  0:39   ` Jason Gunthorpe
2024-04-05  3:33     ` Zeng, Oak
2024-04-05 12:37       ` Jason Gunthorpe
2024-04-05 16:42         ` Zeng, Oak
2024-04-05 18:02           ` Jason Gunthorpe
2024-04-09 16:45             ` Zeng, Oak
2024-04-09 17:24               ` Jason Gunthorpe
2024-04-23 21:17                 ` Zeng, Oak
2024-04-24  2:31                   ` Matthew Brost
2024-04-24 13:57                     ` Jason Gunthorpe
2024-04-24 16:35                       ` Matthew Brost
2024-04-24 16:44                         ` Jason Gunthorpe
2024-04-24 16:56                           ` Matthew Brost
2024-04-24 17:48                             ` Jason Gunthorpe
2024-04-24 13:48                   ` Jason Gunthorpe
2024-04-24 23:59                     ` Zeng, Oak
2024-04-25  1:05                       ` Jason Gunthorpe
2024-04-26  9:55                         ` Thomas Hellström
2024-04-26 12:00                           ` Jason Gunthorpe
2024-04-26 14:49                             ` Thomas Hellström [this message]
2024-04-26 16:35                               ` Jason Gunthorpe
2024-04-29  8:25                                 ` Thomas Hellström
2024-04-30 17:30                                   ` Jason Gunthorpe
2024-04-30 18:57                                     ` Daniel Vetter
2024-05-01  0:09                                       ` Jason Gunthorpe
2024-05-02  8:04                                         ` Daniel Vetter
2024-05-02  9:11                                           ` Thomas Hellström
2024-05-02 12:46                                             ` Jason Gunthorpe
2024-05-02 15:01                                               ` Thomas Hellström
2024-05-02 19:25                                                 ` Zeng, Oak
2024-05-03 13:37                                                   ` Jason Gunthorpe
2024-05-03 14:43                                                     ` Zeng, Oak
2024-05-03 16:28                                                       ` Jason Gunthorpe
2024-05-03 20:29                                                         ` Zeng, Oak
2024-05-04  1:03                                                           ` Dave Airlie
2024-05-06 13:04                                                             ` Daniel Vetter
2024-05-06 23:50                                                               ` Matthew Brost
2024-05-07 11:56                                                                 ` Jason Gunthorpe
2024-05-06 13:33                                                           ` Jason Gunthorpe
2024-04-09 17:33               ` Matthew Brost
2024-01-17 22:12 ` [PATCH 07/23] drm/xe/svm: Add helper for binding hmm range to gpu Oak Zeng
2024-01-17 22:12 ` [PATCH 08/23] drm/xe/svm: Add helper to invalidate svm range from GPU Oak Zeng
2024-01-17 22:12 ` [PATCH 09/23] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
2024-01-17 22:12 ` [PATCH 10/23] drm/xe/svm: Introduce svm migration function Oak Zeng
2024-01-17 22:12 ` [PATCH 11/23] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
2024-01-17 22:12 ` [PATCH 12/23] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
2024-01-17 22:12 ` [PATCH 13/23] drm/xe/svm: Handle CPU page fault Oak Zeng
2024-01-17 22:12 ` [PATCH 14/23] drm/xe/svm: trace svm range migration Oak Zeng
2024-01-17 22:12 ` [PATCH 15/23] drm/xe/svm: Implement functions to register and unregister mmu notifier Oak Zeng
2024-01-17 22:12 ` [PATCH 16/23] drm/xe/svm: Implement the mmu notifier range invalidate callback Oak Zeng
2024-01-17 22:12 ` [PATCH 17/23] drm/xe/svm: clean up svm range during process exit Oak Zeng
2024-01-17 22:12 ` [PATCH 18/23] drm/xe/svm: Move a few structures to xe_gt.h Oak Zeng
2024-01-17 22:12 ` [PATCH 19/23] drm/xe/svm: migrate svm range to vram Oak Zeng
2024-01-17 22:12 ` [PATCH 20/23] drm/xe/svm: Populate svm range Oak Zeng
2024-01-17 22:12 ` [PATCH 21/23] drm/xe/svm: GPU page fault support Oak Zeng
2024-01-23  2:06   ` Welty, Brian
2024-01-23  3:09     ` Zeng, Oak
2024-01-23  3:21       ` Making drm_gpuvm work across gpu devices Zeng, Oak
2024-01-23 11:13         ` Christian König
2024-01-23 19:37           ` Zeng, Oak
2024-01-23 20:17             ` Felix Kuehling
2024-01-25  1:39               ` Zeng, Oak
2024-01-23 23:56             ` Danilo Krummrich
2024-01-24  3:57               ` Zeng, Oak
2024-01-24  4:14                 ` Zeng, Oak
2024-01-24  6:48                   ` Christian König
2024-01-25 22:13                 ` Danilo Krummrich
2024-01-24  8:33             ` Christian König
2024-01-25  1:17               ` Zeng, Oak
2024-01-25  1:25                 ` David Airlie
2024-01-25  5:25                   ` Zeng, Oak
2024-01-26 10:09                     ` Christian König
2024-01-26 20:13                       ` Zeng, Oak
2024-01-29 10:10                         ` Christian König
2024-01-29 20:09                           ` Zeng, Oak
2024-01-25 11:00                 ` 回复:Making " 周春明(日月)
2024-01-25 17:00                   ` Zeng, Oak
2024-01-25 17:15                 ` Making " Felix Kuehling
2024-01-25 18:37                   ` Zeng, Oak
2024-01-26 13:23                     ` Christian König
2024-01-25 16:42               ` Zeng, Oak
2024-01-25 18:32               ` Daniel Vetter
2024-01-25 21:02                 ` Zeng, Oak
2024-01-26  8:21                 ` Thomas Hellström
2024-01-26 12:52                   ` Christian König
2024-01-27  2:21                     ` Zeng, Oak
2024-01-29 10:19                       ` Christian König
2024-01-30  0:21                         ` Zeng, Oak
2024-01-30  8:39                           ` Christian König
2024-01-30 22:29                             ` Zeng, Oak
2024-01-30 23:12                               ` David Airlie
2024-01-31  9:15                                 ` Daniel Vetter
2024-01-31 20:17                                   ` Zeng, Oak
2024-01-31 20:59                                     ` Zeng, Oak
2024-02-01  8:52                                     ` Christian König
2024-02-29 18:22                                       ` Zeng, Oak
2024-03-08  4:43                                         ` Zeng, Oak
2024-03-08 10:07                                           ` Christian König
2024-01-30  8:43                           ` Thomas Hellström
2024-01-29 15:03                 ` Felix Kuehling
2024-01-29 15:33                   ` Christian König
2024-01-29 16:24                     ` Felix Kuehling
2024-01-29 16:28                       ` Christian König
2024-01-29 17:52                         ` Felix Kuehling
2024-01-29 19:03                           ` Christian König
2024-01-29 20:24                             ` Felix Kuehling
2024-02-23 20:12               ` Zeng, Oak
2024-02-27  6:54                 ` Christian König
2024-02-27 15:58                   ` Zeng, Oak
2024-02-28 19:51                     ` Zeng, Oak
2024-02-29  9:41                       ` Christian König
2024-02-29 16:05                         ` Zeng, Oak
2024-02-29 17:12                         ` Thomas Hellström
2024-03-01  7:01                           ` Christian König
2024-01-17 22:12 ` [PATCH 22/23] drm/xe/svm: Add DRM_XE_SVM kernel config entry Oak Zeng
2024-01-17 22:12 ` [PATCH 23/23] drm/xe/svm: Add svm memory hints interface Oak Zeng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ad82f95ee29ada403459416d4c97c2b9083b5a0f.camel@linux.intel.com \
    --to=thomas.hellstrom@linux.intel.com \
    --cc=brian.welty@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=jgg@nvidia.com \
    --cc=krishnaiah.bommu@intel.com \
    --cc=leon@kernel.org \
    --cc=matthew.brost@intel.com \
    --cc=niranjana.vishwanathapura@intel.com \
    --cc=oak.zeng@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).