dri-devel Archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: "Zeng, Oak" <oak.zeng@intel.com>
Cc: "dri-devel@lists.freedesktop.org"
	<dri-devel@lists.freedesktop.org>,
	"intel-xe@lists.freedesktop.org" <intel-xe@lists.freedesktop.org>,
	"Brost, Matthew" <matthew.brost@intel.com>,
	"Thomas.Hellstrom@linux.intel.com"
	<Thomas.Hellstrom@linux.intel.com>,
	"Welty, Brian" <brian.welty@intel.com>,
	"Ghimiray, Himal Prasad" <himal.prasad.ghimiray@intel.com>,
	"Bommu, Krishnaiah" <krishnaiah.bommu@intel.com>,
	"Vishwanathapura,
	Niranjana" <niranjana.vishwanathapura@intel.com>,
	Leon Romanovsky <leon@kernel.org>
Subject: Re: [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range
Date: Wed, 24 Apr 2024 22:05:20 -0300	[thread overview]
Message-ID: <20240425010520.GW941030@nvidia.com> (raw)
In-Reply-To: <SA1PR11MB699102978E72F21E6C803D6392102@SA1PR11MB6991.namprd11.prod.outlook.com>

On Wed, Apr 24, 2024 at 11:59:18PM +0000, Zeng, Oak wrote:
> Hi Jason,
> 
> I went through the conversation b/t you and Matt. I think we are pretty much aligned. Here is what I get from this threads:
> 
> 1) hmm range fault size, gpu page table map size : you prefer bigger
> gpu vma size and vma can be sparsely mapped to gpu. Our vma size is
> configurable through a user madvise api. 

That is even worse! It is not a user tunable in any way shape or form!

> 2) invalidation: you prefer giant notifier. We can consider this if
> it turns out our implementation is not performant. Currently we
> don't know.

It is the wrong way to use the API to have many small notifiers,
please don't use it wrong.
 
> 3) whether driver can look up cpu vma. I think we need this for data migration purpose.

The migration code may but not the SVA/hmm_range_fault code. 

> > > What do you mean by "page by page"/" no API surface available to
> > > safely try to construct covering ranges"? As I understand it,
> > > hmm_range_fault take a virtual address range (defined in hmm_range
> > > struct), and walk cpu page table in this range. It is a range based
> > > API.
> > 
> > Yes, but invalidation is not linked to range_fault so you can get
> > invalidations of single pages. You are binding range_fault to
> > dma_map_sg but then you can't handle invalidation at all sanely.
> 
> Ok, I understand your point now.
> 
> Yes strictly speaking we can get invalidation of a single page. This
> can be triggered by core mm numa balance or ksm (kernel samepage
> merging). At present, my understanding is, single page (or a few
> pages) invalidation is not a very common case. The more common cases
> are invalidation triggered by user munmap, or invalidation triggered
> by hmm migration itself (triggered in migrate_vma_setup). I will
> experiment this.

Regardless it must be handled and unmapping an entire 'gpu vma' is a
horrible implementation of HMM.

> I agree in case of single page invalidation, the current codes is
> not performant because we invalidate the whole vma. What I can do
> is, look at the mmu_notifier_range parameter of the invalidation
> callback, and only invalidate the range. The requires our driver to
> split the vma state and page table state. It is a big change. We
> plan to do it in stage 2

Which is, again, continuing to explain why why this VMA based approach
is a completely wrong way to use hmm.

> > I understand, but SVA/hmm_range_fault/invalidation are *NOT* VMA based
> > and you do need to ensure the page table manipulation has an API that
> > is usable. "populate an entire VMA" / "invalidate an entire VMA" is
> > not a sufficient primitive.
> 
> I understand invalidate entire VMA might be not performant. I will
> improve as explained above.

Please stop saying performant. There are correct ways to use hmm and
bad ways. What you are doing is a bad way. Even if the performance is
only a little different it is still the kind of horrible code
structure I don't want to see in new hmm users.

> I think whether we want to populate entire VMA or only one page is
> still driver's selection. 

hmm_range_fault() should be driven with a prefetch fault/around
scheme. This has nothing to do with a durable VMA record and addresses
these concerns.

> Do you suggest per page based population? Or do you suggest to
> populate the entire address space or the entire memory region? I did
> look at RDMA odp codes. In function ib_umem_odp_map_dma_and_lock, it
> is also a range based population. It seems it populate the whole
> memory region each time, not very sure.

Performance here is veyr complicated. You often want to allow the
userspace to prefetch data into the GPU page table to warm it up to
avoid page faults, as faults are generally super slow and hard to
manage performantly. ODP has many options to control this in a fine
granularity.

> > You can create VMAs, but they can't be fully populated and they should
> > be alot bigger than 2M. ODP uses a similar 2 level scheme for it's
> > SVA, the "vma" granual is 512M.
> 
> Oh, I see. So you are suggesting a much bigger granularity. That
> make sense to me. Our design actually support a much bigger
> granularity. The migration/population granularity is configurable in
> our design. It is a memory advise API and one of the advise is
> called "MIGRATION_GRANULARITY".

I don't think the notifier granual should be user tunable, you are
actually doing something different - it sounds like it mostly acts as
prefetch tunable.

> > The key thing is the VMA is just a container for the notifier and
> > other data structures, it doesn't insist the range be fully populated
> > and it must support page-by-page unmap/remap/invalidateion.
> 
> Agree and I don't see a hard conflict of our implementation to this
> concept. So the linux core mm vma (struct vm_area_struct) represent
> an address range in the process's address space. Xe driver would
> create some xe_vma to cover a sub-region of a core mm vma. 

No again and again NO. hmm_range_fault users are not, and must not be
linked to CPU VMAs ever.

> > mapping is driven by the range passed to hmm_range_fault() during
> > fault handling, which is entirely based on your drivers prefetch
> > logic.
> 
> In our driver, mapping can be triggered by either prefetch or fault. 
> 
> Prefetch is a user API so user can decide the range.

> The range used in fault is decided by MIGRATION_GRANULARITY user
> setting. The default value is 2MiB as said.

Which is basically turning it into a fault prefetch tunable.
 
> In our implementation, our page table is not organized as a radix
> tree. 

Huh? What does the HW in the GPU do to figure out how issue DMAs? Your
GPU HW doesn't have a radix tree walker you are configuring?

> Maybe this an area we can improve. For validation, we don't
> need to walk page table to figure out which range is present in page
> table. Since we only register a mmu notifier when a range is
> actually mapped to gpu page table. So all the notifier callback is
> with a valid range in gpu page table.

But this is not page by page.

> I agree many 2M small notifiers can slow down the red/black tree
> walk from core mm side. But with giant notifier, core mm callback
> driver much more times than small notifier - driver would be called
> back even if a range is not mapped to gpu page table.

The driver should do a cheaper check than a red/black check if it
needs to do any work, eg with a typical radix page table, or some kind
of fast sparse bitmap. At a 2M granual and 100GB workloads these days
this is an obviously loosing idea.

> > I want to be very clear, there should be no interaction of your
> > hmm_range_fault based code with MM centric VMAs. You MUST NOT look
> > up
> > the CPU vma_area_struct in your driver.
> 
> Without look up cpu vma, we even can't decide whether our gpu
> accessed an valid address or not.

hmm_range_fault provides this for you.
 
> Further more, we call hmm helpers migrate_vma_setup for migration
> which take a struct migrate_vma parameter. Migrate_vma has a vma
> field. If we don't look up cpu vma, how do we get this field?

Migration is a different topic. The vma lookup for a migration has
nothing to do with hmm_range_fault and SVA.

(and perhaps arguably this is structured wrong as you probably want
hmm_range_fault to do the migrations for you as it walks the range to
avoid double walks and things)

Jason

  reply	other threads:[~2024-04-25  1:05 UTC|newest]

Thread overview: 123+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-17 22:12 [PATCH 00/23] XeKmd basic SVM support Oak Zeng
2024-01-17 22:12 ` [PATCH 01/23] drm/xe/svm: Add SVM document Oak Zeng
2024-01-17 22:12 ` [PATCH 02/23] drm/xe/svm: Add svm key data structures Oak Zeng
2024-01-17 22:12 ` [PATCH 03/23] drm/xe/svm: create xe svm during vm creation Oak Zeng
2024-01-17 22:12 ` [PATCH 04/23] drm/xe/svm: Trace svm creation Oak Zeng
2024-01-17 22:12 ` [PATCH 05/23] drm/xe/svm: add helper to retrieve svm range from address Oak Zeng
2024-01-17 22:12 ` [PATCH 06/23] drm/xe/svm: Introduce a helper to build sg table from hmm range Oak Zeng
2024-04-05  0:39   ` Jason Gunthorpe
2024-04-05  3:33     ` Zeng, Oak
2024-04-05 12:37       ` Jason Gunthorpe
2024-04-05 16:42         ` Zeng, Oak
2024-04-05 18:02           ` Jason Gunthorpe
2024-04-09 16:45             ` Zeng, Oak
2024-04-09 17:24               ` Jason Gunthorpe
2024-04-23 21:17                 ` Zeng, Oak
2024-04-24  2:31                   ` Matthew Brost
2024-04-24 13:57                     ` Jason Gunthorpe
2024-04-24 16:35                       ` Matthew Brost
2024-04-24 16:44                         ` Jason Gunthorpe
2024-04-24 16:56                           ` Matthew Brost
2024-04-24 17:48                             ` Jason Gunthorpe
2024-04-24 13:48                   ` Jason Gunthorpe
2024-04-24 23:59                     ` Zeng, Oak
2024-04-25  1:05                       ` Jason Gunthorpe [this message]
2024-04-26  9:55                         ` Thomas Hellström
2024-04-26 12:00                           ` Jason Gunthorpe
2024-04-26 14:49                             ` Thomas Hellström
2024-04-26 16:35                               ` Jason Gunthorpe
2024-04-29  8:25                                 ` Thomas Hellström
2024-04-30 17:30                                   ` Jason Gunthorpe
2024-04-30 18:57                                     ` Daniel Vetter
2024-05-01  0:09                                       ` Jason Gunthorpe
2024-05-02  8:04                                         ` Daniel Vetter
2024-05-02  9:11                                           ` Thomas Hellström
2024-05-02 12:46                                             ` Jason Gunthorpe
2024-05-02 15:01                                               ` Thomas Hellström
2024-05-02 19:25                                                 ` Zeng, Oak
2024-05-03 13:37                                                   ` Jason Gunthorpe
2024-05-03 14:43                                                     ` Zeng, Oak
2024-05-03 16:28                                                       ` Jason Gunthorpe
2024-05-03 20:29                                                         ` Zeng, Oak
2024-05-04  1:03                                                           ` Dave Airlie
2024-05-06 13:04                                                             ` Daniel Vetter
2024-05-06 23:50                                                               ` Matthew Brost
2024-05-07 11:56                                                                 ` Jason Gunthorpe
2024-05-06 13:33                                                           ` Jason Gunthorpe
2024-04-09 17:33               ` Matthew Brost
2024-01-17 22:12 ` [PATCH 07/23] drm/xe/svm: Add helper for binding hmm range to gpu Oak Zeng
2024-01-17 22:12 ` [PATCH 08/23] drm/xe/svm: Add helper to invalidate svm range from GPU Oak Zeng
2024-01-17 22:12 ` [PATCH 09/23] drm/xe/svm: Remap and provide memmap backing for GPU vram Oak Zeng
2024-01-17 22:12 ` [PATCH 10/23] drm/xe/svm: Introduce svm migration function Oak Zeng
2024-01-17 22:12 ` [PATCH 11/23] drm/xe/svm: implement functions to allocate and free device memory Oak Zeng
2024-01-17 22:12 ` [PATCH 12/23] drm/xe/svm: Trace buddy block allocation and free Oak Zeng
2024-01-17 22:12 ` [PATCH 13/23] drm/xe/svm: Handle CPU page fault Oak Zeng
2024-01-17 22:12 ` [PATCH 14/23] drm/xe/svm: trace svm range migration Oak Zeng
2024-01-17 22:12 ` [PATCH 15/23] drm/xe/svm: Implement functions to register and unregister mmu notifier Oak Zeng
2024-01-17 22:12 ` [PATCH 16/23] drm/xe/svm: Implement the mmu notifier range invalidate callback Oak Zeng
2024-01-17 22:12 ` [PATCH 17/23] drm/xe/svm: clean up svm range during process exit Oak Zeng
2024-01-17 22:12 ` [PATCH 18/23] drm/xe/svm: Move a few structures to xe_gt.h Oak Zeng
2024-01-17 22:12 ` [PATCH 19/23] drm/xe/svm: migrate svm range to vram Oak Zeng
2024-01-17 22:12 ` [PATCH 20/23] drm/xe/svm: Populate svm range Oak Zeng
2024-01-17 22:12 ` [PATCH 21/23] drm/xe/svm: GPU page fault support Oak Zeng
2024-01-23  2:06   ` Welty, Brian
2024-01-23  3:09     ` Zeng, Oak
2024-01-23  3:21       ` Making drm_gpuvm work across gpu devices Zeng, Oak
2024-01-23 11:13         ` Christian König
2024-01-23 19:37           ` Zeng, Oak
2024-01-23 20:17             ` Felix Kuehling
2024-01-25  1:39               ` Zeng, Oak
2024-01-23 23:56             ` Danilo Krummrich
2024-01-24  3:57               ` Zeng, Oak
2024-01-24  4:14                 ` Zeng, Oak
2024-01-24  6:48                   ` Christian König
2024-01-25 22:13                 ` Danilo Krummrich
2024-01-24  8:33             ` Christian König
2024-01-25  1:17               ` Zeng, Oak
2024-01-25  1:25                 ` David Airlie
2024-01-25  5:25                   ` Zeng, Oak
2024-01-26 10:09                     ` Christian König
2024-01-26 20:13                       ` Zeng, Oak
2024-01-29 10:10                         ` Christian König
2024-01-29 20:09                           ` Zeng, Oak
2024-01-25 11:00                 ` 回复:Making " 周春明(日月)
2024-01-25 17:00                   ` Zeng, Oak
2024-01-25 17:15                 ` Making " Felix Kuehling
2024-01-25 18:37                   ` Zeng, Oak
2024-01-26 13:23                     ` Christian König
2024-01-25 16:42               ` Zeng, Oak
2024-01-25 18:32               ` Daniel Vetter
2024-01-25 21:02                 ` Zeng, Oak
2024-01-26  8:21                 ` Thomas Hellström
2024-01-26 12:52                   ` Christian König
2024-01-27  2:21                     ` Zeng, Oak
2024-01-29 10:19                       ` Christian König
2024-01-30  0:21                         ` Zeng, Oak
2024-01-30  8:39                           ` Christian König
2024-01-30 22:29                             ` Zeng, Oak
2024-01-30 23:12                               ` David Airlie
2024-01-31  9:15                                 ` Daniel Vetter
2024-01-31 20:17                                   ` Zeng, Oak
2024-01-31 20:59                                     ` Zeng, Oak
2024-02-01  8:52                                     ` Christian König
2024-02-29 18:22                                       ` Zeng, Oak
2024-03-08  4:43                                         ` Zeng, Oak
2024-03-08 10:07                                           ` Christian König
2024-01-30  8:43                           ` Thomas Hellström
2024-01-29 15:03                 ` Felix Kuehling
2024-01-29 15:33                   ` Christian König
2024-01-29 16:24                     ` Felix Kuehling
2024-01-29 16:28                       ` Christian König
2024-01-29 17:52                         ` Felix Kuehling
2024-01-29 19:03                           ` Christian König
2024-01-29 20:24                             ` Felix Kuehling
2024-02-23 20:12               ` Zeng, Oak
2024-02-27  6:54                 ` Christian König
2024-02-27 15:58                   ` Zeng, Oak
2024-02-28 19:51                     ` Zeng, Oak
2024-02-29  9:41                       ` Christian König
2024-02-29 16:05                         ` Zeng, Oak
2024-02-29 17:12                         ` Thomas Hellström
2024-03-01  7:01                           ` Christian König
2024-01-17 22:12 ` [PATCH 22/23] drm/xe/svm: Add DRM_XE_SVM kernel config entry Oak Zeng
2024-01-17 22:12 ` [PATCH 23/23] drm/xe/svm: Add svm memory hints interface Oak Zeng

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240425010520.GW941030@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=Thomas.Hellstrom@linux.intel.com \
    --cc=brian.welty@intel.com \
    --cc=dri-devel@lists.freedesktop.org \
    --cc=himal.prasad.ghimiray@intel.com \
    --cc=intel-xe@lists.freedesktop.org \
    --cc=krishnaiah.bommu@intel.com \
    --cc=leon@kernel.org \
    --cc=matthew.brost@intel.com \
    --cc=niranjana.vishwanathapura@intel.com \
    --cc=oak.zeng@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).