LinuxPPC-Dev Archive mirror
 help / color / mirror / Atom feed
From: Jason Gunthorpe <jgg@nvidia.com>
To: Peter Xu <peterx@redhat.com>
Cc: James Houghton <jthoughton@google.com>,
	David Hildenbrand <david@redhat.com>,
	Yang Shi <shy828301@gmail.com>,
	Andrew Jones <andrew.jones@linux.dev>,
	linux-mm@kvack.org, linux-riscv@lists.infradead.org,
	Andrea Arcangeli <aarcange@redhat.com>,
	"Aneesh Kumar K . V" <aneesh.kumar@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Christoph Hellwig <hch@infradead.org>,
	linux-arm-kernel@lists.infradead.org,
	Axel Rasmussen <axelrasmussen@google.com>,
	Rik van Riel <riel@surriel.com>,
	John Hubbard <jhubbard@nvidia.com>,
	"Kirill A . Shutemov" <kirill@shutemov.name>,
	Vlastimil Babka <vbabka@suse.cz>,
	Lorenzo Stoakes <lstoakes@gmail.com>,
	Muchun Song <muchun.song@linux.dev>,
	linux-kernel@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	linuxppc-dev@lists.ozlabs.org, Mike Rapoport <rppt@kernel.org>,
	Mike Kravetz <mike.kravetz@oracle.com>
Subject: Re: [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2
Date: Tue, 26 Mar 2024 11:02:52 -0300	[thread overview]
Message-ID: <20240326140252.GH6245@nvidia.com> (raw)
In-Reply-To: <ZgHJaJSpoeJVEccN@x1n>

On Mon, Mar 25, 2024 at 02:58:48PM -0400, Peter Xu wrote:

> > This remark would be a little easier to understand if you refer to
> > hugetlb_walk() not huge_pte_offset() - the latter is only used to
> > implement hugetlb_walk() and isn't removed by this series, while a
> > single hugetlb_walk() was removed.
> 
> Right.  Here huge_pte_offset() is the arch API that I hope we can remove,
> the hugetlb_walk() is simply the wrapper.

But arguably hugetlb_walk is the thing that should go..

In the generic code we should really try to get away from the weird
hugetlb abuse of treating every level as a pte_t. That is not how the
generic page table API works and that weirdness is part of what
motivates the arch API to supply special functions for reading. Not
every arch can just cast every level to a pte_t and still work.

But that weirdness is also saving alot of code so something else needs
to be though up..

> > Regardless, I think the point is spot on, the direction should be to
> > make the page table reading generic with minimal/no interaction with
> > hugetlbfs in the generic code.
> 
> Yes, and I also like your terms on calling them "pgtable readers".  It's a
> better way to describe the difference in that regard between
> huge_pte_offset() v.s. huge_pte_alloc().  Exactly that's my goal, that we
> should start with the "readers".

Yeah, it makes alot of sense to tackle the readers first - we are
pretty close now to having enough done to have generic readers. I
would imagine tackling everything outside mm/huge*.c to use the normal
page table API for reading.

Then consider what to do with the reading in mm/huge*.c
 
> The writters might change semantics when merge, and can be more
> challenging, I'll need to look into details of each one, like page fault
> handlers.  Such work may need to be analyzed case by case, and this GUP
> part is definitely the low hanging fruit comparing to the rest.

The write side is tricky, I think if the read side is sorted out then
it will be easer to reason about the write side. Today the write side
is paired with the special read side and it is extra hard to
understand if there is something weird hidden in the arch.
 
> measurements too when getting there.  And btw, IIUC the major challenge of
> pagewalk.c is not the removal of walk_hugetlb_range() alone - that may not
> be that hard if that's the solo purpose.  The better way to go is to remove
> mm_walk_ops.hugetlb_entry() altogether, which will cause a change in all
> callers; that's "the challenge".. pretty much labor works, not a major
> technical challenge it seems.  Not sure if it's a good news or bad..

Ugh, that is a big pain. It is relying on that hugetlbfs trick of
passing in a pte that is not a pte to make the API generic..

The more I look at this the more I think we need to get to Matthew's
idea of having some kind of generic page table API that is not tightly
tied to level. Replacing the hugetlb trick of 'everything is a PTE'
with 5 special cases in every place seems just horrible.

   struct mm_walk_ops {
       int (*leaf_entry)(struct mm_walk_state *state, struct mm_walk *walk);
   }

And many cases really want something like:
   struct mm_walk_state state;

   if (!mm_walk_seek_leaf(state, mm, address))
          goto no_present
   if (mm_walk_is_write(state)) ..

And detailed walking:
   for_each_pt_leaf(state, mm, address) {
       if (mm_walk_is_write(state)) ..
   }

Replacing it with a mm_walk_state that retains the level or otherwise
to allow decoding any entry composes a lot better. Forced Loop
unrolling can get back to the current code gen in alot of places.

It also makes the power stuff a bit nicer as the mm_walk_state could
automatically retain back pointers to the higher levels in the state
struct too...

The puzzle is how to do it and still get reasonable efficient codegen,
many operations are going to end up switching on some state->level to
know how to decode the entry.

> One thing I'll soon look into is to allow huge mappings for PFNMAP; there's
> one request from VFIO side for MMIO. Dropping mm_walk_ops.hugetlb_entry()
> seems to be good for all such purposes; well, I may need to bite the bullet
> here.. for either of the purposes to move on.

That would be a nice feature!

> > If, or how much, the hugepd write side remains special is the open
> > question, I think.
> 
> It seems balls are rolling in that aspect, I haven't looked in depth there
> in Christophe's series but it's great to have started!

Yes!

Jason

  reply	other threads:[~2024-03-26 14:04 UTC|newest]

Thread overview: 36+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-21 22:07 [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2 peterx
2024-03-21 22:07 ` [PATCH v3 01/12] mm/Kconfig: CONFIG_PGTABLE_HAS_HUGE_LEAVES peterx
2024-03-21 22:07 ` [PATCH v3 02/12] mm/hugetlb: Declare hugetlbfs_pagecache_present() non-static peterx
2024-03-21 22:07 ` [PATCH v3 03/12] mm: Make HPAGE_PXD_* macros even if !THP peterx
2024-03-22 17:14   ` SeongJae Park
2024-03-23  0:30     ` Peter Xu
2024-03-23  1:05       ` SeongJae Park
2024-03-21 22:07 ` [PATCH v3 04/12] mm: Introduce vma_pgtable_walk_{begin|end}() peterx
2024-03-22 12:27   ` Jason Gunthorpe
2024-03-21 22:07 ` [PATCH v3 05/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing peterx
2024-03-22 12:28   ` Jason Gunthorpe
2024-03-21 22:07 ` [PATCH v3 06/12] mm/gup: Refactor record_subpages() to find 1st small page peterx
2024-03-21 22:07 ` [PATCH v3 07/12] mm/gup: Handle hugetlb for no_page_table() peterx
2024-03-21 22:07 ` [PATCH v3 08/12] mm/gup: Cache *pudp in follow_pud_mask() peterx
2024-03-21 22:07 ` [PATCH v3 09/12] mm/gup: Handle huge pud for follow_pud_mask() peterx
2024-03-21 22:08 ` [PATCH v3 10/12] mm/gup: Handle huge pmd for follow_pmd_mask() peterx
2024-03-21 22:08 ` [PATCH v3 11/12] mm/gup: Handle hugepd for follow_page() peterx
2024-03-21 22:08 ` [PATCH v3 12/12] mm/gup: Handle hugetlb in the generic follow_page_mask code peterx
2024-03-22 13:30   ` Jason Gunthorpe
2024-03-22 15:55     ` Peter Xu
2024-03-22 16:08       ` Jason Gunthorpe
2024-03-22 20:48   ` Andrew Morton
2024-03-23  0:45     ` Peter Xu
2024-03-23  2:15       ` Peter Xu
2024-03-22 16:10 ` [PATCH v3 00/12] mm/gup: Unify hugetlb, part 2 Jason Gunthorpe
2024-03-25 18:58   ` Peter Xu
2024-03-26 14:02     ` Jason Gunthorpe [this message]
2024-04-04 21:48       ` Peter Xu
2024-04-05 18:16         ` Jason Gunthorpe
2024-04-05 21:42           ` Peter Xu
2024-04-09 23:43             ` Jason Gunthorpe
2024-04-10 15:28               ` Peter Xu
2024-04-10 16:30                 ` Christophe Leroy
2024-04-10 19:58                   ` Peter Xu
2024-04-12 14:27                     ` Christophe Leroy
2024-03-25 14:56 ` Christophe Leroy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240326140252.GH6245@nvidia.com \
    --to=jgg@nvidia.com \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=andrew.jones@linux.dev \
    --cc=aneesh.kumar@kernel.org \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=hch@infradead.org \
    --cc=jhubbard@nvidia.com \
    --cc=jthoughton@google.com \
    --cc=kirill@shutemov.name \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=lstoakes@gmail.com \
    --cc=mike.kravetz@oracle.com \
    --cc=muchun.song@linux.dev \
    --cc=peterx@redhat.com \
    --cc=riel@surriel.com \
    --cc=rppt@kernel.org \
    --cc=shy828301@gmail.com \
    --cc=vbabka@suse.cz \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).