Linux-Doc Archive mirror
 help / color / mirror / Atom feed
From: David Hildenbrand <david@redhat.com>
To: John Hubbard <jhubbard@nvidia.com>, Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
	linux-doc@vger.kernel.org,
	Andrew Morton <akpm@linux-foundation.org>,
	Jonathan Corbet <corbet@lwn.net>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Zi Yan <ziy@nvidia.com>, Yang Shi <yang.shi@linux.alibaba.com>,
	Ryan Roberts <ryan.roberts@arm.com>
Subject: Re: [PATCH v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()
Date: Fri, 26 Apr 2024 08:57:36 +0200	[thread overview]
Message-ID: <c04f5611-162c-441c-b94c-79032f09e015@redhat.com> (raw)
In-Reply-To: <1a406a5f-1336-4051-a722-04b9ea2f54df@nvidia.com>

On 26.04.24 03:23, John Hubbard wrote:
> On 4/25/24 1:06 AM, David Hildenbrand wrote:
>> On 25.04.24 07:40, John Hubbard wrote:
>>> On 4/24/24 9:17 PM, Matthew Wilcox wrote:
>>>> On Wed, Apr 24, 2024 at 09:00:50PM -0700, John Hubbard wrote:
> ...
>> We'll talk more about all that at LSF/MM in the mapcount session. A spoiler:
> 
> Looking forward to it. And as an aside, this year it feels like the mm
> code is changing relatively fast. So many large and small improvements
> have happened or are in progress.

Yes, it's happening on a very fast pace (and it's hard for me to get 
reasonable work done while still keeping reviewing that much ...).

I'll note, that replacing a page-based interface by a folio-based 
interface should not be shocking news in 2024, and that the issues with 
mapcounts for large folios have been a recurring topic at LSF/MM and on 
the mailing list.

> 
> 
>>
>> page_mapcount() in the context of large folios:
>> * Is a misunderstood function (e.g., page_mapcount() vs page_count()
>>     checks, mapped = !page_mapcount() checks).
>> * Is a misleading function (e.g., page_mapped() == folio_mapped() but
>>     page_mapcount() != folio_mapcount())
>>
>> We could just rename it to "folio_precise_page_mapcount()", but then, once we tackle the subpage mapcount optimizations (initially using a separate kernel config toggle), we'll have to teach each caller about an alternative that gets the job done, and it's not that easy to prevent further reuse around the kernel.
>>
>> If you look at linux-next, we're down to 5 page_mapcount() calls in fs/proc/, so I'll relocate it to fs/proc/internal.h to prevent any further use - once the s390x change lands in the next merge window.
>>
>> Regarding the subpage mapcount optimizations, I can further add:
>> * (un)map performance improvements for PTE-mapped THP
>> * Preparation for folio_order() > PMD_ORDER, where the current scheme
>>     won't scale and needs further adjustments/complexity to even keep it
>>     working
>> * Preparation for hugetlb-like vmemmap optimizations until we have
>>     memdescs / dynamically allocated folios
>> * (Paving the way for partially mapping hugetlb folios that faced
>>      similar issues? Not sure if that ever gets real, though)
>>
>> Is this patch ahead of its time? LSF/MM is just around the corner, and I'm planning on posting the other relevant patches in the next months.
> 
> I think so, yes. There is a lot of context required to understand the
> motivation, and more required in order to figure out if it is safe,
> and if it still provides "good" behavior.

I think the motivation for removing page_mapcount() should be very clear 
at this point: a single remaining user in mm/ should be warranted, and 
the faster it is gone the better.

[case in point: I even have another potential user [1] in my mailbox 
that should be using a folio interface, well, or PG_anon_exclusive :) ]

[1] https://lore.kernel.org/all/Zirw0uINbP6GxFiK@bender.morinfr.org/T/#u

Regarding removing subpage mapcounts I agree: I added too many details 
that made it look harder to understand :P

> 
> I still think it's mostly harmless, though, so being ahead of its time
> is not necessarily an indictment. :)

I didn't express my thought clearly: LSF/MM is just around the corner 
and the discussion we are having here is the perfect preparation for 
that session! :)

I don't particularly care if we merge this patch now or after the next 
merge window along with the remaining page_mapcount() removal.

Discussing the impact of this change is the important piece. :)

[...]

>> Thanks for having a look!
>>
>> I'm only a bit concerned about folio_likely_mapped_shared() "false negatives" (detecting exclusive although shared), until we have a more precise folio_likely_mapped_shared() variant to not unexpectedly waste memory.
>>
>> Imagine someone would be setting "khugepaged_max_ptes_shared=0", and then we have an area where (I think this is the extreme case):
>>
>> * We map 256 subpages of a 2M folio that are shared 256 times with a
>>     child process.
>> * We don't map the first subpage.
>> * One PTE maps another page and is pte_write().
>> * 255 PTEs are pte_none().
>>
>> folio_likely_mapped_shared() would return "false".
>>
>> But then my thinking is:
>> * We are already wasting 256 subpages that are free in the 2M folio.
>>     Sure, we might be able to relaim it when deferred splitting.
>> * Why would someone set khugepaged_max_ptes_shared=0 but leave
>>     khugepaged_max_ptes_none set that high that we would allow 255
>>     pte_none?
>> * If the child is a short-living subprocess, we don't really care
>> * Any futher writes to unbacked/R/O PTEs in that PMD area would COW and
>>     consume memory.
>>
>> So I had to use more and more "ifs" to construct a scenario where we might end up wasting 1M of memory, at which point I decided "this is really a corner case" and likely not worth the worry.
>>
>> If we run into real issues, though, it will be easy to just inline page_mapcount() here to resolve it; but the less special-casing the better.
>>
> 
> OK. I'll need to think through some more of these cases. And meanwhile, I
> was poking around from the other direction: just injection test it by
> pasting in "true" or "false", in place of calling folio_likely_mapped_shared().
> And see what changes.

Highly appreciated!

> 
> The "true" test lets me fail several khugepaged selftests, while the "false"
> test just increases the counter in /proc/vmstat.
> 
> That's more of a black box way of poking at it, just to have another facet
> of testing. Because it is good to ensure that we really do have test
> coverage if we're changing the code. Anyway, just ideas.

Yes, all makes sense.

I'm very interested if there are valid concerns that the "false 
negatives" are unacceptable: it would be another case for why we really 
want to make folio_likely_mapped_shared() precise. For me it's clear 
that we want to make it precise, but so far I am not convinced that it 
is absolutely required in the khugepaged context.

Thanks!

-- 
Cheers,

David / dhildenb


      reply	other threads:[~2024-04-26  6:57 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-24 12:26 [PATCH v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared() David Hildenbrand
2024-04-24 16:28 ` Yang Shi
2024-04-24 16:36   ` David Hildenbrand
2024-04-25  4:00 ` John Hubbard
2024-04-25  4:17   ` Matthew Wilcox
2024-04-25  5:40     ` John Hubbard
2024-04-25  8:06       ` David Hildenbrand
2024-04-26  1:23         ` John Hubbard
2024-04-26  6:57           ` David Hildenbrand [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c04f5611-162c-441c-b94c-79032f09e015@redhat.com \
    --to=david@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=corbet@lwn.net \
    --cc=jhubbard@nvidia.com \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=linux-doc@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=ryan.roberts@arm.com \
    --cc=willy@infradead.org \
    --cc=yang.shi@linux.alibaba.com \
    --cc=ziy@nvidia.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).