From: David Hildenbrand <david@redhat.com>
To: John Hubbard <jhubbard@nvidia.com>, Matthew Wilcox <willy@infradead.org>
Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org,
linux-doc@vger.kernel.org,
Andrew Morton <akpm@linux-foundation.org>,
Jonathan Corbet <corbet@lwn.net>,
"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
Zi Yan <ziy@nvidia.com>, Yang Shi <yang.shi@linux.alibaba.com>,
Ryan Roberts <ryan.roberts@arm.com>
Subject: Re: [PATCH v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared()
Date: Fri, 26 Apr 2024 08:57:36 +0200 [thread overview]
Message-ID: <c04f5611-162c-441c-b94c-79032f09e015@redhat.com> (raw)
In-Reply-To: <1a406a5f-1336-4051-a722-04b9ea2f54df@nvidia.com>
On 26.04.24 03:23, John Hubbard wrote:
> On 4/25/24 1:06 AM, David Hildenbrand wrote:
>> On 25.04.24 07:40, John Hubbard wrote:
>>> On 4/24/24 9:17 PM, Matthew Wilcox wrote:
>>>> On Wed, Apr 24, 2024 at 09:00:50PM -0700, John Hubbard wrote:
> ...
>> We'll talk more about all that at LSF/MM in the mapcount session. A spoiler:
>
> Looking forward to it. And as an aside, this year it feels like the mm
> code is changing relatively fast. So many large and small improvements
> have happened or are in progress.
Yes, it's happening on a very fast pace (and it's hard for me to get
reasonable work done while still keeping reviewing that much ...).
I'll note, that replacing a page-based interface by a folio-based
interface should not be shocking news in 2024, and that the issues with
mapcounts for large folios have been a recurring topic at LSF/MM and on
the mailing list.
>
>
>>
>> page_mapcount() in the context of large folios:
>> * Is a misunderstood function (e.g., page_mapcount() vs page_count()
>> checks, mapped = !page_mapcount() checks).
>> * Is a misleading function (e.g., page_mapped() == folio_mapped() but
>> page_mapcount() != folio_mapcount())
>>
>> We could just rename it to "folio_precise_page_mapcount()", but then, once we tackle the subpage mapcount optimizations (initially using a separate kernel config toggle), we'll have to teach each caller about an alternative that gets the job done, and it's not that easy to prevent further reuse around the kernel.
>>
>> If you look at linux-next, we're down to 5 page_mapcount() calls in fs/proc/, so I'll relocate it to fs/proc/internal.h to prevent any further use - once the s390x change lands in the next merge window.
>>
>> Regarding the subpage mapcount optimizations, I can further add:
>> * (un)map performance improvements for PTE-mapped THP
>> * Preparation for folio_order() > PMD_ORDER, where the current scheme
>> won't scale and needs further adjustments/complexity to even keep it
>> working
>> * Preparation for hugetlb-like vmemmap optimizations until we have
>> memdescs / dynamically allocated folios
>> * (Paving the way for partially mapping hugetlb folios that faced
>> similar issues? Not sure if that ever gets real, though)
>>
>> Is this patch ahead of its time? LSF/MM is just around the corner, and I'm planning on posting the other relevant patches in the next months.
>
> I think so, yes. There is a lot of context required to understand the
> motivation, and more required in order to figure out if it is safe,
> and if it still provides "good" behavior.
I think the motivation for removing page_mapcount() should be very clear
at this point: a single remaining user in mm/ should be warranted, and
the faster it is gone the better.
[case in point: I even have another potential user [1] in my mailbox
that should be using a folio interface, well, or PG_anon_exclusive :) ]
[1] https://lore.kernel.org/all/Zirw0uINbP6GxFiK@bender.morinfr.org/T/#u
Regarding removing subpage mapcounts I agree: I added too many details
that made it look harder to understand :P
>
> I still think it's mostly harmless, though, so being ahead of its time
> is not necessarily an indictment. :)
I didn't express my thought clearly: LSF/MM is just around the corner
and the discussion we are having here is the perfect preparation for
that session! :)
I don't particularly care if we merge this patch now or after the next
merge window along with the remaining page_mapcount() removal.
Discussing the impact of this change is the important piece. :)
[...]
>> Thanks for having a look!
>>
>> I'm only a bit concerned about folio_likely_mapped_shared() "false negatives" (detecting exclusive although shared), until we have a more precise folio_likely_mapped_shared() variant to not unexpectedly waste memory.
>>
>> Imagine someone would be setting "khugepaged_max_ptes_shared=0", and then we have an area where (I think this is the extreme case):
>>
>> * We map 256 subpages of a 2M folio that are shared 256 times with a
>> child process.
>> * We don't map the first subpage.
>> * One PTE maps another page and is pte_write().
>> * 255 PTEs are pte_none().
>>
>> folio_likely_mapped_shared() would return "false".
>>
>> But then my thinking is:
>> * We are already wasting 256 subpages that are free in the 2M folio.
>> Sure, we might be able to relaim it when deferred splitting.
>> * Why would someone set khugepaged_max_ptes_shared=0 but leave
>> khugepaged_max_ptes_none set that high that we would allow 255
>> pte_none?
>> * If the child is a short-living subprocess, we don't really care
>> * Any futher writes to unbacked/R/O PTEs in that PMD area would COW and
>> consume memory.
>>
>> So I had to use more and more "ifs" to construct a scenario where we might end up wasting 1M of memory, at which point I decided "this is really a corner case" and likely not worth the worry.
>>
>> If we run into real issues, though, it will be easy to just inline page_mapcount() here to resolve it; but the less special-casing the better.
>>
>
> OK. I'll need to think through some more of these cases. And meanwhile, I
> was poking around from the other direction: just injection test it by
> pasting in "true" or "false", in place of calling folio_likely_mapped_shared().
> And see what changes.
Highly appreciated!
>
> The "true" test lets me fail several khugepaged selftests, while the "false"
> test just increases the counter in /proc/vmstat.
>
> That's more of a black box way of poking at it, just to have another facet
> of testing. Because it is good to ensure that we really do have test
> coverage if we're changing the code. Anyway, just ideas.
Yes, all makes sense.
I'm very interested if there are valid concerns that the "false
negatives" are unacceptable: it would be another case for why we really
want to make folio_likely_mapped_shared() precise. For me it's clear
that we want to make it precise, but so far I am not convinced that it
is absolutely required in the khugepaged context.
Thanks!
--
Cheers,
David / dhildenb
prev parent reply other threads:[~2024-04-26 6:57 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-04-24 12:26 [PATCH v1] mm/khugepaged: replace page_mapcount() check by folio_likely_mapped_shared() David Hildenbrand
2024-04-24 16:28 ` Yang Shi
2024-04-24 16:36 ` David Hildenbrand
2024-04-25 4:00 ` John Hubbard
2024-04-25 4:17 ` Matthew Wilcox
2024-04-25 5:40 ` John Hubbard
2024-04-25 8:06 ` David Hildenbrand
2024-04-26 1:23 ` John Hubbard
2024-04-26 6:57 ` David Hildenbrand [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c04f5611-162c-441c-b94c-79032f09e015@redhat.com \
--to=david@redhat.com \
--cc=akpm@linux-foundation.org \
--cc=corbet@lwn.net \
--cc=jhubbard@nvidia.com \
--cc=kirill.shutemov@linux.intel.com \
--cc=linux-doc@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=ryan.roberts@arm.com \
--cc=willy@infradead.org \
--cc=yang.shi@linux.alibaba.com \
--cc=ziy@nvidia.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).