Linux-mm Archive mirror
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: Yang Shi <shy828301@gmail.com>
Cc: lsf-pc@lists.linux-foundation.org, Linux-MM <linux-mm@kvack.org>
Subject: Re: [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation
Date: Thu, 16 May 2024 10:15:13 +1200	[thread overview]
Message-ID: <CAGsJ_4z8BTrTV0P6-vQWyVD5zotDErQNhgUbx_7kHji6oE05OA@mail.gmail.com> (raw)
In-Reply-To: <CAHbLzkrw_jVu=nRMf6-hW50-oM1QgrJuYkHwnG_XuA18FAgf5Q@mail.gmail.com>

On Thu, May 16, 2024 at 9:41 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, May 15, 2024 at 1:25 PM Barry Song <21cnbao@gmail.com> wrote:
> >
> > On Thu, May 16, 2024 at 1:49 AM Yang Shi <shy828301@gmail.com> wrote:
> > >
> > > On Tue, May 14, 2024 at 3:20 AM Barry Song <21cnbao@gmail.com> wrote:
> > > >
> > > > On Sat, May 11, 2024 at 9:18 AM Yang Shi <shy828301@gmail.com> wrote:
> > > > >
> > > > > On Thu, May 9, 2024 at 7:22 PM Barry Song <21cnbao@gmail.com> wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > I'd like to propose a session about the allocation and reclamation of
> > > > > > mTHP. This is related to Yu Zhao's
> > > > > > TAO[1] but not the same.
> > > > > >
> > > > > > OPPO has implemented mTHP-like large folios across thousands of
> > > > > > genuine Android devices, utilizing
> > > > > > ARM64 CONT-PTE. However, we've encountered challenges:
> > > > > >
> > > > > > - The allocation of mTHP isn't consistently reliable; even after
> > > > > > prolonged use, obtaining large folios
> > > > > >   remains uncertain.
> > > > > >   As an instance, following a few hours of operation, the likelihood
> > > > > > of successfully allocating large
> > > > > >   folios on a phone may decrease to just 2%.
> > > > > >
> > > > > > - Mixing large and small folios in the same LRU list can lead to
> > > > > > mutual blocking and unpredictable
> > > > > >   latency during reclamation/allocation.
> > > > >
> > > > > I'm also curious how much large folios can improve reclamation
> > > > > efficiency. Having large folios is supposed to reduce the scan time
> > > > > since there should be fewer folios on LRU. But IIRC I haven't seen too
> > > > > much data or benchmark (particularly real life workloads) regarding
> > > > > this.
> > > >
> > > > Hi Yang,
> > > >
> > > > We lack direct data on this matter, but information from Ryan's THP_SWPOUT
> > > > series [1] provides insights as follows:
> > > >
> > > > | alloc size |                baseline |           + this series |
> > > > |            | mm-unstable (~v6.9-rc1) |                         |
> > > > |:-----------|------------------------:|------------------------:|
> > > > | 4K Page    |                    0.0% |                    1.3% |
> > > > | 64K THP    |                  -13.6% |                   46.3% |
> > > > | 2M THP     |                   91.4% |                   89.6% |
> > > >
> > > >
> > > > I suspect the -13.6% performance decrease is due to the split
> > > > operation. Once the split
> > > > is eliminated, the patchset observed a 46.3% increase. It is presumed
> > > > that the overhead
> > > > required to reclaim 64K is reduced compared to reclaiming 16 * 4K.
> > >
> > > Thank you. Actually I care about 4k vs 64k vs 256k ...
> > >
> > > I did a simple test by calling MADV_PAGEOUT on 4G memory w/ the
> > > swapout optimization then measured the time spent in madvise, I can
> > > see the time was reduced by ~23% between 64k vs 4k. Then there is no
> > > noticeable reduction between 64k and larger sizes.
> >
> > If you engage in perf analysis, what observations can you make? I suspect that
> > even with larger folios, the function try_to_unmap_one() continues to iterate
> > through PTEs individually.
>
> Yes, I think so.
>
> > If we're able to batch the unmapping process for the entire folio, we might
> > observe improved performance.
>
> I did profiling to my benchmark, I didn't see try_to_unmap showed as
> hot spot. The time is actually spent in zram I/O.
>
> But batching try_to_unmap() may show some improvement. Did you do it
> in your kernel? It should be worth exploring.

Not at the moment. However, we've experimented with compressing large
folios in larger granularities, like 64KiB [1]. This experimentation has yielded
significant enhancements in CPU utilization reduction and compression rates.

You can adjust the granularity through the ZSMALLOC_MULTI_PAGES_ORDER
setting, with the default value being 4.

Without our patch, zRAM compresses large folios in 4KiB granularity by iterating
each subpage.

[1] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/

>
> >
> > >
> > > Actually I saw such a pattern (performance doesn't scale with page
> > > size after 64K) with some real life workload benchmark. I'm going to
> > > talk about it in today's LSF/MM.
> > >
> > > >
> > > > However, at present, in actual android devices, we are observing
> > > > nearly 100% occurrence
> > > > of anon_thp_swpout_fallback after the device has been in operation for
> > > > several hours[2].
> > > >
> > > > Hence, it is likely that we will experience regression instead of
> > > > improvement due to the
> > > > absence of measures to mitigate swap fragmentation.
> > > >
> > > > [1] https://lore.kernel.org/all/20240408183946.2991168-1-ryan.roberts@arm.com/
> > > > [2] https://lore.kernel.org/lkml/CAGsJ_4zAcJkuW016Cfi6wicRr8N9X+GJJhgMQdSMp+Ah+NSgNQ@mail.gmail.com/
> > > >
> > > > >
> > > > > >
> > > > > >   For instance, if you require large folios, the LRU list's tail could
> > > > > > be filled with small folios.
> > > > > >   LRU(LF- large folio, SF- small folio):
> > > > > >
> > > > > >    LF - LF - LF -  SF - SF - SF - SF - SF - SF -SF - SF - SF - SF - SF - SF - SF
> > > > > >
> > > > > >  You might end up reclaiming many small folios yet still struggle to
> > > > > > allocate large folios. Conversely,
> > > > > >  the inverse scenario can occur when the LRU list's tail is populated
> > > > > > with large folios.
> > > > > >
> > > > > >    SF - SF - SF -  LF - LF - LF - LF - LF - LF -LF - LF - LF - LF - LF - LF - LF
> > > > > >
> > > > > > In OPPO's products, we allocate dedicated pageblocks solely for large
> > > > > > folios allocation, and we've
> > > > > > fine-tuned the LRU mechanism to support dual LRU—one for small folios
> > > > > > and another for large ones.
> > > > > > Dedicated page blocks offer a fundamental guarantee of allocating
> > > > > > large folios. Additionally, segregating
> > > > > > small and large folios into two LRUs ensures that both can be
> > > > > > efficiently reclaimed for their respective
> > > > > > users' requests.  However, while the implementation may lack aesthetic
> > > > > > appeal and is primarily tailored
> > > > > > for product purposes, it isn't fully upstreamable.
> > > > > >
> > > > > > You can obtain the architectural diagram of OPPO's approach from link[2].
> > > > > >
> > > > > > Therefore, my plan is to present:
> > > > > >
> > > > > > - Introduce the architecture of OPPO's mTHP-like approach, which
> > > > > > encompasses additional optimizations
> > > > > >   we've made to address swap fragmentation issues and improve swap
> > > > > > performance, such as dual-zRAM
> > > > > >   and compression/decompression of large folios [3].
> > > > > >
> > > > > > - Present OPPO's method of utilizing dedicated page blocks and a
> > > > > > dual-LRU system for mTHP.
> > > > > >
> > > > > > - Share our observations from employing Yu Zhao's TAO on Pixel 6 phones.
> > > > > >
> > > > > > - Discuss our future direction—are we leaning towards TAO or dedicated
> > > > > > page blocks? If we opt for page
> > > > > >   blocks, how do we plan to resolve the LRU issue?
> > > > > >
> > > > > > [1] https://lore.kernel.org/linux-mm/20240229183436.4110845-1-yuzhao@google.com/
> > > > > > [2] https://github.com/21cnbao/mTHP/blob/main/largefoliosarch.png
> > > > > > [3] https://lore.kernel.org/linux-mm/20240327214816.31191-1-21cnbao@gmail.com/
> > > > > >
> >
Thanks
Barry


  reply	other threads:[~2024-05-15 22:15 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-10  2:22 [LSF/MM/BPF TOPIC]mTHP reliable allocation and reclamation Barry Song
2024-05-10  2:31 ` Matthew Wilcox
2024-05-10  2:42   ` Barry Song
2024-05-10 14:25     ` [Lsf-pc] " Michal Hocko
2024-05-10 20:33       ` Yu Zhao
2024-05-15  2:42         ` Barry Song
2024-05-15 10:21           ` Karim Manaouil
2024-05-15 10:59           ` Yu Zhao
2024-05-15 13:50           ` Yang Shi
2024-05-15 18:14             ` Barry Song
2024-05-10 21:18 ` Yang Shi
2024-05-14  9:20   ` Barry Song
2024-05-15 13:49     ` Yang Shi
2024-05-15 19:25       ` Barry Song
2024-05-15 21:41         ` Yang Shi
2024-05-15 22:15           ` Barry Song [this message]
2024-05-15 23:41 ` Matthew Wilcox
2024-05-16  0:25   ` Barry Song
2024-05-16  3:19     ` Gao Xiang
2024-05-16  6:57       ` Barry Song
2024-05-16  7:07         ` Gao Xiang
2024-05-22 21:43 ` David Hildenbrand

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGsJ_4z8BTrTV0P6-vQWyVD5zotDErQNhgUbx_7kHji6oE05OA@mail.gmail.com \
    --to=21cnbao@gmail.com \
    --cc=linux-mm@kvack.org \
    --cc=lsf-pc@lists.linux-foundation.org \
    --cc=shy828301@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).