Linux-mm Archive mirror
 help / color / mirror / Atom feed
From: Barry Song <21cnbao@gmail.com>
To: Ryan Roberts <ryan.roberts@arm.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	David Hildenbrand <david@redhat.com>,
	 Matthew Wilcox <willy@infradead.org>,
	Huang Ying <ying.huang@intel.com>, Gao Xiang <xiang@kernel.org>,
	 Yu Zhao <yuzhao@google.com>, Yang Shi <shy828301@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	 Kefeng Wang <wangkefeng.wang@huawei.com>,
	Chris Li <chrisl@kernel.org>,  Lance Yang <ioworker0@gmail.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH v7 5/7] mm: swap: Allow storage of all mTHP orders
Date: Mon, 13 May 2024 21:24:13 +1200	[thread overview]
Message-ID: <CAGsJ_4zEbqkEwzG0p-svwBA8obY0fSGqqthH7guc5qcxodM8hg@mail.gmail.com> (raw)
In-Reply-To: <17b4f026-d734-4610-8517-d83081f75ed4@arm.com>

On Mon, May 13, 2024 at 8:43 PM Ryan Roberts <ryan.roberts@arm.com> wrote:
>
> On 13/05/2024 08:30, Barry Song wrote:
> > On Tue, Apr 9, 2024 at 6:40 AM Ryan Roberts <ryan.roberts@arm.com> wrote:
> >>
> >> Multi-size THP enables performance improvements by allocating large,
> >> pte-mapped folios for anonymous memory. However I've observed that on an
> >> arm64 system running a parallel workload (e.g. kernel compilation)
> >> across many cores, under high memory pressure, the speed regresses. This
> >> is due to bottlenecking on the increased number of TLBIs added due to
> >> all the extra folio splitting when the large folios are swapped out.
> >>
> >> Therefore, solve this regression by adding support for swapping out mTHP
> >> without needing to split the folio, just like is already done for
> >> PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled,
> >> and when the swap backing store is a non-rotating block device. These
> >> are the same constraints as for the existing PMD-sized THP swap-out
> >> support.
> >>
> >> Note that no attempt is made to swap-in (m)THP here - this is still done
> >> page-by-page, like for PMD-sized THP. But swapping-out mTHP is a
> >> prerequisite for swapping-in mTHP.
> >>
> >> The main change here is to improve the swap entry allocator so that it
> >> can allocate any power-of-2 number of contiguous entries between [1, (1
> >> << PMD_ORDER)]. This is done by allocating a cluster for each distinct
> >> order and allocating sequentially from it until the cluster is full.
> >> This ensures that we don't need to search the map and we get no
> >> fragmentation due to alignment padding for different orders in the
> >> cluster. If there is no current cluster for a given order, we attempt to
> >> allocate a free cluster from the list. If there are no free clusters, we
> >> fail the allocation and the caller can fall back to splitting the folio
> >> and allocates individual entries (as per existing PMD-sized THP
> >> fallback).
> >>
> >> The per-order current clusters are maintained per-cpu using the existing
> >> infrastructure. This is done to avoid interleving pages from different
> >> tasks, which would prevent IO being batched. This is already done for
> >> the order-0 allocations so we follow the same pattern.
> >>
> >> As is done for order-0 per-cpu clusters, the scanner now can steal
> >> order-0 entries from any per-cpu-per-order reserved cluster. This
> >> ensures that when the swap file is getting full, space doesn't get tied
> >> up in the per-cpu reserves.
> >>
> >> This change only modifies swap to be able to accept any order mTHP. It
> >> doesn't change the callers to elide doing the actual split. That will be
> >> done in separate changes.
>
> [...]
>
> >
> > Hi Ryan,
> >
> > Sorry for bringing up an old thread.
>
> No problem - thanks for the report!
>
> >
> > During the initial hour of utilizing an Android phone with 64KiB mTHP,
> > we noticed that the
> > anon_swpout_fallback rate was less than 10%. However, after several
> > hours of phone
> > usage, we observed a significant increase in the anon_swpout_fallback
> > rate, reaching
> > 100%.
>
> I suspect this is due to fragmentation of the clusters; If there is just one
> page left in a cluster then the cluster can't be freed and once the cluster free
> list is empty a new cluster allcoation will fail and this will cause fallback to
> order-0.
>
> >
> > As I checked the code of scan_swap_map_try_ssd_cluster(),
> >
> > static bool scan_swap_map_try_ssd_cluster(struct swap_info_struct *si,
> >         unsigned long *offset, unsigned long *scan_base, int order)
> > {
> >         unsigned int nr_pages = 1 << order;
> >         struct percpu_cluster *cluster;
> >         struct swap_cluster_info *ci;
> >         unsigned int tmp, max;
> >
> > new_cluster:
> >         cluster = this_cpu_ptr(si->percpu_cluster);
> >         tmp = cluster->next[order];
> >         if (tmp == SWAP_NEXT_INVALID) {
> >                 if (!cluster_list_empty(&si->free_clusters)) {
> >                         tmp = cluster_next(&si->free_clusters.head) *
> >                                         SWAPFILE_CLUSTER;
> >                 } else if (!cluster_list_empty(&si->discard_clusters)) {
> >                         /*
> >                          * we don't have free cluster but have some clusters in
> >                          * discarding, do discard now and reclaim them, then
> >                          * reread cluster_next_cpu since we dropped si->lock
> >                          */
> >                         swap_do_scheduled_discard(si);
> >                         *scan_base = this_cpu_read(*si->cluster_next_cpu);
> >                         *offset = *scan_base;
> >                         goto new_cluster;
> >                 } else
> >                         return false;
> >         }
> > ...
> >
> > }
> >
> > Considering the cluster_list_empty() checks, is it necessary to have
> > free_cluster to
> > ensure a continuous allocation of swap slots for large folio swap out?
>
> Yes, currently that is done by design; if we can't allocate a free cluster then
> we only scan for free space in an already allocated cluster for order-0
> allocations. I did this for a couple of reasons;
>
> 1: Simplicity.
>
> 2: Keep behavior the same as PMD-order allocations, which are never scanned
> (although the cluster is the same size as the PMD so scanning would be pointless
> there - so perhaps this is not a good argument for not scanning smaller high
> orders).
>
> 3: If scanning for a high order fails then we would fall back to order-0 and
> scan again, so I was trying to avoid the potential for 2 scans (although once
> you split the page, you'll end up scanning per-page, so perhaps its not a real
> argument either).
>
> > For instance,
> > if numerous clusters still possess ample free swap slots, could we
> > potentially miss
> > out on them due to a lack of execution of a slow scan?
>
> I think it would definitely be possible to add support for scanning high orders
> and from memory, I don't think it would be too difficult. Based on your
> experience, it sounds like this would be valuable.
>
> I'm going to be out on Paternity leave for 3 weeks from end of today, so I won't
> personally be able to do this until I get back. I might find some time to review
> if you were to post something though :)

Congratulations on the arrival of your precious little one! Forget
about the swap and
mTHP, enjoy your time with the family :-)

>
> >
> > I'm not saying your patchset has problems, just that I have some questions.
>
> Let's call it "opportunity for further improvement" rather than problems. :)
>
> I suspect swap-in of large folios may help reduce the fragmentation a bit since
> we are less likely to keep parts of a previously swapped-out mTHP in swap.
>
> Also, I understand that Chris Li has been doing some thinking around an
> indirection layer which would remove the requirement for pages of a large folio
> to be stored contiguously in the swap file. I think he is planning to talk about
> that at LSFMM? (which I sadly won't be attending).
>
> Thanks,
> Ryan
>
> >

Thanks
Barry


  reply	other threads:[~2024-05-13  9:24 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-08 18:39 [PATCH v7 0/7] Swap-out mTHP without splitting Ryan Roberts
2024-04-08 18:39 ` [PATCH v7 1/7] mm: swap: Remove CLUSTER_FLAG_HUGE from swap_cluster_info:flags Ryan Roberts
2024-04-08 18:39 ` [PATCH v7 2/7] mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache() Ryan Roberts
2024-04-09  7:34   ` David Hildenbrand
2024-04-09  8:51   ` Barry Song
2024-04-09  9:22     ` Barry Song
2024-04-09  9:24       ` David Hildenbrand
2024-04-09  9:41         ` Barry Song
2024-04-09  9:55           ` Ryan Roberts
2024-04-09 10:29             ` Barry Song
2024-04-09 10:42             ` David Hildenbrand
2024-04-09 11:18   ` [PATCH] FIXUP: " Ryan Roberts
2024-04-08 18:39 ` [PATCH v7 3/7] mm: swap: Simplify struct percpu_cluster Ryan Roberts
2024-04-08 18:39 ` [PATCH v7 4/7] mm: swap: Update get_swap_pages() to take folio order Ryan Roberts
2024-04-09  7:36   ` David Hildenbrand
2024-04-08 18:39 ` [PATCH v7 5/7] mm: swap: Allow storage of all mTHP orders Ryan Roberts
2024-05-13  7:30   ` Barry Song
2024-05-13  8:43     ` Ryan Roberts
2024-05-13  9:24       ` Barry Song [this message]
2024-04-08 18:39 ` [PATCH v7 6/7] mm: vmscan: Avoid split during shrink_folio_list() Ryan Roberts
2024-04-08 18:39 ` [PATCH v7 7/7] mm: madvise: Avoid split during MADV_PAGEOUT and MADV_COLD Ryan Roberts
2024-06-03 21:18 ` [PATCH v7 0/7] Swap-out mTHP without splitting Yosry Ahmed
2024-06-03 22:01   ` Zi Yan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAGsJ_4zEbqkEwzG0p-svwBA8obY0fSGqqthH7guc5qcxodM8hg@mail.gmail.com \
    --to=21cnbao@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=ioworker0@gmail.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.com \
    --cc=ryan.roberts@arm.com \
    --cc=shy828301@gmail.com \
    --cc=wangkefeng.wang@huawei.com \
    --cc=willy@infradead.org \
    --cc=xiang@kernel.org \
    --cc=ying.huang@intel.com \
    --cc=yuzhao@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).