Linux-Fsdevel Archive mirror
 help / color / mirror / Atom feed
From: "Huang, Ying" <ying.huang@intel.com>
To: Kairui Song <ryncsn@gmail.com>
Cc: linux-mm@kvack.org,  Kairui Song <kasong@tencent.com>,
	 Andrew Morton <akpm@linux-foundation.org>,
	 Matthew Wilcox <willy@infradead.org>,
	 Chris Li <chrisl@kernel.org>,
	 Barry Song <v-songbaohua@oppo.com>,
	 Ryan Roberts <ryan.roberts@arm.com>,  Neil Brown <neilb@suse.de>,
	 Minchan Kim <minchan@kernel.org>,
	 David Hildenbrand <david@redhat.com>,
	 Hugh Dickins <hughd@google.com>,
	 Yosry Ahmed <yosryahmed@google.com>,
	linux-fsdevel@vger.kernel.org,  linux-kernel@vger.kernel.org
Subject: Re: [PATCH v5 12/12] mm/swap: reduce swap cache search space
Date: Sat, 11 May 2024 13:54:40 +0800	[thread overview]
Message-ID: <87jzk0hp0f.fsf@yhuang6-desk2.ccr.corp.intel.com> (raw)
In-Reply-To: <20240510114747.21548-13-ryncsn@gmail.com> (Kairui Song's message of "Fri, 10 May 2024 19:47:47 +0800")

Kairui Song <ryncsn@gmail.com> writes:

> From: Kairui Song <kasong@tencent.com>
>
> Currently we use one swap_address_space for every 64M chunk to reduce lock
> contention, this is like having a set of smaller swap files inside one
> swap device. But when doing swap cache look up or insert, we are
> still using the offset of the whole large swap device. This is OK for
> correctness, as the offset (key) is unique.
>
> But Xarray is specially optimized for small indexes, it creates the
> radix tree levels lazily to be just enough to fit the largest key
> stored in one Xarray. So we are wasting tree nodes unnecessarily.
>
> For 64M chunk it should only take at most 3 levels to contain everything.
> But if we are using the offset from the whole swap device, the offset (key)
> value will be way beyond 64M, and so will the tree level.
>
> Optimize this by using a new helper swap_cache_index to get a swap
> entry's unique offset in its own 64M swap_address_space.
>
> I see a ~1% performance gain in benchmark and actual workload with
> high memory pressure.
>
> Test with `time memhog 128G` inside a 8G memcg using 128G swap (ramdisk
> with SWP_SYNCHRONOUS_IO dropped, tested 3 times, results are stable. The
> test result is similar but the improvement is smaller if SWP_SYNCHRONOUS_IO
> is enabled, as swap out path can never skip swap cache):
>
> Before:
> 6.07user 250.74system 4:17.26elapsed 99%CPU (0avgtext+0avgdata 8373376maxresident)k
> 0inputs+0outputs (55major+33555018minor)pagefaults 0swaps
>
> After (1.8% faster):
> 6.08user 246.09system 4:12.58elapsed 99%CPU (0avgtext+0avgdata 8373248maxresident)k
> 0inputs+0outputs (54major+33555027minor)pagefaults 0swaps
>
> Similar result with MySQL and sysbench using swap:
> Before:
> 94055.61 qps
>
> After (0.8% faster):
> 94834.91 qps
>
> Radix tree slab usage is also very slightly lower.
>
> Signed-off-by: Kairui Song <kasong@tencent.com>

LGTM, Thanks!

Reviewed-by: "Huang, Ying" <ying.huang@intel.com>

> ---
>  mm/huge_memory.c |  2 +-
>  mm/memcontrol.c  |  2 +-
>  mm/mincore.c     |  2 +-
>  mm/shmem.c       |  2 +-
>  mm/swap.h        | 15 +++++++++++++++
>  mm/swap_state.c  | 17 +++++++++--------
>  mm/swapfile.c    |  6 +++---
>  7 files changed, 31 insertions(+), 15 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 317de2afd371..fcc0e86a2589 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2838,7 +2838,7 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  	split_page_memcg(head, order, new_order);
>  
>  	if (folio_test_anon(folio) && folio_test_swapcache(folio)) {
> -		offset = swp_offset(folio->swap);
> +		offset = swap_cache_index(folio->swap);
>  		swap_cache = swap_address_space(folio->swap);
>  		xa_lock(&swap_cache->i_pages);
>  	}
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index d127c9c5fabf..024aeb64d0be 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -6153,7 +6153,7 @@ static struct page *mc_handle_swap_pte(struct vm_area_struct *vma,
>  	 * Because swap_cache_get_folio() updates some statistics counter,
>  	 * we call find_get_page() with swapper_space directly.
>  	 */
> -	page = find_get_page(swap_address_space(ent), swp_offset(ent));
> +	page = find_get_page(swap_address_space(ent), swap_cache_index(ent));
>  	entry->val = ent.val;
>  
>  	return page;
> diff --git a/mm/mincore.c b/mm/mincore.c
> index dad3622cc963..e31cf1bde614 100644
> --- a/mm/mincore.c
> +++ b/mm/mincore.c
> @@ -139,7 +139,7 @@ static int mincore_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
>  			} else {
>  #ifdef CONFIG_SWAP
>  				*vec = mincore_page(swap_address_space(entry),
> -						    swp_offset(entry));
> +						    swap_cache_index(entry));
>  #else
>  				WARN_ON(1);
>  				*vec = 1;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index fa2a0ed97507..326315c12feb 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -1756,7 +1756,7 @@ static int shmem_replace_folio(struct folio **foliop, gfp_t gfp,
>  
>  	old = *foliop;
>  	entry = old->swap;
> -	swap_index = swp_offset(entry);
> +	swap_index = swap_cache_index(entry);
>  	swap_mapping = swap_address_space(entry);
>  
>  	/*
> diff --git a/mm/swap.h b/mm/swap.h
> index 82023ab93205..2c0e96272d49 100644
> --- a/mm/swap.h
> +++ b/mm/swap.h
> @@ -27,6 +27,7 @@ void __swap_writepage(struct folio *folio, struct writeback_control *wbc);
>  /* One swap address space for each 64M swap space */
>  #define SWAP_ADDRESS_SPACE_SHIFT	14
>  #define SWAP_ADDRESS_SPACE_PAGES	(1 << SWAP_ADDRESS_SPACE_SHIFT)
> +#define SWAP_ADDRESS_SPACE_MASK		(SWAP_ADDRESS_SPACE_PAGES - 1)
>  extern struct address_space *swapper_spaces[];
>  #define swap_address_space(entry)			    \
>  	(&swapper_spaces[swp_type(entry)][swp_offset(entry) \
> @@ -40,6 +41,15 @@ static inline loff_t swap_dev_pos(swp_entry_t entry)
>  	return ((loff_t)swp_offset(entry)) << PAGE_SHIFT;
>  }
>  
> +/*
> + * Return the swap cache index of the swap entry.
> + */
> +static inline pgoff_t swap_cache_index(swp_entry_t entry)
> +{
> +	BUILD_BUG_ON((SWP_OFFSET_MASK | SWAP_ADDRESS_SPACE_MASK) != SWP_OFFSET_MASK);
> +	return swp_offset(entry) & SWAP_ADDRESS_SPACE_MASK;
> +}
> +
>  void show_swap_cache_info(void);
>  bool add_to_swap(struct folio *folio);
>  void *get_shadow_from_swap_cache(swp_entry_t entry);
> @@ -86,6 +96,11 @@ static inline struct address_space *swap_address_space(swp_entry_t entry)
>  	return NULL;
>  }
>  
> +static inline pgoff_t swap_cache_index(swp_entry_t entry)
> +{
> +	return 0;
> +}
> +
>  static inline void show_swap_cache_info(void)
>  {
>  }
> diff --git a/mm/swap_state.c b/mm/swap_state.c
> index 642c30d8376c..6e86c759dc1d 100644
> --- a/mm/swap_state.c
> +++ b/mm/swap_state.c
> @@ -72,7 +72,7 @@ void show_swap_cache_info(void)
>  void *get_shadow_from_swap_cache(swp_entry_t entry)
>  {
>  	struct address_space *address_space = swap_address_space(entry);
> -	pgoff_t idx = swp_offset(entry);
> +	pgoff_t idx = swap_cache_index(entry);
>  	void *shadow;
>  
>  	shadow = xa_load(&address_space->i_pages, idx);
> @@ -89,7 +89,7 @@ int add_to_swap_cache(struct folio *folio, swp_entry_t entry,
>  			gfp_t gfp, void **shadowp)
>  {
>  	struct address_space *address_space = swap_address_space(entry);
> -	pgoff_t idx = swp_offset(entry);
> +	pgoff_t idx = swap_cache_index(entry);
>  	XA_STATE_ORDER(xas, &address_space->i_pages, idx, folio_order(folio));
>  	unsigned long i, nr = folio_nr_pages(folio);
>  	void *old;
> @@ -144,7 +144,7 @@ void __delete_from_swap_cache(struct folio *folio,
>  	struct address_space *address_space = swap_address_space(entry);
>  	int i;
>  	long nr = folio_nr_pages(folio);
> -	pgoff_t idx = swp_offset(entry);
> +	pgoff_t idx = swap_cache_index(entry);
>  	XA_STATE(xas, &address_space->i_pages, idx);
>  
>  	xas_set_update(&xas, workingset_update_node);
> @@ -253,13 +253,14 @@ void clear_shadow_from_swap_cache(int type, unsigned long begin,
>  
>  	for (;;) {
>  		swp_entry_t entry = swp_entry(type, curr);
> +		unsigned long index = curr & SWAP_ADDRESS_SPACE_MASK;
>  		struct address_space *address_space = swap_address_space(entry);
> -		XA_STATE(xas, &address_space->i_pages, curr);
> +		XA_STATE(xas, &address_space->i_pages, index);
>  
>  		xas_set_update(&xas, workingset_update_node);
>  
>  		xa_lock_irq(&address_space->i_pages);
> -		xas_for_each(&xas, old, end) {
> +		xas_for_each(&xas, old, min(index + (end - curr), SWAP_ADDRESS_SPACE_PAGES)) {
>  			if (!xa_is_value(old))
>  				continue;
>  			xas_store(&xas, NULL);
> @@ -350,7 +351,7 @@ struct folio *swap_cache_get_folio(swp_entry_t entry,
>  {
>  	struct folio *folio;
>  
> -	folio = filemap_get_folio(swap_address_space(entry), swp_offset(entry));
> +	folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
>  	if (!IS_ERR(folio)) {
>  		bool vma_ra = swap_use_vma_readahead();
>  		bool readahead;
> @@ -420,7 +421,7 @@ struct folio *filemap_get_incore_folio(struct address_space *mapping,
>  	si = get_swap_device(swp);
>  	if (!si)
>  		return ERR_PTR(-ENOENT);
> -	index = swp_offset(swp);
> +	index = swap_cache_index(swp);
>  	folio = filemap_get_folio(swap_address_space(swp), index);
>  	put_swap_device(si);
>  	return folio;
> @@ -447,7 +448,7 @@ struct folio *__read_swap_cache_async(swp_entry_t entry, gfp_t gfp_mask,
>  		 * that would confuse statistics.
>  		 */
>  		folio = filemap_get_folio(swap_address_space(entry),
> -						swp_offset(entry));
> +					  swap_cache_index(entry));
>  		if (!IS_ERR(folio))
>  			goto got_folio;
>  
> diff --git a/mm/swapfile.c b/mm/swapfile.c
> index 0b0ae6e8c764..4f0e8b2ac8aa 100644
> --- a/mm/swapfile.c
> +++ b/mm/swapfile.c
> @@ -142,7 +142,7 @@ static int __try_to_reclaim_swap(struct swap_info_struct *si,
>  	struct folio *folio;
>  	int ret = 0;
>  
> -	folio = filemap_get_folio(swap_address_space(entry), offset);
> +	folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
>  	if (IS_ERR(folio))
>  		return 0;
>  	/*
> @@ -2158,7 +2158,7 @@ static int try_to_unuse(unsigned int type)
>  	       (i = find_next_to_unuse(si, i)) != 0) {
>  
>  		entry = swp_entry(type, i);
> -		folio = filemap_get_folio(swap_address_space(entry), i);
> +		folio = filemap_get_folio(swap_address_space(entry), swap_cache_index(entry));
>  		if (IS_ERR(folio))
>  			continue;
>  
> @@ -3476,7 +3476,7 @@ EXPORT_SYMBOL_GPL(swapcache_mapping);
>  
>  pgoff_t __folio_swap_cache_index(struct folio *folio)
>  {
> -	return swp_offset(folio->swap);
> +	return swap_cache_index(folio->swap);
>  }
>  EXPORT_SYMBOL_GPL(__folio_swap_cache_index);

      reply	other threads:[~2024-05-11  5:56 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-10 11:47 [PATCH v5 00/12] mm/swap: clean up and optimize swap cache index Kairui Song
2024-05-10 11:47 ` [PATCH v5 01/12] f2fs: drop usage of page_index Kairui Song
2024-05-10 11:47 ` [PATCH v5 02/12] nilfs2: " Kairui Song
2024-05-10 11:47 ` [PATCH v5 03/12] ceph: " Kairui Song
2024-05-10 11:47 ` [PATCH v5 04/12] NFS: remove nfs_page_lengthg and " Kairui Song
2024-05-10 11:47 ` [PATCH v5 05/12] cifs: drop usage of page_file_offset Kairui Song
2024-05-10 11:47 ` [PATCH v5 06/12] afs: drop usage of folio_file_pos Kairui Song
2024-05-10 11:47 ` [PATCH v5 07/12] netfs: " Kairui Song
2024-05-10 11:47 ` [PATCH v5 08/12] nfs: " Kairui Song
2024-05-10 11:47 ` [PATCH v5 09/12] mm/swap: get the swap device offset directly Kairui Song
2024-05-10 11:47 ` [PATCH v5 10/12] mm: remove page_file_offset and folio_file_pos Kairui Song
2024-05-10 11:47 ` [PATCH v5 11/12] mm: drop page_index and simplify folio_index Kairui Song
2024-05-11  5:39   ` Huang, Ying
2024-05-10 11:47 ` [PATCH v5 12/12] mm/swap: reduce swap cache search space Kairui Song
2024-05-11  5:54   ` Huang, Ying [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=87jzk0hp0f.fsf@yhuang6-desk2.ccr.corp.intel.com \
    --to=ying.huang@intel.com \
    --cc=akpm@linux-foundation.org \
    --cc=chrisl@kernel.org \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=kasong@tencent.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=minchan@kernel.org \
    --cc=neilb@suse.de \
    --cc=ryan.roberts@arm.com \
    --cc=ryncsn@gmail.com \
    --cc=v-songbaohua@oppo.com \
    --cc=willy@infradead.org \
    --cc=yosryahmed@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).