LKML Archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH] A Summary of VMA scanning improvements explored
@ 2024-03-22 13:41 Raghavendra K T
  2024-03-22 13:41 ` [RFC PATCH 1 1/1] sched/numa: Hot VMA and shared VMA optimization Raghavendra K T
  0 siblings, 1 reply; 5+ messages in thread
From: Raghavendra K T @ 2024-03-22 13:41 UTC (permalink / raw
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
	Bharata B Rao, Johannes Weiner, kernel test robot,
	Raghavendra K T

I am posting the summary of numa balancing improvements tried out.

(Intention is RFC and revisiting these in future when some one
sees potential benefits with PATCH1 and PATCH2).

PATCH3 has more potential for workloads that needs aggressive scanning
but may need migration ratelimiting.

Pathset details:
==================
PATCH 1. Increase the number of access PID (information of tasks accessing
VMA) history windows from 2 to 4

Based on PeterZ's suggestion/patch.
Rationale:
- Increases the depth of historical access of tasks
- Get a better view of hot VMAs
- Get a better view of VMA which are widely shared amongst tasks
with that we can take better decision in choosing the VMAs that needs to
be scanned for introducing PROT_NONE.

PATCH 2. Increase the number of bit used to map tasks accessing VMA from 64 to 128bit

Based on suggestion by Ingo
Rationale:
Decrease the number of collisions (false positive), while whole information still
fits in a cacheline

This is potentially helpful when workload involve more threads and thus,
- unnecessarily do VMA scan.
- create contention in scan path.

PATCH 3. Change the notion of scanning 256MB limit per scan to 64k PTE scan (for 4k).
Extend the same logic to hugepages / THP pages.

Based on suggestion by Mel

Rationale: This helps to cover more memory especially when THP is involved or
a hugepage is involved.

PS: Please note all 3 are independent patches. Apologies in advance if patchset
confuses any patching script. Also more comment/details will be added
for patches of interest.

Summary of results:
==================
PATCH1 and PATCH2 are giving benefit in some cases I ran but they may still need
more convincing usecase / results (as on 6.9+ kernel).

PATCH3:
Some benchmarks such as XSBench Hashjoin are benefiting from more scanning
But microbenchmarks (such as allocate on one node fault from other node to
see how  fast migration happen), suffer because of aggressive migration overhead.

Overall if we combine ratelimiting of migration (similar to CXL) or tune the
scan rate when it is not necessary to scan (for e.g., I still see VMA scanning
does not slow even when rate of migration slowed down or all migrations completed.)

Change stat for each of the patches
======================
PATCH 1:

Raghavendra K T (1):
  sched/numa: Hot VMA and shared VMA optimization

 include/linux/mm.h       | 12 ++++++---
 include/linux/mm_types.h | 11 +++++---
 kernel/sched/fair.c      | 58 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 69 insertions(+), 12 deletions(-)

base-commit: b0546776ad3f332e215cebc0b063ba4351971cca
============================
PATCH 2:

Raghavendra K T (1):
  sched/numa: Increase the VMA accessing PID bits

 include/linux/mm.h       | 29 ++++++++++++++++++++++++++---
 include/linux/mm_types.h |  7 ++++++-
 kernel/sched/fair.c      | 21 ++++++++++++++++-----
 3 files changed, 48 insertions(+), 9 deletions(-)

base-commit: b0546776ad3f332e215cebc0b063ba4351971cca
===========================
PATCH 3:

Raghavendra K T (1):
  sched/numa: Convert 256MB VMA scan limit notion

 include/linux/hugetlb.h |  3 +-
 include/linux/mm.h      | 16 +++++++-
 kernel/sched/fair.c     | 15 ++++---
 mm/hugetlb.c            |  9 +++++
 mm/mempolicy.c          | 11 +++++-
 mm/mprotect.c           | 87 +++++++++++++++++++++++++++++++++--------
 6 files changed, 115 insertions(+), 26 deletions(-)

base-commit: b0546776ad3f332e215cebc0b063ba4351971cca
-- 
2.34.1


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [RFC PATCH 1 1/1] sched/numa: Hot VMA and shared VMA optimization
  2024-03-22 13:41 [RFC PATCH] A Summary of VMA scanning improvements explored Raghavendra K T
@ 2024-03-22 13:41 ` Raghavendra K T
  2024-03-22 13:41   ` [RFC PATCH 2 1/1] sched/numa: Increase the VMA accessing PID bits Raghavendra K T
                     ` (2 more replies)
  0 siblings, 3 replies; 5+ messages in thread
From: Raghavendra K T @ 2024-03-22 13:41 UTC (permalink / raw
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
	Bharata B Rao, Johannes Weiner, kernel test robot,
	Raghavendra K T

Optimizations are based on history of PIDs accessing VMA.

- Increase tasks' access history windows (PeterZ) from 2 to 4.
( This patch is from Peter Zijlstra <peterz@infradead.org>)

Idea: A task is allowed to scan a VMA if:
- VMA was very recently accessed as indicated by the latest
  access PIDs information (hot VMA).
- VMA is shared by more than 2 tasks. Here whole history of VMA's
access PIDs is considered using bitmap_weight().

Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
I will split the patset and post if we find this pathset useful
going further. First patch is from PeterZ.

 include/linux/mm.h       | 12 ++++++---
 include/linux/mm_types.h | 11 +++++---
 kernel/sched/fair.c      | 58 ++++++++++++++++++++++++++++++++++++----
 3 files changed, 69 insertions(+), 12 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..1bf1df064b60 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1744,10 +1744,14 @@ static inline int folio_xchg_access_time(struct folio *folio, int time)
 static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
 {
 	unsigned int pid_bit;
-
-	pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG));
-	if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->pids_active[1])) {
-		__set_bit(pid_bit, &vma->numab_state->pids_active[1]);
+	unsigned long *pids, pid_idx;
+
+	if (vma->numab_state) {
+		pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG));
+		pid_idx = READ_ONCE(vma->numab_state->pids_active_idx);
+		pids = vma->numab_state->pids_active + pid_idx;
+		if (!test_bit(pid_bit, pids))
+			__set_bit(pid_bit, pids);
 	}
 }
 #else /* !CONFIG_NUMA_BALANCING */
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8b611e13153e..050ceef1e9d5 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -574,6 +574,7 @@ struct vma_lock {
 	struct rw_semaphore lock;
 };
 
+#define NR_ACCESS_PID_HIST	4
 struct vma_numab_state {
 	/*
 	 * Initialised as time in 'jiffies' after which VMA
@@ -588,17 +589,21 @@ struct vma_numab_state {
 	 */
 	unsigned long pids_active_reset;
 
+	/* Points to current active PID tracking index. */
+	unsigned long pids_active_idx;
+
 	/*
 	 * Approximate tracking of PIDs that trapped a NUMA hinting
 	 * fault. May produce false positives due to hash collisions.
 	 *
-	 *   [0] Previous PID tracking
-	 *   [1] Current PID tracking
+	 *   [pids_active_idx - 1] Previous PID tracking
+	 *   [pids_active_idx] Current PID tracking
 	 *
+	 * Whole array is used in a rotating manner to track latest PIDs.
 	 * Window moves after next_pid_reset has expired approximately
 	 * every VMA_PID_RESET_PERIOD jiffies:
 	 */
-	unsigned long pids_active[2];
+	unsigned long pids_active[NR_ACCESS_PID_HIST];
 
 	/* MM scan sequence ID when scan first started after VMA creation */
 	int start_scan_seq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a16129f9a5c..ed329b2f4d53 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3157,9 +3157,44 @@ static void reset_ptenuma_scan(struct task_struct *p)
 	p->mm->numa_scan_offset = 0;
 }
 
+static inline bool vma_test_access_pid_history(struct vm_area_struct *vma)
+{
+	unsigned int i, pid_bit;
+	unsigned long pids = 0;
+
+	pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG));
+
+	for (i = 0; i < NR_ACCESS_PID_HIST; i++)
+		pids  |= vma->numab_state->pids_active[i];
+
+	return test_bit(pid_bit, &pids);
+}
+
+static inline bool vma_accessed_recent(struct vm_area_struct *vma)
+{
+	unsigned long *pids, pid_idx;
+
+	pid_idx = vma->numab_state->pids_active_idx;
+	pids = vma->numab_state->pids_active + pid_idx;
+
+	return (bitmap_weight(pids, BITS_PER_LONG) >= 1);
+}
+
+#define SHARED_VMA_THRESH	3
+
+static inline bool vma_shared_access(struct vm_area_struct *vma)
+{
+	int i;
+	unsigned long pids = 0;
+
+	for (i = 0; i < NR_ACCESS_PID_HIST; i++)
+		pids  |= vma->numab_state->pids_active[i];
+
+	return (bitmap_weight(&pids, BITS_PER_LONG) >= SHARED_VMA_THRESH);
+}
+
 static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
 {
-	unsigned long pids;
 	/*
 	 * Allow unconditional access first two times, so that all the (pages)
 	 * of VMAs get prot_none fault introduced irrespective of accesses.
@@ -3169,8 +3204,16 @@ static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
 	if ((READ_ONCE(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2)
 		return true;
 
-	pids = vma->numab_state->pids_active[0] | vma->numab_state->pids_active[1];
-	if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
+	/* Check if the current task had historically accessed VMA. */
+	if (vma_test_access_pid_history(vma))
+		return true;
+
+	/* Check at least one task had accessed VMA recently. */
+	if (vma_accessed_recent(vma))
+		return true;
+
+	/* Check if VMA is shared by many tasks. */
+	if (vma_shared_access(vma))
 		return true;
 
 	/*
@@ -3202,6 +3245,7 @@ static void task_numa_work(struct callback_head *work)
 	unsigned long nr_pte_updates = 0;
 	long pages, virtpages;
 	struct vma_iterator vmi;
+	unsigned long pid_idx;
 	bool vma_pids_skipped;
 	bool vma_pids_forced = false;
 
@@ -3341,8 +3385,12 @@ static void task_numa_work(struct callback_head *work)
 				time_after(jiffies, vma->numab_state->pids_active_reset)) {
 			vma->numab_state->pids_active_reset = vma->numab_state->pids_active_reset +
 				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
-			vma->numab_state->pids_active[0] = READ_ONCE(vma->numab_state->pids_active[1]);
-			vma->numab_state->pids_active[1] = 0;
+
+			pid_idx = vma->numab_state->pids_active_idx;
+			pid_idx = (pid_idx + 1) % NR_ACCESS_PID_HIST;
+
+			vma->numab_state->pids_active_idx = pid_idx;
+			vma->numab_state->pids_active[pid_idx] = 0;
 		}
 
 		/* Do not rescan VMAs twice within the same sequence. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH 2 1/1] sched/numa: Increase the VMA accessing PID bits
  2024-03-22 13:41 ` [RFC PATCH 1 1/1] sched/numa: Hot VMA and shared VMA optimization Raghavendra K T
@ 2024-03-22 13:41   ` Raghavendra K T
  2024-03-22 13:41   ` [RFC PATCH 3 1/1] sched/numa: Convert 256MB VMA scan limit notion Raghavendra K T
  2024-06-25 14:20   ` [RFC PATCH 1 1/1] sched/numa: Hot VMA and shared VMA optimization Chen Yu
  2 siblings, 0 replies; 5+ messages in thread
From: Raghavendra K T @ 2024-03-22 13:41 UTC (permalink / raw
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
	Bharata B Rao, Johannes Weiner, kernel test robot,
	Raghavendra K T

Currently we use 64 bits to track tasks accessing VMA.

This increases probability of false positive cases and thus
potentially increase unnecssary scanning of VMA though task
had not accessed VMA. Increase it to 128 bits.

Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/mm.h       | 29 ++++++++++++++++++++++++++---
 include/linux/mm_types.h |  7 ++++++-
 kernel/sched/fair.c      | 21 ++++++++++++++++-----
 3 files changed, 48 insertions(+), 9 deletions(-)

There could be better idea than having array of 2 long variables for 128
bits?

diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..d8ff7233cf9b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1741,13 +1741,26 @@ static inline int folio_xchg_access_time(struct folio *folio, int time)
 	return last_time << PAGE_ACCESS_TIME_BUCKETS;
 }
 
+static inline int pid_array_idx(int pid_bit)
+{
+	return (pid_bit / BITS_PER_LONG);
+}
+
+static inline int pid_bit_idx(int pid_bit)
+{
+	return (pid_bit % BITS_PER_LONG);
+}
+
 static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
 {
 	unsigned int pid_bit;
 
-	pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG));
-	if (vma->numab_state && !test_bit(pid_bit, &vma->numab_state->pids_active[1])) {
-		__set_bit(pid_bit, &vma->numab_state->pids_active[1]);
+	pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG * NR_PID_ARRAY));
+
+	if (vma->numab_state && !test_bit(pid_bit_idx(pid_bit),
+				&vma->numab_state->pids_active[1][pid_array_idx(pid_bit)])) {
+		__set_bit(pid_bit_idx(pid_bit),
+				&vma->numab_state->pids_active[1][pid_array_idx(pid_bit)]);
 	}
 }
 #else /* !CONFIG_NUMA_BALANCING */
@@ -1800,6 +1813,16 @@ static inline bool cpupid_match_pid(struct task_struct *task, int cpupid)
 	return false;
 }
 
+static inline int pid_array_idx(int pid_bit)
+{
+	return 0;
+}
+
+static inline int pid_bit_idx(int pid_bit)
+{
+	return 0;
+}
+
 static inline void vma_set_access_pid_bit(struct vm_area_struct *vma)
 {
 }
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 8b611e13153e..34bb8e1f0e1c 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -574,6 +574,11 @@ struct vma_lock {
 	struct rw_semaphore lock;
 };
 
+#define NR_PID_ARRAY	2
+#define NR_TRACKED_PIDS	(BITS_PER_LONG * NR_PID_ARRAY)
+
+#define NR_ACCESS_PID_HIST     2
+
 struct vma_numab_state {
 	/*
 	 * Initialised as time in 'jiffies' after which VMA
@@ -598,7 +603,7 @@ struct vma_numab_state {
 	 * Window moves after next_pid_reset has expired approximately
 	 * every VMA_PID_RESET_PERIOD jiffies:
 	 */
-	unsigned long pids_active[2];
+	unsigned long pids_active[NR_ACCESS_PID_HIST][NR_PID_ARRAY];
 
 	/* MM scan sequence ID when scan first started after VMA creation */
 	int start_scan_seq;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a16129f9a5c..63086ca00430 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3159,7 +3159,8 @@ static void reset_ptenuma_scan(struct task_struct *p)
 
 static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
 {
-	unsigned long pids;
+	int pid_bit, pid_aidx, i;
+	unsigned long pids = 0;
 	/*
 	 * Allow unconditional access first two times, so that all the (pages)
 	 * of VMAs get prot_none fault introduced irrespective of accesses.
@@ -3169,8 +3170,13 @@ static bool vma_is_accessed(struct mm_struct *mm, struct vm_area_struct *vma)
 	if ((READ_ONCE(current->mm->numa_scan_seq) - vma->numab_state->start_scan_seq) < 2)
 		return true;
 
-	pids = vma->numab_state->pids_active[0] | vma->numab_state->pids_active[1];
-	if (test_bit(hash_32(current->pid, ilog2(BITS_PER_LONG)), &pids))
+	pid_bit = hash_32(current->pid, ilog2(BITS_PER_LONG * NR_PID_ARRAY));
+	pid_aidx = pid_array_idx(pid_bit);
+
+	for (i = 0; i < NR_ACCESS_PID_HIST; i++)
+		pids |= vma->numab_state->pids_active[i][pid_aidx];
+
+	if (test_bit(pid_bit_idx(pid_bit), &pids))
 		return true;
 
 	/*
@@ -3204,6 +3210,7 @@ static void task_numa_work(struct callback_head *work)
 	struct vma_iterator vmi;
 	bool vma_pids_skipped;
 	bool vma_pids_forced = false;
+	int i;
 
 	SCHED_WARN_ON(p != container_of(work, struct task_struct, numa_work));
 
@@ -3341,8 +3348,12 @@ static void task_numa_work(struct callback_head *work)
 				time_after(jiffies, vma->numab_state->pids_active_reset)) {
 			vma->numab_state->pids_active_reset = vma->numab_state->pids_active_reset +
 				msecs_to_jiffies(VMA_PID_RESET_PERIOD);
-			vma->numab_state->pids_active[0] = READ_ONCE(vma->numab_state->pids_active[1]);
-			vma->numab_state->pids_active[1] = 0;
+
+			for (i = 0; i < NR_PID_ARRAY; i++) {
+				vma->numab_state->pids_active[0][i] =
+					READ_ONCE(vma->numab_state->pids_active[1][i]);
+				vma->numab_state->pids_active[1][i] = 0;
+			}
 		}
 
 		/* Do not rescan VMAs twice within the same sequence. */
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [RFC PATCH 3 1/1] sched/numa: Convert 256MB VMA scan limit notion
  2024-03-22 13:41 ` [RFC PATCH 1 1/1] sched/numa: Hot VMA and shared VMA optimization Raghavendra K T
  2024-03-22 13:41   ` [RFC PATCH 2 1/1] sched/numa: Increase the VMA accessing PID bits Raghavendra K T
@ 2024-03-22 13:41   ` Raghavendra K T
  2024-06-25 14:20   ` [RFC PATCH 1 1/1] sched/numa: Hot VMA and shared VMA optimization Chen Yu
  2 siblings, 0 replies; 5+ messages in thread
From: Raghavendra K T @ 2024-03-22 13:41 UTC (permalink / raw
  To: linux-kernel, linux-mm
  Cc: Ingo Molnar, Peter Zijlstra, Mel Gorman, Andrew Morton,
	David Hildenbrand, rppt, Juri Lelli, Vincent Guittot,
	Bharata B Rao, Johannes Weiner, kernel test robot,
	Raghavendra K T, Mike Kravetz, Muchun Song

Currently VMA scanning to introduce PROT_NONE faults to track
tasks's page access pattern is limited to 256MB. This limit works
well for 4K pages. However in cases like VMAs with hugepages, there
is an opportunity to scale up.

One idea is to convert 256MB scanning notion to 64K 4K PTE scanning.
Thus when a 2MB huge page is scanned, we account only 1PMD scan.

However, CPUs could spend more time in migrations than optimally
needed in some cases (mostly microbenchmarks).

Benchmarks with hugepages/THP=on such as hashjoin have shown
good benefit.

TODO:
 - Introduce ratelimiting logic similar to one in CXL case.
 - Tune scan rate to dynamically adopt to rate of migrations.

Inspired by Mels suggestion [1],
"Scan based on page table updates, not address ranges to mitigate
   problems with THP vs base page updates"

[1] Link: https://lore.kernel.org/lkml/20220128052851.17162-1-bharata@amd.com/T/#m38f6bf64f484eb98562f64ed02be86f2768d6fff

Suggested-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
---
 include/linux/hugetlb.h |  3 +-
 include/linux/mm.h      | 16 +++++++-
 kernel/sched/fair.c     | 15 ++++---
 mm/hugetlb.c            |  9 +++++
 mm/mempolicy.c          | 11 +++++-
 mm/mprotect.c           | 87 +++++++++++++++++++++++++++++++++--------
 6 files changed, 115 insertions(+), 26 deletions(-)

Note: I think we can do better without passing a struct to get
the detail of how much memory is covered with current VMA scanning.

Currently change_prot_numa returns how many pages succesfully scanned
But we do not get to know how much memory range is covered in the scan.

Ideas??

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index c1ee640d87b1..eb6987148e44 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -278,7 +278,8 @@ int pud_huge(pud_t pud);
 long hugetlb_change_protection(struct vm_area_struct *vma,
 		unsigned long address, unsigned long end, pgprot_t newprot,
 		unsigned long cp_flags);
-
+long hugetllb_effective_scanned_ptes(struct vm_area_struct *vma, unsigned long start,
+		unsigned long end);
 bool is_hugetlb_entry_migration(pte_t pte);
 bool is_hugetlb_entry_hwpoisoned(pte_t pte);
 void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index f5a97dec5169..8c5490db007d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -549,6 +549,10 @@ struct vm_fault {
 					 */
 };
 
+struct pte_info {
+	long nr_huge_pte;
+};
+
 /*
  * These are the virtual MM functions - opening of an area, closing and
  * unmapping it (needed to keep files on disk up-to-date etc), pointer
@@ -2547,8 +2551,15 @@ static inline bool vma_wants_manual_pte_write_upgrade(struct vm_area_struct *vma
 bool can_change_pte_writable(struct vm_area_struct *vma, unsigned long addr,
 			     pte_t pte);
 extern long change_protection(struct mmu_gather *tlb,
+ 			      struct vm_area_struct *vma, unsigned long start,
+ 			      unsigned long end, unsigned long cp_flags);
+extern long change_protection_n(struct mmu_gather *tlb,
 			      struct vm_area_struct *vma, unsigned long start,
-			      unsigned long end, unsigned long cp_flags);
+			      unsigned long end, unsigned long cp_flags,
+			      struct pte_info *info);
+extern long effective_scanned_ptes(struct vm_area_struct *vma,
+				unsigned long start, unsigned long end,
+				struct pte_info *info);
 extern int mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 	  struct vm_area_struct *vma, struct vm_area_struct **pprev,
 	  unsigned long start, unsigned long end, unsigned long newflags);
@@ -3535,6 +3546,9 @@ void vma_set_file(struct vm_area_struct *vma, struct file *file);
 #ifdef CONFIG_NUMA_BALANCING
 unsigned long change_prot_numa(struct vm_area_struct *vma,
 			unsigned long start, unsigned long end);
+unsigned long change_prot_numa_n(struct vm_area_struct *vma,
+			unsigned long start, unsigned long end,
+			struct pte_info *info);
 #endif
 
 struct vm_area_struct *find_extend_vma_locked(struct mm_struct *,
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a16129f9a5c..3646a0e14bd4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3200,8 +3200,9 @@ static void task_numa_work(struct callback_head *work)
 	struct vm_area_struct *vma;
 	unsigned long start, end;
 	unsigned long nr_pte_updates = 0;
-	long pages, virtpages;
+	long pages, virtpages, ptes_to_scan, e_scanned_ptes;
 	struct vma_iterator vmi;
+	struct pte_info info = {0};
 	bool vma_pids_skipped;
 	bool vma_pids_forced = false;
 
@@ -3248,6 +3249,8 @@ static void task_numa_work(struct callback_head *work)
 
 	pages = sysctl_numa_balancing_scan_size;
 	pages <<= 20 - PAGE_SHIFT; /* MB in pages */
+	/* Consider total number of PTEs to scan rather than sticking to 256MB */
+	ptes_to_scan = pages;
 	virtpages = pages * 8;	   /* Scan up to this much virtual space */
 	if (!pages)
 		return;
@@ -3366,7 +3369,7 @@ static void task_numa_work(struct callback_head *work)
 			start = max(start, vma->vm_start);
 			end = ALIGN(start + (pages << PAGE_SHIFT), HPAGE_SIZE);
 			end = min(end, vma->vm_end);
-			nr_pte_updates = change_prot_numa(vma, start, end);
+			nr_pte_updates = change_prot_numa_n(vma, start, end, &info);
 
 			/*
 			 * Try to scan sysctl_numa_balancing_size worth of
@@ -3376,12 +3379,14 @@ static void task_numa_work(struct callback_head *work)
 			 * PTEs, scan up to virtpages, to skip through those
 			 * areas faster.
 			 */
+			e_scanned_ptes -= effective_scanned_ptes(vma, start, end, &info);
+
 			if (nr_pte_updates)
-				pages -= (end - start) >> PAGE_SHIFT;
-			virtpages -= (end - start) >> PAGE_SHIFT;
+				ptes_to_scan -= e_scanned_ptes;
 
+			virtpages -= e_scanned_ptes;
 			start = end;
-			if (pages <= 0 || virtpages <= 0)
+			if (ptes_to_scan <= 0 || virtpages <= 0)
 				goto out;
 
 			cond_resched();
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ed1581b670d4..a5bb13457398 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6996,6 +6996,15 @@ long hugetlb_change_protection(struct vm_area_struct *vma,
 	return pages > 0 ? (pages << h->order) : pages;
 }
 
+long hugetllb_effective_scanned_ptes(struct vm_area_struct *vma, unsigned long start,
+		       unsigned long end)
+{
+	struct hstate *h = hstate_vma(vma);
+
+	return (end - start) >> (PAGE_SHIFT + h->order);
+}
+
+
 /* Return true if reservation was successful, false otherwise.  */
 bool hugetlb_reserve_pages(struct inode *inode,
 					long from, long to,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 10a590ee1c89..103eca1858e7 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -631,8 +631,9 @@ static int queue_folios_hugetlb(pte_t *pte, unsigned long hmask,
  * an architecture makes a different choice, it will need further
  * changes to the core.
  */
-unsigned long change_prot_numa(struct vm_area_struct *vma,
-			unsigned long addr, unsigned long end)
+unsigned long change_prot_numa_n(struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end,
+			struct pte_info *info)
 {
 	struct mmu_gather tlb;
 	long nr_updated;
@@ -647,6 +648,12 @@ unsigned long change_prot_numa(struct vm_area_struct *vma,
 
 	return nr_updated;
 }
+
+unsigned long change_prot_numa(struct vm_area_struct *vma,
+			unsigned long addr, unsigned long end)
+{
+	return change_prot_numa_n(vma, addr, end, NULL);
+}
 #endif /* CONFIG_NUMA_BALANCING */
 
 static int queue_pages_test_walk(unsigned long start, unsigned long end,
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 81991102f785..8e43506705e0 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -352,9 +352,10 @@ pgtable_populate_needed(struct vm_area_struct *vma, unsigned long cp_flags)
 		err;							\
 	})
 
-static inline long change_pmd_range(struct mmu_gather *tlb,
+static inline long change_pmd_range_n(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
-		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags,
+		struct pte_info *info)
 {
 	pmd_t *pmd;
 	unsigned long next;
@@ -431,14 +432,25 @@ static inline long change_pmd_range(struct mmu_gather *tlb,
 	if (range.start)
 		mmu_notifier_invalidate_range_end(&range);
 
-	if (nr_huge_updates)
+	if (nr_huge_updates) {
 		count_vm_numa_events(NUMA_HUGE_PTE_UPDATES, nr_huge_updates);
+		if (info)
+			info->nr_huge_pte = nr_huge_updates;
+		}
 	return pages;
 }
 
-static inline long change_pud_range(struct mmu_gather *tlb,
-		struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
+static inline long change_pmd_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
+{
+	return change_pmd_range_n(tlb, vma, pud, addr, end, newprot, cp_flags, NULL);
+}
+
+static inline long change_pud_range_n(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags,
+		struct pte_info *info)
 {
 	pud_t *pud;
 	unsigned long next;
@@ -452,17 +464,26 @@ static inline long change_pud_range(struct mmu_gather *tlb,
 			return ret;
 		if (pud_none_or_clear_bad(pud))
 			continue;
-		pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
-					  cp_flags);
+		pages += change_pmd_range_n(tlb, vma, pud, addr, next, newprot,
+					  cp_flags, info);
 	} while (pud++, addr = next, addr != end);
 
 	return pages;
 }
 
-static inline long change_p4d_range(struct mmu_gather *tlb,
-		struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr,
+static inline long change_pud_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
 		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
 {
+
+	return change_pud_range_n(tlb, vma, p4d, addr, end, newprot, cp_flags, NULL);
+}
+
+static inline long change_p4d_range_n(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags,
+		struct pte_info *info)
+{
 	p4d_t *p4d;
 	unsigned long next;
 	long pages = 0, ret;
@@ -475,16 +496,24 @@ static inline long change_p4d_range(struct mmu_gather *tlb,
 			return ret;
 		if (p4d_none_or_clear_bad(p4d))
 			continue;
-		pages += change_pud_range(tlb, vma, p4d, addr, next, newprot,
-					  cp_flags);
+		pages += change_pud_range_n(tlb, vma, p4d, addr, next, newprot,
+					  cp_flags, info);
 	} while (p4d++, addr = next, addr != end);
 
 	return pages;
 }
 
+static inline long change_p4d_range(struct mmu_gather *tlb,
+		struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr,
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
+{
+	return change_p4d_range_n(tlb, vma, pgd, addr, end, newprot, cp_flags, NULL);
+}
+
 static long change_protection_range(struct mmu_gather *tlb,
 		struct vm_area_struct *vma, unsigned long addr,
-		unsigned long end, pgprot_t newprot, unsigned long cp_flags)
+		unsigned long end, pgprot_t newprot, unsigned long cp_flags,
+		struct pte_info *info)
 {
 	struct mm_struct *mm = vma->vm_mm;
 	pgd_t *pgd;
@@ -503,8 +532,8 @@ static long change_protection_range(struct mmu_gather *tlb,
 		}
 		if (pgd_none_or_clear_bad(pgd))
 			continue;
-		pages += change_p4d_range(tlb, vma, pgd, addr, next, newprot,
-					  cp_flags);
+		pages += change_p4d_range_n(tlb, vma, pgd, addr, next, newprot,
+					  cp_flags, info);
 	} while (pgd++, addr = next, addr != end);
 
 	tlb_end_vma(tlb, vma);
@@ -512,9 +541,10 @@ static long change_protection_range(struct mmu_gather *tlb,
 	return pages;
 }
 
-long change_protection(struct mmu_gather *tlb,
+long change_protection_n(struct mmu_gather *tlb,
 		       struct vm_area_struct *vma, unsigned long start,
-		       unsigned long end, unsigned long cp_flags)
+		       unsigned long end, unsigned long cp_flags,
+			struct pte_info *info)
 {
 	pgprot_t newprot = vma->vm_page_prot;
 	long pages;
@@ -538,11 +568,34 @@ long change_protection(struct mmu_gather *tlb,
 						  cp_flags);
 	else
 		pages = change_protection_range(tlb, vma, start, end, newprot,
-						cp_flags);
+						cp_flags, info);
 
 	return pages;
 }
 
+long change_protection(struct mmu_gather *tlb,
+		       struct vm_area_struct *vma, unsigned long start,
+		       unsigned long end, unsigned long cp_flags)
+{
+	return change_protection_n(tlb, vma, start, end, cp_flags, NULL);
+}
+
+long effective_scanned_ptes(struct vm_area_struct *vma, unsigned long start,
+		       unsigned long end, struct pte_info *info)
+{
+	long ptes = (end - start) >> PAGE_SHIFT;
+
+	if (is_vm_hugetlb_page(vma))
+		return hugetllb_effective_scanned_ptes(vma, start, end);
+
+	if (info && info->nr_huge_pte) {
+		ptes -= info->nr_huge_pte / HPAGE_PMD_SIZE;
+		ptes += info->nr_huge_pte;
+	}
+
+	return ptes;
+}
+
 static int prot_none_pte_entry(pte_t *pte, unsigned long addr,
 			       unsigned long next, struct mm_walk *walk)
 {
-- 
2.34.1


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [RFC PATCH 1 1/1] sched/numa: Hot VMA and shared VMA optimization
  2024-03-22 13:41 ` [RFC PATCH 1 1/1] sched/numa: Hot VMA and shared VMA optimization Raghavendra K T
  2024-03-22 13:41   ` [RFC PATCH 2 1/1] sched/numa: Increase the VMA accessing PID bits Raghavendra K T
  2024-03-22 13:41   ` [RFC PATCH 3 1/1] sched/numa: Convert 256MB VMA scan limit notion Raghavendra K T
@ 2024-06-25 14:20   ` Chen Yu
  2 siblings, 0 replies; 5+ messages in thread
From: Chen Yu @ 2024-06-25 14:20 UTC (permalink / raw
  To: Raghavendra K T
  Cc: linux-kernel, linux-mm, Ingo Molnar, Peter Zijlstra, Mel Gorman,
	Andrew Morton, David Hildenbrand, rppt, Juri Lelli,
	Vincent Guittot, Bharata B Rao, Johannes Weiner,
	kernel test robot, Yujie Liu

Hi Raghavendra,

On 2024-03-22 at 19:11:12 +0530, Raghavendra K T wrote:
> Optimizations are based on history of PIDs accessing VMA.
> 
> - Increase tasks' access history windows (PeterZ) from 2 to 4.
> ( This patch is from Peter Zijlstra <peterz@infradead.org>)
> 
> Idea: A task is allowed to scan a VMA if:
> - VMA was very recently accessed as indicated by the latest
>   access PIDs information (hot VMA).
> - VMA is shared by more than 2 tasks. Here whole history of VMA's
> access PIDs is considered using bitmap_weight().
> 
> Signed-off-by: Raghavendra K T <raghavendra.kt@amd.com>
> ---
> I will split the patset and post if we find this pathset useful
> going further. First patch is from PeterZ.
> 

This is a good direction I think. We did an initial test using autonumabench
THREADLOCAL on a 240 CPUs 2 nodes system. It seems that this patch does not
show obvious difference, but it shows a more stable result(less run-to-run
variance). We'll enable the Sub-Numa-Cluster to see if there is any difference.
My understanding is that, if we can extend the NR_ACCESS_PID_HIST further,
the THREADLOCAL could see more benefits, as each thread has its own VMA. Or maybe
make the length of VMA access history adaptive(rather than a fixed 4) could be
more flexible.
                                          numa_scan_orig    numa_scan_4_history
Min       syst-NUMA01_THREADLOCAL      388.47 (   0.00%)      397.43 (  -2.31%)
Min       elsp-NUMA01_THREADLOCAL       40.27 (   0.00%)       38.94 (   3.30%)
Amean     syst-NUMA01_THREADLOCAL      467.62 (   0.00%)      459.10 (   1.82%)
Amean     elsp-NUMA01_THREADLOCAL       42.20 (   0.00%)       44.84 (  -6.26%)
Stddev    syst-NUMA01_THREADLOCAL       74.11 (   0.00%)       60.90 (  17.81%)
CoeffVar  syst-NUMA01_THREADLOCAL       15.85 (   0.00%)       13.27 (  16.29%)
Max       syst-NUMA01_THREADLOCAL      535.36 (   0.00%)      519.21 (   3.02%)
Max       elsp-NUMA01_THREADLOCAL       43.96 (   0.00%)       56.33 ( -28.14%)
BAmean-50 syst-NUMA01_THREADLOCAL      388.47 (   0.00%)      397.43 (  -2.31%)
BAmean-50 elsp-NUMA01_THREADLOCAL       40.27 (   0.00%)       38.94 (   3.30%)
BAmean-95 syst-NUMA01_THREADLOCAL      433.75 (   0.00%)      429.05 (   1.08%)
BAmean-95 elsp-NUMA01_THREADLOCAL       41.31 (   0.00%)       39.09 (   5.39%)
BAmean-99 syst-NUMA01_THREADLOCAL      433.75 (   0.00%)      429.05 (   1.08%)
BAmean-99 elsp-NUMA01_THREADLOCAL       41.31 (   0.00%)       39.09 (   5.39%)

thanks,
Chenyu

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2024-06-25 14:20 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-22 13:41 [RFC PATCH] A Summary of VMA scanning improvements explored Raghavendra K T
2024-03-22 13:41 ` [RFC PATCH 1 1/1] sched/numa: Hot VMA and shared VMA optimization Raghavendra K T
2024-03-22 13:41   ` [RFC PATCH 2 1/1] sched/numa: Increase the VMA accessing PID bits Raghavendra K T
2024-03-22 13:41   ` [RFC PATCH 3 1/1] sched/numa: Convert 256MB VMA scan limit notion Raghavendra K T
2024-06-25 14:20   ` [RFC PATCH 1 1/1] sched/numa: Hot VMA and shared VMA optimization Chen Yu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).