[PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP

LKML Archive mirror
 help / color / mirror / Atom feed

* [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP
@ 2024-01-22 19:41 David Hildenbrand
  2024-01-22 19:41 ` [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64 David Hildenbrand
                   ` (11 more replies)
  0 siblings, 12 replies; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:41 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

Now that the rmap overhaul[1] is upstream that provides a clean interface
for rmap batching, let's implement PTE batching during fork when processing
PTE-mapped THPs.

This series is partially based on Ryan's previous work[2] to implement
cont-pte support on arm64, but its a complete rewrite based on [1] to
optimize all architectures independent of any such PTE bits, and to
use the new rmap batching functions that simplify the code and prepare
for further rmap accounting changes.

We collect consecutive PTEs that map consecutive pages of the same large
folio, making sure that the other PTE bits are compatible, and (a) adjust
the refcount only once per batch, (b) call rmap handling functions only
once per batch and (c) perform batch PTE setting/updates.

While this series should be beneficial for adding cont-pte support on
ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
for large folios with minimal added overhead and further changes[4] that
build up on top of the total mapcount.

Independent of all that, this series results in a speedup during fork with
PTE-mapped THP, which is the default with THPs that are smaller than a PMD
(for example, 16KiB to 1024KiB mTHPs for anonymous memory[5]).

On an Intel Xeon Silver 4210R CPU, fork'ing with 1GiB of PTE-mapped folios
of the same size (stddev < 1%) results in the following runtimes
for fork() (shorter is better):

Folio Size | v6.8-rc1 |      New | Change
------------------------------------------
      4KiB | 0.014328 | 0.014265 |     0%
     16KiB | 0.014263 | 0.013293 |   - 7%
     32KiB | 0.014334 | 0.012355 |   -14%
     64KiB | 0.014046 | 0.011837 |   -16%
    128KiB | 0.014011 | 0.011536 |   -18%
    256KiB | 0.013993 | 0.01134  |   -19%
    512KiB | 0.013983 | 0.011311 |   -19%
   1024KiB | 0.013986 | 0.011282 |   -19%
   2048KiB | 0.014305 | 0.011496 |   -20%

Next up is PTE batching when unmapping, that I'll probably send out
based on this series this/next week.

Only tested on x86-64. Compile-tested on most other architectures. Will
do more testing and double-check the arch changes while this is getting
some review.

[1] https://lkml.kernel.org/r/20231220224504.646757-1-david@redhat.com
[2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.roberts@arm.com
[3] https://lkml.kernel.org/r/20230809083256.699513-1-david@redhat.com
[4] https://lkml.kernel.org/r/20231124132626.235350-1-david@redhat.com
[5] https://lkml.kernel.org/r/20231207161211.2374093-1-ryan.roberts@arm.com

Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Dinh Nguyen <dinguyen@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org>
Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com>
Cc: Paul Walmsley <paul.walmsley@sifive.com>
Cc: Palmer Dabbelt <palmer@dabbelt.com>
Cc: Albert Ou <aou@eecs.berkeley.edu>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: "David S. Miller" <davem@davemloft.net>
Cc: linux-arm-kernel@lists.infradead.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-riscv@lists.infradead.org
Cc: linux-s390@vger.kernel.org
Cc: sparclinux@vger.kernel.org

David Hildenbrand (11):
  arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  nios2/pgtable: define PFN_PTE_SHIFT
  powerpc/pgtable: define PFN_PTE_SHIFT
  risc: pgtable: define PFN_PTE_SHIFT
  s390/pgtable: define PFN_PTE_SHIFT
  sparc/pgtable: define PFN_PTE_SHIFT
  mm/memory: factor out copying the actual PTE in copy_present_pte()
  mm/memory: pass PTE to copy_present_pte()
  mm/memory: optimize fork() with PTE-mapped THP
  mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
  mm/memory: ignore writable bit in folio_pte_batch()

 arch/arm/include/asm/pgtable.h      |   2 +
 arch/arm64/include/asm/pgtable.h    |   2 +
 arch/nios2/include/asm/pgtable.h    |   2 +
 arch/powerpc/include/asm/pgtable.h  |   2 +
 arch/riscv/include/asm/pgtable.h    |   2 +
 arch/s390/include/asm/pgtable.h     |   2 +
 arch/sparc/include/asm/pgtable_64.h |   2 +
 include/linux/pgtable.h             |  17 ++-
 mm/memory.c                         | 188 +++++++++++++++++++++-------
 9 files changed, 173 insertions(+), 46 deletions(-)

base-commit: 6613476e225e090cc9aad49be7fa504e290dd33d
-- 
2.43.0

^ permalink raw reply	[flat|nested] 49+ messages in thread

* [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
@ 2024-01-22 19:41 ` David Hildenbrand
  2024-01-23 10:34   ` Ryan Roberts
  2024-01-22 19:41 ` [PATCH v1 02/11] nios2/pgtable: define PFN_PTE_SHIFT David Hildenbrand
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:41 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/arm/include/asm/pgtable.h   | 2 ++
 arch/arm64/include/asm/pgtable.h | 2 ++
 2 files changed, 4 insertions(+)

diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
index d657b84b6bf70..be91e376df79e 100644
--- a/arch/arm/include/asm/pgtable.h
+++ b/arch/arm/include/asm/pgtable.h
@@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval)
 extern void __sync_icache_dcache(pte_t pteval);
 #endif
 
+#define PFN_PTE_SHIFT		PAGE_SHIFT
+
 void set_ptes(struct mm_struct *mm, unsigned long addr,
 		      pte_t *ptep, pte_t pteval, unsigned int nr);
 #define set_ptes set_ptes
diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index 79ce70fbb751c..d4b3bd96e3304 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
 		mte_sync_tags(pte, nr_pages);
 }
 
+#define PFN_PTE_SHIFT		PAGE_SHIFT
+
 static inline void set_ptes(struct mm_struct *mm,
 			    unsigned long __always_unused addr,
 			    pte_t *ptep, pte_t pte, unsigned int nr)
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v1 02/11] nios2/pgtable: define PFN_PTE_SHIFT
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
  2024-01-22 19:41 ` [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64 David Hildenbrand
@ 2024-01-22 19:41 ` David Hildenbrand
  2024-01-22 19:41 ` [PATCH v1 03/11] powerpc/pgtable: " David Hildenbrand
                   ` (9 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:41 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/nios2/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/nios2/include/asm/pgtable.h b/arch/nios2/include/asm/pgtable.h
index 5144506dfa693..d052dfcbe8d3a 100644
--- a/arch/nios2/include/asm/pgtable.h
+++ b/arch/nios2/include/asm/pgtable.h
@@ -178,6 +178,8 @@ static inline void set_pte(pte_t *ptep, pte_t pteval)
 	*ptep = pteval;
 }
 
+#define PFN_PTE_SHIFT		0
+
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 		pte_t *ptep, pte_t pte, unsigned int nr)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v1 03/11] powerpc/pgtable: define PFN_PTE_SHIFT
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
  2024-01-22 19:41 ` [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64 David Hildenbrand
  2024-01-22 19:41 ` [PATCH v1 02/11] nios2/pgtable: define PFN_PTE_SHIFT David Hildenbrand
@ 2024-01-22 19:41 ` David Hildenbrand
  2024-01-22 19:41 ` [PATCH v1 04/11] risc: pgtable: " David Hildenbrand
                   ` (8 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:41 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/powerpc/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/powerpc/include/asm/pgtable.h b/arch/powerpc/include/asm/pgtable.h
index 9224f23065fff..7a1ba8889aeae 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -41,6 +41,8 @@ struct mm_struct;
 
 #ifndef __ASSEMBLY__
 
+#define PFN_PTE_SHIFT		PTE_RPN_SHIFT
+
 void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
 		pte_t pte, unsigned int nr);
 #define set_ptes set_ptes
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v1 04/11] risc: pgtable: define PFN_PTE_SHIFT
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
                   ` (2 preceding siblings ...)
  2024-01-22 19:41 ` [PATCH v1 03/11] powerpc/pgtable: " David Hildenbrand
@ 2024-01-22 19:41 ` David Hildenbrand
  2024-01-22 20:03   ` Alexandre Ghiti
  2024-01-22 19:41 ` [PATCH v1 05/11] s390/pgtable: " David Hildenbrand
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:41 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/riscv/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
index 0c94260b5d0c1..add5cd30ab34d 100644
--- a/arch/riscv/include/asm/pgtable.h
+++ b/arch/riscv/include/asm/pgtable.h
@@ -523,6 +523,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval)
 	set_pte(ptep, pteval);
 }
 
+#define PFN_PTE_SHIFT		_PAGE_PFN_SHIFT
+
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 		pte_t *ptep, pte_t pteval, unsigned int nr)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v1 05/11] s390/pgtable: define PFN_PTE_SHIFT
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
                   ` (3 preceding siblings ...)
  2024-01-22 19:41 ` [PATCH v1 04/11] risc: pgtable: " David Hildenbrand
@ 2024-01-22 19:41 ` David Hildenbrand
  2024-01-22 19:41 ` [PATCH v1 06/11] sparc/pgtable: " David Hildenbrand
                   ` (6 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:41 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/s390/include/asm/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index 1299b56e43f6f..4b91e65c85d97 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1316,6 +1316,8 @@ pgprot_t pgprot_writecombine(pgprot_t prot);
 #define pgprot_writethrough	pgprot_writethrough
 pgprot_t pgprot_writethrough(pgprot_t prot);
 
+#define PFN_PTE_SHIFT		PAGE_SHIFT
+
 /*
  * Set multiple PTEs to consecutive pages with a single call.  All PTEs
  * are within the same folio, PMD and VMA.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v1 06/11] sparc/pgtable: define PFN_PTE_SHIFT
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
                   ` (4 preceding siblings ...)
  2024-01-22 19:41 ` [PATCH v1 05/11] s390/pgtable: " David Hildenbrand
@ 2024-01-22 19:41 ` David Hildenbrand
  2024-01-22 19:41 ` [PATCH v1 07/11] mm/memory: factor out copying the actual PTE in copy_present_pte() David Hildenbrand
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:41 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

We want to make use of pte_next_pfn() outside of set_ptes(). Let's
simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 arch/sparc/include/asm/pgtable_64.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index a8c871b7d7860..652af9d63fa29 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -929,6 +929,8 @@ static inline void __set_pte_at(struct mm_struct *mm, unsigned long addr,
 	maybe_tlb_batch_add(mm, addr, ptep, orig, fullmm, PAGE_SHIFT);
 }
 
+#define PFN_PTE_SHIFT		PAGE_SHIFT
+
 static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
 		pte_t *ptep, pte_t pte, unsigned int nr)
 {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v1 07/11] mm/memory: factor out copying the actual PTE in copy_present_pte()
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
                   ` (5 preceding siblings ...)
  2024-01-22 19:41 ` [PATCH v1 06/11] sparc/pgtable: " David Hildenbrand
@ 2024-01-22 19:41 ` David Hildenbrand
  2024-01-23 10:45   ` Ryan Roberts
  2024-01-22 19:41 ` [PATCH v1 08/11] mm/memory: pass PTE to copy_present_pte() David Hildenbrand
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:41 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

Let's prepare for further changes.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 60 ++++++++++++++++++++++++++++-------------------------
 1 file changed, 32 insertions(+), 28 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7e1f4849463aa..2aa2051ee51d3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -930,6 +930,29 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	return 0;
 }
 
+static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
+		struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte,
+		pte_t pte, unsigned long addr)
+{
+	struct mm_struct *src_mm = src_vma->vm_mm;
+
+	/* If it's a COW mapping, write protect it both processes. */
+	if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
+		ptep_set_wrprotect(src_mm, addr, src_pte);
+		pte = pte_wrprotect(pte);
+	}
+
+	/* If it's a shared mapping, mark it clean in the child. */
+	if (src_vma->vm_flags & VM_SHARED)
+		pte = pte_mkclean(pte);
+	pte = pte_mkold(pte);
+
+	if (!userfaultfd_wp(dst_vma))
+		pte = pte_clear_uffd_wp(pte);
+
+	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+}
+
 /*
  * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
  * is required to copy this pte.
@@ -939,16 +962,16 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
 		 struct folio **prealloc)
 {
-	struct mm_struct *src_mm = src_vma->vm_mm;
-	unsigned long vm_flags = src_vma->vm_flags;
 	pte_t pte = ptep_get(src_pte);
 	struct page *page;
 	struct folio *folio;
 
 	page = vm_normal_page(src_vma, addr, pte);
-	if (page)
-		folio = page_folio(page);
-	if (page && folio_test_anon(folio)) {
+	if (unlikely(!page))
+		goto copy_pte;
+
+	folio = page_folio(page);
+	if (folio_test_anon(folio)) {
 		/*
 		 * If this page may have been pinned by the parent process,
 		 * copy the page immediately for the child so that we'll always
@@ -963,34 +986,15 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 						 addr, rss, prealloc, page);
 		}
 		rss[MM_ANONPAGES]++;
-	} else if (page) {
+		VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
+	} else {
 		folio_get(folio);
 		folio_dup_file_rmap_pte(folio, page);
 		rss[mm_counter_file(page)]++;
 	}
 
-	/*
-	 * If it's a COW mapping, write protect it both
-	 * in the parent and the child
-	 */
-	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
-		pte = pte_wrprotect(pte);
-	}
-	VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
-
-	/*
-	 * If it's a shared mapping, mark it clean in
-	 * the child
-	 */
-	if (vm_flags & VM_SHARED)
-		pte = pte_mkclean(pte);
-	pte = pte_mkold(pte);
-
-	if (!userfaultfd_wp(dst_vma))
-		pte = pte_clear_uffd_wp(pte);
-
-	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+copy_pte:
+	__copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, pte, addr);
 	return 0;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v1 08/11] mm/memory: pass PTE to copy_present_pte()
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
                   ` (6 preceding siblings ...)
  2024-01-22 19:41 ` [PATCH v1 07/11] mm/memory: factor out copying the actual PTE in copy_present_pte() David Hildenbrand
@ 2024-01-22 19:41 ` David Hildenbrand
  2024-01-23 10:47   ` Ryan Roberts
  2024-01-22 19:41 ` [PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:41 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

We already read it, let's just forward it.

This patch is based on work by Ryan Roberts.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 2aa2051ee51d3..185b4aff13d62 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -959,10 +959,9 @@ static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
  */
 static inline int
 copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
-		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
-		 struct folio **prealloc)
+		 pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr,
+		 int *rss, struct folio **prealloc)
 {
-	pte_t pte = ptep_get(src_pte);
 	struct page *page;
 	struct folio *folio;
 
@@ -1104,7 +1103,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		}
 		/* copy_present_pte() will clear `*prealloc' if consumed */
 		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
-				       addr, rss, &prealloc);
+				       ptent, addr, rss, &prealloc);
 		/*
 		 * If we need a pre-allocated page for this pte, drop the
 		 * locks, allocate, and try again.
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
                   ` (7 preceding siblings ...)
  2024-01-22 19:41 ` [PATCH v1 08/11] mm/memory: pass PTE to copy_present_pte() David Hildenbrand
@ 2024-01-22 19:41 ` David Hildenbrand
  2024-01-23 12:01   ` Ryan Roberts
  2024-01-22 19:41 ` [PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch() David Hildenbrand
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:41 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

Let's implement PTE batching when consecutive (present) PTEs map
consecutive pages of the same large folio, and all other PTE bits besides
the PFNs are equal.

We will optimize folio_pte_batch() separately, to ignore some other
PTE bits. This patch is based on work by Ryan Roberts.

Use __always_inline for __copy_present_ptes() and keep the handling for
single PTEs completely separate from the multi-PTE case: we really want
the compiler to optimize for the single-PTE case with small folios, to
not degrade performance.

Note that PTE batching will never exceed a single page table and will
always stay within VMA boundaries.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 include/linux/pgtable.h |  17 +++++-
 mm/memory.c             | 113 +++++++++++++++++++++++++++++++++-------
 2 files changed, 109 insertions(+), 21 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f6d0e3513948a..d32cedf6936ba 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -212,8 +212,6 @@ static inline int pmd_dirty(pmd_t pmd)
 #define arch_flush_lazy_mmu_mode()	do {} while (0)
 #endif
 
-#ifndef set_ptes
-
 #ifndef pte_next_pfn
 static inline pte_t pte_next_pfn(pte_t pte)
 {
@@ -221,6 +219,7 @@ static inline pte_t pte_next_pfn(pte_t pte)
 }
 #endif
 
+#ifndef set_ptes
 /**
  * set_ptes - Map consecutive pages to a contiguous range of addresses.
  * @mm: Address space to map the pages into.
@@ -650,6 +649,20 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
 }
 #endif
 
+#ifndef wrprotect_ptes
+static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
+		pte_t *ptep, unsigned int nr)
+{
+	for (;;) {
+		ptep_set_wrprotect(mm, addr, ptep);
+		if (--nr == 0)
+			break;
+		ptep++;
+		addr += PAGE_SIZE;
+	}
+}
+#endif
+
 /*
  * On some architectures hardware does not set page access bit when accessing
  * memory page, it is responsibility of software setting this bit. It brings
diff --git a/mm/memory.c b/mm/memory.c
index 185b4aff13d62..f563aec85b2a8 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -930,15 +930,15 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	return 0;
 }
 
-static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
+static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
 		struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte,
-		pte_t pte, unsigned long addr)
+		pte_t pte, unsigned long addr, int nr)
 {
 	struct mm_struct *src_mm = src_vma->vm_mm;
 
 	/* If it's a COW mapping, write protect it both processes. */
 	if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
-		ptep_set_wrprotect(src_mm, addr, src_pte);
+		wrprotect_ptes(src_mm, addr, src_pte, nr);
 		pte = pte_wrprotect(pte);
 	}
 
@@ -950,26 +950,94 @@ static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
 	if (!userfaultfd_wp(dst_vma))
 		pte = pte_clear_uffd_wp(pte);
 
-	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
+	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
+}
+
+/*
+ * Detect a PTE batch: consecutive (present) PTEs that map consecutive
+ * pages of the same folio.
+ *
+ * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
+ */
+static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
+		pte_t *start_ptep, pte_t pte, int max_nr)
+{
+	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
+	const pte_t *end_ptep = start_ptep + max_nr;
+	pte_t expected_pte = pte_next_pfn(pte);
+	pte_t *ptep = start_ptep + 1;
+
+	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
+
+	while (ptep != end_ptep) {
+		pte = ptep_get(ptep);
+
+		if (!pte_same(pte, expected_pte))
+			break;
+
+		/*
+		 * Stop immediately once we reached the end of the folio. In
+		 * corner cases the next PFN might fall into a different
+		 * folio.
+		 */
+		if (pte_pfn(pte) == folio_end_pfn)
+			break;
+
+		expected_pte = pte_next_pfn(expected_pte);
+		ptep++;
+	}
+
+	return ptep - start_ptep;
 }
 
 /*
- * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
- * is required to copy this pte.
+ * Copy one present PTE, trying to batch-process subsequent PTEs that map
+ * consecutive pages of the same folio by copying them as well.
+ *
+ * Returns -EAGAIN if one preallocated page is required to copy the next PTE.
+ * Otherwise, returns the number of copied PTEs (at least 1).
  */
 static inline int
-copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
+copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		 pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr,
-		 int *rss, struct folio **prealloc)
+		 int max_nr, int *rss, struct folio **prealloc)
 {
 	struct page *page;
 	struct folio *folio;
+	int err, nr;
 
 	page = vm_normal_page(src_vma, addr, pte);
 	if (unlikely(!page))
 		goto copy_pte;
 
 	folio = page_folio(page);
+
+	/*
+	 * If we likely have to copy, just don't bother with batching. Make
+	 * sure that the common "small folio" case stays as fast as possible
+	 * by keeping the batching logic separate.
+	 */
+	if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
+		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
+		if (folio_test_anon(folio)) {
+			folio_ref_add(folio, nr);
+			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
+								  nr, src_vma))) {
+				folio_ref_sub(folio, nr);
+				return -EAGAIN;
+			}
+			rss[MM_ANONPAGES] += nr;
+			VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
+		} else {
+			folio_ref_add(folio, nr);
+			folio_dup_file_rmap_ptes(folio, page, nr);
+			rss[mm_counter_file(page)] += nr;
+		}
+		__copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte,
+				    addr, nr);
+		return nr;
+	}
+
 	if (folio_test_anon(folio)) {
 		/*
 		 * If this page may have been pinned by the parent process,
@@ -981,8 +1049,9 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, src_vma))) {
 			/* Page may be pinned, we have to copy. */
 			folio_put(folio);
-			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
-						 addr, rss, prealloc, page);
+			err = copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
+						addr, rss, prealloc, page);
+			return err ? err : 1;
 		}
 		rss[MM_ANONPAGES]++;
 		VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
@@ -993,8 +1062,8 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	}
 
 copy_pte:
-	__copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, pte, addr);
-	return 0;
+	__copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte, addr, 1);
+	return 1;
 }
 
 static inline struct folio *folio_prealloc(struct mm_struct *src_mm,
@@ -1031,10 +1100,11 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	pte_t *src_pte, *dst_pte;
 	pte_t ptent;
 	spinlock_t *src_ptl, *dst_ptl;
-	int progress, ret = 0;
+	int progress, max_nr, ret = 0;
 	int rss[NR_MM_COUNTERS];
 	swp_entry_t entry = (swp_entry_t){0};
 	struct folio *prealloc = NULL;
+	int nr;
 
 again:
 	progress = 0;
@@ -1065,6 +1135,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 	arch_enter_lazy_mmu_mode();
 
 	do {
+		nr = 1;
+
 		/*
 		 * We are holding two locks at this point - either of them
 		 * could generate latencies in another task on another CPU.
@@ -1101,9 +1173,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			 */
 			WARN_ON_ONCE(ret != -ENOENT);
 		}
-		/* copy_present_pte() will clear `*prealloc' if consumed */
-		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
-				       ptent, addr, rss, &prealloc);
+		/* copy_present_ptes() will clear `*prealloc' if consumed */
+		max_nr = (end - addr) / PAGE_SIZE;
+		ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
+					ptent, addr, max_nr, rss, &prealloc);
 		/*
 		 * If we need a pre-allocated page for this pte, drop the
 		 * locks, allocate, and try again.
@@ -1120,8 +1193,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 			folio_put(prealloc);
 			prealloc = NULL;
 		}
-		progress += 8;
-	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
+		nr = ret;
+		progress += 8 * nr;
+	} while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr,
+		 addr != end);
 
 	arch_leave_lazy_mmu_mode();
 	pte_unmap_unlock(orig_src_pte, src_ptl);
@@ -1142,7 +1217,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
 		prealloc = folio_prealloc(src_mm, src_vma, addr, false);
 		if (!prealloc)
 			return -ENOMEM;
-	} else if (ret) {
+	} else if (ret < 0) {
 		VM_WARN_ON_ONCE(1);
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
                   ` (8 preceding siblings ...)
  2024-01-22 19:41 ` [PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
@ 2024-01-22 19:41 ` David Hildenbrand
  2024-01-23 12:25   ` Ryan Roberts
  2024-01-22 19:42 ` [PATCH v1 11/11] mm/memory: ignore writable bit " David Hildenbrand
  2024-01-23 19:15 ` [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP Ryan Roberts
  11 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:41 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

Let's ignore these bits: they are irrelevant for fork, and will likely
be irrelevant for upcoming users such as page unmapping.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 10 ++++++++--
 1 file changed, 8 insertions(+), 2 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index f563aec85b2a8..341b2be845b6e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -953,24 +953,30 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
 	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
 }
 
+static inline pte_t __pte_batch_clear_ignored(pte_t pte)
+{
+	return pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte)));
+}
+
 /*
  * Detect a PTE batch: consecutive (present) PTEs that map consecutive
  * pages of the same folio.
  *
  * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
+ * the accessed bit, dirty bit and soft-dirty bit.
  */
 static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		pte_t *start_ptep, pte_t pte, int max_nr)
 {
 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
 	const pte_t *end_ptep = start_ptep + max_nr;
-	pte_t expected_pte = pte_next_pfn(pte);
+	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
 	pte_t *ptep = start_ptep + 1;
 
 	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
 	while (ptep != end_ptep) {
-		pte = ptep_get(ptep);
+		pte = __pte_batch_clear_ignored(ptep_get(ptep));
 
 		if (!pte_same(pte, expected_pte))
 			break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* [PATCH v1 11/11] mm/memory: ignore writable bit in folio_pte_batch()
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
                   ` (9 preceding siblings ...)
  2024-01-22 19:41 ` [PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch() David Hildenbrand
@ 2024-01-22 19:42 ` David Hildenbrand
  2024-01-23 12:35   ` Ryan Roberts
  2024-01-23 19:15 ` [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP Ryan Roberts
  11 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 19:42 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, David Hildenbrand, Andrew Morton, Matthew Wilcox,
	Ryan Roberts, Russell King, Catalin Marinas, Will Deacon,
	Dinh Nguyen, Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

... and conditionally return to the caller if any pte except the first one
is writable. fork() has to make sure to properly write-protect in case any
PTE is writable. Other users (e.g., page unmaping) won't care.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
 mm/memory.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 341b2be845b6e..a26fd0669016b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -955,7 +955,7 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
 
 static inline pte_t __pte_batch_clear_ignored(pte_t pte)
 {
-	return pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte)));
+	return pte_wrprotect(pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte))));
 }
 
 /*
@@ -963,20 +963,29 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte)
  * pages of the same folio.
  *
  * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
- * the accessed bit, dirty bit and soft-dirty bit.
+ * the accessed bit, dirty bit, soft-dirty bit and writable bit.
+ . If "any_writable" is set, it will indicate if any other PTE besides the
+ * first (given) PTE is writable.
  */
 static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
-		pte_t *start_ptep, pte_t pte, int max_nr)
+		pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
 {
 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
 	const pte_t *end_ptep = start_ptep + max_nr;
 	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
 	pte_t *ptep = start_ptep + 1;
+	bool writable;
+
+	if (any_writable)
+		*any_writable = false;
 
 	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
 
 	while (ptep != end_ptep) {
-		pte = __pte_batch_clear_ignored(ptep_get(ptep));
+		pte = ptep_get(ptep);
+		if (any_writable)
+			writable = !!pte_write(pte);
+		pte = __pte_batch_clear_ignored(pte);
 
 		if (!pte_same(pte, expected_pte))
 			break;
@@ -989,6 +998,9 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
 		if (pte_pfn(pte) == folio_end_pfn)
 			break;
 
+		if (any_writable)
+			*any_writable |= writable;
+
 		expected_pte = pte_next_pfn(expected_pte);
 		ptep++;
 	}
@@ -1010,6 +1022,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 {
 	struct page *page;
 	struct folio *folio;
+	bool any_writable;
 	int err, nr;
 
 	page = vm_normal_page(src_vma, addr, pte);
@@ -1024,7 +1037,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 	 * by keeping the batching logic separate.
 	 */
 	if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
-		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
+		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr,
+				     &any_writable);
 		if (folio_test_anon(folio)) {
 			folio_ref_add(folio, nr);
 			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
@@ -1039,6 +1053,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
 			folio_dup_file_rmap_ptes(folio, page, nr);
 			rss[mm_counter_file(page)] += nr;
 		}
+		if (any_writable)
+			pte = pte_mkwrite(pte, src_vma);
 		__copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte,
 				    addr, nr);
 		return nr;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 04/11] risc: pgtable: define PFN_PTE_SHIFT
  2024-01-22 19:41 ` [PATCH v1 04/11] risc: pgtable: " David Hildenbrand
@ 2024-01-22 20:03   ` Alexandre Ghiti
  2024-01-22 20:08     ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Alexandre Ghiti @ 2024-01-22 20:03 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Ryan Roberts,
	Russell King, Catalin Marinas, Will Deacon, Dinh Nguyen,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

Hi David,

On 22/01/2024 20:41, David Hildenbrand wrote:
> We want to make use of pte_next_pfn() outside of set_ptes(). Let's
> simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().
>
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>   arch/riscv/include/asm/pgtable.h | 2 ++
>   1 file changed, 2 insertions(+)
>
> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
> index 0c94260b5d0c1..add5cd30ab34d 100644
> --- a/arch/riscv/include/asm/pgtable.h
> +++ b/arch/riscv/include/asm/pgtable.h
> @@ -523,6 +523,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval)
>   	set_pte(ptep, pteval);
>   }
>   
> +#define PFN_PTE_SHIFT		_PAGE_PFN_SHIFT
> +
>   static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>   		pte_t *ptep, pte_t pteval, unsigned int nr)
>   {


There is a typo in the commit title: risc -> riscv. Otherwise, this is 
right so:

Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>

Thanks,

Alex


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 04/11] risc: pgtable: define PFN_PTE_SHIFT
  2024-01-22 20:03   ` Alexandre Ghiti
@ 2024-01-22 20:08     ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2024-01-22 20:08 UTC (permalink / raw
  To: Alexandre Ghiti, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Ryan Roberts,
	Russell King, Catalin Marinas, Will Deacon, Dinh Nguyen,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

On 22.01.24 21:03, Alexandre Ghiti wrote:
> Hi David,
> 
> On 22/01/2024 20:41, David Hildenbrand wrote:
>> We want to make use of pte_next_pfn() outside of set_ptes(). Let's
>> simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>    arch/riscv/include/asm/pgtable.h | 2 ++
>>    1 file changed, 2 insertions(+)
>>
>> diff --git a/arch/riscv/include/asm/pgtable.h b/arch/riscv/include/asm/pgtable.h
>> index 0c94260b5d0c1..add5cd30ab34d 100644
>> --- a/arch/riscv/include/asm/pgtable.h
>> +++ b/arch/riscv/include/asm/pgtable.h
>> @@ -523,6 +523,8 @@ static inline void __set_pte_at(pte_t *ptep, pte_t pteval)
>>    	set_pte(ptep, pteval);
>>    }
>>    
>> +#define PFN_PTE_SHIFT		_PAGE_PFN_SHIFT
>> +
>>    static inline void set_ptes(struct mm_struct *mm, unsigned long addr,
>>    		pte_t *ptep, pte_t pteval, unsigned int nr)
>>    {
> 
> 
> There is a typo in the commit title: risc -> riscv. Otherwise, this is
> right so:

Whops :)

> 
> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com>

Thanks!

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-22 19:41 ` [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64 David Hildenbrand
@ 2024-01-23 10:34   ` Ryan Roberts
  2024-01-23 10:48     ` David Hildenbrand
  2024-01-23 15:01     ` Matthew Wilcox
  0 siblings, 2 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 10:34 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 22/01/2024 19:41, David Hildenbrand wrote:
> We want to make use of pte_next_pfn() outside of set_ptes(). Let's
> simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  arch/arm/include/asm/pgtable.h   | 2 ++
>  arch/arm64/include/asm/pgtable.h | 2 ++
>  2 files changed, 4 insertions(+)
> 
> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
> index d657b84b6bf70..be91e376df79e 100644
> --- a/arch/arm/include/asm/pgtable.h
> +++ b/arch/arm/include/asm/pgtable.h
> @@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval)
>  extern void __sync_icache_dcache(pte_t pteval);
>  #endif
>  
> +#define PFN_PTE_SHIFT		PAGE_SHIFT
> +
>  void set_ptes(struct mm_struct *mm, unsigned long addr,
>  		      pte_t *ptep, pte_t pteval, unsigned int nr);
>  #define set_ptes set_ptes
> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
> index 79ce70fbb751c..d4b3bd96e3304 100644
> --- a/arch/arm64/include/asm/pgtable.h
> +++ b/arch/arm64/include/asm/pgtable.h
> @@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
>  		mte_sync_tags(pte, nr_pages);
>  }
>  
> +#define PFN_PTE_SHIFT		PAGE_SHIFT

I think this is buggy. And so is the arm64 implementation of set_ptes(). It
works fine for 48-bit output address, but for 52-bit OAs, the high bits are not
kept contigously, so if you happen to be setting a mapping for which the
physical memory block straddles bit 48, this won't work.

Today, only the 64K base page config can support 52 bits, and for this,
OA[51:48] are stored in PTE[15:12]. But 52 bits for 4K and 16K base pages is
coming (hopefully v6.9) and in this case OA[51:50] are stored in PTE[9:8].
Fortunately we already have helpers in arm64 to abstract this.

So I think arm64 will want to define its own pte_next_pfn():

#define pte_next_pfn pte_next_pfn
static inline pte_t pte_next_pfn(pte_t pte)
{
	return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
}

I'll do a separate patch to fix the already broken arm64 set_ptes() implementation.

I'm not sure if this type of problem might also apply to other arches?


> +
>  static inline void set_ptes(struct mm_struct *mm,
>  			    unsigned long __always_unused addr,
>  			    pte_t *ptep, pte_t pte, unsigned int nr)


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 07/11] mm/memory: factor out copying the actual PTE in copy_present_pte()
  2024-01-22 19:41 ` [PATCH v1 07/11] mm/memory: factor out copying the actual PTE in copy_present_pte() David Hildenbrand
@ 2024-01-23 10:45   ` Ryan Roberts
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 10:45 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 22/01/2024 19:41, David Hildenbrand wrote:
> Let's prepare for further changes.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  mm/memory.c | 60 ++++++++++++++++++++++++++++-------------------------
>  1 file changed, 32 insertions(+), 28 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7e1f4849463aa..2aa2051ee51d3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -930,6 +930,29 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  	return 0;
>  }
>  
> +static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
> +		struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte,
> +		pte_t pte, unsigned long addr)
> +{
> +	struct mm_struct *src_mm = src_vma->vm_mm;
> +
> +	/* If it's a COW mapping, write protect it both processes. */
> +	if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
> +		ptep_set_wrprotect(src_mm, addr, src_pte);
> +		pte = pte_wrprotect(pte);
> +	}
> +
> +	/* If it's a shared mapping, mark it clean in the child. */
> +	if (src_vma->vm_flags & VM_SHARED)
> +		pte = pte_mkclean(pte);
> +	pte = pte_mkold(pte);
> +
> +	if (!userfaultfd_wp(dst_vma))
> +		pte = pte_clear_uffd_wp(pte);
> +
> +	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> +}
> +
>  /*
>   * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
>   * is required to copy this pte.
> @@ -939,16 +962,16 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
>  		 struct folio **prealloc)
>  {
> -	struct mm_struct *src_mm = src_vma->vm_mm;
> -	unsigned long vm_flags = src_vma->vm_flags;
>  	pte_t pte = ptep_get(src_pte);
>  	struct page *page;
>  	struct folio *folio;
>  
>  	page = vm_normal_page(src_vma, addr, pte);
> -	if (page)
> -		folio = page_folio(page);
> -	if (page && folio_test_anon(folio)) {
> +	if (unlikely(!page))
> +		goto copy_pte;
> +
> +	folio = page_folio(page);
> +	if (folio_test_anon(folio)) {
>  		/*
>  		 * If this page may have been pinned by the parent process,
>  		 * copy the page immediately for the child so that we'll always
> @@ -963,34 +986,15 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  						 addr, rss, prealloc, page);
>  		}
>  		rss[MM_ANONPAGES]++;
> -	} else if (page) {
> +		VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
> +	} else {
>  		folio_get(folio);
>  		folio_dup_file_rmap_pte(folio, page);
>  		rss[mm_counter_file(page)]++;
>  	}
>  
> -	/*
> -	 * If it's a COW mapping, write protect it both
> -	 * in the parent and the child
> -	 */
> -	if (is_cow_mapping(vm_flags) && pte_write(pte)) {
> -		ptep_set_wrprotect(src_mm, addr, src_pte);
> -		pte = pte_wrprotect(pte);
> -	}
> -	VM_BUG_ON(page && folio_test_anon(folio) && PageAnonExclusive(page));
> -
> -	/*
> -	 * If it's a shared mapping, mark it clean in
> -	 * the child
> -	 */
> -	if (vm_flags & VM_SHARED)
> -		pte = pte_mkclean(pte);
> -	pte = pte_mkold(pte);
> -
> -	if (!userfaultfd_wp(dst_vma))
> -		pte = pte_clear_uffd_wp(pte);
> -
> -	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> +copy_pte:
> +	__copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, pte, addr);
>  	return 0;
>  }
>  


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 08/11] mm/memory: pass PTE to copy_present_pte()
  2024-01-22 19:41 ` [PATCH v1 08/11] mm/memory: pass PTE to copy_present_pte() David Hildenbrand
@ 2024-01-23 10:47   ` Ryan Roberts
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 10:47 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 22/01/2024 19:41, David Hildenbrand wrote:
> We already read it, let's just forward it.
> 
> This patch is based on work by Ryan Roberts.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  mm/memory.c | 7 +++----
>  1 file changed, 3 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 2aa2051ee51d3..185b4aff13d62 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -959,10 +959,9 @@ static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
>   */
>  static inline int
>  copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> -		 pte_t *dst_pte, pte_t *src_pte, unsigned long addr, int *rss,
> -		 struct folio **prealloc)
> +		 pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr,
> +		 int *rss, struct folio **prealloc)
>  {
> -	pte_t pte = ptep_get(src_pte);
>  	struct page *page;
>  	struct folio *folio;
>  
> @@ -1104,7 +1103,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  		}
>  		/* copy_present_pte() will clear `*prealloc' if consumed */
>  		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
> -				       addr, rss, &prealloc);
> +				       ptent, addr, rss, &prealloc);
>  		/*
>  		 * If we need a pre-allocated page for this pte, drop the
>  		 * locks, allocate, and try again.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 10:34   ` Ryan Roberts
@ 2024-01-23 10:48     ` David Hildenbrand
  2024-01-23 11:02       ` David Hildenbrand
                         ` (2 more replies)
  2024-01-23 15:01     ` Matthew Wilcox
  1 sibling, 3 replies; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 10:48 UTC (permalink / raw
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23.01.24 11:34, Ryan Roberts wrote:
> On 22/01/2024 19:41, David Hildenbrand wrote:
>> We want to make use of pte_next_pfn() outside of set_ptes(). Let's
>> simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   arch/arm/include/asm/pgtable.h   | 2 ++
>>   arch/arm64/include/asm/pgtable.h | 2 ++
>>   2 files changed, 4 insertions(+)
>>
>> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
>> index d657b84b6bf70..be91e376df79e 100644
>> --- a/arch/arm/include/asm/pgtable.h
>> +++ b/arch/arm/include/asm/pgtable.h
>> @@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval)
>>   extern void __sync_icache_dcache(pte_t pteval);
>>   #endif
>>   
>> +#define PFN_PTE_SHIFT		PAGE_SHIFT
>> +
>>   void set_ptes(struct mm_struct *mm, unsigned long addr,
>>   		      pte_t *ptep, pte_t pteval, unsigned int nr);
>>   #define set_ptes set_ptes
>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>> index 79ce70fbb751c..d4b3bd96e3304 100644
>> --- a/arch/arm64/include/asm/pgtable.h
>> +++ b/arch/arm64/include/asm/pgtable.h
>> @@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
>>   		mte_sync_tags(pte, nr_pages);
>>   }
>>   
>> +#define PFN_PTE_SHIFT		PAGE_SHIFT
> 
> I think this is buggy. And so is the arm64 implementation of set_ptes(). It
> works fine for 48-bit output address, but for 52-bit OAs, the high bits are not
> kept contigously, so if you happen to be setting a mapping for which the
> physical memory block straddles bit 48, this won't work.

Right, as soon as the PTE bits are not contiguous, this stops working, 
just like set_ptes() would, which I used as orientation.

> 
> Today, only the 64K base page config can support 52 bits, and for this,
> OA[51:48] are stored in PTE[15:12]. But 52 bits for 4K and 16K base pages is
> coming (hopefully v6.9) and in this case OA[51:50] are stored in PTE[9:8].
> Fortunately we already have helpers in arm64 to abstract this.
> 
> So I think arm64 will want to define its own pte_next_pfn():
> 
> #define pte_next_pfn pte_next_pfn
> static inline pte_t pte_next_pfn(pte_t pte)
> {
> 	return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
> }
> 
> I'll do a separate patch to fix the already broken arm64 set_ptes() implementation.

Make sense.

> 
> I'm not sure if this type of problem might also apply to other arches?

I saw similar handling in the PPC implementation of set_ptes, but was 
not able to convince me that it is actually required there.

pte_pfn on ppc does:

static inline unsigned long pte_pfn(pte_t pte)
{
	return (pte_val(pte) & PTE_RPN_MASK) >> PTE_RPN_SHIFT;
}

But that means that the PFNs *are* contiguous. If high bits are used for 
something else, then we might produce a garbage PTE on overflow, but 
that shouldn't really matter I concluded for folio_pte_batch() purposes, 
we'd not detect "belongs to this folio batch" either way.

Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I 
just hope that we don't lose any other arbitrary PTE bits by doing the 
pte_pgprot().


I guess pte_pfn() implementations should tell us if anything special 
needs to happen.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 10:48     ` David Hildenbrand
@ 2024-01-23 11:02       ` David Hildenbrand
  2024-01-23 11:17         ` Ryan Roberts
  2024-01-23 11:08       ` Ryan Roberts
  2024-01-23 11:10       ` Christophe Leroy
  2 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 11:02 UTC (permalink / raw
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23.01.24 11:48, David Hildenbrand wrote:
> On 23.01.24 11:34, Ryan Roberts wrote:
>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>> We want to make use of pte_next_pfn() outside of set_ptes(). Let's
>>> simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>    arch/arm/include/asm/pgtable.h   | 2 ++
>>>    arch/arm64/include/asm/pgtable.h | 2 ++
>>>    2 files changed, 4 insertions(+)
>>>
>>> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
>>> index d657b84b6bf70..be91e376df79e 100644
>>> --- a/arch/arm/include/asm/pgtable.h
>>> +++ b/arch/arm/include/asm/pgtable.h
>>> @@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval)
>>>    extern void __sync_icache_dcache(pte_t pteval);
>>>    #endif
>>>    
>>> +#define PFN_PTE_SHIFT		PAGE_SHIFT
>>> +
>>>    void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>    		      pte_t *ptep, pte_t pteval, unsigned int nr);
>>>    #define set_ptes set_ptes
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 79ce70fbb751c..d4b3bd96e3304 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t pte, unsigned int nr_pages)
>>>    		mte_sync_tags(pte, nr_pages);
>>>    }
>>>    
>>> +#define PFN_PTE_SHIFT		PAGE_SHIFT
>>
>> I think this is buggy. And so is the arm64 implementation of set_ptes(). It
>> works fine for 48-bit output address, but for 52-bit OAs, the high bits are not
>> kept contigously, so if you happen to be setting a mapping for which the
>> physical memory block straddles bit 48, this won't work.
> 
> Right, as soon as the PTE bits are not contiguous, this stops working,
> just like set_ptes() would, which I used as orientation.
> 
>>
>> Today, only the 64K base page config can support 52 bits, and for this,
>> OA[51:48] are stored in PTE[15:12]. But 52 bits for 4K and 16K base pages is
>> coming (hopefully v6.9) and in this case OA[51:50] are stored in PTE[9:8].
>> Fortunately we already have helpers in arm64 to abstract this.
>>
>> So I think arm64 will want to define its own pte_next_pfn():
>>
>> #define pte_next_pfn pte_next_pfn
>> static inline pte_t pte_next_pfn(pte_t pte)
>> {
>> 	return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
>> }
>>

Digging into the details, on arm64 we have:

#define pte_pfn(pte)           (__pte_to_phys(pte) >> PAGE_SHIFT)

and

#define __pte_to_phys(pte)     (pte_val(pte) & PTE_ADDR_MASK)

But that implies, that upstream the PFN is always contiguous, no?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 10:48     ` David Hildenbrand
  2024-01-23 11:02       ` David Hildenbrand
@ 2024-01-23 11:08       ` Ryan Roberts
  2024-01-23 11:16         ` Christophe Leroy
  2024-01-23 11:10       ` Christophe Leroy
  2 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 11:08 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23/01/2024 10:48, David Hildenbrand wrote:
> On 23.01.24 11:34, Ryan Roberts wrote:
>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>> We want to make use of pte_next_pfn() outside of set_ptes(). Let's
>>> simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   arch/arm/include/asm/pgtable.h   | 2 ++
>>>   arch/arm64/include/asm/pgtable.h | 2 ++
>>>   2 files changed, 4 insertions(+)
>>>
>>> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
>>> index d657b84b6bf70..be91e376df79e 100644
>>> --- a/arch/arm/include/asm/pgtable.h
>>> +++ b/arch/arm/include/asm/pgtable.h
>>> @@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval)
>>>   extern void __sync_icache_dcache(pte_t pteval);
>>>   #endif
>>>   +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>> +
>>>   void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>                 pte_t *ptep, pte_t pteval, unsigned int nr);
>>>   #define set_ptes set_ptes
>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>> index 79ce70fbb751c..d4b3bd96e3304 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t pte,
>>> unsigned int nr_pages)
>>>           mte_sync_tags(pte, nr_pages);
>>>   }
>>>   +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>
>> I think this is buggy. And so is the arm64 implementation of set_ptes(). It
>> works fine for 48-bit output address, but for 52-bit OAs, the high bits are not
>> kept contigously, so if you happen to be setting a mapping for which the
>> physical memory block straddles bit 48, this won't work.
> 
> Right, as soon as the PTE bits are not contiguous, this stops working, just like
> set_ptes() would, which I used as orientation.
> 
>>
>> Today, only the 64K base page config can support 52 bits, and for this,
>> OA[51:48] are stored in PTE[15:12]. But 52 bits for 4K and 16K base pages is
>> coming (hopefully v6.9) and in this case OA[51:50] are stored in PTE[9:8].
>> Fortunately we already have helpers in arm64 to abstract this.
>>
>> So I think arm64 will want to define its own pte_next_pfn():
>>
>> #define pte_next_pfn pte_next_pfn
>> static inline pte_t pte_next_pfn(pte_t pte)
>> {
>>     return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
>> }
>>
>> I'll do a separate patch to fix the already broken arm64 set_ptes()
>> implementation.
> 
> Make sense.
> 
>>
>> I'm not sure if this type of problem might also apply to other arches?
> 
> I saw similar handling in the PPC implementation of set_ptes, but was not able
> to convince me that it is actually required there.
> 
> pte_pfn on ppc does:
> 
> static inline unsigned long pte_pfn(pte_t pte)
> {
>     return (pte_val(pte) & PTE_RPN_MASK) >> PTE_RPN_SHIFT;
> }
> 
> But that means that the PFNs *are* contiguous.

all the ppc pfn_pte() implementations also only shift the pfn, so I think ppc is
safe to just define PFN_PTE_SHIFT. Although 2 of the 3 implementations shift by
PTE_RPN_SHIFT and the other shifts by PAGE_SIZE, so you might want to define
PFN_PTE_SHIFT separately for all 3 configs?

> If high bits are used for
> something else, then we might produce a garbage PTE on overflow, but that
> shouldn't really matter I concluded for folio_pte_batch() purposes, we'd not
> detect "belongs to this folio batch" either way.

Exactly.

> 
> Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I just
> hope that we don't lose any other arbitrary PTE bits by doing the pte_pgprot().

I don't see the need for ppc to implement pte_next_pfn().

pte_pgprot() is not a "proper" arch interface (its only required by the core-mm
if the arch implements a certain Kconfig IIRC). For arm64, all bits that are not
pfn are pgprot, so there are no bits lost.

> 
> 
> I guess pte_pfn() implementations should tell us if anything special needs to
> happen.
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 10:48     ` David Hildenbrand
  2024-01-23 11:02       ` David Hildenbrand
  2024-01-23 11:08       ` Ryan Roberts
@ 2024-01-23 11:10       ` Christophe Leroy
  2 siblings, 0 replies; 49+ messages in thread
From: Christophe Leroy @ 2024-01-23 11:10 UTC (permalink / raw
  To: David Hildenbrand, Ryan Roberts, linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, David S. Miller,
	linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, sparclinux@vger.kernel.org



Le 23/01/2024 à 11:48, David Hildenbrand a écrit :
> On 23.01.24 11:34, Ryan Roberts wrote:
>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>> We want to make use of pte_next_pfn() outside of set_ptes(). Let's
>>> simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   arch/arm/include/asm/pgtable.h   | 2 ++
>>>   arch/arm64/include/asm/pgtable.h | 2 ++
>>>   2 files changed, 4 insertions(+)
>>>
>>> diff --git a/arch/arm/include/asm/pgtable.h 
>>> b/arch/arm/include/asm/pgtable.h
>>> index d657b84b6bf70..be91e376df79e 100644
>>> --- a/arch/arm/include/asm/pgtable.h
>>> +++ b/arch/arm/include/asm/pgtable.h
>>> @@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t 
>>> pteval)
>>>   extern void __sync_icache_dcache(pte_t pteval);
>>>   #endif
>>> +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>> +
>>>   void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>                 pte_t *ptep, pte_t pteval, unsigned int nr);
>>>   #define set_ptes set_ptes
>>> diff --git a/arch/arm64/include/asm/pgtable.h 
>>> b/arch/arm64/include/asm/pgtable.h
>>> index 79ce70fbb751c..d4b3bd96e3304 100644
>>> --- a/arch/arm64/include/asm/pgtable.h
>>> +++ b/arch/arm64/include/asm/pgtable.h
>>> @@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t 
>>> pte, unsigned int nr_pages)
>>>           mte_sync_tags(pte, nr_pages);
>>>   }
>>> +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>
>> I think this is buggy. And so is the arm64 implementation of 
>> set_ptes(). It
>> works fine for 48-bit output address, but for 52-bit OAs, the high 
>> bits are not
>> kept contigously, so if you happen to be setting a mapping for which the
>> physical memory block straddles bit 48, this won't work.
> 
> Right, as soon as the PTE bits are not contiguous, this stops working, 
> just like set_ptes() would, which I used as orientation.
> 
>>
>> Today, only the 64K base page config can support 52 bits, and for this,
>> OA[51:48] are stored in PTE[15:12]. But 52 bits for 4K and 16K base 
>> pages is
>> coming (hopefully v6.9) and in this case OA[51:50] are stored in 
>> PTE[9:8].
>> Fortunately we already have helpers in arm64 to abstract this.
>>
>> So I think arm64 will want to define its own pte_next_pfn():
>>
>> #define pte_next_pfn pte_next_pfn
>> static inline pte_t pte_next_pfn(pte_t pte)
>> {
>>     return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
>> }
>>
>> I'll do a separate patch to fix the already broken arm64 set_ptes() 
>> implementation.
> 
> Make sense.
> 
>>
>> I'm not sure if this type of problem might also apply to other arches?
> 
> I saw similar handling in the PPC implementation of set_ptes, but was 
> not able to convince me that it is actually required there.
> 
> pte_pfn on ppc does:
> 
> static inline unsigned long pte_pfn(pte_t pte)
> {
>      return (pte_val(pte) & PTE_RPN_MASK) >> PTE_RPN_SHIFT;
> }
> 
> But that means that the PFNs *are* contiguous. If high bits are used for 
> something else, then we might produce a garbage PTE on overflow, but 
> that shouldn't really matter I concluded for folio_pte_batch() purposes, 
> we'd not detect "belongs to this folio batch" either way.

Yes PFNs are contiguous. The only thing is that the PFN is not located 
at PAGE_SHIFT, see 
https://elixir.bootlin.com/linux/v6.3-rc2/source/arch/powerpc/include/asm/nohash/pte-e500.h#L63

On powerpc e500 we have 24 PTE flags and the RPN starts above that.

The mask is then standard:

#define PTE_RPN_MASK	(~((1ULL << PTE_RPN_SHIFT) - 1))

Christophe

> 
> Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I 
> just hope that we don't lose any other arbitrary PTE bits by doing the 
> pte_pgprot().
> 
> 
> I guess pte_pfn() implementations should tell us if anything special 
> needs to happen.
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 11:08       ` Ryan Roberts
@ 2024-01-23 11:16         ` Christophe Leroy
  2024-01-23 11:31           ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Christophe Leroy @ 2024-01-23 11:16 UTC (permalink / raw
  To: Ryan Roberts, David Hildenbrand, linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, David S. Miller,
	linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, sparclinux@vger.kernel.org



Le 23/01/2024 à 12:08, Ryan Roberts a écrit :
> On 23/01/2024 10:48, David Hildenbrand wrote:
>> On 23.01.24 11:34, Ryan Roberts wrote:
>>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>>> We want to make use of pte_next_pfn() outside of set_ptes(). Let's
>>>> simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    arch/arm/include/asm/pgtable.h   | 2 ++
>>>>    arch/arm64/include/asm/pgtable.h | 2 ++
>>>>    2 files changed, 4 insertions(+)
>>>>
>>>> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
>>>> index d657b84b6bf70..be91e376df79e 100644
>>>> --- a/arch/arm/include/asm/pgtable.h
>>>> +++ b/arch/arm/include/asm/pgtable.h
>>>> @@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval)
>>>>    extern void __sync_icache_dcache(pte_t pteval);
>>>>    #endif
>>>>    +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>>> +
>>>>    void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>>                  pte_t *ptep, pte_t pteval, unsigned int nr);
>>>>    #define set_ptes set_ptes
>>>> diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
>>>> index 79ce70fbb751c..d4b3bd96e3304 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t pte,
>>>> unsigned int nr_pages)
>>>>            mte_sync_tags(pte, nr_pages);
>>>>    }
>>>>    +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>>
>>> I think this is buggy. And so is the arm64 implementation of set_ptes(). It
>>> works fine for 48-bit output address, but for 52-bit OAs, the high bits are not
>>> kept contigously, so if you happen to be setting a mapping for which the
>>> physical memory block straddles bit 48, this won't work.
>>
>> Right, as soon as the PTE bits are not contiguous, this stops working, just like
>> set_ptes() would, which I used as orientation.
>>
>>>
>>> Today, only the 64K base page config can support 52 bits, and for this,
>>> OA[51:48] are stored in PTE[15:12]. But 52 bits for 4K and 16K base pages is
>>> coming (hopefully v6.9) and in this case OA[51:50] are stored in PTE[9:8].
>>> Fortunately we already have helpers in arm64 to abstract this.
>>>
>>> So I think arm64 will want to define its own pte_next_pfn():
>>>
>>> #define pte_next_pfn pte_next_pfn
>>> static inline pte_t pte_next_pfn(pte_t pte)
>>> {
>>>      return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
>>> }
>>>
>>> I'll do a separate patch to fix the already broken arm64 set_ptes()
>>> implementation.
>>
>> Make sense.
>>
>>>
>>> I'm not sure if this type of problem might also apply to other arches?
>>
>> I saw similar handling in the PPC implementation of set_ptes, but was not able
>> to convince me that it is actually required there.
>>
>> pte_pfn on ppc does:
>>
>> static inline unsigned long pte_pfn(pte_t pte)
>> {
>>      return (pte_val(pte) & PTE_RPN_MASK) >> PTE_RPN_SHIFT;
>> }
>>
>> But that means that the PFNs *are* contiguous.
> 
> all the ppc pfn_pte() implementations also only shift the pfn, so I think ppc is
> safe to just define PFN_PTE_SHIFT. Although 2 of the 3 implementations shift by
> PTE_RPN_SHIFT and the other shifts by PAGE_SIZE, so you might want to define
> PFN_PTE_SHIFT separately for all 3 configs?

We have PTE_RPN_SHIFT defined for all 4 implementations, for some of 
them you are right it is defined as PAGE_SHIFT, but I see no reason to 
define PFN_PTE_SHIFT separately.

> 
>> If high bits are used for
>> something else, then we might produce a garbage PTE on overflow, but that
>> shouldn't really matter I concluded for folio_pte_batch() purposes, we'd not
>> detect "belongs to this folio batch" either way.
> 
> Exactly.
> 
>>
>> Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I just
>> hope that we don't lose any other arbitrary PTE bits by doing the pte_pgprot().
> 
> I don't see the need for ppc to implement pte_next_pfn().

Agreed.

> 
> pte_pgprot() is not a "proper" arch interface (its only required by the core-mm
> if the arch implements a certain Kconfig IIRC). For arm64, all bits that are not
> pfn are pgprot, so there are no bits lost.
> 
>>
>>
>> I guess pte_pfn() implementations should tell us if anything special needs to
>> happen.
>>
> 

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 11:02       ` David Hildenbrand
@ 2024-01-23 11:17         ` Ryan Roberts
  2024-01-23 11:33           ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 11:17 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23/01/2024 11:02, David Hildenbrand wrote:
> On 23.01.24 11:48, David Hildenbrand wrote:
>> On 23.01.24 11:34, Ryan Roberts wrote:
>>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>>> We want to make use of pte_next_pfn() outside of set_ptes(). Let's
>>>> simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    arch/arm/include/asm/pgtable.h   | 2 ++
>>>>    arch/arm64/include/asm/pgtable.h | 2 ++
>>>>    2 files changed, 4 insertions(+)
>>>>
>>>> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
>>>> index d657b84b6bf70..be91e376df79e 100644
>>>> --- a/arch/arm/include/asm/pgtable.h
>>>> +++ b/arch/arm/include/asm/pgtable.h
>>>> @@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval)
>>>>    extern void __sync_icache_dcache(pte_t pteval);
>>>>    #endif
>>>>    +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>>> +
>>>>    void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>>                  pte_t *ptep, pte_t pteval, unsigned int nr);
>>>>    #define set_ptes set_ptes
>>>> diff --git a/arch/arm64/include/asm/pgtable.h
>>>> b/arch/arm64/include/asm/pgtable.h
>>>> index 79ce70fbb751c..d4b3bd96e3304 100644
>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>> @@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t pte,
>>>> unsigned int nr_pages)
>>>>            mte_sync_tags(pte, nr_pages);
>>>>    }
>>>>    +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>>
>>> I think this is buggy. And so is the arm64 implementation of set_ptes(). It
>>> works fine for 48-bit output address, but for 52-bit OAs, the high bits are not
>>> kept contigously, so if you happen to be setting a mapping for which the
>>> physical memory block straddles bit 48, this won't work.
>>
>> Right, as soon as the PTE bits are not contiguous, this stops working,
>> just like set_ptes() would, which I used as orientation.
>>
>>>
>>> Today, only the 64K base page config can support 52 bits, and for this,
>>> OA[51:48] are stored in PTE[15:12]. But 52 bits for 4K and 16K base pages is
>>> coming (hopefully v6.9) and in this case OA[51:50] are stored in PTE[9:8].
>>> Fortunately we already have helpers in arm64 to abstract this.
>>>
>>> So I think arm64 will want to define its own pte_next_pfn():
>>>
>>> #define pte_next_pfn pte_next_pfn
>>> static inline pte_t pte_next_pfn(pte_t pte)
>>> {
>>>     return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
>>> }
>>>
> 
> Digging into the details, on arm64 we have:
> 
> #define pte_pfn(pte)           (__pte_to_phys(pte) >> PAGE_SHIFT)
> 
> and
> 
> #define __pte_to_phys(pte)     (pte_val(pte) & PTE_ADDR_MASK)
> 
> But that implies, that upstream the PFN is always contiguous, no?
> 


But __pte_to_phys() and __phys_to_pte_val() depend on a Kconfig. If PA bits is
52, the bits are not all contiguous:

#ifdef CONFIG_ARM64_PA_BITS_52
static inline phys_addr_t __pte_to_phys(pte_t pte)
{
	return (pte_val(pte) & PTE_ADDR_LOW) |
		((pte_val(pte) & PTE_ADDR_HIGH) << PTE_ADDR_HIGH_SHIFT);
}
static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
{
	return (phys | (phys >> PTE_ADDR_HIGH_SHIFT)) & PTE_ADDR_MASK;
}
#else
#define __pte_to_phys(pte)	(pte_val(pte) & PTE_ADDR_MASK)
#define __phys_to_pte_val(phys)	(phys)
#endif


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 11:16         ` Christophe Leroy
@ 2024-01-23 11:31           ` David Hildenbrand
  2024-01-23 11:38             ` Ryan Roberts
  2024-01-24  5:46             ` Aneesh Kumar K.V
  0 siblings, 2 replies; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 11:31 UTC (permalink / raw
  To: Christophe Leroy, Ryan Roberts, linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, David S. Miller,
	linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, sparclinux@vger.kernel.org

>>
>>> If high bits are used for
>>> something else, then we might produce a garbage PTE on overflow, but that
>>> shouldn't really matter I concluded for folio_pte_batch() purposes, we'd not
>>> detect "belongs to this folio batch" either way.
>>
>> Exactly.
>>
>>>
>>> Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I just
>>> hope that we don't lose any other arbitrary PTE bits by doing the pte_pgprot().
>>
>> I don't see the need for ppc to implement pte_next_pfn().
> 
> Agreed.

So likely we should then do on top for powerpc (whitespace damage):

diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
index a04ae4449a025..549a440ed7f65 100644
--- a/arch/powerpc/mm/pgtable.c
+++ b/arch/powerpc/mm/pgtable.c
@@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
                         break;
                 ptep++;
                 addr += PAGE_SIZE;
-               /*
-                * increment the pfn.
-                */
-               pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
+               pte = pte_next_pfn(pte);
         }
  }
  


-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 11:17         ` Ryan Roberts
@ 2024-01-23 11:33           ` David Hildenbrand
  2024-01-23 11:44             ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 11:33 UTC (permalink / raw
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23.01.24 12:17, Ryan Roberts wrote:
> On 23/01/2024 11:02, David Hildenbrand wrote:
>> On 23.01.24 11:48, David Hildenbrand wrote:
>>> On 23.01.24 11:34, Ryan Roberts wrote:
>>>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>>>> We want to make use of pte_next_pfn() outside of set_ptes(). Let's
>>>>> simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().
>>>>>
>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>> ---
>>>>>     arch/arm/include/asm/pgtable.h   | 2 ++
>>>>>     arch/arm64/include/asm/pgtable.h | 2 ++
>>>>>     2 files changed, 4 insertions(+)
>>>>>
>>>>> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
>>>>> index d657b84b6bf70..be91e376df79e 100644
>>>>> --- a/arch/arm/include/asm/pgtable.h
>>>>> +++ b/arch/arm/include/asm/pgtable.h
>>>>> @@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval)
>>>>>     extern void __sync_icache_dcache(pte_t pteval);
>>>>>     #endif
>>>>>     +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>>>> +
>>>>>     void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>                   pte_t *ptep, pte_t pteval, unsigned int nr);
>>>>>     #define set_ptes set_ptes
>>>>> diff --git a/arch/arm64/include/asm/pgtable.h
>>>>> b/arch/arm64/include/asm/pgtable.h
>>>>> index 79ce70fbb751c..d4b3bd96e3304 100644
>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>> @@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t pte,
>>>>> unsigned int nr_pages)
>>>>>             mte_sync_tags(pte, nr_pages);
>>>>>     }
>>>>>     +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>>>
>>>> I think this is buggy. And so is the arm64 implementation of set_ptes(). It
>>>> works fine for 48-bit output address, but for 52-bit OAs, the high bits are not
>>>> kept contigously, so if you happen to be setting a mapping for which the
>>>> physical memory block straddles bit 48, this won't work.
>>>
>>> Right, as soon as the PTE bits are not contiguous, this stops working,
>>> just like set_ptes() would, which I used as orientation.
>>>
>>>>
>>>> Today, only the 64K base page config can support 52 bits, and for this,
>>>> OA[51:48] are stored in PTE[15:12]. But 52 bits for 4K and 16K base pages is
>>>> coming (hopefully v6.9) and in this case OA[51:50] are stored in PTE[9:8].
>>>> Fortunately we already have helpers in arm64 to abstract this.
>>>>
>>>> So I think arm64 will want to define its own pte_next_pfn():
>>>>
>>>> #define pte_next_pfn pte_next_pfn
>>>> static inline pte_t pte_next_pfn(pte_t pte)
>>>> {
>>>>      return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
>>>> }
>>>>
>>
>> Digging into the details, on arm64 we have:
>>
>> #define pte_pfn(pte)           (__pte_to_phys(pte) >> PAGE_SHIFT)
>>
>> and
>>
>> #define __pte_to_phys(pte)     (pte_val(pte) & PTE_ADDR_MASK)
>>
>> But that implies, that upstream the PFN is always contiguous, no?
>>
> 
> 
> But __pte_to_phys() and __phys_to_pte_val() depend on a Kconfig. If PA bits is
> 52, the bits are not all contiguous:
> 
> #ifdef CONFIG_ARM64_PA_BITS_52
> static inline phys_addr_t __pte_to_phys(pte_t pte)
> {
> 	return (pte_val(pte) & PTE_ADDR_LOW) |
> 		((pte_val(pte) & PTE_ADDR_HIGH) << PTE_ADDR_HIGH_SHIFT);
> }
> static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
> {
> 	return (phys | (phys >> PTE_ADDR_HIGH_SHIFT)) & PTE_ADDR_MASK;
> }
> #else
> #define __pte_to_phys(pte)	(pte_val(pte) & PTE_ADDR_MASK)
> #define __phys_to_pte_val(phys)	(phys)
> #endif
> 

Ah, how could I've missed that. Agreed, set_ptes() and this patch are 
broken.

Do you want to send a patch to implement pte_next_pfn() on arm64, and 
then use pte_next_pfn() in set_ptes()? Then I can drop this patch here 
completely from this series.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 11:31           ` David Hildenbrand
@ 2024-01-23 11:38             ` Ryan Roberts
  2024-01-23 11:40               ` David Hildenbrand
  2024-01-23 11:48               ` Christophe Leroy
  2024-01-24  5:46             ` Aneesh Kumar K.V
  1 sibling, 2 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 11:38 UTC (permalink / raw
  To: David Hildenbrand, Christophe Leroy, linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, David S. Miller,
	linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, sparclinux@vger.kernel.org

On 23/01/2024 11:31, David Hildenbrand wrote:
>>>
>>>> If high bits are used for
>>>> something else, then we might produce a garbage PTE on overflow, but that
>>>> shouldn't really matter I concluded for folio_pte_batch() purposes, we'd not
>>>> detect "belongs to this folio batch" either way.
>>>
>>> Exactly.
>>>
>>>>
>>>> Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I just
>>>> hope that we don't lose any other arbitrary PTE bits by doing the pte_pgprot().
>>>
>>> I don't see the need for ppc to implement pte_next_pfn().
>>
>> Agreed.
> 
> So likely we should then do on top for powerpc (whitespace damage):
> 
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index a04ae4449a025..549a440ed7f65 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr,
> pte_t *ptep,
>                         break;
>                 ptep++;
>                 addr += PAGE_SIZE;
> -               /*
> -                * increment the pfn.
> -                */
> -               pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
> +               pte = pte_next_pfn(pte);
>         }
>  }

Looks like commit 47b8def9358c ("powerpc/mm: Avoid calling
arch_enter/leave_lazy_mmu() in set_ptes") changed from doing the simple
increment to this more complex approach, but the log doesn't say why.

>  
> 
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 11:38             ` Ryan Roberts
@ 2024-01-23 11:40               ` David Hildenbrand
  2024-01-24  5:45                 ` Aneesh Kumar K.V
  2024-01-23 11:48               ` Christophe Leroy
  1 sibling, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 11:40 UTC (permalink / raw
  To: Ryan Roberts, Christophe Leroy, linux-kernel@vger.kernel.org,
	Aneesh Kumar K.V
  Cc: linux-mm@kvack.org, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, sparclinux@vger.kernel.org

On 23.01.24 12:38, Ryan Roberts wrote:
> On 23/01/2024 11:31, David Hildenbrand wrote:
>>>>
>>>>> If high bits are used for
>>>>> something else, then we might produce a garbage PTE on overflow, but that
>>>>> shouldn't really matter I concluded for folio_pte_batch() purposes, we'd not
>>>>> detect "belongs to this folio batch" either way.
>>>>
>>>> Exactly.
>>>>
>>>>>
>>>>> Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I just
>>>>> hope that we don't lose any other arbitrary PTE bits by doing the pte_pgprot().
>>>>
>>>> I don't see the need for ppc to implement pte_next_pfn().
>>>
>>> Agreed.
>>
>> So likely we should then do on top for powerpc (whitespace damage):
>>
>> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
>> index a04ae4449a025..549a440ed7f65 100644
>> --- a/arch/powerpc/mm/pgtable.c
>> +++ b/arch/powerpc/mm/pgtable.c
>> @@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr,
>> pte_t *ptep,
>>                          break;
>>                  ptep++;
>>                  addr += PAGE_SIZE;
>> -               /*
>> -                * increment the pfn.
>> -                */
>> -               pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
>> +               pte = pte_next_pfn(pte);
>>          }
>>   }
> 
> Looks like commit 47b8def9358c ("powerpc/mm: Avoid calling
> arch_enter/leave_lazy_mmu() in set_ptes") changed from doing the simple
> increment to this more complex approach, but the log doesn't say why.

@Aneesh, was that change on purpose?

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 11:33           ` David Hildenbrand
@ 2024-01-23 11:44             ` Ryan Roberts
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 11:44 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23/01/2024 11:33, David Hildenbrand wrote:
> On 23.01.24 12:17, Ryan Roberts wrote:
>> On 23/01/2024 11:02, David Hildenbrand wrote:
>>> On 23.01.24 11:48, David Hildenbrand wrote:
>>>> On 23.01.24 11:34, Ryan Roberts wrote:
>>>>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>>>>> We want to make use of pte_next_pfn() outside of set_ptes(). Let's
>>>>>> simpliy define PFN_PTE_SHIFT, required by pte_next_pfn().
>>>>>>
>>>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>>>> ---
>>>>>>     arch/arm/include/asm/pgtable.h   | 2 ++
>>>>>>     arch/arm64/include/asm/pgtable.h | 2 ++
>>>>>>     2 files changed, 4 insertions(+)
>>>>>>
>>>>>> diff --git a/arch/arm/include/asm/pgtable.h b/arch/arm/include/asm/pgtable.h
>>>>>> index d657b84b6bf70..be91e376df79e 100644
>>>>>> --- a/arch/arm/include/asm/pgtable.h
>>>>>> +++ b/arch/arm/include/asm/pgtable.h
>>>>>> @@ -209,6 +209,8 @@ static inline void __sync_icache_dcache(pte_t pteval)
>>>>>>     extern void __sync_icache_dcache(pte_t pteval);
>>>>>>     #endif
>>>>>>     +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>>>>> +
>>>>>>     void set_ptes(struct mm_struct *mm, unsigned long addr,
>>>>>>                   pte_t *ptep, pte_t pteval, unsigned int nr);
>>>>>>     #define set_ptes set_ptes
>>>>>> diff --git a/arch/arm64/include/asm/pgtable.h
>>>>>> b/arch/arm64/include/asm/pgtable.h
>>>>>> index 79ce70fbb751c..d4b3bd96e3304 100644
>>>>>> --- a/arch/arm64/include/asm/pgtable.h
>>>>>> +++ b/arch/arm64/include/asm/pgtable.h
>>>>>> @@ -341,6 +341,8 @@ static inline void __sync_cache_and_tags(pte_t pte,
>>>>>> unsigned int nr_pages)
>>>>>>             mte_sync_tags(pte, nr_pages);
>>>>>>     }
>>>>>>     +#define PFN_PTE_SHIFT        PAGE_SHIFT
>>>>>
>>>>> I think this is buggy. And so is the arm64 implementation of set_ptes(). It
>>>>> works fine for 48-bit output address, but for 52-bit OAs, the high bits are
>>>>> not
>>>>> kept contigously, so if you happen to be setting a mapping for which the
>>>>> physical memory block straddles bit 48, this won't work.
>>>>
>>>> Right, as soon as the PTE bits are not contiguous, this stops working,
>>>> just like set_ptes() would, which I used as orientation.
>>>>
>>>>>
>>>>> Today, only the 64K base page config can support 52 bits, and for this,
>>>>> OA[51:48] are stored in PTE[15:12]. But 52 bits for 4K and 16K base pages is
>>>>> coming (hopefully v6.9) and in this case OA[51:50] are stored in PTE[9:8].
>>>>> Fortunately we already have helpers in arm64 to abstract this.
>>>>>
>>>>> So I think arm64 will want to define its own pte_next_pfn():
>>>>>
>>>>> #define pte_next_pfn pte_next_pfn
>>>>> static inline pte_t pte_next_pfn(pte_t pte)
>>>>> {
>>>>>      return pfn_pte(pte_pfn(pte) + 1, pte_pgprot(pte));
>>>>> }
>>>>>
>>>
>>> Digging into the details, on arm64 we have:
>>>
>>> #define pte_pfn(pte)           (__pte_to_phys(pte) >> PAGE_SHIFT)
>>>
>>> and
>>>
>>> #define __pte_to_phys(pte)     (pte_val(pte) & PTE_ADDR_MASK)
>>>
>>> But that implies, that upstream the PFN is always contiguous, no?
>>>
>>
>>
>> But __pte_to_phys() and __phys_to_pte_val() depend on a Kconfig. If PA bits is
>> 52, the bits are not all contiguous:
>>
>> #ifdef CONFIG_ARM64_PA_BITS_52
>> static inline phys_addr_t __pte_to_phys(pte_t pte)
>> {
>>     return (pte_val(pte) & PTE_ADDR_LOW) |
>>         ((pte_val(pte) & PTE_ADDR_HIGH) << PTE_ADDR_HIGH_SHIFT);
>> }
>> static inline pteval_t __phys_to_pte_val(phys_addr_t phys)
>> {
>>     return (phys | (phys >> PTE_ADDR_HIGH_SHIFT)) & PTE_ADDR_MASK;
>> }
>> #else
>> #define __pte_to_phys(pte)    (pte_val(pte) & PTE_ADDR_MASK)
>> #define __phys_to_pte_val(phys)    (phys)
>> #endif
>>
> 
> Ah, how could I've missed that. Agreed, set_ptes() and this patch are broken.
> 
> Do you want to send a patch to implement pte_next_pfn() on arm64, and then use
> pte_next_pfn() in set_ptes()? Then I can drop this patch here completely from
> this series.

Yes good idea. I probably won't get around to it until tomorrow.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 11:38             ` Ryan Roberts
  2024-01-23 11:40               ` David Hildenbrand
@ 2024-01-23 11:48               ` Christophe Leroy
  2024-01-23 11:53                 ` David Hildenbrand
  1 sibling, 1 reply; 49+ messages in thread
From: Christophe Leroy @ 2024-01-23 11:48 UTC (permalink / raw
  To: Ryan Roberts, David Hildenbrand, linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, David S. Miller,
	linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, sparclinux@vger.kernel.org



Le 23/01/2024 à 12:38, Ryan Roberts a écrit :
> On 23/01/2024 11:31, David Hildenbrand wrote:
>>>>
>>>>> If high bits are used for
>>>>> something else, then we might produce a garbage PTE on overflow, but that
>>>>> shouldn't really matter I concluded for folio_pte_batch() purposes, we'd not
>>>>> detect "belongs to this folio batch" either way.
>>>>
>>>> Exactly.
>>>>
>>>>>
>>>>> Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I just
>>>>> hope that we don't lose any other arbitrary PTE bits by doing the pte_pgprot().
>>>>
>>>> I don't see the need for ppc to implement pte_next_pfn().
>>>
>>> Agreed.
>>
>> So likely we should then do on top for powerpc (whitespace damage):
>>
>> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
>> index a04ae4449a025..549a440ed7f65 100644
>> --- a/arch/powerpc/mm/pgtable.c
>> +++ b/arch/powerpc/mm/pgtable.c
>> @@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr,
>> pte_t *ptep,
>>                          break;
>>                  ptep++;
>>                  addr += PAGE_SIZE;
>> -               /*
>> -                * increment the pfn.
>> -                */
>> -               pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
>> +               pte = pte_next_pfn(pte);
>>          }
>>   }
> 
> Looks like commit 47b8def9358c ("powerpc/mm: Avoid calling
> arch_enter/leave_lazy_mmu() in set_ptes") changed from doing the simple
> increment to this more complex approach, but the log doesn't say why.

Right. There was a discussion about it without any conclusion: 
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20231024143604.16749-1-aneesh.kumar@linux.ibm.com/

As far as understand the simple increment is better on ppc/32 but worse 
in ppc/64.

Christophe

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 11:48               ` Christophe Leroy
@ 2024-01-23 11:53                 ` David Hildenbrand
  0 siblings, 0 replies; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 11:53 UTC (permalink / raw
  To: Christophe Leroy, Ryan Roberts, linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley,
	Palmer Dabbelt, Albert Ou, Alexander Gordeev, Gerald Schaefer,
	Heiko Carstens, Vasily Gorbik, Christian Borntraeger,
	Sven Schnelle, David S. Miller,
	linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, sparclinux@vger.kernel.org

On 23.01.24 12:48, Christophe Leroy wrote:
> 
> 
> Le 23/01/2024 à 12:38, Ryan Roberts a écrit :
>> On 23/01/2024 11:31, David Hildenbrand wrote:
>>>>>
>>>>>> If high bits are used for
>>>>>> something else, then we might produce a garbage PTE on overflow, but that
>>>>>> shouldn't really matter I concluded for folio_pte_batch() purposes, we'd not
>>>>>> detect "belongs to this folio batch" either way.
>>>>>
>>>>> Exactly.
>>>>>
>>>>>>
>>>>>> Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I just
>>>>>> hope that we don't lose any other arbitrary PTE bits by doing the pte_pgprot().
>>>>>
>>>>> I don't see the need for ppc to implement pte_next_pfn().
>>>>
>>>> Agreed.
>>>
>>> So likely we should then do on top for powerpc (whitespace damage):
>>>
>>> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
>>> index a04ae4449a025..549a440ed7f65 100644
>>> --- a/arch/powerpc/mm/pgtable.c
>>> +++ b/arch/powerpc/mm/pgtable.c
>>> @@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr,
>>> pte_t *ptep,
>>>                           break;
>>>                   ptep++;
>>>                   addr += PAGE_SIZE;
>>> -               /*
>>> -                * increment the pfn.
>>> -                */
>>> -               pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
>>> +               pte = pte_next_pfn(pte);
>>>           }
>>>    }
>>
>> Looks like commit 47b8def9358c ("powerpc/mm: Avoid calling
>> arch_enter/leave_lazy_mmu() in set_ptes") changed from doing the simple
>> increment to this more complex approach, but the log doesn't say why.
> 
> Right. There was a discussion about it without any conclusion:
> https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20231024143604.16749-1-aneesh.kumar@linux.ibm.com/
> 
> As far as understand the simple increment is better on ppc/32 but worse
> in ppc/64.

Sounds like we're micro-optimizing for a specific compiler version 
output. Hurray.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP
  2024-01-22 19:41 ` [PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
@ 2024-01-23 12:01   ` Ryan Roberts
  2024-01-23 12:19     ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 12:01 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 22/01/2024 19:41, David Hildenbrand wrote:
> Let's implement PTE batching when consecutive (present) PTEs map
> consecutive pages of the same large folio, and all other PTE bits besides
> the PFNs are equal.
> 
> We will optimize folio_pte_batch() separately, to ignore some other
> PTE bits. This patch is based on work by Ryan Roberts.
> 
> Use __always_inline for __copy_present_ptes() and keep the handling for
> single PTEs completely separate from the multi-PTE case: we really want
> the compiler to optimize for the single-PTE case with small folios, to
> not degrade performance.
> 
> Note that PTE batching will never exceed a single page table and will
> always stay within VMA boundaries.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  include/linux/pgtable.h |  17 +++++-
>  mm/memory.c             | 113 +++++++++++++++++++++++++++++++++-------
>  2 files changed, 109 insertions(+), 21 deletions(-)
> 
> diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
> index f6d0e3513948a..d32cedf6936ba 100644
> --- a/include/linux/pgtable.h
> +++ b/include/linux/pgtable.h
> @@ -212,8 +212,6 @@ static inline int pmd_dirty(pmd_t pmd)
>  #define arch_flush_lazy_mmu_mode()	do {} while (0)
>  #endif
>  
> -#ifndef set_ptes
> -
>  #ifndef pte_next_pfn
>  static inline pte_t pte_next_pfn(pte_t pte)
>  {
> @@ -221,6 +219,7 @@ static inline pte_t pte_next_pfn(pte_t pte)
>  }
>  #endif
>  
> +#ifndef set_ptes
>  /**
>   * set_ptes - Map consecutive pages to a contiguous range of addresses.
>   * @mm: Address space to map the pages into.
> @@ -650,6 +649,20 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long addres
>  }
>  #endif
>  
> +#ifndef wrprotect_ptes

I wrote some documentation for this (based on Matthew's docs for set_ptes() in
my version. Perhaps it makes sense to add it here, given this is overridable by
the arch.

/**
 * wrprotect_ptes - Write protect a consecutive set of pages.
 * @mm: Address space that the pages are mapped into.
 * @addr: Address of first page to write protect.
 * @ptep: Page table pointer for the first entry.
 * @nr: Number of pages to write protect.
 *
 * May be overridden by the architecture, else implemented as a loop over
 * ptep_set_wrprotect().
 *
 * Context: The caller holds the page table lock. The PTEs are all in the same
 * PMD.
 */

> +static inline void wrprotect_ptes(struct mm_struct *mm, unsigned long addr,
> +		pte_t *ptep, unsigned int nr)
> +{
> +	for (;;) {
> +		ptep_set_wrprotect(mm, addr, ptep);
> +		if (--nr == 0)
> +			break;
> +		ptep++;
> +		addr += PAGE_SIZE;
> +	}
> +}
> +#endif
> +
>  /*
>   * On some architectures hardware does not set page access bit when accessing
>   * memory page, it is responsibility of software setting this bit. It brings
> diff --git a/mm/memory.c b/mm/memory.c
> index 185b4aff13d62..f563aec85b2a8 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -930,15 +930,15 @@ copy_present_page(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  	return 0;
>  }
>  
> -static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
> +static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,

nit: doesn't the addition of __always_inline really belong in the patch where
you factored this out? (#7)

>  		struct vm_area_struct *src_vma, pte_t *dst_pte, pte_t *src_pte,
> -		pte_t pte, unsigned long addr)
> +		pte_t pte, unsigned long addr, int nr)
>  {
>  	struct mm_struct *src_mm = src_vma->vm_mm;
>  
>  	/* If it's a COW mapping, write protect it both processes. */
>  	if (is_cow_mapping(src_vma->vm_flags) && pte_write(pte)) {
> -		ptep_set_wrprotect(src_mm, addr, src_pte);
> +		wrprotect_ptes(src_mm, addr, src_pte, nr);
>  		pte = pte_wrprotect(pte);
>  	}
>  
> @@ -950,26 +950,94 @@ static inline void __copy_present_pte(struct vm_area_struct *dst_vma,
>  	if (!userfaultfd_wp(dst_vma))
>  		pte = pte_clear_uffd_wp(pte);
>  
> -	set_pte_at(dst_vma->vm_mm, addr, dst_pte, pte);
> +	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
> +}
> +
> +/*
> + * Detect a PTE batch: consecutive (present) PTEs that map consecutive
> + * pages of the same folio.
> + *
> + * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
> + */
> +static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> +		pte_t *start_ptep, pte_t pte, int max_nr)
> +{
> +	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
> +	const pte_t *end_ptep = start_ptep + max_nr;
> +	pte_t expected_pte = pte_next_pfn(pte);
> +	pte_t *ptep = start_ptep + 1;
> +
> +	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> +
> +	while (ptep != end_ptep) {
> +		pte = ptep_get(ptep);
> +
> +		if (!pte_same(pte, expected_pte))
> +			break;
> +
> +		/*
> +		 * Stop immediately once we reached the end of the folio. In
> +		 * corner cases the next PFN might fall into a different
> +		 * folio.
> +		 */
> +		if (pte_pfn(pte) == folio_end_pfn)
> +			break;
> +
> +		expected_pte = pte_next_pfn(expected_pte);
> +		ptep++;
> +	}
> +
> +	return ptep - start_ptep;
>  }
>  
>  /*
> - * Copy one pte.  Returns 0 if succeeded, or -EAGAIN if one preallocated page
> - * is required to copy this pte.
> + * Copy one present PTE, trying to batch-process subsequent PTEs that map
> + * consecutive pages of the same folio by copying them as well.
> + *
> + * Returns -EAGAIN if one preallocated page is required to copy the next PTE.
> + * Otherwise, returns the number of copied PTEs (at least 1).
>   */
>  static inline int
> -copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
> +copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  		 pte_t *dst_pte, pte_t *src_pte, pte_t pte, unsigned long addr,
> -		 int *rss, struct folio **prealloc)
> +		 int max_nr, int *rss, struct folio **prealloc)
>  {
>  	struct page *page;
>  	struct folio *folio;
> +	int err, nr;
>  
>  	page = vm_normal_page(src_vma, addr, pte);
>  	if (unlikely(!page))
>  		goto copy_pte;
>  
>  	folio = page_folio(page);
> +
> +	/*
> +	 * If we likely have to copy, just don't bother with batching. Make
> +	 * sure that the common "small folio" case stays as fast as possible
> +	 * by keeping the batching logic separate.
> +	 */
> +	if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
> +		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
> +		if (folio_test_anon(folio)) {
> +			folio_ref_add(folio, nr);
> +			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
> +								  nr, src_vma))) {

What happens if its not the first page of the batch that fails here? Aren't you
signalling that you need a prealloc'ed page for the wrong pte? Shouldn't you
still batch copy all the way up to the failing page first? Perhaps it all comes
out in the wash and these events are so infrequent that we don't care about the
lost batching opportunity?

> +				folio_ref_sub(folio, nr);
> +				return -EAGAIN;
> +			}
> +			rss[MM_ANONPAGES] += nr;
> +			VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
> +		} else {
> +			folio_ref_add(folio, nr);

Perhaps hoist this out to immediately after folio_pte_batch() since you're
calling it on both branches?

> +			folio_dup_file_rmap_ptes(folio, page, nr);
> +			rss[mm_counter_file(page)] += nr;
> +		}
> +		__copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte,
> +				    addr, nr);
> +		return nr;
> +	}
> +
>  	if (folio_test_anon(folio)) {
>  		/*
>  		 * If this page may have been pinned by the parent process,
> @@ -981,8 +1049,9 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  		if (unlikely(folio_try_dup_anon_rmap_pte(folio, page, src_vma))) {
>  			/* Page may be pinned, we have to copy. */
>  			folio_put(folio);
> -			return copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> -						 addr, rss, prealloc, page);
> +			err = copy_present_page(dst_vma, src_vma, dst_pte, src_pte,
> +						addr, rss, prealloc, page);
> +			return err ? err : 1;
>  		}
>  		rss[MM_ANONPAGES]++;
>  		VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
> @@ -993,8 +1062,8 @@ copy_present_pte(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  	}
>  
>  copy_pte:
> -	__copy_present_pte(dst_vma, src_vma, dst_pte, src_pte, pte, addr);
> -	return 0;
> +	__copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte, addr, 1);
> +	return 1;
>  }
>  
>  static inline struct folio *folio_prealloc(struct mm_struct *src_mm,
> @@ -1031,10 +1100,11 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  	pte_t *src_pte, *dst_pte;
>  	pte_t ptent;
>  	spinlock_t *src_ptl, *dst_ptl;
> -	int progress, ret = 0;
> +	int progress, max_nr, ret = 0;
>  	int rss[NR_MM_COUNTERS];
>  	swp_entry_t entry = (swp_entry_t){0};
>  	struct folio *prealloc = NULL;
> +	int nr;
>  
>  again:
>  	progress = 0;
> @@ -1065,6 +1135,8 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  	arch_enter_lazy_mmu_mode();
>  
>  	do {
> +		nr = 1;
> +
>  		/*
>  		 * We are holding two locks at this point - either of them
>  		 * could generate latencies in another task on another CPU.
> @@ -1101,9 +1173,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  			 */
>  			WARN_ON_ONCE(ret != -ENOENT);
>  		}
> -		/* copy_present_pte() will clear `*prealloc' if consumed */
> -		ret = copy_present_pte(dst_vma, src_vma, dst_pte, src_pte,
> -				       ptent, addr, rss, &prealloc);
> +		/* copy_present_ptes() will clear `*prealloc' if consumed */
> +		max_nr = (end - addr) / PAGE_SIZE;
> +		ret = copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte,
> +					ptent, addr, max_nr, rss, &prealloc);
>  		/*
>  		 * If we need a pre-allocated page for this pte, drop the
>  		 * locks, allocate, and try again.
> @@ -1120,8 +1193,10 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  			folio_put(prealloc);
>  			prealloc = NULL;
>  		}
> -		progress += 8;
> -	} while (dst_pte++, src_pte++, addr += PAGE_SIZE, addr != end);
> +		nr = ret;
> +		progress += 8 * nr;
> +	} while (dst_pte += nr, src_pte += nr, addr += PAGE_SIZE * nr,
> +		 addr != end);
>  
>  	arch_leave_lazy_mmu_mode();
>  	pte_unmap_unlock(orig_src_pte, src_ptl);
> @@ -1142,7 +1217,7 @@ copy_pte_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma,
>  		prealloc = folio_prealloc(src_mm, src_vma, addr, false);
>  		if (!prealloc)
>  			return -ENOMEM;
> -	} else if (ret) {
> +	} else if (ret < 0) {
>  		VM_WARN_ON_ONCE(1);
>  	}
>  


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP
  2024-01-23 12:01   ` Ryan Roberts
@ 2024-01-23 12:19     ` David Hildenbrand
  2024-01-23 12:28       ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 12:19 UTC (permalink / raw
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

[...]

> 
> I wrote some documentation for this (based on Matthew's docs for set_ptes() in
> my version. Perhaps it makes sense to add it here, given this is overridable by
> the arch.
> 
> /**
>   * wrprotect_ptes - Write protect a consecutive set of pages.
>   * @mm: Address space that the pages are mapped into.
>   * @addr: Address of first page to write protect.
>   * @ptep: Page table pointer for the first entry.
>   * @nr: Number of pages to write protect.
>   *
>   * May be overridden by the architecture, else implemented as a loop over
>   * ptep_set_wrprotect().
>   *
>   * Context: The caller holds the page table lock. The PTEs are all in the same
>   * PMD.
>   */
> 

I could have sworn I had a documentation at some point. Let me add some, 
thanks.

[...]

>> +
>> +	/*
>> +	 * If we likely have to copy, just don't bother with batching. Make
>> +	 * sure that the common "small folio" case stays as fast as possible
>> +	 * by keeping the batching logic separate.
>> +	 */
>> +	if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
>> +		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
>> +		if (folio_test_anon(folio)) {
>> +			folio_ref_add(folio, nr);
>> +			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
>> +								  nr, src_vma))) {
> 
> What happens if its not the first page of the batch that fails here? Aren't you
> signalling that you need a prealloc'ed page for the wrong pte? Shouldn't you
> still batch copy all the way up to the failing page first? Perhaps it all comes
> out in the wash and these events are so infrequent that we don't care about the
> lost batching opportunity?

I assume you mean the weird corner case that some folio pages in the 
range have PAE set, others don't -- and the folio maybe pinned.

In that case, we fallback to individual pages, and might have 
preallocated a page although we wouldn't have to preallocate one for 
processing the next page (that doesn't have PAE set).

It should all work, although not optimized to the extreme, and as it's a 
corner case, we don't particularly care. Hopefully, in the future we'll 
only have a single PAE flag per folio.

Or am I missing something?

> 
>> +				folio_ref_sub(folio, nr);
>> +				return -EAGAIN;
>> +			}
>> +			rss[MM_ANONPAGES] += nr;
>> +			VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
>> +		} else {
>> +			folio_ref_add(folio, nr);
> 
> Perhaps hoist this out to immediately after folio_pte_batch() since you're
> calling it on both branches?

Makes sense, thanks.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
  2024-01-22 19:41 ` [PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch() David Hildenbrand
@ 2024-01-23 12:25   ` Ryan Roberts
  2024-01-23 13:06     ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 12:25 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 22/01/2024 19:41, David Hildenbrand wrote:
> Let's ignore these bits: they are irrelevant for fork, and will likely
> be irrelevant for upcoming users such as page unmapping.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 10 ++++++++--
>  1 file changed, 8 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index f563aec85b2a8..341b2be845b6e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -953,24 +953,30 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
>  	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>  }
>  
> +static inline pte_t __pte_batch_clear_ignored(pte_t pte)
> +{
> +	return pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte)));
> +}
> +
>  /*
>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>   * pages of the same folio.
>   *
>   * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.

nit: last char should be a comma (,) not a full stop (.)

> + * the accessed bit, dirty bit and soft-dirty bit.
>   */
>  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>  		pte_t *start_ptep, pte_t pte, int max_nr)
>  {
>  	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>  	const pte_t *end_ptep = start_ptep + max_nr;
> -	pte_t expected_pte = pte_next_pfn(pte);
> +	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
>  	pte_t *ptep = start_ptep + 1;
>  
>  	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>  
>  	while (ptep != end_ptep) {
> -		pte = ptep_get(ptep);
> +		pte = __pte_batch_clear_ignored(ptep_get(ptep));
>  
>  		if (!pte_same(pte, expected_pte))
>  			break;

I think you'll lose dirty information in the child for private mappings? If the
first pte in a batch is clean, but a subsequent page is dirty, you will end up
setting all the pages in the batch as clean in the child. Previous behavior
would preserve dirty bit for private mappings.

In my version (v3) that did arbitrary batching, I had some fun and games
tracking dirty, write and uffd_wp:
https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-2-ryan.roberts@arm.com/

Also, I think you will currently either set soft dirty on all or none of the
pages in the batch, depending on the value of the first. I previously convinced
myself that the state was unimportant so always cleared it in the child to
provide consistency.




^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP
  2024-01-23 12:19     ` David Hildenbrand
@ 2024-01-23 12:28       ` Ryan Roberts
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 12:28 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23/01/2024 12:19, David Hildenbrand wrote:
> [...]
> 
>>
>> I wrote some documentation for this (based on Matthew's docs for set_ptes() in
>> my version. Perhaps it makes sense to add it here, given this is overridable by
>> the arch.
>>
>> /**
>>   * wrprotect_ptes - Write protect a consecutive set of pages.
>>   * @mm: Address space that the pages are mapped into.
>>   * @addr: Address of first page to write protect.
>>   * @ptep: Page table pointer for the first entry.
>>   * @nr: Number of pages to write protect.
>>   *
>>   * May be overridden by the architecture, else implemented as a loop over
>>   * ptep_set_wrprotect().
>>   *
>>   * Context: The caller holds the page table lock. The PTEs are all in the same
>>   * PMD.
>>   */
>>
> 
> I could have sworn I had a documentation at some point. Let me add some, thanks.
> 
> [...]
> 
>>> +
>>> +    /*
>>> +     * If we likely have to copy, just don't bother with batching. Make
>>> +     * sure that the common "small folio" case stays as fast as possible
>>> +     * by keeping the batching logic separate.
>>> +     */
>>> +    if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
>>> +        nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
>>> +        if (folio_test_anon(folio)) {
>>> +            folio_ref_add(folio, nr);
>>> +            if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
>>> +                                  nr, src_vma))) {
>>
>> What happens if its not the first page of the batch that fails here? Aren't you
>> signalling that you need a prealloc'ed page for the wrong pte? Shouldn't you
>> still batch copy all the way up to the failing page first? Perhaps it all comes
>> out in the wash and these events are so infrequent that we don't care about the
>> lost batching opportunity?
> 
> I assume you mean the weird corner case that some folio pages in the range have
> PAE set, others don't -- and the folio maybe pinned.
> 
> In that case, we fallback to individual pages, and might have preallocated a
> page although we wouldn't have to preallocate one for processing the next page
> (that doesn't have PAE set).
> 
> It should all work, although not optimized to the extreme, and as it's a corner
> case, we don't particularly care. Hopefully, in the future we'll only have a
> single PAE flag per folio.
> 
> Or am I missing something?

No, your explanation makes sense. Just wanted to check this all definitely
worked, because the flow is slightly different to my previous version that was
doing try_dup_rmap page-by-page.

> 
>>
>>> +                folio_ref_sub(folio, nr);
>>> +                return -EAGAIN;
>>> +            }
>>> +            rss[MM_ANONPAGES] += nr;
>>> +            VM_WARN_ON_FOLIO(PageAnonExclusive(page), folio);
>>> +        } else {
>>> +            folio_ref_add(folio, nr);
>>
>> Perhaps hoist this out to immediately after folio_pte_batch() since you're
>> calling it on both branches?
> 
> Makes sense, thanks.
> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 11/11] mm/memory: ignore writable bit in folio_pte_batch()
  2024-01-22 19:42 ` [PATCH v1 11/11] mm/memory: ignore writable bit " David Hildenbrand
@ 2024-01-23 12:35   ` Ryan Roberts
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 12:35 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 22/01/2024 19:42, David Hildenbrand wrote:
> ... and conditionally return to the caller if any pte except the first one
> is writable. fork() has to make sure to properly write-protect in case any
> PTE is writable. Other users (e.g., page unmaping) won't care.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>

Reviewed-by: Ryan Roberts <ryan.roberts@arm.com>

> ---
>  mm/memory.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 341b2be845b6e..a26fd0669016b 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -955,7 +955,7 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
>  
>  static inline pte_t __pte_batch_clear_ignored(pte_t pte)
>  {
> -	return pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte)));
> +	return pte_wrprotect(pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte))));
>  }
>  
>  /*
> @@ -963,20 +963,29 @@ static inline pte_t __pte_batch_clear_ignored(pte_t pte)
>   * pages of the same folio.
>   *
>   * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
> - * the accessed bit, dirty bit and soft-dirty bit.
> + * the accessed bit, dirty bit, soft-dirty bit and writable bit.
> + . If "any_writable" is set, it will indicate if any other PTE besides the
> + * first (given) PTE is writable.
>   */
>  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> -		pte_t *start_ptep, pte_t pte, int max_nr)
> +		pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
>  {
>  	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>  	const pte_t *end_ptep = start_ptep + max_nr;
>  	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
>  	pte_t *ptep = start_ptep + 1;
> +	bool writable;
> +
> +	if (any_writable)
> +		*any_writable = false;
>  
>  	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>  
>  	while (ptep != end_ptep) {
> -		pte = __pte_batch_clear_ignored(ptep_get(ptep));
> +		pte = ptep_get(ptep);
> +		if (any_writable)
> +			writable = !!pte_write(pte);
> +		pte = __pte_batch_clear_ignored(pte);
>  
>  		if (!pte_same(pte, expected_pte))
>  			break;
> @@ -989,6 +998,9 @@ static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>  		if (pte_pfn(pte) == folio_end_pfn)
>  			break;
>  
> +		if (any_writable)
> +			*any_writable |= writable;
> +
>  		expected_pte = pte_next_pfn(expected_pte);
>  		ptep++;
>  	}
> @@ -1010,6 +1022,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  {
>  	struct page *page;
>  	struct folio *folio;
> +	bool any_writable;
>  	int err, nr;
>  
>  	page = vm_normal_page(src_vma, addr, pte);
> @@ -1024,7 +1037,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  	 * by keeping the batching logic separate.
>  	 */
>  	if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
> -		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
> +		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr,
> +				     &any_writable);
>  		if (folio_test_anon(folio)) {
>  			folio_ref_add(folio, nr);
>  			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
> @@ -1039,6 +1053,8 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
>  			folio_dup_file_rmap_ptes(folio, page, nr);
>  			rss[mm_counter_file(page)] += nr;
>  		}
> +		if (any_writable)
> +			pte = pte_mkwrite(pte, src_vma);
>  		__copy_present_ptes(dst_vma, src_vma, dst_pte, src_pte, pte,
>  				    addr, nr);
>  		return nr;


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
  2024-01-23 12:25   ` Ryan Roberts
@ 2024-01-23 13:06     ` David Hildenbrand
  2024-01-23 13:42       ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 13:06 UTC (permalink / raw
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23.01.24 13:25, Ryan Roberts wrote:
> On 22/01/2024 19:41, David Hildenbrand wrote:
>> Let's ignore these bits: they are irrelevant for fork, and will likely
>> be irrelevant for upcoming users such as page unmapping.
>>
>> Signed-off-by: David Hildenbrand <david@redhat.com>
>> ---
>>   mm/memory.c | 10 ++++++++--
>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/memory.c b/mm/memory.c
>> index f563aec85b2a8..341b2be845b6e 100644
>> --- a/mm/memory.c
>> +++ b/mm/memory.c
>> @@ -953,24 +953,30 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
>>   	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>>   }
>>   
>> +static inline pte_t __pte_batch_clear_ignored(pte_t pte)
>> +{
>> +	return pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte)));
>> +}
>> +
>>   /*
>>    * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>>    * pages of the same folio.
>>    *
>>    * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
> 
> nit: last char should be a comma (,) not a full stop (.)
> 
>> + * the accessed bit, dirty bit and soft-dirty bit.
>>    */
>>   static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>   		pte_t *start_ptep, pte_t pte, int max_nr)
>>   {
>>   	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>   	const pte_t *end_ptep = start_ptep + max_nr;
>> -	pte_t expected_pte = pte_next_pfn(pte);
>> +	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
>>   	pte_t *ptep = start_ptep + 1;
>>   
>>   	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>   
>>   	while (ptep != end_ptep) {
>> -		pte = ptep_get(ptep);
>> +		pte = __pte_batch_clear_ignored(ptep_get(ptep));
>>   
>>   		if (!pte_same(pte, expected_pte))
>>   			break;
> 
> I think you'll lose dirty information in the child for private mappings? If the
> first pte in a batch is clean, but a subsequent page is dirty, you will end up
> setting all the pages in the batch as clean in the child. Previous behavior
> would preserve dirty bit for private mappings.
> 
> In my version (v3) that did arbitrary batching, I had some fun and games
> tracking dirty, write and uffd_wp:
> https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-2-ryan.roberts@arm.com/
> 
> Also, I think you will currently either set soft dirty on all or none of the
> pages in the batch, depending on the value of the first. I previously convinced
> myself that the state was unimportant so always cleared it in the child to
> provide consistency.

Good points regarding dirty and soft-dirty. I wanted to avoid passing 
flags to folio_pte_batch(), but maybe that's just what we need to not 
change behavior.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
  2024-01-23 13:06     ` David Hildenbrand
@ 2024-01-23 13:42       ` Ryan Roberts
  2024-01-23 13:55         ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 13:42 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23/01/2024 13:06, David Hildenbrand wrote:
> On 23.01.24 13:25, Ryan Roberts wrote:
>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>> Let's ignore these bits: they are irrelevant for fork, and will likely
>>> be irrelevant for upcoming users such as page unmapping.
>>>
>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>> ---
>>>   mm/memory.c | 10 ++++++++--
>>>   1 file changed, 8 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/memory.c b/mm/memory.c
>>> index f563aec85b2a8..341b2be845b6e 100644
>>> --- a/mm/memory.c
>>> +++ b/mm/memory.c
>>> @@ -953,24 +953,30 @@ static __always_inline void __copy_present_ptes(struct
>>> vm_area_struct *dst_vma,
>>>       set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>>>   }
>>>   +static inline pte_t __pte_batch_clear_ignored(pte_t pte)
>>> +{
>>> +    return pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte)));
>>> +}
>>> +
>>>   /*
>>>    * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>>>    * pages of the same folio.
>>>    *
>>>    * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
>>
>> nit: last char should be a comma (,) not a full stop (.)
>>
>>> + * the accessed bit, dirty bit and soft-dirty bit.
>>>    */
>>>   static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>           pte_t *start_ptep, pte_t pte, int max_nr)
>>>   {
>>>       unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>>       const pte_t *end_ptep = start_ptep + max_nr;
>>> -    pte_t expected_pte = pte_next_pfn(pte);
>>> +    pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
>>>       pte_t *ptep = start_ptep + 1;
>>>         VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>>         while (ptep != end_ptep) {
>>> -        pte = ptep_get(ptep);
>>> +        pte = __pte_batch_clear_ignored(ptep_get(ptep));
>>>             if (!pte_same(pte, expected_pte))
>>>               break;
>>
>> I think you'll lose dirty information in the child for private mappings? If the
>> first pte in a batch is clean, but a subsequent page is dirty, you will end up
>> setting all the pages in the batch as clean in the child. Previous behavior
>> would preserve dirty bit for private mappings.
>>
>> In my version (v3) that did arbitrary batching, I had some fun and games
>> tracking dirty, write and uffd_wp:
>> https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-2-ryan.roberts@arm.com/
>>
>> Also, I think you will currently either set soft dirty on all or none of the
>> pages in the batch, depending on the value of the first. I previously convinced
>> myself that the state was unimportant so always cleared it in the child to
>> provide consistency.
> 
> Good points regarding dirty and soft-dirty. I wanted to avoid passing flags to
> folio_pte_batch(), but maybe that's just what we need to not change behavior.

I think you could not bother with the enforce_uffd_wp - just always enforce
uffd-wp. So that's one simplification vs mine. Then you just need an any_dirty
flag following the same pattern as your any_writable. Then just set dirty on the
whole batch in the child if any were dirty in the parent.

Although now I'm wondering if there is a race here... What happens if a page in
the parent becomes dirty after you have checked it but before you write protect
it? Isn't that already a problem with the current non-batched version? Why do we
even to preserve dirty in the child for private mappings?


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
  2024-01-23 13:42       ` Ryan Roberts
@ 2024-01-23 13:55         ` David Hildenbrand
  2024-01-23 14:13           ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 13:55 UTC (permalink / raw
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23.01.24 14:42, Ryan Roberts wrote:
> On 23/01/2024 13:06, David Hildenbrand wrote:
>> On 23.01.24 13:25, Ryan Roberts wrote:
>>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>>> Let's ignore these bits: they are irrelevant for fork, and will likely
>>>> be irrelevant for upcoming users such as page unmapping.
>>>>
>>>> Signed-off-by: David Hildenbrand <david@redhat.com>
>>>> ---
>>>>    mm/memory.c | 10 ++++++++--
>>>>    1 file changed, 8 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/memory.c b/mm/memory.c
>>>> index f563aec85b2a8..341b2be845b6e 100644
>>>> --- a/mm/memory.c
>>>> +++ b/mm/memory.c
>>>> @@ -953,24 +953,30 @@ static __always_inline void __copy_present_ptes(struct
>>>> vm_area_struct *dst_vma,
>>>>        set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>>>>    }
>>>>    +static inline pte_t __pte_batch_clear_ignored(pte_t pte)
>>>> +{
>>>> +    return pte_clear_soft_dirty(pte_mkclean(pte_mkold(pte)));
>>>> +}
>>>> +
>>>>    /*
>>>>     * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>>>>     * pages of the same folio.
>>>>     *
>>>>     * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
>>>
>>> nit: last char should be a comma (,) not a full stop (.)
>>>
>>>> + * the accessed bit, dirty bit and soft-dirty bit.
>>>>     */
>>>>    static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>>            pte_t *start_ptep, pte_t pte, int max_nr)
>>>>    {
>>>>        unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>>>        const pte_t *end_ptep = start_ptep + max_nr;
>>>> -    pte_t expected_pte = pte_next_pfn(pte);
>>>> +    pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
>>>>        pte_t *ptep = start_ptep + 1;
>>>>          VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>>>          while (ptep != end_ptep) {
>>>> -        pte = ptep_get(ptep);
>>>> +        pte = __pte_batch_clear_ignored(ptep_get(ptep));
>>>>              if (!pte_same(pte, expected_pte))
>>>>                break;
>>>
>>> I think you'll lose dirty information in the child for private mappings? If the
>>> first pte in a batch is clean, but a subsequent page is dirty, you will end up
>>> setting all the pages in the batch as clean in the child. Previous behavior
>>> would preserve dirty bit for private mappings.
>>>
>>> In my version (v3) that did arbitrary batching, I had some fun and games
>>> tracking dirty, write and uffd_wp:
>>> https://lore.kernel.org/linux-arm-kernel/20231204105440.61448-2-ryan.roberts@arm.com/
>>>
>>> Also, I think you will currently either set soft dirty on all or none of the
>>> pages in the batch, depending on the value of the first. I previously convinced
>>> myself that the state was unimportant so always cleared it in the child to
>>> provide consistency.
>>
>> Good points regarding dirty and soft-dirty. I wanted to avoid passing flags to
>> folio_pte_batch(), but maybe that's just what we need to not change behavior.
> 
> I think you could not bother with the enforce_uffd_wp - just always enforce
> uffd-wp. So that's one simplification vs mine. Then you just need an any_dirty

I think I'll just leave uffd-wp alone for now, corner case with 
fork/munmap that can be optimized later on top if really needed.

Regarding soft-dirty (which is set automatically much more often), I can 
certainly ignore the bit if !vma_soft_dirty_enabled(vma) [which is true 
in most of the cases]. So that's easy to handle. But likely, soft-dirty 
for the child is completely unexpressive and should always be cleared. 
Have to double check what the vmflag will be for the child process.

> flag following the same pattern as your any_writable. Then just set dirty on the
> whole batch in the child if any were dirty in the parent.

Regarding dirtying, I'm not 100% sure yet if we should just always dirty 
all ptes if any is dirty, or if we should preserve the state for private 
VMAs for now.

> 
> Although now I'm wondering if there is a race here... What happens if a page in
> the parent becomes dirty after you have checked it but before you write protect
> it? Isn't that already a problem with the current non-batched version? Why do we
> even to preserve dirty in the child for private mappings?

I suspect, because the parent could zap the anon folio. If the folio is 
clean, but the PTE dirty, I suspect that we could lose data of the child 
if we were to evict that clean folio (swapout).

So I assume we simply copy the dirty PTE bit, so the system knows that 
that folio is actually dirty, because one PTE is dirty.

Touching only PTEs avoids having to mess with folio flags.

But that's just pure speculation. E.g., fs/proc/task_mmu.c does some 
slightly different accounting if a PTE is dirty. But usually, it checks 
if either the PTE or the folios is dirty.

I'll have to do some more digging.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
  2024-01-23 13:55         ` David Hildenbrand
@ 2024-01-23 14:13           ` David Hildenbrand
  2024-01-23 14:27             ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 14:13 UTC (permalink / raw
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

>> Although now I'm wondering if there is a race here... What happens if a page in
>> the parent becomes dirty after you have checked it but before you write protect
>> it? Isn't that already a problem with the current non-batched version? Why do we
>> even to preserve dirty in the child for private mappings?
> 
> I suspect, because the parent could zap the anon folio. If the folio is
> clean, but the PTE dirty, I suspect that we could lose data of the child
> if we were to evict that clean folio (swapout).
> 
> So I assume we simply copy the dirty PTE bit, so the system knows that
> that folio is actually dirty, because one PTE is dirty.

Oh, and regarding your race concern: it's undefined which page state
would see if some write is racing with fork, so it also doesn't matter
if we would copy the PTE dirty bit or not, if it gets set in a racy fashion.

I'll not experiment with:

 From 14e83ff2a422a96ce5701f9c8454a49f9ed947e3 Mon Sep 17 00:00:00 2001
From: David Hildenbrand <david@redhat.com>
Date: Sat, 30 Dec 2023 12:54:35 +0100
Subject: [PATCH] mm/memory: ignore dirty/accessed/soft-dirty bits in
  folio_pte_batch()

Let's always ignore the accessed/young bit: we'll always mark the PTE
as old in our child process during fork, and upcoming users will
similarly not care.

Ignore the dirty bit only if we don't want to duplicate the dirty bit
into the child process during fork. Maybe, we could just set all PTEs
in the child dirty if any PTE is dirty. For now, let's keep the behavior
unchanged.

Ignore the soft-dirty bit only if the bit doesn't have any meaning in
the src vma.

Signed-off-by: David Hildenbrand <david@redhat.com>
---
  mm/memory.c | 34 ++++++++++++++++++++++++++++++----
  1 file changed, 30 insertions(+), 4 deletions(-)

diff --git a/mm/memory.c b/mm/memory.c
index 7690994929d26..9aba1b0e871ca 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -953,24 +953,44 @@ static __always_inline void __copy_present_ptes(struct vm_area_struct *dst_vma,
  	set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
  }
  
+/* Flags for folio_pte_batch(). */
+typedef int __bitwise fpb_t;
+
+/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
+#define FPB_IGNORE_DIRTY		((__force fpb_t)BIT(0))
+
+/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
+#define FPB_IGNORE_SOFT_DIRTY		((__force fpb_t)BIT(1))
+
+static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
+{
+	if (flags & FPB_IGNORE_DIRTY)
+		pte = pte_mkclean(pte);
+	if (likely(flags & FPB_IGNORE_SOFT_DIRTY))
+		pte = pte_clear_soft_dirty(pte);
+	return pte_mkold(pte);
+}
+
  /*
   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
   * pages of the same folio.
   *
   * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
+ * the accessed bit, dirty bit (with FPB_IGNORE_DIRTY) and soft-dirty bit
+ * (with FPB_IGNORE_SOFT_DIRTY).
   */
  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
-		pte_t *start_ptep, pte_t pte, int max_nr)
+		pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags)
  {
  	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
  	const pte_t *end_ptep = start_ptep + max_nr;
-	pte_t expected_pte = pte_next_pfn(pte);
+	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), flags);
  	pte_t *ptep = start_ptep + 1;
  
  	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
  
  	while (ptep != end_ptep) {
-		pte = ptep_get(ptep);
+		pte = __pte_batch_clear_ignored(ptep_get(ptep), flags);
  
  		if (!pte_same(pte, expected_pte))
  			break;
@@ -1004,6 +1024,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
  {
  	struct page *page;
  	struct folio *folio;
+	fpb_t flags = 0;
  	int err, nr;
  
  	page = vm_normal_page(src_vma, addr, pte);
@@ -1018,7 +1039,12 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma
  	 * by keeping the batching logic separate.
  	 */
  	if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
-		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
+		if (src_vma->vm_flags & VM_SHARED)
+			flags |= FPB_IGNORE_DIRTY;
+		if (!vma_soft_dirty_enabled(src_vma))
+			flags |= FPB_IGNORE_SOFT_DIRTY;
+
+		nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags);
  		folio_ref_add(folio, nr);
  		if (folio_test_anon(folio)) {
  			if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,
-- 
2.43.0


-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch()
  2024-01-23 14:13           ` David Hildenbrand
@ 2024-01-23 14:27             ` Ryan Roberts
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 14:27 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23/01/2024 14:13, David Hildenbrand wrote:
>>> Although now I'm wondering if there is a race here... What happens if a page in
>>> the parent becomes dirty after you have checked it but before you write protect
>>> it? Isn't that already a problem with the current non-batched version? Why do we
>>> even to preserve dirty in the child for private mappings?
>>
>> I suspect, because the parent could zap the anon folio. If the folio is
>> clean, but the PTE dirty, I suspect that we could lose data of the child
>> if we were to evict that clean folio (swapout).
>>
>> So I assume we simply copy the dirty PTE bit, so the system knows that
>> that folio is actually dirty, because one PTE is dirty.
> 
> Oh, and regarding your race concern: it's undefined which page state
> would see if some write is racing with fork, so it also doesn't matter
> if we would copy the PTE dirty bit or not, if it gets set in a racy fashion.

Ahh that makes sense. Thanks.

> 
> I'll not experiment with:

Looks good as long as its still performant.

> 
> From 14e83ff2a422a96ce5701f9c8454a49f9ed947e3 Mon Sep 17 00:00:00 2001
> From: David Hildenbrand <david@redhat.com>
> Date: Sat, 30 Dec 2023 12:54:35 +0100
> Subject: [PATCH] mm/memory: ignore dirty/accessed/soft-dirty bits in
>  folio_pte_batch()
> 
> Let's always ignore the accessed/young bit: we'll always mark the PTE
> as old in our child process during fork, and upcoming users will
> similarly not care.
> 
> Ignore the dirty bit only if we don't want to duplicate the dirty bit
> into the child process during fork. Maybe, we could just set all PTEs
> in the child dirty if any PTE is dirty. For now, let's keep the behavior
> unchanged.
> 
> Ignore the soft-dirty bit only if the bit doesn't have any meaning in
> the src vma.
> 
> Signed-off-by: David Hildenbrand <david@redhat.com>
> ---
>  mm/memory.c | 34 ++++++++++++++++++++++++++++++----
>  1 file changed, 30 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory.c b/mm/memory.c
> index 7690994929d26..9aba1b0e871ca 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -953,24 +953,44 @@ static __always_inline void __copy_present_ptes(struct
> vm_area_struct *dst_vma,
>      set_ptes(dst_vma->vm_mm, addr, dst_pte, pte, nr);
>  }
>  
> +/* Flags for folio_pte_batch(). */
> +typedef int __bitwise fpb_t;
> +
> +/* Compare PTEs after pte_mkclean(), ignoring the dirty bit. */
> +#define FPB_IGNORE_DIRTY        ((__force fpb_t)BIT(0))
> +
> +/* Compare PTEs after pte_clear_soft_dirty(), ignoring the soft-dirty bit. */
> +#define FPB_IGNORE_SOFT_DIRTY        ((__force fpb_t)BIT(1))
> +
> +static inline pte_t __pte_batch_clear_ignored(pte_t pte, fpb_t flags)
> +{
> +    if (flags & FPB_IGNORE_DIRTY)
> +        pte = pte_mkclean(pte);
> +    if (likely(flags & FPB_IGNORE_SOFT_DIRTY))
> +        pte = pte_clear_soft_dirty(pte);
> +    return pte_mkold(pte);
> +}
> +
>  /*
>   * Detect a PTE batch: consecutive (present) PTEs that map consecutive
>   * pages of the same folio.
>   *
>   * All PTEs inside a PTE batch have the same PTE bits set, excluding the PFN.
> + * the accessed bit, dirty bit (with FPB_IGNORE_DIRTY) and soft-dirty bit
> + * (with FPB_IGNORE_SOFT_DIRTY).
>   */
>  static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> -        pte_t *start_ptep, pte_t pte, int max_nr)
> +        pte_t *start_ptep, pte_t pte, int max_nr, fpb_t flags)
>  {
>      unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>      const pte_t *end_ptep = start_ptep + max_nr;
> -    pte_t expected_pte = pte_next_pfn(pte);
> +    pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte), flags);
>      pte_t *ptep = start_ptep + 1;
>  
>      VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>  
>      while (ptep != end_ptep) {
> -        pte = ptep_get(ptep);
> +        pte = __pte_batch_clear_ignored(ptep_get(ptep), flags);
>  
>          if (!pte_same(pte, expected_pte))
>              break;
> @@ -1004,6 +1024,7 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct
> vm_area_struct *src_vma
>  {
>      struct page *page;
>      struct folio *folio;
> +    fpb_t flags = 0;
>      int err, nr;
>  
>      page = vm_normal_page(src_vma, addr, pte);
> @@ -1018,7 +1039,12 @@ copy_present_ptes(struct vm_area_struct *dst_vma, struct
> vm_area_struct *src_vma
>       * by keeping the batching logic separate.
>       */
>      if (unlikely(!*prealloc && folio_test_large(folio) && max_nr != 1)) {
> -        nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr);
> +        if (src_vma->vm_flags & VM_SHARED)
> +            flags |= FPB_IGNORE_DIRTY;
> +        if (!vma_soft_dirty_enabled(src_vma))
> +            flags |= FPB_IGNORE_SOFT_DIRTY;
> +
> +        nr = folio_pte_batch(folio, addr, src_pte, pte, max_nr, flags);
>          folio_ref_add(folio, nr);
>          if (folio_test_anon(folio)) {
>              if (unlikely(folio_try_dup_anon_rmap_ptes(folio, page,


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 10:34   ` Ryan Roberts
  2024-01-23 10:48     ` David Hildenbrand
@ 2024-01-23 15:01     ` Matthew Wilcox
  2024-01-23 15:22       ` Ryan Roberts
  1 sibling, 1 reply; 49+ messages in thread
From: Matthew Wilcox @ 2024-01-23 15:01 UTC (permalink / raw
  To: Ryan Roberts
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Dinh Nguyen,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

On Tue, Jan 23, 2024 at 10:34:21AM +0000, Ryan Roberts wrote:
> > +#define PFN_PTE_SHIFT		PAGE_SHIFT
> 
> I think this is buggy. And so is the arm64 implementation of set_ptes(). It
> works fine for 48-bit output address, but for 52-bit OAs, the high bits are not
> kept contigously, so if you happen to be setting a mapping for which the
> physical memory block straddles bit 48, this won't work.

I'd like to see the folio allocation that can straddle bit 48 ...

agreed, it's not workable _in general_, but specifically for a memory
allocation from a power-of-two allocator, you'd have to do a 49-bit
allocation (half a petabyte) to care.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 15:01     ` Matthew Wilcox
@ 2024-01-23 15:22       ` Ryan Roberts
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 15:22 UTC (permalink / raw
  To: Matthew Wilcox
  Cc: David Hildenbrand, linux-kernel, linux-mm, Andrew Morton,
	Russell King, Catalin Marinas, Will Deacon, Dinh Nguyen,
	Michael Ellerman, Nicholas Piggin, Christophe Leroy,
	Aneesh Kumar K.V, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel, linuxppc-dev, linux-riscv,
	linux-s390, sparclinux

On 23/01/2024 15:01, Matthew Wilcox wrote:
> On Tue, Jan 23, 2024 at 10:34:21AM +0000, Ryan Roberts wrote:
>>> +#define PFN_PTE_SHIFT		PAGE_SHIFT
>>
>> I think this is buggy. And so is the arm64 implementation of set_ptes(). It
>> works fine for 48-bit output address, but for 52-bit OAs, the high bits are not
>> kept contigously, so if you happen to be setting a mapping for which the
>> physical memory block straddles bit 48, this won't work.
> 
> I'd like to see the folio allocation that can straddle bit 48 ...
> 
> agreed, it's not workable _in general_, but specifically for a memory
> allocation from a power-of-two allocator, you'd have to do a 49-bit
> allocation (half a petabyte) to care.

Hmm good point. So its a hypothetical bug, not an actual bug. Personally I'm
still inclined to "fix" it. Although its going to cost a few more instructions.
Shout if you disagree.

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP
  2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
                   ` (10 preceding siblings ...)
  2024-01-22 19:42 ` [PATCH v1 11/11] mm/memory: ignore writable bit " David Hildenbrand
@ 2024-01-23 19:15 ` Ryan Roberts
  2024-01-23 19:33   ` David Hildenbrand
  11 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 19:15 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 22/01/2024 19:41, David Hildenbrand wrote:
> Now that the rmap overhaul[1] is upstream that provides a clean interface
> for rmap batching, let's implement PTE batching during fork when processing
> PTE-mapped THPs.
> 
> This series is partially based on Ryan's previous work[2] to implement
> cont-pte support on arm64, but its a complete rewrite based on [1] to
> optimize all architectures independent of any such PTE bits, and to
> use the new rmap batching functions that simplify the code and prepare
> for further rmap accounting changes.
> 
> We collect consecutive PTEs that map consecutive pages of the same large
> folio, making sure that the other PTE bits are compatible, and (a) adjust
> the refcount only once per batch, (b) call rmap handling functions only
> once per batch and (c) perform batch PTE setting/updates.
> 
> While this series should be beneficial for adding cont-pte support on
> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
> for large folios with minimal added overhead and further changes[4] that
> build up on top of the total mapcount.

I'm currently rebasing my contpte work onto this series, and have hit a problem.
I need to expose the "size" of a pte (pte_size()) and skip forward to the start
of the next (cont)pte every time through the folio_pte_batch() loop. But
pte_next_pfn() only allows advancing by 1 pfn; I need to advance by nr pfns:


static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
		pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
{
	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
	const pte_t *end_ptep = start_ptep + max_nr;
	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
-	pte_t *ptep = start_ptep + 1;
+	pte_t *ptep = start_ptep;
+	int vfn, nr, i;
	bool writable;

	if (any_writable)
		*any_writable = false;

	VM_WARN_ON_FOLIO(!pte_present(pte), folio);

+	vfn = addr >> PAGE_SIZE;
+	nr = pte_size(pte);
+	nr = ALIGN_DOWN(vfn + nr, nr) - vfn;
+	ptep += nr;
+
	while (ptep != end_ptep) {
+		pte = ptep_get(ptep);
		nr = pte_size(pte);
		if (any_writable)
			writable = !!pte_write(pte);
		pte = __pte_batch_clear_ignored(pte);

		if (!pte_same(pte, expected_pte))
			break;

		/*
		 * Stop immediately once we reached the end of the folio. In
		 * corner cases the next PFN might fall into a different
		 * folio.
		 */
-		if (pte_pfn(pte) == folio_end_pfn)
+		if (pte_pfn(pte) >= folio_end_pfn)
			break;

		if (any_writable)
			*any_writable |= writable;

-		expected_pte = pte_next_pfn(expected_pte);
-		ptep++;
+		for (i = 0; i < nr; i++)
+			expected_pte = pte_next_pfn(expected_pte);
+		ptep += nr;
	}

	return ptep - start_ptep;
}


So I'm wondering if instead of enabling pte_next_pfn() for all the arches,
perhaps its actually better to expose pte_pgprot() for all the arches. Then we
can be much more flexible about generating ptes with pfn_pte(pfn, pgprot).

What do you think?



^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP
  2024-01-23 19:15 ` [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP Ryan Roberts
@ 2024-01-23 19:33   ` David Hildenbrand
  2024-01-23 19:43     ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 19:33 UTC (permalink / raw
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23.01.24 20:15, Ryan Roberts wrote:
> On 22/01/2024 19:41, David Hildenbrand wrote:
>> Now that the rmap overhaul[1] is upstream that provides a clean interface
>> for rmap batching, let's implement PTE batching during fork when processing
>> PTE-mapped THPs.
>>
>> This series is partially based on Ryan's previous work[2] to implement
>> cont-pte support on arm64, but its a complete rewrite based on [1] to
>> optimize all architectures independent of any such PTE bits, and to
>> use the new rmap batching functions that simplify the code and prepare
>> for further rmap accounting changes.
>>
>> We collect consecutive PTEs that map consecutive pages of the same large
>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>> the refcount only once per batch, (b) call rmap handling functions only
>> once per batch and (c) perform batch PTE setting/updates.
>>
>> While this series should be beneficial for adding cont-pte support on
>> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
>> for large folios with minimal added overhead and further changes[4] that
>> build up on top of the total mapcount.
> 
> I'm currently rebasing my contpte work onto this series, and have hit a problem.
> I need to expose the "size" of a pte (pte_size()) and skip forward to the start
> of the next (cont)pte every time through the folio_pte_batch() loop. But
> pte_next_pfn() only allows advancing by 1 pfn; I need to advance by nr pfns:
> 
> 
> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
> 		pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
> {
> 	unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
> 	const pte_t *end_ptep = start_ptep + max_nr;
> 	pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
> -	pte_t *ptep = start_ptep + 1;
> +	pte_t *ptep = start_ptep;
> +	int vfn, nr, i;
> 	bool writable;
> 
> 	if (any_writable)
> 		*any_writable = false;
> 
> 	VM_WARN_ON_FOLIO(!pte_present(pte), folio);
> 
> +	vfn = addr >> PAGE_SIZE;
> +	nr = pte_size(pte);
> +	nr = ALIGN_DOWN(vfn + nr, nr) - vfn;
> +	ptep += nr;
> +
> 	while (ptep != end_ptep) {
> +		pte = ptep_get(ptep);
> 		nr = pte_size(pte);
> 		if (any_writable)
> 			writable = !!pte_write(pte);
> 		pte = __pte_batch_clear_ignored(pte);
> 
> 		if (!pte_same(pte, expected_pte))
> 			break;
> 
> 		/*
> 		 * Stop immediately once we reached the end of the folio. In
> 		 * corner cases the next PFN might fall into a different
> 		 * folio.
> 		 */
> -		if (pte_pfn(pte) == folio_end_pfn)
> +		if (pte_pfn(pte) >= folio_end_pfn)
> 			break;
> 
> 		if (any_writable)
> 			*any_writable |= writable;
> 
> -		expected_pte = pte_next_pfn(expected_pte);
> -		ptep++;
> +		for (i = 0; i < nr; i++)
> +			expected_pte = pte_next_pfn(expected_pte);
> +		ptep += nr;
> 	}
> 
> 	return ptep - start_ptep;
> }
> 
> 
> So I'm wondering if instead of enabling pte_next_pfn() for all the arches,
> perhaps its actually better to expose pte_pgprot() for all the arches. Then we
> can be much more flexible about generating ptes with pfn_pte(pfn, pgprot).
> 
> What do you think?

The pte_pgprot() stuff is just nasty IMHO.

Likely it's best to simply convert pte_next_pfn() to something like 
pte_advance_pfns(). The we could just have

#define pte_next_pfn(pte) pte_advance_pfns(pte, 1)

That should be fairly easy to do on top (based on PFN_PTE_SHIFT). And 
only 3 archs (x86-64, arm64, and powerpc) need slight care to replace a 
hardcoded "1" by an integer we pass in.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP
  2024-01-23 19:33   ` David Hildenbrand
@ 2024-01-23 19:43     ` Ryan Roberts
  2024-01-23 20:14       ` David Hildenbrand
  0 siblings, 1 reply; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 19:43 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23/01/2024 19:33, David Hildenbrand wrote:
> On 23.01.24 20:15, Ryan Roberts wrote:
>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>> Now that the rmap overhaul[1] is upstream that provides a clean interface
>>> for rmap batching, let's implement PTE batching during fork when processing
>>> PTE-mapped THPs.
>>>
>>> This series is partially based on Ryan's previous work[2] to implement
>>> cont-pte support on arm64, but its a complete rewrite based on [1] to
>>> optimize all architectures independent of any such PTE bits, and to
>>> use the new rmap batching functions that simplify the code and prepare
>>> for further rmap accounting changes.
>>>
>>> We collect consecutive PTEs that map consecutive pages of the same large
>>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>>> the refcount only once per batch, (b) call rmap handling functions only
>>> once per batch and (c) perform batch PTE setting/updates.
>>>
>>> While this series should be beneficial for adding cont-pte support on
>>> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
>>> for large folios with minimal added overhead and further changes[4] that
>>> build up on top of the total mapcount.
>>
>> I'm currently rebasing my contpte work onto this series, and have hit a problem.
>> I need to expose the "size" of a pte (pte_size()) and skip forward to the start
>> of the next (cont)pte every time through the folio_pte_batch() loop. But
>> pte_next_pfn() only allows advancing by 1 pfn; I need to advance by nr pfns:
>>
>>
>> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>         pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
>> {
>>     unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>     const pte_t *end_ptep = start_ptep + max_nr;
>>     pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
>> -    pte_t *ptep = start_ptep + 1;
>> +    pte_t *ptep = start_ptep;
>> +    int vfn, nr, i;
>>     bool writable;
>>
>>     if (any_writable)
>>         *any_writable = false;
>>
>>     VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>
>> +    vfn = addr >> PAGE_SIZE;
>> +    nr = pte_size(pte);
>> +    nr = ALIGN_DOWN(vfn + nr, nr) - vfn;
>> +    ptep += nr;
>> +
>>     while (ptep != end_ptep) {
>> +        pte = ptep_get(ptep);
>>         nr = pte_size(pte);
>>         if (any_writable)
>>             writable = !!pte_write(pte);
>>         pte = __pte_batch_clear_ignored(pte);
>>
>>         if (!pte_same(pte, expected_pte))
>>             break;
>>
>>         /*
>>          * Stop immediately once we reached the end of the folio. In
>>          * corner cases the next PFN might fall into a different
>>          * folio.
>>          */
>> -        if (pte_pfn(pte) == folio_end_pfn)
>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>             break;
>>
>>         if (any_writable)
>>             *any_writable |= writable;
>>
>> -        expected_pte = pte_next_pfn(expected_pte);
>> -        ptep++;
>> +        for (i = 0; i < nr; i++)
>> +            expected_pte = pte_next_pfn(expected_pte);
>> +        ptep += nr;
>>     }
>>
>>     return ptep - start_ptep;
>> }
>>
>>
>> So I'm wondering if instead of enabling pte_next_pfn() for all the arches,
>> perhaps its actually better to expose pte_pgprot() for all the arches. Then we
>> can be much more flexible about generating ptes with pfn_pte(pfn, pgprot).
>>
>> What do you think?
> 
> The pte_pgprot() stuff is just nasty IMHO.

I dunno; we have pfn_pte() which takes a pfn and a pgprot. It seems reasonable
that we should be able to do the reverse.

> 
> Likely it's best to simply convert pte_next_pfn() to something like
> pte_advance_pfns(). The we could just have
> 
> #define pte_next_pfn(pte) pte_advance_pfns(pte, 1)
> 
> That should be fairly easy to do on top (based on PFN_PTE_SHIFT). And only 3
> archs (x86-64, arm64, and powerpc) need slight care to replace a hardcoded "1"
> by an integer we pass in.

I thought we agreed powerpc was safe to just define PFN_PTE_SHIFT? But, yeah,
the principle works I guess. I guess I can do this change along with my series.

> 


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP
  2024-01-23 19:43     ` Ryan Roberts
@ 2024-01-23 20:14       ` David Hildenbrand
  2024-01-23 20:43         ` Ryan Roberts
  0 siblings, 1 reply; 49+ messages in thread
From: David Hildenbrand @ 2024-01-23 20:14 UTC (permalink / raw
  To: Ryan Roberts, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23.01.24 20:43, Ryan Roberts wrote:
> On 23/01/2024 19:33, David Hildenbrand wrote:
>> On 23.01.24 20:15, Ryan Roberts wrote:
>>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>>> Now that the rmap overhaul[1] is upstream that provides a clean interface
>>>> for rmap batching, let's implement PTE batching during fork when processing
>>>> PTE-mapped THPs.
>>>>
>>>> This series is partially based on Ryan's previous work[2] to implement
>>>> cont-pte support on arm64, but its a complete rewrite based on [1] to
>>>> optimize all architectures independent of any such PTE bits, and to
>>>> use the new rmap batching functions that simplify the code and prepare
>>>> for further rmap accounting changes.
>>>>
>>>> We collect consecutive PTEs that map consecutive pages of the same large
>>>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>>>> the refcount only once per batch, (b) call rmap handling functions only
>>>> once per batch and (c) perform batch PTE setting/updates.
>>>>
>>>> While this series should be beneficial for adding cont-pte support on
>>>> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
>>>> for large folios with minimal added overhead and further changes[4] that
>>>> build up on top of the total mapcount.
>>>
>>> I'm currently rebasing my contpte work onto this series, and have hit a problem.
>>> I need to expose the "size" of a pte (pte_size()) and skip forward to the start
>>> of the next (cont)pte every time through the folio_pte_batch() loop. But
>>> pte_next_pfn() only allows advancing by 1 pfn; I need to advance by nr pfns:
>>>
>>>
>>> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>          pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
>>> {
>>>      unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>>      const pte_t *end_ptep = start_ptep + max_nr;
>>>      pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
>>> -    pte_t *ptep = start_ptep + 1;
>>> +    pte_t *ptep = start_ptep;
>>> +    int vfn, nr, i;
>>>      bool writable;
>>>
>>>      if (any_writable)
>>>          *any_writable = false;
>>>
>>>      VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>>
>>> +    vfn = addr >> PAGE_SIZE;
>>> +    nr = pte_size(pte);
>>> +    nr = ALIGN_DOWN(vfn + nr, nr) - vfn;
>>> +    ptep += nr;
>>> +
>>>      while (ptep != end_ptep) {
>>> +        pte = ptep_get(ptep);
>>>          nr = pte_size(pte);
>>>          if (any_writable)
>>>              writable = !!pte_write(pte);
>>>          pte = __pte_batch_clear_ignored(pte);
>>>
>>>          if (!pte_same(pte, expected_pte))
>>>              break;
>>>
>>>          /*
>>>           * Stop immediately once we reached the end of the folio. In
>>>           * corner cases the next PFN might fall into a different
>>>           * folio.
>>>           */
>>> -        if (pte_pfn(pte) == folio_end_pfn)
>>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>>              break;
>>>
>>>          if (any_writable)
>>>              *any_writable |= writable;
>>>
>>> -        expected_pte = pte_next_pfn(expected_pte);
>>> -        ptep++;
>>> +        for (i = 0; i < nr; i++)
>>> +            expected_pte = pte_next_pfn(expected_pte);
>>> +        ptep += nr;
>>>      }
>>>
>>>      return ptep - start_ptep;
>>> }
>>>
>>>
>>> So I'm wondering if instead of enabling pte_next_pfn() for all the arches,
>>> perhaps its actually better to expose pte_pgprot() for all the arches. Then we
>>> can be much more flexible about generating ptes with pfn_pte(pfn, pgprot).
>>>
>>> What do you think?
>>
>> The pte_pgprot() stuff is just nasty IMHO.
> 
> I dunno; we have pfn_pte() which takes a pfn and a pgprot. It seems reasonable
> that we should be able to do the reverse.

But pte_pgprot() is only available on a handful of architectures, no? It 
would be nice to have a completely generic pte_next_pfn() / 
pte_advance_pfns(), though.

Anyhow, this is all "easy" to rework later. Unless I am missing 
something, the low hanging fruit is simply using PFN_PTE_SHIFT for now 
that exists on most archs already.

> 
>>
>> Likely it's best to simply convert pte_next_pfn() to something like
>> pte_advance_pfns(). The we could just have
>>
>> #define pte_next_pfn(pte) pte_advance_pfns(pte, 1)
>>
>> That should be fairly easy to do on top (based on PFN_PTE_SHIFT). And only 3
>> archs (x86-64, arm64, and powerpc) need slight care to replace a hardcoded "1"
>> by an integer we pass in.
> 
> I thought we agreed powerpc was safe to just define PFN_PTE_SHIFT? But, yeah,
> the principle works I guess. I guess I can do this change along with my series.

It is, if nobody insists on that micro-optimization on powerpc.

If there is good reason to invest more time and effort right now on the 
pte_pgprot approach, then please let me know :)

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP
  2024-01-23 20:14       ` David Hildenbrand
@ 2024-01-23 20:43         ` Ryan Roberts
  0 siblings, 0 replies; 49+ messages in thread
From: Ryan Roberts @ 2024-01-23 20:43 UTC (permalink / raw
  To: David Hildenbrand, linux-kernel
  Cc: linux-mm, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Christophe Leroy, Aneesh Kumar K.V,
	Naveen N. Rao, Paul Walmsley, Palmer Dabbelt, Albert Ou,
	Alexander Gordeev, Gerald Schaefer, Heiko Carstens, Vasily Gorbik,
	Christian Borntraeger, Sven Schnelle, David S. Miller,
	linux-arm-kernel, linuxppc-dev, linux-riscv, linux-s390,
	sparclinux

On 23/01/2024 20:14, David Hildenbrand wrote:
> On 23.01.24 20:43, Ryan Roberts wrote:
>> On 23/01/2024 19:33, David Hildenbrand wrote:
>>> On 23.01.24 20:15, Ryan Roberts wrote:
>>>> On 22/01/2024 19:41, David Hildenbrand wrote:
>>>>> Now that the rmap overhaul[1] is upstream that provides a clean interface
>>>>> for rmap batching, let's implement PTE batching during fork when processing
>>>>> PTE-mapped THPs.
>>>>>
>>>>> This series is partially based on Ryan's previous work[2] to implement
>>>>> cont-pte support on arm64, but its a complete rewrite based on [1] to
>>>>> optimize all architectures independent of any such PTE bits, and to
>>>>> use the new rmap batching functions that simplify the code and prepare
>>>>> for further rmap accounting changes.
>>>>>
>>>>> We collect consecutive PTEs that map consecutive pages of the same large
>>>>> folio, making sure that the other PTE bits are compatible, and (a) adjust
>>>>> the refcount only once per batch, (b) call rmap handling functions only
>>>>> once per batch and (c) perform batch PTE setting/updates.
>>>>>
>>>>> While this series should be beneficial for adding cont-pte support on
>>>>> ARM64[2], it's one of the requirements for maintaining a total mapcount[3]
>>>>> for large folios with minimal added overhead and further changes[4] that
>>>>> build up on top of the total mapcount.
>>>>
>>>> I'm currently rebasing my contpte work onto this series, and have hit a
>>>> problem.
>>>> I need to expose the "size" of a pte (pte_size()) and skip forward to the start
>>>> of the next (cont)pte every time through the folio_pte_batch() loop. But
>>>> pte_next_pfn() only allows advancing by 1 pfn; I need to advance by nr pfns:
>>>>
>>>>
>>>> static inline int folio_pte_batch(struct folio *folio, unsigned long addr,
>>>>          pte_t *start_ptep, pte_t pte, int max_nr, bool *any_writable)
>>>> {
>>>>      unsigned long folio_end_pfn = folio_pfn(folio) + folio_nr_pages(folio);
>>>>      const pte_t *end_ptep = start_ptep + max_nr;
>>>>      pte_t expected_pte = __pte_batch_clear_ignored(pte_next_pfn(pte));
>>>> -    pte_t *ptep = start_ptep + 1;
>>>> +    pte_t *ptep = start_ptep;
>>>> +    int vfn, nr, i;
>>>>      bool writable;
>>>>
>>>>      if (any_writable)
>>>>          *any_writable = false;
>>>>
>>>>      VM_WARN_ON_FOLIO(!pte_present(pte), folio);
>>>>
>>>> +    vfn = addr >> PAGE_SIZE;
>>>> +    nr = pte_size(pte);
>>>> +    nr = ALIGN_DOWN(vfn + nr, nr) - vfn;
>>>> +    ptep += nr;
>>>> +
>>>>      while (ptep != end_ptep) {
>>>> +        pte = ptep_get(ptep);
>>>>          nr = pte_size(pte);
>>>>          if (any_writable)
>>>>              writable = !!pte_write(pte);
>>>>          pte = __pte_batch_clear_ignored(pte);
>>>>
>>>>          if (!pte_same(pte, expected_pte))
>>>>              break;
>>>>
>>>>          /*
>>>>           * Stop immediately once we reached the end of the folio. In
>>>>           * corner cases the next PFN might fall into a different
>>>>           * folio.
>>>>           */
>>>> -        if (pte_pfn(pte) == folio_end_pfn)
>>>> +        if (pte_pfn(pte) >= folio_end_pfn)
>>>>              break;
>>>>
>>>>          if (any_writable)
>>>>              *any_writable |= writable;
>>>>
>>>> -        expected_pte = pte_next_pfn(expected_pte);
>>>> -        ptep++;
>>>> +        for (i = 0; i < nr; i++)
>>>> +            expected_pte = pte_next_pfn(expected_pte);
>>>> +        ptep += nr;
>>>>      }
>>>>
>>>>      return ptep - start_ptep;
>>>> }
>>>>
>>>>
>>>> So I'm wondering if instead of enabling pte_next_pfn() for all the arches,
>>>> perhaps its actually better to expose pte_pgprot() for all the arches. Then we
>>>> can be much more flexible about generating ptes with pfn_pte(pfn, pgprot).
>>>>
>>>> What do you think?
>>>
>>> The pte_pgprot() stuff is just nasty IMHO.
>>
>> I dunno; we have pfn_pte() which takes a pfn and a pgprot. It seems reasonable
>> that we should be able to do the reverse.
> 
> But pte_pgprot() is only available on a handful of architectures, no? It would
> be nice to have a completely generic pte_next_pfn() / pte_advance_pfns(), though.
> 
> Anyhow, this is all "easy" to rework later. Unless I am missing something, the
> low hanging fruit is simply using PFN_PTE_SHIFT for now that exists on most
> archs already.
> 
>>
>>>
>>> Likely it's best to simply convert pte_next_pfn() to something like
>>> pte_advance_pfns(). The we could just have
>>>
>>> #define pte_next_pfn(pte) pte_advance_pfns(pte, 1)
>>>
>>> That should be fairly easy to do on top (based on PFN_PTE_SHIFT). And only 3
>>> archs (x86-64, arm64, and powerpc) need slight care to replace a hardcoded "1"
>>> by an integer we pass in.
>>
>> I thought we agreed powerpc was safe to just define PFN_PTE_SHIFT? But, yeah,
>> the principle works I guess. I guess I can do this change along with my series.
> 
> It is, if nobody insists on that micro-optimization on powerpc.
> 
> If there is good reason to invest more time and effort right now on the
> pte_pgprot approach, then please let me know :)
> 

No I think you're right. I thought pte_pgprot() was implemented by more arches,
but there are 13 without it, so clearly a lot of effort to plug that gap. I'll
take the approach you suggest with pte_advance_pfns(). It'll just require mods
to x86 and arm64, +/- ppc.


^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 11:40               ` David Hildenbrand
@ 2024-01-24  5:45                 ` Aneesh Kumar K.V
  0 siblings, 0 replies; 49+ messages in thread
From: Aneesh Kumar K.V @ 2024-01-24  5:45 UTC (permalink / raw
  To: David Hildenbrand, Ryan Roberts, Christophe Leroy,
	linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, sparclinux@vger.kernel.org

David Hildenbrand <david@redhat.com> writes:

> On 23.01.24 12:38, Ryan Roberts wrote:
>> On 23/01/2024 11:31, David Hildenbrand wrote:
>>>>>
>>>>>> If high bits are used for
>>>>>> something else, then we might produce a garbage PTE on overflow, but that
>>>>>> shouldn't really matter I concluded for folio_pte_batch() purposes, we'd not
>>>>>> detect "belongs to this folio batch" either way.
>>>>>
>>>>> Exactly.
>>>>>
>>>>>>
>>>>>> Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I just
>>>>>> hope that we don't lose any other arbitrary PTE bits by doing the pte_pgprot().
>>>>>
>>>>> I don't see the need for ppc to implement pte_next_pfn().
>>>>
>>>> Agreed.
>>>
>>> So likely we should then do on top for powerpc (whitespace damage):
>>>
>>> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
>>> index a04ae4449a025..549a440ed7f65 100644
>>> --- a/arch/powerpc/mm/pgtable.c
>>> +++ b/arch/powerpc/mm/pgtable.c
>>> @@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr,
>>> pte_t *ptep,
>>>                          break;
>>>                  ptep++;
>>>                  addr += PAGE_SIZE;
>>> -               /*
>>> -                * increment the pfn.
>>> -                */
>>> -               pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
>>> +               pte = pte_next_pfn(pte);
>>>          }
>>>   }
>> 
>> Looks like commit 47b8def9358c ("powerpc/mm: Avoid calling
>> arch_enter/leave_lazy_mmu() in set_ptes") changed from doing the simple
>> increment to this more complex approach, but the log doesn't say why.
>
> @Aneesh, was that change on purpose?
>

Because we had a bug with the patch that introduced the change and that
line was confusing. The right thing should have been to add
pte_pfn_next() to make it clear. It was confusing because not all pte
format had pfn at PAGE_SHIFT offset (even though we did use the correct
PTE_RPN_SHIFT in this specific case). To make it simpler I ended up
switching that line to pte_pfn(pte) + 1 .

-aneesh

^ permalink raw reply	[flat|nested] 49+ messages in thread

* Re: [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64
  2024-01-23 11:31           ` David Hildenbrand
  2024-01-23 11:38             ` Ryan Roberts
@ 2024-01-24  5:46             ` Aneesh Kumar K.V
  1 sibling, 0 replies; 49+ messages in thread
From: Aneesh Kumar K.V @ 2024-01-24  5:46 UTC (permalink / raw
  To: David Hildenbrand, Christophe Leroy, Ryan Roberts,
	linux-kernel@vger.kernel.org
  Cc: linux-mm@kvack.org, Andrew Morton, Matthew Wilcox, Russell King,
	Catalin Marinas, Will Deacon, Dinh Nguyen, Michael Ellerman,
	Nicholas Piggin, Naveen N. Rao, Paul Walmsley, Palmer Dabbelt,
	Albert Ou, Alexander Gordeev, Gerald Schaefer, Heiko Carstens,
	Vasily Gorbik, Christian Borntraeger, Sven Schnelle,
	David S. Miller, linux-arm-kernel@lists.infradead.org,
	linuxppc-dev@lists.ozlabs.org, linux-riscv@lists.infradead.org,
	linux-s390@vger.kernel.org, sparclinux@vger.kernel.org

David Hildenbrand <david@redhat.com> writes:

>>>
>>>> If high bits are used for
>>>> something else, then we might produce a garbage PTE on overflow, but that
>>>> shouldn't really matter I concluded for folio_pte_batch() purposes, we'd not
>>>> detect "belongs to this folio batch" either way.
>>>
>>> Exactly.
>>>
>>>>
>>>> Maybe it's likely cleaner to also have a custom pte_next_pfn() on ppc, I just
>>>> hope that we don't lose any other arbitrary PTE bits by doing the pte_pgprot().
>>>
>>> I don't see the need for ppc to implement pte_next_pfn().
>> 
>> Agreed.
>
> So likely we should then do on top for powerpc (whitespace damage):
>
> diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm/pgtable.c
> index a04ae4449a025..549a440ed7f65 100644
> --- a/arch/powerpc/mm/pgtable.c
> +++ b/arch/powerpc/mm/pgtable.c
> @@ -220,10 +220,7 @@ void set_ptes(struct mm_struct *mm, unsigned long addr, pte_t *ptep,
>                          break;
>                  ptep++;
>                  addr += PAGE_SIZE;
> -               /*
> -                * increment the pfn.
> -                */
> -               pte = pfn_pte(pte_pfn(pte) + 1, pte_pgprot((pte)));
> +               pte = pte_next_pfn(pte);
>          }
>   }

Agreed.

-aneesh

^ permalink raw reply	[flat|nested] 49+ messages in thread

end of thread, other threads:[~2024-01-24  5:46 UTC | newest]

Thread overview: 49+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-01-22 19:41 [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
2024-01-22 19:41 ` [PATCH v1 01/11] arm/pgtable: define PFN_PTE_SHIFT on arm and arm64 David Hildenbrand
2024-01-23 10:34   ` Ryan Roberts
2024-01-23 10:48     ` David Hildenbrand
2024-01-23 11:02       ` David Hildenbrand
2024-01-23 11:17         ` Ryan Roberts
2024-01-23 11:33           ` David Hildenbrand
2024-01-23 11:44             ` Ryan Roberts
2024-01-23 11:08       ` Ryan Roberts
2024-01-23 11:16         ` Christophe Leroy
2024-01-23 11:31           ` David Hildenbrand
2024-01-23 11:38             ` Ryan Roberts
2024-01-23 11:40               ` David Hildenbrand
2024-01-24  5:45                 ` Aneesh Kumar K.V
2024-01-23 11:48               ` Christophe Leroy
2024-01-23 11:53                 ` David Hildenbrand
2024-01-24  5:46             ` Aneesh Kumar K.V
2024-01-23 11:10       ` Christophe Leroy
2024-01-23 15:01     ` Matthew Wilcox
2024-01-23 15:22       ` Ryan Roberts
2024-01-22 19:41 ` [PATCH v1 02/11] nios2/pgtable: define PFN_PTE_SHIFT David Hildenbrand
2024-01-22 19:41 ` [PATCH v1 03/11] powerpc/pgtable: " David Hildenbrand
2024-01-22 19:41 ` [PATCH v1 04/11] risc: pgtable: " David Hildenbrand
2024-01-22 20:03   ` Alexandre Ghiti
2024-01-22 20:08     ` David Hildenbrand
2024-01-22 19:41 ` [PATCH v1 05/11] s390/pgtable: " David Hildenbrand
2024-01-22 19:41 ` [PATCH v1 06/11] sparc/pgtable: " David Hildenbrand
2024-01-22 19:41 ` [PATCH v1 07/11] mm/memory: factor out copying the actual PTE in copy_present_pte() David Hildenbrand
2024-01-23 10:45   ` Ryan Roberts
2024-01-22 19:41 ` [PATCH v1 08/11] mm/memory: pass PTE to copy_present_pte() David Hildenbrand
2024-01-23 10:47   ` Ryan Roberts
2024-01-22 19:41 ` [PATCH v1 09/11] mm/memory: optimize fork() with PTE-mapped THP David Hildenbrand
2024-01-23 12:01   ` Ryan Roberts
2024-01-23 12:19     ` David Hildenbrand
2024-01-23 12:28       ` Ryan Roberts
2024-01-22 19:41 ` [PATCH v1 10/11] mm/memory: ignore dirty/accessed/soft-dirty bits in folio_pte_batch() David Hildenbrand
2024-01-23 12:25   ` Ryan Roberts
2024-01-23 13:06     ` David Hildenbrand
2024-01-23 13:42       ` Ryan Roberts
2024-01-23 13:55         ` David Hildenbrand
2024-01-23 14:13           ` David Hildenbrand
2024-01-23 14:27             ` Ryan Roberts
2024-01-22 19:42 ` [PATCH v1 11/11] mm/memory: ignore writable bit " David Hildenbrand
2024-01-23 12:35   ` Ryan Roberts
2024-01-23 19:15 ` [PATCH v1 00/11] mm/memory: optimize fork() with PTE-mapped THP Ryan Roberts
2024-01-23 19:33   ` David Hildenbrand
2024-01-23 19:43     ` Ryan Roberts
2024-01-23 20:14       ` David Hildenbrand
2024-01-23 20:43         ` Ryan Roberts

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).