LKML Archive mirror
 help / color / mirror / Atom feed
* [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
@ 2024-03-12 18:11 David Hildenbrand
  2024-03-12 19:22 ` Matthew Wilcox
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2024-03-12 18:11 UTC (permalink / raw
  To: linux-kernel
  Cc: linux-mm, x86, David Hildenbrand, Wupeng Ma, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton

PAT handling won't do the right thing in COW mappings: the first PTE
(or, in fact, all PTEs) can be replaced during write faults to point at
anon folios. Reliably recovering the correct PFN and cachemode using
follow_phys() from PTEs will not work in COW mappings.

Using follow_phys(), we might just get the address+protection of the
anon folio (which is very wrong), or fail on swap/nonswap entries,
failing follow_phys() and triggering a WARN_ON_ONCE() in untrack_pfn()
and track_pfn_copy(), not properly calling free_pfn_range().

In free_pfn_range(), we either wouldn't call memtype_free() or would
call it with the wrong range, possibly leaking memory.

To fix that, let's update follow_phys() to refuse returning anon folios,
and fallback to using the stored PFN inside vma->vm_pgoff for COW
mappings if we run into that.

We will now properly handle untrack_pfn() with COW mappings, where we
don't need the cachemode. We'll have to fail fork()->track_pfn_copy() if
the first page was replaced by an anon folio, though: we'd have to store
the cachemode in the VMA to make this work, likely growing the VMA size.

For now, lets keep it simple and let track_pfn_copy() just fail in that
case: it would have failed in the past with swap/nonswap entries already,
and it would have done the wrong thing with anon folios.

Simple reproducer to trigger the WARN_ON_ONCE() in untrack_pfn():

<--- C reproducer --->
 #include <stdio.h>
 #include <sys/mman.h>
 #include <unistd.h>
 #include <liburing.h>

 int main(void)
 {
         struct io_uring_params p = {};
         int ring_fd;
         size_t size;
         char *map;

         ring_fd = io_uring_setup(1, &p);
         if (ring_fd < 0) {
                 perror("io_uring_setup");
                 return 1;
         }
         size = p.sq_off.array + p.sq_entries * sizeof(unsigned);

         /* Map the submission queue ring MAP_PRIVATE */
         map = mmap(0, size, PROT_READ | PROT_WRITE, MAP_PRIVATE,
                    ring_fd, IORING_OFF_SQ_RING);
         if (map == MAP_FAILED) {
                 perror("mmap");
                 return 1;
         }

         /* We have at least one page. Let's COW it. */
         *map = 0;
         pause();
         return 0;
 }
<--- C reproducer --->

On a system with 16 GiB RAM and swap configured:
 # ./iouring &
 # memhog 16G
 # killall iouring
[  301.552930] ------------[ cut here ]------------
[  301.553285] WARNING: CPU: 7 PID: 1402 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xf4/0x100
[  301.553989] Modules linked in: binfmt_misc nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_g
[  301.558232] CPU: 7 PID: 1402 Comm: iouring Not tainted 6.7.5-100.fc38.x86_64 #1
[  301.558772] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.16.3-0-ga6ed6b701f0a-prebu4
[  301.559569] RIP: 0010:untrack_pfn+0xf4/0x100
[  301.559893] Code: 75 c4 eb cf 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 b8 48 8b 7b 30 e8 ea 1a f7 000
[  301.561189] RSP: 0018:ffffba2c0377fab8 EFLAGS: 00010282
[  301.561590] RAX: 00000000ffffffea RBX: ffff9208c8ce9cc0 RCX: 000000010455e047
[  301.562105] RDX: 07fffffff0eb1e0a RSI: 0000000000000000 RDI: ffff9208c391d200
[  301.562628] RBP: 0000000000000000 R08: ffffba2c0377fab8 R09: 0000000000000000
[  301.563145] R10: ffff9208d2292d50 R11: 0000000000000002 R12: 00007fea890e0000
[  301.563669] R13: 0000000000000000 R14: ffffba2c0377fc08 R15: 0000000000000000
[  301.564186] FS:  0000000000000000(0000) GS:ffff920c2fbc0000(0000) knlGS:0000000000000000
[  301.564773] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  301.565197] CR2: 00007fea88ee8a20 CR3: 00000001033a8000 CR4: 0000000000750ef0
[  301.565725] PKRU: 55555554
[  301.565944] Call Trace:
[  301.566148]  <TASK>
[  301.566325]  ? untrack_pfn+0xf4/0x100
[  301.566618]  ? __warn+0x81/0x130
[  301.566876]  ? untrack_pfn+0xf4/0x100
[  301.567163]  ? report_bug+0x171/0x1a0
[  301.567466]  ? handle_bug+0x3c/0x80
[  301.567743]  ? exc_invalid_op+0x17/0x70
[  301.568038]  ? asm_exc_invalid_op+0x1a/0x20
[  301.568363]  ? untrack_pfn+0xf4/0x100
[  301.568660]  ? untrack_pfn+0x65/0x100
[  301.568947]  unmap_single_vma+0xa6/0xe0
[  301.569247]  unmap_vmas+0xb5/0x190
[  301.569532]  exit_mmap+0xec/0x340
[  301.569801]  __mmput+0x3e/0x130
[  301.570051]  do_exit+0x305/0xaf0
...

Reported-by: Wupeng Ma <mawupeng1@huawei.com>
Closes: https://lkml.kernel.org/r/20240227122814.3781907-1-mawupeng1@huawei.com
Fixes: b1a86e15dc03 ("x86, pat: remove the dependency on 'vm_pgoff' in track/untrack pfn vma routines")
Fixes: 5899329b1910 ("x86: PAT: implement track/untrack of pfnmap regions for x86 - v3")
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: David Hildenbrand <david@redhat.com>
---

On top of mm/mm-unstable.

I spent way too much time trying to squeeze the cachemode into the VMA in
a clean way. I even went the extra mile and tried to get rid of
follow_phys() completely, storing the cachemode and the PFN in the VMA.

But I didn't quite like the end result and decided to just keep it
simple and not care about improving the already-broken fork() corner case.

follow_phys() is rather ugly, and I have some patches to improve the
situation. But let's get this fix here out first and discuss if there is
an alternative.

---
 arch/x86/mm/pat/memtype.c | 49 ++++++++++++++++++++++++++++-----------
 mm/memory.c               |  4 ++++
 2 files changed, 39 insertions(+), 14 deletions(-)

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 0904d7e8e1260..82857148dbec5 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -950,6 +950,38 @@ static void free_pfn_range(u64 paddr, unsigned long size)
 		memtype_free(paddr, paddr + size);
 }
 
+static int get_pat_info(struct vm_area_struct *vma, resource_size_t *paddr,
+			pgprot_t *pgprot)
+{
+	unsigned long prot;
+
+	VM_WARN_ON_ONCE(!(vma->vm_flags & VM_PAT));
+
+	/*
+	 * We need the starting PFN and cachemode used for track_pfn_remap()
+	 * that covered the whole VMA. For most mappings, we can obtain that
+	 * information from the page tables. For COW mappings, we might now
+	 * suddenly have anon folios mapped and follow_phys() will fail.
+	 *
+	 * Fallback to using vma->vm_pgoff, see remap_pfn_range_notrack(), to
+	 * detect the PFN. If we need the cachemode as well, we're out of luck
+	 * for now and have to fail fork().
+	 */
+	if (!follow_phys(vma, vma->vm_start, 0, &prot, paddr)) {
+		if (pgprot)
+			*pgprot = __pgprot(prot);
+		return 0;
+	}
+	if (is_cow_mapping(vma->vm_flags)) {
+		if (pgprot)
+			return -EINVAL;
+		*paddr = (resource_size_t)vma->vm_pgoff << PAGE_SHIFT;
+		return 0;
+	}
+	WARN_ON_ONCE(1);
+	return -EINVAL;
+}
+
 /*
  * track_pfn_copy is called when vma that is covering the pfnmap gets
  * copied through copy_page_range().
@@ -960,20 +992,13 @@ static void free_pfn_range(u64 paddr, unsigned long size)
 int track_pfn_copy(struct vm_area_struct *vma)
 {
 	resource_size_t paddr;
-	unsigned long prot;
 	unsigned long vma_size = vma->vm_end - vma->vm_start;
 	pgprot_t pgprot;
 
 	if (vma->vm_flags & VM_PAT) {
-		/*
-		 * reserve the whole chunk covered by vma. We need the
-		 * starting address and protection from pte.
-		 */
-		if (follow_phys(vma, vma->vm_start, 0, &prot, &paddr)) {
-			WARN_ON_ONCE(1);
+		if (get_pat_info(vma, &paddr, &pgprot))
 			return -EINVAL;
-		}
-		pgprot = __pgprot(prot);
+		/* reserve the whole chunk covered by vma. */
 		return reserve_pfn_range(paddr, vma_size, &pgprot, 1);
 	}
 
@@ -1048,7 +1073,6 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
 		 unsigned long size, bool mm_wr_locked)
 {
 	resource_size_t paddr;
-	unsigned long prot;
 
 	if (vma && !(vma->vm_flags & VM_PAT))
 		return;
@@ -1056,11 +1080,8 @@ void untrack_pfn(struct vm_area_struct *vma, unsigned long pfn,
 	/* free the chunk starting from pfn or the whole chunk */
 	paddr = (resource_size_t)pfn << PAGE_SHIFT;
 	if (!paddr && !size) {
-		if (follow_phys(vma, vma->vm_start, 0, &prot, &paddr)) {
-			WARN_ON_ONCE(1);
+		if (get_pat_info(vma, &paddr, NULL))
 			return;
-		}
-
 		size = vma->vm_end - vma->vm_start;
 	}
 	free_pfn_range(paddr, size);
diff --git a/mm/memory.c b/mm/memory.c
index f2bc6dd15eb83..bd3336fa5b66a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5971,6 +5971,10 @@ int follow_phys(struct vm_area_struct *vma,
 		goto out;
 	pte = ptep_get(ptep);
 
+	/* Never return PFNs of anon folios in COW mappings. */
+	if (vm_normal_folio(vma, address, pte))
+		goto unlock;
+
 	if ((flags & FOLL_WRITE) && !pte_write(pte))
 		goto unlock;
 
-- 
2.43.2


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
  2024-03-12 18:11 [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings David Hildenbrand
@ 2024-03-12 19:22 ` Matthew Wilcox
  2024-03-12 19:38   ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Matthew Wilcox @ 2024-03-12 19:22 UTC (permalink / raw
  To: David Hildenbrand
  Cc: linux-kernel, linux-mm, x86, Wupeng Ma, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton

On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote:
> PAT handling won't do the right thing in COW mappings: the first PTE
> (or, in fact, all PTEs) can be replaced during write faults to point at
> anon folios. Reliably recovering the correct PFN and cachemode using
> follow_phys() from PTEs will not work in COW mappings.

I guess the first question is: Why do we want to support COW mappings
of VM_PAT areas?  What breaks if we just disallow it?

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
  2024-03-12 19:22 ` Matthew Wilcox
@ 2024-03-12 19:38   ` David Hildenbrand
  2024-03-14 16:42     ` David Hildenbrand
  2024-03-26  8:33     ` Ingo Molnar
  0 siblings, 2 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-03-12 19:38 UTC (permalink / raw
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, x86, Wupeng Ma, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton

On 12.03.24 20:22, Matthew Wilcox wrote:
> On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote:
>> PAT handling won't do the right thing in COW mappings: the first PTE
>> (or, in fact, all PTEs) can be replaced during write faults to point at
>> anon folios. Reliably recovering the correct PFN and cachemode using
>> follow_phys() from PTEs will not work in COW mappings.
> 
> I guess the first question is: Why do we want to support COW mappings
> of VM_PAT areas?  What breaks if we just disallow it?

Well, that was my first approach. Then I decided to be less radical (IOW 
make my life easier by breaking less user space) and "fix it" with 
minimal effort.

Chances of breaking some weird user space is possible, although I 
believe for most such mappings MAP_PRIVATE doesn't make too much sense 
sense.

Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT 
support.

I can try finding digging through some possible user space users tomorrow.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
  2024-03-12 19:38   ` David Hildenbrand
@ 2024-03-14 16:42     ` David Hildenbrand
  2024-03-14 17:12       ` David Hildenbrand
  2024-03-25  2:57       ` mawupeng
  2024-03-26  8:33     ` Ingo Molnar
  1 sibling, 2 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-03-14 16:42 UTC (permalink / raw
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, x86, Wupeng Ma, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton

On 12.03.24 20:38, David Hildenbrand wrote:
> On 12.03.24 20:22, Matthew Wilcox wrote:
>> On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote:
>>> PAT handling won't do the right thing in COW mappings: the first PTE
>>> (or, in fact, all PTEs) can be replaced during write faults to point at
>>> anon folios. Reliably recovering the correct PFN and cachemode using
>>> follow_phys() from PTEs will not work in COW mappings.
>>
>> I guess the first question is: Why do we want to support COW mappings
>> of VM_PAT areas?  What breaks if we just disallow it?
> 
> Well, that was my first approach. Then I decided to be less radical (IOW
> make my life easier by breaking less user space) and "fix it" with
> minimal effort.
> 
> Chances of breaking some weird user space is possible, although I
> believe for most such mappings MAP_PRIVATE doesn't make too much sense
> sense.
> 
> Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT
> support.
> 
> I can try finding digging through some possible user space users tomorrow.

As discussed, MAP_PRIVATE doesn't make too much sense for most PFNMAP 
mappings.

However, /dev/mem and /proc/vmcore are still used with MAP_PRIVATE in 
some cases.

Side note: /proc/vmcore is a bit weird: mmap_vmcore() sets VM_MIXEDMAP, 
and then we might call remap_pfn_range(), which sets VM_PFNMAP. I'm not 
so sure if that's what we want to happen ...

As far as I can see, makedumpfile always mmap's memory to be dumped 
(/dev/mem, /proc/vmcore) using PROT_READ+MAP_PRIVATE, resulting in a COW 
mapping.


In my opinion, we should use this fairly simple fix to keep it working 
for now and look into disabling any MAP_PRIVATE of VM_PFNMAP separately, 
for all architectures.

But I'll leave the decision to x86 maintainers.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
  2024-03-14 16:42     ` David Hildenbrand
@ 2024-03-14 17:12       ` David Hildenbrand
  2024-03-25  2:57       ` mawupeng
  1 sibling, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-03-14 17:12 UTC (permalink / raw
  To: Matthew Wilcox
  Cc: linux-kernel, linux-mm, x86, Wupeng Ma, Dave Hansen,
	Andy Lutomirski, Peter Zijlstra, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, H. Peter Anvin, Andrew Morton

On 14.03.24 17:42, David Hildenbrand wrote:
> On 12.03.24 20:38, David Hildenbrand wrote:
>> On 12.03.24 20:22, Matthew Wilcox wrote:
>>> On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote:
>>>> PAT handling won't do the right thing in COW mappings: the first PTE
>>>> (or, in fact, all PTEs) can be replaced during write faults to point at
>>>> anon folios. Reliably recovering the correct PFN and cachemode using
>>>> follow_phys() from PTEs will not work in COW mappings.
>>>
>>> I guess the first question is: Why do we want to support COW mappings
>>> of VM_PAT areas?  What breaks if we just disallow it?
>>
>> Well, that was my first approach. Then I decided to be less radical (IOW
>> make my life easier by breaking less user space) and "fix it" with
>> minimal effort.
>>
>> Chances of breaking some weird user space is possible, although I
>> believe for most such mappings MAP_PRIVATE doesn't make too much sense
>> sense.
>>
>> Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT
>> support.
>>
>> I can try finding digging through some possible user space users tomorrow.
> 
> As discussed, MAP_PRIVATE doesn't make too much sense for most PFNMAP
> mappings.
> 
> However, /dev/mem and /proc/vmcore are still used with MAP_PRIVATE in
> some cases.
> 
> Side note: /proc/vmcore is a bit weird: mmap_vmcore() sets VM_MIXEDMAP,
> and then we might call remap_pfn_range(), which sets VM_PFNMAP. I'm not
> so sure if that's what we want to happen ...

Correction: at least mmap_vmcore() ends up clearing VM_MAYWRITE. So no 
COW mapping. We could do the same to at least keep PROT_READ|MAP_PRIVATE 
working. If user space specified PROT_WRITE for whatever reason, it's 
not that easy.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
  2024-03-14 16:42     ` David Hildenbrand
  2024-03-14 17:12       ` David Hildenbrand
@ 2024-03-25  2:57       ` mawupeng
  1 sibling, 0 replies; 12+ messages in thread
From: mawupeng @ 2024-03-25  2:57 UTC (permalink / raw
  To: david, willy
  Cc: mawupeng1, linux-kernel, linux-mm, x86, dave.hansen, luto, peterz,
	tglx, mingo, bp, hpa, akpm



On 2024/3/15 0:42, David Hildenbrand wrote:
> On 12.03.24 20:38, David Hildenbrand wrote:
>> On 12.03.24 20:22, Matthew Wilcox wrote:
>>> On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote:
>>>> PAT handling won't do the right thing in COW mappings: the first PTE
>>>> (or, in fact, all PTEs) can be replaced during write faults to point at
>>>> anon folios. Reliably recovering the correct PFN and cachemode using
>>>> follow_phys() from PTEs will not work in COW mappings.
>>>
>>> I guess the first question is: Why do we want to support COW mappings
>>> of VM_PAT areas?  What breaks if we just disallow it?
>>
>> Well, that was my first approach. Then I decided to be less radical (IOW
>> make my life easier by breaking less user space) and "fix it" with
>> minimal effort.
>>
>> Chances of breaking some weird user space is possible, although I
>> believe for most such mappings MAP_PRIVATE doesn't make too much sense
>> sense.
>>
>> Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT
>> support.
>>
>> I can try finding digging through some possible user space users tomorrow.
> 
> As discussed, MAP_PRIVATE doesn't make too much sense for most PFNMAP mappings.
> 
> However, /dev/mem and /proc/vmcore are still used with MAP_PRIVATE in some cases.
> 
> Side note: /proc/vmcore is a bit weird: mmap_vmcore() sets VM_MIXEDMAP, and then we might call remap_pfn_range(), which sets VM_PFNMAP. I'm not so sure if that's what we want to happen ...
> 
> As far as I can see, makedumpfile always mmap's memory to be dumped (/dev/mem, /proc/vmcore) using PROT_READ+MAP_PRIVATE, resulting in a COW mapping.
> 
> 
> In my opinion, we should use this fairly simple fix to keep it working for now and look into disabling any MAP_PRIVATE of VM_PFNMAP separately, for all architectures.
> 
> But I'll leave the decision to x86 maintainers.

Hi, x86 maintainers:

kindle ping.

> 

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
  2024-03-12 19:38   ` David Hildenbrand
  2024-03-14 16:42     ` David Hildenbrand
@ 2024-03-26  8:33     ` Ingo Molnar
  2024-03-26  8:48       ` David Hildenbrand
  1 sibling, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2024-03-26  8:33 UTC (permalink / raw
  To: David Hildenbrand
  Cc: Matthew Wilcox, linux-kernel, linux-mm, x86, Wupeng Ma,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Andrew Morton


* David Hildenbrand <david@redhat.com> wrote:

> On 12.03.24 20:22, Matthew Wilcox wrote:
> > On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote:
> > > PAT handling won't do the right thing in COW mappings: the first PTE
> > > (or, in fact, all PTEs) can be replaced during write faults to point at
> > > anon folios. Reliably recovering the correct PFN and cachemode using
> > > follow_phys() from PTEs will not work in COW mappings.
> > 
> > I guess the first question is: Why do we want to support COW mappings 
> > of VM_PAT areas?  What breaks if we just disallow it?
> 
> Well, that was my first approach. Then I decided to be less radical (IOW 
> make my life easier by breaking less user space) and "fix it" with 
> minimal effort.
> 
> Chances of breaking some weird user space is possible, although I believe 
> for most such mappings MAP_PRIVATE doesn't make too much sense sense.
> 
> Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT 
> support.
> 
> I can try finding digging through some possible user space users 
> tomorrow.

I'd much prefer restricting VM_PAT areas than expanding support. Could we 
try the trivial restriction approach first, and only go with your original 
patch if that fails?

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
  2024-03-26  8:33     ` Ingo Molnar
@ 2024-03-26  8:48       ` David Hildenbrand
  2024-03-26  8:53         ` Ingo Molnar
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2024-03-26  8:48 UTC (permalink / raw
  To: Ingo Molnar
  Cc: Matthew Wilcox, linux-kernel, linux-mm, x86, Wupeng Ma,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Andrew Morton

On 26.03.24 09:33, Ingo Molnar wrote:
> 
> * David Hildenbrand <david@redhat.com> wrote:
> 
>> On 12.03.24 20:22, Matthew Wilcox wrote:
>>> On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote:
>>>> PAT handling won't do the right thing in COW mappings: the first PTE
>>>> (or, in fact, all PTEs) can be replaced during write faults to point at
>>>> anon folios. Reliably recovering the correct PFN and cachemode using
>>>> follow_phys() from PTEs will not work in COW mappings.
>>>
>>> I guess the first question is: Why do we want to support COW mappings
>>> of VM_PAT areas?  What breaks if we just disallow it?
>>
>> Well, that was my first approach. Then I decided to be less radical (IOW
>> make my life easier by breaking less user space) and "fix it" with
>> minimal effort.
>>
>> Chances of breaking some weird user space is possible, although I believe
>> for most such mappings MAP_PRIVATE doesn't make too much sense sense.
>>
>> Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT
>> support.
>>
>> I can try finding digging through some possible user space users
>> tomorrow.
> 
> I'd much prefer restricting VM_PAT areas than expanding support. Could we

Note that we're not expanding support, we're fixing what used to be
possible before but mostly broke silently.

But I agree that we should rather remove these corner cases instead of fixing
them.

> try the trivial restriction approach first, and only go with your original
> patch if that fails?

Which version would you prefer, I had two alternatives (excluding comment
changes, white-space expected to be broken).


1) Disallow when we would have set VM_PAT on is_cow_mapping()

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 0d72183b5dd0..6979912b1a5d 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -994,6 +994,9 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
                                 && size == (vma->vm_end - vma->vm_start))) {
                 int ret;
  
+               if (is_cow_mapping(vma->vm_flags))
+                       return -EINVAL;
+
                 ret = reserve_pfn_range(paddr, size, prot, 0);
                 if (ret == 0 && vma)
                         vm_flags_set(vma, VM_PAT);


2) Fallback to !VM_PAT

diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index 0d72183b5dd0..8e97156c9be8 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -990,8 +990,8 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
         enum page_cache_mode pcm;
  
         /* reserve the whole chunk starting from paddr */
-       if (!vma || (addr == vma->vm_start
-                               && size == (vma->vm_end - vma->vm_start))) {
+       if (!vma || (!is_cow_mapping(vma->vm_flags) && addr == vma->vm_start &&
+                    size == (vma->vm_end - vma->vm_start))) {
                 int ret;
  
                 ret = reserve_pfn_range(paddr, size, prot, 0);



Personally, I'd go for 2).

-- 
Cheers,

David / dhildenb


^ permalink raw reply related	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
  2024-03-26  8:48       ` David Hildenbrand
@ 2024-03-26  8:53         ` Ingo Molnar
  2024-03-26  8:57           ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2024-03-26  8:53 UTC (permalink / raw
  To: David Hildenbrand
  Cc: Matthew Wilcox, linux-kernel, linux-mm, x86, Wupeng Ma,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Andrew Morton


* David Hildenbrand <david@redhat.com> wrote:

> On 26.03.24 09:33, Ingo Molnar wrote:
> > 
> > * David Hildenbrand <david@redhat.com> wrote:
> > 
> > > On 12.03.24 20:22, Matthew Wilcox wrote:
> > > > On Tue, Mar 12, 2024 at 07:11:18PM +0100, David Hildenbrand wrote:
> > > > > PAT handling won't do the right thing in COW mappings: the first PTE
> > > > > (or, in fact, all PTEs) can be replaced during write faults to point at
> > > > > anon folios. Reliably recovering the correct PFN and cachemode using
> > > > > follow_phys() from PTEs will not work in COW mappings.
> > > > 
> > > > I guess the first question is: Why do we want to support COW mappings
> > > > of VM_PAT areas?  What breaks if we just disallow it?
> > > 
> > > Well, that was my first approach. Then I decided to be less radical (IOW
> > > make my life easier by breaking less user space) and "fix it" with
> > > minimal effort.
> > > 
> > > Chances of breaking some weird user space is possible, although I believe
> > > for most such mappings MAP_PRIVATE doesn't make too much sense sense.
> > > 
> > > Nasty COW support for VM_PFNMAP mappings dates back forever. So does PAT
> > > support.
> > > 
> > > I can try finding digging through some possible user space users
> > > tomorrow.
> > 
> > I'd much prefer restricting VM_PAT areas than expanding support. Could we
> 
> Note that we're not expanding support, we're fixing what used to be
> possible before but mostly broke silently.

Yeah - that's de-facto expanding support. :-)

> But I agree that we should rather remove these corner cases instead of 
> fixing them.

Yeah, especially if no code is hitting it intentionally.

> > try the trivial restriction approach first, and only go with your original
> > patch if that fails?
> 
> Which version would you prefer, I had two alternatives (excluding comment
> changes, white-space expected to be broken).
> 
> 
> 1) Disallow when we would have set VM_PAT on is_cow_mapping()
> 
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 0d72183b5dd0..6979912b1a5d 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -994,6 +994,9 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>                                 && size == (vma->vm_end - vma->vm_start))) {
>                 int ret;
> +               if (is_cow_mapping(vma->vm_flags))
> +                       return -EINVAL;
> +
>                 ret = reserve_pfn_range(paddr, size, prot, 0);
>                 if (ret == 0 && vma)
>                         vm_flags_set(vma, VM_PAT);
> 
> 
> 2) Fallback to !VM_PAT
> 
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index 0d72183b5dd0..8e97156c9be8 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -990,8 +990,8 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>         enum page_cache_mode pcm;
>         /* reserve the whole chunk starting from paddr */
> -       if (!vma || (addr == vma->vm_start
> -                               && size == (vma->vm_end - vma->vm_start))) {
> +       if (!vma || (!is_cow_mapping(vma->vm_flags) && addr == vma->vm_start &&
> +                    size == (vma->vm_end - vma->vm_start))) {
>                 int ret;
>                 ret = reserve_pfn_range(paddr, size, prot, 0);
>
> 
> 
> Personally, I'd go for 2).

So what's the advantage of #2? This is clearly something the user didn't 
really intend or think about much. Isn't explicitly failing that mapping a 
better option than silently downgrading it to !VM_PAT?

(If I'm reading it right ...)

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
  2024-03-26  8:53         ` Ingo Molnar
@ 2024-03-26  8:57           ` David Hildenbrand
  2024-04-01  9:45             ` Ingo Molnar
  0 siblings, 1 reply; 12+ messages in thread
From: David Hildenbrand @ 2024-03-26  8:57 UTC (permalink / raw
  To: Ingo Molnar
  Cc: Matthew Wilcox, linux-kernel, linux-mm, x86, Wupeng Ma,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Andrew Morton

>>> try the trivial restriction approach first, and only go with your original
>>> patch if that fails?
>>
>> Which version would you prefer, I had two alternatives (excluding comment
>> changes, white-space expected to be broken).
>>
>>
>> 1) Disallow when we would have set VM_PAT on is_cow_mapping()
>>
>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
>> index 0d72183b5dd0..6979912b1a5d 100644
>> --- a/arch/x86/mm/pat/memtype.c
>> +++ b/arch/x86/mm/pat/memtype.c
>> @@ -994,6 +994,9 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>>                                  && size == (vma->vm_end - vma->vm_start))) {
>>                  int ret;
>> +               if (is_cow_mapping(vma->vm_flags))
>> +                       return -EINVAL;
>> +
>>                  ret = reserve_pfn_range(paddr, size, prot, 0);
>>                  if (ret == 0 && vma)
>>                          vm_flags_set(vma, VM_PAT);
>>
>>
>> 2) Fallback to !VM_PAT
>>
>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
>> index 0d72183b5dd0..8e97156c9be8 100644
>> --- a/arch/x86/mm/pat/memtype.c
>> +++ b/arch/x86/mm/pat/memtype.c
>> @@ -990,8 +990,8 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>>          enum page_cache_mode pcm;
>>          /* reserve the whole chunk starting from paddr */
>> -       if (!vma || (addr == vma->vm_start
>> -                               && size == (vma->vm_end - vma->vm_start))) {
>> +       if (!vma || (!is_cow_mapping(vma->vm_flags) && addr == vma->vm_start &&
>> +                    size == (vma->vm_end - vma->vm_start))) {
>>                  int ret;
>>                  ret = reserve_pfn_range(paddr, size, prot, 0);
>>
>>
>>
>> Personally, I'd go for 2).
> 
> So what's the advantage of #2? This is clearly something the user didn't
> really intend or think about much. Isn't explicitly failing that mapping a
> better option than silently downgrading it to !VM_PAT?
> 
> (If I'm reading it right ...)

I think a simple mmap(MAP_PRIVATE) of /dev/mem will unconditionally fail 
with 1), while it keeps on working for 2).

Note that I think we currently set VM_PAT on each and every system if 
remap_pfn_range() will cover the whole VMA, even if pat is not actually 
enabled.

It's all a bit of a mess TBH, but I got my hands dirty enough on that.

So 1) can be rather destructive ... 2) at least somehow keeps it working.

For that reason I went with the current patch, because it's hard to tell 
which use case you will end up breaking ... :/

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
  2024-03-26  8:57           ` David Hildenbrand
@ 2024-04-01  9:45             ` Ingo Molnar
  2024-04-02  9:14               ` David Hildenbrand
  0 siblings, 1 reply; 12+ messages in thread
From: Ingo Molnar @ 2024-04-01  9:45 UTC (permalink / raw
  To: David Hildenbrand, Andrew Morton
  Cc: Matthew Wilcox, linux-kernel, linux-mm, x86, Wupeng Ma,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin, Andrew Morton


* David Hildenbrand <david@redhat.com> wrote:

> > > > try the trivial restriction approach first, and only go with your original
> > > > patch if that fails?
> > > 
> > > Which version would you prefer, I had two alternatives (excluding comment
> > > changes, white-space expected to be broken).
> > > 
> > > 
> > > 1) Disallow when we would have set VM_PAT on is_cow_mapping()
> > > 
> > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > index 0d72183b5dd0..6979912b1a5d 100644
> > > --- a/arch/x86/mm/pat/memtype.c
> > > +++ b/arch/x86/mm/pat/memtype.c
> > > @@ -994,6 +994,9 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> > >                                  && size == (vma->vm_end - vma->vm_start))) {
> > >                  int ret;
> > > +               if (is_cow_mapping(vma->vm_flags))
> > > +                       return -EINVAL;
> > > +
> > >                  ret = reserve_pfn_range(paddr, size, prot, 0);
> > >                  if (ret == 0 && vma)
> > >                          vm_flags_set(vma, VM_PAT);
> > > 
> > > 
> > > 2) Fallback to !VM_PAT
> > > 
> > > diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> > > index 0d72183b5dd0..8e97156c9be8 100644
> > > --- a/arch/x86/mm/pat/memtype.c
> > > +++ b/arch/x86/mm/pat/memtype.c
> > > @@ -990,8 +990,8 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
> > >          enum page_cache_mode pcm;
> > >          /* reserve the whole chunk starting from paddr */
> > > -       if (!vma || (addr == vma->vm_start
> > > -                               && size == (vma->vm_end - vma->vm_start))) {
> > > +       if (!vma || (!is_cow_mapping(vma->vm_flags) && addr == vma->vm_start &&
> > > +                    size == (vma->vm_end - vma->vm_start))) {
> > >                  int ret;
> > >                  ret = reserve_pfn_range(paddr, size, prot, 0);
> > > 
> > > 
> > > 
> > > Personally, I'd go for 2).
> > 
> > So what's the advantage of #2? This is clearly something the user didn't
> > really intend or think about much. Isn't explicitly failing that mapping a
> > better option than silently downgrading it to !VM_PAT?
> > 
> > (If I'm reading it right ...)
> 
> I think a simple mmap(MAP_PRIVATE) of /dev/mem will unconditionally fail
> with 1), while it keeps on working for 2).
> 
> Note that I think we currently set VM_PAT on each and every system if
> remap_pfn_range() will cover the whole VMA, even if pat is not actually
> enabled.
> 
> It's all a bit of a mess TBH, but I got my hands dirty enough on that.
> 
> So 1) can be rather destructive ... 2) at least somehow keeps it working.
> 
> For that reason I went with the current patch, because it's hard to tell
> which use case you will end up breaking ... :/

Yeah, so I think you make valid observations, i.e. your first patch is 
probably the best one.

But since it changes mm/memory.c, I'd like to pass that over to Andrew 
and the MM folks.

The x86 bits:

  Acked-by: Ingo Molnar <mingo@kernel.org>

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings
  2024-04-01  9:45             ` Ingo Molnar
@ 2024-04-02  9:14               ` David Hildenbrand
  0 siblings, 0 replies; 12+ messages in thread
From: David Hildenbrand @ 2024-04-02  9:14 UTC (permalink / raw
  To: Ingo Molnar, Andrew Morton
  Cc: Matthew Wilcox, linux-kernel, linux-mm, x86, Wupeng Ma,
	Dave Hansen, Andy Lutomirski, Peter Zijlstra, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, H. Peter Anvin

On 01.04.24 11:45, Ingo Molnar wrote:
> 
> * David Hildenbrand <david@redhat.com> wrote:
> 
>>>>> try the trivial restriction approach first, and only go with your original
>>>>> patch if that fails?
>>>>
>>>> Which version would you prefer, I had two alternatives (excluding comment
>>>> changes, white-space expected to be broken).
>>>>
>>>>
>>>> 1) Disallow when we would have set VM_PAT on is_cow_mapping()
>>>>
>>>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
>>>> index 0d72183b5dd0..6979912b1a5d 100644
>>>> --- a/arch/x86/mm/pat/memtype.c
>>>> +++ b/arch/x86/mm/pat/memtype.c
>>>> @@ -994,6 +994,9 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>>>>                                   && size == (vma->vm_end - vma->vm_start))) {
>>>>                   int ret;
>>>> +               if (is_cow_mapping(vma->vm_flags))
>>>> +                       return -EINVAL;
>>>> +
>>>>                   ret = reserve_pfn_range(paddr, size, prot, 0);
>>>>                   if (ret == 0 && vma)
>>>>                           vm_flags_set(vma, VM_PAT);
>>>>
>>>>
>>>> 2) Fallback to !VM_PAT
>>>>
>>>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
>>>> index 0d72183b5dd0..8e97156c9be8 100644
>>>> --- a/arch/x86/mm/pat/memtype.c
>>>> +++ b/arch/x86/mm/pat/memtype.c
>>>> @@ -990,8 +990,8 @@ int track_pfn_remap(struct vm_area_struct *vma, pgprot_t *prot,
>>>>           enum page_cache_mode pcm;
>>>>           /* reserve the whole chunk starting from paddr */
>>>> -       if (!vma || (addr == vma->vm_start
>>>> -                               && size == (vma->vm_end - vma->vm_start))) {
>>>> +       if (!vma || (!is_cow_mapping(vma->vm_flags) && addr == vma->vm_start &&
>>>> +                    size == (vma->vm_end - vma->vm_start))) {
>>>>                   int ret;
>>>>                   ret = reserve_pfn_range(paddr, size, prot, 0);
>>>>
>>>>
>>>>
>>>> Personally, I'd go for 2).
>>>
>>> So what's the advantage of #2? This is clearly something the user didn't
>>> really intend or think about much. Isn't explicitly failing that mapping a
>>> better option than silently downgrading it to !VM_PAT?
>>>
>>> (If I'm reading it right ...)
>>
>> I think a simple mmap(MAP_PRIVATE) of /dev/mem will unconditionally fail
>> with 1), while it keeps on working for 2).
>>
>> Note that I think we currently set VM_PAT on each and every system if
>> remap_pfn_range() will cover the whole VMA, even if pat is not actually
>> enabled.
>>
>> It's all a bit of a mess TBH, but I got my hands dirty enough on that.
>>
>> So 1) can be rather destructive ... 2) at least somehow keeps it working.
>>
>> For that reason I went with the current patch, because it's hard to tell
>> which use case you will end up breaking ... :/
> 

Hi,

> Yeah, so I think you make valid observations, i.e. your first patch is
> probably the best one.

okay, so the original patch, thanks.

> 
> But since it changes mm/memory.c, I'd like to pass that over to Andrew
> and the MM folks.
> 
> The x86 bits:
> 
>    Acked-by: Ingo Molnar <mingo@kernel.org>


Thanks, there is now a conflict with other stuff that already landed in 
mm-unstable that moves follow_phys() to arch/x86/mm/pat/memtype.c.


@Andrew, this here is a fix, how should we best handle that? Likely the 
fix should go in first, and the fixup of Christoph's patch should be 
easy. Just let me know how you want to handle that.

-- 
Cheers,

David / dhildenb


^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2024-04-02  9:14 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-03-12 18:11 [PATCH v1] x86/mm/pat: fix VM_PAT handling in COW mappings David Hildenbrand
2024-03-12 19:22 ` Matthew Wilcox
2024-03-12 19:38   ` David Hildenbrand
2024-03-14 16:42     ` David Hildenbrand
2024-03-14 17:12       ` David Hildenbrand
2024-03-25  2:57       ` mawupeng
2024-03-26  8:33     ` Ingo Molnar
2024-03-26  8:48       ` David Hildenbrand
2024-03-26  8:53         ` Ingo Molnar
2024-03-26  8:57           ` David Hildenbrand
2024-04-01  9:45             ` Ingo Molnar
2024-04-02  9:14               ` David Hildenbrand

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).