[PATCH 0/6] make vma locking more obvious

Stable Archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] make vma locking more obvious
@ 2023-07-31 17:12 Suren Baghdasaryan
  2023-07-31 17:12 ` [PATCH 1/6] mm: enable page walking API to lock vmas during the walk Suren Baghdasaryan
                   ` (5 more replies)
  0 siblings, 6 replies; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-07-31 17:12 UTC (permalink / raw
  To: akpm
  Cc: torvalds, jannh, willy, liam.howlett, david, peterx, ldufour,
	vbabka, michel, jglisse, mhocko, hannes, dave, hughd,
	linux-kernel, linux-mm, stable, Suren Baghdasaryan

During recent vma locking patch reviews Linus and Jann Horn noted a number
of issues with vma locking and suggested improvements:

1. walk_page_range() does not have ability to write-lock a vma during the
walk when it's done under mmap_write_lock. For example s390_reset_cmma().

2. Vma locking is hidden inside vm_flags modifiers and is hard to follow.
Suggestion is to change vm_flags_reset{_once} to assert that vma is
write-locked and require an explicit locking.

3. Same issue with vma_prepare() hiding vma locking.

4. In userfaultfd vm_flags are modified after vma->vm_userfaultfd_ctx and
page faults can operate on a context while it's changed.

5. do_brk_flags() and __install_special_mapping() not locking a newly
created vma before adding it into the mm. While not strictly a problem,
this is fragile if vma is modified after insertion, as in the
mmap_region() case which was recently fixed. Suggestion is to always lock
a new vma before inserting it and making it visible to page faults.

6. vma_assert_write_locked() for CONFIG_PER_VMA_LOCK=n would benefit from
being mmap_assert_write_locked() instead of no-op and then any place which
operates on a vma and calls mmap_assert_write_locked() can be converted
into vma_assert_write_locked().

I CC'ed stable only on the first patch because others are cleanups and the
bug in userfaultfd does not affect stable (lock_vma_under_rcu prevents
uffds from being handled under vma lock protection). However I would be
happy if the whole series is merged into stable 6.4 since it makes vma
locking more maintainable.

The patches apply cleanly over Linus' ToT and will conflict when applied
over mm-unstable due to missing [1]. The conflict can be easily resolved
by ignoring conflicting deletions but probably simpler to take [1] into
mm-unstable and avoid later conflict.

[1] commit 6c21e066f925 ("mm/mempolicy: Take VMA lock before replacing policy")

Suren Baghdasaryan (6):
  mm: enable page walking API to lock vmas during the walk
  mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and
    mmap
  mm: replace mmap with vma write lock assertions when operating on a
    vma
  mm: lock vma explicitly before doing vm_flags_reset and
    vm_flags_reset_once
  mm: always lock new vma before inserting into vma tree
  mm: move vma locking out of vma_prepare

 arch/powerpc/kvm/book3s_hv_uvmem.c      |  1 +
 arch/powerpc/mm/book3s64/subpage_prot.c |  2 +-
 arch/riscv/mm/pageattr.c                |  4 ++--
 arch/s390/mm/gmap.c                     | 10 ++++-----
 drivers/infiniband/hw/hfi1/file_ops.c   |  1 +
 fs/proc/task_mmu.c                      | 10 ++++-----
 fs/userfaultfd.c                        |  6 +++++
 include/linux/mm.h                      | 13 +++++++----
 include/linux/pagewalk.h                |  6 ++---
 mm/damon/vaddr.c                        |  4 ++--
 mm/hmm.c                                |  2 +-
 mm/hugetlb.c                            |  2 +-
 mm/khugepaged.c                         |  5 +++--
 mm/ksm.c                                | 16 +++++++-------
 mm/madvise.c                            | 13 +++++------
 mm/memcontrol.c                         |  4 ++--
 mm/memory-failure.c                     |  2 +-
 mm/memory.c                             |  2 +-
 mm/mempolicy.c                          | 12 ++++------
 mm/migrate_device.c                     |  2 +-
 mm/mincore.c                            |  2 +-
 mm/mlock.c                              |  5 +++--
 mm/mmap.c                               | 29 ++++++++++++++++---------
 mm/mprotect.c                           |  3 ++-
 mm/pagewalk.c                           | 13 ++++++++---
 mm/vmscan.c                             |  3 ++-
 26 files changed, 100 insertions(+), 72 deletions(-)

-- 
2.41.0.487.g6d72f3e995-goog

^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH 1/6] mm: enable page walking API to lock vmas during the walk
  2023-07-31 17:12 [PATCH 0/6] make vma locking more obvious Suren Baghdasaryan
@ 2023-07-31 17:12 ` Suren Baghdasaryan
  2023-07-31 18:02   ` Linus Torvalds
  2023-07-31 17:12 ` [PATCH 2/6] mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap Suren Baghdasaryan
                   ` (4 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-07-31 17:12 UTC (permalink / raw
  To: akpm
  Cc: torvalds, jannh, willy, liam.howlett, david, peterx, ldufour,
	vbabka, michel, jglisse, mhocko, hannes, dave, hughd,
	linux-kernel, linux-mm, stable, Suren Baghdasaryan,
	Linus Torvalds

walk_page_range() and friends often operate under write-locked mmap_lock.
With introduction of vma locks, the vmas have to be locked as well
during such walks to prevent concurrent page faults in these areas.
Add an additional parameter to walk_page_range() functions to indicate the
walks which should lock the vmas before operating on them.

Cc: stable@vger.kernel.org # 6.4.x
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/powerpc/mm/book3s64/subpage_prot.c |  2 +-
 arch/riscv/mm/pageattr.c                |  4 ++--
 arch/s390/mm/gmap.c                     | 10 +++++-----
 fs/proc/task_mmu.c                      | 10 +++++-----
 include/linux/pagewalk.h                |  6 +++---
 mm/damon/vaddr.c                        |  4 ++--
 mm/hmm.c                                |  2 +-
 mm/ksm.c                                | 16 ++++++++--------
 mm/madvise.c                            |  8 ++++----
 mm/memcontrol.c                         |  4 ++--
 mm/memory-failure.c                     |  2 +-
 mm/mempolicy.c                          | 12 ++++--------
 mm/migrate_device.c                     |  2 +-
 mm/mincore.c                            |  2 +-
 mm/mlock.c                              |  2 +-
 mm/mprotect.c                           |  2 +-
 mm/pagewalk.c                           | 13 ++++++++++---
 mm/vmscan.c                             |  3 ++-
 18 files changed, 54 insertions(+), 50 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/subpage_prot.c b/arch/powerpc/mm/book3s64/subpage_prot.c
index 0dc85556dec5..177e5c646d9c 100644
--- a/arch/powerpc/mm/book3s64/subpage_prot.c
+++ b/arch/powerpc/mm/book3s64/subpage_prot.c
@@ -159,7 +159,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
 	 */
 	for_each_vma_range(vmi, vma, addr + len) {
 		vm_flags_set(vma, VM_NOHUGEPAGE);
-		walk_page_vma(vma, &subpage_walk_ops, NULL);
+		walk_page_vma(vma, &subpage_walk_ops, true, NULL);
 	}
 }
 #else
diff --git a/arch/riscv/mm/pageattr.c b/arch/riscv/mm/pageattr.c
index ea3d61de065b..95207994cbf0 100644
--- a/arch/riscv/mm/pageattr.c
+++ b/arch/riscv/mm/pageattr.c
@@ -167,7 +167,7 @@ int set_direct_map_invalid_noflush(struct page *page)
 	};
 
 	mmap_read_lock(&init_mm);
-	ret = walk_page_range(&init_mm, start, end, &pageattr_ops, &masks);
+	ret = walk_page_range(&init_mm, start, end, &pageattr_ops, false, &masks);
 	mmap_read_unlock(&init_mm);
 
 	return ret;
@@ -184,7 +184,7 @@ int set_direct_map_default_noflush(struct page *page)
 	};
 
 	mmap_read_lock(&init_mm);
-	ret = walk_page_range(&init_mm, start, end, &pageattr_ops, &masks);
+	ret = walk_page_range(&init_mm, start, end, &pageattr_ops, false, &masks);
 	mmap_read_unlock(&init_mm);
 
 	return ret;
diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index 9c8af31be970..16a58c860c74 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -2523,7 +2523,7 @@ static inline void thp_split_mm(struct mm_struct *mm)
 
 	for_each_vma(vmi, vma) {
 		vm_flags_mod(vma, VM_NOHUGEPAGE, VM_HUGEPAGE);
-		walk_page_vma(vma, &thp_split_walk_ops, NULL);
+		walk_page_vma(vma, &thp_split_walk_ops, true, NULL);
 	}
 	mm->def_flags |= VM_NOHUGEPAGE;
 }
@@ -2584,7 +2584,7 @@ int s390_enable_sie(void)
 	mm->context.has_pgste = 1;
 	/* split thp mappings and disable thp for future mappings */
 	thp_split_mm(mm);
-	walk_page_range(mm, 0, TASK_SIZE, &zap_zero_walk_ops, NULL);
+	walk_page_range(mm, 0, TASK_SIZE, &zap_zero_walk_ops, true, NULL);
 	mmap_write_unlock(mm);
 	return 0;
 }
@@ -2672,7 +2672,7 @@ int s390_enable_skey(void)
 		mm->context.uses_skeys = 0;
 		goto out_up;
 	}
-	walk_page_range(mm, 0, TASK_SIZE, &enable_skey_walk_ops, NULL);
+	walk_page_range(mm, 0, TASK_SIZE, &enable_skey_walk_ops, true, NULL);
 
 out_up:
 	mmap_write_unlock(mm);
@@ -2697,7 +2697,7 @@ static const struct mm_walk_ops reset_cmma_walk_ops = {
 void s390_reset_cmma(struct mm_struct *mm)
 {
 	mmap_write_lock(mm);
-	walk_page_range(mm, 0, TASK_SIZE, &reset_cmma_walk_ops, NULL);
+	walk_page_range(mm, 0, TASK_SIZE, &reset_cmma_walk_ops, true, NULL);
 	mmap_write_unlock(mm);
 }
 EXPORT_SYMBOL_GPL(s390_reset_cmma);
@@ -2771,7 +2771,7 @@ int __s390_uv_destroy_range(struct mm_struct *mm, unsigned long start,
 	while (r > 0) {
 		state.count = 0;
 		mmap_read_lock(mm);
-		r = walk_page_range(mm, state.next, end, &gather_pages_ops, &state);
+		r = walk_page_range(mm, state.next, end, &gather_pages_ops, false, &state);
 		mmap_read_unlock(mm);
 		cond_resched();
 		s390_uv_destroy_pfns(state.count, state.pfns);
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 507cd4e59d07..f0d0f2959f91 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -804,9 +804,9 @@ static void smap_gather_stats(struct vm_area_struct *vma,
 
 	/* mmap_lock is held in m_start */
 	if (!start)
-		walk_page_vma(vma, ops, mss);
+		walk_page_vma(vma, ops, false, mss);
 	else
-		walk_page_range(vma->vm_mm, start, vma->vm_end, ops, mss);
+		walk_page_range(vma->vm_mm, start, vma->vm_end, ops, false, mss);
 }
 
 #define SEQ_PUT_DEC(str, val) \
@@ -1307,7 +1307,7 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
 						0, mm, 0, -1UL);
 			mmu_notifier_invalidate_range_start(&range);
 		}
-		walk_page_range(mm, 0, -1, &clear_refs_walk_ops, &cp);
+		walk_page_range(mm, 0, -1, &clear_refs_walk_ops, true, &cp);
 		if (type == CLEAR_REFS_SOFT_DIRTY) {
 			mmu_notifier_invalidate_range_end(&range);
 			flush_tlb_mm(mm);
@@ -1720,7 +1720,7 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
 		ret = mmap_read_lock_killable(mm);
 		if (ret)
 			goto out_free;
-		ret = walk_page_range(mm, start_vaddr, end, &pagemap_ops, &pm);
+		ret = walk_page_range(mm, start_vaddr, end, &pagemap_ops, false, &pm);
 		mmap_read_unlock(mm);
 		start_vaddr = end;
 
@@ -1981,7 +1981,7 @@ static int show_numa_map(struct seq_file *m, void *v)
 		seq_puts(m, " huge");
 
 	/* mmap_lock is held by m_start */
-	walk_page_vma(vma, &show_numa_ops, md);
+	walk_page_vma(vma, &show_numa_ops, false, md);
 
 	if (!md->pages)
 		goto out;
diff --git a/include/linux/pagewalk.h b/include/linux/pagewalk.h
index 27a6df448ee5..69656ec44049 100644
--- a/include/linux/pagewalk.h
+++ b/include/linux/pagewalk.h
@@ -105,16 +105,16 @@ struct mm_walk {
 
 int walk_page_range(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
-		void *private);
+		bool lock_vma, void *private);
 int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
 			  unsigned long end, const struct mm_walk_ops *ops,
 			  pgd_t *pgd,
 			  void *private);
 int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
 			unsigned long end, const struct mm_walk_ops *ops,
-			void *private);
+			bool lock_vma, void *private);
 int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
-		void *private);
+		  bool lock_vma, void *private);
 int walk_page_mapping(struct address_space *mapping, pgoff_t first_index,
 		      pgoff_t nr, const struct mm_walk_ops *ops,
 		      void *private);
diff --git a/mm/damon/vaddr.c b/mm/damon/vaddr.c
index 2fcc9731528a..54f50b1aefe4 100644
--- a/mm/damon/vaddr.c
+++ b/mm/damon/vaddr.c
@@ -391,7 +391,7 @@ static const struct mm_walk_ops damon_mkold_ops = {
 static void damon_va_mkold(struct mm_struct *mm, unsigned long addr)
 {
 	mmap_read_lock(mm);
-	walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, NULL);
+	walk_page_range(mm, addr, addr + 1, &damon_mkold_ops, false, NULL);
 	mmap_read_unlock(mm);
 }
 
@@ -536,7 +536,7 @@ static bool damon_va_young(struct mm_struct *mm, unsigned long addr,
 	};
 
 	mmap_read_lock(mm);
-	walk_page_range(mm, addr, addr + 1, &damon_young_ops, &arg);
+	walk_page_range(mm, addr, addr + 1, &damon_young_ops, false, &arg);
 	mmap_read_unlock(mm);
 	return arg.young;
 }
diff --git a/mm/hmm.c b/mm/hmm.c
index 855e25e59d8f..f94f5e268e40 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -600,7 +600,7 @@ int hmm_range_fault(struct hmm_range *range)
 					     range->notifier_seq))
 			return -EBUSY;
 		ret = walk_page_range(mm, hmm_vma_walk.last, range->end,
-				      &hmm_walk_ops, &hmm_vma_walk);
+				      &hmm_walk_ops, false, &hmm_vma_walk);
 		/*
 		 * When -EBUSY is returned the loop restarts with
 		 * hmm_vma_walk.last set to an address that has not been stored
diff --git a/mm/ksm.c b/mm/ksm.c
index ba266359da55..494a1f3fcb97 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -470,7 +470,7 @@ static const struct mm_walk_ops break_ksm_ops = {
  * of the process that owns 'vma'.  We also do not want to enforce
  * protection keys here anyway.
  */
-static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
+static int break_ksm(struct vm_area_struct *vma, unsigned long addr, bool lock_vma)
 {
 	vm_fault_t ret = 0;
 
@@ -479,7 +479,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
 
 		cond_resched();
 		ksm_page = walk_page_range_vma(vma, addr, addr + 1,
-					       &break_ksm_ops, NULL);
+					       &break_ksm_ops, lock_vma, NULL);
 		if (WARN_ON_ONCE(ksm_page < 0))
 			return ksm_page;
 		if (!ksm_page)
@@ -565,7 +565,7 @@ static void break_cow(struct ksm_rmap_item *rmap_item)
 	mmap_read_lock(mm);
 	vma = find_mergeable_vma(mm, addr);
 	if (vma)
-		break_ksm(vma, addr);
+		break_ksm(vma, addr, false);
 	mmap_read_unlock(mm);
 }
 
@@ -871,7 +871,7 @@ static void remove_trailing_rmap_items(struct ksm_rmap_item **rmap_list)
  * in cmp_and_merge_page on one of the rmap_items we would be removing.
  */
 static int unmerge_ksm_pages(struct vm_area_struct *vma,
-			     unsigned long start, unsigned long end)
+			     unsigned long start, unsigned long end, bool lock_vma)
 {
 	unsigned long addr;
 	int err = 0;
@@ -882,7 +882,7 @@ static int unmerge_ksm_pages(struct vm_area_struct *vma,
 		if (signal_pending(current))
 			err = -ERESTARTSYS;
 		else
-			err = break_ksm(vma, addr);
+			err = break_ksm(vma, addr, lock_vma);
 	}
 	return err;
 }
@@ -1029,7 +1029,7 @@ static int unmerge_and_remove_all_rmap_items(void)
 			if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
 				continue;
 			err = unmerge_ksm_pages(vma,
-						vma->vm_start, vma->vm_end);
+						vma->vm_start, vma->vm_end, false);
 			if (err)
 				goto error;
 		}
@@ -2530,7 +2530,7 @@ static int __ksm_del_vma(struct vm_area_struct *vma)
 		return 0;
 
 	if (vma->anon_vma) {
-		err = unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end);
+		err = unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end, true);
 		if (err)
 			return err;
 	}
@@ -2668,7 +2668,7 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
 			return 0;		/* just ignore the advice */
 
 		if (vma->anon_vma) {
-			err = unmerge_ksm_pages(vma, start, end);
+			err = unmerge_ksm_pages(vma, start, end, true);
 			if (err)
 				return err;
 		}
diff --git a/mm/madvise.c b/mm/madvise.c
index 886f06066622..0e484111a1d2 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -287,7 +287,7 @@ static long madvise_willneed(struct vm_area_struct *vma,
 	*prev = vma;
 #ifdef CONFIG_SWAP
 	if (!file) {
-		walk_page_range(vma->vm_mm, start, end, &swapin_walk_ops, vma);
+		walk_page_range(vma->vm_mm, start, end, &swapin_walk_ops, false, vma);
 		lru_add_drain(); /* Push any new pages onto the LRU now */
 		return 0;
 	}
@@ -546,7 +546,7 @@ static void madvise_cold_page_range(struct mmu_gather *tlb,
 	};
 
 	tlb_start_vma(tlb, vma);
-	walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, &walk_private);
+	walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, false, &walk_private);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -584,7 +584,7 @@ static void madvise_pageout_page_range(struct mmu_gather *tlb,
 	};
 
 	tlb_start_vma(tlb, vma);
-	walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, &walk_private);
+	walk_page_range(vma->vm_mm, addr, end, &cold_walk_ops, false, &walk_private);
 	tlb_end_vma(tlb, vma);
 }
 
@@ -786,7 +786,7 @@ static int madvise_free_single_vma(struct vm_area_struct *vma,
 	mmu_notifier_invalidate_range_start(&range);
 	tlb_start_vma(&tlb, vma);
 	walk_page_range(vma->vm_mm, range.start, range.end,
-			&madvise_free_walk_ops, &tlb);
+			&madvise_free_walk_ops, false, &tlb);
 	tlb_end_vma(&tlb, vma);
 	mmu_notifier_invalidate_range_end(&range);
 	tlb_finish_mmu(&tlb);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e8ca4bdcb03c..76aaadbd4bf9 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -6031,7 +6031,7 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
 	unsigned long precharge;
 
 	mmap_read_lock(mm);
-	walk_page_range(mm, 0, ULONG_MAX, &precharge_walk_ops, NULL);
+	walk_page_range(mm, 0, ULONG_MAX, &precharge_walk_ops, false, NULL);
 	mmap_read_unlock(mm);
 
 	precharge = mc.precharge;
@@ -6332,7 +6332,7 @@ static void mem_cgroup_move_charge(void)
 	 * When we have consumed all precharges and failed in doing
 	 * additional charge, the page walk just aborts.
 	 */
-	walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, NULL);
+	walk_page_range(mc.mm, 0, ULONG_MAX, &charge_walk_ops, false, NULL);
 	mmap_read_unlock(mc.mm);
 	atomic_dec(&mc.from->moving_account);
 }
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ece5d481b5ff..763297df9240 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -860,7 +860,7 @@ static int kill_accessing_process(struct task_struct *p, unsigned long pfn,
 
 	mmap_read_lock(p->mm);
 	ret = walk_page_range(p->mm, 0, TASK_SIZE, &hwp_walk_ops,
-			      (void *)&priv);
+			      false, (void *)&priv);
 	if (ret == 1 && priv.tk.addr)
 		kill_proc(&priv.tk, pfn, flags);
 	else
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c53f8beeb507..70ba53c70700 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -738,7 +738,7 @@ static const struct mm_walk_ops queue_pages_walk_ops = {
 static int
 queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 		nodemask_t *nodes, unsigned long flags,
-		struct list_head *pagelist)
+		struct list_head *pagelist, bool lock_vma)
 {
 	int err;
 	struct queue_pages qp = {
@@ -750,7 +750,7 @@ queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
 		.first = NULL,
 	};
 
-	err = walk_page_range(mm, start, end, &queue_pages_walk_ops, &qp);
+	err = walk_page_range(mm, start, end, &queue_pages_walk_ops, lock_vma, &qp);
 
 	if (!qp.first)
 		/* whole range in hole */
@@ -1078,7 +1078,7 @@ static int migrate_to_node(struct mm_struct *mm, int source, int dest,
 	vma = find_vma(mm, 0);
 	VM_BUG_ON(!(flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)));
 	queue_pages_range(mm, vma->vm_start, mm->task_size, &nmask,
-			flags | MPOL_MF_DISCONTIG_OK, &pagelist);
+			flags | MPOL_MF_DISCONTIG_OK, &pagelist, false);
 
 	if (!list_empty(&pagelist)) {
 		err = migrate_pages(&pagelist, alloc_migration_target, NULL,
@@ -1321,12 +1321,8 @@ static long do_mbind(unsigned long start, unsigned long len,
 	 * Lock the VMAs before scanning for pages to migrate, to ensure we don't
 	 * miss a concurrently inserted page.
 	 */
-	vma_iter_init(&vmi, mm, start);
-	for_each_vma_range(vmi, vma, end)
-		vma_start_write(vma);
-
 	ret = queue_pages_range(mm, start, end, nmask,
-			  flags | MPOL_MF_INVERT, &pagelist);
+			  flags | MPOL_MF_INVERT, &pagelist, true);
 
 	if (ret < 0) {
 		err = ret;
diff --git a/mm/migrate_device.c b/mm/migrate_device.c
index 8365158460ed..1bc9937bf1fb 100644
--- a/mm/migrate_device.c
+++ b/mm/migrate_device.c
@@ -304,7 +304,7 @@ static void migrate_vma_collect(struct migrate_vma *migrate)
 	mmu_notifier_invalidate_range_start(&range);
 
 	walk_page_range(migrate->vma->vm_mm, migrate->start, migrate->end,
-			&migrate_vma_walk_ops, migrate);
+			&migrate_vma_walk_ops, false, migrate);
 
 	mmu_notifier_invalidate_range_end(&range);
 	migrate->end = migrate->start + (migrate->npages << PAGE_SHIFT);
diff --git a/mm/mincore.c b/mm/mincore.c
index b7f7a516b26c..a06288c6c126 100644
--- a/mm/mincore.c
+++ b/mm/mincore.c
@@ -198,7 +198,7 @@ static long do_mincore(unsigned long addr, unsigned long pages, unsigned char *v
 		memset(vec, 1, pages);
 		return pages;
 	}
-	err = walk_page_range(vma->vm_mm, addr, end, &mincore_walk_ops, vec);
+	err = walk_page_range(vma->vm_mm, addr, end, &mincore_walk_ops, false, vec);
 	if (err < 0)
 		return err;
 	return (end - addr) >> PAGE_SHIFT;
diff --git a/mm/mlock.c b/mm/mlock.c
index 0a0c996c5c21..3634de0b28e3 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -389,7 +389,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
 	vm_flags_reset_once(vma, newflags);
 
 	lru_add_drain();
-	walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, NULL);
+	walk_page_range(vma->vm_mm, start, end, &mlock_walk_ops, true, NULL);
 	lru_add_drain();
 
 	if (newflags & VM_IO) {
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 6f658d483704..f781f709c39d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -599,7 +599,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 		pgprot_t new_pgprot = vm_get_page_prot(newflags);
 
 		error = walk_page_range(current->mm, start, end,
-				&prot_none_walk_ops, &new_pgprot);
+				&prot_none_walk_ops, true, &new_pgprot);
 		if (error)
 			return error;
 	}
diff --git a/mm/pagewalk.c b/mm/pagewalk.c
index 2022333805d3..7503885fae75 100644
--- a/mm/pagewalk.c
+++ b/mm/pagewalk.c
@@ -406,6 +406,7 @@ static int __walk_page_range(unsigned long start, unsigned long end,
  * @start:	start address of the virtual address range
  * @end:	end address of the virtual address range
  * @ops:	operation to call during the walk
+ * @lock_vma	write-lock the vma before operating on it
  * @private:	private data for callbacks' usage
  *
  * Recursively walk the page table tree of the process represented by @mm
@@ -442,7 +443,7 @@ static int __walk_page_range(unsigned long start, unsigned long end,
  */
 int walk_page_range(struct mm_struct *mm, unsigned long start,
 		unsigned long end, const struct mm_walk_ops *ops,
-		void *private)
+		bool lock_vma, void *private)
 {
 	int err = 0;
 	unsigned long next;
@@ -474,6 +475,8 @@ int walk_page_range(struct mm_struct *mm, unsigned long start,
 			if (ops->pte_hole)
 				err = ops->pte_hole(start, next, -1, &walk);
 		} else { /* inside vma */
+			if (lock_vma)
+				vma_start_write(vma);
 			walk.vma = vma;
 			next = min(end, vma->vm_end);
 			vma = find_vma(mm, vma->vm_end);
@@ -535,7 +538,7 @@ int walk_page_range_novma(struct mm_struct *mm, unsigned long start,
 
 int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
 			unsigned long end, const struct mm_walk_ops *ops,
-			void *private)
+			bool lock_vma, void *private)
 {
 	struct mm_walk walk = {
 		.ops		= ops,
@@ -550,11 +553,13 @@ int walk_page_range_vma(struct vm_area_struct *vma, unsigned long start,
 		return -EINVAL;
 
 	mmap_assert_locked(walk.mm);
+	if (lock_vma)
+		vma_start_write(vma);
 	return __walk_page_range(start, end, &walk);
 }
 
 int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
-		void *private)
+		  bool lock_vma, void *private)
 {
 	struct mm_walk walk = {
 		.ops		= ops,
@@ -567,6 +572,8 @@ int walk_page_vma(struct vm_area_struct *vma, const struct mm_walk_ops *ops,
 		return -EINVAL;
 
 	mmap_assert_locked(walk.mm);
+	if (lock_vma)
+		vma_start_write(vma);
 	return __walk_page_range(vma->vm_start, vma->vm_end, &walk);
 }
 
diff --git a/mm/vmscan.c b/mm/vmscan.c
index 1080209a568b..d85f86871fd9 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -4306,7 +4306,8 @@ static void walk_mm(struct lruvec *lruvec, struct mm_struct *mm, struct lru_gen_
 
 		/* the caller might be holding the lock for write */
 		if (mmap_read_trylock(mm)) {
-			err = walk_page_range(mm, walk->next_addr, ULONG_MAX, &mm_walk_ops, walk);
+			err = walk_page_range(mm, walk->next_addr, ULONG_MAX,
+					      &mm_walk_ops, false, walk);
 
 			mmap_read_unlock(mm);
 		}
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 2/6] mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap
  2023-07-31 17:12 [PATCH 0/6] make vma locking more obvious Suren Baghdasaryan
  2023-07-31 17:12 ` [PATCH 1/6] mm: enable page walking API to lock vmas during the walk Suren Baghdasaryan
@ 2023-07-31 17:12 ` Suren Baghdasaryan
  2023-07-31 17:14   ` kernel test robot
  2023-07-31 17:12 ` [PATCH 3/6] mm: replace mmap with vma write lock assertions when operating on a vma Suren Baghdasaryan
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-07-31 17:12 UTC (permalink / raw
  To: akpm
  Cc: torvalds, jannh, willy, liam.howlett, david, peterx, ldufour,
	vbabka, michel, jglisse, mhocko, hannes, dave, hughd,
	linux-kernel, linux-mm, stable, Suren Baghdasaryan

When CONFIG_PER_VMA_LOCK=n, vma_assert_write_locked() should be equivalent
to mmap_assert_write_locked().

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 include/linux/mm.h | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 406ab9ea818f..262b5f44101d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -750,7 +750,8 @@ static inline void vma_end_read(struct vm_area_struct *vma) {}
 static inline void vma_start_write(struct vm_area_struct *vma) {}
 static inline bool vma_try_start_write(struct vm_area_struct *vma)
 		{ return true; }
-static inline void vma_assert_write_locked(struct vm_area_struct *vma) {}
+static inline void vma_assert_write_locked(struct vm_area_struct *vma)
+		{ mmap_assert_write_locked(vma->vm_mm); }
 static inline void vma_mark_detached(struct vm_area_struct *vma,
 				     bool detached) {}
 
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 3/6] mm: replace mmap with vma write lock assertions when operating on a vma
  2023-07-31 17:12 [PATCH 0/6] make vma locking more obvious Suren Baghdasaryan
  2023-07-31 17:12 ` [PATCH 1/6] mm: enable page walking API to lock vmas during the walk Suren Baghdasaryan
  2023-07-31 17:12 ` [PATCH 2/6] mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap Suren Baghdasaryan
@ 2023-07-31 17:12 ` Suren Baghdasaryan
  2023-07-31 17:12 ` [PATCH 4/6] mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once Suren Baghdasaryan
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-07-31 17:12 UTC (permalink / raw
  To: akpm
  Cc: torvalds, jannh, willy, liam.howlett, david, peterx, ldufour,
	vbabka, michel, jglisse, mhocko, hannes, dave, hughd,
	linux-kernel, linux-mm, stable, Suren Baghdasaryan

Vma write lock assertion always includes mmap write lock assertion and
additional vma lock checks when per-VMA locks are enabled. Replace
weaker mmap_assert_write_locked() assertions with stronger
vma_assert_write_locked() ones when we are operating on a vma which
is expected to be locked.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/hugetlb.c    | 2 +-
 mm/khugepaged.c | 5 +++--
 mm/memory.c     | 2 +-
 3 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 64a3239b6407..1d871a1167d8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5028,7 +5028,7 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 					src_vma->vm_start,
 					src_vma->vm_end);
 		mmu_notifier_invalidate_range_start(&range);
-		mmap_assert_write_locked(src);
+		vma_assert_write_locked(src_vma);
 		raw_write_seqcount_begin(&src->write_protect_seq);
 	} else {
 		/*
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 78c8d5d8b628..1e43a56fba31 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1495,7 +1495,7 @@ static int set_huge_pmd(struct vm_area_struct *vma, unsigned long addr,
 	};
 
 	VM_BUG_ON(!PageTransHuge(hpage));
-	mmap_assert_write_locked(vma->vm_mm);
+	vma_assert_write_locked(vma);
 
 	if (do_set_pmd(&vmf, hpage))
 		return SCAN_FAIL;
@@ -1525,7 +1525,7 @@ static void collapse_and_free_pmd(struct mm_struct *mm, struct vm_area_struct *v
 	pmd_t pmd;
 	struct mmu_notifier_range range;
 
-	mmap_assert_write_locked(mm);
+	vma_assert_write_locked(vma);
 	if (vma->vm_file)
 		lockdep_assert_held_write(&vma->vm_file->f_mapping->i_mmap_rwsem);
 	/*
@@ -1570,6 +1570,7 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr,
 	int count = 0, result = SCAN_FAIL;
 	int i;
 
+	/* Ensure vma can't change, it will be locked below after checks */
 	mmap_assert_write_locked(mm);
 
 	/* Fast check before locking page if already PMD-mapped */
diff --git a/mm/memory.c b/mm/memory.c
index 603b2f419948..652d99b9858a 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1312,7 +1312,7 @@ copy_page_range(struct vm_area_struct *dst_vma, struct vm_area_struct *src_vma)
 		 * Use the raw variant of the seqcount_t write API to avoid
 		 * lockdep complaining about preemptibility.
 		 */
-		mmap_assert_write_locked(src_mm);
+		vma_assert_write_locked(src_vma);
 		raw_write_seqcount_begin(&src_mm->write_protect_seq);
 	}
 
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 4/6] mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once
  2023-07-31 17:12 [PATCH 0/6] make vma locking more obvious Suren Baghdasaryan
                   ` (2 preceding siblings ...)
  2023-07-31 17:12 ` [PATCH 3/6] mm: replace mmap with vma write lock assertions when operating on a vma Suren Baghdasaryan
@ 2023-07-31 17:12 ` Suren Baghdasaryan
  2023-07-31 17:12 ` [PATCH 5/6] mm: always lock new vma before inserting into vma tree Suren Baghdasaryan
  2023-07-31 17:12 ` [PATCH 6/6] mm: move vma locking out of vma_prepare Suren Baghdasaryan
  5 siblings, 0 replies; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-07-31 17:12 UTC (permalink / raw
  To: akpm
  Cc: torvalds, jannh, willy, liam.howlett, david, peterx, ldufour,
	vbabka, michel, jglisse, mhocko, hannes, dave, hughd,
	linux-kernel, linux-mm, stable, Suren Baghdasaryan,
	Linus Torvalds

Implicit vma locking inside vm_flags_reset() and vm_flags_reset_once() is
not obvious and makes it hard to understand where vma locking is happening.
Also in some cases (like in dup_userfaultfd()) vma should be locked earlier
than vma_flags modification. To make locking more visible, change these
functions to assert that the vma write lock is taken and explicitly lock
the vma beforehand. Fix userfaultfd functions which should lock the vma
earlier.

Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 arch/powerpc/kvm/book3s_hv_uvmem.c    |  1 +
 drivers/infiniband/hw/hfi1/file_ops.c |  1 +
 fs/userfaultfd.c                      |  6 ++++++
 include/linux/mm.h                    | 10 +++++++---
 mm/madvise.c                          |  5 ++---
 mm/mlock.c                            |  3 ++-
 mm/mprotect.c                         |  1 +
 7 files changed, 20 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/kvm/book3s_hv_uvmem.c b/arch/powerpc/kvm/book3s_hv_uvmem.c
index 709ebd578394..e2d6f9327f77 100644
--- a/arch/powerpc/kvm/book3s_hv_uvmem.c
+++ b/arch/powerpc/kvm/book3s_hv_uvmem.c
@@ -410,6 +410,7 @@ static int kvmppc_memslot_page_merge(struct kvm *kvm,
 			ret = H_STATE;
 			break;
 		}
+		vma_start_write(vma);
 		/* Copy vm_flags to avoid partial modifications in ksm_madvise */
 		vm_flags = vma->vm_flags;
 		ret = ksm_madvise(vma, vma->vm_start, vma->vm_end,
diff --git a/drivers/infiniband/hw/hfi1/file_ops.c b/drivers/infiniband/hw/hfi1/file_ops.c
index a5ab22cedd41..5920bfc1e1c5 100644
--- a/drivers/infiniband/hw/hfi1/file_ops.c
+++ b/drivers/infiniband/hw/hfi1/file_ops.c
@@ -344,6 +344,7 @@ static int hfi1_file_mmap(struct file *fp, struct vm_area_struct *vma)
 		goto done;
 	}
 
+	vma_start_write(vma);
 	/*
 	 * vm_pgoff is used as a buffer selector cookie.  Always mmap from
 	 * the beginning.
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 7cecd49e078b..6cde95533dcd 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -667,6 +667,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
 		mmap_write_lock(mm);
 		for_each_vma(vmi, vma) {
 			if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
+				vma_start_write(vma);
 				vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 				userfaultfd_set_vm_flags(vma,
 							 vma->vm_flags & ~__VM_UFFD_FLAGS);
@@ -702,6 +703,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
 
 	octx = vma->vm_userfaultfd_ctx.ctx;
 	if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
+		vma_start_write(vma);
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 		userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS);
 		return 0;
@@ -783,6 +785,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
 		atomic_inc(&ctx->mmap_changing);
 	} else {
 		/* Drop uffd context if remap feature not enabled */
+		vma_start_write(vma);
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 		userfaultfd_set_vm_flags(vma, vma->vm_flags & ~__VM_UFFD_FLAGS);
 	}
@@ -940,6 +943,7 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
 			prev = vma;
 		}
 
+		vma_start_write(vma);
 		userfaultfd_set_vm_flags(vma, new_flags);
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 	}
@@ -1502,6 +1506,7 @@ static int userfaultfd_register(struct userfaultfd_ctx *ctx,
 		 * the next vma was merged into the current one and
 		 * the current one has not been updated yet.
 		 */
+		vma_start_write(vma);
 		userfaultfd_set_vm_flags(vma, new_flags);
 		vma->vm_userfaultfd_ctx.ctx = ctx;
 
@@ -1685,6 +1690,7 @@ static int userfaultfd_unregister(struct userfaultfd_ctx *ctx,
 		 * the next vma was merged into the current one and
 		 * the current one has not been updated yet.
 		 */
+		vma_start_write(vma);
 		userfaultfd_set_vm_flags(vma, new_flags);
 		vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 262b5f44101d..2c720c9bb1ae 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -780,18 +780,22 @@ static inline void vm_flags_init(struct vm_area_struct *vma,
 	ACCESS_PRIVATE(vma, __vm_flags) = flags;
 }
 
-/* Use when VMA is part of the VMA tree and modifications need coordination */
+/*
+ * Use when VMA is part of the VMA tree and modifications need coordination
+ * Note: vm_flags_reset and vm_flags_reset_once do not lock the vma and
+ * it should be locked explicitly beforehand.
+ */
 static inline void vm_flags_reset(struct vm_area_struct *vma,
 				  vm_flags_t flags)
 {
-	vma_start_write(vma);
+	vma_assert_write_locked(vma);
 	vm_flags_init(vma, flags);
 }
 
 static inline void vm_flags_reset_once(struct vm_area_struct *vma,
 				       vm_flags_t flags)
 {
-	vma_start_write(vma);
+	vma_assert_write_locked(vma);
 	WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags);
 }
 
diff --git a/mm/madvise.c b/mm/madvise.c
index 0e484111a1d2..54628f4ca217 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -173,9 +173,8 @@ static int madvise_update_vma(struct vm_area_struct *vma,
 	}
 
 success:
-	/*
-	 * vm_flags is protected by the mmap_lock held in write mode.
-	 */
+	/* vm_flags is protected by the mmap_lock held in write mode. */
+	vma_start_write(vma);
 	vm_flags_reset(vma, new_flags);
 	if (!vma->vm_file || vma_is_anon_shmem(vma)) {
 		error = replace_anon_vma_name(vma, anon_name);
diff --git a/mm/mlock.c b/mm/mlock.c
index 3634de0b28e3..f0f5125188ba 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -386,6 +386,7 @@ static void mlock_vma_pages_range(struct vm_area_struct *vma,
 	 */
 	if (newflags & VM_LOCKED)
 		newflags |= VM_IO;
+	vma_start_write(vma);
 	vm_flags_reset_once(vma, newflags);
 
 	lru_add_drain();
@@ -460,9 +461,9 @@ static int mlock_fixup(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	 * It's okay if try_to_unmap_one unmaps a page just after we
 	 * set VM_LOCKED, populate_vma_page_range will bring it back.
 	 */
-
 	if ((newflags & VM_LOCKED) && (oldflags & VM_LOCKED)) {
 		/* No work to do, and mlocking twice would be wrong */
+		vma_start_write(vma);
 		vm_flags_reset(vma, newflags);
 	} else {
 		mlock_vma_pages_range(vma, start, end, newflags);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f781f709c39d..0eab019914db 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -656,6 +656,7 @@ mprotect_fixup(struct vma_iterator *vmi, struct mmu_gather *tlb,
 	 * vm_flags and vm_page_prot are protected by the mmap_lock
 	 * held in write mode.
 	 */
+	vma_start_write(vma);
 	vm_flags_reset(vma, newflags);
 	if (vma_wants_manual_pte_write_upgrade(vma))
 		mm_cp_flags |= MM_CP_TRY_CHANGE_WRITABLE;
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 5/6] mm: always lock new vma before inserting into vma tree
  2023-07-31 17:12 [PATCH 0/6] make vma locking more obvious Suren Baghdasaryan
                   ` (3 preceding siblings ...)
  2023-07-31 17:12 ` [PATCH 4/6] mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once Suren Baghdasaryan
@ 2023-07-31 17:12 ` Suren Baghdasaryan
  2023-07-31 17:12 ` [PATCH 6/6] mm: move vma locking out of vma_prepare Suren Baghdasaryan
  5 siblings, 0 replies; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-07-31 17:12 UTC (permalink / raw
  To: akpm
  Cc: torvalds, jannh, willy, liam.howlett, david, peterx, ldufour,
	vbabka, michel, jglisse, mhocko, hannes, dave, hughd,
	linux-kernel, linux-mm, stable, Suren Baghdasaryan

While it's not strictly necessary to lock a newly created vma before
adding it into the vma tree (as long as no further changes are performed
to it), it seems like a good policy to lock it and prevent accidental
changes after it becomes visible to the page faults. Lock the vma before
adding it into the vma tree.

Suggested-by: Jann Horn <jannh@google.com>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 3937479d0e07..850a39dee075 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -412,6 +412,8 @@ static int vma_link(struct mm_struct *mm, struct vm_area_struct *vma)
 	if (vma_iter_prealloc(&vmi))
 		return -ENOMEM;
 
+	vma_start_write(vma);
+
 	if (vma->vm_file) {
 		mapping = vma->vm_file->f_mapping;
 		i_mmap_lock_write(mapping);
@@ -477,7 +479,8 @@ static inline void vma_prepare(struct vma_prepare *vp)
 	vma_start_write(vp->vma);
 	if (vp->adj_next)
 		vma_start_write(vp->adj_next);
-	/* vp->insert is always a newly created VMA, no need for locking */
+	if (vp->insert)
+		vma_start_write(vp->insert);
 	if (vp->remove)
 		vma_start_write(vp->remove);
 	if (vp->remove2)
@@ -3098,6 +3101,7 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	vma->vm_pgoff = addr >> PAGE_SHIFT;
 	vm_flags_init(vma, flags);
 	vma->vm_page_prot = vm_get_page_prot(flags);
+	vma_start_write(vma);
 	if (vma_iter_store_gfp(vmi, vma, GFP_KERNEL))
 		goto mas_store_fail;
 
@@ -3345,7 +3349,6 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap,
 			get_file(new_vma->vm_file);
 		if (new_vma->vm_ops && new_vma->vm_ops->open)
 			new_vma->vm_ops->open(new_vma);
-		vma_start_write(new_vma);
 		if (vma_link(mm, new_vma))
 			goto out_vma_link;
 		*need_rmap_locks = false;
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH 6/6] mm: move vma locking out of vma_prepare
  2023-07-31 17:12 [PATCH 0/6] make vma locking more obvious Suren Baghdasaryan
                   ` (4 preceding siblings ...)
  2023-07-31 17:12 ` [PATCH 5/6] mm: always lock new vma before inserting into vma tree Suren Baghdasaryan
@ 2023-07-31 17:12 ` Suren Baghdasaryan
  2023-07-31 20:30   ` Liam R. Howlett
  5 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-07-31 17:12 UTC (permalink / raw
  To: akpm
  Cc: torvalds, jannh, willy, liam.howlett, david, peterx, ldufour,
	vbabka, michel, jglisse, mhocko, hannes, dave, hughd,
	linux-kernel, linux-mm, stable, Suren Baghdasaryan,
	Linus Torvalds

vma_prepare() is currently the central place where vmas are being locked
before vma_complete() applies changes to them. While this is convenient,
it also obscures vma locking and makes it hard to follow the locking rules.
Move vma locking out of vma_prepare() and take vma locks explicitly at the
locations where vmas are being modified.

Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
---
 mm/mmap.c | 26 ++++++++++++++++----------
 1 file changed, 16 insertions(+), 10 deletions(-)

diff --git a/mm/mmap.c b/mm/mmap.c
index 850a39dee075..e59d83cb1d7a 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -476,16 +476,6 @@ static inline void init_vma_prep(struct vma_prepare *vp,
  */
 static inline void vma_prepare(struct vma_prepare *vp)
 {
-	vma_start_write(vp->vma);
-	if (vp->adj_next)
-		vma_start_write(vp->adj_next);
-	if (vp->insert)
-		vma_start_write(vp->insert);
-	if (vp->remove)
-		vma_start_write(vp->remove);
-	if (vp->remove2)
-		vma_start_write(vp->remove2);
-
 	if (vp->file) {
 		uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end);
 
@@ -650,6 +640,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	bool remove_next = false;
 	struct vma_prepare vp;
 
+	vma_start_write(vma);
 	if (next && (vma != next) && (end == next->vm_end)) {
 		int ret;
 
@@ -657,6 +648,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		ret = dup_anon_vma(vma, next);
 		if (ret)
 			return ret;
+		vma_start_write(next);
 	}
 
 	init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
@@ -708,6 +700,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (vma_iter_prealloc(vmi))
 		return -ENOMEM;
 
+	vma_start_write(vma);
+
 	init_vma_prep(&vp, vma);
 	vma_prepare(&vp);
 	vma_adjust_trans_huge(vma, start, end, 0);
@@ -946,10 +940,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
 	/* Can we merge both the predecessor and the successor? */
 	if (merge_prev && merge_next &&
 	    is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) {
+		vma_start_write(next);
 		remove = next;				/* case 1 */
 		vma_end = next->vm_end;
 		err = dup_anon_vma(prev, next);
 		if (curr) {				/* case 6 */
+			vma_start_write(curr);
 			remove = curr;
 			remove2 = next;
 			if (!next->anon_vma)
@@ -958,6 +954,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
 	} else if (merge_prev) {			/* case 2 */
 		if (curr) {
 			err = dup_anon_vma(prev, curr);
+			vma_start_write(curr);
 			if (end == curr->vm_end) {	/* case 7 */
 				remove = curr;
 			} else {			/* case 5 */
@@ -969,6 +966,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
 		res = next;
 		if (prev && addr < prev->vm_end) {	/* case 4 */
 			vma_end = addr;
+			vma_start_write(next);
 			adjust = next;
 			adj_start = -(prev->vm_end - addr);
 			err = dup_anon_vma(next, prev);
@@ -983,6 +981,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
 			vma_pgoff = next->vm_pgoff - pglen;
 			if (curr) {			/* case 8 */
 				vma_pgoff = curr->vm_pgoff;
+				vma_start_write(curr);
 				remove = curr;
 				err = dup_anon_vma(next, curr);
 			}
@@ -996,6 +995,8 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
 	if (vma_iter_prealloc(vmi))
 		return NULL;
 
+	vma_start_write(vma);
+
 	init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
 	VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma &&
 		   vp.anon_vma != adjust->anon_vma);
@@ -2373,6 +2374,9 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
 	if (new->vm_ops && new->vm_ops->open)
 		new->vm_ops->open(new);
 
+	vma_start_write(vma);
+	vma_start_write(new);
+
 	init_vma_prep(&vp, vma);
 	vp.insert = new;
 	vma_prepare(&vp);
@@ -3078,6 +3082,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
 		if (vma_iter_prealloc(vmi))
 			goto unacct_fail;
 
+		vma_start_write(vma);
+
 		init_vma_prep(&vp, vma);
 		vma_prepare(&vp);
 		vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
-- 
2.41.0.487.g6d72f3e995-goog


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH 2/6] mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap
  2023-07-31 17:12 ` [PATCH 2/6] mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap Suren Baghdasaryan
@ 2023-07-31 17:14   ` kernel test robot
  0 siblings, 0 replies; 18+ messages in thread
From: kernel test robot @ 2023-07-31 17:14 UTC (permalink / raw
  To: Suren Baghdasaryan; +Cc: stable, oe-kbuild-all

Hi,

Thanks for your patch.

FYI: kernel test robot notices the stable kernel rule is not satisfied.

Rule: 'Cc: stable@vger.kernel.org' or 'commit <sha1> upstream.'
Subject: [PATCH 2/6] mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap
Link: https://lore.kernel.org/stable/20230731171233.1098105-3-surenb%40google.com

The check is based on https://www.kernel.org/doc/html/latest/process/stable-kernel-rules.html

-- 
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki




^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] mm: enable page walking API to lock vmas during the walk
  2023-07-31 17:12 ` [PATCH 1/6] mm: enable page walking API to lock vmas during the walk Suren Baghdasaryan
@ 2023-07-31 18:02   ` Linus Torvalds
  2023-07-31 19:30     ` Suren Baghdasaryan
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2023-07-31 18:02 UTC (permalink / raw
  To: Suren Baghdasaryan
  Cc: akpm, jannh, willy, liam.howlett, david, peterx, ldufour, vbabka,
	michel, jglisse, mhocko, hannes, dave, hughd, linux-kernel,
	linux-mm, stable

On Mon, 31 Jul 2023 at 10:12, Suren Baghdasaryan <surenb@google.com> wrote:
>
> -               walk_page_vma(vma, &subpage_walk_ops, NULL);
> +               walk_page_vma(vma, &subpage_walk_ops, true, NULL);

Rather than add a new argument to the walk_page_*() functions, I
really think you should just add the locking rule to the 'const struct
mm_walk_ops' structure.

The locking rule goes along with the rules for what you are actually
doing, after all. Plus it would actually make it all much more legible
when it's not just some random "true/false" argument, but a an actual

        .write_lock = 1

in the ops definition.

Yes, yes, that might mean that some ops might need duplicating in case
you really have a walk that sometimes takes the lock, and sometimes
doesn't, but that is odd to begin with.

The only such case I found from a quick look was the very strange
queue_pages_range() case. Is it really true that do_mbind() needs the
write-lock, but do_migrate_pages() does not?

And if they really are that different maybe they should have different walk_ops?

Maybe there were other cases that I didn't notice.

>                 error = walk_page_range(current->mm, start, end,
> -                               &prot_none_walk_ops, &new_pgprot);
> +                               &prot_none_walk_ops, true, &new_pgprot);

This looks odd. You're adding vma locking to a place that didn't do it before.

Yes, the mmap semaphore is held for writing, but this particular walk
doesn't need it as far as I can tell.

In fact, this feels like that walker should maybe *verify* that it's
held for writing, but not try to write it again?

Maybe the "lock_vma" flag should be a tri-state:

 - lock for reading (no-op per vma), verify that the mmap sem is held
for reading

 - lock for reading (no-op per vma), but with mmap sem held for
writing (this kind of "check before doing changes" walker)

 - lock for writing (with mmap sem obviously needs to be held for writing)

>         mmap_assert_locked(walk.mm);
> +       if (lock_vma)
> +               vma_start_write(vma);

So I think this should also be tightened up, and something like

        switch (ops->locking) {
        case WRLOCK:
                vma_start_write(vma);
                fallthrough;
        case WRLOCK_VERIFY:
                mmap_assert_write_locked(mm);
                break;
        case RDLOCK:
                mmap_assert_locked(walk.mm);
        }

because we shouldn't have a 'vma_start_write()' without holding the
mmap sem for *writing*, and the above would also allow that
mprotect_fixup() "walk to see if we can merge, verify that it was
already locked" thing.

Hmm?

NOTE! The above names are just completely made up. I dcon't think it
should actually be some "WRLOCK" enum. There are probably much better
names. Take the above as a "maybe something kind of in this direction"
rather than "do it exactly like this".

            Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] mm: enable page walking API to lock vmas during the walk
  2023-07-31 18:02   ` Linus Torvalds
@ 2023-07-31 19:30     ` Suren Baghdasaryan
  2023-07-31 19:33       ` Linus Torvalds
  0 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-07-31 19:30 UTC (permalink / raw
  To: Linus Torvalds
  Cc: akpm, jannh, willy, liam.howlett, david, peterx, ldufour, vbabka,
	michel, jglisse, mhocko, hannes, dave, hughd, linux-kernel,
	linux-mm, stable

On Mon, Jul 31, 2023 at 11:02 AM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 31 Jul 2023 at 10:12, Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > -               walk_page_vma(vma, &subpage_walk_ops, NULL);
> > +               walk_page_vma(vma, &subpage_walk_ops, true, NULL);
>
> Rather than add a new argument to the walk_page_*() functions, I
> really think you should just add the locking rule to the 'const struct
> mm_walk_ops' structure.
>
> The locking rule goes along with the rules for what you are actually
> doing, after all. Plus it would actually make it all much more legible
> when it's not just some random "true/false" argument, but a an actual
>
>         .write_lock = 1
>
> in the ops definition.

Yeah, I was thinking about that but thought a flag like this in a pure
"ops" struct would be frowned upon. If this is acceptable then it
makes it much cleaner.

>
> Yes, yes, that might mean that some ops might need duplicating in case
> you really have a walk that sometimes takes the lock, and sometimes
> doesn't, but that is odd to begin with.
>
> The only such case I found from a quick look was the very strange
> queue_pages_range() case. Is it really true that do_mbind() needs the
> write-lock, but do_migrate_pages() does not?
>
> And if they really are that different maybe they should have different walk_ops?

Makes sense to me.

>
> Maybe there were other cases that I didn't notice.
>
> >                 error = walk_page_range(current->mm, start, end,
> > -                               &prot_none_walk_ops, &new_pgprot);
> > +                               &prot_none_walk_ops, true, &new_pgprot);
>
> This looks odd. You're adding vma locking to a place that didn't do it before.
>
> Yes, the mmap semaphore is held for writing, but this particular walk
> doesn't need it as far as I can tell.

Yes you are correct. Locking a vma in this case seems unnecessary.

>
> In fact, this feels like that walker should maybe *verify* that it's
> held for writing, but not try to write it again?

In this particular case, does this walk even require the vma to be
write locked? Looks like it's simply checking the ptes. And if so,
walk_page_range() already has mmap_assert_locked(walk.mm) at the
beginning to ensure the tree is stable. Do we need anything else here?

>
> Maybe the "lock_vma" flag should be a tri-state:
>
>  - lock for reading (no-op per vma), verify that the mmap sem is held
> for reading
>
>  - lock for reading (no-op per vma), but with mmap sem held for
> writing (this kind of "check before doing changes" walker)
>
>  - lock for writing (with mmap sem obviously needs to be held for writing)
>
> >         mmap_assert_locked(walk.mm);
> > +       if (lock_vma)
> > +               vma_start_write(vma);
>
> So I think this should also be tightened up, and something like
>
>         switch (ops->locking) {
>         case WRLOCK:
>                 vma_start_write(vma);
>                 fallthrough;
>         case WRLOCK_VERIFY:
>                 mmap_assert_write_locked(mm);
>                 break;
>         case RDLOCK:
>                 mmap_assert_locked(walk.mm);
>         }

I got the idea but a couple of modifications, if I may.
walk_page_range() already does mmap_assert_locked() at the beginning,
so we can change it to:

if (ops->locking == RDLOCK)
        mmap_assert_locked(walk.mm);
else
        mmap_assert_write_locked(mm);

and during the walk:

        switch (ops->locking) {
        case WRLOCK:
                 vma_start_write(vma);
                 break;
#ifdef CONFIG_PER_VMA_LOCK
        case WRLOCK_VERIFY:
                 vma_assert_write_locked(vma);
                 break;
#endif
         }

The above CONFIG_PER_VMA_LOCK is ugly but with !CONFIG_PER_VMA_LOCK
vma_assert_write_locked() becomes mmap_assert_write_locked() and we
already checked that, so it's unnecessary.

>
> because we shouldn't have a 'vma_start_write()' without holding the
> mmap sem for *writing*, and the above would also allow that
> mprotect_fixup() "walk to see if we can merge, verify that it was
> already locked" thing.
>
> Hmm?
>
> NOTE! The above names are just completely made up. I dcon't think it
> should actually be some "WRLOCK" enum. There are probably much better
> names. Take the above as a "maybe something kind of in this direction"
> rather than "do it exactly like this".

I'm not great with names... Maybe just add a PGWALK_ prefix like this:

PGWALK_RDLOCK
PGWALK_WRLOCK
PGWALK_WRLOCK_VERIFY

?

>
>             Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] mm: enable page walking API to lock vmas during the walk
  2023-07-31 19:30     ` Suren Baghdasaryan
@ 2023-07-31 19:33       ` Linus Torvalds
  2023-07-31 20:24         ` Suren Baghdasaryan
  0 siblings, 1 reply; 18+ messages in thread
From: Linus Torvalds @ 2023-07-31 19:33 UTC (permalink / raw
  To: Suren Baghdasaryan
  Cc: akpm, jannh, willy, liam.howlett, david, peterx, ldufour, vbabka,
	michel, jglisse, mhocko, hannes, dave, hughd, linux-kernel,
	linux-mm, stable

On Mon, 31 Jul 2023 at 12:31, Suren Baghdasaryan <surenb@google.com> wrote:
>
> I got the idea but a couple of modifications, if I may.

Ack, sounds sane to me.

             Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] mm: enable page walking API to lock vmas during the walk
  2023-07-31 19:33       ` Linus Torvalds
@ 2023-07-31 20:24         ` Suren Baghdasaryan
  2023-08-01 20:28           ` Suren Baghdasaryan
  0 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-07-31 20:24 UTC (permalink / raw
  To: Linus Torvalds
  Cc: akpm, jannh, willy, liam.howlett, david, peterx, ldufour, vbabka,
	michel, jglisse, mhocko, hannes, dave, hughd, linux-kernel,
	linux-mm, stable

On Mon, Jul 31, 2023 at 12:33 PM Linus Torvalds
<torvalds@linux-foundation.org> wrote:
>
> On Mon, 31 Jul 2023 at 12:31, Suren Baghdasaryan <surenb@google.com> wrote:
> >
> > I got the idea but a couple of modifications, if I may.
>
> Ack, sounds sane to me.

Ok, I'll wait for more feedback today and will post an update tomorrow. Thanks!

>
>              Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/6] mm: move vma locking out of vma_prepare
  2023-07-31 17:12 ` [PATCH 6/6] mm: move vma locking out of vma_prepare Suren Baghdasaryan
@ 2023-07-31 20:30   ` Liam R. Howlett
  2023-07-31 20:38     ` Suren Baghdasaryan
  0 siblings, 1 reply; 18+ messages in thread
From: Liam R. Howlett @ 2023-07-31 20:30 UTC (permalink / raw
  To: Suren Baghdasaryan
  Cc: akpm, torvalds, jannh, willy, david, peterx, ldufour, vbabka,
	michel, jglisse, mhocko, hannes, dave, hughd, linux-kernel,
	linux-mm, stable, Lorenzo Stoakes, Linus Torvalds

Adding Lorenzo since this also touches vma_merge() again..

* Suren Baghdasaryan <surenb@google.com> [230731 13:12]:
> vma_prepare() is currently the central place where vmas are being locked
> before vma_complete() applies changes to them. While this is convenient,
> it also obscures vma locking and makes it hard to follow the locking rules.
> Move vma locking out of vma_prepare() and take vma locks explicitly at the
> locations where vmas are being modified.

I get the idea of locking closer to the edits, but vma_merge() is very
hard to follow.  It was worse when it was two functions and much larger,
but adding this into vma_merge() is difficult to validate.

We still set vma = <another vma> in places, so that adds to the
difficulty of ensuring the end result is all VMAs that will be
modified/removed have been locked...and the 'locking rules' are also
obscured.

It's also annoying that this doesn't fully allow you to follow the
locking anyways.  dup_anon_vma() still locks internally, with good
reason, but it's still not clear that the VMA is locked when looking at
this.

That being said, I did go through each case and it looks like it locks
the correct VMA(s) to me.

Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>

> 
> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
> Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> ---
>  mm/mmap.c | 26 ++++++++++++++++----------
>  1 file changed, 16 insertions(+), 10 deletions(-)
> 
> diff --git a/mm/mmap.c b/mm/mmap.c
> index 850a39dee075..e59d83cb1d7a 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -476,16 +476,6 @@ static inline void init_vma_prep(struct vma_prepare *vp,
>   */
>  static inline void vma_prepare(struct vma_prepare *vp)
>  {
> -	vma_start_write(vp->vma);
> -	if (vp->adj_next)
> -		vma_start_write(vp->adj_next);
> -	if (vp->insert)
> -		vma_start_write(vp->insert);
> -	if (vp->remove)
> -		vma_start_write(vp->remove);
> -	if (vp->remove2)
> -		vma_start_write(vp->remove2);
> -
>  	if (vp->file) {
>  		uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end);
>  
> @@ -650,6 +640,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	bool remove_next = false;
>  	struct vma_prepare vp;
>  
> +	vma_start_write(vma);
>  	if (next && (vma != next) && (end == next->vm_end)) {
>  		int ret;
>  
> @@ -657,6 +648,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		ret = dup_anon_vma(vma, next);
>  		if (ret)
>  			return ret;
> +		vma_start_write(next);
>  	}
>  
>  	init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> @@ -708,6 +700,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (vma_iter_prealloc(vmi))
>  		return -ENOMEM;
>  
> +	vma_start_write(vma);
> +
>  	init_vma_prep(&vp, vma);
>  	vma_prepare(&vp);
>  	vma_adjust_trans_huge(vma, start, end, 0);
> @@ -946,10 +940,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
>  	/* Can we merge both the predecessor and the successor? */
>  	if (merge_prev && merge_next &&
>  	    is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) {
> +		vma_start_write(next);
>  		remove = next;				/* case 1 */
>  		vma_end = next->vm_end;
>  		err = dup_anon_vma(prev, next);
>  		if (curr) {				/* case 6 */
> +			vma_start_write(curr);
>  			remove = curr;
>  			remove2 = next;
>  			if (!next->anon_vma)
> @@ -958,6 +954,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
>  	} else if (merge_prev) {			/* case 2 */
>  		if (curr) {
>  			err = dup_anon_vma(prev, curr);
> +			vma_start_write(curr);
>  			if (end == curr->vm_end) {	/* case 7 */
>  				remove = curr;
>  			} else {			/* case 5 */
> @@ -969,6 +966,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
>  		res = next;
>  		if (prev && addr < prev->vm_end) {	/* case 4 */
>  			vma_end = addr;
> +			vma_start_write(next);
>  			adjust = next;
>  			adj_start = -(prev->vm_end - addr);
>  			err = dup_anon_vma(next, prev);
> @@ -983,6 +981,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
>  			vma_pgoff = next->vm_pgoff - pglen;
>  			if (curr) {			/* case 8 */
>  				vma_pgoff = curr->vm_pgoff;
> +				vma_start_write(curr);
>  				remove = curr;
>  				err = dup_anon_vma(next, curr);
>  			}
> @@ -996,6 +995,8 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
>  	if (vma_iter_prealloc(vmi))
>  		return NULL;
>  
> +	vma_start_write(vma);
> +
>  	init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
>  	VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma &&
>  		   vp.anon_vma != adjust->anon_vma);
> @@ -2373,6 +2374,9 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  	if (new->vm_ops && new->vm_ops->open)
>  		new->vm_ops->open(new);
>  
> +	vma_start_write(vma);
> +	vma_start_write(new);
> +
>  	init_vma_prep(&vp, vma);
>  	vp.insert = new;
>  	vma_prepare(&vp);
> @@ -3078,6 +3082,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
>  		if (vma_iter_prealloc(vmi))
>  			goto unacct_fail;
>  
> +		vma_start_write(vma);
> +
>  		init_vma_prep(&vp, vma);
>  		vma_prepare(&vp);
>  		vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
> -- 
> 2.41.0.487.g6d72f3e995-goog
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 6/6] mm: move vma locking out of vma_prepare
  2023-07-31 20:30   ` Liam R. Howlett
@ 2023-07-31 20:38     ` Suren Baghdasaryan
  0 siblings, 0 replies; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-07-31 20:38 UTC (permalink / raw
  To: Liam R. Howlett, Suren Baghdasaryan, akpm, torvalds, jannh, willy,
	david, peterx, ldufour, vbabka, michel, jglisse, mhocko, hannes,
	dave, hughd, linux-kernel, linux-mm, stable, Lorenzo Stoakes,
	Linus Torvalds

On Mon, Jul 31, 2023 at 1:30 PM Liam R. Howlett <Liam.Howlett@oracle.com> wrote:
>
> Adding Lorenzo since this also touches vma_merge() again..
>
> * Suren Baghdasaryan <surenb@google.com> [230731 13:12]:
> > vma_prepare() is currently the central place where vmas are being locked
> > before vma_complete() applies changes to them. While this is convenient,
> > it also obscures vma locking and makes it hard to follow the locking rules.
> > Move vma locking out of vma_prepare() and take vma locks explicitly at the
> > locations where vmas are being modified.
>
> I get the idea of locking closer to the edits, but vma_merge() is very
> hard to follow.  It was worse when it was two functions and much larger,
> but adding this into vma_merge() is difficult to validate.
>
> We still set vma = <another vma> in places, so that adds to the
> difficulty of ensuring the end result is all VMAs that will be
> modified/removed have been locked...and the 'locking rules' are also
> obscured.
>
> It's also annoying that this doesn't fully allow you to follow the
> locking anyways.  dup_anon_vma() still locks internally, with good
> reason, but it's still not clear that the VMA is locked when looking at
> this.
>
> That being said, I did go through each case and it looks like it locks
> the correct VMA(s) to me.

Thanks!
Yeah, it took me some time to ensure the locking there is correct. If
you see a better alternative to make the locking more obvious I'm open
to suggestions. I accept that locking in vma_merge() is not easy to
follow.

>
> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
>
> >
> > Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
> > Signed-off-by: Suren Baghdasaryan <surenb@google.com>
> > ---
> >  mm/mmap.c | 26 ++++++++++++++++----------
> >  1 file changed, 16 insertions(+), 10 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index 850a39dee075..e59d83cb1d7a 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -476,16 +476,6 @@ static inline void init_vma_prep(struct vma_prepare *vp,
> >   */
> >  static inline void vma_prepare(struct vma_prepare *vp)
> >  {
> > -     vma_start_write(vp->vma);
> > -     if (vp->adj_next)
> > -             vma_start_write(vp->adj_next);
> > -     if (vp->insert)
> > -             vma_start_write(vp->insert);
> > -     if (vp->remove)
> > -             vma_start_write(vp->remove);
> > -     if (vp->remove2)
> > -             vma_start_write(vp->remove2);
> > -
> >       if (vp->file) {
> >               uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end);
> >
> > @@ -650,6 +640,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       bool remove_next = false;
> >       struct vma_prepare vp;
> >
> > +     vma_start_write(vma);
> >       if (next && (vma != next) && (end == next->vm_end)) {
> >               int ret;
> >
> > @@ -657,6 +648,7 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >               ret = dup_anon_vma(vma, next);
> >               if (ret)
> >                       return ret;
> > +             vma_start_write(next);
> >       }
> >
> >       init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL);
> > @@ -708,6 +700,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       if (vma_iter_prealloc(vmi))
> >               return -ENOMEM;
> >
> > +     vma_start_write(vma);
> > +
> >       init_vma_prep(&vp, vma);
> >       vma_prepare(&vp);
> >       vma_adjust_trans_huge(vma, start, end, 0);
> > @@ -946,10 +940,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> >       /* Can we merge both the predecessor and the successor? */
> >       if (merge_prev && merge_next &&
> >           is_mergeable_anon_vma(prev->anon_vma, next->anon_vma, NULL)) {
> > +             vma_start_write(next);
> >               remove = next;                          /* case 1 */
> >               vma_end = next->vm_end;
> >               err = dup_anon_vma(prev, next);
> >               if (curr) {                             /* case 6 */
> > +                     vma_start_write(curr);
> >                       remove = curr;
> >                       remove2 = next;
> >                       if (!next->anon_vma)
> > @@ -958,6 +954,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> >       } else if (merge_prev) {                        /* case 2 */
> >               if (curr) {
> >                       err = dup_anon_vma(prev, curr);
> > +                     vma_start_write(curr);
> >                       if (end == curr->vm_end) {      /* case 7 */
> >                               remove = curr;
> >                       } else {                        /* case 5 */
> > @@ -969,6 +966,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> >               res = next;
> >               if (prev && addr < prev->vm_end) {      /* case 4 */
> >                       vma_end = addr;
> > +                     vma_start_write(next);
> >                       adjust = next;
> >                       adj_start = -(prev->vm_end - addr);
> >                       err = dup_anon_vma(next, prev);
> > @@ -983,6 +981,7 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> >                       vma_pgoff = next->vm_pgoff - pglen;
> >                       if (curr) {                     /* case 8 */
> >                               vma_pgoff = curr->vm_pgoff;
> > +                             vma_start_write(curr);
> >                               remove = curr;
> >                               err = dup_anon_vma(next, curr);
> >                       }
> > @@ -996,6 +995,8 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm,
> >       if (vma_iter_prealloc(vmi))
> >               return NULL;
> >
> > +     vma_start_write(vma);
> > +
> >       init_multi_vma_prep(&vp, vma, adjust, remove, remove2);
> >       VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma &&
> >                  vp.anon_vma != adjust->anon_vma);
> > @@ -2373,6 +2374,9 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >       if (new->vm_ops && new->vm_ops->open)
> >               new->vm_ops->open(new);
> >
> > +     vma_start_write(vma);
> > +     vma_start_write(new);
> > +
> >       init_vma_prep(&vp, vma);
> >       vp.insert = new;
> >       vma_prepare(&vp);
> > @@ -3078,6 +3082,8 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma,
> >               if (vma_iter_prealloc(vmi))
> >                       goto unacct_fail;
> >
> > +             vma_start_write(vma);
> > +
> >               init_vma_prep(&vp, vma);
> >               vma_prepare(&vp);
> >               vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0);
> > --
> > 2.41.0.487.g6d72f3e995-goog
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] mm: enable page walking API to lock vmas during the walk
  2023-07-31 20:24         ` Suren Baghdasaryan
@ 2023-08-01 20:28           ` Suren Baghdasaryan
  2023-08-01 21:34             ` Peter Xu
  0 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-08-01 20:28 UTC (permalink / raw
  To: Linus Torvalds
  Cc: akpm, jannh, willy, liam.howlett, david, peterx, ldufour, vbabka,
	michel, jglisse, mhocko, hannes, dave, hughd, linux-kernel,
	linux-mm, stable

On Mon, Jul 31, 2023 at 1:24 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Mon, Jul 31, 2023 at 12:33 PM Linus Torvalds
> <torvalds@linux-foundation.org> wrote:
> >
> > On Mon, 31 Jul 2023 at 12:31, Suren Baghdasaryan <surenb@google.com> wrote:
> > >
> > > I got the idea but a couple of modifications, if I may.
> >
> > Ack, sounds sane to me.
>
> Ok, I'll wait for more feedback today and will post an update tomorrow. Thanks!

I have the new patchset ready but I see 3 places where we walk the
pages after mmap_write_lock() while *I think* we can tolerate
concurrent page faults (don't need to lock the vmas):

s390_enable_sie()
break_ksm()
clear_refs_write()

In all these walks we lock PTL when modifying the page table entries,
that's why I think we can skip locking the vma but maybe I'm missing
something? Could someone please check these 3 cases and confirm or
deny my claim?
Thanks,
Suren.

>
> >
> >              Linus

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] mm: enable page walking API to lock vmas during the walk
  2023-08-01 20:28           ` Suren Baghdasaryan
@ 2023-08-01 21:34             ` Peter Xu
  2023-08-01 21:46               ` Suren Baghdasaryan
  0 siblings, 1 reply; 18+ messages in thread
From: Peter Xu @ 2023-08-01 21:34 UTC (permalink / raw
  To: Suren Baghdasaryan
  Cc: Linus Torvalds, akpm, jannh, willy, liam.howlett, david, ldufour,
	vbabka, michel, jglisse, mhocko, hannes, dave, hughd,
	linux-kernel, linux-mm, stable

On Tue, Aug 01, 2023 at 01:28:56PM -0700, Suren Baghdasaryan wrote:
> I have the new patchset ready but I see 3 places where we walk the
> pages after mmap_write_lock() while *I think* we can tolerate
> concurrent page faults (don't need to lock the vmas):
> 
> s390_enable_sie()
> break_ksm()
> clear_refs_write()

This one doesn't look right to be listed - tlb flushing is postponed after
pgtable lock released, so I assume the same issue can happen like fork():
where we can have race coditions to corrupt data if, e.g., thread A
writting with a writable (unflushed) tlb, alongside with thread B CoWing.

It'll indeed be nice to know whether break_ksm() can avoid that lock_vma
parameter across quite a few function jumps. I don't yet see an immediate
issue with this one..  No idea on s390_enable_sie(), but to make it simple
and safe I'd simply leave it with the write vma lock to match the mmap
write lock.

Thanks,

-- 
Peter Xu

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] mm: enable page walking API to lock vmas during the walk
  2023-08-01 21:34             ` Peter Xu
@ 2023-08-01 21:46               ` Suren Baghdasaryan
  2023-08-01 22:13                 ` Suren Baghdasaryan
  0 siblings, 1 reply; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-08-01 21:46 UTC (permalink / raw
  To: Peter Xu
  Cc: Linus Torvalds, akpm, jannh, willy, liam.howlett, david, ldufour,
	vbabka, michel, jglisse, mhocko, hannes, dave, hughd,
	linux-kernel, linux-mm, stable

On Tue, Aug 1, 2023 at 2:34 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Tue, Aug 01, 2023 at 01:28:56PM -0700, Suren Baghdasaryan wrote:
> > I have the new patchset ready but I see 3 places where we walk the
> > pages after mmap_write_lock() while *I think* we can tolerate
> > concurrent page faults (don't need to lock the vmas):
> >
> > s390_enable_sie()
> > break_ksm()
> > clear_refs_write()
>
> This one doesn't look right to be listed - tlb flushing is postponed after
> pgtable lock released, so I assume the same issue can happen like fork():
> where we can have race coditions to corrupt data if, e.g., thread A
> writting with a writable (unflushed) tlb, alongside with thread B CoWing.

Ah, good point.

>
> It'll indeed be nice to know whether break_ksm() can avoid that lock_vma
> parameter across quite a few function jumps. I don't yet see an immediate
> issue with this one..  No idea on s390_enable_sie(), but to make it simple
> and safe I'd simply leave it with the write vma lock to match the mmap
> write lock.

Thanks for taking a look, Peter!

Ok, let me keep all three of them with vma locking in place to be safe
and will post v2 for further discussion.
Thanks,
Suren.

>
> Thanks,
>
> --
> Peter Xu
>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH 1/6] mm: enable page walking API to lock vmas during the walk
  2023-08-01 21:46               ` Suren Baghdasaryan
@ 2023-08-01 22:13                 ` Suren Baghdasaryan
  0 siblings, 0 replies; 18+ messages in thread
From: Suren Baghdasaryan @ 2023-08-01 22:13 UTC (permalink / raw
  To: Peter Xu
  Cc: Linus Torvalds, akpm, jannh, willy, liam.howlett, david, ldufour,
	vbabka, michel, jglisse, mhocko, hannes, dave, hughd,
	linux-kernel, linux-mm, stable

On Tue, Aug 1, 2023 at 2:46 PM Suren Baghdasaryan <surenb@google.com> wrote:
>
> On Tue, Aug 1, 2023 at 2:34 PM Peter Xu <peterx@redhat.com> wrote:
> >
> > On Tue, Aug 01, 2023 at 01:28:56PM -0700, Suren Baghdasaryan wrote:
> > > I have the new patchset ready but I see 3 places where we walk the
> > > pages after mmap_write_lock() while *I think* we can tolerate
> > > concurrent page faults (don't need to lock the vmas):
> > >
> > > s390_enable_sie()
> > > break_ksm()
> > > clear_refs_write()
> >
> > This one doesn't look right to be listed - tlb flushing is postponed after
> > pgtable lock released, so I assume the same issue can happen like fork():
> > where we can have race coditions to corrupt data if, e.g., thread A
> > writting with a writable (unflushed) tlb, alongside with thread B CoWing.
>
> Ah, good point.
>
> >
> > It'll indeed be nice to know whether break_ksm() can avoid that lock_vma
> > parameter across quite a few function jumps. I don't yet see an immediate
> > issue with this one..  No idea on s390_enable_sie(), but to make it simple
> > and safe I'd simply leave it with the write vma lock to match the mmap
> > write lock.
>
> Thanks for taking a look, Peter!
>
> Ok, let me keep all three of them with vma locking in place to be safe
> and will post v2 for further discussion.

v2 posted at https://lore.kernel.org/all/20230801220733.1987762-1-surenb@google.com/

> Thanks,
> Suren.
>
> >
> > Thanks,
> >
> > --
> > Peter Xu
> >

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2023-08-01 22:14 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-07-31 17:12 [PATCH 0/6] make vma locking more obvious Suren Baghdasaryan
2023-07-31 17:12 ` [PATCH 1/6] mm: enable page walking API to lock vmas during the walk Suren Baghdasaryan
2023-07-31 18:02   ` Linus Torvalds
2023-07-31 19:30     ` Suren Baghdasaryan
2023-07-31 19:33       ` Linus Torvalds
2023-07-31 20:24         ` Suren Baghdasaryan
2023-08-01 20:28           ` Suren Baghdasaryan
2023-08-01 21:34             ` Peter Xu
2023-08-01 21:46               ` Suren Baghdasaryan
2023-08-01 22:13                 ` Suren Baghdasaryan
2023-07-31 17:12 ` [PATCH 2/6] mm: for !CONFIG_PER_VMA_LOCK equate write lock assertion for vma and mmap Suren Baghdasaryan
2023-07-31 17:14   ` kernel test robot
2023-07-31 17:12 ` [PATCH 3/6] mm: replace mmap with vma write lock assertions when operating on a vma Suren Baghdasaryan
2023-07-31 17:12 ` [PATCH 4/6] mm: lock vma explicitly before doing vm_flags_reset and vm_flags_reset_once Suren Baghdasaryan
2023-07-31 17:12 ` [PATCH 5/6] mm: always lock new vma before inserting into vma tree Suren Baghdasaryan
2023-07-31 17:12 ` [PATCH 6/6] mm: move vma locking out of vma_prepare Suren Baghdasaryan
2023-07-31 20:30   ` Liam R. Howlett
2023-07-31 20:38     ` Suren Baghdasaryan

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).