[PATCH v3] KVM: x86/mmu: Always drop mmu_lock to allocate TDP MMU SPs for eager splitting

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v3] KVM: x86/mmu: Always drop mmu_lock to allocate TDP MMU SPs for eager splitting
@ 2024-05-09 18:11 David Matlack
  2024-06-03 18:13 ` Sean Christopherson
  0 siblings, 1 reply; 4+ messages in thread
From: David Matlack @ 2024-05-09 18:11 UTC (permalink / raw
  To: Paolo Bonzini, Sean Christopherson; +Cc: kvm, David Matlack, Bibo Mao

Always drop mmu_lock to allocate shadow pages in the TDP MMU when doing
eager page splitting. Dropping mmu_lock during eager page splitting is
cheap since KVM does not have to flush remote TLBs, and avoids stalling
vCPU threads that are taking page faults in parallel.

This change reduces 20%+ dips in MySQL throughput during live migration
in a 160 vCPU VM while userspace is issuing CLEAR_DIRTY_LOG ioctls
(tested with 1GiB and 8GiB CLEARs). Userspace could issue finer-grained
CLEARs, which would also reduce contention on mmu_lock, but doing so
will increase the rate of remote TLB flushing (KVM must flush TLBs
before returning from CLEAR_DITY_LOG).

When there isn't contention on mmu_lock[1], this change does not regress
the time it takes to perform eager page splitting.

[1] Tested with dirty_log_perf_test, which does not run vCPUs during
eager page splitting, and with a 16 vCPU VM Live Migration with
manual-protect disabled (where mmu_lock is held in read-mode).

Cc: Bibo Mao <maobibo@loongson.cn>
Cc: Sean Christopherson <seanjc@google.com>
Signed-off-by: David Matlack <dmatlack@google.com>
---
v3:
 - Only drop mmu_lock during TDP MMU eager page splitting. This fixes
   the perfomance regression without regressing CLEAR_DIRTY_LOG
   performance on other architectures from frequent lock dropping.

v2: https://lore.kernel.org/kvm/20240402213656.3068504-1-dmatlack@google.com/
 - Rebase onto kvm/queue [Marc]

v1: https://lore.kernel.org/kvm/20231205181645.482037-1-dmatlack@google.com/

 arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++------------------
 1 file changed, 4 insertions(+), 18 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index aaa2369a9479..2089d696e3c6 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -1385,11 +1385,11 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
 	return spte_set;
 }
 
-static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
+static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(void)
 {
+	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
 	struct kvm_mmu_page *sp;
 
-	gfp |= __GFP_ZERO;
 
 	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
 	if (!sp)
@@ -1412,19 +1412,6 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 
 	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
 
-	/*
-	 * Since we are allocating while under the MMU lock we have to be
-	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
-	 * reclaim and to avoid making any filesystem callbacks (which can end
-	 * up invoking KVM MMU notifiers, resulting in a deadlock).
-	 *
-	 * If this allocation fails we drop the lock and retry with reclaim
-	 * allowed.
-	 */
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
-	if (sp)
-		return sp;
-
 	rcu_read_unlock();
 
 	if (shared)
@@ -1433,7 +1420,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
 		write_unlock(&kvm->mmu_lock);
 
 	iter->yielded = true;
-	sp = __tdp_mmu_alloc_sp_for_split(GFP_KERNEL_ACCOUNT);
+	sp = __tdp_mmu_alloc_sp_for_split();
 
 	if (shared)
 		read_lock(&kvm->mmu_lock);
@@ -1524,8 +1511,7 @@ static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
 				break;
 			}
 
-			if (iter.yielded)
-				continue;
+			continue;
 		}
 
 		tdp_mmu_init_child_sp(sp, &iter);

base-commit: 2d181d84af38146748042a6974c577fc46c3f1c3
-- 
2.45.0.118.g7fe29c98d7-goog


^ permalink raw reply related	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] KVM: x86/mmu: Always drop mmu_lock to allocate TDP MMU SPs for eager splitting
  2024-05-09 18:11 [PATCH v3] KVM: x86/mmu: Always drop mmu_lock to allocate TDP MMU SPs for eager splitting David Matlack
@ 2024-06-03 18:13 ` Sean Christopherson
  2024-06-11 15:29   ` David Matlack
  0 siblings, 1 reply; 4+ messages in thread
From: Sean Christopherson @ 2024-06-03 18:13 UTC (permalink / raw
  To: David Matlack; +Cc: Paolo Bonzini, kvm, Bibo Mao

On Thu, May 09, 2024, David Matlack wrote:
> Always drop mmu_lock to allocate shadow pages in the TDP MMU when doing
> eager page splitting. Dropping mmu_lock during eager page splitting is
> cheap since KVM does not have to flush remote TLBs, and avoids stalling
> vCPU threads that are taking page faults in parallel.
> 
> This change reduces 20%+ dips in MySQL throughput during live migration
> in a 160 vCPU VM while userspace is issuing CLEAR_DIRTY_LOG ioctls
> (tested with 1GiB and 8GiB CLEARs). Userspace could issue finer-grained
> CLEARs, which would also reduce contention on mmu_lock, but doing so
> will increase the rate of remote TLB flushing (KVM must flush TLBs
> before returning from CLEAR_DITY_LOG).
> 
> When there isn't contention on mmu_lock[1], this change does not regress
> the time it takes to perform eager page splitting.
> 
> [1] Tested with dirty_log_perf_test, which does not run vCPUs during
> eager page splitting, and with a 16 vCPU VM Live Migration with
> manual-protect disabled (where mmu_lock is held in read-mode).
> 
> Cc: Bibo Mao <maobibo@loongson.cn>
> Cc: Sean Christopherson <seanjc@google.com>
> Signed-off-by: David Matlack <dmatlack@google.com>
> ---
> v3:
>  - Only drop mmu_lock during TDP MMU eager page splitting. This fixes
>    the perfomance regression without regressing CLEAR_DIRTY_LOG
>    performance on other architectures from frequent lock dropping.
> 
> v2: https://lore.kernel.org/kvm/20240402213656.3068504-1-dmatlack@google.com/
>  - Rebase onto kvm/queue [Marc]
> 
> v1: https://lore.kernel.org/kvm/20231205181645.482037-1-dmatlack@google.com/
> 
>  arch/x86/kvm/mmu/tdp_mmu.c | 22 ++++------------------
>  1 file changed, 4 insertions(+), 18 deletions(-)
> 
> diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> index aaa2369a9479..2089d696e3c6 100644
> --- a/arch/x86/kvm/mmu/tdp_mmu.c
> +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> @@ -1385,11 +1385,11 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
>  	return spte_set;
>  }
>  
> -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(void)
>  {
> +	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
>  	struct kvm_mmu_page *sp;
>  
> -	gfp |= __GFP_ZERO;
>  
>  	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);

This can more simply and cleary be:

	sp = kmem_cache_zalloc(mmu_page_header_cache, GFP_KERNEL_ACCOUNT);

>  	if (!sp)
> @@ -1412,19 +1412,6 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>  
>  	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
>  
> -	/*
> -	 * Since we are allocating while under the MMU lock we have to be
> -	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
> -	 * reclaim and to avoid making any filesystem callbacks (which can end
> -	 * up invoking KVM MMU notifiers, resulting in a deadlock).
> -	 *
> -	 * If this allocation fails we drop the lock and retry with reclaim
> -	 * allowed.
> -	 */
> -	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
> -	if (sp)
> -		return sp;
> -
>  	rcu_read_unlock();
>  
>  	if (shared)
> @@ -1433,7 +1420,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
>  		write_unlock(&kvm->mmu_lock);
>  
>  	iter->yielded = true;

Now that yielding is unconditional, this really should be in the loop itself.
The bare continue looks weird, and it's unnecessarily hard to see that "yielded"
is being set.

And there's definitely no reason to have two helpers.

Not sure how many patches it'll take, but I think we should end up with:

static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
{
	struct kvm_mmu_page *sp;

	sp = kmem_cache_zalloc(mmu_page_header_cache, GFP_KERNEL_ACCOUNT);
	if (!sp)
		return NULL;

	sp->spt = (void *)__get_free_page(gfp);
	if (!sp->spt) {
		kmem_cache_free(mmu_page_header_cache, sp);
		return NULL;
	}

	return sp;
}

static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
					 struct kvm_mmu_page *root,
					 gfn_t start, gfn_t end,
					 int target_level, bool shared)
{
	struct kvm_mmu_page *sp = NULL;
	struct tdp_iter iter;

	rcu_read_lock();

	/*
	 * Traverse the page table splitting all huge pages above the target
	 * level into one lower level. For example, if we encounter a 1GB page
	 * we split it into 512 2MB pages.
	 *
	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
	 * to visit an SPTE before ever visiting its children, which means we
	 * will correctly recursively split huge pages that are more than one
	 * level above the target level (e.g. splitting a 1GB to 512 2MB pages,
	 * and then splitting each of those to 512 4KB pages).
	 */
	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
retry:
		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
			continue;

		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
			continue;

		if (!sp) {
			rcu_read_unlock();

			if (shared)
				read_unlock(&kvm->mmu_lock);
			else
				write_unlock(&kvm->mmu_lock);

			sp = tdp_mmu_alloc_sp_for_split(kvm, &iter, shared);

			if (shared)
				read_lock(&kvm->mmu_lock);
			else
				write_lock(&kvm->mmu_lock);

			if (!sp) {
				trace_kvm_mmu_split_huge_page(iter.gfn,
							      iter.old_spte,
							      iter.level, -ENOMEM);
				return -ENOMEM;
			}

			rcu_read_lock();

			iter->yielded = true;
			continue;
		}

		tdp_mmu_init_child_sp(sp, &iter);

		if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
			goto retry;

		sp = NULL;
	}

	rcu_read_unlock();

	/*
	 * It's possible to exit the loop having never used the last sp if, for
	 * example, a vCPU doing HugePage NX splitting wins the race and
	 * installs its own sp in place of the last sp we tried to split.
	 */
	if (sp)
		tdp_mmu_free_sp(sp);

	return 0;
}

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] KVM: x86/mmu: Always drop mmu_lock to allocate TDP MMU SPs for eager splitting
  2024-06-03 18:13 ` Sean Christopherson
@ 2024-06-11 15:29   ` David Matlack
  2024-06-11 16:35     ` Sean Christopherson
  0 siblings, 1 reply; 4+ messages in thread
From: David Matlack @ 2024-06-11 15:29 UTC (permalink / raw
  To: Sean Christopherson; +Cc: Paolo Bonzini, kvm, Bibo Mao

On 2024-06-03 11:13 AM, Sean Christopherson wrote:
> On Thu, May 09, 2024, David Matlack wrote:
> > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > index aaa2369a9479..2089d696e3c6 100644
> > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > @@ -1385,11 +1385,11 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
> >  	return spte_set;
> >  }
> >  
> > -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> > +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(void)
> >  {
> > +	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
> >  	struct kvm_mmu_page *sp;
> >  
> > -	gfp |= __GFP_ZERO;
> >  
> >  	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
> 
> This can more simply and cleary be:
> 
> 	sp = kmem_cache_zalloc(mmu_page_header_cache, GFP_KERNEL_ACCOUNT);

Will do. And I assume you'd prefer get_zeroed_page(GFP_KERNEL_ACCOUNT)
as well below?

> 
> >  	if (!sp)
> > @@ -1412,19 +1412,6 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >  
> >  	kvm_lockdep_assert_mmu_lock_held(kvm, shared);
> >  
> > -	/*
> > -	 * Since we are allocating while under the MMU lock we have to be
> > -	 * careful about GFP flags. Use GFP_NOWAIT to avoid blocking on direct
> > -	 * reclaim and to avoid making any filesystem callbacks (which can end
> > -	 * up invoking KVM MMU notifiers, resulting in a deadlock).
> > -	 *
> > -	 * If this allocation fails we drop the lock and retry with reclaim
> > -	 * allowed.
> > -	 */
> > -	sp = __tdp_mmu_alloc_sp_for_split(GFP_NOWAIT | __GFP_ACCOUNT);
> > -	if (sp)
> > -		return sp;
> > -
> >  	rcu_read_unlock();
> >  
> >  	if (shared)
> > @@ -1433,7 +1420,7 @@ static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(struct kvm *kvm,
> >  		write_unlock(&kvm->mmu_lock);
> >  
> >  	iter->yielded = true;
> 
> Now that yielding is unconditional, this really should be in the loop itself.
> The bare continue looks weird, and it's unnecessarily hard to see that "yielded"
> is being set.
> 
> And there's definitely no reason to have two helpers.
> 
> Not sure how many patches it'll take, but I think we should end up with:
> 
> static struct kvm_mmu_page *tdp_mmu_alloc_sp_for_split(void)
> {
> 	struct kvm_mmu_page *sp;
> 
> 	sp = kmem_cache_zalloc(mmu_page_header_cache, GFP_KERNEL_ACCOUNT);
> 	if (!sp)
> 		return NULL;
> 
> 	sp->spt = (void *)__get_free_page(gfp);
> 	if (!sp->spt) {
> 		kmem_cache_free(mmu_page_header_cache, sp);
> 		return NULL;
> 	}
> 
> 	return sp;
> }
> 
> static int tdp_mmu_split_huge_pages_root(struct kvm *kvm,
> 					 struct kvm_mmu_page *root,
> 					 gfn_t start, gfn_t end,
> 					 int target_level, bool shared)
> {
> 	struct kvm_mmu_page *sp = NULL;
> 	struct tdp_iter iter;
> 
> 	rcu_read_lock();
> 
> 	/*
> 	 * Traverse the page table splitting all huge pages above the target
> 	 * level into one lower level. For example, if we encounter a 1GB page
> 	 * we split it into 512 2MB pages.
> 	 *
> 	 * Since the TDP iterator uses a pre-order traversal, we are guaranteed
> 	 * to visit an SPTE before ever visiting its children, which means we
> 	 * will correctly recursively split huge pages that are more than one
> 	 * level above the target level (e.g. splitting a 1GB to 512 2MB pages,
> 	 * and then splitting each of those to 512 4KB pages).
> 	 */
> 	for_each_tdp_pte_min_level(iter, root, target_level + 1, start, end) {
> retry:
> 		if (tdp_mmu_iter_cond_resched(kvm, &iter, false, shared))
> 			continue;
> 
> 		if (!is_shadow_present_pte(iter.old_spte) || !is_large_pte(iter.old_spte))
> 			continue;
> 
> 		if (!sp) {
> 			rcu_read_unlock();
> 
> 			if (shared)
> 				read_unlock(&kvm->mmu_lock);
> 			else
> 				write_unlock(&kvm->mmu_lock);
> 
> 			sp = tdp_mmu_alloc_sp_for_split(kvm, &iter, shared);
> 
> 			if (shared)
> 				read_lock(&kvm->mmu_lock);
> 			else
> 				write_lock(&kvm->mmu_lock);
> 
> 			if (!sp) {
> 				trace_kvm_mmu_split_huge_page(iter.gfn,
> 							      iter.old_spte,
> 							      iter.level, -ENOMEM);
> 				return -ENOMEM;
> 			}
> 
> 			rcu_read_lock();
> 
> 			iter->yielded = true;
> 			continue;
> 		}
> 
> 		tdp_mmu_init_child_sp(sp, &iter);
> 
> 		if (tdp_mmu_split_huge_page(kvm, &iter, sp, shared))
> 			goto retry;
> 
> 		sp = NULL;
> 	}
> 
> 	rcu_read_unlock();
> 
> 	/*
> 	 * It's possible to exit the loop having never used the last sp if, for
> 	 * example, a vCPU doing HugePage NX splitting wins the race and
> 	 * installs its own sp in place of the last sp we tried to split.
> 	 */
> 	if (sp)
> 		tdp_mmu_free_sp(sp);
> 
> 	return 0;
> }

Ack, will do.

^ permalink raw reply	[flat|nested] 4+ messages in thread

* Re: [PATCH v3] KVM: x86/mmu: Always drop mmu_lock to allocate TDP MMU SPs for eager splitting
  2024-06-11 15:29   ` David Matlack
@ 2024-06-11 16:35     ` Sean Christopherson
  0 siblings, 0 replies; 4+ messages in thread
From: Sean Christopherson @ 2024-06-11 16:35 UTC (permalink / raw
  To: David Matlack; +Cc: Paolo Bonzini, kvm, Bibo Mao

On Tue, Jun 11, 2024, David Matlack wrote:
> On 2024-06-03 11:13 AM, Sean Christopherson wrote:
> > On Thu, May 09, 2024, David Matlack wrote:
> > > diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
> > > index aaa2369a9479..2089d696e3c6 100644
> > > --- a/arch/x86/kvm/mmu/tdp_mmu.c
> > > +++ b/arch/x86/kvm/mmu/tdp_mmu.c
> > > @@ -1385,11 +1385,11 @@ bool kvm_tdp_mmu_wrprot_slot(struct kvm *kvm,
> > >  	return spte_set;
> > >  }
> > >  
> > > -static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(gfp_t gfp)
> > > +static struct kvm_mmu_page *__tdp_mmu_alloc_sp_for_split(void)
> > >  {
> > > +	gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
> > >  	struct kvm_mmu_page *sp;
> > >  
> > > -	gfp |= __GFP_ZERO;
> > >  
> > >  	sp = kmem_cache_alloc(mmu_page_header_cache, gfp);
> > 
> > This can more simply and cleary be:
> > 
> > 	sp = kmem_cache_zalloc(mmu_page_header_cache, GFP_KERNEL_ACCOUNT);
> 
> Will do. And I assume you'd prefer get_zeroed_page(GFP_KERNEL_ACCOUNT)
> as well below?

Ah, yeah, good catch!

^ permalink raw reply	[flat|nested] 4+ messages in thread

end of thread, other threads:[~2024-06-11 16:35 UTC | newest]

Thread overview: 4+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-09 18:11 [PATCH v3] KVM: x86/mmu: Always drop mmu_lock to allocate TDP MMU SPs for eager splitting David Matlack
2024-06-03 18:13 ` Sean Christopherson
2024-06-11 15:29   ` David Matlack
2024-06-11 16:35     ` Sean Christopherson

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.