[PATCH v2 0/8] KVM: x86/xen updates

KVM Archive mirror
 help / color / mirror / Atom feed

* [PATCH v2 0/8] KVM: x86/xen updates
@ 2024-02-27 11:49 David Woodhouse
  2024-02-27 11:49 ` [PATCH v2 1/8] KVM: x86/xen: improve accuracy of Xen timers David Woodhouse
                   ` (9 more replies)
  0 siblings, 10 replies; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 11:49 UTC (permalink / raw
  To: kvm; +Cc: Sean Christopherson, Paul Durrant, Paolo Bonzini, Michal Luczaj

These apply to the kvm-x86/xen tree.

First, deal with the awful brokenness of the KVM clock, and its systemic 
drift especially when TSC scaling is used. This is a bit of a workaround 
for Xen timers where it hurts *most*, but it's actually easier in this 
case because there is a vCPU (and associated PV clock information) to 
use for the scaling. A better fix for __get_kvmclock() is in the works, 
but there's an enormous yak to shave there because there are so many 
interrelated bugs in the TSC and timekeeping code.

Ensure that the guest doesn't miss Xen event channel wakeups which are
already pending when the local APIC is enabled. Userspace doesn't get
to interpose here, so KVM needs to do the same as Xen and explicitly
check for the pending event. While looking at that, Michal spotted a
potential false positive from the WARN_ON_ONCE() when delivering the
vector, so fix that too.

The remainder of the series is about cleaning up locking, simplifying
the pfncache locking so that a recursive lock deadlock in the Xen code
can be eliminated (by virtue of the inner function not having to take
that lock at all any more). The final patch in the series is optional,
but probably worth doing anyway.

In moving the rwlock cleanup to be an optional patch at the end of the
series, I've reworked the commit messages so most of the lamentation
about the existing horridness, and the mention of the "bug that should
not happen", is in the simpler ->refresh_lock patch.

In v2 I rounded up the patches which were dropped from Paul's shared-info
series, to (cosmetically) split up kvm_xen_set_evtchn_fast() and then fix
the RT_PREEMPT locking issue. To address the concerns about fairness when
using read_trylock(), I've adjusted it to only do so from IRQ context, so
if it does fall back to the slow path it still takes the lock normally as
before.

David Woodhouse (6):
      KVM: x86/xen: improve accuracy of Xen timers
      KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
      KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
      KVM: pfncache: simplify locking and make more self-contained
      KVM: x86/xen: fix recursive deadlock in timer injection
      KVM: pfncache: clean up rwlock abuse

Paul Durrant (2):
      KVM: x86/xen: split up kvm_xen_set_evtchn_fast()
      KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()

 arch/x86/kvm/lapic.c |   5 +-
 arch/x86/kvm/x86.c   |  61 +++++++++-
 arch/x86/kvm/x86.h   |   1 +
 arch/x86/kvm/xen.c   | 327 +++++++++++++++++++++++++++++++++------------------
 arch/x86/kvm/xen.h   |  18 +++
 virt/kvm/pfncache.c  | 216 +++++++++++++++++-----------------
 6 files changed, 403 insertions(+), 225 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v2 1/8] KVM: x86/xen: improve accuracy of Xen timers
  2024-02-27 11:49 [PATCH v2 0/8] KVM: x86/xen updates David Woodhouse
@ 2024-02-27 11:49 ` David Woodhouse
  2024-03-04 23:44   ` Sean Christopherson
  2024-02-27 11:49 ` [PATCH v2 2/8] KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled David Woodhouse
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 11:49 UTC (permalink / raw
  To: kvm
  Cc: Sean Christopherson, Paul Durrant, Paolo Bonzini, Michal Luczaj,
	David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

A test program such as http://david.woodhou.se/timerlat.c confirms user
reports that timers are increasingly inaccurate as the lifetime of a
guest increases. Reporting the actual delay observed when asking for
100µs of sleep, it starts off OK on a newly-launched guest but gets
worse over time, giving incorrect sleep times:

root@ip-10-0-193-21:~# ./timerlat -c -n 5
00000000 latency 103243/100000 (3.2430%)
00000001 latency 103243/100000 (3.2430%)
00000002 latency 103242/100000 (3.2420%)
00000003 latency 103245/100000 (3.2450%)
00000004 latency 103245/100000 (3.2450%)

The biggest problem is that get_kvmclock_ns() returns inaccurate values
when the guest TSC is scaled. The guest sees a TSC value scaled from the
host TSC by a mul/shift conversion (hopefully done in hardware). The
guest then converts that guest TSC value into nanoseconds using the
mul/shift conversion given to it by the KVM pvclock information.

But get_kvmclock_ns() performs only a single conversion directly from
host TSC to nanoseconds, giving a different result. A test program at
http://david.woodhou.se/tsdrift.c demonstrates the cumulative error
over a day.

It's non-trivial to fix get_kvmclock_ns(), although I'll come back to
that. The actual guest hv_clock is per-CPU, and *theoretically* each
vCPU could be running at a *different* frequency. But this patch is
needed anyway because...

The other issue with Xen timers was that the code would snapshot the
host CLOCK_MONOTONIC at some point in time, and then... after a few
interrupts may have occurred, some preemption perhaps... would also read
the guest's kvmclock. Then it would proceed under the false assumption
that those two happened at the *same* time. Any time which *actually*
elapsed between reading the two clocks was introduced as inaccuracies
in the time at which the timer fired.

Fix it to use a variant of kvm_get_time_and_clockread(), which reads the
host TSC just *once*, then use the returned TSC value to calculate the
kvmclock (making sure to do that the way the guest would instead of
making the same mistake get_kvmclock_ns() does).

Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen
timers still have to use CLOCK_MONOTONIC. In practice the difference
between the two won't matter over the timescales involved, as the
*absolute* values don't matter; just the delta.

This does mean a new variant of kvm_get_time_and_clockread() is needed;
called kvm_get_monotonic_and_clockread() because that's what it does.

Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/x86.c |  61 +++++++++++++++++++++--
 arch/x86/kvm/x86.h |   1 +
 arch/x86/kvm/xen.c | 121 ++++++++++++++++++++++++++++++++++-----------
 3 files changed, 149 insertions(+), 34 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2911e6383fef..89815a887e4d 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2862,7 +2862,11 @@ static inline u64 vgettsc(struct pvclock_clock *clock, u64 *tsc_timestamp,
 	return v * clock->mult;
 }
 
-static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
+/*
+ * As with get_kvmclock_base_ns(), this counts from boot time, at the
+ * frequency of CLOCK_MONOTONIC_RAW (hence adding gtos->offs_boot).
+ */
+static int do_kvmclock_base(s64 *t, u64 *tsc_timestamp)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	unsigned long seq;
@@ -2881,6 +2885,29 @@ static int do_monotonic_raw(s64 *t, u64 *tsc_timestamp)
 	return mode;
 }
 
+/*
+ * This calculates CLOCK_MONOTONIC at the time of the TSC snapshot, with
+ * no boot time offset.
+ */
+static int do_monotonic(s64 *t, u64 *tsc_timestamp)
+{
+	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
+	unsigned long seq;
+	int mode;
+	u64 ns;
+
+	do {
+		seq = read_seqcount_begin(&gtod->seq);
+		ns = gtod->clock.base_cycles;
+		ns += vgettsc(&gtod->clock, tsc_timestamp, &mode);
+		ns >>= gtod->clock.shift;
+		ns += ktime_to_ns(gtod->clock.offset);
+	} while (unlikely(read_seqcount_retry(&gtod->seq, seq)));
+	*t = ns;
+
+	return mode;
+}
+
 static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
 {
 	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
@@ -2902,18 +2929,42 @@ static int do_realtime(struct timespec64 *ts, u64 *tsc_timestamp)
 	return mode;
 }
 
-/* returns true if host is using TSC based clocksource */
+/*
+ * Calculates the kvmclock_base_ns (CLOCK_MONOTONIC_RAW + boot time) and
+ * reports the TSC value from which it do so. Returns true if host is
+ * using TSC based clocksource.
+ */
 static bool kvm_get_time_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
 {
 	/* checked again under seqlock below */
 	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
 		return false;
 
-	return gtod_is_based_on_tsc(do_monotonic_raw(kernel_ns,
-						      tsc_timestamp));
+	return gtod_is_based_on_tsc(do_kvmclock_base(kernel_ns,
+						     tsc_timestamp));
 }
 
-/* returns true if host is using TSC based clocksource */
+/*
+ * Calculates CLOCK_MONOTONIC and reports the TSC value from which it did
+ * so. Returns true if host is using TSC based clocksource.
+ */
+bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp)
+{
+	/* checked again under seqlock below */
+	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode))
+		return false;
+
+	return gtod_is_based_on_tsc(do_monotonic(kernel_ns,
+						 tsc_timestamp));
+}
+
+/*
+ * Calculates CLOCK_REALTIME and reports the TSC value from which it did
+ * so. Returns true if host is using TSC based clocksource.
+ *
+ * DO NOT USE this for anything related to migration. You want CLOCK_TAI
+ * for that.
+ */
 static bool kvm_get_walltime_and_clockread(struct timespec64 *ts,
 					   u64 *tsc_timestamp)
 {
diff --git a/arch/x86/kvm/x86.h b/arch/x86/kvm/x86.h
index 2f7e19166658..56b7a78f45bf 100644
--- a/arch/x86/kvm/x86.h
+++ b/arch/x86/kvm/x86.h
@@ -294,6 +294,7 @@ void kvm_inject_realmode_interrupt(struct kvm_vcpu *vcpu, int irq, int inc_eip);
 
 u64 get_kvmclock_ns(struct kvm *kvm);
 uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm);
+bool kvm_get_monotonic_and_clockread(s64 *kernel_ns, u64 *tsc_timestamp);
 
 int kvm_read_guest_virt(struct kvm_vcpu *vcpu,
 	gva_t addr, void *val, unsigned int bytes,
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 8a04e0ae9245..ccd2dc753fd6 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -24,6 +24,7 @@
 #include <xen/interface/sched.h>
 
 #include <asm/xen/cpuid.h>
+#include <asm/pvclock.h>
 
 #include "cpuid.h"
 #include "trace.h"
@@ -149,8 +150,93 @@ static enum hrtimer_restart xen_timer_callback(struct hrtimer *timer)
 	return HRTIMER_NORESTART;
 }
 
-static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_ns)
+static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs,
+				bool linux_wa)
 {
+	uint64_t guest_now;
+	int64_t kernel_now, delta;
+
+	/*
+	 * The guest provides the requested timeout in absolute nanoseconds
+	 * of the KVM clock — as *it* sees it, based on the scaled TSC and
+	 * the pvclock information provided by KVM.
+	 *
+	 * The kernel doesn't support hrtimers based on CLOCK_MONOTONIC_RAW
+	 * so use CLOCK_MONOTONIC. In the timescales covered by timers, the
+	 * difference won't matter much as there is no cumulative effect.
+	 *
+	 * Calculate the time for some arbitrary point in time around "now"
+	 * in terms of both kvmclock and CLOCK_MONOTONIC. Calculate the
+	 * delta between the kvmclock "now" value and the guest's requested
+	 * timeout, apply the "Linux workaround" described below, and add
+	 * the resulting delta to the CLOCK_MONOTONIC "now" value, to get
+	 * the absolute CLOCK_MONOTONIC time at which the timer should
+	 * fire.
+	 */
+	if (vcpu->arch.hv_clock.version && vcpu->kvm->arch.use_master_clock &&
+	    static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+		uint64_t host_tsc, guest_tsc;
+
+		if (!IS_ENABLED(CONFIG_64BIT) ||
+		    !kvm_get_monotonic_and_clockread(&kernel_now, &host_tsc)) {
+			/*
+			 * Don't fall back to get_kvmclock_ns() because it's
+			 * broken; it has a systemic error in its results
+			 * because it scales directly from host TSC to
+			 * nanoseconds, and doesn't scale first to guest TSC
+			 * and then* to nanoseconds as the guest does.
+			 *
+			 * There is a small error introduced here because time
+			 * continues to elapse between the ktime_get() and the
+			 * subsequent rdtsc(). But not the systemic drift due
+			 * to get_kvmclock_ns().
+			 */
+			kernel_now = ktime_get(); /* This is CLOCK_MONOTONIC */
+			host_tsc = rdtsc();
+		}
+
+		/* Calculate the guest kvmclock as the guest would do it. */
+		guest_tsc = kvm_read_l1_tsc(vcpu, host_tsc);
+		guest_now = __pvclock_read_cycles(&vcpu->arch.hv_clock,
+						  guest_tsc);
+	} else {
+		/*
+		 * Without CONSTANT_TSC, get_kvmclock_ns() is the only option.
+		 *
+		 * Also if the guest PV clock hasn't been set up yet, as is
+		 * likely to be the case during migration when the vCPU has
+		 * not been run yet. It would be possible to calculate the
+		 * scaling factors properly in that case but there's not much
+		 * point in doing so. The get_kvmclock_ns() drift accumulates
+		 * over time, so it's OK to use it at startup. Besides, on
+		 * migration there's going to be a little bit of skew in the
+		 * precise moment at which timers fire anyway. Often they'll
+		 * be in the "past" by the time the VM is running again after
+		 * migration.
+		 */
+		guest_now = get_kvmclock_ns(vcpu->kvm);
+		kernel_now = ktime_get();
+	}
+
+	delta = guest_abs - guest_now;
+
+	/* Xen has a 'Linux workaround' in do_set_timer_op() which
+	 * checks for negative absolute timeout values (caused by
+	 * integer overflow), and for values about 13 days in the
+	 * future (2^50ns) which would be caused by jiffies
+	 * overflow. For those cases, it sets the timeout 100ms in
+	 * the future (not *too* soon, since if a guest really did
+	 * set a long timeout on purpose we don't want to keep
+	 * churning CPU time by waking it up).
+	 */
+	if (linux_wa) {
+		if ((unlikely((int64_t)guest_abs < 0 ||
+			      (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
+			delta = 100 * NSEC_PER_MSEC;
+			guest_abs = guest_now + delta;
+		}
+	}
+
 	/*
 	 * Avoid races with the old timer firing. Checking timer_expires
 	 * to avoid calling hrtimer_cancel() will only have false positives
@@ -162,12 +248,11 @@ static void kvm_xen_start_timer(struct kvm_vcpu *vcpu, u64 guest_abs, s64 delta_
 	atomic_set(&vcpu->arch.xen.timer_pending, 0);
 	vcpu->arch.xen.timer_expires = guest_abs;
 
-	if (delta_ns <= 0) {
+	if (delta <= 0) {
 		xen_timer_callback(&vcpu->arch.xen.timer);
 	} else {
-		ktime_t ktime_now = ktime_get();
 		hrtimer_start(&vcpu->arch.xen.timer,
-			      ktime_add_ns(ktime_now, delta_ns),
+			      ktime_add_ns(kernel_now, delta),
 			      HRTIMER_MODE_ABS_HARD);
 	}
 }
@@ -998,8 +1083,7 @@ int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data)
 		/* Start the timer if the new value has a valid vector+expiry. */
 		if (data->u.timer.port && data->u.timer.expires_ns)
 			kvm_xen_start_timer(vcpu, data->u.timer.expires_ns,
-					    data->u.timer.expires_ns -
-					    get_kvmclock_ns(vcpu->kvm));
+					    false);
 
 		r = 0;
 		break;
@@ -1472,7 +1556,6 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
 {
 	struct vcpu_set_singleshot_timer oneshot;
 	struct x86_exception e;
-	s64 delta;
 
 	if (!kvm_xen_timer_enabled(vcpu))
 		return false;
@@ -1506,9 +1589,7 @@ static bool kvm_xen_hcall_vcpu_op(struct kvm_vcpu *vcpu, bool longmode, int cmd,
 			return true;
 		}
 
-		/* A delta <= 0 results in an immediate callback, which is what we want */
-		delta = oneshot.timeout_abs_ns - get_kvmclock_ns(vcpu->kvm);
-		kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, delta);
+		kvm_xen_start_timer(vcpu, oneshot.timeout_abs_ns, false);
 		*r = 0;
 		return true;
 
@@ -1532,25 +1613,7 @@ static bool kvm_xen_hcall_set_timer_op(struct kvm_vcpu *vcpu, uint64_t timeout,
 		return false;
 
 	if (timeout) {
-		uint64_t guest_now = get_kvmclock_ns(vcpu->kvm);
-		int64_t delta = timeout - guest_now;
-
-		/* Xen has a 'Linux workaround' in do_set_timer_op() which
-		 * checks for negative absolute timeout values (caused by
-		 * integer overflow), and for values about 13 days in the
-		 * future (2^50ns) which would be caused by jiffies
-		 * overflow. For those cases, it sets the timeout 100ms in
-		 * the future (not *too* soon, since if a guest really did
-		 * set a long timeout on purpose we don't want to keep
-		 * churning CPU time by waking it up).
-		 */
-		if (unlikely((int64_t)timeout < 0 ||
-			     (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
-			delta = 100 * NSEC_PER_MSEC;
-			timeout = guest_now + delta;
-		}
-
-		kvm_xen_start_timer(vcpu, timeout, delta);
+		kvm_xen_start_timer(vcpu, timeout, true);
 	} else {
 		kvm_xen_stop_timer(vcpu);
 	}

base-commit: 003d914220c97ef93cabfe3ec4e245e2383e19e9
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 2/8] KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
  2024-02-27 11:49 [PATCH v2 0/8] KVM: x86/xen updates David Woodhouse
  2024-02-27 11:49 ` [PATCH v2 1/8] KVM: x86/xen: improve accuracy of Xen timers David Woodhouse
@ 2024-02-27 11:49 ` David Woodhouse
  2024-02-27 11:49 ` [PATCH v2 3/8] KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery David Woodhouse
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 11:49 UTC (permalink / raw
  To: kvm
  Cc: Sean Christopherson, Paul Durrant, Paolo Bonzini, Michal Luczaj,
	David Woodhouse, stable

From: David Woodhouse <dwmw@amazon.co.uk>

Linux guests since commit b1c3497e604d ("x86/xen: Add support for
HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU
upcall vector when it's advertised in the Xen CPUID leaves.

This upcall is injected through the guest's local APIC as an MSI, unlike
the older system vector which was merely injected by the hypervisor any
time the CPU was able to receive an interrupt and the upcall_pending
flags is set in its vcpu_info.

Effectively, that makes the per-CPU upcall edge triggered instead of
level triggered, which results in the upcall being lost if the MSI is
delivered when the local APIC is *disabled*.

Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC
for a vCPU is software enabled (in fact, on any write to the SPIV
register which doesn't disable the APIC). Do the same in KVM since KVM
doesn't provide a way for userspace to intervene and trap accesses to
the SPIV register of a local APIC emulated by KVM.

Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Cc: stable@vger.kernel.org
---
 arch/x86/kvm/lapic.c |  5 ++++-
 arch/x86/kvm/xen.c   |  2 +-
 arch/x86/kvm/xen.h   | 18 ++++++++++++++++++
 3 files changed, 23 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 3242f3da2457..75bc7d3f0022 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -41,6 +41,7 @@
 #include "ioapic.h"
 #include "trace.h"
 #include "x86.h"
+#include "xen.h"
 #include "cpuid.h"
 #include "hyperv.h"
 #include "smm.h"
@@ -499,8 +500,10 @@ static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val)
 	}
 
 	/* Check if there are APF page ready requests pending */
-	if (enabled)
+	if (enabled) {
 		kvm_make_request(KVM_REQ_APF_READY, apic->vcpu);
+		kvm_xen_sw_enable_lapic(apic->vcpu);
+	}
 }
 
 static inline void kvm_apic_set_xapic_id(struct kvm_lapic *apic, u8 id)
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index ccd2dc753fd6..06904696759c 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -568,7 +568,7 @@ void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
 		kvm_xen_update_runstate_guest(v, state == RUNSTATE_runnable);
 }
 
-static void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v)
+void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v)
 {
 	struct kvm_lapic_irq irq = { };
 	int r;
diff --git a/arch/x86/kvm/xen.h b/arch/x86/kvm/xen.h
index f8f1fe22d090..f5841d9000ae 100644
--- a/arch/x86/kvm/xen.h
+++ b/arch/x86/kvm/xen.h
@@ -18,6 +18,7 @@ extern struct static_key_false_deferred kvm_xen_enabled;
 
 int __kvm_xen_has_interrupt(struct kvm_vcpu *vcpu);
 void kvm_xen_inject_pending_events(struct kvm_vcpu *vcpu);
+void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *vcpu);
 int kvm_xen_vcpu_set_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data);
 int kvm_xen_vcpu_get_attr(struct kvm_vcpu *vcpu, struct kvm_xen_vcpu_attr *data);
 int kvm_xen_hvm_set_attr(struct kvm *kvm, struct kvm_xen_hvm_attr *data);
@@ -36,6 +37,19 @@ int kvm_xen_setup_evtchn(struct kvm *kvm,
 			 const struct kvm_irq_routing_entry *ue);
 void kvm_xen_update_tsc_info(struct kvm_vcpu *vcpu);
 
+static inline void kvm_xen_sw_enable_lapic(struct kvm_vcpu *vcpu)
+{
+	/*
+	 * The local APIC is being enabled. If the per-vCPU upcall vector is
+	 * set and the vCPU's evtchn_upcall_pending flag is set, inject the
+	 * interrupt.
+	 */
+	if (static_branch_unlikely(&kvm_xen_enabled.key) &&
+	    vcpu->arch.xen.vcpu_info_cache.active &&
+	    vcpu->arch.xen.upcall_vector && __kvm_xen_has_interrupt(vcpu))
+		kvm_xen_inject_vcpu_vector(vcpu);
+}
+
 static inline bool kvm_xen_msr_enabled(struct kvm *kvm)
 {
 	return static_branch_unlikely(&kvm_xen_enabled.key) &&
@@ -101,6 +115,10 @@ static inline void kvm_xen_destroy_vcpu(struct kvm_vcpu *vcpu)
 {
 }
 
+static inline void kvm_xen_sw_enable_lapic(struct kvm_vcpu *vcpu)
+{
+}
+
 static inline bool kvm_xen_msr_enabled(struct kvm *kvm)
 {
 	return false;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 3/8] KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
  2024-02-27 11:49 [PATCH v2 0/8] KVM: x86/xen updates David Woodhouse
  2024-02-27 11:49 ` [PATCH v2 1/8] KVM: x86/xen: improve accuracy of Xen timers David Woodhouse
  2024-02-27 11:49 ` [PATCH v2 2/8] KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled David Woodhouse
@ 2024-02-27 11:49 ` David Woodhouse
  2024-02-27 11:49 ` [PATCH v2 4/8] KVM: pfncache: simplify locking and make more self-contained David Woodhouse
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 11:49 UTC (permalink / raw
  To: kvm
  Cc: Sean Christopherson, Paul Durrant, Paolo Bonzini, Michal Luczaj,
	David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

The kvm_xen_inject_vcpu_vector() function has a comment saying "the fast
version will always work for physical unicast", justifying its use of
kvm_irq_delivery_to_apic_fast() and the WARN_ON_ONCE() when that fails.

In fact that assumption isn't true if X2APIC isn't in use by the guest
and there is (8-bit x)APIC ID aliasing. A single "unicast" destination
APIC ID *may* then be delivered to multiple vCPUs. Remove the warning,
and in fact it might as well just call kvm_irq_delivery_to_apic().

Reported-by: Michal Luczaj <mhal@rbox.co>
Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/xen.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 06904696759c..54a4bdb63b17 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -10,7 +10,7 @@
 #include "x86.h"
 #include "xen.h"
 #include "hyperv.h"
-#include "lapic.h"
+#include "irq.h"
 
 #include <linux/eventfd.h>
 #include <linux/kvm_host.h>
@@ -571,7 +571,6 @@ void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
 void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v)
 {
 	struct kvm_lapic_irq irq = { };
-	int r;
 
 	irq.dest_id = v->vcpu_id;
 	irq.vector = v->arch.xen.upcall_vector;
@@ -580,8 +579,7 @@ void kvm_xen_inject_vcpu_vector(struct kvm_vcpu *v)
 	irq.delivery_mode = APIC_DM_FIXED;
 	irq.level = 1;
 
-	/* The fast version will always work for physical unicast */
-	WARN_ON_ONCE(!kvm_irq_delivery_to_apic_fast(v->kvm, NULL, &irq, &r, NULL));
+	kvm_irq_delivery_to_apic(v->kvm, NULL, &irq, NULL);
 }
 
 /*
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 4/8] KVM: pfncache: simplify locking and make more self-contained
  2024-02-27 11:49 [PATCH v2 0/8] KVM: x86/xen updates David Woodhouse
                   ` (2 preceding siblings ...)
  2024-02-27 11:49 ` [PATCH v2 3/8] KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery David Woodhouse
@ 2024-02-27 11:49 ` David Woodhouse
  2024-03-05  0:08   ` Sean Christopherson
  2024-02-27 11:49 ` [PATCH v2 5/8] KVM: x86/xen: fix recursive deadlock in timer injection David Woodhouse
                   ` (5 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 11:49 UTC (permalink / raw
  To: kvm
  Cc: Sean Christopherson, Paul Durrant, Paolo Bonzini, Michal Luczaj,
	David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

The locking on the gfn_to_pfn_cache is... interesting. And awful.

There is a rwlock in ->lock which readers take to ensure protection
against concurrent changes. But __kvm_gpc_refresh() makes assumptions
that certain fields will not change even while it drops the write lock
and performs MM operations to revalidate the target PFN and kernel
mapping.

Commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via
mutex") partly addressed that — not by fixing it, but by adding a new
mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh()
calls on a given gfn_to_pfn_cache, but is still only a partial solution.

There is still a theoretical race where __kvm_gpc_refresh() runs in
parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has
dropped the write lock, kvm_gpc_deactivate() clears the ->active flag
and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous
->pfn and ->khva are still valid, and reinstalls those values into the
structure. This leaves the gfn_to_pfn_cache with the ->valid bit set,
but ->active clear. And a ->khva which looks like a reasonable kernel
address but is actually unmapped.

All it takes is a subsequent reactivation to cause that ->khva to be
dereferenced. This would theoretically cause an oops which would look
something like this:

[1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0
[1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0

I say "theoretically" because theoretically, that oops that was seen in
production cannot happen. The code which uses the gfn_to_pfn_cache is
supposed to have its *own* locking, to further paper over the fact that
the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own
rwlock abuse is not sufficient.

For the Xen vcpu_info that external lock is the vcpu->mutex, and for the
shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect
the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all
but the cases where the vcpu or kvm object is being *destroyed*, in
which case the subsequent reactivation should never happen.

Theoretically.

Nevertheless, this locking abuse is awful and should be fixed, even if
no clear explanation can be found for how the oops happened. So expand
the use of the ->refresh_lock mutex to ensure serialization of
activate/deactivate vs. refresh and make the pfncache locking entirely
self-sufficient.

This means that a future commit can simplify the locking in the callers,
such as the Xen emulation code which has an outstanding problem with
recursive locking of kvm->arch.xen.xen_lock, which will no longer be
necessary.

The rwlock abuse described above is still not best practice, although
it's harmless now that the ->refresh_lock is held for the entire duration
while the offending code drops the write lock, does some other stuff,
then takes the write lock again and assumes nothing changed. That can
also be fixed^W cleaned up in a subsequent commit, but this commit is
a simpler basis for the Xen deadlock fix mentioned above.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 virt/kvm/pfncache.c | 34 ++++++++++++++++++++++------------
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 9ac8c9da4eda..43d67f8f064e 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -256,12 +256,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	if (page_offset + len > PAGE_SIZE)
 		return -EINVAL;
 
-	/*
-	 * If another task is refreshing the cache, wait for it to complete.
-	 * There is no guarantee that concurrent refreshes will see the same
-	 * gpa, memslots generation, etc..., so they must be fully serialized.
-	 */
-	mutex_lock(&gpc->refresh_lock);
+	lockdep_assert_held(&gpc->refresh_lock);
 
 	write_lock_irq(&gpc->lock);
 
@@ -347,8 +342,6 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 out_unlock:
 	write_unlock_irq(&gpc->lock);
 
-	mutex_unlock(&gpc->refresh_lock);
-
 	if (unmap_old)
 		gpc_unmap(old_pfn, old_khva);
 
@@ -357,15 +350,23 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 
 int kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, unsigned long len)
 {
+	unsigned long uhva;
+	int ret;
+
+	mutex_lock(&gpc->refresh_lock);
+
 	/*
 	 * If the GPA is valid then ignore the HVA, as a cache can be GPA-based
 	 * or HVA-based, not both.  For GPA-based caches, the HVA will be
 	 * recomputed during refresh if necessary.
 	 */
-	unsigned long uhva = kvm_is_error_gpa(gpc->gpa) ? gpc->uhva :
-							  KVM_HVA_ERR_BAD;
+	uhva = kvm_is_error_gpa(gpc->gpa) ? gpc->uhva : KVM_HVA_ERR_BAD;
+
+	ret = __kvm_gpc_refresh(gpc, gpc->gpa, uhva, len);
 
-	return __kvm_gpc_refresh(gpc, gpc->gpa, uhva, len);
+	mutex_unlock(&gpc->refresh_lock);
+
+	return ret;
 }
 
 void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm)
@@ -377,12 +378,16 @@ void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm)
 	gpc->pfn = KVM_PFN_ERR_FAULT;
 	gpc->gpa = INVALID_GPA;
 	gpc->uhva = KVM_HVA_ERR_BAD;
+	gpc->active = gpc->valid = false;
 }
 
 static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva,
 			      unsigned long len)
 {
 	struct kvm *kvm = gpc->kvm;
+	int ret;
+
+	mutex_lock(&gpc->refresh_lock);
 
 	if (!gpc->active) {
 		if (KVM_BUG_ON(gpc->valid, kvm))
@@ -401,7 +406,10 @@ static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned
 		gpc->active = true;
 		write_unlock_irq(&gpc->lock);
 	}
-	return __kvm_gpc_refresh(gpc, gpa, uhva, len);
+	ret = __kvm_gpc_refresh(gpc, gpa, uhva, len);
+	mutex_unlock(&gpc->refresh_lock);
+
+	return ret;
 }
 
 int kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long len)
@@ -420,6 +428,7 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 	kvm_pfn_t old_pfn;
 	void *old_khva;
 
+	mutex_lock(&gpc->refresh_lock);
 	if (gpc->active) {
 		/*
 		 * Deactivate the cache before removing it from the list, KVM
@@ -449,4 +458,5 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 
 		gpc_unmap(old_pfn, old_khva);
 	}
+	mutex_unlock(&gpc->refresh_lock);
 }
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 5/8] KVM: x86/xen: fix recursive deadlock in timer injection
  2024-02-27 11:49 [PATCH v2 0/8] KVM: x86/xen updates David Woodhouse
                   ` (3 preceding siblings ...)
  2024-02-27 11:49 ` [PATCH v2 4/8] KVM: pfncache: simplify locking and make more self-contained David Woodhouse
@ 2024-02-27 11:49 ` David Woodhouse
  2024-02-27 11:49 ` [PATCH v2 6/8] KVM: x86/xen: split up kvm_xen_set_evtchn_fast() David Woodhouse
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 11:49 UTC (permalink / raw
  To: kvm
  Cc: Sean Christopherson, Paul Durrant, Paolo Bonzini, Michal Luczaj,
	David Woodhouse

From: David Woodhouse <dwmw@amazon.co.uk>

The fast-path timer delivery introduced a recursive locking deadlock
when userspace configures a timer which has already expired and is
delivered immediately. The call to kvm_xen_inject_timer_irqs() can
call to kvm_xen_set_evtchn() which may take kvm->arch.xen.xen_lock,
which is already held in kvm_xen_vcpu_get_attr().

 ============================================
 WARNING: possible recursive locking detected
 6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G           O
 --------------------------------------------
 xen_shinfo_test/250013 is trying to acquire lock:
 ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm]

 but task is already holding lock:
 ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm]

Now that the gfn_to_pfn_cache has its own self-sufficient locking, its
callers no longer need to ensure serialization, so just stop taking
kvm->arch.xen.xen_lock from kvm_xen_set_evtchn().

Fixes: 77c9b9dea4fb ("KVM: x86/xen: Use fast path for Xen timer delivery")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/xen.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 54a4bdb63b17..e87b36590809 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1865,8 +1865,6 @@ static int kvm_xen_set_evtchn(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 		mm_borrowed = true;
 	}
 
-	mutex_lock(&kvm->arch.xen.xen_lock);
-
 	/*
 	 * It is theoretically possible for the page to be unmapped
 	 * and the MMU notifier to invalidate the shared_info before
@@ -1894,8 +1892,6 @@ static int kvm_xen_set_evtchn(struct kvm_xen_evtchn *xe, struct kvm *kvm)
 		srcu_read_unlock(&kvm->srcu, idx);
 	} while(!rc);
 
-	mutex_unlock(&kvm->arch.xen.xen_lock);
-
 	if (mm_borrowed)
 		kthread_unuse_mm(kvm->mm);
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 6/8] KVM: x86/xen: split up kvm_xen_set_evtchn_fast()
  2024-02-27 11:49 [PATCH v2 0/8] KVM: x86/xen updates David Woodhouse
                   ` (4 preceding siblings ...)
  2024-02-27 11:49 ` [PATCH v2 5/8] KVM: x86/xen: fix recursive deadlock in timer injection David Woodhouse
@ 2024-02-27 11:49 ` David Woodhouse
  2024-02-27 11:49 ` [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast() David Woodhouse
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 11:49 UTC (permalink / raw
  To: kvm
  Cc: Sean Christopherson, Paul Durrant, Paolo Bonzini, Michal Luczaj,
	Paul Durrant, David Woodhouse, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Dave Hansen, H. Peter Anvin, David Woodhouse,
	x86

From: Paul Durrant <pdurrant@amazon.com>

The implementation of kvm_xen_set_evtchn_fast() is a rather lengthy piece
of code that performs two operations: updating of the shared_info
evtchn_pending mask, and updating of the vcpu_info evtchn_pending_sel
mask. Introduce a separate function to perform each of those operations and
re-work kvm_xen_set_evtchn_fast() to use them.

No functional change intended.

Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org
---
 arch/x86/kvm/xen.c | 173 ++++++++++++++++++++++++++-------------------
 1 file changed, 99 insertions(+), 74 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index e87b36590809..c16b6d394d55 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1728,112 +1728,137 @@ static void kvm_xen_check_poller(struct kvm_vcpu *vcpu, int port)
 	}
 }
 
-/*
- * The return value from this function is propagated to kvm_set_irq() API,
- * so it returns:
- *  < 0   Interrupt was ignored (masked or not delivered for other reasons)
- *  = 0   Interrupt was coalesced (previous irq is still pending)
- *  > 0   Number of CPUs interrupt was delivered to
- *
- * It is also called directly from kvm_arch_set_irq_inatomic(), where the
- * only check on its return value is a comparison with -EWOULDBLOCK'.
- */
-int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
+static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 {
+	struct kvm *kvm = vcpu->kvm;
 	struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
-	struct kvm_vcpu *vcpu;
 	unsigned long *pending_bits, *mask_bits;
 	unsigned long flags;
-	int port_word_bit;
-	bool kick_vcpu = false;
-	int vcpu_idx, idx, rc;
-
-	vcpu_idx = READ_ONCE(xe->vcpu_idx);
-	if (vcpu_idx >= 0)
-		vcpu = kvm_get_vcpu(kvm, vcpu_idx);
-	else {
-		vcpu = kvm_get_vcpu_by_id(kvm, xe->vcpu_id);
-		if (!vcpu)
-			return -EINVAL;
-		WRITE_ONCE(xe->vcpu_idx, vcpu->vcpu_idx);
-	}
-
-	if (xe->port >= max_evtchn_port(kvm))
-		return -EINVAL;
-
-	rc = -EWOULDBLOCK;
-
-	idx = srcu_read_lock(&kvm->srcu);
+	int rc = -EWOULDBLOCK;
 
 	read_lock_irqsave(&gpc->lock, flags);
 	if (!kvm_gpc_check(gpc, PAGE_SIZE))
-		goto out_rcu;
+		goto out;
 
 	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 		struct shared_info *shinfo = gpc->khva;
+
 		pending_bits = (unsigned long *)&shinfo->evtchn_pending;
 		mask_bits = (unsigned long *)&shinfo->evtchn_mask;
-		port_word_bit = xe->port / 64;
 	} else {
 		struct compat_shared_info *shinfo = gpc->khva;
+
 		pending_bits = (unsigned long *)&shinfo->evtchn_pending;
 		mask_bits = (unsigned long *)&shinfo->evtchn_mask;
-		port_word_bit = xe->port / 32;
 	}
 
-	/*
-	 * If this port wasn't already set, and if it isn't masked, then
-	 * we try to set the corresponding bit in the in-kernel shadow of
-	 * evtchn_pending_sel for the target vCPU. And if *that* wasn't
-	 * already set, then we kick the vCPU in question to write to the
-	 * *real* evtchn_pending_sel in its own guest vcpu_info struct.
-	 */
-	if (test_and_set_bit(xe->port, pending_bits)) {
+	if (test_and_set_bit(port, pending_bits)) {
 		rc = 0; /* It was already raised */
-	} else if (test_bit(xe->port, mask_bits)) {
-		rc = -ENOTCONN; /* Masked */
-		kvm_xen_check_poller(vcpu, xe->port);
+	} else if (test_bit(port, mask_bits)) {
+		rc = -ENOTCONN; /* It is masked */
+		kvm_xen_check_poller(vcpu, port);
 	} else {
-		rc = 1; /* Delivered to the bitmap in shared_info. */
-		/* Now switch to the vCPU's vcpu_info to set the index and pending_sel */
-		read_unlock_irqrestore(&gpc->lock, flags);
-		gpc = &vcpu->arch.xen.vcpu_info_cache;
+		rc = 1; /* It is newly raised */
+	}
 
-		read_lock_irqsave(&gpc->lock, flags);
-		if (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
-			/*
-			 * Could not access the vcpu_info. Set the bit in-kernel
-			 * and prod the vCPU to deliver it for itself.
-			 */
+ out:
+	read_unlock_irqrestore(&gpc->lock, flags);
+	return rc;
+}
+
+static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
+{
+	struct kvm *kvm = vcpu->kvm;
+	struct gfn_to_pfn_cache *gpc = &vcpu->arch.xen.vcpu_info_cache;
+	unsigned long flags;
+	bool kick_vcpu = false;
+
+	read_lock_irqsave(&gpc->lock, flags);
+
+	/*
+	 * Try to deliver the event directly to the vcpu_info. If successful and
+	 * the guest is using upcall_vector delivery, send the MSI.
+	 * If the pfncache is invalid, set the shadow. In this case, or if the
+	 * guest is using another form of event delivery, the vCPU must be
+	 * kicked to complete the delivery.
+	 */
+	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
+		struct vcpu_info *vcpu_info = gpc->khva;
+		int port_word_bit = port / 64;
+
+		if (!kvm_gpc_check(gpc, sizeof(*vcpu_info))) {
 			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
 				kick_vcpu = true;
-			goto out_rcu;
+			goto out;
 		}
 
-		if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
-			struct vcpu_info *vcpu_info = gpc->khva;
-			if (!test_and_set_bit(port_word_bit, &vcpu_info->evtchn_pending_sel)) {
-				WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
-				kick_vcpu = true;
-			}
-		} else {
-			struct compat_vcpu_info *vcpu_info = gpc->khva;
-			if (!test_and_set_bit(port_word_bit,
-					      (unsigned long *)&vcpu_info->evtchn_pending_sel)) {
-				WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
+		if (!test_and_set_bit(port_word_bit, &vcpu_info->evtchn_pending_sel)) {
+			WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
+			kick_vcpu = true;
+		}
+	} else {
+		struct compat_vcpu_info *vcpu_info = gpc->khva;
+		int port_word_bit = port / 32;
+
+		if (!kvm_gpc_check(gpc, sizeof(*vcpu_info))) {
+			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
 				kick_vcpu = true;
-			}
+			goto out;
 		}
 
-		/* For the per-vCPU lapic vector, deliver it as MSI. */
-		if (kick_vcpu && vcpu->arch.xen.upcall_vector) {
-			kvm_xen_inject_vcpu_vector(vcpu);
-			kick_vcpu = false;
+		if (!test_and_set_bit(port_word_bit,
+				      (unsigned long *)&vcpu_info->evtchn_pending_sel)) {
+			WRITE_ONCE(vcpu_info->evtchn_upcall_pending, 1);
+			kick_vcpu = true;
 		}
 	}
 
- out_rcu:
+	if (kick_vcpu && vcpu->arch.xen.upcall_vector) {
+		kvm_xen_inject_vcpu_vector(vcpu);
+		kick_vcpu = false;
+	}
+
+ out:
 	read_unlock_irqrestore(&gpc->lock, flags);
+	return kick_vcpu;
+}
+
+/*
+ * The return value from this function is propagated to kvm_set_irq() API,
+ * so it returns:
+ *  < 0   Interrupt was ignored (masked or not delivered for other reasons)
+ *  = 0   Interrupt was coalesced (previous irq is still pending)
+ *  > 0   Number of CPUs interrupt was delivered to
+ *
+ * It is also called directly from kvm_arch_set_irq_inatomic(), where the
+ * only check on its return value is a comparison with -EWOULDBLOCK
+ * (which may be returned by set_shinfo_evtchn_pending()).
+ */
+int kvm_xen_set_evtchn_fast(struct kvm_xen_evtchn *xe, struct kvm *kvm)
+{
+	struct kvm_vcpu *vcpu;
+	bool kick_vcpu = false;
+	int vcpu_idx, idx, rc;
+
+	vcpu_idx = READ_ONCE(xe->vcpu_idx);
+	if (vcpu_idx >= 0)
+		vcpu = kvm_get_vcpu(kvm, vcpu_idx);
+	else {
+		vcpu = kvm_get_vcpu_by_id(kvm, xe->vcpu_id);
+		if (!vcpu)
+			return -EINVAL;
+		WRITE_ONCE(xe->vcpu_idx, vcpu->vcpu_idx);
+	}
+
+	if (xe->port >= max_evtchn_port(kvm))
+		return -EINVAL;
+
+	idx = srcu_read_lock(&kvm->srcu);
+
+	rc = set_shinfo_evtchn_pending(vcpu, xe->port);
+	if (rc == 1) /* Delivered to the bitmap in shared_info */
+		kick_vcpu = set_vcpu_info_evtchn_pending(vcpu, xe->port);
+
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	if (kick_vcpu) {
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
  2024-02-27 11:49 [PATCH v2 0/8] KVM: x86/xen updates David Woodhouse
                   ` (5 preceding siblings ...)
  2024-02-27 11:49 ` [PATCH v2 6/8] KVM: x86/xen: split up kvm_xen_set_evtchn_fast() David Woodhouse
@ 2024-02-27 11:49 ` David Woodhouse
  2024-02-27 14:43   ` Steven Rostedt
  2024-02-27 11:49 ` [PATCH v2 8/8] KVM: pfncache: clean up rwlock abuse David Woodhouse
                   ` (2 subsequent siblings)
  9 siblings, 1 reply; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 11:49 UTC (permalink / raw
  To: kvm
  Cc: Sean Christopherson, Paul Durrant, Paolo Bonzini, Michal Luczaj,
	Paul Durrant, David Woodhouse, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Steven Rostedt, Dave Hansen,
	H. Peter Anvin, David Woodhouse, x86

From: Paul Durrant <pdurrant@amazon.com>

As described in [1] compiling with CONFIG_PROVE_RAW_LOCK_NESTING shows that
kvm_xen_set_evtchn_fast() is blocking on pfncache locks in IRQ context.
There is only actually blocking with PREEMPT_RT because the locks will
turned into mutexes. There is no 'raw' version of rwlock_t that can be used
to avoid that, so use read_trylock() and treat failure to lock the same as
an invalid cache.

[1] https://lore.kernel.org/lkml/99771ef3a4966a01fefd3adbb2ba9c3a75f97cf2.camel@infradead.org/T/#mbd06e5a04534ce9c0ee94bd8f1e8d942b2d45bd6
Fixes: 77c9b9dea4fb ("KVM: x86/xen: Use fast path for Xen timer delivery")
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Reviewed-by: David Woodhouse <dwmw@amazon.co.uk>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steven Rostedt <rostedt@goodmis.org>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: David Woodhouse <dwmw2@infradead.org>
Cc: x86@kernel.org

v2:
 • Use read_trylock only in interrupt context, to avoid concerns about
   unfairness in the slow path.
---
 arch/x86/kvm/xen.c | 41 +++++++++++++++++++++++++++++++----------
 1 file changed, 31 insertions(+), 10 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index c16b6d394d55..d8b5326ecebc 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 	unsigned long flags;
 	int rc = -EWOULDBLOCK;
 
-	read_lock_irqsave(&gpc->lock, flags);
+	local_irq_save(flags);
+	if (!read_trylock(&gpc->lock)) {
+		/*
+		 * When PREEMPT_RT turns locks into mutexes, rwlocks are
+		 * turned into mutexes and most interrupts are threaded.
+		 * But timer events may be delivered in hardirq mode due
+		 * to using HRTIMER_MODE_ABS_HARD. So bail to the slow
+		 * path if the trylock fails in interrupt context.
+		 */
+		if (in_interrupt())
+			goto out;
+
+		read_lock(&gpc->lock);
+	}
+
 	if (!kvm_gpc_check(gpc, PAGE_SIZE))
-		goto out;
+		goto out_unlock;
 
 	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 		struct shared_info *shinfo = gpc->khva;
@@ -1761,8 +1775,10 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 		rc = 1; /* It is newly raised */
 	}
 
+ out_unlock:
+	read_unlock(&gpc->lock);
  out:
-	read_unlock_irqrestore(&gpc->lock, flags);
+	local_irq_restore(flags);
 	return rc;
 }
 
@@ -1772,21 +1788,23 @@ static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 	struct gfn_to_pfn_cache *gpc = &vcpu->arch.xen.vcpu_info_cache;
 	unsigned long flags;
 	bool kick_vcpu = false;
+	bool locked;
 
-	read_lock_irqsave(&gpc->lock, flags);
+	local_irq_save(flags);
+	locked = read_trylock(&gpc->lock);
 
 	/*
 	 * Try to deliver the event directly to the vcpu_info. If successful and
 	 * the guest is using upcall_vector delivery, send the MSI.
-	 * If the pfncache is invalid, set the shadow. In this case, or if the
-	 * guest is using another form of event delivery, the vCPU must be
-	 * kicked to complete the delivery.
+	 * If the pfncache lock is contended or the cache is invalid, set the
+	 * shadow. In this case, or if the guest is using another form of event
+	 * delivery, the vCPU must be kicked to complete the delivery.
 	 */
 	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 		struct vcpu_info *vcpu_info = gpc->khva;
 		int port_word_bit = port / 64;
 
-		if (!kvm_gpc_check(gpc, sizeof(*vcpu_info))) {
+		if ((!locked || !kvm_gpc_check(gpc, sizeof(*vcpu_info)))) {
 			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
 				kick_vcpu = true;
 			goto out;
@@ -1800,7 +1818,7 @@ static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 		struct compat_vcpu_info *vcpu_info = gpc->khva;
 		int port_word_bit = port / 32;
 
-		if (!kvm_gpc_check(gpc, sizeof(*vcpu_info))) {
+		if ((!locked || !kvm_gpc_check(gpc, sizeof(*vcpu_info)))) {
 			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
 				kick_vcpu = true;
 			goto out;
@@ -1819,7 +1837,10 @@ static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 	}
 
  out:
-	read_unlock_irqrestore(&gpc->lock, flags);
+	if (locked)
+		read_unlock(&gpc->lock);
+
+	local_irq_restore(flags);
 	return kick_vcpu;
 }
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v2 8/8] KVM: pfncache: clean up rwlock abuse
  2024-02-27 11:49 [PATCH v2 0/8] KVM: x86/xen updates David Woodhouse
                   ` (6 preceding siblings ...)
  2024-02-27 11:49 ` [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast() David Woodhouse
@ 2024-02-27 11:49 ` David Woodhouse
  2024-02-29 23:12 ` [PATCH v2 0/8] KVM: x86/xen updates Sean Christopherson
  2024-03-05  0:35 ` Sean Christopherson
  9 siblings, 0 replies; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 11:49 UTC (permalink / raw
  To: kvm
  Cc: Sean Christopherson, Paul Durrant, Paolo Bonzini, Michal Luczaj,
	David Woodhouse, Paul Durrant

From: David Woodhouse <dwmw@amazon.co.uk>

There is a rwlock in ->lock which readers take to ensure protection
against concurrent changes. But __kvm_gpc_refresh() makes assumptions
that certain fields will not change even while it drops the write lock
and performs MM operations to revalidate the target PFN and kernel
mapping.

Those assumptions *are* valid, because a previous commit expanded the
coverage of the ->refresh_lock mutex to ensure serialization and that
nothing else can change those fields while __kvm_gpc_refresh() drops
the rwlock. But this is not good practice.

Clean up the semantics of hva_to_pfn_retry() so that it no longer does
any locking gymnastics because it no longer operates on the gpc object
at all. It is now called with a uhva and simply returns the
corresponding pfn (pinned), and a mapped khva for it.

Its caller __kvm_gpc_refresh() now sets gpc->uhva and clears gpc->valid
before dropping ->lock, calling hva_to_pfn_retry() and retaking ->lock
for write.

If hva_to_pfn_retry() fails, *or* if the ->uhva or ->active fields in
the gpc changed while the lock was dropped, the new mapping is discarded
and the gpc is not modified. On success with an unchanged gpc, the new
mapping is installed and the current ->pfn and ->uhva are taken into the
local old_pfn and old_khva variables to be unmapped once the locks are
all released.

This simplification means that ->refresh_lock is no longer needed for
correctness, but it does still provide a minor optimisation because it
will prevent two concurrent __kvm_gpc_refresh() calls from mapping a
given PFN, only for one of them to lose the race and discard its
mapping.

The optimisation in hva_to_pfn_retry() where it attempts to use the old
mapping if the pfn doesn't change is dropped, since it makes the pinning
more complex. It's a pointless optimisation anyway, since the odds of
the pfn ending up the same when the uhva has changed (i.e. the odds of
the two userspace addresses both pointing to the same underlying
physical page) are negligible,

The 'hva_changed' local variable in __kvm_gpc_refresh() is also removed,
since it's simpler just to clear gpc->valid if the uhva changed.
Likewise the unmap_old variable is dropped because it's just as easy to
check the old_pfn variable for KVM_PFN_ERR_FAULT.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <pdurrant@amazon.com>
---
Cc: Sean Christopherson <seanjc@google.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
---
 virt/kvm/pfncache.c | 182 +++++++++++++++++++++-----------------------
 1 file changed, 87 insertions(+), 95 deletions(-)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 43d67f8f064e..f47c1fc44f58 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -139,107 +139,65 @@ static inline bool mmu_notifier_retry_cache(struct kvm *kvm, unsigned long mmu_s
 	return kvm->mmu_invalidate_seq != mmu_seq;
 }
 
-static kvm_pfn_t hva_to_pfn_retry(struct gfn_to_pfn_cache *gpc)
+/*
+ * Given a user virtual address, obtain a pinned host PFN and kernel mapping
+ * for it. The caller will release the PFN after installing it into the GPC
+ * so that the MMU notifier invalidation mechanism is active.
+ */
+static kvm_pfn_t hva_to_pfn_retry(struct kvm *kvm, unsigned long uhva,
+				  kvm_pfn_t *pfn, void **khva)
 {
 	/* Note, the new page offset may be different than the old! */
-	void *old_khva = (void *)PAGE_ALIGN_DOWN((uintptr_t)gpc->khva);
 	kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
 	void *new_khva = NULL;
 	unsigned long mmu_seq;
 
-	lockdep_assert_held(&gpc->refresh_lock);
-
-	lockdep_assert_held_write(&gpc->lock);
-
-	/*
-	 * Invalidate the cache prior to dropping gpc->lock, the gpa=>uhva
-	 * assets have already been updated and so a concurrent check() from a
-	 * different task may not fail the gpa/uhva/generation checks.
-	 */
-	gpc->valid = false;
-
-	do {
-		mmu_seq = gpc->kvm->mmu_invalidate_seq;
+	for (;;) {
+		mmu_seq = kvm->mmu_invalidate_seq;
 		smp_rmb();
 
-		write_unlock_irq(&gpc->lock);
-
-		/*
-		 * If the previous iteration "failed" due to an mmu_notifier
-		 * event, release the pfn and unmap the kernel virtual address
-		 * from the previous attempt.  Unmapping might sleep, so this
-		 * needs to be done after dropping the lock.  Opportunistically
-		 * check for resched while the lock isn't held.
-		 */
-		if (new_pfn != KVM_PFN_ERR_FAULT) {
-			/*
-			 * Keep the mapping if the previous iteration reused
-			 * the existing mapping and didn't create a new one.
-			 */
-			if (new_khva != old_khva)
-				gpc_unmap(new_pfn, new_khva);
-
-			kvm_release_pfn_clean(new_pfn);
-
-			cond_resched();
-		}
-
 		/* We always request a writeable mapping */
-		new_pfn = hva_to_pfn(gpc->uhva, false, false, NULL, true, NULL);
+		new_pfn = hva_to_pfn(uhva, false, false, NULL, true, NULL);
 		if (is_error_noslot_pfn(new_pfn))
-			goto out_error;
+			return -EFAULT;
 
 		/*
-		 * Obtain a new kernel mapping if KVM itself will access the
-		 * pfn.  Note, kmap() and memremap() can both sleep, so this
-		 * too must be done outside of gpc->lock!
+		 * Always obtain a new kernel mapping. Trying to reuse an
+		 * existing one is more complex than it's worth.
 		 */
-		if (new_pfn == gpc->pfn)
-			new_khva = old_khva;
-		else
-			new_khva = gpc_map(new_pfn);
-
+		new_khva = gpc_map(new_pfn);
 		if (!new_khva) {
 			kvm_release_pfn_clean(new_pfn);
-			goto out_error;
+			return -EFAULT;
 		}
 
-		write_lock_irq(&gpc->lock);
+		if (!mmu_notifier_retry_cache(kvm, mmu_seq))
+			break;
 
 		/*
-		 * Other tasks must wait for _this_ refresh to complete before
-		 * attempting to refresh.
+		 * If this iteration "failed" due to an mmu_notifier event,
+		 * release the pfn and unmap the kernel virtual address, and
+		 * loop around again.
 		 */
-		WARN_ON_ONCE(gpc->valid);
-	} while (mmu_notifier_retry_cache(gpc->kvm, mmu_seq));
-
-	gpc->valid = true;
-	gpc->pfn = new_pfn;
-	gpc->khva = new_khva + offset_in_page(gpc->uhva);
+		if (new_pfn != KVM_PFN_ERR_FAULT) {
+			gpc_unmap(new_pfn, new_khva);
+			kvm_release_pfn_clean(new_pfn);
+		}
+	}
 
-	/*
-	 * Put the reference to the _new_ pfn.  The pfn is now tracked by the
-	 * cache and can be safely migrated, swapped, etc... as the cache will
-	 * invalidate any mappings in response to relevant mmu_notifier events.
-	 */
-	kvm_release_pfn_clean(new_pfn);
+	*pfn = new_pfn;
+	*khva = new_khva;
 
 	return 0;
-
-out_error:
-	write_lock_irq(&gpc->lock);
-
-	return -EFAULT;
 }
 
-static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva,
-			     unsigned long len)
+static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa,
+			     unsigned long uhva, unsigned long len)
 {
-	unsigned long page_offset;
-	bool unmap_old = false;
+	unsigned long page_offset = kvm_is_error_gpa(gpa) ?
+		offset_in_page(uhva) : offset_in_page(gpa);
 	unsigned long old_uhva;
-	kvm_pfn_t old_pfn;
-	bool hva_change = false;
+	kvm_pfn_t old_pfn = KVM_PFN_ERR_FAULT;
 	void *old_khva;
 	int ret;
 
@@ -275,7 +233,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 		gpc->uhva = PAGE_ALIGN_DOWN(uhva);
 
 		if (gpc->uhva != old_uhva)
-			hva_change = true;
+			gpc->valid = false;
 	} else {
 		struct kvm_memslots *slots = kvm_memslots(gpc->kvm);
 
@@ -290,7 +248,11 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 
 			if (kvm_is_error_hva(gpc->uhva)) {
 				ret = -EFAULT;
-				goto out;
+
+				gpc->valid = false;
+				gpc->pfn = KVM_PFN_ERR_FAULT;
+				gpc->khva = NULL;
+				goto out_unlock;
 			}
 
 			/*
@@ -298,7 +260,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 			 * HVA may still be the same.
 			 */
 			if (gpc->uhva != old_uhva)
-				hva_change = true;
+				gpc->valid = false;
 		} else {
 			gpc->uhva = old_uhva;
 		}
@@ -311,9 +273,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	 * If the userspace HVA changed or the PFN was already invalid,
 	 * drop the lock and do the HVA to PFN lookup again.
 	 */
-	if (!gpc->valid || hva_change) {
-		ret = hva_to_pfn_retry(gpc);
-	} else {
+	if (gpc->valid) {
 		/*
 		 * If the HVA→PFN mapping was already valid, don't unmap it.
 		 * But do update gpc->khva because the offset within the page
@@ -321,28 +281,60 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 		 */
 		gpc->khva = old_khva + page_offset;
 		ret = 0;
-		goto out_unlock;
-	}
 
- out:
-	/*
-	 * Invalidate the cache and purge the pfn/khva if the refresh failed.
-	 * Some/all of the uhva, gpa, and memslot generation info may still be
-	 * valid, leave it as is.
-	 */
-	if (ret) {
+		/* old_pfn must not be unmapped because it was reused. */
+		old_pfn = KVM_PFN_ERR_FAULT;
+	} else {
+		kvm_pfn_t new_pfn = KVM_PFN_ERR_FAULT;
+		unsigned long new_uhva = gpc->uhva;
+		void *new_khva = NULL;
+
+		/*
+		 * Invalidate the cache prior to dropping gpc->lock; the
+		 * gpa=>uhva assets have already been updated and so a
+		 * concurrent check() from a different task may not fail
+		 * the gpa/uhva/generation checks as it should.
+		 */
 		gpc->valid = false;
-		gpc->pfn = KVM_PFN_ERR_FAULT;
-		gpc->khva = NULL;
-	}
 
-	/* Detect a pfn change before dropping the lock! */
-	unmap_old = (old_pfn != gpc->pfn);
+		write_unlock_irq(&gpc->lock);
+
+		ret = hva_to_pfn_retry(gpc->kvm, new_uhva, &new_pfn, &new_khva);
+
+		write_lock_irq(&gpc->lock);
+
+		WARN_ON_ONCE(gpc->valid);
+
+		if (ret || !gpc->active || gpc->uhva != new_uhva) {
+			/*
+			 * On failure or if another change occurred while the
+			 * lock was dropped, just purge the new mapping.
+			 */
+			old_pfn = new_pfn;
+			old_khva = new_khva;
+		} else {
+			old_pfn = gpc->pfn;
+			old_khva = gpc->khva;
+
+			gpc->pfn = new_pfn;
+			gpc->khva = new_khva + offset_in_page(gpc->uhva);
+			gpc->valid = true;
+		}
+
+		/*
+		 * Put the reference to the _new_ pfn. On success, the
+		 * pfn is now tracked by the cache and can safely be
+		 * migrated, swapped, etc. as the cache will invalidate
+		 * any mappings in response to relevant mmu_notifier
+		 * events.
+		 */
+		kvm_release_pfn_clean(new_pfn);
+	}
 
 out_unlock:
 	write_unlock_irq(&gpc->lock);
 
-	if (unmap_old)
+	if (old_pfn != KVM_PFN_ERR_FAULT)
 		gpc_unmap(old_pfn, old_khva);
 
 	return ret;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
  2024-02-27 11:49 ` [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast() David Woodhouse
@ 2024-02-27 14:43   ` Steven Rostedt
  2024-02-27 17:00     ` David Woodhouse
  2024-03-08 18:10     ` David Woodhouse
  0 siblings, 2 replies; 26+ messages in thread
From: Steven Rostedt @ 2024-02-27 14:43 UTC (permalink / raw
  To: David Woodhouse
  Cc: kvm, Sean Christopherson, Paul Durrant, Paolo Bonzini,
	Michal Luczaj, Paul Durrant, David Woodhouse, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Peter Zijlstra, Dave Hansen,
	H. Peter Anvin, x86

On Tue, 27 Feb 2024 11:49:21 +0000
David Woodhouse <dwmw2@infradead.org> wrote:

> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> index c16b6d394d55..d8b5326ecebc 100644
> --- a/arch/x86/kvm/xen.c
> +++ b/arch/x86/kvm/xen.c
> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
>  	unsigned long flags;
>  	int rc = -EWOULDBLOCK;
>  
> -	read_lock_irqsave(&gpc->lock, flags);
> +	local_irq_save(flags);
> +	if (!read_trylock(&gpc->lock)) {

Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
so in non RT and the taking a mutex.

Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
interrupts.

Not really sure what the best solution is here.

-- Steve


> +		/*
> +		 * When PREEMPT_RT turns locks into mutexes, rwlocks are
> +		 * turned into mutexes and most interrupts are threaded.
> +		 * But timer events may be delivered in hardirq mode due
> +		 * to using HRTIMER_MODE_ABS_HARD. So bail to the slow
> +		 * path if the trylock fails in interrupt context.
> +		 */
> +		if (in_interrupt())
> +			goto out;
> +
> +		read_lock(&gpc->lock);
> +	}
> +
>  	if (!kvm_gpc_check(gpc, PAGE_SIZE))
> -		goto out;
> +		goto out_unlock;
>  
>  	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
>  		struct shared_info *shinfo = gpc->khva;

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
  2024-02-27 14:43   ` Steven Rostedt
@ 2024-02-27 17:00     ` David Woodhouse
  2024-02-27 17:18       ` Thomas Gleixner
  2024-02-27 17:25       ` Steven Rostedt
  2024-03-08 18:10     ` David Woodhouse
  1 sibling, 2 replies; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 17:00 UTC (permalink / raw
  To: Steven Rostedt
  Cc: kvm, Sean Christopherson, Paul Durrant, Paolo Bonzini,
	Michal Luczaj, Paul Durrant, David Woodhouse, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Peter Zijlstra, Dave Hansen,
	H. Peter Anvin, x86

On 27 February 2024 14:43:26 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
>On Tue, 27 Feb 2024 11:49:21 +0000
>David Woodhouse <dwmw2@infradead.org> wrote:
>
>> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
>> index c16b6d394d55..d8b5326ecebc 100644
>> --- a/arch/x86/kvm/xen.c
>> +++ b/arch/x86/kvm/xen.c
>> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
>>  	unsigned long flags;
>>  	int rc = -EWOULDBLOCK;
>>  
>> -	read_lock_irqsave(&gpc->lock, flags);
>> +	local_irq_save(flags);
>> +	if (!read_trylock(&gpc->lock)) {
>
>Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
>so in non RT and the taking a mutex.

Well, it *isn't* a mutex in non-RT. It's basically a spinlock. And even though RT turns it into a mutex this is only a trylock with a fall back to a slow path in process context, so it ends up being a very short period of irqdisable/lock – just to validate the target address of the cache, and write there. (Or fall back to that slow path, if the cache needs revalidating).

So I think this is probably OK, but...

>Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
>interrupts.

... that's kind of odd. I think there's code elsewhere which assumes it will, and pairs it with an explicit local_irq_save() like the above, in a different atomic context (updating runstate time info when being descheduled). So *that* needs changing for RT?

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
  2024-02-27 17:00     ` David Woodhouse
@ 2024-02-27 17:18       ` Thomas Gleixner
  2024-02-27 17:22         ` David Woodhouse
  2024-02-27 17:25       ` Steven Rostedt
  1 sibling, 1 reply; 26+ messages in thread
From: Thomas Gleixner @ 2024-02-27 17:18 UTC (permalink / raw
  To: David Woodhouse, Steven Rostedt
  Cc: kvm, Sean Christopherson, Paul Durrant, Paolo Bonzini,
	Michal Luczaj, Paul Durrant, David Woodhouse, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Dave Hansen, H. Peter Anvin, x86

On Tue, Feb 27 2024 at 17:00, David Woodhouse wrote:
> On 27 February 2024 14:43:26 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
>>On Tue, 27 Feb 2024 11:49:21 +0000
>>David Woodhouse <dwmw2@infradead.org> wrote:
>>
>>> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
>>> index c16b6d394d55..d8b5326ecebc 100644
>>> --- a/arch/x86/kvm/xen.c
>>> +++ b/arch/x86/kvm/xen.c
>>> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
>>>  	unsigned long flags;
>>>  	int rc = -EWOULDBLOCK;
>>>  
>>> -	read_lock_irqsave(&gpc->lock, flags);
>>> +	local_irq_save(flags);
>>> +	if (!read_trylock(&gpc->lock)) {
>>
>>Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
>>so in non RT and the taking a mutex.
>
> Well, it *isn't* a mutex in non-RT. It's basically a spinlock. And
> even though RT turns it into a mutex this is only a trylock with a
> fall back to a slow path in process context, so it ends up being a
> very short period of irqdisable/lock – just to validate the target
> address of the cache, and write there. (Or fall back to that slow
> path, if the cache needs revalidating).
>
> So I think this is probably OK, but...
>
>>Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
>>interrupts.
>
> ... that's kind of odd. I think there's code elsewhere which assumes
> it will, and pairs it with an explicit local_irq_save() like the
> above, in a different atomic context (updating runstate time info when
> being descheduled).

While it is legit, it simply cannot work for RT. We fixed the few places
which had it open coded (mostly for no good reason).

> So *that* needs changing for RT?

No. It's not required anywhere and we really don't change that just to
accomodate the xen shim.

I'll look at the problem at hand tomorrow.

Thanks,

        tglx

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
  2024-02-27 17:18       ` Thomas Gleixner
@ 2024-02-27 17:22         ` David Woodhouse
  2024-03-07  9:27           ` David Woodhouse
  0 siblings, 1 reply; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 17:22 UTC (permalink / raw
  To: Thomas Gleixner, Steven Rostedt
  Cc: kvm, Sean Christopherson, Paul Durrant, Paolo Bonzini,
	Michal Luczaj, Paul Durrant, David Woodhouse, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Dave Hansen, H. Peter Anvin, x86

On 27 February 2024 17:18:58 GMT, Thomas Gleixner <tglx@linutronix.de> wrote:
>On Tue, Feb 27 2024 at 17:00, David Woodhouse wrote:
>> On 27 February 2024 14:43:26 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
>>>On Tue, 27 Feb 2024 11:49:21 +0000
>>>David Woodhouse <dwmw2@infradead.org> wrote:
>>>
>>>> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
>>>> index c16b6d394d55..d8b5326ecebc 100644
>>>> --- a/arch/x86/kvm/xen.c
>>>> +++ b/arch/x86/kvm/xen.c
>>>> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
>>>>  	unsigned long flags;
>>>>  	int rc = -EWOULDBLOCK;
>>>>  
>>>> -	read_lock_irqsave(&gpc->lock, flags);
>>>> +	local_irq_save(flags);
>>>> +	if (!read_trylock(&gpc->lock)) {
>>>
>>>Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
>>>so in non RT and the taking a mutex.
>>
>> Well, it *isn't* a mutex in non-RT. It's basically a spinlock. And
>> even though RT turns it into a mutex this is only a trylock with a
>> fall back to a slow path in process context, so it ends up being a
>> very short period of irqdisable/lock – just to validate the target
>> address of the cache, and write there. (Or fall back to that slow
>> path, if the cache needs revalidating).
>>
>> So I think this is probably OK, but...
>>
>>>Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
>>>interrupts.
>>
>> ... that's kind of odd. I think there's code elsewhere which assumes
>> it will, and pairs it with an explicit local_irq_save() like the
>> above, in a different atomic context (updating runstate time info when
>> being descheduled).
>
>While it is legit, it simply cannot work for RT. We fixed the few places
>which had it open coded (mostly for no good reason).
>
>> So *that* needs changing for RT?
>
>No. It's not required anywhere and we really don't change that just to
>accomodate the xen shim.

It's open-coded in the runstate update in the Xen shim, as I said. That's what needs changing, which is separate from the problem at hand in this patch.

>I'll look at the problem at hand tomorrow.

Ta.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
  2024-02-27 17:00     ` David Woodhouse
  2024-02-27 17:18       ` Thomas Gleixner
@ 2024-02-27 17:25       ` Steven Rostedt
  2024-02-27 17:28         ` David Woodhouse
  1 sibling, 1 reply; 26+ messages in thread
From: Steven Rostedt @ 2024-02-27 17:25 UTC (permalink / raw
  To: David Woodhouse
  Cc: kvm, Sean Christopherson, Paul Durrant, Paolo Bonzini,
	Michal Luczaj, Paul Durrant, David Woodhouse, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Peter Zijlstra, Dave Hansen,
	H. Peter Anvin, x86

On Tue, 27 Feb 2024 17:00:42 +0000
David Woodhouse <dwmw2@infradead.org> wrote:

> On 27 February 2024 14:43:26 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
> >On Tue, 27 Feb 2024 11:49:21 +0000
> >David Woodhouse <dwmw2@infradead.org> wrote:
> >  
> >> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
> >> index c16b6d394d55..d8b5326ecebc 100644
> >> --- a/arch/x86/kvm/xen.c
> >> +++ b/arch/x86/kvm/xen.c
> >> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
> >>  	unsigned long flags;
> >>  	int rc = -EWOULDBLOCK;
> >>  
> >> -	read_lock_irqsave(&gpc->lock, flags);
> >> +	local_irq_save(flags);
> >> +	if (!read_trylock(&gpc->lock)) {  
> >
> >Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
> >so in non RT and the taking a mutex.  
> 
> Well, it *isn't* a mutex in non-RT. It's basically a spinlock. And even
> though RT turns it into a mutex this is only a trylock with a fall back
> to a slow path in process context, so it ends up being a very short
> period of irqdisable/lock – just to validate the target address of the
> cache, and write there. (Or fall back to that slow path, if the cache
> needs revalidating).

Yes, but if the trylock fails, you then call:

+	if (!read_trylock(&gpc->lock)) {
+		/*
+		 * When PREEMPT_RT turns locks into mutexes, rwlocks are
+		 * turned into mutexes and most interrupts are threaded.
+		 * But timer events may be delivered in hardirq mode due
+		 * to using HRTIMER_MODE_ABS_HARD. So bail to the slow
+		 * path if the trylock fails in interrupt context.
+		 */
+		if (in_interrupt())
+			goto out;
+
+		read_lock(&gpc->lock);

That takes the read_lock() with interrupts still disabled. On RT, that will
likely schedule as it had just failed a trylock.

+	}
+
> 
> So I think this is probably OK, but...
> 
> 
> >Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to
> >re-enable interrupts.  
> 
> ... that's kind of odd. I think there's code elsewhere which assumes it
> will, and pairs it with an explicit local_irq_save() like the above, in a
> different atomic context (updating runstate time info when being
> descheduled). So *that* needs changing for RT?


I see Thomas replied. I'll leave the rest to him.

-- Steve

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
  2024-02-27 17:25       ` Steven Rostedt
@ 2024-02-27 17:28         ` David Woodhouse
  0 siblings, 0 replies; 26+ messages in thread
From: David Woodhouse @ 2024-02-27 17:28 UTC (permalink / raw
  To: Steven Rostedt
  Cc: kvm, Sean Christopherson, Paul Durrant, Paolo Bonzini,
	Michal Luczaj, Paul Durrant, David Woodhouse, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Peter Zijlstra, Dave Hansen,
	H. Peter Anvin, x86

On 27 February 2024 17:25:52 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
>On Tue, 27 Feb 2024 17:00:42 +0000
>David Woodhouse <dwmw2@infradead.org> wrote:
>
>> On 27 February 2024 14:43:26 GMT, Steven Rostedt <rostedt@goodmis.org> wrote:
>> >On Tue, 27 Feb 2024 11:49:21 +0000
>> >David Woodhouse <dwmw2@infradead.org> wrote:
>> >  
>> >> diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
>> >> index c16b6d394d55..d8b5326ecebc 100644
>> >> --- a/arch/x86/kvm/xen.c
>> >> +++ b/arch/x86/kvm/xen.c
>> >> @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
>> >>  	unsigned long flags;
>> >>  	int rc = -EWOULDBLOCK;
>> >>  
>> >> -	read_lock_irqsave(&gpc->lock, flags);
>> >> +	local_irq_save(flags);
>> >> +	if (!read_trylock(&gpc->lock)) {  
>> >
>> >Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
>> >so in non RT and the taking a mutex.  
>> 
>> Well, it *isn't* a mutex in non-RT. It's basically a spinlock. And even
>> though RT turns it into a mutex this is only a trylock with a fall back
>> to a slow path in process context, so it ends up being a very short
>> period of irqdisable/lock – just to validate the target address of the
>> cache, and write there. (Or fall back to that slow path, if the cache
>> needs revalidating).
>
>Yes, but if the trylock fails, you then call:
>
>+	if (!read_trylock(&gpc->lock)) {
>+		/*
>+		 * When PREEMPT_RT turns locks into mutexes, rwlocks are
>+		 * turned into mutexes and most interrupts are threaded.
>+		 * But timer events may be delivered in hardirq mode due
>+		 * to using HRTIMER_MODE_ABS_HARD. So bail to the slow
>+		 * path if the trylock fails in interrupt context.
>+		 */
>+		if (in_interrupt())
>+			goto out;
>+
>+		read_lock(&gpc->lock);
>
>That takes the read_lock() with interrupts still disabled. On RT, that will
>likely schedule as it had just failed a trylock.

Oh... pants. Right, so it really does need to use read_trylock_irqsave() in the non-IRQ case, *and* it needs to either read_unlock_irqrestore() or separately unlock and local_irq_restore(), according to which way it did the locking in the first place. Ick.



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/8] KVM: x86/xen updates
  2024-02-27 11:49 [PATCH v2 0/8] KVM: x86/xen updates David Woodhouse
                   ` (7 preceding siblings ...)
  2024-02-27 11:49 ` [PATCH v2 8/8] KVM: pfncache: clean up rwlock abuse David Woodhouse
@ 2024-02-29 23:12 ` Sean Christopherson
  2024-03-05  0:35 ` Sean Christopherson
  9 siblings, 0 replies; 26+ messages in thread
From: Sean Christopherson @ 2024-02-29 23:12 UTC (permalink / raw
  To: David Woodhouse; +Cc: kvm, Paul Durrant, Paolo Bonzini, Michal Luczaj

On Tue, Feb 27, 2024, David Woodhouse wrote:
> David Woodhouse (6):
>       KVM: x86/xen: improve accuracy of Xen timers
>       KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
>       KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
>       KVM: pfncache: simplify locking and make more self-contained
>       KVM: x86/xen: fix recursive deadlock in timer injection
>       KVM: pfncache: clean up rwlock abuse
> 
> Paul Durrant (2):
>       KVM: x86/xen: split up kvm_xen_set_evtchn_fast()
>       KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
> 
>  arch/x86/kvm/lapic.c |   5 +-
>  arch/x86/kvm/x86.c   |  61 +++++++++-
>  arch/x86/kvm/x86.h   |   1 +
>  arch/x86/kvm/xen.c   | 327 +++++++++++++++++++++++++++++++++------------------
>  arch/x86/kvm/xen.h   |  18 +++
>  virt/kvm/pfncache.c  | 216 +++++++++++++++++-----------------
>  6 files changed, 403 insertions(+), 225 deletions(-)

FYI, I'm planning on grabbing at least the first 3 for 6.9, but I'm off tomorrow
and don't want to risk having to fix breakage in -next, so it won't happen until
next week.

I might also grab 4 and 5, I just need to stare at that locking code a bit.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/8] KVM: x86/xen: improve accuracy of Xen timers
  2024-02-27 11:49 ` [PATCH v2 1/8] KVM: x86/xen: improve accuracy of Xen timers David Woodhouse
@ 2024-03-04 23:44   ` Sean Christopherson
  2024-03-05  9:29     ` Paul Durrant
  0 siblings, 1 reply; 26+ messages in thread
From: Sean Christopherson @ 2024-03-04 23:44 UTC (permalink / raw
  To: David Woodhouse
  Cc: kvm, Paul Durrant, Paolo Bonzini, Michal Luczaj, David Woodhouse

On Tue, Feb 27, 2024, David Woodhouse wrote:
> +	/* Xen has a 'Linux workaround' in do_set_timer_op() which
> +	 * checks for negative absolute timeout values (caused by
> +	 * integer overflow), and for values about 13 days in the
> +	 * future (2^50ns) which would be caused by jiffies
> +	 * overflow. For those cases, it sets the timeout 100ms in
> +	 * the future (not *too* soon, since if a guest really did
> +	 * set a long timeout on purpose we don't want to keep
> +	 * churning CPU time by waking it up).
> +	 */

I'm going to massage this slightly, partly to take advantage of reduced indentation,
but also to call out when the workaround is applied.  Though in all honesty, the
extra context may just be in response to a PEBKAC on my end, as I misread the diff
multiple times.

> +	if (linux_wa) {
> +		if ((unlikely((int64_t)guest_abs < 0 ||

No need for a second set of parantheses around the unlikely.

> +			      (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {

And this can all easily fit into one if-statement.

> +			delta = 100 * NSEC_PER_MSEC;
> +			guest_abs = guest_now + delta;
> +		}
> +	}

This is what I'm going to commit, holler if it looks wrong (disclaimer: I've only
compile tested at this point).

	/*
	 * Xen has a 'Linux workaround' in do_set_timer_op() which checks for
	 * negative absolute timeout values (caused by integer overflow), and
	 * for values about 13 days in the future (2^50ns) which would be
	 * caused by jiffies overflow. For those cases, Xen sets the timeout
	 * 100ms in the future (not *too* soon, since if a guest really did
	 * set a long timeout on purpose we don't want to keep churning CPU
	 * time by waking it up).  Emulate Xen's workaround when starting the
	 * timer in response to __HYPERVISOR_set_timer_op.
	 */
	if (linux_wa &&
	    unlikely((int64_t)guest_abs < 0 ||
		     (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
		delta = 100 * NSEC_PER_MSEC;
		guest_abs = guest_now + delta;
	}

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 4/8] KVM: pfncache: simplify locking and make more self-contained
  2024-02-27 11:49 ` [PATCH v2 4/8] KVM: pfncache: simplify locking and make more self-contained David Woodhouse
@ 2024-03-05  0:08   ` Sean Christopherson
  0 siblings, 0 replies; 26+ messages in thread
From: Sean Christopherson @ 2024-03-05  0:08 UTC (permalink / raw
  To: David Woodhouse
  Cc: kvm, Paul Durrant, Paolo Bonzini, Michal Luczaj, David Woodhouse

On Tue, Feb 27, 2024, David Woodhouse wrote:
>  static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva,
>  			      unsigned long len)
>  {
>  	struct kvm *kvm = gpc->kvm;
> +	int ret;
> +
> +	mutex_lock(&gpc->refresh_lock);
>  
>  	if (!gpc->active) {
>  		if (KVM_BUG_ON(gpc->valid, kvm))

Heh, it's _just_ out of sight in this diff, but this has an error path that misses
the unlock.  The full context is:

	if (!gpc->active) {
		if (KVM_BUG_ON(gpc->valid, kvm))
			return -EIO;

At the risk of doing too much when applying, I think this is the perfect patch to
start introducing use of guard(mutex) in common KVM.  As a bonus, it's a noticeably
smaller diff.  Testing this now...

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
Link: https://lore.kernel.org/r/20240227115648.3104-5-dwmw2@infradead.org
[sean: use guard(mutex) to fix a missed unlock]
Signed-off-by: Sean Christopherson <seanjc@google.com>
---
 virt/kvm/pfncache.c | 21 +++++++++++----------
 1 file changed, 11 insertions(+), 10 deletions(-)

diff --git a/virt/kvm/pfncache.c b/virt/kvm/pfncache.c
index 9ac8c9da4eda..4e07112a24c2 100644
--- a/virt/kvm/pfncache.c
+++ b/virt/kvm/pfncache.c
@@ -256,12 +256,7 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 	if (page_offset + len > PAGE_SIZE)
 		return -EINVAL;
 
-	/*
-	 * If another task is refreshing the cache, wait for it to complete.
-	 * There is no guarantee that concurrent refreshes will see the same
-	 * gpa, memslots generation, etc..., so they must be fully serialized.
-	 */
-	mutex_lock(&gpc->refresh_lock);
+	lockdep_assert_held(&gpc->refresh_lock);
 
 	write_lock_irq(&gpc->lock);
 
@@ -347,8 +342,6 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 out_unlock:
 	write_unlock_irq(&gpc->lock);
 
-	mutex_unlock(&gpc->refresh_lock);
-
 	if (unmap_old)
 		gpc_unmap(old_pfn, old_khva);
 
@@ -357,13 +350,16 @@ static int __kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned l
 
 int kvm_gpc_refresh(struct gfn_to_pfn_cache *gpc, unsigned long len)
 {
+	unsigned long uhva;
+
+	guard(mutex)(&gpc->refresh_lock);
+
 	/*
 	 * If the GPA is valid then ignore the HVA, as a cache can be GPA-based
 	 * or HVA-based, not both.  For GPA-based caches, the HVA will be
 	 * recomputed during refresh if necessary.
 	 */
-	unsigned long uhva = kvm_is_error_gpa(gpc->gpa) ? gpc->uhva :
-							  KVM_HVA_ERR_BAD;
+	uhva = kvm_is_error_gpa(gpc->gpa) ? gpc->uhva : KVM_HVA_ERR_BAD;
 
 	return __kvm_gpc_refresh(gpc, gpc->gpa, uhva, len);
 }
@@ -377,6 +373,7 @@ void kvm_gpc_init(struct gfn_to_pfn_cache *gpc, struct kvm *kvm)
 	gpc->pfn = KVM_PFN_ERR_FAULT;
 	gpc->gpa = INVALID_GPA;
 	gpc->uhva = KVM_HVA_ERR_BAD;
+	gpc->active = gpc->valid = false;
 }
 
 static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned long uhva,
@@ -384,6 +381,8 @@ static int __kvm_gpc_activate(struct gfn_to_pfn_cache *gpc, gpa_t gpa, unsigned
 {
 	struct kvm *kvm = gpc->kvm;
 
+	guard(mutex)(&gpc->refresh_lock);
+
 	if (!gpc->active) {
 		if (KVM_BUG_ON(gpc->valid, kvm))
 			return -EIO;
@@ -420,6 +419,8 @@ void kvm_gpc_deactivate(struct gfn_to_pfn_cache *gpc)
 	kvm_pfn_t old_pfn;
 	void *old_khva;
 
+	guard(mutex)(&gpc->refresh_lock);
+
 	if (gpc->active) {
 		/*
 		 * Deactivate the cache before removing it from the list, KVM

base-commit: 6d6f106db109251010423d8728d76a0260db5814
-- 


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 0/8] KVM: x86/xen updates
  2024-02-27 11:49 [PATCH v2 0/8] KVM: x86/xen updates David Woodhouse
                   ` (8 preceding siblings ...)
  2024-02-29 23:12 ` [PATCH v2 0/8] KVM: x86/xen updates Sean Christopherson
@ 2024-03-05  0:35 ` Sean Christopherson
  9 siblings, 0 replies; 26+ messages in thread
From: Sean Christopherson @ 2024-03-05  0:35 UTC (permalink / raw
  To: Sean Christopherson, kvm, David Woodhouse
  Cc: Paul Durrant, Paolo Bonzini, Michal Luczaj

On Tue, 27 Feb 2024 11:49:14 +0000, David Woodhouse wrote:
> These apply to the kvm-x86/xen tree.
> 
> First, deal with the awful brokenness of the KVM clock, and its systemic
> drift especially when TSC scaling is used. This is a bit of a workaround
> for Xen timers where it hurts *most*, but it's actually easier in this
> case because there is a vCPU (and associated PV clock information) to
> use for the scaling. A better fix for __get_kvmclock() is in the works,
> but there's an enormous yak to shave there because there are so many
> interrelated bugs in the TSC and timekeeping code.
> 
> [...]

Applied 1-5 to kvm-x86 xen.  Please take a look and test the result (patches 1
and 4 in particular).  I didn't _intend_ to make any functional changes outside
of fixing up the unlock goof, but I'd greatly appreciate extra eyeballs,
especially this close to the merge window.

Oh, and can you look at v2[*] of Vitaly's fixes for xen_shinfo_test?  I'd like
to get that applied soonish (I see intermittent failures), but I'm nowhere near
competent enough with clocks to give it a proper review.

Thanks!

[*] https://lore.kernel.org/all/20240206151950.31174-1-vkuznets@redhat.com


[1/8] KVM: x86/xen: improve accuracy of Xen timers
      https://github.com/kvm-x86/linux/commit/451a707813ae
[2/8] KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled
      https://github.com/kvm-x86/linux/commit/8e62bf2bfa46
[3/8] KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery
      https://github.com/kvm-x86/linux/commit/66e3cf729b1e
[4/8] KVM: pfncache: simplify locking and make more self-contained
      https://github.com/kvm-x86/linux/commit/6addfcf27139
[5/8] KVM: x86/xen: fix recursive deadlock in timer injection
      https://github.com/kvm-x86/linux/commit/7a36d680658b
[6/8] KVM: x86/xen: split up kvm_xen_set_evtchn_fast()
      (not applied)
[7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
      (not applied)
[8/8] KVM: pfncache: clean up rwlock abuse
      (not applied)

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/8] KVM: x86/xen: improve accuracy of Xen timers
  2024-03-04 23:44   ` Sean Christopherson
@ 2024-03-05  9:29     ` Paul Durrant
  2024-03-05 21:12       ` Sean Christopherson
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Durrant @ 2024-03-05  9:29 UTC (permalink / raw
  To: Sean Christopherson, David Woodhouse
  Cc: kvm, Paolo Bonzini, Michal Luczaj, David Woodhouse

On 04/03/2024 23:44, Sean Christopherson wrote:
> On Tue, Feb 27, 2024, David Woodhouse wrote:
>> +	/* Xen has a 'Linux workaround' in do_set_timer_op() which
>> +	 * checks for negative absolute timeout values (caused by
>> +	 * integer overflow), and for values about 13 days in the
>> +	 * future (2^50ns) which would be caused by jiffies
>> +	 * overflow. For those cases, it sets the timeout 100ms in
>> +	 * the future (not *too* soon, since if a guest really did
>> +	 * set a long timeout on purpose we don't want to keep
>> +	 * churning CPU time by waking it up).
>> +	 */
> 
> I'm going to massage this slightly, partly to take advantage of reduced indentation,
> but also to call out when the workaround is applied.  Though in all honesty, the
> extra context may just be in response to a PEBKAC on my end, as I misread the diff
> multiple times.
> 
>> +	if (linux_wa) {
>> +		if ((unlikely((int64_t)guest_abs < 0 ||
> 
> No need for a second set of parantheses around the unlikely.
> 
>> +			      (delta > 0 && (uint32_t) (delta >> 50) != 0)))) {
> 
> And this can all easily fit into one if-statement.
> 
>> +			delta = 100 * NSEC_PER_MSEC;
>> +			guest_abs = guest_now + delta;
>> +		}
>> +	}
> 
> This is what I'm going to commit, holler if it looks wrong (disclaimer: I've only
> compile tested at this point).
> 
> 	/*
> 	 * Xen has a 'Linux workaround' in do_set_timer_op() which checks for
> 	 * negative absolute timeout values (caused by integer overflow), and
> 	 * for values about 13 days in the future (2^50ns) which would be
> 	 * caused by jiffies overflow. For those cases, Xen sets the timeout
> 	 * 100ms in the future (not *too* soon, since if a guest really did
> 	 * set a long timeout on purpose we don't want to keep churning CPU
> 	 * time by waking it up).  Emulate Xen's workaround when starting the
> 	 * timer in response to __HYPERVISOR_set_timer_op.
> 	 */
> 	if (linux_wa &&
> 	    unlikely((int64_t)guest_abs < 0 ||
> 		     (delta > 0 && (uint32_t) (delta >> 50) != 0))) {

Now that I look at it again, since the last test is simply a '!= 0', I 
don't really see why the case is necessary. Perhaps lose that too. 
Otherwise LGTM.

> 		delta = 100 * NSEC_PER_MSEC;
> 		guest_abs = guest_now + delta;
> 	}


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/8] KVM: x86/xen: improve accuracy of Xen timers
  2024-03-05  9:29     ` Paul Durrant
@ 2024-03-05 21:12       ` Sean Christopherson
  2024-03-06  9:48         ` Paul Durrant
  0 siblings, 1 reply; 26+ messages in thread
From: Sean Christopherson @ 2024-03-05 21:12 UTC (permalink / raw
  To: paul; +Cc: David Woodhouse, kvm, Paolo Bonzini, Michal Luczaj,
	David Woodhouse

On Tue, Mar 05, 2024, Paul Durrant wrote:
> On 04/03/2024 23:44, Sean Christopherson wrote:
> > On Tue, Feb 27, 2024, David Woodhouse wrote:
> > 	/*
> > 	 * Xen has a 'Linux workaround' in do_set_timer_op() which checks for
> > 	 * negative absolute timeout values (caused by integer overflow), and
> > 	 * for values about 13 days in the future (2^50ns) which would be
> > 	 * caused by jiffies overflow. For those cases, Xen sets the timeout
> > 	 * 100ms in the future (not *too* soon, since if a guest really did
> > 	 * set a long timeout on purpose we don't want to keep churning CPU
> > 	 * time by waking it up).  Emulate Xen's workaround when starting the
> > 	 * timer in response to __HYPERVISOR_set_timer_op.
> > 	 */
> > 	if (linux_wa &&
> > 	    unlikely((int64_t)guest_abs < 0 ||
> > 		     (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
> 
> Now that I look at it again, since the last test is simply a '!= 0', I don't
> really see why the case is necessary. Perhaps lose that too. Otherwise LGTM.

Hmm, I think I'll keep it as-is purely so that the diff shows that it's a just
code movement.  There's already enough going on in in this patch, and practically
speaking I doubt anything other than checkpatch will ever care about the "!= 0".

Thanks!

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/8] KVM: x86/xen: improve accuracy of Xen timers
  2024-03-05 21:12       ` Sean Christopherson
@ 2024-03-06  9:48         ` Paul Durrant
  2024-03-06  9:57           ` David Woodhouse
  0 siblings, 1 reply; 26+ messages in thread
From: Paul Durrant @ 2024-03-06  9:48 UTC (permalink / raw
  To: Sean Christopherson
  Cc: David Woodhouse, kvm, Paolo Bonzini, Michal Luczaj,
	David Woodhouse

On 05/03/2024 21:12, Sean Christopherson wrote:
> On Tue, Mar 05, 2024, Paul Durrant wrote:
>> On 04/03/2024 23:44, Sean Christopherson wrote:
>>> On Tue, Feb 27, 2024, David Woodhouse wrote:
>>> 	/*
>>> 	 * Xen has a 'Linux workaround' in do_set_timer_op() which checks for
>>> 	 * negative absolute timeout values (caused by integer overflow), and
>>> 	 * for values about 13 days in the future (2^50ns) which would be
>>> 	 * caused by jiffies overflow. For those cases, Xen sets the timeout
>>> 	 * 100ms in the future (not *too* soon, since if a guest really did
>>> 	 * set a long timeout on purpose we don't want to keep churning CPU
>>> 	 * time by waking it up).  Emulate Xen's workaround when starting the
>>> 	 * timer in response to __HYPERVISOR_set_timer_op.
>>> 	 */
>>> 	if (linux_wa &&
>>> 	    unlikely((int64_t)guest_abs < 0 ||
>>> 		     (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
>>
>> Now that I look at it again, since the last test is simply a '!= 0', I don't
>> really see why the case is necessary. Perhaps lose that too. Otherwise LGTM.
> 
> Hmm, I think I'll keep it as-is purely so that the diff shows that it's a just
> code movement.  There's already enough going on in in this patch, and practically
> speaking I doubt anything other than checkpatch will ever care about the "!= 0".
> 
> Thanks!

... and now I see I typo-ed 'cast' to 'case'... it was more that 
'(uint32_t)' than the superfluous '!= 0' that caught my eye but yes, I 
agree, it's code movement so separate cleanup is probably better.


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 1/8] KVM: x86/xen: improve accuracy of Xen timers
  2024-03-06  9:48         ` Paul Durrant
@ 2024-03-06  9:57           ` David Woodhouse
  0 siblings, 0 replies; 26+ messages in thread
From: David Woodhouse @ 2024-03-06  9:57 UTC (permalink / raw
  To: paul, Sean Christopherson; +Cc: kvm, Paolo Bonzini, Michal Luczaj

[-- Attachment #1: Type: text/plain, Size: 2240 bytes --]

On Wed, 2024-03-06 at 09:48 +0000, Paul Durrant wrote:
> On 05/03/2024 21:12, Sean Christopherson wrote:
> > On Tue, Mar 05, 2024, Paul Durrant wrote:
> > > On 04/03/2024 23:44, Sean Christopherson wrote:
> > > > On Tue, Feb 27, 2024, David Woodhouse wrote:
> > > >         /*
> > > >          * Xen has a 'Linux workaround' in do_set_timer_op() which checks for
> > > >          * negative absolute timeout values (caused by integer overflow), and
> > > >          * for values about 13 days in the future (2^50ns) which would be
> > > >          * caused by jiffies overflow. For those cases, Xen sets the timeout
> > > >          * 100ms in the future (not *too* soon, since if a guest really did
> > > >          * set a long timeout on purpose we don't want to keep churning CPU
> > > >          * time by waking it up).  Emulate Xen's workaround when starting the
> > > >          * timer in response to __HYPERVISOR_set_timer_op.
> > > >          */
> > > >         if (linux_wa &&
> > > >             unlikely((int64_t)guest_abs < 0 ||
> > > >                      (delta > 0 && (uint32_t) (delta >> 50) != 0))) {
> > > 
> > > Now that I look at it again, since the last test is simply a '!= 0', I don't
> > > really see why the case is necessary. Perhaps lose that too. Otherwise LGTM.
> > 
> > Hmm, I think I'll keep it as-is purely so that the diff shows that it's a just
> > code movement.  There's already enough going on in in this patch, and practically
> > speaking I doubt anything other than checkpatch will ever care about the "!= 0".
> > 
> > Thanks!
> 
> ... and now I see I typo-ed 'cast' to 'case'... it was more that 
> '(uint32_t)' than the superfluous '!= 0' that caught my eye but yes, I 
> agree, it's code movement so separate cleanup is probably better.

Right. FWIW it looks like that because that's how it is in Xen:
https://xenbits.xen.org/gitweb/?p=xen.git;a=commitdiff;h=76e5a428e1e

I do agree that (uint32_t) cast is superfluous. I just don't know
whether I care more about that, or the (cosmetic) difference from how
Xen implements the check.

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
  2024-02-27 17:22         ` David Woodhouse
@ 2024-03-07  9:27           ` David Woodhouse
  2024-03-07 10:02             ` David Woodhouse
  0 siblings, 1 reply; 26+ messages in thread
From: David Woodhouse @ 2024-03-07  9:27 UTC (permalink / raw
  To: Thomas Gleixner, Steven Rostedt
  Cc: kvm, Sean Christopherson, Paul Durrant, Paolo Bonzini,
	Michal Luczaj, Paul Durrant, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Dave Hansen, H. Peter Anvin, x86

[-- Attachment #1: Type: text/plain, Size: 2697 bytes --]

On Tue, 2024-02-27 at 17:22 +0000, David Woodhouse wrote:
> 
> > > > Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
> > > > interrupts.
> > > 
> > > ... that's kind of odd. I think there's code elsewhere which assumes
> > > it will, and pairs it with an explicit local_irq_save() like the
> > > above, in a different atomic context (updating runstate time info when
> > > being descheduled).
> > 
> > While it is legit, it simply cannot work for RT. We fixed the few places
> > which had it open coded (mostly for no good reason).
> > 
> > > So *that* needs changing for RT?
> > 
> > No. It's not required anywhere and we really don't change that just to
> > accomodate the xen shim.
> 
> It's open-coded in the runstate update in the Xen shim, as I said.
> That's what needs changing, which is separate from the problem at
> hand in this patch.
> 
> > I'll look at the problem at hand tomorrow.
> 
> Ta.

How about this? It's fugly as hell, it puts PREEMPT_RT knowledge into
the code to make the local trylock function *conditionally* disable
interrupts. I hate it, but let's start with is it even *correct*?

This is addressing the existing runstate code I mentioned above. If it
works then it could theoretically be used in the patch in $SUBJECT too,
instead of the open-coded irqsave and trylock.

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index efca40cc8c4f..78e23f151065 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -270,6 +270,26 @@ static void kvm_xen_init_timer(struct kvm_vcpu *vcpu)
 	vcpu->arch.xen.timer.function = xen_timer_callback;
 }
 
+/*
+ * With CONFIG_PREEMPT_RT, 'irqsave' locking doesn't really disable IRQs as
+ * "all IRQs are threaded" (except the hrtimer IRQs). So, open-coding a
+ * local_irq_save() before a read_trylock() is wrong for PREEMPT_RT.
+ */
+static inline bool read_trylock_irqsave(rwlock_t *lck, unsigned long *flags)
+{
+#ifndef CONFIG_PREEMPT_RT
+	local_irq_save(*flags);
+#endif
+	if (!read_trylock(lck)) {
+#ifndef CONFIG_PREEMPT_RT
+		local_irq_restore(*flags);
+#endif
+		return false;
+	}
+
+	return true;
+}
+
 static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 {
 	struct kvm_vcpu_xen *vx = &v->arch.xen;
@@ -373,11 +393,8 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	 * gfn_to_pfn caches that cover the region.
 	 */
 	if (atomic) {
-		local_irq_save(flags);
-		if (!read_trylock(&gpc1->lock)) {
-			local_irq_restore(flags);
+		if (!read_trylock_irqsave(&gpc1->lock, &flags))
 			return;
-		}
 	} else {
 		read_lock_irqsave(&gpc1->lock, flags);
 	}


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
  2024-03-07  9:27           ` David Woodhouse
@ 2024-03-07 10:02             ` David Woodhouse
  0 siblings, 0 replies; 26+ messages in thread
From: David Woodhouse @ 2024-03-07 10:02 UTC (permalink / raw
  To: Thomas Gleixner, Steven Rostedt
  Cc: kvm, Sean Christopherson, Paul Durrant, Paolo Bonzini,
	Michal Luczaj, Paul Durrant, Ingo Molnar, Borislav Petkov,
	Peter Zijlstra, Dave Hansen, H. Peter Anvin, x86

[-- Attachment #1: Type: text/plain, Size: 768 bytes --]

On Thu, 2024-03-07 at 09:27 +0000, David Woodhouse wrote:
> 
> How about this? It's fugly as hell, it puts PREEMPT_RT knowledge into
> the code to make the local trylock function *conditionally* disable
> interrupts. I hate it, but let's start with is it even *correct*?

Oh but wait...

The only reason we use read_lock_irqsave() in the first place is
because there was code which runs in interrupt context which takes the
same lock.

But Paul's patch is *changing* that, so the in-interrupt code will now
use read_trylock() instead, and can't deadlock. It always have to have
a fallback to a slow path anyway, for when the cache isn't valid.

So I think we can just drop the irqsave from *all* use of the locks in
question. More coffee required...

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast()
  2024-02-27 14:43   ` Steven Rostedt
  2024-02-27 17:00     ` David Woodhouse
@ 2024-03-08 18:10     ` David Woodhouse
  1 sibling, 0 replies; 26+ messages in thread
From: David Woodhouse @ 2024-03-08 18:10 UTC (permalink / raw
  To: Steven Rostedt
  Cc: kvm, Sean Christopherson, Paul Durrant, Paolo Bonzini,
	Michal Luczaj, Paul Durrant, Thomas Gleixner, Ingo Molnar,
	Borislav Petkov, Peter Zijlstra, Dave Hansen, H. Peter Anvin, x86

[-- Attachment #1: Type: text/plain, Size: 11594 bytes --]

On Tue, 2024-02-27 at 09:43 -0500, Steven Rostedt wrote:
> 
> > --- a/arch/x86/kvm/xen.c
> > +++ b/arch/x86/kvm/xen.c
> > @@ -1736,9 +1736,23 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
> >         unsigned long flags;
> >         int rc = -EWOULDBLOCK;
> >   
> > -       read_lock_irqsave(&gpc->lock, flags);
> > +       local_irq_save(flags);
> > +       if (!read_trylock(&gpc->lock)) {
> 
> Note, directly disabling interrupts in PREEMPT_RT is just as bad as doing
> so in non RT and the taking a mutex.
> 
> Worse yet, in PREEMPT_RT read_unlock_irqrestore() isn't going to re-enable
> interrupts.
> 
> Not really sure what the best solution is here.

I think the best solution is to realise that now Paul made the in-
interrupt code paths use read_trylock(), we don't even need to use
read_lock_irqsave() anywhere anyway, and then we don't have to *care*
that PREEMPT_RT does it differently. Going to do some testing with this
version...

From: David Woodhouse <dwmw@amazon.co.uk>
Subject: [PATCH] KVM: x86/xen: avoid blocking in hardirq context in
 kvm_xen_set_evtchn_fast()

As described in [1] compiling with CONFIG_PROVE_RAW_LOCK_NESTING shows that
kvm_xen_set_evtchn_fast() is blocking on pfncache locks in IRQ context.
There is only actually blocking with PREEMPT_RT because the locks will
turned into mutexes. There is no 'raw' version of rwlock_t that can be used
to avoid that, so use read_trylock() and treat failure to lock the same as
an invalid cache.

However, with PREEMPT_RT read_lock_irqsave() doesn't actually disable IRQs
either. So open-coding a trylock_irqsave() using local_irq_save() would be
wrong in the PREEMPT_RT case.

In fact, if kvm_xen_set_evtchn_fast() is now going to use a trylock anyway
when invoked in hardirq context, there's actually no *need* for interrupts
to be disabled by any other code path that takes the lock. So just switch
all other users to a plain read_lock() and then the different semantics
on PREEMPT_RT are not an issue.

Add a warning in __kvm_xen_has_interrupt() just to be sure that one never
does occur from interrupt context.

[1] https://lore.kernel.org/lkml/99771ef3a4966a01fefd3adbb2ba9c3a75f97cf2.camel@infradead.org/T/#mbd06e5a04534ce9c0ee94bd8f1e8d942b2d45bd6
Fixes: 77c9b9dea4fb ("KVM: x86/xen: Use fast path for Xen timer delivery")
Fixes: bbe17c625d68 ("KVM: x86/xen: Fix potential deadlock in kvm_xen_update_runstate_guest()")
Co-developed-by: Paul Durrant <pdurrant@amazon.com>
Signed-off-by: Paul Durrant <pdurrant@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/xen.c | 80 ++++++++++++++++++++++++----------------------
 1 file changed, 41 insertions(+), 39 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index f8113eb8d040..ab3bbe75602d 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -45,7 +45,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	int ret = 0;
 	int idx = srcu_read_lock(&kvm->srcu);
 
-	read_lock_irq(&gpc->lock);
+	read_lock(&gpc->lock);
 	while (!kvm_gpc_check(gpc, PAGE_SIZE)) {
 		read_unlock_irq(&gpc->lock);
 
@@ -53,7 +53,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 		if (ret)
 			goto out;
 
-		read_lock_irq(&gpc->lock);
+		read_lock(&gpc->lock);
 	}
 
 	/*
@@ -96,7 +96,7 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	smp_wmb();
 
 	wc->version = wc_version + 1;
-	read_unlock_irq(&gpc->lock);
+	read_unlock(&gpc->lock);
 
 	kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
 
@@ -277,7 +277,6 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	struct gfn_to_pfn_cache *gpc2 = &vx->runstate2_cache;
 	size_t user_len, user_len1, user_len2;
 	struct vcpu_runstate_info rs;
-	unsigned long flags;
 	size_t times_ofs;
 	uint8_t *update_bit = NULL;
 	uint64_t entry_time;
@@ -373,16 +372,13 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	 * gfn_to_pfn caches that cover the region.
 	 */
 	if (atomic) {
-		local_irq_save(flags);
-		if (!read_trylock(&gpc1->lock)) {
-			local_irq_restore(flags);
+		if (!read_trylock(&gpc1->lock))
 			return;
-		}
 	} else {
-		read_lock_irqsave(&gpc1->lock, flags);
+		read_lock(&gpc1->lock);
 	}
 	while (!kvm_gpc_check(gpc1, user_len1)) {
-		read_unlock_irqrestore(&gpc1->lock, flags);
+		read_unlock(&gpc1->lock);
 
 		/* When invoked from kvm_sched_out() we cannot sleep */
 		if (atomic)
@@ -391,7 +387,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		if (kvm_gpc_refresh(gpc1, user_len1))
 			return;
 
-		read_lock_irqsave(&gpc1->lock, flags);
+		read_lock(&gpc1->lock);
 	}
 
 	if (likely(!user_len2)) {
@@ -419,7 +415,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 		lock_set_subclass(&gpc1->lock.dep_map, 1, _THIS_IP_);
 		if (atomic) {
 			if (!read_trylock(&gpc2->lock)) {
-				read_unlock_irqrestore(&gpc1->lock, flags);
+				read_unlock(&gpc1->lock);
 				return;
 			}
 		} else {
@@ -428,7 +424,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 
 		if (!kvm_gpc_check(gpc2, user_len2)) {
 			read_unlock(&gpc2->lock);
-			read_unlock_irqrestore(&gpc1->lock, flags);
+			read_unlock(&gpc1->lock);
 
 			/* When invoked from kvm_sched_out() we cannot sleep */
 			if (atomic)
@@ -533,7 +529,7 @@ static void kvm_xen_update_runstate_guest(struct kvm_vcpu *v, bool atomic)
 	}
 
 	kvm_gpc_mark_dirty_in_slot(gpc1);
-	read_unlock_irqrestore(&gpc1->lock, flags);
+	read_unlock(&gpc1->lock);
 }
 
 void kvm_xen_update_runstate(struct kvm_vcpu *v, int state)
@@ -592,7 +588,6 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 {
 	unsigned long evtchn_pending_sel = READ_ONCE(v->arch.xen.evtchn_pending_sel);
 	struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
-	unsigned long flags;
 
 	if (!evtchn_pending_sel)
 		return;
@@ -602,14 +597,14 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 	 * does anyway. Page it in and retry the instruction. We're just a
 	 * little more honest about it.
 	 */
-	read_lock_irqsave(&gpc->lock, flags);
+	read_lock(&gpc->lock);
 	while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
+		read_unlock(&gpc->lock);
 
 		if (kvm_gpc_refresh(gpc, sizeof(struct vcpu_info)))
 			return;
 
-		read_lock_irqsave(&gpc->lock, flags);
+		read_lock(&gpc->lock);
 	}
 
 	/* Now gpc->khva is a valid kernel address for the vcpu_info */
@@ -639,7 +634,7 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 	}
 
 	kvm_gpc_mark_dirty_in_slot(gpc);
-	read_unlock_irqrestore(&gpc->lock, flags);
+	read_unlock(&gpc->lock);
 
 	/* For the per-vCPU lapic vector, deliver it as MSI. */
 	if (v->arch.xen.upcall_vector)
@@ -649,7 +644,6 @@ void kvm_xen_inject_pending_events(struct kvm_vcpu *v)
 int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 {
 	struct gfn_to_pfn_cache *gpc = &v->arch.xen.vcpu_info_cache;
-	unsigned long flags;
 	u8 rc = 0;
 
 	/*
@@ -665,9 +659,10 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 	BUILD_BUG_ON(sizeof(rc) !=
 		     sizeof_field(struct compat_vcpu_info, evtchn_upcall_pending));
 
-	read_lock_irqsave(&gpc->lock, flags);
+	WARN_ON_ONCE(in_interrupt());
+	read_lock(&gpc->lock);
 	while (!kvm_gpc_check(gpc, sizeof(struct vcpu_info))) {
-		read_unlock_irqrestore(&gpc->lock, flags);
+		read_unlock(&gpc->lock);
 
 		/*
 		 * This function gets called from kvm_vcpu_block() after setting the
@@ -687,11 +682,11 @@ int __kvm_xen_has_interrupt(struct kvm_vcpu *v)
 			 */
 			return 0;
 		}
-		read_lock_irqsave(&gpc->lock, flags);
+		read_lock(&gpc->lock);
 	}
 
 	rc = ((struct vcpu_info *)gpc->khva)->evtchn_upcall_pending;
-	read_unlock_irqrestore(&gpc->lock, flags);
+	read_unlock(&gpc->lock);
 	return rc;
 }
 
@@ -1382,12 +1377,11 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
 	struct kvm *kvm = vcpu->kvm;
 	struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
 	unsigned long *pending_bits;
-	unsigned long flags;
 	bool ret = true;
 	int idx, i;
 
 	idx = srcu_read_lock(&kvm->srcu);
-	read_lock_irqsave(&gpc->lock, flags);
+	read_lock(&gpc->lock);
 	if (!kvm_gpc_check(gpc, PAGE_SIZE))
 		goto out_rcu;
 
@@ -1408,7 +1402,7 @@ static bool wait_pending_event(struct kvm_vcpu *vcpu, int nr_ports,
 	}
 
  out_rcu:
-	read_unlock_irqrestore(&gpc->lock, flags);
+	read_unlock(&gpc->lock);
 	srcu_read_unlock(&kvm->srcu, idx);
 
 	return ret;
@@ -1730,12 +1724,17 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 	struct kvm *kvm = vcpu->kvm;
 	struct gfn_to_pfn_cache *gpc = &kvm->arch.xen.shinfo_cache;
 	unsigned long *pending_bits, *mask_bits;
-	unsigned long flags;
 	int rc = -EWOULDBLOCK;
 
-	read_lock_irqsave(&gpc->lock, flags);
+	if (in_interrupt()) {
+		if (!read_trylock(&gpc->lock))
+			goto out;
+	} else {
+		read_lock(&gpc->lock);
+	}
+
 	if (!kvm_gpc_check(gpc, PAGE_SIZE))
-		goto out;
+		goto out_unlock;
 
 	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 		struct shared_info *shinfo = gpc->khva;
@@ -1758,8 +1757,9 @@ static int set_shinfo_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 		rc = 1; /* It is newly raised */
 	}
 
+ out_unlock:
+	read_unlock(&gpc->lock);
  out:
-	read_unlock_irqrestore(&gpc->lock, flags);
 	return rc;
 }
 
@@ -1767,23 +1767,23 @@ static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 {
 	struct kvm *kvm = vcpu->kvm;
 	struct gfn_to_pfn_cache *gpc = &vcpu->arch.xen.vcpu_info_cache;
-	unsigned long flags;
 	bool kick_vcpu = false;
+	bool locked;
 
-	read_lock_irqsave(&gpc->lock, flags);
+	locked = read_trylock(&gpc->lock);
 
 	/*
 	 * Try to deliver the event directly to the vcpu_info. If successful and
 	 * the guest is using upcall_vector delivery, send the MSI.
-	 * If the pfncache is invalid, set the shadow. In this case, or if the
-	 * guest is using another form of event delivery, the vCPU must be
-	 * kicked to complete the delivery.
+	 * If the pfncache lock is contended or the cache is invalid, set the
+	 * shadow. In this case, or if the guest is using another form of event
+	 * delivery, the vCPU must be kicked to complete the delivery.
 	 */
 	if (IS_ENABLED(CONFIG_64BIT) && kvm->arch.xen.long_mode) {
 		struct vcpu_info *vcpu_info = gpc->khva;
 		int port_word_bit = port / 64;
 
-		if (!kvm_gpc_check(gpc, sizeof(*vcpu_info))) {
+		if ((!locked || !kvm_gpc_check(gpc, sizeof(*vcpu_info)))) {
 			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
 				kick_vcpu = true;
 			goto out;
@@ -1797,7 +1797,7 @@ static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 		struct compat_vcpu_info *vcpu_info = gpc->khva;
 		int port_word_bit = port / 32;
 
-		if (!kvm_gpc_check(gpc, sizeof(*vcpu_info))) {
+		if ((!locked || !kvm_gpc_check(gpc, sizeof(*vcpu_info)))) {
 			if (!test_and_set_bit(port_word_bit, &vcpu->arch.xen.evtchn_pending_sel))
 				kick_vcpu = true;
 			goto out;
@@ -1816,7 +1816,9 @@ static bool set_vcpu_info_evtchn_pending(struct kvm_vcpu *vcpu, u32 port)
 	}
 
  out:
-	read_unlock_irqrestore(&gpc->lock, flags);
+	if (locked)
+		read_unlock(&gpc->lock);
+
 	return kick_vcpu;
 }
 
-- 
2.34.1





[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply related	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2024-03-08 18:10 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-27 11:49 [PATCH v2 0/8] KVM: x86/xen updates David Woodhouse
2024-02-27 11:49 ` [PATCH v2 1/8] KVM: x86/xen: improve accuracy of Xen timers David Woodhouse
2024-03-04 23:44   ` Sean Christopherson
2024-03-05  9:29     ` Paul Durrant
2024-03-05 21:12       ` Sean Christopherson
2024-03-06  9:48         ` Paul Durrant
2024-03-06  9:57           ` David Woodhouse
2024-02-27 11:49 ` [PATCH v2 2/8] KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled David Woodhouse
2024-02-27 11:49 ` [PATCH v2 3/8] KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery David Woodhouse
2024-02-27 11:49 ` [PATCH v2 4/8] KVM: pfncache: simplify locking and make more self-contained David Woodhouse
2024-03-05  0:08   ` Sean Christopherson
2024-02-27 11:49 ` [PATCH v2 5/8] KVM: x86/xen: fix recursive deadlock in timer injection David Woodhouse
2024-02-27 11:49 ` [PATCH v2 6/8] KVM: x86/xen: split up kvm_xen_set_evtchn_fast() David Woodhouse
2024-02-27 11:49 ` [PATCH v2 7/8] KVM: x86/xen: avoid blocking in hardirq context in kvm_xen_set_evtchn_fast() David Woodhouse
2024-02-27 14:43   ` Steven Rostedt
2024-02-27 17:00     ` David Woodhouse
2024-02-27 17:18       ` Thomas Gleixner
2024-02-27 17:22         ` David Woodhouse
2024-03-07  9:27           ` David Woodhouse
2024-03-07 10:02             ` David Woodhouse
2024-02-27 17:25       ` Steven Rostedt
2024-02-27 17:28         ` David Woodhouse
2024-03-08 18:10     ` David Woodhouse
2024-02-27 11:49 ` [PATCH v2 8/8] KVM: pfncache: clean up rwlock abuse David Woodhouse
2024-02-29 23:12 ` [PATCH v2 0/8] KVM: x86/xen updates Sean Christopherson
2024-03-05  0:35 ` Sean Christopherson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).