[RFC PATCH 0/10] Cleaning up the KVM clock mess

LKML Archive mirror
 help / color / mirror / Atom feed

* [RFC PATCH 0/10] Cleaning up the KVM clock mess
@ 2024-04-18 19:34 David Woodhouse
  2024-04-18 19:34 ` [PATCH 01/10] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
                   ` (11 more replies)
  0 siblings, 12 replies; 30+ messages in thread
From: David Woodhouse @ 2024-04-18 19:34 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

I lied, there aren't three different definitions of the KVM clock. The
fourth is on i386, where it's still based on CLOCK_MONOTONIC (well,
boot time which might as well be the same sine we offset it anyway).

If we fix *that* to be based on CLOCK_MONOTONIC_RAW then we can rip out
a whole bunch of mechanisms which were added to cope with NTP frequency
skew.

This cleans up the mess and gets us back down to the two unavoidable
definitions of the KVM clock: when the host and guest TSCs are well
behaved and in sync, it's in "master clock" mode where it's defined as
a simple arithmetic function of the guest TSC, otherwise it's clamped
to the host's CLOCK_MONOTONIC_RAW.

It includes Jack's KVM_[GS]ET_CLOCK_GUEST patch to allow accurate
migration. Also my KVM_VCPU_TSC_SCALE which exposes the precise TSC
scaling factors. This is needed to get accurate migration of the guest
TSC, and can *also* be used by userspace to have vDSO-style access to
the KVM clock. Thus allowing hypercalls and other emulated clock devices
(e.g. PIT, HPET, ACPI timer) to be based on the KVM clock too, giving
*consistency* across a live migration.

I do still need to fix KVM_REQ_MASTERCLOCK_UPDATE so that it doesn't
clamp the clock back to CLOCK_MONOTONIC_RAW; we should *update* the
ka->kvmclock_offset when we've been running in use_master_clock mode.
Should probably do that in kvm_arch_hardware_disable() for timekeeping
across hibernation too, but I haven't finished working that one out.

I think there are still some places left where KVM reads the time twice 
in close(ish) succession and then assumes they were at the *same* time, 
which I'll audit and fix too.

I also need to flesh out the test cases and do some real testing from
VMMs, but I think it's ready for some heckling at least.

https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/clocks

David Woodhouse (8):
      KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()
      KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
      KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC
      KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
      KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host
      KVM: x86: Remove periodic global clock updates
      KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE
      KVM: x86: Fix KVM clock precision in __get_kvmclock()

Jack Allister (2):
      KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
      KVM: selftests: Add KVM/PV clock selftest to prove timer correction

 Documentation/virt/kvm/api.rst                    |  37 ++
 Documentation/virt/kvm/devices/vcpu.rst           | 115 ++++--
 arch/x86/include/asm/kvm_host.h                   |   8 +-
 arch/x86/include/uapi/asm/kvm.h                   |   6 +
 arch/x86/kvm/svm/svm.c                            |   3 +-
 arch/x86/kvm/vmx/vmx.c                            |   2 +-
 arch/x86/kvm/x86.c                                | 438 +++++++++++++---------
 arch/x86/kvm/xen.c                                |   4 +-
 include/uapi/linux/kvm.h                          |   3 +
 tools/testing/selftests/kvm/Makefile              |   1 +
 tools/testing/selftests/kvm/x86_64/pvclock_test.c | 192 ++++++++++
 11 files changed, 600 insertions(+), 209 deletions(-)

^ permalink raw reply	[flat|nested] 30+ messages in thread

* [PATCH 01/10] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init()
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
@ 2024-04-18 19:34 ` David Woodhouse
  2024-04-18 19:34 ` [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
                   ` (10 subsequent siblings)
  11 siblings, 0 replies; 30+ messages in thread
From: David Woodhouse @ 2024-04-18 19:34 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

From: David Woodhouse <dwmw@amazon.co.uk>

The KVM clock is an interesting thing. It is defined as "nanoseconds
since the guest was created", but in practice it runs at two *different*
rates — or three different rates, if you count implementation bugs.

Definition A is that it runs synchronously with the CLOCK_MONOTONIC_RAW
of the host, with a delta of kvm->arch.kvmclock_offset.

But that version doesn't actually get used in the common case, where the
host has a reliable TSC and the guest TSCs are all running at the same
rate and in sync with each other, and kvm->arch.use_master_clock is set.

In that common case, definition B is used: There is a reference point in
time at kvm->arch.master_kernel_ns (again a CLOCK_MONOTONIC_RAW time),
and a corresponding host TSC value kvm->arch.master_cycle_now. This
fixed point in time is converted to guest units (the time offset by
kvmclock_offset and the TSC Value scaled and offset to be a guest TSC
value) and advertised to the guest in the pvclock structure. While in
this 'use_master_clock' mode, the fixed point in time never needs to be
changed, and the clock runs precisely in time with the guest TSC, at the
rate advertised in the pvclock structure.

The third definition C is implemented in kvm_get_wall_clock_epoch() and
__get_kvmclock(), using the master_cycle_now and master_kernel_ns fields
but converting the *host* TSC cycles directly to a value in nanoseconds
instead of scaling via the guest TSC.

One might naïvely think that all three definitions are identical, since
CLOCK_MONOTONIC_RAW is not skewed by NTP frequency corrections; all
three are just the result of counting the host TSC at a known frequency,
or the scaled guest TSC at a known precise fraction of the host's
frequency. The problem is with arithmetic precision, and the way that
frequency scaling is done in a division-free way by multiplying by a
scale factor, then shifting right. In practice, all three ways of
calculating the KVM clock will suffer a systemic drift from each other.

Eventually, definition C should just be eliminated. Commit 451a707813ae
("KVM: x86/xen: improve accuracy of Xen timers") worked around it for
the specific case of Xen timers, which are defined in terms of the KVM
clock and suffered from a continually increasing error in timer expiry
times. That commit notes that get_kvmclock_ns() is non-trivial to fix
and says "I'll come back to that", which remains true.

Definitions A and B do need to coexist, the former to handle the case
where the host or guest TSC is suboptimally configured. But KVM should
be more careful about switching between them, and the discontinuity in
guest time which could result.

In particular, KVM_REQ_MASTERCLOCK_UPDATE will take a new snapshot of
time as the reference in master_kernel_ns and master_cycle_now, yanking
the guest's clock back to match definition A at that moment.

When invoked from in 'use_master_clock' mode, kvm_update_masterclock()
should probably *adjust* kvm->arch.kvmclock_offset to account for the
drift, instead of yanking the clock back to defintion A. But in the
meantime there are a bunch of places where it just doesn't need to be
invoked at all.

To start with: there is no need to do such an update when a Xen guest
populates the shared_info page. This seems to have been a hangover from
the very first implementation of shared_info which automatically
populated the vcpu_info structures at their default locations, but even
then it should just have raised KVM_REQ_CLOCK_UPDATE on each vCPU
instead of using KVM_REQ_MASTERCLOCK_UPDATE. And now that userspace is
expected to explicitly set the vcpu_info even in its default locations,
there's not even any need for that either.

Fixes: 629b5348841a1 ("KVM: x86/xen: update wallclock region")
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Reviewed-by: Paul Durrant <paul@xen.org>
---
 arch/x86/kvm/xen.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index f65b35a05d91..5a83a8154b79 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -98,8 +98,6 @@ static int kvm_xen_shared_info_init(struct kvm *kvm)
 	wc->version = wc_version + 1;
 	read_unlock_irq(&gpc->lock);

-	kvm_make_all_cpus_request(kvm, KVM_REQ_MASTERCLOCK_UPDATE);
-
 out:
 	srcu_read_unlock(&kvm->srcu, idx);
 	return ret;
-- 
2.44.0

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
  2024-04-18 19:34 ` [PATCH 01/10] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
@ 2024-04-18 19:34 ` David Woodhouse
  2024-04-19 15:29   ` Paul Durrant
  2024-04-22 12:22   ` Paolo Bonzini
  2024-04-18 19:34 ` [PATCH 03/10] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 30+ messages in thread
From: David Woodhouse @ 2024-04-18 19:34 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

From: David Woodhouse <dwmw@amazon.co.uk>

The kvm_guest_time_update() function scales the host TSC frequency to
the guest's using kvm_scale_tsc() and the v->arch.l1_tsc_scaling_ratio
scaling ratio previously calculated for that vCPU. Then calcuates the
scaling factors for the KVM clock itself based on that guest TSC
frequency.

However, it uses kHz as the unit when scaling, and then multiplies by
1000 only at the end.

With a host TSC frequency of 3000MHz and a guest set to 2500MHz, the
result of kvm_scale_tsc() will actually come out at 2,499,999kHz. So
the KVM clock advertised to the guest is based on a frequency of
2,499,999,000 Hz.

By using Hz as the unit from the beginning, the KVM clock would be based
on a more accurate frequency of 2,499,999,999 Hz in this example.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Fixes: 78db6a503796 ("KVM: x86: rewrite handling of scaled TSC for kvmclock")
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/x86.c              | 17 +++++++++--------
 arch/x86/kvm/xen.c              |  2 +-
 3 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 01c69840647e..8440c4081727 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -887,7 +887,7 @@ struct kvm_vcpu_arch {
 
 	gpa_t time;
 	struct pvclock_vcpu_time_info hv_clock;
-	unsigned int hw_tsc_khz;
+	unsigned int hw_tsc_hz;
 	struct gfn_to_pfn_cache pv_time;
 	/* set guest stopped flag in pvclock flags field */
 	bool pvclock_set_guest_stopped_request;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2d2619d3eee4..23281c508c27 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -3215,7 +3215,8 @@ static void kvm_setup_guest_pvclock(struct kvm_vcpu *v,
 
 static int kvm_guest_time_update(struct kvm_vcpu *v)
 {
-	unsigned long flags, tgt_tsc_khz;
+	unsigned long flags;
+	uint64_t tgt_tsc_hz;
 	unsigned seq;
 	struct kvm_vcpu_arch *vcpu = &v->arch;
 	struct kvm_arch *ka = &v->kvm->arch;
@@ -3252,8 +3253,8 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 
 	/* Keep irq disabled to prevent changes to the clock */
 	local_irq_save(flags);
-	tgt_tsc_khz = get_cpu_tsc_khz();
-	if (unlikely(tgt_tsc_khz == 0)) {
+	tgt_tsc_hz = get_cpu_tsc_khz() * 1000LL;
+	if (unlikely(tgt_tsc_hz == 0)) {
 		local_irq_restore(flags);
 		kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
 		return 1;
@@ -3288,14 +3289,14 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	/* With all the info we got, fill in the values */
 
 	if (kvm_caps.has_tsc_control)
-		tgt_tsc_khz = kvm_scale_tsc(tgt_tsc_khz,
-					    v->arch.l1_tsc_scaling_ratio);
+		tgt_tsc_hz = kvm_scale_tsc(tgt_tsc_hz,
+					   v->arch.l1_tsc_scaling_ratio);
 
-	if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
-		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_khz * 1000LL,
+	if (unlikely(vcpu->hw_tsc_hz != tgt_tsc_hz)) {
+		kvm_get_time_scale(NSEC_PER_SEC, tgt_tsc_hz,
 				   &vcpu->hv_clock.tsc_shift,
 				   &vcpu->hv_clock.tsc_to_system_mul);
-		vcpu->hw_tsc_khz = tgt_tsc_khz;
+		vcpu->hw_tsc_hz = tgt_tsc_hz;
 		kvm_xen_update_tsc_info(v);
 	}
 
diff --git a/arch/x86/kvm/xen.c b/arch/x86/kvm/xen.c
index 5a83a8154b79..014048c22652 100644
--- a/arch/x86/kvm/xen.c
+++ b/arch/x86/kvm/xen.c
@@ -2273,7 +2273,7 @@ void kvm_xen_update_tsc_info(struct kvm_vcpu *vcpu)
 
 	entry = kvm_find_cpuid_entry_index(vcpu, function, 2);
 	if (entry)
-		entry->eax = vcpu->arch.hw_tsc_khz;
+		entry->eax = vcpu->arch.hw_tsc_hz / 1000;
 }
 
 void kvm_xen_init_vm(struct kvm *kvm)
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 03/10] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
  2024-04-18 19:34 ` [PATCH 01/10] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
  2024-04-18 19:34 ` [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
@ 2024-04-18 19:34 ` David Woodhouse
  2024-04-19 15:40   ` Paul Durrant
  2024-04-22 14:11   ` Paolo Bonzini
  2024-04-18 19:34 ` [PATCH 04/10] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
                   ` (8 subsequent siblings)
  11 siblings, 2 replies; 30+ messages in thread
From: David Woodhouse @ 2024-04-18 19:34 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

From: Jack Allister <jalliste@amazon.com>

In the common case (where kvm->arch.use_master_clock is true), the KVM
clock is defined as a simple arithmetic function of the guest TSC, based on
a reference point stored in kvm->arch.master_kernel_ns and
kvm->arch.master_cycle_now.

The existing KVM_[GS]ET_CLOCK functionality does not allow for this
relationship to be precisely saved and restored by userspace. All it can
currently do is set the KVM clock at a given UTC reference time, which is
necessarily imprecise.

So on live update, the guest TSC can remain cycle accurate at precisely the
same offset from the host TSC, but there is no way for userspace to restore
the KVM clock accurately.

Even on live migration to a new host, where the accuracy of the guest time-
keeping is fundamentally limited by the accuracy of wallclock
synchronization between the source and destination hosts, the clock jump
experienced by the guest's TSC and its KVM clock should at least be
*consistent*. Even when the guest TSC suffers a discontinuity, its KVM
clock should still remain the *same* arithmetic function of the guest TSC,
and not suffer an *additional* discontinuity.

To allow for accurate migration of the KVM clock, add per-vCPU ioctls which
save and restore the actual PV clock info in pvclock_vcpu_time_info.

The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
point in time just as kvm_update_masterclock() does, and calculating the
corresponding guest TSC value. This guest TSC value is then passed through
the user-provided pvclock structure to generate the *intended* KVM clock
value at that point in time, and through the *actual* KVM clock calculation.
Then kvm->arch.kvmclock_offset is adjusted to eliminate for the difference.

Where kvm->arch.use_master_clock is false (because the host TSC is
unreliable, or the guest TSCs are configured strangely), the KVM clock
is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
returns an error. In this case, as documented, userspace shall use the
legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
case since the clocks are imprecise in this mode anyway.

On *restoration*, if kvm->arch.use_master_clock is false, an error is
returned for similar reasons and userspace shall fall back to using
KVM_SET_CLOCK. This does mean that, as documented, userspace needs to use
*both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results with the
migration data (unless the intent is to refuse to resume on a host with bad
TSC).

(It may have been possible to make KVM_SET_CLOCK_GUEST "good enough" in the
non-masterclock mode, as that mode is necessarily imprecise anyway. The
explicit fallback allows userspace to deliberately fail migration to a host
with misbehaving TSC where master clock mode wouldn't be active.)

Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Jack Allister <jalliste@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
CC: Paul Durrant <paul@xen.org>
CC: Dongli Zhang <dongli.zhang@oracle.com>
---
 Documentation/virt/kvm/api.rst |  37 ++++++++
 arch/x86/kvm/x86.c             | 156 +++++++++++++++++++++++++++++++++
 include/uapi/linux/kvm.h       |   3 +
 3 files changed, 196 insertions(+)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index f0b76ff5030d..758f6fc08fe5 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6352,6 +6352,43 @@ a single guest_memfd file, but the bound ranges must not overlap).
 
 See KVM_SET_USER_MEMORY_REGION2 for additional details.
 
+4.143 KVM_GET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86_64
+:Type: vcpu ioctl
+:Parameters: struct pvclock_vcpu_time_info (out)
+:Returns: 0 on success, <0 on error
+
+Retrieves the current time information structure used for KVM/PV clocks,
+in precisely the form advertised to the guest vCPU, which gives parameters
+for a direct conversion from a guest TSC value to nanoseconds.
+
+When the KVM clock not is in "master clock" mode, for example because the
+host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
+is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
+In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
+
+4.144 KVM_SET_CLOCK_GUEST
+----------------------------
+
+:Capability: none
+:Architectures: x86_64
+:Type: vcpu ioctl
+:Parameters: struct pvclock_vcpu_time_info (in)
+:Returns: 0 on success, <0 on error
+
+Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
+pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
+arithmetic relationship between guest TSC and KVM clock to be preserved by
+userspace across migration.
+
+When the KVM clock is not in "master clock" mode, and the KVM clock is actually
+defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL. Userspace
+may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or may
+choose to fail, denying migration to a host whose TSC is misbehaving.
+
 5. The kvm_run structure
 ========================
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 23281c508c27..42abce7b4fc9 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5868,6 +5868,154 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
 	}
 }
 
+#ifdef CONFIG_X86_64
+static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
+{
+	struct pvclock_vcpu_time_info *hv_clock = &v->arch.hv_clock;
+
+	/*
+	 * If KVM_REQ_CLOCK_UPDATE is already pending, or if the hv_clock has
+	 * never been generated at all, call kvm_guest_time_update() to do so.
+	 * Might as well use the PVCLOCK_TSC_STABLE_BIT as the check for ever
+	 * having been written.
+	 */
+	if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) ||
+	    !(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT)) {
+		if (kvm_guest_time_update(v))
+			return -EINVAL;
+	}
+
+	/*
+	 * PVCLOCK_TSC_STABLE_BIT is set in use_master_clock mode where the
+	 * KVM clock is defined in terms of the guest TSC. Otherwise, it is
+	 * is defined by the host CLOCK_MONOTONIC_RAW, and userspace should
+	 * use the legacy KVM_[GS]ET_CLOCK to migrate it.
+	 */
+	if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
+		return -EINVAL;
+
+	if (copy_to_user(argp, hv_clock, sizeof(*hv_clock)))
+		return -EFAULT;
+
+	return 0;
+}
+
+/*
+ * Reverse the calculation in the hv_clock definition.
+ *
+ * time_ns = ( (cycles << shift) * mul ) >> 32;
+ * (although shift can be negative, so that's bad C)
+ *
+ * So for a single second,
+ *  NSEC_PER_SEC = ( ( FREQ_HZ << shift) * mul ) >> 32
+ *  NSEC_PER_SEC << 32 = ( FREQ_HZ << shift ) * mul
+ *  ( NSEC_PER_SEC << 32 ) / mul = FREQ_HZ << shift
+ *  ( NSEC_PER_SEC << 32 ) / mul ) >> shift = FREQ_HZ
+ */
+static uint64_t hvclock_to_hz(uint32_t mul, int8_t shift)
+{
+	uint64_t tm = NSEC_PER_SEC << 32;
+
+	/* Maximise precision. Shift right until the top bit is set */
+	tm <<= 2;
+	shift += 2;
+
+	/* While 'mul' is even, increase the shift *after* the division */
+	while (!(mul & 1)) {
+		shift++;
+		mul >>= 1;
+	}
+
+	tm /= mul;
+
+	if (shift > 0)
+		return tm >> shift;
+	else
+		return tm << -shift;
+}
+
+static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
+{
+	struct pvclock_vcpu_time_info user_hv_clock;
+	struct kvm *kvm = v->kvm;
+	struct kvm_arch *ka = &kvm->arch;
+	uint64_t curr_tsc_hz, user_tsc_hz;
+	uint64_t user_clk_ns;
+	uint64_t guest_tsc;
+	int rc = 0;
+
+	if (copy_from_user(&user_hv_clock, argp, sizeof(user_hv_clock)))
+		return -EFAULT;
+
+	if (!user_hv_clock.tsc_to_system_mul)
+		return -EINVAL;
+
+	user_tsc_hz = hvclock_to_hz(user_hv_clock.tsc_to_system_mul,
+				    user_hv_clock.tsc_shift);
+
+
+	kvm_hv_request_tsc_page_update(kvm);
+	kvm_start_pvclock_update(kvm);
+	pvclock_update_vm_gtod_copy(kvm);
+
+	/*
+	 * If not in use_master_clock mode, do not allow userspace to set
+	 * the clock in terms of the guest TSC. Userspace should either
+	 * fail the migration (to a host with suboptimal TSCs), or should
+	 * knowingly restore the KVM clock using KVM_SET_CLOCK instead.
+	 */
+	if (!ka->use_master_clock) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	curr_tsc_hz = get_cpu_tsc_khz() * 1000LL;
+	if (unlikely(curr_tsc_hz == 0)) {
+		rc = -EINVAL;
+		goto out;
+	}
+
+	if (kvm_caps.has_tsc_control)
+		curr_tsc_hz = kvm_scale_tsc(curr_tsc_hz,
+					    v->arch.l1_tsc_scaling_ratio);
+
+	/*
+	 * The scaling factors in the hv_clock do not depend solely on the
+	 * TSC frequency *requested* by userspace. They actually use the
+	 * host TSC frequency that was measured/detected by the host kernel,
+	 * scaled by kvm_scale_tsc() with the vCPU's l1_tsc_scaling_ratio.
+	 *
+	 * So a sanity check that they *precisely* match would have false
+	 * negatives. Allow for a discrepancy of 1 kHz either way.
+	 */
+	if (user_tsc_hz < curr_tsc_hz - 1000 ||
+	    user_tsc_hz > curr_tsc_hz + 1000) {
+		rc = -ERANGE;
+		goto out;
+	}
+
+	/*
+	 * The call to pvclock_update_vm_gtod_copy() has created a new time
+	 * reference point in ka->master_cycle_now and ka->master_kernel_ns.
+	 *
+	 * Calculate the guest TSC at that moment, and the corresponding KVM
+	 * clock value according to user_hv_clock. The value according to the
+	 * current hv_clock will of course be ka->master_kernel_ns since no
+	 * TSC cycles have elapsed.
+	 *
+	 * Adjust ka->kvmclock_offset to the delta, so that both definitions
+	 * of the clock give precisely the same reading at the reference time.
+	 */
+	guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
+	user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
+	ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;
+
+out:
+	kvm_end_pvclock_update(kvm);
+	return rc;
+}
+#endif
+
 long kvm_arch_vcpu_ioctl(struct file *filp,
 			 unsigned int ioctl, unsigned long arg)
 {
@@ -6256,6 +6404,14 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
 		srcu_read_unlock(&vcpu->kvm->srcu, idx);
 		break;
 	}
+#ifdef CONFIG_X86_64
+	case KVM_SET_CLOCK_GUEST:
+		r = kvm_vcpu_ioctl_set_clock_guest(vcpu, argp);
+		break;
+	case KVM_GET_CLOCK_GUEST:
+		r = kvm_vcpu_ioctl_get_clock_guest(vcpu, argp);
+		break;
+#endif
 #ifdef CONFIG_KVM_HYPERV
 	case KVM_GET_SUPPORTED_HV_CPUID:
 		r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..0d306311e4d6 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
 	__u64 reserved[6];
 };
 
+#define KVM_SET_CLOCK_GUEST       _IOW(KVMIO,  0xd5, struct pvclock_vcpu_time_info)
+#define KVM_GET_CLOCK_GUEST       _IOR(KVMIO,  0xd6, struct pvclock_vcpu_time_info)
+
 #endif /* __LINUX_KVM_H */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 04/10] KVM: selftests: Add KVM/PV clock selftest to prove timer correction
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
                   ` (2 preceding siblings ...)
  2024-04-18 19:34 ` [PATCH 03/10] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
@ 2024-04-18 19:34 ` David Woodhouse
  2024-04-19 15:44   ` Paul Durrant
  2024-04-18 19:34 ` [PATCH 05/10] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: David Woodhouse @ 2024-04-18 19:34 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

From: Jack Allister <jalliste@amazon.com>

A VM's KVM/PV clock has an inherent relationship to its TSC (guest). When
either the host system live-updates or the VM is live-migrated this pairing
of the two clock sources should stay the same.

In reality this is not the case without some correction taking place. Two
new IOCTLs (KVM_GET_CLOCK_GUEST/KVM_SET_CLOCK_GUEST) can be utilized to
perform a correction on the PVTI (PV time information) structure held by
KVM to effectively fixup the kvmclock_offset prior to the guest VM resuming
in either a live-update/migration scenario.

This test proves that without the necessary fixup there is a perceived
change in the guest TSC & KVM/PV clock relationship before and after a LU/
LM takes place.

The following steps are made to verify there is a delta in the relationship
and that it can be corrected:

1. PVTI is sampled by guest at boot (let's call this PVTI0).
2. Induce a change in PVTI data (KVM_REQ_MASTERCLOCK_UPDATE).
3. PVTI is sampled by guest after change (PVTI1).
4. Correction is requested by usermode to KVM using PVTI0.
5. PVTI is sampled by guest after correction (PVTI2).

The guest the records a singular TSC reference point in time and uses it to
calculate 3 KVM clock values utilizing the 3 recorded PVTI prior. Let's
call each clock value CLK[0-2].

In a perfect world CLK[0-2] should all be the same value if the KVM clock
& TSC relationship is preserved across the LU/LM (or faked in this test),
however it is not.

A delta can be observed between CLK0-CLK1 due to KVM recalculating the PVTI
(and the inaccuracies associated with that). A delta of ~3500ns can be
observed if guest TSC scaling to half host TSC frequency is also enabled,
where as without scaling this is observed at ~180ns.

With the correction it should be possible to achieve a delta of ±1ns.

An option to enable guest TSC scaling is available via invoking the tester
with -s/--scale-tsc.

Example of the test output below:
* selftests: kvm: pvclock_test
* scaling tsc from 2999999KHz to 1499999KHz
* before=5038374946 uncorrected=5038371437 corrected=5038374945
* delta_uncorrected=3509 delta_corrected=1

Signed-off-by: Jack Allister <jalliste@amazon.com>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
CC: Paul Durrant <paul@xen.org>
CC: Dongli Zhang <dongli.zhang@oracle.com>
---
 tools/testing/selftests/kvm/Makefile          |   1 +
 .../selftests/kvm/x86_64/pvclock_test.c       | 192 ++++++++++++++++++
 2 files changed, 193 insertions(+)
 create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c

diff --git a/tools/testing/selftests/kvm/Makefile b/tools/testing/selftests/kvm/Makefile
index 871e2de3eb05..e33f56fedb0c 100644
--- a/tools/testing/selftests/kvm/Makefile
+++ b/tools/testing/selftests/kvm/Makefile
@@ -87,6 +87,7 @@ TEST_GEN_PROGS_x86_64 += x86_64/pmu_counters_test
 TEST_GEN_PROGS_x86_64 += x86_64/pmu_event_filter_test
 TEST_GEN_PROGS_x86_64 += x86_64/private_mem_conversions_test
 TEST_GEN_PROGS_x86_64 += x86_64/private_mem_kvm_exits_test
+TEST_GEN_PROGS_x86_64 += x86_64/pvclock_test
 TEST_GEN_PROGS_x86_64 += x86_64/set_boot_cpu_id
 TEST_GEN_PROGS_x86_64 += x86_64/set_sregs_test
 TEST_GEN_PROGS_x86_64 += x86_64/smaller_maxphyaddr_emulation_test
diff --git a/tools/testing/selftests/kvm/x86_64/pvclock_test.c b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
new file mode 100644
index 000000000000..376ffb730a53
--- /dev/null
+++ b/tools/testing/selftests/kvm/x86_64/pvclock_test.c
@@ -0,0 +1,192 @@
+// SPDX-License-Identifier: GPL-2.0-only
+/*
+ * Copyright © 2024, Amazon.com, Inc. or its affiliates.
+ *
+ * Tests for pvclock API
+ * KVM_SET_CLOCK_GUEST/KVM_GET_CLOCK_GUEST
+ */
+#include <asm/pvclock.h>
+#include <asm/pvclock-abi.h>
+#include <sys/stat.h>
+#include <stdint.h>
+#include <stdio.h>
+
+#include "test_util.h"
+#include "kvm_util.h"
+#include "processor.h"
+
+enum {
+	STAGE_FIRST_BOOT,
+	STAGE_UNCORRECTED,
+	STAGE_CORRECTED
+};
+
+#define KVMCLOCK_GPA 0xc0000000ull
+#define KVMCLOCK_SIZE sizeof(struct pvclock_vcpu_time_info)
+
+static void trigger_pvti_update(vm_paddr_t pvti_pa)
+{
+	/*
+	 * We need a way to trigger KVM to update the fields
+	 * in the PV time info. The easiest way to do this is
+	 * to temporarily switch to the old KVM system time
+	 * method and then switch back to the new one.
+	 */
+	wrmsr(MSR_KVM_SYSTEM_TIME, pvti_pa | KVM_MSR_ENABLED);
+	wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
+}
+
+static void guest_code(vm_paddr_t pvti_pa)
+{
+	struct pvclock_vcpu_time_info *pvti_va =
+		(struct pvclock_vcpu_time_info *)pvti_pa;
+
+	struct pvclock_vcpu_time_info pvti_boot;
+	struct pvclock_vcpu_time_info pvti_uncorrected;
+	struct pvclock_vcpu_time_info pvti_corrected;
+	uint64_t cycles_boot;
+	uint64_t cycles_uncorrected;
+	uint64_t cycles_corrected;
+	uint64_t tsc_guest;
+
+	/*
+	 * Setup the KVMCLOCK in the guest & store the original
+	 * PV time structure that is used.
+	 */
+	wrmsr(MSR_KVM_SYSTEM_TIME_NEW, pvti_pa | KVM_MSR_ENABLED);
+	pvti_boot = *pvti_va;
+	GUEST_SYNC(STAGE_FIRST_BOOT);
+
+	/*
+	 * Trigger an update of the PVTI, if we calculate
+	 * the KVM clock using this structure we'll see
+	 * a delta from the TSC.
+	 */
+	trigger_pvti_update(pvti_pa);
+	pvti_uncorrected = *pvti_va;
+	GUEST_SYNC(STAGE_UNCORRECTED);
+
+	/*
+	 * The test should have triggered the correction by this
+	 * point in time. We have a copy of each of the PVTI structs
+	 * at each stage now.
+	 *
+	 * Let's sample the timestamp at a SINGLE point in time and
+	 * then calculate what the KVM clock would be using the PVTI
+	 * from each stage.
+	 *
+	 * Then return each of these values to the tester.
+	 */
+	pvti_corrected = *pvti_va;
+	tsc_guest = rdtsc();
+
+	cycles_boot = __pvclock_read_cycles(&pvti_boot, tsc_guest);
+	cycles_uncorrected = __pvclock_read_cycles(&pvti_uncorrected, tsc_guest);
+	cycles_corrected = __pvclock_read_cycles(&pvti_corrected, tsc_guest);
+
+	GUEST_SYNC_ARGS(STAGE_CORRECTED, cycles_boot, cycles_uncorrected,
+			cycles_corrected, 0);
+}
+
+static void run_test(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+	struct pvclock_vcpu_time_info pvti_before;
+	uint64_t before, uncorrected, corrected;
+	int64_t delta_uncorrected, delta_corrected;
+	struct ucall uc;
+	uint64_t ucall_reason;
+
+	/* Loop through each stage of the test. */
+	while (true) {
+
+		/* Start/restart the running vCPU code. */
+		vcpu_run(vcpu);
+		TEST_ASSERT_KVM_EXIT_REASON(vcpu, KVM_EXIT_IO);
+
+		/* Retrieve and verify our stage. */
+		ucall_reason = get_ucall(vcpu, &uc);
+		TEST_ASSERT(ucall_reason == UCALL_SYNC,
+			    "Unhandled ucall reason=%lu",
+			    ucall_reason);
+
+		/* Run host specific code relating to stage. */
+		switch (uc.args[1]) {
+		case STAGE_FIRST_BOOT:
+			/* Store the KVM clock values before an update. */
+			vcpu_ioctl(vcpu, KVM_GET_CLOCK_GUEST, &pvti_before);
+
+			/* Sleep for a set amount of time to increase delta. */
+			sleep(5);
+			break;
+
+		case STAGE_UNCORRECTED:
+			/* Restore the KVM clock values. */
+			vcpu_ioctl(vcpu, KVM_SET_CLOCK_GUEST, &pvti_before);
+			break;
+
+		case STAGE_CORRECTED:
+			/* Query the clock information and verify delta. */
+			before = uc.args[2];
+			uncorrected = uc.args[3];
+			corrected = uc.args[4];
+
+			delta_uncorrected = before - uncorrected;
+			delta_corrected = before - corrected;
+
+			pr_info("before=%lu uncorrected=%lu corrected=%lu\n",
+				before, uncorrected, corrected);
+
+			pr_info("delta_uncorrected=%ld delta_corrected=%ld\n",
+				delta_uncorrected, delta_corrected);
+
+			TEST_ASSERT((delta_corrected <= 1) && (delta_corrected >= -1),
+				    "larger than expected delta detected = %ld", delta_corrected);
+			return;
+		}
+	}
+}
+
+static void configure_pvclock(struct kvm_vm *vm, struct kvm_vcpu *vcpu)
+{
+	unsigned int gpages;
+
+	gpages = vm_calc_num_guest_pages(VM_MODE_DEFAULT, KVMCLOCK_SIZE);
+	vm_userspace_mem_region_add(vm, VM_MEM_SRC_ANONYMOUS,
+				    KVMCLOCK_GPA, 1, gpages, 0);
+	virt_map(vm, KVMCLOCK_GPA, KVMCLOCK_GPA, gpages);
+
+	vcpu_args_set(vcpu, 1, KVMCLOCK_GPA);
+}
+
+static void configure_scaled_tsc(struct kvm_vcpu *vcpu)
+{
+	uint64_t tsc_khz;
+
+	tsc_khz =  __vcpu_ioctl(vcpu, KVM_GET_TSC_KHZ, NULL);
+	pr_info("scaling tsc from %ldKHz to %ldKHz\n", tsc_khz, tsc_khz / 2);
+	tsc_khz /= 2;
+	vcpu_ioctl(vcpu, KVM_SET_TSC_KHZ, (void *)tsc_khz);
+}
+
+int main(int argc, char *argv[])
+{
+	struct kvm_vcpu *vcpu;
+	struct kvm_vm *vm;
+	bool scale_tsc;
+
+	scale_tsc = argc > 1 && (!strncmp(argv[1], "-s", 3) ||
+				 !strncmp(argv[1], "--scale-tsc", 10));
+
+	TEST_REQUIRE(sys_clocksource_is_based_on_tsc());
+
+	vm = vm_create_with_one_vcpu(&vcpu, guest_code);
+
+	configure_pvclock(vm, vcpu);
+
+	if (scale_tsc)
+		configure_scaled_tsc(vcpu);
+
+	run_test(vm, vcpu);
+
+	return 0;
+}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 05/10] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
                   ` (3 preceding siblings ...)
  2024-04-18 19:34 ` [PATCH 04/10] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
@ 2024-04-18 19:34 ` David Woodhouse
  2024-04-19 15:45   ` Paul Durrant
  2024-04-18 19:34 ` [PATCH 06/10] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: David Woodhouse @ 2024-04-18 19:34 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

From: David Woodhouse <dwmw@amazon.co.uk>

KVM does make an attempt to cope with non-constant TSC, and has notifiers
to handle host TSC frequency changes. However, it *only* adjusts the KVM
clock, and doesn't adjust TSC frequency scaling when the host changes.

This is presumably because non-constant TSCs were fixed in hardware long
before TSC scaling was implemented, so there should never be real CPUs
which have TSC scaling but *not* CONSTANT_TSC.

Such a combination could potentially happen in some odd L1 nesting
environment, but it isn't worth trying to support it. Just make the
dependency explicit.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/svm/svm.c | 3 ++-
 arch/x86/kvm/vmx/vmx.c | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/svm/svm.c b/arch/x86/kvm/svm/svm.c
index 0f3b59da0d4a..4d3ec1c3231e 100644
--- a/arch/x86/kvm/svm/svm.c
+++ b/arch/x86/kvm/svm/svm.c
@@ -5202,7 +5202,8 @@ static __init int svm_hardware_setup(void)
 		kvm_enable_efer_bits(EFER_FFXSR);
 
 	if (tsc_scaling) {
-		if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR)) {
+		if (!boot_cpu_has(X86_FEATURE_TSCRATEMSR) ||
+		    !boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
 			tsc_scaling = false;
 		} else {
 			pr_info("TSC scaling supported\n");
diff --git a/arch/x86/kvm/vmx/vmx.c b/arch/x86/kvm/vmx/vmx.c
index 6780313914f8..bee830adf744 100644
--- a/arch/x86/kvm/vmx/vmx.c
+++ b/arch/x86/kvm/vmx/vmx.c
@@ -8428,7 +8428,7 @@ __init int vmx_hardware_setup(void)
 	if (!enable_apicv || !cpu_has_vmx_ipiv())
 		enable_ipiv = false;
 
-	if (cpu_has_vmx_tsc_scaling())
+	if (cpu_has_vmx_tsc_scaling() && boot_cpu_has(X86_FEATURE_CONSTANT_TSC))
 		kvm_caps.has_tsc_control = true;
 
 	kvm_caps.max_tsc_scaling_ratio = KVM_VMX_TSC_MULTIPLIER_MAX;
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 06/10] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
                   ` (4 preceding siblings ...)
  2024-04-18 19:34 ` [PATCH 05/10] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
@ 2024-04-18 19:34 ` David Woodhouse
  2024-04-19 15:49   ` Paul Durrant
  2024-04-18 19:34 ` [PATCH 07/10] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
                   ` (5 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: David Woodhouse @ 2024-04-18 19:34 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

From: David Woodhouse <dwmw@amazon.co.uk>

The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
inadequate. It ignores TSC scaling, and ignores the fact that the host
TSC may differ from one host to the next (and in fact because of the way
the kernel calibrates it, it generally differs from one boot to the next
even on the same hardware).

Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
and attempt to document the *awful* process that we're requiring userspace
to follow to merely preserve the TSC across migration.

I may have thrown up in my mouth a little when writing that documentation.
It's an awful API. If we do this, we should be ashamed of ourselves.
(I also haven't tested the documented process yet).

Let's use Simon's KVM_VCPU_TSC_VALUE instead.
https://lore.kernel.org/all/20230202165950.483430-1-sveith@amazon.de/

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 Documentation/virt/kvm/devices/vcpu.rst | 115 ++++++++++++++++++------
 arch/x86/include/uapi/asm/kvm.h         |   6 ++
 arch/x86/kvm/x86.c                      |  15 ++++
 3 files changed, 109 insertions(+), 27 deletions(-)

diff --git a/Documentation/virt/kvm/devices/vcpu.rst b/Documentation/virt/kvm/devices/vcpu.rst
index 31f14ec4a65b..2a054443118d 100644
--- a/Documentation/virt/kvm/devices/vcpu.rst
+++ b/Documentation/virt/kvm/devices/vcpu.rst
@@ -216,52 +216,113 @@ Returns:
 Specifies the guest's TSC offset relative to the host's TSC. The guest's
 TSC is then derived by the following equation:
 
-  guest_tsc = host_tsc + KVM_VCPU_TSC_OFFSET
+  guest_tsc = (( host_tsc * tsc_scale_ratio ) >> tsc_scale_bits ) + KVM_VCPU_TSC_OFFSET
+
+The values of tsc_scale_ratio and tsc_scale_bits can be obtained using
+the KVM_VCPU_TSC_SCALE attribute.
 
 This attribute is useful to adjust the guest's TSC on live migration,
 so that the TSC counts the time during which the VM was paused. The
-following describes a possible algorithm to use for this purpose.
+following describes a possible algorithm to use for this purpose,
 
 From the source VMM process:
 
-1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_src),
+1. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (host_tsc_src),
    kvmclock nanoseconds (guest_src), and host CLOCK_REALTIME nanoseconds
-   (host_src).
+   (time_src) at a given moment (Tsrc).
+
+2. For each vCPU[i]:
+
+   a. Read the KVM_VCPU_TSC_OFFSET attribute to record the guest TSC offset
+      (ofs_src[i]).
 
-2. Read the KVM_VCPU_TSC_OFFSET attribute for every vCPU to record the
-   guest TSC offset (ofs_src[i]).
+   b. Read the KVM_VCPU_TSC_SCALE attribute to record the guest TSC scaling
+      ratio (ratio_src[i], frac_bits_src[i]).
 
-3. Invoke the KVM_GET_TSC_KHZ ioctl to record the frequency of the
-   guest's TSC (freq).
+   c. Use host_tsc_src and the scaling/offset factors to calculate this
+      vCPU's TSC at time Tsrc:
+      tsc_src[i] = (( host_tsc_src * ratio_src[i] ) >> frac_bits_src[i] ) + ofs_src[i]
+
+3. Invoke the KVM_GET_CLOCK_GUEST ioctl on the boot vCPU to return the KVM
+   clock as a function of the guest TSC (pvti_src).  (This ioctl not succeed
+   if the host and guest TSCs are not consistent and well-behaved.)
 
 From the destination VMM process:
 
-4. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
-   kvmclock (guest_src) and CLOCK_REALTIME (host_src) in their respective
+4. Before creating the vCPUs, invoke the KVM_SET_TSC_KHZ ioctl on the VM, to
+   set the scaled frequency of the guest's TSC (freq).
+
+5. Invoke the KVM_SET_CLOCK ioctl, providing the source nanoseconds from
+   kvmclock (guest_src) and CLOCK_REALTIME (time_src) in their respective
    fields.  Ensure that the KVM_CLOCK_REALTIME flag is set in the provided
    structure.
 
-   KVM will advance the VM's kvmclock to account for elapsed time since
-   recording the clock values.  Note that this will cause problems in
+   KVM will restore the VM's kvmclock, accounting for elapsed time since
+   the clock values were recorded.  Note that this will cause problems in
    the guest (e.g., timeouts) unless CLOCK_REALTIME is synchronized
    between the source and destination, and a reasonably short time passes
-   between the source pausing the VMs and the destination executing
-   steps 4-7.
+   between the source pausing the VMs and the destination resuming them.
+   Due to the KVM_[SG]ET_CLOCK API using CLOCK_REALTIME instead of
+   CLOCK_TAI, leap seconds during the migration may also introduce errors.
+
+6. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (host_tsc_dst) and
+   host CLOCK_REALTIME nanoseconds (time_dst) at a given moment (Tdst).
+
+7. Calculate the number of nanoseconds elapsed between Tsrc and Tdst:
+   ΔT = time_dst - time_src
+
+8. As each vCPU[i] is created:
+
+   a. Read the KVM_VCPU_TSC_SCALE attribute to record the guest TSC scaling
+      ratio (ratio_dst[i], frac_bits_dst[i]).
+
+   b. Calculate the intended guest TSC value at time Tdst:
+      tsc_dst[i] = tsc_tsc[i] + (ΔT * freq[i])
+
+   c. Use host_tsc_dst and the scaling/offset factors to calculate this vCPU's
+      TSC at time Tsrc without taking offsetting into account:
+      raw_dst[i] = (( host_tsc_dst * ratio_dst[i] ) >> frac_bits_dst[i] )
+
+   d. Calculate ofs_src[i] = tsc_dst[i] + raw_dst[i] and set the resulting
+      offset using the KVM_VCPU_TSC_OFFSET attrribute.
 
-5. Invoke the KVM_GET_CLOCK ioctl to record the host TSC (tsc_dest) and
-   kvmclock nanoseconds (guest_dest).
+9. If pvti_src was provided, invoke the KVM_SET_CLOCK_GUEST ioctl on the boot
+   vCPU to restore the KVM clock as a precise function of the guest TSC. If
+   pvti_src was not provided by the source, or the ioctl fails on the
+   destination, the KVM clock is operating in its less precise mode where it
+   is defined in terms of the host CLOCK_MONOTONIC_RAW, so the value
+   previously set in step 5 is as accurate as it can be.
+
+4.2 ATTRIBUTE: KVM_VCPU_TSC_SCALE
+
+:Parameters: 64-bit fixed point TSC scale factor
+
+Returns:
+
+	 ======= ======================================
+	 -EFAULT Error reading the provided parameter
+		 address.
+	 -ENXIO  Attribute not supported
+	 -EINVAL Invalid request to write the attribute
+	 ======= ======================================
+
+This read-only attribute reports the guest's TSC scaling factor, in the form
+of a fixed-point number represented by the following structure:
+
+    struct kvm_vcpu_tsc_scale {
+	    __u64	tsc_ratio;
+	    __u64	tsc_frac_bits;
+    };
 
-6. Adjust the guest TSC offsets for every vCPU to account for (1) time
-   elapsed since recording state and (2) difference in TSCs between the
-   source and destination machine:
 
-   ofs_dst[i] = ofs_src[i] -
-     (guest_src - guest_dest) * freq +
-     (tsc_src - tsc_dest)
+The tsc_frac_bits field indicate the location of the fixed point, such that
+host TSC values are converted to guest TSC using the formula:
 
-   ("ofs[i] + tsc - guest * freq" is the guest TSC value corresponding to
-   a time of 0 in kvmclock.  The above formula ensures that it is the
-   same on the destination as it was on the source).
+    guest_tsc = ( ( host_tsc * tsc_ratio ) >> tsc_frac_bits) + offset
 
-7. Write the KVM_VCPU_TSC_OFFSET attribute for every vCPU with the
-   respective value derived in the previous step.
+Userspace can use this to precisely calculate the guest TSC from the host
+TSC at any given moment. This is needed for accurate migration of guests,
+as described in the documentation for the KVM_VCPU_TSC_OFFSET attribute.
+In conjunction with the KVM_GET_CLOCK_GUEST ioctl, it also provides a way
+for userspace to quickly calculate the KVM clock for a guest, to use as a
+time reference for hypercalls or emulation of other timer devices.
diff --git a/arch/x86/include/uapi/asm/kvm.h b/arch/x86/include/uapi/asm/kvm.h
index 72ad5ace118d..fe7c98907818 100644
--- a/arch/x86/include/uapi/asm/kvm.h
+++ b/arch/x86/include/uapi/asm/kvm.h
@@ -864,6 +864,12 @@ struct kvm_hyperv_eventfd {
 /* for KVM_{GET,SET,HAS}_DEVICE_ATTR */
 #define KVM_VCPU_TSC_CTRL 0 /* control group for the timestamp counter (TSC) */
 #define   KVM_VCPU_TSC_OFFSET 0 /* attribute for the TSC offset */
+#define   KVM_VCPU_TSC_SCALE  1 /* attribute for TSC scaling factor */
+
+struct kvm_vcpu_tsc_scale {
+	__u64 tsc_ratio;
+	__u64 tsc_frac_bits;
+};
 
 /* x86-specific KVM_EXIT_HYPERCALL flags. */
 #define KVM_EXIT_HYPERCALL_LONG_MODE	_BITULL(0)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 42abce7b4fc9..00a7c1188dec 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5715,6 +5715,7 @@ static int kvm_arch_tsc_has_attr(struct kvm_vcpu *vcpu,
 
 	switch (attr->attr) {
 	case KVM_VCPU_TSC_OFFSET:
+	case KVM_VCPU_TSC_SCALE:
 		r = 0;
 		break;
 	default:
@@ -5737,6 +5738,17 @@ static int kvm_arch_tsc_get_attr(struct kvm_vcpu *vcpu,
 			break;
 		r = 0;
 		break;
+	case KVM_VCPU_TSC_SCALE: {
+		struct kvm_vcpu_tsc_scale scale;
+
+		scale.tsc_ratio = vcpu->arch.l1_tsc_scaling_ratio;
+		scale.tsc_frac_bits = kvm_caps.tsc_scaling_ratio_frac_bits;
+		r = -EFAULT;
+		if (copy_to_user(uaddr, &scale, sizeof(scale)))
+			break;
+		r = 0;
+		break;
+	}
 	default:
 		r = -ENXIO;
 	}
@@ -5777,6 +5789,9 @@ static int kvm_arch_tsc_set_attr(struct kvm_vcpu *vcpu,
 		r = 0;
 		break;
 	}
+	case KVM_VCPU_TSC_SCALE:
+		r = -EINVAL; /* Read only */
+		break;
 	default:
 		r = -ENXIO;
 	}
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 07/10] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
                   ` (5 preceding siblings ...)
  2024-04-18 19:34 ` [PATCH 06/10] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
@ 2024-04-18 19:34 ` David Woodhouse
  2024-04-19 15:53   ` Paul Durrant
  2024-04-18 19:34 ` [PATCH 08/10] KVM: x86: Remove periodic global clock updates David Woodhouse
                   ` (4 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: David Woodhouse @ 2024-04-18 19:34 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

From: David Woodhouse <dwmw@amazon.co.uk>

Commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to monotonic raw
clock") did so only for 64-bit hosts, by capturing the boot offset from
within the existing clocksource notifier update_pvclock_gtod().

That notifier was added in commit 16e8d74d2da9 ("KVM: x86: notifier for
clocksource changes") but only on x86_64, because its original purpose
was just to disable the "master clock" mode which is only supported on
x86_64.

Now that the notifier is used for more than disabling master clock mode,
(well, OK, more than a decade later but clocks are hard), enable it for
the 32-bit build too so that get_kvmclock_base_ns() can be unaffected by
NTP sync on 32-bit too.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 18 ++++++------------
 1 file changed, 6 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 00a7c1188dec..44b3d2a0da5b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2245,7 +2245,6 @@ static int do_set_msr(struct kvm_vcpu *vcpu, unsigned index, u64 *data)
 	return kvm_set_msr_ignored_check(vcpu, index, *data, true);
 }
 
-#ifdef CONFIG_X86_64
 struct pvclock_clock {
 	int vclock_mode;
 	u64 cycle_last;
@@ -2303,13 +2302,6 @@ static s64 get_kvmclock_base_ns(void)
 	/* Count up from boot time, but with the frequency of the raw clock.  */
 	return ktime_to_ns(ktime_add(ktime_get_raw(), pvclock_gtod_data.offs_boot));
 }
-#else
-static s64 get_kvmclock_base_ns(void)
-{
-	/* Master clock not used, so we can just use CLOCK_BOOTTIME.  */
-	return ktime_get_boottime_ns();
-}
-#endif
 
 static void kvm_write_wall_clock(struct kvm *kvm, gpa_t wall_clock, int sec_hi_ofs)
 {
@@ -9819,6 +9811,7 @@ static void pvclock_irq_work_fn(struct irq_work *w)
 }
 
 static DEFINE_IRQ_WORK(pvclock_irq_work, pvclock_irq_work_fn);
+#endif
 
 /*
  * Notification about pvclock gtod data update.
@@ -9826,26 +9819,26 @@ static DEFINE_IRQ_WORK(pvclock_irq_work, pvclock_irq_work_fn);
 static int pvclock_gtod_notify(struct notifier_block *nb, unsigned long unused,
 			       void *priv)
 {
-	struct pvclock_gtod_data *gtod = &pvclock_gtod_data;
 	struct timekeeper *tk = priv;
 
 	update_pvclock_gtod(tk);
 
+#ifdef CONFIG_X86_64
 	/*
 	 * Disable master clock if host does not trust, or does not use,
 	 * TSC based clocksource. Delegate queue_work() to irq_work as
 	 * this is invoked with tk_core.seq write held.
 	 */
-	if (!gtod_is_based_on_tsc(gtod->clock.vclock_mode) &&
+	if (!gtod_is_based_on_tsc(pvclock_gtod_data.clock.vclock_mode) &&
 	    atomic_read(&kvm_guest_has_master_clock) != 0)
 		irq_work_queue(&pvclock_irq_work);
+#endif
 	return 0;
 }
 
 static struct notifier_block pvclock_gtod_notifier = {
 	.notifier_call = pvclock_gtod_notify,
 };
-#endif
 
 static inline void kvm_ops_update(struct kvm_x86_init_ops *ops)
 {
@@ -9984,9 +9977,10 @@ int kvm_x86_vendor_init(struct kvm_x86_init_ops *ops)
 
 	if (pi_inject_timer == -1)
 		pi_inject_timer = housekeeping_enabled(HK_TYPE_TIMER);
-#ifdef CONFIG_X86_64
+
 	pvclock_gtod_register_notifier(&pvclock_gtod_notifier);
 
+#ifdef CONFIG_X86_64
 	if (hypervisor_is_type(X86_HYPER_MS_HYPERV))
 		set_hv_tscchange_cb(kvm_hyperv_tsc_notifier);
 #endif
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 08/10] KVM: x86: Remove periodic global clock updates
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
                   ` (6 preceding siblings ...)
  2024-04-18 19:34 ` [PATCH 07/10] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
@ 2024-04-18 19:34 ` David Woodhouse
  2024-04-19 15:55   ` Paul Durrant
  2024-04-18 19:34 ` [PATCH 09/10] KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE David Woodhouse
                   ` (3 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: David Woodhouse @ 2024-04-18 19:34 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

From: David Woodhouse <dwmw@amazon.co.uk>

This effectively reverts commit 332967a3eac0 ("x86: kvm: introduce
periodic global clock updates"). The periodic update was introduced to
propagate NTP corrections to the guest KVM clock, when the KVM clock was
based on CLOCK_MONOTONIC.

However, commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to
monotonic raw clock") switched to using CLOCK_MONOTONIC_RAW as the basis
for the KVM clock, avoiding the NTP frequency skew altogether.

So the periodic update serves no purpose. Remove it.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/kvm/x86.c | 25 -------------------------
 1 file changed, 25 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 44b3d2a0da5b..4ec4eb850c5b 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -155,9 +155,6 @@ EXPORT_SYMBOL_GPL(report_ignored_msrs);
 unsigned int min_timer_period_us = 200;
 module_param(min_timer_period_us, uint, 0644);
 
-static bool __read_mostly kvmclock_periodic_sync = true;
-module_param(kvmclock_periodic_sync, bool, 0444);
-
 /* tsc tolerance in parts per million - default to 1/2 of the NTP threshold */
 static u32 __read_mostly tsc_tolerance_ppm = 250;
 module_param(tsc_tolerance_ppm, uint, 0644);
@@ -3434,20 +3431,6 @@ static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
 					KVMCLOCK_UPDATE_DELAY);
 }
 
-#define KVMCLOCK_SYNC_PERIOD (300 * HZ)
-
-static void kvmclock_sync_fn(struct work_struct *work)
-{
-	struct delayed_work *dwork = to_delayed_work(work);
-	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
-					   kvmclock_sync_work);
-	struct kvm *kvm = container_of(ka, struct kvm, arch);
-
-	schedule_delayed_work(&kvm->arch.kvmclock_update_work, 0);
-	schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
-					KVMCLOCK_SYNC_PERIOD);
-}
-
 /* These helpers are safe iff @msr is known to be an MCx bank MSR. */
 static bool is_mci_control_msr(u32 msr)
 {
@@ -12416,8 +12399,6 @@ int kvm_arch_vcpu_create(struct kvm_vcpu *vcpu)
 
 void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 {
-	struct kvm *kvm = vcpu->kvm;
-
 	if (mutex_lock_killable(&vcpu->mutex))
 		return;
 	vcpu_load(vcpu);
@@ -12428,10 +12409,6 @@ void kvm_arch_vcpu_postcreate(struct kvm_vcpu *vcpu)
 	vcpu->arch.msr_kvm_poll_control = 1;
 
 	mutex_unlock(&vcpu->mutex);
-
-	if (kvmclock_periodic_sync && vcpu->vcpu_idx == 0)
-		schedule_delayed_work(&kvm->arch.kvmclock_sync_work,
-						KVMCLOCK_SYNC_PERIOD);
 }
 
 void kvm_arch_vcpu_destroy(struct kvm_vcpu *vcpu)
@@ -12804,7 +12781,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 #endif
 
 	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
-	INIT_DELAYED_WORK(&kvm->arch.kvmclock_sync_work, kvmclock_sync_fn);
 
 	kvm_apicv_init(kvm);
 	kvm_hv_init_vm(kvm);
@@ -12844,7 +12820,6 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm)
 
 void kvm_arch_sync_events(struct kvm *kvm)
 {
-	cancel_delayed_work_sync(&kvm->arch.kvmclock_sync_work);
 	cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
 	kvm_free_pit(kvm);
 }
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 09/10] KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
                   ` (7 preceding siblings ...)
  2024-04-18 19:34 ` [PATCH 08/10] KVM: x86: Remove periodic global clock updates David Woodhouse
@ 2024-04-18 19:34 ` David Woodhouse
  2024-04-19 15:57   ` Paul Durrant
  2024-04-18 19:34 ` [PATCH 10/10] KVM: x86: Fix KVM clock precision in __get_kvmclock() David Woodhouse
                   ` (2 subsequent siblings)
  11 siblings, 1 reply; 30+ messages in thread
From: David Woodhouse @ 2024-04-18 19:34 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

From: David Woodhouse <dwmw@amazon.co.uk>

This was introduced in commit 0061d53daf26 ("KVM: x86: limit difference
between kvmclock updates") to reduce cross-vCPU differences which arose
because the KVM clock was based on CLOCK_MONOTONIC and thus subject to
NTP frequency corrections.

However, commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to
monotonic raw clock") switched to using CLOCK_MONOTONIC_RAW as the basis
for the KVM clock, avoiding the NTP frequency skew altogether.

So remove KVM_REQ_GLOBAL_CLOCK_UPDATE. In kvm_write_system_time(), all
that's needed is a single KVM_REQ_CLOCK_UPDATE for the vCPU whose KVM
clock is being configured.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |  2 +-
 arch/x86/kvm/x86.c              | 57 ++-------------------------------
 2 files changed, 4 insertions(+), 55 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 8440c4081727..cfac72b4aa64 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -98,7 +98,7 @@
 	KVM_ARCH_REQ_FLAGS(14, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_SCAN_IOAPIC \
 	KVM_ARCH_REQ_FLAGS(15, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
-#define KVM_REQ_GLOBAL_CLOCK_UPDATE	KVM_ARCH_REQ(16)
+/* KVM_ARCH_REQ(16) is available to be reused */
 #define KVM_REQ_APIC_PAGE_RELOAD \
 	KVM_ARCH_REQ_FLAGS(17, KVM_REQUEST_WAIT | KVM_REQUEST_NO_WAKEUP)
 #define KVM_REQ_HV_CRASH		KVM_ARCH_REQ(18)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4ec4eb850c5b..f870e29d2558 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2354,13 +2354,13 @@ static void kvm_write_system_time(struct kvm_vcpu *vcpu, gpa_t system_time,
 	}
 
 	vcpu->arch.time = system_time;
-	kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
 
 	/* we verify if the enable bit is set... */
-	if (system_time & 1)
+	if (system_time & 1) {
 		kvm_gpc_activate(&vcpu->arch.pv_time, system_time & ~1ULL,
 				 sizeof(struct pvclock_vcpu_time_info));
-	else
+		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
+	} else
 		kvm_gpc_deactivate(&vcpu->arch.pv_time);
 
 	return;
@@ -3391,46 +3391,6 @@ uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
 	return ktime_get_real_ns() - get_kvmclock_ns(kvm);
 }
 
-/*
- * kvmclock updates which are isolated to a given vcpu, such as
- * vcpu->cpu migration, should not allow system_timestamp from
- * the rest of the vcpus to remain static. Otherwise ntp frequency
- * correction applies to one vcpu's system_timestamp but not
- * the others.
- *
- * So in those cases, request a kvmclock update for all vcpus.
- * We need to rate-limit these requests though, as they can
- * considerably slow guests that have a large number of vcpus.
- * The time for a remote vcpu to update its kvmclock is bound
- * by the delay we use to rate-limit the updates.
- */
-
-#define KVMCLOCK_UPDATE_DELAY msecs_to_jiffies(100)
-
-static void kvmclock_update_fn(struct work_struct *work)
-{
-	unsigned long i;
-	struct delayed_work *dwork = to_delayed_work(work);
-	struct kvm_arch *ka = container_of(dwork, struct kvm_arch,
-					   kvmclock_update_work);
-	struct kvm *kvm = container_of(ka, struct kvm, arch);
-	struct kvm_vcpu *vcpu;
-
-	kvm_for_each_vcpu(i, vcpu, kvm) {
-		kvm_make_request(KVM_REQ_CLOCK_UPDATE, vcpu);
-		kvm_vcpu_kick(vcpu);
-	}
-}
-
-static void kvm_gen_kvmclock_update(struct kvm_vcpu *v)
-{
-	struct kvm *kvm = v->kvm;
-
-	kvm_make_request(KVM_REQ_CLOCK_UPDATE, v);
-	schedule_delayed_work(&kvm->arch.kvmclock_update_work,
-					KVMCLOCK_UPDATE_DELAY);
-}
-
 /* These helpers are safe iff @msr is known to be an MCx bank MSR. */
 static bool is_mci_control_msr(u32 msr)
 {
@@ -5018,12 +4978,6 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
 		if (kvm_lapic_hv_timer_in_use(vcpu))
 			kvm_lapic_restart_hv_timer(vcpu);
 
-		/*
-		 * On a host with synchronized TSC, there is no need to update
-		 * kvmclock on vcpu->cpu migration
-		 */
-		if (!vcpu->kvm->arch.use_master_clock || vcpu->cpu == -1)
-			kvm_make_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu);
 		if (vcpu->cpu != cpu)
 			kvm_make_request(KVM_REQ_MIGRATE_TIMER, vcpu);
 		vcpu->cpu = cpu;
@@ -10961,8 +10915,6 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
 			__kvm_migrate_timers(vcpu);
 		if (kvm_check_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu))
 			kvm_update_masterclock(vcpu->kvm);
-		if (kvm_check_request(KVM_REQ_GLOBAL_CLOCK_UPDATE, vcpu))
-			kvm_gen_kvmclock_update(vcpu);
 		if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, vcpu)) {
 			r = kvm_guest_time_update(vcpu);
 			if (unlikely(r))
@@ -12780,8 +12732,6 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
 	kvm->arch.hv_root_tdp = INVALID_PAGE;
 #endif
 
-	INIT_DELAYED_WORK(&kvm->arch.kvmclock_update_work, kvmclock_update_fn);
-
 	kvm_apicv_init(kvm);
 	kvm_hv_init_vm(kvm);
 	kvm_xen_init_vm(kvm);
@@ -12820,7 +12770,6 @@ static void kvm_unload_vcpu_mmus(struct kvm *kvm)
 
 void kvm_arch_sync_events(struct kvm *kvm)
 {
-	cancel_delayed_work_sync(&kvm->arch.kvmclock_update_work);
 	kvm_free_pit(kvm);
 }
 
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 10/10] KVM: x86: Fix KVM clock precision in __get_kvmclock()
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
                   ` (8 preceding siblings ...)
  2024-04-18 19:34 ` [PATCH 09/10] KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE David Woodhouse
@ 2024-04-18 19:34 ` David Woodhouse
  2024-04-19 16:07   ` Paul Durrant
  2024-04-19 12:51 ` [PATCH 11/10] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time() David Woodhouse
  2024-04-19 12:52 ` [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
  11 siblings, 1 reply; 30+ messages in thread
From: David Woodhouse @ 2024-04-18 19:34 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

From: David Woodhouse <dwmw@amazon.co.uk>

When in 'master clock mode' (i.e. when host and guest TSCs are behaving
sanely and in sync), the KVM clock is defined in terms of the guest TSC.

When TSC scaling is used, calculating the KVM clock directly from *host*
TSC cycles leads to a systemic drift from the values calculated by the
guest from its TSC.

Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
had a simple workaround for the specific case of Xen timers, as it had an
actual vCPU to hand and could use its scaling information. That commit
noted that it was broken for the general case of get_kvmclock_ns(), and
said "I'll come back to that".

Since __get_kvmclock() is invoked without a specific CPU, it needs to
be able to find or generate the scaling values required to perform the
correct calculation.

Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
so it isn't as complex as it might have been.

In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
kvm->arch.last_tsc_scaling_ratio. That is only protected by the
tsc_writE_lock, so in pvclock_update_vm_gtod_copy(), copy it into a
separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
shift factors to convert to nanoseconds for the corresponding KVM clock,
just as kvm_guest_time_update() would.

In __get_kvmclock(), which runs within a seqcount retry loop, use those
values to convert host to guest TSC and then to nanoseconds. Only fall
back to using get_kvmclock_base_ns() when not in master clock mode.

There was previously a code path in __get_kvmclock() which looked like
it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
even on 32-bit hosts. In practice that could never happen as the
ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
hosts it would never be set when the system clock isn't TSC-based. So
that code path is now removed.

The kvm_get_wall_clock_epoch() function had the same problem; make it
just call get_kvmclock() and subtract kvmclock from wallclock, with
the same fallback as before.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
 arch/x86/include/asm/kvm_host.h |   4 +
 arch/x86/kvm/x86.c              | 150 ++++++++++++++++----------------
 2 files changed, 78 insertions(+), 76 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index cfac72b4aa64..13f979dd14b9 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1353,6 +1353,7 @@ struct kvm_arch {
 	u64 last_tsc_write;
 	u32 last_tsc_khz;
 	u64 last_tsc_offset;
+	u64 last_tsc_scaling_ratio;
 	u64 cur_tsc_nsec;
 	u64 cur_tsc_write;
 	u64 cur_tsc_offset;
@@ -1366,6 +1367,9 @@ struct kvm_arch {
 	bool use_master_clock;
 	u64 master_kernel_ns;
 	u64 master_cycle_now;
+	u64 master_tsc_scaling_ratio;
+	u32 master_tsc_mul;
+	s8 master_tsc_shift;
 	struct delayed_work kvmclock_update_work;
 	struct delayed_work kvmclock_sync_work;
 
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index f870e29d2558..5cd92f4b4c97 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2671,6 +2671,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	kvm->arch.last_tsc_nsec = ns;
 	kvm->arch.last_tsc_write = tsc;
 	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
+	kvm->arch.last_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
 	kvm->arch.last_tsc_offset = offset;
 
 	vcpu->arch.last_guest_tsc = tsc;
@@ -3006,6 +3007,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 {
 #ifdef CONFIG_X86_64
 	struct kvm_arch *ka = &kvm->arch;
+	uint64_t last_tsc_hz;
 	int vclock_mode;
 	bool host_tsc_clocksource, vcpus_matched;
 
@@ -3025,6 +3027,34 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
 				&& !ka->backwards_tsc_observed
 				&& !ka->boot_vcpu_runs_old_kvmclock;
 
+	/*
+	 * When TSC scaling is in use (which can thankfully only happen
+	 * with X86_FEATURE_CONSTANT_TSC), the host must calculate the
+	 * KVM clock precisely as the guest would, by scaling through
+	 * the guest TSC frequency. Otherwise, differences in arithmetic
+	 * precision lead to systemic drift between the guest's and the
+	 * host's idea of the time.
+	 */
+	if (kvm_caps.has_tsc_control) {
+		/*
+		 * Copy from the field protected solely by ka->tsc_write_lock,
+		 * to the field protected by the ka->pvclock_sc seqlock.
+		 */
+		ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio;
+
+		/*
+		 * Calculate the scaling factors precisely the same way
+		 * that kvm_guest_time_update() does.
+		 */
+		last_tsc_hz = kvm_scale_tsc(tsc_khz * 1000,
+					    ka->last_tsc_scaling_ratio);
+		kvm_get_time_scale(NSEC_PER_SEC, last_tsc_hz,
+				   &ka->master_tsc_shift, &ka->master_tsc_mul);
+	} else if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+		kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1009,
+				   &ka->master_tsc_shift, &ka->master_tsc_mul);
+	}
+
 	if (ka->use_master_clock)
 		atomic_set(&kvm_guest_has_master_clock, 1);
 
@@ -3097,36 +3127,49 @@ static unsigned long get_cpu_tsc_khz(void)
 static void __get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
 {
 	struct kvm_arch *ka = &kvm->arch;
-	struct pvclock_vcpu_time_info hv_clock;
+
+#ifdef CONFIG_X86_64
+	uint64_t cur_tsc_khz = 0;
+	struct timespec64 ts;
 
 	/* both __this_cpu_read() and rdtsc() should be on the same cpu */
 	get_cpu();
 
-	data->flags = 0;
 	if (ka->use_master_clock &&
-	    (static_cpu_has(X86_FEATURE_CONSTANT_TSC) || __this_cpu_read(cpu_tsc_khz))) {
-#ifdef CONFIG_X86_64
-		struct timespec64 ts;
+	    (cur_tsc_khz = get_cpu_tsc_khz()) &&
+	    !kvm_get_walltime_and_clockread(&ts, &data->host_tsc))
+		cur_tsc_khz = 0;
 
-		if (kvm_get_walltime_and_clockread(&ts, &data->host_tsc)) {
-			data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
-			data->flags |= KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC;
-		} else
-#endif
-		data->host_tsc = rdtsc();
-
-		data->flags |= KVM_CLOCK_TSC_STABLE;
-		hv_clock.tsc_timestamp = ka->master_cycle_now;
-		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
-		kvm_get_time_scale(NSEC_PER_SEC, get_cpu_tsc_khz() * 1000LL,
-				   &hv_clock.tsc_shift,
-				   &hv_clock.tsc_to_system_mul);
-		data->clock = __pvclock_read_cycles(&hv_clock, data->host_tsc);
-	} else {
-		data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+	put_cpu();
+
+	if (cur_tsc_khz) {
+		uint64_t tsc_cycles;
+		uint32_t mul;
+		int8_t shift;
+
+		tsc_cycles = data->host_tsc - ka->master_cycle_now;
+
+		if (kvm_caps.has_tsc_control)
+			tsc_cycles = kvm_scale_tsc(tsc_cycles,
+						   ka->master_tsc_scaling_ratio);
+
+		if (static_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
+			mul = ka->master_tsc_mul;
+			shift = ka->master_tsc_shift;
+		} else {
+			kvm_get_time_scale(NSEC_PER_SEC, cur_tsc_khz * 1000LL,
+					   &shift, &mul);
+		}
+		data->clock = ka->master_kernel_ns + ka->kvmclock_offset +
+			pvclock_scale_delta(tsc_cycles, mul, shift);
+		data->realtime = ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec;
+		data->flags = KVM_CLOCK_REALTIME | KVM_CLOCK_HOST_TSC | KVM_CLOCK_TSC_STABLE;
+		return;
 	}
+#endif
 
-	put_cpu();
+	data->clock = get_kvmclock_base_ns() + ka->kvmclock_offset;
+	data->flags = 0;
 }
 
 static void get_kvmclock(struct kvm *kvm, struct kvm_clock_data *data)
@@ -3327,68 +3370,23 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
  * that and kvmclock, but even that would be subject to change over
  * time.
  *
- * Attempt to calculate the epoch at a given moment using the *same*
- * TSC reading via kvm_get_walltime_and_clockread() to obtain both
- * wallclock and kvmclock times, and subtracting one from the other.
+ * Use get_kvmclock() to obtain a simultaneous reading of wallclock
+ * and kvmclock times from the *same* TSC reading, and subtract one
+ * from the other.
  *
  * Fall back to using their values at slightly different moments by
  * calling ktime_get_real_ns() and get_kvmclock_ns() separately.
  */
 uint64_t kvm_get_wall_clock_epoch(struct kvm *kvm)
 {
-#ifdef CONFIG_X86_64
-	struct pvclock_vcpu_time_info hv_clock;
-	struct kvm_arch *ka = &kvm->arch;
-	unsigned long seq, local_tsc_khz;
-	struct timespec64 ts;
-	uint64_t host_tsc;
-
-	do {
-		seq = read_seqcount_begin(&ka->pvclock_sc);
-
-		local_tsc_khz = 0;
-		if (!ka->use_master_clock)
-			break;
-
-		/*
-		 * The TSC read and the call to get_cpu_tsc_khz() must happen
-		 * on the same CPU.
-		 */
-		get_cpu();
-
-		local_tsc_khz = get_cpu_tsc_khz();
-
-		if (local_tsc_khz &&
-		    !kvm_get_walltime_and_clockread(&ts, &host_tsc))
-			local_tsc_khz = 0; /* Fall back to old method */
-
-		put_cpu();
-
-		/*
-		 * These values must be snapshotted within the seqcount loop.
-		 * After that, it's just mathematics which can happen on any
-		 * CPU at any time.
-		 */
-		hv_clock.tsc_timestamp = ka->master_cycle_now;
-		hv_clock.system_time = ka->master_kernel_ns + ka->kvmclock_offset;
+	struct kvm_clock_data data;
 
-	} while (read_seqcount_retry(&ka->pvclock_sc, seq));
+	get_kvmclock(kvm, &data);
 
-	/*
-	 * If the conditions were right, and obtaining the wallclock+TSC was
-	 * successful, calculate the KVM clock at the corresponding time and
-	 * subtract one from the other to get the guest's epoch in nanoseconds
-	 * since 1970-01-01.
-	 */
-	if (local_tsc_khz) {
-		kvm_get_time_scale(NSEC_PER_SEC, local_tsc_khz * NSEC_PER_USEC,
-				   &hv_clock.tsc_shift,
-				   &hv_clock.tsc_to_system_mul);
-		return ts.tv_nsec + NSEC_PER_SEC * ts.tv_sec -
-			__pvclock_read_cycles(&hv_clock, host_tsc);
-	}
-#endif
-	return ktime_get_real_ns() - get_kvmclock_ns(kvm);
+	if (data.flags & KVM_CLOCK_REALTIME)
+		return data.realtime - data.clock;
+	else
+		return ktime_get_real_ns() - data.clock;
 }
 
 /* These helpers are safe iff @msr is known to be an MCx bank MSR. */
-- 
2.44.0


^ permalink raw reply related	[flat|nested] 30+ messages in thread

* [PATCH 11/10] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time()
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
                   ` (9 preceding siblings ...)
  2024-04-18 19:34 ` [PATCH 10/10] KVM: x86: Fix KVM clock precision in __get_kvmclock() David Woodhouse
@ 2024-04-19 12:51 ` David Woodhouse
  2024-04-19 12:52 ` [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
  11 siblings, 0 replies; 30+ messages in thread
From: David Woodhouse @ 2024-04-19 12:51 UTC (permalink / raw
  To: kvm, Dongli Zhang
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

[-- Attachment #1: Type: text/plain, Size: 5475 bytes --]

From: David Woodhouse <dwmw@amazon.co.uk>

There was some confusion in kvm_update_guest_time() when software needs
to advance the guest TSC.

In master clock mode, there are two points of time which need to be taken
into account. First there is the master clock reference point, stored in
kvm->arch.master_kernel_ns (and associated host TSC ->master_cycle_now).
Secondly, there is the time *now*, at the point kvm_update_guest_time()
is being called.

With software TSC upscaling, the guest TSC is getting further and further
ahead of the host TSC as time elapses. So at time "now", the guest TSC
should be further ahead of the host, than it was at master_kernel_ns.

The adjustment in kvm_update_guest_time() was not taking that into
account, and was only advancing the guest TSC by the appropriate amount
for master_kernel_ns, *not* the current time.

Fix it to calculate them both correctly.

Since the KVM clock reference point in master_kernel_ns might actually
be *earlier* than the reference point used for the guest TSC
(vcpu->last_tsc_nsec), this might lead to a negative delta. Fix the
compute_guest_tsc() function to cope with negative numbers, which
then means there is no need to force a master clock update when the
guest TSC is written.

Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
---
Untested. Thrown on the pile at
https://git.infradead.org/users/dwmw2/linux.git/shortlog/refs/heads/clocks
which we'll be testing more next week...

 arch/x86/kvm/x86.c | 61 +++++++++++++++++++++++++++++++++-------------
 1 file changed, 44 insertions(+), 17 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 5cd92f4b4c97..a78adef698bd 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2488,10 +2488,19 @@ static int kvm_set_tsc_khz(struct kvm_vcpu *vcpu, u32 user_tsc_khz)
 
 static u64 compute_guest_tsc(struct kvm_vcpu *vcpu, s64 kernel_ns)
 {
-	u64 tsc = pvclock_scale_delta(kernel_ns-vcpu->arch.this_tsc_nsec,
-				      vcpu->arch.virtual_tsc_mult,
-				      vcpu->arch.virtual_tsc_shift);
-	tsc += vcpu->arch.this_tsc_write;
+	s64 delta = kernel_ns - vcpu->arch.this_tsc_nsec;
+	u64 tsc = vcpu->arch.this_tsc_write;
+
+	/* pvclock_scale_delta cannot cope with negative deltas */
+	if (delta >= 0)
+		tsc += pvclock_scale_delta(delta,
+					   vcpu->arch.virtual_tsc_mult,
+					   vcpu->arch.virtual_tsc_shift);
+	else
+		tsc -= pvclock_scale_delta(-delta,
+					   vcpu->arch.virtual_tsc_mult,
+					   vcpu->arch.virtual_tsc_shift);
+
 	return tsc;
 }
 
@@ -2502,7 +2511,7 @@ static inline bool gtod_is_based_on_tsc(int mode)
 }
 #endif
 
-static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
+static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu)
 {
 #ifdef CONFIG_X86_64
 	struct kvm_arch *ka = &vcpu->kvm->arch;
@@ -2519,12 +2528,9 @@ static void kvm_track_tsc_matching(struct kvm_vcpu *vcpu, bool new_generation)
 
 	/*
 	 * Request a masterclock update if the masterclock needs to be toggled
-	 * on/off, or when starting a new generation and the masterclock is
-	 * enabled (compute_guest_tsc() requires the masterclock snapshot to be
-	 * taken _after_ the new generation is created).
+	 * on/off.
 	 */
-	if ((ka->use_master_clock && new_generation) ||
-	    (ka->use_master_clock != use_master_clock))
+	if ((ka->use_master_clock != use_master_clock))
 		kvm_make_request(KVM_REQ_MASTERCLOCK_UPDATE, vcpu);
 
 	trace_kvm_track_tsc(vcpu->vcpu_id, ka->nr_vcpus_matched_tsc,
@@ -2702,7 +2708,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
 	vcpu->arch.this_tsc_nsec = kvm->arch.cur_tsc_nsec;
 	vcpu->arch.this_tsc_write = kvm->arch.cur_tsc_write;
 
-	kvm_track_tsc_matching(vcpu, !matched);
+	kvm_track_tsc_matching(vcpu);
 }
 
 static void kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 *user_value)
@@ -3296,8 +3302,6 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 		kernel_ns = get_kvmclock_base_ns();
 	}
 
-	tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
-
 	/*
 	 * We may have to catch up the TSC to match elapsed wall clock
 	 * time for two reasons, even if kvmclock is used.
@@ -3309,11 +3313,34 @@ static int kvm_guest_time_update(struct kvm_vcpu *v)
 	 *	very slowly.
 	 */
 	if (vcpu->tsc_catchup) {
-		u64 tsc = compute_guest_tsc(v, kernel_ns);
-		if (tsc > tsc_timestamp) {
-			adjust_tsc_offset_guest(v, tsc - tsc_timestamp);
-			tsc_timestamp = tsc;
+		uint64_t now_guest_tsc_adjusted;
+		uint64_t now_guest_tsc_unadjusted;
+		int64_t now_guest_tsc_delta;
+
+		tsc_timestamp = compute_guest_tsc(v, kernel_ns);
+
+		if (use_master_clock) {
+			uint64_t now_host_tsc;
+			int64_t now_kernel_ns;
+
+			if (!kvm_get_time_and_clockread(&now_kernel_ns, &now_host_tsc)) {
+				now_kernel_ns = get_kvmclock_base_ns();
+				now_host_tsc = rdtsc();
+			}
+			now_guest_tsc_adjusted = compute_guest_tsc(v, now_kernel_ns);
+			now_guest_tsc_unadjusted = kvm_read_l1_tsc(v, now_host_tsc);
+		} else {
+			now_guest_tsc_adjusted = tsc_timestamp;
+			now_guest_tsc_unadjusted = kvm_read_l1_tsc(v, kernel_ns);
 		}
+
+		now_guest_tsc_delta = now_guest_tsc_adjusted -
+			now_guest_tsc_unadjusted;
+
+		if (now_guest_tsc_delta > 0)
+			adjust_tsc_offset_guest(v, now_guest_tsc_delta);
+	} else {
+		tsc_timestamp = kvm_read_l1_tsc(v, host_tsc);
 	}
 
 	local_irq_restore(flags);
-- 
2.34.1



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply related	[flat|nested] 30+ messages in thread

* Re: [RFC PATCH 0/10] Cleaning up the KVM clock mess
  2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
                   ` (10 preceding siblings ...)
  2024-04-19 12:51 ` [PATCH 11/10] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time() David Woodhouse
@ 2024-04-19 12:52 ` David Woodhouse
  11 siblings, 0 replies; 30+ messages in thread
From: David Woodhouse @ 2024-04-19 12:52 UTC (permalink / raw
  To: kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith,
	Dongli Zhang

[-- Attachment #1: Type: text/plain, Size: 547 bytes --]

On Thu, 2024-04-18 at 20:34 +0100, David Woodhouse wrote:
> 
>       KVM: x86: Remove periodic global clock updates
>       KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE

Meh, I might have to put those back. They were originally introduced to
cope with NTP frequency skew which is no longer a problem, but we now
know there's a systemic skew even between the host CLOCK_MONOTONIC_RAW
and the KVM clock as calculated via the guest's TSC.

So at least when !ka->use_master_clock I think the forced resync does
need to happen.



[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
  2024-04-18 19:34 ` [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
@ 2024-04-19 15:29   ` Paul Durrant
  2024-04-22 12:22   ` Paolo Bonzini
  1 sibling, 0 replies; 30+ messages in thread
From: Paul Durrant @ 2024-04-19 15:29 UTC (permalink / raw
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The kvm_guest_time_update() function scales the host TSC frequency to
> the guest's using kvm_scale_tsc() and the v->arch.l1_tsc_scaling_ratio
> scaling ratio previously calculated for that vCPU. Then calcuates the
> scaling factors for the KVM clock itself based on that guest TSC
> frequency.
> 
> However, it uses kHz as the unit when scaling, and then multiplies by
> 1000 only at the end.
> 
> With a host TSC frequency of 3000MHz and a guest set to 2500MHz, the
> result of kvm_scale_tsc() will actually come out at 2,499,999kHz. So
> the KVM clock advertised to the guest is based on a frequency of
> 2,499,999,000 Hz.
> 
> By using Hz as the unit from the beginning, the KVM clock would be based
> on a more accurate frequency of 2,499,999,999 Hz in this example.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> Fixes: 78db6a503796 ("KVM: x86: rewrite handling of scaled TSC for kvmclock")
> ---
>   arch/x86/include/asm/kvm_host.h |  2 +-
>   arch/x86/kvm/x86.c              | 17 +++++++++--------
>   arch/x86/kvm/xen.c              |  2 +-
>   3 files changed, 11 insertions(+), 10 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/10] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
  2024-04-18 19:34 ` [PATCH 03/10] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
@ 2024-04-19 15:40   ` Paul Durrant
  2024-04-19 15:49     ` David Woodhouse
  2024-04-22 14:11   ` Paolo Bonzini
  1 sibling, 1 reply; 30+ messages in thread
From: Paul Durrant @ 2024-04-19 15:40 UTC (permalink / raw
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On 18/04/2024 20:34, David Woodhouse wrote:
> From: Jack Allister <jalliste@amazon.com>
> 
> In the common case (where kvm->arch.use_master_clock is true), the KVM
> clock is defined as a simple arithmetic function of the guest TSC, based on
> a reference point stored in kvm->arch.master_kernel_ns and
> kvm->arch.master_cycle_now.
> 
> The existing KVM_[GS]ET_CLOCK functionality does not allow for this
> relationship to be precisely saved and restored by userspace. All it can
> currently do is set the KVM clock at a given UTC reference time, which is
> necessarily imprecise.
> 
> So on live update, the guest TSC can remain cycle accurate at precisely the
> same offset from the host TSC, but there is no way for userspace to restore
> the KVM clock accurately.
> 
> Even on live migration to a new host, where the accuracy of the guest time-
> keeping is fundamentally limited by the accuracy of wallclock
> synchronization between the source and destination hosts, the clock jump
> experienced by the guest's TSC and its KVM clock should at least be
> *consistent*. Even when the guest TSC suffers a discontinuity, its KVM
> clock should still remain the *same* arithmetic function of the guest TSC,
> and not suffer an *additional* discontinuity.
> 
> To allow for accurate migration of the KVM clock, add per-vCPU ioctls which
> save and restore the actual PV clock info in pvclock_vcpu_time_info.
> 
> The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
> point in time just as kvm_update_masterclock() does, and calculating the
> corresponding guest TSC value. This guest TSC value is then passed through
> the user-provided pvclock structure to generate the *intended* KVM clock
> value at that point in time, and through the *actual* KVM clock calculation.
> Then kvm->arch.kvmclock_offset is adjusted to eliminate for the difference.
> 
> Where kvm->arch.use_master_clock is false (because the host TSC is
> unreliable, or the guest TSCs are configured strangely), the KVM clock
> is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
> returns an error. In this case, as documented, userspace shall use the
> legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
> case since the clocks are imprecise in this mode anyway.
> 
> On *restoration*, if kvm->arch.use_master_clock is false, an error is
> returned for similar reasons and userspace shall fall back to using
> KVM_SET_CLOCK. This does mean that, as documented, userspace needs to use
> *both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results with the
> migration data (unless the intent is to refuse to resume on a host with bad
> TSC).
> 
> (It may have been possible to make KVM_SET_CLOCK_GUEST "good enough" in the
> non-masterclock mode, as that mode is necessarily imprecise anyway. The
> explicit fallback allows userspace to deliberately fail migration to a host
> with misbehaving TSC where master clock mode wouldn't be active.)
> 
> Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: Jack Allister <jalliste@amazon.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> CC: Paul Durrant <paul@xen.org>
> CC: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>   Documentation/virt/kvm/api.rst |  37 ++++++++
>   arch/x86/kvm/x86.c             | 156 +++++++++++++++++++++++++++++++++
>   include/uapi/linux/kvm.h       |   3 +
>   3 files changed, 196 insertions(+)
> 
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f0b76ff5030d..758f6fc08fe5 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6352,6 +6352,43 @@ a single guest_memfd file, but the bound ranges must not overlap).
>   
>   See KVM_SET_USER_MEMORY_REGION2 for additional details.
>   
> +4.143 KVM_GET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86_64
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (out)
> +:Returns: 0 on success, <0 on error
> +
> +Retrieves the current time information structure used for KVM/PV clocks,
> +in precisely the form advertised to the guest vCPU, which gives parameters
> +for a direct conversion from a guest TSC value to nanoseconds.
> +
> +When the KVM clock not is in "master clock" mode, for example because the
> +host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
> +is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
> +In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
> +
> +4.144 KVM_SET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86_64
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (in)
> +:Returns: 0 on success, <0 on error
> +
> +Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
> +pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
> +arithmetic relationship between guest TSC and KVM clock to be preserved by
> +userspace across migration.
> +
> +When the KVM clock is not in "master clock" mode, and the KVM clock is actually
> +defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL. Userspace
> +may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or may
> +choose to fail, denying migration to a host whose TSC is misbehaving.
> +
>   5. The kvm_run structure
>   ========================
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 23281c508c27..42abce7b4fc9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5868,6 +5868,154 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>   	}
>   }
>   
> +#ifdef CONFIG_X86_64
> +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> +	struct pvclock_vcpu_time_info *hv_clock = &v->arch.hv_clock;
> +
> +	/*
> +	 * If KVM_REQ_CLOCK_UPDATE is already pending, or if the hv_clock has
> +	 * never been generated at all, call kvm_guest_time_update() to do so.
> +	 * Might as well use the PVCLOCK_TSC_STABLE_BIT as the check for ever
> +	 * having been written.
> +	 */
> +	if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) ||
> +	    !(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT)) {
> +		if (kvm_guest_time_update(v))
> +			return -EINVAL;

nit: simple nested if, so you could use &&

> +	}
> +
> +	/*
> +	 * PVCLOCK_TSC_STABLE_BIT is set in use_master_clock mode where the
> +	 * KVM clock is defined in terms of the guest TSC. Otherwise, it is
> +	 * is defined by the host CLOCK_MONOTONIC_RAW, and userspace should
> +	 * use the legacy KVM_[GS]ET_CLOCK to migrate it.
> +	 */
> +	if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
> +		return -EINVAL;
> +
> +	if (copy_to_user(argp, hv_clock, sizeof(*hv_clock)))
> +		return -EFAULT;
> +
> +	return 0;
> +}
> +

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 04/10] KVM: selftests: Add KVM/PV clock selftest to prove timer correction
  2024-04-18 19:34 ` [PATCH 04/10] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
@ 2024-04-19 15:44   ` Paul Durrant
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Durrant @ 2024-04-19 15:44 UTC (permalink / raw
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On 18/04/2024 20:34, David Woodhouse wrote:
> From: Jack Allister <jalliste@amazon.com>
> 
> A VM's KVM/PV clock has an inherent relationship to its TSC (guest). When
> either the host system live-updates or the VM is live-migrated this pairing
> of the two clock sources should stay the same.
> 
> In reality this is not the case without some correction taking place. Two
> new IOCTLs (KVM_GET_CLOCK_GUEST/KVM_SET_CLOCK_GUEST) can be utilized to
> perform a correction on the PVTI (PV time information) structure held by
> KVM to effectively fixup the kvmclock_offset prior to the guest VM resuming
> in either a live-update/migration scenario.
> 
> This test proves that without the necessary fixup there is a perceived
> change in the guest TSC & KVM/PV clock relationship before and after a LU/
> LM takes place.
> 
> The following steps are made to verify there is a delta in the relationship
> and that it can be corrected:
> 
> 1. PVTI is sampled by guest at boot (let's call this PVTI0).
> 2. Induce a change in PVTI data (KVM_REQ_MASTERCLOCK_UPDATE).
> 3. PVTI is sampled by guest after change (PVTI1).
> 4. Correction is requested by usermode to KVM using PVTI0.
> 5. PVTI is sampled by guest after correction (PVTI2).
> 
> The guest the records a singular TSC reference point in time and uses it to
> calculate 3 KVM clock values utilizing the 3 recorded PVTI prior. Let's
> call each clock value CLK[0-2].
> 
> In a perfect world CLK[0-2] should all be the same value if the KVM clock
> & TSC relationship is preserved across the LU/LM (or faked in this test),
> however it is not.
> 
> A delta can be observed between CLK0-CLK1 due to KVM recalculating the PVTI
> (and the inaccuracies associated with that). A delta of ~3500ns can be
> observed if guest TSC scaling to half host TSC frequency is also enabled,
> where as without scaling this is observed at ~180ns.
> 
> With the correction it should be possible to achieve a delta of ±1ns.
> 
> An option to enable guest TSC scaling is available via invoking the tester
> with -s/--scale-tsc.
> 
> Example of the test output below:
> * selftests: kvm: pvclock_test
> * scaling tsc from 2999999KHz to 1499999KHz
> * before=5038374946 uncorrected=5038371437 corrected=5038374945
> * delta_uncorrected=3509 delta_corrected=1
> 
> Signed-off-by: Jack Allister <jalliste@amazon.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> CC: Paul Durrant <paul@xen.org>
> CC: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>   tools/testing/selftests/kvm/Makefile          |   1 +
>   .../selftests/kvm/x86_64/pvclock_test.c       | 192 ++++++++++++++++++
>   2 files changed, 193 insertions(+)
>   create mode 100644 tools/testing/selftests/kvm/x86_64/pvclock_test.c
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 05/10] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC
  2024-04-18 19:34 ` [PATCH 05/10] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
@ 2024-04-19 15:45   ` Paul Durrant
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Durrant @ 2024-04-19 15:45 UTC (permalink / raw
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> KVM does make an attempt to cope with non-constant TSC, and has notifiers
> to handle host TSC frequency changes. However, it *only* adjusts the KVM
> clock, and doesn't adjust TSC frequency scaling when the host changes.
> 
> This is presumably because non-constant TSCs were fixed in hardware long
> before TSC scaling was implemented, so there should never be real CPUs
> which have TSC scaling but *not* CONSTANT_TSC.
> 
> Such a combination could potentially happen in some odd L1 nesting
> environment, but it isn't worth trying to support it. Just make the
> dependency explicit.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/svm/svm.c | 3 ++-
>   arch/x86/kvm/vmx/vmx.c | 2 +-
>   2 files changed, 3 insertions(+), 2 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/10] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
  2024-04-19 15:40   ` Paul Durrant
@ 2024-04-19 15:49     ` David Woodhouse
  0 siblings, 0 replies; 30+ messages in thread
From: David Woodhouse @ 2024-04-19 15:49 UTC (permalink / raw
  To: paul, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

[-- Attachment #1: Type: text/plain, Size: 1203 bytes --]

On Fri, 2024-04-19 at 16:40 +0100, Paul Durrant wrote:
> 
> > +        * If KVM_REQ_CLOCK_UPDATE is already pending, or if the
> > hv_clock has
> > +        * never been generated at all, call
> > kvm_guest_time_update() to do so.
> > +        * Might as well use the PVCLOCK_TSC_STABLE_BIT as the
> > check for ever
> > +        * having been written.
> > +        */
> > +       if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) ||
> > +           !(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT)) {
> > +               if (kvm_guest_time_update(v))
> > +                       return -EINVAL;
> 
> nit: simple nested if, so you could use &&

Yeah, I frowned at that a bit, and decided I preferred it this way just
to highlight the fact that kvm_guest_time_update() is doing a *thing*.
And then we bail with -EINVAL if that thing fails.

If you stick it all in the if() statement, the code flow is logically
the same but it's just a bit less obvious that there are side-effects
here in the middle of the conditions, and that those side-effects are
actually the *main* point of the statement.


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 06/10] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
  2024-04-18 19:34 ` [PATCH 06/10] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
@ 2024-04-19 15:49   ` Paul Durrant
  2024-04-19 15:53     ` David Woodhouse
  0 siblings, 1 reply; 30+ messages in thread
From: Paul Durrant @ 2024-04-19 15:49 UTC (permalink / raw
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
> inadequate. It ignores TSC scaling, and ignores the fact that the host
> TSC may differ from one host to the next (and in fact because of the way
> the kernel calibrates it, it generally differs from one boot to the next
> even on the same hardware).
> 
> Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
> and attempt to document the *awful* process that we're requiring userspace
> to follow to merely preserve the TSC across migration.
> 
> I may have thrown up in my mouth a little when writing that documentation.
> It's an awful API. If we do this, we should be ashamed of ourselves.
> (I also haven't tested the documented process yet).
> 
> Let's use Simon's KVM_VCPU_TSC_VALUE instead.
> https://lore.kernel.org/all/20230202165950.483430-1-sveith@amazon.de/
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   Documentation/virt/kvm/devices/vcpu.rst | 115 ++++++++++++++++++------
>   arch/x86/include/uapi/asm/kvm.h         |   6 ++
>   arch/x86/kvm/x86.c                      |  15 ++++
>   3 files changed, 109 insertions(+), 27 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 06/10] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration
  2024-04-19 15:49   ` Paul Durrant
@ 2024-04-19 15:53     ` David Woodhouse
  0 siblings, 0 replies; 30+ messages in thread
From: David Woodhouse @ 2024-04-19 15:53 UTC (permalink / raw
  To: paul, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

[-- Attachment #1: Type: text/plain, Size: 2231 bytes --]

On Fri, 2024-04-19 at 16:49 +0100, Paul Durrant wrote:
> On 18/04/2024 20:34, David Woodhouse wrote:
> > From: David Woodhouse <dwmw@amazon.co.uk>
> > 
> > The documentation on TSC migration using KVM_VCPU_TSC_OFFSET is woefully
> > inadequate. It ignores TSC scaling, and ignores the fact that the host
> > TSC may differ from one host to the next (and in fact because of the way
> > the kernel calibrates it, it generally differs from one boot to the next
> > even on the same hardware).
> > 
> > Add KVM_VCPU_TSC_SCALE to extract the actual scale ratio and frac_bits,
> > and attempt to document the *awful* process that we're requiring userspace
> > to follow to merely preserve the TSC across migration.
> > 
> > I may have thrown up in my mouth a little when writing that documentation.
> > It's an awful API. If we do this, we should be ashamed of ourselves.
> > (I also haven't tested the documented process yet).
> > 
> > Let's use Simon's KVM_VCPU_TSC_VALUE instead.
> > https://lore.kernel.org/all/20230202165950.483430-1-sveith@amazon.de/

Oops, I meant to change that commit message. I've changed my mind, as
KVM_VCPU_TSC_VALUE sets the TSC using the KVM clock as a reference...
which is utterly backwards, since the KVM clock is *defined* in terms
of the TSC.

Will do so before posting it again.

May also include the "loop over KVM_GET_CLOCK until tai_offset didn't
change" part to fix the leap seconbd problem.

I think we do need to implement a better way to get { TAI, host TSC }
than abusing KVM_GET_CLOCK when we don't even *care* about the KVM
clock. We need a way to migrate the arch counters on platforms other
than x86 too. But I'm trying to leave that part for later. At least on
x86 we *do* have a way to migrate the TSC accurately, even if it's
awful.

> > Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> > ---
> >   Documentation/virt/kvm/devices/vcpu.rst | 115 ++++++++++++++++++------
> >   arch/x86/include/uapi/asm/kvm.h         |   6 ++
> >   arch/x86/kvm/x86.c                      |  15 ++++
> >   3 files changed, 109 insertions(+), 27 deletions(-)
> > 
> 
> Reviewed-by: Paul Durrant <paul@xen.org>
> 


[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 07/10] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host
  2024-04-18 19:34 ` [PATCH 07/10] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
@ 2024-04-19 15:53   ` Paul Durrant
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Durrant @ 2024-04-19 15:53 UTC (permalink / raw
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> Commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to monotonic raw
> clock") did so only for 64-bit hosts, by capturing the boot offset from
> within the existing clocksource notifier update_pvclock_gtod().
> 
> That notifier was added in commit 16e8d74d2da9 ("KVM: x86: notifier for
> clocksource changes") but only on x86_64, because its original purpose
> was just to disable the "master clock" mode which is only supported on
> x86_64.
> 
> Now that the notifier is used for more than disabling master clock mode,
> (well, OK, more than a decade later but clocks are hard), enable it for
> the 32-bit build too so that get_kvmclock_base_ns() can be unaffected by
> NTP sync on 32-bit too.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/x86.c | 18 ++++++------------
>   1 file changed, 6 insertions(+), 12 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 08/10] KVM: x86: Remove periodic global clock updates
  2024-04-18 19:34 ` [PATCH 08/10] KVM: x86: Remove periodic global clock updates David Woodhouse
@ 2024-04-19 15:55   ` Paul Durrant
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Durrant @ 2024-04-19 15:55 UTC (permalink / raw
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> This effectively reverts commit 332967a3eac0 ("x86: kvm: introduce
> periodic global clock updates"). The periodic update was introduced to
> propagate NTP corrections to the guest KVM clock, when the KVM clock was
> based on CLOCK_MONOTONIC.
> 
> However, commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to
> monotonic raw clock") switched to using CLOCK_MONOTONIC_RAW as the basis
> for the KVM clock, avoiding the NTP frequency skew altogether.
> 
> So the periodic update serves no purpose. Remove it.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/kvm/x86.c | 25 -------------------------
>   1 file changed, 25 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 09/10] KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE
  2024-04-18 19:34 ` [PATCH 09/10] KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE David Woodhouse
@ 2024-04-19 15:57   ` Paul Durrant
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Durrant @ 2024-04-19 15:57 UTC (permalink / raw
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> This was introduced in commit 0061d53daf26 ("KVM: x86: limit difference
> between kvmclock updates") to reduce cross-vCPU differences which arose
> because the KVM clock was based on CLOCK_MONOTONIC and thus subject to
> NTP frequency corrections.
> 
> However, commit 53fafdbb8b21 ("KVM: x86: switch KVMCLOCK base to
> monotonic raw clock") switched to using CLOCK_MONOTONIC_RAW as the basis
> for the KVM clock, avoiding the NTP frequency skew altogether.
> 
> So remove KVM_REQ_GLOBAL_CLOCK_UPDATE. In kvm_write_system_time(), all
> that's needed is a single KVM_REQ_CLOCK_UPDATE for the vCPU whose KVM
> clock is being configured.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/include/asm/kvm_host.h |  2 +-
>   arch/x86/kvm/x86.c              | 57 ++-------------------------------
>   2 files changed, 4 insertions(+), 55 deletions(-)
> 

Reviewed-by: Paul Durrant <paul@xen.org>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 10/10] KVM: x86: Fix KVM clock precision in __get_kvmclock()
  2024-04-18 19:34 ` [PATCH 10/10] KVM: x86: Fix KVM clock precision in __get_kvmclock() David Woodhouse
@ 2024-04-19 16:07   ` Paul Durrant
  0 siblings, 0 replies; 30+ messages in thread
From: Paul Durrant @ 2024-04-19 16:07 UTC (permalink / raw
  To: David Woodhouse, kvm
  Cc: Paolo Bonzini, Jonathan Corbet, Sean Christopherson,
	Thomas Gleixner, Ingo Molnar, Borislav Petkov, Dave Hansen, x86,
	H. Peter Anvin, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On 18/04/2024 20:34, David Woodhouse wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>
> 
> When in 'master clock mode' (i.e. when host and guest TSCs are behaving
> sanely and in sync), the KVM clock is defined in terms of the guest TSC.
> 
> When TSC scaling is used, calculating the KVM clock directly from *host*
> TSC cycles leads to a systemic drift from the values calculated by the
> guest from its TSC.
> 
> Commit 451a707813ae ("KVM: x86/xen: improve accuracy of Xen timers")
> had a simple workaround for the specific case of Xen timers, as it had an
> actual vCPU to hand and could use its scaling information. That commit
> noted that it was broken for the general case of get_kvmclock_ns(), and
> said "I'll come back to that".
> 
> Since __get_kvmclock() is invoked without a specific CPU, it needs to
> be able to find or generate the scaling values required to perform the
> correct calculation.
> 
> Thankfully, TSC scaling can only happen with X86_FEATURE_CONSTANT_TSC,
> so it isn't as complex as it might have been.
> 
> In __kvm_synchronize_tsc(), note the current vCPU's scaling ratio in
> kvm->arch.last_tsc_scaling_ratio. That is only protected by the
> tsc_writE_lock, so in pvclock_update_vm_gtod_copy(), copy it into a

^ typo: capitalization of the 'E'

> separate kvm->arch.master_tsc_scaling_ratio so that it can be accessed
> using the kvm->arch.pvclock_sc seqcount lock. Also generate the mul and
> shift factors to convert to nanoseconds for the corresponding KVM clock,
> just as kvm_guest_time_update() would.
> 
> In __get_kvmclock(), which runs within a seqcount retry loop, use those
> values to convert host to guest TSC and then to nanoseconds. Only fall
> back to using get_kvmclock_base_ns() when not in master clock mode.
> 
> There was previously a code path in __get_kvmclock() which looked like
> it could set KVM_CLOCK_TSC_STABLE without KVM_CLOCK_REALTIME, perhaps
> even on 32-bit hosts. In practice that could never happen as the
> ka->use_master_clock flag couldn't be set on 32-bit, and even on 64-bit
> hosts it would never be set when the system clock isn't TSC-based. So
> that code path is now removed.
> 
> The kvm_get_wall_clock_epoch() function had the same problem; make it
> just call get_kvmclock() and subtract kvmclock from wallclock, with
> the same fallback as before.
> 
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> ---
>   arch/x86/include/asm/kvm_host.h |   4 +
>   arch/x86/kvm/x86.c              | 150 ++++++++++++++++----------------
>   2 files changed, 78 insertions(+), 76 deletions(-)
> 
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index cfac72b4aa64..13f979dd14b9 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -1353,6 +1353,7 @@ struct kvm_arch {
>   	u64 last_tsc_write;
>   	u32 last_tsc_khz;
>   	u64 last_tsc_offset;
> +	u64 last_tsc_scaling_ratio;
>   	u64 cur_tsc_nsec;
>   	u64 cur_tsc_write;
>   	u64 cur_tsc_offset;
> @@ -1366,6 +1367,9 @@ struct kvm_arch {
>   	bool use_master_clock;
>   	u64 master_kernel_ns;
>   	u64 master_cycle_now;
> +	u64 master_tsc_scaling_ratio;
> +	u32 master_tsc_mul;
> +	s8 master_tsc_shift;
>   	struct delayed_work kvmclock_update_work;
>   	struct delayed_work kvmclock_sync_work;
>   
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index f870e29d2558..5cd92f4b4c97 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2671,6 +2671,7 @@ static void __kvm_synchronize_tsc(struct kvm_vcpu *vcpu, u64 offset, u64 tsc,
>   	kvm->arch.last_tsc_nsec = ns;
>   	kvm->arch.last_tsc_write = tsc;
>   	kvm->arch.last_tsc_khz = vcpu->arch.virtual_tsc_khz;
> +	kvm->arch.last_tsc_scaling_ratio = vcpu->arch.l1_tsc_scaling_ratio;
>   	kvm->arch.last_tsc_offset = offset;
>   
>   	vcpu->arch.last_guest_tsc = tsc;
> @@ -3006,6 +3007,7 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>   {
>   #ifdef CONFIG_X86_64
>   	struct kvm_arch *ka = &kvm->arch;
> +	uint64_t last_tsc_hz;
>   	int vclock_mode;
>   	bool host_tsc_clocksource, vcpus_matched;
>   
> @@ -3025,6 +3027,34 @@ static void pvclock_update_vm_gtod_copy(struct kvm *kvm)
>   				&& !ka->backwards_tsc_observed
>   				&& !ka->boot_vcpu_runs_old_kvmclock;
>   
> +	/*
> +	 * When TSC scaling is in use (which can thankfully only happen
> +	 * with X86_FEATURE_CONSTANT_TSC), the host must calculate the
> +	 * KVM clock precisely as the guest would, by scaling through
> +	 * the guest TSC frequency. Otherwise, differences in arithmetic
> +	 * precision lead to systemic drift between the guest's and the
> +	 * host's idea of the time.
> +	 */
> +	if (kvm_caps.has_tsc_control) {
> +		/*
> +		 * Copy from the field protected solely by ka->tsc_write_lock,
> +		 * to the field protected by the ka->pvclock_sc seqlock.
> +		 */
> +		ka->master_tsc_scaling_ratio = ka->last_tsc_scaling_ratio;
> +
> +		/*
> +		 * Calculate the scaling factors precisely the same way
> +		 * that kvm_guest_time_update() does.
> +		 */
> +		last_tsc_hz = kvm_scale_tsc(tsc_khz * 1000,
> +					    ka->last_tsc_scaling_ratio);
> +		kvm_get_time_scale(NSEC_PER_SEC, last_tsc_hz,
> +				   &ka->master_tsc_shift, &ka->master_tsc_mul);
> +	} else if (boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
> +		kvm_get_time_scale(NSEC_PER_SEC, tsc_khz * 1009,

1009?

> +				   &ka->master_tsc_shift, &ka->master_tsc_mul);
> +	}
> +
>   	if (ka->use_master_clock)
>   		atomic_set(&kvm_guest_has_master_clock, 1);
>   


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
  2024-04-18 19:34 ` [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
  2024-04-19 15:29   ` Paul Durrant
@ 2024-04-22 12:22   ` Paolo Bonzini
  2024-04-22 15:39     ` David Woodhouse
  1 sibling, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2024-04-22 12:22 UTC (permalink / raw
  To: David Woodhouse
  Cc: kvm, Jonathan Corbet, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On Thu, Apr 18, 2024 at 9:51 PM David Woodhouse <dwmw2@infradead.org> wrote:
>         gpa_t time;
>         struct pvclock_vcpu_time_info hv_clock;
> -       unsigned int hw_tsc_khz;
> +       unsigned int hw_tsc_hz;

Why not change this to u64? 4.3 GHz is scarily close to current
processors, though I expect that it will break a lot more software
than just KVM.

>  static int kvm_guest_time_update(struct kvm_vcpu *v)
>  {
> -       unsigned long flags, tgt_tsc_khz;
> +       unsigned long flags;
> +       uint64_t tgt_tsc_hz;

... especially considering that you did use a 64-bit integer here
(though---please use u64 not uint64_t; and BTW if you want to add a
patch to change kvm_get_time_scale() to u64, please do.

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/10] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
  2024-04-18 19:34 ` [PATCH 03/10] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
  2024-04-19 15:40   ` Paul Durrant
@ 2024-04-22 14:11   ` Paolo Bonzini
  2024-04-22 15:02     ` David Woodhouse
  1 sibling, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2024-04-22 14:11 UTC (permalink / raw
  To: David Woodhouse
  Cc: kvm, Jonathan Corbet, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On Thu, Apr 18, 2024 at 9:46 PM David Woodhouse <dwmw2@infradead.org> wrote:
> +       curr_tsc_hz = get_cpu_tsc_khz() * 1000LL;
> +       if (unlikely(curr_tsc_hz == 0)) {
> +               rc = -EINVAL;
> +               goto out;
> +       }
> +
> +       if (kvm_caps.has_tsc_control)
> +               curr_tsc_hz = kvm_scale_tsc(curr_tsc_hz,
> +                                           v->arch.l1_tsc_scaling_ratio);
> +
> +       /*
> +        * The scaling factors in the hv_clock do not depend solely on the
> +        * TSC frequency *requested* by userspace. They actually use the
> +        * host TSC frequency that was measured/detected by the host kernel,
> +        * scaled by kvm_scale_tsc() with the vCPU's l1_tsc_scaling_ratio.
> +        * So a sanity check that they *precisely* match would have false
> +        * negatives. Allow for a discrepancy of 1 kHz either way.

This is not very clear - if kvm_caps.has_tsc_control, cur_tsc_hz is
exactly the "host TSC frequency [...] scaled by kvm_scale_tsc() with
the vCPU's l1_tsc_scaling_ratio". But even in that case there is a
double rounding issue, I guess.

> +       /*
> +        * The call to pvclock_update_vm_gtod_copy() has created a new time
> +        * reference point in ka->master_cycle_now and ka->master_kernel_ns.
> +        *
> +        * Calculate the guest TSC at that moment, and the corresponding KVM
> +        * clock value according to user_hv_clock. The value according to the
> +        * current hv_clock will of course be ka->master_kernel_ns since no
> +        * TSC cycles have elapsed.
> +        *
> +        * Adjust ka->kvmclock_offset to the delta, so that both definitions
> +        * of the clock give precisely the same reading at the reference time.
> +        */
> +       guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
> +       user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
> +       ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;
> +
> +out:
> +       kvm_end_pvclock_update(kvm);
> +       return rc;
> +}
> +#endif
> +
>  long kvm_arch_vcpu_ioctl(struct file *filp,
>                          unsigned int ioctl, unsigned long arg)
>  {
> @@ -6256,6 +6404,14 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>                 srcu_read_unlock(&vcpu->kvm->srcu, idx);
>                 break;
>         }
> +#ifdef CONFIG_X86_64
> +       case KVM_SET_CLOCK_GUEST:
> +               r = kvm_vcpu_ioctl_set_clock_guest(vcpu, argp);
> +               break;
> +       case KVM_GET_CLOCK_GUEST:
> +               r = kvm_vcpu_ioctl_get_clock_guest(vcpu, argp);
> +               break;
> +#endif
>  #ifdef CONFIG_KVM_HYPERV
>         case KVM_GET_SUPPORTED_HV_CPUID:
>                 r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 2190adbe3002..0d306311e4d6 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
>         __u64 reserved[6];
>  };
>
> +#define KVM_SET_CLOCK_GUEST       _IOW(KVMIO,  0xd5, struct pvclock_vcpu_time_info)
> +#define KVM_GET_CLOCK_GUEST       _IOR(KVMIO,  0xd6, struct pvclock_vcpu_time_info)
> +
>  #endif /* __LINUX_KVM_H */
> --
> 2.44.0
>

On Thu, Apr 18, 2024 at 9:46 PM David Woodhouse <dwmw2@infradead.org> wrote:
>
> From: Jack Allister <jalliste@amazon.com>
>
> In the common case (where kvm->arch.use_master_clock is true), the KVM
> clock is defined as a simple arithmetic function of the guest TSC, based on
> a reference point stored in kvm->arch.master_kernel_ns and
> kvm->arch.master_cycle_now.
>
> The existing KVM_[GS]ET_CLOCK functionality does not allow for this
> relationship to be precisely saved and restored by userspace. All it can
> currently do is set the KVM clock at a given UTC reference time, which is
> necessarily imprecise.
>
> So on live update, the guest TSC can remain cycle accurate at precisely the
> same offset from the host TSC, but there is no way for userspace to restore
> the KVM clock accurately.
>
> Even on live migration to a new host, where the accuracy of the guest time-
> keeping is fundamentally limited by the accuracy of wallclock
> synchronization between the source and destination hosts, the clock jump
> experienced by the guest's TSC and its KVM clock should at least be
> *consistent*. Even when the guest TSC suffers a discontinuity, its KVM
> clock should still remain the *same* arithmetic function of the guest TSC,
> and not suffer an *additional* discontinuity.
>
> To allow for accurate migration of the KVM clock, add per-vCPU ioctls which
> save and restore the actual PV clock info in pvclock_vcpu_time_info.
>
> The restoration in KVM_SET_CLOCK_GUEST works by creating a new reference
> point in time just as kvm_update_masterclock() does, and calculating the
> corresponding guest TSC value. This guest TSC value is then passed through
> the user-provided pvclock structure to generate the *intended* KVM clock
> value at that point in time, and through the *actual* KVM clock calculation.
> Then kvm->arch.kvmclock_offset is adjusted to eliminate for the difference.
>
> Where kvm->arch.use_master_clock is false (because the host TSC is
> unreliable, or the guest TSCs are configured strangely), the KVM clock
> is *not* defined as a function of the guest TSC so KVM_GET_CLOCK_GUEST
> returns an error. In this case, as documented, userspace shall use the
> legacy KVM_GET_CLOCK ioctl. The loss of precision is acceptable in this
> case since the clocks are imprecise in this mode anyway.
>
> On *restoration*, if kvm->arch.use_master_clock is false, an error is
> returned for similar reasons and userspace shall fall back to using
> KVM_SET_CLOCK. This does mean that, as documented, userspace needs to use
> *both* KVM_GET_CLOCK_GUEST and KVM_GET_CLOCK and send both results with the
> migration data (unless the intent is to refuse to resume on a host with bad
> TSC).
>
> (It may have been possible to make KVM_SET_CLOCK_GUEST "good enough" in the
> non-masterclock mode, as that mode is necessarily imprecise anyway. The
> explicit fallback allows userspace to deliberately fail migration to a host
> with misbehaving TSC where master clock mode wouldn't be active.)
>
> Co-developed-by: David Woodhouse <dwmw@amazon.co.uk>
> Signed-off-by: Jack Allister <jalliste@amazon.com>
> Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
> CC: Paul Durrant <paul@xen.org>
> CC: Dongli Zhang <dongli.zhang@oracle.com>
> ---
>  Documentation/virt/kvm/api.rst |  37 ++++++++
>  arch/x86/kvm/x86.c             | 156 +++++++++++++++++++++++++++++++++
>  include/uapi/linux/kvm.h       |   3 +
>  3 files changed, 196 insertions(+)
>
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index f0b76ff5030d..758f6fc08fe5 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -6352,6 +6352,43 @@ a single guest_memfd file, but the bound ranges must not overlap).
>
>  See KVM_SET_USER_MEMORY_REGION2 for additional details.
>
> +4.143 KVM_GET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86_64
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (out)
> +:Returns: 0 on success, <0 on error
> +
> +Retrieves the current time information structure used for KVM/PV clocks,
> +in precisely the form advertised to the guest vCPU, which gives parameters
> +for a direct conversion from a guest TSC value to nanoseconds.
> +
> +When the KVM clock not is in "master clock" mode, for example because the
> +host TSC is unreliable or the guest TSCs are oddly configured, the KVM clock
> +is actually defined by the host CLOCK_MONOTONIC_RAW instead of the guest TSC.
> +In this case, the KVM_GET_CLOCK_GUEST ioctl returns -EINVAL.
> +
> +4.144 KVM_SET_CLOCK_GUEST
> +----------------------------
> +
> +:Capability: none
> +:Architectures: x86_64
> +:Type: vcpu ioctl
> +:Parameters: struct pvclock_vcpu_time_info (in)
> +:Returns: 0 on success, <0 on error
> +
> +Sets the KVM clock (for the whole VM) in terms of the vCPU TSC, using the
> +pvclock structure as returned by KVM_GET_CLOCK_GUEST. This allows the precise
> +arithmetic relationship between guest TSC and KVM clock to be preserved by
> +userspace across migration.
> +
> +When the KVM clock is not in "master clock" mode, and the KVM clock is actually
> +defined by the host CLOCK_MONOTONIC_RAW, this ioctl returns -EINVAL. Userspace
> +may choose to set the clock using the less precise KVM_SET_CLOCK ioctl, or may
> +choose to fail, denying migration to a host whose TSC is misbehaving.
> +
>  5. The kvm_run structure
>  ========================
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 23281c508c27..42abce7b4fc9 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -5868,6 +5868,154 @@ static int kvm_vcpu_ioctl_enable_cap(struct kvm_vcpu *vcpu,
>         }
>  }
>
> +#ifdef CONFIG_X86_64
> +static int kvm_vcpu_ioctl_get_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> +       struct pvclock_vcpu_time_info *hv_clock = &v->arch.hv_clock;
> +
> +       /*
> +        * If KVM_REQ_CLOCK_UPDATE is already pending, or if the hv_clock has
> +        * never been generated at all, call kvm_guest_time_update() to do so.
> +        * Might as well use the PVCLOCK_TSC_STABLE_BIT as the check for ever
> +        * having been written.
> +        */
> +       if (kvm_check_request(KVM_REQ_CLOCK_UPDATE, v) ||
> +           !(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT)) {
> +               if (kvm_guest_time_update(v))
> +                       return -EINVAL;
> +       }
> +
> +       /*
> +        * PVCLOCK_TSC_STABLE_BIT is set in use_master_clock mode where the
> +        * KVM clock is defined in terms of the guest TSC. Otherwise, it is
> +        * is defined by the host CLOCK_MONOTONIC_RAW, and userspace should
> +        * use the legacy KVM_[GS]ET_CLOCK to migrate it.
> +        */
> +       if (!(hv_clock->flags & PVCLOCK_TSC_STABLE_BIT))
> +               return -EINVAL;
> +
> +       if (copy_to_user(argp, hv_clock, sizeof(*hv_clock)))
> +               return -EFAULT;
> +
> +       return 0;
> +}
> +
> +/*
> + * Reverse the calculation in the hv_clock definition.
> + *
> + * time_ns = ( (cycles << shift) * mul ) >> 32;
> + * (although shift can be negative, so that's bad C)
> + *
> + * So for a single second,
> + *  NSEC_PER_SEC = ( ( FREQ_HZ << shift) * mul ) >> 32
> + *  NSEC_PER_SEC << 32 = ( FREQ_HZ << shift ) * mul
> + *  ( NSEC_PER_SEC << 32 ) / mul = FREQ_HZ << shift
> + *  ( NSEC_PER_SEC << 32 ) / mul ) >> shift = FREQ_HZ
> + */
> +static uint64_t hvclock_to_hz(uint32_t mul, int8_t shift)
> +{
> +       uint64_t tm = NSEC_PER_SEC << 32;
> +
> +       /* Maximise precision. Shift right until the top bit is set */
> +       tm <<= 2;
> +       shift += 2;
> +
> +       /* While 'mul' is even, increase the shift *after* the division */
> +       while (!(mul & 1)) {
> +               shift++;
> +               mul >>= 1;
> +       }
> +
> +       tm /= mul;
> +
> +       if (shift > 0)
> +               return tm >> shift;
> +       else
> +               return tm << -shift;
> +}
> +
> +static int kvm_vcpu_ioctl_set_clock_guest(struct kvm_vcpu *v, void __user *argp)
> +{
> +       struct pvclock_vcpu_time_info user_hv_clock;
> +       struct kvm *kvm = v->kvm;
> +       struct kvm_arch *ka = &kvm->arch;
> +       uint64_t curr_tsc_hz, user_tsc_hz;
> +       uint64_t user_clk_ns;
> +       uint64_t guest_tsc;
> +       int rc = 0;
> +
> +       if (copy_from_user(&user_hv_clock, argp, sizeof(user_hv_clock)))
> +               return -EFAULT;
> +
> +       if (!user_hv_clock.tsc_to_system_mul)
> +               return -EINVAL;
> +
> +       user_tsc_hz = hvclock_to_hz(user_hv_clock.tsc_to_system_mul,
> +                                   user_hv_clock.tsc_shift);
> +
> +
> +       kvm_hv_request_tsc_page_update(kvm);
> +       kvm_start_pvclock_update(kvm);
> +       pvclock_update_vm_gtod_copy(kvm);
> +
> +       /*
> +        * If not in use_master_clock mode, do not allow userspace to set
> +        * the clock in terms of the guest TSC. Userspace should either
> +        * fail the migration (to a host with suboptimal TSCs), or should
> +        * knowingly restore the KVM clock using KVM_SET_CLOCK instead.
> +        */
> +       if (!ka->use_master_clock) {
> +               rc = -EINVAL;
> +               goto out;
> +       }
> +
> +       curr_tsc_hz = get_cpu_tsc_khz() * 1000LL;
> +       if (unlikely(curr_tsc_hz == 0)) {
> +               rc = -EINVAL;
> +               goto out;
> +       }
> +
> +       if (kvm_caps.has_tsc_control)
> +               curr_tsc_hz = kvm_scale_tsc(curr_tsc_hz,
> +                                           v->arch.l1_tsc_scaling_ratio);
> +
> +       /*
> +        * The scaling factors in the hv_clock do not depend solely on the
> +        * TSC frequency *requested* by userspace. They actually use the
> +        * host TSC frequency that was measured/detected by the host kernel,
> +        * scaled by kvm_scale_tsc() with the vCPU's l1_tsc_scaling_ratio.
> +        *
> +        * So a sanity check that they *precisely* match would have false
> +        * negatives. Allow for a discrepancy of 1 kHz either way.
> +        */
> +       if (user_tsc_hz < curr_tsc_hz - 1000 ||
> +           user_tsc_hz > curr_tsc_hz + 1000) {
> +               rc = -ERANGE;
> +               goto out;
> +       }
> +
> +       /*
> +        * The call to pvclock_update_vm_gtod_copy() has created a new time
> +        * reference point in ka->master_cycle_now and ka->master_kernel_ns.
> +        *
> +        * Calculate the guest TSC at that moment, and the corresponding KVM
> +        * clock value according to user_hv_clock. The value according to the
> +        * current hv_clock will of course be ka->master_kernel_ns since no
> +        * TSC cycles have elapsed.
> +        *
> +        * Adjust ka->kvmclock_offset to the delta, so that both definitions
> +        * of the clock give precisely the same reading at the reference time.
> +        */
> +       guest_tsc = kvm_read_l1_tsc(v, ka->master_cycle_now);
> +       user_clk_ns = __pvclock_read_cycles(&user_hv_clock, guest_tsc);
> +       ka->kvmclock_offset = user_clk_ns - ka->master_kernel_ns;
> +
> +out:
> +       kvm_end_pvclock_update(kvm);
> +       return rc;
> +}
> +#endif
> +
>  long kvm_arch_vcpu_ioctl(struct file *filp,
>                          unsigned int ioctl, unsigned long arg)
>  {
> @@ -6256,6 +6404,14 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
>                 srcu_read_unlock(&vcpu->kvm->srcu, idx);
>                 break;
>         }
> +#ifdef CONFIG_X86_64
> +       case KVM_SET_CLOCK_GUEST:
> +               r = kvm_vcpu_ioctl_set_clock_guest(vcpu, argp);
> +               break;
> +       case KVM_GET_CLOCK_GUEST:
> +               r = kvm_vcpu_ioctl_get_clock_guest(vcpu, argp);
> +               break;
> +#endif
>  #ifdef CONFIG_KVM_HYPERV
>         case KVM_GET_SUPPORTED_HV_CPUID:
>                 r = kvm_ioctl_get_supported_hv_cpuid(vcpu, argp);
> diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
> index 2190adbe3002..0d306311e4d6 100644
> --- a/include/uapi/linux/kvm.h
> +++ b/include/uapi/linux/kvm.h
> @@ -1548,4 +1548,7 @@ struct kvm_create_guest_memfd {
>         __u64 reserved[6];
>  };
>
> +#define KVM_SET_CLOCK_GUEST       _IOW(KVMIO,  0xd5, struct pvclock_vcpu_time_info)
> +#define KVM_GET_CLOCK_GUEST       _IOR(KVMIO,  0xd6, struct pvclock_vcpu_time_info)
> +
>  #endif /* __LINUX_KVM_H */
> --
> 2.44.0
>


^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 03/10] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration
  2024-04-22 14:11   ` Paolo Bonzini
@ 2024-04-22 15:02     ` David Woodhouse
  0 siblings, 0 replies; 30+ messages in thread
From: David Woodhouse @ 2024-04-22 15:02 UTC (permalink / raw
  To: Paolo Bonzini
  Cc: kvm, Jonathan Corbet, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

[-- Attachment #1: Type: text/plain, Size: 2289 bytes --]

On Mon, 2024-04-22 at 16:11 +0200, Paolo Bonzini wrote:
> On Thu, Apr 18, 2024 at 9:46 PM David Woodhouse <dwmw2@infradead.org> wrote:
> > +       curr_tsc_hz = get_cpu_tsc_khz() * 1000LL;
> > +       if (unlikely(curr_tsc_hz == 0)) {
> > +               rc = -EINVAL;
> > +               goto out;
> > +       }
> > +
> > +       if (kvm_caps.has_tsc_control)
> > +               curr_tsc_hz = kvm_scale_tsc(curr_tsc_hz,
> > +                                           v->arch.l1_tsc_scaling_ratio);
> > +
> > +       /*
> > +        * The scaling factors in the hv_clock do not depend solely on the
> > +        * TSC frequency *requested* by userspace. They actually use the
> > +        * host TSC frequency that was measured/detected by the host kernel,
> > +        * scaled by kvm_scale_tsc() with the vCPU's l1_tsc_scaling_ratio.
> > +        * So a sanity check that they *precisely* match would have false
> > +        * negatives. Allow for a discrepancy of 1 kHz either way.
> 
> This is not very clear - if kvm_caps.has_tsc_control, cur_tsc_hz is
> exactly the "host TSC frequency [...] scaled by kvm_scale_tsc() with
> the vCPU's l1_tsc_scaling_ratio". But even in that case there is a
> double rounding issue, I guess.

That's exactly what I'm saying, isn't it?

Perhaps the issue is clearer if I say "that was measured/detected by
*each* host kernel"?

The point is that if I boot on a kernel which measured its TSC against
the PIT and came up with a value of 3002MHz, and then migrate to an
"identical" host which measured against *its* PIT and decided its TSC
frequency was 2999MHz.... then migrate a guest with an explicit TSC
frequency of 2500MHz from one host to the other... their effective
tsc_to_system_mul and tsc_shift in the pvclock are *different*
because...

"The scaling factors in the hv_clock do not depend solely on the
TSC frequency *requested* by userspace. They actually use the
host TSC frequency that was measured/detected by each host kernel,
scaled by kvm_scale_tsc() with the vCPU's l1_tsc_scaling_ratio."

Or did I misunderstand your objection?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
  2024-04-22 12:22   ` Paolo Bonzini
@ 2024-04-22 15:39     ` David Woodhouse
  2024-04-22 15:54       ` Paolo Bonzini
  0 siblings, 1 reply; 30+ messages in thread
From: David Woodhouse @ 2024-04-22 15:39 UTC (permalink / raw
  To: Paolo Bonzini
  Cc: kvm, Jonathan Corbet, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

[-- Attachment #1: Type: text/plain, Size: 2027 bytes --]

On Mon, 2024-04-22 at 14:22 +0200, Paolo Bonzini wrote:
> On Thu, Apr 18, 2024 at 9:51 PM David Woodhouse <dwmw2@infradead.org> wrote:
> >          gpa_t time;
> >          struct pvclock_vcpu_time_info hv_clock;
> > -       unsigned int hw_tsc_khz;
> > +       unsigned int hw_tsc_hz;
> 
> Why not change this to u64? 4.3 GHz is scarily close to current
> processors, though I expect that it will break a lot more software
> than just KVM.

I'm not sure that just changing the variable is sufficient. I'm
concerned that we may still have places (like kvm_get_time_scale(),
although it's hard to tell because it's entirely uncommented!) which
assume that they can multiply the value by *any* 32-bit number and
it'll still fit in 64 bits.

> >   static int kvm_guest_time_update(struct kvm_vcpu *v)
> >   {
> > -       unsigned long flags, tgt_tsc_khz;
> > +       unsigned long flags;
> > +       uint64_t tgt_tsc_hz;
> 
> ... especially considering that you did use a 64-bit integer here
> (though---please use u64 not uint64_t; and BTW if you want to add a
> patch to change kvm_get_time_scale() to u64, please do.

Meh, I'm used to programming in C. Yes, I *am* old enough to have been
doing this since the last decade of the 1900s, but it *has* been a long
time since 1999, and my fingers have learned :)

That said, although I *do* write in C by default, where I'm editing
functions which already have that old anachronistic crap, I generally
manage to go back and change my additions from proper C to match their
surroundings. And I suppose there *are* some of those awful u64/s64
types already in kvm_guest_time_update(), so I would normally have done
so. I'll change that in my tree.

No way I'm regressing kvm_get_time_scale() myself though :)

Heh, looks like it was you who made it uint64_t, in 2016. In a commit
(3ae13faac) which said "Prepare for improving the precision in the next
patch"... which never came, AFAICT?

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
  2024-04-22 15:39     ` David Woodhouse
@ 2024-04-22 15:54       ` Paolo Bonzini
  2024-04-22 16:44         ` David Woodhouse
  0 siblings, 1 reply; 30+ messages in thread
From: Paolo Bonzini @ 2024-04-22 15:54 UTC (permalink / raw
  To: David Woodhouse
  Cc: kvm, Jonathan Corbet, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

On Mon, Apr 22, 2024 at 5:39 PM David Woodhouse <dwmw2@infradead.org> wrote:

> > ... especially considering that you did use a 64-bit integer here
> > (though---please use u64 not uint64_t; and BTW if you want to add a
> > patch to change kvm_get_time_scale() to u64, please do.
>
> Meh, I'm used to programming in C. Yes, I *am* old enough to have been
> doing this since the last decade of the 1900s, but it *has* been a long
> time since 1999, and my fingers have learned :)

Oh, I am on the same page (working on both QEMU and Linux, adapting my
muscle memory to the context sucks) but u64/s64 is the preferred
spelling and I have been asked to use them before.

> Heh, looks like it was you who made it uint64_t, in 2016. In a commit
> (3ae13faac) which said "Prepare for improving the precision in the next
> patch"... which never came, AFAICT?

Yes, it was posted as
https://lore.kernel.org/lkml/1454944711-33022-5-git-send-email-pbonzini@redhat.com/
but not committed.

As an aside, we discovered later that the patch you list as "Fixes"
fixed another tricky bug: before, kvmclock could jump if the TSC is
set within the 250 ppm tolerance that does not activate TSC scaling.
This is possible after a first live migration, and then the second
live migration used the guest TSC frequency *that userspace desired*
instead of the *actual* TSC frequency.

Before:

        this_tsc_khz = __this_cpu_read(cpu_tsc_khz);
        if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
                tgt_tsc_khz = vcpu->virtual_tsc_khz;
                kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz,
                        &vcpu->hv_clock.tsc_shift,
                        &vcpu->hv_clock.tsc_to_system_mul);
                vcpu->hw_tsc_khz = this_tsc_khz;

After:

        tgt_tsc_khz = __this_cpu_read(cpu_tsc_khz);

        // tgt_tsc_khz unchanged because TSC scaling was not enabled
        tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz);

        if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
                kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz,
                        &vcpu->hv_clock.tsc_shift,
                        &vcpu->hv_clock.tsc_to_system_mul);
                vcpu->hw_tsc_khz = tgt_tsc_khz;

So in the first case kvm_get_time_scale uses vcpu->virtual_tsc_khz, in
the second case it uses __this_cpu_read(cpu_tsc_khz).

This then caused a mismatch between the actual guest frequency and
what is used by kvm_guest_time_update, which only becomes visible when
migration resets the clock with KVM_GET/SET_CLOCK. KVM_GET_CLOCK
returns what _should have been_ the same value read by the guest, but
it's wrong.

Paolo

^ permalink raw reply	[flat|nested] 30+ messages in thread

* Re: [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force
  2024-04-22 15:54       ` Paolo Bonzini
@ 2024-04-22 16:44         ` David Woodhouse
  0 siblings, 0 replies; 30+ messages in thread
From: David Woodhouse @ 2024-04-22 16:44 UTC (permalink / raw
  To: Paolo Bonzini
  Cc: kvm, Jonathan Corbet, Sean Christopherson, Thomas Gleixner,
	Ingo Molnar, Borislav Petkov, Dave Hansen, x86, H. Peter Anvin,
	Paul Durrant, Shuah Khan, linux-doc, linux-kernel,
	linux-kselftest, Oliver Upton, Marcelo Tosatti, jalliste, sveith

[-- Attachment #1: Type: text/plain, Size: 4941 bytes --]

On Mon, 2024-04-22 at 17:54 +0200, Paolo Bonzini wrote:
> On Mon, Apr 22, 2024 at 5:39 PM David Woodhouse <dwmw2@infradead.org> wrote:
> 
> > > ... especially considering that you did use a 64-bit integer here
> > > (though---please use u64 not uint64_t; and BTW if you want to add a
> > > patch to change kvm_get_time_scale() to u64, please do.
> > 
> > Meh, I'm used to programming in C. Yes, I *am* old enough to have been
> > doing this since the last decade of the 1900s, but it *has* been a long
> > time since 1999, and my fingers have learned :)
> 
> Oh, I am on the same page (working on both QEMU and Linux, adapting my
> muscle memory to the context sucks) but u64/s64 is the preferred
> spelling and I have been asked to use them before.

Ever heard of Jury Nullification... ? :)

> > Heh, looks like it was you who made it uint64_t, in 2016. In a commit
> > (3ae13faac) which said "Prepare for improving the precision in the next
> > patch"... which never came, AFAICT?
> 
> Yes, it was posted as
> https://lore.kernel.org/lkml/1454944711-33022-5-git-send-email-pbonzini@redhat.com/
> but not committed.

Ah, thanks. So that isn't about arithmetic precision, but about dealing
with the mess we had where the KVM clock was afflicted by NTP skew.

We live in a much saner world now where it's simply based on the guest
TSC (or, in pathological cases, the host's CLOCK_MONOTONIC_RAW. 

> As an aside, we discovered later that the patch you list as "Fixes"
> fixed another tricky bug: before, kvmclock could jump if the TSC is
> set within the 250 ppm tolerance that does not activate TSC scaling.
> This is possible after a first live migration, and then the second
> live migration used the guest TSC frequency *that userspace desired*
> instead of the *actual* TSC frequency.

Given our saner world where the KVM clock now *isn't* adjusted for NTP
skew, that "bug" was probably better left unfixed. In fact, I may give
some thought to reverting commit 78db6a503796 completely.

Perhaps we should call that "definition E". I think we're up to five
now? But no, let's not add historical ones to the taxonomy :)

I believe Jack's KVM_SET_CLOCK_GUEST fixes the fundamental issue there
of the clock jumping on migration. It's just a special case of the
breakage in KVM_REQ_MASTERCLOCK_UPDATE, where the KVM clock has happily
been running as a direct function of the guest TSC, and when we yank it
back to some other definition when we create a new masterclock
reference point.

With KVM_SET_CLOCK_GUEST, the KVM clock is restored as a function of
the guest TSC, rather than based on realtime/CLOCK_MONOTONIC_RAW/etc.

So even though a new masterclock reference is taken in the new kernel
(and the scaling factors in the pvclock may be slightly different, as
we discussed in the comment you responded to), the ka->kvmclock_offset
is adjusted so that the value of the KVM clock at *that* moment when
the new reference is taken, is precisely what it would have been under
the old kvmclock regime for the contemporary guest TSC value.

> Before:
> 
>         this_tsc_khz = __this_cpu_read(cpu_tsc_khz);
>         if (unlikely(vcpu->hw_tsc_khz != this_tsc_khz)) {
>                 tgt_tsc_khz = vcpu->virtual_tsc_khz;
>                 kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz,
>                         &vcpu->hv_clock.tsc_shift,
>                         &vcpu->hv_clock.tsc_to_system_mul);
>                 vcpu->hw_tsc_khz = this_tsc_khz;
> 
> After:
> 
>         tgt_tsc_khz = __this_cpu_read(cpu_tsc_khz);
> 
>         // tgt_tsc_khz unchanged because TSC scaling was not enabled
>         tgt_tsc_khz = kvm_scale_tsc(v, tgt_tsc_khz);
> 
>         if (unlikely(vcpu->hw_tsc_khz != tgt_tsc_khz)) {
>                 kvm_get_time_scale(NSEC_PER_SEC / 1000, tgt_tsc_khz,
>                         &vcpu->hv_clock.tsc_shift,
>                         &vcpu->hv_clock.tsc_to_system_mul);
>                 vcpu->hw_tsc_khz = tgt_tsc_khz;
> 
> So in the first case kvm_get_time_scale uses vcpu->virtual_tsc_khz, in
> the second case it uses __this_cpu_read(cpu_tsc_khz).
> 
> This then caused a mismatch between the actual guest frequency and
> what is used by kvm_guest_time_update, which only becomes visible when
> migration resets the clock with KVM_GET/SET_CLOCK. KVM_GET_CLOCK
> returns what _should have been_ the same value read by the guest, but
> it's wrong.

Hm? Until I fixed it in this series, KVM_GET_CLOCK didn't even *try*
scaling via the guest's TSC frequency, did it? It just converted from
the *host* TSC to nanoseconds (since master_kernel_now) directly. Which
was even *more* broken :)

[-- Attachment #2: smime.p7s --]
[-- Type: application/pkcs7-signature, Size: 5965 bytes --]

^ permalink raw reply	[flat|nested] 30+ messages in thread

end of thread, other threads:[~2024-04-22 16:44 UTC | newest]

Thread overview: 30+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-04-18 19:34 [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse
2024-04-18 19:34 ` [PATCH 01/10] KVM: x86/xen: Do not corrupt KVM clock in kvm_xen_shared_info_init() David Woodhouse
2024-04-18 19:34 ` [PATCH 02/10] KVM: x86: Improve accuracy of KVM clock when TSC scaling is in force David Woodhouse
2024-04-19 15:29   ` Paul Durrant
2024-04-22 12:22   ` Paolo Bonzini
2024-04-22 15:39     ` David Woodhouse
2024-04-22 15:54       ` Paolo Bonzini
2024-04-22 16:44         ` David Woodhouse
2024-04-18 19:34 ` [PATCH 03/10] KVM: x86: Add KVM_[GS]ET_CLOCK_GUEST for accurate KVM clock migration David Woodhouse
2024-04-19 15:40   ` Paul Durrant
2024-04-19 15:49     ` David Woodhouse
2024-04-22 14:11   ` Paolo Bonzini
2024-04-22 15:02     ` David Woodhouse
2024-04-18 19:34 ` [PATCH 04/10] KVM: selftests: Add KVM/PV clock selftest to prove timer correction David Woodhouse
2024-04-19 15:44   ` Paul Durrant
2024-04-18 19:34 ` [PATCH 05/10] KVM: x86: Explicitly disable TSC scaling without CONSTANT_TSC David Woodhouse
2024-04-19 15:45   ` Paul Durrant
2024-04-18 19:34 ` [PATCH 06/10] KVM: x86: Add KVM_VCPU_TSC_SCALE and fix the documentation on TSC migration David Woodhouse
2024-04-19 15:49   ` Paul Durrant
2024-04-19 15:53     ` David Woodhouse
2024-04-18 19:34 ` [PATCH 07/10] KVM: x86: Avoid NTP frequency skew for KVM clock on 32-bit host David Woodhouse
2024-04-19 15:53   ` Paul Durrant
2024-04-18 19:34 ` [PATCH 08/10] KVM: x86: Remove periodic global clock updates David Woodhouse
2024-04-19 15:55   ` Paul Durrant
2024-04-18 19:34 ` [PATCH 09/10] KVM: x86: Kill KVM_REQ_GLOBAL_CLOCK_UPDATE David Woodhouse
2024-04-19 15:57   ` Paul Durrant
2024-04-18 19:34 ` [PATCH 10/10] KVM: x86: Fix KVM clock precision in __get_kvmclock() David Woodhouse
2024-04-19 16:07   ` Paul Durrant
2024-04-19 12:51 ` [PATCH 11/10] KVM: x86: Fix software TSC upscaling in kvm_update_guest_time() David Woodhouse
2024-04-19 12:52 ` [RFC PATCH 0/10] Cleaning up the KVM clock mess David Woodhouse

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).