KVM Archive mirror
 help / color / mirror / Atom feed
From: Sean Christopherson <seanjc@google.com>
To: Oliver Upton <oliver.upton@linux.dev>
Cc: Anish Moorthy <amoorthy@google.com>,
	maz@kernel.org, kvm@vger.kernel.org,  kvmarm@lists.linux.dev,
	robert.hoo.linux@gmail.com, jthoughton@google.com,
	 dmatlack@google.com, axelrasmussen@google.com,
	peterx@redhat.com,  nadav.amit@gmail.com,
	isaku.yamahata@gmail.com, kconsul@linux.vnet.ibm.com
Subject: Re: [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
Date: Mon, 4 Mar 2024 14:49:07 -0800	[thread overview]
Message-ID: <ZeZP4xOMk7LUnNt2@google.com> (raw)
In-Reply-To: <ZeY3P8Za5Q6pkkQV@linux.dev>

On Mon, Mar 04, 2024, Oliver Upton wrote:
> On Mon, Mar 04, 2024 at 12:32:51PM -0800, Sean Christopherson wrote:
> > On Mon, Mar 04, 2024, Oliver Upton wrote:
> > > On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote:
> 
> [...]
> 
> > > Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with
> > > kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should*
> > > shake out as a write fault on the stage-1 descriptor.
> > > 
> > > With that said, an architecture-neutral UAPI may not be able to capture
> > > the nuance of a fault. This UAPI will become much more load-bearing in
> > > the future, and the loss of granularity could become an issue.
> > 
> > What is the possible fallout from loss of granularity/nuance?  E.g. if the worst
> > case scenario is that KVM may exit to userspace multiple times in order to resolve
> > the problem, IMO that's an acceptable cost for having "dumb", common uAPI.
> > 
> > The intent/contract of the exit to userspace isn't for userspace to be able to
> > completely understand what fault occurred, but rather for KVM to communicate what
> > action userspace needs to take in order for KVM to make forward progress.
> 
> For one, the stage-2 page tables can describe permissions beyond RWX.
> MTE tag allocation can be controlled at stage-2, which (confusingly)
> desribes if the guest can insert tags in an opaque, physical space not
> described by HPFAR.
> 
> There is a corresponding bit in ESR_EL2 that describes this at the time
> of a fault, and R/W/X flags aren't enough to convey the right corrective
> action.
>
> > > Marc had some ideas about forwarding the register state to userspace
> > > directly, which should be the right level of information for _any_ fault
> > > taken to userspace.
> > 
> > I don't know enough about ARM to weigh in on that side of things, but for x86
> > this definitely doesn't hold true.
> 
> We tend to directly model the CPU architecture wherever possible, as it
> is the only way to create something intelligible. That same rationale
> applies to a huge portion of KVM UAPI; it is architecture-dependent by
> design.

Heh, "by design" :-)

I'm not saying "no arch-specific code in memory_fault", all I'm saying is that
stuff that can be arch-neutral, should be arch-neutral.  And AFAIK, basic RWX
information is common across all architectures.

E.g. if KVM needs to communicate MTE information on top of basic RWX info, why
not add a flag to memory_fault.flags that communicates that MTE is enabled and
relevant info can be found in an "extended" data field?

The presense of MTE stuff shouldn't affect the fundamental access information,
e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE
irrespective of whether or not MTE is in play.

The one thing we may want to squeak in before 6.8 is released is a placeholder
in memory_fault, though I don't think that's strictly necessary since the union
as a whole is padded to 256 bytes.  I suppose userspace could allocate based on
sizeof(kvm_run.memory_fault), but that's a bit of a stretch.

> > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for
> > "caching" emulated MMIO accesses, and the resulting fault captures the
> > "reserved bits set" information in register state.  But that's purely an
> > (optional) imlementation detail of KVM that should never be exposed to
> > userspace.
> 
> MMIO accesses would show up elsewhere though, right?

Yes, but I don't see how that's relevant.  Maybe I'm just misunderstanding what
you're saying/asking.

> If these magic SPTEs were causing -EFAULT exits then something must've gone
> sideways.

More or less.  This scenario can happen if the guest re-accesses a GFN that
doesn't have a memslot, but in the interim userspace made the GFN private.  It's
likely a misbehaving userspace, but that really doesn't matter.  KVM's contract
is to report that KVM exited to userspace because the guest was trying to access
GFN X as shared, but the GFN is configured as private by userspace.

My point was that dumping fault/register information straight to userspace in this
scenario, without massaging/filtering that information, is not a sane option on
x86.
 
> Either way, I have no issues whatsoever if the direction for x86 is to
> provide abstracted fault information.

I don't understand how ARM can get away with NOT providing a layer of abstraction.
Copying fault state verbatim to userspace will bleed KVM implementation details
into userspace, and risks breakage of KVM's ABI due to changes in hardware.
Abstracting gory hardware details from userspace is one of the main roles of the
kernel.

A concrete example of hardware throwing a wrench in things is AMD's upcoming
"encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable
CPUs for *any* VM that supports guest-controlled encrypted memory.  If KVM reported
the page fault error code directly to userspace, then running the same VM on
different hardware generations, e.g. after live migration, would generate different
error codes.
 
Are we talking past each other?  I'm genuinely confused by the pushback on
capturing RWX information.  Yes, the RWX info may be insufficient in some cases,
but its existence doesn't preclude KVM from providing more information as needed.

  reply	other threads:[~2024-03-04 22:49 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
2024-02-15 23:53 ` [PATCH v7 01/14] KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
2024-02-15 23:53 ` [PATCH v7 02/14] KVM: Add function comments for __kvm_read/write_guest_page() Anish Moorthy
2024-02-15 23:53 ` [PATCH v7 03/14] KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag Anish Moorthy
2024-04-09 22:47   ` Sean Christopherson
2024-02-15 23:53 ` [PATCH v7 04/14] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
2024-04-09 22:44   ` Sean Christopherson
2024-02-15 23:53 ` [PATCH v7 05/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace Anish Moorthy
2024-02-15 23:53 ` [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings Anish Moorthy
2024-03-08 22:07   ` Sean Christopherson
2024-03-09  0:46     ` David Matlack
2024-03-11  4:45       ` Oliver Upton
2024-03-11 16:20         ` David Matlack
2024-03-11 16:36         ` Sean Christopherson
2024-03-11 17:08           ` Anish Moorthy
2024-03-11 21:21             ` Oliver Upton
2024-02-15 23:53 ` [PATCH v7 07/14] KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler Anish Moorthy
2024-02-15 23:53 ` [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the " Anish Moorthy
2024-03-04 20:00   ` Oliver Upton
2024-03-04 20:10     ` Oliver Upton
2024-03-04 20:32       ` Sean Christopherson
2024-03-04 21:03         ` Oliver Upton
2024-03-04 22:49           ` Sean Christopherson [this message]
2024-03-05  1:01             ` Oliver Upton
2024-03-05 15:39               ` Sean Christopherson
2024-02-15 23:54 ` [PATCH v7 09/14] KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING Anish Moorthy
2024-02-15 23:54 ` [PATCH v7 10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
2024-04-09 22:49   ` Sean Christopherson
2024-02-15 23:54 ` [PATCH v7 11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
2024-04-09 22:58   ` Sean Christopherson
2024-02-15 23:54 ` [PATCH v7 12/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
2024-02-15 23:54 ` [PATCH v7 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
2024-02-15 23:54 ` [PATCH v7 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
2024-02-16  7:36 ` [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Gupta, Pankaj
2024-02-16 20:00   ` Anish Moorthy
2024-02-16 23:40     ` Axel Rasmussen
2024-02-21  7:35       ` Gupta, Pankaj
2024-04-10  0:19 ` Sean Christopherson
2024-05-07 17:38   ` Anish Moorthy

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZeZP4xOMk7LUnNt2@google.com \
    --to=seanjc@google.com \
    --cc=amoorthy@google.com \
    --cc=axelrasmussen@google.com \
    --cc=dmatlack@google.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=jthoughton@google.com \
    --cc=kconsul@linux.vnet.ibm.com \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=maz@kernel.org \
    --cc=nadav.amit@gmail.com \
    --cc=oliver.upton@linux.dev \
    --cc=peterx@redhat.com \
    --cc=robert.hoo.linux@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).