[PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY

KVM Archive mirror
 help / color / mirror / Atom feed

* [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults
@ 2024-02-15 23:53 Anish Moorthy
  2024-02-15 23:53 ` [PATCH v7 01/14] KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
                   ` (15 more replies)
  0 siblings, 16 replies; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:53 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

This series adds an option to cause stage-2 fault handlers to
KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
the userspace mappings. Doing so allows userspace to receive stage-2
faults directly from KVM_RUN instead of through userfaultfd, which
suffers from serious contention issues as the number of vCPUs scales.

Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
demand_paging_test, which demonstrates the scalability improvements:
the following data was collected using [2] on an x86 machine with 256
cores.

vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
1       150     340
2       191     477
4       210     809
8       155     1239
16      130     1595
32      108     2299
64      86      3482
128     62      4134
256     36      4012

The diff since the last version is small enough that I've attached a
range-diff in the cover letter- hopefully it's useful for review.

Links
~~~~~
[1] Original RFC from James Houghton:
    https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/

[2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
    A quick rundown of the new flags (also detailed in later commits)
        -a registers all of guest memory to a single uffd.
        -r species the number of reader threads for polling the uffd.
        -w is what actually enables the new capabilities.
    All data was collected after applying the entire series

---

v7
  - Add comment for the upgrade-to-atomic in __gfn_to_pfn_memslot()
    [James]
  - Expand description for KVM_MEM_GUEST_MEMFD in kvm/api.rst [James]
    and split it off into its own commit [Anish]
  - Update documentation to indicate that KVM_CAP_MEMORY_FAULT_INFO is
    available on arm [James]
  - Expand commit message for the "enable KVM_CAP_MEMORY_FAULT_INFO on
    arm64" commit [Anish]
  - Drop buggy "fast GUP on read faults" patch [Thanks James!]
  - Make KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING mutually exclusive
    [Sean, Oliver]
  - Drop incorrect "Documentation:" from some shortlogs [Sean]
  - Add description for the KVM_EXIT_MEMORY_FAULT RWX patch [Sean]
  - Style issues [Sean]

v6: https://lore.kernel.org/kvm/20231109210325.3806151-1-amoorthy@google.com/
  - Rebase onto guest_memfd series [Anish/Sean]
  - Set write fault flag properly in user_mem_abort() [Oliver]
  - Reformat unnecessarily multi-line comments [Sean]
  - Drop the kvm_vcpu_read|write_guest_page() annotations [Sean]
  - Rename *USERFAULT_ON_MISSING to *EXIT_ON_MISSING [David]
  - Remove unnecessary rounding in user_mem_abort() annotation [David]
  - Rewrite logs for KVM_MEM_EXIT_ON_MISSING patches and squash
    them with the stage-2 fault annotation patches [Sean]
  - Undo the enum parameter addition to __gfn_to_pfn_memslot(), and just
    add another boolean parameter instead [Sean]
  - Better shortlog for the hva_to_pfn_fast() change [Anish]

v5: https://lore.kernel.org/kvm/20230908222905.1321305-1-amoorthy@google.com/
  - Rename APIs (again) [Sean]
  - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku]
  - Reword hva_to_pfn_fast() change commit message [Sean]
  - Correct style on terminal if statements [Sean]
  - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean]
  - Add read fault flag for annotated faults [Sean]
  - read/write_guest_page() changes
      - Move the annotations into vcpu wrapper fns [Sean]
      - Reorder parameters [Robert]
  - Rename kvm_populate_efault_info() to
    kvm_handle_guest_uaccess_fault() [Sean]
  - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean]
  - Correct description of the faults which hva_to_pfn_fast() can now
    resolve [Sean]
  - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean]
  - Magnanimously accept Sean's rewrite of the handle_error_pfn()
    annotation [Anish]
  - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean]

v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
  - Fix excessive indentation [Robert, Oliver]
  - Calculate final stats when uffd handler fn returns an error [Robert]
  - Remove redundant info from uffd_desc [Robert]
  - Fix various commit message typos [Robert]
  - Add comment about suppressed EEXISTs in selftest [Robert]
  - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
  - Fix some include/logic issues in self test [Robert]
  - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
  - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
  - Drop most of the annotations from v3: see
    https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
  - Remove WARN on bare efaults [Sean, Oliver]
  - Eliminate unnecessary UFFDIO_WAKE call from self test [James]

v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
  - Rework the implementation to be based on two orthogonal
    capabilities (KVM_CAP_MEMORY_FAULT_INFO and
    KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
  - Change return code of kvm_populate_efault_info [Isaku]
  - Use kvm_populate_efault_info from arm code [Oliver]

v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/

    This was a bit of a misfire, as I sent my WIP series on the mailing
    list but was just targeting Sean for some feedback. Oliver Upton and
    Isaku Yamahata ended up discovering the series and giving me some
    feedback anyways, so thanks to them :) In the end, there was enough
    discussion to justify retroactively labeling it as v2, even with the
    limited cc list.

  - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
  - API changes:
        - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
          KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
          requirement).
        - Switched to memslot flag
  - Take Oliver's simplification to the "allow fast gup for readable
    faults" logic.
  - Slightly redefine the return code of user_mem_abort.
  - Fix documentation errors brought up by Marc
  - Reword commit messages in imperative mood

v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/

Anish Moorthy (14):
  KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
  KVM: Add function comments for __kvm_read/write_guest_page()
  KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
  KVM: Simplify error handling in __gfn_to_pfn_memslot()
  KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
    userspace
  KVM: Add memslot flag to let userspace force an exit on missing hva
    mappings
  KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
    stage-2 fault handler
  KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the
    stage-2 fault handler
  KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
  KVM: selftests: Report per-vcpu demand paging rate from demand paging
    test
  KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
    paging test
  KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
    signal errors via TEST_ASSERT
  KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  KVM: selftests: Handle memory fault exits in demand_paging_test

 Documentation/virt/kvm/api.rst                |  39 ++-
 arch/arm64/kvm/Kconfig                        |   1 +
 arch/arm64/kvm/arm.c                          |   1 +
 arch/arm64/kvm/mmu.c                          |   7 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
 arch/x86/kvm/Kconfig                          |   1 +
 arch/x86/kvm/mmu/mmu.c                        |   8 +-
 include/linux/kvm_host.h                      |  21 +-
 include/uapi/linux/kvm.h                      |   5 +
 .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
 .../selftests/kvm/access_tracking_perf_test.c |   2 +-
 .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
 .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
 .../testing/selftests/kvm/include/memstress.h |   2 +-
 .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
 tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
 .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
 .../kvm/memslot_modification_stress_test.c    |   2 +-
 .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
 virt/kvm/Kconfig                              |   3 +
 virt/kvm/kvm_main.c                           |  46 ++-
 22 files changed, 453 insertions(+), 172 deletions(-)

Range-diff against v6:
 1:  2089d8955538 !  1:  063d5d109f34 KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
    @@ Metadata
     Author: Anish Moorthy <amoorthy@google.com>
     
      ## Commit message ##
    -    KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
    +    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
     
    -    The current docstring can be read as "atomic -> allowed to sleep," when
    -    in fact the intended statement is "atomic -> NOT allowed to sleep." Make
    -    that clearer in the docstring.
    +    The current description can be read as "atomic -> allowed to sleep,"
    +    when in fact the intended statement is "atomic -> NOT allowed to sleep."
    +    Make that clearer in the docstring.
     
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
     
 2:  36963c6eee29 !  2:  e038fe64f44a KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
    @@ Metadata
     Author: Anish Moorthy <amoorthy@google.com>
     
      ## Commit message ##
    -    KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
    +    KVM: Add function comments for __kvm_read/write_guest_page()
     
         The (gfn, data, offset, len) order of parameters is a little strange
    -    since "offset" applies to "gfn" rather than to "data". Add docstrings to
    -    make things perfectly clear.
    +    since "offset" applies to "gfn" rather than to "data". Add function
    +    comments to make things perfectly clear.
     
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
     
 -:  ------------ >  3:  812a2208da95 KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
 3:  4994835c51f5 =  4:  44cec9bf6166 KVM: Simplify error handling in __gfn_to_pfn_memslot()
 4:  3d51224854b1 !  5:  df09c7482fbf KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
    @@ Metadata
      ## Commit message ##
         KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
     
    +    kvm_prepare_memory_fault_exit() already takes parameters describing the
    +    RWX-ness of the relevant access but doesn't actually do anything with
    +    them. Define and use the flags necessary to pass this information on to
    +    userspace.
    +
         Suggested-by: Sean Christopherson <seanjc@google.com>
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
     
 5:  6bab46398020 <  -:  ------------ KVM: Try using fast GUP to resolve read faults
 6:  556e7079c419 !  6:  6a6993bda462 KVM: Add memslot flag to let userspace force an exit on missing hva mappings
    @@ Commit message
     
         Suggested-by: James Houghton <jthoughton@google.com>
         Suggested-by: Sean Christopherson <seanjc@google.com>
    -    Reviewed-by: James Houghton <jthoughton@google.com>
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
     
      ## Documentation/virt/kvm/api.rst ##
     @@ Documentation/virt/kvm/api.rst: yet and must be cleared on entry.
    -   /* for kvm_userspace_memory_region::flags */
        #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
        #define KVM_MEM_READONLY	(1UL << 1)
    -+  #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
    +   #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
     +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
      
      This ioctl allows the user to create, modify or delete a guest physical
    @@ Documentation/virt/kvm/api.rst: It is recommended that the lower 21 bits of gues
      be identical.  This allows large pages in the guest to be backed by large
      pages in the host.
      
    --The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
    --KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
    +-The flags field supports three flags
     +The flags field supports four flags
    -+
    -+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
    + 
    + 1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
      writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
    --use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
    -+use it.
    -+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
    - to make a new slot read-only.  In this case, writes to this memory will be
    +@@ Documentation/virt/kvm/api.rst: to make a new slot read-only.  In this case, writes to this memory will be
      posted to userspace as KVM_EXIT_MMIO exits.
    -+3.  KVM_MEM_GUEST_MEMFD
    + 3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
    + incompatible with KVM_SET_USER_MEMORY_REGION.
     +4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
      
      When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
      the memory region are automatically reflected into the guest.  For example, an
    +@@ Documentation/virt/kvm/api.rst: Instead, an abort (data abort if the cause of the page-table update
    + was a load or a store, instruction abort if it was an instruction
    + fetch) is injected in the guest.
    + 
    ++Note: KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING are currently mutually
    ++exclusive.
    ++
    + 4.36 KVM_SET_TSS_ADDR
    + ---------------------
    + 
     @@ Documentation/virt/kvm/api.rst: error/annotated fault.
      
      See KVM_EXIT_MEMORY_FAULT for more information.
    @@ include/uapi/linux/kvm.h: struct kvm_userspace_memory_region2 {
      
      /* for KVM_IRQ_LINE */
      struct kvm_irq_level {
    -@@ include/uapi/linux/kvm.h: struct kvm_ppc_resize_hpt {
    +@@ include/uapi/linux/kvm.h: struct kvm_enable_cap {
      #define KVM_CAP_MEMORY_ATTRIBUTES 233
      #define KVM_CAP_GUEST_MEMFD 234
      #define KVM_CAP_VM_TYPES 235
     +#define KVM_CAP_EXIT_ON_MISSING 236
      
    - #ifdef KVM_CAP_IRQ_ROUTING
    - 
    + struct kvm_irq_routing_irqchip {
    + 	__u32 irqchip;
     
      ## virt/kvm/Kconfig ##
     @@ virt/kvm/Kconfig: config KVM_GENERIC_PRIVATE_MEM
    @@ virt/kvm/kvm_main.c: static int check_memory_region_flags(struct kvm *kvm,
     +
      	if (mem->flags & ~valid_flags)
      		return -EINVAL;
    ++	else if ((mem->flags & KVM_MEM_READONLY) &&
    ++		 (mem->flags & KVM_MEM_EXIT_ON_MISSING))
    ++		return -EINVAL;
      
    + 	return 0;
    + }
     @@ virt/kvm/kvm_main.c: kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
      
      kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
    @@ virt/kvm/kvm_main.c: kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot
      		writable = NULL;
      	}
      
    -+	if (!atomic && can_exit_on_missing
    -+	    && kvm_is_slot_exit_on_missing(slot)) {
    ++	/* When the slot is exit-on-missing (and when we should respect that)
    ++	 * set atomic=true to prevent GUP from faulting in the userspace
    ++	 * mappings.
    ++	 */
    ++	if (!atomic && can_exit_on_missing &&
    ++	    kvm_is_slot_exit_on_missing(slot)) {
     +		atomic = true;
     +		if (async) {
     +			*async = false;
 7:  28b6fe1ad5b9 !  7:  70696937be14 KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler
    @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
     
      ## arch/x86/kvm/Kconfig ##
     @@ arch/x86/kvm/Kconfig: config KVM
    - 	select INTERVAL_TREE
    + 	select KVM_VFIO
      	select HAVE_KVM_PM_NOTIFIER if PM
      	select KVM_GENERIC_HARDWARE_ENABLING
     +        select HAVE_KVM_EXIT_ON_MISSING
 8:  a80db5672168 <  -:  ------------ KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
 -:  ------------ >  8:  05bbf29372ed KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
 9:  70c5db4f5c9e !  9:  bb22b31c8437 KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
    @@ Metadata
     Author: Anish Moorthy <amoorthy@google.com>
     
      ## Commit message ##
    -    KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
    +    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
     
         Prevent the stage-2 fault handler from faulting in pages when
         KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
    -    calls to check the memslot flag.
    -
    -    To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
    -    when the stage-2 handler cannot resolve the pfn for a fault. With
    -    KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of stage-2
    -    faults as vCPU exits, which userspace can attempt to resolve without
    -    terminating the guest.
    +    call to check the memslot flag. This effects the delivery of stage-2
    +    faults as vCPU exits (see KVM_CAP_MEMORY_FAULT_INFO), which userspace
    +    can attempt to resolve without terminating the guest.
     
         Delivering stage-2 faults to userspace in this way sidesteps the
         significant scalabiliy issues associated with using userfaultfd for the
    @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
     
      ## arch/arm64/kvm/Kconfig ##
     @@ arch/arm64/kvm/Kconfig: menuconfig KVM
    + 	select SCHED_INFO
      	select GUEST_PERF_EVENTS if PERF_EVENTS
    - 	select INTERVAL_TREE
      	select XARRAY_MULTI
     +        select HAVE_KVM_EXIT_ON_MISSING
      	help
    @@ arch/arm64/kvm/mmu.c: static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr
      	if (pfn == KVM_PFN_ERR_HWPOISON) {
      		kvm_send_hwpoison_signal(hva, vma_shift);
      		return 0;
    - 	}
    --	if (is_error_noslot_pfn(pfn))
    -+	if (is_error_noslot_pfn(pfn)) {
    -+		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
    -+					      write_fault, exec_fault, false);
    - 		return -EFAULT;
    -+	}
    - 
    - 	if (kvm_is_device_pfn(pfn)) {
    - 		/*
10:  ab913b9b5570 = 10:  a62ee8593b84 KVM: selftests: Report per-vcpu demand paging rate from demand paging test
11:  a27ff8b097d7 ! 11:  58ddb652eac1 KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
    @@ Commit message
         configuring the number of reader threads per UFFD as well: add the "-r"
         flag to do so.
     
    -    Acked-by: James Houghton <jthoughton@google.com>
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
    +    Acked-by: James Houghton <jthoughton@google.com>
     
      ## tools/testing/selftests/kvm/aarch64/page_fault_test.c ##
     @@ tools/testing/selftests/kvm/aarch64/page_fault_test.c: static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
12:  ee196df32964 ! 12:  b4cfe82097e2 KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
    @@ Commit message
         [1] Single-vCPU performance does suffer somewhat.
         [2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>
     
    -    Acked-by: James Houghton <jthoughton@google.com>
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
    +    Acked-by: James Houghton <jthoughton@google.com>
     
      ## tools/testing/selftests/kvm/demand_paging_test.c ##
     @@
13:  9406cb2581e5 = 13:  f8095728fcef KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
14:  dbab5917e1f6 ! 14:  a5863f1206bb KVM: selftests: Handle memory fault exits in demand_paging_test
    @@ Commit message
     
         Demonstrate a (very basic) scheme for supporting memory fault exits.
     
    -    >From the vCPU threads:
    +    From the vCPU threads:
         1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
            with the purpose of establishing the absent mappings. Do so with
            wake_waiters=false to avoid serializing on the userfaultfd wait queue
    @@ Commit message
         [A] In reality it is much likelier that the vCPU thread simply lost a
             race to establish the mapping for the page.
     
    -    Acked-by: James Houghton <jthoughton@google.com>
         Signed-off-by: Anish Moorthy <amoorthy@google.com>
    +    Acked-by: James Houghton <jthoughton@google.com>
     
      ## tools/testing/selftests/kvm/demand_paging_test.c ##
     @@

base-commit: 687d8f4c3dea0758afd748968d91288220bbe7e3
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply	[flat|nested] 39+ messages in thread

* [PATCH v7 01/14] KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
@ 2024-02-15 23:53 ` Anish Moorthy
  2024-02-15 23:53 ` [PATCH v7 02/14] KVM: Add function comments for __kvm_read/write_guest_page() Anish Moorthy
                   ` (14 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:53 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

The current description can be read as "atomic -> allowed to sleep,"
when in fact the intended statement is "atomic -> NOT allowed to sleep."
Make that clearer in the docstring.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff588677beb7..46e7b8dbb3d8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -2959,7 +2959,7 @@ static int hva_to_pfn_remapped(struct vm_area_struct *vma,
 /*
  * Pin guest page in memory and return its pfn.
  * @addr: host virtual address which maps memory to the guest
- * @atomic: whether this function can sleep
+ * @atomic: whether this function is forbidden from sleeping
  * @interruptible: whether the process can be interrupted by non-fatal signals
  * @async: whether this function need to wait IO complete if the
  *         host page is not in the memory
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 02/14] KVM: Add function comments for __kvm_read/write_guest_page()
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
  2024-02-15 23:53 ` [PATCH v7 01/14] KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
@ 2024-02-15 23:53 ` Anish Moorthy
  2024-02-15 23:53 ` [PATCH v7 03/14] KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag Anish Moorthy
                   ` (13 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:53 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

The (gfn, data, offset, len) order of parameters is a little strange
since "offset" applies to "gfn" rather than to "data". Add function
comments to make things perfectly clear.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 46e7b8dbb3d8..7186d301d617 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3304,6 +3304,7 @@ static int next_segment(unsigned long len, int offset)
 		return len;
 }
 
+/* Copy @len bytes from guest memory at '(@gfn * PAGE_SIZE) + @offset' to @data */
 static int __kvm_read_guest_page(struct kvm_memory_slot *slot, gfn_t gfn,
 				 void *data, int offset, int len)
 {
@@ -3405,6 +3406,7 @@ int kvm_vcpu_read_guest_atomic(struct kvm_vcpu *vcpu, gpa_t gpa,
 }
 EXPORT_SYMBOL_GPL(kvm_vcpu_read_guest_atomic);
 
+/* Copy @len bytes from @data into guest memory at '(@gfn * PAGE_SIZE) + @offset' */
 static int __kvm_write_guest_page(struct kvm *kvm,
 				  struct kvm_memory_slot *memslot, gfn_t gfn,
 			          const void *data, int offset, int len)
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 03/14] KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
  2024-02-15 23:53 ` [PATCH v7 01/14] KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
  2024-02-15 23:53 ` [PATCH v7 02/14] KVM: Add function comments for __kvm_read/write_guest_page() Anish Moorthy
@ 2024-02-15 23:53 ` Anish Moorthy
  2024-04-09 22:47   ` Sean Christopherson
  2024-02-15 23:53 ` [PATCH v7 04/14] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
                   ` (12 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:53 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

The documentation for KVM_SET_USER_MEMORY_REGION2 describes what the
flag does, but the flag itself is absent from where the other memslot
flags are listed. Add it.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 11 ++++++++---
 1 file changed, 8 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 3ec0b7a455a0..8f75fca2294e 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1352,6 +1352,7 @@ yet and must be cleared on entry.
   /* for kvm_userspace_memory_region::flags */
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
+  #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1382,12 +1383,16 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
-KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
+The flags field supports three flags
+
+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
 writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
-use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
+use it.
+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
 to make a new slot read-only.  In this case, writes to this memory will be
 posted to userspace as KVM_EXIT_MMIO exits.
+3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
+incompatible with KVM_SET_USER_MEMORY_REGION.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 04/14] KVM: Simplify error handling in __gfn_to_pfn_memslot()
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (2 preceding siblings ...)
  2024-02-15 23:53 ` [PATCH v7 03/14] KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag Anish Moorthy
@ 2024-02-15 23:53 ` Anish Moorthy
  2024-04-09 22:44   ` Sean Christopherson
  2024-02-15 23:53 ` [PATCH v7 05/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace Anish Moorthy
                   ` (11 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:53 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

KVM_HVA_ERR_RO_BAD satisfies kvm_is_error_hva(), so there's no need to
duplicate the "if (writable)" block. Fix this by bringing all
kvm_is_error_hva() cases under one conditional.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 virt/kvm/kvm_main.c | 10 ++++------
 1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7186d301d617..67ca580a18c5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -3031,15 +3031,13 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 	if (hva)
 		*hva = addr;
 
-	if (addr == KVM_HVA_ERR_RO_BAD) {
-		if (writable)
-			*writable = false;
-		return KVM_PFN_ERR_RO_FAULT;
-	}
-
 	if (kvm_is_error_hva(addr)) {
 		if (writable)
 			*writable = false;
+
+		if (addr == KVM_HVA_ERR_RO_BAD)
+			return KVM_PFN_ERR_RO_FAULT;
+
 		return KVM_PFN_NOSLOT;
 	}
 
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 05/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (3 preceding siblings ...)
  2024-02-15 23:53 ` [PATCH v7 04/14] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
@ 2024-02-15 23:53 ` Anish Moorthy
  2024-02-15 23:53 ` [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings Anish Moorthy
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:53 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

kvm_prepare_memory_fault_exit() already takes parameters describing the
RWX-ness of the relevant access but doesn't actually do anything with
them. Define and use the flags necessary to pass this information on to
userspace.

Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 5 +++++
 include/linux/kvm_host.h       | 9 ++++++++-
 include/uapi/linux/kvm.h       | 3 +++
 3 files changed, 16 insertions(+), 1 deletion(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 8f75fca2294e..9f5d45c49e36 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -6964,6 +6964,9 @@ spec refer, https://github.com/riscv/riscv-sbi-doc.
 
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
+  #define KVM_MEMORY_EXIT_FLAG_READ     (1ULL << 0)
+  #define KVM_MEMORY_EXIT_FLAG_WRITE    (1ULL << 1)
+  #define KVM_MEMORY_EXIT_FLAG_EXEC     (1ULL << 2)
   #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
 			__u64 flags;
 			__u64 gpa;
@@ -6975,6 +6978,8 @@ could not be resolved by KVM.  The 'gpa' and 'size' (in bytes) describe the
 guest physical address range [gpa, gpa + size) of the fault.  The 'flags' field
 describes properties of the faulting access that are likely pertinent:
 
+ - KVM_MEMORY_EXIT_FLAG_READ/WRITE/EXEC - When set, indicates that the memory
+   fault occurred on a read/write/exec access respectively.
  - KVM_MEMORY_EXIT_FLAG_PRIVATE - When set, indicates the memory fault occurred
    on a private memory access.  When clear, indicates the fault occurred on a
    shared access.
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 7e7fd25b09b3..32cbe5c3a9d1 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -2343,8 +2343,15 @@ static inline void kvm_prepare_memory_fault_exit(struct kvm_vcpu *vcpu,
 	vcpu->run->memory_fault.gpa = gpa;
 	vcpu->run->memory_fault.size = size;
 
-	/* RWX flags are not (yet) defined or communicated to userspace. */
 	vcpu->run->memory_fault.flags = 0;
+
+	if (is_write)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_WRITE;
+	else if (is_exec)
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_EXEC;
+	else
+		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_READ;
+
 	if (is_private)
 		vcpu->run->memory_fault.flags |= KVM_MEMORY_EXIT_FLAG_PRIVATE;
 }
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 2190adbe3002..36a51b162a71 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -428,6 +428,9 @@ struct kvm_run {
 		} notify;
 		/* KVM_EXIT_MEMORY_FAULT */
 		struct {
+#define KVM_MEMORY_EXIT_FLAG_READ       (1ULL << 0)
+#define KVM_MEMORY_EXIT_FLAG_WRITE      (1ULL << 1)
+#define KVM_MEMORY_EXIT_FLAG_EXEC       (1ULL << 2)
 #define KVM_MEMORY_EXIT_FLAG_PRIVATE	(1ULL << 3)
 			__u64 flags;
 			__u64 gpa;
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (4 preceding siblings ...)
  2024-02-15 23:53 ` [PATCH v7 05/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace Anish Moorthy
@ 2024-02-15 23:53 ` Anish Moorthy
  2024-03-08 22:07   ` Sean Christopherson
  2024-02-15 23:53 ` [PATCH v7 07/14] KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler Anish Moorthy
                   ` (9 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:53 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

Allowing KVM to fault in pages during vcpu-context guest memory accesses
can be undesirable: during userfaultfd-based postcopy, it can cause
significant performance issues due to vCPUs contending for
userfaultfd-internal locks.

Add a new memslot flag (KVM_MEM_EXIT_ON_MISSING) through which userspace
can indicate that KVM_RUN should exit instead of faulting in pages
during vcpu-context guest memory accesses. The unfaulted pages are
reported by the accompanying KVM_EXIT_MEMORY_FAULT_INFO, allowing
userspace to determine and take appropriate action.

The basic implementation strategy is to check the memslot flag from
within __gfn_to_pfn_memslot() and override the caller-provided arguments
accordingly. Some callers (such as kvm_vcpu_map()) must be able to opt
out of this behavior, and do so by passing can_exit_on_missing=false.

No functional change intended: nothing sets KVM_MEM_EXIT_ON_MISSING or
passes can_exit_on_missing=true to __gfn_to_pfn_memslot().

Suggested-by: James Houghton <jthoughton@google.com>
Suggested-by: Sean Christopherson <seanjc@google.com>
Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst         | 23 +++++++++++++++++-
 arch/arm64/kvm/mmu.c                   |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c    |  2 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |  2 +-
 arch/x86/kvm/mmu/mmu.c                 |  4 ++--
 include/linux/kvm_host.h               | 12 +++++++++-
 include/uapi/linux/kvm.h               |  2 ++
 virt/kvm/Kconfig                       |  3 +++
 virt/kvm/kvm_main.c                    | 32 ++++++++++++++++++++++----
 9 files changed, 70 insertions(+), 12 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 9f5d45c49e36..bf7bc21d56ac 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -1353,6 +1353,7 @@ yet and must be cleared on entry.
   #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
   #define KVM_MEM_READONLY	(1UL << 1)
   #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
+  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
 
 This ioctl allows the user to create, modify or delete a guest physical
 memory slot.  Bits 0-15 of "slot" specify the slot id and this value
@@ -1383,7 +1384,7 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
 be identical.  This allows large pages in the guest to be backed by large
 pages in the host.
 
-The flags field supports three flags
+The flags field supports four flags
 
 1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
 writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
@@ -1393,6 +1394,7 @@ to make a new slot read-only.  In this case, writes to this memory will be
 posted to userspace as KVM_EXIT_MMIO exits.
 3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
 incompatible with KVM_SET_USER_MEMORY_REGION.
+4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
 
 When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
 the memory region are automatically reflected into the guest.  For example, an
@@ -1408,6 +1410,9 @@ Instead, an abort (data abort if the cause of the page-table update
 was a load or a store, instruction abort if it was an instruction
 fetch) is injected in the guest.
 
+Note: KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING are currently mutually
+exclusive.
+
 4.36 KVM_SET_TSS_ADDR
 ---------------------
 
@@ -8044,6 +8049,22 @@ error/annotated fault.
 
 See KVM_EXIT_MEMORY_FAULT for more information.
 
+7.35 KVM_CAP_EXIT_ON_MISSING
+----------------------------
+
+:Architectures: None
+:Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
+
+The presence of this capability indicates that userspace may set the
+KVM_MEM_EXIT_ON_MISSING flag on memslots. Said flag will cause KVM_RUN to fail
+(-EFAULT) in response to guest-context memory accesses which would require KVM
+to page fault on the userspace mapping.
+
+The range of guest physical memory causing the fault is advertised to userspace
+through KVM_CAP_MEMORY_FAULT_INFO. Userspace should take appropriate action.
+This could mean, for instance, checking that the fault is resolvable, faulting
+in the relevant userspace mapping, then retrying KVM_RUN.
+
 8. Other capabilities.
 ======================
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index d14504821b79..dfe0cbb5937c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1487,7 +1487,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmap_read_unlock(current->mm);
 
 	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, NULL);
+				   write_fault, &writable, false, NULL);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_hv.c b/arch/powerpc/kvm/book3s_64_mmu_hv.c
index 2b1f0cdd8c18..31ebfe4fe8e1 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_hv.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_hv.c
@@ -614,7 +614,7 @@ int kvmppc_book3s_hv_page_fault(struct kvm_vcpu *vcpu,
 	} else {
 		/* Call KVM generic code to do the slow-path check */
 		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, &write_ok, NULL);
+					   writing, &write_ok, false, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
 		page = NULL;
diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c b/arch/powerpc/kvm/book3s_64_mmu_radix.c
index 4a1abb9f7c05..03b0f1c4a0d8 100644
--- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
+++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
@@ -853,7 +853,7 @@ int kvmppc_book3s_instantiate_page(struct kvm_vcpu *vcpu,
 
 		/* Call KVM generic code to do the slow-path check */
 		pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-					   writing, upgrade_p, NULL);
+					   writing, upgrade_p, false, NULL);
 		if (is_error_noslot_pfn(pfn))
 			return -EFAULT;
 		page = NULL;
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 2d6cdeab1f8a..b89a9518f6de 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -4371,7 +4371,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
 					  fault->write, &fault->map_writable,
-					  &fault->hva);
+					  false, &fault->hva);
 	if (!async)
 		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
@@ -4393,7 +4393,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 */
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
 					  fault->write, &fault->map_writable,
-					  &fault->hva);
+					  false, &fault->hva);
 	return RET_PF_CONTINUE;
 }
 
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 32cbe5c3a9d1..210e07c4c2eb 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1216,7 +1216,8 @@ kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn);
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn);
 kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva);
+			       bool write_fault, bool *writable,
+			       bool can_exit_on_missing, hva_t *hva);
 
 void kvm_release_pfn_clean(kvm_pfn_t pfn);
 void kvm_release_pfn_dirty(kvm_pfn_t pfn);
@@ -2394,4 +2395,13 @@ static inline int kvm_gmem_get_pfn(struct kvm *kvm,
 }
 #endif /* CONFIG_KVM_PRIVATE_MEM */
 
+/*
+ * Whether vCPUs should exit upon trying to access memory for which the
+ * userspace mappings are missing.
+ */
+static inline bool kvm_is_slot_exit_on_missing(const struct kvm_memory_slot *slot)
+{
+	return slot && slot->flags & KVM_MEM_EXIT_ON_MISSING;
+}
+
 #endif
diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h
index 36a51b162a71..e9f33ae93dee 100644
--- a/include/uapi/linux/kvm.h
+++ b/include/uapi/linux/kvm.h
@@ -51,6 +51,7 @@ struct kvm_userspace_memory_region2 {
 #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
 #define KVM_MEM_READONLY	(1UL << 1)
 #define KVM_MEM_GUEST_MEMFD	(1UL << 2)
+#define KVM_MEM_EXIT_ON_MISSING	(1UL << 3)
 
 /* for KVM_IRQ_LINE */
 struct kvm_irq_level {
@@ -920,6 +921,7 @@ struct kvm_enable_cap {
 #define KVM_CAP_MEMORY_ATTRIBUTES 233
 #define KVM_CAP_GUEST_MEMFD 234
 #define KVM_CAP_VM_TYPES 235
+#define KVM_CAP_EXIT_ON_MISSING 236
 
 struct kvm_irq_routing_irqchip {
 	__u32 irqchip;
diff --git a/virt/kvm/Kconfig b/virt/kvm/Kconfig
index 29b73eedfe74..c7bdde127af4 100644
--- a/virt/kvm/Kconfig
+++ b/virt/kvm/Kconfig
@@ -109,3 +109,6 @@ config KVM_GENERIC_PRIVATE_MEM
        select KVM_GENERIC_MEMORY_ATTRIBUTES
        select KVM_PRIVATE_MEM
        bool
+
+config HAVE_KVM_EXIT_ON_MISSING
+       bool
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 67ca580a18c5..469b99898be8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1600,7 +1600,7 @@ static void kvm_replace_memslot(struct kvm *kvm,
  * only allows these.
  */
 #define KVM_SET_USER_MEMORY_REGION_V1_FLAGS \
-	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY)
+	(KVM_MEM_LOG_DIRTY_PAGES | KVM_MEM_READONLY | KVM_MEM_EXIT_ON_MISSING)
 
 static int check_memory_region_flags(struct kvm *kvm,
 				     const struct kvm_userspace_memory_region2 *mem)
@@ -1618,8 +1618,14 @@ static int check_memory_region_flags(struct kvm *kvm,
 	valid_flags |= KVM_MEM_READONLY;
 #endif
 
+	if (IS_ENABLED(CONFIG_HAVE_KVM_EXIT_ON_MISSING))
+		valid_flags |= KVM_MEM_EXIT_ON_MISSING;
+
 	if (mem->flags & ~valid_flags)
 		return -EINVAL;
+	else if ((mem->flags & KVM_MEM_READONLY) &&
+		 (mem->flags & KVM_MEM_EXIT_ON_MISSING))
+		return -EINVAL;
 
 	return 0;
 }
@@ -3024,7 +3030,8 @@ kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
 
 kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 			       bool atomic, bool interruptible, bool *async,
-			       bool write_fault, bool *writable, hva_t *hva)
+			       bool write_fault, bool *writable,
+			       bool can_exit_on_missing, hva_t *hva)
 {
 	unsigned long addr = __gfn_to_hva_many(slot, gfn, NULL, write_fault);
 
@@ -3047,6 +3054,19 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
 		writable = NULL;
 	}
 
+	/* When the slot is exit-on-missing (and when we should respect that)
+	 * set atomic=true to prevent GUP from faulting in the userspace
+	 * mappings.
+	 */
+	if (!atomic && can_exit_on_missing &&
+	    kvm_is_slot_exit_on_missing(slot)) {
+		atomic = true;
+		if (async) {
+			*async = false;
+			async = NULL;
+		}
+	}
+
 	return hva_to_pfn(addr, atomic, interruptible, async, write_fault,
 			  writable);
 }
@@ -3056,21 +3076,21 @@ kvm_pfn_t gfn_to_pfn_prot(struct kvm *kvm, gfn_t gfn, bool write_fault,
 		      bool *writable)
 {
 	return __gfn_to_pfn_memslot(gfn_to_memslot(kvm, gfn), gfn, false, false,
-				    NULL, write_fault, writable, NULL);
+				    NULL, write_fault, writable, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_prot);
 
 kvm_pfn_t gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, false, false, NULL, true,
-				    NULL, NULL);
+				    NULL, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot);
 
 kvm_pfn_t gfn_to_pfn_memslot_atomic(const struct kvm_memory_slot *slot, gfn_t gfn)
 {
 	return __gfn_to_pfn_memslot(slot, gfn, true, false, NULL, true,
-				    NULL, NULL);
+				    NULL, false, NULL);
 }
 EXPORT_SYMBOL_GPL(gfn_to_pfn_memslot_atomic);
 
@@ -4877,6 +4897,8 @@ static int kvm_vm_ioctl_check_extension_generic(struct kvm *kvm, long arg)
 	case KVM_CAP_GUEST_MEMFD:
 		return !kvm || kvm_arch_has_private_mem(kvm);
 #endif
+	case KVM_CAP_EXIT_ON_MISSING:
+		return IS_ENABLED(CONFIG_HAVE_KVM_EXIT_ON_MISSING);
 	default:
 		break;
 	}
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 07/14] KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (5 preceding siblings ...)
  2024-02-15 23:53 ` [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings Anish Moorthy
@ 2024-02-15 23:53 ` Anish Moorthy
  2024-02-15 23:53 ` [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the " Anish Moorthy
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:53 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

Prevent the stage-2 fault handler from faulting in pages when
KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
calls to check the memslot flag.

To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
when the stage-2 handler returns EFAULT, e.g. when it cannot resolve the
pfn. With KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of
stage-2 faults as vCPU exits, which userspace can attempt to resolve
without terminating the guest.

Delivering stage-2 faults to userspace in this way sidesteps the
significant scalabiliy issues associated with using userfaultfd for the
same purpose.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 2 +-
 arch/x86/kvm/Kconfig           | 1 +
 arch/x86/kvm/mmu/mmu.c         | 8 ++++++--
 3 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index bf7bc21d56ac..d52757f9e1cb 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8052,7 +8052,7 @@ See KVM_EXIT_MEMORY_FAULT for more information.
 7.35 KVM_CAP_EXIT_ON_MISSING
 ----------------------------
 
-:Architectures: None
+:Architectures: x86
 :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
 
 The presence of this capability indicates that userspace may set the
diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index d43efae05794..09224e306abf 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -44,6 +44,7 @@ config KVM
 	select KVM_VFIO
 	select HAVE_KVM_PM_NOTIFIER if PM
 	select KVM_GENERIC_HARDWARE_ENABLING
+        select HAVE_KVM_EXIT_ON_MISSING
 	help
 	  Support hosting fully virtualized guest machines using hardware
 	  virtualization extensions.  You will need a fairly recent
diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index b89a9518f6de..26388e4f42df 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -3305,6 +3305,10 @@ static int kvm_handle_error_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fa
 		return RET_PF_RETRY;
 	}
 
+	WARN_ON_ONCE(fault->goal_level != PG_LEVEL_4K);
+
+	kvm_prepare_memory_fault_exit(vcpu, gfn_to_gpa(fault->gfn), PAGE_SIZE,
+				      fault->write, fault->exec, fault->is_private);
 	return -EFAULT;
 }
 
@@ -4371,7 +4375,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	async = false;
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, false, &async,
 					  fault->write, &fault->map_writable,
-					  false, &fault->hva);
+					  true, &fault->hva);
 	if (!async)
 		return RET_PF_CONTINUE; /* *pfn has correct page already */
 
@@ -4393,7 +4397,7 @@ static int __kvm_faultin_pfn(struct kvm_vcpu *vcpu, struct kvm_page_fault *fault
 	 */
 	fault->pfn = __gfn_to_pfn_memslot(slot, fault->gfn, false, true, NULL,
 					  fault->write, &fault->map_writable,
-					  false, &fault->hva);
+					  true, &fault->hva);
 	return RET_PF_CONTINUE;
 }
 
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (6 preceding siblings ...)
  2024-02-15 23:53 ` [PATCH v7 07/14] KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler Anish Moorthy
@ 2024-02-15 23:53 ` Anish Moorthy
  2024-03-04 20:00   ` Oliver Upton
  2024-02-15 23:54 ` [PATCH v7 09/14] KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING Anish Moorthy
                   ` (7 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:53 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

At the moment the only intended use case for KVM_CAP_MEMORY_FAULT_INFO
on arm64 is to annotate EFAULTs from the stage-2 fault handler, so
add that annotation now.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 2 +-
 arch/arm64/kvm/arm.c           | 1 +
 arch/arm64/kvm/mmu.c           | 5 ++++-
 3 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index d52757f9e1cb..7012f40332b3 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8031,7 +8031,7 @@ unavailable to host or other VMs.
 7.34 KVM_CAP_MEMORY_FAULT_INFO
 ------------------------------
 
-:Architectures: x86
+:Architectures: x86, arm64
 :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
 
 The presence of this capability indicates that KVM_RUN will fill
diff --git a/arch/arm64/kvm/arm.c b/arch/arm64/kvm/arm.c
index a25265aca432..ca4617f53250 100644
--- a/arch/arm64/kvm/arm.c
+++ b/arch/arm64/kvm/arm.c
@@ -240,6 +240,7 @@ int kvm_vm_ioctl_check_extension(struct kvm *kvm, long ext)
 	case KVM_CAP_ARM_SYSTEM_SUSPEND:
 	case KVM_CAP_IRQFD_RESAMPLE:
 	case KVM_CAP_COUNTER_OFFSET:
+	case KVM_CAP_MEMORY_FAULT_INFO:
 		r = 1;
 		break;
 	case KVM_CAP_SET_GUEST_DEBUG2:
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index dfe0cbb5937c..5b740ddfcc8e 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1492,8 +1492,11 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
 	}
-	if (is_error_noslot_pfn(pfn))
+	if (is_error_noslot_pfn(pfn)) {
+		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
+					      write_fault, exec_fault, false);
 		return -EFAULT;
+	}
 
 	if (kvm_is_device_pfn(pfn)) {
 		/*
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 09/14] KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (7 preceding siblings ...)
  2024-02-15 23:53 ` [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the " Anish Moorthy
@ 2024-02-15 23:54 ` Anish Moorthy
  2024-02-15 23:54 ` [PATCH v7 10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:54 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

Prevent the stage-2 fault handler from faulting in pages when
KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
call to check the memslot flag. This effects the delivery of stage-2
faults as vCPU exits (see KVM_CAP_MEMORY_FAULT_INFO), which userspace
can attempt to resolve without terminating the guest.

Delivering stage-2 faults to userspace in this way sidesteps the
significant scalabiliy issues associated with using userfaultfd for the
same purpose.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 Documentation/virt/kvm/api.rst | 2 +-
 arch/arm64/kvm/Kconfig         | 1 +
 arch/arm64/kvm/mmu.c           | 2 +-
 3 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
index 7012f40332b3..01b762272b6f 100644
--- a/Documentation/virt/kvm/api.rst
+++ b/Documentation/virt/kvm/api.rst
@@ -8052,7 +8052,7 @@ See KVM_EXIT_MEMORY_FAULT for more information.
 7.35 KVM_CAP_EXIT_ON_MISSING
 ----------------------------
 
-:Architectures: x86
+:Architectures: x86, arm64
 :Returns: Informational only, -EINVAL on direct KVM_ENABLE_CAP.
 
 The presence of this capability indicates that userspace may set the
diff --git a/arch/arm64/kvm/Kconfig b/arch/arm64/kvm/Kconfig
index 01398d2996c7..309d8e7ebc1c 100644
--- a/arch/arm64/kvm/Kconfig
+++ b/arch/arm64/kvm/Kconfig
@@ -39,6 +39,7 @@ menuconfig KVM
 	select SCHED_INFO
 	select GUEST_PERF_EVENTS if PERF_EVENTS
 	select XARRAY_MULTI
+        select HAVE_KVM_EXIT_ON_MISSING
 	help
 	  Support hosting virtualized guest machines.
 
diff --git a/arch/arm64/kvm/mmu.c b/arch/arm64/kvm/mmu.c
index 5b740ddfcc8e..b0f1fef0a52c 100644
--- a/arch/arm64/kvm/mmu.c
+++ b/arch/arm64/kvm/mmu.c
@@ -1487,7 +1487,7 @@ static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr_t fault_ipa,
 	mmap_read_unlock(current->mm);
 
 	pfn = __gfn_to_pfn_memslot(memslot, gfn, false, false, NULL,
-				   write_fault, &writable, false, NULL);
+				   write_fault, &writable, true, NULL);
 	if (pfn == KVM_PFN_ERR_HWPOISON) {
 		kvm_send_hwpoison_signal(hva, vma_shift);
 		return 0;
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (8 preceding siblings ...)
  2024-02-15 23:54 ` [PATCH v7 09/14] KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING Anish Moorthy
@ 2024-02-15 23:54 ` Anish Moorthy
  2024-04-09 22:49   ` Sean Christopherson
  2024-02-15 23:54 ` [PATCH v7 11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
                   ` (5 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:54 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

Using the overall demand paging rate to measure performance can be
slightly misleading when vCPU accesses are not overlapped. Adding more
vCPUs will (usually) increase the overall demand paging rate even
if performance remains constant or even degrades on a per-vcpu basis. As
such, it makes sense to report both the total and per-vcpu paging rates.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 tools/testing/selftests/kvm/demand_paging_test.c | 15 +++++++++++----
 1 file changed, 11 insertions(+), 4 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 09c116a82a84..6dc823fa933a 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -135,6 +135,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
 	int i;
+	double vcpu_paging_rate;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -191,11 +192,17 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			uffd_stop_demand_paging(uffd_descs[i]);
 	}
 
-	pr_info("Total guest execution time: %ld.%.9lds\n",
+	pr_info("Total guest execution time:\t%ld.%.9lds\n",
 		ts_diff.tv_sec, ts_diff.tv_nsec);
-	pr_info("Overall demand paging rate: %f pgs/sec\n",
-		memstress_args.vcpu_args[0].pages * nr_vcpus /
-		((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / NSEC_PER_SEC));
+
+	vcpu_paging_rate =
+		memstress_args.vcpu_args[0].pages
+		/ ((double)ts_diff.tv_sec
+			+ (double)ts_diff.tv_nsec / NSEC_PER_SEC);
+	pr_info("Per-vcpu demand paging rate:\t%f pgs/sec/vcpu\n",
+		vcpu_paging_rate);
+	pr_info("Overall demand paging rate:\t%f pgs/sec\n",
+		vcpu_paging_rate * nr_vcpus);
 
 	memstress_destroy_vm(vm);
 
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (9 preceding siblings ...)
  2024-02-15 23:54 ` [PATCH v7 10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
@ 2024-02-15 23:54 ` Anish Moorthy
  2024-04-09 22:58   ` Sean Christopherson
  2024-02-15 23:54 ` [PATCH v7 12/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
                   ` (4 subsequent siblings)
  15 siblings, 1 reply; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:54 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

At the moment, demand_paging_test does not support profiling/testing
multiple vCPU threads concurrently faulting on a single uffd because

    (a) "-u" (run test in userfaultfd mode) creates a uffd for each vCPU's
        region, so that each uffd services a single vCPU thread.
    (b) "-u -o" (userfaultfd mode + overlapped vCPU memory accesses)
        simply doesn't work: the test tries to register the same memory
        to multiple uffds, causing an error.

Add support for many vcpus per uffd by
    (1) Keeping "-u" behavior unchanged.
    (2) Making "-u -a" create a single uffd for all of guest memory.
    (3) Making "-u -o" implicitly pass "-a", solving the problem in (b).
In cases (2) and (3) all vCPU threads fault on a single uffd.

With potentially multiple vCPUs per UFFD, it makes sense to allow
configuring the number of reader threads per UFFD as well: add the "-r"
flag to do so.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/aarch64/page_fault_test.c   |  4 +-
 .../selftests/kvm/demand_paging_test.c        | 76 +++++++++++++---
 .../selftests/kvm/include/userfaultfd_util.h  | 17 +++-
 .../selftests/kvm/lib/userfaultfd_util.c      | 87 +++++++++++++------
 4 files changed, 137 insertions(+), 47 deletions(-)

diff --git a/tools/testing/selftests/kvm/aarch64/page_fault_test.c b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
index 08a5ca5bed56..dad1fb338f36 100644
--- a/tools/testing/selftests/kvm/aarch64/page_fault_test.c
+++ b/tools/testing/selftests/kvm/aarch64/page_fault_test.c
@@ -375,14 +375,14 @@ static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
 		*pt_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						    pt_args.hva,
 						    pt_args.paging_size,
-						    test->uffd_pt_handler);
+						    1, test->uffd_pt_handler);
 
 	*data_uffd = NULL;
 	if (test->uffd_data_handler)
 		*data_uffd = uffd_setup_demand_paging(uffd_mode, 0,
 						      data_args.hva,
 						      data_args.paging_size,
-						      test->uffd_data_handler);
+						      1, test->uffd_data_handler);
 }
 
 static void free_uffd(struct test_desc *test, struct uffd_desc *pt_uffd,
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 6dc823fa933a..f7897a951f90 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -77,8 +77,20 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		copy.mode = 0;
 
 		r = ioctl(uffd, UFFDIO_COPY, &copy);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d with errno: %d\n",
+		/*
+		 * With multiple vCPU threads fault on a single page and there are
+		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+		 * will fail with EEXIST: handle that case without signaling an
+		 * error.
+		 *
+		 * Note that this also suppress any EEXISTs occurring from,
+		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
+		 * happens here, but a realistic VMM might potentially maintain
+		 * some external state to correctly surface EEXISTs to userspace
+		 * (or prevent duplicate COPY/CONTINUEs in the first place).
+		 */
+		if (r == -1 && errno != EEXIST) {
+			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -89,8 +101,20 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		cont.range.len = demand_paging_size;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
-		if (r == -1) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx from thread %d with errno: %d\n",
+		/*
+		 * With multiple vCPU threads fault on a single page and there are
+		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
+		 * will fail with EEXIST: handle that case without signaling an
+		 * error.
+		 *
+		 * Note that this also suppress any EEXISTs occurring from,
+		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
+		 * happens here, but a realistic VMM might potentially maintain
+		 * some external state to correctly surface EEXISTs to userspace
+		 * (or prevent duplicate COPY/CONTINUEs in the first place).
+		 */
+		if (r == -1 && errno != EEXIST) {
+			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
 				addr, tid, errno);
 			return r;
 		}
@@ -110,7 +134,9 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 
 struct test_params {
 	int uffd_mode;
+	bool single_uffd;
 	useconds_t uffd_delay;
+	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
 };
@@ -134,8 +160,9 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i;
+	int i, num_uffds = 0;
 	double vcpu_paging_rate;
+	uint64_t uffd_region_size;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
 				 p->src_type, p->partition_vcpu_memory_access);
@@ -148,7 +175,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
 	if (p->uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
-		for (i = 0; i < nr_vcpus; i++) {
+		num_uffds = p->single_uffd ? 1 : nr_vcpus;
+		for (i = 0; i < num_uffds; i++) {
 			vcpu_args = &memstress_args.vcpu_args[i];
 			prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
 				     vcpu_args->pages * memstress_args.guest_page_size);
@@ -156,9 +184,13 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	}
 
 	if (p->uffd_mode) {
-		uffd_descs = malloc(nr_vcpus * sizeof(struct uffd_desc *));
+		num_uffds = p->single_uffd ? 1 : nr_vcpus;
+		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
+
+		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
 		TEST_ASSERT(uffd_descs, "Memory allocation failed");
-		for (i = 0; i < nr_vcpus; i++) {
+		for (i = 0; i < num_uffds; i++) {
+			struct memstress_vcpu_args *vcpu_args;
 			void *vcpu_hva;
 
 			vcpu_args = &memstress_args.vcpu_args[i];
@@ -171,7 +203,8 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
 				p->uffd_mode, p->uffd_delay, vcpu_hva,
-				vcpu_args->pages * memstress_args.guest_page_size,
+				uffd_region_size,
+				p->readers_per_uffd,
 				&handle_uffd_page_request);
 		}
 	}
@@ -188,7 +221,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 
 	if (p->uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
-		for (i = 0; i < nr_vcpus; i++)
+		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
 	}
 
@@ -214,15 +247,20 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 static void help(char *name)
 {
 	puts("");
-	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-d uffd_delay_usec]\n"
-	       "          [-b memory] [-s type] [-v vcpus] [-c cpu_list] [-o]\n", name);
+	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
+		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
+		   "          [-s type] [-v vcpus] [-c cpu_list] [-o]\n", name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
 	kvm_print_vcpu_pinning_help();
+	printf(" -a: Use a single userfaultfd for all of guest memory, instead of\n"
+	       "     creating one for each region paged by a unique vCPU\n"
+	       "     Set implicitly with -o, and no effect without -u.\n");
 	printf(" -d: add a delay in usec to the User Fault\n"
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
+	printf(" -r: Set the number of reader threads per uffd.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -241,12 +279,14 @@ int main(int argc, char *argv[])
 	struct test_params p = {
 		.src_type = DEFAULT_VM_MEM_SRC,
 		.partition_vcpu_memory_access = true,
+		.readers_per_uffd = 1,
+		.single_uffd = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "hm:u:d:b:s:v:c:o")) != -1) {
+	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:c:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
@@ -258,6 +298,9 @@ int main(int argc, char *argv[])
 				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
 			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
+		case 'a':
+			p.single_uffd = true;
+			break;
 		case 'd':
 			p.uffd_delay = strtoul(optarg, NULL, 0);
 			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
@@ -278,6 +321,13 @@ int main(int argc, char *argv[])
 			break;
 		case 'o':
 			p.partition_vcpu_memory_access = false;
+			p.single_uffd = true;
+			break;
+		case 'r':
+			p.readers_per_uffd = atoi(optarg);
+			TEST_ASSERT(p.readers_per_uffd >= 1,
+				    "Invalid number of readers per uffd %d: must be >=1",
+				    p.readers_per_uffd);
 			break;
 		case 'h':
 		default:
diff --git a/tools/testing/selftests/kvm/include/userfaultfd_util.h b/tools/testing/selftests/kvm/include/userfaultfd_util.h
index 877449c34592..af83a437e74a 100644
--- a/tools/testing/selftests/kvm/include/userfaultfd_util.h
+++ b/tools/testing/selftests/kvm/include/userfaultfd_util.h
@@ -17,18 +17,27 @@
 
 typedef int (*uffd_handler_t)(int uffd_mode, int uffd, struct uffd_msg *msg);
 
-struct uffd_desc {
+struct uffd_reader_args {
 	int uffd_mode;
 	int uffd;
-	int pipefds[2];
 	useconds_t delay;
 	uffd_handler_t handler;
-	pthread_t thread;
+	/* Holds the read end of the pipe for killing the reader. */
+	int pipe;
+};
+
+struct uffd_desc {
+	int uffd;
+	uint64_t num_readers;
+	/* Holds the write ends of the pipes for killing the readers. */
+	int *pipefds;
+	pthread_t *readers;
+	struct uffd_reader_args *reader_args;
 };
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler);
+					   uint64_t num_readers, uffd_handler_t handler);
 
 void uffd_stop_demand_paging(struct uffd_desc *uffd);
 
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 271f63891581..6f220aa4fb08 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -27,10 +27,8 @@
 
 static void *uffd_handler_thread_fn(void *arg)
 {
-	struct uffd_desc *uffd_desc = (struct uffd_desc *)arg;
-	int uffd = uffd_desc->uffd;
-	int pipefd = uffd_desc->pipefds[0];
-	useconds_t delay = uffd_desc->delay;
+	struct uffd_reader_args *reader_args = (struct uffd_reader_args *)arg;
+	int uffd = reader_args->uffd;
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
@@ -44,7 +42,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 		pollfd[0].fd = uffd;
 		pollfd[0].events = POLLIN;
-		pollfd[1].fd = pipefd;
+		pollfd[1].fd = reader_args->pipe;
 		pollfd[1].events = POLLIN;
 
 		r = poll(pollfd, 2, -1);
@@ -92,9 +90,9 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
 
-		if (delay)
-			usleep(delay);
-		r = uffd_desc->handler(uffd_desc->uffd_mode, uffd, &msg);
+		if (reader_args->delay)
+			usleep(reader_args->delay);
+		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
 		if (r < 0)
 			return NULL;
 		pages++;
@@ -110,7 +108,7 @@ static void *uffd_handler_thread_fn(void *arg)
 
 struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 					   void *hva, uint64_t len,
-					   uffd_handler_t handler)
+					   uint64_t num_readers, uffd_handler_t handler)
 {
 	struct uffd_desc *uffd_desc;
 	bool is_minor = (uffd_mode == UFFDIO_REGISTER_MODE_MINOR);
@@ -118,14 +116,26 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	struct uffdio_api uffdio_api;
 	struct uffdio_register uffdio_register;
 	uint64_t expected_ioctls = ((uint64_t) 1) << _UFFDIO_COPY;
-	int ret;
+	int ret, i;
 
 	PER_PAGE_DEBUG("Userfaultfd %s mode, faults resolved with %s\n",
 		       is_minor ? "MINOR" : "MISSING",
 		       is_minor ? "UFFDIO_CONINUE" : "UFFDIO_COPY");
 
 	uffd_desc = malloc(sizeof(struct uffd_desc));
-	TEST_ASSERT(uffd_desc, "malloc failed");
+	TEST_ASSERT(uffd_desc, "Failed to malloc uffd descriptor");
+
+	uffd_desc->pipefds = malloc(sizeof(int) * num_readers);
+	TEST_ASSERT(uffd_desc->pipefds, "Failed to malloc pipes");
+
+	uffd_desc->readers = malloc(sizeof(pthread_t) * num_readers);
+	TEST_ASSERT(uffd_desc->readers, "Failed to malloc reader threads");
+
+	uffd_desc->reader_args = malloc(
+		sizeof(struct uffd_reader_args) * num_readers);
+	TEST_ASSERT(uffd_desc->reader_args, "Failed to malloc reader_args");
+
+	uffd_desc->num_readers = num_readers;
 
 	/* In order to get minor faults, prefault via the alias. */
 	if (is_minor)
@@ -148,18 +158,28 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 	TEST_ASSERT((uffdio_register.ioctls & expected_ioctls) ==
 		    expected_ioctls, "missing userfaultfd ioctls");
 
-	ret = pipe2(uffd_desc->pipefds, O_CLOEXEC | O_NONBLOCK);
-	TEST_ASSERT(!ret, "Failed to set up pipefd");
-
-	uffd_desc->uffd_mode = uffd_mode;
 	uffd_desc->uffd = uffd;
-	uffd_desc->delay = delay;
-	uffd_desc->handler = handler;
-	pthread_create(&uffd_desc->thread, NULL, uffd_handler_thread_fn,
-		       uffd_desc);
+	for (i = 0; i < uffd_desc->num_readers; ++i) {
+		int pipes[2];
+
+		ret = pipe2((int *) &pipes, O_CLOEXEC | O_NONBLOCK);
+		TEST_ASSERT(!ret, "Failed to set up pipefd %i for uffd_desc %p",
+			    i, uffd_desc);
+
+		uffd_desc->pipefds[i] = pipes[1];
 
-	PER_VCPU_DEBUG("Created uffd thread for HVA range [%p, %p)\n",
-		       hva, hva + len);
+		uffd_desc->reader_args[i].uffd_mode = uffd_mode;
+		uffd_desc->reader_args[i].uffd = uffd;
+		uffd_desc->reader_args[i].delay = delay;
+		uffd_desc->reader_args[i].handler = handler;
+		uffd_desc->reader_args[i].pipe = pipes[0];
+
+		pthread_create(&uffd_desc->readers[i], NULL, uffd_handler_thread_fn,
+			       &uffd_desc->reader_args[i]);
+
+		PER_VCPU_DEBUG("Created uffd thread %i for HVA range [%p, %p)\n",
+			       i, hva, hva + len);
+	}
 
 	return uffd_desc;
 }
@@ -167,19 +187,30 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
 void uffd_stop_demand_paging(struct uffd_desc *uffd)
 {
 	char c = 0;
-	int ret;
+	int i, ret;
 
-	ret = write(uffd->pipefds[1], &c, 1);
-	TEST_ASSERT(ret == 1, "Unable to write to pipefd");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = write(uffd->pipefds[i], &c, 1);
+		TEST_ASSERT(
+			ret == 1, "Unable to write to pipefd %i for uffd_desc %p", i, uffd);
+	}
 
-	ret = pthread_join(uffd->thread, NULL);
-	TEST_ASSERT(ret == 0, "Pthread_join failed.");
+	for (i = 0; i < uffd->num_readers; ++i) {
+		ret = pthread_join(uffd->readers[i], NULL);
+		TEST_ASSERT(
+			ret == 0, "Pthread_join failed on reader %i for uffd_desc %p", i, uffd);
+	}
 
 	close(uffd->uffd);
 
-	close(uffd->pipefds[1]);
-	close(uffd->pipefds[0]);
+	for (i = 0; i < uffd->num_readers; ++i) {
+		close(uffd->pipefds[i]);
+		close(uffd->reader_args[i].pipe);
+	}
 
+	free(uffd->pipefds);
+	free(uffd->readers);
+	free(uffd->reader_args);
 	free(uffd);
 }
 
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 12/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (10 preceding siblings ...)
  2024-02-15 23:54 ` [PATCH v7 11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
@ 2024-02-15 23:54 ` Anish Moorthy
  2024-02-15 23:54 ` [PATCH v7 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:54 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

With multiple reader threads POLLing a single UFFD, the test suffers
from the thundering herd problem: performance degrades as the number of
reader threads is increased. Solve this issue [1] by switching the
the polling mechanism to EPOLL + EPOLLEXCLUSIVE.

Also, change the error-handling convention of uffd_handler_thread_fn.
Instead of just printing errors and returning early from the polling
loop, check for them via TEST_ASSERT. "return NULL" is reserved for a
successful exit from uffd_handler_thread_fn, ie one triggered by a
write to the exit pipe.

Performance samples generated by the command in [2] are given below.

Num Reader Threads, Paging Rate (POLL), Paging Rate (EPOLL)
1      249k      185k
2      201k      235k
4      186k      155k
16     150k      217k
32     89k       198k

[1] Single-vCPU performance does suffer somewhat.
[2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        |  1 -
 .../selftests/kvm/lib/userfaultfd_util.c      | 74 +++++++++----------
 2 files changed, 35 insertions(+), 40 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index f7897a951f90..0455347f932a 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -13,7 +13,6 @@
 #include <stdio.h>
 #include <stdlib.h>
 #include <time.h>
-#include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
 #include <sys/syscall.h>
diff --git a/tools/testing/selftests/kvm/lib/userfaultfd_util.c b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
index 6f220aa4fb08..2a179133645a 100644
--- a/tools/testing/selftests/kvm/lib/userfaultfd_util.c
+++ b/tools/testing/selftests/kvm/lib/userfaultfd_util.c
@@ -16,6 +16,7 @@
 #include <poll.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <sys/epoll.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -32,60 +33,55 @@ static void *uffd_handler_thread_fn(void *arg)
 	int64_t pages = 0;
 	struct timespec start;
 	struct timespec ts_diff;
+	int epollfd;
+	struct epoll_event evt;
+
+	epollfd = epoll_create(1);
+	TEST_ASSERT(epollfd >= 0, "Failed to create epollfd.");
+
+	evt.events = EPOLLIN | EPOLLEXCLUSIVE;
+	evt.data.u32 = 0;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, uffd, &evt) == 0,
+		    "Failed to add uffd to epollfd");
+
+	evt.events = EPOLLIN;
+	evt.data.u32 = 1;
+	TEST_ASSERT(epoll_ctl(epollfd, EPOLL_CTL_ADD, reader_args->pipe, &evt) == 0,
+		    "Failed to add pipe to epollfd");
 
 	clock_gettime(CLOCK_MONOTONIC, &start);
 	while (1) {
 		struct uffd_msg msg;
-		struct pollfd pollfd[2];
-		char tmp_chr;
 		int r;
 
-		pollfd[0].fd = uffd;
-		pollfd[0].events = POLLIN;
-		pollfd[1].fd = reader_args->pipe;
-		pollfd[1].events = POLLIN;
-
-		r = poll(pollfd, 2, -1);
-		switch (r) {
-		case -1:
-			pr_info("poll err");
-			continue;
-		case 0:
-			continue;
-		case 1:
-			break;
-		default:
-			pr_info("Polling uffd returned %d", r);
-			return NULL;
-		}
+		r = epoll_wait(epollfd, &evt, 1, -1);
+		TEST_ASSERT(r == 1,
+			    "Unexpected number of events (%d) from epoll, errno = %d",
+			    r, errno);
 
-		if (pollfd[0].revents & POLLERR) {
-			pr_info("uffd revents has POLLERR");
-			return NULL;
-		}
+		if (evt.data.u32 == 1) {
+			char tmp_chr;
 
-		if (pollfd[1].revents & POLLIN) {
-			r = read(pollfd[1].fd, &tmp_chr, 1);
+			TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+				    "Reader thread received EPOLLERR or EPOLLHUP on pipe.");
+			r = read(reader_args->pipe, &tmp_chr, 1);
 			TEST_ASSERT(r == 1,
-				    "Error reading pipefd in UFFD thread\n");
+				    "Error reading pipefd in uffd reader thread");
 			break;
 		}
 
-		if (!(pollfd[0].revents & POLLIN))
-			continue;
+		TEST_ASSERT(!(evt.events & (EPOLLERR | EPOLLHUP)),
+			    "Reader thread received EPOLLERR or EPOLLHUP on uffd.");
 
 		r = read(uffd, &msg, sizeof(msg));
 		if (r == -1) {
-			if (errno == EAGAIN)
-				continue;
-			pr_info("Read of uffd got errno %d\n", errno);
-			return NULL;
+			TEST_ASSERT(errno == EAGAIN,
+				    "Error reading from UFFD: errno = %d", errno);
+			continue;
 		}
 
-		if (r != sizeof(msg)) {
-			pr_info("Read on uffd returned unexpected size: %d bytes", r);
-			return NULL;
-		}
+		TEST_ASSERT(r == sizeof(msg),
+			    "Read on uffd returned unexpected number of bytes (%d)", r);
 
 		if (!(msg.event & UFFD_EVENT_PAGEFAULT))
 			continue;
@@ -93,8 +89,8 @@ static void *uffd_handler_thread_fn(void *arg)
 		if (reader_args->delay)
 			usleep(reader_args->delay);
 		r = reader_args->handler(reader_args->uffd_mode, uffd, &msg);
-		if (r < 0)
-			return NULL;
+		TEST_ASSERT(r >= 0,
+			    "Reader thread handler fn returned negative value %d", r);
 		pages++;
 	}
 
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (11 preceding siblings ...)
  2024-02-15 23:54 ` [PATCH v7 12/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
@ 2024-02-15 23:54 ` Anish Moorthy
  2024-02-15 23:54 ` [PATCH v7 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:54 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

Memslot flags aren't currently exposed to the tests, and are just always
set to 0. Add a parameter to allow tests to manually set those flags.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
---
 tools/testing/selftests/kvm/access_tracking_perf_test.c       | 2 +-
 tools/testing/selftests/kvm/demand_paging_test.c              | 2 +-
 tools/testing/selftests/kvm/dirty_log_perf_test.c             | 2 +-
 tools/testing/selftests/kvm/include/memstress.h               | 2 +-
 tools/testing/selftests/kvm/lib/memstress.c                   | 4 ++--
 .../testing/selftests/kvm/memslot_modification_stress_test.c  | 2 +-
 .../selftests/kvm/x86_64/dirty_log_page_splitting_test.c      | 2 +-
 7 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/tools/testing/selftests/kvm/access_tracking_perf_test.c b/tools/testing/selftests/kvm/access_tracking_perf_test.c
index 3c7defd34f56..b51656b408b8 100644
--- a/tools/testing/selftests/kvm/access_tracking_perf_test.c
+++ b/tools/testing/selftests/kvm/access_tracking_perf_test.c
@@ -306,7 +306,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct kvm_vm *vm;
 	int nr_vcpus = params->nr_vcpus;
 
-	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, params->vcpu_memory_bytes, 1, 0,
 				 params->backing_src, !overlap_memory_access);
 
 	memstress_start_vcpu_threads(nr_vcpus, vcpu_thread_main);
diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 0455347f932a..61bb2e23bef0 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -163,7 +163,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	double vcpu_paging_rate;
 	uint64_t uffd_region_size;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
 				 p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
diff --git a/tools/testing/selftests/kvm/dirty_log_perf_test.c b/tools/testing/selftests/kvm/dirty_log_perf_test.c
index d374dbcf9a53..8b1a84a4db3b 100644
--- a/tools/testing/selftests/kvm/dirty_log_perf_test.c
+++ b/tools/testing/selftests/kvm/dirty_log_perf_test.c
@@ -153,7 +153,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	int i;
 
 	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
-				 p->slots, p->backing_src,
+				 p->slots, 0, p->backing_src,
 				 p->partition_vcpu_memory_access);
 
 	pr_info("Random seed: %u\n", p->random_seed);
diff --git a/tools/testing/selftests/kvm/include/memstress.h b/tools/testing/selftests/kvm/include/memstress.h
index ce4e603050ea..8be9609d3ca0 100644
--- a/tools/testing/selftests/kvm/include/memstress.h
+++ b/tools/testing/selftests/kvm/include/memstress.h
@@ -56,7 +56,7 @@ struct memstress_args {
 extern struct memstress_args memstress_args;
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access);
 void memstress_destroy_vm(struct kvm_vm *vm);
diff --git a/tools/testing/selftests/kvm/lib/memstress.c b/tools/testing/selftests/kvm/lib/memstress.c
index d05487e5a371..e74b09f39769 100644
--- a/tools/testing/selftests/kvm/lib/memstress.c
+++ b/tools/testing/selftests/kvm/lib/memstress.c
@@ -123,7 +123,7 @@ void memstress_setup_vcpus(struct kvm_vm *vm, int nr_vcpus,
 }
 
 struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
-				   uint64_t vcpu_memory_bytes, int slots,
+				   uint64_t vcpu_memory_bytes, int slots, uint32_t slot_flags,
 				   enum vm_mem_backing_src_type backing_src,
 				   bool partition_vcpu_memory_access)
 {
@@ -212,7 +212,7 @@ struct kvm_vm *memstress_create_vm(enum vm_guest_mode mode, int nr_vcpus,
 
 		vm_userspace_mem_region_add(vm, backing_src, region_start,
 					    MEMSTRESS_MEM_SLOT_INDEX + i,
-					    region_pages, 0);
+					    region_pages, slot_flags);
 	}
 
 	/* Do mapping for the demand paging memory slot */
diff --git a/tools/testing/selftests/kvm/memslot_modification_stress_test.c b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
index 9855c41ca811..0b19ec3ecc9c 100644
--- a/tools/testing/selftests/kvm/memslot_modification_stress_test.c
+++ b/tools/testing/selftests/kvm/memslot_modification_stress_test.c
@@ -95,7 +95,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	struct test_params *p = arg;
 	struct kvm_vm *vm;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
 				 VM_MEM_SRC_ANONYMOUS,
 				 p->partition_vcpu_memory_access);
 
diff --git a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
index 634c6bfcd572..a770d7fa469a 100644
--- a/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
+++ b/tools/testing/selftests/kvm/x86_64/dirty_log_page_splitting_test.c
@@ -100,7 +100,7 @@ static void run_test(enum vm_guest_mode mode, void *unused)
 	struct kvm_page_stats stats_dirty_logging_disabled;
 	struct kvm_page_stats stats_repopulated;
 
-	vm = memstress_create_vm(mode, VCPUS, guest_percpu_mem_size,
+	vm = memstress_create_vm(mode, VCPUS, guest_percpu_mem_size, 0,
 				 SLOTS, backing_src, false);
 
 	guest_num_pages = (VCPUS * guest_percpu_mem_size) >> vm->page_shift;
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* [PATCH v7 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (12 preceding siblings ...)
  2024-02-15 23:54 ` [PATCH v7 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
@ 2024-02-15 23:54 ` Anish Moorthy
  2024-02-16  7:36 ` [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Gupta, Pankaj
  2024-04-10  0:19 ` Sean Christopherson
  15 siblings, 0 replies; 39+ messages in thread
From: Anish Moorthy @ 2024-02-15 23:54 UTC (permalink / raw
  To: seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, amoorthy, dmatlack, axelrasmussen,
	peterx, nadav.amit, isaku.yamahata, kconsul

Demonstrate a (very basic) scheme for supporting memory fault exits.

From the vCPU threads:
1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
   with the purpose of establishing the absent mappings. Do so with
   wake_waiters=false to avoid serializing on the userfaultfd wait queue
   locks.

2. When the UFFDIO_COPY/CONTINUE in (1) fails with EEXIST,
   assume that the mapping was already established but is currently
   absent [A] and attempt to populate it using MADV_POPULATE_WRITE.

Issue UFFDIO_COPY/CONTINUEs from the reader threads as well, but with
wake_waiters=true to ensure that any threads sleeping on the uffd are
eventually woken up.

A real VMM would track whether it had already COPY/CONTINUEd pages (eg,
via a bitmap) to avoid calls destined to EEXIST. However, even the
naive approach is enough to demonstrate the performance advantages of
KVM_EXIT_MEMORY_FAULT.

[A] In reality it is much likelier that the vCPU thread simply lost a
    race to establish the mapping for the page.

Signed-off-by: Anish Moorthy <amoorthy@google.com>
Acked-by: James Houghton <jthoughton@google.com>
---
 .../selftests/kvm/demand_paging_test.c        | 245 +++++++++++++-----
 1 file changed, 173 insertions(+), 72 deletions(-)

diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
index 61bb2e23bef0..44bdcc7aad87 100644
--- a/tools/testing/selftests/kvm/demand_paging_test.c
+++ b/tools/testing/selftests/kvm/demand_paging_test.c
@@ -15,6 +15,7 @@
 #include <time.h>
 #include <pthread.h>
 #include <linux/userfaultfd.h>
+#include <linux/mman.h>
 #include <sys/syscall.h>
 
 #include "kvm_util.h"
@@ -31,36 +32,102 @@ static uint64_t guest_percpu_mem_size = DEFAULT_PER_VCPU_MEM_SIZE;
 static size_t demand_paging_size;
 static char *guest_data_prototype;
 
+static int num_uffds;
+static size_t uffd_region_size;
+static struct uffd_desc **uffd_descs;
+/*
+ * Delay when demand paging is performed through userfaultfd or directly by
+ * vcpu_worker in the case of an annotated memory fault.
+ */
+static useconds_t uffd_delay;
+static int uffd_mode;
+
+
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+				    bool is_vcpu);
+
+static void madv_write_or_err(uint64_t gpa)
+{
+	int r;
+	void *hva = addr_gpa2hva(memstress_args.vm, gpa);
+
+	r = madvise(hva, demand_paging_size, MADV_POPULATE_WRITE);
+	TEST_ASSERT(r == 0,
+		    "MADV_POPULATE_WRITE on hva 0x%lx (gpa 0x%lx) fail, errno %i\n",
+		    (uintptr_t) hva, gpa, errno);
+}
+
+static void ready_page(uint64_t gpa)
+{
+	int r, uffd;
+
+	/*
+	 * This test only registers memslot 1 w/ userfaultfd. Any accesses outside
+	 * the registered ranges should fault in the physical pages through
+	 * MADV_POPULATE_WRITE.
+	 */
+	if ((gpa < memstress_args.gpa)
+		|| (gpa >= memstress_args.gpa + memstress_args.size)) {
+		madv_write_or_err(gpa);
+	} else {
+		if (uffd_delay)
+			usleep(uffd_delay);
+
+		uffd = uffd_descs[(gpa - memstress_args.gpa) / uffd_region_size]->uffd;
+
+		r = handle_uffd_page_request(uffd_mode, uffd,
+					     (uint64_t) addr_gpa2hva(memstress_args.vm, gpa), true);
+
+		if (r == EEXIST)
+			madv_write_or_err(gpa);
+	}
+}
+
 static void vcpu_worker(struct memstress_vcpu_args *vcpu_args)
 {
 	struct kvm_vcpu *vcpu = vcpu_args->vcpu;
 	int vcpu_idx = vcpu_args->vcpu_idx;
 	struct kvm_run *run = vcpu->run;
-	struct timespec start;
-	struct timespec ts_diff;
+	struct timespec last_start;
+	struct timespec total_runtime = {};
 	int ret;
-
-	clock_gettime(CLOCK_MONOTONIC, &start);
-
-	/* Let the guest access its memory */
-	ret = _vcpu_run(vcpu);
-	TEST_ASSERT(ret == 0, "vcpu_run failed: %d\n", ret);
-	if (get_ucall(vcpu, NULL) != UCALL_SYNC) {
-		TEST_ASSERT(false,
-			    "Invalid guest sync status: exit_reason=%s\n",
-			    exit_reason_str(run->exit_reason));
+	u64 num_memory_fault_exits = 0;
+	bool annotated_memory_fault = false;
+
+	while (true) {
+		clock_gettime(CLOCK_MONOTONIC, &last_start);
+		/* Let the guest access its memory */
+		ret = _vcpu_run(vcpu);
+		annotated_memory_fault = errno == EFAULT
+					 && run->exit_reason == KVM_EXIT_MEMORY_FAULT;
+		TEST_ASSERT(ret == 0 || annotated_memory_fault,
+			    "vcpu_run failed: %d\n", ret);
+
+		total_runtime = timespec_add(total_runtime,
+					     timespec_elapsed(last_start));
+		if (ret != 0 && get_ucall(vcpu, NULL) != UCALL_SYNC) {
+
+			if (annotated_memory_fault) {
+				++num_memory_fault_exits;
+				ready_page(run->memory_fault.gpa);
+				continue;
+			}
+
+			TEST_ASSERT(false,
+				    "Invalid guest sync status: exit_reason=%s\n",
+				    exit_reason_str(run->exit_reason));
+		}
+		break;
 	}
-
-	ts_diff = timespec_elapsed(start);
-	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds\n", vcpu_idx,
-		       ts_diff.tv_sec, ts_diff.tv_nsec);
+	PER_VCPU_DEBUG("vCPU %d execution time: %ld.%.9lds, %d memory fault exits\n",
+		       vcpu_idx, total_runtime.tv_sec, total_runtime.tv_nsec,
+		       num_memory_fault_exits);
 }
 
-static int handle_uffd_page_request(int uffd_mode, int uffd,
-		struct uffd_msg *msg)
+static int handle_uffd_page_request(int uffd_mode, int uffd, uint64_t hva,
+				    bool is_vcpu)
 {
 	pid_t tid = syscall(__NR_gettid);
-	uint64_t addr = msg->arg.pagefault.address;
 	struct timespec start;
 	struct timespec ts_diff;
 	int r;
@@ -71,16 +138,15 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		struct uffdio_copy copy;
 
 		copy.src = (uint64_t)guest_data_prototype;
-		copy.dst = addr;
+		copy.dst = hva;
 		copy.len = demand_paging_size;
-		copy.mode = 0;
+		copy.mode = is_vcpu ? UFFDIO_COPY_MODE_DONTWAKE : 0;
 
-		r = ioctl(uffd, UFFDIO_COPY, &copy);
 		/*
-		 * With multiple vCPU threads fault on a single page and there are
-		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
-		 * will fail with EEXIST: handle that case without signaling an
-		 * error.
+		 * With multiple vCPU threads and at least one of multiple reader threads
+		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
+		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
+		 * EEXIST: make sure to allow that case.
 		 *
 		 * Note that this also suppress any EEXISTs occurring from,
 		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
@@ -88,23 +154,24 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		 * some external state to correctly surface EEXISTs to userspace
 		 * (or prevent duplicate COPY/CONTINUEs in the first place).
 		 */
-		if (r == -1 && errno != EEXIST) {
-			pr_info("Failed UFFDIO_COPY in 0x%lx from thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		r = ioctl(uffd, UFFDIO_COPY, &copy);
+		TEST_ASSERT(r == 0 || errno == EEXIST,
+			    "Thread 0x%x failed UFFDIO_COPY on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
 	} else if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+		/* The comments in the UFFDIO_COPY branch also apply here. */
 		struct uffdio_continue cont = {0};
 
-		cont.range.start = addr;
+		cont.range.start = hva;
 		cont.range.len = demand_paging_size;
+		cont.mode = is_vcpu ? UFFDIO_CONTINUE_MODE_DONTWAKE : 0;
 
 		r = ioctl(uffd, UFFDIO_CONTINUE, &cont);
 		/*
-		 * With multiple vCPU threads fault on a single page and there are
-		 * multiple readers for the UFFD, at least one of the UFFDIO_COPYs
-		 * will fail with EEXIST: handle that case without signaling an
-		 * error.
+		 * With multiple vCPU threads and at least one of multiple reader threads
+		 * or vCPU memory faults, multiple vCPUs accessing an absent page will
+		 * almost certainly cause some thread doing the UFFDIO_COPY here to get
+		 * EEXIST: make sure to allow that case.
 		 *
 		 * Note that this also suppress any EEXISTs occurring from,
 		 * e.g., the first UFFDIO_COPY/CONTINUEs on a page. That never
@@ -112,32 +179,54 @@ static int handle_uffd_page_request(int uffd_mode, int uffd,
 		 * some external state to correctly surface EEXISTs to userspace
 		 * (or prevent duplicate COPY/CONTINUEs in the first place).
 		 */
-		if (r == -1 && errno != EEXIST) {
-			pr_info("Failed UFFDIO_CONTINUE in 0x%lx, thread %d, errno = %d\n",
-				addr, tid, errno);
-			return r;
-		}
+		TEST_ASSERT(r == 0 || errno == EEXIST,
+			    "Thread 0x%x failed UFFDIO_CONTINUE on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
 	} else {
 		TEST_FAIL("Invalid uffd mode %d", uffd_mode);
 	}
 
+	/*
+	 * If the above UFFDIO_COPY/CONTINUE failed with EEXIST, waiting threads
+	 * will not have been woken: wake them here.
+	 */
+	if (!is_vcpu && r != 0) {
+		struct uffdio_range range = {
+			.start = hva,
+			.len = demand_paging_size
+		};
+		r = ioctl(uffd, UFFDIO_WAKE, &range);
+		TEST_ASSERT(r == 0,
+			    "Thread 0x%x failed UFFDIO_WAKE on hva 0x%lx, errno = %d",
+			    tid, hva, errno);
+	}
+
 	ts_diff = timespec_elapsed(start);
 
 	PER_PAGE_DEBUG("UFFD page-in %d \t%ld ns\n", tid,
 		       timespec_to_ns(ts_diff));
 	PER_PAGE_DEBUG("Paged in %ld bytes at 0x%lx from thread %d\n",
-		       demand_paging_size, addr, tid);
+		       demand_paging_size, hva, tid);
 
 	return 0;
 }
 
+static int handle_uffd_page_request_from_uffd(int uffd_mode, int uffd,
+					      struct uffd_msg *msg)
+{
+	TEST_ASSERT(msg->event == UFFD_EVENT_PAGEFAULT,
+		    "Received uffd message with event %d != UFFD_EVENT_PAGEFAULT",
+		    msg->event);
+	return handle_uffd_page_request(uffd_mode, uffd,
+					msg->arg.pagefault.address, false);
+}
+
 struct test_params {
-	int uffd_mode;
 	bool single_uffd;
-	useconds_t uffd_delay;
 	int readers_per_uffd;
 	enum vm_mem_backing_src_type src_type;
 	bool partition_vcpu_memory_access;
+	bool memfault_exits;
 };
 
 static void prefault_mem(void *alias, uint64_t len)
@@ -155,16 +244,22 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 {
 	struct memstress_vcpu_args *vcpu_args;
 	struct test_params *p = arg;
-	struct uffd_desc **uffd_descs = NULL;
 	struct timespec start;
 	struct timespec ts_diff;
 	struct kvm_vm *vm;
-	int i, num_uffds = 0;
+	int i;
 	double vcpu_paging_rate;
-	uint64_t uffd_region_size;
+	uint32_t slot_flags = 0;
+	bool uffd_memfault_exits = uffd_mode && p->memfault_exits;
 
-	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1, 0,
-				 p->src_type, p->partition_vcpu_memory_access);
+	if (uffd_memfault_exits) {
+		TEST_ASSERT(kvm_has_cap(KVM_CAP_EXIT_ON_MISSING) > 0,
+					"KVM does not have KVM_CAP_EXIT_ON_MISSING");
+		slot_flags = KVM_MEM_EXIT_ON_MISSING;
+	}
+
+	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size,
+				 1, slot_flags, p->src_type, p->partition_vcpu_memory_access);
 
 	demand_paging_size = get_backing_src_pagesz(p->src_type);
 
@@ -173,21 +268,21 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 		    "Failed to allocate buffer for guest data pattern");
 	memset(guest_data_prototype, 0xAB, demand_paging_size);
 
-	if (p->uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
-		num_uffds = p->single_uffd ? 1 : nr_vcpus;
-		for (i = 0; i < num_uffds; i++) {
-			vcpu_args = &memstress_args.vcpu_args[i];
-			prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
-				     vcpu_args->pages * memstress_args.guest_page_size);
-		}
-	}
-
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		num_uffds = p->single_uffd ? 1 : nr_vcpus;
 		uffd_region_size = nr_vcpus * guest_percpu_mem_size / num_uffds;
 
+		if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR) {
+			for (i = 0; i < num_uffds; i++) {
+				vcpu_args = &memstress_args.vcpu_args[i];
+				prefault_mem(addr_gpa2alias(vm, vcpu_args->gpa),
+					     uffd_region_size);
+			}
+		}
+
 		uffd_descs = malloc(num_uffds * sizeof(struct uffd_desc *));
-		TEST_ASSERT(uffd_descs, "Memory allocation failed");
+		TEST_ASSERT(uffd_descs, "Failed to allocate uffd descriptors");
+
 		for (i = 0; i < num_uffds; i++) {
 			struct memstress_vcpu_args *vcpu_args;
 			void *vcpu_hva;
@@ -201,10 +296,10 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 			 * requests.
 			 */
 			uffd_descs[i] = uffd_setup_demand_paging(
-				p->uffd_mode, p->uffd_delay, vcpu_hva,
+				uffd_mode, uffd_delay, vcpu_hva,
 				uffd_region_size,
 				p->readers_per_uffd,
-				&handle_uffd_page_request);
+				&handle_uffd_page_request_from_uffd);
 		}
 	}
 
@@ -218,7 +313,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	ts_diff = timespec_elapsed(start);
 	pr_info("All vCPU threads joined\n");
 
-	if (p->uffd_mode) {
+	if (uffd_mode) {
 		/* Tell the user fault fd handler threads to quit */
 		for (i = 0; i < num_uffds; i++)
 			uffd_stop_demand_paging(uffd_descs[i]);
@@ -239,7 +334,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
 	memstress_destroy_vm(vm);
 
 	free(guest_data_prototype);
-	if (p->uffd_mode)
+	if (uffd_mode)
 		free(uffd_descs);
 }
 
@@ -248,7 +343,8 @@ static void help(char *name)
 	puts("");
 	printf("usage: %s [-h] [-m vm_mode] [-u uffd_mode] [-a]\n"
 		   "          [-d uffd_delay_usec] [-r readers_per_uffd] [-b memory]\n"
-		   "          [-s type] [-v vcpus] [-c cpu_list] [-o]\n", name);
+		   "          [-s type] [-v vcpus] [-c cpu_list] [-o] [-w] \n",
+	       name);
 	guest_modes_help();
 	printf(" -u: use userfaultfd to handle vCPU page faults. Mode is a\n"
 	       "     UFFD registration mode: 'MISSING' or 'MINOR'.\n");
@@ -260,6 +356,7 @@ static void help(char *name)
 	       "     FD handler to simulate demand paging\n"
 	       "     overheads. Ignored without -u.\n");
 	printf(" -r: Set the number of reader threads per uffd.\n");
+	printf(" -w: Enable kvm cap for memory fault exits.\n");
 	printf(" -b: specify the size of the memory region which should be\n"
 	       "     demand paged by each vCPU. e.g. 10M or 3G.\n"
 	       "     Default: 1G\n");
@@ -280,29 +377,30 @@ int main(int argc, char *argv[])
 		.partition_vcpu_memory_access = true,
 		.readers_per_uffd = 1,
 		.single_uffd = false,
+		.memfault_exits = false,
 	};
 	int opt;
 
 	guest_modes_append_default();
 
-	while ((opt = getopt(argc, argv, "ahom:u:d:b:s:v:c:r:")) != -1) {
+	while ((opt = getopt(argc, argv, "ahowm:u:d:b:s:v:c:r:")) != -1) {
 		switch (opt) {
 		case 'm':
 			guest_modes_cmdline(optarg);
 			break;
 		case 'u':
 			if (!strcmp("MISSING", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
+				uffd_mode = UFFDIO_REGISTER_MODE_MISSING;
 			else if (!strcmp("MINOR", optarg))
-				p.uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
-			TEST_ASSERT(p.uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
+				uffd_mode = UFFDIO_REGISTER_MODE_MINOR;
+			TEST_ASSERT(uffd_mode, "UFFD mode must be 'MISSING' or 'MINOR'.");
 			break;
 		case 'a':
 			p.single_uffd = true;
 			break;
 		case 'd':
-			p.uffd_delay = strtoul(optarg, NULL, 0);
-			TEST_ASSERT(p.uffd_delay >= 0, "A negative UFFD delay is not supported.");
+			uffd_delay = strtoul(optarg, NULL, 0);
+			TEST_ASSERT(uffd_delay >= 0, "A negative UFFD delay is not supported.");
 			break;
 		case 'b':
 			guest_percpu_mem_size = parse_size(optarg);
@@ -328,6 +426,9 @@ int main(int argc, char *argv[])
 				    "Invalid number of readers per uffd %d: must be >=1",
 				    p.readers_per_uffd);
 			break;
+		case 'w':
+			p.memfault_exits = true;
+			break;
 		case 'h':
 		default:
 			help(argv[0]);
@@ -335,7 +436,7 @@ int main(int argc, char *argv[])
 		}
 	}
 
-	if (p.uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
+	if (uffd_mode == UFFDIO_REGISTER_MODE_MINOR &&
 	    !backing_src_is_shared(p.src_type)) {
 		TEST_FAIL("userfaultfd MINOR mode requires shared memory; pick a different -s");
 	}
-- 
2.44.0.rc0.258.g7320e95886-goog


^ permalink raw reply related	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (13 preceding siblings ...)
  2024-02-15 23:54 ` [PATCH v7 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
@ 2024-02-16  7:36 ` Gupta, Pankaj
  2024-02-16 20:00   ` Anish Moorthy
  2024-04-10  0:19 ` Sean Christopherson
  15 siblings, 1 reply; 39+ messages in thread
From: Gupta, Pankaj @ 2024-02-16  7:36 UTC (permalink / raw
  To: Anish Moorthy, seanjc, oliver.upton, maz, kvm, kvmarm
  Cc: robert.hoo.linux, jthoughton, dmatlack, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata, kconsul

On 2/16/2024 12:53 AM, Anish Moorthy wrote:
> This series adds an option to cause stage-2 fault handlers to
> KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
> the userspace mappings. Doing so allows userspace to receive stage-2
> faults directly from KVM_RUN instead of through userfaultfd, which
> suffers from serious contention issues as the number of vCPUs scales.

Thanks for your work!

So, this is an alternative approach userspace like Qemu to do post copy
live migration using KVM_MEMORY_FAULT_EXIT instead of userfaultfd which
seems slower with more vCPU's.

Maybe I am missing some things here, just curious how userspace VMM e.g 
Qemu would do memory copy with this approach once the page is available 
from remote host which was done with UFFDIO_COPY earlier?

Just trying to understand how this will work for the existing interfaces.

Best regards,
Pankaj

> 
> Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
> demand_paging_test, which demonstrates the scalability improvements:
> the following data was collected using [2] on an x86 machine with 256
> cores.
> 
> vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> 1       150     340
> 2       191     477
> 4       210     809
> 8       155     1239
> 16      130     1595
> 32      108     2299
> 64      86      3482
> 128     62      4134
> 256     36      4012
> 
> The diff since the last version is small enough that I've attached a
> range-diff in the cover letter- hopefully it's useful for review.
> 
> Links
> ~~~~~
> [1] Original RFC from James Houghton:
>      https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> 
> [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
>      A quick rundown of the new flags (also detailed in later commits)
>          -a registers all of guest memory to a single uffd.
>          -r species the number of reader threads for polling the uffd.
>          -w is what actually enables the new capabilities.
>      All data was collected after applying the entire series
> 
> ---
> 
> v7
>    - Add comment for the upgrade-to-atomic in __gfn_to_pfn_memslot()
>      [James]
>    - Expand description for KVM_MEM_GUEST_MEMFD in kvm/api.rst [James]
>      and split it off into its own commit [Anish]
>    - Update documentation to indicate that KVM_CAP_MEMORY_FAULT_INFO is
>      available on arm [James]
>    - Expand commit message for the "enable KVM_CAP_MEMORY_FAULT_INFO on
>      arm64" commit [Anish]
>    - Drop buggy "fast GUP on read faults" patch [Thanks James!]
>    - Make KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING mutually exclusive
>      [Sean, Oliver]
>    - Drop incorrect "Documentation:" from some shortlogs [Sean]
>    - Add description for the KVM_EXIT_MEMORY_FAULT RWX patch [Sean]
>    - Style issues [Sean]
> 
> v6: https://lore.kernel.org/kvm/20231109210325.3806151-1-amoorthy@google.com/
>    - Rebase onto guest_memfd series [Anish/Sean]
>    - Set write fault flag properly in user_mem_abort() [Oliver]
>    - Reformat unnecessarily multi-line comments [Sean]
>    - Drop the kvm_vcpu_read|write_guest_page() annotations [Sean]
>    - Rename *USERFAULT_ON_MISSING to *EXIT_ON_MISSING [David]
>    - Remove unnecessary rounding in user_mem_abort() annotation [David]
>    - Rewrite logs for KVM_MEM_EXIT_ON_MISSING patches and squash
>      them with the stage-2 fault annotation patches [Sean]
>    - Undo the enum parameter addition to __gfn_to_pfn_memslot(), and just
>      add another boolean parameter instead [Sean]
>    - Better shortlog for the hva_to_pfn_fast() change [Anish]
> 
> v5: https://lore.kernel.org/kvm/20230908222905.1321305-1-amoorthy@google.com/
>    - Rename APIs (again) [Sean]
>    - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku]
>    - Reword hva_to_pfn_fast() change commit message [Sean]
>    - Correct style on terminal if statements [Sean]
>    - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean]
>    - Add read fault flag for annotated faults [Sean]
>    - read/write_guest_page() changes
>        - Move the annotations into vcpu wrapper fns [Sean]
>        - Reorder parameters [Robert]
>    - Rename kvm_populate_efault_info() to
>      kvm_handle_guest_uaccess_fault() [Sean]
>    - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean]
>    - Correct description of the faults which hva_to_pfn_fast() can now
>      resolve [Sean]
>    - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean]
>    - Magnanimously accept Sean's rewrite of the handle_error_pfn()
>      annotation [Anish]
>    - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean]
> 
> v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
>    - Fix excessive indentation [Robert, Oliver]
>    - Calculate final stats when uffd handler fn returns an error [Robert]
>    - Remove redundant info from uffd_desc [Robert]
>    - Fix various commit message typos [Robert]
>    - Add comment about suppressed EEXISTs in selftest [Robert]
>    - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
>    - Fix some include/logic issues in self test [Robert]
>    - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
>    - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
>    - Drop most of the annotations from v3: see
>      https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
>    - Remove WARN on bare efaults [Sean, Oliver]
>    - Eliminate unnecessary UFFDIO_WAKE call from self test [James]
> 
> v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
>    - Rework the implementation to be based on two orthogonal
>      capabilities (KVM_CAP_MEMORY_FAULT_INFO and
>      KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
>    - Change return code of kvm_populate_efault_info [Isaku]
>    - Use kvm_populate_efault_info from arm code [Oliver]
> 
> v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/
> 
>      This was a bit of a misfire, as I sent my WIP series on the mailing
>      list but was just targeting Sean for some feedback. Oliver Upton and
>      Isaku Yamahata ended up discovering the series and giving me some
>      feedback anyways, so thanks to them :) In the end, there was enough
>      discussion to justify retroactively labeling it as v2, even with the
>      limited cc list.
> 
>    - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
>    - API changes:
>          - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
>            KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
>            requirement).
>          - Switched to memslot flag
>    - Take Oliver's simplification to the "allow fast gup for readable
>      faults" logic.
>    - Slightly redefine the return code of user_mem_abort.
>    - Fix documentation errors brought up by Marc
>    - Reword commit messages in imperative mood
> 
> v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/
> 
> Anish Moorthy (14):
>    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
>    KVM: Add function comments for __kvm_read/write_guest_page()
>    KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
>    KVM: Simplify error handling in __gfn_to_pfn_memslot()
>    KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
>      userspace
>    KVM: Add memslot flag to let userspace force an exit on missing hva
>      mappings
>    KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
>      stage-2 fault handler
>    KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the
>      stage-2 fault handler
>    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
>    KVM: selftests: Report per-vcpu demand paging rate from demand paging
>      test
>    KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
>      paging test
>    KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
>      signal errors via TEST_ASSERT
>    KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
>    KVM: selftests: Handle memory fault exits in demand_paging_test
> 
>   Documentation/virt/kvm/api.rst                |  39 ++-
>   arch/arm64/kvm/Kconfig                        |   1 +
>   arch/arm64/kvm/arm.c                          |   1 +
>   arch/arm64/kvm/mmu.c                          |   7 +-
>   arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
>   arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
>   arch/x86/kvm/Kconfig                          |   1 +
>   arch/x86/kvm/mmu/mmu.c                        |   8 +-
>   include/linux/kvm_host.h                      |  21 +-
>   include/uapi/linux/kvm.h                      |   5 +
>   .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
>   .../selftests/kvm/access_tracking_perf_test.c |   2 +-
>   .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
>   .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
>   .../testing/selftests/kvm/include/memstress.h |   2 +-
>   .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
>   tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
>   .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
>   .../kvm/memslot_modification_stress_test.c    |   2 +-
>   .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
>   virt/kvm/Kconfig                              |   3 +
>   virt/kvm/kvm_main.c                           |  46 ++-
>   22 files changed, 453 insertions(+), 172 deletions(-)
> 
> Range-diff against v6:
>   1:  2089d8955538 !  1:  063d5d109f34 KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
>      @@ Metadata
>       Author: Anish Moorthy <amoorthy@google.com>
>       
>        ## Commit message ##
>      -    KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
>      +    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
>       
>      -    The current docstring can be read as "atomic -> allowed to sleep," when
>      -    in fact the intended statement is "atomic -> NOT allowed to sleep." Make
>      -    that clearer in the docstring.
>      +    The current description can be read as "atomic -> allowed to sleep,"
>      +    when in fact the intended statement is "atomic -> NOT allowed to sleep."
>      +    Make that clearer in the docstring.
>       
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>       
>   2:  36963c6eee29 !  2:  e038fe64f44a KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
>      @@ Metadata
>       Author: Anish Moorthy <amoorthy@google.com>
>       
>        ## Commit message ##
>      -    KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
>      +    KVM: Add function comments for __kvm_read/write_guest_page()
>       
>           The (gfn, data, offset, len) order of parameters is a little strange
>      -    since "offset" applies to "gfn" rather than to "data". Add docstrings to
>      -    make things perfectly clear.
>      +    since "offset" applies to "gfn" rather than to "data". Add function
>      +    comments to make things perfectly clear.
>       
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>       
>   -:  ------------ >  3:  812a2208da95 KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
>   3:  4994835c51f5 =  4:  44cec9bf6166 KVM: Simplify error handling in __gfn_to_pfn_memslot()
>   4:  3d51224854b1 !  5:  df09c7482fbf KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
>      @@ Metadata
>        ## Commit message ##
>           KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
>       
>      +    kvm_prepare_memory_fault_exit() already takes parameters describing the
>      +    RWX-ness of the relevant access but doesn't actually do anything with
>      +    them. Define and use the flags necessary to pass this information on to
>      +    userspace.
>      +
>           Suggested-by: Sean Christopherson <seanjc@google.com>
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>       
>   5:  6bab46398020 <  -:  ------------ KVM: Try using fast GUP to resolve read faults
>   6:  556e7079c419 !  6:  6a6993bda462 KVM: Add memslot flag to let userspace force an exit on missing hva mappings
>      @@ Commit message
>       
>           Suggested-by: James Houghton <jthoughton@google.com>
>           Suggested-by: Sean Christopherson <seanjc@google.com>
>      -    Reviewed-by: James Houghton <jthoughton@google.com>
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>       
>        ## Documentation/virt/kvm/api.rst ##
>       @@ Documentation/virt/kvm/api.rst: yet and must be cleared on entry.
>      -   /* for kvm_userspace_memory_region::flags */
>          #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>          #define KVM_MEM_READONLY	(1UL << 1)
>      -+  #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
>      +   #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
>       +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
>        
>        This ioctl allows the user to create, modify or delete a guest physical
>      @@ Documentation/virt/kvm/api.rst: It is recommended that the lower 21 bits of gues
>        be identical.  This allows large pages in the guest to be backed by large
>        pages in the host.
>        
>      --The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
>      --KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
>      +-The flags field supports three flags
>       +The flags field supports four flags
>      -+
>      -+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
>      +
>      + 1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
>        writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
>      --use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
>      -+use it.
>      -+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
>      - to make a new slot read-only.  In this case, writes to this memory will be
>      +@@ Documentation/virt/kvm/api.rst: to make a new slot read-only.  In this case, writes to this memory will be
>        posted to userspace as KVM_EXIT_MMIO exits.
>      -+3.  KVM_MEM_GUEST_MEMFD
>      + 3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
>      + incompatible with KVM_SET_USER_MEMORY_REGION.
>       +4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
>        
>        When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
>        the memory region are automatically reflected into the guest.  For example, an
>      +@@ Documentation/virt/kvm/api.rst: Instead, an abort (data abort if the cause of the page-table update
>      + was a load or a store, instruction abort if it was an instruction
>      + fetch) is injected in the guest.
>      +
>      ++Note: KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING are currently mutually
>      ++exclusive.
>      ++
>      + 4.36 KVM_SET_TSS_ADDR
>      + ---------------------
>      +
>       @@ Documentation/virt/kvm/api.rst: error/annotated fault.
>        
>        See KVM_EXIT_MEMORY_FAULT for more information.
>      @@ include/uapi/linux/kvm.h: struct kvm_userspace_memory_region2 {
>        
>        /* for KVM_IRQ_LINE */
>        struct kvm_irq_level {
>      -@@ include/uapi/linux/kvm.h: struct kvm_ppc_resize_hpt {
>      +@@ include/uapi/linux/kvm.h: struct kvm_enable_cap {
>        #define KVM_CAP_MEMORY_ATTRIBUTES 233
>        #define KVM_CAP_GUEST_MEMFD 234
>        #define KVM_CAP_VM_TYPES 235
>       +#define KVM_CAP_EXIT_ON_MISSING 236
>        
>      - #ifdef KVM_CAP_IRQ_ROUTING
>      -
>      + struct kvm_irq_routing_irqchip {
>      + 	__u32 irqchip;
>       
>        ## virt/kvm/Kconfig ##
>       @@ virt/kvm/Kconfig: config KVM_GENERIC_PRIVATE_MEM
>      @@ virt/kvm/kvm_main.c: static int check_memory_region_flags(struct kvm *kvm,
>       +
>        	if (mem->flags & ~valid_flags)
>        		return -EINVAL;
>      ++	else if ((mem->flags & KVM_MEM_READONLY) &&
>      ++		 (mem->flags & KVM_MEM_EXIT_ON_MISSING))
>      ++		return -EINVAL;
>        
>      + 	return 0;
>      + }
>       @@ virt/kvm/kvm_main.c: kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
>        
>        kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>      @@ virt/kvm/kvm_main.c: kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot
>        		writable = NULL;
>        	}
>        
>      -+	if (!atomic && can_exit_on_missing
>      -+	    && kvm_is_slot_exit_on_missing(slot)) {
>      ++	/* When the slot is exit-on-missing (and when we should respect that)
>      ++	 * set atomic=true to prevent GUP from faulting in the userspace
>      ++	 * mappings.
>      ++	 */
>      ++	if (!atomic && can_exit_on_missing &&
>      ++	    kvm_is_slot_exit_on_missing(slot)) {
>       +		atomic = true;
>       +		if (async) {
>       +			*async = false;
>   7:  28b6fe1ad5b9 !  7:  70696937be14 KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler
>      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
>       
>        ## arch/x86/kvm/Kconfig ##
>       @@ arch/x86/kvm/Kconfig: config KVM
>      - 	select INTERVAL_TREE
>      + 	select KVM_VFIO
>        	select HAVE_KVM_PM_NOTIFIER if PM
>        	select KVM_GENERIC_HARDWARE_ENABLING
>       +        select HAVE_KVM_EXIT_ON_MISSING
>   8:  a80db5672168 <  -:  ------------ KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
>   -:  ------------ >  8:  05bbf29372ed KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
>   9:  70c5db4f5c9e !  9:  bb22b31c8437 KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
>      @@ Metadata
>       Author: Anish Moorthy <amoorthy@google.com>
>       
>        ## Commit message ##
>      -    KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
>      +    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
>       
>           Prevent the stage-2 fault handler from faulting in pages when
>           KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
>      -    calls to check the memslot flag.
>      -
>      -    To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
>      -    when the stage-2 handler cannot resolve the pfn for a fault. With
>      -    KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of stage-2
>      -    faults as vCPU exits, which userspace can attempt to resolve without
>      -    terminating the guest.
>      +    call to check the memslot flag. This effects the delivery of stage-2
>      +    faults as vCPU exits (see KVM_CAP_MEMORY_FAULT_INFO), which userspace
>      +    can attempt to resolve without terminating the guest.
>       
>           Delivering stage-2 faults to userspace in this way sidesteps the
>           significant scalabiliy issues associated with using userfaultfd for the
>      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
>       
>        ## arch/arm64/kvm/Kconfig ##
>       @@ arch/arm64/kvm/Kconfig: menuconfig KVM
>      + 	select SCHED_INFO
>        	select GUEST_PERF_EVENTS if PERF_EVENTS
>      - 	select INTERVAL_TREE
>        	select XARRAY_MULTI
>       +        select HAVE_KVM_EXIT_ON_MISSING
>        	help
>      @@ arch/arm64/kvm/mmu.c: static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr
>        	if (pfn == KVM_PFN_ERR_HWPOISON) {
>        		kvm_send_hwpoison_signal(hva, vma_shift);
>        		return 0;
>      - 	}
>      --	if (is_error_noslot_pfn(pfn))
>      -+	if (is_error_noslot_pfn(pfn)) {
>      -+		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
>      -+					      write_fault, exec_fault, false);
>      - 		return -EFAULT;
>      -+	}
>      -
>      - 	if (kvm_is_device_pfn(pfn)) {
>      - 		/*
> 10:  ab913b9b5570 = 10:  a62ee8593b84 KVM: selftests: Report per-vcpu demand paging rate from demand paging test
> 11:  a27ff8b097d7 ! 11:  58ddb652eac1 KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
>      @@ Commit message
>           configuring the number of reader threads per UFFD as well: add the "-r"
>           flag to do so.
>       
>      -    Acked-by: James Houghton <jthoughton@google.com>
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>      +    Acked-by: James Houghton <jthoughton@google.com>
>       
>        ## tools/testing/selftests/kvm/aarch64/page_fault_test.c ##
>       @@ tools/testing/selftests/kvm/aarch64/page_fault_test.c: static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
> 12:  ee196df32964 ! 12:  b4cfe82097e2 KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
>      @@ Commit message
>           [1] Single-vCPU performance does suffer somewhat.
>           [2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>
>       
>      -    Acked-by: James Houghton <jthoughton@google.com>
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>      +    Acked-by: James Houghton <jthoughton@google.com>
>       
>        ## tools/testing/selftests/kvm/demand_paging_test.c ##
>       @@
> 13:  9406cb2581e5 = 13:  f8095728fcef KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
> 14:  dbab5917e1f6 ! 14:  a5863f1206bb KVM: selftests: Handle memory fault exits in demand_paging_test
>      @@ Commit message
>       
>           Demonstrate a (very basic) scheme for supporting memory fault exits.
>       
>      -    >From the vCPU threads:
>      +    From the vCPU threads:
>           1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
>              with the purpose of establishing the absent mappings. Do so with
>              wake_waiters=false to avoid serializing on the userfaultfd wait queue
>      @@ Commit message
>           [A] In reality it is much likelier that the vCPU thread simply lost a
>               race to establish the mapping for the page.
>       
>      -    Acked-by: James Houghton <jthoughton@google.com>
>           Signed-off-by: Anish Moorthy <amoorthy@google.com>
>      +    Acked-by: James Houghton <jthoughton@google.com>
>       
>        ## tools/testing/selftests/kvm/demand_paging_test.c ##
>       @@
> 
> base-commit: 687d8f4c3dea0758afd748968d91288220bbe7e3


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults
  2024-02-16  7:36 ` [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Gupta, Pankaj
@ 2024-02-16 20:00   ` Anish Moorthy
  2024-02-16 23:40     ` Axel Rasmussen
  0 siblings, 1 reply; 39+ messages in thread
From: Anish Moorthy @ 2024-02-16 20:00 UTC (permalink / raw
  To: Gupta, Pankaj
  Cc: seanjc, oliver.upton, maz, kvm, kvmarm, robert.hoo.linux,
	jthoughton, dmatlack, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Thu, Feb 15, 2024 at 11:36 PM Gupta, Pankaj <pankaj.gupta@amd.com> wrote:
>
> On 2/16/2024 12:53 AM, Anish Moorthy wrote:
> > This series adds an option to cause stage-2 fault handlers to
> > KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
> > the userspace mappings. Doing so allows userspace to receive stage-2
> > faults directly from KVM_RUN instead of through userfaultfd, which
> > suffers from serious contention issues as the number of vCPUs scales.
>
> Thanks for your work!

:D

>
> So, this is an alternative approach userspace like Qemu to do post copy
> live migration using KVM_MEMORY_FAULT_EXIT instead of userfaultfd which
> seems slower with more vCPU's.
>
> Maybe I am missing some things here, just curious how userspace VMM e.g
> Qemu would do memory copy with this approach once the page is available
> from remote host which was done with UFFDIO_COPY earlier?

This new capability is meant to be used *alongside* userfaultfd during
post-copy: it's not a replacement. KVM_RUN can generate page faults
from outside the stage-2 fault handlers (IIUC instruction emulation is
one source), and these paths are unchanged: so it's important that
userspace still UFFDIO_REGISTERs KVM's mapping and reads from the UFFD
to catch these guest accesses. But with the new
KVM_MEM_EXIT_ON_MISSING memslot flag set, the stage-2 handlers will
report needing to fault in memory via KVM_MEMORY_FAULT_EXIT instead of
queuing onto the UFFD.

In the workloads I've tested, the vast majority of guest-generated
page faults (99%+) come from the stage-2 handlers. So this series
"solves" the issue of contention on the UFFD file descriptor by
(mostly) sidestepping it.

As for how userspace actually uses the new functionality: when a vCPU
thread receives a KVM_MEMORY_FAULT_EXIT for an unfetched page during
post-copy it might

(a) Fetch the page
(b) Install the page into KVM's mapping via UFFDIO_COPY (don't
necessarily need to UFFDIO_WAKE!)
(c) Call KVM_RUN to re-enter the guest and retry the access. The
stage-2 fault handler will fire again but almost certainly won't
KVM_MEMORY_FAULT_EXIT now (since the UFFDIO_COPY will have mapped the
page), so the guest can continue.

and userspace can continue using some thread(s) to

(a) Read page faults from the UFFD.
(b) Install the page using UFFDIO_COPY + UFFDIO_WAKE
(c) goto (a)

to make sure it catches everything. The combination of these two things
adds up to more performant "uffd-based" postcopy.

I'm of course skimming over some details (e.g.: when two vCPU threads
race to fetch a page one of them should probably MADV_POPULATE_WRITE
somehow), but I hope this is helpful. My patch to the KVM demand
paging self test might also clarify things a bit [1].

Please let me know if you have more questions!

[1] https://lore.kernel.org/kvm/1f67639d-c6a2-1f36-b086-eb65fa2ab275@amd.com/T/#m28055e5d708103d126985e38e18b591d535e1e84




> Just trying to understand how this will work for the existing interfaces.
> Best regards,
> Pankaj
>
> >
> > Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
> > demand_paging_test, which demonstrates the scalability improvements:
> > the following data was collected using [2] on an x86 machine with 256
> > cores.
> >
> > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> > 1       150     340
> > 2       191     477
> > 4       210     809
> > 8       155     1239
> > 16      130     1595
> > 32      108     2299
> > 64      86      3482
> > 128     62      4134
> > 256     36      4012
> >
> > The diff since the last version is small enough that I've attached a
> > range-diff in the cover letter- hopefully it's useful for review.
> >
> > Links
> > ~~~~~
> > [1] Original RFC from James Houghton:
> >      https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> >
> > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> >      A quick rundown of the new flags (also detailed in later commits)
> >          -a registers all of guest memory to a single uffd.
> >          -r species the number of reader threads for polling the uffd.
> >          -w is what actually enables the new capabilities.
> >      All data was collected after applying the entire series
> >
> > ---
> >
> > v7
> >    - Add comment for the upgrade-to-atomic in __gfn_to_pfn_memslot()
> >      [James]
> >    - Expand description for KVM_MEM_GUEST_MEMFD in kvm/api.rst [James]
> >      and split it off into its own commit [Anish]
> >    - Update documentation to indicate that KVM_CAP_MEMORY_FAULT_INFO is
> >      available on arm [James]
> >    - Expand commit message for the "enable KVM_CAP_MEMORY_FAULT_INFO on
> >      arm64" commit [Anish]
> >    - Drop buggy "fast GUP on read faults" patch [Thanks James!]
> >    - Make KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING mutually exclusive
> >      [Sean, Oliver]
> >    - Drop incorrect "Documentation:" from some shortlogs [Sean]
> >    - Add description for the KVM_EXIT_MEMORY_FAULT RWX patch [Sean]
> >    - Style issues [Sean]
> >
> > v6: https://lore.kernel.org/kvm/20231109210325.3806151-1-amoorthy@google.com/
> >    - Rebase onto guest_memfd series [Anish/Sean]
> >    - Set write fault flag properly in user_mem_abort() [Oliver]
> >    - Reformat unnecessarily multi-line comments [Sean]
> >    - Drop the kvm_vcpu_read|write_guest_page() annotations [Sean]
> >    - Rename *USERFAULT_ON_MISSING to *EXIT_ON_MISSING [David]
> >    - Remove unnecessary rounding in user_mem_abort() annotation [David]
> >    - Rewrite logs for KVM_MEM_EXIT_ON_MISSING patches and squash
> >      them with the stage-2 fault annotation patches [Sean]
> >    - Undo the enum parameter addition to __gfn_to_pfn_memslot(), and just
> >      add another boolean parameter instead [Sean]
> >    - Better shortlog for the hva_to_pfn_fast() change [Anish]
> >
> > v5: https://lore.kernel.org/kvm/20230908222905.1321305-1-amoorthy@google.com/
> >    - Rename APIs (again) [Sean]
> >    - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku]
> >    - Reword hva_to_pfn_fast() change commit message [Sean]
> >    - Correct style on terminal if statements [Sean]
> >    - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean]
> >    - Add read fault flag for annotated faults [Sean]
> >    - read/write_guest_page() changes
> >        - Move the annotations into vcpu wrapper fns [Sean]
> >        - Reorder parameters [Robert]
> >    - Rename kvm_populate_efault_info() to
> >      kvm_handle_guest_uaccess_fault() [Sean]
> >    - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean]
> >    - Correct description of the faults which hva_to_pfn_fast() can now
> >      resolve [Sean]
> >    - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean]
> >    - Magnanimously accept Sean's rewrite of the handle_error_pfn()
> >      annotation [Anish]
> >    - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean]
> >
> > v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
> >    - Fix excessive indentation [Robert, Oliver]
> >    - Calculate final stats when uffd handler fn returns an error [Robert]
> >    - Remove redundant info from uffd_desc [Robert]
> >    - Fix various commit message typos [Robert]
> >    - Add comment about suppressed EEXISTs in selftest [Robert]
> >    - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
> >    - Fix some include/logic issues in self test [Robert]
> >    - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
> >    - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
> >    - Drop most of the annotations from v3: see
> >      https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
> >    - Remove WARN on bare efaults [Sean, Oliver]
> >    - Eliminate unnecessary UFFDIO_WAKE call from self test [James]
> >
> > v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
> >    - Rework the implementation to be based on two orthogonal
> >      capabilities (KVM_CAP_MEMORY_FAULT_INFO and
> >      KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
> >    - Change return code of kvm_populate_efault_info [Isaku]
> >    - Use kvm_populate_efault_info from arm code [Oliver]
> >
> > v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/
> >
> >      This was a bit of a misfire, as I sent my WIP series on the mailing
> >      list but was just targeting Sean for some feedback. Oliver Upton and
> >      Isaku Yamahata ended up discovering the series and giving me some
> >      feedback anyways, so thanks to them :) In the end, there was enough
> >      discussion to justify retroactively labeling it as v2, even with the
> >      limited cc list.
> >
> >    - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
> >    - API changes:
> >          - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
> >            KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
> >            requirement).
> >          - Switched to memslot flag
> >    - Take Oliver's simplification to the "allow fast gup for readable
> >      faults" logic.
> >    - Slightly redefine the return code of user_mem_abort.
> >    - Fix documentation errors brought up by Marc
> >    - Reword commit messages in imperative mood
> >
> > v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/
> >
> > Anish Moorthy (14):
> >    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> >    KVM: Add function comments for __kvm_read/write_guest_page()
> >    KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
> >    KVM: Simplify error handling in __gfn_to_pfn_memslot()
> >    KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
> >      userspace
> >    KVM: Add memslot flag to let userspace force an exit on missing hva
> >      mappings
> >    KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
> >      stage-2 fault handler
> >    KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the
> >      stage-2 fault handler
> >    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
> >    KVM: selftests: Report per-vcpu demand paging rate from demand paging
> >      test
> >    KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
> >      paging test
> >    KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
> >      signal errors via TEST_ASSERT
> >    KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
> >    KVM: selftests: Handle memory fault exits in demand_paging_test
> >
> >   Documentation/virt/kvm/api.rst                |  39 ++-
> >   arch/arm64/kvm/Kconfig                        |   1 +
> >   arch/arm64/kvm/arm.c                          |   1 +
> >   arch/arm64/kvm/mmu.c                          |   7 +-
> >   arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
> >   arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
> >   arch/x86/kvm/Kconfig                          |   1 +
> >   arch/x86/kvm/mmu/mmu.c                        |   8 +-
> >   include/linux/kvm_host.h                      |  21 +-
> >   include/uapi/linux/kvm.h                      |   5 +
> >   .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
> >   .../selftests/kvm/access_tracking_perf_test.c |   2 +-
> >   .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
> >   .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
> >   .../testing/selftests/kvm/include/memstress.h |   2 +-
> >   .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
> >   tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
> >   .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
> >   .../kvm/memslot_modification_stress_test.c    |   2 +-
> >   .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
> >   virt/kvm/Kconfig                              |   3 +
> >   virt/kvm/kvm_main.c                           |  46 ++-
> >   22 files changed, 453 insertions(+), 172 deletions(-)
> >
> > Range-diff against v6:
> >   1:  2089d8955538 !  1:  063d5d109f34 KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> >      @@ Metadata
> >       Author: Anish Moorthy <amoorthy@google.com>
> >
> >        ## Commit message ##
> >      -    KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> >      +    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> >
> >      -    The current docstring can be read as "atomic -> allowed to sleep," when
> >      -    in fact the intended statement is "atomic -> NOT allowed to sleep." Make
> >      -    that clearer in the docstring.
> >      +    The current description can be read as "atomic -> allowed to sleep,"
> >      +    when in fact the intended statement is "atomic -> NOT allowed to sleep."
> >      +    Make that clearer in the docstring.
> >
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >
> >   2:  36963c6eee29 !  2:  e038fe64f44a KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
> >      @@ Metadata
> >       Author: Anish Moorthy <amoorthy@google.com>
> >
> >        ## Commit message ##
> >      -    KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
> >      +    KVM: Add function comments for __kvm_read/write_guest_page()
> >
> >           The (gfn, data, offset, len) order of parameters is a little strange
> >      -    since "offset" applies to "gfn" rather than to "data". Add docstrings to
> >      -    make things perfectly clear.
> >      +    since "offset" applies to "gfn" rather than to "data". Add function
> >      +    comments to make things perfectly clear.
> >
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >
> >   -:  ------------ >  3:  812a2208da95 KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
> >   3:  4994835c51f5 =  4:  44cec9bf6166 KVM: Simplify error handling in __gfn_to_pfn_memslot()
> >   4:  3d51224854b1 !  5:  df09c7482fbf KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
> >      @@ Metadata
> >        ## Commit message ##
> >           KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
> >
> >      +    kvm_prepare_memory_fault_exit() already takes parameters describing the
> >      +    RWX-ness of the relevant access but doesn't actually do anything with
> >      +    them. Define and use the flags necessary to pass this information on to
> >      +    userspace.
> >      +
> >           Suggested-by: Sean Christopherson <seanjc@google.com>
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >
> >   5:  6bab46398020 <  -:  ------------ KVM: Try using fast GUP to resolve read faults
> >   6:  556e7079c419 !  6:  6a6993bda462 KVM: Add memslot flag to let userspace force an exit on missing hva mappings
> >      @@ Commit message
> >
> >           Suggested-by: James Houghton <jthoughton@google.com>
> >           Suggested-by: Sean Christopherson <seanjc@google.com>
> >      -    Reviewed-by: James Houghton <jthoughton@google.com>
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >
> >        ## Documentation/virt/kvm/api.rst ##
> >       @@ Documentation/virt/kvm/api.rst: yet and must be cleared on entry.
> >      -   /* for kvm_userspace_memory_region::flags */
> >          #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
> >          #define KVM_MEM_READONLY     (1UL << 1)
> >      -+  #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> >      +   #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> >       +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> >
> >        This ioctl allows the user to create, modify or delete a guest physical
> >      @@ Documentation/virt/kvm/api.rst: It is recommended that the lower 21 bits of gues
> >        be identical.  This allows large pages in the guest to be backed by large
> >        pages in the host.
> >
> >      --The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> >      --KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> >      +-The flags field supports three flags
> >       +The flags field supports four flags
> >      -+
> >      -+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
> >      +
> >      + 1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
> >        writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> >      --use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> >      -+use it.
> >      -+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
> >      - to make a new slot read-only.  In this case, writes to this memory will be
> >      +@@ Documentation/virt/kvm/api.rst: to make a new slot read-only.  In this case, writes to this memory will be
> >        posted to userspace as KVM_EXIT_MMIO exits.
> >      -+3.  KVM_MEM_GUEST_MEMFD
> >      + 3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
> >      + incompatible with KVM_SET_USER_MEMORY_REGION.
> >       +4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
> >
> >        When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> >        the memory region are automatically reflected into the guest.  For example, an
> >      +@@ Documentation/virt/kvm/api.rst: Instead, an abort (data abort if the cause of the page-table update
> >      + was a load or a store, instruction abort if it was an instruction
> >      + fetch) is injected in the guest.
> >      +
> >      ++Note: KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING are currently mutually
> >      ++exclusive.
> >      ++
> >      + 4.36 KVM_SET_TSS_ADDR
> >      + ---------------------
> >      +
> >       @@ Documentation/virt/kvm/api.rst: error/annotated fault.
> >
> >        See KVM_EXIT_MEMORY_FAULT for more information.
> >      @@ include/uapi/linux/kvm.h: struct kvm_userspace_memory_region2 {
> >
> >        /* for KVM_IRQ_LINE */
> >        struct kvm_irq_level {
> >      -@@ include/uapi/linux/kvm.h: struct kvm_ppc_resize_hpt {
> >      +@@ include/uapi/linux/kvm.h: struct kvm_enable_cap {
> >        #define KVM_CAP_MEMORY_ATTRIBUTES 233
> >        #define KVM_CAP_GUEST_MEMFD 234
> >        #define KVM_CAP_VM_TYPES 235
> >       +#define KVM_CAP_EXIT_ON_MISSING 236
> >
> >      - #ifdef KVM_CAP_IRQ_ROUTING
> >      -
> >      + struct kvm_irq_routing_irqchip {
> >      +        __u32 irqchip;
> >
> >        ## virt/kvm/Kconfig ##
> >       @@ virt/kvm/Kconfig: config KVM_GENERIC_PRIVATE_MEM
> >      @@ virt/kvm/kvm_main.c: static int check_memory_region_flags(struct kvm *kvm,
> >       +
> >               if (mem->flags & ~valid_flags)
> >                       return -EINVAL;
> >      ++       else if ((mem->flags & KVM_MEM_READONLY) &&
> >      ++                (mem->flags & KVM_MEM_EXIT_ON_MISSING))
> >      ++               return -EINVAL;
> >
> >      +        return 0;
> >      + }
> >       @@ virt/kvm/kvm_main.c: kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> >
> >        kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> >      @@ virt/kvm/kvm_main.c: kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot
> >                       writable = NULL;
> >               }
> >
> >      -+       if (!atomic && can_exit_on_missing
> >      -+           && kvm_is_slot_exit_on_missing(slot)) {
> >      ++       /* When the slot is exit-on-missing (and when we should respect that)
> >      ++        * set atomic=true to prevent GUP from faulting in the userspace
> >      ++        * mappings.
> >      ++        */
> >      ++       if (!atomic && can_exit_on_missing &&
> >      ++           kvm_is_slot_exit_on_missing(slot)) {
> >       +               atomic = true;
> >       +               if (async) {
> >       +                       *async = false;
> >   7:  28b6fe1ad5b9 !  7:  70696937be14 KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler
> >      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
> >
> >        ## arch/x86/kvm/Kconfig ##
> >       @@ arch/x86/kvm/Kconfig: config KVM
> >      -        select INTERVAL_TREE
> >      +        select KVM_VFIO
> >               select HAVE_KVM_PM_NOTIFIER if PM
> >               select KVM_GENERIC_HARDWARE_ENABLING
> >       +        select HAVE_KVM_EXIT_ON_MISSING
> >   8:  a80db5672168 <  -:  ------------ KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
> >   -:  ------------ >  8:  05bbf29372ed KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
> >   9:  70c5db4f5c9e !  9:  bb22b31c8437 KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
> >      @@ Metadata
> >       Author: Anish Moorthy <amoorthy@google.com>
> >
> >        ## Commit message ##
> >      -    KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
> >      +    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
> >
> >           Prevent the stage-2 fault handler from faulting in pages when
> >           KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
> >      -    calls to check the memslot flag.
> >      -
> >      -    To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
> >      -    when the stage-2 handler cannot resolve the pfn for a fault. With
> >      -    KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of stage-2
> >      -    faults as vCPU exits, which userspace can attempt to resolve without
> >      -    terminating the guest.
> >      +    call to check the memslot flag. This effects the delivery of stage-2
> >      +    faults as vCPU exits (see KVM_CAP_MEMORY_FAULT_INFO), which userspace
> >      +    can attempt to resolve without terminating the guest.
> >
> >           Delivering stage-2 faults to userspace in this way sidesteps the
> >           significant scalabiliy issues associated with using userfaultfd for the
> >      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
> >
> >        ## arch/arm64/kvm/Kconfig ##
> >       @@ arch/arm64/kvm/Kconfig: menuconfig KVM
> >      +        select SCHED_INFO
> >               select GUEST_PERF_EVENTS if PERF_EVENTS
> >      -        select INTERVAL_TREE
> >               select XARRAY_MULTI
> >       +        select HAVE_KVM_EXIT_ON_MISSING
> >               help
> >      @@ arch/arm64/kvm/mmu.c: static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr
> >               if (pfn == KVM_PFN_ERR_HWPOISON) {
> >                       kvm_send_hwpoison_signal(hva, vma_shift);
> >                       return 0;
> >      -        }
> >      --       if (is_error_noslot_pfn(pfn))
> >      -+       if (is_error_noslot_pfn(pfn)) {
> >      -+               kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> >      -+                                             write_fault, exec_fault, false);
> >      -                return -EFAULT;
> >      -+       }
> >      -
> >      -        if (kvm_is_device_pfn(pfn)) {
> >      -                /*
> > 10:  ab913b9b5570 = 10:  a62ee8593b84 KVM: selftests: Report per-vcpu demand paging rate from demand paging test
> > 11:  a27ff8b097d7 ! 11:  58ddb652eac1 KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
> >      @@ Commit message
> >           configuring the number of reader threads per UFFD as well: add the "-r"
> >           flag to do so.
> >
> >      -    Acked-by: James Houghton <jthoughton@google.com>
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >      +    Acked-by: James Houghton <jthoughton@google.com>
> >
> >        ## tools/testing/selftests/kvm/aarch64/page_fault_test.c ##
> >       @@ tools/testing/selftests/kvm/aarch64/page_fault_test.c: static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
> > 12:  ee196df32964 ! 12:  b4cfe82097e2 KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
> >      @@ Commit message
> >           [1] Single-vCPU performance does suffer somewhat.
> >           [2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>
> >
> >      -    Acked-by: James Houghton <jthoughton@google.com>
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >      +    Acked-by: James Houghton <jthoughton@google.com>
> >
> >        ## tools/testing/selftests/kvm/demand_paging_test.c ##
> >       @@
> > 13:  9406cb2581e5 = 13:  f8095728fcef KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
> > 14:  dbab5917e1f6 ! 14:  a5863f1206bb KVM: selftests: Handle memory fault exits in demand_paging_test
> >      @@ Commit message
> >
> >           Demonstrate a (very basic) scheme for supporting memory fault exits.
> >
> >      -    >From the vCPU threads:
> >      +    From the vCPU threads:
> >           1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
> >              with the purpose of establishing the absent mappings. Do so with
> >              wake_waiters=false to avoid serializing on the userfaultfd wait queue
> >      @@ Commit message
> >           [A] In reality it is much likelier that the vCPU thread simply lost a
> >               race to establish the mapping for the page.
> >
> >      -    Acked-by: James Houghton <jthoughton@google.com>
> >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> >      +    Acked-by: James Houghton <jthoughton@google.com>
> >
> >        ## tools/testing/selftests/kvm/demand_paging_test.c ##
> >       @@
> >
> > base-commit: 687d8f4c3dea0758afd748968d91288220bbe7e3
>

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults
  2024-02-16 20:00   ` Anish Moorthy
@ 2024-02-16 23:40     ` Axel Rasmussen
  2024-02-21  7:35       ` Gupta, Pankaj
  0 siblings, 1 reply; 39+ messages in thread
From: Axel Rasmussen @ 2024-02-16 23:40 UTC (permalink / raw
  To: Anish Moorthy
  Cc: Gupta, Pankaj, seanjc, oliver.upton, maz, kvm, kvmarm,
	robert.hoo.linux, jthoughton, dmatlack, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Fri, Feb 16, 2024 at 12:00 PM Anish Moorthy <amoorthy@google.com> wrote:
>
> On Thu, Feb 15, 2024 at 11:36 PM Gupta, Pankaj <pankaj.gupta@amd.com> wrote:
> >
> > On 2/16/2024 12:53 AM, Anish Moorthy wrote:
> > > This series adds an option to cause stage-2 fault handlers to
> > > KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
> > > the userspace mappings. Doing so allows userspace to receive stage-2
> > > faults directly from KVM_RUN instead of through userfaultfd, which
> > > suffers from serious contention issues as the number of vCPUs scales.
> >
> > Thanks for your work!
>
> :D
>
> >
> > So, this is an alternative approach userspace like Qemu to do post copy
> > live migration using KVM_MEMORY_FAULT_EXIT instead of userfaultfd which
> > seems slower with more vCPU's.
> >
> > Maybe I am missing some things here, just curious how userspace VMM e.g
> > Qemu would do memory copy with this approach once the page is available
> > from remote host which was done with UFFDIO_COPY earlier?
>
> This new capability is meant to be used *alongside* userfaultfd during
> post-copy: it's not a replacement. KVM_RUN can generate page faults
> from outside the stage-2 fault handlers (IIUC instruction emulation is
> one source), and these paths are unchanged: so it's important that
> userspace still UFFDIO_REGISTERs KVM's mapping and reads from the UFFD
> to catch these guest accesses. But with the new
> KVM_MEM_EXIT_ON_MISSING memslot flag set, the stage-2 handlers will
> report needing to fault in memory via KVM_MEMORY_FAULT_EXIT instead of
> queuing onto the UFFD.
>
> In the workloads I've tested, the vast majority of guest-generated
> page faults (99%+) come from the stage-2 handlers. So this series
> "solves" the issue of contention on the UFFD file descriptor by
> (mostly) sidestepping it.
>
> As for how userspace actually uses the new functionality: when a vCPU
> thread receives a KVM_MEMORY_FAULT_EXIT for an unfetched page during
> post-copy it might
>
> (a) Fetch the page
> (b) Install the page into KVM's mapping via UFFDIO_COPY (don't
> necessarily need to UFFDIO_WAKE!)
> (c) Call KVM_RUN to re-enter the guest and retry the access. The
> stage-2 fault handler will fire again but almost certainly won't
> KVM_MEMORY_FAULT_EXIT now (since the UFFDIO_COPY will have mapped the
> page), so the guest can continue.
>
> and userspace can continue using some thread(s) to
>
> (a) Read page faults from the UFFD.
> (b) Install the page using UFFDIO_COPY + UFFDIO_WAKE
> (c) goto (a)
>
> to make sure it catches everything. The combination of these two things
> adds up to more performant "uffd-based" postcopy.
>
> I'm of course skimming over some details (e.g.: when two vCPU threads
> race to fetch a page one of them should probably MADV_POPULATE_WRITE
> somehow), but I hope this is helpful. My patch to the KVM demand
> paging self test might also clarify things a bit [1].

One other small detail is, you can equally use UFFDIO_CONTINUE,
depending on how the rest of the live migration implementation works.

Really briefly, this series should be viewed as an alternate (and more
scalable) mechanism to find out that a fault occurred. The way
userspace then *resolves* the fault (whether via UFFDIO_COPY or
UFFDIO_CONTINUE) can remain the same as before.

>
>
> Please let me know if you have more questions!
>
> [1] https://lore.kernel.org/kvm/1f67639d-c6a2-1f36-b086-eb65fa2ab275@amd.com/T/#m28055e5d708103d126985e38e18b591d535e1e84
>
>
>
>
> > Just trying to understand how this will work for the existing interfaces.
> > Best regards,
> > Pankaj
> >
> > >
> > > Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
> > > demand_paging_test, which demonstrates the scalability improvements:
> > > the following data was collected using [2] on an x86 machine with 256
> > > cores.
> > >
> > > vCPUs, Average Paging Rate (w/o new caps), Average Paging Rate (w/ new caps)
> > > 1       150     340
> > > 2       191     477
> > > 4       210     809
> > > 8       155     1239
> > > 16      130     1595
> > > 32      108     2299
> > > 64      86      3482
> > > 128     62      4134
> > > 256     36      4012
> > >
> > > The diff since the last version is small enough that I've attached a
> > > range-diff in the cover letter- hopefully it's useful for review.
> > >
> > > Links
> > > ~~~~~
> > > [1] Original RFC from James Houghton:
> > >      https://lore.kernel.org/linux-mm/CADrL8HVDB3u2EOhXHCrAgJNLwHkj2Lka1B_kkNb0dNwiWiAN_Q@mail.gmail.com/
> > >
> > > [2] ./demand_paging_test -b 64M -u MINOR -s shmem -a -v <n> -r <n> [-w]
> > >      A quick rundown of the new flags (also detailed in later commits)
> > >          -a registers all of guest memory to a single uffd.
> > >          -r species the number of reader threads for polling the uffd.
> > >          -w is what actually enables the new capabilities.
> > >      All data was collected after applying the entire series
> > >
> > > ---
> > >
> > > v7
> > >    - Add comment for the upgrade-to-atomic in __gfn_to_pfn_memslot()
> > >      [James]
> > >    - Expand description for KVM_MEM_GUEST_MEMFD in kvm/api.rst [James]
> > >      and split it off into its own commit [Anish]
> > >    - Update documentation to indicate that KVM_CAP_MEMORY_FAULT_INFO is
> > >      available on arm [James]
> > >    - Expand commit message for the "enable KVM_CAP_MEMORY_FAULT_INFO on
> > >      arm64" commit [Anish]
> > >    - Drop buggy "fast GUP on read faults" patch [Thanks James!]
> > >    - Make KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING mutually exclusive
> > >      [Sean, Oliver]
> > >    - Drop incorrect "Documentation:" from some shortlogs [Sean]
> > >    - Add description for the KVM_EXIT_MEMORY_FAULT RWX patch [Sean]
> > >    - Style issues [Sean]
> > >
> > > v6: https://lore.kernel.org/kvm/20231109210325.3806151-1-amoorthy@google.com/
> > >    - Rebase onto guest_memfd series [Anish/Sean]
> > >    - Set write fault flag properly in user_mem_abort() [Oliver]
> > >    - Reformat unnecessarily multi-line comments [Sean]
> > >    - Drop the kvm_vcpu_read|write_guest_page() annotations [Sean]
> > >    - Rename *USERFAULT_ON_MISSING to *EXIT_ON_MISSING [David]
> > >    - Remove unnecessary rounding in user_mem_abort() annotation [David]
> > >    - Rewrite logs for KVM_MEM_EXIT_ON_MISSING patches and squash
> > >      them with the stage-2 fault annotation patches [Sean]
> > >    - Undo the enum parameter addition to __gfn_to_pfn_memslot(), and just
> > >      add another boolean parameter instead [Sean]
> > >    - Better shortlog for the hva_to_pfn_fast() change [Anish]
> > >
> > > v5: https://lore.kernel.org/kvm/20230908222905.1321305-1-amoorthy@google.com/
> > >    - Rename APIs (again) [Sean]
> > >    - Initialize hardware_exit_reason along w/ exit_reason on x86 [Isaku]
> > >    - Reword hva_to_pfn_fast() change commit message [Sean]
> > >    - Correct style on terminal if statements [Sean]
> > >    - Switch to kconfig to signal KVM_CAP_USERFAULT_ON_MISSING [Sean]
> > >    - Add read fault flag for annotated faults [Sean]
> > >    - read/write_guest_page() changes
> > >        - Move the annotations into vcpu wrapper fns [Sean]
> > >        - Reorder parameters [Robert]
> > >    - Rename kvm_populate_efault_info() to
> > >      kvm_handle_guest_uaccess_fault() [Sean]
> > >    - Remove unnecessary EINVAL on trying to enable memory fault info cap [Sean]
> > >    - Correct description of the faults which hva_to_pfn_fast() can now
> > >      resolve [Sean]
> > >    - Eliminate unnecessary parameter added to __kvm_faultin_pfn() [Sean]
> > >    - Magnanimously accept Sean's rewrite of the handle_error_pfn()
> > >      annotation [Anish]
> > >    - Remove vcpu null check from kvm_handle_guest_uaccess_fault [Sean]
> > >
> > > v4: https://lore.kernel.org/kvm/20230602161921.208564-1-amoorthy@google.com/T/#t
> > >    - Fix excessive indentation [Robert, Oliver]
> > >    - Calculate final stats when uffd handler fn returns an error [Robert]
> > >    - Remove redundant info from uffd_desc [Robert]
> > >    - Fix various commit message typos [Robert]
> > >    - Add comment about suppressed EEXISTs in selftest [Robert]
> > >    - Add exit_reasons_known definition for KVM_EXIT_MEMORY_FAULT [Robert]
> > >    - Fix some include/logic issues in self test [Robert]
> > >    - Rename no-slow-gup cap to KVM_CAP_NOWAIT_ON_FAULT [Oliver, Sean]
> > >    - Make KVM_CAP_MEMORY_FAULT_INFO informational-only [Oliver, Sean]
> > >    - Drop most of the annotations from v3: see
> > >      https://lore.kernel.org/kvm/20230412213510.1220557-1-amoorthy@google.com/T/#mfe28e6a5015b7cd8c5ea1c351b0ca194aeb33daf
> > >    - Remove WARN on bare efaults [Sean, Oliver]
> > >    - Eliminate unnecessary UFFDIO_WAKE call from self test [James]
> > >
> > > v3: https://lore.kernel.org/kvm/ZEBXi5tZZNxA+jRs@x1n/T/#t
> > >    - Rework the implementation to be based on two orthogonal
> > >      capabilities (KVM_CAP_MEMORY_FAULT_INFO and
> > >      KVM_CAP_NOWAIT_ON_FAULT) [Sean, Oliver]
> > >    - Change return code of kvm_populate_efault_info [Isaku]
> > >    - Use kvm_populate_efault_info from arm code [Oliver]
> > >
> > > v2: https://lore.kernel.org/kvm/20230315021738.1151386-1-amoorthy@google.com/
> > >
> > >      This was a bit of a misfire, as I sent my WIP series on the mailing
> > >      list but was just targeting Sean for some feedback. Oliver Upton and
> > >      Isaku Yamahata ended up discovering the series and giving me some
> > >      feedback anyways, so thanks to them :) In the end, there was enough
> > >      discussion to justify retroactively labeling it as v2, even with the
> > >      limited cc list.
> > >
> > >    - Introduce KVM_CAP_X86_MEMORY_FAULT_EXIT.
> > >    - API changes:
> > >          - Gate KVM_CAP_MEMORY_FAULT_NOWAIT behind
> > >            KVM_CAP_x86_MEMORY_FAULT_EXIT (on x86 only: arm has no such
> > >            requirement).
> > >          - Switched to memslot flag
> > >    - Take Oliver's simplification to the "allow fast gup for readable
> > >      faults" logic.
> > >    - Slightly redefine the return code of user_mem_abort.
> > >    - Fix documentation errors brought up by Marc
> > >    - Reword commit messages in imperative mood
> > >
> > > v1: https://lore.kernel.org/kvm/20230215011614.725983-1-amoorthy@google.com/
> > >
> > > Anish Moorthy (14):
> > >    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> > >    KVM: Add function comments for __kvm_read/write_guest_page()
> > >    KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
> > >    KVM: Simplify error handling in __gfn_to_pfn_memslot()
> > >    KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to
> > >      userspace
> > >    KVM: Add memslot flag to let userspace force an exit on missing hva
> > >      mappings
> > >    KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from
> > >      stage-2 fault handler
> > >    KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the
> > >      stage-2 fault handler
> > >    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
> > >    KVM: selftests: Report per-vcpu demand paging rate from demand paging
> > >      test
> > >    KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand
> > >      paging test
> > >    KVM: selftests: Use EPOLL in userfaultfd_util reader threads and
> > >      signal errors via TEST_ASSERT
> > >    KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
> > >    KVM: selftests: Handle memory fault exits in demand_paging_test
> > >
> > >   Documentation/virt/kvm/api.rst                |  39 ++-
> > >   arch/arm64/kvm/Kconfig                        |   1 +
> > >   arch/arm64/kvm/arm.c                          |   1 +
> > >   arch/arm64/kvm/mmu.c                          |   7 +-
> > >   arch/powerpc/kvm/book3s_64_mmu_hv.c           |   2 +-
> > >   arch/powerpc/kvm/book3s_64_mmu_radix.c        |   2 +-
> > >   arch/x86/kvm/Kconfig                          |   1 +
> > >   arch/x86/kvm/mmu/mmu.c                        |   8 +-
> > >   include/linux/kvm_host.h                      |  21 +-
> > >   include/uapi/linux/kvm.h                      |   5 +
> > >   .../selftests/kvm/aarch64/page_fault_test.c   |   4 +-
> > >   .../selftests/kvm/access_tracking_perf_test.c |   2 +-
> > >   .../selftests/kvm/demand_paging_test.c        | 295 ++++++++++++++----
> > >   .../selftests/kvm/dirty_log_perf_test.c       |   2 +-
> > >   .../testing/selftests/kvm/include/memstress.h |   2 +-
> > >   .../selftests/kvm/include/userfaultfd_util.h  |  17 +-
> > >   tools/testing/selftests/kvm/lib/memstress.c   |   4 +-
> > >   .../selftests/kvm/lib/userfaultfd_util.c      | 159 ++++++----
> > >   .../kvm/memslot_modification_stress_test.c    |   2 +-
> > >   .../x86_64/dirty_log_page_splitting_test.c    |   2 +-
> > >   virt/kvm/Kconfig                              |   3 +
> > >   virt/kvm/kvm_main.c                           |  46 ++-
> > >   22 files changed, 453 insertions(+), 172 deletions(-)
> > >
> > > Range-diff against v6:
> > >   1:  2089d8955538 !  1:  063d5d109f34 KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> > >      @@ Metadata
> > >       Author: Anish Moorthy <amoorthy@google.com>
> > >
> > >        ## Commit message ##
> > >      -    KVM: Documentation: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> > >      +    KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
> > >
> > >      -    The current docstring can be read as "atomic -> allowed to sleep," when
> > >      -    in fact the intended statement is "atomic -> NOT allowed to sleep." Make
> > >      -    that clearer in the docstring.
> > >      +    The current description can be read as "atomic -> allowed to sleep,"
> > >      +    when in fact the intended statement is "atomic -> NOT allowed to sleep."
> > >      +    Make that clearer in the docstring.
> > >
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >
> > >   2:  36963c6eee29 !  2:  e038fe64f44a KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
> > >      @@ Metadata
> > >       Author: Anish Moorthy <amoorthy@google.com>
> > >
> > >        ## Commit message ##
> > >      -    KVM: Documentation: Add docstrings for __kvm_read/write_guest_page()
> > >      +    KVM: Add function comments for __kvm_read/write_guest_page()
> > >
> > >           The (gfn, data, offset, len) order of parameters is a little strange
> > >      -    since "offset" applies to "gfn" rather than to "data". Add docstrings to
> > >      -    make things perfectly clear.
> > >      +    since "offset" applies to "gfn" rather than to "data". Add function
> > >      +    comments to make things perfectly clear.
> > >
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >
> > >   -:  ------------ >  3:  812a2208da95 KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
> > >   3:  4994835c51f5 =  4:  44cec9bf6166 KVM: Simplify error handling in __gfn_to_pfn_memslot()
> > >   4:  3d51224854b1 !  5:  df09c7482fbf KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
> > >      @@ Metadata
> > >        ## Commit message ##
> > >           KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace
> > >
> > >      +    kvm_prepare_memory_fault_exit() already takes parameters describing the
> > >      +    RWX-ness of the relevant access but doesn't actually do anything with
> > >      +    them. Define and use the flags necessary to pass this information on to
> > >      +    userspace.
> > >      +
> > >           Suggested-by: Sean Christopherson <seanjc@google.com>
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >
> > >   5:  6bab46398020 <  -:  ------------ KVM: Try using fast GUP to resolve read faults
> > >   6:  556e7079c419 !  6:  6a6993bda462 KVM: Add memslot flag to let userspace force an exit on missing hva mappings
> > >      @@ Commit message
> > >
> > >           Suggested-by: James Houghton <jthoughton@google.com>
> > >           Suggested-by: Sean Christopherson <seanjc@google.com>
> > >      -    Reviewed-by: James Houghton <jthoughton@google.com>
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >
> > >        ## Documentation/virt/kvm/api.rst ##
> > >       @@ Documentation/virt/kvm/api.rst: yet and must be cleared on entry.
> > >      -   /* for kvm_userspace_memory_region::flags */
> > >          #define KVM_MEM_LOG_DIRTY_PAGES      (1UL << 0)
> > >          #define KVM_MEM_READONLY     (1UL << 1)
> > >      -+  #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > >      +   #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > >       +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> > >
> > >        This ioctl allows the user to create, modify or delete a guest physical
> > >      @@ Documentation/virt/kvm/api.rst: It is recommended that the lower 21 bits of gues
> > >        be identical.  This allows large pages in the guest to be backed by large
> > >        pages in the host.
> > >
> > >      --The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> > >      --KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> > >      +-The flags field supports three flags
> > >       +The flags field supports four flags
> > >      -+
> > >      -+1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
> > >      +
> > >      + 1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
> > >        writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to
> > >      --use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > >      -+use it.
> > >      -+2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
> > >      - to make a new slot read-only.  In this case, writes to this memory will be
> > >      +@@ Documentation/virt/kvm/api.rst: to make a new slot read-only.  In this case, writes to this memory will be
> > >        posted to userspace as KVM_EXIT_MMIO exits.
> > >      -+3.  KVM_MEM_GUEST_MEMFD
> > >      + 3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
> > >      + incompatible with KVM_SET_USER_MEMORY_REGION.
> > >       +4.  KVM_MEM_EXIT_ON_MISSING: see KVM_CAP_EXIT_ON_MISSING for details.
> > >
> > >        When the KVM_CAP_SYNC_MMU capability is available, changes in the backing of
> > >        the memory region are automatically reflected into the guest.  For example, an
> > >      +@@ Documentation/virt/kvm/api.rst: Instead, an abort (data abort if the cause of the page-table update
> > >      + was a load or a store, instruction abort if it was an instruction
> > >      + fetch) is injected in the guest.
> > >      +
> > >      ++Note: KVM_MEM_READONLY and KVM_MEM_EXIT_ON_MISSING are currently mutually
> > >      ++exclusive.
> > >      ++
> > >      + 4.36 KVM_SET_TSS_ADDR
> > >      + ---------------------
> > >      +
> > >       @@ Documentation/virt/kvm/api.rst: error/annotated fault.
> > >
> > >        See KVM_EXIT_MEMORY_FAULT for more information.
> > >      @@ include/uapi/linux/kvm.h: struct kvm_userspace_memory_region2 {
> > >
> > >        /* for KVM_IRQ_LINE */
> > >        struct kvm_irq_level {
> > >      -@@ include/uapi/linux/kvm.h: struct kvm_ppc_resize_hpt {
> > >      +@@ include/uapi/linux/kvm.h: struct kvm_enable_cap {
> > >        #define KVM_CAP_MEMORY_ATTRIBUTES 233
> > >        #define KVM_CAP_GUEST_MEMFD 234
> > >        #define KVM_CAP_VM_TYPES 235
> > >       +#define KVM_CAP_EXIT_ON_MISSING 236
> > >
> > >      - #ifdef KVM_CAP_IRQ_ROUTING
> > >      -
> > >      + struct kvm_irq_routing_irqchip {
> > >      +        __u32 irqchip;
> > >
> > >        ## virt/kvm/Kconfig ##
> > >       @@ virt/kvm/Kconfig: config KVM_GENERIC_PRIVATE_MEM
> > >      @@ virt/kvm/kvm_main.c: static int check_memory_region_flags(struct kvm *kvm,
> > >       +
> > >               if (mem->flags & ~valid_flags)
> > >                       return -EINVAL;
> > >      ++       else if ((mem->flags & KVM_MEM_READONLY) &&
> > >      ++                (mem->flags & KVM_MEM_EXIT_ON_MISSING))
> > >      ++               return -EINVAL;
> > >
> > >      +        return 0;
> > >      + }
> > >       @@ virt/kvm/kvm_main.c: kvm_pfn_t hva_to_pfn(unsigned long addr, bool atomic, bool interruptible,
> > >
> > >        kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
> > >      @@ virt/kvm/kvm_main.c: kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot
> > >                       writable = NULL;
> > >               }
> > >
> > >      -+       if (!atomic && can_exit_on_missing
> > >      -+           && kvm_is_slot_exit_on_missing(slot)) {
> > >      ++       /* When the slot is exit-on-missing (and when we should respect that)
> > >      ++        * set atomic=true to prevent GUP from faulting in the userspace
> > >      ++        * mappings.
> > >      ++        */
> > >      ++       if (!atomic && can_exit_on_missing &&
> > >      ++           kvm_is_slot_exit_on_missing(slot)) {
> > >       +               atomic = true;
> > >       +               if (async) {
> > >       +                       *async = false;
> > >   7:  28b6fe1ad5b9 !  7:  70696937be14 KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler
> > >      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
> > >
> > >        ## arch/x86/kvm/Kconfig ##
> > >       @@ arch/x86/kvm/Kconfig: config KVM
> > >      -        select INTERVAL_TREE
> > >      +        select KVM_VFIO
> > >               select HAVE_KVM_PM_NOTIFIER if PM
> > >               select KVM_GENERIC_HARDWARE_ENABLING
> > >       +        select HAVE_KVM_EXIT_ON_MISSING
> > >   8:  a80db5672168 <  -:  ------------ KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO
> > >   -:  ------------ >  8:  05bbf29372ed KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
> > >   9:  70c5db4f5c9e !  9:  bb22b31c8437 KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
> > >      @@ Metadata
> > >       Author: Anish Moorthy <amoorthy@google.com>
> > >
> > >        ## Commit message ##
> > >      -    KVM: arm64: Enable KVM_CAP_EXIT_ON_MISSING and annotate an EFAULT from stage-2 fault-handler
> > >      +    KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING
> > >
> > >           Prevent the stage-2 fault handler from faulting in pages when
> > >           KVM_MEM_EXIT_ON_MISSING is set by allowing its  __gfn_to_pfn_memslot()
> > >      -    calls to check the memslot flag.
> > >      -
> > >      -    To actually make that behavior useful, prepare a KVM_EXIT_MEMORY_FAULT
> > >      -    when the stage-2 handler cannot resolve the pfn for a fault. With
> > >      -    KVM_MEM_EXIT_ON_MISSING enabled this effects the delivery of stage-2
> > >      -    faults as vCPU exits, which userspace can attempt to resolve without
> > >      -    terminating the guest.
> > >      +    call to check the memslot flag. This effects the delivery of stage-2
> > >      +    faults as vCPU exits (see KVM_CAP_MEMORY_FAULT_INFO), which userspace
> > >      +    can attempt to resolve without terminating the guest.
> > >
> > >           Delivering stage-2 faults to userspace in this way sidesteps the
> > >           significant scalabiliy issues associated with using userfaultfd for the
> > >      @@ Documentation/virt/kvm/api.rst: See KVM_EXIT_MEMORY_FAULT for more information.
> > >
> > >        ## arch/arm64/kvm/Kconfig ##
> > >       @@ arch/arm64/kvm/Kconfig: menuconfig KVM
> > >      +        select SCHED_INFO
> > >               select GUEST_PERF_EVENTS if PERF_EVENTS
> > >      -        select INTERVAL_TREE
> > >               select XARRAY_MULTI
> > >       +        select HAVE_KVM_EXIT_ON_MISSING
> > >               help
> > >      @@ arch/arm64/kvm/mmu.c: static int user_mem_abort(struct kvm_vcpu *vcpu, phys_addr
> > >               if (pfn == KVM_PFN_ERR_HWPOISON) {
> > >                       kvm_send_hwpoison_signal(hva, vma_shift);
> > >                       return 0;
> > >      -        }
> > >      --       if (is_error_noslot_pfn(pfn))
> > >      -+       if (is_error_noslot_pfn(pfn)) {
> > >      -+               kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> > >      -+                                             write_fault, exec_fault, false);
> > >      -                return -EFAULT;
> > >      -+       }
> > >      -
> > >      -        if (kvm_is_device_pfn(pfn)) {
> > >      -                /*
> > > 10:  ab913b9b5570 = 10:  a62ee8593b84 KVM: selftests: Report per-vcpu demand paging rate from demand paging test
> > > 11:  a27ff8b097d7 ! 11:  58ddb652eac1 KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
> > >      @@ Commit message
> > >           configuring the number of reader threads per UFFD as well: add the "-r"
> > >           flag to do so.
> > >
> > >      -    Acked-by: James Houghton <jthoughton@google.com>
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >      +    Acked-by: James Houghton <jthoughton@google.com>
> > >
> > >        ## tools/testing/selftests/kvm/aarch64/page_fault_test.c ##
> > >       @@ tools/testing/selftests/kvm/aarch64/page_fault_test.c: static void setup_uffd(struct kvm_vm *vm, struct test_params *p,
> > > 12:  ee196df32964 ! 12:  b4cfe82097e2 KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT
> > >      @@ Commit message
> > >           [1] Single-vCPU performance does suffer somewhat.
> > >           [2] ./demand_paging_test -u MINOR -s shmem -v 4 -o -r <num readers>
> > >
> > >      -    Acked-by: James Houghton <jthoughton@google.com>
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >      +    Acked-by: James Houghton <jthoughton@google.com>
> > >
> > >        ## tools/testing/selftests/kvm/demand_paging_test.c ##
> > >       @@
> > > 13:  9406cb2581e5 = 13:  f8095728fcef KVM: selftests: Add memslot_flags parameter to memstress_create_vm()
> > > 14:  dbab5917e1f6 ! 14:  a5863f1206bb KVM: selftests: Handle memory fault exits in demand_paging_test
> > >      @@ Commit message
> > >
> > >           Demonstrate a (very basic) scheme for supporting memory fault exits.
> > >
> > >      -    >From the vCPU threads:
> > >      +    From the vCPU threads:
> > >           1. Simply issue UFFDIO_COPY/CONTINUEs in response to memory fault exits,
> > >              with the purpose of establishing the absent mappings. Do so with
> > >              wake_waiters=false to avoid serializing on the userfaultfd wait queue
> > >      @@ Commit message
> > >           [A] In reality it is much likelier that the vCPU thread simply lost a
> > >               race to establish the mapping for the page.
> > >
> > >      -    Acked-by: James Houghton <jthoughton@google.com>
> > >           Signed-off-by: Anish Moorthy <amoorthy@google.com>
> > >      +    Acked-by: James Houghton <jthoughton@google.com>
> > >
> > >        ## tools/testing/selftests/kvm/demand_paging_test.c ##
> > >       @@
> > >
> > > base-commit: 687d8f4c3dea0758afd748968d91288220bbe7e3
> >

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults
  2024-02-16 23:40     ` Axel Rasmussen
@ 2024-02-21  7:35       ` Gupta, Pankaj
  0 siblings, 0 replies; 39+ messages in thread
From: Gupta, Pankaj @ 2024-02-21  7:35 UTC (permalink / raw
  To: Axel Rasmussen, Anish Moorthy
  Cc: seanjc, oliver.upton, maz, kvm, kvmarm, robert.hoo.linux,
	jthoughton, dmatlack, peterx, nadav.amit, isaku.yamahata, kconsul


>>> On 2/16/2024 12:53 AM, Anish Moorthy wrote:
>>>> This series adds an option to cause stage-2 fault handlers to
>>>> KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
>>>> the userspace mappings. Doing so allows userspace to receive stage-2
>>>> faults directly from KVM_RUN instead of through userfaultfd, which
>>>> suffers from serious contention issues as the number of vCPUs scales.
>>>
>>> Thanks for your work!
>>
>> :D
>>
>>>
>>> So, this is an alternative approach userspace like Qemu to do post copy
>>> live migration using KVM_MEMORY_FAULT_EXIT instead of userfaultfd which
>>> seems slower with more vCPU's.
>>>
>>> Maybe I am missing some things here, just curious how userspace VMM e.g
>>> Qemu would do memory copy with this approach once the page is available
>>> from remote host which was done with UFFDIO_COPY earlier?
>>
>> This new capability is meant to be used *alongside* userfaultfd during
>> post-copy: it's not a replacement. KVM_RUN can generate page faults
>> from outside the stage-2 fault handlers (IIUC instruction emulation is
>> one source), and these paths are unchanged: so it's important that
>> userspace still UFFDIO_REGISTERs KVM's mapping and reads from the UFFD
>> to catch these guest accesses. But with the new
>> KVM_MEM_EXIT_ON_MISSING memslot flag set, the stage-2 handlers will
>> report needing to fault in memory via KVM_MEMORY_FAULT_EXIT instead of
>> queuing onto the UFFD.
>>
>> In the workloads I've tested, the vast majority of guest-generated
>> page faults (99%+) come from the stage-2 handlers. So this series
>> "solves" the issue of contention on the UFFD file descriptor by
>> (mostly) sidestepping it.
>>
>> As for how userspace actually uses the new functionality: when a vCPU
>> thread receives a KVM_MEMORY_FAULT_EXIT for an unfetched page during
>> post-copy it might
>>
>> (a) Fetch the page
>> (b) Install the page into KVM's mapping via UFFDIO_COPY (don't
>> necessarily need to UFFDIO_WAKE!)
>> (c) Call KVM_RUN to re-enter the guest and retry the access. The
>> stage-2 fault handler will fire again but almost certainly won't
>> KVM_MEMORY_FAULT_EXIT now (since the UFFDIO_COPY will have mapped the
>> page), so the guest can continue.
>>
>> and userspace can continue using some thread(s) to
>>
>> (a) Read page faults from the UFFD.
>> (b) Install the page using UFFDIO_COPY + UFFDIO_WAKE
>> (c) goto (a)
>>
>> to make sure it catches everything. The combination of these two things
>> adds up to more performant "uffd-based" postcopy.
>>
>> I'm of course skimming over some details (e.g.: when two vCPU threads
>> race to fetch a page one of them should probably MADV_POPULATE_WRITE
>> somehow), but I hope this is helpful. My patch to the KVM demand
>> paging self test might also clarify things a bit [1].
> 
> One other small detail is, you can equally use UFFDIO_CONTINUE,
> depending on how the rest of the live migration implementation works.
> 
> Really briefly, this series should be viewed as an alternate (and more
> scalable) mechanism to find out that a fault occurred. The way
> userspace then *resolves* the fault (whether via UFFDIO_COPY or
> UFFDIO_CONTINUE) can remain the same as before.
> 

That clarifies. Thank you!

Best regards,
Pankaj


^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
  2024-02-15 23:53 ` [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the " Anish Moorthy
@ 2024-03-04 20:00   ` Oliver Upton
  2024-03-04 20:10     ` Oliver Upton
  0 siblings, 1 reply; 39+ messages in thread
From: Oliver Upton @ 2024-03-04 20:00 UTC (permalink / raw
  To: Anish Moorthy
  Cc: seanjc, maz, kvm, kvmarm, robert.hoo.linux, jthoughton, dmatlack,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

On Thu, Feb 15, 2024 at 11:53:59PM +0000, Anish Moorthy wrote:

[...]

> +	if (is_error_noslot_pfn(pfn)) {
> +		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> +					      write_fault, exec_fault, false);

Hmm... Reinterpreting the fault context into something that wants to be
arch-neutral might make this a bit difficult for userspace to
understand.

The CPU can take an instruction abort on an S1PTW due to missing write
permissions, i.e. hardware cannot write to the stage-1 descriptor for an
AF or DBM update. In this case HPFAR points to the IPA of the stage-1
descriptor that took the fault, not the target page.

It would seem this gets expressed to userspace as an intent to write and
execute on the stage-1 page tables, no?

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
  2024-03-04 20:00   ` Oliver Upton
@ 2024-03-04 20:10     ` Oliver Upton
  2024-03-04 20:32       ` Sean Christopherson
  0 siblings, 1 reply; 39+ messages in thread
From: Oliver Upton @ 2024-03-04 20:10 UTC (permalink / raw
  To: Anish Moorthy
  Cc: seanjc, maz, kvm, kvmarm, robert.hoo.linux, jthoughton, dmatlack,
	axelrasmussen, peterx, nadav.amit, isaku.yamahata, kconsul

On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote:
> On Thu, Feb 15, 2024 at 11:53:59PM +0000, Anish Moorthy wrote:
> 
> [...]
> 
> > +	if (is_error_noslot_pfn(pfn)) {
> > +		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> > +					      write_fault, exec_fault, false);
> 
> Hmm... Reinterpreting the fault context into something that wants to be
> arch-neutral might make this a bit difficult for userspace to
> understand.
> 
> The CPU can take an instruction abort on an S1PTW due to missing write
> permissions, i.e. hardware cannot write to the stage-1 descriptor for an
> AF or DBM update. In this case HPFAR points to the IPA of the stage-1
> descriptor that took the fault, not the target page.
> 
> It would seem this gets expressed to userspace as an intent to write and
> execute on the stage-1 page tables, no?

Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with
kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should*
shake out as a write fault on the stage-1 descriptor.

With that said, an architecture-neutral UAPI may not be able to capture
the nuance of a fault. This UAPI will become much more load-bearing in
the future, and the loss of granularity could become an issue.

Marc had some ideas about forwarding the register state to userspace
directly, which should be the right level of information for _any_ fault
taken to userspace.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
  2024-03-04 20:10     ` Oliver Upton
@ 2024-03-04 20:32       ` Sean Christopherson
  2024-03-04 21:03         ` Oliver Upton
  0 siblings, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2024-03-04 20:32 UTC (permalink / raw
  To: Oliver Upton
  Cc: Anish Moorthy, maz, kvm, kvmarm, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Mon, Mar 04, 2024, Oliver Upton wrote:
> On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote:
> > On Thu, Feb 15, 2024 at 11:53:59PM +0000, Anish Moorthy wrote:
> > 
> > [...]
> > 
> > > +	if (is_error_noslot_pfn(pfn)) {
> > > +		kvm_prepare_memory_fault_exit(vcpu, gfn * PAGE_SIZE, PAGE_SIZE,
> > > +					      write_fault, exec_fault, false);
> > 
> > Hmm... Reinterpreting the fault context into something that wants to be
> > arch-neutral might make this a bit difficult for userspace to
> > understand.
> > 
> > The CPU can take an instruction abort on an S1PTW due to missing write
> > permissions, i.e. hardware cannot write to the stage-1 descriptor for an
> > AF or DBM update. In this case HPFAR points to the IPA of the stage-1
> > descriptor that took the fault, not the target page.
> > 
> > It would seem this gets expressed to userspace as an intent to write and
> > execute on the stage-1 page tables, no?
> 
> Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with
> kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should*
> shake out as a write fault on the stage-1 descriptor.
> 
> With that said, an architecture-neutral UAPI may not be able to capture
> the nuance of a fault. This UAPI will become much more load-bearing in
> the future, and the loss of granularity could become an issue.

What is the possible fallout from loss of granularity/nuance?  E.g. if the worst
case scenario is that KVM may exit to userspace multiple times in order to resolve
the problem, IMO that's an acceptable cost for having "dumb", common uAPI.

The intent/contract of the exit to userspace isn't for userspace to be able to
completely understand what fault occurred, but rather for KVM to communicate what
action userspace needs to take in order for KVM to make forward progress.

> Marc had some ideas about forwarding the register state to userspace
> directly, which should be the right level of information for _any_ fault
> taken to userspace.

I don't know enough about ARM to weigh in on that side of things, but for x86
this definitely doesn't hold true.  E.g. on the x86 side, KVM intentionally sets
reserved bits in SPTEs for "caching" emulated MMIO accesses, and the resulting
fault captures the "reserved bits set" information in register state.  But that's
purely an (optional) imlementation detail of KVM that should never be exposed to
userspace.

Ditto for things like access tracking on hardware without A/D bits, and shadow
paging, which again can generate fault state that is inscrutable/misleading
without context that only KVM knows (and shouldn't expose to userspace).

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
  2024-03-04 20:32       ` Sean Christopherson
@ 2024-03-04 21:03         ` Oliver Upton
  2024-03-04 22:49           ` Sean Christopherson
  0 siblings, 1 reply; 39+ messages in thread
From: Oliver Upton @ 2024-03-04 21:03 UTC (permalink / raw
  To: Sean Christopherson
  Cc: Anish Moorthy, maz, kvm, kvmarm, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Mon, Mar 04, 2024 at 12:32:51PM -0800, Sean Christopherson wrote:
> On Mon, Mar 04, 2024, Oliver Upton wrote:
> > On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote:

[...]

> > Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with
> > kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should*
> > shake out as a write fault on the stage-1 descriptor.
> > 
> > With that said, an architecture-neutral UAPI may not be able to capture
> > the nuance of a fault. This UAPI will become much more load-bearing in
> > the future, and the loss of granularity could become an issue.
> 
> What is the possible fallout from loss of granularity/nuance?  E.g. if the worst
> case scenario is that KVM may exit to userspace multiple times in order to resolve
> the problem, IMO that's an acceptable cost for having "dumb", common uAPI.
> 
> The intent/contract of the exit to userspace isn't for userspace to be able to
> completely understand what fault occurred, but rather for KVM to communicate what
> action userspace needs to take in order for KVM to make forward progress.

For one, the stage-2 page tables can describe permissions beyond RWX.
MTE tag allocation can be controlled at stage-2, which (confusingly)
desribes if the guest can insert tags in an opaque, physical space not
described by HPFAR.

There is a corresponding bit in ESR_EL2 that describes this at the time
of a fault, and R/W/X flags aren't enough to convey the right corrective
action.

> > Marc had some ideas about forwarding the register state to userspace
> > directly, which should be the right level of information for _any_ fault
> > taken to userspace.
> 
> I don't know enough about ARM to weigh in on that side of things, but for x86
> this definitely doesn't hold true.

We tend to directly model the CPU architecture wherever possible, as it
is the only way to create something intelligible. That same rationale
applies to a huge portion of KVM UAPI; it is architecture-dependent by
design.

> E.g. on the x86 side, KVM intentionally sets
> reserved bits in SPTEs for "caching" emulated MMIO accesses, and the resulting
> fault captures the "reserved bits set" information in register state.  But that's
> purely an (optional) imlementation detail of KVM that should never be exposed to
> userspace.

MMIO accesses would show up elsewhere though, right? If these magic
SPTEs were causing -EFAULT exits then something must've gone sideways.

Either way, I have no issues whatsoever if the direction for x86 is to
provide abstracted fault information.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
  2024-03-04 21:03         ` Oliver Upton
@ 2024-03-04 22:49           ` Sean Christopherson
  2024-03-05  1:01             ` Oliver Upton
  0 siblings, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2024-03-04 22:49 UTC (permalink / raw
  To: Oliver Upton
  Cc: Anish Moorthy, maz, kvm, kvmarm, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Mon, Mar 04, 2024, Oliver Upton wrote:
> On Mon, Mar 04, 2024 at 12:32:51PM -0800, Sean Christopherson wrote:
> > On Mon, Mar 04, 2024, Oliver Upton wrote:
> > > On Mon, Mar 04, 2024 at 08:00:15PM +0000, Oliver Upton wrote:
> 
> [...]
> 
> > > Duh, kvm_vcpu_trap_is_exec_fault() (not to be confused with
> > > kvm_vcpu_trap_is_iabt()) filters for S1PTW, so this *should*
> > > shake out as a write fault on the stage-1 descriptor.
> > > 
> > > With that said, an architecture-neutral UAPI may not be able to capture
> > > the nuance of a fault. This UAPI will become much more load-bearing in
> > > the future, and the loss of granularity could become an issue.
> > 
> > What is the possible fallout from loss of granularity/nuance?  E.g. if the worst
> > case scenario is that KVM may exit to userspace multiple times in order to resolve
> > the problem, IMO that's an acceptable cost for having "dumb", common uAPI.
> > 
> > The intent/contract of the exit to userspace isn't for userspace to be able to
> > completely understand what fault occurred, but rather for KVM to communicate what
> > action userspace needs to take in order for KVM to make forward progress.
> 
> For one, the stage-2 page tables can describe permissions beyond RWX.
> MTE tag allocation can be controlled at stage-2, which (confusingly)
> desribes if the guest can insert tags in an opaque, physical space not
> described by HPFAR.
> 
> There is a corresponding bit in ESR_EL2 that describes this at the time
> of a fault, and R/W/X flags aren't enough to convey the right corrective
> action.
>
> > > Marc had some ideas about forwarding the register state to userspace
> > > directly, which should be the right level of information for _any_ fault
> > > taken to userspace.
> > 
> > I don't know enough about ARM to weigh in on that side of things, but for x86
> > this definitely doesn't hold true.
> 
> We tend to directly model the CPU architecture wherever possible, as it
> is the only way to create something intelligible. That same rationale
> applies to a huge portion of KVM UAPI; it is architecture-dependent by
> design.

Heh, "by design" :-)

I'm not saying "no arch-specific code in memory_fault", all I'm saying is that
stuff that can be arch-neutral, should be arch-neutral.  And AFAIK, basic RWX
information is common across all architectures.

E.g. if KVM needs to communicate MTE information on top of basic RWX info, why
not add a flag to memory_fault.flags that communicates that MTE is enabled and
relevant info can be found in an "extended" data field?

The presense of MTE stuff shouldn't affect the fundamental access information,
e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE
irrespective of whether or not MTE is in play.

The one thing we may want to squeak in before 6.8 is released is a placeholder
in memory_fault, though I don't think that's strictly necessary since the union
as a whole is padded to 256 bytes.  I suppose userspace could allocate based on
sizeof(kvm_run.memory_fault), but that's a bit of a stretch.

> > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for
> > "caching" emulated MMIO accesses, and the resulting fault captures the
> > "reserved bits set" information in register state.  But that's purely an
> > (optional) imlementation detail of KVM that should never be exposed to
> > userspace.
> 
> MMIO accesses would show up elsewhere though, right?

Yes, but I don't see how that's relevant.  Maybe I'm just misunderstanding what
you're saying/asking.

> If these magic SPTEs were causing -EFAULT exits then something must've gone
> sideways.

More or less.  This scenario can happen if the guest re-accesses a GFN that
doesn't have a memslot, but in the interim userspace made the GFN private.  It's
likely a misbehaving userspace, but that really doesn't matter.  KVM's contract
is to report that KVM exited to userspace because the guest was trying to access
GFN X as shared, but the GFN is configured as private by userspace.

My point was that dumping fault/register information straight to userspace in this
scenario, without massaging/filtering that information, is not a sane option on
x86.

> Either way, I have no issues whatsoever if the direction for x86 is to
> provide abstracted fault information.

I don't understand how ARM can get away with NOT providing a layer of abstraction.
Copying fault state verbatim to userspace will bleed KVM implementation details
into userspace, and risks breakage of KVM's ABI due to changes in hardware.
Abstracting gory hardware details from userspace is one of the main roles of the
kernel.

A concrete example of hardware throwing a wrench in things is AMD's upcoming
"encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable
CPUs for *any* VM that supports guest-controlled encrypted memory.  If KVM reported
the page fault error code directly to userspace, then running the same VM on
different hardware generations, e.g. after live migration, would generate different
error codes.

Are we talking past each other?  I'm genuinely confused by the pushback on
capturing RWX information.  Yes, the RWX info may be insufficient in some cases,
but its existence doesn't preclude KVM from providing more information as needed.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
  2024-03-04 22:49           ` Sean Christopherson
@ 2024-03-05  1:01             ` Oliver Upton
  2024-03-05 15:39               ` Sean Christopherson
  0 siblings, 1 reply; 39+ messages in thread
From: Oliver Upton @ 2024-03-05  1:01 UTC (permalink / raw
  To: Sean Christopherson
  Cc: Anish Moorthy, maz, kvm, kvmarm, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Mon, Mar 04, 2024 at 02:49:07PM -0800, Sean Christopherson wrote:

[...]

> The presense of MTE stuff shouldn't affect the fundamental access information,

  "When FEAT_MTE is implemented, for a synchronous Data Abort on an
  instruction that directly accesses Allocation Tags, ISV is 0."

If there is no instruction syndrome, there's insufficient fault context
to determine if the guest was doing a read or a write.

> e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE
> irrespective of whether or not MTE is in play.

When the MMU generates such an abort, it *is not* read, write, or execute.
It is a NoTagAccess fault. There is no sane way to describe this in
terms of RWX.

> The one thing we may want to squeak in before 6.8 is released is a placeholder
> in memory_fault, though I don't think that's strictly necessary since the union
> as a whole is padded to 256 bytes.  I suppose userspace could allocate based on
> sizeof(kvm_run.memory_fault), but that's a bit of a stretch.

Strictly speaking, that isn't ABI any more, but compile-time brittleness
to header changes. IOW, old userspace could still run on new kernel b/c
it compiled against the old structure size and only knows about the
fields present at that time.

> > > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for
> > > "caching" emulated MMIO accesses, and the resulting fault captures the
> > > "reserved bits set" information in register state.  But that's purely an
> > > (optional) imlementation detail of KVM that should never be exposed to
> > > userspace.
> > 
> > MMIO accesses would show up elsewhere though, right?
> 
> Yes, but I don't see how that's relevant.  Maybe I'm just misunderstanding what
> you're saying/asking.

If "reserved" EPT violations found their way to userspace via the
"memory fault" exit structure then that'd likely be due to a KVM bug.
The only expected flows in the near term are this and CoCo crap.

> > Either way, I have no issues whatsoever if the direction for x86 is to
> > provide abstracted fault information.
> 
> I don't understand how ARM can get away with NOT providing a layer of abstraction.
> Copying fault state verbatim to userspace will bleed KVM implementation details
> into userspace,

The memslot flag already bleeds KVM implementation detail into userspace
to a degree. The event we're trying to let userspace handle is at the
intersection of a specific hardware/software state.

> Abstracting gory hardware details from userspace is one of the main roles of the
> kernel.

Where it can be accomplished without a loss (or misrepresentation) of
information, agreed. But KVM UAPI is so architecture-specific that it
seems arbitrary to draw the line here.

> A concrete example of hardware throwing a wrench in things is AMD's upcoming
> "encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable
> CPUs for *any* VM that supports guest-controlled encrypted memory.  If KVM reported
> the page fault error code directly to userspace, then running the same VM on
> different hardware generations, e.g. after live migration, would generate different
> error codes.
>  
> Are we talking past each other?  I'm genuinely confused by the pushback on
> capturing RWX information.  Yes, the RWX info may be insufficient in some cases,
> but its existence doesn't preclude KVM from providing more information as needed.

My pushback isn't exactly on RWX (even though I noted the MTE quirk
above). What I'm poking at here is the general infrastructure for
reflecting faults into userspace, which is aggressively becoming more
relevant.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the stage-2 fault handler
  2024-03-05  1:01             ` Oliver Upton
@ 2024-03-05 15:39               ` Sean Christopherson
  0 siblings, 0 replies; 39+ messages in thread
From: Sean Christopherson @ 2024-03-05 15:39 UTC (permalink / raw
  To: Oliver Upton
  Cc: Anish Moorthy, maz, kvm, kvmarm, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Tue, Mar 05, 2024, Oliver Upton wrote:
> On Mon, Mar 04, 2024 at 02:49:07PM -0800, Sean Christopherson wrote:
> 
> [...]
> 
> > The presense of MTE stuff shouldn't affect the fundamental access information,
> 
>   "When FEAT_MTE is implemented, for a synchronous Data Abort on an
>   instruction that directly accesses Allocation Tags, ISV is 0."
> 
> If there is no instruction syndrome, there's insufficient fault context
> to determine if the guest was doing a read or a write.
> 
> > e.g. if the guest was attempting to write, then KVM should set KVM_MEMORY_EXIT_FLAG_WRITE
> > irrespective of whether or not MTE is in play.
> 
> When the MMU generates such an abort, it *is not* read, write, or execute.
> It is a NoTagAccess fault. There is no sane way to describe this in
> terms of RWX.

RWX=0, with KVM_MEMORY_EXIT_FLAG_MTE seems like a reasonable way to communicate
that, no?

> > > > E.g. on the x86 side, KVM intentionally sets reserved bits in SPTEs for
> > > > "caching" emulated MMIO accesses, and the resulting fault captures the
> > > > "reserved bits set" information in register state.  But that's purely an
> > > > (optional) imlementation detail of KVM that should never be exposed to
> > > > userspace.
> > > 
> > > MMIO accesses would show up elsewhere though, right?
> > 
> > Yes, but I don't see how that's relevant.  Maybe I'm just misunderstanding what
> > you're saying/asking.
> 
> If "reserved" EPT violations found their way to userspace via the
> "memory fault" exit structure then that'd likely be due to a KVM bug.

Heh, sadly no.  Userspace can convert a GFN to private at any time, and the
TDX and SNP specs allow for implicit converstion "requests" from the guest, i.e.
KVM isn't allowed to kill the guest if the guest accesses a GFN with the "wrong"
private/shared attribute.

This particular case would likely be hit only if there's a userspace and/or guest
bug, but whether the setup is broken or just unique isn't KVM's call to make.

> The only expected flows in the near term are this and CoCo crap.
> 
> > > Either way, I have no issues whatsoever if the direction for x86 is to
> > > provide abstracted fault information.
> > 
> > I don't understand how ARM can get away with NOT providing a layer of abstraction.
> > Copying fault state verbatim to userspace will bleed KVM implementation details
> > into userspace,
> 
> The memslot flag already bleeds KVM implementation detail into userspace
> to a degree. The event we're trying to let userspace handle is at the
> intersection of a specific hardware/software state.

Yes, but IMO there's a huge difference between userspace knowing that KVM uses gup()
and memslots to translate gfn=>hva=>pfn, or even knowing that KVM plays games with
reserved stage-2 PTE bits, and userspace knowing exactly how KVM configures stage-2
PTEs.

Another example would be dirty logging on Intel CPUs.  The *admin* can decide
whether to use a write-protection scheme or page-modification logging, but KVM's
ABI with userspace provides a layer of abstraction (dirty ring or bitmap) so that
the userspace VMM doesn't need to do X for write-protection and Y for PML.

> > Abstracting gory hardware details from userspace is one of the main roles of the
> > kernel.
> 
> Where it can be accomplished without a loss (or misrepresentation) of
> information, agreed. But KVM UAPI is so architecture-specific that it
> seems arbitrary to draw the line here.

I don't think it's arbitrary.  KVM's existing uAPI for mapping memory into the
guest is almost entirely arch-neutral, and I want to preserve that for related
functionality unless it's literally impossible to do so.

> > A concrete example of hardware throwing a wrench in things is AMD's upcoming
> > "encrypted" flag (in the stage-2 page fault error code), which is set by SNP-capable
> > CPUs for *any* VM that supports guest-controlled encrypted memory.  If KVM reported
> > the page fault error code directly to userspace, then running the same VM on
> > different hardware generations, e.g. after live migration, would generate different
> > error codes.
> >  
> > Are we talking past each other?  I'm genuinely confused by the pushback on
> > capturing RWX information.  Yes, the RWX info may be insufficient in some cases,
> > but its existence doesn't preclude KVM from providing more information as needed.
> 
> My pushback isn't exactly on RWX (even though I noted the MTE quirk
> above). What I'm poking at here is the general infrastructure for
> reflecting faults into userspace, which is aggressively becoming more
> relevant.

But the purpose of memory_fault isn't to reflect faults into userspace, it's to
alert userspace that KVM has encountered a memory fault that requires userspace
action to resolve.

That distinction matters because there are and will be MMU features that KVM
supports, and that can generate novel faults, but such faults won't be punted to
userspace unless KVM provides a way for userspace to explicitly control the MMU
feature.

And if KVM lets userspace control a feature, then KVM needs new uAPI to expose
the controls.  Which means that we should always have an opportunity to expand
memory_fault, e.g. with new flags, to support such features.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-02-15 23:53 ` [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings Anish Moorthy
@ 2024-03-08 22:07   ` Sean Christopherson
  2024-03-09  0:46     ` David Matlack
  0 siblings, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2024-03-08 22:07 UTC (permalink / raw
  To: Anish Moorthy
  Cc: oliver.upton, maz, kvm, kvmarm, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Thu, Feb 15, 2024, Anish Moorthy wrote:
> diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> index 9f5d45c49e36..bf7bc21d56ac 100644
> --- a/Documentation/virt/kvm/api.rst
> +++ b/Documentation/virt/kvm/api.rst
> @@ -1353,6 +1353,7 @@ yet and must be cleared on entry.
>    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
>    #define KVM_MEM_READONLY	(1UL << 1)
>    #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)

David M.,

Before this gets queued anywhere, a few questions related to the generic KVM
userfault stuff you're working on:

  1. Do you anticipate reusing KVM_MEM_EXIT_ON_MISSING to communicate that a vCPU
     should exit to userspace, even for guest_memfd?  Or are you envisioning the
     "data invalid" gfn attribute as being a superset?

     We danced very close to this topic in the PUCK call, but I don't _think_ we
     ever explicitly talked about whether or not KVM_MEM_EXIT_ON_MISSING would
     effectively be obsoleted by a KVM_SET_MEMORY_ATTRIBUTES-based "invalid data"
     flag.

     I was originally thinking that KVM_MEM_EXIT_ON_MISSING would be re-used,
     but after re-watching parts of the PUCK recording, e.g. about decoupling
     KVM from userspace page tables, I suspect past me was wrong.

  2. What is your best guess as to when KVM userfault patches will be available,
     even if only in RFC form?

The reason I ask is because Oliver pointed out (off-list) that (a) Google is the
primary user for KVM_MEM_EXIT_ON_MISSING, possibly the _only_ user for the
forseeable future, and (b) if Google moves on to KVM userfault before ever
ingesting KVM_MEM_EXIT_ON_MISSING from upstream, then we'll have effectively
added dead code to KVM's eternal ABI.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-03-08 22:07   ` Sean Christopherson
@ 2024-03-09  0:46     ` David Matlack
  2024-03-11  4:45       ` Oliver Upton
  0 siblings, 1 reply; 39+ messages in thread
From: David Matlack @ 2024-03-09  0:46 UTC (permalink / raw
  To: Sean Christopherson
  Cc: Anish Moorthy, oliver.upton, maz, kvm, kvmarm, robert.hoo.linux,
	jthoughton, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On 2024-03-08 02:07 PM, Sean Christopherson wrote:
> On Thu, Feb 15, 2024, Anish Moorthy wrote:
> > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > index 9f5d45c49e36..bf7bc21d56ac 100644
> > --- a/Documentation/virt/kvm/api.rst
> > +++ b/Documentation/virt/kvm/api.rst
> > @@ -1353,6 +1353,7 @@ yet and must be cleared on entry.
> >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> >    #define KVM_MEM_READONLY	(1UL << 1)
> >    #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> 
> David M.,
> 
> Before this gets queued anywhere, a few questions related to the generic KVM
> userfault stuff you're working on:
> 
>   1. Do you anticipate reusing KVM_MEM_EXIT_ON_MISSING to communicate that a vCPU
>      should exit to userspace, even for guest_memfd?  Or are you envisioning the
>      "data invalid" gfn attribute as being a superset?
> 
>      We danced very close to this topic in the PUCK call, but I don't _think_ we
>      ever explicitly talked about whether or not KVM_MEM_EXIT_ON_MISSING would
>      effectively be obsoleted by a KVM_SET_MEMORY_ATTRIBUTES-based "invalid data"
>      flag.
> 
>      I was originally thinking that KVM_MEM_EXIT_ON_MISSING would be re-used,
>      but after re-watching parts of the PUCK recording, e.g. about decoupling
>      KVM from userspace page tables, I suspect past me was wrong.

No I don't anticipate reusing KVM_MEM_EXIT_ON_MISSING.

The plan is to introduce a new gfn attribute and exit to userspace based
on that. I do forsee having an on/off switch for the new attribute, but
it wouldn't make sense to reuse KVM_MEM_EXIT_ON_MISSING for that.

> 
>   2. What is your best guess as to when KVM userfault patches will be available,
>      even if only in RFC form?

We're aiming for the end of April for RFC with KVM/ARM support.

> 
> The reason I ask is because Oliver pointed out (off-list) that (a) Google is the
> primary user for KVM_MEM_EXIT_ON_MISSING, possibly the _only_ user for the
> forseeable future, and (b) if Google moves on to KVM userfault before ever
> ingesting KVM_MEM_EXIT_ON_MISSING from upstream, then we'll have effectively
> added dead code to KVM's eternal ABI.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-03-09  0:46     ` David Matlack
@ 2024-03-11  4:45       ` Oliver Upton
  2024-03-11 16:20         ` David Matlack
  2024-03-11 16:36         ` Sean Christopherson
  0 siblings, 2 replies; 39+ messages in thread
From: Oliver Upton @ 2024-03-11  4:45 UTC (permalink / raw
  To: David Matlack
  Cc: Sean Christopherson, Anish Moorthy, maz, kvm, kvmarm,
	robert.hoo.linux, jthoughton, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

Hey,

Thanks Sean for bringing this up on the list, didn't have time for a lot
of upstream stuffs :)

On Fri, Mar 08, 2024 at 04:46:32PM -0800, David Matlack wrote:
> On 2024-03-08 02:07 PM, Sean Christopherson wrote:
> > On Thu, Feb 15, 2024, Anish Moorthy wrote:
> > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > index 9f5d45c49e36..bf7bc21d56ac 100644
> > > --- a/Documentation/virt/kvm/api.rst
> > > +++ b/Documentation/virt/kvm/api.rst
> > > @@ -1353,6 +1353,7 @@ yet and must be cleared on entry.
> > >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> > >    #define KVM_MEM_READONLY	(1UL << 1)
> > >    #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > > +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> > 
> > David M.,
> > 
> > Before this gets queued anywhere, a few questions related to the generic KVM
> > userfault stuff you're working on:
> > 
> >   1. Do you anticipate reusing KVM_MEM_EXIT_ON_MISSING to communicate that a vCPU
> >      should exit to userspace, even for guest_memfd?  Or are you envisioning the
> >      "data invalid" gfn attribute as being a superset?
> > 
> >      We danced very close to this topic in the PUCK call, but I don't _think_ we
> >      ever explicitly talked about whether or not KVM_MEM_EXIT_ON_MISSING would
> >      effectively be obsoleted by a KVM_SET_MEMORY_ATTRIBUTES-based "invalid data"
> >      flag.
> > 
> >      I was originally thinking that KVM_MEM_EXIT_ON_MISSING would be re-used,
> >      but after re-watching parts of the PUCK recording, e.g. about decoupling
> >      KVM from userspace page tables, I suspect past me was wrong.
> 
> No I don't anticipate reusing KVM_MEM_EXIT_ON_MISSING.
> 
> The plan is to introduce a new gfn attribute and exit to userspace based
> on that. I do forsee having an on/off switch for the new attribute, but
> it wouldn't make sense to reuse KVM_MEM_EXIT_ON_MISSING for that.

With that in mind, unless someone else has a usecase for the
KVM_MEM_EXIT_ON_MISSING behavior my *strong* preference is that we not
take this bit of the series upstream. The "memory fault" UAPI should
still be useful when the KVM userfault stuff comes along.

Anish, apologies, you must have whiplash from all the bikeshedding,
nitpicking, and other fun you've been put through on this series. Thanks
for being patient.

> > 
> >   2. What is your best guess as to when KVM userfault patches will be available,
> >      even if only in RFC form?
> 
> We're aiming for the end of April for RFC with KVM/ARM support.

Just to make sure everyone is read in on what this entails -- is this
the implementation that only worries about vCPUs touching non-present
memory, leaving the question of other UAPIs that consume guest memory
(e.g. GIC/ITS table save/restore) up for further discussion?

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-03-11  4:45       ` Oliver Upton
@ 2024-03-11 16:20         ` David Matlack
  2024-03-11 16:36         ` Sean Christopherson
  1 sibling, 0 replies; 39+ messages in thread
From: David Matlack @ 2024-03-11 16:20 UTC (permalink / raw
  To: Oliver Upton
  Cc: Sean Christopherson, Anish Moorthy, maz, kvm, kvmarm,
	robert.hoo.linux, jthoughton, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Sun, Mar 10, 2024 at 9:46 PM Oliver Upton <oliver.upton@linux.dev> wrote:
> > >
> > >   2. What is your best guess as to when KVM userfault patches will be available,
> > >      even if only in RFC form?
> >
> > We're aiming for the end of April for RFC with KVM/ARM support.
>
> Just to make sure everyone is read in on what this entails -- is this
> the implementation that only worries about vCPUs touching non-present
> memory, leaving the question of other UAPIs that consume guest memory
> (e.g. GIC/ITS table save/restore) up for further discussion?

Yes. The initial version will only support returning to userspace on
invalid vCPU accesses with KVM_EXIT_MEMORY_FAULT. Non-vCPU accesses to
invalid pages (e.g. GIC/ITS table save/restore) will trigger an error
return from __gfn_to_hva_many() (which will cause the corresponding
ioctl to fail). It will be userspace's responsibility to clear the
invalid attribute before invoking those ioctls.

For x86 we may need an blocking kernel-to-userspace notification
mechanism for code paths in the emulator, but we'd like to investigate
and discuss if there are any other cleaner alternatives before going
too far down that route.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-03-11  4:45       ` Oliver Upton
  2024-03-11 16:20         ` David Matlack
@ 2024-03-11 16:36         ` Sean Christopherson
  2024-03-11 17:08           ` Anish Moorthy
  1 sibling, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2024-03-11 16:36 UTC (permalink / raw
  To: Oliver Upton
  Cc: David Matlack, Anish Moorthy, maz, kvm, kvmarm, robert.hoo.linux,
	jthoughton, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Sun, Mar 10, 2024, Oliver Upton wrote:
> On Fri, Mar 08, 2024 at 04:46:32PM -0800, David Matlack wrote:
> > On 2024-03-08 02:07 PM, Sean Christopherson wrote:
> > > On Thu, Feb 15, 2024, Anish Moorthy wrote:
> > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > index 9f5d45c49e36..bf7bc21d56ac 100644
> > > > --- a/Documentation/virt/kvm/api.rst
> > > > +++ b/Documentation/virt/kvm/api.rst
> > > > @@ -1353,6 +1353,7 @@ yet and must be cleared on entry.
> > > >    #define KVM_MEM_LOG_DIRTY_PAGES	(1UL << 0)
> > > >    #define KVM_MEM_READONLY	(1UL << 1)
> > > >    #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > > > +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> > > 
> > > David M.,
> > > 
> > > Before this gets queued anywhere, a few questions related to the generic KVM
> > > userfault stuff you're working on:
> > > 
> > >   1. Do you anticipate reusing KVM_MEM_EXIT_ON_MISSING to communicate that a vCPU
> > >      should exit to userspace, even for guest_memfd?  Or are you envisioning the
> > >      "data invalid" gfn attribute as being a superset?
> > > 
> > >      We danced very close to this topic in the PUCK call, but I don't _think_ we
> > >      ever explicitly talked about whether or not KVM_MEM_EXIT_ON_MISSING would
> > >      effectively be obsoleted by a KVM_SET_MEMORY_ATTRIBUTES-based "invalid data"
> > >      flag.
> > > 
> > >      I was originally thinking that KVM_MEM_EXIT_ON_MISSING would be re-used,
> > >      but after re-watching parts of the PUCK recording, e.g. about decoupling
> > >      KVM from userspace page tables, I suspect past me was wrong.
> > 
> > No I don't anticipate reusing KVM_MEM_EXIT_ON_MISSING.
> > 
> > The plan is to introduce a new gfn attribute and exit to userspace based
> > on that. I do forsee having an on/off switch for the new attribute, but
> > it wouldn't make sense to reuse KVM_MEM_EXIT_ON_MISSING for that.
> 
> With that in mind, unless someone else has a usecase for the
> KVM_MEM_EXIT_ON_MISSING behavior my *strong* preference is that we not
> take this bit of the series upstream. The "memory fault" UAPI should
> still be useful when the KVM userfault stuff comes along.

+1

Though I'll go a step further and say that even if someone does have a use case,
we should still wait.  The imminent collision with David Steven's kvm_follow_pfn()
series[*] is going to be a painful rebase no matter what, and once that's out of
the way, rebasing this series onto future kernels shouldn't be crazy difficult.

In other words, _if_ it turns out there's value in KVM_MEM_EXIT_ON_MISSING even
with David M's work, the cost of waiting another cycle (or two) is relatively
small.

Oh, and I'll plan on grabbing patches 1-4 for 6.10.

[*]https://lore.kernel.org/all/20240229025759.1187910-1-stevensd@google.com

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-03-11 16:36         ` Sean Christopherson
@ 2024-03-11 17:08           ` Anish Moorthy
  2024-03-11 21:21             ` Oliver Upton
  0 siblings, 1 reply; 39+ messages in thread
From: Anish Moorthy @ 2024-03-11 17:08 UTC (permalink / raw
  To: Sean Christopherson
  Cc: Oliver Upton, David Matlack, maz, kvm, kvmarm, robert.hoo.linux,
	jthoughton, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Sun, Mar 10, 2024 at 9:46 PM Oliver Upton <oliver.upton@linux.dev> wrote:
>
> Hey,
>
> Thanks Sean for bringing this up on the list, didn't have time for a lot
> of upstream stuffs :)
>
> On Fri, Mar 08, 2024 at 04:46:32PM -0800, David Matlack wrote:
> > On 2024-03-08 02:07 PM, Sean Christopherson wrote:
> > > On Thu, Feb 15, 2024, Anish Moorthy wrote:
> > > > diff --git a/Documentation/virt/kvm/api.rst b/Documentation/virt/kvm/api.rst
> > > > index 9f5d45c49e36..bf7bc21d56ac 100644
> > > > --- a/Documentation/virt/kvm/api.rst
> > > > +++ b/Documentation/virt/kvm/api.rst
> > > > @@ -1353,6 +1353,7 @@ yet and must be cleared on entry.
> > > >    #define KVM_MEM_LOG_DIRTY_PAGES        (1UL << 0)
> > > >    #define KVM_MEM_READONLY       (1UL << 1)
> > > >    #define KVM_MEM_GUEST_MEMFD      (1UL << 2)
> > > > +  #define KVM_MEM_EXIT_ON_MISSING  (1UL << 3)
> > >
> > > David M.,
> > >
> > > Before this gets queued anywhere, a few questions related to the generic KVM
> > > userfault stuff you're working on:
> > >
> > >   1. Do you anticipate reusing KVM_MEM_EXIT_ON_MISSING to communicate that a vCPU
> > >      should exit to userspace, even for guest_memfd?  Or are you envisioning the
> > >      "data invalid" gfn attribute as being a superset?
> > >
> > >      We danced very close to this topic in the PUCK call, but I don't _think_ we
> > >      ever explicitly talked about whether or not KVM_MEM_EXIT_ON_MISSING would
> > >      effectively be obsoleted by a KVM_SET_MEMORY_ATTRIBUTES-based "invalid data"
> > >      flag.
> > >
> > >      I was originally thinking that KVM_MEM_EXIT_ON_MISSING would be re-used,
> > >      but after re-watching parts of the PUCK recording, e.g. about decoupling
> > >      KVM from userspace page tables, I suspect past me was wrong.
> >
> > No I don't anticipate reusing KVM_MEM_EXIT_ON_MISSING.
> >
> > The plan is to introduce a new gfn attribute and exit to userspace based
> > on that. I do forsee having an on/off switch for the new attribute, but
> > it wouldn't make sense to reuse KVM_MEM_EXIT_ON_MISSING for that.
>
> With that in mind, unless someone else has a usecase for the
> KVM_MEM_EXIT_ON_MISSING behavior my *strong* preference is that we not
> take this bit of the series upstream. The "memory fault" UAPI should
> still be useful when the KVM userfault stuff comes along.
>
> Anish, apologies, you must have whiplash from all the bikeshedding,
> nitpicking, and other fun you've been put through on this series. Thanks
> for being patient.

No worries- I got a lot of patient (and much-needed) review as well
:). And I understand not wanting to add an eternal feature when
something better is coming down the line.

On Mon, Mar 11, 2024 at 9:36 AM Sean Christopherson <seanjc@google.com> wrote:
>
> Oh, and I'll plan on grabbing patches 1-4 for 6.10.

I think patches 10/11/12 are useful changes to the selftest that make
sense to merge even with KVM_MEM_EXIT_ON_MISSING being mothballed-
they should rebase without any issues. And the annotations on the
stage-2 fault handlers seem like they should still be added, but I
suppose David can do that with his series.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings
  2024-03-11 17:08           ` Anish Moorthy
@ 2024-03-11 21:21             ` Oliver Upton
  0 siblings, 0 replies; 39+ messages in thread
From: Oliver Upton @ 2024-03-11 21:21 UTC (permalink / raw
  To: Anish Moorthy
  Cc: Sean Christopherson, David Matlack, maz, kvm, kvmarm,
	robert.hoo.linux, jthoughton, axelrasmussen, peterx, nadav.amit,
	isaku.yamahata, kconsul

On Mon, Mar 11, 2024 at 10:08:56AM -0700, Anish Moorthy wrote:
> I think patches 10/11/12 are useful changes to the selftest that make
> sense to merge even with KVM_MEM_EXIT_ON_MISSING being mothballed-
> they should rebase without any issues. And the annotations on the
> stage-2 fault handlers seem like they should still be added, but I
> suppose David can do that with his series.

Yeah, let's fold the vCPU exit portions of the UAPI into the overall KVM
userfault series. In that case there is sufficient context at the time
of the "memory fault" to generate a 'precise' fault context (this GFN
failed since it isn't marked as present).

Compare that to the current implementation, which actually annotates
_any_ __gfn_to_pfn_memslot() failures on the way out to userspace. I
haven't seen anyone saying their userspace wants to use this, and I'd
rather not take a new feature without a user, even if it is comparably
trivial.

-- 
Thanks,
Oliver

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 04/14] KVM: Simplify error handling in __gfn_to_pfn_memslot()
  2024-02-15 23:53 ` [PATCH v7 04/14] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
@ 2024-04-09 22:44   ` Sean Christopherson
  0 siblings, 0 replies; 39+ messages in thread
From: Sean Christopherson @ 2024-04-09 22:44 UTC (permalink / raw
  To: Anish Moorthy
  Cc: oliver.upton, maz, kvm, kvmarm, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Thu, Feb 15, 2024, Anish Moorthy wrote:
> KVM_HVA_ERR_RO_BAD satisfies kvm_is_error_hva(), so there's no need to
> duplicate the "if (writable)" block. Fix this by bringing all
> kvm_is_error_hva() cases under one conditional.
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  virt/kvm/kvm_main.c | 10 ++++------
>  1 file changed, 4 insertions(+), 6 deletions(-)
> 
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 7186d301d617..67ca580a18c5 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -3031,15 +3031,13 @@ kvm_pfn_t __gfn_to_pfn_memslot(const struct kvm_memory_slot *slot, gfn_t gfn,
>  	if (hva)
>  		*hva = addr;
>  
> -	if (addr == KVM_HVA_ERR_RO_BAD) {
> -		if (writable)
> -			*writable = false;
> -		return KVM_PFN_ERR_RO_FAULT;
> -	}
> -
>  	if (kvm_is_error_hva(addr)) {
>  		if (writable)
>  			*writable = false;
> +
> +		if (addr == KVM_HVA_ERR_RO_BAD)
> +			return KVM_PFN_ERR_RO_FAULT;
> +
>  		return KVM_PFN_NOSLOT;

This would be a good use of a ternary operator, e.g. to make it super obvious
that "if (kvm_is_error_hva(addr))" is terminal in all cases.  I'll fixup when
applying.

>  	}
>  
> -- 
> 2.44.0.rc0.258.g7320e95886-goog
> 

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 03/14] KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag
  2024-02-15 23:53 ` [PATCH v7 03/14] KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag Anish Moorthy
@ 2024-04-09 22:47   ` Sean Christopherson
  0 siblings, 0 replies; 39+ messages in thread
From: Sean Christopherson @ 2024-04-09 22:47 UTC (permalink / raw
  To: Anish Moorthy
  Cc: oliver.upton, maz, kvm, kvmarm, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Thu, Feb 15, 2024, Anish Moorthy wrote:
>  This ioctl allows the user to create, modify or delete a guest physical
>  memory slot.  Bits 0-15 of "slot" specify the slot id and this value
> @@ -1382,12 +1383,16 @@ It is recommended that the lower 21 bits of guest_phys_addr and userspace_addr
>  be identical.  This allows large pages in the guest to be backed by large
>  pages in the host.
>  
> -The flags field supports two flags: KVM_MEM_LOG_DIRTY_PAGES and
> -KVM_MEM_READONLY.  The former can be set to instruct KVM to keep track of
> +The flags field supports three flags
> +
> +1.  KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of
>  writes to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl to know how to

This formatting is funky.  I suspect you are trying to avoid touching lines that
otherwise wouldn't be modified, but the end result is hard to read.  I would also
opportunistically clean up the wording in general, e.g.

1. KVM_MEM_LOG_DIRTY_PAGES: can be set to instruct KVM to keep track of writes
   to memory within the slot.  See KVM_GET_DIRTY_LOG ioctl for details.

> -use it.  The latter can be set, if KVM_CAP_READONLY_MEM capability allows it,
> +use it.
> +2.  KVM_MEM_READONLY: can be set, if KVM_CAP_READONLY_MEM capability allows it,
>  to make a new slot read-only.  In this case, writes to this memory will be
>  posted to userspace as KVM_EXIT_MMIO exits.
> +3.  KVM_MEM_GUEST_MEMFD: see KVM_SET_USER_MEMORY_REGION2. This flag is
> +incompatible with KVM_SET_USER_MEMORY_REGION.

This is now stale, as KVM_MEM_GUEST_MEMFD is also incompatible with READONLY.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test
  2024-02-15 23:54 ` [PATCH v7 10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
@ 2024-04-09 22:49   ` Sean Christopherson
  0 siblings, 0 replies; 39+ messages in thread
From: Sean Christopherson @ 2024-04-09 22:49 UTC (permalink / raw
  To: Anish Moorthy
  Cc: oliver.upton, maz, kvm, kvmarm, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Thu, Feb 15, 2024, Anish Moorthy wrote:
> Using the overall demand paging rate to measure performance can be
> slightly misleading when vCPU accesses are not overlapped. Adding more
> vCPUs will (usually) increase the overall demand paging rate even
> if performance remains constant or even degrades on a per-vcpu basis. As
> such, it makes sense to report both the total and per-vcpu paging rates.
> 
> Signed-off-by: Anish Moorthy <amoorthy@google.com>
> ---
>  tools/testing/selftests/kvm/demand_paging_test.c | 15 +++++++++++----
>  1 file changed, 11 insertions(+), 4 deletions(-)
> 
> diff --git a/tools/testing/selftests/kvm/demand_paging_test.c b/tools/testing/selftests/kvm/demand_paging_test.c
> index 09c116a82a84..6dc823fa933a 100644
> --- a/tools/testing/selftests/kvm/demand_paging_test.c
> +++ b/tools/testing/selftests/kvm/demand_paging_test.c
> @@ -135,6 +135,7 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>  	struct timespec ts_diff;
>  	struct kvm_vm *vm;
>  	int i;
> +	double vcpu_paging_rate;
>  
>  	vm = memstress_create_vm(mode, nr_vcpus, guest_percpu_mem_size, 1,
>  				 p->src_type, p->partition_vcpu_memory_access);
> @@ -191,11 +192,17 @@ static void run_test(enum vm_guest_mode mode, void *arg)
>  			uffd_stop_demand_paging(uffd_descs[i]);
>  	}
>  
> -	pr_info("Total guest execution time: %ld.%.9lds\n",
> +	pr_info("Total guest execution time:\t%ld.%.9lds\n",
>  		ts_diff.tv_sec, ts_diff.tv_nsec);
> -	pr_info("Overall demand paging rate: %f pgs/sec\n",
> -		memstress_args.vcpu_args[0].pages * nr_vcpus /
> -		((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / NSEC_PER_SEC));
> +
> +	vcpu_paging_rate =
> +		memstress_args.vcpu_args[0].pages
> +		/ ((double)ts_diff.tv_sec
> +			+ (double)ts_diff.tv_nsec / NSEC_PER_SEC);

*sigh*

For the umpteenth time, please follow kernel coding style.  Either

	vcpu_paging_rate = memstress_args.vcpu_args[0].pages /
			   ((double)ts_diff.tv_sec + (double)ts_diff.tv_nsec / NSEC_PER_SEC);

or

	vcpu_paging_rate = memstress_args.vcpu_args[0].pages /
			   ((double)ts_diff.tv_sec +
			    (double)ts_diff.tv_nsec / NSEC_PER_SEC);

I don't have a strong preference, so I'll go with the first one when applying.

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
  2024-02-15 23:54 ` [PATCH v7 11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
@ 2024-04-09 22:58   ` Sean Christopherson
  0 siblings, 0 replies; 39+ messages in thread
From: Sean Christopherson @ 2024-04-09 22:58 UTC (permalink / raw
  To: Anish Moorthy
  Cc: oliver.upton, maz, kvm, kvmarm, robert.hoo.linux, jthoughton,
	dmatlack, axelrasmussen, peterx, nadav.amit, isaku.yamahata,
	kconsul

On Thu, Feb 15, 2024, Anish Moorthy wrote:
> @@ -167,19 +187,30 @@ struct uffd_desc *uffd_setup_demand_paging(int uffd_mode, useconds_t delay,
>  void uffd_stop_demand_paging(struct uffd_desc *uffd)
>  {
>  	char c = 0;
> -	int ret;
> +	int i, ret;
>  
> -	ret = write(uffd->pipefds[1], &c, 1);
> -	TEST_ASSERT(ret == 1, "Unable to write to pipefd");
> +	for (i = 0; i < uffd->num_readers; ++i) {
> +		ret = write(uffd->pipefds[i], &c, 1);
> +		TEST_ASSERT(
> +			ret == 1, "Unable to write to pipefd %i for uffd_desc %p", i, uffd);

More coding style oddities.

And storing the return code in "ret" is unnecessary, and arguably makes it more
difficult to debug failures by not capturing the failing command in the assert
message.

> +	}
>  
> -	ret = pthread_join(uffd->thread, NULL);
> -	TEST_ASSERT(ret == 0, "Pthread_join failed.");
> +	for (i = 0; i < uffd->num_readers; ++i) {
> +		ret = pthread_join(uffd->readers[i], NULL);
> +		TEST_ASSERT(
> +			ret == 0, "Pthread_join failed on reader %i for uffd_desc %p", i, uffd);

Preferred kernel style is "!ret", not "ret == 0".

Putting it all together:

	int i;

	for (i = 0; i < uffd->num_readers; ++i)
		TEST_ASSERT(write(uffd->pipefds[i], &c, 1) == 1,
			    "Unable to write to pipefd %i for uffd_desc %p", i, uffd);

	for (i = 0; i < uffd->num_readers; ++i)
		TEST_ASSERT(!pthread_join(uffd->readers[i], NULL),
			    "Pthread_join failed on reader %i for uffd_desc %p", i, uffd);

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults
  2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
                   ` (14 preceding siblings ...)
  2024-02-16  7:36 ` [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Gupta, Pankaj
@ 2024-04-10  0:19 ` Sean Christopherson
  2024-05-07 17:38   ` Anish Moorthy
  15 siblings, 1 reply; 39+ messages in thread
From: Sean Christopherson @ 2024-04-10  0:19 UTC (permalink / raw
  To: Sean Christopherson, oliver.upton, maz, kvm, kvmarm,
	Anish Moorthy
  Cc: robert.hoo.linux, jthoughton, dmatlack, axelrasmussen, peterx,
	nadav.amit, isaku.yamahata, kconsul

On Thu, 15 Feb 2024 23:53:51 +0000, Anish Moorthy wrote:
> This series adds an option to cause stage-2 fault handlers to
> KVM_MEMORY_FAULT_EXIT when they would otherwise be required to fault in
> the userspace mappings. Doing so allows userspace to receive stage-2
> faults directly from KVM_RUN instead of through userfaultfd, which
> suffers from serious contention issues as the number of vCPUs scales.
> 
> Support for the new option (KVM_CAP_EXIT_ON_MISSING) is added to the
> demand_paging_test, which demonstrates the scalability improvements:
> the following data was collected using [2] on an x86 machine with 256
> cores.
> 
> [...]

Applied 1,2, and 4 to kvm-x86 generic, and 10-12 to kvm-x86 selftests.

I skipped all KVM_CAP_EXIT_ON_MISSING as per our decision to hold off until we
see the KVM userfault stuff.  I skipped the docs patch because it would require
more massaging than I wanted to do when applying.  And lastly, I skipped the
"Add memslot_flags parameter to memstress_create_vm()" patch because it would be
dead code without the exit-on-missing usage.

Please take a look at the selftests commits in particular, as I did a decent
amount of massaging when applying.

Thanks!

[01/14] KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter
        https://github.com/kvm-x86/linux/commit/ed2f049fc144
[02/14] KVM: Add function comments for __kvm_read/write_guest_page()
        https://github.com/kvm-x86/linux/commit/a3bd2f7ead6d
...

[04/14] KVM: Simplify error handling in __gfn_to_pfn_memslot()
        https://github.com/kvm-x86/linux/commit/f588557ac4ac

...

[10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test
        https://github.com/kvm-x86/linux/commit/2ca76c12c48b
[11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in demand paging test
        https://github.com/kvm-x86/linux/commit/df4ec5aada9d
[12/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads
        https://github.com/kvm-x86/linux/commit/0cba6442e9e2

--
https://github.com/kvm-x86/linux/tree/next

^ permalink raw reply	[flat|nested] 39+ messages in thread

* Re: [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults
  2024-04-10  0:19 ` Sean Christopherson
@ 2024-05-07 17:38   ` Anish Moorthy
  0 siblings, 0 replies; 39+ messages in thread
From: Anish Moorthy @ 2024-05-07 17:38 UTC (permalink / raw
  To: Sean Christopherson; +Cc: kvm, kvmarm

On Tue, Apr 9, 2024 at 5:21 PM Sean Christopherson <seanjc@google.com> wrote:
>
> I skipped all KVM_CAP_EXIT_ON_MISSING as per our decision to hold off until we
> see the KVM userfault stuff.  I skipped the docs patch because it would require
> more massaging than I wanted to do when applying.  And lastly, I skipped the
> "Add memslot_flags parameter to memstress_create_vm()" patch because it would be
> dead code without the exit-on-missing usage.
>
> Please take a look at the selftests commits in particular, as I did a decent
> amount of massaging when applying.

Thanks for cleaning the commits, and for all the help along the way. I
just got around to checking the selftest commits, and they look good
to me

^ permalink raw reply	[flat|nested] 39+ messages in thread

end of thread, other threads:[~2024-05-07 17:38 UTC | newest]

Thread overview: 39+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-02-15 23:53 [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Anish Moorthy
2024-02-15 23:53 ` [PATCH v7 01/14] KVM: Clarify meaning of hva_to_pfn()'s 'atomic' parameter Anish Moorthy
2024-02-15 23:53 ` [PATCH v7 02/14] KVM: Add function comments for __kvm_read/write_guest_page() Anish Moorthy
2024-02-15 23:53 ` [PATCH v7 03/14] KVM: Documentation: Make note of the KVM_MEM_GUEST_MEMFD memslot flag Anish Moorthy
2024-04-09 22:47   ` Sean Christopherson
2024-02-15 23:53 ` [PATCH v7 04/14] KVM: Simplify error handling in __gfn_to_pfn_memslot() Anish Moorthy
2024-04-09 22:44   ` Sean Christopherson
2024-02-15 23:53 ` [PATCH v7 05/14] KVM: Define and communicate KVM_EXIT_MEMORY_FAULT RWX flags to userspace Anish Moorthy
2024-02-15 23:53 ` [PATCH v7 06/14] KVM: Add memslot flag to let userspace force an exit on missing hva mappings Anish Moorthy
2024-03-08 22:07   ` Sean Christopherson
2024-03-09  0:46     ` David Matlack
2024-03-11  4:45       ` Oliver Upton
2024-03-11 16:20         ` David Matlack
2024-03-11 16:36         ` Sean Christopherson
2024-03-11 17:08           ` Anish Moorthy
2024-03-11 21:21             ` Oliver Upton
2024-02-15 23:53 ` [PATCH v7 07/14] KVM: x86: Enable KVM_CAP_EXIT_ON_MISSING and annotate EFAULTs from stage-2 fault handler Anish Moorthy
2024-02-15 23:53 ` [PATCH v7 08/14] KVM: arm64: Enable KVM_CAP_MEMORY_FAULT_INFO and annotate fault in the " Anish Moorthy
2024-03-04 20:00   ` Oliver Upton
2024-03-04 20:10     ` Oliver Upton
2024-03-04 20:32       ` Sean Christopherson
2024-03-04 21:03         ` Oliver Upton
2024-03-04 22:49           ` Sean Christopherson
2024-03-05  1:01             ` Oliver Upton
2024-03-05 15:39               ` Sean Christopherson
2024-02-15 23:54 ` [PATCH v7 09/14] KVM: arm64: Implement and advertise KVM_CAP_EXIT_ON_MISSING Anish Moorthy
2024-02-15 23:54 ` [PATCH v7 10/14] KVM: selftests: Report per-vcpu demand paging rate from demand paging test Anish Moorthy
2024-04-09 22:49   ` Sean Christopherson
2024-02-15 23:54 ` [PATCH v7 11/14] KVM: selftests: Allow many vCPUs and reader threads per UFFD in " Anish Moorthy
2024-04-09 22:58   ` Sean Christopherson
2024-02-15 23:54 ` [PATCH v7 12/14] KVM: selftests: Use EPOLL in userfaultfd_util reader threads and signal errors via TEST_ASSERT Anish Moorthy
2024-02-15 23:54 ` [PATCH v7 13/14] KVM: selftests: Add memslot_flags parameter to memstress_create_vm() Anish Moorthy
2024-02-15 23:54 ` [PATCH v7 14/14] KVM: selftests: Handle memory fault exits in demand_paging_test Anish Moorthy
2024-02-16  7:36 ` [PATCH v7 00/14] Improve KVM + userfaultfd performance via KVM_EXIT_MEMORY_FAULTs on stage-2 faults Gupta, Pankaj
2024-02-16 20:00   ` Anish Moorthy
2024-02-16 23:40     ` Axel Rasmussen
2024-02-21  7:35       ` Gupta, Pankaj
2024-04-10  0:19 ` Sean Christopherson
2024-05-07 17:38   ` Anish Moorthy

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).