[RFC V1 0/5] x86: CVMs: Align memory conversions to 2M granularity

Linux Confidential Computing Development
 help / color / mirror / Atom feed

From: Vishal Annapurve <vannapurve@google.com>
To: x86@kernel.org, linux-kernel@vger.kernel.org
Cc: pbonzini@redhat.com, rientjes@google.com, bgardon@google.com,
	 seanjc@google.com, erdemaktas@google.com,
	ackerleytng@google.com,  jxgao@google.com, sagis@google.com,
	oupton@google.com, peterx@redhat.com,  vkuznets@redhat.com,
	dmatlack@google.com, pgonda@google.com,  michael.roth@amd.com,
	kirill@shutemov.name, thomas.lendacky@amd.com,
	 dave.hansen@linux.intel.com, linux-coco@lists.linux.dev,
	 chao.p.peng@linux.intel.com, isaku.yamahata@gmail.com,
	andrew.jones@linux.dev,  corbet@lwn.net, hch@lst.de,
	m.szyprowski@samsung.com, bp@suse.de,  rostedt@goodmis.org,
	iommu@lists.linux.dev,  Vishal Annapurve <vannapurve@google.com>
Subject: [RFC V1 0/5] x86: CVMs: Align memory conversions to 2M granularity
Date: Fri, 12 Jan 2024 05:52:46 +0000	[thread overview]
Message-ID: <20240112055251.36101-1-vannapurve@google.com> (raw)

Goal of this series is aligning memory conversion requests from CVMs to
huge page sizes to allow better host side management of guest memory and
optimized page table walks.

This patch series is partially tested and needs more work, I am seeking
feedback from wider community before making further progress.

Background
=====================
Confidential VMs(CVMs) support two types of guest memory ranges:
1) Private Memory: Intended to be consumed/modified only by the CVM.
2) Shared Memory: visible to both guest/host components, used for
non-trusted IO.

Guest memfd [1] support is set to be merged upstream to handle guest private
memory isolation from host usersapace. Guest memfd approach allows following
setup:
* private memory backed using the guest memfd file which is not accessible
  from host userspace.
* Shared memory backed by tmpfs/hugetlbfs files that are accessible from
  host userspace.

Userspace VMM needs to register two backing stores for all of the guest
memory ranges:
* HVA for shared memory
* Guest memfd ranges for private memory

KVM keeps track of shared/private guest memory ranges that can be updated at
runtime using IOCTLs. This allows KVM to back the guest memory using either HVA
(shared) or guest memfd file offsets (private) based on the attributes of the
guest memory ranges.

In this setup, there is possibility of "double allocation" i.e. scenarios where
both shared and private memory backing stores mapped to the same guest memory
ranges have memory allocated.

Guest issues an hypercall to convert the memory types which is forwarded by KVM
to the host userspace.
Userspace VMM is supposed to handle conversion as follows:
1) Private to shared conversion:
  * Update guest memory attributes for the range to be shared using KVM
    supported IOCTLs.
    - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
      to the guest memory being converted.
  * Unback the guest memfd range.
2) Shared to private conversion:
  * Update guest memory attributes for the range to be private using KVM
    supported IOCTLs.
    - While handling this IOCTL, KVM will unmap EPT/NPT entries corresponding
      to the guest memory being converted.
  * Unback the shared memory file.

Note that unbacking needs to be done for both kinds of conversions in order to
avoid double allocation.

Problem
=====================
CVMs can convert memory between these two types at 4K granularity. Conversion
done at 4K granularity causes issues when using guest memfd support
with hugetlb/Hugepage backed guest private memory:
1) Hugetlb fs doesn't allow freeing subpage ranges when punching holes,
causing all the private to shared memory conversions to result in double
allocation.
2) Even if a new fs is implemented for guest memfd that allows splitting
hugepages, punching holes at 4K will cause:
   - loss of vmemmmap optimization [2]
   - more memory for EPT/NPT entries and extra pagetable walks for guest
     side accesses.
   - Shared memory mappings to consume more host pagetable entries and
     extra pagetalble walks for host side access.
   - Higher number of conversions with additional overhead of VM exits
     serviced by host userspace.

Memory conversion scenarios in the guest that are of major concern:
- SWIOTLB area conversion early during boot.
   * dma_map_* API invocations for CVMs result in using bounce buffers
     from SWIOTLB region which is already marked as shared.
- Device drivers allocating memory using dma_alloc_* APIs at runtime
  that bypass SWIOTLB.

Proposal
=====================
To counter above issues, this series proposes following:
1) Use boot time allocated SWIOTLB pools for all DMA memory allocated
using dma_alloc_* APIs.
2) Increase memory allocated at boot for SWIOTLB from 6% to 8% for CVMs.
3) Enable dynamic SWIOTLB [4] by default for CVMs so that SWITLB can be
scaled up as needed.
4) Ensure SWIOTLB pool is 2MB aligned so that all the conversions happen at
2M granularity once during boot.
5) Add a check to ensure all conversions happen at 2M granularity.

** This series leaves out some of the conversion sites which might not
be 2M aligned but should be easy to fix once the approach is finalized. **

1G alignment for conversion:
* Using 1G alignment may cause over-allocated SWIOTLB buffers but might
  be acceptable for CVMs depending on more considerations.
* It might be challenging to use 1G aligned conversion in OVMF. 2M
  alignment should be achievable with OVMF changes [3].

Alternatives could be:
1) Separate hugepage aligned DMA pools setup by individual device drivers in
case of CVMs.

[1] https://lore.kernel.org/linux-mips/20231105163040.14904-1-pbonzini@redhat.com/
[2] https://www.kernel.org/doc/html/next/mm/vmemmap_dedup.html
[3] https://github.com/tianocore/edk2/pull/3784
[4] https://lore.kernel.org/lkml/20230908080031.GA7848@lst.de/T/

Vishal Annapurve (5):
  swiotlb: Support allocating DMA memory from SWIOTLB
  swiotlb: Allow setting up default alignment of SWIOTLB region
  x86: CVMs: Enable dynamic swiotlb by default for CVMs
  x86: CVMs: Allow allocating all DMA memory from SWIOTLB
  x86: CVMs: Ensure that memory conversions happen at 2M alignment

 arch/x86/Kconfig             |  2 ++
 arch/x86/kernel/pci-dma.c    |  2 +-
 arch/x86/mm/mem_encrypt.c    |  8 ++++++--
 arch/x86/mm/pat/set_memory.c |  6 ++++--
 include/linux/swiotlb.h      | 22 ++++++----------------
 kernel/dma/direct.c          |  4 ++--
 kernel/dma/swiotlb.c         | 17 ++++++++++++-----
 7 files changed, 33 insertions(+), 28 deletions(-)

-- 
2.43.0.275.g3460e3d667-goog

next             reply	other threads:[~2024-01-12  5:52 UTC|newest]

Thread overview: 28+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-12  5:52 Vishal Annapurve [this message]
2024-01-12  5:52 ` [RFC V1 1/5] swiotlb: Support allocating DMA memory from SWIOTLB Vishal Annapurve
2024-02-14 14:49   ` Kirill A. Shutemov
2024-02-15  3:33     ` Vishal Annapurve
2024-02-15  9:44       ` Alexander Graf
2024-02-15 20:26         ` Michael Kelley
2024-02-24 17:07           ` Vishal Annapurve
2024-02-24 22:02             ` Michael Kelley
2024-03-05 17:19         ` Vishal Annapurve
2024-01-12  5:52 ` [RFC V1 2/5] swiotlb: Allow setting up default alignment of SWIOTLB region Vishal Annapurve
2024-01-12  5:52 ` [RFC V1 3/5] x86: CVMs: Enable dynamic swiotlb by default for CVMs Vishal Annapurve
2024-02-01 12:20   ` Jeremi Piotrowski
2024-02-02  4:40     ` Vishal Annapurve
2024-01-12  5:52 ` [RFC V1 4/5] x86: CVMs: Allow allocating all DMA memory from SWIOTLB Vishal Annapurve
2024-01-31 16:17   ` Dave Hansen
2024-02-01  3:41     ` Vishal Annapurve
2024-01-12  5:52 ` [RFC V1 5/5] x86: CVMs: Ensure that memory conversions happen at 2M alignment Vishal Annapurve
2024-01-31 16:33   ` Dave Hansen
2024-02-01  3:46     ` Vishal Annapurve
2024-02-01 12:02       ` Jeremi Piotrowski
2024-02-02  5:08         ` Vishal Annapurve
2024-02-02  8:00           ` Jeremi Piotrowski
2024-02-02 16:22             ` Vishal Annapurve
2024-02-02 16:35               ` Dave Hansen
2024-02-03  5:19                 ` Vishal Annapurve
2024-01-30 16:42 ` [RFC V1 0/5] x86: CVMs: Align memory conversions to 2M granularity Vishal Annapurve
2024-01-31 16:52 ` Dave Hansen
2024-02-01  5:44   ` Vishal Annapurve

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240112055251.36101-1-vannapurve@google.com \
    --to=vannapurve@google.com \
    --cc=ackerleytng@google.com \
    --cc=andrew.jones@linux.dev \
    --cc=bgardon@google.com \
    --cc=bp@suse.de \
    --cc=chao.p.peng@linux.intel.com \
    --cc=corbet@lwn.net \
    --cc=dave.hansen@linux.intel.com \
    --cc=dmatlack@google.com \
    --cc=erdemaktas@google.com \
    --cc=hch@lst.de \
    --cc=iommu@lists.linux.dev \
    --cc=isaku.yamahata@gmail.com \
    --cc=jxgao@google.com \
    --cc=kirill@shutemov.name \
    --cc=linux-coco@lists.linux.dev \
    --cc=linux-kernel@vger.kernel.org \
    --cc=m.szyprowski@samsung.com \
    --cc=michael.roth@amd.com \
    --cc=oupton@google.com \
    --cc=pbonzini@redhat.com \
    --cc=peterx@redhat.com \
    --cc=pgonda@google.com \
    --cc=rientjes@google.com \
    --cc=rostedt@goodmis.org \
    --cc=sagis@google.com \
    --cc=seanjc@google.com \
    --cc=thomas.lendacky@amd.com \
    --cc=vkuznets@redhat.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).