Xen-Devel Archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 00/13] x86: assorted shadow mode adjustments
@ 2023-03-30 11:22 Jan Beulich
  2023-03-30 11:24 ` [PATCH v2 01/13] x86/shadow: rename SHADOW_FOREACH_L<N>E() to FOREACH_PRESENT_L<N>E() Jan Beulich
                   ` (12 more replies)
  0 siblings, 13 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:22 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

This is kind of fallout from XSA-427 investigations, partly related to
there having been a more intrusive first approach.

Most patches aren't really dependent upon one another, so can probably
go in independently (as they get acked).

A few patches from v1 went in, but there are also three new patches in
v2. See individual patches for what has changed (in response to review
comments).

01: rename SHADOW_FOREACH_L<N>E() to FOREACH_PRESENT_L<N>E()
02: drop redundant present bit checks from SHADOW_FOREACH_L<N>E() "bodies"
03: reduce explicit log-dirty recording for HVM
04: call sh_update_cr3() directly from sh_page_fault()
05: don't generate bogus "domain dying" trace entry from sh_page_fault()
06: use lighter weight mode checks
07: move OOS functions to their own file
08: sh_rm_write_access_from_sl1p() is HVM-only
09: drop is_hvm_...() where easily possible
10: make monitor table create/destroy more consistent
11: vCPU-s never have "no mode"
12: "monitor table" is a HVM-only concept
13: adjust monitor table prealloc amount

Jan


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 01/13] x86/shadow: rename SHADOW_FOREACH_L<N>E() to FOREACH_PRESENT_L<N>E()
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
@ 2023-03-30 11:24 ` Jan Beulich
  2023-03-30 12:20   ` Andrew Cooper
  2023-03-30 11:26 ` [PATCH v2 02/13] x86/shadow: drop redundant present bit checks from FOREACH_PRESENT_L<N>E() "bodies" Jan Beulich
                   ` (11 subsequent siblings)
  12 siblings, 1 reply; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:24 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

These being local macros, the SHADOW prefix doesn't gain us much. What
is more important to be aware of at use sites is that the supplied code
is invoked for present entries only.

While making the adjustment also properly use NULL for the 3rd argument
at respective invocation sites.

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -777,7 +777,7 @@ static inline void increment_ptr_to_gues
 }
 
 /* All kinds of l1: touch all entries */
-#define _SHADOW_FOREACH_L1E(_sl1mfn, _sl1e, _gl1p, _done, _code)        \
+#define _FOREACH_PRESENT_L1E(_sl1mfn, _sl1e, _gl1p, _done, _code)       \
 do {                                                                    \
     int _i;                                                             \
     shadow_l1e_t *_sp = map_domain_page((_sl1mfn));                     \
@@ -796,25 +796,25 @@ do {
 
 /* 32-bit l1, on PAE or 64-bit shadows: need to walk both pages of shadow */
 #if GUEST_PAGING_LEVELS == 2 && SHADOW_PAGING_LEVELS > 2
-#define SHADOW_FOREACH_L1E(_sl1mfn, _sl1e, _gl1p, _done,  _code)        \
+#define FOREACH_PRESENT_L1E(_sl1mfn, _sl1e, _gl1p, _done,  _code)       \
 do {                                                                    \
     int __done = 0;                                                     \
-    _SHADOW_FOREACH_L1E(_sl1mfn, _sl1e, _gl1p,                          \
+    _FOREACH_PRESENT_L1E(_sl1mfn, _sl1e, _gl1p,                         \
                          ({ (__done = _done); }), _code);               \
     _sl1mfn = sh_next_page(_sl1mfn);                                    \
     if ( !__done )                                                      \
-        _SHADOW_FOREACH_L1E(_sl1mfn, _sl1e, _gl1p, _done, _code);       \
+        _FOREACH_PRESENT_L1E(_sl1mfn, _sl1e, _gl1p, _done, _code);      \
 } while (0)
 #else /* Everything else; l1 shadows are only one page */
-#define SHADOW_FOREACH_L1E(_sl1mfn, _sl1e, _gl1p, _done, _code)         \
-       _SHADOW_FOREACH_L1E(_sl1mfn, _sl1e, _gl1p, _done, _code)
+#define FOREACH_PRESENT_L1E(_sl1mfn, _sl1e, _gl1p, _done, _code)        \
+       _FOREACH_PRESENT_L1E(_sl1mfn, _sl1e, _gl1p, _done, _code)
 #endif
 
 
 #if GUEST_PAGING_LEVELS == 2
 
 /* 32-bit l2 on PAE/64: four pages, touch every second entry */
-#define SHADOW_FOREACH_L2E(_sl2mfn, _sl2e, _gl2p, _done, _dom, _code)     \
+#define FOREACH_PRESENT_L2E(_sl2mfn, _sl2e, _gl2p, _done, _dom, _code)    \
 do {                                                                      \
     int _i, _j;                                                           \
     ASSERT(shadow_mode_external(_dom));                                   \
@@ -839,7 +839,7 @@ do {
 #elif GUEST_PAGING_LEVELS == 3
 
 /* PAE: touch all entries */
-#define SHADOW_FOREACH_L2E(_sl2mfn, _sl2e, _gl2p, _done, _dom, _code)      \
+#define FOREACH_PRESENT_L2E(_sl2mfn, _sl2e, _gl2p, _done, _dom, _code)     \
 do {                                                                       \
     int _i;                                                                \
     shadow_l2e_t *_sp = map_domain_page((_sl2mfn));                        \
@@ -859,7 +859,7 @@ do {
 #else
 
 /* 64-bit l2: touch all entries except for PAE compat guests. */
-#define SHADOW_FOREACH_L2E(_sl2mfn, _sl2e, _gl2p, _done, _dom, _code)       \
+#define FOREACH_PRESENT_L2E(_sl2mfn, _sl2e, _gl2p, _done, _dom, _code)      \
 do {                                                                        \
     unsigned int _i, _end = SHADOW_L2_PAGETABLE_ENTRIES;                    \
     shadow_l2e_t *_sp = map_domain_page((_sl2mfn));                         \
@@ -886,7 +886,7 @@ do {
 #if GUEST_PAGING_LEVELS == 4
 
 /* 64-bit l3: touch all entries */
-#define SHADOW_FOREACH_L3E(_sl3mfn, _sl3e, _gl3p, _done, _code)         \
+#define FOREACH_PRESENT_L3E(_sl3mfn, _sl3e, _gl3p, _done, _code)        \
 do {                                                                    \
     int _i;                                                             \
     shadow_l3e_t *_sp = map_domain_page((_sl3mfn));                     \
@@ -903,7 +903,7 @@ do {
 } while (0)
 
 /* 64-bit l4: avoid Xen mappings */
-#define SHADOW_FOREACH_L4E(_sl4mfn, _sl4e, _gl4p, _done, _dom, _code)   \
+#define FOREACH_PRESENT_L4E(_sl4mfn, _sl4e, _gl4p, _done, _dom, _code)  \
 do {                                                                    \
     shadow_l4e_t *_sp = map_domain_page((_sl4mfn));                     \
     int _xen = !shadow_mode_external(_dom);                             \
@@ -1288,7 +1288,7 @@ void sh_destroy_l4_shadow(struct domain
     shadow_demote(d, gmfn, t);
     /* Decrement refcounts of all the old entries */
     sl4mfn = smfn;
-    SHADOW_FOREACH_L4E(sl4mfn, sl4e, 0, 0, d, {
+    FOREACH_PRESENT_L4E(sl4mfn, sl4e, NULL, 0, d, {
         if ( shadow_l4e_get_flags(*sl4e) & _PAGE_PRESENT )
         {
             sh_put_ref(d, shadow_l4e_get_mfn(*sl4e),
@@ -1319,7 +1319,7 @@ void sh_destroy_l3_shadow(struct domain
 
     /* Decrement refcounts of all the old entries */
     sl3mfn = smfn;
-    SHADOW_FOREACH_L3E(sl3mfn, sl3e, 0, 0, {
+    FOREACH_PRESENT_L3E(sl3mfn, sl3e, NULL, 0, {
         if ( shadow_l3e_get_flags(*sl3e) & _PAGE_PRESENT )
             sh_put_ref(d, shadow_l3e_get_mfn(*sl3e),
                         (((paddr_t)mfn_x(sl3mfn)) << PAGE_SHIFT)
@@ -1351,7 +1351,7 @@ void sh_destroy_l2_shadow(struct domain
 
     /* Decrement refcounts of all the old entries */
     sl2mfn = smfn;
-    SHADOW_FOREACH_L2E(sl2mfn, sl2e, 0, 0, d, {
+    FOREACH_PRESENT_L2E(sl2mfn, sl2e, NULL, 0, d, {
         if ( shadow_l2e_get_flags(*sl2e) & _PAGE_PRESENT )
             sh_put_ref(d, shadow_l2e_get_mfn(*sl2e),
                         (((paddr_t)mfn_x(sl2mfn)) << PAGE_SHIFT)
@@ -1389,7 +1389,7 @@ void sh_destroy_l1_shadow(struct domain
     {
         /* Decrement refcounts of all the old entries */
         mfn_t sl1mfn = smfn;
-        SHADOW_FOREACH_L1E(sl1mfn, sl1e, 0, 0, {
+        FOREACH_PRESENT_L1E(sl1mfn, sl1e, NULL, 0, {
             unsigned int sl1f = shadow_l1e_get_flags(*sl1e);
 
             if ( (sl1f & _PAGE_PRESENT) && !sh_l1e_is_magic(*sl1e) )
@@ -1421,7 +1421,7 @@ void sh_destroy_l1_shadow(struct domain
 void sh_unhook_l2_mappings(struct domain *d, mfn_t sl2mfn, bool user_only)
 {
     shadow_l2e_t *sl2e;
-    SHADOW_FOREACH_L2E(sl2mfn, sl2e, 0, 0, d, {
+    FOREACH_PRESENT_L2E(sl2mfn, sl2e, NULL, 0, d, {
         if ( !user_only || (sl2e->l2 & _PAGE_USER) )
             shadow_set_l2e(d, sl2e, shadow_l2e_empty(), sl2mfn);
     });
@@ -1432,7 +1432,7 @@ void sh_unhook_l2_mappings(struct domain
 void sh_unhook_l4_mappings(struct domain *d, mfn_t sl4mfn, bool user_only)
 {
     shadow_l4e_t *sl4e;
-    SHADOW_FOREACH_L4E(sl4mfn, sl4e, 0, 0, d, {
+    FOREACH_PRESENT_L4E(sl4mfn, sl4e, NULL, 0, d, {
         if ( !user_only || (sl4e->l4 & _PAGE_USER) )
             shadow_set_l4e(d, sl4e, shadow_l4e_empty(), sl4mfn);
     });
@@ -1666,7 +1666,7 @@ void sh_resync_l1(struct vcpu *v, mfn_t
     gp = map_domain_page(gl1mfn);
     gl1p = gp;
 
-   SHADOW_FOREACH_L1E(sl1mfn, sl1p, &gl1p, 0, {
+   FOREACH_PRESENT_L1E(sl1mfn, sl1p, &gl1p, 0, {
         guest_l1e_t gl1e = *gl1p;
 
         if ( snp[guest_index(gl1p)].l1 != gl1e.l1 )
@@ -3564,7 +3564,7 @@ int cf_check sh_rm_write_access_from_l1(
     mfn_t base_sl1mfn = sl1mfn; /* Because sl1mfn changes in the foreach */
 #endif
 
-    SHADOW_FOREACH_L1E(sl1mfn, sl1e, 0, done,
+    FOREACH_PRESENT_L1E(sl1mfn, sl1e, NULL, done,
     {
         flags = shadow_l1e_get_flags(*sl1e);
         if ( (flags & _PAGE_PRESENT)
@@ -3597,7 +3597,7 @@ int cf_check sh_rm_mappings_from_l1(
     int done = 0;
     int flags;
 
-    SHADOW_FOREACH_L1E(sl1mfn, sl1e, 0, done,
+    FOREACH_PRESENT_L1E(sl1mfn, sl1e, NULL, done,
     {
         flags = shadow_l1e_get_flags(*sl1e);
         if ( (flags & _PAGE_PRESENT)
@@ -3648,7 +3648,7 @@ int cf_check sh_remove_l1_shadow(struct
     int done = 0;
     int flags;
 
-    SHADOW_FOREACH_L2E(sl2mfn, sl2e, 0, done, d,
+    FOREACH_PRESENT_L2E(sl2mfn, sl2e, NULL, done, d,
     {
         flags = shadow_l2e_get_flags(*sl2e);
         if ( (flags & _PAGE_PRESENT)
@@ -3671,7 +3671,7 @@ int cf_check sh_remove_l2_shadow(struct
     int done = 0;
     int flags;
 
-    SHADOW_FOREACH_L3E(sl3mfn, sl3e, 0, done,
+    FOREACH_PRESENT_L3E(sl3mfn, sl3e, NULL, done,
     {
         flags = shadow_l3e_get_flags(*sl3e);
         if ( (flags & _PAGE_PRESENT)
@@ -3693,7 +3693,7 @@ int cf_check sh_remove_l3_shadow(struct
     int done = 0;
     int flags;
 
-    SHADOW_FOREACH_L4E(sl4mfn, sl4e, 0, done, d,
+    FOREACH_PRESENT_L4E(sl4mfn, sl4e, NULL, done, d,
     {
         flags = shadow_l4e_get_flags(*sl4e);
         if ( (flags & _PAGE_PRESENT)
@@ -3924,7 +3924,7 @@ int cf_check sh_audit_l1_table(struct do
 #endif
 
     gl1e = gp = map_domain_page(gl1mfn);
-    SHADOW_FOREACH_L1E(sl1mfn, sl1e, &gl1e, done, {
+    FOREACH_PRESENT_L1E(sl1mfn, sl1e, &gl1e, done, {
 
         if ( sh_l1e_is_magic(*sl1e) )
         {
@@ -3978,7 +3978,7 @@ int cf_check sh_audit_fl1_table(struct d
 
     /* fl1 has no useful backpointer: all we can check are flags */
     e = guest_l1e_from_gfn(_gfn(0), 0); gl1e = &e; /* Needed for macro */
-    SHADOW_FOREACH_L1E(sl1mfn, sl1e, 0, done, {
+    FOREACH_PRESENT_L1E(sl1mfn, sl1e, NULL, done, {
         f = shadow_l1e_get_flags(*sl1e);
         f &= ~(_PAGE_AVAIL0|_PAGE_AVAIL1|_PAGE_AVAIL2);
         if ( !(f == 0
@@ -4015,7 +4015,7 @@ int cf_check sh_audit_l2_table(struct do
 #endif
 
     gl2e = gp = map_domain_page(gl2mfn);
-    SHADOW_FOREACH_L2E(sl2mfn, sl2e, &gl2e, done, d, {
+    FOREACH_PRESENT_L2E(sl2mfn, sl2e, &gl2e, done, d, {
 
         s = sh_audit_flags(d, 2, guest_l2e_get_flags(*gl2e),
                            shadow_l2e_get_flags(*sl2e));
@@ -4066,7 +4066,7 @@ int cf_check sh_audit_l3_table(struct do
 #endif
 
     gl3e = gp = map_domain_page(gl3mfn);
-    SHADOW_FOREACH_L3E(sl3mfn, sl3e, &gl3e, done, {
+    FOREACH_PRESENT_L3E(sl3mfn, sl3e, &gl3e, done, {
 
         s = sh_audit_flags(d, 3, guest_l3e_get_flags(*gl3e),
                            shadow_l3e_get_flags(*sl3e));
@@ -4115,7 +4115,7 @@ int cf_check sh_audit_l4_table(struct do
 #endif
 
     gl4e = gp = map_domain_page(gl4mfn);
-    SHADOW_FOREACH_L4E(sl4mfn, sl4e, &gl4e, done, d,
+    FOREACH_PRESENT_L4E(sl4mfn, sl4e, &gl4e, done, d,
     {
         s = sh_audit_flags(d, 4, guest_l4e_get_flags(*gl4e),
                            shadow_l4e_get_flags(*sl4e));



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 02/13] x86/shadow: drop redundant present bit checks from FOREACH_PRESENT_L<N>E() "bodies"
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
  2023-03-30 11:24 ` [PATCH v2 01/13] x86/shadow: rename SHADOW_FOREACH_L<N>E() to FOREACH_PRESENT_L<N>E() Jan Beulich
@ 2023-03-30 11:26 ` Jan Beulich
  2023-03-30 11:26 ` [PATCH v2 03/13] x86/shadow: reduce explicit log-dirty recording for HVM Jan Beulich
                   ` (10 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:26 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

FOREACH_PRESENT_L<N>E(), as their names (now) say, already invoke the
"body" only when the present bit is set; no need to re-do the check.

While there also
- stop open-coding mfn_to_maddr() in code being touched (re-indented)
  anyway,
- stop open-coding mfn_eq() in code being touched or in adjacent code,
- drop local variables when they're no longer used at least twice.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v2: Re-base over new earlier patch. More mfn_eq().

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -1289,12 +1289,8 @@ void sh_destroy_l4_shadow(struct domain
     /* Decrement refcounts of all the old entries */
     sl4mfn = smfn;
     FOREACH_PRESENT_L4E(sl4mfn, sl4e, NULL, 0, d, {
-        if ( shadow_l4e_get_flags(*sl4e) & _PAGE_PRESENT )
-        {
-            sh_put_ref(d, shadow_l4e_get_mfn(*sl4e),
-                       (((paddr_t)mfn_x(sl4mfn)) << PAGE_SHIFT)
-                       | ((unsigned long)sl4e & ~PAGE_MASK));
-        }
+        sh_put_ref(d, shadow_l4e_get_mfn(*sl4e),
+                   mfn_to_maddr(sl4mfn) | ((unsigned long)sl4e & ~PAGE_MASK));
     });
 
     /* Put the memory back in the pool */
@@ -1320,10 +1316,8 @@ void sh_destroy_l3_shadow(struct domain
     /* Decrement refcounts of all the old entries */
     sl3mfn = smfn;
     FOREACH_PRESENT_L3E(sl3mfn, sl3e, NULL, 0, {
-        if ( shadow_l3e_get_flags(*sl3e) & _PAGE_PRESENT )
-            sh_put_ref(d, shadow_l3e_get_mfn(*sl3e),
-                        (((paddr_t)mfn_x(sl3mfn)) << PAGE_SHIFT)
-                        | ((unsigned long)sl3e & ~PAGE_MASK));
+        sh_put_ref(d, shadow_l3e_get_mfn(*sl3e),
+                   mfn_to_maddr(sl3mfn) | ((unsigned long)sl3e & ~PAGE_MASK));
     });
 
     /* Put the memory back in the pool */
@@ -1352,10 +1346,8 @@ void sh_destroy_l2_shadow(struct domain
     /* Decrement refcounts of all the old entries */
     sl2mfn = smfn;
     FOREACH_PRESENT_L2E(sl2mfn, sl2e, NULL, 0, d, {
-        if ( shadow_l2e_get_flags(*sl2e) & _PAGE_PRESENT )
-            sh_put_ref(d, shadow_l2e_get_mfn(*sl2e),
-                        (((paddr_t)mfn_x(sl2mfn)) << PAGE_SHIFT)
-                        | ((unsigned long)sl2e & ~PAGE_MASK));
+        sh_put_ref(d, shadow_l2e_get_mfn(*sl2e),
+                   mfn_to_maddr(sl2mfn) | ((unsigned long)sl2e & ~PAGE_MASK));
     });
 
     /* Put the memory back in the pool */
@@ -1390,11 +1382,10 @@ void sh_destroy_l1_shadow(struct domain
         /* Decrement refcounts of all the old entries */
         mfn_t sl1mfn = smfn;
         FOREACH_PRESENT_L1E(sl1mfn, sl1e, NULL, 0, {
-            unsigned int sl1f = shadow_l1e_get_flags(*sl1e);
-
-            if ( (sl1f & _PAGE_PRESENT) && !sh_l1e_is_magic(*sl1e) )
+            if ( !sh_l1e_is_magic(*sl1e) )
             {
-                shadow_vram_put_mfn(shadow_l1e_get_mfn(*sl1e), sl1f,
+                shadow_vram_put_mfn(shadow_l1e_get_mfn(*sl1e),
+                                    shadow_l1e_get_flags(*sl1e),
                                     sl1mfn, sl1e, d);
                 shadow_put_page_from_l1e(*sl1e, d);
             }
@@ -3558,7 +3549,6 @@ int cf_check sh_rm_write_access_from_l1(
 {
     shadow_l1e_t *sl1e;
     int done = 0;
-    int flags;
 #if SHADOW_OPTIMIZATIONS & SHOPT_WRITABLE_HEURISTIC
     struct vcpu *curr = current;
     mfn_t base_sl1mfn = sl1mfn; /* Because sl1mfn changes in the foreach */
@@ -3566,10 +3556,8 @@ int cf_check sh_rm_write_access_from_l1(
 
     FOREACH_PRESENT_L1E(sl1mfn, sl1e, NULL, done,
     {
-        flags = shadow_l1e_get_flags(*sl1e);
-        if ( (flags & _PAGE_PRESENT)
-             && (flags & _PAGE_RW)
-             && (mfn_x(shadow_l1e_get_mfn(*sl1e)) == mfn_x(readonly_mfn)) )
+        if ( (shadow_l1e_get_flags(*sl1e) & _PAGE_RW) &&
+             mfn_eq(shadow_l1e_get_mfn(*sl1e), readonly_mfn) )
         {
             shadow_l1e_t ro_sl1e = shadow_l1e_remove_flags(*sl1e, _PAGE_RW);
 
@@ -3595,13 +3583,10 @@ int cf_check sh_rm_mappings_from_l1(
 {
     shadow_l1e_t *sl1e;
     int done = 0;
-    int flags;
 
     FOREACH_PRESENT_L1E(sl1mfn, sl1e, NULL, done,
     {
-        flags = shadow_l1e_get_flags(*sl1e);
-        if ( (flags & _PAGE_PRESENT)
-             && (mfn_x(shadow_l1e_get_mfn(*sl1e)) == mfn_x(target_mfn)) )
+        if ( mfn_eq(shadow_l1e_get_mfn(*sl1e), target_mfn) )
         {
             shadow_set_l1e(d, sl1e, shadow_l1e_empty(), p2m_invalid, sl1mfn);
             if ( sh_check_page_has_no_refs(mfn_to_page(target_mfn)) )
@@ -3646,13 +3631,10 @@ int cf_check sh_remove_l1_shadow(struct
 {
     shadow_l2e_t *sl2e;
     int done = 0;
-    int flags;
 
     FOREACH_PRESENT_L2E(sl2mfn, sl2e, NULL, done, d,
     {
-        flags = shadow_l2e_get_flags(*sl2e);
-        if ( (flags & _PAGE_PRESENT)
-             && (mfn_x(shadow_l2e_get_mfn(*sl2e)) == mfn_x(sl1mfn)) )
+        if ( mfn_eq(shadow_l2e_get_mfn(*sl2e), sl1mfn) )
         {
             shadow_set_l2e(d, sl2e, shadow_l2e_empty(), sl2mfn);
             if ( mfn_to_page(sl1mfn)->u.sh.type == 0 )
@@ -3669,13 +3651,10 @@ int cf_check sh_remove_l2_shadow(struct
 {
     shadow_l3e_t *sl3e;
     int done = 0;
-    int flags;
 
     FOREACH_PRESENT_L3E(sl3mfn, sl3e, NULL, done,
     {
-        flags = shadow_l3e_get_flags(*sl3e);
-        if ( (flags & _PAGE_PRESENT)
-             && (mfn_x(shadow_l3e_get_mfn(*sl3e)) == mfn_x(sl2mfn)) )
+        if ( mfn_eq(shadow_l3e_get_mfn(*sl3e), sl2mfn) )
         {
             shadow_set_l3e(d, sl3e, shadow_l3e_empty(), sl3mfn);
             if ( mfn_to_page(sl2mfn)->u.sh.type == 0 )
@@ -3691,13 +3670,10 @@ int cf_check sh_remove_l3_shadow(struct
 {
     shadow_l4e_t *sl4e;
     int done = 0;
-    int flags;
 
     FOREACH_PRESENT_L4E(sl4mfn, sl4e, NULL, done, d,
     {
-        flags = shadow_l4e_get_flags(*sl4e);
-        if ( (flags & _PAGE_PRESENT)
-             && (mfn_x(shadow_l4e_get_mfn(*sl4e)) == mfn_x(sl3mfn)) )
+        if ( mfn_eq(shadow_l4e_get_mfn(*sl4e), sl3mfn) )
         {
             shadow_set_l4e(d, sl4e, shadow_l4e_empty(), sl4mfn);
             if ( mfn_to_page(sl3mfn)->u.sh.type == 0 )



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 03/13] x86/shadow: reduce explicit log-dirty recording for HVM
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
  2023-03-30 11:24 ` [PATCH v2 01/13] x86/shadow: rename SHADOW_FOREACH_L<N>E() to FOREACH_PRESENT_L<N>E() Jan Beulich
  2023-03-30 11:26 ` [PATCH v2 02/13] x86/shadow: drop redundant present bit checks from FOREACH_PRESENT_L<N>E() "bodies" Jan Beulich
@ 2023-03-30 11:26 ` Jan Beulich
  2023-03-30 11:27 ` [PATCH v2 04/13] x86/shadow: call sh_update_cr3() directly from sh_page_fault() Jan Beulich
                   ` (9 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:26 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

validate_guest_pt_write(), by calling sh_validate_guest_entry(), already
guarantees the needed update of log-dirty information. Move the
operation into the sole code path needing it (when SHOPT_SKIP_VERIFY is
enabled), making clear that only one such call is needed.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -656,6 +656,7 @@ static void sh_emulate_unmap_dest(struct
     {
         /* Writes with this alignment constraint can't possibly cross pages. */
         ASSERT(!mfn_valid(sh_ctxt->mfn[1]));
+        paging_mark_dirty(v->domain, sh_ctxt->mfn[0]);
     }
     else
 #endif /* SHADOW_OPTIMIZATIONS & SHOPT_SKIP_VERIFY */
@@ -673,12 +674,10 @@ static void sh_emulate_unmap_dest(struct
             validate_guest_pt_write(v, sh_ctxt->mfn[1], addr + b1, b2);
     }
 
-    paging_mark_dirty(v->domain, sh_ctxt->mfn[0]);
     put_page(mfn_to_page(sh_ctxt->mfn[0]));
 
     if ( unlikely(mfn_valid(sh_ctxt->mfn[1])) )
     {
-        paging_mark_dirty(v->domain, sh_ctxt->mfn[1]);
         put_page(mfn_to_page(sh_ctxt->mfn[1]));
         vunmap((void *)((unsigned long)addr & PAGE_MASK));
     }



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 04/13] x86/shadow: call sh_update_cr3() directly from sh_page_fault()
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
                   ` (2 preceding siblings ...)
  2023-03-30 11:26 ` [PATCH v2 03/13] x86/shadow: reduce explicit log-dirty recording for HVM Jan Beulich
@ 2023-03-30 11:27 ` Jan Beulich
  2023-03-30 11:28 ` [PATCH v2 05/13] x86/shadow: don't generate bogus "domain dying" trace entry " Jan Beulich
                   ` (8 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:27 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

There's no need for an indirect call here, as the mode is invariant
throughout the entire paging-locked region. All it takes to avoid it is
to have a forward declaration of sh_update_cr3() in place.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
I find this and the respective Win7 related comment suspicious: If we
really need to "fix up" L3 entries "on demand", wouldn't we better retry
the shadow_get_and_create_l1e() rather than exit? The spurious page
fault that the guest observes can, after all, not be known to be non-
fatal inside the guest. That's purely an OS policy.

Furthermore the sh_update_cr3() will also invalidate L3 entries which
were loaded successfully before, but invalidated by the guest
afterwards. I strongly suspect that the described hardware behavior is
_only_ to load previously not-present entries from the PDPT, but not
purge ones already marked present. IOW I think sh_update_cr3() would
need calling in an "incremental" mode here. (The alternative of doing
this in shadow_get_and_create_l3e() instead would likely be more
cumbersome.)

Beyond the "on demand" L3 entry creation I also can't see what guest
actions could lead to the ASSERT() being inapplicable in the PAE case.
The 3-level code in shadow_get_and_create_l2e() doesn't consult guest
PDPTEs, and all other logic is similar to that for other modes.

(See 89329d832aed ["x86 shadow: Update cr3 in PAE mode when guest walk
succeed but shadow walk fails"].)

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -91,6 +91,8 @@ const char *const fetch_type_names[] = {
 # define for_each_shadow_table(v, i) for ( (i) = 0; (i) < 1; ++(i) )
 #endif
 
+static void cf_check sh_update_cr3(struct vcpu *v, int do_locking, bool noflush);
+
 /* Helper to perform a local TLB flush. */
 static void sh_flush_local(const struct domain *d)
 {
@@ -2487,7 +2489,7 @@ static int cf_check sh_page_fault(
          * In any case, in the PAE case, the ASSERT is not true; it can
          * happen because of actions the guest is taking. */
 #if GUEST_PAGING_LEVELS == 3
-        v->arch.paging.mode->update_cr3(v, 0, false);
+        sh_update_cr3(v, 0, false);
 #else
         ASSERT(d->is_shutting_down);
 #endif



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 05/13] x86/shadow: don't generate bogus "domain dying" trace entry from sh_page_fault()
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
                   ` (3 preceding siblings ...)
  2023-03-30 11:27 ` [PATCH v2 04/13] x86/shadow: call sh_update_cr3() directly from sh_page_fault() Jan Beulich
@ 2023-03-30 11:28 ` Jan Beulich
  2023-03-30 11:28 ` [PATCH v2 06/13] x86/shadow: use lighter weight mode checks Jan Beulich
                   ` (7 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:28 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

When in 3-level guest mode we help a guest to stay alive, we also
shouldn't emit a trace entry to the contrary. Move the invocation up
into the respective #ifdef, noting that while this moves it into the
locked region, emitting trace records with the paging lock held is okay
(as done elsewhere as well), just needlessly increasing lock holding
time a little.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -2492,10 +2492,10 @@ static int cf_check sh_page_fault(
         sh_update_cr3(v, 0, false);
 #else
         ASSERT(d->is_shutting_down);
+        trace_shadow_gen(TRC_SHADOW_DOMF_DYING, va);
 #endif
         paging_unlock(d);
         put_gfn(d, gfn_x(gfn));
-        trace_shadow_gen(TRC_SHADOW_DOMF_DYING, va);
         return 0;
     }
 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 06/13] x86/shadow: use lighter weight mode checks
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
                   ` (4 preceding siblings ...)
  2023-03-30 11:28 ` [PATCH v2 05/13] x86/shadow: don't generate bogus "domain dying" trace entry " Jan Beulich
@ 2023-03-30 11:28 ` Jan Beulich
  2023-03-30 11:29 ` [PATCH v2 07/13] x86/shadow: move OOS functions to their own file Jan Beulich
                   ` (6 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:28 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

shadow_mode_...(), with the exception of shadow_mode_enabled(), are
shorthands for shadow_mode_enabled() && paging_mode_...(). While
potentially useful outside of shadow-internal functions, when we already
know that we're dealing with a domain in shadow mode, the "paging"
checks are sufficient and cheaper. While the "shadow" ones commonly
translate to a MOV/AND/CMP/Jcc sequence, the "paging" ones typically
resolve to just TEST+Jcc.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Re-base over new earlier patch.

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1856,7 +1856,7 @@ int sh_remove_write_access(struct domain
      * In guest refcounting, we trust Xen to already be restricting
      * all the writes to the guest page tables, so we do not need to
      * do more. */
-    if ( !shadow_mode_refcounts(d) )
+    if ( !paging_mode_refcounts(d) )
         return 0;
 
     /* Early exit if it's already a pagetable, or otherwise not writeable */
@@ -2088,7 +2088,7 @@ int sh_remove_all_mappings(struct domain
          *   guest pages with an extra reference taken by
          *   prepare_ring_for_helper().
          */
-        if ( !(shadow_mode_external(d)
+        if ( !(paging_mode_external(d)
                && (page->count_info & PGC_count_mask) <= 3
                && ((page->u.inuse.type_info & PGT_count_mask)
                    == (is_special_page(page) ||
@@ -2385,8 +2385,8 @@ static void sh_update_paging_modes(struc
     {
         const struct paging_mode *old_mode = v->arch.paging.mode;
 
-        ASSERT(shadow_mode_translate(d));
-        ASSERT(shadow_mode_external(d));
+        ASSERT(paging_mode_translate(d));
+        ASSERT(paging_mode_external(d));
 
 #if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
         /* Need to resync all our pages now, because if a page goes out
@@ -2773,7 +2773,7 @@ void shadow_vcpu_teardown(struct vcpu *v
 
     sh_detach_old_tables(v);
 #ifdef CONFIG_HVM
-    if ( shadow_mode_external(d) )
+    if ( paging_mode_external(d) )
     {
         mfn_t mfn = pagetable_get_mfn(v->arch.hvm.monitor_table);
 
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -526,7 +526,7 @@ _sh_propagate(struct vcpu *v,
                || (level == 1
                    && page_get_owner(mfn_to_page(target_mfn)) == dom_io);
     if ( mmio_mfn
-         && !(level == 1 && (!shadow_mode_refcounts(d)
+         && !(level == 1 && (!paging_mode_refcounts(d)
                              || p2mt == p2m_mmio_direct)) )
     {
         ASSERT((ft == ft_prefetch));
@@ -543,7 +543,7 @@ _sh_propagate(struct vcpu *v,
                        _PAGE_RW | _PAGE_PRESENT);
     if ( guest_nx_enabled(v) )
         pass_thru_flags |= _PAGE_NX_BIT;
-    if ( level == 1 && !shadow_mode_refcounts(d) && mmio_mfn )
+    if ( level == 1 && !paging_mode_refcounts(d) && mmio_mfn )
         pass_thru_flags |= PAGE_CACHE_ATTRS;
     sflags = gflags & pass_thru_flags;
 
@@ -663,7 +663,7 @@ _sh_propagate(struct vcpu *v,
      * (We handle log-dirty entirely inside the shadow code, without using the
      * p2m_ram_logdirty p2m type: only HAP uses that.)
      */
-    if ( level == 1 && unlikely(shadow_mode_log_dirty(d)) && !mmio_mfn )
+    if ( level == 1 && unlikely(paging_mode_log_dirty(d)) && !mmio_mfn )
     {
         if ( ft & FETCH_TYPE_WRITE )
             paging_mark_dirty(d, target_mfn);
@@ -819,7 +819,7 @@ do {
 #define FOREACH_PRESENT_L2E(_sl2mfn, _sl2e, _gl2p, _done, _dom, _code)    \
 do {                                                                      \
     int _i, _j;                                                           \
-    ASSERT(shadow_mode_external(_dom));                                   \
+    ASSERT(paging_mode_external(_dom));                                   \
     ASSERT(mfn_to_page(_sl2mfn)->u.sh.type == SH_type_l2_32_shadow);      \
     for ( _j = 0; _j < 4; _j++ )                                          \
     {                                                                     \
@@ -845,7 +845,7 @@ do {
 do {                                                                       \
     int _i;                                                                \
     shadow_l2e_t *_sp = map_domain_page((_sl2mfn));                        \
-    ASSERT(shadow_mode_external(_dom));                                    \
+    ASSERT(paging_mode_external(_dom));                                    \
     ASSERT(mfn_to_page(_sl2mfn)->u.sh.type == SH_type_l2_pae_shadow);      \
     for ( _i = 0; _i < SHADOW_L2_PAGETABLE_ENTRIES; _i++ )                 \
     {                                                                      \
@@ -866,7 +866,7 @@ do {
     unsigned int _i, _end = SHADOW_L2_PAGETABLE_ENTRIES;                    \
     shadow_l2e_t *_sp = map_domain_page((_sl2mfn));                         \
     ASSERT_VALID_L2(mfn_to_page(_sl2mfn)->u.sh.type);                       \
-    if ( is_pv_32bit_domain(_dom) /* implies !shadow_mode_external */ &&    \
+    if ( is_pv_32bit_domain(_dom) /* implies !paging_mode_external */ &&    \
          mfn_to_page(_sl2mfn)->u.sh.type != SH_type_l2_64_shadow )          \
         _end = COMPAT_L2_PAGETABLE_FIRST_XEN_SLOT(_dom);                    \
     for ( _i = 0; _i < _end; ++_i )                                         \
@@ -908,7 +908,7 @@ do {
 #define FOREACH_PRESENT_L4E(_sl4mfn, _sl4e, _gl4p, _done, _dom, _code)  \
 do {                                                                    \
     shadow_l4e_t *_sp = map_domain_page((_sl4mfn));                     \
-    int _xen = !shadow_mode_external(_dom);                             \
+    int _xen = !paging_mode_external(_dom);                             \
     int _i;                                                             \
     ASSERT(mfn_to_page(_sl4mfn)->u.sh.type == SH_type_l4_64_shadow);\
     for ( _i = 0; _i < SHADOW_L4_PAGETABLE_ENTRIES; _i++ )              \
@@ -977,7 +977,7 @@ sh_make_shadow(struct vcpu *v, mfn_t gmf
 #endif
 
     // Create the Xen mappings...
-    if ( !shadow_mode_external(d) )
+    if ( !paging_mode_external(d) )
     {
         switch (shadow_type)
         {
@@ -1379,7 +1379,7 @@ void sh_destroy_l1_shadow(struct domain
         shadow_demote(d, gmfn, t);
     }
 
-    if ( shadow_mode_refcounts(d) )
+    if ( paging_mode_refcounts(d) )
     {
         /* Decrement refcounts of all the old entries */
         mfn_t sl1mfn = smfn;
@@ -1476,7 +1476,7 @@ static int cf_check validate_gl4e(
     l4e_propagate_from_guest(v, new_gl4e, sl3mfn, &new_sl4e, ft_prefetch);
 
     // check for updates to xen reserved slots
-    if ( !shadow_mode_external(d) )
+    if ( !paging_mode_external(d) )
     {
         int shadow_index = (((unsigned long)sl4p & ~PAGE_MASK) /
                             sizeof(shadow_l4e_t));
@@ -2399,7 +2399,7 @@ static int cf_check sh_page_fault(
     gfn = guest_walk_to_gfn(&gw);
     gmfn = get_gfn(d, gfn, &p2mt);
 
-    if ( shadow_mode_refcounts(d) &&
+    if ( paging_mode_refcounts(d) &&
          ((!p2m_is_valid(p2mt) && !p2m_is_grant(p2mt)) ||
           (!p2m_is_mmio(p2mt) && !mfn_valid(gmfn))) )
     {
@@ -2623,7 +2623,7 @@ static int cf_check sh_page_fault(
     return EXCRET_fault_fixed;
 
  emulate:
-    if ( !shadow_mode_refcounts(d) )
+    if ( !paging_mode_refcounts(d) )
         goto not_a_shadow_fault;
 
 #ifdef CONFIG_HVM
@@ -3067,7 +3067,7 @@ sh_update_linear_entries(struct vcpu *v)
      */
 
     /* Don't try to update the monitor table if it doesn't exist */
-    if ( !shadow_mode_external(d) ||
+    if ( !paging_mode_external(d) ||
          pagetable_get_pfn(v->arch.hvm.monitor_table) == 0 )
         return;
 
@@ -3216,7 +3216,7 @@ static void cf_check sh_update_cr3(struc
     /* Double-check that the HVM code has sent us a sane guest_table */
     if ( is_hvm_domain(d) )
     {
-        ASSERT(shadow_mode_external(d));
+        ASSERT(paging_mode_external(d));
         if ( hvm_paging_enabled(v) )
             ASSERT(pagetable_get_pfn(v->arch.guest_table));
         else
@@ -3241,7 +3241,7 @@ static void cf_check sh_update_cr3(struc
      * table.  We cache the current state of that table and shadow that,
      * until the next CR3 write makes us refresh our cache.
      */
-    ASSERT(shadow_mode_external(d));
+    ASSERT(paging_mode_external(d));
 
     /*
      * Find where in the page the l3 table is, but ignore the low 2 bits of
@@ -3272,7 +3272,7 @@ static void cf_check sh_update_cr3(struc
         ASSERT(d->is_dying || d->is_shutting_down);
         return;
     }
-    if ( !shadow_mode_external(d) && !is_pv_32bit_domain(d) )
+    if ( !paging_mode_external(d) && !is_pv_32bit_domain(d) )
     {
         mfn_t smfn = pagetable_get_mfn(v->arch.paging.shadow.shadow_table[0]);
 
@@ -3366,7 +3366,7 @@ static void cf_check sh_update_cr3(struc
     ///
     /// v->arch.cr3
     ///
-    if ( shadow_mode_external(d) )
+    if ( paging_mode_external(d) )
     {
         make_cr3(v, pagetable_get_mfn(v->arch.hvm.monitor_table));
     }
@@ -3383,7 +3383,7 @@ static void cf_check sh_update_cr3(struc
     ///
     /// v->arch.hvm.hw_cr[3]
     ///
-    if ( shadow_mode_external(d) )
+    if ( paging_mode_external(d) )
     {
         ASSERT(is_hvm_domain(d));
 #if SHADOW_PAGING_LEVELS == 3
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -420,7 +420,7 @@ static inline int sh_remove_write_access
                                          unsigned int level,
                                          unsigned long fault_addr)
 {
-    ASSERT(!shadow_mode_refcounts(d));
+    ASSERT(!paging_mode_refcounts(d));
     return 0;
 }
 #endif
@@ -533,8 +533,8 @@ sh_mfn_is_a_page_table(mfn_t gmfn)
         return 0;
 
     owner = page_get_owner(page);
-    if ( owner && shadow_mode_refcounts(owner)
-         && (page->count_info & PGC_shadowed_pt) )
+    if ( owner && paging_mode_refcounts(owner) &&
+         (page->count_info & PGC_shadowed_pt) )
         return 1;
 
     type_info = page->u.inuse.type_info & PGT_type_mask;
--- a/xen/arch/x86/mm/shadow/set.c
+++ b/xen/arch/x86/mm/shadow/set.c
@@ -93,7 +93,7 @@ shadow_get_page_from_l1e(shadow_l1e_t sl
     struct domain *owner = NULL;
 
     ASSERT(!sh_l1e_is_magic(sl1e));
-    ASSERT(shadow_mode_refcounts(d));
+    ASSERT(paging_mode_refcounts(d));
 
     if ( mfn_valid(mfn) )
     {
@@ -354,7 +354,7 @@ int shadow_set_l1e(struct domain *d, sha
          !sh_l1e_is_magic(new_sl1e) )
     {
         /* About to install a new reference */
-        if ( shadow_mode_refcounts(d) )
+        if ( paging_mode_refcounts(d) )
         {
 #define PAGE_FLIPPABLE (_PAGE_RW | _PAGE_PWT | _PAGE_PCD | _PAGE_PAT)
             int rc;
@@ -387,7 +387,7 @@ int shadow_set_l1e(struct domain *d, sha
 
     old_sl1f = shadow_l1e_get_flags(old_sl1e);
     if ( (old_sl1f & _PAGE_PRESENT) && !sh_l1e_is_magic(old_sl1e) &&
-         shadow_mode_refcounts(d) )
+         paging_mode_refcounts(d) )
     {
         /*
          * We lost a reference to an old mfn.
--- a/xen/arch/x86/mm/shadow/types.h
+++ b/xen/arch/x86/mm/shadow/types.h
@@ -274,7 +274,7 @@ int shadow_set_l4e(struct domain *d, sha
 static void inline
 shadow_put_page_from_l1e(shadow_l1e_t sl1e, struct domain *d)
 {
-    if ( !shadow_mode_refcounts(d) )
+    if ( !paging_mode_refcounts(d) )
         return;
 
     put_page_from_l1e(sl1e, d);



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 07/13] x86/shadow: move OOS functions to their own file
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
                   ` (5 preceding siblings ...)
  2023-03-30 11:28 ` [PATCH v2 06/13] x86/shadow: use lighter weight mode checks Jan Beulich
@ 2023-03-30 11:29 ` Jan Beulich
  2023-03-30 11:30 ` [PATCH v2 08/13] x86/shadow: sh_rm_write_access_from_sl1p() is HVM-only Jan Beulich
                   ` (5 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:29 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

The code has been identified as HVM-only, and its main functions are
pretty well isolated. Move them to their own file. While moving, besides
making two functions non-static, do a few style adjustments, mainly
comment formatting, but leave the code otherwise untouched.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Acked-by: Andrew Cooper <andrew.cooper3@citrix.com>
---
v2: Adjust SPDX to GPL-2.0-only. A few more style adjustments.

--- a/xen/arch/x86/mm/shadow/Makefile
+++ b/xen/arch/x86/mm/shadow/Makefile
@@ -1,6 +1,6 @@
 ifeq ($(CONFIG_SHADOW_PAGING),y)
 obj-y += common.o set.o
-obj-$(CONFIG_HVM) += hvm.o guest_2.o guest_3.o guest_4.o
+obj-$(CONFIG_HVM) += hvm.o guest_2.o guest_3.o guest_4.o oos.o
 obj-$(CONFIG_PV) += pv.o guest_4.o
 else
 obj-y += none.o
--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -152,576 +152,6 @@ static int __init cf_check shadow_audit_
 __initcall(shadow_audit_key_init);
 #endif /* SHADOW_AUDIT */
 
-#if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
-/**************************************************************************/
-/* Out-of-sync shadows. */
-
-/* From time to time, we let a shadowed pagetable page go out of sync
- * with its shadow: the guest is allowed to write directly to the page,
- * and those writes are not synchronously reflected in the shadow.
- * This lets us avoid many emulations if the guest is writing a lot to a
- * pagetable, but it relaxes a pretty important invariant in the shadow
- * pagetable design.  Therefore, some rules:
- *
- * 1. Only L1 pagetables may go out of sync: any page that is shadowed
- *    at at higher level must be synchronously updated.  This makes
- *    using linear shadow pagetables much less dangerous.
- *    That means that: (a) unsyncing code needs to check for higher-level
- *    shadows, and (b) promotion code needs to resync.
- *
- * 2. All shadow operations on a guest page require the page to be brought
- *    back into sync before proceeding.  This must be done under the
- *    paging lock so that the page is guaranteed to remain synced until
- *    the operation completes.
- *
- *    Exceptions to this rule: the pagefault and invlpg handlers may
- *    update only one entry on an out-of-sync page without resyncing it.
- *
- * 3. Operations on shadows that do not start from a guest page need to
- *    be aware that they may be handling an out-of-sync shadow.
- *
- * 4. Operations that do not normally take the paging lock (fast-path
- *    #PF handler, INVLPG) must fall back to a locking, syncing version
- *    if they see an out-of-sync table.
- *
- * 5. Operations corresponding to guest TLB flushes (MOV CR3, INVLPG)
- *    must explicitly resync all relevant pages or update their
- *    shadows.
- *
- * Currently out-of-sync pages are listed in a simple open-addressed
- * hash table with a second chance (must resist temptation to radically
- * over-engineer hash tables...)  The virtual address of the access
- * which caused us to unsync the page is also kept in the hash table, as
- * a hint for finding the writable mappings later.
- *
- * We keep a hash per vcpu, because we want as much as possible to do
- * the re-sync on the save vcpu we did the unsync on, so the VA hint
- * will be valid.
- */
-
-static void sh_oos_audit(struct domain *d)
-{
-    unsigned int idx, expected_idx, expected_idx_alt;
-    struct page_info *pg;
-    struct vcpu *v;
-
-    for_each_vcpu(d, v)
-    {
-        for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
-        {
-            mfn_t *oos = v->arch.paging.shadow.oos;
-            if ( mfn_eq(oos[idx], INVALID_MFN) )
-                continue;
-
-            expected_idx = mfn_x(oos[idx]) % SHADOW_OOS_PAGES;
-            expected_idx_alt = ((expected_idx + 1) % SHADOW_OOS_PAGES);
-            if ( idx != expected_idx && idx != expected_idx_alt )
-            {
-                printk("%s: idx %x contains gmfn %lx, expected at %x or %x.\n",
-                       __func__, idx, mfn_x(oos[idx]),
-                       expected_idx, expected_idx_alt);
-                BUG();
-            }
-            pg = mfn_to_page(oos[idx]);
-            if ( !(pg->count_info & PGC_shadowed_pt) )
-            {
-                printk("%s: idx %x gmfn %lx not a pt (count %lx)\n",
-                       __func__, idx, mfn_x(oos[idx]), pg->count_info);
-                BUG();
-            }
-            if ( !(pg->shadow_flags & SHF_out_of_sync) )
-            {
-                printk("%s: idx %x gmfn %lx not marked oos (flags %x)\n",
-                       __func__, idx, mfn_x(oos[idx]), pg->shadow_flags);
-                BUG();
-            }
-            if ( (pg->shadow_flags & SHF_page_type_mask & ~SHF_L1_ANY) )
-            {
-                printk("%s: idx %x gmfn %lx shadowed as non-l1 (flags %x)\n",
-                       __func__, idx, mfn_x(oos[idx]), pg->shadow_flags);
-                BUG();
-            }
-        }
-    }
-}
-
-#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES
-void oos_audit_hash_is_present(struct domain *d, mfn_t gmfn)
-{
-    int idx;
-    struct vcpu *v;
-    mfn_t *oos;
-
-    ASSERT(mfn_is_out_of_sync(gmfn));
-
-    for_each_vcpu(d, v)
-    {
-        oos = v->arch.paging.shadow.oos;
-        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-        if ( !mfn_eq(oos[idx], gmfn) )
-            idx = (idx + 1) % SHADOW_OOS_PAGES;
-
-        if ( mfn_eq(oos[idx], gmfn) )
-            return;
-    }
-
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" marked OOS but not in hash table\n",
-           mfn_x(gmfn));
-    BUG();
-}
-#endif
-
-/* Update the shadow, but keep the page out of sync. */
-static inline void _sh_resync_l1(struct vcpu *v, mfn_t gmfn, mfn_t snpmfn)
-{
-    struct page_info *pg = mfn_to_page(gmfn);
-
-    ASSERT(mfn_valid(gmfn));
-    ASSERT(page_is_out_of_sync(pg));
-
-    /* Call out to the appropriate per-mode resyncing function */
-    if ( pg->shadow_flags & SHF_L1_32 )
-        SHADOW_INTERNAL_NAME(sh_resync_l1, 2)(v, gmfn, snpmfn);
-    else if ( pg->shadow_flags & SHF_L1_PAE )
-        SHADOW_INTERNAL_NAME(sh_resync_l1, 3)(v, gmfn, snpmfn);
-    else if ( pg->shadow_flags & SHF_L1_64 )
-        SHADOW_INTERNAL_NAME(sh_resync_l1, 4)(v, gmfn, snpmfn);
-}
-
-static int sh_remove_write_access_from_sl1p(struct domain *d, mfn_t gmfn,
-                                            mfn_t smfn, unsigned long off)
-{
-    ASSERT(mfn_valid(smfn));
-    ASSERT(mfn_valid(gmfn));
-
-    switch ( mfn_to_page(smfn)->u.sh.type )
-    {
-    case SH_type_l1_32_shadow:
-    case SH_type_fl1_32_shadow:
-        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 2)
-            (d, gmfn, smfn, off);
-
-    case SH_type_l1_pae_shadow:
-    case SH_type_fl1_pae_shadow:
-        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 3)
-            (d, gmfn, smfn, off);
-
-    case SH_type_l1_64_shadow:
-    case SH_type_fl1_64_shadow:
-        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 4)
-            (d, gmfn, smfn, off);
-
-    default:
-        return 0;
-    }
-}
-
-/*
- * Fixup arrays: We limit the maximum number of writable mappings to
- * SHADOW_OOS_FIXUPS and store enough information to remove them
- * quickly on resync.
- */
-
-static inline int oos_fixup_flush_gmfn(struct vcpu *v, mfn_t gmfn,
-                                       struct oos_fixup *fixup)
-{
-    struct domain *d = v->domain;
-    int i;
-    for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
-    {
-        if ( !mfn_eq(fixup->smfn[i], INVALID_MFN) )
-        {
-            sh_remove_write_access_from_sl1p(d, gmfn,
-                                             fixup->smfn[i],
-                                             fixup->off[i]);
-            fixup->smfn[i] = INVALID_MFN;
-        }
-    }
-
-    /* Always flush the TLBs. See comment on oos_fixup_add(). */
-    return 1;
-}
-
-void oos_fixup_add(struct domain *d, mfn_t gmfn,
-                   mfn_t smfn,  unsigned long off)
-{
-    int idx, next;
-    mfn_t *oos;
-    struct oos_fixup *oos_fixup;
-    struct vcpu *v;
-
-    perfc_incr(shadow_oos_fixup_add);
-
-    for_each_vcpu(d, v)
-    {
-        oos = v->arch.paging.shadow.oos;
-        oos_fixup = v->arch.paging.shadow.oos_fixup;
-        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-        if ( !mfn_eq(oos[idx], gmfn) )
-            idx = (idx + 1) % SHADOW_OOS_PAGES;
-        if ( mfn_eq(oos[idx], gmfn) )
-        {
-            int i;
-            for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
-            {
-                if ( mfn_eq(oos_fixup[idx].smfn[i], smfn)
-                     && (oos_fixup[idx].off[i] == off) )
-                    return;
-            }
-
-            next = oos_fixup[idx].next;
-
-            if ( !mfn_eq(oos_fixup[idx].smfn[next], INVALID_MFN) )
-            {
-                TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_EVICT);
-
-                /* Reuse this slot and remove current writable mapping. */
-                sh_remove_write_access_from_sl1p(d, gmfn,
-                                                 oos_fixup[idx].smfn[next],
-                                                 oos_fixup[idx].off[next]);
-                perfc_incr(shadow_oos_fixup_evict);
-                /* We should flush the TLBs now, because we removed a
-                   writable mapping, but since the shadow is already
-                   OOS we have no problem if another vcpu write to
-                   this page table. We just have to be very careful to
-                   *always* flush the tlbs on resync. */
-            }
-
-            oos_fixup[idx].smfn[next] = smfn;
-            oos_fixup[idx].off[next] = off;
-            oos_fixup[idx].next = (next + 1) % SHADOW_OOS_FIXUPS;
-
-            TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_ADD);
-            return;
-        }
-    }
-
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
-           mfn_x(gmfn));
-    BUG();
-}
-
-static int oos_remove_write_access(struct vcpu *v, mfn_t gmfn,
-                                   struct oos_fixup *fixup)
-{
-    struct domain *d = v->domain;
-    int ftlb = 0;
-
-    ftlb |= oos_fixup_flush_gmfn(v, gmfn, fixup);
-
-    switch ( sh_remove_write_access(d, gmfn, 0, 0) )
-    {
-    default:
-    case 0:
-        break;
-
-    case 1:
-        ftlb |= 1;
-        break;
-
-    case -1:
-        /* An unfindable writeable typecount has appeared, probably via a
-         * grant table entry: can't shoot the mapping, so try to unshadow
-         * the page.  If that doesn't work either, the guest is granting
-         * his pagetables and must be killed after all.
-         * This will flush the tlb, so we can return with no worries. */
-        shadow_remove_all_shadows(d, gmfn);
-        return 1;
-    }
-
-    if ( ftlb )
-        guest_flush_tlb_mask(d, d->dirty_cpumask);
-
-    return 0;
-}
-
-
-static inline void trace_resync(int event, mfn_t gmfn)
-{
-    if ( tb_init_done )
-    {
-        /* Convert gmfn to gfn */
-        gfn_t gfn = mfn_to_gfn(current->domain, gmfn);
-
-        __trace_var(event, 0/*!tsc*/, sizeof(gfn), &gfn);
-    }
-}
-
-/* Pull all the entries on an out-of-sync page back into sync. */
-static void _sh_resync(struct vcpu *v, mfn_t gmfn,
-                       struct oos_fixup *fixup, mfn_t snp)
-{
-    struct page_info *pg = mfn_to_page(gmfn);
-
-    ASSERT(paging_locked_by_me(v->domain));
-    ASSERT(mfn_is_out_of_sync(gmfn));
-    /* Guest page must be shadowed *only* as L1 when out of sync. */
-    ASSERT(!(mfn_to_page(gmfn)->shadow_flags & SHF_page_type_mask
-             & ~SHF_L1_ANY));
-    ASSERT(!sh_page_has_multiple_shadows(mfn_to_page(gmfn)));
-
-    SHADOW_PRINTK("%pv gmfn=%"PRI_mfn"\n", v, mfn_x(gmfn));
-
-    /* Need to pull write access so the page *stays* in sync. */
-    if ( oos_remove_write_access(v, gmfn, fixup) )
-    {
-        /* Page has been unshadowed. */
-        return;
-    }
-
-    /* No more writable mappings of this page, please */
-    pg->shadow_flags &= ~SHF_oos_may_write;
-
-    /* Update the shadows with current guest entries. */
-    _sh_resync_l1(v, gmfn, snp);
-
-    /* Now we know all the entries are synced, and will stay that way */
-    pg->shadow_flags &= ~SHF_out_of_sync;
-    perfc_incr(shadow_resync);
-    trace_resync(TRC_SHADOW_RESYNC_FULL, gmfn);
-}
-
-
-/* Add an MFN to the list of out-of-sync guest pagetables */
-static void oos_hash_add(struct vcpu *v, mfn_t gmfn)
-{
-    int i, idx, oidx, swap = 0;
-    mfn_t *oos = v->arch.paging.shadow.oos;
-    mfn_t *oos_snapshot = v->arch.paging.shadow.oos_snapshot;
-    struct oos_fixup *oos_fixup = v->arch.paging.shadow.oos_fixup;
-    struct oos_fixup fixup = { .next = 0 };
-
-    for (i = 0; i < SHADOW_OOS_FIXUPS; i++ )
-        fixup.smfn[i] = INVALID_MFN;
-
-    idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-    oidx = idx;
-
-    if ( !mfn_eq(oos[idx], INVALID_MFN)
-         && (mfn_x(oos[idx]) % SHADOW_OOS_PAGES) == idx )
-    {
-        /* Punt the current occupant into the next slot */
-        SWAP(oos[idx], gmfn);
-        SWAP(oos_fixup[idx], fixup);
-        swap = 1;
-        idx = (idx + 1) % SHADOW_OOS_PAGES;
-    }
-    if ( !mfn_eq(oos[idx], INVALID_MFN) )
-    {
-        /* Crush the current occupant. */
-        _sh_resync(v, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
-        perfc_incr(shadow_unsync_evict);
-    }
-    oos[idx] = gmfn;
-    oos_fixup[idx] = fixup;
-
-    if ( swap )
-        SWAP(oos_snapshot[idx], oos_snapshot[oidx]);
-
-    copy_domain_page(oos_snapshot[oidx], oos[oidx]);
-}
-
-/* Remove an MFN from the list of out-of-sync guest pagetables */
-static void oos_hash_remove(struct domain *d, mfn_t gmfn)
-{
-    int idx;
-    mfn_t *oos;
-    struct vcpu *v;
-
-    SHADOW_PRINTK("d%d gmfn %lx\n", d->domain_id, mfn_x(gmfn));
-
-    for_each_vcpu(d, v)
-    {
-        oos = v->arch.paging.shadow.oos;
-        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-        if ( !mfn_eq(oos[idx], gmfn) )
-            idx = (idx + 1) % SHADOW_OOS_PAGES;
-        if ( mfn_eq(oos[idx], gmfn) )
-        {
-            oos[idx] = INVALID_MFN;
-            return;
-        }
-    }
-
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
-           mfn_x(gmfn));
-    BUG();
-}
-
-mfn_t oos_snapshot_lookup(struct domain *d, mfn_t gmfn)
-{
-    int idx;
-    mfn_t *oos;
-    mfn_t *oos_snapshot;
-    struct vcpu *v;
-
-    for_each_vcpu(d, v)
-    {
-        oos = v->arch.paging.shadow.oos;
-        oos_snapshot = v->arch.paging.shadow.oos_snapshot;
-        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-        if ( !mfn_eq(oos[idx], gmfn) )
-            idx = (idx + 1) % SHADOW_OOS_PAGES;
-        if ( mfn_eq(oos[idx], gmfn) )
-        {
-            return oos_snapshot[idx];
-        }
-    }
-
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
-           mfn_x(gmfn));
-    BUG();
-}
-
-/* Pull a single guest page back into sync */
-void sh_resync(struct domain *d, mfn_t gmfn)
-{
-    int idx;
-    mfn_t *oos;
-    mfn_t *oos_snapshot;
-    struct oos_fixup *oos_fixup;
-    struct vcpu *v;
-
-    for_each_vcpu(d, v)
-    {
-        oos = v->arch.paging.shadow.oos;
-        oos_fixup = v->arch.paging.shadow.oos_fixup;
-        oos_snapshot = v->arch.paging.shadow.oos_snapshot;
-        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
-        if ( !mfn_eq(oos[idx], gmfn) )
-            idx = (idx + 1) % SHADOW_OOS_PAGES;
-
-        if ( mfn_eq(oos[idx], gmfn) )
-        {
-            _sh_resync(v, gmfn, &oos_fixup[idx], oos_snapshot[idx]);
-            oos[idx] = INVALID_MFN;
-            return;
-        }
-    }
-
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
-           mfn_x(gmfn));
-    BUG();
-}
-
-/* Figure out whether it's definitely safe not to sync this l1 table,
- * by making a call out to the mode in which that shadow was made. */
-static int sh_skip_sync(struct vcpu *v, mfn_t gl1mfn)
-{
-    struct page_info *pg = mfn_to_page(gl1mfn);
-    if ( pg->shadow_flags & SHF_L1_32 )
-        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 2)(v, gl1mfn);
-    else if ( pg->shadow_flags & SHF_L1_PAE )
-        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 3)(v, gl1mfn);
-    else if ( pg->shadow_flags & SHF_L1_64 )
-        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 4)(v, gl1mfn);
-    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not shadowed as an l1\n",
-           mfn_x(gl1mfn));
-    BUG();
-}
-
-
-/* Pull all out-of-sync pages back into sync.  Pages brought out of sync
- * on other vcpus are allowed to remain out of sync, but their contents
- * will be made safe (TLB flush semantics); pages unsynced by this vcpu
- * are brought back into sync and write-protected.  If skip != 0, we try
- * to avoid resyncing at all if we think we can get away with it. */
-void sh_resync_all(struct vcpu *v, int skip, int this, int others)
-{
-    int idx;
-    struct vcpu *other;
-    mfn_t *oos = v->arch.paging.shadow.oos;
-    mfn_t *oos_snapshot = v->arch.paging.shadow.oos_snapshot;
-    struct oos_fixup *oos_fixup = v->arch.paging.shadow.oos_fixup;
-
-    SHADOW_PRINTK("%pv\n", v);
-
-    ASSERT(paging_locked_by_me(v->domain));
-
-    if ( !this )
-        goto resync_others;
-
-    /* First: resync all of this vcpu's oos pages */
-    for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
-        if ( !mfn_eq(oos[idx], INVALID_MFN) )
-        {
-            /* Write-protect and sync contents */
-            _sh_resync(v, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
-            oos[idx] = INVALID_MFN;
-        }
-
- resync_others:
-    if ( !others )
-        return;
-
-    /* Second: make all *other* vcpus' oos pages safe. */
-    for_each_vcpu(v->domain, other)
-    {
-        if ( v == other )
-            continue;
-
-        oos = other->arch.paging.shadow.oos;
-        oos_fixup = other->arch.paging.shadow.oos_fixup;
-        oos_snapshot = other->arch.paging.shadow.oos_snapshot;
-
-        for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
-        {
-            if ( mfn_eq(oos[idx], INVALID_MFN) )
-                continue;
-
-            if ( skip )
-            {
-                /* Update the shadows and leave the page OOS. */
-                if ( sh_skip_sync(v, oos[idx]) )
-                    continue;
-                trace_resync(TRC_SHADOW_RESYNC_ONLY, oos[idx]);
-                _sh_resync_l1(other, oos[idx], oos_snapshot[idx]);
-            }
-            else
-            {
-                /* Write-protect and sync contents */
-                _sh_resync(other, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
-                oos[idx] = INVALID_MFN;
-            }
-        }
-    }
-}
-
-/* Allow a shadowed page to go out of sync. Unsyncs are traced in
- * multi.c:sh_page_fault() */
-int sh_unsync(struct vcpu *v, mfn_t gmfn)
-{
-    struct page_info *pg;
-
-    ASSERT(paging_locked_by_me(v->domain));
-
-    SHADOW_PRINTK("%pv gmfn=%"PRI_mfn"\n", v, mfn_x(gmfn));
-
-    pg = mfn_to_page(gmfn);
-
-    /* Guest page must be shadowed *only* as L1 and *only* once when out
-     * of sync.  Also, get out now if it's already out of sync.
-     * Also, can't safely unsync if some vcpus have paging disabled.*/
-    if ( pg->shadow_flags &
-         ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync)
-         || sh_page_has_multiple_shadows(pg)
-         || !is_hvm_vcpu(v)
-         || !v->domain->arch.paging.shadow.oos_active )
-        return 0;
-
-    BUILD_BUG_ON(!(typeof(pg->shadow_flags))SHF_out_of_sync);
-    BUILD_BUG_ON(!(typeof(pg->shadow_flags))SHF_oos_may_write);
-
-    pg->shadow_flags |= SHF_out_of_sync|SHF_oos_may_write;
-    oos_hash_add(v, gmfn);
-    perfc_incr(shadow_unsync);
-    TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_UNSYNC);
-    return 1;
-}
-
-#endif /* (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC) */
-
-
 /**************************************************************************/
 /* Code for "promoting" a guest page to the point where the shadow code is
  * willing to let it be treated as a guest page table.  This generally
--- /dev/null
+++ b/xen/arch/x86/mm/shadow/oos.c
@@ -0,0 +1,606 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/******************************************************************************
+ * arch/x86/mm/shadow/oos.c
+ *
+ * Shadow code dealing with out-of-sync shadows.
+ * Parts of this code are Copyright (c) 2006 by XenSource Inc.
+ * Parts of this code are Copyright (c) 2006 by Michael A Fetterman
+ * Parts based on earlier work by Michael A Fetterman, Ian Pratt et al.
+ */
+
+#include "private.h"
+
+#if (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC)
+
+#include <xen/trace.h>
+
+#include <asm/shadow.h>
+
+/*
+ * From time to time, we let a shadowed pagetable page go out of sync
+ * with its shadow: the guest is allowed to write directly to the page,
+ * and those writes are not synchronously reflected in the shadow.
+ * This lets us avoid many emulations if the guest is writing a lot to a
+ * pagetable, but it relaxes a pretty important invariant in the shadow
+ * pagetable design.  Therefore, some rules:
+ *
+ * 1. Only L1 pagetables may go out of sync: any page that is shadowed
+ *    at at higher level must be synchronously updated.  This makes
+ *    using linear shadow pagetables much less dangerous.
+ *    That means that: (a) unsyncing code needs to check for higher-level
+ *    shadows, and (b) promotion code needs to resync.
+ *
+ * 2. All shadow operations on a guest page require the page to be brought
+ *    back into sync before proceeding.  This must be done under the
+ *    paging lock so that the page is guaranteed to remain synced until
+ *    the operation completes.
+ *
+ *    Exceptions to this rule: the pagefault and invlpg handlers may
+ *    update only one entry on an out-of-sync page without resyncing it.
+ *
+ * 3. Operations on shadows that do not start from a guest page need to
+ *    be aware that they may be handling an out-of-sync shadow.
+ *
+ * 4. Operations that do not normally take the paging lock (fast-path
+ *    #PF handler, INVLPG) must fall back to a locking, syncing version
+ *    if they see an out-of-sync table.
+ *
+ * 5. Operations corresponding to guest TLB flushes (MOV CR3, INVLPG)
+ *    must explicitly resync all relevant pages or update their
+ *    shadows.
+ *
+ * Currently out-of-sync pages are listed in a simple open-addressed
+ * hash table with a second chance (must resist temptation to radically
+ * over-engineer hash tables...)  The virtual address of the access
+ * which caused us to unsync the page is also kept in the hash table, as
+ * a hint for finding the writable mappings later.
+ *
+ * We keep a hash per vcpu, because we want as much as possible to do
+ * the re-sync on the save vcpu we did the unsync on, so the VA hint
+ * will be valid.
+ */
+
+#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES_FULL
+void sh_oos_audit(struct domain *d)
+{
+    unsigned int idx, expected_idx, expected_idx_alt;
+    struct page_info *pg;
+    struct vcpu *v;
+
+    for_each_vcpu(d, v)
+    {
+        for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
+        {
+            mfn_t *oos = v->arch.paging.shadow.oos;
+
+            if ( mfn_eq(oos[idx], INVALID_MFN) )
+                continue;
+
+            expected_idx = mfn_x(oos[idx]) % SHADOW_OOS_PAGES;
+            expected_idx_alt = ((expected_idx + 1) % SHADOW_OOS_PAGES);
+            if ( idx != expected_idx && idx != expected_idx_alt )
+            {
+                printk("%s: idx %x contains gmfn %lx, expected at %x or %x.\n",
+                       __func__, idx, mfn_x(oos[idx]),
+                       expected_idx, expected_idx_alt);
+                BUG();
+            }
+            pg = mfn_to_page(oos[idx]);
+            if ( !(pg->count_info & PGC_shadowed_pt) )
+            {
+                printk("%s: idx %x gmfn %lx not a pt (count %lx)\n",
+                       __func__, idx, mfn_x(oos[idx]), pg->count_info);
+                BUG();
+            }
+            if ( !(pg->shadow_flags & SHF_out_of_sync) )
+            {
+                printk("%s: idx %x gmfn %lx not marked oos (flags %x)\n",
+                       __func__, idx, mfn_x(oos[idx]), pg->shadow_flags);
+                BUG();
+            }
+            if ( (pg->shadow_flags & SHF_page_type_mask & ~SHF_L1_ANY) )
+            {
+                printk("%s: idx %x gmfn %lx shadowed as non-l1 (flags %x)\n",
+                       __func__, idx, mfn_x(oos[idx]), pg->shadow_flags);
+                BUG();
+            }
+        }
+    }
+}
+#endif
+
+#if SHADOW_AUDIT & SHADOW_AUDIT_ENTRIES
+void oos_audit_hash_is_present(struct domain *d, mfn_t gmfn)
+{
+    int idx;
+    struct vcpu *v;
+    mfn_t *oos;
+
+    ASSERT(mfn_is_out_of_sync(gmfn));
+
+    for_each_vcpu(d, v)
+    {
+        oos = v->arch.paging.shadow.oos;
+        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+        if ( !mfn_eq(oos[idx], gmfn) )
+            idx = (idx + 1) % SHADOW_OOS_PAGES;
+
+        if ( mfn_eq(oos[idx], gmfn) )
+            return;
+    }
+
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" marked OOS but not in hash table\n",
+           mfn_x(gmfn));
+    BUG();
+}
+#endif
+
+/* Update the shadow, but keep the page out of sync. */
+static inline void _sh_resync_l1(struct vcpu *v, mfn_t gmfn, mfn_t snpmfn)
+{
+    struct page_info *pg = mfn_to_page(gmfn);
+
+    ASSERT(mfn_valid(gmfn));
+    ASSERT(page_is_out_of_sync(pg));
+
+    /* Call out to the appropriate per-mode resyncing function */
+    if ( pg->shadow_flags & SHF_L1_32 )
+        SHADOW_INTERNAL_NAME(sh_resync_l1, 2)(v, gmfn, snpmfn);
+    else if ( pg->shadow_flags & SHF_L1_PAE )
+        SHADOW_INTERNAL_NAME(sh_resync_l1, 3)(v, gmfn, snpmfn);
+    else if ( pg->shadow_flags & SHF_L1_64 )
+        SHADOW_INTERNAL_NAME(sh_resync_l1, 4)(v, gmfn, snpmfn);
+}
+
+static int sh_remove_write_access_from_sl1p(struct domain *d, mfn_t gmfn,
+                                            mfn_t smfn, unsigned long off)
+{
+    ASSERT(mfn_valid(smfn));
+    ASSERT(mfn_valid(gmfn));
+
+    switch ( mfn_to_page(smfn)->u.sh.type )
+    {
+    case SH_type_l1_32_shadow:
+    case SH_type_fl1_32_shadow:
+        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 2)
+            (d, gmfn, smfn, off);
+
+    case SH_type_l1_pae_shadow:
+    case SH_type_fl1_pae_shadow:
+        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 3)
+            (d, gmfn, smfn, off);
+
+    case SH_type_l1_64_shadow:
+    case SH_type_fl1_64_shadow:
+        return SHADOW_INTERNAL_NAME(sh_rm_write_access_from_sl1p, 4)
+            (d, gmfn, smfn, off);
+
+    default:
+        return 0;
+    }
+}
+
+/*
+ * Fixup arrays: We limit the maximum number of writable mappings to
+ * SHADOW_OOS_FIXUPS and store enough information to remove them
+ * quickly on resync.
+ */
+
+static inline int oos_fixup_flush_gmfn(struct vcpu *v, mfn_t gmfn,
+                                       struct oos_fixup *fixup)
+{
+    struct domain *d = v->domain;
+    int i;
+    for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
+    {
+        if ( !mfn_eq(fixup->smfn[i], INVALID_MFN) )
+        {
+            sh_remove_write_access_from_sl1p(d, gmfn,
+                                             fixup->smfn[i],
+                                             fixup->off[i]);
+            fixup->smfn[i] = INVALID_MFN;
+        }
+    }
+
+    /* Always flush the TLBs. See comment on oos_fixup_add(). */
+    return 1;
+}
+
+void oos_fixup_add(struct domain *d, mfn_t gmfn,
+                   mfn_t smfn,  unsigned long off)
+{
+    int idx, next;
+    mfn_t *oos;
+    struct oos_fixup *oos_fixup;
+    struct vcpu *v;
+
+    perfc_incr(shadow_oos_fixup_add);
+
+    for_each_vcpu(d, v)
+    {
+        oos = v->arch.paging.shadow.oos;
+        oos_fixup = v->arch.paging.shadow.oos_fixup;
+        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+        if ( !mfn_eq(oos[idx], gmfn) )
+            idx = (idx + 1) % SHADOW_OOS_PAGES;
+        if ( mfn_eq(oos[idx], gmfn) )
+        {
+            int i;
+            for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
+            {
+                if ( mfn_eq(oos_fixup[idx].smfn[i], smfn) &&
+                     (oos_fixup[idx].off[i] == off) )
+                    return;
+            }
+
+            next = oos_fixup[idx].next;
+
+            if ( !mfn_eq(oos_fixup[idx].smfn[next], INVALID_MFN) )
+            {
+                TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_EVICT);
+
+                /* Reuse this slot and remove current writable mapping. */
+                sh_remove_write_access_from_sl1p(d, gmfn,
+                                                 oos_fixup[idx].smfn[next],
+                                                 oos_fixup[idx].off[next]);
+                perfc_incr(shadow_oos_fixup_evict);
+                /*
+                 * We should flush the TLBs now, because we removed a
+                 * writable mapping, but since the shadow is already
+                 * OOS we have no problem if another vcpu write to
+                 * this page table. We just have to be very careful to
+                 * *always* flush the tlbs on resync.
+                 */
+            }
+
+            oos_fixup[idx].smfn[next] = smfn;
+            oos_fixup[idx].off[next] = off;
+            oos_fixup[idx].next = (next + 1) % SHADOW_OOS_FIXUPS;
+
+            TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_OOS_FIXUP_ADD);
+            return;
+        }
+    }
+
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
+           mfn_x(gmfn));
+    BUG();
+}
+
+static int oos_remove_write_access(struct vcpu *v, mfn_t gmfn,
+                                   struct oos_fixup *fixup)
+{
+    struct domain *d = v->domain;
+    int ftlb = 0;
+
+    ftlb |= oos_fixup_flush_gmfn(v, gmfn, fixup);
+
+    switch ( sh_remove_write_access(d, gmfn, 0, 0) )
+    {
+    default:
+    case 0:
+        break;
+
+    case 1:
+        ftlb |= 1;
+        break;
+
+    case -1:
+        /*
+         * An unfindable writeable typecount has appeared, probably via a
+         * grant table entry: can't shoot the mapping, so try to unshadow
+         * the page.  If that doesn't work either, the guest is granting
+         * his pagetables and must be killed after all.
+         * This will flush the tlb, so we can return with no worries.
+         */
+        shadow_remove_all_shadows(d, gmfn);
+        return 1;
+    }
+
+    if ( ftlb )
+        guest_flush_tlb_mask(d, d->dirty_cpumask);
+
+    return 0;
+}
+
+static inline void trace_resync(int event, mfn_t gmfn)
+{
+    if ( tb_init_done )
+    {
+        /* Convert gmfn to gfn */
+        gfn_t gfn = mfn_to_gfn(current->domain, gmfn);
+
+        __trace_var(event, 0/*!tsc*/, sizeof(gfn), &gfn);
+    }
+}
+
+/* Pull all the entries on an out-of-sync page back into sync. */
+static void _sh_resync(struct vcpu *v, mfn_t gmfn,
+                       struct oos_fixup *fixup, mfn_t snp)
+{
+    struct page_info *pg = mfn_to_page(gmfn);
+
+    ASSERT(paging_locked_by_me(v->domain));
+    ASSERT(mfn_is_out_of_sync(gmfn));
+    /* Guest page must be shadowed *only* as L1 when out of sync. */
+    ASSERT(!(mfn_to_page(gmfn)->shadow_flags & SHF_page_type_mask
+             & ~SHF_L1_ANY));
+    ASSERT(!sh_page_has_multiple_shadows(mfn_to_page(gmfn)));
+
+    SHADOW_PRINTK("%pv gmfn=%"PRI_mfn"\n", v, mfn_x(gmfn));
+
+    /* Need to pull write access so the page *stays* in sync. */
+    if ( oos_remove_write_access(v, gmfn, fixup) )
+    {
+        /* Page has been unshadowed. */
+        return;
+    }
+
+    /* No more writable mappings of this page, please */
+    pg->shadow_flags &= ~SHF_oos_may_write;
+
+    /* Update the shadows with current guest entries. */
+    _sh_resync_l1(v, gmfn, snp);
+
+    /* Now we know all the entries are synced, and will stay that way */
+    pg->shadow_flags &= ~SHF_out_of_sync;
+    perfc_incr(shadow_resync);
+    trace_resync(TRC_SHADOW_RESYNC_FULL, gmfn);
+}
+
+/* Add an MFN to the list of out-of-sync guest pagetables */
+static void oos_hash_add(struct vcpu *v, mfn_t gmfn)
+{
+    int i, idx, oidx, swap = 0;
+    mfn_t *oos = v->arch.paging.shadow.oos;
+    mfn_t *oos_snapshot = v->arch.paging.shadow.oos_snapshot;
+    struct oos_fixup *oos_fixup = v->arch.paging.shadow.oos_fixup;
+    struct oos_fixup fixup = { .next = 0 };
+
+    for ( i = 0; i < SHADOW_OOS_FIXUPS; i++ )
+        fixup.smfn[i] = INVALID_MFN;
+
+    idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+    oidx = idx;
+
+    if ( !mfn_eq(oos[idx], INVALID_MFN) &&
+         (mfn_x(oos[idx]) % SHADOW_OOS_PAGES) == idx )
+    {
+        /* Punt the current occupant into the next slot */
+        SWAP(oos[idx], gmfn);
+        SWAP(oos_fixup[idx], fixup);
+        swap = 1;
+        idx = (idx + 1) % SHADOW_OOS_PAGES;
+    }
+    if ( !mfn_eq(oos[idx], INVALID_MFN) )
+    {
+        /* Crush the current occupant. */
+        _sh_resync(v, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
+        perfc_incr(shadow_unsync_evict);
+    }
+    oos[idx] = gmfn;
+    oos_fixup[idx] = fixup;
+
+    if ( swap )
+        SWAP(oos_snapshot[idx], oos_snapshot[oidx]);
+
+    copy_domain_page(oos_snapshot[oidx], oos[oidx]);
+}
+
+/* Remove an MFN from the list of out-of-sync guest pagetables */
+void oos_hash_remove(struct domain *d, mfn_t gmfn)
+{
+    int idx;
+    mfn_t *oos;
+    struct vcpu *v;
+
+    SHADOW_PRINTK("d%d gmfn %lx\n", d->domain_id, mfn_x(gmfn));
+
+    for_each_vcpu(d, v)
+    {
+        oos = v->arch.paging.shadow.oos;
+        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+        if ( !mfn_eq(oos[idx], gmfn) )
+            idx = (idx + 1) % SHADOW_OOS_PAGES;
+        if ( mfn_eq(oos[idx], gmfn) )
+        {
+            oos[idx] = INVALID_MFN;
+            return;
+        }
+    }
+
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
+           mfn_x(gmfn));
+    BUG();
+}
+
+mfn_t oos_snapshot_lookup(struct domain *d, mfn_t gmfn)
+{
+    int idx;
+    mfn_t *oos;
+    mfn_t *oos_snapshot;
+    struct vcpu *v;
+
+    for_each_vcpu(d, v)
+    {
+        oos = v->arch.paging.shadow.oos;
+        oos_snapshot = v->arch.paging.shadow.oos_snapshot;
+        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+        if ( !mfn_eq(oos[idx], gmfn) )
+            idx = (idx + 1) % SHADOW_OOS_PAGES;
+        if ( mfn_eq(oos[idx], gmfn) )
+        {
+            return oos_snapshot[idx];
+        }
+    }
+
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
+           mfn_x(gmfn));
+    BUG();
+}
+
+/* Pull a single guest page back into sync */
+void sh_resync(struct domain *d, mfn_t gmfn)
+{
+    int idx;
+    mfn_t *oos;
+    mfn_t *oos_snapshot;
+    struct oos_fixup *oos_fixup;
+    struct vcpu *v;
+
+    for_each_vcpu(d, v)
+    {
+        oos = v->arch.paging.shadow.oos;
+        oos_fixup = v->arch.paging.shadow.oos_fixup;
+        oos_snapshot = v->arch.paging.shadow.oos_snapshot;
+        idx = mfn_x(gmfn) % SHADOW_OOS_PAGES;
+        if ( !mfn_eq(oos[idx], gmfn) )
+            idx = (idx + 1) % SHADOW_OOS_PAGES;
+
+        if ( mfn_eq(oos[idx], gmfn) )
+        {
+            _sh_resync(v, gmfn, &oos_fixup[idx], oos_snapshot[idx]);
+            oos[idx] = INVALID_MFN;
+            return;
+        }
+    }
+
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not in hash table\n",
+           mfn_x(gmfn));
+    BUG();
+}
+
+/*
+ * Figure out whether it's definitely safe not to sync this l1 table,
+ * by making a call out to the mode in which that shadow was made.
+ */
+static int sh_skip_sync(struct vcpu *v, mfn_t gl1mfn)
+{
+    struct page_info *pg = mfn_to_page(gl1mfn);
+
+    if ( pg->shadow_flags & SHF_L1_32 )
+        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 2)(v, gl1mfn);
+    else if ( pg->shadow_flags & SHF_L1_PAE )
+        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 3)(v, gl1mfn);
+    else if ( pg->shadow_flags & SHF_L1_64 )
+        return SHADOW_INTERNAL_NAME(sh_safe_not_to_sync, 4)(v, gl1mfn);
+
+    printk(XENLOG_ERR "gmfn %"PRI_mfn" was OOS but not shadowed as an l1\n",
+           mfn_x(gl1mfn));
+    BUG();
+}
+
+/*
+ * Pull all out-of-sync pages back into sync.  Pages brought out of sync
+ * on other vcpus are allowed to remain out of sync, but their contents
+ * will be made safe (TLB flush semantics); pages unsynced by this vcpu
+ * are brought back into sync and write-protected.  If skip != 0, we try
+ * to avoid resyncing at all if we think we can get away with it.
+ */
+void sh_resync_all(struct vcpu *v, int skip, int this, int others)
+{
+    int idx;
+    struct vcpu *other;
+    mfn_t *oos = v->arch.paging.shadow.oos;
+    mfn_t *oos_snapshot = v->arch.paging.shadow.oos_snapshot;
+    struct oos_fixup *oos_fixup = v->arch.paging.shadow.oos_fixup;
+
+    SHADOW_PRINTK("%pv\n", v);
+
+    ASSERT(paging_locked_by_me(v->domain));
+
+    if ( !this )
+        goto resync_others;
+
+    /* First: resync all of this vcpu's oos pages */
+    for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
+        if ( !mfn_eq(oos[idx], INVALID_MFN) )
+        {
+            /* Write-protect and sync contents */
+            _sh_resync(v, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
+            oos[idx] = INVALID_MFN;
+        }
+
+ resync_others:
+    if ( !others )
+        return;
+
+    /* Second: make all *other* vcpus' oos pages safe. */
+    for_each_vcpu(v->domain, other)
+    {
+        if ( v == other )
+            continue;
+
+        oos = other->arch.paging.shadow.oos;
+        oos_fixup = other->arch.paging.shadow.oos_fixup;
+        oos_snapshot = other->arch.paging.shadow.oos_snapshot;
+
+        for ( idx = 0; idx < SHADOW_OOS_PAGES; idx++ )
+        {
+            if ( mfn_eq(oos[idx], INVALID_MFN) )
+                continue;
+
+            if ( skip )
+            {
+                /* Update the shadows and leave the page OOS. */
+                if ( sh_skip_sync(v, oos[idx]) )
+                    continue;
+                trace_resync(TRC_SHADOW_RESYNC_ONLY, oos[idx]);
+                _sh_resync_l1(other, oos[idx], oos_snapshot[idx]);
+            }
+            else
+            {
+                /* Write-protect and sync contents */
+                _sh_resync(other, oos[idx], &oos_fixup[idx], oos_snapshot[idx]);
+                oos[idx] = INVALID_MFN;
+            }
+        }
+    }
+}
+
+/*
+ * Allow a shadowed page to go out of sync. Unsyncs are traced in
+ * multi.c:sh_page_fault()
+ */
+int sh_unsync(struct vcpu *v, mfn_t gmfn)
+{
+    struct page_info *pg;
+
+    ASSERT(paging_locked_by_me(v->domain));
+
+    SHADOW_PRINTK("%pv gmfn=%"PRI_mfn"\n", v, mfn_x(gmfn));
+
+    pg = mfn_to_page(gmfn);
+
+    /*
+     * Guest page must be shadowed *only* as L1 and *only* once when out
+     * of sync.  Also, get out now if it's already out of sync.
+     * Also, can't safely unsync if some vcpus have paging disabled.
+     */
+    if ( (pg->shadow_flags &
+          ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync)) ||
+         sh_page_has_multiple_shadows(pg) ||
+         !is_hvm_vcpu(v) ||
+         !v->domain->arch.paging.shadow.oos_active )
+        return 0;
+
+    BUILD_BUG_ON(!(typeof(pg->shadow_flags))SHF_out_of_sync);
+    BUILD_BUG_ON(!(typeof(pg->shadow_flags))SHF_oos_may_write);
+
+    pg->shadow_flags |= SHF_out_of_sync|SHF_oos_may_write;
+    oos_hash_add(v, gmfn);
+    perfc_incr(shadow_unsync);
+    TRACE_SHADOW_PATH_FLAG(TRCE_SFLAG_UNSYNC);
+    return 1;
+}
+
+#endif /* (SHADOW_OPTIMIZATIONS & SHOPT_OUT_OF_SYNC) */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -452,6 +452,7 @@ int sh_unsync(struct vcpu *v, mfn_t gmfn
 /* Pull an out-of-sync page back into sync. */
 void sh_resync(struct domain *d, mfn_t gmfn);
 
+void oos_hash_remove(struct domain *d, mfn_t gmfn);
 void oos_fixup_add(struct domain *d, mfn_t gmfn, mfn_t smfn, unsigned long off);
 
 /* Pull all out-of-sync shadows back into sync.  If skip != 0, we try
@@ -477,6 +478,7 @@ shadow_sync_other_vcpus(struct vcpu *v)
     sh_resync_all(v, 1 /* skip */, 0 /* this */, 1 /* others */);
 }
 
+void sh_oos_audit(struct domain *d);
 void oos_audit_hash_is_present(struct domain *d, mfn_t gmfn);
 mfn_t oos_snapshot_lookup(struct domain *d, mfn_t gmfn);
 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 08/13] x86/shadow: sh_rm_write_access_from_sl1p() is HVM-only
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
                   ` (6 preceding siblings ...)
  2023-03-30 11:29 ` [PATCH v2 07/13] x86/shadow: move OOS functions to their own file Jan Beulich
@ 2023-03-30 11:30 ` Jan Beulich
  2023-03-30 11:30 ` [PATCH v2 09/13] x86/shadow: drop is_hvm_...() where easily possible Jan Beulich
                   ` (4 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:30 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

The function is used from (HVM-only) OOS code only - replace the
respective #ifdef inside the function to make this more obvious. (Note
that SHOPT_OUT_OF_SYNC won't be set when !HVM, so the #ifdef surrounding
the function is already sufficient.)

Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: New.

--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3429,9 +3429,7 @@ static void cf_check sh_update_cr3(struc
 int sh_rm_write_access_from_sl1p(struct domain *d, mfn_t gmfn,
                                  mfn_t smfn, unsigned long off)
 {
-#ifdef CONFIG_HVM
     struct vcpu *curr = current;
-#endif
     int r;
     shadow_l1e_t *sl1p, sl1e;
     struct page_info *sp;
@@ -3439,12 +3437,10 @@ int sh_rm_write_access_from_sl1p(struct
     ASSERT(mfn_valid(gmfn));
     ASSERT(mfn_valid(smfn));
 
-#ifdef CONFIG_HVM
     /* Remember if we've been told that this process is being torn down */
     if ( curr->domain == d && is_hvm_domain(d) )
         curr->arch.paging.shadow.pagetable_dying
             = mfn_to_page(gmfn)->pagetable_dying;
-#endif
 
     sp = mfn_to_page(smfn);
 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 09/13] x86/shadow: drop is_hvm_...() where easily possible
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
                   ` (7 preceding siblings ...)
  2023-03-30 11:30 ` [PATCH v2 08/13] x86/shadow: sh_rm_write_access_from_sl1p() is HVM-only Jan Beulich
@ 2023-03-30 11:30 ` Jan Beulich
  2023-03-30 11:30 ` [PATCH v2 10/13] x86/shadow: make monitor table create/destroy more consistent Jan Beulich
                   ` (3 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:30 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

Emulation related functions are involved in HVM handling only, and in
some cases they even invoke such checks after having already done things
which are valid for HVM domains only. OOS active also implies HVM. In
sh_remove_all_mappings() one of the two checks is redundant with an
earlier paging_mode_external() one (the other, however, needs to stay).

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Re-base over changes/additions earlier in the series.

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1522,7 +1522,7 @@ int sh_remove_all_mappings(struct domain
                && (page->count_info & PGC_count_mask) <= 3
                && ((page->u.inuse.type_info & PGT_count_mask)
                    == (is_special_page(page) ||
-                       (is_hvm_domain(d) && is_ioreq_server_page(d, page))))) )
+                       is_ioreq_server_page(d, page)))) )
             printk(XENLOG_G_ERR "can't find all mappings of mfn %"PRI_mfn
                    " (gfn %"PRI_gfn"): c=%lx t=%lx s=%d i=%d\n",
                    mfn_x(gmfn), gfn_x(gfn),
--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -204,10 +204,6 @@ hvm_emulate_write(enum x86_segment seg,
     if ( rc || !bytes )
         return rc;
 
-    /* Unaligned writes are only acceptable on HVM */
-    if ( (addr & (bytes - 1)) && !is_hvm_vcpu(v)  )
-        return X86EMUL_UNHANDLEABLE;
-
     ptr = sh_emulate_map_dest(v, addr, bytes, sh_ctxt);
     if ( IS_ERR(ptr) )
         return ~PTR_ERR(ptr);
@@ -258,10 +254,6 @@ hvm_emulate_cmpxchg(enum x86_segment seg
     if ( rc )
         return rc;
 
-    /* Unaligned writes are only acceptable on HVM */
-    if ( (addr & (bytes - 1)) && !is_hvm_vcpu(v)  )
-        return X86EMUL_UNHANDLEABLE;
-
     ptr = sh_emulate_map_dest(v, addr, bytes, sh_ctxt);
     if ( IS_ERR(ptr) )
         return ~PTR_ERR(ptr);
@@ -457,8 +449,7 @@ static void *sh_emulate_map_dest(struct
 
 #ifndef NDEBUG
     /* We don't emulate user-mode writes to page tables. */
-    if ( is_hvm_domain(d) ? hvm_get_cpl(v) == 3
-                          : !guest_kernel_mode(v, guest_cpu_user_regs()) )
+    if ( hvm_get_cpl(v) == 3 )
     {
         gdprintk(XENLOG_DEBUG, "User-mode write to pagetable reached "
                  "emulate_map_dest(). This should never happen!\n");
@@ -487,15 +478,6 @@ static void *sh_emulate_map_dest(struct
         sh_ctxt->mfn[1] = INVALID_MFN;
         map = map_domain_page(sh_ctxt->mfn[0]) + (vaddr & ~PAGE_MASK);
     }
-    else if ( !is_hvm_domain(d) )
-    {
-        /*
-         * Cross-page emulated writes are only supported for HVM guests;
-         * PV guests ought to know better.
-         */
-        put_page(mfn_to_page(sh_ctxt->mfn[0]));
-        return MAPPING_UNHANDLEABLE;
-    }
     else
     {
         /* This write crosses a page boundary. Translate the second page. */
--- a/xen/arch/x86/mm/shadow/multi.c
+++ b/xen/arch/x86/mm/shadow/multi.c
@@ -3438,7 +3438,7 @@ int sh_rm_write_access_from_sl1p(struct
     ASSERT(mfn_valid(smfn));
 
     /* Remember if we've been told that this process is being torn down */
-    if ( curr->domain == d && is_hvm_domain(d) )
+    if ( curr->domain == d )
         curr->arch.paging.shadow.pagetable_dying
             = mfn_to_page(gmfn)->pagetable_dying;
 
--- a/xen/arch/x86/mm/shadow/oos.c
+++ b/xen/arch/x86/mm/shadow/oos.c
@@ -580,7 +580,6 @@ int sh_unsync(struct vcpu *v, mfn_t gmfn
     if ( (pg->shadow_flags &
           ((SHF_page_type_mask & ~SHF_L1_ANY) | SHF_out_of_sync)) ||
          sh_page_has_multiple_shadows(pg) ||
-         !is_hvm_vcpu(v) ||
          !v->domain->arch.paging.shadow.oos_active )
         return 0;
 



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 10/13] x86/shadow: make monitor table create/destroy more consistent
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
                   ` (8 preceding siblings ...)
  2023-03-30 11:30 ` [PATCH v2 09/13] x86/shadow: drop is_hvm_...() where easily possible Jan Beulich
@ 2023-03-30 11:30 ` Jan Beulich
  2023-03-30 11:31 ` [PATCH v2 11/13] x86/shadow: vCPU-s never have "no mode" Jan Beulich
                   ` (2 subsequent siblings)
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:30 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

While benign at present, it is still a little fragile to operate on a
wrong "old_mode" value in sh_update_paging_modes(). This can happen when
no monitor table was present initially - we'd create one for the new
mode without updating old_mode. Correct this in two ways, each of which
would be sufficient on its own: Once by adding "else" to the second of
the involved if()s in the function, and then by setting the correct
initial mode for HVM domains in shadow_vcpu_init().

Further use the same predicate (paging_mode_external()) consistently
when dealing with shadow mode init/update/cleanup, rather than a mix of
is_hvm_vcpu() (init), is_hvm_domain() (update), and
paging_mode_external() (cleanup).

Finally drop a redundant is_hvm_domain() from inside the bigger if()
(which is being converted to paging_mode_external()) in
sh_update_paging_modes().

Signed-off-by: Jan Beulich <jbeulich@suse.com>
---
v2: Style adjustment.

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -129,9 +129,9 @@ void shadow_vcpu_init(struct vcpu *v)
     }
 #endif
 
-    v->arch.paging.mode = is_hvm_vcpu(v) ?
-                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 3) :
-                          &SHADOW_INTERNAL_NAME(sh_paging_mode, 4);
+    v->arch.paging.mode = paging_mode_external(v->domain)
+                          ? &SHADOW_INTERNAL_NAME(sh_paging_mode, 2)
+                          : &SHADOW_INTERNAL_NAME(sh_paging_mode, 4);
 }
 
 #if SHADOW_AUDIT
@@ -1811,7 +1811,7 @@ static void sh_update_paging_modes(struc
         sh_detach_old_tables(v);
 
 #ifdef CONFIG_HVM
-    if ( is_hvm_domain(d) )
+    if ( paging_mode_external(d) )
     {
         const struct paging_mode *old_mode = v->arch.paging.mode;
 
@@ -1864,13 +1864,12 @@ static void sh_update_paging_modes(struc
             make_cr3(v, mmfn);
             hvm_update_host_cr3(v);
         }
-
-        if ( v->arch.paging.mode != old_mode )
+        else if ( v->arch.paging.mode != old_mode )
         {
             SHADOW_PRINTK("new paging mode: %pv pe=%d gl=%u "
                           "sl=%u (was g=%u s=%u)\n",
                           v,
-                          is_hvm_domain(d) ? hvm_paging_enabled(v) : 1,
+                          hvm_paging_enabled(v),
                           v->arch.paging.mode->guest_levels,
                           v->arch.paging.mode->shadow.shadow_levels,
                           old_mode ? old_mode->guest_levels : 0,



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 11/13] x86/shadow: vCPU-s never have "no mode"
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
                   ` (9 preceding siblings ...)
  2023-03-30 11:30 ` [PATCH v2 10/13] x86/shadow: make monitor table create/destroy more consistent Jan Beulich
@ 2023-03-30 11:31 ` Jan Beulich
  2023-03-30 11:31 ` [PATCH v2 12/13] x86/shadow: "monitor table" is a HVM-only concept Jan Beulich
  2023-03-30 11:32 ` [PATCH v2 13/13] x86/shadow: adjust monitor table prealloc amount Jan Beulich
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:31 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

With an initial mode installed by shadow_vcpu_init(), there's no need
for sh_update_paging_modes() to deal with the "mode is still unset"
case. Leave an assertion, though.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1864,6 +1864,8 @@ static void sh_update_paging_modes(struc
             make_cr3(v, mmfn);
             hvm_update_host_cr3(v);
         }
+        else if ( !old_mode )
+            ASSERT_UNREACHABLE();
         else if ( v->arch.paging.mode != old_mode )
         {
             SHADOW_PRINTK("new paging mode: %pv pe=%d gl=%u "
@@ -1872,11 +1874,10 @@ static void sh_update_paging_modes(struc
                           hvm_paging_enabled(v),
                           v->arch.paging.mode->guest_levels,
                           v->arch.paging.mode->shadow.shadow_levels,
-                          old_mode ? old_mode->guest_levels : 0,
-                          old_mode ? old_mode->shadow.shadow_levels : 0);
-            if ( old_mode &&
-                 (v->arch.paging.mode->shadow.shadow_levels !=
-                  old_mode->shadow.shadow_levels) )
+                          old_mode->guest_levels,
+                          old_mode->shadow.shadow_levels);
+            if ( v->arch.paging.mode->shadow.shadow_levels !=
+                 old_mode->shadow.shadow_levels )
             {
                 /* Need to make a new monitor table for the new mode */
                 mfn_t new_mfn, old_mfn;



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 12/13] x86/shadow: "monitor table" is a HVM-only concept
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
                   ` (10 preceding siblings ...)
  2023-03-30 11:31 ` [PATCH v2 11/13] x86/shadow: vCPU-s never have "no mode" Jan Beulich
@ 2023-03-30 11:31 ` Jan Beulich
  2023-03-30 11:32 ` [PATCH v2 13/13] x86/shadow: adjust monitor table prealloc amount Jan Beulich
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:31 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

It looks like in the combination of aff8bf94ce65 ('x86/shadow: only
4-level guest code needs building when !HVM') and 0b841314dace
('x86/shadow: sh_{make,destroy}_monitor_table() are "even more" HVM-
only') I didn't go quite far enough: SH_type_monitor_table is also
effectively unused when !HVM.

The assertion early in sh_destroy_shadow() can have the type dropped
altogether: it shouldn't make it here in the first place. Pages of
this type are freed directly from sh_destroy_monitor_table() only.

Signed-off-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>

--- a/xen/arch/x86/mm/shadow/common.c
+++ b/xen/arch/x86/mm/shadow/common.c
@@ -1190,7 +1190,6 @@ void sh_destroy_shadow(struct domain *d,
     ASSERT(t == SH_type_fl1_32_shadow  ||
            t == SH_type_fl1_pae_shadow ||
            t == SH_type_fl1_64_shadow  ||
-           t == SH_type_monitor_table  ||
            (is_pv_32bit_domain(d) && t == SH_type_l4_64_shadow) ||
            (page_get_owner(mfn_to_page(backpointer(sp))) == d));
 
--- a/xen/arch/x86/mm/shadow/private.h
+++ b/xen/arch/x86/mm/shadow/private.h
@@ -205,8 +205,7 @@ extern void shadow_audit_tables(struct v
 #define SH_type_l4_64_shadow   6U /* shadowing a 64-bit L4 page */
 #define SH_type_max_shadow     6U
 #define SH_type_p2m_table      7U /* in use as the p2m table */
-#define SH_type_monitor_table  8U /* in use as a monitor table */
-#define SH_type_unused         9U
+#define SH_type_unused         8U
 #endif
 
 #ifndef CONFIG_PV32 /* Unused (but uglier to #ifdef above): */



^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH v2 13/13] x86/shadow: adjust monitor table prealloc amount
  2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
                   ` (11 preceding siblings ...)
  2023-03-30 11:31 ` [PATCH v2 12/13] x86/shadow: "monitor table" is a HVM-only concept Jan Beulich
@ 2023-03-30 11:32 ` Jan Beulich
  12 siblings, 0 replies; 15+ messages in thread
From: Jan Beulich @ 2023-03-30 11:32 UTC (permalink / raw
  To: xen-devel@lists.xenproject.org
  Cc: Andrew Cooper, Wei Liu, Roger Pau Monné, George Dunlap,
	Tim Deegan

While 670d6b908ff2 ('x86 shadow: Move the shadow linear mapping for
n-on-3-on-4 shadows so') bumped the amount by one too little for the
32-on-64 case (which luckily was dead code, and hence a bump wasn't
necessary in the first place), 0b841314dace ('x86/shadow:
sh_{make,destroy}_monitor_table() are "even more" HVM-only'), dropping
the dead code, then didn't adjust the amount back. Yet even the original
amount was too high in certain cases. Switch to pre-allocating just as
much as is going to be needed.

Signed-off-by: Jan Beulich <jbeulich@suse.com>

--- a/xen/arch/x86/mm/shadow/hvm.c
+++ b/xen/arch/x86/mm/shadow/hvm.c
@@ -738,7 +738,7 @@ mfn_t sh_make_monitor_table(const struct
     ASSERT(!pagetable_get_pfn(v->arch.hvm.monitor_table));
 
     /* Guarantee we can get the memory we need */
-    if ( !shadow_prealloc(d, SH_type_monitor_table, CONFIG_PAGING_LEVELS) )
+    if ( !shadow_prealloc(d, SH_type_monitor_table, shadow_levels < 4 ? 3 : 1) )
         return INVALID_MFN;
 
     m4mfn = shadow_alloc(d, SH_type_monitor_table, 0);



^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH v2 01/13] x86/shadow: rename SHADOW_FOREACH_L<N>E() to FOREACH_PRESENT_L<N>E()
  2023-03-30 11:24 ` [PATCH v2 01/13] x86/shadow: rename SHADOW_FOREACH_L<N>E() to FOREACH_PRESENT_L<N>E() Jan Beulich
@ 2023-03-30 12:20   ` Andrew Cooper
  0 siblings, 0 replies; 15+ messages in thread
From: Andrew Cooper @ 2023-03-30 12:20 UTC (permalink / raw
  To: Jan Beulich, xen-devel@lists.xenproject.org
  Cc: Wei Liu, Roger Pau Monné, George Dunlap, Tim Deegan

On 30/03/2023 12:24 pm, Jan Beulich wrote:
> These being local macros, the SHADOW prefix doesn't gain us much. What
> is more important to be aware of at use sites is that the supplied code
> is invoked for present entries only.
>
> While making the adjustment also properly use NULL for the 3rd argument
> at respective invocation sites.
>
> Requested-by: Andrew Cooper <andrew.cooper3@citrix.com>
> Signed-off-by: Jan Beulich <jbeulich@suse.com>

Reviewed-by: Andrew Cooper <andrew.cooper3@citrix.com>


^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2023-03-30 12:22 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-03-30 11:22 [PATCH v2 00/13] x86: assorted shadow mode adjustments Jan Beulich
2023-03-30 11:24 ` [PATCH v2 01/13] x86/shadow: rename SHADOW_FOREACH_L<N>E() to FOREACH_PRESENT_L<N>E() Jan Beulich
2023-03-30 12:20   ` Andrew Cooper
2023-03-30 11:26 ` [PATCH v2 02/13] x86/shadow: drop redundant present bit checks from FOREACH_PRESENT_L<N>E() "bodies" Jan Beulich
2023-03-30 11:26 ` [PATCH v2 03/13] x86/shadow: reduce explicit log-dirty recording for HVM Jan Beulich
2023-03-30 11:27 ` [PATCH v2 04/13] x86/shadow: call sh_update_cr3() directly from sh_page_fault() Jan Beulich
2023-03-30 11:28 ` [PATCH v2 05/13] x86/shadow: don't generate bogus "domain dying" trace entry " Jan Beulich
2023-03-30 11:28 ` [PATCH v2 06/13] x86/shadow: use lighter weight mode checks Jan Beulich
2023-03-30 11:29 ` [PATCH v2 07/13] x86/shadow: move OOS functions to their own file Jan Beulich
2023-03-30 11:30 ` [PATCH v2 08/13] x86/shadow: sh_rm_write_access_from_sl1p() is HVM-only Jan Beulich
2023-03-30 11:30 ` [PATCH v2 09/13] x86/shadow: drop is_hvm_...() where easily possible Jan Beulich
2023-03-30 11:30 ` [PATCH v2 10/13] x86/shadow: make monitor table create/destroy more consistent Jan Beulich
2023-03-30 11:31 ` [PATCH v2 11/13] x86/shadow: vCPU-s never have "no mode" Jan Beulich
2023-03-30 11:31 ` [PATCH v2 12/13] x86/shadow: "monitor table" is a HVM-only concept Jan Beulich
2023-03-30 11:32 ` [PATCH v2 13/13] x86/shadow: adjust monitor table prealloc amount Jan Beulich

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).