All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v3 00/19] Virtual NUMA for PV and HVM
@ 2015-01-13 12:11 Wei Liu
  2015-01-13 12:11 ` [PATCH v3 01/19] xen: dump vNUMA information with debug key "u" Wei Liu
                   ` (18 more replies)
  0 siblings, 19 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel; +Cc: dario.faggioli, Wei Liu, jbeulich

Hi all

This is version 3 of this series rebased on top of staging. More commit
messages have been written and I hope this can help make things clearer.

This patch series implements virtual NUMA support for both PV and HVM guest.
That is, admin can configure via libxl what virtual NUMA topology the guest
sees.

This is the stage 1 (basic vNUMA support) and part of stage 2 (vNUMA-ware
ballooning, hypervisor side) described in my previous email to xen-devel [0].

This series is broken into several parts:

1. xen patches: vNUMA debug output and vNUMA-aware memory hypercall support.
2. libxc/libxl support for PV vNUMA.
3. libxc/libxl/hypervisor support for HVM vNUMA.
4. xl vNUMA configuration documentation and parser.

One significant difference from Elena's work is that this patch series makes
use of multiple vmemranges should there be a memory hole, instead of shrinking
ram. This matches the behaviour of real hardware.

The vNUMA auto placement algorithm is missing at the moment and Dario is
working on it.

This series can be found at:
 git://xenbits.xen.org/people/liuw/xen.git wip.vnuma-v3

With this series, the following configuration can be used to enabled virtual
NUMA support, and it works for both PV and HVM guests.

memory = 6000
vnuma_memory = [3000, 3000]
vnuma_vcpu_map = [0, 1]
vnuma_pnode_map = [0, 0]
vnuma_vdistances = [ [10,30], [30,10] ]

For example output of guest NUMA information, please look at [1].

In terms of libxl / libxc internal, things are broken into several
parts:

1. libxl interface

Users of libxl can only specify how many vnodes a guest can have, but
currently they have no control over the actual memory layout. Note that
it's fairly easy to export the interface to control memory layout in the
future.

2. libxl internal

It generates some internal vNUMA configurations when building domain,
then transform them into libxc representations. It also validates vNUMA
configuration along the line.

3. libxc internal

Libxc does what it's told to do. It doesn't do anything smart (in fact,
I delibrately didn't put any smart logic inside it). Libxc will also
report back some information in HVM case to libxl but that's it.

Wei.

[0] <20141111173606.GC21312@zion.uk.xensource.com>
[1] <1416582421-10789-1-git-send-email-wei.liu2@citrix.com>

Changes in v3:
1. Address comments made by Jan.
2. More commit messages and comments.
3. Shorten some error messages.

Changes in v2:
1. Make vnuma_vdistances mandatory.
2. Use nested list to specify distances among nodes.
3. Hvmloader uses hypercall to retrieve vNUMA information.
4. Fix some problems spotted by Jan.


Tiejun Chen (1):
  tools/hvmloader: link errno.h from xen internal

Wei Liu (18):
  xen: dump vNUMA information with debug key "u"
  xen: make two memory hypercalls vNUMA-aware
  libxc: allocate memory with vNUMA information for PV guest
  libxl: introduce vNUMA types
  libxl: add vmemrange to libxl__domain_build_state
  libxl: introduce libxl__vnuma_config_check
  libxl: x86: factor out e820_host_sanitize
  libxl: functions to build vmemranges for PV guest
  libxl: build, check and pass vNUMA info to Xen for PV guest
  xen: handle XENMEM_get_vnumainfo in compat_memory_op
  hvmloader: retrieve vNUMA information from hypervisor
  hvmloader: construct SRAT
  hvmloader: construct SLIT
  libxc: allocate memory with vNUMA information for HVM guest
  libxl: build, check and pass vNUMA info to Xen for HVM guest
  libxl: disallow memory relocation when vNUMA is enabled
  libxlutil: nested list support
  xl: vNUMA support

 .gitignore                              |    1 +
 docs/man/xl.cfg.pod.5                   |   39 ++++++
 tools/firmware/hvmloader/Makefile       |    9 +-
 tools/firmware/hvmloader/acpi/acpi2_0.h |   61 +++++++++
 tools/firmware/hvmloader/acpi/build.c   |  110 +++++++++++++++
 tools/firmware/hvmloader/hvmloader.c    |    3 +
 tools/firmware/hvmloader/vnuma.c        |  102 ++++++++++++++
 tools/firmware/hvmloader/vnuma.h        |   52 +++++++
 tools/libxc/include/xc_dom.h            |    5 +
 tools/libxc/include/xenguest.h          |    7 +
 tools/libxc/xc_dom_x86.c                |   79 +++++++++--
 tools/libxc/xc_hvm_build_x86.c          |  224 +++++++++++++++++++-----------
 tools/libxc/xc_private.h                |    2 +
 tools/libxl/Makefile                    |    2 +-
 tools/libxl/libxl_arch.h                |    6 +
 tools/libxl/libxl_arm.c                 |    9 ++
 tools/libxl/libxl_create.c              |    9 ++
 tools/libxl/libxl_dm.c                  |    6 +-
 tools/libxl/libxl_dom.c                 |   99 ++++++++++++++
 tools/libxl/libxl_internal.h            |   18 +++
 tools/libxl/libxl_types.idl             |    9 ++
 tools/libxl/libxl_vnuma.c               |  227 +++++++++++++++++++++++++++++++
 tools/libxl/libxl_x86.c                 |  105 ++++++++++++--
 tools/libxl/libxlu_cfg.c                |  180 ++++++++++++++----------
 tools/libxl/libxlu_cfg_i.h              |   15 +-
 tools/libxl/libxlu_cfg_y.c              |  110 +++++++--------
 tools/libxl/libxlu_cfg_y.h              |    2 +-
 tools/libxl/libxlu_cfg_y.y              |   19 ++-
 tools/libxl/libxlu_internal.h           |   24 +++-
 tools/libxl/libxlutil.h                 |   11 +-
 tools/libxl/xl_cmdimpl.c                |  147 ++++++++++++++++++++
 xen/arch/x86/numa.c                     |   50 ++++++-
 xen/common/compat/memory.c              |   46 +++++++
 xen/common/kernel.c                     |    2 +-
 xen/common/memory.c                     |   59 +++++++-
 xen/include/public/features.h           |    3 +
 xen/include/public/memory.h             |    2 +
 xen/include/xlat.lst                    |    2 +
 38 files changed, 1592 insertions(+), 264 deletions(-)
 create mode 100644 tools/firmware/hvmloader/vnuma.c
 create mode 100644 tools/firmware/hvmloader/vnuma.h
 create mode 100644 tools/libxl/libxl_vnuma.c

-- 
1.7.10.4

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 16:38   ` Jan Beulich
  2015-01-13 19:21   ` Andrew Cooper
  2015-01-13 12:11 ` [PATCH v3 02/19] xen: make two memory hypercalls vNUMA-aware Wei Liu
                   ` (17 subsequent siblings)
  18 siblings, 2 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel; +Cc: dario.faggioli, Wei Liu, Elena Ufimsteva, Jan Beulich

Signed-off-by: Elena Ufimsteva <ufimtseva@gmail.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
---
Changes in v3:
1. Constify struct vnuma_info.
2. Don't print amount of ram of a vmemrange.
3. Process softirqs when dumping information.
4. Fix format string.

Changes in v2:
1. Use unsigned int for loop vars.
2. Use strlcpy.
3. Properly align output.
---
 xen/arch/x86/numa.c |   50 +++++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 49 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/numa.c b/xen/arch/x86/numa.c
index 628a40a..c35d42a 100644
--- a/xen/arch/x86/numa.c
+++ b/xen/arch/x86/numa.c
@@ -16,6 +16,7 @@
 #include <xen/pfn.h>
 #include <asm/acpi.h>
 #include <xen/sched.h>
+#include <xen/softirq.h>
 
 static int numa_setup(char *s);
 custom_param("numa", numa_setup);
@@ -363,10 +364,12 @@ EXPORT_SYMBOL(node_data);
 static void dump_numa(unsigned char key)
 {
     s_time_t now = NOW();
-    int i;
+    unsigned int i, j, n;
+    int err;
     struct domain *d;
     struct page_info *page;
     unsigned int page_num_node[MAX_NUMNODES];
+    const struct vnuma_info *vnuma;
 
     printk("'%c' pressed -> dumping numa info (now-0x%X:%08X)\n", key,
            (u32)(now>>32), (u32)now);
@@ -393,6 +396,8 @@ static void dump_numa(unsigned char key)
     printk("Memory location of each domain:\n");
     for_each_domain ( d )
     {
+        process_pending_softirqs();
+
         printk("Domain %u (total: %u):\n", d->domain_id, d->tot_pages);
 
         for_each_online_node ( i )
@@ -408,6 +413,49 @@ static void dump_numa(unsigned char key)
 
         for_each_online_node ( i )
             printk("    Node %u: %u\n", i, page_num_node[i]);
+
+        if ( !d->vnuma )
+                continue;
+
+        vnuma = d->vnuma;
+        printk("     %u vnodes, %u vcpus:\n", vnuma->nr_vnodes, d->max_vcpus);
+        for ( i = 0; i < vnuma->nr_vnodes; i++ )
+        {
+            err = snprintf(keyhandler_scratch, 12, "%3u",
+                    vnuma->vnode_to_pnode[i]);
+            if ( err < 0 || vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
+                strlcpy(keyhandler_scratch, "???", 3);
+
+            printk("       vnode  %3u - pnode %s\n", i, keyhandler_scratch);
+            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
+            {
+                if ( vnuma->vmemrange[j].nid == i )
+                {
+                    printk("         %016"PRIx64" - %016"PRIx64"\n",
+                           vnuma->vmemrange[j].start,
+                           vnuma->vmemrange[j].end);
+                }
+            }
+
+            printk("       vcpus: ");
+            for ( j = 0, n = 0; j < d->max_vcpus; j++ )
+            {
+                if ( !(j & 0x3f) )
+                    process_pending_softirqs();
+
+                if ( vnuma->vcpu_to_vnode[j] == i )
+                {
+                    if ( (n + 1) % 8 == 0 )
+                        printk("%3d\n", j);
+                    else if ( !(n % 8) && n != 0 )
+                        printk("%17d ", j);
+                    else
+                        printk("%3d ", j);
+                    n++;
+                }
+            }
+            printk("\n");
+        }
     }
 
     rcu_read_unlock(&domlist_read_lock);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 02/19] xen: make two memory hypercalls vNUMA-aware
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
  2015-01-13 12:11 ` [PATCH v3 01/19] xen: dump vNUMA information with debug key "u" Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 12:11 ` [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest Wei Liu
                   ` (16 subsequent siblings)
  18 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel; +Cc: dario.faggioli, Wei Liu, jbeulich

Make XENMEM_increase_reservation and XENMEM_populate_physmap
vNUMA-aware.

That is, if guest requests Xen to allocate memory for specific vnode,
Xen can translate vnode to pnode using vNUMA information of that guest.

XENMEMF_vnode is introduced for the guest to mark the node number is in
fact virtual node number and should be translated by Xen.

XENFEAT_memory_op_vnode_supported is introduced to indicate that Xen is
able to translate virtual node to physical node.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Acked-by: Jan Beulich <JBeulich@suse.com>
---
Changes in v3:
1. Coding style fix.
2. Remove redundant assignment.

Changes in v2:
1. Return start_extent when vnode translation fails.
2. Expose new feature bit to guest.
3. Fix typo in comment.
---
 xen/common/kernel.c           |    2 +-
 xen/common/memory.c           |   59 ++++++++++++++++++++++++++++++++++++++---
 xen/include/public/features.h |    3 +++
 xen/include/public/memory.h   |    2 ++
 4 files changed, 61 insertions(+), 5 deletions(-)

diff --git a/xen/common/kernel.c b/xen/common/kernel.c
index 0d9e519..e5e0050 100644
--- a/xen/common/kernel.c
+++ b/xen/common/kernel.c
@@ -301,7 +301,7 @@ DO(xen_version)(int cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
         switch ( fi.submap_idx )
         {
         case 0:
-            fi.submap = 0;
+            fi.submap = (1U << XENFEAT_memory_op_vnode_supported);
             if ( VM_ASSIST(d, VMASST_TYPE_pae_extended_cr3) )
                 fi.submap |= (1U << XENFEAT_pae_pgdir_above_4gb);
             if ( paging_mode_translate(current->domain) )
diff --git a/xen/common/memory.c b/xen/common/memory.c
index 234dae6..671c93d 100644
--- a/xen/common/memory.c
+++ b/xen/common/memory.c
@@ -692,6 +692,50 @@ out:
     return rc;
 }
 
+static int translate_vnode_to_pnode(struct domain *d,
+                                    struct xen_memory_reservation *r,
+                                    struct memop_args *a)
+{
+    int rc = 0;
+    unsigned int vnode, pnode;
+
+    /*
+     * Note: we don't strictly require non-supported bits set to zero,
+     * so we may have exact_vnode bit set for old guests that don't
+     * support vNUMA.
+     *
+     * To distinguish spurious vnode request v.s. real one, check if
+     * d->vnuma exists.
+     */
+    if ( r->mem_flags & XENMEMF_vnode )
+    {
+        read_lock(&d->vnuma_rwlock);
+        if ( d->vnuma )
+        {
+            vnode = XENMEMF_get_node(r->mem_flags);
+
+            if ( vnode < d->vnuma->nr_vnodes )
+            {
+                pnode = d->vnuma->vnode_to_pnode[vnode];
+
+                a->memflags &=
+                    ~MEMF_node(XENMEMF_get_node(r->mem_flags));
+
+                if ( pnode != NUMA_NO_NODE )
+                {
+                    a->memflags |= MEMF_exact_node;
+                    a->memflags |= MEMF_node(pnode);
+                }
+            }
+            else
+                rc = -EINVAL;
+        }
+        read_unlock(&d->vnuma_rwlock);
+    }
+
+    return rc;
+}
+
 long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
 {
     struct domain *d;
@@ -734,10 +778,6 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             args.memflags = MEMF_bits(address_bits);
         }
 
-        args.memflags |= MEMF_node(XENMEMF_get_node(reservation.mem_flags));
-        if ( reservation.mem_flags & XENMEMF_exact_node_request )
-            args.memflags |= MEMF_exact_node;
-
         if ( op == XENMEM_populate_physmap
              && (reservation.mem_flags & XENMEMF_populate_on_demand) )
             args.memflags |= MEMF_populate_on_demand;
@@ -747,6 +787,17 @@ long do_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             return start_extent;
         args.domain = d;
 
+        args.memflags |= MEMF_node(XENMEMF_get_node(reservation.mem_flags));
+        if ( reservation.mem_flags & XENMEMF_exact_node_request )
+            args.memflags |= MEMF_exact_node;
+
+        rc = translate_vnode_to_pnode(d, &reservation, &args);
+        if ( rc )
+        {
+            rcu_unlock_domain(d);
+            return start_extent;
+        }
+
         rc = xsm_memory_adjust_reservation(XSM_TARGET, current->domain, d);
         if ( rc )
         {
diff --git a/xen/include/public/features.h b/xen/include/public/features.h
index 16d92aa..2110b04 100644
--- a/xen/include/public/features.h
+++ b/xen/include/public/features.h
@@ -99,6 +99,9 @@
 #define XENFEAT_grant_map_identity        12
  */
 
+/* Guest can use XENMEMF_vnode to specify virtual node for memory op. */
+#define XENFEAT_memory_op_vnode_supported 13
+
 #define XENFEAT_NR_SUBMAPS 1
 
 #endif /* __XEN_PUBLIC_FEATURES_H__ */
diff --git a/xen/include/public/memory.h b/xen/include/public/memory.h
index 595f953..2b5206b 100644
--- a/xen/include/public/memory.h
+++ b/xen/include/public/memory.h
@@ -55,6 +55,8 @@
 /* Flag to request allocation only from the node specified */
 #define XENMEMF_exact_node_request  (1<<17)
 #define XENMEMF_exact_node(n) (XENMEMF_node(n) | XENMEMF_exact_node_request)
+/* Flag to indicate the node specified is virtual node */
+#define XENMEMF_vnode  (1<<18)
 #endif
 
 struct xen_memory_reservation {
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
  2015-01-13 12:11 ` [PATCH v3 01/19] xen: dump vNUMA information with debug key "u" Wei Liu
  2015-01-13 12:11 ` [PATCH v3 02/19] xen: make two memory hypercalls vNUMA-aware Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 17:05   ` Ian Jackson
  2015-01-13 20:02   ` Andrew Cooper
  2015-01-13 12:11 ` [PATCH v3 04/19] libxl: introduce vNUMA types Wei Liu
                   ` (15 subsequent siblings)
  18 siblings, 2 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, jbeulich,
	Elena Ufimtseva

>From libxc's point of view, it only needs to know vnode to pnode mapping
and size of each vnode to allocate memory accordingly. Add these fields
to xc_dom structure.

The caller might not pass in vNUMA information. In that case, a dummy
layout is generated for the convenience of libxc's allocation code. The
upper layer (libxl etc) still sees the domain has no vNUMA
configuration.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v3:
1. Rewrite commit log.
2. Shorten some error messages.
---
 tools/libxc/include/xc_dom.h |    5 +++
 tools/libxc/xc_dom_x86.c     |   79 ++++++++++++++++++++++++++++++++++++------
 tools/libxc/xc_private.h     |    2 ++
 3 files changed, 75 insertions(+), 11 deletions(-)

diff --git a/tools/libxc/include/xc_dom.h b/tools/libxc/include/xc_dom.h
index 07d7224..c459e77 100644
--- a/tools/libxc/include/xc_dom.h
+++ b/tools/libxc/include/xc_dom.h
@@ -167,6 +167,11 @@ struct xc_dom_image {
     struct xc_dom_loader *kernel_loader;
     void *private_loader;
 
+    /* vNUMA information */
+    unsigned int *vnode_to_pnode; /* vnode to pnode mapping array */
+    uint64_t *vnode_size;         /* vnode size array */
+    unsigned int nr_vnodes;       /* number of elements of above arrays */
+
     /* kernel loader */
     struct xc_dom_arch *arch_hooks;
     /* allocate up to virt_alloc_end */
diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
index bf06fe4..06a7e54 100644
--- a/tools/libxc/xc_dom_x86.c
+++ b/tools/libxc/xc_dom_x86.c
@@ -759,7 +759,8 @@ static int x86_shadow(xc_interface *xch, domid_t domid)
 int arch_setup_meminit(struct xc_dom_image *dom)
 {
     int rc;
-    xen_pfn_t pfn, allocsz, i, j, mfn;
+    xen_pfn_t pfn, allocsz, mfn, total, pfn_base;
+    int i, j;
 
     rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
     if ( rc )
@@ -811,18 +812,74 @@ int arch_setup_meminit(struct xc_dom_image *dom)
         /* setup initial p2m */
         for ( pfn = 0; pfn < dom->total_pages; pfn++ )
             dom->p2m_host[pfn] = pfn;
-        
+
+        /* Setup dummy vNUMA information if it's not provided. Not
+         * that this is a valid state if libxl doesn't provide any
+         * vNUMA information.
+         *
+         * In this case we setup some dummy value for the convenience
+         * of the allocation code. Note that from the user's PoV the
+         * guest still has no vNUMA configuration.
+         */
+        if ( dom->nr_vnodes == 0 )
+        {
+            dom->nr_vnodes = 1;
+            dom->vnode_to_pnode = xc_dom_malloc(dom,
+                                                sizeof(*dom->vnode_to_pnode));
+            dom->vnode_to_pnode[0] = XC_VNUMA_NO_NODE;
+            dom->vnode_size = xc_dom_malloc(dom, sizeof(*dom->vnode_size));
+            dom->vnode_size[0] = (dom->total_pages << PAGE_SHIFT) >> 20;
+        }
+
+        total = 0;
+        for ( i = 0; i < dom->nr_vnodes; i++ )
+            total += ((dom->vnode_size[i] << 20) >> PAGE_SHIFT);
+        if ( total != dom->total_pages )
+        {
+            xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                         "%s: vNUMA page count mismatch (0x%"PRIpfn" != 0x%"PRIpfn")\n",
+                         __FUNCTION__, total, dom->total_pages);
+            return -EINVAL;
+        }
+
         /* allocate guest memory */
-        for ( i = rc = allocsz = 0;
-              (i < dom->total_pages) && !rc;
-              i += allocsz )
+        pfn_base = 0;
+        for ( i = 0; i < dom->nr_vnodes; i++ )
         {
-            allocsz = dom->total_pages - i;
-            if ( allocsz > 1024*1024 )
-                allocsz = 1024*1024;
-            rc = xc_domain_populate_physmap_exact(
-                dom->xch, dom->guest_domid, allocsz,
-                0, 0, &dom->p2m_host[i]);
+            unsigned int memflags;
+            uint64_t pages;
+
+            memflags = 0;
+            if ( dom->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
+            {
+                memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[i]);
+                memflags |= XENMEMF_exact_node_request;
+            }
+
+            pages = (dom->vnode_size[i] << 20) >> PAGE_SHIFT;
+
+            for ( j = 0; j < pages; j += allocsz )
+            {
+                allocsz = pages - j;
+                if ( allocsz > 1024*1024 )
+                    allocsz = 1024*1024;
+
+                rc = xc_domain_populate_physmap_exact(dom->xch,
+                         dom->guest_domid, allocsz, 0, memflags,
+                         &dom->p2m_host[pfn_base+j]);
+
+                if ( rc )
+                {
+                    if ( dom->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
+                        xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
+                                     "%s: fail to allocate 0x%"PRIx64"/0x%"PRIpfn" pages (v=%d, p=%d)\n",
+                                     __FUNCTION__, pages, dom->total_pages, i,
+                                     dom->vnode_to_pnode[i]);
+                    return rc;
+                }
+            }
+
+            pfn_base += pages;
         }
 
         /* Ensure no unclaimed pages are left unused.
diff --git a/tools/libxc/xc_private.h b/tools/libxc/xc_private.h
index 45b8644..1809674 100644
--- a/tools/libxc/xc_private.h
+++ b/tools/libxc/xc_private.h
@@ -35,6 +35,8 @@
 
 #include <xen/sys/privcmd.h>
 
+#define XC_VNUMA_NO_NODE (~0U)
+
 #if defined(HAVE_VALGRIND_MEMCHECK_H) && !defined(NDEBUG) && !defined(__MINIOS__)
 /* Compile in Valgrind client requests? */
 #include <valgrind/memcheck.h>
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 04/19] libxl: introduce vNUMA types
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (2 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 15:40   ` Ian Jackson
  2015-01-13 12:11 ` [PATCH v3 05/19] libxl: add vmemrange to libxl__domain_build_state Wei Liu
                   ` (14 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, jbeulich,
	Elena Ufimtseva

A domain can contain several virtual NUMA nodes, hence we introduce an
array in libxl_domain_build_info.

libxl_vnode_info contains the size of memory in that node, the distance
from that node to every nodes, the underlying pnode and a bitmap of
vcpus.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v3:
1. Add commit message.
---
 tools/libxl/libxl_types.idl |    9 +++++++++
 1 file changed, 9 insertions(+)

diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
index 1214d2e..49e9b69 100644
--- a/tools/libxl/libxl_types.idl
+++ b/tools/libxl/libxl_types.idl
@@ -354,6 +354,13 @@ libxl_domain_sched_params = Struct("domain_sched_params",[
     ("budget",       integer, {'init_val': 'LIBXL_DOMAIN_SCHED_PARAM_BUDGET_DEFAULT'}),
     ])
 
+libxl_vnode_info = Struct("vnode_info", [
+    ("mem", uint64), # memory size of this node, in MiB
+    ("distances", Array(uint32, "num_distances")), # distances from this node to other nodes
+    ("pnode", uint32), # physical node of this node
+    ("vcpus", libxl_bitmap), # vcpus in this node
+    ])
+
 libxl_domain_build_info = Struct("domain_build_info",[
     ("max_vcpus",       integer),
     ("avail_vcpus",     libxl_bitmap),
@@ -374,6 +381,8 @@ libxl_domain_build_info = Struct("domain_build_info",[
     ("disable_migrate", libxl_defbool),
     ("cpuid",           libxl_cpuid_policy_list),
     ("blkdev_start",    string),
+
+    ("vnuma_nodes", Array(libxl_vnode_info, "num_vnuma_nodes")),
     
     ("device_model_version", libxl_device_model_version),
     ("device_model_stubdomain", libxl_defbool),
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 05/19] libxl: add vmemrange to libxl__domain_build_state
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (3 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 04/19] libxl: introduce vNUMA types Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 17:02   ` Ian Jackson
  2015-01-13 12:11 ` [PATCH v3 06/19] libxl: introduce libxl__vnuma_config_check Wei Liu
                   ` (13 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, jbeulich,
	Elena Ufimtseva

A vnode consists of one or more vmemranges (virtual memory range).  One
example of multiple vmemranges is that there is a hole in one vnode.

Currently we haven't exported vmemrange interface to libxl user.
Vmemranges are generated during domain build, so we have relevant
structures in domain build state.

Later if we discover we need to export the interface, those structures
can be moved to libxl_domain_build_info as well.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v3:
1. Rewrite commit message.
---
 tools/libxl/libxl_internal.h |    3 +++
 1 file changed, 3 insertions(+)

diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 934465a..6d3ac58 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -973,6 +973,9 @@ typedef struct {
     libxl__file_reference pv_ramdisk;
     const char * pv_cmdline;
     bool pvh_enabled;
+
+    xen_vmemrange_t *vmemranges;
+    uint32_t num_vmemranges;
 } libxl__domain_build_state;
 
 _hidden int libxl__build_pre(libxl__gc *gc, uint32_t domid,
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 06/19] libxl: introduce libxl__vnuma_config_check
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (4 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 05/19] libxl: add vmemrange to libxl__domain_build_state Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 12:11 ` [PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize Wei Liu
                   ` (12 subsequent siblings)
  18 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, jbeulich,
	Elena Ufimtseva

This function is used to check whether vNUMA configuration (be it
auto-generated or supplied by user) is valid.

The checks performed can be found in the comment of the function.

This vNUMA function (and future ones) is placed in a new file called
libxl_vnuma.c

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v3:
1. Rewrite commit log.
2. Shorten two error messages.
---
 tools/libxl/Makefile         |    2 +-
 tools/libxl/libxl_internal.h |    5 ++
 tools/libxl/libxl_vnuma.c    |  137 ++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 143 insertions(+), 1 deletion(-)
 create mode 100644 tools/libxl/libxl_vnuma.c

diff --git a/tools/libxl/Makefile b/tools/libxl/Makefile
index b417372..22ae0c3 100644
--- a/tools/libxl/Makefile
+++ b/tools/libxl/Makefile
@@ -93,7 +93,7 @@ LIBXL_LIBS += -lyajl
 LIBXL_OBJS = flexarray.o libxl.o libxl_create.o libxl_dm.o libxl_pci.o \
 			libxl_dom.o libxl_exec.o libxl_xshelp.o libxl_device.o \
 			libxl_internal.o libxl_utils.o libxl_uuid.o \
-			libxl_json.o libxl_aoutils.o libxl_numa.o \
+			libxl_json.o libxl_aoutils.o libxl_numa.o libxl_vnuma.o \
 			libxl_save_callout.o _libxl_save_msgs_callout.o \
 			libxl_qmp.o libxl_event.o libxl_fork.o $(LIBXL_OBJS-y)
 LIBXL_OBJS += libxl_genid.o
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 6d3ac58..39356ba 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3394,6 +3394,11 @@ void libxl__numa_candidate_put_nodemap(libxl__gc *gc,
     libxl_bitmap_copy(CTX, &cndt->nodemap, nodemap);
 }
 
+/* Check if vNUMA config is valid. Returns 0 if valid. */
+int libxl__vnuma_config_check(libxl__gc *gc,
+                              const libxl_domain_build_info *b_info,
+                              const libxl__domain_build_state *state);
+
 _hidden int libxl__ms_vm_genid_set(libxl__gc *gc, uint32_t domid,
                                    const libxl_ms_vm_genid *id);
 
diff --git a/tools/libxl/libxl_vnuma.c b/tools/libxl/libxl_vnuma.c
new file mode 100644
index 0000000..439f5ab
--- /dev/null
+++ b/tools/libxl/libxl_vnuma.c
@@ -0,0 +1,137 @@
+/*
+ * Copyright (C) 2014      Citrix Ltd.
+ * Author Wei Liu <wei.liu2@citrix.com>
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU Lesser General Public License as published
+ * by the Free Software Foundation; version 2.1 only. with the special
+ * exception on linking described in file LICENSE.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
+ * GNU Lesser General Public License for more details.
+ */
+#include "libxl_osdeps.h" /* must come before any other headers */
+#include "libxl_internal.h"
+#include <stdlib.h>
+
+/* Sort vmemranges in ascending order with "start" */
+static int compare_vmemrange(const void *a, const void *b)
+{
+    const xen_vmemrange_t *x = a, *y = b;
+    if (x->start < y->start)
+        return -1;
+    if (x->start > y->start)
+        return 1;
+    return 0;
+}
+
+/* Check if vNUMA configuration is valid:
+ *  1. all pnodes inside vnode_to_pnode array are valid
+ *  2. one vcpu belongs to and only belongs to one vnode
+ *  3. each vmemrange is valid and doesn't overlap with each other
+ */
+int libxl__vnuma_config_check(libxl__gc *gc,
+			      const libxl_domain_build_info *b_info,
+                              const libxl__domain_build_state *state)
+{
+    int i, j, rc = ERROR_INVAL, nr_nodes;
+    libxl_numainfo *ninfo = NULL;
+    uint64_t total_ram = 0;
+    libxl_bitmap cpumap;
+    libxl_vnode_info *p;
+
+    libxl_bitmap_init(&cpumap);
+
+    /* Check pnode specified is valid */
+    ninfo = libxl_get_numainfo(CTX, &nr_nodes);
+    if (!ninfo) {
+        LIBXL__LOG(CTX, LIBXL__LOG_ERROR, "libxl_get_numainfo failed");
+        goto out;
+    }
+
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        uint32_t pnode;
+
+        p = &b_info->vnuma_nodes[i];
+        pnode = p->pnode;
+
+        /* The pnode specified is not valid? */
+        if (pnode >= nr_nodes) {
+            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                       "Invalid pnode %d specified",
+                       pnode);
+            goto out;
+        }
+
+        total_ram += p->mem;
+    }
+
+    if (total_ram != (b_info->max_memkb >> 10)) {
+        LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                   "Amount of memory mismatch (0x%"PRIx64" != 0x%"PRIx64")",
+                   total_ram, (b_info->max_memkb >> 10));
+        goto out;
+    }
+
+    /* Check vcpu mapping */
+    libxl_cpu_bitmap_alloc(CTX, &cpumap, b_info->max_vcpus);
+    libxl_bitmap_set_none(&cpumap);
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        p = &b_info->vnuma_nodes[i];
+        libxl_for_each_set_bit(j, p->vcpus) {
+            if (!libxl_bitmap_test(&cpumap, j))
+                libxl_bitmap_set(&cpumap, j);
+            else {
+                LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                           "Vcpu %d assigned more than once", j);
+                goto out;
+            }
+        }
+    }
+
+    for (i = 0; i < b_info->max_vcpus; i++) {
+        if (!libxl_bitmap_test(&cpumap, i)) {
+            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                       "Vcpu %d is not assigned to any vnode", i);
+            goto out;
+        }
+    }
+
+    /* Check vmemranges */
+    qsort(state->vmemranges, state->num_vmemranges, sizeof(xen_vmemrange_t),
+          compare_vmemrange);
+
+    for (i = 0; i < state->num_vmemranges; i++) {
+        if (state->vmemranges[i].end < state->vmemranges[i].start) {
+                LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                           "Vmemrange end < start");
+                goto out;
+        }
+    }
+
+    for (i = 0; i < state->num_vmemranges - 1; i++) {
+        if (state->vmemranges[i].end > state->vmemranges[i+1].start) {
+            LIBXL__LOG(CTX, LIBXL__LOG_ERROR,
+                       "Vmemranges overlapped, 0x%"PRIx64"-0x%"PRIx64", 0x%"PRIx64"-0x%"PRIx64,
+                       state->vmemranges[i].start, state->vmemranges[i].end,
+                       state->vmemranges[i+1].start, state->vmemranges[i+1].end);
+            goto out;
+        }
+    }
+
+    rc = 0;
+out:
+    if (ninfo) libxl_numainfo_dispose(ninfo);
+    libxl_bitmap_dispose(&cpumap);
+    return rc;
+}
+
+/*
+ * Local variables:
+ * mode: C
+ * c-basic-offset: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (5 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 06/19] libxl: introduce libxl__vnuma_config_check Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 17:07   ` Ian Jackson
  2015-01-13 21:00   ` Andrew Cooper
  2015-01-13 12:11 ` [PATCH v3 08/19] libxl: functions to build vmemranges for PV guest Wei Liu
                   ` (11 subsequent siblings)
  18 siblings, 2 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, jbeulich,
	Elena Ufimtseva

This function gets the machine E820 map and sanitize it according to PV
guest configuration.

This will be used in later patch. No functional change introduced in
this patch.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
 tools/libxl/libxl_x86.c |   31 ++++++++++++++++++++++---------
 1 file changed, 22 insertions(+), 9 deletions(-)

diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index 9ceb373..e959e37 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -207,6 +207,27 @@ static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
     return 0;
 }
 
+static int e820_host_sanitize(libxl__gc *gc,
+                              libxl_domain_build_info *b_info,
+                              struct e820entry map[],
+                              uint32_t *nr)
+{
+    int rc;
+
+    rc = xc_get_machine_memory_map(CTX->xch, map, E820MAX);
+    if (rc < 0) {
+        errno = rc;
+        return ERROR_FAIL;
+    }
+
+    *nr = rc;
+
+    rc = e820_sanitize(CTX, map, nr, b_info->target_memkb,
+                       (b_info->max_memkb - b_info->target_memkb) +
+                       b_info->u.pv.slack_memkb);
+    return rc;
+}
+
 static int libxl__e820_alloc(libxl__gc *gc, uint32_t domid,
         libxl_domain_config *d_config)
 {
@@ -223,15 +244,7 @@ static int libxl__e820_alloc(libxl__gc *gc, uint32_t domid,
     if (!libxl_defbool_val(b_info->u.pv.e820_host))
         return ERROR_INVAL;
 
-    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
-    if (rc < 0) {
-        errno = rc;
-        return ERROR_FAIL;
-    }
-    nr = rc;
-    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
-                       (b_info->max_memkb - b_info->target_memkb) +
-                       b_info->u.pv.slack_memkb);
+    rc = e820_host_sanitize(gc, b_info, map, &nr);
     if (rc)
         return ERROR_FAIL;
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 08/19] libxl: functions to build vmemranges for PV guest
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (6 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 18:15   ` Ian Jackson
  2015-01-13 12:11 ` [PATCH v3 09/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
                   ` (10 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, jbeulich,
	Elena Ufimtseva

Introduce a arch-independent routine to generate one vmemrange per
vnode. Also introduce arch-dependent routines for different
architectures because part of the process is arch-specific -- ARM has
yet have NUMA support and E820 is x86 only.

For those x86 guests who care about machine E820 map (i.e. with
e820_host=1), vnode is further split into several vmemranges to
accommodate memory holes.  A few stubs for libxl_arm.c are created.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v3:
1. Rewrite commit log.
---
 tools/libxl/libxl_arch.h     |    6 ++++
 tools/libxl/libxl_arm.c      |    9 +++++
 tools/libxl/libxl_internal.h |    5 +++
 tools/libxl/libxl_vnuma.c    |   34 +++++++++++++++++++
 tools/libxl/libxl_x86.c      |   74 ++++++++++++++++++++++++++++++++++++++++++
 5 files changed, 128 insertions(+)

diff --git a/tools/libxl/libxl_arch.h b/tools/libxl/libxl_arch.h
index d3bc136..e249048 100644
--- a/tools/libxl/libxl_arch.h
+++ b/tools/libxl/libxl_arch.h
@@ -27,4 +27,10 @@ int libxl__arch_domain_init_hw_description(libxl__gc *gc,
 int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
                                       libxl_domain_build_info *info,
                                       struct xc_dom_image *dom);
+
+/* build vNUMA vmemrange with arch specific information */
+int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
+                                      uint32_t domid,
+                                      libxl_domain_build_info *b_info,
+                                      libxl__domain_build_state *state);
 #endif
diff --git a/tools/libxl/libxl_arm.c b/tools/libxl/libxl_arm.c
index 65a762b..d3968a7 100644
--- a/tools/libxl/libxl_arm.c
+++ b/tools/libxl/libxl_arm.c
@@ -707,6 +707,15 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
     return 0;
 }
 
+int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
+                                      uint32_t domid,
+                                      libxl_domain_build_info *info,
+                                      libxl__domain_build_state *state)
+{
+    /* Don't touch anything. */
+    return 0;
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 39356ba..73d533a 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3399,6 +3399,11 @@ int libxl__vnuma_config_check(libxl__gc *gc,
                               const libxl_domain_build_info *b_info,
                               const libxl__domain_build_state *state);
 
+int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
+                                    uint32_t domid,
+                                    libxl_domain_build_info *b_info,
+                                    libxl__domain_build_state *state);
+
 _hidden int libxl__ms_vm_genid_set(libxl__gc *gc, uint32_t domid,
                                    const libxl_ms_vm_genid *id);
 
diff --git a/tools/libxl/libxl_vnuma.c b/tools/libxl/libxl_vnuma.c
index 439f5ab..a72abe2 100644
--- a/tools/libxl/libxl_vnuma.c
+++ b/tools/libxl/libxl_vnuma.c
@@ -14,6 +14,7 @@
  */
 #include "libxl_osdeps.h" /* must come before any other headers */
 #include "libxl_internal.h"
+#include "libxl_arch.h"
 #include <stdlib.h>
 
 /* Sort vmemranges in ascending order with "start" */
@@ -128,6 +129,39 @@ out:
     return rc;
 }
 
+/* Build vmemranges for PV guest */
+int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
+                                    uint32_t domid,
+                                    libxl_domain_build_info *b_info,
+                                    libxl__domain_build_state *state)
+{
+    int i;
+    uint64_t next;
+    xen_vmemrange_t *v = NULL;
+
+    assert(state->vmemranges == NULL);
+
+    /* Generate one vmemrange for each virtual node. */
+    next = 0;
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
+
+        v = libxl__realloc(gc, v, sizeof(*v) * (i+1));
+
+        v[i].start = next;
+        v[i].end = next + (p->mem << 20); /* mem is in MiB */
+        v[i].flags = 0;
+        v[i].nid = i;
+
+        next = v[i].end;
+    }
+
+    state->vmemranges = v;
+    state->num_vmemranges = i;
+
+    return libxl__arch_vnuma_build_vmemrange(gc, domid, b_info, state);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
index e959e37..2018afc 100644
--- a/tools/libxl/libxl_x86.c
+++ b/tools/libxl/libxl_x86.c
@@ -338,6 +338,80 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
     return 0;
 }
 
+int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
+                                      uint32_t domid,
+                                      libxl_domain_build_info *b_info,
+                                      libxl__domain_build_state *state)
+{
+    int i, x, n, rc;
+    uint32_t nr_e820;
+    struct e820entry map[E820MAX];
+    xen_vmemrange_t *v;
+
+    /* Only touch vmemranges if it's PV guest and e820_host is true */
+    if (!(b_info->type == LIBXL_DOMAIN_TYPE_PV &&
+          libxl_defbool_val(b_info->u.pv.e820_host))) {
+        rc = 0;
+        goto out;
+    }
+
+    rc = e820_host_sanitize(gc, b_info, map, &nr_e820);
+    if (rc) goto out;
+
+    /* Ditch old vmemranges and start with host E820 map. Note, memory
+     * was gc allocated.
+     */
+    state->vmemranges = NULL;
+    state->num_vmemranges = 0;
+
+    n = 0; /* E820 counter */
+    x = 0;
+    v = NULL;
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
+        uint64_t remaining = (p->mem << 20);
+
+        while (remaining > 0) {
+            if (n >= nr_e820) {
+                rc = ERROR_FAIL;
+                goto out;
+            }
+
+            /* Skip non RAM region */
+            if (map[n].type != E820_RAM) {
+                n++;
+                continue;
+            }
+
+            v = libxl__realloc(gc, v, sizeof(xen_vmemrange_t) * (x+1));
+
+            if (map[n].size >= remaining) {
+                v[x].start = map[n].addr;
+                v[x].end = map[n].addr + remaining;
+                map[n].addr += remaining;
+                map[n].size -= remaining;
+                remaining = 0;
+            } else {
+                v[x].start = map[n].addr;
+                v[x].end = map[n].addr + map[n].size;
+                remaining -= map[n].size;
+                n++;
+            }
+
+            v[x].flags = 0;
+            v[x].nid = i;
+            x++;
+        }
+    }
+
+    state->vmemranges = v;
+    state->num_vmemranges = x;
+
+    rc = 0;
+out:
+    return rc;
+}
+
 /*
  * Local variables:
  * mode: C
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 09/19] libxl: build, check and pass vNUMA info to Xen for PV guest
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (7 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 08/19] libxl: functions to build vmemranges for PV guest Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 12:11 ` [PATCH v3 10/19] xen: handle XENMEM_get_vnumainfo in compat_memory_op Wei Liu
                   ` (9 subsequent siblings)
  18 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, jbeulich,
	Elena Ufimtseva

Transform the user supplied vNUMA configuration into libxl internal
representations, and finally libxc representations. Check validity of
the configuration along the line.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v3:
1. Add more commit log.
---
 tools/libxl/libxl_dom.c |   71 +++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 71 insertions(+)

diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index 48d661a..f06c88b 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -515,6 +515,51 @@ retry_transaction:
     return 0;
 }
 
+static int set_vnuma_info(libxl__gc *gc, uint32_t domid,
+                          const libxl_domain_build_info *info,
+                          const libxl__domain_build_state *state)
+{
+    int rc = 0;
+    int i, nr_vdistance;
+    unsigned int *vcpu_to_vnode, *vnode_to_pnode, *vdistance = NULL;
+
+    vcpu_to_vnode = libxl__calloc(gc, info->max_vcpus,
+                                  sizeof(unsigned int));
+    vnode_to_pnode = libxl__calloc(gc, info->num_vnuma_nodes,
+                                   sizeof(unsigned int));
+
+    nr_vdistance = info->num_vnuma_nodes * info->num_vnuma_nodes;
+    vdistance = libxl__calloc(gc, nr_vdistance, sizeof(unsigned int));
+
+    for (i = 0; i < info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *v = &info->vnuma_nodes[i];
+        int bit;
+
+        /* vnode to pnode mapping */
+        vnode_to_pnode[i] = v->pnode;
+
+        /* vcpu to vnode mapping */
+        libxl_for_each_set_bit(bit, v->vcpus)
+            vcpu_to_vnode[bit] = i;
+
+        /* node distances */
+        assert(info->num_vnuma_nodes == v->num_distances);
+        memcpy(vdistance + (i * info->num_vnuma_nodes),
+               v->distances,
+               v->num_distances * sizeof(unsigned int));
+    }
+
+    if ( xc_domain_setvnuma(CTX->xch, domid, info->num_vnuma_nodes,
+                            state->num_vmemranges, info->max_vcpus,
+                            state->vmemranges, vdistance,
+                            vcpu_to_vnode, vnode_to_pnode) < 0 ) {
+        LOGE(ERROR, "xc_domain_setvnuma failed");
+        rc = ERROR_FAIL;
+    }
+
+    return rc;
+}
+
 int libxl__build_pv(libxl__gc *gc, uint32_t domid,
              libxl_domain_build_info *info, libxl__domain_build_state *state)
 {
@@ -572,6 +617,32 @@ int libxl__build_pv(libxl__gc *gc, uint32_t domid,
     dom->xenstore_domid = state->store_domid;
     dom->claim_enabled = libxl_defbool_val(info->claim_mode);
 
+    if (info->num_vnuma_nodes != 0) {
+        int i;
+
+        ret = libxl__vnuma_build_vmemrange_pv(gc, domid, info, state);
+        if (ret) {
+            LOGE(ERROR, "cannot build vmemranges");
+            goto out;
+        }
+        ret = libxl__vnuma_config_check(gc, info, state);
+        if (ret) goto out;
+
+        ret = set_vnuma_info(gc, domid, info, state);
+        if (ret) goto out;
+
+        dom->nr_vnodes = info->num_vnuma_nodes;
+        dom->vnode_to_pnode = xc_dom_malloc(dom, sizeof(*dom->vnode_to_pnode) *
+                                            dom->nr_vnodes);
+        dom->vnode_size = xc_dom_malloc(dom, sizeof(*dom->vnode_size) *
+                                        dom->nr_vnodes);
+
+        for (i = 0; i < dom->nr_vnodes; i++) {
+            dom->vnode_to_pnode[i] = info->vnuma_nodes[i].pnode;
+            dom->vnode_size[i] = info->vnuma_nodes[i].mem;
+        }
+    }
+
     if ( (ret = xc_dom_boot_xen_init(dom, ctx->xch, domid)) != 0 ) {
         LOGE(ERROR, "xc_dom_boot_xen_init failed");
         goto out;
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 10/19] xen: handle XENMEM_get_vnumainfo in compat_memory_op
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (8 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 09/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 16:41   ` Jan Beulich
  2015-01-13 12:11 ` [PATCH v3 11/19] tools/hvmloader: link errno.h from xen internal Wei Liu
                   ` (8 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel; +Cc: dario.faggioli, Wei Liu, jbeulich

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <jbeulich@suse.com>
---
Changes in v3:
1. Fix hard tabs.
2. Fix subject line.
3. Add extra hunk to handle -ENOBUFS (hence drop Reviewed-by:)
---
 xen/common/compat/memory.c |   46 ++++++++++++++++++++++++++++++++++++++++++++
 xen/include/xlat.lst       |    2 ++
 2 files changed, 48 insertions(+)

diff --git a/xen/common/compat/memory.c b/xen/common/compat/memory.c
index 06c90be..5e69282 100644
--- a/xen/common/compat/memory.c
+++ b/xen/common/compat/memory.c
@@ -15,6 +15,7 @@ CHECK_TYPE(domid);
 #undef xen_domid_t
 
 CHECK_mem_access_op;
+CHECK_vmemrange;
 
 int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
 {
@@ -32,12 +33,14 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
             struct xen_add_to_physmap *atp;
             struct xen_add_to_physmap_batch *atpb;
             struct xen_remove_from_physmap *xrfp;
+            struct xen_vnuma_topology_info *vnuma;
         } nat;
         union {
             struct compat_memory_reservation rsrv;
             struct compat_memory_exchange xchg;
             struct compat_add_to_physmap atp;
             struct compat_add_to_physmap_batch atpb;
+            struct compat_vnuma_topology_info vnuma;
         } cmp;
 
         set_xen_guest_handle(nat.hnd, COMPAT_ARG_XLAT_VIRT_BASE);
@@ -273,11 +276,46 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
             break;
         }
 
+        case XENMEM_get_vnumainfo:
+        {
+            enum XLAT_vnuma_topology_info_vdistance vdistance =
+                XLAT_vnuma_topology_info_vdistance_h;
+            enum XLAT_vnuma_topology_info_vcpu_to_vnode vcpu_to_vnode =
+                XLAT_vnuma_topology_info_vcpu_to_vnode_h;
+            enum XLAT_vnuma_topology_info_vmemrange vmemrange =
+                XLAT_vnuma_topology_info_vmemrange_h;
+
+            if ( copy_from_guest(&cmp.vnuma, compat, 1) )
+                return -EFAULT;
+
+#define XLAT_vnuma_topology_info_HNDL_vdistance_h(_d_, _s_)		\
+            guest_from_compat_handle((_d_)->vdistance.h, (_s_)->vdistance.h)
+#define XLAT_vnuma_topology_info_HNDL_vcpu_to_vnode_h(_d_, _s_)		\
+            guest_from_compat_handle((_d_)->vcpu_to_vnode.h, (_s_)->vcpu_to_vnode.h)
+#define XLAT_vnuma_topology_info_HNDL_vmemrange_h(_d_, _s_)		\
+            guest_from_compat_handle((_d_)->vmemrange.h, (_s_)->vmemrange.h)
+
+            XLAT_vnuma_topology_info(nat.vnuma, &cmp.vnuma);
+
+#undef XLAT_vnuma_topology_info_HNDL_vdistance_h
+#undef XLAT_vnuma_topology_info_HNDL_vcpu_to_vnode_h
+#undef XLAT_vnuma_topology_info_HNDL_vmemrange_h
+            break;
+        }
+
         default:
             return compat_arch_memory_op(cmd, compat);
         }
 
         rc = do_memory_op(cmd, nat.hnd);
+        if ( rc == -ENOBUFS && op == XENMEM_get_vnumainfo )
+        {
+            cmp.vnuma.nr_vnodes = nat.vnuma->nr_vnodes;
+            cmp.vnuma.nr_vcpus = nat.vnuma->nr_vcpus;
+            cmp.vnuma.nr_vmemranges = nat.vnuma->nr_vmemranges;
+            if ( __copy_to_guest(compat, &cmp.vnuma, 1) )
+                rc = -EFAULT;
+        }
         if ( rc < 0 )
             break;
 
@@ -398,6 +436,14 @@ int compat_memory_op(unsigned int cmd, XEN_GUEST_HANDLE_PARAM(void) compat)
         case XENMEM_remove_from_physmap:
             break;
 
+        case XENMEM_get_vnumainfo:
+            cmp.vnuma.nr_vnodes = nat.vnuma->nr_vnodes;
+            cmp.vnuma.nr_vcpus = nat.vnuma->nr_vcpus;
+            cmp.vnuma.nr_vmemranges = nat.vnuma->nr_vmemranges;
+            if ( __copy_to_guest(compat, &cmp.vnuma, 1) )
+                rc = -EFAULT;
+            break;
+
         default:
             domain_crash(current->domain);
             split = 0;
diff --git a/xen/include/xlat.lst b/xen/include/xlat.lst
index 41b3e35..9c9fd9a 100644
--- a/xen/include/xlat.lst
+++ b/xen/include/xlat.lst
@@ -64,6 +64,8 @@
 ?	mem_access_op		memory.h
 !	pod_target			memory.h
 !	remove_from_physmap		memory.h
+?	vmemrange			memory.h
+!	vnuma_topology_info		memory.h
 ?	physdev_eoi			physdev.h
 ?	physdev_get_free_pirq		physdev.h
 ?	physdev_irq			physdev.h
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 11/19] tools/hvmloader: link errno.h from xen internal
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (9 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 10/19] xen: handle XENMEM_get_vnumainfo in compat_memory_op Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 16:32   ` Jan Beulich
  2015-01-13 12:11 ` [PATCH v3 12/19] hvmloader: retrieve vNUMA information from hypervisor Wei Liu
                   ` (7 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel; +Cc: Tiejun Chen, dario.faggioli, jbeulich

From: Tiejun Chen <tiejun.chen@intel.com>

We need to act on some specific hypercall error numbers, so
require the hypervisor view on the errno.h value rather than
just the build environment's number. So here link this headfile
from xen.

Acked-by: Jan Beulich <jbeulich@suse.com>
Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
---
 .gitignore                        |    1 +
 tools/firmware/hvmloader/Makefile |    7 ++++++-
 2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/.gitignore b/.gitignore
index 8c8c06f..30b90f9 100644
--- a/.gitignore
+++ b/.gitignore
@@ -127,6 +127,7 @@ tools/firmware/hvmloader/acpi/ssdt_*.h
 tools/firmware/hvmloader/hvmloader
 tools/firmware/hvmloader/roms.h
 tools/firmware/hvmloader/roms.inc
+tools/firmware/hvmloader/errno.h
 tools/firmware/rombios/BIOS-bochs-[^/]*
 tools/firmware/rombios/_rombios[^/]*_.c
 tools/firmware/rombios/rombios[^/]*.s
diff --git a/tools/firmware/hvmloader/Makefile b/tools/firmware/hvmloader/Makefile
index 46a79c5..ef2337b 100644
--- a/tools/firmware/hvmloader/Makefile
+++ b/tools/firmware/hvmloader/Makefile
@@ -87,6 +87,11 @@ endif
 all: subdirs-all
 	$(MAKE) hvmloader
 
+subdirs-all: errno.h
+
+errno.h:
+	ln -sf $(XEN_ROOT)/xen/include/xen/errno.h .
+
 ovmf.o rombios.o seabios.o hvmloader.o: roms.inc
 smbios.o: CFLAGS += -D__SMBIOS_DATE__="\"$(shell date +%m/%d/%Y)\""
 
@@ -136,7 +141,7 @@ endif
 
 .PHONY: clean
 clean: subdirs-clean
-	rm -f roms.inc roms.inc.new acpi.h
+	rm -f roms.inc roms.inc.new acpi.h errno.h
 	rm -f hvmloader hvmloader.tmp *.o $(DEPS)
 
 -include $(DEPS)
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 12/19] hvmloader: retrieve vNUMA information from hypervisor
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (10 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 11/19] tools/hvmloader: link errno.h from xen internal Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 16:50   ` Jan Beulich
  2015-01-13 12:11 ` [PATCH v3 13/19] hvmloader: construct SRAT Wei Liu
                   ` (6 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel; +Cc: dario.faggioli, Wei Liu, Jan Beulich

Hvmloader issues XENMEM_get_vnumainfo hypercall and stores the
information retrieved in scratch space for later use.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
---
Changes in v3:
1. Move init_vnuma_info before ACPI stuff.
2. Fix errno.h inclusion.
3. Remove upper limits and use loop.
---
 tools/firmware/hvmloader/Makefile    |    2 +-
 tools/firmware/hvmloader/hvmloader.c |    3 +
 tools/firmware/hvmloader/vnuma.c     |  102 ++++++++++++++++++++++++++++++++++
 tools/firmware/hvmloader/vnuma.h     |   52 +++++++++++++++++
 4 files changed, 158 insertions(+), 1 deletion(-)
 create mode 100644 tools/firmware/hvmloader/vnuma.c
 create mode 100644 tools/firmware/hvmloader/vnuma.h

diff --git a/tools/firmware/hvmloader/Makefile b/tools/firmware/hvmloader/Makefile
index ef2337b..8bf119d 100644
--- a/tools/firmware/hvmloader/Makefile
+++ b/tools/firmware/hvmloader/Makefile
@@ -29,7 +29,7 @@ LOADADDR = 0x100000
 CFLAGS += $(CFLAGS_xeninclude)
 
 OBJS  = hvmloader.o mp_tables.o util.o smbios.o 
-OBJS += smp.o cacheattr.o xenbus.o
+OBJS += smp.o cacheattr.o xenbus.o vnuma.o
 OBJS += e820.o pci.o pir.o ctype.o
 OBJS += hvm_param.o
 ifeq ($(debug),y)
diff --git a/tools/firmware/hvmloader/hvmloader.c b/tools/firmware/hvmloader/hvmloader.c
index 7b0da38..25b7f08 100644
--- a/tools/firmware/hvmloader/hvmloader.c
+++ b/tools/firmware/hvmloader/hvmloader.c
@@ -26,6 +26,7 @@
 #include "pci_regs.h"
 #include "apic_regs.h"
 #include "acpi/acpi2_0.h"
+#include "vnuma.h"
 #include <xen/version.h>
 #include <xen/hvm/params.h>
 
@@ -310,6 +311,8 @@ int main(void)
 
     if ( acpi_enabled )
     {
+        init_vnuma_info();
+
         if ( bios->acpi_build_tables )
         {
             printf("Loading ACPI ...\n");
diff --git a/tools/firmware/hvmloader/vnuma.c b/tools/firmware/hvmloader/vnuma.c
new file mode 100644
index 0000000..1f3d4a5
--- /dev/null
+++ b/tools/firmware/hvmloader/vnuma.c
@@ -0,0 +1,102 @@
+/*
+ * vnuma.c: obtain vNUMA information from hypervisor
+ *
+ * Copyright (c) 2014 Wei Liu, Citrix Systems (R&D) Ltd.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ * 1. Redistributions of source code must retain the above copyright
+ *    notice, this list of conditions and the following disclaimer.
+ * 2. Redistributions in binary form must reproduce the above copyright
+ *    notice, this list of conditions and the following disclaimer in the
+ *    documentation and/or other materials provided with the distribution.
+ *
+ * THIS SOFTWARE IS PROVIDED BY AUTHOR AND CONTRIBUTORS ``AS IS'' AND
+ * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+ * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
+ * ARE DISCLAIMED.  IN NO EVENT SHALL AUTHOR OR CONTRIBUTORS BE LIABLE
+ * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
+ * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
+ * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
+ * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
+ * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
+ * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
+ * SUCH DAMAGE.
+ */
+
+#include "util.h"
+#include "hypercall.h"
+#include "vnuma.h"
+#include "errno.h"
+
+unsigned int nr_vnodes, nr_vmemranges;
+unsigned int *vcpu_to_vnode, *vdistance;
+xen_vmemrange_t *vmemrange;
+
+void init_vnuma_info(void)
+{
+    int rc, retry = 0;
+    struct xen_vnuma_topology_info vnuma_topo;
+
+    vcpu_to_vnode = scratch_alloc(sizeof(uint32_t) * hvm_info->nr_vcpus, 0);
+
+    rc = -EAGAIN;
+    while ( rc == -EAGAIN && retry < 10 )
+    {
+        vnuma_topo.domid = DOMID_SELF;
+        vnuma_topo.pad = 0;
+        vnuma_topo.nr_vcpus = 0;
+        vnuma_topo.nr_vnodes = 0;
+        vnuma_topo.nr_vmemranges = 0;
+
+        set_xen_guest_handle(vnuma_topo.vdistance.h, NULL);
+        set_xen_guest_handle(vnuma_topo.vcpu_to_vnode.h, NULL);
+        set_xen_guest_handle(vnuma_topo.vmemrange.h, NULL);
+
+        rc = hypercall_memory_op(XENMEM_get_vnumainfo, &vnuma_topo);
+
+        if ( rc == -EOPNOTSUPP )
+            return;
+
+        if ( rc != -ENOBUFS )
+            break;
+
+        ASSERT(vnuma_topo.nr_vcpus == hvm_info->nr_vcpus);
+
+        vdistance = scratch_alloc(sizeof(uint32_t) * vnuma_topo.nr_vnodes *
+                                  vnuma_topo.nr_vnodes, 0);
+        vmemrange = scratch_alloc(sizeof(xen_vmemrange_t) *
+                                  vnuma_topo.nr_vmemranges, 0);
+
+        set_xen_guest_handle(vnuma_topo.vdistance.h, vdistance);
+        set_xen_guest_handle(vnuma_topo.vcpu_to_vnode.h, vcpu_to_vnode);
+        set_xen_guest_handle(vnuma_topo.vmemrange.h, vmemrange);
+
+        rc = hypercall_memory_op(XENMEM_get_vnumainfo, &vnuma_topo);
+
+        retry++;
+    }
+
+    if ( rc < 0 )
+    {
+        printf("Failed to retrieve vNUMA information, rc = %d\n", rc);
+        return;
+    }
+
+    nr_vnodes = vnuma_topo.nr_vnodes;
+    nr_vmemranges = vnuma_topo.nr_vmemranges;
+
+    return;
+}
+
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
diff --git a/tools/firmware/hvmloader/vnuma.h b/tools/firmware/hvmloader/vnuma.h
new file mode 100644
index 0000000..63b648a
--- /dev/null
+++ b/tools/firmware/hvmloader/vnuma.h
@@ -0,0 +1,52 @@
+/******************************************************************************
+ * vnuma.h
+ *
+ * Copyright (c) 2014, Wei Liu
+ *
+ * This program is free software; you can redistribute it and/or
+ * modify it under the terms of the GNU General Public License version 2
+ * as published by the Free Software Foundation; or, when distributed
+ * separately from the Linux kernel or incorporated into other
+ * software packages, subject to the following license:
+ *
+ * Permission is hereby granted, free of charge, to any person obtaining a copy
+ * of this source file (the "Software"), to deal in the Software without
+ * restriction, including without limitation the rights to use, copy, modify,
+ * merge, publish, distribute, sublicense, and/or sell copies of the Software,
+ * and to permit persons to whom the Software is furnished to do so, subject to
+ * the following conditions:
+ *
+ * The above copyright notice and this permission notice shall be included in
+ * all copies or substantial portions of the Software.
+ *
+ * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+ * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+ * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+ * AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+ * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING
+ * FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS
+ * IN THE SOFTWARE.
+ */
+
+#ifndef __HVMLOADER_VNUMA_H__
+#define __HVMLOADER_VNUMA_H__
+
+#include <xen/memory.h>
+
+extern unsigned int nr_vnodes, nr_vmemranges;
+extern unsigned int *vcpu_to_vnode, *vdistance;
+extern xen_vmemrange_t *vmemrange;
+
+void init_vnuma_info(void);
+
+#endif /* __HVMLOADER_VNUMA_H__ */
+
+/*
+ * Local variables:
+ * mode: C
+ * c-file-style: "BSD"
+ * c-basic-offset: 4
+ * tab-width: 4
+ * indent-tabs-mode: nil
+ * End:
+ */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 13/19] hvmloader: construct SRAT
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (11 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 12/19] hvmloader: retrieve vNUMA information from hypervisor Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 16:52   ` Jan Beulich
  2015-01-13 12:11 ` [PATCH v3 14/19] hvmloader: construct SLIT Wei Liu
                   ` (5 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel; +Cc: dario.faggioli, Wei Liu, Jan Beulich

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
---
Changes in v3:
1. Remove redundant variable.
2. Coding style fix.
3. Add assertion.
Changes in v2:
1. Remove explicit zero initializers.
2. Adapt to new vNUMA retrieval routine.
3. Move SRAT very late in secondary table build.
---
 tools/firmware/hvmloader/acpi/acpi2_0.h |   53 +++++++++++++++++++++++
 tools/firmware/hvmloader/acpi/build.c   |   72 +++++++++++++++++++++++++++++++
 2 files changed, 125 insertions(+)

diff --git a/tools/firmware/hvmloader/acpi/acpi2_0.h b/tools/firmware/hvmloader/acpi/acpi2_0.h
index 7b22d80..6169213 100644
--- a/tools/firmware/hvmloader/acpi/acpi2_0.h
+++ b/tools/firmware/hvmloader/acpi/acpi2_0.h
@@ -364,6 +364,57 @@ struct acpi_20_madt_intsrcovr {
 };
 
 /*
+ * System Resource Affinity Table header definition (SRAT)
+ */
+struct acpi_20_srat {
+    struct acpi_header header;
+    uint32_t table_revision;
+    uint32_t reserved2[2];
+};
+
+#define ACPI_SRAT_TABLE_REVISION 1
+
+/*
+ * System Resource Affinity Table structure types.
+ */
+#define ACPI_PROCESSOR_AFFINITY 0x0
+#define ACPI_MEMORY_AFFINITY    0x1
+struct acpi_20_srat_processor {
+    uint8_t type;
+    uint8_t length;
+    uint8_t domain;
+    uint8_t apic_id;
+    uint32_t flags;
+    uint8_t sapic_id;
+    uint8_t domain_hi[3];
+    uint32_t reserved;
+};
+
+/*
+ * Local APIC Affinity Flags.  All other bits are reserved and must be 0.
+ */
+#define ACPI_LOCAL_APIC_AFFIN_ENABLED (1 << 0)
+
+struct acpi_20_srat_memory {
+    uint8_t type;
+    uint8_t length;
+    uint32_t domain;
+    uint16_t reserved;
+    uint64_t base_address;
+    uint64_t mem_length;
+    uint32_t reserved2;
+    uint32_t flags;
+    uint64_t reserved3;
+};
+
+/*
+ * Memory Affinity Flags.  All other bits are reserved and must be 0.
+ */
+#define ACPI_MEM_AFFIN_ENABLED (1 << 0)
+#define ACPI_MEM_AFFIN_HOTPLUGGABLE (1 << 1)
+#define ACPI_MEM_AFFIN_NONVOLATILE (1 << 2)
+
+/*
  * Table Signatures.
  */
 #define ACPI_2_0_RSDP_SIGNATURE ASCII64('R','S','D',' ','P','T','R',' ')
@@ -375,6 +426,7 @@ struct acpi_20_madt_intsrcovr {
 #define ACPI_2_0_TCPA_SIGNATURE ASCII32('T','C','P','A')
 #define ACPI_2_0_HPET_SIGNATURE ASCII32('H','P','E','T')
 #define ACPI_2_0_WAET_SIGNATURE ASCII32('W','A','E','T')
+#define ACPI_2_0_SRAT_SIGNATURE ASCII32('S','R','A','T')
 
 /*
  * Table revision numbers.
@@ -388,6 +440,7 @@ struct acpi_20_madt_intsrcovr {
 #define ACPI_2_0_HPET_REVISION 0x01
 #define ACPI_2_0_WAET_REVISION 0x01
 #define ACPI_1_0_FADT_REVISION 0x01
+#define ACPI_2_0_SRAT_REVISION 0x01
 
 #pragma pack ()
 
diff --git a/tools/firmware/hvmloader/acpi/build.c b/tools/firmware/hvmloader/acpi/build.c
index 1431296..3e96c23 100644
--- a/tools/firmware/hvmloader/acpi/build.c
+++ b/tools/firmware/hvmloader/acpi/build.c
@@ -23,6 +23,7 @@
 #include "ssdt_pm.h"
 #include "../config.h"
 #include "../util.h"
+#include "../vnuma.h"
 #include <xen/hvm/hvm_xs_strings.h>
 #include <xen/hvm/params.h>
 
@@ -203,6 +204,66 @@ static struct acpi_20_waet *construct_waet(void)
     return waet;
 }
 
+static struct acpi_20_srat *construct_srat(void)
+{
+    struct acpi_20_srat *srat;
+    struct acpi_20_srat_processor *processor;
+    struct acpi_20_srat_memory *memory;
+    unsigned int size;
+    void *p;
+    int i;
+
+    size = sizeof(*srat) + sizeof(*processor) * hvm_info->nr_vcpus +
+        sizeof(*memory) * nr_vmemranges;
+
+    p = mem_alloc(size, 16);
+    if ( !p )
+        return NULL;
+
+    srat = p;
+    memset(srat, 0, sizeof(*srat));
+    srat->header.signature    = ACPI_2_0_SRAT_SIGNATURE;
+    srat->header.revision     = ACPI_2_0_SRAT_REVISION;
+    fixed_strcpy(srat->header.oem_id, ACPI_OEM_ID);
+    fixed_strcpy(srat->header.oem_table_id, ACPI_OEM_TABLE_ID);
+    srat->header.oem_revision = ACPI_OEM_REVISION;
+    srat->header.creator_id   = ACPI_CREATOR_ID;
+    srat->header.creator_revision = ACPI_CREATOR_REVISION;
+    srat->table_revision      = ACPI_SRAT_TABLE_REVISION;
+
+    processor = (struct acpi_20_srat_processor *)(srat + 1);
+    for ( i = 0; i < hvm_info->nr_vcpus; i++ )
+    {
+        memset(processor, 0, sizeof(*processor));
+        processor->type     = ACPI_PROCESSOR_AFFINITY;
+        processor->length   = sizeof(*processor);
+        processor->domain   = vcpu_to_vnode[i];
+        processor->apic_id  = LAPIC_ID(i);
+        processor->flags    = ACPI_LOCAL_APIC_AFFIN_ENABLED;
+        processor++;
+    }
+
+    memory = (struct acpi_20_srat_memory *)processor;
+    for ( i = 0; i < nr_vmemranges; i++ )
+    {
+        memset(memory, 0, sizeof(*memory));
+        memory->type          = ACPI_MEMORY_AFFINITY;
+        memory->length        = sizeof(*memory);
+        memory->domain        = vmemrange[i].nid;
+        memory->flags         = ACPI_MEM_AFFIN_ENABLED;
+        memory->base_address  = vmemrange[i].start;
+        memory->mem_length    = vmemrange[i].end - vmemrange[i].start;
+        memory++;
+    }
+
+    ASSERT(((unsigned long)memory) - ((unsigned long)p) == size);
+
+    srat->header.length = size;
+    set_checksum(srat, offsetof(struct acpi_header, checksum), size);
+
+    return srat;
+}
+
 static int construct_passthrough_tables(unsigned long *table_ptrs,
                                         int nr_tables)
 {
@@ -257,6 +318,7 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
     struct acpi_20_hpet *hpet;
     struct acpi_20_waet *waet;
     struct acpi_20_tcpa *tcpa;
+    struct acpi_20_srat *srat;
     unsigned char *ssdt;
     static const uint16_t tis_signature[] = {0x0001, 0x0001, 0x0001};
     uint16_t *tis_hdr;
@@ -346,6 +408,16 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
         }
     }
 
+    /* SRAT */
+    if ( nr_vnodes > 0 )
+    {
+        srat = construct_srat();
+        if ( srat )
+            table_ptrs[nr_tables++] = (unsigned long)srat;
+        else
+            printf("Failed to build SRAT, skipping...\n");
+    }
+
     /* Load any additional tables passed through. */
     nr_tables += construct_passthrough_tables(table_ptrs, nr_tables);
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 14/19] hvmloader: construct SLIT
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (12 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 13/19] hvmloader: construct SRAT Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 16:53   ` Jan Beulich
  2015-01-13 12:11 ` [PATCH v3 15/19] libxc: allocate memory with vNUMA information for HVM guest Wei Liu
                   ` (4 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel; +Cc: dario.faggioli, Wei Liu, Jan Beulich

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Jan Beulich <JBeulich@suse.com>
---
1. Coding style fix.
2. Fix an error code.
3. Use unsigned int for loop variable.
Changes in v2:
1. Adapt to new vNUMA retrieval routine.
2. Move SLIT very late in secondary table build.
---
 tools/firmware/hvmloader/acpi/acpi2_0.h |    8 +++++++
 tools/firmware/hvmloader/acpi/build.c   |   40 ++++++++++++++++++++++++++++++-
 2 files changed, 47 insertions(+), 1 deletion(-)

diff --git a/tools/firmware/hvmloader/acpi/acpi2_0.h b/tools/firmware/hvmloader/acpi/acpi2_0.h
index 6169213..d698095 100644
--- a/tools/firmware/hvmloader/acpi/acpi2_0.h
+++ b/tools/firmware/hvmloader/acpi/acpi2_0.h
@@ -414,6 +414,12 @@ struct acpi_20_srat_memory {
 #define ACPI_MEM_AFFIN_HOTPLUGGABLE (1 << 1)
 #define ACPI_MEM_AFFIN_NONVOLATILE (1 << 2)
 
+struct acpi_20_slit {
+    struct acpi_header header;
+    uint64_t localities;
+    uint8_t entry[0];
+};
+
 /*
  * Table Signatures.
  */
@@ -427,6 +433,7 @@ struct acpi_20_srat_memory {
 #define ACPI_2_0_HPET_SIGNATURE ASCII32('H','P','E','T')
 #define ACPI_2_0_WAET_SIGNATURE ASCII32('W','A','E','T')
 #define ACPI_2_0_SRAT_SIGNATURE ASCII32('S','R','A','T')
+#define ACPI_2_0_SLIT_SIGNATURE ASCII32('S','L','I','T')
 
 /*
  * Table revision numbers.
@@ -441,6 +448,7 @@ struct acpi_20_srat_memory {
 #define ACPI_2_0_WAET_REVISION 0x01
 #define ACPI_1_0_FADT_REVISION 0x01
 #define ACPI_2_0_SRAT_REVISION 0x01
+#define ACPI_2_0_SLIT_REVISION 0x01
 
 #pragma pack ()
 
diff --git a/tools/firmware/hvmloader/acpi/build.c b/tools/firmware/hvmloader/acpi/build.c
index 3e96c23..7dac6a8 100644
--- a/tools/firmware/hvmloader/acpi/build.c
+++ b/tools/firmware/hvmloader/acpi/build.c
@@ -264,6 +264,38 @@ static struct acpi_20_srat *construct_srat(void)
     return srat;
 }
 
+static struct acpi_20_slit *construct_slit(void)
+{
+    struct acpi_20_slit *slit;
+    unsigned int i, num, size;
+
+    num = nr_vnodes * nr_vnodes;
+    size = sizeof(*slit) + num * sizeof(uint8_t);
+
+    slit = mem_alloc(size, 16);
+    if ( !slit )
+        return NULL;
+
+    memset(slit, 0, size);
+    slit->header.signature    = ACPI_2_0_SLIT_SIGNATURE;
+    slit->header.revision     = ACPI_2_0_SLIT_REVISION;
+    fixed_strcpy(slit->header.oem_id, ACPI_OEM_ID);
+    fixed_strcpy(slit->header.oem_table_id, ACPI_OEM_TABLE_ID);
+    slit->header.oem_revision = ACPI_OEM_REVISION;
+    slit->header.creator_id   = ACPI_CREATOR_ID;
+    slit->header.creator_revision = ACPI_CREATOR_REVISION;
+
+    for ( i = 0; i < num; i++ )
+        slit->entry[i] = vdistance[i];
+
+    slit->localities = nr_vnodes;
+
+    slit->header.length = size;
+    set_checksum(slit, offsetof(struct acpi_header, checksum), size);
+
+    return slit;
+}
+
 static int construct_passthrough_tables(unsigned long *table_ptrs,
                                         int nr_tables)
 {
@@ -319,6 +351,7 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
     struct acpi_20_waet *waet;
     struct acpi_20_tcpa *tcpa;
     struct acpi_20_srat *srat;
+    struct acpi_20_slit *slit;
     unsigned char *ssdt;
     static const uint16_t tis_signature[] = {0x0001, 0x0001, 0x0001};
     uint16_t *tis_hdr;
@@ -408,7 +441,7 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
         }
     }
 
-    /* SRAT */
+    /* SRAT and SLIT */
     if ( nr_vnodes > 0 )
     {
         srat = construct_srat();
@@ -416,6 +449,11 @@ static int construct_secondary_tables(unsigned long *table_ptrs,
             table_ptrs[nr_tables++] = (unsigned long)srat;
         else
             printf("Failed to build SRAT, skipping...\n");
+        slit = construct_slit();
+        if ( slit )
+            table_ptrs[nr_tables++] = (unsigned long)slit;
+        else
+            printf("Failed to build SLIT, skipping...\n");
     }
 
     /* Load any additional tables passed through. */
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 15/19] libxc: allocate memory with vNUMA information for HVM guest
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (13 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 14/19] hvmloader: construct SLIT Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 18:15   ` Konrad Rzeszutek Wilk
  2015-01-13 12:11 ` [PATCH v3 16/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
                   ` (3 subsequent siblings)
  18 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, jbeulich,
	Elena Ufimtseva

The algorithm is more or less the same as the one used for PV guest.
Libxc gets hold of the mapping of vnode to pnode and size of each vnode
then allocate memory accordingly.

And then the function returns low memory end, high memory end and mmio
start to caller. Libxl needs those values to construct vmemranges for
that guest.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v3:
1. Rewrite commit log.
2. Add a few code comments.
---
 tools/libxc/include/xenguest.h |    7 ++
 tools/libxc/xc_hvm_build_x86.c |  224 ++++++++++++++++++++++++++--------------
 2 files changed, 151 insertions(+), 80 deletions(-)

diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
index 40bbac8..d1cbb4e 100644
--- a/tools/libxc/include/xenguest.h
+++ b/tools/libxc/include/xenguest.h
@@ -230,6 +230,13 @@ struct xc_hvm_build_args {
     struct xc_hvm_firmware_module smbios_module;
     /* Whether to use claim hypercall (1 - enable, 0 - disable). */
     int claim_enabled;
+    unsigned int nr_vnodes;	  /* Number of vnodes */
+    unsigned int *vnode_to_pnode; /* Vnode to pnode mapping */
+    uint64_t *vnode_size;	  /* Size of vnodes */
+    /* Out parameters  */
+    uint64_t lowmem_end;
+    uint64_t highmem_end;
+    uint64_t mmio_start;
 };
 
 /**
diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
index c81a25b..54d3dc8 100644
--- a/tools/libxc/xc_hvm_build_x86.c
+++ b/tools/libxc/xc_hvm_build_x86.c
@@ -89,7 +89,8 @@ static int modules_init(struct xc_hvm_build_args *args,
 }
 
 static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
-                           uint64_t mmio_start, uint64_t mmio_size)
+                           uint64_t mmio_start, uint64_t mmio_size,
+                           struct xc_hvm_build_args *args)
 {
     struct hvm_info_table *hvm_info = (struct hvm_info_table *)
         (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET);
@@ -119,6 +120,10 @@ static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
     hvm_info->high_mem_pgend = highmem_end >> PAGE_SHIFT;
     hvm_info->reserved_mem_pgstart = ioreq_server_pfn(0);
 
+    args->lowmem_end = lowmem_end;
+    args->highmem_end = highmem_end;
+    args->mmio_start = mmio_start;
+
     /* Finish with the checksum. */
     for ( i = 0, sum = 0; i < hvm_info->length; i++ )
         sum += ((uint8_t *)hvm_info)[i];
@@ -244,7 +249,7 @@ static int setup_guest(xc_interface *xch,
                        char *image, unsigned long image_size)
 {
     xen_pfn_t *page_array = NULL;
-    unsigned long i, nr_pages = args->mem_size >> PAGE_SHIFT;
+    unsigned long i, j, nr_pages = args->mem_size >> PAGE_SHIFT;
     unsigned long target_pages = args->mem_target >> PAGE_SHIFT;
     uint64_t mmio_start = (1ull << 32) - args->mmio_size;
     uint64_t mmio_size = args->mmio_size;
@@ -258,13 +263,13 @@ static int setup_guest(xc_interface *xch,
     xen_capabilities_info_t caps;
     unsigned long stat_normal_pages = 0, stat_2mb_pages = 0, 
         stat_1gb_pages = 0;
-    int pod_mode = 0;
+    unsigned int memflags = 0;
     int claim_enabled = args->claim_enabled;
     xen_pfn_t special_array[NR_SPECIAL_PAGES];
     xen_pfn_t ioreq_server_array[NR_IOREQ_SERVER_PAGES];
-
-    if ( nr_pages > target_pages )
-        pod_mode = XENMEMF_populate_on_demand;
+    uint64_t dummy_vnode_size;
+    unsigned int dummy_vnode_to_pnode;
+    uint64_t total;
 
     memset(&elf, 0, sizeof(elf));
     if ( elf_init(&elf, image, image_size) != 0 )
@@ -276,6 +281,37 @@ static int setup_guest(xc_interface *xch,
     v_start = 0;
     v_end = args->mem_size;
 
+    if ( nr_pages > target_pages )
+        memflags |= XENMEMF_populate_on_demand;
+
+    if ( args->nr_vnodes == 0 )
+    {
+        /* Build dummy vnode information */
+        args->nr_vnodes = 1;
+        dummy_vnode_to_pnode = XC_VNUMA_NO_NODE;
+        dummy_vnode_size = args->mem_size >> 20;
+        args->vnode_size = &dummy_vnode_size;
+        args->vnode_to_pnode = &dummy_vnode_to_pnode;
+    }
+    else
+    {
+        if ( nr_pages > target_pages )
+        {
+            PERROR("Cannot enable vNUMA and PoD at the same time");
+            goto error_out;
+        }
+    }
+
+    total = 0;
+    for ( i = 0; i < args->nr_vnodes; i++ )
+        total += (args->vnode_size[i] << 20);
+    if ( total != args->mem_size )
+    {
+        PERROR("Memory size requested by vNUMA (0x%"PRIx64") mismatches memory size configured for domain (0x%"PRIx64")",
+               total, args->mem_size);
+        goto error_out;
+    }
+
     if ( xc_version(xch, XENVER_capabilities, &caps) != 0 )
     {
         PERROR("Could not get Xen capabilities");
@@ -320,7 +356,7 @@ static int setup_guest(xc_interface *xch,
         }
     }
 
-    if ( pod_mode )
+    if ( memflags & XENMEMF_populate_on_demand )
     {
         /*
          * Subtract VGA_HOLE_SIZE from target_pages for the VGA
@@ -349,103 +385,128 @@ static int setup_guest(xc_interface *xch,
      * ensure that we can be preempted and hence dom0 remains responsive.
      */
     rc = xc_domain_populate_physmap_exact(
-        xch, dom, 0xa0, 0, pod_mode, &page_array[0x00]);
+        xch, dom, 0xa0, 0, memflags, &page_array[0x00]);
     cur_pages = 0xc0;
     stat_normal_pages = 0xc0;
 
-    while ( (rc == 0) && (nr_pages > cur_pages) )
+    for ( i = 0; i < args->nr_vnodes; i++ )
     {
-        /* Clip count to maximum 1GB extent. */
-        unsigned long count = nr_pages - cur_pages;
-        unsigned long max_pages = SUPERPAGE_1GB_NR_PFNS;
-
-        if ( count > max_pages )
-            count = max_pages;
-
-        cur_pfn = page_array[cur_pages];
+        unsigned int new_memflags = memflags;
+        uint64_t pages, finished;
 
-        /* Take care the corner cases of super page tails */
-        if ( ((cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
-             (count > (-cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1))) )
-            count = -cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1);
-        else if ( ((count & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
-                  (count > SUPERPAGE_1GB_NR_PFNS) )
-            count &= ~(SUPERPAGE_1GB_NR_PFNS - 1);
-
-        /* Attemp to allocate 1GB super page. Because in each pass we only
-         * allocate at most 1GB, we don't have to clip super page boundaries.
-         */
-        if ( ((count | cur_pfn) & (SUPERPAGE_1GB_NR_PFNS - 1)) == 0 &&
-             /* Check if there exists MMIO hole in the 1GB memory range */
-             !check_mmio_hole(cur_pfn << PAGE_SHIFT,
-                              SUPERPAGE_1GB_NR_PFNS << PAGE_SHIFT,
-                              mmio_start, mmio_size) )
+        if ( args->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
         {
-            long done;
-            unsigned long nr_extents = count >> SUPERPAGE_1GB_SHIFT;
-            xen_pfn_t sp_extents[nr_extents];
-
-            for ( i = 0; i < nr_extents; i++ )
-                sp_extents[i] = page_array[cur_pages+(i<<SUPERPAGE_1GB_SHIFT)];
-
-            done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_1GB_SHIFT,
-                                              pod_mode, sp_extents);
-
-            if ( done > 0 )
-            {
-                stat_1gb_pages += done;
-                done <<= SUPERPAGE_1GB_SHIFT;
-                cur_pages += done;
-                count -= done;
-            }
+            new_memflags |= XENMEMF_exact_node(args->vnode_to_pnode[i]);
+            new_memflags |= XENMEMF_exact_node_request;
         }
 
-        if ( count != 0 )
+        pages = (args->vnode_size[i] << 20) >> PAGE_SHIFT;
+        /* Consider vga hole belongs to node 0 */
+        if ( i == 0 )
+            finished = 0xc0;
+        else
+            finished = 0;
+
+        while ( (rc == 0) && (pages > finished) )
         {
-            /* Clip count to maximum 8MB extent. */
-            max_pages = SUPERPAGE_2MB_NR_PFNS * 4;
+            /* Clip count to maximum 1GB extent. */
+            unsigned long count = pages - finished;
+            unsigned long max_pages = SUPERPAGE_1GB_NR_PFNS;
+
             if ( count > max_pages )
                 count = max_pages;
-            
-            /* Clip partial superpage extents to superpage boundaries. */
-            if ( ((cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
-                 (count > (-cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1))) )
-                count = -cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1);
-            else if ( ((count & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
-                      (count > SUPERPAGE_2MB_NR_PFNS) )
-                count &= ~(SUPERPAGE_2MB_NR_PFNS - 1); /* clip non-s.p. tail */
-
-            /* Attempt to allocate superpage extents. */
-            if ( ((count | cur_pfn) & (SUPERPAGE_2MB_NR_PFNS - 1)) == 0 )
+
+            cur_pfn = page_array[cur_pages];
+
+            /* Take care the corner cases of super page tails */
+            if ( ((cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
+                 (count > (-cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1))) )
+                count = -cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1);
+            else if ( ((count & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
+                      (count > SUPERPAGE_1GB_NR_PFNS) )
+                count &= ~(SUPERPAGE_1GB_NR_PFNS - 1);
+
+            /* Attemp to allocate 1GB super page. Because in each pass we only
+             * allocate at most 1GB, we don't have to clip super page boundaries.
+             */
+            if ( ((count | cur_pfn) & (SUPERPAGE_1GB_NR_PFNS - 1)) == 0 &&
+                 /* Check if there exists MMIO hole in the 1GB memory range */
+                 !check_mmio_hole(cur_pfn << PAGE_SHIFT,
+                                  SUPERPAGE_1GB_NR_PFNS << PAGE_SHIFT,
+                                  mmio_start, mmio_size) )
             {
                 long done;
-                unsigned long nr_extents = count >> SUPERPAGE_2MB_SHIFT;
+                unsigned long nr_extents = count >> SUPERPAGE_1GB_SHIFT;
                 xen_pfn_t sp_extents[nr_extents];
 
-                for ( i = 0; i < nr_extents; i++ )
-                    sp_extents[i] = page_array[cur_pages+(i<<SUPERPAGE_2MB_SHIFT)];
+                for ( j = 0; j < nr_extents; j++ )
+                    sp_extents[j] = page_array[cur_pages+(j<<SUPERPAGE_1GB_SHIFT)];
 
-                done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_2MB_SHIFT,
-                                                  pod_mode, sp_extents);
+                done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_1GB_SHIFT,
+                                                  new_memflags, sp_extents);
 
                 if ( done > 0 )
                 {
-                    stat_2mb_pages += done;
-                    done <<= SUPERPAGE_2MB_SHIFT;
+                    stat_1gb_pages += done;
+                    done <<= SUPERPAGE_1GB_SHIFT;
                     cur_pages += done;
+                    finished += done;
                     count -= done;
                 }
             }
-        }
 
-        /* Fall back to 4kB extents. */
-        if ( count != 0 )
-        {
-            rc = xc_domain_populate_physmap_exact(
-                xch, dom, count, 0, pod_mode, &page_array[cur_pages]);
-            cur_pages += count;
-            stat_normal_pages += count;
+            if ( count != 0 )
+            {
+                /* Clip count to maximum 8MB extent. */
+                max_pages = SUPERPAGE_2MB_NR_PFNS * 4;
+                if ( count > max_pages )
+                    count = max_pages;
+
+                /* Clip partial superpage extents to superpage boundaries. */
+                if ( ((cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
+                     (count > (-cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1))) )
+                    count = -cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1);
+                else if ( ((count & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
+                          (count > SUPERPAGE_2MB_NR_PFNS) )
+                    count &= ~(SUPERPAGE_2MB_NR_PFNS - 1); /* clip non-s.p. tail */
+
+                /* Attempt to allocate superpage extents. */
+                if ( ((count | cur_pfn) & (SUPERPAGE_2MB_NR_PFNS - 1)) == 0 )
+                {
+                    long done;
+                    unsigned long nr_extents = count >> SUPERPAGE_2MB_SHIFT;
+                    xen_pfn_t sp_extents[nr_extents];
+
+                    for ( j = 0; j < nr_extents; j++ )
+                        sp_extents[j] = page_array[cur_pages+(j<<SUPERPAGE_2MB_SHIFT)];
+
+                    done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_2MB_SHIFT,
+                                                      new_memflags, sp_extents);
+
+                    if ( done > 0 )
+                    {
+                        stat_2mb_pages += done;
+                        done <<= SUPERPAGE_2MB_SHIFT;
+                        cur_pages += done;
+                        finished += done;
+                        count -= done;
+                    }
+                }
+            }
+
+            /* Fall back to 4kB extents. */
+            if ( count != 0 )
+            {
+                rc = xc_domain_populate_physmap_exact(
+                    xch, dom, count, 0, new_memflags, &page_array[cur_pages]);
+                cur_pages += count;
+                finished += count;
+                stat_normal_pages += count;
+            }
         }
+
+        if ( rc != 0 )
+            break;
     }
 
     if ( rc != 0 )
@@ -469,7 +530,7 @@ static int setup_guest(xc_interface *xch,
               xch, dom, PAGE_SIZE, PROT_READ | PROT_WRITE,
               HVM_INFO_PFN)) == NULL )
         goto error_out;
-    build_hvm_info(hvm_info_page, v_end, mmio_start, mmio_size);
+    build_hvm_info(hvm_info_page, v_end, mmio_start, mmio_size, args);
     munmap(hvm_info_page, PAGE_SIZE);
 
     /* Allocate and clear special pages. */
@@ -608,6 +669,9 @@ int xc_hvm_build(xc_interface *xch, uint32_t domid,
             args.acpi_module.guest_addr_out;
         hvm_args->smbios_module.guest_addr_out = 
             args.smbios_module.guest_addr_out;
+        hvm_args->lowmem_end = args.lowmem_end;
+        hvm_args->highmem_end = args.highmem_end;
+        hvm_args->mmio_start = args.mmio_start;
     }
 
     free(image);
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 16/19] libxl: build, check and pass vNUMA info to Xen for HVM guest
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (14 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 15/19] libxc: allocate memory with vNUMA information for HVM guest Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 12:11 ` [PATCH v3 17/19] libxl: disallow memory relocation when vNUMA is enabled Wei Liu
                   ` (2 subsequent siblings)
  18 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, jbeulich,
	Elena Ufimtseva

Transform user supplied vNUMA configuration into libxl internal
representations then libxc representations. Check validity along the
line.

Libxc has more involvement in building vmemranges in HVM case compared
to PV case. The building of vmemranges is placed after xc_hvm_build
returns, because it relies on memory hole information provided by
xc_hvm_build.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v3:
1. Rewrite commit log.
---
 tools/libxl/libxl_create.c   |    9 +++++++
 tools/libxl/libxl_dom.c      |   28 +++++++++++++++++++++
 tools/libxl/libxl_internal.h |    5 ++++
 tools/libxl/libxl_vnuma.c    |   56 ++++++++++++++++++++++++++++++++++++++++++
 4 files changed, 98 insertions(+)

diff --git a/tools/libxl/libxl_create.c b/tools/libxl/libxl_create.c
index 6f87d1c..9700172 100644
--- a/tools/libxl/libxl_create.c
+++ b/tools/libxl/libxl_create.c
@@ -845,6 +845,15 @@ static void initiate_domain_create(libxl__egc *egc,
         goto error_out;
     }
 
+    /* Disallow PoD and vNUMA to be enabled at the same time because PoD
+     * pool is not vNUMA-aware yet.
+     */
+    if (pod_enabled && d_config->b_info.num_vnuma_nodes) {
+        ret = ERROR_INVAL;
+        LOG(ERROR, "Cannot enable PoD and vNUMA at the same time");
+        goto error_out;
+    }
+
     ret = libxl__domain_create_info_setdefault(gc, &d_config->c_info);
     if (ret) goto error_out;
 
diff --git a/tools/libxl/libxl_dom.c b/tools/libxl/libxl_dom.c
index f06c88b..592904f 100644
--- a/tools/libxl/libxl_dom.c
+++ b/tools/libxl/libxl_dom.c
@@ -887,12 +887,40 @@ int libxl__build_hvm(libxl__gc *gc, uint32_t domid,
         goto out;
     }
 
+    if (info->num_vnuma_nodes != 0) {
+        int i;
+
+        args.nr_vnodes = info->num_vnuma_nodes;
+        args.vnode_to_pnode = libxl__malloc(gc, sizeof(*args.vnode_to_pnode) *
+                                            args.nr_vnodes);
+        args.vnode_size = libxl__malloc(gc, sizeof(*args.vnode_size) *
+                                        args.nr_vnodes);
+        for (i = 0; i < args.nr_vnodes; i++) {
+            args.vnode_to_pnode[i] = info->vnuma_nodes[i].pnode;
+            args.vnode_size[i] = info->vnuma_nodes[i].mem;
+        }
+
+        /* Consider video ram belongs to node 0 */
+        args.vnode_size[0] -= (info->video_memkb >> 10);
+    }
+
     ret = xc_hvm_build(ctx->xch, domid, &args);
     if (ret) {
         LOGEV(ERROR, ret, "hvm building failed");
         goto out;
     }
 
+    if (info->num_vnuma_nodes != 0) {
+        ret = libxl__vnuma_build_vmemrange_hvm(gc, domid, info, state, &args);
+        if (ret) {
+            LOGEV(ERROR, ret, "hvm build vmemranges failed");
+            goto out;
+        }
+        ret = libxl__vnuma_config_check(gc, info, state);
+        if (ret) goto out;
+        ret = set_vnuma_info(gc, domid, info, state);
+        if (ret) goto out;
+    }
     ret = hvm_build_set_params(ctx->xch, domid, info, state->store_port,
                                &state->store_mfn, state->console_port,
                                &state->console_mfn, state->store_domid,
diff --git a/tools/libxl/libxl_internal.h b/tools/libxl/libxl_internal.h
index 73d533a..a5513c9 100644
--- a/tools/libxl/libxl_internal.h
+++ b/tools/libxl/libxl_internal.h
@@ -3403,6 +3403,11 @@ int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
                                     uint32_t domid,
                                     libxl_domain_build_info *b_info,
                                     libxl__domain_build_state *state);
+int libxl__vnuma_build_vmemrange_hvm(libxl__gc *gc,
+                                     uint32_t domid,
+                                     libxl_domain_build_info *b_info,
+                                     libxl__domain_build_state *state,
+                                     struct xc_hvm_build_args *args);
 
 _hidden int libxl__ms_vm_genid_set(libxl__gc *gc, uint32_t domid,
                                    const libxl_ms_vm_genid *id);
diff --git a/tools/libxl/libxl_vnuma.c b/tools/libxl/libxl_vnuma.c
index a72abe2..c6e949f 100644
--- a/tools/libxl/libxl_vnuma.c
+++ b/tools/libxl/libxl_vnuma.c
@@ -162,6 +162,62 @@ int libxl__vnuma_build_vmemrange_pv(libxl__gc *gc,
     return libxl__arch_vnuma_build_vmemrange(gc, domid, b_info, state);
 }
 
+/* Build vmemranges for HVM guest */
+int libxl__vnuma_build_vmemrange_hvm(libxl__gc *gc,
+                                     uint32_t domid,
+                                     libxl_domain_build_info *b_info,
+                                     libxl__domain_build_state *state,
+                                     struct xc_hvm_build_args *args)
+{
+    uint64_t hole_start, hole_end, next;
+    int i, x;
+    xen_vmemrange_t *v;
+
+    /* Derive vmemranges from vnode size and memory hole.
+     *
+     * Guest physical address space layout:
+     * [0, hole_start) [hole_start, hole_end) [hole_end, highmem_end)
+     */
+    hole_start = args->lowmem_end < args->mmio_start ?
+        args->lowmem_end : args->mmio_start;
+    hole_end = (args->mmio_start + args->mmio_size) > (1ULL << 32) ?
+        (args->mmio_start + args->mmio_size) : (1ULL << 32);
+
+    assert(state->vmemranges == NULL);
+
+    next = 0;
+    x = 0;
+    v = NULL;
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
+        uint64_t remaining = (p->mem << 20);
+
+        while (remaining > 0) {
+            uint64_t count = remaining;
+
+            if (next >= hole_start && next < hole_end)
+                next = hole_end;
+            if ((next < hole_start) && (next + remaining >= hole_start))
+                count = hole_start - next;
+
+            v = libxl__realloc(gc, v, sizeof(xen_vmemrange_t) * (x + 1));
+            v[x].start = next;
+            v[x].end = next + count;
+            v[x].flags = 0;
+            v[x].nid = i;
+
+            x++;
+            remaining -= count;
+            next += count;
+        }
+    }
+
+    state->vmemranges = v;
+    state->num_vmemranges = x;
+
+    return 0;
+}
+
 /*
  * Local variables:
  * mode: C
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 17/19] libxl: disallow memory relocation when vNUMA is enabled
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (15 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 16/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 12:11 ` [PATCH v3 18/19] libxlutil: nested list support Wei Liu
  2015-01-13 12:11 ` [PATCH v3 19/19] xl: vNUMA support Wei Liu
  18 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel; +Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, jbeulich

Disallow memory relocation when vNUMA is enabled, because relocated
memory ends up off node. Further more, even if we dynamically expand
node coverage in hvmloader, low memory and high memory may reside
in different physical nodes, blindly relocating low memory to high
memory gives us a sub-optimal configuration.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Konrad Wilk <konrad.wilk@oracle.com>
---
 tools/libxl/libxl_dm.c |    6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/tools/libxl/libxl_dm.c b/tools/libxl/libxl_dm.c
index c2b0487..201c9e5 100644
--- a/tools/libxl/libxl_dm.c
+++ b/tools/libxl/libxl_dm.c
@@ -1356,13 +1356,15 @@ void libxl__spawn_local_dm(libxl__egc *egc, libxl__dm_spawn_state *dmss)
                         libxl__sprintf(gc, "%s/hvmloader/bios", path),
                         "%s", libxl_bios_type_to_string(b_info->u.hvm.bios));
         /* Disable relocating memory to make the MMIO hole larger
-         * unless we're running qemu-traditional */
+         * unless we're running qemu-traditional and vNUMA is not
+         * configured. */
         libxl__xs_write(gc, XBT_NULL,
                         libxl__sprintf(gc,
                                        "%s/hvmloader/allow-memory-relocate",
                                        path),
                         "%d",
-                        b_info->device_model_version==LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL);
+                        b_info->device_model_version==LIBXL_DEVICE_MODEL_VERSION_QEMU_XEN_TRADITIONAL &&
+                        !b_info->num_vnuma_nodes);
         free(path);
     }
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 18/19] libxlutil: nested list support
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (16 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 17/19] libxl: disallow memory relocation when vNUMA is enabled Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  2015-01-13 15:52   ` Ian Jackson
  2015-01-13 12:11 ` [PATCH v3 19/19] xl: vNUMA support Wei Liu
  18 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel; +Cc: Ian Jackson, dario.faggioli, Wei Liu, Ian Campbell, jbeulich

This is done with three major changes:
1. Rework internal representation of setting.
2. Extend grammar of parser.
3. Introduce new APIs.

New APIs introduced:
1. xlu_cfg_value_type
2. xlu_cfg_value_get_string
3. xlu_cfg_value_get_list
4. xlu_cfg_get_listitem2

Previous APIs work as before.

Nested list will be used to represent two dimensional array used to
specify distances among different vNUMA nodes.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
---
 tools/libxl/libxlu_cfg.c      |  180 +++++++++++++++++++++++++----------------
 tools/libxl/libxlu_cfg_i.h    |   15 +++-
 tools/libxl/libxlu_cfg_y.c    |  110 +++++++++++--------------
 tools/libxl/libxlu_cfg_y.h    |    2 +-
 tools/libxl/libxlu_cfg_y.y    |   19 +++--
 tools/libxl/libxlu_internal.h |   24 ++++--
 tools/libxl/libxlutil.h       |   11 ++-
 7 files changed, 208 insertions(+), 153 deletions(-)

diff --git a/tools/libxl/libxlu_cfg.c b/tools/libxl/libxlu_cfg.c
index 22adcb0..db26c34 100644
--- a/tools/libxl/libxlu_cfg.c
+++ b/tools/libxl/libxlu_cfg.c
@@ -132,13 +132,9 @@ int xlu_cfg_readdata(XLU_Config *cfg, const char *data, int length) {
 }
 
 void xlu__cfg_set_free(XLU_ConfigSetting *set) {
-    int i;
-
     if (!set) return;
+    xlu__cfg_value_free(set->value);
     free(set->name);
-    for (i=0; i<set->nvalues; i++)
-        free(set->values[i]);
-    free(set->values);
     free(set);
 }
 
@@ -173,7 +169,7 @@ static int find_atom(const XLU_Config *cfg, const char *n,
     set= find(cfg,n);
     if (!set) return ESRCH;
 
-    if (set->avalues!=1) {
+    if (set->value->type!=XLU_STRING) {
         if (!dont_warn)
             fprintf(cfg->report,
                     "%s:%d: warning: parameter `%s' is"
@@ -185,13 +181,30 @@ static int find_atom(const XLU_Config *cfg, const char *n,
     return 0;
 }
 
+enum XLU_ConfigValueType xlu_cfg_value_type(const XLU_ConfigValue *value)
+{
+    return value->type;
+}
+
+const char *xlu_cfg_value_get_string(const XLU_ConfigValue *value)
+{
+    assert(value->type == XLU_STRING);
+    return value->u.string;
+}
+
+const XLU_ConfigList *xlu_cfg_value_get_list(const XLU_ConfigValue *value)
+{
+    assert(value->type == XLU_LIST);
+    return &value->u.list;
+}
+
 int xlu_cfg_get_string(const XLU_Config *cfg, const char *n,
                        const char **value_r, int dont_warn) {
     XLU_ConfigSetting *set;
     int e;
 
     e= find_atom(cfg,n,&set,dont_warn);  if (e) return e;
-    *value_r= set->values[0];
+    *value_r= set->value->u.string;
     return 0;
 }
 
@@ -202,7 +215,7 @@ int xlu_cfg_replace_string(const XLU_Config *cfg, const char *n,
 
     e= find_atom(cfg,n,&set,dont_warn);  if (e) return e;
     free(*value_r);
-    *value_r= strdup(set->values[0]);
+    *value_r= strdup(set->value->u.string);
     return 0;
 }
 
@@ -214,7 +227,7 @@ int xlu_cfg_get_long(const XLU_Config *cfg, const char *n,
     char *ep;
 
     e= find_atom(cfg,n,&set,dont_warn);  if (e) return e;
-    errno= 0; l= strtol(set->values[0], &ep, 0);
+    errno= 0; l= strtol(set->value->u.string, &ep, 0);
     e= errno;
     if (errno) {
         e= errno;
@@ -226,7 +239,7 @@ int xlu_cfg_get_long(const XLU_Config *cfg, const char *n,
                     cfg->config_source, set->lineno, n, strerror(e));
         return e;
     }
-    if (*ep || ep==set->values[0]) {
+    if (*ep || ep==set->value->u.string) {
         if (!dont_warn)
             fprintf(cfg->report,
                     "%s:%d: warning: parameter `%s' is not a valid number\n",
@@ -253,7 +266,7 @@ int xlu_cfg_get_list(const XLU_Config *cfg, const char *n,
                      XLU_ConfigList **list_r, int *entries_r, int dont_warn) {
     XLU_ConfigSetting *set;
     set= find(cfg,n);  if (!set) return ESRCH;
-    if (set->avalues==1) {
+    if (set->value->type!=XLU_LIST) {
         if (!dont_warn) {
             fprintf(cfg->report,
                     "%s:%d: warning: parameter `%s' is a single value"
@@ -262,8 +275,8 @@ int xlu_cfg_get_list(const XLU_Config *cfg, const char *n,
         }
         return EINVAL;
     }
-    if (list_r) *list_r= set;
-    if (entries_r) *entries_r= set->nvalues;
+    if (list_r) *list_r= &set->value->u.list;
+    if (entries_r) *entries_r= set->value->u.list.nvalues;
     return 0;
 }
 
@@ -290,72 +303,29 @@ int xlu_cfg_get_list_as_string_list(const XLU_Config *cfg, const char *n,
     return 0;
 }
 
-const char *xlu_cfg_get_listitem(const XLU_ConfigList *set, int entry) {
-    if (entry < 0 || entry >= set->nvalues) return 0;
-    return set->values[entry];
+const char *xlu_cfg_get_listitem(const XLU_ConfigList *list, int entry) {
+    if (entry < 0 || entry >= list->nvalues) return 0;
+    if (list->values[entry]->type != XLU_STRING) return 0;
+    return list->values[entry]->u.string;
 }
 
-
-XLU_ConfigSetting *xlu__cfg_set_mk(CfgParseContext *ctx,
-                                   int alloc, char *atom) {
-    XLU_ConfigSetting *set= 0;
-
-    if (ctx->err) goto x;
-    assert(!!alloc == !!atom);
-
-    set= malloc(sizeof(*set));
-    if (!set) goto xe;
-
-    set->name= 0; /* tbd */
-    set->avalues= alloc;
-
-    if (!alloc) {
-        set->nvalues= 0;
-        set->values= 0;
-    } else {
-        set->values= malloc(sizeof(*set->values) * alloc);
-        if (!set->values) goto xe;
-
-        set->nvalues= 1;
-        set->values[0]= atom;
-    }
-    return set;
-
- xe:
-    ctx->err= errno;
- x:
-    free(set);
-    free(atom);
-    return 0;
-}
-
-void xlu__cfg_set_add(CfgParseContext *ctx, XLU_ConfigSetting *set,
-                      char *atom) {
-    if (ctx->err) return;
-
-    assert(atom);
-
-    if (set->nvalues >= set->avalues) {
-        int new_avalues;
-        char **new_values;
-
-        if (set->avalues > INT_MAX / 100) { ctx->err= ERANGE; return; }
-        new_avalues= set->avalues * 4;
-        new_values= realloc(set->values,
-                            sizeof(*new_values) * new_avalues);
-        if (!new_values) { ctx->err= errno; free(atom); return; }
-        set->values= new_values;
-        set->avalues= new_avalues;
-    }
-    set->values[set->nvalues++]= atom;
+const XLU_ConfigValue *xlu_cfg_get_listitem2(const XLU_ConfigList *list,
+                                             int entry)
+{
+    if (entry < 0 || entry >= list->nvalues) return NULL;
+    return list->values[entry];
 }
 
 void xlu__cfg_set_store(CfgParseContext *ctx, char *name,
-                        XLU_ConfigSetting *set, int lineno) {
+                        XLU_ConfigValue *val, int lineno) {
+    XLU_ConfigSetting *set;
     if (ctx->err) return;
 
     assert(name);
+    set = malloc(sizeof(*set));
+    assert(set);
     set->name= name;
+    set->value = val;
     set->lineno= lineno;
     set->next= ctx->cfg->settings;
     ctx->cfg->settings= set;
@@ -473,6 +443,76 @@ void xlu__cfg_yyerror(YYLTYPE *loc, CfgParseContext *ctx, char const *msg) {
     if (!ctx->err) ctx->err= EINVAL;
 }
 
+XLU_ConfigValue *xlu__cfg_make_string(CfgParseContext *ctx,
+                                      char *src)
+{
+    XLU_ConfigValue *ret;
+
+    ret = malloc(sizeof(*ret));
+    assert(ret);
+    ret->type = XLU_STRING;
+    ret->u.string = src;
+
+    return ret;
+}
+
+XLU_ConfigValue *xlu__cfg_make_list(CfgParseContext *ctx,
+                                    XLU_ConfigValue *val)
+{
+    XLU_ConfigValue *ret;
+
+    ret = malloc(sizeof(*ret));
+    assert(ret);
+    ret->type = XLU_LIST;
+    ret->u.list.nvalues = 1;
+    ret->u.list.values = malloc(sizeof(*ret->u.list.values));
+    ret->u.list.values[0] = val;
+
+    return ret;
+}
+
+XLU_ConfigValue *xlu__cfg_list_append(CfgParseContext *ctx,
+                                      XLU_ConfigValue *list,
+                                      XLU_ConfigValue *val)
+{
+    XLU_ConfigValue **new_val;
+
+    assert(list->type == XLU_LIST);
+
+    new_val = realloc(list->u.list.values,
+                      sizeof(*new_val) * (list->u.list.nvalues+1));
+    if (!new_val) {
+        ctx->err = errno;
+        xlu__cfg_value_free(val);
+        return list;
+    }
+
+    list->u.list.values = new_val;
+    list->u.list.values[list->u.list.nvalues] = val;
+    list->u.list.nvalues++;
+
+    return list;
+}
+
+void xlu__cfg_value_free(XLU_ConfigValue *val)
+{
+    int i;
+
+    if (!val) return;
+
+    switch (val->type) {
+    case XLU_STRING:
+        free(val->u.string);
+        break;
+    case XLU_LIST:
+        for (i = 0; i < val->u.list.nvalues; i++) {
+            xlu__cfg_value_free(val->u.list.values[i]);
+        }
+        free(val->u.list.values);
+    }
+    free(val);
+}
+
 /*
  * Local variables:
  * mode: C
diff --git a/tools/libxl/libxlu_cfg_i.h b/tools/libxl/libxlu_cfg_i.h
index 54d033c..89cef9a 100644
--- a/tools/libxl/libxlu_cfg_i.h
+++ b/tools/libxl/libxlu_cfg_i.h
@@ -23,10 +23,8 @@
 #include "libxlu_cfg_y.h"
 
 void xlu__cfg_set_free(XLU_ConfigSetting *set);
-XLU_ConfigSetting *xlu__cfg_set_mk(CfgParseContext*, int alloc, char *atom);
-void xlu__cfg_set_add(CfgParseContext*, XLU_ConfigSetting *set, char *atom);
 void xlu__cfg_set_store(CfgParseContext*, char *name,
-                        XLU_ConfigSetting *set, int lineno);
+                        XLU_ConfigValue *val, int lineno);
 
 char *xlu__cfgl_strdup(CfgParseContext*, const char *src);
 char *xlu__cfgl_dequote(CfgParseContext*, const char *src);
@@ -36,7 +34,18 @@ void xlu__cfgl_lexicalerror(CfgParseContext*, char const *msg);
 
 void xlu__cfgl_likely_python(CfgParseContext *ctx);
 
+XLU_ConfigValue *xlu__cfg_make_string(CfgParseContext*, char *src);
 
+XLU_ConfigValue *xlu__cfg_make_list(CfgParseContext*,
+                                    XLU_ConfigValue *val);
+
+
+
+XLU_ConfigValue *xlu__cfg_list_append(CfgParseContext *ctx,
+                                      XLU_ConfigValue *list,
+                                      XLU_ConfigValue *val);
+
+void xlu__cfg_value_free(struct XLU_ConfigValue *val);
 
 /* Why oh why does bison not declare this in its autogenerated .h ? */
 int xlu__cfg_yyparse(CfgParseContext *ctx);
diff --git a/tools/libxl/libxlu_cfg_y.c b/tools/libxl/libxlu_cfg_y.c
index 07b5a1d..bc2b81e 100644
--- a/tools/libxl/libxlu_cfg_y.c
+++ b/tools/libxl/libxlu_cfg_y.c
@@ -126,7 +126,7 @@ typedef union YYSTYPE
 #line 25 "libxlu_cfg_y.y"
 
   char *string;
-  XLU_ConfigSetting *setting;
+  XLU_ConfigValue *value;
 
 
 
@@ -377,14 +377,14 @@ union yyalloc
 /* YYFINAL -- State number of the termination state.  */
 #define YYFINAL  3
 /* YYLAST -- Last index in YYTABLE.  */
-#define YYLAST   24
+#define YYLAST   26
 
 /* YYNTOKENS -- Number of terminals.  */
 #define YYNTOKENS  12
 /* YYNNTS -- Number of nonterminals.  */
 #define YYNNTS  11
 /* YYNRULES -- Number of rules.  */
-#define YYNRULES  22
+#define YYNRULES  21
 /* YYNRULES -- Number of states.  */
 #define YYNSTATES  30
 
@@ -433,8 +433,8 @@ static const yytype_uint8 yytranslate[] =
 static const yytype_uint8 yyprhs[] =
 {
        0,     0,     3,     5,     8,     9,    12,    15,    17,    20,
-      24,    26,    28,    30,    35,    37,    39,    40,    42,    46,
-      49,    55,    56
+      24,    26,    28,    30,    32,    34,    36,    41,    42,    45,
+      51,    52
 };
 
 /* YYRHS -- A `-1'-separated list of the rules' RHS.  */
@@ -443,17 +443,17 @@ static const yytype_int8 yyrhs[] =
       13,     0,    -1,    14,    -1,    14,    16,    -1,    -1,    14,
       15,    -1,    16,    17,    -1,    17,    -1,     1,     6,    -1,
        3,     7,    18,    -1,     6,    -1,     8,    -1,    19,    -1,
-       9,    22,    20,    10,    -1,     4,    -1,     5,    -1,    -1,
-      21,    -1,    21,    11,    22,    -1,    19,    22,    -1,    21,
-      11,    22,    19,    22,    -1,    -1,    22,     6,    -1
+      20,    -1,     4,    -1,     5,    -1,     9,    22,    21,    10,
+      -1,    -1,    18,    22,    -1,    21,    11,    22,    18,    22,
+      -1,    -1,    22,     6,    -1
 };
 
 /* YYRLINE[YYN] -- source line where rule number YYN was defined.  */
 static const yytype_uint8 yyrline[] =
 {
        0,    47,    47,    48,    50,    51,    53,    54,    55,    57,
-      59,    60,    62,    63,    65,    66,    68,    69,    70,    72,
-      73,    75,    77
+      59,    60,    62,    63,    65,    66,    68,    70,    71,    72,
+      74,    76
 };
 #endif
 
@@ -464,7 +464,7 @@ static const char *const yytname[] =
 {
   "$end", "error", "$undefined", "IDENT", "STRING", "NUMBER", "NEWLINE",
   "'='", "';'", "'['", "']'", "','", "$accept", "file", "stmts", "stmt",
-  "assignment", "endstmt", "value", "atom", "valuelist", "values", "nlok", 0
+  "assignment", "endstmt", "value", "atom", "list", "values", "nlok", 0
 };
 #endif
 
@@ -482,16 +482,16 @@ static const yytype_uint16 yytoknum[] =
 static const yytype_uint8 yyr1[] =
 {
        0,    12,    13,    13,    14,    14,    15,    15,    15,    16,
-      17,    17,    18,    18,    19,    19,    20,    20,    20,    21,
-      21,    22,    22
+      17,    17,    18,    18,    19,    19,    20,    21,    21,    21,
+      22,    22
 };
 
 /* YYR2[YYN] -- Number of symbols composing right hand side of rule YYN.  */
 static const yytype_uint8 yyr2[] =
 {
        0,     2,     1,     2,     0,     2,     2,     1,     2,     3,
-       1,     1,     1,     4,     1,     1,     0,     1,     3,     2,
-       5,     0,     2
+       1,     1,     1,     1,     1,     1,     4,     0,     2,     5,
+       0,     2
 };
 
 /* YYDEFACT[STATE-NAME] -- Default reduction number in state STATE-NUM.
@@ -500,32 +500,32 @@ static const yytype_uint8 yyr2[] =
 static const yytype_uint8 yydefact[] =
 {
        4,     0,     0,     1,     0,     0,    10,    11,     5,     3,
-       7,     8,     0,     6,    14,    15,    21,     9,    12,    16,
-      22,    21,     0,    17,    19,    13,    21,    18,    21,    20
+       7,     8,     0,     6,    14,    15,    20,     9,    12,    13,
+      17,    21,    20,     0,    18,    16,    20,     0,    20,    19
 };
 
 /* YYDEFGOTO[NTERM-NUM].  */
 static const yytype_int8 yydefgoto[] =
 {
-      -1,     1,     2,     8,     9,    10,    17,    18,    22,    23,
-      19
+      -1,     1,     2,     8,     9,    10,    17,    18,    19,    23,
+      20
 };
 
 /* YYPACT[STATE-NUM] -- Index in YYTABLE of the portion describing
    STATE-NUM.  */
-#define YYPACT_NINF -18
+#define YYPACT_NINF -19
 static const yytype_int8 yypact[] =
 {
-     -18,     4,     0,   -18,    -1,     6,   -18,   -18,   -18,     3,
-     -18,   -18,    11,   -18,   -18,   -18,   -18,   -18,   -18,    13,
-     -18,   -18,    12,    10,    17,   -18,   -18,    13,   -18,    17
+     -19,    20,     0,   -19,    15,    16,   -19,   -19,   -19,     4,
+     -19,   -19,    13,   -19,   -19,   -19,   -19,   -19,   -19,   -19,
+      10,   -19,   -19,    -6,    18,   -19,   -19,    10,   -19,    18
 };
 
 /* YYPGOTO[NTERM-NUM].  */
 static const yytype_int8 yypgoto[] =
 {
-     -18,   -18,   -18,   -18,   -18,    15,   -18,   -17,   -18,   -18,
-     -14
+     -19,   -19,   -19,   -19,   -19,    17,   -18,   -19,   -19,   -19,
+     -15
 };
 
 /* YYTABLE[YYPACT[STATE-NUM]].  What to do in state STATE-NUM.  If
@@ -534,22 +534,22 @@ static const yytype_int8 yypgoto[] =
 #define YYTABLE_NINF -3
 static const yytype_int8 yytable[] =
 {
-      -2,     4,    21,     5,     3,    11,     6,    24,     7,     6,
-      28,     7,    27,    12,    29,    14,    15,    14,    15,    20,
-      16,    26,    25,    20,    13
+      -2,     4,    22,     5,    25,    26,     6,    24,     7,    28,
+       6,    27,     7,    29,    14,    15,    21,    14,    15,    16,
+       3,    11,    16,    12,    21,     0,    13
 };
 
 #define yypact_value_is_default(yystate) \
-  ((yystate) == (-18))
+  ((yystate) == (-19))
 
 #define yytable_value_is_error(yytable_value) \
   YYID (0)
 
-static const yytype_uint8 yycheck[] =
+static const yytype_int8 yycheck[] =
 {
-       0,     1,    19,     3,     0,     6,     6,    21,     8,     6,
-      27,     8,    26,     7,    28,     4,     5,     4,     5,     6,
-       9,    11,    10,     6,     9
+       0,     1,    20,     3,    10,    11,     6,    22,     8,    27,
+       6,    26,     8,    28,     4,     5,     6,     4,     5,     9,
+       0,     6,     9,     7,     6,    -1,     9
 };
 
 /* YYSTOS[STATE-NUM] -- The (internal number of the) accessing
@@ -557,8 +557,8 @@ static const yytype_uint8 yycheck[] =
 static const yytype_uint8 yystos[] =
 {
        0,    13,    14,     0,     1,     3,     6,     8,    15,    16,
-      17,     6,     7,    17,     4,     5,     9,    18,    19,    22,
-       6,    19,    20,    21,    22,    10,    11,    22,    19,    22
+      17,     6,     7,    17,     4,     5,     9,    18,    19,    20,
+      22,     6,    18,    21,    22,    10,    11,    22,    18,    22
 };
 
 #define yyerrok		(yyerrstatus = 0)
@@ -1148,7 +1148,7 @@ yydestruct (yymsg, yytype, yyvaluep, yylocationp, ctx)
 
 /* Line 1391 of yacc.c  */
 #line 43 "libxlu_cfg_y.y"
-	{ xlu__cfg_set_free((yyvaluep->setting)); };
+	{ xlu__cfg_value_free((yyvaluep->value)); };
 
 /* Line 1391 of yacc.c  */
 #line 1155 "libxlu_cfg_y.c"
@@ -1162,11 +1162,11 @@ yydestruct (yymsg, yytype, yyvaluep, yylocationp, ctx)
 /* Line 1391 of yacc.c  */
 #line 1164 "libxlu_cfg_y.c"
 	break;
-      case 20: /* "valuelist" */
+      case 20: /* "list" */
 
 /* Line 1391 of yacc.c  */
 #line 43 "libxlu_cfg_y.y"
-	{ xlu__cfg_set_free((yyvaluep->setting)); };
+	{ xlu__cfg_value_free((yyvaluep->value)); };
 
 /* Line 1391 of yacc.c  */
 #line 1173 "libxlu_cfg_y.c"
@@ -1175,7 +1175,7 @@ yydestruct (yymsg, yytype, yyvaluep, yylocationp, ctx)
 
 /* Line 1391 of yacc.c  */
 #line 43 "libxlu_cfg_y.y"
-	{ xlu__cfg_set_free((yyvaluep->setting)); };
+	{ xlu__cfg_value_free((yyvaluep->value)); };
 
 /* Line 1391 of yacc.c  */
 #line 1182 "libxlu_cfg_y.c"
@@ -1508,21 +1508,14 @@ yyreduce:
 
 /* Line 1806 of yacc.c  */
 #line 57 "libxlu_cfg_y.y"
-    { xlu__cfg_set_store(ctx,(yyvsp[(1) - (3)].string),(yyvsp[(3) - (3)].setting),(yylsp[(3) - (3)]).first_line); }
+    { xlu__cfg_set_store(ctx,(yyvsp[(1) - (3)].string),(yyvsp[(3) - (3)].value),(yylsp[(3) - (3)]).first_line); }
     break;
 
   case 12:
 
 /* Line 1806 of yacc.c  */
 #line 62 "libxlu_cfg_y.y"
-    { (yyval.setting)= xlu__cfg_set_mk(ctx,1,(yyvsp[(1) - (1)].string)); }
-    break;
-
-  case 13:
-
-/* Line 1806 of yacc.c  */
-#line 63 "libxlu_cfg_y.y"
-    { (yyval.setting)= (yyvsp[(3) - (4)].setting); }
+    { (yyval.value)= xlu__cfg_make_string(ctx,(yyvsp[(1) - (1)].string)); }
     break;
 
   case 14:
@@ -1543,41 +1536,34 @@ yyreduce:
 
 /* Line 1806 of yacc.c  */
 #line 68 "libxlu_cfg_y.y"
-    { (yyval.setting)= xlu__cfg_set_mk(ctx,0,0); }
+    { (yyval.value)= (yyvsp[(3) - (4)].value); }
     break;
 
   case 17:
 
 /* Line 1806 of yacc.c  */
-#line 69 "libxlu_cfg_y.y"
-    { (yyval.setting)= (yyvsp[(1) - (1)].setting); }
+#line 70 "libxlu_cfg_y.y"
+    { (yyval.value)= xlu__cfg_make_list(ctx,NULL); }
     break;
 
   case 18:
 
 /* Line 1806 of yacc.c  */
-#line 70 "libxlu_cfg_y.y"
-    { (yyval.setting)= (yyvsp[(1) - (3)].setting); }
+#line 71 "libxlu_cfg_y.y"
+    { (yyval.value)= xlu__cfg_make_list(ctx,(yyvsp[(1) - (2)].value)); }
     break;
 
   case 19:
 
 /* Line 1806 of yacc.c  */
 #line 72 "libxlu_cfg_y.y"
-    { (yyval.setting)= xlu__cfg_set_mk(ctx,2,(yyvsp[(1) - (2)].string)); }
-    break;
-
-  case 20:
-
-/* Line 1806 of yacc.c  */
-#line 73 "libxlu_cfg_y.y"
-    { xlu__cfg_set_add(ctx,(yyvsp[(1) - (5)].setting),(yyvsp[(4) - (5)].string)); (yyval.setting)= (yyvsp[(1) - (5)].setting); }
+    { xlu__cfg_list_append(ctx,(yyvsp[(1) - (5)].value),(yyvsp[(4) - (5)].value)); (yyval.value)= (yyvsp[(1) - (5)].value);}
     break;
 
 
 
 /* Line 1806 of yacc.c  */
-#line 1581 "libxlu_cfg_y.c"
+#line 1567 "libxlu_cfg_y.c"
       default: break;
     }
   /* User semantic actions sometimes alter yychar, and that requires
diff --git a/tools/libxl/libxlu_cfg_y.h b/tools/libxl/libxlu_cfg_y.h
index d7dfaf2..37e8213 100644
--- a/tools/libxl/libxlu_cfg_y.h
+++ b/tools/libxl/libxlu_cfg_y.h
@@ -54,7 +54,7 @@ typedef union YYSTYPE
 #line 25 "libxlu_cfg_y.y"
 
   char *string;
-  XLU_ConfigSetting *setting;
+  XLU_ConfigValue *value;
 
 
 
diff --git a/tools/libxl/libxlu_cfg_y.y b/tools/libxl/libxlu_cfg_y.y
index 5acd438..cd0d494 100644
--- a/tools/libxl/libxlu_cfg_y.y
+++ b/tools/libxl/libxlu_cfg_y.y
@@ -24,7 +24,7 @@
 
 %union {
   char *string;
-  XLU_ConfigSetting *setting;
+  XLU_ConfigValue *value;
 }
 
 %locations
@@ -39,8 +39,8 @@
 %type <string>            atom
 %destructor { free($$); } atom IDENT STRING NUMBER
 
-%type <setting>                         value valuelist values
-%destructor { xlu__cfg_set_free($$); }  value valuelist values
+%type <value>                         value list values
+%destructor { xlu__cfg_value_free($$); }  value list values
 
 %%
 
@@ -59,18 +59,17 @@ assignment: IDENT '=' value { xlu__cfg_set_store(ctx,$1,$3,@3.first_line); }
 endstmt: NEWLINE
  |      ';'
 
-value:  atom                         { $$= xlu__cfg_set_mk(ctx,1,$1); }
- |      '[' nlok valuelist ']'       { $$= $3; }
+value:  atom                         { $$= xlu__cfg_make_string(ctx,$1); }
+ |      list
 
 atom:   STRING                   { $$= $1; }
  |      NUMBER                   { $$= $1; }
 
-valuelist: /* empty */           { $$= xlu__cfg_set_mk(ctx,0,0); }
- |      values                  { $$= $1; }
- |      values ',' nlok         { $$= $1; }
+list:   '[' nlok values ']'      { $$= $3; }
 
-values: atom nlok                  { $$= xlu__cfg_set_mk(ctx,2,$1); }
- |      values ',' nlok atom nlok  { xlu__cfg_set_add(ctx,$1,$4); $$= $1; }
+values: /* empty */                { $$= xlu__cfg_make_list(ctx,NULL); }
+ |      value nlok                 { $$= xlu__cfg_make_list(ctx,$1); }
+ |      values ',' nlok value nlok { xlu__cfg_list_append(ctx,$1,$4); $$= $1;}
 
 nlok:
         /* nothing */
diff --git a/tools/libxl/libxlu_internal.h b/tools/libxl/libxlu_internal.h
index 7579158..66eec28 100644
--- a/tools/libxl/libxlu_internal.h
+++ b/tools/libxl/libxlu_internal.h
@@ -23,17 +23,29 @@
 #include <assert.h>
 #include <regex.h>
 
-#define XLU_ConfigList XLU_ConfigSetting
-
 #include "libxlutil.h"
 
-struct XLU_ConfigSetting { /* transparent */
+typedef struct XLU_ConfigValue XLU_ConfigValue;
+
+typedef struct XLU_ConfigList {
+    int nvalues;
+    XLU_ConfigValue **values;
+} XLU_ConfigList;
+
+typedef struct XLU_ConfigValue {
+    enum XLU_ConfigValueType type;
+    union {
+        char *string;
+        XLU_ConfigList list;
+    } u;
+} XLU_ConfigValue;
+
+typedef struct XLU_ConfigSetting { /* transparent */
     struct XLU_ConfigSetting *next;
     char *name;
-    int nvalues, avalues; /* lists have avalues>1 */
-    char **values;
+    struct XLU_ConfigValue *value;
     int lineno;
-};
+} XLU_ConfigSetting;
 
 struct XLU_Config {
     XLU_ConfigSetting *settings;
diff --git a/tools/libxl/libxlutil.h b/tools/libxl/libxlutil.h
index 0333e55..37d9549 100644
--- a/tools/libxl/libxlutil.h
+++ b/tools/libxl/libxlutil.h
@@ -23,6 +23,15 @@
 /* Unless otherwise stated, all functions return an errno value. */
 typedef struct XLU_Config XLU_Config;
 typedef struct XLU_ConfigList XLU_ConfigList;
+typedef struct XLU_ConfigValue XLU_ConfigValue;
+enum XLU_ConfigValueType {
+    XLU_STRING,
+    XLU_LIST,
+};
+
+enum XLU_ConfigValueType xlu_cfg_value_type(const XLU_ConfigValue *value);
+const char *xlu_cfg_value_get_string(const XLU_ConfigValue *value);
+const XLU_ConfigList *xlu_cfg_value_get_list(const XLU_ConfigValue *value);
 
 XLU_Config *xlu_cfg_init(FILE *report, const char *report_filename);
   /* 0 means we got ENOMEM. */
@@ -65,7 +74,7 @@ int xlu_cfg_get_list_as_string_list(const XLU_Config *cfg, const char *n,
 const char *xlu_cfg_get_listitem(const XLU_ConfigList*, int entry);
   /* xlu_cfg_get_listitem cannot fail, except that if entry is
    * out of range it returns 0 (not setting errno) */
-
+const XLU_ConfigValue *xlu_cfg_get_listitem2(const XLU_ConfigList*, int entry);
 
 /*
  * Disk specification parsing.
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [PATCH v3 19/19] xl: vNUMA support
  2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
                   ` (17 preceding siblings ...)
  2015-01-13 12:11 ` [PATCH v3 18/19] libxlutil: nested list support Wei Liu
@ 2015-01-13 12:11 ` Wei Liu
  18 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 12:11 UTC (permalink / raw
  To: xen-devel
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, jbeulich,
	Elena Ufimtseva

This patch includes configuration options parser and documentation.

Please find the hunk to xl.cfg.pod.5 for more information.

Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Cc: Ian Campbell <ian.campbell@citrix.com>
Cc: Ian Jackson <ian.jackson@eu.citrix.com>
Cc: Dario Faggioli <dario.faggioli@citrix.com>
Cc: Elena Ufimtseva <ufimtseva@gmail.com>
---
Changes in v2:
1. Make vnuma_vdistances mandatory.
2. Use nested list to specify vdistances.
3. Update documentation.
---
 docs/man/xl.cfg.pod.5    |   39 ++++++++++++
 tools/libxl/xl_cmdimpl.c |  147 ++++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 186 insertions(+)

diff --git a/docs/man/xl.cfg.pod.5 b/docs/man/xl.cfg.pod.5
index e2f91fc..b23bd6f 100644
--- a/docs/man/xl.cfg.pod.5
+++ b/docs/man/xl.cfg.pod.5
@@ -266,6 +266,45 @@ it will crash.
 
 =back
 
+=head3 Virtual NUMA Memory Allocation
+
+=over 4
+
+=item B<vnuma_memory=[ NUMBER, NUMBER, ... ]>
+
+Specify the size of memory covered by each virtual NUMA node. The number of
+elements in the list also implicitly defines the number of virtual NUMA nodes.
+
+The sum of all elements in this list should be equal to memory size specified
+by B<maxmem=> in guest configuration file, or B<memory=> if B<maxmem=> is not
+specified.
+
+=item B<vnuma_vcpu_map=[ NUMBER, NUMBER, ... ]>
+
+Specifiy which virutal NUMA node a specific vcpu belongs to. The number of
+elements in this list should be equal to B<maxvcpus=> in guest configuration
+file, or B<vcpus=> if B<maxvcpus=> is not specified.
+
+=item B<vnuma_pnode_map=[ NUMBER, NUMBER, ... ]>
+
+Specifiy which physical NUMA node a specific virtual NUMA node maps to. The
+number of elements in this list should be equal to the number of virtual
+NUMA nodes defined in B<vnuma_memory=>.
+
+=item B<vnuma_vdistance=[ [NUMBER, ..., NUMBER], [NUMBER, ..., NUMBER], ... ]>
+
+Two dimensional list to specify distances among nodes.
+
+The number of elements in the first dimension list equals the number of virtual
+nodes. Each element in position B<i> is a list that specifies the distances
+from node B<i> to other nodes.
+
+For example, for a guest with 2 virtual nodes, user can specify:
+
+  vnuma_vdistance = [ [10, 20], [20, 10] ]
+
+=back
+
 =head3 Event Actions
 
 =over 4
diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
index 0b02a6c..e891522 100644
--- a/tools/libxl/xl_cmdimpl.c
+++ b/tools/libxl/xl_cmdimpl.c
@@ -977,6 +977,151 @@ static int parse_nic_config(libxl_device_nic *nic, XLU_Config **config, char *to
     return 0;
 }
 
+static void parse_vnuma_config(const XLU_Config *config,
+                               libxl_domain_build_info *b_info)
+{
+    int i, j;
+    XLU_ConfigList *vnuma_memory, *vnuma_vcpu_map, *vnuma_pnode_map,
+        *vnuma_vdistances;
+    int num_vnuma_memory, num_vnuma_vcpu_map, num_vnuma_pnode_map,
+        num_vnuma_vdistances;
+    const char *buf;
+    libxl_physinfo physinfo;
+    uint32_t nr_nodes;
+    unsigned long val;
+    char *ep;
+
+    libxl_physinfo_init(&physinfo);
+    if (libxl_get_physinfo(ctx, &physinfo) != 0) {
+        libxl_physinfo_dispose(&physinfo);
+        fprintf(stderr, "libxl_physinfo failed.\n");
+        exit(1);
+    }
+    nr_nodes = physinfo.nr_nodes;
+    libxl_physinfo_dispose(&physinfo);
+
+    if (xlu_cfg_get_list(config, "vnuma_memory", &vnuma_memory,
+                         &num_vnuma_memory, 1))
+        return;              /* No vnuma config */
+
+    b_info->num_vnuma_nodes = num_vnuma_memory;
+    b_info->vnuma_nodes = xmalloc(num_vnuma_memory * sizeof(libxl_vnode_info));
+
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
+
+        libxl_vnode_info_init(p);
+        libxl_cpu_bitmap_alloc(ctx, &p->vcpus, b_info->max_vcpus);
+        libxl_bitmap_set_none(&p->vcpus);
+        p->distances = xmalloc(b_info->num_vnuma_nodes * sizeof(uint32_t));
+        p->num_distances = b_info->num_vnuma_nodes;
+    }
+
+    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
+        buf = xlu_cfg_get_listitem(vnuma_memory, i);
+        val = strtoul(buf, &ep, 10);
+        if (ep == buf) {
+            fprintf(stderr, "xl: Invalid argument parsing vnuma memory: %s\n", buf);
+            exit(1);
+        }
+        b_info->vnuma_nodes[i].mem = val;
+    }
+
+    if (xlu_cfg_get_list(config, "vnuma_vcpu_map", &vnuma_vcpu_map,
+                         &num_vnuma_vcpu_map, 1)) {
+        fprintf(stderr, "No vcpu to vnode map specified\n");
+        exit(1);
+    }
+
+    i = 0;
+    while (i < b_info->max_vcpus &&
+           (buf = xlu_cfg_get_listitem(vnuma_vcpu_map, i)) != NULL) {
+        val = strtoul(buf, &ep, 10);
+        if (ep == buf) {
+            fprintf(stderr, "xl: Invalid argument parsing vcpu map: %s\n", buf);
+            exit(1);
+        }
+        if (val >= num_vnuma_memory) {
+            fprintf(stderr, "xl: Invalid vnode number specified: %lu\n", val);
+            exit(1);
+        }
+        libxl_bitmap_set(&b_info->vnuma_nodes[val].vcpus, i);
+        i++;
+    }
+
+    if (i != b_info->max_vcpus) {
+        fprintf(stderr, "xl: Not enough elements in vnuma_vcpu_map, provided %d, required %d\n",
+                i + 1, b_info->max_vcpus);
+        exit(1);
+    }
+
+    if (xlu_cfg_get_list(config, "vnuma_pnode_map", &vnuma_pnode_map,
+                         &num_vnuma_pnode_map, 1)) {
+        fprintf(stderr, "No vnode to pnode map specified\n");
+        exit(1);
+    }
+
+    i = 0;
+    while (i < num_vnuma_pnode_map &&
+           (buf = xlu_cfg_get_listitem(vnuma_pnode_map, i)) != NULL) {
+        val = strtoul(buf, &ep, 10);
+        if (ep == buf) {
+            fprintf(stderr, "xl: Invalid argument parsing vnode to pnode map: %s\n", buf);
+            exit(1);
+        }
+        if (val >= nr_nodes) {
+            fprintf(stderr, "xl: Invalid pnode number specified: %lu\n", val);
+            exit(1);
+        }
+
+        b_info->vnuma_nodes[i].pnode = val;
+
+        i++;
+    }
+
+    if (i != b_info->num_vnuma_nodes) {
+        fprintf(stderr, "xl: Not enough elements in vnuma_vnode_map, provided %d, required %d\n",
+                i + 1, b_info->num_vnuma_nodes);
+        exit(1);
+    }
+
+    if (!xlu_cfg_get_list(config, "vnuma_vdistances", &vnuma_vdistances,
+                          &num_vnuma_vdistances, 1)) {
+        if (num_vnuma_vdistances != num_vnuma_memory) {
+            fprintf(stderr, "xl: Required %d sub-lists in vnuma_vdistances but provided %d\n",
+                    num_vnuma_memory, num_vnuma_vdistances);
+            exit(1);
+        }
+
+        for (i = 0; i < num_vnuma_vdistances; i++) {
+            const XLU_ConfigValue *tmp;
+            const XLU_ConfigList *sublist;
+
+            tmp = xlu_cfg_get_listitem2(vnuma_vdistances, i);
+            if (xlu_cfg_value_type(tmp) != XLU_LIST) {
+                fprintf(stderr, "xl: Expecting list in vnuma_vdistance\n");
+                exit(1);
+            }
+            sublist = xlu_cfg_value_get_list(tmp);
+            for (j = 0;
+                 (buf = xlu_cfg_get_listitem(sublist, j)) != NULL;
+                 j++) {
+                val = strtoul(buf, &ep, 10);
+                if (ep == buf) {
+                    fprintf(stderr, "xl: Invalid argument parsing vdistances map: %s\n", buf);
+                    exit(1);
+                }
+
+                b_info->vnuma_nodes[i].distances[j] = val;
+            }
+        }
+
+    } else {
+        fprintf(stderr, "xl: No vnuma_vdistances specified.\n");
+        exit(1);
+    }
+}
+
 static void parse_config_data(const char *config_source,
                               const char *config_data,
                               int config_len,
@@ -1167,6 +1312,8 @@ static void parse_config_data(const char *config_source,
         }
     }
 
+    parse_vnuma_config(config, b_info);
+
     if (!xlu_cfg_get_long(config, "rtc_timeoffset", &l, 0))
         b_info->rtc_timeoffset = l;
 
-- 
1.7.10.4

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 04/19] libxl: introduce vNUMA types
  2015-01-13 12:11 ` [PATCH v3 04/19] libxl: introduce vNUMA types Wei Liu
@ 2015-01-13 15:40   ` Ian Jackson
  2015-01-13 15:51     ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Ian Jackson @ 2015-01-13 15:40 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, Elena Ufimtseva, Ian Campbell, jbeulich,
	xen-devel

Wei Liu writes ("[PATCH v3 04/19] libxl: introduce vNUMA types"):
> A domain can contain several virtual NUMA nodes, hence we introduce an
> array in libxl_domain_build_info.
> 
> libxl_vnode_info contains the size of memory in that node, the distance
> from that node to every nodes, the underlying pnode and a bitmap of
> vcpus.
...
> +libxl_vnode_info = Struct("vnode_info", [
> +    ("mem", uint64), # memory size of this node, in MiB
> +    ("distances", Array(uint32, "num_distances")), # distances from this node to other nodes
> +    ("pnode", uint32), # physical node of this node
> +    ("vcpus", libxl_bitmap), # vcpus in this node
> +    ])

Can you please document the defaults ?  It's probably tolerable to have
that in the xl documentation (only) so maybe it should go in 19/19.

Thanks,
Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 04/19] libxl: introduce vNUMA types
  2015-01-13 15:40   ` Ian Jackson
@ 2015-01-13 15:51     ` Wei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 15:51 UTC (permalink / raw
  To: Ian Jackson
  Cc: Wei Liu, Ian Campbell, dario.faggioli, xen-devel, jbeulich,
	Elena Ufimtseva

On Tue, Jan 13, 2015 at 03:40:43PM +0000, Ian Jackson wrote:
> Wei Liu writes ("[PATCH v3 04/19] libxl: introduce vNUMA types"):
> > A domain can contain several virtual NUMA nodes, hence we introduce an
> > array in libxl_domain_build_info.
> > 
> > libxl_vnode_info contains the size of memory in that node, the distance
> > from that node to every nodes, the underlying pnode and a bitmap of
> > vcpus.
> ...
> > +libxl_vnode_info = Struct("vnode_info", [
> > +    ("mem", uint64), # memory size of this node, in MiB
> > +    ("distances", Array(uint32, "num_distances")), # distances from this node to other nodes
> > +    ("pnode", uint32), # physical node of this node
> > +    ("vcpus", libxl_bitmap), # vcpus in this node
> > +    ])
> 
> Can you please document the defaults ?  It's probably tolerable to have
> that in the xl documentation (only) so maybe it should go in 19/19.
> 

There's no default values. Users have to specify every fields of this
structure at the moment, otherwise libxl refuses to proceed. I have to
check whether I actually coded that way, but that's the expected
behaviour.

I think there might be some default values when Dario finishes his work
on vNUMA auto placement. But I can't say for sure.

Wei.

> Thanks,
> Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 18/19] libxlutil: nested list support
  2015-01-13 12:11 ` [PATCH v3 18/19] libxlutil: nested list support Wei Liu
@ 2015-01-13 15:52   ` Ian Jackson
  2015-01-13 16:11     ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Ian Jackson @ 2015-01-13 15:52 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, Ian Campbell, jbeulich, xen-devel

Wei Liu writes ("[PATCH v3 18/19] libxlutil: nested list support"):
> This is done with three major changes:
> 1. Rework internal representation of setting.
> 2. Extend grammar of parser.
> 3. Introduce new APIs.

This commit message is very brief.  For example, under the heading of
`Rework internal representation of setting' I would expect a clear
description of every formulaic change.

Also, I think would be much easier to review if split up into 3 parts,
which from the description above ought to be doable without trouble.

AFAICT from your changes, the API is not backward compatible.  ICBW,
but if I'm right that's not acceptable I'm afraid, even in libxlu.

> Previous APIs work as before.

That can't be right because you have to at least specify how they deal
with the additional config file syntax.

Thanks,
Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 18/19] libxlutil: nested list support
  2015-01-13 15:52   ` Ian Jackson
@ 2015-01-13 16:11     ` Wei Liu
  2015-01-13 18:25       ` Ian Jackson
  0 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 16:11 UTC (permalink / raw
  To: Ian Jackson; +Cc: dario.faggioli, Wei Liu, Ian Campbell, jbeulich, xen-devel

On Tue, Jan 13, 2015 at 03:52:48PM +0000, Ian Jackson wrote:
> Wei Liu writes ("[PATCH v3 18/19] libxlutil: nested list support"):
> > This is done with three major changes:
> > 1. Rework internal representation of setting.
> > 2. Extend grammar of parser.
> > 3. Introduce new APIs.
> 
> This commit message is very brief.  For example, under the heading of
> `Rework internal representation of setting' I would expect a clear
> description of every formulaic change.
> 

Originally the internal representation of setting is (string, string)
pair, the first string being the name of the setting, second string
being the value of the setting. Now the internal is changed to (string,
ConfigValue) pair, where ConfigValue can refer to a string or a list of
ConfigValue's. Internal functions to deal with setting are changed
accordingly.

Does the above description makes things clearer?

> Also, I think would be much easier to review if split up into 3 parts,
> which from the description above ought to be doable without trouble.
> 

OK. I can try to split this patch into three.

> AFAICT from your changes, the API is not backward compatible.  ICBW,
> but if I'm right that's not acceptable I'm afraid, even in libxlu.
> 

The old APIs still have the same semantic as before, so any applications
linked against those APIs still have the same results returned.

> > Previous APIs work as before.
> 
> That can't be right because you have to at least specify how they deal
> with the additional config file syntax.
> 

No, the old APIs don't deal with new syntax. If applications want to
support new syntax, they need to use new API.

If old APIs try to get value from new syntax, it has no effect.

Wei.

> Thanks,
> Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 11/19] tools/hvmloader: link errno.h from xen internal
  2015-01-13 12:11 ` [PATCH v3 11/19] tools/hvmloader: link errno.h from xen internal Wei Liu
@ 2015-01-13 16:32   ` Jan Beulich
  2015-01-13 16:36     ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Jan Beulich @ 2015-01-13 16:32 UTC (permalink / raw
  To: Wei Liu, Tiejun Chen; +Cc: dario.faggioli, xen-devel

>>> On 13.01.15 at 13:11, <wei.liu2@citrix.com> wrote:
> From: Tiejun Chen <tiejun.chen@intel.com>
> 
> We need to act on some specific hypercall error numbers, so
> require the hypervisor view on the errno.h value rather than
> just the build environment's number. So here link this headfile
> from xen.
> 
> Acked-by: Jan Beulich <jbeulich@suse.com>
> Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>

Now that Tiejun's series didn't go into 4.5 anymore, I have to
retract that ack: As was discussed before, for 4.6 we want to
deal with this properly. Please see the RFC patch at
http://lists.xenproject.org/archives/html/xen-devel/2015-01/msg01236.html.

Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 11/19] tools/hvmloader: link errno.h from xen internal
  2015-01-13 16:32   ` Jan Beulich
@ 2015-01-13 16:36     ` Wei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 16:36 UTC (permalink / raw
  To: Jan Beulich; +Cc: Tiejun Chen, dario.faggioli, Wei Liu, xen-devel

On Tue, Jan 13, 2015 at 04:32:10PM +0000, Jan Beulich wrote:
> >>> On 13.01.15 at 13:11, <wei.liu2@citrix.com> wrote:
> > From: Tiejun Chen <tiejun.chen@intel.com>
> > 
> > We need to act on some specific hypercall error numbers, so
> > require the hypervisor view on the errno.h value rather than
> > just the build environment's number. So here link this headfile
> > from xen.
> > 
> > Acked-by: Jan Beulich <jbeulich@suse.com>
> > Signed-off-by: Tiejun Chen <tiejun.chen@intel.com>
> 
> Now that Tiejun's series didn't go into 4.5 anymore, I have to
> retract that ack: As was discussed before, for 4.6 we want to
> deal with this properly. Please see the RFC patch at
> http://lists.xenproject.org/archives/html/xen-devel/2015-01/msg01236.html.
> 

Thanks. This works for me too.

I will make adjustment to my patch once your patch is in.

Wei.

> Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-13 12:11 ` [PATCH v3 01/19] xen: dump vNUMA information with debug key "u" Wei Liu
@ 2015-01-13 16:38   ` Jan Beulich
  2015-01-13 16:45     ` Wei Liu
  2015-01-13 19:21   ` Andrew Cooper
  1 sibling, 1 reply; 63+ messages in thread
From: Jan Beulich @ 2015-01-13 16:38 UTC (permalink / raw
  To: Wei Liu, Elena Ufimsteva; +Cc: dario.faggioli, xen-devel

>>> On 13.01.15 at 13:11, <wei.liu2@citrix.com> wrote:
> @@ -408,6 +413,49 @@ static void dump_numa(unsigned char key)
>  
>          for_each_online_node ( i )
>              printk("    Node %u: %u\n", i, page_num_node[i]);
> +
> +        if ( !d->vnuma )
> +                continue;
> +
> +        vnuma = d->vnuma;

I'm sorry for noticing this only now, but don't you need to try to
acquire d->vnuma_rwlock for reading, and skip any printing if
the acquire attempt fails, to avoid a XEN_DOMCTL_setvnumainfo
invocation to make the information go away under your feet?

Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 10/19] xen: handle XENMEM_get_vnumainfo in compat_memory_op
  2015-01-13 12:11 ` [PATCH v3 10/19] xen: handle XENMEM_get_vnumainfo in compat_memory_op Wei Liu
@ 2015-01-13 16:41   ` Jan Beulich
  0 siblings, 0 replies; 63+ messages in thread
From: Jan Beulich @ 2015-01-13 16:41 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, xen-devel

>>> On 13.01.15 at 13:11, <wei.liu2@citrix.com> wrote:
>          rc = do_memory_op(cmd, nat.hnd);
> +        if ( rc == -ENOBUFS && op == XENMEM_get_vnumainfo )
> +        {
> +            cmp.vnuma.nr_vnodes = nat.vnuma->nr_vnodes;
> +            cmp.vnuma.nr_vcpus = nat.vnuma->nr_vcpus;
> +            cmp.vnuma.nr_vmemranges = nat.vnuma->nr_vmemranges;
> +            if ( __copy_to_guest(compat, &cmp.vnuma, 1) )
> +                rc = -EFAULT;
> +        }
>          if ( rc < 0 )
>              break;

Please move the new if() inside the body of the existing one.
Ack with that change.

Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-13 16:38   ` Jan Beulich
@ 2015-01-13 16:45     ` Wei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 16:45 UTC (permalink / raw
  To: Jan Beulich; +Cc: dario.faggioli, Wei Liu, Elena Ufimsteva, xen-devel

On Tue, Jan 13, 2015 at 04:38:33PM +0000, Jan Beulich wrote:
> >>> On 13.01.15 at 13:11, <wei.liu2@citrix.com> wrote:
> > @@ -408,6 +413,49 @@ static void dump_numa(unsigned char key)
> >  
> >          for_each_online_node ( i )
> >              printk("    Node %u: %u\n", i, page_num_node[i]);
> > +
> > +        if ( !d->vnuma )
> > +                continue;
> > +
> > +        vnuma = d->vnuma;
> 
> I'm sorry for noticing this only now, but don't you need to try to
> acquire d->vnuma_rwlock for reading, and skip any printing if
> the acquire attempt fails, to avoid a XEN_DOMCTL_setvnumainfo
> invocation to make the information go away under your feet?
> 

I will fix this.

Wei.

> Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 12/19] hvmloader: retrieve vNUMA information from hypervisor
  2015-01-13 12:11 ` [PATCH v3 12/19] hvmloader: retrieve vNUMA information from hypervisor Wei Liu
@ 2015-01-13 16:50   ` Jan Beulich
  2015-01-13 17:17     ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Jan Beulich @ 2015-01-13 16:50 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, xen-devel

>>> On 13.01.15 at 13:11, <wei.liu2@citrix.com> wrote:
> +void init_vnuma_info(void)
> +{
> +    int rc, retry = 0;
> +    struct xen_vnuma_topology_info vnuma_topo;
> +
> +    vcpu_to_vnode = scratch_alloc(sizeof(uint32_t) * hvm_info->nr_vcpus, 0);

sizeof(*vcpu_to_vnode) please.

> +    rc = -EAGAIN;
> +    while ( rc == -EAGAIN && retry < 10 )

What's the justification for 10 here? A sane tool stack shouldn't alter
the values while starting the domain.

> +    {
> +        vnuma_topo.domid = DOMID_SELF;
> +        vnuma_topo.pad = 0;
> +        vnuma_topo.nr_vcpus = 0;
> +        vnuma_topo.nr_vnodes = 0;
> +        vnuma_topo.nr_vmemranges = 0;
> +
> +        set_xen_guest_handle(vnuma_topo.vdistance.h, NULL);
> +        set_xen_guest_handle(vnuma_topo.vcpu_to_vnode.h, NULL);
> +        set_xen_guest_handle(vnuma_topo.vmemrange.h, NULL);
> +
> +        rc = hypercall_memory_op(XENMEM_get_vnumainfo, &vnuma_topo);
> +
> +        if ( rc == -EOPNOTSUPP )
> +            return;
> +
> +        if ( rc != -ENOBUFS )
> +            break;
> +
> +        ASSERT(vnuma_topo.nr_vcpus == hvm_info->nr_vcpus);

I also wonder whether we shouldn't make the hypervisor return
back the (modified) values in the -EAGAIN error case, so that you
could move above first half of the loop body out of the loop.

Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 13/19] hvmloader: construct SRAT
  2015-01-13 12:11 ` [PATCH v3 13/19] hvmloader: construct SRAT Wei Liu
@ 2015-01-13 16:52   ` Jan Beulich
  0 siblings, 0 replies; 63+ messages in thread
From: Jan Beulich @ 2015-01-13 16:52 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, xen-devel

>>> On 13.01.15 at 13:11, <wei.liu2@citrix.com> wrote:
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 14/19] hvmloader: construct SLIT
  2015-01-13 12:11 ` [PATCH v3 14/19] hvmloader: construct SLIT Wei Liu
@ 2015-01-13 16:53   ` Jan Beulich
  0 siblings, 0 replies; 63+ messages in thread
From: Jan Beulich @ 2015-01-13 16:53 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, xen-devel

>>> On 13.01.15 at 13:11, <wei.liu2@citrix.com> wrote:
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>

Acked-by: Jan Beulich <jbeulich@suse.com>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 05/19] libxl: add vmemrange to libxl__domain_build_state
  2015-01-13 12:11 ` [PATCH v3 05/19] libxl: add vmemrange to libxl__domain_build_state Wei Liu
@ 2015-01-13 17:02   ` Ian Jackson
  2015-01-13 17:34     ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Ian Jackson @ 2015-01-13 17:02 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, Elena Ufimtseva, Ian Campbell, jbeulich,
	xen-devel

Wei Liu writes ("[PATCH v3 05/19] libxl: add vmemrange to libxl__domain_build_state"):
> A vnode consists of one or more vmemranges (virtual memory range).  One
> example of multiple vmemranges is that there is a hole in one vnode.

I'm finding this series a bit oddly structured.  This patch, for
example, just introduces some new fields to an internal state struct -
but these fields are not initialised, set, or read.

Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest
  2015-01-13 12:11 ` [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest Wei Liu
@ 2015-01-13 17:05   ` Ian Jackson
  2015-01-13 17:41     ` Wei Liu
  2015-01-13 20:02   ` Andrew Cooper
  1 sibling, 1 reply; 63+ messages in thread
From: Ian Jackson @ 2015-01-13 17:05 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, Elena Ufimtseva, Ian Campbell, jbeulich,
	xen-devel

Wei Liu writes ("[PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest"):
...
> diff --git a/tools/libxc/include/xc_dom.h b/tools/libxc/include/xc_dom.h
> index 07d7224..c459e77 100644
> --- a/tools/libxc/include/xc_dom.h
> +++ b/tools/libxc/include/xc_dom.h
> @@ -167,6 +167,11 @@ struct xc_dom_image {
...
> +    /* vNUMA information */
> +    unsigned int *vnode_to_pnode; /* vnode to pnode mapping array */
> +    uint64_t *vnode_size;         /* vnode size array */

You don't specify the units.  You should probably name the variable
_bytes or _pages or something.

Looking at the algorithm below it seems to be in _mby.  But the domain
size is specified in pages.  So AFAICT if you try to create a domain
which is not a whole number of pages, it is bound to fail !

Perhaps the vnode memory size should be in pages too.

> +    unsigned int nr_vnodes;       /* number of elements of above arrays */

Is there some reason to prefer this arrangement with multiple parallel
arrays, to one with a single array of structs ?

> +        /* Setup dummy vNUMA information if it's not provided. Not
> +         * that this is a valid state if libxl doesn't provide any
> +         * vNUMA information.
> +         *
> +         * In this case we setup some dummy value for the convenience
> +         * of the allocation code. Note that from the user's PoV the
> +         * guest still has no vNUMA configuration.
> +         */

This arrangement for defaulting makes it difficult to supply only
partial information - for example, to supply the number of vnodes but
allow the system to make up the details.

I have a similar complaint about the corresponding libxl code.

I think you should decide where you want the defaulting to be, and do
it in a more flexible way in that one place.  Probably, libxl.

Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize
  2015-01-13 12:11 ` [PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize Wei Liu
@ 2015-01-13 17:07   ` Ian Jackson
  2015-01-13 21:00   ` Andrew Cooper
  1 sibling, 0 replies; 63+ messages in thread
From: Ian Jackson @ 2015-01-13 17:07 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, Elena Ufimtseva, Ian Campbell, jbeulich,
	xen-devel

Wei Liu writes ("[PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize"):
> This function gets the machine E820 map and sanitize it according to PV
> guest configuration.
> 
> This will be used in later patch. No functional change introduced in
> this patch.

Thanks.  It is easy to see that this is correct.

The way that `rc' is used to contain a libxc (syscall) return value is
contrary to the coding style but it is better not to fix this in the
same patch as the code motion.

Acked-by: Ian Jackson <ian.jackson@eu.citrix.com>

Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 12/19] hvmloader: retrieve vNUMA information from hypervisor
  2015-01-13 16:50   ` Jan Beulich
@ 2015-01-13 17:17     ` Wei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 17:17 UTC (permalink / raw
  To: Jan Beulich; +Cc: dario.faggioli, Wei Liu, xen-devel

On Tue, Jan 13, 2015 at 04:50:11PM +0000, Jan Beulich wrote:
> >>> On 13.01.15 at 13:11, <wei.liu2@citrix.com> wrote:
> > +void init_vnuma_info(void)
> > +{
> > +    int rc, retry = 0;
> > +    struct xen_vnuma_topology_info vnuma_topo;
> > +
> > +    vcpu_to_vnode = scratch_alloc(sizeof(uint32_t) * hvm_info->nr_vcpus, 0);
> 
> sizeof(*vcpu_to_vnode) please.
> 

Done.

> > +    rc = -EAGAIN;
> > +    while ( rc == -EAGAIN && retry < 10 )
> 
> What's the justification for 10 here? A sane tool stack shouldn't alter
> the values while starting the domain.
> 

I wasn't sure if a toolstack will change the values whilst domain is
running. But you now confirm that a sane toolstack shouldn't do that I
can just remove this loop.

> > +    {
> > +        vnuma_topo.domid = DOMID_SELF;
> > +        vnuma_topo.pad = 0;
> > +        vnuma_topo.nr_vcpus = 0;
> > +        vnuma_topo.nr_vnodes = 0;
> > +        vnuma_topo.nr_vmemranges = 0;
> > +
> > +        set_xen_guest_handle(vnuma_topo.vdistance.h, NULL);
> > +        set_xen_guest_handle(vnuma_topo.vcpu_to_vnode.h, NULL);
> > +        set_xen_guest_handle(vnuma_topo.vmemrange.h, NULL);
> > +
> > +        rc = hypercall_memory_op(XENMEM_get_vnumainfo, &vnuma_topo);
> > +
> > +        if ( rc == -EOPNOTSUPP )
> > +            return;
> > +
> > +        if ( rc != -ENOBUFS )
> > +            break;
> > +
> > +        ASSERT(vnuma_topo.nr_vcpus == hvm_info->nr_vcpus);
> 
> I also wonder whether we shouldn't make the hypervisor return
> back the (modified) values in the -EAGAIN error case, so that you
> could move above first half of the loop body out of the loop.
> 

I don't think hypercall modifies values in -EAGAIN case. The first half
of that loop is to prepare hypercall structure so that we can retrieve
the new size.

But since you say no sane toolstack should do that this issue becomes
moot.

Wei.

> Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 05/19] libxl: add vmemrange to libxl__domain_build_state
  2015-01-13 17:02   ` Ian Jackson
@ 2015-01-13 17:34     ` Wei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 17:34 UTC (permalink / raw
  To: Ian Jackson
  Cc: Wei Liu, Ian Campbell, dario.faggioli, xen-devel, jbeulich,
	Elena Ufimtseva

On Tue, Jan 13, 2015 at 05:02:10PM +0000, Ian Jackson wrote:
> Wei Liu writes ("[PATCH v3 05/19] libxl: add vmemrange to libxl__domain_build_state"):
> > A vnode consists of one or more vmemranges (virtual memory range).  One
> > example of multiple vmemranges is that there is a hole in one vnode.
> 
> I'm finding this series a bit oddly structured.  This patch, for
> example, just introduces some new fields to an internal state struct -
> but these fields are not initialised, set, or read.
> 

The new fields (and other existing fields) are initialised to zero in
initiate_domain_create, that's why it doesn't need to be explicitly
initialised.

These new fields are accessed in the next patch. I can either explicitly
say so in commit log or squash this patch with the next one. Which way
do you prefer?

TBH I don't think this patch and next one should be squashed into one
patch.

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest
  2015-01-13 17:05   ` Ian Jackson
@ 2015-01-13 17:41     ` Wei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-13 17:41 UTC (permalink / raw
  To: Ian Jackson
  Cc: Wei Liu, Ian Campbell, dario.faggioli, xen-devel, jbeulich,
	Elena Ufimtseva

On Tue, Jan 13, 2015 at 05:05:26PM +0000, Ian Jackson wrote:
> Wei Liu writes ("[PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest"):
> ...
> > diff --git a/tools/libxc/include/xc_dom.h b/tools/libxc/include/xc_dom.h
> > index 07d7224..c459e77 100644
> > --- a/tools/libxc/include/xc_dom.h
> > +++ b/tools/libxc/include/xc_dom.h
> > @@ -167,6 +167,11 @@ struct xc_dom_image {
> ...
> > +    /* vNUMA information */
> > +    unsigned int *vnode_to_pnode; /* vnode to pnode mapping array */
> > +    uint64_t *vnode_size;         /* vnode size array */
> 
> You don't specify the units.  You should probably name the variable
> _bytes or _pages or something.
> 
> Looking at the algorithm below it seems to be in _mby.  But the domain
> size is specified in pages.  So AFAICT if you try to create a domain
> which is not a whole number of pages, it is bound to fail !
> 
> Perhaps the vnode memory size should be in pages too.
> 

Let's use page as unit.

> > +    unsigned int nr_vnodes;       /* number of elements of above arrays */
> 
> Is there some reason to prefer this arrangement with multiple parallel
> arrays, to one with a single array of structs ?
> 

No, I don't have preference. I can pack vnode_to_pnode and
vnode_size(_pages) into a struct.

> > +        /* Setup dummy vNUMA information if it's not provided. Not
> > +         * that this is a valid state if libxl doesn't provide any
> > +         * vNUMA information.
> > +         *
> > +         * In this case we setup some dummy value for the convenience
> > +         * of the allocation code. Note that from the user's PoV the
> > +         * guest still has no vNUMA configuration.
> > +         */
> 
> This arrangement for defaulting makes it difficult to supply only
> partial information - for example, to supply the number of vnodes but
> allow the system to make up the details.
> 
> I have a similar complaint about the corresponding libxl code.
> 
> I think you should decide where you want the defaulting to be, and do
> it in a more flexible way in that one place.  Probably, libxl.
> 

The defaulting will be in libxl. That's what Dario is working on.

If libxl provides information, these dummy values will have no effect.

Maybe the comment is confusing. I wasn't saying there defaulting
happening inside libxc. It's only for the convenience of the allocation
code, because it needs to operate on one mapping.

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 15/19] libxc: allocate memory with vNUMA information for HVM guest
  2015-01-13 12:11 ` [PATCH v3 15/19] libxc: allocate memory with vNUMA information for HVM guest Wei Liu
@ 2015-01-13 18:15   ` Konrad Rzeszutek Wilk
  2015-01-13 20:44     ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-13 18:15 UTC (permalink / raw
  To: Wei Liu
  Cc: Ian Campbell, dario.faggioli, Ian Jackson, xen-devel, jbeulich,
	Elena Ufimtseva

On Tue, Jan 13, 2015 at 12:11:43PM +0000, Wei Liu wrote:
> The algorithm is more or less the same as the one used for PV guest.
> Libxc gets hold of the mapping of vnode to pnode and size of each vnode
> then allocate memory accordingly.

Could you split this patch in two? One part for the adding of the code
and the other for moving the existing code around?


> 
> And then the function returns low memory end, high memory end and mmio
> start to caller. Libxl needs those values to construct vmemranges for
> that guest.
> 
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Dario Faggioli <dario.faggioli@citrix.com>
> Cc: Elena Ufimtseva <ufimtseva@gmail.com>
> ---
> Changes in v3:
> 1. Rewrite commit log.
> 2. Add a few code comments.
> ---
>  tools/libxc/include/xenguest.h |    7 ++
>  tools/libxc/xc_hvm_build_x86.c |  224 ++++++++++++++++++++++++++--------------
>  2 files changed, 151 insertions(+), 80 deletions(-)
> 
> diff --git a/tools/libxc/include/xenguest.h b/tools/libxc/include/xenguest.h
> index 40bbac8..d1cbb4e 100644
> --- a/tools/libxc/include/xenguest.h
> +++ b/tools/libxc/include/xenguest.h
> @@ -230,6 +230,13 @@ struct xc_hvm_build_args {
>      struct xc_hvm_firmware_module smbios_module;
>      /* Whether to use claim hypercall (1 - enable, 0 - disable). */
>      int claim_enabled;
> +    unsigned int nr_vnodes;	  /* Number of vnodes */
> +    unsigned int *vnode_to_pnode; /* Vnode to pnode mapping */
> +    uint64_t *vnode_size;	  /* Size of vnodes */
> +    /* Out parameters  */
> +    uint64_t lowmem_end;
> +    uint64_t highmem_end;
> +    uint64_t mmio_start;
>  };
>  
>  /**
> diff --git a/tools/libxc/xc_hvm_build_x86.c b/tools/libxc/xc_hvm_build_x86.c
> index c81a25b..54d3dc8 100644
> --- a/tools/libxc/xc_hvm_build_x86.c
> +++ b/tools/libxc/xc_hvm_build_x86.c
> @@ -89,7 +89,8 @@ static int modules_init(struct xc_hvm_build_args *args,
>  }
>  
>  static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
> -                           uint64_t mmio_start, uint64_t mmio_size)
> +                           uint64_t mmio_start, uint64_t mmio_size,
> +                           struct xc_hvm_build_args *args)
>  {
>      struct hvm_info_table *hvm_info = (struct hvm_info_table *)
>          (((unsigned char *)hvm_info_page) + HVM_INFO_OFFSET);
> @@ -119,6 +120,10 @@ static void build_hvm_info(void *hvm_info_page, uint64_t mem_size,
>      hvm_info->high_mem_pgend = highmem_end >> PAGE_SHIFT;
>      hvm_info->reserved_mem_pgstart = ioreq_server_pfn(0);
>  
> +    args->lowmem_end = lowmem_end;
> +    args->highmem_end = highmem_end;
> +    args->mmio_start = mmio_start;
> +
>      /* Finish with the checksum. */
>      for ( i = 0, sum = 0; i < hvm_info->length; i++ )
>          sum += ((uint8_t *)hvm_info)[i];
> @@ -244,7 +249,7 @@ static int setup_guest(xc_interface *xch,
>                         char *image, unsigned long image_size)
>  {
>      xen_pfn_t *page_array = NULL;
> -    unsigned long i, nr_pages = args->mem_size >> PAGE_SHIFT;
> +    unsigned long i, j, nr_pages = args->mem_size >> PAGE_SHIFT;
>      unsigned long target_pages = args->mem_target >> PAGE_SHIFT;
>      uint64_t mmio_start = (1ull << 32) - args->mmio_size;
>      uint64_t mmio_size = args->mmio_size;
> @@ -258,13 +263,13 @@ static int setup_guest(xc_interface *xch,
>      xen_capabilities_info_t caps;
>      unsigned long stat_normal_pages = 0, stat_2mb_pages = 0, 
>          stat_1gb_pages = 0;
> -    int pod_mode = 0;
> +    unsigned int memflags = 0;
>      int claim_enabled = args->claim_enabled;
>      xen_pfn_t special_array[NR_SPECIAL_PAGES];
>      xen_pfn_t ioreq_server_array[NR_IOREQ_SERVER_PAGES];
> -
> -    if ( nr_pages > target_pages )
> -        pod_mode = XENMEMF_populate_on_demand;
> +    uint64_t dummy_vnode_size;
> +    unsigned int dummy_vnode_to_pnode;
> +    uint64_t total;
>  
>      memset(&elf, 0, sizeof(elf));
>      if ( elf_init(&elf, image, image_size) != 0 )
> @@ -276,6 +281,37 @@ static int setup_guest(xc_interface *xch,
>      v_start = 0;
>      v_end = args->mem_size;
>  
> +    if ( nr_pages > target_pages )
> +        memflags |= XENMEMF_populate_on_demand;
> +
> +    if ( args->nr_vnodes == 0 )
> +    {
> +        /* Build dummy vnode information */
> +        args->nr_vnodes = 1;
> +        dummy_vnode_to_pnode = XC_VNUMA_NO_NODE;
> +        dummy_vnode_size = args->mem_size >> 20;
> +        args->vnode_size = &dummy_vnode_size;
> +        args->vnode_to_pnode = &dummy_vnode_to_pnode;
> +    }
> +    else
> +    {
> +        if ( nr_pages > target_pages )
> +        {
> +            PERROR("Cannot enable vNUMA and PoD at the same time");
> +            goto error_out;
> +        }
> +    }
> +
> +    total = 0;
> +    for ( i = 0; i < args->nr_vnodes; i++ )
> +        total += (args->vnode_size[i] << 20);
> +    if ( total != args->mem_size )
> +    {
> +        PERROR("Memory size requested by vNUMA (0x%"PRIx64") mismatches memory size configured for domain (0x%"PRIx64")",
> +               total, args->mem_size);
> +        goto error_out;
> +    }
> +
>      if ( xc_version(xch, XENVER_capabilities, &caps) != 0 )
>      {
>          PERROR("Could not get Xen capabilities");
> @@ -320,7 +356,7 @@ static int setup_guest(xc_interface *xch,
>          }
>      }
>  
> -    if ( pod_mode )
> +    if ( memflags & XENMEMF_populate_on_demand )
>      {
>          /*
>           * Subtract VGA_HOLE_SIZE from target_pages for the VGA
> @@ -349,103 +385,128 @@ static int setup_guest(xc_interface *xch,
>       * ensure that we can be preempted and hence dom0 remains responsive.
>       */
>      rc = xc_domain_populate_physmap_exact(
> -        xch, dom, 0xa0, 0, pod_mode, &page_array[0x00]);
> +        xch, dom, 0xa0, 0, memflags, &page_array[0x00]);
>      cur_pages = 0xc0;
>      stat_normal_pages = 0xc0;
>  
> -    while ( (rc == 0) && (nr_pages > cur_pages) )
> +    for ( i = 0; i < args->nr_vnodes; i++ )
>      {
> -        /* Clip count to maximum 1GB extent. */
> -        unsigned long count = nr_pages - cur_pages;
> -        unsigned long max_pages = SUPERPAGE_1GB_NR_PFNS;
> -
> -        if ( count > max_pages )
> -            count = max_pages;
> -
> -        cur_pfn = page_array[cur_pages];
> +        unsigned int new_memflags = memflags;
> +        uint64_t pages, finished;
>  
> -        /* Take care the corner cases of super page tails */
> -        if ( ((cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
> -             (count > (-cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1))) )
> -            count = -cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1);
> -        else if ( ((count & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
> -                  (count > SUPERPAGE_1GB_NR_PFNS) )
> -            count &= ~(SUPERPAGE_1GB_NR_PFNS - 1);
> -
> -        /* Attemp to allocate 1GB super page. Because in each pass we only
> -         * allocate at most 1GB, we don't have to clip super page boundaries.
> -         */
> -        if ( ((count | cur_pfn) & (SUPERPAGE_1GB_NR_PFNS - 1)) == 0 &&
> -             /* Check if there exists MMIO hole in the 1GB memory range */
> -             !check_mmio_hole(cur_pfn << PAGE_SHIFT,
> -                              SUPERPAGE_1GB_NR_PFNS << PAGE_SHIFT,
> -                              mmio_start, mmio_size) )
> +        if ( args->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
>          {
> -            long done;
> -            unsigned long nr_extents = count >> SUPERPAGE_1GB_SHIFT;
> -            xen_pfn_t sp_extents[nr_extents];
> -
> -            for ( i = 0; i < nr_extents; i++ )
> -                sp_extents[i] = page_array[cur_pages+(i<<SUPERPAGE_1GB_SHIFT)];
> -
> -            done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_1GB_SHIFT,
> -                                              pod_mode, sp_extents);
> -
> -            if ( done > 0 )
> -            {
> -                stat_1gb_pages += done;
> -                done <<= SUPERPAGE_1GB_SHIFT;
> -                cur_pages += done;
> -                count -= done;
> -            }
> +            new_memflags |= XENMEMF_exact_node(args->vnode_to_pnode[i]);
> +            new_memflags |= XENMEMF_exact_node_request;
>          }
>  
> -        if ( count != 0 )
> +        pages = (args->vnode_size[i] << 20) >> PAGE_SHIFT;
> +        /* Consider vga hole belongs to node 0 */
> +        if ( i == 0 )
> +            finished = 0xc0;
> +        else
> +            finished = 0;
> +
> +        while ( (rc == 0) && (pages > finished) )
>          {
> -            /* Clip count to maximum 8MB extent. */
> -            max_pages = SUPERPAGE_2MB_NR_PFNS * 4;
> +            /* Clip count to maximum 1GB extent. */
> +            unsigned long count = pages - finished;
> +            unsigned long max_pages = SUPERPAGE_1GB_NR_PFNS;
> +
>              if ( count > max_pages )
>                  count = max_pages;
> -            
> -            /* Clip partial superpage extents to superpage boundaries. */
> -            if ( ((cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
> -                 (count > (-cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1))) )
> -                count = -cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1);
> -            else if ( ((count & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
> -                      (count > SUPERPAGE_2MB_NR_PFNS) )
> -                count &= ~(SUPERPAGE_2MB_NR_PFNS - 1); /* clip non-s.p. tail */
> -
> -            /* Attempt to allocate superpage extents. */
> -            if ( ((count | cur_pfn) & (SUPERPAGE_2MB_NR_PFNS - 1)) == 0 )
> +
> +            cur_pfn = page_array[cur_pages];
> +
> +            /* Take care the corner cases of super page tails */
> +            if ( ((cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
> +                 (count > (-cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1))) )
> +                count = -cur_pfn & (SUPERPAGE_1GB_NR_PFNS-1);
> +            else if ( ((count & (SUPERPAGE_1GB_NR_PFNS-1)) != 0) &&
> +                      (count > SUPERPAGE_1GB_NR_PFNS) )
> +                count &= ~(SUPERPAGE_1GB_NR_PFNS - 1);
> +
> +            /* Attemp to allocate 1GB super page. Because in each pass we only
> +             * allocate at most 1GB, we don't have to clip super page boundaries.
> +             */
> +            if ( ((count | cur_pfn) & (SUPERPAGE_1GB_NR_PFNS - 1)) == 0 &&
> +                 /* Check if there exists MMIO hole in the 1GB memory range */
> +                 !check_mmio_hole(cur_pfn << PAGE_SHIFT,
> +                                  SUPERPAGE_1GB_NR_PFNS << PAGE_SHIFT,
> +                                  mmio_start, mmio_size) )
>              {
>                  long done;
> -                unsigned long nr_extents = count >> SUPERPAGE_2MB_SHIFT;
> +                unsigned long nr_extents = count >> SUPERPAGE_1GB_SHIFT;
>                  xen_pfn_t sp_extents[nr_extents];
>  
> -                for ( i = 0; i < nr_extents; i++ )
> -                    sp_extents[i] = page_array[cur_pages+(i<<SUPERPAGE_2MB_SHIFT)];
> +                for ( j = 0; j < nr_extents; j++ )
> +                    sp_extents[j] = page_array[cur_pages+(j<<SUPERPAGE_1GB_SHIFT)];
>  
> -                done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_2MB_SHIFT,
> -                                                  pod_mode, sp_extents);
> +                done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_1GB_SHIFT,
> +                                                  new_memflags, sp_extents);
>  
>                  if ( done > 0 )
>                  {
> -                    stat_2mb_pages += done;
> -                    done <<= SUPERPAGE_2MB_SHIFT;
> +                    stat_1gb_pages += done;
> +                    done <<= SUPERPAGE_1GB_SHIFT;
>                      cur_pages += done;
> +                    finished += done;
>                      count -= done;
>                  }
>              }
> -        }
>  
> -        /* Fall back to 4kB extents. */
> -        if ( count != 0 )
> -        {
> -            rc = xc_domain_populate_physmap_exact(
> -                xch, dom, count, 0, pod_mode, &page_array[cur_pages]);
> -            cur_pages += count;
> -            stat_normal_pages += count;
> +            if ( count != 0 )
> +            {
> +                /* Clip count to maximum 8MB extent. */
> +                max_pages = SUPERPAGE_2MB_NR_PFNS * 4;
> +                if ( count > max_pages )
> +                    count = max_pages;
> +
> +                /* Clip partial superpage extents to superpage boundaries. */
> +                if ( ((cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
> +                     (count > (-cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1))) )
> +                    count = -cur_pfn & (SUPERPAGE_2MB_NR_PFNS-1);
> +                else if ( ((count & (SUPERPAGE_2MB_NR_PFNS-1)) != 0) &&
> +                          (count > SUPERPAGE_2MB_NR_PFNS) )
> +                    count &= ~(SUPERPAGE_2MB_NR_PFNS - 1); /* clip non-s.p. tail */
> +
> +                /* Attempt to allocate superpage extents. */
> +                if ( ((count | cur_pfn) & (SUPERPAGE_2MB_NR_PFNS - 1)) == 0 )
> +                {
> +                    long done;
> +                    unsigned long nr_extents = count >> SUPERPAGE_2MB_SHIFT;
> +                    xen_pfn_t sp_extents[nr_extents];
> +
> +                    for ( j = 0; j < nr_extents; j++ )
> +                        sp_extents[j] = page_array[cur_pages+(j<<SUPERPAGE_2MB_SHIFT)];
> +
> +                    done = xc_domain_populate_physmap(xch, dom, nr_extents, SUPERPAGE_2MB_SHIFT,
> +                                                      new_memflags, sp_extents);
> +
> +                    if ( done > 0 )
> +                    {
> +                        stat_2mb_pages += done;
> +                        done <<= SUPERPAGE_2MB_SHIFT;
> +                        cur_pages += done;
> +                        finished += done;
> +                        count -= done;
> +                    }
> +                }
> +            }
> +
> +            /* Fall back to 4kB extents. */
> +            if ( count != 0 )
> +            {
> +                rc = xc_domain_populate_physmap_exact(
> +                    xch, dom, count, 0, new_memflags, &page_array[cur_pages]);
> +                cur_pages += count;
> +                finished += count;
> +                stat_normal_pages += count;
> +            }
>          }
> +
> +        if ( rc != 0 )
> +            break;
>      }
>  
>      if ( rc != 0 )
> @@ -469,7 +530,7 @@ static int setup_guest(xc_interface *xch,
>                xch, dom, PAGE_SIZE, PROT_READ | PROT_WRITE,
>                HVM_INFO_PFN)) == NULL )
>          goto error_out;
> -    build_hvm_info(hvm_info_page, v_end, mmio_start, mmio_size);
> +    build_hvm_info(hvm_info_page, v_end, mmio_start, mmio_size, args);
>      munmap(hvm_info_page, PAGE_SIZE);
>  
>      /* Allocate and clear special pages. */
> @@ -608,6 +669,9 @@ int xc_hvm_build(xc_interface *xch, uint32_t domid,
>              args.acpi_module.guest_addr_out;
>          hvm_args->smbios_module.guest_addr_out = 
>              args.smbios_module.guest_addr_out;
> +        hvm_args->lowmem_end = args.lowmem_end;
> +        hvm_args->highmem_end = args.highmem_end;
> +        hvm_args->mmio_start = args.mmio_start;
>      }
>  
>      free(image);
> -- 
> 1.7.10.4
> 
> 
> _______________________________________________
> Xen-devel mailing list
> Xen-devel@lists.xen.org
> http://lists.xen.org/xen-devel

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 08/19] libxl: functions to build vmemranges for PV guest
  2015-01-13 12:11 ` [PATCH v3 08/19] libxl: functions to build vmemranges for PV guest Wei Liu
@ 2015-01-13 18:15   ` Ian Jackson
  2015-01-14 11:05     ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Ian Jackson @ 2015-01-13 18:15 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, Elena Ufimtseva, Ian Campbell, jbeulich,
	xen-devel

Wei Liu writes ("[PATCH v3 08/19] libxl: functions to build vmemranges for PV guest"):
> Introduce a arch-independent routine to generate one vmemrange per
> vnode. Also introduce arch-dependent routines for different
> architectures because part of the process is arch-specific -- ARM has
> yet have NUMA support and E820 is x86 only.
> 
> For those x86 guests who care about machine E820 map (i.e. with
> e820_host=1), vnode is further split into several vmemranges to
> accommodate memory holes.  A few stubs for libxl_arm.c are created.
...
> +    /* Generate one vmemrange for each virtual node. */
> +    next = 0;
> +    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
> +        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
> +
> +        v = libxl__realloc(gc, v, sizeof(*v) * (i+1));

Please use GCREALLOC_ARRAY.

> +        v[i].start = next;
> +        v[i].end = next + (p->mem << 20); /* mem is in MiB */

Why are all these values in different units ?

Also, it would be best if the units were in the field and variable
names.  Then you wouldn't have to write an explanatory comment.

> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
> index e959e37..2018afc 100644
> --- a/tools/libxl/libxl_x86.c
> +++ b/tools/libxl/libxl_x86.c
> @@ -338,6 +338,80 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
...
> +int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
> +                                      uint32_t domid,
> +                                      libxl_domain_build_info *b_info,
> +                                      libxl__domain_build_state *state)
> +{
...
> +    n = 0; /* E820 counter */

How about putting this information in the variable name rather than
dropping it into a comment ?  Likewise i.

> +        while (remaining > 0) {
> +            if (n >= nr_e820) {
> +                rc = ERROR_FAIL;

ERROR_NOMEM, surely ?

> +            if (map[n].size >= remaining) {
> +                v[x].start = map[n].addr;
> +                v[x].end = map[n].addr + remaining;
> +                map[n].addr += remaining;
> +                map[n].size -= remaining;
> +                remaining = 0;
> +            } else {
> +                v[x].start = map[n].addr;
> +                v[x].end = map[n].addr + map[n].size;
> +                remaining -= map[n].size;
> +                n++;
> +            }

It might be possible to write this more compactly with something like

   use = map[n].size < remaining ? map[n].size : remaining;

Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 18/19] libxlutil: nested list support
  2015-01-13 16:11     ` Wei Liu
@ 2015-01-13 18:25       ` Ian Jackson
  0 siblings, 0 replies; 63+ messages in thread
From: Ian Jackson @ 2015-01-13 18:25 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, Ian Campbell, jbeulich, xen-devel

Wei Liu writes ("Re: [PATCH v3 18/19] libxlutil: nested list support"):
> On Tue, Jan 13, 2015 at 03:52:48PM +0000, Ian Jackson wrote:
> > This commit message is very brief.  For example, under the heading of
> > `Rework internal representation of setting' I would expect a clear
> > description of every formulaic change.
> 
> Originally the internal representation of setting is (string, string)
> pair, the first string being the name of the setting, second string
> being the value of the setting. Now the internal is changed to (string,
> ConfigValue) pair, where ConfigValue can refer to a string or a list of
> ConfigValue's. Internal functions to deal with setting are changed
> accordingly.
> 
> Does the above description makes things clearer?

Yes.  Something like that should be in the commit message.  It would
help to refer to the actual type names.  You could say (if true) for
example, "internal functions new refer to a ConfigSetting; the public
APIs still talk about ConfigValues" or some such.

> > Also, I think would be much easier to review if split up into 3 parts,
> > which from the description above ought to be doable without trouble.
> 
> OK. I can try to split this patch into three.

If it's difficult for some reason then do get back to me.

> > AFAICT from your changes, the API is not backward compatible.  ICBW,
> > but if I'm right that's not acceptable I'm afraid, even in libxlu.
> 
> The old APIs still have the same semantic as before, so any applications
> linked against those APIs still have the same results returned.

Oh yes.  Sorry, I had misread the patch and read your changes to
libxlu_internal.h as being in libxlutil.h.

> > > Previous APIs work as before.
> > 
> > That can't be right because you have to at least specify how they deal
> > with the additional config file syntax.
> 
> No, the old APIs don't deal with new syntax. If applications want to
> support new syntax, they need to use new API.

It's obvious that the new API can't return the new syntax.  The
question is what happens if you try.

> If old APIs try to get value from new syntax, it has no effect.

I don't think "has no effect" can be right.  What is the actual return
value from the function ?  Will it be treated as an error ?  (IMO it
should be, and this should be documented.)

Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-13 12:11 ` [PATCH v3 01/19] xen: dump vNUMA information with debug key "u" Wei Liu
  2015-01-13 16:38   ` Jan Beulich
@ 2015-01-13 19:21   ` Andrew Cooper
  2015-01-13 20:51     ` Wei Liu
  1 sibling, 1 reply; 63+ messages in thread
From: Andrew Cooper @ 2015-01-13 19:21 UTC (permalink / raw
  To: Wei Liu, xen-devel; +Cc: dario.faggioli, Jan Beulich, Elena Ufimsteva

On 13/01/15 12:11, Wei Liu wrote:
> @@ -408,6 +413,49 @@ static void dump_numa(unsigned char key)
>  
>          for_each_online_node ( i )
>              printk("    Node %u: %u\n", i, page_num_node[i]);
> +
> +        if ( !d->vnuma )
> +                continue;

Nit - extraneous whitespace.

> +
> +        vnuma = d->vnuma;
> +        printk("     %u vnodes, %u vcpus:\n", vnuma->nr_vnodes, d->max_vcpus);
> +        for ( i = 0; i < vnuma->nr_vnodes; i++ )
> +        {
> +            err = snprintf(keyhandler_scratch, 12, "%3u",
> +                    vnuma->vnode_to_pnode[i]);
> +            if ( err < 0 || vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
> +                strlcpy(keyhandler_scratch, "???", 3);
> +
> +            printk("       vnode  %3u - pnode %s\n", i, keyhandler_scratch);
> +            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
> +            {
> +                if ( vnuma->vmemrange[j].nid == i )
> +                {
> +                    printk("         %016"PRIx64" - %016"PRIx64"\n",
> +                           vnuma->vmemrange[j].start,
> +                           vnuma->vmemrange[j].end);
> +                }
> +            }
> +
> +            printk("       vcpus: ");
> +            for ( j = 0, n = 0; j < d->max_vcpus; j++ )
> +            {
> +                if ( !(j & 0x3f) )
> +                    process_pending_softirqs();
> +
> +                if ( vnuma->vcpu_to_vnode[j] == i )
> +                {
> +                    if ( (n + 1) % 8 == 0 )
> +                        printk("%3d\n", j);
> +                    else if ( !(n % 8) && n != 0 )
> +                        printk("%17d ", j);
> +                    else
> +                        printk("%3d ", j);
> +                    n++;
> +                }

Do you have a sample of what this looks like for a VM with more than 8
vcpus ?

~Andrew

> +            }
> +            printk("\n");
> +        }
>      }
>  
>      rcu_read_unlock(&domlist_read_lock);

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest
  2015-01-13 12:11 ` [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest Wei Liu
  2015-01-13 17:05   ` Ian Jackson
@ 2015-01-13 20:02   ` Andrew Cooper
  2015-01-14 10:53     ` Wei Liu
  1 sibling, 1 reply; 63+ messages in thread
From: Andrew Cooper @ 2015-01-13 20:02 UTC (permalink / raw
  To: Wei Liu, xen-devel
  Cc: Elena Ufimtseva, dario.faggioli, Ian Jackson, Ian Campbell,
	jbeulich

On 13/01/15 12:11, Wei Liu wrote:
> From libxc's point of view, it only needs to know vnode to pnode mapping
> and size of each vnode to allocate memory accordingly. Add these fields
> to xc_dom structure.
>
> The caller might not pass in vNUMA information. In that case, a dummy
> layout is generated for the convenience of libxc's allocation code. The
> upper layer (libxl etc) still sees the domain has no vNUMA
> configuration.
>
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Dario Faggioli <dario.faggioli@citrix.com>
> Cc: Elena Ufimtseva <ufimtseva@gmail.com>
> ---
> Changes in v3:
> 1. Rewrite commit log.
> 2. Shorten some error messages.
> ---
>  tools/libxc/include/xc_dom.h |    5 +++
>  tools/libxc/xc_dom_x86.c     |   79 ++++++++++++++++++++++++++++++++++++------
>  tools/libxc/xc_private.h     |    2 ++
>  3 files changed, 75 insertions(+), 11 deletions(-)
>
> diff --git a/tools/libxc/include/xc_dom.h b/tools/libxc/include/xc_dom.h
> index 07d7224..c459e77 100644
> --- a/tools/libxc/include/xc_dom.h
> +++ b/tools/libxc/include/xc_dom.h
> @@ -167,6 +167,11 @@ struct xc_dom_image {
>      struct xc_dom_loader *kernel_loader;
>      void *private_loader;
>  
> +    /* vNUMA information */
> +    unsigned int *vnode_to_pnode; /* vnode to pnode mapping array */
> +    uint64_t *vnode_size;         /* vnode size array */

Please make it very clear in the comment here that "size" is in MB (at
least I presume so, given the shifts by 20).  There are currently no
specified units.

> +    unsigned int nr_vnodes;       /* number of elements of above arrays */
> +
>      /* kernel loader */
>      struct xc_dom_arch *arch_hooks;
>      /* allocate up to virt_alloc_end */
> diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
> index bf06fe4..06a7e54 100644
> --- a/tools/libxc/xc_dom_x86.c
> +++ b/tools/libxc/xc_dom_x86.c
> @@ -759,7 +759,8 @@ static int x86_shadow(xc_interface *xch, domid_t domid)
>  int arch_setup_meminit(struct xc_dom_image *dom)
>  {
>      int rc;
> -    xen_pfn_t pfn, allocsz, i, j, mfn;
> +    xen_pfn_t pfn, allocsz, mfn, total, pfn_base;
> +    int i, j;
>  
>      rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
>      if ( rc )
> @@ -811,18 +812,74 @@ int arch_setup_meminit(struct xc_dom_image *dom)
>          /* setup initial p2m */
>          for ( pfn = 0; pfn < dom->total_pages; pfn++ )
>              dom->p2m_host[pfn] = pfn;
> -        
> +
> +        /* Setup dummy vNUMA information if it's not provided. Not
> +         * that this is a valid state if libxl doesn't provide any
> +         * vNUMA information.
> +         *
> +         * In this case we setup some dummy value for the convenience
> +         * of the allocation code. Note that from the user's PoV the
> +         * guest still has no vNUMA configuration.
> +         */
> +        if ( dom->nr_vnodes == 0 )
> +        {
> +            dom->nr_vnodes = 1;
> +            dom->vnode_to_pnode = xc_dom_malloc(dom,
> +                                                sizeof(*dom->vnode_to_pnode));
> +            dom->vnode_to_pnode[0] = XC_VNUMA_NO_NODE;
> +            dom->vnode_size = xc_dom_malloc(dom, sizeof(*dom->vnode_size));
> +            dom->vnode_size[0] = (dom->total_pages << PAGE_SHIFT) >> 20;
> +        }
> +
> +        total = 0;
> +        for ( i = 0; i < dom->nr_vnodes; i++ )
> +            total += ((dom->vnode_size[i] << 20) >> PAGE_SHIFT);

Can I suggest a "mb_to_pages()" helper rather than opencoding this in
several locations.

> +        if ( total != dom->total_pages )
> +        {
> +            xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
> +                         "%s: vNUMA page count mismatch (0x%"PRIpfn" != 0x%"PRIpfn")\n",
> +                         __FUNCTION__, total, dom->total_pages);

__func__ please.  It is part of C99 unlike __FUNCTION__ which is a gnuism.

andrewcoop:xen.git$ git grep  __FUNCTION__ | wc -l
230
andrewcoop:xen.git$ git grep  __func__ | wc -l
194

Looks like the codebase is very mixed, but best to err on the side of
the standard.

> +            return -EINVAL;
> +        }
> +
>          /* allocate guest memory */
> -        for ( i = rc = allocsz = 0;
> -              (i < dom->total_pages) && !rc;
> -              i += allocsz )
> +        pfn_base = 0;
> +        for ( i = 0; i < dom->nr_vnodes; i++ )
>          {
> -            allocsz = dom->total_pages - i;
> -            if ( allocsz > 1024*1024 )
> -                allocsz = 1024*1024;
> -            rc = xc_domain_populate_physmap_exact(
> -                dom->xch, dom->guest_domid, allocsz,
> -                0, 0, &dom->p2m_host[i]);
> +            unsigned int memflags;
> +            uint64_t pages;
> +
> +            memflags = 0;
> +            if ( dom->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
> +            {
> +                memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[i]);
> +                memflags |= XENMEMF_exact_node_request;
> +            }
> +
> +            pages = (dom->vnode_size[i] << 20) >> PAGE_SHIFT;
> +
> +            for ( j = 0; j < pages; j += allocsz )
> +            {
> +                allocsz = pages - j;
> +                if ( allocsz > 1024*1024 )
> +                    allocsz = 1024*1024;
> +
> +                rc = xc_domain_populate_physmap_exact(dom->xch,
> +                         dom->guest_domid, allocsz, 0, memflags,
> +                         &dom->p2m_host[pfn_base+j]);
> +
> +                if ( rc )
> +                {
> +                    if ( dom->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
> +                        xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
> +                                     "%s: fail to allocate 0x%"PRIx64"/0x%"PRIpfn" pages (v=%d, p=%d)\n",
> +                                     __FUNCTION__, pages, dom->total_pages, i,
> +                                     dom->vnode_to_pnode[i]);

"failed to allocate"

I am not sure the total number of pages is useful here, especially as
you don't know how many pages have successfully been allocated first.

Finally, please also print an error for the non-vnuma case.  Failing to
populate memory is not something which should go without an error message.

~Andrew

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 15/19] libxc: allocate memory with vNUMA information for HVM guest
  2015-01-13 18:15   ` Konrad Rzeszutek Wilk
@ 2015-01-13 20:44     ` Wei Liu
  2015-01-14 19:02       ` Konrad Rzeszutek Wilk
  0 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 20:44 UTC (permalink / raw
  To: Konrad Rzeszutek Wilk
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, xen-devel,
	jbeulich, Elena Ufimtseva

On Tue, Jan 13, 2015 at 01:15:21PM -0500, Konrad Rzeszutek Wilk wrote:
> On Tue, Jan 13, 2015 at 12:11:43PM +0000, Wei Liu wrote:
> > The algorithm is more or less the same as the one used for PV guest.
> > Libxc gets hold of the mapping of vnode to pnode and size of each vnode
> > then allocate memory accordingly.
> 
> Could you split this patch in two? One part for the adding of the code
> and the other for moving the existing code around?
> 

The structures added are just a few lines. On one hand two Ians
complained I wrote patches that only added structures, on the other you
complained I bundled them with functional code. It's hard to please
everyone isn't it. :-)

Jokes aside, I'm not sure if the above paragraph answers your question.
Which part of code is "added" and which part of code is "moved around"?

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-13 19:21   ` Andrew Cooper
@ 2015-01-13 20:51     ` Wei Liu
  2015-01-14 11:06       ` Andrew Cooper
  0 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-13 20:51 UTC (permalink / raw
  To: Andrew Cooper
  Cc: dario.faggioli, Wei Liu, Jan Beulich, Elena Ufimsteva, xen-devel

On Tue, Jan 13, 2015 at 07:21:47PM +0000, Andrew Cooper wrote:
> On 13/01/15 12:11, Wei Liu wrote:
> > @@ -408,6 +413,49 @@ static void dump_numa(unsigned char key)
> >  
> >          for_each_online_node ( i )
> >              printk("    Node %u: %u\n", i, page_num_node[i]);
> > +
> > +        if ( !d->vnuma )
> > +                continue;
> 
> Nit - extraneous whitespace.
> 

I noticed this when I try to add lock. This is now fixed.

> > +
> > +        vnuma = d->vnuma;
> > +        printk("     %u vnodes, %u vcpus:\n", vnuma->nr_vnodes, d->max_vcpus);
> > +        for ( i = 0; i < vnuma->nr_vnodes; i++ )
> > +        {
> > +            err = snprintf(keyhandler_scratch, 12, "%3u",
> > +                    vnuma->vnode_to_pnode[i]);
> > +            if ( err < 0 || vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
> > +                strlcpy(keyhandler_scratch, "???", 3);
> > +
> > +            printk("       vnode  %3u - pnode %s\n", i, keyhandler_scratch);
> > +            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
> > +            {
> > +                if ( vnuma->vmemrange[j].nid == i )
> > +                {
> > +                    printk("         %016"PRIx64" - %016"PRIx64"\n",
> > +                           vnuma->vmemrange[j].start,
> > +                           vnuma->vmemrange[j].end);
> > +                }
> > +            }
> > +
> > +            printk("       vcpus: ");
> > +            for ( j = 0, n = 0; j < d->max_vcpus; j++ )
> > +            {
> > +                if ( !(j & 0x3f) )
> > +                    process_pending_softirqs();
> > +
> > +                if ( vnuma->vcpu_to_vnode[j] == i )
> > +                {
> > +                    if ( (n + 1) % 8 == 0 )
> > +                        printk("%3d\n", j);
> > +                    else if ( !(n % 8) && n != 0 )
> > +                        printk("%17d ", j);
> > +                    else
> > +                        printk("%3d ", j);
> > +                    n++;
> > +                }
> 
> Do you have a sample of what this looks like for a VM with more than 8
> vcpus ?
> 

(XEN) Memory location of each domain:
(XEN) Domain 0 (total: 131072):
(XEN)     Node 0: 131072
(XEN) Domain 6 (total: 768064):
(XEN)     Node 0: 768064
(XEN)      2 vnodes, 20 vcpus:
(XEN)        vnode    0 - pnode   0
(XEN)          0000000000000000 - 000000005dc00000
(XEN)        vcpus:   0   1   2   3   4   5   6   7
(XEN)                 8   9
(XEN)        vnode    1 - pnode   0
(XEN)          000000005dc00000 - 00000000bb000000
(XEN)          0000000100000000 - 0000000100800000
(XEN)        vcpus:  10  11  12  13  14  15  16  17
(XEN)                18  19

> ~Andrew
> 
> > +            }
> > +            printk("\n");
> > +        }
> >      }
> >  
> >      rcu_read_unlock(&domlist_read_lock);
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize
  2015-01-13 12:11 ` [PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize Wei Liu
  2015-01-13 17:07   ` Ian Jackson
@ 2015-01-13 21:00   ` Andrew Cooper
  2015-01-14 10:30     ` Wei Liu
  1 sibling, 1 reply; 63+ messages in thread
From: Andrew Cooper @ 2015-01-13 21:00 UTC (permalink / raw
  To: Wei Liu, xen-devel
  Cc: Elena Ufimtseva, dario.faggioli, Ian Jackson, Ian Campbell,
	jbeulich

On 13/01/15 12:11, Wei Liu wrote:
> This function gets the machine E820 map and sanitize it according to PV
> guest configuration.
>
> This will be used in later patch. No functional change introduced in
> this patch.
>
> Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> Cc: Ian Campbell <ian.campbell@citrix.com>
> Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> Cc: Dario Faggioli <dario.faggioli@citrix.com>
> Cc: Elena Ufimtseva <ufimtseva@gmail.com>
> ---
>  tools/libxl/libxl_x86.c |   31 ++++++++++++++++++++++---------
>  1 file changed, 22 insertions(+), 9 deletions(-)
>
> diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
> index 9ceb373..e959e37 100644
> --- a/tools/libxl/libxl_x86.c
> +++ b/tools/libxl/libxl_x86.c
> @@ -207,6 +207,27 @@ static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
>      return 0;
>  }
>  
> +static int e820_host_sanitize(libxl__gc *gc,
> +                              libxl_domain_build_info *b_info,
> +                              struct e820entry map[],
> +                              uint32_t *nr)
> +{
> +    int rc;
> +
> +    rc = xc_get_machine_memory_map(CTX->xch, map, E820MAX);

This function should not assume that map[] is E820MAX entries long.  The
code you copied was passing a local variable whereas here you are
passing someone else's pointer.  You should pass the length in *nr like
e820_sanitize() does.

~Andrew

> +    if (rc < 0) {
> +        errno = rc;
> +        return ERROR_FAIL;
> +    }
> +
> +    *nr = rc;
> +
> +    rc = e820_sanitize(CTX, map, nr, b_info->target_memkb,
> +                       (b_info->max_memkb - b_info->target_memkb) +
> +                       b_info->u.pv.slack_memkb);
> +    return rc;
> +}
> +
>  static int libxl__e820_alloc(libxl__gc *gc, uint32_t domid,
>          libxl_domain_config *d_config)
>  {
> @@ -223,15 +244,7 @@ static int libxl__e820_alloc(libxl__gc *gc, uint32_t domid,
>      if (!libxl_defbool_val(b_info->u.pv.e820_host))
>          return ERROR_INVAL;
>  
> -    rc = xc_get_machine_memory_map(ctx->xch, map, E820MAX);
> -    if (rc < 0) {
> -        errno = rc;
> -        return ERROR_FAIL;
> -    }
> -    nr = rc;
> -    rc = e820_sanitize(ctx, map, &nr, b_info->target_memkb,
> -                       (b_info->max_memkb - b_info->target_memkb) +
> -                       b_info->u.pv.slack_memkb);
> +    rc = e820_host_sanitize(gc, b_info, map, &nr);
>      if (rc)
>          return ERROR_FAIL;
>  

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize
  2015-01-13 21:00   ` Andrew Cooper
@ 2015-01-14 10:30     ` Wei Liu
  2015-01-14 11:40       ` Ian Campbell
  0 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-14 10:30 UTC (permalink / raw
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, xen-devel,
	jbeulich, Elena Ufimtseva

On Tue, Jan 13, 2015 at 09:00:49PM +0000, Andrew Cooper wrote:
> On 13/01/15 12:11, Wei Liu wrote:
> > This function gets the machine E820 map and sanitize it according to PV
> > guest configuration.
> >
> > This will be used in later patch. No functional change introduced in
> > this patch.
> >
> > Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> > Cc: Ian Campbell <ian.campbell@citrix.com>
> > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > Cc: Dario Faggioli <dario.faggioli@citrix.com>
> > Cc: Elena Ufimtseva <ufimtseva@gmail.com>
> > ---
> >  tools/libxl/libxl_x86.c |   31 ++++++++++++++++++++++---------
> >  1 file changed, 22 insertions(+), 9 deletions(-)
> >
> > diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
> > index 9ceb373..e959e37 100644
> > --- a/tools/libxl/libxl_x86.c
> > +++ b/tools/libxl/libxl_x86.c
> > @@ -207,6 +207,27 @@ static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
> >      return 0;
> >  }
> >  
> > +static int e820_host_sanitize(libxl__gc *gc,
> > +                              libxl_domain_build_info *b_info,
> > +                              struct e820entry map[],
> > +                              uint32_t *nr)
> > +{
> > +    int rc;
> > +
> > +    rc = xc_get_machine_memory_map(CTX->xch, map, E820MAX);
> 
> This function should not assume that map[] is E820MAX entries long.  The
> code you copied was passing a local variable whereas here you are
> passing someone else's pointer.  You should pass the length in *nr like
> e820_sanitize() does.
> 

Good catch. I will fix.

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest
  2015-01-13 20:02   ` Andrew Cooper
@ 2015-01-14 10:53     ` Wei Liu
  2015-01-14 10:58       ` Andrew Cooper
  0 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-14 10:53 UTC (permalink / raw
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, xen-devel,
	jbeulich, Elena Ufimtseva

On Tue, Jan 13, 2015 at 08:02:17PM +0000, Andrew Cooper wrote:
[...]
> > +    /* vNUMA information */
> > +    unsigned int *vnode_to_pnode; /* vnode to pnode mapping array */
> > +    uint64_t *vnode_size;         /* vnode size array */
> 
> Please make it very clear in the comment here that "size" is in MB (at
> least I presume so, given the shifts by 20).  There are currently no
> specified units.
> 

The size will be in pages.

> > +    unsigned int nr_vnodes;       /* number of elements of above arrays */
> > +
> >      /* kernel loader */
> >      struct xc_dom_arch *arch_hooks;
> >      /* allocate up to virt_alloc_end */
> > diff --git a/tools/libxc/xc_dom_x86.c b/tools/libxc/xc_dom_x86.c
> > index bf06fe4..06a7e54 100644
> > --- a/tools/libxc/xc_dom_x86.c
> > +++ b/tools/libxc/xc_dom_x86.c
> > @@ -759,7 +759,8 @@ static int x86_shadow(xc_interface *xch, domid_t domid)
> >  int arch_setup_meminit(struct xc_dom_image *dom)
> >  {
> >      int rc;
> > -    xen_pfn_t pfn, allocsz, i, j, mfn;
> > +    xen_pfn_t pfn, allocsz, mfn, total, pfn_base;
> > +    int i, j;
> >  
> >      rc = x86_compat(dom->xch, dom->guest_domid, dom->guest_type);
> >      if ( rc )
> > @@ -811,18 +812,74 @@ int arch_setup_meminit(struct xc_dom_image *dom)
> >          /* setup initial p2m */
> >          for ( pfn = 0; pfn < dom->total_pages; pfn++ )
> >              dom->p2m_host[pfn] = pfn;
> > -        
> > +
> > +        /* Setup dummy vNUMA information if it's not provided. Not
> > +         * that this is a valid state if libxl doesn't provide any
> > +         * vNUMA information.
> > +         *
> > +         * In this case we setup some dummy value for the convenience
> > +         * of the allocation code. Note that from the user's PoV the
> > +         * guest still has no vNUMA configuration.
> > +         */
> > +        if ( dom->nr_vnodes == 0 )
> > +        {
> > +            dom->nr_vnodes = 1;
> > +            dom->vnode_to_pnode = xc_dom_malloc(dom,
> > +                                                sizeof(*dom->vnode_to_pnode));
> > +            dom->vnode_to_pnode[0] = XC_VNUMA_NO_NODE;
> > +            dom->vnode_size = xc_dom_malloc(dom, sizeof(*dom->vnode_size));
> > +            dom->vnode_size[0] = (dom->total_pages << PAGE_SHIFT) >> 20;
> > +        }
> > +
> > +        total = 0;
> > +        for ( i = 0; i < dom->nr_vnodes; i++ )
> > +            total += ((dom->vnode_size[i] << 20) >> PAGE_SHIFT);
> 
> Can I suggest a "mb_to_pages()" helper rather than opencoding this in
> several locations.
> 

Sure. If I end up needing one I will add that helper.

> > +        if ( total != dom->total_pages )
> > +        {
> > +            xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
> > +                         "%s: vNUMA page count mismatch (0x%"PRIpfn" != 0x%"PRIpfn")\n",
> > +                         __FUNCTION__, total, dom->total_pages);
> 
> __func__ please.  It is part of C99 unlike __FUNCTION__ which is a gnuism.
> 
> andrewcoop:xen.git$ git grep  __FUNCTION__ | wc -l
> 230
> andrewcoop:xen.git$ git grep  __func__ | wc -l
> 194
> 
> Looks like the codebase is very mixed, but best to err on the side of
> the standard.
> 

No problem.

> > +            return -EINVAL;
> > +        }
> > +
> >          /* allocate guest memory */
> > -        for ( i = rc = allocsz = 0;
> > -              (i < dom->total_pages) && !rc;
> > -              i += allocsz )
> > +        pfn_base = 0;
> > +        for ( i = 0; i < dom->nr_vnodes; i++ )
> >          {
> > -            allocsz = dom->total_pages - i;
> > -            if ( allocsz > 1024*1024 )
> > -                allocsz = 1024*1024;
> > -            rc = xc_domain_populate_physmap_exact(
> > -                dom->xch, dom->guest_domid, allocsz,
> > -                0, 0, &dom->p2m_host[i]);
> > +            unsigned int memflags;
> > +            uint64_t pages;
> > +
> > +            memflags = 0;
> > +            if ( dom->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
> > +            {
> > +                memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[i]);
> > +                memflags |= XENMEMF_exact_node_request;
> > +            }
> > +
> > +            pages = (dom->vnode_size[i] << 20) >> PAGE_SHIFT;
> > +
> > +            for ( j = 0; j < pages; j += allocsz )
> > +            {
> > +                allocsz = pages - j;
> > +                if ( allocsz > 1024*1024 )
> > +                    allocsz = 1024*1024;
> > +
> > +                rc = xc_domain_populate_physmap_exact(dom->xch,
> > +                         dom->guest_domid, allocsz, 0, memflags,
> > +                         &dom->p2m_host[pfn_base+j]);
> > +
> > +                if ( rc )
> > +                {
> > +                    if ( dom->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
> > +                        xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
> > +                                     "%s: fail to allocate 0x%"PRIx64"/0x%"PRIpfn" pages (v=%d, p=%d)\n",
> > +                                     __FUNCTION__, pages, dom->total_pages, i,
> > +                                     dom->vnode_to_pnode[i]);
> 
> "failed to allocate"
> 

Fixed.

> I am not sure the total number of pages is useful here, especially as
> you don't know how many pages have successfully been allocated first.
> 

Are you suggesting we don't print any number of print more numbers?

> Finally, please also print an error for the non-vnuma case.  Failing to
> populate memory is not something which should go without an error message.
> 

Sure.

Wei.

> ~Andrew
> 

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest
  2015-01-14 10:53     ` Wei Liu
@ 2015-01-14 10:58       ` Andrew Cooper
  2015-01-14 11:01         ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Andrew Cooper @ 2015-01-14 10:58 UTC (permalink / raw
  To: Wei Liu
  Cc: Ian Campbell, dario.faggioli, Ian Jackson, xen-devel, jbeulich,
	Elena Ufimtseva

On 14/01/15 10:53, Wei Liu wrote:
>
>>> +            return -EINVAL;
>>> +        }
>>> +
>>>          /* allocate guest memory */
>>> -        for ( i = rc = allocsz = 0;
>>> -              (i < dom->total_pages) && !rc;
>>> -              i += allocsz )
>>> +        pfn_base = 0;
>>> +        for ( i = 0; i < dom->nr_vnodes; i++ )
>>>          {
>>> -            allocsz = dom->total_pages - i;
>>> -            if ( allocsz > 1024*1024 )
>>> -                allocsz = 1024*1024;
>>> -            rc = xc_domain_populate_physmap_exact(
>>> -                dom->xch, dom->guest_domid, allocsz,
>>> -                0, 0, &dom->p2m_host[i]);
>>> +            unsigned int memflags;
>>> +            uint64_t pages;
>>> +
>>> +            memflags = 0;
>>> +            if ( dom->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
>>> +            {
>>> +                memflags |= XENMEMF_exact_node(dom->vnode_to_pnode[i]);
>>> +                memflags |= XENMEMF_exact_node_request;
>>> +            }
>>> +
>>> +            pages = (dom->vnode_size[i] << 20) >> PAGE_SHIFT;
>>> +
>>> +            for ( j = 0; j < pages; j += allocsz )
>>> +            {
>>> +                allocsz = pages - j;
>>> +                if ( allocsz > 1024*1024 )
>>> +                    allocsz = 1024*1024;
>>> +
>>> +                rc = xc_domain_populate_physmap_exact(dom->xch,
>>> +                         dom->guest_domid, allocsz, 0, memflags,
>>> +                         &dom->p2m_host[pfn_base+j]);
>>> +
>>> +                if ( rc )
>>> +                {
>>> +                    if ( dom->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
>>> +                        xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
>>> +                                     "%s: fail to allocate 0x%"PRIx64"/0x%"PRIpfn" pages (v=%d, p=%d)\n",
>>> +                                     __FUNCTION__, pages, dom->total_pages, i,
>>> +                                     dom->vnode_to_pnode[i]);
>> "failed to allocate"
>>
> Fixed.
>
>> I am not sure the total number of pages is useful here, especially as
>> you don't know how many pages have successfully been allocated first.
>>
> Are you suggesting we don't print any number of print more numbers?

I am not suggesting that you don't print numbers.  "pages" is useful to
know in the error message, but "d->total_pages" is not, as you don't
know how many successful allocations have occurred before, and the total
allocated memory so far.

~Andrew

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest
  2015-01-14 10:58       ` Andrew Cooper
@ 2015-01-14 11:01         ` Wei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-14 11:01 UTC (permalink / raw
  To: Andrew Cooper
  Cc: Wei Liu, Ian Campbell, dario.faggioli, Ian Jackson, xen-devel,
	jbeulich, Elena Ufimtseva

On Wed, Jan 14, 2015 at 10:58:08AM +0000, Andrew Cooper wrote:
[...]
> >>> +                         &dom->p2m_host[pfn_base+j]);
> >>> +
> >>> +                if ( rc )
> >>> +                {
> >>> +                    if ( dom->vnode_to_pnode[i] != XC_VNUMA_NO_NODE )
> >>> +                        xc_dom_panic(dom->xch, XC_INTERNAL_ERROR,
> >>> +                                     "%s: fail to allocate 0x%"PRIx64"/0x%"PRIpfn" pages (v=%d, p=%d)\n",
> >>> +                                     __FUNCTION__, pages, dom->total_pages, i,
> >>> +                                     dom->vnode_to_pnode[i]);
> >> "failed to allocate"
> >>
> > Fixed.
> >
> >> I am not sure the total number of pages is useful here, especially as
> >> you don't know how many pages have successfully been allocated first.
> >>
> > Are you suggesting we don't print any number of print more numbers?
> 
> I am not suggesting that you don't print numbers.  "pages" is useful to
> know in the error message, but "d->total_pages" is not, as you don't
> know how many successful allocations have occurred before, and the total
> allocated memory so far.
> 

OK. Don't print dom->total_pages then. :-)

Wei.

> ~Andrew

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 08/19] libxl: functions to build vmemranges for PV guest
  2015-01-13 18:15   ` Ian Jackson
@ 2015-01-14 11:05     ` Wei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-14 11:05 UTC (permalink / raw
  To: Ian Jackson
  Cc: Wei Liu, Ian Campbell, dario.faggioli, xen-devel, jbeulich,
	Elena Ufimtseva

On Tue, Jan 13, 2015 at 06:15:49PM +0000, Ian Jackson wrote:
> Wei Liu writes ("[PATCH v3 08/19] libxl: functions to build vmemranges for PV guest"):
> > Introduce a arch-independent routine to generate one vmemrange per
> > vnode. Also introduce arch-dependent routines for different
> > architectures because part of the process is arch-specific -- ARM has
> > yet have NUMA support and E820 is x86 only.
> > 
> > For those x86 guests who care about machine E820 map (i.e. with
> > e820_host=1), vnode is further split into several vmemranges to
> > accommodate memory holes.  A few stubs for libxl_arm.c are created.
> ...
> > +    /* Generate one vmemrange for each virtual node. */
> > +    next = 0;
> > +    for (i = 0; i < b_info->num_vnuma_nodes; i++) {
> > +        libxl_vnode_info *p = &b_info->vnuma_nodes[i];
> > +
> > +        v = libxl__realloc(gc, v, sizeof(*v) * (i+1));
> 
> Please use GCREALLOC_ARRAY.

No problem.

> 
> > +        v[i].start = next;
> > +        v[i].end = next + (p->mem << 20); /* mem is in MiB */
> 
> Why are all these values in different units ?
> 
> Also, it would be best if the units were in the field and variable
> names.  Then you wouldn't have to write an explanatory comment.
> 

I will rework code / data structures that's unclear about their units.

> > diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
> > index e959e37..2018afc 100644
> > --- a/tools/libxl/libxl_x86.c
> > +++ b/tools/libxl/libxl_x86.c
> > @@ -338,6 +338,80 @@ int libxl__arch_domain_finalise_hw_description(libxl__gc *gc,
> ...
> > +int libxl__arch_vnuma_build_vmemrange(libxl__gc *gc,
> > +                                      uint32_t domid,
> > +                                      libxl_domain_build_info *b_info,
> > +                                      libxl__domain_build_state *state)
> > +{
> ...
> > +    n = 0; /* E820 counter */
> 
> How about putting this information in the variable name rather than
> dropping it into a comment ?  Likewise i.
> 

Sure. Done.

> > +        while (remaining > 0) {
> > +            if (n >= nr_e820) {
> > +                rc = ERROR_FAIL;
> 
> ERROR_NOMEM, surely ?
> 

Done.

> > +            if (map[n].size >= remaining) {
> > +                v[x].start = map[n].addr;
> > +                v[x].end = map[n].addr + remaining;
> > +                map[n].addr += remaining;
> > +                map[n].size -= remaining;
> > +                remaining = 0;
> > +            } else {
> > +                v[x].start = map[n].addr;
> > +                v[x].end = map[n].addr + map[n].size;
> > +                remaining -= map[n].size;
> > +                n++;
> > +            }
> 
> It might be possible to write this more compactly with something like
> 
>    use = map[n].size < remaining ? map[n].size : remaining;
> 

I will see what I can do.

Wei.

> Ian.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-13 20:51     ` Wei Liu
@ 2015-01-14 11:06       ` Andrew Cooper
  2015-01-14 11:24         ` Jan Beulich
  0 siblings, 1 reply; 63+ messages in thread
From: Andrew Cooper @ 2015-01-14 11:06 UTC (permalink / raw
  To: Wei Liu; +Cc: dario.faggioli, Jan Beulich, Elena Ufimsteva, xen-devel

On 13/01/15 20:51, Wei Liu wrote:
>
>>> +
>>> +        vnuma = d->vnuma;
>>> +        printk("     %u vnodes, %u vcpus:\n", vnuma->nr_vnodes, d->max_vcpus);
>>> +        for ( i = 0; i < vnuma->nr_vnodes; i++ )
>>> +        {
>>> +            err = snprintf(keyhandler_scratch, 12, "%3u",
>>> +                    vnuma->vnode_to_pnode[i]);
>>> +            if ( err < 0 || vnuma->vnode_to_pnode[i] == NUMA_NO_NODE )
>>> +                strlcpy(keyhandler_scratch, "???", 3);
>>> +
>>> +            printk("       vnode  %3u - pnode %s\n", i, keyhandler_scratch);
>>> +            for ( j = 0; j < vnuma->nr_vmemranges; j++ )
>>> +            {
>>> +                if ( vnuma->vmemrange[j].nid == i )
>>> +                {
>>> +                    printk("         %016"PRIx64" - %016"PRIx64"\n",
>>> +                           vnuma->vmemrange[j].start,
>>> +                           vnuma->vmemrange[j].end);
>>> +                }
>>> +            }
>>> +
>>> +            printk("       vcpus: ");
>>> +            for ( j = 0, n = 0; j < d->max_vcpus; j++ )
>>> +            {
>>> +                if ( !(j & 0x3f) )
>>> +                    process_pending_softirqs();
>>> +
>>> +                if ( vnuma->vcpu_to_vnode[j] == i )
>>> +                {
>>> +                    if ( (n + 1) % 8 == 0 )
>>> +                        printk("%3d\n", j);
>>> +                    else if ( !(n % 8) && n != 0 )
>>> +                        printk("%17d ", j);
>>> +                    else
>>> +                        printk("%3d ", j);
>>> +                    n++;
>>> +                }
>> Do you have a sample of what this looks like for a VM with more than 8
>> vcpus ?
>>
> (XEN) Memory location of each domain:
> (XEN) Domain 0 (total: 131072):
> (XEN)     Node 0: 131072
> (XEN) Domain 6 (total: 768064):
> (XEN)     Node 0: 768064
> (XEN)      2 vnodes, 20 vcpus:
> (XEN)        vnode    0 - pnode   0
> (XEN)          0000000000000000 - 000000005dc00000
> (XEN)        vcpus:   0   1   2   3   4   5   6   7
> (XEN)                 8   9
> (XEN)        vnode    1 - pnode   0
> (XEN)          000000005dc00000 - 00000000bb000000
> (XEN)          0000000100000000 - 0000000100800000
> (XEN)        vcpus:  10  11  12  13  14  15  16  17
> (XEN)                18  19

This is very verbose, and (IMO) moderately difficult to parse even
knowing that the information is.

May I suggest the following sylistic changes:

2 vnodes, 20 vcpus:
  1: pnode 0, vcpus {1-9}
    0000000000000000 - 000000005dc00000
  2: pnode 1, vcpus {10-20}
    000000005dc00000 - 00000000bb000000
    0000000100000000 - 0000000100800000
 
You have already stated 2 vnodes, so "vnode $X" is redundant as the list
index.  The vcpus are exceedingly likely to be consecutively allocated,
and cpumask_scnprintf() is a very concise way of representing them (and
will reduce your code quite a bit).

I might even be tempted to suggest adding something like "guest physical
layout" to the "2 vnodes, 20 vcpus" line, to make explicitly clear that
the memory ranges are gfns and not mfns/something else.  After all, this
debugkey is mixing host and guest memory attributes.

~Andrew

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-14 11:06       ` Andrew Cooper
@ 2015-01-14 11:24         ` Jan Beulich
  2015-01-14 11:45           ` Andrew Cooper
  0 siblings, 1 reply; 63+ messages in thread
From: Jan Beulich @ 2015-01-14 11:24 UTC (permalink / raw
  To: Andrew Cooper, Wei Liu; +Cc: dario.faggioli, Elena Ufimsteva, xen-devel

>>> On 14.01.15 at 12:06, <andrew.cooper3@citrix.com> wrote:
> May I suggest the following sylistic changes:
> 
> 2 vnodes, 20 vcpus:
>   1: pnode 0, vcpus {1-9}
>     0000000000000000 - 000000005dc00000
>   2: pnode 1, vcpus {10-20}
>     000000005dc00000 - 00000000bb000000
>     0000000100000000 - 0000000100800000
>  
> You have already stated 2 vnodes, so "vnode $X" is redundant as the list
> index.  The vcpus are exceedingly likely to be consecutively allocated,
> and cpumask_scnprintf() is a very concise way of representing them (and
> will reduce your code quite a bit).

You mean bitmap_scnprintf() - cpumask_scnprintf() is not suitable for
dealing with vCPU-s.

Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize
  2015-01-14 10:30     ` Wei Liu
@ 2015-01-14 11:40       ` Ian Campbell
  2015-01-14 12:05         ` Wei Liu
  0 siblings, 1 reply; 63+ messages in thread
From: Ian Campbell @ 2015-01-14 11:40 UTC (permalink / raw
  To: Wei Liu
  Cc: jbeulich, Andrew Cooper, dario.faggioli, Ian Jackson, xen-devel,
	Elena Ufimtseva

On Wed, 2015-01-14 at 10:30 +0000, Wei Liu wrote:
> On Tue, Jan 13, 2015 at 09:00:49PM +0000, Andrew Cooper wrote:
> > On 13/01/15 12:11, Wei Liu wrote:
> > > This function gets the machine E820 map and sanitize it according to PV
> > > guest configuration.
> > >
> > > This will be used in later patch. No functional change introduced in
> > > this patch.
> > >
> > > Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> > > Cc: Ian Campbell <ian.campbell@citrix.com>
> > > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > > Cc: Dario Faggioli <dario.faggioli@citrix.com>
> > > Cc: Elena Ufimtseva <ufimtseva@gmail.com>
> > > ---
> > >  tools/libxl/libxl_x86.c |   31 ++++++++++++++++++++++---------
> > >  1 file changed, 22 insertions(+), 9 deletions(-)
> > >
> > > diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
> > > index 9ceb373..e959e37 100644
> > > --- a/tools/libxl/libxl_x86.c
> > > +++ b/tools/libxl/libxl_x86.c
> > > @@ -207,6 +207,27 @@ static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
> > >      return 0;
> > >  }
> > >  
> > > +static int e820_host_sanitize(libxl__gc *gc,
> > > +                              libxl_domain_build_info *b_info,
> > > +                              struct e820entry map[],
> > > +                              uint32_t *nr)
> > > +{
> > > +    int rc;
> > > +
> > > +    rc = xc_get_machine_memory_map(CTX->xch, map, E820MAX);
> > 
> > This function should not assume that map[] is E820MAX entries long.  The
> > code you copied was passing a local variable whereas here you are
> > passing someone else's pointer.  You should pass the length in *nr like
> > e820_sanitize() does.
> > 
> 
> Good catch. I will fix.

Does it also work to make the prototpe "map[E820MAX]" thereby causing
the compiler to force the caller to use a suitably sized array? (not
sure what the constraints on the caller are here, i.e. whether it needs
to be dynamically allocated.)

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-14 11:24         ` Jan Beulich
@ 2015-01-14 11:45           ` Andrew Cooper
  2015-01-14 12:02             ` Wei Liu
  2015-01-14 14:49             ` Jan Beulich
  0 siblings, 2 replies; 63+ messages in thread
From: Andrew Cooper @ 2015-01-14 11:45 UTC (permalink / raw
  To: Jan Beulich, Wei Liu; +Cc: dario.faggioli, Elena Ufimsteva, xen-devel

On 14/01/15 11:24, Jan Beulich wrote:
>>>> On 14.01.15 at 12:06, <andrew.cooper3@citrix.com> wrote:
>> May I suggest the following sylistic changes:
>>
>> 2 vnodes, 20 vcpus:
>>   1: pnode 0, vcpus {1-9}
>>     0000000000000000 - 000000005dc00000
>>   2: pnode 1, vcpus {10-20}
>>     000000005dc00000 - 00000000bb000000
>>     0000000100000000 - 0000000100800000
>>  
>> You have already stated 2 vnodes, so "vnode $X" is redundant as the list
>> index.  The vcpus are exceedingly likely to be consecutively allocated,
>> and cpumask_scnprintf() is a very concise way of representing them (and
>> will reduce your code quite a bit).
> You mean bitmap_scnprintf() - cpumask_scnprintf() is not suitable for
> dealing with vCPU-s.

Yes, although I was actually thinking of the scnlistprintf() variant.

However, I further notice that the source data is not in an appropriate
form, so it is perhaps a less sensible suggestion.

~Andrew

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-14 11:45           ` Andrew Cooper
@ 2015-01-14 12:02             ` Wei Liu
  2015-01-14 15:04               ` Jan Beulich
  2015-01-14 14:49             ` Jan Beulich
  1 sibling, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-14 12:02 UTC (permalink / raw
  To: Andrew Cooper
  Cc: dario.faggioli, Wei Liu, Elena Ufimsteva, Jan Beulich, xen-devel

On Wed, Jan 14, 2015 at 11:45:25AM +0000, Andrew Cooper wrote:
> On 14/01/15 11:24, Jan Beulich wrote:
> >>>> On 14.01.15 at 12:06, <andrew.cooper3@citrix.com> wrote:
> >> May I suggest the following sylistic changes:
> >>
> >> 2 vnodes, 20 vcpus:
> >>   1: pnode 0, vcpus {1-9}
> >>     0000000000000000 - 000000005dc00000
> >>   2: pnode 1, vcpus {10-20}
> >>     000000005dc00000 - 00000000bb000000
> >>     0000000100000000 - 0000000100800000
> >>  
> >> You have already stated 2 vnodes, so "vnode $X" is redundant as the list
> >> index.  The vcpus are exceedingly likely to be consecutively allocated,
> >> and cpumask_scnprintf() is a very concise way of representing them (and
> >> will reduce your code quite a bit).
> > You mean bitmap_scnprintf() - cpumask_scnprintf() is not suitable for
> > dealing with vCPU-s.
> 
> Yes, although I was actually thinking of the scnlistprintf() variant.
> 
> However, I further notice that the source data is not in an appropriate
> form, so it is perhaps a less sensible suggestion.
> 

Yes. I need to rearrange source data into bitmaps. And that seems to
involve dynamic memory allocation (?) that I would like to avoid in key
handler.

I adopt other suggestions you made. I also increased the output width to
4 for every number in case we have 3-digit number in the future. Now the
output looks like:

(XEN) Domain 1 (total: 768064):
(XEN)     Node 0: 768064
(XEN)      2 vnodes, 20 vcpus, guest physical layout:
(XEN)          0: pnode   0
(XEN)          0000000000000000 - 000000005dc00000
(XEN)          vcpus:    0    1    2    3    4    5    6    7
(XEN)                    8    9
(XEN)          1: pnode   0
(XEN)          000000005dc00000 - 00000000bb000000
(XEN)          0000000100000000 - 0000000100800000
(XEN)          vcpus:   10   11   12   13   14   15   16   17
(XEN)                   18   19

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize
  2015-01-14 11:40       ` Ian Campbell
@ 2015-01-14 12:05         ` Wei Liu
  0 siblings, 0 replies; 63+ messages in thread
From: Wei Liu @ 2015-01-14 12:05 UTC (permalink / raw
  To: Ian Campbell
  Cc: Wei Liu, Elena Ufimtseva, Andrew Cooper, dario.faggioli,
	Ian Jackson, xen-devel, jbeulich

On Wed, Jan 14, 2015 at 11:40:42AM +0000, Ian Campbell wrote:
> On Wed, 2015-01-14 at 10:30 +0000, Wei Liu wrote:
> > On Tue, Jan 13, 2015 at 09:00:49PM +0000, Andrew Cooper wrote:
> > > On 13/01/15 12:11, Wei Liu wrote:
> > > > This function gets the machine E820 map and sanitize it according to PV
> > > > guest configuration.
> > > >
> > > > This will be used in later patch. No functional change introduced in
> > > > this patch.
> > > >
> > > > Signed-off-by: Wei Liu <wei.liu2@citrix.com>
> > > > Cc: Ian Campbell <ian.campbell@citrix.com>
> > > > Cc: Ian Jackson <ian.jackson@eu.citrix.com>
> > > > Cc: Dario Faggioli <dario.faggioli@citrix.com>
> > > > Cc: Elena Ufimtseva <ufimtseva@gmail.com>
> > > > ---
> > > >  tools/libxl/libxl_x86.c |   31 ++++++++++++++++++++++---------
> > > >  1 file changed, 22 insertions(+), 9 deletions(-)
> > > >
> > > > diff --git a/tools/libxl/libxl_x86.c b/tools/libxl/libxl_x86.c
> > > > index 9ceb373..e959e37 100644
> > > > --- a/tools/libxl/libxl_x86.c
> > > > +++ b/tools/libxl/libxl_x86.c
> > > > @@ -207,6 +207,27 @@ static int e820_sanitize(libxl_ctx *ctx, struct e820entry src[],
> > > >      return 0;
> > > >  }
> > > >  
> > > > +static int e820_host_sanitize(libxl__gc *gc,
> > > > +                              libxl_domain_build_info *b_info,
> > > > +                              struct e820entry map[],
> > > > +                              uint32_t *nr)
> > > > +{
> > > > +    int rc;
> > > > +
> > > > +    rc = xc_get_machine_memory_map(CTX->xch, map, E820MAX);
> > > 
> > > This function should not assume that map[] is E820MAX entries long.  The
> > > code you copied was passing a local variable whereas here you are
> > > passing someone else's pointer.  You should pass the length in *nr like
> > > e820_sanitize() does.
> > > 
> > 
> > Good catch. I will fix.
> 
> Does it also work to make the prototpe "map[E820MAX]" thereby causing
> the compiler to force the caller to use a suitably sized array? (not
> sure what the constraints on the caller are here, i.e. whether it needs
> to be dynamically allocated.)

The map passed in doesn't necessarily have E820MAX elements I think.

Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-14 11:45           ` Andrew Cooper
  2015-01-14 12:02             ` Wei Liu
@ 2015-01-14 14:49             ` Jan Beulich
  2015-01-14 17:25               ` Wei Liu
  1 sibling, 1 reply; 63+ messages in thread
From: Jan Beulich @ 2015-01-14 14:49 UTC (permalink / raw
  To: Andrew Cooper, Wei Liu; +Cc: dario.faggioli, Elena Ufimsteva, xen-devel

>>> On 14.01.15 at 12:45, <andrew.cooper3@citrix.com> wrote:
> On 14/01/15 11:24, Jan Beulich wrote:
>>>>> On 14.01.15 at 12:06, <andrew.cooper3@citrix.com> wrote:
>>> May I suggest the following sylistic changes:
>>>
>>> 2 vnodes, 20 vcpus:
>>>   1: pnode 0, vcpus {1-9}
>>>     0000000000000000 - 000000005dc00000
>>>   2: pnode 1, vcpus {10-20}
>>>     000000005dc00000 - 00000000bb000000
>>>     0000000100000000 - 0000000100800000
>>>  
>>> You have already stated 2 vnodes, so "vnode $X" is redundant as the list
>>> index.  The vcpus are exceedingly likely to be consecutively allocated,
>>> and cpumask_scnprintf() is a very concise way of representing them (and
>>> will reduce your code quite a bit).
>> You mean bitmap_scnprintf() - cpumask_scnprintf() is not suitable for
>> dealing with vCPU-s.
> 
> Yes, although I was actually thinking of the scnlistprintf() variant.
> 
> However, I further notice that the source data is not in an appropriate
> form, so it is perhaps a less sensible suggestion.

Yeah, but even if the library function isn't usable here, it would still
make sense to attempt to condense the output as much as possible.

Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-14 12:02             ` Wei Liu
@ 2015-01-14 15:04               ` Jan Beulich
  0 siblings, 0 replies; 63+ messages in thread
From: Jan Beulich @ 2015-01-14 15:04 UTC (permalink / raw
  To: Andrew Cooper, Wei Liu; +Cc: dario.faggioli, Elena Ufimsteva, xen-devel

>>> On 14.01.15 at 13:02, <wei.liu2@citrix.com> wrote:
> On Wed, Jan 14, 2015 at 11:45:25AM +0000, Andrew Cooper wrote:
>> On 14/01/15 11:24, Jan Beulich wrote:
>> >>>> On 14.01.15 at 12:06, <andrew.cooper3@citrix.com> wrote:
>> >> May I suggest the following sylistic changes:
>> >>
>> >> 2 vnodes, 20 vcpus:
>> >>   1: pnode 0, vcpus {1-9}
>> >>     0000000000000000 - 000000005dc00000
>> >>   2: pnode 1, vcpus {10-20}
>> >>     000000005dc00000 - 00000000bb000000
>> >>     0000000100000000 - 0000000100800000
>> >>  
>> >> You have already stated 2 vnodes, so "vnode $X" is redundant as the list
>> >> index.  The vcpus are exceedingly likely to be consecutively allocated,
>> >> and cpumask_scnprintf() is a very concise way of representing them (and
>> >> will reduce your code quite a bit).
>> > You mean bitmap_scnprintf() - cpumask_scnprintf() is not suitable for
>> > dealing with vCPU-s.
>> 
>> Yes, although I was actually thinking of the scnlistprintf() variant.
>> 
>> However, I further notice that the source data is not in an appropriate
>> form, so it is perhaps a less sensible suggestion.
>> 
> 
> Yes. I need to rearrange source data into bitmaps. And that seems to
> involve dynamic memory allocation (?) that I would like to avoid in key
> handler.
> 
> I adopt other suggestions you made. I also increased the output width to
> 4 for every number in case we have 3-digit number in the future. Now the
> output looks like:
> 
> (XEN) Domain 1 (total: 768064):
> (XEN)     Node 0: 768064
> (XEN)      2 vnodes, 20 vcpus, guest physical layout:
> (XEN)          0: pnode   0
> (XEN)          0000000000000000 - 000000005dc00000
> (XEN)          vcpus:    0    1    2    3    4    5    6    7
> (XEN)                    8    9
> (XEN)          1: pnode   0
> (XEN)          000000005dc00000 - 00000000bb000000
> (XEN)          0000000100000000 - 0000000100800000
> (XEN)          vcpus:   10   11   12   13   14   15   16   17
> (XEN)                   18   19

As said in an earlier reply, I think you should still aim at printing
"vcpus: {0-9}" and alike instead of the individual enumeration
you use right now.

Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-14 14:49             ` Jan Beulich
@ 2015-01-14 17:25               ` Wei Liu
  2015-01-15  8:14                 ` Jan Beulich
  0 siblings, 1 reply; 63+ messages in thread
From: Wei Liu @ 2015-01-14 17:25 UTC (permalink / raw
  To: Jan Beulich
  Cc: Andrew Cooper, dario.faggioli, Wei Liu, Elena Ufimsteva,
	xen-devel

On Wed, Jan 14, 2015 at 02:49:17PM +0000, Jan Beulich wrote:
> >>> On 14.01.15 at 12:45, <andrew.cooper3@citrix.com> wrote:
> > On 14/01/15 11:24, Jan Beulich wrote:
> >>>>> On 14.01.15 at 12:06, <andrew.cooper3@citrix.com> wrote:
> >>> May I suggest the following sylistic changes:
> >>>
> >>> 2 vnodes, 20 vcpus:
> >>>   1: pnode 0, vcpus {1-9}
> >>>     0000000000000000 - 000000005dc00000
> >>>   2: pnode 1, vcpus {10-20}
> >>>     000000005dc00000 - 00000000bb000000
> >>>     0000000100000000 - 0000000100800000
> >>>  
> >>> You have already stated 2 vnodes, so "vnode $X" is redundant as the list
> >>> index.  The vcpus are exceedingly likely to be consecutively allocated,
> >>> and cpumask_scnprintf() is a very concise way of representing them (and
> >>> will reduce your code quite a bit).
> >> You mean bitmap_scnprintf() - cpumask_scnprintf() is not suitable for
> >> dealing with vCPU-s.
> > 
> > Yes, although I was actually thinking of the scnlistprintf() variant.
> > 
> > However, I further notice that the source data is not in an appropriate
> > form, so it is perhaps a less sensible suggestion.
> 
> Yeah, but even if the library function isn't usable here, it would still
> make sense to attempt to condense the output as much as possible.
> 

I modify my code and here are some output examples:

(XEN) Domain 1 (total: 768064):
(XEN)     Node 0: 768064
(XEN)      2 vnodes, 20 vcpus, guest physical layout:
(XEN)          0: pnode   0 vcpus 0-9
(XEN)            0000000000000000 - 000000005dc00000
(XEN)          1: pnode   0 vcpus 10-19
(XEN)            000000005dc00000 - 00000000bb000000
(XEN)            0000000100000000 - 0000000100800000

(XEN) Domain 2 (total: 768064):
(XEN)     Node 0: 768064
(XEN)      2 vnodes, 20 vcpus, guest physical layout:
(XEN)          0: pnode   0 vcpus 0-18
(XEN)            0000000000000000 - 000000005dc00000
(XEN)          1: pnode   0 vcpus 19
(XEN)            000000005dc00000 - 00000000bb000000
(XEN)            0000000100000000 - 0000000100800000

(XEN) Domain 4 (total: 768064):
(XEN)     Node 0: 768064
(XEN)      2 vnodes, 20 vcpus, guest physical layout:
(XEN)          0: pnode   0 vcpus 0-4 10-14
(XEN)            0000000000000000 - 000000005dc00000
(XEN)          1: pnode   0 vcpus 5-9 15-19
(XEN)            000000005dc00000 - 00000000bb000000
(XEN)            0000000100000000 - 0000000100800000

Wei.

> Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 15/19] libxc: allocate memory with vNUMA information for HVM guest
  2015-01-13 20:44     ` Wei Liu
@ 2015-01-14 19:02       ` Konrad Rzeszutek Wilk
  0 siblings, 0 replies; 63+ messages in thread
From: Konrad Rzeszutek Wilk @ 2015-01-14 19:02 UTC (permalink / raw
  To: Wei Liu
  Cc: Ian Campbell, dario.faggioli, Ian Jackson, xen-devel, jbeulich,
	Elena Ufimtseva

On Tue, Jan 13, 2015 at 08:44:29PM +0000, Wei Liu wrote:
> On Tue, Jan 13, 2015 at 01:15:21PM -0500, Konrad Rzeszutek Wilk wrote:
> > On Tue, Jan 13, 2015 at 12:11:43PM +0000, Wei Liu wrote:
> > > The algorithm is more or less the same as the one used for PV guest.
> > > Libxc gets hold of the mapping of vnode to pnode and size of each vnode
> > > then allocate memory accordingly.
> > 
> > Could you split this patch in two? One part for the adding of the code
> > and the other for moving the existing code around?
> > 
> 
> The structures added are just a few lines. On one hand two Ians
> complained I wrote patches that only added structures, on the other you
> complained I bundled them with functional code. It's hard to please
> everyone isn't it. :-)
> 
> Jokes aside, I'm not sure if the above paragraph answers your question.
> Which part of code is "added" and which part of code is "moved around"?

The existing code that is just moved by tab could have a seperate patch
which is a non-functional. That is - it does not change the behavior
of the code - just moves it around.
> 
> Wei.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-14 17:25               ` Wei Liu
@ 2015-01-15  8:14                 ` Jan Beulich
  2015-01-15 10:38                   ` Andrew Cooper
  0 siblings, 1 reply; 63+ messages in thread
From: Jan Beulich @ 2015-01-15  8:14 UTC (permalink / raw
  To: Andrew Cooper, Wei Liu; +Cc: dario.faggioli, Elena Ufimsteva, xen-devel

>>> On 14.01.15 at 18:25, <wei.liu2@citrix.com> wrote:
> I modify my code and here are some output examples:
> 
> (XEN) Domain 1 (total: 768064):
> (XEN)     Node 0: 768064
> (XEN)      2 vnodes, 20 vcpus, guest physical layout:
> (XEN)          0: pnode   0 vcpus 0-9
> (XEN)            0000000000000000 - 000000005dc00000
> (XEN)          1: pnode   0 vcpus 10-19
> (XEN)            000000005dc00000 - 00000000bb000000
> (XEN)            0000000100000000 - 0000000100800000
> 
> (XEN) Domain 2 (total: 768064):
> (XEN)     Node 0: 768064
> (XEN)      2 vnodes, 20 vcpus, guest physical layout:
> (XEN)          0: pnode   0 vcpus 0-18
> (XEN)            0000000000000000 - 000000005dc00000
> (XEN)          1: pnode   0 vcpus 19
> (XEN)            000000005dc00000 - 00000000bb000000
> (XEN)            0000000100000000 - 0000000100800000
> 
> (XEN) Domain 4 (total: 768064):
> (XEN)     Node 0: 768064
> (XEN)      2 vnodes, 20 vcpus, guest physical layout:
> (XEN)          0: pnode   0 vcpus 0-4 10-14
> (XEN)            0000000000000000 - 000000005dc00000
> (XEN)          1: pnode   0 vcpus 5-9 15-19
> (XEN)            000000005dc00000 - 00000000bb000000
> (XEN)            0000000100000000 - 0000000100800000

LGTM - Andrew?

Jan

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH v3 01/19] xen: dump vNUMA information with debug key "u"
  2015-01-15  8:14                 ` Jan Beulich
@ 2015-01-15 10:38                   ` Andrew Cooper
  0 siblings, 0 replies; 63+ messages in thread
From: Andrew Cooper @ 2015-01-15 10:38 UTC (permalink / raw
  To: Jan Beulich, Wei Liu; +Cc: dario.faggioli, Elena Ufimsteva, xen-devel

On 15/01/15 08:14, Jan Beulich wrote:
>>>> On 14.01.15 at 18:25, <wei.liu2@citrix.com> wrote:
>> I modify my code and here are some output examples:
>>
>> (XEN) Domain 1 (total: 768064):
>> (XEN)     Node 0: 768064
>> (XEN)      2 vnodes, 20 vcpus, guest physical layout:
>> (XEN)          0: pnode   0 vcpus 0-9
>> (XEN)            0000000000000000 - 000000005dc00000
>> (XEN)          1: pnode   0 vcpus 10-19
>> (XEN)            000000005dc00000 - 00000000bb000000
>> (XEN)            0000000100000000 - 0000000100800000
>>
>> (XEN) Domain 2 (total: 768064):
>> (XEN)     Node 0: 768064
>> (XEN)      2 vnodes, 20 vcpus, guest physical layout:
>> (XEN)          0: pnode   0 vcpus 0-18
>> (XEN)            0000000000000000 - 000000005dc00000
>> (XEN)          1: pnode   0 vcpus 19
>> (XEN)            000000005dc00000 - 00000000bb000000
>> (XEN)            0000000100000000 - 0000000100800000
>>
>> (XEN) Domain 4 (total: 768064):
>> (XEN)     Node 0: 768064
>> (XEN)      2 vnodes, 20 vcpus, guest physical layout:
>> (XEN)          0: pnode   0 vcpus 0-4 10-14
>> (XEN)            0000000000000000 - 000000005dc00000
>> (XEN)          1: pnode   0 vcpus 5-9 15-19
>> (XEN)            000000005dc00000 - 00000000bb000000
>> (XEN)            0000000100000000 - 0000000100800000
> LGTM - Andrew?

Comma after the pnode number to distinguish "vcpus" from the number
beforehand.  Other than that, LGTM.

~Andrew

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2015-01-15 10:38 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2015-01-13 12:11 [PATCH v3 00/19] Virtual NUMA for PV and HVM Wei Liu
2015-01-13 12:11 ` [PATCH v3 01/19] xen: dump vNUMA information with debug key "u" Wei Liu
2015-01-13 16:38   ` Jan Beulich
2015-01-13 16:45     ` Wei Liu
2015-01-13 19:21   ` Andrew Cooper
2015-01-13 20:51     ` Wei Liu
2015-01-14 11:06       ` Andrew Cooper
2015-01-14 11:24         ` Jan Beulich
2015-01-14 11:45           ` Andrew Cooper
2015-01-14 12:02             ` Wei Liu
2015-01-14 15:04               ` Jan Beulich
2015-01-14 14:49             ` Jan Beulich
2015-01-14 17:25               ` Wei Liu
2015-01-15  8:14                 ` Jan Beulich
2015-01-15 10:38                   ` Andrew Cooper
2015-01-13 12:11 ` [PATCH v3 02/19] xen: make two memory hypercalls vNUMA-aware Wei Liu
2015-01-13 12:11 ` [PATCH v3 03/19] libxc: allocate memory with vNUMA information for PV guest Wei Liu
2015-01-13 17:05   ` Ian Jackson
2015-01-13 17:41     ` Wei Liu
2015-01-13 20:02   ` Andrew Cooper
2015-01-14 10:53     ` Wei Liu
2015-01-14 10:58       ` Andrew Cooper
2015-01-14 11:01         ` Wei Liu
2015-01-13 12:11 ` [PATCH v3 04/19] libxl: introduce vNUMA types Wei Liu
2015-01-13 15:40   ` Ian Jackson
2015-01-13 15:51     ` Wei Liu
2015-01-13 12:11 ` [PATCH v3 05/19] libxl: add vmemrange to libxl__domain_build_state Wei Liu
2015-01-13 17:02   ` Ian Jackson
2015-01-13 17:34     ` Wei Liu
2015-01-13 12:11 ` [PATCH v3 06/19] libxl: introduce libxl__vnuma_config_check Wei Liu
2015-01-13 12:11 ` [PATCH v3 07/19] libxl: x86: factor out e820_host_sanitize Wei Liu
2015-01-13 17:07   ` Ian Jackson
2015-01-13 21:00   ` Andrew Cooper
2015-01-14 10:30     ` Wei Liu
2015-01-14 11:40       ` Ian Campbell
2015-01-14 12:05         ` Wei Liu
2015-01-13 12:11 ` [PATCH v3 08/19] libxl: functions to build vmemranges for PV guest Wei Liu
2015-01-13 18:15   ` Ian Jackson
2015-01-14 11:05     ` Wei Liu
2015-01-13 12:11 ` [PATCH v3 09/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
2015-01-13 12:11 ` [PATCH v3 10/19] xen: handle XENMEM_get_vnumainfo in compat_memory_op Wei Liu
2015-01-13 16:41   ` Jan Beulich
2015-01-13 12:11 ` [PATCH v3 11/19] tools/hvmloader: link errno.h from xen internal Wei Liu
2015-01-13 16:32   ` Jan Beulich
2015-01-13 16:36     ` Wei Liu
2015-01-13 12:11 ` [PATCH v3 12/19] hvmloader: retrieve vNUMA information from hypervisor Wei Liu
2015-01-13 16:50   ` Jan Beulich
2015-01-13 17:17     ` Wei Liu
2015-01-13 12:11 ` [PATCH v3 13/19] hvmloader: construct SRAT Wei Liu
2015-01-13 16:52   ` Jan Beulich
2015-01-13 12:11 ` [PATCH v3 14/19] hvmloader: construct SLIT Wei Liu
2015-01-13 16:53   ` Jan Beulich
2015-01-13 12:11 ` [PATCH v3 15/19] libxc: allocate memory with vNUMA information for HVM guest Wei Liu
2015-01-13 18:15   ` Konrad Rzeszutek Wilk
2015-01-13 20:44     ` Wei Liu
2015-01-14 19:02       ` Konrad Rzeszutek Wilk
2015-01-13 12:11 ` [PATCH v3 16/19] libxl: build, check and pass vNUMA info to Xen " Wei Liu
2015-01-13 12:11 ` [PATCH v3 17/19] libxl: disallow memory relocation when vNUMA is enabled Wei Liu
2015-01-13 12:11 ` [PATCH v3 18/19] libxlutil: nested list support Wei Liu
2015-01-13 15:52   ` Ian Jackson
2015-01-13 16:11     ` Wei Liu
2015-01-13 18:25       ` Ian Jackson
2015-01-13 12:11 ` [PATCH v3 19/19] xl: vNUMA support Wei Liu

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.