Linux-CXL Archive mirror
 help / color / mirror / Atom feed
From: Jonathan Cameron <Jonathan.Cameron@Huawei.com>
To: "Parthasarathy,
	Mohan (In-Memory Compute Platforms)"
	<mohan_parthasarathy@hpe.com>
Cc: "linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>,
	Dave Jiang <dave.jiang@intel.com>
Subject: Re: How to connect a CXL memory device to a NUMA node ?
Date: Fri, 5 Apr 2024 14:29:45 +0100	[thread overview]
Message-ID: <20240405142945.00002921@Huawei.com> (raw)
In-Reply-To: <PH7PR84MB158244A35FB3A633E6E0DFE888032@PH7PR84MB1582.NAMPRD84.PROD.OUTLOOK.COM>

On Fri, 5 Apr 2024 10:36:37 +0000
"Parthasarathy, Mohan (In-Memory Compute Platforms)" <mohan_parthasarathy@hpe.com> wrote:

> Hi all,

Hi Mohan,

You've found a gap on the kernel side of things rather than QEMU I think.
Btw I assume you are testing on x86 - there are some more changes needed on
ARM64. I have them but need to find time to clean up the code. My tests
are on ARM64 but should align with what you are seeing.

Directly no, there isn't a way to do it because such a setup would rely
on firmware doing the distance discovery and creating SLIT and SRAT appropriate.
It doesn't make sense to emulate a firmware setup directly in QEMU though we
could in theory do so.

Unless someone fancies taking on EDK2 support for doing that on top of QEMU, we are
focusing on what looks more like a hotplug flow (or a BIOS leaving configuration
to the OS.)

For that we use SRAT Generic Port Affinity structures and CFMW Structures in CEDT.
Not all the code is upstream yet though. You'll need qemu patches
https://lore.kernel.org/qemu-devel/20240403102927.31263-1-Jonathan.Cameron@huawei.com/
(there is a test in there to act as an example on how to configure it)
which I posted earlier this week +
 Dave Jiang's kernel fixes on this list for the kernel
https://lore.kernel.org/linux-cxl/20240403154844.3403859-1-dave.jiang@intel.com/T/#t

However, I think we do have a gap in providing any data for SLIT equivalent for the NUMA
nodes generated for CXL memory.

You can add SLIT for all the ACPI nodes via the -numa dist,src=0,dst=0,val=10 etc
entrees in here
https://github.com/open-mpi/hwloc/wiki/Simulating-complex-memory-with-Qemu

I've just run with that and get generic distances from numactl -H similar to below
but the HMAT derived /sys/bus/nodes/devices/nodeX/access0/ etc correctly show different distances.
For nodes in SLIT the values /sys/bus/node/devices/nodeX/distances (which is probably
what numactl -H reads) show the values provided (I used 21 to make it obvious) but
for CXL memory added to the OS the default value of 20 is used.

Currently CXL related NUMA nodes are per CFMWS so if you want to separate devices into their
own nodes you will need to create 2 of those if you want to separate your two devices into
their own NUMA nodes.

So I wonder, what should we do about distances traditionally retrieved from SLIT?
There will be lots of legacy code out there unfortunately that will care :(

We need to poke something into numa_set_distance() I think.

Fun here is how do we derive something sensible given the can of worms SLIT
is? 10 is well defined but other than 'bigger' internode distances depend
on what mood the bios writer was in and what broke in various OS with
the values they actually wanted to put in - these are tweaked to get around
OS issues - we like about some of our platforms so that the scheduler doesn't
go crazy for example. We could try to calculate relationship between SLIT
values and HMAT values on a platform and use that to derive a value?

Anyone object to just using 42 for all CXL memory nodes that weren't set
to anything at boot time? (i.e. not already in SLIT?)

Jonathan


> 
> I want to create a VM with 2 CXL memory devices - one attached to NUMA node 0 and one attached to NUMA node 1. Is this possible in QEMU ? Currently when I create a CXL memory device in QEMU, it shows equal distances from each numa node in numactl output. I want it such that it should show closer distance to the NUMA node (or socket) it is attached to. Something like this :
> 
> [fedora@localhost ~]$ numactl -H
> available: 3 nodes (0-2)
> node 0 cpus: 0 1
> node 0 size: 1894 MB
> node 0 free: 1627 MB
> node 1 cpus: 2 3
> node 1 size: 2012 MB
> node 1 free: 1696 MB
> node 2 cpus:
> node 2 size: 4096 MB
> node 2 free: 4096 MB
> node distances:
> node   0   1   2
>   0:  10  20  20
>   1:  20  10  30
>   2:  20  20  10
> 
> As you can see the distance for numa node 1 to the cxl device should be 30, not 20, assuming the CXL device is attached to node 0. Any thoughts on how to make this work ?
.
> 
> Regards,
> Mohan
> 


      reply	other threads:[~2024-04-05 13:29 UTC|newest]

Thread overview: 2+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-05 10:36 How to connect a CXL memory device to a NUMA node ? Parthasarathy, Mohan (In-Memory Compute Platforms)
2024-04-05 13:29 ` Jonathan Cameron [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240405142945.00002921@Huawei.com \
    --to=jonathan.cameron@huawei.com \
    --cc=dave.jiang@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=mohan_parthasarathy@hpe.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).