Linux-CXL Archive mirror
 help / color / mirror / Atom feed
From: Dan Williams <dan.j.williams@intel.com>
To: Nathan Fontenot <ndfont@gmail.com>,
	Alison Schofield <alison.schofield@intel.com>,
	Hongjian Fan <hongjian.fan@seagate.com>
Cc: "linux-cxl@vger.kernel.org" <linux-cxl@vger.kernel.org>
Subject: Re: Question on deferring dax registration to cxl module for CXL_REGION
Date: Thu, 29 Feb 2024 13:52:46 -0800	[thread overview]
Message-ID: <65e0fcae989d6_1138c7294c2@dwillia2-xfh.jf.intel.com.notmuch> (raw)
In-Reply-To: <c3aadcee-818c-4346-9935-b72df9e4b443@gmail.com>

Nathan Fontenot wrote:
[..] 
> Alison, and others,
> 
> Can you provide some additional details on this new approach. I'm trying to wrap
> my head around management of the the separate cxl resource tree and what resources
> would be put in it.
> 
> I've also wondered if you were looking to use this to manage cxl resources outside
> of the iomem resource tree or is it just for management of 'soft reserve' resources
> under the CFMWS.

Hey Nathan, here is my current thinking. The tl;dr is quite a bit of
spelunking in early init code is needed to get this fixed up properly.
If you have cycles to take this on, here is a roadmap:

The BIOS is responsible for building the CFMWS and optionally deciding
what memory should not be included in the OS general pool by default
(EFI_MEMORY_SP attribute). That attribute allows the OS the *option* to
use it in a dedicated fashion. That could either be because the memory
is too slow and needs to be set aside for applications that can tolerate
the latency, or memory that is too fast, i.e. so precious that only a
specific application should use it.

However, when the OS has no policy for that dedicated memory it should
find its way back into the general pool. So EFI_MEMORY_SP is a mechanism
that prevents immovable allocations landing in dedicated memory early in
boot, but with the understanding that most system-owners likely want to
proceed with all memory in the general pool unless and until it becomes
abundantly clear that a policy to do something different needs to be
deployed.

So, how does this affect CXL? Given there is only one EFI_MEMORY_SP bit
that is not CXL specific, it may be applied across one large address
range that includes CXL, HBM, PMEM, and/or even some DDR. For HBM and
DDR there is no driver subsystem like CXL that can decorate that address
range with a device like a cxl_region device. Those ranges need to go
straight to device-dax so that the dedicated memory-policy can be
deployed.

The problem is what do with the remaining Soft Reserve ranges that
intersect CFMWS that *might* later have a cxl_region established. Right
now a "Soft Reserve" entry in iomem_resource can collide with "CXL
Window" entries and impede the CXL subsystem's ability to manage the
resource tree.

The proposal is to delay the decision about installing a "Soft Reserve"
entry until after the CXL intersections have been determined.

NUMA init should run before e820__reserve_resources_late() which means
that e820__reserve_resources_late() should be able to discover where
a "Soft Reserve" entry intersects a CXL range, and trim that "Soft
Reserve" entry to only reference non-CXL ranges.

Sometime later the cxl_acpi driver loads, re-parses CFMWS and inserts
"CXL Window" resources into iomem resource. You might wonder, "could not
NUMA init do this resource registration?". The problem is that NUMA init
does not know about any of the resources deferred by
e820__reserve_resources() to e820__reserve_resources_late(), so it does
not know how to do the fixups that add_cxl_resource() (in cxl_acpi)
does.

After cxl_acpi is loaded Linux knows where all the CXL windows are, but
it does not know if there are cxl_region instances populated into those
windows. That process involves waiting for PCI probe to complete and the
CXL autoassembly process to run its course.

That could be achieved with something like cxl_acpi kicks off a work
item to do:

	wait_for_device_probe()
	/* trim soft reserve where cxl took over */
	for_each_deferred_cxl_soft_reserve_range()
		for_each_cxl_region()
			if (does_region_intersect_softreserve_range())
				trim_soft_reserve_range()

	/* create standalone Soft Reserve entries for the rest */
	for_each_deferred_cxl_soft_reserve_range()
		insert_soft_reserve_resource()

	/*
	 * CXL soft reserve with no cxl_region likely means something
	 * like Type-2 memory that may be invisible to cxl_pci
	 */
	notify_device_dax_of_late_discover_soft_reserve()

Hope this helps clarify what I think is needed to fix this up, patches
are welcome.

      reply	other threads:[~2024-02-29 21:52 UTC|newest]

Thread overview: 5+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-11 21:03 Question on deferring dax registration to cxl module for CXL_REGION Hongjian Fan
2024-01-12  0:26 ` Alison Schofield
2024-01-12 22:38   ` Hongjian Fan
2024-02-15 21:33   ` Nathan Fontenot
2024-02-29 21:52     ` Dan Williams [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=65e0fcae989d6_1138c7294c2@dwillia2-xfh.jf.intel.com.notmuch \
    --to=dan.j.williams@intel.com \
    --cc=alison.schofield@intel.com \
    --cc=hongjian.fan@seagate.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=ndfont@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).