Re: [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: Gregory Price <gregory.price@memverge.com>
To: linux-cxl@vger.kernel.org
Cc: Dan Williams <dan.j.williams@intel.com>,
	Dave Jiang <dave.jiang@intel.com>
Subject: Re: [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check
Date: Thu, 13 Apr 2023 07:39:43 -0400	[thread overview]
Message-ID: <ZDfp/+7uTyh2wWcX@memverge.com> (raw)
In-Reply-To: <ZDb71ZXGtzz0ttQT@memverge.com>

On Wed, Apr 12, 2023 at 02:43:33PM -0400, Gregory Price wrote:
> 
> 
> I was looking to validate mlock-ability of various pages when CXL is in
> different states (numa, dax, etc), and I discovered a page_table_check
> BUG when accessing MemExp memory while a device is in daxdev mode.
> 
> this happens essentially on a fault of the first accessed page
> 
> int dax_fd = open(device_path, O_RDWR);
> void *mapped_memory = mmap(NULL, (1024*1024*2), PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);
> ((char*)mapped_memory)[0] = 1;
> 
> 
> Full details of my test here:
> 
> Step 1) Test that memory onlined in NUMA node works
> 
> [user@host0 ~]# numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> node 0 size: 63892 MB
> node 0 free: 59622 MB
> node 1 cpus:
> node 1 size: 129024 MB
> node 1 free: 129024 MB
> node distances:
> node   0   1
>   0:  10  50
>   1:  255  10
> 
> 
> [user@host0 ~]# numactl --preferred=1 memhog 128G
> ... snip ...
> 
> Passes no problem, all memory is accessible and used.
> 
> 
> 
> Next, reconfigure the device to daxdev mode
> 
> 
> [user@host0 ~]# daxctl list
> [
>   {
>     "chardev":"dax0.0",
>     "size":137438953472,
>     "target_node":1,
>     "align":2097152,
>     "mode":"system-ram",
>     "online_memblocks":63,
>     "total_memblocks":63,
>     "movable":true
>   }
> ]


Follow up - i was investigating why my dax region here only created 63
2GB MemBlocks for a 128GB region, and the reason is a forced alignment
of dax devices against the CXL Fixed Memory Window.

[    0.000000] BIOS-e820: [mem 0x0000001050000000-0x000000304fffffff] soft reserved
[    0.000000] BIOS-e820: [mem 0x00003ffc00000000-0x00003ffc03ffffff] reserved
[    0.000000] reserve setup_data: [mem 0x0000001050000000-0x000000304fffffff] soft reserved
[    0.000000] reserve setup_data: [mem 0x00003ffc00000000-0x00003ffc03ffffff] reserved


some debug prints i added

[   20.726483] dax cxl probe
[   20.727330] cxl_dax_region dax_region0: alloc_dax_region: start 1050000000 end 304fffffff
[   20.728405] Creating dev_dev
[   20.729033] dev_dax nr_range: 0
[   20.735481]  dax0.0: alloc range[0]: 0x0000001050000000:0x000000304fffffff

The memory backing this dax region gets squashed by this code:

+++ b/drivers/dax/kmem.c
static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
        struct dev_dax_range *dax_range = &dev_dax->ranges[i];
        struct range *range = &dax_range->range;

        /* memory-block align the hotplug range */
        r->start = ALIGN(range->start, memory_block_size_bytes());
        r->end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
        if (r->start >= r->end) {
                r->start = range->start;
                r->end = range->end;


and we end up with a mapping range of:

start=0x1080000000
end=0x2fffffffff


Why NUMA-mode works under these conditions without crashing the system
is escaping me at the moment, given that the page faulting system goes
through the same driver.  But my guess is that pfn-to-page mappings are
off in some way when placed in devdax mode, whereas they're correct
under numa mode.


Note that the above code chops off the first 768MB of the dax region and
the last 1.25GB of the dax region.

The CFWM is required to be 256MB aligned, but this code will force
anything mapped into that area to be 2GB aligned.  I don't think it's
safe to safe the BIOS is wrong.


It seems like the dax region ranges are being tied to memory block size,
but that a raw devdax does not necessarily utilize memory blocks.  Is
there a potential bug in the mode-switching code?

~Gregory

next prev parent reply	other threads:[~2023-04-13 20:54 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-12 18:43 [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check Gregory Price
2023-04-13 11:39 ` Gregory Price [this message]
2023-04-18  6:43   ` Dan Williams
2023-04-20  0:58     ` Gregory Price
2023-04-18  6:35 ` Dan Williams
2023-04-20  1:29   ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZDfp/+7uTyh2wWcX@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.