Linux-CXL Archive mirror
 help / color / mirror / Atom feed
* [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check
@ 2023-04-12 18:43 Gregory Price
  2023-04-13 11:39 ` Gregory Price
  2023-04-18  6:35 ` Dan Williams
  0 siblings, 2 replies; 6+ messages in thread
From: Gregory Price @ 2023-04-12 18:43 UTC (permalink / raw
  To: linux-cxl; +Cc: Dan Williams, Dave Jiang



I was looking to validate mlock-ability of various pages when CXL is in
different states (numa, dax, etc), and I discovered a page_table_check
BUG when accessing MemExp memory while a device is in daxdev mode.

this happens essentially on a fault of the first accessed page

int dax_fd = open(device_path, O_RDWR);
void *mapped_memory = mmap(NULL, (1024*1024*2), PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);
((char*)mapped_memory)[0] = 1;


Full details of my test here:

Step 1) Test that memory onlined in NUMA node works

[user@host0 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 63892 MB
node 0 free: 59622 MB
node 1 cpus:
node 1 size: 129024 MB
node 1 free: 129024 MB
node distances:
node   0   1
  0:  10  50
  1:  255  10


[user@host0 ~]# numactl --preferred=1 memhog 128G
... snip ...

Passes no problem, all memory is accessible and used.



Next, reconfigure the device to daxdev mode


[user@host0 ~]# daxctl list
[
  {
    "chardev":"dax0.0",
    "size":137438953472,
    "target_node":1,
    "align":2097152,
    "mode":"system-ram",
    "online_memblocks":63,
    "total_memblocks":63,
    "movable":true
  }
]
[user@host0 ~]# daxctl offline-memory dax0.0
offlined memory for 1 device
[user@host0 ~]# daxctl reconfigure-device --human --mode=devdax dax0.0
{
  "chardev":"dax0.0",
  "size":"128.00 GiB (137.44 GB)",
  "target_node":1,
  "align":2097152,
  "mode":"devdax"
}
reconfigured 1 device
[user@host0 mapping0]# daxctl list -M -u
{
  "chardev":"dax0.0",
  "size":"128.00 GiB (137.44 GB)",
  "target_node":1,
  "align":2097152,
  "mode":"devdax",
  "mappings":[
    {
      "page_offset":"0",
      "start":"0x1050000000",
      "end":"0x304fffffff",
      "size":"128.00 GiB (137.44 GB)"
    }
  ]
}


Now map and access the memory via /dev/dax0.0  (test program attached)

[ 1028.430734] kernel BUG at mm/page_table_check.c:53!
[ 1028.430753] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 1028.430763] CPU: 14 PID: 5292 Comm: daxmemtest Not tainted 6.3.0-rc6-dirty #22
[ 1028.430774] Hardware name: AMD Corporation ONYX/ONYX, BIOS ROX1006C 03/01/2023
[ 1028.430785] RIP: 0010:page_table_check_set.part.0+0x89/0xf0
[ 1028.430798] Code: 75 65 44 89 c2 f0 0f c1 10 83 c2 01 83 fa 01 7e 04 84 db 75 6d 48 83 c1 01 48 03 3d 21 09 52 05 4c 39 e1 74 52 48 85 ff 75 c2 <0f> 0b 8b 10 85 d2 75 37 44 89 c2 f0 0f c1 50 04 83 c2 01 79 d6 0f
[ 1028.430820] RSP: 0000:ff6120b001417d30 EFLAGS: 00010246
[ 1028.430829] RAX: ff115d297d82b128 RBX: 0000000000000001 RCX: 00000000001e0414
[ 1028.430838] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 1028.430847] RBP: fff3676981400000 R08: 0000000000000002 R09: 0000000032458015
[ 1028.430857] R10: 0000000000000001 R11: 000000001ad9d129 R12: 0000000000000200
[ 1028.430867] R13: fff3676981400000 R14: 84000010500008e7 R15: 0000000000000001
[ 1028.430876] FS:  00007f0a660c4740(0000) GS:ff115d37c1200000(0000) knlGS:0000000000000000
[ 1028.430887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1028.430895] CR2: 00007f0a65a00000 CR3: 00000001ab212005 CR4: 0000000000771ee0
[ 1028.430905] PKRU: 55555554
[ 1028.430909] Call Trace:
[ 1028.430914]  <TASK>
[ 1028.430919]  vmf_insert_pfn_pmd_prot+0x2b4/0x360
[ 1028.430929]  dev_dax_huge_fault+0x181/0x400 [device_dax]
[ 1028.430941]  __handle_mm_fault+0x806/0xfe0
[ 1028.430951]  handle_mm_fault+0x189/0x460
[ 1028.430958]  do_user_addr_fault+0x1e0/0x730
[ 1028.430968]  exc_page_fault+0x7e/0x200
[ 1028.430977]  asm_exc_page_fault+0x22/0x30
[ 1028.430984] RIP: 0033:0x401262
[ 1028.430990] Code: 00 b8 00 00 00 00 e8 1d fe ff ff 8b 45 f4 89 c7 e8 23 fe ff ff b8 01 00 00 00 eb 2a bf 7e 20 40 00 e8 e2 fd ff ff 48 8b 45 e0 <c6> 00 01 8b 45 f4 89 c7 e8 01 fe ff ff bf 85 20 40 00 e8 c7 fd ff
[ 1028.431011] RSP: 002b:00007fff888d8fc0 EFLAGS: 00010206
[ 1028.431019] RAX: 00007f0a65a00000 RBX: 00007fff888d90f8 RCX: 00007f0a65f01c37
[ 1028.431242] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00007f0a65ffaa70
[ 1028.431446] RBP: 00007fff888d8fe0 R08: 0000000000000003 R09: 0000000000000000
[ 1028.431651] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[ 1028.431852] R13: 00007fff888d9108 R14: 0000000000403e18 R15: 00007f0a6610d000
[ 1028.432053]  </TASK>
[ 1028.432247] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink xt_addrtype nft_compat br_netfilter bridge rpcsec_gss_krb5 stp llc auth_rpcgss overlay nfsv4 dns_resolver nfs lockd grace fscache netfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr sunrpc vfat fat intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm kmem device_dax dax_cxl irqbypass rapl wmi_bmof pcspkr dax_hmem cxl_mem ipmi_ssif cxl_port acpi_ipmi ipmi_si ipmi_devintf i2c_piix4 cxl_pci k10temp ipmi_msghandler cxl_acpi cxl_core acpi_cpufreq fuse zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 nvme nvme_core ast tg3 i2c_algo_bit nvme_common sp5100_tco ccp wmi
[ 1028.434042] ---[ end trace 0000000000000000 ]---
[ 1028.434278] RIP: 0010:page_table_check_set.part.0+0x89/0xf0
[ 1028.434518] Code: 75 65 44 89 c2 f0 0f c1 10 83 c2 01 83 fa 01 7e 04 84 db 75 6d 48 83 c1 01 48 03 3d 21 09 52 05 4c 39 e1 74 52 48 85 ff 75 c2 <0f> 0b 8b 10 85 d2 75 37 44 89 c2 f0 0f c1 50 04 83 c2 01 79 d6 0f
[ 1028.435009] RSP: 0000:ff6120b001417d30 EFLAGS: 00010246
[ 1028.435251] RAX: ff115d297d82b128 RBX: 0000000000000001 RCX: 00000000001e0414
[ 1028.435501] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 1028.435744] RBP: fff3676981400000 R08: 0000000000000002 R09: 0000000032458015
[ 1028.435985] R10: 0000000000000001 R11: 000000001ad9d129 R12: 0000000000000200
[ 1028.436220] R13: fff3676981400000 R14: 84000010500008e7 R15: 0000000000000001
[ 1028.436454] FS:  00007f0a660c4740(0000) GS:ff115d37c1200000(0000) knlGS:0000000000000000
[ 1028.436683] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1028.436910] CR2: 00007f0a65a00000 CR3: 00000001ab212005 CR4: 0000000000771ee0
[ 1028.437140] PKRU: 55555554
[ 1028.437375] note: daxmemtest[5292] exited with preempt_count 1



Test program:

#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>

int main() {
    // Open the DAX device
    const char *device_path = "/dev/dax0.0"; // Replace with your DAX device path
    int dax_fd = open(device_path, O_RDWR);

    if (dax_fd < 0) {
        printf("Error: Unable to open DAX device: %s\n", strerror(errno));
        return 1;
    }
    printf("file opened\n");

    // Memory-map the DAX device
    size_t size = 1024*1024*2; // 2MB
    void *mapped_memory = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);

    if (mapped_memory == MAP_FAILED) {
        printf("Error: Unable to mmap DAX device: %s\n", strerror(errno));
        close(dax_fd);
        return 1;
    }
    printf("mmaped\n");

    ((char*)mapped_memory)[0] = 1;

/*
    // Lock the memory region using mlock
    int result = mlock(mapped_memory, size);

    if (result != 0) {
        printf("Error: Unable to lock memory using mlock: %s\n", strerror(errno));
        munmap(mapped_memory, size);
        close(dax_fd);
        return 1;
    }
    printf("mlocked\n");

    // Use the mapped_memory for your application

    // Remember to unlock the memory using munlock before unmapping it
    result = munlock(mapped_memory, size);
    if (result != 0) {
        printf("Error: Unable to unlock memory using munlock: %s\n", strerror(errno));
    }
    printf("munlocked\n");

    munmap(mapped_memory, size);
*/
    close(dax_fd);
    printf("success\n");
    return 0;
}



CXL topology at time of error:
[user@host0 ~]# ./cxl list -vvvv
[
  {
    "bus":"root0",
    "provider":"ACPI.CXL",
    "nr_dports":1,
    "dports":[
      {
        "dport":"pci0000:3f",
        "alias":"ACPI0016:00",
        "id":4
      }
    ],
    "endpoints:root0":[
      {
        "endpoint":"endpoint1",
        "host":"mem0",
        "depth":1,
        "memdev":{
          "memdev":"mem0",
          "ram_size":137438953472,
          "health":{
            "maintenance_needed":true,
            "performance_degraded":false,
            "hw_replacement_needed":false,
            "media_normal":false,
            "media_not_ready":false,
            "media_persistence_lost":true,
            "media_data_lost":false,
            "media_powerloss_persistence_loss":false,
            "media_shutdown_persistence_loss":false,
            "media_persistence_loss_imminent":false,
            "media_powerloss_data_loss":false,
            "media_shutdown_data_loss":false,
            "media_data_loss_imminent":false,
            "ext_life_used":"unknown",
            "ext_temperature":"normal",
            "ext_corrected_volatile":"normal",
            "ext_corrected_persistent":"normal",
            "life_used_percent":4,
            "temperature":0,
            "dirty_shutdowns":0,
            "volatile_errors":0,
            "pmem_errors":0
          },
          "alert_config":{
            "life_used_prog_warn_threshold_valid":false,
            "dev_over_temperature_prog_warn_threshold_valid":true,
            "dev_under_temperature_prog_warn_threshold_valid":false,
            "corrected_volatile_mem_err_prog_warn_threshold_valid":false,
            "corrected_pmem_err_prog_warn_threshold_valid":false,
            "life_used_prog_warn_threshold_writable":false,
            "dev_over_temperature_prog_warn_threshold_writable":true,
            "dev_under_temperature_prog_warn_threshold_writable":false,
            "corrected_volatile_mem_err_prog_warn_threshold_writable":false,
            "corrected_pmem_err_prog_warn_threshold_writable":false,
            "life_used_crit_alert_threshold":75,
            "life_used_prog_warn_threshold":25,
            "dev_over_temperature_crit_alert_threshold":150,
            "dev_under_temperature_crit_alert_threshold":65360,
            "dev_over_temperature_prog_warn_threshold":75,
            "dev_under_temperature_prog_warn_threshold":65472,
            "corrected_volatile_mem_err_prog_warn_threshold":16,
            "corrected_pmem_err_prog_warn_threshold":0
          },
          "serial":9947034750368612352,
          "host":"0000:3f:00.0",
          "partition_info":{
            "total_size":137438953472,
            "volatile_only_size":137438953472,
            "persistent_only_size":0,
            "partition_alignment_size":0
          }
        },
        "decoders:endpoint1":[
          {
            "decoder":"decoder1.0",
            "resource":70061654016,
            "size":137438953472,
            "interleave_ways":1,
            "region":"region0",
            "dpa_resource":0,
            "dpa_size":137438953472,
            "mode":"ram"
          }
        ]
      }
    ],
    "decoders:root0":[
      {
        "decoder":"decoder0.0",
        "resource":70061654016,
        "size":137438953472,
        "interleave_ways":1,
        "max_available_extent":0,
        "volatile_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:3f",
            "alias":"ACPI0016:00",
            "position":0,
            "id":4
          }
        ],
        "regions:decoder0.0":[
          {
            "region":"region0",
            "resource":70061654016,
            "size":137438953472,
            "type":"ram",
            "interleave_ways":1,
            "interleave_granularity":4096,
            "decode_state":"commit",
            "mappings":[
              {
                "position":0,
                "memdev":"mem0",
                "decoder":"decoder1.0"
              }
            ],
            "daxregion":{
              "id":0,
              "size":137438953472,
              "align":2097152,
              "devices":[
                {
                  "chardev":"dax0.0",
                  "size":137438953472,
                  "target_node":1,
                  "align":2097152,
                  "mode":"devdax"
                }
              ]
            }
          }
        ]
      }
    ]
  }
]



~Gregory

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check
  2023-04-12 18:43 [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check Gregory Price
@ 2023-04-13 11:39 ` Gregory Price
  2023-04-18  6:43   ` Dan Williams
  2023-04-18  6:35 ` Dan Williams
  1 sibling, 1 reply; 6+ messages in thread
From: Gregory Price @ 2023-04-13 11:39 UTC (permalink / raw
  To: linux-cxl; +Cc: Dan Williams, Dave Jiang

On Wed, Apr 12, 2023 at 02:43:33PM -0400, Gregory Price wrote:
> 
> 
> I was looking to validate mlock-ability of various pages when CXL is in
> different states (numa, dax, etc), and I discovered a page_table_check
> BUG when accessing MemExp memory while a device is in daxdev mode.
> 
> this happens essentially on a fault of the first accessed page
> 
> int dax_fd = open(device_path, O_RDWR);
> void *mapped_memory = mmap(NULL, (1024*1024*2), PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);
> ((char*)mapped_memory)[0] = 1;
> 
> 
> Full details of my test here:
> 
> Step 1) Test that memory onlined in NUMA node works
> 
> [user@host0 ~]# numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> node 0 size: 63892 MB
> node 0 free: 59622 MB
> node 1 cpus:
> node 1 size: 129024 MB
> node 1 free: 129024 MB
> node distances:
> node   0   1
>   0:  10  50
>   1:  255  10
> 
> 
> [user@host0 ~]# numactl --preferred=1 memhog 128G
> ... snip ...
> 
> Passes no problem, all memory is accessible and used.
> 
> 
> 
> Next, reconfigure the device to daxdev mode
> 
> 
> [user@host0 ~]# daxctl list
> [
>   {
>     "chardev":"dax0.0",
>     "size":137438953472,
>     "target_node":1,
>     "align":2097152,
>     "mode":"system-ram",
>     "online_memblocks":63,
>     "total_memblocks":63,
>     "movable":true
>   }
> ]


Follow up - i was investigating why my dax region here only created 63
2GB MemBlocks for a 128GB region, and the reason is a forced alignment
of dax devices against the CXL Fixed Memory Window.

[    0.000000] BIOS-e820: [mem 0x0000001050000000-0x000000304fffffff] soft reserved
[    0.000000] BIOS-e820: [mem 0x00003ffc00000000-0x00003ffc03ffffff] reserved
[    0.000000] reserve setup_data: [mem 0x0000001050000000-0x000000304fffffff] soft reserved
[    0.000000] reserve setup_data: [mem 0x00003ffc00000000-0x00003ffc03ffffff] reserved


some debug prints i added

[   20.726483] dax cxl probe
[   20.727330] cxl_dax_region dax_region0: alloc_dax_region: start 1050000000 end 304fffffff
[   20.728405] Creating dev_dev
[   20.729033] dev_dax nr_range: 0
[   20.735481]  dax0.0: alloc range[0]: 0x0000001050000000:0x000000304fffffff

The memory backing this dax region gets squashed by this code:

+++ b/drivers/dax/kmem.c
static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
        struct dev_dax_range *dax_range = &dev_dax->ranges[i];
        struct range *range = &dax_range->range;

        /* memory-block align the hotplug range */
        r->start = ALIGN(range->start, memory_block_size_bytes());
        r->end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
        if (r->start >= r->end) {
                r->start = range->start;
                r->end = range->end;


and we end up with a mapping range of:

start=0x1080000000
end=0x2fffffffff


Why NUMA-mode works under these conditions without crashing the system
is escaping me at the moment, given that the page faulting system goes
through the same driver.  But my guess is that pfn-to-page mappings are
off in some way when placed in devdax mode, whereas they're correct
under numa mode.


Note that the above code chops off the first 768MB of the dax region and
the last 1.25GB of the dax region.

The CFWM is required to be 256MB aligned, but this code will force
anything mapped into that area to be 2GB aligned.  I don't think it's
safe to safe the BIOS is wrong.


It seems like the dax region ranges are being tied to memory block size,
but that a raw devdax does not necessarily utilize memory blocks.  Is
there a potential bug in the mode-switching code?

~Gregory

^ permalink raw reply	[flat|nested] 6+ messages in thread

* RE: [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check
  2023-04-12 18:43 [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check Gregory Price
  2023-04-13 11:39 ` Gregory Price
@ 2023-04-18  6:35 ` Dan Williams
  2023-04-20  1:29   ` Gregory Price
  1 sibling, 1 reply; 6+ messages in thread
From: Dan Williams @ 2023-04-18  6:35 UTC (permalink / raw
  To: Gregory Price, linux-cxl; +Cc: Dan Williams, Dave Jiang

Gregory Price wrote:
> 
> 
> I was looking to validate mlock-ability of various pages when CXL is in
> different states (numa, dax, etc), and I discovered a page_table_check
> BUG when accessing MemExp memory while a device is in daxdev mode.
> 
> this happens essentially on a fault of the first accessed page
> 
> int dax_fd = open(device_path, O_RDWR);
> void *mapped_memory = mmap(NULL, (1024*1024*2), PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);
> ((char*)mapped_memory)[0] = 1;
> 
> 
> Full details of my test here:
> 
> Step 1) Test that memory onlined in NUMA node works
> 
> [user@host0 ~]# numactl --hardware
> available: 2 nodes (0-1)
> node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> node 0 size: 63892 MB
> node 0 free: 59622 MB
> node 1 cpus:
> node 1 size: 129024 MB
> node 1 free: 129024 MB
> node distances:
> node   0   1
>   0:  10  50
>   1:  255  10
> 
> 
> [user@host0 ~]# numactl --preferred=1 memhog 128G
> ... snip ...
> 
> Passes no problem, all memory is accessible and used.
> 
> 
> 
> Next, reconfigure the device to daxdev mode
> 
> 
> [user@host0 ~]# daxctl list
> [
>   {
>     "chardev":"dax0.0",
>     "size":137438953472,
>     "target_node":1,
>     "align":2097152,
>     "mode":"system-ram",
>     "online_memblocks":63,
>     "total_memblocks":63,
>     "movable":true
>   }
> ]
> [user@host0 ~]# daxctl offline-memory dax0.0
> offlined memory for 1 device
> [user@host0 ~]# daxctl reconfigure-device --human --mode=devdax dax0.0
> {
>   "chardev":"dax0.0",
>   "size":"128.00 GiB (137.44 GB)",
>   "target_node":1,
>   "align":2097152,
>   "mode":"devdax"
> }
> reconfigured 1 device
> [user@host0 mapping0]# daxctl list -M -u
> {
>   "chardev":"dax0.0",
>   "size":"128.00 GiB (137.44 GB)",
>   "target_node":1,
>   "align":2097152,
>   "mode":"devdax",
>   "mappings":[
>     {
>       "page_offset":"0",
>       "start":"0x1050000000",
>       "end":"0x304fffffff",
>       "size":"128.00 GiB (137.44 GB)"
>     }
>   ]
> }
> 
> 
> Now map and access the memory via /dev/dax0.0  (test program attached)
> 
> [ 1028.430734] kernel BUG at mm/page_table_check.c:53!

I have never tested DAX with CONFIG_PAGE_TABLE_CHECK=y, so would need to
dig in further here. A quick test passes the unit tests, but the unit
tests don't have this, "map dax after system-ram" scenario. Just for
completenees, does it behave without that debug option enabled?

[..] 
> 
> Test program:
> 
> #include <sys/mman.h>
> #include <sys/types.h>
> #include <sys/stat.h>
> #include <fcntl.h>
> #include <unistd.h>
> #include <stdio.h>
> #include <stdlib.h>
> #include <errno.h>
> #include <string.h>
> 
> int main() {
>     // Open the DAX device
>     const char *device_path = "/dev/dax0.0"; // Replace with your DAX device path
>     int dax_fd = open(device_path, O_RDWR);
> 
>     if (dax_fd < 0) {
>         printf("Error: Unable to open DAX device: %s\n", strerror(errno));
>         return 1;
>     }
>     printf("file opened\n");
> 
>     // Memory-map the DAX device
>     size_t size = 1024*1024*2; // 2MB
>     void *mapped_memory = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);
> 
>     if (mapped_memory == MAP_FAILED) {
>         printf("Error: Unable to mmap DAX device: %s\n", strerror(errno));
>         close(dax_fd);
>         return 1;
>     }
>     printf("mmaped\n");
> 
>     ((char*)mapped_memory)[0] = 1;
> 
> /*

i.e. just touching the memory fails, no need to mlock it? This smells
more like the CONFIG_PAGE_TABLE_CHECK machinery is getting confused, but
I would have expected its metadata to be reset by the dax device
reconfiguration.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check
  2023-04-13 11:39 ` Gregory Price
@ 2023-04-18  6:43   ` Dan Williams
  2023-04-20  0:58     ` Gregory Price
  0 siblings, 1 reply; 6+ messages in thread
From: Dan Williams @ 2023-04-18  6:43 UTC (permalink / raw
  To: Gregory Price, linux-cxl; +Cc: Dan Williams, Dave Jiang

Gregory Price wrote:
> On Wed, Apr 12, 2023 at 02:43:33PM -0400, Gregory Price wrote:
> > 
> > 
> > I was looking to validate mlock-ability of various pages when CXL is in
> > different states (numa, dax, etc), and I discovered a page_table_check
> > BUG when accessing MemExp memory while a device is in daxdev mode.
> > 
> > this happens essentially on a fault of the first accessed page
> > 
> > int dax_fd = open(device_path, O_RDWR);
> > void *mapped_memory = mmap(NULL, (1024*1024*2), PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);
> > ((char*)mapped_memory)[0] = 1;
> > 
> > 
> > Full details of my test here:
> > 
> > Step 1) Test that memory onlined in NUMA node works
> > 
> > [user@host0 ~]# numactl --hardware
> > available: 2 nodes (0-1)
> > node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
> > node 0 size: 63892 MB
> > node 0 free: 59622 MB
> > node 1 cpus:
> > node 1 size: 129024 MB
> > node 1 free: 129024 MB
> > node distances:
> > node   0   1
> >   0:  10  50
> >   1:  255  10
> > 
> > 
> > [user@host0 ~]# numactl --preferred=1 memhog 128G
> > ... snip ...
> > 
> > Passes no problem, all memory is accessible and used.
> > 
> > 
> > 
> > Next, reconfigure the device to daxdev mode
> > 
> > 
> > [user@host0 ~]# daxctl list
> > [
> >   {
> >     "chardev":"dax0.0",
> >     "size":137438953472,
> >     "target_node":1,
> >     "align":2097152,
> >     "mode":"system-ram",
> >     "online_memblocks":63,
> >     "total_memblocks":63,
> >     "movable":true
> >   }
> > ]
> 
> 
> Follow up - i was investigating why my dax region here only created 63
> 2GB MemBlocks for a 128GB region, and the reason is a forced alignment
> of dax devices against the CXL Fixed Memory Window.
> 
> [    0.000000] BIOS-e820: [mem 0x0000001050000000-0x000000304fffffff] soft reserved
> [    0.000000] BIOS-e820: [mem 0x00003ffc00000000-0x00003ffc03ffffff] reserved
> [    0.000000] reserve setup_data: [mem 0x0000001050000000-0x000000304fffffff] soft reserved
> [    0.000000] reserve setup_data: [mem 0x00003ffc00000000-0x00003ffc03ffffff] reserved
> 
> 
> some debug prints i added
> 
> [   20.726483] dax cxl probe
> [   20.727330] cxl_dax_region dax_region0: alloc_dax_region: start 1050000000 end 304fffffff
> [   20.728405] Creating dev_dev
> [   20.729033] dev_dax nr_range: 0
> [   20.735481]  dax0.0: alloc range[0]: 0x0000001050000000:0x000000304fffffff
> 
> The memory backing this dax region gets squashed by this code:
> 
> +++ b/drivers/dax/kmem.c
> static int dax_kmem_range(struct dev_dax *dev_dax, int i, struct range *r)
>         struct dev_dax_range *dax_range = &dev_dax->ranges[i];
>         struct range *range = &dax_range->range;
> 
>         /* memory-block align the hotplug range */
>         r->start = ALIGN(range->start, memory_block_size_bytes());
>         r->end = ALIGN_DOWN(range->end + 1, memory_block_size_bytes()) - 1;
>         if (r->start >= r->end) {
>                 r->start = range->start;
>                 r->end = range->end;
> 
> 
> and we end up with a mapping range of:
> 
> start=0x1080000000
> end=0x2fffffffff
> 
> 
> Why NUMA-mode works under these conditions without crashing the system
> is escaping me at the moment,

Why would it crash? That range is valid within
0x1050000000-0x304fffffff.

>  given that the page faulting system goes
> through the same driver.  But my guess is that pfn-to-page mappings are
> off in some way when placed in devdax mode, whereas they're correct
> under numa mode.

pfn-to-page is pretty simple, its the pfn to page_ext that's concerning
for CONFIG_PAGE_TABLE_CHECK.

> Note that the above code chops off the first 768MB of the dax region and
> the last 1.25GB of the dax region.

Yes, if the core-mm picks 2GB for the block size (which it does for
systems with more the 64GB of memory, then it will align hot-added
ranges.

> The CFWM is required to be 256MB aligned, but this code will force
> anything mapped into that area to be 2GB aligned.  I don't think it's
> safe to safe the BIOS is wrong.

The *minimum* alignment of the CFMWS window is 256M, but if they don't
want to waste memory on Linux they had better make it 2GB aligned.

BIOS looks ok here.

> It seems like the dax region ranges are being tied to memory block size,
> but that a raw devdax does not necessarily utilize memory blocks.  Is
> there a potential bug in the mode-switching code?

No memory-blocks to worry about in dax-mode. Until evidence to the
contrary, I'm still looking for how CONFIG_PAGE_TABLE_CHECK might get
confused by DAX mode switches.

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check
  2023-04-18  6:43   ` Dan Williams
@ 2023-04-20  0:58     ` Gregory Price
  0 siblings, 0 replies; 6+ messages in thread
From: Gregory Price @ 2023-04-20  0:58 UTC (permalink / raw
  To: Dan Williams; +Cc: linux-cxl, Dave Jiang

On Mon, Apr 17, 2023 at 11:43:27PM -0700, Dan Williams wrote:
> Gregory Price wrote:
> > Why NUMA-mode works under these conditions without crashing the system
> > is escaping me at the moment,
> 
> Why would it crash? That range is valid within
> 0x1050000000-0x304fffffff.
> 

Basically I was expecting a page-fault in NUMA to produce the same
effects as a fault fault in DAX, clearly this is not the case and either
the switch from numa to dax is causing 

> >  given that the page faulting system goes
> > through the same driver.  But my guess is that pfn-to-page mappings are
> > off in some way when placed in devdax mode, whereas they're correct
> > under numa mode.
> 
> pfn-to-page is pretty simple, its the pfn to page_ext that's concerning
> for CONFIG_PAGE_TABLE_CHECK.
> 

Testing CONFIG_PAGE_TABLE_CHECK=n now, will report back when done.

> > Note that the above code chops off the first 768MB of the dax region and
> > the last 1.25GB of the dax region.
> 
> Yes, if the core-mm picks 2GB for the block size (which it does for
> systems with more the 64GB of memory, then it will align hot-added
> ranges.
> 
> > The CFWM is required to be 256MB aligned, but this code will force
> > anything mapped into that area to be 2GB aligned.  I don't think it's
> > safe to safe the BIOS is wrong.
> 
> The *minimum* alignment of the CFMWS window is 256M, but if they don't
> want to waste memory on Linux they had better make it 2GB aligned.
> 
> BIOS looks ok here.
>

FWIW i have a QEMU instance with 64GB that puts CXL devices on 256MB
alignment as well, so QEMU instances over a certain amount of DRAM
produce the same effect as hardware - lost memory.

~Gregory

^ permalink raw reply	[flat|nested] 6+ messages in thread

* Re: [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check
  2023-04-18  6:35 ` Dan Williams
@ 2023-04-20  1:29   ` Gregory Price
  0 siblings, 0 replies; 6+ messages in thread
From: Gregory Price @ 2023-04-20  1:29 UTC (permalink / raw
  To: Dan Williams; +Cc: linux-cxl, Dave Jiang

On Mon, Apr 17, 2023 at 11:35:15PM -0700, Dan Williams wrote:
> Gregory Price wrote:
> > Now map and access the memory via /dev/dax0.0  (test program attached)
> > 
> > [ 1028.430734] kernel BUG at mm/page_table_check.c:53!
> 
> I have never tested DAX with CONFIG_PAGE_TABLE_CHECK=y, so would need to
> dig in further here. A quick test passes the unit tests, but the unit
> tests don't have this, "map dax after system-ram" scenario. Just for
> completenees, does it behave without that debug option enabled?
> 

Confirmed passes without issues when this debug option is disabled.
Also confirmed on production hardware with a release build where this
check is disabled.

So something is up with page table check code and going numa to dax.

> 
> i.e. just touching the memory fails, no need to mlock it? This smells
> more like the CONFIG_PAGE_TABLE_CHECK machinery is getting confused, but
> I would have expected its metadata to be reset by the dax device
> reconfiguration.

Yes, just touching is faults, without mlocking it.

I dug in and the page_ext for the page is NULL, which is what causes the
BUG().  I don't know the subsystem well enough to know why converting to
dax would cause the page_ext to be NULL.

The reason why this got convoluted with the other hardware/firmware/bios
issues is that I was thinking the alignment issue with memory blocks may
have been part of the issue, but clearly that's not the case.

~Gregory

^ permalink raw reply	[flat|nested] 6+ messages in thread

end of thread, other threads:[~2023-04-20  1:30 UTC | newest]

Thread overview: 6+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-04-12 18:43 [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check Gregory Price
2023-04-13 11:39 ` Gregory Price
2023-04-18  6:43   ` Dan Williams
2023-04-20  0:58     ` Gregory Price
2023-04-18  6:35 ` Dan Williams
2023-04-20  1:29   ` Gregory Price

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).