All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Gregory Price <gregory.price@memverge.com>
To: linux-cxl@vger.kernel.org
Cc: Dan Williams <dan.j.williams@intel.com>,
	Dave Jiang <dave.jiang@intel.com>
Subject: [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check
Date: Wed, 12 Apr 2023 14:43:33 -0400	[thread overview]
Message-ID: <ZDb71ZXGtzz0ttQT@memverge.com> (raw)



I was looking to validate mlock-ability of various pages when CXL is in
different states (numa, dax, etc), and I discovered a page_table_check
BUG when accessing MemExp memory while a device is in daxdev mode.

this happens essentially on a fault of the first accessed page

int dax_fd = open(device_path, O_RDWR);
void *mapped_memory = mmap(NULL, (1024*1024*2), PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);
((char*)mapped_memory)[0] = 1;


Full details of my test here:

Step 1) Test that memory onlined in NUMA node works

[user@host0 ~]# numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 0 size: 63892 MB
node 0 free: 59622 MB
node 1 cpus:
node 1 size: 129024 MB
node 1 free: 129024 MB
node distances:
node   0   1
  0:  10  50
  1:  255  10


[user@host0 ~]# numactl --preferred=1 memhog 128G
... snip ...

Passes no problem, all memory is accessible and used.



Next, reconfigure the device to daxdev mode


[user@host0 ~]# daxctl list
[
  {
    "chardev":"dax0.0",
    "size":137438953472,
    "target_node":1,
    "align":2097152,
    "mode":"system-ram",
    "online_memblocks":63,
    "total_memblocks":63,
    "movable":true
  }
]
[user@host0 ~]# daxctl offline-memory dax0.0
offlined memory for 1 device
[user@host0 ~]# daxctl reconfigure-device --human --mode=devdax dax0.0
{
  "chardev":"dax0.0",
  "size":"128.00 GiB (137.44 GB)",
  "target_node":1,
  "align":2097152,
  "mode":"devdax"
}
reconfigured 1 device
[user@host0 mapping0]# daxctl list -M -u
{
  "chardev":"dax0.0",
  "size":"128.00 GiB (137.44 GB)",
  "target_node":1,
  "align":2097152,
  "mode":"devdax",
  "mappings":[
    {
      "page_offset":"0",
      "start":"0x1050000000",
      "end":"0x304fffffff",
      "size":"128.00 GiB (137.44 GB)"
    }
  ]
}


Now map and access the memory via /dev/dax0.0  (test program attached)

[ 1028.430734] kernel BUG at mm/page_table_check.c:53!
[ 1028.430753] invalid opcode: 0000 [#1] PREEMPT SMP NOPTI
[ 1028.430763] CPU: 14 PID: 5292 Comm: daxmemtest Not tainted 6.3.0-rc6-dirty #22
[ 1028.430774] Hardware name: AMD Corporation ONYX/ONYX, BIOS ROX1006C 03/01/2023
[ 1028.430785] RIP: 0010:page_table_check_set.part.0+0x89/0xf0
[ 1028.430798] Code: 75 65 44 89 c2 f0 0f c1 10 83 c2 01 83 fa 01 7e 04 84 db 75 6d 48 83 c1 01 48 03 3d 21 09 52 05 4c 39 e1 74 52 48 85 ff 75 c2 <0f> 0b 8b 10 85 d2 75 37 44 89 c2 f0 0f c1 50 04 83 c2 01 79 d6 0f
[ 1028.430820] RSP: 0000:ff6120b001417d30 EFLAGS: 00010246
[ 1028.430829] RAX: ff115d297d82b128 RBX: 0000000000000001 RCX: 00000000001e0414
[ 1028.430838] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 1028.430847] RBP: fff3676981400000 R08: 0000000000000002 R09: 0000000032458015
[ 1028.430857] R10: 0000000000000001 R11: 000000001ad9d129 R12: 0000000000000200
[ 1028.430867] R13: fff3676981400000 R14: 84000010500008e7 R15: 0000000000000001
[ 1028.430876] FS:  00007f0a660c4740(0000) GS:ff115d37c1200000(0000) knlGS:0000000000000000
[ 1028.430887] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1028.430895] CR2: 00007f0a65a00000 CR3: 00000001ab212005 CR4: 0000000000771ee0
[ 1028.430905] PKRU: 55555554
[ 1028.430909] Call Trace:
[ 1028.430914]  <TASK>
[ 1028.430919]  vmf_insert_pfn_pmd_prot+0x2b4/0x360
[ 1028.430929]  dev_dax_huge_fault+0x181/0x400 [device_dax]
[ 1028.430941]  __handle_mm_fault+0x806/0xfe0
[ 1028.430951]  handle_mm_fault+0x189/0x460
[ 1028.430958]  do_user_addr_fault+0x1e0/0x730
[ 1028.430968]  exc_page_fault+0x7e/0x200
[ 1028.430977]  asm_exc_page_fault+0x22/0x30
[ 1028.430984] RIP: 0033:0x401262
[ 1028.430990] Code: 00 b8 00 00 00 00 e8 1d fe ff ff 8b 45 f4 89 c7 e8 23 fe ff ff b8 01 00 00 00 eb 2a bf 7e 20 40 00 e8 e2 fd ff ff 48 8b 45 e0 <c6> 00 01 8b 45 f4 89 c7 e8 01 fe ff ff bf 85 20 40 00 e8 c7 fd ff
[ 1028.431011] RSP: 002b:00007fff888d8fc0 EFLAGS: 00010206
[ 1028.431019] RAX: 00007f0a65a00000 RBX: 00007fff888d90f8 RCX: 00007f0a65f01c37
[ 1028.431242] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 00007f0a65ffaa70
[ 1028.431446] RBP: 00007fff888d8fe0 R08: 0000000000000003 R09: 0000000000000000
[ 1028.431651] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000000
[ 1028.431852] R13: 00007fff888d9108 R14: 0000000000403e18 R15: 00007f0a6610d000
[ 1028.432053]  </TASK>
[ 1028.432247] Modules linked in: xt_conntrack xt_MASQUERADE nf_conntrack_netlink xt_addrtype nft_compat br_netfilter bridge rpcsec_gss_krb5 stp llc auth_rpcgss overlay nfsv4 dns_resolver nfs lockd grace fscache netfs nft_fib_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 rfkill ip_set nf_tables nfnetlink qrtr sunrpc vfat fat intel_rapl_msr intel_rapl_common amd64_edac edac_mce_amd kvm_amd kvm kmem device_dax dax_cxl irqbypass rapl wmi_bmof pcspkr dax_hmem cxl_mem ipmi_ssif cxl_port acpi_ipmi ipmi_si ipmi_devintf i2c_piix4 cxl_pci k10temp ipmi_msghandler cxl_acpi cxl_core acpi_cpufreq fuse zram xfs crct10dif_pclmul crc32_pclmul crc32c_intel polyval_clmulni polyval_generic ghash_clmulni_intel sha512_ssse3 nvme nvme_core ast tg3 i2c_algo_bit nvme_common sp5100_tco ccp wmi
[ 1028.434042] ---[ end trace 0000000000000000 ]---
[ 1028.434278] RIP: 0010:page_table_check_set.part.0+0x89/0xf0
[ 1028.434518] Code: 75 65 44 89 c2 f0 0f c1 10 83 c2 01 83 fa 01 7e 04 84 db 75 6d 48 83 c1 01 48 03 3d 21 09 52 05 4c 39 e1 74 52 48 85 ff 75 c2 <0f> 0b 8b 10 85 d2 75 37 44 89 c2 f0 0f c1 50 04 83 c2 01 79 d6 0f
[ 1028.435009] RSP: 0000:ff6120b001417d30 EFLAGS: 00010246
[ 1028.435251] RAX: ff115d297d82b128 RBX: 0000000000000001 RCX: 00000000001e0414
[ 1028.435501] RDX: 0000000000000003 RSI: 0000000000000000 RDI: 0000000000000000
[ 1028.435744] RBP: fff3676981400000 R08: 0000000000000002 R09: 0000000032458015
[ 1028.435985] R10: 0000000000000001 R11: 000000001ad9d129 R12: 0000000000000200
[ 1028.436220] R13: fff3676981400000 R14: 84000010500008e7 R15: 0000000000000001
[ 1028.436454] FS:  00007f0a660c4740(0000) GS:ff115d37c1200000(0000) knlGS:0000000000000000
[ 1028.436683] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1028.436910] CR2: 00007f0a65a00000 CR3: 00000001ab212005 CR4: 0000000000771ee0
[ 1028.437140] PKRU: 55555554
[ 1028.437375] note: daxmemtest[5292] exited with preempt_count 1



Test program:

#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <unistd.h>
#include <stdio.h>
#include <stdlib.h>
#include <errno.h>
#include <string.h>

int main() {
    // Open the DAX device
    const char *device_path = "/dev/dax0.0"; // Replace with your DAX device path
    int dax_fd = open(device_path, O_RDWR);

    if (dax_fd < 0) {
        printf("Error: Unable to open DAX device: %s\n", strerror(errno));
        return 1;
    }
    printf("file opened\n");

    // Memory-map the DAX device
    size_t size = 1024*1024*2; // 2MB
    void *mapped_memory = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_SHARED, dax_fd, 0);

    if (mapped_memory == MAP_FAILED) {
        printf("Error: Unable to mmap DAX device: %s\n", strerror(errno));
        close(dax_fd);
        return 1;
    }
    printf("mmaped\n");

    ((char*)mapped_memory)[0] = 1;

/*
    // Lock the memory region using mlock
    int result = mlock(mapped_memory, size);

    if (result != 0) {
        printf("Error: Unable to lock memory using mlock: %s\n", strerror(errno));
        munmap(mapped_memory, size);
        close(dax_fd);
        return 1;
    }
    printf("mlocked\n");

    // Use the mapped_memory for your application

    // Remember to unlock the memory using munlock before unmapping it
    result = munlock(mapped_memory, size);
    if (result != 0) {
        printf("Error: Unable to unlock memory using munlock: %s\n", strerror(errno));
    }
    printf("munlocked\n");

    munmap(mapped_memory, size);
*/
    close(dax_fd);
    printf("success\n");
    return 0;
}



CXL topology at time of error:
[user@host0 ~]# ./cxl list -vvvv
[
  {
    "bus":"root0",
    "provider":"ACPI.CXL",
    "nr_dports":1,
    "dports":[
      {
        "dport":"pci0000:3f",
        "alias":"ACPI0016:00",
        "id":4
      }
    ],
    "endpoints:root0":[
      {
        "endpoint":"endpoint1",
        "host":"mem0",
        "depth":1,
        "memdev":{
          "memdev":"mem0",
          "ram_size":137438953472,
          "health":{
            "maintenance_needed":true,
            "performance_degraded":false,
            "hw_replacement_needed":false,
            "media_normal":false,
            "media_not_ready":false,
            "media_persistence_lost":true,
            "media_data_lost":false,
            "media_powerloss_persistence_loss":false,
            "media_shutdown_persistence_loss":false,
            "media_persistence_loss_imminent":false,
            "media_powerloss_data_loss":false,
            "media_shutdown_data_loss":false,
            "media_data_loss_imminent":false,
            "ext_life_used":"unknown",
            "ext_temperature":"normal",
            "ext_corrected_volatile":"normal",
            "ext_corrected_persistent":"normal",
            "life_used_percent":4,
            "temperature":0,
            "dirty_shutdowns":0,
            "volatile_errors":0,
            "pmem_errors":0
          },
          "alert_config":{
            "life_used_prog_warn_threshold_valid":false,
            "dev_over_temperature_prog_warn_threshold_valid":true,
            "dev_under_temperature_prog_warn_threshold_valid":false,
            "corrected_volatile_mem_err_prog_warn_threshold_valid":false,
            "corrected_pmem_err_prog_warn_threshold_valid":false,
            "life_used_prog_warn_threshold_writable":false,
            "dev_over_temperature_prog_warn_threshold_writable":true,
            "dev_under_temperature_prog_warn_threshold_writable":false,
            "corrected_volatile_mem_err_prog_warn_threshold_writable":false,
            "corrected_pmem_err_prog_warn_threshold_writable":false,
            "life_used_crit_alert_threshold":75,
            "life_used_prog_warn_threshold":25,
            "dev_over_temperature_crit_alert_threshold":150,
            "dev_under_temperature_crit_alert_threshold":65360,
            "dev_over_temperature_prog_warn_threshold":75,
            "dev_under_temperature_prog_warn_threshold":65472,
            "corrected_volatile_mem_err_prog_warn_threshold":16,
            "corrected_pmem_err_prog_warn_threshold":0
          },
          "serial":9947034750368612352,
          "host":"0000:3f:00.0",
          "partition_info":{
            "total_size":137438953472,
            "volatile_only_size":137438953472,
            "persistent_only_size":0,
            "partition_alignment_size":0
          }
        },
        "decoders:endpoint1":[
          {
            "decoder":"decoder1.0",
            "resource":70061654016,
            "size":137438953472,
            "interleave_ways":1,
            "region":"region0",
            "dpa_resource":0,
            "dpa_size":137438953472,
            "mode":"ram"
          }
        ]
      }
    ],
    "decoders:root0":[
      {
        "decoder":"decoder0.0",
        "resource":70061654016,
        "size":137438953472,
        "interleave_ways":1,
        "max_available_extent":0,
        "volatile_capable":true,
        "nr_targets":1,
        "targets":[
          {
            "target":"pci0000:3f",
            "alias":"ACPI0016:00",
            "position":0,
            "id":4
          }
        ],
        "regions:decoder0.0":[
          {
            "region":"region0",
            "resource":70061654016,
            "size":137438953472,
            "type":"ram",
            "interleave_ways":1,
            "interleave_granularity":4096,
            "decode_state":"commit",
            "mappings":[
              {
                "position":0,
                "memdev":"mem0",
                "decoder":"decoder1.0"
              }
            ],
            "daxregion":{
              "id":0,
              "size":137438953472,
              "align":2097152,
              "devices":[
                {
                  "chardev":"dax0.0",
                  "size":137438953472,
                  "target_node":1,
                  "align":2097152,
                  "mode":"devdax"
                }
              ]
            }
          }
        ]
      }
    ]
  }
]



~Gregory

             reply	other threads:[~2023-04-12 18:43 UTC|newest]

Thread overview: 6+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-04-12 18:43 Gregory Price [this message]
2023-04-13 11:39 ` [BUG] DAX access of Memory Expander on RCH topology fires BUG on page_table_check Gregory Price
2023-04-18  6:43   ` Dan Williams
2023-04-20  0:58     ` Gregory Price
2023-04-18  6:35 ` Dan Williams
2023-04-20  1:29   ` Gregory Price

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZDb71ZXGtzz0ttQT@memverge.com \
    --to=gregory.price@memverge.com \
    --cc=dan.j.williams@intel.com \
    --cc=dave.jiang@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.