Linux-CXL Archive mirror
 help / color / mirror / Atom feed
From: Terry Bowman <Terry.Bowman@amd.com>
To: Jonathan Cameron <Jonathan.Cameron@Huawei.com>,
	Yuquan Wang <wangyuquan1236@phytium.com.cn>
Cc: linux-cxl@vger.kernel.org, qemu-devel@nongnu.org,
	Robert Richter <rrichter@amd.com>,
	dan.williams@intel.com
Subject: Re: Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu]
Date: Wed, 6 Mar 2024 13:06:42 -0600	[thread overview]
Message-ID: <bd47d1b1-43a0-4c96-af91-2a3ef898a9cb@amd.com> (raw)
In-Reply-To: <6447f2fd-8594-454f-b0ca-f12ad1947e10@amd.com>

HI Yuquan,

For your test, the first logging will come from the AER driver if 
everything is working correctly.

You may want to check if the upstream pci bridge's AER UIE/CIE 
masks are set. This could prevent the error from handled by the OS's
aer driver.

Regards,
Terry

On 3/6/24 11:12, Terry Bowman wrote:
> Hi Yuquan an Jon,
> 
> I added responses inline below.
> 
> On 3/6/24 07:23, Jonathan Cameron wrote:
>> On Wed, 6 Mar 2024 19:27:07 +0800
>> Yuquan Wang <wangyuquan1236@phytium.com.cn> wrote:
>>
>>> Hello, Jonathan
>>>
>>> Recently I met some problems on CXL RAS tests. 
>>>
>>> I tried to use "cxl-inject-uncorrectable-errors" and "cxl-inject-correctable-error"
>>> qmp to inject CXL errors, however, there was no any kernel printing information in 
>>> my qemu machine. And the qmp connection was unstable that made the machine 
>>> always "terminating on signal 2".
>>
>> The qmp connection being unstable is odd - might be related to the CXL code, but
>> I'm not sure how..
>>
>>>
>>> In addition, I successfully used the hmp "pcie_aer_inject_error" in the same conditions.
>>> The kernel showed relevant print information.
>>
>> IIRC the AER paths print under all circumstances whereas CXL errors do not, they simply
>> trigger tracepoints - but you should have seen device resets.
>>
>> However I span up a test and I think the issue is more straight forward.
>> The uncorrectable internal error and correctable internal errors are masked on the device.
>> I thought we changed the default on this in linux but maybe not :(
>>
> 
> Device AER UIE/CIE mask can be set and still expect to handle device AER errors. The device reports 
> AER UIE/CIE to the root port/RCEC on behalf of device AER CRC, TLP, etc errors. 
> 
> In earlier changes we added logic to clear the RCEC UIE/CIE mask inorder to properly receive 
> AER UIE/CI notifications from devices and RCH dports.
> 
> "CXL Protocol and Link errors detected by components that are part of a CXL VH are
> escalated and reported using standard PCIe error reporting mechanisms over CXL.io as
> UIEs and/or CIEs. See PCIe Base Specification for details."[1]
> 
> [1] CXL3.1 12.2.1 - Protocol and Link Layer Error Reporting
> 
>> Hack is fine the relevant device with lspci -tv and then use
>> setpci -s 0d:00.0 0x208.l=0
>> to clear all the mask bits for uncorrectable errors.
>>
>> Note I tested this on a convenient arm64 setup so always possible there is yet
>> another problem on x86.
>>
>> Robert / Terry, I tracked down the patch where you enabled this for RCHs and there was
>> some discussion on walking out on VH as well to enable this, but seems it
>> never happened. Can you remember why?  Just kicked back for a future occasion?
>>
>> Jonathan
>>
>>
> 
> I tested (qemu x86) using the aer-inject tool and found it to work. Below shows the 
> endpoint CIE is masked (0xe000 @ AER+0x14) and the injected error is properly handled
> with root port logging and cxl_pci handler trace logs.
> 
>  # lspci | grep -i cxl                                                                                                                                     
>     0d:00.0 CXL: Intel Corporation Device 0d93 (rev 01)                                                                                                       
>                                                                                                                                                               
>     # lspci -s 0d:00.0 -vvv | grep Advanced                                                                                                                   
>     Capabilities: [200 v2] Advanced Error Reporting                                                                                                           
>                                                                                                                                                               
>     # setpci -s 0d:00.0 0x208.l                                                                                                                               
>     02400000                                                                                                                                                  
>                                                                                                                                                               
>     # setpci -s 0d:00.0 0x214.l                                                                                                                               
>     0000e000                                                                                                                                                  
>                                                                                                                                                               
>     # cat aer-input.txt                                                                                                                                       
>     # Inject a correctable bad TLP error into the device with header log                                                                                      
>     # words 0 1 2 3.                                                                                                                                          
>     #                                                                                                                                                         
>     # Either specify the PCI id on the command-line option or uncomment and edit                                                                              
>     # the PCI_ID line below using the correct PCI ID.                                                                                                         
>     #                                                                                                                                                         
>     # Note that system firmware/BIOS may mask certain errors and/or not report                                                                                
>     # header log words.                                                                                                                                       
>     #                                                                                                                                                         
>     AER                                                                                                                                                       
>     #PCI_ID 0000:0C.00.0                                                                                                                                      
>     COR_STATUS BAD_TLP                                                                                                                                        
>     HEADER_LOG 0 1 2 3                                                                                                                                        
>                                                                                                                                                               
>     # ./aer-inject -s 0000:0d:00.0 aer-input.txt                                                                                                              
>     [   72.850686] pcieport 0000:0c:00.0: aer_inject: Injecting errors 00000040/00000000 into device 0000:0d:00.0                                             
>     [   72.851784] pcieport 0000:0c:00.0: AER: Corrected error received: 0000:0d:00.0                                                                         
>     [   72.852594] cxl_pci 0000:0d:00.0: PCIe Bus Error: severity=Corrected, type=Data Link Layer, (Receiver ID)                                              
>     [   72.853591] cxl_pci 0000:0d:00.0:   device [8086:0d93] error status/mask=00000040/0000e000                                             
>     # [   72.854277] cxl_pci 0000:0d:00.0:    [ 6] BadTLP      
> 
> I have not tried to use cxl-inject-uncorrectable-errors or cxl-inject-correctable-error.
> 
> Regards,
> Terry
> 
>>>
>>> Question:
>>> 1) Is my CXL RAS test operations standard?
>>> 2) The error injected by "pcie_aer_inject_error" is "protocol & link errors" of cxl.io?
>>>    The error injected by "cxl-inject-uncorrectable-errors" or "cxl-inject-correctable-error" is "protocol & link errors" of cxl.cachemem?
>>>
>>> Hope I can get some helps here, any help will be greatly appreciated.
>>>
>>>
>>> My qemu command line:
>>> qemu-system-x86_64 \
>>> -M q35,nvdimm=on,cxl=on \
>>> -m 4G \
>>> -smp 4 \
>>> -object memory-backend-ram,size=2G,id=mem0 \
>>> -numa node,nodeid=0,cpus=0-1,memdev=mem0 \
>>> -object memory-backend-ram,size=2G,id=mem1 \
>>> -numa node,nodeid=1,cpus=2-3,memdev=mem1 \
>>> -object memory-backend-ram,size=256M,id=cxl-mem0 \
>>> -device pxb-cxl,bus_nr=12,bus=pcie.0,id=cxl.1 \
>>> -device cxl-rp,port=0,bus=cxl.1,id=root_port0,chassis=0,slot=0 \
>>> -device cxl-type3,bus=root_port0,volatile-memdev=cxl-mem0,id=cxl-mem0 \
>>> -M cxl-fmw.0.targets.0=cxl.1,cxl-fmw.0.size=4G,cxl-fmw.0.interleave-granularity=4k \
>>> -hda ../disk/ubuntu_x86_test_new.qcow2 \
>>> -nographic \
>>> -qmp tcp:127.0.0.1:4444,server,nowait \
>>>
>>> Qemu version: 8.2.50, the lastest commit of branch cxl-2024-03-05 in "https://gitlab.com/jic23/qemu" 
>>> Kernel version: 6.8.0-rc6
>>>
>>> My steps in the Qemu qmp:
>>> 1) telnet 127.0.0.1 4444
>>>
>>> result:
>>> Trying 127.0.0.1...
>>> Connected to 127.0.0.1.
>>> Escape character is '^]'.
>>> {"QMP": {"version": {"qemu": {"micro": 50, "minor": 2, "major": 8}, "package": "v6.2.0-19482-gccfb4fe221"}, "capabilities": ["oob"]}}
>>>
>>> 2) { "execute": "qmp_capabilities" }
>>>
>>> result:
>>> {"return": {}}
>>>
>>> 3) If inject correctable error:
>>> { "execute": "cxl-inject-correctable-error",
>>>     "arguments": {
>>>         "path": "/machine/peripheral/cxl-mem0",
>>>         "type": "physical"
>>>     } }
>>>
>>> result:
>>> {"return": {}}
>>>
>>> 3) If inject uncorrectable error:
>>> { "execute": "cxl-inject-uncorrectable-errors",
>>>   "arguments": {
>>>     "path": "/machine/peripheral/cxl-mem0",
>>>     "errors": [
>>>         {
>>>             "type": "cache-address-parity",
>>>             "header": [ 3, 4]
>>>         },
>>>         {
>>>             "type": "cache-data-parity",
>>>             "header": [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31]
>>>         },
>>>         {
>>>             "type": "internal",
>>>             "header": [ 1, 2, 4]
>>>         }
>>>         ]
>>>   }}
>>>
>>> result:
>>> {"return": {}}
>>> {"timestamp": {"seconds": 1709721640, "microseconds": 275345}, "event": "SHUTDOWN", "data": {"guest": false, "reason": "host-signal"}}
>>>
>>> Many thanks
>>> Yuquan
>>>
>>

  reply	other threads:[~2024-03-06 19:06 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-03-06 11:27 Questions about CXL RAS injection test in qemu Yuquan Wang
2024-03-06 13:23 ` Enabling internal errors for VH CXL devices: [was: Re: Questions about CXL RAS injection test in qemu] Jonathan Cameron
2024-03-06 17:12   ` Terry Bowman
2024-03-06 19:06     ` Terry Bowman [this message]
2024-03-06 17:16   ` Dan Williams
2024-03-06 17:42     ` Terry Bowman
  -- strict thread matches above, loose matches on Subject: below --
2024-03-08  2:01 Yuquan Wang
2024-03-08 12:59 ` Jonathan Cameron

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=bd47d1b1-43a0-4c96-af91-2a3ef898a9cb@amd.com \
    --to=terry.bowman@amd.com \
    --cc=Jonathan.Cameron@Huawei.com \
    --cc=dan.williams@intel.com \
    --cc=linux-cxl@vger.kernel.org \
    --cc=qemu-devel@nongnu.org \
    --cc=rrichter@amd.com \
    --cc=wangyuquan1236@phytium.com.cn \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).