Linux-NVME Archive mirror
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Cc: Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>,
	Sagi Grimberg <sagi@grimberg.me>, "axboe@fb.com" <axboe@fb.com>,
	Gregory Joyce <gjoyce@ibm.com>,
	Srimannarayana Murthy Maram <msmurthy@imap.linux.ibm.com>
Subject: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
Date: Thu, 18 Apr 2024 18:22:08 +0530	[thread overview]
Message-ID: <199be893-5dfa-41e5-b6f2-40ac90ebccc4@linux.ibm.com> (raw)

Hi,

We found nvme driver hangs when disk IO is ongoing and if we inject pcie error and hot-unplug (not physical but logical unplug) the nvme disk.

Notes and observations:
====================== 
This is observed on the latest linus kernel tree (v6.9-rc4) however we believe this issue shall also be present on the older kernels.

Test details:
=============
Steps to reproduce this issue:

1. Run some disk IO using fio or any other tool
2. While disk IO is running, inject pci error
3. disable the slot where nvme disk is attached (echo 0 > /sys/bus/pci/slots/<slot-no>/power)

Kernel Logs:
============
When we follow steps described in the test details we get the below logs:

[  295.240811] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[  295.240837] nvme nvme1: Does your device have a faulty power saving mode enabled?
[  295.240845] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[  490.381591] INFO: task bash:2510 blocked for more than 122 seconds.
[  490.381614]       Not tainted 6.9.0-rc4+ #8
[  490.381618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  490.381623] task:bash            state:D stack:0     pid:2510  tgid:2510  ppid:2509   flags:0x00042080
[  490.381632] Call Trace:
[  490.381635] [c00000006748f510] [c00000006748f550] 0xc00000006748f550 (unreliable)
[  490.381644] [c00000006748f6c0] [c00000000001f3fc] __switch_to+0x13c/0x220
[  490.381654] [c00000006748f720] [c000000000fb87e0] __schedule+0x268/0x7c4
[  490.381663] [c00000006748f7f0] [c000000000fb8d7c] schedule+0x40/0x108
[  490.381669] [c00000006748f860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
[  490.381676] [c00000006748f8c0] [c00000000081eba8] del_gendisk+0x284/0x464
[  490.381683] [c00000006748f920] [c0080000064c74a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
[  490.381697] [c00000006748f960] [c0080000064c7704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
[  490.381710] [c00000006748f9d0] [c008000006704b70] nvme_remove+0x80/0x168 [nvme]
[  490.381752] [c00000006748fa10] [c00000000092a10c] pci_device_remove+0x6c/0x110
[  490.381776] [c00000006748fa50] [c000000000a4f504] device_remove+0x70/0xc4
[  490.381786] [c00000006748fa80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
[  490.381801] [c00000006748fad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
[  490.381817] [c00000006748fb10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
[  490.381826] [c00000006748fb40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
[  490.381831] [c00000006748fbd0] [c008000004440504] disable_slot+0x40/0x90 [rpaphp]
[  490.381839] [c00000006748fc00] [c000000000956090] power_write_file+0xf8/0x19c
[  490.381846] [c00000006748fc80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
[  490.381851] [c00000006748fca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
[  490.381858] [c00000006748fcc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
[  490.381864] [c00000006748fd10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
[  490.381871] [c00000006748fdc0] [c0000000005e13c0] ksys_write+0x84/0x140
[  490.381876] [c00000006748fe10] [c000000000030a84] system_call_exception+0x124/0x330
[  490.381882] [c00000006748fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec

NVMe controller state:
======================
# cat /sys/class/nvme/nvme1/state 
deleting (no IO)

Process State:
==============
# ps -aex 
   [..]
   2510 pts/2    Ds+    0:00 -bash USER=root LOGNAME=root HOME=/root PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin SHELL=/bin/bash TERM=xterm-256colo
   2549 ?        Ds     0:14 fio --filename=/dev/nvme1n1 --direct=1 --rw=randrw --bs=4k --ioengine=psync --iodepth=256 --runtime=300 --numjobs=1 --time_based 
   [..]

Observation:
============
As it's apparent from the above logs that "disable-slot" (pid 2510) is waiting (uninterruptible-sleep) 
for queue to be freezed because the in-flight IO(s) couldn't finish. Moreover the IO(s) which were 
in-flight actually times-out however nvme_timeout() doesn't cancel those IOs but logs this error 
"Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug" and returns BLK_EH_DONE. 
As those in-fligh IOs were not cancelled, the NVMe driver code which runs in the context of 
"disable-slot" couldn't forward progress and NVMe controller state remains in "deleting (no IO)" 
indefinitely. The only way we found to come out of this state is to reboot the system.

Proposed fix:
============
static void nvme_remove(struct pci_dev *pdev)
{
	struct nvme_dev *dev = pci_get_drvdata(pdev);

	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
	pci_set_drvdata(pdev, NULL);

	if (!pci_device_is_present(pdev)) {
		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
		nvme_dev_disable(dev, true);
	}
	flush_work(&dev->ctrl.reset_work);
	nvme_stop_ctrl(&dev->ctrl);
	nvme_remove_namespaces(&dev->ctrl); <== here cntrl state is set to "deleting (no IO)"
        [..]
}

As shown above, nvme_remove() invokes nvme_dev_disable(), however, it is only invoked if the 
device is physically removed. As nvme_dev_disable() helps cancel pending IOs, does it makes 
sense to unconditionally cancel pending IOs before moving on? Or are there any side effect if 
we were to unconditionally invoke nvme_dev_disable() at the first place?

Thanks,
--Nilay


             reply	other threads:[~2024-04-18 12:52 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-18 12:52 Nilay Shroff [this message]
2024-04-21 10:28 ` [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang Sagi Grimberg
2024-04-21 16:53   ` Nilay Shroff
2024-04-21 16:56   ` Nilay Shroff
2024-04-22 13:00     ` Sagi Grimberg
2024-04-22 13:52       ` Keith Busch
2024-04-22 14:35         ` Keith Busch
2024-04-23  9:52           ` Nilay Shroff
2024-04-24 17:36             ` Keith Busch
2024-04-25 13:49               ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=199be893-5dfa-41e5-b6f2-40ac90ebccc4@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@fb.com \
    --cc=gjoyce@ibm.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=msmurthy@imap.linux.ibm.com \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for read-only IMAP folder(s) and NNTP newsgroup(s).