All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed
From: Nilay Shroff <nilay@linux.ibm.com>
To: "linux-nvme@lists.infradead.org" <linux-nvme@lists.infradead.org>
Cc: Keith Busch <kbusch@kernel.org>, Christoph Hellwig <hch@lst.de>,
	Sagi Grimberg <sagi@grimberg.me>, "axboe@fb.com" <axboe@fb.com>,
	Gregory Joyce <gjoyce@ibm.com>,
	Srimannarayana Murthy Maram <msmurthy@imap.linux.ibm.com>
Subject: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang
Date: Thu, 18 Apr 2024 18:22:08 +0530	[thread overview]
Message-ID: <199be893-5dfa-41e5-b6f2-40ac90ebccc4@linux.ibm.com> (raw)

Hi,

We found nvme driver hangs when disk IO is ongoing and if we inject pcie error and hot-unplug (not physical but logical unplug) the nvme disk.

Notes and observations:
====================== 
This is observed on the latest linus kernel tree (v6.9-rc4) however we believe this issue shall also be present on the older kernels.

Test details:
=============
Steps to reproduce this issue:

1. Run some disk IO using fio or any other tool
2. While disk IO is running, inject pci error
3. disable the slot where nvme disk is attached (echo 0 > /sys/bus/pci/slots/<slot-no>/power)

Kernel Logs:
============
When we follow steps described in the test details we get the below logs:

[  295.240811] nvme nvme1: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
[  295.240837] nvme nvme1: Does your device have a faulty power saving mode enabled?
[  295.240845] nvme nvme1: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
[  490.381591] INFO: task bash:2510 blocked for more than 122 seconds.
[  490.381614]       Not tainted 6.9.0-rc4+ #8
[  490.381618] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[  490.381623] task:bash            state:D stack:0     pid:2510  tgid:2510  ppid:2509   flags:0x00042080
[  490.381632] Call Trace:
[  490.381635] [c00000006748f510] [c00000006748f550] 0xc00000006748f550 (unreliable)
[  490.381644] [c00000006748f6c0] [c00000000001f3fc] __switch_to+0x13c/0x220
[  490.381654] [c00000006748f720] [c000000000fb87e0] __schedule+0x268/0x7c4
[  490.381663] [c00000006748f7f0] [c000000000fb8d7c] schedule+0x40/0x108
[  490.381669] [c00000006748f860] [c000000000808bb4] blk_mq_freeze_queue_wait+0xa4/0xec
[  490.381676] [c00000006748f8c0] [c00000000081eba8] del_gendisk+0x284/0x464
[  490.381683] [c00000006748f920] [c0080000064c74a4] nvme_ns_remove+0x138/0x2ac [nvme_core]
[  490.381697] [c00000006748f960] [c0080000064c7704] nvme_remove_namespaces+0xec/0x198 [nvme_core]
[  490.381710] [c00000006748f9d0] [c008000006704b70] nvme_remove+0x80/0x168 [nvme]
[  490.381752] [c00000006748fa10] [c00000000092a10c] pci_device_remove+0x6c/0x110
[  490.381776] [c00000006748fa50] [c000000000a4f504] device_remove+0x70/0xc4
[  490.381786] [c00000006748fa80] [c000000000a515d8] device_release_driver_internal+0x2a4/0x324
[  490.381801] [c00000006748fad0] [c00000000091b528] pci_stop_bus_device+0xb8/0x104
[  490.381817] [c00000006748fb10] [c00000000091b910] pci_stop_and_remove_bus_device+0x28/0x40
[  490.381826] [c00000006748fb40] [c000000000072620] pci_hp_remove_devices+0x90/0x128
[  490.381831] [c00000006748fbd0] [c008000004440504] disable_slot+0x40/0x90 [rpaphp]
[  490.381839] [c00000006748fc00] [c000000000956090] power_write_file+0xf8/0x19c
[  490.381846] [c00000006748fc80] [c00000000094b4f8] pci_slot_attr_store+0x40/0x5c
[  490.381851] [c00000006748fca0] [c0000000006e5dc4] sysfs_kf_write+0x64/0x78
[  490.381858] [c00000006748fcc0] [c0000000006e48d8] kernfs_fop_write_iter+0x1b0/0x290
[  490.381864] [c00000006748fd10] [c0000000005e0f4c] vfs_write+0x3b0/0x4f8
[  490.381871] [c00000006748fdc0] [c0000000005e13c0] ksys_write+0x84/0x140
[  490.381876] [c00000006748fe10] [c000000000030a84] system_call_exception+0x124/0x330
[  490.381882] [c00000006748fe50] [c00000000000cedc] system_call_vectored_common+0x15c/0x2ec

NVMe controller state:
======================
# cat /sys/class/nvme/nvme1/state 
deleting (no IO)

Process State:
==============
# ps -aex 
   [..]
   2510 pts/2    Ds+    0:00 -bash USER=root LOGNAME=root HOME=/root PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin SHELL=/bin/bash TERM=xterm-256colo
   2549 ?        Ds     0:14 fio --filename=/dev/nvme1n1 --direct=1 --rw=randrw --bs=4k --ioengine=psync --iodepth=256 --runtime=300 --numjobs=1 --time_based 
   [..]

Observation:
============
As it's apparent from the above logs that "disable-slot" (pid 2510) is waiting (uninterruptible-sleep) 
for queue to be freezed because the in-flight IO(s) couldn't finish. Moreover the IO(s) which were 
in-flight actually times-out however nvme_timeout() doesn't cancel those IOs but logs this error 
"Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug" and returns BLK_EH_DONE. 
As those in-fligh IOs were not cancelled, the NVMe driver code which runs in the context of 
"disable-slot" couldn't forward progress and NVMe controller state remains in "deleting (no IO)" 
indefinitely. The only way we found to come out of this state is to reboot the system.

Proposed fix:
============
static void nvme_remove(struct pci_dev *pdev)
{
	struct nvme_dev *dev = pci_get_drvdata(pdev);

	nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DELETING);
	pci_set_drvdata(pdev, NULL);

	if (!pci_device_is_present(pdev)) {
		nvme_change_ctrl_state(&dev->ctrl, NVME_CTRL_DEAD);
		nvme_dev_disable(dev, true);
	}
	flush_work(&dev->ctrl.reset_work);
	nvme_stop_ctrl(&dev->ctrl);
	nvme_remove_namespaces(&dev->ctrl); <== here cntrl state is set to "deleting (no IO)"
        [..]
}

As shown above, nvme_remove() invokes nvme_dev_disable(), however, it is only invoked if the 
device is physically removed. As nvme_dev_disable() helps cancel pending IOs, does it makes 
sense to unconditionally cancel pending IOs before moving on? Or are there any side effect if 
we were to unconditionally invoke nvme_dev_disable() at the first place?

Thanks,
--Nilay


             reply	other threads:[~2024-04-18 12:52 UTC|newest]

Thread overview: 10+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-04-18 12:52 Nilay Shroff [this message]
2024-04-21 10:28 ` [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang Sagi Grimberg
2024-04-21 16:53   ` Nilay Shroff
2024-04-21 16:56   ` Nilay Shroff
2024-04-22 13:00     ` Sagi Grimberg
2024-04-22 13:52       ` Keith Busch
2024-04-22 14:35         ` Keith Busch
2024-04-23  9:52           ` Nilay Shroff
2024-04-24 17:36             ` Keith Busch
2024-04-25 13:49               ` Nilay Shroff

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=199be893-5dfa-41e5-b6f2-40ac90ebccc4@linux.ibm.com \
    --to=nilay@linux.ibm.com \
    --cc=axboe@fb.com \
    --cc=gjoyce@ibm.com \
    --cc=hch@lst.de \
    --cc=kbusch@kernel.org \
    --cc=linux-nvme@lists.infradead.org \
    --cc=msmurthy@imap.linux.ibm.com \
    --cc=sagi@grimberg.me \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.