From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from bombadil.infradead.org (bombadil.infradead.org [198.137.202.133]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id BE8B6C4345F for ; Mon, 22 Apr 2024 13:52:35 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; q=dns/txt; c=relaxed/relaxed; d=lists.infradead.org; s=bombadil.20210309; h=Sender:List-Subscribe:List-Help :List-Post:List-Archive:List-Unsubscribe:List-Id:In-Reply-To: Content-Transfer-Encoding:Content-Type:MIME-Version:References:Message-ID: Subject:Cc:To:From:Date:Reply-To:Content-ID:Content-Description:Resent-Date: Resent-From:Resent-Sender:Resent-To:Resent-Cc:Resent-Message-ID:List-Owner; bh=upoGKrJkRGN2TsP5uld+raQacJbnLT7pseXPW8+DoF8=; b=B4V3cGNKM9v9cAVd+4fyX/58G/ rNEINOL593wyC/ZnCrM5s5vIVEwJ2wG3DZ3P93MBmh1rsc609Y3RvlMcQmOkX+WFsVhbuir4gxD84 6OpNnvEdgH7mPTtCmxmIwteyHBlndOcvN4i2OwSNMr1X9luEHs0fG2t4DPQTn/wWUuBjbWyvtGmsJ mOdLXaWetELe8Uuc3fDpgbgWUgSeyD5PCHR7l+OA+ymtUefpJIbDBEXS/nCn6ZW7f4Le01Lk6kEva WKLkPJzugsB7AsGPPjGEEWdW+ONIe8Xbwd51yIQ/ilrGidYTz6heUy3Odcd60takjJ0NYocweHiiY QPL7Wkcg==; Received: from localhost ([::1] helo=bombadil.infradead.org) by bombadil.infradead.org with esmtp (Exim 4.97.1 #2 (Red Hat Linux)) id 1ryu5w-0000000DpOm-0qEK; Mon, 22 Apr 2024 13:52:32 +0000 Received: from dfw.source.kernel.org ([139.178.84.217]) by bombadil.infradead.org with esmtps (Exim 4.97.1 #2 (Red Hat Linux)) id 1ryu5t-0000000DpNZ-3oRB for linux-nvme@lists.infradead.org; Mon, 22 Apr 2024 13:52:31 +0000 Received: from smtp.kernel.org (transwarp.subspace.kernel.org [100.75.92.58]) by dfw.source.kernel.org (Postfix) with ESMTP id 60E5B60DF4; Mon, 22 Apr 2024 13:52:29 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id 8F8B2C32783; Mon, 22 Apr 2024 13:52:28 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1713793949; bh=0HspCuB1hBKnh/pR1CggY0E1H1Kaq+SeWPeu/THFQ3o=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=SJDIoB8ijhw3GNK9+1Nc6sT501xvlnj/66Jlqj95klVCuIPUGSXf2hNXBncWhTBMS 7pT0uCZNOOYFrpBKJzvZCFaF9DTbrtiGnxYg+KAP/XDsCVHnfKcM7OfJsGrf/uHvgG EVFBvN2tRnD67AfvQmA++U31CU6Lco5fxNNwXLD2uKpfCvbEsperlXQsiJAVwYGIZB LGNPujm5EjEOe7dmE8QRTs8ByzhmJGvfi/s7oGWDCgeODoZ+QB0FucOUqN6e3R1Kox jyMa1EALPoBpXgcoPrATbsTaZjqxmnUUCSkW3iHeN8CibjnUgAfAa+4cI0pc8M13U+ aLaxnh3qSIJmw== Date: Mon, 22 Apr 2024 07:52:25 -0600 From: Keith Busch To: Sagi Grimberg Cc: Nilay Shroff , "linux-nvme@lists.infradead.org" , Christoph Hellwig , "axboe@fb.com" , Gregory Joyce , Srimannarayana Murthy Maram Subject: Re: [Bug Report] PCIe errinject and hot-unplug causes nvme driver hang Message-ID: References: <199be893-5dfa-41e5-b6f2-40ac90ebccc4@linux.ibm.com> <579c82da-52a7-4425-81d7-480c676b8cbb@grimberg.me> <627cdf69-ff60-4596-a7f3-0fdd0af0f601@grimberg.me> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <627cdf69-ff60-4596-a7f3-0fdd0af0f601@grimberg.me> X-CRM114-Version: 20100106-BlameMichelson ( TRE 0.8.0 (BSD) ) MR-646709E3 X-CRM114-CacheID: sfid-20240422_065230_015678_984AD70E X-CRM114-Status: GOOD ( 15.48 ) X-BeenThere: linux-nvme@lists.infradead.org X-Mailman-Version: 2.1.34 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Sender: "Linux-nvme" Errors-To: linux-nvme-bounces+linux-nvme=archiver.kernel.org@lists.infradead.org On Mon, Apr 22, 2024 at 04:00:54PM +0300, Sagi Grimberg wrote: > > pci_rescan_remove_lock then it shall be able to recover the pci error and hence > > pending IOs could be finished. Later when hot-unplug task starts, it could > > forward progress and cleanup all resources used by the nvme disk. > > > > So does it make sense if we unconditionally cancel the pending IOs from > > nvme_remove() before it forward progress to remove namespaces? > > The driver attempts to allow inflights I/O to complete successfully, if the > device > is still present in the remove stage. I am not sure we want to > unconditionally fail these > I/Os.    Keith? We have a timeout handler to clean this up, but I think it was another PPC specific patch that has the timeout handler do nothing if pcie error recovery is in progress. Which seems questionable, we should be able to concurrently run error handling and timeouts, but I think the error handling just needs to syncronize the request_queue's in the "error_detected" path.