Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

All the mail mirrored from lore.kernel.org
 help / color / mirror / Atom feed

From: Can Guo <cang@codeaurora.org>
To: Bart Van Assche <bvanassche@acm.org>
Cc: asutoshd@codeaurora.org, nguyenb@codeaurora.org,
	hongwus@codeaurora.org, ziqichen@codeaurora.org,
	linux-scsi@vger.kernel.org, kernel-team@android.com,
	Alim Akhtar <alim.akhtar@samsung.com>,
	Avri Altman <avri.altman@wdc.com>,
	"James E.J. Bottomley" <jejb@linux.ibm.com>,
	"Martin K. Petersen" <martin.petersen@oracle.com>,
	Stanley Chu <stanley.chu@mediatek.com>,
	Bean Huo <beanhuo@micron.com>, Jaegeuk Kim <jaegeuk@kernel.org>,
	open list <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests
Date: Wed, 16 Jun 2021 12:00:57 +0800	[thread overview]
Message-ID: <8aae95071b9ab3c0a3cab91d1ae138e1@codeaurora.org> (raw)
In-Reply-To: <fabc70f8-6bb8-4b62-3311-f6e0ce9eb2c3@acm.org>

On 2021-06-16 02:25, Bart Van Assche wrote:
> On 6/14/21 7:36 PM, Can Guo wrote:
>> I've considered the similar way (leverage hba->host->eh_noresume) last
>> year,
>> but I didn't take this way due to below reasons:
>> 
>> 1. UFS error handler basically does one thing - reset and restore, 
>> which
>> stops hba [1], resets device [2] and re-probes the device [3]. 
>> Stopping
>> hba [1]
>> shall complete any pending requests in the doorbell (with error or no
>> error).
>> After [1], suspend/resume contexts, blocked by SSU cmd, shall be 
>> unblocked
>> right away to do whatever it needs to handle the SSU cmd failure 
>> (completed
>> in [1], so scsi_execute() returns an error), e.g., put link back to 
>> the old
>> state. call ufshcd_vops_suspend(), turn off irq/clocks/powers and 
>> etc...
>> However, reset and restore ([2] and [3]) is still running, and it can
>> (most likely)
>> be disturbed by suspend/resume. So passing a parameter or using
>> hba->host->eh_noresume
>> to skip lock_system_sleep() and unlock_system_sleep() can break the 
>> cycle,
>> but error handling may run concurrently with suspend/resume. Of course
>> we can
>> modify suspend/resume to avoid it, but I was pursuing a minimal change
>> to get this fixed.
>> 
>> 2. Whatever way we take to break the cycle, suspend/resume shall fail 
>> and
>> RPM framework shall save the error to dev.power.runtime_error, leaving
>> the device in runtime suspended or active mode permanently. If it is 
>> left
>> runtime suspended, UFS driver won't accept cmd anymore, while if it is 
>> left
>> runtime active, powers of UFS device and host will be left ON, leading
>> to power
>> penalty. So my main idea is to let suspend/resume contexts, blocked by
>> PM cmds,
>> fail fast first and then error handler recover everything back to 
>> work.
> 
> Hi Can,
> 
> Has it been considered to make the UFS error handler fail pending
> commands with an error code that causes the SCSI core to resubmit the
> SCSI command, e.g. DID_IMM_RETRY or DID_TRANSPORT_DISRUPTED? I want to
> prevent that power management or suspend/resume callbacks fail if the
> error handler succeeds with recovering the UFS transport.
> 

Hi Bart,

Thanks for the suggestion, I thought about it but I didn't go that
far in this path because I believe letting a context fast fail is
better than retrying/blocking it (to me suspend/resume can fail
due to many reasons and task abort is just one of them). I appreciate
the idea, but I would like to stick to my way as of now because

1. Merely preventing task abort cannot prevent suspend/resume fail.
Task abort (to PM requests), in real cases, is just one of many kinds
of failure which can fail the suspend/resume callbacks. During
suspend/resume, if AH8 error and/or UIC errors happen, IRQ handler
may complete SSU cmd with errors and schedule the error handler (I've
seen such scenarios in real customer cases). My idea is to treat task
abort (to PM requests) as a failure (let scsi_execute() return with
whatever error) and let error handler recover everything just like
any other UFS errors which invoke error handler. In case this, again,
goes back to the topic that is why don't just do error recovery in
suspend/resume, let me paste my previous reply here -

"
Error handler has the same nature of user access - it is unpredictable, 
meaning it
can be invoked at any time (from IRQ handler), even when there is no 
ongoing
cmd/data transactions (like auto hibern8 failure and UIC errors, such as 
DME
error and some errors in data link layer) [1], unless you disable UFS 
IRQ.

The reasons why I choose not to do it that way are (althrough error 
handler
prepare has became much more simple after apply this change)

- I want to keep all the complexity within error handler, and re-direct 
all error
recovery needs to error handler. It can avoid calling 
ufshcd_reset_and_restore()
and/or flush_work(&hba->eh_work) here and there. The entire UFS 
suspend/resume is
already complex enough, I don't want to mess up with it.

- We do explicit recovery only when we see certain errors, e.g., H8 
enter func
returns an error during suspend, but as mentioned above [1], error 
handling can
be invoked already from IRQ handler (due to all kinds of UIC errors 
before H8 enter
func returns). So, we still need host_sem (in case of system 
suspend/resume) to
avoid concurrency.

- During system suspend/resume, error handling can be invoked (due to 
non-fatal
errors) but still UFS cmds return no error at all. Similar like above, 
we need
host_sem to avoid concurrency.
"

2. And say we want SCSI layer to resubmit PM requests to prevent
suspend/resume fail, we should keep retrying the PM requests (so
long as error handler can recover everything successfully), meaning
we should give them unlimited retries (which I think is a bad idea),
otherwise (if they have zero retries or limited retries), in extreme
conditions, what may happen is that error handler can recover everything
successfully every time, but all these retries (say 3) still time out,
which block the power management for too long (retries * 60 seconds) 
and,
most important, when the last retry times out, scsi layer will anyways
complete the PM request (even we return DID_IMM_RETRY), then we end up
same - suspend/resume shall run concurrently with error handler and we
couldn't recover saved PM errors.

Thanks,

Can Guo.

> Thanks,
> 
> Bart.

next prev parent reply	other threads:[~2021-06-16  4:01 UTC|newest]

Thread overview: 44+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-10  4:43 [PATCH v3 0/9] Complementary changes for error handling Can Guo
2021-06-10  4:43 ` [PATCH v3 1/9] scsi: ufs: Differentiate status between hba pm ops and wl pm ops Can Guo
2021-06-10 11:15   ` Adrian Hunter
2021-06-11  0:53     ` Can Guo
2021-06-11 20:40   ` Bart Van Assche
2021-06-12  6:20     ` Can Guo
2021-06-16 17:50   ` Bart Van Assche
2021-06-23  1:32     ` Can Guo
2021-06-10  4:43 ` [PATCH v3 2/9] scsi: ufs: Update the return value of supplier " Can Guo
2021-06-10  4:43 ` [PATCH v3 3/9] scsi: ufs: Enable IRQ after enabling clocks in error handling preparation Can Guo
2021-06-10  4:43 ` [PATCH v3 4/9] scsi: ufs: Complete the cmd before returning in queuecommand Can Guo
2021-06-11 20:52   ` Bart Van Assche
2021-06-12  7:38     ` Can Guo
2021-06-12 15:50       ` Bart Van Assche
2021-06-13 13:30         ` Can Guo
2021-06-10  4:43 ` [PATCH v3 5/9] scsi: ufs: Simplify error handling preparation Can Guo
2021-06-10 12:30   ` Adrian Hunter
2021-06-11  3:01     ` Can Guo
2021-06-11 20:58       ` Bart Van Assche
2021-06-12  6:46         ` Can Guo
2021-06-12  9:49           ` Can Guo
2021-06-10  4:43 ` [PATCH v3 6/9] scsi: ufs: Update ufshcd_recover_pm_error() Can Guo
2021-06-10  4:43 ` [PATCH v3 7/9] scsi: ufs: Let host_sem cover the entire system suspend/resume Can Guo
2021-06-10 13:32   ` Adrian Hunter
2021-06-11  3:06     ` Can Guo
2021-06-11 21:00   ` Bart Van Assche
2021-06-12  6:46     ` Can Guo
2021-06-10  4:43 ` [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests Can Guo
2021-06-11 21:02   ` Bart Van Assche
2021-06-12  7:07     ` Can Guo
2021-06-12 16:50       ` Bart Van Assche
2021-06-13 14:42         ` Can Guo
2021-06-14 18:49           ` Bart Van Assche
2021-06-15  2:36             ` Can Guo
2021-06-15  3:17               ` Can Guo
2021-06-15 18:25               ` Bart Van Assche
2021-06-16  4:00                 ` Can Guo [this message]
2021-06-16  4:40                   ` Bart Van Assche
2021-06-16  8:47                     ` Can Guo
2021-06-16 17:55                       ` Bart Van Assche
2021-06-23  1:34                         ` Can Guo
2021-06-10  4:43 ` [PATCH v3 9/9] scsi: ufs: Apply more limitations to user access Can Guo
2021-06-11 21:03   ` Bart Van Assche
2021-06-12  7:13     ` Can Guo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=8aae95071b9ab3c0a3cab91d1ae138e1@codeaurora.org \
    --to=cang@codeaurora.org \
    --cc=alim.akhtar@samsung.com \
    --cc=asutoshd@codeaurora.org \
    --cc=avri.altman@wdc.com \
    --cc=beanhuo@micron.com \
    --cc=bvanassche@acm.org \
    --cc=hongwus@codeaurora.org \
    --cc=jaegeuk@kernel.org \
    --cc=jejb@linux.ibm.com \
    --cc=kernel-team@android.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-scsi@vger.kernel.org \
    --cc=martin.petersen@oracle.com \
    --cc=nguyenb@codeaurora.org \
    --cc=stanley.chu@mediatek.com \
    --cc=ziqichen@codeaurora.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.