2021-06-10 04:48:06

by Can Guo

[permalink] [raw]
Subject: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

If PM requests fail during runtime suspend/resume, RPM framework saves the
error to dev->power.runtime_error. Before the runtime_error gets cleared,
runtime PM on this specific device won't work again, leaving the device
either runtime active or runtime suspended permanently.

When task abort happens to a PM request sent during runtime suspend/resume,
even if it can be successfully aborted, RPM framework anyways saves the
(TIMEOUT) error. In this situation, we can leverage error handling to
recover and clear the runtime_error. So, let PM requests take the fast
abort path in ufshcd_abort().

Signed-off-by: Can Guo <[email protected]>
---
drivers/scsi/ufs/ufshcd.c | 38 ++++++++++++++++++++++----------------
1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 861942b..cf24ec2 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -2737,7 +2737,7 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd)
* err handler blocked for too long. So, just fail the scsi cmd
* sent from PM ops, err handler can recover PM error anyways.
*/
- if (hba->wl_pm_op_in_progress) {
+ if (cmd->request->rq_flags & RQF_PM) {
hba->force_reset = true;
set_host_byte(cmd, DID_BAD_TARGET);
cmd->scsi_done(cmd);
@@ -2760,7 +2760,7 @@ static int ufshcd_queuecommand(struct Scsi_Host *host, struct scsi_cmnd *cmd)
}

if (unlikely(test_bit(tag, &hba->outstanding_reqs))) {
- if (hba->wl_pm_op_in_progress) {
+ if (cmd->request->rq_flags & RQF_PM) {
set_host_byte(cmd, DID_BAD_TARGET);
cmd->scsi_done(cmd);
} else {
@@ -6985,11 +6985,14 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
int err = 0;
struct ufshcd_lrb *lrbp;
u32 reg;
+ bool need_eh = false;

host = cmd->device->host;
hba = shost_priv(host);
tag = cmd->request->tag;
lrbp = &hba->lrb[tag];
+
+ dev_info(hba->dev, "%s: Device abort task at tag %d\n", __func__, tag);
if (!ufshcd_valid_tag(hba, tag)) {
dev_err(hba->dev,
"%s: invalid command tag %d: cmd=0x%p, cmd->request=0x%p",
@@ -7007,9 +7010,6 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
goto out;
}

- /* Print Transfer Request of aborted task */
- dev_info(hba->dev, "%s: Device abort task at tag %d\n", __func__, tag);
-
/*
* Print detailed info about aborted request.
* As more than one request might get aborted at the same time,
@@ -7037,21 +7037,21 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
}

/*
- * Task abort to the device W-LUN is illegal. When this command
- * will fail, due to spec violation, scsi err handling next step
- * will be to send LU reset which, again, is a spec violation.
- * To avoid these unnecessary/illegal steps, first we clean up
- * the lrb taken by this cmd and re-set it in outstanding_reqs,
- * then queue the eh_work and bail.
+ * This fast path guarantees the cmd always gets aborted successfully,
+ * meanwhile it invokes the error handler. It allows contexts, which
+ * are blocked by this cmd, to fail fast. It serves multiple purposes:
+ * #1 To avoid unnecessary/illagal abort attempts to the W-LU.
+ * #2 To avoid live lock between eh_work and specific contexts, i.e.,
+ * suspend/resume and eh_work itself.
+ * #3 To let eh_work recover runtime PM error in case abort happens
+ * to cmds sent from runtime suspend/resume ops.
*/
- if (lrbp->lun == UFS_UPIU_UFS_DEVICE_WLUN) {
+ if (lrbp->lun == UFS_UPIU_UFS_DEVICE_WLUN ||
+ (cmd->request->rq_flags & RQF_PM)) {
ufshcd_update_evt_hist(hba, UFS_EVT_ABORT, lrbp->lun);
__ufshcd_transfer_req_compl(hba, (1UL << tag));
set_bit(tag, &hba->outstanding_reqs);
- spin_lock_irqsave(host->host_lock, flags);
- hba->force_reset = true;
- ufshcd_schedule_eh_work(hba);
- spin_unlock_irqrestore(host->host_lock, flags);
+ need_eh = true;
goto out;
}

@@ -7065,6 +7065,12 @@ static int ufshcd_abort(struct scsi_cmnd *cmd)
cleanup:
__ufshcd_transfer_req_compl(hba, (1UL << tag));
out:
+ if (cmd->request->rq_flags & RQF_PM || need_eh) {
+ spin_lock_irqsave(host->host_lock, flags);
+ hba->force_reset = true;
+ ufshcd_schedule_eh_work(hba);
+ spin_unlock_irqrestore(host->host_lock, flags);
+ }
err = SUCCESS;
} else {
dev_err(hba->dev, "%s: failed with err %d\n", __func__, err);
--
Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, a Linux Foundation Collaborative Project.


2021-06-11 21:04:22

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

On 6/9/21 9:43 PM, Can Guo wrote:
> If PM requests fail during runtime suspend/resume, RPM framework saves the
> error to dev->power.runtime_error. Before the runtime_error gets cleared,
> runtime PM on this specific device won't work again, leaving the device
> either runtime active or runtime suspended permanently.
>
> When task abort happens to a PM request sent during runtime suspend/resume,
> even if it can be successfully aborted, RPM framework anyways saves the
> (TIMEOUT) error. In this situation, we can leverage error handling to
> recover and clear the runtime_error. So, let PM requests take the fast
> abort path in ufshcd_abort().

How can a PM request fail during runtime suspend/resume? Does such a
failure perhaps indicate an UFS controller bug? I appreciate your work
but I'm wondering whether it's worth to complicate the UFS driver for
issues that should be fixed in the controller instead of in software.

Thanks,

Bart.

2021-06-12 07:10:03

by Can Guo

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

On 2021-06-12 05:02, Bart Van Assche wrote:
> On 6/9/21 9:43 PM, Can Guo wrote:
>> If PM requests fail during runtime suspend/resume, RPM framework saves
>> the
>> error to dev->power.runtime_error. Before the runtime_error gets
>> cleared,
>> runtime PM on this specific device won't work again, leaving the
>> device
>> either runtime active or runtime suspended permanently.
>>
>> When task abort happens to a PM request sent during runtime
>> suspend/resume,
>> even if it can be successfully aborted, RPM framework anyways saves
>> the
>> (TIMEOUT) error. In this situation, we can leverage error handling to
>> recover and clear the runtime_error. So, let PM requests take the fast
>> abort path in ufshcd_abort().
>
> How can a PM request fail during runtime suspend/resume? Does such a
> failure perhaps indicate an UFS controller bug?

I've replied your similar question in previous series. I've seen too
much
SSU cmd and SYNCHRONIZE_CACHE cmd timed out these years, 60s is not even
enough for them to complete. And you are right, most cases are that
device
is not responding - UFS controller is busy with housekeeping.

> I appreciate your work
> but I'm wondering whether it's worth to complicate the UFS driver for
> issues that should be fixed in the controller instead of in software.
>

Sigh... I also want my life and work to be easier... I agree with you.

In project bring up stage, we fix whatever error/bug/failure we face to
unblock the project, during which we only focus on and try to fix the
very
first UFS error, but not quite care about the error recovery or what the
error can possibly cause (usually more UFS errors and system stability
issues
follow the very first UFS error).

However, these years our customers tend to ask for more - they want UFS
error
handling to recover everything whenever UFS error occurs, because they
believe
it is the last line of defense after their products go out to market. So
I took
a lot of effort fixing, testing and trying to make it robust. Now here
we are.
FYI, I am on a tight schedule to have these UFS error handling changes
ready in
Android12-5.10.

Thanks,

Can Guo.

> Thanks,
>
> Bart.

2021-06-12 17:02:54

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

On 6/12/21 12:07 AM, Can Guo wrote:
> Sigh... I also want my life and work to be easier...

How about reducing the number of states and state transitions in the UFS
driver?

One source of complexity is that ufshcd_err_handler() is scheduled
independently of the SCSI error handler and hence may run concurrently
with the SCSI error handler. Has the following already been considered?
- Call ufshcd_err_handler() synchronously from ufshcd_abort() and
ufshcd_eh_host_reset_handler() instead of asynchronously.
- Call scsi_schedule_eh() from ufshcd_uic_pwr_ctrl() and
ufshcd_check_errors() instead of ufshcd_schedule_eh_work().

These changes will guarantee that all commands have completed or timed
out before ufshcd_err_handler() is called. I think that would allow to
remove e.g. the following code from the error handler:

ufshcd_scsi_block_requests(hba);
/* Drain ufshcd_queuecommand() */
down_write(&hba->clk_scaling_lock);
up_write(&hba->clk_scaling_lock);

Thanks,

Bart.

2021-06-13 14:45:30

by Can Guo

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

Hi Bart,

On 2021-06-13 00:50, Bart Van Assche wrote:
> On 6/12/21 12:07 AM, Can Guo wrote:
>> Sigh... I also want my life and work to be easier...
>
> How about reducing the number of states and state transitions in the
> UFS
> driver? One source of complexity is that ufshcd_err_handler() is
> scheduled
> independently of the SCSI error handler and hence may run concurrently
> with the SCSI error handler. Has the following already been considered?
> - Call ufshcd_err_handler() synchronously from ufshcd_abort() and
> ufshcd_eh_host_reset_handler() instead of asynchronously.

1. ufshcd_eh_host_reset_handler() invokes ufshcd_err_handler() and
flushes
it, so it is synchronous. ufshcd_eh_host_reset_handler() used to call
reset_and_restore() directly, which can run concurrently with UFS error
handler,
so I fixed it last year [1].

2. ufshcd_abort() invokes ufshcd_err_handler() synchronously can have a
live lock issue, which is why I chose the asynchronous way (from the
first
day I started to fix error handling). The live lock happens when abort
happens
to a PM request, e.g., a SSU cmd sent from suspend/resume. Because UFS
error
handler is synchronized with suspend/resume (by calling
pm_runtime_get_sync()
and lock_system_sleep()), the sequence is like:
[1] ufshcd_wl_resume() sends SSU cmd
[2] ufshcd_abort() calls UFS error handler
[3] UFS error handler calls lock_system_sleep() and
pm_runtime_get_sync()

In above sequence, either lock_system_sleep() or pm_runtime_get_sync()
shall
be blocked - [3] is blocked by [1], [2] is blocked by [3], while [1] is
blocked by [2].

For PM requests, I chose to abort them fast to unblock suspend/resume,
suspend/resume shall fail of course, but UFS error handler recovers
PM errors anyways.

> - Call scsi_schedule_eh() from ufshcd_uic_pwr_ctrl() and
> ufshcd_check_errors() instead of ufshcd_schedule_eh_work().

When ufshcd_uic_pwr_ctrl() and/or ufshcd_check_errors() report errors,
usually they are fatal errors, according to UFSHCI spec, SW should
re-probe
UFS to recover.

However scsi_schedule_eh() does more than that - scsi_unjam_host() sends
request sense cmd and calls scsi_eh_ready_devs(), while
scsi_eh_ready_devs()
sends test unit ready cmd and calls all the way down to
scsi_eh_device/target/
bus/host_reset(). But we only need scsi_eh_host_reset() in this case. I
know
you have concerns that scsi_schedule_eh() may run concurrently with UFS
error
handler, but as I mentioned above in [1] - I've made
ufshcd_eh_host_reset_handler()
synchronized with UFS error handler, hope that can ease your concern.

I am not saying your idea won't work, it is a good suggestion. I will
try
it after these changes go in, because it would require extra effort and
the
effort won't be minor - I need to consider how to remove/reduce the
ufshcd
states along with the change and the error injection and stability test
all
over again, which is a long way to go. As for now, at least current
changes
works well as per my test and we really need these changes for
Andriod12-5.10.

Thanks,

Can Guo.

>
> These changes will guarantee that all commands have completed or timed
> out before ufshcd_err_handler() is called. I think that would allow to
> remove e.g. the following code from the error handler:
>
> ufshcd_scsi_block_requests(hba);
> /* Drain ufshcd_queuecommand() */
> down_write(&hba->clk_scaling_lock);
> up_write(&hba->clk_scaling_lock);
>
> Thanks,
>
> Bart.

2021-06-14 18:51:59

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

On 6/13/21 7:42 AM, Can Guo wrote:
> 2. ufshcd_abort() invokes ufshcd_err_handler() synchronously can have a
> live lock issue, which is why I chose the asynchronous way (from the first
> day I started to fix error handling). The live lock happens when abort
> happens
> to a PM request, e.g., a SSU cmd sent from suspend/resume. Because UFS
> error
> handler is synchronized with suspend/resume (by calling
> pm_runtime_get_sync()
> and lock_system_sleep()), the sequence is like:
> [1] ufshcd_wl_resume() sends SSU cmd
> [2] ufshcd_abort() calls UFS error handler
> [3] UFS error handler calls lock_system_sleep() and pm_runtime_get_sync()
>
> In above sequence, either lock_system_sleep() or pm_runtime_get_sync()
> shall
> be blocked - [3] is blocked by [1], [2] is blocked by [3], while [1] is
> blocked by [2].
>
> For PM requests, I chose to abort them fast to unblock suspend/resume,
> suspend/resume shall fail of course, but UFS error handler recovers
> PM errors anyways.

In the above sequence, does [2] perhaps refer to aborting the SSU
command submitted in step [1] (this is not clear to me)? If so, how
about breaking the circular waiting cycle as follows:
- If it can happen that SSU succeeds after more than scsi_timeout
seconds, define a custom timeout handler. From inside the timeout
handler, schedule a link check and return BLK_EH_RESET_TIMER. If the
link is no longer operational, run the error handler. If the link
cannot be recovered by the error handler, fail all pending commands.
This will prevent that ufshcd_abort() is called if a SSU command takes
longer than expected. See also commit 0dd0dec1677e.
- Modify the UFS error handler such that it accepts a context argument.
The context argument specifies whether or not the UFS error handler is
called from inside a system suspend or system resume handler. If the
UFS error handler is called from inside a system suspend or resume
callback, skip the lock_system_sleep() and unlock_system_sleep()
calls.

Thanks,

Bart.

2021-06-15 02:42:54

by Can Guo

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

Hi Bart,

On 2021-06-15 02:49, Bart Van Assche wrote:
> On 6/13/21 7:42 AM, Can Guo wrote:
>> 2. ufshcd_abort() invokes ufshcd_err_handler() synchronously can have
>> a
>> live lock issue, which is why I chose the asynchronous way (from the
>> first
>> day I started to fix error handling). The live lock happens when abort
>> happens
>> to a PM request, e.g., a SSU cmd sent from suspend/resume. Because UFS
>> error
>> handler is synchronized with suspend/resume (by calling
>> pm_runtime_get_sync()
>> and lock_system_sleep()), the sequence is like:
>> [1] ufshcd_wl_resume() sends SSU cmd
>> [2] ufshcd_abort() calls UFS error handler
>> [3] UFS error handler calls lock_system_sleep() and
>> pm_runtime_get_sync()
>>
>> In above sequence, either lock_system_sleep() or pm_runtime_get_sync()
>> shall
>> be blocked - [3] is blocked by [1], [2] is blocked by [3], while [1]
>> is
>> blocked by [2].
>>
>> For PM requests, I chose to abort them fast to unblock suspend/resume,
>> suspend/resume shall fail of course, but UFS error handler recovers
>> PM errors anyways.
>
> In the above sequence, does [2] perhaps refer to aborting the SSU
> command submitted in step [1] (this is not clear to me)?

Yes, your understanding is right.

> If so, how about breaking the circular waiting cycle as follows:
> - If it can happen that SSU succeeds after more than scsi_timeout
> seconds, define a custom timeout handler. From inside the timeout
> handler, schedule a link check and return BLK_EH_RESET_TIMER. If the
> link is no longer operational, run the error handler. If the link
> cannot be recovered by the error handler, fail all pending commands.
> This will prevent that ufshcd_abort() is called if a SSU command
> takes
> longer than expected. See also commit 0dd0dec1677e.
> - Modify the UFS error handler such that it accepts a context argument.
> The context argument specifies whether or not the UFS error handler
> is
> called from inside a system suspend or system resume handler. If the
> UFS error handler is called from inside a system suspend or resume
> callback, skip the lock_system_sleep() and unlock_system_sleep()
> calls.
>

I am aware of commit 0dd0dec1677e, I gave my reviewed-by tag. Thank you
for your suggestion and I believe it can resolve the cycle, because
actually
I've considered the similar way (leverage hba->host->eh_noresume) last
year,
but I didn't take this way due to below reasons:

1. UFS error handler basically does one thing - reset and restore, which
stops hba [1], resets device [2] and re-probes the device [3]. Stopping
hba [1]
shall complete any pending requests in the doorbell (with error or no
error).
After [1], suspend/resume contexts, blocked by SSU cmd, shall be
unblocked
right away to do whatever it needs to handle the SSU cmd failure
(completed
in [1], so scsi_execute() returns an error), e.g., put link back to the
old
state. call ufshcd_vops_suspend(), turn off irq/clocks/powers and etc...
However, reset and restore ([2] and [3]) is still running, and it can
(most likely)
be disturbed by suspend/resume. So passing a parameter or using
hba->host->eh_noresume
to skip lock_system_sleep() and unlock_system_sleep() can break the
cycle,
but error handling may run concurrently with suspend/resume. Of course
we can
modify suspend/resume to avoid it, but I was pursuing a minimal change
to get this fixed.

2. Whatever way we take to break the cycle, suspend/resume shall fail
and
RPM framework shall save the error to dev.power.runtime_error, leaving
the device in runtime suspended or active mode permanently. If it is
left
runtime suspended, UFS driver won't accept cmd anymore, while if it is
left
runtime active, powers of UFS device and host will be left ON, leading
to power
penalty. So my main idea is to let suspend/resume contexts, blocked by
PM cmds,
fail fast first and then error handler recover everything back to work.

Thanks,

Can Guo.

> Thanks,
>
> Bart.

2021-06-15 03:19:18

by Can Guo

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

On 2021-06-15 10:36, Can Guo wrote:
> Hi Bart,
>
> On 2021-06-15 02:49, Bart Van Assche wrote:
>> On 6/13/21 7:42 AM, Can Guo wrote:
>>> 2. ufshcd_abort() invokes ufshcd_err_handler() synchronously can have
>>> a
>>> live lock issue, which is why I chose the asynchronous way (from the
>>> first
>>> day I started to fix error handling). The live lock happens when
>>> abort
>>> happens
>>> to a PM request, e.g., a SSU cmd sent from suspend/resume. Because
>>> UFS
>>> error
>>> handler is synchronized with suspend/resume (by calling
>>> pm_runtime_get_sync()
>>> and lock_system_sleep()), the sequence is like:
>>> [1] ufshcd_wl_resume() sends SSU cmd
>>> [2] ufshcd_abort() calls UFS error handler
>>> [3] UFS error handler calls lock_system_sleep() and
>>> pm_runtime_get_sync()
>>>
>>> In above sequence, either lock_system_sleep() or
>>> pm_runtime_get_sync()
>>> shall
>>> be blocked - [3] is blocked by [1], [2] is blocked by [3], while [1]
>>> is
>>> blocked by [2].
>>>
>>> For PM requests, I chose to abort them fast to unblock
>>> suspend/resume,
>>> suspend/resume shall fail of course, but UFS error handler recovers
>>> PM errors anyways.
>>
>> In the above sequence, does [2] perhaps refer to aborting the SSU
>> command submitted in step [1] (this is not clear to me)?
>
> Yes, your understanding is right.
>
>> If so, how about breaking the circular waiting cycle as follows:
>> - If it can happen that SSU succeeds after more than scsi_timeout
>> seconds, define a custom timeout handler. From inside the timeout
>> handler, schedule a link check and return BLK_EH_RESET_TIMER. If the
>> link is no longer operational, run the error handler. If the link
>> cannot be recovered by the error handler, fail all pending commands.
>> This will prevent that ufshcd_abort() is called if a SSU command
>> takes
>> longer than expected. See also commit 0dd0dec1677e.
>> - Modify the UFS error handler such that it accepts a context
>> argument.
>> The context argument specifies whether or not the UFS error handler
>> is
>> called from inside a system suspend or system resume handler. If the
>> UFS error handler is called from inside a system suspend or resume
>> callback, skip the lock_system_sleep() and unlock_system_sleep()
>> calls.
>>
>
> I am aware of commit 0dd0dec1677e, I gave my reviewed-by tag. Thank you
> for your suggestion and I believe it can resolve the cycle, because
> actually
> I've considered the similar way (leverage hba->host->eh_noresume) last
> year,
> but I didn't take this way due to below reasons:
>
> 1. UFS error handler basically does one thing - reset and restore,
> which
> stops hba [1], resets device [2] and re-probes the device [3]. Stopping
> hba [1]
> shall complete any pending requests in the doorbell (with error or no
> error).
> After [1], suspend/resume contexts, blocked by SSU cmd, shall be
> unblocked
> right away to do whatever it needs to handle the SSU cmd failure
> (completed
> in [1], so scsi_execute() returns an error), e.g., put link back to the
> old
> state. call ufshcd_vops_suspend(), turn off irq/clocks/powers and
> etc...
> However, reset and restore ([2] and [3]) is still running, and it can
> (most likely)
> be disturbed by suspend/resume. So passing a parameter or using
> hba->host->eh_noresume
> to skip lock_system_sleep() and unlock_system_sleep() can break the
> cycle,
> but error handling may run concurrently with suspend/resume. Of course
> we can
> modify suspend/resume to avoid it, but I was pursuing a minimal change
> to get this fixed.
>

Add more - besides, SSU cmd is not the only PM request sent during
suspend/resume,
last year (before your changes came in) it also sends request sense cmd
without
checking the return value of it - so if request sense cmd abort happens,
suspend/resume
still move forward, which can run concurrently with error handling. So I
was pursuing
a way to make error handler less dependent on the bahaviours of these
contexts.

Thanks,

Can Guo.

> 2. Whatever way we take to break the cycle, suspend/resume shall fail
> and
> RPM framework shall save the error to dev.power.runtime_error, leaving
> the device in runtime suspended or active mode permanently. If it is
> left
> runtime suspended, UFS driver won't accept cmd anymore, while if it is
> left
> runtime active, powers of UFS device and host will be left ON, leading
> to power
> penalty. So my main idea is to let suspend/resume contexts, blocked by
> PM cmds,
> fail fast first and then error handler recover everything back to work.
>
> Thanks,
>
> Can Guo.
>
>> Thanks,
>>
>> Bart.

2021-06-15 18:26:41

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

On 6/14/21 7:36 PM, Can Guo wrote:
> I've considered the similar way (leverage hba->host->eh_noresume) last
> year,
> but I didn't take this way due to below reasons:
>
> 1. UFS error handler basically does one thing - reset and restore, which
> stops hba [1], resets device [2] and re-probes the device [3]. Stopping
> hba [1]
> shall complete any pending requests in the doorbell (with error or no
> error).
> After [1], suspend/resume contexts, blocked by SSU cmd, shall be unblocked
> right away to do whatever it needs to handle the SSU cmd failure (completed
> in [1], so scsi_execute() returns an error), e.g., put link back to the old
> state. call ufshcd_vops_suspend(), turn off irq/clocks/powers and etc...
> However, reset and restore ([2] and [3]) is still running, and it can
> (most likely)
> be disturbed by suspend/resume. So passing a parameter or using
> hba->host->eh_noresume
> to skip lock_system_sleep() and unlock_system_sleep() can break the cycle,
> but error handling may run concurrently with suspend/resume. Of course
> we can
> modify suspend/resume to avoid it, but I was pursuing a minimal change
> to get this fixed.
>
> 2. Whatever way we take to break the cycle, suspend/resume shall fail and
> RPM framework shall save the error to dev.power.runtime_error, leaving
> the device in runtime suspended or active mode permanently. If it is left
> runtime suspended, UFS driver won't accept cmd anymore, while if it is left
> runtime active, powers of UFS device and host will be left ON, leading
> to power
> penalty. So my main idea is to let suspend/resume contexts, blocked by
> PM cmds,
> fail fast first and then error handler recover everything back to work.

Hi Can,

Has it been considered to make the UFS error handler fail pending
commands with an error code that causes the SCSI core to resubmit the
SCSI command, e.g. DID_IMM_RETRY or DID_TRANSPORT_DISRUPTED? I want to
prevent that power management or suspend/resume callbacks fail if the
error handler succeeds with recovering the UFS transport.

Thanks,

Bart.

2021-06-16 04:05:34

by Can Guo

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

On 2021-06-16 02:25, Bart Van Assche wrote:
> On 6/14/21 7:36 PM, Can Guo wrote:
>> I've considered the similar way (leverage hba->host->eh_noresume) last
>> year,
>> but I didn't take this way due to below reasons:
>>
>> 1. UFS error handler basically does one thing - reset and restore,
>> which
>> stops hba [1], resets device [2] and re-probes the device [3].
>> Stopping
>> hba [1]
>> shall complete any pending requests in the doorbell (with error or no
>> error).
>> After [1], suspend/resume contexts, blocked by SSU cmd, shall be
>> unblocked
>> right away to do whatever it needs to handle the SSU cmd failure
>> (completed
>> in [1], so scsi_execute() returns an error), e.g., put link back to
>> the old
>> state. call ufshcd_vops_suspend(), turn off irq/clocks/powers and
>> etc...
>> However, reset and restore ([2] and [3]) is still running, and it can
>> (most likely)
>> be disturbed by suspend/resume. So passing a parameter or using
>> hba->host->eh_noresume
>> to skip lock_system_sleep() and unlock_system_sleep() can break the
>> cycle,
>> but error handling may run concurrently with suspend/resume. Of course
>> we can
>> modify suspend/resume to avoid it, but I was pursuing a minimal change
>> to get this fixed.
>>
>> 2. Whatever way we take to break the cycle, suspend/resume shall fail
>> and
>> RPM framework shall save the error to dev.power.runtime_error, leaving
>> the device in runtime suspended or active mode permanently. If it is
>> left
>> runtime suspended, UFS driver won't accept cmd anymore, while if it is
>> left
>> runtime active, powers of UFS device and host will be left ON, leading
>> to power
>> penalty. So my main idea is to let suspend/resume contexts, blocked by
>> PM cmds,
>> fail fast first and then error handler recover everything back to
>> work.
>
> Hi Can,
>
> Has it been considered to make the UFS error handler fail pending
> commands with an error code that causes the SCSI core to resubmit the
> SCSI command, e.g. DID_IMM_RETRY or DID_TRANSPORT_DISRUPTED? I want to
> prevent that power management or suspend/resume callbacks fail if the
> error handler succeeds with recovering the UFS transport.
>

Hi Bart,

Thanks for the suggestion, I thought about it but I didn't go that
far in this path because I believe letting a context fast fail is
better than retrying/blocking it (to me suspend/resume can fail
due to many reasons and task abort is just one of them). I appreciate
the idea, but I would like to stick to my way as of now because

1. Merely preventing task abort cannot prevent suspend/resume fail.
Task abort (to PM requests), in real cases, is just one of many kinds
of failure which can fail the suspend/resume callbacks. During
suspend/resume, if AH8 error and/or UIC errors happen, IRQ handler
may complete SSU cmd with errors and schedule the error handler (I've
seen such scenarios in real customer cases). My idea is to treat task
abort (to PM requests) as a failure (let scsi_execute() return with
whatever error) and let error handler recover everything just like
any other UFS errors which invoke error handler. In case this, again,
goes back to the topic that is why don't just do error recovery in
suspend/resume, let me paste my previous reply here -

"
Error handler has the same nature of user access - it is unpredictable,
meaning it
can be invoked at any time (from IRQ handler), even when there is no
ongoing
cmd/data transactions (like auto hibern8 failure and UIC errors, such as
DME
error and some errors in data link layer) [1], unless you disable UFS
IRQ.

The reasons why I choose not to do it that way are (althrough error
handler
prepare has became much more simple after apply this change)

- I want to keep all the complexity within error handler, and re-direct
all error
recovery needs to error handler. It can avoid calling
ufshcd_reset_and_restore()
and/or flush_work(&hba->eh_work) here and there. The entire UFS
suspend/resume is
already complex enough, I don't want to mess up with it.

- We do explicit recovery only when we see certain errors, e.g., H8
enter func
returns an error during suspend, but as mentioned above [1], error
handling can
be invoked already from IRQ handler (due to all kinds of UIC errors
before H8 enter
func returns). So, we still need host_sem (in case of system
suspend/resume) to
avoid concurrency.

- During system suspend/resume, error handling can be invoked (due to
non-fatal
errors) but still UFS cmds return no error at all. Similar like above,
we need
host_sem to avoid concurrency.
"

2. And say we want SCSI layer to resubmit PM requests to prevent
suspend/resume fail, we should keep retrying the PM requests (so
long as error handler can recover everything successfully), meaning
we should give them unlimited retries (which I think is a bad idea),
otherwise (if they have zero retries or limited retries), in extreme
conditions, what may happen is that error handler can recover everything
successfully every time, but all these retries (say 3) still time out,
which block the power management for too long (retries * 60 seconds)
and,
most important, when the last retry times out, scsi layer will anyways
complete the PM request (even we return DID_IMM_RETRY), then we end up
same - suspend/resume shall run concurrently with error handler and we
couldn't recover saved PM errors.

Thanks,

Can Guo.

> Thanks,
>
> Bart.

2021-06-16 04:41:29

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

On 6/15/21 9:00 PM, Can Guo wrote:
> I would like to stick to my way as of now because
>
> 1. Merely preventing task abort cannot prevent suspend/resume fail.
> Task abort (to PM requests), in real cases, is just one of many kinds
> of failure which can fail the suspend/resume callbacks. During
> suspend/resume, if AH8 error and/or UIC errors happen, IRQ handler
> may complete SSU cmd with errors and schedule the error handler (I've
> seen such scenarios in real customer cases). My idea is to treat task
> abort (to PM requests) as a failure (let scsi_execute() return with
> whatever error) and let error handler recover everything just like
> any other UFS errors which invoke error handler. In case this, again,
> goes back to the topic that is why don't just do error recovery in
> suspend/resume, let me paste my previous reply here -

Does this mean that the IRQ handler can complete an SSU command with an
error and that the error handler can later recover from that error? That
sounds completely wrong to me. The IRQ handler should never complete any
command with an error if that error could be recoverable. Instead, the
IRQ handler should add that command to a list and leave it to the error
handler to fail that command or to retry it.

> 2. And say we want SCSI layer to resubmit PM requests to prevent
> suspend/resume fail, we should keep retrying the PM requests (so
> long as error handler can recover everything successfully), meaning
> we should give them unlimited retries (which I think is a bad idea),
> otherwise (if they have zero retries or limited retries), in extreme
> conditions, what may happen is that error handler can recover everything
> successfully every time, but all these retries (say 3) still time out,
> which block the power management for too long (retries * 60 seconds) and,
> most important, when the last retry times out, scsi layer will anyways
> complete the PM request (even we return DID_IMM_RETRY), then we end up
> same - suspend/resume shall run concurrently with error handler and we
> couldn't recover saved PM errors.

Hmm ... it is not clear to me why this behavior is considered a problem?

What is wrong with blocking RPM while a START STOP UNIT command is being
processed? If there are UFS devices for which it takes long to process
that command I think it is up to the vendors of these devices to fix
these UFS devices.

Additionally, if a UFS device needs more than (retries * 60 seconds) to
process a START STOP UNIT command, shouldn't it be marked as broken?

Thanks,

Bart.

2021-06-16 08:49:37

by Can Guo

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

Hi Bart,

On 2021-06-16 12:40, Bart Van Assche wrote:
> On 6/15/21 9:00 PM, Can Guo wrote:
>> I would like to stick to my way as of now because
>>
>> 1. Merely preventing task abort cannot prevent suspend/resume fail.
>> Task abort (to PM requests), in real cases, is just one of many kinds
>> of failure which can fail the suspend/resume callbacks. During
>> suspend/resume, if AH8 error and/or UIC errors happen, IRQ handler
>> may complete SSU cmd with errors and schedule the error handler (I've
>> seen such scenarios in real customer cases). My idea is to treat task
>> abort (to PM requests) as a failure (let scsi_execute() return with
>> whatever error) and let error handler recover everything just like
>> any other UFS errors which invoke error handler. In case this, again,
>> goes back to the topic that is why don't just do error recovery in
>> suspend/resume, let me paste my previous reply here -
>
> Does this mean that the IRQ handler can complete an SSU command with an
> error and that the error handler can later recover from that error?

Not exactly, sorry that I didn't put it clearly. There are cases where
cmds
are completed with an error (either OCS is not SUCCESS or device returns
check condition in resp) and accompanied by fatal or non-fatal UIC
errors
(UIC errors invoke UFS error handler). For example, SSU is completed
with
OCS_MISMATCH_RESPONSE_UPIU_SIZE (whatever the reason is in HW), then
auto
hibern8 enter (AH8 timer timeout hba->ahit is set to a very low value)
kicks
start right after but fails with fatal UIC errors. From dmesg log, these
all
happen at once. I've seen even more complicated cases where all kinds of
errors
mess up together.

> That sounds completely wrong to me. The IRQ handler should never
> complete any
> command with an error if that error could be recoverable. Instead, the
> IRQ handler should add that command to a list and leave it to the error
> handler to fail that command or to retry it.
>
>> 2. And say we want SCSI layer to resubmit PM requests to prevent
>> suspend/resume fail, we should keep retrying the PM requests (so
>> long as error handler can recover everything successfully), meaning
>> we should give them unlimited retries (which I think is a bad idea),
>> otherwise (if they have zero retries or limited retries), in extreme
>> conditions, what may happen is that error handler can recover
>> everything
>> successfully every time, but all these retries (say 3) still time out,
>> which block the power management for too long (retries * 60 seconds)
>> and,
>> most important, when the last retry times out, scsi layer will anyways
>> complete the PM request (even we return DID_IMM_RETRY), then we end up
>> same - suspend/resume shall run concurrently with error handler and we
>> couldn't recover saved PM errors.
>
> Hmm ... it is not clear to me why this behavior is considered a
> problem?
>

To me, task abort to PM requests does not worth being treated so
differently,
after all suspend/resume may fail due to any kinds of UFS errors (as
I've
explained so many times). My idea is to let PM requests fast fail (60
seconds
has passed, a broken device maybe, we have reason to fail it since it is
just
a passthrough req) and schedule UFS error handler, UFS error handler
shall
proceed after suspend/resume fails out then start to recover everything
in a
safe environment. Is this way not working?

Thanks,

Can Guo.

> What is wrong with blocking RPM while a START STOP UNIT command is
> being
> processed? If there are UFS devices for which it takes long to process
> that command I think it is up to the vendors of these devices to fix
> these UFS devices.
>
> Additionally, if a UFS device needs more than (retries * 60 seconds) to
> process a START STOP UNIT command, shouldn't it be marked as broken?
>
> Thanks,
>
> Bart.

2021-06-17 00:39:48

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

On 6/16/21 1:47 AM, Can Guo wrote:
> On 2021-06-16 12:40, Bart Van Assche wrote:
>> On 6/15/21 9:00 PM, Can Guo wrote:
>>> 2. And say we want SCSI layer to resubmit PM requests to prevent
>>> suspend/resume fail, we should keep retrying the PM requests (so
>>> long as error handler can recover everything successfully),
>>> meaning we should give them unlimited retries (which I think is a
>>> bad idea), otherwise (if they have zero retries or limited
>>> retries), in extreme conditions, what may happen is that error
>>> handler can recover everything successfully every time, but all
>>> these retries (say 3) still time out, which block the power
>>> management for too long (retries * 60 seconds) and, most
>>> important, when the last retry times out, scsi layer will
>>> anyways complete the PM request (even we return DID_IMM_RETRY),
>>> then we end up same - suspend/resume shall run concurrently with
>>> error handler and we couldn't recover saved PM errors.
>>
>> Hmm ... it is not clear to me why this behavior is considered a
>> problem?
>
> To me, task abort to PM requests does not worth being treated so
> differently, after all suspend/resume may fail due to any kinds of
> UFS errors (as I've explained so many times). My idea is to let PM
> requests fast fail (60 seconds has passed, a broken device maybe, we
> have reason to fail it since it is just a passthrough req) and
> schedule UFS error handler, UFS error handler shall proceed after
> suspend/resume fails out then start to recover everything in a safe
> environment. Is this way not working?
Hi Can,

Thank you for the clarification. As you probably know the power
management subsystem serializes runtime power management (RPM) and
system suspend callbacks. I was concerned about the consequences of a
failed RPM transition on system suspend and resume. Having taken a
closer look at the UFS driver, I see that failed RPM transitions do not
require special handling in the system suspend or resume callbacks. In
other words, I'm fine with the approach of failing PM requests fast.

Bart.

2021-06-23 01:39:27

by Can Guo

[permalink] [raw]
Subject: Re: [PATCH v3 8/9] scsi: ufs: Update the fast abort path in ufshcd_abort() for PM requests

Hi Bart,

On 2021-06-17 01:55, Bart Van Assche wrote:
> On 6/16/21 1:47 AM, Can Guo wrote:
>> On 2021-06-16 12:40, Bart Van Assche wrote:
>>> On 6/15/21 9:00 PM, Can Guo wrote:
>>>> 2. And say we want SCSI layer to resubmit PM requests to prevent
>>>> suspend/resume fail, we should keep retrying the PM requests (so
>>>> long as error handler can recover everything successfully),
>>>> meaning we should give them unlimited retries (which I think is a
>>>> bad idea), otherwise (if they have zero retries or limited
>>>> retries), in extreme conditions, what may happen is that error
>>>> handler can recover everything successfully every time, but all
>>>> these retries (say 3) still time out, which block the power
>>>> management for too long (retries * 60 seconds) and, most
>>>> important, when the last retry times out, scsi layer will
>>>> anyways complete the PM request (even we return DID_IMM_RETRY),
>>>> then we end up same - suspend/resume shall run concurrently with
>>>> error handler and we couldn't recover saved PM errors.
>>>
>>> Hmm ... it is not clear to me why this behavior is considered a
>>> problem?
>>
>> To me, task abort to PM requests does not worth being treated so
>> differently, after all suspend/resume may fail due to any kinds of
>> UFS errors (as I've explained so many times). My idea is to let PM
>> requests fast fail (60 seconds has passed, a broken device maybe, we
>> have reason to fail it since it is just a passthrough req) and
>> schedule UFS error handler, UFS error handler shall proceed after
>> suspend/resume fails out then start to recover everything in a safe
>> environment. Is this way not working?
> Hi Can,
>
> Thank you for the clarification. As you probably know the power
> management subsystem serializes runtime power management (RPM) and
> system suspend callbacks. I was concerned about the consequences of a
> failed RPM transition on system suspend and resume. Having taken a
> closer look at the UFS driver, I see that failed RPM transitions do not
> require special handling in the system suspend or resume callbacks. In
> other words, I'm fine with the approach of failing PM requests fast.
>

Thank you for your time and efforts spent on this series, I will upload
next version to address your previous comments (hope I can convince
Trilok
to pick these up).

Thanks,

Can Guo.

> Bart.