2020-10-24 15:47:00

by Jaegeuk Kim

[permalink] [raw]
Subject: [UFS v3] UFS fixes

Change log from v2:
- use active_req-- instead of __ufshcd_release to avoid UFS timeout

Change log from v1:
- remove clkgating_enable check in __ufshcd_release
- use __uhfshcd_release instead of active_req.



2020-10-24 15:47:02

by Jaegeuk Kim

[permalink] [raw]
Subject: [PATCH v3 1/5] scsi: ufs: atomic update for clkgating_enable

From: Jaegeuk Kim <[email protected]>

When giving a stress test which enables/disables clkgating, we hit device
timeout sometimes. This patch avoids subtle racy condition to address it.

If we use __ufshcd_release(), I've seen that gate_work can be called in parallel
with ungate_work, which results in UFS timeout when doing hibern8.
Should avoid it.

Signed-off-by: Jaegeuk Kim <[email protected]>
---
drivers/scsi/ufs/ufshcd.c | 12 ++++++------
1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index b8f573a02713..e0b479f9eb8a 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -1807,19 +1807,19 @@ static ssize_t ufshcd_clkgate_enable_store(struct device *dev,
return -EINVAL;

value = !!value;
+
+ spin_lock_irqsave(hba->host->host_lock, flags);
if (value == hba->clk_gating.is_enabled)
goto out;

- if (value) {
- ufshcd_release(hba);
- } else {
- spin_lock_irqsave(hba->host->host_lock, flags);
+ if (value)
+ hba->clk_gating.active_reqs--;
+ else
hba->clk_gating.active_reqs++;
- spin_unlock_irqrestore(hba->host->host_lock, flags);
- }

hba->clk_gating.is_enabled = value;
out:
+ spin_unlock_irqrestore(hba->host->host_lock, flags);
return count;
}

--
2.29.0.rc1.297.gfa9743e501-goog

2020-10-26 06:59:39

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] scsi: ufs: atomic update for clkgating_enable

On 10/26, Can Guo wrote:
> On 2020-10-24 23:06, Jaegeuk Kim wrote:
> > From: Jaegeuk Kim <[email protected]>
> >
> > When giving a stress test which enables/disables clkgating, we hit
> > device
> > timeout sometimes. This patch avoids subtle racy condition to address
> > it.
> >
> > If we use __ufshcd_release(), I've seen that gate_work can be called in
> > parallel
> > with ungate_work, which results in UFS timeout when doing hibern8.
> > Should avoid it.
> >
>
> I don't understand this comment. gate_work and ungate_work are queued on
> an ordered workqueue and an ordered workqueue executes at most one work item
> at any given time in the queued order. How can the two run in parallel?

When I hit UFS stuck, I saw this by clkgating tracepoint.

- REQ_CLK_OFF
- CLKS_OFF
- REQ_CLK_OFF
- REQ_CLKS_ON
..

By using active_req, I don't see any problem.

>
> Thanks,
>
> Can Guo.
>
> > Signed-off-by: Jaegeuk Kim <[email protected]>
> > ---
> > drivers/scsi/ufs/ufshcd.c | 12 ++++++------
> > 1 file changed, 6 insertions(+), 6 deletions(-)
> >
> > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> > index b8f573a02713..e0b479f9eb8a 100644
> > --- a/drivers/scsi/ufs/ufshcd.c
> > +++ b/drivers/scsi/ufs/ufshcd.c
> > @@ -1807,19 +1807,19 @@ static ssize_t
> > ufshcd_clkgate_enable_store(struct device *dev,
> > return -EINVAL;
> >
> > value = !!value;
> > +
> > + spin_lock_irqsave(hba->host->host_lock, flags);
> > if (value == hba->clk_gating.is_enabled)
> > goto out;
> >
> > - if (value) {
> > - ufshcd_release(hba);
> > - } else {
> > - spin_lock_irqsave(hba->host->host_lock, flags);
> > + if (value)
> > + hba->clk_gating.active_reqs--;
> > + else
> > hba->clk_gating.active_reqs++;
> > - spin_unlock_irqrestore(hba->host->host_lock, flags);
> > - }
> >
> > hba->clk_gating.is_enabled = value;
> > out:
> > + spin_unlock_irqrestore(hba->host->host_lock, flags);
> > return count;
> > }

2020-10-26 07:32:15

by Can Guo

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] scsi: ufs: atomic update for clkgating_enable

On 2020-10-24 23:06, Jaegeuk Kim wrote:
> From: Jaegeuk Kim <[email protected]>
>
> When giving a stress test which enables/disables clkgating, we hit
> device
> timeout sometimes. This patch avoids subtle racy condition to address
> it.
>
> If we use __ufshcd_release(), I've seen that gate_work can be called in
> parallel
> with ungate_work, which results in UFS timeout when doing hibern8.
> Should avoid it.
>

I don't understand this comment. gate_work and ungate_work are queued on
an ordered workqueue and an ordered workqueue executes at most one work
item
at any given time in the queued order. How can the two run in parallel?

Thanks,

Can Guo.

> Signed-off-by: Jaegeuk Kim <[email protected]>
> ---
> drivers/scsi/ufs/ufshcd.c | 12 ++++++------
> 1 file changed, 6 insertions(+), 6 deletions(-)
>
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index b8f573a02713..e0b479f9eb8a 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -1807,19 +1807,19 @@ static ssize_t
> ufshcd_clkgate_enable_store(struct device *dev,
> return -EINVAL;
>
> value = !!value;
> +
> + spin_lock_irqsave(hba->host->host_lock, flags);
> if (value == hba->clk_gating.is_enabled)
> goto out;
>
> - if (value) {
> - ufshcd_release(hba);
> - } else {
> - spin_lock_irqsave(hba->host->host_lock, flags);
> + if (value)
> + hba->clk_gating.active_reqs--;
> + else
> hba->clk_gating.active_reqs++;
> - spin_unlock_irqrestore(hba->host->host_lock, flags);
> - }
>
> hba->clk_gating.is_enabled = value;
> out:
> + spin_unlock_irqrestore(hba->host->host_lock, flags);
> return count;
> }

2020-10-26 08:14:22

by Can Guo

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] scsi: ufs: atomic update for clkgating_enable

On 2020-10-26 14:13, Jaegeuk Kim wrote:
> On 10/26, Can Guo wrote:
>> On 2020-10-24 23:06, Jaegeuk Kim wrote:
>> > From: Jaegeuk Kim <[email protected]>
>> >
>> > When giving a stress test which enables/disables clkgating, we hit
>> > device
>> > timeout sometimes. This patch avoids subtle racy condition to address
>> > it.
>> >
>> > If we use __ufshcd_release(), I've seen that gate_work can be called in
>> > parallel
>> > with ungate_work, which results in UFS timeout when doing hibern8.
>> > Should avoid it.
>> >
>>
>> I don't understand this comment. gate_work and ungate_work are queued
>> on
>> an ordered workqueue and an ordered workqueue executes at most one
>> work item
>> at any given time in the queued order. How can the two run in
>> parallel?
>
> When I hit UFS stuck, I saw this by clkgating tracepoint.
>
> - REQ_CLK_OFF
> - CLKS_OFF
> - REQ_CLK_OFF
> - REQ_CLKS_ON
> ..
>

I don't see how can you tell that the two works are running in parallel
just from above trace. May I know what is the exact error by "UFS
timeout
when doing hibern8"?

By using __ufshcd_release() here, I do see one potential issue if your
test
quickly toggles on/off of clk_gating - disable it, enable it, disable it
and
enable it, which will cause that __ufshcd_release() being called twice,
meaning
we queue two gate_works back to back. So can you try below code and let
me know
if it helps or not? I am OK with your current change, but I would like
to
understand the problem. Thanks.

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index 1791bce..3eee438 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -2271,6 +2271,8 @@ static void ufshcd_gate_work(struct work_struct
*work)
unsigned long flags;

spin_lock_irqsave(hba->host->host_lock, flags);
+ if (hba->clk_gating.state == CLKS_OFF)
+ goto rel_lock;
/*
* In case you are here to cancel this work the gating state
* would be marked as REQ_CLKS_ON. In this case save time by

Regards,

Can Guo.

> By using active_req, I don't see any problem.
>
>>
>> Thanks,
>>
>> Can Guo.
>>
>> > Signed-off-by: Jaegeuk Kim <[email protected]>
>> > ---
>> > drivers/scsi/ufs/ufshcd.c | 12 ++++++------
>> > 1 file changed, 6 insertions(+), 6 deletions(-)
>> >
>> > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>> > index b8f573a02713..e0b479f9eb8a 100644
>> > --- a/drivers/scsi/ufs/ufshcd.c
>> > +++ b/drivers/scsi/ufs/ufshcd.c
>> > @@ -1807,19 +1807,19 @@ static ssize_t
>> > ufshcd_clkgate_enable_store(struct device *dev,
>> > return -EINVAL;
>> >
>> > value = !!value;
>> > +
>> > + spin_lock_irqsave(hba->host->host_lock, flags);
>> > if (value == hba->clk_gating.is_enabled)
>> > goto out;
>> >
>> > - if (value) {
>> > - ufshcd_release(hba);
>> > - } else {
>> > - spin_lock_irqsave(hba->host->host_lock, flags);
>> > + if (value)
>> > + hba->clk_gating.active_reqs--;
>> > + else
>> > hba->clk_gating.active_reqs++;
>> > - spin_unlock_irqrestore(hba->host->host_lock, flags);
>> > - }
>> >
>> > hba->clk_gating.is_enabled = value;
>> > out:
>> > + spin_unlock_irqrestore(hba->host->host_lock, flags);
>> > return count;
>> > }

2020-10-27 07:19:32

by Jaegeuk Kim

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] scsi: ufs: atomic update for clkgating_enable

On 10/26, Can Guo wrote:
> On 2020-10-26 14:13, Jaegeuk Kim wrote:
> > On 10/26, Can Guo wrote:
> > > On 2020-10-24 23:06, Jaegeuk Kim wrote:
> > > > From: Jaegeuk Kim <[email protected]>
> > > >
> > > > When giving a stress test which enables/disables clkgating, we hit
> > > > device
> > > > timeout sometimes. This patch avoids subtle racy condition to address
> > > > it.
> > > >
> > > > If we use __ufshcd_release(), I've seen that gate_work can be called in
> > > > parallel
> > > > with ungate_work, which results in UFS timeout when doing hibern8.
> > > > Should avoid it.
> > > >
> > >
> > > I don't understand this comment. gate_work and ungate_work are
> > > queued on
> > > an ordered workqueue and an ordered workqueue executes at most one
> > > work item
> > > at any given time in the queued order. How can the two run in
> > > parallel?
> >
> > When I hit UFS stuck, I saw this by clkgating tracepoint.
> >
> > - REQ_CLK_OFF
> > - CLKS_OFF
> > - REQ_CLK_OFF
> > - REQ_CLKS_ON
> > ..
> >
>
> I don't see how can you tell that the two works are running in parallel
> just from above trace. May I know what is the exact error by "UFS timeout
> when doing hibern8"?
>
> By using __ufshcd_release() here, I do see one potential issue if your test
> quickly toggles on/off of clk_gating - disable it, enable it, disable it and
> enable it, which will cause that __ufshcd_release() being called twice,
> meaning
> we queue two gate_works back to back. So can you try below code and let me
> know
> if it helps or not? I am OK with your current change, but I would like to
> understand the problem. Thanks.
>
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index 1791bce..3eee438 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -2271,6 +2271,8 @@ static void ufshcd_gate_work(struct work_struct *work)
> unsigned long flags;
>
> spin_lock_irqsave(hba->host->host_lock, flags);
> + if (hba->clk_gating.state == CLKS_OFF)
> + goto rel_lock;
> /*
> * In case you are here to cancel this work the gating state
> * would be marked as REQ_CLKS_ON. In this case save time by

This doesn't help. So, I checked this back again, and, like what you said, now
suspect __ufshcd_release() which changed state to REQ_CLKS_OFF on CLKS_OFF.

With the below change, I can see the issue anymore. Let me send v4.

---
drivers/scsi/ufs/ufshcd.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
index b8f573a02713..cc8d5f0c3fdc 100644
--- a/drivers/scsi/ufs/ufshcd.c
+++ b/drivers/scsi/ufs/ufshcd.c
@@ -1745,7 +1745,8 @@ static void __ufshcd_release(struct ufs_hba *hba)
if (hba->clk_gating.active_reqs || hba->clk_gating.is_suspended ||
hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL ||
ufshcd_any_tag_in_use(hba) || hba->outstanding_tasks ||
- hba->active_uic_cmd || hba->uic_async_done)
+ hba->active_uic_cmd || hba->uic_async_done ||
+ hba->clk_gating.state == CLKS_OFF)
return;

hba->clk_gating.state = REQ_CLKS_OFF;
--
2.29.0.rc1.297.gfa9743e501-goog


>
> Regards,
>
> Can Guo.
>
> > By using active_req, I don't see any problem.
> >
> > >
> > > Thanks,
> > >
> > > Can Guo.
> > >
> > > > Signed-off-by: Jaegeuk Kim <[email protected]>
> > > > ---
> > > > drivers/scsi/ufs/ufshcd.c | 12 ++++++------
> > > > 1 file changed, 6 insertions(+), 6 deletions(-)
> > > >
> > > > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> > > > index b8f573a02713..e0b479f9eb8a 100644
> > > > --- a/drivers/scsi/ufs/ufshcd.c
> > > > +++ b/drivers/scsi/ufs/ufshcd.c
> > > > @@ -1807,19 +1807,19 @@ static ssize_t
> > > > ufshcd_clkgate_enable_store(struct device *dev,
> > > > return -EINVAL;
> > > >
> > > > value = !!value;
> > > > +
> > > > + spin_lock_irqsave(hba->host->host_lock, flags);
> > > > if (value == hba->clk_gating.is_enabled)
> > > > goto out;
> > > >
> > > > - if (value) {
> > > > - ufshcd_release(hba);
> > > > - } else {
> > > > - spin_lock_irqsave(hba->host->host_lock, flags);
> > > > + if (value)
> > > > + hba->clk_gating.active_reqs--;
> > > > + else
> > > > hba->clk_gating.active_reqs++;
> > > > - spin_unlock_irqrestore(hba->host->host_lock, flags);
> > > > - }
> > > >
> > > > hba->clk_gating.is_enabled = value;
> > > > out:
> > > > + spin_unlock_irqrestore(hba->host->host_lock, flags);
> > > > return count;
> > > > }

2020-10-27 14:00:14

by Can Guo

[permalink] [raw]
Subject: Re: [PATCH v3 1/5] scsi: ufs: atomic update for clkgating_enable

On 2020-10-27 03:48, Jaegeuk Kim wrote:
> On 10/26, Can Guo wrote:
>> On 2020-10-26 14:13, Jaegeuk Kim wrote:
>> > On 10/26, Can Guo wrote:
>> > > On 2020-10-24 23:06, Jaegeuk Kim wrote:
>> > > > From: Jaegeuk Kim <[email protected]>
>> > > >
>> > > > When giving a stress test which enables/disables clkgating, we hit
>> > > > device
>> > > > timeout sometimes. This patch avoids subtle racy condition to address
>> > > > it.
>> > > >
>> > > > If we use __ufshcd_release(), I've seen that gate_work can be called in
>> > > > parallel
>> > > > with ungate_work, which results in UFS timeout when doing hibern8.
>> > > > Should avoid it.
>> > > >
>> > >
>> > > I don't understand this comment. gate_work and ungate_work are
>> > > queued on
>> > > an ordered workqueue and an ordered workqueue executes at most one
>> > > work item
>> > > at any given time in the queued order. How can the two run in
>> > > parallel?
>> >
>> > When I hit UFS stuck, I saw this by clkgating tracepoint.
>> >
>> > - REQ_CLK_OFF
>> > - CLKS_OFF
>> > - REQ_CLK_OFF
>> > - REQ_CLKS_ON
>> > ..
>> >
>>
>> I don't see how can you tell that the two works are running in
>> parallel
>> just from above trace. May I know what is the exact error by "UFS
>> timeout
>> when doing hibern8"?
>>
>> By using __ufshcd_release() here, I do see one potential issue if your
>> test
>> quickly toggles on/off of clk_gating - disable it, enable it, disable
>> it and
>> enable it, which will cause that __ufshcd_release() being called
>> twice,
>> meaning
>> we queue two gate_works back to back. So can you try below code and
>> let me
>> know
>> if it helps or not? I am OK with your current change, but I would like
>> to
>> understand the problem. Thanks.
>>
>> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>> index 1791bce..3eee438 100644
>> --- a/drivers/scsi/ufs/ufshcd.c
>> +++ b/drivers/scsi/ufs/ufshcd.c
>> @@ -2271,6 +2271,8 @@ static void ufshcd_gate_work(struct work_struct
>> *work)
>> unsigned long flags;
>>
>> spin_lock_irqsave(hba->host->host_lock, flags);
>> + if (hba->clk_gating.state == CLKS_OFF)
>> + goto rel_lock;
>> /*
>> * In case you are here to cancel this work the gating state
>> * would be marked as REQ_CLKS_ON. In this case save time by
>
> This doesn't help. So, I checked this back again, and, like what you
> said, now
> suspect __ufshcd_release() which changed state to REQ_CLKS_OFF on
> CLKS_OFF.
>

Aha, sorry that I gave the right analysis but wrong fix - my check won't
help
since it is checking CLKS_OFF, but at that moment it has become
CLKS_REQ_OFF.
Your fix is fulfiling the right purpose.

Thanks,

Can Guo.


> With the below change, I can see the issue anymore. Let me send v4.
>
> ---
> drivers/scsi/ufs/ufshcd.c | 3 ++-
> 1 file changed, 2 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
> index b8f573a02713..cc8d5f0c3fdc 100644
> --- a/drivers/scsi/ufs/ufshcd.c
> +++ b/drivers/scsi/ufs/ufshcd.c
> @@ -1745,7 +1745,8 @@ static void __ufshcd_release(struct ufs_hba *hba)
> if (hba->clk_gating.active_reqs || hba->clk_gating.is_suspended ||
> hba->ufshcd_state != UFSHCD_STATE_OPERATIONAL ||
> ufshcd_any_tag_in_use(hba) || hba->outstanding_tasks ||
> - hba->active_uic_cmd || hba->uic_async_done)
> + hba->active_uic_cmd || hba->uic_async_done ||
> + hba->clk_gating.state == CLKS_OFF)
> return;
>
> hba->clk_gating.state = REQ_CLKS_OFF;
> --
> 2.29.0.rc1.297.gfa9743e501-goog
>
>
>>
>> Regards,
>>
>> Can Guo.
>>
>> > By using active_req, I don't see any problem.
>> >
>> > >
>> > > Thanks,
>> > >
>> > > Can Guo.
>> > >
>> > > > Signed-off-by: Jaegeuk Kim <[email protected]>
>> > > > ---
>> > > > drivers/scsi/ufs/ufshcd.c | 12 ++++++------
>> > > > 1 file changed, 6 insertions(+), 6 deletions(-)
>> > > >
>> > > > diff --git a/drivers/scsi/ufs/ufshcd.c b/drivers/scsi/ufs/ufshcd.c
>> > > > index b8f573a02713..e0b479f9eb8a 100644
>> > > > --- a/drivers/scsi/ufs/ufshcd.c
>> > > > +++ b/drivers/scsi/ufs/ufshcd.c
>> > > > @@ -1807,19 +1807,19 @@ static ssize_t
>> > > > ufshcd_clkgate_enable_store(struct device *dev,
>> > > > return -EINVAL;
>> > > >
>> > > > value = !!value;
>> > > > +
>> > > > + spin_lock_irqsave(hba->host->host_lock, flags);
>> > > > if (value == hba->clk_gating.is_enabled)
>> > > > goto out;
>> > > >
>> > > > - if (value) {
>> > > > - ufshcd_release(hba);
>> > > > - } else {
>> > > > - spin_lock_irqsave(hba->host->host_lock, flags);
>> > > > + if (value)
>> > > > + hba->clk_gating.active_reqs--;
>> > > > + else
>> > > > hba->clk_gating.active_reqs++;
>> > > > - spin_unlock_irqrestore(hba->host->host_lock, flags);
>> > > > - }
>> > > >
>> > > > hba->clk_gating.is_enabled = value;
>> > > > out:
>> > > > + spin_unlock_irqrestore(hba->host->host_lock, flags);
>> > > > return count;
>> > > > }