2022-10-14 10:57:13

by Tushar Nimkar

[permalink] [raw]
Subject: PM-runtime: supplier looses track of consumer during probe

Hi linux-pm/linux-scsi,

We have included fix [1] but continuing to observe supplier loosing
track of consumer.

Below is trace snippet with additional logging added.
Here consumer is 0:0:0:0 and supplier is 0:0:0:49488. In Last three
lines consumer resume is completed but supplier is put down.

kworker/u16:0-7 0.880014: rpm_idle: 0:0:0:0 flags-4
cnt-0 dep-0 auto-1 p-0 irq-0 child-0
kworker/u16:0-7 0.880017: bprint:
pm_runtime_mark_last_busy.46700: :#205 dev_name:0:0:0:0
ktime_get_mono_fast_ns():852365364
kworker/u16:0-7 0.880019: rpm_suspend: 0:0:0:0 flags-8
cnt-0 dep-0 auto-1 p-0 irq-0 child-0
kworker/u16:0-7 0.880022: bprint: pm_runtime_put_noidle.44083:
pm_runtime_put_noidle: #112 dev_name:0:0:0:49488 dev usage_count:5
decremented usage count
kworker/u16:0-7 0.880023: bprint: pm_runtime_put_noidle.44083:
pm_runtime_put_noidle: #112 dev_name:0:0:0:49488 dev usage_count:4
decremented usage count
kworker/u16:2-142 0.880024: rpm_resume: 0:0:0:0 flags-4
cnt-1 dep-0 auto-1 p-0 irq-0 child-0
kworker/u16:0-7 0.880025: bprint: __rpm_put_suppliers:
__rpm_put_suppliers: #348 consumer:0:0:0:0 supplier:0:0:0:49488
usage_count:4
kworker/u16:0-7 0.880061: rpm_idle: 0:0:0:49488
flags-1 cnt-4 dep-0 auto-1 p-0 irq-0 child-0
kworker/u16:0-7 0.880062: rpm_return_int:
rpm_idle+0x16c:0:0:0:49488 ret=-11
kworker/u16:2-142 0.880062: bprint: __pm_runtime_resume:
__pm_runtime_resume: #1147 dev_name:0:0:0:49488 dev usage_count:5
incremented usage count
kworker/u16:2-142 0.880063: rpm_resume: 0:0:0:49488
flags-4 cnt-5 dep-0 auto-1 p-0 irq-0 child-0
kworker/u16:2-142 0.880063: rpm_return_int:
rpm_resume+0x690:0:0:0:49488 ret=1
kworker/u16:0-7 0.880063: rpm_return_int:
rpm_suspend+0x68:0:0:0:0 ret=0
kworker/u16:2-142 0.880063: bprint: rpm_get_suppliers:
rpm_get_suppliers: #300 consumer:0:0:0:0 supplier:0:0:0:49488 usage_count:5
kworker/u16:0-7 0.880065: bprint: pm_runtime_put_noidle.44083:
pm_runtime_put_noidle: #112 dev_name:0:0:0:49488 dev usage_count:4
decremented usage count
kworker/u16:2-142 0.880065: bprint:
pm_runtime_mark_last_busy.44088: :#205 dev_name:0:0:0:0
ktime_get_mono_fast_ns():852413749
kworker/u16:2-142 0.880065: rpm_idle: 0:0:0:0 flags-1
cnt-1 dep-0 auto-1 p-0 irq-0 child-0
kworker/u16:2-142 0.880065: rpm_return_int: rpm_idle+0x16c:0:0:0:0
ret=-11
kworker/u16:0-7 0.880066: bprint: __rpm_put_suppliers:
__rpm_put_suppliers: #348 consumer:0:0:0:0 supplier:0:0:0:49488
usage_count:4
kworker/u16:0-7 0.880067: rpm_return_int: rpm_idle+0x16c:0:0:0:0
ret=-16
kworker/u16:2-142 0.880067: rpm_return_int:
rpm_resume+0x690:0:0:0:0 ret=0

Upon looking into this further the race looks to be in below two
processes running in parallel and process-1 is putting down supplier at
[C] because process-2 is setting runtime_status as resuming at [D].

Also as per runtime PM documentation
In order to use autosuspend, subsystems or drivers must call
pm_runtime_use_autosuspend(), and thereafter they should use the various
`*_autosuspend()` helper functions...

It was also observed that *_autosuspend() API at point [A] was invoked
without first invoking pm_runtime_use_autosuspend() which return
expiration as zero at point [B] and proceeds ahead for immediate runtime
suspend of device which seems lead to this race condition.

Process -1
ufshcd_async_scan context (process 1)
scsi_autopm_put_device() //0:0:0:0
pm_runtime_put_sync()
__pm_runtime_idle()
rpm_idle() -- RPM_GET_PUT(4)
__rpm_callback
scsi_runtime_idle()
pm_runtime_mark_last_busy()
pm_runtime_autosuspend() --[A]
rpm_suspend() -- RPM_AUTO(8)
pm_runtime_autosuspend_expiration()
use_autosuspend is false return 0 --- [B]
__update_runtime_status to RPM_SUSPENDING
__rpm_callback()
__rpm_put_suppliers(dev, false)
__update_runtime_status to RPM_SUSPENDED
rpm_suspend_suppliers()
rpm_idle() for supplier -- RPM_ASYNC(1) return
(-EAGAIN) [ Other consumer active for supplier]
rpm_suspend() – END with return=0
scsi_runtime_idle() END return (-EBUSY) always.
/* Do that if resume fails too.*/
(dev->power.runtime_status == RPM_RESUMING && retval))) return -EBUSY
__rpm_put_suppliers(dev, false) -- [C]
rpm_idle() END return -EBUSY

Process -2
sd_probe context (Process 2)
scsi_autopm_get_device() //0:0:0:0
__pm_runtime_resume(RPM_GET_PUT)
rpm_resume() -- RPM_GET_PUT(4)
__update_runtime_status to RPM_RESUMING --[D]
__rpm_callback()
rpm_get_suppliers()
__pm_runtime_resume() - RPM_GET_PUT(4) – supplier
rpm_resume() for supplier.
__update_runtime_status to RPM_ACTIVE
pm_runtime_mark_last_busy ()
rpm_resume() END return 0

Can you please provide your suggestions on addressing above race condition?

This is also reported at [2].

[1]: https://lore.kernel.org/lkml/4748074.GXAFRqVoOG@kreacher/T/
[2]: https://lkml.org/lkml/2022/10/12/259

Thanks,
Tushar Nimkar


2022-11-04 10:37:56

by Tushar Nimkar

[permalink] [raw]
Subject: Re: PM-runtime: supplier looses track of consumer during probe

Hi linux-pm/linux-scsi,

Gentle reminder!

Can you please provide your suggestions on below race?

Thanks, Tushar Nimkar

On 10/14/2022 4:20 PM, Tushar Nimkar wrote:
> Hi linux-pm/linux-scsi,
>
> We have included fix [1] but continuing to observe supplier loosing
> track of consumer.
>
> Below is trace snippet with additional logging added.
> Here consumer is 0:0:0:0 and supplier is 0:0:0:49488. In Last three
> lines consumer resume is completed but supplier is put down.
>
>    kworker/u16:0-7     0.880014: rpm_idle:             0:0:0:0 flags-4
> cnt-0  dep-0  auto-1 p-0 irq-0 child-0
>    kworker/u16:0-7     0.880017: bprint:
> pm_runtime_mark_last_busy.46700: :#205 dev_name:0:0:0:0
> ktime_get_mono_fast_ns():852365364
>    kworker/u16:0-7     0.880019: rpm_suspend:          0:0:0:0 flags-8
> cnt-0  dep-0  auto-1 p-0 irq-0 child-0
>    kworker/u16:0-7     0.880022: bprint: pm_runtime_put_noidle.44083:
> pm_runtime_put_noidle: #112 dev_name:0:0:0:49488 dev usage_count:5
> decremented usage count
>    kworker/u16:0-7     0.880023: bprint: pm_runtime_put_noidle.44083:
> pm_runtime_put_noidle: #112 dev_name:0:0:0:49488 dev usage_count:4
> decremented usage count
>    kworker/u16:2-142   0.880024: rpm_resume:           0:0:0:0 flags-4
> cnt-1  dep-0  auto-1 p-0 irq-0 child-0
>    kworker/u16:0-7     0.880025: bprint: __rpm_put_suppliers:
> __rpm_put_suppliers: #348 consumer:0:0:0:0 supplier:0:0:0:49488
> usage_count:4
>    kworker/u16:0-7     0.880061: rpm_idle:             0:0:0:49488
> flags-1 cnt-4  dep-0  auto-1 p-0 irq-0 child-0
>    kworker/u16:0-7     0.880062: rpm_return_int:
> rpm_idle+0x16c:0:0:0:49488 ret=-11
>    kworker/u16:2-142   0.880062: bprint: __pm_runtime_resume:
> __pm_runtime_resume: #1147 dev_name:0:0:0:49488 dev usage_count:5
> incremented usage count
>    kworker/u16:2-142   0.880063: rpm_resume:           0:0:0:49488
> flags-4 cnt-5  dep-0  auto-1 p-0 irq-0 child-0
>    kworker/u16:2-142   0.880063: rpm_return_int:
> rpm_resume+0x690:0:0:0:49488 ret=1
>    kworker/u16:0-7     0.880063: rpm_return_int:
> rpm_suspend+0x68:0:0:0:0 ret=0
>    kworker/u16:2-142   0.880063: bprint: rpm_get_suppliers:
> rpm_get_suppliers: #300 consumer:0:0:0:0 supplier:0:0:0:49488 usage_count:5
>    kworker/u16:0-7     0.880065: bprint: pm_runtime_put_noidle.44083:
> pm_runtime_put_noidle: #112 dev_name:0:0:0:49488 dev usage_count:4
> decremented usage count
>    kworker/u16:2-142   0.880065: bprint:
> pm_runtime_mark_last_busy.44088: :#205 dev_name:0:0:0:0
> ktime_get_mono_fast_ns():852413749
>    kworker/u16:2-142   0.880065: rpm_idle:             0:0:0:0 flags-1
> cnt-1  dep-0  auto-1 p-0 irq-0 child-0
>    kworker/u16:2-142   0.880065: rpm_return_int: rpm_idle+0x16c:0:0:0:0
> ret=-11
>    kworker/u16:0-7     0.880066: bprint: __rpm_put_suppliers:
> __rpm_put_suppliers: #348 consumer:0:0:0:0 supplier:0:0:0:49488
> usage_count:4
>    kworker/u16:0-7     0.880067: rpm_return_int: rpm_idle+0x16c:0:0:0:0
> ret=-16
>    kworker/u16:2-142   0.880067: rpm_return_int:
> rpm_resume+0x690:0:0:0:0 ret=0
>
> Upon looking into this further the race looks to be in below two
> processes running in parallel and process-1 is putting down supplier at
> [C] because process-2 is setting runtime_status as resuming at [D].
>
> Also as per runtime PM documentation
> In order to use autosuspend, subsystems or drivers must call
> pm_runtime_use_autosuspend(), and thereafter they should use the various
> `*_autosuspend()` helper functions...
>
> It was also observed that *_autosuspend() API at point [A] was invoked
> without first invoking pm_runtime_use_autosuspend() which return
> expiration as zero at point [B] and proceeds ahead for immediate runtime
> suspend of device which seems lead to this race condition.
>
> Process -1
> ufshcd_async_scan context (process 1)
> scsi_autopm_put_device() //0:0:0:0
> pm_runtime_put_sync()
> __pm_runtime_idle()
> rpm_idle() -- RPM_GET_PUT(4)
>     __rpm_callback
>         scsi_runtime_idle()
>             pm_runtime_mark_last_busy()
>             pm_runtime_autosuspend()  --[A]
>                 rpm_suspend() -- RPM_AUTO(8)
>                     pm_runtime_autosuspend_expiration()
> use_autosuspend    is false return 0   --- [B]
>                         __update_runtime_status to RPM_SUSPENDING
>                     __rpm_callback()
>                         __rpm_put_suppliers(dev, false)
>                     __update_runtime_status to RPM_SUSPENDED
>                 rpm_suspend_suppliers()
>                     rpm_idle() for supplier -- RPM_ASYNC(1) return
> (-EAGAIN) [ Other consumer active for supplier]
>                 rpm_suspend() – END with return=0
>         scsi_runtime_idle() END return (-EBUSY) always.
>      /* Do that if resume fails too.*/
>     (dev->power.runtime_status == RPM_RESUMING && retval)))  return -EBUSY
>         __rpm_put_suppliers(dev, false)  -- [C]
> rpm_idle() END return -EBUSY
>
> Process -2
> sd_probe context (Process 2)
> scsi_autopm_get_device() //0:0:0:0
> __pm_runtime_resume(RPM_GET_PUT)
> rpm_resume() -- RPM_GET_PUT(4)
>     __update_runtime_status to RPM_RESUMING --[D]
>     __rpm_callback()
>         rpm_get_suppliers()
>             __pm_runtime_resume() - RPM_GET_PUT(4) – supplier
>                 rpm_resume() for supplier.
>     __update_runtime_status to RPM_ACTIVE
>     pm_runtime_mark_last_busy ()
> rpm_resume() END return 0
>
> Can you please provide your suggestions on addressing above race condition?
>
> This is also reported at [2].
>
> [1]: https://lore.kernel.org/lkml/4748074.GXAFRqVoOG@kreacher/T/
> [2]: https://lkml.org/lkml/2022/10/12/259
>
> Thanks,
> Tushar Nimkar

2022-11-18 15:34:04

by Adrian Hunter

[permalink] [raw]
Subject: Re: PM-runtime: supplier looses track of consumer during probe

On 4/11/22 11:19, Tushar Nimkar wrote:
> Hi linux-pm/linux-scsi,
>
> Gentle reminder!
>
> Can you please provide your suggestions on below race?
>
> Thanks, Tushar Nimkar
>
> On 10/14/2022 4:20 PM, Tushar Nimkar wrote:
>> Hi linux-pm/linux-scsi,
>>
>> We have included fix [1] but continuing to observe supplier loosing track of consumer.
>>
>> Below is trace snippet with additional logging added.
>> Here consumer is 0:0:0:0 and supplier is 0:0:0:49488. In Last three lines consumer resume is completed but supplier is put down.
>>
>>     kworker/u16:0-7     0.880014: rpm_idle:             0:0:0:0 flags-4 cnt-0  dep-0  auto-1 p-0 irq-0 child-0
>>     kworker/u16:0-7     0.880017: bprint: pm_runtime_mark_last_busy.46700: :#205 dev_name:0:0:0:0 ktime_get_mono_fast_ns():852365364
>>     kworker/u16:0-7     0.880019: rpm_suspend:          0:0:0:0 flags-8 cnt-0  dep-0  auto-1 p-0 irq-0 child-0
>>     kworker/u16:0-7     0.880022: bprint: pm_runtime_put_noidle.44083: pm_runtime_put_noidle: #112 dev_name:0:0:0:49488 dev usage_count:5 decremented usage count
>>     kworker/u16:0-7     0.880023: bprint: pm_runtime_put_noidle.44083: pm_runtime_put_noidle: #112 dev_name:0:0:0:49488 dev usage_count:4 decremented usage count
>>     kworker/u16:2-142   0.880024: rpm_resume:           0:0:0:0 flags-4 cnt-1  dep-0  auto-1 p-0 irq-0 child-0
>>     kworker/u16:0-7     0.880025: bprint: __rpm_put_suppliers: __rpm_put_suppliers: #348 consumer:0:0:0:0 supplier:0:0:0:49488 usage_count:4
>>     kworker/u16:0-7     0.880061: rpm_idle:             0:0:0:49488 flags-1 cnt-4  dep-0  auto-1 p-0 irq-0 child-0
>>     kworker/u16:0-7     0.880062: rpm_return_int: rpm_idle+0x16c:0:0:0:49488 ret=-11
>>     kworker/u16:2-142   0.880062: bprint: __pm_runtime_resume: __pm_runtime_resume: #1147 dev_name:0:0:0:49488 dev usage_count:5 incremented usage count
>>     kworker/u16:2-142   0.880063: rpm_resume:           0:0:0:49488 flags-4 cnt-5  dep-0  auto-1 p-0 irq-0 child-0
>>     kworker/u16:2-142   0.880063: rpm_return_int: rpm_resume+0x690:0:0:0:49488 ret=1
>>     kworker/u16:0-7     0.880063: rpm_return_int: rpm_suspend+0x68:0:0:0:0 ret=0
>>     kworker/u16:2-142   0.880063: bprint: rpm_get_suppliers: rpm_get_suppliers: #300 consumer:0:0:0:0 supplier:0:0:0:49488 usage_count:5
>>     kworker/u16:0-7     0.880065: bprint: pm_runtime_put_noidle.44083: pm_runtime_put_noidle: #112 dev_name:0:0:0:49488 dev usage_count:4 decremented usage count
>>     kworker/u16:2-142   0.880065: bprint: pm_runtime_mark_last_busy.44088: :#205 dev_name:0:0:0:0 ktime_get_mono_fast_ns():852413749
>>     kworker/u16:2-142   0.880065: rpm_idle:             0:0:0:0 flags-1 cnt-1  dep-0  auto-1 p-0 irq-0 child-0
>>     kworker/u16:2-142   0.880065: rpm_return_int: rpm_idle+0x16c:0:0:0:0 ret=-11
>>     kworker/u16:0-7     0.880066: bprint: __rpm_put_suppliers: __rpm_put_suppliers: #348 consumer:0:0:0:0 supplier:0:0:0:49488 usage_count:4
>>     kworker/u16:0-7     0.880067: rpm_return_int: rpm_idle+0x16c:0:0:0:0 ret=-16
>>     kworker/u16:2-142   0.880067: rpm_return_int: rpm_resume+0x690:0:0:0:0 ret=0
>>
>> Upon looking into this further the race looks to be in below two processes running in parallel and process-1 is putting down supplier at [C] because process-2 is setting runtime_status as resuming at [D].
>>
>> Also as per runtime PM documentation
>> In order to use autosuspend, subsystems or drivers must call
>> pm_runtime_use_autosuspend(), and thereafter they should use the various `*_autosuspend()` helper functions...
>>
>> It was also observed that *_autosuspend() API at point [A] was invoked without first invoking pm_runtime_use_autosuspend() which return expiration as zero at point [B] and proceeds ahead for immediate runtime suspend of device which seems lead to this race condition.
>>
>> Process -1
>> ufshcd_async_scan context (process 1)
>> scsi_autopm_put_device() //0:0:0:0

I am having trouble following your description. What function is calling
scsi_autopm_put_device() here?

>> pm_runtime_put_sync()
>> __pm_runtime_idle()
>> rpm_idle() -- RPM_GET_PUT(4)
>>      __rpm_callback
>>          scsi_runtime_idle()
>>              pm_runtime_mark_last_busy()
>>              pm_runtime_autosuspend()  --[A]
>>                  rpm_suspend() -- RPM_AUTO(8)
>>                      pm_runtime_autosuspend_expiration() use_autosuspend    is false return 0   --- [B]
>>                          __update_runtime_status to RPM_SUSPENDING
>>                      __rpm_callback()
>>                          __rpm_put_suppliers(dev, false)
>>                      __update_runtime_status to RPM_SUSPENDED
>>                  rpm_suspend_suppliers()
>>                      rpm_idle() for supplier -- RPM_ASYNC(1) return (-EAGAIN) [ Other consumer active for supplier]
>>                  rpm_suspend() – END with return=0
>>          scsi_runtime_idle() END return (-EBUSY) always.

Not following here either. Which device is EBUSY and why?

>>       /* Do that if resume fails too.*/
>>      (dev->power.runtime_status == RPM_RESUMING && retval)))  return -EBUSY
>>          __rpm_put_suppliers(dev, false)  -- [C]
>> rpm_idle() END return -EBUSY
>>
>> Process -2
>> sd_probe context (Process 2)
>> scsi_autopm_get_device() //0:0:0:0
>> __pm_runtime_resume(RPM_GET_PUT)
>> rpm_resume() -- RPM_GET_PUT(4)
>>      __update_runtime_status to RPM_RESUMING --[D]
>>      __rpm_callback()
>>          rpm_get_suppliers()
>>              __pm_runtime_resume() - RPM_GET_PUT(4) – supplier
>>                  rpm_resume() for supplier.
>>      __update_runtime_status to RPM_ACTIVE
>>      pm_runtime_mark_last_busy ()
>> rpm_resume() END return 0
>>
>> Can you please provide your suggestions on addressing above race condition?
>>
>> This is also reported at [2].
>>
>> [1]: https://lore.kernel.org/lkml/4748074.GXAFRqVoOG@kreacher/T/
>> [2]: https://lkml.org/lkml/2022/10/12/259
>>
>> Thanks,
>> Tushar Nimkar


2022-11-21 06:39:10

by Tushar Nimkar

[permalink] [raw]
Subject: Re: PM-runtime: supplier looses track of consumer during probe

Hi Adrian,

On 11/18/2022 8:25 PM, Adrian Hunter wrote:
> On 4/11/22 11:19, Tushar Nimkar wrote:
>> Hi linux-pm/linux-scsi,

>>> Process -1
>>> ufshcd_async_scan context (process 1)
>>> scsi_autopm_put_device() //0:0:0:0
>
> I am having trouble following your description. What function is calling
> scsi_autopm_put_device() here?
>
Below is flow which calls scsi_autopm_put_device()
Process -1
ufshcd_async_scan()
scsi_probe_and_add_lun()
scsi_add_lun()
slave_configure()
scsi_sysfs_add_sdev()
scsi_autopm_get_device()
device_add()     <- invoked [Process 2] sd_probe()
scsi_autopm_put_device()

>>> pm_runtime_put_sync()
>>> __pm_runtime_idle()
>>> rpm_idle() -- RPM_GET_PUT(4)
>>>      __rpm_callback
>>>          scsi_runtime_idle()
>>>              pm_runtime_mark_last_busy()
>>>              pm_runtime_autosuspend()  --[A]
>>>                  rpm_suspend() -- RPM_AUTO(8)
>>>                      pm_runtime_autosuspend_expiration() use_autosuspend    is false return 0   --- [B]
>>>                          __update_runtime_status to RPM_SUSPENDING
>>>                      __rpm_callback()
>>>                          __rpm_put_suppliers(dev, false)
>>>                      __update_runtime_status to RPM_SUSPENDED
>>>                  rpm_suspend_suppliers()
>>>                      rpm_idle() for supplier -- RPM_ASYNC(1) return (-EAGAIN) [ Other consumer active for supplier]
>>>                  rpm_suspend() – END with return=0
>>>          scsi_runtime_idle() END return (-EBUSY) always.
>
> Not following here either. Which device is EBUSY and why?

scsi_runtime_idle() return -EBUSY always [3]
Storage/scsi team can better explain -EBUSY implementation.

[3]
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/scsi/scsi_pm.c?h=next-20221118#n210


>>>
>>> [1]: https://lore.kernel.org/lkml/4748074.GXAFRqVoOG@kreacher/T/
>>> [2]: https://lkml.org/lkml/2022/10/12/259
>>>
>>> Thanks,
>>> Tushar Nimkar
>
Thanks,
Tushar Nimkar

2022-11-29 17:43:16

by Nitin Rawat

[permalink] [raw]
Subject: Re: PM-runtime: supplier looses track of consumer during probe

Hi Adrian,

On 11/21/2022 11:38 AM, Tushar Nimkar wrote:
> Hi Adrian,
>
> On 11/18/2022 8:25 PM, Adrian Hunter wrote:
>> On 4/11/22 11:19, Tushar Nimkar wrote:
>>> Hi linux-pm/linux-scsi,
>
>>>> Process -1
>>>> ufshcd_async_scan context (process 1)
>>>> scsi_autopm_put_device() //0:0:0:0
>>
>> I am having trouble following your description.  What function is calling
>> scsi_autopm_put_device() here?
>>
> Below is flow which calls scsi_autopm_put_device()
> Process -1
> ufshcd_async_scan()
>     scsi_probe_and_add_lun()
>         scsi_add_lun()
>             slave_configure()
>                 scsi_sysfs_add_sdev()
>                     scsi_autopm_get_device()
>                         device_add()     <- invoked [Process 2] sd_probe()
>                             scsi_autopm_put_device()
>
>>>> pm_runtime_put_sync()
>>>> __pm_runtime_idle()
>>>> rpm_idle() -- RPM_GET_PUT(4)
>>>>       __rpm_callback
>>>>           scsi_runtime_idle()
>>>>               pm_runtime_mark_last_busy()
>>>>               pm_runtime_autosuspend()  --[A]
>>>>                   rpm_suspend() -- RPM_AUTO(8)
>>>>                       pm_runtime_autosuspend_expiration()
>>>> use_autosuspend    is false return 0   --- [B]
>>>>                           __update_runtime_status to RPM_SUSPENDING
>>>>                       __rpm_callback()
>>>>                           __rpm_put_suppliers(dev, false)
>>>>                       __update_runtime_status to RPM_SUSPENDED
>>>>                   rpm_suspend_suppliers()
>>>>                       rpm_idle() for supplier -- RPM_ASYNC(1) return
>>>> (-EAGAIN) [ Other consumer active for supplier]
>>>>                   rpm_suspend() – END with return=0
>>>>           scsi_runtime_idle() END return (-EBUSY) always.
>>
>> Not following here either.  Which device is EBUSY and why?
>
> scsi_runtime_idle() return -EBUSY always [3]
> Storage/scsi team can better explain -EBUSY implementation.

EBUSY is returned from below code for consumer dev 0:0:0:0.
scsi_runtime_idle is called from scsi_autopm_put_device which inturn is
called from ufshcd_async_scan (Process 1 as per above call stack)
static int scsi_runtime_idle(struct device *dev)
{
:

if (scsi_is_sdev_device(dev)) {
pm_runtime_mark_last_busy(dev);
pm_runtime_autosuspend(dev);
return -EBUSY; ---> EBUSY returned from here.
}


}

>
> [3]
> https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/scsi/scsi_pm.c?h=next-20221118#n210
>
>
>>>>
>>>> [1]: https://lore.kernel.org/lkml/4748074.GXAFRqVoOG@kreacher/T/
>>>> [2]: https://lkml.org/lkml/2022/10/12/259
>>>>
>>>> Thanks,
>>>> Tushar Nimkar
>>
> Thanks,
> Tushar Nimkar

Thanks,
Nitin

2022-12-01 14:06:13

by Adrian Hunter

[permalink] [raw]
Subject: Re: PM-runtime: supplier looses track of consumer during probe

On 29/11/22 18:56, Nitin Rawat wrote:
> Hi Adrian,
>
> On 11/21/2022 11:38 AM, Tushar Nimkar wrote:
>> Hi Adrian,
>>
>> On 11/18/2022 8:25 PM, Adrian Hunter wrote:
>>> On 4/11/22 11:19, Tushar Nimkar wrote:
>>>> Hi linux-pm/linux-scsi,
>>
>>>>> Process -1
>>>>> ufshcd_async_scan context (process 1)
>>>>> scsi_autopm_put_device() //0:0:0:0
>>>
>>> I am having trouble following your description.  What function is calling
>>> scsi_autopm_put_device() here?
>>>
>> Below is flow which calls scsi_autopm_put_device()
>> Process -1
>> ufshcd_async_scan()
>>      scsi_probe_and_add_lun()
>>          scsi_add_lun()
>>              slave_configure()
>>                  scsi_sysfs_add_sdev()
>>                      scsi_autopm_get_device()
>>                          device_add()     <- invoked [Process 2] sd_probe()
>>                              scsi_autopm_put_device()
>>
>>>>> pm_runtime_put_sync()
>>>>> __pm_runtime_idle()
>>>>> rpm_idle() -- RPM_GET_PUT(4)
>>>>>       __rpm_callback
>>>>>           scsi_runtime_idle()
>>>>>               pm_runtime_mark_last_busy()
>>>>>               pm_runtime_autosuspend()  --[A]
>>>>>                   rpm_suspend() -- RPM_AUTO(8)
>>>>>                       pm_runtime_autosuspend_expiration() use_autosuspend    is false return 0   --- [B]
>>>>>                           __update_runtime_status to RPM_SUSPENDING
>>>>>                       __rpm_callback()
>>>>>                           __rpm_put_suppliers(dev, false)
>>>>>                       __update_runtime_status to RPM_SUSPENDED
>>>>>                   rpm_suspend_suppliers()
>>>>>                       rpm_idle() for supplier -- RPM_ASYNC(1) return (-EAGAIN) [ Other consumer active for supplier]
>>>>>                   rpm_suspend() – END with return=0
>>>>>           scsi_runtime_idle() END return (-EBUSY) always.
>>>
>>> Not following here either.  Which device is EBUSY and why?
>>
>> scsi_runtime_idle() return -EBUSY always [3]
>> Storage/scsi team can better explain -EBUSY implementation.
>
> EBUSY is returned from below code for consumer dev 0:0:0:0.
> scsi_runtime_idle is called from scsi_autopm_put_device which inturn is called from ufshcd_async_scan (Process 1 as per above call stack)
> static int scsi_runtime_idle(struct device *dev)
> {
>     :
>
>     if (scsi_is_sdev_device(dev)) {
>         pm_runtime_mark_last_busy(dev);
>         pm_runtime_autosuspend(dev);
>         return -EBUSY; ---> EBUSY returned from here.
>     }
>
>     
> }
>
>>
>> [3] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/scsi/scsi_pm.c?h=next-20221118#n210
>>
>>
>>>>>
>>>>> [1]: https://lore.kernel.org/lkml/4748074.GXAFRqVoOG@kreacher/T/
>>>>> [2]: https://lkml.org/lkml/2022/10/12/259

It looks to me like __rpm_callback() makes assumptions about
dev->power.runtime_status that are not necessarily true because
dev->power.lock is dropped. AFAICT the intention of the code
would be fulfilled by instead using the status as it was before
the lock was dropped.

Consequently, perhaps you could try this:

diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
index b52049098d4e..3cf9abc3b2c2 100644
--- a/drivers/base/power/runtime.c
+++ b/drivers/base/power/runtime.c
@@ -365,6 +365,7 @@ static int __rpm_callback(int (*cb)(struct device *), struct device *dev)
{
int retval = 0, idx;
bool use_links = dev->power.links_count > 0;
+ enum rpm_status runtime_status = dev->power.runtime_status;

if (dev->power.irq_safe) {
spin_unlock(&dev->power.lock);
@@ -378,7 +379,7 @@ static int __rpm_callback(int (*cb)(struct device *), struct device *dev)
* routine returns, so it is safe to read the status outside of
* the lock.
*/
- if (use_links && dev->power.runtime_status == RPM_RESUMING) {
+ if (use_links && runtime_status == RPM_RESUMING) {
idx = device_links_read_lock();

retval = rpm_get_suppliers(dev);
@@ -405,8 +406,8 @@ static int __rpm_callback(int (*cb)(struct device *), struct device *dev)
* Do that if resume fails too.
*/
if (use_links
- && ((dev->power.runtime_status == RPM_SUSPENDING && !retval)
- || (dev->power.runtime_status == RPM_RESUMING && retval))) {
+ && ((runtime_status == RPM_SUSPENDING && !retval)
+ || (runtime_status == RPM_RESUMING && retval))) {
idx = device_links_read_lock();

__rpm_put_suppliers(dev, false);


2022-12-01 15:14:00

by Nitin Rawat

[permalink] [raw]
Subject: Re: PM-runtime: supplier looses track of consumer during probe

Hi Adrian,
Thanks for the patch.
I Agree, using local variable to store the runtime status before
dev->power.lock is released and using this variable later in the code
can meet the intention of code and can help to solve this race.

We will get it tested and update you.


Regards,
Nitin

On 12/1/2022 6:39 PM, Adrian Hunter wrote:
> On 29/11/22 18:56, Nitin Rawat wrote:
>> Hi Adrian,
>>
>> On 11/21/2022 11:38 AM, Tushar Nimkar wrote:
>>> Hi Adrian,
>>>
>>> On 11/18/2022 8:25 PM, Adrian Hunter wrote:
>>>> On 4/11/22 11:19, Tushar Nimkar wrote:
>>>>> Hi linux-pm/linux-scsi,
>>>
>>>>>> Process -1
>>>>>> ufshcd_async_scan context (process 1)
>>>>>> scsi_autopm_put_device() //0:0:0:0
>>>>
>>>> I am having trouble following your description.  What function is calling
>>>> scsi_autopm_put_device() here?
>>>>
>>> Below is flow which calls scsi_autopm_put_device()
>>> Process -1
>>> ufshcd_async_scan()
>>>      scsi_probe_and_add_lun()
>>>          scsi_add_lun()
>>>              slave_configure()
>>>                  scsi_sysfs_add_sdev()
>>>                      scsi_autopm_get_device()
>>>                          device_add()     <- invoked [Process 2] sd_probe()
>>>                              scsi_autopm_put_device()
>>>
>>>>>> pm_runtime_put_sync()
>>>>>> __pm_runtime_idle()
>>>>>> rpm_idle() -- RPM_GET_PUT(4)
>>>>>>       __rpm_callback
>>>>>>           scsi_runtime_idle()
>>>>>>               pm_runtime_mark_last_busy()
>>>>>>               pm_runtime_autosuspend()  --[A]
>>>>>>                   rpm_suspend() -- RPM_AUTO(8)
>>>>>>                       pm_runtime_autosuspend_expiration() use_autosuspend    is false return 0   --- [B]
>>>>>>                           __update_runtime_status to RPM_SUSPENDING
>>>>>>                       __rpm_callback()
>>>>>>                           __rpm_put_suppliers(dev, false)
>>>>>>                       __update_runtime_status to RPM_SUSPENDED
>>>>>>                   rpm_suspend_suppliers()
>>>>>>                       rpm_idle() for supplier -- RPM_ASYNC(1) return (-EAGAIN) [ Other consumer active for supplier]
>>>>>>                   rpm_suspend() – END with return=0
>>>>>>           scsi_runtime_idle() END return (-EBUSY) always.
>>>>
>>>> Not following here either.  Which device is EBUSY and why?
>>>
>>> scsi_runtime_idle() return -EBUSY always [3]
>>> Storage/scsi team can better explain -EBUSY implementation.
>>
>> EBUSY is returned from below code for consumer dev 0:0:0:0.
>> scsi_runtime_idle is called from scsi_autopm_put_device which inturn is called from ufshcd_async_scan (Process 1 as per above call stack)
>> static int scsi_runtime_idle(struct device *dev)
>> {
>>     :
>>
>>     if (scsi_is_sdev_device(dev)) {
>>         pm_runtime_mark_last_busy(dev);
>>         pm_runtime_autosuspend(dev);
>>         return -EBUSY; ---> EBUSY returned from here.
>>     }
>>
>>
>> }
>>
>>>
>>> [3] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/scsi/scsi_pm.c?h=next-20221118#n210
>>>
>>>
>>>>>>
>>>>>> [1]: https://lore.kernel.org/lkml/4748074.GXAFRqVoOG@kreacher/T/
>>>>>> [2]: https://lkml.org/lkml/2022/10/12/259
>
> It looks to me like __rpm_callback() makes assumptions about
> dev->power.runtime_status that are not necessarily true because
> dev->power.lock is dropped. AFAICT the intention of the code
> would be fulfilled by instead using the status as it was before
> the lock was dropped.
>
> Consequently, perhaps you could try this:
>
> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
> index b52049098d4e..3cf9abc3b2c2 100644
> --- a/drivers/base/power/runtime.c
> +++ b/drivers/base/power/runtime.c
> @@ -365,6 +365,7 @@ static int __rpm_callback(int (*cb)(struct device *), struct device *dev)
> {
> int retval = 0, idx;
> bool use_links = dev->power.links_count > 0;
> + enum rpm_status runtime_status = dev->power.runtime_status;
>
> if (dev->power.irq_safe) {
> spin_unlock(&dev->power.lock);
> @@ -378,7 +379,7 @@ static int __rpm_callback(int (*cb)(struct device *), struct device *dev)
> * routine returns, so it is safe to read the status outside of
> * the lock.
> */
> - if (use_links && dev->power.runtime_status == RPM_RESUMING) {
> + if (use_links && runtime_status == RPM_RESUMING) {
> idx = device_links_read_lock();
>
> retval = rpm_get_suppliers(dev);
> @@ -405,8 +406,8 @@ static int __rpm_callback(int (*cb)(struct device *), struct device *dev)
> * Do that if resume fails too.
> */
> if (use_links
> - && ((dev->power.runtime_status == RPM_SUSPENDING && !retval)
> - || (dev->power.runtime_status == RPM_RESUMING && retval))) {
> + && ((runtime_status == RPM_SUSPENDING && !retval)
> + || (runtime_status == RPM_RESUMING && retval))) {
> idx = device_links_read_lock();
>
> __rpm_put_suppliers(dev, false);
>
>

2022-12-01 19:55:23

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: PM-runtime: supplier looses track of consumer during probe

On Thu, Dec 1, 2022 at 2:10 PM Adrian Hunter <[email protected]> wrote:
>
> On 29/11/22 18:56, Nitin Rawat wrote:
> > Hi Adrian,
> >
> > On 11/21/2022 11:38 AM, Tushar Nimkar wrote:
> >> Hi Adrian,
> >>
> >> On 11/18/2022 8:25 PM, Adrian Hunter wrote:
> >>> On 4/11/22 11:19, Tushar Nimkar wrote:
> >>>> Hi linux-pm/linux-scsi,
> >>
> >>>>> Process -1
> >>>>> ufshcd_async_scan context (process 1)
> >>>>> scsi_autopm_put_device() //0:0:0:0
> >>>
> >>> I am having trouble following your description. What function is calling
> >>> scsi_autopm_put_device() here?
> >>>
> >> Below is flow which calls scsi_autopm_put_device()
> >> Process -1
> >> ufshcd_async_scan()
> >> scsi_probe_and_add_lun()
> >> scsi_add_lun()
> >> slave_configure()
> >> scsi_sysfs_add_sdev()
> >> scsi_autopm_get_device()
> >> device_add() <- invoked [Process 2] sd_probe()
> >> scsi_autopm_put_device()
> >>
> >>>>> pm_runtime_put_sync()
> >>>>> __pm_runtime_idle()
> >>>>> rpm_idle() -- RPM_GET_PUT(4)
> >>>>> __rpm_callback
> >>>>> scsi_runtime_idle()
> >>>>> pm_runtime_mark_last_busy()
> >>>>> pm_runtime_autosuspend() --[A]
> >>>>> rpm_suspend() -- RPM_AUTO(8)
> >>>>> pm_runtime_autosuspend_expiration() use_autosuspend is false return 0 --- [B]
> >>>>> __update_runtime_status to RPM_SUSPENDING
> >>>>> __rpm_callback()
> >>>>> __rpm_put_suppliers(dev, false)
> >>>>> __update_runtime_status to RPM_SUSPENDED
> >>>>> rpm_suspend_suppliers()
> >>>>> rpm_idle() for supplier -- RPM_ASYNC(1) return (-EAGAIN) [ Other consumer active for supplier]
> >>>>> rpm_suspend() – END with return=0
> >>>>> scsi_runtime_idle() END return (-EBUSY) always.
> >>>
> >>> Not following here either. Which device is EBUSY and why?
> >>
> >> scsi_runtime_idle() return -EBUSY always [3]
> >> Storage/scsi team can better explain -EBUSY implementation.
> >
> > EBUSY is returned from below code for consumer dev 0:0:0:0.
> > scsi_runtime_idle is called from scsi_autopm_put_device which inturn is called from ufshcd_async_scan (Process 1 as per above call stack)
> > static int scsi_runtime_idle(struct device *dev)
> > {
> > :
> >
> > if (scsi_is_sdev_device(dev)) {
> > pm_runtime_mark_last_busy(dev);
> > pm_runtime_autosuspend(dev);
> > return -EBUSY; ---> EBUSY returned from here.
> > }
> >
> >
> > }
> >
> >>
> >> [3] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/scsi/scsi_pm.c?h=next-20221118#n210
> >>
> >>
> >>>>>
> >>>>> [1]: https://lore.kernel.org/lkml/4748074.GXAFRqVoOG@kreacher/T/
> >>>>> [2]: https://lkml.org/lkml/2022/10/12/259
>
> It looks to me like __rpm_callback() makes assumptions about
> dev->power.runtime_status that are not necessarily true because
> dev->power.lock is dropped.

Well, this happens because rpm_idle() calls __rpm_callback() and
allows it to run concurrently with rpm_suspend() and rpm_resume(), so
one of them may change runtime_status to RPM_SUSPENDING or
RPM_RESUMING while __rpm_callback() is running.

It is somewhat questionable whether or not this should be allowed to
happen, but since it is generally allowed to suspend the device from
its .runtime_idle callback, there is not too much that can be done
about it.

> AFAICT the intention of the code would be fulfilled by instead using the status as it was before
> the lock was dropped.

That's correct, so the patch should help, but it also needs to remove
the comment stating that the runtime status cannot change when
__rpm_callback() is running, which is clearly incorrect.

> Consequently, perhaps you could try this:
>
> diff --git a/drivers/base/power/runtime.c b/drivers/base/power/runtime.c
> index b52049098d4e..3cf9abc3b2c2 100644
> --- a/drivers/base/power/runtime.c
> +++ b/drivers/base/power/runtime.c
> @@ -365,6 +365,7 @@ static int __rpm_callback(int (*cb)(struct device *), struct device *dev)
> {
> int retval = 0, idx;
> bool use_links = dev->power.links_count > 0;
> + enum rpm_status runtime_status = dev->power.runtime_status;
>
> if (dev->power.irq_safe) {
> spin_unlock(&dev->power.lock);
> @@ -378,7 +379,7 @@ static int __rpm_callback(int (*cb)(struct device *), struct device *dev)
> * routine returns, so it is safe to read the status outside of
> * the lock.
> */
> - if (use_links && dev->power.runtime_status == RPM_RESUMING) {
> + if (use_links && runtime_status == RPM_RESUMING) {
> idx = device_links_read_lock();
>
> retval = rpm_get_suppliers(dev);
> @@ -405,8 +406,8 @@ static int __rpm_callback(int (*cb)(struct device *), struct device *dev)
> * Do that if resume fails too.
> */
> if (use_links
> - && ((dev->power.runtime_status == RPM_SUSPENDING && !retval)
> - || (dev->power.runtime_status == RPM_RESUMING && retval))) {
> + && ((runtime_status == RPM_SUSPENDING && !retval)
> + || (runtime_status == RPM_RESUMING && retval))) {
> idx = device_links_read_lock();
>
> __rpm_put_suppliers(dev, false);
>
>

2022-12-01 20:00:13

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: PM-runtime: supplier looses track of consumer during probe

On Thursday, December 1, 2022 8:28:25 PM CET Rafael J. Wysocki wrote:
> On Thu, Dec 1, 2022 at 2:10 PM Adrian Hunter <[email protected]> wrote:
> >
> > On 29/11/22 18:56, Nitin Rawat wrote:
> > > Hi Adrian,
> > >
> > > On 11/21/2022 11:38 AM, Tushar Nimkar wrote:
> > >> Hi Adrian,
> > >>
> > >> On 11/18/2022 8:25 PM, Adrian Hunter wrote:
> > >>> On 4/11/22 11:19, Tushar Nimkar wrote:
> > >>>> Hi linux-pm/linux-scsi,
> > >>
> > >>>>> Process -1
> > >>>>> ufshcd_async_scan context (process 1)
> > >>>>> scsi_autopm_put_device() //0:0:0:0
> > >>>
> > >>> I am having trouble following your description. What function is calling
> > >>> scsi_autopm_put_device() here?
> > >>>
> > >> Below is flow which calls scsi_autopm_put_device()
> > >> Process -1
> > >> ufshcd_async_scan()
> > >> scsi_probe_and_add_lun()
> > >> scsi_add_lun()
> > >> slave_configure()
> > >> scsi_sysfs_add_sdev()
> > >> scsi_autopm_get_device()
> > >> device_add() <- invoked [Process 2] sd_probe()
> > >> scsi_autopm_put_device()
> > >>
> > >>>>> pm_runtime_put_sync()
> > >>>>> __pm_runtime_idle()
> > >>>>> rpm_idle() -- RPM_GET_PUT(4)
> > >>>>> __rpm_callback
> > >>>>> scsi_runtime_idle()
> > >>>>> pm_runtime_mark_last_busy()
> > >>>>> pm_runtime_autosuspend() --[A]
> > >>>>> rpm_suspend() -- RPM_AUTO(8)
> > >>>>> pm_runtime_autosuspend_expiration() use_autosuspend is false return 0 --- [B]
> > >>>>> __update_runtime_status to RPM_SUSPENDING
> > >>>>> __rpm_callback()
> > >>>>> __rpm_put_suppliers(dev, false)
> > >>>>> __update_runtime_status to RPM_SUSPENDED
> > >>>>> rpm_suspend_suppliers()
> > >>>>> rpm_idle() for supplier -- RPM_ASYNC(1) return (-EAGAIN) [ Other consumer active for supplier]
> > >>>>> rpm_suspend() – END with return=0
> > >>>>> scsi_runtime_idle() END return (-EBUSY) always.
> > >>>
> > >>> Not following here either. Which device is EBUSY and why?
> > >>
> > >> scsi_runtime_idle() return -EBUSY always [3]
> > >> Storage/scsi team can better explain -EBUSY implementation.
> > >
> > > EBUSY is returned from below code for consumer dev 0:0:0:0.
> > > scsi_runtime_idle is called from scsi_autopm_put_device which inturn is called from ufshcd_async_scan (Process 1 as per above call stack)
> > > static int scsi_runtime_idle(struct device *dev)
> > > {
> > > :
> > >
> > > if (scsi_is_sdev_device(dev)) {
> > > pm_runtime_mark_last_busy(dev);
> > > pm_runtime_autosuspend(dev);
> > > return -EBUSY; ---> EBUSY returned from here.
> > > }
> > >
> > >
> > > }
> > >
> > >>
> > >> [3] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/scsi/scsi_pm.c?h=next-20221118#n210
> > >>
> > >>
> > >>>>>
> > >>>>> [1]: https://lore.kernel.org/lkml/4748074.GXAFRqVoOG@kreacher/T/
> > >>>>> [2]: https://lkml.org/lkml/2022/10/12/259
> >
> > It looks to me like __rpm_callback() makes assumptions about
> > dev->power.runtime_status that are not necessarily true because
> > dev->power.lock is dropped.
>
> Well, this happens because rpm_idle() calls __rpm_callback() and
> allows it to run concurrently with rpm_suspend() and rpm_resume(), so
> one of them may change runtime_status to RPM_SUSPENDING or
> RPM_RESUMING while __rpm_callback() is running.
>
> It is somewhat questionable whether or not this should be allowed to
> happen, but since it is generally allowed to suspend the device from
> its .runtime_idle callback, there is not too much that can be done
> about it.

But this means that the patch below should help too.

I actually think that we can do both, because rpm_idle() doesn't have to do
the whole device links dance and the fact that it still calls __rpm_callback()
is a clear oversight.

---
drivers/base/power/runtime.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

Index: linux-pm/drivers/base/power/runtime.c
===================================================================
--- linux-pm.orig/drivers/base/power/runtime.c
+++ linux-pm/drivers/base/power/runtime.c
@@ -484,7 +484,17 @@ static int rpm_idle(struct device *dev,

dev->power.idle_notification = true;

- retval = __rpm_callback(callback, dev);
+ if (dev->power.irq_safe)
+ spin_unlock(&dev->power.lock);
+ else
+ spin_unlock_irq(&dev->power.lock);
+
+ retval = callback(dev);
+
+ if (dev->power.irq_safe)
+ spin_lock(&dev->power.lock);
+ else
+ spin_lock_irq(&dev->power.lock);

dev->power.idle_notification = false;
wake_up_all(&dev->power.wait_queue);



2022-12-02 12:51:49

by Tushar Nimkar

[permalink] [raw]
Subject: Re: PM-runtime: supplier looses track of consumer during probe

Thanks Adrian and Rafael,
We are trying both patches separately. And will update result once we get.

Thanks,
Tushar Nimkar

On 12/2/2022 1:14 AM, Rafael J. Wysocki wrote:
> On Thursday, December 1, 2022 8:28:25 PM CET Rafael J. Wysocki wrote:
>> On Thu, Dec 1, 2022 at 2:10 PM Adrian Hunter <[email protected]> wrote:
>>>
>>> On 29/11/22 18:56, Nitin Rawat wrote:
>>>> Hi Adrian,
>>>>
>>>> On 11/21/2022 11:38 AM, Tushar Nimkar wrote:
>>>>> Hi Adrian,
>>>>>
>>>>> On 11/18/2022 8:25 PM, Adrian Hunter wrote:
>>>>>> On 4/11/22 11:19, Tushar Nimkar wrote:
>>>>>>> Hi linux-pm/linux-scsi,
>>>>>
>>>>>>>> Process -1
>>>>>>>> ufshcd_async_scan context (process 1)
>>>>>>>> scsi_autopm_put_device() //0:0:0:0
>>>>>>
>>>>>> I am having trouble following your description. What function is calling
>>>>>> scsi_autopm_put_device() here?
>>>>>>
>>>>> Below is flow which calls scsi_autopm_put_device()
>>>>> Process -1
>>>>> ufshcd_async_scan()
>>>>> scsi_probe_and_add_lun()
>>>>> scsi_add_lun()
>>>>> slave_configure()
>>>>> scsi_sysfs_add_sdev()
>>>>> scsi_autopm_get_device()
>>>>> device_add() <- invoked [Process 2] sd_probe()
>>>>> scsi_autopm_put_device()
>>>>>
>>>>>>>> pm_runtime_put_sync()
>>>>>>>> __pm_runtime_idle()
>>>>>>>> rpm_idle() -- RPM_GET_PUT(4)
>>>>>>>> __rpm_callback
>>>>>>>> scsi_runtime_idle()
>>>>>>>> pm_runtime_mark_last_busy()
>>>>>>>> pm_runtime_autosuspend() --[A]
>>>>>>>> rpm_suspend() -- RPM_AUTO(8)
>>>>>>>> pm_runtime_autosuspend_expiration() use_autosuspend is false return 0 --- [B]
>>>>>>>> __update_runtime_status to RPM_SUSPENDING
>>>>>>>> __rpm_callback()
>>>>>>>> __rpm_put_suppliers(dev, false)
>>>>>>>> __update_runtime_status to RPM_SUSPENDED
>>>>>>>> rpm_suspend_suppliers()
>>>>>>>> rpm_idle() for supplier -- RPM_ASYNC(1) return (-EAGAIN) [ Other consumer active for supplier]
>>>>>>>> rpm_suspend() – END with return=0
>>>>>>>> scsi_runtime_idle() END return (-EBUSY) always.
>>>>>>
>>>>>> Not following here either. Which device is EBUSY and why?
>>>>>
>>>>> scsi_runtime_idle() return -EBUSY always [3]
>>>>> Storage/scsi team can better explain -EBUSY implementation.
>>>>
>>>> EBUSY is returned from below code for consumer dev 0:0:0:0.
>>>> scsi_runtime_idle is called from scsi_autopm_put_device which inturn is called from ufshcd_async_scan (Process 1 as per above call stack)
>>>> static int scsi_runtime_idle(struct device *dev)
>>>> {
>>>> :
>>>>
>>>> if (scsi_is_sdev_device(dev)) {
>>>> pm_runtime_mark_last_busy(dev);
>>>> pm_runtime_autosuspend(dev);
>>>> return -EBUSY; ---> EBUSY returned from here.
>>>> }
>>>>
>>>>
>>>> }
>>>>
>>>>>
>>>>> [3] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/scsi/scsi_pm.c?h=next-20221118#n210
>>>>>
>>>>>
>>>>>>>>
>>>>>>>> [1]: https://lore.kernel.org/lkml/4748074.GXAFRqVoOG@kreacher/T/
>>>>>>>> [2]: https://lkml.org/lkml/2022/10/12/259
>>>
>>> It looks to me like __rpm_callback() makes assumptions about
>>> dev->power.runtime_status that are not necessarily true because
>>> dev->power.lock is dropped.
>>
>> Well, this happens because rpm_idle() calls __rpm_callback() and
>> allows it to run concurrently with rpm_suspend() and rpm_resume(), so
>> one of them may change runtime_status to RPM_SUSPENDING or
>> RPM_RESUMING while __rpm_callback() is running.
>>
>> It is somewhat questionable whether or not this should be allowed to
>> happen, but since it is generally allowed to suspend the device from
>> its .runtime_idle callback, there is not too much that can be done
>> about it.
>
> But this means that the patch below should help too.
>
> I actually think that we can do both, because rpm_idle() doesn't have to do
> the whole device links dance and the fact that it still calls __rpm_callback()
> is a clear oversight.
>
> ---
> drivers/base/power/runtime.c | 12 +++++++++++-
> 1 file changed, 11 insertions(+), 1 deletion(-)
>
> Index: linux-pm/drivers/base/power/runtime.c
> ===================================================================
> --- linux-pm.orig/drivers/base/power/runtime.c
> +++ linux-pm/drivers/base/power/runtime.c
> @@ -484,7 +484,17 @@ static int rpm_idle(struct device *dev,
>
> dev->power.idle_notification = true;
>
> - retval = __rpm_callback(callback, dev);
> + if (dev->power.irq_safe)
> + spin_unlock(&dev->power.lock);
> + else
> + spin_unlock_irq(&dev->power.lock);
> +
> + retval = callback(dev);
> +
> + if (dev->power.irq_safe)
> + spin_lock(&dev->power.lock);
> + else
> + spin_lock_irq(&dev->power.lock);
>
> dev->power.idle_notification = false;
> wake_up_all(&dev->power.wait_queue);
>
>
>

2022-12-02 14:22:15

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: PM-runtime: supplier looses track of consumer during probe

On Fri, Dec 2, 2022 at 1:23 PM Tushar Nimkar <[email protected]> wrote:
>
> Thanks Adrian and Rafael,
> We are trying both patches separately. And will update result once we get.

Thank you!

I'm going to submit the change in rpm_idle() regardless of whether or
not it is sufficient to address the issue, because it is a clear
mistake to still call __rpm_callback() from there after adding the
handling of device links to it.

> On 12/2/2022 1:14 AM, Rafael J. Wysocki wrote:
> > On Thursday, December 1, 2022 8:28:25 PM CET Rafael J. Wysocki wrote:
> >> On Thu, Dec 1, 2022 at 2:10 PM Adrian Hunter <[email protected]> wrote:
> >>>
> >>> On 29/11/22 18:56, Nitin Rawat wrote:
> >>>> Hi Adrian,
> >>>>
> >>>> On 11/21/2022 11:38 AM, Tushar Nimkar wrote:
> >>>>> Hi Adrian,
> >>>>>
> >>>>> On 11/18/2022 8:25 PM, Adrian Hunter wrote:
> >>>>>> On 4/11/22 11:19, Tushar Nimkar wrote:
> >>>>>>> Hi linux-pm/linux-scsi,
> >>>>>
> >>>>>>>> Process -1
> >>>>>>>> ufshcd_async_scan context (process 1)
> >>>>>>>> scsi_autopm_put_device() //0:0:0:0
> >>>>>>
> >>>>>> I am having trouble following your description. What function is calling
> >>>>>> scsi_autopm_put_device() here?
> >>>>>>
> >>>>> Below is flow which calls scsi_autopm_put_device()
> >>>>> Process -1
> >>>>> ufshcd_async_scan()
> >>>>> scsi_probe_and_add_lun()
> >>>>> scsi_add_lun()
> >>>>> slave_configure()
> >>>>> scsi_sysfs_add_sdev()
> >>>>> scsi_autopm_get_device()
> >>>>> device_add() <- invoked [Process 2] sd_probe()
> >>>>> scsi_autopm_put_device()
> >>>>>
> >>>>>>>> pm_runtime_put_sync()
> >>>>>>>> __pm_runtime_idle()
> >>>>>>>> rpm_idle() -- RPM_GET_PUT(4)
> >>>>>>>> __rpm_callback
> >>>>>>>> scsi_runtime_idle()
> >>>>>>>> pm_runtime_mark_last_busy()
> >>>>>>>> pm_runtime_autosuspend() --[A]
> >>>>>>>> rpm_suspend() -- RPM_AUTO(8)
> >>>>>>>> pm_runtime_autosuspend_expiration() use_autosuspend is false return 0 --- [B]
> >>>>>>>> __update_runtime_status to RPM_SUSPENDING
> >>>>>>>> __rpm_callback()
> >>>>>>>> __rpm_put_suppliers(dev, false)
> >>>>>>>> __update_runtime_status to RPM_SUSPENDED
> >>>>>>>> rpm_suspend_suppliers()
> >>>>>>>> rpm_idle() for supplier -- RPM_ASYNC(1) return (-EAGAIN) [ Other consumer active for supplier]
> >>>>>>>> rpm_suspend() – END with return=0
> >>>>>>>> scsi_runtime_idle() END return (-EBUSY) always.
> >>>>>>
> >>>>>> Not following here either. Which device is EBUSY and why?
> >>>>>
> >>>>> scsi_runtime_idle() return -EBUSY always [3]
> >>>>> Storage/scsi team can better explain -EBUSY implementation.
> >>>>
> >>>> EBUSY is returned from below code for consumer dev 0:0:0:0.
> >>>> scsi_runtime_idle is called from scsi_autopm_put_device which inturn is called from ufshcd_async_scan (Process 1 as per above call stack)
> >>>> static int scsi_runtime_idle(struct device *dev)
> >>>> {
> >>>> :
> >>>>
> >>>> if (scsi_is_sdev_device(dev)) {
> >>>> pm_runtime_mark_last_busy(dev);
> >>>> pm_runtime_autosuspend(dev);
> >>>> return -EBUSY; ---> EBUSY returned from here.
> >>>> }
> >>>>
> >>>>
> >>>> }
> >>>>
> >>>>>
> >>>>> [3] https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/tree/drivers/scsi/scsi_pm.c?h=next-20221118#n210
> >>>>>
> >>>>>
> >>>>>>>>
> >>>>>>>> [1]: https://lore.kernel.org/lkml/4748074.GXAFRqVoOG@kreacher/T/
> >>>>>>>> [2]: https://lkml.org/lkml/2022/10/12/259
> >>>
> >>> It looks to me like __rpm_callback() makes assumptions about
> >>> dev->power.runtime_status that are not necessarily true because
> >>> dev->power.lock is dropped.
> >>
> >> Well, this happens because rpm_idle() calls __rpm_callback() and
> >> allows it to run concurrently with rpm_suspend() and rpm_resume(), so
> >> one of them may change runtime_status to RPM_SUSPENDING or
> >> RPM_RESUMING while __rpm_callback() is running.
> >>
> >> It is somewhat questionable whether or not this should be allowed to
> >> happen, but since it is generally allowed to suspend the device from
> >> its .runtime_idle callback, there is not too much that can be done
> >> about it.
> >
> > But this means that the patch below should help too.
> >
> > I actually think that we can do both, because rpm_idle() doesn't have to do
> > the whole device links dance and the fact that it still calls __rpm_callback()
> > is a clear oversight.
> >
> > ---
> > drivers/base/power/runtime.c | 12 +++++++++++-
> > 1 file changed, 11 insertions(+), 1 deletion(-)
> >
> > Index: linux-pm/drivers/base/power/runtime.c
> > ===================================================================
> > --- linux-pm.orig/drivers/base/power/runtime.c
> > +++ linux-pm/drivers/base/power/runtime.c
> > @@ -484,7 +484,17 @@ static int rpm_idle(struct device *dev,
> >
> > dev->power.idle_notification = true;
> >
> > - retval = __rpm_callback(callback, dev);
> > + if (dev->power.irq_safe)
> > + spin_unlock(&dev->power.lock);
> > + else
> > + spin_unlock_irq(&dev->power.lock);
> > +
> > + retval = callback(dev);
> > +
> > + if (dev->power.irq_safe)
> > + spin_lock(&dev->power.lock);
> > + else
> > + spin_lock_irq(&dev->power.lock);
> >
> > dev->power.idle_notification = false;
> > wake_up_all(&dev->power.wait_queue);
> >
> >
> >