2023-06-01 16:18:05

by Kai-Heng Feng

[permalink] [raw]
Subject: [PATCH v5] scsi: core: Wait until device is fully resumed before doing rescan

During system resuming process, the resuming order is from top to down.
Namely, the ATA host is resumed before disks connected to it.

When an EH is scheduled while ATA host is resumed and disk device is
still suspended, the device_lock hold by scsi_rescan_device() is never
released so the dpm_resume() of the disk is blocked forerver, therefore
the system can never be resumed back.

That's because scsi_attach_vpd() is expecting the disk device is in
operational state, as it doesn't work on suspended device.

To avoid such deadlock, wait until the scsi device is fully resumed,
before continuing the rescan process.

Signed-off-by: Kai-Heng Feng <[email protected]>
---
v5:
- Use a different approach. Wait until the disk device is resumed.

v4:
- No change.

v3:
- New patch to resolve undefined pm_suspend_target_state.

v2:
- Schedule rescan task at the end of system resume phase.
- Wording.

drivers/scsi/scsi_scan.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/scsi/scsi_scan.c b/drivers/scsi/scsi_scan.c
index d217be323cc6..a59aada98ac5 100644
--- a/drivers/scsi/scsi_scan.c
+++ b/drivers/scsi/scsi_scan.c
@@ -1621,6 +1621,9 @@ void scsi_rescan_device(struct device *dev)
{
struct scsi_device *sdev = to_scsi_device(dev);

+ if (dev->power.is_suspended)
+ wait_for_completion(&dev->power.completion);
+
device_lock(dev);

scsi_attach_vpd(sdev);
--
2.34.1



2023-06-01 19:55:06

by Bart Van Assche

[permalink] [raw]
Subject: Re: [PATCH v5] scsi: core: Wait until device is fully resumed before doing rescan

On 6/1/23 08:56, Kai-Heng Feng wrote:
> During system resuming process, the resuming order is from top to down.
> Namely, the ATA host is resumed before disks connected to it.
>
> When an EH is scheduled while ATA host is resumed and disk device is
> still suspended, the device_lock hold by scsi_rescan_device() is never
> released so the dpm_resume() of the disk is blocked forerver, therefore
> the system can never be resumed back.
>
> That's because scsi_attach_vpd() is expecting the disk device is in
> operational state, as it doesn't work on suspended device.
>
> To avoid such deadlock, wait until the scsi device is fully resumed,
> before continuing the rescan process.

Why doesn't scsi_attach_vpd() support runtime power management? Calling
scsi_attach_vpd() should result in a call of sdev_runtime_resume(),
isn't it?

Thanks,

Bart.


2023-06-02 01:14:15

by Kai-Heng Feng

[permalink] [raw]
Subject: Re: [PATCH v5] scsi: core: Wait until device is fully resumed before doing rescan

On Fri, Jun 2, 2023 at 3:48 AM Bart Van Assche <[email protected]> wrote:
>
> On 6/1/23 08:56, Kai-Heng Feng wrote:
> > During system resuming process, the resuming order is from top to down.
> > Namely, the ATA host is resumed before disks connected to it.
> >
> > When an EH is scheduled while ATA host is resumed and disk device is
> > still suspended, the device_lock hold by scsi_rescan_device() is never
> > released so the dpm_resume() of the disk is blocked forerver, therefore
> > the system can never be resumed back.
> >
> > That's because scsi_attach_vpd() is expecting the disk device is in
> > operational state, as it doesn't work on suspended device.
> >
> > To avoid such deadlock, wait until the scsi device is fully resumed,
> > before continuing the rescan process.
>
> Why doesn't scsi_attach_vpd() support runtime power management? Calling
> scsi_attach_vpd() should result in a call of sdev_runtime_resume(),

It's system-wide resume in this context, so it's dpm_resume() waiting
for the lock to be released by scsi_rescan_device().

Kai-Heng

> isn't it?
>
> Thanks,
>
> Bart.
>

2023-06-02 04:57:50

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v5] scsi: core: Wait until device is fully resumed before doing rescan

Hi Kai-Heng,

kernel test robot noticed the following build errors:

[auto build test ERROR on mkp-scsi/for-next]
[also build test ERROR on jejb-scsi/for-next linus/master v6.4-rc4 next-20230601]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Kai-Heng-Feng/scsi-core-Wait-until-device-is-fully-resumed-before-doing-rescan/20230601-235821
base: https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git for-next
patch link: https://lore.kernel.org/r/20230601155652.1157611-1-kai.heng.feng%40canonical.com
patch subject: [PATCH v5] scsi: core: Wait until device is fully resumed before doing rescan
config: parisc-defconfig (https://download.01.org/0day-ci/archive/20230602/[email protected]/config)
compiler: hppa-linux-gcc (GCC) 12.3.0
reproduce (this is a W=1 build):
mkdir -p ~/bin
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/fe2b65dd8204442bd73db8a6e40d8307e11fcd04
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Kai-Heng-Feng/scsi-core-Wait-until-device-is-fully-resumed-before-doing-rescan/20230601-235821
git checkout fe2b65dd8204442bd73db8a6e40d8307e11fcd04
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=parisc olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=gcc-12.3.0 ~/bin/make.cross W=1 O=build_dir ARCH=parisc SHELL=/bin/bash drivers/

If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All errors (new ones prefixed by >>):

drivers/scsi/scsi_scan.c: In function 'scsi_rescan_device':
>> drivers/scsi/scsi_scan.c:1627:48: error: 'struct dev_pm_info' has no member named 'completion'
1627 | wait_for_completion(&dev->power.completion);
| ^


vim +1627 drivers/scsi/scsi_scan.c

1621
1622 void scsi_rescan_device(struct device *dev)
1623 {
1624 struct scsi_device *sdev = to_scsi_device(dev);
1625
1626 if (dev->power.is_suspended)
> 1627 wait_for_completion(&dev->power.completion);
1628
1629 device_lock(dev);
1630
1631 scsi_attach_vpd(sdev);
1632 scsi_cdl_check(sdev);
1633
1634 if (sdev->handler && sdev->handler->rescan)
1635 sdev->handler->rescan(sdev);
1636
1637 if (dev->driver && try_module_get(dev->driver->owner)) {
1638 struct scsi_driver *drv = to_scsi_driver(dev->driver);
1639
1640 if (drv->rescan)
1641 drv->rescan(dev);
1642 module_put(dev->driver->owner);
1643 }
1644 device_unlock(dev);
1645 }
1646 EXPORT_SYMBOL(scsi_rescan_device);
1647

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki