2023-07-26 14:29:48

by Thorsten Leemhuis

[permalink] [raw]
Subject: Re: Scsi_bus_resume+0x0/0x90 returns -5 when resuming from s3 sleep

Hi, Thorsten here, the Linux kernel's regression tracker.

On 26.07.23 13:54, TW wrote:
> I have been having issues with the 6.x series of kernels resuming from
> suspend with one of my drives. Far as I can tell it has trouble with the
> cache on the drive when coming out of s3 sleep. Tried a few different
> distros (Manjaro, OpenMandriva Rome, EndeavourOS) all that give the same
> error message. It appears to work fine on the 5.15 kernel just fine
> however.
>
> This is the error or errors that I have been getting and assume has been
> holding up the system from resuming from suspend.
>
> Jul 20 04:13:41 rageworks kernel: ata10.00: device reported invalid CHS sector 0
> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: PM: failed to resume async: error -5

Thx for your report. I CCed a few people, with a bit of luck they have
an idea. But I doubt it. If no one replies you likely will need a
bisection to find the root of the problem. But before going down that
route you want to check if latest mainline kernel (vanilla!) works better.

FWIW, this is not my area of expertise, so the following might be a
misleading comment, but the problem looks somewhat similar to this one
that iirc was never solved:
https://bugzilla.kernel.org/show_bug.cgi?id=216087

> Jul 20 04:12:51 rageworks systemd[1]: nvidia-suspend.service: Deactivated successfully.
> Jul 20 04:12:51 rageworks systemd[1]: Finished NVIDIA system suspend actions.
> Jul 20 04:12:51 rageworks systemd[1]: Starting System Suspend...

That sounds like you are using out-of tree drivers which can cause all
sorts of issues. Please recheck if the problem happens without those as
well and do not use them in all further tests to debug the issue.

Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat)
--
Everything you wanna know about Linux kernel regression tracking:
https://linux-regtracking.leemhuis.info/about/#tldr
If I did something stupid, please tell me, as explained on that page.


2023-07-27 00:03:26

by Damien Le Moal

[permalink] [raw]
Subject: Re: Scsi_bus_resume+0x0/0x90 returns -5 when resuming from s3 sleep

On 7/26/23 22:47, Thorsten Leemhuis wrote:
> Hi, Thorsten here, the Linux kernel's regression tracker.
>
> On 26.07.23 13:54, TW wrote:
>> I have been having issues with the 6.x series of kernels resuming from
>> suspend with one of my drives. Far as I can tell it has trouble with the
>> cache on the drive when coming out of s3 sleep. Tried a few different
>> distros (Manjaro, OpenMandriva Rome, EndeavourOS) all that give the same
>> error message. It appears to work fine on the 5.15 kernel just fine
>> however.
>>
>> This is the error or errors that I have been getting and assume has been
>> holding up the system from resuming from suspend.
>>
>> Jul 20 04:13:41 rageworks kernel: ata10.00: device reported invalid CHS sector 0
>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command

This sense is garbage. This issue was reported already, but it is hard
to deal with as it seems to be due to drives/adapters not correctly
reporting status bits. So for now, let's ignore this sense codes.

The start/stop unit failure is weird. On another case, I am suspecting
that this command is causing a delay on resume, but not an error like this.

>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: PM: failed to resume async: error -5
>
> Thx for your report. I CCed a few people, with a bit of luck they have
> an idea. But I doubt it. If no one replies you likely will need a
> bisection to find the root of the problem. But before going down that
> route you want to check if latest mainline kernel (vanilla!) works better.
>
> FWIW, this is not my area of expertise, so the following might be a
> misleading comment, but the problem looks somewhat similar to this one
> that iirc was never solved:
> https://bugzilla.kernel.org/show_bug.cgi?id=216087
>
>> Jul 20 04:12:51 rageworks systemd[1]: nvidia-suspend.service: Deactivated successfully.
>> Jul 20 04:12:51 rageworks systemd[1]: Finished NVIDIA system suspend actions.
>> Jul 20 04:12:51 rageworks systemd[1]: Starting System Suspend...
>
> That sounds like you are using out-of tree drivers which can cause all
> sorts of issues. Please recheck if the problem happens without those as
> well and do not use them in all further tests to debug the issue.

Yes. Please retest with the latest 6.5-rc3.

And can you try this patch to see if it solves your issue ?

commit 29e81d11812ee924d19425343ec69acd34af9d35
Author: Damien Le Moal <[email protected]>
Date: Mon Jul 24 13:23:14 2023 +0900

ata,scsi: do not issue START STOP UNIT on resume

Signed-off-by: Damien Le Moal <[email protected]>

diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
index 370d18aca71e..6184c7bcc16c 100644
--- a/drivers/ata/libata-scsi.c
+++ b/drivers/ata/libata-scsi.c
@@ -1100,7 +1100,13 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct
ata_device *dev)
}
} else {
sdev->sector_size = ata_id_logical_sector_size(dev->id);
+ /*
+ * Stop the drive on suspend but do not issue START STOP UNIT
+ * on resume as this is not necessary: the port is reset on
+ * resume, which wakes up the drive.
+ */
sdev->manage_start_stop = 1;
+ sdev->no_start_on_resume = 1;
}

/*
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 68b12afa0721..b8584fe3123e 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -3876,7 +3876,7 @@ static int sd_suspend_runtime(struct device *dev)
static int sd_resume(struct device *dev)
{
struct scsi_disk *sdkp = dev_get_drvdata(dev);
- int ret;
+ int ret = 0;

if (!sdkp) /* E.g.: runtime resume at the start of sd_probe() */
return 0;
@@ -3885,7 +3885,8 @@ static int sd_resume(struct device *dev)
return 0;

sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
- ret = sd_start_stop_device(sdkp, 1);
+ if (!sdkp->device->no_start_on_resume)
+ ret = sd_start_stop_device(sdkp, 1);
if (!ret)
opal_unlock_from_suspend(sdkp->opal_dev);
return ret;
diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
index 75b2235b99e2..b9230b6add04 100644
--- a/include/scsi/scsi_device.h
+++ b/include/scsi/scsi_device.h
@@ -194,6 +194,7 @@ struct scsi_device {
unsigned no_start_on_add:1; /* do not issue start on add */
unsigned allow_restart:1; /* issue START_UNIT in error handler */
unsigned manage_start_stop:1; /* Let HLD (sd) manage start/stop */
+ unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */
unsigned start_stop_pwr_cond:1; /* Set power cond. in START_STOP_UNIT */
unsigned no_uld_attach:1; /* disable connecting to upper level drivers */
unsigned select_no_atn:1;


--
Damien Le Moal
Western Digital Research


2023-07-27 11:29:28

by TW

[permalink] [raw]
Subject: Re: Scsi_bus_resume+0x0/0x90 returns -5 when resuming from s3 sleep

I retried on 6.5 rc3 without the Nvidia drivers and still received the
same error and going to try for the patch next but got a malformed patch
error on line 6 for the first patch for libata-scsi.c. The other two
seem to go through just fine however.

Also the bugzilla link is similar to what I have but the disk doesn't
disappear, comes back but just takes awhile to come back out of sleep mode.

On 7/26/23 17:39, Damien Le Moal wrote:
> On 7/26/23 22:47, Thorsten Leemhuis wrote:
>> Hi, Thorsten here, the Linux kernel's regression tracker.
>>
>> On 26.07.23 13:54, TW wrote:
>>> I have been having issues with the 6.x series of kernels resuming from
>>> suspend with one of my drives. Far as I can tell it has trouble with the
>>> cache on the drive when coming out of s3 sleep. Tried a few different
>>> distros (Manjaro, OpenMandriva Rome, EndeavourOS) all that give the same
>>> error message. It appears to work fine on the 5.15 kernel just fine
>>> however.
>>>
>>> This is the error or errors that I have been getting and assume has been
>>> holding up the system from resuming from suspend.
>>>
>>> Jul 20 04:13:41 rageworks kernel: ata10.00: device reported invalid CHS sector 0
>>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Start/Stop Unit failed: Result: hostbyte=DID_OK driverbyte=DRIVER_OK
>>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Sense Key : Illegal Request [current]
>>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: [sdc] Add. Sense: Unaligned write command
> This sense is garbage. This issue was reported already, but it is hard
> to deal with as it seems to be due to drives/adapters not correctly
> reporting status bits. So for now, let's ignore this sense codes.
>
> The start/stop unit failure is weird. On another case, I am suspecting
> that this command is causing a delay on resume, but not an error like this.
>
>>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: PM: dpm_run_callback(): scsi_bus_resume+0x0/0x90 returns -5
>>> Jul 20 04:13:41 rageworks kernel: sd 9:0:0:0: PM: failed to resume async: error -5
>> Thx for your report. I CCed a few people, with a bit of luck they have
>> an idea. But I doubt it. If no one replies you likely will need a
>> bisection to find the root of the problem. But before going down that
>> route you want to check if latest mainline kernel (vanilla!) works better.
>>
>> FWIW, this is not my area of expertise, so the following might be a
>> misleading comment, but the problem looks somewhat similar to this one
>> that iirc was never solved:
>> https://bugzilla.kernel.org/show_bug.cgi?id=216087
>>
>>> Jul 20 04:12:51 rageworks systemd[1]: nvidia-suspend.service: Deactivated successfully.
>>> Jul 20 04:12:51 rageworks systemd[1]: Finished NVIDIA system suspend actions.
>>> Jul 20 04:12:51 rageworks systemd[1]: Starting System Suspend...
>> That sounds like you are using out-of tree drivers which can cause all
>> sorts of issues. Please recheck if the problem happens without those as
>> well and do not use them in all further tests to debug the issue.
> Yes. Please retest with the latest 6.5-rc3.
>
> And can you try this patch to see if it solves your issue ?
>
> commit 29e81d11812ee924d19425343ec69acd34af9d35
> Author: Damien Le Moal <[email protected]>
> Date: Mon Jul 24 13:23:14 2023 +0900
>
> ata,scsi: do not issue START STOP UNIT on resume
>
> Signed-off-by: Damien Le Moal <[email protected]>
>
> diff --git a/drivers/ata/libata-scsi.c b/drivers/ata/libata-scsi.c
> index 370d18aca71e..6184c7bcc16c 100644
> --- a/drivers/ata/libata-scsi.c
> +++ b/drivers/ata/libata-scsi.c
> @@ -1100,7 +1100,13 @@ int ata_scsi_dev_config(struct scsi_device *sdev, struct
> ata_device *dev)
> }
> } else {
> sdev->sector_size = ata_id_logical_sector_size(dev->id);
> + /*
> + * Stop the drive on suspend but do not issue START STOP UNIT
> + * on resume as this is not necessary: the port is reset on
> + * resume, which wakes up the drive.
> + */
> sdev->manage_start_stop = 1;
> + sdev->no_start_on_resume = 1;
> }
>
> /*
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 68b12afa0721..b8584fe3123e 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -3876,7 +3876,7 @@ static int sd_suspend_runtime(struct device *dev)
> static int sd_resume(struct device *dev)
> {
> struct scsi_disk *sdkp = dev_get_drvdata(dev);
> - int ret;
> + int ret = 0;
>
> if (!sdkp) /* E.g.: runtime resume at the start of sd_probe() */
> return 0;
> @@ -3885,7 +3885,8 @@ static int sd_resume(struct device *dev)
> return 0;
>
> sd_printk(KERN_NOTICE, sdkp, "Starting disk\n");
> - ret = sd_start_stop_device(sdkp, 1);
> + if (!sdkp->device->no_start_on_resume)
> + ret = sd_start_stop_device(sdkp, 1);
> if (!ret)
> opal_unlock_from_suspend(sdkp->opal_dev);
> return ret;
> diff --git a/include/scsi/scsi_device.h b/include/scsi/scsi_device.h
> index 75b2235b99e2..b9230b6add04 100644
> --- a/include/scsi/scsi_device.h
> +++ b/include/scsi/scsi_device.h
> @@ -194,6 +194,7 @@ struct scsi_device {
> unsigned no_start_on_add:1; /* do not issue start on add */
> unsigned allow_restart:1; /* issue START_UNIT in error handler */
> unsigned manage_start_stop:1; /* Let HLD (sd) manage start/stop */
> + unsigned no_start_on_resume:1; /* Do not issue START_STOP_UNIT on resume */
> unsigned start_stop_pwr_cond:1; /* Set power cond. in START_STOP_UNIT */
> unsigned no_uld_attach:1; /* disable connecting to upper level drivers */
> unsigned select_no_atn:1;
>
>