2017-08-16 17:00:37

by Abdul Haleem

[permalink] [raw]
Subject: [BUG][bisected 270065e] linux-next fails to boot on powerpc

Hi Bart,

As of next-20170809, linux-next on powerpc boot hung with below trace
message.

Test : Boot
Machine Type : Power 8 bare-metal
Kernel version : 4.13.0-rc4-next-2017081
gcc : 4.8.5
config: Tul-NV-config file attached


Boot logs:
---------
oprofile: using timer interrupt.
ipip: IPv4 and MPLS over IPv4 tunneling driver
NET: Registered protocol family 17
Key type dns_resolver registered
registered taskstats version 1
ima: No TPM chip found, activating TPM-bypass! (rc=-19)
console [netcon0] enabled
netconsole: network logging started
rtc-opal opal-rtc: setting system clock to 2017-08-16 06:34:56 UTC
(1502865296)
.
ready
sd 0:2:0:0: [sda] 272646144 512-byte logical blocks: (140 GB/130 GiB)
sd 0:2:0:0: [sda] 4096-byte physical blocks
sd 0:2:0:0: [sda] Write Protect is off
INFO: task swapper/5:1 blocked for more than 120 seconds.
Not tainted 4.13.0-rc4-next-20170810-autotest #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
message.
swapper/5 D 9936 1 0 0x00000800
Call Trace:
[c0000007f8483a10] [c0000007f8483a80] 0xc0000007f8483a80 (unreliable)
[c0000007f8483be0] [c00000000001b358] __switch_to+0x2e8/0x430
[c0000007f8483c40] [c0000000009d134c] __schedule+0x38c/0xaf0
[c0000007f8483d20] [c0000000009d1af0] schedule+0x40/0xb0
[c0000007f8483d50] [c000000000110bd4] async_synchronize_cookie_domain
+0xd4/0x150
[c0000007f8483dc0] [c00000000000d8f8] kernel_init+0x28/0x140
[c0000007f8483e30] [c00000000000bc60] ret_from_kernel_thread+0x5c/0x7c

A bisection resulted in first bad commit (270065e92 - scsi: scsi-mq:
Always unprepare ...) in the merge branch 'scsi/for-next'

System booted fine when the below commit is reverted:

commit 270065e92c317845d69095ec8e3d18616b5b39d5
Author: Bart Van Assche <[email protected]>
Date: Thu Aug 3 14:40:14 2017 -0700

scsi: scsi-mq: Always unprepare before requeuing a request

One of the two scsi-mq functions that requeue a request unprepares a
request before requeueing (scsi_io_completion()) but the other
function
not (__scsi_queue_insert()). Make sure that a request is unprepared
before requeuing it.

Fixes: commit d285203cf647 ("scsi: add support for a blk-mq based
I/O path.")
Signed-off-by: Bart Van Assche <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Hannes Reinecke <[email protected]>
Cc: Damien Le Moal <[email protected]>
Cc: Johannes Thumshirn <[email protected]>
Cc: <[email protected]>
Tested-by: Damien Le Moal <[email protected]>
Reviewed-by: Christoph Hellwig <[email protected]>
Reviewed-by: Johannes Thumshirn <[email protected]>
Signed-off-by: Martin K. Petersen <[email protected]>

--
Regard's

Abdul Haleem
IBM Linux Technology Centre



Attachments:
Tul-NV-config (89.66 kB)

2017-08-16 17:22:04

by Bart Van Assche

[permalink] [raw]
Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

On Wed, 2017-08-16 at 22:30 +0530, Abdul Haleem wrote:
> As of next-20170809, linux-next on powerpc boot hung with below trace
> message.
>
> [ ... ]
>
> A bisection resulted in first bad commit (270065e92 - scsi: scsi-mq:
> Always unprepare ...) in the merge branch 'scsi/for-next'
>
> System booted fine when the below commit is reverted:
>
> commit 270065e92c317845d69095ec8e3d18616b5b39d5
> Author: Bart Van Assche <[email protected]>
> Date: Thu Aug 3 14:40:14 2017 -0700
>
> scsi: scsi-mq: Always unprepare before requeuing a request

Hello Brian and Michael,

Do you agree that this probably indicates a bug in the PowerPC block driver
that is used to access the boot disk? Anyway, since a solution is not yet
available, I will submit a revert for this patch.

Bart.

2017-08-16 20:26:41

by Bart Van Assche

[permalink] [raw]
Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

On Wed, 2017-08-16 at 22:30 +0530, Abdul Haleem wrote:
> As of next-20170809, linux-next on powerpc boot hung with below trace
> message.
> [ ... ]
> System booted fine when the below commit is reverted:

Hello Abdul,

Can you check whether applying the following commit on top of next-20170809
fixes this regression:

https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/commit/?h=4.13/scsi-fixes&id=b0e17a9b0df29590c45dfb296f541270a5941f41

Thanks,

Bart.

2017-08-16 23:18:34

by Brian King

[permalink] [raw]
Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

On 08/16/2017 12:21 PM, Bart Van Assche wrote:
> On Wed, 2017-08-16 at 22:30 +0530, Abdul Haleem wrote:
>> As of next-20170809, linux-next on powerpc boot hung with below trace
>> message.
>>
>> [ ... ]
>>
>> A bisection resulted in first bad commit (270065e92 - scsi: scsi-mq:
>> Always unprepare ...) in the merge branch 'scsi/for-next'
>>
>> System booted fine when the below commit is reverted:
>>
>> commit 270065e92c317845d69095ec8e3d18616b5b39d5
>> Author: Bart Van Assche <[email protected]>
>> Date: Thu Aug 3 14:40:14 2017 -0700
>>
>> scsi: scsi-mq: Always unprepare before requeuing a request
>
> Hello Brian and Michael,
>
> Do you agree that this probably indicates a bug in the PowerPC block driver
> that is used to access the boot disk? Anyway, since a solution is not yet
> available, I will submit a revert for this patch.

I've been looking at this a bit, and can recreate the issue, but haven't
got to root cause of the issue as of yet. If I do a sysrq-w while the system is hung
during boot I see this:

[ 25.561523] Workqueue: events_unbound async_run_entry_fn
[ 25.561527] Call Trace:
[ 25.561529] [c0000001697873f0] [c000000169701600] 0xc000000169701600 (unreliable)
[ 25.561534] [c0000001697875c0] [c00000000001ab78] __switch_to+0x2e8/0x430
[ 25.561539] [c000000169787620] [c00000000091ccb0] __schedule+0x310/0xa00
[ 25.561543] [c0000001697876f0] [c00000000091d3e0] schedule+0x40/0xb0
[ 25.561548] [c000000169787720] [c000000000921e40] schedule_timeout+0x200/0x430
[ 25.561553] [c000000169787810] [c00000000091db10] io_schedule_timeout+0x30/0x70
[ 25.561558] [c000000169787840] [c00000000091e978] wait_for_common_io.constprop.3+0x178/0x280
[ 25.561563] [c0000001697878c0] [c00000000047f7ec] blk_execute_rq+0x7c/0xd0
[ 25.561567] [c000000169787910] [c000000000614cd0] scsi_execute+0x100/0x230
[ 25.561572] [c000000169787990] [c00000000060d29c] scsi_report_opcode+0xbc/0x170
[ 25.561577] [c000000169787a50] [d000000004fe6404] sd_revalidate_disk+0xe04/0x1620 [sd_mod]
[ 25.561583] [c000000169787b80] [d000000004fe6d84] sd_probe_async+0xb4/0x230 [sd_mod]
[ 25.561588] [c000000169787c00] [c00000000010fc44] async_run_entry_fn+0x74/0x210
[ 25.561593] [c000000169787c90] [c000000000102f48] process_one_work+0x198/0x480
[ 25.561598] [c000000169787d30] [c0000000001032b8] worker_thread+0x88/0x510
[ 25.561603] [c000000169787dc0] [c00000000010b030] kthread+0x160/0x1a0
[ 25.561608] [c000000169787e30] [c00000000000b3a4] ret_from_kernel_thread+0x5c/0xb8

I was noticing that we are commonly in scsi_report_opcode. Since ipr RAID arrays don't support
the MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES, I tried setting sdev->no_report_opcodes = 1
in ipr's slave configure. This seems to eliminate the boot hang for me, but is only working around
the issue. Since this command is not supported by ipr, it should return with an illegal request.
When I'm hung at this point, there is nothing outstanding to the adapter / driver. I'll continue
debugging...

-Brian

--
Brian King
Power Linux I/O
IBM Linux Technology Center

2017-08-17 01:33:11

by Michael Ellerman

[permalink] [raw]
Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

Bart Van Assche <[email protected]> writes:

> On Wed, 2017-08-16 at 22:30 +0530, Abdul Haleem wrote:
>> As of next-20170809, linux-next on powerpc boot hung with below trace
>> message.
>>
>> [ ... ]
>>
>> A bisection resulted in first bad commit (270065e92 - scsi: scsi-mq:
>> Always unprepare ...) in the merge branch 'scsi/for-next'
>>
>> System booted fine when the below commit is reverted:
>>
>> commit 270065e92c317845d69095ec8e3d18616b5b39d5
>> Author: Bart Van Assche <[email protected]>
>> Date: Thu Aug 3 14:40:14 2017 -0700
>>
>> scsi: scsi-mq: Always unprepare before requeuing a request
>
> Hello Brian and Michael,
>
> Do you agree that this probably indicates a bug in the PowerPC block driver
> that is used to access the boot disk?

I don't know a scsi device from a block device, so I'm not much help sorry.

It seems likely it is a powerpc specific bug, as it seems no one else
has reported any problems with this commit.

> Anyway, since a solution is not yet available, I will submit a revert
> for this patch.

Thanks. Sorry I haven't been able to debug it further, there's about 10
things on fire right now - ie. situation normal :)

cheers

2017-08-17 07:06:42

by Michael Ellerman

[permalink] [raw]
Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

Bart Van Assche <[email protected]> writes:

> On Wed, 2017-08-16 at 22:30 +0530, Abdul Haleem wrote:
>> As of next-20170809, linux-next on powerpc boot hung with below trace
>> message.
>> [ ... ]
>> System booted fine when the below commit is reverted:
>
> Hello Abdul,
>
> Can you check whether applying the following commit on top of next-20170809
> fixes this regression:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mkp/scsi.git/commit/?h=4.13/scsi-fixes&id=b0e17a9b0df29590c45dfb296f541270a5941f41

That's already in next-20170816, which still exhibits the bug.

We know that reverting it fixes the bug, but I can also fix it by doing
a partial revert, ie:

diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 582769116f66..95f1428c5bc9 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -190,7 +190,9 @@ static void __scsi_queue_insert(struct scsi_cmnd *cmd, int reason, int unbusy)
*/
cmd->result = 0;
if (q->mq_ops) {
- scsi_mq_requeue_cmd(cmd);
+ struct scsi_device *sdev = cmd->device;
+ blk_mq_requeue_request(cmd->request, true);
+ put_device(&sdev->sdev_gendev);
return;
}
spin_lock_irqsave(q->queue_lock, flags);


That's basically inlining the old version of scsi_mq_requeue_cmd() at
the other call site in __scsi_queue_insert().

Now I guess that was fairly obvious, if you look closely at the commit,
and the fact that we're not hitting the WARN_ON() you added.

Anyway let me know if there's anything else I can try.

cheers

2017-08-17 15:53:37

by Bart Van Assche

[permalink] [raw]
Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

On Wed, 2017-08-16 at 18:18 -0500, Brian King wrote:
> On 08/16/2017 12:21 PM, Bart Van Assche wrote:
> > On Wed, 2017-08-16 at 22:30 +0530, Abdul Haleem wrote:
> > > As of next-20170809, linux-next on powerpc boot hung with below trace
> > > message.
> > >
> > > [ ... ]
> > >
> > > A bisection resulted in first bad commit (270065e92 - scsi: scsi-mq:
> > > Always unprepare ...) in the merge branch 'scsi/for-next'
> > >
> > > System booted fine when the below commit is reverted:
> > >
> > > commit 270065e92c317845d69095ec8e3d18616b5b39d5
> > > Author: Bart Van Assche <[email protected]>
> > > Date: Thu Aug 3 14:40:14 2017 -0700
> > >
> > > scsi: scsi-mq: Always unprepare before requeuing a request
> >
> > Hello Brian and Michael,
> >
> > Do you agree that this probably indicates a bug in the PowerPC block driver
> > that is used to access the boot disk? Anyway, since a solution is not yet
> > available, I will submit a revert for this patch.
>
> I've been looking at this a bit, and can recreate the issue, but haven't
> got to root cause of the issue as of yet. If I do a sysrq-w while the system is hung
> during boot I see this:
>
> [ 25.561523] Workqueue: events_unbound async_run_entry_fn
> [ 25.561527] Call Trace:
> [ 25.561529] [c0000001697873f0] [c000000169701600] 0xc000000169701600 (unreliable)
> [ 25.561534] [c0000001697875c0] [c00000000001ab78] __switch_to+0x2e8/0x430
> [ 25.561539] [c000000169787620] [c00000000091ccb0] __schedule+0x310/0xa00
> [ 25.561543] [c0000001697876f0] [c00000000091d3e0] schedule+0x40/0xb0
> [ 25.561548] [c000000169787720] [c000000000921e40] schedule_timeout+0x200/0x430
> [ 25.561553] [c000000169787810] [c00000000091db10] io_schedule_timeout+0x30/0x70
> [ 25.561558] [c000000169787840] [c00000000091e978] wait_for_common_io.constprop.3+0x178/0x280
> [ 25.561563] [c0000001697878c0] [c00000000047f7ec] blk_execute_rq+0x7c/0xd0
> [ 25.561567] [c000000169787910] [c000000000614cd0] scsi_execute+0x100/0x230
> [ 25.561572] [c000000169787990] [c00000000060d29c] scsi_report_opcode+0xbc/0x170
> [ 25.561577] [c000000169787a50] [d000000004fe6404] sd_revalidate_disk+0xe04/0x1620 [sd_mod]
> [ 25.561583] [c000000169787b80] [d000000004fe6d84] sd_probe_async+0xb4/0x230 [sd_mod]
> [ 25.561588] [c000000169787c00] [c00000000010fc44] async_run_entry_fn+0x74/0x210
> [ 25.561593] [c000000169787c90] [c000000000102f48] process_one_work+0x198/0x480
> [ 25.561598] [c000000169787d30] [c0000000001032b8] worker_thread+0x88/0x510
> [ 25.561603] [c000000169787dc0] [c00000000010b030] kthread+0x160/0x1a0
> [ 25.561608] [c000000169787e30] [c00000000000b3a4] ret_from_kernel_thread+0x5c/0xb8
>
> I was noticing that we are commonly in scsi_report_opcode. Since ipr RAID arrays don't support
> the MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES, I tried setting sdev->no_report_opcodes = 1
> in ipr's slave configure. This seems to eliminate the boot hang for me, but is only working around
> the issue. Since this command is not supported by ipr, it should return with an illegal request.
> When I'm hung at this point, there is nothing outstanding to the adapter / driver. I'll continue
> debugging...

(+linux-scsi)

Hello Brian,

Is kernel debugging enabled on your test system? Is lockdep enabled?
Anyway, stack traces like the above usually mean that a request got stuck in
a block or scsi driver (ipr in this case). Information about pending requests,
including the SCSI CDB, is available under /sys/kernel/debug/block (see also
commit 0eebd005dd07 ("scsi: Implement blk_mq_ops.show_rq()")).

Bart.

2017-08-18 21:04:20

by Brian King

[permalink] [raw]
Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

On 08/17/2017 10:52 AM, Bart Van Assche wrote:
> On Wed, 2017-08-16 at 18:18 -0500, Brian King wrote:
>> On 08/16/2017 12:21 PM, Bart Van Assche wrote:
>>> On Wed, 2017-08-16 at 22:30 +0530, Abdul Haleem wrote:
>>>> As of next-20170809, linux-next on powerpc boot hung with below trace
>>>> message.
>>>>
>>>> [ ... ]
>>>>
>>>> A bisection resulted in first bad commit (270065e92 - scsi: scsi-mq:
>>>> Always unprepare ...) in the merge branch 'scsi/for-next'
>>>>
>>>> System booted fine when the below commit is reverted:
>>>>
>>>> commit 270065e92c317845d69095ec8e3d18616b5b39d5
>>>> Author: Bart Van Assche <[email protected]>
>>>> Date: Thu Aug 3 14:40:14 2017 -0700
>>>>
>>>> scsi: scsi-mq: Always unprepare before requeuing a request
>>>
>>> Hello Brian and Michael,
>>>
>>> Do you agree that this probably indicates a bug in the PowerPC block driver
>>> that is used to access the boot disk? Anyway, since a solution is not yet
>>> available, I will submit a revert for this patch.
>>
>> I've been looking at this a bit, and can recreate the issue, but haven't
>> got to root cause of the issue as of yet. If I do a sysrq-w while the system is hung
>> during boot I see this:
>>
>> [ 25.561523] Workqueue: events_unbound async_run_entry_fn
>> [ 25.561527] Call Trace:
>> [ 25.561529] [c0000001697873f0] [c000000169701600] 0xc000000169701600 (unreliable)
>> [ 25.561534] [c0000001697875c0] [c00000000001ab78] __switch_to+0x2e8/0x430
>> [ 25.561539] [c000000169787620] [c00000000091ccb0] __schedule+0x310/0xa00
>> [ 25.561543] [c0000001697876f0] [c00000000091d3e0] schedule+0x40/0xb0
>> [ 25.561548] [c000000169787720] [c000000000921e40] schedule_timeout+0x200/0x430
>> [ 25.561553] [c000000169787810] [c00000000091db10] io_schedule_timeout+0x30/0x70
>> [ 25.561558] [c000000169787840] [c00000000091e978] wait_for_common_io.constprop.3+0x178/0x280
>> [ 25.561563] [c0000001697878c0] [c00000000047f7ec] blk_execute_rq+0x7c/0xd0
>> [ 25.561567] [c000000169787910] [c000000000614cd0] scsi_execute+0x100/0x230
>> [ 25.561572] [c000000169787990] [c00000000060d29c] scsi_report_opcode+0xbc/0x170
>> [ 25.561577] [c000000169787a50] [d000000004fe6404] sd_revalidate_disk+0xe04/0x1620 [sd_mod]
>> [ 25.561583] [c000000169787b80] [d000000004fe6d84] sd_probe_async+0xb4/0x230 [sd_mod]
>> [ 25.561588] [c000000169787c00] [c00000000010fc44] async_run_entry_fn+0x74/0x210
>> [ 25.561593] [c000000169787c90] [c000000000102f48] process_one_work+0x198/0x480
>> [ 25.561598] [c000000169787d30] [c0000000001032b8] worker_thread+0x88/0x510
>> [ 25.561603] [c000000169787dc0] [c00000000010b030] kthread+0x160/0x1a0
>> [ 25.561608] [c000000169787e30] [c00000000000b3a4] ret_from_kernel_thread+0x5c/0xb8
>>
>> I was noticing that we are commonly in scsi_report_opcode. Since ipr RAID arrays don't support
>> the MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES, I tried setting sdev->no_report_opcodes = 1
>> in ipr's slave configure. This seems to eliminate the boot hang for me, but is only working around
>> the issue. Since this command is not supported by ipr, it should return with an illegal request.
>> When I'm hung at this point, there is nothing outstanding to the adapter / driver. I'll continue
>> debugging...
>
> (+linux-scsi)
>
> Hello Brian,
>
> Is kernel debugging enabled on your test system? Is lockdep enabled?
> Anyway, stack traces like the above usually mean that a request got stuck in
> a block or scsi driver (ipr in this case). Information about pending requests,
> including the SCSI CDB, is available under /sys/kernel/debug/block (see also
> commit 0eebd005dd07 ("scsi: Implement blk_mq_ops.show_rq()")).

I think I have an understanding what is going on and why Bart's patch is causing problems for ipr.
I can work around the boot hang in ipr, but ultimately I think we need to figure out a fix
in scsi / block. I added some tracing and confirmed its not a matter of commands getting stuck
in ipr. The issue is we are retrying failed commands until we finally run out of time. This is
what I see:

1. sd_revalidate_disk calls scsi_report_opcode
2. ipr RAID arrays don't support MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES
3. ipr returns the command with DID_ERROR
4. scsi_decide_disposition goes to maybe_retry, increments scmd->retries, and returns NEEDS_RETRY
5. scsi_softirq_done calls scsi_queue_insert to requeue the command, which calls scsi_mq_requeue_cmd
6. With Bart's change, we then clear RQF_DONTPREP in this path, while prior we did not
7. This results in the command getting scmd->retries zeroed out when it gets re-queued,
since we go through prep again and we lose our retry counter, resulting in lots and lots of retries.
8. Since the default command timeout for an ipr RAID array is 120 seconds, these retries go on for
quite a long time...
9. Finally, the command has been retried so long we trip over the overall retry timer
in scsi_softirq_done and we timeout the command.

I'll follow up with a patch to ipr to workaround the hang, but I think we need to somehow preserve
the retry counter in the scsi command, as this will likely cause issues with other drivers.

-Brian

--
Brian King
Power Linux I/O
IBM Linux Technology Center

2017-08-18 21:17:40

by Brian King

[permalink] [raw]
Subject: [PATCH] ipr: Set no_report_opcodes for RAID arrays

Since ipr RAID arrays do not support the MAINTENANCE_IN /
MI_REPORT_SUPPORTED_OPERATION_CODES, set no_report_opcodes
to prevent it from being sent.

Signed-off-by: Brian King <[email protected]>
---

Index: linux-2.6.git/drivers/scsi/ipr.c
===================================================================
--- linux-2.6.git.orig/drivers/scsi/ipr.c
+++ linux-2.6.git/drivers/scsi/ipr.c
@@ -4935,6 +4935,7 @@ static int ipr_slave_configure(struct sc
}
if (ipr_is_vset_device(res)) {
sdev->scsi_level = SCSI_SPC_3;
+ sdev->no_report_opcodes = 1;
blk_queue_rq_timeout(sdev->request_queue,
IPR_VSET_RW_TIMEOUT);
blk_queue_max_hw_sectors(sdev->request_queue, IPR_VSET_MAX_SECTORS);

2017-08-18 21:42:05

by Bart Van Assche

[permalink] [raw]
Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

On Fri, 2017-08-18 at 16:04 -0500, Brian King wrote:
> I think I have an understanding what is going on and why Bart's patch is causing problems for ipr.
> I can work around the boot hang in ipr, but ultimately I think we need to figure out a fix
> in scsi / block. I added some tracing and confirmed its not a matter of commands getting stuck
> in ipr. The issue is we are retrying failed commands until we finally run out of time. This is
> what I see:
>
> 1. sd_revalidate_disk calls scsi_report_opcode
> 2. ipr RAID arrays don't support MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES
> 3. ipr returns the command with DID_ERROR
> 4. scsi_decide_disposition goes to maybe_retry, increments scmd->retries, and returns NEEDS_RETRY
> 5. scsi_softirq_done calls scsi_queue_insert to requeue the command, which calls scsi_mq_requeue_cmd
> 6. With Bart's change, we then clear RQF_DONTPREP in this path, while prior we did not
> 7. This results in the command getting scmd->retries zeroed out when it gets re-queued,
> since we go through prep again and we lose our retry counter, resulting in lots and lots of retries.
> 8. Since the default command timeout for an ipr RAID array is 120 seconds, these retries go on for
> quite a long time...
> 9. Finally, the command has been retried so long we trip over the overall retry timer
> in scsi_softirq_done and we timeout the command.
>
> I'll follow up with a patch to ipr to workaround the hang, but I think we need to somehow preserve
> the retry counter in the scsi command, as this will likely cause issues with other drivers.

Hello Brian,

Thanks for the detailed analysis. This is very helpful. Have you considered
to change the ipr driver such that it terminates REPORT SUPPORTED OPERATION
CODES commands with the appropriate check condition code instead of DID_ERROR?

Thanks,

Bart.

2017-08-18 21:58:08

by Brian King

[permalink] [raw]
Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

On 08/18/2017 04:41 PM, Bart Van Assche wrote:
> On Fri, 2017-08-18 at 16:04 -0500, Brian King wrote:
>> I think I have an understanding what is going on and why Bart's patch is causing problems for ipr.
>> I can work around the boot hang in ipr, but ultimately I think we need to figure out a fix
>> in scsi / block. I added some tracing and confirmed its not a matter of commands getting stuck
>> in ipr. The issue is we are retrying failed commands until we finally run out of time. This is
>> what I see:
>>
>> 1. sd_revalidate_disk calls scsi_report_opcode
>> 2. ipr RAID arrays don't support MAINTENANCE_IN / MI_REPORT_SUPPORTED_OPERATION_CODES
>> 3. ipr returns the command with DID_ERROR
>> 4. scsi_decide_disposition goes to maybe_retry, increments scmd->retries, and returns NEEDS_RETRY
>> 5. scsi_softirq_done calls scsi_queue_insert to requeue the command, which calls scsi_mq_requeue_cmd
>> 6. With Bart's change, we then clear RQF_DONTPREP in this path, while prior we did not
>> 7. This results in the command getting scmd->retries zeroed out when it gets re-queued,
>> since we go through prep again and we lose our retry counter, resulting in lots and lots of retries.
>> 8. Since the default command timeout for an ipr RAID array is 120 seconds, these retries go on for
>> quite a long time...
>> 9. Finally, the command has been retried so long we trip over the overall retry timer
>> in scsi_softirq_done and we timeout the command.
>>
>> I'll follow up with a patch to ipr to workaround the hang, but I think we need to somehow preserve
>> the retry counter in the scsi command, as this will likely cause issues with other drivers.
>
> Hello Brian,
>
> Thanks for the detailed analysis. This is very helpful. Have you considered
> to change the ipr driver such that it terminates REPORT SUPPORTED OPERATION
> CODES commands with the appropriate check condition code instead of DID_ERROR?

Yes. That data is actually in the sense buffer, but since I'm also setting DID_ERROR,
scsi_decide_disposition isn't using it. I've got a patch to do just as you suggest,
to stop setting DID_ERROR when there is more detailed error data available,
but it will need some additional testing before I submit, as it will impact much
more than just this case.

To add to my analysis above, #9 should not be there... It looks like
jiffies_at_alloc would also be getting reinitialized in this case, resulting in
a perpetual retry, which is what I was seeing.

Thanks,

Brian

--
Brian King
Power Linux I/O
IBM Linux Technology Center

2017-08-18 22:14:31

by Bart Van Assche

[permalink] [raw]
Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc

On Fri, 2017-08-18 at 16:57 -0500, Brian King wrote:
> To add to my analysis above, #9 should not be there... It looks like
> jiffies_at_alloc would also be getting reinitialized in this case, resulting in
> a perpetual retry, which is what I was seeing.

Hello Brian,

Some time ago I noticed that jiffies_at_alloc is indeed set while a command
is being prepared instead of at command allocation time. I think that
behavior was introduced in 2005 through commit b21a41385118 ("[SCSI] add
global timeout to the scsi mid-layer"). At that time SCSI commands were
allocated at prep time and freed at unprep time. Recently that has been
changed such that a SCSI command (struct scsi_cmnd) has the same lifetime as
struct request. In other words, it was not possible in 2005 but it is
possible today to set jiffies_at_alloc at command allocation time instead of
when a command is being prepared. Do you want me to submit a patch that
implements this change?

Bart.


2017-08-21 20:24:13

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [PATCH] ipr: Set no_report_opcodes for RAID arrays


Brian,

> Since ipr RAID arrays do not support the MAINTENANCE_IN /
> MI_REPORT_SUPPORTED_OPERATION_CODES, set no_report_opcodes to prevent
> it from being sent.

Applied to 4.13/scsi-fixes. Thank you!

--
Martin K. Petersen Oracle Linux Engineering

2017-08-21 20:27:32

by Martin K. Petersen

[permalink] [raw]
Subject: Re: [BUG][bisected 270065e] linux-next fails to boot on powerpc


Brian,

>> Thanks for the detailed analysis. This is very helpful. Have you
>> considered to change the ipr driver such that it terminates REPORT
>> SUPPORTED OPERATION CODES commands with the appropriate check
>> condition code instead of DID_ERROR?
>
> Yes. That data is actually in the sense buffer, but since I'm also
> setting DID_ERROR, scsi_decide_disposition isn't using it. I've got a
> patch to do just as you suggest, to stop setting DID_ERROR when there
> is more detailed error data available, but it will need some
> additional testing before I submit, as it will impact much more than
> just this case.

I agree. In this case where a command is not supported, a check
condition would be a better way to signal the failure to the SCSI
midlayer.

--
Martin K. Petersen Oracle Linux Engineering

2017-08-21 22:11:40

by Brian King

[permalink] [raw]
Subject: [PATCH 0/2] Allow scsi_prep_fn to occur for retried commands

The following two patches address the hang issue being observed
with Bart's patch on powerpc. The first patch moves the initialization
of jiffies_at_alloc from scsi_init_command to scsi_init_rq, and ensures
we don't zero jiffies_at_alloc in scsi_init_command. The second patch
saves / restores the retry counter in scsi_init_command which lets us
go through scsi_init_command for retries and not forget why we were
there.

These patches have only been boot tested on my Power machine with ipr
to ensure they fix the issue I was seeing.

-Brian

--
Brian King
Power Linux I/O
IBM Linux Technology Center

2017-08-21 22:13:30

by Brian King

[permalink] [raw]
Subject: [PATCH 1/2] scsi: Move scsi_cmd->jiffies_at_alloc initialization to allocation time

Move the initialization of scsi_cmd->jiffies_at_alloc to allocation
time rather than prep time. Also ensure that jiffies_at_alloc
is preserved when we go through prep. This lets us send retries
through prep again and not break the overall retry timer logic
in scsi_softirq_done.

Suggested-by: Bart Van Assche <[email protected]>
Signed-off-by: Brian King <[email protected]>
---

Index: linux-2.6.git/drivers/scsi/scsi_lib.c
===================================================================
--- linux-2.6.git.orig/drivers/scsi/scsi_lib.c
+++ linux-2.6.git/drivers/scsi/scsi_lib.c
@@ -1154,6 +1154,7 @@ void scsi_init_command(struct scsi_devic
void *buf = cmd->sense_buffer;
void *prot = cmd->prot_sdb;
unsigned int unchecked_isa_dma = cmd->flags & SCMD_UNCHECKED_ISA_DMA;
+ unsigned long jiffies_at_alloc = cmd->jiffies_at_alloc;

/* zero out the cmd, except for the embedded scsi_request */
memset((char *)cmd + sizeof(cmd->req), 0,
@@ -1164,7 +1165,7 @@ void scsi_init_command(struct scsi_devic
cmd->prot_sdb = prot;
cmd->flags = unchecked_isa_dma;
INIT_DELAYED_WORK(&cmd->abort_work, scmd_eh_abort_handler);
- cmd->jiffies_at_alloc = jiffies;
+ cmd->jiffies_at_alloc = jiffies_at_alloc;

scsi_add_cmd_to_list(cmd);
}
@@ -2119,6 +2120,7 @@ static int scsi_init_rq(struct request_q
if (!cmd->sense_buffer)
goto fail;
cmd->req.sense = cmd->sense_buffer;
+ cmd->jiffies_at_alloc = jiffies;

if (scsi_host_get_prot(shost) >= SHOST_DIX_TYPE0_PROTECTION) {
cmd->prot_sdb = kmem_cache_zalloc(scsi_sdb_cache, gfp);

2017-08-21 22:14:08

by Brian King

[permalink] [raw]
Subject: [PATCH 2/2] scsi: Preserve retry counter through scsi_prep_fn

Save / restore the retry counter in scsi_cmd in scsi_init_command.
This allows us to go back through scsi_init_command for retries
and not forget we are doing a retry.

Signed-off-by: Brian King <[email protected]>
---

Index: linux-2.6.git/drivers/scsi/scsi_lib.c
===================================================================
--- linux-2.6.git.orig/drivers/scsi/scsi_lib.c
+++ linux-2.6.git/drivers/scsi/scsi_lib.c
@@ -1155,6 +1155,7 @@ void scsi_init_command(struct scsi_devic
void *prot = cmd->prot_sdb;
unsigned int unchecked_isa_dma = cmd->flags & SCMD_UNCHECKED_ISA_DMA;
unsigned long jiffies_at_alloc = cmd->jiffies_at_alloc;
+ int retries = cmd->retries;

/* zero out the cmd, except for the embedded scsi_request */
memset((char *)cmd + sizeof(cmd->req), 0,
@@ -1166,6 +1167,7 @@ void scsi_init_command(struct scsi_devic
cmd->flags = unchecked_isa_dma;
INIT_DELAYED_WORK(&cmd->abort_work, scmd_eh_abort_handler);
cmd->jiffies_at_alloc = jiffies_at_alloc;
+ cmd->retries = retries;

scsi_add_cmd_to_list(cmd);
}

2017-08-21 22:16:58

by Brian King

[permalink] [raw]
Subject: Re: [PATCH 1/2] scsi: Move scsi_cmd->jiffies_at_alloc initialization to allocation time

Scratch this one... Version 2 on the way with the corresponding changes
in scsi_init_request...

-Brian


--
Brian King
Power Linux I/O
IBM Linux Technology Center

2017-08-21 22:40:23

by Brian King

[permalink] [raw]
Subject: [PATCHv2 1/2] scsi: Move scsi_cmd->jiffies_at_alloc initialization to allocation time

This second version also sets up jiffies_at_alloc in scsi_init_request.
This has been tested without the second patch in the series and
I've confirmed I now see the following in the logs after booting:

[ 121.718088] sd 1:2:0:0: timing out command, waited 120s
[ 121.798081] sd 1:2:1:0: timing out command, waited 120s

Without this patch I was never seeing these messages, indicating
the retry timer code wasn't working. Also, after seeing these
messages, I've confirmed there are no longer any hung tasks
in the kernel with sysrq-w, while before, without this patch,
I would see hung tasks for the scsi_report_opcodes calls which
were getting retried forever.

8<

Move the initialization of scsi_cmd->jiffies_at_alloc to allocation
time rather than prep time. Also ensure that jiffies_at_alloc
is preserved when we go through prep. This lets us send retries
through prep again and not break the overall retry timer logic
in scsi_softirq_done.

Suggested-by: Bart Van Assche <[email protected]>
Signed-off-by: Brian King <[email protected]>
---

Index: linux-2.6.git/drivers/scsi/scsi_lib.c
===================================================================
--- linux-2.6.git.orig/drivers/scsi/scsi_lib.c
+++ linux-2.6.git/drivers/scsi/scsi_lib.c
@@ -1154,6 +1154,7 @@ void scsi_init_command(struct scsi_devic
void *buf = cmd->sense_buffer;
void *prot = cmd->prot_sdb;
unsigned int unchecked_isa_dma = cmd->flags & SCMD_UNCHECKED_ISA_DMA;
+ unsigned long jiffies_at_alloc = cmd->jiffies_at_alloc;

/* zero out the cmd, except for the embedded scsi_request */
memset((char *)cmd + sizeof(cmd->req), 0,
@@ -1164,7 +1165,7 @@ void scsi_init_command(struct scsi_devic
cmd->prot_sdb = prot;
cmd->flags = unchecked_isa_dma;
INIT_DELAYED_WORK(&cmd->abort_work, scmd_eh_abort_handler);
- cmd->jiffies_at_alloc = jiffies;
+ cmd->jiffies_at_alloc = jiffies_at_alloc;

scsi_add_cmd_to_list(cmd);
}
@@ -2016,6 +2017,7 @@ static int scsi_init_request(struct blk_
if (!cmd->sense_buffer)
return -ENOMEM;
cmd->req.sense = cmd->sense_buffer;
+ cmd->jiffies_at_alloc = jiffies;

if (scsi_host_get_prot(shost)) {
sg = (void *)cmd + sizeof(struct scsi_cmnd) +
@@ -2119,6 +2121,7 @@ static int scsi_init_rq(struct request_q
if (!cmd->sense_buffer)
goto fail;
cmd->req.sense = cmd->sense_buffer;
+ cmd->jiffies_at_alloc = jiffies;

if (scsi_host_get_prot(shost) >= SHOST_DIX_TYPE0_PROTECTION) {
cmd->prot_sdb = kmem_cache_zalloc(scsi_sdb_cache, gfp);

2017-08-22 06:42:19

by Abdul Haleem

[permalink] [raw]
Subject: Re: [PATCH 0/2] Allow scsi_prep_fn to occur for retried commands

On Mon, 2017-08-21 at 17:11 -0500, Brian King wrote:
> The following two patches address the hang issue being observed
> with Bart's patch on powerpc. The first patch moves the initialization
> of jiffies_at_alloc from scsi_init_command to scsi_init_rq, and ensures
> we don't zero jiffies_at_alloc in scsi_init_command. The second patch
> saves / restores the retry counter in scsi_init_command which lets us
> go through scsi_init_command for retries and not forget why we were
> there.
>
> These patches have only been boot tested on my Power machine with ipr
> to ensure they fix the issue I was seeing.
>
> -Brian
>

Thank you Brian and Bart, for your efforts to fix this bug.

I tested these patches on my PowerPC machine and system booted fine.

[PATCHv2 1/2] scsi: Move scsi_cmd->jiffies_at_alloc initialization to
allocation time
[PATCH 2/2] scsi: Preserve retry counter through scsi_prep_fn

Reported-and-Tested-by: Abdul Haleem <[email protected]>

--
Regard's

Abdul Haleem
IBM Linux Technology Centre



2017-08-22 06:51:07

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 1/2] scsi: Move scsi_cmd->jiffies_at_alloc initialization to allocation time

On Mon, Aug 21, 2017 at 05:13:20PM -0500, Brian King wrote:
> Move the initialization of scsi_cmd->jiffies_at_alloc to allocation
> time rather than prep time. Also ensure that jiffies_at_alloc
> is preserved when we go through prep. This lets us send retries
> through prep again and not break the overall retry timer logic
> in scsi_softirq_done.
>
> Suggested-by: Bart Van Assche <[email protected]>
> Signed-off-by: Brian King <[email protected]>

As far as I can tell this will never set jiffies_at_alloc for the
blk-mq path.

2017-08-22 06:51:36

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH 2/2] scsi: Preserve retry counter through scsi_prep_fn

On Mon, Aug 21, 2017 at 05:14:00PM -0500, Brian King wrote:
> Save / restore the retry counter in scsi_cmd in scsi_init_command.
> This allows us to go back through scsi_init_command for retries
> and not forget we are doing a retry.

So where will we initialize it to zero now?