2014-11-11 23:33:29

by Barto

[permalink] [raw]
Subject: BUG in scsi_lib.c due to a bad commit

Hello everyone,

I notice a bug since kernel 3.17 ( and also with 3.18 branch ), a random
hang at boot on some PC configurations, I did a "git bisect" and I found
that the culprit is :

[045065d8a300a37218c548e9aa7becd581c6a0e8] [SCSI] fix qemu boot hang problem

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=045065d8a300a37218c548e9aa7becd581c6a0e8

the author of this commit has choosen to inverse the logic of the if
statement in the file drivers/scsi/scsi_lib.c in order to solve an
issue with qemu :

--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct request_queue *q)
blk_requeue_request(q, req);
atomic_dec(&sdev->device_busy);
out_delay:
- if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
+ if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
blk_delay_queue(q, SCSI_QUEUE_DELAY);
}

this change triggers a bug on my PC ( I don't have SCSI devices, but
only 3 SATA harddisks and 2 IDE harddisks, SATA disks are on an ICH7
sata controler on the motherboard gigabyte GA-P31-DS3L, and IDE disk on
a JMicron JMB363/368 Sata/IDE PCIe card ), every 5~10 boots the boot
stops suddenly because of this commit,

If I revert this commit then the bug is gone, more details can be found
here, where I created a patch who reverts commit 045065d8 :

https://bugzilla.kernel.org/show_bug.cgi?id=87581

my question: why Guenter Roeck ( the author of the bad commit ) has
choosen to inverse the logic in the if statement ?
before his commit the if statement was like this :

if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
blk_delay_queue(q, SCSI_QUEUE_DELAY);

if his decision to inverse the logic of
"atomic_read(&sdev->device_busy)" is acceptable then the real bug is
probably located elsewhere in the scsi source code, and we must solve
this mistery because there is obviously a bug regression in SCSI code
because with older kernels ( 3.16.7 and lower ) I don't have the random
hang boot bug with my configuration,

another user in archlinux forums has also this bug and he has a more
modern PC ( intel i7 core cpu, SSD device ), my fear is when linux
distros will move to kernel 3.17 then more users will have this weird
random bug who can occur only on boot and only with a specific PC
configuration, if the boot step is passed despite the random bug then
the bug will not occur, it occurs only during the boot process, which
probably means that the faulty source code is only called during the
boot process,

thanks for anyone who wants to dig this problem with me


2014-11-12 00:17:40

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

[+cc Guenter, linux-scsi]

On Tue, Nov 11, 2014 at 4:33 PM, Barto <[email protected]> wrote:
> Hello everyone,
>
> I notice a bug since kernel 3.17 ( and also with 3.18 branch ), a random
> hang at boot on some PC configurations, I did a "git bisect" and I found
> that the culprit is :
>
> [045065d8a300a37218c548e9aa7becd581c6a0e8] [SCSI] fix qemu boot hang problem
>
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=045065d8a300a37218c548e9aa7becd581c6a0e8
>
> the author of this commit has choosen to inverse the logic of the if
> statement in the file drivers/scsi/scsi_lib.c in order to solve an
> issue with qemu :
>
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct request_queue *q)
> blk_requeue_request(q, req);
> atomic_dec(&sdev->device_busy);
> out_delay:
> - if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
> + if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
> }
>
> this change triggers a bug on my PC ( I don't have SCSI devices, but
> only 3 SATA harddisks and 2 IDE harddisks, SATA disks are on an ICH7
> sata controler on the motherboard gigabyte GA-P31-DS3L, and IDE disk on
> a JMicron JMB363/368 Sata/IDE PCIe card ), every 5~10 boots the boot
> stops suddenly because of this commit,
>
> If I revert this commit then the bug is gone, more details can be found
> here, where I created a patch who reverts commit 045065d8 :
>
> https://bugzilla.kernel.org/show_bug.cgi?id=87581
>
> my question: why Guenter Roeck ( the author of the bad commit ) has
> choosen to inverse the logic in the if statement ?
> before his commit the if statement was like this :
>
> if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
>
> if his decision to inverse the logic of
> "atomic_read(&sdev->device_busy)" is acceptable then the real bug is
> probably located elsewhere in the scsi source code, and we must solve
> this mistery because there is obviously a bug regression in SCSI code
> because with older kernels ( 3.16.7 and lower ) I don't have the random
> hang boot bug with my configuration,
>
> another user in archlinux forums has also this bug and he has a more
> modern PC ( intel i7 core cpu, SSD device ), my fear is when linux
> distros will move to kernel 3.17 then more users will have this weird
> random bug who can occur only on boot and only with a specific PC
> configuration, if the boot step is passed despite the random bug then
> the bug will not occur, it occurs only during the boot process, which
> probably means that the faulty source code is only called during the
> boot process,
>
> thanks for anyone who wants to dig this problem with me
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-11-12 02:53:38

by Guenter Roeck

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

On 11/11/2014 04:17 PM, Bjorn Helgaas wrote:
> [+cc Guenter, linux-scsi]
>
> On Tue, Nov 11, 2014 at 4:33 PM, Barto <[email protected]> wrote:
>> Hello everyone,
>>
>> I notice a bug since kernel 3.17 ( and also with 3.18 branch ), a random
>> hang at boot on some PC configurations, I did a "git bisect" and I found
>> that the culprit is :
>>
>> [045065d8a300a37218c548e9aa7becd581c6a0e8] [SCSI] fix qemu boot hang problem
>>
>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=045065d8a300a37218c548e9aa7becd581c6a0e8
>>
>> the author of this commit has choosen to inverse the logic of the if
>> statement in the file drivers/scsi/scsi_lib.c in order to solve an
>> issue with qemu :
>>
>> --- a/drivers/scsi/scsi_lib.c
>> +++ b/drivers/scsi/scsi_lib.c
>> @@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct request_queue *q)
>> blk_requeue_request(q, req);
>> atomic_dec(&sdev->device_busy);
>> out_delay:
>> - if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
>> + if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
>> blk_delay_queue(q, SCSI_QUEUE_DELAY);
>> }
>>

The above commit was a follow-up to commit 71e75c97f97a, which did the following:

- if (sdev->device_busy == 0 && !scsi_device_blocked(sdev))
+ if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))

meaning it reversed the polarity of the check in the original code.
Since that commit created problems as observed in qemu, I thought it
would be obvious that restoring the original polarity would make sense.
This is explained in my patch, so I am somewhat at loss why ...

>> this change triggers a bug on my PC ( I don't have SCSI devices, but
>> only 3 SATA harddisks and 2 IDE harddisks, SATA disks are on an ICH7
>> sata controler on the motherboard gigabyte GA-P31-DS3L, and IDE disk on
>> a JMicron JMB363/368 Sata/IDE PCIe card ), every 5~10 boots the boot
>> stops suddenly because of this commit,
>>
>> If I revert this commit then the bug is gone, more details can be found
>> here, where I created a patch who reverts commit 045065d8 :
>>
>> https://bugzilla.kernel.org/show_bug.cgi?id=87581
>>
>> my question: why Guenter Roeck ( the author of the bad commit ) has
>> choosen to inverse the logic in the if statement ?

this question is asked.

Having said that, I don't really care one way or another. I am fine
with reverting my commit; if that is done, I'll simply remove the
related scsi tests from my qemu test runs.

Guenter

>> before his commit the if statement was like this :
>>
>> if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
>> blk_delay_queue(q, SCSI_QUEUE_DELAY);
>>
>> if his decision to inverse the logic of
>> "atomic_read(&sdev->device_busy)" is acceptable then the real bug is
>> probably located elsewhere in the scsi source code, and we must solve
>> this mistery because there is obviously a bug regression in SCSI code
>> because with older kernels ( 3.16.7 and lower ) I don't have the random
>> hang boot bug with my configuration,
>>
>> another user in archlinux forums has also this bug and he has a more
>> modern PC ( intel i7 core cpu, SSD device ), my fear is when linux
>> distros will move to kernel 3.17 then more users will have this weird
>> random bug who can occur only on boot and only with a specific PC
>> configuration, if the boot step is passed despite the random bug then
>> the bug will not occur, it occurs only during the boot process, which
>> probably means that the faulty source code is only called during the
>> boot process,
>>
>> thanks for anyone who wants to dig this problem with me
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/
>

2014-11-13 03:28:30

by Barto

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

reverting your commit 045065d8a300a37218c is a solution, but it's just a
temporary solution,

it's better to search why your commit can create a random hang on boot
on some PC configurations,

--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct request_queue *q)
blk_requeue_request(q, req);
atomic_dec(&sdev->device_busy);
out_delay:
- if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
+ if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
blk_delay_queue(q, SCSI_QUEUE_DELAY);
}

perhaps the atomic_read() function doesn't make the expected job on some
rare circonstances, I have the same doubts about the blk_delay_queue()
function


Le 12/11/2014 03:53, Guenter Roeck a écrit :> On 11/11/2014 04:17 PM,
Bjorn Helgaas wrote:
>> [+cc Guenter, linux-scsi]
>>
>> On Tue, Nov 11, 2014 at 4:33 PM, Barto <[email protected]>
>> wrote:
>>> Hello everyone,
>>>
>>> I notice a bug since kernel 3.17 ( and also with 3.18 branch ), a random
>>> hang at boot on some PC configurations, I did a "git bisect" and I found
>>> that the culprit is :
>>>
>>> [045065d8a300a37218c548e9aa7becd581c6a0e8] [SCSI] fix qemu boot hang
>>> problem
>>>
>>>
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=045065d8a300a37218c548e9aa7becd581c6a0e8
>>>
>>>
>>> the author of this commit has choosen to inverse the logic of the if
>>> statement in the file drivers/scsi/scsi_lib.c in order to solve an
>>> issue with qemu :
>>>
>>> --- a/drivers/scsi/scsi_lib.c
>>> +++ b/drivers/scsi/scsi_lib.c
>>> @@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct
>>> request_queue *q)
>>> blk_requeue_request(q, req);
>>> atomic_dec(&sdev->device_busy);
>>> out_delay:
>>> - if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
>>> + if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
>>> blk_delay_queue(q, SCSI_QUEUE_DELAY);
>>> }
>>>
>
> The above commit was a follow-up to commit 71e75c97f97a, which did the
> following:
>
> - if (sdev->device_busy == 0 && !scsi_device_blocked(sdev))
> + if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
>
> meaning it reversed the polarity of the check in the original code.
> Since that commit created problems as observed in qemu, I thought it
> would be obvious that restoring the original polarity would make sense.
> This is explained in my patch, so I am somewhat at loss why ...
>
>>> this change triggers a bug on my PC ( I don't have SCSI devices, but
>>> only 3 SATA harddisks and 2 IDE harddisks, SATA disks are on an ICH7
>>> sata controler on the motherboard gigabyte GA-P31-DS3L, and IDE disk on
>>> a JMicron JMB363/368 Sata/IDE PCIe card ), every 5~10 boots the boot
>>> stops suddenly because of this commit,
>>>
>>> If I revert this commit then the bug is gone, more details can be found
>>> here, where I created a patch who reverts commit 045065d8 :
>>>
>>> https://bugzilla.kernel.org/show_bug.cgi?id=87581
>>>
>>> my question: why Guenter Roeck ( the author of the bad commit ) has
>>> choosen to inverse the logic in the if statement ?
>
> this question is asked.
>
> Having said that, I don't really care one way or another. I am fine
> with reverting my commit; if that is done, I'll simply remove the
> related scsi tests from my qemu test runs.
>
> Guenter
>
>>> before his commit the if statement was like this :
>>>
>>> if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
>>> blk_delay_queue(q, SCSI_QUEUE_DELAY);
>>>
>>> if his decision to inverse the logic of
>>> "atomic_read(&sdev->device_busy)" is acceptable then the real bug is
>>> probably located elsewhere in the scsi source code, and we must solve
>>> this mistery because there is obviously a bug regression in SCSI code
>>> because with older kernels ( 3.16.7 and lower ) I don't have the random
>>> hang boot bug with my configuration,
>>>
>>> another user in archlinux forums has also this bug and he has a more
>>> modern PC ( intel i7 core cpu, SSD device ), my fear is when linux
>>> distros will move to kernel 3.17 then more users will have this weird
>>> random bug who can occur only on boot and only with a specific PC
>>> configuration, if the boot step is passed despite the random bug then
>>> the bug will not occur, it occurs only during the boot process, which
>>> probably means that the faulty source code is only called during the
>>> boot process,
>>>
>>> thanks for anyone who wants to dig this problem with me
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe
>>> linux-kernel" in
>>> the body of a message to [email protected]
>>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>>> Please read the FAQ at http://www.tux.org/lkml/
>>
>
>

Subject: RE: BUG in scsi_lib.c due to a bad commit



> -----Original Message-----
> From: [email protected] [mailto:linux-kernel-
> [email protected]] On Behalf Of Barto
> Sent: Wednesday, November 12, 2014 9:28 PM
> To: Guenter Roeck; Bjorn Helgaas
> Cc: [email protected]; [email protected]; Joe
> Perches
> Subject: Re: BUG in scsi_lib.c due to a bad commit
>
> reverting your commit 045065d8a300a37218c is a solution, but it's just a
> temporary solution,
>
> it's better to search why your commit can create a random hang on boot
> on some PC configurations,
>
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct request_queue
> *q)
> blk_requeue_request(q, req);
> atomic_dec(&sdev->device_busy);
> out_delay:
> - if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
> + if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
> blk_delay_queue(q, SCSI_QUEUE_DELAY);
> }
>
> perhaps the atomic_read() function doesn't make the expected job on some
> rare circonstances, I have the same doubts about the blk_delay_queue()
> function

Were you running with scsi_mod.use_blk_mq=Y or =N?

device_busy is the active queue depth for the device (e.g.
5 means there are 5 commands submitted but not yet completed).

The function reaches this code if it has run out of tags, the host
has reached its limit of outstanding commands, or the target has
reached its limit. It requeus the request:
* with delay if device_busy is zero
* without delay if device_busy is non_zero

I think this is the reasoning:
If device_busy is zero, trying to process the request again will
probably run into the same problem; a delay gives time for the
situation to change. If device_busy is non-zero, then the
requeued command goes behind others and might get a different
result.

With the polarity backwards, the lack of delay hung PA-RISC
and SPARC64 systems), not just QEMU. So, I don't think reverting
the fix is good.

Changing it to an unconditional delay might be safe - delay
regardless of device_busy (until the root cause is understood).

Also, SCSI_QUEUE_DELAY seems like an arbitrary magic number;
maybe that value isn't working correctly anymore?


????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2014-11-13 09:38:55

by Barto

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

Hello,

> Were you running with scsi_mod.use_blk_mq=Y or =N?

I don't find this value in my .config file related to kernel modules
options,

perhaps you talk about a kernel boot option for grub ?
I don't use this kernel option boot in grub ( scsi_mod.use_blk_mq ),

I use archlinux and his default kernel config file for 64 bits CPU, you
can find this file here :

https://projects.archlinux.org/svntogit/packages.git/tree/trunk?h=packages/linux

I don't have SCSI devices, I have only 3 SATA harddisks, and 2 IDE
harddisks,

before kernel 3.17 I don't have this random hang bug on boot, so we need
to know what's going on, it's probably a change made last months ago in
SCSI source code ( or something else ) who has introduced a bug who
prevents a correct behaviour on some parts in scsi_lib.c ( for example
the if statement related to atomic_read(&sdev->device_busy) and
blk_delay_queue(q, SCSI_QUEUE_DELAY) )

Guenter Roeck was the first to have been hit by this bug ( his qemu test
files hang on boot ), he thought the solution was to revert the polarity
of the if statement, it solves his problem but unfortunately a new bug
is now triggered on some PC configurations,

it would be interesting to find a definitive solution who can solve both
the "qemu bug" and the "random bug on boot" on some PC configurations,

I'm not an expert about the scsi code, perhaps with some unit tests you
can spot the defect element in the scsi source code, check if some parts
in the source code acts really like it should be ?


Le 13/11/2014 06:33, Elliott, Robert (Server Storage) a écrit :
>
>
>> -----Original Message-----
>> From: [email protected] [mailto:linux-kernel-
>> [email protected]] On Behalf Of Barto
>> Sent: Wednesday, November 12, 2014 9:28 PM
>> To: Guenter Roeck; Bjorn Helgaas
>> Cc: [email protected]; [email protected]; Joe
>> Perches
>> Subject: Re: BUG in scsi_lib.c due to a bad commit
>>
>> reverting your commit 045065d8a300a37218c is a solution, but it's just a
>> temporary solution,
>>
>> it's better to search why your commit can create a random hang on boot
>> on some PC configurations,
>>
>> --- a/drivers/scsi/scsi_lib.c
>> +++ b/drivers/scsi/scsi_lib.c
>> @@ -1774,7 +1774,7 @@ static void scsi_request_fn(struct request_queue
>> *q)
>> blk_requeue_request(q, req);
>> atomic_dec(&sdev->device_busy);
>> out_delay:
>> - if (atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
>> + if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
>> blk_delay_queue(q, SCSI_QUEUE_DELAY);
>> }
>>
>> perhaps the atomic_read() function doesn't make the expected job on some
>> rare circonstances, I have the same doubts about the blk_delay_queue()
>> function
>
> Were you running with scsi_mod.use_blk_mq=Y or =N?
>
> device_busy is the active queue depth for the device (e.g.
> 5 means there are 5 commands submitted but not yet completed).
>
> The function reaches this code if it has run out of tags, the host
> has reached its limit of outstanding commands, or the target has
> reached its limit. It requeus the request:
> * with delay if device_busy is zero
> * without delay if device_busy is non_zero
>
> I think this is the reasoning:
> If device_busy is zero, trying to process the request again will
> probably run into the same problem; a delay gives time for the
> situation to change. If device_busy is non-zero, then the
> requeued command goes behind others and might get a different
> result.
>
> With the polarity backwards, the lack of delay hung PA-RISC
> and SPARC64 systems), not just QEMU. So, I don't think reverting
> the fix is good.
>
> Changing it to an unconditional delay might be safe - delay
> regardless of device_busy (until the root cause is understood).
>
> Also, SCSI_QUEUE_DELAY seems like an arbitrary magic number;
> maybe that value isn't working correctly anymore?
>
>

2014-11-13 14:29:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

Guenters commit just fixes the inversion from commit

71e75c97f97a9645d25fbf3d8e4165a558f18747 scsi: convert device_busy to atomic_t

Can you please test the commit before that:
74665016086615bbaa3fa6f83af410a0a4e029ee scsi: convert host_busy to atomic_t

and check if that works for you or not.

2014-11-13 15:14:23

by Barto

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

I am not sure to understand your request, it's quite confusing,

you ask me to revert this commit and see if the bug is gone ? :

74665016086615bbaa3fa6f83af410a0a4e029ee scsi: convert host_busy to atomic_t

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=74665016086615bbaa3fa6f83af410a0a4e029ee

Le 13/11/2014 15:29, Christoph Hellwig a ?crit :
> Guenters commit just fixes the inversion from commit
>
> 71e75c97f97a9645d25fbf3d8e4165a558f18747 scsi: convert device_busy to atomic_t
>
> Can you please test the commit before that:
> 74665016086615bbaa3fa6f83af410a0a4e029ee scsi: convert host_busy to atomic_t
>
> and check if that works for you or not.
>

2014-11-13 17:14:47

by Barto

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

Hello Christoph,

I tested this commit :

7ae65c0f9646c29432b69580b80e08632e6cd813 scsi: convert target_busy to an
atomic_t

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7ae65c0f9646c29432b69580b80e08632e6cd813

there is no bug, boot process is ok whit this commit,

when I did the "git bisect" the first bad commit was Guenters' commit (
045065d8a300a37218c548e9aa7becd581c6a0e8 ), it seems that all commits
before 045065d8a300a37218c548e9aa7becd581c6a0e8 are ok,

you can find the log of the bisect here :

https://bugzilla.kernel.org/attachment.cgi?id=157211

Le 13/11/2014 15:29, Christoph Hellwig a ?crit :
> Guenters commit just fixes the inversion from commit
>
> 71e75c97f97a9645d25fbf3d8e4165a558f18747 scsi: convert device_busy to atomic_t
>
> Can you please test the commit before that:
> 74665016086615bbaa3fa6f83af410a0a4e029ee scsi: convert host_busy to atomic_t
>
> and check if that works for you or not.
>

2014-11-13 17:54:15

by Christoph Hellwig

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

On Thu, Nov 13, 2014 at 06:14:14PM +0100, Barto wrote:
> Hello Christoph,
>
> I tested this commit :
>
> 7ae65c0f9646c29432b69580b80e08632e6cd813 scsi: convert target_busy to an
> atomic_t
>
> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7ae65c0f9646c29432b69580b80e08632e6cd813
>
> there is no bug, boot process is ok whit this commit,
>
> when I did the "git bisect" the first bad commit was Guenters' commit (
> 045065d8a300a37218c548e9aa7becd581c6a0e8 ), it seems that all commits
> before 045065d8a300a37218c548e9aa7becd581c6a0e8 are ok,

The bisect isn't quite interesting here as we're trying to find the real
root ause.

045065d8a300a37218c548e9aa7becd581c6a0e8 reverts the polarity of the
check that 71e75c97f97a9645d25fbf3d8e4165a558f18747 incorrectly
chaged.

The behavior you see doesn't make a lot of sense from a high level view,
so we need to find out what's going on.

When you say all commits before 045065d8a300a37218c548e9aa7becd581c6a0e8
I wonder if the bisect simply skiped to many revisions.

Please explicitly test commits:

71e75c97f97a9645d25fbf3d8e4165a558f18747

and

74665016086615bbaa3fa6f83af410a0a4e029ee

Thanks!

2014-11-13 22:56:12

by Barto

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

Hello Christoph,

> Please explicitly test commits:
>
> 71e75c97f97a9645d25fbf3d8e4165a558f18747

with commit 71e75c97f97a9645d25fbf3d8e4165a558f18747 there is no problem


> and
>
> 74665016086615bbaa3fa6f83af410a0a4e029ee

it's interesting, with this commit
74665016086615bbaa3fa6f83af410a0a4e029ee I have the bug :

scsi: convert host_busy to atomic_t :

http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=74665016086615bbaa3fa6f83af410a0a4e029ee

Le 13/11/2014 18:54, Christoph Hellwig a ?crit :
> On Thu, Nov 13, 2014 at 06:14:14PM +0100, Barto wrote:
>> Hello Christoph,
>>
>> I tested this commit :
>>
>> 7ae65c0f9646c29432b69580b80e08632e6cd813 scsi: convert target_busy to an
>> atomic_t
>>
>> http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=7ae65c0f9646c29432b69580b80e08632e6cd813
>>
>> there is no bug, boot process is ok whit this commit,
>>
>> when I did the "git bisect" the first bad commit was Guenters' commit (
>> 045065d8a300a37218c548e9aa7becd581c6a0e8 ), it seems that all commits
>> before 045065d8a300a37218c548e9aa7becd581c6a0e8 are ok,
>
> The bisect isn't quite interesting here as we're trying to find the real
> root ause.
>
> 045065d8a300a37218c548e9aa7becd581c6a0e8 reverts the polarity of the
> check that 71e75c97f97a9645d25fbf3d8e4165a558f18747 incorrectly
> chaged.
>
> The behavior you see doesn't make a lot of sense from a high level view,
> so we need to find out what's going on.
>
> When you say all commits before 045065d8a300a37218c548e9aa7becd581c6a0e8
> I wonder if the bisect simply skiped to many revisions.
>
> Please explicitly test commits:
>
> 71e75c97f97a9645d25fbf3d8e4165a558f18747
>
> and
>
> 74665016086615bbaa3fa6f83af410a0a4e029ee
>
> Thanks!
>
>

2014-11-14 07:32:21

by Christoph Hellwig

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

On Thu, Nov 13, 2014 at 11:55:38PM +0100, Barto wrote:
> it's interesting, with this commit
> 74665016086615bbaa3fa6f83af410a0a4e029ee I have the bug :
>
> scsi: convert host_busy to atomic_t :

At this point we'll need a bisction between v3.16 as the last good
point, and 74665016086615bbaa3fa6f83af410a0a4e029ee as the known bad
point.

2014-11-14 16:31:11

by Barto

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

Hello Christoph,

I did the new bisect,

the first bad commit is :

74665016086615bbaa3fa6f83af410a0a4e029ee
scsi: convert host_busy to atomic_t

you can find the bisect log here :

https://bugzilla.kernel.org/attachment.cgi?id=157621


Le 14/11/2014 08:32, Christoph Hellwig a ?crit :
> On Thu, Nov 13, 2014 at 11:55:38PM +0100, Barto wrote:
>> it's interesting, with this commit
>> 74665016086615bbaa3fa6f83af410a0a4e029ee I have the bug :
>>
>> scsi: convert host_busy to atomic_t :
>
> At this point we'll need a bisction between v3.16 as the last good
> point, and 74665016086615bbaa3fa6f83af410a0a4e029ee as the known bad
> point.
>

2014-11-16 18:31:34

by Barto

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

Hello everyone,

> Also, SCSI_QUEUE_DELAY seems like an arbitrary magic number;
> maybe that value isn't working correctly anymore?

this is an excellent remark from Robert Elliot,

this gives me an idea : try to set manually a value in the if()
statement ( line 1779 in file /drivers/scsi/scsi_lib.c )

by default the value of SCSI_QUEUE_DELAY is 3 ms, which might be
inapropriate with some slow harddisks and with the changes made by the
commit
74665016086615bbaa3fa6f83af410a0a4e029ee ( scsi: convert host_busy to
atomic_t ),

after further tests I discover that the value 40 ms solves my problem,
the bug is gone with this value,

here is the patch who sets 40 ms in the if() statement :

--- a/drivers/scsi/scsi_lib.c 2014-10-05 21:23:04.000000000 +0200
+++ b/drivers/scsi/scsi_lib.c 2014-11-16 17:39:16.819674725 +0100
@@ -1776,7 +1776,7 @@ static void scsi_request_fn(struct reque
atomic_dec(&sdev->device_busy);
out_delay:
if (!atomic_read(&sdev->device_busy) && !scsi_device_blocked(sdev))
- blk_delay_queue(q, SCSI_QUEUE_DELAY);
+ blk_delay_queue(q, 40);
}

static inline int prep_to_mq(int ret)

with this patch the value of SCSI_QUEUE_DELAY is still 3 ms, but here we
use 40 ms only in a specific part of scsi_lib.c file ( line 1779, it's
this part where the bug seems to be triggered, so that's why I set 40 ms
here in the blk_delay_queue() function )

after applying this patch I don't see problems related to I/O
performance/speed, all is ok,

the question is now : why putting a higher value in line 1779 does solve
the bug ?
and why before the commit 74665016086615bbaa3fa6f83af410a0a4e029ee I
don't have problems even with a value of 3 ms for SCSI_QUEUE_DELAY ?



Le 14/11/2014 08:32, Christoph Hellwig a ?crit :
> On Thu, Nov 13, 2014 at 11:55:38PM +0100, Barto wrote:
>> it's interesting, with this commit
>> 74665016086615bbaa3fa6f83af410a0a4e029ee I have the bug :
>>
>> scsi: convert host_busy to atomic_t :
>
> At this point we'll need a bisction between v3.16 as the last good
> point, and 74665016086615bbaa3fa6f83af410a0a4e029ee as the known bad
> point.
>

2014-11-19 20:22:36

by Barto

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

Hello,

I solved the mystery,

I found that the element who triggers the bug ( random hang at boot with
kernel 3.17 and 3.18 ) is the combination of 3 elements :

- the use of a SATA DVD burner ( Liteon iHAS124 C ) on a ICH7 Sata controler
- the use of a gigabyte motherboard GA-P31-DSL3 ( bios F10A, ICH7
controler, intel P31 chipset )
- commit 74665016086615bbaa3fa6f83af410a0a4e029ee ( scsi: convert
host_busy to atomic_t )

If I connect this Sata DVD burner and a sata harddisk by using the SATA
ports of the motherboard then the bug will occur ( but the bug will
occur only on kernels 3.17 and 3.18, there is no problems with older
kernels, and no problems with Windows 7 )

If I disconnect the SATA DVD burner then the bug is gone, no problems
with kernels 3.17 and 3.18,

And if I connect the SATA DVD burner on my JMicron SATA/IDE PCIe card
then there is no problem, no bugs, this is a perfect workaround for my
problem, because I can use kernel 3.17/3.18 without problem with this
configuration.

But I don't know which element I should blame, my gigabyte motherboard ?
( faulty bios ? ) The use of "atomic_t" in scsi source code ? (
innapropriate way to handle SATA devices, it breaks compatibility with
some PC configurations ? )




Le 14/11/2014 08:32, Christoph Hellwig a ?crit :
> On Thu, Nov 13, 2014 at 11:55:38PM +0100, Barto wrote:
>> it's interesting, with this commit
>> 74665016086615bbaa3fa6f83af410a0a4e029ee I have the bug :
>>
>> scsi: convert host_busy to atomic_t :
>
> At this point we'll need a bisction between v3.16 as the last good
> point, and 74665016086615bbaa3fa6f83af410a0a4e029ee as the known bad
> point.
>

2014-11-20 06:09:40

by Christoph Hellwig

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

I

Hi Barto,

sorry for the late reply, and thanks for drilling down the exact
conditions.

I think we have some issues with the lack of the host lock vs error
handling, but I still don't undertand the details.

I've got a test patch for you that just adds the host lock back in a few
places while keeing the atomic_t. Can you try to reproduce with that
one?

When breaking an existing setup in software it's always the softwares
fault, btw..


diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
index 994eb08..99b77f7 100644
--- a/drivers/scsi/scsi_lib.c
+++ b/drivers/scsi/scsi_lib.c
@@ -311,16 +311,16 @@ void scsi_device_unbusy(struct scsi_device *sdev)
struct scsi_target *starget = scsi_target(sdev);
unsigned long flags;

+ spin_lock_irqsave(shost->host_lock, flags);
atomic_dec(&shost->host_busy);
if (starget->can_queue > 0)
atomic_dec(&starget->target_busy);

if (unlikely(scsi_host_in_recovery(shost) &&
(shost->host_failed || shost->host_eh_scheduled))) {
- spin_lock_irqsave(shost->host_lock, flags);
scsi_eh_wakeup(shost);
- spin_unlock_irqrestore(shost->host_lock, flags);
}
+ spin_unlock_irqrestore(shost->host_lock, flags);

atomic_dec(&sdev->device_busy);
}
@@ -1504,6 +1504,8 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
if (scsi_host_in_recovery(shost))
return 0;

+ spin_lock_irq(shost->host_lock);
+
busy = atomic_inc_return(&shost->host_busy) - 1;
if (atomic_read(&shost->host_blocked) > 0) {
if (busy)
@@ -1527,21 +1529,19 @@ static inline int scsi_host_queue_ready(struct request_queue *q,

/* We're OK to process the command, so we can't be starved */
if (!list_empty(&sdev->starved_entry)) {
- spin_lock_irq(shost->host_lock);
if (!list_empty(&sdev->starved_entry))
list_del_init(&sdev->starved_entry);
- spin_unlock_irq(shost->host_lock);
}

+ spin_unlock_irq(shost->host_lock);
return 1;

starved:
- spin_lock_irq(shost->host_lock);
if (list_empty(&sdev->starved_entry))
list_add_tail(&sdev->starved_entry, &shost->starved_list);
- spin_unlock_irq(shost->host_lock);
out_dec:
atomic_dec(&shost->host_busy);
+ spin_unlock_irq(shost->host_lock);
return 0;
}

2014-11-20 17:45:17

by Barto

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

Hello Christoph,

I tested your patch, it works, the bug is gone with your patch, it's a
good news,

I notice that my motherboard doesn't support "AHCI" mode, my motherboard
uses an ICH7 sata controler, ICH7 doesn't support AHCI if I check the
technical specifications of this controler,

it seems that my bios treats SATA devices like IDE devices ( IDE
emulation mode ), for example I can read in the bios these lines "IDE 0
channel master : "the name of Sata harddisk 1", "IDE 3 channel master :
"the name of Sata harddisk 2",

perhaps the bug only occurs when the AHCI mode is not used ?

because on archlinux forums another user has noticed that this bug only
occurs if he enables the "IDE emulation" option in the bios of his
motherboard ( a recent gigabyte motherboard : Z68P-DS3 intel for i7 CPU
), and if he chooses the "AHCI mode" instead of the "IDE emulation mode"
in his bios settings then the bug disappears, no random hangs at boot
when AHCI is enabled,

and this user has also a SATA DVD burner,

so perhaps you can reproduce this bug on your PC if you have the same
option enabled in the bios ( IDE emulation mode ) and a Sata DVD burner,
in this case you will have 20~30% of chance to trigger the bug ( random
hangs will occur approximately every 5~10 boots )


Le 20/11/2014 07:09, Christoph Hellwig a ?crit :
> I
>
> Hi Barto,
>
> sorry for the late reply, and thanks for drilling down the exact
> conditions.
>
> I think we have some issues with the lack of the host lock vs error
> handling, but I still don't undertand the details.
>
> I've got a test patch for you that just adds the host lock back in a few
> places while keeing the atomic_t. Can you try to reproduce with that
> one?
>
> When breaking an existing setup in software it's always the softwares
> fault, btw..
>
>
> diff --git a/drivers/scsi/scsi_lib.c b/drivers/scsi/scsi_lib.c
> index 994eb08..99b77f7 100644
> --- a/drivers/scsi/scsi_lib.c
> +++ b/drivers/scsi/scsi_lib.c
> @@ -311,16 +311,16 @@ void scsi_device_unbusy(struct scsi_device *sdev)
> struct scsi_target *starget = scsi_target(sdev);
> unsigned long flags;
>
> + spin_lock_irqsave(shost->host_lock, flags);
> atomic_dec(&shost->host_busy);
> if (starget->can_queue > 0)
> atomic_dec(&starget->target_busy);
>
> if (unlikely(scsi_host_in_recovery(shost) &&
> (shost->host_failed || shost->host_eh_scheduled))) {
> - spin_lock_irqsave(shost->host_lock, flags);
> scsi_eh_wakeup(shost);
> - spin_unlock_irqrestore(shost->host_lock, flags);
> }
> + spin_unlock_irqrestore(shost->host_lock, flags);
>
> atomic_dec(&sdev->device_busy);
> }
> @@ -1504,6 +1504,8 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
> if (scsi_host_in_recovery(shost))
> return 0;
>
> + spin_lock_irq(shost->host_lock);
> +
> busy = atomic_inc_return(&shost->host_busy) - 1;
> if (atomic_read(&shost->host_blocked) > 0) {
> if (busy)
> @@ -1527,21 +1529,19 @@ static inline int scsi_host_queue_ready(struct request_queue *q,
>
> /* We're OK to process the command, so we can't be starved */
> if (!list_empty(&sdev->starved_entry)) {
> - spin_lock_irq(shost->host_lock);
> if (!list_empty(&sdev->starved_entry))
> list_del_init(&sdev->starved_entry);
> - spin_unlock_irq(shost->host_lock);
> }
>
> + spin_unlock_irq(shost->host_lock);
> return 1;
>
> starved:
> - spin_lock_irq(shost->host_lock);
> if (list_empty(&sdev->starved_entry))
> list_add_tail(&sdev->starved_entry, &shost->starved_list);
> - spin_unlock_irq(shost->host_lock);
> out_dec:
> atomic_dec(&shost->host_busy);
> + spin_unlock_irq(shost->host_lock);
> return 0;
> }
>
>

2014-11-20 17:53:20

by Christoph Hellwig

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

On Thu, Nov 20, 2014 at 06:44:32PM +0100, Barto wrote:
> I notice that my motherboard doesn't support "AHCI" mode, my motherboard
> uses an ICH7 sata controler, ICH7 doesn't support AHCI if I check the
> technical specifications of this controler,
>
> it seems that my bios treats SATA devices like IDE devices ( IDE
> emulation mode ), for example I can read in the bios these lines "IDE 0
> channel master : "the name of Sata harddisk 1", "IDE 3 channel master :
> "the name of Sata harddisk 2",
>
> perhaps the bug only occurs when the AHCI mode is not used ?

Looks like it. I suspect you device only supports a single outstanding
command without AHCI, can you confirm this by doing:

for i in /sys/class/scsi_host/*; do cat $i/can_queue; done

2014-11-20 18:28:26

by Barto

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

Hello,

here is the result of the command :

$ for i in /sys/class/scsi_host/*; do cat $i/can_queue; done
1
1
1
1
1
1
31
31

but the sata DVD burner is now connected on a Jmicron sata PCIe card (
with this configuration the bug doesn't occur ), I don't know if this
change can modify the result of your command,

here is the result of another command :

$ for i in /sys/class/scsi_host/*; do cat $i/proc_name; done
pata_jmicron
pata_jmicron
ata_piix
ata_piix
ata_piix
ata_piix
ahci
ahci

$ dmesg | grep ahci
[ 0.883351] ahci 0000:03:00.0: version 3.0
[ 0.883505] ahci 0000:03:00.0: AHCI 0001.0100 32 slots 2 ports 3 Gbps
0x3 impl SATA mode
[ 0.883508] ahci 0000:03:00.0: flags: 64bit ncq pm led clo pmp pio
slum part
[ 0.884329] scsi host6: ahci
[ 0.884502] scsi host7: ahci

Le 20/11/2014 18:53, Christoph Hellwig a ?crit :
> On Thu, Nov 20, 2014 at 06:44:32PM +0100, Barto wrote:
>> I notice that my motherboard doesn't support "AHCI" mode, my motherboard
>> uses an ICH7 sata controler, ICH7 doesn't support AHCI if I check the
>> technical specifications of this controler,
>>
>> it seems that my bios treats SATA devices like IDE devices ( IDE
>> emulation mode ), for example I can read in the bios these lines "IDE 0
>> channel master : "the name of Sata harddisk 1", "IDE 3 channel master :
>> "the name of Sata harddisk 2",
>>
>> perhaps the bug only occurs when the AHCI mode is not used ?
>
> Looks like it. I suspect you device only supports a single outstanding
> command without AHCI, can you confirm this by doing:
>
> for i in /sys/class/scsi_host/*; do cat $i/can_queue; done
>
>

2014-11-24 09:18:45

by Christoph Hellwig

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

On Thu, Nov 20, 2014 at 07:27:49PM +0100, Barto wrote:
> but the sata DVD burner is now connected on a Jmicron sata PCIe card (
> with this configuration the bug doesn't occur ), I don't know if this
> change can modify the result of your command,
>
> here is the result of another command :
>
> $ for i in /sys/class/scsi_host/*; do cat $i/proc_name; done
> pata_jmicron
> pata_jmicron
> ata_piix
> ata_piix
> ata_piix
> ata_piix
> ahci
> ahci

Interesting. So the jmicron card also has a queue depth of just one,
but the error still occurs? I've added the linux-ide list, maybe
someone there has an idea how piix vs jmicron could have such different
behaviors for queueing or error handling.

2014-11-24 15:13:24

by Barto

[permalink] [raw]
Subject: Re: BUG in scsi_lib.c due to a bad commit

Hello Christoph,

to be clear :

my JMicron pcie card JMB363/368 has 2 sata ports and one IDE port,
JMicron supports AHCI on these 2 sata ports,

my gigabyte motherboard has 4 sata ports and one IDE port, my gigabyte
motherboard doesn't support AHCI on these 4 sata ports ( my bios seems
to treats sata devices like IDE devices, a kind of "IDE emulation mode" ),

the bug doesn't occur if the Sata DVD burner is not mixed with a hard
disk on the Sata ports ( motherboard sata ports ),

the configuration who triggers the bug is mixing a slow device ( sata
dvd burner ) with a fast device ( sata harddisk ) on the sata ports of
the motherboard ( who doesn't support AHCI ),

the bug doesn't occur if the sata dvd burner is connected on a JMicron
pcie card because there is no other sata harddisks connected on this
JMicron, I connect only a IDE hardidsk on this JMicron pcie card and
also this Sata dvd burner in order to avoid the bug ( it's a workaround ),

2 pata_jmicron = the IDE port in this JMicron pcie card ( master and slave )
4 ata_piix = 4 sata ports from the gigabyte motherboard, this ports
doesn't support AHCI
2 ahci = 2 sata ports in this JMicron pcie card

but another user has the same problem but with an IDE DVD burner ( mixed
with an IDE hardidsk ) :

https://bugzilla.kernel.org/show_bug.cgi?id=87581#c18

this user has tested your patch and it solves the bug


Le 24/11/2014 10:18, Christoph Hellwig a ?crit :
> On Thu, Nov 20, 2014 at 07:27:49PM +0100, Barto wrote:
>> but the sata DVD burner is now connected on a Jmicron sata PCIe card (
>> with this configuration the bug doesn't occur ), I don't know if this
>> change can modify the result of your command,
>>
>> here is the result of another command :
>>
>> $ for i in /sys/class/scsi_host/*; do cat $i/proc_name; done
>> pata_jmicron
>> pata_jmicron
>> ata_piix
>> ata_piix
>> ata_piix
>> ata_piix
>> ahci
>> ahci
>
> Interesting. So the jmicron card also has a queue depth of just one,
> but the error still occurs? I've added the linux-ide list, maybe
> someone there has an idea how piix vs jmicron could have such different
> behaviors for queueing or error handling.
>
>