2018-04-01 14:41:27

by Sitsofe Wheeler

[permalink] [raw]
Subject: [BISECTED][REGRESSION] Hang while booting EeePC 900

Hi,

While trying to boot an EeePC 900 on the latest git mainline kernel a
hang is encountered while systemd is starting services but this did
not happen with 4.15. A bisection narrowed it down to the commit
358f70da49d77c43f2ca11b5da584213b2add29c ("blk-mq: make
blk_abort_request() trigger timeout path"). Here's the bisection log:

git bisect start
# bad: [9829fcd7ae99d5955bb76a8fb8060e63339d7c9d] drm-tip:
2018y-03m-29d-19h-56m-48s UTC integration manifest
git bisect bad 9829fcd7ae99d5955bb76a8fb8060e63339d7c9d
# good: [bebc6082da0a9f5d47a1ea2edc099bf671058bd4] Linux 4.14
git bisect good bebc6082da0a9f5d47a1ea2edc099bf671058bd4
# skip: [68ae7424583b561c2fda2fb03726e56946b34b19] ixgbe: Drop
l2_accel_priv data pointer from ring struct
git bisect skip 68ae7424583b561c2fda2fb03726e56946b34b19
# good: [d8a5b80568a9cb66810e75b182018e9edb68e8ff] Linux 4.15
git bisect good d8a5b80568a9cb66810e75b182018e9edb68e8ff
# bad: [fe53d1443a146326b49d57fe6336b5c2a725223f] Merge tag
'armsoc-drivers' of
git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect bad fe53d1443a146326b49d57fe6336b5c2a725223f
# bad: [a103950e0dd2058df5e8a8d4a915707bdcf205f0] Merge branch 'linus'
of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect bad a103950e0dd2058df5e8a8d4a915707bdcf205f0
# bad: [d8b91dde38f4c43bd0bbbf17a90f735b16aaff2c] Merge branch
'perf-core-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad d8b91dde38f4c43bd0bbbf17a90f735b16aaff2c
# bad: [9697e9da84299d0d715d515dd2cc48f1eceb277d] Merge tag
'edac_for_4.16' of git://git.kernel.org/pub/scm/linux/kernel/git/bp/bp
git bisect bad 9697e9da84299d0d715d515dd2cc48f1eceb277d
# good: [7f3fdd40a7dfaa7405185250974b0fabd08c1f8b] Merge tag
'pm-4.16-rc1' of
git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm
git bisect good 7f3fdd40a7dfaa7405185250974b0fabd08c1f8b
# good: [0fc7e74663447682c904fe375bb680b004ddaa14] Merge tag
'mtd/for-4.16' of git://git.infradead.org/linux-mtd
git bisect good 0fc7e74663447682c904fe375bb680b004ddaa14
# good: [310eb252a78307fc2ac4c4c755290a578c0304d0] mmc: mmci: fix
error return code in mmci_probe()
git bisect good 310eb252a78307fc2ac4c4c755290a578c0304d0
# bad: [3eb2ce825ea1ad89d20f7a3b5780df850e4be274] Linux 4.16-rc7
git bisect bad 3eb2ce825ea1ad89d20f7a3b5780df850e4be274
# bad: [537433b6241e067de2d9da3bed5f4fed9c9eac58] Merge tag
'armsoc-dt' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc
git bisect bad 537433b6241e067de2d9da3bed5f4fed9c9eac58
# bad: [a103950e0dd2058df5e8a8d4a915707bdcf205f0] Merge branch 'linus'
of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6
git bisect bad a103950e0dd2058df5e8a8d4a915707bdcf205f0
# bad: [efd52b5d363e3e3b6224ad39949219c0df117c91] Merge tag
'nfs-for-4.16-1' of git://git.linux-nfs.org/projects/trondmy/linux-nfs
git bisect bad efd52b5d363e3e3b6224ad39949219c0df117c91
# bad: [6304672b7f0a5c010002e63a075160856dc4f88d] Merge branch
'x86-pti-for-linus' of
git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 6304672b7f0a5c010002e63a075160856dc4f88d
# bad: [a4b7fd7d34de5765dece2dd08060d2e1f7be3b39] Merge tag
'iversion-v4.16-1' of
git://git.kernel.org/pub/scm/linux/kernel/git/jlayton/linux
git bisect bad a4b7fd7d34de5765dece2dd08060d2e1f7be3b39
# skip: [8c7a8d1c4b9c30a2be3b31a2e6af1cefd45574eb] lib/scatterlist:
Fix chaining support in sgl_alloc_order()
git bisect skip 8c7a8d1c4b9c30a2be3b31a2e6af1cefd45574eb
# skip: [85ba3effc5a0836b9195f2010684062ad230cc23] blk-mq: add missing
RQF_STARTED to debugfs
git bisect skip 85ba3effc5a0836b9195f2010684062ad230cc23
# good: [73d8003749ebfd9617076213c0033e11aacf823c] Merge
remote-tracking branch 'regulator/topic/core' into regulator-next
git bisect good 73d8003749ebfd9617076213c0033e11aacf823c
# bad: [9b9c63f71bddf3dc897aaaa46c0593abc3e09932] Merge branch
'nvme-4.16' of git://git.infradead.org/nvme into for-4.16/block
git bisect bad 9b9c63f71bddf3dc897aaaa46c0593abc3e09932
# good: [b221fc130c49c50f4c2250d22e873420765a9fa2] bcache: ret IOERR
when read meets metadata error
git bisect good b221fc130c49c50f4c2250d22e873420765a9fa2
# bad: [e14575b3d457f5806d79b85886ef94d9c29e3b2a] block: convert
REQ_ATOM_COMPLETE to stealing rq->__deadline bit
git bisect bad e14575b3d457f5806d79b85886ef94d9c29e3b2a
# bad: [5a61c36398d0626bad377a7f5b9391b21e16e91d] blk-mq: remove
REQ_ATOM_STARTED
git bisect bad 5a61c36398d0626bad377a7f5b9391b21e16e91d
# good: [5138ac6748e381501894976f995fb7d1a63f80f4] bcache: fix
misleading error message in bch_count_io_errors()
git bisect good 5138ac6748e381501894976f995fb7d1a63f80f4
# good: [04ced159cec863f9bc27015d6b970bb13cfa6176] blk-mq: move hctx
lock/unlock into a helper
git bisect good 04ced159cec863f9bc27015d6b970bb13cfa6176
# good: [67818d25738b1c9ffb8541ca875b2ae3304869d5] blk-mq: use
blk_mq_rq_state() instead of testing REQ_ATOM_COMPLETE
git bisect good 67818d25738b1c9ffb8541ca875b2ae3304869d5
# bad: [358f70da49d77c43f2ca11b5da584213b2add29c] blk-mq: make
blk_abort_request() trigger timeout path
git bisect bad 358f70da49d77c43f2ca11b5da584213b2add29c
# first bad commit: [358f70da49d77c43f2ca11b5da584213b2add29c] blk-mq:
make blk_abort_request() trigger timeout path

--
Sitsofe | http://sucs.org/~sits/


2018-04-02 13:49:12

by Bart Van Assche

[permalink] [raw]
Subject: Re: [BISECTED][REGRESSION] Hang while booting EeePC 900

On Sun, 2018-04-01 at 15:35 +0100, Sitsofe Wheeler wrote:
> While trying to boot an EeePC 900 on the latest git mainline kernel a
> hang is encountered while systemd is starting services but this did
> not happen with 4.15. A bisection narrowed it down to the commit
> 358f70da49d77c43f2ca11b5da584213b2add29c ("blk-mq: make
> blk_abort_request() trigger timeout path"). Here's the bisection log:

Please repost on the linux-block mailing list. I think that mailing list is
more appropriate for this bug report.

Thanks,

Bart.


2018-04-02 14:34:49

by Jens Axboe

[permalink] [raw]
Subject: Re: [BISECTED][REGRESSION] Hang while booting EeePC 900

On 4/1/18 8:35 AM, Sitsofe Wheeler wrote:
> Hi,
>
> While trying to boot an EeePC 900 on the latest git mainline kernel a
> hang is encountered while systemd is starting services but this did
> not happen with 4.15. A bisection narrowed it down to the commit
> 358f70da49d77c43f2ca11b5da584213b2add29c ("blk-mq: make
> blk_abort_request() trigger timeout path"). Here's the bisection log:
>
> git bisect bad 358f70da49d77c43f2ca11b5da584213b2add29c
> # first bad commit: [358f70da49d77c43f2ca11b5da584213b2add29c] blk-mq:
> make blk_abort_request() trigger timeout path

Are you using scsi-mq?

Would probably also be handy to see a dmesg log from a 4.15 boot,
and from a failed 4.16 boot.

--
Jens Axboe


2018-04-02 17:34:48

by Sitsofe Wheeler

[permalink] [raw]
Subject: Re: [BISECTED][REGRESSION] Hang while booting EeePC 900

On 2 April 2018 at 15:33, Jens Axboe <[email protected]> wrote:
> On 4/1/18 8:35 AM, Sitsofe Wheeler wrote:
>>
>> While trying to boot an EeePC 900 on the latest git mainline kernel a
>> hang is encountered while systemd is starting services but this did
>> not happen with 4.15. A bisection narrowed it down to the commit
>> 358f70da49d77c43f2ca11b5da584213b2add29c ("blk-mq: make
>> blk_abort_request() trigger timeout path"). Here's the bisection log:
>>
>> git bisect bad 358f70da49d77c43f2ca11b5da584213b2add29c
>> # first bad commit: [358f70da49d77c43f2ca11b5da584213b2add29c] blk-mq:
>> make blk_abort_request() trigger timeout path
>
> Are you using scsi-mq?

Yes:

$ cat /sys/module/scsi_mod/parameters/use_blk_mq
Y

> Would probably also be handy to see a dmesg log from a 4.15 boot,
> and from a failed 4.16 boot.

Attached. See the dmesg-g67818d25738b-ok.txt (parent of
358f70da49d77c43f2ca11b5da584213b2add29c) for the dmesg on an OK
commit. See dmesg-g358f70da49d7-regression.txt for a dmesg of a boot
that hangs (matching commit 358f70da49d77c43f2ca11b5da584213b2add29c).

--
Sitsofe | http://sucs.org/~sits/


Attachments:
dmesg-g358f70da49d7-regression.txt (52.35 kB)
dmesg-g67818d25738b-ok.txt (52.45 kB)
Download all attachments

2018-04-02 20:31:26

by Tejun Heo

[permalink] [raw]
Subject: Re: [BISECTED][REGRESSION] Hang while booting EeePC 900

Hello, Sitsofe.

Can you see whether the following patch makes any difference?

Thanks.

diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index a05e367..f0e6e41 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
* No need for fancy synchronizations.
*/
blk_rq_set_deadline(req, jiffies);
- mod_timer(&req->q->timeout, 0);
+ kblockd_schedule_work(&req->q->timeout_work);
} else {
if (blk_mark_rq_complete(req))
return;

2018-04-02 21:04:14

by Sitsofe Wheeler

[permalink] [raw]
Subject: Re: [BISECTED][REGRESSION] Hang while booting EeePC 900

Hi Tejun,

On 2 April 2018 at 21:29, Tejun Heo <[email protected]> wrote:
>
> Can you see whether the following patch makes any difference?
>
> Thanks.
>
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index a05e367..f0e6e41 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
> * No need for fancy synchronizations.
> */
> blk_rq_set_deadline(req, jiffies);
> - mod_timer(&req->q->timeout, 0);
> + kblockd_schedule_work(&req->q->timeout_work);
> } else {
> if (blk_mark_rq_complete(req))
> return;

That patch seems to fix the issue.

--
Sitsofe | http://sucs.org/~sits/

2018-04-02 22:06:52

by Tejun Heo

[permalink] [raw]
Subject: [PATCH] blk-mq: Directly schedule q->timeout_work when aborting a request

Request abortion is performed by overriding deadline to now and
scheduling timeout handling immediately. For the latter part, the
code was using mod_timer(timeout, 0) which can't guarantee that the
timer runs afterwards. Let's schedule the underlying work item
directly instead.

This fixes the hangs during probing reported by Sitsofe but it isn't
yet clear to me how the failure can happen reliably if it's just the
above described race condition.

Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Sitsofe Wheeler <[email protected]>
Reported-by: Meelis Roos <[email protected]>
Fixes: 358f70da49d7 ("blk-mq: make blk_abort_request() trigger timeout path")
Cc: [email protected] # v4.16
Link: http://lkml.kernel.org/r/CALjAwxh-PVYFnYFCJpGOja+m5SzZ8Sa4J7ohxdK=r8NyOF-EMA@mail.gmail.com
Link: http://lkml.kernel.org/r/[email protected]
---
Hello,

I don't have the full explanation yet but here's a preliminary patch.

Thanks.

block/blk-timeout.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/block/blk-timeout.c b/block/blk-timeout.c
index a05e367..f0e6e41 100644
--- a/block/blk-timeout.c
+++ b/block/blk-timeout.c
@@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
* No need for fancy synchronizations.
*/
blk_rq_set_deadline(req, jiffies);
- mod_timer(&req->q->timeout, 0);
+ kblockd_schedule_work(&req->q->timeout_work);
} else {
if (blk_mark_rq_complete(req))
return;

2018-04-02 22:37:25

by Jens Axboe

[permalink] [raw]
Subject: Re: [PATCH] blk-mq: Directly schedule q->timeout_work when aborting a request

On 4/2/18 4:04 PM, Tejun Heo wrote:
> Request abortion is performed by overriding deadline to now and
> scheduling timeout handling immediately. For the latter part, the
> code was using mod_timer(timeout, 0) which can't guarantee that the
> timer runs afterwards. Let's schedule the underlying work item
> directly instead.
>
> This fixes the hangs during probing reported by Sitsofe but it isn't
> yet clear to me how the failure can happen reliably if it's just the
> above described race condition.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: Sitsofe Wheeler <[email protected]>
> Reported-by: Meelis Roos <[email protected]>
> Fixes: 358f70da49d7 ("blk-mq: make blk_abort_request() trigger timeout path")
> Cc: [email protected] # v4.16
> Link: http://lkml.kernel.org/r/CALjAwxh-PVYFnYFCJpGOja+m5SzZ8Sa4J7ohxdK=r8NyOF-EMA@mail.gmail.com
> Link: http://lkml.kernel.org/r/[email protected]
> ---
> Hello,
>
> I don't have the full explanation yet but here's a preliminary patch.
>
> Thanks.
>
> block/blk-timeout.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index a05e367..f0e6e41 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
> * No need for fancy synchronizations.
> */
> blk_rq_set_deadline(req, jiffies);
> - mod_timer(&req->q->timeout, 0);
> + kblockd_schedule_work(&req->q->timeout_work);
> } else {
> if (blk_mark_rq_complete(req))
> return;

In any case, it's cleaner than relying on mod_timer(.., 0). If that
doesn't guarantee that the timer runs again, I can see how a race
with the running timer could prevent us from seeing the timeout
after an abort.

I'll apply this, thanks.

--
Jens Axboe


2018-04-05 08:16:13

by Sitsofe Wheeler

[permalink] [raw]
Subject: Re: [BISECTED][REGRESSION] Hang while booting EeePC 900

On 2 April 2018 at 21:29, Tejun Heo <[email protected]> wrote:
> Hello, Sitsofe.
>
> Can you see whether the following patch makes any difference?
>
> Thanks.
>
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index a05e367..f0e6e41 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
> * No need for fancy synchronizations.
> */
> blk_rq_set_deadline(req, jiffies);
> - mod_timer(&req->q->timeout, 0);
> + kblockd_schedule_work(&req->q->timeout_work);
> } else {
> if (blk_mark_rq_complete(req))
> return;

Just out of interest, does the fact that an abort occurs mean that the
hardware is somehow broken or badly behaved?

--
Sitsofe | http://sucs.org/~sits/

2018-04-05 14:44:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [BISECTED][REGRESSION] Hang while booting EeePC 900

Hello,

On Thu, Apr 05, 2018 at 09:14:15AM +0100, Sitsofe Wheeler wrote:
> Just out of interest, does the fact that an abort occurs mean that the
> hardware is somehow broken or badly behaved?

Not really. For example, ATAPI devices depend on exception handling
to fetch sense data as a part of normal operation which is handled by
libata exception handler which is invoked by aborting the original
command. So, exception handling can often be a part of normal
operation.

Thanks.

--
tejun

2018-04-06 20:10:38

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH] blk-mq: Directly schedule q->timeout_work when aborting a request

Hi,

[This is an automated email]

This commit has been processed because it contains a "Fixes:" tag,
fixing commit: 358f70da49d7 blk-mq: make blk_abort_request() trigger timeout path.

The bot has also determined it's probably a bug fixing patch. (score: 98.7780)

The bot has tested the following trees: v4.16.

v4.16: Build OK!

--
Thanks,
Sasha

2018-04-10 18:46:54

by Martin Steigerwald

[permalink] [raw]
Subject: Re: [PATCH] blk-mq: Directly schedule q->timeout_work when aborting a request

Tejun Heo - 03.04.18, 00:04:
> Request abortion is performed by overriding deadline to now and
> scheduling timeout handling immediately. For the latter part, the
> code was using mod_timer(timeout, 0) which can't guarantee that the
> timer runs afterwards. Let's schedule the underlying work item
> directly instead.
>
> This fixes the hangs during probing reported by Sitsofe but it isn't
> yet clear to me how the failure can happen reliably if it's just the
> above described race condition.

Compiling a 4.16.1 kernel with that patch to test whether this fixes the boot
hang I reported in:

[Possible REGRESSION, 4.16-rc4] Error updating SMART data during runtime and
boot failures with blk_mq_terminate_expired in backtrace
https://bugzilla.kernel.org/show_bug.cgi?id=199077

The "Error updating SMART data during runtime" thing I reported there as well
may still be another (independent) issue.

> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: Sitsofe Wheeler <[email protected]>
> Reported-by: Meelis Roos <[email protected]>
> Fixes: 358f70da49d7 ("blk-mq: make blk_abort_request() trigger timeout
> path") Cc: [email protected] # v4.16
> Link:
> http://lkml.kernel.org/r/CALjAwxh-PVYFnYFCJpGOja+m5SzZ8Sa4J7ohxdK=r8NyOF-EM
> [email protected] Link:
> http://lkml.kernel.org/r/[email protected] ---
> Hello,
>
> I don't have the full explanation yet but here's a preliminary patch.
>
> Thanks.
>
> block/blk-timeout.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> index a05e367..f0e6e41 100644
> --- a/block/blk-timeout.c
> +++ b/block/blk-timeout.c
> @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
> * No need for fancy synchronizations.
> */
> blk_rq_set_deadline(req, jiffies);
> - mod_timer(&req->q->timeout, 0);
> + kblockd_schedule_work(&req->q->timeout_work);
> } else {
> if (blk_mark_rq_complete(req))
> return;
--
Martin



2018-04-10 20:53:04

by Martin Steigerwald

[permalink] [raw]
Subject: Re: [PATCH] blk-mq: Directly schedule q->timeout_work when aborting a request

Martin Steigerwald - 10.04.18, 20:43:
> Tejun Heo - 03.04.18, 00:04:
> > Request abortion is performed by overriding deadline to now and
> > scheduling timeout handling immediately. For the latter part, the
> > code was using mod_timer(timeout, 0) which can't guarantee that the
> > timer runs afterwards. Let's schedule the underlying work item
> > directly instead.
> >
> > This fixes the hangs during probing reported by Sitsofe but it isn't
> > yet clear to me how the failure can happen reliably if it's just the
> > above described race condition.
>
> Compiling a 4.16.1 kernel with that patch to test whether this fixes
> the boot hang I reported in:
>
> [Possible REGRESSION, 4.16-rc4] Error updating SMART data during
> runtime and boot failures with blk_mq_terminate_expired in backtrace
> https://bugzilla.kernel.org/show_bug.cgi?id=199077

Fails as well, see

https://bugzilla.kernel.org/show_bug.cgi?id=199077#c8

for photo with (part of) backtrace.

> The "Error updating SMART data during runtime" thing I reported there
> as well may still be another (independent) issue.
>
> > Signed-off-by: Tejun Heo <[email protected]>
> > Reported-by: Sitsofe Wheeler <[email protected]>
> > Reported-by: Meelis Roos <[email protected]>
> > Fixes: 358f70da49d7 ("blk-mq: make blk_abort_request() trigger
> > timeout path") Cc: [email protected] # v4.16
> > Link:
> > http://lkml.kernel.org/r/CALjAwxh-PVYFnYFCJpGOja+m5SzZ8Sa4J7ohxdK=r8
> > NyOF-EM [email protected] Link:
> > http://lkml.kernel.org/r/[email protected].
> > ee --- Hello,
> >
> > I don't have the full explanation yet but here's a preliminary
> > patch.
> >
> > Thanks.
> >
> > block/blk-timeout.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/block/blk-timeout.c b/block/blk-timeout.c
> > index a05e367..f0e6e41 100644
> > --- a/block/blk-timeout.c
> > +++ b/block/blk-timeout.c
> > @@ -165,7 +165,7 @@ void blk_abort_request(struct request *req)
> >
> > * No need for fancy synchronizations.
> > */
> >
> > blk_rq_set_deadline(req, jiffies);
> >
> > - mod_timer(&req->q->timeout, 0);
> > + kblockd_schedule_work(&req->q->timeout_work);
> >
> > } else {
> >
> > if (blk_mark_rq_complete(req))
> >
> > return;


--
Martin