2018-04-24 21:02:21

by kernel test robot

[permalink] [raw]
Subject: ed74ae0342 ("blk-mq: Avoid that a completion can be ignored .."): BUG: kernel hang in test stage

Greetings,

0day kernel testing robot got the below dmesg and the first bad commit is

https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-linus

commit ed74ae03424684a6ad8a973c3fa727c6b4162432
Author: Bart Van Assche <[email protected]>
AuthorDate: Thu Apr 19 09:43:53 2018 -0700
Commit: Jens Axboe <[email protected]>
CommitDate: Thu Apr 19 14:21:47 2018 -0600

blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

The blk-mq timeout handling code ignores completions that occur after
blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
has reset rq->aborted_gstate. If a block driver timeout handler always
returns BLK_EH_RESET_TIMER then the result will be that the request
never terminates.

Fix this race as follows:
- Use the deadline instead of the request generation to detect whether
or not a request timer fired after reinitialization of a request.
- Store the request state in the lowest two bits of the deadline instead
of the lowest two bits of 'gstate'.
- Rename MQ_RQ_STATE_MASK into RQ_STATE_MASK and change it from an
enumeration member into a #define such that its type can be changed
into unsigned long. That allows to write & ~RQ_STATE_MASK instead of
~(unsigned long)RQ_STATE_MASK.
- Remove all request member variables that became superfluous due to
this change: gstate, gstate_seq and aborted_gstate_sync.
- Remove the request state information that became superfluous due to this
patch, namely RQF_MQ_TIMEOUT_EXPIRED.
- Remove the code that became superfluous due to this change, namely
the RCU lock and unlock statements in blk_mq_complete_request() and
also the synchronize_rcu() call in the timeout handler.

Signed-off-by: Bart Van Assche <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Christoph Hellwig <[email protected]>
Cc: Ming Lei <[email protected]>
Cc: Sagi Grimberg <[email protected]>
Cc: Israel Rukshin <[email protected]>,
Cc: Max Gurtovoy <[email protected]>
Cc: <[email protected]> # v4.16
Signed-off-by: Jens Axboe <[email protected]>

901932a3f9 blkcg: init root blkcg_gq under lock
ed74ae0342 blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
81f9787a93 block: mq: Add some minor doc for core structs
+-------------------------------+------------+------------+------------+
| | 901932a3f9 | ed74ae0342 | 81f9787a93 |
+-------------------------------+------------+------------+------------+
| boot_successes | 33 | 0 | 0 |
| boot_failures | 0 | 13 | 19 |
| BUG:kernel_hang_in_test_stage | 0 | 13 | 19 |
+-------------------------------+------------+------------+------------+

[ 1291.232051] Writes: Total: 551918540 Max/Min: 0/0 Fail: 0
[ 1352.672080] Writes: Total: 584508421 Max/Min: 0/0 Fail: 0
[ 1414.120542] Writes: Total: 614769476 Max/Min: 0/0 Fail: 0
[ 1475.552055] Writes: Total: 641108602 Max/Min: 0/0 Fail: 0
[ 1536.993079] Writes: Total: 668490942 Max/Min: 0/0 Fail: 0
BUG: kernel hang in test stage


# HH:MM RESULT GOOD BAD GOOD_BUT_DIRTY DIRTY_NOT_BAD
git bisect start aa1ee658e2460d5caaa6c9f20c21f5429e967421 60cc43fc888428bb2f18f08997432d426a243338 --
git bisect bad bc9a2437cdfcbf8906270d81cb5dd973efa260a7 # 19:42 B 0 3 16 0 Merge 'wsa/coccinelle/get_drvdata' into devel-hourly-2018042013
git bisect bad 855b1a282f0fe1da064cd1d26b32ee5e64e1262c # 20:34 B 0 4 17 0 Merge 'drm-drm-intel/topic/core-for-CI' into devel-hourly-2018042013
git bisect good 3cfc4f7d8c25179a3d0198d7231e6ec9f67121e2 # 20:54 G 11 0 4 4 Merge 'linux-review/Baolin-Wang/dt-bindings-input-Add-Add-Spreadtrum-SC27xx-vibrator-documentation/20180417-172158' into devel-hourly-2018042013
git bisect bad 2b9ee19fee29678f1db58a1c56af59d4ae1a8b32 # 21:31 B 0 11 24 0 Merge 'linux-review/Jon-Maloy/tipc-fix-use-after-free-in-tipc_nametbl_stop/20180418-195743' into devel-hourly-2018042013
git bisect bad 7b6af07d1f233f28f49d9d8866bece52c5a24163 # 22:16 B 0 1 14 0 Merge 'linux-nvdimm/libnvdimm-for-next' into devel-hourly-2018042013
git bisect bad bc1f481f45ce7483f02e47bbf4020c7607b1e041 # 22:55 B 0 2 15 0 Merge 'net/master' into devel-hourly-2018042013
git bisect good f2ba5ad5cef9d0984d5e8ab631466405ceba8097 # 23:15 G 11 0 2 2 Merge 'baolu/iommu/siov/0419' into devel-hourly-2018042013
git bisect good 6724b7e7ac4e65a794e2fe5ef666c4dbe1a8b057 # 23:29 G 10 0 4 4 Merge 'chanwoo-extcon/extcon-fixes' into devel-hourly-2018042013
git bisect good e3377f29326e372064057289eac14d65813c24e7 # 23:48 G 11 0 5 5 Merge 'ceph-client/testing' into devel-hourly-2018042013
git bisect good 7592af80646d1cfa4a37f9b125901b8aa558a232 # 00:05 G 10 0 4 4 Merge 'linux-review/Ryder-Lee/thermal-mediatek-use-of_device_get_match_data/20180416-125446' into devel-hourly-2018042013
git bisect bad edc4453ad0d7e61ead2875fdbeea2086f9284143 # 00:45 B 0 2 15 0 Merge 'block/for-linus' into devel-hourly-2018042013
git bisect good b3906535ccc6cd04c42f9b1c7e31d1947b3ebc74 # 01:50 G 11 0 2 2 block/swim: Select appropriate drive on device open
git bisect good 946b81da114b8ba5c74bb01e57c0c6eca2bdc801 # 02:03 G 11 0 1 1 blkcg: don't hold blkcg lock when deactivating policy
git bisect bad ed74ae03424684a6ad8a973c3fa727c6b4162432 # 02:36 B 0 2 15 0 blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
git bisect good 901932a3f9b2b80352896be946c6d577c0a9652c # 03:08 G 11 0 2 2 blkcg: init root blkcg_gq under lock
# first bad commit: [ed74ae03424684a6ad8a973c3fa727c6b4162432] blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
git bisect good 901932a3f9b2b80352896be946c6d577c0a9652c # 03:13 G 30 0 3 5 blkcg: init root blkcg_gq under lock
# extra tests with debug options
git bisect bad ed74ae03424684a6ad8a973c3fa727c6b4162432 # 03:52 B 0 3 16 0 blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
# extra tests on HEAD of linux-devel/devel-hourly-2018042013
git bisect bad aa1ee658e2460d5caaa6c9f20c21f5429e967421 # 03:57 B 0 21 46 0 0day head guard for 'devel-hourly-2018042013'
# extra tests on tree/branch block/for-linus
git bisect bad 81f9787a93a81d284eec6c0327f452b56d337ee0 # 04:39 B 0 2 16 1 block: mq: Add some minor doc for core structs
# extra tests with first bad commit reverted
git bisect good 64c641c666cd37692c81aa2933c1f19262e047f2 # 04:59 G 10 0 1 1 Revert "blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER"

---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/lkp Intel Corporation


Attachments:
(No filename) (7.10 kB)
dmesg-yocto-intel12-16:20180425024228:x86_64-randconfig-s5-04202025:4.16.0-12473-ged74ae0:2.gz (11.63 kB)
dmesg-yocto-intel12-11:20180425030711:x86_64-randconfig-s5-04202025:4.16.0-12472-g901932a:1.gz (27.87 kB)
reproduce-yocto-intel12-16:20180425024228:x86_64-randconfig-s5-04202025:4.16.0-12473-ged74ae0:2 (977.00 B)
config-4.16.0-12473-ged74ae0 (126.10 kB)
Download all attachments

2018-04-28 00:54:26

by Jens Axboe

[permalink] [raw]
Subject: Re: ed74ae0342 ("blk-mq: Avoid that a completion can be ignored .."): BUG: kernel hang in test stage

On 4/24/18 3:00 PM, kernel test robot wrote:
> Greetings,
>
> 0day kernel testing robot got the below dmesg and the first bad commit is
>
> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-linus
>
> commit ed74ae03424684a6ad8a973c3fa727c6b4162432
> Author: Bart Van Assche <[email protected]>
> AuthorDate: Thu Apr 19 09:43:53 2018 -0700
> Commit: Jens Axboe <[email protected]>
> CommitDate: Thu Apr 19 14:21:47 2018 -0600
>
> blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER
>
> The blk-mq timeout handling code ignores completions that occur after
> blk_mq_check_expired() has been called and before blk_mq_rq_timed_out()
> has reset rq->aborted_gstate. If a block driver timeout handler always
> returns BLK_EH_RESET_TIMER then the result will be that the request
> never terminates.
>
> Fix this race as follows:
> - Use the deadline instead of the request generation to detect whether
> or not a request timer fired after reinitialization of a request.
> - Store the request state in the lowest two bits of the deadline instead
> of the lowest two bits of 'gstate'.
> - Rename MQ_RQ_STATE_MASK into RQ_STATE_MASK and change it from an
> enumeration member into a #define such that its type can be changed
> into unsigned long. That allows to write & ~RQ_STATE_MASK instead of
> ~(unsigned long)RQ_STATE_MASK.
> - Remove all request member variables that became superfluous due to
> this change: gstate, gstate_seq and aborted_gstate_sync.
> - Remove the request state information that became superfluous due to this
> patch, namely RQF_MQ_TIMEOUT_EXPIRED.
> - Remove the code that became superfluous due to this change, namely
> the RCU lock and unlock statements in blk_mq_complete_request() and
> also the synchronize_rcu() call in the timeout handler.

Any chance you can try with the newer version?

https://github.com/bvanassche/linux/commit/4acd555fa13087

--
Jens Axboe


2018-04-29 02:02:40

by Wu Fengguang

[permalink] [raw]
Subject: Re: ed74ae0342 ("blk-mq: Avoid that a completion can be ignored .."): BUG: kernel hang in test stage

Hi Jens,

On Fri, Apr 27, 2018 at 06:52:58PM -0600, Jens Axboe wrote:
>On 4/24/18 3:00 PM, kernel test robot wrote:
>> Greetings,
>>
>> 0day kernel testing robot got the below dmesg and the first bad commit is
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block.git for-linus
>>
>> commit ed74ae03424684a6ad8a973c3fa727c6b4162432
>> Author: Bart Van Assche <[email protected]>
>> AuthorDate: Thu Apr 19 09:43:53 2018 -0700
>> Commit: Jens Axboe <[email protected]>
>> CommitDate: Thu Apr 19 14:21:47 2018 -0600
>>
>> blk-mq: Avoid that a completion can be ignored for BLK_EH_RESET_TIMER

>Any chance you can try with the newer version?
>
>https://github.com/bvanassche/linux/commit/4acd555fa13087

That works!

Tested-by: Fengguang Wu <[email protected]>