LinuxLists.cc - [PATCH 1/1] workqueue: fix ghost PENDING flag while doing MQ IO

2016-04-26 11:40:23

Subject: [PATCH 1/1] workqueue: fix ghost PENDING flag while doing MQ IO

The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
with the following backtrace:

[ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
[ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
[ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
[ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
[ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
[ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
[ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
[ 601.350965] Call Trace:
[ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80
[ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
[ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
[ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
[ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
[ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
[ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
[ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
[ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
[ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
[ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
[ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
[ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
[ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
[ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
[ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
[ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
[ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
[ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0
[ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50

The underlying device is a null_blk, with default parameters:

queue_mode = MQ
submit_queues = 1

Verification that nullb0 has something inflight:

root@pserver8:~# cat /sys/block/nullb0/inflight
0 1
root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
...
/sys/block/nullb0/mq/0/cpu2/rq_list
CTX pending:
ffff8838038e2400
...

During debug it became clear that stalled request is always inserted in
the rq_list from the following path:

save_stack_trace_tsk + 34
blk_mq_insert_requests + 231
blk_mq_flush_plug_list + 281
blk_flush_plug_list + 199
wait_on_page_bit + 192
__filemap_fdatawait_range + 228
filemap_fdatawait_range + 20
filemap_write_and_wait_range + 63
blkdev_fsync + 27
vfs_fsync_range + 73
blkdev_write_iter + 202
__vfs_write + 170
vfs_write + 169
kernel_write + 56

So blk_flush_plug_list() was called with from_schedule == true.

If from_schedule is true, that means that finally blk_mq_insert_requests()
offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
i.e. it calls kblockd_schedule_delayed_work_on().

That means, that we race with another CPU, which is about to execute
__blk_mq_run_hw_queue() work.

Further debugging shows the following traces from different CPUs:

CPU#0 CPU#1
---------------------------------- -------------------------------
reqeust A inserted
STORE hctx->ctx_map[0] bit marked
kblockd_schedule...() returns 1
<schedule to kblockd workqueue>
request B inserted
STORE hctx->ctx_map[1] bit marked
kblockd_schedule...() returns 0
*** WORK PENDING bit is cleared ***
flush_busy_ctxs() is executed, but
bit 1, set by CPU#1, is not observed

As a result request B pended forever.

This behaviour can be explained by speculative LOAD of hctx->ctx_map on
CPU#0, which is reordered with clear of PENDING bit and executed _before_
actual STORE of bit 1 on CPU#1.

The proper fix is an explicit full barrier <mfence>, which guarantees
that clear of PENDING bit is to be executed before all possible
speculative LOADS or STORES inside actual work function.

Signed-off-by: Roman Pen <[email protected]>
Cc: Gioh Kim <[email protected]>
Cc: Michael Wang <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
---
kernel/workqueue.c | 29 +++++++++++++++++++++++++++++
1 file changed, 29 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 450c21f..0ec0594 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -649,6 +649,35 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
*/
smp_wmb();
set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
+ /*
+ * The following mb guarantees that previous clear of a PENDING bit
+ * will not be reordered with any speculative LOADS or STORES from
+ * work->current_func, which is executed afterwards. This possible
+ * reordering can lead to a missed execution on attempt to qeueue
+ * the same @work. E.g. consider this case:
+ *
+ * CPU#0 CPU#1
+ * ---------------------------- --------------------------------
+ *
+ * 1 STORE event_indicated
+ * 2 queue_work_on() {
+ * 3 test_and_set_bit(PENDING)
+ * 4 } set_..._and_clear_pending() {
+ * 5 set_work_data() # clear bit
+ * 6 smp_mb()
+ * 7 work->current_func() {
+ * 8 LOAD event_indicated
+ * }
+ *
+ * Without an explicit full barrier speculative LOAD on line 8 can
+ * be executed before CPU#0 does STORE on line 1. If that happens,
+ * CPU#0 observes the PENDING bit is still set and new execution of
+ * a @work is not queued in a hope, that CPU#1 will eventually
+ * finish the queued @work. Meanwhile CPU#1 does not see
+ * event_indicated is set, because speculative LOAD was executed
+ * before actual STORE.
+ */
+ smp_mb();
}

static void clear_work_data(struct work_struct *work)
--
2.7.4

2016-04-26 15:22:03

by Tejun Heo

[permalink] [raw]

Subject: Re: [PATCH 1/1] workqueue: fix ghost PENDING flag while doing MQ IO

Hello,

On Tue, Apr 26, 2016 at 01:15:35PM +0200, Roman Pen wrote:
...
> This behaviour can be explained by speculative LOAD of hctx->ctx_map on
> CPU#0, which is reordered with clear of PENDING bit and executed _before_
> actual STORE of bit 1 on CPU#1.
>
> The proper fix is an explicit full barrier <mfence>, which guarantees
> that clear of PENDING bit is to be executed before all possible
> speculative LOADS or STORES inside actual work function.

Applied to workqueue/for-4.6-fixes.

Thanks.

--
tejun

2016-04-26 20:58:17

by Peter Hurley

[permalink] [raw]

Subject: Re: [PATCH 1/1] workqueue: fix ghost PENDING flag while doing MQ IO

On 04/26/2016 04:15 AM, Roman Pen wrote:
> The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
> with the following backtrace:
>
> [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
> [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
> [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
> [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
> [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
> [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
> [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
> [ 601.350965] Call Trace:
> [ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
> [ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80
> [ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
> [ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
> [ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
> [ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
> [ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
> [ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
> [ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
> [ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
> [ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
> [ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
> [ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
> [ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
> [ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
> [ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
> [ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
> [ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
> [ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
> [ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0
> [ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50
>
> The underlying device is a null_blk, with default parameters:
>
> queue_mode = MQ
> submit_queues = 1
>
> Verification that nullb0 has something inflight:
>
> root@pserver8:~# cat /sys/block/nullb0/inflight
> 0 1
> root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
> ...
> /sys/block/nullb0/mq/0/cpu2/rq_list
> CTX pending:
> ffff8838038e2400
> ...
>
> During debug it became clear that stalled request is always inserted in
> the rq_list from the following path:
>
> save_stack_trace_tsk + 34
> blk_mq_insert_requests + 231
> blk_mq_flush_plug_list + 281
> blk_flush_plug_list + 199
> wait_on_page_bit + 192
> __filemap_fdatawait_range + 228
> filemap_fdatawait_range + 20
> filemap_write_and_wait_range + 63
> blkdev_fsync + 27
> vfs_fsync_range + 73
> blkdev_write_iter + 202
> __vfs_write + 170
> vfs_write + 169
> kernel_write + 56
>
> So blk_flush_plug_list() was called with from_schedule == true.
>
> If from_schedule is true, that means that finally blk_mq_insert_requests()
> offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
> i.e. it calls kblockd_schedule_delayed_work_on().
>
> That means, that we race with another CPU, which is about to execute
> __blk_mq_run_hw_queue() work.
>
> Further debugging shows the following traces from different CPUs:
>
> CPU#0 CPU#1
> ---------------------------------- -------------------------------
> reqeust A inserted
> STORE hctx->ctx_map[0] bit marked
> kblockd_schedule...() returns 1
> <schedule to kblockd workqueue>
> request B inserted
> STORE hctx->ctx_map[1] bit marked
> kblockd_schedule...() returns 0
> *** WORK PENDING bit is cleared ***
> flush_busy_ctxs() is executed, but
> bit 1, set by CPU#1, is not observed
>
> As a result request B pended forever.
>
> This behaviour can be explained by speculative LOAD of hctx->ctx_map on
> CPU#0, which is reordered with clear of PENDING bit and executed _before_
> actual STORE of bit 1 on CPU#1.
>
> The proper fix is an explicit full barrier <mfence>, which guarantees
> that clear of PENDING bit is to be executed before all possible
> speculative LOADS or STORES inside actual work function.
>
> Signed-off-by: Roman Pen <[email protected]>
> Cc: Gioh Kim <[email protected]>
> Cc: Michael Wang <[email protected]>
> Cc: Tejun Heo <[email protected]>
> Cc: Jens Axboe <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> ---
> kernel/workqueue.c | 29 +++++++++++++++++++++++++++++
> 1 file changed, 29 insertions(+)
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 450c21f..0ec0594 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -649,6 +649,35 @@ static void set_work_pool_and_clear_pending(struct work_struct *work,
> */
> smp_wmb();
> set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
> + /*
> + * The following mb guarantees that previous clear of a PENDING bit
> + * will not be reordered with any speculative LOADS or STORES from
> + * work->current_func, which is executed afterwards.

Just to be clear: reordering with STORES in the work function is not a problem
because no conclusion can be reached regarding the scheduling state of the work
function from outside the work function, other than by testing PENDING.

IOW,

is_running = 1
clear PENDING

is equivalent to

clear PENDING
is_running = 1

for any other logic like

if (is_running) {
}

or

if (!is_running) {
}

It's only LOADS in the work function that must follow STORES to PENDING.
That the full barrier will also order stores is unintended effect.

> This possible
> + * reordering can lead to a missed execution on attempt to qeueue
> + * the same @work. E.g. consider this case:
> + *
> + * CPU#0 CPU#1
> + * ---------------------------- --------------------------------
> + *
> + * 1 STORE event_indicated
> + * 2 queue_work_on() {
> + * 3 test_and_set_bit(PENDING)
> + * 4 } set_..._and_clear_pending() {
> + * 5 set_work_data() # clear bit
> + * 6 smp_mb()
> + * 7 work->current_func() {
> + * 8 LOAD event_indicated
> + * }
> + *
> + * Without an explicit full barrier speculative LOAD on line 8 can
> + * be executed before CPU#0 does STORE on line 1. If that happens,
> + * CPU#0 observes the PENDING bit is still set and new execution of
> + * a @work is not queued in a hope, that CPU#1 will eventually
> + * finish the queued @work. Meanwhile CPU#1 does not see
> + * event_indicated is set, because speculative LOAD was executed
> + * before actual STORE.
> + */
> + smp_mb();
> }
>
> static void clear_work_data(struct work_struct *work)
>