Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752951AbcDZU6R (ORCPT ); Tue, 26 Apr 2016 16:58:17 -0400 Received: from mail-pf0-f172.google.com ([209.85.192.172]:34104 "EHLO mail-pf0-f172.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752149AbcDZU6O (ORCPT ); Tue, 26 Apr 2016 16:58:14 -0400 Subject: Re: [PATCH 1/1] workqueue: fix ghost PENDING flag while doing MQ IO To: Roman Pen , Tejun Heo References: <1461669335-19522-1-git-send-email-roman.penyaev@profitbricks.com> Cc: Gioh Kim , Michael Wang , Jens Axboe , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, stable@vger.kernel.org From: Peter Hurley Message-ID: <571FD65F.2080901@hurleysoftware.com> Date: Tue, 26 Apr 2016 13:58:07 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <1461669335-19522-1-git-send-email-roman.penyaev@profitbricks.com> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 7402 Lines: 188 On 04/26/2016 04:15 AM, Roman Pen wrote: > The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list > with the following backtrace: > > [ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds. > [ 601.347574] Tainted: G O 4.4.5-1-storage+ #6 > [ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. > [ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000 > [ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server] > [ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000 > [ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0 > [ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38 > [ 601.350965] Call Trace: > [ 601.351203] [] ? bit_wait+0x60/0x60 > [ 601.351444] [] schedule+0x35/0x80 > [ 601.351709] [] schedule_timeout+0x192/0x230 > [ 601.351958] [] ? blk_flush_plug_list+0xc7/0x220 > [ 601.352208] [] ? ktime_get+0x37/0xa0 > [ 601.352446] [] ? bit_wait+0x60/0x60 > [ 601.352688] [] io_schedule_timeout+0xa4/0x110 > [ 601.352951] [] ? _raw_spin_unlock_irqrestore+0xe/0x10 > [ 601.353196] [] bit_wait_io+0x1b/0x70 > [ 601.353440] [] __wait_on_bit+0x5d/0x90 > [ 601.353689] [] wait_on_page_bit+0xc0/0xd0 > [ 601.353958] [] ? autoremove_wake_function+0x40/0x40 > [ 601.354200] [] __filemap_fdatawait_range+0xe4/0x140 > [ 601.354441] [] filemap_fdatawait_range+0x14/0x30 > [ 601.354688] [] filemap_write_and_wait_range+0x3f/0x70 > [ 601.354932] [] blkdev_fsync+0x1b/0x50 > [ 601.355193] [] vfs_fsync_range+0x49/0xa0 > [ 601.355432] [] blkdev_write_iter+0xca/0x100 > [ 601.355679] [] __vfs_write+0xaa/0xe0 > [ 601.355925] [] vfs_write+0xa9/0x1a0 > [ 601.356164] [] kernel_write+0x38/0x50 > > The underlying device is a null_blk, with default parameters: > > queue_mode = MQ > submit_queues = 1 > > Verification that nullb0 has something inflight: > > root@pserver8:~# cat /sys/block/nullb0/inflight > 0 1 > root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \; > ... > /sys/block/nullb0/mq/0/cpu2/rq_list > CTX pending: > ffff8838038e2400 > ... > > During debug it became clear that stalled request is always inserted in > the rq_list from the following path: > > save_stack_trace_tsk + 34 > blk_mq_insert_requests + 231 > blk_mq_flush_plug_list + 281 > blk_flush_plug_list + 199 > wait_on_page_bit + 192 > __filemap_fdatawait_range + 228 > filemap_fdatawait_range + 20 > filemap_write_and_wait_range + 63 > blkdev_fsync + 27 > vfs_fsync_range + 73 > blkdev_write_iter + 202 > __vfs_write + 170 > vfs_write + 169 > kernel_write + 56 > > So blk_flush_plug_list() was called with from_schedule == true. > > If from_schedule is true, that means that finally blk_mq_insert_requests() > offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue, > i.e. it calls kblockd_schedule_delayed_work_on(). > > That means, that we race with another CPU, which is about to execute > __blk_mq_run_hw_queue() work. > > Further debugging shows the following traces from different CPUs: > > CPU#0 CPU#1 > ---------------------------------- ------------------------------- > reqeust A inserted > STORE hctx->ctx_map[0] bit marked > kblockd_schedule...() returns 1 > > request B inserted > STORE hctx->ctx_map[1] bit marked > kblockd_schedule...() returns 0 > *** WORK PENDING bit is cleared *** > flush_busy_ctxs() is executed, but > bit 1, set by CPU#1, is not observed > > As a result request B pended forever. > > This behaviour can be explained by speculative LOAD of hctx->ctx_map on > CPU#0, which is reordered with clear of PENDING bit and executed _before_ > actual STORE of bit 1 on CPU#1. > > The proper fix is an explicit full barrier , which guarantees > that clear of PENDING bit is to be executed before all possible > speculative LOADS or STORES inside actual work function. > > Signed-off-by: Roman Pen > Cc: Gioh Kim > Cc: Michael Wang > Cc: Tejun Heo > Cc: Jens Axboe > Cc: linux-block@vger.kernel.org > Cc: linux-kernel@vger.kernel.org > Cc: stable@vger.kernel.org > --- > kernel/workqueue.c | 29 +++++++++++++++++++++++++++++ > 1 file changed, 29 insertions(+) > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 450c21f..0ec0594 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -649,6 +649,35 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, > */ > smp_wmb(); > set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0); > + /* > + * The following mb guarantees that previous clear of a PENDING bit > + * will not be reordered with any speculative LOADS or STORES from > + * work->current_func, which is executed afterwards. Just to be clear: reordering with STORES in the work function is not a problem because no conclusion can be reached regarding the scheduling state of the work function from outside the work function, other than by testing PENDING. IOW, is_running = 1 clear PENDING is equivalent to clear PENDING is_running = 1 for any other logic like if (is_running) { } or if (!is_running) { } It's only LOADS in the work function that must follow STORES to PENDING. That the full barrier will also order stores is unintended effect. > This possible > + * reordering can lead to a missed execution on attempt to qeueue > + * the same @work. E.g. consider this case: > + * > + * CPU#0 CPU#1 > + * ---------------------------- -------------------------------- > + * > + * 1 STORE event_indicated > + * 2 queue_work_on() { > + * 3 test_and_set_bit(PENDING) > + * 4 } set_..._and_clear_pending() { > + * 5 set_work_data() # clear bit > + * 6 smp_mb() > + * 7 work->current_func() { > + * 8 LOAD event_indicated > + * } > + * > + * Without an explicit full barrier speculative LOAD on line 8 can > + * be executed before CPU#0 does STORE on line 1. If that happens, > + * CPU#0 observes the PENDING bit is still set and new execution of > + * a @work is not queued in a hope, that CPU#1 will eventually > + * finish the queued @work. Meanwhile CPU#1 does not see > + * event_indicated is set, because speculative LOAD was executed > + * before actual STORE. > + */ > + smp_mb(); > } > > static void clear_work_data(struct work_struct *work) >