Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751738AbcDZBWJ (ORCPT ); Mon, 25 Apr 2016 21:22:09 -0400 Received: from mail-pf0-f181.google.com ([209.85.192.181]:34101 "EHLO mail-pf0-f181.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750750AbcDZBWH (ORCPT ); Mon, 25 Apr 2016 21:22:07 -0400 Subject: Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO To: Tejun Heo , Roman Pen References: <1461597771-25352-1-git-send-email-roman.penyaev@profitbricks.com> <20160425154847.GZ7822@mtj.duckdns.org> Cc: Jens Axboe , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, David Howells From: Peter Hurley Message-ID: <571EC2B9.6000802@hurleysoftware.com> Date: Mon, 25 Apr 2016 18:22:01 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:38.0) Gecko/20100101 Thunderbird/38.6.0 MIME-Version: 1.0 In-Reply-To: <20160425154847.GZ7822@mtj.duckdns.org> Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4100 Lines: 107 On 04/25/2016 08:48 AM, Tejun Heo wrote: > Hello, Roman. > > On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote: > ... >> CPU#6 CPU#2 >> reqeust ffff884000343600 inserted >> hctx marked as pended >> kblockd_schedule...() returns 1 >> >> *** WORK_STRUCT_PENDING_BIT is cleared *** >> flush_busy_ctxs() is executed >> reqeust ffff884000343cc0 inserted >> hctx marked as pended >> kblockd_schedule...() returns 0 >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> WTF? >> >> As a result ffff884000343cc0 request pended forever. >> >> According to the trace output I see that another CPU _always_ observes >> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was >> cleared on another CPU. >> >> Checking the workqueue.c code I see that clearing the bit is nothing >> more, but atomic_long_set(), which is instruction. This >> function: >> >> static inline void set_work_data() >> >> In attempt to "fix" the mystery I replaced atomic_long_set() call with >> atomic_long_xchg(), which is instruction. >> >> The problem has gone. >> >> For me it looks that test_and_set_bit() ( instruction) does >> not require flush of all CPU caches, which can be dirty after executing >> of on another CPU. But really updates the memory and the >> following execution of observes that bit was cleared. >> >> As a conculusion I can say, that I am lucky enough and can reproduce >> this bug in several minutes on a specific load (I tried many other >> simple loads using fio, even using btrecord/btreplay, no success). >> And that easy reproduction on a specific load gives me some freedom >> to test and then to be sure, that problem has gone. > > Heh, excellent debugging. I wonder how old this bug is. This is the same bug I wrote about 2 yrs ago (but with the wrong fix). http://lkml.iu.edu/hypermail/linux/kernel/1402.2/04697.html Unfortunately I didn't have a reproducer at all :/ > cc'ing David > Howells who ISTR to have reported a similar issue. The root problem > here, I think, is that PENDING is used to synchronize between > different queueing instances but we don't have proper memory barrier > after it. > > A B > > queue (test_and_sets PENDING) > dispatch (clears PENDING) > execute queue (test_and_sets PENDING) > > So, for B, the guarantee must be that either A starts executing after > B's test_and_set or B's test_and_set succeeds; however, as we don't > have any memory barrier between dispatch and execute, there's nothing > preventing the processor from scheduling some memory fetch operations > from the execute stage before the clearing of PENDING - ie. A might > not see what B has done prior to queue even if B's test_and_set fails > indicating that A should. Can you please test whether the following > patch fixes the issue? > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 2232ae3..8ec2b5e 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, > */ > smp_wmb(); > set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0); > + smp_mb(); > } The atomic_long_xchg() patch has several benefits over the naked barrier: 1. set_work_pool_and_clear_pending() has the same requirements as clear_work_data(); note that both require write barrier before and full barrier after. 2. xchg() et al imply full barrier before and full barrier after. 3. The naked barriers could be removed, while improving efficiency. On x86, mov + mfence => xchg 4. Maybe fixes other hidden bugs. For example, I'm wondering if reordering with set_work_pwq/list_add_tail would be a problem; ie., what if work is visible on the worklist _before_ data is initialized by set_work_pwq()? Regards, Peter Hurley