Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933136AbcDYQfI (ORCPT ); Mon, 25 Apr 2016 12:35:08 -0400 Received: from mail-io0-f180.google.com ([209.85.223.180]:35238 "EHLO mail-io0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932978AbcDYQfG (ORCPT ); Mon, 25 Apr 2016 12:35:06 -0400 MIME-Version: 1.0 In-Reply-To: <20160425154847.GZ7822@mtj.duckdns.org> References: <1461597771-25352-1-git-send-email-roman.penyaev@profitbricks.com> <20160425154847.GZ7822@mtj.duckdns.org> From: Roman Penyaev Date: Mon, 25 Apr 2016 18:34:45 +0200 Message-ID: Subject: Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO To: Tejun Heo Cc: Jens Axboe , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, David Howells Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4233 Lines: 113 Hello, Tejun, On Mon, Apr 25, 2016 at 5:48 PM, Tejun Heo wrote: > Hello, Roman. > > On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote: > ... >> CPU#6 CPU#2 >> reqeust ffff884000343600 inserted >> hctx marked as pended >> kblockd_schedule...() returns 1 >> >> *** WORK_STRUCT_PENDING_BIT is cleared *** >> flush_busy_ctxs() is executed >> reqeust ffff884000343cc0 inserted >> hctx marked as pended >> kblockd_schedule...() returns 0 >> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ >> WTF? >> >> As a result ffff884000343cc0 request pended forever. >> >> According to the trace output I see that another CPU _always_ observes >> WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was >> cleared on another CPU. >> >> Checking the workqueue.c code I see that clearing the bit is nothing >> more, but atomic_long_set(), which is instruction. This >> function: >> >> static inline void set_work_data() >> >> In attempt to "fix" the mystery I replaced atomic_long_set() call with >> atomic_long_xchg(), which is instruction. >> >> The problem has gone. >> >> For me it looks that test_and_set_bit() ( instruction) does >> not require flush of all CPU caches, which can be dirty after executing >> of on another CPU. But really updates the memory and the >> following execution of observes that bit was cleared. >> >> As a conculusion I can say, that I am lucky enough and can reproduce >> this bug in several minutes on a specific load (I tried many other >> simple loads using fio, even using btrecord/btreplay, no success). >> And that easy reproduction on a specific load gives me some freedom >> to test and then to be sure, that problem has gone. > > Heh, excellent debugging. I wonder how old this bug is. cc'ing David > Howells who ISTR to have reported a similar issue. The root problem > here, I think, is that PENDING is used to synchronize between > different queueing instances but we don't have proper memory barrier > after it. > > A B > > queue (test_and_sets PENDING) > dispatch (clears PENDING) > execute queue (test_and_sets PENDING) > > So, for B, the guarantee must be that either A starts executing after > B's test_and_set or B's test_and_set succeeds; however, as we don't > have any memory barrier between dispatch and execute, there's nothing > preventing the processor from scheduling some memory fetch operations > from the execute stage before the clearing of PENDING - ie. A might > not see what B has done prior to queue even if B's test_and_set fails > indicating that A should. Can you please test whether the following > patch fixes the issue? I can assure you that smp_mb() helps (at least running for 30 minutes under IO). That was my first variant, but I did not like it because I could not explain myself why: 1. not smp_wmb()? We need to do flush after an update. (I tried that also, and it does not help) 2. what protects us from this situation? CPU#0 CPU#1 set_work_data() test_and_set_bit() smp_mb() And 2. question was crucial to me, because even tiny delay "fixes" the problem, e.g. ndelay also "fixes" the bug: smp_wmb(); set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0); + ndelay(40); } Why ndelay(40)? Because on this machine smp_mb() takes 40 ns on average. -- Roman > > diff --git a/kernel/workqueue.c b/kernel/workqueue.c > index 2232ae3..8ec2b5e 100644 > --- a/kernel/workqueue.c > +++ b/kernel/workqueue.c > @@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, > */ > smp_wmb(); > set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0); > + smp_mb(); > } > > static void clear_work_data(struct work_struct *work) > > > -- > tejun