Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932905AbcDYPsw (ORCPT ); Mon, 25 Apr 2016 11:48:52 -0400 Received: from mail-yw0-f171.google.com ([209.85.161.171]:34652 "EHLO mail-yw0-f171.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932797AbcDYPsu (ORCPT ); Mon, 25 Apr 2016 11:48:50 -0400 Date: Mon, 25 Apr 2016 11:48:47 -0400 From: Tejun Heo To: Roman Pen Cc: Jens Axboe , linux-block@vger.kernel.org, linux-kernel@vger.kernel.org, David Howells Subject: Re: [PATCH 1/1] [RFC] workqueue: fix ghost PENDING flag while doing MQ IO Message-ID: <20160425154847.GZ7822@mtj.duckdns.org> References: <1461597771-25352-1-git-send-email-roman.penyaev@profitbricks.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <1461597771-25352-1-git-send-email-roman.penyaev@profitbricks.com> User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 3178 Lines: 82 Hello, Roman. On Mon, Apr 25, 2016 at 05:22:51PM +0200, Roman Pen wrote: ... > CPU#6 CPU#2 > reqeust ffff884000343600 inserted > hctx marked as pended > kblockd_schedule...() returns 1 > > *** WORK_STRUCT_PENDING_BIT is cleared *** > flush_busy_ctxs() is executed > reqeust ffff884000343cc0 inserted > hctx marked as pended > kblockd_schedule...() returns 0 > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ > WTF? > > As a result ffff884000343cc0 request pended forever. > > According to the trace output I see that another CPU _always_ observes > WORK_STRUCT_PENDING_BIT as set for that hctx->run_work, even it was > cleared on another CPU. > > Checking the workqueue.c code I see that clearing the bit is nothing > more, but atomic_long_set(), which is instruction. This > function: > > static inline void set_work_data() > > In attempt to "fix" the mystery I replaced atomic_long_set() call with > atomic_long_xchg(), which is instruction. > > The problem has gone. > > For me it looks that test_and_set_bit() ( instruction) does > not require flush of all CPU caches, which can be dirty after executing > of on another CPU. But really updates the memory and the > following execution of observes that bit was cleared. > > As a conculusion I can say, that I am lucky enough and can reproduce > this bug in several minutes on a specific load (I tried many other > simple loads using fio, even using btrecord/btreplay, no success). > And that easy reproduction on a specific load gives me some freedom > to test and then to be sure, that problem has gone. Heh, excellent debugging. I wonder how old this bug is. cc'ing David Howells who ISTR to have reported a similar issue. The root problem here, I think, is that PENDING is used to synchronize between different queueing instances but we don't have proper memory barrier after it. A B queue (test_and_sets PENDING) dispatch (clears PENDING) execute queue (test_and_sets PENDING) So, for B, the guarantee must be that either A starts executing after B's test_and_set or B's test_and_set succeeds; however, as we don't have any memory barrier between dispatch and execute, there's nothing preventing the processor from scheduling some memory fetch operations from the execute stage before the clearing of PENDING - ie. A might not see what B has done prior to queue even if B's test_and_set fails indicating that A should. Can you please test whether the following patch fixes the issue? diff --git a/kernel/workqueue.c b/kernel/workqueue.c index 2232ae3..8ec2b5e 100644 --- a/kernel/workqueue.c +++ b/kernel/workqueue.c @@ -666,6 +666,7 @@ static void set_work_pool_and_clear_pending(struct work_struct *work, */ smp_wmb(); set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0); + smp_mb(); } static void clear_work_data(struct work_struct *work) -- tejun