Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757557AbbBFSBm (ORCPT ); Fri, 6 Feb 2015 13:01:42 -0500 Received: from mail-qa0-f47.google.com ([209.85.216.47]:63953 "EHLO mail-qa0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754038AbbBFSBk (ORCPT ); Fri, 6 Feb 2015 13:01:40 -0500 Date: Fri, 6 Feb 2015 13:01:37 -0500 From: Tejun Heo To: Rabin Vincent Cc: Jesper Nilsson , linux-kernel@vger.kernel.org Subject: Re: Soft lockups in __cancel_work_timer() Message-ID: <20150206180137.GC10580@htj.dyndns.org> References: <20150206171156.GA8942@axis.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150206171156.GA8942@axis.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2571 Lines: 72 Hello, On Fri, Feb 06, 2015 at 06:11:56PM +0100, Rabin Vincent wrote: > task1 task2 worker > > add to busy hash > clear pending > exec work > > __cancel_work_timer() > try_to_grab_pending() > get pending, return 0 > set work cancelling > flush_work() > wait_for_completion() > > remove from busy_hash > > __cancel_work_timer() > while (forever) { > try_to_grab_pending() > return -ENOENT due to cancelling > flush_work > return false due to already gone > } > > > On kernels with CONFIG_PREEMPT disabled, this causes a permanent soft > lockup of the system. Ah, drat. > Adding a cond_resched() to the loop in cancel_delayed_work_sync(), like > below, helps the simple !PREEMPT case, but does not completely fix the > problem, because the problem can still be seen if the thread which sees > the ENOENT has for example SCHED_FIFO scheduling class, both on systems > with CONFIG_PREEMPT enabled and on !PREEMPT with the cond_resched(). > > We've seen this problem with the cancel_delayed_work() call in > jffs2_sync_fs(), but I've attached a testing patch which forces the > problem on current mainline without the need for jffs2. ... > @@ -2741,6 +2741,9 @@ static bool __cancel_work_timer(struct work_struct *work, bool is_dwork) > */ > if (unlikely(ret == -ENOENT)) > flush_work(work); > + > + if (ret < 0) > + cond_resched(); > } while (unlikely(ret < 0)); Well, an obvious thing to do would be if (unlikely(ret == -ENOENT)) { if (!flush_work(work)) schedule_timeout(1); } It was gonna block for the work item anyway so I don't think schedule_timeout() there is a problem. That said, given that we're guaranteed to be able to dereference @work there, there probably is a better way. I'll think more about it. Thanks. -- tejun -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/