Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757069AbbLBJEY (ORCPT ); Wed, 2 Dec 2015 04:04:24 -0500 Received: from eu-smtp-delivery-143.mimecast.com ([207.82.80.143]:22920 "EHLO eu-smtp-delivery-143.mimecast.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755191AbbLBJEU convert rfc822-to-8bit (ORCPT ); Wed, 2 Dec 2015 04:04:20 -0500 Message-ID: <565EB409.3090202@arm.com> Date: Wed, 02 Dec 2015 09:04:09 +0000 From: Vladimir Murzin User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.2.0 MIME-Version: 1.0 To: Peter Zijlstra CC: linux-kernel@vger.kernel.org, neilb@suse.de, oleg@redhat.com, mark.rutland@arm.com, linux-arm-kernel@lists.infradead.org, linux-mm@kvack.org Subject: Re: [BISECTED] rcu_sched self-detected stall since 3.17 References: <564F3DCA.1080907@arm.com> <20151201130404.GL3816@twins.programming.kicks-ass.net> In-Reply-To: <20151201130404.GL3816@twins.programming.kicks-ass.net> X-OriginalArrivalTime: 02 Dec 2015 09:04:15.0818 (UTC) FILETIME=[6B7E22A0:01D12CE0] X-MC-Unique: AR92XXCHQTG_UuQ_ZvSpAA-1 Content-Type: text/plain; charset=WINDOWS-1252 Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 5131 Lines: 164 On 01/12/15 13:04, Peter Zijlstra wrote: > Sorry for the delay and thanks for the reminder! > > On Fri, Nov 20, 2015 at 03:35:38PM +0000, Vladimir Murzin wrote: >> commit 743162013d40ca612b4cb53d3a200dff2d9ab26e >> Author: NeilBrown >> Date: Mon Jul 7 15:16:04 2014 +1000 >> >> sched: Remove proliferation of wait_on_bit() action functions >> >> The only change I noticed is from (mm/filemap.c) >> >> io_schedule(); >> fatal_signal_pending(current) >> >> to (kernel/sched/wait.c) >> >> signal_pending_state(current->state, current) >> io_schedule(); >> >> and if I apply following diff I don't see stalls anymore. >> >> diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c >> index a104879..2d68cdb 100644 >> --- a/kernel/sched/wait.c >> +++ b/kernel/sched/wait.c >> @@ -514,9 +514,10 @@ EXPORT_SYMBOL(bit_wait); >> >> __sched int bit_wait_io(void *word) >> { >> + io_schedule(); >> + >> if (signal_pending_state(current->state, current)) >> return 1; >> - io_schedule(); >> return 0; >> } >> EXPORT_SYMBOL(bit_wait_io); >> >> Any ideas why it might happen and why diff above helps? > > Yes, the code as presented is simply wrong. And in fact most of the code > it replaced was of the right form (with a few exceptions which would > indeed have been subject to the same problem you've observed. > > Note how the late: > > - cifs_sb_tcon_pending_wait > - fscache_wait_bit_interruptible > - sleep_on_page_killable > - wait_inquiry > - key_wait_bit_intr > > All check the signal state _after_ calling schedule(). > > As opposed to: > > - gfs2_journalid_wait > > which follows the broken pattern. > > Further notice that most expect a return of -EINTR, which also seems > correct given that this is a signal, those that do not return -EINTR > only check for a !0 return value so would work equally well with -EINTR. > > The reason this is broken is that schedule() will no-op when there is a > pending signal, while raising a signal will also issue a wakeup. > Glad to hear confirmation on a problem. Thanks for detailed answer! > Thus the right thing to do is check for the signal state after, that way > you handle both cases: > > - calling schedule() with a signal pending > - receiving a signal while sleeping > > As such, I would propose the below patch. Oleg, do you concur? > > --- > Subject: sched,wait: Fix signal handling in bit wait helpers > > Vladimir reported getting RCU stall warnings and bisected it back to > commit 743162013d40. That commit inadvertently reversed the calls to > schedule() and signal_pending(), thereby not handling the case where the > signal receives while we sleep. > > Fixes: 743162013d40 ("sched: Remove proliferation of wait_on_bit() action functions") > Fixes: cbbce8220949 ("SCHED: add some "wait..on_bit...timeout()" interfaces.") > Reported-by: Vladimir Murzin > Signed-off-by: Peter Zijlstra (Intel) > --- > kernel/sched/wait.c | 16 ++++++++-------- > 1 file changed, 8 insertions(+), 8 deletions(-) > > diff --git a/kernel/sched/wait.c b/kernel/sched/wait.c > index 052e02672d12..f10bd873e684 100644 > --- a/kernel/sched/wait.c > +++ b/kernel/sched/wait.c > @@ -583,18 +583,18 @@ EXPORT_SYMBOL(wake_up_atomic_t); > > __sched int bit_wait(struct wait_bit_key *word) > { > - if (signal_pending_state(current->state, current)) > - return 1; > schedule(); > + if (signal_pending(current)) > + return -EINTR; > return 0; > } > EXPORT_SYMBOL(bit_wait); > > __sched int bit_wait_io(struct wait_bit_key *word) > { > - if (signal_pending_state(current->state, current)) > - return 1; > io_schedule(); > + if (signal_pending(current)) > + return -EINTR; > return 0; > } > EXPORT_SYMBOL(bit_wait_io); > @@ -602,11 +602,11 @@ EXPORT_SYMBOL(bit_wait_io); > __sched int bit_wait_timeout(struct wait_bit_key *word) > { > unsigned long now = READ_ONCE(jiffies); > - if (signal_pending_state(current->state, current)) > - return 1; > if (time_after_eq(now, word->timeout)) > return -EAGAIN; > schedule_timeout(word->timeout - now); > + if (signal_pending(current)) > + return -EINTR; > return 0; > } > EXPORT_SYMBOL_GPL(bit_wait_timeout); > @@ -614,11 +614,11 @@ EXPORT_SYMBOL_GPL(bit_wait_timeout); > __sched int bit_wait_io_timeout(struct wait_bit_key *word) > { > unsigned long now = READ_ONCE(jiffies); > - if (signal_pending_state(current->state, current)) > - return 1; > if (time_after_eq(now, word->timeout)) > return -EAGAIN; > io_schedule_timeout(word->timeout - now); > + if (signal_pending(current)) > + return -EINTR; > return 0; > } > EXPORT_SYMBOL_GPL(bit_wait_io_timeout); > I run it overnight on top of 4.3 and didn't see stalls. So in case it helps Tested-by: Vladimir Murzin Cheers Vladimir -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/