Received: by 10.192.165.148 with SMTP id m20csp433966imm; Fri, 20 Apr 2018 09:04:39 -0700 (PDT) X-Google-Smtp-Source: AIpwx48+Nc/Z1FwjggGgOaN40lurItykhkeKxyXBAmnrbzZPZjdTF3YuZ4Xz++Db1M5JE5Cvxp9q X-Received: by 10.99.113.80 with SMTP id b16mr9035781pgn.326.1524240279097; Fri, 20 Apr 2018 09:04:39 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524240279; cv=none; d=google.com; s=arc-20160816; b=xd22rnAyBJgjv6rjg24iBVdc+14BwHAhwdO1riSC5emuB4LhgkjF9aJZK+kpxovaRT LwmqmGBXz3tsEmbJjlokY89Xx/JQJptUDvYokclOtW0CWy4pX9qGUK7Fa95LXbj3f0lg di6fJTSQ19bzyRbh22Iy5ErnAXF3uckxU0tRKtvD+lm8GjckzrgfIRXpGgZt2qELJbj8 R21DL59KYdmqnUebYc2bJY0wAcsPUGdJJLa0KVOcR2Ha3BqpxTqIf5R5OA8MYzTpxP4t hRZtpf0+yCTGtZooNGjMp1UhnYn1V1+EOYf8dUlISACJx1Tw65fVps0ozTdYxDlrmVa4 CADA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-transfer-encoding:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=mJshR31+G0ZlRxM5OAWS5GjMPzNizK27pGqevcx+tk4=; b=svf43DInBqqgD2Nr3BYVMNPCvRAnArTR5UPpKagdMpiWD0xpU8vL2aXPC4Nms5INFW MIv4ntJAvfuuh1tWqMBdvANTZkERTNW25UhJAjCQ5nybXZQHxdz8lQPgoGfJkYVyhZcv oDwCAHDpKhCIM2H9H0gfens1cmmRgriqqgxMd1lO3B1eDLq73ATfi9U+amV6L7Htq98Y IJs9Q7/ZL+XjYbBXK7mEqVLT1t3niiXLlQNWkvCwNArEkOZu854hUWNDIrxZyCyLszK6 y33PJnpJyJY2PinpBFt71FxJsfto00K+VCr6niHzs3gY90mdh23d5G0oL5m+PbgdA7QH q4lg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=MgDTEuzP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t1-v6si6584670plb.90.2018.04.20.09.03.53; Fri, 20 Apr 2018 09:04:39 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@linaro.org header.s=google header.b=MgDTEuzP; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=linaro.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755972AbeDTQAU (ORCPT + 99 others); Fri, 20 Apr 2018 12:00:20 -0400 Received: from mail-wr0-f174.google.com ([209.85.128.174]:44211 "EHLO mail-wr0-f174.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755683AbeDTQAR (ORCPT ); Fri, 20 Apr 2018 12:00:17 -0400 Received: by mail-wr0-f174.google.com with SMTP id o15-v6so24278216wro.11 for ; Fri, 20 Apr 2018 09:00:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linaro.org; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:content-transfer-encoding:in-reply-to :user-agent; bh=mJshR31+G0ZlRxM5OAWS5GjMPzNizK27pGqevcx+tk4=; b=MgDTEuzPACeJjh5onutu1t74LpjiEaUHVdKJIIO7RQ6W2B9ZgxGB3zV+YCvZZ8alOt pmvXeNaJDV2knPBCfbUr0MoOFoom6e+7/6O/puwqZu/oehmUl3P2cYBlYRISyOMol4yO m3tRZxnjjMe1uFluyA+aflKuGJZPjuLKA0+n8= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:content-transfer-encoding :in-reply-to:user-agent; bh=mJshR31+G0ZlRxM5OAWS5GjMPzNizK27pGqevcx+tk4=; b=F3krAKTV/iOkquzZY6N5LWu1FHXC2/Eg3B/CcBNJIqUC9+nJ2SzI5BChbXY6RXk/SQ CDYb2TsO5a9q+XwrkxGG6ws4DnfOBGMG7t8FUx/HiilVG9Hl8NAo+13mnz/QEI8MLSQ9 OtQdCA5ICbM3c6xVCr6rJpT3wPo1gwVy1Gsak58HZz/zVlazx+cGLAV1cyBquqfBPdVb S2ybYkbZDiZyetcgvTRAUsVp2RHK/+uHRTNJF5mAO/ZH58g5ihtRX+LtTszfC/VPU8F4 dIsiwpN7D5446wSr9VLwlzi2aXJ1/j4nWKNMu3ccU0wqXqv6jOlXQw0nwKLbURGyeYGO JDKQ== X-Gm-Message-State: ALQs6tA1UoYdgIZC4kbjBcgN3oH4sO1dcbAqIALLkicBrws71ABjtJZr USt3hk/HNzz8ZjmtvTj+nSP+YUvaV74= X-Received: by 10.28.147.83 with SMTP id v80mr2443437wmd.91.1524240015989; Fri, 20 Apr 2018 09:00:15 -0700 (PDT) Received: from linaro.org ([2a01:e0a:f:6020:575:3484:84df:41ba]) by smtp.gmail.com with ESMTPSA id k79sm2541350wmg.39.2018.04.20.09.00.14 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 20 Apr 2018 09:00:14 -0700 (PDT) Date: Fri, 20 Apr 2018 18:00:13 +0200 From: Vincent Guittot To: Niklas =?iso-8859-1?Q?S=F6derlund?= , Heiner Kallweit Cc: Peter Zijlstra , "Paul E. McKenney" , Ingo Molnar , linux-kernel , linux-renesas-soc@vger.kernel.org Subject: Re: Potential problem with 31e77c93e432dec7 ("sched/fair: Update blocked load when newly idle") Message-ID: <20180420160013.GA13769@linaro.org> References: <20180412091822.GG12256@bigcity.dyn.berto.se> <20180412111519.GH12256@bigcity.dyn.berto.se> <20180412133031.GA551@linaro.org> <20180412223904.GJ12256@bigcity.dyn.berto.se> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Heiner and Niklas, Le Saturday 14 Apr 2018 ? 13:24:20 (+0200), Vincent Guittot a ?crit : > Hi Niklas, > > On 13 April 2018 at 00:39, Niklas S?derlund > wrote: > > Hi Vincent, > > > > Thanks for helping trying to figure this out. > > > > On 2018-04-12 15:30:31 +0200, Vincent Guittot wrote: > > > > [snip] > > > >> > >> I'd like to narrow the problem a bit more with the 2 patchies aboves. Can you try > >> them separatly on top of c18bb396d3d261eb ("Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net")) > >> and check if one of them fixes the problem ?i > > > > I tried your suggested changes based on top of c18bb396d3d261eb. > > > >> > >> (They should apply on linux-next as well) > >> > >> First patch always kick ilb instead of doing ilb on local cpu before entering idle > >> > >> --- > >> kernel/sched/fair.c | 3 +-- > >> 1 file changed, 1 insertion(+), 2 deletions(-) > >> > >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c > >> index 0951d1c..b21925b 100644 > >> --- a/kernel/sched/fair.c > >> +++ b/kernel/sched/fair.c > >> @@ -9739,8 +9739,7 @@ static void nohz_newidle_balance(struct rq *this_rq) > >> * candidate for ilb instead of waking up another idle CPU. > >> * Kick an normal ilb if we failed to do the update. > >> */ > >> - if (!_nohz_idle_balance(this_rq, NOHZ_STATS_KICK, CPU_NEWLY_IDLE)) > >> - kick_ilb(NOHZ_STATS_KICK); > >> + kick_ilb(NOHZ_STATS_KICK); > >> raw_spin_lock(&this_rq->lock); > >> } > > > > This change don't seem to effect the issue. I can still get the single > > ssh session and the system to lockup by hitting the return key. And > > opening a second ssh session immediately unblocks both the first ssh > > session and the serial console. And I can still trigger the console > > warning by just letting the system be once it locks-up. I do have > > just as before reset the system a few times to trigger the issue. > > You results are similar to Heiner's ones. The problem is still there > even if we only kick ilb which mainly send an IPI reschedule to the > other CPU if Idle > Could it be possible to record some traces of the problem to get a better view of what happens ? I have a small patch that adds some traces in the functions that seems to create the problem --- kernel/sched/fair.c | 6 ++++++ 1 file changed, 6 insertions(+) diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c index 0951d1c..a951464 100644 --- a/kernel/sched/fair.c +++ b/kernel/sched/fair.c @@ -9606,6 +9606,8 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, */ WRITE_ONCE(nohz.has_blocked, 0); + trace_printk("_nohz_idle_balance cpu %d idle %d flag %x", this_cpu, idle, flags); + /* * Ensures that if we miss the CPU, we must see the has_blocked * store from nohz_balance_enter_idle(). @@ -9680,6 +9682,8 @@ static bool _nohz_idle_balance(struct rq *this_rq, unsigned int flags, if (likely(update_next_balance)) nohz.next_balance = next_balance; + trace_printk("_nohz_idle_balance return %d", ret); + return ret; } @@ -9732,6 +9736,8 @@ static void nohz_newidle_balance(struct rq *this_rq) time_before(jiffies, READ_ONCE(nohz.next_blocked))) return; + trace_printk("nohz_newidle_balance start update"); + raw_spin_unlock(&this_rq->lock); /* * This CPU is going to be idle and blocked load of idle CPUs -- 2.7.4 Also that would be good to stop tracing when the RCU stall happens In case you are not familiar with the trace tool, I have put below how to configure ftrace to trace scheudler, irq, timer ... events and stop tracing when dump_backtrace function is called trace-cmd reset > /dev/null trace-cmd start -b 40000 -p function -l dump_backtrace:traceoff -e sched -e cpu_idle -e cpu_frequency -e timer -e ipi -e irq -e printk trace-cmd start -b 40000 -p function -l dump_backtrace The trace-cmd start has to be called twice to make sure that we will not trace all function Once the dump happen and the trace is stopped, you can extract the traces with trace-cmd extract -o Thanks Vincent