Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp3514858pxb; Tue, 20 Apr 2021 09:55:06 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxpl7UoLB0QK8G6mB6NGmCmJb5iiaSUMQXw6iU0peG3rugjq+MJwXTCfq0amFNZef3EHPL1 X-Received: by 2002:a17:907:9852:: with SMTP id jj18mr3777185ejc.382.1618937706574; Tue, 20 Apr 2021 09:55:06 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618937706; cv=none; d=google.com; s=arc-20160816; b=jVnxZg6DgXY6AWip8k6Xi6W59GUz6eXCMa6ucyRB1+FNVsLRGdbdfoeFdv/YybfzID 848deNUV8eoDCZWVWK4q6EX8E8s1fKIZSlJrsnHxEiwadhl8ecdGEGSJbGsJZZWZL77z 8JpBtUdOZYtHzCfVnIQMdrYf9hQ9pgphcxqsm3Z2gAlQuLs5E0FLdru7Xz0Pi6pHPC5F tRz2vGQ7LDsUWCt6CFFlQv6snvi6OEBLVGBJptpqy0VZ4+qTrLXfBCUe/TOnEQ8VJ3sT p/esCxS6iQ7LoycHfUoHH5rapip0P50YrRMZH2bgk5V1aNXePnGYd+xyzdQRm3fWf5RE BW3A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:content-disposition :mime-version:references:message-id:subject:cc:to:from:date; bh=YnPnnMbou4txWQp8hxW+eVEFyZEd6R2xp2g74+H8ii8=; b=FfbctonnES4g5xzLfMj/m5LWTz2ZUs4RwOCuzseZMa/4ClTQhCHTyXk7ZC2+3bLOJS ljcys5FiT5VoPQJxUQiwfoTiPOhlNPs9S2PnDIbbH5XidsPlRHpGnVaDf2KI+P/QjaSK wmWmROxRu7bSKi5Hk0QTzxpOyv1r7LESwJAGLjGSbmBFIYqlf0wiQqLWS5kvI9kxkhIs yYPhyThy7hnjQMujEA7Ajy+d0bDoZsizNp52JrpqF3EMoB6y77S3IXahhKHmb1631o5x lC/cMqLpzHHt5XJzL7uJzexdAjMaJ4ihKLbewjx9e+6WqH9+lZcXdYqUuo+Yf01VHCyq elrg== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id a19si14990368ejy.700.2021.04.20.09.54.42; Tue, 20 Apr 2021 09:55:06 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232993AbhDTQyU (ORCPT + 99 others); Tue, 20 Apr 2021 12:54:20 -0400 Received: from foss.arm.com ([217.140.110.172]:38208 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232767AbhDTQyT (ORCPT ); Tue, 20 Apr 2021 12:54:19 -0400 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.121.207.14]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id D9AF31474; Tue, 20 Apr 2021 09:53:47 -0700 (PDT) Received: from e120877-lin.cambridge.arm.com (usa-sjc-imap-foss1.foss.arm.com [10.121.207.14]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 0203D3F73B; Tue, 20 Apr 2021 09:53:45 -0700 (PDT) Date: Tue, 20 Apr 2021 17:53:40 +0100 From: Vincent Donnefort To: Peter Zijlstra Cc: Valentin Schneider , tglx@linutronix.de, mingo@kernel.org, bigeasy@linutronix.de, swood@redhat.com, juri.lelli@redhat.com, vincent.guittot@linaro.org, dietmar.eggemann@arm.com, rostedt@goodmis.org, bsegall@google.com, mgorman@suse.de, bristot@redhat.com, qais.yousef@arm.com, linux-kernel@vger.kernel.org Subject: Re: [PATCH 3/3] sched: Use cpu_dying() to fix balance_push vs hotplug-rollback Message-ID: <20210420165340.GA231208@e120877-lin.cambridge.arm.com> References: <87r1jfmn8d.mognet@arm.com> <87a6pzmxec.mognet@arm.com> <20210419105541.GA40111@e120877-lin.cambridge.arm.com> <20210420094632.GA165360@e120877-lin.cambridge.arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.5.24 (2015-08-30) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Apr 20, 2021 at 04:58:00PM +0200, Peter Zijlstra wrote: > On Tue, Apr 20, 2021 at 04:39:04PM +0200, Peter Zijlstra wrote: > > On Tue, Apr 20, 2021 at 04:20:56PM +0200, Peter Zijlstra wrote: > > > On Tue, Apr 20, 2021 at 10:46:33AM +0100, Vincent Donnefort wrote: > > > > > > > Found the issue: > > > > > > > > $ cat hotplug/states: > > > > 219: sched:active > > > > 220: online > > > > > > > > CPU0: > > > > > > > > $ echo 219 > hotplug/fail > > > > $ echo 0 > online > > > > > > > > => cpu_active = 1 cpu_dying = 1 > > > > > > > > which means that later on, for another CPU hotunplug, in > > > > __balance_push_cpu_stop(), the fallback rq for a kthread can select that > > > > CPU0, but __migrate_task() would fail and we end-up in an infinite loop, > > > > trying to migrate that task to CPU0. > > > > > > > > The problem is that for a failure in sched:active, as "online" has no callback, > > > > there will be no call to cpuhp_invoke_callback(). Hence, the cpu_dying bit would > > > > not be reset. > > > > > > Urgh! Good find. > > > I seem to have triggered the BUG() in select_fallback_rq() with your recipie. > > Have cpu0 fail on sched:active, then offline all other CPUs. > > > > Now lemme add that patch. > > (which obviously didn't actually build) seems to fix it. > > --- > diff --git a/kernel/cpu.c b/kernel/cpu.c > index 838dcf238f92..e538518556f4 100644 > --- a/kernel/cpu.c > +++ b/kernel/cpu.c > @@ -63,6 +63,7 @@ struct cpuhp_cpu_state { > bool rollback; > bool single; > bool bringup; > + int cpu; > struct hlist_node *node; > struct hlist_node *last; > enum cpuhp_state cb_state; > @@ -160,9 +161,6 @@ static int cpuhp_invoke_callback(unsigned int cpu, enum cpuhp_state state, > int (*cb)(unsigned int cpu); > int ret, cnt; > > - if (cpu_dying(cpu) != !bringup) > - set_cpu_dying(cpu, !bringup); > - > if (st->fail == state) { > st->fail = CPUHP_INVALID; > return -EAGAIN; > @@ -467,13 +465,16 @@ static inline enum cpuhp_state > cpuhp_set_state(struct cpuhp_cpu_state *st, enum cpuhp_state target) > { > enum cpuhp_state prev_state = st->state; > + bool bringup = st->state < target; > > st->rollback = false; > st->last = NULL; > > st->target = target; > st->single = false; > - st->bringup = st->state < target; > + st->bringup = bringup; > + if (cpu_dying(st->cpu) != !bringup) > + set_cpu_dying(st->cpu, !bringup); > > return prev_state; > } > @@ -481,6 +482,8 @@ cpuhp_set_state(struct cpuhp_cpu_state *st, enum cpuhp_state target) > static inline void > cpuhp_reset_state(struct cpuhp_cpu_state *st, enum cpuhp_state prev_state) > { > + bool bringup = !st->bringup; > + > st->target = prev_state; > > /* > @@ -503,7 +506,9 @@ cpuhp_reset_state(struct cpuhp_cpu_state *st, enum cpuhp_state prev_state) > st->state++; > } > > - st->bringup = !st->bringup; > + st->bringup = bringup; > + if (cpu_dying(st->cpu) != !bringup) > + set_cpu_dying(st->cpu, !bringup); > } > > /* Regular hotplug invocation of the AP hotplug thread */ > @@ -693,6 +698,7 @@ static void cpuhp_create(unsigned int cpu) > > init_completion(&st->done_up); > init_completion(&st->done_down); > + st->cpu = cpu; > } > > static int cpuhp_should_run(unsigned int cpu) All good with that snippet on my end. I wonder if balance_push() shouldn't use the cpu_of() accessor instead of rq->cpu. Otherwise, + Reviewed-by: Vincent Donnefort