Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp3701413rwb; Sun, 20 Nov 2022 20:20:19 -0800 (PST) X-Google-Smtp-Source: AA0mqf4wcW/NyFAe9gQwfLzpzYROM4V0e5lXvHh4ia6G56nbV52i45YlA6z2Zpi5rkpvkBKcRApg X-Received: by 2002:a17:902:cad4:b0:189:e4e:7943 with SMTP id y20-20020a170902cad400b001890e4e7943mr340190pld.33.1669004418850; Sun, 20 Nov 2022 20:20:18 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1669004418; cv=none; d=google.com; s=arc-20160816; b=tUDaFrMcnD7DXjyTGU0IE0OPM/ZCSZrMkNnFvV0w92Vmymef3g48WH+8qu0esafbFm iDNsZJeRQDHHkF+mV9zu/DLxMW5K/katVHiDrB+gZNqJgCFix3oMIeOpXUw3trc+NThS qR4MFCogJg8Rw8BVFZ0jH8EhEPM9jHXb0FYq4RvHOGxXNYcQmOp2gVUt0f1ZDVzzke20 HqYaSN/pJsGOJhosGMf864fbusum2TJLva8jrwoXdmNHnpB+OAndyvRsm+vORYCNW/dl XoFcil6ReYyrqGmIVs6tEDSJo1k3bLN32QliHmP9REGQnjXQQ3JHKkgzKmFlxT+uLMsa QL/w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-transfer-encoding :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature; bh=dFbITUzntAVYAcjWtRWDusrK/MIuoB6XSOMzVB+txkA=; b=v97YvluiOQ2TBMQNs7lMl8fwfTzvw0eQP8/1wwvqB4VQB+WG8KiVqgKi+iLv33PpKT ph/kpasl1NaoykM7e1hjfALcKPK8Clmax6eN+xnHr0Y5iHEdYWOX71BuZ+JbTUIj7AGL xx3l67nhkSc+YlRYDwXgs8uguRUlL2PtMphtmpwG9DpUaBFFMXnqCpytl/cMoKCBpj8H gseT1lBZr89FNYkZ0QJJmAguxak1TIhoGk6IOXue8zI1lO8vYS4OSoCmTzxW9cBBQqe4 7zgvlbaJ8GuYimSOpqxPzPAhtEaa+F2Rz1Zy6fLhWkOWtXsQLQab2vBNg0411K4MCyBp frzw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=QcSWHAKX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id l3-20020a654c43000000b00476cb3a08fcsi10669315pgr.338.2022.11.20.20.20.07; Sun, 20 Nov 2022 20:20:18 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@joelfernandes.org header.s=google header.b=QcSWHAKX; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229681AbiKUEAD (ORCPT + 90 others); Sun, 20 Nov 2022 23:00:03 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43980 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229446AbiKUEAB (ORCPT ); Sun, 20 Nov 2022 23:00:01 -0500 Received: from mail-qk1-x72e.google.com (mail-qk1-x72e.google.com [IPv6:2607:f8b0:4864:20::72e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B65112D1C4 for ; Sun, 20 Nov 2022 20:00:00 -0800 (PST) Received: by mail-qk1-x72e.google.com with SMTP id k4so7286471qkj.8 for ; Sun, 20 Nov 2022 20:00:00 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=joelfernandes.org; s=google; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date:from:to :cc:subject:date:message-id:reply-to; bh=dFbITUzntAVYAcjWtRWDusrK/MIuoB6XSOMzVB+txkA=; b=QcSWHAKXBCkk8kPmKraRJTW2m9aKhQN/vyFSacAD1zmDLpEJwaqgsQgeUkjUc1TVE7 A5BY6OFudbC+6AF0KHC0ZYM9bV65bpmkl32O9Hcd9bp/EH7Wu07lTSW/80su9hZ81fpS vc0N4EW7dRgA7LAhalhJqnIqTsYvrE62ydtls= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:message-id:subject:cc:to:from:date :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=dFbITUzntAVYAcjWtRWDusrK/MIuoB6XSOMzVB+txkA=; b=mOEvsjcCKjWf1hJdQkBmIhy5aQMK+upRIs/s/ZtMuLa3UTlNidd/oFTcrfDY7ZHPdk 0sphgOGzCjgfo62fmqO7D6r26fDtYVBkDDT3OU6S4Tk0JlzxMfZ9PkZdED+ET5A+cCGf 0q/eBtaGFexvvBwpZkXjJzIvgySKdxjc7FHzAEgMjD4SuTNKypJgfjuOI1atyBKQifwh jo4LHDFt3DzSn2xg31V51vacnNBsveBrEr/wKocCN9pQbLuhpODuZXkFJwiIaJzOxDOd eBYojeaYFnEPfrnKrW9ChkQA+fTU/mKtyCcr+YRTbTRIHnTecBOG06j4qOyHtmWxAcpg Okuw== X-Gm-Message-State: ANoB5plRl5U7JhqPTKWXV2VGdBryPx6tv8S7Mt7k2BPG5IAlRVWtYE1w SRTbvn8tDXvub8RkRQtm1mqymdtCZayXlw== X-Received: by 2002:a37:bb06:0:b0:6fa:7d46:33b5 with SMTP id l6-20020a37bb06000000b006fa7d4633b5mr685928qkf.397.1669003199726; Sun, 20 Nov 2022 19:59:59 -0800 (PST) Received: from localhost (228.221.150.34.bc.googleusercontent.com. [34.150.221.228]) by smtp.gmail.com with ESMTPSA id dt18-20020a05620a479200b006ea7f9d8644sm7478163qkb.96.2022.11.20.19.59.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 20 Nov 2022 19:59:59 -0800 (PST) Date: Mon, 21 Nov 2022 03:59:58 +0000 From: Joel Fernandes To: Dietmar Eggemann Cc: Connor O'Brien , linux-kernel@vger.kernel.org, kernel-team@android.com, John Stultz , Qais Yousef , Ingo Molnar , Peter Zijlstra , Juri Lelli , Vincent Guittot , Steven Rostedt , Ben Segall , Mel Gorman , Daniel Bristot de Oliveira , Valentin Schneider , Will Deacon , Waiman Long , Boqun Feng , "Paul E . McKenney" Subject: Re: [RFC PATCH 07/11] sched: Add proxy execution Message-ID: References: <34B2D8B9-A0C1-4280-944D-17224FB24339@joelfernandes.org> <4e396924-c3be-1932-91a3-5f458cc843fe@arm.com> <4ec6ab79-9f0f-e14b-dd06-d2840a1bf71a@arm.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: X-Spam-Status: No, score=-2.1 required=5.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,RCVD_IN_DNSWL_NONE, SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, Nov 20, 2022 at 08:49:22PM -0500, Joel Fernandes wrote: > On Sun, Nov 20, 2022 at 7:22 PM Joel Fernandes wrote: > > > > Hello Dietmar, > > > > On Fri, Nov 04, 2022 at 06:09:26PM +0100, Dietmar Eggemann wrote: > > > On 31/10/2022 19:00, Joel Fernandes wrote: > > > > On Mon, Oct 31, 2022 at 05:39:45PM +0100, Dietmar Eggemann wrote: > > > >> On 29/10/2022 05:31, Joel Fernandes wrote: > > > >>> Hello Dietmar, > > > >>> > > > >>>> On Oct 24, 2022, at 6:13 AM, Dietmar Eggemann wrote: > > > >>>> > > > >>>> On 03/10/2022 23:44, Connor O'Brien wrote: > > > >>>>> From: Peter Zijlstra > > > > > > [...] > > > > > > >>>>> + rq_unpin_lock(rq, rf); > > > >>>>> + raw_spin_rq_unlock(rq); > > > >>>> > > > >>>> Don't we run into rq_pin_lock()'s: > > > >>>> > > > >>>> SCHED_WARN_ON(rq->balance_callback && rq->balance_callback != > > > >>>> &balance_push_callback) > > > >>>> > > > >>>> by releasing rq lock between queue_balance_callback(, push_rt/dl_tasks) > > > >>>> and __balance_callbacks()? > > > >>> > > > >>> Apologies, I’m a bit lost here. The code you are responding to inline does not call rq_pin_lock, it calls rq_unpin_lock. So what scenario does the warning trigger according to you? > > > >> > > > >> True, but the code which sneaks in between proxy()'s > > > >> raw_spin_rq_unlock(rq) and raw_spin_rq_lock(rq) does. > > > >> > > > > > > > > Got it now, thanks a lot for clarifying. Can this be fixed by do a > > > > __balance_callbacks() at: > > > > > > I tried the: > > > > > > head = splice_balance_callbacks(rq) > > > task_rq_unlock(rq, p, &rf); > > > ... > > > balance_callbacks(rq, head); > > > > > > separation known from __sched_setscheduler() in __schedule() (right > > > after pick_next_task()) but it doesn't work. Lot of `BUG: scheduling > > > while atomic:` > > > > How about something like the following? This should exclude concurrent > > balance callback queues from other CPUs and let us release the rq lock early > > in proxy(). I ran locktorture with your diff to make writer threads RT, and I > > cannot reproduce any crash with it: > > > > ---8<----------------------- > > > > From: "Joel Fernandes (Google)" > > Subject: [PATCH] Exclude balance callback queuing during proxy's migrate > > > > In commit 565790d28b1e ("sched: Fix balance_callback()"), it is clear that rq > > lock needs to be held when __balance_callbacks() in schedule() is called. > > However, it is possible that because rq lock is dropped in proxy(), another > > CPU, say in __sched_setscheduler() can queue balancing callbacks and cause > > issues. > > > > To remedy this, exclude balance callback queuing on other CPUs, during the > > proxy(). > > > > Reported-by: Dietmar Eggemann > > Signed-off-by: Joel Fernandes (Google) > > --- > > kernel/sched/core.c | 15 +++++++++++++++ > > kernel/sched/sched.h | 3 +++ > > 2 files changed, 18 insertions(+) > > > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c > > index 88a5fa34dc06..f1dac21fcd90 100644 > > --- a/kernel/sched/core.c > > +++ b/kernel/sched/core.c > > @@ -6739,6 +6739,10 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf) > > p->wake_cpu = wake_cpu; > > } > > > > + // Prevent other CPUs from queuing balance callbacks while we migrate > > + // tasks in the migrate_list with the rq lock released. > > + raw_spin_lock(&rq->balance_lock); > > + > > rq_unpin_lock(rq, rf); > > raw_spin_rq_unlock(rq); > > raw_spin_rq_lock(that_rq); > > @@ -6758,7 +6762,18 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf) > > } > > > > raw_spin_rq_unlock(that_rq); > > + > > + // This may make lockdep unhappy as we acquire rq->lock with balance_lock > > + // held. But that should be a false positive, as the following pattern > > + // happens only on the current CPU with interrupts disabled: > > + // rq_lock() > > + // balance_lock(); > > + // rq_unlock(); > > + // rq_lock(); > > raw_spin_rq_lock(rq); > > Hmm, I think there's still a chance of deadlock here. I need to > rethink it a bit, but that's the idea I was going for. Took care of that, and came up with the below. Tested with locktorture and it survives. Thoughts? ---8<----------------------- From: "Joel Fernandes (Google)" Subject: [PATCH v2] Exclude balance callback queuing during proxy's migrate In commit 565790d28b1e ("sched: Fix balance_callback()"), it is clear that rq lock needs to be held when __balance_callbacks() in schedule() is called. However, it is possible that because rq lock is dropped in proxy(), another CPU, say in __sched_setscheduler() can queue balancing callbacks and cause issues. To remedy this, exclude balance callback queuing on other CPUs, during the proxy(). Reported-by: Dietmar Eggemann Signed-off-by: Joel Fernandes (Google) --- kernel/sched/core.c | 72 ++++++++++++++++++++++++++++++++++++++++++-- kernel/sched/sched.h | 3 ++ 2 files changed, 73 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index 88a5fa34dc06..aba90b3dc3ef 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -633,6 +633,29 @@ struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf) } } +/* + * Helper to call __task_rq_lock safely, in scenarios where we might be about to + * queue a balance callback on a remote CPU. That CPU might be in proxy(), and + * could have released its rq lock while holding balance_lock. So release rq + * lock in such a situation to avoid deadlock, and retry. + */ +struct rq *__task_rq_lock_balance(struct task_struct *p, struct rq_flags *rf) +{ + struct rq *rq; + bool locked = false; + + do { + if (locked) { + __task_rq_unlock(rq, rf); + cpu_relax(); + } + rq = __task_rq_lock(p, rf); + locked = true; + } while (raw_spin_is_locked(&rq->balance_lock)); + + return rq; +} + /* * task_rq_lock - lock p->pi_lock and lock the rq @p resides on. */ @@ -675,6 +698,29 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf) } } +/* + * Helper to call task_rq_lock safely, in scenarios where we might be about to + * queue a balance callback on a remote CPU. That CPU might be in proxy(), and + * could have released its rq lock while holding balance_lock. So release rq + * lock in such a situation to avoid deadlock, and retry. + */ +struct rq *task_rq_lock_balance(struct task_struct *p, struct rq_flags *rf) +{ + struct rq *rq; + bool locked = false; + + do { + if (locked) { + task_rq_unlock(rq, p, rf); + cpu_relax(); + } + rq = task_rq_lock(p, rf); + locked = true; + } while (raw_spin_is_locked(&rq->balance_lock)); + + return rq; +} + /* * RQ-clock updating methods: */ @@ -6739,6 +6785,12 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf) p->wake_cpu = wake_cpu; } + /* + * Prevent other CPUs from queuing balance callbacks while we migrate + * tasks in the migrate_list with the rq lock released. + */ + raw_spin_lock(&rq->balance_lock); + rq_unpin_lock(rq, rf); raw_spin_rq_unlock(rq); raw_spin_rq_lock(that_rq); @@ -6758,7 +6810,21 @@ proxy(struct rq *rq, struct task_struct *next, struct rq_flags *rf) } raw_spin_rq_unlock(that_rq); + + /* + * This may make lockdep unhappy as we acquire rq->lock with + * balance_lock held. But that should be a false positive, as the + * following pattern happens only on the current CPU with interrupts + * disabled: + * rq_lock() + * balance_lock(); + * rq_unlock(); + * rq_lock(); + */ raw_spin_rq_lock(rq); + + raw_spin_unlock(&rq->balance_lock); + rq_repin_lock(rq, rf); return NULL; /* Retry task selection on _this_ CPU. */ @@ -7489,7 +7555,7 @@ void rt_mutex_setprio(struct task_struct *p, struct task_struct *pi_task) if (p->pi_top_task == pi_task && prio == p->prio && !dl_prio(prio)) return; - rq = __task_rq_lock(p, &rf); + rq = __task_rq_lock_balance(p, &rf); update_rq_clock(rq); /* * Set under pi_lock && rq->lock, such that the value can be used under @@ -8093,7 +8159,8 @@ static int __sched_setscheduler(struct task_struct *p, * To be able to change p->policy safely, the appropriate * runqueue lock must be held. */ - rq = task_rq_lock(p, &rf); + rq = task_rq_lock_balance(p, &rf); + update_rq_clock(rq); /* @@ -10312,6 +10379,7 @@ void __init sched_init(void) rq = cpu_rq(i); raw_spin_lock_init(&rq->__lock); + raw_spin_lock_init(&rq->balance_lock); rq->nr_running = 0; rq->calc_load_active = 0; rq->calc_load_update = jiffies + LOAD_FREQ; diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index 354e75587fed..932d32bf9571 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1057,6 +1057,7 @@ struct rq { unsigned long cpu_capacity_orig; struct callback_head *balance_callback; + raw_spinlock_t balance_lock; unsigned char nohz_idle_balance; unsigned char idle_balance; @@ -1748,6 +1749,7 @@ queue_balance_callback(struct rq *rq, void (*func)(struct rq *rq)) { lockdep_assert_rq_held(rq); + raw_spin_lock(&rq->balance_lock); /* * Don't (re)queue an already queued item; nor queue anything when @@ -1760,6 +1762,7 @@ queue_balance_callback(struct rq *rq, head->func = (void (*)(struct callback_head *))func; head->next = rq->balance_callback; rq->balance_callback = head; + raw_spin_unlock(&rq->balance_lock); } #define rcu_dereference_check_sched_domain(p) \ -- 2.38.1.584.g0f3c55d4c2-goog