Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp73965ybt; Tue, 30 Jun 2020 15:07:35 -0700 (PDT) X-Google-Smtp-Source: ABdhPJw2bIwsv9aEKw6KHxqoNERBibsqMy5zbYjYxIkPFcUk8ER6Uz7L5LWTO22qHFpkyvh9R8Zg X-Received: by 2002:a17:906:454d:: with SMTP id s13mr19649375ejq.319.1593554522067; Tue, 30 Jun 2020 15:02:02 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1593554522; cv=none; d=google.com; s=arc-20160816; b=Bjx80Z1D6esTAFo5wVBuJXISh/9rpOF8202QmRce2AlhUkZoaD82HBUSngSfK+LeBH GRNAaV6w+q/jB5VSGfhVMmdViRPeKku/QMKN18MRKPVKyWSLzcyiVF1KdC+nfyomY/v/ oLQKyHaD8FiI5qTauCDeM3DqKUW3C7GUy4Le3FIbUo6fDbmZGDc3y10CVYLgGl2d0kIE l4fv7gz6EBtJVWimr9eYRoW/cEngyMrwb7bVBdym0hpLLPdPI6eG+6cmCdCpmsUVIQ60 LxIygyFhUVjbRT7SWfQk1C17XwpbfJN5DNXsxMUxxQzwupAGqIAw/E5nzrizzorcrb6u zWDA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:references :in-reply-to:message-id:date:subject:cc:to:from:dkim-signature; bh=wsRrgXrBLhZrqEzJE2oHTQVsuAhj9oeYzHHnpxIytS4=; b=uMOvQAZ+HpHlypSbuYNw0d2DytvESLNTHYL8/V8rWPWWA0EM2e5uVlFFQbl9r0nd2i hS2c2bpfpTeDwitQCyWCisQu/Ea+ea8DQkZs/O6YEj3mlgnASEd8WVdY/S+jHIePYoed SFHhhO8E1GNFBLPxIDJKSA5jzMei/+OV9a7fdHfkWp65u2D1JWYFuSyFkCeUB36JvBPo J69wxz7Pe/tRBJGygEGO5kuXk1O6kCfMaYRdr4HZh0wst2vXAzh3gw+Jv+OdFZOmxnPI Y2WveM0AXnRXPPB0SigB5By7yJMmGWoJmiy0V2eCzwU5mqGEg9YVnWWybf/Xm/VIV2+n fc1g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@digitalocean.com header.s=google header.b="DwVbzTb/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=digitalocean.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id i2si2382999ejp.662.2020.06.30.15.01.36; Tue, 30 Jun 2020 15:02:02 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@digitalocean.com header.s=google header.b="DwVbzTb/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=digitalocean.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728520AbgF3Vd1 (ORCPT + 99 others); Tue, 30 Jun 2020 17:33:27 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:34376 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726244AbgF3VdQ (ORCPT ); Tue, 30 Jun 2020 17:33:16 -0400 Received: from mail-pf1-x442.google.com (mail-pf1-x442.google.com [IPv6:2607:f8b0:4864:20::442]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 9AFE5C03E97A for ; Tue, 30 Jun 2020 14:33:16 -0700 (PDT) Received: by mail-pf1-x442.google.com with SMTP id x3so4452943pfo.9 for ; Tue, 30 Jun 2020 14:33:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=digitalocean.com; s=google; h=from:to:cc:subject:date:message-id:in-reply-to:references :in-reply-to:references; bh=wsRrgXrBLhZrqEzJE2oHTQVsuAhj9oeYzHHnpxIytS4=; b=DwVbzTb/7gLisXQwOiqCF4xDjWwiWffteZ0BL2K+PRgG5IMjgYNrvm1DSeiSr3SeGT 8PcP/tIzJNharXqT2tQl9ebmUv1zT0lttcH/T2UIeLKiVAZ4PcL0/FdjIgKM4Kqw0NSf h4oWhHj/9U0Eyo3H00aM2ATBfNc0Q4PmyzACo= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references:in-reply-to:references; bh=wsRrgXrBLhZrqEzJE2oHTQVsuAhj9oeYzHHnpxIytS4=; b=nuye2kng2vFc1EbrKIXbgS0XIlFMi2HumeJf2vUvjw9pc56PHsjtDbWBTtyHA9CSj1 ZJW8F7+EOytBr12dANtQ19XqSjL/0V5dAsUtkn4ozkHC24vESTgbF9OoS4AnV65L+9lY 11t3VjeYx986bEGc/3Y8zat7RzGvCvQs+3sd1a1YMrf2iXOJCYJZ2pwDyeHHRyubYc7v gE57TyltJy0DDE2uL+HSWTgGBiNLqGlZo52SQm+VPmGeQscyDQAzZw5kGt/0Lcd4hVR+ IyI87j7CGWjGo3UEfpmRBx7gAv0c/Z5299Rpiymg52dM9A4cRXNcV9uUexqCcVGAoeee v+2w== X-Gm-Message-State: AOAM532oBCtj85mdbKJEvZgojuCPgCQbtzgHBZ15EGjCir5ROa9SHtPR xoo6XSQWui82fqcZH9gOJcA7vA== X-Received: by 2002:a65:51c1:: with SMTP id i1mr16520807pgq.272.1593552795982; Tue, 30 Jun 2020 14:33:15 -0700 (PDT) Received: from vpillai-dev.sfo2.internal.digitalocean.com ([138.68.32.68]) by smtp.gmail.com with ESMTPSA id x24sm3616942pfo.12.2020.06.30.14.33.14 (version=TLS1_2 cipher=ECDHE-ECDSA-CHACHA20-POLY1305 bits=256/256); Tue, 30 Jun 2020 14:33:15 -0700 (PDT) From: Vineeth Remanan Pillai To: Nishanth Aravamudan , Julien Desfossez , Peter Zijlstra , Tim Chen , mingo@kernel.org, tglx@linutronix.de, pjt@google.com, torvalds@linux-foundation.org Cc: linux-kernel@vger.kernel.org, subhra.mazumdar@oracle.com, fweisbec@gmail.com, keescook@chromium.org, kerrnel@google.com, Phil Auld , Aaron Lu , Aubrey Li , Valentin Schneider , Mel Gorman , Pawan Gupta , Paolo Bonzini , Joel Fernandes , joel@joelfernandes.org, vineethrp@gmail.com, Chen Yu , Christian Brauner , Vineeth Remanan Pillai , Aaron Lu Subject: [RFC PATCH 06/16] sched: Add core wide task selection and scheduling. Date: Tue, 30 Jun 2020 21:32:27 +0000 Message-Id: X-Mailer: git-send-email 2.17.1 In-Reply-To: References: In-Reply-To: References: Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org From: Peter Zijlstra Instead of only selecting a local task, select a task for all SMT siblings for every reschedule on the core (irrespective which logical CPU does the reschedule). There could be races in core scheduler where a CPU is trying to pick a task for its sibling in core scheduler, when that CPU has just been offlined. We should not schedule any tasks on the CPU in this case. Return an idle task in pick_next_task for this situation. NOTE: there is still potential for siblings rivalry. NOTE: this is far too complicated; but thus far I've failed to simplify it further. Signed-off-by: Peter Zijlstra (Intel) Signed-off-by: Julien Desfossez Signed-off-by: Vineeth Remanan Pillai Signed-off-by: Aaron Lu Signed-off-by: Tim Chen --- kernel/sched/core.c | 263 ++++++++++++++++++++++++++++++++++++++++++- kernel/sched/sched.h | 6 +- 2 files changed, 267 insertions(+), 2 deletions(-) diff --git a/kernel/sched/core.c b/kernel/sched/core.c index b21bcab20da6..f51e5c4798c8 100644 --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -4113,7 +4113,7 @@ static inline void schedule_debug(struct task_struct *prev, bool preempt) * Pick up the highest-prio task: */ static inline struct task_struct * -pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +__pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) { const struct sched_class *class; struct task_struct *p; @@ -4169,6 +4169,262 @@ pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) BUG(); } +#ifdef CONFIG_SCHED_CORE + +static inline bool cookie_equals(struct task_struct *a, unsigned long cookie) +{ + return is_idle_task(a) || (a->core_cookie == cookie); +} + +static inline bool cookie_match(struct task_struct *a, struct task_struct *b) +{ + if (is_idle_task(a) || is_idle_task(b)) + return true; + + return a->core_cookie == b->core_cookie; +} + +// XXX fairness/fwd progress conditions +/* + * Returns + * - NULL if there is no runnable task for this class. + * - the highest priority task for this runqueue if it matches + * rq->core->core_cookie or its priority is greater than max. + * - Else returns idle_task. + */ +static struct task_struct * +pick_task(struct rq *rq, const struct sched_class *class, struct task_struct *max) +{ + struct task_struct *class_pick, *cookie_pick; + unsigned long cookie = rq->core->core_cookie; + + class_pick = class->pick_task(rq); + if (!class_pick) + return NULL; + + if (!cookie) { + /* + * If class_pick is tagged, return it only if it has + * higher priority than max. + */ + if (max && class_pick->core_cookie && + prio_less(class_pick, max)) + return idle_sched_class.pick_task(rq); + + return class_pick; + } + + /* + * If class_pick is idle or matches cookie, return early. + */ + if (cookie_equals(class_pick, cookie)) + return class_pick; + + cookie_pick = sched_core_find(rq, cookie); + + /* + * If class > max && class > cookie, it is the highest priority task on + * the core (so far) and it must be selected, otherwise we must go with + * the cookie pick in order to satisfy the constraint. + */ + if (prio_less(cookie_pick, class_pick) && + (!max || prio_less(max, class_pick))) + return class_pick; + + return cookie_pick; +} + +static struct task_struct * +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +{ + struct task_struct *next, *max = NULL; + const struct sched_class *class; + const struct cpumask *smt_mask; + int i, j, cpu; + bool need_sync; + + if (!sched_core_enabled(rq)) + return __pick_next_task(rq, prev, rf); + + /* + * If there were no {en,de}queues since we picked (IOW, the task + * pointers are all still valid), and we haven't scheduled the last + * pick yet, do so now. + */ + if (rq->core->core_pick_seq == rq->core->core_task_seq && + rq->core->core_pick_seq != rq->core_sched_seq) { + WRITE_ONCE(rq->core_sched_seq, rq->core->core_pick_seq); + + next = rq->core_pick; + if (next != prev) { + put_prev_task(rq, prev); + set_next_task(rq, next); + } + return next; + } + + prev->sched_class->put_prev_task(rq, prev); + if (!rq->nr_running) + newidle_balance(rq, rf); + + cpu = cpu_of(rq); + smt_mask = cpu_smt_mask(cpu); + + /* + * core->core_task_seq, core->core_pick_seq, rq->core_sched_seq + * + * @task_seq guards the task state ({en,de}queues) + * @pick_seq is the @task_seq we did a selection on + * @sched_seq is the @pick_seq we scheduled + * + * However, preemptions can cause multiple picks on the same task set. + * 'Fix' this by also increasing @task_seq for every pick. + */ + rq->core->core_task_seq++; + need_sync = !!rq->core->core_cookie; + + /* reset state */ + rq->core->core_cookie = 0UL; + for_each_cpu(i, smt_mask) { + struct rq *rq_i = cpu_rq(i); + + rq_i->core_pick = NULL; + + if (rq_i->core_forceidle) { + need_sync = true; + rq_i->core_forceidle = false; + } + + if (i != cpu) + update_rq_clock(rq_i); + } + + /* + * Try and select tasks for each sibling in decending sched_class + * order. + */ + for_each_class(class) { +again: + for_each_cpu_wrap(i, smt_mask, cpu) { + struct rq *rq_i = cpu_rq(i); + struct task_struct *p; + + if (rq_i->core_pick) + continue; + + /* + * If this sibling doesn't yet have a suitable task to + * run; ask for the most elegible task, given the + * highest priority task already selected for this + * core. + */ + p = pick_task(rq_i, class, max); + if (!p) { + /* + * If there weren't no cookies; we don't need + * to bother with the other siblings. + */ + if (i == cpu && !need_sync) + goto next_class; + + continue; + } + + /* + * Optimize the 'normal' case where there aren't any + * cookies and we don't need to sync up. + */ + if (i == cpu && !need_sync && !p->core_cookie) { + next = p; + goto done; + } + + rq_i->core_pick = p; + + /* + * If this new candidate is of higher priority than the + * previous; and they're incompatible; we need to wipe + * the slate and start over. pick_task makes sure that + * p's priority is more than max if it doesn't match + * max's cookie. + * + * NOTE: this is a linear max-filter and is thus bounded + * in execution time. + */ + if (!max || !cookie_match(max, p)) { + struct task_struct *old_max = max; + + rq->core->core_cookie = p->core_cookie; + max = p; + + if (old_max) { + for_each_cpu(j, smt_mask) { + if (j == i) + continue; + + cpu_rq(j)->core_pick = NULL; + } + goto again; + } else { + /* + * Once we select a task for a cpu, we + * should not be doing an unconstrained + * pick because it might starve a task + * on a forced idle cpu. + */ + need_sync = true; + } + + } + } +next_class:; + } + + rq->core->core_pick_seq = rq->core->core_task_seq; + next = rq->core_pick; + rq->core_sched_seq = rq->core->core_pick_seq; + + /* + * Reschedule siblings + * + * NOTE: L1TF -- at this point we're no longer running the old task and + * sending an IPI (below) ensures the sibling will no longer be running + * their task. This ensures there is no inter-sibling overlap between + * non-matching user state. + */ + for_each_cpu(i, smt_mask) { + struct rq *rq_i = cpu_rq(i); + + WARN_ON_ONCE(!rq_i->core_pick); + + if (is_idle_task(rq_i->core_pick) && rq_i->nr_running) + rq_i->core_forceidle = true; + + if (i == cpu) + continue; + + if (rq_i->curr != rq_i->core_pick) + resched_curr(rq_i); + + /* Did we break L1TF mitigation requirements? */ + WARN_ON_ONCE(!cookie_match(next, rq_i->core_pick)); + } + +done: + set_next_task(rq, next); + return next; +} + +#else /* !CONFIG_SCHED_CORE */ + +static struct task_struct * +pick_next_task(struct rq *rq, struct task_struct *prev, struct rq_flags *rf) +{ + return __pick_next_task(rq, prev, rf); +} + +#endif /* CONFIG_SCHED_CORE */ + /* * __schedule() is the main scheduler function. * @@ -6931,7 +7187,12 @@ void __init sched_init(void) #ifdef CONFIG_SCHED_CORE rq->core = NULL; + rq->core_pick = NULL; rq->core_enabled = 0; + rq->core_tree = RB_ROOT; + rq->core_forceidle = false; + + rq->core_cookie = 0UL; #endif } diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h index c85c5a4bc21f..4a738093d731 100644 --- a/kernel/sched/sched.h +++ b/kernel/sched/sched.h @@ -1032,11 +1032,16 @@ struct rq { #ifdef CONFIG_SCHED_CORE /* per rq */ struct rq *core; + struct task_struct *core_pick; unsigned int core_enabled; + unsigned int core_sched_seq; struct rb_root core_tree; + unsigned char core_forceidle; /* shared state */ unsigned int core_task_seq; + unsigned int core_pick_seq; + unsigned long core_cookie; #endif }; @@ -1905,7 +1910,6 @@ static inline void put_prev_task(struct rq *rq, struct task_struct *prev) static inline void set_next_task(struct rq *rq, struct task_struct *next) { - WARN_ON_ONCE(rq->curr != next); next->sched_class->set_next_task(rq, next, false); } -- 2.17.1