Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp655162ybb; Fri, 20 Mar 2020 05:58:50 -0700 (PDT) X-Google-Smtp-Source: ADFU+vsDOEtV7GWOVPCv90YeNvEelhP+sn4j4QnsaoHVzs/YVKVC8GH8zE8inq0ufDCRQ+dlMART X-Received: by 2002:a9d:7f07:: with SMTP id j7mr6806113otq.54.1584709130552; Fri, 20 Mar 2020 05:58:50 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1584709130; cv=none; d=google.com; s=arc-20160816; b=bM6L8UOApG+MOC793JMDLd59Z5X4BPyyORaHs1pTeqIaBkz+/+/6atrLoBYN+SNFM8 axadAZ+mfi5+e/9WqS/eBkDe7TKC2t1M+7em68krFjVfSx31yrUqHvA7/82exi1aXkaf Do3F0ioVnnlk6/08sQ1ZdPQj3s4+tF7UPcfRobkpG7NsqgeTFbB/iBtoZZXzffYTLyJW 4/wnf7dD4+Xg4N9GmjFMUrDRQpQApwRR4oYUCB5nqH8KTzSD4mmqoQSNfBc0ld2DRDV8 Si4Ly9oJ48w3eVZLywZxTPicbmsTgzorOnV5uL0Zv/lqjH/+W1p1/TAXwwMvQ0+bNSbo 07Xg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :robot-unsubscribe:robot-id:message-id:mime-version:references :in-reply-to:cc:subject:to:reply-to:from:date; bh=f0k3lgK/7Xra/sFo4PC6CcjW+WpUqs7oTcN8yCL6CNY=; b=n9uyGlwVVRVbfqlIOqOt7XDr8VbdFwvE7y2OJsPl+jnWDX3QFkvsOW55+iaN8Cdbsu HWtV11US+DXXY/S2sD4GxQXsM11Jql6gwjeM1+/j2LhPoWl42lAZg/GsWf9GEwbq+RR9 3vXvqxSnIbHcCNyFgsBffbTi0NkAk4hG8JV1GUOKejcy0nOWjWRLxkWE/Ak0rxKvMHUA rWhvgUhUhopM+GkjydKUD/2+dHoIv13J7sziTA7FRY/qKbAWVjMt9lVsCTx6envEXHcb shhv32x3lbX6HafVggxL0G3avUE6YgHBQZrx91pE53wkhocISw4G9rBBOTWP6TGSZvca /GfA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id w10si2935707oti.165.2020.03.20.05.58.38; Fri, 20 Mar 2020 05:58:50 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727297AbgCTM6Q (ORCPT + 99 others); Fri, 20 Mar 2020 08:58:16 -0400 Received: from Galois.linutronix.de ([193.142.43.55]:35606 "EHLO Galois.linutronix.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727105AbgCTM6K (ORCPT ); Fri, 20 Mar 2020 08:58:10 -0400 Received: from [5.158.153.53] (helo=tip-bot2.lab.linutronix.de) by Galois.linutronix.de with esmtpsa (TLS1.2:DHE_RSA_AES_256_CBC_SHA256:256) (Exim 4.80) (envelope-from ) id 1jFHDy-0003jX-2j; Fri, 20 Mar 2020 13:58:06 +0100 Received: from [127.0.1.1] (localhost [IPv6:::1]) by tip-bot2.lab.linutronix.de (Postfix) with ESMTP id A477D1C22C0; Fri, 20 Mar 2020 13:58:05 +0100 (CET) Date: Fri, 20 Mar 2020 12:58:05 -0000 From: "tip-bot2 for Johannes Weiner" Reply-to: linux-kernel@vger.kernel.org To: linux-tip-commits@vger.kernel.org Subject: [tip: sched/core] psi: Optimize switching tasks inside shared cgroups Cc: Peter Zijlstra , Johannes Weiner , x86 , LKML In-Reply-To: <20200316191333.115523-3-hannes@cmpxchg.org> References: <20200316191333.115523-3-hannes@cmpxchg.org> MIME-Version: 1.0 Message-ID: <158470908538.28353.4035583045093717977.tip-bot2@tip-bot2> X-Mailer: tip-git-log-daemon Robot-ID: Robot-Unsubscribe: Contact to get blacklisted from these emails Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: 7bit X-Linutronix-Spam-Score: -1.0 X-Linutronix-Spam-Level: - X-Linutronix-Spam-Status: No , -1.0 points, 5.0 required, ALL_TRUSTED=-1,SHORTCIRCUIT=-0.0001 Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The following commit has been merged into the sched/core branch of tip: Commit-ID: 36b238d5717279163859fb6ba0f4360abcafab83 Gitweb: https://git.kernel.org/tip/36b238d5717279163859fb6ba0f4360abcafab83 Author: Johannes Weiner AuthorDate: Mon, 16 Mar 2020 15:13:32 -04:00 Committer: Peter Zijlstra CommitterDate: Fri, 20 Mar 2020 13:06:19 +01:00 psi: Optimize switching tasks inside shared cgroups When switching tasks running on a CPU, the psi state of a cgroup containing both of these tasks does not change. Right now, we don't exploit that, and can perform many unnecessary state changes in nested hierarchies, especially when most activity comes from one leaf cgroup. This patch implements an optimization where we only update cgroups whose state actually changes during a task switch. These are all cgroups that contain one task but not the other, up to the first shared ancestor. When both tasks are in the same group, we don't need to update anything at all. We can identify the first shared ancestor by walking the groups of the incoming task until we see TSK_ONCPU set on the local CPU; that's the first group that also contains the outgoing task. The new psi_task_switch() is similar to psi_task_change(). To allow code reuse, move the task flag maintenance code into a new function and the poll/avg worker wakeups into the shared psi_group_change(). Suggested-by: Peter Zijlstra Signed-off-by: Johannes Weiner Signed-off-by: Peter Zijlstra (Intel) Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.org --- include/linux/psi.h | 2 +- kernel/sched/psi.c | 87 +++++++++++++++++++++++++++++++++---------- kernel/sched/stats.h | 9 +---- 3 files changed, 70 insertions(+), 28 deletions(-) diff --git a/include/linux/psi.h b/include/linux/psi.h index 7b3de73..7361023 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -17,6 +17,8 @@ extern struct psi_group psi_system; void psi_init(void); void psi_task_change(struct task_struct *task, int clear, int set); +void psi_task_switch(struct task_struct *prev, struct task_struct *next, + bool sleep); void psi_memstall_tick(struct task_struct *task, int cpu); void psi_memstall_enter(unsigned long *flags); diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 5012829..955a124 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -669,13 +669,14 @@ static void record_times(struct psi_group_cpu *groupc, int cpu, groupc->times[PSI_NONIDLE] += delta; } -static u32 psi_group_change(struct psi_group *group, int cpu, - unsigned int clear, unsigned int set) +static void psi_group_change(struct psi_group *group, int cpu, + unsigned int clear, unsigned int set, + bool wake_clock) { struct psi_group_cpu *groupc; + u32 state_mask = 0; unsigned int t, m; enum psi_states s; - u32 state_mask = 0; groupc = per_cpu_ptr(group->pcpu, cpu); @@ -717,7 +718,11 @@ static u32 psi_group_change(struct psi_group *group, int cpu, write_seqcount_end(&groupc->seq); - return state_mask; + if (state_mask & group->poll_states) + psi_schedule_poll_work(group, 1); + + if (wake_clock && !delayed_work_pending(&group->avgs_work)) + schedule_delayed_work(&group->avgs_work, PSI_FREQ); } static struct psi_group *iterate_groups(struct task_struct *task, void **iter) @@ -744,27 +749,32 @@ static struct psi_group *iterate_groups(struct task_struct *task, void **iter) return &psi_system; } -void psi_task_change(struct task_struct *task, int clear, int set) +static void psi_flags_change(struct task_struct *task, int clear, int set) { - int cpu = task_cpu(task); - struct psi_group *group; - bool wake_clock = true; - void *iter = NULL; - - if (!task->pid) - return; - if (((task->psi_flags & set) || (task->psi_flags & clear) != clear) && !psi_bug) { printk_deferred(KERN_ERR "psi: inconsistent task state! task=%d:%s cpu=%d psi_flags=%x clear=%x set=%x\n", - task->pid, task->comm, cpu, + task->pid, task->comm, task_cpu(task), task->psi_flags, clear, set); psi_bug = 1; } task->psi_flags &= ~clear; task->psi_flags |= set; +} + +void psi_task_change(struct task_struct *task, int clear, int set) +{ + int cpu = task_cpu(task); + struct psi_group *group; + bool wake_clock = true; + void *iter = NULL; + + if (!task->pid) + return; + + psi_flags_change(task, clear, set); /* * Periodic aggregation shuts off if there is a period of no @@ -777,14 +787,51 @@ void psi_task_change(struct task_struct *task, int clear, int set) wq_worker_last_func(task) == psi_avgs_work)) wake_clock = false; - while ((group = iterate_groups(task, &iter))) { - u32 state_mask = psi_group_change(group, cpu, clear, set); + while ((group = iterate_groups(task, &iter))) + psi_group_change(group, cpu, clear, set, wake_clock); +} + +void psi_task_switch(struct task_struct *prev, struct task_struct *next, + bool sleep) +{ + struct psi_group *group, *common = NULL; + int cpu = task_cpu(prev); + void *iter; + + if (next->pid) { + psi_flags_change(next, 0, TSK_ONCPU); + /* + * When moving state between tasks, the group that + * contains them both does not change: we can stop + * updating the tree once we reach the first common + * ancestor. Iterate @next's ancestors until we + * encounter @prev's state. + */ + iter = NULL; + while ((group = iterate_groups(next, &iter))) { + if (per_cpu_ptr(group->pcpu, cpu)->tasks[NR_ONCPU]) { + common = group; + break; + } + + psi_group_change(group, cpu, 0, TSK_ONCPU, true); + } + } + + /* + * If this is a voluntary sleep, dequeue will have taken care + * of the outgoing TSK_ONCPU alongside TSK_RUNNING already. We + * only need to deal with it during preemption. + */ + if (sleep) + return; - if (state_mask & group->poll_states) - psi_schedule_poll_work(group, 1); + if (prev->pid) { + psi_flags_change(prev, TSK_ONCPU, 0); - if (wake_clock && !delayed_work_pending(&group->avgs_work)) - schedule_delayed_work(&group->avgs_work, PSI_FREQ); + iter = NULL; + while ((group = iterate_groups(prev, &iter)) && group != common) + psi_group_change(group, cpu, TSK_ONCPU, 0, true); } } diff --git a/kernel/sched/stats.h b/kernel/sched/stats.h index 6ff0ac1..1339f5b 100644 --- a/kernel/sched/stats.h +++ b/kernel/sched/stats.h @@ -141,14 +141,7 @@ static inline void psi_sched_switch(struct task_struct *prev, if (static_branch_likely(&psi_disabled)) return; - /* - * Clear the TSK_ONCPU state if the task was preempted. If - * it's a voluntary sleep, dequeue will have taken care of it. - */ - if (!sleep) - psi_task_change(prev, TSK_ONCPU, 0); - - psi_task_change(next, 0, TSK_ONCPU); + psi_task_switch(prev, next, sleep); } static inline void psi_task_tick(struct rq *rq)