Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp981656imm; Wed, 1 Aug 2018 08:19:03 -0700 (PDT) X-Google-Smtp-Source: AAOMgpdhHZmiPJjlB5nqTF3qImKP0B1/QtDgcksXd1ju/3CFUOeUocn5lsfeSaqqICXG8ETR4xnQ X-Received: by 2002:a63:c20:: with SMTP id b32-v6mr24459275pgl.400.1533136743575; Wed, 01 Aug 2018 08:19:03 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1533136743; cv=none; d=google.com; s=arc-20160816; b=Vb1AbjX9q8mhqM0VwTnBPCgCrrwF1A6pu8rerBtCtk7qloM0twVWqMWudJ1ZjRWJIt xAbEjVOYDhgrIOG4T2aCuy+2Fp/0t7XhKg2TTtxesjUTmNe21Y6l3mEXwr7jT252C2/o kdsnXA3/rPsmd81eQmFfHiP+E689eepm8Afj16Q6s0NUNoR2UAgiGDhURxwQ4CN2wAhg QMBSqBeOmDWI5fmtTT2Fs9mKs2jAuMHQ+bCZVM/JD6P0w/5evsQGTpVynUzlp53MGB7l eK/xCVwXknODHuJWbQUKMqL+iCReneOb90v8DklMIkCrxQ7SX59oo4uX42FJoXsBHOb8 UEYQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:references:in-reply-to:message-id:date :subject:cc:to:from:dkim-signature:arc-authentication-results; bh=td8sVEq7B3zRpddgZoElV+O4+/3RnsPaR9Zy76qVrWc=; b=xqhHbaWK85b08RFxR4sgX4nj3vIiSsfDoGZCeo5VTLViHOjCcU8cCx7Fi9MdTOf3Qb SoLTosrkls9ji6rDIxUzE5qrx6v/SQLF+ZmhK8bOcQ6uQsCq3DCCbLGDMy2pmA7vMxvW TTZhcQ3dJHvKSXcNTWoRT76JqZjxz5x750v/ojRy5Gxw50kYluD5cuZXir6HRp5+VtKR Qd5JVbAcSKb3RAUJY96T7LRyag2yKoJiZeW+B6ohpRjz0vqBaY/c5bYWMJmOtkWehs5b HfwRwyTW0XxMGNsSCO/w4hRAhT254VimI7k+WNmhtbDPUbaKHl/swPBpNgW56oaSeWdO l7UA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=KL9Kw381; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d7-v6si9386094plr.213.2018.08.01.08.18.48; Wed, 01 Aug 2018 08:19:03 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cmpxchg-org.20150623.gappssmtp.com header.s=20150623 header.b=KL9Kw381; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=cmpxchg.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2389872AbeHARD4 (ORCPT + 99 others); Wed, 1 Aug 2018 13:03:56 -0400 Received: from mail-qt0-f194.google.com ([209.85.216.194]:40699 "EHLO mail-qt0-f194.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2389835AbeHARDz (ORCPT ); Wed, 1 Aug 2018 13:03:55 -0400 Received: by mail-qt0-f194.google.com with SMTP id h4-v6so20259706qtj.7 for ; Wed, 01 Aug 2018 08:17:44 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=cmpxchg-org.20150623.gappssmtp.com; s=20150623; h=from:to:cc:subject:date:message-id:in-reply-to:references; bh=td8sVEq7B3zRpddgZoElV+O4+/3RnsPaR9Zy76qVrWc=; b=KL9Kw381RFpnCLgDoWqIx1BR1NJDAptWxyqx03S6uawYslq4CKGBtWzF0Ih4PK519W msUTCi4DSh35e+wZGcM/GIu0haSYPvTc9ZmFjklrpaTBXShCHT/oqaA3UI57uigLOLrY RwgrZ5WlvxWUcp4bfk2R/pOtj8ScNj9XqOi0vjAccEUSe/f7bkceysqpIgkOdogH3Tgg ZbNfGItmz8Wy/FS0WTRjhh8mxr0jqiu5Bn4tzFzqncBdPhIXFUFhEPMG7CpBwMLLhI8V Ph0pitoppIirCVzVOwNTzB9R8MGHTDqkDMzp5Owm+58ggB0PkeAy1IsHMneC1FpIQN5+ /qrA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:to:cc:subject:date:message-id:in-reply-to :references; bh=td8sVEq7B3zRpddgZoElV+O4+/3RnsPaR9Zy76qVrWc=; b=RVae+ewdTXd0d9hxk9YIPmkcj2J5mqKGITR/QsTBnwAfwyRXetvrcl2uEU4RVrZMcs mQOjG5tMOUwpkJzn9+TIzMk1AOeo5uu+43cT/Jwt7aFDLsSQUHQMJo/0IFpTFV6UJdYU MY1sDUjnl6YLkyNAWlNYBiZfF+wpnN8hVGR6bYlotQvpkuQIaHeu4PJ6dtVmky+2ke0x VPlll4tLVf5kebSMgx/nUINHFILkoOHaB34DOdEjfyiPjthsgZX8wSL+HNzmpzsaPGHd 8lkpQNoIVEWDqTprIUfUKPmrrcNrW5XGWZoqyWvXO7DqZm5Gdiq5ZJ3wnsQ9mn5oCkK2 oyAw== X-Gm-Message-State: AOUpUlEyRQFJS3xk1m3dF441zZZIXCykp1VFax8nQ1XmZFCPq/6j3uHQ NUa43xLxbg3qJwjoJGZdfE2aYQ== X-Received: by 2002:ac8:489:: with SMTP id s9-v6mr25479578qtg.173.1533136663941; Wed, 01 Aug 2018 08:17:43 -0700 (PDT) Received: from localhost (216.49.36.201.res-cmts.bus.ptd.net. [216.49.36.201]) by smtp.gmail.com with ESMTPSA id 49-v6sm12027407qtz.43.2018.08.01.08.17.42 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Wed, 01 Aug 2018 08:17:42 -0700 (PDT) From: Johannes Weiner To: Ingo Molnar , Peter Zijlstra , Andrew Morton , Linus Torvalds Cc: Tejun Heo , Suren Baghdasaryan , Daniel Drake , Vinayak Menon , Christopher Lameter , Mike Galbraith , Shakeel Butt , Peter Enderborg , linux-mm@kvack.org, cgroups@vger.kernel.org, linux-kernel@vger.kernel.org, kernel-team@fb.com Subject: [PATCH 9/9] psi: cgroup support Date: Wed, 1 Aug 2018 11:19:58 -0400 Message-Id: <20180801151958.32590-10-hannes@cmpxchg.org> X-Mailer: git-send-email 2.18.0 In-Reply-To: <20180801151958.32590-1-hannes@cmpxchg.org> References: <20180801151958.32590-1-hannes@cmpxchg.org> Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On a system that executes multiple cgrouped jobs and independent workloads, we don't just care about the health of the overall system, but also that of individual jobs, so that we can ensure individual job health, fairness between jobs, or prioritize some jobs over others. This patch implements pressure stall tracking for cgroups. In kernels with CONFIG_PSI=y, cgroup2 groups will have cpu.pressure, memory.pressure, and io.pressure files that track aggregate pressure stall times for only the tasks inside the cgroup. v3: - fix copy-paste indentation screwups Acked-by: Tejun Heo Signed-off-by: Johannes Weiner --- Documentation/accounting/psi.txt | 9 ++++ Documentation/cgroup-v2.txt | 18 +++++++ include/linux/cgroup-defs.h | 4 ++ include/linux/cgroup.h | 15 ++++++ include/linux/psi.h | 25 ++++++++++ init/Kconfig | 4 ++ kernel/cgroup/cgroup.c | 45 +++++++++++++++++- kernel/sched/psi.c | 81 +++++++++++++++++++++++++++++++- 8 files changed, 197 insertions(+), 4 deletions(-) diff --git a/Documentation/accounting/psi.txt b/Documentation/accounting/psi.txt index 51e7ef14142e..e051810d5127 100644 --- a/Documentation/accounting/psi.txt +++ b/Documentation/accounting/psi.txt @@ -62,3 +62,12 @@ well as medium and long term trends. The total absolute stall time is tracked and exported as well, to allow detection of latency spikes which wouldn't necessarily make a dent in the time averages, or to average trends over custom time frames. + +Cgroup2 interface +================= + +In a system with a CONFIG_CGROUP=y kernel and the cgroup2 filesystem +mounted, pressure stall information is also tracked for tasks grouped +into cgroups. Each subdirectory in the cgroupfs mountpoint contains +cpu.pressure, memory.pressure, and io.pressure files; the format is +the same as the /proc/pressure/ files. diff --git a/Documentation/cgroup-v2.txt b/Documentation/cgroup-v2.txt index 74cdeaed9f7a..a22879dba019 100644 --- a/Documentation/cgroup-v2.txt +++ b/Documentation/cgroup-v2.txt @@ -963,6 +963,12 @@ All time durations are in microseconds. $PERIOD duration. "max" for $MAX indicates no limit. If only one number is written, $MAX is updated. + cpu.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for CPU. See + Documentation/accounting/psi.txt for details. + Memory ------ @@ -1199,6 +1205,12 @@ PAGE_SIZE multiple when read back. Swap usage hard limit. If a cgroup's swap usage reaches this limit, anonymous memory of the cgroup will not be swapped out. + memory.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for memory. See + Documentation/accounting/psi.txt for details. + Usage Guidelines ~~~~~~~~~~~~~~~~ @@ -1334,6 +1346,12 @@ IO Interface Files 8:16 rbps=2097152 wbps=max riops=max wiops=max + io.pressure + A read-only nested-key file which exists on non-root cgroups. + + Shows pressure stall information for IO. See + Documentation/accounting/psi.txt for details. + Writeback ~~~~~~~~~ diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index dc5b70449dc6..280f18da956a 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -20,6 +20,7 @@ #include #include #include +#include #ifdef CONFIG_CGROUPS @@ -424,6 +425,9 @@ struct cgroup { /* used to schedule release agent */ struct work_struct release_agent_work; + /* used to track pressure stalls */ + struct psi_group psi; + /* used to store eBPF programs */ struct cgroup_bpf bpf; diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h index 473e0c0abb86..fd94c294c207 100644 --- a/include/linux/cgroup.h +++ b/include/linux/cgroup.h @@ -627,6 +627,11 @@ static inline void pr_cont_cgroup_path(struct cgroup *cgrp) pr_cont_kernfs_path(cgrp->kn); } +static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) +{ + return &cgrp->psi; +} + static inline void cgroup_init_kthreadd(void) { /* @@ -680,6 +685,16 @@ static inline union kernfs_node_id *cgroup_get_kernfs_id(struct cgroup *cgrp) return NULL; } +static inline struct cgroup *cgroup_parent(struct cgroup *cgrp) +{ + return NULL; +} + +static inline struct psi_group *cgroup_psi(struct cgroup *cgrp) +{ + return NULL; +} + static inline bool task_under_cgroup_hierarchy(struct task_struct *task, struct cgroup *ancestor) { diff --git a/include/linux/psi.h b/include/linux/psi.h index 371af1479699..05c3dae3e9c5 100644 --- a/include/linux/psi.h +++ b/include/linux/psi.h @@ -4,6 +4,9 @@ #include #include +struct seq_file; +struct css_set; + #ifdef CONFIG_PSI extern bool psi_disabled; @@ -15,6 +18,14 @@ void psi_task_change(struct task_struct *task, u64 now, int clear, int set); void psi_memstall_enter(unsigned long *flags); void psi_memstall_leave(unsigned long *flags); +int psi_show(struct seq_file *s, struct psi_group *group, enum psi_res res); + +#ifdef CONFIG_CGROUPS +int psi_cgroup_alloc(struct cgroup *cgrp); +void psi_cgroup_free(struct cgroup *cgrp); +void cgroup_move_task(struct task_struct *p, struct css_set *to); +#endif + #else /* CONFIG_PSI */ static inline void psi_init(void) {} @@ -22,6 +33,20 @@ static inline void psi_init(void) {} static inline void psi_memstall_enter(unsigned long *flags) {} static inline void psi_memstall_leave(unsigned long *flags) {} +#ifdef CONFIG_CGROUPS +static inline int psi_cgroup_alloc(struct cgroup *cgrp) +{ + return 0; +} +static inline void psi_cgroup_free(struct cgroup *cgrp) +{ +} +static inline void cgroup_move_task(struct task_struct *p, struct css_set *to) +{ + rcu_assign_pointer(p->cgroups, to); +} +#endif + #endif /* CONFIG_PSI */ #endif /* _LINUX_PSI_H */ diff --git a/init/Kconfig b/init/Kconfig index ad61ddb5d68e..5c029f8d69f1 100644 --- a/init/Kconfig +++ b/init/Kconfig @@ -468,6 +468,10 @@ config PSI the share of walltime in which some or all tasks in the system are delayed due to contention of the respective resource. + In kernels with cgroup support (cgroup2 only), cgroups will + have cpu.pressure, memory.pressure, and io.pressure files, + which aggregate pressure stalls for the grouped tasks only. + For more details see Documentation/accounting/psi.txt. Say N if unsure. diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index a662bfcbea0e..bbb00b3ab752 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -54,6 +54,7 @@ #include #include #include +#include #include #define CREATE_TRACE_POINTS @@ -826,7 +827,7 @@ static void css_set_move_task(struct task_struct *task, */ WARN_ON_ONCE(task->flags & PF_EXITING); - rcu_assign_pointer(task->cgroups, to_cset); + cgroup_move_task(task, to_cset); list_add_tail(&task->cg_list, use_mg_tasks ? &to_cset->mg_tasks : &to_cset->tasks); } @@ -3388,6 +3389,21 @@ static int cpu_stat_show(struct seq_file *seq, void *v) return ret; } +#ifdef CONFIG_PSI +static int cgroup_io_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_IO); +} +static int cgroup_memory_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_MEM); +} +static int cgroup_cpu_pressure_show(struct seq_file *seq, void *v) +{ + return psi_show(seq, &seq_css(seq)->cgroup->psi, PSI_CPU); +} +#endif + static int cgroup_file_open(struct kernfs_open_file *of) { struct cftype *cft = of->kn->priv; @@ -4499,6 +4515,23 @@ static struct cftype cgroup_base_files[] = { .flags = CFTYPE_NOT_ON_ROOT, .seq_show = cpu_stat_show, }, +#ifdef CONFIG_PSI + { + .name = "io.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_io_pressure_show, + }, + { + .name = "memory.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_memory_pressure_show, + }, + { + .name = "cpu.pressure", + .flags = CFTYPE_NOT_ON_ROOT, + .seq_show = cgroup_cpu_pressure_show, + }, +#endif { } /* terminate */ }; @@ -4559,6 +4592,7 @@ static void css_free_rwork_fn(struct work_struct *work) */ cgroup_put(cgroup_parent(cgrp)); kernfs_put(cgrp->kn); + psi_cgroup_free(cgrp); if (cgroup_on_dfl(cgrp)) cgroup_stat_exit(cgrp); kfree(cgrp); @@ -4805,10 +4839,15 @@ static struct cgroup *cgroup_create(struct cgroup *parent) cgrp->self.parent = &parent->self; cgrp->root = root; cgrp->level = level; - ret = cgroup_bpf_inherit(cgrp); + + ret = psi_cgroup_alloc(cgrp); if (ret) goto out_idr_free; + ret = cgroup_bpf_inherit(cgrp); + if (ret) + goto out_psi_free; + for (tcgrp = cgrp; tcgrp; tcgrp = cgroup_parent(tcgrp)) { cgrp->ancestor_ids[tcgrp->level] = tcgrp->id; @@ -4846,6 +4885,8 @@ static struct cgroup *cgroup_create(struct cgroup *parent) return cgrp; +out_psi_free: + psi_cgroup_free(cgrp); out_idr_free: cgroup_idr_remove(&root->cgroup_idr, cgrp->id); out_stat_exit: diff --git a/kernel/sched/psi.c b/kernel/sched/psi.c index 57ec86592b5a..a20f885da66f 100644 --- a/kernel/sched/psi.c +++ b/kernel/sched/psi.c @@ -464,6 +464,9 @@ static void psi_group_change(struct psi_group *group, int cpu, u64 now, void psi_task_change(struct task_struct *task, u64 now, int clear, int set) { +#ifdef CONFIG_CGROUPS + struct cgroup *cgroup, *parent; +#endif int cpu = task_cpu(task); if (psi_disabled) @@ -485,6 +488,18 @@ void psi_task_change(struct task_struct *task, u64 now, int clear, int set) task->psi_flags |= set; psi_group_change(&psi_system, cpu, now, clear, set); + +#ifdef CONFIG_CGROUPS + cgroup = task->cgroups->dfl_cgrp; + while (cgroup && (parent = cgroup_parent(cgroup))) { + struct psi_group *group; + + group = cgroup_psi(cgroup); + psi_group_change(group, cpu, now, clear, set); + + cgroup = parent; + } +#endif } /** @@ -551,8 +566,70 @@ void psi_memstall_leave(unsigned long *flags) rq_unlock_irq(rq, &rf); } -static int psi_show(struct seq_file *m, struct psi_group *group, - enum psi_res res) +#ifdef CONFIG_CGROUPS +int psi_cgroup_alloc(struct cgroup *cgroup) +{ + cgroup->psi.pcpu = alloc_percpu(struct psi_group_cpu); + if (!cgroup->psi.pcpu) + return -ENOMEM; + psi_group_init(&cgroup->psi); + return 0; +} + +void psi_cgroup_free(struct cgroup *cgroup) +{ + cancel_delayed_work_sync(&cgroup->psi.clock_work); + free_percpu(cgroup->psi.pcpu); +} + +/** + * cgroup_move_task - move task to a different cgroup + * @task: the task + * @to: the target css_set + * + * Move task to a new cgroup and safely migrate its associated stall + * state between the different groups. + * + * This function acquires the task's rq lock to lock out concurrent + * changes to the task's scheduling state and - in case the task is + * running - concurrent changes to its stall state. + */ +void cgroup_move_task(struct task_struct *task, struct css_set *to) +{ + unsigned int task_flags = 0; + struct rq_flags rf; + struct rq *rq; + u64 now; + + rq = task_rq_lock(task, &rf); + + if (task_on_rq_queued(task)) + task_flags = TSK_RUNNING; + else if (task->in_iowait) + task_flags = TSK_IOWAIT; + if (task->flags & PF_MEMSTALL) + task_flags |= TSK_MEMSTALL; + + if (task_flags) { + update_rq_clock(rq); + now = rq_clock(rq); + psi_task_change(task, now, task_flags, 0); + } + + /* + * Lame to do this here, but the scheduler cannot be locked + * from the outside, so we move cgroups from inside sched/. + */ + rcu_assign_pointer(task->cgroups, to); + + if (task_flags) + psi_task_change(task, now, 0, task_flags); + + task_rq_unlock(rq, task, &rf); +} +#endif /* CONFIG_CGROUPS */ + +int psi_show(struct seq_file *m, struct psi_group *group, enum psi_res res) { int full; -- 2.18.0