LinuxLists.cc - [PATCH 0/6] Automatic NUMA placement of tasks in cpu cgroup

2012-11-20 08:33:20

Subject: [PATCH 0/6] Automatic NUMA placement of tasks in cpu cgroup

Hi,

This patchset has absolutely nothing to do with NUMA. But now that I got your
attention:

This is an attempt to ressurect a patchset that Tejun Heo sent a while ago,
aiming at deprecation of cpuacct. He only went as far as publishing the files
in the cpu cgroup, but the final work would require us to take advantage
of it by not incurring in hierarchy walks more times than necessary.

It is trivial to do it in the case where we have SCHEDSTATS enabled: we already
record a statistic that is exactly the same as cpuusage: exec_clock. That is
not collected for rt tasks, so the only thing we need to do is to also collect
it for them, and print them back for cpuusage.

In theory, it would also be possible to avoid hierarchy walks even without
SCHEDSTATS: we could modify task_group_charge() to stop walking, and then
update cpuusage only for the current group, during the walk we already do.
I didn't do so, because I believe we care more about setups that would enable
a bunch of options anyway - which is likely to include SCHEDSTATS. Custom setups
can take a much easier route and just compile out the whole thing! But let me
know if I should do it.

Glauber Costa (3):
don't call cpuacct_charge in stop_task.c
sched: adjust exec_clock to use it as cpu usage metric
cpuacct: don't actually do anything.

Tejun Heo (3):
cgroup: implement CFTYPE_NO_PREFIX
cgroup, sched: let cpu serve the same files as cpuacct
cgroup, sched: deprecate cpuacct

include/linux/cgroup.h | 1 +
init/Kconfig | 11 +-
kernel/cgroup.c | 57 ++++++++++-
kernel/sched/core.c | 262 ++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/fair.c | 1 +
kernel/sched/rt.c | 2 +
kernel/sched/sched.h | 14 ++-
kernel/sched/stop_task.c | 1 -
8 files changed, 341 insertions(+), 8 deletions(-)

--
1.7.11.7

2012-11-20 08:32:43

by Glauber Costa

[permalink] [raw]

Subject: [PATCH 1/6] don't call cpuacct_charge in stop_task.c

Commit 8f618968 changed stop_task to do the same bookkeping as the
other classes. However, the call to cpuacct_charge() doesn't affect
the scheduler decisions at all, and doesn't need to be moved over.

Moreover, being a kthread, the migration thread won't belong to any
cgroup anyway, rendering this call quite useless.

Signed-off-by: Glauber Costa <[email protected]>
CC: Mike Galbraith <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Thomas Gleixner <[email protected]>
---
kernel/sched/stop_task.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index da5eb5b..fda1cbe 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -68,7 +68,6 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
account_group_exec_runtime(curr, delta_exec);

curr->se.exec_start = rq->clock_task;
- cpuacct_charge(curr, delta_exec);
}

static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
--
1.7.11.7

2012-11-20 08:32:47

by Glauber Costa

[permalink] [raw]

Subject: [PATCH 2/6] cgroup: implement CFTYPE_NO_PREFIX

From: Tejun Heo <[email protected]>

When cgroup files are created, cgroup core automatically prepends the
name of the subsystem as prefix. This patch adds CFTYPE_NO_PREFIX
which disables the automatic prefix.

This will be used to deprecate cpuacct which will make cpu create and
serve the cpuacct files.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Glauber Costa <[email protected]>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index a178a91..018468b 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -272,6 +272,7 @@ struct cgroup_map_cb {
/* cftype->flags */
#define CFTYPE_ONLY_ON_ROOT (1U << 0) /* only create on root cg */
#define CFTYPE_NOT_ON_ROOT (1U << 1) /* don't create onp root cg */
+#define CFTYPE_NO_PREFIX (1U << 2) /* skip subsys prefix */

#define MAX_CFTYPE_NAME 64

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 3d68aad..4081fee 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2629,7 +2629,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys,
if ((cft->flags & CFTYPE_ONLY_ON_ROOT) && cgrp->parent)
return 0;

- if (subsys && !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) {
+ if (subsys && !(cft->flags & CFTYPE_NO_PREFIX) &&
+ !test_bit(ROOT_NOPREFIX, &cgrp->root->flags)) {
strcpy(name, subsys->name);
strcat(name, ".");
}
--
1.7.11.7

2012-11-20 08:32:50

by Glauber Costa

[permalink] [raw]

Subject: [PATCH 5/6] sched: adjust exec_clock to use it as cpu usage metric

exec_clock already provides per-group cpu usage metrics, and can be
reused by cpuacct in case cpu and cpuacct are comounted.

However, it is only provided by tasks in fair class. Doing the same for
rt is easy, and can be done in an already existing hierarchy loop. This
is an improvement over the independent hierarchy walk executed by
cpuacct.

Signed-off-by: Glauber Costa <[email protected]>
CC: Dave Jones <[email protected]>
CC: Ben Hutchings <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Paul Turner <[email protected]>
CC: Lennart Poettering <[email protected]>
CC: Kay Sievers <[email protected]>
CC: Tejun Heo <[email protected]>
---
kernel/sched/rt.c | 1 +
kernel/sched/sched.h | 3 +++
2 files changed, 4 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 0c70807..68e9daf 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -945,6 +945,7 @@ static void update_curr_rt(struct rq *rq)

for_each_sched_rt_entity(rt_se) {
rt_rq = rt_rq_of_se(rt_se);
+ schedstat_add(rt_rq, exec_clock, delta_exec);

if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
raw_spin_lock(&rt_rq->rt_runtime_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index bc05c05..854d2e9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -208,6 +208,7 @@ struct cfs_rq {
unsigned int nr_running, h_nr_running;

u64 exec_clock;
+ u64 prev_exec_clock;
u64 min_vruntime;
#ifndef CONFIG_64BIT
u64 min_vruntime_copy;
@@ -299,6 +300,8 @@ struct rt_rq {
struct plist_head pushable_tasks;
#endif
int rt_throttled;
+ u64 exec_clock;
+ u64 prev_exec_clock;
u64 rt_time;
u64 rt_runtime;
/* Nests inside the rq lock: */
--
1.7.11.7

2012-11-20 08:33:43

by Glauber Costa

[permalink] [raw]

Subject: [PATCH 4/6] cgroup, sched: deprecate cpuacct

From: Tejun Heo <[email protected]>

Now that cpu serves the same files as cpuacct and using cpuacct
separately from cpu is deprecated, we can deprecate cpuacct. To avoid
disturbing userland which has been co-mounting cpu and cpuacct,
implement some hackery in cgroup core so that cpuacct co-mounting
still works even if cpuacct is disabled.

The goal of this patch is to accelerate disabling and removal of
cpuacct by decoupling kernel-side deprecation from userland changes.
Userland is recommended to do the following.

* If /proc/cgroups lists cpuacct, always co-mount it with cpu under
e.g. /sys/fs/cgroup/cpu.

* Optionally create symlinks for compatibility -
e.g. /sys/fs/cgroup/cpuacct and /sys/fs/cgroup/cpu,cpucct both
pointing to /sys/fs/cgroup/cpu - whether cpuacct exists or not.

This compatibility hack will eventually go away.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Kay Sievers <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Ben Hutchings <[email protected]>
Cc: Paul Turner <[email protected]>
---
init/Kconfig | 11 ++++++++++-
kernel/cgroup.c | 41 +++++++++++++++++++++++++++++++++++++++--
kernel/sched/core.c | 2 ++
3 files changed, 51 insertions(+), 3 deletions(-)

diff --git a/init/Kconfig b/init/Kconfig
index 3d26eb9..0690a96 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -675,11 +675,20 @@ config PROC_PID_CPUSET
default y

config CGROUP_CPUACCT
- bool "Simple CPU accounting cgroup subsystem"
+ bool "DEPRECATED: Simple CPU accounting cgroup subsystem"
+ default n
help
Provides a simple Resource Controller for monitoring the
total CPU consumed by the tasks in a cgroup.

+ This cgroup subsystem is deprecated. The CPU cgroup
+ subsystem serves the same accounting files and "cpuacct"
+ mount option is ignored if specified with "cpu". As long as
+ userland co-mounts cpu and cpuacct, disabling this
+ controller should be mostly unnoticeable - one notable
+ difference is that /proc/PID/cgroup won't list cpuacct
+ anymore.
+
config RESOURCE_COUNTERS
bool "Resource counters"
help
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index b2ba3e9..13e039c 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1106,6 +1106,7 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
unsigned long mask = (unsigned long)-1;
int i;
bool module_pin_failed = false;
+ bool cpuacct_requested = false;

BUG_ON(!mutex_is_locked(&cgroup_mutex));

@@ -1191,8 +1192,13 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)

break;
}
- if (i == CGROUP_SUBSYS_COUNT)
+ /* handle deprecated cpuacct specially, see below */
+ if (!strcmp(token, "cpuacct")) {
+ cpuacct_requested = true;
+ one_ss = true;
+ } else if (i == CGROUP_SUBSYS_COUNT) {
return -ENOENT;
+ }
}

/*
@@ -1219,8 +1225,25 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
* this creates some discrepancies in /proc/cgroups and
* /proc/PID/cgroup.
*
+ * Accept and ignore "cpuacct" option if comounted with "cpu" even
+ * when cpuacct itself is disabled to allow quick disabling and
+ * removal of cpuacct. This will be removed eventually.
+ *
* https://lkml.org/lkml/2012/9/13/542
*/
+ if (cpuacct_requested) {
+ bool comounted = false;
+
+#if IS_ENABLED(CONFIG_CGROUP_SCHED)
+ comounted = opts->subsys_bits & (1 << cpu_cgroup_subsys_id);
+#endif
+ if (!comounted) {
+ pr_warning("cgroup: mounting cpuacct separately from cpu is deprecated\n");
+#if !IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+ return -EINVAL;
+#endif
+ }
+ }
#if IS_ENABLED(CONFIG_CGROUP_SCHED) && IS_ENABLED(CONFIG_CGROUP_CPUACCT)
if ((opts->subsys_bits & (1 << cpu_cgroup_subsys_id)) &&
(opts->subsys_bits & (1 << cpuacct_subsys_id)))
@@ -4544,6 +4567,7 @@ const struct file_operations proc_cgroup_operations = {
/* Display information about each subsystem and each hierarchy */
static int proc_cgroupstats_show(struct seq_file *m, void *v)
{
+ struct cgroup_subsys *ss;
int i;

seq_puts(m, "#subsys_name\thierarchy\tnum_cgroups\tenabled\n");
@@ -4554,7 +4578,7 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
*/
mutex_lock(&cgroup_mutex);
for (i = 0; i < CGROUP_SUBSYS_COUNT; i++) {
- struct cgroup_subsys *ss = subsys[i];
+ ss = subsys[i];
if (ss == NULL)
continue;
seq_printf(m, "%s\t%d\t%d\t%d\n",
@@ -4562,6 +4586,19 @@ static int proc_cgroupstats_show(struct seq_file *m, void *v)
ss->root->number_of_cgroups, !ss->disabled);
}
mutex_unlock(&cgroup_mutex);
+
+ /*
+ * Fake /proc/cgroups entry for cpuacct to trick userland into
+ * cpu,cpuacct comounts. This is to allow quick disabling and
+ * removal of cpuacct and will be removed eventually.
+ */
+#if IS_ENABLED(CONFIG_CGROUP_SCHED) && !IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+ ss = subsys[cpu_cgroup_subsys_id];
+ if (ss) {
+ seq_printf(m, "cpuacct\t%d\t%d\t%d\n", ss->root->hierarchy_id,
+ ss->root->number_of_cgroups, !ss->disabled);
+ }
+#endif
return 0;
}

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 59cf912..7d85a01 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8518,6 +8518,8 @@ struct cgroup_subsys cpu_cgroup_subsys = {

#ifdef CONFIG_CGROUP_CPUACCT

+#warning CONFIG_CGROUP_CPUACCT is deprecated, read the Kconfig help message
+
/*
* CPU accounting code for task groups.
*
--
1.7.11.7

2012-11-20 08:33:46

by Glauber Costa

[permalink] [raw]

Subject: [PATCH 3/6] cgroup, sched: let cpu serve the same files as cpuacct

From: Tejun Heo <[email protected]>

cpuacct being on a separate hierarchy is one of the main cgroup
related complaints from scheduler side and the consensus seems to be

* Allowing cpuacct to be a separate controller was a mistake. In
general multiple controllers on the same type of resource should be
avoided, especially accounting-only ones.

* Statistics provided by cpuacct are useful and should instead be
served by cpu.

This patch makes cpu maintain and serve all cpuacct.* files and make
cgroup core ignore cpuacct if it's co-mounted with cpu. This is a
step in deprecating cpuacct. The next patch will allow disabling or
dropping cpuacct without affecting userland too much.

Note that this creates some discrepancies in /proc/cgroups and
/proc/PID/cgroup. The co-mounted cpuacct won't be reflected correctly
there. cpuacct will eventually be removed completely probably except
for the statistics filenames and I'd like to keep the amount of
compatbility hackery to minimum as much as possible.

The cpu statistics implementation isn't optimized in any way. It's
mostly verbatim copy from cpuacct. The goal is allowing quick
disabling and removal of CONFIG_CGROUP_CPUACCT and creating a base on
top of which cpu can implement proper optimization.

[ glommer: don't call *_charge in stop_task.c ]

Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Glauber Costa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Kay Sievers <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Ben Hutchings <[email protected]>
Cc: Paul Turner <[email protected]>
---
kernel/cgroup.c | 13 ++++
kernel/sched/core.c | 194 ++++++++++++++++++++++++++++++++++++++++++++++++++-
kernel/sched/fair.c | 1 +
kernel/sched/rt.c | 1 +
kernel/sched/sched.h | 7 ++
5 files changed, 214 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 4081fee..b2ba3e9 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1214,6 +1214,19 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
/* Consistency checks */

/*
+ * cpuacct is deprecated and cpu will serve the same stat files.
+ * If co-mount with cpu is requested, ignore cpuacct. Note that
+ * this creates some discrepancies in /proc/cgroups and
+ * /proc/PID/cgroup.
+ *
+ * https://lkml.org/lkml/2012/9/13/542
+ */
+#if IS_ENABLED(CONFIG_CGROUP_SCHED) && IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+ if ((opts->subsys_bits & (1 << cpu_cgroup_subsys_id)) &&
+ (opts->subsys_bits & (1 << cpuacct_subsys_id)))
+ opts->subsys_bits &= ~(1 << cpuacct_subsys_id);
+#endif
+ /*
* Option noprefix was introduced just for backward compatibility
* with the old cpuset, so we allow noprefix only if mounting just
* the cpuset subsystem.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 649c9f8..59cf912 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2817,8 +2817,10 @@ struct cpuacct root_cpuacct;
static inline void task_group_account_field(struct task_struct *p, int index,
u64 tmp)
{
+#ifdef CONFIG_CGROUP_SCHED
+ struct task_group *tg;
+#endif
#ifdef CONFIG_CGROUP_CPUACCT
- struct kernel_cpustat *kcpustat;
struct cpuacct *ca;
#endif
/*
@@ -2829,6 +2831,20 @@ static inline void task_group_account_field(struct task_struct *p, int index,
*/
__get_cpu_var(kernel_cpustat).cpustat[index] += tmp;

+#ifdef CONFIG_CGROUP_SCHED
+ rcu_read_lock();
+ tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ while (tg && (tg != &root_task_group)) {
+ struct kernel_cpustat *kcpustat = this_cpu_ptr(tg->cpustat);
+
+ kcpustat->cpustat[index] += tmp;
+ tg = tg->parent;
+ }
+ rcu_read_unlock();
+#endif
+
#ifdef CONFIG_CGROUP_CPUACCT
if (unlikely(!cpuacct_subsys.active))
return;
@@ -2836,7 +2852,8 @@ static inline void task_group_account_field(struct task_struct *p, int index,
rcu_read_lock();
ca = task_ca(p);
while (ca && (ca != &root_cpuacct)) {
- kcpustat = this_cpu_ptr(ca->cpustat);
+ struct kernel_cpustat *kcpustat = this_cpu_ptr(ca->cpustat);
+
kcpustat->cpustat[index] += tmp;
ca = parent_ca(ca);
}
@@ -7202,6 +7219,7 @@ int in_sched_functions(unsigned long addr)
#ifdef CONFIG_CGROUP_SCHED
struct task_group root_task_group;
LIST_HEAD(task_groups);
+static DEFINE_PER_CPU(u64, root_tg_cpuusage);
#endif

DECLARE_PER_CPU(cpumask_var_t, load_balance_tmpmask);
@@ -7260,6 +7278,8 @@ void __init sched_init(void)
#endif /* CONFIG_RT_GROUP_SCHED */

#ifdef CONFIG_CGROUP_SCHED
+ root_task_group.cpustat = &kernel_cpustat;
+ root_task_group.cpuusage = &root_tg_cpuusage;
list_add(&root_task_group.list, &task_groups);
INIT_LIST_HEAD(&root_task_group.children);
INIT_LIST_HEAD(&root_task_group.siblings);
@@ -7543,6 +7563,8 @@ static void free_sched_group(struct task_group *tg)
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
+ free_percpu(tg->cpuusage);
+ free_percpu(tg->cpustat);
kfree(tg);
}

@@ -7556,6 +7578,11 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!tg)
return ERR_PTR(-ENOMEM);

+ tg->cpuusage = alloc_percpu(u64);
+ tg->cpustat = alloc_percpu(struct kernel_cpustat);
+ if (!tg->cpuusage || !tg->cpustat)
+ goto err;
+
if (!alloc_fair_sched_group(tg, parent))
goto err;

@@ -7647,6 +7674,24 @@ void sched_move_task(struct task_struct *tsk)

task_rq_unlock(rq, tsk, &flags);
}
+
+void task_group_charge(struct task_struct *tsk, u64 cputime)
+{
+ struct task_group *tg;
+ int cpu = task_cpu(tsk);
+
+ rcu_read_lock();
+
+ tg = container_of(task_subsys_state(tsk, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ for (; tg; tg = tg->parent) {
+ u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+ *cpuusage += cputime;
+ }
+
+ rcu_read_unlock();
+}
#endif /* CONFIG_CGROUP_SCHED */

#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -8003,6 +8048,134 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup *old_cgrp,
sched_move_task(task);
}

+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+ u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+ u64 data;
+
+#ifndef CONFIG_64BIT
+ /*
+ * Take rq->lock to make 64-bit read safe on 32-bit platforms.
+ */
+ raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ data = *cpuusage;
+ raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+ data = *cpuusage;
+#endif
+
+ return data;
+}
+
+static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
+{
+ u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+
+#ifndef CONFIG_64BIT
+ /*
+ * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+ */
+ raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ *cpuusage = val;
+ raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+ *cpuusage = val;
+#endif
+}
+
+/* return total cpu usage (in nanoseconds) of a group */
+static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct task_group *tg;
+ u64 totalcpuusage = 0;
+ int i;
+
+ tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ for_each_present_cpu(i)
+ totalcpuusage += task_group_cpuusage_read(tg, i);
+
+ return totalcpuusage;
+}
+
+static int cpucg_cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
+ u64 reset)
+{
+ struct task_group *tg;
+ int err = 0;
+ int i;
+
+ tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ if (reset) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ for_each_present_cpu(i)
+ task_group_cpuusage_write(tg, i, 0);
+
+out:
+ return err;
+}
+
+static int cpucg_percpu_seq_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct task_group *tg;
+ u64 percpu;
+ int i;
+
+ tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ for_each_present_cpu(i) {
+ percpu = task_group_cpuusage_read(tg, i);
+ seq_printf(m, "%llu ", (unsigned long long) percpu);
+ }
+ seq_printf(m, "\n");
+ return 0;
+}
+
+static const char *cpucg_stat_desc[] = {
+ [CPUACCT_STAT_USER] = "user",
+ [CPUACCT_STAT_SYSTEM] = "system",
+};
+
+static int cpucg_stats_show(struct cgroup *cgrp, struct cftype *cft,
+ struct cgroup_map_cb *cb)
+{
+ struct task_group *tg;
+ int cpu;
+ s64 val = 0;
+
+ tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ for_each_online_cpu(cpu) {
+ struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+ val += kcpustat->cpustat[CPUTIME_USER];
+ val += kcpustat->cpustat[CPUTIME_NICE];
+ }
+ val = cputime64_to_clock_t(val);
+ cb->fill(cb, cpucg_stat_desc[CPUACCT_STAT_USER], val);
+
+ val = 0;
+ for_each_online_cpu(cpu) {
+ struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+ val += kcpustat->cpustat[CPUTIME_SYSTEM];
+ val += kcpustat->cpustat[CPUTIME_IRQ];
+ val += kcpustat->cpustat[CPUTIME_SOFTIRQ];
+ }
+
+ val = cputime64_to_clock_t(val);
+ cb->fill(cb, cpucg_stat_desc[CPUACCT_STAT_SYSTEM], val);
+
+ return 0;
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
u64 shareval)
@@ -8309,6 +8482,23 @@ static struct cftype cpu_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
#endif
+ /* cpuacct.* which used to be served by a separate cpuacct controller */
+ {
+ .name = "cpuacct.usage",
+ .flags = CFTYPE_NO_PREFIX,
+ .read_u64 = cpucg_cpuusage_read,
+ .write_u64 = cpucg_cpuusage_write,
+ },
+ {
+ .name = "cpuacct.usage_percpu",
+ .flags = CFTYPE_NO_PREFIX,
+ .read_seq_string = cpucg_percpu_seq_read,
+ },
+ {
+ .name = "cpuacct.stat",
+ .flags = CFTYPE_NO_PREFIX,
+ .read_map = cpucg_stats_show,
+ },
{ } /* terminate */
};

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 96e2b18..2d6a793 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -706,6 +706,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
struct task_struct *curtask = task_of(curr);

trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
+ task_group_charge(curtask, delta_exec);
cpuacct_charge(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e0b7ba9..0c70807 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -935,6 +935,7 @@ static void update_curr_rt(struct rq *rq)
account_group_exec_runtime(curr, delta_exec);

curr->se.exec_start = rq->clock_task;
+ task_group_charge(curr, delta_exec);
cpuacct_charge(curr, delta_exec);

sched_rt_avg_update(rq, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0848fa3..bc05c05 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -104,6 +104,10 @@ struct cfs_bandwidth {
struct task_group {
struct cgroup_subsys_state css;

+ /* statistics */
+ u64 __percpu *cpuusage;
+ struct kernel_cpustat __percpu *cpustat;
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each cpu */
struct sched_entity **se;
@@ -575,6 +579,8 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#endif
}

+extern void task_group_charge(struct task_struct *tsk, u64 cputime);
+
#else /* CONFIG_CGROUP_SCHED */

static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -582,6 +588,7 @@ static inline struct task_group *task_group(struct task_struct *p)
{
return NULL;
}
+static inline void task_group_charge(struct task_struct *tsk, u64 cputime) { }

#endif /* CONFIG_CGROUP_SCHED */

--
1.7.11.7

2012-11-20 08:33:52

by Glauber Costa

[permalink] [raw]

Subject: [PATCH 6/6] cpuacct: don't actually do anything.

All the information we have that is needed for cpuusage (and
cpuusage_percpu) is present in schedstats. It is already recorded
in a sane hierarchical way.

If we have CONFIG_SCHEDSTATS, we don't really need to do any extra
work. All former functions become empty inlines.

Signed-off-by: Glauber Costa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Kay Sievers <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Ben Hutchings <[email protected]>
Cc: Paul Turner <[email protected]>
---
kernel/sched/core.c | 102 ++++++++++++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 10 +++--
2 files changed, 90 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7d85a01..13cc041 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7675,6 +7675,7 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, &flags);
}

+#ifndef CONFIG_SCHEDSTATS
void task_group_charge(struct task_struct *tsk, u64 cputime)
{
struct task_group *tg;
@@ -7692,6 +7693,7 @@ void task_group_charge(struct task_struct *tsk, u64 cputime)

rcu_read_unlock();
}
+#endif
#endif /* CONFIG_CGROUP_SCHED */

#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -8048,22 +8050,92 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup *old_cgrp,
sched_move_task(task);
}

-static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+/*
+ * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+ */
+static inline void lock_rq_dword(int cpu)
{
- u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
- u64 data;
-
#ifndef CONFIG_64BIT
- /*
- * Take rq->lock to make 64-bit read safe on 32-bit platforms.
- */
raw_spin_lock_irq(&cpu_rq(cpu)->lock);
- data = *cpuusage;
+#endif
+}
+
+static inline void unlock_rq_dword(int cpu)
+{
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#endif
+}
+
+#ifdef CONFIG_SCHEDSTATS
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+ return tg->cfs_rq[cpu]->exec_clock - tg->cfs_rq[cpu]->prev_exec_clock;
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+ tg->cfs_rq[cpu]->prev_exec_clock = tg->cfs_rq[cpu]->exec_clock;
+}
#else
- data = *cpuusage;
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+ return tg->rt_rq[cpu]->exec_clock - tg->rt_rq[cpu]->prev_exec_clock;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+ tg->rt_rq[cpu]->prev_exec_clock = tg->rt_rq[cpu]->exec_clock;
+}
+#else
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+ return 0;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
#endif

+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+ u64 ret = 0;
+
+ lock_rq_dword(cpu);
+ ret = cfs_exec_clock(tg, cpu) + rt_exec_clock(tg, cpu);
+ unlock_rq_dword(cpu);
+
+ return ret;
+}
+
+static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
+{
+ lock_rq_dword(cpu);
+ cfs_exec_clock_reset(tg, cpu);
+ rt_exec_clock_reset(tg, cpu);
+ unlock_rq_dword(cpu);
+}
+#else
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+ u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+ u64 data;
+
+ lock_rq_dword(cpu);
+ data = *cpuusage;
+ unlock_rq_dword(cpu);
+
return data;
}

@@ -8071,17 +8143,11 @@ static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
{
u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);

-#ifndef CONFIG_64BIT
- /*
- * Take rq->lock to make 64-bit write safe on 32-bit platforms.
- */
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ lock_rq_dword(cpu);
*cpuusage = val;
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
- *cpuusage = val;
-#endif
+ unlock_rq_dword(cpu);
}
+#endif

/* return total cpu usage (in nanoseconds) of a group */
static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 854d2e9..a6f3ec7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -582,8 +582,6 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#endif
}

-extern void task_group_charge(struct task_struct *tsk, u64 cputime);
-
#else /* CONFIG_CGROUP_SCHED */

static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -591,10 +589,14 @@ static inline struct task_group *task_group(struct task_struct *p)
{
return NULL;
}
-static inline void task_group_charge(struct task_struct *tsk, u64 cputime) { }
-
#endif /* CONFIG_CGROUP_SCHED */

+#if defined(CONFIG_CGROUP_SCHED) && !defined(CONFIG_SCHEDSTATS)
+extern void task_group_charge(struct task_struct *tsk, u64 cputime);
+#else
+static inline void task_group_charge(struct task_struct *tsk, u64 cputime) {}
+#endif
+
static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
{
set_task_rq(p, cpu);
--
1.7.11.7