2013-05-29 11:03:01

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 00/11] per-cgroup cpu-stat

Peter et. al,

I am coming again with this series, hoping this is a better time for you all to
look at it.

I am *not* going as far as marking cpuacct deprecated, because I think it
deserves a special discussion (even though my position in this matter is widely
known), but all the infrastructure to make it happen is here. But after this,
it should be a matter of setting a flag (or not).

Through this patchset I am making cpu cgroup provide the same functionality of
cpuacct, and now with a more clear semantics, I attempt to provide userspace
with enough information to reconstruct per-container version of files like
"/proc/stat". In particular, we are interested in knowing the per-cgroup slices
of user time, system time, wait time, number of processes, and a variety of
statistics.

To make sure we can count nr of switches correctly, I am ressurecting one of
Peter's patches that apparently got nowhere in the past.

Glauber Costa (8):
don't call cpuacct_charge in stop_task.c
sched: adjust exec_clock to use it as cpu usage metric
cpuacct: don't actually do anything.
sched: document the cpu cgroup.
sched: account guest time per-cgroup as well.
sched: record per-cgroup number of context switches
sched: change nr_context_switches calculation.
sched: introduce cgroup file stat_percpu

Peter Zijlstra (1):
sched: Push put_prev_task() into pick_next_task()

Tejun Heo (2):
cgroup: implement CFTYPE_NO_PREFIX
cgroup, sched: let cpu serve the same files as cpuacct

Documentation/cgroups/cpu.txt | 99 +++++++++++
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 16 +-
kernel/sched/core.c | 385 ++++++++++++++++++++++++++++++++++++++++--
kernel/sched/cputime.c | 33 +++-
kernel/sched/fair.c | 39 ++++-
kernel/sched/idle_task.c | 9 +-
kernel/sched/rt.c | 42 +++--
kernel/sched/sched.h | 35 +++-
kernel/sched/stop_task.c | 8 +-
10 files changed, 626 insertions(+), 41 deletions(-)
create mode 100644 Documentation/cgroups/cpu.txt

--
1.8.1.4


2013-05-29 11:03:10

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 01/11] don't call cpuacct_charge in stop_task.c

Commit 8f618968 changed stop_task to do the same bookkeping as the
other classes. However, the call to cpuacct_charge() doesn't affect
the scheduler decisions at all, and doesn't need to be moved over.

Moreover, being a kthread, the migration thread won't belong to any
cgroup anyway, rendering this call quite useless.

Signed-off-by: Glauber Costa <[email protected]>
CC: Mike Galbraith <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Thomas Gleixner <[email protected]>
---
kernel/sched/stop_task.c | 1 -
1 file changed, 1 deletion(-)

diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index e08fbee..4142d7e 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -68,7 +68,6 @@ static void put_prev_task_stop(struct rq *rq, struct task_struct *prev)
account_group_exec_runtime(curr, delta_exec);

curr->se.exec_start = rq_clock_task(rq);
- cpuacct_charge(curr, delta_exec);
}

static void task_tick_stop(struct rq *rq, struct task_struct *curr, int queued)
--
1.8.1.4

2013-05-29 11:03:37

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 06/11] sched: document the cpu cgroup.

The CPU cgroup is so far, undocumented. Although data exists in the
Documentation directory about its functioning, it is usually spread,
and/or presented in the context of something else. This file
consolidates all cgroup-related information about it.

Signed-off-by: Glauber Costa <[email protected]>
---
Documentation/cgroups/cpu.txt | 81 +++++++++++++++++++++++++++++++++++++++++++
1 file changed, 81 insertions(+)
create mode 100644 Documentation/cgroups/cpu.txt

diff --git a/Documentation/cgroups/cpu.txt b/Documentation/cgroups/cpu.txt
new file mode 100644
index 0000000..072fd58
--- /dev/null
+++ b/Documentation/cgroups/cpu.txt
@@ -0,0 +1,81 @@
+CPU Controller
+--------------
+
+The CPU controller is responsible for grouping tasks together that will be
+viewed by the scheduler as a single unit. The CFS scheduler will first divide
+CPU time equally between all entities in the same level, and then proceed by
+doing the same in the next level. Basic use cases for that are described in the
+main cgroup documentation file, cgroups.txt.
+
+Users of this functionality should be aware that deep hierarchies will of
+course impose scheduler overhead, since the scheduler will have to take extra
+steps and look up additional data structures to make its final decision.
+
+Through the CPU controller, the scheduler is also able to cap the CPU
+utilization of a particular group. This is particularly useful in environments
+in which CPU is paid for by the hour, and one values predictability over
+performance.
+
+CPU Accounting
+--------------
+
+The CPU cgroup will also provide additional files under the prefix "cpuacct".
+Those files provide accounting statistics and were previously provided by the
+separate cpuacct controller. Although the cpuacct controller will still be kept
+around for compatibility reasons, its usage is discouraged. If both the CPU and
+cpuacct controllers are present in the system, distributors are encouraged to
+always mount them together.
+
+Files
+-----
+
+The CPU controller exposes the following files to the user:
+
+ - cpu.shares: The weight of each group living in the same hierarchy, that
+ translates into the amount of CPU it is expected to get. Upon cgroup creation,
+ each group gets assigned a default of 1024. The percentage of CPU assigned to
+ the cgroup is the value of shares divided by the sum of all shares in all
+ cgroups in the same level.
+
+ - cpu.cfs_period_us: The duration in microseconds of each scheduler period, for
+ bandwidth decisions. This defaults to 100000us or 100ms. Larger periods will
+ improve throughput at the expense of latency, since the scheduler will be able
+ to sustain a cpu-bound workload for longer. The opposite of true for smaller
+ periods. Note that this only affects non-RT tasks that are scheduled by the
+ CFS scheduler.
+
+- cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us
+ in for the current group will be allowed to run. For instance, if it is set to
+ half of cpu_period_us, the cgroup will only be able to peak run for 50 % of
+ the time. One should note that this represents aggregate time over all CPUs
+ in the system. Therefore, in order to allow full usage of two CPUs, for
+ instance, one should set this value to twice the value of cfs_period_us.
+
+- cpu.stat: statistics about the bandwidth controls. No data will be presented
+ if cpu.cfs_quota_us is not set. The file presents three
+ numbers:
+ nr_periods: how many full periods have been elapsed.
+ nr_throttled: number of times we exausted the full allowed bandwidth
+ throttled_time: total time the tasks were not run due to being overquota
+
+ - cpu.rt_runtime_us and cpu.rt_period_us: Those files are the RT-tasks
+ analogous to the CFS files cfs_quota_us and cfs_period_us. One important
+ difference, though, is that while the cfs quotas are upper bounds that
+ won't necessarily be met, the rt runtimes form a stricter guarantee.
+ Therefore, no overlap is allowed. Implications of that are that given a
+ hierarchy with multiple children, the sum of all rt_runtime_us may not exceed
+ the runtime of the parent. Also, a rt_runtime_us of 0, means that no rt tasks
+ can ever be run in this cgroup. For more information about rt tasks runtime
+ assignments, see scheduler/sched-rt-group.txt
+
+ - cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks
+ in this group.
+
+ - cpuacct.usage_percpu: The CPU time, in nanoseconds, consumed by all tasks in
+ this group, separated by CPU. The format is an space-separated array of time
+ values, one for each present CPU.
+
+ - cpuacct.stat: aggregate user and system time consumed by tasks in this group.
+ The format is
+ user: x
+ system: y
--
1.8.1.4

2013-05-29 11:03:40

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 07/11] sched: account guest time per-cgroup as well.

We already track multiple tick statistics per-cgroup, using
the task_group_account_field facility. This patch accounts
guest_time in that manner as well.

Signed-off-by: Glauber Costa <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Paul Turner <[email protected]>
---
kernel/sched/cputime.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 74a44ef..e653e52 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -183,8 +183,6 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
static void account_guest_time(struct task_struct *p, cputime_t cputime,
cputime_t cputime_scaled)
{
- u64 *cpustat = kcpustat_this_cpu->cpustat;
-
/* Add guest time to process. */
p->utime += cputime;
p->utimescaled += cputime_scaled;
@@ -193,11 +191,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,

/* Add guest time to cpustat. */
if (TASK_NICE(p) > 0) {
- cpustat[CPUTIME_NICE] += (__force u64) cputime;
- cpustat[CPUTIME_GUEST_NICE] += (__force u64) cputime;
+ task_group_account_field(p, CPUTIME_NICE, (__force u64) cputime);
+ task_group_account_field(p, CPUTIME_GUEST, (__force u64) cputime);
} else {
- cpustat[CPUTIME_USER] += (__force u64) cputime;
- cpustat[CPUTIME_GUEST] += (__force u64) cputime;
+ task_group_account_field(p, CPUTIME_USER, (__force u64) cputime);
+ task_group_account_field(p, CPUTIME_GUEST, (__force u64) cputime);
}
}

--
1.8.1.4

2013-05-29 11:03:47

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 08/11] sched: Push put_prev_task() into pick_next_task()

From: Peter Zijlstra <[email protected]>

In order to avoid having to do put/set on a whole cgroup hierarchy
when we context switch, push the put into pick_next_task() so that
both operations are in the same function. Further changes then allow
us to possibly optimize away redundant work.

[ [email protected]: incorporated mailing list feedback ]

Signed-off-by: Peter Zijlstra <[email protected]>
Signed-off-by: Glauber Costa <[email protected]>
---
kernel/sched/core.c | 20 +++++++-------------
kernel/sched/fair.c | 6 +++++-
kernel/sched/idle_task.c | 6 +++++-
kernel/sched/rt.c | 27 ++++++++++++++++-----------
kernel/sched/sched.h | 8 +++++++-
kernel/sched/stop_task.c | 5 ++++-
6 files changed, 44 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0fa0f87..3b1441f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2295,18 +2295,11 @@ static inline void schedule_debug(struct task_struct *prev)
schedstat_inc(this_rq(), sched_count);
}

-static void put_prev_task(struct rq *rq, struct task_struct *prev)
-{
- if (prev->on_rq || rq->skip_clock_update < 0)
- update_rq_clock(rq);
- prev->sched_class->put_prev_task(rq, prev);
-}
-
/*
* Pick up the highest-prio task:
*/
static inline struct task_struct *
-pick_next_task(struct rq *rq)
+pick_next_task(struct rq *rq, struct task_struct *prev)
{
const struct sched_class *class;
struct task_struct *p;
@@ -2316,13 +2309,13 @@ pick_next_task(struct rq *rq)
* the fair class we can call that function directly:
*/
if (likely(rq->nr_running == rq->cfs.h_nr_running)) {
- p = fair_sched_class.pick_next_task(rq);
+ p = fair_sched_class.pick_next_task(rq, prev);
if (likely(p))
return p;
}

for_each_class(class) {
- p = class->pick_next_task(rq);
+ p = class->pick_next_task(rq, prev);
if (p)
return p;
}
@@ -2417,8 +2410,9 @@ need_resched:
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);

- put_prev_task(rq, prev);
- next = pick_next_task(rq);
+ if (prev->on_rq || rq->skip_clock_update < 0)
+ update_rq_clock(rq);
+ next = pick_next_task(rq, prev);
clear_tsk_need_resched(prev);
rq->skip_clock_update = 0;

@@ -4395,7 +4389,7 @@ static void migrate_tasks(unsigned int dead_cpu)
if (rq->nr_running == 1)
break;

- next = pick_next_task(rq);
+ next = pick_next_task(rq, NULL);
BUG_ON(!next);
next->sched_class->put_prev_task(rq, next);

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a7f5c39..414cf1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3628,7 +3628,8 @@ preempt:
set_last_buddy(se);
}

-static struct task_struct *pick_next_task_fair(struct rq *rq)
+static struct task_struct *
+pick_next_task_fair(struct rq *rq, struct task_struct *prev)
{
struct task_struct *p;
struct cfs_rq *cfs_rq = &rq->cfs;
@@ -3637,6 +3638,9 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
if (!cfs_rq->nr_running)
return NULL;

+ if (prev)
+ prev->sched_class->put_prev_task(rq, prev);
+
do {
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index d8da010..4f0a332 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -33,8 +33,12 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
resched_task(rq->idle);
}

-static struct task_struct *pick_next_task_idle(struct rq *rq)
+static struct task_struct *
+pick_next_task_idle(struct rq *rq, struct task_struct *prev)
{
+ if (prev)
+ prev->sched_class->put_prev_task(rq, prev);
+
schedstat_inc(rq, sched_goidle);
#ifdef CONFIG_SMP
/* Trigger the post schedule to do an idle_enter for CFS */
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 4a21045..6c41fa1 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1330,15 +1330,7 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
{
struct sched_rt_entity *rt_se;
struct task_struct *p;
- struct rt_rq *rt_rq;
-
- rt_rq = &rq->rt;
-
- if (!rt_rq->rt_nr_running)
- return NULL;
-
- if (rt_rq_throttled(rt_rq))
- return NULL;
+ struct rt_rq *rt_rq = &rq->rt;

do {
rt_se = pick_next_rt_entity(rq, rt_rq);
@@ -1352,9 +1344,22 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
return p;
}

-static struct task_struct *pick_next_task_rt(struct rq *rq)
+static struct task_struct *
+pick_next_task_rt(struct rq *rq, struct task_struct *prev)
{
- struct task_struct *p = _pick_next_task_rt(rq);
+ struct task_struct *p;
+ struct rt_rq *rt_rq = &rq->rt;
+
+ if (!rt_rq->rt_nr_running)
+ return NULL;
+
+ if (rt_rq_throttled(rt_rq))
+ return NULL;
+
+ if (prev)
+ prev->sched_class->put_prev_task(rq, prev);
+
+ p = _pick_next_task_rt(rq);

/* The running task is never eligible for pushing */
if (p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0e5e795..a39eb9b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -987,7 +987,13 @@ struct sched_class {

void (*check_preempt_curr) (struct rq *rq, struct task_struct *p, int flags);

- struct task_struct * (*pick_next_task) (struct rq *rq);
+ /*
+ * It is the responsibility of the pick_next_task() method that will
+ * return the next task to call put_prev_task() on the @prev task or
+ * something equivalent.
+ */
+ struct task_struct * (*pick_next_task) (struct rq *rq,
+ struct task_struct *prev);
void (*put_prev_task) (struct rq *rq, struct task_struct *p);

#ifdef CONFIG_SMP
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 4142d7e..6adf6a9 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -23,12 +23,15 @@ check_preempt_curr_stop(struct rq *rq, struct task_struct *p, int flags)
/* we're never preempted */
}

-static struct task_struct *pick_next_task_stop(struct rq *rq)
+static struct task_struct *
+pick_next_task_stop(struct rq *rq, struct task_struct *prev)
{
struct task_struct *stop = rq->stop;

if (stop && stop->on_rq) {
stop->se.exec_start = rq_clock_task(rq);
+ if (prev)
+ prev->sched_class->put_prev_task(rq, prev);
return stop;
}

--
1.8.1.4

2013-05-29 11:04:19

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 11/11] sched: introduce cgroup file stat_percpu

The file cpu.stat_percpu will show various scheduler related
information, that are usually available to the top level through other
files.

For instance, most of the meaningful data in /proc/stat is presented
here. Given this file, a container can easily construct a local copy of
/proc/stat for internal consumption.

The data we export is comprised of:
* all the tick information, previously available only through cpuacct,
like user time, system time, etc.

* wait time, which can be used to construct analogous information to
steal time in hypervisors,

* nr_switches and nr_running, which are cgroup-local versions of
their global counterparts.

The file format consists of a one-line header that describes the fields
being listed. No guarantee is given that the fields will be kept the
same between kernel releases, and readers should always check the header
in order to introspect it.

Each of the following lines will show the respective field value for
each of the possible cpus in the system. All values are show in
nanoseconds.

One example output for this file is:

cpu user nice system irq softirq guest guest_nice wait nr_switches nr_running
cpu0 471000000 0 15000000 0 0 0 0 1996534 7205 1
cpu1 588000000 0 17000000 0 0 0 0 2848680 6510 1
cpu2 505000000 0 14000000 0 0 0 0 2350771 6183 1
cpu3 472000000 0 16000000 0 0 0 0 19766345 6277 2

Signed-off-by: Glauber Costa <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Paul Turner <[email protected]>
---
Documentation/cgroups/cpu.txt | 18 +++++++
kernel/sched/core.c | 109 ++++++++++++++++++++++++++++++++++++++++++
kernel/sched/fair.c | 14 ++++++
kernel/sched/sched.h | 10 +++-
4 files changed, 150 insertions(+), 1 deletion(-)

diff --git a/Documentation/cgroups/cpu.txt b/Documentation/cgroups/cpu.txt
index 072fd58..d40e500 100644
--- a/Documentation/cgroups/cpu.txt
+++ b/Documentation/cgroups/cpu.txt
@@ -68,6 +68,24 @@ The CPU controller exposes the following files to the user:
can ever be run in this cgroup. For more information about rt tasks runtime
assignments, see scheduler/sched-rt-group.txt

+ - cpu.stat_percpu: Various scheduler statistics for the current group. The
+ information provided in this file is akin to the one displayed in /proc/stat,
+ except for the fact that it is cgroup-aware. The file format consists of a
+ one-line header that describes the fields being listed. No guarantee is
+ given that the fields will be kept the same between kernel releases, and
+ readers should always check the header in order to introspect it.
+
+ Each of the following lines will show the respective field value for
+ each of the possible cpus in the system. All values are show in
+ nanoseconds. One example output for this file is:
+
+ cpu user nice system irq softirq guest guest_nice wait nr_switches nr_running
+ cpu0 471000000 0 15000000 0 0 0 0 1996534 7205 1
+ cpu1 588000000 0 17000000 0 0 0 0 2848680 6510 1
+ cpu2 505000000 0 14000000 0 0 0 0 2350771 6183 1
+ cpu3 472000000 0 16000000 0 0 0 0 19766345 6277 2
+
+
- cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks
in this group.

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 892c828..e2963ec 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7239,6 +7239,7 @@ static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
#else
static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
{
+ return 0;
}

static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
@@ -7670,6 +7671,108 @@ static u64 cpu_rt_period_read_uint(struct cgroup *cgrp, struct cftype *cft)
}
#endif /* CONFIG_RT_GROUP_SCHED */

+#ifdef CONFIG_SCHEDSTATS
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define fair_rq(field, tg, i) (tg)->cfs_rq[i]->field
+#else
+#define fair_rq(field, tg, i) 0
+#endif
+
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_rq(field, tg, i) (tg)->rt_rq[i]->field
+#else
+#define rt_rq(field, tg, i) 0
+#endif
+
+static u64 tg_nr_switches(struct task_group *tg, int cpu)
+{
+ /* nr_switches, which counts idle and stop task, is added to all tgs */
+ return cpu_rq(cpu)->nr_switches +
+ cfs_nr_switches(tg, cpu) + rt_nr_switches(tg, cpu);
+}
+
+static u64 tg_nr_running(struct task_group *tg, int cpu)
+{
+ /*
+ * because of autogrouped groups in root_task_group, the
+ * following does not hold.
+ */
+ if (tg != &root_task_group)
+ return rt_rq(rt_nr_running, tg, cpu) + fair_rq(nr_running, tg, cpu);
+
+ return cpu_rq(cpu)->nr_running;
+}
+
+static u64 tg_wait(struct task_group *tg, int cpu)
+{
+ u64 val;
+
+ if (tg != &root_task_group)
+ val = cfs_read_wait(tg, cpu);
+ else
+ /*
+ * There are many errors here that we are accumulating.
+ * However, we only provide this in the interest of having
+ * a consistent interface for all cgroups. Everybody
+ * probing the root cgroup should be getting its figures
+ * from system-wide files as /proc/stat. That would be faster
+ * to begin with...
+ */
+ val = kcpustat_cpu(cpu).cpustat[CPUTIME_STEAL] * TICK_NSEC;
+
+ return val;
+}
+
+static inline void do_fill_seq(struct seq_file *m, struct task_group *tg,
+ int cpu, int index)
+{
+ u64 val = 0;
+ struct kernel_cpustat *kcpustat;
+ kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+ val = cputime64_to_clock_t(kcpustat->cpustat[index]) * TICK_NSEC;
+ seq_put_decimal_ull(m, ' ', val);
+}
+
+/*
+ * This will dislay per-cpu statistics about the running cgroup. The file
+ * format consists of a one-line header that describes the fields being listed.
+ * No guarantee is given that the fields will be kept the same between kernel
+ * releases, and readers should always check the header in order to introspect
+ * it. The first column, however, will always be in the form cpux, where
+ * x is the logical number of the cpu.
+ *
+ * Each of the following lines will show the respective field value for each of
+ * the possible cpus in the system. All values are show in nanoseconds.
+ */
+static int cpu_stats_percpu_show(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct task_group *tg = cgroup_tg(cgrp);
+ int cpu;
+
+ seq_printf(m, "cpu user nice system irq softirq guest guest_nice ");
+ seq_printf(m, "wait nr_switches nr_running\n");
+
+ for_each_possible_cpu(cpu) {
+ seq_printf(m, "cpu%d", cpu);
+ do_fill_seq(m, tg, cpu, CPUTIME_USER);
+ do_fill_seq(m, tg, cpu, CPUTIME_NICE);
+ do_fill_seq(m, tg, cpu, CPUTIME_SYSTEM);
+ do_fill_seq(m, tg, cpu, CPUTIME_IRQ);
+ do_fill_seq(m, tg, cpu, CPUTIME_SOFTIRQ);
+ do_fill_seq(m, tg, cpu, CPUTIME_GUEST);
+ do_fill_seq(m, tg, cpu, CPUTIME_GUEST_NICE);
+ seq_put_decimal_ull(m, ' ', tg_wait(tg, cpu));
+ seq_put_decimal_ull(m, ' ', tg_nr_switches(tg, cpu));
+ seq_put_decimal_ull(m, ' ', tg_nr_running(tg, cpu));
+ seq_putc(m, '\n');
+ }
+
+ return 0;
+}
+#endif
+
static struct cftype cpu_files[] = {
#ifdef CONFIG_FAIR_GROUP_SCHED
{
@@ -7723,6 +7826,12 @@ static struct cftype cpu_files[] = {
.flags = CFTYPE_NO_PREFIX,
.read_map = cpucg_stats_show,
},
+#ifdef CONFIG_SCHEDSTATS
+ {
+ .name = "stat_percpu",
+ .read_seq_string = cpu_stats_percpu_show,
+ },
+#endif
{ } /* terminate */
};

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b29e5dd..f718c28 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1123,6 +1123,20 @@ static void update_cfs_shares(struct cfs_rq *cfs_rq)

reweight_entity(cfs_rq_of(se), se, shares);
}
+
+#ifdef CONFIG_SCHEDSTATS
+u64 cfs_read_wait(struct task_group *tg, int cpu)
+{
+ struct sched_entity *se = tg->se[cpu];
+ struct cfs_rq *cfs_rq = cfs_rq_of(se);
+ u64 value = se->statistics.wait_sum;
+
+ if (!se->statistics.wait_start)
+ return value;
+
+ return value + rq_of(cfs_rq)->clock - se->statistics.wait_start;
+}
+#endif
#else /* CONFIG_FAIR_GROUP_SCHED */
static inline void update_cfs_shares(struct cfs_rq *cfs_rq)
{
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 39d3284..742870e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -242,8 +242,16 @@ extern void sched_move_task(struct task_struct *tsk);
extern int sched_group_set_shares(struct task_group *tg, unsigned long shares);
#endif

-#else /* CONFIG_CGROUP_SCHED */

+#ifdef CONFIG_FAIR_GROUP_SCHED
+extern u64 cfs_read_wait(struct task_group *tg, int cpu);
+#else
+static inline u64 cfs_read_wait(struct task_group *tg, int cpu)
+{
+ return 0;
+}
+#endif
+#else /* CONFIG_CGROUP_SCHED */
struct cfs_bandwidth { };

#endif /* CONFIG_CGROUP_SCHED */
--
1.8.1.4

2013-05-29 11:04:05

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 10/11] sched: change nr_context_switches calculation.

This patch changes the calculation of nr_context_switches. The variable
"nr_switches" is now used to account for the number of transition to the
idle task, or stop task. It is removed from the schedule() path.

The total calculation can be made using the fact that the transitions to
fair and rt classes are recorded in the root_task_group. One can easily
derive the total figure by adding those quantities together.

Signed-off-by: Glauber Costa <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Paul Turner <[email protected]>
---
kernel/sched/core.c | 17 +++++++++++++++--
kernel/sched/idle_task.c | 3 +++
kernel/sched/stop_task.c | 2 ++
3 files changed, 20 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 3b1441f..892c828 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2031,13 +2031,27 @@ unsigned long nr_running(void)
return sum;
}

+#ifdef CONFIG_FAIR_GROUP_SCHED
+#define cfs_nr_switches(tg, cpu) (tg)->cfs_rq[cpu]->nr_switches
+#else
+#define cfs_nr_switches(tg, cpu) cpu_rq(cpu)->cfs.nr_switches
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+#define rt_nr_switches(tg, cpu) (tg)->rt_rq[cpu]->rt_nr_switches
+#else
+#define rt_nr_switches(tg, cpu) cpu_rq(cpu)->rt.rt_nr_switches
+#endif
+
unsigned long long nr_context_switches(void)
{
int i;
unsigned long long sum = 0;

- for_each_possible_cpu(i)
+ for_each_possible_cpu(i) {
+ sum += cfs_nr_switches(&root_task_group, i);
+ sum += rt_nr_switches(&root_task_group, i);
sum += cpu_rq(i)->nr_switches;
+ }

return sum;
}
@@ -2417,7 +2431,6 @@ need_resched:
rq->skip_clock_update = 0;

if (likely(prev != next)) {
- rq->nr_switches++;
rq->curr = next;
++*switch_count;

diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index 4f0a332..923286c 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -39,6 +39,9 @@ pick_next_task_idle(struct rq *rq, struct task_struct *prev)
if (prev)
prev->sched_class->put_prev_task(rq, prev);

+ if (prev != rq->idle)
+ rq->nr_switches++;
+
schedstat_inc(rq, sched_goidle);
#ifdef CONFIG_SMP
/* Trigger the post schedule to do an idle_enter for CFS */
diff --git a/kernel/sched/stop_task.c b/kernel/sched/stop_task.c
index 6adf6a9..48c572a 100644
--- a/kernel/sched/stop_task.c
+++ b/kernel/sched/stop_task.c
@@ -32,6 +32,8 @@ pick_next_task_stop(struct rq *rq, struct task_struct *prev)
stop->se.exec_start = rq_clock_task(rq);
if (prev)
prev->sched_class->put_prev_task(rq, prev);
+ if (prev != rq->stop)
+ rq->nr_switches++;
return stop;
}

--
1.8.1.4

2013-05-29 11:03:54

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 09/11] sched: record per-cgroup number of context switches

Context switches are, to this moment, a property of the runqueue. When
running containers, we would like to be able to present a separate
figure for each container (or cgroup, in this context).

The chosen way to accomplish this is to increment a per cfs_rq or
rt_rq, depending on the task, for each of the sched entities involved,
up to the parent. It is trivial to note that for the parent cgroup, we
always add 1 by doing this. Also, we are not introducing any hierarchy
walks in here. An already existent walk is reused.
There are, however, two main issues:

1. the traditional context switch code only increment nr_switches when
a different task is being inserted in the rq. Eventually, albeit not
likely, we will pick the same task as before. Since for cfq and rt we
only now which task will be next after the walk, we need to do the walk
again, decrementing 1. Since this is by far not likely, it seems a fair
price to pay.

2. Those figures do not include switches from and to the idle or stop
task. Those need to be recorded separately, which will happen in a
follow up patch.

Signed-off-by: Glauber Costa <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Paul Turner <[email protected]>
---
kernel/sched/fair.c | 18 ++++++++++++++++++
kernel/sched/rt.c | 15 +++++++++++++--
kernel/sched/sched.h | 3 +++
3 files changed, 34 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 414cf1c..b29e5dd 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3642,6 +3642,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
prev->sched_class->put_prev_task(rq, prev);

do {
+ if (likely(prev))
+ cfs_rq->nr_switches++;
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
@@ -3651,6 +3653,22 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
if (hrtick_enabled(rq))
hrtick_start_fair(rq, p);

+ /*
+ * This condition is extremely unlikely, and most of the time will just
+ * consist of this unlikely branch, which is extremely cheap. But we
+ * still need to have it, because when we first loop through cfs_rq's,
+ * we can't possibly know which task we will pick. The call to
+ * set_next_entity above is not meant to mess up the tree in this case,
+ * so this should give us the same chain, in the same order.
+ */
+ if (unlikely(p == prev)) {
+ se = &p->se;
+ for_each_sched_entity(se) {
+ cfs_rq = cfs_rq_of(se);
+ cfs_rq->nr_switches--;
+ }
+ }
+
return p;
}

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 6c41fa1..894fc0b 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -1326,13 +1326,16 @@ static struct sched_rt_entity *pick_next_rt_entity(struct rq *rq,
return next;
}

-static struct task_struct *_pick_next_task_rt(struct rq *rq)
+static struct task_struct *
+_pick_next_task_rt(struct rq *rq, struct task_struct *prev)
{
struct sched_rt_entity *rt_se;
struct task_struct *p;
struct rt_rq *rt_rq = &rq->rt;

do {
+ if (likely(prev))
+ rt_rq->rt_nr_switches++;
rt_se = pick_next_rt_entity(rq, rt_rq);
BUG_ON(!rt_se);
rt_rq = group_rt_rq(rt_se);
@@ -1341,6 +1344,14 @@ static struct task_struct *_pick_next_task_rt(struct rq *rq)
p = rt_task_of(rt_se);
p->se.exec_start = rq_clock_task(rq);

+ /* See fair.c for an explanation on this */
+ if (unlikely(p == prev)) {
+ for_each_sched_rt_entity(rt_se) {
+ rt_rq = rt_rq_of_se(rt_se);
+ rt_rq->rt_nr_switches--;
+ }
+ }
+
return p;
}

@@ -1359,7 +1370,7 @@ pick_next_task_rt(struct rq *rq, struct task_struct *prev)
if (prev)
prev->sched_class->put_prev_task(rq, prev);

- p = _pick_next_task_rt(rq);
+ p = _pick_next_task_rt(rq, prev);

/* The running task is never eligible for pushing */
if (p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a39eb9b..39d3284 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -273,6 +273,7 @@ struct cfs_rq {
unsigned int nr_spread_over;
#endif

+ u64 nr_switches;
#ifdef CONFIG_SMP
/*
* Load-tracking only depends on SMP, FAIR_GROUP_SCHED dependency below may be
@@ -342,6 +343,8 @@ static inline int rt_bandwidth_enabled(void)
struct rt_rq {
struct rt_prio_array active;
unsigned int rt_nr_running;
+ u64 rt_nr_switches;
+
#if defined CONFIG_SMP || defined CONFIG_RT_GROUP_SCHED
struct {
int curr; /* highest queued rt task prio */
--
1.8.1.4

2013-05-29 11:03:35

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 04/11] sched: adjust exec_clock to use it as cpu usage metric

exec_clock already provides per-group cpu usage metrics, and can be
reused by cpuacct in case cpu and cpuacct are comounted.

However, it is only provided by tasks in fair class. Doing the same for
rt is easy, and can be done in an already existing hierarchy loop. This
is an improvement over the independent hierarchy walk executed by
cpuacct.

Signed-off-by: Glauber Costa <[email protected]>
CC: Dave Jones <[email protected]>
CC: Ben Hutchings <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: Paul Turner <[email protected]>
CC: Lennart Poettering <[email protected]>
CC: Kay Sievers <[email protected]>
CC: Tejun Heo <[email protected]>
---
kernel/sched/rt.c | 1 +
kernel/sched/sched.h | 3 +++
2 files changed, 4 insertions(+)

diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index e9f8dcd..4a21045 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -907,6 +907,7 @@ static void update_curr_rt(struct rq *rq)

for_each_sched_rt_entity(rt_se) {
rt_rq = rt_rq_of_se(rt_se);
+ schedstat_add(rt_rq, exec_clock, delta_exec);

if (sched_rt_runtime(rt_rq) != RUNTIME_INF) {
raw_spin_lock(&rt_rq->rt_runtime_lock);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 765c687..b05dd84 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -254,6 +254,7 @@ struct cfs_rq {
unsigned int nr_running, h_nr_running;

u64 exec_clock;
+ u64 prev_exec_clock;
u64 min_vruntime;
#ifndef CONFIG_64BIT
u64 min_vruntime_copy;
@@ -356,6 +357,8 @@ struct rt_rq {
struct plist_head pushable_tasks;
#endif
int rt_throttled;
+ u64 exec_clock;
+ u64 prev_exec_clock;
u64 rt_time;
u64 rt_runtime;
/* Nests inside the rq lock: */
--
1.8.1.4

2013-05-29 11:08:14

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 03/11] cgroup, sched: let cpu serve the same files as cpuacct

From: Tejun Heo <[email protected]>

cpuacct being on a separate hierarchy is one of the main cgroup
related complaints from scheduler side and the consensus seems to be

* Allowing cpuacct to be a separate controller was a mistake. In
general multiple controllers on the same type of resource should be
avoided, especially accounting-only ones.

* Statistics provided by cpuacct are useful and should instead be
served by cpu.

This patch makes cpu maintain and serve all cpuacct.* files and make
cgroup core ignore cpuacct if it's co-mounted with cpu. This is a
step in deprecating cpuacct. The next patch will allow disabling or
dropping cpuacct without affecting userland too much.

Note that this creates some discrepancies in /proc/cgroups and
/proc/PID/cgroup. The co-mounted cpuacct won't be reflected correctly
there. cpuacct will eventually be removed completely probably except
for the statistics filenames and I'd like to keep the amount of
compatbility hackery to minimum as much as possible.

The cpu statistics implementation isn't optimized in any way. It's
mostly verbatim copy from cpuacct. The goal is allowing quick
disabling and removal of CONFIG_CGROUP_CPUACCT and creating a base on
top of which cpu can implement proper optimization.

[ glommer: don't call *_charge in stop_task.c ]

Signed-off-by: Tejun Heo <[email protected]>
Signed-off-by: Glauber Costa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Kay Sievers <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Ben Hutchings <[email protected]>
Cc: Paul Turner <[email protected]>
---
kernel/cgroup.c | 13 ++++
kernel/sched/core.c | 173 +++++++++++++++++++++++++++++++++++++++++++++++++
kernel/sched/cputime.c | 23 +++++++
kernel/sched/fair.c | 1 +
kernel/sched/rt.c | 1 +
kernel/sched/sched.h | 7 ++
6 files changed, 218 insertions(+)

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 8bbeb4d..01c3ed0 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -1256,6 +1256,19 @@ static int parse_cgroupfs_options(char *data, struct cgroup_sb_opts *opts)
}

/*
+ * cpuacct is deprecated and cpu will serve the same stat files.
+ * If co-mount with cpu is requested, ignore cpuacct. Note that
+ * this creates some discrepancies in /proc/cgroups and
+ * /proc/PID/cgroup.
+ *
+ * https://lkml.org/lkml/2012/9/13/542
+ */
+#if IS_ENABLED(CONFIG_CGROUP_SCHED) && IS_ENABLED(CONFIG_CGROUP_CPUACCT)
+ if ((opts->subsys_mask & (1 << cpu_cgroup_subsys_id)) &&
+ (opts->subsys_mask & (1 << cpuacct_subsys_id)))
+ opts->subsys_mask &= ~(1 << cpuacct_subsys_id);
+#endif
+ /*
* Option noprefix was introduced just for backward compatibility
* with the old cpuset, so we allow noprefix only if mounting just
* the cpuset subsystem.
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 36f85be..5ae1adf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6327,6 +6327,7 @@ int in_sched_functions(unsigned long addr)
*/
struct task_group root_task_group;
LIST_HEAD(task_groups);
+static DEFINE_PER_CPU(u64, root_tg_cpuusage);
#endif

DECLARE_PER_CPU(cpumask_var_t, load_balance_mask);
@@ -6385,6 +6386,8 @@ void __init sched_init(void)
#endif /* CONFIG_RT_GROUP_SCHED */

#ifdef CONFIG_CGROUP_SCHED
+ root_task_group.cpustat = &kernel_cpustat;
+ root_task_group.cpuusage = &root_tg_cpuusage;
list_add(&root_task_group.list, &task_groups);
INIT_LIST_HEAD(&root_task_group.children);
INIT_LIST_HEAD(&root_task_group.siblings);
@@ -6665,6 +6668,8 @@ static void free_sched_group(struct task_group *tg)
free_fair_sched_group(tg);
free_rt_sched_group(tg);
autogroup_free(tg);
+ free_percpu(tg->cpuusage);
+ free_percpu(tg->cpustat);
kfree(tg);
}

@@ -6677,6 +6682,11 @@ struct task_group *sched_create_group(struct task_group *parent)
if (!tg)
return ERR_PTR(-ENOMEM);

+ tg->cpuusage = alloc_percpu(u64);
+ tg->cpustat = alloc_percpu(struct kernel_cpustat);
+ if (!tg->cpuusage || !tg->cpustat)
+ goto err;
+
if (!alloc_fair_sched_group(tg, parent))
goto err;

@@ -6776,6 +6786,24 @@ void sched_move_task(struct task_struct *tsk)

task_rq_unlock(rq, tsk, &flags);
}
+
+void task_group_charge(struct task_struct *tsk, u64 cputime)
+{
+ struct task_group *tg;
+ int cpu = task_cpu(tsk);
+
+ rcu_read_lock();
+
+ tg = container_of(task_subsys_state(tsk, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ for (; tg; tg = tg->parent) {
+ u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+ *cpuusage += cputime;
+ }
+
+ rcu_read_unlock();
+}
#endif /* CONFIG_CGROUP_SCHED */

#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -7171,6 +7199,134 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup *old_cgrp,
sched_move_task(task);
}

+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+ u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+ u64 data;
+
+#ifndef CONFIG_64BIT
+ /*
+ * Take rq->lock to make 64-bit read safe on 32-bit platforms.
+ */
+ raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ data = *cpuusage;
+ raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+ data = *cpuusage;
+#endif
+
+ return data;
+}
+
+static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
+{
+ u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+
+#ifndef CONFIG_64BIT
+ /*
+ * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+ */
+ raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ *cpuusage = val;
+ raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#else
+ *cpuusage = val;
+#endif
+}
+
+/* return total cpu usage (in nanoseconds) of a group */
+static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
+{
+ struct task_group *tg;
+ u64 totalcpuusage = 0;
+ int i;
+
+ tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ for_each_present_cpu(i)
+ totalcpuusage += task_group_cpuusage_read(tg, i);
+
+ return totalcpuusage;
+}
+
+static int cpucg_cpuusage_write(struct cgroup *cgrp, struct cftype *cftype,
+ u64 reset)
+{
+ struct task_group *tg;
+ int err = 0;
+ int i;
+
+ tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ if (reset) {
+ err = -EINVAL;
+ goto out;
+ }
+
+ for_each_present_cpu(i)
+ task_group_cpuusage_write(tg, i, 0);
+
+out:
+ return err;
+}
+
+static int cpucg_percpu_seq_read(struct cgroup *cgrp, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct task_group *tg;
+ u64 percpu;
+ int i;
+
+ tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ for_each_present_cpu(i) {
+ percpu = task_group_cpuusage_read(tg, i);
+ seq_printf(m, "%llu ", (unsigned long long) percpu);
+ }
+ seq_printf(m, "\n");
+ return 0;
+}
+
+static const char *cpucg_stat_desc[] = {
+ [0] = "user",
+ [1] = "system",
+};
+
+static int cpucg_stats_show(struct cgroup *cgrp, struct cftype *cft,
+ struct cgroup_map_cb *cb)
+{
+ struct task_group *tg;
+ int cpu;
+ s64 val = 0;
+
+ tg = container_of(cgroup_subsys_state(cgrp, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ for_each_online_cpu(cpu) {
+ struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+ val += kcpustat->cpustat[CPUTIME_USER];
+ val += kcpustat->cpustat[CPUTIME_NICE];
+ }
+ val = cputime64_to_clock_t(val);
+ cb->fill(cb, cpucg_stat_desc[0], val);
+
+ val = 0;
+ for_each_online_cpu(cpu) {
+ struct kernel_cpustat *kcpustat = per_cpu_ptr(tg->cpustat, cpu);
+ val += kcpustat->cpustat[CPUTIME_SYSTEM];
+ val += kcpustat->cpustat[CPUTIME_IRQ];
+ val += kcpustat->cpustat[CPUTIME_SOFTIRQ];
+ }
+
+ val = cputime64_to_clock_t(val);
+ cb->fill(cb, cpucg_stat_desc[1], val);
+
+ return 0;
+}
+
#ifdef CONFIG_FAIR_GROUP_SCHED
static int cpu_shares_write_u64(struct cgroup *cgrp, struct cftype *cftype,
u64 shareval)
@@ -7477,6 +7633,23 @@ static struct cftype cpu_files[] = {
.write_u64 = cpu_rt_period_write_uint,
},
#endif
+ /* cpuacct.* which used to be served by a separate cpuacct controller */
+ {
+ .name = "cpuacct.usage",
+ .flags = CFTYPE_NO_PREFIX,
+ .read_u64 = cpucg_cpuusage_read,
+ .write_u64 = cpucg_cpuusage_write,
+ },
+ {
+ .name = "cpuacct.usage_percpu",
+ .flags = CFTYPE_NO_PREFIX,
+ .read_seq_string = cpucg_percpu_seq_read,
+ },
+ {
+ .name = "cpuacct.stat",
+ .flags = CFTYPE_NO_PREFIX,
+ .read_map = cpucg_stats_show,
+ },
{ } /* terminate */
};

diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
index 94691bc..74a44ef 100644
--- a/kernel/sched/cputime.c
+++ b/kernel/sched/cputime.c
@@ -112,6 +112,28 @@ static int irqtime_account_si_update(void)

#endif /* !CONFIG_IRQ_TIME_ACCOUNTING */

+#ifdef CONFIG_CGROUP_SCHED
+void cpu_account_field(struct task_struct *p, int index, u64 val)
+{
+ struct task_group *tg;
+
+ rcu_read_lock();
+ tg = container_of(task_subsys_state(p, cpu_cgroup_subsys_id),
+ struct task_group, css);
+
+ while (tg && (tg != &root_task_group)) {
+ struct kernel_cpustat *kcpustat = this_cpu_ptr(tg->cpustat);
+ kcpustat->cpustat[index] += val;
+ tg = tg->parent;
+ }
+ rcu_read_unlock();
+}
+#else
+static inline void cpu_account_field(struct task_struct *p, int index, u64 val)
+{
+
+}
+#endif
static inline void task_group_account_field(struct task_struct *p, int index,
u64 tmp)
{
@@ -124,6 +146,7 @@ static inline void task_group_account_field(struct task_struct *p, int index,
__get_cpu_var(kernel_cpustat).cpustat[index] += tmp;

cpuacct_account_field(p, index, tmp);
+ cpu_account_field(p, index, tmp);
}

/*
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3ee1c2e..a7f5c39 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -726,6 +726,7 @@ static void update_curr(struct cfs_rq *cfs_rq)
struct task_struct *curtask = task_of(curr);

trace_sched_stat_runtime(curtask, delta_exec, curr->vruntime);
+ task_group_charge(curtask, delta_exec);
cpuacct_charge(curtask, delta_exec);
account_group_exec_runtime(curtask, delta_exec);
}
diff --git a/kernel/sched/rt.c b/kernel/sched/rt.c
index 8d85f9a..e9f8dcd 100644
--- a/kernel/sched/rt.c
+++ b/kernel/sched/rt.c
@@ -897,6 +897,7 @@ static void update_curr_rt(struct rq *rq)
account_group_exec_runtime(curr, delta_exec);

curr->se.exec_start = rq_clock_task(rq);
+ task_group_charge(curr, delta_exec);
cpuacct_charge(curr, delta_exec);

sched_rt_avg_update(rq, delta_exec);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 74ff659..765c687 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -141,6 +141,10 @@ struct cfs_bandwidth {
struct task_group {
struct cgroup_subsys_state css;

+ /* statistics */
+ u64 __percpu *cpuusage;
+ struct kernel_cpustat __percpu *cpustat;
+
#ifdef CONFIG_FAIR_GROUP_SCHED
/* schedulable entities of this group on each cpu */
struct sched_entity **se;
@@ -703,6 +707,8 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#endif
}

+extern void task_group_charge(struct task_struct *tsk, u64 cputime);
+
#else /* CONFIG_CGROUP_SCHED */

static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -710,6 +716,7 @@ static inline struct task_group *task_group(struct task_struct *p)
{
return NULL;
}
+static inline void task_group_charge(struct task_struct *tsk, u64 cputime) { }

#endif /* CONFIG_CGROUP_SCHED */

--
1.8.1.4

2013-05-29 11:08:19

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 02/11] cgroup: implement CFTYPE_NO_PREFIX

From: Tejun Heo <[email protected]>

When cgroup files are created, cgroup core automatically prepends the
name of the subsystem as prefix. This patch adds CFTYPE_NO_PREFIX
which disables the automatic prefix.

This will be used to deprecate cpuacct which will make cpu create and
serve the cpuacct files.

Signed-off-by: Tejun Heo <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Glauber Costa <[email protected]>
---
include/linux/cgroup.h | 1 +
kernel/cgroup.c | 3 ++-
2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index 5047355..5a8a093 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -397,6 +397,7 @@ struct cgroup_map_cb {
#define CFTYPE_ONLY_ON_ROOT (1U << 0) /* only create on root cg */
#define CFTYPE_NOT_ON_ROOT (1U << 1) /* don't create on root cg */
#define CFTYPE_INSANE (1U << 2) /* don't create if sane_behavior */
+#define CFTYPE_NO_PREFIX (1U << 3) /* skip subsys prefix */

#define MAX_CFTYPE_NAME 64

diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 2a99262..8bbeb4d 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2681,7 +2681,8 @@ static int cgroup_add_file(struct cgroup *cgrp, struct cgroup_subsys *subsys,
umode_t mode;
char name[MAX_CGROUP_TYPE_NAMELEN + MAX_CFTYPE_NAME + 2] = { 0 };

- if (subsys && !(cgrp->root->flags & CGRP_ROOT_NOPREFIX)) {
+ if (subsys && !(cft->flags & CFTYPE_NO_PREFIX) &&
+ !(cgrp->root->flags & CGRP_ROOT_NOPREFIX)) {
strcpy(name, subsys->name);
strcat(name, ".");
}
--
1.8.1.4

2013-05-29 11:08:18

by Glauber Costa

[permalink] [raw]
Subject: [PATCH v7 05/11] cpuacct: don't actually do anything.

All the information we have that is needed for cpuusage (and
cpuusage_percpu) is present in schedstats. It is already recorded
in a sane hierarchical way.

If we have CONFIG_SCHEDSTATS, we don't really need to do any extra
work. All former functions become empty inlines.

Signed-off-by: Glauber Costa <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Kay Sievers <[email protected]>
Cc: Lennart Poettering <[email protected]>
Cc: Dave Jones <[email protected]>
Cc: Ben Hutchings <[email protected]>
Cc: Paul Turner <[email protected]>
---
kernel/sched/core.c | 102 ++++++++++++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 10 +++--
2 files changed, 90 insertions(+), 22 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5ae1adf..0fa0f87 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6787,6 +6787,7 @@ void sched_move_task(struct task_struct *tsk)
task_rq_unlock(rq, tsk, &flags);
}

+#ifndef CONFIG_SCHEDSTATS
void task_group_charge(struct task_struct *tsk, u64 cputime)
{
struct task_group *tg;
@@ -6804,6 +6805,7 @@ void task_group_charge(struct task_struct *tsk, u64 cputime)

rcu_read_unlock();
}
+#endif
#endif /* CONFIG_CGROUP_SCHED */

#if defined(CONFIG_RT_GROUP_SCHED) || defined(CONFIG_CFS_BANDWIDTH)
@@ -7199,22 +7201,92 @@ cpu_cgroup_exit(struct cgroup *cgrp, struct cgroup *old_cgrp,
sched_move_task(task);
}

-static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+/*
+ * Take rq->lock to make 64-bit write safe on 32-bit platforms.
+ */
+static inline void lock_rq_dword(int cpu)
{
- u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
- u64 data;
-
#ifndef CONFIG_64BIT
- /*
- * Take rq->lock to make 64-bit read safe on 32-bit platforms.
- */
raw_spin_lock_irq(&cpu_rq(cpu)->lock);
- data = *cpuusage;
+#endif
+}
+
+static inline void unlock_rq_dword(int cpu)
+{
+#ifndef CONFIG_64BIT
raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
+#endif
+}
+
+#ifdef CONFIG_SCHEDSTATS
+#ifdef CONFIG_FAIR_GROUP_SCHED
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+ return tg->cfs_rq[cpu]->exec_clock - tg->cfs_rq[cpu]->prev_exec_clock;
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+ tg->cfs_rq[cpu]->prev_exec_clock = tg->cfs_rq[cpu]->exec_clock;
+}
#else
- data = *cpuusage;
+static inline u64 cfs_exec_clock(struct task_group *tg, int cpu)
+{
+}
+
+static inline void cfs_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
+#endif
+#ifdef CONFIG_RT_GROUP_SCHED
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+ return tg->rt_rq[cpu]->exec_clock - tg->rt_rq[cpu]->prev_exec_clock;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+ tg->rt_rq[cpu]->prev_exec_clock = tg->rt_rq[cpu]->exec_clock;
+}
+#else
+static inline u64 rt_exec_clock(struct task_group *tg, int cpu)
+{
+ return 0;
+}
+
+static inline void rt_exec_clock_reset(struct task_group *tg, int cpu)
+{
+}
#endif

+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+ u64 ret = 0;
+
+ lock_rq_dword(cpu);
+ ret = cfs_exec_clock(tg, cpu) + rt_exec_clock(tg, cpu);
+ unlock_rq_dword(cpu);
+
+ return ret;
+}
+
+static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
+{
+ lock_rq_dword(cpu);
+ cfs_exec_clock_reset(tg, cpu);
+ rt_exec_clock_reset(tg, cpu);
+ unlock_rq_dword(cpu);
+}
+#else
+static u64 task_group_cpuusage_read(struct task_group *tg, int cpu)
+{
+ u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);
+ u64 data;
+
+ lock_rq_dword(cpu);
+ data = *cpuusage;
+ unlock_rq_dword(cpu);
+
return data;
}

@@ -7222,17 +7294,11 @@ static void task_group_cpuusage_write(struct task_group *tg, int cpu, u64 val)
{
u64 *cpuusage = per_cpu_ptr(tg->cpuusage, cpu);

-#ifndef CONFIG_64BIT
- /*
- * Take rq->lock to make 64-bit write safe on 32-bit platforms.
- */
- raw_spin_lock_irq(&cpu_rq(cpu)->lock);
+ lock_rq_dword(cpu);
*cpuusage = val;
- raw_spin_unlock_irq(&cpu_rq(cpu)->lock);
-#else
- *cpuusage = val;
-#endif
+ unlock_rq_dword(cpu);
}
+#endif

/* return total cpu usage (in nanoseconds) of a group */
static u64 cpucg_cpuusage_read(struct cgroup *cgrp, struct cftype *cft)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index b05dd84..0e5e795 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -710,8 +710,6 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#endif
}

-extern void task_group_charge(struct task_struct *tsk, u64 cputime);
-
#else /* CONFIG_CGROUP_SCHED */

static inline void set_task_rq(struct task_struct *p, unsigned int cpu) { }
@@ -719,10 +717,14 @@ static inline struct task_group *task_group(struct task_struct *p)
{
return NULL;
}
-static inline void task_group_charge(struct task_struct *tsk, u64 cputime) { }
-
#endif /* CONFIG_CGROUP_SCHED */

+#if defined(CONFIG_CGROUP_SCHED) && !defined(CONFIG_SCHEDSTATS)
+extern void task_group_charge(struct task_struct *tsk, u64 cputime);
+#else
+static inline void task_group_charge(struct task_struct *tsk, u64 cputime) {}
+#endif
+
static inline void __set_task_cpu(struct task_struct *p, unsigned int cpu)
{
set_task_rq(p, cpu);
--
1.8.1.4

2013-06-06 01:49:37

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 00/11] per-cgroup cpu-stat

Hello, Glauber.

On Wed, May 29, 2013 at 03:03:11PM +0400, Glauber Costa wrote:
> I am *not* going as far as marking cpuacct deprecated, because I think it
> deserves a special discussion (even though my position in this matter is widely
> known), but all the infrastructure to make it happen is here. But after this,
> it should be a matter of setting a flag (or not).

I'll ensure that cpuacct can't be used with sane_behavior from cgroup
core side, so that it can be deprecated with multiple hierarchies
eventually.

> Through this patchset I am making cpu cgroup provide the same functionality of
> cpuacct, and now with a more clear semantics, I attempt to provide userspace
> with enough information to reconstruct per-container version of files like
> "/proc/stat". In particular, we are interested in knowing the per-cgroup slices
> of user time, system time, wait time, number of processes, and a variety of
> statistics.
>
> To make sure we can count nr of switches correctly, I am ressurecting one of
> Peter's patches that apparently got nowhere in the past.

I generally agree with the approach but which tree is it based on?
Can you please let me know the base commit so that I can review the
series properly?

Thanks.

--
tejun

2013-06-06 07:57:52

by Glauber Costa

[permalink] [raw]
Subject: Re: [PATCH v7 00/11] per-cgroup cpu-stat

On 06/06/2013 05:49 AM, Tejun Heo wrote:
> I generally agree with the approach but which tree is it based on?
> Can you please let me know the base commit so that I can review the
> series properly?
last week's -tip/master

2013-06-06 23:00:18

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 04/11] sched: adjust exec_clock to use it as cpu usage metric

On Wed, May 29, 2013 at 03:03:15PM +0400, Glauber Costa wrote:
> exec_clock already provides per-group cpu usage metrics, and can be
> reused by cpuacct in case cpu and cpuacct are comounted.
>
> However, it is only provided by tasks in fair class. Doing the same for
> rt is easy, and can be done in an already existing hierarchy loop. This
> is an improvement over the independent hierarchy walk executed by
> cpuacct.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Dave Jones <[email protected]>
> CC: Ben Hutchings <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Paul Turner <[email protected]>
> CC: Lennart Poettering <[email protected]>
> CC: Kay Sievers <[email protected]>
> CC: Tejun Heo <[email protected]>

Reviewed-by: Tejun Heo <[email protected]>

Thanks.

--
tejun

2013-06-06 23:16:13

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 05/11] cpuacct: don't actually do anything.

Hello, Glauber.

I think the patch title / description are a bit confusing. This is
about pcpuacct stats provided by cpu controller, right?

> +#ifdef CONFIG_SCHEDSTATS
> +#ifdef CONFIG_FAIR_GROUP_SCHED
...
> #else
...
> +#endif
> +#ifdef CONFIG_RT_GROUP_SCHED
...
> +#else
...
> #endif
...
> +#else
> +#endif

This is getting rather messy. Maybe CONFIG_CGROUP_SCHED should just
select CONFIG_SCHEDSTATS?

Thanks.

--
tejun

2013-06-06 23:28:11

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 06/11] sched: document the cpu cgroup.

Hello, Glauber.

On Wed, May 29, 2013 at 03:03:17PM +0400, Glauber Costa wrote:
> The CPU cgroup is so far, undocumented. Although data exists in the
> Documentation directory about its functioning, it is usually spread,
> and/or presented in the context of something else. This file
> consolidates all cgroup-related information about it.
>
> Signed-off-by: Glauber Costa <[email protected]>

Reviewed-by: Tejun Heo <[email protected]>

Some minor points below.

> +Files
> +-----
> +
> +The CPU controller exposes the following files to the user:
> +
> + - cpu.shares: The weight of each group living in the same hierarchy, that
> + translates into the amount of CPU it is expected to get. Upon cgroup creation,
> + each group gets assigned a default of 1024. The percentage of CPU assigned to
> + the cgroup is the value of shares divided by the sum of all shares in all
> + cgroups in the same level.
> +
> + - cpu.cfs_period_us: The duration in microseconds of each scheduler period, for
> + bandwidth decisions. This defaults to 100000us or 100ms. Larger periods will
> + improve throughput at the expense of latency, since the scheduler will be able
> + to sustain a cpu-bound workload for longer. The opposite of true for smaller
^
is?
> + periods. Note that this only affects non-RT tasks that are scheduled by the
> + CFS scheduler.
> +
> +- cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us
> + in for the current group will be allowed to run. For instance, if it is set to
^^^^^^^
in for? doesn't parse for me.

> + half of cpu_period_us, the cgroup will only be able to peak run for 50 % of
^^^^^^^^^
to run at maximum?

> + the time. One should note that this represents aggregate time over all CPUs
> + in the system. Therefore, in order to allow full usage of two CPUs, for
> + instance, one should set this value to twice the value of cfs_period_us.
> +
> +- cpu.stat: statistics about the bandwidth controls. No data will be presented
> + if cpu.cfs_quota_us is not set. The file presents three

Unnecessary line break?

> + numbers:
> + nr_periods: how many full periods have been elapsed.
> + nr_throttled: number of times we exausted the full allowed bandwidth
> + throttled_time: total time the tasks were not run due to being overquota
> +
> + - cpu.rt_runtime_us and cpu.rt_period_us: Those files are the RT-tasks
^^^^^
these

> + analogous to the CFS files cfs_quota_us and cfs_period_us. One important
^^^^^^^^^^^^
counterparts of?

> + difference, though, is that while the cfs quotas are upper bounds that
> + won't necessarily be met, the rt runtimes form a stricter guarantee.
^^^^^^^^^^^^^
runtimes are strict guarantees?

> + Therefore, no overlap is allowed. Implications of that are that given a
^^^^^^^
maybe overcommit is a better term?

> + hierarchy with multiple children, the sum of all rt_runtime_us may not exceed
> + the runtime of the parent. Also, a rt_runtime_us of 0, means that no rt tasks
^
prolly unnecessary

> + can ever be run in this cgroup. For more information about rt tasks runtime
> + assignments, see scheduler/sched-rt-group.txt
^^^^^^^^^^^
configuration?

> +
> + - cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks
> + in this group.
> +
> + - cpuacct.usage_percpu: The CPU time, in nanoseconds, consumed by all tasks in
> + this group, separated by CPU. The format is an space-separated array of time
> + values, one for each present CPU.
> +
> + - cpuacct.stat: aggregate user and system time consumed by tasks in this group.
> + The format is
> + user: x
> + system: y

Thanks.

--
tejun

2013-06-06 23:48:36

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 07/11] sched: account guest time per-cgroup as well.

On Wed, May 29, 2013 at 03:03:18PM +0400, Glauber Costa wrote:
> We already track multiple tick statistics per-cgroup, using
> the task_group_account_field facility. This patch accounts
> guest_time in that manner as well.
>
> Signed-off-by: Glauber Costa <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> CC: Paul Turner <[email protected]>
> ---
> kernel/sched/cputime.c | 10 ++++------
> 1 file changed, 4 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/cputime.c b/kernel/sched/cputime.c
> index 74a44ef..e653e52 100644
> --- a/kernel/sched/cputime.c
> +++ b/kernel/sched/cputime.c
> @@ -183,8 +183,6 @@ void account_user_time(struct task_struct *p, cputime_t cputime,
> static void account_guest_time(struct task_struct *p, cputime_t cputime,
> cputime_t cputime_scaled)
> {
> - u64 *cpustat = kcpustat_this_cpu->cpustat;
> -
> /* Add guest time to process. */
> p->utime += cputime;
> p->utimescaled += cputime_scaled;
> @@ -193,11 +191,11 @@ static void account_guest_time(struct task_struct *p, cputime_t cputime,
>
> /* Add guest time to cpustat. */
> if (TASK_NICE(p) > 0) {
> - cpustat[CPUTIME_NICE] += (__force u64) cputime;
> - cpustat[CPUTIME_GUEST_NICE] += (__force u64) cputime;
> + task_group_account_field(p, CPUTIME_NICE, (__force u64) cputime);
> + task_group_account_field(p, CPUTIME_GUEST, (__force u64) cputime);
> } else {
> - cpustat[CPUTIME_USER] += (__force u64) cputime;
> - cpustat[CPUTIME_GUEST] += (__force u64) cputime;
> + task_group_account_field(p, CPUTIME_USER, (__force u64) cputime);
> + task_group_account_field(p, CPUTIME_GUEST, (__force u64) cputime);

This is a behavior change, right? It seems sane to me but maybe the
patch description should be more explicit?

Thanks.

--
tejun

2013-06-06 23:56:53

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 08/11] sched: Push put_prev_task() into pick_next_task()

On Wed, May 29, 2013 at 03:03:19PM +0400, Glauber Costa wrote:
> From: Peter Zijlstra <[email protected]>
>
> In order to avoid having to do put/set on a whole cgroup hierarchy
> when we context switch, push the put into pick_next_task() so that
> both operations are in the same function. Further changes then allow
> us to possibly optimize away redundant work.

I suppose this is to optimize out css_put/get() if the previous and
next tasks are in the same group? css refcnt will soon (this week) be
converted to percpu reference counter, so the optimization might not
be as meaningful unless there are other things which are being
optimized.

Thanks.

--
tejun

2013-06-07 00:04:57

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 09/11] sched: record per-cgroup number of context switches

Hello,

Maybe we should break off addition of switch stats to a separate set?
They are two separate things.

On Wed, May 29, 2013 at 03:03:20PM +0400, Glauber Costa wrote:
> @@ -3642,6 +3642,8 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> prev->sched_class->put_prev_task(rq, prev);
>
> do {
> + if (likely(prev))
> + cfs_rq->nr_switches++;
> se = pick_next_entity(cfs_rq);
> set_next_entity(cfs_rq, se);
> cfs_rq = group_cfs_rq(se);
> @@ -3651,6 +3653,22 @@ pick_next_task_fair(struct rq *rq, struct task_struct *prev)
> if (hrtick_enabled(rq))
> hrtick_start_fair(rq, p);
>
> + /*
> + * This condition is extremely unlikely, and most of the time will just
> + * consist of this unlikely branch, which is extremely cheap. But we
> + * still need to have it, because when we first loop through cfs_rq's,
> + * we can't possibly know which task we will pick. The call to
> + * set_next_entity above is not meant to mess up the tree in this case,
> + * so this should give us the same chain, in the same order.
> + */
> + if (unlikely(p == prev)) {
> + se = &p->se;
> + for_each_sched_entity(se) {
> + cfs_rq = cfs_rq_of(se);
> + cfs_rq->nr_switches--;
> + }
> + }
> +

This concern may be fringe but the above breaks the monotonically
increasing property of the stat. Depending on the timing, a very
unlucky consumer of the stat may see the counter going backward which
can lead to nasty things. I'm not sure whether the fact that it'd be
very difficult to trigger is a pro or con.

Thanks.

--
tejun

2013-06-07 00:06:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v7 00/11] per-cgroup cpu-stat

On Wed, May 29, 2013 at 03:03:11PM +0400, Glauber Costa wrote:
> I am coming again with this series, hoping this is a better time for you all to
> look at it.
>
> I am *not* going as far as marking cpuacct deprecated, because I think it
> deserves a special discussion (even though my position in this matter is widely
> known), but all the infrastructure to make it happen is here. But after this,
> it should be a matter of setting a flag (or not).

Things look generally good to me upto patch 6. Maybe we can combine
"adding cpuacct stats to cpu" and "re-using schedstats" but no strong
opinion. Peter, what do you think?

Thanks.

--
tejun