LinuxLists.cc - [PATCH RFC 0/3] per cpuset load average

2012-10-03 23:07:20

Subject: [PATCH RFC 0/3] per cpuset load average

[ Premise: I'm not considering this patch suitable for inclusion, mostly
because I'm not convinced at all of the interface and I don't even know if
cpuset is the right place for this feature. However, I just wrote it for a
project and I would like to improve it, better integrate it in the kernel and
share the code in general, in case someone else is also interested to use this
feature. ]

Overview
~~~~~~~~
The cpusets subsystem allows to assign a different set of CPUs to a cgroup. A
typical use case is to split large systems in small CPU/memory partitions and
isolate certain users/applications in these subsets of the system.

Sometimes, to have a quick overview of each partition's state, we may be
interested to get the load average of the CPUs assigned to a particular cpuset,
rather than the global load average of the system.

Proposed solution
~~~~~~~~~~~~~~~~~
The proposal is to add a new file in the cpuset subsystem to report the load
average of the CPUs assinged to a particular cpuset cgroup.

Example:

# echo 0-1 > /sys/fs/cgroup/cpuset/foo/cpuset.cpus
# echo 2-3 > /sys/fs/cgroup/cpuset/bar/cpuset.cpus

# echo $$ > /sys/fs/cgroup/cpuset/foo/tasks
# for i in `seq 4`; do yes > /dev/null & done

... after ~5mins ...

# cat /proc/loadavg /sys/fs/cgroup/cpuset/{foo,bar}/cpuset.loadavg
3.99 2.66 1.24 6/377 2855
3.98 2.64 1.20
0.01 0.02 0.04

In this case we can easily find that the cpuset "foo" is the most busy in the
system.

[PATCH RFC 1/3] sched: introduce distinct per-cpu load average
[PATCH RFC 2/3] cpusets: add load avgerage interface
[PATCH RFC 3/3] cpusets: add documentation of the loadavg file

include/linux/sched.h | 7 +++++
kernel/cpuset.c | 58 ++++++++++++++++++++++++++++++++++++
kernel/sched/core.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 139 insertions(+), 4 deletions(-)

2012-10-03 23:07:26

by Andrea Righi

[permalink] [raw]

Subject: [PATCH RFC 3/3] cpusets: add documentation of the loadavg file

Signed-off-by: Andrea Righi <[email protected]>
---
Documentation/cgroups/cpusets.txt | 1 +
1 file changed, 1 insertion(+)

diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index cefd3d8..d5ddc36 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -179,6 +179,7 @@ files describing that cpuset:
- cpuset.memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes
- cpuset.sched_load_balance flag: if set, load balance within CPUs on that cpuset
- cpuset.sched_relax_domain_level: the searching range when migrating tasks
+ - cpuset.loadavg: the load average of the CPUs in that cpuset

In addition, only the root cpuset has the following file:
- cpuset.memory_pressure_enabled flag: compute memory_pressure?
--
1.7.9.5

2012-10-03 23:07:18

by Andrea Righi

[permalink] [raw]

Subject: [PATCH RFC 1/3] sched: introduce distinct per-cpu load average

Account per-cpu load average, as well as nr_running and
nr_uninterruptible tasks.

The new element on_cpu_uninterruptible to task_struct is added to
properly keep track of the cpu where the task was set to the
uninterruptible sleep state.

This feature is required by the cpusets cgroup subsystem to report the
load average per-cpuset.

Signed-off-by: Andrea Righi <[email protected]>
---
include/linux/sched.h | 7 +++++
kernel/sched/core.c | 78 ++++++++++++++++++++++++++++++++++++++++++++++---
2 files changed, 81 insertions(+), 4 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9d51e26..fb3df1b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -119,7 +119,10 @@ struct blk_plug;
* 11 bit fractions.
*/
extern unsigned long avenrun[]; /* Load averages */
+extern unsigned long cpu_avenrun[][NR_CPUS] /* Load averages per cpu */;
extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift);
+extern void get_cpu_avenrun(unsigned long *loads, int cpu,
+ unsigned long offset, int shift);

#define FSHIFT 11 /* nr of bits of precision */
#define FIXED_1 (1<<FSHIFT) /* 1.0 as fixed-point */
@@ -138,7 +141,9 @@ extern int nr_threads;
DECLARE_PER_CPU(unsigned long, process_counts);
extern int nr_processes(void);
extern unsigned long nr_running(void);
+extern unsigned long nr_running_cpu(int cpu);
extern unsigned long nr_uninterruptible(void);
+extern unsigned long nr_uninterruptible_cpu(int cpu);
extern unsigned long nr_iowait(void);
extern unsigned long nr_iowait_cpu(int cpu);
extern unsigned long this_cpu_load(void);
@@ -1239,6 +1244,8 @@ struct task_struct {
#ifdef CONFIG_SMP
struct llist_node wake_entry;
int on_cpu;
+ /* Used to keep track of nr_uninterruptible tasks per-cpu */
+ int on_cpu_uninterruptible;
#endif
int on_rq;

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c177472..87a8e62 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -727,15 +727,17 @@ static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
void activate_task(struct rq *rq, struct task_struct *p, int flags)
{
if (task_contributes_to_load(p))
- rq->nr_uninterruptible--;
+ cpu_rq(p->on_cpu_uninterruptible)->nr_uninterruptible--;

enqueue_task(rq, p, flags);
}

void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
{
- if (task_contributes_to_load(p))
- rq->nr_uninterruptible++;
+ if (task_contributes_to_load(p)) {
+ task_rq(p)->nr_uninterruptible++;
+ p->on_cpu_uninterruptible = task_cpu(p);
+ }

dequeue_task(rq, p, flags);
}
@@ -1278,7 +1280,7 @@ ttwu_do_activate(struct rq *rq, struct task_struct *p, int wake_flags)
{
#ifdef CONFIG_SMP
if (p->sched_contributes_to_load)
- rq->nr_uninterruptible--;
+ cpu_rq(p->on_cpu_uninterruptible)->nr_uninterruptible--;
#endif

ttwu_activate(rq, p, ENQUEUE_WAKEUP | ENQUEUE_WAKING);
@@ -1916,6 +1918,11 @@ unsigned long nr_running(void)
return sum;
}

+unsigned long nr_running_cpu(int cpu)
+{
+ return cpu_rq(cpu)->nr_running;
+}
+
unsigned long nr_uninterruptible(void)
{
unsigned long i, sum = 0;
@@ -1933,6 +1940,11 @@ unsigned long nr_uninterruptible(void)
return sum;
}

+unsigned long nr_uninterruptible_cpu(int cpu)
+{
+ return cpu_rq(cpu)->nr_uninterruptible;
+}
+
unsigned long long nr_context_switches(void)
{
int i;
@@ -2035,6 +2047,9 @@ void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
loads[2] = (avenrun[2] + offset) << shift;
}

+unsigned long cpu_avenrun[3][NR_CPUS] __cacheline_aligned_in_smp;
+EXPORT_SYMBOL(cpu_avenrun);
+
static long calc_load_fold_active(struct rq *this_rq)
{
long nr_active, delta = 0;
@@ -2062,6 +2077,24 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
return load >> FSHIFT;
}

+static void calc_global_load_percpu(void)
+{
+ long active;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ active = cpu_rq(cpu)->calc_load_active;
+ active = active > 0 ? active * FIXED_1 : 0;
+
+ cpu_avenrun[0][cpu] = calc_load(cpu_avenrun[0][cpu],
+ EXP_1, active);
+ cpu_avenrun[1][cpu] = calc_load(cpu_avenrun[1][cpu],
+ EXP_5, active);
+ cpu_avenrun[2][cpu] = calc_load(cpu_avenrun[2][cpu],
+ EXP_15, active);
+ }
+}
+
#ifdef CONFIG_NO_HZ
/*
* Handle NO_HZ for the global load-average.
@@ -2248,6 +2281,23 @@ calc_load_n(unsigned long load, unsigned long exp,
return calc_load(load, fixed_power_int(exp, FSHIFT, n), active);
}

+static void calc_global_load_n_percpu(unsigned int n)
+{
+ long active;
+ int cpu;
+
+ for_each_online_cpu(cpu) {
+ active = cpu_rq(cpu)->calc_load_active;
+ active = active > 0 ? active * FIXED_1 : 0;
+
+ cpu_avenrun[0][cpu] = calc_load_n(cpu_avenrun[0][cpu],
+ EXP_1, active, n);
+ cpu_avenrun[1][cpu] = calc_load_n(cpu_avenrun[1][cpu],
+ EXP_5, active, n);
+ cpu_avenrun[2][cpu] = calc_load_n(cpu_avenrun[2][cpu],
+ EXP_15, active, n);
+ }
+}
/*
* NO_HZ can leave us missing all per-cpu ticks calling
* calc_load_account_active(), but since an idle CPU folds its delta into
@@ -2275,6 +2325,8 @@ static void calc_global_nohz(void)
avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);

+ calc_global_load_n_percpu(n);
+
calc_load_update += n * LOAD_FREQ;
}

@@ -2320,6 +2372,8 @@ void calc_global_load(unsigned long ticks)
avenrun[1] = calc_load(avenrun[1], EXP_5, active);
avenrun[2] = calc_load(avenrun[2], EXP_15, active);

+ calc_global_load_percpu();
+
calc_load_update += LOAD_FREQ;

/*
@@ -2328,6 +2382,22 @@ void calc_global_load(unsigned long ticks)
calc_global_nohz();
}

+/**
+ * get_cpu_avenrun - get the load average array of a single cpu
+ * @loads: pointer to dest load array
+ * @cpu: the cpu to read the load average
+ * @offset: offset to add
+ * @shift: shift count to shift the result left
+ *
+ * These values are estimates at best, so no need for locking.
+ */
+void get_cpu_avenrun(unsigned long *loads, int cpu,
+ unsigned long offset, int shift)
+{
+ loads[0] = (cpu_avenrun[0][cpu] + offset) << shift;
+ loads[1] = (cpu_avenrun[1][cpu] + offset) << shift;
+ loads[2] = (cpu_avenrun[2][cpu] + offset) << shift;
+}
/*
* Called from update_cpu_load() to periodically update this CPU's
* active count.
--
1.7.9.5

2012-10-03 23:07:56

by Andrea Righi

[permalink] [raw]

Subject: [PATCH RFC 2/3] cpusets: add load avgerage interface

Add the new file loadavg to report the load average of the cpus assigned
to the cpuset cgroup.

The load average is reported using the typical three values as they
appear in /proc/loadavg, averaged over 1, 5 and 15 minutes.

Example:
# cat /sys/fs/cgroup/cpuset/foo/cpuset.loadavg
3.98 2.64 1.20

Signed-off-by: Andrea Righi <[email protected]>
---
kernel/cpuset.c | 58 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 58 insertions(+)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index f33c715..fef6d3e 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1465,6 +1465,7 @@ typedef enum {
FILE_MEMORY_PRESSURE,
FILE_SPREAD_PAGE,
FILE_SPREAD_SLAB,
+ FILE_LOADAVG,
} cpuset_filetype_t;

static int cpuset_write_u64(struct cgroup *cgrp, struct cftype *cft, u64 val)
@@ -1686,6 +1687,57 @@ static s64 cpuset_read_s64(struct cgroup *cont, struct cftype *cft)
return 0;
}

+/*
+ * XXX: move all of this to a better place and unify the different
+ * re-definition of these macros.
+ */
+#define LOAD_INT(x) ((x) >> FSHIFT)
+#define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100)
+
+static void cpuset_show_loadavg(struct seq_file *m, const struct cpuset *cs)
+{
+ unsigned long avnrun[3] = {};
+ int cpu;
+
+ for_each_cpu(cpu, cs->cpus_allowed) {
+ unsigned long cpu_avnrun[3];
+ int i;
+
+ get_cpu_avenrun(cpu_avnrun, cpu, FIXED_1/200, 0);
+
+ for (i = 0; i < ARRAY_SIZE(cpu_avnrun); i++)
+ avnrun[i] += cpu_avnrun[i];
+ }
+ /*
+ * TODO: also report nr_running/nr_threads and last_pid, producing the
+ * same output as /proc/loadavg.
+ *
+ * For nr_running we can just sum the nr_running_cpu() of the cores
+ * assigned to this cs; what should we report in nr_threads? maybe
+ * cgroup_task_count()? and what about last_pid?
+ */
+ seq_printf(m, "%lu.%02lu %lu.%02lu %lu.%02lu\n",
+ LOAD_INT(avnrun[0]), LOAD_FRAC(avnrun[0]),
+ LOAD_INT(avnrun[1]), LOAD_FRAC(avnrun[1]),
+ LOAD_INT(avnrun[2]), LOAD_FRAC(avnrun[2]));
+}
+
+static int cpuset_read_seq_string(struct cgroup *cont, struct cftype *cft,
+ struct seq_file *m)
+{
+ struct cpuset *cs = cgroup_cs(cont);
+ cpuset_filetype_t type = cft->private;
+
+ switch (type) {
+ case FILE_LOADAVG:
+ cpuset_show_loadavg(m, cs);
+ break;
+ default:
+ BUG();
+ }
+
+ return 0;
+}

/*
* for the common functions, 'private' gives the type of file
@@ -1780,6 +1832,12 @@ static struct cftype files[] = {
.private = FILE_MEMORY_PRESSURE_ENABLED,
},

+ {
+ .name = "loadavg",
+ .read_seq_string = cpuset_read_seq_string,
+ .private = FILE_LOADAVG,
+ },
+
{ } /* terminate */
};

--
1.7.9.5

2012-10-04 09:00:36

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH RFC 1/3] sched: introduce distinct per-cpu load average

On Thu, 2012-10-04 at 01:05 +0200, Andrea Righi wrote:
> +++ b/kernel/sched/core.c
> @@ -727,15 +727,17 @@ static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> void activate_task(struct rq *rq, struct task_struct *p, int flags)
> {
> if (task_contributes_to_load(p))
> - rq->nr_uninterruptible--;
> + cpu_rq(p->on_cpu_uninterruptible)->nr_uninterruptible--;
>
> enqueue_task(rq, p, flags);
> }

That's completely broken, you cannot do non-atomic cross-cpu
modifications like that. Also, adding an atomic op to the wakeup/sleep
paths isn't going to be popular at all.

> void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
> {
> - if (task_contributes_to_load(p))
> - rq->nr_uninterruptible++;
> + if (task_contributes_to_load(p)) {
> + task_rq(p)->nr_uninterruptible++;
> + p->on_cpu_uninterruptible = task_cpu(p);
> + }
>
> dequeue_task(rq, p, flags);
> }

This looks pointless, at deactivate time task_rq() had better be rq or
something is terribly broken.

2012-10-04 09:43:57

by Andrea Righi

[permalink] [raw]

Subject: Re: [PATCH RFC 1/3] sched: introduce distinct per-cpu load average

On Thu, Oct 04, 2012 at 10:59:46AM +0200, Peter Zijlstra wrote:
> On Thu, 2012-10-04 at 01:05 +0200, Andrea Righi wrote:
> > +++ b/kernel/sched/core.c
> > @@ -727,15 +727,17 @@ static void dequeue_task(struct rq *rq, struct task_struct *p, int flags)
> > void activate_task(struct rq *rq, struct task_struct *p, int flags)
> > {
> > if (task_contributes_to_load(p))
> > - rq->nr_uninterruptible--;
> > + cpu_rq(p->on_cpu_uninterruptible)->nr_uninterruptible--;
> >
> > enqueue_task(rq, p, flags);
> > }
>
> That's completely broken, you cannot do non-atomic cross-cpu
> modifications like that. Also, adding an atomic op to the wakeup/sleep
> paths isn't going to be popular at all.

Right, the update must be atomic to have a coherent nr_uninterruptible
value. And AFAICS the only way to account a coherent nr_uninterruptible
value per-cpu is to go with atomic ops... mmh... I'll think more on
this.

>
> > void deactivate_task(struct rq *rq, struct task_struct *p, int flags)
> > {
> > - if (task_contributes_to_load(p))
> > - rq->nr_uninterruptible++;
> > + if (task_contributes_to_load(p)) {
> > + task_rq(p)->nr_uninterruptible++;
> > + p->on_cpu_uninterruptible = task_cpu(p);
> > + }
> >
> > dequeue_task(rq, p, flags);
> > }
>
> This looks pointless, at deactivate time task_rq() had better be rq or
> something is terribly broken.

Correct, I didn't realize that, sorry.

Many thanks for your review, Peter.

-Andrea

2012-10-04 12:12:59

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH RFC 1/3] sched: introduce distinct per-cpu load average

On Thu, 2012-10-04 at 11:43 +0200, Andrea Righi wrote:
>
> Right, the update must be atomic to have a coherent nr_uninterruptible
> value. And AFAICS the only way to account a coherent
> nr_uninterruptible
> value per-cpu is to go with atomic ops... mmh... I'll think more on
> this.

You could stick it in the cpu controller instead of cpuset, add a
per-cpu nr_uninterruptible counter to struct task_group and update it
from the enqueue/dequeue paths. Those already are per-cgroup (through
cfs_rq, which has a tg pointer).

That would also give you better semantics since it would really be the
load of the tasks of the cgroup, not whatever happened to run on a
particular cpu regardless of groups. Then again, it might be 'fun' to
get the hierarchical semantics right :-)

OTOH it would also make calculating the load-avg O(nr_cgroups) and since
we do this from the tick and people are known to create a shitload (on
the order of 1e3 and upwards) of those this might not actually be a very
good idea.

Also, your patch 2 relies on the load avg function to be additive yet
your completely fail to mention this and state whether this is so or
not.

Furthermore, please look at PER_CPU() and friends as alternatives to
[NR_CPUS] arrays.

2012-10-04 17:19:55

by Andrea Righi

[permalink] [raw]

Subject: Re: [PATCH RFC 1/3] sched: introduce distinct per-cpu load average

On Thu, Oct 04, 2012 at 02:12:08PM +0200, Peter Zijlstra wrote:
> On Thu, 2012-10-04 at 11:43 +0200, Andrea Righi wrote:
> >
> > Right, the update must be atomic to have a coherent nr_uninterruptible
> > value. And AFAICS the only way to account a coherent
> > nr_uninterruptible
> > value per-cpu is to go with atomic ops... mmh... I'll think more on
> > this.
>
> You could stick it in the cpu controller instead of cpuset, add a
> per-cpu nr_uninterruptible counter to struct task_group and update it
> from the enqueue/dequeue paths. Those already are per-cgroup (through
> cfs_rq, which has a tg pointer).
>
> That would also give you better semantics since it would really be the
> load of the tasks of the cgroup, not whatever happened to run on a
> particular cpu regardless of groups. Then again, it might be 'fun' to
> get the hierarchical semantics right :-)
>
> OTOH it would also make calculating the load-avg O(nr_cgroups) and since
> we do this from the tick and people are known to create a shitload (on
> the order of 1e3 and upwards) of those this might not actually be a very
> good idea.

That would be an interesting path to explore, even if my concern goes to
the large hosting companies that want to create like a cpu cgroup for
each user. In this case we may have big scalability issues. Maintaining
all the required stats per-cpu seems a more scalable solution to me
(except probably for the large SMP systems case...).

I wonder if it is worth to define rq->nr_uninterruptible as a pointer to
percpu data rather than converting it to an atomic var... but this would
be even worst for the large SMP systems. Especially for those that are
not interested in the loadavg feature.

>
> Also, your patch 2 relies on the load avg function to be additive yet
> your completely fail to mention this and state whether this is so or
> not.

Correct, I'll report a more detailed description in the next version.

>
> Furthermore, please look at PER_CPU() and friends as alternatives to
> [NR_CPUS] arrays.

Will do.

Thanks again for your suggestions.

-Andrea