2012-06-09 10:55:45

by Charles Wang

[permalink] [raw]
Subject: [PATCH] sched: Folding nohz load accounting more accurate

After patch 453494c3d4 (sched: Fix nohz load accounting -- again!), we can fold
the idle into calc_load_tasks_idle between the last cpu load calculating and
calc_global_load calling. However problem still exits between the first cpu
load calculating and the last cpu load calculating. Every time when we do load
calculating, calc_load_tasks_idle will be added into calc_load_tasks, even if
the idle load is caused by calculated cpus. This problem is also described in
the following link:

https://lkml.org/lkml/2012/5/24/419

This bug can be found in our work load. The average running processes number
is about 15, but the load only shows about 4.

The patch provides a solution, by taking calculated load cpus' idle away from
real effective idle. First adds a cpumask to record those cpus that alread
calculated their load, and then adds a calc_unmask_cpu_load_idle to record
thoses not marked cpus' go-idle load. Calc_unmask_cpu_load_idle takes place
of calc_load_tasks_idle to be added into calc_load_tasks every 5HZ when cpu
calculate its load. Go-idle load on those cpus which load alread has been calculated
will only be added into calc_load_tasks_idle, no in calc_unmask_cpu_load_idle.

Reported-by: Sha Zhengju <[email protected]>
Signed-off-by: Charles Wang <[email protected]>

---
include/linux/sched.h | 1 +
kernel/sched/core.c | 83 ++++++++++++++++++++++++++++++++++++++++++++-
kernel/time/timekeeping.c | 1 +
3 files changed, 84 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 6029d8c..a2b8df2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -145,6 +145,7 @@ extern unsigned long this_cpu_load(void);


extern void calc_global_load(unsigned long ticks);
+extern void prepare_idle_mask(unsigned long ticks);
extern void update_cpu_load_nohz(void);

extern unsigned long get_parent_ip(unsigned long addr);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c46958e..bdfe3c2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2164,6 +2164,7 @@ unsigned long this_cpu_load(void)
/* Variables and functions for calc_load */
static atomic_long_t calc_load_tasks;
static unsigned long calc_load_update;
+static unsigned long idle_mask_update;
unsigned long avenrun[3];
EXPORT_SYMBOL(avenrun);

@@ -2199,13 +2200,38 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
*/
static atomic_long_t calc_load_tasks_idle;

+/*
+ * Those cpus whose load alread has been calculated in this LOAD_FREQ
+ * period will be masked.
+ */
+struct cpumask cpu_load_update_mask;
+
+/*
+ * Fold unmask cpus' idle load
+ */
+static atomic_long_t calc_unmask_cpu_load_idle;
+
+
void calc_load_account_idle(struct rq *this_rq)
{
long delta;
+ int cpu = smp_processor_id();
+

delta = calc_load_fold_active(this_rq);
if (delta)
+ {
atomic_long_add(delta, &calc_load_tasks_idle);
+ /*
+ * calc_unmask_cpu_load_idle is only used between the first cpu load accounting
+ * and the last cpu load accounting in every LOAD_FREQ period, and records idle load on
+ * those unmask cpus.
+ */
+ if (!cpumask_empty(&cpu_load_update_mask) && !cpumask_test_cpu(cpu, &cpu_load_update_mask))
+ {
+ atomic_long_add(delta, &calc_unmask_cpu_load_idle);
+ }
+ }
}

static long calc_load_fold_idle(void)
@@ -2221,6 +2247,20 @@ static long calc_load_fold_idle(void)
return delta;
}

+static long calc_load_fold_unmask_idle(void)
+{
+ long delta = 0;
+
+ if (atomic_long_read(&calc_unmask_cpu_load_idle))
+ {
+ delta = atomic_long_xchg(&calc_unmask_cpu_load_idle, 0);
+ atomic_long_sub(delta, &calc_load_tasks_idle);
+ }
+
+ return delta;
+}
+
+
/**
* fixed_power_int - compute: x^n, in O(log n) time
*
@@ -2312,6 +2352,9 @@ static void calc_global_nohz(void)
if (delta)
atomic_long_add(delta, &calc_load_tasks);

+ cpumask_clear(&cpu_load_update_mask);
+ atomic_long_xchg(&calc_unmask_cpu_load_idle, 0);
+
/*
* It could be the one fold was all it took, we done!
*/
@@ -2395,18 +2438,54 @@ void calc_global_load(unsigned long ticks)
}

/*
+ * Prepare cpu_load_update_mask for the comming per-cpu load calculating
+ */
+void prepare_idle_mask(unsigned long ticks)
+{
+ if (time_before(jiffies, idle_mask_update - 10))
+ return;
+
+ cpumask_clear(&cpu_load_update_mask);
+ /*
+ * calc_unmask_cpu_load_idle is part of calc_load_tasks_idle,
+ * and calc_load_tasks_ide will be folded into calc_load_tasks immediately.
+ * So no need to keep this now.
+ */
+ atomic_long_xchg(&calc_unmask_cpu_load_idle, 0);
+
+ idle_mask_update += LOAD_FREQ;
+}
+
+/*
* Called from update_cpu_load() to periodically update this CPU's
* active count.
*/
static void calc_load_account_active(struct rq *this_rq)
{
long delta;
+ int cpu = smp_processor_id();

if (time_before(jiffies, this_rq->calc_load_update))
return;

+ /*
+ * cpu_load_update_mask empty means the first cpu
+ * doing load calculating. Global idle should be
+ * folded into calc_load_tasks, so we just push it
+ * to calc_unmask_cpu_load_idle.
+ */
+ if (cpumask_empty(&cpu_load_update_mask))
+ atomic_long_set(&calc_unmask_cpu_load_idle, atomic_long_read(&calc_load_tasks_idle));
+ /*
+ * Mask this cpu as load calculated,
+ * then go-idle in this cpu won't take effect
+ * to calc_load_tasks.
+ */
+ cpumask_set_cpu(cpu, &cpu_load_update_mask);
+
delta = calc_load_fold_active(this_rq);
- delta += calc_load_fold_idle();
+ /* Fold unmask cpus' load into calc_load_tasks */
+ delta += calc_load_fold_unmask_idle();
if (delta)
atomic_long_add(delta, &calc_load_tasks);

@@ -7100,6 +7179,8 @@ void __init sched_init(void)

calc_load_update = jiffies + LOAD_FREQ;

+ idle_mask_update = jiffies + LOAD_FREQ;
+
/*
* During early bootup we pretend to be a normal task:
*/
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 6e46cac..afbc06a 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -1222,6 +1222,7 @@ void do_timer(unsigned long ticks)
jiffies_64 += ticks;
update_wall_time();
calc_global_load(ticks);
+ prepare_idle_mask(ticks);
}

/**
--
1.7.9.5


2012-06-11 15:43:10

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

On Sat, 2012-06-09 at 18:54 +0800, Charles Wang wrote:
> After patch 453494c3d4 (sched: Fix nohz load accounting -- again!), we can fold
> the idle into calc_load_tasks_idle between the last cpu load calculating and
> calc_global_load calling. However problem still exits between the first cpu
> load calculating and the last cpu load calculating. Every time when we do load
> calculating, calc_load_tasks_idle will be added into calc_load_tasks, even if
> the idle load is caused by calculated cpus. This problem is also described in
> the following link:
>
> https://lkml.org/lkml/2012/5/24/419

Don't put links in like this, if its useful describe it. As it stands I
think your paragraph and that link are trying to say the same.

Also, since you found Tao's email, why didn't you CC him?

I find it hard to understand either description.

Are you in fact saying:


cpu0 cpu1 cpu2

calc_load_account_active()

calc_load_account_active()
-> idle
,---- calc_load_account_idle()
|
| calc_load_account_active()
`---------> calc_load_fold_idle()


That when a cpu goes idle during the per-cpu active folding window
it might happen that a cpu goes idle after its fold, but before a next
cpu's fold, so that its calc_load_tasks_idle contribution gets folded
back in. As per the above picture.

Now _why_ is this a problem?

All this code tries to do is:

nr_active = 0;
for_each_possible_cpu(cpu)
nr_active += cpu_rq(cpu)->nr_running + cpu_rq(cpu)

Without actually iterating all cpus because on large systems that's
prohibitively expensive.

So instead we do the calc_load_fold_active() thing on each individual
cpu:

nr_active = this_rq->nr_running + this_rq->nr_uninterruptible;
delta = nr_active - this_rq->calc_load_active;
this_rq->calc_load_active = nr_active;

atomic_long_add(delta, &calc_load_tasks);

The first does a straight sum, the second does a sum of deltas, both
should give the same number.


\Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
= \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }

The whole NO_HZ thing complicates this a little further in order to not
have to wake an idle cpu. I'll try and do the above sum with the nohz
thing in.. maybe something obvious falls out.

Now all this is sampling.. we take a nr_active sample once every
LOAD_FREQ (5s). Anything that happens inside this doesn't exist.

So how is the above scenario different from the cpu going idle right
before we take the sample?

> This bug can be found in our work load. The average running processes number
> is about 15, but the load only shows about 4.

Right,. still not really sure what the bug is though.

> The patch provides a solution, by taking calculated load cpus' idle away from
> real effective idle.

-ENOPARSE

> First adds a cpumask to record those cpus that alread
> calculated their load, and then adds a calc_unmask_cpu_load_idle to record
> thoses not marked cpus' go-idle load. Calc_unmask_cpu_load_idle takes place
> of calc_load_tasks_idle to be added into calc_load_tasks every 5HZ

5 seconds

> when cpu
> calculate its load. Go-idle load on those cpus which load alread has been calculated
> will only be added into calc_load_tasks_idle, no in calc_unmask_cpu_load_idle.

I don't like the solution, keeping that mask is expensive. Also, I still
don't understand what the problem is.

A few nits on the patch below..

> ---
> include/linux/sched.h | 1 +
> kernel/sched/core.c | 83 ++++++++++++++++++++++++++++++++++++++++++++-
> kernel/time/timekeeping.c | 1 +
> 3 files changed, 84 insertions(+), 1 deletion(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 6029d8c..a2b8df2 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -145,6 +145,7 @@ extern unsigned long this_cpu_load(void);
>
>
> extern void calc_global_load(unsigned long ticks);
> +extern void prepare_idle_mask(unsigned long ticks);
> extern void update_cpu_load_nohz(void);
>
> extern unsigned long get_parent_ip(unsigned long addr);
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c46958e..bdfe3c2 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2164,6 +2164,7 @@ unsigned long this_cpu_load(void)
> /* Variables and functions for calc_load */
> static atomic_long_t calc_load_tasks;
> static unsigned long calc_load_update;
> +static unsigned long idle_mask_update;
> unsigned long avenrun[3];
> EXPORT_SYMBOL(avenrun);
>
> @@ -2199,13 +2200,38 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
> */
> static atomic_long_t calc_load_tasks_idle;
>
> +/*
> + * Those cpus whose load alread has been calculated in this LOAD_FREQ
> + * period will be masked.
> + */
> +struct cpumask cpu_load_update_mask;

This should be a cpumask_var_t and allocated somewhere..

> +
> +/*
> + * Fold unmask cpus' idle load
> + */
> +static atomic_long_t calc_unmask_cpu_load_idle;
> +
> +
> void calc_load_account_idle(struct rq *this_rq)
> {
> long delta;
> + int cpu = smp_processor_id();
> +
>
> delta = calc_load_fold_active(this_rq);
> if (delta)
> + {

wrong bracket placement

> atomic_long_add(delta, &calc_load_tasks_idle);
> + /*
> + * calc_unmask_cpu_load_idle is only used between the first cpu load accounting
> + * and the last cpu load accounting in every LOAD_FREQ period, and records idle load on
> + * those unmask cpus.
> + */
> + if (!cpumask_empty(&cpu_load_update_mask) && !cpumask_test_cpu(cpu, &cpu_load_update_mask))
> + {
> + atomic_long_add(delta, &calc_unmask_cpu_load_idle);

lines are too long, brackets are placed wrong and brackets are
superfluous.


> + }
> + }
> }
>

and many more similar nits.

> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
> index 6e46cac..afbc06a 100644
> --- a/kernel/time/timekeeping.c
> +++ b/kernel/time/timekeeping.c
> @@ -1222,6 +1222,7 @@ void do_timer(unsigned long ticks)
> jiffies_64 += ticks;
> update_wall_time();
> calc_global_load(ticks);
> + prepare_idle_mask(ticks);

Why not add this to calc_global_load() ?

> }
>
> /**

2012-06-12 08:54:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate


Please try again without HTML..

2012-06-12 09:34:18

by Charles Wang

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

OK! Configure the Thunderbird as Documentation/email-clients.txt saying.
Sorry for last email contained with html. No html now.

On Monday, June 11, 2012 11:42 PM, Peter Zijlstra wrote:

> On Sat, 2012-06-09 at 18:54 +0800, Charles Wang wrote:
>> After patch 453494c3d4 (sched: Fix nohz load accounting -- again!), we can fold
>> the idle into calc_load_tasks_idle between the last cpu load calculating and
>> calc_global_load calling. However problem still exits between the first cpu
>> load calculating and the last cpu load calculating. Every time when we do load
>> calculating, calc_load_tasks_idle will be added into calc_load_tasks, even if
>> the idle load is caused by calculated cpus. This problem is also described in
>> the following link:
>>
>> https://lkml.org/lkml/2012/5/24/419
>
> Don't put links in like this, if its useful describe it. As it stands I
> think your paragraph and that link are trying to say the same.
>
> Also, since you found Tao's email, why didn't you CC him?
>

My mistake! And this problem is first reported by Sha Zhengju
<[email protected]>, so i also add her in.

> I find it hard to understand either description.
>
> Are you in fact saying:
>
>
> cpu0 cpu1 cpu2
>
> calc_load_account_active()
>
> calc_load_account_active()
> -> idle
> ,---- calc_load_account_idle()
> |
> | calc_load_account_active()
> `---------> calc_load_fold_idle()
>
>
> That when a cpu goes idle during the per-cpu active folding window
> it might happen that a cpu goes idle after its fold, but before a next
> cpu's fold, so that its calc_load_tasks_idle contribution gets folded
> back in. As per the above picture.
>
> Now _why_ is this a problem?
>

consider following case:

5HZ+1
| cpu0_load cpu1 cpu2 cpu3 calc_load_tasks
| 1 1 1 1
| -->calc_load 1
| 1 1 1 1
| -->calc_load 2
| 0 0 1 0
| -->calc_load 2+1-3=1
| 1 1 0 1
| -->calc_load 1-1=0
V
5HZ+11 -->calc_global_load 0



actually the load should be around 3, but shows nearly 0.

1 tick is much long for some workloads.

> All this code tries to do is:
>
> nr_active = 0;
> for_each_possible_cpu(cpu)
> nr_active += cpu_rq(cpu)->nr_running + cpu_rq(cpu)
>
> Without actually iterating all cpus because on large systems that's
> prohibitively expensive.
>
> So instead we do the calc_load_fold_active() thing on each individual
> cpu:
>
> nr_active = this_rq->nr_running + this_rq->nr_uninterruptible;
> delta = nr_active - this_rq->calc_load_active;
> this_rq->calc_load_active = nr_active;
>
> atomic_long_add(delta, &calc_load_tasks);
>
> The first does a straight sum, the second does a sum of deltas, both
> should give the same number.
>
>
> \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
> = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
>
> The whole NO_HZ thing complicates this a little further in order to not
> have to wake an idle cpu. I'll try and do the above sum with the nohz
> thing in.. maybe something obvious falls out.
>
> Now all this is sampling.. we take a nr_active sample once every
> LOAD_FREQ (5s). Anything that happens inside this doesn't exist.
>
> So how is the above scenario different from the cpu going idle right
> before we take the sample?
>
>> This bug can be found in our work load. The average running processes number
>> is about 15, but the load only shows about 4.
>
> Right,. still not really sure what the bug is though.
>
>> The patch provides a solution, by taking calculated load cpus' idle away from
>> real effective idle.
>
> -ENOPARSE
>
>> First adds a cpumask to record those cpus that alread
>> calculated their load, and then adds a calc_unmask_cpu_load_idle to record
>> thoses not marked cpus' go-idle load. Calc_unmask_cpu_load_idle takes place
>> of calc_load_tasks_idle to be added into calc_load_tasks every 5HZ
>
> 5 seconds
>
>> when cpu
>> calculate its load. Go-idle load on those cpus which load alread has been calculated
>> will only be added into calc_load_tasks_idle, no in calc_unmask_cpu_load_idle.
>
> I don't like the solution, keeping that mask is expensive. Also, I still
> don't understand what the problem is.
>

It's a problem. And I plan to use per-cpu var to take place of cpumask.

> A few nits on the patch below..
>
>> ---
>> include/linux/sched.h | 1 +
>> kernel/sched/core.c | 83 ++++++++++++++++++++++++++++++++++++++++++++-
>> kernel/time/timekeeping.c | 1 +
>> 3 files changed, 84 insertions(+), 1 deletion(-)
>>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 6029d8c..a2b8df2 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -145,6 +145,7 @@ extern unsigned long this_cpu_load(void);
>>
>>
>> extern void calc_global_load(unsigned long ticks);
>> +extern void prepare_idle_mask(unsigned long ticks);
>> extern void update_cpu_load_nohz(void);
>>
>> extern unsigned long get_parent_ip(unsigned long addr);
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c46958e..bdfe3c2 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2164,6 +2164,7 @@ unsigned long this_cpu_load(void)
>> /* Variables and functions for calc_load */
>> static atomic_long_t calc_load_tasks;
>> static unsigned long calc_load_update;
>> +static unsigned long idle_mask_update;
>> unsigned long avenrun[3];
>> EXPORT_SYMBOL(avenrun);
>>
>> @@ -2199,13 +2200,38 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
>> */
>> static atomic_long_t calc_load_tasks_idle;
>>
>> +/*
>> + * Those cpus whose load alread has been calculated in this LOAD_FREQ
>> + * period will be masked.
>> + */
>> +struct cpumask cpu_load_update_mask;
>
> This should be a cpumask_var_t and allocated somewhere..
>
>> +
>> +/*
>> + * Fold unmask cpus' idle load
>> + */
>> +static atomic_long_t calc_unmask_cpu_load_idle;
>> +
>> +
>> void calc_load_account_idle(struct rq *this_rq)
>> {
>> long delta;
>> + int cpu = smp_processor_id();
>> +
>>
>> delta = calc_load_fold_active(this_rq);
>> if (delta)
>> + {
>
> wrong bracket placement
>
>> atomic_long_add(delta, &calc_load_tasks_idle);
>> + /*
>> + * calc_unmask_cpu_load_idle is only used between the first cpu load accounting
>> + * and the last cpu load accounting in every LOAD_FREQ period, and records idle load on
>> + * those unmask cpus.
>> + */
>> + if (!cpumask_empty(&cpu_load_update_mask) && !cpumask_test_cpu(cpu, &cpu_load_update_mask))
>> + {
>> + atomic_long_add(delta, &calc_unmask_cpu_load_idle);
>
> lines are too long, brackets are placed wrong and brackets are
> superfluous.
>
>
>> + }
>> + }
>> }
>>
>
> and many more similar nits.
>

Thanks for pointing these out!

>> diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
>> index 6e46cac..afbc06a 100644
>> --- a/kernel/time/timekeeping.c
>> +++ b/kernel/time/timekeeping.c
>> @@ -1222,6 +1222,7 @@ void do_timer(unsigned long ticks)
>> jiffies_64 += ticks;
>> update_wall_time();
>> calc_global_load(ticks);
>> + prepare_idle_mask(ticks);
>
> Why not add this to calc_global_load() ?
>

If some CPUs go into nohz and get up after calc_global_load,
then calc_load_account_active will be called after calc_global_load.
This will cause identifying of first calc_load_account_active cpu being
much more complicated.

>> }
>>
>> /**
>

2012-06-12 09:56:20

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

Also added Doug to CC, hopefully we now have everybody who pokes at this
stuff.

On Tue, 2012-06-12 at 17:34 +0800, Charles Wang wrote:
> consider following case:
>
> 5HZ+1
> | cpu0_load cpu1 cpu2 cpu3 calc_load_tasks
> | 1 1 1 1
> | -->calc_load 1
> | 1 1 1 1
> | -->calc_load 2
> | 0 0 1 0
> | -->calc_load 2+1-3=1

Not sure but last time I did the math 2+1-3 ended up being 0.

> | 1 1 0 1
> | -->calc_load 1-1=0
> V
> 5HZ+11 -->calc_global_load 0
>
> actually the load should be around 3, but shows nearly 0.
>
> 1 tick is much long for some workloads.

Yes, one tick is long for some stuff, but seeing we sample once every 5
seconds a little fuzz around sampling the nr_running+nr_uninterruptible
thing shouldn't be too bad.

But I think I see what you're getting at.. lemme get more tea and ponder
this a bit.

2012-06-13 05:55:47

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

> On 2012.06.12 02:56 - 0800 (I think), Peter Zijlstra wrote:

>Also added Doug to CC, hopefully we now have everybody who pokes at this
>stuff.

Thanks.

On my computer, and from a different thread from yesterday, I let
the proposed "wang" patch multiple processes test continue for
another 24 hours. The png file showing the results is attached, also
available at [1].

Conclusion: The proposed "wang" patch is worse for the lower load
conditions, giving higher reported load average errors for the same
conditions. The proposed "wang" patch tends towards a load equal to
the number of processes, independent of the actual load of those
processes.

Interestingly, with the "wang" patch I was able to remove the 10
tick grace period without bad side effects (very minimally tested).

@ Charles or Tao: If I could ask: What is your expected load for your 16
processes case? Because you used to get a reported load average of
< 1, we know that the processes enter and exit idle (sleep) at a high
frequency (as that was only possible way for the older under reporting
issue, at least as far as I know). You said it now reports a load
average of 8 to 10, but that is too low. How many CPU's do you have?
I have been unable to re-create your situation on my test computer
(an i7 CPU).
When I run 16 processes, where each process would use 0.95 of a cpu,
if the system did not become resource limited, I get a reported load
average of about 15 to 16. Kernel = 3.5 RC2. Process sleep frequency
was about 80 Hertz each.

[1]
http://www.smythies.com/~doug/network/load_average/load_processes_wang.html

Doug Smythies


Attachments:
load_processes_wang.png (38.01 kB)

2012-06-13 07:56:14

by Charles Wang

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

On Wednesday, June 13, 2012 01:55 PM, Doug Smythies wrote:

>> On 2012.06.12 02:56 - 0800 (I think), Peter Zijlstra wrote:
>
>> Also added Doug to CC, hopefully we now have everybody who pokes at this
>> stuff.
>
> Thanks.
>
> On my computer, and from a different thread from yesterday, I let
> the proposed "wang" patch multiple processes test continue for
> another 24 hours. The png file showing the results is attached, also
> available at [1].
>
> Conclusion: The proposed "wang" patch is worse for the lower load
> conditions, giving higher reported load average errors for the same
> conditions. The proposed "wang" patch tends towards a load equal to
> the number of processes, independent of the actual load of those
> processes.
>
> Interestingly, with the "wang" patch I was able to remove the 10
> tick grace period without bad side effects (very minimally tested).
>
> @ Charles or Tao: If I could ask: What is your expected load for your 16
> processes case? Because you used to get a reported load average of
> < 1, we know that the processes enter and exit idle (sleep) at a high
> frequency (as that was only possible way for the older under reporting
> issue, at least as far as I know). You said it now reports a load
> average of 8 to 10, but that is too low. How many CPU's do you have?
> I have been unable to re-create your situation on my test computer
> (an i7 CPU).
> When I run 16 processes, where each process would use 0.95 of a cpu,
> if the system did not become resource limited, I get a reported load
> average of about 15 to 16. Kernel = 3.5 RC2. Process sleep frequency
> was about 80 Hertz each.
>
> [1]
> http://www.smythies.com/~doug/network/load_average/load_processes_wang.html
>
> Doug Smythies
>


Thanks Doug for these exhaustive testing!

Every cpu's load should be the load right on the time executing
calculation.
This is what my patch expected to do.

After my patch, it's supposed to let every cpu's load calculation be
independent from
its idle and other cpus' influence when the calculation is finished.
But it seems other problem exists.

I recorded some trace, show as below:

tasks--->calc_load_tasks
idle--->calc_load_tasks_idle
idle_unmask--->calc_unmask_cpu_load_idle
scheduler_tick---> after calc_load_account_active called
pick_next_task_idle---> after pick_next_task_idle called

# TASK-PID CPU# TIMESTAMP
# | | | | |
<...>-4318 [000] 1864.217498: scheduler_tick: tasks:1
idle:0 idle_unmask:0
<...>-4320 [002] 1864.217534: scheduler_tick: tasks:2
idle:0 idle_unmask:0
<...>-4313 [003] 1864.217555: scheduler_tick: tasks:3
idle:0 idle_unmask:0
<...>-4323 [004] 1864.217577: scheduler_tick: tasks:5
idle:0 idle_unmask:0
<...>-4316 [005] 1864.217596: scheduler_tick: tasks:6
idle:0 idle_unmask:0
<...>-4311 [006] 1864.217617: scheduler_tick: tasks:8
idle:0 idle_unmask:0
<...>-4312 [007] 1864.217637: scheduler_tick: tasks:9
idle:0 idle_unmask:0
<idle>-0 [008] 1864.217659: scheduler_tick: tasks:9
idle:0 idle_unmask:0
<...>-4317 [009] 1864.217679: scheduler_tick: tasks:10
idle:0 idle_unmask:0
<...>-4321 [010] 1864.217700: scheduler_tick: tasks:11
idle:0 idle_unmask:0
<...>-4318 [000] 1864.217716: pick_next_task_idle: go idle!
tasks:11 idle:-1 idle_unmask:0
<...>-4319 [011] 1864.217721: scheduler_tick: tasks:12
idle:-1 idle_unmask:0
<...>-4309 [012] 1864.217742: scheduler_tick: tasks:13
idle:-1 idle_unmask:0
<...>-4313 [003] 1864.217758: pick_next_task_idle: go idle!
tasks:13 idle:-2 idle_unmask:0
<...>-4310 [013] 1864.217762: scheduler_tick: tasks:14
idle:-2 idle_unmask:0
<...>-4309 [012] 1864.217773: pick_next_task_idle: go idle!
tasks:14 idle:-3 idle_unmask:0
<...>-4314 [014] 1864.217783: scheduler_tick: tasks:15
idle:-3 idle_unmask:0
<...>-4322 [015] 1864.217804: scheduler_tick: tasks:16
idle:-3 idle_unmask:0
<...>-4319 [011] 1864.217862: pick_next_task_idle: go idle!
tasks:16 idle:-4 idle_unmask:0
<...>-4316 [005] 1864.217886: pick_next_task_idle: go idle!
tasks:16 idle:-5 idle_unmask:0
<...>-4322 [015] 1864.217909: pick_next_task_idle: go idle!
tasks:16 idle:-6 idle_unmask:0
<...>-4320 [002] 1864.217956: pick_next_task_idle: go idle!
tasks:16 idle:-7 idle_unmask:0
<...>-4311 [006] 1864.217976: pick_next_task_idle: go idle!
tasks:16 idle:-9 idle_unmask:0
<...>-4321 [010] 1864.218118: pick_next_task_idle: go idle!
tasks:16 idle:-10 idle_unmask:0
<...>-4314 [014] 1864.218198: pick_next_task_idle: go idle!
tasks:16 idle:-11 idle_unmask:0
<...>-4317 [009] 1864.218930: pick_next_task_idle: go idle!
tasks:16 idle:-12 idle_unmask:0
<...>-4323 [004] 1864.219782: pick_next_task_idle: go idle!
tasks:16 idle:-14 idle_unmask:0
<...>-4310 [013] 1864.219784: pick_next_task_idle: go idle!
tasks:16 idle:-15 idle_unmask:0
<...>-4312 [007] 1864.221397: pick_next_task_idle: go idle!
tasks:16 idle:-16 idle_unmask:0

As u see, the time when we calculate the load, almost all of these cpus'
loads are 1. But after
calculation, almost all of their loads go down to 0. All these are
done in 1~2 tick.
If a cpu's load moves quickly between 0 and 1, then the time when we get
the load it should be 0 or 1, not always on 1. This is the problem i
found, and may lead the load tending towards the number of processes.
@Peter Will scheduler disturb the time doing load calculation? I
thought it's just determined by the time interrupts before.

2012-06-13 08:17:04

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

On Tue, 2012-06-12 at 22:55 -0700, Doug Smythies wrote:
> On my computer, and from a different thread from yesterday, I let
> the proposed "wang" patch multiple processes test continue for
> another 24 hours.

So I got waiter_1.txt and load_180.txt and tried running it, but it
takes _forever_... is there anything I can run that gives a reasonable
output in say 30 minutes?

2012-06-13 15:33:49

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate


>> On Tue, 2012-06-12 at 22:55 -0700, Doug Smythies wrote:
> On 2012.06.13 01:17 -0700, Peter Zijlstra wrote:

>> On my computer, and from a different thread from yesterday, I let
>> the proposed "wang" patch multiple processes test continue for
>> another 24 hours.

> So I got waiter_1.txt and load_180.txt and tried running it, but it
> takes _forever_... is there anything I can run that gives a reasonable
> output in say 30 minutes?

Yes, sorry it is painfully slow. The whole thing, if let go to completion
takes 63 hours. And one can not use the computer for other activities,
or it will bias the reported load average results (I sometimes cheat and
do some editing). I only use the long term strongly filtered 15 minute
reported load average. They are IIR (Infinite Impulse Response) filters
and the script waits 4 time constants (1 hour) between graph lines
(number of processes) as a transient response settling time and 2 time
constants (30 minutes) between samples (the assumption being that the
difference between samples is small). I suspect that 1 time constant (15
minutes) between samples would be enough, but I was wanting to avoid bias
due to filter lag.

If I tried to get quick results by looking at the 1 minute reported load
average, I often got confused and jumped to incorrect conclusions. There are
many examples of the high noise level of the 1 minute reported loaded
average in my web notes.

All that being said, what I typically do with a new code test is:

. select a known, previous bad operating point. For example 2
processes, actual load average 0.30 (0.15 for each process) currently
reporting ~1.5.

. find the proper command line for those conditions and execute it for
a long time. (For example look it up in load_180.txt) (yes, my main
program command line stuff is less than friendly. I always forget how to
use it.)

. Observe via "top" and or "uptime". 30 minutes should be enough time here
to know if the code is promising or not.

. Make a decisions to do a longer term test or not. Often, I will do just
one or two processes over a range of actual loads and or sleep frequencies.

Please note: The main time waster loop inside the main program is computer
dependent. It needs to be determined for your computer and then the script
generating program needs to be re-compiled. See:

#define LOOPS_PER_USEC 207.547 /* Tight loops per second. s15.smythies.com.
CONFIG_NO_HZ=n or y kernels */

Which is for my computer with the CPUs locked in powersave mode (to avoid
results confusion due to CPU throttling).

@Peter: My code is done to the coding standards I have used for a long time,
which is likely to annoy you as it is different than the kernel standards.
Sorry. My web notes were a couple of days behind yesterday morning (my time)
when you pulled the files. Suggest you use the "wang" [1] write up for
updated
source files and such.

I am willing to make any special test code or whatever to help with this.
Notice the absence of any test results from my own patch tests. My attempts
have been disappointing.

[1]
http://www.smythies.com/~doug/network/load_average/load_processes_wang.html
General: http://www.smythies.com/~doug/network/load_average/index.html

Doug Smythies


2012-06-13 21:57:44

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

On Wed, 2012-06-13 at 08:33 -0700, Doug Smythies wrote:
>
> All that being said, what I typically do with a new code test is:
>
> . select a known, previous bad operating point. For example 2
> processes, actual load average 0.30 (0.15 for each process) currently
> reporting ~1.5.

OK, I'll try and apply this. Waiting 63 hours for feedback on patches is
something I'm not patient enough for.

Would this be:

./waiter 2 900 230608 10000

I haven't even bothered reading the waiter proglet yet, but I did notice
the 'help' provided when started without arguments doesn't seem to
actually match what load_180 does.

2012-06-14 03:13:52

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate


>> On Wed, 2012-06-13 at 08:33 -0700, Doug Smythies wrote:
> On 2012.06.13 14:58 -0700, Peter Zijlstra wrote:

>>
>> All that being said, what I typically do with a new code test is:
>>
>> . select a known, previous bad operating point. For example 2
>> processes, actual load average 0.30 (0.15 for each process) currently
>> reporting ~1.5.

> OK, I'll try and apply this. Waiting 63 hours for feedback on patches is
> something I'm not patient enough for.

> Would this be:

> ./waiter 2 900 230608 10000

Actually it would be:

./waiter 2 900 345912 9444

At least on my computer, with the CPUs locked into powersave mode (lowest
clock rate). It might be different on your computer as the exact numbers are
computer dependent.
I will change the script generating program to add comment lines as to the
expected execution scenario, as I have troubles also looking up command
lines.

> I haven't even bothered reading the waiter proglet yet, but I did notice
> the 'help' provided when started without arguments doesn't seem to
> actually match what load_180 does.

Right, sorry. I think you will find the newer version (from the "wang"
experiment write up) is O.K.



2012-06-14 04:42:51

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

> On 2012.06.13 00:56 - 0700, Charles Wang wrote:

> Every cpu's load should be the load right on the time executing
> calculation.
> This is what my patch expected to do.

> After my patch, it's supposed to let every cpu's load calculation be
> independent from
> its idle and other cpus' influence when the calculation is finished.
> But it seems other problem exists.

> I recorded some trace, show as below:

Charles thanks for your detailed reply.
I am not sure I agree.

On purpose, I have left out your trace detail from this reply.
Why? Because I want to take a step back and answer a more fundamental
question first:
Is there (or was there) actually a problem that needed to be solved?

I am still not understanding what your expected reported load averages
for your 16 processes should be. How do you know that the 8 to 10 that
was being reported was incorrect?

I want to understand so that I can re-create your same reported load
averages are still to low situation on my test computer, before your patch.
Then show that it is fixed after your patch.
I have tried to re-create it and have been unsuccessful. Today, and since
you
showed at least 16 CPUs on your system, and I only have 8 CPUs, I tried with
8 processes at high duty cycle. The reported load average was always pretty
close to the actual load (see also attachment).

Perhaps I have some fundamental misunderstanding as to how reported load
averages should work. I have always used kernels compiled with
CONFIG_NO_HZ=no (tick based, rather than tickless) as the control reference.


Attachments:
wang_high_load_example.txt (3.31 kB)

2012-06-14 15:57:44

by Charles Wang

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

于 Thursday, June 14, 2012 12:41 PM, Doug Smythies 写道:

>> On 2012.06.13 00:56 - 0700, Charles Wang wrote:
>
>> Every cpu's load should be the load right on the time executing
>> calculation.
>> This is what my patch expected to do.
>
>> After my patch, it's supposed to let every cpu's load calculation be
>> independent from
>> its idle and other cpus' influence when the calculation is finished.
>> But it seems other problem exists.
>
>> I recorded some trace, show as below:
>
> Charles thanks for your detailed reply.
> I am not sure I agree.
>
> On purpose, I have left out your trace detail from this reply.
> Why? Because I want to take a step back and answer a more fundamental
> question first:
> Is there (or was there) actually a problem that needed to be solved?
>
> I am still not understanding what your expected reported load averages
> for your 16 processes should be. How do you know that the 8 to 10 that
> was being reported was incorrect?
>

I re-create the problem by waiter, using following parameters:
./waiter 16 1800 900 9444 100 1

9444 and 100 is the key point that can make
"processing time+sleeping time" less than 1 tick.

We have 16 processors, and the load of every processor is about 0.25~0.3,
so the actual load should be 16*0.25=4. But the loadavg calculated by the
current logic is only about 1.

> I want to understand so that I can re-create your same reported load
> averages are still to low situation on my test computer, before your patch.
> Then show that it is fixed after your patch.
> I have tried to re-create it and have been unsuccessful. Today, and since
> you
> showed at least 16 CPUs on your system, and I only have 8 CPUs, I tried with
> 8 processes at high duty cycle. The reported load average was always pretty
> close to the actual load (see also attachment).
>
> Perhaps I have some fundamental misunderstanding as to how reported load
> averages should work. I have always used kernels compiled with
> CONFIG_NO_HZ=no (tick based, rather than tickless) as the control reference.
>



________________________________

This email (including any attachments) is confidential and may be legally privileged. If you received this email in error, please delete it immediately and do not copy it or use it for any purpose or disclose its contents to any other person. Thank you.

本电邮(包括任何附件)可能含有机密资料并受法律保护。如您不是正确的收件人,请您立即删除本邮件。请不要将本电邮进行复制并用作任何其他用途、或透露本邮件之内容。谢谢。

2012-06-15 14:27:22

by Charles Wang

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

On Tuesday, June 12, 2012 05:56 PM, Peter Zijlstra wrote:

> Also added Doug to CC, hopefully we now have everybody who pokes at this
> stuff.
>
> On Tue, 2012-06-12 at 17:34 +0800, Charles Wang wrote:
>> consider following case:
>>
>> 5HZ+1
>> | cpu0_load cpu1 cpu2 cpu3 calc_load_tasks
>> | 1 1 1 1
>> | -->calc_load 1
>> | 1 1 1 1
>> | -->calc_load 2
>> | 0 0 1 0
>> | -->calc_load 2+1-3=1
>
> Not sure but last time I did the math 2+1-3 ended up being 0.
>
>> | 1 1 0 1
>> | -->calc_load 1-1=0
>> V
>> 5HZ+11 -->calc_global_load 0
>>
>> actually the load should be around 3, but shows nearly 0.
>>
>> 1 tick is much long for some workloads.
>
> Yes, one tick is long for some stuff, but seeing we sample once every 5
> seconds a little fuzz around sampling the nr_running+nr_uninterruptible
> thing shouldn't be too bad.
>
> But I think I see what you're getting at.. lemme get more tea and ponder
> this a bit.
> .
>


In our mind per-cpu sampling for cpu idle and non-idle is equal. But
actually may not. For non-idle cpu sampling, it's right the load when
sampling. But for idle, cause of nohz, the sampling will be delayed to
nohz exit(less than 1 tick after nohz exit). Nohz exit is always caused
by processes woken up--non-idle model. It's not fair here, idle
calculated to non-idle.

time-expect-sampling
| time-do-sampling
| |
V V
-|-------------------------|--
start_nohz stop_nohz


This may explain why using my patch the load shows higher, also may
explain some reports about high load for current.

I tried a experiments, results showed better. Now i need more experiments.

Peter, is this right as i thought?

2012-06-15 17:39:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

Wednesday I ended up with something like the below.. but I haven't
gotten round to trying Doug's latest testing method, nor did I really
read the email I'm now replying to.

I think it does something like what Wang described... every time I try
and write comments related to why it does this I get stuck though.

I ran out of time again for this week, I'll try and prod at it a little
more next week (and try and catch up with the thread).

In the meantime I thought I might as well post this.. who knows somebody
might be bored over the weekend, it might actually work, or not :-)

---
kernel/sched/core.c | 77 +++++++++++++++++++++++++++++++++++----------------
1 file changed, 53 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index ca07ee0..4101a0e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2198,26 +2198,49 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
*
* When making the ILB scale, we should try to pull this in as well.
*/
-static atomic_long_t calc_load_tasks_idle;
+static atomic_long_t calc_load_idle[2];
+static int calc_load_idx;
+
+static inline int calc_load_write_idx(void)
+{
+ int idx = calc_load_idx;
+
+ /*
+ * See calc_global_nohz(), if we observe the new index, we also
+ * need to observe the new update time.
+ */
+ smp_rmb();
+
+ if (!time_before(jiffies, calc_load_update))
+ idx++;
+
+ return idx & 1;
+}
+
+static inline int calc_load_read_idx(void)
+{
+ return calc_load_idx & 1;
+}

void calc_load_account_idle(struct rq *this_rq)
{
long delta;
+ int idx;

delta = calc_load_fold_active(this_rq);
- if (delta)
- atomic_long_add(delta, &calc_load_tasks_idle);
+ if (delta) {
+ idx = calc_load_write_idx();
+ atomic_long_add(delta, &calc_load_idle[idx]);
+ }
}

static long calc_load_fold_idle(void)
{
+ int idx = calc_load_read_idx();
long delta = 0;

- /*
- * Its got a race, we don't care...
- */
- if (atomic_long_read(&calc_load_tasks_idle))
- delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+ if (atomic_long_read(&calc_load_idle[idx]))
+ delta = atomic_long_xchg(&calc_load_idle[idx], 0);

return delta;
}
@@ -2313,26 +2336,32 @@ static void calc_global_nohz(void)
if (delta)
atomic_long_add(delta, &calc_load_tasks);

- /*
- * It could be the one fold was all it took, we done!
- */
- if (time_before(jiffies, calc_load_update + 10))
- return;
+ if (!time_before(jiffies, calc_load_update + 10)) {
+ /*
+ * Catch-up, fold however many we are behind still
+ */
+ delta = jiffies - calc_load_update - 10;
+ n = 1 + (delta / LOAD_FREQ);

- /*
- * Catch-up, fold however many we are behind still
- */
- delta = jiffies - calc_load_update - 10;
- n = 1 + (delta / LOAD_FREQ);
+ active = atomic_long_read(&calc_load_tasks);
+ active = active > 0 ? active * FIXED_1 : 0;

- active = atomic_long_read(&calc_load_tasks);
- active = active > 0 ? active * FIXED_1 : 0;
+ avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
+ avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
+ avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);

- avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
- avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
- avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
+ calc_load_update += n * LOAD_FREQ;
+ }

- calc_load_update += n * LOAD_FREQ;
+ /*
+ * Flip the idle index...
+ *
+ * Make sure we first write the new time then flip the index, so that
+ * calc_load_write_idx() will see the new time when it reads the new
+ * index, this avoids a double flip messing things up.
+ */
+ smp_wmb();
+ calc_load_idx++;
}
#else
void calc_load_account_idle(struct rq *this_rq)

2012-06-16 06:43:13

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

> On 2012.06.14 08:43 - 0700, Charles Wang wrote:

Just trying to catch up...

> I re-create the problem by waiter, using following parameters:
> ./waiter 16 1800 900 9444 100 1

That lists output at a very high rate and has about 5 kilo hertz
sleep frequency per process (at least on my computer).
Suggest something like this instead:

./waiter 16 1800 9000 94444 100 1

Which still makes your point.

> 9444 and 100 is the key point that can make
> "processing time+sleeping time" less than 1 tick.

> We have 16 processors, and the load of every processor is about 0.25~0.3,
> so the actual load should be 16*0.25=4. But the loadavg calculated by the
> current logic is only about 1.

O.K. finally I understand.

Your application has an extremely high rate of switching between
processing and sleeping or waiting for something. On the order of
1000s of Hertz, far exceeding the basic tick rate within default kernel
compiles. Yes, under such conditions it can not be expected that Reported
Load Averages would be accurate.

The kernel would have to be re-compiled with a much higher basic tick rate.

In February/March as part of the fix testing for the reported load
averages way way way too low issue, I tested operating such conditions
where even a tick based kernel would be expected to start breaking down.
All I cared about in those tests was that the tickless kernel started to
degrade in a similar fashion as the tick based kernel, which it did [1].
I was only going to around 420 Hertz per process, and now you are going
even higher.

However, and based on your findings, I was able to re-create conditions
of errors too low in Reported Load Averages under conditions (low
enough frequencies) where we would expect things to work properly.
See the attached PNG file, also posted at [2]. (Kernel 3.5 RC2
an i7 processor with 8 cpus)

I have back edited Peter's patch (from a subsequent e-mail) into my
working kernel (3.2 based), and will test over the weekend.

@Charles: Early next week maybe we can compare results from your tests.
I'll try your solutions also on my computer if you post the code (either
on list for off list)

[1] http://www.smythies.com/~doug/network/load_average/original.html#higher_freq
[2] http://www.smythies.com/~doug/network/load_average/high_freq_35rc2.png


Attachments:
high_freq_35rc2.png (38.88 kB)

2012-06-16 14:54:04

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

> On 2012.06.15 10:40 -0700, Peter Zijlstra wrote:

> Wednesday I ended up with something like the below.. but I haven't
> gotten round to trying Doug's latest testing method, nor did I really
> read the email I'm now replying to.

> I think it does something like what Wang described... every time I try
> and write comments related to why it does this I get stuck though.

[...]

>In the meantime I thought I might as well post this.. who knows somebody
> might be bored over the weekend, it might actually work, or not :-)

[...]

I back edited your changes into my working kernel (3.2 based),
in addition to the other back edits which had it at the
equivalent of 3.5 RC2 with respect to this stuff.

I did only the quick test, as described previously:
Selected 2 processes, 90 Hertz per process, and 0.15 load each
For an actual load of 0.30 total and a previously Reported Load
Average of ~1.5. Command used for the load testing program:

./waiter 2 1800 900 345912 9444 1

Long term Reported Load Average was ~1.8.

I like the idea of the index flipping per cycle. My thinking
was that the final solution would have to do some flipping
and maybe even eliminate the 10 tick grace period in favour
of overlapped dual finish, with possible nohz delays,
yet continue to accumulate information during that time.
I'll look at the code patch in more detail, perhaps Monday
or Tuesday.


2012-06-18 06:42:23

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate


>> On 2012.06.15 10:40 -0700, Peter Zijlstra wrote:
> On 2012.06.16 07:53 -0700, Doug Smythies wrote:

>> In the meantime I thought I might as well post this.. who knows somebody
>> might be bored over the weekend, it might actually work, or not :-)

> I back edited your changes into my working kernel (3.2 based),
> in addition to the other back edits which had it at the
> equivalent of 3.5 RC2 with respect to this stuff.

> I did only the quick test, as described previously:
> Selected 2 processes, 90 Hertz per process, and 0.15 load each
> For an actual load of 0.30 total and a previously Reported Load
> Average of ~1.5. Command used for the load testing program:

> ./waiter 2 1800 900 345912 9444 1

> Long term Reported Load Average was ~1.8.

Another data point: 8 processes, actual load = 6.34, 150 Hz sleep
frequency per process:

Kernel 3.5 RC2 reported load average ~= 3.9
Kernel with Peter code reported load average ~= 7.9


2012-06-18 10:07:13

by Charles Wang

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

On Saturday, June 16, 2012 01:39 AM, Peter Zijlstra wrote:

> Wednesday I ended up with something like the below.. but I haven't
> gotten round to trying Doug's latest testing method, nor did I really
> read the email I'm now replying to.
>
> I think it does something like what Wang described... every time I try
> and write comments related to why it does this I get stuck though.
>
> I ran out of time again for this week, I'll try and prod at it a little
> more next week (and try and catch up with the thread).
>
> In the meantime I thought I might as well post this.. who knows somebody
> might be bored over the weekend, it might actually work, or not :-)
>
> ---
> kernel/sched/core.c | 77 +++++++++++++++++++++++++++++++++++----------------
> 1 file changed, 53 insertions(+), 24 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index ca07ee0..4101a0e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2198,26 +2198,49 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
> *
> * When making the ILB scale, we should try to pull this in as well.
> */
> -static atomic_long_t calc_load_tasks_idle;
> +static atomic_long_t calc_load_idle[2];
> +static int calc_load_idx;
> +
> +static inline int calc_load_write_idx(void)
> +{
> + int idx = calc_load_idx;
> +
> + /*
> + * See calc_global_nohz(), if we observe the new index, we also
> + * need to observe the new update time.
> + */
> + smp_rmb();
> +
> + if (!time_before(jiffies, calc_load_update))
> + idx++;
> +
> + return idx & 1;
> +}
> +
> +static inline int calc_load_read_idx(void)
> +{
> + return calc_load_idx & 1;
> +}
>
> void calc_load_account_idle(struct rq *this_rq)
> {
> long delta;
> + int idx;
>
> delta = calc_load_fold_active(this_rq);
> - if (delta)
> - atomic_long_add(delta, &calc_load_tasks_idle);
> + if (delta) {
> + idx = calc_load_write_idx();
> + atomic_long_add(delta, &calc_load_idle[idx]);
> + }
> }
>
> static long calc_load_fold_idle(void)
> {
> + int idx = calc_load_read_idx();
> long delta = 0;
>
> - /*
> - * Its got a race, we don't care...
> - */
> - if (atomic_long_read(&calc_load_tasks_idle))
> - delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
> + if (atomic_long_read(&calc_load_idle[idx]))
> + delta = atomic_long_xchg(&calc_load_idle[idx], 0);
>
> return delta;
> }
> @@ -2313,26 +2336,32 @@ static void calc_global_nohz(void)
> if (delta)
> atomic_long_add(delta, &calc_load_tasks);
>
> - /*
> - * It could be the one fold was all it took, we done!
> - */
> - if (time_before(jiffies, calc_load_update + 10))
> - return;
> + if (!time_before(jiffies, calc_load_update + 10)) {
> + /*
> + * Catch-up, fold however many we are behind still
> + */
> + delta = jiffies - calc_load_update - 10;
> + n = 1 + (delta / LOAD_FREQ);
>
> - /*
> - * Catch-up, fold however many we are behind still
> - */
> - delta = jiffies - calc_load_update - 10;
> - n = 1 + (delta / LOAD_FREQ);
> + active = atomic_long_read(&calc_load_tasks);
> + active = active > 0 ? active * FIXED_1 : 0;
>
> - active = atomic_long_read(&calc_load_tasks);
> - active = active > 0 ? active * FIXED_1 : 0;
> + avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
> + avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
> + avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
>
> - avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
> - avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
> - avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
> + calc_load_update += n * LOAD_FREQ;
> + }
>
> - calc_load_update += n * LOAD_FREQ;
> + /*
> + * Flip the idle index...
> + *
> + * Make sure we first write the new time then flip the index, so that
> + * calc_load_write_idx() will see the new time when it reads the new
> + * index, this avoids a double flip messing things up.
> + */
> + smp_wmb();
> + calc_load_idx++;
> }
> #else
> void calc_load_account_idle(struct rq *this_rq)
>

I tried to identify the start-line precisely, and made the implemention
little more complicated. Using calc_load_update as start-line will cause
it not that accurate, but may work, and keep simple. I will test this on
my environments, and try to port the next patch on this.

2012-06-18 10:13:28

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

On Wed, 2012-06-13 at 20:13 -0700, Doug Smythies wrote:
> ./waiter 2 900 345912 9444

So we're all simply just trying to burn a particular amount of cpu-time?

I have a little consume.c thing lying about doing something like that.
Its independent of clockspeed, but doesn't really like preemption much,
but it more or less works ;-)

---
#include <sys/time.h>
#include <unistd.h>
#include <stdio.h>

unsigned long long stamp(void)
{
struct timeval tv;

gettimeofday(&tv, NULL);

return (unsigned long long)tv.tv_sec * 1000000 + tv.tv_usec;
}

void consume(int spin, int total)
{
unsigned long long begin, now;

begin = stamp();

for (;;) {
now = stamp();
if ((long long)(now - begin) > spin) {
usleep(total - spin);
begin += total;
}
}
}

int main(int argc, char **argv)
{
int period = 100000; /* 100ms */
int frac;

if (argc < 2) {
fprintf(stderr, "%s <frac> [<period>]\n"
" frac -- [1-100] %% of time to burn\n"
" period -- [usec] period of burn/sleep cycle\n",
argv[0]);
return -1;
}

frac = atoi(argv[1]);
if (argc > 2)
period = atoi(argv[2]);

if (frac > 100)
frac = 100;

if (frac < 1)
frac = 1;

consume((period * frac) / 100, period);

return 0;
}

2012-06-18 14:41:59

by Charles Wang

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

Peter's patch works the well. Now I ported the second patch
to fix high load problem based on Peter's. It works fine on my testing
environment. Doug, please try this. Thanks.


In our mind per-cpu sampling for cpu idle and non-idle is equal. But
actually may not. For non-idle cpu sampling, it's right the load when
sampling. But for idle, cause of nohz, the sampling will be delayed to
nohz exit(less than 1 tick after nohz exit). Nohz exit is always caused
by processes woken up--non-idle model. It's not fair here. Idle
sampling will be turned to non-idle sampling. And cause loadavg being
higher than normal.

time-expected-sampling
| time-do-sampling
| |
V V
-|-------------------------|--
start_nohz stop_nohz

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4101a0e..180e612 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2228,6 +2228,7 @@ void calc_load_account_idle(struct rq *this_rq)
int idx;

delta = calc_load_fold_active(this_rq);
+ this_rq->last_idle_enter = jiffies;

if (delta) {
idx = calc_load_write_idx();
atomic_long_add(delta, &calc_load_idle[idx]);
@@ -2431,15 +2432,27 @@ void calc_global_load(void)
static void calc_load_account_active(struct rq *this_rq)
{
long delta;
+ unsigned long delta_time;
+ long last_idle_time_elapse;

if (time_before(jiffies, this_rq->calc_load_update))
return;

+ last_idle_time_elapse = this_rq->last_idle_enter - calc_load_update;
+ delta_time = jiffies - this_rq->calc_load_update;
+
+ if (last_idle_time_elapse > 0)
+ goto out;
+
+ if ((last_idle_time_elapse > -1) && (delta_time >= 1))
+ goto out;
+
delta = calc_load_fold_active(this_rq);
delta += calc_load_fold_idle();
if (delta)
atomic_long_add(delta, &calc_load_tasks);

+out:
this_rq->calc_load_update += LOAD_FREQ;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4134d37..a356588 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -438,6 +438,7 @@ struct rq {

/* calc_load related fields */
unsigned long calc_load_update;
+ unsigned long last_idle_enter;
long calc_load_active;

#ifdef CONFIG_SCHED_HRTICK
--
1.7.9.5

2012-06-18 16:03:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

Hi Charles,

I'm having difficulties understanding your exact meaning, I suspect its
a language thing, so please excuse me for nit-picking through your
email.


On Fri, 2012-06-15 at 22:27 +0800, Charles Wang wrote:

> In our mind

Are there more people involved?

> per-cpu sampling for cpu idle and non-idle is equal. But
> actually may not.

Are you saying they should be equal but are in fact not so?

I think we can all agree on this. Doug has illustrated this quite
clearly.

The desire is for CONFIG_NOHZ=n,y to function identically, but its been
clearly demonstrated this is currently not the case.

> For non-idle cpu sampling, it's right the load when
> sampling.

Agreed, sampling of a busy cpu is identical.

> But for idle, cause of nohz, the sampling will be delayed to
> nohz exit(less than 1 tick after nohz exit).

I don't think the nohz exit code calls into the load sampling, but I
think you're saying we'll get a sample tick after we leave nohz, right?

This is only so if the busy period covers a tick, that is, if we wake
and go back to idle before a tick happens we'll still not get sampled.


tick tick
|----====-----|
^ ^
wake sleep


Shows a nohz-exit busy period not sampled.

> Nohz exit is always caused
> by processes woken up--non-idle model. It's not fair here, idle
> calculated to non-idle.
>
> time-expect-sampling
> | time-do-sampling
> | |
> V V
> -|-------------------------|--
> start_nohz stop_nohz

I don't think the delay in sampling is the biggest problem, I think the
problem is the direct interaction between a cpu going idle and another
cpu taking a sample.

So the approach I took was to isolate the going idle before the sample
window from going idle during (and after) the sampling window.

Therefore any going idle activity will not affect the sampled of other
cpus. The only trick is the slight shift in index flip for read vs
write.


0 5 10 15
+10 +10 +10 +10
|-|-----------|-|-----------|-|-----------|-|

r:001 110 001 110
w:011 100 011 100


Shows we'll read the old idle load, but write to the new idle load
during the sample window. Thus including the old idle load in this
sample fold, but leaving new activity for the next.

A cpu waking up and doing a sample is similar to the cpu being busy at
the window start.

However since this window is 10 ticks long and any busy spanning a tick
will make it appear 'busy' we can only accurately sample loads of up to
HZ/(2*10) (2 for the sampling theorem). For a regular HZ=1000 kernel
this ends up being 50 Hz.

Higher frequency workloads will appear to over-account.

Now the whole reason we have this window of 10 ticks is that we're
trying to do a completely asynchronous load computation trying to avoid
as much serialization as possible. So have each cpu account its local
state and hope that 10 ticks is sufficient for all to have reported in,
then process the fold.

The 10 tick window is directly related to the worst irq-off latency on
your machine, if you keep IRQs disabled for a few ticks -- something
that quite easily happens on large machines, even a busy cpu will be
'late' reporting its load. I think the current 10 tick came from an SGI
machine with 4k cpus or so.


Hmmm,.. idea.. OK, so we should also have a hook coming out of NOHZ
state, when we come out of NOHZ state during the sample window, simply
push the whole window fwd to the next time.

This finds another bug in the current code.. A cpu that has idled 'long'
could be multiple LOAD_FREQ intervals behind and will take multiple
samples in quick succession instead of 5s apart.


Can someone please think through the below thing? its been compile
tested only...

---
kernel/sched/core.c | 290 ++++++++++++++++++++++++++++++++++------------
kernel/sched/idle_task.c | 1 -
kernel/sched/sched.h | 2 -
kernel/time/tick-sched.c | 2 +
4 files changed, 220 insertions(+), 75 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d5594a4..3a49ee1 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2161,11 +2161,72 @@ unsigned long this_cpu_load(void)
}


+/*
+ * global load-average calculations
+ *
+ * We take a distributed and async approach to calculating the global load-avg
+ * in order to minimize overhead.
+ *
+ * The global load average is an exponentially decaying average of nr_running +
+ * nr_uninterruptible.
+ *
+ * Once every LOAD_FREQ:
+ *
+ * nr_active = 0;
+ * for_each_possible_cpu(cpu)
+ * nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;
+ *
+ * avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n)
+ *
+ * Due to a number of reasons the above turns in the mess below:
+ *
+ * - for_each_possible_cpu() is prohibitively expensive on machines with
+ * serious number of cpus, therefore we need to take a distributed approach
+ * to calculating nr_active.
+ *
+ * \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
+ * = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
+ *
+ * So assuming nr_active := 0 when we start out -- true per definition, we
+ * can simply take per-cpu deltas and fold those into a global accumulate
+ * to obtain the same result. See calc_load_fold_active().
+ *
+ * Furthermore, in order to avoid synchronizing all per-cpu delta folding
+ * across the machine, we assume 10 ticks is sufficient time for every
+ * cpu to have completed this task.
+ *
+ * This places an upper-bound on the IRQ-off latency of the machine.
+ *
+ * - cpu_rq()->nr_uninterruptible isn't accurately tracked per-cpu because
+ * this would add another cross-cpu cacheline miss and atomic operation
+ * to the wakeup path. Instead we increment on whatever cpu the task ran
+ * when it went into uninterruptible state and decrement on whatever cpu
+ * did the wakeup. This means that only the sum of nr_uninterruptible over
+ * all cpus yields the correct result.
+ *
+ * This covers the NO_HZ=n code, for extra head-aches, see the comment below.
+ */
+
/* Variables and functions for calc_load */
static atomic_long_t calc_load_tasks;
static unsigned long calc_load_update;
unsigned long avenrun[3];
-EXPORT_SYMBOL(avenrun);
+EXPORT_SYMBOL(avenrun); /* should be removed */
+
+/**
+ * get_avenrun - get the load average array
+ * @loads: pointer to dest load array
+ * @offset: offset to add
+ * @shift: shift count to shift the result left
+ *
+ * These values are estimates at best, so no need for locking.
+ */
+void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
+{
+ loads[0] = (avenrun[0] + offset) << shift;
+ loads[1] = (avenrun[1] + offset) << shift;
+ loads[2] = (avenrun[2] + offset) << shift;
+}

static long calc_load_fold_active(struct rq *this_rq)
{
@@ -2182,6 +2243,9 @@ static long calc_load_fold_active(struct rq *this_rq)
return delta;
}

+/*
+ * a1 = a0 * e + a * (1 - e)
+ */
static unsigned long
calc_load(unsigned long load, unsigned long exp, unsigned long active)
{
@@ -2193,30 +2257,117 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)

#ifdef CONFIG_NO_HZ
/*
- * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
+ * Handle NO_HZ for the global load-average.
+ *
+ * Since the above described distributed algorithm to compute the global
+ * load-average relies on per-cpu sampling from the tick, it is affected by
+ * NO_HZ.
+ *
+ * The basic idea is to fold the nr_active delta into a global idle load upon
+ * entering NO_HZ state such that we can include this as an 'extra' cpu delta
+ * when we read the global state.
+ *
+ * Obviously reality has to ruin such a delightfully simple scheme:
+ *
+ * - When we go NO_HZ idle during the window, we can negate our sample
+ * contribution, causing under-accounting.
+ *
+ * We avoid this by keeping two idle-delta counters and flipping them
+ * when the window starts, thus separating old and new NO_HZ load.
+ *
+ * The only trick is the slight shift in index flip for read vs write.
+ *
+ * 0 5 10 15
+ * +10 +10 +10 +10
+ * |-|-----------|-|-----------|-|-----------|-|
+ * r:001 110 001 110
+ * w:011 100 011 100
+ *
+ * This ensures we'll fold the old idle contribution in this window while
+ * accumlating the new one.
+ *
+ * - When we wake up from NO_HZ idle during the window, we push up our
+ * contribution, since we effectively move our sample point to a known
+ * busy state.
+ *
+ * This is solved by pushing the window forward, and thus skipping the
+ * sample, for this cpu (effectively using the idle-delta for this cpu which
+ * was in effect at the time the window opened). This also solves the issue
+ * of having to deal with a cpu having been in NOHZ idle for multiple
+ * LOAD_FREQ intervals.
*
* When making the ILB scale, we should try to pull this in as well.
*/
-static atomic_long_t calc_load_tasks_idle;
+static atomic_long_t calc_load_idle[2];
+static int calc_load_idx;

-void calc_load_account_idle(struct rq *this_rq)
+static inline int calc_load_write_idx(void)
{
+ int idx = calc_load_idx;
+
+ /*
+ * See calc_global_nohz(), if we observe the new index, we also
+ * need to observe the new update time.
+ */
+ smp_rmb();
+
+ /*
+ * If the folding window started, make sure we start writing in the
+ * next idle-load delta.
+ */
+ if (!time_before(jiffies, calc_load_update))
+ idx++;
+
+ return idx & 1;
+}
+
+static inline int calc_load_read_idx(void)
+{
+ return calc_load_idx & 1;
+}
+
+void calc_load_enter_idle(void)
+{
+ struct rq *this_rq = this_rq();
long delta;
+ int idx;

+ /*
+ * We're going into NOHZ mode, if there's any pending delta, fold it
+ * into the pending idle delta.
+ */
delta = calc_load_fold_active(this_rq);
- if (delta)
- atomic_long_add(delta, &calc_load_tasks_idle);
+ if (delta) {
+ idx = calc_load_write_idx();
+ atomic_long_add(delta, &calc_load_idle[idx]);
+ }
}

-static long calc_load_fold_idle(void)
+void calc_load_exit_idle(void)
{
- long delta = 0;
+ struct rq *this_rq = this_rq();

/*
- * Its got a race, we don't care...
+ * If we're still outside the sample window, we're done.
*/
- if (atomic_long_read(&calc_load_tasks_idle))
- delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+ if (time_before(jiffies, this_rq->calc_load_update))
+ return;
+
+ /*
+ * We woke inside or after the sample window, this means another cpu
+ * likely already accounted us through the nohz accounting, so skip the
+ * entire deal and sync up for the next window.
+ */
+ this_rq->calc_load_update = calc_load_update + LOAD_FREQ;
+}
+
+static long calc_load_fold_idle(void)
+{
+ int idx = calc_load_read_idx();
+ long delta = 0;
+
+ if (atomic_long_read(&calc_load_idle[idx]))
+ delta = atomic_long_xchg(&calc_load_idle[idx], 0);

return delta;
}
@@ -2302,66 +2453,39 @@ static void calc_global_nohz(void)
{
long delta, active, n;

- /*
- * If we crossed a calc_load_update boundary, make sure to fold
- * any pending idle changes, the respective CPUs might have
- * missed the tick driven calc_load_account_active() update
- * due to NO_HZ.
- */
- delta = calc_load_fold_idle();
- if (delta)
- atomic_long_add(delta, &calc_load_tasks);
-
- /*
- * It could be the one fold was all it took, we done!
- */
- if (time_before(jiffies, calc_load_update + 10))
- return;
-
- /*
- * Catch-up, fold however many we are behind still
- */
- delta = jiffies - calc_load_update - 10;
- n = 1 + (delta / LOAD_FREQ);
+ if (!time_before(jiffies, calc_load_update + 10)) {
+ /*
+ * Catch-up, fold however many we are behind still
+ */
+ delta = jiffies - calc_load_update - 10;
+ n = 1 + (delta / LOAD_FREQ);

- active = atomic_long_read(&calc_load_tasks);
- active = active > 0 ? active * FIXED_1 : 0;
+ active = atomic_long_read(&calc_load_tasks);
+ active = active > 0 ? active * FIXED_1 : 0;

- avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
- avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
- avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
+ avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
+ avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
+ avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);

- calc_load_update += n * LOAD_FREQ;
-}
-#else
-void calc_load_account_idle(struct rq *this_rq)
-{
-}
+ calc_load_update += n * LOAD_FREQ;
+ }

-static inline long calc_load_fold_idle(void)
-{
- return 0;
+ /*
+ * Flip the idle index...
+ *
+ * Make sure we first write the new time then flip the index, so that
+ * calc_load_write_idx() will see the new time when it reads the new
+ * index, this avoids a double flip messing things up.
+ */
+ smp_wmb();
+ calc_load_idx++;
}
+#else /* !CONFIG_NO_HZ */

-static void calc_global_nohz(void)
-{
-}
-#endif
+static inline long calc_load_fold_idle(void) { return 0; }
+static inline void calc_global_nohz(void) { }

-/**
- * get_avenrun - get the load average array
- * @loads: pointer to dest load array
- * @offset: offset to add
- * @shift: shift count to shift the result left
- *
- * These values are estimates at best, so no need for locking.
- */
-void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
-{
- loads[0] = (avenrun[0] + offset) << shift;
- loads[1] = (avenrun[1] + offset) << shift;
- loads[2] = (avenrun[2] + offset) << shift;
-}
+#endif /* CONFIG_NO_HZ */

/*
* calc_load - update the avenrun load estimates 10 ticks after the
@@ -2369,11 +2493,35 @@ void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
*/
void calc_global_load(unsigned long ticks)
{
- long active;
+ long active, delta;

if (time_before(jiffies, calc_load_update + 10))
return;

+ /*
+ * Fold the 'old' idle-delta to include all NO_HZ cpus.
+ *
+ * cpu0 cpu1 cpu2 ..
+ *
+ * >--- [sample A]
+ *
+ * -> NOHZ
+ * -> NOHZ
+ * ->NOHZ
+ *
+ * >--- [sample B]
+ *
+ * >--- [sample C]
+ *
+ * NOHZ-> (here)
+ *
+ * Since all CPUs went into NOHZ state, all 'missed' samples (B, C)
+ * should include the folded idle-delta.
+ */
+ delta += calc_load_fold_idle();
+ if (delta)
+ atomic_long_add(delta, &calc_load_tasks);
+
active = atomic_long_read(&calc_load_tasks);
active = active > 0 ? active * FIXED_1 : 0;

@@ -2384,12 +2532,7 @@ void calc_global_load(unsigned long ticks)
calc_load_update += LOAD_FREQ;

/*
- * Account one period with whatever state we found before
- * folding in the nohz state and ageing the entire idle period.
- *
- * This avoids loosing a sample when we go idle between
- * calc_load_account_active() (10 ticks ago) and now and thus
- * under-accounting.
+ * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk.
*/
calc_global_nohz();
}
@@ -2406,7 +2549,6 @@ static void calc_load_account_active(struct rq *this_rq)
return;

delta = calc_load_fold_active(this_rq);
- delta += calc_load_fold_idle();
if (delta)
atomic_long_add(delta, &calc_load_tasks);

@@ -2414,6 +2556,10 @@ static void calc_load_account_active(struct rq *this_rq)
}

/*
+ * End of global load-average stuff
+ */
+
+/*
* The exact cpuload at various idx values, calculated at every tick would be
* load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load
*
diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
index b44d604..b6baf37 100644
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -25,7 +25,6 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
static struct task_struct *pick_next_task_idle(struct rq *rq)
{
schedstat_inc(rq, sched_goidle);
- calc_load_account_idle(rq);
return rq->idle;
}

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6d52cea..55844f2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -942,8 +942,6 @@ static inline u64 sched_avg_period(void)
return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
}

-void calc_load_account_idle(struct rq *this_rq);
-
#ifdef CONFIG_SCHED_HRTICK

/*
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 8699978..4a08472 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -406,6 +406,7 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts)
*/
if (!ts->tick_stopped) {
select_nohz_load_balancer(1);
+ calc_load_enter_idle();

ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
ts->tick_stopped = 1;
@@ -597,6 +598,7 @@ void tick_nohz_idle_exit(void)
account_idle_ticks(ticks);
#endif

+ calc_load_exit_idle();
touch_softlockup_watchdog();
/*
* Cancel the scheduled timer and restore the tick

2012-06-19 06:08:40

by Yong Zhang

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

On Mon, Jun 18, 2012 at 06:03:37PM +0200, Peter Zijlstra wrote:
> > Nohz exit is always caused
> > by processes woken up--non-idle model. It's not fair here, idle
> > calculated to non-idle.
> >
> > time-expect-sampling
> > | time-do-sampling
> > | |
> > V V
> > -|-------------------------|--
> > start_nohz stop_nohz
>
> I don't think the delay in sampling is the biggest problem, I think the
> problem is the direct interaction between a cpu going idle and another
> cpu taking a sample.

IIUC, you hook into tick_nohz_idle_exit() will cure Charles's problem.

And comments below.

> ---
> kernel/sched/core.c | 290 ++++++++++++++++++++++++++++++++++------------
> kernel/sched/idle_task.c | 1 -
> kernel/sched/sched.h | 2 -
> kernel/time/tick-sched.c | 2 +
> 4 files changed, 220 insertions(+), 75 deletions(-)
>
> + * - When we go NO_HZ idle during the window, we can negate our sample
> + * contribution, causing under-accounting.
> + *
> + * We avoid this by keeping two idle-delta counters and flipping them
> + * when the window starts, thus separating old and new NO_HZ load.
> + *
> + * The only trick is the slight shift in index flip for read vs write.
> + *
> + * 0 5 10 15
> + * +10 +10 +10 +10
> + * |-|-----------|-|-----------|-|-----------|-|
> + * r:001 110 001 110
> + * w:011 100 011 100

I'm confused by this comments, looking at your code, index is increased by
1 for each samaple window.

> + *
> + * This ensures we'll fold the old idle contribution in this window while
> + * accumlating the new one.
> + *
> + * - When we wake up from NO_HZ idle during the window, we push up our
> + * contribution, since we effectively move our sample point to a known
> + * busy state.
> + *
> + * This is solved by pushing the window forward, and thus skipping the
> + * sample, for this cpu (effectively using the idle-delta for this cpu which
> + * was in effect at the time the window opened). This also solves the issue
> + * of having to deal with a cpu having been in NOHZ idle for multiple
> + * LOAD_FREQ intervals.
> *
> * When making the ILB scale, we should try to pull this in as well.
> */
> -static long calc_load_fold_idle(void)
> +void calc_load_exit_idle(void)
> {
> - long delta = 0;
> + struct rq *this_rq = this_rq();
>
> /*
> - * Its got a race, we don't care...
> + * If we're still outside the sample window, we're done.
> */
> - if (atomic_long_read(&calc_load_tasks_idle))
> - delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
> + if (time_before(jiffies, this_rq->calc_load_update))
> + return;
else if (time_before(jiffies, calc_load_update + 10)
this_rq->calc_load_update = calc_load_update + LOAD_FREQ;
else
this_rq->calc_load_update = calc_load_update;

Otherwise if you woke after the sample window, we loose on sample?
And maybe we need local variable to cache calc_load_update.

Thanks,
Yong

2012-06-19 06:20:25

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

> On 2012.06.18 09:04 -0700, Peter Zijlstra wrote:

[... lots deleted ...]

> Can someone please think through the below thing? its been compile
> tested only...

[... code patch deleted ...]

Your code patch makes sense to me, but I admit that I still have
difficulties to follow this area of code.

Note: I didn't have time yet to review or try Charles' solution.

I back edited this new solution into my working kernel and retested
the same two operating points as over the weekend. Summary:

2 Processes @ 90 hertz per process and 0.15 load per process,
or 0.30 total. Reported Load Average (long average):

Kernel 3.5 RC2: ~1.5
Kernel Peter 2012.06.15: ~1.8
Kernel Peter 2012.06.18: ~0.3 (0.28)

8 processes @ 150 hertz per process and 0.7925 load per process,
or 6.34 total. Reported Load Average (long average):

Kernel 3.5 RC2: ~3.9
Kernel Peter 2012.06.15: ~7.9
Kernel Peter 2012.06.18: ~6.3

I will start one of my longer term experiments tonight.
It will take many days to do all the tests.
If things change, the tests can be re-started.

Note: On my computers I have no way to test the catch-
up code path, as my computers never take that path.

A note on the test code for loading (from other branch
of this thread): Peter, I'll try your code sometime. It was
on purpose that I made mine a mindless code loop, without any
system calls to keep time. But yes, mine is proving a little
annoying to use.


2012-06-19 06:24:55

by Charles Wang

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

On Tuesday, June 19, 2012 12:03 AM, Peter Zijlstra wrote:

> Hi Charles,
>
> I'm having difficulties understanding your exact meaning, I suspect its
> a language thing, so please excuse me for nit-picking through your
> email.
>
>
> On Fri, 2012-06-15 at 22:27 +0800, Charles Wang wrote:
>
>> In our mind
>
> Are there more people involved?
>
>> per-cpu sampling for cpu idle and non-idle is equal. But
>> actually may not.
>
> Are you saying they should be equal but are in fact not so?



That's it. When a cpu enters idle, tick will be stopped, and sampling
couldn't execute on this idle cpu. Although idle will be folded, and
then be caculated into global calc_load_tasks[1], we not update
calc_load_update on this cpu, and sampling will be done after idle
exits[2]. The real load for this sampling time should be the load when
this cpu goes into idle([1]), and the sampling after idle exits [2] is
wrong. This's what i mean "sampling time line fix", not clear before. @_@

>
> I think we can all agree on this. Doug has illustrated this quite
> clearly.
>
> The desire is for CONFIG_NOHZ=n,y to function identically, but its been
> clearly demonstrated this is currently not the case.
>
>> For non-idle cpu sampling, it's right the load when
>> sampling.
>
> Agreed, sampling of a busy cpu is identical.
>
>> But for idle, cause of nohz, the sampling will be delayed to
>> nohz exit(less than 1 tick after nohz exit).
>
> I don't think the nohz exit code calls into the load sampling, but I
> think you're saying we'll get a sample tick after we leave nohz, right?

yes

>
> This is only so if the busy period covers a tick, that is, if we wake
> and go back to idle before a tick happens we'll still not get sampled.
>


>
> tick tick
> |----====-----|
> ^ ^
> wake sleep
>

This is the key point. We should take sampling in this cpu as idle load,
but cause of idle the sampling is delayed to a wrong place.

>
> Shows a nohz-exit busy period not sampled.
>
>> Nohz exit is always caused
>> by processes woken up--non-idle model. It's not fair here, idle
>> calculated to non-idle.
>>
>> time-expect-sampling
>> | time-do-sampling
>> | |
>> V V
>> -|-------------------------|--
>> start_nohz stop_nohz
>
> I don't think the delay in sampling is the biggest problem, I think the
> problem is the direct interaction between a cpu going idle and another
> cpu taking a sample.



Maybe "delay" is not the exactly, this sampling is totally wrong.

"A cpu going idle and another cpu taking a sample" is the premise,
missing the right sampling time and taking another wrong sample after
idle exits is the real reason.

>
> So the approach I took was to isolate the going idle before the sample
> window from going idle during (and after) the sampling window.
>
> Therefore any going idle activity will not affect the sampled of other
> cpus. The only trick is the slight shift in index flip for read vs
> write.
>
>
> 0 5 10 15
> +10 +10 +10 +10
> |-|-----------|-|-----------|-|-----------|-|
>
> r:001 110 001 110
> w:011 100 011 100
>

It's a wonderful plan to use index flip. Simple and effective.

>
> Shows we'll read the old idle load, but write to the new idle load
> during the sample window. Thus including the old idle load in this
> sample fold, but leaving new activity for the next.


>
> A cpu waking up and doing a sample is similar to the cpu being busy at
> the window start.
>
> However since this window is 10 ticks long and any busy spanning a tick
> will make it appear 'busy' we can only accurately sample loads of up to
> HZ/(2*10) (2 for the sampling theorem). For a regular HZ=1000 kernel
> this ends up being 50 Hz.

>
> Higher frequency workloads will appear to over-account.
>
> Now the whole reason we have this window of 10 ticks is that we're
> trying to do a completely asynchronous load computation trying to avoid
> as much serialization as possible. So have each cpu account its local
> state and hope that 10 ticks is sufficient for all to have reported in,
> then process the fold.
>
> The 10 tick window is directly related to the worst irq-off latency on
> your machine, if you keep IRQs disabled for a few ticks -- something
> that quite easily happens on large machines, even a busy cpu will be
> 'late' reporting its load. I think the current 10 tick came from an SGI
> machine with 4k cpus or so.
>
>
> Hmmm,.. idea.. OK, so we should also have a hook coming out of NOHZ
> state, when we come out of NOHZ state during the sample window, simply
> push the whole window fwd to the next time.

These two hooks can take over my work. I tried these before, but not
find the right place :(. So I tried the simple solution.

>

> This finds another bug in the current code.. A cpu that has idled 'long'
> could be multiple LOAD_FREQ intervals behind and will take multiple
> samples in quick succession instead of 5s apart.
>
>
> Can someone please think through the below thing? its been compile
> tested only...
>
> ---
> kernel/sched/core.c | 290 ++++++++++++++++++++++++++++++++++------------
> kernel/sched/idle_task.c | 1 -
> kernel/sched/sched.h | 2 -
> kernel/time/tick-sched.c | 2 +
> 4 files changed, 220 insertions(+), 75 deletions(-)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index d5594a4..3a49ee1 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2161,11 +2161,72 @@ unsigned long this_cpu_load(void)
> }
>
>
> +/*
> + * global load-average calculations
> + *
> + * We take a distributed and async approach to calculating the global load-avg
> + * in order to minimize overhead.
> + *
> + * The global load average is an exponentially decaying average of nr_running +
> + * nr_uninterruptible.
> + *
> + * Once every LOAD_FREQ:
> + *
> + * nr_active = 0;
> + * for_each_possible_cpu(cpu)
> + * nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;
> + *
> + * avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n)
> + *
> + * Due to a number of reasons the above turns in the mess below:
> + *
> + * - for_each_possible_cpu() is prohibitively expensive on machines with
> + * serious number of cpus, therefore we need to take a distributed approach
> + * to calculating nr_active.
> + *
> + * \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
> + * = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
> + *
> + * So assuming nr_active := 0 when we start out -- true per definition, we
> + * can simply take per-cpu deltas and fold those into a global accumulate
> + * to obtain the same result. See calc_load_fold_active().
> + *
> + * Furthermore, in order to avoid synchronizing all per-cpu delta folding
> + * across the machine, we assume 10 ticks is sufficient time for every
> + * cpu to have completed this task.
> + *
> + * This places an upper-bound on the IRQ-off latency of the machine.
> + *
> + * - cpu_rq()->nr_uninterruptible isn't accurately tracked per-cpu because
> + * this would add another cross-cpu cacheline miss and atomic operation
> + * to the wakeup path. Instead we increment on whatever cpu the task ran
> + * when it went into uninterruptible state and decrement on whatever cpu
> + * did the wakeup. This means that only the sum of nr_uninterruptible over
> + * all cpus yields the correct result.
> + *
> + * This covers the NO_HZ=n code, for extra head-aches, see the comment below.
> + */
> +
> /* Variables and functions for calc_load */
> static atomic_long_t calc_load_tasks;
> static unsigned long calc_load_update;
> unsigned long avenrun[3];
> -EXPORT_SYMBOL(avenrun);
> +EXPORT_SYMBOL(avenrun); /* should be removed */
> +
> +/**
> + * get_avenrun - get the load average array
> + * @loads: pointer to dest load array
> + * @offset: offset to add
> + * @shift: shift count to shift the result left
> + *
> + * These values are estimates at best, so no need for locking.
> + */
> +void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
> +{
> + loads[0] = (avenrun[0] + offset) << shift;
> + loads[1] = (avenrun[1] + offset) << shift;
> + loads[2] = (avenrun[2] + offset) << shift;
> +}
>
> static long calc_load_fold_active(struct rq *this_rq)
> {
> @@ -2182,6 +2243,9 @@ static long calc_load_fold_active(struct rq *this_rq)
> return delta;
> }
>
> +/*
> + * a1 = a0 * e + a * (1 - e)
> + */
> static unsigned long
> calc_load(unsigned long load, unsigned long exp, unsigned long active)
> {
> @@ -2193,30 +2257,117 @@ calc_load(unsigned long load, unsigned long exp, unsigned long active)
>
> #ifdef CONFIG_NO_HZ
> /*
> - * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
> + * Handle NO_HZ for the global load-average.
> + *
> + * Since the above described distributed algorithm to compute the global
> + * load-average relies on per-cpu sampling from the tick, it is affected by
> + * NO_HZ.
> + *
> + * The basic idea is to fold the nr_active delta into a global idle load upon
> + * entering NO_HZ state such that we can include this as an 'extra' cpu delta
> + * when we read the global state.
> + *
> + * Obviously reality has to ruin such a delightfully simple scheme:
> + *
> + * - When we go NO_HZ idle during the window, we can negate our sample
> + * contribution, causing under-accounting.
> + *
> + * We avoid this by keeping two idle-delta counters and flipping them
> + * when the window starts, thus separating old and new NO_HZ load.
> + *
> + * The only trick is the slight shift in index flip for read vs write.
> + *
> + * 0 5 10 15
> + * +10 +10 +10 +10
> + * |-|-----------|-|-----------|-|-----------|-|
> + * r:001 110 001 110
> + * w:011 100 011 100
> + *
> + * This ensures we'll fold the old idle contribution in this window while
> + * accumlating the new one.
> + *
> + * - When we wake up from NO_HZ idle during the window, we push up our
> + * contribution, since we effectively move our sample point to a known
> + * busy state.
> + *
> + * This is solved by pushing the window forward, and thus skipping the
> + * sample, for this cpu (effectively using the idle-delta for this cpu which
> + * was in effect at the time the window opened). This also solves the issue
> + * of having to deal with a cpu having been in NOHZ idle for multiple
> + * LOAD_FREQ intervals.
> *
> * When making the ILB scale, we should try to pull this in as well.
> */
> -static atomic_long_t calc_load_tasks_idle;
> +static atomic_long_t calc_load_idle[2];
> +static int calc_load_idx;
>
> -void calc_load_account_idle(struct rq *this_rq)
> +static inline int calc_load_write_idx(void)
> {
> + int idx = calc_load_idx;
> +
> + /*
> + * See calc_global_nohz(), if we observe the new index, we also
> + * need to observe the new update time.
> + */
> + smp_rmb();
> +
> + /*
> + * If the folding window started, make sure we start writing in the
> + * next idle-load delta.
> + */
> + if (!time_before(jiffies, calc_load_update))
> + idx++;

Can we just take calc_load_update as the start time-line here? Will
there be different ticks between cpus?

> +
> + return idx & 1;
> +}
> +
> +static inline int calc_load_read_idx(void)
> +{
> + return calc_load_idx & 1;
> +}
> +
> +void calc_load_enter_idle(void)
> +{
> + struct rq *this_rq = this_rq();
> long delta;
> + int idx;
>
> + /*
> + * We're going into NOHZ mode, if there's any pending delta, fold it
> + * into the pending idle delta.
> + */
> delta = calc_load_fold_active(this_rq);
> - if (delta)
> - atomic_long_add(delta, &calc_load_tasks_idle);
> + if (delta) {
> + idx = calc_load_write_idx();
> + atomic_long_add(delta, &calc_load_idle[idx]);
> + }
> }
>
> -static long calc_load_fold_idle(void)
> +void calc_load_exit_idle(void)
> {
> - long delta = 0;
> + struct rq *this_rq = this_rq();
>
> /*
> - * Its got a race, we don't care...
> + * If we're still outside the sample window, we're done.
> */
> - if (atomic_long_read(&calc_load_tasks_idle))
> - delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
> + if (time_before(jiffies, this_rq->calc_load_update))
> + return;
> +
> + /*
> + * We woke inside or after the sample window, this means another cpu
> + * likely already accounted us through the nohz accounting, so skip the
> + * entire deal and sync up for the next window.
> + */
> + this_rq->calc_load_update = calc_load_update + LOAD_FREQ;
> +}
> +
> +static long calc_load_fold_idle(void)
> +{
> + int idx = calc_load_read_idx();
> + long delta = 0;
> +
> + if (atomic_long_read(&calc_load_idle[idx]))
> + delta = atomic_long_xchg(&calc_load_idle[idx], 0);
>
> return delta;
> }
> @@ -2302,66 +2453,39 @@ static void calc_global_nohz(void)
> {
> long delta, active, n;
>
> - /*
> - * If we crossed a calc_load_update boundary, make sure to fold
> - * any pending idle changes, the respective CPUs might have
> - * missed the tick driven calc_load_account_active() update
> - * due to NO_HZ.
> - */
> - delta = calc_load_fold_idle();
> - if (delta)
> - atomic_long_add(delta, &calc_load_tasks);
> -
> - /*
> - * It could be the one fold was all it took, we done!
> - */
> - if (time_before(jiffies, calc_load_update + 10))
> - return;
> -
> - /*
> - * Catch-up, fold however many we are behind still
> - */
> - delta = jiffies - calc_load_update - 10;
> - n = 1 + (delta / LOAD_FREQ);
> + if (!time_before(jiffies, calc_load_update + 10)) {
> + /*
> + * Catch-up, fold however many we are behind still
> + */
> + delta = jiffies - calc_load_update - 10;
> + n = 1 + (delta / LOAD_FREQ);
>
> - active = atomic_long_read(&calc_load_tasks);
> - active = active > 0 ? active * FIXED_1 : 0;
> + active = atomic_long_read(&calc_load_tasks);
> + active = active > 0 ? active * FIXED_1 : 0;
>
> - avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
> - avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
> - avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
> + avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
> + avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
> + avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
>
> - calc_load_update += n * LOAD_FREQ;
> -}
> -#else
> -void calc_load_account_idle(struct rq *this_rq)
> -{
> -}
> + calc_load_update += n * LOAD_FREQ;
> + }
>
> -static inline long calc_load_fold_idle(void)
> -{
> - return 0;
> + /*
> + * Flip the idle index...
> + *
> + * Make sure we first write the new time then flip the index, so that
> + * calc_load_write_idx() will see the new time when it reads the new
> + * index, this avoids a double flip messing things up.
> + */
> + smp_wmb();
> + calc_load_idx++;
> }
> +#else /* !CONFIG_NO_HZ */
>
> -static void calc_global_nohz(void)
> -{
> -}
> -#endif
> +static inline long calc_load_fold_idle(void) { return 0; }
> +static inline void calc_global_nohz(void) { }
>
> -/**
> - * get_avenrun - get the load average array
> - * @loads: pointer to dest load array
> - * @offset: offset to add
> - * @shift: shift count to shift the result left
> - *
> - * These values are estimates at best, so no need for locking.
> - */
> -void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
> -{
> - loads[0] = (avenrun[0] + offset) << shift;
> - loads[1] = (avenrun[1] + offset) << shift;
> - loads[2] = (avenrun[2] + offset) << shift;
> -}
> +#endif /* CONFIG_NO_HZ */
>
> /*
> * calc_load - update the avenrun load estimates 10 ticks after the
> @@ -2369,11 +2493,35 @@ void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
> */
> void calc_global_load(unsigned long ticks)
> {
> - long active;
> + long active, delta;
>
> if (time_before(jiffies, calc_load_update + 10))
> return;
>
> + /*
> + * Fold the 'old' idle-delta to include all NO_HZ cpus.
> + *
> + * cpu0 cpu1 cpu2 ..
> + *
> + * >--- [sample A]
> + *
> + * -> NOHZ
> + * -> NOHZ
> + * ->NOHZ
> + *
> + * >--- [sample B]
> + *
> + * >--- [sample C]
> + *
> + * NOHZ-> (here)
> + *
> + * Since all CPUs went into NOHZ state, all 'missed' samples (B, C)
> + * should include the folded idle-delta.
> + */
> + delta += calc_load_fold_idle();
> + if (delta)
> + atomic_long_add(delta, &calc_load_tasks);
> +
> active = atomic_long_read(&calc_load_tasks);
> active = active > 0 ? active * FIXED_1 : 0;
>
> @@ -2384,12 +2532,7 @@ void calc_global_load(unsigned long ticks)
> calc_load_update += LOAD_FREQ;
>
> /*
> - * Account one period with whatever state we found before
> - * folding in the nohz state and ageing the entire idle period.
> - *
> - * This avoids loosing a sample when we go idle between
> - * calc_load_account_active() (10 ticks ago) and now and thus
> - * under-accounting.
> + * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk.
> */
> calc_global_nohz();
> }
> @@ -2406,7 +2549,6 @@ static void calc_load_account_active(struct rq *this_rq)
> return;
>
> delta = calc_load_fold_active(this_rq);
> - delta += calc_load_fold_idle();
> if (delta)
> atomic_long_add(delta, &calc_load_tasks);
>
> @@ -2414,6 +2556,10 @@ static void calc_load_account_active(struct rq *this_rq)
> }
>
> /*
> + * End of global load-average stuff
> + */
> +
> +/*
> * The exact cpuload at various idx values, calculated at every tick would be
> * load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load
> *
> diff --git a/kernel/sched/idle_task.c b/kernel/sched/idle_task.c
> index b44d604..b6baf37 100644
> --- a/kernel/sched/idle_task.c
> +++ b/kernel/sched/idle_task.c
> @@ -25,7 +25,6 @@ static void check_preempt_curr_idle(struct rq *rq, struct task_struct *p, int fl
> static struct task_struct *pick_next_task_idle(struct rq *rq)
> {
> schedstat_inc(rq, sched_goidle);
> - calc_load_account_idle(rq);
> return rq->idle;
> }
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6d52cea..55844f2 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -942,8 +942,6 @@ static inline u64 sched_avg_period(void)
> return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
> }
>
> -void calc_load_account_idle(struct rq *this_rq);
> -
> #ifdef CONFIG_SCHED_HRTICK
>
> /*
> diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
> index 8699978..4a08472 100644
> --- a/kernel/time/tick-sched.c
> +++ b/kernel/time/tick-sched.c
> @@ -406,6 +406,7 @@ static void tick_nohz_stop_sched_tick(struct tick_sched *ts)
> */
> if (!ts->tick_stopped) {
> select_nohz_load_balancer(1);
> + calc_load_enter_idle();
>
> ts->idle_tick = hrtimer_get_expires(&ts->sched_timer);
> ts->tick_stopped = 1;
> @@ -597,6 +598,7 @@ void tick_nohz_idle_exit(void)
> account_idle_ticks(ticks);
> #endif
>
> + calc_load_exit_idle();
> touch_softlockup_watchdog();
> /*
> * Cancel the scheduled timer and restore the tick
>
>



2012-06-19 09:19:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

On Tue, 2012-06-19 at 14:08 +0800, Yong Zhang wrote:

> > ---
> > kernel/sched/core.c | 290 ++++++++++++++++++++++++++++++++++------------
> > kernel/sched/idle_task.c | 1 -
> > kernel/sched/sched.h | 2 -
> > kernel/time/tick-sched.c | 2 +
> > 4 files changed, 220 insertions(+), 75 deletions(-)
> >
> > + * - When we go NO_HZ idle during the window, we can negate our sample
> > + * contribution, causing under-accounting.
> > + *
> > + * We avoid this by keeping two idle-delta counters and flipping them
> > + * when the window starts, thus separating old and new NO_HZ load.
> > + *
> > + * The only trick is the slight shift in index flip for read vs write.
> > + *
> > + * 0 5 10 15
> > + * +10 +10 +10 +10
> > + * |-|-----------|-|-----------|-|-----------|-|
> > + * r:001 110 001 110
> > + * w:011 100 011 100
>
> I'm confused by this comments, looking at your code, index is increased by
> 1 for each samaple window.

Also looking at the code you'll find we only ever use idx & 1.

> > + *
> > + * This ensures we'll fold the old idle contribution in this window while
> > + * accumlating the new one.
> > + *
> > + * - When we wake up from NO_HZ idle during the window, we push up our
> > + * contribution, since we effectively move our sample point to a known
> > + * busy state.
> > + *
> > + * This is solved by pushing the window forward, and thus skipping the
> > + * sample, for this cpu (effectively using the idle-delta for this cpu which
> > + * was in effect at the time the window opened). This also solves the issue
> > + * of having to deal with a cpu having been in NOHZ idle for multiple
> > + * LOAD_FREQ intervals.
> > *
> > * When making the ILB scale, we should try to pull this in as well.
> > */
> > +void calc_load_exit_idle(void)
> > {
> > + struct rq *this_rq = this_rq();
> >
> > /*
> > + * If we're still outside the sample window, we're done.
> > */
> > + if (time_before(jiffies, this_rq->calc_load_update))
> > + return;

> else if (time_before(jiffies, calc_load_update + 10)
> this_rq->calc_load_update = calc_load_update + LOAD_FREQ;
> else
> this_rq->calc_load_update = calc_load_update;
>
> Otherwise if you woke after the sample window, we loose on sample?
> And maybe we need local variable to cache calc_load_update.

Ah indeed, although I'd write it like:

this_rq->calc_load_update = calc_load_update;
if (time_before(jiffies, this_rq->calc_load_update + 10)
this_rq->calc_load_update += LOAD_FREQ;

Thanks!

2012-06-19 09:57:42

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

On Tue, 2012-06-19 at 14:24 +0800, Charles Wang wrote:
> > +static inline int calc_load_write_idx(void)
> > {
> > + int idx = calc_load_idx;
> > +
> > + /*
> > + * See calc_global_nohz(), if we observe the new index, we also
> > + * need to observe the new update time.
> > + */
> > + smp_rmb();
> > +
> > + /*
> > + * If the folding window started, make sure we start writing in the
> > + * next idle-load delta.
> > + */
> > + if (!time_before(jiffies, calc_load_update))
> > + idx++;
>
> Can we just take calc_load_update as the start time-line here? Will
> there be different ticks between cpus?

I'm not exactly sure what you're asking, but yes and probably yes.

The per-cpu ticks are separate, but on the same time-line. That is, they
don't happen at the exact same moment, either due to per-cpu IRQ
disabling or because the architecture spreads the tick. But they do all
get HZ ticks per second.

Remember, jiffies is a global timeline, and the global calc_load_update
is the last to be moved fwd to the next period, so its ideally suited to
be used to determine the current window.

2012-06-19 15:51:13

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

>> On Tue, 2012-06-19 at 14:08 +0800, Yong Zhang wrote:
> On 2012.06.19 02:19 -0700, Peter Zijlstra wrote:

>>> /*
>>> + * If we're still outside the sample window, we're done.
>>> */
>>> + if (time_before(jiffies, this_rq->calc_load_update))
>>> + return;

>> else if (time_before(jiffies, calc_load_update + 10)
>> this_rq->calc_load_update = calc_load_update + LOAD_FREQ;
>> else
>> this_rq->calc_load_update = calc_load_update;
>>
>> Otherwise if you woke after the sample window, we loose on sample?
>> And maybe we need local variable to cache calc_load_update.

> Ah indeed, although I'd write it like:

> this_rq->calc_load_update = calc_load_update;
> if (time_before(jiffies, this_rq->calc_load_update + 10)
> this_rq->calc_load_update += LOAD_FREQ;

Note missing end brace:
if (time_before(jiffies, this_rq->calc_load_update + 10))

My automated 63 hour test has been terminated, the code changed
and the test re-started.

The attached png file is what I had so far, but it will be replaced.
Summary: Looked good, so far.


Attachments:
n_processes_peter35.png (33.96 kB)

2012-06-20 09:45:43

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

On Tue, 2012-06-19 at 08:50 -0700, Doug Smythies wrote:
> My automated 63 hour test has been terminated, the code changed
> and the test re-started.

There's another boo-boo in there:


> @@ -2369,11 +2493,35 @@ void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
> */
> void calc_global_load(unsigned long ticks)
> {
> - long active;
> + long active, delta;
>
> if (time_before(jiffies, calc_load_update + 10))
> return;
>
> + /*
> + * Fold the 'old' idle-delta to include all NO_HZ cpus.
> + *
> + * cpu0 cpu1 cpu2 ..
> + *
> + * >--- [sample A]
> + *
> + * -> NOHZ
> + * -> NOHZ
> + * ->NOHZ
> + *
> + * >--- [sample B]
> + *
> + * >--- [sample C]
> + *
> + * NOHZ-> (here)
> + *
> + * Since all CPUs went into NOHZ state, all 'missed' samples (B, C)
> + * should include the folded idle-delta.
> + */
> + delta += calc_load_fold_idle();

This gives a gcc warning about use without init, which is true, that should be:

delta = calc_load_fold_idle();

> + if (delta)
> + atomic_long_add(delta, &calc_load_tasks);

2012-06-21 04:12:59

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate


>> On Tue, 2012-06-19 at 08:50 -0700, Doug Smythies wrote:
> On 2012.06.20 02:45 -0700, Peter Zijlstra wrote:

>> My automated 63 hour test has been terminated, the code changed
>> and the test re-started.

> There's another boo-boo in there:

O.K. I let the automated test finish the 3 processes stuff
(which is 1/2 way).
The newly compiled code is now running, but I only resumed
the test, I did not re-start it.

A comparative graph of the 1 & 2 process results is attached.
(the previous version of this same graph was on another thread).
Legend:
Control: older kernel compiled with CONFIG_NO_HZ=no
c308: Kernel with the earlier commit
5aaa: Basically, Kernel 3.5 RC2
Wang: Charles Wang's first proposed solution.
Peter35: Peter Zijlstra's current proposed solution.

Summary: Extremely good.


Attachments:
peter_compare.png (88.43 kB)

2012-06-21 06:35:58

by Charles Wang

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

On Thursday, June 21, 2012 12:12 PM, Doug Smythies wrote:

>
>>> On Tue, 2012-06-19 at 08:50 -0700, Doug Smythies wrote:
>> On 2012.06.20 02:45 -0700, Peter Zijlstra wrote:
>
>>> My automated 63 hour test has been terminated, the code changed
>>> and the test re-started.
>
>> There's another boo-boo in there:
>
> O.K. I let the automated test finish the 3 processes stuff
> (which is 1/2 way).
> The newly compiled code is now running, but I only resumed
> the test, I did not re-start it.
>
> A comparative graph of the 1 & 2 process results is attached.
> (the previous version of this same graph was on another thread).
> Legend:
> Control: older kernel compiled with CONFIG_NO_HZ=no
> c308: Kernel with the earlier commit
> 5aaa: Basically, Kernel 3.5 RC2
> Wang: Charles Wang's first proposed solution.
> Peter35: Peter Zijlstra's current proposed solution.
>
> Summary: Extremely good.
>


Finally resolved!! ^____^

2012-06-21 08:48:40

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

On Wed, 2012-06-20 at 21:12 -0700, Doug Smythies wrote:
> Summary: Extremely good.

Finally... I was starting to question my sanity on this ;-)

2012-06-22 14:03:18

by Peter Zijlstra

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

Doug, Charles,

Can you both confirm that the below patch is what you tested and we
should finally be able to close this issue?


---
Subject: sched, nohz: Rewrite load-avg computation -- again
From: Peter Zijlstra <[email protected]>
Date: Fri Jun 22 15:52:09 CEST 2012

Thanks to Charles Wang for spotting the defects in the current code:

- If we go idle during the sample window -- after sampling, we get a
negative bias because we can negate our own sample.

- If we wake up during the sample window we get a positive bias
because we push the sample to a known active period.

So rewrite the entire nohz load-avg muck once again, now adding
copious documentation to the code.

Cc: Thomas Gleixner <[email protected]>
Reported-and-tested-by: Doug Smythies <[email protected]>
Reported-and-tested-by: Charles Wang <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/n/[email protected]
---
include/linux/sched.h | 8 +
kernel/sched/core.c | 275 ++++++++++++++++++++++++++++++++++-------------
kernel/sched/idle_task.c | 1
kernel/sched/sched.h | 2
kernel/time/tick-sched.c | 2
5 files changed, 213 insertions(+), 75 deletions(-)

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1928,6 +1928,14 @@ static inline int set_cpus_allowed_ptr(s
}
#endif

+#ifdef CONFIG_NO_HZ
+void calc_load_enter_idle(void);
+void calc_load_exit_idle(void);
+#else
+static inline void calc_load_enter_idle(void) { }
+static inline void calc_load_exit_idle(void) { }
+#endif /* CONFIG_NO_HZ */
+
#ifndef CONFIG_CPUMASK_OFFSTACK
static inline int set_cpus_allowed(struct task_struct *p, cpumask_t new_mask)
{
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2162,11 +2162,73 @@ unsigned long this_cpu_load(void)
}


+/*
+ * Global load-average calculations
+ *
+ * We take a distributed and async approach to calculating the global load-avg
+ * in order to minimize overhead.
+ *
+ * The global load average is an exponentially decaying average of nr_running +
+ * nr_uninterruptible.
+ *
+ * Once every LOAD_FREQ:
+ *
+ * nr_active = 0;
+ * for_each_possible_cpu(cpu)
+ * nr_active += cpu_of(cpu)->nr_running + cpu_of(cpu)->nr_uninterruptible;
+ *
+ * avenrun[n] = avenrun[0] * exp_n + nr_active * (1 - exp_n)
+ *
+ * Due to a number of reasons the above turns in the mess below:
+ *
+ * - for_each_possible_cpu() is prohibitively expensive on machines with
+ * serious number of cpus, therefore we need to take a distributed approach
+ * to calculating nr_active.
+ *
+ * \Sum_i x_i(t) = \Sum_i x_i(t) - x_i(t_0) | x_i(t_0) := 0
+ * = \Sum_i { \Sum_j=1 x_i(t_j) - x_i(t_j-1) }
+ *
+ * So assuming nr_active := 0 when we start out -- true per definition, we
+ * can simply take per-cpu deltas and fold those into a global accumulate
+ * to obtain the same result. See calc_load_fold_active().
+ *
+ * Furthermore, in order to avoid synchronizing all per-cpu delta folding
+ * across the machine, we assume 10 ticks is sufficient time for every
+ * cpu to have completed this task.
+ *
+ * This places an upper-bound on the IRQ-off latency of the machine. Then
+ * again, being late doesn't loose the delta, just wrecks the sample.
+ *
+ * - cpu_rq()->nr_uninterruptible isn't accurately tracked per-cpu because
+ * this would add another cross-cpu cacheline miss and atomic operation
+ * to the wakeup path. Instead we increment on whatever cpu the task ran
+ * when it went into uninterruptible state and decrement on whatever cpu
+ * did the wakeup. This means that only the sum of nr_uninterruptible over
+ * all cpus yields the correct result.
+ *
+ * This covers the NO_HZ=n code, for extra head-aches, see the comment below.
+ */
+
/* Variables and functions for calc_load */
static atomic_long_t calc_load_tasks;
static unsigned long calc_load_update;
unsigned long avenrun[3];
-EXPORT_SYMBOL(avenrun);
+EXPORT_SYMBOL(avenrun); /* should be removed */
+
+/**
+ * get_avenrun - get the load average array
+ * @loads: pointer to dest load array
+ * @offset: offset to add
+ * @shift: shift count to shift the result left
+ *
+ * These values are estimates at best, so no need for locking.
+ */
+void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
+{
+ loads[0] = (avenrun[0] + offset) << shift;
+ loads[1] = (avenrun[1] + offset) << shift;
+ loads[2] = (avenrun[2] + offset) << shift;
+}

static long calc_load_fold_active(struct rq *this_rq)
{
@@ -2183,6 +2245,9 @@ static long calc_load_fold_active(struct
return delta;
}

+/*
+ * a1 = a0 * e + a * (1 - e)
+ */
static unsigned long
calc_load(unsigned long load, unsigned long exp, unsigned long active)
{
@@ -2194,30 +2259,118 @@ calc_load(unsigned long load, unsigned l

#ifdef CONFIG_NO_HZ
/*
- * For NO_HZ we delay the active fold to the next LOAD_FREQ update.
+ * Handle NO_HZ for the global load-average.
+ *
+ * Since the above described distributed algorithm to compute the global
+ * load-average relies on per-cpu sampling from the tick, it is affected by
+ * NO_HZ.
+ *
+ * The basic idea is to fold the nr_active delta into a global idle-delta upon
+ * entering NO_HZ state such that we can include this as an 'extra' cpu delta
+ * when we read the global state.
+ *
+ * Obviously reality has to ruin such a delightfully simple scheme:
+ *
+ * - When we go NO_HZ idle during the window, we can negate our sample
+ * contribution, causing under-accounting.
+ *
+ * We avoid this by keeping two idle-delta counters and flipping them
+ * when the window starts, thus separating old and new NO_HZ load.
+ *
+ * The only trick is the slight shift in index flip for read vs write.
+ *
+ * 0s 5s 10s 15s
+ * +10 +10 +10 +10
+ * |-|-----------|-|-----------|-|-----------|-|
+ * r:0 0 1 1 0 0 1 1 0
+ * w:0 1 1 0 0 1 1 0 0
+ *
+ * This ensures we'll fold the old idle contribution in this window while
+ * accumlating the new one.
+ *
+ * - When we wake up from NO_HZ idle during the window, we push up our
+ * contribution, since we effectively move our sample point to a known
+ * busy state.
+ *
+ * This is solved by pushing the window forward, and thus skipping the
+ * sample, for this cpu (effectively using the idle-delta for this cpu which
+ * was in effect at the time the window opened). This also solves the issue
+ * of having to deal with a cpu having been in NOHZ idle for multiple
+ * LOAD_FREQ intervals.
*
* When making the ILB scale, we should try to pull this in as well.
*/
-static atomic_long_t calc_load_tasks_idle;
+static atomic_long_t calc_load_idle[2];
+static int calc_load_idx;
+
+static inline int calc_load_write_idx(void)
+{
+ int idx = calc_load_idx;

-void calc_load_account_idle(struct rq *this_rq)
+ /*
+ * See calc_global_nohz(), if we observe the new index, we also
+ * need to observe the new update time.
+ */
+ smp_rmb();
+
+ /*
+ * If the folding window started, make sure we start writing in the
+ * next idle-delta.
+ */
+ if (!time_before(jiffies, calc_load_update))
+ idx++;
+
+ return idx & 1;
+}
+
+static inline int calc_load_read_idx(void)
{
+ return calc_load_idx & 1;
+}
+
+void calc_load_enter_idle(void)
+{
+ struct rq *this_rq = this_rq();
long delta;

+ /*
+ * We're going into NOHZ mode, if there's any pending delta, fold it
+ * into the pending idle delta.
+ */
delta = calc_load_fold_active(this_rq);
- if (delta)
- atomic_long_add(delta, &calc_load_tasks_idle);
+ if (delta) {
+ int idx = calc_load_write_idx();
+ atomic_long_add(delta, &calc_load_idle[idx]);
+ }
}

-static long calc_load_fold_idle(void)
+void calc_load_exit_idle(void)
{
- long delta = 0;
+ struct rq *this_rq = this_rq();
+
+ /*
+ * If we're still before the sample window, we're done.
+ */
+ if (time_before(jiffies, this_rq->calc_load_update))
+ return;

/*
- * Its got a race, we don't care...
+ * We woke inside or after the sample window, this means we're already
+ * accounted through the nohz accounting, so skip the entire deal and
+ * sync up for the next window.
*/
- if (atomic_long_read(&calc_load_tasks_idle))
- delta = atomic_long_xchg(&calc_load_tasks_idle, 0);
+ this_rq->calc_load_update = calc_load_update;
+ if (time_before(jiffies, this_rq->calc_load_update + 10))
+ this_rq->calc_load_update += LOAD_FREQ;
+}
+
+static long calc_load_fold_idle(void)
+{
+ int idx = calc_load_read_idx();
+ long delta = 0;
+
+ if (atomic_long_read(&calc_load_idle[idx]))
+ delta = atomic_long_xchg(&calc_load_idle[idx], 0);

return delta;
}
@@ -2303,66 +2456,39 @@ static void calc_global_nohz(void)
{
long delta, active, n;

- /*
- * If we crossed a calc_load_update boundary, make sure to fold
- * any pending idle changes, the respective CPUs might have
- * missed the tick driven calc_load_account_active() update
- * due to NO_HZ.
- */
- delta = calc_load_fold_idle();
- if (delta)
- atomic_long_add(delta, &calc_load_tasks);
+ if (!time_before(jiffies, calc_load_update + 10)) {
+ /*
+ * Catch-up, fold however many we are behind still
+ */
+ delta = jiffies - calc_load_update - 10;
+ n = 1 + (delta / LOAD_FREQ);
+
+ active = atomic_long_read(&calc_load_tasks);
+ active = active > 0 ? active * FIXED_1 : 0;
+
+ avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
+ avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
+ avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);

- /*
- * It could be the one fold was all it took, we done!
- */
- if (time_before(jiffies, calc_load_update + 10))
- return;
+ calc_load_update += n * LOAD_FREQ;
+ }

/*
- * Catch-up, fold however many we are behind still
+ * Flip the idle index...
+ *
+ * Make sure we first write the new time then flip the index, so that
+ * calc_load_write_idx() will see the new time when it reads the new
+ * index, this avoids a double flip messing things up.
*/
- delta = jiffies - calc_load_update - 10;
- n = 1 + (delta / LOAD_FREQ);
-
- active = atomic_long_read(&calc_load_tasks);
- active = active > 0 ? active * FIXED_1 : 0;
-
- avenrun[0] = calc_load_n(avenrun[0], EXP_1, active, n);
- avenrun[1] = calc_load_n(avenrun[1], EXP_5, active, n);
- avenrun[2] = calc_load_n(avenrun[2], EXP_15, active, n);
-
- calc_load_update += n * LOAD_FREQ;
-}
-#else
-void calc_load_account_idle(struct rq *this_rq)
-{
+ smp_wmb();
+ calc_load_idx++;
}
+#else /* !CONFIG_NO_HZ */

-static inline long calc_load_fold_idle(void)
-{
- return 0;
-}
-
-static void calc_global_nohz(void)
-{
-}
-#endif
+static inline long calc_load_fold_idle(void) { return 0; }
+static inline void calc_global_nohz(void) { }

-/**
- * get_avenrun - get the load average array
- * @loads: pointer to dest load array
- * @offset: offset to add
- * @shift: shift count to shift the result left
- *
- * These values are estimates at best, so no need for locking.
- */
-void get_avenrun(unsigned long *loads, unsigned long offset, int shift)
-{
- loads[0] = (avenrun[0] + offset) << shift;
- loads[1] = (avenrun[1] + offset) << shift;
- loads[2] = (avenrun[2] + offset) << shift;
-}
+#endif /* CONFIG_NO_HZ */

/*
* calc_load - update the avenrun load estimates 10 ticks after the
@@ -2370,11 +2496,18 @@ void get_avenrun(unsigned long *loads, u
*/
void calc_global_load(void)
{
- long active;
+ long active, delta;

if (time_before(jiffies, calc_load_update + 10))
return;

+ /*
+ * Fold the 'old' idle-delta to include all NO_HZ cpus.
+ */
+ delta = calc_load_fold_idle();
+ if (delta)
+ atomic_long_add(delta, &calc_load_tasks);
+
active = atomic_long_read(&calc_load_tasks);
active = active > 0 ? active * FIXED_1 : 0;

@@ -2385,12 +2518,7 @@ void calc_global_load(void)
calc_load_update += LOAD_FREQ;

/*
- * Account one period with whatever state we found before
- * folding in the nohz state and ageing the entire idle period.
- *
- * This avoids loosing a sample when we go idle between
- * calc_load_account_active() (10 ticks ago) and now and thus
- * under-accounting.
+ * In case we idled for multiple LOAD_FREQ intervals, catch up in bulk.
*/
calc_global_nohz();
}
@@ -2407,7 +2535,6 @@ static void calc_load_account_active(str
return;

delta = calc_load_fold_active(this_rq);
- delta += calc_load_fold_idle();
if (delta)
atomic_long_add(delta, &calc_load_tasks);

@@ -2415,6 +2542,10 @@ static void calc_load_account_active(str
}

/*
+ * End of global load-average stuff
+ */
+
+/*
* The exact cpuload at various idx values, calculated at every tick would be
* load = (2^idx - 1) / 2^idx * load + 1 / 2^idx * cur_load
*
--- a/kernel/sched/idle_task.c
+++ b/kernel/sched/idle_task.c
@@ -25,7 +25,6 @@ static void check_preempt_curr_idle(stru
static struct task_struct *pick_next_task_idle(struct rq *rq)
{
schedstat_inc(rq, sched_goidle);
- calc_load_account_idle(rq);
return rq->idle;
}

--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -961,8 +961,6 @@ static inline u64 sched_avg_period(void)
return (u64)sysctl_sched_time_avg * NSEC_PER_MSEC / 2;
}

-void calc_load_account_idle(struct rq *this_rq);
-
#ifdef CONFIG_SCHED_HRTICK

/*
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -373,6 +373,7 @@ static ktime_t tick_nohz_stop_sched_tick
*/
if (!ts->tick_stopped) {
select_nohz_load_balancer(1);
+ calc_load_enter_idle();

ts->last_tick = hrtimer_get_expires(&ts->sched_timer);
ts->tick_stopped = 1;
@@ -572,6 +573,7 @@ static void tick_nohz_restart_sched_tick
tick_do_update_jiffies64(now);
update_cpu_load_nohz();

+ calc_load_exit_idle();
touch_softlockup_watchdog();
/*
* Cancel the scheduled timer and restore the tick

2012-06-24 21:46:08

by Doug Smythies

[permalink] [raw]
Subject: RE: [PATCH] sched: Folding nohz load accounting more accurate

> On 2012.06.22 07:03 -0700, Peter Zijlstra wrote:

> Doug, Charles,

> Can you both confirm that the below patch is what you tested and we
> should finally be able to close this issue?

[... patch code deleted ...]

Yes, I confirm that is the patch code I am testing, back edited to my
working environment (So far, I call this "peter35").

Some more test results are attached as PNG files and tests will
continue for at least another couple of days. I can comment
further on the "finally close this issue" question then, but it
looks very good so far.

Unless I find some issue, I won't push further results to this list,
But results can be pulled from [1].

[1] http://www.smythies.com/~doug/network/load_average/peter35.html


Attachments:
one_low_freq_low_load.png (34.05 kB)
n_processes_peter35.png (43.58 kB)
high_freq_3_8.png (50.27 kB)
background_histograms.png (44.85 kB)
Download all attachments

2012-06-25 02:15:27

by Charles Wang

[permalink] [raw]
Subject: Re: [PATCH] sched: Folding nohz load accounting more accurate

On Friday, June 22, 2012 10:03 PM, Peter Zijlstra wrote:

> Doug, Charles,
>
> Can you both confirm that the below patch is what you tested and we
> should finally be able to close this issue?
>
>
> ---

[... patch code deleted ...]

Yes, this patch can cure my problem.