I have an LVS/DR cluster of 10 machines that receive similar traffic
via a round-robin strategy. These machines run Debian Lenny with
2.6.26, and consistently have a 15-minute load average between 4-12
depending on the time of day.
Upgrading any one of these machines to a newer kernel compiled with
NO_HZ=y causes the reported load average to drop significantly. Here,
fe3 and fe12 are running 3.2.4:
fe3: 0.48 0.53 0.55
fe4: 6.73 5.59 5.11
fe5: 5.93 5.29 5.60
fe6: 6.20 5.79 6.08
fe7: 8.32 5.65 5.05
fe8: 6.34 5.85 5.93
fe9: 5.80 5.46 5.53
fe10: 5.49 4.91 5.03
fe11: 6.60 6.11 6.10
fe12: 0.39 0.54 0.46
The newly reported load average is much lower than the other machines
performing equivalent work, and does not match cpu usage numbers
reported by vmstat.
Originally, I attempted to upgrade to 2.6.32 (used by lenny-backports
and squeeze). In 2.6.32, load averages are misreported even when using
NO_HZ=n. This bug was fixed in 2.6.34-rc4 (74f5187ac8: sched: Cure
load average vs NO_HZ woes).
The NO_HZ=y case was supposed to be fixed in 2.6.37-rc5 (0f004f5a69:
sched: Cure more NO_HZ load average woes), with the commit stating
"behaviour between CONFIG_NO_HZ=[yn] should be equivalent". In my
environment, however, kernels after this patch was introduced still
misreport load averages when compiled with NO_HZ=y.
Here's a list of the kernel versions I've tried (more details in
http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=620297):
Correct Load Average
2.6.26.25
2.6.32.55-620297patch (CONFIG_NO_HZ=n)
Incorrect Load Average
2.6.32-bpo.5-amd64
2.6.32.55
2.6.32.55-620297patch
2.6.32.55-620297patch (nohz=off)
2.6.37-rc5-cure-more
2.6.39.4
3.2.2
3.2.4
Also worth nothing is that fe12 is much newer than fe3, which should
help rule out hardware as a cause of this bug.
fe3: Dell PowerEdge 2950 w/ Xeon(R) CPU L5420 @ 2.50GHz
fe12: Dell PowerEdge R710 w/ Xeon(R) CPU E5645 @ 2.40GHz
I've attached dmesg output from fe12 booting up on 3.2.4. I am happy
to provide any other information that would be useful, and would
appreciate any advice on patches to try or ways to narrow the bug down
further.
Aman
On 02/06/2012 07:51 AM, Aman Gupta wrote:
> I have an LVS/DR cluster of 10 machines that receive similar traffic
> via a round-robin strategy. These machines run Debian Lenny with
> 2.6.26, and consistently have a 15-minute load average between 4-12
> depending on the time of day.
>
> Upgrading any one of these machines to a newer kernel compiled with
> NO_HZ=y causes the reported load average to drop significantly. [...]
I can confirm Aman's results on kernels 2.6.32 and higher on a similar
setup. I did a test on a cluster of diskless PHP workers. Servers were
running on identical hardware and software platform. The workload should
have been the same. However load average was reporting different values
depending on which kernel the host was running.
I have tested the following vanilla kernels:
* 2.6.32.55-*
* 2.6.32.55-*-74f5187ac8 (2.6.32.55 with patch 74f5187ac8)
* 2.6.32.55-*-0f004f5a69 (2.6.32.55 with patch 74f5187ac8 and 0f004f5a69)
* 2.6.37-rc5-*-0f004f5a69 (2.6.37 at commit 0f004f5a69)
* 2.6.37-rc5-*-pre-0f004f5a69 (2.6.37 at commit 6313e3c217)
Each kernel was compiled with CONFIG_NO_HZ enabled (no-hz variant) and
disabled (hz variant). Here's a snapshot of load 15 on each kernel:
no-hz hz
2.6.32.55-* 0.59 0.57
2.6.32.55-*-74f5187ac8 3.56 11.79
2.6.32.55-*-0f004f5a69 0.61 11.76
2.6.37-rc5-*-0f004f5a69 0.67 11.65
2.6.37-rc5-*-pre-0f004f5a69 3.97 12.05
I've also uploaded load average [1] and CPU utilization [2] charts for a
visual comparison.
My observations are:
1. On tickless kernels load is very low where no or both patches
(74f5187ac8 and 0f004f5a69) are applied.
2. Kernels that have only patch 74f5187ac8 applied have the smallest
difference between hz and no-hz variants. Still no-hz kernels are
returning values lower than their hz siblings.
3. Non-tickless kernels seem to be reporting correct load values.
Overall trend and values are matching CPU utilization. Only exception is
2.6.32.55-hz which reports the same values as 2.6.32.55-no-hz.
4. If x processes are using all available cycles load is correctly
incremented by x. This behavior is consistent on all kernels.
Steps to reproduce: run a bunch of CPU bound processes that will not use
all available cycles. The biggest difference between expected and
measured load is around 30% CPU utilization in my case.
Has there been any other patches that correct load calculation? Maybe
I'm testing it in a wrong way? I'd appreciate any suggestions. I'd be
happy to test new patches. Sadly, I cannot propose any fixes as kernel
sources are still a mystery to me.
[1] http://img841.imageshack.us/img841/2204/kernelload.png
[2] http://img854.imageshack.us/img854/8194/kernelcpu.png
--
Lesław Kopeć
On Thu, 2012-02-23 at 16:46 +0100, Lesław Kopeć wrote:
> Each kernel was compiled with CONFIG_NO_HZ enabled (no-hz variant) and
> disabled (hz variant). Here's a snapshot of load 15 on each kernel:
> no-hz hz
> 2.6.32.55-* 0.59 0.57
> 2.6.32.55-*-74f5187ac8 3.56 11.79
> 2.6.32.55-*-0f004f5a69 0.61 11.76
> 2.6.37-rc5-*-0f004f5a69 0.67 11.65
> 2.6.37-rc5-*-pre-0f004f5a69 3.97 12.05
Missing here is a kernel build with CONFIG_NO_HZ but booted with
nohz=off; this would be an interesting data point because it includes
all the funny code but still ticks are the right frequency.
> My observations are:
>
> 1. On tickless kernels load is very low where no or both patches
> (74f5187ac8 and 0f004f5a69) are applied.
>
> 2. Kernels that have only patch 74f5187ac8 applied have the smallest
> difference between hz and no-hz variants. Still no-hz kernels are
> returning values lower than their hz siblings.
>
> 3. Non-tickless kernels seem to be reporting correct load values.
> Overall trend and values are matching CPU utilization. Only exception is
> 2.6.32.55-hz which reports the same values as 2.6.32.55-no-hz.
>
> 4. If x processes are using all available cycles load is correctly
> incremented by x. This behavior is consistent on all kernels.
Yay! at least we get something right.. Also, I think we actually will go
down to load 0 if the machine is idle, we used to get that wrong for
nohz too.
> Steps to reproduce: run a bunch of CPU bound processes that will not use
> all available cycles. The biggest difference between expected and
> measured load is around 30% CPU utilization in my case.
Hrmm, this suggests we age too hard with nohz code.. in your test case
is there significant idle time? That is, suppose you run each cpu at 30%
what is the period of you load? Running 3s out of 10s is significantly
different from running .3ms out of 1ms.
> Has there been any other patches that correct load calculation? Maybe
> I'm testing it in a wrong way? I'd appreciate any suggestions. I'd be
> happy to test new patches. Sadly, I cannot propose any fixes as kernel
> sources are still a mystery to me.
Darned load-tracking stuff.. I went over it again but couldn't spot
anything obviously broken. I suspect the tail magic of
calc_global_nohz() is busted, just not seeing it atm.
Will go brew myself a fresh pot of tea and stare more.
On Wed, 2012-02-29 at 13:06 +0100, Peter Zijlstra wrote:
>
> > Steps to reproduce: run a bunch of CPU bound processes that will not use
> > all available cycles. The biggest difference between expected and
> > measured load is around 30% CPU utilization in my case.
>
> Hrmm, this suggests we age too hard with nohz code.. in your test case
> is there significant idle time? That is, suppose you run each cpu at 30%
> what is the period of you load? Running 3s out of 10s is significantly
> different from running .3ms out of 1ms.
I can indeed see some weirdness, but not only downwards, I can manage to
get a load of 1 with two 20% burners (0.1 ms period). Still need to try
with bigger periods.
> > Has there been any other patches that correct load calculation? Maybe
> > I'm testing it in a wrong way? I'd appreciate any suggestions. I'd be
> > happy to test new patches. Sadly, I cannot propose any fixes as kernel
> > sources are still a mystery to me.
>
> Darned load-tracking stuff.. I went over it again but couldn't spot
> anything obviously broken. I suspect the tail magic of
> calc_global_nohz() is busted, just not seeing it atm.
>
> Will go brew myself a fresh pot of tea and stare more.
The only thing I could find is that on nohz we can confuse the per-rq
sample period, does the below make a difference?
---
kernel/sched/core.c | 9 +--------
kernel/sched/sched.h | 1 -
2 files changed, 1 insertions(+), 9 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d7c4322..370c578 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2372,15 +2372,13 @@ static void calc_load_account_active(struct rq *this_rq)
{
long delta;
- if (time_before(jiffies, this_rq->calc_load_update))
+ if (time_before(jiffies, calc_load_update))
return;
delta = calc_load_fold_active(this_rq);
delta += calc_load_fold_idle();
if (delta)
atomic_long_add(delta, &calc_load_tasks);
-
- this_rq->calc_load_update += LOAD_FREQ;
}
/*
@@ -5329,10 +5327,6 @@ migration_call(struct notifier_block *nfb, unsigned long action, void *hcpu)
switch (action & ~CPU_TASKS_FROZEN) {
- case CPU_UP_PREPARE:
- rq->calc_load_update = calc_load_update;
- break;
-
case CPU_ONLINE:
/* Update our root-domain */
raw_spin_lock_irqsave(&rq->lock, flags);
@@ -6879,7 +6873,6 @@ void __init sched_init(void)
raw_spin_lock_init(&rq->lock);
rq->nr_running = 0;
rq->calc_load_active = 0;
- rq->calc_load_update = jiffies + LOAD_FREQ;
init_cfs_rq(&rq->cfs);
init_rt_rq(&rq->rt, rq);
#ifdef CONFIG_FAIR_GROUP_SCHED
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 8a2c768..59b5a33 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -441,7 +441,6 @@ struct rq {
#endif
/* calc_load related fields */
- unsigned long calc_load_update;
long calc_load_active;
#ifdef CONFIG_SCHED_HRTICK
On Wed, 2012-02-29 at 17:24 +0100, Peter Zijlstra wrote:
>
> The only thing I could find is that on nohz we can confuse the per-rq
> sample period, does the below make a difference?
Uhm, something like so that is..
---
kernel/sched/core.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index d7c4322..44f61df 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2380,7 +2380,8 @@ static void calc_load_account_active(struct rq *this_rq)
if (delta)
atomic_long_add(delta, &calc_load_tasks);
- this_rq->calc_load_update += LOAD_FREQ;
+ while (!time_before(jiffies, this_rq->calc_load_update))
+ this_rq->calc_load_update += LOAD_FREQ;
}
/*