2009-04-17 05:45:48

by Miao Xie

[permalink] [raw]
Subject: [RFC][PATCH] sched: fix the nice-unfairness on SMP when offline a CPU

I tested the fairness of scheduler on my multi-core box(2 CPUs * 2 Cores), and
found the nice-fairness was broken when I offlined a CPU. The CPU time gotten
by half of tasks was half as much as the others.

A test program which reproduces the problem on current kernel is attached.
This program forks a lot of child tasks, then the parent task gets the loop
count of every task and figures out the average and standard deviation every
5 seconds. (All of the child tasks do the same work - repeat doing sqrt)

Steps to reproduce:
# echo 0 > /sys/devices/system/cpu/cpu3/online
# ./sched-fair -p 8 -i 5 -v

By debuging, we found it is caused by the __cpu_power of the sched group. If
I offlined a CPU, the partition of sched groups in the CPU-level sched domain
is:
+-----------+----------+
| CPU0 CPU1 | CPU2 |
+-----------+----------+
and the __cpu_power of each sched group was 1024. It is strange that the first
sched group had two logic CPUs, the __cpu_power should be double times of the
second sched group. If both of the sched groups' __cpu_power was 1024, the load
balance program would balance the load fifty-fifty between these two sched
group, so half of the test tasks was moved to logic CPU2, and they got less CPU
time.

The code that caused this problem is following:
static void init_sched_groups_power(int cpu, struct sched_domain *sd)
{
[snip]
/*
* For perf policy, if the groups in child domain share resources
* (for example cores sharing some portions of the cache hierarchy
* or SMT), then set this domain groups cpu_power such that each group
* can handle only one task, when there are other idle groups in the
* same sched domain.
*/
if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
(child->flags &
(SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
return;
}
[snip]
}
According to the above comment, this design was in view of performance. But I
found there was no regression after applying this patch.

Test result on multi-core x86_64 box:
Before applying this patch:
AVERAGE STD-DEV
1297.500 432.518

After applying this patch:
AVERAGE STD-DEV
1297.250 118.857

Test result on hyper-threading x86_64 box:
Before applying this patch:
AVERAGE STD-DEV
536.750 176.265

After applying this patch:
AVERAGE STD-DEV
535.625 53.979

Maybe we need more test for it.

Signed-off-by: Miao Xie <[email protected]>
---
kernel/sched.c | 11 +----------
1 files changed, 1 insertions(+), 10 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 5724508..07b08b2 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7956,16 +7956,7 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)

sd->groups->__cpu_power = 0;

- /*
- * For perf policy, if the groups in child domain share resources
- * (for example cores sharing some portions of the cache hierarchy
- * or SMT), then set this domain groups cpu_power such that each group
- * can handle only one task, when there are other idle groups in the
- * same sched domain.
- */
- if (!child || (!(sd->flags & SD_POWERSAVINGS_BALANCE) &&
- (child->flags &
- (SD_SHARE_CPUPOWER | SD_SHARE_PKG_RESOURCES)))) {
+ if (!child) {
sg_inc_cpu_power(sd->groups, SCHED_LOAD_SCALE);
return;
}
--
1.6.0.3


Attachments:
sched-fair.c (9.65 kB)