2009-11-03 02:26:55

by Miao Xie

[permalink] [raw]
Subject: [BUG] cpu controller can't provide fair CPU time for each group

Hi, Peter.

I found two problems about cpu controller:
1) cpu controller didn't provide fair CPU time to groups when the tasks
attached into those groups were bound to the same logic CPU.
2) cpu controller didn't provide fair CPU time to groups when shares of
each group <= 2 * nr_cpus.

The detail is following:
1) The first one is that cpu controller didn't provide fair CPU time to
groups when the tasks attached into those groups were bound to the
same logic CPU.

The reason is that there is something with the computing of the per
cpu shares.

on my test box with 16 logic CPU, I did the following manipulation:
a. create 2 cpu controller groups.
b. attach a task into one group and 2 tasks into the other.
c. bind three tasks to the same logic cpu.
+--------+ +--------+
| group1 | | group2 |
+--------+ +--------+
| |
CPU0 Task A Task B & Task C

The following is the reproduce steps:
# mkdir /dev/cpuctl
# mount -t cgroup -o cpu,noprefix cpuctl /dev/cpuctl
# mkdir /dev/cpuctl/1
# mkdir /dev/cpuctl/2
# cat /dev/zero > /dev/null &
# pid1=$!
# echo $pid1 > /dev/cpuctl/1/tasks
# taskset -p -c 0 $pid1
# cat /dev/zero > /dev/null &
# pid2=$!
# echo $pid2 > /dev/cpuctl/2/tasks
# taskset -p -c 0 $pid2
# cat /dev/zero > /dev/null &
# pid3=$!
# echo $pid3 > /dev/cpuctl/2/tasks
# taskset -p -c 0 $pid3

some time later, I found the the task in the group1 got the 35% CPU time not
50% CPU time. It was very strange that this result against the expected.

this problem was caused by the wrong computing of the per cpu shares.
According to the design of the cpu controller, the shares of each cpu
controller group will be divided for every CPU by the workload of each
logic CPU.
cpu[i] shares = group shares * CPU[i] workload / sum(CPU workload)

But if the CPU has no task, cpu controller will pretend there is one of
average load, usually this average load is 1024, the load of the task whose
nice is zero. So in the test, the shares of group1 on CPU0 is:
1024 * (1 * 1024) / ((1 * 1024 + 15 * 1024)) = 64
and the shares of group2 on CPU0 is:
1024 * (2 * 1024) / ((2 * 1024 + 15 * 1024)) = 120
The scheduler of the CPU0 provided CPU time to each group by the shares
above. The bug occured.

2) The second problem is that cpu controller didn't provide fair CPU time to
groups when shares of each group <= 2 * nr_cpus

The reason is that per cpu shares was set to MIN_SHARES(=2) if shares of
each group <= 2 * nr_cpus.

on the test box with 16 logic CPU, we do the following test:
a. create two cpu controller groups
b. attach 32 tasks into each group
c. set shares of the first group to 16, the other to 32
+--------+ +--------+
| group1 | | group2 |
+--------+ +--------+
|shares=16 |shares=32
| |
16 Tasks 32 Tasks

some time later, the first group got 50% CPU time, not 33%. It also was very
strange that this result against the expected.

It is because the shares of cpuctl group was small, and there is many logic
CPU. So per cpu shares that was computed was less than MIN_SHARES, and then
was set to MIN_SHARES.

Maybe 16 and 32 is not used usually. We can set a usual number(such as 1024)
to avoid this problem on my box. But the number of CPU on a machine will
become more and more in the future. If the number of CPU is greater than 512,
this bug will occur even we set shares of group to 1024. This is a usual
number. At this rate, the usual user will feel strange.



2009-11-05 02:56:19

by Miao Xie

[permalink] [raw]
Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group

Hi, Ingo

Could you see the following problems?

Regards
Miao

on 2009-11-3 11:26, Miao Xie wrote:
> Hi, Peter.
>
> I found two problems about cpu controller:
> 1) cpu controller didn't provide fair CPU time to groups when the tasks
> attached into those groups were bound to the same logic CPU.
> 2) cpu controller didn't provide fair CPU time to groups when shares of
> each group <= 2 * nr_cpus.
>
> The detail is following:
> 1) The first one is that cpu controller didn't provide fair CPU time to
> groups when the tasks attached into those groups were bound to the
> same logic CPU.
>
> The reason is that there is something with the computing of the per
> cpu shares.
>
> on my test box with 16 logic CPU, I did the following manipulation:
> a. create 2 cpu controller groups.
> b. attach a task into one group and 2 tasks into the other.
> c. bind three tasks to the same logic cpu.
> +--------+ +--------+
> | group1 | | group2 |
> +--------+ +--------+
> | |
> CPU0 Task A Task B & Task C
>
> The following is the reproduce steps:
> # mkdir /dev/cpuctl
> # mount -t cgroup -o cpu,noprefix cpuctl /dev/cpuctl
> # mkdir /dev/cpuctl/1
> # mkdir /dev/cpuctl/2
> # cat /dev/zero > /dev/null &
> # pid1=$!
> # echo $pid1 > /dev/cpuctl/1/tasks
> # taskset -p -c 0 $pid1
> # cat /dev/zero > /dev/null &
> # pid2=$!
> # echo $pid2 > /dev/cpuctl/2/tasks
> # taskset -p -c 0 $pid2
> # cat /dev/zero > /dev/null &
> # pid3=$!
> # echo $pid3 > /dev/cpuctl/2/tasks
> # taskset -p -c 0 $pid3
>
> some time later, I found the the task in the group1 got the 35% CPU
> time not
> 50% CPU time. It was very strange that this result against the expected.
>
> this problem was caused by the wrong computing of the per cpu shares.
> According to the design of the cpu controller, the shares of each cpu
> controller group will be divided for every CPU by the workload of each
> logic CPU.
> cpu[i] shares = group shares * CPU[i] workload / sum(CPU workload)
>
> But if the CPU has no task, cpu controller will pretend there is one of
> average load, usually this average load is 1024, the load of the task
> whose
> nice is zero. So in the test, the shares of group1 on CPU0 is:
> 1024 * (1 * 1024) / ((1 * 1024 + 15 * 1024)) = 64
> and the shares of group2 on CPU0 is:
> 1024 * (2 * 1024) / ((2 * 1024 + 15 * 1024)) = 120
> The scheduler of the CPU0 provided CPU time to each group by the shares
> above. The bug occured.
>
> 2) The second problem is that cpu controller didn't provide fair CPU
> time to
> groups when shares of each group <= 2 * nr_cpus
>
> The reason is that per cpu shares was set to MIN_SHARES(=2) if shares of
> each group <= 2 * nr_cpus.
>
> on the test box with 16 logic CPU, we do the following test:
> a. create two cpu controller groups
> b. attach 32 tasks into each group
> c. set shares of the first group to 16, the other to 32
> +--------+ +--------+
> | group1 | | group2 |
> +--------+ +--------+
> |shares=16 |shares=32
> | |
> 16 Tasks 32 Tasks
>
> some time later, the first group got 50% CPU time, not 33%. It also
> was very
> strange that this result against the expected.
>
> It is because the shares of cpuctl group was small, and there is many
> logic
> CPU. So per cpu shares that was computed was less than MIN_SHARES,
> and then
> was set to MIN_SHARES.
>
> Maybe 16 and 32 is not used usually. We can set a usual number(such
> as 1024)
> to avoid this problem on my box. But the number of CPU on a machine will
> become more and more in the future. If the number of CPU is greater
> than 512,
> this bug will occur even we set shares of group to 1024. This is a usual
> number. At this rate, the usual user will feel strange.
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
>

2009-11-10 00:23:42

by Andrew Morton

[permalink] [raw]
Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group

(cc [email protected])

On Thu, 05 Nov 2009 11:56:00 +0900
Miao Xie <[email protected]> wrote:

> Hi, Ingo
>
> Could you see the following problems?
>
> Regards
> Miao
>
> on 2009-11-3 11:26, Miao Xie wrote:
> > Hi, Peter.
> >
> > I found two problems about cpu controller:
> > 1) cpu controller didn't provide fair CPU time to groups when the tasks
> > attached into those groups were bound to the same logic CPU.
> > 2) cpu controller didn't provide fair CPU time to groups when shares of
> > each group <= 2 * nr_cpus.
> >
> > The detail is following:
> > 1) The first one is that cpu controller didn't provide fair CPU time to
> > groups when the tasks attached into those groups were bound to the
> > same logic CPU.
> >
> > The reason is that there is something with the computing of the per
> > cpu shares.
> >
> > on my test box with 16 logic CPU, I did the following manipulation:
> > a. create 2 cpu controller groups.
> > b. attach a task into one group and 2 tasks into the other.
> > c. bind three tasks to the same logic cpu.
> > +--------+ +--------+
> > | group1 | | group2 |
> > +--------+ +--------+
> > | |
> > CPU0 Task A Task B & Task C
> >
> > The following is the reproduce steps:
> > # mkdir /dev/cpuctl
> > # mount -t cgroup -o cpu,noprefix cpuctl /dev/cpuctl
> > # mkdir /dev/cpuctl/1
> > # mkdir /dev/cpuctl/2
> > # cat /dev/zero > /dev/null &
> > # pid1=$!
> > # echo $pid1 > /dev/cpuctl/1/tasks
> > # taskset -p -c 0 $pid1
> > # cat /dev/zero > /dev/null &
> > # pid2=$!
> > # echo $pid2 > /dev/cpuctl/2/tasks
> > # taskset -p -c 0 $pid2
> > # cat /dev/zero > /dev/null &
> > # pid3=$!
> > # echo $pid3 > /dev/cpuctl/2/tasks
> > # taskset -p -c 0 $pid3
> >
> > some time later, I found the the task in the group1 got the 35% CPU
> > time not
> > 50% CPU time. It was very strange that this result against the expected.
> >
> > this problem was caused by the wrong computing of the per cpu shares.
> > According to the design of the cpu controller, the shares of each cpu
> > controller group will be divided for every CPU by the workload of each
> > logic CPU.
> > cpu[i] shares = group shares * CPU[i] workload / sum(CPU workload)
> >
> > But if the CPU has no task, cpu controller will pretend there is one of
> > average load, usually this average load is 1024, the load of the task
> > whose
> > nice is zero. So in the test, the shares of group1 on CPU0 is:
> > 1024 * (1 * 1024) / ((1 * 1024 + 15 * 1024)) = 64
> > and the shares of group2 on CPU0 is:
> > 1024 * (2 * 1024) / ((2 * 1024 + 15 * 1024)) = 120
> > The scheduler of the CPU0 provided CPU time to each group by the shares
> > above. The bug occured.
> >
> > 2) The second problem is that cpu controller didn't provide fair CPU
> > time to
> > groups when shares of each group <= 2 * nr_cpus
> >
> > The reason is that per cpu shares was set to MIN_SHARES(=2) if shares of
> > each group <= 2 * nr_cpus.
> >
> > on the test box with 16 logic CPU, we do the following test:
> > a. create two cpu controller groups
> > b. attach 32 tasks into each group
> > c. set shares of the first group to 16, the other to 32
> > +--------+ +--------+
> > | group1 | | group2 |
> > +--------+ +--------+
> > |shares=16 |shares=32
> > | |
> > 16 Tasks 32 Tasks
> >
> > some time later, the first group got 50% CPU time, not 33%. It also
> > was very
> > strange that this result against the expected.
> >
> > It is because the shares of cpuctl group was small, and there is many
> > logic
> > CPU. So per cpu shares that was computed was less than MIN_SHARES,
> > and then
> > was set to MIN_SHARES.
> >
> > Maybe 16 and 32 is not used usually. We can set a usual number(such
> > as 1024)
> > to avoid this problem on my box. But the number of CPU on a machine will
> > become more and more in the future. If the number of CPU is greater
> > than 512,
> > this bug will occur even we set shares of group to 1024. This is a usual
> > number. At this rate, the usual user will feel strange.
> >
> >
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at http://www.tux.org/lkml/
> >
> >
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2009-11-10 09:48:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group

On Tue, 2009-11-03 at 11:26 +0900, Miao Xie wrote:
> Hi, Peter.
>
> I found two problems about cpu controller:
> 1) cpu controller didn't provide fair CPU time to groups when the tasks
> attached into those groups were bound to the same logic CPU.
> 2) cpu controller didn't provide fair CPU time to groups when shares of
> each group <= 2 * nr_cpus.
3) if you nest them too deep you're too going to see similar funnies.

Too sodding bad gcc messed up unsigned long long for LP64 mode, so we're
stuck with 64bit fixed point math where otherwise we could have used
128bit things.

Also, I don't really care much about fairness vs affinity, if you're
going to constrain the load-balancer and make his life impossible by
using affinities you get to keep the pieces.

But you've got a point, since you can probably see the same issue (1)
with cpusets, and that is because the whole cpu-controller vs cpusets
thing was done wrong.

Someone needs to fix that if they really care.

2009-11-11 06:22:13

by Yasunori Goto

[permalink] [raw]
Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group

Hi.

I have concern about these issues.

> On Tue, 2009-11-03 at 11:26 +0900, Miao Xie wrote:
> > Hi, Peter.
> >
> > I found two problems about cpu controller:
> > 1) cpu controller didn't provide fair CPU time to groups when the tasks
> > attached into those groups were bound to the same logic CPU.
> > 2) cpu controller didn't provide fair CPU time to groups when shares of
> > each group <= 2 * nr_cpus.
> 3) if you nest them too deep you're too going to see similar funnies.
>
> Too sodding bad gcc messed up unsigned long long for LP64 mode, so we're
> stuck with 64bit fixed point math where otherwise we could have used
> 128bit things.
>
> Also, I don't really care much about fairness vs affinity, if you're
> going to constrain the load-balancer and make his life impossible by
> using affinities you get to keep the pieces.


I think 1) differs from 2) and 3). 1) should be fixed at least.
Because 1) is hard to understand for normal users.

------
> > a. create 2 cpu controller groups.
> > b. attach a task into one group and 2 tasks into the other.
> > c. bind three tasks to the same logic cpu.
> > +--------+ +--------+
> > | group1 | | group2 |
> > +--------+ +--------+
> > | |
> > CPU0 Task A Task B & Task C

(snip)

> > some time later, I found the the task in the group1 got the 35% CPU time not
> > 50% CPU time. It was very strange that this result against the expected.
------


Probably, if normal users start to use cpu controller, then they must make
same test by way of trial. They will find same results which differ from their
expectation in spite of SIMPLE test case.
Then, they will think cpu controller must have many bugs, and will never
use it.

When 2) and 3), I can explain users the reason like "It's due to accurate
error of internal calculation."
Beacause the test case is a bit fussy, and user will understand it.

However, I can't explain anything for 1), because it seems not fussy case.


> But you've got a point, since you can probably see the same issue (1)
> with cpusets, and that is because the whole cpu-controller vs cpusets
> thing was done wrong.

When users use cpuset/cpu affinity, then they would like to controll cpu affinity.
Not CPU time.

When users use cpu controller, they would like to controll cpu time, right?
However, the cpu time is far from their expectation. I think it is strange.

>
> Someone needs to fix that if they really care.

To be honest, I don't have any good idea because I'm not familiar with
schduler's code. But I have one question.


1618 static int tg_shares_up(struct task_group *tg, void *data)
1619 {
1620 unsigned long weight, rq_weight = 0, shares = 0;

(snip)

1632 for_each_cpu(i, sched_domain_span(sd)) {
1633 weight = tg->cfs_rq[i]->load.weight;
1634 usd->rq_weight[i] = weight;
1635
1636 /*
1637 * If there are currently no tasks on the cpu pretend there
1638 * is one of average load so that when a new task gets to
1639 * run here it will not get delayed by group starvation.
1640 */
1641 if (!weight)
1642 weight = NICE_0_LOAD; ---------(*)

I heard from test team when (*) was removed, 1) didn't occur.

The comment said (*) is to avoid starvation condition.
However, I don't understand why NICE_0_LOAD must be specified.
Could you tell me why small value (like 2 or 3) is not used for (*)?
What is side effect?



Thanks.

--
Yasunori Goto

2009-11-11 07:20:30

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group

On Wed, 2009-11-11 at 15:21 +0900, Yasunori Goto wrote:

> When users use cpuset/cpu affinity, then they would like to controll cpu affinity.
> Not CPU time.

What are people using affinity for? The only use of affinity is to
restrict or disable the load-balancer. Don't complain the load-balancer
doesn't work when you're taking active steps to hinder its work.

If you don't want things load-balanced, turn it off, if you want the
load-balancer to work on smaller groups of cpus, use cpusets.

Anyway, I said there needs to be done something because the interaction
between cpusets and the cpu-controller is utter crap, they never should
have been separated like they are.

> To be honest, I don't have any good idea because I'm not familiar with
> schduler's code. But I have one question.
>
>
> 1618 static int tg_shares_up(struct task_group *tg, void *data)
> 1619 {
> 1620 unsigned long weight, rq_weight = 0, shares = 0;
>
> (snip)
>
> 1632 for_each_cpu(i, sched_domain_span(sd)) {
> 1633 weight = tg->cfs_rq[i]->load.weight;
> 1634 usd->rq_weight[i] = weight;
> 1635
> 1636 /*
> 1637 * If there are currently no tasks on the cpu pretend there
> 1638 * is one of average load so that when a new task gets to
> 1639 * run here it will not get delayed by group starvation.
> 1640 */
> 1641 if (!weight)
> 1642 weight = NICE_0_LOAD; ---------(*)
>
> I heard from test team when (*) was removed, 1) didn't occur.
>
> The comment said (*) is to avoid starvation condition.
> However, I don't understand why NICE_0_LOAD must be specified.
> Could you tell me why small value (like 2 or 3) is not used for (*)?
> What is side effect?

Exactly what the comment says, it will get delayed because the group
won't get scheduled on that cpu until all the group weights get
re-adjusted again, which can be much longer than the typical runtimes of
the workload in question.

Regular weights are NICE_0_LOAD, if you stick a 3 next to that I'll not
get ran much -> starvation.

2009-11-11 09:59:35

by Yasunori Goto

[permalink] [raw]
Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group


After receiving your mail, I realized I misunderstood about
test case 1). I thought 1) was occur without cpu affinity due to
mis-communication with test team.
I'm really really sorry for noise. I need much coffee. :-(


> On Wed, 2009-11-11 at 15:21 +0900, Yasunori Goto wrote:
>
> > When users use cpuset/cpu affinity, then they would like to controll cpu affinity.
> > Not CPU time.
>
> What are people using affinity for? The only use of affinity is to
> restrict or disable the load-balancer. Don't complain the load-balancer
> doesn't work when you're taking active steps to hinder its work.
>
> If you don't want things load-balanced, turn it off, if you want the
> load-balancer to work on smaller groups of cpus, use cpusets.

Ok. make sense.

>
> Anyway, I said there needs to be done something because the interaction
> between cpusets and the cpu-controller is utter crap, they never should
> have been separated like they are.

Thanks.

>
> > To be honest, I don't have any good idea because I'm not familiar with
> > schduler's code. But I have one question.
> >
> >
> > 1618 static int tg_shares_up(struct task_group *tg, void *data)
> > 1619 {
> > 1620 unsigned long weight, rq_weight = 0, shares = 0;
> >
> > (snip)
> >
> > 1632 for_each_cpu(i, sched_domain_span(sd)) {
> > 1633 weight = tg->cfs_rq[i]->load.weight;
> > 1634 usd->rq_weight[i] = weight;
> > 1635
> > 1636 /*
> > 1637 * If there are currently no tasks on the cpu pretend there
> > 1638 * is one of average load so that when a new task gets to
> > 1639 * run here it will not get delayed by group starvation.
> > 1640 */
> > 1641 if (!weight)
> > 1642 weight = NICE_0_LOAD; ---------(*)
> >
> > I heard from test team when (*) was removed, 1) didn't occur.
> >
> > The comment said (*) is to avoid starvation condition.
> > However, I don't understand why NICE_0_LOAD must be specified.
> > Could you tell me why small value (like 2 or 3) is not used for (*)?
> > What is side effect?
>
> Exactly what the comment says, it will get delayed because the group
> won't get scheduled on that cpu until all the group weights get
> re-adjusted again, which can be much longer than the typical runtimes of
> the workload in question.
>
> Regular weights are NICE_0_LOAD, if you stick a 3 next to that I'll not
> get ran much -> starvation.

Ok.

Thank you very much for your explanation.

Best Regards.

--
Yasunori Goto

2009-11-11 10:08:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group

On Wed, 2009-11-11 at 15:21 +0900, Yasunori Goto wrote:
> > Someone needs to fix that if they really care.
>
> To be honest, I don't have any good idea because I'm not familiar with
> schduler's code.

You could possible try something like the below, its rather gross but
might maybe work,.. has other issues though.


---
kernel/sched.c | 9 +++++++--
1 files changed, 7 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 91642c1..65629a3 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1614,7 +1614,7 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
*/
static int tg_shares_up(struct task_group *tg, void *data)
{
- unsigned long weight, rq_weight = 0, shares = 0;
+ unsigned long weight, rq_weight = 0, sum_weight = 0, shares = 0;
unsigned long *usd_rq_weight;
struct sched_domain *sd = data;
unsigned long flags;
@@ -1630,6 +1630,7 @@ static int tg_shares_up(struct task_group *tg, void *data)
weight = tg->cfs_rq[i]->load.weight;
usd_rq_weight[i] = weight;

+ rq_weight += weight;
/*
* If there are currently no tasks on the cpu pretend there
* is one of average load so that when a new task gets to
@@ -1638,10 +1639,14 @@ static int tg_shares_up(struct task_group *tg, void *data)
if (!weight)
weight = NICE_0_LOAD;

- rq_weight += weight;
+ sum_weight += weight;
+
shares += tg->cfs_rq[i]->shares;
}

+ if (!rq_weight)
+ rq_weight = sum_weight;
+
if ((!shares && rq_weight) || shares > tg->shares)
shares = tg->shares;

2009-11-11 20:41:06

by Chris Friesen

[permalink] [raw]
Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group

On 11/11/2009 01:20 AM, Peter Zijlstra wrote:
> On Wed, 2009-11-11 at 15:21 +0900, Yasunori Goto wrote:
>
>> When users use cpuset/cpu affinity, then they would like to controll cpu affinity.
>> Not CPU time.
>
> What are people using affinity for? The only use of affinity is to
> restrict or disable the load-balancer. Don't complain the load-balancer
> doesn't work when you're taking active steps to hinder its work.

I have one active user of scheduler groups (using CKRM though, but they
want to switch to a new kernel using CFS and sched groups in the near
future).

They want to run their app on one cpu by itself with as little
interference as possible. Pure cpu processing, not even any I/O except
via shared memory buffers. Everything else gets done on the other cpu,
but they want to control how much of the other cpu is assigned to packet
processing, how much to system maintenance, normal user shell commands, etc.

This would seem like a case where some sort of cpuset/affinity and
sched groups would be expected to play nice together.


Chris

2009-11-11 20:51:43

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group

On Wed, 2009-11-11 at 14:39 -0600, Chris Friesen wrote:
> On 11/11/2009 01:20 AM, Peter Zijlstra wrote:
> > On Wed, 2009-11-11 at 15:21 +0900, Yasunori Goto wrote:
> >
> >> When users use cpuset/cpu affinity, then they would like to controll cpu affinity.
> >> Not CPU time.
> >
> > What are people using affinity for? The only use of affinity is to
> > restrict or disable the load-balancer. Don't complain the load-balancer
> > doesn't work when you're taking active steps to hinder its work.
>
> I have one active user of scheduler groups (using CKRM though, but they
> want to switch to a new kernel using CFS and sched groups in the near
> future).
>
> They want to run their app on one cpu by itself with as little
> interference as possible. Pure cpu processing, not even any I/O except
> via shared memory buffers. Everything else gets done on the other cpu,
> but they want to control how much of the other cpu is assigned to packet
> processing, how much to system maintenance, normal user shell commands, etc.
>
> This would seem like a case where some sort of cpuset/affinity and
> sched groups would be expected to play nice together.

Agreed, and I'd like the load-balance partition feature of cpusets and
the cpu task-groups to work well together, except that the current
interface doesn't allow that nicely.


2009-11-12 01:13:06

by Yasunori Goto

[permalink] [raw]
Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group

> On Wed, 2009-11-11 at 15:21 +0900, Yasunori Goto wrote:
> > > Someone needs to fix that if they really care.
> >
> > To be honest, I don't have any good idea because I'm not familiar with
> > schduler's code.
>
> You could possible try something like the below, its rather gross but
> might maybe work,.. has other issues though.

Thank you very much for your patch. I'm very glad.

Unfortunately, my test staff and I don't have test box today,
I'll report the result later.

Thanks again for your time.

>
>
> ---
> kernel/sched.c | 9 +++++++--
> 1 files changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 91642c1..65629a3 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -1614,7 +1614,7 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
> */
> static int tg_shares_up(struct task_group *tg, void *data)
> {
> - unsigned long weight, rq_weight = 0, shares = 0;
> + unsigned long weight, rq_weight = 0, sum_weight = 0, shares = 0;
> unsigned long *usd_rq_weight;
> struct sched_domain *sd = data;
> unsigned long flags;
> @@ -1630,6 +1630,7 @@ static int tg_shares_up(struct task_group *tg, void *data)
> weight = tg->cfs_rq[i]->load.weight;
> usd_rq_weight[i] = weight;
>
> + rq_weight += weight;
> /*
> * If there are currently no tasks on the cpu pretend there
> * is one of average load so that when a new task gets to
> @@ -1638,10 +1639,14 @@ static int tg_shares_up(struct task_group *tg, void *data)
> if (!weight)
> weight = NICE_0_LOAD;
>
> - rq_weight += weight;
> + sum_weight += weight;
> +
> shares += tg->cfs_rq[i]->shares;
> }
>
> + if (!rq_weight)
> + rq_weight = sum_weight;
> +
> if ((!shares && rq_weight) || shares > tg->shares)
> shares = tg->shares;
>
>

--
Yasunori Goto

2009-11-19 07:10:10

by Yasunori Goto

[permalink] [raw]
Subject: Re: [BUG] cpu controller can't provide fair CPU time for each group

Hi, Peter-san.
Sorry for late response.

> > On Wed, 2009-11-11 at 15:21 +0900, Yasunori Goto wrote:
> > > > Someone needs to fix that if they really care.
> > >
> > > To be honest, I don't have any good idea because I'm not familiar with
> > > schduler's code.
> >
> > You could possible try something like the below, its rather gross but
> > might maybe work,.. has other issues though.
>
> Thank you very much for your patch. I'm very glad.
>
> Unfortunately, my test staff and I don't have test box today,
> I'll report the result later.
>
> Thanks again for your time.

We tested your patch. It works perfectly as we expected.

In addition, we confirmed the environment how this issue occurs.
Even if we doesn't use neither affinity nor cpuset, this issue can be
reproduced.

For example, we made 2 groups which had same share values on the box
which had 2 cores.

CPU time CPU time
# of tasks shares w/o your patch with your patch
group A 1 1024 Avg 71% Avg 100%
group B 3 1024 Avg 129% Avg 100%

Probably, this is the worst case for current cpu controller.
But your patch can fix this.

So, I would like to push this patch into main line if you are ok.
Though you may feel this patch is ugly, I think this is good for stable kernel.

Anyway, thanks a lot for your patch.

Regards.

Acked-by: Yasunori Goto <[email protected]>
Tested-by: Kengo Kuwahara <[email protected]>


>
> >
> >
> > ---
> > kernel/sched.c | 9 +++++++--
> > 1 files changed, 7 insertions(+), 2 deletions(-)
> >
> > diff --git a/kernel/sched.c b/kernel/sched.c
> > index 91642c1..65629a3 100644
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> > @@ -1614,7 +1614,7 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
> > */
> > static int tg_shares_up(struct task_group *tg, void *data)
> > {
> > - unsigned long weight, rq_weight = 0, shares = 0;
> > + unsigned long weight, rq_weight = 0, sum_weight = 0, shares = 0;
> > unsigned long *usd_rq_weight;
> > struct sched_domain *sd = data;
> > unsigned long flags;
> > @@ -1630,6 +1630,7 @@ static int tg_shares_up(struct task_group *tg, void *data)
> > weight = tg->cfs_rq[i]->load.weight;
> > usd_rq_weight[i] = weight;
> >
> > + rq_weight += weight;
> > /*
> > * If there are currently no tasks on the cpu pretend there
> > * is one of average load so that when a new task gets to
> > @@ -1638,10 +1639,14 @@ static int tg_shares_up(struct task_group *tg, void *data)
> > if (!weight)
> > weight = NICE_0_LOAD;
> >
> > - rq_weight += weight;
> > + sum_weight += weight;
> > +
> > shares += tg->cfs_rq[i]->shares;
> > }
> >
> > + if (!rq_weight)
> > + rq_weight = sum_weight;
> > +
> > if ((!shares && rq_weight) || shares > tg->shares)
> > shares = tg->shares;
> >
> >
>
> --
> Yasunori Goto

--
Yasunori Goto

2009-12-09 09:57:42

by Peter Zijlstra

[permalink] [raw]
Subject: [tip:sched/urgent] sched: cgroup: Implement different treatment for idle shares

Commit-ID: cd8ad40de36c2fe75f3b731bd70198b385895246
Gitweb: http://git.kernel.org/tip/cd8ad40de36c2fe75f3b731bd70198b385895246
Author: Peter Zijlstra <[email protected]>
AuthorDate: Thu, 3 Dec 2009 18:00:07 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Wed, 9 Dec 2009 10:03:09 +0100

sched: cgroup: Implement different treatment for idle shares

When setting the weight for a per-cpu task-group, we have to put in a
phantom weight when there is no work on that cpu, otherwise we'll not
service that cpu when new work gets placed there until we again update
the per-cpu weights.

We used to add these phantom weights to the total, so that the idle
per-cpu shares don't get inflated, this however causes the non-idle
parts to get deflated, causing unexpected weight distibutions.

Reverse this, so that the non-idle shares are correct but the idle
shares are inflated.

Reported-by: Yasunori Goto <[email protected]>
Tested-by: Yasunori Goto <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
LKML-Reference: <1257934048.23203.76.camel@twins>
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 8 ++++++--
1 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 0170735..71eb062 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -1614,7 +1614,7 @@ static void update_group_shares_cpu(struct task_group *tg, int cpu,
*/
static int tg_shares_up(struct task_group *tg, void *data)
{
- unsigned long weight, rq_weight = 0, shares = 0;
+ unsigned long weight, rq_weight = 0, sum_weight = 0, shares = 0;
unsigned long *usd_rq_weight;
struct sched_domain *sd = data;
unsigned long flags;
@@ -1630,6 +1630,7 @@ static int tg_shares_up(struct task_group *tg, void *data)
weight = tg->cfs_rq[i]->load.weight;
usd_rq_weight[i] = weight;

+ rq_weight += weight;
/*
* If there are currently no tasks on the cpu pretend there
* is one of average load so that when a new task gets to
@@ -1638,10 +1639,13 @@ static int tg_shares_up(struct task_group *tg, void *data)
if (!weight)
weight = NICE_0_LOAD;

- rq_weight += weight;
+ sum_weight += weight;
shares += tg->cfs_rq[i]->shares;
}

+ if (!rq_weight)
+ rq_weight = sum_weight;
+
if ((!shares && rq_weight) || shares > tg->shares)
shares = tg->shares;