2008-02-27 22:52:04

by Kei Tokunaga

[permalink] [raw]
Subject: A strange behavior of sched_fair

Hi Ingo,

I am playing around with sched_fair and cgroup, and it seems like
I hit a possible bug. Could you also check if that is a bug?

Description of behavior:
Start a cpu-bound task (t1), attach it to a cgroup (cgA), and let the
task to run for a while (e.g. several tens of seconds or a couple of
minutes would be adequate.) Then, start another cpu-bound task (t2)
and attach it to cgA in the way described in "Steps to Reproduce" section.
You will see t1 does not get run for a while.
(The tasks may not have to be cpu-bound, but it is easier to see the
behavior using cpu-bound tasks.)

How reproducible:
Always.

Environments where I saw the behavior:
2.6.25-rc3 with resource management functions enabled on ia64 box.

Steps to Reproduce:
# mkdir /dev/cgroup
# mount -t cgroup -ocpuset,cpu cpu /dev/cgroup
# mkdir /dev/cgroup/{a,b}
# echo 0 > /dev/cgroup/a/cpuset.cpus
# echo 0 > /dev/cgroup/b/cpuset.cpus
# echo 1 > /dev/cgroup/a/cpuset.mems
# echo 1 > /dev/cgroup/b/cpuset.mems
# echo $$ > /dev/cgroup/b/tasks
# ./a.out & echo $! > /dev/cgroup/a/tasks (a.out is just a for-loop program)
[Wait for several tens of seconds or a couple of minutes.]
# ./a.out2 & echo $! > /dev/cgroup/a/tasks (a.out2 is just a for-loop program)
[You will see a.out does not get run for a while by running top command.]

Additional Info:
a.out2 needs to be started from the shell of cgroup-b in order to
reproduce the problem (, unless the system is UP.) Starting a.out2
in the manner, se->vruntime (or something to create the se->vruntime)
of a.out2 seems to be initialized to a small value, compared to the
value of a.out. And the fair scheduler only runs a.out2 until the
se->vruntime catches up with the se->vruntime of a.out.

Thanks,
Kei
--
Kei Tokunaga
Fujitsu (Red Hat On-site Partner)


2008-02-29 07:08:54

by Andrew Morton

[permalink] [raw]
Subject: Re: A strange behavior of sched_fair


(cc containers list)

On Wed, 27 Feb 2008 17:51:35 -0500 Kei Tokunaga <[email protected]> wrote:

> Hi Ingo,
>
> I am playing around with sched_fair and cgroup, and it seems like
> I hit a possible bug. Could you also check if that is a bug?
>
> Description of behavior:
> Start a cpu-bound task (t1), attach it to a cgroup (cgA), and let the
> task to run for a while (e.g. several tens of seconds or a couple of
> minutes would be adequate.) Then, start another cpu-bound task (t2)
> and attach it to cgA in the way described in "Steps to Reproduce" section.
> You will see t1 does not get run for a while.
> (The tasks may not have to be cpu-bound, but it is easier to see the
> behavior using cpu-bound tasks.)
>
> How reproducible:
> Always.
>
> Environments where I saw the behavior:
> 2.6.25-rc3 with resource management functions enabled on ia64 box.
>
> Steps to Reproduce:
> # mkdir /dev/cgroup
> # mount -t cgroup -ocpuset,cpu cpu /dev/cgroup
> # mkdir /dev/cgroup/{a,b}
> # echo 0 > /dev/cgroup/a/cpuset.cpus
> # echo 0 > /dev/cgroup/b/cpuset.cpus
> # echo 1 > /dev/cgroup/a/cpuset.mems
> # echo 1 > /dev/cgroup/b/cpuset.mems
> # echo $$ > /dev/cgroup/b/tasks
> # ./a.out & echo $! > /dev/cgroup/a/tasks (a.out is just a for-loop program)
> [Wait for several tens of seconds or a couple of minutes.]
> # ./a.out2 & echo $! > /dev/cgroup/a/tasks (a.out2 is just a for-loop program)
> [You will see a.out does not get run for a while by running top command.]
>
> Additional Info:
> a.out2 needs to be started from the shell of cgroup-b in order to
> reproduce the problem (, unless the system is UP.) Starting a.out2
> in the manner, se->vruntime (or something to create the se->vruntime)
> of a.out2 seems to be initialized to a small value, compared to the
> value of a.out. And the fair scheduler only runs a.out2 until the
> se->vruntime catches up with the se->vruntime of a.out.
>
> Thanks,
> Kei
> --
> Kei Tokunaga
> Fujitsu (Red Hat On-site Partner)

2008-02-29 09:35:16

by Peter Zijlstra

[permalink] [raw]
Subject: Re: A strange behavior of sched_fair


On Wed, 2008-02-27 at 17:51 -0500, Kei Tokunaga wrote:
> Hi Ingo,
>
> I am playing around with sched_fair and cgroup, and it seems like
> I hit a possible bug. Could you also check if that is a bug?
>
> Description of behavior:
> Start a cpu-bound task (t1), attach it to a cgroup (cgA), and let the
> task to run for a while (e.g. several tens of seconds or a couple of
> minutes would be adequate.) Then, start another cpu-bound task (t2)
> and attach it to cgA in the way described in "Steps to Reproduce" section.
> You will see t1 does not get run for a while.
> (The tasks may not have to be cpu-bound, but it is easier to see the
> behavior using cpu-bound tasks.)
>
> How reproducible:
> Always.
>
> Environments where I saw the behavior:
> 2.6.25-rc3 with resource management functions enabled on ia64 box.
>
> Steps to Reproduce:
> # mkdir /dev/cgroup
> # mount -t cgroup -ocpuset,cpu cpu /dev/cgroup
> # mkdir /dev/cgroup/{a,b}
> # echo 0 > /dev/cgroup/a/cpuset.cpus
> # echo 0 > /dev/cgroup/b/cpuset.cpus
> # echo 1 > /dev/cgroup/a/cpuset.mems
> # echo 1 > /dev/cgroup/b/cpuset.mems
> # echo $$ > /dev/cgroup/b/tasks
> # ./a.out & echo $! > /dev/cgroup/a/tasks (a.out is just a for-loop program)
> [Wait for several tens of seconds or a couple of minutes.]
> # ./a.out2 & echo $! > /dev/cgroup/a/tasks (a.out2 is just a for-loop program)
> [You will see a.out does not get run for a while by running top command.]
>
> Additional Info:
> a.out2 needs to be started from the shell of cgroup-b in order to
> reproduce the problem (, unless the system is UP.) Starting a.out2
> in the manner, se->vruntime (or something to create the se->vruntime)
> of a.out2 seems to be initialized to a small value, compared to the
> value of a.out. And the fair scheduler only runs a.out2 until the
> se->vruntime catches up with the se->vruntime of a.out.

Seems the vruntime doesn't get re-set if you move tasks between groups.
sched_move_task() should call place_entity() in the context of the new
cfs_rq.

2008-02-29 10:12:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: A strange behavior of sched_fair


On Fri, 2008-02-29 at 10:34 +0100, Peter Zijlstra wrote:
> On Wed, 2008-02-27 at 17:51 -0500, Kei Tokunaga wrote:
> > Hi Ingo,
> >
> > I am playing around with sched_fair and cgroup, and it seems like
> > I hit a possible bug. Could you also check if that is a bug?
> >
> > Description of behavior:
> > Start a cpu-bound task (t1), attach it to a cgroup (cgA), and let the
> > task to run for a while (e.g. several tens of seconds or a couple of
> > minutes would be adequate.) Then, start another cpu-bound task (t2)
> > and attach it to cgA in the way described in "Steps to Reproduce" section.
> > You will see t1 does not get run for a while.
> > (The tasks may not have to be cpu-bound, but it is easier to see the
> > behavior using cpu-bound tasks.)
> >
> > How reproducible:
> > Always.
> >
> > Environments where I saw the behavior:
> > 2.6.25-rc3 with resource management functions enabled on ia64 box.
> >
> > Steps to Reproduce:
> > # mkdir /dev/cgroup
> > # mount -t cgroup -ocpuset,cpu cpu /dev/cgroup
> > # mkdir /dev/cgroup/{a,b}
> > # echo 0 > /dev/cgroup/a/cpuset.cpus
> > # echo 0 > /dev/cgroup/b/cpuset.cpus
> > # echo 1 > /dev/cgroup/a/cpuset.mems
> > # echo 1 > /dev/cgroup/b/cpuset.mems
> > # echo $$ > /dev/cgroup/b/tasks
> > # ./a.out & echo $! > /dev/cgroup/a/tasks (a.out is just a for-loop program)
> > [Wait for several tens of seconds or a couple of minutes.]
> > # ./a.out2 & echo $! > /dev/cgroup/a/tasks (a.out2 is just a for-loop program)
> > [You will see a.out does not get run for a while by running top command.]
> >
> > Additional Info:
> > a.out2 needs to be started from the shell of cgroup-b in order to
> > reproduce the problem (, unless the system is UP.) Starting a.out2
> > in the manner, se->vruntime (or something to create the se->vruntime)
> > of a.out2 seems to be initialized to a small value, compared to the
> > value of a.out. And the fair scheduler only runs a.out2 until the
> > se->vruntime catches up with the se->vruntime of a.out.
>
> Seems the vruntime doesn't get re-set if you move tasks between groups.
> sched_move_task() should call place_entity() in the context of the new
> cfs_rq.

This should do I guess - uncompiled, untested

---
Subject: sched: fair-group: fixup tasks on group move

Tasks would retain their old vruntime when moved between groups, this
can cause funny lags. Re-set the vruntime on group move to fit within
the new tree.

Signed-off-by: Peter Zijlstra <[email protected]>
---
include/linux/sched.h | 4 ++++
kernel/sched.c | 3 +++
kernel/sched_fair.c | 14 ++++++++++++++
3 files changed, 21 insertions(+)

Index: linux-2.6-2/include/linux/sched.h
===================================================================
--- linux-2.6-2.orig/include/linux/sched.h
+++ linux-2.6-2/include/linux/sched.h
@@ -899,6 +899,10 @@ struct sched_class {
int running);
void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
int oldprio, int running);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ void (*moved_group) (struct task_struct *p);
+#endif
};

struct load_weight {
Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -7831,6 +7831,9 @@ void sched_move_task(struct task_struct

set_task_rq(tsk, task_cpu(tsk));

+ if (tsk->sched_class->moved_group)
+ tsk->sched_class->moved_group(tsk);
+
if (on_rq) {
if (unlikely(running))
tsk->sched_class->set_curr_task(rq);
Index: linux-2.6-2/kernel/sched_fair.c
===================================================================
--- linux-2.6-2.orig/kernel/sched_fair.c
+++ linux-2.6-2/kernel/sched_fair.c
@@ -1398,6 +1398,16 @@ static void set_curr_task_fair(struct rq
set_next_entity(cfs_rq_of(se), se);
}

+#ifdef CONFIG_FAIR_GROUP_SCHED
+static void moved_group_fair(struct task_struct *p)
+{
+ struct cfs_rq = task_cfs_rq(tsk);
+
+ update_curr(cfs_rq);
+ place_entity(cfs_rq, &p->se, 1);
+}
+#endif
+
/*
* All the scheduling class methods:
*/
@@ -1426,6 +1436,10 @@ static const struct sched_class fair_sch

.prio_changed = prio_changed_fair,
.switched_to = switched_to_fair,
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ .moved_group = moved_group_fair,
+#endif
};

#ifdef CONFIG_SCHED_DEBUG

2008-02-29 19:42:39

by Ingo Molnar

[permalink] [raw]
Subject: Re: A strange behavior of sched_fair


* Peter Zijlstra <[email protected]> wrote:

> > Seems the vruntime doesn't get re-set if you move tasks between
> > groups. sched_move_task() should call place_entity() in the context
> > of the new cfs_rq.
>
> This should do I guess - uncompiled, untested

if someone tests it and if it solves the problem i'll apply it to the
scheduler tree.

Ingo

2008-02-29 20:23:19

by Kei Tokunaga

[permalink] [raw]
Subject: Re: A strange behavior of sched_fair

Ingo Molnar wrote, (2008/02/29 14:42):
> * Peter Zijlstra <[email protected]> wrote:
>
>>> Seems the vruntime doesn't get re-set if you move tasks between
>>> groups. sched_move_task() should call place_entity() in the context
>>> of the new cfs_rq.
>> This should do I guess - uncompiled, untested
>
> if someone tests it and if it solves the problem i'll apply it to the
> scheduler tree.

Hi Ingo, Peter,

Thanks for the patch, Peter. I modified it a bit to compile and
confirmed the issue did not occur on my ia64 box. I am attaching
the revised patch below.

Thanks,
Kei


Subject: sched: fair-group: fixup tasks on group move

Tasks would retain their old vruntime when moved between groups, this
can cause funny lags. Re-set the vruntime on group move to fit within
the new tree.

Signed-off-by: Peter Zijlstra <[email protected]>


---

linux-2.6.25-rc3-cgroup-kei/include/linux/sched.h | 4 ++++
linux-2.6.25-rc3-cgroup-kei/kernel/sched.c | 3 +++
linux-2.6.25-rc3-cgroup-kei/kernel/sched_fair.c | 14 ++++++++++++++
3 files changed, 21 insertions(+)

diff -puN include/linux/sched.h~sched_fair include/linux/sched.h
--- linux-2.6.25-rc3-cgroup/include/linux/sched.h~sched_fair 2008-02-29 12:26:47.000000000 -0500
+++ linux-2.6.25-rc3-cgroup-kei/include/linux/sched.h 2008-02-29 12:26:47.000000000 -0500
@@ -898,6 +898,10 @@ struct sched_class {
int running);
void (*prio_changed) (struct rq *this_rq, struct task_struct *task,
int oldprio, int running);
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ void (*moved_group) (struct task_struct *p);
+#endif
};

struct load_weight {
diff -puN kernel/sched.c~sched_fair kernel/sched.c
--- linux-2.6.25-rc3-cgroup/kernel/sched.c~sched_fair 2008-02-29 12:26:47.000000000 -0500
+++ linux-2.6.25-rc3-cgroup-kei/kernel/sched.c 2008-02-29 12:26:47.000000000 -0500
@@ -7825,6 +7825,9 @@ void sched_move_task(struct task_struct

set_task_rq(tsk, task_cpu(tsk));

+ if (tsk->sched_class->moved_group)
+ tsk->sched_class->moved_group(tsk);
+
if (on_rq) {
if (unlikely(running))
tsk->sched_class->set_curr_task(rq);
diff -puN kernel/sched_fair.c~sched_fair kernel/sched_fair.c
--- linux-2.6.25-rc3-cgroup/kernel/sched_fair.c~sched_fair 2008-02-29 12:26:47.000000000 -0500
+++ linux-2.6.25-rc3-cgroup-kei/kernel/sched_fair.c 2008-02-29 13:49:15.000000000 -0500
@@ -1403,6 +1403,16 @@ static void set_curr_task_fair(struct rq
set_next_entity(cfs_rq_of(se), se);
}

+#ifdef CONFIG_FAIR_GROUP_SCHED
+static void moved_group_fair(struct task_struct *p)
+{
+ struct cfs_rq *cfs_rq = task_cfs_rq(p);
+
+ update_curr(cfs_rq);
+ place_entity(cfs_rq, &p->se, 1);
+}
+#endif
+
/*
* All the scheduling class methods:
*/
@@ -1431,6 +1441,10 @@ static const struct sched_class fair_sch

.prio_changed = prio_changed_fair,
.switched_to = switched_to_fair,
+
+#ifdef CONFIG_FAIR_GROUP_SCHED
+ .moved_group = moved_group_fair,
+#endif
};

#ifdef CONFIG_SCHED_DEBUG

_


2008-02-29 20:33:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: A strange behavior of sched_fair


* Kei Tokunaga <[email protected]> wrote:

> Hi Ingo, Peter,
>
> Thanks for the patch, Peter. I modified it a bit to compile and
> confirmed the issue did not occur on my ia64 box. I am attaching the
> revised patch below.

thanks, queued it up. Could you send me your Tested-by (and Acked-by)
lines?

Ingo

2008-02-29 22:46:19

by Kei Tokunaga

[permalink] [raw]
Subject: Re: A strange behavior of sched_fair

Ingo Molnar wrote, (2008/02/29 15:32):
> * Kei Tokunaga <[email protected]> wrote:
>
>> Hi Ingo, Peter,
>>
>> Thanks for the patch, Peter. I modified it a bit to compile and
>> confirmed the issue did not occur on my ia64 box. I am attaching the
>> revised patch below.
>
> thanks, queued it up. Could you send me your Tested-by (and Acked-by)
> lines?

Tested-by: Kei Tokunaga <[email protected]>

Thanks,
Kei