LinuxLists.cc - Regression in latest sched-git

2008-02-12 18:54:17

Subject: Regression in latest sched-git

Hi Ingo,

I've been running the latest sched-git through some tests. Here is
essentially what I am doing,

1. Mount the control group
2. Create 3-4 groups
3. Start kernbench inside each group
4. Run cpu hogs in each group

Essentially the idea is to see how the system responds under extreme CPU
load.

This is what I get (and this is in a shell which belongs to the root
group)
[root@llm11 ~]# time sleep 1

real 0m1.212s
user 0m0.004s
sys 0m0.000s
[root@llm11 ~]# time sleep 1

real 0m1.200s
user 0m0.000s
sys 0m0.004s
[root@llm11 ~]# time sleep 1

real 0m1.266s
user 0m0.000s
sys 0m0.000s
[root@llm11 ~]# time sleep 1

real 0m1.113s
user 0m0.000s
sys 0m0.000s
[root@llm11 ~]#

On the sched-devel tree that I have, the same gives me following
results.

[root@llm11 ~]# time sleep 1

real 0m1.057s
user 0m0.000s
sys 0m0.004s
[root@llm11 ~]# time sleep 1

real 0m1.038s
user 0m0.000s
sys 0m0.004s
[root@llm11 ~]# time sleep 1

real 0m1.075s
user 0m0.000s
sys 0m0.000s
[root@llm11 ~]# time sleep 1

real 0m1.071s
user 0m0.000s
sys 0m0.000s
[root@llm11 ~]# time sleep 1

real 0m1.073s
user 0m0.000s
sys 0m0.004s
[root@llm11 ~]# time sleep 1

real 0m1.055s
user 0m0.000s
sys 0m0.004s

I agree this is not a very great test. Its getting a bit late here. I
will get some better test case tomorrow morning (and if you have some, I
can try those as well). I just did not want the tree to get merged in
without further discussion.

--
regards,
Dhaval

2008-02-12 19:40:55

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Regression in latest sched-git

On Wed, 2008-02-13 at 00:23 +0530, Dhaval Giani wrote:
> Hi Ingo,
>
> I've been running the latest sched-git through some tests. Here is
> essentially what I am doing,
>
> 1. Mount the control group
> 2. Create 3-4 groups
> 3. Start kernbench inside each group
> 4. Run cpu hogs in each group
>
> Essentially the idea is to see how the system responds under extreme CPU
> load.

> This is what I get (and this is in a shell which belongs to the root
> group)
> [root@llm11 ~]# time sleep 1
>
> real 0m1.212s
> user 0m0.004s
> sys 0m0.000s

> On the sched-devel tree that I have, the same gives me following
> results.
>
> [root@llm11 ~]# time sleep 1
>
> real 0m1.057s
> user 0m0.000s
> sys 0m0.004s

Yes, latency isolation is the one thing I had to sacrifice in order to
get the normal latencies under control.

The problem with the old code is that under light load: a kernel make
-j2 as root, under an otherwise idle X session, generates latencies up
to 120ms on my UP laptop. (uid grouping; two active users: peter, root).

Others have reported latencies up to 300ms, and Ingo found a 700ms
latency on his machine.

The source for this problem is I think the vruntime driven wakeup
preemption (but I'm not quite sure). The other things that rely on
global vruntime are sleeper fairness and yield. Now while I can't
possibly care less about yield, the loss of sleeper fairness is somewhat
sad (NB. turning it off with the old group scheduling does improve life
somewhat).

So my first attempt at getting a global vruntime was flattening the
whole RQ structure, you can see that patch in sched.git (I really ought
to have posted that, will do so tomorrow).

With the experience gained from doing that, I think it might be possible
to construct a hierarchical RQ model that has synced vruntime; but
thinking about that still makes my head hurt.

Anyway, yes, its not ideal, but it does the more common case of light
load much better - I basically had to tell people to disable
CONFIG_FAIR_GROUP_SCHED in order to use their computer, which is sad,
because its the default and we want it to be the default in the cgroup
future.

So yes, I share your concern, lets work on this together.

2008-02-13 05:04:01

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: Regression in latest sched-git

On Tue, Feb 12, 2008 at 08:40:08PM +0100, Peter Zijlstra wrote:
> Yes, latency isolation is the one thing I had to sacrifice in order to
> get the normal latencies under control.

Hi Peter,
I don't have easy solution in mind either to meet both fairness
and latency goals in a acceptable way.

But I am puzzled at the max latency numbers you have provided below:

> The problem with the old code is that under light load: a kernel make
> -j2 as root, under an otherwise idle X session, generates latencies up
> to 120ms on my UP laptop. (uid grouping; two active users: peter, root).

If it was just two active users, then max latency should be:

latency to schedule user entity (~10ms?) +
latency to schedule task within that user

20-30 ms seems more reaonable max latency to expect in this scenario.
120ms seems abnormal, unless the user had large number of tasks.

On the same lines, I cant understand how we can be seeing 700ms latency
(below) unless we had: large number of active groups/users and large number of
tasks within each group/user.

> Others have reported latencies up to 300ms, and Ingo found a 700ms
> latency on his machine.
>
> The source for this problem is I think the vruntime driven wakeup
> preemption (but I'm not quite sure). The other things that rely on
> global vruntime are sleeper fairness and yield. Now while I can't
> possibly care less about yield, the loss of sleeper fairness is somewhat
> sad (NB. turning it off with the old group scheduling does improve life
> somewhat).
>
> So my first attempt at getting a global vruntime was flattening the
> whole RQ structure, you can see that patch in sched.git (I really ought
> to have posted that, will do so tomorrow).

We will do some exhaustive testing with this approach. My main concern
with this is that it may compromise the level of isolation between two
groups (imagine one group does a fork-bomb and how it would affect
fairness for other groups).

> With the experience gained from doing that, I think it might be possible
> to construct a hierarchical RQ model that has synced vruntime; but
> thinking about that still makes my head hurt.
>
> Anyway, yes, its not ideal, but it does the more common case of light
> load much better - I basically had to tell people to disable
> CONFIG_FAIR_GROUP_SCHED in order to use their computer, which is sad,
> because its the default and we want it to be the default in the cgroup
> future.
>
> So yes, I share your concern, lets work on this together.

--
Regards,
vatsa

2008-02-13 12:51:46

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Regression in latest sched-git

On Wed, 2008-02-13 at 08:30 +0530, Srivatsa Vaddagiri wrote:
> On Tue, Feb 12, 2008 at 08:40:08PM +0100, Peter Zijlstra wrote:
> > Yes, latency isolation is the one thing I had to sacrifice in order to
> > get the normal latencies under control.
>
> Hi Peter,
> I don't have easy solution in mind either to meet both fairness
> and latency goals in a acceptable way.

Ah, do be careful with 'fairness' here. The single RQ is fair wrt cpu
time, just not quite as 'fair' wrt to latency.

> But I am puzzled at the max latency numbers you have provided below:
>
> > The problem with the old code is that under light load: a kernel make
> > -j2 as root, under an otherwise idle X session, generates latencies up
> > to 120ms on my UP laptop. (uid grouping; two active users: peter, root).
>
> If it was just two active users, then max latency should be:
>
> latency to schedule user entity (~10ms?) +
> latency to schedule task within that user
>
> 20-30 ms seems more reaonable max latency to expect in this scenario.
> 120ms seems abnormal, unless the user had large number of tasks.
>
> On the same lines, I cant understand how we can be seeing 700ms latency
> (below) unless we had: large number of active groups/users and large number of
> tasks within each group/user.

All I can say it that its trivial to reproduce these horrid latencies.

As for Ingo's setup, the worst that he does is run distcc with (32?)
instances on that machine - and I assume he has that user niced waay
down.

> > Others have reported latencies up to 300ms, and Ingo found a 700ms
> > latency on his machine.
> >
> > The source for this problem is I think the vruntime driven wakeup
> > preemption (but I'm not quite sure). The other things that rely on
> > global vruntime are sleeper fairness and yield. Now while I can't
> > possibly care less about yield, the loss of sleeper fairness is somewhat
> > sad (NB. turning it off with the old group scheduling does improve life
> > somewhat).
> >
> > So my first attempt at getting a global vruntime was flattening the
> > whole RQ structure, you can see that patch in sched.git (I really ought
> > to have posted that, will do so tomorrow).
>
> We will do some exhaustive testing with this approach. My main concern
> with this is that it may compromise the level of isolation between two
> groups (imagine one group does a fork-bomb and how it would affect
> fairness for other groups).

Again, be careful with the fairness issue. CPU time should still be
fair, but yes, other groups might experience some latencies.

2008-02-13 16:35:52

by Dhaval Giani

[permalink] [raw]

Subject: Re: Regression in latest sched-git

On Wed, Feb 13, 2008 at 01:51:18PM +0100, Peter Zijlstra wrote:
>
> On Wed, 2008-02-13 at 08:30 +0530, Srivatsa Vaddagiri wrote:
> > On Tue, Feb 12, 2008 at 08:40:08PM +0100, Peter Zijlstra wrote:
> > > Yes, latency isolation is the one thing I had to sacrifice in order to
> > > get the normal latencies under control.
> >
> > Hi Peter,
> > I don't have easy solution in mind either to meet both fairness
> > and latency goals in a acceptable way.
>
> Ah, do be careful with 'fairness' here. The single RQ is fair wrt cpu
> time, just not quite as 'fair' wrt to latency.
>
> > But I am puzzled at the max latency numbers you have provided below:
> >
> > > The problem with the old code is that under light load: a kernel make
> > > -j2 as root, under an otherwise idle X session, generates latencies up
> > > to 120ms on my UP laptop. (uid grouping; two active users: peter, root).
> >
> > If it was just two active users, then max latency should be:
> >
> > latency to schedule user entity (~10ms?) +
> > latency to schedule task within that user
> >
> > 20-30 ms seems more reaonable max latency to expect in this scenario.
> > 120ms seems abnormal, unless the user had large number of tasks.
> >
> > On the same lines, I cant understand how we can be seeing 700ms latency
> > (below) unless we had: large number of active groups/users and large number of
> > tasks within each group/user.
>
> All I can say it that its trivial to reproduce these horrid latencies.
>

Hi Peter,

I've been trying to reproduce the latencies, and the worst I have
managed only 80ms. At an average I am getting around 60 ms. This is with
a make -j4 as root, and dhaval running other programs. (with maxcpus=1).

> As for Ingo's setup, the worst that he does is run distcc with (32?)
> instances on that machine - and I assume he has that user niced waay
> down.
>
> > > Others have reported latencies up to 300ms, and Ingo found a 700ms
> > > latency on his machine.
> > >
> > > The source for this problem is I think the vruntime driven wakeup
> > > preemption (but I'm not quite sure). The other things that rely on
> > > global vruntime are sleeper fairness and yield. Now while I can't
> > > possibly care less about yield, the loss of sleeper fairness is somewhat
> > > sad (NB. turning it off with the old group scheduling does improve life
> > > somewhat).
> > >
> > > So my first attempt at getting a global vruntime was flattening the
> > > whole RQ structure, you can see that patch in sched.git (I really ought
> > > to have posted that, will do so tomorrow).
> >
> > We will do some exhaustive testing with this approach. My main concern
> > with this is that it may compromise the level of isolation between two
> > groups (imagine one group does a fork-bomb and how it would affect
> > fairness for other groups).
>
> Again, be careful with the fairness issue. CPU time should still be
> fair, but yes, other groups might experience some latencies.
>

I know I am missing something, but aren't we trying to reduce latencies
here?

--
regards,
Dhaval

2008-02-13 16:37:35

by Dhaval Giani

[permalink] [raw]

Subject: Re: Regression in latest sched-git

On Wed, Feb 13, 2008 at 10:04:44PM +0530, Dhaval Giani wrote:
> > > On the same lines, I cant understand how we can be seeing 700ms latency
> > > (below) unless we had: large number of active groups/users and large number of
> > > tasks within each group/user.
> >
> > All I can say it that its trivial to reproduce these horrid latencies.
> >
>
> Hi Peter,
>
> I've been trying to reproduce the latencies, and the worst I have
> managed only 80ms. At an average I am getting around 60 ms. This is with
> a make -j4 as root, and dhaval running other programs. (with maxcpus=1).
>

Totally missed here. Any more hints to reproduce?

--
regards,
Dhaval

2008-02-13 16:43:43

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Regression in latest sched-git

On Wed, 2008-02-13 at 22:07 +0530, Dhaval Giani wrote:
> On Wed, Feb 13, 2008 at 10:04:44PM +0530, Dhaval Giani wrote:
> > > > On the same lines, I cant understand how we can be seeing 700ms latency
> > > > (below) unless we had: large number of active groups/users and large number of
> > > > tasks within each group/user.
> > >
> > > All I can say it that its trivial to reproduce these horrid latencies.
> > >
> >
> > Hi Peter,
> >
> > I've been trying to reproduce the latencies, and the worst I have
> > managed only 80ms. At an average I am getting around 60 ms. This is with
> > a make -j4 as root, and dhaval running other programs. (with maxcpus=1).
> >
>
> Totally missed here. Any more hints to reproduce?

Not really, this is the recipie I took from Lukas Hejtmanek's report and
it worked for me.

I'll see if I can find some time to try the ftrace patches to narrow
this down.

2008-02-13 16:57:39

by Srivatsa Vaddagiri

[permalink] [raw]

Subject: Re: Regression in latest sched-git

On Wed, Feb 13, 2008 at 10:04:44PM +0530, Dhaval Giani wrote:
> I know I am missing something, but aren't we trying to reduce latencies
> here?

I guess Peter is referring to the latency in seeing fairness results. In
other words, with single rq approach, you may require more time for the groups
to converge on fairness.

--
Regards,
vatsa

2008-02-14 11:21:12

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Regression in latest sched-git

Hi Dhaval,

How does this patch (on top of todays sched-devel.git) work for you?

It keeps my laptop nice and spiffy when I run

let i=0; while [ $i -lt 100 ]; do let i+=1; while :; do :; done & done

under a third user (nobody). This generates huge latencies for the nobody
user (up to 1.6s) but root and peter don't seem to get above 40ms

---
include/linux/sched.h | 1 +
kernel/sched_fair.c | 6 +++++-
2 files changed, 6 insertions(+), 1 deletion(-)

Index: linux-2.6/include/linux/sched.h
===================================================================
--- linux-2.6.orig/include/linux/sched.h
+++ linux-2.6/include/linux/sched.h
@@ -925,6 +925,7 @@ struct sched_entity {
u64 exec_start;
u64 sum_exec_runtime;
u64 vruntime;
+ u64 vperiod;
u64 prev_sum_exec_runtime;

#ifdef CONFIG_SCHEDSTATS
Index: linux-2.6/kernel/sched_fair.c
===================================================================
--- linux-2.6.orig/kernel/sched_fair.c
+++ linux-2.6/kernel/sched_fair.c
@@ -220,9 +220,11 @@ static inline u64 min_vruntime(u64 min_v

static inline s64 entity_key(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
- return se->vruntime - cfs_rq->min_vruntime;
+ return se->vruntime + se->vperiod - cfs_rq->min_vruntime;
}

+static u64 sched_vslice_add(struct cfs_rq *cfs_rq, struct sched_entity *se);
+
/*
* Enqueue an entity into the rb-tree:
*/
@@ -240,6 +242,8 @@ static void __enqueue_entity(struct cfs_
if (se == cfs_rq->curr)
return;

+ se->vperiod = sched_vslice_add(cfs_rq, se);
+
cfs_rq = &rq_of(cfs_rq)->cfs;

link = &cfs_rq->tasks_timeline.rb_node;