Subject: Re: [PATCH 2/6] sched: make sched_slice() group scheduling savvy
From: Peter Zijlstra <peterz@infradead.org>
To: vatsa@linux.vnet.ibm.com
Cc: linux-kernel@vger.kernel.org, Ingo Molnar <mingo@elte.hu>,
       Mike Galbraith <efault@gmx.de>,
       Dmitry Adamushko <dmitry.adamushko@gmail.com>
In-Reply-To: <20071101163120.GB20788@linux.vnet.ibm.com>
References: <20071031211030.310581000@chello.nl>
	 <20071031211248.796653000@chello.nl>
	 <20071101113138.GA20788@linux.vnet.ibm.com>
	 <1193917912.27652.258.camel@twins> <1193918299.27652.260.camel@twins>
	 <1193918598.27652.262.camel@twins> <1193919608.27652.277.camel@twins>
	 <20071101163120.GB20788@linux.vnet.ibm.com>
Content-Type: text/plain
Date: Thu, 01 Nov 2007 17:55:15 +0100
Message-Id: <1193936115.27652.319.camel@twins>
Mime-Version: 1.0
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3856
Lines: 96

On Thu, 2007-11-01 at 22:01 +0530, Srivatsa Vaddagiri wrote:
> On Thu, Nov 01, 2007 at 01:20:08PM +0100, Peter Zijlstra wrote:
> > On Thu, 2007-11-01 at 13:03 +0100, Peter Zijlstra wrote:
> > > On Thu, 2007-11-01 at 12:58 +0100, Peter Zijlstra wrote:
> > > 
> > > > > sched_slice() is about lantecy, its intended purpose is to ensure each
> > > > > task is ran exactly once during sched_period() - which is
> > > > > sysctl_sched_latency when nr_running <= sysctl_sched_nr_latency, and
> > > > > otherwise linearly scales latency.
> > > 
> > > The thing that got my brain in a twist is what to do about the non-leaf
> > > nodes, for those it seems I'm not doing the right thing - I think.
> > 
> > Ok, suppose a tree like so:
> > 
> > 
> > level 2                   cfs_rq
> >                        A           B
> > 
> > level 1             cfs_rqA     cfs_rqB
> >                      A0        B0 - B99
> > 
> > 
> > So for sake of determinism, we want to impose a period in which all
> > level 1 tasks will have ran (at least) once.
> 
> Peter,
> 	I fail to see why this requirement to "determine a period in
> which all level 1 tasks will have ran (at least) once" is essential.

Because it gives a steady feel to things. For humans its most essential
that things run in a predicable fashion. So no only does it matter how
much time a task gets, it also very much matters when it gets that.

It contributes to the feeling of gradual slow down. 

> I am visualizing each of the groups to be similar to Xen-like partitions
> which are given fair timeslices by the hypervisor (Linux kernel in this
> case). How each partition (group in this case) manages the allocated
> timeslice(s) to provide fairness to tasks within that partition/group should not
> (IMHO) depend on other groups and esp. how many tasks other groups has.

Agreed, I've realised this since my last mail, one group should not
influence another in such a fashion, in this respect you don't want to
flatten it like I did.

> For ex: before this patch, fair time would be allocated to group and
> their tasks as below:
> 
>     A0       B0-B9     A0    B10-B19     A0     B20-B29
>  |--------|--------|--------|--------|--------|--------|-----//--| 
>  0       10ms	   20ms	   30ms     40ms     50ms     60ms
> 
> i.e during the first 10ms allocated to group B, B0-B9 run,
>     during the next  10ms allocated to group B, B10-B19 run etc
> 
> What's wrong with this scheme?

What made me start tinkering here is that the nested level is again
distributing wall-time without being aware of the fraction it gets from
the upper levels.

So if we have two groups, A and B, and B is selected for 1/2 of period,
then Bn will again divide period, even though in reality it will only
have p/2.

So I guess, I need a top down traversal, not a bottom up traversal to
get this fixed up... I'll ponder this.

> By letting __sched_period() be determined for each group independently,
> we are building stronger isolation between them, which is good IMO
> (imagine a rogue container that does a fork bomb).

Agreed.

> > Index: linux-2.6/kernel/sched_fair.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched_fair.c
> > +++ linux-2.6/kernel/sched_fair.c
> > @@ -341,7 +341,7 @@ static u64 sched_slice(struct cfs_rq *cf
> >  		do_div(slice, cfs_rq->load.weight);
> >  	}
> > 
> > -	return slice;
> > +	return min_t(u64, sysctl_sched_latency, slice);

> which seems to be more or less giving what we already have w/o the
> patch?

Well, its basically giving up on overload, admittedly not very nice.

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/