Message-ID: <447A2853.2080901@vilain.net>
Date: Mon, 29 May 2006 10:46:43 +1200
From: Sam Vilain <sam@vilain.net>
User-Agent: Mozilla Thunderbird 1.0.7 (X11/20051013)
MIME-Version: 1.0
To: =?ISO-8859-1?Q?Bj=F6rn_Steinbrink?= <B.Steinbrink@gmx.de>
Cc: Mike Galbraith <efault@gmx.de>, Peter Williams <pwil3058@bigpond.net.au>,
       Con Kolivas <kernel@kolivas.org>,
       Linux Kernel <linux-kernel@vger.kernel.org>,
       Kingsley Cheung <kingsley@aurema.com>, Ingo Molnar <mingo@elte.hu>,
       Rene Herman <rene.herman@keyaccess.nl>,
       Herbert Poetzl <herbert@13thfloor.at>, Kirill Korotaev <dev@sw.ru>
Subject: Re: [RFC 0/5] sched: Add CPU rate caps
References: <20060526042021.2886.4957.sendpatchset@heathwren.pw.nest> <1148630661.7589.65.camel@homer> <20060526161148.GA23680@atjola.homenet>
In-Reply-To: <20060526161148.GA23680@atjola.homenet>
Content-Type: text/plain; charset=iso-8859-1
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 3765
Lines: 83

Bj?rn Steinbrink wrote:

>>The killer problem I see with this approach is that it doesn't address
>>the divide and conquer problem.  If a task is capped, and forks off
>>workers, each worker inherits the total cap, effectively extending same.
>>    
>>

Yes, although the current thinking is that you need to set a special
clone() flag (which may be restricted via capabilities such as
CAP_SYS_RESOURCE) to set a new CPU scheduling namespace, so the workers
will inherit the same scheduling ns and therefore be accounted against
the one resource.

Sorry if I'm replying out of context, I'll catch up on this thread
shortly.  Thanks for including me.

>>IMHO, per task resource management is too severely limited in it's
>>usefulness, because jobs are what need managing, and they're seldom
>>single threaded.  In order to use per task limits to manage any given
>>job, you have to both know the number of components, and manually
>>distribute resources to each component of the job.  If a job has a
>>dynamic number of components, it becomes impossible to manage.
>>    
>>
>
>Linux-VServer uses a token bucket scheduler (TBS) to limit cpu ressources
>for processes in a "context". All processes in a context share one token
>bucket, which has a set of parameters to tune scheduling behaviour.
>As the token bucket is shared by a group of processes, and inherited by
>child processes/threads, management is quite easy. And the parameters
>can be tuned to allow different scheduling behaviours, like allowing a
>process group to burst, ie. use as much cpu time as is available, after
>being idle for some time, but being limited to X % cpu time on average.
>  
>

This is correct.  Basically I read the LARTC.org (which explains Linux
network schedulers etc) and the description of the Token Bucket
Scheduler inspired me to write the same thing for CPU resources.  It was
originally developed for the 2.4 Alan Cox series kernels.  The primary
design guarantee of the scheduler is a low total performance impact -
maximum CPU utilisation prioritisation and fairness a secondary
concern.  It was built with the idea that people wanting different sorts
of scheduling policies could at least get a set of userland controls to
implement their approach - to the limit of the effectiveness of task
priorities.

I most recently described this at http://lkml.org/lkml/2006/3/29/59, a
lot of that thread is probably worth catching up on.

It would be nice if we could somehow re-use the scheduling algorithms in
use in the network space here, if it doesn't impact on performance.

For instance, the "CBQ" network scheduler is the same approach as used
in OpenVZ's CPU scheduler, and the classful Token Bucket Filter is the
approach used in VServer.  The "Sched_prio" and "Sched_hard" distinction
in vserver could probably be compared to "Ingres Policing", where
available CPU that could run a process instead sits idle - similar to
the network world where data that has arrived is dropped to try to
convince the application to throttle its network activity.

As in the network space (http://lkml.org/lkml/2006/5/19/216) in this
space we have a continual scale of possible implementations, marked by a
highly efficient solution akin to "binding" at one end, and a
virtualisation at the other.  Each deliver guarantees most applicable to
certain users or workloads.

Sam.

>I'm CC'ing Herbert and Sam on this as they can explain the whole thing a
>lot better and I'm not familiar with implementation details.
>
>Regards
>Bj?rn
>  
>

-
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/