2007-06-06 01:56:44

by Tong Li

[permalink] [raw]
Subject: [RFC] Extend Linux to support proportional-share scheduling

Hi all,

I've ported my code to mainline 2.6.21.3. You can get it at
http://www.cs.duke.edu/~tongli/linux/. As I said before, the intent of
the patch is not to compete with CFS and SD because the design relies on
the underlying scheduler for interactive performance. The goal here is
to present a complementary design that can ensure stronger MP fairness,
which I think is lacking in the existing proposals. Here's a brief
overview of the design (I call it Trio for the lack of a better name).
Any comments or suggestions will be highly appreciated.

Trio extends the existing Linux scheduler with support for
proportional-share scheduling. It uses a scheduling algorithm, called
Distributed Weighted Round-Robin (DWRR), which retains the existing
scheduler design as much as possible, and extends it to achieve
proportional fairness with O(1) time complexity and a constant error
bound, compared to the ideal fair scheduling algorithm. The goal of Trio
is not to improve interactive performance; rather, it relies on the
existing scheduler for interactivity and extends it to support MP
proportional fairness.

Trio has two unique features: (1) it enables users to control shares of
CPU time for any thread or group of threads (e.g., a process, an
application, etc.), and (2) it enables fair sharing of CPU time across
multiple CPUs. For example, with ten tasks running on eight CPUs, Trio
allows each task to take an equal fraction of the total CPU time,
whereas no existing scheduler achieves such fairness. These features
enable Trio to complement the mainline scheduler and other proposals
such as CFS and SD to enable greater user flexibility and stronger
fairness.

tong


2007-06-06 03:35:26

by Willy Tarreau

[permalink] [raw]
Subject: Re: [RFC] Extend Linux to support proportional-share scheduling

Hi Tong,

On Tue, Jun 05, 2007 at 06:56:17PM -0700, Li, Tong N wrote:
> Hi all,
>
> I've ported my code to mainline 2.6.21.3. You can get it at
> http://www.cs.duke.edu/~tongli/linux/.

as much as possible, you should post your patch for others to comment
on it. Posting just a URL is often fine to inform people that there's
an update to *try*, but at this stage, it may be more important to
comment on your design and code than trying it.

[...]

> Trio has two unique features: (1) it enables users to control shares of
> CPU time for any thread or group of threads (e.g., a process, an
> application, etc.), and (2) it enables fair sharing of CPU time across
> multiple CPUs. For example, with ten tasks running on eight CPUs, Trio
> allows each task to take an equal fraction of the total CPU time,

While this looks interesting, doesn't it make threads jump to random
CPUs all the time, thus reducing cache efficiency ? Or maybe it would
be good to consider two or three criteria to group CPUs :
- those which share the same caches (multi-core)
- those which share the same local memory on the same mainboard
(multi-socket)
- those which are so far away from each others that it's really
not worth migrating a task

> whereas no existing scheduler achieves such fairness. These features
> enable Trio to complement the mainline scheduler and other proposals
> such as CFS and SD to enable greater user flexibility and stronger
> fairness.

Right now, I think that only benchmarks could tell which design is
better. I understand that running 10 tasks on 8 CPUs may result in
the last batch involving only 2 CPUs with 1 task each, thus increasing
the overall wall time. But maybe cache thrashing between CPUs will
also increase the wall time.

Regards,
Willy

2007-06-06 04:31:44

by Tong Li

[permalink] [raw]
Subject: RE: [RFC] Extend Linux to support proportional-share scheduling

Willy,

These are all good comments. Regarding the cache penalty, I've done some
measurements using benchmarks like SPEC OMP on an 8-processor SMP and
the performance with this patch was nearly identical to that with the
mainline. I'm sure some apps may suffer from the potentially more
migrations with this design. In the end, I think what we want is to
balance fairness and performance. This design currently emphasizes on
fairness, but it could be changed to relax fairness when performance
does become an issue (which could even be a user-tunable knob depending
on which aspect the user cares more).

Thanks,

tong

> -----Original Message-----
> From: Willy Tarreau [mailto:[email protected]]
> Sent: Tuesday, June 05, 2007 8:33 PM
> To: Li, Tong N
> Cc: [email protected]; Ingo Molnar; Con Kolivas; Linus
Torvalds;
> Arjan van de Ven; Siddha, Suresh B; Barnes, Jesse; William Lee Irwin
III;
> Bill Huey (hui); [email protected]; [email protected]; Nick Piggin;
Bill
> Davidsen; John Kingman; Peter Williams; [email protected]
> Subject: Re: [RFC] Extend Linux to support proportional-share
scheduling
>
> Hi Tong,
>
> On Tue, Jun 05, 2007 at 06:56:17PM -0700, Li, Tong N wrote:
> > Hi all,
> >
> > I've ported my code to mainline 2.6.21.3. You can get it at
> > http://www.cs.duke.edu/~tongli/linux/.
>
> as much as possible, you should post your patch for others to comment
> on it. Posting just a URL is often fine to inform people that there's
> an update to *try*, but at this stage, it may be more important to
> comment on your design and code than trying it.
>
> [...]
>
> > Trio has two unique features: (1) it enables users to control shares
of
> > CPU time for any thread or group of threads (e.g., a process, an
> > application, etc.), and (2) it enables fair sharing of CPU time
across
> > multiple CPUs. For example, with ten tasks running on eight CPUs,
Trio
> > allows each task to take an equal fraction of the total CPU time,
>
> While this looks interesting, doesn't it make threads jump to random
> CPUs all the time, thus reducing cache efficiency ? Or maybe it would
> be good to consider two or three criteria to group CPUs :
> - those which share the same caches (multi-core)
> - those which share the same local memory on the same mainboard
> (multi-socket)
> - those which are so far away from each others that it's really
> not worth migrating a task
>
> > whereas no existing scheduler achieves such fairness. These features
> > enable Trio to complement the mainline scheduler and other proposals
> > such as CFS and SD to enable greater user flexibility and stronger
> > fairness.
>
> Right now, I think that only benchmarks could tell which design is
> better. I understand that running 10 tasks on 8 CPUs may result in
> the last batch involving only 2 CPUs with 1 task each, thus increasing
> the overall wall time. But maybe cache thrashing between CPUs will
> also increase the wall time.
>
> Regards,
> Willy

2007-06-06 05:32:06

by Willy Tarreau

[permalink] [raw]
Subject: Re: [RFC] Extend Linux to support proportional-share scheduling

On Tue, Jun 05, 2007 at 09:31:33PM -0700, Li, Tong N wrote:
> Willy,
>
> These are all good comments. Regarding the cache penalty, I've done some
> measurements using benchmarks like SPEC OMP on an 8-processor SMP and
> the performance with this patch was nearly identical to that with the
> mainline. I'm sure some apps may suffer from the potentially more
> migrations with this design. In the end, I think what we want is to
> balance fairness and performance. This design currently emphasizes on
> fairness, but it could be changed to relax fairness when performance
> does become an issue (which could even be a user-tunable knob depending
> on which aspect the user cares more).

Maybe storing in each task a small list of the 2 or 4 last CPUs used would
help the scheduler in trying to place them. I mean, let's say you have 10
tasks and 8 CPUs. You first assign tasks 1..8 CPUs 1..8 for 1 timeslice.
Then you will give 9..10 a run on CPUs 1..2, and CPUs 3..8 will be usable
for other tasks. It wil be optimal to run tasks 3..8 on them. Then you will
stop some of those because they are "in advance", and run 9..10 and 1..2
again. You'll have to switch 1..2 to another group of CPUs to maintain hot
cache on CPUs 1..2 for tasks 9..10. But another possibility would be to
consider that 9..10 and 1..2 have performed the same amount of work, so
let's 9..10 take some advance and benefit from the hot cache, then try to
place 1..2 there again. But it will mean that 3..8 will now have run 2
timeslices more than others. At this moment, it should be wise to make
them sleep and keep their CPU history for future use.

Maybe on end-user systems, the CPUs history is not that important because
of the often small caches, but on high-end systems with large L2/L3 caches,
I think that we can often keep several tasks in the cache, justifying the
ability to select one of the last CPUs used.

Not an easy thing to do, but probably very complementary to your work IMHO.

Regards,
Willy

2007-06-06 13:27:41

by Bill Davidsen

[permalink] [raw]
Subject: Re: [RFC] Extend Linux to support proportional-share scheduling

Willy Tarreau wrote:
> On Tue, Jun 05, 2007 at 09:31:33PM -0700, Li, Tong N wrote:
>
>> Willy,
>>
>> These are all good comments. Regarding the cache penalty, I've done some
>> measurements using benchmarks like SPEC OMP on an 8-processor SMP and
>> the performance with this patch was nearly identical to that with the
>> mainline. I'm sure some apps may suffer from the potentially more
>> migrations with this design. In the end, I think what we want is to
>> balance fairness and performance. This design currently emphasizes on
>> fairness, but it could be changed to relax fairness when performance
>> does become an issue (which could even be a user-tunable knob depending
>> on which aspect the user cares more).
>>
>
> Maybe storing in each task a small list of the 2 or 4 last CPUs used would
> help the scheduler in trying to place them. I mean, let's say you have 10
> tasks and 8 CPUs. You first assign tasks 1..8 CPUs 1..8 for 1 timeslice.
> Then you will give 9..10 a run on CPUs 1..2, and CPUs 3..8 will be usable
> for other tasks. It wil be optimal to run tasks 3..8 on them. Then you will
> stop some of those because they are "in advance", and run 9..10 and 1..2
> again. You'll have to switch 1..2 to another group of CPUs to maintain hot
> cache on CPUs 1..2 for tasks 9..10. But another possibility would be to
> consider that 9..10 and 1..2 have performed the same amount of work, so
> let's 9..10 take some advance and benefit from the hot cache, then try to
> place 1..2 there again. But it will mean that 3..8 will now have run 2
> timeslices more than others. At this moment, it should be wise to make
> them sleep and keep their CPU history for future use.
>
> Maybe on end-user systems, the CPUs history is not that important because
> of the often small caches, but on high-end systems with large L2/L3 caches,
> I think that we can often keep several tasks in the cache, justifying the
> ability to select one of the last CPUs used.
>
>
CPU affinity to preserve cache is a very delicate balance. It makes
sense to try to run a process on the same CPU, but since even a few ms
of running some other process is long enough to refill the cache with
new contents (depending on what it does, obviously) that long delays in
running a process to get it on the "right CPU" are not always a saving,
using the previous CPU becomes less beneficial rapidly.

Some omnipotent scheduler would have a count of pages evicted from cache
as process A runs, and deduct that from the affinity of process B
previously on the same CPU. Then make a perfect decision when it's
better to migrate the task and how far. Since the schedulers now being
advanced are "fair" rather than "perfect," everyone is making educated
guesses on optimal process migration policy, migrating all threads to
improve cache hit vs. spread them to better run threads in parallel, etc.

For a desktop I want a scheduler which "doesn't suck" at the things I do
regularly. For a server I'm more concerned with overall tps than the
latency of one transaction. Most users would trade a slowdown in kernel
compiles for being able to watch youtube while the compile runs, and
conversely people with heavily loaded servers would usually trade a
slower transaction for more of them per second. Obviously within
reason... what people will tolerate is a bounded value.
> Not an easy thing to do, but probably very complementary to your work IMHO.
>
Agree, not easy at all.

--
bill davidsen <[email protected]>
CTO TMR Associates, Inc
Doing interesting things with small computers since 1979