2006-03-07 23:13:21

by Con Kolivas

[permalink] [raw]
Subject: [PATCH] mm: yield during swap prefetching

Swap prefetching doesn't use very much cpu but spends a lot of time waiting on
disk in uninterruptible sleep. This means it won't get preempted often even at
a low nice level since it is seen as sleeping most of the time. We want to
minimise its cpu impact so yield where possible.

Signed-off-by: Con Kolivas <[email protected]>
---
mm/swap_prefetch.c | 1 +
1 file changed, 1 insertion(+)

Index: linux-2.6.15-ck5/mm/swap_prefetch.c
===================================================================
--- linux-2.6.15-ck5.orig/mm/swap_prefetch.c 2006-03-02 14:00:46.000000000 +1100
+++ linux-2.6.15-ck5/mm/swap_prefetch.c 2006-03-08 08:49:32.000000000 +1100
@@ -421,6 +421,7 @@ static enum trickle_return trickle_swap(

if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
break;
+ yield();
}

if (sp_stat.prefetched_pages) {


2006-03-07 23:24:47

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Con Kolivas <[email protected]> wrote:
>
> Swap prefetching doesn't use very much cpu but spends a lot of time waiting on
> disk in uninterruptible sleep. This means it won't get preempted often even at
> a low nice level since it is seen as sleeping most of the time. We want to
> minimise its cpu impact so yield where possible.
>
> Signed-off-by: Con Kolivas <[email protected]>
> ---
> mm/swap_prefetch.c | 1 +
> 1 file changed, 1 insertion(+)
>
> Index: linux-2.6.15-ck5/mm/swap_prefetch.c
> ===================================================================
> --- linux-2.6.15-ck5.orig/mm/swap_prefetch.c 2006-03-02 14:00:46.000000000 +1100
> +++ linux-2.6.15-ck5/mm/swap_prefetch.c 2006-03-08 08:49:32.000000000 +1100
> @@ -421,6 +421,7 @@ static enum trickle_return trickle_swap(
>
> if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
> break;
> + yield();
> }
>
> if (sp_stat.prefetched_pages) {

yield() really sucks if there are a lot of runnable tasks. And the amount
of CPU which that thread uses isn't likely to matter anyway.

I think it'd be better to just not do this. Perhaps alter the thread's
static priority instead? Does the scheduler have a knob which can be used
to disable a tasks's dynamic priority boost heuristic?

2006-03-07 23:32:17

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Andrew Morton writes:

> Con Kolivas <[email protected]> wrote:
>>
>> Swap prefetching doesn't use very much cpu but spends a lot of time waiting on
>> disk in uninterruptible sleep. This means it won't get preempted often even at
>> a low nice level since it is seen as sleeping most of the time. We want to
>> minimise its cpu impact so yield where possible.
>>
>> Signed-off-by: Con Kolivas <[email protected]>
>> ---
>> mm/swap_prefetch.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> Index: linux-2.6.15-ck5/mm/swap_prefetch.c
>> ===================================================================
>> --- linux-2.6.15-ck5.orig/mm/swap_prefetch.c 2006-03-02 14:00:46.000000000 +1100
>> +++ linux-2.6.15-ck5/mm/swap_prefetch.c 2006-03-08 08:49:32.000000000 +1100
>> @@ -421,6 +421,7 @@ static enum trickle_return trickle_swap(
>>
>> if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
>> break;
>> + yield();
>> }
>>
>> if (sp_stat.prefetched_pages) {
>
> yield() really sucks if there are a lot of runnable tasks. And the amount
> of CPU which that thread uses isn't likely to matter anyway.
>
> I think it'd be better to just not do this. Perhaps alter the thread's
> static priority instead? Does the scheduler have a knob which can be used
> to disable a tasks's dynamic priority boost heuristic?

We do have SCHED_BATCH but even that doesn't really have the desired effect.
I know how much yield sucks and I actually want it to suck as much as yield
does.

Cheers,
Con

2006-03-08 00:03:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Con Kolivas <[email protected]> wrote:
>
> > yield() really sucks if there are a lot of runnable tasks. And the amount
> > of CPU which that thread uses isn't likely to matter anyway.
> >
> > I think it'd be better to just not do this. Perhaps alter the thread's
> > static priority instead? Does the scheduler have a knob which can be used
> > to disable a tasks's dynamic priority boost heuristic?
>
> We do have SCHED_BATCH but even that doesn't really have the desired effect.
> I know how much yield sucks and I actually want it to suck as much as yield
> does.

Why do you want that?

If prefetch is doing its job then it will save the machine from a pile of
major faults in the near future. The fact that the machine happens to be
running a number of busy tasks doesn't alter that. It's _worth_ stealing a
few cycles from those tasks now to avoid lengthy D-state sleeps in the near
future?

2006-03-08 00:50:47

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wed, 8 Mar 2006 11:05 am, Andrew Morton wrote:
> Con Kolivas <[email protected]> wrote:
> > > yield() really sucks if there are a lot of runnable tasks. And the
> > > amount of CPU which that thread uses isn't likely to matter anyway.
> > >
> > > I think it'd be better to just not do this. Perhaps alter the thread's
> > > static priority instead? Does the scheduler have a knob which can be
> > > used to disable a tasks's dynamic priority boost heuristic?
> >
> > We do have SCHED_BATCH but even that doesn't really have the desired
> > effect. I know how much yield sucks and I actually want it to suck as
> > much as yield does.
>
> Why do you want that?
>
> If prefetch is doing its job then it will save the machine from a pile of
> major faults in the near future. The fact that the machine happens to be
> running a number of busy tasks doesn't alter that. It's _worth_ stealing a
> few cycles from those tasks now to avoid lengthy D-state sleeps in the near
> future?

The test case is the 3d (gaming) app that uses 100% cpu. It never sets delay
swap prefetch in any way so swap prefetching starts working. Once swap
prefetching starts reading it is mostly in uninterruptible sleep and always
wakes up on the active array ready for cpu, never expiring even with its
miniscule timeslice. The 3d app is always expiring and landing on the expired
array behind kprefetchd even though kprefetchd is nice 19. The practical
upshot of all this is that kprefetchd does a lot of prefetching with 3d
gaming going on, and no amount of priority fiddling stops it doing this. The
disk access is noticeable during 3d gaming unfortunately. Yielding regularly
means a heck of a lot less prefetching occurs and is no longer noticeable.
When idle, yield()ing doesn't seem to adversely affect the effectiveness of
the prefetching.

Cheers,
Con

2006-03-08 01:09:24

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Con Kolivas <[email protected]> wrote:
>
> On Wed, 8 Mar 2006 11:05 am, Andrew Morton wrote:
> > Con Kolivas <[email protected]> wrote:
> > > > yield() really sucks if there are a lot of runnable tasks. And the
> > > > amount of CPU which that thread uses isn't likely to matter anyway.
> > > >
> > > > I think it'd be better to just not do this. Perhaps alter the thread's
> > > > static priority instead? Does the scheduler have a knob which can be
> > > > used to disable a tasks's dynamic priority boost heuristic?
> > >
> > > We do have SCHED_BATCH but even that doesn't really have the desired
> > > effect. I know how much yield sucks and I actually want it to suck as
> > > much as yield does.
> >
> > Why do you want that?
> >
> > If prefetch is doing its job then it will save the machine from a pile of
> > major faults in the near future. The fact that the machine happens to be
> > running a number of busy tasks doesn't alter that. It's _worth_ stealing a
> > few cycles from those tasks now to avoid lengthy D-state sleeps in the near
> > future?
>
> The test case is the 3d (gaming) app that uses 100% cpu. It never sets delay
> swap prefetch in any way so swap prefetching starts working. Once swap
> prefetching starts reading it is mostly in uninterruptible sleep and always
> wakes up on the active array ready for cpu, never expiring even with its
> miniscule timeslice. The 3d app is always expiring and landing on the expired
> array behind kprefetchd even though kprefetchd is nice 19. The practical
> upshot of all this is that kprefetchd does a lot of prefetching with 3d
> gaming going on, and no amount of priority fiddling stops it doing this. The
> disk access is noticeable during 3d gaming unfortunately. Yielding regularly
> means a heck of a lot less prefetching occurs and is no longer noticeable.
> When idle, yield()ing doesn't seem to adversely affect the effectiveness of
> the prefetching.
>

but, but. If prefetching is prefetching stuff which that game will soon
use then it'll be an aggregate improvement. If prefetch is prefetching
stuff which that game _won't_ use then prefetch is busted. Using yield()
to artificially cripple kprefetchd is a rather sad workaround isn't it?

2006-03-08 01:11:33

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wed, 8 Mar 2006 12:11 pm, Andrew Morton wrote:
> Con Kolivas <[email protected]> wrote:
> > On Wed, 8 Mar 2006 11:05 am, Andrew Morton wrote:
> > > Con Kolivas <[email protected]> wrote:
> > > > > yield() really sucks if there are a lot of runnable tasks. And the
> > > > > amount of CPU which that thread uses isn't likely to matter anyway.
> > > > >
> > > > > I think it'd be better to just not do this. Perhaps alter the
> > > > > thread's static priority instead? Does the scheduler have a knob
> > > > > which can be used to disable a tasks's dynamic priority boost
> > > > > heuristic?
> > > >
> > > > We do have SCHED_BATCH but even that doesn't really have the desired
> > > > effect. I know how much yield sucks and I actually want it to suck as
> > > > much as yield does.
> > >
> > > Why do you want that?
> > >
> > > If prefetch is doing its job then it will save the machine from a pile
> > > of major faults in the near future. The fact that the machine happens
> > > to be running a number of busy tasks doesn't alter that. It's _worth_
> > > stealing a few cycles from those tasks now to avoid lengthy D-state
> > > sleeps in the near future?
> >
> > The test case is the 3d (gaming) app that uses 100% cpu. It never sets
> > delay swap prefetch in any way so swap prefetching starts working. Once
> > swap prefetching starts reading it is mostly in uninterruptible sleep and
> > always wakes up on the active array ready for cpu, never expiring even
> > with its miniscule timeslice. The 3d app is always expiring and landing
> > on the expired array behind kprefetchd even though kprefetchd is nice 19.
> > The practical upshot of all this is that kprefetchd does a lot of
> > prefetching with 3d gaming going on, and no amount of priority fiddling
> > stops it doing this. The disk access is noticeable during 3d gaming
> > unfortunately. Yielding regularly means a heck of a lot less prefetching
> > occurs and is no longer noticeable. When idle, yield()ing doesn't seem to
> > adversely affect the effectiveness of the prefetching.
>
> but, but. If prefetching is prefetching stuff which that game will soon
> use then it'll be an aggregate improvement. If prefetch is prefetching
> stuff which that game _won't_ use then prefetch is busted. Using yield()
> to artificially cripple kprefetchd is a rather sad workaround isn't it?

It's not the stuff that it prefetches that's the problem; it's the disk
access.

Con

2006-03-08 01:18:52

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wed, 8 Mar 2006 12:12 pm, Con Kolivas wrote:
> On Wed, 8 Mar 2006 12:11 pm, Andrew Morton wrote:
> > Con Kolivas <[email protected]> wrote:
> > > On Wed, 8 Mar 2006 11:05 am, Andrew Morton wrote:
> > > > Con Kolivas <[email protected]> wrote:
> > > > > > yield() really sucks if there are a lot of runnable tasks. And
> > > > > > the amount of CPU which that thread uses isn't likely to matter
> > > > > > anyway.
> > > > > >
> > > > > > I think it'd be better to just not do this. Perhaps alter the
> > > > > > thread's static priority instead? Does the scheduler have a knob
> > > > > > which can be used to disable a tasks's dynamic priority boost
> > > > > > heuristic?
> > > > >
> > > > > We do have SCHED_BATCH but even that doesn't really have the
> > > > > desired effect. I know how much yield sucks and I actually want it
> > > > > to suck as much as yield does.
> > > >
> > > > Why do you want that?
> > > >
> > > > If prefetch is doing its job then it will save the machine from a
> > > > pile of major faults in the near future. The fact that the machine
> > > > happens to be running a number of busy tasks doesn't alter that.
> > > > It's _worth_ stealing a few cycles from those tasks now to avoid
> > > > lengthy D-state sleeps in the near future?
> > >
> > > The test case is the 3d (gaming) app that uses 100% cpu. It never sets
> > > delay swap prefetch in any way so swap prefetching starts working. Once
> > > swap prefetching starts reading it is mostly in uninterruptible sleep
> > > and always wakes up on the active array ready for cpu, never expiring
> > > even with its miniscule timeslice. The 3d app is always expiring and
> > > landing on the expired array behind kprefetchd even though kprefetchd
> > > is nice 19. The practical upshot of all this is that kprefetchd does a
> > > lot of prefetching with 3d gaming going on, and no amount of priority
> > > fiddling stops it doing this. The disk access is noticeable during 3d
> > > gaming unfortunately. Yielding regularly means a heck of a lot less
> > > prefetching occurs and is no longer noticeable. When idle, yield()ing
> > > doesn't seem to adversely affect the effectiveness of the prefetching.
> >
> > but, but. If prefetching is prefetching stuff which that game will soon
> > use then it'll be an aggregate improvement. If prefetch is prefetching
> > stuff which that game _won't_ use then prefetch is busted. Using yield()
> > to artificially cripple kprefetchd is a rather sad workaround isn't it?
>
> It's not the stuff that it prefetches that's the problem; it's the disk
> access.

I guess what I'm saying is there isn't enough information to delay swap
prefetch when cpu usage is high which was my intention as well. Yielding has
the desired effect without adding further accounting checks to swap_prefetch.

Cheers,
Con

2006-03-08 01:21:25

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Con Kolivas <[email protected]> wrote:
>
> > but, but. If prefetching is prefetching stuff which that game will soon
> > use then it'll be an aggregate improvement. If prefetch is prefetching
> > stuff which that game _won't_ use then prefetch is busted. Using yield()
> > to artificially cripple kprefetchd is a rather sad workaround isn't it?
>
> It's not the stuff that it prefetches that's the problem; it's the disk
> access.

But the prefetch code tries to avoid prefetching when the disk is otherwise
busy (or it should - we discussed that a bit a while ago).

Sorry, I'm not trying to be awkward here - I think that nobbling prefetch
when there's a lot of CPU activity is just the wrong thing to do and it'll
harm other workloads.

2006-03-08 01:27:36

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wed, 8 Mar 2006 12:23 pm, Andrew Morton wrote:
> Con Kolivas <[email protected]> wrote:
> > > but, but. If prefetching is prefetching stuff which that game will
> > > soon use then it'll be an aggregate improvement. If prefetch is
> > > prefetching stuff which that game _won't_ use then prefetch is busted.
> > > Using yield() to artificially cripple kprefetchd is a rather sad
> > > workaround isn't it?
> >
> > It's not the stuff that it prefetches that's the problem; it's the disk
> > access.
>
> But the prefetch code tries to avoid prefetching when the disk is otherwise
> busy (or it should - we discussed that a bit a while ago).

Anything that does disk access delays prefetch fine. Things that only do heavy
cpu do not delay prefetch. Anything reading from disk will be noticeable
during 3d gaming.

> Sorry, I'm not trying to be awkward here - I think that nobbling prefetch
> when there's a lot of CPU activity is just the wrong thing to do and it'll
> harm other workloads.

I can't distinguish between when cpu activity is important (game) and when it
is not (compile), and assuming worst case scenario and not doing any swap
prefetching is my intent. I could add cpu accounting to prefetch_suitable()
instead, but that gets rather messy and yielding achieves the same endpoint.

Cheers,
Con

2006-03-08 02:08:37

by Lee Revell

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wed, 2006-03-08 at 12:28 +1100, Con Kolivas wrote:
> I can't distinguish between when cpu activity is important (game) and when it
> is not (compile), and assuming worst case scenario and not doing any swap
> prefetching is my intent. I could add cpu accounting to prefetch_suitable()
> instead, but that gets rather messy and yielding achieves the same endpoint.

Shouldn't the game be running with RT priority or at least at a low nice
value?

Lee

2006-03-08 02:12:25

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wed, 8 Mar 2006 01:08 pm, Lee Revell wrote:
> On Wed, 2006-03-08 at 12:28 +1100, Con Kolivas wrote:
> > I can't distinguish between when cpu activity is important (game) and
> > when it is not (compile), and assuming worst case scenario and not doing
> > any swap prefetching is my intent. I could add cpu accounting to
> > prefetch_suitable() instead, but that gets rather messy and yielding
> > achieves the same endpoint.
>
> Shouldn't the game be running with RT priority or at least at a low nice
> value?

No way. Games run nice 0 SCHED_NORMAL.

Con

2006-03-08 02:18:20

by Lee Revell

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wed, 2006-03-08 at 13:12 +1100, Con Kolivas wrote:
> On Wed, 8 Mar 2006 01:08 pm, Lee Revell wrote:
> > On Wed, 2006-03-08 at 12:28 +1100, Con Kolivas wrote:
> > > I can't distinguish between when cpu activity is important (game) and
> > > when it is not (compile), and assuming worst case scenario and not doing
> > > any swap prefetching is my intent. I could add cpu accounting to
> > > prefetch_suitable() instead, but that gets rather messy and yielding
> > > achieves the same endpoint.
> >
> > Shouldn't the game be running with RT priority or at least at a low nice
> > value?
>
> No way. Games run nice 0 SCHED_NORMAL.

Maybe this is a stupid/OT question (answer off list if you think so) but
why not? Isn't that the standard way of telling the scheduler that you
have a realtime constraint? It's how pro audio stuff works which I
would think has similar RT requirements.

How is the scheduler supposed to know to penalize a kernel compile
taking 100% CPU but not a game using 100% CPU?

Lee

2006-03-08 02:21:36

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wed, 8 Mar 2006 01:18 pm, Lee Revell wrote:
> On Wed, 2006-03-08 at 13:12 +1100, Con Kolivas wrote:
> > On Wed, 8 Mar 2006 01:08 pm, Lee Revell wrote:
> > > On Wed, 2006-03-08 at 12:28 +1100, Con Kolivas wrote:
> > > > I can't distinguish between when cpu activity is important (game) and
> > > > when it is not (compile), and assuming worst case scenario and not
> > > > doing any swap prefetching is my intent. I could add cpu accounting
> > > > to prefetch_suitable() instead, but that gets rather messy and
> > > > yielding achieves the same endpoint.
> > >
> > > Shouldn't the game be running with RT priority or at least at a low
> > > nice value?
> >
> > No way. Games run nice 0 SCHED_NORMAL.
>
> Maybe this is a stupid/OT question (answer off list if you think so) but
> why not? Isn't that the standard way of telling the scheduler that you
> have a realtime constraint? It's how pro audio stuff works which I
> would think has similar RT requirements.
>
> How is the scheduler supposed to know to penalize a kernel compile
> taking 100% CPU but not a game using 100% CPU?

Because being a serious desktop operating system that we are (bwahahahaha)
means the user should not have special privileges to run something as simple
as a game. Games should not need special scheduling classes. We can always
use 'nice' for a compile though. Real time audio is a completely different
world to this.

Cheers,
Con

2006-03-08 02:27:17

by Lee Revell

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wed, 2006-03-08 at 13:22 +1100, Con Kolivas wrote:
> > How is the scheduler supposed to know to penalize a kernel compile
> > taking 100% CPU but not a game using 100% CPU?
>
> Because being a serious desktop operating system that we are (bwahahahaha)
> means the user should not have special privileges to run something as simple
> as a game. Games should not need special scheduling classes. We can always
> use 'nice' for a compile though. Real time audio is a completely different
> world to this.

Actually recent distros like the upcoming Ubuntu Dapper support the new
RLIMIT_NICE and RLIMIT_RTPRIO so this would Just Work without any
special privileges (well, not root anyway - you'd have to put the user
in the right group and add one line to /etc/security/limits.conf).

I think OSX also uses special scheduling classes for stuff with RT
constraints.

The only barrier I see is that games aren't specifically written to take
advantage of RT scheduling because historically it's only been available
to root.

Lee

2006-03-08 02:30:29

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wed, 8 Mar 2006 01:27 pm, Lee Revell wrote:
> On Wed, 2006-03-08 at 13:22 +1100, Con Kolivas wrote:
> > > How is the scheduler supposed to know to penalize a kernel compile
> > > taking 100% CPU but not a game using 100% CPU?
> >
> > Because being a serious desktop operating system that we are
> > (bwahahahaha) means the user should not have special privileges to run
> > something as simple as a game. Games should not need special scheduling
> > classes. We can always use 'nice' for a compile though. Real time audio
> > is a completely different world to this.
>
> Actually recent distros like the upcoming Ubuntu Dapper support the new
> RLIMIT_NICE and RLIMIT_RTPRIO so this would Just Work without any
> special privileges (well, not root anyway - you'd have to put the user
> in the right group and add one line to /etc/security/limits.conf).
>
> I think OSX also uses special scheduling classes for stuff with RT
> constraints.
>
> The only barrier I see is that games aren't specifically written to take
> advantage of RT scheduling because historically it's only been available
> to root.

Well as I said in my previous reply, games should _not_ need special
scheduling classes. They are not written in a real time smart way and they do
not have any realtime constraints or requirements.

Cheers,
Con

2006-03-08 02:52:07

by André Goddard Rosa

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] mm: yield during swap prefetching

[...]
> > > Because being a serious desktop operating system that we are
> > > (bwahahahaha) means the user should not have special privileges to run
> > > something as simple as a game. Games should not need special scheduling
> > > classes. We can always use 'nice' for a compile though. Real time audio
> > > is a completely different world to this.
[...]
> Well as I said in my previous reply, games should _not_ need special
> scheduling classes. They are not written in a real time smart way and they do
> not have any realtime constraints or requirements.

Sorry Con, but I have to disagree with you on this.

Games are very complex software, involving heavy use of hardware resources
and they also have a lot of time constraints. So, I think they should
use RT priorities
if it is necessary to get the resources needed in time.

Thanks,
--
[]s,

Andr? Goddard

2006-03-08 03:03:07

by Lee Revell

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] mm: yield during swap prefetching

On Tue, 2006-03-07 at 22:52 -0400, Andr? Goddard Rosa wrote:
> Sorry Con, but I have to disagree with you on this.
>
> Games are very complex software, involving heavy use of hardware
> resources
> and they also have a lot of time constraints. So, I think they should
> use RT priorities
> if it is necessary to get the resources needed in time.
>

The main reason I assumed games would want to use the POSIX realtime
features like priority scheduling etc. is that the simulation people all
use them - it seems like a very similar problem.

Lee

2006-03-08 03:05:55

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Andr? Goddard Rosa writes:

> [...]
>> > > Because being a serious desktop operating system that we are
>> > > (bwahahahaha) means the user should not have special privileges to run
>> > > something as simple as a game. Games should not need special scheduling
>> > > classes. We can always use 'nice' for a compile though. Real time audio
>> > > is a completely different world to this.
> [...]
>> Well as I said in my previous reply, games should _not_ need special
>> scheduling classes. They are not written in a real time smart way and they do
>> not have any realtime constraints or requirements.
>
> Sorry Con, but I have to disagree with you on this.
>
> Games are very complex software, involving heavy use of hardware resources
> and they also have a lot of time constraints. So, I think they should
> use RT priorities
> if it is necessary to get the resources needed in time.

Excellent, I've opened the can of worms.

Yes, games are a in incredibly complex beast.

No they shouldn't need real time scheduling to work well if they are coded
properly. However, witness the fact that most of our games are windows
ports, therefore being lower quality than the original. Witness also the
fact that at last with dual core support, lots and lots (but not all) of
windows games on _windows_ are having scheduling trouble and jerky playback,
forcing them to crappily force binding to one cpu. As much as I'd love to
blame windows, it is almost certainly due to the coding of the application
since better games don't exhibit this problem. Now the games in question
can't be trusted to even run on SMP; do you really think they could cope
with good real time code? Good -complex- real time coding is very difficult.
If you take any game out there that currently exists and throw real time
scheduling at it, almost certainly it will hang the machine. No, I don't
believe games need realtime scheduling to work well; they just need to be
written well and the kernel needs to be unintrusive enough to work well with
them. Otherwise gaming would have needed realtime scheduling from day
one on all operating systems. Generic kernel activities should not cause
game stuttering either as users have little control over them. I do expect
users to not run too many userspace programs while trying to play games
though. I do not believe we should make games work well in the presence of
updatedb running for example.

Cheers,
Con


Attachments:
(No filename) (2.39 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-08 07:51:43

by Jan Knutar

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wednesday 08 March 2006 03:28, Con Kolivas wrote:

> Anything that does disk access delays prefetch fine. Things that only do heavy
> cpu do not delay prefetch. Anything reading from disk will be noticeable
> during 3d gaming.

What exactly makes the disk accesses noticeable? Is it because they steal
time from the disk that the game otherwise would need, or do the disk accesses
themselves consume noticeable amounts of CPU time?
Or, do bits of the game's executable drop from memory to make room for the
new stuff being pulled in from memory, causing the game to halt while it waits
for its pages to come back? On a related note, through advanced use of
handwaving and guessing, this seems to be the thing that kills my destop
experience (*buzzword alert*) most often. Checksumming a large file
seems to be less of an impact than things that seek alot, like updatedb.

I remember playing vegastrike on my linux desktop machine few years ago,
the game leaked so much memory that it filled my 2G swap rather often,
unleashing OOM killer mayhem. I "solved" this by putting swap on floppy at
lower priority than the 2G, and a 128M swap file as "backup" at even lower
priority than the floppy. I didn't notice the swapping to harddrive, but when it
started to swap to floppy, it made the game run a bit slower for a few seconds,
plus the floppy light went on, and I knew I had 128M left to save my position
and quit.

If I needed floppy to make disk access noticeable on my very low end
machine... What are these new fancy things doing to make HD access
noticeable?

2006-03-08 08:39:49

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wednesday 08 March 2006 18:51, Jan Knutar wrote:
> On Wednesday 08 March 2006 03:28, Con Kolivas wrote:
> > Anything that does disk access delays prefetch fine. Things that only do
> > heavy cpu do not delay prefetch. Anything reading from disk will be
> > noticeable during 3d gaming.
>
> What exactly makes the disk accesses noticeable? Is it because they steal
> time from the disk that the game otherwise would need, or do the disk
> accesses themselves consume noticeable amounts of CPU time?
> Or, do bits of the game's executable drop from memory to make room for the
> new stuff being pulled in from memory, causing the game to halt while it
> waits for its pages to come back? On a related note, through advanced use
> of handwaving and guessing, this seems to be the thing that kills my destop
> experience (*buzzword alert*) most often. Checksumming a large file seems
> to be less of an impact than things that seek alot, like updatedb.
>
> I remember playing vegastrike on my linux desktop machine few years ago,
> the game leaked so much memory that it filled my 2G swap rather often,
> unleashing OOM killer mayhem. I "solved" this by putting swap on floppy at
> lower priority than the 2G, and a 128M swap file as "backup" at even lower
> priority than the floppy. I didn't notice the swapping to harddrive, but
> when it started to swap to floppy, it made the game run a bit slower for a
> few seconds, plus the floppy light went on, and I knew I had 128M left to
> save my position and quit.
>
> If I needed floppy to make disk access noticeable on my very low end
> machine... What are these new fancy things doing to make HD access
> noticeable?

It's the cumulative effect of the cpu used by the in kernel code paths and the
kprefetchd kernel thread. Even running ultra low priority, if they read a lot
from the hard drive it costs us cpu time (seen as I/O wait in top for
example). Swap prefetch _never_ displaces anything from ram; it only ever
reads things from swap if there is generous free ram available. Not only that
but if it reads something from swap it is put at the end of the "least
recently used" list meaning that if _anything_ needs ram, these are the first
things displaced again.

Cheers,
Con

2006-03-08 08:48:31

by Andreas Mohr

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] mm: yield during swap prefetching

Hi,

On Tue, Mar 07, 2006 at 03:26:36PM -0800, Andrew Morton wrote:
> Con Kolivas <[email protected]> wrote:
> >
> > Swap prefetching doesn't use very much cpu but spends a lot of time waiting on
> > disk in uninterruptible sleep. This means it won't get preempted often even at
> > a low nice level since it is seen as sleeping most of the time. We want to
> > minimise its cpu impact so yield where possible.

> yield() really sucks if there are a lot of runnable tasks. And the amount
> of CPU which that thread uses isn't likely to matter anyway.
>
> I think it'd be better to just not do this. Perhaps alter the thread's
> static priority instead? Does the scheduler have a knob which can be used
> to disable a tasks's dynamic priority boost heuristic?

This problem occurs due to giving a priority boost to processes that are
sleeping a lot (e.g. in this case, I/O, from disk), right?
Forgive me my possibly less insightful comments, but maybe instead of adding
crude specific hacks (namely, yield()) to each specific problematic process as
it comes along (it just happens to be the swap prefetch thread this time)
there is a *general way* to give processes with lots of disk I/O sleeping
much smaller amounts of boost in order to get them preempted more often
in favour of an actually much more critical process (game)?
>From the discussion here it seems this problem is caused by a *general*
miscalculation of processes sleeping on disk I/O a lot.

Thus IMHO this problem should be solved in a general way if at all possible.

Andreas Mohr

2006-03-08 08:53:07

by Con Kolivas

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] mm: yield during swap prefetching

On Wednesday 08 March 2006 19:48, Andreas Mohr wrote:
> Hi,
>
> On Tue, Mar 07, 2006 at 03:26:36PM -0800, Andrew Morton wrote:
> > Con Kolivas <[email protected]> wrote:
> > > Swap prefetching doesn't use very much cpu but spends a lot of time
> > > waiting on disk in uninterruptible sleep. This means it won't get
> > > preempted often even at a low nice level since it is seen as sleeping
> > > most of the time. We want to minimise its cpu impact so yield where
> > > possible.
> >
> > yield() really sucks if there are a lot of runnable tasks. And the
> > amount of CPU which that thread uses isn't likely to matter anyway.
> >
> > I think it'd be better to just not do this. Perhaps alter the thread's
> > static priority instead? Does the scheduler have a knob which can be
> > used to disable a tasks's dynamic priority boost heuristic?
>
> This problem occurs due to giving a priority boost to processes that are
> sleeping a lot (e.g. in this case, I/O, from disk), right?
> Forgive me my possibly less insightful comments, but maybe instead of
> adding crude specific hacks (namely, yield()) to each specific problematic
> process as it comes along (it just happens to be the swap prefetch thread
> this time) there is a *general way* to give processes with lots of disk I/O
> sleeping much smaller amounts of boost in order to get them preempted more
> often in favour of an actually much more critical process (game)?
>
> >From the discussion here it seems this problem is caused by a *general*
>
> miscalculation of processes sleeping on disk I/O a lot.
>
> Thus IMHO this problem should be solved in a general way if at all
> possible.

No. We already do special things for tasks waiting on uninterruptible sleep.
This is more about what is exaggerated on a dual array expiring scheduler
design that mainline has.

Cheers,
Con

2006-03-08 13:37:14

by Con Kolivas

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] mm: yield during swap prefetching

cc'ing Ingo...

On Wednesday 08 March 2006 10:32, Con Kolivas wrote:
> Andrew Morton writes:
> > Con Kolivas <[email protected]> wrote:
> >> Swap prefetching doesn't use very much cpu but spends a lot of time
> >> waiting on disk in uninterruptible sleep. This means it won't get
> >> preempted often even at a low nice level since it is seen as sleeping
> >> most of the time. We want to minimise its cpu impact so yield where
> >> possible.
> >>
> >> Signed-off-by: Con Kolivas <[email protected]>
> >> ---
> >> mm/swap_prefetch.c | 1 +
> >> 1 file changed, 1 insertion(+)
> >>
> >> Index: linux-2.6.15-ck5/mm/swap_prefetch.c
> >> ===================================================================
> >> --- linux-2.6.15-ck5.orig/mm/swap_prefetch.c 2006-03-02
> >> 14:00:46.000000000 +1100 +++
> >> linux-2.6.15-ck5/mm/swap_prefetch.c 2006-03-08 08:49:32.000000000 +1100
> >> @@ -421,6 +421,7 @@ static enum trickle_return trickle_swap(
> >>
> >> if (trickle_swap_cache_async(swp_entry, node) == TRICKLE_DELAY)
> >> break;
> >> + yield();
> >> }
> >>
> >> if (sp_stat.prefetched_pages) {
> >
> > yield() really sucks if there are a lot of runnable tasks. And the
> > amount of CPU which that thread uses isn't likely to matter anyway.
> >
> > I think it'd be better to just not do this. Perhaps alter the thread's
> > static priority instead? Does the scheduler have a knob which can be
> > used to disable a tasks's dynamic priority boost heuristic?
>
> We do have SCHED_BATCH but even that doesn't really have the desired
> effect. I know how much yield sucks and I actually want it to suck as much
> as yield does.

Thinking some more on this I wonder if SCHED_BATCH isn't a strong enough
scheduling hint if it's not suitable for such an application. Ingo do you
think we could make SCHED_BATCH tasks always wake up on the expired array?

Cheers,
Con

2006-03-08 21:08:07

by Zan Lynx

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Wed, 2006-03-08 at 14:05 +1100, Con Kolivas wrote:
> Andr? Goddard Rosa writes:
>
> > [...]
> >> > > Because being a serious desktop operating system that we are
> >> > > (bwahahahaha) means the user should not have special privileges to run
> >> > > something as simple as a game. Games should not need special scheduling
> >> > > classes. We can always use 'nice' for a compile though. Real time audio
> >> > > is a completely different world to this.
> > [...]
> >> Well as I said in my previous reply, games should _not_ need special
> >> scheduling classes. They are not written in a real time smart way and they do
> >> not have any realtime constraints or requirements.
> >
> > Sorry Con, but I have to disagree with you on this.
> >
> > Games are very complex software, involving heavy use of hardware resources
> > and they also have a lot of time constraints. So, I think they should
> > use RT priorities
> > if it is necessary to get the resources needed in time.
>
> Excellent, I've opened the can of worms.
>
> Yes, games are a in incredibly complex beast.
>
> No they shouldn't need real time scheduling to work well if they are coded
> properly. However, witness the fact that most of our games are windows
> ports, therefore being lower quality than the original. Witness also the
> fact that at last with dual core support, lots and lots (but not all) of
> windows games on _windows_ are having scheduling trouble and jerky playback,
> forcing them to crappily force binding to one cpu.
[snip]

Games where you notice frame-rate chop because the *disk system* is
using too much CPU are perfect examples of applications that should be
using real-time.

Multiple CPU cores and multithreading in games is another perfect
example of programming that *needs* predictable real-time thread
priorities. There is no other way to guarantee that physics processing
takes priority over graphics updates or AI, once each task becomes
separated from a monolithic main loop and spread over several CPU cores.

Because games often *are* badly written, a user-friendly Linux gaming
system does need a high-priority real-time task watching for a special
keystroke, like C-A-Del for example, so that it can kill the other
real-time tasks and return to the UI shell.

Games and real-time go together like they were made for each other.
--
Zan Lynx <[email protected]>


Attachments:
signature.asc (191.00 B)
This is a digitally signed message part

2006-03-08 22:24:37

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On ?t 07-03-06 16:05:15, Andrew Morton wrote:
> Con Kolivas <[email protected]> wrote:
> >
> > > yield() really sucks if there are a lot of runnable tasks. And the amount
> > > of CPU which that thread uses isn't likely to matter anyway.
> > >
> > > I think it'd be better to just not do this. Perhaps alter the thread's
> > > static priority instead? Does the scheduler have a knob which can be used
> > > to disable a tasks's dynamic priority boost heuristic?
> >
> > We do have SCHED_BATCH but even that doesn't really have the desired effect.
> > I know how much yield sucks and I actually want it to suck as much as yield
> > does.
>
> Why do you want that?
>
> If prefetch is doing its job then it will save the machine from a pile of
> major faults in the near future. The fact that the machine happens

Or maybe not.... it is prefetch, it may prefetch wrongly, and you
definitely want it doing nothing when system is loaded.... It only
makes sense to prefetch when system is idle.
Pavel
--
Web maintainer for suspend.sf.net (http://www.sf.net/projects/suspend) wanted...

2006-03-08 23:00:30

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Zan Lynx writes:

> On Wed, 2006-03-08 at 14:05 +1100, Con Kolivas wrote:
>> Andr? Goddard Rosa writes:
>>
>> > [...]
>> >> > > Because being a serious desktop operating system that we are
>> >> > > (bwahahahaha) means the user should not have special privileges to run
>> >> > > something as simple as a game. Games should not need special scheduling
>> >> > > classes. We can always use 'nice' for a compile though. Real time audio
>> >> > > is a completely different world to this.
>> > [...]
>> >> Well as I said in my previous reply, games should _not_ need special
>> >> scheduling classes. They are not written in a real time smart way and they do
>> >> not have any realtime constraints or requirements.
>> >
>> > Sorry Con, but I have to disagree with you on this.
>> >
>> > Games are very complex software, involving heavy use of hardware resources
>> > and they also have a lot of time constraints. So, I think they should
>> > use RT priorities
>> > if it is necessary to get the resources needed in time.
>>
>> Excellent, I've opened the can of worms.
>>
>> Yes, games are a in incredibly complex beast.
>>
>> No they shouldn't need real time scheduling to work well if they are coded
>> properly. However, witness the fact that most of our games are windows
>> ports, therefore being lower quality than the original. Witness also the
>> fact that at last with dual core support, lots and lots (but not all) of
>> windows games on _windows_ are having scheduling trouble and jerky playback,
>> forcing them to crappily force binding to one cpu.
> [snip]
>
> Games where you notice frame-rate chop because the *disk system* is
> using too much CPU are perfect examples of applications that should be
> using real-time.
>
> Multiple CPU cores and multithreading in games is another perfect
> example of programming that *needs* predictable real-time thread
> priorities. There is no other way to guarantee that physics processing
> takes priority over graphics updates or AI, once each task becomes
> separated from a monolithic main loop and spread over several CPU cores.
>
> Because games often *are* badly written, a user-friendly Linux gaming
> system does need a high-priority real-time task watching for a special
> keystroke, like C-A-Del for example, so that it can kill the other
> real-time tasks and return to the UI shell.
>
> Games and real-time go together like they were made for each other.

I guess every single well working windows game since the dawn of time is
some sort of anomaly then.

Cheers,
Con


Attachments:
(No filename) (2.49 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-08 23:48:28

by Zan Lynx

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Thu, 2006-03-09 at 10:00 +1100, Con Kolivas wrote:
> Zan Lynx writes:
[snip]
> > Games and real-time go together like they were made for each other.
>
> I guess every single well working windows game since the dawn of time is
> some sort of anomaly then.

Yes, those Windows games are anomalies that rely on the OS scheduling
them AS IF they were real-time, but without actually claiming that
priority.

Because these games just assume they own the whole system and aren't
explicitly telling the OS about their real-time requirements, the OS has
to guess instead and can get it wrong, especially when hardware
capabilities advance in ways that force changes to the task scheduler
(multi-core, hyper-threading). And you said it yourself, many old games
don't work well on dual-core systems.

I think your effort to improve the guessing is a good idea, and
thanks.

Just don't dismiss the idea that games do have real-time requirements
and if they did things correctly, games would explicitly specify those
requirements.
--
Zan Lynx <[email protected]>


Attachments:
signature.asc (191.00 B)
This is a digitally signed message part

2006-03-09 00:08:12

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Zan Lynx writes:

> On Thu, 2006-03-09 at 10:00 +1100, Con Kolivas wrote:
>> Zan Lynx writes:
> [snip]
>> > Games and real-time go together like they were made for each other.
>>
>> I guess every single well working windows game since the dawn of time is
>> some sort of anomaly then.
>
> Yes, those Windows games are anomalies that rely on the OS scheduling
> them AS IF they were real-time, but without actually claiming that
> priority.
>
> Because these games just assume they own the whole system and aren't
> explicitly telling the OS about their real-time requirements, the OS has
> to guess instead and can get it wrong, especially when hardware
> capabilities advance in ways that force changes to the task scheduler
> (multi-core, hyper-threading). And you said it yourself, many old games
> don't work well on dual-core systems.
>
> I think your effort to improve the guessing is a good idea, and
> thanks.
>
> Just don't dismiss the idea that games do have real-time requirements
> and if they did things correctly, games would explicitly specify those
> requirements.

Games worked on windows for a decade on single core without real time
scheduling because that's what they were written for.

Now that games are written for windows with dual core they work well - again
without real time scheduling.

Why should a port of these games to linux require real time?

Cheers,
Con


Attachments:
(No filename) (1.37 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-09 02:22:28

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Pavel Machek wrote:

>On ?t 07-03-06 16:05:15, Andrew Morton wrote:
>
>>Why do you want that?
>>
>>If prefetch is doing its job then it will save the machine from a pile of
>>major faults in the near future. The fact that the machine happens
>>
>
>Or maybe not.... it is prefetch, it may prefetch wrongly, and you
>definitely want it doing nothing when system is loaded.... It only
>makes sense to prefetch when system is idle.
>

Right. Prefetching is obviously going to have a very low work/benefit,
assuming your page reclaim is working properly, because a) it doesn't
deal with file pages, and b) it is doing work to reclaim pages that
have already been deemed to be the least important.

What it is good for is working around our interesting VM that apparently
allows updatedb to swap everything out (although I haven't seen this
problem myself), and artificial memory hogs. By moving work to times of
low cost. No problem with the theory behind it.

So as much as a major fault costs in terms of performance, the tiny
chance that prefetching will avoid it means even the CPU usage is
questionable. Using sched_yield() seems like a hack though.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-09 02:29:52

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Thu, 9 Mar 2006 01:22 pm, Nick Piggin wrote:
> Pavel Machek wrote:
> >On ?t 07-03-06 16:05:15, Andrew Morton wrote:
> >>Why do you want that?
> >>
> >>If prefetch is doing its job then it will save the machine from a pile of
> >>major faults in the near future. The fact that the machine happens
> >
> >Or maybe not.... it is prefetch, it may prefetch wrongly, and you
> >definitely want it doing nothing when system is loaded.... It only
> >makes sense to prefetch when system is idle.
>
> Right. Prefetching is obviously going to have a very low work/benefit,
> assuming your page reclaim is working properly, because a) it doesn't
> deal with file pages, and b) it is doing work to reclaim pages that
> have already been deemed to be the least important.
>
> What it is good for is working around our interesting VM that apparently
> allows updatedb to swap everything out (although I haven't seen this
> problem myself), and artificial memory hogs. By moving work to times of
> low cost. No problem with the theory behind it.
>
> So as much as a major fault costs in terms of performance, the tiny
> chance that prefetching will avoid it means even the CPU usage is
> questionable. Using sched_yield() seems like a hack though.

Yeah it's a hack alright. Funny how at last I find a place where yield does
exactly what I want and because we hate yield so much noone wants me to use
it all.

Cheers,
Con

2006-03-09 02:58:03

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Con Kolivas wrote:

>On Thu, 9 Mar 2006 01:22 pm, Nick Piggin wrote:
>
>>
>>So as much as a major fault costs in terms of performance, the tiny
>>chance that prefetching will avoid it means even the CPU usage is
>>questionable. Using sched_yield() seems like a hack though.
>>
>
>Yeah it's a hack alright. Funny how at last I find a place where yield does
>exactly what I want and because we hate yield so much noone wants me to use
>it all.
>
>

AFAIKS it is a hack for the same reason using it for locking is a hack,
it's just that prefetch doesn't care if it doesn't get the CPU back for
a while.

Given a yield implementation which does something completely different
for SCHED_OTHER tasks, you code may find it doesn't work so well anymore.
This is no different to the java folk using it with decent results for
locking. Just because it happened to work OK for them at the time didn't
mean it was the right thing to do.

I have always maintained that a SCHED_OTHER task calling sched_yield
is basically a bug because it is utterly undefined behaviour.

But being an in-kernel user that "knows" the implementation sort of does
the right thin, maybe you justify it that way.

--

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-09 03:13:45

by Zan Lynx

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Thu, 2006-03-09 at 11:07 +1100, Con Kolivas wrote:
> Games worked on windows for a decade on single core without real time
> scheduling because that's what they were written for.
>
> Now that games are written for windows with dual core they work well -
> again
> without real time scheduling.
>
> Why should a port of these games to linux require real time?

That isn't what I said. I said nothing about *requiring* anything, only
about how to do it better.

Here is what Con said that I was disagreeing with. All the rest was to
justify my disagreement.

Con said, "... games should _not_ need special scheduling classes. They
are not written in a real time smart way and they do not have any
realtime constraints or requirements."

And he said later, "No they shouldn't need real time scheduling to work
well if they are coded properly."

Here is a list of simple statements of what I am saying:
Games do have real-time requirements.
The OS guessing about real-time priorities will sometimes get it wrong.
Guessing task priority is worse than being told and knowing for sure.
Games should, in an ideal world, be using real-time OS scheduling.
Games would work better using real-time OS scheduling.

That is all from me.
--
Zan Lynx <[email protected]>


Attachments:
signature.asc (191.00 B)
This is a digitally signed message part

2006-03-09 04:09:57

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Zan Lynx writes:

> On Thu, 2006-03-09 at 11:07 +1100, Con Kolivas wrote:
>> Games worked on windows for a decade on single core without real time
>> scheduling because that's what they were written for.
>>
>> Now that games are written for windows with dual core they work well -
>> again
>> without real time scheduling.
>>
>> Why should a port of these games to linux require real time?
>
> That isn't what I said. I said nothing about *requiring* anything, only
> about how to do it better.
>
> Here is what Con said that I was disagreeing with. All the rest was to
> justify my disagreement.
>
> Con said, "... games should _not_ need special scheduling classes. They
> are not written in a real time smart way and they do not have any
> realtime constraints or requirements."
>
> And he said later, "No they shouldn't need real time scheduling to work
> well if they are coded properly."
>
> Here is a list of simple statements of what I am saying:
> Games do have real-time requirements.
> The OS guessing about real-time priorities will sometimes get it wrong.
> Guessing task priority is worse than being told and knowing for sure.
> Games should, in an ideal world, be using real-time OS scheduling.
> Games would work better using real-time OS scheduling.

At the risk of being repetitive to the point of tiresome, my point is that
there are no real time requirements in games. You're assuming that
everything will be better if we assume that there are rt requirements and
that we're simulating pseudo real time conditions currently. That's just not
the case and never has been. That's why it has worked fine for so long.

Cheers,
Con


Attachments:
(No filename) (1.63 kB)
(No filename) (189.00 B)
Download all attachments

2006-03-09 04:54:55

by Lee Revell

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Thu, 2006-03-09 at 15:08 +1100, Con Kolivas wrote:
> > Games do have real-time requirements.
> > The OS guessing about real-time priorities will sometimes get it wrong.
> > Guessing task priority is worse than being told and knowing for sure.
> > Games should, in an ideal world, be using real-time OS scheduling.
> > Games would work better using real-time OS scheduling.
>
> At the risk of being repetitive to the point of tiresome, my point is that
> there are no real time requirements in games. You're assuming that
> everything will be better if we assume that there are rt requirements and
> that we're simulating pseudo real time conditions currently. That's just not
> the case and never has been. That's why it has worked fine for so long.

I think you are talking past each other, and are both right - Con is
saying games don't need realtime scheduling (SCHED_FIFO, low nice value,
whatever) to function correctly (true), while Zan is saying that games
have RT constraints in that they must react as fast as possible to user
input (also true).

Anyway, this is getting OT, I wish I had not raised this issue in this
thread.

Lee

2006-03-09 08:57:37

by Helge Hafting

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

Con Kolivas wrote:

>On Wed, 8 Mar 2006 12:11 pm, Andrew Morton wrote:
>
>
>>but, but. If prefetching is prefetching stuff which that game will soon
>>use then it'll be an aggregate improvement. If prefetch is prefetching
>>stuff which that game _won't_ use then prefetch is busted. Using yield()
>>to artificially cripple kprefetchd is a rather sad workaround isn't it?
>>
>>
>
>It's not the stuff that it prefetches that's the problem; it's the disk
>access.
>
>
Well, seems you have some sorry kind of disk driver then?
An ide disk not using dma?

A low-cpu task that only abuses the disk shouldn't make an impact
on a 3D game that hogs the cpu only. Unless the driver for your
harddisk is faulty, using way more cpu than it need.

Use hdparm, check the basics:
unmaksirq=1, using_dma=1, multcount is some positive number,
such as 8 or 16, readahead is some positive number.
Also use hdparm -i and verify that the disk is using some
nice udma mode. (too old for that, and it probably isn't worth
optimizing this for...)

Also make sure the disk driver isn't sharing an irq with the
3D card.

Come to think of it, if your 3D game happens to saturate the
pci bus for long times, then disk accesses might indeed
be noticeable as they too need the bus. Check if going to
a slower dma mode helps - this might free up the bus a bit.

Helge Hafting

2006-03-09 09:08:42

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Thursday 09 March 2006 19:57, Helge Hafting wrote:
> Con Kolivas wrote:
> >On Wed, 8 Mar 2006 12:11 pm, Andrew Morton wrote:
> >>but, but. If prefetching is prefetching stuff which that game will soon
> >>use then it'll be an aggregate improvement. If prefetch is prefetching
> >>stuff which that game _won't_ use then prefetch is busted. Using yield()
> >>to artificially cripple kprefetchd is a rather sad workaround isn't it?
> >
> >It's not the stuff that it prefetches that's the problem; it's the disk
> >access.
>
> Well, seems you have some sorry kind of disk driver then?
> An ide disk not using dma?
>
> A low-cpu task that only abuses the disk shouldn't make an impact
> on a 3D game that hogs the cpu only. Unless the driver for your
> harddisk is faulty, using way more cpu than it need.
>
> Use hdparm, check the basics:
> unmaksirq=1, using_dma=1, multcount is some positive number,
> such as 8 or 16, readahead is some positive number.
> Also use hdparm -i and verify that the disk is using some
> nice udma mode. (too old for that, and it probably isn't worth
> optimizing this for...)
>
> Also make sure the disk driver isn't sharing an irq with the
> 3D card.
>
> Come to think of it, if your 3D game happens to saturate the
> pci bus for long times, then disk accesses might indeed
> be noticeable as they too need the bus. Check if going to
> a slower dma mode helps - this might free up the bus a bit.

Thanks for the hints.

However I actually wrote the swap prefetch code and this is all about changing
its behaviour to make it do what I want. The problem is that nice 19 will
give it up to 5% cpu in the presence of a nice 0 task when I really don't
want swap prefetch doing anything. Furthermore because it is constantly
waking up from sleep (after disk activity) it is always given lower latency
scheduling than a fully cpu bound nice 0 task - this is normally appropriate
behaviour. Yielding regularly works around that issue.

Ideally taking into account cpu usage and only working below a certain cpu
threshold may be the better mechanism and it does appear this would be more
popular. It would not be hard to implement, but does add yet more code to an
increasingly complex heuristic used to detect "idleness". I am seriously
considering it.

Cheers,
Con

2006-03-09 09:12:04

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] mm: yield during swap prefetching

On Thursday 09 March 2006 13:57, Nick Piggin wrote:
> Con Kolivas wrote:
> >On Thu, 9 Mar 2006 01:22 pm, Nick Piggin wrote:
> >>So as much as a major fault costs in terms of performance, the tiny
> >>chance that prefetching will avoid it means even the CPU usage is
> >>questionable. Using sched_yield() seems like a hack though.
> >
> >Yeah it's a hack alright. Funny how at last I find a place where yield
> > does exactly what I want and because we hate yield so much noone wants me
> > to use it all.
>
> AFAIKS it is a hack for the same reason using it for locking is a hack,
> it's just that prefetch doesn't care if it doesn't get the CPU back for
> a while.
>
> Given a yield implementation which does something completely different
> for SCHED_OTHER tasks, you code may find it doesn't work so well anymore.
> This is no different to the java folk using it with decent results for
> locking. Just because it happened to work OK for them at the time didn't
> mean it was the right thing to do.
>
> I have always maintained that a SCHED_OTHER task calling sched_yield
> is basically a bug because it is utterly undefined behaviour.
>
> But being an in-kernel user that "knows" the implementation sort of does
> the right thin, maybe you justify it that way.

You're right. Even if I do know exactly how yield works and am using it to my
advantage, any solution that depends on the way yield works may well not work
in the future. It does look like I should just check cpu usage as well in
prefetch_suitable(). That will probably be the best generalised solution to
this.

Thanks.

Cheers,
Con

2006-03-17 09:09:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] mm: yield during swap prefetching


* Con Kolivas <[email protected]> wrote:

> > We do have SCHED_BATCH but even that doesn't really have the desired
> > effect. I know how much yield sucks and I actually want it to suck as much
> > as yield does.
>
> Thinking some more on this I wonder if SCHED_BATCH isn't a strong
> enough scheduling hint if it's not suitable for such an application.
> Ingo do you think we could make SCHED_BATCH tasks always wake up on
> the expired array?

yep, i think that's a good idea. In the worst case the starvation
timeout should kick in.

Ingo

2006-03-17 10:44:39

by Mike Galbraith

[permalink] [raw]
Subject: interactive task starvation

On Fri, 2006-03-17 at 10:06 +0100, Ingo Molnar wrote:

> yep, i think that's a good idea. In the worst case the starvation
> timeout should kick in.

(I didn't want to hijack that thread ergo name change)

Speaking of the starvation timeout...

I'm beginning to wonder if it might not be a good idea to always have an
expired_timestamp to ensure that there is a limit to how long
interactive tasks can starve _each other_. Yesterday, I ran some tests
with apache, and ended up waiting for over 3 minutes for a netstat|
grep :81|wc -l to finish when competing with 10 copies of httpd. The
problem with the expired_timestamp is that if there is nobody already
expired, and if no non-interactive task exists, there's certainly no
expired_timestamp, there's no starvation limit.

There are other ways to cure 'interactive starvation', but forcing an
array switch if a non-interactive task hasn't run for pick-a-number time
is the easiest.

-Mike

(yup, folks would certainly feel it, and would _very_ likely gripe, so
it would probably have to be configurable)

2006-03-17 12:38:39

by Con Kolivas

[permalink] [raw]
Subject: [PATCH] sched: activate SCHED BATCH expired

On Friday 17 March 2006 20:06, Ingo Molnar wrote:
> * Con Kolivas <[email protected]> wrote:
> > Thinking some more on this I wonder if SCHED_BATCH isn't a strong
> > enough scheduling hint if it's not suitable for such an application.
> > Ingo do you think we could make SCHED_BATCH tasks always wake up on
> > the expired array?
>
> yep, i think that's a good idea. In the worst case the starvation
> timeout should kick in.

Ok here's a patch that does exactly that. Without an "inline" hint, gcc 4.1.0
chooses not to inline this function. I can't say I have a strong opinion
about whether it should be inlined or not (93 bytes larger inlined), so I've
decided not to given the current trend.

Cheers,
Con
---
To increase the strength of SCHED_BATCH as a scheduling hint we can activate
batch tasks on the expired array since by definition they are latency
insensitive tasks.

Signed-off-by: Con Kolivas <[email protected]>

---
include/linux/sched.h | 1 +
kernel/sched.c | 9 ++++++---
2 files changed, 7 insertions(+), 3 deletions(-)

Index: linux-2.6.16-rc6-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc6-mm1.orig/include/linux/sched.h 2006-03-13 20:12:22.000000000 +1100
+++ linux-2.6.16-rc6-mm1/include/linux/sched.h 2006-03-17 23:08:31.000000000 +1100
@@ -485,6 +485,7 @@ struct signal_struct {
#define MAX_PRIO (MAX_RT_PRIO + 40)

#define rt_task(p) (unlikely((p)->prio < MAX_RT_PRIO))
+#define batch_task(p) (unlikely((p)->policy == SCHED_BATCH))

/*
* Some day this will be a full-fledged user tracking system..
Index: linux-2.6.16-rc6-mm1/kernel/sched.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/kernel/sched.c 2006-03-13 20:12:15.000000000 +1100
+++ linux-2.6.16-rc6-mm1/kernel/sched.c 2006-03-17 23:08:12.000000000 +1100
@@ -737,9 +737,12 @@ static inline void dec_nr_running(task_t
/*
* __activate_task - move a task to the runqueue.
*/
-static inline void __activate_task(task_t *p, runqueue_t *rq)
+static void __activate_task(task_t *p, runqueue_t *rq)
{
- enqueue_task(p, rq->active);
+ if (batch_task(p))
+ enqueue_task(p, rq->expired);
+ else
+ enqueue_task(p, rq->active);
inc_nr_running(p, rq);
}

@@ -758,7 +761,7 @@ static int recalc_task_prio(task_t *p, u
unsigned long long __sleep_time = now - p->timestamp;
unsigned long sleep_time;

- if (unlikely(p->policy == SCHED_BATCH))
+ if (batch_task(p))
sleep_time = 0;
else {
if (__sleep_time > NS_MAX_SLEEP_AVG)

2006-03-17 13:09:23

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched: activate SCHED BATCH expired


* Con Kolivas <[email protected]> wrote:

> To increase the strength of SCHED_BATCH as a scheduling hint we can activate
> batch tasks on the expired array since by definition they are latency
> insensitive tasks.
>
> Signed-off-by: Con Kolivas <[email protected]>

Acked-by: Ingo Molnar <[email protected]>

Ingo

2006-03-17 13:26:22

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] sched: activate SCHED BATCH expired

Con Kolivas wrote:

>
> Ok here's a patch that does exactly that. Without an "inline" hint, gcc 4.1.0
> chooses not to inline this function. I can't say I have a strong opinion
> about whether it should be inlined or not (93 bytes larger inlined), so I've
> decided not to given the current trend.
>

Sigh, sacrifice for the common case! :P


> Index: linux-2.6.16-rc6-mm1/kernel/sched.c
> ===================================================================
> --- linux-2.6.16-rc6-mm1.orig/kernel/sched.c 2006-03-13 20:12:15.000000000 +1100
> +++ linux-2.6.16-rc6-mm1/kernel/sched.c 2006-03-17 23:08:12.000000000 +1100
> @@ -737,9 +737,12 @@ static inline void dec_nr_running(task_t
> /*
> * __activate_task - move a task to the runqueue.
> */
> -static inline void __activate_task(task_t *p, runqueue_t *rq)
> +static void __activate_task(task_t *p, runqueue_t *rq)
> {
> - enqueue_task(p, rq->active);
> + if (batch_task(p))
> + enqueue_task(p, rq->expired);
> + else
> + enqueue_task(p, rq->active);
> inc_nr_running(p, rq);
> }
>

I prefer:

prio_array_t *target = rq->active;
if (batch_task(p))
target = rq->expired;
enqueue_task(p, target);

Because gcc can use things like predicated instructions for it.
But perhaps it is smart enough these days to recognise this?
At least in the past I have seen it start using cmov after doing
such a conversion.

At any rate, I think it looks nicer as well. IMO, of course.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-17 13:36:27

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] sched: activate SCHED BATCH expired

On Saturday 18 March 2006 00:26, Nick Piggin wrote:
> Con Kolivas wrote:
> > -static inline void __activate_task(task_t *p, runqueue_t *rq)
> > +static void __activate_task(task_t *p, runqueue_t *rq)
> > {
> > - enqueue_task(p, rq->active);
> > + if (batch_task(p))
> > + enqueue_task(p, rq->expired);
> > + else
> > + enqueue_task(p, rq->active);
> > inc_nr_running(p, rq);
> > }
>
> I prefer:
>
> prio_array_t *target = rq->active;
> if (batch_task(p))
> target = rq->expired;
> enqueue_task(p, target);
>
> Because gcc can use things like predicated instructions for it.
> But perhaps it is smart enough these days to recognise this?
> At least in the past I have seen it start using cmov after doing
> such a conversion.
>
> At any rate, I think it looks nicer as well. IMO, of course.

Well on my one boring architecture here is a before and after, gcc 4.1.0 with
optimise for size kernel config:
0xb01127da <__activate_task+0>: push %ebp
0xb01127db <__activate_task+1>: mov %esp,%ebp
0xb01127dd <__activate_task+3>: push %esi
0xb01127de <__activate_task+4>: push %ebx
0xb01127df <__activate_task+5>: mov %eax,%esi
0xb01127e1 <__activate_task+7>: mov %edx,%ebx
0xb01127e3 <__activate_task+9>: cmpl $0x3,0x58(%eax)
0xb01127e7 <__activate_task+13>: jne 0xb01127ee <__activate_task+20>
0xb01127e9 <__activate_task+15>: mov 0x44(%edx),%edx
0xb01127ec <__activate_task+18>: jmp 0xb01127f1 <__activate_task+23>
0xb01127ee <__activate_task+20>: mov 0x40(%edx),%edx
0xb01127f1 <__activate_task+23>: mov %esi,%eax
0xb01127f3 <__activate_task+25>: call 0xb01124bb <enqueue_task>
0xb01127f8 <__activate_task+30>: incl 0x8(%ebx)
0xb01127fb <__activate_task+33>: mov 0x18(%esi),%eax
0xb01127fe <__activate_task+36>: add %eax,0xc(%ebx)
0xb0112801 <__activate_task+39>: pop %ebx
0xb0112802 <__activate_task+40>: pop %esi
0xb0112803 <__activate_task+41>: pop %ebp
0xb0112804 <__activate_task+42>: ret

Your version:
0xb01127da <__activate_task+0>: push %ebp
0xb01127db <__activate_task+1>: mov %esp,%ebp
0xb01127dd <__activate_task+3>: push %esi
0xb01127de <__activate_task+4>: push %ebx
0xb01127df <__activate_task+5>: mov %eax,%esi
0xb01127e1 <__activate_task+7>: mov %edx,%ebx
0xb01127e3 <__activate_task+9>: mov 0x40(%edx),%edx
0xb01127e6 <__activate_task+12>: cmpl $0x3,0x58(%eax)
0xb01127ea <__activate_task+16>: jne 0xb01127ef <__activate_task+21>
0xb01127ec <__activate_task+18>: mov 0x44(%ebx),%edx
0xb01127ef <__activate_task+21>: mov %esi,%eax
0xb01127f1 <__activate_task+23>: call 0xb01124bb <enqueue_task>
0xb01127f6 <__activate_task+28>: incl 0x8(%ebx)
0xb01127f9 <__activate_task+31>: mov 0x18(%esi),%eax
0xb01127fc <__activate_task+34>: add %eax,0xc(%ebx)
0xb01127ff <__activate_task+37>: pop %ebx
0xb0112800 <__activate_task+38>: pop %esi
0xb0112801 <__activate_task+39>: pop %ebp
0xb0112802 <__activate_task+40>: ret

I'm not attached to the style, just the feature. If you think it's warranted
I'll change it.

Cheers,
Con

2006-03-17 13:46:12

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] sched: activate SCHED BATCH expired

Con Kolivas wrote:
> On Saturday 18 March 2006 00:26, Nick Piggin wrote:
>
>>Con Kolivas wrote:
>>
>>>-static inline void __activate_task(task_t *p, runqueue_t *rq)
>>>+static void __activate_task(task_t *p, runqueue_t *rq)
>>> {
>>>- enqueue_task(p, rq->active);
>>>+ if (batch_task(p))
>>>+ enqueue_task(p, rq->expired);
>>>+ else
>>>+ enqueue_task(p, rq->active);
>>> inc_nr_running(p, rq);
>>> }
>>
>>I prefer:
>>
>> prio_array_t *target = rq->active;
>> if (batch_task(p))
>> target = rq->expired;
>> enqueue_task(p, target);
>>
>>Because gcc can use things like predicated instructions for it.
>>But perhaps it is smart enough these days to recognise this?
>>At least in the past I have seen it start using cmov after doing
>>such a conversion.
>>
>>At any rate, I think it looks nicer as well. IMO, of course.
>
>
> Well on my one boring architecture here is a before and after, gcc 4.1.0 with
> optimise for size kernel config:

> I'm not attached to the style, just the feature. If you think it's warranted
> I'll change it.
>

I guess it isn't doing the cmov because it doesn't want to do the
extra load in the common case, which is fair enough (are you compiling
for a pentiumpro+, without generic x86 support? what about if you
turn off optimise for size?)

At least other archtectures might be able to make better use of it,
and I agree even for i386 the code looks better (and slightly smaller).

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-17 13:47:41

by Andreas Mohr

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] sched: activate SCHED BATCH expired

Hi,

On Sat, Mar 18, 2006 at 12:36:10AM +1100, Con Kolivas wrote:
> I'm not attached to the style, just the feature. If you think it's warranted
> I'll change it.

Seconded.

An even nicer way (this solution seems somewhat asymmetric) than

prio_array_t *target = rq->active;
if (batch_task(p))
target = rq->expired;
enqueue_task(p, target);

may be

prio_array_t *target;
if (batch_task(p))
target = rq->expired;
else
target = rq->active;
enqueue_task(p, target);

and thus (but this coding style may be considered overloaded):

prio_array_t *target;
target = batch_task(p) ?
rq->expired : rq->active;
enqueue_task(p, target);


But this discussion is clearly growing out of control now ;)

Andreas Mohr

2006-03-17 13:51:38

by Nick Piggin

[permalink] [raw]
Subject: Re: [PATCH] sched: activate SCHED BATCH expired

Nick Piggin wrote:
> Con Kolivas wrote:

>> I'm not attached to the style, just the feature. If you think it's
>> warranted I'll change it.
>>
>

> At least other archtectures might be able to make better use of it,
> and I agree even for i386 the code looks better (and slightly smaller).
>

s/I agree/I think/

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-17 13:59:46

by Con Kolivas

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] sched: activate SCHED BATCH expired

On Saturday 18 March 2006 00:47, Andreas Mohr wrote:
> Hi,
>
> On Sat, Mar 18, 2006 at 12:36:10AM +1100, Con Kolivas wrote:
> > I'm not attached to the style, just the feature. If you think it's
> > warranted I'll change it.
>
> Seconded.
>
> An even nicer way (this solution seems somewhat asymmetric) than
>
> prio_array_t *target = rq->active;
> if (batch_task(p))
> target = rq->expired;
> enqueue_task(p, target);
>
> may be
>
> prio_array_t *target;
> if (batch_task(p))
> target = rq->expired;
> else
> target = rq->active;
> enqueue_task(p, target);

Well I hadn't quite gone to bed so I tried yours for grins too and
interestingly it produced the identical code to my original version.

> But this discussion is clearly growing out of control now ;)

I prefer a month's worth of this over a single more email about
cd-fscking-record's amazing perfection.

Cheers,
Con

2006-03-17 14:06:47

by Nick Piggin

[permalink] [raw]
Subject: Re: [ck] Re: [PATCH] sched: activate SCHED BATCH expired

Andreas Mohr wrote:
> Hi,
>
> On Sat, Mar 18, 2006 at 12:36:10AM +1100, Con Kolivas wrote:
>
>>I'm not attached to the style, just the feature. If you think it's warranted
>>I'll change it.
>
>
> Seconded.
>
> An even nicer way (this solution seems somewhat asymmetric) than
>
> prio_array_t *target = rq->active;
> if (batch_task(p))
> target = rq->expired;
> enqueue_task(p, target);
>
> may be
>
> prio_array_t *target;
> if (batch_task(p))
> target = rq->expired;
> else
> target = rq->active;
> enqueue_task(p, target);
>

It doesn't actually generate the same code here (I guess it is good
that gcc gives us this control).

I think my way is (ever so slightly) better because it gets the load
going earlier and comprises one less conditional jump (admittedly in
the slowpath). You'd probably never be able to measure a difference
between any of the variants, however ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-03-17 14:11:18

by Con Kolivas

[permalink] [raw]
Subject: Re: [PATCH] sched: activate SCHED BATCH expired

On Saturday 18 March 2006 00:46, Nick Piggin wrote:
> I guess it isn't doing the cmov because it doesn't want to do the
> extra load in the common case, which is fair enough (are you compiling
> for a pentiumpro+, without generic x86 support?

For pentium4 with no generic support.

> what about if you
> turn off optimise for size?)

Dunno, sleep is taking me...

> At least other archtectures might be able to make better use of it,
> and I agree even for i386 the code looks better (and slightly smaller).

Good enough for me. Here's a respin, thanks!

Cheers,
Con
---
To increase the strength of SCHED_BATCH as a scheduling hint we can activate
batch tasks on the expired array since by definition they are latency
insensitive tasks.

Signed-off-by: Con Kolivas <[email protected]>

---
include/linux/sched.h | 1 +
kernel/sched.c | 10 +++++++---
2 files changed, 8 insertions(+), 3 deletions(-)

Index: linux-2.6.16-rc6-mm1/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc6-mm1.orig/include/linux/sched.h 2006-03-13 20:12:22.000000000 +1100
+++ linux-2.6.16-rc6-mm1/include/linux/sched.h 2006-03-17 23:08:31.000000000 +1100
@@ -485,6 +485,7 @@ struct signal_struct {
#define MAX_PRIO (MAX_RT_PRIO + 40)

#define rt_task(p) (unlikely((p)->prio < MAX_RT_PRIO))
+#define batch_task(p) (unlikely((p)->policy == SCHED_BATCH))

/*
* Some day this will be a full-fledged user tracking system..
Index: linux-2.6.16-rc6-mm1/kernel/sched.c
===================================================================
--- linux-2.6.16-rc6-mm1.orig/kernel/sched.c 2006-03-13 20:12:15.000000000 +1100
+++ linux-2.6.16-rc6-mm1/kernel/sched.c 2006-03-18 01:05:02.000000000 +1100
@@ -737,9 +737,13 @@ static inline void dec_nr_running(task_t
/*
* __activate_task - move a task to the runqueue.
*/
-static inline void __activate_task(task_t *p, runqueue_t *rq)
+static void __activate_task(task_t *p, runqueue_t *rq)
{
- enqueue_task(p, rq->active);
+ prio_array_t *target = rq->active;
+
+ if (batch_task(p))
+ target = rq->expired;
+ enqueue_task(p, target);
inc_nr_running(p, rq);
}

@@ -758,7 +762,7 @@ static int recalc_task_prio(task_t *p, u
unsigned long long __sleep_time = now - p->timestamp;
unsigned long sleep_time;

- if (unlikely(p->policy == SCHED_BATCH))
+ if (batch_task(p))
sleep_time = 0;
else {
if (__sleep_time > NS_MAX_SLEEP_AVG)

2006-03-17 15:02:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] sched: activate SCHED BATCH expired


* Con Kolivas <[email protected]> wrote:

> Good enough for me. Here's a respin, thanks!

> Signed-off-by: Con Kolivas <[email protected]>

Still-Acked-by: Ingo Molnar <[email protected]>

Ingo

2006-03-17 17:13:46

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Fri, 2006-03-17 at 11:46 +0100, Mike Galbraith wrote:
> On Fri, 2006-03-17 at 10:06 +0100, Ingo Molnar wrote:
>
> > yep, i think that's a good idea. In the worst case the starvation
> > timeout should kick in.
>
> (I didn't want to hijack that thread ergo name change)
>
> Speaking of the starvation timeout...
>

<snip day late $ short idea>

Problem solved. I now know why the starvation logic doesn't work.
Wakeups. In the face of 10+ copies of httpd constantly waking up, it
seems it just takes ages to get around to switching arrays.

With the (urp) patch below, I now get...

[root]:# time netstat|grep :81|wc -l
1648

real 0m27.735s
user 0m0.158s
sys 0m0.111s
[root]:# time netstat|grep :81|wc -l
1817

real 0m13.550s
user 0m0.121s
sys 0m0.186s
[root]:# time netstat|grep :81|wc -l
1641

real 0m17.022s
user 0m0.132s
sys 0m0.143s
[root]:#

which certainly isn't pleasant, but it beats the heck out of minutes.

-Mike

--- kernel/sched.c.org 2006-03-17 14:48:35.000000000 +0100
+++ kernel/sched.c 2006-03-17 17:41:25.000000000 +0100
@@ -662,11 +662,30 @@
}

/*
+ * We place interactive tasks back into the active array, if possible.
+ *
+ * To guarantee that this does not starve expired tasks we ignore the
+ * interactivity of a task if the first expired task had to wait more
+ * than a 'reasonable' amount of time. This deadline timeout is
+ * load-dependent, as the frequency of array switched decreases with
+ * increasing number of running tasks. We also ignore the interactivity
+ * if a better static_prio task has expired:
+ */
+#define EXPIRED_STARVING(rq) \
+ ((STARVATION_LIMIT && ((rq)->expired_timestamp && \
+ (jiffies - (rq)->expired_timestamp >= \
+ STARVATION_LIMIT * ((rq)->nr_running) + 1))) || \
+ ((rq)->curr->static_prio > (rq)->best_expired_prio))
+
+/*
* __activate_task - move a task to the runqueue.
*/
static inline void __activate_task(task_t *p, runqueue_t *rq)
{
- enqueue_task(p, rq->active);
+ prio_array_t *array = rq->active;
+ if (unlikely(EXPIRED_STARVING(rq)))
+ array = rq->expired;
+ enqueue_task(p, array);
rq->nr_running++;
}

@@ -2461,22 +2480,6 @@
}

/*
- * We place interactive tasks back into the active array, if possible.
- *
- * To guarantee that this does not starve expired tasks we ignore the
- * interactivity of a task if the first expired task had to wait more
- * than a 'reasonable' amount of time. This deadline timeout is
- * load-dependent, as the frequency of array switched decreases with
- * increasing number of running tasks. We also ignore the interactivity
- * if a better static_prio task has expired:
- */
-#define EXPIRED_STARVING(rq) \
- ((STARVATION_LIMIT && ((rq)->expired_timestamp && \
- (jiffies - (rq)->expired_timestamp >= \
- STARVATION_LIMIT * ((rq)->nr_running) + 1))) || \
- ((rq)->curr->static_prio > (rq)->best_expired_prio))
-
-/*
* Account user cpu time to a process.
* @p: the process that the cpu time gets accounted to
* @hardirq_offset: the offset to subtract from hardirq_count()


2006-03-20 07:06:59

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Fri, 2006-03-17 at 18:15 +0100, Mike Galbraith wrote:
> Problem solved. I now know why the starvation logic doesn't work.
> Wakeups. In the face of 10+ copies of httpd constantly waking up, it
> seems it just takes ages to get around to switching arrays.
>
> With the (urp) patch below, I now get...
>
> [root]:# time netstat|grep :81|wc -l
> 1648
>
> real 0m27.735s
> user 0m0.158s
> sys 0m0.111s
> [root]:# time netstat|grep :81|wc -l
> 1817
>
> real 0m13.550s
> user 0m0.121s
> sys 0m0.186s
> [root]:# time netstat|grep :81|wc -l
> 1641
>
> real 0m17.022s
> user 0m0.132s
> sys 0m0.143s
> [root]:#

For those interested in these kind of things, here are the numbers for
2.6.16-rc6-mm2 with my [tarball] throttle patches applied...

[root]:# time netstat|grep :81|wc -l
1681

real 0m1.525s
user 0m0.141s
sys 0m0.136s
[root]:# time netstat|grep :81|wc -l
1491

real 0m0.356s
user 0m0.130s
sys 0m0.114s
[root]:# time netstat|grep :81|wc -l
1527

real 0m0.343s
user 0m0.129s
sys 0m0.114s
[root]:# time netstat|grep :81|wc -l
1568

real 0m0.512s
user 0m0.112s
sys 0m0.138s

...while running with the same apache loadavg of over 10, and tunables
set to server mode (0,0).

<plug>
Even a desktop running with these settings is so interactive that I
could play a game of Maelstrom (asteroids like thing) while doing a make
-j30 in slow nfs mount and barely feel it. In a local filesystem, I
could't feel it at all, so I added a thud 3, irman2 and a bonnie -s 2047
for good measure. Try that with stock :)
</plug>


Attachments:
throttle-V22-2.6.16-rc6-mm2.tar.gz (7.04 kB)

2006-03-20 10:25:03

by Ingo Molnar

[permalink] [raw]
Subject: Re: interactive task starvation


* Mike Galbraith <[email protected]> wrote:

> <plug>
> Even a desktop running with these settings is so interactive that I
> could play a game of Maelstrom (asteroids like thing) while doing a
> make -j30 in slow nfs mount and barely feel it. In a local
> filesystem, I could't feel it at all, so I added a thud 3, irman2 and
> a bonnie -s 2047 for good measure. Try that with stock :)
> </plug>

great! Please make sure all the patches make their way into -mm. We
definitely want to try this for v2.6.17. Increasing starvation
resistance _and_ interactivity via the same patchset is a rare feat ;-)

Acked-by: Ingo Molnar <[email protected]>

Ingo

2006-03-21 06:47:40

by Willy Tarreau

[permalink] [raw]
Subject: Re: interactive task starvation

Hi Mike,

On Mon, Mar 20, 2006 at 08:09:13AM +0100, Mike Galbraith wrote:
(...)
> For those interested in these kind of things, here are the numbers for
> 2.6.16-rc6-mm2 with my [tarball] throttle patches applied...
>
> [root]:# time netstat|grep :81|wc -l
> 1681
>
> real 0m1.525s
> user 0m0.141s
> sys 0m0.136s
> [root]:# time netstat|grep :81|wc -l
> 1491
>
> real 0m0.356s
> user 0m0.130s
> sys 0m0.114s
> [root]:# time netstat|grep :81|wc -l
> 1527
>
> real 0m0.343s
> user 0m0.129s
> sys 0m0.114s
> [root]:# time netstat|grep :81|wc -l
> 1568
>
> real 0m0.512s
> user 0m0.112s
> sys 0m0.138s
>
> ...while running with the same apache loadavg of over 10, and tunables
> set to server mode (0,0).
>
> <plug>
> Even a desktop running with these settings is so interactive that I
> could play a game of Maelstrom (asteroids like thing) while doing a make
> -j30 in slow nfs mount and barely feel it. In a local filesystem, I
> could't feel it at all, so I added a thud 3, irman2 and a bonnie -s 2047
> for good measure. Try that with stock :)
> </plug>

Very good job !
I told Grant in a private email that I felt confident the problem would
quickly be solved now that someone familiar with the scheduler could
reliably reproduce it. Your numbers look excellent, I'm willing to test.
Could you remind us what kernel and what patches we need to apply to
try the same, please ?

Cheers,
Willy

2006-03-21 07:51:39

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, 2006-03-21 at 07:47 +0100, Willy Tarreau wrote:
> Hi Mike,

Greetings!

> On Mon, Mar 20, 2006 at 08:09:13AM +0100, Mike Galbraith wrote:
> > real 0m0.512s
> > user 0m0.112s
> > sys 0m0.138s
> >
> > ...while running with the same apache loadavg of over 10, and tunables
> > set to server mode (0,0).
...

> Very good job !
> I told Grant in a private email that I felt confident the problem would
> quickly be solved now that someone familiar with the scheduler could
> reliably reproduce it. Your numbers look excellent, I'm willing to test.
> Could you remind us what kernel and what patches we need to apply to
> try the same, please ?

You bet. I'm most happy to have someone try it other than me :)

Apply the patches from the attached tarball in the obvious order to
2.6.16-rc6-mm2. As delivered, it's knobs are set up for a desktop box.
For a server, you'll probably want maximum starvation resistance, so
echo 0 > /proc/sys/kernel/grace_g1 and grace_g2. This will set the time
a task can exceed expected cpu (based upon sleep_avg) to zero seconds,
ie immediate throttling upon detection. It will also disable some
interactivity specific code in the scheduler.

If you want to fiddle with the knobs, grace_g1 is the number of CPU
seconds a new task is authorized to run completely free of any
intervention... startup in a desktop environment. grace_g2 is the
amount of CPU seconds a well behaved task can store for later usage.
With the throttling patch, an interactive task must earn the right to
exceed expected cpu by performing within expectations. The longer the
task behaves, the more 'good carma' it earns. This allows interactive
tasks to do a burst of activity, but the user determines how long that
burst==starvation is authorized. Tasks with just use as much cpu as
they can get run headlong into the throttle.

-Mike


Attachments:
throttle-V23-2.6.16-rc6-mm2.tar.gz (7.09 kB)

2006-03-21 09:14:14

by Willy Tarreau

[permalink] [raw]
Subject: Re: interactive task starvation


On Tue, Mar 21, 2006 at 08:51:38AM +0100, Mike Galbraith wrote:
> On Tue, 2006-03-21 at 07:47 +0100, Willy Tarreau wrote:
> > Hi Mike,
>
> Greetings!

Thanks for the details,
I'll try to find some time to test your code quickly. If this fixes this
long standing problem, we should definitely try to get it into 2.6.17 !

Cheers,
Willy

2006-03-21 09:16:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: interactive task starvation


* Willy Tarreau <[email protected]> wrote:

>
> On Tue, Mar 21, 2006 at 08:51:38AM +0100, Mike Galbraith wrote:
> > On Tue, 2006-03-21 at 07:47 +0100, Willy Tarreau wrote:
> > > Hi Mike,
> >
> > Greetings!
>
> Thanks for the details,
> I'll try to find some time to test your code quickly. If this fixes this
> long standing problem, we should definitely try to get it into 2.6.17 !

the time window is quickly closing for that to happen though.

Ingo

2006-03-21 11:16:23

by Willy Tarreau

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, Mar 21, 2006 at 10:14:22AM +0100, Ingo Molnar wrote:
>
> * Willy Tarreau <[email protected]> wrote:
>
> >
> > On Tue, Mar 21, 2006 at 08:51:38AM +0100, Mike Galbraith wrote:
> > > On Tue, 2006-03-21 at 07:47 +0100, Willy Tarreau wrote:
> > > > Hi Mike,
> > >
> > > Greetings!
> >
> > Thanks for the details,
> > I'll try to find some time to test your code quickly. If this fixes this
> > long standing problem, we should definitely try to get it into 2.6.17 !
>
> the time window is quickly closing for that to happen though.

Ingo, Mike,

it's a great day :-)

Right now, I'm typing this mail from my notebook which has 8 instances of
my exploit running in background. Previously, 4 of them were enough on this
machine to create pauses of up to 31 seconds. Right now, I can type normally,
and I simply can say that my exploit has no effect anymore ! It's just
consuming CPU and nothing else. I also tried to write 0 to grace_g[12] and
I find it even more responsive with 0 in those values. I've not had time to
do more extensive tests, but I can assure you that the problem is clearly
solved for me. I'd like Grant to test ssh on his firewall with it too.

Congratulations !
Willy

2006-03-21 11:21:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: interactive task starvation


* Willy Tarreau <[email protected]> wrote:

> On Tue, Mar 21, 2006 at 10:14:22AM +0100, Ingo Molnar wrote:
> >
> > * Willy Tarreau <[email protected]> wrote:
> >
> > >
> > > On Tue, Mar 21, 2006 at 08:51:38AM +0100, Mike Galbraith wrote:
> > > > On Tue, 2006-03-21 at 07:47 +0100, Willy Tarreau wrote:
> > > > > Hi Mike,
> > > >
> > > > Greetings!
> > >
> > > Thanks for the details,
> > > I'll try to find some time to test your code quickly. If this fixes this
> > > long standing problem, we should definitely try to get it into 2.6.17 !
> >
> > the time window is quickly closing for that to happen though.
>
> Ingo, Mike,
>
> it's a great day :-)
>
> Right now, I'm typing this mail from my notebook which has 8 instances
> of my exploit running in background. Previously, 4 of them were enough
> on this machine to create pauses of up to 31 seconds. Right now, I can
> type normally, and I simply can say that my exploit has no effect
> anymore ! It's just consuming CPU and nothing else. I also tried to
> write 0 to grace_g[12] and I find it even more responsive with 0 in
> those values. I've not had time to do more extensive tests, but I can
> assure you that the problem is clearly solved for me. I'd like Grant
> to test ssh on his firewall with it too.

great work by Mike! One detail: i'd like there to be just one default
throttling value, i.e. no grace_g tunables [so that we have just one
default scheduler behavior]. Is the default grace_g[12] setting good
enough for your workload?

Ingo

2006-03-21 11:54:24

by Con Kolivas

[permalink] [raw]
Subject: Re: interactive task starvation

On Tuesday 21 March 2006 22:18, Ingo Molnar wrote:
> great work by Mike! One detail: i'd like there to be just one default
> throttling value, i.e. no grace_g tunables [so that we have just one
> default scheduler behavior]. Is the default grace_g[12] setting good
> enough for your workload?

I agree. If anything is required, a simple on/off tunable makes much more
sense. Much like I suggested ages ago with an "interactive" switch which was
rather unpopular when I first suggested it. Perhaps my marketing was wrong.
Oh well.

Cheers,
Con

2006-03-21 12:11:26

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, 2006-03-21 at 12:18 +0100, Ingo Molnar wrote:

> great work by Mike! One detail: i'd like there to be just one default
> throttling value, i.e. no grace_g tunables [so that we have just one
> default scheduler behavior]. Is the default grace_g[12] setting good
> enough for your workload?

I can make the knobs compile time so we don't see random behavior
reports, but I don't think they can be totally eliminated. Would that
be sufficient?

If so, the numbers as delivered should be fine for desktop boxen I
think. People who are building custom kernels can bend to fit as
always.

-Mike

2006-03-21 12:59:21

by Willy Tarreau

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, Mar 21, 2006 at 01:07:58PM +0100, Mike Galbraith wrote:
> On Tue, 2006-03-21 at 12:18 +0100, Ingo Molnar wrote:
>
> > great work by Mike! One detail: i'd like there to be just one default
> > throttling value, i.e. no grace_g tunables [so that we have just one
> > default scheduler behavior]. Is the default grace_g[12] setting good
> > enough for your workload?

The default values are infinitely better than mainline, but it is still
a huge improvement to reduce them (at least grace_g2) :

default : grace_g1=10, grace_g2=14400, loadavg oscillating between 7 and 12 :

willy@wtap:~$ time ls -la /data/src/tmp/|wc
2271 18250 212211

real 0m5.759s
user 0m0.028s
sys 0m0.008s
willy@wtap:~$ time ls -la /data/src/tmp/|wc
2271 18250 212211

real 0m3.476s
user 0m0.020s
sys 0m0.016s
willy@wtap:~$

I can still observe some occasionnal pauses of 1 to 3 seconds (once
to 4 times per minute).

- grace_g2 set to 0, load converges to a stable 8 :

willy@wtap:~$ time ls -la /data/src/tmp/|wc
2271 18250 212211

real 0m0.441s
user 0m0.036s
sys 0m0.004s
willy@wtap:~$ time ls -la /data/src/tmp/|wc
2271 18250 212211

real 0m0.400s
user 0m0.032s
sys 0m0.008s

I can still observe some rare cases of 1 second pauses (once or twice per
minute).

- grace_g2 and grace_g1 set to zero :

willy@wtap:~$ time ls -la /data/src/tmp/|wc
2271 18250 212211

real 0m0.214s
user 0m0.028s
sys 0m0.008s
willy@wtap:~$ time ls -la /data/src/tmp/|wc
2271 18250 212211

real 0m0.193s
user 0m0.032s
sys 0m0.008s

=> I never observe any pause, and the numbers above sometimes even
get lower (around 75 ms).

I have also tried injecting traffic on my proxy, and at 16000 hits/s,
its does not impact overall system's responsiveness, whatever (g1,g2).

> I can make the knobs compile time so we don't see random behavior
> reports, but I don't think they can be totally eliminated. Would that
> be sufficient?
>
> If so, the numbers as delivered should be fine for desktop boxen I
> think. People who are building custom kernels can bend to fit as
> always.

That would suit me perfectly. I think I would set them both to zero.
It's not clear to me what workload they can help, it seems that they
try to allow a sometimes unfair scheduling.

> -Mike

Cheers,
Willy

2006-03-21 13:11:55

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, 2006-03-21 at 22:53 +1100, Con Kolivas wrote:
> On Tuesday 21 March 2006 22:18, Ingo Molnar wrote:
> > great work by Mike! One detail: i'd like there to be just one default
> > throttling value, i.e. no grace_g tunables [so that we have just one
> > default scheduler behavior]. Is the default grace_g[12] setting good
> > enough for your workload?
>
> I agree. If anything is required, a simple on/off tunable makes much more
> sense. Much like I suggested ages ago with an "interactive" switch which was
> rather unpopular when I first suggested it.

Let me try to explain why on/off is not sufficient.

You notice how Willy said that his notebook is more responsive with
tunables set to 0,0? That's important, because it's absolutely true...
depending what you're doing. Setting tunables to 0,0 cuts off the idle
sleep logic, and the sleep_avg divisor - both of which were put there
specifically for interactivity - and returns the scheduler to more or
less original O(1) scheduler. You and I both know that these are most
definitely needed in a Desktop environment. For instance, if Willy
starts editing code in X, and scrolls while something is running in the
background, he'll suddenly say hey, maybe this _ain't_ more responsive,
because all of a sudden the starvation added with the interactivity
logic will be sorely missed as my throttle wrings X's neck.

How long should Willy be able to scroll without feeling the background,
and how long should Apache be able to starve his shell. They are one
and the same, and I can't say, because I'm not Willy. I don't know how
to get there from here without tunables. Picking defaults is one thing,
but I don't know how to make it one-size-fits-all. For the general
case, the values delivered will work fine. For the apache case, they
absolutely 100% guaranteed will not.

-Mike

2006-03-21 13:13:42

by Con Kolivas

[permalink] [raw]
Subject: Re: interactive task starvation

On Wednesday 22 March 2006 00:10, Mike Galbraith wrote:
> On Tue, 2006-03-21 at 22:53 +1100, Con Kolivas wrote:
> > On Tuesday 21 March 2006 22:18, Ingo Molnar wrote:
> > > great work by Mike! One detail: i'd like there to be just one default
> > > throttling value, i.e. no grace_g tunables [so that we have just one
> > > default scheduler behavior]. Is the default grace_g[12] setting good
> > > enough for your workload?
> >
> > I agree. If anything is required, a simple on/off tunable makes much more
> > sense. Much like I suggested ages ago with an "interactive" switch which
> > was rather unpopular when I first suggested it.
>
> Let me try to explain why on/off is not sufficient.
>
> You notice how Willy said that his notebook is more responsive with
> tunables set to 0,0? That's important, because it's absolutely true...
> depending what you're doing. Setting tunables to 0,0 cuts off the idle
> sleep logic, and the sleep_avg divisor - both of which were put there
> specifically for interactivity - and returns the scheduler to more or
> less original O(1) scheduler. You and I both know that these are most
> definitely needed in a Desktop environment. For instance, if Willy
> starts editing code in X, and scrolls while something is running in the
> background, he'll suddenly say hey, maybe this _ain't_ more responsive,
> because all of a sudden the starvation added with the interactivity
> logic will be sorely missed as my throttle wrings X's neck.
>
> How long should Willy be able to scroll without feeling the background,
> and how long should Apache be able to starve his shell. They are one
> and the same, and I can't say, because I'm not Willy. I don't know how
> to get there from here without tunables. Picking defaults is one thing,
> but I don't know how to make it one-size-fits-all. For the general
> case, the values delivered will work fine. For the apache case, they
> absolutely 100% guaranteed will not.

So how do you propose we tune such a beast then? Apache users will use off,
everyone else will have no idea but to use the defaults.

Cheers,
Con

2006-03-21 13:24:16

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, 2006-03-21 at 13:59 +0100, Willy Tarreau wrote:
> On Tue, Mar 21, 2006 at 01:07:58PM +0100, Mike Galbraith wrote:

> > I can make the knobs compile time so we don't see random behavior
> > reports, but I don't think they can be totally eliminated. Would that
> > be sufficient?
> >
> > If so, the numbers as delivered should be fine for desktop boxen I
> > think. People who are building custom kernels can bend to fit as
> > always.
>
> That would suit me perfectly. I think I would set them both to zero.
> It's not clear to me what workload they can help, it seems that they
> try to allow a sometimes unfair scheduling.

Correct. Massively unfair scheduling is what interactivity requires.

-Mike

2006-03-21 13:33:19

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Wed, 2006-03-22 at 00:13 +1100, Con Kolivas wrote:
> On Wednesday 22 March 2006 00:10, Mike Galbraith wrote:
> > How long should Willy be able to scroll without feeling the background,
> > and how long should Apache be able to starve his shell. They are one
> > and the same, and I can't say, because I'm not Willy. I don't know how
> > to get there from here without tunables. Picking defaults is one thing,
> > but I don't know how to make it one-size-fits-all. For the general
> > case, the values delivered will work fine. For the apache case, they
> > absolutely 100% guaranteed will not.
>
> So how do you propose we tune such a beast then? Apache users will use off,
> everyone else will have no idea but to use the defaults.

Set for desktop, which is intended to mostly emulate what we have right
now, which most people are quite happy with. The throttle will still
nail most of the corner cases, and the other adjustments nail the
majority of what's left. That leaves the hefty server type loads as
what certainly will require tuning. They always need tuning.

-Mike

2006-03-21 13:38:18

by Con Kolivas

[permalink] [raw]
Subject: Re: interactive task starvation

On Wednesday 22 March 2006 00:33, Mike Galbraith wrote:
> On Wed, 2006-03-22 at 00:13 +1100, Con Kolivas wrote:
> > On Wednesday 22 March 2006 00:10, Mike Galbraith wrote:
> > > How long should Willy be able to scroll without feeling the background,
> > > and how long should Apache be able to starve his shell. They are one
> > > and the same, and I can't say, because I'm not Willy. I don't know how
> > > to get there from here without tunables. Picking defaults is one
> > > thing, but I don't know how to make it one-size-fits-all. For the
> > > general case, the values delivered will work fine. For the apache
> > > case, they absolutely 100% guaranteed will not.
> >
> > So how do you propose we tune such a beast then? Apache users will use
> > off, everyone else will have no idea but to use the defaults.
>
> Set for desktop, which is intended to mostly emulate what we have right
> now, which most people are quite happy with. The throttle will still
> nail most of the corner cases, and the other adjustments nail the
> majority of what's left. That leaves the hefty server type loads as
> what certainly will require tuning. They always need tuning.

That still sounds like just on/off to me. Default for desktop and 0,0 for
server. Am I missing something?

Cheers,
Con

2006-03-21 13:39:11

by Willy Tarreau

[permalink] [raw]
Subject: Re: interactive task starvation

On Wed, Mar 22, 2006 at 12:13:15AM +1100, Con Kolivas wrote:
> On Wednesday 22 March 2006 00:10, Mike Galbraith wrote:
> > On Tue, 2006-03-21 at 22:53 +1100, Con Kolivas wrote:
> > > On Tuesday 21 March 2006 22:18, Ingo Molnar wrote:
> > > > great work by Mike! One detail: i'd like there to be just one default
> > > > throttling value, i.e. no grace_g tunables [so that we have just one
> > > > default scheduler behavior]. Is the default grace_g[12] setting good
> > > > enough for your workload?
> > >
> > > I agree. If anything is required, a simple on/off tunable makes much more
> > > sense. Much like I suggested ages ago with an "interactive" switch which
> > > was rather unpopular when I first suggested it.
> >
> > Let me try to explain why on/off is not sufficient.
> >
> > You notice how Willy said that his notebook is more responsive with
> > tunables set to 0,0? That's important, because it's absolutely true...
> > depending what you're doing. Setting tunables to 0,0 cuts off the idle
> > sleep logic, and the sleep_avg divisor - both of which were put there
> > specifically for interactivity - and returns the scheduler to more or
> > less original O(1) scheduler. You and I both know that these are most
> > definitely needed in a Desktop environment. For instance, if Willy
> > starts editing code in X, and scrolls while something is running in the
> > background, he'll suddenly say hey, maybe this _ain't_ more responsive,
> > because all of a sudden the starvation added with the interactivity
> > logic will be sorely missed as my throttle wrings X's neck.
> >
> > How long should Willy be able to scroll without feeling the background,
> > and how long should Apache be able to starve his shell. They are one
> > and the same, and I can't say, because I'm not Willy. I don't know how
> > to get there from here without tunables. Picking defaults is one thing,
> > but I don't know how to make it one-size-fits-all. For the general
> > case, the values delivered will work fine. For the apache case, they
> > absolutely 100% guaranteed will not.
>
> So how do you propose we tune such a beast then? Apache users will use off,
> everyone else will have no idea but to use the defaults.

What you describe is exactly a case for a tunable. Different people with
different workloads want different values. Seems fair enough. After all,
we already have /proc/sys/vm/swappiness, and things like that for the same
reason : the default value should suit most users, and the ones with
knowledge and different needs can tune their system. Maybe grace_{g1,g2}
should be renamed to be more explicit, may be we can automatically tune
one from the other and let only one tunable. But if both have a useful
effect, I don't see a reason for hiding them.

> Cheers,
> Con

Cheers,
Willy

2006-03-21 13:44:33

by Willy Tarreau

[permalink] [raw]
Subject: Re: interactive task starvation

On Wed, Mar 22, 2006 at 12:37:51AM +1100, Con Kolivas wrote:
> On Wednesday 22 March 2006 00:33, Mike Galbraith wrote:
> > On Wed, 2006-03-22 at 00:13 +1100, Con Kolivas wrote:
> > > On Wednesday 22 March 2006 00:10, Mike Galbraith wrote:
> > > > How long should Willy be able to scroll without feeling the background,
> > > > and how long should Apache be able to starve his shell. They are one
> > > > and the same, and I can't say, because I'm not Willy. I don't know how
> > > > to get there from here without tunables. Picking defaults is one
> > > > thing, but I don't know how to make it one-size-fits-all. For the
> > > > general case, the values delivered will work fine. For the apache
> > > > case, they absolutely 100% guaranteed will not.
> > >
> > > So how do you propose we tune such a beast then? Apache users will use
> > > off, everyone else will have no idea but to use the defaults.
> >
> > Set for desktop, which is intended to mostly emulate what we have right
> > now, which most people are quite happy with. The throttle will still
> > nail most of the corner cases, and the other adjustments nail the
> > majority of what's left. That leaves the hefty server type loads as
> > what certainly will require tuning. They always need tuning.
>
> That still sounds like just on/off to me. Default for desktop and 0,0 for
> server. Am I missing something?

Believe it or not, there *are* people running their servers with full
graphical environments. At the place we first encountered the interactivity
problem with my load-balancer, they first installed in on a full FC2 with the
OpenGL screen saver... No need to say they had scaling difficulties and trouble
to log in !

Although that's a stupid thing to do, what I want to show is that even on
servers, you can't easily predict the workload. Maybe a server which often
forks processes for dedicated tasks (eg: monitoring) would prefer running
between "desktop" and "server" mode.

> Cheers,
> Con

Cheers,
Willy

2006-03-21 13:46:19

by Con Kolivas

[permalink] [raw]
Subject: Re: interactive task starvation

On Wednesday 22 March 2006 00:44, Willy Tarreau wrote:
> On Wed, Mar 22, 2006 at 12:37:51AM +1100, Con Kolivas wrote:
> > On Wednesday 22 March 2006 00:33, Mike Galbraith wrote:
> > > On Wed, 2006-03-22 at 00:13 +1100, Con Kolivas wrote:
> > > > On Wednesday 22 March 2006 00:10, Mike Galbraith wrote:
> > > > > How long should Willy be able to scroll without feeling the
> > > > > background, and how long should Apache be able to starve his shell.
> > > > > They are one and the same, and I can't say, because I'm not Willy.
> > > > > I don't know how to get there from here without tunables. Picking
> > > > > defaults is one thing, but I don't know how to make it
> > > > > one-size-fits-all. For the general case, the values delivered will
> > > > > work fine. For the apache case, they absolutely 100% guaranteed
> > > > > will not.
> > > >
> > > > So how do you propose we tune such a beast then? Apache users will
> > > > use off, everyone else will have no idea but to use the defaults.
> > >
> > > Set for desktop, which is intended to mostly emulate what we have right
> > > now, which most people are quite happy with. The throttle will still
> > > nail most of the corner cases, and the other adjustments nail the
> > > majority of what's left. That leaves the hefty server type loads as
> > > what certainly will require tuning. They always need tuning.
> >
> > That still sounds like just on/off to me. Default for desktop and 0,0 for
> > server. Am I missing something?
>
> Believe it or not, there *are* people running their servers with full
> graphical environments. At the place we first encountered the interactivity
> problem with my load-balancer, they first installed in on a full FC2 with
> the OpenGL screen saver... No need to say they had scaling difficulties and
> trouble to log in !
>
> Although that's a stupid thing to do, what I want to show is that even on
> servers, you can't easily predict the workload. Maybe a server which often
> forks processes for dedicated tasks (eg: monitoring) would prefer running
> between "desktop" and "server" mode.

I give up. Add as many tunables as you like in as many places as possible that
even less people will understand. You've already told me you'll be running
0,0.

Cheers,
Con

2006-03-21 13:48:10

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, 2006-03-21 at 14:38 +0100, Willy Tarreau wrote:
> What you describe is exactly a case for a tunable. Different people with
> different workloads want different values. Seems fair enough. After all,
> we already have /proc/sys/vm/swappiness, and things like that for the same
> reason : the default value should suit most users, and the ones with
> knowledge and different needs can tune their system. Maybe grace_{g1,g2}
> should be renamed to be more explicit, may be we can automatically tune
> one from the other and let only one tunable. But if both have a useful
> effect, I don't see a reason for hiding them.

I'm wide open to suggestions. I tried to make it functional, flexible,
and above all, dirt simple. Adding 'acceptable' would be cool :)

-Mike

2006-03-21 13:54:16

by Con Kolivas

[permalink] [raw]
Subject: Re: interactive task starvation

On Wednesday 22 March 2006 00:24, Mike Galbraith wrote:
> On Tue, 2006-03-21 at 13:59 +0100, Willy Tarreau wrote:
> > That would suit me perfectly. I think I would set them both to zero.
> > It's not clear to me what workload they can help, it seems that they
> > try to allow a sometimes unfair scheduling.
>
> Correct. Massively unfair scheduling is what interactivity requires.

To some degree, yes. Transient unfairness was all that it was supposed to do
and clearly it failed at being transient.

I would argue that good interactivity is possible with fairness by changing
the design. I won't go there (to try and push it that is), though, as the
opposition to changing the whole scheduler in place or making it pluggable
has already been voiced numerous times over, and it would kill me to try and
promote such an alternative ever again. Especially since the number of people
willing to test interactive patches and report to lkml has dropped to
virtually nil.

The yardstick for changes is now the speed of 'ls' scrolling in the console.
Where exactly are those extra cycles going I wonder? Do you think the
scheduler somehow makes the cpu idle doing nothing in that timespace? Clearly
that's not true, and userspace is making something spin unnecessarily, but
we're gonna fix that by modifying the scheduler.... sigh

Cheers,
Con

2006-03-21 14:02:27

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Wed, 2006-03-22 at 00:45 +1100, Con Kolivas wrote:

> I give up. Add as many tunables as you like in as many places as possible that
> even less people will understand. You've already told me you'll be running
> 0,0.

Instead of giving up, how about look at the code and make a suggestion
for improvement? It's not an easy problem, as you're well aware.

I really don't see why you're (seemingly) getting irate. Tunables for
this are no different that tunables like CHILD_PENALTY etc etc etc. How
many casual users know those exist, much less understand them?

-Mike

2006-03-21 14:17:28

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Wed, 2006-03-22 at 00:53 +1100, Con Kolivas wrote:
> The yardstick for changes is now the speed of 'ls' scrolling in the console.
> Where exactly are those extra cycles going I wonder? Do you think the
> scheduler somehow makes the cpu idle doing nothing in that timespace? Clearly
> that's not true, and userspace is making something spin unnecessarily, but
> we're gonna fix that by modifying the scheduler.... sigh

*Blink*

Are you having a bad hair day??

-Mike

2006-03-21 14:18:20

by Con Kolivas

[permalink] [raw]
Subject: Re: interactive task starvation

On Wednesday 22 March 2006 01:01, Mike Galbraith wrote:
> On Wed, 2006-03-22 at 00:45 +1100, Con Kolivas wrote:
> > I give up. Add as many tunables as you like in as many places as possible
> > that even less people will understand. You've already told me you'll be
> > running 0,0.
>
> Instead of giving up, how about look at the code and make a suggestion
> for improvement? It's not an easy problem, as you're well aware.
>
> I really don't see why you're (seemingly) getting irate. Tunables for
> this are no different that tunables like CHILD_PENALTY etc etc etc. How
> many casual users know those exist, much less understand them?

Because I strongly believe that tunables for this sort of thing are wrong.
CHILD_PENALTY and friends have never been exported apart from out-of-tree
patches. These were meant to be tuned in the kernel and never exported. Ingo
didn't want *any* tunables so I'm relatively flexible with an on/off switch
which he doesn't like. I really do believe most users will only have it on or
off though.

Don't think I'm ignoring your code. You inspired me to do the original patches
3 years ago.

I have looked at your patch at length and basically what it does is variably
convert the interactive estimator from full to zero over some timeframe
choosable with your tunables. Since most users will use either full or zero I
actually believe the same effect can be had by a tiny modification to
enable/disable the estimator anyway. This is not to deny you've done a lot of
work and confirmed that the estimator running indefinitely unthrottled is
bad. What timeframe is correct to throttle is impossible to say
though :-( Most desktop users would be quite happy with indefinite because
they basically do not hit workloads that "exploit" it. Most server/hybrid
setups are willing to sacrifice some interactivity for fairness, and the
basic active->expired design gives them enough interactivity without
virtually any boost anyway. Ironically, audio is fabulous on such a design
since it virtually never consumes a full timeslice.

So any value you place on the timeframe as the default ends up being a
compromise, and this is what Ingo is suggesting. This is similar to when
sleep_avg changed from 10 seconds to 30 seconds to 2 seconds at various
times. Luckily the non linear decay of sleep_avg circumvents that being
relevant... but it also leads to the exact issue you're trying to fix. Once
again we're left with choosing some number, and as much as I'd like to help
since I really care about the desktop, I don't think any compromise is
correct. Just on or off.

Cheers,
Con

2006-03-21 14:20:12

by Con Kolivas

[permalink] [raw]
Subject: Re: interactive task starvation

On Wednesday 22 March 2006 01:17, Mike Galbraith wrote:
> On Wed, 2006-03-22 at 00:53 +1100, Con Kolivas wrote:
> > The yardstick for changes is now the speed of 'ls' scrolling in the
> > console. Where exactly are those extra cycles going I wonder? Do you
> > think the scheduler somehow makes the cpu idle doing nothing in that
> > timespace? Clearly that's not true, and userspace is making something
> > spin unnecessarily, but we're gonna fix that by modifying the
> > scheduler.... sigh
>
> *Blink*
>
> Are you having a bad hair day??

My hair is approximately 3mm long so it's kinda hard for that to happen.

What you're fixing with unfairness is worth pursuing. The 'ls' issue just
blows my mind though for reasons I've just said. Where are the magic cycles
going when nothing else is running that make it take ten times longer?

Cheers,
Con

2006-03-21 14:27:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: interactive task starvation


* Con Kolivas <[email protected]> wrote:

> On Wednesday 22 March 2006 01:17, Mike Galbraith wrote:
> > On Wed, 2006-03-22 at 00:53 +1100, Con Kolivas wrote:
> > > The yardstick for changes is now the speed of 'ls' scrolling in the
> > > console. Where exactly are those extra cycles going I wonder? Do you
> > > think the scheduler somehow makes the cpu idle doing nothing in that
> > > timespace? Clearly that's not true, and userspace is making something
> > > spin unnecessarily, but we're gonna fix that by modifying the
> > > scheduler.... sigh
> >
> > *Blink*
> >
> > Are you having a bad hair day??
>
> My hair is approximately 3mm long so it's kinda hard for that to happen.
>
> What you're fixing with unfairness is worth pursuing. The 'ls' issue
> just blows my mind though for reasons I've just said. Where are the
> magic cycles going when nothing else is running that make it take ten
> times longer?

i believe such artifacts are due to array switches not happening (due to
the workload getting queued back to rq->active, not rq->expired), and
'ls' only gets a timeslice once in a while, every STARVATION_LIMIT
times. I.e. such workloads penalize the CPU-bound 'ls' process quite
heavily.

Ingo

2006-03-21 14:28:53

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Wed, 2006-03-22 at 01:19 +1100, Con Kolivas wrote:
> On Wednesday 22 March 2006 01:17, Mike Galbraith wrote:
> > On Wed, 2006-03-22 at 00:53 +1100, Con Kolivas wrote:
> > > The yardstick for changes is now the speed of 'ls' scrolling in the
> > > console. Where exactly are those extra cycles going I wonder? Do you
> > > think the scheduler somehow makes the cpu idle doing nothing in that
> > > timespace? Clearly that's not true, and userspace is making something
> > > spin unnecessarily, but we're gonna fix that by modifying the
> > > scheduler.... sigh
> >
> > *Blink*
> >
> > Are you having a bad hair day??
>
> My hair is approximately 3mm long so it's kinda hard for that to happen.
>
> What you're fixing with unfairness is worth pursuing. The 'ls' issue just
> blows my mind though for reasons I've just said. Where are the magic cycles
> going when nothing else is running that make it take ten times longer?

What I was talking about when I mentioned scrolling was rendering.

-Mike

2006-03-21 14:28:30

by Con Kolivas

[permalink] [raw]
Subject: Re: interactive task starvation

On Wednesday 22 March 2006 01:25, Ingo Molnar wrote:
> * Con Kolivas <[email protected]> wrote:
> > What you're fixing with unfairness is worth pursuing. The 'ls' issue
> > just blows my mind though for reasons I've just said. Where are the
> > magic cycles going when nothing else is running that make it take ten
> > times longer?
>
> i believe such artifacts are due to array switches not happening (due to
> the workload getting queued back to rq->active, not rq->expired), and
> 'ls' only gets a timeslice once in a while, every STARVATION_LIMIT
> times. I.e. such workloads penalize the CPU-bound 'ls' process quite
> heavily.

With nothing else running on the machine it should still get all the cpu no
matter which array it's on though.

Con

2006-03-21 14:31:04

by Con Kolivas

[permalink] [raw]
Subject: Re: interactive task starvation

On Wednesday 22 March 2006 01:28, Mike Galbraith wrote:
> On Wed, 2006-03-22 at 01:19 +1100, Con Kolivas wrote:
> > What you're fixing with unfairness is worth pursuing. The 'ls' issue just
> > blows my mind though for reasons I've just said. Where are the magic
> > cycles going when nothing else is running that make it take ten times
> > longer?
>
> What I was talking about when I mentioned scrolling was rendering.

I'm talking about the long standing report that 'ls' takes 10 times longer on
2.6 90% of the time you run it, and doing 'ls | cat' makes it run as fast as
2.4. This is what Willy has been fighting with.

Cheers,
Con

2006-03-21 14:32:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: interactive task starvation


* Con Kolivas <[email protected]> wrote:

> On Wednesday 22 March 2006 01:25, Ingo Molnar wrote:
> > * Con Kolivas <[email protected]> wrote:
> > > What you're fixing with unfairness is worth pursuing. The 'ls' issue
> > > just blows my mind though for reasons I've just said. Where are the
> > > magic cycles going when nothing else is running that make it take ten
> > > times longer?
> >
> > i believe such artifacts are due to array switches not happening (due to
> > the workload getting queued back to rq->active, not rq->expired), and
> > 'ls' only gets a timeslice once in a while, every STARVATION_LIMIT
> > times. I.e. such workloads penalize the CPU-bound 'ls' process quite
> > heavily.
>
> With nothing else running on the machine it should still get all the
> cpu no matter which array it's on though.

yes. I thought you were asking why 'ls' pauses so long during the
aforementioned workloads (of loadavg 7-8) - and i answered that. If you
meant something else then please re-explain it to me.

Ingo

2006-03-21 14:34:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: interactive task starvation


* Con Kolivas <[email protected]> wrote:

> On Wednesday 22 March 2006 01:28, Mike Galbraith wrote:
> > On Wed, 2006-03-22 at 01:19 +1100, Con Kolivas wrote:
> > > What you're fixing with unfairness is worth pursuing. The 'ls' issue just
> > > blows my mind though for reasons I've just said. Where are the magic
> > > cycles going when nothing else is running that make it take ten times
> > > longer?
> >
> > What I was talking about when I mentioned scrolling was rendering.
>
> I'm talking about the long standing report that 'ls' takes 10 times
> longer on 2.6 90% of the time you run it, and doing 'ls | cat' makes
> it run as fast as 2.4. This is what Willy has been fighting with.

ah. That's i think a gnome-terminal artifact - it does some really
stupid dynamic things while rendering, it 'skips' certain portions of
rendering, depending on the speed of scrolling. Gnome 2.14 ought to have
that fixed i think.

Ingo

2006-03-21 14:36:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Wed, 2006-03-22 at 01:30 +1100, Con Kolivas wrote:
> On Wednesday 22 March 2006 01:28, Mike Galbraith wrote:
> > On Wed, 2006-03-22 at 01:19 +1100, Con Kolivas wrote:
> > > What you're fixing with unfairness is worth pursuing. The 'ls' issue just
> > > blows my mind though for reasons I've just said. Where are the magic
> > > cycles going when nothing else is running that make it take ten times
> > > longer?
> >
> > What I was talking about when I mentioned scrolling was rendering.
>
> I'm talking about the long standing report that 'ls' takes 10 times longer on
> 2.6 90% of the time you run it, and doing 'ls | cat' makes it run as fast as
> 2.4. This is what Willy has been fighting with.

Oh. I thought you were calling me a _moron_ :)

-Mike

2006-03-21 14:40:25

by Con Kolivas

[permalink] [raw]
Subject: Re: interactive task starvation

On Wednesday 22 March 2006 01:36, Mike Galbraith wrote:
> Oh. I thought you were calling me a _moron_ :)

No, never assume any emotion in email and I'm sorry if you interpreted it that
way.

Since I run my own mailing list I had to make a FAQ on this.
http://ck.kolivas.org/faqs/replying-to-mailing-list.txt

Extract:
4. Be polite

Humans by nature don't realise how much they depend on seeing facial
expressions, voice intonations and body language to determine the emotion
associated with words. In the context of email it is very common to
misinterpret people's emotions based on the text alone. English subtleties
will often be misinterpreted even across English speaking nations, and for
non-English speakers it becomes much harder. Without the author explicitly
stating his emotions, assume neutrality and respond politely.

Cheers,
Con

2006-03-21 14:39:56

by Willy Tarreau

[permalink] [raw]
Subject: Re: interactive task starvation

On Wed, Mar 22, 2006 at 01:19:49AM +1100, Con Kolivas wrote:
> On Wednesday 22 March 2006 01:17, Mike Galbraith wrote:
> > On Wed, 2006-03-22 at 00:53 +1100, Con Kolivas wrote:
> > > The yardstick for changes is now the speed of 'ls' scrolling in the
> > > console. Where exactly are those extra cycles going I wonder? Do you
> > > think the scheduler somehow makes the cpu idle doing nothing in that
> > > timespace? Clearly that's not true, and userspace is making something
> > > spin unnecessarily, but we're gonna fix that by modifying the
> > > scheduler.... sigh
> >
> > *Blink*
> >
> > Are you having a bad hair day??
>
> My hair is approximately 3mm long so it's kinda hard for that to happen.
>
> What you're fixing with unfairness is worth pursuing. The 'ls' issue just
> blows my mind though for reasons I've just said. Where are the magic cycles
> going when nothing else is running that make it take ten times longer?

Con, those cycles are not "magic", if you look at the numbers, the time is
not spent in the process itself. From what has been observed since the
beginning, it is spent :
- in other processes which are starvating the CPU (eg: X11 when xterm
scrolls)
- in context switches when you have a pipe somewhere and the CPU is
bouncing between tasks.

Concerning your angriness about me being OK with (0,0) and still
asking for tunables, it's precisely because I know that *my* workload
is not everyone else's, and I don't want to conclude too quickly that
there are only two types of workloads. Maybe you're right, maybe you're
wrong. At least you're right for as long as no other workload has been
identified. But thinking like this is like some time ago when we thought
that "if it runs XMMS without skipping, it'll be OK for everyone".

> Cheers,
> Con

Cheers,
Willy

2006-03-21 14:44:30

by Willy Tarreau

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, Mar 21, 2006 at 03:32:40PM +0100, Ingo Molnar wrote:
>
> * Con Kolivas <[email protected]> wrote:
>
> > On Wednesday 22 March 2006 01:28, Mike Galbraith wrote:
> > > On Wed, 2006-03-22 at 01:19 +1100, Con Kolivas wrote:
> > > > What you're fixing with unfairness is worth pursuing. The 'ls' issue just
> > > > blows my mind though for reasons I've just said. Where are the magic
> > > > cycles going when nothing else is running that make it take ten times
> > > > longer?
> > >
> > > What I was talking about when I mentioned scrolling was rendering.
> >
> > I'm talking about the long standing report that 'ls' takes 10 times
> > longer on 2.6 90% of the time you run it, and doing 'ls | cat' makes
> > it run as fast as 2.4. This is what Willy has been fighting with.
>
> ah. That's i think a gnome-terminal artifact - it does some really
> stupid dynamic things while rendering, it 'skips' certain portions of
> rendering, depending on the speed of scrolling. Gnome 2.14 ought to have
> that fixed i think.

Ah no, I never use those montruous environments ! xterm is already heavy.
don't you remember, we found that doing "ls" in an xterm was waking the
xterm process for every single line, which in turn woke the X server for
a one-line scroll, while adding the "|cat" acted like a buffer with batched
scrolls. Newer xterms have been improved to trigger jump scroll earlier and
don't exhibit this behaviour even on non-patched kernels. However, sshd
still shows the same problem IMHO.

> Ingo

Cheers,
Willy

2006-03-21 14:54:10

by Ingo Molnar

[permalink] [raw]
Subject: Re: interactive task starvation


* Willy Tarreau <[email protected]> wrote:

> Ah no, I never use those montruous environments ! xterm is already
> heavy. [...]

[ offtopic note: gnome-terminal developers claim some massive speedups
in Gnome 2.14, and my experiments on Fedora rawhide seem to
corraborate that - gnome-term is now faster (for me) than xterm. ]

> [...] don't you remember, we found that doing "ls" in an xterm was
> waking the xterm process for every single line, which in turn woke the
> X server for a one-line scroll, while adding the "|cat" acted like a
> buffer with batched scrolls. Newer xterms have been improved to
> trigger jump scroll earlier and don't exhibit this behaviour even on
> non-patched kernels. However, sshd still shows the same problem IMHO.

yeah. The "|cat" changes the workload, which gets rated by the scheduler
differently. Such artifacts are inevitable once interactivity heuristics
are strong enough to significantly distort the equal sharing of CPU
time.

Ingo

2006-03-21 15:20:29

by Con Kolivas

[permalink] [raw]
Subject: Re: interactive task starvation

On Wednesday 22 March 2006 01:17, Con Kolivas wrote:
> I actually believe the same effect can be had by a tiny
> modification to enable/disable the estimator anyway.

Just for argument's sake it would look something like this.

Cheers,
Con
---
Add sysctl to enable/disable cpu scheduer interactivity estimator

Signed-off-by: Con Kolivas <[email protected]>

---
include/linux/sched.h | 1 +
include/linux/sysctl.h | 1 +
kernel/sched.c | 14 +++++++++++---
kernel/sysctl.c | 8 ++++++++
4 files changed, 21 insertions(+), 3 deletions(-)

Index: linux-2.6.16-rc6-mm2/include/linux/sched.h
===================================================================
--- linux-2.6.16-rc6-mm2.orig/include/linux/sched.h 2006-03-19 11:15:27.000000000 +1100
+++ linux-2.6.16-rc6-mm2/include/linux/sched.h 2006-03-22 02:13:55.000000000 +1100
@@ -104,6 +104,7 @@ extern unsigned long nr_uninterruptible(
extern unsigned long nr_active(void);
extern unsigned long nr_iowait(void);
extern unsigned long weighted_cpuload(const int cpu);
+extern int sched_interactive;

#include <linux/time.h>
#include <linux/param.h>
Index: linux-2.6.16-rc6-mm2/include/linux/sysctl.h
===================================================================
--- linux-2.6.16-rc6-mm2.orig/include/linux/sysctl.h 2006-03-19 11:15:27.000000000 +1100
+++ linux-2.6.16-rc6-mm2/include/linux/sysctl.h 2006-03-22 02:14:43.000000000 +1100
@@ -148,6 +148,7 @@ enum
KERN_SPIN_RETRY=70, /* int: number of spinlock retries */
KERN_ACPI_VIDEO_FLAGS=71, /* int: flags for setting up video after ACPI sleep */
KERN_IA64_UNALIGNED=72, /* int: ia64 unaligned userland trap enable */
+ KERN_INTERACTIVE=73, /* int: enable/disable interactivity estimator */
};


Index: linux-2.6.16-rc6-mm2/kernel/sched.c
===================================================================
--- linux-2.6.16-rc6-mm2.orig/kernel/sched.c 2006-03-19 15:41:08.000000000 +1100
+++ linux-2.6.16-rc6-mm2/kernel/sched.c 2006-03-22 02:13:56.000000000 +1100
@@ -128,6 +128,9 @@
* too hard.
*/

+/* Sysctl enable/disable interactive estimator */
+int sched_interactive __read_mostly = 1;
+
#define CURRENT_BONUS(p) \
(NS_TO_JIFFIES((p)->sleep_avg) * MAX_BONUS / \
MAX_SLEEP_AVG)
@@ -151,7 +154,8 @@
INTERACTIVE_DELTA)

#define TASK_INTERACTIVE(p) \
- ((p)->prio <= (p)->static_prio - DELTA(p))
+ ((p)->prio <= (p)->static_prio - DELTA(p) && \
+ sched_interactive)

#define INTERACTIVE_SLEEP(p) \
(JIFFIES_TO_NS(MAX_SLEEP_AVG * \
@@ -662,9 +666,13 @@ static int effective_prio(task_t *p)
if (rt_task(p))
return p->prio;

- bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;
+ prio = p->static_prio;
+
+ if (sched_interactive) {
+ bonus = CURRENT_BONUS(p) - MAX_BONUS / 2;

- prio = p->static_prio - bonus;
+ prio -= bonus;
+ }
if (prio < MAX_RT_PRIO)
prio = MAX_RT_PRIO;
if (prio > MAX_PRIO-1)
Index: linux-2.6.16-rc6-mm2/kernel/sysctl.c
===================================================================
--- linux-2.6.16-rc6-mm2.orig/kernel/sysctl.c 2006-03-19 11:15:27.000000000 +1100
+++ linux-2.6.16-rc6-mm2/kernel/sysctl.c 2006-03-22 02:15:23.000000000 +1100
@@ -684,6 +684,14 @@ static ctl_table kern_table[] = {
.proc_handler = &proc_dointvec,
},
#endif
+ {
+ .ctl_name = KERN_SCHED_INTERACTIVE,
+ .procname = "interactive",
+ .data = &sched_interactive,
+ .maxlen = sizeof (int),
+ .mode = 0644,
+ .proc_handler = &proc_dointvec,
+ },
{ .ctl_name = 0 }
};

2006-03-21 17:50:43

by Willy Tarreau

[permalink] [raw]
Subject: Re: interactive task starvation

On Wed, Mar 22, 2006 at 02:20:10AM +1100, Con Kolivas wrote:
> On Wednesday 22 March 2006 01:17, Con Kolivas wrote:
> > I actually believe the same effect can be had by a tiny
> > modification to enable/disable the estimator anyway.
>
> Just for argument's sake it would look something like this.
>
> Cheers,
> Con
> ---
> Add sysctl to enable/disable cpu scheduer interactivity estimator

At least, in May 2005, the equivalent of this patch I tested on
2.6.11.7 considerably improved responsiveness, but there was still
this very annoying slowdown when the load increased. vmstat delays
increased by one second every 10 processes. I retried again around
2.6.14 a few months ago, and it was the same. Perhaps Mike's code
and other changes in 2.6-mm really fix the initial problem (array
switching ?) and then only the interactivity boost is causing the
remaining trouble ?

Cheers,
Willy

2006-03-21 17:52:09

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Wed, 2006-03-22 at 02:20 +1100, Con Kolivas wrote:
> On Wednesday 22 March 2006 01:17, Con Kolivas wrote:
> > I actually believe the same effect can be had by a tiny
> > modification to enable/disable the estimator anyway.
>
> Just for argument's sake it would look something like this.

That won't have the same effect. What you disabled isn't only about
interactivity. It's also about preemption, throughput and fairness.

-Mike

(we now interrupt this thread for an evening of real life;)

2006-03-21 18:40:26

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: interactive task starvation

On Tuesday 21 March 2006 15:39, Willy Tarreau wrote:
> On Wed, Mar 22, 2006 at 01:19:49AM +1100, Con Kolivas wrote:
> > On Wednesday 22 March 2006 01:17, Mike Galbraith wrote:
> > > On Wed, 2006-03-22 at 00:53 +1100, Con Kolivas wrote:
> > > > The yardstick for changes is now the speed of 'ls' scrolling in the
> > > > console. Where exactly are those extra cycles going I wonder? Do you
> > > > think the scheduler somehow makes the cpu idle doing nothing in that
> > > > timespace? Clearly that's not true, and userspace is making something
> > > > spin unnecessarily, but we're gonna fix that by modifying the
> > > > scheduler.... sigh
> > >
> > > *Blink*
> > >
> > > Are you having a bad hair day??
> >
> > My hair is approximately 3mm long so it's kinda hard for that to happen.
> >
> > What you're fixing with unfairness is worth pursuing. The 'ls' issue just
> > blows my mind though for reasons I've just said. Where are the magic cycles
> > going when nothing else is running that make it take ten times longer?
>
> Con, those cycles are not "magic", if you look at the numbers, the time is
> not spent in the process itself. From what has been observed since the
> beginning, it is spent :
> - in other processes which are starvating the CPU (eg: X11 when xterm
> scrolls)
> - in context switches when you have a pipe somewhere and the CPU is
> bouncing between tasks.
>
> Concerning your angriness about me being OK with (0,0) and still
> asking for tunables, it's precisely because I know that *my* workload
> is not everyone else's, and I don't want to conclude too quickly that
> there are only two types of workloads.

Well, perhaps we can assume there are only two types of workloads and
wait for a test case that will show the assumption is wrong?

> Maybe you're right, maybe you're wrong. At least you're right for as long
> as no other workload has been identified. But thinking like this is like
> some time ago when we thought that "if it runs XMMS without skipping,
> it'll be OK for everyone".

However, we should not try to anticipate every possible kind of workload
IMHO.

Greetings,
Rafael

2006-03-21 19:37:58

by Willy Tarreau

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, Mar 21, 2006 at 07:39:11PM +0100, Rafael J. Wysocki wrote:
> On Tuesday 21 March 2006 15:39, Willy Tarreau wrote:
> > On Wed, Mar 22, 2006 at 01:19:49AM +1100, Con Kolivas wrote:
> > > On Wednesday 22 March 2006 01:17, Mike Galbraith wrote:
> > > > On Wed, 2006-03-22 at 00:53 +1100, Con Kolivas wrote:
> > > > > The yardstick for changes is now the speed of 'ls' scrolling in the
> > > > > console. Where exactly are those extra cycles going I wonder? Do you
> > > > > think the scheduler somehow makes the cpu idle doing nothing in that
> > > > > timespace? Clearly that's not true, and userspace is making something
> > > > > spin unnecessarily, but we're gonna fix that by modifying the
> > > > > scheduler.... sigh
> > > >
> > > > *Blink*
> > > >
> > > > Are you having a bad hair day??
> > >
> > > My hair is approximately 3mm long so it's kinda hard for that to happen.
> > >
> > > What you're fixing with unfairness is worth pursuing. The 'ls' issue just
> > > blows my mind though for reasons I've just said. Where are the magic cycles
> > > going when nothing else is running that make it take ten times longer?
> >
> > Con, those cycles are not "magic", if you look at the numbers, the time is
> > not spent in the process itself. From what has been observed since the
> > beginning, it is spent :
> > - in other processes which are starvating the CPU (eg: X11 when xterm
> > scrolls)
> > - in context switches when you have a pipe somewhere and the CPU is
> > bouncing between tasks.
> >
> > Concerning your angriness about me being OK with (0,0) and still
> > asking for tunables, it's precisely because I know that *my* workload
> > is not everyone else's, and I don't want to conclude too quickly that
> > there are only two types of workloads.
>
> Well, perhaps we can assume there are only two types of workloads and
> wait for a test case that will show the assumption is wrong?

It would certainly fit most usages, but as soon as we find another group
of users complaining, we will add another sysctl just for them ? Perhaps
we could just resume the two current sysctls into one called
"interactivity_boost" with a value between 0 and 100, with the ability
for any user to increase or decrease it easily ? Mainline would be
pre-configured with something reasonable, like what Mike proposed as
default values for example, and server admins would only set it to
zero while desktop-intensive users could increase it a bit if they like
to.

> > Maybe you're right, maybe you're wrong. At least you're right for as long
> > as no other workload has been identified. But thinking like this is like
> > some time ago when we thought that "if it runs XMMS without skipping,
> > it'll be OK for everyone".
>
> However, we should not try to anticipate every possible kind of workload
> IMHO.

I generally agree on this, except that we got caught once in this area for
this exact reason.

> Greetings,
> Rafael

Regards,
Willy

2006-03-21 21:48:52

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: interactive task starvation

On Tuesday 21 March 2006 20:32, Willy Tarreau wrote:
> On Tue, Mar 21, 2006 at 07:39:11PM +0100, Rafael J. Wysocki wrote:
> > On Tuesday 21 March 2006 15:39, Willy Tarreau wrote:
> > > On Wed, Mar 22, 2006 at 01:19:49AM +1100, Con Kolivas wrote:
> > > > On Wednesday 22 March 2006 01:17, Mike Galbraith wrote:
> > > > > On Wed, 2006-03-22 at 00:53 +1100, Con Kolivas wrote:
> > > > > > The yardstick for changes is now the speed of 'ls' scrolling in the
> > > > > > console. Where exactly are those extra cycles going I wonder? Do you
> > > > > > think the scheduler somehow makes the cpu idle doing nothing in that
> > > > > > timespace? Clearly that's not true, and userspace is making something
> > > > > > spin unnecessarily, but we're gonna fix that by modifying the
> > > > > > scheduler.... sigh
> > > > >
> > > > > *Blink*
> > > > >
> > > > > Are you having a bad hair day??
> > > >
> > > > My hair is approximately 3mm long so it's kinda hard for that to happen.
> > > >
> > > > What you're fixing with unfairness is worth pursuing. The 'ls' issue just
> > > > blows my mind though for reasons I've just said. Where are the magic cycles
> > > > going when nothing else is running that make it take ten times longer?
> > >
> > > Con, those cycles are not "magic", if you look at the numbers, the time is
> > > not spent in the process itself. From what has been observed since the
> > > beginning, it is spent :
> > > - in other processes which are starvating the CPU (eg: X11 when xterm
> > > scrolls)
> > > - in context switches when you have a pipe somewhere and the CPU is
> > > bouncing between tasks.
> > >
> > > Concerning your angriness about me being OK with (0,0) and still
> > > asking for tunables, it's precisely because I know that *my* workload
> > > is not everyone else's, and I don't want to conclude too quickly that
> > > there are only two types of workloads.
> >
> > Well, perhaps we can assume there are only two types of workloads and
> > wait for a test case that will show the assumption is wrong?
>
> It would certainly fit most usages, but as soon as we find another group
> of users complaining, we will add another sysctl just for them ? Perhaps
> we could just resume the two current sysctls into one called
> "interactivity_boost" with a value between 0 and 100, with the ability
> for any user to increase or decrease it easily ? Mainline would be
> pre-configured with something reasonable, like what Mike proposed as
> default values for example, and server admins would only set it to
> zero while desktop-intensive users could increase it a bit if they like
> to.

Sounds reasonable to me.

Greetings,
Rafael

2006-03-21 22:51:06

by Peter Williams

[permalink] [raw]
Subject: Re: interactive task starvation

Mike Galbraith wrote:
> On Tue, 2006-03-21 at 13:59 +0100, Willy Tarreau wrote:
>
>>On Tue, Mar 21, 2006 at 01:07:58PM +0100, Mike Galbraith wrote:
>
>
>>>I can make the knobs compile time so we don't see random behavior
>>>reports, but I don't think they can be totally eliminated. Would that
>>>be sufficient?
>>>
>>>If so, the numbers as delivered should be fine for desktop boxen I
>>>think. People who are building custom kernels can bend to fit as
>>>always.
>>
>>That would suit me perfectly. I think I would set them both to zero.
>>It's not clear to me what workload they can help, it seems that they
>>try to allow a sometimes unfair scheduling.
>
>
> Correct. Massively unfair scheduling is what interactivity requires.
>

Selective unfairness not massive unfairness is what's required. The
hard part is automating the selectiveness especially when there are
three quite different types of task that need special treatment: 1) the
X server, 2) normal interactive tasks and 3) media streamers; each of
which has different behavioural characteristics. A single mechanism
that classifies all of these as "interactive" will unfortunately catch a
lot of tasks that don't belong to any one of these types.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-03-22 03:49:37

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Wed, 2006-03-22 at 09:51 +1100, Peter Williams wrote:
> Mike Galbraith wrote:
> > On Tue, 2006-03-21 at 13:59 +0100, Willy Tarreau wrote:
> >>That would suit me perfectly. I think I would set them both to zero.
> >>It's not clear to me what workload they can help, it seems that they
> >>try to allow a sometimes unfair scheduling.
> >
> >
> > Correct. Massively unfair scheduling is what interactivity requires.
> >
>
> Selective unfairness not massive unfairness is what's required. The
> hard part is automating the selectiveness especially when there are
> three quite different types of task that need special treatment: 1) the
> X server, 2) normal interactive tasks and 3) media streamers; each of
> which has different behavioural characteristics. A single mechanism
> that classifies all of these as "interactive" will unfortunately catch a
> lot of tasks that don't belong to any one of these types.

Yes, selective would be nice, but it's still massively unfair that is
required. There is no criteria available for discrimination, so my
patches don't even try to classify, they only enforce the rules. I
don't classify X as interactive, I merely provide a mechanism which
enables X to accumulate the cycles an interactive task needs to be able
to perform by actually _being_ interactive, by conforming to the
definition of sleep_avg. Fortunately, it uses that mechanism. I do
nothing more than trade stout rope for good behavior. I anchor one end
to a boulder, the other to a task's neck. The mechanism is agnostic.
The task determines whether it gets hung or not, and the user determines
how long the rope is.

-Mike

2006-03-22 03:59:46

by Peter Williams

[permalink] [raw]
Subject: Re: interactive task starvation

Mike Galbraith wrote:
> On Wed, 2006-03-22 at 09:51 +1100, Peter Williams wrote:
>
>>Mike Galbraith wrote:
>>
>>>On Tue, 2006-03-21 at 13:59 +0100, Willy Tarreau wrote:
>>>
>>>>That would suit me perfectly. I think I would set them both to zero.
>>>>It's not clear to me what workload they can help, it seems that they
>>>>try to allow a sometimes unfair scheduling.
>>>
>>>
>>>Correct. Massively unfair scheduling is what interactivity requires.
>>>
>>
>>Selective unfairness not massive unfairness is what's required. The
>>hard part is automating the selectiveness especially when there are
>>three quite different types of task that need special treatment: 1) the
>>X server, 2) normal interactive tasks and 3) media streamers; each of
>>which has different behavioural characteristics. A single mechanism
>>that classifies all of these as "interactive" will unfortunately catch a
>>lot of tasks that don't belong to any one of these types.
>
>
> Yes, selective would be nice, but it's still massively unfair that is
> required. There is no criteria available for discrimination, so my
> patches don't even try to classify, they only enforce the rules. I
> don't classify X as interactive, I merely provide a mechanism which
> enables X to accumulate the cycles an interactive task needs to be able
> to perform by actually _being_ interactive, by conforming to the
> definition of sleep_avg.

That's what I mean by classification :-)

> Fortunately, it uses that mechanism. I do
> nothing more than trade stout rope for good behavior. I anchor one end
> to a boulder, the other to a task's neck. The mechanism is agnostic.
> The task determines whether it gets hung or not, and the user determines
> how long the rope is.

I view that as a modification (hopefully an improvement) of the
classification rules :-). In particular, a variation in the persistence
of a classification and the criteria for losing/downgrading it.

Peter
--
Peter Williams [email protected]

"Learning, n. The kind of ignorance distinguishing the studious."
-- Ambrose Bierce

2006-03-22 04:17:54

by Mike Galbraith

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, 2006-03-21 at 18:50 +0100, Willy Tarreau wrote:
> On Wed, Mar 22, 2006 at 02:20:10AM +1100, Con Kolivas wrote:
> > On Wednesday 22 March 2006 01:17, Con Kolivas wrote:
> > > I actually believe the same effect can be had by a tiny
> > > modification to enable/disable the estimator anyway.
> >
> > Just for argument's sake it would look something like this.
> >
> > Cheers,
> > Con
> > ---
> > Add sysctl to enable/disable cpu scheduer interactivity estimator
>
> At least, in May 2005, the equivalent of this patch I tested on
> 2.6.11.7 considerably improved responsiveness, but there was still
> this very annoying slowdown when the load increased. vmstat delays
> increased by one second every 10 processes. I retried again around
> 2.6.14 a few months ago, and it was the same. Perhaps Mike's code
> and other changes in 2.6-mm really fix the initial problem (array
> switching ?) and then only the interactivity boost is causing the
> remaining trouble ?

The slowdown you see is because a timeslice is 100ms, and that patch
turned the scheduler into a non-preempting pure round-robin slug.

Array switching is only one aspect, and one I hadn't thought of as I was
tinkering with my patches, I discovered that aspect by accident.

My code does a few things, and all of them are part of the picture. One
of them is to deal with excessive interactive boost. Another is to
tighten timeslice enforcement, and another is to close the fundamental
hole in the concept sleep_avg. That hole is causing the majority of the
problems that crop up, the interactivity bits only make it worse. The
hole is this. If priority is based solely upon % sleep time, even if
there is no interactive boost, even if accumulation vs consumption is
1:1, if you sleep 51% of the time, you will inevitably rise to max
priority, and be able to use 49% of the CPU at max priority forever.
The current heuristics make that very close to but not quite 95%.

The fact that we don't have _horrendous_ problems shows that the basic
concept of sleep_avg is pretty darn good. Close the hole in any way you
can think of (mine is one), and it's excellent.

2006-03-22 12:13:56

by Mike Galbraith

[permalink] [raw]
Subject: [interbench numbers] Re: interactive task starvation

Greetings,

I was asked to do some interbench runs, with various throttle settings,
see below. I'll not attempt to interpret results, only present raw data
for others to examine.

Tested throttling patches version is V24, because while I was compiling
2.6.16-rc6-mm2 in preparation for comparison, I found I'd introduced an
SMP buglet in V23. Something good came from the added testing whether
the results are informative or not :)

-Mike

1. virgin 2.6.16-rc6-mm2.

Using 1975961 loops per ms, running every load for 30 seconds
Benchmarking kernel 2.6.16-rc6-mm2-smp at datestamp 200603221223

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.024 +/- 0.0486 1 100 100
Video 0.996 +/- 1.31 6.05 100 100
X 0.336 +/- 0.739 5.01 100 100
Burn 0.028 +/- 0.0905 2.05 100 100
Write 0.058 +/- 0.508 12.1 100 100
Read 0.043 +/- 0.115 1.66 100 100
Compile 0.047 +/- 0.126 2.55 100 100
Memload 0.258 +/- 4.57 112 99.8 99.8

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.031 +/- 0.396 16.7 100 99.9
X 0.722 +/- 3.35 30.7 100 97
Burn 0.531 +/- 7.42 246 99.1 98
Write 0.302 +/- 2.31 40.4 99.9 98.5
Read 0.092 +/- 1.11 32.9 99.9 99.7
Compile 0.428 +/- 2.77 36.3 99.9 97.9
Memload 0.235 +/- 3.3 104 99.5 99.1

--- Benchmarking simulated cpu of X in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 1.25 +/- 6.46 70 85.8 83.2
Video 17.8 +/- 32 92 31.7 22.3
Burn 45.5 +/- 97.5 503 8.35 4.22
Write 3.55 +/- 12.2 66 79.9 73.6
Read 0.739 +/- 3.04 20 87.4 83
Compile 51.9 +/- 122 857 10.7 5.34
Memload 1.81 +/- 6.67 54 85.1 78.3

--- Benchmarking simulated cpu of Gaming in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU
None 8.65 +/- 14.8 116 92
Video 77.9 +/- 78.5 107 56.2
X 64.2 +/- 72.9 124 60.9
Burn 301 +/- 317 524 24.9
Write 26.8 +/- 45.6 135 78.9
Read 13.1 +/- 16.8 67.9 88.4
Compile 478 +/- 519 765 17.3
Memload 21.1 +/- 28.8 148 82.6


2. 2.6.16-rc6-mm2x with no throttling.

Using 1975961 loops per ms, running every load for 30 seconds
Benchmarking kernel 2.6.16-rc6-mm2x-smp at datestamp 200603220914

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.062 +/- 0.11 1.09 100 100
Video 1.15 +/- 1.53 11.4 100 100
X 0.223 +/- 0.609 6.09 100 100
Burn 0.039 +/- 0.258 6.01 100 100
Write 0.194 +/- 0.837 14 100 100
Read 0.05 +/- 0.202 3.01 100 100
Compile 0.216 +/- 1.36 19 100 100
Memload 0.218 +/- 2.22 51.4 100 99.8

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.185 +/- 1.6 18.8 100 99.1
X 1.27 +/- 4.47 27 100 94.3
Burn 1.57 +/- 13.3 345 98.1 93
Write 0.819 +/- 3.76 34.7 99.9 96
Read 0.301 +/- 2.05 18.7 100 98.5
Compile 4.22 +/- 12.9 233 92.4 80.2
Memload 0.624 +/- 3.46 66.7 99.6 97

--- Benchmarking simulated cpu of X in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 2.57 +/- 7.94 43 74.6 67.7
Video 17.6 +/- 32.2 99 31.2 22.3
Burn 40.1 +/- 79.4 716 12.9 6.65
Write 6.03 +/- 16.6 80 75.1 64.6
Read 2.52 +/- 7.49 42 74.8 66.7
Compile 54.1 +/- 79.3 410 15.6 6.56
Memload 2.08 +/- 6.93 48 77.3 71.7

--- Benchmarking simulated cpu of Gaming in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU
None 12.3 +/- 16.6 65.3 89
Video 78.7 +/- 79.4 109 56
X 70.6 +/- 78.2 128 58.6
Burn 468 +/- 492 737 17.6
Write 36.6 +/- 52.7 300 73.2
Read 18.3 +/- 20.6 47.9 84.5
Compile 468 +/- 486 802 17.6
Memload 21.4 +/- 27 132 82.4


3. 2.6.16-rc6-mm2x with default settings.

Using 1975961 loops per ms, running every load for 30 seconds
Benchmarking kernel 2.6.16-rc6-mm2x-smp at datestamp 200603221006

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.033 +/- 0.0989 1.05 100 100
Video 0.859 +/- 1.17 7.45 100 100
X 0.239 +/- 0.662 7.1 100 100
Burn 0.06 +/- 0.382 7.86 100 100
Write 0.123 +/- 0.422 4.12 100 100
Read 0.045 +/- 0.103 1.18 100 100
Compile 0.292 +/- 2.9 65.8 100 99.8
Memload 0.256 +/- 3.78 91.8 100 99.8

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.101 +/- 1.06 16.7 100 99.6
X 1.13 +/- 4.38 33.7 99.9 95.2
Burn 10.7 +/- 47.1 410 67.2 64.7
Write 1.17 +/- 10.9 417 98.2 94.8
Read 0.127 +/- 1.13 16.8 100 99.6
Compile 8.6 +/- 32.6 200 70.7 63.6
Memload 0.512 +/- 3.32 83.5 99.7 97.6

--- Benchmarking simulated cpu of X in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 2.2 +/- 7.75 51 81.9 74.9
Video 15.8 +/- 29.4 81 33 23.9
Burn 74.1 +/- 124 406 18.5 9.57
Write 4.6 +/- 14 86 55 48.5
Read 1.75 +/- 5.16 26 80.7 73.1
Compile 71.2 +/- 124 468 21.8 12.2
Memload 2.95 +/- 9.31 70 75.6 69.1

--- Benchmarking simulated cpu of Gaming in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU
None 13.7 +/- 17.9 56.4 87.9
Video 74.6 +/- 75.4 98.5 57.3
X 68.2 +/- 76.1 128 59.4
Burn 515 +/- 526 735 16.3
Write 35.5 +/- 58.3 505 73.8
Read 15.7 +/- 17.8 45.8 86.4
Compile 436 +/- 453 863 18.7
Memload 22.3 +/- 30.1 227 81.8


4. 2.6.16-rc6-mm2x with max throttling.

Using 1975961 loops per ms, running every load for 30 seconds
Benchmarking kernel 2.6.16-rc6-mm2x-smp at datestamp 200603220938

--- Benchmarking simulated cpu of Audio in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.035 +/- 0.118 2.01 100 100
Video 0.043 +/- 0.231 5.02 100 100
X 0.109 +/- 0.737 12.3 100 100
Burn 0.072 +/- 0.574 9.78 100 100
Write 0.11 +/- 0.367 4.14 100 100
Read 0.052 +/- 0.141 2.02 100 100
Compile 0.5 +/- 4.84 112 99.8 99.8
Memload 0.093 +/- 0.461 9.13 100 100

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.187 +/- 1.59 16.7 100 99.1
X 2.4 +/- 6.26 32.8 99.9 90
Burn 59.7 +/- 130 478 27.1 23.8
Write 2.08 +/- 9.24 208 98.3 90.5
Read 0.154 +/- 1.3 18.8 100 99.4
Compile 57.9 +/- 130 714 28.3 22.4
Memload 0.743 +/- 3.7 66.7 99.8 96.3

--- Benchmarking simulated cpu of X in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 1.73 +/- 6.46 42 74.4 70
Video 13.3 +/- 24.5 74 39.8 29.2
Burn 142 +/- 206 579 9.11 4.69
Write 4.51 +/- 14.1 88.4 61.4 55.5
Read 1.38 +/- 4.38 24 85.3 78.3
Compile 126 +/- 190 619 12.4 6.51
Memload 3.61 +/- 11.7 70 61.7 55.8

--- Benchmarking simulated cpu of Gaming in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU
None 12.9 +/- 16.5 67.6 88.6
Video 67.7 +/- 69 97.3 59.6
X 70.7 +/- 77.7 130 58.6
Burn 355 +/- 367 625 22
Write 35.6 +/- 61.3 545 73.8
Read 23.1 +/- 28.4 115 81.3
Compile 467 +/- 485 793 17.6
Memload 25.6 +/- 32.9 138 79.6





2006-03-22 20:33:20

by Con Kolivas

[permalink] [raw]
Subject: Re: [interbench numbers] Re: interactive task starvation

On Wednesday 22 March 2006 23:14, Mike Galbraith wrote:
> Greetings,
>
> I was asked to do some interbench runs, with various throttle settings,
> see below. I'll not attempt to interpret results, only present raw data
> for others to examine.
>
> Tested throttling patches version is V24, because while I was compiling
> 2.6.16-rc6-mm2 in preparation for comparison, I found I'd introduced an
> SMP buglet in V23. Something good came from the added testing whether
> the results are informative or not :)

Thanks!

I wonder why the results are affected even without any throttling settings but
just patched in? Specifically I'm talking about deadlines met with video
being sensitive to this. Were there any other config differences between the
tests? Changing HZ would invalidate the results for example. Comments?

Cheers,
Con

2006-03-23 03:22:45

by Mike Galbraith

[permalink] [raw]
Subject: Re: [interbench numbers] Re: interactive task starvation

On Thu, 2006-03-23 at 07:27 +1100, Con Kolivas wrote:
> On Wednesday 22 March 2006 23:14, Mike Galbraith wrote:
> > Greetings,
> >
> > I was asked to do some interbench runs, with various throttle settings,
> > see below. I'll not attempt to interpret results, only present raw data
> > for others to examine.
> >
> > Tested throttling patches version is V24, because while I was compiling
> > 2.6.16-rc6-mm2 in preparation for comparison, I found I'd introduced an
> > SMP buglet in V23. Something good came from the added testing whether
> > the results are informative or not :)
>
> Thanks!
>
> I wonder why the results are affected even without any throttling settings but
> just patched in? Specifically I'm talking about deadlines met with video
> being sensitive to this. Were there any other config differences between the
> tests? Changing HZ would invalidate the results for example. Comments?

I wondered the same. The only difference then is the lower idle sleep
prio, tighter timeslice enforcement, and the SMP buglet fix for now <
p->timestamp due to SMP rounding. Configs are identical.

-Mike

2006-03-23 05:43:04

by Con Kolivas

[permalink] [raw]
Subject: Re: [interbench numbers] Re: interactive task starvation

On Thu, 23 Mar 2006 02:22 pm, Mike Galbraith wrote:
> On Thu, 2006-03-23 at 07:27 +1100, Con Kolivas wrote:
> > I wonder why the results are affected even without any throttling
> > settings but just patched in? Specifically I'm talking about deadlines
> > met with video being sensitive to this. Were there any other config
> > differences between the tests? Changing HZ would invalidate the results
> > for example. Comments?
>
> I wondered the same. The only difference then is the lower idle sleep
> prio, tighter timeslice enforcement, and the SMP buglet fix for now <
> p->timestamp due to SMP rounding. Configs are identical.

Ok well if we're going to run with this set of changes then we need to assess
the affect of each change and splitting them up into separate patches would
be appropriate normally anyway. That will allow us to track down which
particular patch causes it. That won't mean we will turn down the change
based on that one result, though, it will just help us understand it better.

Cheers,
Con

2006-03-23 05:53:29

by Mike Galbraith

[permalink] [raw]
Subject: Re: [interbench numbers] Re: interactive task starvation

On Thu, 2006-03-23 at 16:43 +1100, Con Kolivas wrote:
> On Thu, 23 Mar 2006 02:22 pm, Mike Galbraith wrote:
> > On Thu, 2006-03-23 at 07:27 +1100, Con Kolivas wrote:
> > > I wonder why the results are affected even without any throttling
> > > settings but just patched in? Specifically I'm talking about deadlines
> > > met with video being sensitive to this. Were there any other config
> > > differences between the tests? Changing HZ would invalidate the results
> > > for example. Comments?
> >
> > I wondered the same. The only difference then is the lower idle sleep
> > prio, tighter timeslice enforcement, and the SMP buglet fix for now <
> > p->timestamp due to SMP rounding. Configs are identical.
>
> Ok well if we're going to run with this set of changes then we need to assess
> the affect of each change and splitting them up into separate patches would
> be appropriate normally anyway. That will allow us to track down which
> particular patch causes it. That won't mean we will turn down the change
> based on that one result, though, it will just help us understand it better.

I'm investigating now.

-Mike

2006-03-23 11:06:56

by Mike Galbraith

[permalink] [raw]
Subject: Re: [interbench numbers] Re: interactive task starvation

On Thu, 2006-03-23 at 06:53 +0100, Mike Galbraith wrote:
> On Thu, 2006-03-23 at 16:43 +1100, Con Kolivas wrote:
> > On Thu, 23 Mar 2006 02:22 pm, Mike Galbraith wrote:
> > > On Thu, 2006-03-23 at 07:27 +1100, Con Kolivas wrote:
> > > > I wonder why the results are affected even without any throttling
> > > > settings but just patched in? Specifically I'm talking about deadlines
> > > > met with video being sensitive to this. Were there any other config
> > > > differences between the tests? Changing HZ would invalidate the results
> > > > for example. Comments?
> > >
> > > I wondered the same. The only difference then is the lower idle sleep
> > > prio, tighter timeslice enforcement, and the SMP buglet fix for now <
> > > p->timestamp due to SMP rounding. Configs are identical.
> >
> > Ok well if we're going to run with this set of changes then we need to assess
> > the affect of each change and splitting them up into separate patches would
> > be appropriate normally anyway. That will allow us to track down which
> > particular patch causes it. That won't mean we will turn down the change
> > based on that one result, though, it will just help us understand it better.
>
> I'm investigating now.

Nothing conclusive. Some of the difference may be because interbench
has a dependency on the idle sleep path popping tasks in a prio 16
instead of 18. Some of it may be because I'm not restricting IO, doing
that makes a bit of difference. Some of it is definitely plain old
jitter.

Six hours is long enough. I'm all done chasing interbench numbers.

-Mike

virgin

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.031 +/- 0.396 16.7 100 99.9
X 0.722 +/- 3.35 30.7 100 97
Burn 0.531 +/- 7.42 246 99.1 98
Write 0.302 +/- 2.31 40.4 99.9 98.5
Read 0.092 +/- 1.11 32.9 99.9 99.7
Compile 0.428 +/- 2.77 36.3 99.9 97.9
Memload 0.235 +/- 3.3 104 99.5 99.1

throttle patches with throttling disabled

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.185 +/- 1.6 18.8 100 99.1
X 1.27 +/- 4.47 27 100 94.3
Burn 1.57 +/- 13.3 345 98.1 93
Write 0.819 +/- 3.76 34.7 99.9 96
Read 0.301 +/- 2.05 18.7 100 98.5
Compile 4.22 +/- 12.9 233 92.4 80.2
Memload 0.624 +/- 3.46 66.7 99.6 97

minus idle sleep

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.222 +/- 1.82 16.8 100 98.8
X 1.02 +/- 3.9 30.7 100 95.7
Burn 0.208 +/- 3.67 141 99.8 99.3
Write 0.755 +/- 3.62 37.2 99.9 96.4
Read 0.265 +/- 1.94 16.9 100 98.6
Compile 2.16 +/- 15.2 333 96.7 90.7
Memload 0.723 +/- 3.5 37.4 99.8 96.3

minus don't restrict IO

--- Benchmarking simulated cpu of Video in the presence of simulated ---
Load Latency +/- SD (ms) Max Latency % Desired CPU % Deadlines Met
None 0.226 +/- 1.82 16.8 100 98.8
X 1.38 +/- 4.68 49.4 99.9 93.9
Burn 0.513 +/- 9.62 339 98.8 98.4
Write 0.418 +/- 2.7 30.8 99.9 97.9
Read 0.565 +/- 2.99 16.7 100 96.8
Compile 1.05 +/- 13.6 545 99.1 95.1
Memload 0.345 +/- 3.23 80.5 99.8 98.5



2006-03-24 00:21:27

by Con Kolivas

[permalink] [raw]
Subject: Re: [interbench numbers] Re: interactive task starvation

On Thursday 23 March 2006 22:07, Mike Galbraith wrote:
> Nothing conclusive. Some of the difference may be because interbench
> has a dependency on the idle sleep path popping tasks in a prio 16
> instead of 18. Some of it may be because I'm not restricting IO, doing
> that makes a bit of difference. Some of it is definitely plain old
> jitter.

Thanks for those! Just a clarification please

> virgin

I assume 2.6.16-rc6-mm2 ?

> throttle patches with throttling disabled

With your full patchset but no throttling enabled?

> minus idle sleep

Full patchset -throttling-idlesleep ?

> minus don't restrict IO

Full patchset -throttling-idlesleep-restrictio ?

Can you please email the latest separate patches so we can see them in
isolation? I promise I won't ask for any more interbench numbers any time
soon :)

Thanks!

Cheers,
Con

2006-03-24 05:02:11

by Mike Galbraith

[permalink] [raw]
Subject: Re: [interbench numbers] Re: interactive task starvation

On Fri, 2006-03-24 at 11:21 +1100, Con Kolivas wrote:
> On Thursday 23 March 2006 22:07, Mike Galbraith wrote:
> > Nothing conclusive. Some of the difference may be because interbench
> > has a dependency on the idle sleep path popping tasks in a prio 16
> > instead of 18. Some of it may be because I'm not restricting IO, doing
> > that makes a bit of difference. Some of it is definitely plain old
> > jitter.
>
> Thanks for those! Just a clarification please
>
> > virgin
>
> I assume 2.6.16-rc6-mm2 ?

Yes.

>
> > throttle patches with throttling disabled
>
> With your full patchset but no throttling enabled?

Yes.

>
> > minus idle sleep
>
> Full patchset -throttling-idlesleep ?

Yes, using stock idle sleep bits.

>
> > minus don't restrict IO
>
> Full patchset -throttling-idlesleep-restrictio ?
>

Yes.

> Can you please email the latest separate patches so we can see them in
> isolation? I promise I won't ask for any more interbench numbers any time
> soon :)

I've separated the buglet fix parts from the rest, so there are four
patches instead of two. I've also hidden the knobs, though for the
testing phase at least, I personally think it would be better to leave
the knobs there for people to twiddle. Something Willy said indicated
to me that 'credit' would be more palatable than 'grace', so I've
renamed and updated comments to match. I think it might look better,
but can't know since 'grace' was perfectly fine for my taste buds ;-)

I'll post as soon as I do some more cleanup pondering and verification.

-Mike

2006-03-24 05:04:32

by Con Kolivas

[permalink] [raw]
Subject: Re: [interbench numbers] Re: interactive task starvation

On Friday 24 March 2006 16:02, Mike Galbraith wrote:
> I've separated the buglet fix parts from the rest, so there are four
> patches instead of two. I've also hidden the knobs, though for the
> testing phase at least, I personally think it would be better to leave
> the knobs there for people to twiddle. Something Willy said indicated
> to me that 'credit' would be more palatable than 'grace', so I've
> renamed and updated comments to match. I think it might look better,
> but can't know since 'grace' was perfectly fine for my taste buds ;-)
>
> I'll post as soon as I do some more cleanup pondering and verification.

Great. I suggest making the base patch have the values hard coded as #defines
and then have a patch on top that turns those into userspace tunables we can
hand tune while in -mm which can then be dropped if/when merged upstream.

Cheers,
Con

2006-03-29 03:01:21

by Lee Revell

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, 2006-03-21 at 15:52 +0100, Ingo Molnar wrote:
> * Willy Tarreau <[email protected]> wrote:
>
> > Ah no, I never use those montruous environments ! xterm is already
> > heavy. [...]
>
> [ offtopic note: gnome-terminal developers claim some massive speedups
> in Gnome 2.14, and my experiments on Fedora rawhide seem to
> corraborate that - gnome-term is now faster (for me) than xterm. ]
>
> > [...] don't you remember, we found that doing "ls" in an xterm was
> > waking the xterm process for every single line, which in turn woke the
> > X server for a one-line scroll, while adding the "|cat" acted like a
> > buffer with batched scrolls. Newer xterms have been improved to
> > trigger jump scroll earlier and don't exhibit this behaviour even on
> > non-patched kernels. However, sshd still shows the same problem IMHO.
>
> yeah. The "|cat" changes the workload, which gets rated by the scheduler
> differently. Such artifacts are inevitable once interactivity heuristics
> are strong enough to significantly distort the equal sharing of CPU
> time.

Can you explain why terminal output ping-pongs back and forth between
taking a certain amount of time, and approximately 10x longer? For
example here's the result of "time dmesg" 6 times in an xterm with a
constant background workload:

real 0m0.086s
user 0m0.005s
sys 0m0.012s

real 0m0.078s
user 0m0.008s
sys 0m0.009s

real 0m0.082s
user 0m0.004s
sys 0m0.013s

real 0m0.084s
user 0m0.005s
sys 0m0.011s

real 0m0.751s
user 0m0.006s
sys 0m0.017s

real 0m0.749s
user 0m0.005s
sys 0m0.017s

Why does it ping-pong between taking ~0.08s and ~0.75s like that? The
behavior is completely reproducible.

Lee

2006-03-29 05:56:22

by Ray Lee

[permalink] [raw]
Subject: Re: interactive task starvation

On 3/28/06, Lee Revell <[email protected]> wrote:
> Can you explain why terminal output ping-pongs back and forth between
> taking a certain amount of time, and approximately 10x longer?
[...]
> Why does it ping-pong between taking ~0.08s and ~0.75s like that? The
> behavior is completely reproducible.

Does the scheduler have any concept of dependent tasks? (If so, hit
<delete> and move on.) If not, then the producer and consumer will be
scheduled randomly w/r/t each other, right? Sometimes producer then
consumer, sometimes vice versa. If so, the ping pong should be half of
the time slow, half of the time fast (+/- sqrt(N)), and the slow time
should scale directly with the number of tasks running on the system.

Do any of the above WAGs match what you see? If so, then perhaps it's
random just due to the order in which the tasks get initially
scheduled (dmesg vs ssh, or dmesg vs xterm vs X -- er, though I guess
in that latter case there's really <thinks> three separate timings
that you'd get back, as the triple set of tasks could be in one of six
orderings, one fast, one slow, and four equally mixed between the
two).

I wonder if on a pipe write, moving the reader to be right after the
writer in the list would even that out. (But only on cases where the
reader didn't just run -- wouldn't want a back and forth conversation
to starve everyone else...)

But like I said, just a WAG.

Ray

2006-03-29 06:16:13

by Lee Revell

[permalink] [raw]
Subject: Re: interactive task starvation

On Tue, 2006-03-28 at 21:56 -0800, Ray Lee wrote:
> Do any of the above WAGs match what you see? If so, then perhaps it's
> random just due to the order in which the tasks get initially
> scheduled (dmesg vs ssh, or dmesg vs xterm vs X -- er, though I guess
> in that latter case there's really <thinks> three separate timings
> that you'd get back, as the triple set of tasks could be in one of six
> orderings, one fast, one slow, and four equally mixed between the
> two).
>

Possibly - *very* rarely, like 1 out of 50 or 100 times, it falls
somewhere in the middle.

Lee