2007-11-27 09:34:30

by Yanmin Zhang

[permalink] [raw]
Subject: sched_yield: delete sysctl_sched_compat_yield

If echo "1">/proc/sys/kernel/sched_compat_yield before starting volanoMark
testing, the result is very good with kernel 2.6.24-rc3 on my 16-core tigerton.

1) If /proc/sys/kernel/sched_compat_yield=1, comparing with 2.6.22,
2.6.24-rc3 has more than 70% improvement;
2) If /proc/sys/kernel/sched_compat_yield=0, comparing with 2.6.22,
2.6.24-rc3 has more than 80% regression;

On other machines, the volanoMark result also has much improvement if
/proc/sys/kernel/sched_compat_yield=1.

Would you like to change function yield_task_fair to delete codes around
sysctl_sched_compat_yield, or just initiate it to 1?

Thanks,
Yanmin


2007-11-27 11:17:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield


* Zhang, Yanmin <[email protected]> wrote:

> If echo "1">/proc/sys/kernel/sched_compat_yield before starting
> volanoMark testing, the result is very good with kernel 2.6.24-rc3 on
> my 16-core tigerton.

yep, that's known and has been discussed in detail on lkml. Java should
use something more suitable than sched_yield for its locking. Yield will
always be dependent on scheduling details and _some_ category of apps
will always hurt. That's why we offer the sched_compat_yield flag.

Ingo

2007-11-29 15:06:59

by Arjan van de Ven

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Tue, 27 Nov 2007 17:33:05 +0800
"Zhang, Yanmin" <[email protected]> wrote:

> If echo "1">/proc/sys/kernel/sched_compat_yield before starting
> volanoMark testing, the result is very good with kernel 2.6.24-rc3 on
> my 16-core tigerton.
>
> 1) If /proc/sys/kernel/sched_compat_yield=1, comparing with 2.6.22,
> 2.6.24-rc3 has more than 70% improvement;
> 2) If /proc/sys/kernel/sched_compat_yield=0, comparing with 2.6.22,
> 2.6.24-rc3 has more than 80% regression;
>
> On other machines, the volanoMark result also has much improvement if
> /proc/sys/kernel/sched_compat_yield=1.
>
> Would you like to change function yield_task_fair to delete codes
> around sysctl_sched_compat_yield, or just initiate it to 1?
>

sounds like a bad idea; volanomark (well, technically the jvm behind
it) is abusing sched_yield() by assuming it does something it really
doesn't do, and as it happens some of the earlier 2.6 schedulers
accidentally happened to behave in a way that was nice for this
benchmark.

Todays kernel has a different behavior somewhat (and before people
scream "regression"; sched_yield() behavior isn't really specified and
doesn't make any sense at all, whatever you get is what you get....
it's pretty much an insane defacto behavior that is incredibly tied to
which decisions the scheduler makes how, and no app can depend on that
in any way. In fact, I've proposed to make sched_yield() just do an
msleep(1)... that'd be closer to what sched_yield is supposed to do
standard wise than any of the current behaviors .... ;_


--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2007-11-30 02:46:57

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:
> On Tue, 27 Nov 2007 17:33:05 +0800
>
> "Zhang, Yanmin" <[email protected]> wrote:
> > If echo "1">/proc/sys/kernel/sched_compat_yield before starting
> > volanoMark testing, the result is very good with kernel 2.6.24-rc3 on
> > my 16-core tigerton.
> >
> > 1) If /proc/sys/kernel/sched_compat_yield=1, comparing with 2.6.22,
> > 2.6.24-rc3 has more than 70% improvement;
> > 2) If /proc/sys/kernel/sched_compat_yield=0, comparing with 2.6.22,
> > 2.6.24-rc3 has more than 80% regression;
> >
> > On other machines, the volanoMark result also has much improvement if
> > /proc/sys/kernel/sched_compat_yield=1.
> >
> > Would you like to change function yield_task_fair to delete codes
> > around sysctl_sched_compat_yield, or just initiate it to 1?
>
> sounds like a bad idea; volanomark (well, technically the jvm behind
> it) is abusing sched_yield() by assuming it does something it really
> doesn't do, and as it happens some of the earlier 2.6 schedulers
> accidentally happened to behave in a way that was nice for this
> benchmark.

OK, why is this still happening? Haven't we been asking JVMs to use
futexes or posix locking for years and years now? Are there any sane
jvms that _don't_ use yield?


> Todays kernel has a different behavior somewhat (and before people
> scream "regression"; sched_yield() behavior isn't really specified and
> doesn't make any sense at all, whatever you get is what you get....
> it's pretty much an insane defacto behavior that is incredibly tied to
> which decisions the scheduler makes how, and no app can depend on that

It is a performance regression. Is there any reason *not* to use the
"compat" yield by default? As you say, for SCHED_OTHER tasks, yield
can do almost anything. We may as well do something that isn't a
regression...


> in any way. In fact, I've proposed to make sched_yield() just do an
> msleep(1)... that'd be closer to what sched_yield is supposed to do
> standard wise than any of the current behaviors .... ;_

What makes you say that? IIRC of all the things that sched_yeild can
do, it is not allowed to block. So this is about the only thing that
will break the standard...

2007-11-30 02:52:52

by Arjan van de Ven

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Fri, 30 Nov 2007 13:46:22 +1100
Nick Piggin <[email protected]> wrote:

> > Todays kernel has a different behavior somewhat (and before people
> > scream "regression"; sched_yield() behavior isn't really specified
> > and doesn't make any sense at all, whatever you get is what you
> > get.... it's pretty much an insane defacto behavior that is
> > incredibly tied to which decisions the scheduler makes how, and no
> > app can depend on that
>
> It is a performance regression. Is there any reason *not* to use the
> "compat" yield by default? As you say, for SCHED_OTHER tasks, yield
> can do almost anything. We may as well do something that isn't a
> regression..

it just makes OTHER tests/benchmarks regress.... this is one of those
things where you just can't win.

>
>
> > in any way. In fact, I've proposed to make sched_yield() just do an
> > msleep(1)... that'd be closer to what sched_yield is supposed to do
> > standard wise than any of the current behaviors .... ;_
>
> What makes you say that? IIRC of all the things that sched_yeild can
> do, it is not allowed to block. So this is about the only thing that
> will break the standard...

sched_yield OF COURSE can block.. it's a schedule call after all!



--
If you want to reach me at my work email, use [email protected]
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2007-11-30 03:03:07

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Friday 30 November 2007 13:51, Arjan van de Ven wrote:
> On Fri, 30 Nov 2007 13:46:22 +1100
>
> Nick Piggin <[email protected]> wrote:
> > > Todays kernel has a different behavior somewhat (and before people
> > > scream "regression"; sched_yield() behavior isn't really specified
> > > and doesn't make any sense at all, whatever you get is what you
> > > get.... it's pretty much an insane defacto behavior that is
> > > incredibly tied to which decisions the scheduler makes how, and no
> > > app can depend on that
> >
> > It is a performance regression. Is there any reason *not* to use the
> > "compat" yield by default? As you say, for SCHED_OTHER tasks, yield
> > can do almost anything. We may as well do something that isn't a
> > regression..
>
> it just makes OTHER tests/benchmarks regress.... this is one of those
> things where you just can't win.

OK, which ones? Because java is slightly important...


> > > in any way. In fact, I've proposed to make sched_yield() just do an
> > > msleep(1)... that'd be closer to what sched_yield is supposed to do
> > > standard wise than any of the current behaviors .... ;_
> >
> > What makes you say that? IIRC of all the things that sched_yeild can
> > do, it is not allowed to block. So this is about the only thing that
> > will break the standard...
>
> sched_yield OF COURSE can block.. it's a schedule call after all!

In unix, blocking ~= removed from runqueue, no?

OF COURSE it is allowed to cooperatively schedule another task, but
I don't see why you think it should so obviously be allowed to block
/ sleep.

It breaks the basically only invariant of sched_yeild in that the
task will no longer run when there is nothing else running.

2007-11-30 03:16:56

by Yanmin Zhang

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Fri, 2007-11-30 at 13:46 +1100, Nick Piggin wrote:
> On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:
> > On Tue, 27 Nov 2007 17:33:05 +0800
> >
> > "Zhang, Yanmin" <[email protected]> wrote:
> > > If echo "1">/proc/sys/kernel/sched_compat_yield before starting
> > > volanoMark testing, the result is very good with kernel 2.6.24-rc3 on
> > > my 16-core tigerton.
> > >
> > > 1) If /proc/sys/kernel/sched_compat_yield=1, comparing with 2.6.22,
> > > 2.6.24-rc3 has more than 70% improvement;
> > > 2) If /proc/sys/kernel/sched_compat_yield=0, comparing with 2.6.22,
> > > 2.6.24-rc3 has more than 80% regression;
> > >
> > > On other machines, the volanoMark result also has much improvement if
> > > /proc/sys/kernel/sched_compat_yield=1.
> > >
> > > Would you like to change function yield_task_fair to delete codes
> > > around sysctl_sched_compat_yield, or just initiate it to 1?
> >
> > sounds like a bad idea; volanomark (well, technically the jvm behind
> > it) is abusing sched_yield() by assuming it does something it really
> > doesn't do, and as it happens some of the earlier 2.6 schedulers
> > accidentally happened to behave in a way that was nice for this
> > benchmark.
>
> OK, why is this still happening? Haven't we been asking JVMs to use
> futexes or posix locking for years and years now? Are there any sane
> jvms that _don't_ use yield?
I think it's an issue of volanomark (a kind of java application) instead of JVM.

>
>
> > Todays kernel has a different behavior somewhat (and before people
> > scream "regression"; sched_yield() behavior isn't really specified and
> > doesn't make any sense at all, whatever you get is what you get....
> > it's pretty much an insane defacto behavior that is incredibly tied to
> > which decisions the scheduler makes how, and no app can depend on that
>
> It is a performance regression. Is there any reason *not* to use the
> "compat" yield by default?
There is no, so I suggest to set sched_compat_yield=1 by default.
If sched_compat_yield=0, kernel almost does nothing but returns. When
sched_compat_yield=1, it is closer to the meaning of sched_yield man page.

> As you say, for SCHED_OTHER tasks, yield
> can do almost anything. We may as well do something that isn't a
> regression...
I just found SCHED_OTHER in man sched_setscheduler. Is it SCHED_NORMAL in
the latest kernel?

>
>
> > in any way. In fact, I've proposed to make sched_yield() just do an
> > msleep(1)... that'd be closer to what sched_yield is supposed to do
> > standard wise than any of the current behaviors .... ;_
>
> What makes you say that? IIRC of all the things that sched_yeild can
> do, it is not allowed to block. So this is about the only thing that
> will break the standard...

2007-11-30 03:29:43

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Friday 30 November 2007 14:15, Zhang, Yanmin wrote:
> On Fri, 2007-11-30 at 13:46 +1100, Nick Piggin wrote:
> > On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:

> > > sounds like a bad idea; volanomark (well, technically the jvm behind
> > > it) is abusing sched_yield() by assuming it does something it really
> > > doesn't do, and as it happens some of the earlier 2.6 schedulers
> > > accidentally happened to behave in a way that was nice for this
> > > benchmark.
> >
> > OK, why is this still happening? Haven't we been asking JVMs to use
> > futexes or posix locking for years and years now? Are there any sane
> > jvms that _don't_ use yield?
>
> I think it's an issue of volanomark (a kind of java application) instead of
> JVM.

volanomark itself and not the jvm is calling sched_yield()? Do we have
any non-toy threaded java apps? (what's JAVA in the kernel-perf tests?)


> > > Todays kernel has a different behavior somewhat (and before people
> > > scream "regression"; sched_yield() behavior isn't really specified and
> > > doesn't make any sense at all, whatever you get is what you get....
> > > it's pretty much an insane defacto behavior that is incredibly tied to
> > > which decisions the scheduler makes how, and no app can depend on that
> >
> > It is a performance regression. Is there any reason *not* to use the
> > "compat" yield by default?
>
> There is no, so I suggest to set sched_compat_yield=1 by default.
> If sched_compat_yield=0, kernel almost does nothing but returns. When
> sched_compat_yield=1, it is closer to the meaning of sched_yield man page.

sched_yield() is really only defined for posix realtime scheduling
AFAIK, which talks about priority lists.

SCHED_OTHER is defined to be a single priority, below the rest of the
realtime priorities. So at first you *might* say that the process
should then be made to run only after all other SCHED_OTHER processes,
however there is no such ordering requirement for SCHED_OTHER
scheduling. The SCHED_OTHER scheduler can run any task at any time.

That said, I think people would *expect* that call be much closer to
the compat behaviour than the current default. And that's definitely
what Linux has done in the past. So there really does need to be a
good reason to change it like this IMO.


> > As you say, for SCHED_OTHER tasks, yield
> > can do almost anything. We may as well do something that isn't a
> > regression...
>
> I just found SCHED_OTHER in man sched_setscheduler. Is it SCHED_NORMAL in
> the latest kernel?

Yes, SCHED_NORMAL is SCHED_OTHER. Don't know why it got renamed...

2007-11-30 04:34:08

by Yanmin Zhang

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Fri, 2007-11-30 at 14:29 +1100, Nick Piggin wrote:
> On Friday 30 November 2007 14:15, Zhang, Yanmin wrote:
> > On Fri, 2007-11-30 at 13:46 +1100, Nick Piggin wrote:
> > > On Wednesday 28 November 2007 09:57, Arjan van de Ven wrote:
>
> > > > sounds like a bad idea; volanomark (well, technically the jvm behind
> > > > it) is abusing sched_yield() by assuming it does something it really
> > > > doesn't do, and as it happens some of the earlier 2.6 schedulers
> > > > accidentally happened to behave in a way that was nice for this
> > > > benchmark.
> > >
> > > OK, why is this still happening? Haven't we been asking JVMs to use
> > > futexes or posix locking for years and years now? Are there any sane
> > > jvms that _don't_ use yield?
> >
> > I think it's an issue of volanomark (a kind of java application) instead of
> > JVM.
>
> volanomark itself and not the jvm is calling sched_yield()? Do we have
> any non-toy threaded java apps? (what's JAVA in the kernel-perf tests?)
I run lots of well-known benchmarks and volanoMark is the one who gets the largest
impact from sched_yield.

As for real-applications which use sched_yield, mostly, they are not open sources.
Yesterday, I got to know someone was using sched_yield in his network C programs,
but he didn't want to share the sources with me.

>
>
> > > > Todays kernel has a different behavior somewhat (and before people
> > > > scream "regression"; sched_yield() behavior isn't really specified and
> > > > doesn't make any sense at all, whatever you get is what you get....
> > > > it's pretty much an insane defacto behavior that is incredibly tied to
> > > > which decisions the scheduler makes how, and no app can depend on that
> > >
> > > It is a performance regression. Is there any reason *not* to use the
> > > "compat" yield by default?
> >
> > There is no, so I suggest to set sched_compat_yield=1 by default.
> > If sched_compat_yield=0, kernel almost does nothing but returns. When
> > sched_compat_yield=1, it is closer to the meaning of sched_yield man page.
>
> sched_yield() is really only defined for posix realtime scheduling
> AFAIK, which talks about priority lists.
>
> SCHED_OTHER is defined to be a single priority, below the rest of the
> realtime priorities. So at first you *might* say that the process
> should then be made to run only after all other SCHED_OTHER processes,
> however there is no such ordering requirement for SCHED_OTHER
> scheduling. The SCHED_OTHER scheduler can run any task at any time.
>
> That said, I think people would *expect* that call be much closer to
> the compat behaviour than the current default. And that's definitely
> what Linux has done in the past. So there really does need to be a
> good reason to change it like this IMO.
That's indeed what I am thinking.

I am running many testing(SPECjbb/SPECjbb2005/cpu2000/iozone/dbench/tbench...) to
see if there is any regression if sched_compat_yield=1. I think there is no
regression and the testing is just to double-check.

>
>
> > > As you say, for SCHED_OTHER tasks, yield
> > > can do almost anything. We may as well do something that isn't a
> > > regression...
> >
> > I just found SCHED_OTHER in man sched_setscheduler. Is it SCHED_NORMAL in
> > the latest kernel?
>
> Yes, SCHED_NORMAL is SCHED_OTHER. Don't know why it got renamed...
Thanks.

2007-11-30 10:09:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield


* Nick Piggin <[email protected]> wrote:

> Haven't we been asking JVMs to use futexes or posix locking for years
> and years now? [...]

i'm curious, with what JVM was it tested and where's the source so i can
fix their locking for them? Can the problem be reproduced with:

http://download.fedora.redhat.com/pub/fedora/linux/development/source/SRPMS/java-1.7.0-icedtea-1.7.0.0-0.20.b23.snapshot.fc9.src.rpm

?

Ingo

2007-12-03 04:28:18

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Friday 30 November 2007 21:08, Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
> > Haven't we been asking JVMs to use futexes or posix locking for years
> > and years now? [...]
>
> i'm curious, with what JVM was it tested and where's the source so i can
> fix their locking for them? Can the problem be reproduced with:

Sure, but why shouldn't the compat behaviour be the default, and the
sysctl go away?

It makes older JVMs work better, it is slightly closer to the old
behaviour, and it is arguably a less surprising result.

2007-12-03 08:46:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield


* Nick Piggin <[email protected]> wrote:

> On Friday 30 November 2007 21:08, Ingo Molnar wrote:
> > * Nick Piggin <[email protected]> wrote:
> > > Haven't we been asking JVMs to use futexes or posix locking for years
> > > and years now? [...]
> >
> > i'm curious, with what JVM was it tested and where's the source so i
> > can fix their locking for them? Can the problem be reproduced with:
>
> Sure, but why shouldn't the compat behaviour be the default, and the
> sysctl go away?
>
> It makes older JVMs work better, it is slightly closer to the old
> behaviour, and it is arguably a less surprising result.

as far as desktop apps such as firefox goes, the exact opposite is true.
We had two choices basically: either a "more agressive" yield than
before or a "less agressive" yield. Desktop apps were reported to hurt
from a "more agressive" yield (firefox for example gets some pretty bad
delays), so we defaulted to the less agressive method. (and we defaulted
to that in v2.6.23 already) Really, in this sense volanomark is another
test like dbench - we care about it but not unconditionally and in this
case it's a really silly API use that is at the center of the problem.
Talking about the default alone will not bring us forward, but we can
certainly add helpers to identify SCHED_OTHER::yield tasks - a once per
bootup warning perhaps?

Ingo

2007-12-03 09:17:39

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Monday 03 December 2007 19:45, Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
> > On Friday 30 November 2007 21:08, Ingo Molnar wrote:
> > > * Nick Piggin <[email protected]> wrote:
> > > > Haven't we been asking JVMs to use futexes or posix locking for years
> > > > and years now? [...]
> > >
> > > i'm curious, with what JVM was it tested and where's the source so i
> > > can fix their locking for them? Can the problem be reproduced with:
> >
> > Sure, but why shouldn't the compat behaviour be the default, and the
> > sysctl go away?
> >
> > It makes older JVMs work better, it is slightly closer to the old
> > behaviour, and it is arguably a less surprising result.
>
> as far as desktop apps such as firefox goes, the exact opposite is true.
> We had two choices basically: either a "more agressive" yield than
> before or a "less agressive" yield. Desktop apps were reported to hurt
> from a "more agressive" yield (firefox for example gets some pretty bad
> delays), so we defaulted to the less agressive method. (and we defaulted
> to that in v2.6.23 already)

Yeah, I doubt the 2.6.23 scheduler will be usable for distros though...


> Really, in this sense volanomark is another
> test like dbench - we care about it but not unconditionally and in this
> case it's a really silly API use that is at the center of the problem.

Sure, but do you whether _real_ java server applications are OK? Is it
possible to reduce the aggressiveness of yield to a mid-way? Are the
firefox tests also like dbench (ie. were they done with make -j huge or
some other insane scheduler loads)


> Talking about the default alone will not bring us forward, but we can
> certainly add helpers to identify SCHED_OTHER::yield tasks - a once per
> bootup warning perhaps?

I don't care about keeping the behaviour for future apps. But for older
code out there, it is very important to still work well.

I was just talking about the default because I didn't know the reason
for the way it was set -- now that I do, we should talk about trying to
improve the actual code so we don't need 2 defaults.

2007-12-03 09:30:58

by Yanmin Zhang

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Fri, 2007-11-30 at 11:08 +0100, Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
> > Haven't we been asking JVMs to use futexes or posix locking for years
> > and years now? [...]
>
> i'm curious, with what JVM was it tested and where's the source so i can
> fix their locking for them? Can the problem be reproduced with:
>
> http://download.fedora.redhat.com/pub/fedora/linux/development/source/SRPMS/java-1.7.0-icedtea-1.7.0.0-0.20.b23.snapshot.fc9.src.rpm
I used BEA Jrockit to run volanoMark. Because of no Jrockit source codes, so
I retested volanoMark by jre-1.7.0-icedtea.x86_64 java of Fedora Core 8 on my stoakley (8-core)
machine with kernel 2.6.24-rc3.

1) Jrockit: sched_compat_yield=0's result is less than 15% of sched_compat_yield=1's.
2) jre-1.7.0-icedtea: sched_compat_yield=0's result is less than 89% of sched_compat_yield=1's.

So JVM really has much impact on the regression.

I checked the source codes of openjdk and found Thread.yield is implemented as native sched_yield.
If java applications call Thread.yield, it just calls sched_yield. garbage collection and other JVM
threads also calls Thread.yield. That's why 2 different JVM have different regression percentage.

Although no source codes of volanoMark, I suspect it calls Thread.sched. volanoMark is a kind
of chatroom benchmark. When a client sends out a message, server will send the message to all clients.
I suspect the client calls Thread.yield after sending out a couple of messages.

2 JVM all have regression if sched_compat_yield=0.

I ran some testing, such like iozone/specjbb/tbench/dbench/sysbench, and didn't see regression.

-yanmin

2007-12-03 09:37:50

by Yanmin Zhang

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Mon, 2007-12-03 at 20:17 +1100, Nick Piggin wrote:
> On Monday 03 December 2007 19:45, Ingo Molnar wrote:
> > * Nick Piggin <[email protected]> wrote:
> > > On Friday 30 November 2007 21:08, Ingo Molnar wrote:
> > > > * Nick Piggin <[email protected]> wrote:
> > > > > Haven't we been asking JVMs to use futexes or posix locking for years
> > > > > and years now? [...]
> > > >
> > > > i'm curious, with what JVM was it tested and where's the source so i
> > > > can fix their locking for them? Can the problem be reproduced with:
> > >
> > > Sure, but why shouldn't the compat behaviour be the default, and the
> > > sysctl go away?
> > >
> > > It makes older JVMs work better, it is slightly closer to the old
> > > behaviour, and it is arguably a less surprising result.
> >
> > as far as desktop apps such as firefox goes, the exact opposite is true.
> > We had two choices basically: either a "more agressive" yield than
> > before or a "less agressive" yield. Desktop apps were reported to hurt
> > from a "more agressive" yield (firefox for example gets some pretty bad
> > delays), so we defaulted to the less agressive method. (and we defaulted
> > to that in v2.6.23 already)
>
> Yeah, I doubt the 2.6.23 scheduler will be usable for distros though...
>
>
> > Really, in this sense volanomark is another
> > test like dbench - we care about it but not unconditionally and in this
> > case it's a really silly API use that is at the center of the problem.
>
> Sure, but do you whether _real_ java server applications are OK?
I did a simple check of openjvm source codes and garbage collecter calls
Thread.yield. It really has much impact on both Jrockit and openJVM although
the regression percentage is different.

2007-12-03 09:42:44

by Yanmin Zhang

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Mon, 2007-12-03 at 09:45 +0100, Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
>
> > On Friday 30 November 2007 21:08, Ingo Molnar wrote:
> > > * Nick Piggin <[email protected]> wrote:
> > > > Haven't we been asking JVMs to use futexes or posix locking for years
> > > > and years now? [...]
> > >
> > > i'm curious, with what JVM was it tested and where's the source so i
> > > can fix their locking for them? Can the problem be reproduced with:
> >
> > Sure, but why shouldn't the compat behaviour be the default, and the
> > sysctl go away?
> >
> > It makes older JVMs work better, it is slightly closer to the old
> > behaviour, and it is arguably a less surprising result.
>
> as far as desktop apps such as firefox goes, the exact opposite is true.
> We had two choices basically: either a "more agressive" yield than
> before or a "less agressive" yield. Desktop apps were reported to hurt
> from a "more agressive" yield (firefox for example gets some pretty bad
> delays),
Why not to change source codes of firefox? If the sched_compat_yield=0,
the sys_sched_yield almost does nothing but returns, so firefox could just
do not call sched_yield. I assume 'sched_compat_yield=0' ~ no_call_to_sched_yield.

It's easier to delete calls to sched_yield in applications than to tune
calls to sched_yield.

> so we defaulted to the less agressive method. (and we defaulted
> to that in v2.6.23 already) Really, in this sense volanomark is another
> test like dbench - we care about it but not unconditionally and in this
> case it's a really silly API use that is at the center of the problem.
> Talking about the default alone will not bring us forward, but we can
> certainly add helpers to identify SCHED_OTHER::yield tasks - a once per
> bootup warning perhaps?
>
> Ingo

2007-12-03 09:57:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield


* Nick Piggin <[email protected]> wrote:

> > as far as desktop apps such as firefox goes, the exact opposite is
> > true. We had two choices basically: either a "more agressive" yield
> > than before or a "less agressive" yield. Desktop apps were reported
> > to hurt from a "more agressive" yield (firefox for example gets some
> > pretty bad delays), so we defaulted to the less agressive method.
> > (and we defaulted to that in v2.6.23 already)
>
> Yeah, I doubt the 2.6.23 scheduler will be usable for distros
> though...

... which is a pretty gross exaggeration belied by distros already
running v2.6.23. Sure, "enterprise" distros might not run .23 (or .22 or
.21 or .24) because those are slow to adopt and pick _one_ upstream
kernel every 10 releases without bothering much about anything
inbetween. So the enterprise distros might in fact want to see 1-2
iterations of the scheduler before they switch to it. (But by that
argument 80% of the other upstream kernels were not used by enterprise
distros either, so this is nothing new.)

> I was just talking about the default because I didn't know the reason
> for the way it was set -- now that I do, we should talk about trying
> to improve the actual code so we don't need 2 defaults.

I've got the patch below queued up: it uses the more agressive yield
implementation for SCHED_BATCH tasks. SCHED_BATCH is a natural
differentiator, it's a "I dont care about latency, it's all about
throughput for me" signal from the application.

But first and foremost, do you realize that there will be no easy
solutions to this topic, that it's not just about 'flipping a default'?

Ingo

-------------->
Subject: sched: default to more agressive yield for SCHED_BATCH tasks
From: Ingo Molnar <[email protected]>

do more agressive yield for SCHED_BATCH tasks.

Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched_fair.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

Index: linux/kernel/sched_fair.c
===================================================================
--- linux.orig/kernel/sched_fair.c
+++ linux/kernel/sched_fair.c
@@ -824,8 +824,9 @@ static void dequeue_task_fair(struct rq
*/
static void yield_task_fair(struct rq *rq)
{
- struct cfs_rq *cfs_rq = task_cfs_rq(rq->curr);
- struct sched_entity *rightmost, *se = &rq->curr->se;
+ struct task_struct *curr = rq->curr;
+ struct cfs_rq *cfs_rq = task_cfs_rq(curr);
+ struct sched_entity *rightmost, *se = &curr->se;

/*
* Are we the only task in the tree?
@@ -833,7 +834,7 @@ static void yield_task_fair(struct rq *r
if (unlikely(cfs_rq->nr_running == 1))
return;

- if (likely(!sysctl_sched_compat_yield)) {
+ if (likely(!sysctl_sched_compat_yield) && curr->policy != SCHED_BATCH) {
__update_rq_clock(rq);
/*
* Update run-time statistics of the 'current'.

2007-12-03 10:06:25

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield


* Zhang, Yanmin <[email protected]> wrote:

> Although no source codes of volanoMark, I suspect it calls
> Thread.sched. volanoMark is a kind of chatroom benchmark. When a
> client sends out a message, server will send the message to all
> clients. I suspect the client calls Thread.yield after sending out a
> couple of messages.

yeah, so far only volanomark seems to be affected by this, and if it
indeed calls Thread.yield artificially it's a pretty stupid benchmark
and it's not the fault of the JDK. If we had the source to volanomark we
could fix this easily.

> 2 JVM all have regression if sched_compat_yield=0.
>
> I ran some testing, such like iozone/specjbb/tbench/dbench/sysbench,
> and didn't see regression.

which JVM was utilized by the specjbb (Java Business Benchmark) tests?

Ingo

2007-12-03 10:16:07

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Monday 03 December 2007 20:57, Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
> > > as far as desktop apps such as firefox goes, the exact opposite is
> > > true. We had two choices basically: either a "more agressive" yield
> > > than before or a "less agressive" yield. Desktop apps were reported
> > > to hurt from a "more agressive" yield (firefox for example gets some
> > > pretty bad delays), so we defaulted to the less agressive method.
> > > (and we defaulted to that in v2.6.23 already)
> >
> > Yeah, I doubt the 2.6.23 scheduler will be usable for distros
> > though...
>
> ... which is a pretty gross exaggeration belied by distros already
> running v2.6.23. Sure, "enterprise" distros might not run .23 (or .22 or

Yeah, that's what I mean of course. And it's because of the performance
and immediate upstream divergence issues with 2.6.23. Specifically I'm
talking about the scheduler: they may run a base 2.6.23, but it would
likely have most or all subsequent scheduler patches.


> > I was just talking about the default because I didn't know the reason
> > for the way it was set -- now that I do, we should talk about trying
> > to improve the actual code so we don't need 2 defaults.
>
> I've got the patch below queued up: it uses the more agressive yield
> implementation for SCHED_BATCH tasks. SCHED_BATCH is a natural
> differentiator, it's a "I dont care about latency, it's all about
> throughput for me" signal from the application.

First and foremost, do you realize that I'm talking about existing
userspace working well on future kernels right? (ie. backwards
compatibility).


> But first and foremost, do you realize that there will be no easy
> solutions to this topic, that it's not just about 'flipping a default'?

Of course ;) I already answered that in the email that you're replying
to:

> > I was just talking about the default because I didn't know the reason
> > for the way it was set -- now that I do, we should talk about trying
> > to improve the actual code so we don't need 2 defaults.

Anyway, I'd hope it can actually be improved and even the sysctl
removed completely.

2007-12-03 10:17:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield


* Zhang, Yanmin <[email protected]> wrote:

> > as far as desktop apps such as firefox goes, the exact opposite is
> > true. We had two choices basically: either a "more agressive" yield
> > than before or a "less agressive" yield. Desktop apps were reported
> > to hurt from a "more agressive" yield (firefox for example gets some
> > pretty bad delays),
>
> Why not to change source codes of firefox? [...]

because we care a heck of a lot more about a widely used open-source
package's default "user experience" than we care about closed-source
volanomark scores...

do you realize the absurdity of that suggestion: in essence we'd punish
firefox _because it is open-source and can be changed_. So basically
firefox would get a more preferential treatment if it was closed-source
and could not be changed? That's totally backwards.

> If the sched_compat_yield=0, the sys_sched_yield almost does nothing
> but returns, so firefox could just do not call sched_yield. I assume
> 'sched_compat_yield=0' ~ no_call_to_sched_yield.
>
> It's easier to delete calls to sched_yield in applications than to
> tune calls to sched_yield.

We are not at all worried about punishing silly benchmark behavior - and
volanomark's call to Thread.yield (if that's indeed what is happening -
could you try to trace it to make sure?) is outright silly. There are
other chatroom benchmarks such as hackbench.c and hackbench_pth.c that i
test frequently, and they are not affected by any yield details. (and
even then it's still taken with a grain of salt - remember dbench)

Ingo

2007-12-03 10:33:58

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield


* Nick Piggin <[email protected]> wrote:

> > > I was just talking about the default because I didn't know the
> > > reason for the way it was set -- now that I do, we should talk
> > > about trying to improve the actual code so we don't need 2
> > > defaults.
> >
> > I've got the patch below queued up: it uses the more agressive yield
> > implementation for SCHED_BATCH tasks. SCHED_BATCH is a natural
> > differentiator, it's a "I dont care about latency, it's all about
> > throughput for me" signal from the application.
>
> First and foremost, do you realize that I'm talking about existing
> userspace working well on future kernels right? (ie. backwards
> compatibility).

given how poorly sched_yield() is/was defined the only "compatible"
solution would be to go back to the old yield code. (And note that you
are rehashing arguments that were covered on lkml months ago already.)

> > But first and foremost, do you realize that there will be no easy
> > solutions to this topic, that it's not just about 'flipping a
> > default'?
>
> Of course ;) I already answered that in the email that you're replying
> to:
>
> > > I was just talking about the default because I didn't know the
> > > reason for the way it was set -- now that I do, we should talk
> > > about trying to improve the actual code so we don't need 2
> > > defaults.

well, in case you were wondering why i was a bit pointy about this, this
topic of yield has been covered on lkml quite extensively a couple of
months ago. I assumed you knew about that already, but perhaps not?

> Anyway, I'd hope it can actually be improved and even the sysctl
> removed completely.

i think the sanest long-term solution is to strongly discourage the use
of SCHED_OTHER::yield, because there's just no sane definition for yield
that apps could rely upon. (well Linus suggested a pretty sane
definition but that would necessiate the burdening of the scheduler
fastpath - we dont want to do that.) New ideas are welcome of course.

[ also, actual technical feedback on the SCHED_BATCH patch i sent (which
was the only "forward looking" moment in this thread so far ;-) would
be nice too. ]

Ingo

2007-12-03 11:03:00

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Monday 03 December 2007 21:33, Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
> > > > I was just talking about the default because I didn't know the
> > > > reason for the way it was set -- now that I do, we should talk
> > > > about trying to improve the actual code so we don't need 2
> > > > defaults.
> > >
> > > I've got the patch below queued up: it uses the more agressive yield
> > > implementation for SCHED_BATCH tasks. SCHED_BATCH is a natural
> > > differentiator, it's a "I dont care about latency, it's all about
> > > throughput for me" signal from the application.
> >
> > First and foremost, do you realize that I'm talking about existing
> > userspace working well on future kernels right? (ie. backwards
> > compatibility).
>
> given how poorly sched_yield() is/was defined the only "compatible"
> solution would be to go back to the old yield code.

While it is technically allowed to do anything with SCHED_OTHER class,
putting the thread to the back of the runnable tasks, or at least having
it give up _some_ priority (like the old scheduler) is less surprising
than having it do _nothing_.

I mean, if firefox really works best if sched_yield does nothing, it
surely shouldn't be calling it at all (nothing to do with it being open
source or not).

Wheras JVMs (eg. that have garbage collectors call yield), presumably
get quite a lot of tuning, and that was probably done with the less
surprising (and more common) sched_yield behaviour.


> (And note that you
> are rehashing arguments that were covered on lkml months ago already.)

I'm just wondering whether the outcome was the right one.


> > > But first and foremost, do you realize that there will be no easy
> > > solutions to this topic, that it's not just about 'flipping a
> > > default'?
> >
> > Of course ;) I already answered that in the email that you're replying
> >
> > to:
> > > > I was just talking about the default because I didn't know the
> > > > reason for the way it was set -- now that I do, we should talk
> > > > about trying to improve the actual code so we don't need 2
> > > > defaults.
>
> well, in case you were wondering why i was a bit pointy about this, this
> topic of yield has been covered on lkml quite extensively a couple of
> months ago. I assumed you knew about that already, but perhaps not?

I did, but I haven't always followed the scheduler discussions closely
recently. I was surprised to find it hasn't changed much.

I appreciate you can never do exactly the right thing for everyone and
you can't (and don't want, by definition) to make behaviour exactly the
same.

Clearly the current default is far less aggressive (almost noop), and the
compat behaviour is probably more aggressive in most cases than the old
scheduler. I would have thought looking for a middle ground might be a
good idea.

Or just ignore firefox and get them to fix it, if the occasional stalls
are during really high scheduler stressing workloads (do you have a pointer
to that thread, btw?).


> > Anyway, I'd hope it can actually be improved and even the sysctl
> > removed completely.
>
> i think the sanest long-term solution is to strongly discourage the use
> of SCHED_OTHER::yield, because there's just no sane definition for yield
> that apps could rely upon. (well Linus suggested a pretty sane
> definition but that would necessiate the burdening of the scheduler
> fastpath - we dont want to do that.) New ideas are welcome of course.

sched_yield is defined to put the calling task at the end of the queue for
the given priority level as you know (ie. at the end of all other priority
0 tasks, for SCHED_OTHER).

So, while SCHED_OTHER technically allows _any_ task to be picked, I think
it would be least surprising to have the calling task go to the end of the
queue, rather than not doing very much at all...


> [ also, actual technical feedback on the SCHED_BATCH patch i sent (which
> was the only "forward looking" moment in this thread so far ;-) would
> be nice too. ]

I dislike a wholesale change in behaviour like that. Especially when it
is changing behaviour of yield among SCHED_BATCH tasks versus yield among
SCHED_OTHER tasks.

2007-12-03 11:37:47

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield


* Nick Piggin <[email protected]> wrote:

> > given how poorly sched_yield() is/was defined the only "compatible"
> > solution would be to go back to the old yield code.
>
> While it is technically allowed to do anything with SCHED_OTHER class,
> putting the thread to the back of the runnable tasks, or at least
> having it give up _some_ priority (like the old scheduler) is less
> surprising than having it do _nothing_.

wrong: it's not "nothing" that the new code does - run two yield-ing
loops and they'll happily switch to each other, at a rate of a few
million context switches per second.

( Note that the old scheduler's yield code did not actually change a
task's priority - so if an interactive task (such as firefox) yielded,
it got different behavior than CPU hogs. )

> Wheras JVMs (eg. that have garbage collectors call yield), presumably
> get quite a lot of tuning, and that was probably done with the less
> surprising (and more common) sched_yield behaviour.

i disagree. To some of them, having a _more_ agressive yield than 2.6.22
might increase latencies and jitter - which can be seen as a regression
as well. All tests i've seen so far show dramatically lower jitter in
v2.6.23 and upwards kernels.

anyway, right now what we have is a closed-source benchmark (which is a
quite silly one as well) against a popular open-source desktop app and
in that case the choice is obvious. Actual Java app server benchmarks
did not show any regression so maybe Java's use of yield for locking is
not that significant after all and it's only Volanomark that is doing
extra (unnecessary) yields. (and java benchmarks are part of the
upstream kernel test grid anyway so we'd have noticed any serious
regression)

if you insist on flipping the default then that just shows a blatant
disregard to desktop performance - i personally care quite a bit about
desktop performance. (and deterministic scheduling in particular)

> > i think the sanest long-term solution is to strongly discourage the
> > use of SCHED_OTHER::yield, because there's just no sane definition
> > for yield that apps could rely upon. (well Linus suggested a pretty
> > sane definition but that would necessiate the burdening of the
> > scheduler fastpath - we dont want to do that.) New ideas are welcome
> > of course.
>
> sched_yield is defined to put the calling task at the end of the queue
> for the given priority level as you know (ie. at the end of all other
> priority 0 tasks, for SCHED_OTHER).

almost: substitute "priority" with "nice level". One problem is, that's
not what the old scheduler did.

> > [ also, actual technical feedback on the SCHED_BATCH patch i sent
> > (which was the only "forward looking" moment in this thread so far
> > ;-) would be nice too. ]
>
> I dislike a wholesale change in behaviour like that. Especially when
> it is changing behaviour of yield among SCHED_BATCH tasks versus yield
> among SCHED_OTHER tasks.

There's no wholesale change in behavior, SCHED_BATCH tasks have clear
expectations of being throughput versus latency, hence the patch makes
quite a bit of sense to me. YMMV.

Ingo

2007-12-03 17:04:40

by David Schwartz

[permalink] [raw]
Subject: RE: sched_yield: delete sysctl_sched_compat_yield


I've asked versions of this question at least three times and never gotten
anything approaching a straight answer:

1) What is the current default 'sched_yield' behavior?

2) What is the current alternate 'sched_yield' behavior?

3) Are either of them sensible? Simply acting as if the current thread's
timeslice was up should be sufficient.

The implication I keep getting is that neither the default behavior nor the
alternate behavior are sensible. What is so hard about simply scheduling the
next thread?

We don't need perfection, but it sounds like we have two alternatives of
which neither is sensible.

DS

2007-12-03 17:38:21

by Chris Friesen

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

David Schwartz wrote:
> I've asked versions of this question at least three times and never gotten
> anything approaching a straight answer:
>
> 1) What is the current default 'sched_yield' behavior?
>
> 2) What is the current alternate 'sched_yield' behavior?

I'm pretty sure I've seen responses from Ingo describing this multiple
times in various threads. Google should have them.

If I remember right, the default is to simply recalculate the task's
position in the tree and reinsert it, and the alternate is to yield to
everything currently runnable.

> 3) Are either of them sensible? Simply acting as if the current thread's
> timeslice was up should be sufficient.

The new scheduler doesn't really have a concept of "timeslice". This is
one of the core problems with determining what to do on sched_yield().

> The implication I keep getting is that neither the default behavior nor the
> alternate behavior are sensible. What is so hard about simply scheduling the
> next thread?

The problem is where do we insert the task that is yielding? CFS is
based around a tree structure ordered by time.

The old scheduler was priority-based, so you could essentially yield to
everyone of the same niceness level.

With the new scheduler, this would be possible, but would involve extra
work tracking the position of the rightmost task at each priority level.
This additional overhead is what Ingo is trying to avoid.

> We don't need perfection, but it sounds like we have two alternatives of
> which neither is sensible.

sched_yield() isn't a great API. It just says to delay the task,
without specifying how long or what the task is waiting *for*. Other
constructs are much more useful because they give the scheduler more
information with which to make a decision.

Chris

2007-12-03 19:12:31

by David Schwartz

[permalink] [raw]
Subject: RE: sched_yield: delete sysctl_sched_compat_yield


Chris Friesen wrote:

> David Schwartz wrote:

> > I've asked versions of this question at least three times
> > and never gotten
> > anything approaching a straight answer:
> >
> > 1) What is the current default 'sched_yield' behavior?
> >
> > 2) What is the current alternate 'sched_yield' behavior?

> I'm pretty sure I've seen responses from Ingo describing this multiple
> times in various threads. Google should have them.

> If I remember right, the default is to simply recalculate the task's
> position in the tree and reinsert it, and the alternate is to yield to
> everything currently runnable.

The meaning of the default behavior then depends upon where in the tree it
reinserts it.

> > 3) Are either of them sensible? Simply acting as if the
> > current thread's
> > timeslice was up should be sufficient.

> The new scheduler doesn't really have a concept of "timeslice". This is
> one of the core problems with determining what to do on sched_yield().

Then it should probably just not support 'sched_yield' and return ENOSYS.
Applications should work around an ENOSYS reply (since some versions of
Solaris return this, among other reasons). Perhaps for compatability, it
could also yield 'lightly' just in case applications ignore the return
value.

It could also handle it the way it handles the smallest sleep time that it
supports. This is sub-optimal if no other task are ready-to-run at the same
static priority level and that might be an expensive check.

If CFS really can't support sched_yield's semantics, then it should just
not, and that's that. Return ENOSYS and admit that the behavior sched_yield
is documented to have simply can't be supported by the scheduler.

> > The implication I keep getting is that neither the default
> > behavior nor the
> > alternate behavior are sensible. What is so hard about simply
> > scheduling the
> > next thread?

> The problem is where do we insert the task that is yielding? CFS is
> based around a tree structure ordered by time.

We put it exactly where we would have when its timeslice ran out. If we can
reward it a little bit, that's great. But if not, we can live with that.
Just imagine that the timer interrupt fired to indicate the end of the
thread's run time when the thread called 'sched_yield'.

> The old scheduler was priority-based, so you could essentially yield to
> everyone of the same niceness level.
>
> With the new scheduler, this would be possible, but would involve extra
> work tracking the position of the rightmost task at each priority level.
> This additional overhead is what Ingo is trying to avoid.

Then what does he do when the task runs out of run time? It's hard to
imagine we can't do that when the task calls sched_yield.

> > We don't need perfection, but it sounds like we have two
> > alternatives of
> > which neither is sensible.

> sched_yield() isn't a great API.

I agree.

> It just says to delay the task,
> without specifying how long or what the task is waiting *for*.

That is not true. The task is waiting for something that will be done by
another thread that is ready-to-run and at the same priority level. The task
does not need to wait until the thing is guaranteed done but wishes to wait
until it is more likely to be done. This is an often-misused but sometimes
sensible thing to do.

I think the API gets blamed for two things that are not its fault:

1) It's often misunderstood and misused.

2) It was often chosen as a "best available" solution because no truly good
solutions were available.

> Other
> constructs are much more useful because they give the scheduler more
> information with which to make a decision.

Sure, if there is more information. But if all you really want to do is wait
until other threads at the same static priority level have had a chance to
run, then sched_yield is the right API.

DS

2007-12-03 19:57:27

by Chris Friesen

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

David Schwartz wrote:
> Chris Friesen wrote:


> If CFS really can't support sched_yield's semantics, then it should just
> not, and that's that. Return ENOSYS and admit that the behavior sched_yield
> is documented to have simply can't be supported by the scheduler.

That's just it though...sched_yield() with SCHED_OTHER doesn't have well
defined semantics, so we can do just about anything we want.

The issue is mostly how to work around existing apps that (invalidly)
expect certain behaviour from sched_yield().

>>The problem is where do we insert the task that is yielding? CFS is
>>based around a tree structure ordered by time.

> We put it exactly where we would have when its timeslice ran out. If we can
> reward it a little bit, that's great. But if not, we can live with that.
> Just imagine that the timer interrupt fired to indicate the end of the
> thread's run time when the thread called 'sched_yield'.

CFS doesn't really do "timeslice". But in essence what you are
describing is the default behaviour currently...it simply removes the
task from the tree and reinserts it based on how much cpu time it used up.

> Then what does he do when the task runs out of run time? It's hard to
> imagine we can't do that when the task calls sched_yield.

It gets reinserted into the tree at a position based on how much cpu
time it used. This is exactly the current sched_yield() behaviour.

>>It just says to delay the task,
>>without specifying how long or what the task is waiting *for*.

> That is not true. The task is waiting for something that will be done by
> another thread that is ready-to-run and at the same priority level. The task
> does not need to wait until the thing is guaranteed done but wishes to wait
> until it is more likely to be done. This is an often-misused but sometimes
> sensible thing to do.

The scheduler still doesn't know specifically what the task is waiting for.

> Sure, if there is more information. But if all you really want to do is wait
> until other threads at the same static priority level have had a chance to
> run, then sched_yield is the right API.

Technically, all of SCHED_OTHER has static priority level zero. Thus
the "right" thing to do is to allow all SCHED_OTHER tasks to run,
including the ones with the highst possible nice level.

This is the alternate implementation in the current code, but it has
latency implications that may be unexpected by applications written for
the previous 2.6 behaviour.

Chris

2007-12-03 21:39:34

by Mark Lord

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

Chris Friesen wrote:
> David Schwartz wrote:
>> Chris Friesen wrote:
..
>>> The problem is where do we insert the task that is yielding? CFS is
>>> based around a tree structure ordered by time.
>
>> We put it exactly where we would have when its timeslice ran out. If
>> we can
>> reward it a little bit, that's great. But if not, we can live with that.
>> Just imagine that the timer interrupt fired to indicate the end of the
>> thread's run time when the thread called 'sched_yield'.
>
> CFS doesn't really do "timeslice". But in essence what you are
> describing is the default behaviour currently...it simply removes the
> task from the tree and reinserts it based on how much cpu time it used up.
>
>> Then what does he do when the task runs out of run time? It's hard to
>> imagine we can't do that when the task calls sched_yield.
>
> It gets reinserted into the tree at a position based on how much cpu
> time it used. This is exactly the current sched_yield() behaviour.
..

That's not the same thing at all.
I think that David is suggesting that the reinsertion logic
should pretend that the task used up all of the CPU time it
was offered in the slot leading up to the sched_yield() call.

If it did that, then the task would be far more likely not to
end up as the next task chosen to run.

Without doing that, the task is highly likely to be chosen
to run again immediately, as it will appear to have done
nothing since it was previously chosen -- and so the same
criteria will result in it being chosen again, and again,
and again, until it finally wastes enough cycles to not
be reinserted into the "currently active" slot of the tree.

Cheers

2007-12-03 21:48:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield


* Mark Lord <[email protected]> wrote:

> That's not the same thing at all. I think that David is suggesting
> that the reinsertion logic should pretend that the task used up all of
> the CPU time it was offered in the slot leading up to the
> sched_yield() call.

we have tried this too, and this has problems too (artificial inflation
of the vruntime metric and a domino effects on other portions of the
scheduler). So this is a worse solution than what we have now. (and this
has all been pointed out in past discussions in which David
participated. I'll certainly reply to any genuinely new idea.)

Ingo

2007-12-03 21:57:46

by Mark Lord

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

Ingo Molnar wrote:
> * Mark Lord <[email protected]> wrote:
>
>> That's not the same thing at all. I think that David is suggesting
>> that the reinsertion logic should pretend that the task used up all of
>> the CPU time it was offered in the slot leading up to the
>> sched_yield() call.
>
> we have tried this too, and this has problems too (artificial inflation
> of the vruntime metric and a domino effects on other portions of the
> scheduler). So this is a worse solution than what we have now. (and this
> has all been pointed out in past discussions in which David
> participated. I'll certainly reply to any genuinely new idea.)
..

Ack. And what of the suggestion to try to ensure that a yielding task
simply not end up as the very next one chosen to run? Maybe by swapping
it with another (adjacent?) task in the tree if it comes out on top again?

(I really don't know the proper terminology to use here,
but hopefully Ingo can translate that).

That's probably already been covered too, but are the prior conclusions still valid?

Thanks Ingo -- I *really* like this scheduler!

2007-12-03 22:06:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield


* Mark Lord <[email protected]> wrote:

> Ack. And what of the suggestion to try to ensure that a yielding task
> simply not end up as the very next one chosen to run? Maybe by
> swapping it with another (adjacent?) task in the tree if it comes out
> on top again?

we did that too for quite some time in CFS - it was found to be "not
agressive enough" by some folks and "too agressive" by others. Then when
people started bickering over this we added these two simple corner
cases - switchable via a flag. (minimum agression and maximum agression)

> (I really don't know the proper terminology to use here, but hopefully
> Ingo can translate that).

the terminology you used is perfectly fine.

> Thanks Ingo -- I *really* like this scheduler!

heh, thanks :) For which workload does it make the biggest difference
for you? (and compared to what other scheduler you used before? 2.6.22?)

Ingo

2007-12-03 22:18:33

by Mark Lord

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

Ingo Molnar wrote:
> * Mark Lord <[email protected]> wrote:
>
>> Ack. And what of the suggestion to try to ensure that a yielding task
>> simply not end up as the very next one chosen to run? Maybe by
>> swapping it with another (adjacent?) task in the tree if it comes out
>> on top again?
>
> we did that too for quite some time in CFS - it was found to be "not
> agressive enough" by some folks and "too agressive" by others. Then when
> people started bickering over this we added these two simple corner
> cases - switchable via a flag. (minimum agression and maximum agression)
>
>> (I really don't know the proper terminology to use here, but hopefully
>> Ingo can translate that).
>
> the terminology you used is perfectly fine.
>
>> Thanks Ingo -- I *really* like this scheduler!
>
> heh, thanks :) For which workload does it make the biggest difference
> for you? (and compared to what other scheduler you used before? 2.6.22?)
..

Heh.. I'm just a very unsophisticated desktop user, and I like it when
Thunderbird and Firefox are unaffected by the "make -j3" kernel builds
that are often running in another window. BIG difference there.

And on the cool side, the Swarm game (swarm.swf) is a great example of
something that used to get jerky really fast whenever anything else was
running, and now it really doesn't seem to be affected by anything.
(I don't really play computer games, but this one is has a very retro feel..).

Cheers

2007-12-03 22:34:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield


* Mark Lord <[email protected]> wrote:

>> heh, thanks :) For which workload does it make the biggest difference
>> for you? (and compared to what other scheduler you used before?
>> 2.6.22?)
> ..
>
> Heh.. I'm just a very unsophisticated desktop user, and I like it when
> Thunderbird and Firefox are unaffected by the "make -j3" kernel builds
> that are often running in another window. BIG difference there.
>
> And on the cool side, the Swarm game (swarm.swf) is a great example of
> something that used to get jerky really fast whenever anything else
> was running, and now it really doesn't seem to be affected by
> anything. (I don't really play computer games, but this one is has a
> very retro feel..).

nice! Do you feel any difference between 2.6.23 and 2.6.24-rc for these
workloads? (if you've tried .24 already)

Ingo

2007-12-04 00:18:43

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Tuesday 04 December 2007 09:33, Ingo Molnar wrote:
> * Mark Lord <[email protected]> wrote:
> >> heh, thanks :) For which workload does it make the biggest difference
> >> for you? (and compared to what other scheduler you used before?
> >> 2.6.22?)
> >
> > ..
> >
> > Heh.. I'm just a very unsophisticated desktop user, and I like it when
> > Thunderbird and Firefox are unaffected by the "make -j3" kernel builds
> > that are often running in another window. BIG difference there.
> >
> > And on the cool side, the Swarm game (swarm.swf) is a great example of
> > something that used to get jerky really fast whenever anything else
> > was running, and now it really doesn't seem to be affected by
> > anything. (I don't really play computer games, but this one is has a
> > very retro feel..).
>
> nice! Do you feel any difference between 2.6.23 and 2.6.24-rc for these
> workloads? (if you've tried .24 already)

And also, I wonder what the average timeslice and number of context
switches is between 2.6.22 and 2.6.23-4. Would be interesting to see.

2007-12-04 00:31:01

by David Schwartz

[permalink] [raw]
Subject: RE: sched_yield: delete sysctl_sched_compat_yield


> * Mark Lord <[email protected]> wrote:

> > Ack. And what of the suggestion to try to ensure that a yielding task
> > simply not end up as the very next one chosen to run? Maybe by
> > swapping it with another (adjacent?) task in the tree if it comes out
> > on top again?

> we did that too for quite some time in CFS - it was found to be "not
> agressive enough" by some folks and "too agressive" by others. Then when
> people started bickering over this we added these two simple corner
> cases - switchable via a flag. (minimum agression and maximum agression)

They are both correct. It is not agressive enough if there are tasks other
than those two that are at the same static priority level and ready to run.
It is too agressive if the task it is swapped with is at a lower static
priority level.

Perhaps it might be possible to scan for the task at the same static
priority level that is ready-to-run but last in line among other
ready-to-run tasks and put it after that task? I think that's about as close
as we can get to the POSIX-specified behavior.

> > Thanks Ingo -- I *really* like this scheduler!

Just in case this isn't clear, I like CFS too and sincerely appreciate the
work Ingo, Con, and others have done on it.

DS

2007-12-04 01:03:09

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Monday 03 December 2007 22:37, Ingo Molnar wrote:
> * Nick Piggin <[email protected]> wrote:
> > > given how poorly sched_yield() is/was defined the only "compatible"
> > > solution would be to go back to the old yield code.
> >
> > While it is technically allowed to do anything with SCHED_OTHER class,
> > putting the thread to the back of the runnable tasks, or at least
> > having it give up _some_ priority (like the old scheduler) is less
> > surprising than having it do _nothing_.
>
> wrong: it's not "nothing" that the new code does - run two yield-ing
> loops and they'll happily switch to each other, at a rate of a few
> million context switches per second.

OK, it's not nothing, it interacts with the quantisation of the
update granularity and wakeup granularity... It's definitely not
what would be expected if you didn't look at the implementation
though.


> > Wheras JVMs (eg. that have garbage collectors call yield), presumably
> > get quite a lot of tuning, and that was probably done with the less
> > surprising (and more common) sched_yield behaviour.
>
> i disagree. To some of them, having a _more_ agressive yield than 2.6.22
> might increase latencies and jitter - which can be seen as a regression
> as well. All tests i've seen so far show dramatically lower jitter in
> v2.6.23 and upwards kernels.

Right so we should have one being about the _same_ aggressiveness.
Doesn't that make sense?


> anyway, right now what we have is a closed-source benchmark (which is a
> quite silly one as well) against a popular open-source desktop app and
> in that case the choice is obvious. Actual Java app server benchmarks
> did not show any regression so maybe Java's use of yield for locking is
> not that significant after all and it's only Volanomark that is doing
> extra (unnecessary) yields. (and java benchmarks are part of the
> upstream kernel test grid anyway so we'd have noticed any serious
> regression)

Sure I'm not basing this purely on volanomark at all. If you've tested
a reasonable range of actual java app server benchmarks with a range of
jvms then fine.


> if you insist on flipping the default then that just shows a blatant
> disregard to desktop performance

That statement is true. But you know I'm not insisting on flipping
the default, so I don't see how it is relevant.

BTW. can you answer what workload did firefox see the sched_yield
pauses with, and/or where that thread is archived? I still think
firefox should not call sched_yield at all.


> > > i think the sanest long-term solution is to strongly discourage the
> > > use of SCHED_OTHER::yield, because there's just no sane definition
> > > for yield that apps could rely upon. (well Linus suggested a pretty
> > > sane definition but that would necessiate the burdening of the
> > > scheduler fastpath - we dont want to do that.) New ideas are welcome
> > > of course.
> >
> > sched_yield is defined to put the calling task at the end of the queue
> > for the given priority level as you know (ie. at the end of all other
> > priority 0 tasks, for SCHED_OTHER).
>
> almost: substitute "priority" with "nice level". One problem is, that's
> not what the old scheduler did.

I'm not sure if that's right. Posix realtime scheduling says that all
SCHED_OTHER tasks are priority 0. But I'm not much of a standards reader.
And even if it were just applied to a given nice level, that would be
more intuitive than the current default.


> > > [ also, actual technical feedback on the SCHED_BATCH patch i sent
> > > (which was the only "forward looking" moment in this thread so far
> > > ;-) would be nice too. ]
> >
> > I dislike a wholesale change in behaviour like that. Especially when
> > it is changing behaviour of yield among SCHED_BATCH tasks versus yield
> > among SCHED_OTHER tasks.
>
> There's no wholesale change in behavior, SCHED_BATCH tasks have clear
> expectations of being throughput versus latency, hence the patch makes
> quite a bit of sense to me. YMMV.

sched_yield semantics are totally different depending on whether the
process is SCHED_BATCH or not. That's what I was calling a change in
behaviour, so arguing otherwise is just arguing semantics.

I just would think it isn't such a good thing if you suddently got a
500% speedup by making your jvm SCHED_BATCH, only to find that it
stops working when your batch cron jobs or something start running...
But if there are no real jvm workloads that would see such a speedup,
then I guess the point is moot ;)

2007-12-04 02:10:05

by Nick Piggin

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Tuesday 04 December 2007 11:30, David Schwartz wrote:

> Perhaps it might be possible to scan for the task at the same static
> priority level that is ready-to-run but last in line among other
> ready-to-run tasks and put it after that task?

Nice level versus posix static priority level debate aside, this
is the exact behaviour which the compat mode does now basically,
when you have all tasks running at nice 0 (which I assume is the
essentially the case in both the jvm and firefox tests) (some
things, eg. kernel threads or X server could run at a higher prio,
but these are not the ones calling yield anyway...)


> I think that's about as
> close as we can get to the POSIX-specified behavior.

I don't think it is a question of POSIX being a bit fuzzy, or some
problem we have implementing it. It is explicitly specified to
allow any behaviour.

So the current default is not wrong, any more than the compat mode
is right.

2007-12-04 06:41:48

by Yanmin Zhang

[permalink] [raw]
Subject: Re: sched_yield: delete sysctl_sched_compat_yield

On Mon, 2007-12-03 at 11:05 +0100, Ingo Molnar wrote:
> * Zhang, Yanmin <[email protected]> wrote:
>
> > Although no source codes of volanoMark, I suspect it calls
> > Thread.sched. volanoMark is a kind of chatroom benchmark. When a
> > client sends out a message, server will send the message to all
> > clients. I suspect the client calls Thread.yield after sending out a
> > couple of messages.
>
> yeah, so far only volanomark seems to be affected by this, and if it
> indeed calls Thread.yield artificially it's a pretty stupid benchmark
> and it's not the fault of the JDK. If we had the source to volanomark we
> could fix this easily.
>
> > 2 JVM all have regression if sched_compat_yield=0.
> >
> > I ran some testing, such like iozone/specjbb/tbench/dbench/sysbench,
> > and didn't see regression.
>
> which JVM was utilized by the specjbb (Java Business Benchmark) tests?
BEA Jrockit. It supports huge pages which promote performance for about 8%~10%.

-yanmin