2007-09-26 08:52:34

by Martin Michlmayr

[permalink] [raw]
Subject: Network slowdown due to CFS

I noticed that my network performance has gone down from 2.6.22
from [ 3] 0.0-10.0 sec 113 MBytes 95.0 Mbits/sec
to [ 3] 0.0-10.0 sec 75.7 MBytes 63.3 Mbits/sec
with 2.6.23-rc1 (and 2.6.23-rc8), as measured with iperf.

I did a git bisect today and tracked it back to the commit where CFS
was enabled ("sched: cfs core code; apply the CFS core code",
commit dd41f596cda0d7d6e4a8b139ffdfabcefdd46528). I also compiled a
kernel from
git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
but things don't improve.

This is on a Thecus N2100, an ARM (Intel IOP32x) based storage device
with a r8169 card, SATA disks and 512 MB RAM. My config is attached.

What kind of information can I supply so you can track this down?
--
Martin Michlmayr
http://www.cyrius.com/


Attachments:
(No filename) (805.00 B)
.config (30.42 kB)
Download all attachments

2007-09-26 09:35:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Martin Michlmayr <[email protected]> wrote:

> I noticed that my network performance has gone down from 2.6.22
> from [ 3] 0.0-10.0 sec 113 MBytes 95.0 Mbits/sec
> to [ 3] 0.0-10.0 sec 75.7 MBytes 63.3 Mbits/sec
> with 2.6.23-rc1 (and 2.6.23-rc8), as measured with iperf.
>
> I did a git bisect today and tracked it back to the commit where CFS
> was enabled ("sched: cfs core code; apply the CFS core code",
> commit dd41f596cda0d7d6e4a8b139ffdfabcefdd46528). I also compiled a
> kernel from
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
> but things don't improve.
>
> This is on a Thecus N2100, an ARM (Intel IOP32x) based storage device
> with a r8169 card, SATA disks and 512 MB RAM. My config is attached.
>
> What kind of information can I supply so you can track this down?

as a starter, could you boot the sched-devel.git kernel, with
CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y enabled and could you run
this script while the iperf test is in the middle of its testrun:

http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh

this will gather a good deal of info about the workload in question.
Please send me the resulting debug file.

Ingo

2007-09-26 09:47:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Ingo Molnar <[email protected]> wrote:

> > What kind of information can I supply so you can track this down?
>
> as a starter, could you boot the sched-devel.git kernel, with
> CONFIG_SCHED_DEBUG=y and CONFIG_SCHEDSTATS=y enabled and could you run
> this script while the iperf test is in the middle of its testrun:
>
> http://people.redhat.com/mingo/cfs-scheduler/tools/cfs-debug-info.sh
>
> this will gather a good deal of info about the workload in question.
> Please send me the resulting debug file.

Another thing: please also do the same with the vanilla v2.6.22 kernel,
and send me that file too. (so that the two cases can be compared)

Ingo

2007-09-26 10:09:21

by Martin Michlmayr

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

* Ingo Molnar <[email protected]> [2007-09-26 11:47]:
> > this will gather a good deal of info about the workload in question.
> > Please send me the resulting debug file.
> Another thing: please also do the same with the vanilla v2.6.22 kernel,
> and send me that file too. (so that the two cases can be compared)

I put the log files here:
http://www.cyrius.com/tmp/2.6.22
http://www.cyrius.com/tmp/2.6.23-rc8-sched-devel

I increased the time ipfer ran to 30 secs since your script runs for
over 15 secs. I got:

[ 3] 0.0-30.0 sec 331 MBytes 92.6 Mbits/sec 2.6.22
vs
[ 3] 0.0-30.0 sec 222 MBytes 62.1 Mbits/sec 2.6.23-rc8-sched-devel

--
Martin Michlmayr
http://www.cyrius.com/

2007-09-26 10:19:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Martin Michlmayr <[email protected]> wrote:

> * Ingo Molnar <[email protected]> [2007-09-26 11:47]:
> > > this will gather a good deal of info about the workload in question.
> > > Please send me the resulting debug file.
> > Another thing: please also do the same with the vanilla v2.6.22 kernel,
> > and send me that file too. (so that the two cases can be compared)
>
> I put the log files here:
> http://www.cyrius.com/tmp/2.6.22
> http://www.cyrius.com/tmp/2.6.23-rc8-sched-devel
>
> I increased the time ipfer ran to 30 secs since your script runs for
> over 15 secs. I got:
>
> [ 3] 0.0-30.0 sec 331 MBytes 92.6 Mbits/sec 2.6.22
> vs
> [ 3] 0.0-30.0 sec 222 MBytes 62.1 Mbits/sec 2.6.23-rc8-sched-devel

thanks!

the test does almost no context switches:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
r b swpd free buff cache si so bi bo in cs us sy id wa
2 0 0 462928 3280 37216 0 0 137 75 2306 83 13 38 47 2
2 0 0 462928 3280 37216 0 0 0 0 8600 54 6 94 0 0
2 0 0 462928 3280 37216 0 0 0 36 8667 55 7 93 0 0
2 0 0 462928 3280 37216 0 0 0 0 8592 53 5 95 0 0
2 0 0 462928 3280 37216 0 0 0 0 8638 52 7 93 0 0

(the 'cs' column shows 50-80 context switches per second.)

so there must be some other side-effect, not raw scheduling overhead or
some other direct scheduler performance problem.

Ingo

2007-09-26 10:20:44

by Mike Galbraith

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Wed, 2007-09-26 at 10:52 +0200, Martin Michlmayr wrote:
> I noticed that my network performance has gone down from 2.6.22
> >from [ 3] 0.0-10.0 sec 113 MBytes 95.0 Mbits/sec
> to [ 3] 0.0-10.0 sec 75.7 MBytes 63.3 Mbits/sec
> with 2.6.23-rc1 (and 2.6.23-rc8), as measured with iperf.

FWIW, on my box blasting localhost, I see the opposite, repeatably.

root@Homer: iperf -v
iperf version 2.0.2 (03 May 2005) pthreads

2.6.22.1-smp
for i in `seq 1 3`; do iperf -c localhost; done >> /xx

------------------------------------------------------------
Client connecting to localhost, TCP port 5001
TCP window size: 49.2 KByte (default)
------------------------------------------------------------
[ 3] local 127.0.0.1 port 21383 connected with 127.0.0.1 port 5001
[ 3] 0.0-10.1 sec 518 MBytes 432 Mbits/sec
------------------------------------------------------------
Client connecting to localhost, TCP port 5001
TCP window size: 49.2 KByte (default)
------------------------------------------------------------
[ 3] local 127.0.0.1 port 21384 connected with 127.0.0.1 port 5001
[ 3] 0.0-10.0 sec 325 MBytes 273 Mbits/sec
------------------------------------------------------------
Client connecting to localhost, TCP port 5001
TCP window size: 49.2 KByte (default)
------------------------------------------------------------
[ 3] local 127.0.0.1 port 21385 connected with 127.0.0.1 port 5001
[ 3] 0.0-10.0 sec 434 MBytes 363 Mbits/sec

2.6.23-rc8-smp-d
for i in `seq 1 3`; do iperf -c localhost; done >> /xx

------------------------------------------------------------
Client connecting to localhost, TCP port 5001
TCP window size: 49.2 KByte (default)
------------------------------------------------------------
[ 3] local 127.0.0.1 port 11650 connected with 127.0.0.1 port 5001
[ 3] 0.0-10.0 sec 2.01 GBytes 1.72 Gbits/sec
------------------------------------------------------------
Client connecting to localhost, TCP port 5001
TCP window size: 49.2 KByte (default)
------------------------------------------------------------
[ 3] local 127.0.0.1 port 11651 connected with 127.0.0.1 port 5001
[ 3] 0.0-10.0 sec 2.02 GBytes 1.74 Gbits/sec
------------------------------------------------------------
Client connecting to localhost, TCP port 5001
TCP window size: 49.2 KByte (default)
------------------------------------------------------------
[ 3] local 127.0.0.1 port 11652 connected with 127.0.0.1 port 5001
[ 3] 0.0-10.0 sec 2.10 GBytes 1.81 Gbits/sec


2007-09-26 10:23:39

by Mike Galbraith

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Wed, 2007-09-26 at 10:52 +0200, Martin Michlmayr wrote:
> I noticed that my network performance has gone down from 2.6.22
> >from [ 3] 0.0-10.0 sec 113 MBytes 95.0 Mbits/sec
> to [ 3] 0.0-10.0 sec 75.7 MBytes 63.3 Mbits/sec
> with 2.6.23-rc1 (and 2.6.23-rc8), as measured with iperf.
>
> I did a git bisect today and tracked it back to the commit where CFS
> was enabled ("sched: cfs core code; apply the CFS core code",
> commit dd41f596cda0d7d6e4a8b139ffdfabcefdd46528). I also compiled a
> kernel from
> git://git.kernel.org/pub/scm/linux/kernel/git/mingo/linux-2.6-sched-devel.git
> but things don't improve.
>
> This is on a Thecus N2100, an ARM (Intel IOP32x) based storage device
> with a r8169 card, SATA disks and 512 MB RAM. My config is attached.
>
> What kind of information can I supply so you can track this down?

I noticed on the iperf website a patch which contains sched_yield().

http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt

Do you have that patch applied by any chance? If so, it might be a
worth while to try it without it.

-Mike

2007-09-26 10:48:41

by Martin Michlmayr

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

* Mike Galbraith <[email protected]> [2007-09-26 12:23]:
> I noticed on the iperf website a patch which contains sched_yield().
> http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt
>
> Do you have that patch applied by any chance? If so, it might be a
> worth while to try it without it.

Yes, this patch was applied. When I revert it, I get the same (high)
performance with both kernels.
--
Martin Michlmayr
http://www.cyrius.com/

2007-09-26 11:22:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Martin Michlmayr <[email protected]> wrote:

> * Mike Galbraith <[email protected]> [2007-09-26 12:23]:
> > I noticed on the iperf website a patch which contains sched_yield().
> > http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt
> >
> > Do you have that patch applied by any chance? If so, it might be a
> > worth while to try it without it.
>
> Yes, this patch was applied. When I revert it, I get the same (high)
> performance with both kernels.

great! Could you try this too:

echo 1 > /proc/sys/kernel/sched_compat_yield

does it fix iperf performance too (with the yield patch applied to
iperf)?

I think the real fix would be for iperf to use blocking network IO
though, or maybe to use a POSIX mutex or POSIX semaphores.

Ingo

2007-09-26 11:29:49

by Martin Michlmayr

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

* Ingo Molnar <[email protected]> [2007-09-26 13:21]:
> > > I noticed on the iperf website a patch which contains sched_yield().
> > > http://dast.nlanr.net/Projects/Iperf2.0/patch-iperf-linux-2.6.21.txt
>
> great! Could you try this too:
> echo 1 > /proc/sys/kernel/sched_compat_yield
>
> does it fix iperf performance too (with the yield patch applied to
> iperf)?

Yes, this gives me good performance too.

> I think the real fix would be for iperf to use blocking network IO
> though, or maybe to use a POSIX mutex or POSIX semaphores.

So it's definitely not a bug in the kernel, only in iperf?

(CCing Stephen Hemminger who wrote the iperf patch.)
--
Martin Michlmayr
http://www.cyrius.com/

2007-09-26 12:01:16

by David Schwartz

[permalink] [raw]
Subject: RE: Network slowdown due to CFS


> > I think the real fix would be for iperf to use blocking network IO
> > though, or maybe to use a POSIX mutex or POSIX semaphores.
>
> So it's definitely not a bug in the kernel, only in iperf?

Martin:

Actually, in this case I think iperf is doing the right thing (though not
the best thing) and the kernel is doing the wrong thing. It's calling
'sched_yield' to ensure that every other thread gets a chance to run before
the current thread runs again. CFS is not doing that, allowing the yielding
thread to hog the CPU to the exclusion of the other threads. (It can allow
the yielding thread to hog the CPU, of course, just not to the exclusion of
other threads.)

It's still better to use some kind of rational synchronization primitive
(like mutex/sempahore) so that the other threads can tell you when there's
something for you to do. It's still better to use blocking network IO, so
the kernel will let you know exactly when to try I/O and your dynamic
priority can rise.

Ingo:

Can you clarify what CFS' current default sched_yield implementation is and
what setting sched_compat_yield to 1 does? Which way do we get the right
semantics (all threads of equal static priority are scheduled, with some
possible SMP fuzziness, before this thread is scheduled again)?

DS


2007-09-26 13:32:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* David Schwartz <[email protected]> wrote:

> > > I think the real fix would be for iperf to use blocking network IO
> > > though, or maybe to use a POSIX mutex or POSIX semaphores.
> >
> > So it's definitely not a bug in the kernel, only in iperf?
>
> Martin:
>
> Actually, in this case I think iperf is doing the right thing (though not
> the best thing) and the kernel is doing the wrong thing. [...]

it's not doing the right thing at all. I had a quick look at the source
code, and the reason for that weird yield usage was that there's a
locking bug in iperf's "Reporter thread" abstraction and apparently
instead of fixing the bug it was worked around via a horrible yield()
based user-space lock.

the (small) patch below fixes the iperf locking bug and removes the
yield() use. There are numerous immediate benefits of this patch:

- iperf uses _much_ less CPU time. On my Core2Duo test system, before
the patch it used up 100% CPU time to saturate 1 gigabit of network
traffic to another box. With the patch applied it now uses 9% of
CPU time.

- sys_sched_yield() is removed altogether

- i was able to measure much higher bandwidth over localhost for
example. This is the case for over-the-network measurements as well.

- the results are also more consistent and more deterministic, hence
more reliable as a benchmarking tool. (the reason for that is that
more CPU time is spent on actually delivering packets, instead of
mindlessly polling on the user-space "lock", so we actually max out
the CPU, instead of relying on the random proportion the workload was
able to make progress versus wasting CPU time on polling.)

sched_yield() is almost always the symptom of broken locking or other
bug. In that sense CFS does the right thing by exposing such bugs =B-)

Ingo

------------------------->
Subject: iperf: fix locking
From: Ingo Molnar <[email protected]>

fix iperf locking - it was burning CPU time while polling
unnecessarily, instead of using the proper wait primitives.

Signed-off-by: Ingo Molnar <[email protected]>
---
compat/Thread.c | 3 ---
src/Reporter.c | 13 +++++++++----
src/main.cpp | 2 ++
3 files changed, 11 insertions(+), 7 deletions(-)

Index: iperf-2.0.2/compat/Thread.c
===================================================================
--- iperf-2.0.2.orig/compat/Thread.c
+++ iperf-2.0.2/compat/Thread.c
@@ -405,9 +405,6 @@ int thread_numuserthreads( void ) {
void thread_rest ( void ) {
#if defined( HAVE_THREAD )
#if defined( HAVE_POSIX_THREAD )
- // TODO add checks for sched_yield or pthread_yield and call that
- // if available
- usleep( 0 );
#else // Win32
SwitchToThread( );
#endif
Index: iperf-2.0.2/src/Reporter.c
===================================================================
--- iperf-2.0.2.orig/src/Reporter.c
+++ iperf-2.0.2/src/Reporter.c
@@ -111,6 +111,7 @@ report_statistics multiple_reports[kRepo
char buffer[64]; // Buffer for printing
ReportHeader *ReportRoot = NULL;
extern Condition ReportCond;
+extern Condition ReportDoneCond;
int reporter_process_report ( ReportHeader *report );
void process_report ( ReportHeader *report );
int reporter_handle_packet( ReportHeader *report );
@@ -338,7 +339,7 @@ void ReportPacket( ReportHeader* agent,
// item
while ( index == 0 ) {
Condition_Signal( &ReportCond );
- thread_rest();
+ Condition_Wait( &ReportDoneCond );
index = agent->reporterindex;
}
agent->agentindex = 0;
@@ -346,7 +347,7 @@ void ReportPacket( ReportHeader* agent,
// Need to make sure that reporter is not about to be "lapped"
while ( index - 1 == agent->agentindex ) {
Condition_Signal( &ReportCond );
- thread_rest();
+ Condition_Wait( &ReportDoneCond );
index = agent->reporterindex;
}

@@ -553,6 +554,7 @@ void reporter_spawn( thread_Settings *th
}
Condition_Unlock ( ReportCond );

+again:
if ( ReportRoot != NULL ) {
ReportHeader *temp = ReportRoot;
//Condition_Unlock ( ReportCond );
@@ -575,9 +577,12 @@ void reporter_spawn( thread_Settings *th
// finished with report so free it
free( temp );
Condition_Unlock ( ReportCond );
+ Condition_Signal( &ReportDoneCond );
+ if (ReportRoot)
+ goto again;
}
- // yield control of CPU is another thread is waiting
- thread_rest();
+ Condition_Signal( &ReportDoneCond );
+ usleep(10000);
} else {
//Condition_Unlock ( ReportCond );
}
Index: iperf-2.0.2/src/main.cpp
===================================================================
--- iperf-2.0.2.orig/src/main.cpp
+++ iperf-2.0.2/src/main.cpp
@@ -96,6 +96,7 @@ extern "C" {
// records being accessed in a report and also to
// serialize modification of the report list
Condition ReportCond;
+ Condition ReportDoneCond;
}

// global variables only accessed within this file
@@ -141,6 +142,7 @@ int main( int argc, char **argv ) {

// Initialize global mutexes and conditions
Condition_Initialize ( &ReportCond );
+ Condition_Initialize ( &ReportDoneCond );
Mutex_Initialize( &groupCond );
Mutex_Initialize( &clients_mutex );

2007-09-26 15:41:28

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Wed, 26 Sep 2007 15:31:38 +0200
Ingo Molnar <[email protected]> wrote:

>
> * David Schwartz <[email protected]> wrote:
>
> > > > I think the real fix would be for iperf to use blocking network
> > > > IO though, or maybe to use a POSIX mutex or POSIX semaphores.
> > >
> > > So it's definitely not a bug in the kernel, only in iperf?
> >
> > Martin:
> >
> > Actually, in this case I think iperf is doing the right thing
> > (though not the best thing) and the kernel is doing the wrong
> > thing. [...]
>
> it's not doing the right thing at all. I had a quick look at the
> source code, and the reason for that weird yield usage was that
> there's a locking bug in iperf's "Reporter thread" abstraction and
> apparently instead of fixing the bug it was worked around via a
> horrible yield() based user-space lock.
>
> the (small) patch below fixes the iperf locking bug and removes the
> yield() use. There are numerous immediate benefits of this patch:
>
> - iperf uses _much_ less CPU time. On my Core2Duo test system,
> before the patch it used up 100% CPU time to saturate 1 gigabit of
> network traffic to another box. With the patch applied it now uses 9%
> of CPU time.
>
> - sys_sched_yield() is removed altogether
>
> - i was able to measure much higher bandwidth over localhost for
> example. This is the case for over-the-network measurements as
> well.
>
> - the results are also more consistent and more deterministic, hence
> more reliable as a benchmarking tool. (the reason for that is that
> more CPU time is spent on actually delivering packets, instead of
> mindlessly polling on the user-space "lock", so we actually max out
> the CPU, instead of relying on the random proportion the workload
> was able to make progress versus wasting CPU time on polling.)
>
> sched_yield() is almost always the symptom of broken locking or other
> bug. In that sense CFS does the right thing by exposing such bugs =B-)
>
> Ingo

A similar patch has already been submitted, since BSD wouldn't work
without it.

2007-09-26 15:47:18

by Stephen Hemminger

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

Here is the combined fixes from iperf-users list.

Begin forwarded message:

Date: Thu, 30 Aug 2007 15:55:22 -0400
From: "Andrew Gallatin" <[email protected]>
To: [email protected]
Subject: [PATCH] performance fixes for non-linux


Hi,

I've attached a patch which gives iperf similar performance to netperf
on my FreeBSD, MacOSX and Solaris hosts. It does not seem to
negatively impact Linux. I only started looking at the iperf source
yesterday, so I don't really expect this to be integrated as is, but a
patch is worth a 1000 words :)

Background: On both Solaris and FreeBSD, there are 2 things slowing
iperf down: The gettimeofday timestamp around each socket read/write
is terribly expensive, and the sched_yield() or usleep(0) causes iperf
to take 100% of the time (system time on BSD, split user/system time
on Solaris and MacOSX), which slows things down and confuses the
scheduler.

To address the gettimeofday() issue, I treat TCP different than UDP,
and TCP tests behave as though only a single (huge) packet was sent.
Rather then ending the test based on polling gettimeofday()
timestamps, an interval timer / sigalarm handler is used. I had
to increase the packetLen from an int to a max_size_t.

To address the sched_yield/usleep issue, I put the reporter thread
to sleep on a condition variable. For the TCP tests at least, there
is no reason to have it running during the test and it is best
to just get it out of the way rather than burning CPU in a tight
loop.

I've also incorporated some fixes from the FreeBSD ports collection:

--- include/headers.h
use a 64-bit type for max_size_t

--- compat/Thread.c
oldTID is not declared anywhere. Make this compile
(seems needed for at least FreeBSD & MacOSX)

--- src/Client.cpp
BSDs can return ENOBUFS during a UDP test when the socket
buffer fills. Don't exit when this happens.

I've run the resulting iperf on FreeBSD, Solaris, MacOSX and Linux,
and it seems to work for me. It is nice not to have a 100% CPU
load when running an iperf test across a 100Mb/s network.

Drew



Attachments:
(No filename) (2.01 kB)
non-linux.diff (8.97 kB)
Download all attachments

2007-09-27 09:27:51

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On 26-09-2007 15:31, Ingo Molnar wrote:
> * David Schwartz <[email protected]> wrote:
>
>>>> I think the real fix would be for iperf to use blocking network IO
>>>> though, or maybe to use a POSIX mutex or POSIX semaphores.
>>> So it's definitely not a bug in the kernel, only in iperf?
>> Martin:
>>
>> Actually, in this case I think iperf is doing the right thing (though not
>> the best thing) and the kernel is doing the wrong thing. [...]
>
> it's not doing the right thing at all. I had a quick look at the source
> code, and the reason for that weird yield usage was that there's a
> locking bug in iperf's "Reporter thread" abstraction and apparently
> instead of fixing the bug it was worked around via a horrible yield()
> based user-space lock.
>
> the (small) patch below fixes the iperf locking bug and removes the
> yield() use. There are numerous immediate benefits of this patch:
...
>
> sched_yield() is almost always the symptom of broken locking or other
> bug. In that sense CFS does the right thing by exposing such bugs =B-)

...Only if it were under some DEBUG option. Even if iperf is doing
the wrong thing there is no explanation for such big difference in
the behavior between sched_compat_yield 1 vs. 0. It seems common
interfaces should work similarly and predictably on various
systems, and here, if I didn't miss something, linux looks like a
different kind?

Regards,
Jarek P.

2007-09-27 09:46:57

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Jarek Poplawski <[email protected]> wrote:

> > the (small) patch below fixes the iperf locking bug and removes the
> > yield() use. There are numerous immediate benefits of this patch:
> ...
> >
> > sched_yield() is almost always the symptom of broken locking or other
> > bug. In that sense CFS does the right thing by exposing such bugs =B-)
>
> ...Only if it were under some DEBUG option. [...]

note that i qualified my sentence both via "In that sense" and via a
smiley! So i was not suggesting that this is a general rule at all and i
was also joking :-)

> [...] Even if iperf is doing the wrong thing there is no explanation
> for such big difference in the behavior between sched_compat_yield 1
> vs. 0. It seems common interfaces should work similarly and
> predictably on various systems, and here, if I didn't miss something,
> linux looks like a different kind?

What you missed is that there is no such thing as "predictable yield
behavior" for anything but SCHED_FIFO/RR tasks (for which tasks CFS does
keep the behavior). Please read this thread on lkml for a more detailed
background:

CFS: some bad numbers with Java/database threading [FIXED]

http://lkml.org/lkml/2007/9/19/357
http://lkml.org/lkml/2007/9/19/328

in short: the yield implementation was tied to the O(1) scheduler, so
the only way to have the exact same behavior would be to have the exact
same core scheduler again. If what you said was true we would not be
able to change the scheduler, ever. For something as vaguely defined of
an API as yield, there's just no way to have a different core scheduler
and still behave the same way.

So _generally_ i'd agree with you that normally we want to be bug for
bug compatible, but in this specific (iperf) case there's just no point
in preserving behavior that papers over this _clearly_ broken user-space
app/thread locking (for which now two fixes exist already, plus a third
fix is the twiddling of that sysctl).

Ingo

2007-09-27 09:50:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Martin Michlmayr <[email protected]> wrote:

> > I think the real fix would be for iperf to use blocking network IO
> > though, or maybe to use a POSIX mutex or POSIX semaphores.
>
> So it's definitely not a bug in the kernel, only in iperf?
>
> (CCing Stephen Hemminger who wrote the iperf patch.)

Martin, could you check the iperf patch below instead of the yield patch
- does it solve the iperf performance problem equally well, and does CPU
utilization drop for you too?

Ingo

-------------------------->
Subject: iperf: fix locking
From: Ingo Molnar <[email protected]>

fix iperf locking - it was burning CPU time while polling
unnecessarily, instead of using the proper wait primitives.

Signed-off-by: Ingo Molnar <[email protected]>
---
compat/Thread.c | 3 ---
src/Reporter.c | 13 +++++++++----
src/main.cpp | 2 ++
3 files changed, 11 insertions(+), 7 deletions(-)

Index: iperf-2.0.2/compat/Thread.c
===================================================================
--- iperf-2.0.2.orig/compat/Thread.c
+++ iperf-2.0.2/compat/Thread.c
@@ -405,9 +405,6 @@ int thread_numuserthreads( void ) {
void thread_rest ( void ) {
#if defined( HAVE_THREAD )
#if defined( HAVE_POSIX_THREAD )
- // TODO add checks for sched_yield or pthread_yield and call that
- // if available
- usleep( 0 );
#else // Win32
SwitchToThread( );
#endif
Index: iperf-2.0.2/src/Reporter.c
===================================================================
--- iperf-2.0.2.orig/src/Reporter.c
+++ iperf-2.0.2/src/Reporter.c
@@ -111,6 +111,7 @@ report_statistics multiple_reports[kRepo
char buffer[64]; // Buffer for printing
ReportHeader *ReportRoot = NULL;
extern Condition ReportCond;
+extern Condition ReportDoneCond;
int reporter_process_report ( ReportHeader *report );
void process_report ( ReportHeader *report );
int reporter_handle_packet( ReportHeader *report );
@@ -338,7 +339,7 @@ void ReportPacket( ReportHeader* agent,
// item
while ( index == 0 ) {
Condition_Signal( &ReportCond );
- thread_rest();
+ Condition_Wait( &ReportDoneCond );
index = agent->reporterindex;
}
agent->agentindex = 0;
@@ -346,7 +347,7 @@ void ReportPacket( ReportHeader* agent,
// Need to make sure that reporter is not about to be "lapped"
while ( index - 1 == agent->agentindex ) {
Condition_Signal( &ReportCond );
- thread_rest();
+ Condition_Wait( &ReportDoneCond );
index = agent->reporterindex;
}

@@ -553,6 +554,7 @@ void reporter_spawn( thread_Settings *th
}
Condition_Unlock ( ReportCond );

+again:
if ( ReportRoot != NULL ) {
ReportHeader *temp = ReportRoot;
//Condition_Unlock ( ReportCond );
@@ -575,9 +577,12 @@ void reporter_spawn( thread_Settings *th
// finished with report so free it
free( temp );
Condition_Unlock ( ReportCond );
+ Condition_Signal( &ReportDoneCond );
+ if (ReportRoot)
+ goto again;
}
- // yield control of CPU is another thread is waiting
- thread_rest();
+ Condition_Signal( &ReportDoneCond );
+ usleep(10000);
} else {
//Condition_Unlock ( ReportCond );
}
Index: iperf-2.0.2/src/main.cpp
===================================================================
--- iperf-2.0.2.orig/src/main.cpp
+++ iperf-2.0.2/src/main.cpp
@@ -96,6 +96,7 @@ extern "C" {
// records being accessed in a report and also to
// serialize modification of the report list
Condition ReportCond;
+ Condition ReportDoneCond;
}

// global variables only accessed within this file
@@ -141,6 +142,7 @@ int main( int argc, char **argv ) {

// Initialize global mutexes and conditions
Condition_Initialize ( &ReportCond );
+ Condition_Initialize ( &ReportDoneCond );
Mutex_Initialize( &groupCond );
Mutex_Initialize( &clients_mutex );


2007-09-27 10:55:27

by Martin Michlmayr

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

* Ingo Molnar <[email protected]> [2007-09-27 11:49]:
> Martin, could you check the iperf patch below instead of the yield
> patch - does it solve the iperf performance problem equally well,
> and does CPU utilization drop for you too?

Yes, it works and CPU goes down too.
--
Martin Michlmayr
http://www.cyrius.com/

2007-09-27 10:57:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Martin Michlmayr <[email protected]> wrote:

> * Ingo Molnar <[email protected]> [2007-09-27 11:49]:
> > Martin, could you check the iperf patch below instead of the yield
> > patch - does it solve the iperf performance problem equally well,
> > and does CPU utilization drop for you too?
>
> Yes, it works and CPU goes down too.

i'm curious by how much does CPU go down, and what's the output of
iperf? (does it saturate full 100mbit network bandwidth)

Ingo

2007-09-27 11:13:24

by Martin Michlmayr

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

* Ingo Molnar <[email protected]> [2007-09-27 12:56]:
> i'm curious by how much does CPU go down, and what's the output of
> iperf? (does it saturate full 100mbit network bandwidth)

I get about 94-95 Mbits/sec and CPU drops from 99% to about 82% (this
is with a 600 MHz ARM CPU).
--
Martin Michlmayr
http://www.cyrius.com/

2007-09-27 12:25:39

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Thu, Sep 27, 2007 at 11:46:03AM +0200, Ingo Molnar wrote:
>
> * Jarek Poplawski <[email protected]> wrote:
>
> > > the (small) patch below fixes the iperf locking bug and removes the
> > > yield() use. There are numerous immediate benefits of this patch:
> > ...
> > >
> > > sched_yield() is almost always the symptom of broken locking or other
> > > bug. In that sense CFS does the right thing by exposing such bugs =B-)
> >
> > ...Only if it were under some DEBUG option. [...]
>
> note that i qualified my sentence both via "In that sense" and via a
> smiley! So i was not suggesting that this is a general rule at all and i
> was also joking :-)

Actually, I've analyzed this smiley for some time but these scheduler
jokes are really hard, and I definitely need more time...

>
> > [...] Even if iperf is doing the wrong thing there is no explanation
> > for such big difference in the behavior between sched_compat_yield 1
> > vs. 0. It seems common interfaces should work similarly and
> > predictably on various systems, and here, if I didn't miss something,
> > linux looks like a different kind?
>
> What you missed is that there is no such thing as "predictable yield
> behavior" for anything but SCHED_FIFO/RR tasks (for which tasks CFS does
> keep the behavior). Please read this thread on lkml for a more detailed
> background:
>
> CFS: some bad numbers with Java/database threading [FIXED]
>
> http://lkml.org/lkml/2007/9/19/357
> http://lkml.org/lkml/2007/9/19/328
>
> in short: the yield implementation was tied to the O(1) scheduler, so
> the only way to have the exact same behavior would be to have the exact
> same core scheduler again. If what you said was true we would not be
> able to change the scheduler, ever. For something as vaguely defined of
> an API as yield, there's just no way to have a different core scheduler
> and still behave the same way.
>
> So _generally_ i'd agree with you that normally we want to be bug for
> bug compatible, but in this specific (iperf) case there's just no point
> in preserving behavior that papers over this _clearly_ broken user-space
> app/thread locking (for which now two fixes exist already, plus a third
> fix is the twiddling of that sysctl).
>

OK, but let's forget about fixing iperf. Probably I got this wrong,
but I've thought this "bad" iperf patch was tested on a few nixes and
linux was the most different one. The main point is: even if there is
no standard here, it should be a common interest to try to not differ
too much at least. So, it's not about exactness, but 50% (63 -> 95)
change in linux own 'definition' after upgrading seems to be a lot.
So, IMHO, maybe some 'compatibility' test could be prepared to
compare a few different ideas on this yield and some average value
could be a kind of at least linux' own standard, which should be
emulated within some limits by next kernels?

Thanks,
Jarek P.

2007-09-27 13:31:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Jarek Poplawski <[email protected]> wrote:

> On Thu, Sep 27, 2007 at 11:46:03AM +0200, Ingo Molnar wrote:
[...]
> > What you missed is that there is no such thing as "predictable yield
> > behavior" for anything but SCHED_FIFO/RR tasks (for which tasks CFS does
> > keep the behavior). Please read this thread on lkml for a more detailed
> > background:
> >
> > CFS: some bad numbers with Java/database threading [FIXED]
> >
> > http://lkml.org/lkml/2007/9/19/357
> > http://lkml.org/lkml/2007/9/19/328
> >
> > in short: the yield implementation was tied to the O(1) scheduler, so
> > the only way to have the exact same behavior would be to have the exact
> > same core scheduler again. If what you said was true we would not be
> > able to change the scheduler, ever. For something as vaguely defined of
> > an API as yield, there's just no way to have a different core scheduler
> > and still behave the same way.
> >
> > So _generally_ i'd agree with you that normally we want to be bug for
> > bug compatible, but in this specific (iperf) case there's just no point
> > in preserving behavior that papers over this _clearly_ broken user-space
> > app/thread locking (for which now two fixes exist already, plus a third
> > fix is the twiddling of that sysctl).
> >
>
> OK, but let's forget about fixing iperf. Probably I got this wrong,
> but I've thought this "bad" iperf patch was tested on a few nixes and
> linux was the most different one. The main point is: even if there is
> no standard here, it should be a common interest to try to not differ
> too much at least. So, it's not about exactness, but 50% (63 -> 95)
> change in linux own 'definition' after upgrading seems to be a lot.
> So, IMHO, maybe some 'compatibility' test could be prepared to compare
> a few different ideas on this yield and some average value could be a
> kind of at least linux' own standard, which should be emulated within
> some limits by next kernels?

you repeat your point of "emulating yield", and i can only repeat my
point that you should please read this:

http://lkml.org/lkml/2007/9/19/357

because, once you read that, i think you'll agree with me that what you
say is simply not possible in a sane way at this stage. We went through
a number of yield implementations already and each will change behavior
for _some_ category of apps. So right now we offer two implementations,
and the default was chosen empirically to minimize the amount of
complaints. (but it's not possible to eliminate them altogether, for the
reasons outlined above - hence the switch.)

Ingo

2007-09-27 14:40:22

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Thu, Sep 27, 2007 at 03:31:23PM +0200, Ingo Molnar wrote:
>
> * Jarek Poplawski <[email protected]> wrote:
...
> > OK, but let's forget about fixing iperf. Probably I got this wrong,
> > but I've thought this "bad" iperf patch was tested on a few nixes and
> > linux was the most different one. The main point is: even if there is
> > no standard here, it should be a common interest to try to not differ
> > too much at least. So, it's not about exactness, but 50% (63 -> 95)
> > change in linux own 'definition' after upgrading seems to be a lot.
> > So, IMHO, maybe some 'compatibility' test could be prepared to compare
> > a few different ideas on this yield and some average value could be a
> > kind of at least linux' own standard, which should be emulated within
> > some limits by next kernels?
>
> you repeat your point of "emulating yield", and i can only repeat my
> point that you should please read this:
>
> http://lkml.org/lkml/2007/9/19/357
>
> because, once you read that, i think you'll agree with me that what you
> say is simply not possible in a sane way at this stage. We went through
> a number of yield implementations already and each will change behavior
> for _some_ category of apps. So right now we offer two implementations,
> and the default was chosen empirically to minimize the amount of
> complaints. (but it's not possible to eliminate them altogether, for the
> reasons outlined above - hence the switch.)

Sorry, but I think you got me wrong: I didn't mean emulation of any
implementation, but probably the some thing you write above: emulation
of time/performance. In my opinion this should be done experimentally
too, but with something more objective and constant than current
"complaints counter". And the first thing could be a try to set some
kind of linux internal "standard of yeld" for the future by averaging
a few most popular systems in a test doing things like this iperf or
preferably more.

Jarek P.

2007-09-29 12:45:01

by Nick Piggin

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Friday 28 September 2007 00:42, Jarek Poplawski wrote:
> On Thu, Sep 27, 2007 at 03:31:23PM +0200, Ingo Molnar wrote:
> > * Jarek Poplawski <[email protected]> wrote:
>
> ...
>
> > > OK, but let's forget about fixing iperf. Probably I got this wrong,
> > > but I've thought this "bad" iperf patch was tested on a few nixes and
> > > linux was the most different one. The main point is: even if there is
> > > no standard here, it should be a common interest to try to not differ
> > > too much at least. So, it's not about exactness, but 50% (63 -> 95)
> > > change in linux own 'definition' after upgrading seems to be a lot.
> > > So, IMHO, maybe some 'compatibility' test could be prepared to compare
> > > a few different ideas on this yield and some average value could be a
> > > kind of at least linux' own standard, which should be emulated within
> > > some limits by next kernels?
> >
> > you repeat your point of "emulating yield", and i can only repeat my
> > point that you should please read this:
> >
> > http://lkml.org/lkml/2007/9/19/357
> >
> > because, once you read that, i think you'll agree with me that what you
> > say is simply not possible in a sane way at this stage. We went through
> > a number of yield implementations already and each will change behavior
> > for _some_ category of apps. So right now we offer two implementations,
> > and the default was chosen empirically to minimize the amount of
> > complaints. (but it's not possible to eliminate them altogether, for the
> > reasons outlined above - hence the switch.)
>
> Sorry, but I think you got me wrong: I didn't mean emulation of any
> implementation, but probably the some thing you write above: emulation
> of time/performance. In my opinion this should be done experimentally
> too, but with something more objective and constant than current
> "complaints counter". And the first thing could be a try to set some
> kind of linux internal "standard of yeld" for the future by averaging
> a few most popular systems in a test doing things like this iperf or
> preferably more.

By definition, yield is essentially undefined as to the behaviour between
SCHED_OTHER tasks at the same priority level (ie. all of them), because
SCHED_OTHER scheduling behaviour itself is undefined.

It's never going to do exactly what everybody wants, except those using
it for legitimate reasons in realtime applications.

2007-10-01 08:41:36

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Fri, Sep 28, 2007 at 04:10:00PM +1000, Nick Piggin wrote:
> On Friday 28 September 2007 00:42, Jarek Poplawski wrote:
> > On Thu, Sep 27, 2007 at 03:31:23PM +0200, Ingo Molnar wrote:
> > > * Jarek Poplawski <[email protected]> wrote:
> >
> > ...
> >
> > > > OK, but let's forget about fixing iperf. Probably I got this wrong,
> > > > but I've thought this "bad" iperf patch was tested on a few nixes and
> > > > linux was the most different one. The main point is: even if there is
> > > > no standard here, it should be a common interest to try to not differ
> > > > too much at least. So, it's not about exactness, but 50% (63 -> 95)
> > > > change in linux own 'definition' after upgrading seems to be a lot.
> > > > So, IMHO, maybe some 'compatibility' test could be prepared to compare
> > > > a few different ideas on this yield and some average value could be a
> > > > kind of at least linux' own standard, which should be emulated within
> > > > some limits by next kernels?
> > >
> > > you repeat your point of "emulating yield", and i can only repeat my
> > > point that you should please read this:
> > >
> > > http://lkml.org/lkml/2007/9/19/357
> > >
> > > because, once you read that, i think you'll agree with me that what you
> > > say is simply not possible in a sane way at this stage. We went through
> > > a number of yield implementations already and each will change behavior
> > > for _some_ category of apps. So right now we offer two implementations,
> > > and the default was chosen empirically to minimize the amount of
> > > complaints. (but it's not possible to eliminate them altogether, for the
> > > reasons outlined above - hence the switch.)
> >
> > Sorry, but I think you got me wrong: I didn't mean emulation of any
> > implementation, but probably the some thing you write above: emulation
> > of time/performance. In my opinion this should be done experimentally
> > too, but with something more objective and constant than current
> > "complaints counter". And the first thing could be a try to set some
> > kind of linux internal "standard of yeld" for the future by averaging
> > a few most popular systems in a test doing things like this iperf or
> > preferably more.
>
> By definition, yield is essentially undefined as to the behaviour between
> SCHED_OTHER tasks at the same priority level (ie. all of them), because
> SCHED_OTHER scheduling behaviour itself is undefined.
>
> It's never going to do exactly what everybody wants, except those using
> it for legitimate reasons in realtime applications.
>

That's why I've used words like: "not differ too much" and "within
some limits" above. So, it's only about being reasonable, compared
to our previous versions, and to others, if possible.

This should be not impossible to additionally control (delay or
accelerate) yielding tasks wrt. current load/weight/number_of_tasks
etc., if we know (after testing) eg. average expedition time of such
tasks with various schedulers. Of course, such tests and controlling
paremeters can change for some time until the problem is explored
enough, and still no aim for exactness or to please everybody.

BTW, it looks like risky to criticise sched_yield too much: some
people can misinterpret such discussions and stop using this at
all, even where it's right.

Regards,
Jarek P.

2007-10-01 16:25:36

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Jarek Poplawski <[email protected]> wrote:

> BTW, it looks like risky to criticise sched_yield too much: some
> people can misinterpret such discussions and stop using this at all,
> even where it's right.

Really, i have never seen a _single_ mainstream app where the use of
sched_yield() was the right choice.

Fortunately, the sched_yield() API is already one of the most rarely
used scheduler functionalities, so it does not really matter. [ In my
experience a Linux scheduler is stabilizing pretty well when the
discussion shifts to yield behavior, because that shows that everything
else is pretty much fine ;-) ]

But, because you assert it that it's risky to "criticise sched_yield()
too much", you sure must know at least one real example where it's right
to use it (and cite the line and code where it's used, with
specificity)?

Ingo

2007-10-01 16:49:35

by David Schwartz

[permalink] [raw]
Subject: RE: Network slowdown due to CFS


> * Jarek Poplawski <[email protected]> wrote:
>
> > BTW, it looks like risky to criticise sched_yield too much: some
> > people can misinterpret such discussions and stop using this at all,
> > even where it's right.

> Really, i have never seen a _single_ mainstream app where the use of
> sched_yield() was the right choice.

It can occasionally be an optimization. You may have a case where you can do
something very efficiently if a lock is not held, but you cannot afford to
wait for the lock to be released. So you check the lock, if it's held, you
yield and then check again. If that fails, you do it the less optimal way
(for example, dispatching it to a thread that *can* afford to wait).

It is also sometimes used in the implementation of spinlock-type primitives.
After spinning fails, yielding is tried.

I think it's also sometimes appropriate when a thread may monopolize a
mutex. For example, consider a rarely-run task that cleans up some expensive
structures. It may need to hold locks that are only held during this complex
clean up.

One example I know of is a defragmenter for a multi-threaded memory
allocator, and it has to lock whole pools. When it releases these locks, it
calls yield before re-acquiring them to go back to work. The idea is to "go
to the back of the line" if any threads are blocking on those mutexes.

There are certainly other ways to do these things, but I have seen cases
where, IMO, yielding was the best solution. Doing nothing would have been
okay too.

> Fortunately, the sched_yield() API is already one of the most rarely
> used scheduler functionalities, so it does not really matter. [ In my
> experience a Linux scheduler is stabilizing pretty well when the
> discussion shifts to yield behavior, because that shows that everything
> else is pretty much fine ;-) ]

Can you explain what the current sched_yield behavior *is* for CFS and what
the tunable does to change it?

The desired behavior is for the current thread to not be rescheduled until
every thread at the same static priority as this thread has had a chance to
be scheduled.

Of course, it's not clear exactly what a "chance" is.

The semantics with respect to threads at other static priority levels is not
clear. Ditto for SMP issues. It's also not clear whether threads that yield
should be rewarded or punished for doing so.

DS


2007-10-01 16:56:17

by Chris Friesen

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

Ingo Molnar wrote:

> But, because you assert it that it's risky to "criticise sched_yield()
> too much", you sure must know at least one real example where it's right
> to use it (and cite the line and code where it's used, with
> specificity)?

It's fine to criticise sched_yield(). I agree that new apps should
generally be written to use proper completion mechanisms or to wait for
specific events.

However, there are closed-source and/or frozen-source apps where it's
not practical to rewrite or rebuild the app. Does it make sense to
break the behaviour of all of these?

Chris

2007-10-01 17:09:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Chris Friesen <[email protected]> wrote:

> Ingo Molnar wrote:
>
> >But, because you assert it that it's risky to "criticise sched_yield()
> >too much", you sure must know at least one real example where it's right
> >to use it (and cite the line and code where it's used, with
> >specificity)?
>
> It's fine to criticise sched_yield(). I agree that new apps should
> generally be written to use proper completion mechanisms or to wait
> for specific events.

yes.

> However, there are closed-source and/or frozen-source apps where it's
> not practical to rewrite or rebuild the app. Does it make sense to
> break the behaviour of all of these?

See the background and answers to that in:

http://lkml.org/lkml/2007/9/19/357
http://lkml.org/lkml/2007/9/19/328

there's plenty of recourse possible to all possible kinds of apps. Tune
the sysctl flag in one direction or another, depending on which behavior
the app is expecting.

Ingo

2007-10-01 17:32:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* David Schwartz <[email protected]> wrote:

> > > BTW, it looks like risky to criticise sched_yield too much: some
> > > people can misinterpret such discussions and stop using this at
> > > all, even where it's right.
>
> > Really, i have never seen a _single_ mainstream app where the use of
> > sched_yield() was the right choice.
>
> It can occasionally be an optimization. You may have a case where you
> can do something very efficiently if a lock is not held, but you
> cannot afford to wait for the lock to be released. So you check the
> lock, if it's held, you yield and then check again. If that fails, you
> do it the less optimal way (for example, dispatching it to a thread
> that *can* afford to wait).

These are generic statements, but i'm _really_ interested in the
specifics. Real, specific code that i can look at. The typical Linux
distro consists of in execess of 500 millions of lines of code, in tens
of thousands of apps, so there really must be some good, valid and
"right" use of sched_yield() somewhere in there, in some mainstream app,
right? (because, as you might have guessed it, in the past decade of
sched_yield() existence i _have_ seen my share of sched_yield()
utilizing user-space code, and at the moment i'm not really impressed by
those examples.)

Preferably that example should show that the best quality user-space
lock implementation in a given scenario is best done via sched_yield().
Actual code and numbers. (And this isnt _that_ hard. I'm not asking for
a full RDBMS implementation that must run through SQL99 spec suite. This
is about a simple locking primitive, or a simple pointer to an existing
codebase.)

> It is also sometimes used in the implementation of spinlock-type
> primitives. After spinning fails, yielding is tried.

(user-space spinlocks are broken beyond words for anything but perhaps
SCHED_FIFO tasks.)

> One example I know of is a defragmenter for a multi-threaded memory
> allocator, and it has to lock whole pools. When it releases these
> locks, it calls yield before re-acquiring them to go back to work. The
> idea is to "go to the back of the line" if any threads are blocking on
> those mutexes.

at a quick glance this seems broken too - but if you show the specific
code i might be able to point out the breakage in detail. (One
underlying problem here appears to be fairness: a quick unlock/lock
sequence may starve out other threads. yield wont solve that fundamental
problem either, and it will introduce random latencies into apps using
this memory allocator.)

> > Fortunately, the sched_yield() API is already one of the most rarely
> > used scheduler functionalities, so it does not really matter. [ In my
> > experience a Linux scheduler is stabilizing pretty well when the
> > discussion shifts to yield behavior, because that shows that everything
> > else is pretty much fine ;-) ]
>
> Can you explain what the current sched_yield behavior *is* for CFS and
> what the tunable does to change it?

sure. (and i described that flag on lkml before) The sched_yield flag
does two things:

- if 0 ("opportunistic mode"), then the task will reschedule to any
other task that is in "bigger need for CPU time" than the currently
running task, as indicated by CFS's ->wait_runtime metric. (or as
indicated by the similar ->vruntime metric in sched-devel.git)

- if 1 ("agressive mode"), then the task will be one-time requeued to
the right end of the CFS rbtree. This means that for one instance,
all other tasks will run before this task will run again - after that
this task's natural ordering within the rbtree is restored.

> The desired behavior is for the current thread to not be rescheduled
> until every thread at the same static priority as this thread has had
> a chance to be scheduled.

do you realize that this "desired behavior" you just described is not
achieved by the old scheduler, and that this random behavior _is_ the
main problem here? If yield was well-specified then we could implement
it in a well-specified way - even if the API was poor.

But fact is that it is _not_ well-specified, and apps grew upon a random
scheduler implementation details in random ways. (in the lkml discussion
about this topic, Linus offered a pretty sane theoretical definition for
yield but it's not simple to implement [and no scheduler implements it
at the moment] - nor will it map to the old scheduler's yield behavior
so we'll end up breaking more apps.)

Ingo

2007-10-01 17:46:00

by Chris Friesen

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

Ingo Molnar wrote:
> * Chris Friesen <[email protected]> wrote:

>>However, there are closed-source and/or frozen-source apps where it's
>>not practical to rewrite or rebuild the app. Does it make sense to
>>break the behaviour of all of these?
>
>
> See the background and answers to that in:
>
> http://lkml.org/lkml/2007/9/19/357
> http://lkml.org/lkml/2007/9/19/328
>
> there's plenty of recourse possible to all possible kinds of apps. Tune
> the sysctl flag in one direction or another, depending on which behavior
> the app is expecting.

Yeah, I read those threads.

It seems like the fundamental source of the disconnect is that the tasks
used to be sorted by priority (thus making it easy to bump a yielding
task to the end of that priority level) while now they're organized by
time (making it harder to do anything priority-based). Do I have that
right?

Chris

2007-10-01 18:24:17

by David Schwartz

[permalink] [raw]
Subject: RE: Network slowdown due to CFS


> These are generic statements, but i'm _really_ interested in the
> specifics. Real, specific code that i can look at. The typical Linux
> distro consists of in execess of 500 millions of lines of code, in tens
> of thousands of apps, so there really must be some good, valid and
> "right" use of sched_yield() somewhere in there, in some mainstream app,
> right? (because, as you might have guessed it, in the past decade of
> sched_yield() existence i _have_ seen my share of sched_yield()
> utilizing user-space code, and at the moment i'm not really impressed by
> those examples.)

Maybe, maybe not. Even if so, it would be very difficult to find. Simply
grepping for sched_yield is not going to help because determining whether a
given use of sched_yield is smart is not going to be easy.

> (user-space spinlocks are broken beyond words for anything but perhaps
> SCHED_FIFO tasks.)

User-space spinlocks are broken so spinlocks can only be implemented in
kernel-space? Even if you use the kernel to schedule/unschedule the tasks,
you still have to spin in user-space.

> > One example I know of is a defragmenter for a multi-threaded memory
> > allocator, and it has to lock whole pools. When it releases these
> > locks, it calls yield before re-acquiring them to go back to work. The
> > idea is to "go to the back of the line" if any threads are blocking on
> > those mutexes.

> at a quick glance this seems broken too - but if you show the specific
> code i might be able to point out the breakage in detail. (One
> underlying problem here appears to be fairness: a quick unlock/lock
> sequence may starve out other threads. yield wont solve that fundamental
> problem either, and it will introduce random latencies into apps using
> this memory allocator.)

You are assuming that random latencies are necessarily bad. Random latencies
may be significantly better than predictable high latency.


> > Can you explain what the current sched_yield behavior *is* for CFS and
> > what the tunable does to change it?

> sure. (and i described that flag on lkml before) The sched_yield flag
> does two things:

> - if 0 ("opportunistic mode"), then the task will reschedule to any
> other task that is in "bigger need for CPU time" than the currently
> running task, as indicated by CFS's ->wait_runtime metric. (or as
> indicated by the similar ->vruntime metric in sched-devel.git)
>
> - if 1 ("agressive mode"), then the task will be one-time requeued to
> the right end of the CFS rbtree. This means that for one instance,
> all other tasks will run before this task will run again - after that
> this task's natural ordering within the rbtree is restored.

Thank you. Unfortunately, neither of these does what sched_yiled is really
supposed to do. Opportunistic mode does too little and agressive mode does
too much.

> > The desired behavior is for the current thread to not be rescheduled
> > until every thread at the same static priority as this thread has had
> > a chance to be scheduled.

> do you realize that this "desired behavior" you just described is not
> achieved by the old scheduler, and that this random behavior _is_ the
> main problem here? If yield was well-specified then we could implement
> it in a well-specified way - even if the API was poor.

> But fact is that it is _not_ well-specified, and apps grew upon a random
> scheduler implementation details in random ways. (in the lkml discussion
> about this topic, Linus offered a pretty sane theoretical definition for
> yield but it's not simple to implement [and no scheduler implements it
> at the moment] - nor will it map to the old scheduler's yield behavior
> so we'll end up breaking more apps.)

I don't have a problem with failing to emulate the old scheduler's behavior
if we can show that the new behavior has saner semantics. Unfortunately, in
this case, I think CFS' semantics are pretty bad. Neither of these is what
sched_yield is supposed to do.

Note that I'm not saying this is a particularly big deal. And I'm not
calling CFS' behavior a regression, since it's not really better or worse
than the old behavior, simply different.

I'm not familiar enough with CFS' internals to help much on the
implementation, but there may be some simple compromise yield that might
work well enough. How about simply acting as if the task used up its
timeslice and scheduling the next one? (Possibly with a slight reduction in
penalty or reward for not really using all the time, if possible?)

DS


2007-10-01 19:10:32

by Ingo Molnar

[permalink] [raw]
Subject: Re: iperf yield usage


* Chris Friesen <[email protected]> wrote:

> >See the background and answers to that in:
> >
> > http://lkml.org/lkml/2007/9/19/357
> > http://lkml.org/lkml/2007/9/19/328
> >
> >there's plenty of recourse possible to all possible kinds of apps.
> >Tune the sysctl flag in one direction or another, depending on which
> >behavior the app is expecting.
>
> Yeah, I read those threads.
>
> It seems like the fundamental source of the disconnect is that the
> tasks used to be sorted by priority (thus making it easy to bump a
> yielding task to the end of that priority level) while now they're
> organized by time (making it harder to do anything priority-based).
> Do I have that right?

not really - the old yield implementation in essence gave the task a
time hit too, because we rotated through tasks based on timeslices. But
the old one requeued yield-ing tasks to the 'active array', and the
decision whether a task is in the active or in the expired array was a
totally stohastic, load-dependent thing. As a result, certain tasks,
under certain workloads saw a "stronger" yield, other tasks saw a
"weaker" yield. (The reason for that implementation was simple: yield
was (and is) unimportant and it was implemented in the most
straightforward way that caused no overhead anywhere else in the
scheduler.)

( and to keep perspective it's also important to correct the subject
line here: it's not about "network slowdown" - nothing in networking
slowed down in any way - it was that iperf used yield in a horrible
way. I changed the subject line to reflect that. )

Ingo

2007-10-01 19:55:52

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Mon, 1 Oct 2007 09:49:35 -0700
"David Schwartz" <[email protected]> wrote:

>
> > * Jarek Poplawski <[email protected]> wrote:
> >
> > > BTW, it looks like risky to criticise sched_yield too much: some
> > > people can misinterpret such discussions and stop using this at
> > > all, even where it's right.
>
> > Really, i have never seen a _single_ mainstream app where the use of
> > sched_yield() was the right choice.
>
> It can occasionally be an optimization. You may have a case where you
> can do something very efficiently if a lock is not held, but you
> cannot afford to wait for the lock to be released. So you check the
> lock, if it's held, you yield and then check again. If that fails,
> you do it the less optimal way (for example, dispatching it to a
> thread that *can* afford to wait).


at this point it's "use a futex" instead; once you're doing system
calls you might as well use the right one for what you're trying to
achieve.

2007-10-01 21:37:05

by Hubert Tonneau

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

Ingo Molnar wrote:
>
> Really, i have never seen a _single_ mainstream app where the use of
> sched_yield() was the right choice.

Pliant 'FastSem' semaphore implementation (as oppsed to 'Sem') uses 'yield'
http://old.fullpliant.org/

Basically, if the ressource you are protecting with the semaphore will be held
for a significant time, then a full semaphore might be better, but if the
ressource will be held just a fiew cycles, then light aquiering might bring best
result because the most significant cost is in aquiering/releasing.

So the aquiering algorithm for fast semaphores might be:
try to aquire with a hardware atomic read and set instruction, then if it fails,
call yield then retry (at least on a single processor single core system).


2007-10-01 22:19:18

by David Schwartz

[permalink] [raw]
Subject: RE: Network slowdown due to CFS


Arjan van de Ven wrote:

> > It can occasionally be an optimization. You may have a case where you
> > can do something very efficiently if a lock is not held, but you
> > cannot afford to wait for the lock to be released. So you check the
> > lock, if it's held, you yield and then check again. If that fails,
> > you do it the less optimal way (for example, dispatching it to a
> > thread that *can* afford to wait).

> at this point it's "use a futex" instead; once you're doing system
> calls you might as well use the right one for what you're trying to
> achieve.

There are two answers to this. One is that you sometimes are writing POSIX
code and Linux-specific optimizations don't change the fact that you still
need a portable implementation.

The other answer is that futexes don't change anything in this case. In
fact, in the last time I hit this, the lock was a futex on Linux.
Nevertheless, that doesn't change the basic issue. The lock is locked, you
cannot afford to wait for it, but not getting the lock is expensive. The
solution is to yield and check the lock again. If it's still held, you
dispatch to another thread, but many times, yielding can avoid that.

A futex doesn't change the fact that sometimes you can't afford to block on
a lock but nevertheless would save significant effort if you were able to
acquire it. Odds are the thread that holds it is about to release it anyway.

That is, you need something in-between "non-blocking trylock, fail easily"
and "blocking lock, do not fail", but you'd rather make forward progress
without the lock than actually block/sleep.

DS


2007-10-01 22:37:28

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Mon, 1 Oct 2007 15:17:52 -0700
"David Schwartz" <[email protected]> wrote:

>
> Arjan van de Ven wrote:
>
> > > It can occasionally be an optimization. You may have a case where
> > > you can do something very efficiently if a lock is not held, but
> > > you cannot afford to wait for the lock to be released. So you
> > > check the lock, if it's held, you yield and then check again. If
> > > that fails, you do it the less optimal way (for example,
> > > dispatching it to a thread that *can* afford to wait).
>
> > at this point it's "use a futex" instead; once you're doing system
> > calls you might as well use the right one for what you're trying to
> > achieve.
>
> There are two answers to this. One is that you sometimes are writing
> POSIX code and Linux-specific optimizations don't change the fact
> that you still need a portable implementation.
>
> The other answer is that futexes don't change anything in this case.
> In fact, in the last time I hit this, the lock was a futex on Linux.
> Nevertheless, that doesn't change the basic issue. The lock is
> locked, you cannot afford to wait for it, but not getting the lock is
> expensive. The solution is to yield and check the lock again. If it's
> still held, you dispatch to another thread, but many times, yielding
> can avoid that.

yielding IS blocking. Just with indeterminate fuzzyness added to it....

2007-10-01 22:45:22

by David Schwartz

[permalink] [raw]
Subject: RE: Network slowdown due to CFS


> yielding IS blocking. Just with indeterminate fuzzyness added to it....

Yielding is sort of blocking, but the difference is that yielding will not
idle the CPU while blocking might. Yielding is sometimes preferable to
blocking in a case where the thread knows it can make forward progress even
if it doesn't get the resource. (As in the examples I explained.)

DS


2007-10-01 22:57:24

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Mon, 1 Oct 2007 15:44:09 -0700
"David Schwartz" <[email protected]> wrote:

>
> > yielding IS blocking. Just with indeterminate fuzzyness added to
> > it....
>
> Yielding is sort of blocking, but the difference is that yielding
> will not idle the CPU while blocking might.

not really; SOMEONE will make progress, the one holding the lock.
Granted, he can be on some other cpu, but at that point all yielding
gets you is a bunch of cache bounces.

>Yielding is sometimes
> preferable to blocking in a case where the thread knows it can make
> forward progress even if it doesn't get the resource. (As in the
> examples I explained.)

that's also what trylock is for... as well as spinaphores...
(you can argue that futexes should be more intelligent and do
spinaphore stuff etc... and I can buy that, lets improve them in the
kernel by any means. But userspace yield() isn't the answer. A
yield_to() would have been a ton better (which would return immediately
if the thing you want to yield to is running already somethere), a
blind "yield" isn't, since it doesn't say what you want to yield to.

Note: The answer to "what to yield to" isn't "everything that might
want to run"; we tried that way back when the 2.6.early scheduler was
designed and that turns out to not be what people calling yield
expected.. (it made their things even slower than they thought). So
they want "yield to" semantics, without telling the kernel what they
want to yield to, and complain if the kernel second-guesses wrongly....


not a good api.

2007-10-02 06:06:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* David Schwartz <[email protected]> wrote:

> > These are generic statements, but i'm _really_ interested in the
> > specifics. Real, specific code that i can look at. The typical Linux
> > distro consists of in execess of 500 millions of lines of code, in
> > tens of thousands of apps, so there really must be some good, valid
> > and "right" use of sched_yield() somewhere in there, in some
> > mainstream app, right? (because, as you might have guessed it, in
> > the past decade of sched_yield() existence i _have_ seen my share of
> > sched_yield() utilizing user-space code, and at the moment i'm not
> > really impressed by those examples.)
>
> Maybe, maybe not. Even if so, it would be very difficult to find.
> [...]

google.com/codesearch is your friend. Really,

> Note that I'm not saying this is a particularly big deal. And I'm not
> calling CFS' behavior a regression, since it's not really better or
> worse than the old behavior, simply different.

yes, and that's the core point.

> I'm not familiar enough with CFS' internals to help much on the
> implementation, but there may be some simple compromise yield that
> might work well enough. How about simply acting as if the task used up
> its timeslice and scheduling the next one? (Possibly with a slight
> reduction in penalty or reward for not really using all the time, if
> possible?)

firstly, there's no notion of "timeslices" in CFS. (in CFS tasks "earn"
a right to the CPU, and that "right" is not sliced in the traditional
sense) But we tried a conceptually similar thing: to schedule not to the
end of the tree but into the next position. That too was bad for _some_
apps. CFS literally cycled through 5-6 different yield implementations
in its 22 versions so far. The current flag solution was achieved in
such an iterative fashion and gives an acceptable solution to all app
categories that came up so far. [ and this is driven by compatibility
goals - regardless of how broken we consider yield use. The ideal
solution is of course to almost never use yield. Fortunately 99%+ of
Linux apps follow that ideal solution ;-) ]

Ingo

2007-10-02 06:08:59

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* David Schwartz <[email protected]> wrote:

> > (user-space spinlocks are broken beyond words for anything but
> > perhaps SCHED_FIFO tasks.)
>
> User-space spinlocks are broken so spinlocks can only be implemented
> in kernel-space? Even if you use the kernel to schedule/unschedule the
> tasks, you still have to spin in user-space.

user-space spinlocks (in anything but SCHED_FIFO tasks) are pretty
broken because they waste CPU time. (not as broken as yield, because
"wasting CPU time" is a more deterministic act, but still broken) Could
you cite a single example where user-space spinlocks are technically the
best solution?

Ingo

2007-10-02 06:26:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* David Schwartz <[email protected]> wrote:

> > at a quick glance this seems broken too - but if you show the
> > specific code i might be able to point out the breakage in detail.
> > (One underlying problem here appears to be fairness: a quick
> > unlock/lock sequence may starve out other threads. yield wont solve
> > that fundamental problem either, and it will introduce random
> > latencies into apps using this memory allocator.)
>
> You are assuming that random latencies are necessarily bad. Random
> latencies may be significantly better than predictable high latency.

i'm not really assuming anything, i gave a vague first impression of the
vague example you gave (assuming that the yield was done to combat
fairness problems). This is a case where the human language shows its
boundaries: statements that are hard to refute with certainty because
they are too vague. So i'd really suggest you show me some sample/real
code - that would move this discussion to a much more productive level.

but i'll attempt to weave the chain of argument one step forward (in the
hope of not distorting your point in any way): _if_ the sched_yield()
call in that memory allocator is done because it uses a locking
primitive that is unfair (hence the memory pool lock can be starved),
then the "guaranteed large latency" is caused by "guaranteed
unfairness". The solution is not to insert a random latency (via a
sched_yield() call) that also has a side-effect of fairness to other
tasks, because this random latency introduces guaranteed unfairness for
this particular task. The correct solution IMO is to make the locking
primitive more fair _without_ random delays, and there are a number of
good techniques for that. (they mostly center around the use of futexes)

one thing that is often missed is that most of the cost of a yield() is
in the system call and the context-switch - quite similar to the futex
slowpath. So there's _no_ reason to not use a futexes on Linux. (yes,
there might be historic/compatibility or ease-of-porting arguments but
those do not really impact the fundamental argument of whether something
is technically right or not.)

Ingo

2007-10-02 06:46:46

by Ingo Molnar

[permalink] [raw]
Subject: Re: yield API


* David Schwartz <[email protected]> wrote:

> > These are generic statements, but i'm _really_ interested in the
> > specifics. Real, specific code that i can look at. The typical Linux
> > distro consists of in execess of 500 millions of lines of code, in
> > tens of thousands of apps, so there really must be some good, valid
> > and "right" use of sched_yield() somewhere in there, in some
> > mainstream app, right? (because, as you might have guessed it, in
> > the past decade of sched_yield() existence i _have_ seen my share of
> > sched_yield() utilizing user-space code, and at the moment i'm not
> > really impressed by those examples.)
>
> Maybe, maybe not. Even if so, it would be very difficult to find.
> Simply grepping for sched_yield is not going to help because
> determining whether a given use of sched_yield is smart is not going
> to be easy.

sched_yield() has been around for a decade (about three times longer
than futexes were around), so if it's useful, it sure should have grown
some 'crown jewel' app that uses it and shows off its advantages,
compared to other locking approaches, right?

For example, if you asked me whether pipes are the best thing for
certain apps, i could immediately show you tons of examples where they
are. Same for sockets. Or RT priorities. Or nice levels. Or futexes. Or
just about any other core kernel concept or API. Your notion that
showing a good example of an API would be "difficult" because it's hard
to determine "smart" use is not tenable i believe and does not
adequately refute my pretty plain-meaning "it does not exist" assertion.

If then this is one more supporting proof for the fundamental weakness
of the sched_yield() API. Rarely are we able to so universally condemn
an API: real-life is usually more varied and even for theoretically
poorly defined APIs _some_ sort of legitimate use does grow up.

APIs that are not in any real, meaningful use, despite a decade of
presence are not really interesting to me personally. (especially in
this case where we know exactly _why_ the API is used so rarely.) Sure
we'll continue to support it in the best possible way, with the usual
kernel maintainance policy: without hurting other, more commonly used
APIs. That was the principle we followed in previous schedulers too. And
if anyone has a patch to make sched_yield() better than it is today, i'm
of course interested in it.

Ingo

2007-10-02 06:47:23

by Andi Kleen

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

Ingo Molnar <[email protected]> writes:

> * David Schwartz <[email protected]> wrote:
>
> > > These are generic statements, but i'm _really_ interested in the
> > > specifics. Real, specific code that i can look at. The typical Linux
> > > distro consists of in execess of 500 millions of lines of code, in
> > > tens of thousands of apps, so there really must be some good, valid
> > > and "right" use of sched_yield() somewhere in there, in some
> > > mainstream app, right? (because, as you might have guessed it, in
> > > the past decade of sched_yield() existence i _have_ seen my share of
> > > sched_yield() utilizing user-space code, and at the moment i'm not
> > > really impressed by those examples.)
> >
> > Maybe, maybe not. Even if so, it would be very difficult to find.
> > [...]
>
> google.com/codesearch is your friend. Really,

http://www.koders.com/ (which has been around for a long time)
actually seems to have more code.

It's also a pity that so much free code is behind passwords
and protected from spiders.

-Andi

2007-10-02 09:01:24

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Mon, Oct 01, 2007 at 06:25:07PM +0200, Ingo Molnar wrote:
>
> * Jarek Poplawski <[email protected]> wrote:
>
> > BTW, it looks like risky to criticise sched_yield too much: some
> > people can misinterpret such discussions and stop using this at all,
> > even where it's right.
>
> Really, i have never seen a _single_ mainstream app where the use of
> sched_yield() was the right choice.
>
> Fortunately, the sched_yield() API is already one of the most rarely
> used scheduler functionalities, so it does not really matter. [ In my
> experience a Linux scheduler is stabilizing pretty well when the
> discussion shifts to yield behavior, because that shows that everything
> else is pretty much fine ;-) ]
>
> But, because you assert it that it's risky to "criticise sched_yield()
> too much", you sure must know at least one real example where it's right
> to use it (and cite the line and code where it's used, with
> specificity)?

Very clever move! And I see some people have catched this...

Since sched_yeld() is a very general purpose tool, it can be easily
replaced by others, of course, just like probably half of all
system calls. And such things are done often in a code during
optimization. But, IMHO, the main value of sched_yield() is it's
easy to use and very readable way to mark some place. Sometimes
even test if such idea is reasonable at all... Otherwise, many such
possibilities could easily stay forgotten forever.

But you are right, the value of this call shouldn't be exaggerated,
and my proposal was an overkill. Anyway, it seems, there could be
imagined something better than current ways of doing this. They look
like two extremes and something in between and not too complicated
should suffice. Currently, I wonder if simply charging (with a key
recalculated) such a task for all the time it could've used isn't one
of such methods. It seems, it's functionally analogous with going to
the end of que of tasks with the same priority according to the old
sched.

Regards,
Jarek P.

2007-10-02 09:24:16

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Mon, Oct 01, 2007 at 10:43:56AM +0200, Jarek Poplawski wrote:
...
> etc., if we know (after testing) eg. average expedition time of such

No new theory - it's only my reverse Polish translation. Should be:
"etc., if we know (after testing) eg. average dispatch time of such".

Sorry,
Jarek P.

2007-10-02 11:51:14

by linux-os (Dick Johnson)

[permalink] [raw]
Subject: Re: yield API


On Tue, 2 Oct 2007, Ingo Molnar wrote:

>
> * David Schwartz <[email protected]> wrote:
>
>>> These are generic statements, but i'm _really_ interested in the
>>> specifics. Real, specific code that i can look at. The typical Linux
>>> distro consists of in execess of 500 millions of lines of code, in
>>> tens of thousands of apps, so there really must be some good, valid
>>> and "right" use of sched_yield() somewhere in there, in some
>>> mainstream app, right? (because, as you might have guessed it, in
>>> the past decade of sched_yield() existence i _have_ seen my share of
>>> sched_yield() utilizing user-space code, and at the moment i'm not
>>> really impressed by those examples.)
>>
>> Maybe, maybe not. Even if so, it would be very difficult to find.
>> Simply grepping for sched_yield is not going to help because
>> determining whether a given use of sched_yield is smart is not going
>> to be easy.
>
> sched_yield() has been around for a decade (about three times longer
> than futexes were around), so if it's useful, it sure should have grown
> some 'crown jewel' app that uses it and shows off its advantages,
> compared to other locking approaches, right?
>
> For example, if you asked me whether pipes are the best thing for
> certain apps, i could immediately show you tons of examples where they
> are. Same for sockets. Or RT priorities. Or nice levels. Or futexes. Or
> just about any other core kernel concept or API. Your notion that
> showing a good example of an API would be "difficult" because it's hard
> to determine "smart" use is not tenable i believe and does not
> adequately refute my pretty plain-meaning "it does not exist" assertion.
>
> If then this is one more supporting proof for the fundamental weakness
> of the sched_yield() API. Rarely are we able to so universally condemn
> an API: real-life is usually more varied and even for theoretically
> poorly defined APIs _some_ sort of legitimate use does grow up.
>
> APIs that are not in any real, meaningful use, despite a decade of
> presence are not really interesting to me personally. (especially in
> this case where we know exactly _why_ the API is used so rarely.) Sure
> we'll continue to support it in the best possible way, with the usual
> kernel maintainance policy: without hurting other, more commonly used
> APIs. That was the principle we followed in previous schedulers too. And
> if anyone has a patch to make sched_yield() better than it is today, i'm
> of course interested in it.
>
> Ingo

But sched_yield() on Linux never did what the majority of
programmers assumed it would do (give up the CPU to some
runnable processes for the rest of the time-slice). Instead,
it just appeared to spin in the kernel. Therefore, those
who needed a sched_yield(), just used usleep().

Whether or not there is a POSIX definition of sched_yield(),
there is a need for something that will give up the CPU
and not busy-wait. There are many control applications
where state-machines are kept in user-mode code. The code
waits for an event. It shouldn't be spinning, wasting
CPU time, when the kernel can be doing file and network
I/O with the wasted CPU cycles.

So, just because sched_yield() doesn't work as expected,
is not the reason to get rid of it.

Cheers,
Dick Johnson
Penguin : Linux version 2.6.16.24 on an i686 machine (5592.59 BogoMips).
My book : http://www.AbominableFirebug.com/
_


****************************************************************
The information transmitted in this message is confidential and may be privileged. Any review, retransmission, dissemination, or other use of this information by persons or entities other than the intended recipient is prohibited. If you are not the intended recipient, please notify Analogic Corporation immediately - by replying to this message or by sending an email to [email protected] - and destroy all copies of this information, including any attachments, without reading or disclosing them.

Thank you.

2007-10-02 13:37:17

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Tue, Oct 02, 2007 at 11:03:46AM +0200, Jarek Poplawski wrote:
...
> should suffice. Currently, I wonder if simply charging (with a key
> recalculated) such a task for all the time it could've used isn't one
> of such methods. It seems, it's functionally analogous with going to
> the end of que of tasks with the same priority according to the old
> sched.

Only now I've read I repeat the idea of David Schwartz (and probably
not only him) from a nearby thread, sorry. But, I still try to find
what was wrong with it?

Jarek P.

2007-10-02 15:38:09

by David Schwartz

[permalink] [raw]
Subject: RE: Network slowdown due to CFS


This is a combined response to Arjan's:

> that's also what trylock is for... as well as spinaphores...
> (you can argue that futexes should be more intelligent and do
> spinaphore stuff etc... and I can buy that, lets improve them in the
> kernel by any means. But userspace yield() isn't the answer. A
> yield_to() would have been a ton better (which would return immediately
> if the thing you want to yield to is running already somethere), a
> blind "yield" isn't, since it doesn't say what you want to yield to.

And Ingo's:

> but i'll attempt to weave the chain of argument one step forward (in the
> hope of not distorting your point in any way): _if_ the sched_yield()
> call in that memory allocator is done because it uses a locking
> primitive that is unfair (hence the memory pool lock can be starved),
> then the "guaranteed large latency" is caused by "guaranteed
> unfairness". The solution is not to insert a random latency (via a
> sched_yield() call) that also has a side-effect of fairness to other
> tasks, because this random latency introduces guaranteed unfairness for
> this particular task. The correct solution IMO is to make the locking
> primitive more fair _without_ random delays, and there are a number of
> good techniques for that. (they mostly center around the use of futexes)

So now I not only have to come up with an example where sched_yield is the
best practical choice, I have to come up with one where sched_yield is the
best conceivable choice? Didn't we start out by agreeing these are very rare
cases? Why are we designing new APIs for them (Arjan) and why do we care
about their performance (Ingo)?

These are *rare* cases. It is a waste of time to optimize them.

In this case, nobody cares about fairness to the service thread. It is a
cleanup task that probably runs every few minutes. It could be delayed for
minutes and nobody would care. What they do care about is the impact of the
service thread on the threads doing real work.

You two challenged me to present any legitimate use case for sched_yield. I
see now that was not a legitimate challenge and you two were determined to
shoot down any response no matter how reasonable on the grounds that there
is some way to do it better, no matter how complex, impractical, or
unjustified by the real-world problem.

I think if a pthread_mutex had a 'yield to others blocking on this mutex'
kind of a 'go to the back of the line' option, that would cover the majority
of cases where sched_yield is your best choice currently. Unfortunately,
POSIX gave us yield.

Note that I think we all agree that any program whose performance relies on
quirks of sched_yield (such as the examples that have been cited as CFS
'regressions') are broken horribly. None of the cases I am suggesting use
sched_yield as anything more than a minor optimization.

DS


2007-10-02 15:56:42

by Douglas McNaught

[permalink] [raw]
Subject: Re: yield API

"linux-os \(Dick Johnson\)" <[email protected]> writes:

> Whether or not there is a POSIX definition of sched_yield(),
> there is a need for something that will give up the CPU
> and not busy-wait. There are many control applications
> where state-machines are kept in user-mode code. The code
> waits for an event. It shouldn't be spinning, wasting
> CPU time, when the kernel can be doing file and network
> I/O with the wasted CPU cycles.

These "control applications" would be real-time processes, for which
(AIUI) sched_yield() behavior is completely well-defined and
implemented as such by Linux. The question here is how useful the
call is for SCHED_OTHER (non-real-time) processes, for which it has no
well-defined semantics.

-Doug

2007-10-02 22:38:36

by Eric St-Laurent

[permalink] [raw]
Subject: Re: yield API


On Tue, 2007-10-02 at 08:46 +0200, Ingo Molnar wrote:

[...]

> APIs that are not in any real, meaningful use, despite a decade of
> presence are not really interesting to me personally. (especially in
> this case where we know exactly _why_ the API is used so rarely.) Sure
> we'll continue to support it in the best possible way, with the usual
> kernel maintainance policy: without hurting other, more commonly used
> APIs. That was the principle we followed in previous schedulers too. And
> if anyone has a patch to make sched_yield() better than it is today, i'm
> of course interested in it.

Do you still have intentions to add a directed yield API? I remember
seeing it in the earlier CFS patches.


- Eric


2007-10-03 07:12:54

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On 02-10-2007 17:37, David Schwartz wrote:
...
> So now I not only have to come up with an example where sched_yield is the
> best practical choice, I have to come up with one where sched_yield is the
> best conceivable choice? Didn't we start out by agreeing these are very rare
> cases? Why are we designing new APIs for them (Arjan) and why do we care
> about their performance (Ingo)?
>
> These are *rare* cases. It is a waste of time to optimize them.

Probably we'll start to care after first comparison tests done by our
rivals. It should be a piece of cake for them to find the "right"
code...

Regards,
Jarek P.

2007-10-03 07:59:58

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On 02-10-2007 08:06, Ingo Molnar wrote:
> * David Schwartz <[email protected]> wrote:
...
>> I'm not familiar enough with CFS' internals to help much on the
>> implementation, but there may be some simple compromise yield that
>> might work well enough. How about simply acting as if the task used up
>> its timeslice and scheduling the next one? (Possibly with a slight
>> reduction in penalty or reward for not really using all the time, if
>> possible?)
>
> firstly, there's no notion of "timeslices" in CFS. (in CFS tasks "earn"
> a right to the CPU, and that "right" is not sliced in the traditional
> sense) But we tried a conceptually similar thing [...]

>From kernel/sched_fair.c:

"/*
* Targeted preemption latency for CPU-bound tasks:
* (default: 20ms, units: nanoseconds)
*
* NOTE: this latency value is not the same as the concept of
* 'timeslice length' - timeslices in CFS are of variable length.
* (to see the precise effective timeslice length of your workload,
* run vmstat and monitor the context-switches field)
..."

So, no notion of something, which are(!) of variable length, and which
precise effective timeslice lenght can be seen in nanoseconds? (But
not timeslice!)

Well, I start to think, this new scheduler could be too simple yet...


> [...] [ and this is driven by compatibility
> goals - regardless of how broken we consider yield use. The ideal
> solution is of course to almost never use yield. Fortunately 99%+ of
> Linux apps follow that ideal solution ;-) ]

Nevertheless, it seems, this 1% is important enough to boast a little:

"( another detail: due to nanosec accounting and timeline sorting,
sched_yield() support is very simple under CFS, and in fact under
CFS sched_yield() behaves much better than under any other
scheduler i have tested so far. )"
[Documentation/sched-design-CFS.txt]

Cheers,
Jarek P.

2007-10-03 08:16:35

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Jarek Poplawski <[email protected]> wrote:

> > firstly, there's no notion of "timeslices" in CFS. (in CFS tasks
> > "earn" a right to the CPU, and that "right" is not sliced in the
> > traditional sense) But we tried a conceptually similar thing [...]
>
> >From kernel/sched_fair.c:
>
> "/*
> * Targeted preemption latency for CPU-bound tasks:
> * (default: 20ms, units: nanoseconds)
> *
> * NOTE: this latency value is not the same as the concept of
> * 'timeslice length' - timeslices in CFS are of variable length.
> * (to see the precise effective timeslice length of your workload,
> * run vmstat and monitor the context-switches field)
> ..."
>
> So, no notion of something, which are(!) of variable length, and which
> precise effective timeslice lenght can be seen in nanoseconds? (But
> not timeslice!)

You should really read and understand the code you are arguing about :-/

In the 2.6.22 scheduler, there was a p->time_slice per task variable
that could be manipulated. (Note, in 2.6.22's sched_yield() did not
manipulate p->time_slice.)

sysctl_sched_latency on the other hand is not something that is per task
(it is global) so there is no pending timeslice to be "cleared" as it
has been suggested naively.

Ingo

2007-10-03 08:54:18

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Wed, Oct 03, 2007 at 10:16:13AM +0200, Ingo Molnar wrote:
>
> * Jarek Poplawski <[email protected]> wrote:
>
> > > firstly, there's no notion of "timeslices" in CFS. (in CFS tasks
> > > "earn" a right to the CPU, and that "right" is not sliced in the
> > > traditional sense) But we tried a conceptually similar thing [...]
> >
> > >From kernel/sched_fair.c:
> >
> > "/*
> > * Targeted preemption latency for CPU-bound tasks:
> > * (default: 20ms, units: nanoseconds)
> > *
> > * NOTE: this latency value is not the same as the concept of
> > * 'timeslice length' - timeslices in CFS are of variable length.
> > * (to see the precise effective timeslice length of your workload,
> > * run vmstat and monitor the context-switches field)
> > ..."
> >
> > So, no notion of something, which are(!) of variable length, and which
> > precise effective timeslice lenght can be seen in nanoseconds? (But
> > not timeslice!)
>
> You should really read and understand the code you are arguing about :-/

Maybe you could help me with better comments? IMHO, it would be enough
to warn new timeslices have different meaning, or stop to use this
term at all. (Btw, in -rc8-mm2 I see new sched_slice() function which
seems to return... time.)

>
> In the 2.6.22 scheduler, there was a p->time_slice per task variable
> that could be manipulated. (Note, in 2.6.22's sched_yield() did not
> manipulate p->time_slice.)
>
> sysctl_sched_latency on the other hand is not something that is per task
> (it is global) so there is no pending timeslice to be "cleared" as it
> has been suggested naively.

But, there is this "something", very similar and very misleading, you
count eg. in check_preempt_curr_fair to find if time is over, and I
think this could be similar enough to what David Schwartz wanted to
use in his idea, and you didn't care to explain why it's so different?

Jarek P.

2007-10-03 09:11:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Jarek Poplawski <[email protected]> wrote:

> On Wed, Oct 03, 2007 at 10:16:13AM +0200, Ingo Molnar wrote:
> >
> > * Jarek Poplawski <[email protected]> wrote:
> >
> > > > firstly, there's no notion of "timeslices" in CFS. (in CFS tasks
> > > > "earn" a right to the CPU, and that "right" is not sliced in the
> > > > traditional sense) But we tried a conceptually similar thing [...]
> > >
> > > >From kernel/sched_fair.c:
> > >
> > > "/*
> > > * Targeted preemption latency for CPU-bound tasks:
> > > * (default: 20ms, units: nanoseconds)
> > > *
> > > * NOTE: this latency value is not the same as the concept of
> > > * 'timeslice length' - timeslices in CFS are of variable length.
> > > * (to see the precise effective timeslice length of your workload,
> > > * run vmstat and monitor the context-switches field)
> > > ..."
> > >
> > > So, no notion of something, which are(!) of variable length, and which
> > > precise effective timeslice lenght can be seen in nanoseconds? (But
> > > not timeslice!)
> >
> > You should really read and understand the code you are arguing about :-/
>
> Maybe you could help me with better comments? IMHO, it would be enough
> to warn new timeslices have different meaning, or stop to use this
> term at all. [...]

i'm curious, what better do you need than the very detailed comment
quoted above? Which bit of "this latency value is not the same as the
concept of timeslice length" is difficult to understand? The timeslices
of tasks (i.e. the time they spend on a CPU without scheduling away) is
_not_ maintained directly in CFS as a per-task variable that can be
"cleared", it's not the metric that drives scheduling. Yes, of course
CFS too "slices up CPU time", but those slices are not the per-task
variables of traditional schedulers and cannot be 'cleared'.

> [...] (Btw, in -rc8-mm2 I see new sched_slice() function which seems
> to return... time.)

wrong again. That is a function, not a variable to be cleared. (Anyway,
the noise/signal ratio is getting increasingly high in this thread with
no progress in sight, so i cannot guarantee any further replies -
possibly others will pick up the tab and explain/discuss any other
questions that might come up. Patches are welcome of course.)

Ingo

2007-10-03 09:48:32

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Wed, Oct 03, 2007 at 11:10:58AM +0200, Ingo Molnar wrote:
>
> * Jarek Poplawski <[email protected]> wrote:
>
> > On Wed, Oct 03, 2007 at 10:16:13AM +0200, Ingo Molnar wrote:
> > >
> > > * Jarek Poplawski <[email protected]> wrote:
> > >
> > > > > firstly, there's no notion of "timeslices" in CFS. (in CFS tasks
> > > > > "earn" a right to the CPU, and that "right" is not sliced in the
> > > > > traditional sense) But we tried a conceptually similar thing [...]
> > > >
> > > > >From kernel/sched_fair.c:
> > > >
> > > > "/*
> > > > * Targeted preemption latency for CPU-bound tasks:
> > > > * (default: 20ms, units: nanoseconds)
> > > > *
> > > > * NOTE: this latency value is not the same as the concept of
> > > > * 'timeslice length' - timeslices in CFS are of variable length.
> > > > * (to see the precise effective timeslice length of your workload,
> > > > * run vmstat and monitor the context-switches field)
> > > > ..."
> > > >
> > > > So, no notion of something, which are(!) of variable length, and which
> > > > precise effective timeslice lenght can be seen in nanoseconds? (But
> > > > not timeslice!)
> > >
> > > You should really read and understand the code you are arguing about :-/
> >
> > Maybe you could help me with better comments? IMHO, it would be enough
> > to warn new timeslices have different meaning, or stop to use this
> > term at all. [...]
>
> i'm curious, what better do you need than the very detailed comment
> quoted above? Which bit of "this latency value is not the same as the
> concept of timeslice length" is difficult to understand? The timeslices
> of tasks (i.e. the time they spend on a CPU without scheduling away) is
> _not_ maintained directly in CFS as a per-task variable that can be
> "cleared", it's not the metric that drives scheduling. Yes, of course
> CFS too "slices up CPU time", but those slices are not the per-task
> variables of traditional schedulers and cannot be 'cleared'.

It's not about this comment alone, but this comment plus "no notion"
comment, which appears in sched-design-CFS.txt too.

>
> > [...] (Btw, in -rc8-mm2 I see new sched_slice() function which seems
> > to return... time.)
>
> wrong again. That is a function, not a variable to be cleared. (Anyway,
> the noise/signal ratio is getting increasingly high in this thread with
> no progress in sight, so i cannot guarantee any further replies -
> possibly others will pick up the tab and explain/discuss any other
> questions that might come up. Patches are welcome of course.)

I can't see anything about clearing. I think, this was about charging,
which should change the key enough, to move a task to, maybe, a better
place in a que (tree) than with current ways.

Jarek P.

PS: Don't you think that a nice argue with some celebrity, like Ingo
Molnar himself, is by far more interesting than those dull patches?

2007-10-03 10:56:20

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On 03/10/2007, Jarek Poplawski <[email protected]> wrote:
> I can't see anything about clearing. I think, this was about charging,
> which should change the key enough, to move a task to, maybe, a better
> place in a que (tree) than with current ways.

just a quick patch, not tested and I've not evaluated all possible
implications yet.
But someone might give it a try with his/(her -- are even more
welcomed :-) favourite sched_yield() load.

(and white space damaged)

--- sched_fair-old.c 2007-10-03 12:45:17.010306000 +0200
+++ sched_fair.c 2007-10-03 12:44:46.899851000 +0200
@@ -803,7 +803,35 @@ static void yield_task_fair(struct rq *r
update_curr(cfs_rq);

return;
+ } else if (sysctl_sched_compat_yield == 2) {
+ unsigned long ideal_runtime, delta_exec,
+ delta_exec_weighted;
+
+ __update_rq_clock(rq);
+ /*
+ * Update run-time statistics of the 'current'.
+ */
+ update_curr(cfs_rq);
+
+ /*
+ * Emulate (speed up) the effect of us being preempted
+ * by scheduler_tick().
+ */
+ ideal_runtime = sched_slice(cfs_rq, curr);
+ delta_exec = curr->sum_exec_runtime -
curr->prev_sum_exec_runtime;
+
+ if (ideal_runtime > delta_exec) {
+ delta_exec_weighted = ideal_runtime - delta_exec;
+
+ if (unlikely(curr->load.weight != NICE_0_LOAD)) {
+ delta_exec_weighted =
calc_delta_fair(delta_exec_weighted,
+
&se->load);
+ }
+ se->vruntime += delta_exec_weighted;
+ }
+ return;
}
+
/*
* Find the rightmost entry in the rbtree:
*/


>
> Jarek P.
>

--
Best regards,
Dmitry Adamushko


Attachments:
(No filename) (1.88 kB)
yield.patch (0.99 kB)
Download all attachments

2007-10-03 10:58:37

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On 03/10/2007, Dmitry Adamushko <[email protected]> wrote:
> On 03/10/2007, Jarek Poplawski <[email protected]> wrote:
> > I can't see anything about clearing. I think, this was about charging,
> > which should change the key enough, to move a task to, maybe, a better
> > place in a que (tree) than with current ways.
>
> just a quick patch, not tested and I've not evaluated all possible
> implications yet.
> But someone might give it a try with his/(her -- are even more
> welcomed :-) favourite sched_yield() load.
>
> (and white space damaged)
>
> --- sched_fair-old.c 2007-10-03 12:45:17.010306000 +0200
> +++ sched_fair.c 2007-10-03 12:44:46.899851000 +0200
> @@ -803,7 +803,35 @@ static void yield_task_fair(struct rq *r
> update_curr(cfs_rq);
>
> return;
> + } else if (sysctl_sched_compat_yield == 2) {
> + unsigned long ideal_runtime, delta_exec,
> + delta_exec_weighted;
> +
> + __update_rq_clock(rq);
> + /*
> + * Update run-time statistics of the 'current'.
> + */
> + update_curr(cfs_rq);
> +
> + /*
> + * Emulate (speed up) the effect of us being preempted
> + * by scheduler_tick().
> + */
> + ideal_runtime = sched_slice(cfs_rq, curr);
> + delta_exec = curr->sum_exec_runtime -
> curr->prev_sum_exec_runtime;
> +
> + if (ideal_runtime > delta_exec) {
> + delta_exec_weighted = ideal_runtime - delta_exec;
> +
> + if (unlikely(curr->load.weight != NICE_0_LOAD)) {
> + delta_exec_weighted =
> calc_delta_fair(delta_exec_weighted,
> +
> &se->load);
> + }


s/curr/se


--
Best regards,
Dmitry Adamushko

2007-10-03 11:18:01

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Wed, Oct 03, 2007 at 12:58:26PM +0200, Dmitry Adamushko wrote:
> On 03/10/2007, Dmitry Adamushko <[email protected]> wrote:
> > On 03/10/2007, Jarek Poplawski <[email protected]> wrote:
> > > I can't see anything about clearing. I think, this was about charging,
> > > which should change the key enough, to move a task to, maybe, a better
> > > place in a que (tree) than with current ways.
> >
> > just a quick patch, not tested and I've not evaluated all possible
> > implications yet.
> > But someone might give it a try with his/(her -- are even more
> > welcomed :-) favourite sched_yield() load.
> >
> > (and white space damaged)
> >
> > --- sched_fair-old.c 2007-10-03 12:45:17.010306000 +0200
> > +++ sched_fair.c 2007-10-03 12:44:46.899851000 +0200
...
> s/curr/se

Thanks very much!

Alas, I'll be able to look at this and try only in the evening.

Best regards,
Jarek P.

2007-10-03 11:22:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Dmitry Adamushko <[email protected]> wrote:

> + se->vruntime += delta_exec_weighted;

thanks Dmitry.

Btw., this is quite similar to the yield_granularity patch i did
originally, just less flexible. It turned out that apps want either zero
granularity or "infinite" granularity, they dont actually want something
inbetween. That's the two extremes that the current sysctl expresses in
essence.

Ingo

2007-10-03 11:32:53

by Helge Hafting

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

David Schwartz wrote:
>> * Jarek Poplawski <[email protected]> wrote:
>>
>>
>>> BTW, it looks like risky to criticise sched_yield too much: some
>>> people can misinterpret such discussions and stop using this at all,
>>> even where it's right.
>>>
>
>
>> Really, i have never seen a _single_ mainstream app where the use of
>> sched_yield() was the right choice.
>>
>
> It can occasionally be an optimization. You may have a case where you can do
> something very efficiently if a lock is not held, but you cannot afford to
> wait for the lock to be released. So you check the lock, if it's held, you
> yield and then check again. If that fails, you do it the less optimal way
> (for example, dispatching it to a thread that *can* afford to wait).
>
How about:
Check the lock. If it is held, sleep for an interval that is shorter
than acceptable waiting time. If it is still held, sleep for twice as long.
Loop until you get the lock and do the work, or until you
you reach the limit for how much you can wait at this point and
dispatch to a thread instead.

This approach should be portable, don't wake up too often,
and don't waste the CPU. (And it won't go idle either, whoever
holds the lock will be running.)


Helge Hafting

2007-10-03 11:38:03

by Jarek Poplawski

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

On Wed, Oct 03, 2007 at 12:55:34PM +0200, Dmitry Adamushko wrote:
...
> just a quick patch, not tested and I've not evaluated all possible
> implications yet.
> But someone might give it a try with his/(her -- are even more
> welcomed :-) favourite sched_yield() load.

Of course, after some evaluation by yourself and Ingo the most
interesting should be Martin's Michlmayr testing, so I hope you'll
Cc him too?!

Jarek P.

2007-10-03 11:57:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: yield


* Jarek Poplawski <[email protected]> wrote:

> On Wed, Oct 03, 2007 at 12:55:34PM +0200, Dmitry Adamushko wrote:
> ...
> > just a quick patch, not tested and I've not evaluated all possible
> > implications yet.
> > But someone might give it a try with his/(her -- are even more
> > welcomed :-) favourite sched_yield() load.
>
> Of course, after some evaluation by yourself and Ingo the most
> interesting should be Martin's Michlmayr testing, so I hope you'll Cc
> him too?!

My current take on this: queue the current task right to the next
position in the tree (this is what this patch achieves in essence) was
one of the yield implementations we already tried in CFS but it didnt
meet the expectations of some apps. So i can only repeat my argument:
this is not something that can be "solved" in the way you imagine and
your arguments just reiterate the path that CFS has already taken in the
past. So please do not expect _us_ to go out and pester people. If
people feel so inclined, they are of course welcome to test out various
approaches. (they might as well try the original yield-granularity patch
which also makes the amount of "delay" tunable, so the ideal amount of
delay can be figured out. And of course they should also try the
existing yield flag.)

Ingo

2007-10-03 12:14:21

by Jarek Poplawski

[permalink] [raw]
Subject: Re: yield

On Wed, Oct 03, 2007 at 01:56:46PM +0200, Ingo Molnar wrote:
>
> * Jarek Poplawski <[email protected]> wrote:
>
> > On Wed, Oct 03, 2007 at 12:55:34PM +0200, Dmitry Adamushko wrote:
> > ...
> > > just a quick patch, not tested and I've not evaluated all possible
> > > implications yet.
> > > But someone might give it a try with his/(her -- are even more
> > > welcomed :-) favourite sched_yield() load.
> >
> > Of course, after some evaluation by yourself and Ingo the most
> > interesting should be Martin's Michlmayr testing, so I hope you'll Cc
> > him too?!
>
> My current take on this: queue the current task right to the next
> position in the tree (this is what this patch achieves in essence) was
> one of the yield implementations we already tried in CFS but it didnt
> meet the expectations of some apps. So i can only repeat my argument:
> this is not something that can be "solved" in the way you imagine and
> your arguments just reiterate the path that CFS has already taken in the
> past. So please do not expect _us_ to go out and pester people. If
> people feel so inclined, they are of course welcome to test out various
> approaches. (they might as well try the original yield-granularity patch
> which also makes the amount of "delay" tunable, so the ideal amount of
> delay can be figured out. And of course they should also try the
> existing yield flag.)

I'm terribly sorry! Of course, the last thing I would like is to
pester anybody. I simply wasn't sure you've told about the same
idea. And of course, there is no reason to go back to something
checked before.

Thanks,
Jarek P.

2007-10-04 00:32:16

by Rusty Russell

[permalink] [raw]
Subject: RE: Network slowdown due to CFS

On Mon, 2007-10-01 at 09:49 -0700, David Schwartz wrote:
> > * Jarek Poplawski <[email protected]> wrote:
> >
> > > BTW, it looks like risky to criticise sched_yield too much: some
> > > people can misinterpret such discussions and stop using this at all,
> > > even where it's right.
>
> > Really, i have never seen a _single_ mainstream app where the use of
> > sched_yield() was the right choice.
>
> It can occasionally be an optimization. You may have a case where you can do
> something very efficiently if a lock is not held, but you cannot afford to
> wait for the lock to be released. So you check the lock, if it's held, you
> yield and then check again. If that fails, you do it the less optimal way
> (for example, dispatching it to a thread that *can* afford to wait).

This used to be true, and still is if you want to be portable. But the
point of futexes was precisely to attack this use case: whereas
sched_yield() says "I'm waiting for something, but I won't tell you
what" the futex ops tells the kernel what you're waiting for.

While the time to do a futex op is slightly slower than sched_yield(),
futexes win in so many cases that we haven't found a benchmark where
yield wins. Yield-lose cases include:
1) There are other unrelated process that yield() ends up queueing
behind.
2) The process you're waiting for doesn't conveniently sleep as soon as
it releases the lock, so you wait for longer than intended,
3) You race between the yield and the lock being dropped.

In summary: spin N times & futex seems optimal. The value of N depends
on the number of CPUs in the machine and other factors, but N=1 has
shown itself pretty reasonable.

Hope that helps,
Rusty.

2007-10-04 05:34:22

by Casey Dahlin

[permalink] [raw]
Subject: Re: Network slowdown due to CFS

Ingo Molnar wrote:
> * Jarek Poplawski <[email protected]> wrote:
>
>> [...] (Btw, in -rc8-mm2 I see new sched_slice() function which seems
>> to return... time.)
>>
>
> wrong again. That is a function, not a variable to be cleared.

It still gives us a target time, so could we not simply have sched_yield
put the thread completely to sleep for the given amount of time? It
wholly redefines the operation, and its far more expensive (now there's
a whole new timer involved) but it might emulate the expected behavior.
Its hideous, but so is sched_yield in the first place, so why not?

--CJD

2007-10-07 07:18:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: Network slowdown due to CFS


* Jarek Poplawski <[email protected]> wrote:

> > [...] The timeslices of tasks (i.e. the time they spend on a CPU
> > without scheduling away) is _not_ maintained directly in CFS as a
> > per-task variable that can be "cleared", it's not the metric that
> > drives scheduling. Yes, of course CFS too "slices up CPU time", but
> > those slices are not the per-task variables of traditional
> > schedulers and cannot be 'cleared'.
>
> It's not about this comment alone, but this comment plus "no notion"
> comment, which appears in sched-design-CFS.txt too.

ok - i've re-read it and it indeed is somewhat confusing without
additional context. I'll improve the wording. (sched-design-CFS.txt
needs an update anyway)

Ingo

2007-12-12 22:39:26

by Jesper Juhl

[permalink] [raw]
Subject: Re: yield API

On 02/10/2007, Ingo Molnar <[email protected]> wrote:
>
> * David Schwartz <[email protected]> wrote:
>
> > > These are generic statements, but i'm _really_ interested in the
> > > specifics. Real, specific code that i can look at. The typical Linux
> > > distro consists of in execess of 500 millions of lines of code, in
> > > tens of thousands of apps, so there really must be some good, valid
> > > and "right" use of sched_yield() somewhere in there, in some
> > > mainstream app, right? (because, as you might have guessed it, in
> > > the past decade of sched_yield() existence i _have_ seen my share of
> > > sched_yield() utilizing user-space code, and at the moment i'm not
> > > really impressed by those examples.)
> >
> > Maybe, maybe not. Even if so, it would be very difficult to find.
> > Simply grepping for sched_yield is not going to help because
> > determining whether a given use of sched_yield is smart is not going
> > to be easy.
>
> sched_yield() has been around for a decade (about three times longer
> than futexes were around), so if it's useful, it sure should have grown
> some 'crown jewel' app that uses it and shows off its advantages,
> compared to other locking approaches, right?
>

I have one example of sched_yield() use in a real app. Unfortunately
it's proprietary so I can't show you the source, but I can tell you
how it's used.

The case is this: Process A forks process B. Process B does some work
that takes aproximately between 50 and 1000ms to complete (varies),
then it creates a file and continues to do other work. Process A
needs to wait for the file B creates before it can continue.
Process A *could* immediately go into some kind of "check for file;
sleep n ms" loop, but instead it starts off by calling sched_yield()
to give process B a chance to run and hopefully get to the point where
it has created the file before process A is again scheduled and starts
to look for it - after the single sched yield call, process A does
indeed go into a "check for file; sleep 250ms;" loop, but most of the
time the initial sched_yield() call actually results in the file being
present without having to loop like that.

Now is this the best way to handle this situation? No.
Does it work better than just doing the wait loop from the start? Yes.
Is this a good way to use sched_yield()? Maybe, maybe not. But it
*is* an actual use of the API in a real app.

> For example, if you asked me whether pipes are the best thing for
> certain apps, i could immediately show you tons of examples where they
> are. Same for sockets. Or RT priorities. Or nice levels. Or futexes. Or
> just about any other core kernel concept or API.

True. But in the app I'm talking about above, rewriting the code to
communicate over a pipe, socket or anything else would have been too
large a change to make (released product, can't risk introducing (new)
bugs).

>Your notion that
> showing a good example of an API would be "difficult" because it's hard
> to determine "smart" use is not tenable i believe and does not
> adequately refute my pretty plain-meaning "it does not exist" assertion.
>

I agree that sched_yield() is not a very good API. I also agree that
our use of it is not the best solution to the problem we wanted to
solve, but it actually works pretty well most of the time.

> If then this is one more supporting proof for the fundamental weakness
> of the sched_yield() API. Rarely are we able to so universally condemn
> an API: real-life is usually more varied and even for theoretically
> poorly defined APIs _some_ sort of legitimate use does grow up.
>
> APIs that are not in any real, meaningful use, despite a decade of
> presence are not really interesting to me personally. (especially in
> this case where we know exactly _why_ the API is used so rarely.) Sure
> we'll continue to support it in the best possible way, with the usual
> kernel maintainance policy: without hurting other, more commonly used
> APIs. That was the principle we followed in previous schedulers too. And
> if anyone has a patch to make sched_yield() better than it is today, i'm
> of course interested in it.
>
Just for the record; for our use, sched_yield() seems to work just
fine both with older and newer kernels, so from my point of view the
new scheduler is doing fine in this regard.

--
Jesper Juhl <[email protected]>
Don't top-post http://www.catb.org/~esr/jargon/html/T/top-post.html
Plain text mails only, please http://www.expita.com/nomime.html

2007-12-13 04:44:17

by Kyle Moffett

[permalink] [raw]
Subject: Re: yield API

On Dec 12, 2007, at 17:39:15, Jesper Juhl wrote:
> On 02/10/2007, Ingo Molnar <[email protected]> wrote:
>> sched_yield() has been around for a decade (about three times
>> longer than futexes were around), so if it's useful, it sure
>> should have grown some 'crown jewel' app that uses it and shows
>> off its advantages, compared to other locking approaches, right?
>
> I have one example of sched_yield() use in a real app.
> Unfortunately it's proprietary so I can't show you the source, but
> I can tell you how it's used.
>
> The case is this: Process A forks process B. Process B does some
> work that takes aproximately between 50 and 1000ms to complete
> (varies), then it creates a file and continues to do other work.
> Process A needs to wait for the file B creates before it can
> continue. Process A *could* immediately go into some kind of "check
> for file; sleep n ms" loop, but instead it starts off by calling
> sched_yield() to give process B a chance to run and hopefully get
> to the point where it has created the file before process A is
> again scheduled and starts to look for it - after the single sched
> yield call, process A does indeed go into a "check for file; sleep
> 250ms;" loop, but most of the time the initial sched_yield() call
> actually results in the file being present without having to loop
> like that.

That is a *terrible* disgusting way to use yield. Better options:
(1) inotify/dnotify
(2) create a "foo.lock" file and put the mutex in that
(3) just start with the check-file-and-sleep loop.


> Now is this the best way to handle this situation? No. Does it
> work better than just doing the wait loop from the start? Yes.

It works better than doing the wait-loop from the start? What
evidence do you provide to support this assertion? Specifically, in
the first case you tell the kernel "I'm waiting for something but I
don't know what it is or how long it will take"; while in the second
case you tell the kernel "I'm waiting for something that will take
exactly X milliseconds, even though I don't know what it is. If you
really want something similar to the old behavior then just replace
the "sched_yield()" call with a proper sleep for the estimated time
it will take the program to create the file.


> Is this a good way to use sched_yield()? Maybe, maybe not. But it
> *is* an actual use of the API in a real app.

We weren't looking for "actual uses", especially not in binary-only
apps. What we are looking for is optimal uses of sched_yield(); ones
where that is the best alternative. This... certainly isn't.

Cheers,
Kyle Moffett

2007-12-13 20:11:13

by David Schwartz

[permalink] [raw]
Subject: RE: yield API


Kyle Moffett wrote:

> That is a *terrible* disgusting way to use yield. Better options:
> (1) inotify/dnotify

Sure, tie yourself to a Linux-specific mechanism that may or may not work
over things like NFS. That's much worse.

> (2) create a "foo.lock" file and put the mutex in that

Right, tie yourself to process-shared mutexes which historically weren't
available on Linux. That's much better than an option that's been stable for
a decade.

> (3) just start with the check-file-and-sleep loop.

How is that better? There is literally no improvement, since the first check
will (almost) always fail.

> > Now is this the best way to handle this situation? No. Does it
> > work better than just doing the wait loop from the start? Yes.
>
> It works better than doing the wait-loop from the start? What
> evidence do you provide to support this assertion?

The evidence is that more than half the time, this avoids the sleep. That
means it has zero cost, since the yield is no heavier than a sleep would be,
and has a possible benefit, since the first sleep may be too long.

> Specifically, in
> the first case you tell the kernel "I'm waiting for something but I
> don't know what it is or how long it will take"; while in the second
> case you tell the kernel "I'm waiting for something that will take
> exactly X milliseconds, even though I don't know what it is. If you
> really want something similar to the old behavior then just replace
> the "sched_yield()" call with a proper sleep for the estimated time
> it will take the program to create the file.

The problem is that if the estimate is too short, pre-emption will result in
a huge performance drop. If the estimate is too long, there will be some
wasted CPU. What was the claimed benefit of doing this again?

> > Is this a good way to use sched_yield()? Maybe, maybe not. But it
> > *is* an actual use of the API in a real app.

> We weren't looking for "actual uses", especially not in binary-only
> apps. What we are looking for is optimal uses of sched_yield(); ones
> where that is the best alternative. This... certainly isn't.

Your standards for "optimal" are totally unrealistic. In his case, it was
optimal. Using platform-specific optimizations would have meant more
development and test time for minimal benefit. Sleeping first would have had
some performance cost and no benefit. In his case, sched_yield was optimal.
Really.

DS