2003-05-21 07:55:25

by David Mosberger

[permalink] [raw]
Subject: web page on O(1) scheduler

Recently, I started to look into some odd performance behaviors of the
O(1) scheduler. I decided to document what I found in a web page
at:

http://www.hpl.hp.com/research/linux/kernel/o1.php

(it may take another couple of hours before the pages show up outside
the HP firewall, so if you get "page not found" at the moment, don't
be surprised).

I should say that I have no direct stake in the CPU scheduler (never
worked on it, not sure I ever would want to), but I feel that it's
worthwhile to at least document the O(1) scheduler a bit better.
Also, I do feel that the O(1) scheduler currently has rather poor
"long-term" properties. It would be nice if some of those properties
could be improved without hurting the other excellent properties of
the current O(1) scheduler.

I think the web pages should be most relevant to the HPTC (high
performance technical computing) community, since this is the
community that is most likely affected by some of the performance
oddities of the O(1) scheduler. Certainly anyone using OpenMP on
Intel platforms (x86 and ia64) may want to take a look.

Comments welcome.

--david


2003-05-21 08:49:18

by Arjan van de Ven

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

On Wed, 2003-05-21 at 08:49, David Mosberger wrote:

>
> I think the web pages should be most relevant to the HPTC (high
> performance technical computing) community, since this is the
> community that is most likely affected by some of the performance
> oddities of the O(1) scheduler. Certainly anyone using OpenMP on
> Intel platforms (x86 and ia64) may want to take a look.

oh you mean the OpenMP broken behavior of calling sched_yield() in a
tight loop to implement spinlocks ?

I'd guess that instead of second guessing the runtime, they should use
the pthreads primitives which are the fastest for the platform one would
hope.. (eg futexes nowadays)


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-05-21 09:09:07

by Mike Galbraith

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

At 11:49 PM 5/20/2003 -0700, David Mosberger wrote:
>Recently, I started to look into some odd performance behaviors of the
>O(1) scheduler. I decided to document what I found in a web page
>at:
>
> http://www.hpl.hp.com/research/linux/kernel/o1.php

<snip>

>Comments welcome.

The page mentions persistent starvation. My own explorations of this issue
indicate that the primary source is always selecting the highest priority
queue. Combine that with the round-robin, and you have a good chance of
being grossly unfair with some workloads. I know for certain that lock
holders in the active array can be starved for very long periods by tasks
entering higher priority queues, thereby causing even more starvation when
they finally get the cpu and can release the lock (sleepers go through the
roof).

Try the attached overly simplistic (KISS:) diff, and watch your starvation
issues be very noticably reduced.

-Mike

2003-05-21 09:13:14

by Mike Galbraith

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

At 11:26 AM 5/21/2003 +0200, Mike Galbraith wrote:

>Try the attached overly simplistic (KISS:) diff, and watch your starvation
>issues be very noticably reduced.

Ahem ;-)

Now even attached.

-Mike


Attachments:
xx.diff (929.00 B)

2003-05-21 10:27:42

by Duraid Madina

[permalink] [raw]
Subject: Re: [Linux-ia64] Re: web page on O(1) scheduler

Dear Arjan,


///////
// O
// > This is a graduate
/ \__ ~ student, laboratory
|| ///// assistant, automotive
(\ \) (~) // o <--- engineer or other
( \ \ / / // > unfortunate soul
( \ \/ / ____________/ \__O attempting to get
( \__/ / ___ ______\// performance out of a
/ | /@ ( / / ______)/ machine running Linux
( |// \ \ / / (_) by writing a simple
\ () \ \O/ and correct
\ | ) ) multithreaded program.
) ) / /
( |_ / /_
(____> (____>

^
|
|
|
|

This is you.



(with apologies to the haxor brothers,)

Duraid.


Arjan van de Ven wrote:
> oh you mean the OpenMP broken behavior of calling sched_yield() in a
> tight loop to implement spinlocks ?


2003-05-21 12:59:19

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Linux-ia64] Re: web page on O(1) scheduler

On Wed, 2003-05-21 at 12:40, Duraid Madina wrote:
> Dear Arjan,
>
>
> ///////
> // O
> // > This is a graduate
> / \__ ~ student, laboratory
> || ///// assistant, automotive
> (\ \) (~) // o <--- engineer or other
> ( \ \ / / // > unfortunate soul
> ( \ \/ / ____________/ \__O attempting to get
> ( \__/ / ___ ______\// performance out of a
> / | /@ ( / / ______)/ machine running Linux
> ( |// \ \ / / (_) by writing a simple
> \ () \ \O/ and correct
> \ | ) ) multithreaded program.
> ) ) / /
> ( |_ / /_
> (____> (____>
>
> ^
> |
> |
> |
> |
>
> This is you.
>
>

if you had spent the time you spent on this colorful graphic on reading
SUS or Posix about what sched_yield() means, you would actually have
learned something. sched_yield() means "I'm the least important thing in
the system". It's the wrong thing for cross-cpu spinlocks; futexes are
optimal for that. For letting higher priority tasks run a sleep(0) is
also far more closer to the right behavior than a sched_yield().


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2003-05-21 13:52:00

by Olivier Galibert

[permalink] [raw]
Subject: Re: [Linux-ia64] Re: web page on O(1) scheduler

On Wed, May 21, 2003 at 03:12:12PM +0200, Arjan van de Ven wrote:
> if you had spent the time you spent on this colorful graphic on reading
> SUS or Posix about what sched_yield() means, you would actually have
> learned something. sched_yield() means "I'm the least important thing in
> the system".

Susv2:

DESCRIPTION

The sched_yield() function forces the running thread to relinquish
the processor until it again becomes the head of its thread list. It
takes no arguments.


Aka "I skip the rest of my turn, try the others again once", not "I'm
unimportant" nor "please rerun me immediatly".

What is it with you people wanting to make sched_yield() unusable for
anything that makes sense?

OG.

2003-05-21 15:05:13

by David Mosberger

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

>>>>> On 21 May 2003 11:01:33 +0200, Arjan van de Ven <[email protected]> said:

Arjan> On Wed, 2003-05-21 at 08:49, David Mosberger wrote:
>> I think the web pages should be most relevant to the HPTC (high
>> performance technical computing) community, since this is the
>> community that is most likely affected by some of the performance
>> oddities of the O(1) scheduler. Certainly anyone using OpenMP on
>> Intel platforms (x86 and ia64) may want to take a look.

Arjan> oh you mean the OpenMP broken behavior of calling
Arjan> sched_yield() in a tight loop to implement spinlocks ?

Please have the courtesy of reading the web page before jumping to
conclusions.

--david

2003-05-21 17:47:42

by David Mosberger

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

>>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <[email protected]> said:

Mike> The page mentions persistent starvation. My own explorations
Mike> of this issue indicate that the primary source is always
Mike> selecting the highest priority queue.

My working assumption is that the problem is a bug with the dynamic
prioritization. The task receiving the signals calls sleep() after
handling a signal and hence it's dynamic priority should end up higher
than the priority of the task sending signals (since the sender never
relinquishes the CPU voluntarily).

However, I haven't actually had time to look at the relevant code, so
I may be missing something. If you understand the issue better,
please explain to me why this isn't a dynamic priority issue.

--david

2003-05-21 18:18:37

by David Mosberger

[permalink] [raw]
Subject: Re: [Linux-ia64] web page on O(1) scheduler

It has been pointed out that on the main web page
(http://www.hpl.hp.com/research/linux/kernel/o1.php) it is very
difficult to see that the section entitled "Some problems with the
O(1) scheduler" contains links to the full problem descriptions (the
links aren't underlined and the color looks almost black).

So, in case you missed that, please do click on those links to get the
full description.

I'll see what we can do to improve the layout of the page.

Thanks,

--david

2003-05-21 19:05:35

by Duraid Madina

[permalink] [raw]
Subject: Re: [Linux-ia64] Re: web page on O(1) scheduler

Arjan van de Ven wrote:

> if you had spent the time you spent on this colorful graphic on reading
> SUS or Posix about what sched_yield() means

Quoth the man page,

"A process can relinquish the processor voluntarily without blocking by
calling sched_yield. The process will then be moved to the end of the
queue for its static priority and a new process gets to run."

How you get from there to "I'm the least important thing in the system"
is, once again, beyond me. And even if that were a reasonable
interpretation of the word 'yield', you would still hope that more than
one CPU would get something to do if there was enough work to go around.
Agreed, "spinning" on sched_yield is a very naive way of doing
spinlocks. But that doesn't change the fact that it's a simple and
correct way. One would have hoped that calling sched_yield every few
million cycles wouldn't break the scheduler.

Duraid

2003-05-21 19:47:53

by John Stoffel

[permalink] [raw]
Subject: Cyclades Cyclom-Y ISA on 2.5.69


Hi all,

Has anyone else run into a problem compiling the Cyclades serial board
driver under 2.5.x when you have ISA defined as well? I've done a
quick hack patch to make it compile. I've been running 2.4.21-rc*
lately, so it's time I started to actually test this patch below and
see how it works.

I read the file Documentation cli-sti-removal.txt and I think I've
done the right things here.

John
John Stoffel - Senior Unix Systems Administrator - Lucent Technologies
[email protected] - http://www.lucent.com - 978-399-0479



*** drivers/char/cyclades.c.org Wed May 21 11:45:48 2003
--- drivers/char/cyclades.c Wed May 21 12:07:53 2003
***************
*** 872,877 ****
--- 872,878 ----
static int cyz_issue_cmd(struct cyclades_card *, uclong, ucchar,
uclong);
#ifdef CONFIG_ISA
static unsigned detect_isa_irq (volatile ucchar *);
+ spinlock_t isa_card_lock;
#endif /* CONFIG_ISA */

static int cyclades_get_proc_info(char *, char **, off_t , int , int
*, void *);
***************
*** 1056,1069 ****
udelay(5000L);

/* Enable the Tx interrupts on the CD1400 */
! save_flags(flags); cli();
cy_writeb((u_long)address + (CyCAR<<index), 0);
cyy_issue_cmd(address, CyCHAN_CTL|CyENB_XMTR, index);

cy_writeb((u_long)address + (CyCAR<<index), 0);
cy_writeb((u_long)address + (CySRER<<index),
cy_readb(address + (CySRER<<index)) | CyTxRdy);
! restore_flags(flags);

/* Wait ... */
udelay(5000L);
--- 1057,1070 ----
udelay(5000L);

/* Enable the Tx interrupts on the CD1400 */
! spin_lock_irqsave(&isa_card_lock,flags);
cy_writeb((u_long)address + (CyCAR<<index), 0);
cyy_issue_cmd(address, CyCHAN_CTL|CyENB_XMTR, index);

cy_writeb((u_long)address + (CyCAR<<index), 0);
cy_writeb((u_long)address + (CySRER<<index),
cy_readb(address + (CySRER<<index)) | CyTxRdy);
! spin_unlock_irqrestore(&isa_card_lock, flags);

/* Wait ... */
udelay(5000L);
***************
*** 5762,5768 ****
}
#endif /* CONFIG_CYZ_INTR */

! save_flags(flags); cli();

if ((e1 = tty_unregister_driver(&cy_serial_driver)))
printk("cyc: failed to unregister Cyclades serial
driver(%d)\n",
--- 5763,5769 ----
}
#endif /* CONFIG_CYZ_INTR */

! spin_lock_irqsave(&isa_card_lock, flags);

if ((e1 = tty_unregister_driver(&cy_serial_driver)))
printk("cyc: failed to unregister Cyclades serial
driver(%d)\n",
***************
*** 5771,5777 ****
printk("cyc: failed to unregister Cyclades callout
driver (%d)\n",
e2);

! restore_flags(flags);

for (i = 0; i < NR_CARDS; i++) {
if (cy_card[i].base_addr != 0) {
--- 5772,5778 ----
printk("cyc: failed to unregister Cyclades callout
driver (%d)\n",
e2);

! spin_unlock_irqrestore(&isa_card_lock, flags);

for (i = 0; i < NR_CARDS; i++) {
if (cy_card[i].base_addr != 0) {

2003-05-21 19:51:55

by Helge Hafting

[permalink] [raw]
Subject: Re: [Linux-ia64] Re: web page on O(1) scheduler

On Thu, May 22, 2003 at 05:18:02AM +1000, Duraid Madina wrote:
> Arjan van de Ven wrote:
>
> >if you had spent the time you spent on this colorful graphic on reading
> >SUS or Posix about what sched_yield() means
>
> Quoth the man page,
>
> "A process can relinquish the processor voluntarily without blocking by
> calling sched_yield. The process will then be moved to the end of the
> queue for its static priority and a new process gets to run."
>
This assumes the implementation uses queues, one per
priority level.
And even if it does, the process may be the only one with that
priority, making this a useless way of giving up "some time".
It'll still get rescheduled over and over and prevent
lower-priority processes from running.


> How you get from there to "I'm the least important thing in the system"
> is, once again, beyond me. And even if that were a reasonable
> interpretation of the word 'yield', you would still hope that more than
> one CPU would get something to do if there was enough work to go around.
> Agreed, "spinning" on sched_yield is a very naive way of doing
> spinlocks. But that doesn't change the fact that it's a simple and
> correct way. One would have hoped that calling sched_yield every few
> million cycles wouldn't break the scheduler.

The way I understand it, the scheduler doesn't "break". You just
get a lot of useless busy waiting.

Helge Hafting

2003-05-21 20:29:31

by Mike Galbraith

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
> >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <[email protected]>
> said:
>
> Mike> The page mentions persistent starvation. My own explorations
> Mike> of this issue indicate that the primary source is always
> Mike> selecting the highest priority queue.
>
>My working assumption is that the problem is a bug with the dynamic
>prioritization. The task receiving the signals calls sleep() after
>handling a signal and hence it's dynamic priority should end up higher
>than the priority of the task sending signals (since the sender never
>relinquishes the CPU voluntarily).

The only thing that matters is how much you sleep vs run, so yes, it should
have a higher priority unless that handling is heavy on cpu. If it
doesn't, then you have to have a different problem, because the dynamic
priority portion of the scheduler is dead simple. The only way I can
imagine that priority could end up lower than expected is heavyweight
interrupt load, or spinning out of control.

>However, I haven't actually had time to look at the relevant code, so
>I may be missing something. If you understand the issue better,
>please explain to me why this isn't a dynamic priority issue.

I just saw your other post regarding the web page. Now that I know that
there's a detailed description in there somewhere, I'll go read it and see
if any of what I've gleaned from crawling around the scheduler code is
useful. I thought you might be encountering the same kind of generic
starvation I've seen. Ergo, the simple diag patch.

-Mike

2003-05-21 23:45:14

by Alan

[permalink] [raw]
Subject: Re: [Linux-ia64] Re: web page on O(1) scheduler

On Mer, 2003-05-21 at 20:18, Duraid Madina wrote:
> "A process can relinquish the processor voluntarily without blocking by
> calling sched_yield. The process will then be moved to the end of the
> queue for its static priority and a new process gets to run."
>
> How you get from there to "I'm the least important thing in the system"
> is, once again, beyond me. And even if that were a reasonable


Assuming nobody is niced the two say the same thing. Note "for its
_static_ priority" not for its dynamic priority. So sched_yield puts you
at the back of a pile of cpu hogs cos thats what the spec says it does

2003-05-22 00:25:58

by Rik van Riel

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

On Wed, 21 May 2003, Mike Galbraith wrote:
> At 11:49 PM 5/20/2003 -0700, David Mosberger wrote:
> >Recently, I started to look into some odd performance behaviors of the
> >O(1) scheduler. I decided to document what I found in a web page
> >at:
> >
> > http://www.hpl.hp.com/research/linux/kernel/o1.php
>
> The page mentions persistent starvation. My own explorations of this
> issue indicate that the primary source is always selecting the highest
> priority queue.

It's deeper than that. The O(1) scheduler doesn't consider
actual CPU usage as a factor of CPU priority.


Rik
--
Engineers don't grow up, they grow sideways.
http://www.surriel.com/ http://kernelnewbies.org/

2003-05-22 05:35:14

by Mike Galbraith

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

At 08:38 PM 5/21/2003 -0400, Rik van Riel wrote:
>On Wed, 21 May 2003, Mike Galbraith wrote:
> > At 11:49 PM 5/20/2003 -0700, David Mosberger wrote:
> > >Recently, I started to look into some odd performance behaviors of the
> > >O(1) scheduler. I decided to document what I found in a web page
> > >at:
> > >
> > > http://www.hpl.hp.com/research/linux/kernel/o1.php
> >
> > The page mentions persistent starvation. My own explorations of this
> > issue indicate that the primary source is always selecting the highest
> > priority queue.
>
>It's deeper than that. The O(1) scheduler doesn't consider
>actual CPU usage as a factor of CPU priority.

Oh, there's no doubt in my mind that it's _much_ deeper than my little
surface scratchings ;-)

It does consider cpu usage though. Your run history is right there in your
accumulated sleep_avg. Unfortunately (in some ways, fortunate in others..
conflict) that information can be diluted down to nothing instantly by new
input from one wakeup.

-Mike

2003-05-22 09:34:58

by Mike Galbraith

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
> >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <[email protected]>
> said:
>
> Mike> The page mentions persistent starvation. My own explorations
> Mike> of this issue indicate that the primary source is always
> Mike> selecting the highest priority queue.
>
>My working assumption is that the problem is a bug with the dynamic
>prioritization. The task receiving the signals calls sleep() after
>handling a signal and hence it's dynamic priority should end up higher
>than the priority of the task sending signals (since the sender never
>relinquishes the CPU voluntarily).
>
>However, I haven't actually had time to look at the relevant code, so
>I may be missing something. If you understand the issue better,
>please explain to me why this isn't a dynamic priority issue.

You're right, it looks like a corner case. It works fine here with the
attached diff.

-Mike


Attachments:
xx.diff (1.52 kB)

2003-05-22 14:34:11

by Valdis Klētnieks

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

On Thu, 22 May 2003 07:52:44 +0200, Mike Galbraith said:

> It does consider cpu usage though. Your run history is right there in your
> accumulated sleep_avg. Unfortunately (in some ways, fortunate in others..
> conflict) that information can be diluted down to nothing instantly by new
> input from one wakeup.

Maybe there should be a scheduler tunable that says how much it should be
diluted?


Attachments:
(No filename) (226.00 B)

2003-05-22 15:55:05

by Mike Galbraith

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

At 10:47 AM 5/22/2003 -0400, [email protected] wrote:
>On Thu, 22 May 2003 07:52:44 +0200, Mike Galbraith said:
>
> > It does consider cpu usage though. Your run history is right there in
> your
> > accumulated sleep_avg. Unfortunately (in some ways, fortunate in others..
> > conflict) that information can be diluted down to nothing instantly by new
> > input from one wakeup.
>
>Maybe there should be a scheduler tunable that says how much it should be
>diluted?

I hope not. Once you use a knob, you _have_ to use it again when you
change workloads.

-Mike

2003-05-22 16:08:25

by Mike Galbraith

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

At 11:52 AM 5/22/2003 +0200, Mike Galbraith wrote:
>At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
>> >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <[email protected]>
>> said:
>>
>> Mike> The page mentions persistent starvation. My own explorations
>> Mike> of this issue indicate that the primary source is always
>> Mike> selecting the highest priority queue.
>>
>>My working assumption is that the problem is a bug with the dynamic
>>prioritization. The task receiving the signals calls sleep() after
>>handling a signal and hence it's dynamic priority should end up higher
>>than the priority of the task sending signals (since the sender never
>>relinquishes the CPU voluntarily).
>>
>>However, I haven't actually had time to look at the relevant code, so
>>I may be missing something. If you understand the issue better,
>>please explain to me why this isn't a dynamic priority issue.
>
>You're right, it looks like a corner case.

Out of curiosity, is someone hitting that with a real program?

-Mike

aside:
if so, I suppose nano-ticks may be needed. rounding up gave us too many
"nano-ticks", and was the first problem with irman, which brought round
down into activate_task(). now, test-starve.c appears, and it turns out to
be too many nano-ticks _missing_. (rounding up doesn't "fix" that one btw
[too fast], but what I did to demonstrate the problem does re-break irman
rather nicely:)

2003-05-22 17:45:48

by David Mosberger

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

>>>>> On Thu, 22 May 2003 18:25:54 +0200, Mike Galbraith <[email protected]> said:

Mike> Out of curiosity, is someone hitting that with a real program?

Yes, it caused a failure in a validation program. The test case is a
stripped-down version of the original program and, of course, doesn't
make much sense other than to demonstrate the scheduling problem.

--david

2003-05-23 00:43:58

by Boehm, Hans

[permalink] [raw]
Subject: Re: [Linux-ia64] Re: web page on O(1) scheduler

On Wed, 21 May 2003, Arjan van de Ven wrote:

> oh you mean the OpenMP broken behavior of calling sched_yield() in a
> tight loop to implement spinlocks ?
>
> I'd guess that instead of second guessing the runtime, they should use
> the pthreads primitives which are the fastest for the platform one would
> hope.. (eg futexes nowadays)
>

That really depends on the circumstances. I think there will always
be applications that use custom synchronization because they need the
last bit of performance in their specific environment.

I just ran a quick test to compare

a) 10,000,000 lock/unlock pairs with a rudimentary custom user-level spinlock
implementation, and
b) 10,000,000 pthread_mutex_lock/pthread_mutex_unlock pairs.

All locking is done by a single thread; this is the completely contention-free
case.

On a 1GHz Itanium 2 I get

Custom lock: 180 msecs
Custom lock: 1382 msecs

On a 2GHz Xeon, I get

Custom lock: 646 msecs
Custom lock: 1659 msecs

There are good reasons for the differences:

1) The pthread implementation needs an atomic operation on lock exit to
check whether waiters need to be awoken. The spin lock just needs a
release barrier, which is cheap on both platforms.

2) The custom lock can be inlined. The pthread one normally involves a
dynamic library call. That has to be the case if you want to be able
to plug in different thread implementations.

3) Pthread locks come in various flavors, and the interface is designed
such that you have to check at runtime which flavor you have.

In the contention case there are other interesting issues, since it's
often far more efficient to spin before attempting to yield, and pthreads
implementations don't always do that.

The original case may also have involved barrier synchronization instead
of locks, in which case there is probably at least as much motivation to
"roll your own".

To reproduce my results from attached files (ao is my current attempt at
a portable atomic operations library):

tar xvfz ao-0.2.tar.gz
cp time_lock.c ao-0.2
cd ao-0.2
gcc -O2 time_lock.c -lpthread
./a.out

--
Hans Boehm
([email protected])


Attachments:
ao-0.2.tar.gz (12.96 kB)
ao-0.2.tar.gz
time_lock.c (1.72 kB)
time_lock.s
Download all attachments

2003-05-23 08:17:34

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Linux-ia64] Re: web page on O(1) scheduler

On Thu, May 22, 2003 at 06:07:46PM -0700, Hans Boehm wrote:
> case.
>
> On a 1GHz Itanium 2 I get
>
> Custom lock: 180 msecs
> Custom lock: 1382 msecs
>
> On a 2GHz Xeon, I get
>
> Custom lock: 646 msecs
> Custom lock: 1659 msecs

is the pthreads one with nptl ?

2003-05-25 09:00:32

by Mike Galbraith

[permalink] [raw]
Subject: Re: web page on O(1) scheduler

At 07:52 AM 5/22/2003 +0200, Mike Galbraith wrote:
>At 08:38 PM 5/21/2003 -0400, Rik van Riel wrote:
>>On Wed, 21 May 2003, Mike Galbraith wrote:
>> > At 11:49 PM 5/20/2003 -0700, David Mosberger wrote:
>> > >Recently, I started to look into some odd performance behaviors of the
>> > >O(1) scheduler. I decided to document what I found in a web page
>> > >at:
>> > >
>> > > http://www.hpl.hp.com/research/linux/kernel/o1.php
>> >
>> > The page mentions persistent starvation. My own explorations of this
>> > issue indicate that the primary source is always selecting the highest
>> > priority queue.
>>
>>It's deeper than that. The O(1) scheduler doesn't consider
>>actual CPU usage as a factor of CPU priority.
>
>Oh, there's no doubt in my mind that it's _much_ deeper than my little
>surface scratchings ;-)
>
>It does consider cpu usage though. Your run history is right there in
>your accumulated sleep_avg. Unfortunately (in some ways, fortunate in
>others.. conflict) that information can be diluted down to nothing
>instantly by new input from one wakeup.

Or did you mean that it misses a bunch of cpu usage? I went looking at cpu
usage, and...

Unless there's something seriously b0rked in the attached (don't _think_
so, but...;), trusting an irq that happens 1/ms to tick tasks isn't 100%
effective, even if you aren't context switching a zillion times
faster. The attached diff produced the attached log while running
test-starve.c. I get some pretty serious thievery even when doing ho-hum
stuff. We can deactivate tasks at _really_ high frequency, and besides, if
the timer interrupt doesn't fire while the last runnable task is active, he
can be missed for a while and have accumulated up nearly a full ms. It
seems to me that with the right timing, you could end up stealing a bunch
of cpu (my logs say it's happening a _lot_ under load. again, diff might
be b0rked)

-Mike


Attachments:
xx.log (7.09 kB)
xx.diff (6.42 kB)
Download all attachments

2003-05-27 14:59:03

by Mike Galbraith

[permalink] [raw]
Subject: [case closed] Re: web page on O(1) scheduler

At 06:25 PM 5/22/2003 +0200, Mike Galbraith wrote:
>At 11:52 AM 5/22/2003 +0200, Mike Galbraith wrote:
>>At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
>>> >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith
>>> <[email protected]> said:
>>>
>>> Mike> The page mentions persistent starvation. My own explorations
>>> Mike> of this issue indicate that the primary source is always
>>> Mike> selecting the highest priority queue.
>>>
>>>My working assumption is that the problem is a bug with the dynamic
>>>prioritization. The task receiving the signals calls sleep() after
>>>handling a signal and hence it's dynamic priority should end up higher
>>>than the priority of the task sending signals (since the sender never
>>>relinquishes the CPU voluntarily).
>>>
>>>However, I haven't actually had time to look at the relevant code, so
>>>I may be missing something. If you understand the issue better,
>>>please explain to me why this isn't a dynamic priority issue.
>>
>>You're right, it looks like a corner case.
>
>Out of curiosity, is someone hitting that with a real program?
>
> -Mike
>
>aside:
>if so, I suppose nano-ticks may be needed....

Since I spent a bunch of time fumbling around with this problem, I may as
well post one last time and put it to bed. I bet Ingo has a _lot_ better
way to cure that problem, but...

Yes indeed, pedantic tracking of cpu usage does fix the problem. It also
makes for some interesting changes in contest numbers (well, _I_ find them
interesting;) If anyone wants to see the numbers, they're attached. If
anyone wants to try the attached diff, it's attached too, have fun... the
standard "you get to keep the pieces" lkml warrantee applies :) FWIW, it
works just fine here, and the only time someone now manages to steal ticks
(one) is (apparently) once in a long while when someone (whose initials
appear to be IDE:) keeps interrupts disabled for too long. Nobody has yet
managed to steal more than one tick.

The End,

-Mike

P.S. ok, so they're micro-ticks... what's a few orders of magnitude :)


Attachments:
xx.diff (7.36 kB)
contest.log (1.52 kB)
Download all attachments

2003-05-28 22:05:42

by Bill Davidsen

[permalink] [raw]
Subject: Re: [Linux-ia64] Re: web page on O(1) scheduler

On Wed, 21 May 2003, Olivier Galibert wrote:

> On Wed, May 21, 2003 at 03:12:12PM +0200, Arjan van de Ven wrote:
> > if you had spent the time you spent on this colorful graphic on reading
> > SUS or Posix about what sched_yield() means, you would actually have
> > learned something. sched_yield() means "I'm the least important thing in
> > the system".
>
> Susv2:
>
> DESCRIPTION
>
> The sched_yield() function forces the running thread to relinquish
> the processor until it again becomes the head of its thread list. It
> takes no arguments.
>
>
> Aka "I skip the rest of my turn, try the others again once", not "I'm
> unimportant" nor "please rerun me immediatly".
>
> What is it with you people wanting to make sched_yield() unusable for
> anything that makes sense?

Have to agree, I have a context switching benchmark which uses a spinlock
in shared memory for do-it-yourself gating, and it wants sched_yeild() to
be useful on uni. The SuS is pretty clear about this, the useful behaviour
is also the required behaviour, why are people resisting doing it that
way?

--
bill davidsen <[email protected]>
CTO, TMR Associates, Inc
Doing interesting things with little computers since 1979.

2003-06-03 20:46:12

by J.A. Magallon

[permalink] [raw]
Subject: sched.c gives ICE [Was: Re: web page on O(1) scheduler]


On 05.22, Mike Galbraith wrote:
> At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
> > >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <[email protected]>
> > said:
> >
> > Mike> The page mentions persistent starvation. My own explorations
> > Mike> of this issue indicate that the primary source is always
> > Mike> selecting the highest priority queue.
> >
> >My working assumption is that the problem is a bug with the dynamic
> >prioritization. The task receiving the signals calls sleep() after
> >handling a signal and hence it's dynamic priority should end up higher
> >than the priority of the task sending signals (since the sender never
> >relinquishes the CPU voluntarily).
> >
> >However, I haven't actually had time to look at the relevant code, so
> >I may be missing something. If you understand the issue better,
> >please explain to me why this isn't a dynamic priority issue.
>
> You're right, it looks like a corner case. It works fine here with the
> attached diff.
>

Have you tried to build with gcc-3.3 ? I applied it on top of 2.4.x-aa,
and I just get an ICE:

End of search list.
sched.c: In function `do_schedule':
sched.c:1003: internal compiler error: in merge_assigned_reloads, at reload1.c:6134
Please submit a full bug report,
with preprocessed source if appropriate.
See <URL:https://qa.mandrakesoft.com/> for instructions.

sched.c::1003 is the closing brace for do_schedule().

Any idea ?

--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-rc6-jam1 (gcc 3.2.3 (Mandrake Linux 9.2 3.2.3-1mdk))

2003-06-03 22:11:51

by Mike Galbraith

[permalink] [raw]
Subject: Re: sched.c gives ICE [Was: Re: web page on O(1) scheduler]

At 10:59 PM 6/3/2003 +0200, J.A. Magallon wrote:

>On 05.22, Mike Galbraith wrote:
> > At 10:56 AM 5/21/2003 -0700, David Mosberger wrote:
> > > >>>>> On Wed, 21 May 2003 11:26:31 +0200, Mike Galbraith <[email protected]>
> > > said:
> > >
> > > Mike> The page mentions persistent starvation. My own explorations
> > > Mike> of this issue indicate that the primary source is always
> > > Mike> selecting the highest priority queue.
> > >
> > >My working assumption is that the problem is a bug with the dynamic
> > >prioritization. The task receiving the signals calls sleep() after
> > >handling a signal and hence it's dynamic priority should end up higher
> > >than the priority of the task sending signals (since the sender never
> > >relinquishes the CPU voluntarily).
> > >
> > >However, I haven't actually had time to look at the relevant code, so
> > >I may be missing something. If you understand the issue better,
> > >please explain to me why this isn't a dynamic priority issue.
> >
> > You're right, it looks like a corner case. It works fine here with the
> > attached diff.
> >
>
>Have you tried to build with gcc-3.3 ? I applied it on top of 2.4.x-aa,
>and I just get an ICE:

No, I use 2.95.3.

>End of search list.
>sched.c: In function `do_schedule':
>sched.c:1003: internal compiler error: in merge_assigned_reloads, at
>reload1.c:6134
>Please submit a full bug report,
>with preprocessed source if appropriate.
>See <URL:https://qa.mandrakesoft.com/> for instructions.
>
>sched.c::1003 is the closing brace for do_schedule().
>
>Any idea ?

Nope... gcc can have the blame if it wants it :)

-Mike

2003-06-22 21:49:23

by J.A. Magallon

[permalink] [raw]
Subject: Re: sched.c gives ICE [Was: Re: web page on O(1) scheduler]


On 06.03, J.A. Magallon wrote:
>
[...]
>
> Have you tried to build with gcc-3.3 ? I applied it on top of 2.4.x-aa,
> and I just get an ICE:
>
> End of search list.
> sched.c: In function `do_schedule':
> sched.c:1003: internal compiler error: in merge_assigned_reloads, at reload1.c:6134
> Please submit a full bug report,
> with preprocessed source if appropriate.
> See <URL:https://qa.mandrakesoft.com/> for instructions.
>
> sched.c::1003 is the closing brace for do_schedule().
>
> Any idea ?
>

FYI, it was a gcc bug. In latest gcc release in mandrake:

* Tue Jun 17 2003 Juan Quintela <[email protected]> 3.3-2mdk

- Add reload1 patch (fix bug http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10890).

Now it builds fine with the patch attached.

--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-jam1 (gcc 3.3 (Mandrake Linux 9.2 3.3-1mdk))

2003-06-22 21:56:27

by J.A. Magallon

[permalink] [raw]
Subject: Re: sched.c gives ICE [Was: Re: web page on O(1) scheduler]


On 06.23, J.A. Magallon wrote:
>
> On 06.03, J.A. Magallon wrote:
> >
> [...]
> >
> > Have you tried to build with gcc-3.3 ? I applied it on top of 2.4.x-aa,
> > and I just get an ICE:
> >
> > End of search list.
> > sched.c: In function `do_schedule':
> > sched.c:1003: internal compiler error: in merge_assigned_reloads, at reload1.c:6134
> > Please submit a full bug report,
> > with preprocessed source if appropriate.
> > See <URL:https://qa.mandrakesoft.com/> for instructions.
> >
> > sched.c::1003 is the closing brace for do_schedule().
> >
> > Any idea ?
> >
>
> FYI, it was a gcc bug. In latest gcc release in mandrake:
>
> * Tue Jun 17 2003 Juan Quintela <[email protected]> 3.3-2mdk
>
> - Add reload1 patch (fix bug http://gcc.gnu.org/bugzilla/show_bug.cgi?id=10890).
>
> Now it builds fine with the patch attached.
>

Ups, ^^^^^^^ == applied

--
J.A. Magallon <[email protected]> \ Software is like sex:
werewolf.able.es \ It's better when it's free
Mandrake Linux release 9.2 (Cooker) for i586
Linux 2.4.21-jam1 (gcc 3.3 (Mandrake Linux 9.2 3.3-2mdk))