2010-08-26 18:15:59

by Mathieu Desnoyers

[permalink] [raw]
Subject: [RFC PATCH 00/11] sched: CFS low-latency features

Hi,

Following the findings I presented a few months ago
(http://lkml.org/lkml/2010/4/18/13) about CFS having large vruntime spread
issues, Peter Zijlstra and I pursued the discussion and the implementation
effort (my work on this is funded by Nokia). I recently put the result together
and came up with this patchset, combining both his work and mine.

With this patchset, I got the following results with wakeup-latency.c (a 10ms
periodic timer), running periodic-fork.sh, Xorg, make -j3 and firefox (playing a
youtube video), with Xorg moving terminal windows around, in parallel on a UP
system (links to the test program source in the dyn min_vruntime patch). The
Xorg interactivity is very good with the new features enabled, but was poor
originally with the vanilla mainline scheduler. The 10ms timer delays are as
follow:

2.6.35.2 mainline* with low-latency features**
maximum latency: 34465.2 µs 8261.4 µs
average latency: 6445.5 µs 211.2 µs
missed timer events: yes no

* 2.6.35.2 mainline test needs to run periodic-fork.sh for a few minutes first
to let it rip the spread apart.

** low-latency features:

with the patchset applied and CONFIG_SCHED_DEBUG=y
(with debugfs mounted in /sys/debugfs)

for opt in DYN_MIN_VRUNTIME \
NO_FAIR_SLEEPERS FAIR_SLEEPERS_INTERACTIVE FAIR_SLEEPERS_TIMER \
INTERACTIVE TIMER \
INTERACTIVE_FORK_EXPEDITED TIMER_FORK_EXPEDITED;
do echo $opt > /sys/debugfs/sched_features;
done

These patches are designed to allow individual enabling of each feature and to
make sure that the object size of sched.o does not grow when the features are
disabled on a CONFIG_SCHED_DEBUG=n kernel. Optimization of the try_to_wake_up()
fast path when features are enabled could be done later by merging some of these
features together. This patchset is based on 2.6.35.2.

Feedback is welcome,

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com


2010-08-26 18:57:37

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Thu, 2010-08-26 at 14:09 -0400, Mathieu Desnoyers wrote:
> Feedback is welcome,
>
So we have the following components to this patch:

- dynamic min_vruntime -- push the min_vruntime ahead at the
rate of the runqueue wide virtual clock. This approximates
the virtual clock, esp. when turning off sleeper fairness.
And is cheaper than actually computing the virtual clock.

It allows for better insertion and re-weighting behaviour,
but it does increase overhead somewhat.

- special wakeups using the next-buddy to get scheduled 'soon',
used by all wakeups from the input system and timers.

- special fork semantics related to those special wakeups.


So while I would love to simply compute the virtual clock, it would add
a s64 mult to every enqueue/dequeue and a s64 div to each
enqueue/re-weight, which might be somewhat prohibitive, the dyn
min_vruntime approximation seems to work well enough and costs a u32 div
per enqueue.

Adding a preference to all user generated wakeups (input) and
propagating that state along the wakeup chain seems to make sense,
adding the same to all timers is something that needs to be discussed, I
can well imagine not all timers are equally important -- do we want to
extend the timer interface?

If we do decide we want both, we should at the very least merge the
try_to_wake_up() conditional blob (they're really identical). Preferably
we should reduce ttwu(), not add more to it...

Fudging fork seems dubious at best, it seems generated by the use of
timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
broken thing to do, it has very ill defined semantics and is utterly
unable to properly cope with error cases. Furthermore its trivial to
actually correctly implement the desired behaviour, so I'm really
skeptical on this front; friends don't let friends use SIGEV_THREAD.

2010-08-26 21:26:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Thu, 26 Aug 2010, Peter Zijlstra wrote:
>
> Fudging fork seems dubious at best, it seems generated by the use of
> timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> broken thing to do, it has very ill defined semantics and is utterly
> unable to properly cope with error cases. Furthermore its trivial to
> actually correctly implement the desired behaviour, so I'm really
> skeptical on this front; friends don't let friends use SIGEV_THREAD.

SIGEV_THREAD is the best proof that the whole posix timer interface
was comitte[e]d under the influence of not to be revealed
mind-altering substances.

I completely object to add timer specific wakeup magic and support for
braindead fork orgies to the kernel proper. All that mess can be fixed
in user space by using sensible functionality.

Providing support for misdesigned crap just for POSIX compliance
reasons and to make some of the blind abusers of that very same crap
happy would be a completely stupid decision.

In fact that would make a brilliant precedence case for forcing the
kernel to solve user space madness at the expense of kernel
complexity. If we follow down that road we get requests for extra
functionality for AIO, networking and whatever in a split second with
no real good reason to reject them anymore.

Thanks,

tglx

2010-08-26 22:23:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> >
> > Fudging fork seems dubious at best, it seems generated by the use of
> > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > broken thing to do, it has very ill defined semantics and is utterly
> > unable to properly cope with error cases. Furthermore its trivial to
> > actually correctly implement the desired behaviour, so I'm really
> > skeptical on this front; friends don't let friends use SIGEV_THREAD.
>
> SIGEV_THREAD is the best proof that the whole posix timer interface
> was comitte[e]d under the influence of not to be revealed
> mind-altering substances.
>
> I completely object to add timer specific wakeup magic and support for
> braindead fork orgies to the kernel proper. All that mess can be fixed
> in user space by using sensible functionality.
>
> Providing support for misdesigned crap just for POSIX compliance
> reasons and to make some of the blind abusers of that very same crap
> happy would be a completely stupid decision.
>
> In fact that would make a brilliant precedence case for forcing the
> kernel to solve user space madness at the expense of kernel
> complexity. If we follow down that road we get requests for extra
> functionality for AIO, networking and whatever in a split second with
> no real good reason to reject them anymore.

I really risked eye cancer and digged into the glibc code.

/* There is not much we can do if the allocation fails. */
(void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);

So if the helper thread which gets the signal fails to create the
thread then everything is toast.

What about fixing the f*cked up glibc implementation in the first place
instead of fiddling in the kernel to support this utter madness?

WTF can't the damned delivery thread not be created when timer_create
is called and the signal be delivered to that very thread directly via
SIGEV_THREAD_ID ?

Thanks,

tglx

2010-08-26 23:09:39

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Thomas Gleixner ([email protected]) wrote:
> On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > >
> > > Fudging fork seems dubious at best, it seems generated by the use of
> > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > broken thing to do, it has very ill defined semantics and is utterly
> > > unable to properly cope with error cases. Furthermore its trivial to
> > > actually correctly implement the desired behaviour, so I'm really
> > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> >
> > SIGEV_THREAD is the best proof that the whole posix timer interface
> > was comitte[e]d under the influence of not to be revealed
> > mind-altering substances.
> >
> > I completely object to add timer specific wakeup magic and support for
> > braindead fork orgies to the kernel proper. All that mess can be fixed
> > in user space by using sensible functionality.
> >
> > Providing support for misdesigned crap just for POSIX compliance
> > reasons and to make some of the blind abusers of that very same crap
> > happy would be a completely stupid decision.
> >
> > In fact that would make a brilliant precedence case for forcing the
> > kernel to solve user space madness at the expense of kernel
> > complexity. If we follow down that road we get requests for extra
> > functionality for AIO, networking and whatever in a split second with
> > no real good reason to reject them anymore.
>
> I really risked eye cancer and digged into the glibc code.
>
> /* There is not much we can do if the allocation fails. */
> (void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
>
> So if the helper thread which gets the signal fails to create the
> thread then everything is toast.
>
> What about fixing the f*cked up glibc implementation in the first place
> instead of fiddling in the kernel to support this utter madness?
>
> WTF can't the damned delivery thread not be created when timer_create
> is called and the signal be delivered to that very thread directly via
> SIGEV_THREAD_ID ?

Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm
pretty sure that would work.

The only thing we might have to be careful about is what happens if the timer
re-fires before the thread completes its execution. We might want to let the
signal handler detect these overruns somehow.

Thanks,

Mathieu

>
> Thanks,
>
> tglx

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-26 23:18:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > >
> > > Fudging fork seems dubious at best, it seems generated by the use of
> > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > broken thing to do, it has very ill defined semantics and is utterly
> > > unable to properly cope with error cases. Furthermore its trivial to
> > > actually correctly implement the desired behaviour, so I'm really
> > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> >
> > SIGEV_THREAD is the best proof that the whole posix timer interface
> > was comitte[e]d under the influence of not to be revealed
> > mind-altering substances.
> >
> > I completely object to add timer specific wakeup magic and support for
> > braindead fork orgies to the kernel proper. All that mess can be fixed
> > in user space by using sensible functionality.
> >
> > Providing support for misdesigned crap just for POSIX compliance
> > reasons and to make some of the blind abusers of that very same crap
> > happy would be a completely stupid decision.
> >
> > In fact that would make a brilliant precedence case for forcing the
> > kernel to solve user space madness at the expense of kernel
> > complexity. If we follow down that road we get requests for extra
> > functionality for AIO, networking and whatever in a split second with
> > no real good reason to reject them anymore.
>
> I really risked eye cancer and digged into the glibc code.
>
> /* There is not much we can do if the allocation fails. */
> (void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
>
> So if the helper thread which gets the signal fails to create the
> thread then everything is toast.
>
> What about fixing the f*cked up glibc implementation in the first place
> instead of fiddling in the kernel to support this utter madness?
>
> WTF can't the damned delivery thread not be created when timer_create
> is called and the signal be delivered to that very thread directly via
> SIGEV_THREAD_ID ?

C'mon, Thomas!!! That is entirely too sensible!!! ;-)

But if you are going to create the thread at timer_create() time,
why not just have the new thread block for the desired duration?

Thanx, Paul

2010-08-26 23:29:06

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Paul E. McKenney ([email protected]) wrote:
> On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > >
> > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > actually correctly implement the desired behaviour, so I'm really
> > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > >
> > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > was comitte[e]d under the influence of not to be revealed
> > > mind-altering substances.
> > >
> > > I completely object to add timer specific wakeup magic and support for
> > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > in user space by using sensible functionality.
> > >
> > > Providing support for misdesigned crap just for POSIX compliance
> > > reasons and to make some of the blind abusers of that very same crap
> > > happy would be a completely stupid decision.
> > >
> > > In fact that would make a brilliant precedence case for forcing the
> > > kernel to solve user space madness at the expense of kernel
> > > complexity. If we follow down that road we get requests for extra
> > > functionality for AIO, networking and whatever in a split second with
> > > no real good reason to reject them anymore.
> >
> > I really risked eye cancer and digged into the glibc code.
> >
> > /* There is not much we can do if the allocation fails. */
> > (void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> >
> > So if the helper thread which gets the signal fails to create the
> > thread then everything is toast.
> >
> > What about fixing the f*cked up glibc implementation in the first place
> > instead of fiddling in the kernel to support this utter madness?
> >
> > WTF can't the damned delivery thread not be created when timer_create
> > is called and the signal be delivered to that very thread directly via
> > SIGEV_THREAD_ID ?
>
> C'mon, Thomas!!! That is entirely too sensible!!! ;-)
>
> But if you are going to create the thread at timer_create() time,
> why not just have the new thread block for the desired duration?

The timer infrastructure allows things like periodic timer which restarts when
it fires, detection of missed timer events, etc. If you try doing this in a
userland thread context with a simple sleep, then your period becomes however
long you sleep for _and_ the thread execution time. This is all in all quite
different from the timer semantic.

Thanks,

Mathieu

>
> Thanx, Paul

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-26 23:36:56

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Mathieu Desnoyers ([email protected]) wrote:
> * Thomas Gleixner ([email protected]) wrote:
> > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > >
> > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > actually correctly implement the desired behaviour, so I'm really
> > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > >
> > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > was comitte[e]d under the influence of not to be revealed
> > > mind-altering substances.
> > >
> > > I completely object to add timer specific wakeup magic and support for
> > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > in user space by using sensible functionality.
> > >
> > > Providing support for misdesigned crap just for POSIX compliance
> > > reasons and to make some of the blind abusers of that very same crap
> > > happy would be a completely stupid decision.
> > >
> > > In fact that would make a brilliant precedence case for forcing the
> > > kernel to solve user space madness at the expense of kernel
> > > complexity. If we follow down that road we get requests for extra
> > > functionality for AIO, networking and whatever in a split second with
> > > no real good reason to reject them anymore.
> >
> > I really risked eye cancer and digged into the glibc code.
> >
> > /* There is not much we can do if the allocation fails. */
> > (void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> >
> > So if the helper thread which gets the signal fails to create the
> > thread then everything is toast.
> >
> > What about fixing the f*cked up glibc implementation in the first place
> > instead of fiddling in the kernel to support this utter madness?
> >
> > WTF can't the damned delivery thread not be created when timer_create
> > is called and the signal be delivered to that very thread directly via
> > SIGEV_THREAD_ID ?
>
> Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm
> pretty sure that would work.
>
> The only thing we might have to be careful about is what happens if the timer
> re-fires before the thread completes its execution. We might want to let the
> signal handler detect these overruns somehow.

Hrm, thinking about it a little more, one of the "plus" sides of these
SIGEV_THREAD timers is that a single timer can fork threads that will run on
many cores on a multi-core system. If we go for preallocation of a single
thread, we lose that. Maybe we could think of a way to preallocate a thread pool
instead ?

Thanks,

Mathieu

>
> Thanks,
>
> Mathieu
>
> >
> > Thanks,
> >
> > tglx
>
> --
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-26 23:38:16

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Thu, Aug 26, 2010 at 07:28:58PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney ([email protected]) wrote:
> > On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> > > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > >
> > > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > > actually correctly implement the desired behaviour, so I'm really
> > > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > >
> > > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > > was comitte[e]d under the influence of not to be revealed
> > > > mind-altering substances.
> > > >
> > > > I completely object to add timer specific wakeup magic and support for
> > > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > > in user space by using sensible functionality.
> > > >
> > > > Providing support for misdesigned crap just for POSIX compliance
> > > > reasons and to make some of the blind abusers of that very same crap
> > > > happy would be a completely stupid decision.
> > > >
> > > > In fact that would make a brilliant precedence case for forcing the
> > > > kernel to solve user space madness at the expense of kernel
> > > > complexity. If we follow down that road we get requests for extra
> > > > functionality for AIO, networking and whatever in a split second with
> > > > no real good reason to reject them anymore.
> > >
> > > I really risked eye cancer and digged into the glibc code.
> > >
> > > /* There is not much we can do if the allocation fails. */
> > > (void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > >
> > > So if the helper thread which gets the signal fails to create the
> > > thread then everything is toast.
> > >
> > > What about fixing the f*cked up glibc implementation in the first place
> > > instead of fiddling in the kernel to support this utter madness?
> > >
> > > WTF can't the damned delivery thread not be created when timer_create
> > > is called and the signal be delivered to that very thread directly via
> > > SIGEV_THREAD_ID ?
> >
> > C'mon, Thomas!!! That is entirely too sensible!!! ;-)
> >
> > But if you are going to create the thread at timer_create() time,
> > why not just have the new thread block for the desired duration?
>
> The timer infrastructure allows things like periodic timer which restarts when
> it fires, detection of missed timer events, etc. If you try doing this in a
> userland thread context with a simple sleep, then your period becomes however
> long you sleep for _and_ the thread execution time. This is all in all quite
> different from the timer semantic.

Hmmm... Why couldn't the thread in question set the next sleep time based
on the timer period? Yes, if the function ran for longer than the period,
there would be a delay, but the POSIX semantics allow such a delay, right?

Thanx, Paul

2010-08-26 23:49:15

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2010-08-26 at 14:09 -0400, Mathieu Desnoyers wrote:
> > Feedback is welcome,
> >
> So we have the following components to this patch:
>
> - dynamic min_vruntime -- push the min_vruntime ahead at the
> rate of the runqueue wide virtual clock. This approximates
> the virtual clock, esp. when turning off sleeper fairness.
> And is cheaper than actually computing the virtual clock.
>
> It allows for better insertion and re-weighting behaviour,
> but it does increase overhead somewhat.
>
> - special wakeups using the next-buddy to get scheduled 'soon',
> used by all wakeups from the input system and timers.
>
> - special fork semantics related to those special wakeups.
>
>
> So while I would love to simply compute the virtual clock, it would add
> a s64 mult to every enqueue/dequeue and a s64 div to each
> enqueue/re-weight, which might be somewhat prohibitive, the dyn
> min_vruntime approximation seems to work well enough and costs a u32 div
> per enqueue.

Yep, it's cheap enough and seems to work very well as far as my testing have
shown.

> Adding a preference to all user generated wakeups (input) and
> propagating that state along the wakeup chain seems to make sense,

Yes, this is what lets us kill FAIR_SLEEPERS (and thus let the dynamic
min_vruntime behave as expected), while keeping good Xorg interactivity.

> adding the same to all timers is something that needs to be discussed, I
> can well imagine not all timers are equally important -- do we want to
> extend the timer interface?

I just thought it made sense that when a timer fires and wakes up a thread,
there are pretty good chances that we might to wakeup this thread quickly. But
it brings the question in a more general sense: would we want this kind of
behavior also available for network packets, disk I/O, etc ? IOW, would it make
sense to have next-buddy selection on all these input-triggered wakeups ? Since
we're only using the next buddy selection, it will only perform this selection
if the selected buddy is within a specific vruntime range from the minimum, so
AFAIK, I don't think we would end up starving the system in any possible way.

So far I cannot see a situation where selecting the next buddy would _not_ make
sense in any kind of input-driven wakeups (interactive, timer, disk, network,
etc). But maybe it's just a lack of imagination on my part.


> If we do decide we want both, we should at the very least merge the
> try_to_wake_up() conditional blob (they're really identical). Preferably
> we should reduce ttwu(), not add more to it...

Sure. Or maybe find out we want to target even more input paths... so far
interactive input seemed absolutely needed, timers seemed logical and
self-rate-limited. For the others, I don't know. It might actually make sense
too.

>
> Fudging fork seems dubious at best, it seems generated by the use of
> timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> broken thing to do, it has very ill defined semantics and is utterly
> unable to properly cope with error cases. Furthermore its trivial to
> actually correctly implement the desired behaviour, so I'm really
> skeptical on this front; friends don't let friends use SIGEV_THREAD.

Agreed for the timer-tied case. I think the direction Thomas proposes for fixing
glibc makes much more sense. However, I am wondering about the interactive case
too: e.g., if you click on "open terminal", it needs to fork a process and you
would normally expect this to come up more quickly than the background kernel
compile you're doing. So there might be some interest for this fork vruntime
boost for interactivity-driven wakeups. Thoughts ?

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-26 23:53:28

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Paul E. McKenney ([email protected]) wrote:
> On Thu, Aug 26, 2010 at 07:28:58PM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney ([email protected]) wrote:
> > > On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> > > > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > > >
> > > > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > > > actually correctly implement the desired behaviour, so I'm really
> > > > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > > >
> > > > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > > > was comitte[e]d under the influence of not to be revealed
> > > > > mind-altering substances.
> > > > >
> > > > > I completely object to add timer specific wakeup magic and support for
> > > > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > > > in user space by using sensible functionality.
> > > > >
> > > > > Providing support for misdesigned crap just for POSIX compliance
> > > > > reasons and to make some of the blind abusers of that very same crap
> > > > > happy would be a completely stupid decision.
> > > > >
> > > > > In fact that would make a brilliant precedence case for forcing the
> > > > > kernel to solve user space madness at the expense of kernel
> > > > > complexity. If we follow down that road we get requests for extra
> > > > > functionality for AIO, networking and whatever in a split second with
> > > > > no real good reason to reject them anymore.
> > > >
> > > > I really risked eye cancer and digged into the glibc code.
> > > >
> > > > /* There is not much we can do if the allocation fails. */
> > > > (void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > > >
> > > > So if the helper thread which gets the signal fails to create the
> > > > thread then everything is toast.
> > > >
> > > > What about fixing the f*cked up glibc implementation in the first place
> > > > instead of fiddling in the kernel to support this utter madness?
> > > >
> > > > WTF can't the damned delivery thread not be created when timer_create
> > > > is called and the signal be delivered to that very thread directly via
> > > > SIGEV_THREAD_ID ?
> > >
> > > C'mon, Thomas!!! That is entirely too sensible!!! ;-)
> > >
> > > But if you are going to create the thread at timer_create() time,
> > > why not just have the new thread block for the desired duration?
> >
> > The timer infrastructure allows things like periodic timer which restarts when
> > it fires, detection of missed timer events, etc. If you try doing this in a
> > userland thread context with a simple sleep, then your period becomes however
> > long you sleep for _and_ the thread execution time. This is all in all quite
> > different from the timer semantic.
>
> Hmmm... Why couldn't the thread in question set the next sleep time based
> on the timer period? Yes, if the function ran for longer than the period,
> there would be a delay, but the POSIX semantics allow such a delay, right?

I'm afraid you'll have a large error accumulation over time, and getting the
precise picture of how much time between now and where the the period end is
expected to be is kind of hard to do precisely from user-space. In a few words,
this solution would be terrible for jitter. This is why we usually rely on
timers rather than delays in these periodic workloads.

Thanks,

Mathieu

>
> Thanx, Paul

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-27 00:09:17

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Thu, Aug 26, 2010 at 07:53:23PM -0400, Mathieu Desnoyers wrote:
> * Paul E. McKenney ([email protected]) wrote:
> > On Thu, Aug 26, 2010 at 07:28:58PM -0400, Mathieu Desnoyers wrote:
> > > * Paul E. McKenney ([email protected]) wrote:
> > > > On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> > > > > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > > > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > > > >
> > > > > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > > > > actually correctly implement the desired behaviour, so I'm really
> > > > > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > > > >
> > > > > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > > > > was comitte[e]d under the influence of not to be revealed
> > > > > > mind-altering substances.
> > > > > >
> > > > > > I completely object to add timer specific wakeup magic and support for
> > > > > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > > > > in user space by using sensible functionality.
> > > > > >
> > > > > > Providing support for misdesigned crap just for POSIX compliance
> > > > > > reasons and to make some of the blind abusers of that very same crap
> > > > > > happy would be a completely stupid decision.
> > > > > >
> > > > > > In fact that would make a brilliant precedence case for forcing the
> > > > > > kernel to solve user space madness at the expense of kernel
> > > > > > complexity. If we follow down that road we get requests for extra
> > > > > > functionality for AIO, networking and whatever in a split second with
> > > > > > no real good reason to reject them anymore.
> > > > >
> > > > > I really risked eye cancer and digged into the glibc code.
> > > > >
> > > > > /* There is not much we can do if the allocation fails. */
> > > > > (void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > > > >
> > > > > So if the helper thread which gets the signal fails to create the
> > > > > thread then everything is toast.
> > > > >
> > > > > What about fixing the f*cked up glibc implementation in the first place
> > > > > instead of fiddling in the kernel to support this utter madness?
> > > > >
> > > > > WTF can't the damned delivery thread not be created when timer_create
> > > > > is called and the signal be delivered to that very thread directly via
> > > > > SIGEV_THREAD_ID ?
> > > >
> > > > C'mon, Thomas!!! That is entirely too sensible!!! ;-)
> > > >
> > > > But if you are going to create the thread at timer_create() time,
> > > > why not just have the new thread block for the desired duration?
> > >
> > > The timer infrastructure allows things like periodic timer which restarts when
> > > it fires, detection of missed timer events, etc. If you try doing this in a
> > > userland thread context with a simple sleep, then your period becomes however
> > > long you sleep for _and_ the thread execution time. This is all in all quite
> > > different from the timer semantic.
> >
> > Hmmm... Why couldn't the thread in question set the next sleep time based
> > on the timer period? Yes, if the function ran for longer than the period,
> > there would be a delay, but the POSIX semantics allow such a delay, right?
>
> I'm afraid you'll have a large error accumulation over time, and getting the
> precise picture of how much time between now and where the the period end is
> expected to be is kind of hard to do precisely from user-space. In a few words,
> this solution would be terrible for jitter. This is why we usually rely on
> timers rather than delays in these periodic workloads.

Why couldn't the timer_create() call record the start time, and then
compute the sleeps from that time? So if timer_create() executed at
time t=100 and the period is 5, upon awakening and completing the first
invocation of the function in question, the thread does a sleep calculated
to wake at t=110.

Thanx, Paul

2010-08-27 07:37:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Thu, 2010-08-26 at 19:09 -0400, Mathieu Desnoyers wrote:
> > WTF can't the damned delivery thread not be created when timer_create
> > is called and the signal be delivered to that very thread directly via
> > SIGEV_THREAD_ID ?
>
> Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm
> pretty sure that would work.
>
> The only thing we might have to be careful about is what happens if the timer
> re-fires before the thread completes its execution. We might want to let the
> signal handler detect these overruns somehow.

Simply don't use SIGEV_THREAD and spawn you own thread and use
SIGEV_THREAD_ID yourself, the programmer knows the semantics and knows
if he cares about overlapping timers etc.

2010-08-27 07:39:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Thu, 2010-08-26 at 19:36 -0400, Mathieu Desnoyers wrote:
> > The only thing we might have to be careful about is what happens if the timer
> > re-fires before the thread completes its execution. We might want to let the
> > signal handler detect these overruns somehow.
>
> Hrm, thinking about it a little more, one of the "plus" sides of these
> SIGEV_THREAD timers is that a single timer can fork threads that will run on
> many cores on a multi-core system. If we go for preallocation of a single
> thread, we lose that. Maybe we could think of a way to preallocate a thread pool
> instead ?

Why try and fix a broken thing, just let the app spawn threads and use
SIGEV_THREAD_ID itself, it knows its requirements and can do what suits
the situation best. No need to try and patch up braindead posix stuff.

2010-08-27 07:43:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Thu, 2010-08-26 at 19:49 -0400, Mathieu Desnoyers wrote:
> AFAIK, I don't think we would end up starving the system in any possible way.

Correct, it does maintain fairness.

> So far I cannot see a situation where selecting the next buddy would _not_ make
> sense in any kind of input-driven wakeups (interactive, timer, disk, network,
> etc). But maybe it's just a lack of imagination on my part.

The risk is that you end up with always using next-buddy, and we tried
that a while back and that didn't work well for some, Mike might
remember.

Also, when you use timers things like time-outs you really couldn't care
less if its handled sooner rather than later.

Disk is usually so slow you really don't want to consider it
interactive, but then sometimes you might,.. its a really hard problem.

The only clear situation is the direct input, that's a direct link
between the user and our wakeup chain and the user is always important.

2010-08-27 08:19:49

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Fri, 2010-08-27 at 09:42 +0200, Peter Zijlstra wrote:
> On Thu, 2010-08-26 at 19:49 -0400, Mathieu Desnoyers wrote:
> > AFAIK, I don't think we would end up starving the system in any possible way.
>
> Correct, it does maintain fairness.
>
> > So far I cannot see a situation where selecting the next buddy would _not_ make
> > sense in any kind of input-driven wakeups (interactive, timer, disk, network,
> > etc). But maybe it's just a lack of imagination on my part.
>
> The risk is that you end up with always using next-buddy, and we tried
> that a while back and that didn't work well for some, Mike might
> remember.

I turned it off because it was ripping spread apart badly, and last
buddy did a better job of improving scalability without it.

> Also, when you use timers things like time-outs you really couldn't care
> less if its handled sooner rather than later.
>
> Disk is usually so slow you really don't want to consider it
> interactive, but then sometimes you might,.. its a really hard problem.

(very hard)

> The only clear situation is the direct input, that's a direct link
> between the user and our wakeup chain and the user is always important.

Yeah, directly linked wakeups using next could be a good thing, but the
trouble with using any linkage to the user is that you have to pass it
on to reap benefit.. so when do you disconnect?

-Mike

2010-08-27 08:44:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Thu, 26 Aug 2010, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers ([email protected]) wrote:
> > * Thomas Gleixner ([email protected]) wrote:
> > > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > >
> > > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > > actually correctly implement the desired behaviour, so I'm really
> > > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > >
> > > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > > was comitte[e]d under the influence of not to be revealed
> > > > mind-altering substances.
> > > >
> > > > I completely object to add timer specific wakeup magic and support for
> > > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > > in user space by using sensible functionality.
> > > >
> > > > Providing support for misdesigned crap just for POSIX compliance
> > > > reasons and to make some of the blind abusers of that very same crap
> > > > happy would be a completely stupid decision.
> > > >
> > > > In fact that would make a brilliant precedence case for forcing the
> > > > kernel to solve user space madness at the expense of kernel
> > > > complexity. If we follow down that road we get requests for extra
> > > > functionality for AIO, networking and whatever in a split second with
> > > > no real good reason to reject them anymore.
> > >
> > > I really risked eye cancer and digged into the glibc code.
> > >
> > > /* There is not much we can do if the allocation fails. */
> > > (void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > >
> > > So if the helper thread which gets the signal fails to create the
> > > thread then everything is toast.
> > >
> > > What about fixing the f*cked up glibc implementation in the first place
> > > instead of fiddling in the kernel to support this utter madness?
> > >
> > > WTF can't the damned delivery thread not be created when timer_create
> > > is called and the signal be delivered to that very thread directly via
> > > SIGEV_THREAD_ID ?
> >
> > Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm
> > pretty sure that would work.
> >
> > The only thing we might have to be careful about is what happens if the timer
> > re-fires before the thread completes its execution. We might want to let the
> > signal handler detect these overruns somehow.

That's easy.

> Hrm, thinking about it a little more, one of the "plus" sides of these
> SIGEV_THREAD timers is that a single timer can fork threads that will run on
> many cores on a multi-core system. If we go for preallocation of a single
> thread, we lose that. Maybe we could think of a way to preallocate a thread pool
> instead ?

Why should a single timer fork many threads? Just because a previous
thread did not complete before the timer fires again? That's
braindamage as all threads call the same function which then needs to
be serialized anyway. We really do not need a function which creates
tons of threads which get all stuck on the same resource.

Thanks,

tglx

2010-08-27 10:58:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Fri, 2010-08-27 at 12:47 +0200, Indan Zupancic wrote:
>
> Please don't hide scheduler improvements behind obscure CONFIG_SCHED_DEBUG
> options. If it doesn't make the scheduler better just don't merge it. If it
> does then it should be enabled by default.

It's RFC, features are nice to test things and see if/how they work.
Features are not a user option, so you don't have to worry about them.

Also, 'better' is a very hard thing to quantify. Some people like
throughput, some like latency, and others like raw context switch
performance.

Hopefully we'll soon have a non privileged sporadic task scheduler for
all our srt media needs.

2010-08-27 11:06:49

by Indan Zupancic

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

Hello,

On Thu, August 26, 2010 20:09, Mathieu Desnoyers wrote:
> Hi,
>
> Following the findings I presented a few months ago
> (http://lkml.org/lkml/2010/4/18/13) about CFS having large vruntime spread
> issues, Peter Zijlstra and I pursued the discussion and the implementation
> effort (my work on this is funded by Nokia). I recently put the result
> together
> and came up with this patchset, combining both his work and mine.
>
> With this patchset, I got the following results with wakeup-latency.c (a 10ms
> periodic timer), running periodic-fork.sh, Xorg, make -j3 and firefox (playing
> a
> youtube video), with Xorg moving terminal windows around, in parallel on a UP
> system (links to the test program source in the dyn min_vruntime patch). The
> Xorg interactivity is very good with the new features enabled, but was poor
> originally with the vanilla mainline scheduler. The 10ms timer delays are as
> follow:
>
> 2.6.35.2 mainline* with low-latency features**
> maximum latency: 34465.2 µs 8261.4 µs
> average latency: 6445.5 µs 211.2 µs
> missed timer events: yes no
>
> * 2.6.35.2 mainline test needs to run periodic-fork.sh for a few minutes first
> to let it rip the spread apart.
>
> ** low-latency features:
>
> with the patchset applied and CONFIG_SCHED_DEBUG=y
> (with debugfs mounted in /sys/debugfs)
>
> for opt in DYN_MIN_VRUNTIME \
> NO_FAIR_SLEEPERS FAIR_SLEEPERS_INTERACTIVE FAIR_SLEEPERS_TIMER \
> INTERACTIVE TIMER \
> INTERACTIVE_FORK_EXPEDITED TIMER_FORK_EXPEDITED;
> do echo $opt > /sys/debugfs/sched_features;
> done
>
> These patches are designed to allow individual enabling of each feature and to
> make sure that the object size of sched.o does not grow when the features are
> disabled on a CONFIG_SCHED_DEBUG=n kernel.

Please don't hide scheduler improvements behind obscure CONFIG_SCHED_DEBUG
options. If it doesn't make the scheduler better just don't merge it. If it
does then it should be enabled by default.

Thank you,

Indan


P.S. Tony Luck seems to be chopped of the CC list.

2010-08-27 15:18:13

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Paul E. McKenney ([email protected]) wrote:
> On Thu, Aug 26, 2010 at 07:53:23PM -0400, Mathieu Desnoyers wrote:
> > * Paul E. McKenney ([email protected]) wrote:
> > > On Thu, Aug 26, 2010 at 07:28:58PM -0400, Mathieu Desnoyers wrote:
> > > > * Paul E. McKenney ([email protected]) wrote:
> > > > > On Fri, Aug 27, 2010 at 12:22:46AM +0200, Thomas Gleixner wrote:
> > > > > > On Thu, 26 Aug 2010, Thomas Gleixner wrote:
> > > > > > > On Thu, 26 Aug 2010, Peter Zijlstra wrote:
> > > > > > > >
> > > > > > > > Fudging fork seems dubious at best, it seems generated by the use of
> > > > > > > > timer_create(.evp->sigev_notify = SIGEV_THREAD), which is a really
> > > > > > > > broken thing to do, it has very ill defined semantics and is utterly
> > > > > > > > unable to properly cope with error cases. Furthermore its trivial to
> > > > > > > > actually correctly implement the desired behaviour, so I'm really
> > > > > > > > skeptical on this front; friends don't let friends use SIGEV_THREAD.
> > > > > > >
> > > > > > > SIGEV_THREAD is the best proof that the whole posix timer interface
> > > > > > > was comitte[e]d under the influence of not to be revealed
> > > > > > > mind-altering substances.
> > > > > > >
> > > > > > > I completely object to add timer specific wakeup magic and support for
> > > > > > > braindead fork orgies to the kernel proper. All that mess can be fixed
> > > > > > > in user space by using sensible functionality.
> > > > > > >
> > > > > > > Providing support for misdesigned crap just for POSIX compliance
> > > > > > > reasons and to make some of the blind abusers of that very same crap
> > > > > > > happy would be a completely stupid decision.
> > > > > > >
> > > > > > > In fact that would make a brilliant precedence case for forcing the
> > > > > > > kernel to solve user space madness at the expense of kernel
> > > > > > > complexity. If we follow down that road we get requests for extra
> > > > > > > functionality for AIO, networking and whatever in a split second with
> > > > > > > no real good reason to reject them anymore.
> > > > > >
> > > > > > I really risked eye cancer and digged into the glibc code.
> > > > > >
> > > > > > /* There is not much we can do if the allocation fails. */
> > > > > > (void) pthread_create (&th, &tk->attr, timer_sigev_thread, td);
> > > > > >
> > > > > > So if the helper thread which gets the signal fails to create the
> > > > > > thread then everything is toast.
> > > > > >
> > > > > > What about fixing the f*cked up glibc implementation in the first place
> > > > > > instead of fiddling in the kernel to support this utter madness?
> > > > > >
> > > > > > WTF can't the damned delivery thread not be created when timer_create
> > > > > > is called and the signal be delivered to that very thread directly via
> > > > > > SIGEV_THREAD_ID ?
> > > > >
> > > > > C'mon, Thomas!!! That is entirely too sensible!!! ;-)
> > > > >
> > > > > But if you are going to create the thread at timer_create() time,
> > > > > why not just have the new thread block for the desired duration?
> > > >
> > > > The timer infrastructure allows things like periodic timer which restarts when
> > > > it fires, detection of missed timer events, etc. If you try doing this in a
> > > > userland thread context with a simple sleep, then your period becomes however
> > > > long you sleep for _and_ the thread execution time. This is all in all quite
> > > > different from the timer semantic.
> > >
> > > Hmmm... Why couldn't the thread in question set the next sleep time based
> > > on the timer period? Yes, if the function ran for longer than the period,
> > > there would be a delay, but the POSIX semantics allow such a delay, right?
> >
> > I'm afraid you'll have a large error accumulation over time, and getting the
> > precise picture of how much time between now and where the the period end is
> > expected to be is kind of hard to do precisely from user-space. In a few words,
> > this solution would be terrible for jitter. This is why we usually rely on
> > timers rather than delays in these periodic workloads.
>
> Why couldn't the timer_create() call record the start time, and then
> compute the sleeps from that time? So if timer_create() executed at
> time t=100 and the period is 5, upon awakening and completing the first
> invocation of the function in question, the thread does a sleep calculated
> to wake at t=110.

Let's focus on the userspace thread execution, right between the samping of the
current time and the call to sleep:

Thread A
current_time = read current time();
sleep(period_end - current_time);

If the thread is preempted between these two operations, then we end up sleeping
for longer than what is needed. This kind of imprecision will add up over time,
so that after e.g. one day, instead of having the expected number of timer
executions, we'll have less than that. This kind of accumulated drift is an
unwanted side-effect of using delays in lieue of real periodic timers.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-27 15:21:28

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2010-08-26 at 19:09 -0400, Mathieu Desnoyers wrote:
> > > WTF can't the damned delivery thread not be created when timer_create
> > > is called and the signal be delivered to that very thread directly via
> > > SIGEV_THREAD_ID ?
> >
> > Yeah, that sounds exactly like what I proposed about an hour ago on IRC ;) I'm
> > pretty sure that would work.
> >
> > The only thing we might have to be careful about is what happens if the timer
> > re-fires before the thread completes its execution. We might want to let the
> > signal handler detect these overruns somehow.
>
> Simply don't use SIGEV_THREAD and spawn you own thread and use
> SIGEV_THREAD_ID yourself, the programmer knows the semantics and knows
> if he cares about overlapping timers etc.

>From man timer_create:

SIGEV_THREAD
Upon timer expiration, invoke sigev_notify_function as if it
were the start function of a new thread. (Among the implementa‐
tion possibilities here are that each timer notification could
result in the creation of a new thread, or that a single thread
is created to receive all notifications.) The function is
invoked with sigev_value as its sole argument. If
sigev_notify_attributes is not NULL, it should point to a
pthread_attr_t structure that defines attributes for the new
thread (see pthread_attr_init(3).

So basically, it's the glibc implementation that is broken, not the standard.

The programmer should expect that thread execution can overlap though.

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-27 15:21:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Fri, 27 Aug 2010, Mathieu Desnoyers wrote:
> * Paul E. McKenney ([email protected]) wrote:
> > Why couldn't the timer_create() call record the start time, and then
> > compute the sleeps from that time? So if timer_create() executed at
> > time t=100 and the period is 5, upon awakening and completing the first
> > invocation of the function in question, the thread does a sleep calculated
> > to wake at t=110.
>
> Let's focus on the userspace thread execution, right between the samping of the
> current time and the call to sleep:
>
> Thread A
> current_time = read current time();
> sleep(period_end - current_time);
>
> If the thread is preempted between these two operations, then we end up sleeping
> for longer than what is needed. This kind of imprecision will add up over time,
> so that after e.g. one day, instead of having the expected number of timer
> executions, we'll have less than that. This kind of accumulated drift is an
> unwanted side-effect of using delays in lieue of real periodic timers.

Nonsense, that's why we provide clock_nanosleep(ABSTIME)

Thanks,

tglx

2010-08-27 15:23:24

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Peter Zijlstra ([email protected]) wrote:
> On Thu, 2010-08-26 at 19:36 -0400, Mathieu Desnoyers wrote:
> > > The only thing we might have to be careful about is what happens if the timer
> > > re-fires before the thread completes its execution. We might want to let the
> > > signal handler detect these overruns somehow.
> >
> > Hrm, thinking about it a little more, one of the "plus" sides of these
> > SIGEV_THREAD timers is that a single timer can fork threads that will run on
> > many cores on a multi-core system. If we go for preallocation of a single
> > thread, we lose that. Maybe we could think of a way to preallocate a thread pool
> > instead ?
>
> Why try and fix a broken thing, just let the app spawn threads and use
> SIGEV_THREAD_ID itself, it knows its requirements and can do what suits
> the situation best. No need to try and patch up braindead posix stuff.

As I dig through the code and learn more about the posix standard, my
understanding is that the _implementation_ is broken. The standard just leaves
enough rope for the implementation to either do the right thing or to hang
itself. Sadly glibc seems to have chosen the second option.

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-27 15:30:12

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Thomas Gleixner ([email protected]) wrote:
> On Fri, 27 Aug 2010, Mathieu Desnoyers wrote:
> > * Paul E. McKenney ([email protected]) wrote:
> > > Why couldn't the timer_create() call record the start time, and then
> > > compute the sleeps from that time? So if timer_create() executed at
> > > time t=100 and the period is 5, upon awakening and completing the first
> > > invocation of the function in question, the thread does a sleep calculated
> > > to wake at t=110.
> >
> > Let's focus on the userspace thread execution, right between the samping of the
> > current time and the call to sleep:
> >
> > Thread A
> > current_time = read current time();
> > sleep(period_end - current_time);
> >
> > If the thread is preempted between these two operations, then we end up sleeping
> > for longer than what is needed. This kind of imprecision will add up over time,
> > so that after e.g. one day, instead of having the expected number of timer
> > executions, we'll have less than that. This kind of accumulated drift is an
> > unwanted side-effect of using delays in lieue of real periodic timers.
>
> Nonsense, that's why we provide clock_nanosleep(ABSTIME)

If we're using CLOCK_MONOTONIC, you're right, this could work. I was only
thinking of relative delays.

So do you think Paul's ideal would be a good candidate for the timer_create
SIGEV_THREAD glibc implementation then ?

Thanks,

Mathieu

>
> Thanks,
>
> tglx

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-27 15:41:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Fri, 2010-08-27 at 11:21 -0400, Mathieu Desnoyers wrote:
> SIGEV_THREAD
> Upon timer expiration, invoke sigev_notify_function as if it
> were the start function of a new thread. (Among the implementa‐
> tion possibilities here are that each timer notification could
> result in the creation of a new thread, or that a single thread
> is created to receive all notifications.) The function is
> invoked with sigev_value as its sole argument. If
> sigev_notify_attributes is not NULL, it should point to a
> pthread_attr_t structure that defines attributes for the new
> thread (see pthread_attr_init(3).
>
> So basically, it's the glibc implementation that is broken, not the standard.

The standard is broken too, what context will the new thread inherit?
The pthread_attr_t stuff tries to cover some of that, but pthread_attr_t
doesn't cover all inherited task attributes, and allows for some very
'interesting' bugs [1].

The specification also doesn't cover the case where the handler takes
more time to execute than the timer interval.

[1] - consider the case where pthread_attr_t includes the stack and we
use a spawn thread on expire policy and then run into the situation
where the handler is delayed past the next expiration.

2010-08-27 15:41:50

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Fri, 2010-08-27 at 11:30 -0400, Mathieu Desnoyers wrote:
> So do you think Paul's ideal would be a good candidate for the timer_create
> SIGEV_THREAD glibc implementation then ?

Simply do not _ever_ use SIGEV_THREAD, there really is no reason to.

2010-08-27 15:43:38

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Mike Galbraith ([email protected]) wrote:
> On Fri, 2010-08-27 at 09:42 +0200, Peter Zijlstra wrote:
> > On Thu, 2010-08-26 at 19:49 -0400, Mathieu Desnoyers wrote:
> > > AFAIK, I don't think we would end up starving the system in any possible way.
> >
> > Correct, it does maintain fairness.
> >
> > > So far I cannot see a situation where selecting the next buddy would _not_ make
> > > sense in any kind of input-driven wakeups (interactive, timer, disk, network,
> > > etc). But maybe it's just a lack of imagination on my part.
> >
> > The risk is that you end up with always using next-buddy, and we tried
> > that a while back and that didn't work well for some, Mike might
> > remember.
>
> I turned it off because it was ripping spread apart badly, and last
> buddy did a better job of improving scalability without it.

Maybe with the dyn min_vruntime feature proposed in this patchset we should
reconsider this. Spread being ripped apart is exactly what it addresses.

>
> > Also, when you use timers things like time-outs you really couldn't care
> > less if its handled sooner rather than later.

Maybe we could have a timer delay threshold under which we consider the wakeup
to be "short" or "long". e.g. a timer firing every ms is very likely to need its
wakups to proceed quickly, but if the timer fires every hours, we could not care
less.

> >
> > Disk is usually so slow you really don't want to consider it
> > interactive, but then sometimes you might,.. its a really hard problem.
>
> (very hard)
>
> > The only clear situation is the direct input, that's a direct link
> > between the user and our wakeup chain and the user is always important.
>
> Yeah, directly linked wakeups using next could be a good thing, but the
> trouble with using any linkage to the user is that you have to pass it
> on to reap benefit.. so when do you disconnect?

What we do here is that we pass it on across wakeups, but not across wakups of
new threads (so it's not passed across forks). However, AFAIU, this only
generates a bias that will select the wakeup target as next buddy IIF it is
within a certain range from the leftmost entity. The disconnect is done (setting
se.interactive to 0) on a per-thread basis when they go to sleep.

I've seen some version of Peter's patches which were decrementing a counter as
the token is passed across a wakeup, but I did not find it necessary so far.

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-27 15:50:17

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Thomas Gleixner ([email protected]) wrote:
> On Thu, 26 Aug 2010, Mathieu Desnoyers wrote:
[...]
> > Hrm, thinking about it a little more, one of the "plus" sides of these
> > SIGEV_THREAD timers is that a single timer can fork threads that will run on
> > many cores on a multi-core system. If we go for preallocation of a single
> > thread, we lose that. Maybe we could think of a way to preallocate a thread pool
> > instead ?
>
> Why should a single timer fork many threads? Just because a previous
> thread did not complete before the timer fires again? That's
> braindamage as all threads call the same function which then needs to
> be serialized anyway. We really do not need a function which creates
> tons of threads which get all stuck on the same resource.

It could make sense if the workload is mostly CPU-bound and there is only a very
short critical section shared between the threads. But I agree that in many
cases this will generate an utter contention mess.

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-27 16:09:50

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Peter Zijlstra ([email protected]) wrote:
> On Fri, 2010-08-27 at 11:21 -0400, Mathieu Desnoyers wrote:
> > SIGEV_THREAD
> > Upon timer expiration, invoke sigev_notify_function as if it
> > were the start function of a new thread. (Among the implementa‐
> > tion possibilities here are that each timer notification could
> > result in the creation of a new thread, or that a single thread
> > is created to receive all notifications.) The function is
> > invoked with sigev_value as its sole argument. If
> > sigev_notify_attributes is not NULL, it should point to a
> > pthread_attr_t structure that defines attributes for the new
> > thread (see pthread_attr_init(3).
> >
> > So basically, it's the glibc implementation that is broken, not the standard.
>
> The standard is broken too, what context will the new thread inherit?

Besides pthread_attr_t, thinking of the scheduler/cgroups/etc stuff, I'd think
it might be expected to inherit the state of the thread which calls
timer_create(). But this is not what glibc does right now, and it is not spelled
out clearly by the standard.

> The pthread_attr_t stuff tries to cover some of that, but pthread_attr_t
> doesn't cover all inherited task attributes, and allows for some very
> 'interesting' bugs [1].

(see below)

>
> The specification also doesn't cover the case where the handler takes
> more time to execute than the timer interval.

Why should it ? It seems valid for a workload to result in spawning many threads
bound to more than a single core on a multi-core system. So concurrency
management should be performed by the application.

>
> [1] - consider the case where pthread_attr_t includes the stack and we
> use a spawn thread on expire policy and then run into the situation
> where the handler is delayed past the next expiration.

Setting a thread stack and generating the signal more than once is taken into
account in the standard. It leads to unspecified result (IOW: don't do this):

http://www.opengroup.org/onlinepubs/009695399/functions/timer_create.html

"If evp->sigev_sigev_notify is SIGEV_THREAD and sev->sigev_notify_attributes is
not NULL, if the attribute pointed to by sev->sigev_notify_attributes has a
thread stack address specified by a call to pthread_attr_setstack() or
pthread_attr_setstackaddr(), the results are unspecified if the signal is
generated more than once."

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-27 17:28:05

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Fri, 2010-08-27 at 12:09 -0400, Mathieu Desnoyers wrote:
> > The specification also doesn't cover the case where the handler takes
> > more time to execute than the timer interval.
>
> Why should it ? It seems valid for a workload to result in spawning many threads
> bound to more than a single core on a multi-core system. So concurrency
> management should be performed by the application.

The problem with allowing concurrency is that the moment you want to do
that you get the spawner context and error propagation problems.

Thus we're limited to spawning a single thread at timer_create() and
handling expirations in there. At that point you have to specify what
happens to an expiration while the handler is running, will it queue
handlers or will you have to carefully craft your handler using
timer_getoverrun().

But I really don't understand why POSIX would provide this composite
functionality instead of providing the separate bits to implement this,
which I think is only SIGEV_THREAD_ID which is missing.

2010-08-27 18:32:44

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Peter Zijlstra ([email protected]) wrote:
> On Fri, 2010-08-27 at 12:09 -0400, Mathieu Desnoyers wrote:
> > > The specification also doesn't cover the case where the handler takes
> > > more time to execute than the timer interval.
> >
> > Why should it ? It seems valid for a workload to result in spawning many threads
> > bound to more than a single core on a multi-core system. So concurrency
> > management should be performed by the application.
>
> The problem with allowing concurrency is that the moment you want to do
> that you get the spawner context and error propagation problems.

These are problems you get only when you allow spawning any number of threads.
If, instead, you create a thread pool at timer creation, then you can allow
concurrency without problems with spawner context and error proparation.

> Thus we're limited to spawning a single thread at timer_create() and
> handling expirations in there. At that point you have to specify what
> happens to an expiration while the handler is running, will it queue
> handlers or will you have to carefully craft your handler using
> timer_getoverrun().

With my statement above in mind, it's true that we'd have to handle what happens
when the thread pool is exhausted. But couldn't we simply increment an overrun
counter from userland ? (It might be a different counter than kernel space kept
internally in userland)

> But I really don't understand why POSIX would provide this composite
> functionality instead of providing the separate bits to implement this,
> which I think is only SIGEV_THREAD_ID which is missing.

Higher abstraction API vs low-level control. A long running battle. ;)

Thanks,

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-27 18:38:33

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Mathieu Desnoyers ([email protected]) wrote:
> * Mike Galbraith ([email protected]) wrote:
> > On Fri, 2010-08-27 at 09:42 +0200, Peter Zijlstra wrote:
> > > On Thu, 2010-08-26 at 19:49 -0400, Mathieu Desnoyers wrote:
> > > > AFAIK, I don't think we would end up starving the system in any possible way.
> > >
> > > Correct, it does maintain fairness.
> > >
> > > > So far I cannot see a situation where selecting the next buddy would _not_ make
> > > > sense in any kind of input-driven wakeups (interactive, timer, disk, network,
> > > > etc). But maybe it's just a lack of imagination on my part.
> > >
> > > The risk is that you end up with always using next-buddy, and we tried
> > > that a while back and that didn't work well for some, Mike might
> > > remember.
> >
> > I turned it off because it was ripping spread apart badly, and last
> > buddy did a better job of improving scalability without it.
>
> Maybe with the dyn min_vruntime feature proposed in this patchset we should
> reconsider this. Spread being ripped apart is exactly what it addresses.

I'm curious: which workload was showing this kind of problem exactly ?

Thanks,

Mathieu

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-27 19:23:29

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Fri, 2010-08-27 at 14:32 -0400, Mathieu Desnoyers wrote:
>
> These are problems you get only when you allow spawning any number of threads.
> If, instead, you create a thread pool at timer creation, then you can allow
> concurrency without problems with spawner context and error proparation.

That would be a massive resource waste for the normal case where the
interval > handler runtime.

2010-08-27 19:57:57

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Peter Zijlstra ([email protected]) wrote:
> On Fri, 2010-08-27 at 14:32 -0400, Mathieu Desnoyers wrote:
> >
> > These are problems you get only when you allow spawning any number of threads.
> > If, instead, you create a thread pool at timer creation, then you can allow
> > concurrency without problems with spawner context and error proparation.
>
> That would be a massive resource waste for the normal case where the
> interval > handler runtime.

Not if the application can configure this value. So the "normal" case could be
set to 1 single thread. Only resource-intensive apps would set this differently,
e.g. to the detected number of cpus.

Mathieu


--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com

2010-08-28 07:33:41

by Mike Galbraith

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

On Fri, 2010-08-27 at 14:38 -0400, Mathieu Desnoyers wrote:
> * Mathieu Desnoyers ([email protected]) wrote:
> > * Mike Galbraith ([email protected]) wrote:
> > > On Fri, 2010-08-27 at 09:42 +0200, Peter Zijlstra wrote:
> > > > On Thu, 2010-08-26 at 19:49 -0400, Mathieu Desnoyers wrote:
> > > > > AFAIK, I don't think we would end up starving the system in any possible way.
> > > >
> > > > Correct, it does maintain fairness.
> > > >
> > > > > So far I cannot see a situation where selecting the next buddy would _not_ make
> > > > > sense in any kind of input-driven wakeups (interactive, timer, disk, network,
> > > > > etc). But maybe it's just a lack of imagination on my part.
> > > >
> > > > The risk is that you end up with always using next-buddy, and we tried
> > > > that a while back and that didn't work well for some, Mike might
> > > > remember.
> > >
> > > I turned it off because it was ripping spread apart badly, and last
> > > buddy did a better job of improving scalability without it.
> >
> > Maybe with the dyn min_vruntime feature proposed in this patchset we should
> > reconsider this. Spread being ripped apart is exactly what it addresses.

Dunno. Messing with spread is a very sharp double edged sword.

I took the patch set out for a spin, and saw some negative effects in
that regard. I did see some positive effects as well though, x264 for
example really wants round robin, so profits. Things where preemption
translates to throughput don't care for the idea much. vmark rather
surprised me, it hated this feature for some reason, I expected the
opposite.

It's an interesting knob, but I wouldn't turn it on by default on
anything but maybe a UP desktop box.

(i kinda like cgroups classified by pgid for desktop interactivity under
load, works pretty darn well)

> I'm curious: which workload was showing this kind of problem exactly ?

Hm, I don't recall exact details. I was looking at a lot of different
load mixes at the time, mostly interactive and batch, trying to shrink
the too darn high latencies seen with modest load mixes.

I never tried to figure out why next buddy had worse effect on spread
than last buddy (not obvious to me), just noted the fact that it did. I
recall that it had a large negative effect on x264 throughput as well.

-Mike

some numbers:

35.3x = DYN_MIN_VRUNTIME

netperf TCP_RR
35.3 97624.82 96982.62 97387.50 avg 97331.646 RR/sec 1.000
35.3x 95044.98 95365.52 94581.92 avg 94997.473 RR/sec .976

tbench 8
35.3 1200.36 1200.56 1199.29 avg 1200.070 MB/sec 1.000
35.3x 1106.92 1110.50 1106.58 avg 1108.000 MB/sec .923

x264 8
35.3 407.12 408.80 414.60 avg 410.173 fps 1.000
35.3x 428.07 436.43 438.16 avg 434.220 fps 1.058

vmark
35.3 149678 149269 150584 avg 149843.666 m/sec 1.000
35.3x 120872 120932 121247 avg 121017.000 m/sec .807

mysql+oltp
1 2 4 8 16 32 64 128 256
35.3 10956.33 20747.86 37139.27 36898.70 36575.90 36104.63 34390.26 31574.46 29148.01
10938.44 20835.17 37058.51 37051.71 36630.06 35930.90 34464.88 32024.50 28989.14
10935.72 20792.54 37238.17 36989.97 36568.37 35961.00 34342.54 31532.39 29235.20
avg 10943.49 20791.85 37145.31 36980.12 36591.44 35998.84 34399.22 31710.45 29124.11

35.3x 10944.22 20851.09 35609.32 35744.05 35137.49 33362.16 30796.03 28286.87 25105.84
10958.68 20811.93 35604.57 35610.71 35147.65 33371.81 30877.52 28325.79 25113.85
10962.72 20745.81 35728.36 35638.23 35124.56 33336.20 30794.99 28225.99 25202.88
avg 10955.20 20802.94 35647.41 35664.33 35136.56 33356.72 30822.84 28279.55 25140.85
vs 35.3 1.001 1.000 .959 .964 .960 .926 .896 .891 .863

2010-08-31 15:02:31

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [RFC PATCH 00/11] sched: CFS low-latency features

* Mathieu Desnoyers ([email protected]) wrote:
> * Peter Zijlstra ([email protected]) wrote:
> > On Fri, 2010-08-27 at 14:32 -0400, Mathieu Desnoyers wrote:
> > >
> > > These are problems you get only when you allow spawning any number of threads.
> > > If, instead, you create a thread pool at timer creation, then you can allow
> > > concurrency without problems with spawner context and error proparation.
> >
> > That would be a massive resource waste for the normal case where the
> > interval > handler runtime.
>
> Not if the application can configure this value. So the "normal" case could be
> set to 1 single thread. Only resource-intensive apps would set this differently,
> e.g. to the detected number of cpus.

.. but as we discussed in private, this would involve adding extra attribute
fields to what the standard proposes. I think the problem comes from the fact
that POSIX considers the pthread attributes to contain every possible thread
attribute one can imagine, which does not take into account the internal kernel
state set by other interfaces.

So this leaves us with a single-threaded SIGEV_THREAD, which is pretty much
useless. But at least it is not totally impossible for glibc to get it right by
moving to an implementation which uses only one thread.

So until one finds the time to work on fixing glibc SIGEV_THREAD timers, I would
strongly recommend against using it.

Thanks,

Mathieu

>
> Mathieu
>
>
> --
> Mathieu Desnoyers
> Operating System Efficiency R&D Consultant
> EfficiOS Inc.
> http://www.efficios.com

--
Mathieu Desnoyers
Operating System Efficiency R&D Consultant
EfficiOS Inc.
http://www.efficios.com