2012-02-15 14:40:29

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Thu, 29 Sep 2011, Matthew Garrett wrote:

> sleep(0) is a common construct used by applications that want to trigger
> the scheduler. sched_yield() might make more sense, but only appeared in
> POSIX.1-2001 and so plenty of example code still uses the sleep(0) form.
> This wouldn't normally be a problem, but it means that event-driven
> applications that are merely trying to avoid starving other processes may
> actually end up sleeping due to having large timer_slack values. Special-
> casing this seems reasonable.
>
> Signed-off-by: Matthew Garrett <[email protected]>
> ---
> kernel/hrtimer.c | 8 ++++++++
> 1 files changed, 8 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
> index a9205e3..0bb70a7 100644
> --- a/kernel/hrtimer.c
> +++ b/kernel/hrtimer.c
> @@ -1566,6 +1566,14 @@ long hrtimer_nanosleep(struct timespec *rqtp, struct timespec __user *rmtp,
> if (rt_task(current))
> slack = 0;
>
> + /*
> + * Applications will often sleep(0) to indicate that they wish to
> + * be scheduled. Special case that to avoid actually putting them
> + * to sleep for the duration of the slack.
> + */
> + if (rqtp->tv_sec == 0 && rqtp->tv_nsec == 0)
> + slack = 0;

That's pretty pointless. You can simply return 0 here as
do_nanosleep() will not call the scheduler on an already expired
timer, which is always true for a relative timer with delta 0.

Thanks,

tglx


2012-02-15 14:52:40

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, Feb 15, 2012 at 03:40:24PM +0100, Thomas Gleixner wrote:
> > + * be scheduled. Special case that to avoid actually putting them
> > + * to sleep for the duration of the slack.
> > + */
> > + if (rqtp->tv_sec == 0 && rqtp->tv_nsec == 0)
> > + slack = 0;
>
> That's pretty pointless. You can simply return 0 here as
> do_nanosleep() will not call the scheduler on an already expired
> timer, which is always true for a relative timer with delta 0.

I'm actually starting to wonder about the applications doing this. We
default to adding a small amount of slack even if the application has
done sleep(0), which will mean that the timer hasn't expired at this
point. Do we then go through the scheduler differently? Are these
applications actually relying on an invalid assumption?

--
Matthew Garrett | [email protected]

2012-02-15 14:54:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, 2012-02-15 at 15:40 +0100, Thomas Gleixner wrote:
> On Thu, 29 Sep 2011, Matthew Garrett wrote:
>
> > sleep(0) is a common construct used by applications that want to trigger
> > the scheduler. sched_yield() might make more sense, but only appeared in
> > POSIX.1-2001 and so plenty of example code still uses the sleep(0) form.

Note that sched_yield() is unspecified for SCHED_OTHER, so any
application using it that is not a real-time application is outside spec
anyway.

Furthermore, both sched_yield() and sleep(0) are fair indications the
application is broken, wait for proper events, not random delays.

> > This wouldn't normally be a problem, but it means that event-driven
> > applications that are merely trying to avoid starving other processes may
> > actually end up sleeping due to having large timer_slack values. Special-
> > casing this seems reasonable.

Again, sleep granularity isn't specified and thus this behaviour isn't
actually buggy, all the spec promises is we won't sleep shorter than
asked.

Also, starvation avoidance isn't something that userspace should concern
itself with.

2012-02-15 14:59:01

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

Regardless of whether userspace should be concerning itself about this
kind of thing or not, there's plenty of userspace that calls sleep(0) on
the assumption that it'll get rescheduled. This makes using whole-system
timer slack difficult, because there are some applications that do this
even if they're event-driven and sleeping for significant lengths of
time here breaks them. I'd certainly understand the argument for fixing
userspace instead, but that's a massive task for something that's easily
special-cased in the kernel.

--
Matthew Garrett | [email protected]

2012-02-15 20:14:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, 15 Feb 2012, Matthew Garrett wrote:

> On Wed, Feb 15, 2012 at 03:40:24PM +0100, Thomas Gleixner wrote:
> > > + * be scheduled. Special case that to avoid actually putting them
> > > + * to sleep for the duration of the slack.
> > > + */
> > > + if (rqtp->tv_sec == 0 && rqtp->tv_nsec == 0)
> > > + slack = 0;
> >
> > That's pretty pointless. You can simply return 0 here as
> > do_nanosleep() will not call the scheduler on an already expired
> > timer, which is always true for a relative timer with delta 0.
>
> I'm actually starting to wonder about the applications doing this. We
> default to adding a small amount of slack even if the application has
> done sleep(0), which will mean that the timer hasn't expired at this
> point. Do we then go through the scheduler differently? Are these

When the slack is large enough that the timer is actually not expired
right away, which is usually the case, then we end up in schedule()
and the task gets scheduled out until the timer fires. With your
approach of making the slack 0 for sleep(0) calls the code does not
call schedule() because the timer is definitely expired.

> applications actually relying on an invalid assumption?

Oh yes. sleep(0) has no guarantee about its behaviour at all. The only
guarantee of sleep() is that it wont return before the requested time
has elapsed, but there is no upper bound when it returns after the
sleep time is over. So it's perfectly fine from the standards POV that
sleep(0) actually sleeps and puts the tasks for some random time away.
It's also correct when it returns right away w/o going through
schedule(). The fact that sleep(0) ended up in schedule() even when
the timer was already and the task state therefor was RUNNING on some
unix implementations does not change that at all.

Just for the extended fun of it: The pre hrtimer implementation in
Linux put the task on sleep as well up to the next jiffies boundary,
so anything which used sleep(0) on a pre hrtimer kernel was going to
sleep. That's also the case today when high resolution timers are
disabled (compile or runtime).

So anything which relies on sleep(0) as a fast scheduling point is and
has been broken forever.

Thanks,

tglx

2012-02-15 20:22:06

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, 15 Feb 2012, Matthew Garrett wrote:

> Regardless of whether userspace should be concerning itself about this
> kind of thing or not, there's plenty of userspace that calls sleep(0) on
> the assumption that it'll get rescheduled. This makes using whole-system

Right. And it got always rescheduled because the old pre hrtimer
implementation put it to sleep unconditionally. The hrtimer code
before we introduced the slack stuff returned right away when the
relative delta was 0. Slack changed that back to the pre hrtimer
behaviour. Also if you disable high res timers (compile or runtime)
you go to sleep unconditionally until the next jiffies boundary.

> timer slack difficult, because there are some applications that do this
> even if they're event-driven and sleeping for significant lengths of
> time here breaks them. I'd certainly understand the argument for fixing

What is significant? If the app relies on sleep(0) returning
instantly, then it's going to malfunction on a highres=n kernel with
HZ=100 as well. Also when there are actually other runnable tasks and
it gets scheduled away, then there is no guarantee that it comes back
to the cpu within a defined boundary.

> userspace instead, but that's a massive task for something that's easily
> special-cased in the kernel.

So what's the correct special case solution? Return right away, call
schedule() with state RUNNING or some other magic crap?

Again, if a SCHED_OTHER task cannot cope with the fact that it gets
scheduled away for unbound amount of time, then changing the behaviour
of sleep(0) to some magic yield() variant does not help at all. It's
still broken and no special case in the kernel can fix that.

Thanks,

tglx

2012-02-15 20:22:24

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, Feb 15, 2012 at 09:14:30PM +0100, Thomas Gleixner wrote:

> Just for the extended fun of it: The pre hrtimer implementation in
> Linux put the task on sleep as well up to the next jiffies boundary,
> so anything which used sleep(0) on a pre hrtimer kernel was going to
> sleep. That's also the case today when high resolution timers are
> disabled (compile or runtime).
>
> So anything which relies on sleep(0) as a fast scheduling point is and
> has been broken forever.

Excellent. So the real question is what /should/ sleep(0) do - nothing,
schedule or sleep for an arbitrary period of time that could be years?

--
Matthew Garrett | [email protected]

2012-02-15 20:30:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, 15 Feb 2012, Matthew Garrett wrote:

> On Wed, Feb 15, 2012 at 09:14:30PM +0100, Thomas Gleixner wrote:
>
> > Just for the extended fun of it: The pre hrtimer implementation in
> > Linux put the task on sleep as well up to the next jiffies boundary,
> > so anything which used sleep(0) on a pre hrtimer kernel was going to
> > sleep. That's also the case today when high resolution timers are
> > disabled (compile or runtime).
> >
> > So anything which relies on sleep(0) as a fast scheduling point is and
> > has been broken forever.
>
> Excellent. So the real question is what /should/ sleep(0) do - nothing,
> schedule or sleep for an arbitrary period of time that could be years?

Well, I don't expect slack to be set to years and I really don't want
to special case sleep(0), because then we might end up discussing
special casing usleep(1) or nanosleep(1ns) as well.

slack gets added to all timers, period.

You can tune the slack for your app if you think that it's necessary,
but I bet that in 99,9% of the cases the programmer does not even know
why the hell he put sleep(0), usleep(1) or nanosleep(1ns) into his
code.

Thanks,

tglx

2012-02-15 20:38:21

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, Feb 15, 2012 at 09:30:20PM +0100, Thomas Gleixner wrote:
> On Wed, 15 Feb 2012, Matthew Garrett wrote:
> > Excellent. So the real question is what /should/ sleep(0) do - nothing,
> > schedule or sleep for an arbitrary period of time that could be years?
>
> Well, I don't expect slack to be set to years and I really don't want
> to special case sleep(0), because then we might end up discussing
> special casing usleep(1) or nanosleep(1ns) as well.

Increasing slack to the seconds range has measureable power management
benefits, but there's some code that ends up broken as a result even
when they're nominally event driven. I've no problem with us just
declaring that code as broken, but it would be less effort to special
case it. Application authors do seem to have ended up under the belief
that sleep(0) is a meaningful thing to do, and the internet seems to be
full of suggestions to use it rather than sched_yield().

--
Matthew Garrett | [email protected]

2012-02-15 20:40:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, 2012-02-15 at 20:38 +0000, Matthew Garrett wrote:
> Application authors do seem to have ended up under the belief
> that sleep(0) is a meaningful thing to do, and the internet seems to be
> full of suggestions to use it rather than sched_yield().

http://xkcd.com/386/

2012-02-15 20:43:47

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, Feb 15, 2012 at 09:40:28PM +0100, Peter Zijlstra wrote:
> On Wed, 2012-02-15 at 20:38 +0000, Matthew Garrett wrote:
> > Application authors do seem to have ended up under the belief
> > that sleep(0) is a meaningful thing to do, and the internet seems to be
> > full of suggestions to use it rather than sched_yield().
>
> http://xkcd.com/386/

I'm not even going to pretend to disagree. The question is whether we
want to support their breakage anyway.
--
Matthew Garrett | [email protected]

2012-02-15 20:46:31

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, 15 Feb 2012, Matthew Garrett wrote:

> On Wed, Feb 15, 2012 at 09:30:20PM +0100, Thomas Gleixner wrote:
> > On Wed, 15 Feb 2012, Matthew Garrett wrote:
> > > Excellent. So the real question is what /should/ sleep(0) do - nothing,
> > > schedule or sleep for an arbitrary period of time that could be years?
> >
> > Well, I don't expect slack to be set to years and I really don't want
> > to special case sleep(0), because then we might end up discussing
> > special casing usleep(1) or nanosleep(1ns) as well.
>
> Increasing slack to the seconds range has measureable power management
> benefits, but there's some code that ends up broken as a result even
> when they're nominally event driven. I've no problem with us just
> declaring that code as broken, but it would be less effort to special
> case it. Application authors do seem to have ended up under the belief
> that sleep(0) is a meaningful thing to do, and the internet seems to be
> full of suggestions to use it rather than sched_yield().

The internet is full of crappy suggestions written by absolutely
clueless and advisory resistant morons.

Dammit, we cannot come up with a reasonable definition for special
casing that stuff simply because you cannot draw a clear boundary what
to special case and what not. And there is no sensible definition for
what to do - return right away or go through schedule() or what ever.

sleep(0) is as pointless as sched_yield() and it's about time that we
stop to create a fucking mess in the kernel just because user space
programmers refuse to understand how an operating system works and how
proper programming should be done.

Thanks,

tglx

2012-02-15 20:48:01

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, Feb 15, 2012 at 09:46:27PM +0100, Thomas Gleixner wrote:

> Dammit, we cannot come up with a reasonable definition for special
> casing that stuff simply because you cannot draw a clear boundary what
> to special case and what not. And there is no sensible definition for
> what to do - return right away or go through schedule() or what ever.
>
> sleep(0) is as pointless as sched_yield() and it's about time that we
> stop to create a fucking mess in the kernel just because user space
> programmers refuse to understand how an operating system works and how
> proper programming should be done.

Ok. I'll work on finding how prevelant it is and see if we can kill it
off in userspace.

--
Matthew Garrett | [email protected]

2012-02-16 14:27:39

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Wed, Feb 15, 2012 at 08:47:57PM +0000, Matthew Garrett wrote:

> Ok. I'll work on finding how prevelant it is and see if we can kill it
> off in userspace.

Checking through an exploded Fedora kernel tree suggests around 125
packages out of 11000 or so, so around 1% of userspace seems to use
sleep(0) under certain circumstances. We can probably fix everything in
the distribution, but that suggests that there's also going to be a
significant amount of code in the outside world that's also broken.

Userspace clearly has an expectation that sleep(0) is magic in some
ill-defined way. We'd be well within our rights to break that
expectation, but I think it's common enough to warrant special casing.

--
Matthew Garrett | [email protected]

2012-02-16 14:29:52

by Alan

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

> Userspace clearly has an expectation that sleep(0) is magic in some
> ill-defined way. We'd be well within our rights to break that
> expectation, but I think it's common enough to warrant special casing.

In historical Unix sleep(0) ends up the nearest equivalent it had to
triggering a reschedule and giving up the rest of the timeslice.

I suspect special casing it as yield() isn't far from the right result ?

Alan

2012-02-16 14:51:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Thu, 2012-02-16 at 14:31 +0000, Alan Cox wrote:
> > Userspace clearly has an expectation that sleep(0) is magic in some
> > ill-defined way. We'd be well within our rights to break that
> > expectation, but I think it's common enough to warrant special casing.
>
> In historical Unix sleep(0) ends up the nearest equivalent it had to
> triggering a reschedule and giving up the rest of the timeslice.
>
> I suspect special casing it as yield() isn't far from the right result ?

But why go that way? Using sleep(0) or yield() is pretty much always the
wrong thing to do anyway, this is a great opportunity for all folks to
find these sites and fix them.

Wasn't that what open-source is all about, doing the right thing?

Why should we care about obviously broken crap?

Furthermore, pushing slack to several seconds will also break stuff that
needed those timers to expire sooner, who is going to fix that?

So we've got a stacking of two ill-considered things:
- applications using yield()/sleep(0)
- weirdos pushing timer slack to the seconds range

Individually both cause/are borkage, and now you want to add code to the
kernel to mitigate some, but nowhere near all, of it?

What's next, we're actually going to give people their O_PONIES?

2012-02-16 15:01:54

by Matthew Garrett

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Thu, Feb 16, 2012 at 03:51:16PM +0100, Peter Zijlstra wrote:
> On Thu, 2012-02-16 at 14:31 +0000, Alan Cox wrote:
> > In historical Unix sleep(0) ends up the nearest equivalent it had to
> > triggering a reschedule and giving up the rest of the timeslice.
> >
> > I suspect special casing it as yield() isn't far from the right result ?
>
> But why go that way? Using sleep(0) or yield() is pretty much always the
> wrong thing to do anyway, this is a great opportunity for all folks to
> find these sites and fix them.
>
> Wasn't that what open-source is all about, doing the right thing?

Doing the right thing if practical. How about we special-case for now
with a once-per-process printk and kill it further down the line?

> Why should we care about obviously broken crap?
>
> Furthermore, pushing slack to several seconds will also break stuff that
> needed those timers to expire sooner, who is going to fix that?

The reason to change timer slack is because we're willing to break
polling applications in order to gain power savings. The problem is that
there are event-driven applications that are also going to be broken
because some ridiculous proportion of userspace believes that sleep(0)
is a thing that they can do in an event-driven application. The question
is whether the cost of special-casing that in the kernel is more than
fixing all of them.

> So we've got a stacking of two ill-considered things:
> - applications using yield()/sleep(0)
> - weirdos pushing timer slack to the seconds range
>
> Individually both cause/are borkage, and now you want to add code to the
> kernel to mitigate some, but nowhere near all, of it?
>
> What's next, we're actually going to give people their O_PONIES?

If we could give people O_PONIES then why wouldn't we? The only reason
not to is because it costs too much elsewhere.

--
Matthew Garrett | [email protected]

2012-02-16 19:09:28

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] hrtimers: Special-case zero length sleeps

On Thu, 16 Feb 2012, Matthew Garrett wrote:

> On Thu, Feb 16, 2012 at 03:51:16PM +0100, Peter Zijlstra wrote:
> > On Thu, 2012-02-16 at 14:31 +0000, Alan Cox wrote:
> > > In historical Unix sleep(0) ends up the nearest equivalent it had to
> > > triggering a reschedule and giving up the rest of the timeslice.
> > >
> > > I suspect special casing it as yield() isn't far from the right result ?
> >
> > But why go that way? Using sleep(0) or yield() is pretty much always the
> > wrong thing to do anyway, this is a great opportunity for all folks to
> > find these sites and fix them.
> >
> > Wasn't that what open-source is all about, doing the right thing?
>
> Doing the right thing if practical. How about we special-case for now
> with a once-per-process printk and kill it further down the line?

We've done this for stuff which was once kinda guaranteed
(intentionally or not), but sleep(0) has never had any guarantee
whatsoever. Quite the contrary it changed its behaviour several times
over the last decades and there is no reason to manifest some weird
ass semantics now.

> > Why should we care about obviously broken crap?
> >
> > Furthermore, pushing slack to several seconds will also break stuff that
> > needed those timers to expire sooner, who is going to fix that?
>
> The reason to change timer slack is because we're willing to break
> polling applications in order to gain power savings. The problem is that
> there are event-driven applications that are also going to be broken
> because some ridiculous proportion of userspace believes that sleep(0)
> is a thing that they can do in an event-driven application. The question

If we are willing to break polling crap, then there is no fricking
reason not to break lousy programmed "event driven" crap while we are
at it.

> is whether the cost of special-casing that in the kernel is more than
> fixing all of them.

The cost of special casing stuff is maintainence, inconsistance of
interfaces and extended confusion. And each of that is worse than
breaking some (i.e. 1% of the total app stack out there) weirdo apps.

> > What's next, we're actually going to give people their O_PONIES?
>
> If we could give people O_PONIES then why wouldn't we? The only reason
> not to is because it costs too much elsewhere.

Wrong. The reason is that O_PONIES is impossible to define and that's
equally true for any exception to the (nano)sleep interface.

Thanks,

tglx