After wraparound-problems in sched_clock() we expand the 32bit
timer in the MTU from 32 to 63 bits so the scheduling and
timeline is more stable. At current max operating frequency for
the MTU, 133 MHz, this should be sufficient for rougly 2200
years.
Cc: Colin Cross <[email protected]>
Cc: Rabin Vincent <[email protected]>
Signed-off-by: Linus Walleij <[email protected]>
---
tglx, nico: can you help out on reviewing this?
The solution in this patch is based on the similar approach
taken by the Tegra platform in arch/arm/mach-tegra/timer.c.
Orion on the other hand will only expand the timer to 63
bits in the sched_clock() function in arch/arm/plat-orion/time.c
What we need to know is whether it's OK to simply blow up
clocksource to 63 bits like this. In that case this
simplification should also apply to Orion, meaning that it
would base it's sched_clock() on the clocksource, using
simply clocksource_cyc2ns() cutting down the code
significantly.
My main obstacle is that I cannot really determine whether
clocksource.read() will be called often enough for the
cnt32_to_63 algorithm.
If this won't work the Tegra platform probably needs fixing
too.
We had a rather heated debate about this internally already
so I'm seeking the advice of the elders.
---
arch/arm/plat-nomadik/timer.c | 5 +++--
1 files changed, 3 insertions(+), 2 deletions(-)
diff --git a/arch/arm/plat-nomadik/timer.c b/arch/arm/plat-nomadik/timer.c
index aedf9c1..e887bf8 100644
--- a/arch/arm/plat-nomadik/timer.c
+++ b/arch/arm/plat-nomadik/timer.c
@@ -16,6 +16,7 @@
#include <linux/clk.h>
#include <linux/jiffies.h>
#include <linux/err.h>
+#include <linux/cnt32_to_63.h>
#include <asm/mach/time.h>
#include <plat/mtu.h>
@@ -34,14 +35,14 @@ static cycle_t nmdk_read_timer_dummy(struct clocksource *cs)
/* clocksource: MTU decrements, so we negate the value being read. */
static cycle_t nmdk_read_timer(struct clocksource *cs)
{
- return -readl(mtu_base + MTU_VAL(0));
+ return cnt32_to_63(-readl(mtu_base + MTU_VAL(0)));
}
static struct clocksource nmdk_clksrc = {
.name = "mtu_0",
.rating = 200,
.read = nmdk_read_timer_dummy,
- .mask = CLOCKSOURCE_MASK(32),
+ .mask = CLOCKSOURCE_MASK(63),
.flags = CLOCK_SOURCE_IS_CONTINUOUS,
};
--
1.6.3.3
On Thu, 11 Nov 2010, Linus Walleij wrote:
> After wraparound-problems in sched_clock() we expand the 32bit
> timer in the MTU from 32 to 63 bits so the scheduling and
> timeline is more stable. At current max operating frequency for
> the MTU, 133 MHz, this should be sufficient for rougly 2200
> years.
>
> Cc: Colin Cross <[email protected]>
> Cc: Rabin Vincent <[email protected]>
> Signed-off-by: Linus Walleij <[email protected]>
> ---
> tglx, nico: can you help out on reviewing this?
>
> The solution in this patch is based on the similar approach
> taken by the Tegra platform in arch/arm/mach-tegra/timer.c.
>
> Orion on the other hand will only expand the timer to 63
> bits in the sched_clock() function in arch/arm/plat-orion/time.c
>
> What we need to know is whether it's OK to simply blow up
> clocksource to 63 bits like this. In that case this
> simplification should also apply to Orion, meaning that it
> would base it's sched_clock() on the clocksource, using
> simply clocksource_cyc2ns() cutting down the code
> significantly.
>
> My main obstacle is that I cannot really determine whether
> clocksource.read() will be called often enough for the
> cnt32_to_63 algorithm.
We talk about 16 seconds here. If we wouldn't read out the 32bit
clocksource at least once in that time we'd run into wrap issues as
well.
There is only one caveat. When nohz is on and you sleep longer than 16
seconds then the limitation we have in place does not work anymore, as
it would say that the long sleep time is less than the 63bit
wraparound time. With 32bit clocksource it limits the sleep correclty
to avoid the clocksource wrap issue.
Aside of that you are trading a bit less source code with extra code
in the clock read() function, which is called pretty frequently.
Thanks,
tglx
Thomas Gleixner wrote:
> There is only one caveat. When nohz is on and you sleep longer than 16
> seconds then the limitation we have in place does not work anymore, as
> it would say that the long sleep time is less than the 63bit
> wraparound time. With 32bit clocksource it limits the sleep correclty
> to avoid the clocksource wrap issue.
Hm! So the .mask has that side effect (as I suspected), and with
this the sleep will be limited to what time you can get into 63 bits
(2200 years). And we sure want NOHZ...
But if I complement the solution with the keepwarm() timer from
Orion doing just a dummy read() on the clocksource every say 15 secs
it will work even if the system sleeps for so long.
> Aside of that you are trading a bit less source code with extra code
> in the clock read() function, which is called pretty frequently.
Hm, yeah. I would have to go to metrics but it's O(n)
is it not? Shouldn't hurt a high-frequency SMP system
I believe.
Yours,
Linus Walleij
On Thu, 11 Nov 2010, Linus Walleij wrote:
> Thomas Gleixner wrote:
>
> > There is only one caveat. When nohz is on and you sleep longer than 16
> > seconds then the limitation we have in place does not work anymore, as
> > it would say that the long sleep time is less than the 63bit
> > wraparound time. With 32bit clocksource it limits the sleep correclty
> > to avoid the clocksource wrap issue.
>
> Hm! So the .mask has that side effect (as I suspected), and with
> this the sleep will be limited to what time you can get into 63 bits
> (2200 years). And we sure want NOHZ...
>
> But if I complement the solution with the keepwarm() timer from
> Orion doing just a dummy read() on the clocksource every say 15 secs
> it will work even if the system sleeps for so long.
Well, you don't need the keepwarm() timer, if you just call
sched_clock() in the idle wakeup path, so the magic in the NOHZ code
will ensure that you reread the thing timely.
Thanks,
tglx
On Thu, 11 Nov 2010, Thomas Gleixner wrote:
> On Thu, 11 Nov 2010, Linus Walleij wrote:
>
> > Thomas Gleixner wrote:
> >
> > > There is only one caveat. When nohz is on and you sleep longer than 16
> > > seconds then the limitation we have in place does not work anymore, as
> > > it would say that the long sleep time is less than the 63bit
> > > wraparound time. With 32bit clocksource it limits the sleep correclty
> > > to avoid the clocksource wrap issue.
> >
> > Hm! So the .mask has that side effect (as I suspected), and with
> > this the sleep will be limited to what time you can get into 63 bits
> > (2200 years). And we sure want NOHZ...
> >
> > But if I complement the solution with the keepwarm() timer from
> > Orion doing just a dummy read() on the clocksource every say 15 secs
> > it will work even if the system sleeps for so long.
>
> Well, you don't need the keepwarm() timer, if you just call
> sched_clock() in the idle wakeup path, so the magic in the NOHZ code
> will ensure that you reread the thing timely.
You still have to make sure you get out of idle before 16 secs has
passed.
Nicolas
On Thu, 11 Nov 2010, Nicolas Pitre wrote:
> On Thu, 11 Nov 2010, Thomas Gleixner wrote:
> > Well, you don't need the keepwarm() timer, if you just call
> > sched_clock() in the idle wakeup path, so the magic in the NOHZ code
> > will ensure that you reread the thing timely.
>
> You still have to make sure you get out of idle before 16 secs has
> passed.
No, I meant to put it into the idle loop, right after mach_idle() or
whatever is called to put the CPU into deep power sleep. The
clockevents core code makes sure that you get interrupted in time.
Thanks,
tglx
On Thu, 11 Nov 2010, Thomas Gleixner wrote:
> On Thu, 11 Nov 2010, Nicolas Pitre wrote:
> > On Thu, 11 Nov 2010, Thomas Gleixner wrote:
> > > Well, you don't need the keepwarm() timer, if you just call
> > > sched_clock() in the idle wakeup path, so the magic in the NOHZ code
> > > will ensure that you reread the thing timely.
> >
> > You still have to make sure you get out of idle before 16 secs has
> > passed.
>
> No, I meant to put it into the idle loop, right after mach_idle() or
> whatever is called to put the CPU into deep power sleep. The
> clockevents core code makes sure that you get interrupted in time.
OK, if the clockevents core code takes care of that then fine.
Of course one could claim that the sched_clock() code could be called
needlessly when frequently entering and leaving idle mode. That's why
I've used a simple timer for this purpose so far, which also makes the
implementation and its algorithm constraint workaround close together.
Nicolas
On Thu, Nov 11, 2010 at 1:05 AM, Linus Walleij
<[email protected]> wrote:
> After wraparound-problems in sched_clock() we expand the 32bit
> timer in the MTU from 32 to 63 bits so the scheduling and
> timeline is more stable. At current max operating frequency for
> the MTU, 133 MHz, this should be sufficient for rougly 2200
> years.
>
[snip]
> What we need to know is whether it's OK to simply blow up
> clocksource to 63 bits like this. In that case this
> simplification should also apply to Orion, meaning that it
> would base it's sched_clock() on the clocksource, using
> simply clocksource_cyc2ns() cutting down the code
> significantly.
I would advise against expanding the hardware counter to 63bits via
software at the clocksource layer. This breaks the
CLOCK_SOURCE_IS_CONTINUOUS flag rule (since the hardware will wrap
below the mask value if not updated in the software layer). And as
Thomas said, this will cause problems in the nohz idle limiting code
(which makes sure we wake up before the clocksource wraps).
Instead, I'd use this extension at the sched_clock level, where it
seems reasonable that it will be called frequently enough to not brake
the cnt32_to_63 magic.
thanks
-john
john stultz wrote:
> I would advise against expanding the hardware counter to 63bits via
> software at the clocksource layer. This breaks the
> CLOCK_SOURCE_IS_CONTINUOUS flag rule (since the hardware will wrap
> below the mask value if not updated in the software layer). And as
> Thomas said, this will cause problems in the nohz idle limiting code
> (which makes sure we wake up before the clocksource wraps).
Hm, yeah that's probably the wisest thing to do right now..
Then the Tegra code is not quite sound. It will probably experience
bad stuff like shaky clocksource when idling too long.
> Instead, I'd use this extension at the sched_clock level, where it
> seems reasonable that it will be called frequently enough to not brake
> the cnt32_to_63 magic.
Yeah we will do it like the Orion does.
Thanks!
Linus Walleij
Hi John,
On Thu, Nov 11, 2010 at 01:33:41PM -0800, john stultz wrote:
> On Thu, Nov 11, 2010 at 1:05 AM, Linus Walleij
> <[email protected]> wrote:
> > After wraparound-problems in sched_clock() we expand the 32bit
> > timer in the MTU from 32 to 63 bits so the scheduling and
> > timeline is more stable. At current max operating frequency for
> > the MTU, 133 MHz, this should be sufficient for rougly 2200
> > years.
> >
> [snip]
> > What we need to know is whether it's OK to simply blow up
> > clocksource to 63 bits like this. In that case this
> > simplification should also apply to Orion, meaning that it
> > would base it's sched_clock() on the clocksource, using
> > simply clocksource_cyc2ns() cutting down the code
> > significantly.
>
> I would advise against expanding the hardware counter to 63bits via
> software at the clocksource layer. This breaks the
> CLOCK_SOURCE_IS_CONTINUOUS flag rule (since the hardware will wrap
> below the mask value if not updated in the software layer). And as
> Thomas said, this will cause problems in the nohz idle limiting code
> (which makes sure we wake up before the clocksource wraps).
>
> Instead, I'd use this extension at the sched_clock level, where it
> seems reasonable that it will be called frequently enough to not brake
> the cnt32_to_63 magic.
Instead of implementing sched_clock for each architecture seperatly,
wouldn't it be nice to have a generic sched_clock that uses the
architecture's clocksource? I tried to implement that some time ago,
but tglx shoot it down because of locking problems.
Maybe doing that became easier since then?
Best regards
Uwe
--
Pengutronix e.K. | Uwe Kleine-K?nig |
Industrial Linux Solutions | http://www.pengutronix.de/ |
Uwe Kleine-K?nig wrote:
> Instead of implementing sched_clock for each architecture seperatly,
> wouldn't it be nice to have a generic sched_clock that uses the
> architecture's clocksource? I tried to implement that some time ago,
> but tglx shoot it down because of locking problems.
I was and still am a big fan of this approach, I am willing to help
testing it if you want to dust off this patch set...
Yours,
Linus Walleij
On Fri, 12 Nov 2010, Linus Walleij wrote:
> Uwe Kleine-K?nig wrote:
>
> > Instead of implementing sched_clock for each architecture seperatly,
> > wouldn't it be nice to have a generic sched_clock that uses the
> > architecture's clocksource? I tried to implement that some time ago,
> > but tglx shoot it down because of locking problems.
>
> I was and still am a big fan of this approach, I am willing to help
> testing it if you want to dust off this patch set...
sched_clock is not necessarily the same as the current clocksource.
Thanks,
tglx
Thomas Gleixner wrote:
> On Fri, 12 Nov 2010, Linus Walleij wrote:
>> Uwe Kleine-K?nig wrote:
>>
>>> Instead of implementing sched_clock for each architecture seperatly,
>>> wouldn't it be nice to have a generic sched_clock that uses the
>>> architecture's clocksource? I tried to implement that some time ago,
>>> but tglx shoot it down because of locking problems.
>> I was and still am a big fan of this approach, I am willing to help
>> testing it if you want to dust off this patch set...
>
> sched_clock is not necessarily the same as the current clocksource.
IIRC Uwe:s approach was that if and only if you would want to use
the clocksource for sched_clock() you provide a special flag on
the clocksource, and it will be attempted to be used for sched_clock().
Incidentally the dual use of a single free-running timer as both
single clocksource and sched_clock() baseline seem to creep up in
a lot of embedded platforms...
Yours,
Linus Walleij
On Fri, 12 Nov 2010, Linus Walleij wrote:
> Thomas Gleixner wrote:
> > On Fri, 12 Nov 2010, Linus Walleij wrote:
> >> Uwe Kleine-K?nig wrote:
> >>
> >>> Instead of implementing sched_clock for each architecture seperatly,
> >>> wouldn't it be nice to have a generic sched_clock that uses the
> >>> architecture's clocksource? I tried to implement that some time ago,
> >>> but tglx shoot it down because of locking problems.
> >> I was and still am a big fan of this approach, I am willing to help
> >> testing it if you want to dust off this patch set...
> >
> > sched_clock is not necessarily the same as the current clocksource.
>
> IIRC Uwe:s approach was that if and only if you would want to use
> the clocksource for sched_clock() you provide a special flag on
> the clocksource, and it will be attempted to be used for sched_clock().
I still don't like the special flag approach. It's mixing
concepts. clock->read() != sched_clock().
I can understand that you want to reuse the conversion factors etc in
struct clocksource, but we have to be more clever than special flags
and magic functions to install sched_clock.
I could accept a solution where we have generic infrastructure which
uses clock->read() and does the magic 63 bit expansion and you'd
simply do sched_clock_install(struct clocksource *) to explicitely
assign that clocksource as sched_clock. No magic, straight forward and
simple.
> Incidentally the dual use of a single free-running timer as both
> single clocksource and sched_clock() baseline seem to creep up in
> a lot of embedded platforms...
No argument about that.
Thanks,
tglx
On Fri, 12 Nov 2010, Uwe Kleine-K?nig wrote:
> On Thu, Nov 11, 2010 at 01:33:41PM -0800, john stultz wrote:
> > Instead, I'd use this extension at the sched_clock level, where it
> > seems reasonable that it will be called frequently enough to not brake
> > the cnt32_to_63 magic.
> Instead of implementing sched_clock for each architecture seperatly,
> wouldn't it be nice to have a generic sched_clock that uses the
> architecture's clocksource? I tried to implement that some time ago,
> but tglx shoot it down because of locking problems.
They also have different goals.
sched_clock: fast implementation with a moderate
accuracy. Used for things like scheduler time slices and printk
timestamping, and since it is called often it needs to be as lightweight
as possible and accuracy is secondary.
clocksource: as accurate as possible, used for POSIX timers and the
like, and is typically more heavyweight than sched_clock().
> Maybe doing that became easier since then?
Maybe having a fallback sched_clock implementation based on clocksource
might make sense, but the current fallback is based on jiffies which is
probably good enough as a fallback.
Nicolas
On Fri, 12 Nov 2010, Linus Walleij wrote:
> Thomas Gleixner wrote:
> > On Fri, 12 Nov 2010, Linus Walleij wrote:
> >> Uwe Kleine-K?nig wrote:
> >>
> >>> Instead of implementing sched_clock for each architecture seperatly,
> >>> wouldn't it be nice to have a generic sched_clock that uses the
> >>> architecture's clocksource? I tried to implement that some time ago,
> >>> but tglx shoot it down because of locking problems.
> >> I was and still am a big fan of this approach, I am willing to help
> >> testing it if you want to dust off this patch set...
> >
> > sched_clock is not necessarily the same as the current clocksource.
>
> IIRC Uwe:s approach was that if and only if you would want to use
> the clocksource for sched_clock() you provide a special flag on
> the clocksource, and it will be attempted to be used for sched_clock().
>
> Incidentally the dual use of a single free-running timer as both
> single clocksource and sched_clock() baseline seem to creep up in
> a lot of embedded platforms...
Sure, but the computations and associated cost around that free running
timer are not the same in both cases.
Nicolas
On Fri, 2010-11-12 at 11:37 +0100, Linus Walleij wrote:
> Incidentally the dual use of a single free-running timer as both
> single clocksource and sched_clock() baseline seem to creep up in
> a lot of embedded platforms...
And really, using the same underlying hardware for both isn't a bad
thing. The main issue is that the product of the clocksource code and
the sched_clock code have different requirements.
1) sched_clock needs to be a fast lightweight ns value, that wraps on
64bit boundaries. Performance is very critical. Transient consistency
errors are acceptable.
2) clocksources care most about correctness, so that the counter is
never mis-read, and that its is accumulated from once every wrap period.
Ideally fine adjustment granularity is desired as well (which
sched_clock doesn't require). Performance is important, but secondary to
correctness.
So there isn't a problem with the same hardware being used for both, but
there is an issue trying to base one implementation off of the other.
I do think a generic sched_clock implementation could be useful,
specifically if it added an accumulation hook so we can more easily and
generically deal with using 32bit counters. I don't think it should try
to re-use the clocksource infrastructure code, as I think that may
stretch that infrastructure too far (ie: having to deal with the
conflicting assumptions of both interfaces).
But having arches register a similar read() and cyc2ns conversion hook
(or freq values to generate such conversion) would probably be a good
approach.
That said, I don't have the space on my plate to look at this in the
near term, but I'd be happy to review code.
thanks
-john