When the system goes out to lunch for a long time, the clocksource
watchdog might get false positives. Clarify the warning so that
people stop blaming their system freezes on the timing code.
This change was Thomas Gleixner's suggestion.
Signed-off-by: Andy Lutomirski <[email protected]>
---
I've only compile-tested on 2.6.36, but it applies cleanly to Linus' tree
and it's rather trivial.
kernel/time/clocksource.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c18d7ef..5b30aa2 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -215,8 +215,10 @@ static void __clocksource_unstable(struct clocksource *cs)
static void clocksource_unstable(struct clocksource *cs, int64_t delta)
{
- printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)\n",
- cs->name, delta);
+ printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)%s\n",
+ cs->name, delta,
+ delta < -5000000000LL ?
+ " or your system lagged for other reasons" : "");
__clocksource_unstable(cs);
}
--
1.7.3.2
On Wed, 10 Nov 2010, Andy Lutomirski wrote:
> When the system goes out to lunch for a long time, the clocksource
> watchdog might get false positives. Clarify the warning so that
> people stop blaming their system freezes on the timing code.
>
> This change was Thomas Gleixner's suggestion.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> I've only compile-tested on 2.6.36, but it applies cleanly to Linus' tree
> and it's rather trivial.
>
> kernel/time/clocksource.c | 6 ++++--
> 1 files changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index c18d7ef..5b30aa2 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -215,8 +215,10 @@ static void __clocksource_unstable(struct clocksource *cs)
>
> static void clocksource_unstable(struct clocksource *cs, int64_t delta)
> {
> - printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)\n",
> - cs->name, delta);
> + printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)%s\n",
> + cs->name, delta,
> + delta < -5000000000LL ?
I'm ok with that change, but we should not hard code the
delta. Instead we should look at the wrap around time of the
clocksource which we use for reference.
> + " or your system lagged for other reasons" : "");
> __clocksource_unstable(cs);
> }
Thanks,
tglx
When the system goes out to lunch for a long time, the clocksource
watchdog might get false positives. Clarify the warning so that
people stop blaming their system freezes on the timing code.
This change was Thomas Gleixner's suggestion.
Signed-off-by: Andy Lutomirski <[email protected]>
---
This version compares directly (and tediously to avoid overflow) with
max_idle_ns.
kernel/time/clocksource.c | 6 ++++--
1 files changed, 4 insertions(+), 2 deletions(-)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c18d7ef..f8cef8b 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -215,8 +215,10 @@ static void __clocksource_unstable(struct clocksource *cs)
static void clocksource_unstable(struct clocksource *cs, int64_t delta)
{
- printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)\n",
- cs->name, delta);
+ printk(KERN_WARNING "Clocksource %s unstable (delta = %Ld ns)%s\n",
+ cs->name, delta,
+ (delta < 0 && (u64)-delta > watchdog->max_idle_ns) ?
+ " or your system lagged for other reasons" : "");
__clocksource_unstable(cs);
}
--
1.7.3.2
On Wed, Nov 10, 2010 at 2:16 PM, Andy Lutomirski <[email protected]> wrote:
> When the system goes out to lunch for a long time, the clocksource
> watchdog might get false positives. ?Clarify the warning so that
> people stop blaming their system freezes on the timing code.
So this issue has also crept up at times in the -rt kernel, where some
high priority rt tasks hogs the cpu long enough for the watchdog
clocksource to wrap, and then the TSC seems to have sped up relative
to the watchdog when the system returns to a normal state and causes a
false positive, so the TSC gets kicked out.
So the watchdog heuristic is rough, since it depends on three things:
1) The watchdog clocksource is actually trustworthy
2) The watchdog actually runs near to when it expects to run.
3) When reading the different clocksources, any interruptions when the
watchdog runs are small enough to to be handled by the allowed margin
of error.
And in these cases we're breaking #2
So I had a few ideas to improve the huristic somewhat (and huristic
refinement is always ugly). So first of all, the whole reason for the
watchdog code is the TSC. Other platforms may want to make use of it,
but none do at this moment.
The majority of TSC issues are caused when the TSC halts in idle
(probably most common these days) or slows down due to cpufreq
scaling.
When we get these false positives, its because the watchdog check
interval is too long and the watchdog hardware has wrapped. This makes
it appear that the TSC is running too *fast*. So we could try to be
more skeptical of watchdog failures where the TSC appears to have sped
up, rather then the more common real issue of it slowing down.
Now the actual positive where this could occur is if you were on an
old APM based laptop, and booted on battery power and the cpus started
at their slow freq. Then you plugged in the AC and the cpus jumped to
the faster rate. So while I suspect this is a rare case these days (on
ACPI based hardware the kernel has more say in cpufreq transitions, so
I believe we are unlikely to calculate tsc_khz while the cpus are in
low power mode), we still need to address this.
Ideas:
1) Maybe should we check that we get two sequential failures where the
cpu seems fast before we throw out the TSC? This will still fall over
in some stall cases (ie: a poor rt task hogging the cpu for 10
minutes, pausing for a 10th of a second and then continuing to hog the
cpu).
2) We could look at the TSC delta, and if it appears outside the order
of 2-10x faster (i don't think any cpus scale up even close to 10x in
freq, but please correct me if so), then assume we just have been
blocked from running and don't throw out the TSC.
3) Similar to #2 we could look at the max interval that the watchdog
clocksource provides, and if the TSC delta is greater then that, avoid
throwing things out. This combined with #2 might narrow out the false
positives fairly well.
Any additional thoughts here?
thanks
-john
On Fri, 2010-11-12 at 13:31 -0800, john stultz wrote:
> On Wed, Nov 10, 2010 at 2:16 PM, Andy Lutomirski <[email protected]> wrote:
> > When the system goes out to lunch for a long time, the clocksource
> > watchdog might get false positives. Clarify the warning so that
> > people stop blaming their system freezes on the timing code.
>
> So this issue has also crept up at times in the -rt kernel, where some
> high priority rt tasks hogs the cpu long enough for the watchdog
> clocksource to wrap, and then the TSC seems to have sped up relative
> to the watchdog when the system returns to a normal state and causes a
> false positive, so the TSC gets kicked out.
>
> So the watchdog heuristic is rough, since it depends on three things:
> 1) The watchdog clocksource is actually trustworthy
> 2) The watchdog actually runs near to when it expects to run.
> 3) When reading the different clocksources, any interruptions when the
> watchdog runs are small enough to to be handled by the allowed margin
> of error.
>
> And in these cases we're breaking #2
>
> So I had a few ideas to improve the huristic somewhat (and huristic
> refinement is always ugly). So first of all, the whole reason for the
> watchdog code is the TSC. Other platforms may want to make use of it,
> but none do at this moment.
>
> The majority of TSC issues are caused when the TSC halts in idle
> (probably most common these days) or slows down due to cpufreq
> scaling.
>
> When we get these false positives, its because the watchdog check
> interval is too long and the watchdog hardware has wrapped. This makes
> it appear that the TSC is running too *fast*. So we could try to be
> more skeptical of watchdog failures where the TSC appears to have sped
> up, rather then the more common real issue of it slowing down.
>
> Now the actual positive where this could occur is if you were on an
> old APM based laptop, and booted on battery power and the cpus started
> at their slow freq. Then you plugged in the AC and the cpus jumped to
> the faster rate. So while I suspect this is a rare case these days (on
> ACPI based hardware the kernel has more say in cpufreq transitions, so
> I believe we are unlikely to calculate tsc_khz while the cpus are in
> low power mode), we still need to address this.
>
> Ideas:
> 1) Maybe should we check that we get two sequential failures where the
> cpu seems fast before we throw out the TSC? This will still fall over
> in some stall cases (ie: a poor rt task hogging the cpu for 10
> minutes, pausing for a 10th of a second and then continuing to hog the
> cpu).
>
> 2) We could look at the TSC delta, and if it appears outside the order
> of 2-10x faster (i don't think any cpus scale up even close to 10x in
> freq, but please correct me if so), then assume we just have been
> blocked from running and don't throw out the TSC.
>
> 3) Similar to #2 we could look at the max interval that the watchdog
> clocksource provides, and if the TSC delta is greater then that, avoid
> throwing things out. This combined with #2 might narrow out the false
> positives fairly well.
>
> Any additional thoughts here?
Here's a rough and untested attempt at adding just #3.
thanks
-john
Improve watchdog hursitics to avoid false positives caused by the
watchdog being delayed.
Signed-off-by: John Stultz <[email protected]>
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c18d7ef..d63f6f8 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -247,6 +247,7 @@ static void clocksource_watchdog(unsigned long data)
struct clocksource *cs;
cycle_t csnow, wdnow;
int64_t wd_nsec, cs_nsec;
+ int64_t wd_max_nsec;
int next_cpu;
spin_lock(&watchdog_lock);
@@ -257,6 +258,8 @@ static void clocksource_watchdog(unsigned long data)
wd_nsec = clocksource_cyc2ns((wdnow - watchdog_last) & watchdog->mask,
watchdog->mult, watchdog->shift);
watchdog_last = wdnow;
+ wd_max_nsec = clocksource_cyc2ns(watchdog->mask, watchdog->mult,
+ watchdog->shift);
list_for_each_entry(cs, &watchdog_list, wd_list) {
@@ -280,6 +283,14 @@ static void clocksource_watchdog(unsigned long data)
cs_nsec = clocksource_cyc2ns((csnow - cs->wd_last) &
cs->mask, cs->mult, cs->shift);
cs->wd_last = csnow;
+
+ /* Check to make sure the watchdog wasn't delayed */
+ if (cs_nsec > wd_max_nsec) {
+ WARN_ONCE(1,
+ "clocksource_watchdog: watchdog delayed?\n");
+ continue;
+ }
+
if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) {
clocksource_unstable(cs, cs_nsec - wd_nsec);
continue;
On Fri, Nov 12, 2010 at 4:31 PM, john stultz <[email protected]> wrote:
> On Wed, Nov 10, 2010 at 2:16 PM, Andy Lutomirski <[email protected]> wrote:
>> When the system goes out to lunch for a long time, the clocksource
>> watchdog might get false positives. ?Clarify the warning so that
>> people stop blaming their system freezes on the timing code.
>
> So this issue has also crept up at times in the -rt kernel, where some
> high priority rt tasks hogs the cpu long enough for the watchdog
> clocksource to wrap, and then the TSC seems to have sped up relative
> to the watchdog when the system returns to a normal state and causes a
> false positive, so the TSC gets kicked out.
>
> So the watchdog heuristic is rough, since it depends on three things:
> 1) The watchdog clocksource is actually trustworthy
> 2) The watchdog actually runs near to when it expects to run.
> 3) When reading the different clocksources, any interruptions when the
> watchdog runs are small enough to to be handled by the allowed margin
> of error.
>
> And in these cases we're breaking #2
>
> So I had a few ?ideas ?to improve the huristic somewhat (and huristic
> refinement is always ugly). So first of all, ?the whole reason for the
> watchdog code is the TSC. Other platforms may want to make use of it,
> but none do at this moment.
>
> The majority of TSC issues are caused when the TSC halts in idle
> (probably most common these days) or slows down due to cpufreq
> scaling.
>
> When we get these false positives, its because the watchdog check
> interval is too long and the watchdog hardware has wrapped. This makes
> it appear that the TSC is running too *fast*. ?So we could try to be
> more skeptical of watchdog failures where the TSC appears to have sped
> up, rather then the more common real issue of it slowing down.
>
> Now the actual positive where this could occur is if you were on an
> old APM based laptop, and booted on battery power and the cpus started
> at their slow freq. Then you plugged in the AC and the cpus jumped to
> the faster rate. So while I suspect this is a rare case these days (on
> ACPI based hardware the kernel has more say in cpufreq transitions, so
> I believe we are unlikely to calculate tsc_khz while the cpus are in
> low power mode), we still need to address this.
>
> Ideas:
> 1) Maybe should we check that we get two sequential failures where the
> cpu seems fast before we throw out the TSC? This will still fall over
> in some stall cases (ie: a poor rt task hogging the cpu ?for 10
> minutes, pausing for a 10th of a second and then continuing to hog the
> cpu).
>
> 2) We could look at the TSC delta, and if it appears outside the order
> of 2-10x faster (i don't think any cpus scale up even close to 10x in
> freq, but please correct me if so), then assume we just have been
> blocked from running and don't throw out the TSC.
>
> 3) Similar to #2 we could look at the max interval that the watchdog
> clocksource provides, and if the TSC delta is greater then that, avoid
> throwing things out. This combined with #2 might narrow out the false
> positives fairly well.
>
> Any additional thoughts here?
Yes. As far as I know, the watchdog doesn't give arbitrary values
when it wraps; it just wraps. Here's a possible heuristic, in
pseudocode:
wd_now_1 = (read watchdog)
cs_now = (read clocksource)
cs_elapsed = cs_now - cs_last;
wd_elapsed = wd_now_1 - wd_last;
if ( abs(wd_elapsed - cs_elapsed) < MAX_DELTA)
return; // We're OK.
wd_now_2 = (read watchdog again)
if (abs(wd_now_1 - wd_now_2) > MAX_DELTA / 2)
bail; // The clocksource might be unstable, but we either just
lagged or the watchdog is unstable, and in either case we don't gain
anything by marking the clocksource unstable.
if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) %
wd_wrapping_time ) < (something fairly small) )
bail; // The watchdog most likely wrapped.
mark_unstable;
The idea is that there are three obvious causes for false positives.
1. Something caused the watchdog itself to lag. In this case, the
second watchdog read will differ from the first one by a lot and we'll
bail.
2. The TSC wrapped in the time between two watchdog checks. I don't
think that this is very likely on x86 CPUs. (Tsn't the TSC always 64
bits? By the time that a 20GHz CPU wraps a 64-bit TSC, you're
probably already replaced the computer.)
3. The watchdog wrapped in the time between two watchdog checks. In
this case, unless something else went wrong simultaneously, we'll
detect it.
On the other hand, the chance that the TSC was unstable by almost
exactly a multiple of the watchdog wrapping time is pretty unlikely.
If we're using a non-TSC clocksource, then you could get false
positives if that clocksource wraps, but maybe the watchdog is
unnecessary in that case.
--Andy
On Fri, 2010-11-12 at 16:52 -0500, Andrew Lutomirski wrote:
> On Fri, Nov 12, 2010 at 4:31 PM, john stultz <[email protected]> wrote:
> > Ideas:
> > 1) Maybe should we check that we get two sequential failures where the
> > cpu seems fast before we throw out the TSC? This will still fall over
> > in some stall cases (ie: a poor rt task hogging the cpu for 10
> > minutes, pausing for a 10th of a second and then continuing to hog the
> > cpu).
> >
> > 2) We could look at the TSC delta, and if it appears outside the order
> > of 2-10x faster (i don't think any cpus scale up even close to 10x in
> > freq, but please correct me if so), then assume we just have been
> > blocked from running and don't throw out the TSC.
> >
> > 3) Similar to #2 we could look at the max interval that the watchdog
> > clocksource provides, and if the TSC delta is greater then that, avoid
> > throwing things out. This combined with #2 might narrow out the false
> > positives fairly well.
> >
> > Any additional thoughts here?
>
> Yes. As far as I know, the watchdog doesn't give arbitrary values
> when it wraps; it just wraps. Here's a possible heuristic, in
> pseudocode:
>
> wd_now_1 = (read watchdog)
> cs_now = (read clocksource)
>
> cs_elapsed = cs_now - cs_last;
> wd_elapsed = wd_now_1 - wd_last;
>
> if ( abs(wd_elapsed - cs_elapsed) < MAX_DELTA)
> return; // We're OK.
>
> wd_now_2 = (read watchdog again)
> if (abs(wd_now_1 - wd_now_2) > MAX_DELTA / 2)
> bail; // The clocksource might be unstable, but we either just
> lagged or the watchdog is unstable, and in either case we don't gain
> anything by marking the clocksource unstable.
This is more easily done by just bounding the clocksource read:
wd_now_1 = watchdog->read()
cs_now = clocksource->read()
wd_now_2 = watchdog->read()
if (((wd_now_2 - wd_now_1)&watchdog->mask) > SOMETHING_SMALL)
bail; // hit an SMI or some sort of long preemption
> if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) %
> wd_wrapping_time ) < (something fairly small) )
> bail; // The watchdog most likely wrapped.
Huh. The modulo bit may need tweaking as its not immediately clear its
right. Maybe the following is clearer?:
if ((cs_elapsed > wd_wrapping_time)
&& (abs((cs_elapsed % wd_wrapping_time)-wd_elapsed) < MAX_DELTA)
// should be ok.
> The idea is that there are three obvious causes for false positives.
>
> 1. Something caused the watchdog itself to lag. In this case, the
> second watchdog read will differ from the first one by a lot and we'll
> bail.
>
> 2. The TSC wrapped in the time between two watchdog checks. I don't
> think that this is very likely on x86 CPUs. (Tsn't the TSC always 64
> bits? By the time that a 20GHz CPU wraps a 64-bit TSC, you're
> probably already replaced the computer.)
Yea, current TSCs won't wrap in our lifetime. And even if it ever does
get close, it will have to wrap between watchdog checks to be an issue
(since twos-compliment math makes the subtraction work even if it does
roll over).
> 3. The watchdog wrapped in the time between two watchdog checks. In
> this case, unless something else went wrong simultaneously, we'll
> detect it.
>
> On the other hand, the chance that the TSC was unstable by almost
> exactly a multiple of the watchdog wrapping time is pretty unlikely.
>
>
> If we're using a non-TSC clocksource, then you could get false
> positives if that clocksource wraps, but maybe the watchdog is
> unnecessary in that case.
Yea. Some similar extra modulo conditions could probably handle that
too. But again, the only clocksource that needs the watchdog is the TSC.
thanks
-john
On Fri, Nov 12, 2010 at 6:40 PM, john stultz <[email protected]> wrote:
> On Fri, 2010-11-12 at 16:52 -0500, Andrew Lutomirski wrote:
>> On Fri, Nov 12, 2010 at 4:31 PM, john stultz <[email protected]> wrote:
>> > Ideas:
>> > 1) Maybe should we check that we get two sequential failures where the
>> > cpu seems fast before we throw out the TSC? This will still fall over
>> > in some stall cases (ie: a poor rt task hogging the cpu ?for 10
>> > minutes, pausing for a 10th of a second and then continuing to hog the
>> > cpu).
>> >
>> > 2) We could look at the TSC delta, and if it appears outside the order
>> > of 2-10x faster (i don't think any cpus scale up even close to 10x in
>> > freq, but please correct me if so), then assume we just have been
>> > blocked from running and don't throw out the TSC.
>> >
>> > 3) Similar to #2 we could look at the max interval that the watchdog
>> > clocksource provides, and if the TSC delta is greater then that, avoid
>> > throwing things out. This combined with #2 might narrow out the false
>> > positives fairly well.
>> >
>> > Any additional thoughts here?
>>
>> Yes. ?As far as I know, the watchdog doesn't give arbitrary values
>> when it wraps; it just wraps. ?Here's a possible heuristic, in
>> pseudocode:
>>
>> wd_now_1 = (read watchdog)
>> cs_now = (read clocksource)
>>
>> cs_elapsed = cs_now - cs_last;
>> wd_elapsed = wd_now_1 - wd_last;
>>
>> if ( abs(wd_elapsed - cs_elapsed) < MAX_DELTA)
>> ? return; ?// We're OK.
>>
>> wd_now_2 = (read watchdog again)
>> if (abs(wd_now_1 - wd_now_2) > MAX_DELTA / 2)
>> ? bail; ?// The clocksource might be unstable, but we either just
>> lagged or the watchdog is unstable, and in either case we don't gain
>> anything by marking the clocksource unstable.
>
> This is more easily done by just bounding the clocksource read:
> wd_now_1 = watchdog->read()
> cs_now = clocksource->read()
> wd_now_2 = watchdog->read()
>
> if (((wd_now_2 - wd_now_1)&watchdog->mask) > SOMETHING_SMALL)
> ? ? ? ?bail; // hit an SMI or some sort of long preemption
>
>> if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) %
>> wd_wrapping_time ) < (something fairly small) )
>> ? bail; ?// The watchdog most likely wrapped.
>
> Huh. The modulo bit may need tweaking as its not immediately clear its
> right. Maybe the following is clearer?:
>
> if ((cs_elapsed > wd_wrapping_time)
> ? ? ? ?&& (abs((cs_elapsed % wd_wrapping_time)-wd_elapsed) < MAX_DELTA)
> ? ? ? ?// should be ok.
I think this is wrong if wd_elapsed is large (which could happen if
the real wd time is something like (2 * wd_wrapping_time -
MAX_DELTA/4)).
--Andy
On Fri, Nov 12, 2010 at 6:48 PM, Andrew Lutomirski <[email protected]> wrote:
> On Fri, Nov 12, 2010 at 6:40 PM, john stultz <[email protected]> wrote:
>> On Fri, 2010-11-12 at 16:52 -0500, Andrew Lutomirski wrote:
>>> On Fri, Nov 12, 2010 at 4:31 PM, john stultz <[email protected]> wrote:
>>> > Ideas:
>>> > 1) Maybe should we check that we get two sequential failures where the
>>> > cpu seems fast before we throw out the TSC? This will still fall over
>>> > in some stall cases (ie: a poor rt task hogging the cpu ?for 10
>>> > minutes, pausing for a 10th of a second and then continuing to hog the
>>> > cpu).
>>> >
>>> > 2) We could look at the TSC delta, and if it appears outside the order
>>> > of 2-10x faster (i don't think any cpus scale up even close to 10x in
>>> > freq, but please correct me if so), then assume we just have been
>>> > blocked from running and don't throw out the TSC.
>>> >
>>> > 3) Similar to #2 we could look at the max interval that the watchdog
>>> > clocksource provides, and if the TSC delta is greater then that, avoid
>>> > throwing things out. This combined with #2 might narrow out the false
>>> > positives fairly well.
>>> >
>>> > Any additional thoughts here?
>>>
>>> Yes. ?As far as I know, the watchdog doesn't give arbitrary values
>>> when it wraps; it just wraps. ?Here's a possible heuristic, in
>>> pseudocode:
>>>
>>> wd_now_1 = (read watchdog)
>>> cs_now = (read clocksource)
>>>
>>> cs_elapsed = cs_now - cs_last;
>>> wd_elapsed = wd_now_1 - wd_last;
>>>
>>> if ( abs(wd_elapsed - cs_elapsed) < MAX_DELTA)
>>> ? return; ?// We're OK.
>>>
>>> wd_now_2 = (read watchdog again)
>>> if (abs(wd_now_1 - wd_now_2) > MAX_DELTA / 2)
>>> ? bail; ?// The clocksource might be unstable, but we either just
>>> lagged or the watchdog is unstable, and in either case we don't gain
>>> anything by marking the clocksource unstable.
>>
>> This is more easily done by just bounding the clocksource read:
>> wd_now_1 = watchdog->read()
>> cs_now = clocksource->read()
>> wd_now_2 = watchdog->read()
>>
>> if (((wd_now_2 - wd_now_1)&watchdog->mask) > SOMETHING_SMALL)
>> ? ? ? ?bail; // hit an SMI or some sort of long preemption
>>
>>> if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) %
>>> wd_wrapping_time ) < (something fairly small) )
>>> ? bail; ?// The watchdog most likely wrapped.
>>
>> Huh. The modulo bit may need tweaking as its not immediately clear its
>> right. Maybe the following is clearer?:
>>
>> if ((cs_elapsed > wd_wrapping_time)
>> ? ? ? ?&& (abs((cs_elapsed % wd_wrapping_time)-wd_elapsed) < MAX_DELTA)
>> ? ? ? ?// should be ok.
>
> I think this is wrong if wd_elapsed is large (which could happen if
> the real wd time is something like (2 * wd_wrapping_time -
> MAX_DELTA/4)).
Also wrong if cs_elapsed is just slightly less than wd_wrapping_time
but the wd clocksource runs enough faster that it wrapped.
--Andy
>
> --Andy
>
On Fri, 2010-11-12 at 18:48 -0500, Andrew Lutomirski wrote:
> On Fri, Nov 12, 2010 at 6:40 PM, john stultz <[email protected]> wrote:
> > On Fri, 2010-11-12 at 16:52 -0500, Andrew Lutomirski wrote:
> >
> >> if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) %
> >> wd_wrapping_time ) < (something fairly small) )
> >> bail; // The watchdog most likely wrapped.
> >
> > Huh. The modulo bit may need tweaking as its not immediately clear its
> > right. Maybe the following is clearer?:
> >
> > if ((cs_elapsed > wd_wrapping_time)
> > && (abs((cs_elapsed % wd_wrapping_time)-wd_elapsed) < MAX_DELTA)
> > // should be ok.
>
> I think this is wrong if wd_elapsed is large (which could happen if
> the real wd time is something like (2 * wd_wrapping_time -
> MAX_DELTA/4)).
But wd_elapsed can never be larger then wd_wrapping_time, right?
Or am I still missing something?
thanks
-john
On Fri, 2010-11-12 at 18:51 -0500, Andrew Lutomirski wrote:
> On Fri, Nov 12, 2010 at 6:48 PM, Andrew Lutomirski <[email protected]> wrote:
> > On Fri, Nov 12, 2010 at 6:40 PM, john stultz <[email protected]> wrote:
> >> On Fri, 2010-11-12 at 16:52 -0500, Andrew Lutomirski wrote:
> >>> On Fri, Nov 12, 2010 at 4:31 PM, john stultz <[email protected]> wrote:
> >>> > Ideas:
> >>> > 1) Maybe should we check that we get two sequential failures where the
> >>> > cpu seems fast before we throw out the TSC? This will still fall over
> >>> > in some stall cases (ie: a poor rt task hogging the cpu for 10
> >>> > minutes, pausing for a 10th of a second and then continuing to hog the
> >>> > cpu).
> >>> >
> >>> > 2) We could look at the TSC delta, and if it appears outside the order
> >>> > of 2-10x faster (i don't think any cpus scale up even close to 10x in
> >>> > freq, but please correct me if so), then assume we just have been
> >>> > blocked from running and don't throw out the TSC.
> >>> >
> >>> > 3) Similar to #2 we could look at the max interval that the watchdog
> >>> > clocksource provides, and if the TSC delta is greater then that, avoid
> >>> > throwing things out. This combined with #2 might narrow out the false
> >>> > positives fairly well.
> >>> >
> >>> > Any additional thoughts here?
> >>>
> >>> Yes. As far as I know, the watchdog doesn't give arbitrary values
> >>> when it wraps; it just wraps. Here's a possible heuristic, in
> >>> pseudocode:
> >>>
> >>> wd_now_1 = (read watchdog)
> >>> cs_now = (read clocksource)
> >>>
> >>> cs_elapsed = cs_now - cs_last;
> >>> wd_elapsed = wd_now_1 - wd_last;
> >>>
> >>> if ( abs(wd_elapsed - cs_elapsed) < MAX_DELTA)
> >>> return; // We're OK.
> >>>
> >>> wd_now_2 = (read watchdog again)
> >>> if (abs(wd_now_1 - wd_now_2) > MAX_DELTA / 2)
> >>> bail; // The clocksource might be unstable, but we either just
> >>> lagged or the watchdog is unstable, and in either case we don't gain
> >>> anything by marking the clocksource unstable.
> >>
> >> This is more easily done by just bounding the clocksource read:
> >> wd_now_1 = watchdog->read()
> >> cs_now = clocksource->read()
> >> wd_now_2 = watchdog->read()
> >>
> >> if (((wd_now_2 - wd_now_1)&watchdog->mask) > SOMETHING_SMALL)
> >> bail; // hit an SMI or some sort of long preemption
> >>
> >>> if ( wd_elapsed < cs_elapsed and ( (cs_elapsed - wd_elapsed) %
> >>> wd_wrapping_time ) < (something fairly small) )
> >>> bail; // The watchdog most likely wrapped.
> >>
> >> Huh. The modulo bit may need tweaking as its not immediately clear its
> >> right. Maybe the following is clearer?:
> >>
> >> if ((cs_elapsed > wd_wrapping_time)
> >> && (abs((cs_elapsed % wd_wrapping_time)-wd_elapsed) < MAX_DELTA)
> >> // should be ok.
> >
> > I think this is wrong if wd_elapsed is large (which could happen if
> > the real wd time is something like (2 * wd_wrapping_time -
> > MAX_DELTA/4)).
>
> Also wrong if cs_elapsed is just slightly less than wd_wrapping_time
> but the wd clocksource runs enough faster that it wrapped.
Ok. Good point, that's a problem. Hrmmmm. Too much math for Friday. :)
And your implementation above hits the same issue.. hrm.
I want to say we can get away with it using the same twos-complement
masking trick we use for subtracting around an overflow to calculate
cycle deltas, but I'm not confident it will work when we've moved from
clean bit field masks to nanosecond intervals.
I think I'll have to revisit this on Monday.
Thanks again for pointing out this think-o.
-john
On Sat, 2010-11-13 at 00:22 +0000, john stultz wrote:
> On Fri, 2010-11-12 at 18:51 -0500, Andrew Lutomirski wrote:
> > Also wrong if cs_elapsed is just slightly less than wd_wrapping_time
> > but the wd clocksource runs enough faster that it wrapped.
>
> Ok. Good point, that's a problem. Hrmmmm. Too much math for Friday. :)
I have a hard time leaving things alone. :)
So this still has the issue of the u64%u64 won't work on 32bit systems,
but I think once I rework the modulo bit the following should be what
you were describing.
It is ugly, so let me know if you have a cleaner way.
Signed-off-by: John Stultz <[email protected]>
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c18d7ef..0afbcb5 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -247,6 +247,7 @@ static void clocksource_watchdog(unsigned long data)
struct clocksource *cs;
cycle_t csnow, wdnow;
int64_t wd_nsec, cs_nsec;
+ int64_t wd_max_nsec;
int next_cpu;
spin_lock(&watchdog_lock);
@@ -257,6 +258,8 @@ static void clocksource_watchdog(unsigned long data)
wd_nsec = clocksource_cyc2ns((wdnow - watchdog_last) & watchdog->mask,
watchdog->mult, watchdog->shift);
watchdog_last = wdnow;
+ wd_max_nsec = clocksource_cyc2ns(watchdog->mask, watchdog->mult,
+ watchdog->shift);
list_for_each_entry(cs, &watchdog_list, wd_list) {
@@ -280,6 +283,35 @@ static void clocksource_watchdog(unsigned long data)
cs_nsec = clocksource_cyc2ns((csnow - cs->wd_last) &
cs->mask, cs->mult, cs->shift);
cs->wd_last = csnow;
+
+ /*
+ * If the cs interval is greater then the wd wrap interval,
+ * we were somehow delayed. So check that modulo the wrap
+ * interval, the cs interval is close to the watchdog
+ * interval value.
+ */
+ if (cs_nsec > wd_max_nsec){
+ int64_t cs_modulo = cs_nsec % wd_max_nsec;
+
+ /* first check the easy case */
+ if(abs(cs_modulo - wd_nsec) < WATCHDOG_THRESHOLD) {
+ WARN_ONCE(1, "clocksource_watchdog:"
+ "watchdog delayed?\n");
+ continue;
+ }
+ /* could be split along the wd_max_nsec boundary*/
+ if (wd_nsec < cs_modulo)
+ wd_nsec += wd_max_nsec;
+ else
+ cs_modulo += wd_max_nsec;
+ /* check again */
+ if(abs(cs_modulo - wd_nsec) < WATCHDOG_THRESHOLD) {
+ WARN_ONCE(1, "clocksource_watchdog:"
+ "watchdog delayed?\n");
+ continue;
+ }
+ }
+
if (abs(cs_nsec - wd_nsec) > WATCHDOG_THRESHOLD) {
clocksource_unstable(cs, cs_nsec - wd_nsec);
continue;
On Fri, Nov 12, 2010 at 7:58 PM, john stultz <[email protected]> wrote:
> On Sat, 2010-11-13 at 00:22 +0000, john stultz wrote:
>> On Fri, 2010-11-12 at 18:51 -0500, Andrew Lutomirski wrote:
>> > Also wrong if cs_elapsed is just slightly less than wd_wrapping_time
>> > but the wd clocksource runs enough faster that it wrapped.
>>
>> Ok. Good point, that's a problem. Hrmmmm. Too much math for Friday. :)
>
> I have a hard time leaving things alone. :)
>
> So this still has the issue of the u64%u64 won't work on 32bit systems,
> but I think once I rework the modulo bit the following should be what
> you were describing.
>
> It is ugly, so let me know if you have a cleaner way.
>
I'm playing with this stuff now, and it looks like my (invariant,
constant, single-package i7) TSC has a max_idle_ns of just over 3
seconds. I'm confused.
--Andy
On Tue, 2010-11-16 at 19:05 -0500, Andrew Lutomirski wrote:
> On Fri, Nov 12, 2010 at 7:58 PM, john stultz <[email protected]> wrote:
> > On Sat, 2010-11-13 at 00:22 +0000, john stultz wrote:
> >> On Fri, 2010-11-12 at 18:51 -0500, Andrew Lutomirski wrote:
> >> > Also wrong if cs_elapsed is just slightly less than wd_wrapping_time
> >> > but the wd clocksource runs enough faster that it wrapped.
> >>
> >> Ok. Good point, that's a problem. Hrmmmm. Too much math for Friday. :)
> >
> > I have a hard time leaving things alone. :)
> >
> > So this still has the issue of the u64%u64 won't work on 32bit systems,
> > but I think once I rework the modulo bit the following should be what
> > you were describing.
> >
> > It is ugly, so let me know if you have a cleaner way.
> >
>
> I'm playing with this stuff now, and it looks like my (invariant,
> constant, single-package i7) TSC has a max_idle_ns of just over 3
> seconds. I'm confused.
Yea. I hit this wall the other day as well. So my patch is invalid
because its assuming the TSC deltas will be large, but for any
unreasonable delay, we'll actually end up with multiply overflows,
causing the tsc ns interval to be invalid as well.
I'm starting to think we should be pushing the watchdog check into the
timekeeping accumulation loop (or have it hang off of the accumulation
loop).
1) The clocksource cyc2ns conversion code is built with assumptions
linked to how frequently we accumulate time via update_wall_time().
2) update_wall_time() happens in timer irq context, so we don't have to
worry about being delayed. If an irq storm or something does actually
cause the timer irq to be delayed, we have bigger issues.
The only trouble with this, is that if we actually push the max_idle_ns
out to something like 10 seconds on the TSC, we could end up having the
watchdog clocksource wrapping while we're in nohz idle. So that could
be ugly. Maybe if the current clocksource needs the watchdog
observations, we should cap the max_idle_ns to the smaller of the
current clocksource and the watchdog clocksource.
Oof. Its just ugly. If I can get some time this week I'll try to take a
rough swing at refactoring that code.
thanks
-john
On Tue, Nov 16, 2010 at 7:26 PM, john stultz <[email protected]> wrote:
> On Tue, 2010-11-16 at 19:05 -0500, Andrew Lutomirski wrote:
>> On Fri, Nov 12, 2010 at 7:58 PM, john stultz <[email protected]> wrote:
>> > On Sat, 2010-11-13 at 00:22 +0000, john stultz wrote:
>> >> On Fri, 2010-11-12 at 18:51 -0500, Andrew Lutomirski wrote:
>> >> > Also wrong if cs_elapsed is just slightly less than wd_wrapping_time
>> >> > but the wd clocksource runs enough faster that it wrapped.
>> >>
>> >> Ok. Good point, that's a problem. Hrmmmm. Too much math for Friday. :)
>> >
>> > I have a hard time leaving things alone. :)
>> >
>> > So this still has the issue of the u64%u64 won't work on 32bit systems,
>> > but I think once I rework the modulo bit the following should be what
>> > you were describing.
>> >
>> > It is ugly, so let me know if you have a cleaner way.
>> >
>>
>> I'm playing with this stuff now, and it looks like my (invariant,
>> constant, single-package i7) TSC has a max_idle_ns of just over 3
>> seconds. ?I'm confused.
>
> Yea. I hit this wall the other day as well. So my patch is invalid
> because its assuming the TSC deltas will be large, but for any
> unreasonable delay, we'll actually end up with multiply overflows,
> causing the tsc ns interval to be invalid as well.
>
> I'm starting to think we should be pushing the watchdog check into the
> timekeeping accumulation loop (or have it hang off of the accumulation
> loop).
>
> 1) The clocksource cyc2ns conversion code is built with assumptions
> linked to how frequently we accumulate time via update_wall_time().
>
> 2) update_wall_time() happens in timer irq context, so we don't have to
> worry about being delayed. If an irq storm or something does actually
> cause the timer irq to be delayed, we have bigger issues.
That's why I hit this. It would be nice if we didn't respond to irq
storms by calling stop_machine.
>
> The only trouble with this, is that if we actually push the max_idle_ns
> out to something like 10 seconds on the TSC, we could end up having the
> watchdog clocksource wrapping while we're in nohz idle. ?So that could
> be ugly. Maybe if the current clocksource needs the watchdog
> observations, we should cap the max_idle_ns to the smaller of the
> current clocksource and the watchdog clocksource.
>
What would you think about implementing non-overflowing
clocksource_cyc2ns on architectures that can do it efficiently? You'd
have to artificially limit the mask to 2^64 / (rate in GHz), rounded
down to a power of 2, but that shouldn't be a problem for any sensible
clocksource.
x86_64 can do it with one multiply, two shifts, an or, and a subtract
(to figure out the shifts). It should take just a couple cycles
longer than the current code (or maybe the same amount of time,
depending on how good the CPU is at running the whole thing in
parallel).
x86_32 and similar architectures would need two multiplies and one add.
Architectures with only 32x32->32 multiply would need three
multiplies. (They're already presumably doing two multiplies with the
current code, though.)
The benefit would be that sensible clocksources (TSC and 64-bit HPET)
would essentially never overflow and multicore systems could keep most
cores asleep for as long as they liked.
(There's yet another approach: keep the current clocksource_cyc2ns,
but add an exact version and only use it when waking up from a long
sleep.)
--Andy
On Tue, 2010-11-16 at 19:54 -0500, Andrew Lutomirski wrote:
> On Tue, Nov 16, 2010 at 7:26 PM, john stultz <[email protected]> wrote:
> > I'm starting to think we should be pushing the watchdog check into the
> > timekeeping accumulation loop (or have it hang off of the accumulation
> > loop).
> >
> > 1) The clocksource cyc2ns conversion code is built with assumptions
> > linked to how frequently we accumulate time via update_wall_time().
> >
> > 2) update_wall_time() happens in timer irq context, so we don't have to
> > worry about being delayed. If an irq storm or something does actually
> > cause the timer irq to be delayed, we have bigger issues.
>
> That's why I hit this. It would be nice if we didn't respond to irq
> storms by calling stop_machine.
So even if we don't change clocksources, if you have a long enough
interrupt storm that delays the hard timer irq, such that the
clocksources wrap (or hit the mult overflow), your system time will be
lagging behind anyway. So that would be broken regardless of if the
watchdog kicked in or not.
I suspect that even with such an irq storm, the timer irq will hopefully
be high enough priority to be serviced first, avoiding the accumulation
loss.
> > The only trouble with this, is that if we actually push the max_idle_ns
> > out to something like 10 seconds on the TSC, we could end up having the
> > watchdog clocksource wrapping while we're in nohz idle. So that could
> > be ugly. Maybe if the current clocksource needs the watchdog
> > observations, we should cap the max_idle_ns to the smaller of the
> > current clocksource and the watchdog clocksource.
> >
>
> What would you think about implementing non-overflowing
> clocksource_cyc2ns on architectures that can do it efficiently? You'd
> have to artificially limit the mask to 2^64 / (rate in GHz), rounded
> down to a power of 2, but that shouldn't be a problem for any sensible
> clocksource.
You would run into accuracy issues. The reason why we use large
mult/shift pairs for timekeeping is because we need to make very fine
grained adjustments to steer the clock (also just the freq accuracy can
be poor if you use too low a shift value in the cyc2ns conversions).
> The benefit would be that sensible clocksources (TSC and 64-bit HPET)
> would essentially never overflow and multicore systems could keep most
> cores asleep for as long as they liked.
>
> (There's yet another approach: keep the current clocksource_cyc2ns,
> but add an exact version and only use it when waking up from a long
> sleep.)
We're actually running into this very issue in the sched_clock threads
that have been discussed. Namely that the clocksource/timekeeping code
is very intertwined around the assumptions and limits of the freq that
update_wall_time is called. I'm currently working to consolidate those
assumptions into manageable tunables via clocksource_register_khz/hz
patches so we don't end up with odd cases where some idle limit gets
pushed out and stuff breaks on half of the clocksources out there, but
not the other half.
So yea, there are a couple of approaches here:
It may just be that we should drop the watchdog infrastructure and
either embed that functionality into the TSC code itself (since it is
the only user), allowing more flexible and rough conversion factors with
other watchdog hardware similar to what you suggest, instead of using
the clocksource conversion functions.
Or we can embed the watchdog timer into a function of the timekeeping
accumulation code, the constraints of of which the clocksource are
written to.
Neither is really a pretty solution. I suspect second would end up being
a little cleaner, and would allow other troublesome clocksources to make
use of it in the future. But the first does isolate the issue close to
the problematic hardware.
thanks
-john
On Tue, Nov 16, 2010 at 8:19 PM, john stultz <[email protected]> wrote:
> On Tue, 2010-11-16 at 19:54 -0500, Andrew Lutomirski wrote:
>> On Tue, Nov 16, 2010 at 7:26 PM, john stultz <[email protected]> wrote:
>> > I'm starting to think we should be pushing the watchdog check into the
>> > timekeeping accumulation loop (or have it hang off of the accumulation
>> > loop).
>> >
>> > 1) The clocksource cyc2ns conversion code is built with assumptions
>> > linked to how frequently we accumulate time via update_wall_time().
>> >
>> > 2) update_wall_time() happens in timer irq context, so we don't have to
>> > worry about being delayed. If an irq storm or something does actually
>> > cause the timer irq to be delayed, we have bigger issues.
>>
>> That's why I hit this. ?It would be nice if we didn't respond to irq
>> storms by calling stop_machine.
>
> So even if we don't change clocksources, if you have a long enough
> interrupt storm that delays the hard timer irq, such that the
> clocksources wrap (or hit the mult overflow), your system time will be
> lagging behind anyway. So that would be broken regardless of if the
> watchdog kicked in or not.
>
> I suspect that even with such an irq storm, the timer irq will hopefully
> be high enough priority to be serviced first, avoiding the accumulation
> loss.
>
>
>> > The only trouble with this, is that if we actually push the max_idle_ns
>> > out to something like 10 seconds on the TSC, we could end up having the
>> > watchdog clocksource wrapping while we're in nohz idle. ?So that could
>> > be ugly. Maybe if the current clocksource needs the watchdog
>> > observations, we should cap the max_idle_ns to the smaller of the
>> > current clocksource and the watchdog clocksource.
>> >
>>
>> What would you think about implementing non-overflowing
>> clocksource_cyc2ns on architectures that can do it efficiently? ?You'd
>> have to artificially limit the mask to 2^64 / (rate in GHz), rounded
>> down to a power of 2, but that shouldn't be a problem for any sensible
>> clocksource.
>
> You would run into accuracy issues. The reason why we use large
> mult/shift pairs for timekeeping is because we need to make very fine
> grained adjustments to steer the clock (also just the freq accuracy can
> be poor if you use too low a shift value in the cyc2ns conversions).
>
Why would it be any worse than right now? We could keep shift as high
as 32 (or even higher) and use the exact same logic as we use now.
gcc compiles this code:
uint64_t mul_64_32_shift(uint64_t a, uint32_t mult, uint32_t shift)
{
#if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 5)
if (shift >= 32)
__builtin_unreachable();
#endif
return (uint64_t)( ((__uint128_t)a * (__uint128_t)mult) >> shift );
}
To:
0: 89 f0 mov %esi,%eax
2: 89 d1 mov %edx,%ecx
4: 48 f7 e7 mul %rdi
7: 48 0f ad d0 shrd %cl,%rdx,%rax
b: 48 d3 ea shr %cl,%rdx
e: f6 c1 40 test $0x40,%cl
11: 48 0f 45 c2 cmovne %rdx,%rax
15: c3 retq
And if the compiler were a little smarter, it would generate:
mov %esi,%eax
mov %edx,%ecx
mul %rdi
shrd %cl,%rdx,%rax
retq
So it would be essentially free.
--Andy
On Tue, 2010-11-16 at 20:24 -0500, Andrew Lutomirski wrote:
> On Tue, Nov 16, 2010 at 8:19 PM, john stultz <[email protected]> wrote:
> > On Tue, 2010-11-16 at 19:54 -0500, Andrew Lutomirski wrote:
> >> On Tue, Nov 16, 2010 at 7:26 PM, john stultz <[email protected]> wrote:
> >> > I'm starting to think we should be pushing the watchdog check into the
> >> > timekeeping accumulation loop (or have it hang off of the accumulation
> >> > loop).
> >> >
> >> > 1) The clocksource cyc2ns conversion code is built with assumptions
> >> > linked to how frequently we accumulate time via update_wall_time().
> >> >
> >> > 2) update_wall_time() happens in timer irq context, so we don't have to
> >> > worry about being delayed. If an irq storm or something does actually
> >> > cause the timer irq to be delayed, we have bigger issues.
> >>
> >> That's why I hit this. It would be nice if we didn't respond to irq
> >> storms by calling stop_machine.
> >
> > So even if we don't change clocksources, if you have a long enough
> > interrupt storm that delays the hard timer irq, such that the
> > clocksources wrap (or hit the mult overflow), your system time will be
> > lagging behind anyway. So that would be broken regardless of if the
> > watchdog kicked in or not.
> >
> > I suspect that even with such an irq storm, the timer irq will hopefully
> > be high enough priority to be serviced first, avoiding the accumulation
> > loss.
> >
> >
> >> > The only trouble with this, is that if we actually push the max_idle_ns
> >> > out to something like 10 seconds on the TSC, we could end up having the
> >> > watchdog clocksource wrapping while we're in nohz idle. So that could
> >> > be ugly. Maybe if the current clocksource needs the watchdog
> >> > observations, we should cap the max_idle_ns to the smaller of the
> >> > current clocksource and the watchdog clocksource.
> >> >
> >>
> >> What would you think about implementing non-overflowing
> >> clocksource_cyc2ns on architectures that can do it efficiently? You'd
> >> have to artificially limit the mask to 2^64 / (rate in GHz), rounded
> >> down to a power of 2, but that shouldn't be a problem for any sensible
> >> clocksource.
> >
> > You would run into accuracy issues. The reason why we use large
> > mult/shift pairs for timekeeping is because we need to make very fine
> > grained adjustments to steer the clock (also just the freq accuracy can
> > be poor if you use too low a shift value in the cyc2ns conversions).
> >
>
> Why would it be any worse than right now? We could keep shift as high
> as 32 (or even higher) and use the exact same logic as we use now.
Oh. My apologies, I thought you were suggesting to drop shift down, so
the 64bit mult doesn't overflow, not using a 128 bit mult to just avoid
that issue.
> gcc compiles this code:
>
> uint64_t mul_64_32_shift(uint64_t a, uint32_t mult, uint32_t shift)
> {
> #if __GNUC__ > 4 || (__GNUC__ == 4 && __GNUC_MINOR__ >= 5)
> if (shift >= 32)
> __builtin_unreachable();
> #endif
> return (uint64_t)( ((__uint128_t)a * (__uint128_t)mult) >> shift );
> }
>
> To:
>
> 0: 89 f0 mov %esi,%eax
> 2: 89 d1 mov %edx,%ecx
> 4: 48 f7 e7 mul %rdi
> 7: 48 0f ad d0 shrd %cl,%rdx,%rax
> b: 48 d3 ea shr %cl,%rdx
> e: f6 c1 40 test $0x40,%cl
> 11: 48 0f 45 c2 cmovne %rdx,%rax
> 15: c3 retq
>
> And if the compiler were a little smarter, it would generate:
>
> mov %esi,%eax
> mov %edx,%ecx
> mul %rdi
> shrd %cl,%rdx,%rax
> retq
>
> So it would be essentially free.
So yes, on 64bit systems it won't be so bad, but again, I'm worried a
bit about overhead on 32bit systems, as clocksource_cyc2ns is in the
gettimeofday hot path for a quite a lot of applications.
But it is an interesting thought.
And something like the following could avoid the overhead most of the
time.
if(unlikely(delta > cs->max_mult64_cycles))
return cyc2ns128(delta, cs->mult, cs->shift);
return cyc2ns64(delta, cs->mult, cs->shift);
Where we optimize mult/shift pair so for the likely max nohz time
interval, but allow deeper sleeps without problems.
-john