LinuxLists.cc - [RFC][PATCH] Logarithmic Timekeeping Accumulation

2009-03-24 01:35:20

Subject: [RFC][PATCH] Logarithmic Timekeeping Accumulation

Accumulating one tick at a time works well unless we're using NOHZ. Then
it can be an issue, since we may have to run through the loop a few
thousand times, which can increase timer interrupt caused latency.

The current solution was to accumulate in half-second intervals with
NOHZ. This kept the number of loops down, however it did slightly change
how we make NTP adjustments. While not an issue with NTPd users, as NTPd
makes adjustments over a longer period of time, other adjtimex() users
have noticed the half-second granularity with which we can apply
frequency changes to the clock.

For instance, if a application tries to apply a 100ppm frequency
correction for 20ms to correct a 2us offset, with NOHZ they either get
no correction, or a 50us correction.

Now, there will always be some granularity error for applying frequency
corrections. However with users sensitive to this error have seen a
50-500x increase with NOHZ compared to running without NOHZ.

So I figured I'd try another approach then just simply increasing the
interval. My approach is to consume the time interval logarithmically.
This reduces the number of times through the loop needed keeping
latency down, while still preserving the original granularity error for
adjtimex() changes.

This has been lightly tested and appears to work correctly, but I'd
appreciate any feedback or comments on the idea and code.

Signed-off-by: John Stultz <[email protected]>

diff --git a/include/linux/timex.h b/include/linux/timex.h
index 998a55d..680c412 100644
--- a/include/linux/timex.h
+++ b/include/linux/timex.h
@@ -241,11 +241,7 @@ static inline int ntp_synced(void)

#define NTP_SCALE_SHIFT 32

-#ifdef CONFIG_NO_HZ
-#define NTP_INTERVAL_FREQ (2)
-#else
#define NTP_INTERVAL_FREQ (HZ)
-#endif
#define NTP_INTERVAL_LENGTH (NSEC_PER_SEC/NTP_INTERVAL_FREQ)

/* Returns how long ticks are at present, in ns / 2^NTP_SCALE_SHIFT. */
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index 900f1b6..dffe668 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -480,6 +480,7 @@ static void clocksource_adjust(s64 offset)
void update_wall_time(void)
{
cycle_t offset;
+ int shift = 1;

/* Make sure we're fully resumed: */
if (unlikely(timekeeping_suspended))
@@ -492,30 +493,47 @@ void update_wall_time(void)
#endif
clock->xtime_nsec = (s64)xtime.tv_nsec << clock->shift;

+
+ /*
+ * If NO_HZ is enabled, we may spend too much time in the
+ * accumulation loop. So try to accumulate logrithmically
+ */
+ while (offset > (clock->cycle_interval << shift))
+ shift++;
+ shift--;
+
/* normally this loop will run just once, however in the
* case of lost or late ticks, it will accumulate correctly.
*/
while (offset >= clock->cycle_interval) {
+ if (offset < clock->cycle_interval<<shift) {
+ shift--;
+ continue;
+ }
/* accumulate one interval */
- offset -= clock->cycle_interval;
- clock->cycle_last += clock->cycle_interval;
+ offset -= clock->cycle_interval << shift;
+ clock->cycle_last += clock->cycle_interval << shift;

- clock->xtime_nsec += clock->xtime_interval;
+ clock->xtime_nsec += clock->xtime_interval << shift;
if (clock->xtime_nsec >= (u64)NSEC_PER_SEC << clock->shift) {
clock->xtime_nsec -= (u64)NSEC_PER_SEC << clock->shift;
xtime.tv_sec++;
second_overflow();
}

- clock->raw_time.tv_nsec += clock->raw_interval;
+ clock->raw_time.tv_nsec += clock->raw_interval << shift;
if (clock->raw_time.tv_nsec >= NSEC_PER_SEC) {
clock->raw_time.tv_nsec -= NSEC_PER_SEC;
clock->raw_time.tv_sec++;
}

/* accumulate error between NTP and clock interval */
- clock->error += tick_length;
- clock->error -= clock->xtime_interval << (NTP_SCALE_SHIFT - clock->shift);
+ clock->error += tick_length << shift;
+ clock->error -= (clock->xtime_interval
+ << (NTP_SCALE_SHIFT - clock->shift))
+ << shift;
+ if (shift > 0) /*don't roll under!*/
+ shift--;
}

/* correct the clock when NTP error is too big */

2009-03-24 14:21:40

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC][PATCH] Logarithmic Timekeeping Accumulation

* John Stultz <[email protected]> wrote:

> Accumulating one tick at a time works well unless we're using
> NOHZ. Then it can be an issue, since we may have to run through
> the loop a few thousand times, which can increase timer interrupt
> caused latency.
>
> The current solution was to accumulate in half-second intervals
> with NOHZ. This kept the number of loops down, however it did
> slightly change how we make NTP adjustments. While not an issue
> with NTPd users, as NTPd makes adjustments over a longer period of
> time, other adjtimex() users have noticed the half-second
> granularity with which we can apply frequency changes to the
> clock.
>
> For instance, if a application tries to apply a 100ppm frequency
> correction for 20ms to correct a 2us offset, with NOHZ they either
> get no correction, or a 50us correction.
>
> Now, there will always be some granularity error for applying
> frequency corrections. However with users sensitive to this error
> have seen a 50-500x increase with NOHZ compared to running without
> NOHZ.
>
> So I figured I'd try another approach then just simply increasing
> the interval. My approach is to consume the time interval
> logarithmically. This reduces the number of times through the loop
> needed keeping latency down, while still preserving the original
> granularity error for adjtimex() changes.
>
> This has been lightly tested and appears to work correctly, but
> I'd appreciate any feedback or comments on the idea and code.
>
> Signed-off-by: John Stultz <[email protected]>

Hm, we used to have some sort of problem with a similar patch in the
past.

> /* accumulate error between NTP and clock interval */
> - clock->error += tick_length;
> - clock->error -= clock->xtime_interval << (NTP_SCALE_SHIFT - clock->shift);
> + clock->error += tick_length << shift;
> + clock->error -= (clock->xtime_interval
> + << (NTP_SCALE_SHIFT - clock->shift))
> + << shift;

Why not:

clock->error -= clock->xtime_interval
<< (NTP_SCALE_SHIFT - clock->shift + shift);

?

> + if (shift > 0) /*don't roll under!*/
> + shift--;

(nit: watch out the comment style)

that bit looks a bit messy. We estimated the shift:

+ while (offset > (clock->cycle_interval << shift))
+ shift++;
+ shift--;

can it really ever roll under in this loop:

while (offset >= clock->cycle_interval) {

...
offset -= clock->cycle_interval << shift;

?

Ingo

2009-03-25 00:14:57

by john stultz

[permalink] [raw]

Subject: Re: [RFC][PATCH] Logarithmic Timekeeping Accumulation

On Tue, 2009-03-24 at 15:13 +0100, Ingo Molnar wrote:
> * John Stultz <[email protected]> wrote:
>
> > Accumulating one tick at a time works well unless we're using
> > NOHZ. Then it can be an issue, since we may have to run through
> > the loop a few thousand times, which can increase timer interrupt
> > caused latency.
> >
> > The current solution was to accumulate in half-second intervals
> > with NOHZ. This kept the number of loops down, however it did
> > slightly change how we make NTP adjustments. While not an issue
> > with NTPd users, as NTPd makes adjustments over a longer period of
> > time, other adjtimex() users have noticed the half-second
> > granularity with which we can apply frequency changes to the
> > clock.
> >
> > For instance, if a application tries to apply a 100ppm frequency
> > correction for 20ms to correct a 2us offset, with NOHZ they either
> > get no correction, or a 50us correction.
> >
> > Now, there will always be some granularity error for applying
> > frequency corrections. However with users sensitive to this error
> > have seen a 50-500x increase with NOHZ compared to running without
> > NOHZ.
> >
> > So I figured I'd try another approach then just simply increasing
> > the interval. My approach is to consume the time interval
> > logarithmically. This reduces the number of times through the loop
> > needed keeping latency down, while still preserving the original
> > granularity error for adjtimex() changes.
> >
> > This has been lightly tested and appears to work correctly, but
> > I'd appreciate any feedback or comments on the idea and code.
> >
> > Signed-off-by: John Stultz <[email protected]>
>
> Hm, we used to have some sort of problem with a similar patch in the
> past.

Do you recall any details about the problem? I don't.

> > /* accumulate error between NTP and clock interval */
> > - clock->error += tick_length;
> > - clock->error -= clock->xtime_interval << (NTP_SCALE_SHIFT - clock->shift);
> > + clock->error += tick_length << shift;
> > + clock->error -= (clock->xtime_interval
> > + << (NTP_SCALE_SHIFT - clock->shift))
> > + << shift;
>
> Why not:
>
> clock->error -= clock->xtime_interval
> << (NTP_SCALE_SHIFT - clock->shift + shift);
>
> ?

Yep. Much cleaner.

> > + if (shift > 0) /*don't roll under!*/
> > + shift--;
>
> (nit: watch out the comment style)

Good point.

> that bit looks a bit messy. We estimated the shift:
>
> + while (offset > (clock->cycle_interval << shift))
> + shift++;
> + shift--;
>
> can it really ever roll under in this loop:

It probably can't. I just haven't sat down to work out the full math to
prove it to myself, so I've been cautious.

Thanks for the suggestions, I'll roll those fixes up into the next
version.

So the basic approach seems ok by you?

thanks
-john

2009-03-25 08:26:37

by Ingo Molnar

[permalink] [raw]

Subject: Re: [RFC][PATCH] Logarithmic Timekeeping Accumulation

* john stultz <[email protected]> wrote:

> On Tue, 2009-03-24 at 15:13 +0100, Ingo Molnar wrote:
> > * John Stultz <[email protected]> wrote:
> >
> > > Accumulating one tick at a time works well unless we're using
> > > NOHZ. Then it can be an issue, since we may have to run through
> > > the loop a few thousand times, which can increase timer interrupt
> > > caused latency.
> > >
> > > The current solution was to accumulate in half-second intervals
> > > with NOHZ. This kept the number of loops down, however it did
> > > slightly change how we make NTP adjustments. While not an issue
> > > with NTPd users, as NTPd makes adjustments over a longer period of
> > > time, other adjtimex() users have noticed the half-second
> > > granularity with which we can apply frequency changes to the
> > > clock.
> > >
> > > For instance, if a application tries to apply a 100ppm frequency
> > > correction for 20ms to correct a 2us offset, with NOHZ they either
> > > get no correction, or a 50us correction.
> > >
> > > Now, there will always be some granularity error for applying
> > > frequency corrections. However with users sensitive to this error
> > > have seen a 50-500x increase with NOHZ compared to running without
> > > NOHZ.
> > >
> > > So I figured I'd try another approach then just simply increasing
> > > the interval. My approach is to consume the time interval
> > > logarithmically. This reduces the number of times through the loop
> > > needed keeping latency down, while still preserving the original
> > > granularity error for adjtimex() changes.
> > >
> > > This has been lightly tested and appears to work correctly, but
> > > I'd appreciate any feedback or comments on the idea and code.
> > >
> > > Signed-off-by: John Stultz <[email protected]>
> >
> > Hm, we used to have some sort of problem with a similar patch in the
> > past.
>
> Do you recall any details about the problem? I don't.

Do you remember that logarithmic accumulation patch you did for -rt
originally? That was which introduced an NTP breakage. My memories
are fuzzy, and it's probably all moot, just wanted to ask :-)

> > > /* accumulate error between NTP and clock interval */
> > > - clock->error += tick_length;
> > > - clock->error -= clock->xtime_interval << (NTP_SCALE_SHIFT - clock->shift);
> > > + clock->error += tick_length << shift;
> > > + clock->error -= (clock->xtime_interval
> > > + << (NTP_SCALE_SHIFT - clock->shift))
> > > + << shift;
> >
> > Why not:
> >
> > clock->error -= clock->xtime_interval
> > << (NTP_SCALE_SHIFT - clock->shift + shift);
> >
> > ?
>
> Yep. Much cleaner.
>
>
> > > + if (shift > 0) /*don't roll under!*/
> > > + shift--;
> >
> > (nit: watch out the comment style)
>
> Good point.
>
> > that bit looks a bit messy. We estimated the shift:
> >
> > + while (offset > (clock->cycle_interval << shift))
> > + shift++;
> > + shift--;
> >
> > can it really ever roll under in this loop:
>
> It probably can't. I just haven't sat down to work out the full math to
> prove it to myself, so I've been cautious.
>
>
> Thanks for the suggestions, I'll roll those fixes up into the next
> version.
>
>
> So the basic approach seems ok by you?

Yeah, certainly.

Please mention it in the changelog that the logarithmic method is a
perfect replacement for the existing code - not an approximation.

Most of the loop is linear calculations so there going to a
logarithmic loop is fine (as long as shift does not cause an
overflow in the worst case - that should be double checked).

The only non-linear aspect of this loop is:

if (clock->xtime_nsec >= (u64)NSEC_PER_SEC << clock->shift) {
clock->xtime_nsec -= (u64)NSEC_PER_SEC << clock->shift;
xtime.tv_sec++;
second_overflow();
}

We'll now hit this moment imprecisely - with a
~clock->cycle_interval/2 worst-case.

Fortunately this is not an issue in terms of second_overflow()
internals, except one small (nitpicky) detail: we should make sure
elsewhere (or at least comment on the fact) that
clock->cycle_interval should never be below 1 seconds range -
otherwise we could miss a second_overflow() call. That is true
currently (HZ=10 is the current lowest timer tick frequency AFAICS).

One more thing:

+ while (offset > (clock->cycle_interval << shift))
+ shift++;
+ shift--;

why not use ilog2()?

About your choice of algorithm.

In theory we could use a different implementation here: instead of
the current linear and your proposed logarithmic algorithm, it could
be plain multiplicative, calculating the clock offsets straight
forward by 'offset' nanoseconds - and calculating the number of
second overflows it touched along the way. Then we could do a
seconds_overflow() (note the plural) code in NTP.

The problem is, sufficiently large 'offset' values would rarely be
tested - and this would make it fragile in practice to overflow
conditions in the multiplication math.

So i think we are best served by your logarithmic approach - it's
all plain shifts with clear bounds, so easier to validate (and
easier to tweak for overflows), and basically just as fast.

> > that bit looks a bit messy. We estimated the shift:
> >
> > + while (offset > (clock->cycle_interval << shift))
> > + shift++;
> > + shift--;
> >
> > can it really ever roll under in this loop:
>
> It probably can't. I just haven't sat down to work out the full
> math to prove it to myself, so I've been cautious.

That needs to be done though, instead of silently being wrong on the
math for really long intervals. The ilog2 is done in a reverse
direction in the loop inside (modulo 1) so it should match up
perfectly. If it does not, we need to emit a WARN_ONCE(). (Note:
first release the xtime_lock in that case, we'll lock up otherwise.)

A quick look suggests that there might be a problem area: we might
need to limit the maximum value of shift, to never overflow this:

clock->error -= clock->xtime_interval
<< (NTP_SCALE_SHIFT - clock->shift + shift);

Either via a clock->max_shift, or - ideally - via a reasonable
constant of max shift of around 16 or so.

We cannot assume anything about the maximum value of 'offset' on
NOHZ here - i've seen dynticks sleep times of over 120 seconds in
the past. So an input of 120,000,000,000 is possible (with HZ=1000
that would be a shift of ~17) and i think the shift needs to be
capped to not overflow the 64 bit width.

Fortunately, as HZ goes up, xtime_iterval goes down, so a cap of 16
looks realistic and robust. Please re-do the math though :)

Thanks,

Ingo