LinuxLists.cc - [RFC][PATCH] timekeeping: Keep xtime_nsec remainder separate from ntp

2010-12-08 18:30:59

Subject: [RFC][PATCH] timekeeping: Keep xtime_nsec remainder separate from ntp_error

While doing my end of year unlikely() cleanup, running the annotate
branch profiler, I came across this:

correct incorrect % Function File Line
------- --------- - -------- ---- ----
122588 65653641 99 timekeeping_adjust timekeeping.c 664
167493 14584927 98 timekeeping_adjust timekeeping.c 658

This shows that the following likely()'s are wrong most of the time:

if (error > interval) {
error >>= 2;
if (likely(error <= interval))
adj = 1;
else
adj = timekeeping_bigadjust(error, &interval, &offset);
} else if (error < -interval) {
error >>= 2;
if (likely(error >= -interval)) {

Talking about this with John Stultz, we both agreed that the annotations
should be correct and is not the issue, but something else is going
wrong.

Adding in trace_printks(), I saw that the adj values that were added to
the "mult" multiplier were sometimes quite large. The time intervals
never got down into a small error, but instead was making large
oscillations, both positive and negative to where it should be.

John noticed that if he removed the commit:

commit 5cd1c9c5cf30d4b33df3d3f74d8142f278d536b7
timekeeping: fix rounding problem during clock update

that the problem would go away and we would get back into a tight
oscillation that would stay within the fast path (and the likely()'s
were again likely).

What the above commit did was to fix a bug that caused time to go
backward a nanosec due to the truncating of the xtime_nsec shifted into
the xtime.tv_nsec. The fix for that bug (and what that commit did) was
to always round up one. It added +1 to the xtime.tv_nsec after it did
the conversion, and then took the difference between this shifted and
the xtime_nsec and stored that into the ntp_error.

The ntp_error is used to control the frequency, and this constant adding
of the shift remainder would cause the large oscillation.

This patch instead adds another field to the timekeeping structure that
stores the remainder separately. On re-entry into update_wall_time(),
the remainder is added back onto the xtime_nsec after it is set to the
xtime.tv_nsec and restoring its original value.

This handles the rounding problem that the original commit addressed but
does not cause the large oscillation that it caused.

The new results of the branch annotation profiler is now:

correct incorrect % Function File Line
------- --------- - -------- ---- ----
736971 129 0 timekeeping_adjust timekeeping.c 685
736943 115 0 timekeeping_adjust timekeeping.c 676

Cc: Roman Zippel <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Signed-off-by: Steven Rostedt <[email protected]>

Index: linux-compile.git/kernel/time/timekeeping.c
===================================================================
--- linux-compile.git.orig/kernel/time/timekeeping.c
+++ linux-compile.git/kernel/time/timekeeping.c
@@ -37,6 +37,10 @@ struct timekeeper {

/* Clock shifted nano seconds remainder not stored in xtime.tv_nsec. */
u64 xtime_nsec;
+
+ /* remainder in xtime subtraction */
+ u64 xtime_nsec_rem;
+
/* Difference between accumulated time and NTP time in ntp
* shifted nano seconds. */
s64 ntp_error;
@@ -84,6 +88,7 @@ static void timekeeper_setup_internals(s
((u64) interval * clock->mult) >> clock->shift;

timekeeper.xtime_nsec = 0;
+ timekeeper.xtime_nsec_rem = 0;
timekeeper.shift = clock->shift;

timekeeper.ntp_error = 0;
@@ -751,6 +756,12 @@ void update_wall_time(void)
timekeeper.xtime_nsec = (s64)xtime.tv_nsec << timekeeper.shift;

/*
+ * Add back the remainder that was left over when adding +1 to
+ * xtime.tv_nsec;
+ */
+ timekeeper.xtime_nsec += timekeeper.xtime_nsec_rem;
+
+ /*
* With NO_HZ we may have to accumulate many cycle_intervals
* (think "ticks") worth of time at once. To do this efficiently,
* we calculate the largest doubling multiple of cycle_intervals
@@ -797,12 +808,11 @@ void update_wall_time(void)

/*
* Store full nanoseconds into xtime after rounding it up and
- * add the remainder to the error difference.
+ * store the remainder to update xtime_nsec on the next iteration.
*/
xtime.tv_nsec = ((s64) timekeeper.xtime_nsec >> timekeeper.shift) + 1;
timekeeper.xtime_nsec -= (s64) xtime.tv_nsec << timekeeper.shift;
- timekeeper.ntp_error += timekeeper.xtime_nsec <<
- timekeeper.ntp_error_shift;
+ timekeeper.xtime_nsec_rem = timekeeper.xtime_nsec;

/*
* Finally, make sure that after the rounding

2010-12-08 19:59:51

by john stultz

[permalink] [raw]

Subject: Re: [RFC][PATCH] timekeeping: Keep xtime_nsec remainder separate from ntp_error

On Wed, 2010-12-08 at 13:30 -0500, Steven Rostedt wrote:
> While doing my end of year unlikely() cleanup, running the annotate
> branch profiler, I came across this:
>
> correct incorrect % Function File Line
> ------- --------- - -------- ---- ----
> 122588 65653641 99 timekeeping_adjust timekeeping.c 664
> 167493 14584927 98 timekeeping_adjust timekeeping.c 658
>
> This shows that the following likely()'s are wrong most of the time:
>
> if (error > interval) {
> error >>= 2;
> if (likely(error <= interval))
> adj = 1;
> else
> adj = timekeeping_bigadjust(error, &interval, &offset);
> } else if (error < -interval) {
> error >>= 2;
> if (likely(error >= -interval)) {
>
> Talking about this with John Stultz, we both agreed that the annotations
> should be correct and is not the issue, but something else is going
> wrong.
>
> Adding in trace_printks(), I saw that the adj values that were added to
> the "mult" multiplier were sometimes quite large. The time intervals
> never got down into a small error, but instead was making large
> oscillations, both positive and negative to where it should be.
>
> John noticed that if he removed the commit:
>
> commit 5cd1c9c5cf30d4b33df3d3f74d8142f278d536b7
> timekeeping: fix rounding problem during clock update
>
> that the problem would go away and we would get back into a tight
> oscillation that would stay within the fast path (and the likely()'s
> were again likely).
>
> What the above commit did was to fix a bug that caused time to go
> backward a nanosec due to the truncating of the xtime_nsec shifted into
> the xtime.tv_nsec. The fix for that bug (and what that commit did) was
> to always round up one. It added +1 to the xtime.tv_nsec after it did
> the conversion, and then took the difference between this shifted and
> the xtime_nsec and stored that into the ntp_error.
>
> The ntp_error is used to control the frequency, and this constant adding
> of the shift remainder would cause the large oscillation.
>
> This patch instead adds another field to the timekeeping structure that
> stores the remainder separately. On re-entry into update_wall_time(),
> the remainder is added back onto the xtime_nsec after it is set to the
> xtime.tv_nsec and restoring its original value.
>
> This handles the rounding problem that the original commit addressed but
> does not cause the large oscillation that it caused.

Hey Steven!

Thanks for the great analysis and tooling to help find these unexpected
behaviors!

Sadly, I believe your proposed change can still cause minor nsec
inconsistencies from gettimeofday/vgettimeofday. In fact, the previous
implementation where the nsec inconsistency error was observed preserved
the sub-nanosecond remainder in xtime_nsec.

I suspect we may need to still round up and store the error, but tweak
the adjustment code to handle the larger error per-iteration then it was
originally designed for (note: the current code is still functioning
properly, its just not often hitting the expected trivial case).

The only alternative would be to integrate the sub-ns remainder into the
gettime caclculation (including reworking all the vsyscall
implementations to utilize it as well).

thanks
-john

2010-12-08 20:15:23

by Steven Rostedt

[permalink] [raw]

Subject: Re: [RFC][PATCH] timekeeping: Keep xtime_nsec remainder separate from ntp_error

On Wed, 2010-12-08 at 11:59 -0800, john stultz wrote:

> Hey Steven!
>
> Thanks for the great analysis and tooling to help find these unexpected
> behaviors!
>
> Sadly, I believe your proposed change can still cause minor nsec
> inconsistencies from gettimeofday/vgettimeofday. In fact, the previous
> implementation where the nsec inconsistency error was observed preserved
> the sub-nanosecond remainder in xtime_nsec.

What inconsistencies exactly? Not to say they aren't any, but I really
want to know exactly what is happening. Even the change log that Roman
showed in that commit does not explain well what the issue was.

Here's the log:

timekeeping: fix rounding problem during clock update

Due to a rounding problem during a clock update it's possible for readers
to observe the clock jumping back by 1nsec. The following simplified
example demonstrates the problem:

cycle xtime
0 0
1000 999999.6
2000 1999999.2
3000 2999998.8
...

1500 = 1499999.4
= 0.0 + 1499999.4
= 999999.6 + 499999.8

When reading the clock only the full nanosecond part is used, while
timekeeping internally keeps nanosecond fractions. If the clock is now
updated at cycle 1500 here, a nanosecond is missing due to the truncation.

The simple fix is to round up the xtime value during the update, this also
changes the distance to the reference time, but the adjustment will
automatically take care that it stays under control.

I have no idea what he's trying to say with that "simplified example".
Where did that "1500" come from? There's a bit too much black magic
going on to make this comfortable.

>
> I suspect we may need to still round up and store the error, but tweak
> the adjustment code to handle the larger error per-iteration then it was
> originally designed for (note: the current code is still functioning
> properly, its just not often hitting the expected trivial case).

I'm afraid that may also do damage. This code needs real documentation
explaining in detail what the frick it's doing. Having someone that
shows up once ever 10 years that understands it is not something I feel
confident about.

>
> The only alternative would be to integrate the sub-ns remainder into the
> gettime caclculation (including reworking all the vsyscall
> implementations to utilize it as well).
>

First lets talk about all the issues, and perhaps even start documenting
what it currently does. The comments in the code are not enough for a
reviewer to understand the logic. I've spent today reading some RFC's to
understand more about NTP but, it still does not explain the magic
constants that are used in the code.

Thanks,

-- Steve

2010-12-08 22:48:07

by john stultz

[permalink] [raw]

Subject: Re: [RFC][PATCH] timekeeping: Keep xtime_nsec remainder separate from ntp_error

On Wed, 2010-12-08 at 15:15 -0500, Steven Rostedt wrote:
> On Wed, 2010-12-08 at 11:59 -0800, john stultz wrote:
>
> > Hey Steven!
> >
> > Thanks for the great analysis and tooling to help find these unexpected
> > behaviors!
> >
> > Sadly, I believe your proposed change can still cause minor nsec
> > inconsistencies from gettimeofday/vgettimeofday. In fact, the previous
> > implementation where the nsec inconsistency error was observed preserved
> > the sub-nanosecond remainder in xtime_nsec.
>
> What inconsistencies exactly? Not to say they aren't any, but I really
> want to know exactly what is happening. Even the change log that Roman
> showed in that commit does not explain well what the issue was.

So, it has to do with the fact that X cycles doesn't exactly match Y ns,
and the way we do our accumulation with high precision.

There are two components to the gettimeofday calculation: the
accumulated base, stored in nanoseconds (this is xtime) and the current
cycle delta, converted to nanoseconds.

gtod: xtime + cyc2ns(delta)

During the tick, when we do our accumulation into xtime, we do it with
very high precision (clocksource shifted nanoseconds). For this
accumulation, there are three components: xtime, the cycle delta, and
the sub-ns remainder. The sub-ns remainder may take various forms. In
earlier code it was stored in xtime_nsec, with the current code we round
xtime up to the next nsec, and store -(1ns - remainder) into the
ntp_error, in your patch it is the new remainder value.

Now, when we accumulate the cycle delta, we do so in fixed chunks with
higher precision then we use for gettimeofday (again using shifted
nanoseconds). This may leave some unaccumulated cycles in delta.

accumulation: (xtime + remainder) += high precision chunk of delta

The key here is that a gtod called right before and right after the
accumulation, if given the same cycle value should get the same number
(ie: imagine no time changed).

Now looking at Roman's commit message:

> timekeeping: fix rounding problem during clock update
>
> Due to a rounding problem during a clock update it's possible for readers
> to observe the clock jumping back by 1nsec. The following simplified
> example demonstrates the problem:
>
> cycle xtime
> 0 0
> 1000 999999.6
> 2000 1999999.2
> 3000 2999998.8

The above shows the standard accumulation chunks.
1000 cycles = 999999.6 ns

> ...
>
> 1500 = 1499999.4
> = 0.0 + 1499999.4
> = 999999.6 + 499999.8

This is an illustration of if the accumulation triggers @1500 cycles.

It initially seems a little contrived, because the accumulation chunks
should match the accumulation period (aka: the tick period). But the
tick interrupt may not be exactly matched, due to the timer hardware
freq not being able to match HZ exactly, or some other slight difference
in the hardware speed. So for instance, the tick may always land just a
little early, so 999, then 1998, etc.. In that case the first tick we
won't accumulate anything, then the second we will accumulate 1000,
leaving 998 for the next time. Anyway, the net of it is, that the code
can't assume that the tick will land at a specific time.

So again, assume the accumulation is late or whatever, and triggers at
1500 cycles.

In this case, we accumulate a single chunk of 1000 cycles, leaving 500
cycles for the next tick/accumulation period.

Pre-accumulation:
xtime = 0, rem = 0, delta = 1500
gtod: 0 + cyc2ns(1500)
0 + 1499999 (note cyc2ns truncated the 0.4 remainder)
1499999

Accumulation occurs.

Post-accumulation:
xtime = 999999, rem=0.6, delta = 500

gtod: 999999 + cyc2ns(500)
999999 + 499999 (cyc2ns truncates the .8 remainder)
1499998

And there is the 1ns inconsistency!

With the current code, in accumulation we round xtime up and add the
remainder into the ntp_error.

Post accumulation:
xtime = 1000000, ntp_error=-0.4, delta = 500

gtod: 1000000 + cyc2ns(500)
1000000 + 499999 (cyc2ns truncates the .8 remainder)
1499999

Then in the adjustment step, we take the ntp_error into account and we
will slow the clocksource freq down a bit to make sure the long-term
accuracy is still correct (so we don't gain ~1ns per tick).

Now, as I said, we could modify gtod, so it takes into account the
remainder value, as well as a modified cyc2ns that doesn't truncate the
sub-ns portion until after we've added everything together. But that
requires touching a lot more code (all the vsyscall gtods).

In that case, post accumulation would look like:
xtime = 999999, rem=0.6, delta = 500
gtod: 999999.6 + cyc2shiftedns(500)
999999.6 + 499999.8
1499999.4
1499999 (finally truncing the value before returning)

Alternatively, we could modify the adjustment code to expect larger
errors each iteration due to the rounding up.

> > I suspect we may need to still round up and store the error, but tweak
> > the adjustment code to handle the larger error per-iteration then it was
> > originally designed for (note: the current code is still functioning
> > properly, its just not often hitting the expected trivial case).
>
> I'm afraid that may also do damage. This code needs real documentation
> explaining in detail what the frick it's doing. Having someone that
> shows up once ever 10 years that understands it is not something I feel
> confident about.

Oh yea. I agree the adjustment code is very opaque. And there is risk in
modifying the code that has been "functioning" but just not in the ideal
environment. This stuff is terribly subtle. And there is a real
possibility that in fixing the issue you've found, we may cause bugs
that effect users in a much more negative way then the incorrect
likely() overhead.

> > The only alternative would be to integrate the sub-ns remainder into the
> > gettime caclculation (including reworking all the vsyscall
> > implementations to utilize it as well).
> >
>
> First lets talk about all the issues, and perhaps even start documenting
> what it currently does. The comments in the code are not enough for a
> reviewer to understand the logic. I've spent today reading some RFC's to
> understand more about NTP but, it still does not explain the magic
> constants that are used in the code.

Also, NTP RFCs probably won't help you much with the kernel's
timekeeping.c code and adjustments. Most of the NTP specific code that
maps to the RFC is actually in ntp.c. The timekeeping code consumes the
steering done there via the tick_length value and the error accounting.

But as to your point about better documentation, yes I think that's a
great idea. I do worry that Roman (who has been out of contact for
awhile) is the only person who really gets the adjustment code. I've
always let it be somewhat of a black box, understanding what it does in
concept, but some of the details still escape my grasp. It has
functioned well over the last few years, and in my long discussions with
him he tends to be correct about his approach, but it takes quite a bit
of time and frustration (at least for slow minded folks like me) to
understand why its right.

So more eyes connected to large brains could definitely help here. :)

thanks
-john

2010-12-09 02:09:17

by Steven Rostedt

[permalink] [raw]

Subject: Re: [RFC][PATCH] timekeeping: Keep xtime_nsec remainder separate from ntp_error

John,

Thanks a lot for this explanation. It does explain things much better!

On Wed, 2010-12-08 at 14:47 -0800, john stultz wrote:
> On Wed, 2010-12-08 at 15:15 -0500, Steven Rostedt wrote:
> > On Wed, 2010-12-08 at 11:59 -0800, john stultz wrote:
> >
> > > Hey Steven!
> > >
> > > Thanks for the great analysis and tooling to help find these unexpected
> > > behaviors!
> > >
> > > Sadly, I believe your proposed change can still cause minor nsec
> > > inconsistencies from gettimeofday/vgettimeofday. In fact, the previous
> > > implementation where the nsec inconsistency error was observed preserved
> > > the sub-nanosecond remainder in xtime_nsec.
> >
> > What inconsistencies exactly? Not to say they aren't any, but I really
> > want to know exactly what is happening. Even the change log that Roman
> > showed in that commit does not explain well what the issue was.
>
> So, it has to do with the fact that X cycles doesn't exactly match Y ns,
> and the way we do our accumulation with high precision.
>
> There are two components to the gettimeofday calculation: the
> accumulated base, stored in nanoseconds (this is xtime) and the current
> cycle delta, converted to nanoseconds.
>
> gtod: xtime + cyc2ns(delta)
>
>
> During the tick, when we do our accumulation into xtime, we do it with
> very high precision (clocksource shifted nanoseconds). For this
> accumulation, there are three components: xtime, the cycle delta, and
> the sub-ns remainder. The sub-ns remainder may take various forms. In
> earlier code it was stored in xtime_nsec, with the current code we round
> xtime up to the next nsec, and store -(1ns - remainder) into the
> ntp_error, in your patch it is the new remainder value.
>
> Now, when we accumulate the cycle delta, we do so in fixed chunks with
> higher precision then we use for gettimeofday (again using shifted
> nanoseconds). This may leave some unaccumulated cycles in delta.
>
> accumulation: (xtime + remainder) += high precision chunk of delta
>
> The key here is that a gtod called right before and right after the
> accumulation, if given the same cycle value should get the same number
> (ie: imagine no time changed).
>
> Now looking at Roman's commit message:
>
> > timekeeping: fix rounding problem during clock update
> >
> > Due to a rounding problem during a clock update it's possible for readers
> > to observe the clock jumping back by 1nsec. The following simplified
> > example demonstrates the problem:
> >
> > cycle xtime
> > 0 0
> > 1000 999999.6
> > 2000 1999999.2
> > 3000 2999998.8
>
> The above shows the standard accumulation chunks.
> 1000 cycles = 999999.6 ns
>
> > ...
> >
> > 1500 = 1499999.4
> > = 0.0 + 1499999.4
> > = 999999.6 + 499999.8
>
> This is an illustration of if the accumulation triggers @1500 cycles.
>
> It initially seems a little contrived, because the accumulation chunks
> should match the accumulation period (aka: the tick period). But the
> tick interrupt may not be exactly matched, due to the timer hardware
> freq not being able to match HZ exactly, or some other slight difference
> in the hardware speed. So for instance, the tick may always land just a
> little early, so 999, then 1998, etc.. In that case the first tick we
> won't accumulate anything, then the second we will accumulate 1000,
> leaving 998 for the next time. Anyway, the net of it is, that the code
> can't assume that the tick will land at a specific time.
>
> So again, assume the accumulation is late or whatever, and triggers at
> 1500 cycles.
>
> In this case, we accumulate a single chunk of 1000 cycles, leaving 500
> cycles for the next tick/accumulation period.
>
> Pre-accumulation:
> xtime = 0, rem = 0, delta = 1500
> gtod: 0 + cyc2ns(1500)
> 0 + 1499999 (note cyc2ns truncated the 0.4 remainder)
> 1499999
>
> Accumulation occurs.
>
> Post-accumulation:
> xtime = 999999, rem=0.6, delta = 500
>
> gtod: 999999 + cyc2ns(500)
> 999999 + 499999 (cyc2ns truncates the .8 remainder)
> 1499998
>
> And there is the 1ns inconsistency!
>
>
> With the current code, in accumulation we round xtime up and add the
> remainder into the ntp_error.
>
> Post accumulation:
> xtime = 1000000, ntp_error=-0.4, delta = 500
>
> gtod: 1000000 + cyc2ns(500)
> 1000000 + 499999 (cyc2ns truncates the .8 remainder)
> 1499999
>
> Then in the adjustment step, we take the ntp_error into account and we
> will slow the clocksource freq down a bit to make sure the long-term
> accuracy is still correct (so we don't gain ~1ns per tick).

OK, but have you looked at my patch carefully? It does not do what the
old code does. It still keeps the "round up", but then it subtracts the
remainder from the delta when we come in again. I don't think my way has
the same issue.

Post-accumulation:
xtime = 1000000, rem=-0.4, delta = 500

gtod: 999999 + cyc2ns(500)
999999 + 499999 (cyc2ns truncates the .8 remainder)
1499998

The old code did:

xtime_nsec += xtime.tv_nsec << shift;

[ do stuff ]

xtime.tv_nsec = xtime_nsec >> shift;
xtime_nsec -= xtime.tv_nsec << shift;

So the gtod calculations using xtime.tv_nsec leaves off the remainder.
This causes issues as you shown.

My code still rounds up, but the difference is that it now subtracts
from the new start to get back to where it left off:

xtime_nsec += xtime.tv_nsec << shift;

[ do stuff ]

xtime.tv_nsec = (xtime_nsec >> shift) + 1;
xtime_nsec -= xtime.tv_nsec << shift;

The result of xtime_nsec is now negative, because we added 1 and
shifted.

Heck, I think this is all Roland needed to do to fix the issue. Lets
look at this again:

Pre-accumulation:
xtime = 0, rem = 0, delta = 1500
gtod: 0 + cyc2ns(1500)
0 + 1499999 (note cyc2ns truncated the 0.4 remainder)
1499999

Accumulation occurs.

Post-accumulation:
xtime = 1000000, rem=-0.4, delta = 500

gtod: 1000000 + cyc2ns(500)
1000000 + 499999 (cyc2ns truncates the .8 remainder)
1499999

Even if the subtraction moves the shift backward one, we are still ok,
because when we calculate xtime.tv_nsec at the end, we do the "+1" again
which returns us that missing ns!

>
>
> Now, as I said, we could modify gtod, so it takes into account the
> remainder value, as well as a modified cyc2ns that doesn't truncate the
> sub-ns portion until after we've added everything together. But that
> requires touching a lot more code (all the vsyscall gtods).

I don't think this is needed.

>
> In that case, post accumulation would look like:
> xtime = 999999, rem=0.6, delta = 500
> gtod: 999999.6 + cyc2shiftedns(500)
> 999999.6 + 499999.8
> 1499999.4
> 1499999 (finally truncing the value before returning)
>
>
> Alternatively, we could modify the adjustment code to expect larger
> errors each iteration due to the rounding up.
>
> > > I suspect we may need to still round up and store the error, but tweak
> > > the adjustment code to handle the larger error per-iteration then it was
> > > originally designed for (note: the current code is still functioning
> > > properly, its just not often hitting the expected trivial case).
> >
> > I'm afraid that may also do damage. This code needs real documentation
> > explaining in detail what the frick it's doing. Having someone that
> > shows up once ever 10 years that understands it is not something I feel
> > confident about.
>
> Oh yea. I agree the adjustment code is very opaque. And there is risk in
> modifying the code that has been "functioning" but just not in the ideal
> environment. This stuff is terribly subtle. And there is a real
> possibility that in fixing the issue you've found, we may cause bugs
> that effect users in a much more negative way then the incorrect
> likely() overhead.

It's not the "likely()" I care about, its the fact that we are going
into the slow path (timekeeping_bigadjust()) every time. This does a
hell of a lot more work than the simple "adj=1" that is given.

>
>
> > > The only alternative would be to integrate the sub-ns remainder into the
> > > gettime caclculation (including reworking all the vsyscall
> > > implementations to utilize it as well).
> > >
> >
> > First lets talk about all the issues, and perhaps even start documenting
> > what it currently does. The comments in the code are not enough for a
> > reviewer to understand the logic. I've spent today reading some RFC's to
> > understand more about NTP but, it still does not explain the magic
> > constants that are used in the code.
>
> Also, NTP RFCs probably won't help you much with the kernel's
> timekeeping.c code and adjustments. Most of the NTP specific code that
> maps to the RFC is actually in ntp.c. The timekeeping code consumes the
> steering done there via the tick_length value and the error accounting.

I read it just to understand the general concepts more ;-)

-- Steve

>
> But as to your point about better documentation, yes I think that's a
> great idea. I do worry that Roman (who has been out of contact for
> awhile) is the only person who really gets the adjustment code. I've
> always let it be somewhat of a black box, understanding what it does in
> concept, but some of the details still escape my grasp. It has
> functioned well over the last few years, and in my long discussions with
> him he tends to be correct about his approach, but it takes quite a bit
> of time and frustration (at least for slow minded folks like me) to
> understand why its right.
>
> So more eyes connected to large brains could definitely help here. :)

2010-12-09 03:44:38

by john stultz

[permalink] [raw]

Subject: Re: [RFC][PATCH] timekeeping: Keep xtime_nsec remainder separate from ntp_error

On Wed, 2010-12-08 at 21:09 -0500, Steven Rostedt wrote:
> On Wed, 2010-12-08 at 14:47 -0800, john stultz wrote:
> > On Wed, 2010-12-08 at 15:15 -0500, Steven Rostedt wrote:
> > > On Wed, 2010-12-08 at 11:59 -0800, john stultz wrote:
> > >
> > > > Hey Steven!
> > > >
> > > > Thanks for the great analysis and tooling to help find these unexpected
> > > > behaviors!
> > > >
> > > > Sadly, I believe your proposed change can still cause minor nsec
> > > > inconsistencies from gettimeofday/vgettimeofday. In fact, the previous
> > > > implementation where the nsec inconsistency error was observed preserved
> > > > the sub-nanosecond remainder in xtime_nsec.
[snip]
> OK, but have you looked at my patch carefully? It does not do what the
> old code does. It still keeps the "round up", but then it subtracts the
> remainder from the delta when we come in again. I don't think my way has
> the same issue.

Ah. I didn't read your patch correctly. It looked very close to what was
there before, so I jumped the gun a bit. That said, the fact that your
"remainder" holds the "negative amount we rounded up" shows how quickly
this type of code can get subtle. :)

> My code still rounds up, but the difference is that it now subtracts
> from the new start to get back to where it left off:
>
> xtime_nsec += xtime.tv_nsec << shift;
>
> [ do stuff ]
>
> xtime.tv_nsec = (xtime_nsec >> shift) + 1;
> xtime_nsec -= xtime.tv_nsec << shift;
>
> The result of xtime_nsec is now negative, because we added 1 and
> shifted.
>
> Heck, I think this is all Roland needed to do to fix the issue. Lets
> look at this again:
>
> Pre-accumulation:
> xtime = 0, rem = 0, delta = 1500
> gtod: 0 + cyc2ns(1500)
> 0 + 1499999 (note cyc2ns truncated the 0.4 remainder)
> 1499999
>
> Accumulation occurs.
>
> Post-accumulation:
> xtime = 1000000, rem=-0.4, delta = 500
>
> gtod: 1000000 + cyc2ns(500)
> 1000000 + 499999 (cyc2ns truncates the .8 remainder)
> 1499999
>
> Even if the subtraction moves the shift backward one, we are still ok,
> because when we calculate xtime.tv_nsec at the end, we do the "+1" again
> which returns us that missing ns!

So, yes, this does *seem* comforting. Keep the negative delta, and
re-add it in. It sounds pretty good, but my gut felt like it was prone
to the same issue as keeping the remainder. Took me awhile to find a
case that broke, but I think the following

I've kept everything in ns to simplify the example.

The accumulation interval is 1000.6ns, and with the exception of the
first interval, we accumulate exactly on that boundary, leaving a
1000.4ns delta around.

gtod: 0 + 2001 = 2001
accumulate: 1000.6 + 1000.4 = 2001
1001 + -.4 + 1000.4 = 2001
gtod: 1001 + 1000 = 2001
---------------
gtod: 1001 + 2001 = 3002
accumulate: 1001 + -.4 + 1000.6 + 1000.4 = 3001.6
2001.2 + 1000.4 = 3001.6
2002 + -.8 + 1000.4 = 3001.6
gtod: 2002 + 1000 = 3002
---------------
gtod: 2002 + 2001 = 4003
accumulate: 2002 + -.8 + 1000.6 + 1000.4 = 4002.2
3001.8 + 1000.4 = 4002.2
3002 + -2 + 1000.4 = 4002.2
gtod: 3002 + 1000 = 4002

So I believe your patch will still suffer from the same problem.

Its getting to the end of my day, so I might have flubbed something, so
let me know if you see anything wrong here.

> > Oh yea. I agree the adjustment code is very opaque. And there is risk in
> > modifying the code that has been "functioning" but just not in the ideal
> > environment. This stuff is terribly subtle. And there is a real
> > possibility that in fixing the issue you've found, we may cause bugs
> > that effect users in a much more negative way then the incorrect
> > likely() overhead.
>
> It's not the "likely()" I care about, its the fact that we are going
> into the slow path (timekeeping_bigadjust()) every time. This does a
> hell of a lot more work than the simple "adj=1" that is given.

True, we're taking the slow path every time. But the slow path does
function correctly and we do keep our long term accuracy. Its just not
optimal. The risk here is that is subtle code, so in working to make
things more optimal, we risk actually hurting the sync behavior.

So caution is needed to ensure correct functionality isn't compromised
as we work on improving performance.

thanks
-john