LinuxLists.cc - [PATCH 08/10] -mm clocksource: cleanup on -mm

2006-08-04 03:28:19

Subject: [PATCH 08/10] -mm clocksource: cleanup on -mm

Some additional clean up only on the -mm tree. Moves the adjust
functions into kernel/time/clocksource.c .

These functions directly modify the clocksource multiplier based
on ntp error. These adjustments will effect other users of that
clock. This hasn't been addressed in my patch set, since it
needs some discussion.

Signed-Off-By: Daniel Walker <[email protected]>

---
include/linux/clocksource.h | 1
kernel/time/clocksource.c | 90 ++++++++++++++++++++++++++++++++++++++++++++
kernel/timer.c | 83 ----------------------------------------
3 files changed, 91 insertions(+), 83 deletions(-)

Index: linux-2.6.17/include/linux/clocksource.h
===================================================================
--- linux-2.6.17.orig/include/linux/clocksource.h
+++ linux-2.6.17/include/linux/clocksource.h
@@ -217,6 +217,7 @@ extern int clocksource_sysfs_register(st
extern void clocksource_sysfs_unregister(struct sysdev_attribute*);
extern void clocksource_rating_change(struct clocksource*);
extern struct clocksource * clocksource_get_clock(char*);
+extern void clocksource_adjust(struct clocksource *, s64);

/**
* clocksource_get_best_clock - Finds highest rated clocksource
Index: linux-2.6.17/kernel/time/clocksource.c
===================================================================
--- linux-2.6.17.orig/kernel/time/clocksource.c
+++ linux-2.6.17/kernel/time/clocksource.c
@@ -57,6 +57,96 @@ int clocksource_notifier_register(struct
return atomic_notifier_chain_register(&clocksource_list_notifier, nb);
}

+/*
+ * If the error is already larger, we look ahead even further
+ * to compensate for late or lost adjustments.
+ */
+static __always_inline int
+clocksource_bigadjust(struct clocksource *clock, s64 error, s64 *interval,
+ s64 *offset)
+{
+ s64 tick_error, i;
+ u32 look_ahead, adj;
+ s32 error2, mult;
+
+ /*
+ * Use the current error value to determine how much to look ahead.
+ * The larger the error the slower we adjust for it to avoid problems
+ * with losing too many ticks, otherwise we would overadjust and
+ * produce an even larger error. The smaller the adjustment the
+ * faster we try to adjust for it, as lost ticks can do less harm
+ * here. This is tuned so that an error of about 1 msec is adusted
+ * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
+ */
+ error2 = clock->error >> (TICK_LENGTH_SHIFT + 22 - 2 * SHIFT_HZ);
+ error2 = abs(error2);
+ for (look_ahead = 0; error2 > 0; look_ahead++)
+ error2 >>= 2;
+
+ /*
+ * Now calculate the error in (1 << look_ahead) ticks, but first
+ * remove the single look ahead already included in the error.
+ */
+ tick_error = current_tick_length() >>
+ (TICK_LENGTH_SHIFT - clock->shift + 1);
+ tick_error -= clock->xtime_interval >> 1;
+ error = ((error - tick_error) >> look_ahead) + tick_error;
+
+ /* Finally calculate the adjustment shift value. */
+ i = *interval;
+ mult = 1;
+ if (error < 0) {
+ error = -error;
+ *interval = -*interval;
+ *offset = -*offset;
+ mult = -1;
+ }
+ for (adj = 0; error > i; adj++)
+ error >>= 1;
+
+ *interval <<= adj;
+ *offset <<= adj;
+ return mult << adj;
+}
+
+/*
+ * Adjust the multiplier to reduce the error value,
+ * this is optimized for the most common adjustments of -1,0,1,
+ * for other values we can do a bit more work.
+ */
+void clocksource_adjust(struct clocksource *clock, s64 offset)
+{
+ s64 error, interval = clock->cycle_interval;
+ int adj;
+
+ error = clock->error >> (TICK_LENGTH_SHIFT - clock->shift - 1);
+ if (error > interval) {
+ error >>= 2;
+ if (likely(error <= interval))
+ adj = 1;
+ else
+ adj = clocksource_bigadjust(clock, error, &interval,
+ &offset);
+ } else if (error < -interval) {
+ error >>= 2;
+ if (likely(error >= -interval)) {
+ adj = -1;
+ interval = -interval;
+ offset = -offset;
+ } else
+ adj = clocksource_bigadjust(clock, error, &interval,
+ &offset);
+ } else
+ return;
+
+ clock->mult += adj;
+ clock->xtime_interval += interval;
+ clock->xtime_nsec -= offset;
+ clock->error -= (interval - offset) <<
+ (TICK_LENGTH_SHIFT - clock->shift);
+}
+
+
/**
* __is_registered - Returns a clocksource if it's registered
* @name: name of the clocksource to return
Index: linux-2.6.17/kernel/timer.c
===================================================================
--- linux-2.6.17.orig/kernel/timer.c
+++ linux-2.6.17/kernel/timer.c
@@ -1140,89 +1140,6 @@ static int __init timekeeping_init_devic
device_initcall(timekeeping_init_device);

/*
- * If the error is already larger, we look ahead even further
- * to compensate for late or lost adjustments.
- */
-static __always_inline int clocksource_bigadjust(s64 error, s64 *interval, s64 *offset)
-{
- s64 tick_error, i;
- u32 look_ahead, adj;
- s32 error2, mult;
-
- /*
- * Use the current error value to determine how much to look ahead.
- * The larger the error the slower we adjust for it to avoid problems
- * with losing too many ticks, otherwise we would overadjust and
- * produce an even larger error. The smaller the adjustment the
- * faster we try to adjust for it, as lost ticks can do less harm
- * here. This is tuned so that an error of about 1 msec is adusted
- * within about 1 sec (or 2^20 nsec in 2^SHIFT_HZ ticks).
- */
- error2 = clock->error >> (TICK_LENGTH_SHIFT + 22 - 2 * SHIFT_HZ);
- error2 = abs(error2);
- for (look_ahead = 0; error2 > 0; look_ahead++)
- error2 >>= 2;
-
- /*
- * Now calculate the error in (1 << look_ahead) ticks, but first
- * remove the single look ahead already included in the error.
- */
- tick_error = current_tick_length() >> (TICK_LENGTH_SHIFT - clock->shift + 1);
- tick_error -= clock->xtime_interval >> 1;
- error = ((error - tick_error) >> look_ahead) + tick_error;
-
- /* Finally calculate the adjustment shift value. */
- i = *interval;
- mult = 1;
- if (error < 0) {
- error = -error;
- *interval = -*interval;
- *offset = -*offset;
- mult = -1;
- }
- for (adj = 0; error > i; adj++)
- error >>= 1;
-
- *interval <<= adj;
- *offset <<= adj;
- return mult << adj;
-}
-
-/*
- * Adjust the multiplier to reduce the error value,
- * this is optimized for the most common adjustments of -1,0,1,
- * for other values we can do a bit more work.
- */
-static void clocksource_adjust(struct clocksource *clock, s64 offset)
-{
- s64 error, interval = clock->cycle_interval;
- int adj;
-
- error = clock->error >> (TICK_LENGTH_SHIFT - clock->shift - 1);
- if (error > interval) {
- error >>= 2;
- if (likely(error <= interval))
- adj = 1;
- else
- adj = clocksource_bigadjust(error, &interval, &offset);
- } else if (error < -interval) {
- error >>= 2;
- if (likely(error >= -interval)) {
- adj = -1;
- interval = -interval;
- offset = -offset;
- } else
- adj = clocksource_bigadjust(error, &interval, &offset);
- } else
- return;
-
- clock->mult += adj;
- clock->xtime_interval += interval;
- clock->xtime_nsec -= offset;
- clock->error -= (interval - offset) << (TICK_LENGTH_SHIFT - clock->shift);
-}
-
-/*
* update_wall_time - Uses the current clocksource to increment the wall time
*
* Called from the timer interrupt, must hold a write on xtime_lock.

--

2006-08-04 19:53:46

by john stultz

[permalink] [raw]

Subject: Re: [PATCH 08/10] -mm clocksource: cleanup on -mm

On Thu, 2006-08-03 at 20:24 -0700, [email protected] wrote:
> plain text document attachment (clocksource_api_cleanup_on_mm.patch)
> Some additional clean up only on the -mm tree. Moves the adjust
> functions into kernel/time/clocksource.c .
>
> These functions directly modify the clocksource multiplier based
> on ntp error. These adjustments will effect other users of that
> clock. This hasn't been addressed in my patch set, since it
> needs some discussion.

Hmmmm. Yea, some additional discussion here would probably be needed

At the moment, I'd prefer to keep the clocksource_adjust bits with the
timekeeping code, however I'd also prefer to remove the timekeeping
specific fields (cycle_last, cycle_interval, xtime_nsec, xtime_interval,
error) from the clocksource structure and instead keep them in a
timekeeping specific structure (which may also point to a clocksource).

This would keep a clean separation between the clocksource's abstraction
that keeps as little state as possible and the timekeeping code's
internal state. However the point you bring up above is an interesting
issue: Do all users of the generic clocksource structure want the
clocksource to be NTP adjusted?

If we allow for non-ntp adjusted access to the clocksources, we may have
consistency issues between users comparing say sched_clock() and
clock_gettime() intervals. Further, if those users do want NTP adjusted
counters, why aren't they just using the timekeeping subsystem?

This does put some question as to what exactly would be the uses of the
clocksource structure outside of the timekeeping realm. Sure,
sched_clock() is a reasonable example, although since sched_clock has
such specific latency needs (we probably shouldn't go touching off-chip
hardware on every sched_clock call) and can be careful to avoid TSC skew
unlike the timekeeping code, its selection algorithm is going to be very
arch specific. So I'm not sure its really an ideal use of the
clocksource interface (as its not too difficult to just keep sched_clock
arch specific).

I do feel making the abstraction clean and generic is a good thing just
for code readability (and I very much appreciate your work here!), but
I'm not really sure that the need for clocksource access outside the
timekeeping subsystem has been well expressed. Do you have some other
examples other then sched_clock that might show further uses for this
abstraction?

thanks
-john

2006-08-04 21:11:16

by Daniel Walker

[permalink] [raw]

Subject: Re: [PATCH 08/10] -mm clocksource: cleanup on -mm

On Fri, 2006-08-04 at 12:53 -0700, john stultz wrote:
>
> Hmmmm. Yea, some additional discussion here would probably be needed
>
> At the moment, I'd prefer to keep the clocksource_adjust bits with the
> timekeeping code, however I'd also prefer to remove the timekeeping
> specific fields (cycle_last, cycle_interval, xtime_nsec, xtime_interval,
> error) from the clocksource structure and instead keep them in a
> timekeeping specific structure (which may also point to a clocksource).
>
> This would keep a clean separation between the clocksource's abstraction
> that keeps as little state as possible and the timekeeping code's
> internal state. However the point you bring up above is an interesting
> issue: Do all users of the generic clocksource structure want the
> clocksource to be NTP adjusted?

Since the output from the clocksource is a lowlevel timestamp I don't
think the users of it would want it to be ntp adjusted. It would also be
a little odd, since the ntp adjustment would be attached only to a
single clock.

> If we allow for non-ntp adjusted access to the clocksources, we may have
> consistency issues between users comparing say sched_clock() and
> clock_gettime() intervals. Further, if those users do want NTP adjusted
> counters, why aren't they just using the timekeeping subsystem?

I imagine the users of the interface would be compartmentalized. Taking
sched_clock as an example the output is only compared to itself and not
to output from other interfaces.

> This does put some question as to what exactly would be the uses of the
> clocksource structure outside of the timekeeping realm. Sure,
> sched_clock() is a reasonable example, although since sched_clock has
> such specific latency needs (we probably shouldn't go touching off-chip
> hardware on every sched_clock call) and can be careful to avoid TSC skew
> unlike the timekeeping code, its selection algorithm is going to be very
> arch specific. So I'm not sure its really an ideal use of the
> clocksource interface (as its not too difficult to just keep sched_clock
> arch specific).

Part of the reason to have a generic sched_clock() (and the generic
clocksource interface in general) is to eliminate the inefficienty of
duplicating shift and mult functionality in each arch (and on ARM it's
per board). So if you correctly implement a clocksource structure for
your hardware you will at least expose a usable sched_clock() and
generic timeofday. Then if we add more users of the interface then more
functionality is exposed.

Another instances of this is when instrumentation is needing a of fast
low level timestamp. In the past to accomplish this one would need a per
arch change to read a clock, then potentially duplicate a shift and mult
type computation in order to covert to nanosecond. One good example of
this is latency tracing in the -rt tree. I can imagine some good and
valid instrumentation having a long road of acceptable because the time
stamping portion would need to flow through several different arch and
potentially board maintainers.

I've also imagined that some usage of jiffies could be converted to use
this interface if it was appropriate. Since jiffies is hooked to the
tick, and the tick is getting more and more irregular, a clocksource
might be a relatively good replacement.

> I do feel making the abstraction clean and generic is a good thing just
> for code readability (and I very much appreciate your work here!), but
> I'm not really sure that the need for clocksource access outside the
> timekeeping subsystem has been well expressed. Do you have some other
> examples other then sched_clock that might show further uses for this
> abstraction?

I've converted latency tracing to an earlier version of the API , but I
don't have any other examples prepared. I think it's important to get
the API settled before I start converting anything else.

Daniel

2006-08-04 22:16:39

by john stultz

[permalink] [raw]

Subject: Re: [PATCH 08/10] -mm clocksource: cleanup on -mm

On Fri, 2006-08-04 at 14:11 -0700, Daniel Walker wrote:
> On Fri, 2006-08-04 at 12:53 -0700, john stultz wrote:
> >
> > Hmmmm. Yea, some additional discussion here would probably be needed
> >
> > At the moment, I'd prefer to keep the clocksource_adjust bits with the
> > timekeeping code, however I'd also prefer to remove the timekeeping
> > specific fields (cycle_last, cycle_interval, xtime_nsec, xtime_interval,
> > error) from the clocksource structure and instead keep them in a
> > timekeeping specific structure (which may also point to a clocksource).
> >
> > This would keep a clean separation between the clocksource's abstraction
> > that keeps as little state as possible and the timekeeping code's
> > internal state. However the point you bring up above is an interesting
> > issue: Do all users of the generic clocksource structure want the
> > clocksource to be NTP adjusted?
>
> Since the output from the clocksource is a lowlevel timestamp I don't
> think the users of it would want it to be ntp adjusted. It would also be
> a little odd, since the ntp adjustment would be attached only to a
> single clock.
>
> > If we allow for non-ntp adjusted access to the clocksources, we may have
> > consistency issues between users comparing say sched_clock() and
> > clock_gettime() intervals. Further, if those users do want NTP adjusted
> > counters, why aren't they just using the timekeeping subsystem?
>
> I imagine the users of the interface would be compartmentalized. Taking
> sched_clock as an example the output is only compared to itself and not
> to output from other interfaces.

Agreed on both points. Although I suspect this point will need to be
made explicit.

> > This does put some question as to what exactly would be the uses of the
> > clocksource structure outside of the timekeeping realm. Sure,
> > sched_clock() is a reasonable example, although since sched_clock has
> > such specific latency needs (we probably shouldn't go touching off-chip
> > hardware on every sched_clock call) and can be careful to avoid TSC skew
> > unlike the timekeeping code, its selection algorithm is going to be very
> > arch specific. So I'm not sure its really an ideal use of the
> > clocksource interface (as its not too difficult to just keep sched_clock
> > arch specific).
>
> Part of the reason to have a generic sched_clock() (and the generic
> clocksource interface in general) is to eliminate the inefficienty of
> duplicating shift and mult functionality in each arch (and on ARM it's
> per board).

Well, a coherent accumulation and NTP adjustment method for continuous
clocksources was a big motivator for the timekeeping work. Also the
quantity of duplicated arch specific time code is a bit larger then the
sched_clock(), but that itself isn't a mark against utilizing
clocksources for sched_clock().

> So if you correctly implement a clocksource structure for
> your hardware you will at least expose a usable sched_clock() and
> generic timeofday. Then if we add more users of the interface then more
> functionality is exposed.

Well, this point might need some work. sched_clock has quite a different
correctness/performance tradeoff when compared against timeofday. If one
correctly implements a clocksource for something like the ACPI PM, I
doubt they'd want to use it for sched_clock (due to its ~1us access
latency). Additionally, since sched_clock doesn't require (for its
original purpose, at least) the TSC synchronization that is essential
for timekeeping, how will sched_clock determine which clocksource to use
on a system were the TSC is unsyched and marked bad?

> Another instances of this is when instrumentation is needing a of fast
> low level timestamp. In the past to accomplish this one would need a per
> arch change to read a clock, then potentially duplicate a shift and mult
> type computation in order to covert to nanosecond. One good example of
> this is latency tracing in the -rt tree. I can imagine some good and
> valid instrumentation having a long road of acceptable because the time
> stamping portion would need to flow through several different arch and
> potentially board maintainers.

This sounds reasonable, but also I'd question if sched_clock or
get_cycles would be appropriate here. Further, if the mult/shift cost is
acceptable, why not just use the timeofday as the cost will be similar.

> I've also imagined that some usage of jiffies could be converted to use
> this interface if it was appropriate. Since jiffies is hooked to the
> tick, and the tick is getting more and more irregular, a clocksource
> might be a relatively good replacement.

Hmmm. That'd be a harder sell for me. Probably would want those users to
move to the timeofday, or alternatively, drive jiffies off of the
timekeeping code rather then the interrupt handler to ensure it stays
synced (something I'm plotting once the timekeeping code settles down).

> > I do feel making the abstraction clean and generic is a good thing just
> > for code readability (and I very much appreciate your work here!), but
> > I'm not really sure that the need for clocksource access outside the
> > timekeeping subsystem has been well expressed. Do you have some other
> > examples other then sched_clock that might show further uses for this
> > abstraction?
>
> I've converted latency tracing to an earlier version of the API , but I
> don't have any other examples prepared. I think it's important to get
> the API settled before I start converting anything else.

Again, I think your patch set looks good for the most part (its just the
last few bits I worry about). I'm very much interested to see where you
go with this, as I feel sched_clock (on i386 atleast) needs some love
and attention and I'm excited to see new uses for the clocksource
abstraction. However, I do want to make sure that we think the use cases
out to avoid over-engineering the wrong bits.

thanks
-john

2006-08-04 23:16:35

by Daniel Walker

[permalink] [raw]

Subject: Re: [PATCH 08/10] -mm clocksource: cleanup on -mm

On Fri, 2006-08-04 at 15:16 -0700, john stultz wrote:

> > I imagine the users of the interface would be compartmentalized. Taking
> > sched_clock as an example the output is only compared to itself and not
> > to output from other interfaces.
>
> Agreed on both points. Although I suspect this point will need to be
> made explicit.

Yeah, that's a good idea.

> > So if you correctly implement a clocksource structure for
> > your hardware you will at least expose a usable sched_clock() and
> > generic timeofday. Then if we add more users of the interface then more
> > functionality is exposed.
>
> Well, this point might need some work. sched_clock has quite a different
> correctness/performance tradeoff when compared against timeofday. If one
> correctly implements a clocksource for something like the ACPI PM, I
> doubt they'd want to use it for sched_clock (due to its ~1us access
> latency). Additionally, since sched_clock doesn't require (for its
> original purpose, at least) the TSC synchronization that is essential
> for timekeeping, how will sched_clock determine which clocksource to use
> on a system were the TSC is unsyched and marked bad?

sched_clock would use the highest rated clock in the system, and if that
becomes unstable it uses jiffies. That could mean using the acpi_pm if
it's the highest rated clock. Some code would have the be added to force
sched_clock to use the tsc.

I did some hackbench runs using the tsc vs. acpi_pm, and there was only
minimal differences (within the margin of error). It could be different
with other pm timers, but that was the result on mine. I also did some
tests of tsc vs. pit which showed some extensive differences. It added
10 seconds to "hackbench 80" . So I'm not entirely convinced that
acpi_pm is totally inappropriate as a fall back, in the case of
un-synced TSC.

I wish I had an HPET to test.

> > Another instances of this is when instrumentation is needing a of fast
> > low level timestamp. In the past to accomplish this one would need a per
> > arch change to read a clock, then potentially duplicate a shift and mult
> > type computation in order to covert to nanosecond. One good example of
> > this is latency tracing in the -rt tree. I can imagine some good and
> > valid instrumentation having a long road of acceptable because the time
> > stamping portion would need to flow through several different arch and
> > potentially board maintainers.
>
> This sounds reasonable, but also I'd question if sched_clock or
> get_cycles would be appropriate here. Further, if the mult/shift cost is
> acceptable, why not just use the timeofday as the cost will be similar.

get_cycles() isn't implemented on all arches. sched_clock() sometimes
returns jiffies converted to nanosecond depending on the arch (it does
this sometimes on i386 even). Also sched_clock() has the disadvantage of
converting to nanosecond each time it runs, which isn't always ideal.
get_cycles(), if it's implemented, doesn't come with a standard way to
find a) the clock it accesses b) the frequency of the clock.

So they both have disadvantages over the clocksource interface.

> > I've also imagined that some usage of jiffies could be converted to use
> > this interface if it was appropriate. Since jiffies is hooked to the
> > tick, and the tick is getting more and more irregular, a clocksource
> > might be a relatively good replacement.
>
> Hmmm. That'd be a harder sell for me. Probably would want those users to
> move to the timeofday, or alternatively, drive jiffies off of the
> timekeeping code rather then the interrupt handler to ensure it stays
> synced (something I'm plotting once the timekeeping code settles down).

It's case by case. I wouldn't say all jiffies uses could use timeofday
calls, and I wouldn't say they could all use a clocksource. I'd imagine
some could be converted to a clocksource though.

I'd be interested to see any jiffies changes you make.

> > > I do feel making the abstraction clean and generic is a good thing just
> > > for code readability (and I very much appreciate your work here!), but
> > > I'm not really sure that the need for clocksource access outside the
> > > timekeeping subsystem has been well expressed. Do you have some other
> > > examples other then sched_clock that might show further uses for this
> > > abstraction?
> >
> > I've converted latency tracing to an earlier version of the API , but I
> > don't have any other examples prepared. I think it's important to get
> > the API settled before I start converting anything else.
>
> Again, I think your patch set looks good for the most part (its just the
> last few bits I worry about). I'm very much interested to see where you
> go with this, as I feel sched_clock (on i386 atleast) needs some love
> and attention and I'm excited to see new uses for the clocksource
> abstraction. However, I do want to make sure that we think the use cases
> out to avoid over-engineering the wrong bits.

Your questions are certainly appropriate, and I appreciate the review.
There's not to many other responses so far.

Daniel