The latency tracer needs a way to get an accurate time
without grabbing any locks. Locks themselves might call
the latency tracer and cause at best a slow down.
This patch adds get_monotonic_cycles that returns cycles
from a reliable clock source in a monotonic fashion.
Signed-off-by: Steven Rostedt <[email protected]>
---
include/linux/clocksource.h | 3 ++
kernel/time/timekeeping.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 51 insertions(+)
Index: linux-compile-i386.git/kernel/time/timekeeping.c
===================================================================
--- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 14:27:26.000000000 -0500
+++ linux-compile-i386.git/kernel/time/timekeeping.c 2008-01-09 14:34:40.000000000 -0500
@@ -103,6 +103,54 @@ static inline void __get_realtime_clock_
timespec_add_ns(ts, nsecs);
}
+cycle_t notrace get_monotonic_cycles(void)
+{
+ cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
+
+ do {
+ /*
+ * cycle_raw and cycle_last can change on
+ * another CPU and we need the delta calculation
+ * of cycle_now and cycle_last happen atomic, as well
+ * as the adding to cycle_raw. We don't need to grab
+ * any locks, we just keep trying until get all the
+ * calculations together in one state.
+ *
+ * In fact, we __cant__ grab any locks. This
+ * function is called from the latency_tracer which can
+ * be called anywhere. To grab any locks (including
+ * seq_locks) we risk putting ourselves into a deadlock.
+ */
+ cycle_raw = clock->cycle_raw;
+ cycle_last = clock->cycle_last;
+
+ /* read clocksource: */
+ cycle_now = clocksource_read(clock);
+
+ /* calculate the delta since the last update_wall_time: */
+ cycle_delta = (cycle_now - cycle_last) & clock->mask;
+
+ } while (cycle_raw != clock->cycle_raw ||
+ cycle_last != clock->cycle_last);
+
+ return cycle_raw + cycle_delta;
+}
+
+unsigned long notrace cycles_to_usecs(cycle_t cycles)
+{
+ u64 ret = cyc2ns(clock, cycles);
+
+ ret += NSEC_PER_USEC/2; /* For rounding in do_div() */
+ do_div(ret, NSEC_PER_USEC);
+
+ return ret;
+}
+
+cycle_t notrace usecs_to_cycles(unsigned long usecs)
+{
+ return ns2cyc(clock, (u64)usecs * 1000);
+}
+
/**
* getnstimeofday - Returns the time of day in a timespec
* @ts: pointer to the timespec to be set
Index: linux-compile-i386.git/include/linux/clocksource.h
===================================================================
--- linux-compile-i386.git.orig/include/linux/clocksource.h 2008-01-09 14:27:51.000000000 -0500
+++ linux-compile-i386.git/include/linux/clocksource.h 2008-01-09 14:29:44.000000000 -0500
@@ -273,6 +273,9 @@ extern int clocksource_register(struct c
extern struct clocksource* clocksource_get_next(void);
extern void clocksource_change_rating(struct clocksource *cs, int rating);
extern void clocksource_resume(void);
+extern cycle_t get_monotonic_cycles(void);
+extern unsigned long cycles_to_usecs(cycle_t cycles);
+extern cycle_t usecs_to_cycles(unsigned long usecs);
/* used to initialize clock */
extern struct clocksource clocksource_jiffies;
--
On Wed, 2008-01-09 at 18:29 -0500, Steven Rostedt wrote:
> +cycle_t notrace get_monotonic_cycles(void)
> +{
> + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
> +
> + do {
> + /*
> + * cycle_raw and cycle_last can change on
> + * another CPU and we need the delta calculation
> + * of cycle_now and cycle_last happen atomic, as well
> + * as the adding to cycle_raw. We don't need to grab
> + * any locks, we just keep trying until get all the
> + * calculations together in one state.
> + *
> + * In fact, we __cant__ grab any locks. This
> + * function is called from the latency_tracer which can
> + * be called anywhere. To grab any locks (including
> + * seq_locks) we risk putting ourselves into a deadlock.
> + */
> + cycle_raw = clock->cycle_raw;
> + cycle_last = clock->cycle_last;
> +
> + /* read clocksource: */
> + cycle_now = clocksource_read(clock);
> +
> + /* calculate the delta since the last update_wall_time: */
> + cycle_delta = (cycle_now - cycle_last) & clock->mask;
> +
> + } while (cycle_raw != clock->cycle_raw ||
> + cycle_last != clock->cycle_last);
> +
> + return cycle_raw + cycle_delta;
> +}
The last I check this changed caused problems for me with the -rt
latency tracer.. I haven't tested this tree , but all things being equal
I would imagine the exists here also..
Daniel
* Steven Rostedt ([email protected]) wrote:
> The latency tracer needs a way to get an accurate time
> without grabbing any locks. Locks themselves might call
> the latency tracer and cause at best a slow down.
>
> This patch adds get_monotonic_cycles that returns cycles
> from a reliable clock source in a monotonic fashion.
>
> Signed-off-by: Steven Rostedt <[email protected]>
> ---
> include/linux/clocksource.h | 3 ++
> kernel/time/timekeeping.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 51 insertions(+)
>
> Index: linux-compile-i386.git/kernel/time/timekeeping.c
> ===================================================================
> --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 14:27:26.000000000 -0500
> +++ linux-compile-i386.git/kernel/time/timekeeping.c 2008-01-09 14:34:40.000000000 -0500
> @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_
> timespec_add_ns(ts, nsecs);
> }
>
> +cycle_t notrace get_monotonic_cycles(void)
> +{
> + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
> +
> + do {
> + /*
> + * cycle_raw and cycle_last can change on
> + * another CPU and we need the delta calculation
> + * of cycle_now and cycle_last happen atomic, as well
> + * as the adding to cycle_raw. We don't need to grab
> + * any locks, we just keep trying until get all the
> + * calculations together in one state.
> + *
> + * In fact, we __cant__ grab any locks. This
> + * function is called from the latency_tracer which can
> + * be called anywhere. To grab any locks (including
> + * seq_locks) we risk putting ourselves into a deadlock.
> + */
I wonder what makes the compiler read the clock->cycle_raw and
clock->cycle_last variables twice ? I guess some memory barriers could
be welcome here ?
> + cycle_raw = clock->cycle_raw;
> + cycle_last = clock->cycle_last;
> +
> + /* read clocksource: */
> + cycle_now = clocksource_read(clock);
> +
> + /* calculate the delta since the last update_wall_time: */
> + cycle_delta = (cycle_now - cycle_last) & clock->mask;
> +
> + } while (cycle_raw != clock->cycle_raw ||
> + cycle_last != clock->cycle_last);
> +
> + return cycle_raw + cycle_delta;
> +}
> +
> +unsigned long notrace cycles_to_usecs(cycle_t cycles)
> +{
> + u64 ret = cyc2ns(clock, cycles);
> +
> + ret += NSEC_PER_USEC/2; /* For rounding in do_div() */
> + do_div(ret, NSEC_PER_USEC);
> +
> + return ret;
> +}
> +
> +cycle_t notrace usecs_to_cycles(unsigned long usecs)
> +{
> + return ns2cyc(clock, (u64)usecs * 1000);
> +}
> +
> /**
> * getnstimeofday - Returns the time of day in a timespec
> * @ts: pointer to the timespec to be set
> Index: linux-compile-i386.git/include/linux/clocksource.h
> ===================================================================
> --- linux-compile-i386.git.orig/include/linux/clocksource.h 2008-01-09 14:27:51.000000000 -0500
> +++ linux-compile-i386.git/include/linux/clocksource.h 2008-01-09 14:29:44.000000000 -0500
> @@ -273,6 +273,9 @@ extern int clocksource_register(struct c
> extern struct clocksource* clocksource_get_next(void);
> extern void clocksource_change_rating(struct clocksource *cs, int rating);
> extern void clocksource_resume(void);
> +extern cycle_t get_monotonic_cycles(void);
> +extern unsigned long cycles_to_usecs(cycle_t cycles);
> +extern cycle_t usecs_to_cycles(unsigned long usecs);
>
> /* used to initialize clock */
> extern struct clocksource clocksource_jiffies;
>
> --
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
> >
> > Signed-off-by: Steven Rostedt <[email protected]>
> > ---
> > include/linux/clocksource.h | 3 ++
> > kernel/time/timekeeping.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
> > 2 files changed, 51 insertions(+)
> >
> > Index: linux-compile-i386.git/kernel/time/timekeeping.c
> > ===================================================================
> > --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 14:27:26.000000000 -0500
> > +++ linux-compile-i386.git/kernel/time/timekeeping.c 2008-01-09 14:34:40.000000000 -0500
> > @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_
> > timespec_add_ns(ts, nsecs);
> > }
> >
> > +cycle_t notrace get_monotonic_cycles(void)
> > +{
> > + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
> > +
> > + do {
> > + /*
> > + * cycle_raw and cycle_last can change on
> > + * another CPU and we need the delta calculation
> > + * of cycle_now and cycle_last happen atomic, as well
> > + * as the adding to cycle_raw. We don't need to grab
> > + * any locks, we just keep trying until get all the
> > + * calculations together in one state.
> > + *
> > + * In fact, we __cant__ grab any locks. This
> > + * function is called from the latency_tracer which can
> > + * be called anywhere. To grab any locks (including
> > + * seq_locks) we risk putting ourselves into a deadlock.
> > + */
>
> I wonder what makes the compiler read the clock->cycle_raw and
> clock->cycle_last variables twice ? I guess some memory barriers could
> be welcome here ?
We need both cycle_raw and cycle_last to be the same from the time we read
the clock source to the time we calculate cycle_delta. If either one
changes then delta is bogus.
Also, it just occurred to me that this is an old patch. I thought I
renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/
>
> > + cycle_raw = clock->cycle_raw;
> > + cycle_last = clock->cycle_last;
> > +
> > + /* read clocksource: */
> > + cycle_now = clocksource_read(clock);
> > +
> > + /* calculate the delta since the last update_wall_time: */
> > + cycle_delta = (cycle_now - cycle_last) & clock->mask;
> > +
> > + } while (cycle_raw != clock->cycle_raw ||
> > + cycle_last != clock->cycle_last);
> > +
> > + return cycle_raw + cycle_delta;
> > +}
-- Steve
On Tue, 15 Jan 2008, Steven Rostedt wrote:
>
> Also, it just occurred to me that this is an old patch. I thought I
> renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/
Ah, I changed this in the -rt patch queue, and never moved the patch back
here.
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
>
> On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
> > >
> > > Signed-off-by: Steven Rostedt <[email protected]>
> > > ---
> > > include/linux/clocksource.h | 3 ++
> > > kernel/time/timekeeping.c | 48 ++++++++++++++++++++++++++++++++++++++++++++
> > > 2 files changed, 51 insertions(+)
> > >
> > > Index: linux-compile-i386.git/kernel/time/timekeeping.c
> > > ===================================================================
> > > --- linux-compile-i386.git.orig/kernel/time/timekeeping.c 2008-01-09 14:27:26.000000000 -0500
> > > +++ linux-compile-i386.git/kernel/time/timekeeping.c 2008-01-09 14:34:40.000000000 -0500
> > > @@ -103,6 +103,54 @@ static inline void __get_realtime_clock_
> > > timespec_add_ns(ts, nsecs);
> > > }
> > >
> > > +cycle_t notrace get_monotonic_cycles(void)
> > > +{
> > > + cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
> > > +
> > > + do {
> > > + /*
> > > + * cycle_raw and cycle_last can change on
> > > + * another CPU and we need the delta calculation
> > > + * of cycle_now and cycle_last happen atomic, as well
> > > + * as the adding to cycle_raw. We don't need to grab
> > > + * any locks, we just keep trying until get all the
> > > + * calculations together in one state.
> > > + *
> > > + * In fact, we __cant__ grab any locks. This
> > > + * function is called from the latency_tracer which can
> > > + * be called anywhere. To grab any locks (including
> > > + * seq_locks) we risk putting ourselves into a deadlock.
> > > + */
> >
> > I wonder what makes the compiler read the clock->cycle_raw and
> > clock->cycle_last variables twice ? I guess some memory barriers could
> > be welcome here ?
>
> We need both cycle_raw and cycle_last to be the same from the time we read
> the clock source to the time we calculate cycle_delta. If either one
> changes then delta is bogus.
>
Ok, but what actually insures that the clock->cycle_* reads won't be
reordered across the clocksource_read() ?
> Also, it just occurred to me that this is an old patch. I thought I
> renamed cycle_raw to cycle_monotonic. But I must have lost that patch :-/
>
> >
> > > + cycle_raw = clock->cycle_raw;
> > > + cycle_last = clock->cycle_last;
> > > +
> > > + /* read clocksource: */
> > > + cycle_now = clocksource_read(clock);
> > > +
> > > + /* calculate the delta since the last update_wall_time: */
> > > + cycle_delta = (cycle_now - cycle_last) & clock->mask;
> > > +
> > > + } while (cycle_raw != clock->cycle_raw ||
> > > + cycle_last != clock->cycle_last);
> > > +
> > > + return cycle_raw + cycle_delta;
> > > +}
>
> -- Steve
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
>
> Ok, but what actually insures that the clock->cycle_* reads won't be
> reordered across the clocksource_read() ?
<looks at code>
Hmm, interesting.I didn't notice that clocksource_read() is a static
inline. I was thinking that since it was passing a pointer to a function,
gcc could not assume that it could move that code across it. But now
looking to see that clocksource_read is simply a static inline that does:
cs->read();
But still, can gcc assume that it can push loads of unknown origin
variables across function calls? So something like:
static int *glob;
void foo(void) {
int x;
x = *glob;
bar();
if (x != *glob)
/* ... */
}
I can't see how any compiler could honestly move the loading of the first
x after the calling of bar(). With glob pointing to some unknown
variable, that may be perfectly fine for bar to modify.
> > >
> > > > + cycle_raw = clock->cycle_raw;
> > > > + cycle_last = clock->cycle_last;
> > > > +
> > > > + /* read clocksource: */
> > > > + cycle_now = clocksource_read(clock);
So the question here is,can cycle_raw and cycle_last be loaded from
the unknown source that clock points to after the call to
clocksource_read()?
I'm thinking not.
> > > > +
> > > > + /* calculate the delta since the last update_wall_time: */
> > > > + cycle_delta = (cycle_now - cycle_last) & clock->mask;
> > > > +
> > > > + } while (cycle_raw != clock->cycle_raw ||
> > > > + cycle_last != clock->cycle_last);
> > > > +
> > > > + return cycle_raw + cycle_delta;
> > > > +}
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
> On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
> >
> > Ok, but what actually insures that the clock->cycle_* reads won't be
> > reordered across the clocksource_read() ?
>
> <looks at code>
>
> Hmm, interesting.I didn't notice that clocksource_read() is a static
> inline. I was thinking that since it was passing a pointer to a function,
> gcc could not assume that it could move that code across it. But now
> looking to see that clocksource_read is simply a static inline that does:
>
> cs->read();
>
> But still, can gcc assume that it can push loads of unknown origin
> variables across function calls? So something like:
>
> static int *glob;
>
> void foo(void) {
> int x;
>
> x = *glob;
>
> bar();
>
> if (x != *glob)
> /* ... */
> }
>
> I can't see how any compiler could honestly move the loading of the first
> x after the calling of bar(). With glob pointing to some unknown
> variable, that may be perfectly fine for bar to modify.
>
>
> > > >
> > > > > + cycle_raw = clock->cycle_raw;
> > > > > + cycle_last = clock->cycle_last;
> > > > > +
> > > > > + /* read clocksource: */
> > > > > + cycle_now = clocksource_read(clock);
>
> So the question here is,can cycle_raw and cycle_last be loaded from
> the unknown source that clock points to after the call to
> clocksource_read()?
>
> I'm thinking not.
>
I agree with you that I don't see how the compiler could reorder this.
So we forget about compiler barriers. Also, the clock source used is a
synchronized clock source (get_cycles_sync on x86_64), so it should make
sure the TSC is read at the right moment.
However, what happens if the clock source is, say, the jiffies ?
Is this case, we have :
static cycle_t jiffies_read(void)
{
return (cycle_t) jiffies;
}
Which is nothing more than a memory read of
extern unsigned long volatile __jiffy_data jiffies;
I think it is wrong to assume that reads from clock->cycle_raw and from
jiffies will be ordered correctly in SMP. I am tempted to think that
ordering memory writes to clock->cycle_raw vs jiffies is also needed in this
case (where clock->cycle_raw is updated, or where jiffies is updated).
We can fall in the same kind of issue if we read the HPET, which is
memory I/O based. It does not seems correct to assume that MMIO vs
normal memory reads are ordered. (pointing back to this article :
http://lwn.net/Articles/198988/)
Mathieu
> > > > > +
> > > > > + /* calculate the delta since the last update_wall_time: */
> > > > > + cycle_delta = (cycle_now - cycle_last) & clock->mask;
> > > > > +
> > > > > + } while (cycle_raw != clock->cycle_raw ||
> > > > > + cycle_last != clock->cycle_last);
> > > > > +
> > > > > + return cycle_raw + cycle_delta;
> > > > > +}
>
>
> -- Steve
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
[ CC'd Daniel Walker, since he had problems with this code ]
On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
>
> I agree with you that I don't see how the compiler could reorder this.
> So we forget about compiler barriers. Also, the clock source used is a
> synchronized clock source (get_cycles_sync on x86_64), so it should make
> sure the TSC is read at the right moment.
>
> However, what happens if the clock source is, say, the jiffies ?
>
> Is this case, we have :
>
> static cycle_t jiffies_read(void)
> {
> return (cycle_t) jiffies;
> }
>
> Which is nothing more than a memory read of
>
> extern unsigned long volatile __jiffy_data jiffies;
Yep, and that's not my concern.
>
> I think it is wrong to assume that reads from clock->cycle_raw and from
> jiffies will be ordered correctly in SMP. I am tempted to think that
> ordering memory writes to clock->cycle_raw vs jiffies is also needed in this
> case (where clock->cycle_raw is updated, or where jiffies is updated).
>
> We can fall in the same kind of issue if we read the HPET, which is
> memory I/O based. It does not seems correct to assume that MMIO vs
> normal memory reads are ordered. (pointing back to this article :
> http://lwn.net/Articles/198988/)
That and the dread memory barrier thread that my head is still spinning
on.
Ok, lets take a close look at the code in question. I may be wrong, and if
so, great, we can fix it.
We have this in get_monotonic_cycles:
{
cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last;
do {
cycle_monotonic = clock->cycle_monotonic;
cycle_last = clock->cycle_last;
cycle_now = clocksource_read(clock);
cycle_delta = (cycle_now - cycle_last) & clock->mask;
} while (cycle_monotonic != clock->cycle_monotonic ||
cycle_last != clock->cycle_last);
return cycle_monotonic + cycle_delta;
}
and this in clocksource.h
static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
{
cycle_t offset = (now - cs->cycle_last) & cs->mask;
cs->cycle_last = now;
cs->cycle_accumulated += offset;
cs->cycle_monotonic += offset;
}
now is usually just a clocksource_read() passed in.
The goal is to have clock_monotonic always return something that is
greater than what was read the last time.
Let's make a few assumptions now (for others to shoot them down). One
thing is that we don't need to worry too much about MMIO, because we are
doing a read. This means we need the data right now to contiune. So the
read being a function call should keep gcc from moving stuff around, and
since we are doing an IO read, the order of events should be pretty much
synchronized. in
1. load cycle_last and cycle_monotonic (we don't care which order)*
2. read clock source
3. calculate delta and while() compare (order doesn't matter)
* we might care (see below)
If the above is incorrect, then we need to fix get_monotonic_cycles.
in clocksource_accumulate, we have:
offset = ((now = cs->read()) - cycle_last) & cs->mask;
cycle_last = now;
cycle_accumulate += offset;
cycle_monotonic += offset;
The order of events here are. Using the same reasoning as above, the read
must be first and completed because for gcc it's a function, and for IO,
it needs to return data.
1. cs->read
2. update cycle_last, cycle_accumulate, cycle_monotonic.
Can we assume, if the above for get_monotonic_cycles is correct, that
since we read and compare cycle_last and cycle_monotonic, that neither of
them have changed over the read? So we have a snapshot of the
clocksource_accumulate.
So the worst thing that I can think of, is that cycle_monotonic is update
*before* cycle_last:
cycle_monotonic += offest;
<get_monotonic_cycles run on other CPU>
cycle_last = now;
cycle_last = 5
cycle_monotonic = 0
CPU 0 CPU 1
---------- -------------
cs->read() = 10
offset = 10 - 5 = 5
cycle_monotonic = 5
cycle_monotonic = 5
cycle_last = 5
cs->read() = 11
delta = 11 - 5 = 6
cycle_monotonic and cycle_last still same
return 5 + 6 = 11
cycle_last = 10
cycle_monotonic = 5
cycle_last = 10
cs->read() = 12
delta = 12 - 10 = 2
cycle_monotonic and cycle_last still same
return 5 + 2 = 7
**** ERROR *****
So, we *do* need memory barriers. Looks like cycle_last and
cycle_monotonic need to be synchronized.
OK, will this do?
cycle_t notrace get_monotonic_cycles(void)
{
cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last;
do {
cycle_monotonic = clock->cycle_monotonic;
smp_rmb();
cycle_last = clock->cycle_last;
cycle_now = clocksource_read(clock);
cycle_delta = (cycle_now - cycle_last) & clock->mask;
} while (cycle_monotonic != clock->cycle_monotonic ||
cycle_last != clock->cycle_last);
return cycle_monotonic + cycle_delta;
}
and this in clocksource.h
static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
{
cycle_t offset = (now - cs->cycle_last) & cs->mask;
cs->cycle_last = now;
smp_wmb();
cs->cycle_accumulated += offset;
cs->cycle_monotonic += offset;
}
We may still get to a situation where cycle_monotonic is of the old value
and cycle_last is of the new value. That would give us a smaller delta
than we want.
Lets look at this, with a slightly different situation.
cycle_last = 5
cycle_monotonic = 0
CPU 0 CPU 1
---------- -------------
cs->read() = 10
offset = 10 - 5 = 5
cycle_last = 10
cycle_monotonic = 5
cycle_monotonic = 5
cycle_last = 10
cs->read() = 12
delta = 12 - 10 = 2
cycle_monotonic and cycle_last still same
return 5 + 2 = 7
cs->read() = 13
offset = 13 - 10 = 2
cycle_last = 13
cycle_monotonic = 5
cycle_last = 13
cs->read() = 14
delta = 14 - 13 = 1
cycle_monotonic and cycle_last still same
return 5 + 1 = 6
**** ERROR ****
Crap, looks like we do need a stronger locking here :-(
Hmm, I might as well just use seq_locks, and make sure that tracing
does not hit them.
Thanks!
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
> [ CC'd Daniel Walker, since he had problems with this code ]
>
> On Tue, 15 Jan 2008, Mathieu Desnoyers wrote:
> >
> > I agree with you that I don't see how the compiler could reorder this.
> > So we forget about compiler barriers. Also, the clock source used is a
> > synchronized clock source (get_cycles_sync on x86_64), so it should make
> > sure the TSC is read at the right moment.
> >
> > However, what happens if the clock source is, say, the jiffies ?
> >
> > Is this case, we have :
> >
> > static cycle_t jiffies_read(void)
> > {
> > return (cycle_t) jiffies;
> > }
> >
> > Which is nothing more than a memory read of
> >
> > extern unsigned long volatile __jiffy_data jiffies;
>
> Yep, and that's not my concern.
>
Hrm, I will reply to the rest of this email in a separate mail, but
there is another concern, simpler than memory ordering, that just hit
me :
If we have CPU A calling clocksource_accumulate while CPU B is calling
get_monotonic_cycles, but events happens in the following order (because
of preemption or interrupts). Here, to make things worse, we would be on
x86 where cycle_t is not an atomic write (64 bits) :
CPU A CPU B
clocksource read
update cycle_mono (1st 32 bits)
read cycle_mono
read cycle_last
clocksource read
read cycle_mono
read cycle_last
update cycle_mono (2nd 32 bits)
update cycle_last
update cycle_acc
Therefore, we have :
- an inconsistant cycle_monotonic value
- inconsistant cycle_monotonic and cycle_last values.
Or is there something I have missed ?
If you really want an seqlock free algorithm (I _do_ want this for
tracing!) :) maybe going in the RCU direction could help (I refer to my
RCU-based 32-to-64 bits lockless timestamp counter extension, which
could be turned into the clocksource updater).
Mathieu
> >
> > I think it is wrong to assume that reads from clock->cycle_raw and from
> > jiffies will be ordered correctly in SMP. I am tempted to think that
> > ordering memory writes to clock->cycle_raw vs jiffies is also needed in this
> > case (where clock->cycle_raw is updated, or where jiffies is updated).
> >
> > We can fall in the same kind of issue if we read the HPET, which is
> > memory I/O based. It does not seems correct to assume that MMIO vs
> > normal memory reads are ordered. (pointing back to this article :
> > http://lwn.net/Articles/198988/)
>
> That and the dread memory barrier thread that my head is still spinning
> on.
>
> Ok, lets take a close look at the code in question. I may be wrong, and if
> so, great, we can fix it.
>
> We have this in get_monotonic_cycles:
>
> {
> cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last;
> do {
> cycle_monotonic = clock->cycle_monotonic;
> cycle_last = clock->cycle_last;
> cycle_now = clocksource_read(clock);
> cycle_delta = (cycle_now - cycle_last) & clock->mask;
> } while (cycle_monotonic != clock->cycle_monotonic ||
> cycle_last != clock->cycle_last);
> return cycle_monotonic + cycle_delta;
> }
>
> and this in clocksource.h
>
> static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
> {
> cycle_t offset = (now - cs->cycle_last) & cs->mask;
> cs->cycle_last = now;
> cs->cycle_accumulated += offset;
> cs->cycle_monotonic += offset;
> }
>
> now is usually just a clocksource_read() passed in.
>
> The goal is to have clock_monotonic always return something that is
> greater than what was read the last time.
>
> Let's make a few assumptions now (for others to shoot them down). One
> thing is that we don't need to worry too much about MMIO, because we are
> doing a read. This means we need the data right now to contiune. So the
> read being a function call should keep gcc from moving stuff around, and
> since we are doing an IO read, the order of events should be pretty much
> synchronized. in
>
> 1. load cycle_last and cycle_monotonic (we don't care which order)*
> 2. read clock source
> 3. calculate delta and while() compare (order doesn't matter)
>
> * we might care (see below)
>
> If the above is incorrect, then we need to fix get_monotonic_cycles.
>
> in clocksource_accumulate, we have:
>
> offset = ((now = cs->read()) - cycle_last) & cs->mask;
> cycle_last = now;
> cycle_accumulate += offset;
> cycle_monotonic += offset;
>
> The order of events here are. Using the same reasoning as above, the read
> must be first and completed because for gcc it's a function, and for IO,
> it needs to return data.
>
> 1. cs->read
> 2. update cycle_last, cycle_accumulate, cycle_monotonic.
>
> Can we assume, if the above for get_monotonic_cycles is correct, that
> since we read and compare cycle_last and cycle_monotonic, that neither of
> them have changed over the read? So we have a snapshot of the
> clocksource_accumulate.
>
> So the worst thing that I can think of, is that cycle_monotonic is update
> *before* cycle_last:
>
> cycle_monotonic += offest;
> <get_monotonic_cycles run on other CPU>
> cycle_last = now;
>
>
> cycle_last = 5
> cycle_monotonic = 0
>
>
> CPU 0 CPU 1
> ---------- -------------
> cs->read() = 10
> offset = 10 - 5 = 5
> cycle_monotonic = 5
> cycle_monotonic = 5
> cycle_last = 5
> cs->read() = 11
> delta = 11 - 5 = 6
> cycle_monotonic and cycle_last still same
> return 5 + 6 = 11
>
> cycle_last = 10
>
> cycle_monotonic = 5
> cycle_last = 10
> cs->read() = 12
> delta = 12 - 10 = 2
> cycle_monotonic and cycle_last still same
> return 5 + 2 = 7
>
> **** ERROR *****
>
> So, we *do* need memory barriers. Looks like cycle_last and
> cycle_monotonic need to be synchronized.
>
> OK, will this do?
>
> cycle_t notrace get_monotonic_cycles(void)
> {
> cycle_t cycle_now, cycle_delta, cycle_monotonic, cycle_last;
> do {
> cycle_monotonic = clock->cycle_monotonic;
> smp_rmb();
> cycle_last = clock->cycle_last;
> cycle_now = clocksource_read(clock);
> cycle_delta = (cycle_now - cycle_last) & clock->mask;
> } while (cycle_monotonic != clock->cycle_monotonic ||
> cycle_last != clock->cycle_last);
> return cycle_monotonic + cycle_delta;
> }
>
> and this in clocksource.h
>
> static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
> {
> cycle_t offset = (now - cs->cycle_last) & cs->mask;
> cs->cycle_last = now;
> smp_wmb();
> cs->cycle_accumulated += offset;
> cs->cycle_monotonic += offset;
> }
>
> We may still get to a situation where cycle_monotonic is of the old value
> and cycle_last is of the new value. That would give us a smaller delta
> than we want.
>
> Lets look at this, with a slightly different situation.
>
> cycle_last = 5
> cycle_monotonic = 0
>
>
> CPU 0 CPU 1
> ---------- -------------
> cs->read() = 10
> offset = 10 - 5 = 5
> cycle_last = 10
> cycle_monotonic = 5
>
> cycle_monotonic = 5
> cycle_last = 10
> cs->read() = 12
> delta = 12 - 10 = 2
> cycle_monotonic and cycle_last still same
> return 5 + 2 = 7
>
>
> cs->read() = 13
> offset = 13 - 10 = 2
> cycle_last = 13
>
> cycle_monotonic = 5
> cycle_last = 13
> cs->read() = 14
> delta = 14 - 13 = 1
> cycle_monotonic and cycle_last still same
> return 5 + 1 = 6
>
> **** ERROR ****
>
> Crap, looks like we do need a stronger locking here :-(
>
> Hmm, I might as well just use seq_locks, and make sure that tracing
> does not hit them.
>
> Thanks!
>
> -- Steve
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> Hrm, I will reply to the rest of this email in a separate mail, but
> there is another concern, simpler than memory ordering, that just hit
> me :
>
> If we have CPU A calling clocksource_accumulate while CPU B is calling
> get_monotonic_cycles, but events happens in the following order (because
> of preemption or interrupts). Here, to make things worse, we would be on
> x86 where cycle_t is not an atomic write (64 bits) :
>
>
> CPU A CPU B
>
> clocksource read
> update cycle_mono (1st 32 bits)
> read cycle_mono
> read cycle_last
> clocksource read
> read cycle_mono
> read cycle_last
> update cycle_mono (2nd 32 bits)
> update cycle_last
> update cycle_acc
>
> Therefore, we have :
> - an inconsistant cycle_monotonic value
> - inconsistant cycle_monotonic and cycle_last values.
>
> Or is there something I have missed ?
No, there's probably issues there too, but no need to worry about it,
since I already showed that allowing for clocksource_accumulate to happen
inside the get_monotonic_cycles loop is already flawed.
>
> If you really want an seqlock free algorithm (I _do_ want this for
> tracing!) :) maybe going in the RCU direction could help (I refer to my
> RCU-based 32-to-64 bits lockless timestamp counter extension, which
> could be turned into the clocksource updater).
I know you pointed me the code, but lets assume that I'm still ignorant
;-)
do you actually use the RCU internals? or do you just reimplement an RCU
algorithm?
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
> On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> > Hrm, I will reply to the rest of this email in a separate mail, but
> > there is another concern, simpler than memory ordering, that just hit
> > me :
> >
> > If we have CPU A calling clocksource_accumulate while CPU B is calling
> > get_monotonic_cycles, but events happens in the following order (because
> > of preemption or interrupts). Here, to make things worse, we would be on
> > x86 where cycle_t is not an atomic write (64 bits) :
> >
> >
> > CPU A CPU B
> >
> > clocksource read
> > update cycle_mono (1st 32 bits)
> > read cycle_mono
> > read cycle_last
> > clocksource read
> > read cycle_mono
> > read cycle_last
> > update cycle_mono (2nd 32 bits)
> > update cycle_last
> > update cycle_acc
> >
> > Therefore, we have :
> > - an inconsistant cycle_monotonic value
> > - inconsistant cycle_monotonic and cycle_last values.
> >
> > Or is there something I have missed ?
>
> No, there's probably issues there too, but no need to worry about it,
> since I already showed that allowing for clocksource_accumulate to happen
> inside the get_monotonic_cycles loop is already flawed.
>
Yep, I just re-read through your previous email, and totally agree that
the algorithm is flawed in the way you pointed out.
> >
> > If you really want an seqlock free algorithm (I _do_ want this for
> > tracing!) :) maybe going in the RCU direction could help (I refer to my
> > RCU-based 32-to-64 bits lockless timestamp counter extension, which
> > could be turned into the clocksource updater).
>
> I know you pointed me the code, but lets assume that I'm still ignorant
> ;-)
>
> do you actually use the RCU internals? or do you just reimplement an RCU
> algorithm?
>
Nope, I don't use RCU internals in this code. Preempt disable seemed
like the best way to handle this utterly short code path and I wanted
the write side to be fast enough to be called periodically. What I do is:
- Disable preemption at the read-side :
it makes sure the pointer I get will point to a data structure that
will never change while I am in the preempt disabled code. (see *)
- I use per-cpu data to allow the read-side to be as fast as possible
(only need to disable preemption, does not race against other CPUs and
won't generate cache line bouncing). It also allows dealing with
unsynchronized TSCs if needed.
- Periodical write side : it's called from an IPI running on each CPU.
(*) We expect the read-side (preempt off region) to last shorter than
the interval between IPI updates so we can guarantee the data structure
it uses won't be modified underneath it. Since the IPI update is
launched each seconds or so (depends on the frequency of the counter we
are trying to extend), it's more than ok.
Mathieu
> -- Steve
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> > No, there's probably issues there too, but no need to worry about it,
> > since I already showed that allowing for clocksource_accumulate to happen
> > inside the get_monotonic_cycles loop is already flawed.
> >
>
> Yep, I just re-read through your previous email, and totally agree that
> the algorithm is flawed in the way you pointed out.
Yeah, but if we replace the loop with a seq lock, then it would work.
albeit, more cacheline bouncing (caused by writes). (maybe not, see below)
> > do you actually use the RCU internals? or do you just reimplement an RCU
> > algorithm?
> >
>
> Nope, I don't use RCU internals in this code. Preempt disable seemed
> like the best way to handle this utterly short code path and I wanted
> the write side to be fast enough to be called periodically. What I do is:
grmble. Then how do you trace preempt_disable? As my tracer does that
(see the last patch in the series).
>
> - Disable preemption at the read-side :
> it makes sure the pointer I get will point to a data structure that
> will never change while I am in the preempt disabled code. (see *)
> - I use per-cpu data to allow the read-side to be as fast as possible
> (only need to disable preemption, does not race against other CPUs and
> won't generate cache line bouncing). It also allows dealing with
> unsynchronized TSCs if needed.
> - Periodical write side : it's called from an IPI running on each CPU.
>
> (*) We expect the read-side (preempt off region) to last shorter than
> the interval between IPI updates so we can guarantee the data structure
> it uses won't be modified underneath it. Since the IPI update is
> launched each seconds or so (depends on the frequency of the counter we
> are trying to extend), it's more than ok.
One thing I want to clear up. The major difference between this
latency_tracer and LTTng is what we consider fast paths. The latency
tracer is recording things like enabling and disabling interrupts, preempt
count changes, or simply profiling all function calls. Those are what I
consider fast paths. The slow path WRT the latency_tracer are things like
context switches. This is why I don't have a problem with copying the
comm at context switch time. Because that _is_ a slow path for the latency
tracer.
Placing a read_seqlock in get_monotonic_cycles would not be that bad,
since the only slow down would be the rmb. read_seqlocks don't modify
global data. Only the write_seqlock does. So the cache line bouncing would
only happen on updates in clocksource_accumulate. But then after the
caches are all balanced again, the reads will continue fine.
Question: Is a cache-miss a greater cost than a read to a clocksource
(besides the TSC)?
Also note how I arrange these variables in the clock struct:
struct {
cycle_t cycle_last, cycle_accumulated, cycle_monotonic;
cycle_t cycle_interval;
} ____cacheline_aligned_in_smp;
I could do the following:
struct {
seqlock_t cycle_lock;
cycle_t cycle_last, cycle_accumulated, cycle_monotonic;
cycle_t cycle_interval;
} ____cacheline_aligned_in_smp;
Which would help to keep all these in the same cache line. These are all
updated at the same time, and hopefully this will keep the cache line
bouncing limited to a single cacheline.
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
>
> On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> > > No, there's probably issues there too, but no need to worry about it,
> > > since I already showed that allowing for clocksource_accumulate to happen
> > > inside the get_monotonic_cycles loop is already flawed.
> > >
> >
> > Yep, I just re-read through your previous email, and totally agree that
> > the algorithm is flawed in the way you pointed out.
>
> Yeah, but if we replace the loop with a seq lock, then it would work.
> albeit, more cacheline bouncing (caused by writes). (maybe not, see below)
>
Yes, but then you would trigger a deadlock if you instrument code called
from NMI, SMI, MCE contexts :(
grep -ri NMI drivers/* arch/* |grep -vi PNMI
is quite interesting : actually, it show that a few spots need to handle
those "so special interrupts" : watchdogs, oprofile, virtualization and
much more in architecture specific code.
I just would not like to add a tracer in the kernel that is _so_
intrusive that module writers and architecture maintainers would have to
audit their code and think about the tracer for each implementation that
would deal with these kind of interrupts.
> > > do you actually use the RCU internals? or do you just reimplement an RCU
> > > algorithm?
> > >
> >
> > Nope, I don't use RCU internals in this code. Preempt disable seemed
> > like the best way to handle this utterly short code path and I wanted
> > the write side to be fast enough to be called periodically. What I do is:
>
> grmble. Then how do you trace preempt_disable? As my tracer does that
> (see the last patch in the series).
>
I think using a kind of preempt_disable_notrace() would make sense here.
I mean.. even if you use the seqlock, eventually, you will want to trace
the seqlock behavior. Then you have to find the lightest way to do some
sort of synchronization that will have a predictable execution. seqlock
has the following disadvantage : if the seqlock read has to wait for the
write seqlock to end, we add up some time to the execution of the
code we are trying to profile, which will mix up the results. On the
other hand, if the read-side executes in a constant amount of cycles
which does not depend on the write-side activity, then we get a clearer
picture of what the time should be accounted for. We can even create a
module that will figure out how many nanoseconds are spent for reading
the clock so we can substract this time from our analysis if required.
That's why having to choose between read seqlock and preemption disable
for the read-side, I would strongly prefer the preemption disable.
(constant execution time and it's deadlock-free)
> >
> > - Disable preemption at the read-side :
> > it makes sure the pointer I get will point to a data structure that
> > will never change while I am in the preempt disabled code. (see *)
> > - I use per-cpu data to allow the read-side to be as fast as possible
> > (only need to disable preemption, does not race against other CPUs and
> > won't generate cache line bouncing). It also allows dealing with
> > unsynchronized TSCs if needed.
> > - Periodical write side : it's called from an IPI running on each CPU.
> >
> > (*) We expect the read-side (preempt off region) to last shorter than
> > the interval between IPI updates so we can guarantee the data structure
> > it uses won't be modified underneath it. Since the IPI update is
> > launched each seconds or so (depends on the frequency of the counter we
> > are trying to extend), it's more than ok.
>
> One thing I want to clear up. The major difference between this
> latency_tracer and LTTng is what we consider fast paths. The latency
> tracer is recording things like enabling and disabling interrupts, preempt
> count changes, or simply profiling all function calls. Those are what I
> consider fast paths. The slow path WRT the latency_tracer are things like
> context switches. This is why I don't have a problem with copying the
> comm at context switch time. Because that _is_ a slow path for the latency
> tracer.
LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
in flight recorder mode, we have nothing to write to disk and therefore
we can handle very frequent events. We then do the analysis off-line
using the last MB written in the buffers. The advantage is that the
kernel dumbly writes data to a buffer : we therefore move the complexity
to user-space.
I agree that some kind of tracing, like the one you are doing, might be
done more efficiently if you do a first clever analysis phase directly
in the kernel without writing the raw high event rate data in memory
buffers. However, I believe that it would be more powerful if we combine
the two approaches rather than trying to do everything in or out of the
kernel. LTTng could provide the comm names, priorities, etc, and your
tracer could provide the top X list of processes that had a bad
behavior. It would mean that the complete overall information would be
made available after a post-processing phase done in an analysis tool
like LTTV, but I don't see any problem with it.
>
> Placing a read_seqlock in get_monotonic_cycles would not be that bad,
> since the only slow down would be the rmb. read_seqlocks don't modify
> global data. Only the write_seqlock does. So the cache line bouncing would
> only happen on updates in clocksource_accumulate. But then after the
> caches are all balanced again, the reads will continue fine.
>
Yep, cache-line bouncing for rare updates in not much of an issue.
> Question: Is a cache-miss a greater cost than a read to a clocksource
> (besides the TSC)?
>
If HPET reads are as slow as I expect, then no. Even then, a
synchronized TSC read will take about 100 cycles. If we have to hit main
memory, some tests I have done on a P4 showed that it could take about
600 cycles. However, cacheline bouncing, in my understanding, has more
effects that barely burning cycles : wasting memory I/O, when we
increase the number of CPUs, becomes increasingly bad.
> Also note how I arrange these variables in the clock struct:
>
> struct {
> cycle_t cycle_last, cycle_accumulated, cycle_monotonic;
> cycle_t cycle_interval;
> } ____cacheline_aligned_in_smp;
>
> I could do the following:
>
> struct {
> seqlock_t cycle_lock;
> cycle_t cycle_last, cycle_accumulated, cycle_monotonic;
> cycle_t cycle_interval;
> } ____cacheline_aligned_in_smp;
>
> Which would help to keep all these in the same cache line. These are all
> updated at the same time, and hopefully this will keep the cache line
> bouncing limited to a single cacheline.
>
And if the cache line only bounces when the write seqlock is taken, it's
not really an issue. I am more concerned about deadlocks ;)
Mathieu
> -- Steve
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Mathieu Desnoyers ([email protected]) wrote:
> * Steven Rostedt ([email protected]) wrote:
> >
....
> >
>
> > One thing I want to clear up. The major difference between this
> > latency_tracer and LTTng is what we consider fast paths. The latency
> > tracer is recording things like enabling and disabling interrupts, preempt
> > count changes, or simply profiling all function calls. Those are what I
> > consider fast paths. The slow path WRT the latency_tracer are things like
> > context switches. This is why I don't have a problem with copying the
> > comm at context switch time. Because that _is_ a slow path for the latency
> > tracer.
>
> LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
> in flight recorder mode, we have nothing to write to disk and therefore
> we can handle very frequent events. We then do the analysis off-line
> using the last MB written in the buffers. The advantage is that the
> kernel dumbly writes data to a buffer : we therefore move the complexity
> to user-space.
>
> I agree that some kind of tracing, like the one you are doing, might be
> done more efficiently if you do a first clever analysis phase directly
> in the kernel without writing the raw high event rate data in memory
> buffers. However, I believe that it would be more powerful if we combine
> the two approaches rather than trying to do everything in or out of the
> kernel. LTTng could provide the comm names, priorities, etc, and your
> tracer could provide the top X list of processes that had a bad
> behavior. It would mean that the complete overall information would be
> made available after a post-processing phase done in an analysis tool
> like LTTV, but I don't see any problem with it.
>
Just to expand a bit on the design-side of my proposal :
Your module would create "profiles" based on the hooks called. If we
take the interrupt on/off for example, it would be called by lockdep and
could keep a table of the top X instructions that disables interrupts
for a long time. (it's just an example, you could want to save the pid
instead...)
Then, whenever a "profile dump" is triggered, you would simply have to
send the current state of your profile to lttng with something like :
struct irq_latency_table {
void *ip;
cycles_t time;
};
/*
* Make sure only one profile at a time is written to the trace for the
* whole system.
*/
static DECLARE_MUTEX(latency_profile_mutex);
static struct irq_latency_table latency_table[NR_ENTRIES];
void irq_latency_dump_profile(void)
{
int i;
char namebuf[KSYM_NAME_LEN];
mutex_lock(&latency_profile_mutex);
trace_mark(irq_latency_dump_begin, MARK_NOARGS);
for (i = 0; i < NR_ENTRIES; i++) {
sprint_symbol(namebuf, (unsigned long)latency_table[i].ip);
trace_mark(irq_latency_entry, "ip %p symbol %s time %llu",
latency_table[i].ip, namebuf,
(unsigned long long)latency_table[i].time);
}
trace_mark(irq_latency_dump_end, MARK_NOARGS);
mutex_unlock(&latency_profile_mutex);
}
You can then create a LTTV module that will format your nice output each
time a profile dump is encountered.
By doing this, your specialized profile generator would only have to
hook into the irq on/off events to gather the information it needs,
nothing more. I think that would trim the code size and the complexity
of your module by an interesting factor.
Note that I could optimize the way I currently deal with symbols by not
having to dump them in the trace, but since it's only for low rate
events, this optimization has a low priority on my todo list.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Steven Rostedt wrote:
> grmble. Then how do you trace preempt_disable? As my tracer does that
> (see the last patch in the series).
One way is to make a tracer_preempt_disable() and tracer_preempt_enable(),
both of which would be 'notrace'. You could probably optimize them
as well. The standard preempt_disable and preempt_enable don't look
very efficient (e.g. what's up with converting an increment operation into
an addition? - gak!)
Any lock you do is going to have a pretty bad effect.
In order to be able to trace as much as possible, for KFT, I implemented
my own synchronization mechanism using cmpxchg, to avoid using any of the
existing kernel locks (which change more often than you'd think, and have
weird side effects).
=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=============================
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> * Steven Rostedt ([email protected]) wrote:
> >
> > Yeah, but if we replace the loop with a seq lock, then it would work.
> > albeit, more cacheline bouncing (caused by writes). (maybe not, see below)
> >
>
> Yes, but then you would trigger a deadlock if you instrument code called
> from NMI, SMI, MCE contexts :(
>
> grep -ri NMI drivers/* arch/* |grep -vi PNMI
>
> is quite interesting : actually, it show that a few spots need to handle
> those "so special interrupts" : watchdogs, oprofile, virtualization and
> much more in architecture specific code.
>
> I just would not like to add a tracer in the kernel that is _so_
> intrusive that module writers and architecture maintainers would have to
> audit their code and think about the tracer for each implementation that
> would deal with these kind of interrupts.
I don't want driver writers to worry about that either.
>
> > > > do you actually use the RCU internals? or do you just reimplement an RCU
> > > > algorithm?
> > > >
> > >
> > > Nope, I don't use RCU internals in this code. Preempt disable seemed
> > > like the best way to handle this utterly short code path and I wanted
> > > the write side to be fast enough to be called periodically. What I do is:
> >
> > grmble. Then how do you trace preempt_disable? As my tracer does that
> > (see the last patch in the series).
> >
>
> I think using a kind of preempt_disable_notrace() would make sense here.
Actually after hitting the send button, I thought the same.
> I mean.. even if you use the seqlock, eventually, you will want to trace
> the seqlock behavior. Then you have to find the lightest way to do some
> sort of synchronization that will have a predictable execution. seqlock
> has the following disadvantage : if the seqlock read has to wait for the
> write seqlock to end, we add up some time to the execution of the
> code we are trying to profile, which will mix up the results. On the
> other hand, if the read-side executes in a constant amount of cycles
> which does not depend on the write-side activity, then we get a clearer
> picture of what the time should be accounted for. We can even create a
> module that will figure out how many nanoseconds are spent for reading
> the clock so we can substract this time from our analysis if required.
>
> That's why having to choose between read seqlock and preemption disable
> for the read-side, I would strongly prefer the preemption disable.
> (constant execution time and it's deadlock-free)
You may convince me yet ;-)
>
> > >
> > > - Disable preemption at the read-side :
> > > it makes sure the pointer I get will point to a data structure that
> > > will never change while I am in the preempt disabled code. (see *)
> > > - I use per-cpu data to allow the read-side to be as fast as possible
> > > (only need to disable preemption, does not race against other CPUs and
> > > won't generate cache line bouncing). It also allows dealing with
> > > unsynchronized TSCs if needed.
> > > - Periodical write side : it's called from an IPI running on each CPU.
> > >
> > > (*) We expect the read-side (preempt off region) to last shorter than
> > > the interval between IPI updates so we can guarantee the data structure
> > > it uses won't be modified underneath it. Since the IPI update is
> > > launched each seconds or so (depends on the frequency of the counter we
> > > are trying to extend), it's more than ok.
> >
> > One thing I want to clear up. The major difference between this
> > latency_tracer and LTTng is what we consider fast paths. The latency
> > tracer is recording things like enabling and disabling interrupts, preempt
> > count changes, or simply profiling all function calls. Those are what I
> > consider fast paths. The slow path WRT the latency_tracer are things like
> > context switches. This is why I don't have a problem with copying the
> > comm at context switch time. Because that _is_ a slow path for the latency
> > tracer.
>
> LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
> in flight recorder mode, we have nothing to write to disk and therefore
> we can handle very frequent events. We then do the analysis off-line
> using the last MB written in the buffers. The advantage is that the
> kernel dumbly writes data to a buffer : we therefore move the complexity
> to user-space.
But you would still need to do something in case you want this information
dumped to console on a kernel crash. Of course you can rely on kexec, but
if the kexec fails (which is possible) then you lose all the information.
Having the ability to dump the output to console on a crash is one of the
benefits of latency_tracer that I want to keep.
>
> I agree that some kind of tracing, like the one you are doing, might be
> done more efficiently if you do a first clever analysis phase directly
> in the kernel without writing the raw high event rate data in memory
> buffers. However, I believe that it would be more powerful if we combine
> the two approaches rather than trying to do everything in or out of the
> kernel. LTTng could provide the comm names, priorities, etc, and your
> tracer could provide the top X list of processes that had a bad
> behavior. It would mean that the complete overall information would be
> made available after a post-processing phase done in an analysis tool
> like LTTV, but I don't see any problem with it.
Of course you don't see any problem with it, because you know LTTV and
LTTng very well ;-)
latency_tracer has been very detrimental in solving -rt patch latencies by
telling the customer to run with latency trace on, and then having them
simply set a few sysctl variables and run their app. By combining this
with LTTng, I wouldn't know how to start with telling a customer how to
analyze the problem.
Simply put, latency_tracer has a much smaller learning curve than LTTng.
Not to mention, a smaller footprint. The tracer here is very focused on
what to do, and is not meant to be a general profiling tool as LTTng is.
In-other-words, latency_tracer is LTTng-lite ;-)
>
> >
> > Placing a read_seqlock in get_monotonic_cycles would not be that bad,
> > since the only slow down would be the rmb. read_seqlocks don't modify
> > global data. Only the write_seqlock does. So the cache line bouncing would
> > only happen on updates in clocksource_accumulate. But then after the
> > caches are all balanced again, the reads will continue fine.
> >
>
> Yep, cache-line bouncing for rare updates in not much of an issue.
>
> > Question: Is a cache-miss a greater cost than a read to a clocksource
> > (besides the TSC)?
> >
>
> If HPET reads are as slow as I expect, then no. Even then, a
> synchronized TSC read will take about 100 cycles. If we have to hit main
> memory, some tests I have done on a P4 showed that it could take about
> 600 cycles. However, cacheline bouncing, in my understanding, has more
> effects that barely burning cycles : wasting memory I/O, when we
> increase the number of CPUs, becomes increasingly bad.
I haven't seen too much of an effect on a 64 CPU box. For the rare update
that is. But I'm sure it will get much worse when we run a 1024 CPU box.
>
> > Also note how I arrange these variables in the clock struct:
> >
> > struct {
> > cycle_t cycle_last, cycle_accumulated, cycle_monotonic;
> > cycle_t cycle_interval;
> > } ____cacheline_aligned_in_smp;
> >
> > I could do the following:
> >
> > struct {
> > seqlock_t cycle_lock;
> > cycle_t cycle_last, cycle_accumulated, cycle_monotonic;
> > cycle_t cycle_interval;
> > } ____cacheline_aligned_in_smp;
> >
> > Which would help to keep all these in the same cache line. These are all
> > updated at the same time, and hopefully this will keep the cache line
> > bouncing limited to a single cacheline.
> >
>
> And if the cache line only bounces when the write seqlock is taken, it's
> not really an issue. I am more concerned about deadlocks ;)
The write_seqlock is updated the same time as the other variables are
written (all on the same cacheline - if the cache line is big enough).
But I do share you concern with deadlocks.
-- Steve
* Steven Rostedt ([email protected]) wrote:
>
...
> >
> > > >
> > > > - Disable preemption at the read-side :
> > > > it makes sure the pointer I get will point to a data structure that
> > > > will never change while I am in the preempt disabled code. (see *)
> > > > - I use per-cpu data to allow the read-side to be as fast as possible
> > > > (only need to disable preemption, does not race against other CPUs and
> > > > won't generate cache line bouncing). It also allows dealing with
> > > > unsynchronized TSCs if needed.
> > > > - Periodical write side : it's called from an IPI running on each CPU.
> > > >
> > > > (*) We expect the read-side (preempt off region) to last shorter than
> > > > the interval between IPI updates so we can guarantee the data structure
> > > > it uses won't be modified underneath it. Since the IPI update is
> > > > launched each seconds or so (depends on the frequency of the counter we
> > > > are trying to extend), it's more than ok.
> > >
> > > One thing I want to clear up. The major difference between this
> > > latency_tracer and LTTng is what we consider fast paths. The latency
> > > tracer is recording things like enabling and disabling interrupts, preempt
> > > count changes, or simply profiling all function calls. Those are what I
> > > consider fast paths. The slow path WRT the latency_tracer are things like
> > > context switches. This is why I don't have a problem with copying the
> > > comm at context switch time. Because that _is_ a slow path for the latency
> > > tracer.
> >
> > LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
> > in flight recorder mode, we have nothing to write to disk and therefore
> > we can handle very frequent events. We then do the analysis off-line
> > using the last MB written in the buffers. The advantage is that the
> > kernel dumbly writes data to a buffer : we therefore move the complexity
> > to user-space.
>
> But you would still need to do something in case you want this information
> dumped to console on a kernel crash. Of course you can rely on kexec, but
> if the kexec fails (which is possible) then you lose all the information.
> Having the ability to dump the output to console on a crash is one of the
> benefits of latency_tracer that I want to keep.
>
There has been some integration done between LTTng and the "crash" tool
to extract the buffers from a crashed kernel. I am not an expert in
crash buffer extraction though, but I guess all the available mechanisms
depend on kexec and could show the limits you are referring to.
If you really want to pretty-print the information to the console, I
would propose that you leave that part of the problem to a different
output module. The core of the latency tracer could keep the minimum
information. Then, when a dump is triggered, it either sends the
information to LTTng or to your console pretty-printer.
However, I would not call the pretty-printer a "tracer" module per-se.
We would have to accept that it is a bit more tied to the kernel
internals than the latency tracer. My goal is to separate the core
"profiling" module from the optional "pretty-printing" module as much as
possible so the latency tracer core could be reused by other output
modules.
>
> >
> > I agree that some kind of tracing, like the one you are doing, might be
> > done more efficiently if you do a first clever analysis phase directly
> > in the kernel without writing the raw high event rate data in memory
> > buffers. However, I believe that it would be more powerful if we combine
> > the two approaches rather than trying to do everything in or out of the
> > kernel. LTTng could provide the comm names, priorities, etc, and your
> > tracer could provide the top X list of processes that had a bad
> > behavior. It would mean that the complete overall information would be
> > made available after a post-processing phase done in an analysis tool
> > like LTTV, but I don't see any problem with it.
>
> Of course you don't see any problem with it, because you know LTTV and
> LTTng very well ;-)
>
> latency_tracer has been very detrimental in solving -rt patch latencies by
> telling the customer to run with latency trace on, and then having them
> simply set a few sysctl variables and run their app. By combining this
> with LTTng, I wouldn't know how to start with telling a customer how to
> analyze the problem.
>
> Simply put, latency_tracer has a much smaller learning curve than LTTng.
> Not to mention, a smaller footprint. The tracer here is very focused on
> what to do, and is not meant to be a general profiling tool as LTTng is.
>
> In-other-words, latency_tracer is LTTng-lite ;-)
>
If LTTng is already ported to your specific kernel, the learning-curve
is not big at all. Here is what the latency_tracer over LTTng guide
could look like :
Well, once you have LTTng in your kernel and have compiled and installed
the ltt-control and lttv packages (configure, make, make install), all
that would be needed is :
(there may be some bits in the QUICKSTART GUIDE on
http://ltt.polymtl.ca, like adding the debugfs mount to fstab and make sure
the LTTng modules are loaded)
#arm all the markers
ltt-armall
#start lttng tracing
lttctl -n test -t /tmp/trace1 -d -l /mnt/debugfs/ltt
-> start latency tracer
-> stop latency tracer
-> trigger latency tracer dump
While the tracing is active, trigger the condition...
(rince, repeat, can handle multiple latency tracer dumps)
#stop lttng tracing
lttctl -n test -R
#disarm all markers
ltt-disarmall
You can easily test the trace with :
lttv -m textDump -t /tmp/trace1
Your users would issue something like :
lttv -m latencytracerDump -t /tmp/trace1
that's it. LatencytracerDump would be a new specialized plugin, inspired
from the generic textDump.c plugin and from the state.c module (for
hooking on specific events rather that on _all_ events). It would
generate a text output from the trace collected at each latency tracer
dump.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Mathieu Desnoyers wrote:
> If LTTng is already ported to your specific kernel, the learning-curve
> is not big at all. Here is what the latency_tracer over LTTng guide
> could look like :
>
> Well, once you have LTTng in your kernel and have compiled and installed
> the ltt-control and lttv packages (configure, make, make install), all
> that would be needed is :
>
> (there may be some bits in the QUICKSTART GUIDE on
> http://ltt.polymtl.ca, like adding the debugfs mount to fstab and make sure
> the LTTng modules are loaded)
>
> #arm all the markers
> ltt-armall
> #start lttng tracing
> lttctl -n test -t /tmp/trace1 -d -l /mnt/debugfs/ltt
>
> -> start latency tracer
> -> stop latency tracer
> -> trigger latency tracer dump
>
> While the tracing is active, trigger the condition...
>
> (rince, repeat, can handle multiple latency tracer dumps)
>
> #stop lttng tracing
> lttctl -n test -R
> #disarm all markers
> ltt-disarmall
>
> You can easily test the trace with :
> lttv -m textDump -t /tmp/trace1
>
> Your users would issue something like :
>
> lttv -m latencytracerDump -t /tmp/trace1
No offense, but this is still quite a bit harder than:
echo 1 >/proc/something
... wait a bit ...
cat /proc/something
(substitute /sys or /debugfs where appropriate)
Having to compile something (besides the kernel) for the target
is sometimes a major hassle. I avoided it completely with KFT.
-- Tim
=============================
Tim Bird
Architecture Group Chair, CE Linux Forum
Senior Staff Engineer, Sony Corporation of America
=============================
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> >
> > In-other-words, latency_tracer is LTTng-lite ;-)
> >
>
> If LTTng is already ported to your specific kernel, the learning-curve
> is not big at all. Here is what the latency_tracer over LTTng guide
> could look like :
>
> Well, once you have LTTng in your kernel and have compiled and installed
> the ltt-control and lttv packages (configure, make, make install), all
> that would be needed is :
>
> (there may be some bits in the QUICKSTART GUIDE on
> http://ltt.polymtl.ca, like adding the debugfs mount to fstab and make sure
> the LTTng modules are loaded)
>
> #arm all the markers
> ltt-armall
> #start lttng tracing
> lttctl -n test -t /tmp/trace1 -d -l /mnt/debugfs/ltt
>
> -> start latency tracer
> -> stop latency tracer
> -> trigger latency tracer dump
>
> While the tracing is active, trigger the condition...
>
> (rince, repeat, can handle multiple latency tracer dumps)
>
> #stop lttng tracing
> lttctl -n test -R
> #disarm all markers
> ltt-disarmall
>
> You can easily test the trace with :
> lttv -m textDump -t /tmp/trace1
>
> Your users would issue something like :
>
> lttv -m latencytracerDump -t /tmp/trace1
>
> that's it. LatencytracerDump would be a new specialized plugin, inspired
> from the generic textDump.c plugin and from the state.c module (for
> hooking on specific events rather that on _all_ events). It would
> generate a text output from the trace collected at each latency tracer
> dump.
>
Mathieu,
That's quite a bit. Currently we have tools that already hook into
latency_tracer. e.g. Thomas Gleixner's cyclictest. It turns on the latency
tracer runs a quick test, turns it off. If the latency is ok, the trace is
discarded and it runs the test again. When the latency is not acceptable,
it stops the test. Then all one needs to do is look at the given trace.
Hooking something simple as this to LTTng is not going to fly.
Don't get me wrong, I'm a huge supporter of LTTng, and even recommend it.
But there's things that using LTTng for is like using a sledgehammer for
a tack. I like to use the tools that are best for the job (which also
mean easiest). I don't buy the one tracer fits all mentality. And by
pushing that too much, I'm thinking we'll never get a tracer into the
kernel.
What my goal of these patches is, is to get the infrastructure for tracing
into the kernel. The latency_tracer is like Rusty Russells lguest is for
pvops. LTTng is the Xen equivalent. Where latency_tracer is nothing more
than the quick and dirty tracer. For administrators that want to analyze
their systems, LTTng is much more appropriate. latency_tracer is more of
the first aid in finding latency troubles, and if that doesn't work, then
we can do the LTTng surgery.
Lets go back and focus on the infrastructure again. Namely, the mcount and
notrace parts of the kernel. As well as the tracer timer, which by the
way, we are looking more into what you have done ;-)
-- Steve
On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[email protected]> wrote:
> If you really want an seqlock free algorithm (I _do_ want this for
> tracing!) :) maybe going in the RCU direction could help (I refer to my
> RCU-based 32-to-64 bits lockless timestamp counter extension, which
> could be turned into the clocksource updater).
Yea. After our earlier discussion and talking w/ Steven, I'm taking a
swing at this now. The lock-free method still doesn't apply to the
update_wall_time function, but does work fine for the monotonic cycle
uses. I'll send a patch for review as soon as I get things building.
thanks
-john
On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
> On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[email protected]> wrote:
> > If you really want an seqlock free algorithm (I _do_ want this for
> > tracing!) :) maybe going in the RCU direction could help (I refer to my
> > RCU-based 32-to-64 bits lockless timestamp counter extension, which
> > could be turned into the clocksource updater).
>
> Yea. After our earlier discussion and talking w/ Steven, I'm taking a
> swing at this now. The lock-free method still doesn't apply to the
> update_wall_time function, but does work fine for the monotonic cycle
> uses. I'll send a patch for review as soon as I get things building.
So here's my first attempt at adding Mathieu's lock-free method to
Steven's get_monotonic_cycles() interface.
Completely un-tested, but it builds, so I figured I'd send it out for
review.
I'm not super sure the update or the read doesn't need something
additional to force a memory access, but as I didn't see anything
special in Mathieu's implementation, I'm going to guess this is ok.
Mathieu, Let me know if this isn't what you're suggesting.
Signed-off-by: John Stultz <[email protected]>
Index: monotonic-cleanup/include/linux/clocksource.h
===================================================================
--- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.000000000 -0800
+++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 14:41:31.000000000 -0800
@@ -87,9 +87,17 @@
* more than one cache line.
*/
struct {
- cycle_t cycle_last, cycle_accumulated, cycle_raw;
- } ____cacheline_aligned_in_smp;
+ cycle_t cycle_last, cycle_accumulated;
+ /* base structure provides lock-free read
+ * access to a virtualized 64bit counter
+ * Uses RCU-like update.
+ */
+ struct {
+ cycle_t cycle_base_last, cycle_base;
+ } base[2];
+ int base_num;
+ } ____cacheline_aligned_in_smp;
u64 xtime_nsec;
s64 error;
@@ -175,19 +183,21 @@
}
/**
- * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
* @cs: pointer to clocksource being read
* @now: current cycle value
*
* Uses the clocksource to return the current cycle_t value.
* NOTE!!!: This is different from clocksource_read, because it
- * returns the accumulated cycle value! Must hold xtime lock!
+ * returns a 64bit wide accumulated value.
*/
static inline cycle_t
-clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+clocksource_get_basecycles(struct clocksource *cs, cycle_t now)
{
- cycle_t offset = (now - cs->cycle_last) & cs->mask;
- offset += cs->cycle_accumulated;
+ int num = cs->base_num;
+ cycle_t offset = (now - cs->base[num].cycle_base_last);
+ offset &= cs->mask;
+ offset += cs->base[num].cycle_base;
return offset;
}
@@ -197,14 +207,25 @@
* @now: current cycle value
*
* Used to avoids clocksource hardware overflow by periodically
- * accumulating the current cycle delta. Must hold xtime write lock!
+ * accumulating the current cycle delta. Uses RCU-like update, but
+ * ***still requires the xtime_lock is held for writing!***
*/
static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
{
- cycle_t offset = (now - cs->cycle_last) & cs->mask;
+ /* First update the monotonic base portion.
+ * The dual array update method allows for lock-free reading.
+ */
+ int num = !cs->base_num;
+ cycle_t offset = (now - cs->base[!num].cycle_base_last);
+ offset &= cs->mask;
+ cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
+ cs->base[num].cycle_base_last = now;
+ cs->base_num = num;
+
+ /* Now update the cycle_accumulated portion */
+ offset = (now - cs->cycle_last) & cs->mask;
cs->cycle_last = now;
cs->cycle_accumulated += offset;
- cs->cycle_raw += offset;
}
/**
Index: monotonic-cleanup/kernel/time/timekeeping.c
===================================================================
--- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.000000000 -0800
+++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.000000000 -0800
@@ -71,10 +71,12 @@
*/
static inline s64 __get_nsec_offset(void)
{
- cycle_t cycle_delta;
+ cycle_t now, cycle_delta;
s64 ns_offset;
- cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
+ now = clocksource_read(clock);
+ cycle_delta = (now - clock->cycle_last) & clock->mask;
+ cycle_delta += clock->cycle_accumulated;
ns_offset = cyc2ns(clock, cycle_delta);
return ns_offset;
@@ -105,35 +107,7 @@
cycle_t notrace get_monotonic_cycles(void)
{
- cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
-
- do {
- /*
- * cycle_raw and cycle_last can change on
- * another CPU and we need the delta calculation
- * of cycle_now and cycle_last happen atomic, as well
- * as the adding to cycle_raw. We don't need to grab
- * any locks, we just keep trying until get all the
- * calculations together in one state.
- *
- * In fact, we __cant__ grab any locks. This
- * function is called from the latency_tracer which can
- * be called anywhere. To grab any locks (including
- * seq_locks) we risk putting ourselves into a deadlock.
- */
- cycle_raw = clock->cycle_raw;
- cycle_last = clock->cycle_last;
-
- /* read clocksource: */
- cycle_now = clocksource_read(clock);
-
- /* calculate the delta since the last update_wall_time: */
- cycle_delta = (cycle_now - cycle_last) & clock->mask;
-
- } while (cycle_raw != clock->cycle_raw ||
- cycle_last != clock->cycle_last);
-
- return cycle_raw + cycle_delta;
+ return clocksource_get_basecycles(clock, clocksource_read(clock));
}
unsigned long notrace cycles_to_usecs(cycle_t cycles)
Thanks John for doing this!
(comments imbedded)
On Wed, 16 Jan 2008, john stultz wrote:
>
> On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
>
> Completely un-tested, but it builds, so I figured I'd send it out for
> review.
heh, ok, I'll take it and run it.
>
> I'm not super sure the update or the read doesn't need something
> additional to force a memory access, but as I didn't see anything
> special in Mathieu's implementation, I'm going to guess this is ok.
>
> Mathieu, Let me know if this isn't what you're suggesting.
>
> Signed-off-by: John Stultz <[email protected]>
>
> Index: monotonic-cleanup/include/linux/clocksource.h
> ===================================================================
> --- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.000000000 -0800
> +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 14:41:31.000000000 -0800
> @@ -87,9 +87,17 @@
> * more than one cache line.
> */
> struct {
> - cycle_t cycle_last, cycle_accumulated, cycle_raw;
> - } ____cacheline_aligned_in_smp;
> + cycle_t cycle_last, cycle_accumulated;
>
> + /* base structure provides lock-free read
> + * access to a virtualized 64bit counter
> + * Uses RCU-like update.
> + */
> + struct {
> + cycle_t cycle_base_last, cycle_base;
> + } base[2];
> + int base_num;
> + } ____cacheline_aligned_in_smp;
> u64 xtime_nsec;
> s64 error;
>
> @@ -175,19 +183,21 @@
> }
>
> /**
> - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
> + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
> * @cs: pointer to clocksource being read
> * @now: current cycle value
> *
> * Uses the clocksource to return the current cycle_t value.
> * NOTE!!!: This is different from clocksource_read, because it
> - * returns the accumulated cycle value! Must hold xtime lock!
> + * returns a 64bit wide accumulated value.
> */
> static inline cycle_t
> -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
> +clocksource_get_basecycles(struct clocksource *cs, cycle_t now)
> {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> - offset += cs->cycle_accumulated;
> + int num = cs->base_num;
> + cycle_t offset = (now - cs->base[num].cycle_base_last);
> + offset &= cs->mask;
> + offset += cs->base[num].cycle_base;
> return offset;
> }
>
> @@ -197,14 +207,25 @@
> * @now: current cycle value
> *
> * Used to avoids clocksource hardware overflow by periodically
> - * accumulating the current cycle delta. Must hold xtime write lock!
> + * accumulating the current cycle delta. Uses RCU-like update, but
> + * ***still requires the xtime_lock is held for writing!***
> */
> static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
> {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> + /* First update the monotonic base portion.
> + * The dual array update method allows for lock-free reading.
> + */
> + int num = !cs->base_num;
> + cycle_t offset = (now - cs->base[!num].cycle_base_last);
> + offset &= cs->mask;
> + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> + cs->base[num].cycle_base_last = now;
I would think that we would need some sort of barrier here. Otherwise,
base_num could be updated before all the cycle_base. I'd expect a smp_wmb
is needed.
> + cs->base_num = num;
> +
> + /* Now update the cycle_accumulated portion */
> + offset = (now - cs->cycle_last) & cs->mask;
> cs->cycle_last = now;
> cs->cycle_accumulated += offset;
> - cs->cycle_raw += offset;
> }
>
> /**
> Index: monotonic-cleanup/kernel/time/timekeeping.c
> ===================================================================
> --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.000000000 -0800
> +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.000000000 -0800
> @@ -71,10 +71,12 @@
> */
> static inline s64 __get_nsec_offset(void)
> {
> - cycle_t cycle_delta;
> + cycle_t now, cycle_delta;
> s64 ns_offset;
>
> - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
> + now = clocksource_read(clock);
> + cycle_delta = (now - clock->cycle_last) & clock->mask;
> + cycle_delta += clock->cycle_accumulated;
Is the above just to decouple the two methods?
> ns_offset = cyc2ns(clock, cycle_delta);
>
> return ns_offset;
> @@ -105,35 +107,7 @@
>
> cycle_t notrace get_monotonic_cycles(void)
> {
> - cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
> -
> - do {
> - /*
> - * cycle_raw and cycle_last can change on
> - * another CPU and we need the delta calculation
> - * of cycle_now and cycle_last happen atomic, as well
> - * as the adding to cycle_raw. We don't need to grab
> - * any locks, we just keep trying until get all the
> - * calculations together in one state.
> - *
> - * In fact, we __cant__ grab any locks. This
> - * function is called from the latency_tracer which can
> - * be called anywhere. To grab any locks (including
> - * seq_locks) we risk putting ourselves into a deadlock.
> - */
> - cycle_raw = clock->cycle_raw;
> - cycle_last = clock->cycle_last;
> -
> - /* read clocksource: */
> - cycle_now = clocksource_read(clock);
> -
> - /* calculate the delta since the last update_wall_time: */
> - cycle_delta = (cycle_now - cycle_last) & clock->mask;
> -
> - } while (cycle_raw != clock->cycle_raw ||
> - cycle_last != clock->cycle_last);
> -
> - return cycle_raw + cycle_delta;
> + return clocksource_get_basecycles(clock, clocksource_read(clock));
Nice ;-)
> }
>
> unsigned long notrace cycles_to_usecs(cycle_t cycles)
-- Steve
* john stultz ([email protected]) wrote:
>
> On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
> > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[email protected]> wrote:
> > > If you really want an seqlock free algorithm (I _do_ want this for
> > > tracing!) :) maybe going in the RCU direction could help (I refer to my
> > > RCU-based 32-to-64 bits lockless timestamp counter extension, which
> > > could be turned into the clocksource updater).
> >
> > Yea. After our earlier discussion and talking w/ Steven, I'm taking a
> > swing at this now. The lock-free method still doesn't apply to the
> > update_wall_time function, but does work fine for the monotonic cycle
> > uses. I'll send a patch for review as soon as I get things building.
>
> So here's my first attempt at adding Mathieu's lock-free method to
> Steven's get_monotonic_cycles() interface.
>
> Completely un-tested, but it builds, so I figured I'd send it out for
> review.
>
> I'm not super sure the update or the read doesn't need something
> additional to force a memory access, but as I didn't see anything
> special in Mathieu's implementation, I'm going to guess this is ok.
>
> Mathieu, Let me know if this isn't what you're suggesting.
>
> Signed-off-by: John Stultz <[email protected]>
>
> Index: monotonic-cleanup/include/linux/clocksource.h
> ===================================================================
> --- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.000000000 -0800
> +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 14:41:31.000000000 -0800
> @@ -87,9 +87,17 @@
> * more than one cache line.
> */
> struct {
> - cycle_t cycle_last, cycle_accumulated, cycle_raw;
> - } ____cacheline_aligned_in_smp;
Shouldn't the cycle_last and cycle_accumulated by in the array too ?
> + cycle_t cycle_last, cycle_accumulated;
>
> + /* base structure provides lock-free read
> + * access to a virtualized 64bit counter
> + * Uses RCU-like update.
> + */
> + struct {
We had cycle_raw before, why do we need the following two ?
> + cycle_t cycle_base_last, cycle_base;
I'm not quite sure why you need both cycle_base_last and cycle_base...
I think I'll need a bit of an explanation of what you are trying to
achieve here to see what to expect from the clock source. Are you trying
to deal with non-synchronized TSCs across CPUs in a way that will
generate a monotonic (sometimes stalling) clock ?
What I am trying to say is : I know you are trying to make a virtual
clock source where time cannot go backward, but what are your
assumptions about the "real" clock source ?
Is the intent to deal with an HPET suddenly reset to 0 or something
like this ?
Basically, I wonder why you have to calculate the current cycle count
from the previous update_wall_time event. Is is because you need to be
consistent when a clocksource change occurs ?
> + } base[2];
> + int base_num;
> + } ____cacheline_aligned_in_smp;
> u64 xtime_nsec;
> s64 error;
>
> @@ -175,19 +183,21 @@
> }
>
> /**
> - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
> + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
> * @cs: pointer to clocksource being read
> * @now: current cycle value
> *
> * Uses the clocksource to return the current cycle_t value.
> * NOTE!!!: This is different from clocksource_read, because it
> - * returns the accumulated cycle value! Must hold xtime lock!
> + * returns a 64bit wide accumulated value.
> */
> static inline cycle_t
> -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
> +clocksource_get_basecycles(struct clocksource *cs, cycle_t now)
> {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> - offset += cs->cycle_accumulated;
I would disable preemption in clocksource_get_basecycles. We would not
want to be scheduled out while we hold a pointer to the old array
element.
> + int num = cs->base_num;
Since you deal with base_num in a shared manner (not per cpu), you will
need a smp_read_barrier_depend() here after the cs->base_num read.
You should think about reading the cs->base_num first, and _after_ that
read the real clocksource. Here, the clocksource value is passed as
parameter. It means that the read clocksource may have been read in the
previous RCU window.
> + cycle_t offset = (now - cs->base[num].cycle_base_last);
> + offset &= cs->mask;
> + offset += cs->base[num].cycle_base;
> return offset;
> }
>
> @@ -197,14 +207,25 @@
> * @now: current cycle value
> *
> * Used to avoids clocksource hardware overflow by periodically
> - * accumulating the current cycle delta. Must hold xtime write lock!
> + * accumulating the current cycle delta. Uses RCU-like update, but
> + * ***still requires the xtime_lock is held for writing!***
> */
> static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
> {
Why do we still require xtime_lock here ? Can you tell exactly which
contexts this function will be called from (periodical timer interrupt?)
I guess it is called from one and only one CPU periodically.
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> + /* First update the monotonic base portion.
> + * The dual array update method allows for lock-free reading.
> + */
> + int num = !cs->base_num;
> + cycle_t offset = (now - cs->base[!num].cycle_base_last);
!0 is not necessarily 1. This is why I use cpu_synth->index ? 0 : 1 in
my code. The two previous lines seems buggy. (I did the same mistake in
my first implementation) ;)
> + offset &= cs->mask;
> + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
Here too.
> + cs->base[num].cycle_base_last = now;
Since you deal with shared data (in my algo, I use per-cpu data), you
have to add a wmb() before the base_num value update. Only then will you
ensure that other CPUs will see consistent values.
> + cs->base_num = num;
> +
> + /* Now update the cycle_accumulated portion */
> + offset = (now - cs->cycle_last) & cs->mask;
The following two updates are racy. I think they should be in the array
too. We want consistant cycle_raw, cycle_last and cycle_accumulated
values; they should therefore be presented to the reader atomically with
a pointer change.
> cs->cycle_last = now;
> cs->cycle_accumulated += offset;
> - cs->cycle_raw += offset;
> }
>
> /**
> Index: monotonic-cleanup/kernel/time/timekeeping.c
> ===================================================================
> --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.000000000 -0800
> +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.000000000 -0800
> @@ -71,10 +71,12 @@
> */
> static inline s64 __get_nsec_offset(void)
> {
> - cycle_t cycle_delta;
> + cycle_t now, cycle_delta;
> s64 ns_offset;
>
> - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
> + now = clocksource_read(clock);
Why is there a second level of cycle_last and cycle_accumulated here ?
> + cycle_delta = (now - clock->cycle_last) & clock->mask;
> + cycle_delta += clock->cycle_accumulated;
> ns_offset = cyc2ns(clock, cycle_delta);
>
> return ns_offset;
> @@ -105,35 +107,7 @@
>
> cycle_t notrace get_monotonic_cycles(void)
> {
> - cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
> -
> - do {
> - /*
> - * cycle_raw and cycle_last can change on
> - * another CPU and we need the delta calculation
> - * of cycle_now and cycle_last happen atomic, as well
> - * as the adding to cycle_raw. We don't need to grab
> - * any locks, we just keep trying until get all the
> - * calculations together in one state.
> - *
> - * In fact, we __cant__ grab any locks. This
> - * function is called from the latency_tracer which can
> - * be called anywhere. To grab any locks (including
> - * seq_locks) we risk putting ourselves into a deadlock.
> - */
> - cycle_raw = clock->cycle_raw;
> - cycle_last = clock->cycle_last;
> -
> - /* read clocksource: */
> - cycle_now = clocksource_read(clock);
> -
> - /* calculate the delta since the last update_wall_time: */
> - cycle_delta = (cycle_now - cycle_last) & clock->mask;
> -
> - } while (cycle_raw != clock->cycle_raw ||
> - cycle_last != clock->cycle_last);
> -
> - return cycle_raw + cycle_delta;
> + return clocksource_get_basecycles(clock, clocksource_read(clock));
> }
>
> unsigned long notrace cycles_to_usecs(cycle_t cycles)
>
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
>
> > - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> > + /* First update the monotonic base portion.
> > + * The dual array update method allows for lock-free reading.
> > + */
> > + int num = !cs->base_num;
> > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
>
> !0 is not necessarily 1. This is why I use cpu_synth->index ? 0 : 1 in
How about simply "cpu_synth->index ^ 1"? Seems the best choice if you ask
me, if all you are doing is changing it from 1 to zero and back to 1.
-- Steve
> my code. The two previous lines seems buggy. (I did the same mistake in
> my first implementation) ;)
>
> > + offset &= cs->mask;
> > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
>
> Here too.
On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
> * john stultz ([email protected]) wrote:
> >
> > On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
> > > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[email protected]> wrote:
> > > > If you really want an seqlock free algorithm (I _do_ want this for
> > > > tracing!) :) maybe going in the RCU direction could help (I refer to my
> > > > RCU-based 32-to-64 bits lockless timestamp counter extension, which
> > > > could be turned into the clocksource updater).
> > >
> > > Yea. After our earlier discussion and talking w/ Steven, I'm taking a
> > > swing at this now. The lock-free method still doesn't apply to the
> > > update_wall_time function, but does work fine for the monotonic cycle
> > > uses. I'll send a patch for review as soon as I get things building.
> >
> > So here's my first attempt at adding Mathieu's lock-free method to
> > Steven's get_monotonic_cycles() interface.
> >
> > Completely un-tested, but it builds, so I figured I'd send it out for
> > review.
> >
> > I'm not super sure the update or the read doesn't need something
> > additional to force a memory access, but as I didn't see anything
> > special in Mathieu's implementation, I'm going to guess this is ok.
> >
> > Mathieu, Let me know if this isn't what you're suggesting.
> >
> > Signed-off-by: John Stultz <[email protected]>
> >
> > Index: monotonic-cleanup/include/linux/clocksource.h
> > ===================================================================
> > --- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.000000000 -0800
> > +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 14:41:31.000000000 -0800
> > @@ -87,9 +87,17 @@
> > * more than one cache line.
> > */
> > struct {
> > - cycle_t cycle_last, cycle_accumulated, cycle_raw;
> > - } ____cacheline_aligned_in_smp;
>
> Shouldn't the cycle_last and cycle_accumulated by in the array too ?
No, we're leaving cycle_last and cycle_accumulated alone. They're
relating to the update_wall_time conversion of cycles to xtime.
> > + cycle_t cycle_last, cycle_accumulated;
> >
> > + /* base structure provides lock-free read
> > + * access to a virtualized 64bit counter
> > + * Uses RCU-like update.
> > + */
> > + struct {
>
> We had cycle_raw before, why do we need the following two ?
>
> > + cycle_t cycle_base_last, cycle_base;
>
> I'm not quite sure why you need both cycle_base_last and cycle_base...
So on my first shot at this, I tried to layer the concepts. Using the
lock-free method to create a abstracted 64bit counter, as provided by
get_monotonic_cycles(). Then I tried to use that abstraction directly in
the update_wall_time() code, reading the abstracted 64bit counter and
using it to update time.
However, then we start keeping cycle_last in 64bit cycles, rather then
an actual counter read. This then caused changes to be needed in the
arch vsyscall implementations, and that started to get ugly, as we had
to also re-implement the abstracted 64bit counter w/ the lock free
method as well.
So I just backed off and tried to make it simple: We have two sets of
data that counts cycles from the clocksource. One for timekeeping and
one for get_monotoinc_cycles(). It is a little redundant, but I don't
think you can escape that (the layering method above also has
redundancy, but its just hidden until you implement the vsyscall gtod
methods).
> I think I'll need a bit of an explanation of what you are trying to
> achieve here to see what to expect from the clock source. Are you trying
> to deal with non-synchronized TSCs across CPUs in a way that will
> generate a monotonic (sometimes stalling) clock ?
No no no.. I'm not touching the non-synced TSC issue. I'm just trying to
take clocksource counters, which may be of different bit-widths (ACPI PM
is 24bits, for instance), and create lock-free method to translate that
into a virtual 64bit wide counter (using an accumulation bucket,
basically).
> What I am trying to say is : I know you are trying to make a virtual
> clock source where time cannot go backward, but what are your
> assumptions about the "real" clock source ?
The assumptions of the real clocksource is the same we keep in the
timekeeping core. It counts forward, at a constant rate and only wraps
after the mask value has been reached.
> Is the intent to deal with an HPET suddenly reset to 0 or something
> like this ?
Well, dealing with clocksources wrapping short of 64bits.
> Basically, I wonder why you have to calculate the current cycle count
> from the previous update_wall_time event. Is is because you need to be
> consistent when a clocksource change occurs ?
Actually, we try to do it from the last clocksource_accumulate() call
(which is called from update_wall_time).
> > + } base[2];
> > + int base_num;
> > + } ____cacheline_aligned_in_smp;
> > u64 xtime_nsec;
> > s64 error;
> >
> > @@ -175,19 +183,21 @@
> > }
> >
> > /**
> > - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
> > + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
> > * @cs: pointer to clocksource being read
> > * @now: current cycle value
> > *
> > * Uses the clocksource to return the current cycle_t value.
> > * NOTE!!!: This is different from clocksource_read, because it
> > - * returns the accumulated cycle value! Must hold xtime lock!
> > + * returns a 64bit wide accumulated value.
> > */
> > static inline cycle_t
> > -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
> > +clocksource_get_basecycles(struct clocksource *cs, cycle_t now)
> > {
> > - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> > - offset += cs->cycle_accumulated;
>
> I would disable preemption in clocksource_get_basecycles. We would not
> want to be scheduled out while we hold a pointer to the old array
> element.
Ok. This is the part I wasn't so sure about. But yes, that sounds
reasonable.
> > + int num = cs->base_num;
>
> Since you deal with base_num in a shared manner (not per cpu), you will
> need a smp_read_barrier_depend() here after the cs->base_num read.
Ah, thanks. I'll add that in.
> You should think about reading the cs->base_num first, and _after_ that
> read the real clocksource. Here, the clocksource value is passed as
> parameter. It means that the read clocksource may have been read in the
> previous RCU window.
Hmm. Ok, still need to wrap my head around that one, but I think it
makes sense.
> > + cycle_t offset = (now - cs->base[num].cycle_base_last);
> > + offset &= cs->mask;
> > + offset += cs->base[num].cycle_base;
> > return offset;
> > }
> >
> > @@ -197,14 +207,25 @@
> > * @now: current cycle value
> > *
> > * Used to avoids clocksource hardware overflow by periodically
> > - * accumulating the current cycle delta. Must hold xtime write lock!
> > + * accumulating the current cycle delta. Uses RCU-like update, but
> > + * ***still requires the xtime_lock is held for writing!***
> > */
> > static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
> > {
>
> Why do we still require xtime_lock here ? Can you tell exactly which
> contexts this function will be called from (periodical timer interrupt?)
> I guess it is called from one and only one CPU periodically.
Well, the main reason we need the xtime_lock, is because the xtime_lock
still protects the cycle_last and cycle_accumulated values (which are
not lock-free). This is part of the redundancy issue above. We're
updating similar structures, that store different data from the same
source. One of the two can be handled lock-free, the other cannot.
In addition however, doing the update under the lock makes sure we don't
do the update in parallel (serializes writers, basically) if
clocksource_accumulate is called on different cpus (it shouldn't happen
right now, but in the past it has been possible).
> > - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> > + /* First update the monotonic base portion.
> > + * The dual array update method allows for lock-free reading.
> > + */
> > + int num = !cs->base_num;
> > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
>
> !0 is not necessarily 1. This is why I use cpu_synth->index ? 0 : 1 in
> my code. The two previous lines seems buggy. (I did the same mistake in
> my first implementation) ;)
Heh. My first thought to this was just disbelief("WHAAAH? NOOOO!"). But
Steven made clear the logical issue on irc. Thanks for pointing it out.
I've been using that assumption (as well as the !! trick) for so long it
will be a hard habit to break. :)
I'll add in Steven's method to the code.
> > + offset &= cs->mask;
> > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
>
> Here too.
>
> > + cs->base[num].cycle_base_last = now;
>
> Since you deal with shared data (in my algo, I use per-cpu data), you
> have to add a wmb() before the base_num value update. Only then will you
> ensure that other CPUs will see consistent values.
Ok. Thanks I was worried about that as well.
Thanks so much for the review! I'll go through and make the update
changes you suggested. Do let me know if my explanations above to your
questions make sense.
thanks
-john
On Wed, 16 Jan 2008, Steven Rostedt wrote:
> On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> >
> > !0 is not necessarily 1. This is why I use cpu_synth->index ? 0 : 1 in
>
> How about simply "cpu_synth->index ^ 1"? Seems the best choice if you ask
> me, if all you are doing is changing it from 1 to zero and back to 1.
>
FYI:
rostedt@bilbo:~/c$ cat flipme.c
int flip1 (int x)
{
return !x;
}
int flip2 (int x)
{
return x ? 0 : 1;
}
int flip3(int x)
{
return x ^ 1;
}
rostedt@bilbo:~/c$ gcc -O2 -c flipme.c
rostedt@bilbo:~/c$ objdump -d flipme.o
flipme.o: file format elf32-i386
Disassembly of section .text:
00000000 <flip1>:
0: 55 push %ebp
1: 31 c0 xor %eax,%eax
3: 89 e5 mov %esp,%ebp
5: 83 7d 08 00 cmpl $0x0,0x8(%ebp)
9: 5d pop %ebp
a: 0f 94 c0 sete %al
d: c3 ret
e: 66 90 xchg %ax,%ax
00000010 <flip2>:
10: 55 push %ebp
11: 31 c0 xor %eax,%eax
13: 89 e5 mov %esp,%ebp
15: 83 7d 08 00 cmpl $0x0,0x8(%ebp)
19: 5d pop %ebp
1a: 0f 94 c0 sete %al
1d: c3 ret
1e: 66 90 xchg %ax,%ax
00000020 <flip3>:
20: 55 push %ebp
21: 89 e5 mov %esp,%ebp
23: 8b 45 08 mov 0x8(%ebp),%eax
26: 5d pop %ebp
27: 83 f0 01 xor $0x1,%eax
2a: c3 ret
So, if you know for sure that x is only 1 or 0, then using x ^ 1 to invert
it, seems the most efficient.
-- Steve
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
>
> > + int num = !cs->base_num;
> > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
>
> !0 is not necessarily 1.
Incorrect.
!0 _is_ necessarily 1. It's how all C logical operators work. If you find
a compiler that turns !x into anything but 0/1, you found a compiler for
another language than C.
It's true that any non-zero value counts as "true", but the that does not
mean that a logical operator can return any non-zero value for true. As a
return value of the logical operations in C, true is *always* 1.
So !, ||, &&, when used as values, will *always* return either 0 or 1 (but
when used as part of a conditional, the compiler will often optimize out
unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if
the value itself was never used, only branched upon).
So doing "!cs->base_num" to turn 0->1 and 1->0 is perfectly fine.
That's not to say it's necessarily the *best* way.
If you *know* that you started with 0/1 in the first place, the best way
to flip it tends to be to do (1-x) (or possibly (x^1)).
And if you can't guarantee that, !x is probably better than x ? 0 : 1,
but you might also decide to use ((x+1)&1) for example.
And obviously, the compiler may sometimes surprise you, and if *it* also
knows it's always 0/1 (for something like the source being a single-bit
bitfield for example), it may end up doing something else than you coded
that is equivalent. And the particular choice of operation the compiler
chooses may well depend on the code _around_ that sequence.
(One reason to potentially prefer (1-x) over (x^1) is that it's often
easier to combine a subtraction with other operations, while an xor seldom
combines with anything around it)
Linus
* Linus Torvalds ([email protected]) wrote:
>
>
> On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> >
> > > + int num = !cs->base_num;
> > > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
> >
> > !0 is not necessarily 1.
>
> Incorrect.
>
Hrm, *digging in my mailbox*, ah, here it is :
http://listserv.shafik.org/pipermail/ltt-dev/2006-June/001548.html
Richard Purdie reviewed my code back in 2006 and made this modification.
Maybe will he have something to add.
> !0 _is_ necessarily 1. It's how all C logical operators work. If you find
> a compiler that turns !x into anything but 0/1, you found a compiler for
> another language than C.
>
> It's true that any non-zero value counts as "true", but the that does not
> mean that a logical operator can return any non-zero value for true. As a
> return value of the logical operations in C, true is *always* 1.
>
> So !, ||, &&, when used as values, will *always* return either 0 or 1 (but
> when used as part of a conditional, the compiler will often optimize out
> unnecessary stuff, so the CPU may not actually ever see a 0/1 value, if
> the value itself was never used, only branched upon).
>
> So doing "!cs->base_num" to turn 0->1 and 1->0 is perfectly fine.
>
> That's not to say it's necessarily the *best* way.
>
> If you *know* that you started with 0/1 in the first place, the best way
> to flip it tends to be to do (1-x) (or possibly (x^1)).
>
> And if you can't guarantee that, !x is probably better than x ? 0 : 1,
> but you might also decide to use ((x+1)&1) for example.
>
> And obviously, the compiler may sometimes surprise you, and if *it* also
> knows it's always 0/1 (for something like the source being a single-bit
> bitfield for example), it may end up doing something else than you coded
> that is equivalent. And the particular choice of operation the compiler
> chooses may well depend on the code _around_ that sequence.
>
> (One reason to potentially prefer (1-x) over (x^1) is that it's often
> easier to combine a subtraction with other operations, while an xor seldom
> combines with anything around it)
>
Ok, I'll adopt (1-x) then. Thanks!
Mathieu
> Linus
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
> I would disable preemption in clocksource_get_basecycles. We would not
> want to be scheduled out while we hold a pointer to the old array
> element.
>
> > + int num = cs->base_num;
>
> Since you deal with base_num in a shared manner (not per cpu), you will
> need a smp_read_barrier_depend() here after the cs->base_num read.
>
> You should think about reading the cs->base_num first, and _after_ that
> read the real clocksource. Here, the clocksource value is passed as
> parameter. It means that the read clocksource may have been read in the
> previous RCU window.
Here's an updated version of the patch w/ the suggested memory barrier
changes and favored (1-x) inversion change. ;) Let me know if you see
any other holes, or have any other suggestions or ideas.
Still un-tested (my test box will free up soon, I promise!), but builds.
Signed-off-by: John Stultz <[email protected]>
Index: monotonic-cleanup/include/linux/clocksource.h
===================================================================
--- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.000000000 -0800
+++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 18:12:53.000000000 -0800
@@ -87,9 +87,17 @@
* more than one cache line.
*/
struct {
- cycle_t cycle_last, cycle_accumulated, cycle_raw;
- } ____cacheline_aligned_in_smp;
+ cycle_t cycle_last, cycle_accumulated;
+ /* base structure provides lock-free read
+ * access to a virtualized 64bit counter
+ * Uses RCU-like update.
+ */
+ struct {
+ cycle_t cycle_base_last, cycle_base;
+ } base[2];
+ int base_num;
+ } ____cacheline_aligned_in_smp;
u64 xtime_nsec;
s64 error;
@@ -175,19 +183,29 @@
}
/**
- * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
+ * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
* @cs: pointer to clocksource being read
* @now: current cycle value
*
* Uses the clocksource to return the current cycle_t value.
* NOTE!!!: This is different from clocksource_read, because it
- * returns the accumulated cycle value! Must hold xtime lock!
+ * returns a 64bit wide accumulated value.
*/
static inline cycle_t
-clocksource_get_cycles(struct clocksource *cs, cycle_t now)
+clocksource_get_basecycles(struct clocksource *cs)
{
- cycle_t offset = (now - cs->cycle_last) & cs->mask;
- offset += cs->cycle_accumulated;
+ int num;
+ cycle_t now, offset;
+
+ preempt_disable();
+ num = cs->base_num;
+ smp_read_barrier_depends();
+ now = clocksource_read(cs);
+ offset = (now - cs->base[num].cycle_base_last);
+ offset &= cs->mask;
+ offset += cs->base[num].cycle_base;
+ preempt_enable();
+
return offset;
}
@@ -197,14 +215,26 @@
* @now: current cycle value
*
* Used to avoids clocksource hardware overflow by periodically
- * accumulating the current cycle delta. Must hold xtime write lock!
+ * accumulating the current cycle delta. Uses RCU-like update, but
+ * ***still requires the xtime_lock is held for writing!***
*/
static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
{
- cycle_t offset = (now - cs->cycle_last) & cs->mask;
+ /* First update the monotonic base portion.
+ * The dual array update method allows for lock-free reading.
+ */
+ int num = 1 - cs->base_num;
+ cycle_t offset = (now - cs->base[1-num].cycle_base_last);
+ offset &= cs->mask;
+ cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset;
+ cs->base[num].cycle_base_last = now;
+ wmb();
+ cs->base_num = num;
+
+ /* Now update the cycle_accumulated portion */
+ offset = (now - cs->cycle_last) & cs->mask;
cs->cycle_last = now;
cs->cycle_accumulated += offset;
- cs->cycle_raw += offset;
}
/**
Index: monotonic-cleanup/kernel/time/timekeeping.c
===================================================================
--- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.000000000 -0800
+++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 17:51:50.000000000 -0800
@@ -71,10 +71,12 @@
*/
static inline s64 __get_nsec_offset(void)
{
- cycle_t cycle_delta;
+ cycle_t now, cycle_delta;
s64 ns_offset;
- cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
+ now = clocksource_read(clock);
+ cycle_delta = (now - clock->cycle_last) & clock->mask;
+ cycle_delta += clock->cycle_accumulated;
ns_offset = cyc2ns(clock, cycle_delta);
return ns_offset;
@@ -105,35 +107,7 @@
cycle_t notrace get_monotonic_cycles(void)
{
- cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
-
- do {
- /*
- * cycle_raw and cycle_last can change on
- * another CPU and we need the delta calculation
- * of cycle_now and cycle_last happen atomic, as well
- * as the adding to cycle_raw. We don't need to grab
- * any locks, we just keep trying until get all the
- * calculations together in one state.
- *
- * In fact, we __cant__ grab any locks. This
- * function is called from the latency_tracer which can
- * be called anywhere. To grab any locks (including
- * seq_locks) we risk putting ourselves into a deadlock.
- */
- cycle_raw = clock->cycle_raw;
- cycle_last = clock->cycle_last;
-
- /* read clocksource: */
- cycle_now = clocksource_read(clock);
-
- /* calculate the delta since the last update_wall_time: */
- cycle_delta = (cycle_now - cycle_last) & clock->mask;
-
- } while (cycle_raw != clock->cycle_raw ||
- cycle_last != clock->cycle_last);
-
- return cycle_raw + cycle_delta;
+ return clocksource_get_basecycles(clock);
}
unsigned long notrace cycles_to_usecs(cycle_t cycles)
* john stultz ([email protected]) wrote:
>
> On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
> > * john stultz ([email protected]) wrote:
> > >
> > > On Wed, 2008-01-16 at 14:36 -0800, john stultz wrote:
> > > > On Jan 16, 2008 6:56 AM, Mathieu Desnoyers <[email protected]> wrote:
> > > > > If you really want an seqlock free algorithm (I _do_ want this for
> > > > > tracing!) :) maybe going in the RCU direction could help (I refer to my
> > > > > RCU-based 32-to-64 bits lockless timestamp counter extension, which
> > > > > could be turned into the clocksource updater).
> > > >
> > > > Yea. After our earlier discussion and talking w/ Steven, I'm taking a
> > > > swing at this now. The lock-free method still doesn't apply to the
> > > > update_wall_time function, but does work fine for the monotonic cycle
> > > > uses. I'll send a patch for review as soon as I get things building.
> > >
> > > So here's my first attempt at adding Mathieu's lock-free method to
> > > Steven's get_monotonic_cycles() interface.
> > >
> > > Completely un-tested, but it builds, so I figured I'd send it out for
> > > review.
> > >
> > > I'm not super sure the update or the read doesn't need something
> > > additional to force a memory access, but as I didn't see anything
> > > special in Mathieu's implementation, I'm going to guess this is ok.
> > >
> > > Mathieu, Let me know if this isn't what you're suggesting.
> > >
> > > Signed-off-by: John Stultz <[email protected]>
> > >
> > > Index: monotonic-cleanup/include/linux/clocksource.h
> > > ===================================================================
> > > --- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.000000000 -0800
> > > +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 14:41:31.000000000 -0800
> > > @@ -87,9 +87,17 @@
> > > * more than one cache line.
> > > */
> > > struct {
> > > - cycle_t cycle_last, cycle_accumulated, cycle_raw;
> > > - } ____cacheline_aligned_in_smp;
> >
> > Shouldn't the cycle_last and cycle_accumulated by in the array too ?
>
> No, we're leaving cycle_last and cycle_accumulated alone. They're
> relating to the update_wall_time conversion of cycles to xtime.
>
> > > + cycle_t cycle_last, cycle_accumulated;
> > >
> > > + /* base structure provides lock-free read
> > > + * access to a virtualized 64bit counter
> > > + * Uses RCU-like update.
> > > + */
> > > + struct {
> >
> > We had cycle_raw before, why do we need the following two ?
> >
> > > + cycle_t cycle_base_last, cycle_base;
> >
> > I'm not quite sure why you need both cycle_base_last and cycle_base...
>
> So on my first shot at this, I tried to layer the concepts. Using the
> lock-free method to create a abstracted 64bit counter, as provided by
> get_monotonic_cycles(). Then I tried to use that abstraction directly in
> the update_wall_time() code, reading the abstracted 64bit counter and
> using it to update time.
>
> However, then we start keeping cycle_last in 64bit cycles, rather then
> an actual counter read. This then caused changes to be needed in the
> arch vsyscall implementations, and that started to get ugly, as we had
> to also re-implement the abstracted 64bit counter w/ the lock free
> method as well.
>
> So I just backed off and tried to make it simple: We have two sets of
> data that counts cycles from the clocksource. One for timekeeping and
> one for get_monotoinc_cycles(). It is a little redundant, but I don't
> think you can escape that (the layering method above also has
> redundancy, but its just hidden until you implement the vsyscall gtod
> methods).
>
>
> > I think I'll need a bit of an explanation of what you are trying to
> > achieve here to see what to expect from the clock source. Are you trying
> > to deal with non-synchronized TSCs across CPUs in a way that will
> > generate a monotonic (sometimes stalling) clock ?
>
> No no no.. I'm not touching the non-synced TSC issue. I'm just trying to
> take clocksource counters, which may be of different bit-widths (ACPI PM
> is 24bits, for instance), and create lock-free method to translate that
> into a virtual 64bit wide counter (using an accumulation bucket,
> basically).
>
> > What I am trying to say is : I know you are trying to make a virtual
> > clock source where time cannot go backward, but what are your
> > assumptions about the "real" clock source ?
>
> The assumptions of the real clocksource is the same we keep in the
> timekeeping core. It counts forward, at a constant rate and only wraps
> after the mask value has been reached.
>
> > Is the intent to deal with an HPET suddenly reset to 0 or something
> > like this ?
>
> Well, dealing with clocksources wrapping short of 64bits.
>
Ah ok, then the problem is clearer :) The main difference between the
approach I use and yours is that, let's say your clocksource starts a
while after the system started running, then it will start the
accumulator at 0. You therefore have to keep track of the total time
accumulated and the current lower order bits of the last accumulation
upon clocksource_accumulate().
I used a different method to do this. I just need to keep the 64 bits
"synthetic counter" value at each update. The main difference between my
synthetic counter and your accumumated cycles is that my LSBs will
always be exactly the same as the real clocksource itself. In your case,
you always have to read two 64 bits fields and perform an addition and a
substraction, while I only need to do comparisons and bit operations.
Only when I detect a wrap around of the LSB (in the read side) do I have
to add 1 to the higher order bits of the value I return.
So by using my algorithm, you could replace the cycle_base_last and
cycle_base by a single "synthetic_tsc" variable.
So the periodical calls to clocksource_accumulate() would only be there
to update the synthetic TSC to get the MSBs right and to make sure the
following reads would be able to detect LSB overflows.
> > Basically, I wonder why you have to calculate the current cycle count
> > from the previous update_wall_time event. Is is because you need to be
> > consistent when a clocksource change occurs ?
>
> Actually, we try to do it from the last clocksource_accumulate() call
> (which is called from update_wall_time).
>
>
> > > + } base[2];
> > > + int base_num;
> > > + } ____cacheline_aligned_in_smp;
> > > u64 xtime_nsec;
> > > s64 error;
> > >
> > > @@ -175,19 +183,21 @@
> > > }
> > >
> > > /**
> > > - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
> > > + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
> > > * @cs: pointer to clocksource being read
> > > * @now: current cycle value
> > > *
> > > * Uses the clocksource to return the current cycle_t value.
> > > * NOTE!!!: This is different from clocksource_read, because it
> > > - * returns the accumulated cycle value! Must hold xtime lock!
> > > + * returns a 64bit wide accumulated value.
> > > */
> > > static inline cycle_t
> > > -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
> > > +clocksource_get_basecycles(struct clocksource *cs, cycle_t now)
> > > {
> > > - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> > > - offset += cs->cycle_accumulated;
> >
> > I would disable preemption in clocksource_get_basecycles. We would not
> > want to be scheduled out while we hold a pointer to the old array
> > element.
>
> Ok. This is the part I wasn't so sure about. But yes, that sounds
> reasonable.
>
>
> > > + int num = cs->base_num;
> >
> > Since you deal with base_num in a shared manner (not per cpu), you will
> > need a smp_read_barrier_depend() here after the cs->base_num read.
>
> Ah, thanks. I'll add that in.
>
> > You should think about reading the cs->base_num first, and _after_ that
> > read the real clocksource. Here, the clocksource value is passed as
> > parameter. It means that the read clocksource may have been read in the
should read the previous line "real clocksource" : (might help
understanding)
> > parameter. It means that the reaL clocksource may have been read in the
> > previous RCU window.
>
> Hmm. Ok, still need to wrap my head around that one, but I think it
> makes sense.
>
>
> > > + cycle_t offset = (now - cs->base[num].cycle_base_last);
> > > + offset &= cs->mask;
> > > + offset += cs->base[num].cycle_base;
> > > return offset;
> > > }
> > >
> > > @@ -197,14 +207,25 @@
> > > * @now: current cycle value
> > > *
> > > * Used to avoids clocksource hardware overflow by periodically
> > > - * accumulating the current cycle delta. Must hold xtime write lock!
> > > + * accumulating the current cycle delta. Uses RCU-like update, but
> > > + * ***still requires the xtime_lock is held for writing!***
> > > */
> > > static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
> > > {
> >
> > Why do we still require xtime_lock here ? Can you tell exactly which
> > contexts this function will be called from (periodical timer interrupt?)
> > I guess it is called from one and only one CPU periodically.
>
> Well, the main reason we need the xtime_lock, is because the xtime_lock
> still protects the cycle_last and cycle_accumulated values (which are
> not lock-free). This is part of the redundancy issue above. We're
> updating similar structures, that store different data from the same
> source. One of the two can be handled lock-free, the other cannot.
>
> In addition however, doing the update under the lock makes sure we don't
> do the update in parallel (serializes writers, basically) if
> clocksource_accumulate is called on different cpus (it shouldn't happen
> right now, but in the past it has been possible).
>
Note that if two writers have to be serialized, then you do not respect
the required delay between updates that are necessary to make sure a
reader won't see its data overwritten while it still holds reference to
its old data.
>
> > > - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> > > + /* First update the monotonic base portion.
> > > + * The dual array update method allows for lock-free reading.
> > > + */
> > > + int num = !cs->base_num;
> > > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
> >
> > !0 is not necessarily 1. This is why I use cpu_synth->index ? 0 : 1 in
> > my code. The two previous lines seems buggy. (I did the same mistake in
> > my first implementation) ;)
>
> Heh. My first thought to this was just disbelief("WHAAAH? NOOOO!"). But
> Steven made clear the logical issue on irc. Thanks for pointing it out.
> I've been using that assumption (as well as the !! trick) for so long it
> will be a hard habit to break. :)
>
> I'll add in Steven's method to the code.
>
To be continued in the other thread I guess..
> > > + offset &= cs->mask;
> > > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> >
> > Here too.
> >
> > > + cs->base[num].cycle_base_last = now;
> >
> > Since you deal with shared data (in my algo, I use per-cpu data), you
> > have to add a wmb() before the base_num value update. Only then will you
> > ensure that other CPUs will see consistent values.
>
> Ok. Thanks I was worried about that as well.
>
I think a smp_wmb() would be enough though.
>
> Thanks so much for the review! I'll go through and make the update
> changes you suggested. Do let me know if my explanations above to your
> questions make sense.
>
We would have to see which method (synthetic tsc vs accumulated cycles)
makes the more sense/is the fastest/etc...
Mathieu
> thanks
> -john
>
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
> Thanks John for doing this!
>
> (comments imbedded)
>
> On Wed, 16 Jan 2008, john stultz wrote:
> > + int num = !cs->base_num;
> > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
> > + offset &= cs->mask;
> > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> > + cs->base[num].cycle_base_last = now;
>
> I would think that we would need some sort of barrier here. Otherwise,
> base_num could be updated before all the cycle_base. I'd expect a smp_wmb
> is needed.
Hopefully addressed in the current version.
> > Index: monotonic-cleanup/kernel/time/timekeeping.c
> > ===================================================================
> > --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.000000000 -0800
> > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.000000000 -0800
> > @@ -71,10 +71,12 @@
> > */
> > static inline s64 __get_nsec_offset(void)
> > {
> > - cycle_t cycle_delta;
> > + cycle_t now, cycle_delta;
> > s64 ns_offset;
> >
> > - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
> > + now = clocksource_read(clock);
> > + cycle_delta = (now - clock->cycle_last) & clock->mask;
> > + cycle_delta += clock->cycle_accumulated;
>
> Is the above just to decouple the two methods?
Yep. clocksource_get_cycles() ended up not being as useful as an helper
function (I was hoping the arch vsyscall implementations could use it,
but they've done too much optimization - although that may reflect a
need up the chain to the clocksource structure).
thanks
-john
* john stultz ([email protected]) wrote:
> On Wed, 2008-01-16 at 18:39 -0500, Mathieu Desnoyers wrote:
> > I would disable preemption in clocksource_get_basecycles. We would not
> > want to be scheduled out while we hold a pointer to the old array
> > element.
> >
> > > + int num = cs->base_num;
> >
> > Since you deal with base_num in a shared manner (not per cpu), you will
> > need a smp_read_barrier_depend() here after the cs->base_num read.
> >
> > You should think about reading the cs->base_num first, and _after_ that
> > read the real clocksource. Here, the clocksource value is passed as
> > parameter. It means that the read clocksource may have been read in the
> > previous RCU window.
>
> Here's an updated version of the patch w/ the suggested memory barrier
> changes and favored (1-x) inversion change. ;) Let me know if you see
> any other holes, or have any other suggestions or ideas.
>
> Still un-tested (my test box will free up soon, I promise!), but builds.
>
> Signed-off-by: John Stultz <[email protected]>
>
> Index: monotonic-cleanup/include/linux/clocksource.h
> ===================================================================
> --- monotonic-cleanup.orig/include/linux/clocksource.h 2008-01-16 12:22:04.000000000 -0800
> +++ monotonic-cleanup/include/linux/clocksource.h 2008-01-16 18:12:53.000000000 -0800
> @@ -87,9 +87,17 @@
> * more than one cache line.
> */
> struct {
> - cycle_t cycle_last, cycle_accumulated, cycle_raw;
> - } ____cacheline_aligned_in_smp;
> + cycle_t cycle_last, cycle_accumulated;
>
> + /* base structure provides lock-free read
> + * access to a virtualized 64bit counter
> + * Uses RCU-like update.
> + */
> + struct {
> + cycle_t cycle_base_last, cycle_base;
> + } base[2];
> + int base_num;
> + } ____cacheline_aligned_in_smp;
> u64 xtime_nsec;
> s64 error;
>
> @@ -175,19 +183,29 @@
> }
>
> /**
> - * clocksource_get_cycles: - Access the clocksource's accumulated cycle value
> + * clocksource_get_basecycles: - get the clocksource's accumulated cycle value
> * @cs: pointer to clocksource being read
> * @now: current cycle value
> *
> * Uses the clocksource to return the current cycle_t value.
> * NOTE!!!: This is different from clocksource_read, because it
> - * returns the accumulated cycle value! Must hold xtime lock!
> + * returns a 64bit wide accumulated value.
> */
> static inline cycle_t
> -clocksource_get_cycles(struct clocksource *cs, cycle_t now)
> +clocksource_get_basecycles(struct clocksource *cs)
> {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> - offset += cs->cycle_accumulated;
> + int num;
> + cycle_t now, offset;
> +
> + preempt_disable();
> + num = cs->base_num;
> + smp_read_barrier_depends();
> + now = clocksource_read(cs);
> + offset = (now - cs->base[num].cycle_base_last);
> + offset &= cs->mask;
> + offset += cs->base[num].cycle_base;
> + preempt_enable();
> +
> return offset;
> }
>
> @@ -197,14 +215,26 @@
> * @now: current cycle value
> *
> * Used to avoids clocksource hardware overflow by periodically
> - * accumulating the current cycle delta. Must hold xtime write lock!
> + * accumulating the current cycle delta. Uses RCU-like update, but
> + * ***still requires the xtime_lock is held for writing!***
> */
> static inline void clocksource_accumulate(struct clocksource *cs, cycle_t now)
> {
> - cycle_t offset = (now - cs->cycle_last) & cs->mask;
> + /* First update the monotonic base portion.
> + * The dual array update method allows for lock-free reading.
> + */
> + int num = 1 - cs->base_num;
(nitpick)
right here, you could probably express 1-num with cs->base_num, since we
are the only ones supposed to touch it.
> + cycle_t offset = (now - cs->base[1-num].cycle_base_last);
> + offset &= cs->mask;
here too.
> + cs->base[num].cycle_base = cs->base[1-num].cycle_base + offset;
> + cs->base[num].cycle_base_last = now;
> + wmb();
As I just emailed : smp_smb() *should* be enough. I don't see which
architecture could reorder writes wrt local interrupts ? (please tell me
if I am grossly mistaken)
Mathieu
> + cs->base_num = num;
> +
> + /* Now update the cycle_accumulated portion */
> + offset = (now - cs->cycle_last) & cs->mask;
> cs->cycle_last = now;
> cs->cycle_accumulated += offset;
> - cs->cycle_raw += offset;
> }
>
> /**
> Index: monotonic-cleanup/kernel/time/timekeeping.c
> ===================================================================
> --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.000000000 -0800
> +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 17:51:50.000000000 -0800
> @@ -71,10 +71,12 @@
> */
> static inline s64 __get_nsec_offset(void)
> {
> - cycle_t cycle_delta;
> + cycle_t now, cycle_delta;
> s64 ns_offset;
>
> - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
> + now = clocksource_read(clock);
> + cycle_delta = (now - clock->cycle_last) & clock->mask;
> + cycle_delta += clock->cycle_accumulated;
> ns_offset = cyc2ns(clock, cycle_delta);
>
> return ns_offset;
> @@ -105,35 +107,7 @@
>
> cycle_t notrace get_monotonic_cycles(void)
> {
> - cycle_t cycle_now, cycle_delta, cycle_raw, cycle_last;
> -
> - do {
> - /*
> - * cycle_raw and cycle_last can change on
> - * another CPU and we need the delta calculation
> - * of cycle_now and cycle_last happen atomic, as well
> - * as the adding to cycle_raw. We don't need to grab
> - * any locks, we just keep trying until get all the
> - * calculations together in one state.
> - *
> - * In fact, we __cant__ grab any locks. This
> - * function is called from the latency_tracer which can
> - * be called anywhere. To grab any locks (including
> - * seq_locks) we risk putting ourselves into a deadlock.
> - */
> - cycle_raw = clock->cycle_raw;
> - cycle_last = clock->cycle_last;
> -
> - /* read clocksource: */
> - cycle_now = clocksource_read(clock);
> -
> - /* calculate the delta since the last update_wall_time: */
> - cycle_delta = (cycle_now - cycle_last) & clock->mask;
> -
> - } while (cycle_raw != clock->cycle_raw ||
> - cycle_last != clock->cycle_last);
> -
> - return cycle_raw + cycle_delta;
> + return clocksource_get_basecycles(clock);
> }
>
> unsigned long notrace cycles_to_usecs(cycle_t cycles)
>
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* john stultz ([email protected]) wrote:
>
> On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
> > Thanks John for doing this!
> >
> > (comments imbedded)
> >
> > On Wed, 16 Jan 2008, john stultz wrote:
> > > + int num = !cs->base_num;
> > > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
> > > + offset &= cs->mask;
> > > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> > > + cs->base[num].cycle_base_last = now;
> >
> > I would think that we would need some sort of barrier here. Otherwise,
> > base_num could be updated before all the cycle_base. I'd expect a smp_wmb
> > is needed.
>
> Hopefully addressed in the current version.
>
>
> > > Index: monotonic-cleanup/kernel/time/timekeeping.c
> > > ===================================================================
> > > --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.000000000 -0800
> > > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.000000000 -0800
> > > @@ -71,10 +71,12 @@
> > > */
> > > static inline s64 __get_nsec_offset(void)
> > > {
> > > - cycle_t cycle_delta;
> > > + cycle_t now, cycle_delta;
> > > s64 ns_offset;
> > >
> > > - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
> > > + now = clocksource_read(clock);
> > > + cycle_delta = (now - clock->cycle_last) & clock->mask;
> > > + cycle_delta += clock->cycle_accumulated;
> >
> > Is the above just to decouple the two methods?
>
> Yep. clocksource_get_cycles() ended up not being as useful as an helper
> function (I was hoping the arch vsyscall implementations could use it,
> but they've done too much optimization - although that may reflect a
> need up the chain to the clocksource structure).
>
The problem with vsyscall is that we will have a hard time disabling
preemption :( Therefore, insuring that the read of the data is done in a
timely manner is hard to do.
Mathieu
> thanks
> -john
>
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> >
> > Yep. clocksource_get_cycles() ended up not being as useful as an helper
> > function (I was hoping the arch vsyscall implementations could use it,
> > but they've done too much optimization - although that may reflect a
> > need up the chain to the clocksource structure).
> >
>
> The problem with vsyscall is that we will have a hard time disabling
> preemption :( Therefore, insuring that the read of the data is done in a
> timely manner is hard to do.
You'll have more than a hard time disabling preemption for vsyscall. We'll
need to come up with a better solution then. vsyscall can not modify any
kernel memory, nor can it disable preemption.
-- Steve
* Mathieu Desnoyers ([email protected]) wrote:
> * john stultz ([email protected]) wrote:
> >
> > On Wed, 2008-01-16 at 18:33 -0500, Steven Rostedt wrote:
> > > Thanks John for doing this!
> > >
> > > (comments imbedded)
> > >
> > > On Wed, 16 Jan 2008, john stultz wrote:
> > > > + int num = !cs->base_num;
> > > > + cycle_t offset = (now - cs->base[!num].cycle_base_last);
> > > > + offset &= cs->mask;
> > > > + cs->base[num].cycle_base = cs->base[!num].cycle_base + offset;
> > > > + cs->base[num].cycle_base_last = now;
> > >
> > > I would think that we would need some sort of barrier here. Otherwise,
> > > base_num could be updated before all the cycle_base. I'd expect a smp_wmb
> > > is needed.
> >
> > Hopefully addressed in the current version.
> >
> >
> > > > Index: monotonic-cleanup/kernel/time/timekeeping.c
> > > > ===================================================================
> > > > --- monotonic-cleanup.orig/kernel/time/timekeeping.c 2008-01-16 12:21:46.000000000 -0800
> > > > +++ monotonic-cleanup/kernel/time/timekeeping.c 2008-01-16 14:15:31.000000000 -0800
> > > > @@ -71,10 +71,12 @@
> > > > */
> > > > static inline s64 __get_nsec_offset(void)
> > > > {
> > > > - cycle_t cycle_delta;
> > > > + cycle_t now, cycle_delta;
> > > > s64 ns_offset;
> > > >
> > > > - cycle_delta = clocksource_get_cycles(clock, clocksource_read(clock));
> > > > + now = clocksource_read(clock);
> > > > + cycle_delta = (now - clock->cycle_last) & clock->mask;
> > > > + cycle_delta += clock->cycle_accumulated;
> > >
> > > Is the above just to decouple the two methods?
> >
> > Yep. clocksource_get_cycles() ended up not being as useful as an helper
> > function (I was hoping the arch vsyscall implementations could use it,
> > but they've done too much optimization - although that may reflect a
> > need up the chain to the clocksource structure).
> >
>
> The problem with vsyscall is that we will have a hard time disabling
> preemption :( Therefore, insuring that the read of the data is done in a
> timely manner is hard to do.
>
Sorry for self-reply, but I thought, in the past, of a way to make this
possible.
It would imply the creation of a new vsyscall : vgetschedperiod
It would read a counter that would increment each time the thread is
scheduled out (or in). It would be a per thread counter (not a per cpu
counter) so we can deal appropriately with a stopped thread that would
happen to come back running a loooong time afterward (if we do per-cpu
counters, we could get the same 32 bits counter value falsely if it is
shared with other thread activity).
Then, the clocksource read code would look like :
int period;
do {
period = vgetschedperiod();
perform the clocksource read..
} while (period != vgetschedperiod());
Therefore, we would be sure that we have not been scheduled out while
reading the value. I think this new vsyscall could be useful for others.
Actually, it would make implementation of RCU in user-space possible (as
long as the read-side can retry the read operation).
Mathieu
>
> > thanks
> > -john
> >
> >
>
> --
> Mathieu Desnoyers
> Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
> OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
> It would imply the creation of a new vsyscall : vgetschedperiod
>
> It would read a counter that would increment each time the thread is
> scheduled out (or in). It would be a per thread counter (not a per cpu
> counter) so we can deal appropriately with a stopped thread that would
> happen to come back running a loooong time afterward (if we do per-cpu
> counters, we could get the same 32 bits counter value falsely if it is
> shared with other thread activity).
>
> Then, the clocksource read code would look like :
>
> int period;
>
> do {
> period = vgetschedperiod();
>
> perform the clocksource read..
>
> } while (period != vgetschedperiod());
>
> Therefore, we would be sure that we have not been scheduled out while
> reading the value. I think this new vsyscall could be useful for others.
> Actually, it would make implementation of RCU in user-space possible (as
> long as the read-side can retry the read operation).
This is something that I would agree is useful.
-- Steve
Mathieu Desnoyers writes:
> Sorry for self-reply, but I thought, in the past, of a way to make this
> possible.
>
> It would imply the creation of a new vsyscall : vgetschedperiod
>
> It would read a counter that would increment each time the thread is
> scheduled out (or in). It would be a per thread counter
It's very hard to do a per-thread counter in the VDSO, since threads
in the same process see the same memory, by definition. You'd have to
have an array of counters and have some way for each thread to know
which entry to read. Also you'd have to find space for tens or
hundreds of thousands of counters, since there can be that many
threads in a process sometimes.
Paul.
On Thu, 17 Jan 2008, Paul Mackerras wrote:
>
> It's very hard to do a per-thread counter in the VDSO, since threads
> in the same process see the same memory, by definition. You'd have to
> have an array of counters and have some way for each thread to know
> which entry to read. Also you'd have to find space for tens or
> hundreds of thousands of counters, since there can be that many
> threads in a process sometimes.
I was thinking about this. What would also work is just the ability to
read the schedule counter for the current cpu. Now this would require that
the task had a way to know which CPU it was currently on.
-- Steve
* Paul Mackerras ([email protected]) wrote:
> Mathieu Desnoyers writes:
>
> > Sorry for self-reply, but I thought, in the past, of a way to make this
> > possible.
> >
> > It would imply the creation of a new vsyscall : vgetschedperiod
> >
> > It would read a counter that would increment each time the thread is
> > scheduled out (or in). It would be a per thread counter
>
> It's very hard to do a per-thread counter in the VDSO, since threads
> in the same process see the same memory, by definition. You'd have to
> have an array of counters and have some way for each thread to know
> which entry to read. Also you'd have to find space for tens or
> hundreds of thousands of counters, since there can be that many
> threads in a process sometimes.
>
> Paul.
>
Crazy ideas :
Could we do something along the lines of the thread local storage ?
Or could we map a per-thread page that would contradict this
"definition" ?
Or can we move down the beginning of the user-space thread stack of 4
bytes (it's already put at a random address anyway) and use these 32
bits to put our variable ? We don't care if userspace also modifies it;
the kernel would blindly increment it, so there would be no security
concerns involved.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
>
> Crazy ideas :
>
> Could we do something along the lines of the thread local storage ?
>
> Or could we map a per-thread page that would contradict this
> "definition" ?
When working on lguest64, I implemented a "per CPU" shadow page. That the
process of a guest running on one real CPU, could never see this page for
a process of the same guest running on another real CPU.
This is a nightmare to manage. The thing is that threads share the same
pgd. With shadow pages, it's doable, but hard to manage. It required
making a "copy" of the pgd for each real CPU.
Here we are not talking about guests, but bare-metal. So this complicates
things even more. The problem is that each thread shares the same pgd. If
two threads run on the same time on two different CPUS, then the two
threads would see the same counter. But we need to make this per cpu.
Now if the thread was given a clue to what CPU it was running on, then we
might be able to accomplish something. But then we have the same issue.
How do you tell a thread what CPU it's on, without running into the same
issues as where to store this data?
-- Steve
>
> Or can we move down the beginning of the user-space thread stack of 4
> bytes (it's already put at a random address anyway) and use these 32
> bits to put our variable ? We don't care if userspace also modifies it;
> the kernel would blindly increment it, so there would be no security
> concerns involved.
* Steven Rostedt ([email protected]) wrote:
>
> On Thu, 17 Jan 2008, Paul Mackerras wrote:
> >
> > It's very hard to do a per-thread counter in the VDSO, since threads
> > in the same process see the same memory, by definition. You'd have to
> > have an array of counters and have some way for each thread to know
> > which entry to read. Also you'd have to find space for tens or
> > hundreds of thousands of counters, since there can be that many
> > threads in a process sometimes.
>
> I was thinking about this. What would also work is just the ability to
> read the schedule counter for the current cpu. Now this would require that
> the task had a way to know which CPU it was currently on.
>
> -- Steve
>
The problem with the per cpu schedule counter would be to deal with
stopped tasks that would wake up at the exact wrong moment. With a 32
bits counter, it could happen.
At 1000HZ, given the scheduler is only called once per tick
(approximation, it can also be called explicitly) it would happen after
49.7 days. But then, if the kernel calls schedule() too often, this can
be sooner than that.
Having a per-thread variable would make sure we don't have this problem.
By the way, a task cannot "really know" which CPU it is on : it could be
migrated between the cpu ID read and the moment it uses it as an array
index. Actually, knowing the schedule count would help to implement
algorithms that helps getting the cpu id and knowing it's been valid for
a period of time without pinning to a particular CPU.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Steven Rostedt ([email protected]) wrote:
>
> On Thu, 17 Jan 2008, Paul Mackerras wrote:
> >
> > It's very hard to do a per-thread counter in the VDSO, since threads
> > in the same process see the same memory, by definition. You'd have to
> > have an array of counters and have some way for each thread to know
> > which entry to read. Also you'd have to find space for tens or
> > hundreds of thousands of counters, since there can be that many
> > threads in a process sometimes.
>
> I was thinking about this. What would also work is just the ability to
> read the schedule counter for the current cpu. Now this would require that
> the task had a way to know which CPU it was currently on.
>
> -- Steve
>
The problem with the per cpu schedule counter would be to deal with
stopped tasks that would wake up at the exact wrong moment. With a 32
bits counter, it could happen.
At 1000HZ, given the scheduler is only called once per tick
(approximation, it can also be called explicitly) it would happen after
49.7 days. But then, if the kernel calls schedule() too often, this can
be sooner than that.
Having a per-thread variable would make sure we don't have this problem.
By the way, a task cannot "really know" which CPU it is on : it could be
migrated between the cpu ID read and the moment it uses it as an array
index. Actually, knowing the schedule count would help to implement
algorithms that helps getting the cpu id and knowing it's been valid for
a period of time without pinning to a particular CPU.
Mathieu
(sorry for repost, messed up with my mail client)
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Wed, 16 Jan 2008, Mathieu Desnoyers wrote:
>
> Or could we map a per-thread page that would contradict this
> "definition" ?
Over my dead body.
It's been done before. Many times. It's horrible, and means that you need
to flush the TLB on context switches between threads and cannot share the
same hw page tables across CPU's (since now different CPU's run different
threads).
It generally makes threads pointless. You might as well implement them as
processes with shared mmap.
Linus
> > > >
> > > > One thing I want to clear up. The major difference between this
> > > > latency_tracer and LTTng is what we consider fast paths. The latency
> > > > tracer is recording things like enabling and disabling interrupts, preempt
> > > > count changes, or simply profiling all function calls. Those are what I
> > > > consider fast paths. The slow path WRT the latency_tracer are things like
> > > > context switches. This is why I don't have a problem with copying the
> > > > comm at context switch time. Because that _is_ a slow path for the latency
> > > > tracer.
> > >
> > > LTTng hooks in the lockdep tracer to trace irq on/off, spinlocks, etc..
> > > in flight recorder mode, we have nothing to write to disk and therefore
> > > we can handle very frequent events. We then do the analysis off-line
> > > using the last MB written in the buffers. The advantage is that the
> > > kernel dumbly writes data to a buffer : we therefore move the complexity
> > > to user-space.
> >
> > But you would still need to do something in case you want this information
> > dumped to console on a kernel crash. Of course you can rely on kexec, but
> > if the kexec fails (which is possible) then you lose all the information.
> > Having the ability to dump the output to console on a crash is one of the
> > benefits of latency_tracer that I want to keep.
> >
>
> There has been some integration done between LTTng and the "crash" tool
> to extract the buffers from a crashed kernel. I am not an expert in
> crash buffer extraction though, but I guess all the available mechanisms
> depend on kexec and could show the limits you are referring to.
>
> If you really want to pretty-print the information to the console, I
> would propose that you leave that part of the problem to a different
> output module. The core of the latency tracer could keep the minimum
> information. Then, when a dump is triggered, it either sends the
> information to LTTng or to your console pretty-printer.
>
> However, I would not call the pretty-printer a "tracer" module per-se.
> We would have to accept that it is a bit more tied to the kernel
> internals than the latency tracer. My goal is to separate the core
> "profiling" module from the optional "pretty-printing" module as much as
> possible so the latency tracer core could be reused by other output
> modules.
Mathieu,
I've been thinking about the context switch marker, and the printf format
field you have:
prepare_task_switch(rq, prev, next);
+ trace_mark(kernel_sched_schedule,
+ "prev_pid %d next_pid %d prev_state %ld",
+ prev->pid, next->pid, prev->state);
mm = next->mm;
Now I see that this is great for your tracer, since all your hook would
need to do is:
static notrace void simple_trace(const struct marker *mdata,
void *private_data,
const char *format, ...)
{
va_list ap;
va_start(ap, format);
simple_trace_record(ap, format);
va_end(ap);
}
And you could hook this up to all your traces. Wonderful!
But...
Tracers that want to do a bit more work, like recording timings and seeing
if we hit some max somewhere, can't do much with that pretty print data.
For example, I like to record the priority of a task that is being swapped
out as well as the one being swapped in. But with this, all I can get is
the priority of the prev task (since it is still current).
You told me that I should put hooks into where the priority gets modified,
so that I can trace it there at a non hot path. Well, I have some issues
with this.
1) to me it's a management nightmare. Yes, I could hook into lttng, but
that too is too much. That takes me back to the days when COTS became the
new buzzword, and we ended up having to incorporate COTS products into
things better left done in house. The end result was more glue code than
what would happen if it was simple done in house, and a product that was
totally inefficient.
2) this requires we put a marker at all the places that might change the
data we want the snapshot of at the marker (here being prio at context
switch time). I could imaging some new code comes into the kernel that
modifies priority but the author has no idea about having to update the
trace marker, and the trace output ends up showing stale data. And this
would cause the poor bastard that needs to maintain this code to debug the
tracer on top of the code they are maintaining. Which simply would end up
where that poor bastard will lose all confidence in the tracer and would
simply give up on it.
I know that if we did something like:
trace_mark(kernel_sched_schedule,
"prev %p next %p",
prev, next);
It would be useless for the simple recording, because what the user would
see is just two meaningless pointer numbers.
So, at a minimum, I'd like to at least have meta data attached:
trace_mark(kernel_sched_schedule,
"prev_pid %d next_pid %d prev_state %ld\0"
"prev %p next %p",
prev->pid, next->pid, prev->state,
prev, next);
This would allow for both the nice pretty print of your trace, as well as
allowing other tracers to get to better meta data.
The '\0' would keep your tracer from recording the extra data, and we
could add some way to ignore the parameters in the printf to let other
traces get straight to the meta data.
Actually here, since prev == current, we could omit that.
Just a thought.
-- Steve
Hi -
On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote:
> [...]
> + trace_mark(kernel_sched_schedule,
> + "prev_pid %d next_pid %d prev_state %ld",
> + prev->pid, next->pid, prev->state);
> [...]
> But...
>
> Tracers that want to do a bit more work, like recording timings and seeing
> if we hit some max somewhere, can't do much with that pretty print data.
If you find yourself wanting to perform computations like finding
maxima, or responding right there as opposed to later during userspace
trace data extraction, then you're trending toward a tool like
systemtap.
> [...]
> So, at a minimum, I'd like to at least have meta data attached:
> trace_mark(kernel_sched_schedule,
> "prev_pid %d next_pid %d prev_state %ld\0"
> "prev %p next %p",
> prev->pid, next->pid, prev->state,
> prev, next);
>
> This would allow for both the nice pretty print of your trace, as well as
> allowing other tracers to get to better meta data.
Yes, more self-contained marker events are necessary for meaningful
in-situ processing. That needs to be balanced by the increased cost
for computing and passing the extra parameters, multiplied the event
occurrence rate.
In this case, the prev/next pointers are sufficient to compute the
other values. For particularly performance-critical markers, it may
not be unreasonable to expect the callback functions to dereference
such pointers for pretty-printing or other processing.
> The '\0' would keep your tracer from recording the extra data, and we
> could add some way to ignore the parameters in the printf to let other
> traces get straight to the meta data.
This \0 hack is perhaps too clever. Much of the cost of the extra
parameters is already paid by the time that a simpleminded tracing
callback function starts going through the string. Also, I believe
the systemtap marker interface would break if the format strings were
not singly terminated ordinary strings.
- FChE
On Thu, 17 Jan 2008, Frank Ch. Eigler wrote:
> Hi -
>
> On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote:
> > [...]
> > + trace_mark(kernel_sched_schedule,
> > + "prev_pid %d next_pid %d prev_state %ld",
> > + prev->pid, next->pid, prev->state);
> > [...]
> > But...
> >
> > Tracers that want to do a bit more work, like recording timings and seeing
> > if we hit some max somewhere, can't do much with that pretty print data.
>
> If you find yourself wanting to perform computations like finding
> maxima, or responding right there as opposed to later during userspace
> trace data extraction, then you're trending toward a tool like
> systemtap.
Yes, very much so. I'm working on getting the latency_tracer from the -rt
patch into something suitable for mainline. We need to calculate the max
latencies on the fly. If we hit a max, then we save it off, otherwise, we
blow away the trace and start again.
>
> > [...]
> > So, at a minimum, I'd like to at least have meta data attached:
> > trace_mark(kernel_sched_schedule,
> > "prev_pid %d next_pid %d prev_state %ld\0"
> > "prev %p next %p",
> > prev->pid, next->pid, prev->state,
> > prev, next);
> >
> > This would allow for both the nice pretty print of your trace, as well as
> > allowing other tracers to get to better meta data.
>
> Yes, more self-contained marker events are necessary for meaningful
> in-situ processing. That needs to be balanced by the increased cost
> for computing and passing the extra parameters, multiplied the event
> occurrence rate.
The cost is only done when the marker is armed. Since the marker is an
unlikely, and will be placed at the end of the function.
>
> In this case, the prev/next pointers are sufficient to compute the
> other values. For particularly performance-critical markers, it may
> not be unreasonable to expect the callback functions to dereference
> such pointers for pretty-printing or other processing.
This was exactly my point to Mathieu, but I think he has LTTng very much
coupled with the markers. I haven't played with LTTng (yet), but from what
I've read (Mathieu, correct me if I'm wrong), it seems that all the
markers become visible to userspace, and the user can simple turn them on
or off. LTTng doesn't need any knowledge of the marker since the marker
contains how to print the information.
So* by placing a "prev %p next %p" as the only information, we lose out on
this automated way LTTng works. Because the two pointers are just
meaningless numbers to the user.
>
> > The '\0' would keep your tracer from recording the extra data, and we
> > could add some way to ignore the parameters in the printf to let other
> > traces get straight to the meta data.
>
> This \0 hack is perhaps too clever. Much of the cost of the extra
> parameters is already paid by the time that a simpleminded tracing
> callback function starts going through the string. Also, I believe
> the systemtap marker interface would break if the format strings were
> not singly terminated ordinary strings.
Well, actually when I first wrote this letter, I used "--" as a delimiter
to allow a tool to hide the pretty stuff. But then I thought about the
"clever hack" with the '\0', The "--" may be better since it wont break
systemtap.
-- Steve
* dvhart - bah!
* Steven Rostedt ([email protected]) wrote:
>
> On Thu, 17 Jan 2008, Frank Ch. Eigler wrote:
>
> > Hi -
> >
> > On Thu, Jan 17, 2008 at 03:08:33PM -0500, Steven Rostedt wrote:
> > > [...]
> > > + trace_mark(kernel_sched_schedule,
> > > + "prev_pid %d next_pid %d prev_state %ld",
> > > + prev->pid, next->pid, prev->state);
> > > [...]
> > > But...
> > >
> > > Tracers that want to do a bit more work, like recording timings and seeing
> > > if we hit some max somewhere, can't do much with that pretty print data.
> >
> > If you find yourself wanting to perform computations like finding
> > maxima, or responding right there as opposed to later during userspace
> > trace data extraction, then you're trending toward a tool like
> > systemtap.
>
> Yes, very much so. I'm working on getting the latency_tracer from the -rt
> patch into something suitable for mainline. We need to calculate the max
> latencies on the fly. If we hit a max, then we save it off, otherwise, we
> blow away the trace and start again.
>
> >
> > > [...]
> > > So, at a minimum, I'd like to at least have meta data attached:
> > > trace_mark(kernel_sched_schedule,
> > > "prev_pid %d next_pid %d prev_state %ld\0"
> > > "prev %p next %p",
> > > prev->pid, next->pid, prev->state,
> > > prev, next);
> > >
> > > This would allow for both the nice pretty print of your trace, as well as
> > > allowing other tracers to get to better meta data.
> >
> > Yes, more self-contained marker events are necessary for meaningful
> > in-situ processing. That needs to be balanced by the increased cost
> > for computing and passing the extra parameters, multiplied the event
> > occurrence rate.
>
> The cost is only done when the marker is armed. Since the marker is an
> unlikely, and will be placed at the end of the function.
>
> >
> > In this case, the prev/next pointers are sufficient to compute the
> > other values. For particularly performance-critical markers, it may
> > not be unreasonable to expect the callback functions to dereference
> > such pointers for pretty-printing or other processing.
>
> This was exactly my point to Mathieu, but I think he has LTTng very much
> coupled with the markers. I haven't played with LTTng (yet), but from what
> I've read (Mathieu, correct me if I'm wrong), it seems that all the
> markers become visible to userspace, and the user can simple turn them on
> or off. LTTng doesn't need any knowledge of the marker since the marker
> contains how to print the information.
>
> So* by placing a "prev %p next %p" as the only information, we lose out on
> this automated way LTTng works. Because the two pointers are just
> meaningless numbers to the user.
>
Exactly. We have, at the marker site :
- a marker identifier
- format string containing field names and types
- arguments
I would like to keep that as much in a straight line as possible with
what ends up in the trace.
However, I see that it limits what can be done by in-kernel tracers. And
by the way, I also suffer from the same kind of limitation in LTTng. Here
is an example :
I would like to replace blktrace (actually, I already have a quite
complete implementation). However, there is some code ran in the kernel
to "prepare" the information for the trace which is blktrace specific.
Since this code is not required to run when tracing is disabled, it can
be seen as "glue-code" between the kernel tracing point and the
extraction of data to trace.
What looked like the less intrusive solution was to create inline
functions that consist of branches over code considered unlikely (could
be a function call) where the glue-code is executed to prepare the data.
It's a bit like what the markers are doing, except that there is no
marker name associated and no format string : the subsystem being traced
must enable its tracing features by itself (could be a /proc file). It
makes sense, since this type of code has to be subsystem-specific
anyway.
But I have not seen a lot of situations where that kind of glue-code was
needed, so I think it makes sense to keep markers simple to use and
efficient for the common case.
Then, in this glue-code, we can put trace_mark() and calls to in-kernel
tracers.
Since the markers are eventually meant to become an API visible from
user-space, I think it makes sense to keep it clean. If an in-kernel
tracer needs extra information, I think it would make sense for it to
get it from a mechanism that does not make the exported information
visible to user-space.
What do you think ?
> >
> > > The '\0' would keep your tracer from recording the extra data, and we
> > > could add some way to ignore the parameters in the printf to let other
> > > traces get straight to the meta data.
> >
> > This \0 hack is perhaps too clever. Much of the cost of the extra
> > parameters is already paid by the time that a simpleminded tracing
> > callback function starts going through the string. Also, I believe
> > the systemtap marker interface would break if the format strings were
> > not singly terminated ordinary strings.
>
> Well, actually when I first wrote this letter, I used "--" as a delimiter
> to allow a tool to hide the pretty stuff. But then I thought about the
> "clever hack" with the '\0', The "--" may be better since it wont break
> systemtap.
>
It could be done I guess. But it looks a bit ugly. :) I would rather
prefer to export the "pretty stuff" through an interface not involving
markers. Or if there is a way to separate the "callback" mechanism from
the "export to user-space" API parts of the markers, I am open to
proposals.
Mathieu
> -- Steve
>
> * dvhart - bah!
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Fri, 18 Jan 2008, Mathieu Desnoyers wrote:
>
> But I have not seen a lot of situations where that kind of glue-code was
> needed, so I think it makes sense to keep markers simple to use and
> efficient for the common case.
>
> Then, in this glue-code, we can put trace_mark() and calls to in-kernel
> tracers.
I'm almost done with the latency tracer work, and there are only a total
of 6 hooks that I needed.
- schedule context switch
- try_to_wake_up
- hard_irqs_off (which is already there for lockdep)
- hard irqs on (also for lockdep)
- lock_contention (already in for the lock contention code)
- lock acquire (also in there for contention code)
With the above, we could have this (if this is what I think you are
recommending). For example in the context_switch case:
trace_switch_to(prev, next);
switch_to(prev, next, prev);
and in sched.h I could have:
static inline trace_switch_to(struct task_struct *prev,
struct task_struct *next)
{
trace_mark(kernel_schedudule,
"prev_pid %d next_pid %d prev_state %ld",
prev->pid, next->pid, prev->pid);
trace_context_switch(prev, next);
}
and have the trace_context_switch code be something that is turned on with
the latency tracing utility (config option). That way production code can
keep it off.
-- Steve
* Steven Rostedt ([email protected]) wrote:
> On Fri, 18 Jan 2008, Mathieu Desnoyers wrote:
> >
> > But I have not seen a lot of situations where that kind of glue-code was
> > needed, so I think it makes sense to keep markers simple to use and
> > efficient for the common case.
> >
> > Then, in this glue-code, we can put trace_mark() and calls to in-kernel
> > tracers.
>
> I'm almost done with the latency tracer work, and there are only a total
> of 6 hooks that I needed.
>
> - schedule context switch
> - try_to_wake_up
> - hard_irqs_off (which is already there for lockdep)
> - hard irqs on (also for lockdep)
> - lock_contention (already in for the lock contention code)
> - lock acquire (also in there for contention code)
>
> With the above, we could have this (if this is what I think you are
> recommending). For example in the context_switch case:
>
> trace_switch_to(prev, next);
> switch_to(prev, next, prev);
>
> and in sched.h I could have:
>
Almost.. I would add :
static int trace_switch_to_enabled;
> static inline trace_switch_to(struct task_struct *prev,
> struct task_struct *next)
> {
if (likely(!trace_switch_to_enabled))
return;
> trace_mark(kernel_schedudule,
> "prev_pid %d next_pid %d prev_state %ld",
> prev->pid, next->pid, prev->pid);
>
> trace_context_switch(prev, next);
> }
>
And some code to activate the trace_switch_to_enabled variable (ideally
keeping a refcount).
By doing this, we would have the minimum impact on the scheduled when
disabled.
But remember that this trace_switch_to_enabled could be enabled for both
markers and your tracer, so you might need to put a branch at the
beginning of trace_context_switch() too.
Mathieu
> and have the trace_context_switch code be something that is turned on with
> the latency tracing utility (config option). That way production code can
> keep it off.
>
> -- Steve
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
Hi -
On Fri, Jan 18, 2008 at 05:49:19PM -0500, Steven Rostedt wrote:
> [...]
> > But I have not seen a lot of situations where that kind of glue-code was
> > needed, so I think it makes sense to keep markers simple to use and
> > efficient for the common case.
> >
> > Then, in this glue-code, we can put trace_mark() and calls to in-kernel
> > tracers.
>
> I'm almost done with the latency tracer work, and there are only a total
> of 6 hooks that I needed.
> [...]
> With the above, we could have this (if this is what I think you are
> recommending). [...]
> static inline trace_switch_to(struct task_struct *prev,
> struct task_struct *next)
> {
> trace_mark(kernel_schedudule,
> "prev_pid %d next_pid %d prev_state %ld",
> prev->pid, next->pid, prev->pid);
>
> trace_context_switch(prev, next);
> }
I'm afraid I don't see the point in this. You could use one marker
for all that data (and force the more naive tracer callbacks to ignore
out some of them). You could even use two markers (and force the more
naive tracer to attach to only to its favorite subset). But to use a
second, different, less efficient, not more configurable tracing hook
mechanism in the same logical spot makes no sense to me.
- FChE
Hi -
On Fri, Jan 18, 2008 at 06:19:29PM -0500, Mathieu Desnoyers wrote:
> [...]
> Almost.. I would add :
>
> static int trace_switch_to_enabled;
>
> > static inline trace_switch_to(struct task_struct *prev,
> > struct task_struct *next)
> > {
> if (likely(!trace_switch_to_enabled))
> return;
> > trace_mark(kernel_schedudule,
> > "prev_pid %d next_pid %d prev_state %ld",
> > prev->pid, next->pid, prev->pid);
> >
> > trace_context_switch(prev, next);
> > }
>
> And some code to activate the trace_switch_to_enabled variable (ideally
> keeping a refcount). [...]
All this complexity is to be justified by keeping the raw prev/next
pointers from being sent to a naive tracer? It seems to me way out of
proportion.
- FChE
On Fri, 18 Jan 2008, Frank Ch. Eigler wrote:
>
> All this complexity is to be justified by keeping the raw prev/next
> pointers from being sent to a naive tracer? It seems to me way out of
> proportion.
Damn, and I just blew away all my marker code for something like this ;-)
Actually, you just gave me a great idea that I think can help all of us.
OK, Mathieu may not be in total agreement, but I think this is the
ultimate compromise.
We have in sched.c the following marker:
trace_mark(kernel_sched_scheduler, "prev %p next %p", prev, next);
Then Mathieu can add in some code somewhere (or a module, or something)
ret = marker_probe_register("kernel_sched_scheduler",
"prev %p next %p",
pretty_print_sched_switch, NULL);
static void pretty_print_sched_switch(const struct marker *mdata,
void *private_data,
const char *format, ...)
{
va_list ap;
struct task_struct *prev;
struct task_struct *next;
va_start(ap, format);
prev = va_arg(ap, typeof(prev));
next = va_arg(ap, typeof(next));
va_end;
trace_mark(kernel_pretty_print_sched_switch,
"prev_pid %d next_pid %d prev_state %ld",
prev->pid, next->pid, prev->state);
}
Then LTTng on startup could arm the normal kernel_sched_switch code and
have the user see the nice one. All without adding any more goo or
overhead to the non tracing case, and keeping a few critical markers with
enough information to be useful to other tracers!
Thoughts?
-- Steve
Hi -
On Fri, Jan 18, 2008 at 10:55:27PM -0500, Steven Rostedt wrote:
> [...]
> > All this complexity is to be justified by keeping the raw prev/next
> > pointers from being sent to a naive tracer? It seems to me way out of
> > proportion.
>
> Damn, and I just blew away all my marker code for something like this ;-)
Sorry! :-)
> [...]
> We have in sched.c the following marker:
> trace_mark(kernel_sched_scheduler, "prev %p next %p", prev, next);
Fine so far!
> Then Mathieu can add in some code somewhere (or a module, or something)
> ret = marker_probe_register("kernel_sched_scheduler",
> "prev %p next %p",
> pretty_print_sched_switch, NULL);
> static void pretty_print_sched_switch(const struct marker *mdata,
> void *private_data,
> const char *format, ...)
> {
> [...]
> trace_mark(kernel_pretty_print_sched_switch,
> "prev_pid %d next_pid %d prev_state %ld",
> prev->pid, next->pid, prev->state);
> }
That marker_probe_register call would need to be done only when the
embedded (k_p_p_s_s) marker is actually being used. Otherwise we'd
lose all the savings of a dormant sched.c marker by always calling
into pretty_print_sched_switch(), whether or not the k_p_p_s_s marker
was active.
In any case, if the naive tracer agrees to become educated about some
of these markers in the form of intermediary functions like that, they
need not insist on a second hop through marker territory anyway:
static void pretty_print_sched_switch(const struct marker *mdata,
void *private_data,
const char *format, ...)
{
[...]
lttng_backend_trace(kernel_pretty_print_sched_switch,
"prev_pid %d next_pid %d prev_state %ld",
prev->pid, next->pid, prev->state);
}
- FChE
* Frank Ch. Eigler ([email protected]) wrote:
> Hi -
>
> On Fri, Jan 18, 2008 at 10:55:27PM -0500, Steven Rostedt wrote:
> > [...]
> > > All this complexity is to be justified by keeping the raw prev/next
> > > pointers from being sent to a naive tracer? It seems to me way out of
> > > proportion.
> >
> > Damn, and I just blew away all my marker code for something like this ;-)
>
> Sorry! :-)
>
> > [...]
> > We have in sched.c the following marker:
> > trace_mark(kernel_sched_scheduler, "prev %p next %p", prev, next);
>
> Fine so far!
>
> > Then Mathieu can add in some code somewhere (or a module, or something)
> > ret = marker_probe_register("kernel_sched_scheduler",
> > "prev %p next %p",
> > pretty_print_sched_switch, NULL);
>
> > static void pretty_print_sched_switch(const struct marker *mdata,
> > void *private_data,
> > const char *format, ...)
> > {
> > [...]
> > trace_mark(kernel_pretty_print_sched_switch,
> > "prev_pid %d next_pid %d prev_state %ld",
> > prev->pid, next->pid, prev->state);
> > }
>
> That marker_probe_register call would need to be done only when the
> embedded (k_p_p_s_s) marker is actually being used. Otherwise we'd
> lose all the savings of a dormant sched.c marker by always calling
> into pretty_print_sched_switch(), whether or not the k_p_p_s_s marker
> was active.
>
> In any case, if the naive tracer agrees to become educated about some
> of these markers in the form of intermediary functions like that, they
> need not insist on a second hop through marker territory anyway:
>
> static void pretty_print_sched_switch(const struct marker *mdata,
> void *private_data,
> const char *format, ...)
> {
> [...]
> lttng_backend_trace(kernel_pretty_print_sched_switch,
> "prev_pid %d next_pid %d prev_state %ld",
> prev->pid, next->pid, prev->state);
> }
>
Oh! perfect then :) Since I already planned my ltt-marker-control kernel
module to connect specialized callbacks instead of the dumb one, it
shouldn't be so hard to do.
I would just have to find another way to declare the trace events (it's
currently embedded in the markers), but it's not a showstopper. I'll try
this.
Thanks to you both for the good proposals,
Mathieu
>
> - FChE
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68