Hi,
I am trying to improve the Linux kernel time source so it can be read
without seqlock from NMI handlers. I have also seen some interest for
such an accurate monotonic clock readable from user space. It mainly
implies an atomic update of the time value. I am also trying to figure a
way to support architectures with multiple CPUs with non-synchronized
TSCs.
I would like to have your comments on the following idea.
Thanks in advance,
Mathieu
Monotonic accurate time
The goal of this design is to provide a monotonic time :
Readable from userspace without a system call
Readable from NMI handler
Readable without disabling interrupts
Readable without disabling preemption
Only one clock source (most precise available : tsc)
Support architectures with variable TSC frequency.
Main difference with wall time currently implemented in the Linux kernel : the
time update is done atomically instead of using a write seqlock. It permits
reading time from NMI handler and from userspace.
struct time_info {
u64 tsc;
u64 freq;
u64 walltime;
}
static struct time_struct {
struct time_info time_sel[2];
long update_count;
}
DECLARE_PERCPU(struct time_struct, cpu_time);
/* Number of times the scheduler is called on each CPU */
DECLARE_PERCPU(unsigned long, sched_nr);
/* On frequency change event */
/* In irq context */
void freq_change_cb(unsigned int new_freq)
{
struct time_struct this_cpu_time =
per_cpu(cpu_time, smp_processor_id());
struct time_info *write_time, *current_time;
write_time =
this_cpu_time->time_sel[(this_cpu_time->update_count+1)&1];
current_time =
this_cpu_time->time_sel[(this_cpu_time->update_count)&1];
write_time->tsc = get_cycles();
write_time->freq = new_freq;
/* We cumulate the division imprecision. This is the downside of using
* the TSC with variable frequency as a time base. */
write_time->walltime =
current_time->walltime +
(write_time->tsc - current_time->tsc) /
current_time->freq;
wmb();
this_cpu_time->update_count++;
}
/* Init cpu freq */
init_cpu_freq()
{
struct time_struct this_cpu_time =
per_cpu(cpu_time, smp_processor_id());
struct time_info *current_time;
memset(this_cpu_time, 0, sizeof(this_cpu_time));
current_time = this_cpu_time->time_sel[this_cpu_time->update_count&1];
/* Init current time */
/* Get frequency */
/* Reset cpus to 0 ns, 0 tsc, start their tsc. */
}
/* After a CPU comes back from hlt */
/* The trick is to sync all the other CPUs on the first CPU up when they come
* up. If all CPUs are down, then there is no need to increment the walltime :
* let's simply define the useful walltime on a machine as the time elapsed
* while there is a CPU running. If we want, when no cpu is active, we can use
* a lower resolution clock to somehow keep track of walltime. */
wake_from_hlt()
{
/* TODO */
}
/* Read time from anywhere in the kernel. Return time in walltime. (ns) */
/* If the update_count changes while we read the context, it may be invalid.
* This would happen if we are scheduled out for a period of time long enough to
* permit 2 frequency changes. We simply start the loop again if it happens.
* We detect it by comparing the update_count running counter.
* We detect preemption by incrementing a counter sched_nr within schedule().
* This counter is readable by user space through the vsyscall page. */
*/
u64 read_time(void)
{
u64 walltime;
long update_count;
struct time_struct this_cpu_time;
struct time_info *current_time;
unsigned int cpu;
long prev_sched_nr;
do {
cpu = _smp_processor_id();
prev_sched_nr = per_cpu(sched_nr, cpu);
if(cpu != _smp_processor_id())
continue; /* changed CPU between CPUID and getting
sched_nr */
this_cpu_time = per_cpu(cpu_time, cpu);
update_count = this_cpu_time->update_count;
current_time = this_cpu_time->time_sel[update_count&1];
walltime = current_time->walltime +
(get_cycles() - current_time->tsc) /
current_time->freq;
if(per_cpu(sched_nr, cpu) != prev_sched_nr)
continue; /* been preempted */
} while(this_cpu_time->update_count != update_count);
return walltime;
}
/* Userspace */
/* Export all this data to user space through the vsyscall page. Use a function
* like read_time to read the walltime. This function can be implemented as-is
* because it doesn't need to disable preemption. */
--
Mathieu Desnoyers
Computer Engineering Ph.D. Candidate, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Sat, 2007-02-24 at 11:19 -0500, Mathieu Desnoyers wrote:
> Hi,
>
> I am trying to improve the Linux kernel time source so it can be read
> without seqlock from NMI handlers. I have also seen some interest for
> such an accurate monotonic clock readable from user space. It mainly
> implies an atomic update of the time value. I am also trying to figure a
> way to support architectures with multiple CPUs with non-synchronized
> TSCs.
>
> I would like to have your comments on the following idea.
>
> Thanks in advance,
>
> Mathieu
>
>
> Monotonic accurate time
>
> The goal of this design is to provide a monotonic time :
>
> Readable from userspace without a system call
> Readable from NMI handler
> Readable without disabling interrupts
> Readable without disabling preemption
> Only one clock source (most precise available : tsc)
> Support architectures with variable TSC frequency.
I don't think you could use only the tsc. From reviewing John, and
Thomas work it's pretty clear the TSC isn't going to work correctly all
the time.
> /* On frequency change event */
> /* In irq context */
> void freq_change_cb(unsigned int new_freq)
> {
It's possible for the TSC to change frequencies without notification. It
can also completely stop when the system goes idle.
> /* Userspace */
> /* Export all this data to user space through the vsyscall page. Use a function
> * like read_time to read the walltime. This function can be implemented as-is
> * because it doesn't need to disable preemption. */
What would be the benefit of using this over the vsyscall gettimeofday()
from userspace ?
Daniel
* Daniel Walker ([email protected]) wrote:
> On Sat, 2007-02-24 at 11:19 -0500, Mathieu Desnoyers wrote:
> > Hi,
> >
> > I am trying to improve the Linux kernel time source so it can be read
> > without seqlock from NMI handlers. I have also seen some interest for
> > such an accurate monotonic clock readable from user space. It mainly
> > implies an atomic update of the time value. I am also trying to figure a
> > way to support architectures with multiple CPUs with non-synchronized
> > TSCs.
> >
> > I would like to have your comments on the following idea.
> >
> > Thanks in advance,
> >
> > Mathieu
> >
> >
> > Monotonic accurate time
> >
> > The goal of this design is to provide a monotonic time :
> >
> > Readable from userspace without a system call
> > Readable from NMI handler
> > Readable without disabling interrupts
> > Readable without disabling preemption
> > Only one clock source (most precise available : tsc)
> > Support architectures with variable TSC frequency.
>
> I don't think you could use only the tsc. From reviewing John, and
> Thomas work it's pretty clear the TSC isn't going to work correctly all
> the time.
>
Ok, if there are other high precision timers we can use, I guess it may
be better to fall back on them.
> > /* On frequency change event */
> > /* In irq context */
> > void freq_change_cb(unsigned int new_freq)
> > {
>
> It's possible for the TSC to change frequencies without notification. It
> can also completely stop when the system goes idle.
>
Hrm, I see. I though those freq change without notification would happen
rarely and could be dealt by resynchronizing the CPUs. I guess I was
wrong.
> > /* Userspace */
> > /* Export all this data to user space through the vsyscall page. Use a function
> > * like read_time to read the walltime. This function can be implemented as-is
> > * because it doesn't need to disable preemption. */
>
> What would be the benefit of using this over the vsyscall gettimeofday()
> from userspace ?
>
So we can get a monotonic, non NTP corrected timestamp quickly from user
space without going through a system call. Are there other alternatives ?
Thanks,
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Candidate, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Mon, 2007-02-26 at 15:53 -0500, Mathieu Desnoyers wrote:
> > > /* On frequency change event */
> > > /* In irq context */
> > > void freq_change_cb(unsigned int new_freq)
> > > {
> >
> > It's possible for the TSC to change frequencies without notification. It
> > can also completely stop when the system goes idle.
> >
>
> Hrm, I see. I though those freq change without notification would happen
> rarely and could be dealt by resynchronizing the CPUs. I guess I was
> wrong.
The system could be UP .. I don't think tracking this kind of thing is
trival, and given the TSC track record you have to assume there are will
be other issue in the future.
> > > /* Userspace */
> > > /* Export all this data to user space through the vsyscall page. Use a function
> > > * like read_time to read the walltime. This function can be implemented as-is
> > > * because it doesn't need to disable preemption. */
> >
> > What would be the benefit of using this over the vsyscall gettimeofday()
> > from userspace ?
> >
>
> So we can get a monotonic, non NTP corrected timestamp quickly from user
> space without going through a system call. Are there other alternatives ?
The NTP daemon needs to be active AFAIK before you would start observing
weird time jumps. There are adjustments made without NTP but they are
only seen over short periods.. So I still think gettimeofday() would be
an alternative ..
Have you considered adding something to glibc? You could access only the
TSC from userspace.. I don't think the addition of a vsyscall/syscall
for this would go over too well considering that there are other ways to
get timestamps. It might help if you tell us what you think this would
be used for in userspace ?
Daniel
* Daniel Walker ([email protected]) wrote:
> On Mon, 2007-02-26 at 15:53 -0500, Mathieu Desnoyers wrote:
>
> > > > /* On frequency change event */
> > > > /* In irq context */
> > > > void freq_change_cb(unsigned int new_freq)
> > > > {
> > >
> > > It's possible for the TSC to change frequencies without notification. It
> > > can also completely stop when the system goes idle.
> > >
> >
> > Hrm, I see. I though those freq change without notification would happen
> > rarely and could be dealt by resynchronizing the CPUs. I guess I was
> > wrong.
>
> The system could be UP .. I don't think tracking this kind of thing is
> trival, and given the TSC track record you have to assume there are will
> be other issue in the future.
>
The other solution, good for UP, would be to sychronize the TSC with
another clock source (HPET or timer), but it starts to look pretty much
like what is done right now.
> > > > /* Userspace */
> > > > /* Export all this data to user space through the vsyscall page. Use a function
> > > > * like read_time to read the walltime. This function can be implemented as-is
> > > > * because it doesn't need to disable preemption. */
> > >
> > > What would be the benefit of using this over the vsyscall gettimeofday()
> > > from userspace ?
> > >
> >
> > So we can get a monotonic, non NTP corrected timestamp quickly from user
> > space without going through a system call. Are there other alternatives ?
>
> The NTP daemon needs to be active AFAIK before you would start observing
> weird time jumps. There are adjustments made without NTP but they are
> only seen over short periods.. So I still think gettimeofday() would be
> an alternative ..
>
> Have you considered adding something to glibc? You could access only the
> TSC from userspace.. I don't think the addition of a vsyscall/syscall
> for this would go over too well considering that there are other ways to
> get timestamps. It might help if you tell us what you think this would
> be used for in userspace ?
>
For kernel and user space tracing, those small jumps are very annoying :
it can show, in a trace, that a fork() appears on a CPU after the first
schedule() of the thread on the other CPU : scheduling causality relationship
can become very hard to follow. This is only a sample case. Inaccuracy and
periodical modification of the clock time (non monotonic) can cause important
inaccuracy in performance tests, even on UP systems. A monotonic clock,
accessible from anywhere in kernel space (including NMI handler) and
from user space is very useful for performance analysis and, more
generally, for timestamping data in per cpu buffers so it can be later
reordered correctly.
Mathieu
> Daniel
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Candidate, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Mon, 2007-02-26 at 17:14 -0500, Mathieu Desnoyers wrote:
> For kernel and user space tracing, those small jumps are very annoying :
> it can show, in a trace, that a fork() appears on a CPU after the first
> schedule() of the thread on the other CPU : scheduling causality relationship
> can become very hard to follow. This is only a sample case. Inaccuracy and
> periodical modification of the clock time (non monotonic) can cause important
> inaccuracy in performance tests, even on UP systems. A monotonic clock,
> accessible from anywhere in kernel space (including NMI handler) and
> from user space is very useful for performance analysis and, more
> generally, for timestamping data in per cpu buffers so it can be later
> reordered correctly.
What about adding a layer below do_gettimeofday() which just scheds the
adjustment process? That might be reasonable .. The NMI, and userspace
cases aren't very compelling right now, at least I'm not convinced a
whole new timing interface is needed ..
The latency tracing system in the -rt branch modifies the gettimeofday
facilities , I'm not sure of the correctness of it but it gets called
from anyplace in the kernel including NMI's .
Here's the function,
cycle_t notrace get_monotonic_cycles(void)
{
cycle_t cycle_now, cycle_delta;
/* read clocksource: */
cycle_now = clocksource_read(clock);
/* calculate the delta since the last update_wall_time: */
cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
return clock->cycle_last + cycle_delta;
}
That looks safe. When converting this to nanoseconds you would still get
the time adjustments but it would be all at once instead of in little
increments ..
Daniel
* Daniel Walker ([email protected]) wrote:
> On Mon, 2007-02-26 at 17:14 -0500, Mathieu Desnoyers wrote:
>
>
> > For kernel and user space tracing, those small jumps are very annoying :
> > it can show, in a trace, that a fork() appears on a CPU after the first
> > schedule() of the thread on the other CPU : scheduling causality relationship
> > can become very hard to follow. This is only a sample case. Inaccuracy and
> > periodical modification of the clock time (non monotonic) can cause important
> > inaccuracy in performance tests, even on UP systems. A monotonic clock,
> > accessible from anywhere in kernel space (including NMI handler) and
> > from user space is very useful for performance analysis and, more
> > generally, for timestamping data in per cpu buffers so it can be later
> > reordered correctly.
>
> What about adding a layer below do_gettimeofday() which just scheds the
> adjustment process? That might be reasonable .. The NMI, and userspace
> cases aren't very compelling right now, at least I'm not convinced a
> whole new timing interface is needed ..
>
> The latency tracing system in the -rt branch modifies the gettimeofday
> facilities , I'm not sure of the correctness of it but it gets called
> from anyplace in the kernel including NMI's .
>
> Here's the function,
>
> cycle_t notrace get_monotonic_cycles(void)
> {
> cycle_t cycle_now, cycle_delta;
>
> /* read clocksource: */
> cycle_now = clocksource_read(clock);
>
> /* calculate the delta since the last update_wall_time: */
> cycle_delta = (cycle_now - clock->cycle_last) & clock->mask;
>
> return clock->cycle_last + cycle_delta;
> }
>
> That looks safe. When converting this to nanoseconds you would still get
> the time adjustments but it would be all at once instead of in little
> increments ..
>
ouch... if the clocksource used is the PIT on x86 :
static cycle_t pit_read(void)
{
unsigned long flags;
int count;
u32 jifs;
static int old_count;
static u32 old_jifs;
spin_lock_irqsave(&i8253_lock, flags);
If an NMI nests over the spinlock, we have a deadlock.
In addition, clock->cycle_last is a cycle_t, defined as a 64 bits on
x86. If is therefore not updated atomically by change_clocksource,
timekeeping_init, timekeeping_resume and update_wall_time. If an NMI
fires right on top of the update, especially around the 32 bits wrap
around, the time will be really fuzzy.
Mathieu
> Daniel
>
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Mon, 2007-02-26 at 22:54 -0500, Mathieu Desnoyers wrote:
> If an NMI nests over the spinlock, we have a deadlock.
Maybe not completely safe ...
> In addition, clock->cycle_last is a cycle_t, defined as a 64 bits on
> x86. If is therefore not updated atomically by change_clocksource,
> timekeeping_init, timekeeping_resume and update_wall_time. If an NMI
> fires right on top of the update, especially around the 32 bits wrap
> around, the time will be really fuzzy.
I'm not sure that is particularly significant considering that it's just
a possible bad timestamp, and the probability of that happening seems
rather low .. You could also modify NMI calls so they use a different
time stamping method, like reading the clocksource directly .
The pit clocksource could be dropped pretty easy with my clocksource
update patches, which I'm still working on but you could easily drop
clock sources that aren't atomic like the pit .. Also the pit is
generally undesirable, so it's not going to be missed.
Daniel
* Daniel Walker ([email protected]) wrote:
> On Mon, 2007-02-26 at 22:54 -0500, Mathieu Desnoyers wrote:
> > If an NMI nests over the spinlock, we have a deadlock.
>
> Maybe not completely safe ...
>
> > In addition, clock->cycle_last is a cycle_t, defined as a 64 bits on
> > x86. If is therefore not updated atomically by change_clocksource,
> > timekeeping_init, timekeeping_resume and update_wall_time. If an NMI
> > fires right on top of the update, especially around the 32 bits wrap
> > around, the time will be really fuzzy.
>
> I'm not sure that is particularly significant considering that it's just
> a possible bad timestamp, and the probability of that happening seems
> rather low .. You could also modify NMI calls so they use a different
> time stamping method, like reading the clocksource directly .
>
Since you do not disable interrupts around the clocksource read, the
same problem applies to interrupt handlers of higher priority than the
cycle_last updating code.
A bad timestamp can make a trace quite hard to follow and more
error-prone. When someone is looking for _the_ failing case in a system,
the infrastructure used to debug must be reliable. Sometimes error cases
takes days before showing up : we can't afford to be unsure about the
precision of the tracer.
Also, the goal is to have a generic monotonic timestamp readable from
everywhere. Excluding execution contexts doesn't seem like such a great
idea (it just replicates the same problem somewhere else).
> The pit clocksource could be dropped pretty easy with my clocksource
> update patches, which I'm still working on but you could easily drop
> clock sources that aren't atomic like the pit .. Also the pit is
> generally undesirable, so it's not going to be missed.
>
Still important for old architectures where PIT is the only available
clock source I guess. However, the clocksource struct should at least
tell if the time can be read atomically and offer a different API for
atomic vs non atomic read of time source, returning an error if no
atomic time source is available.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
* Daniel Walker <[email protected]> wrote:
> The pit clocksource could be dropped pretty easy with my clocksource
> update patches, which I'm still working on but you could easily drop
> clock sources that aren't atomic like the pit .. Also the pit is
> generally undesirable, so it's not going to be missed.
that's totally unacceptable, and i'm amazed you are even suggesting it -
often the PIT ends up being the most reliable hardware clock in a PC.
Btw., what's wrong with the spinlock that is protecting PIT access? It
expresses the non-atomic property of the PIT just fine.
Ingo
* Ingo Molnar ([email protected]) wrote:
>
> * Daniel Walker <[email protected]> wrote:
>
> > The pit clocksource could be dropped pretty easy with my clocksource
> > update patches, which I'm still working on but you could easily drop
> > clock sources that aren't atomic like the pit .. Also the pit is
> > generally undesirable, so it's not going to be missed.
>
> that's totally unacceptable, and i'm amazed you are even suggesting it -
> often the PIT ends up being the most reliable hardware clock in a PC.
> Btw., what's wrong with the spinlock that is protecting PIT access? It
> expresses the non-atomic property of the PIT just fine.
>
I am concerned about the automatic fallback to the PIT when no other
clock source is available. A clocksource read would be atomic when TSC
or HPET are available, but would fall back on PIT otherwise. There
should be some way to specify that a caller is only interested in atomic
clock sources (if none are available, the call should simply return an
error, or 0).
I still think that an RCU style update mechanism would be a good way to
fix the current clocksource read issue. Another, slower and non NMI
safe way to do this would be with a read seqlock and with IRQ disabling.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Tue, 2007-02-27 at 02:38 -0500, Mathieu Desnoyers wrote:
> > > The pit clocksource could be dropped pretty easy with my clocksource
> > > update patches, which I'm still working on but you could easily drop
> > > clock sources that aren't atomic like the pit .. Also the pit is
> > > generally undesirable, so it's not going to be missed.
> >
> > that's totally unacceptable, and i'm amazed you are even suggesting it -
> > often the PIT ends up being the most reliable hardware clock in a PC.
> > Btw., what's wrong with the spinlock that is protecting PIT access? It
> > expresses the non-atomic property of the PIT just fine.
> >
>
> I am concerned about the automatic fallback to the PIT when no other
> clock source is available.
And what are you going to use then ? Just kill the box, when nothing
else than PIT is there ?
tglx
On Tue, 2007-02-27 at 07:29 +0100, Ingo Molnar wrote:
> * Daniel Walker <[email protected]> wrote:
>
> > The pit clocksource could be dropped pretty easy with my clocksource
> > update patches, which I'm still working on but you could easily drop
> > clock sources that aren't atomic like the pit .. Also the pit is
> > generally undesirable, so it's not going to be missed.
>
> that's totally unacceptable, and i'm amazed you are even suggesting it -
> often the PIT ends up being the most reliable hardware clock in a PC.
> Btw., what's wrong with the spinlock that is protecting PIT access? It
> expresses the non-atomic property of the PIT just fine.
Just considering the rating is lower than the acpi_pm (and the TSC), and
it's not even considered on SMP systems is enough for me .. It's just a
problematic clock..
Again, I'm not suggesting we drop it all the time, just for a special
case when Mathieu needs it dropped.
Daniel
On Tue, 2007-02-27 at 02:38 -0500, Mathieu Desnoyers wrote:
>
> I am concerned about the automatic fallback to the PIT when no other
> clock source is available. A clocksource read would be atomic when TSC
> or HPET are available, but would fall back on PIT otherwise. There
> should be some way to specify that a caller is only interested in atomic
> clock sources (if none are available, the call should simply return an
> error, or 0).
>
> I still think that an RCU style update mechanism would be a good way to
> fix the current clocksource read issue. Another, slower and non NMI
> safe way to do this would be with a read seqlock and with IRQ disabling.
I'm not sure what you mean by using the RCU, but the pit clocksource
does disable interrupts with a spin_lock_irqsave().
Daniel
* Daniel Walker ([email protected]) wrote:
> On Tue, 2007-02-27 at 02:38 -0500, Mathieu Desnoyers wrote:
>
> >
> > I am concerned about the automatic fallback to the PIT when no other
> > clock source is available. A clocksource read would be atomic when TSC
> > or HPET are available, but would fall back on PIT otherwise. There
> > should be some way to specify that a caller is only interested in atomic
> > clock sources (if none are available, the call should simply return an
> > error, or 0).
> >
> I'm not sure what you mean by using the RCU
The original proposal of this thread uses a RCU (read-copy-update) style
update of the previous 64 bits counter : it swaps a pointer (atomically)
upon update by incrementing a word-sized counter that is used, by the
reader, to get the offest in the array (with a modulo operation) for the
current readable data and as a way to detect incorrect reads of
overwritten information (we re-read the word-sized counter after having
read the data structure to make sure is has not been incremented. If we
detect an increment, we redo the whole operation).
> > I still think that an RCU style update mechanism would be a good way to
> > fix the current clocksource read issue. Another, slower and non NMI
> > safe way to do this would be with a read seqlock and with IRQ disabling.
>
> , but the pit clocksource
> does disable interrupts with a spin_lock_irqsave().
>
When I say "clocksource read issue", I am talking about
race between the function you proposed earlier, which you say is used in
-rt kernels for latency tracing (get_monotonic_cycles), and HPET and TSC
"last cycles" updates.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Tue, 2007-02-27 at 11:02 -0500, Mathieu Desnoyers wrote:
> * Daniel Walker ([email protected]) wrote:
> > On Tue, 2007-02-27 at 02:38 -0500, Mathieu Desnoyers wrote:
> >
> > >
> > > I am concerned about the automatic fallback to the PIT when no other
> > > clock source is available. A clocksource read would be atomic when TSC
> > > or HPET are available, but would fall back on PIT otherwise. There
> > > should be some way to specify that a caller is only interested in atomic
> > > clock sources (if none are available, the call should simply return an
> > > error, or 0).
> > >
> > I'm not sure what you mean by using the RCU
>
> The original proposal of this thread uses a RCU (read-copy-update) style
> update of the previous 64 bits counter : it swaps a pointer (atomically)
> upon update by incrementing a word-sized counter that is used, by the
> reader, to get the offest in the array (with a modulo operation) for the
> current readable data and as a way to detect incorrect reads of
> overwritten information (we re-read the word-sized counter after having
> read the data structure to make sure is has not been incremented. If we
> detect an increment, we redo the whole operation).
I didn't see RCU at all in your original message, so I'm not sure how
you propose to use it .. My understanding of the RCU was that it
couldn't be used from interrupt context, that could be totally wrong so
I'll let you explain how you planed to use it.
> > > I still think that an RCU style update mechanism would be a good way to
> > > fix the current clocksource read issue. Another, slower and non NMI
> > > safe way to do this would be with a read seqlock and with IRQ disabling.
> >
> > , but the pit clocksource
> > does disable interrupts with a spin_lock_irqsave().
> >
>
> When I say "clocksource read issue", I am talking about
> race between the function you proposed earlier, which you say is used in
> -rt kernels for latency tracing (get_monotonic_cycles), and HPET and TSC
> "last cycles" updates.
Right .. You said that regular interrupts would cause this non-atomic
64-bit update race , but the pit disabled interrupts, and the
last_cycles update is done with interrupts off .. So I think we're back
to only the NMI case ..
Did you have another scenario ?
Daniel
* Daniel Walker ([email protected]) wrote:
> On Tue, 2007-02-27 at 11:02 -0500, Mathieu Desnoyers wrote:
> > * Daniel Walker ([email protected]) wrote:
> > > On Tue, 2007-02-27 at 02:38 -0500, Mathieu Desnoyers wrote:
> > >
> > > >
> > > > I am concerned about the automatic fallback to the PIT when no other
> > > > clock source is available. A clocksource read would be atomic when TSC
> > > > or HPET are available, but would fall back on PIT otherwise. There
> > > > should be some way to specify that a caller is only interested in atomic
> > > > clock sources (if none are available, the call should simply return an
> > > > error, or 0).
> > > >
> > > I'm not sure what you mean by using the RCU
> >
> > The original proposal of this thread uses a RCU (read-copy-update) style
> > update of the previous 64 bits counter : it swaps a pointer (atomically)
> > upon update by incrementing a word-sized counter that is used, by the
> > reader, to get the offest in the array (with a modulo operation) for the
> > current readable data and as a way to detect incorrect reads of
> > overwritten information (we re-read the word-sized counter after having
> > read the data structure to make sure is has not been incremented. If we
> > detect an increment, we redo the whole operation).
>
> I didn't see RCU at all in your original message, so I'm not sure how
> you propose to use it .. My understanding of the RCU was that it
> couldn't be used from interrupt context, that could be totally wrong so
> I'll let you explain how you planed to use it.
>
1 - I do not plan to use the rcupdate.h API, because it is oriented
towards allowing/freeing data structures after a quiescent state. I
don't need that. I only want to have a 64 bits data structure valid for
reading, with atomic update. Therefore, I keep an array of 2 64 bits
structures. At all time, there is one used as "readable" value and the other
as "writeable". The role is exchanged at each update. The word-sized
counter is used to select the current read and write pointers through a
mask, and is also used to detect bad reads (is a read is preempted, and
then we have 2 updates, the reader could read a bad value without
knowing it). By keeping a word-sized counter of the number of updates,
we have 32 (or 64) bits (depending on the architecture) before the wrap
around, which should not happen even in a far future.
> > > > I still think that an RCU style update mechanism would be a good way to
> > > > fix the current clocksource read issue. Another, slower and non NMI
> > > > safe way to do this would be with a read seqlock and with IRQ disabling.
> > >
> > > , but the pit clocksource
> > > does disable interrupts with a spin_lock_irqsave().
> > >
> >
> > When I say "clocksource read issue", I am talking about
> > race between the function you proposed earlier, which you say is used in
> > -rt kernels for latency tracing (get_monotonic_cycles), and HPET and TSC
> > "last cycles" updates.
>
> Right .. You said that regular interrupts would cause this non-atomic
> 64-bit update race , but the pit disabled interrupts, and the
> last_cycles update is done with interrupts off .. So I think we're back
> to only the NMI case ..
>
> Did you have another scenario ?
>
__get_nsec_offset : reads clock->cycle_last. Should be called with
xtime_lock held. (ok so far, but see below)
change_clocksource
clock->cycle_last = now; (non atomic 64 bits update. Not protected by
any lock ?) -> this would race with __get_nsec_offset ?
update_wall_time
Called from timer interrupt. Holds xtime_lock and has a priority higher
than other interrupts. Other clock->cycle_last protected by
write_seqlock_irqsave.
get_monotonic_cycles (as you proposed, in -rt kernels) :
reads clock->cycle_last. Not protected by any read seqlock and does not
disable interrupts. Races with change_clocksource, update_wall_time and
all other time update functions. For instance, is someone uses
get_monotonic_cycles in process context and the timer interrupt fires
update_wall_time right at the middle of the 2 32 bits read, the value
will be wrong.
Mathieu
--
Mathieu Desnoyers
Computer Engineering Ph.D. Student, Ecole Polytechnique de Montreal
OpenPGP key fingerprint: 8CD5 52C3 8E3C 4140 715F BA06 3F25 A8FE 3BAE 9A68
On Tue, 2007-02-27 at 14:04 -0500, Mathieu Desnoyers wrote:
> __get_nsec_offset : reads clock->cycle_last. Should be called with
> xtime_lock held. (ok so far, but see below)
>
> change_clocksource
> clock->cycle_last = now; (non atomic 64 bits update. Not protected by
> any lock ?) -> this would race with __get_nsec_offset ?
Minor nit (it could probably use a comment fixup): its only called from
update_wall_time, which holds xtime_lock.
> update_wall_time
> Called from timer interrupt. Holds xtime_lock and has a priority higher
> than other interrupts. Other clock->cycle_last protected by
> write_seqlock_irqsave.
-john
On Tue, 2007-02-27 at 14:04 -0500, Mathieu Desnoyers wrote:
> 1 - I do not plan to use the rcupdate.h API, because it is oriented
> towards allowing/freeing data structures after a quiescent state. I
> don't need that. I only want to have a 64 bits data structure valid for
> reading, with atomic update. Therefore, I keep an array of 2 64 bits
> structures. At all time, there is one used as "readable" value and the other
> as "writeable". The role is exchanged at each update. The word-sized
> counter is used to select the current read and write pointers through a
> mask, and is also used to detect bad reads (is a read is preempted, and
> then we have 2 updates, the reader could read a bad value without
> knowing it). By keeping a word-sized counter of the number of updates,
> we have 32 (or 64) bits (depending on the architecture) before the wrap
> around, which should not happen even in a far future.
Sounds like a special case RCU system .. If you wanted to add this time
stamping system to Linux, the only acceptable way to add it, IMO, would
be to replace or extend gettimeofday .. You would also need a
justification for the changes, which right now is likely only LTT ..
> __get_nsec_offset : reads clock->cycle_last. Should be called with
> xtime_lock held. (ok so far, but see below)
>
> change_clocksource
> clock->cycle_last = now; (non atomic 64 bits update. Not protected by
> any lock ?) -> this would race with __get_nsec_offset ?
>
> update_wall_time
> Called from timer interrupt. Holds xtime_lock and has a priority higher
> than other interrupts. Other clock->cycle_last protected by
> write_seqlock_irqsave.
update_wall_time, change_clocksource, __get_nsec_offset all hold the
xtime_lock w/ interrupts disabled.
> get_monotonic_cycles (as you proposed, in -rt kernels) :
> reads clock->cycle_last. Not protected by any read seqlock and does not
> disable interrupts. Races with change_clocksource, update_wall_time and
> all other time update functions. For instance, is someone uses
> get_monotonic_cycles in process context and the timer interrupt fires
> update_wall_time right at the middle of the 2 32 bits read, the value
> will be wrong.
I guess that's true.. You also have to assume that the upper 32 bits
have actually changed, the TSC is the only 64-bit clock in linux right
now..
Daniel