To me, there isn't a clear reason why we're using stop_machine
when changing clocksources instead of just taking the xtime_lock.
Additionally, using stop_machine limits us from being able to
register clocksources from timers (as needed for a following
patch).
This patch simply removes the stop_machine usage and instead
directly calls change_clocksource, which now takes the xtime_lock.
I could be totally missing something here that necessitates
stop_machine, but in my testing it seems to function fine.
Any clarifications or corrections would be appreciated!
Signed-off-by: John Stultz <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Martin Schwidefsky <[email protected]>
CC: Clark Williams <[email protected]>
---
kernel/time/timekeeping.c | 12 ++++++------
1 files changed, 6 insertions(+), 6 deletions(-)
diff --git a/kernel/time/timekeeping.c b/kernel/time/timekeeping.c
index e14c839..fa2cc41 100644
--- a/kernel/time/timekeeping.c
+++ b/kernel/time/timekeeping.c
@@ -19,7 +19,6 @@
#include <linux/jiffies.h>
#include <linux/time.h>
#include <linux/tick.h>
-#include <linux/stop_machine.h>
/* Structure holding internal timekeeping values. */
struct timekeeper {
@@ -345,12 +344,12 @@ EXPORT_SYMBOL(do_settimeofday);
*
* Accumulates current time interval and initializes new clocksource
*/
-static int change_clocksource(void *data)
+static int change_clocksource(struct clocksource *new)
{
- struct clocksource *new, *old;
-
- new = (struct clocksource *) data;
+ struct clocksource *old;
+ unsigned long flags;
+ write_seqlock_irqsave(&xtime_lock, flags);
timekeeping_forward_now();
if (!new->enable || new->enable(new) == 0) {
old = timekeeper.clock;
@@ -358,6 +357,7 @@ static int change_clocksource(void *data)
if (old->disable)
old->disable(old);
}
+ write_sequnlock_irqrestore(&xtime_lock, flags);
return 0;
}
@@ -372,7 +372,7 @@ void timekeeping_notify(struct clocksource *clock)
{
if (timekeeper.clock == clock)
return;
- stop_machine(change_clocksource, clock, NULL);
+ change_clocksource(clock);
tick_clock_notify();
}
--
1.6.0.4
Boot to boot the TSC calibration may vary by quite a large amount.
While normal variance of 50-100ppm can easily be seen, the quick
calibration code only requires 500ppm accuracy, which is the limit
of what NTP can correct for.
This can cause problems for systems being used as NTP servers, as
every time they reboot it can take hours for them to calculate the
new drift error caused by the calibration.
The classic trade-off here is calibration accuracy vs slow boot times,
as during the calibration nothing else can run.
This patch uses a timer later in the boot process to calibrate the TSC
over a two second period. This allows very accurate calibration (in my
tests only varying by 1khz or 0.4ppm boot to boot). Additionally this
refined calibration step does not block the boot process, and only
delays the TSC clocksoure registration by a few seconds in early boot.
Credit to Andi Kleen who suggested this idea quite awhile back, but
I dismissed it thinking the timer calibration would be done after
the clocksource was registered (which would break things). Forgive
me for my short-sightedness.
This patch has worked very well in my testing, but TSC hardware is
quite varied so it would probably be good to get some extended
testing, possibly pushing inclusion out to 2.6.37.
These two patches also apply onto the changes already picked up in the
-tip timers/clocksource branch.
Signed-off-by: John Stultz <[email protected]>
CC: Thomas Gleixner <[email protected]>
CC: Martin Schwidefsky <[email protected]>
CC: Clark Williams <[email protected]>
CC: Andi Kleen <[email protected]>
---
arch/x86/kernel/tsc.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++-
1 files changed, 69 insertions(+), 2 deletions(-)
diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index 5ca6370..28bde64 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -842,6 +842,70 @@ __cpuinit int unsynchronized_tsc(void)
return tsc_unstable;
}
+
+static struct timer_list tsc_calibrate_timer;
+
+static void calibrate_tsc_timer(unsigned long dummy)
+{
+ static u64 tsc_start = -1, ref_start;
+ static int hpet;
+ u64 tsc_stop, ref_stop, delta;
+ unsigned long freq;
+
+ /* Don't bother refining TSC on unstable systems */
+ if (check_tsc_unstable())
+ goto out;
+
+ /*
+ * Since the timer is started early in boot, we may be
+ * delayed the first time we expire. So set the timer
+ * again once we know timers are working.
+ */
+ if (tsc_start == -1) {
+ /*
+ * Only set hpet once, to avoid mixing hardware
+ * if the hpet becomes enabled later.
+ */
+ hpet = is_hpet_enabled();
+
+ /* We limit it to 2 seconds as pmtmr wraps quickly */
+ tsc_calibrate_timer.expires = jiffies + HZ*2;
+ add_timer(&tsc_calibrate_timer);
+ tsc_start = tsc_read_refs(&ref_start, hpet);
+ return;
+ }
+
+ tsc_stop = tsc_read_refs(&ref_stop, hpet);
+
+ /* hpet or pmtimer available ? */
+ if (!hpet && !ref_start && !ref_stop)
+ goto out;
+
+ /* Check, whether the sampling was disturbed by an SMI */
+ if (tsc_start == ULLONG_MAX || tsc_stop == ULLONG_MAX)
+ goto out;
+
+ delta = tsc_stop - tsc_start;
+ delta *= 1000000LL;
+ if (hpet)
+ freq = calc_hpet_ref(delta, ref_start, ref_stop);
+ else
+ freq = calc_pmtimer_ref(delta, ref_start, ref_stop);
+
+ /* Make sure we're within 1% */
+ if (abs(tsc_khz - freq) > tsc_khz/100)
+ goto out;
+
+ tsc_khz = freq;
+ printk(KERN_INFO, "Refined TSC clocksource calibration: "
+ "%lu.%03lu MHz.\n", (unsigned long)tsc_khz / 1000,
+ (unsigned long)tsc_khz % 1000);
+
+out:
+ clocksource_register_khz(&clocksource_tsc, tsc_khz);
+}
+
+
static void __init init_tsc_clocksource(void)
{
if (tsc_clocksource_reliable)
@@ -851,8 +915,11 @@ static void __init init_tsc_clocksource(void)
clocksource_tsc.rating = 0;
clocksource_tsc.flags &= ~CLOCK_SOURCE_IS_CONTINUOUS;
}
-
- clocksource_register_khz(&clocksource_tsc, tsc_khz);
+
+ init_timer(&tsc_calibrate_timer);
+ tsc_calibrate_timer.function = calibrate_tsc_timer;
+ tsc_calibrate_timer.expires = jiffies + 1;
+ add_timer(&tsc_calibrate_timer);
}
#ifdef CONFIG_X86_64
--
1.6.0.4
On Tue, 27 Jul 2010 19:06:41 -0700
John Stultz <[email protected]> wrote:
> To me, there isn't a clear reason why we're using stop_machine
> when changing clocksources instead of just taking the xtime_lock.
>
> Additionally, using stop_machine limits us from being able to
> register clocksources from timers (as needed for a following
> patch).
>
> This patch simply removes the stop_machine usage and instead
> directly calls change_clocksource, which now takes the xtime_lock.
>
> I could be totally missing something here that necessitates
> stop_machine, but in my testing it seems to function fine.
>
> Any clarifications or corrections would be appreciated!
Installing a new clocksource updates quite a lot of internal
variables, we need to make sure that no code ever uses these
variables without holding the xtime_lock as writer.
And then there is ktime_get which uses a read_seqbegin/
read_seqretry loop on the xtime_lock to get the time from the
clocksource. Consider the case where a ktime_get call already
did read_seqbegin but did not yet call the read function of
the clocksource. Another cpu registers a better clocksource
which will cause the timekeeper.clock variable to get updated
while the ktime_get call is using it. If I look at
timekeeping_get_ns I don't see anything that prevents the
compiler from generating code that reads timekeeper.clock
multiple times. Which would mix the read function from one
clocksource with the cycle_last / mask values from the new
clock. Now if we add code that prevents the compiler from
reading from timekeeper.clock multiple times we might get
away with it.
The reasoning for stop_machine is that the change of a
clocksource is a major change which has subtle side effects
so we want to make sure that nothing breaks. It is a very rare
event, we can afford to spent a little bit of time there.
Ergo stop_machine.
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Wed, 2010-07-28 at 09:17 +0200, Martin Schwidefsky wrote:
> On Tue, 27 Jul 2010 19:06:41 -0700
> John Stultz <[email protected]> wrote:
>
> > To me, there isn't a clear reason why we're using stop_machine
> > when changing clocksources instead of just taking the xtime_lock.
> >
> > Additionally, using stop_machine limits us from being able to
> > register clocksources from timers (as needed for a following
> > patch).
> >
> > This patch simply removes the stop_machine usage and instead
> > directly calls change_clocksource, which now takes the xtime_lock.
> >
> > I could be totally missing something here that necessitates
> > stop_machine, but in my testing it seems to function fine.
> >
> > Any clarifications or corrections would be appreciated!
>
> Installing a new clocksource updates quite a lot of internal
> variables, we need to make sure that no code ever uses these
> variables without holding the xtime_lock as writer.
Agreed.
> And then there is ktime_get which uses a read_seqbegin/
> read_seqretry loop on the xtime_lock to get the time from the
> clocksource. Consider the case where a ktime_get call already
> did read_seqbegin but did not yet call the read function of
> the clocksource. Another cpu registers a better clocksource
> which will cause the timekeeper.clock variable to get updated
> while the ktime_get call is using it.
Although ktime_get will be forced to loop and try again, as any writes
require holding a write on the xtime_lock. While the xtime_lock
writelock is held, the function could possibly mix the
read/cycle_last/mask/cyc2ns values, but the results from those invalid
calculations will not be returned.
> If I look at
> timekeeping_get_ns I don't see anything that prevents the
> compiler from generating code that reads timekeeper.clock
> multiple times. Which would mix the read function from one
> clocksource with the cycle_last / mask values from the new
> clock. Now if we add code that prevents the compiler from
> reading from timekeeper.clock multiple times we might get
> away with it.
Right, but this should be ok. timekeeping_get_ns is a helper that
requires the xtime_lock to be held (such a comment is probably needed,
but there is no usage of it when the xtime_lock isn't held). While the
function may actually mix values from two clocksources in a calculation,
the results of those calculations will be thrown out and re-done via the
xtime_lock seqlock.
> The reasoning for stop_machine is that the change of a
> clocksource is a major change which has subtle side effects
> so we want to make sure that nothing breaks. It is a very rare
> event, we can afford to spent a little bit of time there.
> Ergo stop_machine.
I do agree that there can be subtle side effects when dealing with
clocksources (part of why I'm being so cautious introducing this
change), and when the stop_machine code was added it seemed reasonable.
But given the limitations of stop_machine, the more I look at the
clocksource_change code, the more I suspect stop_machine is overkill and
we can safely just take the write lock on xtime_lock.
If I'm still missing something, do let me know.
thanks
-john
On Wed, 28 Jul 2010 09:12:49 -0700
john stultz <[email protected]> wrote:
> On Wed, 2010-07-28 at 09:17 +0200, Martin Schwidefsky wrote:
> > On Tue, 27 Jul 2010 19:06:41 -0700
> > John Stultz <[email protected]> wrote:
> >
> > If I look at
> > timekeeping_get_ns I don't see anything that prevents the
> > compiler from generating code that reads timekeeper.clock
> > multiple times. Which would mix the read function from one
> > clocksource with the cycle_last / mask values from the new
> > clock. Now if we add code that prevents the compiler from
> > reading from timekeeper.clock multiple times we might get
> > away with it.
>
> Right, but this should be ok. timekeeping_get_ns is a helper that
> requires the xtime_lock to be held (such a comment is probably needed,
> but there is no usage of it when the xtime_lock isn't held). While the
> function may actually mix values from two clocksources in a calculation,
> the results of those calculations will be thrown out and re-done via the
> xtime_lock seqlock.
Ok, all callers to timekeeping_get_ns use an xtime_lock loop to
make sure that no inconsistent result gets returned.
> > The reasoning for stop_machine is that the change of a
> > clocksource is a major change which has subtle side effects
> > so we want to make sure that nothing breaks. It is a very rare
> > event, we can afford to spent a little bit of time there.
> > Ergo stop_machine.
>
> I do agree that there can be subtle side effects when dealing with
> clocksources (part of why I'm being so cautious introducing this
> change), and when the stop_machine code was added it seemed reasonable.
> But given the limitations of stop_machine, the more I look at the
> clocksource_change code, the more I suspect stop_machine is overkill and
> we can safely just take the write lock on xtime_lock.
>
> If I'm still missing something, do let me know.
What about a clocksource_unregister while a cpu is in the middle of a
read_seqbegin/timekeeping_get_ns/read_seqretry? The clocksource structure
is "free" after the successful call to the unregister. At least in theory
this could be a use after free. The race window is tiny but on virtual
systems there can be an arbitrary delay in the ktime_get sequence.
I agree that stop_machine is the big gun and restricts the code in the way
how the clocksource functions may be call. But it is safe, no?
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Thu, 2010-07-29 at 09:11 +0200, Martin Schwidefsky wrote:
> On Wed, 28 Jul 2010 09:12:49 -0700
> john stultz <[email protected]> wrote:
> >
> > I do agree that there can be subtle side effects when dealing with
> > clocksources (part of why I'm being so cautious introducing this
> > change), and when the stop_machine code was added it seemed reasonable.
> > But given the limitations of stop_machine, the more I look at the
> > clocksource_change code, the more I suspect stop_machine is overkill and
> > we can safely just take the write lock on xtime_lock.
> >
> > If I'm still missing something, do let me know.
>
> What about a clocksource_unregister while a cpu is in the middle of a
> read_seqbegin/timekeeping_get_ns/read_seqretry? The clocksource structure
> is "free" after the successful call to the unregister. At least in theory
> this could be a use after free. The race window is tiny but on virtual
> systems there can be an arbitrary delay in the ktime_get sequence.
Huh. At first I thought "but we don't yet implement
clocksource_unregister!" but of course do now (I guess that TODO in the
clocksource.c header can be yanked :).
So yes, unregister has been contentious in the past for this very
reason. Once registered, its really hard to find a safe point when it
can be un-registered. Stop machine mostly solves this (although one
should note: vsyscall enabled clocksources really can't be freed, as
their vread() page needs to be statically mapped into userspace).
So while stop_machine is a solution here, it would make more sense to me
to use stop_machine (or maybe even a different method, as it sort of
screams RCU to me) to make sure all the cpus are out of the xtime_lock
critical section prior to returning from unregister_clocksource, rather
then stopping everything for the clocksource change.
> I agree that stop_machine is the big gun and restricts the code in the way
> how the clocksource functions may be call. But it is safe, no?
Actually, my reading of stop_machine makes me hesitate a bit, as I'm not
sure if with kernel preemption, we're sure to avoid stopping a thread
mid-syscall to gettimeofday. Anyone have a clue if that's avoided?
Regardless, we need some other method then stop_machine to register
clocksources, as stop_machine is just too limiting.
thanks
-john
On Thu, 2010-07-29 at 13:49 -0700, john stultz wrote:
> On Thu, 2010-07-29 at 09:11 +0200, Martin Schwidefsky wrote:
> > What about a clocksource_unregister while a cpu is in the middle of a
> > read_seqbegin/timekeeping_get_ns/read_seqretry? The clocksource structure
> > is "free" after the successful call to the unregister. At least in theory
> > this could be a use after free. The race window is tiny but on virtual
> > systems there can be an arbitrary delay in the ktime_get sequence.
>
> So yes, unregister has been contentious in the past for this very
> reason. Once registered, its really hard to find a safe point when it
> can be un-registered. Stop machine mostly solves this (although one
> should note: vsyscall enabled clocksources really can't be freed, as
> their vread() page needs to be statically mapped into userspace).
>
> So while stop_machine is a solution here, it would make more sense to me
> to use stop_machine (or maybe even a different method, as it sort of
> screams RCU to me) to make sure all the cpus are out of the xtime_lock
> critical section prior to returning from unregister_clocksource, rather
> then stopping everything for the clocksource change.
Below is a rough patch to use stop_machine to get the same level of race
protection for clocksource_unregister as we have currently in Linus's
tree (which may possibly have holes in it?).
Comments or suggestions for other ideas would be appreciated.
I'm thinking RCU might be really close to what we actually want here,
but I'd like to be able to avoid any extra work on the read-side (ie:
even the preempt_disable()), and would even be more prone to disallowing
clocksource unregistration then impacting the xtime_lock read side.
Any other thoughts?
thanks
-john
>From c85f3adf525ac457b938f4cf39ace1e2239c509f Mon Sep 17 00:00:00 2001
From: John Stultz <[email protected]>
Date: Thu, 29 Jul 2010 15:28:13 -0700
Subject: [PATCH] [HACK] Avoid clocksource_unregister races
This attempts to avoid use-after free style races caused
when a clocksource that is in-use is unregistered.
The problem arises as the read-side of the xtime_lock seqlock
allows for invalid calculations to be made while the write-lock
is held. This is normally ok, as the invalid calculations are
never returned. However, if a clocksource is unregistered, and
then freed, its possible that other cpus that still are in the
read-loop may hold invalid references to the clocksource
(possibly traversing an invalid clocksource->read function pointer).
This patch (ab)uses stop_machine to delay returning from
clocksource_unregister until all cpus have run the cpu_stopper_thread,
and have (we assume, possibly incorrectly), left the xtime_lock
protected read-side critical section.
I have concerns that stop_machine does not enforce that the other cpus
have in fact left the read-side xtime_lock critical section if
features like kernel preemption is enabled. So the race may
still exist.
However, this restores as much protection as we had earlier
when stop_machine was used to do all clocksource changes.
stop_machine also seems a little heavy handed for what we're wanting.
I'd appreciate other better tools to do similar. It is very RCU
like, but we'd like to avoid any extra work on the read-side.
(I'm suspecting making sure all other cpus idle time counters
have increased might be a better method to ensure everyone has
left the critical section)
NOT YET FOR INCLUSION! FOR DISCUSSION ONLY.
Signed-off-by: John Stultz <[email protected]>
---
kernel/time/clocksource.c | 24 ++++++++++++++++++++++++
1 files changed, 24 insertions(+), 0 deletions(-)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index c18d7ef..9a1ae9e 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -30,6 +30,7 @@
#include <linux/sched.h> /* for spin_unlock_irq() using preempt_count() m68k */
#include <linux/tick.h>
#include <linux/kthread.h>
+#include <linux/stop_machine.h>
void timecounter_init(struct timecounter *tc,
const struct cyclecounter *cc,
@@ -731,6 +732,17 @@ void clocksource_change_rating(struct clocksource *cs, int rating)
}
EXPORT_SYMBOL(clocksource_change_rating);
+
+static int unregister_wait(void *data)
+{
+ /*
+ * We don't really do anything here, just
+ * use this to get to a safe point before
+ * we return from clocksource_unregister.
+ */
+ return 0;
+}
+
/**
* clocksource_unregister - remove a registered clocksource
*/
@@ -741,6 +753,18 @@ void clocksource_unregister(struct clocksource *cs)
list_del(&cs->list);
clocksource_select();
mutex_unlock(&clocksource_mutex);
+
+ /*
+ * There may be cpus that still hold a reference to
+ * the clocksource's read() function, so block here
+ * until they have all gone idle (via stop_machine).
+ *
+ * XXX - stop_machine preempts tasks, which may
+ * not actually wait till the read seq xtime_lock is
+ * "released". Regardless, this is atleast as safe
+ * as what we have been doing in the past.
+ */
+ stop_machine(unregister_wait, NULL, NULL);
}
EXPORT_SYMBOL(clocksource_unregister);
--
1.6.0.4
On Thu, 2010-07-29 at 16:08 -0700, john stultz wrote:
> On Thu, 2010-07-29 at 13:49 -0700, john stultz wrote:
> > On Thu, 2010-07-29 at 09:11 +0200, Martin Schwidefsky wrote:
> > > What about a clocksource_unregister while a cpu is in the middle of a
> > > read_seqbegin/timekeeping_get_ns/read_seqretry? The clocksource structure
> > > is "free" after the successful call to the unregister. At least in theory
> > > this could be a use after free. The race window is tiny but on virtual
> > > systems there can be an arbitrary delay in the ktime_get sequence.
> >
> > So yes, unregister has been contentious in the past for this very
> > reason. Once registered, its really hard to find a safe point when it
> > can be un-registered. Stop machine mostly solves this (although one
> > should note: vsyscall enabled clocksources really can't be freed, as
> > their vread() page needs to be statically mapped into userspace).
> >
> > So while stop_machine is a solution here, it would make more sense to me
> > to use stop_machine (or maybe even a different method, as it sort of
> > screams RCU to me) to make sure all the cpus are out of the xtime_lock
> > critical section prior to returning from unregister_clocksource, rather
> > then stopping everything for the clocksource change.
>
>
> Below is a rough patch to use stop_machine to get the same level of race
> protection for clocksource_unregister as we have currently in Linus's
> tree (which may possibly have holes in it?).
>
> Comments or suggestions for other ideas would be appreciated.
>
> I'm thinking RCU might be really close to what we actually want here,
> but I'd like to be able to avoid any extra work on the read-side (ie:
> even the preempt_disable()), and would even be more prone to disallowing
> clocksource unregistration then impacting the xtime_lock read side.
>
>
> Any other thoughts?
Actually, the more I think about it.. The more I really just think we
should kill clocksource_unregister and simply not allow it.
Part of the reason is that we have other issues lurking under here, such
as: "what do we do if someone unregisters the only HRT capable
clocksource? As there's currently no way to fall back from HRT mode to
non HRT mode."
It just adds a ton of complexity and issues for really zero gain. The
only reasonable use-case I can come up with is having a clocksource
loaded via a module, and then wanting to unload it.
So while loading clocksources as a module is a nice feature that could
save folks in a pinch (think old distro kernels needing a clock fix on
new hardware), unregister and removal really doesn't have much
functional use. Its just only nice an symmetrical.
So unless anyone else objects, I'm prone to kill off unregister (and
change the single user's error-handling path to delay registration until
the hardware is known to be good).
Any counter points?
thanks
-john
Hi John,
How about using the O'Rourke's algorithm?
It's much faster and accurate even if an SMI disturbance occurred
during the calibration process.
Reference
O'Rourke, J. (1981). An on-line algorithm for fitting straight lines
between data ranges.
Communications of the ACM, 24, 574-578. doi:10.1145/358746.358758