2013-07-04 05:36:02

by Alex Shi

[permalink] [raw]
Subject: [URGENT rfc patch 0/3] tsc clocksource bug fix

We find some benchmarks drop a lot on tip/sched/core on many Intel boxes,
like oltp, tbench, hackbench etc. and bisected the commit 5d33b883ae
cause this regression. Due to this commit, the clocksource was changed
to hpet from tsc even tsc will be set CLOCK_SOURCE_VALID_FOR_HRES later
in clocksource_watchdog.

Tim Chen said the hpet reading cost much. That cause this regression.

This patchset fixed this bug by re-select clocksource after this flag set
on tsc.

BTW,
If the tsc is marked as constant and nonstop, could we set it as system
clocksource when do tsc register? w/o checking it on clocksource_watchdog?

regards!
Alex


2013-07-04 05:36:09

by Alex Shi

[permalink] [raw]
Subject: [PATCH 3/3] clocksource: fix can not set tsc as clocksource bug

commit 5d33b883aed81c6fbcd09c6f7c3619eee850a7e2
clocksource: Always verify highres capability

This commit will reject a clock to be system clocksource if it has no
CLOCK_SOURCE_VALID_FOR_HRES flags. Then the tsc to be rejected as
clocksource, because this flag for tsc is set in clocksource_watchdog
which run after the tsc register.
TSC registered in tsc_refine_calibration_work() and started the watchdog
at that time.

This patch re-select the clocksource after we make sure the tsc has this
flag. Fixed this bug.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/time/clocksource.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 9d6c333..3ad9f29 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -308,6 +308,8 @@ static void clocksource_watchdog(unsigned long data)
* transition into high-res mode:
*/
tick_clock_notify();
+ if (finished_booting)
+ schedule_work(&watchdog_work);
}
}

@@ -404,6 +406,7 @@ static void clocksource_dequeue_watchdog(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

+static void clocksource_select(void);
static int clocksource_watchdog_kthread(void *data)
{
struct clocksource *cs, *tmp;
@@ -412,11 +415,14 @@ static int clocksource_watchdog_kthread(void *data)

mutex_lock(&clocksource_mutex);
spin_lock_irqsave(&watchdog_lock, flags);
- list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list)
+ list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
list_del_init(&cs->wd_list);
list_add(&cs->wd_list, &unstable);
}
+ if (cs->flags & CLOCK_SOURCE_VALID_FOR_HRES)
+ clocksource_select();
+ }
/* Check if the watchdog timer needs to be stopped. */
clocksource_stop_watchdog();
spin_unlock_irqrestore(&watchdog_lock, flags);
--
1.7.12

2013-07-04 05:36:31

by Alex Shi

[permalink] [raw]
Subject: [PATCH 2/3] clockesource: set override clocksource

Shrink the mutex region. And save a clocksource_select action if set
clocksource is same as current clocksource.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/time/clocksource.c | 12 +++++++-----
1 file changed, 7 insertions(+), 5 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index 021c159..9d6c333 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -885,13 +885,15 @@ static ssize_t sysfs_override_clocksource(struct device *dev,
{
size_t ret;

- mutex_lock(&clocksource_mutex);
-
ret = sysfs_get_uname(buf, override_name, count);
- if (ret >= 0)
- clocksource_select();
+ if (ret >= 0) {
+ if (!strcmp(curr_clocksource->name, override_name))
+ return ret;

- mutex_unlock(&clocksource_mutex);
+ mutex_lock(&clocksource_mutex);
+ clocksource_select();
+ mutex_unlock(&clocksource_mutex);
+ }

return ret;
}
--
1.7.12

2013-07-04 05:36:47

by Alex Shi

[permalink] [raw]
Subject: [PATCH 1/3] clocksource: clean up clocksource_select

After clocksource_find_best() introduced, it is impossible to get into
some code path. so clean them up.

Signed-off-by: Alex Shi <[email protected]>
---
kernel/time/clocksource.c | 16 +---------------
1 file changed, 1 insertion(+), 15 deletions(-)

diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index e713ef7..021c159 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -582,26 +582,12 @@ static void __clocksource_select(bool skipcur)
if (!best)
return;

- /* Check for the override clocksource. */
list_for_each_entry(cs, &clocksource_list, list) {
if (skipcur && cs == curr_clocksource)
continue;
if (strcmp(cs->name, override_name) != 0)
continue;
- /*
- * Check to make sure we don't switch to a non-highres
- * capable clocksource if the tick code is in oneshot
- * mode (highres or nohz)
- */
- if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) {
- /* Override clocksource cannot be used. */
- printk(KERN_WARNING "Override clocksource %s is not "
- "HRT compatible. Cannot switch while in "
- "HRT/NOHZ mode\n", cs->name);
- override_name[0] = 0;
- } else
- /* Override clocksource can be used. */
- best = cs;
+ best = cs;
break;
}

--
1.7.12

2013-07-04 07:59:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Thu, Jul 04, 2013 at 01:34:13PM +0800, Alex Shi wrote:

> If the tsc is marked as constant and nonstop, could we set it as system
> clocksource when do tsc register? w/o checking it on clocksource_watchdog?

I'd not do that; the BIOS can still screw you over, we need some validation.

That said; we do need means to disable the clocksource watchdog -- although I
suppose Frederic might already have provided this for this NOHZ efforts when I
wasn't looking.

2013-07-04 08:14:25

by Alex Shi

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On 07/04/2013 03:58 PM, Peter Zijlstra wrote:
> On Thu, Jul 04, 2013 at 01:34:13PM +0800, Alex Shi wrote:
>
>> If the tsc is marked as constant and nonstop, could we set it as system
>> clocksource when do tsc register? w/o checking it on clocksource_watchdog?
>
> I'd not do that; the BIOS can still screw you over, we need some validation.

I see. thanks!
>
> That said; we do need means to disable the clocksource watchdog -- although I
> suppose Frederic might already have provided this for this NOHZ efforts when I
> wasn't looking.
>


--
Thanks
Alex

2013-07-04 09:57:40

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/3] clocksource: clean up clocksource_select

On Thu, 4 Jul 2013, Alex Shi wrote:

> After clocksource_find_best() introduced, it is impossible to get into
> some code path. so clean them up.

That's wrong.

> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/time/clocksource.c | 16 +---------------
> 1 file changed, 1 insertion(+), 15 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index e713ef7..021c159 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -582,26 +582,12 @@ static void __clocksource_select(bool skipcur)
> if (!best)
> return;
>
> - /* Check for the override clocksource. */
> list_for_each_entry(cs, &clocksource_list, list) {
> if (skipcur && cs == curr_clocksource)
> continue;
> if (strcmp(cs->name, override_name) != 0)
> continue;

We need this check and it is completely unrelated to the problem
you're trying to solve.

Assume the following:

System boots with clocksource A and switches into highres mode.
Now clocksource B gets registered and B is not highres capable.

clocksource_find_best() selects again A, but we have
clocksource=B on the kernel command line to override the kernel
decision.

By removing the check, you install he non highres capable clocksource
B and kill the machine.

> - /*
> - * Check to make sure we don't switch to a non-highres
> - * capable clocksource if the tick code is in oneshot
> - * mode (highres or nohz)
> - */
> - if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) && oneshot) {
> - /* Override clocksource cannot be used. */
> - printk(KERN_WARNING "Override clocksource %s is not "
> - "HRT compatible. Cannot switch while in "
> - "HRT/NOHZ mode\n", cs->name);
> - override_name[0] = 0;
> - } else
> - /* Override clocksource can be used. */
> - best = cs;
> + best = cs;
> break;
> }

Thanks,

tglx

2013-07-04 10:22:58

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 1/3] clocksource: clean up clocksource_select


> We need this check and it is completely unrelated to the problem
> you're trying to solve.
>
> Assume the following:
>
> System boots with clocksource A and switches into highres mode.
> Now clocksource B gets registered and B is not highres capable.
>
> clocksource_find_best() selects again A, but we have
> clocksource=B on the kernel command line to override the kernel
> decision.
>
> By removing the check, you install he non highres capable clocksource
> B and kill the machine.
>

You'r right. my bad, Sorry!

BTW, why we allow user override a second best clocksource? I mean user
can override the tsc with hpet. because undetected unstable tsc?



--
Thanks
Alex

2013-07-04 10:23:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 2/3] clockesource: set override clocksource

On Thu, 4 Jul 2013, Alex Shi wrote:

> Shrink the mutex region. And save a clocksource_select action if set
> clocksource is same as current clocksource.

Again, how is that related to the issue described in 0/3 ?

That's an optimization and not a regression fix. And it's wrong as
well.

> Signed-off-by: Alex Shi <[email protected]>
> ---
> kernel/time/clocksource.c | 12 +++++++-----
> 1 file changed, 7 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 021c159..9d6c333 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -885,13 +885,15 @@ static ssize_t sysfs_override_clocksource(struct device *dev,
> {
> size_t ret;
>
> - mutex_lock(&clocksource_mutex);
> -
> ret = sysfs_get_uname(buf, override_name, count);
> - if (ret >= 0)
> - clocksource_select();
> + if (ret >= 0) {
> + if (!strcmp(curr_clocksource->name, override_name))

What if you get preempted in the middle of sysfs_get_uname() or in the
middle of strcmp() and some other code triggers a clocksource_select()
while you are off the CPU?

That might end up in a half filled override_name buffer or accessing
memory which might have been freed already because curr_clocksource
changed and the old driver was unloaded.

Not pretty.

If at all we can do:

- mutex_lock(&clocksource_mutex);
-
- ret = sysfs_get_uname(buf, override_name, count);
+ ret = sysfs_get_uname(buf, tmp_buf, count);
+ if (ret < 0)
+ return ret;

+ mutex_lock(&clocksource_mutex);
+ if (strcmp(override_name, tmp_buf) != 0) {
+ memcpy(override_name, tmp_buf, sizeof(tmp_buf));
+ clocksource_select();
+ }

mutex_unlock(&clocksource_mutex);
return ret;
}

Thanks,

tglx

2013-07-04 10:27:26

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 1/3] clocksource: clean up clocksource_select

On Thu, 4 Jul 2013, Alex Shi wrote:
> > We need this check and it is completely unrelated to the problem
> > you're trying to solve.
> >
> > Assume the following:
> >
> > System boots with clocksource A and switches into highres mode.
> > Now clocksource B gets registered and B is not highres capable.
> >
> > clocksource_find_best() selects again A, but we have
> > clocksource=B on the kernel command line to override the kernel
> > decision.
> >
> > By removing the check, you install he non highres capable clocksource
> > B and kill the machine.
> >
>
> You'r right. my bad, Sorry!
>
> BTW, why we allow user override a second best clocksource? I mean user
> can override the tsc with hpet. because undetected unstable tsc?

The user can decide to override with clocksource=jiffies if he wants
for testing purposes. That wont switch into highres ever if done from
the kernel command line. Now we have a sysfs interface, so we need to
sanity check the user override against highres.

The override is quite useful to test hpet, pmtimer on a machine which
would always select TSC.

Thanks,

tglx

2013-07-04 10:55:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/3] clocksource: fix can not set tsc as clocksource bug

On Thu, 4 Jul 2013, Alex Shi wrote:
> commit 5d33b883aed81c6fbcd09c6f7c3619eee850a7e2
> clocksource: Always verify highres capability
>
> This commit will reject a clock to be system clocksource if it has no
> CLOCK_SOURCE_VALID_FOR_HRES flags. Then the tsc to be rejected as

No. It rejects the clocksource if the system is in oneshot mode and
the clocksource does not support HIGHRES.

So at boot time, the tick device mode is PERIODIC and the clocksource
is set to jiffies.

In clocksource_done_booting() we select the best clocksource from the
registered list. We are still in PERIODIC mode, so the selection in
clocksource_find_best() should grab TSC whether the HIGHRES_VALID flag
is set or not. The relevant check in find_best() is:

if (oneshot && !(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES))
continue;

And oneshot should be definitely false at that point. So we don't care
about the HRES flag at all.

So if we select TSC from clocksource_done_booting() that prevents the
system from switching into highres mode as long as the VALID_FOR_HRES
flag is not set by the watchdog. If it gets set then
tick_clock_notify() tells the tick code to reevaluate.

So now the question is why you are observing that HPET is selected in
the first place.

Can you add instrumentation to the code and provide the data please? I
try to reproduce myself. What's the environment you're using?

> diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> index 9d6c333..3ad9f29 100644
> --- a/kernel/time/clocksource.c
> +++ b/kernel/time/clocksource.c
> @@ -308,6 +308,8 @@ static void clocksource_watchdog(unsigned long data)
> * transition into high-res mode:
> */
> tick_clock_notify();
> + if (finished_booting)
> + schedule_work(&watchdog_work);

This only makes sense, when the clocksource which gets the FLAG set
has the highest rating, but was not selected because the system was in
ONESHOT mode already.

And I can't see why that should suddenly happen.

> static int clocksource_watchdog_kthread(void *data)
> {
> struct clocksource *cs, *tmp;
> @@ -412,11 +415,14 @@ static int clocksource_watchdog_kthread(void *data)
>
> mutex_lock(&clocksource_mutex);
> spin_lock_irqsave(&watchdog_lock, flags);
> - list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list)
> + list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
> if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
> list_del_init(&cs->wd_list);
> list_add(&cs->wd_list, &unstable);
> }
> + if (cs->flags & CLOCK_SOURCE_VALID_FOR_HRES)
> + clocksource_select();

Unlikely, but if we have 3 watched clocksources which have the HRES
bit set, you'd call 3 times clocksource_select().

Thanks,

tglx

2013-07-04 11:00:51

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Thu, 4 Jul 2013, Alex Shi wrote:

> We find some benchmarks drop a lot on tip/sched/core on many Intel boxes,
> like oltp, tbench, hackbench etc. and bisected the commit 5d33b883ae
> cause this regression. Due to this commit, the clocksource was changed
> to hpet from tsc even tsc will be set CLOCK_SOURCE_VALID_FOR_HRES later
> in clocksource_watchdog.

5d33b883ae is not in tip/sched/core. So what are you testing and
bisecting?

Thanks,

tglx

2013-07-04 13:04:21

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/3] clocksource: fix can not set tsc as clocksource bug

On Thu, 4 Jul 2013, Thomas Gleixner wrote:
> On Thu, 4 Jul 2013, Alex Shi wrote:
> > commit 5d33b883aed81c6fbcd09c6f7c3619eee850a7e2
> > clocksource: Always verify highres capability
> >
> > This commit will reject a clock to be system clocksource if it has no
> > CLOCK_SOURCE_VALID_FOR_HRES flags. Then the tsc to be rejected as
>
> No. It rejects the clocksource if the system is in oneshot mode and
> the clocksource does not support HIGHRES.
>
> So at boot time, the tick device mode is PERIODIC and the clocksource
> is set to jiffies.
>
> In clocksource_done_booting() we select the best clocksource from the
> registered list. We are still in PERIODIC mode, so the selection in
> clocksource_find_best() should grab TSC whether the HIGHRES_VALID flag
> is set or not. The relevant check in find_best() is:
>
> if (oneshot && !(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES))
> continue;
>
> And oneshot should be definitely false at that point. So we don't care
> about the HRES flag at all.
>
> So if we select TSC from clocksource_done_booting() that prevents the
> system from switching into highres mode as long as the VALID_FOR_HRES
> flag is not set by the watchdog. If it gets set then
> tick_clock_notify() tells the tick code to reevaluate.
>
> So now the question is why you are observing that HPET is selected in
> the first place.
>
> Can you add instrumentation to the code and provide the data please? I
> try to reproduce myself. What's the environment you're using?

Ok. Figured it out.

Boot.
start tsc revalidation()
register(hpet)
register(pmtimer);

clocksource_done_booting()
select HPET

switch to oneshot

tsc_revalidation()
register(TSC)
clocksource_select()
keep HPET because TSC is not yet validated by the watchdog

watchdog validates TSC, but that does not make TSC the system
clocksource.

So, the old code was actually installing TSC over HPET even when TSC
was not yet qualified for HRES. Kinda worked :)

So yes, we need somthing like your patch to fix that.

> > diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
> > index 9d6c333..3ad9f29 100644
> > --- a/kernel/time/clocksource.c
> > +++ b/kernel/time/clocksource.c
> > @@ -308,6 +308,8 @@ static void clocksource_watchdog(unsigned long data)
> > * transition into high-res mode:
> > */
> > tick_clock_notify();
> > + if (finished_booting)
> > + schedule_work(&watchdog_work);
>
> This only makes sense, when the clocksource which gets the FLAG set
> has the highest rating, but was not selected because the system was in
> ONESHOT mode already.
>
> And I can't see why that should suddenly happen.

Now I can :)

> > static int clocksource_watchdog_kthread(void *data)
> > {
> > struct clocksource *cs, *tmp;
> > @@ -412,11 +415,14 @@ static int clocksource_watchdog_kthread(void *data)
> >
> > mutex_lock(&clocksource_mutex);
> > spin_lock_irqsave(&watchdog_lock, flags);
> > - list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list)
> > + list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
> > if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
> > list_del_init(&cs->wd_list);
> > list_add(&cs->wd_list, &unstable);
> > }
> > + if (cs->flags & CLOCK_SOURCE_VALID_FOR_HRES)
> > + clocksource_select();
>
> Unlikely, but if we have 3 watched clocksources which have the HRES
> bit set, you'd call 3 times clocksource_select().

Also the reselect must be called outside the watchdog_lock region.

Thanks,

tglx

2013-07-04 18:11:58

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Thu, 2013-07-04 at 13:00 +0200, Thomas Gleixner wrote:
> On Thu, 4 Jul 2013, Alex Shi wrote:
>
> > We find some benchmarks drop a lot on tip/sched/core on many Intel boxes,
> > like oltp, tbench, hackbench etc. and bisected the commit 5d33b883ae
> > cause this regression. Due to this commit, the clocksource was changed
> > to hpet from tsc even tsc will be set CLOCK_SOURCE_VALID_FOR_HRES later
> > in clocksource_watchdog.
>
> 5d33b883ae is not in tip/sched/core. So what are you testing and
> bisecting?

I think he's referring to:

commit 5d33b883aed81c6fbcd09c6f7c3619eee850a7e2
Author: Thomas Gleixner <[email protected]>
Date: Thu Apr 25 20:31:43 2013 +0000

clocksource: Always verify highres capability

If a clocksource has a (wrong) high rating, but can't be used as a
timebase for oneshot tick mode, it is unconditionally selected even
when the system is already in oneshot tick mode. This causes full
system failure.

Verify the clocksource selection against the oneshot mode.

Signed-off-by: Thomas Gleixner <[email protected]>
Acked-by: John Stultz <[email protected]>
Cc: Magnus Damm <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>

2013-07-04 20:27:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Thu, 4 Jul 2013, Davidlohr Bueso wrote:

> On Thu, 2013-07-04 at 13:00 +0200, Thomas Gleixner wrote:
> > On Thu, 4 Jul 2013, Alex Shi wrote:
> >
> > > We find some benchmarks drop a lot on tip/sched/core on many Intel boxes,
> > > like oltp, tbench, hackbench etc. and bisected the commit 5d33b883ae
> > > cause this regression. Due to this commit, the clocksource was changed
> > > to hpet from tsc even tsc will be set CLOCK_SOURCE_VALID_FOR_HRES later
> > > in clocksource_watchdog.
> >
> > 5d33b883ae is not in tip/sched/core. So what are you testing and
> > bisecting?
>
> I think he's referring to:
>
> commit 5d33b883aed81c6fbcd09c6f7c3619eee850a7e2

I know what he is referring to. He explicitly mentions this commit:

> > > like oltp, tbench, hackbench etc. and bisected the commit 5d33b883ae

What I was pointing out that he was referring to tip sched/core at the
same time

> > > We find some benchmarks drop a lot on tip/sched/core on many Intel boxes,

And that branch does NOT have that commit included. So how can you see
a regression on a branch caused by a commit NOT included into that
branch?

The offending commit is in tip timers/core and not in tip
sched/core. What I'm wanted to say is, that we need a proper
description of problems and not some random association.

It tricked you to assume, that I'm not able to figure it out myself :)

See? These things are complex and subtle, so we need precise
descriptions and not some sloppy semi correct data.

I'm well aware of the issue and with Peters help I got a reasonable
explanation for it. A proper fix is about to be sent out.

Thanks,

tglx

2013-07-04 20:46:51

by Thomas Gleixner

[permalink] [raw]
Subject: [PATCH] clocksource: Reselect clocksource when watchdog validated

Up to commit 5d33b883a (clocksource: Always verify highres capability)
we had no sanity check when selecting a clocksource, which prevented
that a non highres capable clocksource is used when the system already
switched to highres/nohz mode.

The new sanity check works as Alex and Tim found out. It prevents the
TSC from being used. This happens because on x86 the boot process
looks like this:

tsc_start_freqency_validation(TSC);
clocksource_register(HPET);
clocksource_done_booting();
clocksource_select()
Selects HPET which is valid for high-res

switch_to_highres();

clocksource_register(TSC);
TSC is not selected, because it is not yet
flagged as VALID_HIGH_RES

clocksource_watchdog()
Validates TSC for highres, but that does not make TSC
the current clocksource.

Before the sanity check was added, we installed TSC unvalidated which
worked most of the time. If the TSC was really detected as unstable,
then the unstable logic removed it and installed HPET again.

The sanity check is correct and needed. So the watchdog needs to kick
a reselection of the clocksource, when it qualifies TSC as a valid
high res clocksource.

To solve this, we mark the clocksource which got the flag
CLOCK_SOURCE_VALID_FOR_HRES set by the watchdog with an new flag
CLOCK_SOURCE_RESELECT and trigger the watchdog thread. The watchdog
thread evaluates the flag and invokes clocksource_select() when set.

To avoid that the clocksource_done_booting() code, which is about to
install the first real clocksource anyway, needs to go through
clocksource_select and tick_oneshot_notify() pointlessly, split out
the clocksource_watchdog_kthread() list walk code and invoke the
select/notify only when called from clocksource_watchdog_kthread().

So clocksource_done_booting() can utilize the same splitout code
without the select/notify invocation and the clocksource_mutex
unlock/relock dance.

Reported-by: Alex Shi <[email protected]>
Cc: Hans Peter Anvin <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/clocksource.h | 1
kernel/time/clocksource.c | 57 ++++++++++++++++++++++++++++++++------------
2 files changed, 43 insertions(+), 15 deletions(-)

Index: tip/include/linux/clocksource.h
===================================================================
--- tip.orig/include/linux/clocksource.h
+++ tip/include/linux/clocksource.h
@@ -210,6 +210,7 @@ struct clocksource {
#define CLOCK_SOURCE_VALID_FOR_HRES 0x20
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
+#define CLOCK_SOURCE_RESELECT 0x100

/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) (cycle_t)((bits) < 64 ? ((1ULL<<(bits))-1) : -1)
Index: tip/kernel/time/clocksource.c
===================================================================
--- tip.orig/kernel/time/clocksource.c
+++ tip/kernel/time/clocksource.c
@@ -181,6 +181,7 @@ static int finished_booting;

#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
static void clocksource_watchdog_work(struct work_struct *work);
+static void clocksource_select(void);

static LIST_HEAD(watchdog_list);
static struct clocksource *watchdog;
@@ -301,13 +302,30 @@ static void clocksource_watchdog(unsigne
if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) &&
(cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) &&
(watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {
+ /* Mark it valid for high-res. */
cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
+
+ /*
+ * clocksource_done_booting() will sort it if
+ * finished_booting is not set yet.
+ */
+ if (!finished_booting)
+ continue;
+
/*
- * We just marked the clocksource as highres-capable,
- * notify the rest of the system as well so that we
- * transition into high-res mode:
+ * If this is not the current clocksource let
+ * the watchdog thread reselect it. Due to the
+ * change to high res this clocksource might
+ * be preferred now. If it is the current
+ * clocksource let the tick code know about
+ * that change.
*/
- tick_clock_notify();
+ if (cs != curr_clocksource) {
+ cs->flags |= CLOCK_SOURCE_RESELECT;
+ schedule_work(&watchdog_work);
+ } else {
+ tick_clock_notify();
+ }
}
}

@@ -404,19 +422,25 @@ static void clocksource_dequeue_watchdog
spin_unlock_irqrestore(&watchdog_lock, flags);
}

-static int clocksource_watchdog_kthread(void *data)
+static int __clocksource_watchdog_kthread(void)
{
struct clocksource *cs, *tmp;
unsigned long flags;
LIST_HEAD(unstable);
+ int select = 0;

- mutex_lock(&clocksource_mutex);
spin_lock_irqsave(&watchdog_lock, flags);
- list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list)
+ list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
list_del_init(&cs->wd_list);
list_add(&cs->wd_list, &unstable);
+ select = 1;
+ }
+ if (cs->flags & CLOCK_SOURCE_RESELECT) {
+ cs->flags &= ~CLOCK_SOURCE_RESELECT;
+ select = 1;
}
+ }
/* Check if the watchdog timer needs to be stopped. */
clocksource_stop_watchdog();
spin_unlock_irqrestore(&watchdog_lock, flags);
@@ -426,6 +450,14 @@ static int clocksource_watchdog_kthread(
list_del_init(&cs->wd_list);
__clocksource_change_rating(cs, 0);
}
+ return select;
+}
+
+static int clocksource_watchdog_kthread(void *data)
+{
+ mutex_lock(&clocksource_mutex);
+ if (__clocksource_watchdog_kthread())
+ clocksource_select();
mutex_unlock(&clocksource_mutex);
return 0;
}
@@ -445,7 +477,7 @@ static void clocksource_enqueue_watchdog

static inline void clocksource_dequeue_watchdog(struct clocksource *cs) { }
static inline void clocksource_resume_watchdog(void) { }
-static inline int clocksource_watchdog_kthread(void *data) { return 0; }
+static inline int __clocksource_watchdog_kthread(void) { return 0; }
static bool clocksource_is_watchdog(struct clocksource *cs) { return false; }

#endif /* CONFIG_CLOCKSOURCE_WATCHDOG */
@@ -647,16 +679,11 @@ static int __init clocksource_done_booti
{
mutex_lock(&clocksource_mutex);
curr_clocksource = clocksource_default_clock();
- mutex_unlock(&clocksource_mutex);
-
finished_booting = 1;
-
/*
* Run the watchdog first to eliminate unstable clock sources
*/
- clocksource_watchdog_kthread(NULL);
-
- mutex_lock(&clocksource_mutex);
+ __clocksource_watchdog_kthread();
clocksource_select();
mutex_unlock(&clocksource_mutex);
return 0;
@@ -789,7 +816,6 @@ static void __clocksource_change_rating(
list_del(&cs->list);
cs->rating = rating;
clocksource_enqueue(cs);
- clocksource_select();
}

/**
@@ -801,6 +827,7 @@ void clocksource_change_rating(struct cl
{
mutex_lock(&clocksource_mutex);
__clocksource_change_rating(cs, rating);
+ clocksource_select();
mutex_unlock(&clocksource_mutex);
}
EXPORT_SYMBOL(clocksource_change_rating);

2013-07-05 01:14:07

by Alex Shi

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On 07/05/2013 04:27 AM, Thomas Gleixner wrote:
>>>> We find some benchmarks drop a lot on tip/sched/core on many Intel boxes,
> And that branch does NOT have that commit included. So how can you see
> a regression on a branch caused by a commit NOT included into that
> branch?
>
> The offending commit is in tip timers/core and not in tip
> sched/core. What I'm wanted to say is, that we need a proper
> description of problems and not some random association.
>
> It tricked you to assume, that I'm not able to figure it out myself :)
>
> See? These things are complex and subtle, so we need precise
> descriptions and not some sloppy semi correct data.
>
> I'm well aware of the issue and with Peters help I got a reasonable
> explanation for it. A proper fix is about to be sent out.
>

Ingo had merged your branch into sched/core. :)

commit f9bed7021710a3e45c331f7d7781de914cc1b939
Merge: 7e76057 67dd331
Author: Ingo Molnar <[email protected]>
Date: Wed May 29 11:21:59 2013 +0200

Merge branch 'timers/urgent'

...

commit 7d194f78bde64ec813c1ed8291181bdd61515e78
Merge: 0298bf7 1eaff67
Author: Ingo Molnar <[email protected]>
Date: Tue May 28 09:53:41 2013 +0200

Merge branch 'timers/core'

commit 0298bf70644d7334bec16ae47f3aa58f4f883b59
Merge: cc662fa 2938d27
Author: Ingo Molnar <[email protected]>
Date: Tue May 28 09:49:51 2013 +0200

Merge branch 'timers/urgent'

--
Thanks
Alex

2013-07-05 01:24:09

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 1/3] clocksource: clean up clocksource_select

On 07/04/2013 06:27 PM, Thomas Gleixner wrote:
>> > BTW, why we allow user override a second best clocksource? I mean user
>> > can override the tsc with hpet. because undetected unstable tsc?
> The user can decide to override with clocksource=jiffies if he wants
> for testing purposes. That wont switch into highres ever if done from
> the kernel command line. Now we have a sysfs interface, so we need to
> sanity check the user override against highres.
>
> The override is quite useful to test hpet, pmtimer on a machine which
> would always select TSC.

Got it. thanks a lot for friendly explanation!

--
Thanks
Alex

2013-07-05 01:39:23

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH] clocksource: Reselect clocksource when watchdog validated

On 07/05/2013 04:46 AM, Thomas Gleixner wrote:
>
> Reported-by: Alex Shi <[email protected]>
> Cc: Hans Peter Anvin <[email protected]>
> Cc: Tim Chen <[email protected]>
> Cc: Andi Kleen <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>

Glad to the problem fixed. Thanks!

--
Thanks
Alex

2013-07-05 02:50:02

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 3/3] clocksource: fix can not set tsc as clocksource bug

On 07/04/2013 09:04 PM, Thomas Gleixner wrote:
>>> > > static int clocksource_watchdog_kthread(void *data)
>>> > > {
>>> > > struct clocksource *cs, *tmp;
>>> > > @@ -412,11 +415,14 @@ static int clocksource_watchdog_kthread(void *data)
>>> > >
>>> > > mutex_lock(&clocksource_mutex);
>>> > > spin_lock_irqsave(&watchdog_lock, flags);
>>> > > - list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list)
>>> > > + list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
>>> > > if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
>>> > > list_del_init(&cs->wd_list);
>>> > > list_add(&cs->wd_list, &unstable);
>>> > > }
>>> > > + if (cs->flags & CLOCK_SOURCE_VALID_FOR_HRES)
>>> > > + clocksource_select();
>> >
>> > Unlikely, but if we have 3 watched clocksources which have the HRES
>> > bit set, you'd call 3 times clocksource_select().
> Also the reselect must be called outside the watchdog_lock region.

Sorry for stupid, the watchdog_lock used protect watchdog related
resource. but clocksource_select doesn't touch them. So, I know it isn't
necessary to put reselect under this lock. Just don't know why the
reselect *must* be called outside of it?

--
Thanks
Alex

2013-07-05 05:41:11

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH 3/3] clocksource: fix can not set tsc as clocksource bug

On Fri, 5 Jul 2013, Alex Shi wrote:
> On 07/04/2013 09:04 PM, Thomas Gleixner wrote:
> >>> > > static int clocksource_watchdog_kthread(void *data)
> >>> > > {
> >>> > > struct clocksource *cs, *tmp;
> >>> > > @@ -412,11 +415,14 @@ static int clocksource_watchdog_kthread(void *data)
> >>> > >
> >>> > > mutex_lock(&clocksource_mutex);
> >>> > > spin_lock_irqsave(&watchdog_lock, flags);
> >>> > > - list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list)
> >>> > > + list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
> >>> > > if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
> >>> > > list_del_init(&cs->wd_list);
> >>> > > list_add(&cs->wd_list, &unstable);
> >>> > > }
> >>> > > + if (cs->flags & CLOCK_SOURCE_VALID_FOR_HRES)
> >>> > > + clocksource_select();
> >> >
> >> > Unlikely, but if we have 3 watched clocksources which have the HRES
> >> > bit set, you'd call 3 times clocksource_select().
> > Also the reselect must be called outside the watchdog_lock region.
>
> Sorry for stupid, the watchdog_lock used protect watchdog related
> resource. but clocksource_select doesn't touch them. So, I know it isn't
> necessary to put reselect under this lock. Just don't know why the
> reselect *must* be called outside of it?

clocksource_select()
timekeeping_notify()
stop_machine()
get_online_cpus()
might_sleep()
mutex_lock()

So we need to be in preemptible context when we call clocksource_select().

Thanks,

tglx

2013-07-05 05:58:52

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Fri, 5 Jul 2013, Alex Shi wrote:
> On 07/05/2013 04:27 AM, Thomas Gleixner wrote:
> >>>> We find some benchmarks drop a lot on tip/sched/core on many Intel boxes,
> > And that branch does NOT have that commit included. So how can you see
> > a regression on a branch caused by a commit NOT included into that
> > branch?
> >
> > The offending commit is in tip timers/core and not in tip
> > sched/core. What I'm wanted to say is, that we need a proper
> > description of problems and not some random association.
> >
> > It tricked you to assume, that I'm not able to figure it out myself :)
> >
> > See? These things are complex and subtle, so we need precise
> > descriptions and not some sloppy semi correct data.
> >
> > I'm well aware of the issue and with Peters help I got a reasonable
> > explanation for it. A proper fix is about to be sent out.
> >
>
> Ingo had merged your branch into sched/core. :)
>
> commit f9bed7021710a3e45c331f7d7781de914cc1b939
> Merge: 7e76057 67dd331
> Author: Ingo Molnar <[email protected]>
> Date: Wed May 29 11:21:59 2013 +0200
>
> Merge branch 'timers/urgent'

Not really.

tip$ git branch --contains f9bed70217
* master

tip$ git branch --contains 5d33b883ae
* master
timers/core

So you are testing tip/master not tip/sched/core.

Thanks,

tglx

2013-07-05 06:29:14

by Alex Shi

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On 07/05/2013 01:58 PM, Thomas Gleixner wrote:
>> >
>> > Ingo had merged your branch into sched/core. :)
>> >
>> > commit f9bed7021710a3e45c331f7d7781de914cc1b939
>> > Merge: 7e76057 67dd331
>> > Author: Ingo Molnar <[email protected]>
>> > Date: Wed May 29 11:21:59 2013 +0200
>> >
>> > Merge branch 'timers/urgent'
> Not really.
>
> tip$ git branch --contains f9bed70217
> * master
>
> tip$ git branch --contains 5d33b883ae
> * master
> timers/core
>
> So you are testing tip/master not tip/sched/core.

You'r right. I mixed them. sorry.

--
Thanks
Alex

2013-07-05 06:45:47

by Alex Shi

[permalink] [raw]
Subject: Re: [PATCH 3/3] clocksource: fix can not set tsc as clocksource bug

On 07/05/2013 01:41 PM, Thomas Gleixner wrote:
>>>>> > >> > Unlikely, but if we have 3 watched clocksources which have the HRES
>>>>> > >> > bit set, you'd call 3 times clocksource_select().
>>> > > Also the reselect must be called outside the watchdog_lock region.
>> >
>> > Sorry for stupid, the watchdog_lock used protect watchdog related
>> > resource. but clocksource_select doesn't touch them. So, I know it isn't
>> > necessary to put reselect under this lock. Just don't know why the
>> > reselect *must* be called outside of it?
> clocksource_select()
> timekeeping_notify()
> stop_machine()
> get_online_cpus()
> might_sleep()
> mutex_lock()
>
> So we need to be in preemptible context when we call clocksource_select().

understand! many thanks!

--
Thanks
Alex

Subject: [tip:timers/core] clocksource: Reselect clocksource when watchdog validated high-res capability

Commit-ID: 332962f2c88868ed3cdab466870baaa34dd58612
Gitweb: http://git.kernel.org/tip/332962f2c88868ed3cdab466870baaa34dd58612
Author: Thomas Gleixner <[email protected]>
AuthorDate: Thu, 4 Jul 2013 22:46:45 +0200
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 5 Jul 2013 11:09:28 +0200

clocksource: Reselect clocksource when watchdog validated high-res capability

Up to commit 5d33b883a (clocksource: Always verify highres capability)
we had no sanity check when selecting a clocksource, which prevented
that a non highres capable clocksource is used when the system already
switched to highres/nohz mode.

The new sanity check works as Alex and Tim found out. It prevents the
TSC from being used. This happens because on x86 the boot process
looks like this:

tsc_start_freqency_validation(TSC);
clocksource_register(HPET);
clocksource_done_booting();
clocksource_select()
Selects HPET which is valid for high-res

switch_to_highres();

clocksource_register(TSC);
TSC is not selected, because it is not yet
flagged as VALID_HIGH_RES

clocksource_watchdog()
Validates TSC for highres, but that does not make TSC
the current clocksource.

Before the sanity check was added, we installed TSC unvalidated which
worked most of the time. If the TSC was really detected as unstable,
then the unstable logic removed it and installed HPET again.

The sanity check is correct and needed. So the watchdog needs to kick
a reselection of the clocksource, when it qualifies TSC as a valid
high res clocksource.

To solve this, we mark the clocksource which got the flag
CLOCK_SOURCE_VALID_FOR_HRES set by the watchdog with an new flag
CLOCK_SOURCE_RESELECT and trigger the watchdog thread. The watchdog
thread evaluates the flag and invokes clocksource_select() when set.

To avoid that the clocksource_done_booting() code, which is about to
install the first real clocksource anyway, needs to go through
clocksource_select and tick_oneshot_notify() pointlessly, split out
the clocksource_watchdog_kthread() list walk code and invoke the
select/notify only when called from clocksource_watchdog_kthread().

So clocksource_done_booting() can utilize the same splitout code
without the select/notify invocation and the clocksource_mutex
unlock/relock dance.

Reported-and-tested-by: Alex Shi <[email protected]>
Cc: Hans Peter Anvin <[email protected]>
Cc: Tim Chen <[email protected]>
Cc: Andi Kleen <[email protected]>
Tested-by: Peter Zijlstra <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Davidlohr Bueso <[email protected]>
Cc: John Stultz <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
include/linux/clocksource.h | 1 +
kernel/time/clocksource.c | 57 +++++++++++++++++++++++++++++++++------------
2 files changed, 43 insertions(+), 15 deletions(-)

diff --git a/include/linux/clocksource.h b/include/linux/clocksource.h
index 2f39a49..dbbf8aa 100644
--- a/include/linux/clocksource.h
+++ b/include/linux/clocksource.h
@@ -210,6 +210,7 @@ struct clocksource {
#define CLOCK_SOURCE_VALID_FOR_HRES 0x20
#define CLOCK_SOURCE_UNSTABLE 0x40
#define CLOCK_SOURCE_SUSPEND_NONSTOP 0x80
+#define CLOCK_SOURCE_RESELECT 0x100

/* simplify initialization of mask field */
#define CLOCKSOURCE_MASK(bits) (cycle_t)((bits) < 64 ? ((1ULL<<(bits))-1) : -1)
diff --git a/kernel/time/clocksource.c b/kernel/time/clocksource.c
index e713ef7..50a8736 100644
--- a/kernel/time/clocksource.c
+++ b/kernel/time/clocksource.c
@@ -181,6 +181,7 @@ static int finished_booting;

#ifdef CONFIG_CLOCKSOURCE_WATCHDOG
static void clocksource_watchdog_work(struct work_struct *work);
+static void clocksource_select(void);

static LIST_HEAD(watchdog_list);
static struct clocksource *watchdog;
@@ -301,13 +302,30 @@ static void clocksource_watchdog(unsigned long data)
if (!(cs->flags & CLOCK_SOURCE_VALID_FOR_HRES) &&
(cs->flags & CLOCK_SOURCE_IS_CONTINUOUS) &&
(watchdog->flags & CLOCK_SOURCE_IS_CONTINUOUS)) {
+ /* Mark it valid for high-res. */
cs->flags |= CLOCK_SOURCE_VALID_FOR_HRES;
+
+ /*
+ * clocksource_done_booting() will sort it if
+ * finished_booting is not set yet.
+ */
+ if (!finished_booting)
+ continue;
+
/*
- * We just marked the clocksource as highres-capable,
- * notify the rest of the system as well so that we
- * transition into high-res mode:
+ * If this is not the current clocksource let
+ * the watchdog thread reselect it. Due to the
+ * change to high res this clocksource might
+ * be preferred now. If it is the current
+ * clocksource let the tick code know about
+ * that change.
*/
- tick_clock_notify();
+ if (cs != curr_clocksource) {
+ cs->flags |= CLOCK_SOURCE_RESELECT;
+ schedule_work(&watchdog_work);
+ } else {
+ tick_clock_notify();
+ }
}
}

@@ -404,19 +422,25 @@ static void clocksource_dequeue_watchdog(struct clocksource *cs)
spin_unlock_irqrestore(&watchdog_lock, flags);
}

-static int clocksource_watchdog_kthread(void *data)
+static int __clocksource_watchdog_kthread(void)
{
struct clocksource *cs, *tmp;
unsigned long flags;
LIST_HEAD(unstable);
+ int select = 0;

- mutex_lock(&clocksource_mutex);
spin_lock_irqsave(&watchdog_lock, flags);
- list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list)
+ list_for_each_entry_safe(cs, tmp, &watchdog_list, wd_list) {
if (cs->flags & CLOCK_SOURCE_UNSTABLE) {
list_del_init(&cs->wd_list);
list_add(&cs->wd_list, &unstable);
+ select = 1;
+ }
+ if (cs->flags & CLOCK_SOURCE_RESELECT) {
+ cs->flags &= ~CLOCK_SOURCE_RESELECT;
+ select = 1;
}
+ }
/* Check if the watchdog timer needs to be stopped. */
clocksource_stop_watchdog();
spin_unlock_irqrestore(&watchdog_lock, flags);
@@ -426,6 +450,14 @@ static int clocksource_watchdog_kthread(void *data)
list_del_init(&cs->wd_list);
__clocksource_change_rating(cs, 0);
}
+ return select;
+}
+
+static int clocksource_watchdog_kthread(void *data)
+{
+ mutex_lock(&clocksource_mutex);
+ if (__clocksource_watchdog_kthread())
+ clocksource_select();
mutex_unlock(&clocksource_mutex);
return 0;
}
@@ -445,7 +477,7 @@ static void clocksource_enqueue_watchdog(struct clocksource *cs)

static inline void clocksource_dequeue_watchdog(struct clocksource *cs) { }
static inline void clocksource_resume_watchdog(void) { }
-static inline int clocksource_watchdog_kthread(void *data) { return 0; }
+static inline int __clocksource_watchdog_kthread(void) { return 0; }
static bool clocksource_is_watchdog(struct clocksource *cs) { return false; }

#endif /* CONFIG_CLOCKSOURCE_WATCHDOG */
@@ -647,16 +679,11 @@ static int __init clocksource_done_booting(void)
{
mutex_lock(&clocksource_mutex);
curr_clocksource = clocksource_default_clock();
- mutex_unlock(&clocksource_mutex);
-
finished_booting = 1;
-
/*
* Run the watchdog first to eliminate unstable clock sources
*/
- clocksource_watchdog_kthread(NULL);
-
- mutex_lock(&clocksource_mutex);
+ __clocksource_watchdog_kthread();
clocksource_select();
mutex_unlock(&clocksource_mutex);
return 0;
@@ -789,7 +816,6 @@ static void __clocksource_change_rating(struct clocksource *cs, int rating)
list_del(&cs->list);
cs->rating = rating;
clocksource_enqueue(cs);
- clocksource_select();
}

/**
@@ -801,6 +827,7 @@ void clocksource_change_rating(struct clocksource *cs, int rating)
{
mutex_lock(&clocksource_mutex);
__clocksource_change_rating(cs, rating);
+ clocksource_select();
mutex_unlock(&clocksource_mutex);
}
EXPORT_SYMBOL(clocksource_change_rating);

2013-07-05 14:23:36

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

2013/7/4 Peter Zijlstra <[email protected]>:
> On Thu, Jul 04, 2013 at 01:34:13PM +0800, Alex Shi wrote:
>
>> If the tsc is marked as constant and nonstop, could we set it as system
>> clocksource when do tsc register? w/o checking it on clocksource_watchdog?
>
> I'd not do that; the BIOS can still screw you over, we need some validation.
>
> That said; we do need means to disable the clocksource watchdog -- although I
> suppose Frederic might already have provided this for this NOHZ efforts when I
> wasn't looking.

Nope, I haven't touched that. I prefer not to fiddle with unstable
clocksource for now :)

As for unstable TSCs, if sched_clock_tick() needs to be fed, we simply
don't stop the tick.

2013-07-05 14:38:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Fri, Jul 05, 2013 at 04:23:33PM +0200, Frederic Weisbecker wrote:
> Nope, I haven't touched that. I prefer not to fiddle with unstable
> clocksource for now :)
>
> As for unstable TSCs, if sched_clock_tick() needs to be fed, we simply
> don't stop the tick.

Not entirely the same thing; I thought the clocksource watchdog was ran
even when we have a 'stable' TSC, just to make sure it stays stable.
There's known cases where the BIOS f*cks us over and wrecks TSC sync.

2013-07-05 15:24:12

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Fri, 5 Jul 2013, Peter Zijlstra wrote:

> On Fri, Jul 05, 2013 at 04:23:33PM +0200, Frederic Weisbecker wrote:
> > Nope, I haven't touched that. I prefer not to fiddle with unstable
> > clocksource for now :)
> >
> > As for unstable TSCs, if sched_clock_tick() needs to be fed, we simply
> > don't stop the tick.
>
> Not entirely the same thing; I thought the clocksource watchdog was ran
> even when we have a 'stable' TSC, just to make sure it stays stable.
> There's known cases where the BIOS f*cks us over and wrecks TSC sync.

See arch/x86/kernel/tsc.c

We disable the watchdog for the TSC when tsc_clocksource_reliable is
set.

tsc_clocksource_reliable is set when:

- you add tsc=reliable to the kernel command line

- boot_cpu_has(X86_FEATURE_TSC_RELIABLE)

X86_FEATURE_TSC_RELIABLE is a software flag, set by vmware and
moorsetown. So all other machines keep the watchdog enabled.

- On Geode LX (OLPC)

Thanks,

tglx

2013-07-05 21:22:52

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Fri, Jul 05, 2013 at 05:24:09PM +0200, Thomas Gleixner wrote:
> See arch/x86/kernel/tsc.c
>
> We disable the watchdog for the TSC when tsc_clocksource_reliable is
> set.
>
> tsc_clocksource_reliable is set when:
>
> - you add tsc=reliable to the kernel command line

Ah, I didn't know about that one, useful.

> - boot_cpu_has(X86_FEATURE_TSC_RELIABLE)
>
> X86_FEATURE_TSC_RELIABLE is a software flag, set by vmware and
> moorsetown. So all other machines keep the watchdog enabled.

Right.. I knew it was enabled on my machines even though they normally
have usable TSC.

2013-07-05 21:50:10

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Fri, 5 Jul 2013, Peter Zijlstra wrote:
> On Fri, Jul 05, 2013 at 05:24:09PM +0200, Thomas Gleixner wrote:
> > See arch/x86/kernel/tsc.c
> >
> > We disable the watchdog for the TSC when tsc_clocksource_reliable is
> > set.
> >
> > tsc_clocksource_reliable is set when:
> >
> > - you add tsc=reliable to the kernel command line
>
> Ah, I didn't know about that one, useful.
>
> > - boot_cpu_has(X86_FEATURE_TSC_RELIABLE)
> >
> > X86_FEATURE_TSC_RELIABLE is a software flag, set by vmware and
> > moorsetown. So all other machines keep the watchdog enabled.
>
> Right.. I knew it was enabled on my machines even though they normally
> have usable TSC.

Yeah, but our well justified paranoia still prevents us from trusting
these CPU flags. Maybe some day BIOS is going to be replaced by
something useful. You know: Hope springs eternal....



2013-07-05 21:58:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Fri, Jul 05, 2013 at 11:50:05PM +0200, Thomas Gleixner wrote:
> Yeah, but our well justified paranoia still prevents us from trusting
> these CPU flags. Maybe some day BIOS is going to be replaced by
> something useful. You know: Hope springs eternal....

Not in the next 10 yrs at least if one took a look at the
overengineered, obese at birth and braindead crap by the name of UEFI.

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-07-05 22:18:00

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Fri, 5 Jul 2013, Borislav Petkov wrote:
> On Fri, Jul 05, 2013 at 11:50:05PM +0200, Thomas Gleixner wrote:
> > Yeah, but our well justified paranoia still prevents us from trusting
> > these CPU flags. Maybe some day BIOS is going to be replaced by
> > something useful. You know: Hope springs eternal....
>
> Not in the next 10 yrs at least if one took a look at the
> overengineered, obese at birth and braindead crap by the name of UEFI.

Good news! 10 years is way less than eternity and just before
retirement :)

2013-07-06 08:38:01

by Borislav Petkov

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Sat, Jul 06, 2013 at 12:17:46AM +0200, Thomas Gleixner wrote:
> Good news! 10 years is way less than eternity and just before
> retirement :)

You know that after the 10 years they'll come up with an even uglier
platform-differentiation-fiddle-with-dong-while-smoking-crack-crap which
will even replace the OS, right?

Hmm, I'm wondering what would be faster: wait *at least* 10 more years
or get an old mainboard and start experimenting with coreboot...

--
Regards/Gruss,
Boris.

Sent from a fat crate under my desk. Formatting is fine.
--

2013-07-06 10:50:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [URGENT rfc patch 0/3] tsc clocksource bug fix

On Fri, Jul 05, 2013 at 11:50:05PM +0200, Thomas Gleixner wrote:
> On Fri, 5 Jul 2013, Peter Zijlstra wrote:
> > On Fri, Jul 05, 2013 at 05:24:09PM +0200, Thomas Gleixner wrote:
> > > See arch/x86/kernel/tsc.c
> > >
> > > We disable the watchdog for the TSC when tsc_clocksource_reliable is
> > > set.
> > >
> > > tsc_clocksource_reliable is set when:
> > >
> > > - you add tsc=reliable to the kernel command line
> >
> > Ah, I didn't know about that one, useful.
> >
> > > - boot_cpu_has(X86_FEATURE_TSC_RELIABLE)
> > >
> > > X86_FEATURE_TSC_RELIABLE is a software flag, set by vmware and
> > > moorsetown. So all other machines keep the watchdog enabled.
> >
> > Right.. I knew it was enabled on my machines even though they normally
> > have usable TSC.
>
> Yeah, but our well justified paranoia still prevents us from trusting
> these CPU flags. Maybe some day BIOS is going to be replaced by
> something useful. You know: Hope springs eternal....

Oh quite agreed. Its just that at several times I've wanted to disable the
thing. Now I know you can do using the kernel cmdline. Previously I had to
wreck code -- not that much harder really :-)