2014-01-21 22:29:35

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 12/12/2013 09:08 AM, Peter Zijlstra wrote:
> In order to avoid the runtime condition and variable load turn
> sched_clock_stable into a static_key.
>
> Also provide a shorter implementation of local_clock() and
> cpu_clock(int) when sched_clock_stable==1.
>
> MAINLINE PRE POST
>
> sched_clock_stable: 1 1 1
> (cold) sched_clock: 329841 221876 215295
> (cold) local_clock: 301773 234692 220773
> (warm) sched_clock: 38375 25602 25659
> (warm) local_clock: 100371 33265 27242
> (warm) rdtsc: 27340 24214 24208
> sched_clock_stable: 0 0 0
> (cold) sched_clock: 382634 235941 237019
> (cold) local_clock: 396890 297017 294819
> (warm) sched_clock: 38194 25233 25609
> (warm) local_clock: 143452 71234 71232
> (warm) rdtsc: 27345 24245 24243
>
> Signed-off-by: Peter Zijlstra<[email protected]>

Hi Peter,

This patch seems to be causing an issue with booting a KVM guest. It seems that it
causes the time to go random during early boot process:

[ 0.000000] Initmem setup node 30 [mem 0x12ee000000-0x138dffffff]
[ 0.000000] NODE_DATA [mem 0xcfa42000-0xcfa72fff]
[ 0.000000] NODE_DATA(30) on node 1
[ 0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
[ 0.000000] NODE_DATA [mem 0xcfa11000-0xcfa41fff]
[ 0.000000] NODE_DATA(31) on node 1
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
[133538.294040] Zone ranges:
[133538.294338] DMA [mem 0x00001000-0x00ffffff]
[133538.294804] DMA32 [mem 0x01000000-0xffffffff]
[133538.295223] Normal [mem 0x100000000-0x142fffffff]
[133538.295670] Movable zone start for each node

Looking at the code, initially I though that the problem is with:

+void set_sched_clock_stable(void)
+{
+ if (!sched_clock_stable())
+ static_key_slow_dec(&__sched_clock_stable);
+}
+
+void clear_sched_clock_stable(void)
+{
+ /* XXX worry about clock continuity */
+ if (sched_clock_stable())
+ static_key_slow_inc(&__sched_clock_stable);
+}

I think the jump label inc/dec is reversed here. We would want to inc it when enabling
and dec when disabling, no?
However, trying to reverse the two didn't help. I was still seeing the same odd behaviour.

I tried doing a simple conversion to using a simple var like before, which looks like this:

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 6bd6a67..a035932 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -76,26 +76,21 @@ EXPORT_SYMBOL_GPL(sched_clock);
__read_mostly int sched_clock_running;

#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
-static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
+static int __sched_clock_stable;

int sched_clock_stable(void)
{
- if (static_key_false(&__sched_clock_stable))
- return false;
- return true;
+ return __sched_clock_stable;
}

void set_sched_clock_stable(void)
{
- if (!sched_clock_stable())
- static_key_slow_dec(&__sched_clock_stable);
+ __sched_clock_stable = 1;
}

static void __clear_sched_clock_stable(struct work_struct *work)
{
- /* XXX worry about clock continuity */
- if (sched_clock_stable())
- static_key_slow_inc(&__sched_clock_stable);
+ __sched_clock_stable = 0;
}

static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);
@@ -340,7 +335,7 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
*/
u64 cpu_clock(int cpu)
{
- if (static_key_false(&__sched_clock_stable))
+ if (!sched_clock_stable())
return sched_clock_cpu(cpu);

return sched_clock();
@@ -355,7 +350,7 @@ u64 cpu_clock(int cpu)
*/
u64 local_clock(void)
{
- if (static_key_false(&__sched_clock_stable))
+ if (!sched_clock_stable())
return sched_clock_cpu(raw_smp_processor_id());

return sched_clock();


This has corrected the issue:

[ 0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
[ 0.000000] NODE_DATA [mem 0xcfa11000-0xcfa41fff]
[ 0.000000] NODE_DATA(31) on node 1
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
[ 0.000000] Zone ranges:
[ 0.000000] DMA [mem 0x00001000-0x00ffffff]
[ 0.000000] DMA32 [mem 0x01000000-0xffffffff]
[ 0.000000] Normal [mem 0x100000000-0x142fffffff]
[ 0.000000] Movable zone start for each node
[ 0.000000] Early memory node ranges
[ timing is correct for the rest of the boot]

At this point, I thought that there's something up with jump labels being used this early (?) and
tried compiling with CONFIG_JUMP_LABELS=n, this didn't solve the issue.

This makes me thing there's something different related to jumplabels we're missing, as the
no-jumplabel config should be very similar to the patch I did above, I just can't figure
out what it is.


Thanks,
Sasha


2014-01-22 10:46:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Tue, Jan 21, 2014 at 05:28:37PM -0500, Sasha Levin wrote:
> On 12/12/2013 09:08 AM, Peter Zijlstra wrote:
>
> This patch seems to be causing an issue with booting a KVM guest. It seems that it
> causes the time to go random during early boot process:
>
> [ 0.000000] Initmem setup node 30 [mem 0x12ee000000-0x138dffffff]
> [ 0.000000] NODE_DATA [mem 0xcfa42000-0xcfa72fff]
> [ 0.000000] NODE_DATA(30) on node 1
> [ 0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
> [ 0.000000] NODE_DATA [mem 0xcfa11000-0xcfa41fff]
> [ 0.000000] NODE_DATA(31) on node 1
> [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [ 0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
> [133538.294040] Zone ranges:
> [133538.294338] DMA [mem 0x00001000-0x00ffffff]
> [133538.294804] DMA32 [mem 0x01000000-0xffffffff]
> [133538.295223] Normal [mem 0x100000000-0x142fffffff]
> [133538.295670] Movable zone start for each node
>
> Looking at the code, initially I though that the problem is with:
>
> +void set_sched_clock_stable(void)
> +{
> + if (!sched_clock_stable())
> + static_key_slow_dec(&__sched_clock_stable);
> +}
> +
> +void clear_sched_clock_stable(void)
> +{
> + /* XXX worry about clock continuity */
> + if (sched_clock_stable())
> + static_key_slow_inc(&__sched_clock_stable);
> +}
>
> I think the jump label inc/dec is reversed here. We would want to inc
> it when enabling and dec when disabling, no?

I got terribly confused with that static_key trainwreck. I know I
definitely got it wrong a few times.

I helped write the jump label stuff, but the current interface is
horrible, I really couldn't figure out what is what anymore :-(

The current code seems to work for me in that my machine ends up with
sched_clock_stable() == true and when I call clear_sched_clock_stable()
it returns false and nothing explodes.

> However, trying to reverse the two didn't help. I was still seeing the
> same odd behaviour.

I got a crash when I flipped the inc and dec ;-)

> I tried doing a simple conversion to using a simple var like before,
> which looks like this:

<snip>

> This has corrected the issue:
>
> [ 0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
> [ 0.000000] NODE_DATA [mem 0xcfa11000-0xcfa41fff]
> [ 0.000000] NODE_DATA(31) on node 1
> [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [ 0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
> [ 0.000000] Zone ranges:
> [ 0.000000] DMA [mem 0x00001000-0x00ffffff]
> [ 0.000000] DMA32 [mem 0x01000000-0xffffffff]
> [ 0.000000] Normal [mem 0x100000000-0x142fffffff]
> [ 0.000000] Movable zone start for each node
> [ 0.000000] Early memory node ranges
> [ timing is correct for the rest of the boot]
>
> At this point, I thought that there's something up with jump labels
> being used this early (?) and tried compiling with
> CONFIG_JUMP_LABELS=n, this didn't solve the issue.
>
> This makes me thing there's something different related to jumplabels
> we're missing, as the no-jumplabel config should be very similar to
> the patch I did above, I just can't figure out what it is.

Does this kvm thing have sched_clock_stable==true ever? My machine ends
up setting set_sched_clock_stable() very early indeed, in
early_init_intel().

I suspect what happens is that the way I wrote it; the jump_label is
true per boot, so sched_clock_stable() == true. My initial
set_sched_clock_stable() call then does nothing as the state is already
good.

KVM doesn't do this and runs with sched_clock_stable()==true for a
while, detects the TSC really isn't stable and calls
clear_sched_clock_stable() and you get this jump in time.

Now the next patch 'fixes' this patch by adding a workqueue to delay the
actual jump_label poke; that fixed a few wierd lockdep errors.

But the above is still obviously wrong for machines that do not call
set_sched_clock_stable() on boot.

Ho humm.

2014-01-22 11:59:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Wed, Jan 22, 2014 at 11:45:32AM +0100, Peter Zijlstra wrote:
> Ho humm.

OK, so I had me a ponder; does the below fix things for you and David?
I've only done a boot test on real proper hardware :-)

---
kernel/sched/clock.c | 42 +++++++++++++++++++++++++++++++++---------
1 file changed, 33 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 6bd6a6731b21..6bbcd97f4532 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -77,35 +77,45 @@ __read_mostly int sched_clock_running;

#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
+static int __sched_clock_stable_early;

int sched_clock_stable(void)
{
- if (static_key_false(&__sched_clock_stable))
- return false;
- return true;
+ return static_key_false(&__sched_clock_stable);
}

void set_sched_clock_stable(void)
{
+ __sched_clock_stable_early = 1;
+
+ smp_mb(); /* matches sched_clock_init() */
+
+ if (!sched_clock_running)
+ return;
+
if (!sched_clock_stable())
- static_key_slow_dec(&__sched_clock_stable);
+ static_key_slow_inc(&__sched_clock_stable);
}

static void __clear_sched_clock_stable(struct work_struct *work)
{
/* XXX worry about clock continuity */
if (sched_clock_stable())
- static_key_slow_inc(&__sched_clock_stable);
+ static_key_slow_dec(&__sched_clock_stable);
}

static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);

void clear_sched_clock_stable(void)
{
- if (keventd_up())
- schedule_work(&sched_clock_work);
- else
- __clear_sched_clock_stable(&sched_clock_work);
+ __sched_clock_stable_early = 0;
+
+ smp_mb(); /* matches sched_clock_init() */
+
+ if (!sched_clock_running)
+ return;
+
+ schedule_work(&sched_clock_work);
}

struct sched_clock_data {
@@ -140,6 +150,20 @@ void sched_clock_init(void)
}

sched_clock_running = 1;
+
+ /*
+ * Ensure that it is impossible to not do a static_key update.
+ *
+ * Either {set,clear}_sched_clock_stable() must see sched_clock_running
+ * and do the update, or we must see their __sched_clock_stable_early
+ * and do the update, or both.
+ */
+ smp_mb(); /* matches {set,clear}_sched_clock_stable() */
+
+ if (__sched_clock_stable_early)
+ set_sched_clock_stable();
+ else
+ clear_sched_clock_stable();
}

/*

2014-01-22 12:00:54

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 2014.01.22 at 11:45 +0100, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 05:28:37PM -0500, Sasha Levin wrote:
> > On 12/12/2013 09:08 AM, Peter Zijlstra wrote:
> >
> > This patch seems to be causing an issue with booting a KVM guest. It seems that it
> > causes the time to go random during early boot process:
> >
> > [ 0.000000] Initmem setup node 30 [mem 0x12ee000000-0x138dffffff]
> > [ 0.000000] NODE_DATA [mem 0xcfa42000-0xcfa72fff]
> > [ 0.000000] NODE_DATA(30) on node 1
> > [ 0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
> > [ 0.000000] NODE_DATA [mem 0xcfa11000-0xcfa41fff]
> > [ 0.000000] NODE_DATA(31) on node 1
> > [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> > [ 0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
> > [133538.294040] Zone ranges:
> > [133538.294338] DMA [mem 0x00001000-0x00ffffff]
> > [133538.294804] DMA32 [mem 0x01000000-0xffffffff]
> > [133538.295223] Normal [mem 0x100000000-0x142fffffff]
> > [133538.295670] Movable zone start for each node
> >

FYI it happens on real hardware on my machine:
...
[ 0.000000] Hierarchical RCU implementation.
[ 0.000000] NR_IRQS:4352 nr_irqs:712 16
[ 0.000000] spurious 8259A interrupt: IRQ7.
[ 0.000000] Console: colour VGA+ 80x25
[ 0.000000] console [tty0] enabled
[ 0.000000] hpet clockevent registered
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.003333] tsc: Detected 3210.681 MHz processor
[ 60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
[ 60.375240] pid_max: default: 32768 minimum: 301
[ 60.375259] Mount-cache hash table entries: 256
[ 60.375373] tseg: 0000000000
[ 60.375377] CPU: Physical Processor ID: 0
[ 60.375377] CPU: Processor Core ID: 0
[ 60.375378] mce: CPU supports 6 MCE banks
[ 60.375382] LVT offset 0 assigned for vector 0xf9
[ 60.375384] process: using AMD E400 aware idle routine
[ 60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
...
--
Markus

2014-01-22 12:08:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> FYI it happens on real hardware on my machine:
> ...
> [ 0.000000] Hierarchical RCU implementation.
> [ 0.000000] NR_IRQS:4352 nr_irqs:712 16
> [ 0.000000] spurious 8259A interrupt: IRQ7.
> [ 0.000000] Console: colour VGA+ 80x25
> [ 0.000000] console [tty0] enabled
> [ 0.000000] hpet clockevent registered
> [ 0.000000] tsc: Fast TSC calibration using PIT
> [ 0.003333] tsc: Detected 3210.681 MHz processor
> [ 60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> [ 60.375240] pid_max: default: 32768 minimum: 301
> [ 60.375259] Mount-cache hash table entries: 256
> [ 60.375373] tseg: 0000000000
> [ 60.375377] CPU: Physical Processor ID: 0
> [ 60.375377] CPU: Processor Core ID: 0
> [ 60.375378] mce: CPU supports 6 MCE banks
> [ 60.375382] LVT offset 0 assigned for vector 0xf9
> [ 60.375384] process: using AMD E400 aware idle routine
> [ 60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8

Should have always happened like that I think. From the log it looks
like the moment we switch from jiffies to actual TSC in
arch/x86/kernel/tsc.c:sched_clock().

I don't think I changed the logic there, just switched from a condition
to a jump_label.

2014-01-22 12:16:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Wed, Jan 22, 2014 at 01:07:57PM +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> > FYI it happens on real hardware on my machine:
> > ...
> > [ 0.000000] Hierarchical RCU implementation.
> > [ 0.000000] NR_IRQS:4352 nr_irqs:712 16
> > [ 0.000000] spurious 8259A interrupt: IRQ7.
> > [ 0.000000] Console: colour VGA+ 80x25
> > [ 0.000000] console [tty0] enabled
> > [ 0.000000] hpet clockevent registered
> > [ 0.000000] tsc: Fast TSC calibration using PIT
> > [ 0.003333] tsc: Detected 3210.681 MHz processor
> > [ 60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> > [ 60.375240] pid_max: default: 32768 minimum: 301
> > [ 60.375259] Mount-cache hash table entries: 256
> > [ 60.375373] tseg: 0000000000
> > [ 60.375377] CPU: Physical Processor ID: 0
> > [ 60.375377] CPU: Processor Core ID: 0
> > [ 60.375378] mce: CPU supports 6 MCE banks
> > [ 60.375382] LVT offset 0 assigned for vector 0xf9
> > [ 60.375384] process: using AMD E400 aware idle routine
> > [ 60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
>
> Should have always happened like that I think. From the log it looks
> like the moment we switch from jiffies to actual TSC in
> arch/x86/kernel/tsc.c:sched_clock().
>
> I don't think I changed the logic there, just switched from a condition
> to a jump_label.

On that, the bestest solution is to completely rip out the no_tsc
option, its daft. The few people still running chips without TSC can use
X86_TSC=n.

2014-01-22 12:26:15

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 2014.01.22 at 13:07 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> > FYI it happens on real hardware on my machine:
> > ...
> > [ 0.000000] Hierarchical RCU implementation.
> > [ 0.000000] NR_IRQS:4352 nr_irqs:712 16
> > [ 0.000000] spurious 8259A interrupt: IRQ7.
> > [ 0.000000] Console: colour VGA+ 80x25
> > [ 0.000000] console [tty0] enabled
> > [ 0.000000] hpet clockevent registered
> > [ 0.000000] tsc: Fast TSC calibration using PIT
> > [ 0.003333] tsc: Detected 3210.681 MHz processor
> > [ 60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> > [ 60.375240] pid_max: default: 32768 minimum: 301
> > [ 60.375259] Mount-cache hash table entries: 256
> > [ 60.375373] tseg: 0000000000
> > [ 60.375377] CPU: Physical Processor ID: 0
> > [ 60.375377] CPU: Processor Core ID: 0
> > [ 60.375378] mce: CPU supports 6 MCE banks
> > [ 60.375382] LVT offset 0 assigned for vector 0xf9
> > [ 60.375384] process: using AMD E400 aware idle routine
> > [ 60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
>
> Should have always happened like that I think. From the log it looks
> like the moment we switch from jiffies to actual TSC in
> arch/x86/kernel/tsc.c:sched_clock().
>
> I don't think I changed the logic there, just switched from a condition
> to a jump_label.

Well, v3.13 was fine. So it's definitely a regression. But it may be
another issue. I will try to bisect later.

--
Markus

2014-01-22 12:30:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Wed, Jan 22, 2014 at 01:26:09PM +0100, Markus Trippelsdorf wrote:
> On 2014.01.22 at 13:07 +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> > > FYI it happens on real hardware on my machine:
> > > ...
> > > [ 0.000000] Hierarchical RCU implementation.
> > > [ 0.000000] NR_IRQS:4352 nr_irqs:712 16
> > > [ 0.000000] spurious 8259A interrupt: IRQ7.
> > > [ 0.000000] Console: colour VGA+ 80x25
> > > [ 0.000000] console [tty0] enabled
> > > [ 0.000000] hpet clockevent registered
> > > [ 0.000000] tsc: Fast TSC calibration using PIT
> > > [ 0.003333] tsc: Detected 3210.681 MHz processor
> > > [ 60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> > > [ 60.375240] pid_max: default: 32768 minimum: 301
> > > [ 60.375259] Mount-cache hash table entries: 256
> > > [ 60.375373] tseg: 0000000000
> > > [ 60.375377] CPU: Physical Processor ID: 0
> > > [ 60.375377] CPU: Processor Core ID: 0
> > > [ 60.375378] mce: CPU supports 6 MCE banks
> > > [ 60.375382] LVT offset 0 assigned for vector 0xf9
> > > [ 60.375384] process: using AMD E400 aware idle routine
> > > [ 60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
> >
> > Should have always happened like that I think. From the log it looks
> > like the moment we switch from jiffies to actual TSC in
> > arch/x86/kernel/tsc.c:sched_clock().
> >
> > I don't think I changed the logic there, just switched from a condition
> > to a jump_label.
>
> Well, v3.13 was fine. So it's definitely a regression. But it may be
> another issue. I will try to bisect later.

OK, weird, I'll see if I can spot anything.

2014-01-22 13:15:03

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 2014.01.22 at 13:30 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 01:26:09PM +0100, Markus Trippelsdorf wrote:
> > On 2014.01.22 at 13:07 +0100, Peter Zijlstra wrote:
> > > On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> > > > FYI it happens on real hardware on my machine:
> > > > ...
> > > > [ 0.000000] Hierarchical RCU implementation.
> > > > [ 0.000000] NR_IRQS:4352 nr_irqs:712 16
> > > > [ 0.000000] spurious 8259A interrupt: IRQ7.
> > > > [ 0.000000] Console: colour VGA+ 80x25
> > > > [ 0.000000] console [tty0] enabled
> > > > [ 0.000000] hpet clockevent registered
> > > > [ 0.000000] tsc: Fast TSC calibration using PIT
> > > > [ 0.003333] tsc: Detected 3210.681 MHz processor
> > > > [ 60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> > > > [ 60.375240] pid_max: default: 32768 minimum: 301
> > > > [ 60.375259] Mount-cache hash table entries: 256
> > > > [ 60.375373] tseg: 0000000000
> > > > [ 60.375377] CPU: Physical Processor ID: 0
> > > > [ 60.375377] CPU: Processor Core ID: 0
> > > > [ 60.375378] mce: CPU supports 6 MCE banks
> > > > [ 60.375382] LVT offset 0 assigned for vector 0xf9
> > > > [ 60.375384] process: using AMD E400 aware idle routine
> > > > [ 60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
> > >
> > > Should have always happened like that I think. From the log it looks
> > > like the moment we switch from jiffies to actual TSC in
> > > arch/x86/kernel/tsc.c:sched_clock().
> > >
> > > I don't think I changed the logic there, just switched from a condition
> > > to a jump_label.
> >
> > Well, v3.13 was fine. So it's definitely a regression. But it may be
> > another issue. I will try to bisect later.
>
> OK, weird, I'll see if I can spot anything.

Unfortunately the issue is unbisectable (but the remaining commits are
all yours):

git bisect start
# bad: [df32e43a54d04eda35d2859beaf90e3864d53288] Merge branch 'akpm' (incoming from Andrew)
git bisect bad df32e43a54d04eda35d2859beaf90e3864d53288
# good: [d8ec26d7f8287f5788a494f56e8814210f0e64be] Linux 3.13
git bisect good d8ec26d7f8287f5788a494f56e8814210f0e64be
# bad: [de4fe30af1620b5117d65489621a5037913e7a92] Merge tag 'staging-3.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/staging
git bisect bad de4fe30af1620b5117d65489621a5037913e7a92
# good: [2783980525df12b9c49e8b4baaff06abc7f2f8f2] staging: comedi: usbduxsigma: removing unneccesay attached info
git bisect good 2783980525df12b9c49e8b4baaff06abc7f2f8f2
# good: [9326657abe1a83ed4b4f396b923ca1217fd50cba] Merge branch 'perf-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect good 9326657abe1a83ed4b4f396b923ca1217fd50cba
# bad: [82b51734b4f228c76b6064b6e899d9d3d4c17c1a] Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux
git bisect bad 82b51734b4f228c76b6064b6e899d9d3d4c17c1a
# bad: [5d4863e4cc4dc12d1d5e42da3cb5d38c535e4ad6] Merge branch 'x86-cpu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad 5d4863e4cc4dc12d1d5e42da3cb5d38c535e4ad6
# bad: [a0fa1dd3cdbccec9597fe53b6177a9aa6e20f2f8] Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
git bisect bad a0fa1dd3cdbccec9597fe53b6177a9aa6e20f2f8
# good: [0aeeeebac8d8304501680f12862784341f4bee7b] sched: Remove unused parameter from nohz_balancer_kick()
git bisect good 0aeeeebac8d8304501680f12862784341f4bee7b
# bad: [c9c8986847d2f4fc474c10ee08afa57e7474096d] Merge branch 'x86/idle' into sched/core
git bisect bad c9c8986847d2f4fc474c10ee08afa57e7474096d
# skip: [20d1c86a57762f0a33a78988e3fc8818316badd4] sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
git bisect skip 20d1c86a57762f0a33a78988e3fc8818316badd4
# skip: [35af99e646c7f7ea46dc2977601e9e71a51dadd5] sched/clock, x86: Use a static_key for sched_clock_stable
git bisect skip 35af99e646c7f7ea46dc2977601e9e71a51dadd5
# good: [7d590cca7cd2cce4ed7c47d221d6f90566653ba8] x86, idle: Add memory barriers around clflush in mwait_play_dead()
git bisect good 7d590cca7cd2cce4ed7c47d221d6f90566653ba8
# skip: [57c67da274f3fab38e08d2c9edf08b89e1d9c71d] sched/clock, x86: Move some cyc2ns() code around
git bisect skip 57c67da274f3fab38e08d2c9edf08b89e1d9c71d
# good: [c726099ec224be8078d91072207053ff9a1ad6fc] sched: Factor out the on_null_domain() checks in trigger_load_balance()
git bisect good c726099ec224be8078d91072207053ff9a1ad6fc
# skip: [ef08f0fff87630d4f67ceb09514d8b444df833f8] sched/clock: Remove local_irq_disable() from the clocks
git bisect skip ef08f0fff87630d4f67ceb09514d8b444df833f8
# skip: [62b94a08da1bae9d187d49dfcd6665af393750f8] sched/preempt: Take away preempt_enable_no_resched() from modules
git bisect skip 62b94a08da1bae9d187d49dfcd6665af393750f8
# skip: [5dd12c2152743747ca9f50ef80281e54cc416dc0] sched/clock, x86: Use mul_u64_u32_shr() for native_sched_clock()
git bisect skip 5dd12c2152743747ca9f50ef80281e54cc416dc0
# skip: [10b033d434c25a6c9e0f4f4dc2418af1b8236c63] sched/clock, x86: Avoid a runtime condition in native_sched_clock()
git bisect skip 10b033d434c25a6c9e0f4f4dc2418af1b8236c63
# skip: [6577e42a3e1633afe762f47da9e00061ee4b9a5e] sched/clock: Fix up clear_sched_clock_stable()
git bisect skip 6577e42a3e1633afe762f47da9e00061ee4b9a5e
# bad: [0bd3a173d711857fc9f583eb5825386cc08f3948] sched/preempt, locking: Rework local_bh_{dis,en}able()
git bisect bad 0bd3a173d711857fc9f583eb5825386cc08f3948
# skip: [9ea4c380066fbe23fe0da7f4abfabc444f2467f4] locking: Optimize lock_bh functions
git bisect skip 9ea4c380066fbe23fe0da7f4abfabc444f2467f4
# only skipped commits left to test
# possible first bad commit: [0bd3a173d711857fc9f583eb5825386cc08f3948] sched/preempt, locking: Rework local_bh_{dis,en}able()
# possible first bad commit: [10b033d434c25a6c9e0f4f4dc2418af1b8236c63] sched/clock, x86: Avoid a runtime condition in native_sched_clock()
# possible first bad commit: [6577e42a3e1633afe762f47da9e00061ee4b9a5e] sched/clock: Fix up clear_sched_clock_stable()
# possible first bad commit: [35af99e646c7f7ea46dc2977601e9e71a51dadd5] sched/clock, x86: Use a static_key for sched_clock_stable
# possible first bad commit: [ef08f0fff87630d4f67ceb09514d8b444df833f8] sched/clock: Remove local_irq_disable() from the clocks
# possible first bad commit: [20d1c86a57762f0a33a78988e3fc8818316badd4] sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
# possible first bad commit: [57c67da274f3fab38e08d2c9edf08b89e1d9c71d] sched/clock, x86: Move some cyc2ns() code around
# possible first bad commit: [5dd12c2152743747ca9f50ef80281e54cc416dc0] sched/clock, x86: Use mul_u64_u32_shr() for native_sched_clock()
# possible first bad commit: [62b94a08da1bae9d187d49dfcd6665af393750f8] sched/preempt: Take away preempt_enable_no_resched() from modules
# possible first bad commit: [9ea4c380066fbe23fe0da7f4abfabc444f2467f4] locking: Optimize lock_bh functions

I've skipped the commits because of compile errors:

make[1]: Nothing to be done for 'all'.
CHK include/config/kernel.release
CHK include/generated/uapi/linux/version.h
make[1]: Nothing to be done for 'relocs'.
CHK include/generated/utsrelease.h
CC arch/x86/kernel/asm-offsets.s
In file included from include/linux/spinlock.h:271:0,
from include/linux/mmzone.h:7,
from include/linux/gfp.h:4,
from include/linux/slab.h:14,
from include/linux/crypto.h:24,
from arch/x86/kernel/asm-offsets.c:8:
include/linux/spinlock_api_smp.h: In function ‘__raw_spin_lock_bh’:
include/linux/spinlock_api_smp.h:134:2: error: implicit declaration of function ‘__local_bh_disable_ip’ [-Werror=implicit-function-declaration]
__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
^
include/linux/spinlock_api_smp.h:134:34: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
^
include/linux/spinlock_api_smp.h:134:34: note: each undeclared identifier is reported only once for each function it appears in
include/linux/spinlock_api_smp.h: In function ‘__raw_spin_unlock_bh’:
include/linux/spinlock_api_smp.h:176:2: error: implicit declaration of function ‘__local_bh_enable_ip’ [-Werror=implicit-function-declaration]
__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
^
include/linux/spinlock_api_smp.h:176:33: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
^
include/linux/spinlock_api_smp.h: In function ‘__raw_spin_trylock_bh’:
include/linux/spinlock_api_smp.h:181:34: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
^
In file included from include/linux/spinlock_api_smp.h:190:0,
from include/linux/spinlock.h:271,
from include/linux/mmzone.h:7,
from include/linux/gfp.h:4,
from include/linux/slab.h:14,
from include/linux/crypto.h:24,
from arch/x86/kernel/asm-offsets.c:8:
include/linux/rwlock_api_smp.h: In function ‘__raw_read_lock_bh’:
include/linux/rwlock_api_smp.h:175:34: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
^
include/linux/rwlock_api_smp.h: In function ‘__raw_write_lock_bh’:
include/linux/rwlock_api_smp.h:202:34: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
__local_bh_disable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
^
include/linux/rwlock_api_smp.h: In function ‘__raw_read_unlock_bh’:
include/linux/rwlock_api_smp.h:251:33: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
^
include/linux/rwlock_api_smp.h: In function ‘__raw_write_unlock_bh’:
include/linux/rwlock_api_smp.h:275:33: error: ‘SOFTIRQ_LOCK_OFFSET’ undeclared (first use in this function)
__local_bh_enable_ip(_RET_IP_, SOFTIRQ_LOCK_OFFSET);
^
cc1: some warnings being treated as errors
/usr/src/linux/./Kbuild:81: recipe for target 'arch/x86/kernel/asm-offsets.s' failed
make[1]: *** [arch/x86/kernel/asm-offsets.s] Error 1
Makefile:859: recipe for target 'prepare0' failed
make: *** [prepare0] Error 2

--
Markus

2014-01-22 14:27:40

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 01/22/2014 08:14 AM, Markus Trippelsdorf wrote:
> On 2014.01.22 at 13:30 +0100, Peter Zijlstra wrote:
>> >On Wed, Jan 22, 2014 at 01:26:09PM +0100, Markus Trippelsdorf wrote:
>>> > >On 2014.01.22 at 13:07 +0100, Peter Zijlstra wrote:
>>>> > > >On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
>>>>> > > > >FYI it happens on real hardware on my machine:
>>>>> > > > >...
>>>>> > > > >[ 0.000000] Hierarchical RCU implementation.
>>>>> > > > >[ 0.000000] NR_IRQS:4352 nr_irqs:712 16
>>>>> > > > >[ 0.000000] spurious 8259A interrupt: IRQ7.
>>>>> > > > >[ 0.000000] Console: colour VGA+ 80x25
>>>>> > > > >[ 0.000000] console [tty0] enabled
>>>>> > > > >[ 0.000000] hpet clockevent registered
>>>>> > > > >[ 0.000000] tsc: Fast TSC calibration using PIT
>>>>> > > > >[ 0.003333] tsc: Detected 3210.681 MHz processor
>>>>> > > > >[ 60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
>>>>> > > > >[ 60.375240] pid_max: default: 32768 minimum: 301
>>>>> > > > >[ 60.375259] Mount-cache hash table entries: 256
>>>>> > > > >[ 60.375373] tseg: 0000000000
>>>>> > > > >[ 60.375377] CPU: Physical Processor ID: 0
>>>>> > > > >[ 60.375377] CPU: Processor Core ID: 0
>>>>> > > > >[ 60.375378] mce: CPU supports 6 MCE banks
>>>>> > > > >[ 60.375382] LVT offset 0 assigned for vector 0xf9
>>>>> > > > >[ 60.375384] process: using AMD E400 aware idle routine
>>>>> > > > >[ 60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
>>>> > > >
>>>> > > >Should have always happened like that I think. From the log it looks
>>>> > > >like the moment we switch from jiffies to actual TSC in
>>>> > > >arch/x86/kernel/tsc.c:sched_clock().
>>>> > > >
>>>> > > >I don't think I changed the logic there, just switched from a condition
>>>> > > >to a jump_label.
>>> > >
>>> > >Well, v3.13 was fine. So it's definitely a regression. But it may be
>>> > >another issue. I will try to bisect later.
>> >
>> >OK, weird, I'll see if I can spot anything.
> Unfortunately the issue is unbisectable (but the remaining commits are
> all yours):

I've actually bisected it previously by fixing the build errors manually, and that took me
to this patch you see in the subject line :)


Thanks,
Sasha

2014-01-22 17:15:34

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Tue, Jan 21, 2014 at 05:28:37PM -0500, Sasha Levin wrote:
> [ 0.000000] Initmem setup node 30 [mem 0x12ee000000-0x138dffffff]
> [ 0.000000] NODE_DATA [mem 0xcfa42000-0xcfa72fff]
> [ 0.000000] NODE_DATA(30) on node 1
> [ 0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
> [ 0.000000] NODE_DATA [mem 0xcfa11000-0xcfa41fff]
> [ 0.000000] NODE_DATA(31) on node 1
> [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
> [ 0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
> [133538.294040] Zone ranges:
> [133538.294338] DMA [mem 0x00001000-0x00ffffff]
> [133538.294804] DMA32 [mem 0x01000000-0xffffffff]
> [133538.295223] Normal [mem 0x100000000-0x142fffffff]
> [133538.295670] Movable zone start for each node

OK, took me a while to fiddle with KVM and all the various muck around
that to reproduce. But I can confirm the below does fix the issue for
me.

I'm hoping to not have to re-introcude the kevents_up() check, but I
need to figure out what hardware triggered that and test again.

---
Subject: sched/clock: Fixup early sched_clock initialization
From: Peter Zijlstra <[email protected]>
Date: Wed, 22 Jan 2014 12:59:18 +0100

The code would assume sched_clock_stable() and switch to !stable
later, this switch brings a discontinuity in time.

The discontinuity on switching from stable to unstable was always
present, but previously we would set stable/unstable before
initializing TSC and usually stick to the one we start out with.

So the static_key bits brought an extra switch where there previously
wasn't one.

Things are further complicated by the fact that we cannot use
static_key as early as we usually call set_sched_clock_stable().

Fix things by tracking the stable state in a regular variable and only
set the static_key to the right state on sched_clock_init(), which is
ran right after late_time_init->tsc_init().

Before this we would not be using the TSC anyway.

Fixes: 35af99e646c7 ("sched/clock, x86: Use a static_key for sched_clock_stable")
Cc: [email protected]
Cc: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Eliezer Tamir <[email protected]>
Cc: [email protected]
Reported-by: Sasha Levin <[email protected]>
Reported-by: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
---
kernel/sched/clock.c | 53 +++++++++++++++++++++++++++++++++++++++------------
1 file changed, 41 insertions(+), 12 deletions(-)

--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -77,35 +77,50 @@ __read_mostly int sched_clock_running;

#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
+static int __sched_clock_stable_early;

int sched_clock_stable(void)
{
- if (static_key_false(&__sched_clock_stable))
- return false;
- return true;
+ return static_key_false(&__sched_clock_stable);
}

-void set_sched_clock_stable(void)
+static void __set_sched_clock_stable(void)
{
if (!sched_clock_stable())
- static_key_slow_dec(&__sched_clock_stable);
+ static_key_slow_inc(&__sched_clock_stable);
+}
+
+void set_sched_clock_stable(void)
+{
+ __sched_clock_stable_early = 1;
+
+ smp_mb(); /* matches sched_clock_init() */
+
+ if (!sched_clock_running)
+ return;
+
+ __set_sched_clock_stable();
}

static void __clear_sched_clock_stable(struct work_struct *work)
{
/* XXX worry about clock continuity */
if (sched_clock_stable())
- static_key_slow_inc(&__sched_clock_stable);
+ static_key_slow_dec(&__sched_clock_stable);
}

static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);

void clear_sched_clock_stable(void)
{
- if (keventd_up())
- schedule_work(&sched_clock_work);
- else
- __clear_sched_clock_stable(&sched_clock_work);
+ __sched_clock_stable_early = 0;
+
+ smp_mb(); /* matches sched_clock_init() */
+
+ if (!sched_clock_running)
+ return;
+
+ schedule_work(&sched_clock_work);
}

struct sched_clock_data {
@@ -140,6 +155,20 @@ void sched_clock_init(void)
}

sched_clock_running = 1;
+
+ /*
+ * Ensure that it is impossible to not do a static_key update.
+ *
+ * Either {set,clear}_sched_clock_stable() must see sched_clock_running
+ * and do the update, or we must see their __sched_clock_stable_early
+ * and do the update, or both.
+ */
+ smp_mb(); /* matches {set,clear}_sched_clock_stable() */
+
+ if (__sched_clock_stable_early)
+ __set_sched_clock_stable();
+ else
+ __clear_sched_clock_stable(NULL);
}

/*
@@ -340,7 +369,7 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeu
*/
u64 cpu_clock(int cpu)
{
- if (static_key_false(&__sched_clock_stable))
+ if (!sched_clock_stable())
return sched_clock_cpu(cpu);

return sched_clock();
@@ -355,7 +384,7 @@ u64 cpu_clock(int cpu)
*/
u64 local_clock(void)
{
- if (static_key_false(&__sched_clock_stable))
+ if (!sched_clock_stable())
return sched_clock_cpu(raw_smp_processor_id());

return sched_clock();

2014-01-22 18:35:42

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 2014.01.22 at 09:26 -0500, Sasha Levin wrote:
> On 01/22/2014 08:14 AM, Markus Trippelsdorf wrote:
> > On 2014.01.22 at 13:30 +0100, Peter Zijlstra wrote:
> >> >On Wed, Jan 22, 2014 at 01:26:09PM +0100, Markus Trippelsdorf wrote:
> >>> > >On 2014.01.22 at 13:07 +0100, Peter Zijlstra wrote:
> >>>> > > >On Wed, Jan 22, 2014 at 01:00:48PM +0100, Markus Trippelsdorf wrote:
> >>>>> > > > >FYI it happens on real hardware on my machine:
> >>>>> > > > >...
> >>>>> > > > >[ 0.000000] Hierarchical RCU implementation.
> >>>>> > > > >[ 0.000000] NR_IRQS:4352 nr_irqs:712 16
> >>>>> > > > >[ 0.000000] spurious 8259A interrupt: IRQ7.
> >>>>> > > > >[ 0.000000] Console: colour VGA+ 80x25
> >>>>> > > > >[ 0.000000] console [tty0] enabled
> >>>>> > > > >[ 0.000000] hpet clockevent registered
> >>>>> > > > >[ 0.000000] tsc: Fast TSC calibration using PIT
> >>>>> > > > >[ 0.003333] tsc: Detected 3210.681 MHz processor
> >>>>> > > > >[ 60.375238] Calibrating delay loop (skipped), value calculated using timer frequency.. 6423.91 BogoMIPS (lpj=10702270)
> >>>>> > > > >[ 60.375240] pid_max: default: 32768 minimum: 301
> >>>>> > > > >[ 60.375259] Mount-cache hash table entries: 256
> >>>>> > > > >[ 60.375373] tseg: 0000000000
> >>>>> > > > >[ 60.375377] CPU: Physical Processor ID: 0
> >>>>> > > > >[ 60.375377] CPU: Processor Core ID: 0
> >>>>> > > > >[ 60.375378] mce: CPU supports 6 MCE banks
> >>>>> > > > >[ 60.375382] LVT offset 0 assigned for vector 0xf9
> >>>>> > > > >[ 60.375384] process: using AMD E400 aware idle routine
> >>>>> > > > >[ 60.375386] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
> >>>> > > >
> >>>> > > >Should have always happened like that I think. From the log it looks
> >>>> > > >like the moment we switch from jiffies to actual TSC in
> >>>> > > >arch/x86/kernel/tsc.c:sched_clock().
> >>>> > > >
> >>>> > > >I don't think I changed the logic there, just switched from a condition
> >>>> > > >to a jump_label.
> >>> > >
> >>> > >Well, v3.13 was fine. So it's definitely a regression. But it may be
> >>> > >another issue. I will try to bisect later.
> >> >
> >> >OK, weird, I'll see if I can spot anything.
> > Unfortunately the issue is unbisectable (but the remaining commits are
> > all yours):
>
> I've actually bisected it previously by fixing the build errors manually, and that took me
> to this patch you see in the subject line :)

But this is a different issue. I've bisected it to:

commit 20d1c86a57762f0a33a78988e3fc8818316badd4
Author: Peter Zijlstra <[email protected]>
Date: Fri Nov 29 15:40:29 2013 +0100

sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs

Reverting the commit "fixes" the issue:
...
[ 0.000000] Hierarchical RCU implementation.
[ 0.000000] NR_IRQS:4352 nr_irqs:712 16
[ 0.000000] spurious 8259A interrupt: IRQ7.
[ 0.000000] Console: colour VGA+ 80x25
[ 0.000000] console [tty0] enabled
[ 0.000000] hpet clockevent registered
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.003333] tsc: Detected 3211.075 MHz processor
[ 0.000006] Calibrating delay loop (skipped), value calculated using timer frequency.. 6424.73 BogoMIPS (lpj=10703583)
[ 0.000007] pid_max: default: 32768 minimum: 301
[ 0.000026] Mount-cache hash table entries: 256
[ 0.000139] tseg: 0000000000
[ 0.000143] CPU: Physical Processor ID: 0
[ 0.000144] CPU: Processor Core ID: 0
[ 0.000145] mce: CPU supports 6 MCE banks
[ 0.000148] LVT offset 0 assigned for vector 0xf9
[ 0.000151] process: using AMD E400 aware idle routine
[ 0.000152] Last level iTLB entries: 4KB 512, 2MB 16, 4MB 8
...

--
Markus

2014-01-22 18:43:26

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:

> >FYI it happens on real hardware on my machine:

> >[ 60.375384] process: using AMD E400 aware idle routine

> But this is a different issue. I've bisected it to:
>
> commit 20d1c86a57762f0a33a78988e3fc8818316badd4
> Author: Peter Zijlstra <[email protected]>
> Date: Fri Nov 29 15:40:29 2013 +0100
>
> sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
>
> Reverting the commit "fixes" the issue:

Hurm.. what kind of AMD machine is that?

2014-01-22 18:45:15

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 2014.01.22 at 19:42 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:
>
> > >FYI it happens on real hardware on my machine:
>
> > >[ 60.375384] process: using AMD E400 aware idle routine
>
> > But this is a different issue. I've bisected it to:
> >
> > commit 20d1c86a57762f0a33a78988e3fc8818316badd4
> > Author: Peter Zijlstra <[email protected]>
> > Date: Fri Nov 29 15:40:29 2013 +0100
> >
> > sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
> >
> > Reverting the commit "fixes" the issue:
>
> Hurm.. what kind of AMD machine is that?

It's an ancient one:

processor : 0
vendor_id : AuthenticAMD
cpu family : 16
model : 4
model name : AMD Phenom(tm) II X4 955 Processor
stepping : 2
microcode : 0x10000db
cpu MHz : 800.000
cache size : 512 KB
physical id : 0
siblings : 4
core id : 0
cpu cores : 4
apicid : 0
initial apicid : 0
fpu : yes
fpu_exception : yes
cpuid level : 5
wp : yes
flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3
dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit
wdt hw_pstate npt lbrv svm_lock nrip_save
bogomips : 6424.73
TLB size : 1024 4K pages
clflush size : 64
cache_alignment : 64
address sizes : 48 bits physical, 48 bits virtual
power management: ts ttp tm stc 100mhzsteps hwpstate



--
Markus

2014-01-22 19:09:09

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 2014.01.22 at 19:42 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:
>
> > >FYI it happens on real hardware on my machine:
>
> > >[ 60.375384] process: using AMD E400 aware idle routine
>
> > But this is a different issue. I've bisected it to:
> >
> > commit 20d1c86a57762f0a33a78988e3fc8818316badd4
> > Author: Peter Zijlstra <[email protected]>
> > Date: Fri Nov 29 15:40:29 2013 +0100
> >
> > sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
> >
> > Reverting the commit "fixes" the issue:
>
> Hurm..

Turns out the fix is simple:

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a3acbac2ee72..d90f7a11d573 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -1198,8 +1198,8 @@ void __init tsc_init(void)
* up if their speed diverges)
*/
for_each_possible_cpu(cpu) {
- cyc2ns_init(cpu);
set_cyc2ns_scale(cpu_khz, cpu);
+ cyc2ns_init(cpu);
}

if (tsc_disabled > 0)

--
Markus

2014-01-22 19:12:58

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 2014.01.22 at 20:09 +0100, Markus Trippelsdorf wrote:
> On 2014.01.22 at 19:42 +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:
> >
> > > >FYI it happens on real hardware on my machine:
> >
> > > >[ 60.375384] process: using AMD E400 aware idle routine
> >
> > > But this is a different issue. I've bisected it to:
> > >
> > > commit 20d1c86a57762f0a33a78988e3fc8818316badd4
> > > Author: Peter Zijlstra <[email protected]>
> > > Date: Fri Nov 29 15:40:29 2013 +0100
> > >
> > > sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
> > >
> > > Reverting the commit "fixes" the issue:
> >
> > Hurm..
>
> Turns out the fix is simple:

No. It isn't. I've tested the wrong kernel. Please ignore.

Thanks.

--
Markus

2014-01-22 19:18:00

by Josh Boyer

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Wed, Jan 22, 2014 at 1:45 PM, Markus Trippelsdorf
<[email protected]> wrote:
> On 2014.01.22 at 19:42 +0100, Peter Zijlstra wrote:
>> On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:
>>
>> > >FYI it happens on real hardware on my machine:
>>
>> > >[ 60.375384] process: using AMD E400 aware idle routine
>>
>> > But this is a different issue. I've bisected it to:
>> >
>> > commit 20d1c86a57762f0a33a78988e3fc8818316badd4
>> > Author: Peter Zijlstra <[email protected]>
>> > Date: Fri Nov 29 15:40:29 2013 +0100
>> >
>> > sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
>> >
>> > Reverting the commit "fixes" the issue:
>>
>> Hurm.. what kind of AMD machine is that?
>
> It's an ancient one:
>
> processor : 0
> vendor_id : AuthenticAMD
> cpu family : 16
> model : 4
> model name : AMD Phenom(tm) II X4 955 Processor
> stepping : 2
> microcode : 0x10000db
> cpu MHz : 800.000
> cache size : 512 KB
> physical id : 0
> siblings : 4
> core id : 0
> cpu cores : 4
> apicid : 0
> initial apicid : 0
> fpu : yes
> fpu_exception : yes
> cpuid level : 5
> wp : yes
> flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm 3dnowext 3
> dnow constant_tsc rep_good nopl nonstop_tsc extd_apicid pni monitor cx16 popcnt lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit
> wdt hw_pstate npt lbrv svm_lock nrip_save
> bogomips : 6424.73
> TLB size : 1024 4K pages
> clflush size : 64
> cache_alignment : 64
> address sizes : 48 bits physical, 48 bits virtual
> power management: ts ttp tm stc 100mhzsteps hwpstate

I'm seeing the odd printk timestamp jump going from 0.xxx to 49.xx on
an i7-2600 so it isn't just old hardware.

[ 0.000000] kmemleak: Kernel memory leak detector disabled
[ 0.000000] ODEBUG: 32 of 32 active objects replaced
[ 0.000000] hpet clockevent registered
[ 0.000000] tsc: Fast TSC calibration using PIT
[ 0.001000] tsc: Detected 3392.351 MHz processor
[ 49.777360] Calibrating delay loop (skipped), value calculated
using timer frequency.. 6784.70 BogoMIPS (lpj=3392351)
[ 49.777363] pid_max: default: 32768 minimum: 301
[ 49.777572] Security Framework initialized


josh

2014-01-22 20:16:25

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Wed, Jan 22, 2014 at 08:12:54PM +0100, Markus Trippelsdorf wrote:
> On 2014.01.22 at 20:09 +0100, Markus Trippelsdorf wrote:
> > On 2014.01.22 at 19:42 +0100, Peter Zijlstra wrote:
> > > On Wed, Jan 22, 2014 at 07:35:38PM +0100, Markus Trippelsdorf wrote:
> > >
> > > > >FYI it happens on real hardware on my machine:
> > >
> > > > >[ 60.375384] process: using AMD E400 aware idle routine
> > >
> > > > But this is a different issue. I've bisected it to:
> > > >
> > > > commit 20d1c86a57762f0a33a78988e3fc8818316badd4
> > > > Author: Peter Zijlstra <[email protected]>
> > > > Date: Fri Nov 29 15:40:29 2013 +0100
> > > >
> > > > sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs
> > > >
> > > > Reverting the commit "fixes" the issue:
> > >
> > > Hurm..
> >
> > Turns out the fix is simple:
>
> No. It isn't. I've tested the wrong kernel. Please ignore.

I think its the right region to look through. My current suspect is the
linear continuity fit with the initial 'random' multiplier.

That initial 'random' multiplier can get us quite high, and we'll fit
the function to match that but continue at a sane rate.

I'll try and prod a little more later this evening as time permits.

2014-01-22 21:08:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

>
> I think its the right region to look through. My current suspect is the
> linear continuity fit with the initial 'random' multiplier.
>
> That initial 'random' multiplier can get us quite high, and we'll fit
> the function to match that but continue at a sane rate.
>
> I'll try and prod a little more later this evening as time permits.

Does this cure things?

---
arch/x86/kernel/tsc.c | 11 +++++++----
1 file changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a3acbac2ee72..bb04148c5fe0 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -237,7 +237,7 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
/* XXX surely we already have this someplace in the kernel?! */
#define DIV_ROUND(n, d) (((n) + ((d) / 2)) / (d))

-static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
+static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu, bool origin)
{
unsigned long long tsc_now, ns_now;
struct cyc2ns_data *data;
@@ -252,7 +252,10 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
data = cyc2ns_write_begin(cpu);

rdtscll(tsc_now);
- ns_now = cycles_2_ns(tsc_now);
+ if (origin)
+ ns_now = 0;
+ else
+ ns_now = cycles_2_ns(tsc_now);

/*
* Compute a new multiplier as per the above comment and ensure our
@@ -926,7 +929,7 @@ static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
mark_tsc_unstable("cpufreq changes");
}

- set_cyc2ns_scale(tsc_khz, freq->cpu);
+ set_cyc2ns_scale(tsc_khz, freq->cpu, false);

return 0;
}
@@ -1199,7 +1202,7 @@ void __init tsc_init(void)
*/
for_each_possible_cpu(cpu) {
cyc2ns_init(cpu);
- set_cyc2ns_scale(cpu_khz, cpu);
+ set_cyc2ns_scale(cpu_khz, cpu, true);
}

if (tsc_disabled > 0)

2014-01-22 21:17:44

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 2014.01.22 at 22:08 +0100, Peter Zijlstra wrote:
> >
> > I think its the right region to look through. My current suspect is the
> > linear continuity fit with the initial 'random' multiplier.
> >
> > That initial 'random' multiplier can get us quite high, and we'll fit
> > the function to match that but continue at a sane rate.
> >
> > I'll try and prod a little more later this evening as time permits.
>
> Does this cure things?

Yes. Thanks Peter.

--
Markus

2014-01-22 22:32:28

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 01/22/2014 12:14 PM, Peter Zijlstra wrote:
> On Tue, Jan 21, 2014 at 05:28:37PM -0500, Sasha Levin wrote:
>> [ 0.000000] Initmem setup node 30 [mem 0x12ee000000-0x138dffffff]
>> [ 0.000000] NODE_DATA [mem 0xcfa42000-0xcfa72fff]
>> [ 0.000000] NODE_DATA(30) on node 1
>> [ 0.000000] Initmem setup node 31 [mem 0x138e000000-0x142fffffff]
>> [ 0.000000] NODE_DATA [mem 0xcfa11000-0xcfa41fff]
>> [ 0.000000] NODE_DATA(31) on node 1
>> [ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
>> [ 0.000000] kvm-clock: cpu 0, msr 0:cf991001, boot clock
>> [133538.294040] Zone ranges:
>> [133538.294338] DMA [mem 0x00001000-0x00ffffff]
>> [133538.294804] DMA32 [mem 0x01000000-0xffffffff]
>> [133538.295223] Normal [mem 0x100000000-0x142fffffff]
>> [133538.295670] Movable zone start for each node
>
> OK, took me a while to fiddle with KVM and all the various muck around
> that to reproduce. But I can confirm the below does fix the issue for
> me.
>
> I'm hoping to not have to re-introcude the kevents_up() check, but I
> need to figure out what hardware triggered that and test again.
>
> ---
> Subject: sched/clock: Fixup early sched_clock initialization
> From: Peter Zijlstra <[email protected]>
> Date: Wed, 22 Jan 2014 12:59:18 +0100
>
> The code would assume sched_clock_stable() and switch to !stable
> later, this switch brings a discontinuity in time.
>
> The discontinuity on switching from stable to unstable was always
> present, but previously we would set stable/unstable before
> initializing TSC and usually stick to the one we start out with.
>
> So the static_key bits brought an extra switch where there previously
> wasn't one.
>
> Things are further complicated by the fact that we cannot use
> static_key as early as we usually call set_sched_clock_stable().
>
> Fix things by tracking the stable state in a regular variable and only
> set the static_key to the right state on sched_clock_init(), which is
> ran right after late_time_init->tsc_init().
>
> Before this we would not be using the TSC anyway.
>
> Fixes: 35af99e646c7 ("sched/clock, x86: Use a static_key for sched_clock_stable")
> Cc: [email protected]
> Cc: Mike Galbraith <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Thomas Gleixner <[email protected]>
> Cc: John Stultz <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Arjan van de Ven <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Eliezer Tamir <[email protected]>
> Cc: [email protected]
> Reported-by: Sasha Levin <[email protected]>
> Reported-by: [email protected]
> Signed-off-by: Peter Zijlstra <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]

The patch fixes the problem for me.


Thanks,
Sasha

2014-01-22 23:53:24

by Josh Boyer

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Wed, Jan 22, 2014 at 4:08 PM, Peter Zijlstra <[email protected]> wrote:
>>
>> I think its the right region to look through. My current suspect is the
>> linear continuity fit with the initial 'random' multiplier.
>>
>> That initial 'random' multiplier can get us quite high, and we'll fit
>> the function to match that but continue at a sane rate.
>>
>> I'll try and prod a little more later this evening as time permits.
>
> Does this cure things?

This seems to correct the time jump I was seeing on my i7 box as well.

josh

>
> ---
> arch/x86/kernel/tsc.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index a3acbac2ee72..bb04148c5fe0 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -237,7 +237,7 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
> /* XXX surely we already have this someplace in the kernel?! */
> #define DIV_ROUND(n, d) (((n) + ((d) / 2)) / (d))
>
> -static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> +static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu, bool origin)
> {
> unsigned long long tsc_now, ns_now;
> struct cyc2ns_data *data;
> @@ -252,7 +252,10 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> data = cyc2ns_write_begin(cpu);
>
> rdtscll(tsc_now);
> - ns_now = cycles_2_ns(tsc_now);
> + if (origin)
> + ns_now = 0;
> + else
> + ns_now = cycles_2_ns(tsc_now);
>
> /*
> * Compute a new multiplier as per the above comment and ensure our
> @@ -926,7 +929,7 @@ static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
> mark_tsc_unstable("cpufreq changes");
> }
>
> - set_cyc2ns_scale(tsc_khz, freq->cpu);
> + set_cyc2ns_scale(tsc_khz, freq->cpu, false);
>
> return 0;
> }
> @@ -1199,7 +1202,7 @@ void __init tsc_init(void)
> */
> for_each_possible_cpu(cpu) {
> cyc2ns_init(cpu);
> - set_cyc2ns_scale(cpu_khz, cpu);
> + set_cyc2ns_scale(cpu_khz, cpu, true);
> }
>
> if (tsc_disabled > 0)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2014-01-23 01:53:19

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 01/22/14 at 12:59pm, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 11:45:32AM +0100, Peter Zijlstra wrote:
> > Ho humm.
>
> OK, so I had me a ponder; does the below fix things for you and David?
> I've only done a boot test on real proper hardware :-)
>
> ---
> kernel/sched/clock.c | 42 +++++++++++++++++++++++++++++++++---------
> 1 file changed, 33 insertions(+), 9 deletions(-)
>
> diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
> index 6bd6a6731b21..6bbcd97f4532 100644
> --- a/kernel/sched/clock.c
> +++ b/kernel/sched/clock.c
> @@ -77,35 +77,45 @@ __read_mostly int sched_clock_running;
>
> #ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
> static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
> +static int __sched_clock_stable_early;
>
> int sched_clock_stable(void)
> {
> - if (static_key_false(&__sched_clock_stable))
> - return false;
> - return true;
> + return static_key_false(&__sched_clock_stable);
> }
>
> void set_sched_clock_stable(void)
> {
> + __sched_clock_stable_early = 1;
> +
> + smp_mb(); /* matches sched_clock_init() */
> +
> + if (!sched_clock_running)
> + return;
> +
> if (!sched_clock_stable())
> - static_key_slow_dec(&__sched_clock_stable);
> + static_key_slow_inc(&__sched_clock_stable);
> }
>
> static void __clear_sched_clock_stable(struct work_struct *work)
> {
> /* XXX worry about clock continuity */
> if (sched_clock_stable())
> - static_key_slow_inc(&__sched_clock_stable);
> + static_key_slow_dec(&__sched_clock_stable);
> }
>
> static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);
>
> void clear_sched_clock_stable(void)
> {
> - if (keventd_up())
> - schedule_work(&sched_clock_work);
> - else
> - __clear_sched_clock_stable(&sched_clock_work);
> + __sched_clock_stable_early = 0;
> +
> + smp_mb(); /* matches sched_clock_init() */
> +
> + if (!sched_clock_running)
> + return;
> +
> + schedule_work(&sched_clock_work);
> }
>
> struct sched_clock_data {
> @@ -140,6 +150,20 @@ void sched_clock_init(void)
> }
>
> sched_clock_running = 1;
> +
> + /*
> + * Ensure that it is impossible to not do a static_key update.
> + *
> + * Either {set,clear}_sched_clock_stable() must see sched_clock_running
> + * and do the update, or we must see their __sched_clock_stable_early
> + * and do the update, or both.
> + */
> + smp_mb(); /* matches {set,clear}_sched_clock_stable() */
> +
> + if (__sched_clock_stable_early)
> + set_sched_clock_stable();
> + else
> + clear_sched_clock_stable();
> }
>
> /*

It does not fix the prink time issue, here is the log:
[ 0.000000] efi: mem26: type=6, attr=0x800000000000000f, range=[0x000000000dbe0000-0x000000000dc00000) (0MB)
[ 0.000000] DMI not present or invalid.
[ 0.000000] Hypervisor detected: KVM
[ 0.000000] e820: last_pfn = 0xdbe0 max_arch_pfn = 0x400000000
[ 0.000000] PAT not supported by CPU.
[ 0.000000] init_memory_mapping: [mem 0x00000000-0x000fffff]
[ 0.000000] init_memory_mapping: [mem 0x0aa00000-0x0abfffff]
[ 0.000000] init_memory_mapping: [mem 0x08000000-0x0a9fffff]
[ 0.000000] init_memory_mapping: [mem 0x00100000-0x07ffffff]
[ 0.000000] init_memory_mapping: [mem 0x0ac00000-0x0bd93fff]
[ 0.000000] init_memory_mapping: [mem 0x0bdc1000-0x0d580fff]
[ 0.000000] init_memory_mapping: [mem 0x0d5e5000-0x0dbdffff]
[ 0.000000] RAMDISK: [mem 0x0ac0e000-0x0b583fff]
[ 0.000000] ACPI: RSDP 000000000d5e0014 000024 (v02 OVMF )
[ 0.000000] ACPI: XSDT 000000000d5df0e8 00003C (v01 OVMF OVMFEDK2 20130221 01000013)
[ 0.000000] ACPI: FACP 000000000d5de000 0000F4 (v03 OVMF OVMFEDK2 20130221 OVMF 00000099)
[ 0.000000] ACPI: DSDT 000000000d5dc000 000D57 (v01 INTEL OVMF 00000004 INTL 20120913)
[ 0.000000] ACPI: FACS 000000000d5e4000 000040
[ 0.000000] ACPI: APIC 000000000d5dd000 000078 (v01 OVMF OVMFEDK2 20130221 OVMF 00000099)
[ 0.000000] ACPI: SSDT 000000000d5db000 000057 (v01 REDHAT OVMF 00000001 INTL 20120913)
[ 0.000000] crashkernel reservation failed - No suitable area found.
[ 0.000000] kvm-clock: Using msrs 4b564d01 and 4b564d00
[ 0.000000] kvm-clock: cpu 0, msr 0:d401001, boot clock
[65465.267798] Zone ranges:
[65465.268914] DMA [mem 0x00001000-0x00ffffff]
[65465.271107] DMA32 [mem 0x01000000-0xffffffff]
[65465.273348] Normal empty
[65465.274683] Movable zone start for each node
[65465.276646] Early memory node ranges
[65465.278321] node 0: [mem 0x00001000-0x0009ffff]
[65465.280572] node 0: [mem 0x00100000-0x0bd93fff]
[65465.282825] node 0: [mem 0x0bdc1000-0x0d580fff]
[65465.285084] node 0: [mem 0x0d5e5000-0x0dbdffff]
[65465.289251] ACPI: PM-Timer IO Port: 0xb008
[65465.291105] ACPI: LAPIC (acpi_id[0x00] lapic_id[0x00] enabled)
[65465.293766] ACPI: LAPIC_NMI (acpi_id[0xff] dfl dfl lint[0x1])
[65465.296413] ACPI: IOAPIC (id[0x01] address[0xfec00000] gsi_base[0])
[65465.299460] IOAPIC[0]: apic_id 1, version 17, address 0xfec00000, GSI 0-23
[65465.302607] ACPI: INT_SRC_OVR (bus 0 bus_irq 0 global_irq 2 dfl dfl)
[65465.305524] ACPI: INT_SRC_OVR (bus 0 bus_irq 5 global_irq 5 high level)
[65465.308622] ACPI: INT_SRC_OVR (bus 0 bus_irq 9 global_irq 9 high level)
[65465.311685] ACPI: INT_SRC_OVR (bus 0 bus_irq 10 global_irq 10 high level)
[65465.314766] ACPI: INT_SRC_OVR (bus 0 bus_irq 11 global_irq 11 high level)
[65465.317792] Using ACPI (MADT) for SMP configuration information
[65465.320608] smpboot: Allowing 1 CPUs, 0 hotplug CPUs
[65465.322958] PM: Registered nosave memory: [mem 0x000a0000-0x000fffff]
[65465.325861] PM: Registered nosave memory: [mem 0x0bd94000-0x0bdc0fff]
[65465.328809] PM: Registered nosave memory: [mem 0x0d581000-0x0d5d8fff]
[65465.331770] PM: Registered nosave memory: [mem 0x0d5d9000-0x0d5e0fff]
[65465.334716] PM: Registered nosave memory: [mem 0x0d5e1000-0x0d5e4fff]
[65465.337723] e820: [mem 0x0dc00000-0xffffffff] available for PCI devices
[65465.340880] Booting paravirtualized kernel on KVM
[65465.343045] setup_percpu: NR_CPUS:16 nr_cpumask_bits:16 nr_cpu_ids:1 nr_node_ids:1
[65465.346736] PERCPU: Embedded 28 pages/cpu @ffff88000a800000 s83392 r8192 d23104 u2097152
[65465.350469] kvm-clock: cpu 0, msr 0:d401001, primary cpu clock
[65465.353143] KVM setup async PF for cpu 0
[65465.354969] kvm-stealtime: cpu 0, msr a80dfc0
[65465.357124] Built 1 zonelists in Zone order, mobility grouping on. Total pages: 53096
[65465.360905] Kernel command line: root=UUID=4522081c-614f-43ba-927b-1ef26d69fe20 ro console=ttyS0 earlyprintk=serial,ttyS0 nomodeset selinux=0 crashkernel=128M
[65465.367711] PID hash table entries: 1024 (order: 1, 8192 bytes)
[65465.370534] Dentry cache hash table entries: 32768 (order: 6, 262144 bytes)
[65465.373903] Inode-cache hash table entries: 16384 (order: 5, 131072 bytes)
[65465.377467] Memory: 144368K/224184K available (4748K kernel code, 788K rwdata, 2376K rodata, 888K init, 8968K bss, 79816K reserved)
[65465.382968] SLUB: HWalign=64, Order=0-3, MinObjects=0, CPUs=1, Nodes=1
[65465.386095] Preemptible hierarchical RCU implementation.
[65465.388602] RCU debugfs-based tracing is enabled.
[65465.390851] CONFIG_RCU_FANOUT set to non-default value of 32
[65465.393569] RCU dyntick-idle grace-period acceleration is enabled.
[65465.396494] RCU restricting CPUs from NR_CPUS=16 to nr_cpu_ids=1.
[65465.399422] Offload RCU callbacks from all CPUs
[65465.401594] Offload RCU callbacks from CPUs: 0.
[65465.403781] RCU: Adjusting geometry for rcu_fanout_leaf=16, nr_cpu_ids=1
[65465.406906] NO_HZ: Full dynticks CPUs: 1-15.
[65465.408963] NR_IRQS:4352 nr_irqs:256 16
[65465.411104] Console: colour dummy device 80x25
[65465.413229] console [ttyS0] enabled
[65465.413229] console [ttyS0] enabled
[65465.416579] bootconsole [earlyser0] disabled
[65465.416579] bootconsole [earlyser0] disabled
[65465.420729] Lock dependency validator: Copyright (c) 2006 Red Hat, Inc., Ingo Molnar
[65465.424454] ... MAX_LOCKDEP_SUBCLASSES: 8
[65465.426418] ... MAX_LOCK_DEPTH: 48
[65465.428422] ... MAX_LOCKDEP_KEYS: 8191
[65465.430509] ... CLASSHASH_SIZE: 4096
[65465.432576] ... MAX_LOCKDEP_ENTRIES: 16384
[65465.434717] ... MAX_LOCKDEP_CHAINS: 32768
[65465.436858] ... CHAINHASH_SIZE: 16384
[65465.438991] memory used by lock dependency info: 5855 kB
[65465.441630] per task-struct memory footprint: 1920 bytes
[65465.444477] tsc: Detected 2793.268 MHz processor
[65465.446663] BUG: unable to handle kernel NULL pointer dereference at 0000000000000182
[65465.450471] IP: [<ffffffff81074023>] __queue_work+0x45/0x1ee
[65465.453195] PGD 0
[65465.454270] Oops: 0000 [#1] PREEMPT SMP
[65465.456286] Modules linked in:
[65465.457815] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.13.0+ #15
[65465.460610] task: ffffffff81719490 ti: ffffffff816f8000 task.ti: ffffffff816f8000
[65465.464099] RIP: 0010:[<ffffffff81074023>] [<ffffffff81074023>] __queue_work+0x45/0x1ee
[65465.467932] RSP: 0000:ffffffff816f9eb8 EFLAGS: 00010046
[65465.470418] RAX: 0000000000000006 RBX: 0000000000000292 RCX: 0000000000000030
[65465.473790] RDX: ffffffff817328e0 RSI: 0000000000000000 RDI: 0000000000000010
[65465.477159] RBP: ffffffff816f9ee8 R08: ffffffff817b6ac8 R09: 00000000ffffffff
[65465.480531] R10: 00000000fffea071 R11: 0000000225c17d03 R12: 0000000000000000
[65465.483927] R13: ffffffff817328e0 R14: ffffffff81857ac0 R15: 000000000b584000
[65465.487348] FS: 0000000000000000(0000) GS:ffff88000a800000(0000) knlGS:0000000000000000
[65465.491138] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[65465.493900] CR2: 0000000000000182 CR3: 0000000002714000 CR4: 00000000000006b0
[65465.497214] Stack:
[65465.498208] 0000001081857ac0 0000000000000292 000000000032dcd5 ffff88000b585680
[65465.501853] ffffffff81857ac0 000000000b584000 ffffffff816f9f20 ffffffff8107420f
[65465.505492] 00000010816f9f30 0000000000000000 ffffffff817328e0 0000000000014280
[65465.509180] Call Trace:
[65465.510385] [<ffffffff8107420f>] queue_work_on+0x43/0x7c
[65465.512932] [<ffffffff810868a5>] clear_sched_clock_stable+0x32/0x34
[65465.515985] [<ffffffff81086921>] sched_clock_init+0x7a/0x7f
[65465.518696] [<ffffffff817d4cd8>] start_kernel+0x351/0x3fa
[65465.521367] [<ffffffff817d4795>] ? repair_env_string+0x58/0x58
[65465.524159] [<ffffffff817d4120>] ? early_idt_handlers+0x120/0x120
[65465.527044] [<ffffffff817d4498>] x86_64_start_reservations+0x2a/0x2c
[65465.530038] [<ffffffff817d458d>] x86_64_start_kernel+0xf3/0x100
[65465.532832] Code: 25 30 d2 72 81 f6 c4 02 74 21 80 3d 91 22 73 00 00 75 18 be 31 05 00 00 48 c7 c7 ca 0b 64 81 e8 80 c9 fe ff c6 05 77 22 73 00 01 <41> f6 84 24 82 01 00 00 01 74 59 65 48 8b 3c 25 c0 c9 00 00 f6
[65465.544401] RIP [<ffffffff81074023>] __queue_work+0x45/0x1ee
[65465.547201] RSP <ffffffff816f9eb8>
[65465.548880] CR2: 0000000000000182
[65465.550462] ---[ end trace 8bf023a4e6e5d79e ]---
[65465.552655] Kernel panic - not syncing: Attempted to kill the idle task!

Thanks
Dave

2014-01-23 01:54:00

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 01/22/14 at 10:08pm, Peter Zijlstra wrote:
> >
> > I think its the right region to look through. My current suspect is the
> > linear continuity fit with the initial 'random' multiplier.
> >
> > That initial 'random' multiplier can get us quite high, and we'll fit
> > the function to match that but continue at a sane rate.
> >
> > I'll try and prod a little more later this evening as time permits.
>
> Does this cure things?

Peter, the odd timstamp still happens with this patch for me.

>
> ---
> arch/x86/kernel/tsc.c | 11 +++++++----
> 1 file changed, 7 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> index a3acbac2ee72..bb04148c5fe0 100644
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -237,7 +237,7 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
> /* XXX surely we already have this someplace in the kernel?! */
> #define DIV_ROUND(n, d) (((n) + ((d) / 2)) / (d))
>
> -static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> +static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu, bool origin)
> {
> unsigned long long tsc_now, ns_now;
> struct cyc2ns_data *data;
> @@ -252,7 +252,10 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> data = cyc2ns_write_begin(cpu);
>
> rdtscll(tsc_now);
> - ns_now = cycles_2_ns(tsc_now);
> + if (origin)
> + ns_now = 0;
> + else
> + ns_now = cycles_2_ns(tsc_now);
>
> /*
> * Compute a new multiplier as per the above comment and ensure our
> @@ -926,7 +929,7 @@ static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
> mark_tsc_unstable("cpufreq changes");
> }
>
> - set_cyc2ns_scale(tsc_khz, freq->cpu);
> + set_cyc2ns_scale(tsc_khz, freq->cpu, false);
>
> return 0;
> }
> @@ -1199,7 +1202,7 @@ void __init tsc_init(void)
> */
> for_each_possible_cpu(cpu) {
> cyc2ns_init(cpu);
> - set_cyc2ns_scale(cpu_khz, cpu);
> + set_cyc2ns_scale(cpu_khz, cpu, true);
> }
>
> if (tsc_disabled > 0)

2014-01-23 02:10:38

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 01/23/14 at 09:53am, Dave Young wrote:
> On 01/22/14 at 10:08pm, Peter Zijlstra wrote:
> > >
> > > I think its the right region to look through. My current suspect is the
> > > linear continuity fit with the initial 'random' multiplier.
> > >
> > > That initial 'random' multiplier can get us quite high, and we'll fit
> > > the function to match that but continue at a sane rate.
> > >
> > > I'll try and prod a little more later this evening as time permits.
> >
> > Does this cure things?
>
> Peter, the odd timstamp still happens with this patch for me.

Hmm, seems the my physical machine is booting fine with this patch. kvm
guest problem still exist, but that kvm thing might be other problem.

>
> >
> > ---
> > arch/x86/kernel/tsc.c | 11 +++++++----
> > 1 file changed, 7 insertions(+), 4 deletions(-)
> >
> > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
> > index a3acbac2ee72..bb04148c5fe0 100644
> > --- a/arch/x86/kernel/tsc.c
> > +++ b/arch/x86/kernel/tsc.c
> > @@ -237,7 +237,7 @@ static inline unsigned long long cycles_2_ns(unsigned long long cyc)
> > /* XXX surely we already have this someplace in the kernel?! */
> > #define DIV_ROUND(n, d) (((n) + ((d) / 2)) / (d))
> >
> > -static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> > +static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu, bool origin)
> > {
> > unsigned long long tsc_now, ns_now;
> > struct cyc2ns_data *data;
> > @@ -252,7 +252,10 @@ static void set_cyc2ns_scale(unsigned long cpu_khz, int cpu)
> > data = cyc2ns_write_begin(cpu);
> >
> > rdtscll(tsc_now);
> > - ns_now = cycles_2_ns(tsc_now);
> > + if (origin)
> > + ns_now = 0;
> > + else
> > + ns_now = cycles_2_ns(tsc_now);
> >
> > /*
> > * Compute a new multiplier as per the above comment and ensure our
> > @@ -926,7 +929,7 @@ static int time_cpufreq_notifier(struct notifier_block *nb, unsigned long val,
> > mark_tsc_unstable("cpufreq changes");
> > }
> >
> > - set_cyc2ns_scale(tsc_khz, freq->cpu);
> > + set_cyc2ns_scale(tsc_khz, freq->cpu, false);
> >
> > return 0;
> > }
> > @@ -1199,7 +1202,7 @@ void __init tsc_init(void)
> > */
> > for_each_possible_cpu(cpu) {
> > cyc2ns_init(cpu);
> > - set_cyc2ns_scale(cpu_khz, cpu);
> > + set_cyc2ns_scale(cpu_khz, cpu, true);
> > }
> >
> > if (tsc_disabled > 0)

2014-01-23 09:48:47

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Wed, Jan 22, 2014 at 10:17:40PM +0100, Markus Trippelsdorf wrote:
> Yes. Thanks Peter.
>

Ah much simpler patch that should have the same effect:

---
Subject: sched/x86/tsc: Initialize multiplier to 0
From: Peter Zijlstra <[email protected]>
Date: Wed, 22 Jan 2014 22:08:14 +0100

Since we keep the clock value linearly continuous on frequency change,
make sure the initial multiplier is 0, such that out initial value is
0. Without this we compute the initial value at whatever the TSC has
managed to reach since power-on.

Fixes: 20d1c86a57762 ("sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs")
Cc: [email protected]
Cc: [email protected]
Cc: Eliezer Tamir <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Mike Galbraith <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Thomas Gleixner <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: [email protected]
Reported-by: Markus Trippelsdorf <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
---
arch/x86/kernel/tsc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -180,7 +180,7 @@ static void cyc2ns_write_end(int cpu, st

static void cyc2ns_data_init(struct cyc2ns_data *data)
{
- data->cyc2ns_mul = 1U << CYC2NS_SCALE_FACTOR;
+ data->cyc2ns_mul = 0;
data->cyc2ns_shift = CYC2NS_SCALE_FACTOR;
data->cyc2ns_offset = 0;
data->__count = 0;

2014-01-23 10:01:06

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 2014.01.23 at 10:48 +0100, Peter Zijlstra wrote:
> On Wed, Jan 22, 2014 at 10:17:40PM +0100, Markus Trippelsdorf wrote:
>
> Ah much simpler patch that should have the same effect:

Yes. FWIW it also seems to be fine.

--
Markus

2014-01-23 10:04:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Thu, Jan 23, 2014 at 11:01:00AM +0100, Markus Trippelsdorf wrote:
> On 2014.01.23 at 10:48 +0100, Peter Zijlstra wrote:
> > On Wed, Jan 22, 2014 at 10:17:40PM +0100, Markus Trippelsdorf wrote:
> >
> > Ah much simpler patch that should have the same effect:
>
> Yes. FWIW it also seems to be fine.

Thanks.. the wonders of sleep eh :-)

2014-01-23 13:32:37

by Josh Boyer

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Thu, Jan 23, 2014 at 4:48 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Jan 22, 2014 at 10:17:40PM +0100, Markus Trippelsdorf wrote:
>> Yes. Thanks Peter.
>>
>
> Ah much simpler patch that should have the same effect:

This fixes the issue on my baremetal i7 machine as well.

josh

> ---
> Subject: sched/x86/tsc: Initialize multiplier to 0
> From: Peter Zijlstra <[email protected]>
> Date: Wed, 22 Jan 2014 22:08:14 +0100
>
> Since we keep the clock value linearly continuous on frequency change,
> make sure the initial multiplier is 0, such that out initial value is
> 0. Without this we compute the initial value at whatever the TSC has
> managed to reach since power-on.
>
> Fixes: 20d1c86a57762 ("sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs")
> Cc: [email protected]
> Cc: [email protected]
> Cc: Eliezer Tamir <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Mike Galbraith <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: Thomas Gleixner <[email protected]>
> Cc: John Stultz <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Arjan van de Ven <[email protected]>
> Cc: Sasha Levin <[email protected]>
> Cc: [email protected]
> Reported-by: Markus Trippelsdorf <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> arch/x86/kernel/tsc.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> --- a/arch/x86/kernel/tsc.c
> +++ b/arch/x86/kernel/tsc.c
> @@ -180,7 +180,7 @@ static void cyc2ns_write_end(int cpu, st
>
> static void cyc2ns_data_init(struct cyc2ns_data *data)
> {
> - data->cyc2ns_mul = 1U << CYC2NS_SCALE_FACTOR;
> + data->cyc2ns_mul = 0;
> data->cyc2ns_shift = CYC2NS_SCALE_FACTOR;
> data->cyc2ns_offset = 0;
> data->__count = 0;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

Subject: [tip:sched/urgent] sched/clock: Fixup early initialization

Commit-ID: d375b4e0fa3771343b370be0d876a1963c02e0a0
Gitweb: http://git.kernel.org/tip/d375b4e0fa3771343b370be0d876a1963c02e0a0
Author: Peter Zijlstra <[email protected]>
AuthorDate: Wed, 22 Jan 2014 12:59:18 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 23 Jan 2014 14:48:36 +0100

sched/clock: Fixup early initialization

The code would assume sched_clock_stable() and switch to !stable
later, this switch brings a discontinuity in time.

The discontinuity on switching from stable to unstable was always
present, but previously we would set stable/unstable before
initializing TSC and usually stick to the one we start out with.

So the static_key bits brought an extra switch where there previously
wasn't one.

Things are further complicated by the fact that we cannot use
static_key as early as we usually call set_sched_clock_stable().

Fix things by tracking the stable state in a regular variable and only
set the static_key to the right state on sched_clock_init(), which is
ran right after late_time_init->tsc_init().

Before this we would not be using the TSC anyway.

Reported-and-Tested-by: Sasha Levin <[email protected]>
Reported-by: [email protected]
Fixes: 35af99e646c7 ("sched/clock, x86: Use a static_key for sched_clock_stable")
Cc: [email protected]
Cc: Mike Galbraith <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: John Stultz <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Eliezer Tamir <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched/clock.c | 53 ++++++++++++++++++++++++++++++++++++++++------------
1 file changed, 41 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/clock.c b/kernel/sched/clock.c
index 6bd6a67..43c2bcc 100644
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -77,35 +77,50 @@ __read_mostly int sched_clock_running;

#ifdef CONFIG_HAVE_UNSTABLE_SCHED_CLOCK
static struct static_key __sched_clock_stable = STATIC_KEY_INIT;
+static int __sched_clock_stable_early;

int sched_clock_stable(void)
{
- if (static_key_false(&__sched_clock_stable))
- return false;
- return true;
+ return static_key_false(&__sched_clock_stable);
}

-void set_sched_clock_stable(void)
+static void __set_sched_clock_stable(void)
{
if (!sched_clock_stable())
- static_key_slow_dec(&__sched_clock_stable);
+ static_key_slow_inc(&__sched_clock_stable);
+}
+
+void set_sched_clock_stable(void)
+{
+ __sched_clock_stable_early = 1;
+
+ smp_mb(); /* matches sched_clock_init() */
+
+ if (!sched_clock_running)
+ return;
+
+ __set_sched_clock_stable();
}

static void __clear_sched_clock_stable(struct work_struct *work)
{
/* XXX worry about clock continuity */
if (sched_clock_stable())
- static_key_slow_inc(&__sched_clock_stable);
+ static_key_slow_dec(&__sched_clock_stable);
}

static DECLARE_WORK(sched_clock_work, __clear_sched_clock_stable);

void clear_sched_clock_stable(void)
{
- if (keventd_up())
- schedule_work(&sched_clock_work);
- else
- __clear_sched_clock_stable(&sched_clock_work);
+ __sched_clock_stable_early = 0;
+
+ smp_mb(); /* matches sched_clock_init() */
+
+ if (!sched_clock_running)
+ return;
+
+ schedule_work(&sched_clock_work);
}

struct sched_clock_data {
@@ -140,6 +155,20 @@ void sched_clock_init(void)
}

sched_clock_running = 1;
+
+ /*
+ * Ensure that it is impossible to not do a static_key update.
+ *
+ * Either {set,clear}_sched_clock_stable() must see sched_clock_running
+ * and do the update, or we must see their __sched_clock_stable_early
+ * and do the update, or both.
+ */
+ smp_mb(); /* matches {set,clear}_sched_clock_stable() */
+
+ if (__sched_clock_stable_early)
+ __set_sched_clock_stable();
+ else
+ __clear_sched_clock_stable(NULL);
}

/*
@@ -340,7 +369,7 @@ EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
*/
u64 cpu_clock(int cpu)
{
- if (static_key_false(&__sched_clock_stable))
+ if (!sched_clock_stable())
return sched_clock_cpu(cpu);

return sched_clock();
@@ -355,7 +384,7 @@ u64 cpu_clock(int cpu)
*/
u64 local_clock(void)
{
- if (static_key_false(&__sched_clock_stable))
+ if (!sched_clock_stable())
return sched_clock_cpu(raw_smp_processor_id());

return sched_clock();

Subject: [tip:sched/urgent] sched/x86/tsc: Initialize multiplier to 0

Commit-ID: 5e3c1afd4587e70c201bf7224b51f747c9a3dfa8
Gitweb: http://git.kernel.org/tip/5e3c1afd4587e70c201bf7224b51f747c9a3dfa8
Author: Peter Zijlstra <[email protected]>
AuthorDate: Wed, 22 Jan 2014 22:08:14 +0100
Committer: Ingo Molnar <[email protected]>
CommitDate: Thu, 23 Jan 2014 14:48:36 +0100

sched/x86/tsc: Initialize multiplier to 0

Since we keep the clock value linearly continuous on frequency change,
make sure the initial multiplier is 0, such that our initial value is 0.
Without this we compute the initial value at whatever the TSC has
managed to reach since power-on.

Reported-and-Tested-by: Markus Trippelsdorf <[email protected]>
Fixes: 20d1c86a57762 ("sched/clock, x86: Rewrite cyc2ns() to avoid the need to disable IRQs")
Cc: [email protected]
Cc: [email protected]
Cc: Eliezer Tamir <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: Mike Galbraith <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: John Stultz <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Arjan van de Ven <[email protected]>
Cc: Sasha Levin <[email protected]>
Cc: [email protected]
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/tsc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c
index a3acbac..19e5adb 100644
--- a/arch/x86/kernel/tsc.c
+++ b/arch/x86/kernel/tsc.c
@@ -180,7 +180,7 @@ static void cyc2ns_write_end(int cpu, struct cyc2ns_data *data)

static void cyc2ns_data_init(struct cyc2ns_data *data)
{
- data->cyc2ns_mul = 1U << CYC2NS_SCALE_FACTOR;
+ data->cyc2ns_mul = 0;
data->cyc2ns_shift = CYC2NS_SCALE_FACTOR;
data->cyc2ns_offset = 0;
data->__count = 0;

2014-01-23 16:57:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On Thu, Jan 23, 2014 at 10:10:28AM +0800, Dave Young wrote:
> Hmm, seems the my physical machine is booting fine with this patch. kvm
> guest problem still exist, but that kvm thing might be other problem.

Dave, could you try with tip/master, Ingo pushed out all the fixes
gathered.

Lacking that, could you perhaps provide the .config you use for your
guest and a kvm invocation?

I used a defconfig with KVM_GUEST=y, which was enough to enable kvmclock
support and then started with:

kvm -kernel defconfig-build/arch/x86/boot/bzImage -nographic -append "console=ttyS0" -smp 2

2014-01-24 03:16:13

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable

On 01/23/14 at 05:56pm, Peter Zijlstra wrote:
> On Thu, Jan 23, 2014 at 10:10:28AM +0800, Dave Young wrote:
> > Hmm, seems the my physical machine is booting fine with this patch. kvm
> > guest problem still exist, but that kvm thing might be other problem.
>
> Dave, could you try with tip/master, Ingo pushed out all the fixes
> gathered.

tip/master works fine for me.

>
> Lacking that, could you perhaps provide the .config you use for your
> guest and a kvm invocation?
>
> I used a defconfig with KVM_GUEST=y, which was enough to enable kvmclock
> support and then started with:
>
> kvm -kernel defconfig-build/arch/x86/boot/bzImage -nographic -append "console=ttyS0" -smp 2

I use qemu-system-x86_64 with efi ovmf firmware, I think the problem
has been resolved in tip/master, if you still want the .config please
let me know.

Thanks
Dave

2014-01-24 07:58:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH 13/15] sched: Use a static_key for sched_clock_stable


* Dave Young <[email protected]> wrote:

> On 01/23/14 at 05:56pm, Peter Zijlstra wrote:
> > On Thu, Jan 23, 2014 at 10:10:28AM +0800, Dave Young wrote:
> > > Hmm, seems the my physical machine is booting fine with this patch. kvm
> > > guest problem still exist, but that kvm thing might be other problem.
> >
> > Dave, could you try with tip/master, Ingo pushed out all the fixes
> > gathered.
>
> tip/master works fine for me.

Ok, thanks for testing - I'll send those bits to Linus later today.

Thanks,

Ingo