2011-05-30 17:39:19

by Markus Trippelsdorf

[permalink] [raw]
Subject: Very high CPU values in top on idle system (3.0-rc1)

I get very high CPU values when I run 'top' on my (mostly) idle system
with 3.0-rc1. For example mpd was always in the 1-2% range and is now
constantly over 50%.

This is caused by:

commit 317f394160e9beb97d19a84c39b7e5eb3d7815a8
Author: Peter Zijlstra <[email protected]>
Date: Tue Apr 5 17:23:58 2011 +0200

sched: Move the second half of ttwu() to the remote cpu

When I revert the above I see sane CPU values again.
--
Markus


2011-05-30 18:06:27

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Very high CPU values in top on idle system (3.0-rc1)

On Mon, 2011-05-30 at 19:39 +0200, Markus Trippelsdorf wrote:
> I get very high CPU values when I run 'top' on my (mostly) idle system
> with 3.0-rc1. For example mpd was always in the 1-2% range and is now
> constantly over 50%.
>
> This is caused by:
>
> commit 317f394160e9beb97d19a84c39b7e5eb3d7815a8
> Author: Peter Zijlstra <[email protected]>
> Date: Tue Apr 5 17:23:58 2011 +0200
>
> sched: Move the second half of ttwu() to the remote cpu
>
> When I revert the above I see sane CPU values again.

So: echo NO_TTWU_QUEUE > /debug/sched_features, also cures it?

What architecture, what .config, and can you see it with anything other
than mpd (yum search mpd, only seems to result in mpd clients not the
actual server).

2011-05-30 18:24:00

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Very high CPU values in top on idle system (3.0-rc1)

On 2011.05.30 at 20:05 +0200, Peter Zijlstra wrote:
> On Mon, 2011-05-30 at 19:39 +0200, Markus Trippelsdorf wrote:
> > I get very high CPU values when I run 'top' on my (mostly) idle system
> > with 3.0-rc1. For example mpd was always in the 1-2% range and is now
> > constantly over 50%.
> >
> > This is caused by:
> >
> > commit 317f394160e9beb97d19a84c39b7e5eb3d7815a8
> > Author: Peter Zijlstra <[email protected]>
> > Date: Tue Apr 5 17:23:58 2011 +0200
> >
> > sched: Move the second half of ttwu() to the remote cpu
> >
> > When I revert the above I see sane CPU values again.
>
> So: echo NO_TTWU_QUEUE > /debug/sched_features, also cures it?

Yes.

> What architecture, what .config, and can you see it with anything other
> than mpd (yum search mpd, only seems to result in mpd clients not the
> actual server).

Yes, mpd was just an example. _Every_ program that would normally show in
the 1-5% CPU range is now in the 30-70% range (X, xterm, etc.).

I'm running an AMD Phenom II X4 in 64-bit mode.
My config is attached.

--
Markus


Attachments:
(No filename) (1.04 kB)
config (51.60 kB)
Download all attachments

2011-05-30 20:45:51

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Very high CPU values in top on idle system (3.0-rc1)

On 2011.05.30 at 20:23 +0200, Markus Trippelsdorf wrote:
> On 2011.05.30 at 20:05 +0200, Peter Zijlstra wrote:
> > On Mon, 2011-05-30 at 19:39 +0200, Markus Trippelsdorf wrote:
> > > I get very high CPU values when I run 'top' on my (mostly) idle system
> > > with 3.0-rc1. For example mpd was always in the 1-2% range and is now
> > > constantly over 50%.
> > >
> > > This is caused by:
> > >
> > > commit 317f394160e9beb97d19a84c39b7e5eb3d7815a8
> > > Author: Peter Zijlstra <[email protected]>
> > > Date: Tue Apr 5 17:23:58 2011 +0200
> > >
> > > sched: Move the second half of ttwu() to the remote cpu
> > >
> > > When I revert the above I see sane CPU values again.
> >
> > So: echo NO_TTWU_QUEUE > /debug/sched_features, also cures it?
>
> Yes.
>
> > What architecture, what .config, and can you see it with anything other
> > than mpd (yum search mpd, only seems to result in mpd clients not the
> > actual server).
>
> Yes, mpd was just an example. _Every_ program that would normally show in
> the 1-5% CPU range is now in the 30-70% range (X, xterm, etc.).

IOW:

% sudo echo TTWU_QUEUE >| /sys/kernel/debug/sched_features
% top -b -n 1
top - 22:40:38 up 1:46, 11 users, load average: 0.00, 0.02, 0.05
Tasks: 109 total, 1 running, 108 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.0%us, 0.9%sy, 0.0%ni, 98.0%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8183132k total, 1365724k used, 6817408k free, 1632k buffers
Swap: 2097148k total, 0k used, 2097148k free, 655208k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
881 mpd 20 0 122m 20m 4832 S 74 0.3 4:25.22 mpd
1564 root 20 0 99428 40m 4716 S 36 0.5 3:36.15 X
1648 markus 20 0 1147m 340m 39m S 16 4.3 1:17.90 firefox-bin
1649 markus 20 0 237m 33m 17m S 6 0.4 1:02.31 konsole

% sudo echo NO_TTWU_QUEUE >| /sys/kernel/debug/sched_features
% top -b -n 1
top - 22:42:07 up 1:48, 11 users, load average: 0.00, 0.01, 0.05
Tasks: 110 total, 1 running, 109 sleeping, 0 stopped, 0 zombie
Cpu(s): 1.0%us, 0.9%sy, 0.0%ni, 98.0%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 8183132k total, 1377308k used, 6805824k free, 1632k buffers
Swap: 2097148k total, 0k used, 2097148k free, 659896k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1649 markus 20 0 237m 33m 17m S 2 0.4 1:17.69 konsole
1 root 20 0 160 28 12 S 0 0.0 0:16.95 minit
2 root 20 0 0 0 0 S 0 0.0 0:00.36 kthreadd


--
Markus

2011-05-30 22:13:09

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Very high CPU values in top on idle system (3.0-rc1)

On Mon, 2011-05-30 at 22:45 +0200, Markus Trippelsdorf wrote:
> On 2011.05.30 at 20:23 +0200, Markus Trippelsdorf wrote:
> > On 2011.05.30 at 20:05 +0200, Peter Zijlstra wrote:
> > > On Mon, 2011-05-30 at 19:39 +0200, Markus Trippelsdorf wrote:
> > > > I get very high CPU values when I run 'top' on my (mostly) idle system
> > > > with 3.0-rc1. For example mpd was always in the 1-2% range and is now
> > > > constantly over 50%.
> > > >
> > > > This is caused by:
> > > >
> > > > commit 317f394160e9beb97d19a84c39b7e5eb3d7815a8
> > > > Author: Peter Zijlstra <[email protected]>
> > > > Date: Tue Apr 5 17:23:58 2011 +0200
> > > >
> > > > sched: Move the second half of ttwu() to the remote cpu
> > > >
> > > > When I revert the above I see sane CPU values again.
> > >
> > > So: echo NO_TTWU_QUEUE > /debug/sched_features, also cures it?
> >
> > Yes.
> >
> > > What architecture, what .config, and can you see it with anything other
> > > than mpd (yum search mpd, only seems to result in mpd clients not the
> > > actual server).
> >
> > Yes, mpd was just an example. _Every_ program that would normally show in
> > the 1-5% CPU range is now in the 30-70% range (X, xterm, etc.).
>
> IOW:
>
> % sudo echo TTWU_QUEUE >| /sys/kernel/debug/sched_features
> % top -b -n 1
> top - 22:40:38 up 1:46, 11 users, load average: 0.00, 0.02, 0.05
> Tasks: 109 total, 1 running, 108 sleeping, 0 stopped, 0 zombie
> Cpu(s): 1.0%us, 0.9%sy, 0.0%ni, 98.0%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> Mem: 8183132k total, 1365724k used, 6817408k free, 1632k buffers
> Swap: 2097148k total, 0k used, 2097148k free, 655208k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 881 mpd 20 0 122m 20m 4832 S 74 0.3 4:25.22 mpd
> 1564 root 20 0 99428 40m 4716 S 36 0.5 3:36.15 X
> 1648 markus 20 0 1147m 340m 39m S 16 4.3 1:17.90 firefox-bin
> 1649 markus 20 0 237m 33m 17m S 6 0.4 1:02.31 konsole
>
> % sudo echo NO_TTWU_QUEUE >| /sys/kernel/debug/sched_features
> % top -b -n 1
> top - 22:42:07 up 1:48, 11 users, load average: 0.00, 0.01, 0.05
> Tasks: 110 total, 1 running, 109 sleeping, 0 stopped, 0 zombie
> Cpu(s): 1.0%us, 0.9%sy, 0.0%ni, 98.0%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st
> Mem: 8183132k total, 1377308k used, 6805824k free, 1632k buffers
> Swap: 2097148k total, 0k used, 2097148k free, 659896k cached
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 1649 markus 20 0 237m 33m 17m S 2 0.4 1:17.69 konsole
> 1 root 20 0 160 28 12 S 0 0.0 0:16.95 minit
> 2 root 20 0 0 0 0 S 0 0.0 0:00.36 kthreadd

Right, was easy enough to reproduce once you know what to look for
(hadn't noticed as such and simply ascribed it to desktop bloat).

It does look to be an accounting funny because when I see firefox
consume 60% the CPUs line isn't in fact registering much cpu usage at
all.

I poked around with some skip_rq_update bits but couldn't make it go
away, will try more tomorrow.

2011-05-31 09:55:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Very high CPU values in top on idle system (3.0-rc1)

On Tue, 2011-05-31 at 00:12 +0200, Peter Zijlstra wrote:
>
> I poked around with some skip_rq_update bits but couldn't make it go
> away, will try more tomorrow.

Gah, that was annoying, and explains why I wasn't seeing it on my main
dev box but could see it on my desktop (wsm tsc is much better synced
than the core2 tsc).

Could you confirm that this indeed cures the problem? I'll probably
commit a version without the toggle, but that was easy for verifying it
indeed made a difference..

---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -2604,6 +2601,8 @@ static void ttwu_queue(struct task_struc

#if defined(CONFIG_SMP)
if (sched_feat(TTWU_QUEUE) && cpu != smp_processor_id()) {
+ if (sched_feat(TTWU_SYNC_CLOCK))
+ sched_clock_cpu(cpu); /* sync clocks x-cpu */
ttwu_queue_remote(p, cpu);
return;
}
Index: linux-2.6/kernel/sched_features.h
===================================================================
--- linux-2.6.orig/kernel/sched_features.h
+++ linux-2.6/kernel/sched_features.h
@@ -70,3 +70,4 @@ SCHED_FEAT(NONIRQ_POWER, 1)
* using the scheduler IPI. Reduces rq->lock contention/bounces.
*/
SCHED_FEAT(TTWU_QUEUE, 1)
+SCHED_FEAT(TTWU_SYNC_CLOCK, 1)

2011-05-31 10:04:38

by Markus Trippelsdorf

[permalink] [raw]
Subject: Re: Very high CPU values in top on idle system (3.0-rc1)

On 2011.05.31 at 11:55 +0200, Peter Zijlstra wrote:
> On Tue, 2011-05-31 at 00:12 +0200, Peter Zijlstra wrote:
> >
> > I poked around with some skip_rq_update bits but couldn't make it go
> > away, will try more tomorrow.
>
> Gah, that was annoying, and explains why I wasn't seeing it on my main
> dev box but could see it on my desktop (wsm tsc is much better synced
> than the core2 tsc).
>
> Could you confirm that this indeed cures the problem?

Yes, indeed it does. Many thanks Peter.

--
Markus

2011-05-31 12:31:55

by Peter Zijlstra

[permalink] [raw]
Subject: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

Commit-ID: f01114cb59d670e9b4f2c335930dd57db96e9360
Gitweb: http://git.kernel.org/tip/f01114cb59d670e9b4f2c335930dd57db96e9360
Author: Peter Zijlstra <[email protected]>
AuthorDate: Tue, 31 May 2011 12:26:55 +0200
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 31 May 2011 14:19:56 +0200

sched: Fix cross-cpu clock sync on remote wakeups

Markus reported that commit 317f394160e ("sched: Move the second half
of ttwu() to the remote cpu") caused some accounting funnies on his AMD
Phenom II X4, such as weird 'top' results.

It turns out that this is due to non-synced TSC and the queued remote
wakeups stopped coupeling the two relevant cpu clocks, which leads to
wakeups seeing time jumps, which in turn lead to skewed runtime stats.

Add an explicit call to sched_clock_cpu() to couple the per-cpu clocks
to restore the normal flow of time.

Reported-and-tested-by: Markus Trippelsdorf <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
Link: http://lkml.kernel.org/r/1306835745.2353.3.camel@twins
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/sched.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index cbb3a0e..49cc70b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2600,6 +2600,7 @@ static void ttwu_queue(struct task_struct *p, int cpu)

#if defined(CONFIG_SMP)
if (sched_feat(TTWU_QUEUE) && cpu != smp_processor_id()) {
+ sched_clock_cpu(cpu); /* sync clocks x-cpu */
ttwu_queue_remote(p, cpu);
return;
}

2011-05-31 12:56:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

Hey Peter,

On Tue, May 31, 2011 at 12:31:20PM +0000, tip-bot for Peter Zijlstra wrote:
> Commit-ID: f01114cb59d670e9b4f2c335930dd57db96e9360
> Gitweb: http://git.kernel.org/tip/f01114cb59d670e9b4f2c335930dd57db96e9360
> Author: Peter Zijlstra <[email protected]>
> AuthorDate: Tue, 31 May 2011 12:26:55 +0200
> Committer: Ingo Molnar <[email protected]>
> CommitDate: Tue, 31 May 2011 14:19:56 +0200
>
> sched: Fix cross-cpu clock sync on remote wakeups
>
> Markus reported that commit 317f394160e ("sched: Move the second half
> of ttwu() to the remote cpu") caused some accounting funnies on his AMD
> Phenom II X4, such as weird 'top' results.
>
> It turns out that this is due to non-synced TSC

this would mean that his machine doesn't pass the TSC sync check at boot
but that's a F10h and they usu. have synchronized TSCs?

I'm confused.

> and the queued remote
> wakeups stopped coupeling the two relevant cpu clocks, which leads to
> wakeups seeing time jumps, which in turn lead to skewed runtime stats.
>
> Add an explicit call to sched_clock_cpu() to couple the per-cpu clocks
> to restore the normal flow of time.
>
> Reported-and-tested-by: Markus Trippelsdorf <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> Link: http://lkml.kernel.org/r/1306835745.2353.3.camel@twins
> Signed-off-by: Ingo Molnar <[email protected]>

Thanks.

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-05-31 22:29:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Tue, 2011-05-31 at 14:56 +0200, Borislav Petkov wrote:
> Hey Peter,
>
> On Tue, May 31, 2011 at 12:31:20PM +0000, tip-bot for Peter Zijlstra wrote:
> > Commit-ID: f01114cb59d670e9b4f2c335930dd57db96e9360
> > Gitweb: http://git.kernel.org/tip/f01114cb59d670e9b4f2c335930dd57db96e9360
> > Author: Peter Zijlstra <[email protected]>
> > AuthorDate: Tue, 31 May 2011 12:26:55 +0200
> > Committer: Ingo Molnar <[email protected]>
> > CommitDate: Tue, 31 May 2011 14:19:56 +0200
> >
> > sched: Fix cross-cpu clock sync on remote wakeups
> >
> > Markus reported that commit 317f394160e ("sched: Move the second half
> > of ttwu() to the remote cpu") caused some accounting funnies on his AMD
> > Phenom II X4, such as weird 'top' results.
> >
> > It turns out that this is due to non-synced TSC
>
> this would mean that his machine doesn't pass the TSC sync check at boot
> but that's a F10h and they usu. have synchronized TSCs?
>
> I'm confused.

Well, I don't have a modern AMD system to verify on, but the only
explanation is sched_clock weirdness (different code from the GTOD tsc
stuff). I could not reproduce on an Intel Westmere machine, but could on
a Core2.

The sched_clock_cpu stuff basically takes a GTOD timestamp every tick
and uses sched_clock() (tsc + cyc2ns) to provide delta increments, when
TSCs are synced every cpu should return the same value and the patch is
a nop.

If they aren't synced the per-cpu sched_clock_cpu() values can drift up
to about 2 jiffies (when the ticks drift about 1 and the slower of the
two has a stuck tsc while the faster of the two does progress at the
normal rate). In that case doing a clock update cross-cpu will ensure
time monotonicity between those two cpus.

So by this patch making a difference we prove that sched_clock_cpu()'s
aren't synced on the affected machines. That leaves two options, either
the TSC are screwy or there's a bug somewhere in the sched_clock* code.

I'd be more than happy if you could take a look at the relevant code
since all code can use more eyes :-)

2011-06-01 07:05:54

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Tue, May 31, 2011 at 03:11:56PM +0200, Peter Zijlstra wrote:
> Well, I don't have a modern AMD system to verify on, but the only
> explanation is sched_clock weirdness (different code from the GTOD tsc
> stuff). I could not reproduce on an Intel Westmere machine, but could on
> a Core2.
>
> The sched_clock_cpu stuff basically takes a GTOD timestamp every tick
> and uses sched_clock() (tsc + cyc2ns) to provide delta increments, when
> TSCs are synced every cpu should return the same value and the patch is
> a nop.
>
> If they aren't synced the per-cpu sched_clock_cpu() values can drift up
> to about 2 jiffies (when the ticks drift about 1 and the slower of the
> two has a stuck tsc while the faster of the two does progress at the
> normal rate). In that case doing a clock update cross-cpu will ensure
> time monotonicity between those two cpus.

Hmm, could it be that the sched_clock_tick() could be seeing different
TSC values due to propagation delays of IPIs and TSCs? Or, it could be
also that some TSCs don't start at the same moment after powerup but
still run synchronized though?

How can we trace this, do you do trace_printk() in the scheduler? I'm
asking because I remember reading somewhere that tracing the scheduler
is not that trivial like say a driver :). I could do that on a couple of
machines I have here and see what happens.

Thanks.

--
Regards/Gruss,
Boris.

2011-06-01 12:43:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Wed, 2011-06-01 at 09:05 +0200, Borislav Petkov wrote:
> On Tue, May 31, 2011 at 03:11:56PM +0200, Peter Zijlstra wrote:
> > Well, I don't have a modern AMD system to verify on, but the only
> > explanation is sched_clock weirdness (different code from the GTOD tsc
> > stuff). I could not reproduce on an Intel Westmere machine, but could on
> > a Core2.
> >
> > The sched_clock_cpu stuff basically takes a GTOD timestamp every tick
> > and uses sched_clock() (tsc + cyc2ns) to provide delta increments, when
> > TSCs are synced every cpu should return the same value and the patch is
> > a nop.
> >
> > If they aren't synced the per-cpu sched_clock_cpu() values can drift up
> > to about 2 jiffies (when the ticks drift about 1 and the slower of the
> > two has a stuck tsc while the faster of the two does progress at the
> > normal rate). In that case doing a clock update cross-cpu will ensure
> > time monotonicity between those two cpus.
>
> Hmm, could it be that the sched_clock_tick() could be seeing different
> TSC values due to propagation delays of IPIs and TSCs? Or, it could be
> also that some TSCs don't start at the same moment after powerup but
> still run synchronized though?
>
> How can we trace this, do you do trace_printk() in the scheduler? I'm
> asking because I remember reading somewhere that tracing the scheduler
> is not that trivial like say a driver :). I could do that on a couple of
> machines I have here and see what happens.

trace_printk() can go pretty much anywhere, you want to start with
something like the below and go from there, either up into
arch/x86/kernel/tsc.c:native_sched_clock() or down into the scheduler
and instrument rq->clock (although you most likely already get most of
that through the sched_clock_cpu() trace_printk).

Also, it might be good to check on the sched_clock_stable logic in
general and on your platform in particular, if that's set we forgo all
the fancy bits and return sched_clock() directly.

---
kernel/sched_clock.c | 10 ++++++++++
1 files changed, 10 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_clock.c b/kernel/sched_clock.c
index 9d8af0b..873f50f 100644
--- a/kernel/sched_clock.c
+++ b/kernel/sched_clock.c
@@ -167,6 +167,9 @@ static u64 sched_clock_local(struct sched_clock_data *scd)
if (cmpxchg64(&scd->clock, old_clock, clock) != old_clock)
goto again;

+ trace_printk("now: %llu, gtod: %llu, delta: %llu, old_clock: %llu, clock: %llu\n",
+ now, scd->tick_gtod, delta, old_clock, clock);
+
return clock;
}

@@ -203,6 +206,9 @@ static u64 sched_clock_remote(struct sched_clock_data *scd)
if (cmpxchg64(ptr, old_val, val) != old_val)
goto again;

+ trace_printk("this: %llu, remote: %llu, clock: %llu\n",
+ thid_clock, remote_clock, val);
+
return val;
}

@@ -231,6 +237,8 @@ u64 sched_clock_cpu(int cpu)
else
clock = sched_clock_local(scd);

+ trace_printk("clock: %llu\n", clock);
+
return clock;
}

@@ -251,6 +259,8 @@ void sched_clock_tick(void)
now_gtod = ktime_to_ns(ktime_get());
now = sched_clock();

+ trace_printk("gtod: %llu, now: %llu\n", now_gtod, now);
+
scd->tick_raw = now;
scd->tick_gtod = now_gtod;
sched_clock_local(scd);

2011-06-01 15:50:36

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Wed, Jun 01, 2011 at 06:36:52AM -0400, Peter Zijlstra wrote:
> trace_printk() can go pretty much anywhere, you want to start with
> something like the below and go from there, either up into
> arch/x86/kernel/tsc.c:native_sched_clock() or down into the scheduler
> and instrument rq->clock (although you most likely already get most of
> that through the sched_clock_cpu() trace_printk).

See for a trace excerpt below.

> Also, it might be good to check on the sched_clock_stable logic in
> general and on your platform in particular, if that's set we forgo all
> the fancy bits and return sched_clock() directly.

Nah, we don't set sched_clock_stable on AMD.

--
<idle>-0 [023] 511.343360: sched_clock_cpu: clock: 511500051225
<idle>-0 [023] 511.343365: sched_clock_tick: gtod: 511500058415, now: 511343364325
<idle>-0 [023] 511.343367: sched_clock_local: now: 511343365506, gtod: 511500058415, delta: 1181, old_clock: 511500051225, clock: 511500059596
<idle>-0 [023] 511.343369: sched_clock_local: now: 511343367760, gtod: 511500058415, delta: 3435, old_clock: 511500059596, clock: 511500061850
<idle>-0 [023] 511.343370: sched_clock_cpu: clock: 511500061850
<idle>-0 [023] 511.343380: sched_clock_local: now: 511343378644, gtod: 511500058415, delta: 14319, old_clock: 511500061850, clock: 511500072734
<idle>-0 [023] 511.343381: sched_clock_cpu: clock: 511500072734
<idle>-0 [000] 511.343519: sched_clock_tick: gtod: 511500210072, now: 511343515910
<idle>-0 [000] 511.343523: sched_clock_local: now: 511343521252, gtod: 511500210072, delta: 5342, old_clock: 511484088885, clock: 511500215414
<...>-102 [023] 511.343527: sched_clock_local: now: 511343524554, gtod: 511500058415, delta: 160229, old_clock: 511500072734, clock: 511500218644

let's take this one for example

now: 511343524554 is the TSC on core 23
gtod: 511500058415 tick_gtod we just updated with the delta of now and scd->tick_raw. Btw, delta is > 0 so we're incrementing fine.
old_clock: 511500072734 what we returned on the previous sched_clock_cpu() above at tstamp 511.343381
clock: 511500218644 is what we actually return


<...>-102 [023] 511.343528: sched_clock_cpu: clock: 511500218644
<idle>-0 [000] 511.343542: sched_clock_local: now: 511343540484, gtod: 511500210072, delta: 24574, old_clock: 511500215414, clock: 511500234646


now, on cpu0 we've advanced in the meantime with now=511343540484 (btw,
the now TSC values are almost the same as the ftrace timestamps, the
last being taken only a couple of ?s later) and so on ...

So, I don't see anything unusual but I could very well be missing
something. Btw, this trace is without your change to ttwu_queue().

I'll do the same exercise on another F10h system once I get it free to
see whether we see the same monotonicity there too.

Let me know if you want something else traced or the trace tweaked.


<idle>-0 [000] 511.343543: sched_clock_cpu: clock: 511500234646
<...>-1549 [023] 511.343548: sched_clock_local: now: 511343546428, gtod: 511500058415, delta: 182103, old_clock: 511500218644, clock: 511500240518
<...>-1549 [023] 511.343549: sched_clock_cpu: clock: 511500240518
<idle>-0 [023] 511.343560: sched_clock_local: now: 511343558648, gtod: 511500058415, delta: 194323, old_clock: 511500240518, clock: 511500252738
<idle>-0 [023] 511.343561: sched_clock_cpu: clock: 511500252738
<idle>-0 [023] 511.347381: sched_clock_tick: gtod: 511504076961, now: 511347377038
<idle>-0 [000] 511.347386: sched_clock_tick: gtod: 511504078048, now: 511347378123
<idle>-0 [023] 511.347389: sched_clock_local: now: 511347383317, gtod: 511504076961, delta: 6279, old_clock: 511500252738, clock: 511504083240
<idle>-0 [000] 511.347392: sched_clock_local: now: 511347387920, gtod: 511504078048, delta: 9797, old_clock: 511500234646, clock: 511504087845
<idle>-0 [023] 511.347397: sched_clock_local: now: 511347395232, gtod: 511504076961, delta: 18194, old_clock: 511504083240, clock: 511504095155
<idle>-0 [000] 511.347399: sched_clock_local: now: 511347396131, gtod: 511504078048, delta: 18008, old_clock: 511504087845, clock: 511504096056
<idle>-0 [023] 511.347401: sched_clock_cpu: clock: 511504095155
<idle>-0 [000] 511.347402: sched_clock_cpu: clock: 511504096056
<idle>-0 [000] 511.473120: sched_clock_tick: gtod: 511630004325, now: 511473117198
<idle>-0 [000] 511.473124: sched_clock_local: now: 511473122126, gtod: 511630004325, delta: 4928, old_clock: 511504096056, clock: 511630009253
<idle>-0 [000] 511.473156: sched_clock_local: now: 511473154566, gtod: 511630004325, delta: 37368, old_clock: 511630009253, clock: 511630041693
<idle>-0 [000] 511.473157: sched_clock_cpu: clock: 511630041693
<...>-3889 [000] 511.473232: sched_clock_local: now: 511473229761, gtod: 511630004325, delta: 112563, old_clock: 511630041693, clock: 511630116888
<...>-3889 [000] 511.473233: sched_clock_cpu: clock: 511630116888
<...>-1226 [000] 511.473251: sched_clock_local: now: 511473249602, gtod: 511630004325, delta: 132404, old_clock: 511630116888, clock: 511630136729
<...>-1226 [000] 511.473252: sched_clock_cpu: clock: 511630136729
<idle>-0 [012] 511.473260: sched_clock_local: now: 511473255980, gtod: 511448077716, delta: 181793791, old_clock: 511448095543, clock: 511452077966
<idle>-0 [012] 511.473263: sched_clock_cpu: clock: 511452077966
<...>-3889 [000] 511.473266: sched_clock_local: now: 511473263744, gtod: 511630004325, delta: 146546, old_clock: 511630136729, clock: 511630150871
<...>-3889 [000] 511.473267: sched_clock_cpu: clock: 511630150871
<idle>-0 [012] 511.473271: sched_clock_tick: gtod: 511630157193, now: 511473269741
<idle>-0 [012] 511.473273: sched_clock_local: now: 511473271181, gtod: 511630157193, delta: 1440, old_clock: 511452077966, clock: 511630158633
<idle>-0 [000] 511.473280: sched_clock_local: now: 511473277552, gtod: 511630004325, delta: 160354, old_clock: 511630150871, clock: 511630164679
<idle>-0 [000] 511.473282: sched_clock_cpu: clock: 511630164679
<...>-3891 [012] 511.473355: sched_clock_local: now: 511473352521, gtod: 511630157193, delta: 82780, old_clock: 511630158633, clock: 511630239973
<...>-3891 [012] 511.473356: sched_clock_cpu: clock: 511630239973
<...>-1201 [012] 511.473375: sched_clock_local: now: 511473373553, gtod: 511630157193, delta: 103812, old_clock: 511630239973, clock: 511630261005
<...>-1201 [012] 511.473376: sched_clock_cpu: clock: 511630261005
<idle>-0 [000] 511.473380: sched_clock_local: now: 511473375569, gtod: 511630004325, delta: 258371, old_clock: 511630164679, clock: 511630262696
<idle>-0 [000] 511.473383: sched_clock_cpu: clock: 511630262696
<...>-3891 [012] 511.473387: sched_clock_local: now: 511473384172, gtod: 511630157193, delta: 114431, old_clock: 511630261005, clock: 511630271624
<...>-3891 [012] 511.473388: sched_clock_cpu: clock: 511630271624


--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-06-02 07:52:12

by Yong Zhang

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Wed, Jun 1, 2011 at 11:50 PM, Borislav Petkov <[email protected]> wrote:
> On Wed, Jun 01, 2011 at 06:36:52AM -0400, Peter Zijlstra wrote:
>> trace_printk() can go pretty much anywhere, you want to start with
>> something like the below and go from there, either up into
>> arch/x86/kernel/tsc.c:native_sched_clock() or down into the scheduler
>> and instrument rq->clock (although you most likely already get most of
>> that through the sched_clock_cpu() trace_printk).
>
> See for a trace excerpt below.
>
>> Also, it might be good to check on the sched_clock_stable logic in
>> general and on your platform in particular, if that's set we forgo all
>> the fancy bits and return sched_clock() directly.
>
> Nah, we don't set sched_clock_stable on AMD.
>
> --
>          <idle>-0     [023]   511.343360: sched_clock_cpu: clock: 511500051225
>          <idle>-0     [023]   511.343365: sched_clock_tick: gtod: 511500058415, now: 511343364325
>          <idle>-0     [023]   511.343367: sched_clock_local: now: 511343365506, gtod: 511500058415, delta: 1181, old_clock: 511500051225, clock: 511500059596
>          <idle>-0     [023]   511.343369: sched_clock_local: now: 511343367760, gtod: 511500058415, delta: 3435, old_clock: 511500059596, clock: 511500061850
>          <idle>-0     [023]   511.343370: sched_clock_cpu: clock: 511500061850
>          <idle>-0     [023]   511.343380: sched_clock_local: now: 511343378644, gtod: 511500058415, delta: 14319, old_clock: 511500061850, clock: 511500072734
>          <idle>-0     [023]   511.343381: sched_clock_cpu: clock: 511500072734
>          <idle>-0     [000]   511.343519: sched_clock_tick: gtod: 511500210072, now: 511343515910
>          <idle>-0     [000]   511.343523: sched_clock_local: now: 511343521252, gtod: 511500210072, delta: 5342, old_clock: 511484088885, clock: 511500215414
>           <...>-102   [023]   511.343527: sched_clock_local: now: 511343524554, gtod: 511500058415, delta: 160229, old_clock: 511500072734, clock: 511500218644

In sched_clock_local(), clock is calculated around ->tick_gtod even if
that ->tick_gtod
is stale for long time because we stays in idle state. You know ->tick_gtod
is only updated in sched_clock_tick();

IOW, when a cpu goes out of idle, sched_clock_tick() is called from
tick_nohz_stop_idle() which is later than interrupt.

So if we have any site which call sched_clock() in interrupt on an
idle cpu, it could
get incorrect clock.

I'm not sure how to teach sched_clock() about this special case, Peter?

Thanks,
Yong

>
> let's take this one for example
>
> now: 511343524554       is the TSC on core 23
> gtod: 511500058415      tick_gtod we just updated with the delta of now and scd->tick_raw. Btw, delta is > 0 so we're incrementing fine.
> old_clock: 511500072734 what we returned on the previous sched_clock_cpu() above at tstamp 511.343381
> clock: 511500218644     is what we actually return
>
>
>           <...>-102   [023]   511.343528: sched_clock_cpu: clock: 511500218644
>          <idle>-0     [000]   511.343542: sched_clock_local: now: 511343540484, gtod: 511500210072, delta: 24574, old_clock: 511500215414, clock: 511500234646
>
>
> now, on cpu0 we've advanced in the meantime with now=511343540484 (btw,
> the now TSC values are almost the same as the ftrace timestamps, the
> last being taken only a couple of 盜 later) and so on ...
>
> So, I don't see anything unusual but I could very well be missing
> something. Btw, this trace is without your change to ttwu_queue().
>
> I'll do the same exercise on another F10h system once I get it free to
> see whether we see the same monotonicity there too.
>
> Let me know if you want something else traced or the trace tweaked.
>
>
>          <idle>-0     [000]   511.343543: sched_clock_cpu: clock: 511500234646
>           <...>-1549  [023]   511.343548: sched_clock_local: now: 511343546428, gtod: 511500058415, delta: 182103, old_clock: 511500218644, clock: 511500240518
>           <...>-1549  [023]   511.343549: sched_clock_cpu: clock: 511500240518
>          <idle>-0     [023]   511.343560: sched_clock_local: now: 511343558648, gtod: 511500058415, delta: 194323, old_clock: 511500240518, clock: 511500252738
>          <idle>-0     [023]   511.343561: sched_clock_cpu: clock: 511500252738
>          <idle>-0     [023]   511.347381: sched_clock_tick: gtod: 511504076961, now: 511347377038
>          <idle>-0     [000]   511.347386: sched_clock_tick: gtod: 511504078048, now: 511347378123
>          <idle>-0     [023]   511.347389: sched_clock_local: now: 511347383317, gtod: 511504076961, delta: 6279, old_clock: 511500252738, clock: 511504083240
>          <idle>-0     [000]   511.347392: sched_clock_local: now: 511347387920, gtod: 511504078048, delta: 9797, old_clock: 511500234646, clock: 511504087845
>          <idle>-0     [023]   511.347397: sched_clock_local: now: 511347395232, gtod: 511504076961, delta: 18194, old_clock: 511504083240, clock: 511504095155
>          <idle>-0     [000]   511.347399: sched_clock_local: now: 511347396131, gtod: 511504078048, delta: 18008, old_clock: 511504087845, clock: 511504096056
>          <idle>-0     [023]   511.347401: sched_clock_cpu: clock: 511504095155
>          <idle>-0     [000]   511.347402: sched_clock_cpu: clock: 511504096056
>          <idle>-0     [000]   511.473120: sched_clock_tick: gtod: 511630004325, now: 511473117198
>          <idle>-0     [000]   511.473124: sched_clock_local: now: 511473122126, gtod: 511630004325, delta: 4928, old_clock: 511504096056, clock: 511630009253
>          <idle>-0     [000]   511.473156: sched_clock_local: now: 511473154566, gtod: 511630004325, delta: 37368, old_clock: 511630009253, clock: 511630041693
>          <idle>-0     [000]   511.473157: sched_clock_cpu: clock: 511630041693
>           <...>-3889  [000]   511.473232: sched_clock_local: now: 511473229761, gtod: 511630004325, delta: 112563, old_clock: 511630041693, clock: 511630116888
>           <...>-3889  [000]   511.473233: sched_clock_cpu: clock: 511630116888
>           <...>-1226  [000]   511.473251: sched_clock_local: now: 511473249602, gtod: 511630004325, delta: 132404, old_clock: 511630116888, clock: 511630136729
>           <...>-1226  [000]   511.473252: sched_clock_cpu: clock: 511630136729
>          <idle>-0     [012]   511.473260: sched_clock_local: now: 511473255980, gtod: 511448077716, delta: 181793791, old_clock: 511448095543, clock: 511452077966
>          <idle>-0     [012]   511.473263: sched_clock_cpu: clock: 511452077966
>           <...>-3889  [000]   511.473266: sched_clock_local: now: 511473263744, gtod: 511630004325, delta: 146546, old_clock: 511630136729, clock: 511630150871
>           <...>-3889  [000]   511.473267: sched_clock_cpu: clock: 511630150871
>          <idle>-0     [012]   511.473271: sched_clock_tick: gtod: 511630157193, now: 511473269741
>          <idle>-0     [012]   511.473273: sched_clock_local: now: 511473271181, gtod: 511630157193, delta: 1440, old_clock: 511452077966, clock: 511630158633
>          <idle>-0     [000]   511.473280: sched_clock_local: now: 511473277552, gtod: 511630004325, delta: 160354, old_clock: 511630150871, clock: 511630164679
>          <idle>-0     [000]   511.473282: sched_clock_cpu: clock: 511630164679
>           <...>-3891  [012]   511.473355: sched_clock_local: now: 511473352521, gtod: 511630157193, delta: 82780, old_clock: 511630158633, clock: 511630239973
>           <...>-3891  [012]   511.473356: sched_clock_cpu: clock: 511630239973
>           <...>-1201  [012]   511.473375: sched_clock_local: now: 511473373553, gtod: 511630157193, delta: 103812, old_clock: 511630239973, clock: 511630261005
>           <...>-1201  [012]   511.473376: sched_clock_cpu: clock: 511630261005
>          <idle>-0     [000]   511.473380: sched_clock_local: now: 511473375569, gtod: 511630004325, delta: 258371, old_clock: 511630164679, clock: 511630262696
>          <idle>-0     [000]   511.473383: sched_clock_cpu: clock: 511630262696
>           <...>-3891  [012]   511.473387: sched_clock_local: now: 511473384172, gtod: 511630157193, delta: 114431, old_clock: 511630261005, clock: 511630271624
>           <...>-3891  [012]   511.473388: sched_clock_cpu: clock: 511630271624
>
>
> --
> Regards/Gruss,
> Boris.
>
> Advanced Micro Devices GmbH
> Einsteinring 24, 85609 Dornach
> General Managers: Alberto Bozzo, Andrew Bowd
> Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
> Registergericht Muenchen, HRB Nr. 43632
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



--
Only stand for myself

2011-06-02 13:01:00

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Thu, 2011-06-02 at 15:52 +0800, Yong Zhang wrote:
> In sched_clock_local(), clock is calculated around ->tick_gtod even if
> that ->tick_gtod is stale for long time because we stays in idle state.
> You know ->tick_gtod is only updated in sched_clock_tick();

(well, no, there's idle callbacks as you said below)

> IOW, when a cpu goes out of idle, sched_clock_tick() is called from
> tick_nohz_stop_idle() which is later than interrupt.

Gah, that would be awefull and mean wakeups from interrupts were already
borken. /me goes look at code.

irq_enter() -> tick_check_idle() -> tick_check_nohz() ->
tick_nohz_stop_idle() -> sched_clock_idle_wakeup_event()

should update the thing before we run any isrs, right?

> So if we have any site which call sched_clock() in interrupt on an
> idle cpu, it could get incorrect clock.
>
> I'm not sure how to teach sched_clock() about this special case, Peter?

isn't anything to teach afaict.

2011-06-02 14:23:52

by Yong Zhang

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Thu, Jun 02, 2011 at 03:04:26PM +0200, Peter Zijlstra wrote:
> On Thu, 2011-06-02 at 15:52 +0800, Yong Zhang wrote:
> > In sched_clock_local(), clock is calculated around ->tick_gtod even if
> > that ->tick_gtod is stale for long time because we stays in idle state.
> > You know ->tick_gtod is only updated in sched_clock_tick();
>
> (well, no, there's idle callbacks as you said below)
>
> > IOW, when a cpu goes out of idle, sched_clock_tick() is called from
> > tick_nohz_stop_idle() which is later than interrupt.
>
> Gah, that would be awefull and mean wakeups from interrupts were already
> borken. /me goes look at code.
>
> irq_enter() -> tick_check_idle() -> tick_check_nohz() ->
> tick_nohz_stop_idle() -> sched_clock_idle_wakeup_event()
>
> should update the thing before we run any isrs, right?

Hmmm, you are right.

But smp_reschedule_interrupt() doesn't call irq_enter()/irq_exit(),
is that correct?

Thanks,
Yong

2011-06-02 15:45:08

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Thu, 2011-06-02 at 22:23 +0800, Yong Zhang wrote:
> On Thu, Jun 02, 2011 at 03:04:26PM +0200, Peter Zijlstra wrote:
> > On Thu, 2011-06-02 at 15:52 +0800, Yong Zhang wrote:
> > > In sched_clock_local(), clock is calculated around ->tick_gtod even if
> > > that ->tick_gtod is stale for long time because we stays in idle state.
> > > You know ->tick_gtod is only updated in sched_clock_tick();
> >
> > (well, no, there's idle callbacks as you said below)
> >
> > > IOW, when a cpu goes out of idle, sched_clock_tick() is called from
> > > tick_nohz_stop_idle() which is later than interrupt.
> >
> > Gah, that would be awefull and mean wakeups from interrupts were already
> > borken. /me goes look at code.
> >
> > irq_enter() -> tick_check_idle() -> tick_check_nohz() ->
> > tick_nohz_stop_idle() -> sched_clock_idle_wakeup_event()
> >
> > should update the thing before we run any isrs, right?
>
> Hmmm, you are right.
>
> But smp_reschedule_interrupt() doesn't call irq_enter()/irq_exit(),
> is that correct?

Crap.. you're right. And I bet other archs don't do that either. With
NO_HZ you really need irq_enter() for pretty much all interrupts so I
was assuming the resched IPI had it, but its been special and never
really needed it. If it would wake an idle cpu the idle loop exit would
deal with it, if it interrupted userspace the thing was running and
NO_HZ wasn't relevant.

Damn.

And yes, the only reason I didn't see this on my dev box was because we
do indeed set that sched_clock_stable thing on wsm. And I never noticed
on my desktop because firefox/X/etc. consuming heaps of CPU isn't weird
at all.

Adding it to all resched int handlers is of course a possibility but
would slow down the thing, although with the new code, most users are
now indeed wakeups (excepting weird and wonderful users like KVM).

We could of course add it in sched.c since the logic recurses just
fine.. its not pretty though.. :/

Thoughts?

---
kernel/sched.c | 18 +++++++++++++++++-
1 files changed, 17 insertions(+), 1 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 2fe98ed..365ed6b 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2554,7 +2554,23 @@ static void sched_ttwu_pending(void)

void scheduler_ipi(void)
{
- sched_ttwu_pending();
+ struct rq *rq = this_rq();
+ struct task_struct *list = xchg(&rq->wake_list, NULL);
+
+ if (!list)
+ return;
+
+ irq_enter();
+ raw_spin_lock(&rq->lock);
+
+ while (list) {
+ struct task_struct *p = list;
+ list = list->wake_entry;
+ ttwu_do_activate(rq, p, 0);
+ }
+
+ raw_spin_unlock(&rq->lock);
+ irq_exit();
}

static void ttwu_queue_remote(struct task_struct *p, int cpu)

2011-06-03 06:49:41

by Yong Zhang

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Thu, Jun 2, 2011 at 11:48 PM, Peter Zijlstra <[email protected]> wrote:
> On Thu, 2011-06-02 at 22:23 +0800, Yong Zhang wrote:
>> On Thu, Jun 02, 2011 at 03:04:26PM +0200, Peter Zijlstra wrote:
>> > On Thu, 2011-06-02 at 15:52 +0800, Yong Zhang wrote:
>> > > In sched_clock_local(), clock is calculated around ->tick_gtod even if
>> > > that ->tick_gtod is stale for long time because we stays in idle state.
>> > > You know ->tick_gtod is only updated in sched_clock_tick();
>> >
>> > (well, no, there's idle callbacks as you said below)
>> >
>> > > IOW, when a cpu goes out of idle, sched_clock_tick() is called from
>> > > tick_nohz_stop_idle() which is later than interrupt.
>> >
>> > Gah, that would be awefull and mean wakeups from interrupts were already
>> > borken. /me goes look at code.
>> >
>> > irq_enter() -> tick_check_idle() -> tick_check_nohz() ->
>> > tick_nohz_stop_idle() -> sched_clock_idle_wakeup_event()
>> >
>> > should update the thing before we run any isrs, right?
>>
>> Hmmm, you are right.
>>
>> But smp_reschedule_interrupt() doesn't call irq_enter()/irq_exit(),
>> is that correct?
>
> Crap.. you're right.
> And I bet other archs don't do that either.

Most of them ;)
I only notice sparc32 do that. Maybe there have more,
but I didn't check it very carefully.

> With
> NO_HZ you really need irq_enter() for pretty much all interrupts so I
> was assuming the resched IPI had it, but its been special and never
> really needed it. If it would wake an idle cpu the idle loop exit would
> deal with it, if it interrupted userspace the thing was running and
> NO_HZ wasn't relevant.
>
> Damn.
>
> And yes, the only reason I didn't see this on my dev box was because we
> do indeed set that sched_clock_stable thing on wsm. And I never noticed
> on my desktop because firefox/X/etc. consuming heaps of CPU isn't weird
> at all.
>
> Adding it to all resched int handlers is of course a possibility but
> would slow down the thing, although with the new code, most users are
> now indeed wakeups (excepting weird and wonderful users like KVM).
>
> We could of course add it in sched.c since the logic recurses just
> fine.. its not pretty though.. :/

Yeah, IMHO it's suitable here and my test looks good.

Reviewed-and-Tested-by: Yong Zhang <[email protected]>

BTW, sched_ipi() and sched_ttwu_pending() could share a piece of
code now. And we place irq_enter()/irq_exit() in sched_ipi() because
it's the only function we could call, thus account_system_vtime() could
get the almost exact time value. IOW we should pay some attention on
the future change of smp_reschedule_interrupt().

Thanks,
Yong

>
> Thoughts?
>
> ---
>  kernel/sched.c |   18 +++++++++++++++++-
>  1 files changed, 17 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 2fe98ed..365ed6b 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2554,7 +2554,23 @@ static void sched_ttwu_pending(void)
>
>  void scheduler_ipi(void)
>  {
> -       sched_ttwu_pending();
> +       struct rq *rq = this_rq();
> +       struct task_struct *list = xchg(&rq->wake_list, NULL);
> +
> +       if (!list)
> +               return;
> +
> +       irq_enter();
> +       raw_spin_lock(&rq->lock);
> +
> +       while (list) {
> +               struct task_struct *p = list;
> +               list = list->wake_entry;
> +               ttwu_do_activate(rq, p, 0);
> +       }
> +
> +       raw_spin_unlock(&rq->lock);
> +       irq_exit();
>  }
>
>  static void ttwu_queue_remote(struct task_struct *p, int cpu)
>
>
>



--
Only stand for myself

2011-06-03 09:57:23

by Milton Miller

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Thu, 02 Jun 2011 about 15:48:31 -0000, Peter Zijlstra wrote:
> On Thu, 2011-06-02 at 22:23 +0800, Yong Zhang wrote:
> > On Thu, Jun 02, 2011 at 03:04:26PM +0200, Peter Zijlstra wrote:
> > > irq_enter() -> tick_check_idle() -> tick_check_nohz() ->
> > > tick_nohz_stop_idle() -> sched_clock_idle_wakeup_event()
> > >
> > > should update the thing before we run any isrs, right?
> >
> > Hmmm, you are right.
> >
> > But smp_reschedule_interrupt() doesn't call irq_enter()/irq_exit(),
> > is that correct?
>
> Crap.. you're right. And I bet other archs don't do that either. With
> NO_HZ you really need irq_enter() for pretty much all interrupts so I
> was assuming the resched IPI had it, but its been special and never
> really needed it. If it would wake an idle cpu the idle loop exit would
> deal with it, if it interrupted userspace the thing was running and
> NO_HZ wasn't relevant.
>
> Damn.
>
> And yes, the only reason I didn't see this on my dev box was because we
> do indeed set that sched_clock_stable thing on wsm. And I never noticed
> on my desktop because firefox/X/etc. consuming heaps of CPU isn't weird
> at all.
>
> Adding it to all resched int handlers is of course a possibility but
> would slow down the thing, although with the new code, most users are
> now indeed wakeups (excepting weird and wonderful users like KVM).

[me looks closely at patch and finds early return]

>
> We could of course add it in sched.c since the logic recurses just
> fine.. its not pretty though.. :/
>
> Thoughts?


Many architectures already have an irq_enter becuase they have a single
interrupt to the cpu for all external causes including software; they
do the irq_enter before reading from the irq controller to know the
reason for the interrupt. A quick glance at irq_enter and irq_exit
shows they will do several things twice when nested, even if that
is safe.

Are there really that many calls with the empty list that it makes
sense to avoid and optimize this on x86 while penalizing the several
architectures with a nested irq_enter and exit? When it also duplicates
sched_ttwu_pending (because it can't be common with the additional tests)?

We said the perf mon callback (now irq_work) had to be under irq_enter.

Can we get some numbers for how often the two cases occur on some
various workloads?

milton

>
> ---
> kernel/sched.c | 18 +++++++++++++++++-
> 1 files changed, 17 insertions(+), 1 deletions(-)
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 2fe98ed..365ed6b 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2554,7 +2554,23 @@ static void sched_ttwu_pending(void)
>
> void scheduler_ipi(void)
> {
> - sched_ttwu_pending();
> + struct rq *rq = this_rq();
> + struct task_struct *list = xchg(&rq->wake_list, NULL);
> +
> + if (!list)
> + return;
> +
> + irq_enter();
> + raw_spin_lock(&rq->lock);
> +
> + while (list) {
> + struct task_struct *p = list;
> + list = list->wake_entry;
> + ttwu_do_activate(rq, p, 0);
> + }
> +
> + raw_spin_unlock(&rq->lock);
> + irq_exit();
> }
>
> static void ttwu_queue_remote(struct task_struct *p, int cpu)

2011-06-03 10:36:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Fri, 2011-06-03 at 04:57 -0500, Milton Miller wrote:

> [me looks closely at patch and finds early return]

Yeah, in case there's nothing to do, all the old conditions hold and
irq_enter isn't strictly required.

> >
> > We could of course add it in sched.c since the logic recurses just
> > fine.. its not pretty though.. :/
> >
> > Thoughts?
>
>
> Many architectures already have an irq_enter becuase they have a single
> interrupt to the cpu for all external causes including software; they
> do the irq_enter before reading from the irq controller to know the
> reason for the interrupt. A quick glance at irq_enter and irq_exit
> shows they will do several things twice when nested, even if that
> is safe.

Agreed, and its a worry I had. The flip side is that doing it in the
arch code means I have to audit all the archs again (not that I mind too
much, but it takes a wee bit longer), also I'll have to look at all the
code using this IPI for the old purpose.

> Are there really that many calls with the empty list that it makes
> sense to avoid and optimize this on x86 while penalizing the several
> architectures with a nested irq_enter and exit?

I _think_ the now predominant case is this remote wakeup, so adding
irq_enter() to all arch paths isn't too big of a problem, but I need to
make sure.

> When it also duplicates
> sched_ttwu_pending (because it can't be common with the additional tests)?

yeah, sad that.

> We said the perf mon callback (now irq_work) had to be under irq_enter.

Correct, anything that actually does something in the handler needs
irq_enter, the problem with the resched ipi was that it never actually
did anything and the idle loop exit took care of the no_hz funnies.

> Can we get some numbers for how often the two cases occur on some
> various workloads?

Sure, let me stick some counters in.

2011-06-03 10:55:31

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Fri, 2011-06-03 at 12:36 +0200, Peter Zijlstra wrote:
> > Can we get some numbers for how often the two cases occur on some
> > various workloads?
>
> Sure, let me stick some counters in.

Of 128765 IPIs, 17483 were with an empty list (boot + make -j64).

I didn't yet look at where those empty ones originated from.

2011-06-03 10:58:58

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Fri, 2011-06-03 at 12:55 +0200, Peter Zijlstra wrote:
> On Fri, 2011-06-03 at 12:36 +0200, Peter Zijlstra wrote:
> > > Can we get some numbers for how often the two cases occur on some
> > > various workloads?
> >
> > Sure, let me stick some counters in.
>
> Of 128765 IPIs, 17483 were with an empty list (boot + make -j64).
>
> I didn't yet look at where those empty ones originated from.

After two more make clean; make -j64 cycles thats:

250082:24793

So on this load its 10+:1.

2011-06-07 13:12:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Thu, Jun 02, 2011 at 11:48:31AM -0400, Peter Zijlstra wrote:
> Crap.. you're right. And I bet other archs don't do that either. With
> NO_HZ you really need irq_enter() for pretty much all interrupts so I
> was assuming the resched IPI had it, but its been special and never
> really needed it. If it would wake an idle cpu the idle loop exit would
> deal with it, if it interrupted userspace the thing was running and
> NO_HZ wasn't relevant.
>
> Damn.
>
> And yes, the only reason I didn't see this on my dev box was because we
> do indeed set that sched_clock_stable thing on wsm. And I never noticed
> on my desktop because firefox/X/etc. consuming heaps of CPU isn't weird
> at all.
>
> Adding it to all resched int handlers is of course a possibility but
> would slow down the thing, although with the new code, most users are
> now indeed wakeups (excepting weird and wonderful users like KVM).

FWIW, we could set the sched_clock_stable on AMD too - at least on F10h
and later. This will take care of the problem at hand and defer the
issue of slowing down the resched ipi handlers.

I dunno, however, whether we still would need the proper ->tick_gtod
update for correct ttwu accounting regardless of sched_clock_stable on
K8 (unstable TSCs) and maybe even other arches.

>
> We could of course add it in sched.c since the logic recurses just
> fine.. its not pretty though.. :/
>
> Thoughts?
>
> ---
> kernel/sched.c | 18 +++++++++++++++++-
> 1 files changed, 17 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 2fe98ed..365ed6b 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -2554,7 +2554,23 @@ static void sched_ttwu_pending(void)
>
> void scheduler_ipi(void)
> {
> - sched_ttwu_pending();
> + struct rq *rq = this_rq();
> + struct task_struct *list = xchg(&rq->wake_list, NULL);
> +
> + if (!list)
> + return;
> +
> + irq_enter();
> + raw_spin_lock(&rq->lock);
> +
> + while (list) {
> + struct task_struct *p = list;
> + list = list->wake_entry;
> + ttwu_do_activate(rq, p, 0);
> + }
> +
> + raw_spin_unlock(&rq->lock);
> + irq_exit();
> }
>
> static void ttwu_queue_remote(struct task_struct *p, int cpu)
>
>
>

--
Regards/Gruss,
Boris.

Advanced Micro Devices GmbH
Einsteinring 24, 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Gemeinde Aschheim, Landkreis Muenchen
Registergericht Muenchen, HRB Nr. 43632

2011-06-07 13:16:46

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [tip:sched/urgent] sched: Fix cross-cpu clock sync on remote wakeups

On Tue, 2011-06-07 at 15:12 +0200, Borislav Petkov wrote:
> On Thu, Jun 02, 2011 at 11:48:31AM -0400, Peter Zijlstra wrote:
> > Crap.. you're right. And I bet other archs don't do that either. With
> > NO_HZ you really need irq_enter() for pretty much all interrupts so I
> > was assuming the resched IPI had it, but its been special and never
> > really needed it. If it would wake an idle cpu the idle loop exit would
> > deal with it, if it interrupted userspace the thing was running and
> > NO_HZ wasn't relevant.
> >
> > Damn.
> >
> > And yes, the only reason I didn't see this on my dev box was because we
> > do indeed set that sched_clock_stable thing on wsm. And I never noticed
> > on my desktop because firefox/X/etc. consuming heaps of CPU isn't weird
> > at all.
> >
> > Adding it to all resched int handlers is of course a possibility but
> > would slow down the thing, although with the new code, most users are
> > now indeed wakeups (excepting weird and wonderful users like KVM).
>
> FWIW, we could set the sched_clock_stable on AMD too - at least on F10h
> and later. This will take care of the problem at hand and defer the
> issue of slowing down the resched ipi handlers.
>
> I dunno, however, whether we still would need the proper ->tick_gtod
> update for correct ttwu accounting regardless of sched_clock_stable on
> K8 (unstable TSCs) and maybe even other arches.

Yeah, I think I'm going to commit this extra irq_enter() thing for now,
and then slowly go through all the arches again and remove this one here
once all archs are sorted.