LinuxLists.cc - idle issues running sembench on 128 cpus

2011-05-04 21:47:13

Subject: idle issues running sembench on 128 cpus

Thomas,
I've been looking at performance running sembench on a 128-cpu system
and I'm running into some issues in the idle loop.

Initially, I was seeing a lot of contention on the clockevents_lock in
clockevents_notify(). Assuming it is only protecting clockevents_chain,
and not the handlers themselves, I changed this to an rwlock (with
thoughts of using rcu if successful).

This didn't help, but exposed an underlying problem with high contention
on tick_broadcast_lock in tick_broadcast_oneshot_control(). I think with
this many cpus, tick_handle_oneshot_broadcast() is holding that lock a
long time, causing the idle cpus to spin on the lock.

I am able to avoid this problem with either kernel parameter,
"idle=mwait" or "processor.max_cstate=1". Similarly, defining
CONFIG_INTEL_IDLE=y and using the kernel parameter
intel_idle.max_cstate=1 exposes a different spinlock, pm_qos_lock, but I
found this patch which fixes that contention:
https://lists.linux-foundation.org/pipermail/linux-pm/2011-February/030266.html
https://patchwork.kernel.org/patch/550721/

Of course, we'd like to find a way to reduce the spinlock contention and
not resort to prohibiting the cpus from entering C3 state at all. I
don't see a simple fix, and want to know if you've seen anything like
this before and given it any thought.

I also don't know if it makes sense to be able to tune the cpuidle
governors to add more resistance to enter the C3 state, or even being
able to switch to a performance governor at runtime, similar to cpufreq.

I'd like to hear your thoughts before I dive any deeper into this.

Thanks,
Shaggy

2011-05-04 22:04:48

by Thomas Gleixner

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus

Dave,

On Wed, 4 May 2011, Dave Kleikamp wrote:

> Thomas,
> I've been looking at performance running sembench on a 128-cpu system and I'm
> running into some issues in the idle loop.
>
> Initially, I was seeing a lot of contention on the clockevents_lock in
> clockevents_notify(). Assuming it is only protecting clockevents_chain, and
> not the handlers themselves, I changed this to an rwlock (with thoughts of
> using rcu if successful).
>
> This didn't help, but exposed an underlying problem with high contention on
> tick_broadcast_lock in tick_broadcast_oneshot_control(). I think with this
> many cpus, tick_handle_oneshot_broadcast() is holding that lock a long time,
> causing the idle cpus to spin on the lock.
>
> I am able to avoid this problem with either kernel parameter, "idle=mwait" or
> "processor.max_cstate=1". Similarly, defining CONFIG_INTEL_IDLE=y and using
> the kernel parameter intel_idle.max_cstate=1 exposes a different spinlock,
> pm_qos_lock, but I found this patch which fixes that contention:
> https://lists.linux-foundation.org/pipermail/linux-pm/2011-February/030266.html
> https://patchwork.kernel.org/patch/550721/
>
> Of course, we'd like to find a way to reduce the spinlock contention and not
> resort to prohibiting the cpus from entering C3 state at all. I don't see a
> simple fix, and want to know if you've seen anything like this before and
> given it any thought.
>
> I also don't know if it makes sense to be able to tune the cpuidle governors
> to add more resistance to enter the C3 state, or even being able to switch to
> a performance governor at runtime, similar to cpufreq.
>
> I'd like to hear your thoughts before I dive any deeper into this.

Tick broadcasting for more than a few CPU's simply does not scale and
never will.

There is no real way to avoid the global lock if all what you have is
_ONE_ global working event device and N cpus which try to work around
their f*cked up local apics when deeper C-States are entered.

The same problem is with the TSC which stops in deeper C-States. You
just don't see lock contention because we rely on the HW serialization
of HPET or PM_TIMER which is a huge bottleneck when you try to do
timekeeping related stuff high frequency on more than a handful of
cores at the same time. Just benchmark a tight loop of gettimeofday()
or clock_gettime() on such a machine with and without max_cstate=1 on
the kernel command line.

We could perhaps get away w/o the locking for the NOHZ=n and HIGHRES=n
case, but I doubt that you want to have that given that you don't want
to restrict C-States either. C-states do not make much sense without
NOHZ=y at least.

We tried to beat sense into unnamed HW manufacturers for years and it
took just a little bit more than a decade that they started to act on
it :(

Thanks,

tglx

2011-05-04 22:07:56

by Andi Kleen

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus

Dave Kleikamp <[email protected]> writes:
>
> I am able to avoid this problem with either kernel parameter,
> "idle=mwait" or "processor.max_cstate=1". Similarly, defining
> CONFIG_INTEL_IDLE=y and using the kernel parameter
> intel_idle.max_cstate=1 exposes a different spinlock, pm_qos_lock, but
> I found this patch which fixes that contention:
> https://lists.linux-foundation.org/pipermail/linux-pm/2011-February/030266.html
> https://patchwork.kernel.org/patch/550721/

The pm_qos patch really needs to be merged ASAP. Len?

> Of course, we'd like to find a way to reduce the spinlock contention
> and not resort to prohibiting the cpus from entering C3 state at
> all. I don't see a simple fix, and want to know if you've seen
> anything like this before and given it any thought.
>
> I also don't know if it makes sense to be able to tune the cpuidle
> governors to add more resistance to enter the C3 state, or even being
> able to switch to a performance governor at runtime, similar to
> cpufreq.
>
> I'd like to hear your thoughts before I dive any deeper into this.

It's fixed on Westmere. There the APIC timer will always tick
and all that logic is not needed anymore and disabled.

That is mostly fixed. One problem right now is that the
CLOCK_EVT_FEAT_C3STOP test is inside the lock. But we
can easily move it out, assuming the clock_event_device
gets RCU freed or has a reference count.

But yes it would be still good to fix Nehalem too.

One fix would be to make all the masks hierarchical,
similar to what RCU does. Perhaps even some code
could be shared with RCU on that because it's a very
similar problem.

-Andi
--
[email protected] -- Speaking for myself only

2011-05-04 22:34:40

by Thomas Gleixner

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus

On Wed, 4 May 2011, Andi Kleen wrote:
> Dave Kleikamp <[email protected]> writes:
> > I also don't know if it makes sense to be able to tune the cpuidle
> > governors to add more resistance to enter the C3 state, or even being
> > able to switch to a performance governor at runtime, similar to
> > cpufreq.
> >
> > I'd like to hear your thoughts before I dive any deeper into this.
>
> It's fixed on Westmere. There the APIC timer will always tick
> and all that logic is not needed anymore and disabled.
>
> That is mostly fixed. One problem right now is that the
> CLOCK_EVT_FEAT_C3STOP test is inside the lock. But we
> can easily move it out, assuming the clock_event_device
> gets RCU freed or has a reference count.

No, it does not even need refcounting. We can access it outside of the
lock as this is atomic context called on the cpu which is about to go
idle and therefor the device cannot go away. Easy and straightforward
fix.

> But yes it would be still good to fix Nehalem too.
>
> One fix would be to make all the masks hierarchical,
> similar to what RCU does. Perhaps even some code
> could be shared with RCU on that because it's a very
> similar problem.

In theory. It's not about the mask. The mask is uninteresting. It's
about the expiry time, which we have to protect. There is nothing
hierarchical about that. It all boils down on _ONE_ single functional
device and you don't want to miss out your deadline just because you
decided to be extra clever. RCU does not care much whether you run the
callbacks a tick later on not. Time and timekeeping does.

Thanks,

tglx

2011-05-04 23:03:52

by Andi Kleen

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus

> No, it does not even need refcounting. We can access it outside of the

Ok.

> lock as this is atomic context called on the cpu which is about to go
> idle and therefor the device cannot go away. Easy and straightforward
> fix.

Ok. Patch appended. Looks good?

BTW why must the lock be irqsave?

>
> > But yes it would be still good to fix Nehalem too.
> >
> > One fix would be to make all the masks hierarchical,
> > similar to what RCU does. Perhaps even some code
> > could be shared with RCU on that because it's a very
> > similar problem.
>
> In theory. It's not about the mask. The mask is uninteresting. It's
> about the expiry time, which we have to protect. There is nothing
> hierarchical about that. It all boils down on _ONE_ single functional

The mask can be used to see if another thread on this core is still
running. If yes you don't need that. Right now Linux doesn't
know that, but it could be taught. The only problem is that once
the other guy goes idle too their timeouts have to be merged.

This would cut contention in half.

Also if it's HPET you could actually use multiple independent HPET channels.
I remember us discussing this a long time ago... Not sure if it's worth
it, but it may be a small relief.

> device and you don't want to miss out your deadline just because you
> decided to be extra clever. RCU does not care much whether you run the
> callbacks a tick later on not. Time and timekeeping does.

You can at least check lockless if someone else has a <= timeout, right?

-Andi

---

Move C3 stop test outside lock

Avoid taking locks in the idle path for systems where the timer
doesn't stop in C3.

Signed-off-by: Andi Kleen <[email protected]>

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index da800ff..9cf0415 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -456,23 +456,22 @@ void tick_broadcast_oneshot_control(unsigned long reason)
unsigned long flags;
int cpu;

- raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
-
/*
* Periodic mode does not care about the enter/exit of power
* states
*/
if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
- goto out;
+ return;

+ cpu = raw_smp_processor_id();
bc = tick_broadcast_device.evtdev;
- cpu = smp_processor_id();
td = &per_cpu(tick_cpu_device, cpu);
dev = td->evtdev;

if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
- goto out;
+ return;

+ raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) {
cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask());

2011-05-04 23:29:57

by Thomas Gleixner

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus

On Thu, 5 May 2011, Andi Kleen wrote:
> > No, it does not even need refcounting. We can access it outside of the
>
> Ok.
>
> > lock as this is atomic context called on the cpu which is about to go
> > idle and therefor the device cannot go away. Easy and straightforward
> > fix.
>
> Ok. Patch appended. Looks good?

Mostly. See below.

> BTW why must the lock be irqsave?

Good question. Probably safety frist paranoia :)

Indeed that code should only be called from irq disabled regions, so
we could avoid the irqsave there. Otherwise that needs to be irqsave
for obvious reasons.

> > > But yes it would be still good to fix Nehalem too.
> > >
> > > One fix would be to make all the masks hierarchical,
> > > similar to what RCU does. Perhaps even some code
> > > could be shared with RCU on that because it's a very
> > > similar problem.
> >
> > In theory. It's not about the mask. The mask is uninteresting. It's
> > about the expiry time, which we have to protect. There is nothing
> > hierarchical about that. It all boils down on _ONE_ single functional
>
> The mask can be used to see if another thread on this core is still
> running. If yes you don't need that. Right now Linux doesn't
> know that, but it could be taught. The only problem is that once
> the other guy goes idle too their timeouts have to be merged.
>
> This would cut contention in half.

That makes sense, but merging the timeouts race free will be a real
PITA.

> Also if it's HPET you could actually use multiple independent HPET channels.
> I remember us discussing this a long time ago... Not sure if it's worth
> it, but it may be a small relief.

Multiple broadcast devices. That sounds still horrible :)

> > device and you don't want to miss out your deadline just because you
> > decided to be extra clever. RCU does not care much whether you run the
> > callbacks a tick later on not. Time and timekeeping does.
>
> You can at least check lockless if someone else has a <= timeout, right?

Might be worth a try. Need some sleep to remember why I discarded that
idea long ago.

> -Andi
>
> ---
>
> Move C3 stop test outside lock
>
> Avoid taking locks in the idle path for systems where the timer
> doesn't stop in C3.
>
> Signed-off-by: Andi Kleen <[email protected]>
>
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index da800ff..9cf0415 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -456,23 +456,22 @@ void tick_broadcast_oneshot_control(unsigned long reason)
> unsigned long flags;
> int cpu;
>
> - raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
> -
> /*
> * Periodic mode does not care about the enter/exit of power
> * states
> */
> if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
> - goto out;
> + return;
>
> + cpu = raw_smp_processor_id();

Why raw_ ? As I said above this should always be called with irqs
disabled.

If that ever gets called from an irq enabled, preemptible and
migratable context then we just open up a very narrow but ugly to
debug race window as we can look at the wrong per cpu device.

> bc = tick_broadcast_device.evtdev;
> - cpu = smp_processor_id();
> td = &per_cpu(tick_cpu_device, cpu);
> dev = td->evtdev;

Thanks,

tglx

2011-05-04 23:42:44

by Andi Kleen

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus

On Thu, May 05, 2011 at 01:29:49AM +0200, Thomas Gleixner wrote:
> > > hierarchical about that. It all boils down on _ONE_ single functional
> >
> > The mask can be used to see if another thread on this core is still
> > running. If yes you don't need that. Right now Linux doesn't
> > know that, but it could be taught. The only problem is that once
> > the other guy goes idle too their timeouts have to be merged.
> >
> > This would cut contention in half.
>
> That makes sense, but merging the timeouts race free will be a real
> PITA.

For this case one could actually use a spinlock between the siblings.
That shouldn't be a problem as long as it's not a global spinlock.

>
> > Also if it's HPET you could actually use multiple independent HPET channels.
> > I remember us discussing this a long time ago... Not sure if it's worth
> > it, but it may be a small relief.
>
> Multiple broadcast devices. That sounds still horrible :)

It would cut contention in half or more at least. Not great,
but sometimes you take everything you can get.

> Might be worth a try. Need some sleep to remember why I discarded that
> idea long ago.

Ok.

Here's a new patch without the raw. Boots on my Westmere.

---

From: Andi Kleen <[email protected]>
Subject: [PATCH] Move C3 stop test outside lock

Avoid taking locks in the idle path for systems where the timer
doesn't stop in C3.

Signed-off-by: Andi Kleen <[email protected]>

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index da800ff..9cf0415 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -456,23 +456,22 @@ void tick_broadcast_oneshot_control(unsigned long reason)
unsigned long flags;
int cpu;

- raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
-
/*
* Periodic mode does not care about the enter/exit of power
* states
*/
if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
- goto out;
+ return;

+ cpu = raw_smp_processor_id();
bc = tick_broadcast_device.evtdev;
- cpu = smp_processor_id();
td = &per_cpu(tick_cpu_device, cpu);
dev = td->evtdev;

if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
- goto out;
+ return;

+ raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) {
cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask());

2011-05-04 23:47:53

by Thomas Gleixner

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus

On Thu, 5 May 2011, Andi Kleen wrote:
> On Thu, May 05, 2011 at 01:29:49AM +0200, Thomas Gleixner wrote:
> > That makes sense, but merging the timeouts race free will be a real
> > PITA.
>
> For this case one could actually use a spinlock between the siblings.
> That shouldn't be a problem as long as it's not a global spinlock.

Care to give it a try ?

> > > Also if it's HPET you could actually use multiple independent HPET channels.
> > > I remember us discussing this a long time ago... Not sure if it's worth
> > > it, but it may be a small relief.
> >
> > Multiple broadcast devices. That sounds still horrible :)
>
>
> It would cut contention in half or more at least. Not great,
> but sometimes you take everything you can get.

To a certain degree. If the code pain is larger than the benefit ...

> Here's a new patch without the raw. Boots on my Westmere.

> + cpu = raw_smp_processor_id();

Hmm. quilt refresh perhaps ? I know that feeling :)

Thanks,

tglx

2011-05-04 23:48:09

by Andi Kleen

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus II

> Here's a new patch without the raw. Boots on my Westmere.

Sorry appended the old patch. Here's really the new version.

From: Andi Kleen <[email protected]>
Date: Wed, 4 May 2011 15:09:27 -0700
Subject: [PATCH] Move C3 stop test outside lock

Avoid taking locks in the idle path for systems where the timer
doesn't stop in C3.

Signed-off-by: Andi Kleen <[email protected]>

diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index da800ff..7f565b7 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -456,23 +456,22 @@ void tick_broadcast_oneshot_control(unsigned long reason)
unsigned long flags;
int cpu;

- raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
-
/*
* Periodic mode does not care about the enter/exit of power
* states
*/
if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
- goto out;
+ return;

- bc = tick_broadcast_device.evtdev;
cpu = smp_processor_id();
+ bc = tick_broadcast_device.evtdev;
td = &per_cpu(tick_cpu_device, cpu);
dev = td->evtdev;

if (!(dev->features & CLOCK_EVT_FEAT_C3STOP))
- goto out;
+ return;

+ raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) {
cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask());

2011-05-04 23:49:55

by Andi Kleen

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus

Thomas Gleixner <[email protected]> writes:

> On Thu, 5 May 2011, Andi Kleen wrote:
>> On Thu, May 05, 2011 at 01:29:49AM +0200, Thomas Gleixner wrote:
>> > That makes sense, but merging the timeouts race free will be a real
>> > PITA.
>>
>> For this case one could actually use a spinlock between the siblings.
>> That shouldn't be a problem as long as it's not a global spinlock.
>
> Care to give it a try ?

Ok, will try tomorrow.

>> Here's a new patch without the raw. Boots on my Westmere.
>
>> + cpu = raw_smp_processor_id();
>
> Hmm. quilt refresh perhaps ? I know that feeling :)

The scp to copy the patch was too slow. Noticed it later and sent a new
one.

-Andi
--
[email protected] -- Speaking for myself only

2011-05-04 23:51:57

by Thomas Gleixner

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus

On Wed, 4 May 2011, Andi Kleen wrote:

> Thomas Gleixner <[email protected]> writes:
>
> > On Thu, 5 May 2011, Andi Kleen wrote:
> >> On Thu, May 05, 2011 at 01:29:49AM +0200, Thomas Gleixner wrote:
> >> > That makes sense, but merging the timeouts race free will be a real
> >> > PITA.
> >>
> >> For this case one could actually use a spinlock between the siblings.
> >> That shouldn't be a problem as long as it's not a global spinlock.
> >
> > Care to give it a try ?
>
> Ok, will try tomorrow.
>
> >> Here's a new patch without the raw. Boots on my Westmere.
> >
> >> + cpu = raw_smp_processor_id();
> >
> > Hmm. quilt refresh perhaps ? I know that feeling :)
>
> The scp to copy the patch was too slow. Noticed it later and sent a new
> one.

NP. Happens to hit everybody from time to time :)

Thanks,

tglx

2011-05-05 13:58:28

by Thomas Gleixner

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus

On Thu, 5 May 2011, Thomas Gleixner wrote:
> On Thu, 5 May 2011, Andi Kleen wrote:
> > > No, it does not even need refcounting. We can access it outside of the
> >
> > Ok.
> >
> > > lock as this is atomic context called on the cpu which is about to go
> > > idle and therefor the device cannot go away. Easy and straightforward
> > > fix.
> >
> > Ok. Patch appended. Looks good?
>
> Mostly. See below.
>
> > BTW why must the lock be irqsave?
>
> Good question. Probably safety frist paranoia :)
>
> Indeed that code should only be called from irq disabled regions, so
> we could avoid the irqsave there. Otherwise that needs to be irqsave
> for obvious reasons.

Just looked through all the call sites. Both intel_idle and
processor_idle notify ENTER with interrups disabled, but EXIT with
interrupts enabled. So when we want to remove irqsave from the
spinlock that needs to be fixed as well.

Thanks,

tglx

2011-05-05 15:25:58

by Dave Kleikamp

[permalink] [raw]

Subject: Re: idle issues running sembench on 128 cpus II

On 05/04/2011 06:48 PM, Andi Kleen wrote:
>> Here's a new patch without the raw. Boots on my Westmere.
>
> Sorry appended the old patch. Here's really the new version.
>
> From: Andi Kleen<[email protected]>
> Date: Wed, 4 May 2011 15:09:27 -0700
> Subject: [PATCH] Move C3 stop test outside lock
>
> Avoid taking locks in the idle path for systems where the timer
> doesn't stop in C3.
>
> Signed-off-by: Andi Kleen<[email protected]>
>
> diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
> index da800ff..7f565b7 100644
> --- a/kernel/time/tick-broadcast.c
> +++ b/kernel/time/tick-broadcast.c
> @@ -456,23 +456,22 @@ void tick_broadcast_oneshot_control(unsigned long reason)
> unsigned long flags;
> int cpu;
>
> - raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
> -
> /*
> * Periodic mode does not care about the enter/exit of power
> * states
> */
> if (tick_broadcast_device.mode == TICKDEV_MODE_PERIODIC)
> - goto out;
> + return;
>
> - bc = tick_broadcast_device.evtdev;
> cpu = smp_processor_id();
> + bc = tick_broadcast_device.evtdev;
> td =&per_cpu(tick_cpu_device, cpu);
> dev = td->evtdev;
>
> if (!(dev->features& CLOCK_EVT_FEAT_C3STOP))
> - goto out;
> + return;

out: is now defined but not used in this function.

>
> + raw_spin_lock_irqsave(&tick_broadcast_lock, flags);
> if (reason == CLOCK_EVT_NOTIFY_BROADCAST_ENTER) {
> if (!cpumask_test_cpu(cpu, tick_get_broadcast_oneshot_mask())) {
> cpumask_set_cpu(cpu, tick_get_broadcast_oneshot_mask());
>