2009-06-29 06:55:58

by Luming Yu

[permalink] [raw]
Subject: [RFC patch] Use IPI_shortcut for lapic timer broadcast

Hello,

We need to use IPI shortcut to send lapic timer broadcast
to avoid the latency of sending IPI one bye one on systems with many
logical processors when NO_HZ is disabled.
Without this patch,I have seen upstream kernel with RHEL 5 kernel
config boot hang .

The patch also changes physflat_send_IPI_all to IPI shortcut mode.

Please review, and apply.

**The patch is enclosed in text attachment*
**Using web client to send the patch* *
**below is for review, please apply attached patch*/

Thanks,
Luming


Signed-off-by: Yu Luming <[email protected]>

apic.c | 4 +++-
apic_flat_64.c | 7 ++++++-
2 files changed, 9 insertions(+), 2 deletions(-)

--- linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c.0 2009-06-28
20:22:55.000000000 -0600
+++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c 2009-06-29
00:21:44.000000000 -0600
@@ -419,7 +419,9 @@
static void lapic_timer_broadcast(const struct cpumask *mask)
{
#ifdef CONFIG_SMP
- apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
+ if (cpus_empty(*mask))
+ return;
+ apic->send_IPI_all(LOCAL_TIMER_VECTOR);
#endif
}

--- linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c.0 2009-06-29
00:13:26.000000000 -0600
+++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c 2009-06-29
00:11:23.000000000 -0600
@@ -274,7 +274,12 @@

static void physflat_send_IPI_all(int vector)
{
- physflat_send_IPI_mask(cpu_online_mask, vector);
+ if (vector == NMI_VECTOR) {
+ physflat_send_IPI_mask(cpu_online_mask, vector);
+ } else {
+ __default_send_IPI_shortcut(APIC_DEST_ALLINC,
+ vector, apic->dest_logical);
+ }
}

static unsigned int physflat_cpu_mask_to_apicid(const struct cpumask *cpumask)


Attachments:
1.patch (999.00 B)

2009-06-29 07:21:01

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast


* Luming Yu <[email protected]> wrote:

> Hello,
>
> We need to use IPI shortcut to send lapic timer broadcast
> to avoid the latency of sending IPI one bye one on systems with many
> logical processors when NO_HZ is disabled.
> Without this patch,I have seen upstream kernel with RHEL 5 kernel
> config boot hang .

hm, that might be a valid optimization - but why does the lack of
this optimization result in a hang?

Ingo

2009-06-29 08:04:48

by Luming Yu

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast

On Mon, Jun 29, 2009 at 3:20 PM, Ingo Molnar<[email protected]> wrote:
>
> * Luming Yu <[email protected]> wrote:
>
>> Hello,
>>
>> We need to use IPI shortcut to send lapic timer broadcast
>> to avoid the latency of sending IPI one bye one on systems with many
>> logical processors when NO_HZ is disabled.
>> Without this patch,I have seen upstream kernel with RHEL 5 kernel
>> config boot hang .
>
> hm, that might be a valid optimization - but why does the lack of
> this optimization result in a hang?

It is hang caused by kernel code for work around lapic-timer-stop issue.
With HZ=1000, and a lot of cpus (eg. 64 logical cpus), cpu 0 will be
busy working on send TIMER IPI instead of making progress in boot
(right after deep-C-state has been used).




>
>        Ingo
>

2009-06-29 08:16:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast


* Luming Yu <[email protected]> wrote:

> On Mon, Jun 29, 2009 at 3:20 PM, Ingo Molnar<[email protected]> wrote:
> >
> > * Luming Yu <[email protected]> wrote:
> >
> >> Hello,
> >>
> >> We need to use IPI shortcut to send lapic timer broadcast
> >> to avoid the latency of sending IPI one bye one on systems with many
> >> logical processors when NO_HZ is disabled.
> >> Without this patch,I have seen upstream kernel with RHEL 5 kernel
> >> config boot hang .
> >
> > hm, that might be a valid optimization - but why does the lack of
> > this optimization result in a hang?
>
> It is hang caused by kernel code for work around lapic-timer-stop
> issue. With HZ=1000, and a lot of cpus (eg. 64 logical cpus), cpu
> 0 will be busy working on send TIMER IPI instead of making
> progress in boot (right after deep-C-state has been used).

that's a bit weird. With HZ=1000 we have 1000 usecs between each
timer tick. Assuming a CPU sends to a lot of CPUs (64 logical CPUs)
that means that each IPI takes more than ~15 microseconds to
process. On what hardware/platform can this happen realistically?

Ingo

2009-06-29 08:22:10

by Luming Yu

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast

On Mon, Jun 29, 2009 at 4:16 PM, Ingo Molnar<[email protected]> wrote:
>
> * Luming Yu <[email protected]> wrote:
>
>> On Mon, Jun 29, 2009 at 3:20 PM, Ingo Molnar<[email protected]> wrote:
>> >
>> > * Luming Yu <[email protected]> wrote:
>> >
>> >> Hello,
>> >>
>> >> We need to use IPI shortcut to send lapic timer broadcast
>> >> to avoid the latency of sending IPI one bye one on systems with many
>> >> logical processors when NO_HZ is disabled.
>> >> Without this patch,I have seen upstream kernel with RHEL 5 kernel
>> >> config boot hang .
>> >
>> > hm, that might be a valid optimization - but why does the lack of
>> > this optimization result in a hang?
>>
>> It is hang caused by kernel code for work around lapic-timer-stop
>> issue. With HZ=1000, and a lot of cpus (eg. 64 logical cpus), cpu
>> 0 will be busy working on send TIMER IPI instead of making
>> progress in boot (right after deep-C-state has been used).
>
> that's a bit weird. With HZ=1000 we have 1000 usecs between each
> timer tick. Assuming a CPU sends to a lot of CPUs (64 logical CPUs)
> that means that each IPI takes more than ~15 microseconds to
> process. On what hardware/platform can this happen realistically?

https://bugzilla.redhat.com/show_bug.cgi?id=499271

Someone has measured that it needs 50-100us latency to send one IPI

>
>        Ingo
>

2009-06-29 08:30:55

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast


* Luming Yu <[email protected]> wrote:

> On Mon, Jun 29, 2009 at 4:16 PM, Ingo Molnar<[email protected]> wrote:
> >
> > * Luming Yu <[email protected]> wrote:
> >
> >> On Mon, Jun 29, 2009 at 3:20 PM, Ingo Molnar<[email protected]> wrote:
> >> >
> >> > * Luming Yu <[email protected]> wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> We need to use IPI shortcut to send lapic timer broadcast
> >> >> to avoid the latency of sending IPI one bye one on systems with many
> >> >> logical processors when NO_HZ is disabled.
> >> >> Without this patch,I have seen upstream kernel with RHEL 5 kernel
> >> >> config boot hang .
> >> >
> >> > hm, that might be a valid optimization - but why does the lack of
> >> > this optimization result in a hang?
> >>
> >> It is hang caused by kernel code for work around lapic-timer-stop
> >> issue. With HZ=1000, and a lot of cpus (eg. 64 logical cpus), cpu
> >> 0 will be busy working on send TIMER IPI instead of making
> >> progress in boot (right after deep-C-state has been used).
> >
> > that's a bit weird. With HZ=1000 we have 1000 usecs between each
> > timer tick. Assuming a CPU sends to a lot of CPUs (64 logical CPUs)
> > that means that each IPI takes more than ~15 microseconds to
> > process. On what hardware/platform can this happen realistically?
>
> https://bugzilla.redhat.com/show_bug.cgi?id=499271
>
> Someone has measured that it needs 50-100us latency to send one
> IPI

Ugh. What platform is it that takes this much time to pass an IPI?

IPIs are the lifeline of process messaging under Linux. TLB flushes
in threaded apps rely on it (heavily), the scheduler relies on it
for wakeups (heavily) and a lot of other code relies on IPIs as
well.

Even a Pentium-5 100 MHz dual box was able to do cross-CPU IPIs
within 10-20 microseconds more than a decade ago - so 50-100 usecs
latency on a modern platform is totally out of this planet and will
hurt Linux performance big time. And the worst thing about it is
that none of the usual performance metrics will really show _why_
performance is tanking ...

Ingo

2009-06-29 08:43:31

by Luming Yu

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast

On Mon, Jun 29, 2009 at 4:30 PM, Ingo Molnar<[email protected]> wrote:
>
> * Luming Yu <[email protected]> wrote:
>
>> On Mon, Jun 29, 2009 at 4:16 PM, Ingo Molnar<[email protected]> wrote:
>> >
>> > * Luming Yu <[email protected]> wrote:
>> >
>> >> On Mon, Jun 29, 2009 at 3:20 PM, Ingo Molnar<[email protected]> wrote:
>> >> >
>> >> > * Luming Yu <[email protected]> wrote:
>> >> >
>> >> >> Hello,
>> >> >>
>> >> >> We need to use IPI shortcut to send lapic timer broadcast
>> >> >> to avoid the latency of sending IPI one bye one on systems with many
>> >> >> logical processors when NO_HZ is disabled.
>> >> >> Without this patch,I have seen upstream kernel with RHEL 5 kernel
>> >> >> config boot hang .
>> >> >
>> >> > hm, that might be a valid optimization - but why does the lack of
>> >> > this optimization result in a hang?
>> >>
>> >> It is hang caused by kernel code for work around lapic-timer-stop
>> >> issue. With HZ=1000, and a lot of cpus (eg. 64 logical cpus), cpu
>> >> 0 will be busy working on send TIMER IPI instead of making
>> >> progress in boot (right after deep-C-state has been used).
>> >
>> > that's a bit weird. With HZ=1000 we have 1000 usecs between each
>> > timer tick. Assuming a CPU sends to a lot of CPUs (64 logical CPUs)
>> > that means that each IPI takes more than ~15 microseconds to
>> > process. On what hardware/platform can this happen realistically?
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=499271
>>
>> Someone has measured that it needs 50-100us latency to send one
>> IPI
>
> Ugh. What platform is it that takes this much time to pass an IPI?
>
> IPIs are the lifeline of process messaging under Linux. TLB flushes
> in threaded apps rely on it (heavily), the scheduler relies on it
> for wakeups (heavily) and a lot of other code relies on IPIs as
> well.
>
> Even a Pentium-5 100 MHz dual box was able to do cross-CPU IPIs
> within 10-20 microseconds more than a decade ago - so 50-100 usecs
> latency on a modern platform is totally out of this planet and will
> hurt Linux performance big time. And the worst thing about it is
> that none of the usual performance metrics will really show _why_
> performance is tanking ...
>


Please note this is deep-C-state related.
C state does add extra latency.. but I don't know how much...

2009-06-29 09:15:52

by Ingo Molnar

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast


* Luming Yu <[email protected]> wrote:

> Please note this is deep-C-state related. C state does add extra
> latency.. but I don't know how much...

ok, so if each of the CPUs has to be brought out of deep-C and with
each one 50-100 usecs a pop one could see the total time taking more
than 1000 usecs. That would explain the hang.

Ingo

2009-06-29 14:00:43

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast

On Mon, 29 Jun 2009 16:43:16 +0800
Luming Yu <[email protected]> wrote:

> > Even a Pentium-5 100 MHz dual box was able to do cross-CPU IPIs
> > within 10-20 microseconds more than a decade ago - so 50-100 usecs
> > latency on a modern platform is totally out of this planet and will
> > hurt Linux performance big time. And the worst thing about it is
> > that none of the usual performance metrics will really show _why_
> > performance is tanking ...
> >
>
>
> Please note this is deep-C-state related.
> C state does add extra latency.. but I don't know how much...

C states normally only add on the "wait for" side, not on the "send"
side.

(RHEL5 is rather old, so it may have done things a bit different)

Maybe it is time for mainline to not allow a !NO_HZ config....


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2009-06-29 20:35:56

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast

On Sun, 2009-06-28 at 23:47 -0700, Luming Yu wrote:
> Hello,
>
> We need to use IPI shortcut to send lapic timer broadcast
> to avoid the latency of sending IPI one bye one on systems with many
> logical processors when NO_HZ is disabled.
> Without this patch,I have seen upstream kernel with RHEL 5 kernel
> config boot hang .
>
> The patch also changes physflat_send_IPI_all to IPI shortcut mode.
>
> Please review, and apply.
>
> **The patch is enclosed in text attachment*
> **Using web client to send the patch* *
> **below is for review, please apply attached patch*/
>
> Thanks,
> Luming
>
>
> Signed-off-by: Yu Luming <[email protected]>
>
> apic.c | 4 +++-
> apic_flat_64.c | 7 ++++++-
> 2 files changed, 9 insertions(+), 2 deletions(-)
>
> --- linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c.0 2009-06-28
> 20:22:55.000000000 -0600
> +++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c 2009-06-29
> 00:21:44.000000000 -0600
> @@ -419,7 +419,9 @@
> static void lapic_timer_broadcast(const struct cpumask *mask)
> {
> #ifdef CONFIG_SMP
> - apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
> + if (cpus_empty(*mask))
> + return;
> + apic->send_IPI_all(LOCAL_TIMER_VECTOR);
> #endif
> }
>

Hmm.. this change looks wrong. This will cause unnecessary timer
broadcasts when nohz is actually enabled. I mean, when nohz is enabled,
we want to send this interrupt broadcast to one or some subset of CPUs
and this change sends the interrupt to all CPUs.

On a 16 logical CPUs with nohz enabled, powertop without this patch I
see
Wakeups-from-idle per second : 11.6 interval: 15.0s
and with this patch I see
Wakeups-from-idle per second : 24.7 interval: 15.0s

Also, cat /proc/interrupts | grep LOC shows this problem where all CPUs
seem to get roughly same number of local APIC interrupts over a period
of time with this patch.

We should only do the allbutself when nohz is disabled with a big
unlikely() around that code.

Thanks,
Venki

> --- linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c.0 2009-06-29
> 00:13:26.000000000 -0600
> +++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c 2009-06-29
> 00:11:23.000000000 -0600
> @@ -274,7 +274,12 @@
>
> static void physflat_send_IPI_all(int vector)
> {
> - physflat_send_IPI_mask(cpu_online_mask, vector);
> + if (vector == NMI_VECTOR) {
> + physflat_send_IPI_mask(cpu_online_mask, vector);
> + } else {
> + __default_send_IPI_shortcut(APIC_DEST_ALLINC,
> + vector, apic->dest_logical);
> + }
> }
>
> static unsigned int physflat_cpu_mask_to_apicid(const struct cpumask *cpumask)

2009-06-30 07:01:17

by Luming Yu

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast

Thanks for review. How about the following patch?

/**The patch is enclosed in text attachment**
**Using web client to send the patch**
**below is for review, please apply attached patch*/


Signed-off-by: Yu Luming <[email protected]>

apic.c | 5 ++++-
apic_flat_64.c | 7 ++++++-
2 files changed, 10 insertions(+), 2 deletions(-)


--- linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c.0 2009-06-29
23:45:05.000000000 -0600
+++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c 2009-06-30
00:37:56.000000000 -0600
@@ -419,7 +419,10 @@
static void lapic_timer_broadcast(const struct cpumask *mask)
{
#ifdef CONFIG_SMP
- apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
+ if (unlikely(cpumask_weight(mask) == num_online_cpus() -1))
+ apic->send_IPI_all(LOCAL_TIMER_VECTOR);
+ else
+ apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
#endif
}

--- linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c.0 2009-06-29
00:13:26.000000000 -0600
+++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c 2009-06-29
00:11:23.000000000 -0600
@@ -274,7 +274,12 @@

static void physflat_send_IPI_all(int vector)
{
- physflat_send_IPI_mask(cpu_online_mask, vector);
+ if (vector == NMI_VECTOR) {
+ physflat_send_IPI_mask(cpu_online_mask, vector);
+ } else {
+ __default_send_IPI_shortcut(APIC_DEST_ALLINC,
+ vector, apic->dest_logical);
+ }
}

static unsigned int physflat_cpu_mask_to_apicid(const struct cpumask *cpumask)


Attachments:
1.patch (1.06 kB)

2009-07-03 00:25:00

by Pallipadi, Venkatesh

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast

On Tue, 2009-06-30 at 00:01 -0700, Luming Yu wrote:
> Thanks for review. How about the following patch?
>
> /**The patch is enclosed in text attachment**
> **Using web client to send the patch**
> **below is for review, please apply attached patch*/
>
>
> Signed-off-by: Yu Luming <[email protected]>
>
> apic.c | 5 ++++-
> apic_flat_64.c | 7 ++++++-
> 2 files changed, 10 insertions(+), 2 deletions(-)
>
>
> --- linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c.0 2009-06-29
> 23:45:05.000000000 -0600
> +++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c 2009-06-30
> 00:37:56.000000000 -0600
> @@ -419,7 +419,10 @@
> static void lapic_timer_broadcast(const struct cpumask *mask)
> {
> #ifdef CONFIG_SMP
> - apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
> + if (unlikely(cpumask_weight(mask) == num_online_cpus() -1))
> + apic->send_IPI_all(LOCAL_TIMER_VECTOR);
> + else
> + apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
> #endif
> }

I dont think it is a good idea to pay the penalty for cpumask_weight and
num_online_cpus, for the more common tickless case to fix this problem
with less common no tickeless case.

We should be able to add an interface to get tick_nohz_enabled from
tick-sched.c and use that instead.

Thanks,
Venki

>
> --- linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c.0 2009-06-29
> 00:13:26.000000000 -0600
> +++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c 2009-06-29
> 00:11:23.000000000 -0600
> @@ -274,7 +274,12 @@
>
> static void physflat_send_IPI_all(int vector)
> {
> - physflat_send_IPI_mask(cpu_online_mask, vector);
> + if (vector == NMI_VECTOR) {
> + physflat_send_IPI_mask(cpu_online_mask, vector);
> + } else {
> + __default_send_IPI_shortcut(APIC_DEST_ALLINC,
> + vector, apic->dest_logical);
> + }
> }
>
> static unsigned int physflat_cpu_mask_to_apicid(const struct cpumask *cpumask)

2009-07-03 02:04:52

by Luming Yu

[permalink] [raw]
Subject: Re: [RFC patch] Use IPI_shortcut for lapic timer broadcast

On Fri, Jul 3, 2009 at 8:23 AM, Pallipadi,
Venkatesh<[email protected]> wrote:
> On Tue, 2009-06-30 at 00:01 -0700, Luming Yu wrote:
>> Thanks for review. How about the following patch?
>>
>> /**The patch is enclosed in text attachment**
>> **Using web client to send the patch**
>> **below is for review, please apply attached  patch*/
>>
>>
>> Signed-off-by: Yu Luming <[email protected]>
>>
>>  apic.c         |    5 ++++-
>>  apic_flat_64.c |    7 ++++++-
>>  2 files changed, 10 insertions(+), 2 deletions(-)
>>
>>
>> --- linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c.0    2009-06-29
>> 23:45:05.000000000 -0600
>> +++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c      2009-06-30
>> 00:37:56.000000000 -0600
>> @@ -419,7 +419,10 @@
>>  static void lapic_timer_broadcast(const struct cpumask *mask)
>>  {
>>  #ifdef CONFIG_SMP
>> -     apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
>> +     if (unlikely(cpumask_weight(mask) == num_online_cpus() -1))
>> +             apic->send_IPI_all(LOCAL_TIMER_VECTOR);
>> +     else
>> +             apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
>>  #endif
>>  }
>
> I dont think it is a good idea to pay the penalty for cpumask_weight and
> num_online_cpus, for the more common tickless case to fix this problem
> with less common no tickeless case.

In NO_HZ, the lapic_timer_broadcast is rare thing too...

>
> We should be able to add an interface to get tick_nohz_enabled from
> tick-sched.c and use that instead.

If we use send_IPI_all just for NO_HZ is disabled, we will leave a corner case
here that lapic_timer_broadcast could be really really being actively used even
NO_HZ enabled... if this will never happen, I would agree with you..

Thanks,
LUming