Hello,
We need to use IPI shortcut to send lapic timer broadcast
to avoid the latency of sending IPI one bye one on systems with many
logical processors when NO_HZ is disabled.
Without this patch,I have seen upstream kernel with RHEL 5 kernel
config boot hang .
The patch also changes physflat_send_IPI_all to IPI shortcut mode.
Please review, and apply.
**The patch is enclosed in text attachment*
**Using web client to send the patch* *
**below is for review, please apply attached patch*/
Thanks,
Luming
Signed-off-by: Yu Luming <[email protected]>
apic.c | 4 +++-
apic_flat_64.c | 7 ++++++-
2 files changed, 9 insertions(+), 2 deletions(-)
--- linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c.0 2009-06-28
20:22:55.000000000 -0600
+++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c 2009-06-29
00:21:44.000000000 -0600
@@ -419,7 +419,9 @@
static void lapic_timer_broadcast(const struct cpumask *mask)
{
#ifdef CONFIG_SMP
- apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
+ if (cpus_empty(*mask))
+ return;
+ apic->send_IPI_all(LOCAL_TIMER_VECTOR);
#endif
}
--- linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c.0 2009-06-29
00:13:26.000000000 -0600
+++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c 2009-06-29
00:11:23.000000000 -0600
@@ -274,7 +274,12 @@
static void physflat_send_IPI_all(int vector)
{
- physflat_send_IPI_mask(cpu_online_mask, vector);
+ if (vector == NMI_VECTOR) {
+ physflat_send_IPI_mask(cpu_online_mask, vector);
+ } else {
+ __default_send_IPI_shortcut(APIC_DEST_ALLINC,
+ vector, apic->dest_logical);
+ }
}
static unsigned int physflat_cpu_mask_to_apicid(const struct cpumask *cpumask)
* Luming Yu <[email protected]> wrote:
> Hello,
>
> We need to use IPI shortcut to send lapic timer broadcast
> to avoid the latency of sending IPI one bye one on systems with many
> logical processors when NO_HZ is disabled.
> Without this patch,I have seen upstream kernel with RHEL 5 kernel
> config boot hang .
hm, that might be a valid optimization - but why does the lack of
this optimization result in a hang?
Ingo
On Mon, Jun 29, 2009 at 3:20 PM, Ingo Molnar<[email protected]> wrote:
>
> * Luming Yu <[email protected]> wrote:
>
>> Hello,
>>
>> We need to use IPI shortcut to send lapic timer broadcast
>> to avoid the latency of sending IPI one bye one on systems with many
>> logical processors when NO_HZ is disabled.
>> Without this patch,I have seen upstream kernel with RHEL 5 kernel
>> config boot hang .
>
> hm, that might be a valid optimization - but why does the lack of
> this optimization result in a hang?
It is hang caused by kernel code for work around lapic-timer-stop issue.
With HZ=1000, and a lot of cpus (eg. 64 logical cpus), cpu 0 will be
busy working on send TIMER IPI instead of making progress in boot
(right after deep-C-state has been used).
>
> Ingo
>
* Luming Yu <[email protected]> wrote:
> On Mon, Jun 29, 2009 at 3:20 PM, Ingo Molnar<[email protected]> wrote:
> >
> > * Luming Yu <[email protected]> wrote:
> >
> >> Hello,
> >>
> >> We need to use IPI shortcut to send lapic timer broadcast
> >> to avoid the latency of sending IPI one bye one on systems with many
> >> logical processors when NO_HZ is disabled.
> >> Without this patch,I have seen upstream kernel with RHEL 5 kernel
> >> config boot hang .
> >
> > hm, that might be a valid optimization - but why does the lack of
> > this optimization result in a hang?
>
> It is hang caused by kernel code for work around lapic-timer-stop
> issue. With HZ=1000, and a lot of cpus (eg. 64 logical cpus), cpu
> 0 will be busy working on send TIMER IPI instead of making
> progress in boot (right after deep-C-state has been used).
that's a bit weird. With HZ=1000 we have 1000 usecs between each
timer tick. Assuming a CPU sends to a lot of CPUs (64 logical CPUs)
that means that each IPI takes more than ~15 microseconds to
process. On what hardware/platform can this happen realistically?
Ingo
On Mon, Jun 29, 2009 at 4:16 PM, Ingo Molnar<[email protected]> wrote:
>
> * Luming Yu <[email protected]> wrote:
>
>> On Mon, Jun 29, 2009 at 3:20 PM, Ingo Molnar<[email protected]> wrote:
>> >
>> > * Luming Yu <[email protected]> wrote:
>> >
>> >> Hello,
>> >>
>> >> We need to use IPI shortcut to send lapic timer broadcast
>> >> to avoid the latency of sending IPI one bye one on systems with many
>> >> logical processors when NO_HZ is disabled.
>> >> Without this patch,I have seen upstream kernel with RHEL 5 kernel
>> >> config boot hang .
>> >
>> > hm, that might be a valid optimization - but why does the lack of
>> > this optimization result in a hang?
>>
>> It is hang caused by kernel code for work around lapic-timer-stop
>> issue. With HZ=1000, and a lot of cpus (eg. 64 logical cpus), cpu
>> 0 will be busy working on send TIMER IPI instead of making
>> progress in boot (right after deep-C-state has been used).
>
> that's a bit weird. With HZ=1000 we have 1000 usecs between each
> timer tick. Assuming a CPU sends to a lot of CPUs (64 logical CPUs)
> that means that each IPI takes more than ~15 microseconds to
> process. On what hardware/platform can this happen realistically?
https://bugzilla.redhat.com/show_bug.cgi?id=499271
Someone has measured that it needs 50-100us latency to send one IPI
>
> Ingo
>
* Luming Yu <[email protected]> wrote:
> On Mon, Jun 29, 2009 at 4:16 PM, Ingo Molnar<[email protected]> wrote:
> >
> > * Luming Yu <[email protected]> wrote:
> >
> >> On Mon, Jun 29, 2009 at 3:20 PM, Ingo Molnar<[email protected]> wrote:
> >> >
> >> > * Luming Yu <[email protected]> wrote:
> >> >
> >> >> Hello,
> >> >>
> >> >> We need to use IPI shortcut to send lapic timer broadcast
> >> >> to avoid the latency of sending IPI one bye one on systems with many
> >> >> logical processors when NO_HZ is disabled.
> >> >> Without this patch,I have seen upstream kernel with RHEL 5 kernel
> >> >> config boot hang .
> >> >
> >> > hm, that might be a valid optimization - but why does the lack of
> >> > this optimization result in a hang?
> >>
> >> It is hang caused by kernel code for work around lapic-timer-stop
> >> issue. With HZ=1000, and a lot of cpus (eg. 64 logical cpus), cpu
> >> 0 will be busy working on send TIMER IPI instead of making
> >> progress in boot (right after deep-C-state has been used).
> >
> > that's a bit weird. With HZ=1000 we have 1000 usecs between each
> > timer tick. Assuming a CPU sends to a lot of CPUs (64 logical CPUs)
> > that means that each IPI takes more than ~15 microseconds to
> > process. On what hardware/platform can this happen realistically?
>
> https://bugzilla.redhat.com/show_bug.cgi?id=499271
>
> Someone has measured that it needs 50-100us latency to send one
> IPI
Ugh. What platform is it that takes this much time to pass an IPI?
IPIs are the lifeline of process messaging under Linux. TLB flushes
in threaded apps rely on it (heavily), the scheduler relies on it
for wakeups (heavily) and a lot of other code relies on IPIs as
well.
Even a Pentium-5 100 MHz dual box was able to do cross-CPU IPIs
within 10-20 microseconds more than a decade ago - so 50-100 usecs
latency on a modern platform is totally out of this planet and will
hurt Linux performance big time. And the worst thing about it is
that none of the usual performance metrics will really show _why_
performance is tanking ...
Ingo
On Mon, Jun 29, 2009 at 4:30 PM, Ingo Molnar<[email protected]> wrote:
>
> * Luming Yu <[email protected]> wrote:
>
>> On Mon, Jun 29, 2009 at 4:16 PM, Ingo Molnar<[email protected]> wrote:
>> >
>> > * Luming Yu <[email protected]> wrote:
>> >
>> >> On Mon, Jun 29, 2009 at 3:20 PM, Ingo Molnar<[email protected]> wrote:
>> >> >
>> >> > * Luming Yu <[email protected]> wrote:
>> >> >
>> >> >> Hello,
>> >> >>
>> >> >> We need to use IPI shortcut to send lapic timer broadcast
>> >> >> to avoid the latency of sending IPI one bye one on systems with many
>> >> >> logical processors when NO_HZ is disabled.
>> >> >> Without this patch,I have seen upstream kernel with RHEL 5 kernel
>> >> >> config boot hang .
>> >> >
>> >> > hm, that might be a valid optimization - but why does the lack of
>> >> > this optimization result in a hang?
>> >>
>> >> It is hang caused by kernel code for work around lapic-timer-stop
>> >> issue. With HZ=1000, and a lot of cpus (eg. 64 logical cpus), cpu
>> >> 0 will be busy working on send TIMER IPI instead of making
>> >> progress in boot (right after deep-C-state has been used).
>> >
>> > that's a bit weird. With HZ=1000 we have 1000 usecs between each
>> > timer tick. Assuming a CPU sends to a lot of CPUs (64 logical CPUs)
>> > that means that each IPI takes more than ~15 microseconds to
>> > process. On what hardware/platform can this happen realistically?
>>
>> https://bugzilla.redhat.com/show_bug.cgi?id=499271
>>
>> Someone has measured that it needs 50-100us latency to send one
>> IPI
>
> Ugh. What platform is it that takes this much time to pass an IPI?
>
> IPIs are the lifeline of process messaging under Linux. TLB flushes
> in threaded apps rely on it (heavily), the scheduler relies on it
> for wakeups (heavily) and a lot of other code relies on IPIs as
> well.
>
> Even a Pentium-5 100 MHz dual box was able to do cross-CPU IPIs
> within 10-20 microseconds more than a decade ago - so 50-100 usecs
> latency on a modern platform is totally out of this planet and will
> hurt Linux performance big time. And the worst thing about it is
> that none of the usual performance metrics will really show _why_
> performance is tanking ...
>
Please note this is deep-C-state related.
C state does add extra latency.. but I don't know how much...
* Luming Yu <[email protected]> wrote:
> Please note this is deep-C-state related. C state does add extra
> latency.. but I don't know how much...
ok, so if each of the CPUs has to be brought out of deep-C and with
each one 50-100 usecs a pop one could see the total time taking more
than 1000 usecs. That would explain the hang.
Ingo
On Mon, 29 Jun 2009 16:43:16 +0800
Luming Yu <[email protected]> wrote:
> > Even a Pentium-5 100 MHz dual box was able to do cross-CPU IPIs
> > within 10-20 microseconds more than a decade ago - so 50-100 usecs
> > latency on a modern platform is totally out of this planet and will
> > hurt Linux performance big time. And the worst thing about it is
> > that none of the usual performance metrics will really show _why_
> > performance is tanking ...
> >
>
>
> Please note this is deep-C-state related.
> C state does add extra latency.. but I don't know how much...
C states normally only add on the "wait for" side, not on the "send"
side.
(RHEL5 is rather old, so it may have done things a bit different)
Maybe it is time for mainline to not allow a !NO_HZ config....
--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org
On Sun, 2009-06-28 at 23:47 -0700, Luming Yu wrote:
> Hello,
>
> We need to use IPI shortcut to send lapic timer broadcast
> to avoid the latency of sending IPI one bye one on systems with many
> logical processors when NO_HZ is disabled.
> Without this patch,I have seen upstream kernel with RHEL 5 kernel
> config boot hang .
>
> The patch also changes physflat_send_IPI_all to IPI shortcut mode.
>
> Please review, and apply.
>
> **The patch is enclosed in text attachment*
> **Using web client to send the patch* *
> **below is for review, please apply attached patch*/
>
> Thanks,
> Luming
>
>
> Signed-off-by: Yu Luming <[email protected]>
>
> apic.c | 4 +++-
> apic_flat_64.c | 7 ++++++-
> 2 files changed, 9 insertions(+), 2 deletions(-)
>
> --- linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c.0 2009-06-28
> 20:22:55.000000000 -0600
> +++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c 2009-06-29
> 00:21:44.000000000 -0600
> @@ -419,7 +419,9 @@
> static void lapic_timer_broadcast(const struct cpumask *mask)
> {
> #ifdef CONFIG_SMP
> - apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
> + if (cpus_empty(*mask))
> + return;
> + apic->send_IPI_all(LOCAL_TIMER_VECTOR);
> #endif
> }
>
Hmm.. this change looks wrong. This will cause unnecessary timer
broadcasts when nohz is actually enabled. I mean, when nohz is enabled,
we want to send this interrupt broadcast to one or some subset of CPUs
and this change sends the interrupt to all CPUs.
On a 16 logical CPUs with nohz enabled, powertop without this patch I
see
Wakeups-from-idle per second : 11.6 interval: 15.0s
and with this patch I see
Wakeups-from-idle per second : 24.7 interval: 15.0s
Also, cat /proc/interrupts | grep LOC shows this problem where all CPUs
seem to get roughly same number of local APIC interrupts over a period
of time with this patch.
We should only do the allbutself when nohz is disabled with a big
unlikely() around that code.
Thanks,
Venki
> --- linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c.0 2009-06-29
> 00:13:26.000000000 -0600
> +++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c 2009-06-29
> 00:11:23.000000000 -0600
> @@ -274,7 +274,12 @@
>
> static void physflat_send_IPI_all(int vector)
> {
> - physflat_send_IPI_mask(cpu_online_mask, vector);
> + if (vector == NMI_VECTOR) {
> + physflat_send_IPI_mask(cpu_online_mask, vector);
> + } else {
> + __default_send_IPI_shortcut(APIC_DEST_ALLINC,
> + vector, apic->dest_logical);
> + }
> }
>
> static unsigned int physflat_cpu_mask_to_apicid(const struct cpumask *cpumask)
Thanks for review. How about the following patch?
/**The patch is enclosed in text attachment**
**Using web client to send the patch**
**below is for review, please apply attached patch*/
Signed-off-by: Yu Luming <[email protected]>
apic.c | 5 ++++-
apic_flat_64.c | 7 ++++++-
2 files changed, 10 insertions(+), 2 deletions(-)
--- linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c.0 2009-06-29
23:45:05.000000000 -0600
+++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c 2009-06-30
00:37:56.000000000 -0600
@@ -419,7 +419,10 @@
static void lapic_timer_broadcast(const struct cpumask *mask)
{
#ifdef CONFIG_SMP
- apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
+ if (unlikely(cpumask_weight(mask) == num_online_cpus() -1))
+ apic->send_IPI_all(LOCAL_TIMER_VECTOR);
+ else
+ apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
#endif
}
--- linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c.0 2009-06-29
00:13:26.000000000 -0600
+++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c 2009-06-29
00:11:23.000000000 -0600
@@ -274,7 +274,12 @@
static void physflat_send_IPI_all(int vector)
{
- physflat_send_IPI_mask(cpu_online_mask, vector);
+ if (vector == NMI_VECTOR) {
+ physflat_send_IPI_mask(cpu_online_mask, vector);
+ } else {
+ __default_send_IPI_shortcut(APIC_DEST_ALLINC,
+ vector, apic->dest_logical);
+ }
}
static unsigned int physflat_cpu_mask_to_apicid(const struct cpumask *cpumask)
On Tue, 2009-06-30 at 00:01 -0700, Luming Yu wrote:
> Thanks for review. How about the following patch?
>
> /**The patch is enclosed in text attachment**
> **Using web client to send the patch**
> **below is for review, please apply attached patch*/
>
>
> Signed-off-by: Yu Luming <[email protected]>
>
> apic.c | 5 ++++-
> apic_flat_64.c | 7 ++++++-
> 2 files changed, 10 insertions(+), 2 deletions(-)
>
>
> --- linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c.0 2009-06-29
> 23:45:05.000000000 -0600
> +++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c 2009-06-30
> 00:37:56.000000000 -0600
> @@ -419,7 +419,10 @@
> static void lapic_timer_broadcast(const struct cpumask *mask)
> {
> #ifdef CONFIG_SMP
> - apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
> + if (unlikely(cpumask_weight(mask) == num_online_cpus() -1))
> + apic->send_IPI_all(LOCAL_TIMER_VECTOR);
> + else
> + apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
> #endif
> }
I dont think it is a good idea to pay the penalty for cpumask_weight and
num_online_cpus, for the more common tickless case to fix this problem
with less common no tickeless case.
We should be able to add an interface to get tick_nohz_enabled from
tick-sched.c and use that instead.
Thanks,
Venki
>
> --- linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c.0 2009-06-29
> 00:13:26.000000000 -0600
> +++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic_flat_64.c 2009-06-29
> 00:11:23.000000000 -0600
> @@ -274,7 +274,12 @@
>
> static void physflat_send_IPI_all(int vector)
> {
> - physflat_send_IPI_mask(cpu_online_mask, vector);
> + if (vector == NMI_VECTOR) {
> + physflat_send_IPI_mask(cpu_online_mask, vector);
> + } else {
> + __default_send_IPI_shortcut(APIC_DEST_ALLINC,
> + vector, apic->dest_logical);
> + }
> }
>
> static unsigned int physflat_cpu_mask_to_apicid(const struct cpumask *cpumask)
On Fri, Jul 3, 2009 at 8:23 AM, Pallipadi,
Venkatesh<[email protected]> wrote:
> On Tue, 2009-06-30 at 00:01 -0700, Luming Yu wrote:
>> Thanks for review. How about the following patch?
>>
>> /**The patch is enclosed in text attachment**
>> **Using web client to send the patch**
>> **below is for review, please apply attached patch*/
>>
>>
>> Signed-off-by: Yu Luming <[email protected]>
>>
>> apic.c | 5 ++++-
>> apic_flat_64.c | 7 ++++++-
>> 2 files changed, 10 insertions(+), 2 deletions(-)
>>
>>
>> --- linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c.0 2009-06-29
>> 23:45:05.000000000 -0600
>> +++ linux-2.6.30-rc6/arch/x86/kernel/apic/apic.c 2009-06-30
>> 00:37:56.000000000 -0600
>> @@ -419,7 +419,10 @@
>> static void lapic_timer_broadcast(const struct cpumask *mask)
>> {
>> #ifdef CONFIG_SMP
>> - apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
>> + if (unlikely(cpumask_weight(mask) == num_online_cpus() -1))
>> + apic->send_IPI_all(LOCAL_TIMER_VECTOR);
>> + else
>> + apic->send_IPI_mask(mask, LOCAL_TIMER_VECTOR);
>> #endif
>> }
>
> I dont think it is a good idea to pay the penalty for cpumask_weight and
> num_online_cpus, for the more common tickless case to fix this problem
> with less common no tickeless case.
In NO_HZ, the lapic_timer_broadcast is rare thing too...
>
> We should be able to add an interface to get tick_nohz_enabled from
> tick-sched.c and use that instead.
If we use send_IPI_all just for NO_HZ is disabled, we will leave a corner case
here that lapic_timer_broadcast could be really really being actively used even
NO_HZ enabled... if this will never happen, I would agree with you..
Thanks,
LUming