2008-11-04 17:33:32

by Lukas Hejtmanek

[permalink] [raw]
Subject: Bootup time regression from 2.6.27 to 2.6.28-rc3+

Hello Rafael,

not sure whether this is already reported or not. However, in 2.6.28-rc3
I discovered about 10 secs delay before mounting the root fs and running init
(not sure whether the delay is caused by mount or init). It did not appear in
2.6.27. I did not try any previous 2.6.28-rcX.

This is from dmesg in 2.6.28-rc3.

[ 7.880966] IBM TrackPoint firmware: 0x0e, buttons: 3/3
[ 8.103116] input: TPPS/2 IBM TrackPoint as
/devices/platform/i8042/serio1/serio2/input/input7
[ 28.993523] kjournald starting. Commit interval 5 seconds
[ 28.993581] EXT3-fs: mounted filesystem with ordered data mode.
[ 28.993638] VFS: Mounted root (ext3 filesystem) readonly.
[ 28.993675] Freeing unused kernel memory: 332k freed
[ 37.697648] uhci_hcd: USB Universal Host Controller Interface driver
[ 37.697740] uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 20 (level, low) -> IRQ
20
[ 37.697779] uhci_hcd 0000:00:1a.0: setting latency timer to 64
[ 37.697785] uhci_hcd 0000:00:1a.0: UHCI Host Controller
[ 37.698005] drivers/usb/core/inode.c: creating file 'devices'
[ 37.698008] drivers/usb/core/inode.c: creating file '001'
[ 37.698131] uhci_hcd 0000:00:1a.0: new USB bus registered, assigned bus
number 1
[ 37.698176] uhci_hcd 0000:00:1a.0: detected 2 ports
[ 37.698205] uhci_hcd 0000:00:1a.0: uhci_check_and_reset_hc: cmd = 0x0000


This is from 2.6.27 kernel:
[ 8.215972] IBM TrackPoint firmware: 0x0e, buttons: 3/3
[ 8.452174] input: TPPS/2 IBM TrackPoint as /devices/platform/i8042/serio1/serio2/input/input7
[ 9.016605] kjournald starting. Commit interval 5 seconds
[ 9.016634] EXT3-fs: mounted filesystem with ordered data mode.
[ 9.016663] VFS: Mounted root (ext3 filesystem) readonly.
[ 9.016756] Freeing unused kernel memory: 296k freed
[ 18.963410] HDA Intel 0000:00:1b.0: PCI INT B -> GSI 17 (level, low) -> IRQ 17
[ 18.963461] hda_intel: probe_mask set to 0x1 for device 17aa:20ac
[ 18.969726] USB Universal Host Controller Interface driver v3.0
[ 18.987703] e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k6
[ 18.987732] e1000e: Copyright (c) 1999-2008 Intel Corporation.
[ 19.023978] sdhci: Secure Digital Host Controller Interface driver
[ 19.024017] sdhci: Copyright(c) Pierre Ossman
[ 19.084812] input: PC Speaker as /devices/platform/pcspkr/input/input8
[ 19.366304] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
[ 19.372539] ACPI: WMI: Mapper loaded
[ 19.570065] 0000:00:19.0: eth0: (PCI Express:2.5GB/s:Width x1) 00:1a:6b:d2:bd:81
[ 19.570105] 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
[ 19.570172] 0000:00:19.0: eth0: MAC: 4, PHY: 6, PBA No: ffffff-0ff
[ 19.570301] uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
[ 19.570343] uhci_hcd 0000:00:1a.0: UHCI Host Controller

--
Luk?? Hejtm?nek


2008-11-09 00:02:01

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Tuesday, 4 of November 2008, Lukas Hejtmanek wrote:
> Hello Rafael,

Hi,

> not sure whether this is already reported or not. However, in 2.6.28-rc3
> I discovered about 10 secs delay before mounting the root fs and running init
> (not sure whether the delay is caused by mount or init). It did not appear in
> 2.6.27. I did not try any previous 2.6.28-rcX.

Added to the list of recent regressions as
http://bugzilla.kernel.org/show_bug.cgi?id=11987

Thanks,
Rafael


> This is from dmesg in 2.6.28-rc3.
>
> [ 7.880966] IBM TrackPoint firmware: 0x0e, buttons: 3/3
> [ 8.103116] input: TPPS/2 IBM TrackPoint as
> /devices/platform/i8042/serio1/serio2/input/input7
> [ 28.993523] kjournald starting. Commit interval 5 seconds
> [ 28.993581] EXT3-fs: mounted filesystem with ordered data mode.
> [ 28.993638] VFS: Mounted root (ext3 filesystem) readonly.
> [ 28.993675] Freeing unused kernel memory: 332k freed
> [ 37.697648] uhci_hcd: USB Universal Host Controller Interface driver
> [ 37.697740] uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 20 (level, low) -> IRQ
> 20
> [ 37.697779] uhci_hcd 0000:00:1a.0: setting latency timer to 64
> [ 37.697785] uhci_hcd 0000:00:1a.0: UHCI Host Controller
> [ 37.698005] drivers/usb/core/inode.c: creating file 'devices'
> [ 37.698008] drivers/usb/core/inode.c: creating file '001'
> [ 37.698131] uhci_hcd 0000:00:1a.0: new USB bus registered, assigned bus
> number 1
> [ 37.698176] uhci_hcd 0000:00:1a.0: detected 2 ports
> [ 37.698205] uhci_hcd 0000:00:1a.0: uhci_check_and_reset_hc: cmd = 0x0000
>
>
> This is from 2.6.27 kernel:
> [ 8.215972] IBM TrackPoint firmware: 0x0e, buttons: 3/3
> [ 8.452174] input: TPPS/2 IBM TrackPoint as /devices/platform/i8042/serio1/serio2/input/input7
> [ 9.016605] kjournald starting. Commit interval 5 seconds
> [ 9.016634] EXT3-fs: mounted filesystem with ordered data mode.
> [ 9.016663] VFS: Mounted root (ext3 filesystem) readonly.
> [ 9.016756] Freeing unused kernel memory: 296k freed
> [ 18.963410] HDA Intel 0000:00:1b.0: PCI INT B -> GSI 17 (level, low) -> IRQ 17
> [ 18.963461] hda_intel: probe_mask set to 0x1 for device 17aa:20ac
> [ 18.969726] USB Universal Host Controller Interface driver v3.0
> [ 18.987703] e1000e: Intel(R) PRO/1000 Network Driver - 0.3.3.3-k6
> [ 18.987732] e1000e: Copyright (c) 1999-2008 Intel Corporation.
> [ 19.023978] sdhci: Secure Digital Host Controller Interface driver
> [ 19.024017] sdhci: Copyright(c) Pierre Ossman
> [ 19.084812] input: PC Speaker as /devices/platform/pcspkr/input/input8
> [ 19.366304] e1000e 0000:00:19.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
> [ 19.372539] ACPI: WMI: Mapper loaded
> [ 19.570065] 0000:00:19.0: eth0: (PCI Express:2.5GB/s:Width x1) 00:1a:6b:d2:bd:81
> [ 19.570105] 0000:00:19.0: eth0: Intel(R) PRO/1000 Network Connection
> [ 19.570172] 0000:00:19.0: eth0: MAC: 4, PHY: 6, PBA No: ffffff-0ff
> [ 19.570301] uhci_hcd 0000:00:1a.0: PCI INT A -> GSI 20 (level, low) -> IRQ 20
> [ 19.570343] uhci_hcd 0000:00:1a.0: UHCI Host Controller
>



--
Everyone knows that debugging is twice as hard as writing a program
in the first place. So if you're as clever as you can be when you write it,
how will you ever debug it? --- Brian Kernighan

2008-11-09 04:55:38

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, 9 Nov 2008 01:06:04 +0100
"Rafael J. Wysocki" <[email protected]> wrote:

> On Tuesday, 4 of November 2008, Lukas Hejtmanek wrote:
> > Hello Rafael,
>
> Hi,
>
> > not sure whether this is already reported or not. However, in
> > 2.6.28-rc3 I discovered about 10 secs delay before mounting the
> > root fs and running init (not sure whether the delay is caused by
> > mount or init). It did not appear in 2.6.27. I did not try any
> > previous 2.6.28-rcX.
>


do you have RAID autoprobing enabled ?
raid autoprobing requires that all devices have been discovered,
to avoid getting partial RAID arrays and other nasty things...

it's CONFIG_MD_AUTODETECT


--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-11-09 08:57:30

by Lukas Hejtmanek

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sat, Nov 08, 2008 at 08:55:32PM -0800, Arjan van de Ven wrote:
> do you have RAID autoprobing enabled ?
> raid autoprobing requires that all devices have been discovered,
> to avoid getting partial RAID arrays and other nasty things...
>
> it's CONFIG_MD_AUTODETECT

# CONFIG_MD is not set

I do not have MD at all.

Moreover, I discovered that I have to hit a keyboard several times during the
boot to boot up.

The delay depends on keyboard hit. If I do not touch the keyboard, it waits
forever.. Looks like something needs an interrupt..

--
Luk?? Hejtm?nek

2008-11-09 10:53:43

by Marcin Slusarz

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, Nov 09, 2008 at 09:56:49AM +0100, Lukas Hejtmanek wrote:
> On Sat, Nov 08, 2008 at 08:55:32PM -0800, Arjan van de Ven wrote:
> > do you have RAID autoprobing enabled ?
> > raid autoprobing requires that all devices have been discovered,
> > to avoid getting partial RAID arrays and other nasty things...
> >
> > it's CONFIG_MD_AUTODETECT
>
> # CONFIG_MD is not set
>
> I do not have MD at all.
>
> Moreover, I discovered that I have to hit a keyboard several times during the
> boot to boot up.
>
> The delay depends on keyboard hit. If I do not touch the keyboard, it waits
> forever.. Looks like something needs an interrupt..

Do you have NO_HZ enabled?
(Cc'ing Thomas)

2008-11-09 11:33:18

by Lukas Hejtmanek

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, Nov 09, 2008 at 11:52:02AM +0100, Marcin Slusarz wrote:
> > Moreover, I discovered that I have to hit a keyboard several times during the
> > boot to boot up.
> >
> > The delay depends on keyboard hit. If I do not touch the keyboard, it waits
> > forever.. Looks like something needs an interrupt..
>
> Do you have NO_HZ enabled?
> (Cc'ing Thomas)

indeed, I have. Should I try it without NO_HZ?

--
Luk?? Hejtm?nek

2008-11-09 13:19:22

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sunday, 9 of November 2008, Lukas Hejtmanek wrote:
> On Sun, Nov 09, 2008 at 11:52:02AM +0100, Marcin Slusarz wrote:
> > > Moreover, I discovered that I have to hit a keyboard several times during the
> > > boot to boot up.
> > >
> > > The delay depends on keyboard hit. If I do not touch the keyboard, it waits
> > > forever.. Looks like something needs an interrupt..
> >
> > Do you have NO_HZ enabled?
> > (Cc'ing Thomas)
>
> indeed, I have. Should I try it without NO_HZ?

Yes, please.

Thanks,
Rafael


--
Everyone knows that debugging is twice as hard as writing a program
in the first place. So if you're as clever as you can be when you write it,
how will you ever debug it? --- Brian Kernighan

2008-11-09 14:20:28

by Frans Pop

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

Lukas Hejtmanek wrote:
> Moreover, I discovered that I have to hit a keyboard several times
> during the boot to boot up.
>
> The delay depends on keyboard hit. If I do not touch the keyboard, it
> waits forever.. Looks like something needs an interrupt..

That sounds very much like the issues described in this thread:
http://lkml.org/lkml/2008/10/28/226

Do you use LUKS encryption and/or LVM?

Cheers,
FJP

2008-11-09 15:14:17

by Lukas Hejtmanek

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, Nov 09, 2008 at 03:20:12PM +0100, Frans Pop wrote:
> That sounds very much like the issues described in this thread:
> http://lkml.org/lkml/2008/10/28/226

yes, looks pretty much the same but it appears on a plain libata driver.

> Do you use LUKS encryption and/or LVM?

nope. just AHCI driver & ext3 file system.

--
Luk?? Hejtm?nek

2008-11-09 20:09:40

by Lukas Hejtmanek

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, Nov 09, 2008 at 02:24:10PM +0100, Rafael J. Wysocki wrote:
> > > Do you have NO_HZ enabled?
> > > (Cc'ing Thomas)
> >
> > indeed, I have. Should I try it without NO_HZ?
>
> Yes, please.

I tried it, it is not better. I/O operations on /dev/sda1 often require a key
hit. It is AHCI driver.

--
Luk?? Hejtm?nek

2008-11-09 20:28:42

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, 9 Nov 2008 21:09:18 +0100
Lukas Hejtmanek <[email protected]> wrote:

> On Sun, Nov 09, 2008 at 02:24:10PM +0100, Rafael J. Wysocki wrote:
> > > > Do you have NO_HZ enabled?
> > > > (Cc'ing Thomas)
> > >
> > > indeed, I have. Should I try it without NO_HZ?
> >
> > Yes, please.
>
> I tried it, it is not better. I/O operations on /dev/sda1 often
> require a key hit. It is AHCI driver.

just as a status on this guy:

1) we know what to remove to make it go away and go back to 2.6.27
status.
However we think it just papers over a bug elsewhere
2) Thomas did a lot of work and has a set of bugfixes to hrtimers and
related that fix a bunch of cases and MIGHT fix this one too (we
haven't been able to reproduce it but there's several reports out for
this "wait until hitting a key")

for me, the plan is that we need to get Thomas' fixes tested by someone
who can reproduce this very reliably. If they fix it, great.
If they don't, time is running out and we need to remove the feature
for 2.6.28 (that is 1 line of change), realizing that this just papers
over stuff and isn't a real fix, and get back to it for 2.6.29.



--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-11-09 20:34:49

by Lukas Hejtmanek

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, Nov 09, 2008 at 12:28:35PM -0800, Arjan van de Ven wrote:
> for me, the plan is that we need to get Thomas' fixes tested by someone
> who can reproduce this very reliably. If they fix it, great.

as for me, 100% of boots require a key hit to proceed in 2.6.28-rc3 (more
precisely, the current git head).

--
Luk?? Hejtm?nek

2008-11-09 20:40:43

by Linus Torvalds

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+



On Sun, 9 Nov 2008, Arjan van de Ven wrote:
>
> for me, the plan is that we need to get Thomas' fixes tested by someone
> who can reproduce this very reliably. If they fix it, great.
> If they don't, time is running out and we need to remove the feature
> for 2.6.28 (that is 1 line of change), realizing that this just papers
> over stuff and isn't a real fix, and get back to it for 2.6.29.

I'd like to see the one-liner fix, even if it turns out to just be a
workaround that hides the issue and makes us go back to 2.6.27 behaviour.
I'm about to release -rc4 today, we should be aggressive about
regressions.

If somebody can test the "real fix" independently, that's obviously fine,
but it's not an excuse for keeping a pending regression. If people end up
being 100% sure they found the bug later, we can then re-enable it, but
again - that has zero bearing on the fact that we should get rid of the
regression asap.

Linus

2008-11-09 20:48:51

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, 9 Nov 2008 12:39:50 -0800 (PST)
Linus Torvalds <[email protected]> wrote:

>
>
> On Sun, 9 Nov 2008, Arjan van de Ven wrote:
> >
> > for me, the plan is that we need to get Thomas' fixes tested by
> > someone who can reproduce this very reliably. If they fix it, great.
> > If they don't, time is running out and we need to remove the feature
> > for 2.6.28 (that is 1 line of change), realizing that this just
> > papers over stuff and isn't a real fix, and get back to it for
> > 2.6.29.
>
> I'd like to see the one-liner fix, even if it turns out to just be a
> workaround that hides the issue and makes us go back to 2.6.27
> behaviour. I'm about to release -rc4 today, we should be aggressive
> about regressions.

>From bf1003799b70f68f927d312841a05930856ad716 Mon Sep 17 00:00:00 2001
From: Arjan van de Ven <[email protected]>
Date: Sun, 9 Nov 2008 12:45:10 -0800
Subject: [PATCH] regression: disable timer peek-ahead for 2.6.28

it's showing up as regressions; disabling it very likely just papers
over an underlying issue, but time is running out for 2.6.28,
lets get back to this for 2.6.29

Fixes: #11826 and #11893

Signed-off-by: Arjan van de Ven <[email protected]>
---
drivers/cpuidle/cpuidle.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)

diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 5bed733..67ec4da 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -65,12 +65,14 @@ static void cpuidle_idle_call(void)
return;
}

+#if 0
+ /* shows regressions, re-enable for 2.6.29 */
/*
* run any timers that can be run now, at this point
* before calculating the idle duration etc.
*/
hrtimer_peek_ahead_timers();
-
+#endif
/* ask the governor for the next state */
next_state = cpuidle_curr_governor->select(dev);
if (need_resched())
--
1.6.0.3

2008-11-09 21:24:48

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, 9 Nov 2008 21:34:32 +0100
Lukas Hejtmanek <[email protected]> wrote:

> On Sun, Nov 09, 2008 at 12:28:35PM -0800, Arjan van de Ven wrote:
> > for me, the plan is that we need to get Thomas' fixes tested by
> > someone who can reproduce this very reliably. If they fix it, great.
>
> as for me, 100% of boots require a key hit to proceed in 2.6.28-rc3
> (more precisely, the current git head).
>

it would be greatly appreciated if those that see this issue can test
the patch below (which is a combo patch of thomas' fixes, to make
testing easier) and confirm if this fixes the hangs.
If they do fix the issue we don't need to disable anything and get better power
savings as bonus as well.


Gautham R Shenoy (1):
timers: handle HRTIMER_CB_IRQSAFE_UNLOCKED correctly from softirq context

Ingo Molnar (2):
irq: fix typo
nohz: fix SMP race

Thomas Gleixner (12):
nohz: check broadcast on tick restart
x86: apic reset counter on shutdown
hrtimer: fix the accounting of hrtimer_peek_ahead()
hrtimer: do not peek ahead when the hardware timer has fired already
tick: clear broadcast mask when timer is reprogrammed
irq: call __irq_enter() before calling the tick_idle_check
x86: apic honour irq affinity which was set in early boot
genirq: keep affinities set from userspace across free/request_irq()
genirq: fix the affinity setting in setup_irq
tick: cleanup idle checks
ACPI: processor_idle: check for spurious wakeups
ACPI: idle check for spurious wakeups accounting fix


arch/x86/kernel/apic.c | 1 +
arch/x86/kernel/io_apic.c | 26 ++++++++++--
drivers/acpi/processor_idle.c | 49 ++++++++++++++++++-----
drivers/cpuidle/cpuidle.c | 5 ++-
include/linux/hrtimer.h | 4 +-
include/linux/irq.h | 8 +---
include/linux/tick.h | 5 ++
kernel/hrtimer.c | 81 +++++++++++++++++++++++++++++--------
kernel/irq/internals.h | 2 +
kernel/irq/manage.c | 66 ++++++++++++++++++++++++------
kernel/irq/migration.c | 11 -----
kernel/irq/proc.c | 2 +-
kernel/softirq.c | 7 ++-
kernel/time/tick-broadcast.c | 27 ++++++++++++
kernel/time/tick-internal.h | 5 ++
kernel/time/tick-sched.c | 88 ++++++++++++++++++++++------------------
kernel/time/timer_list.c | 4 +-
17 files changed, 279 insertions(+), 112 deletions(-)

diff --git a/arch/x86/kernel/apic.c b/arch/x86/kernel/apic.c
index 04a7f96..70003b8 100644
--- a/arch/x86/kernel/apic.c
+++ b/arch/x86/kernel/apic.c
@@ -441,6 +441,7 @@ static void lapic_timer_setup(enum clock_event_mode mode,
v = apic_read(APIC_LVTT);
v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
apic_write(APIC_LVTT, v);
+ apic_write(APIC_TMICT, 0xffffffff);
break;
case CLOCK_EVT_MODE_RESUME:
/* Nothing to do here */
diff --git a/arch/x86/kernel/io_apic.c b/arch/x86/kernel/io_apic.c
index 7a3f202..988ee89 100644
--- a/arch/x86/kernel/io_apic.c
+++ b/arch/x86/kernel/io_apic.c
@@ -3761,7 +3761,9 @@ int acpi_get_override_irq(int bus_irq, int *trigger, int *polarity)
void __init setup_ioapic_dest(void)
{
int pin, ioapic, irq, irq_entry;
+ struct irq_desc *desc;
struct irq_cfg *cfg;
+ cpumask_t mask;

if (skip_ioapic_setup == 1)
return;
@@ -3778,16 +3780,30 @@ void __init setup_ioapic_dest(void)
* cpu is online.
*/
cfg = irq_cfg(irq);
- if (!cfg->vector)
+ if (!cfg->vector) {
setup_IO_APIC_irq(ioapic, pin, irq,
irq_trigger(irq_entry),
irq_polarity(irq_entry));
+ continue;
+
+ }
+
+ /*
+ * Honour affinities which have been set in early boot
+ */
+ desc = irq_to_desc(irq);
+ if (desc->status &
+ (IRQ_NO_BALANCING | IRQ_AFFINITY_SET))
+ mask = desc->affinity;
+ else
+ mask = TARGET_CPUS;
+
#ifdef CONFIG_INTR_REMAP
- else if (intr_remapping_enabled)
- set_ir_ioapic_affinity_irq(irq, TARGET_CPUS);
-#endif
+ if (intr_remapping_enabled)
+ set_ir_ioapic_affinity_irq(irq, mask);
else
- set_ioapic_affinity_irq(irq, TARGET_CPUS);
+#endif
+ set_ioapic_affinity_irq(irq, mask);
}

}
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 81b40ed..371dc42 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -41,6 +41,7 @@
#include <linux/pm_qos_params.h>
#include <linux/clockchips.h>
#include <linux/cpuidle.h>
+#include <linux/tick.h>

/*
* Include the apic definitions for x86 to have the APIC timer related defines
@@ -1458,6 +1459,7 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
u32 t1, t2;
struct acpi_processor *pr;
struct acpi_processor_cx *cx = cpuidle_get_statedata(state);
+ int sleep_ticks = 0;

pr = __get_cpu_var(processors);

@@ -1473,6 +1475,7 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
return 0;
}

+again:
if (pr->flags.bm_check)
acpi_idle_update_bm_rld(pr, cx);

@@ -1480,10 +1483,18 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
acpi_idle_do_entry(cx);
t2 = inl(acpi_gbl_FADT.xpm_timer_block.address);

+ sleep_ticks += ticks_elapsed(t1, t2);
+
local_irq_enable();
+
+ /* Check for spurious wakeup */
+ if (check_idle_spurious_wakeup(pr->id) && !need_resched()) {
+ local_irq_disable();
+ goto again;
+ }
cx->usage++;

- return ticks_elapsed_in_us(t1, t2);
+ return sleep_ticks;
}

/**
@@ -1497,7 +1508,7 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
struct acpi_processor *pr;
struct acpi_processor_cx *cx = cpuidle_get_statedata(state);
u32 t1, t2;
- int sleep_ticks = 0;
+ int ticks, sleep_ticks = 0;

pr = __get_cpu_var(processors);

@@ -1527,6 +1538,7 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
*/
acpi_state_timer_broadcast(pr, cx, 1);

+again:
if (pr->flags.bm_check)
acpi_idle_update_bm_rld(pr, cx);

@@ -1544,12 +1556,19 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
if (tsc_halts_in_c(cx->type))
mark_tsc_unstable("TSC halts in idle");;
#endif
- sleep_ticks = ticks_elapsed(t1, t2);
-
+ ticks = ticks_elapsed(t1, t2);
/* Tell the scheduler how much we idled: */
- sched_clock_idle_wakeup_event(sleep_ticks*PM_TIMER_TICK_NS);
+ sched_clock_idle_wakeup_event(ticks*PM_TIMER_TICK_NS);

local_irq_enable();
+ sleep_ticks += ticks;
+
+ /* Check for spurious wakeup */
+ if (check_idle_spurious_wakeup(pr->id) && !need_resched()) {
+ local_irq_disable();
+ goto again;
+ }
+
current_thread_info()->status |= TS_POLLING;

cx->usage++;
@@ -1575,7 +1594,7 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
struct acpi_processor *pr;
struct acpi_processor_cx *cx = cpuidle_get_statedata(state);
u32 t1, t2;
- int sleep_ticks = 0;
+ int ticks, sleep_ticks = 0;

pr = __get_cpu_var(processors);

@@ -1613,14 +1632,16 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,

acpi_unlazy_tlb(smp_processor_id());

- /* Tell the scheduler that we are going deep-idle: */
- sched_clock_idle_sleep_event();
/*
* Must be done before busmaster disable as we might need to
* access HPET !
*/
acpi_state_timer_broadcast(pr, cx, 1);

+again:
+ /* Tell the scheduler that we are going deep-idle: */
+ sched_clock_idle_sleep_event();
+
acpi_idle_update_bm_rld(pr, cx);

/*
@@ -1661,11 +1682,19 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
if (tsc_halts_in_c(ACPI_STATE_C3))
mark_tsc_unstable("TSC halts in idle");
#endif
- sleep_ticks = ticks_elapsed(t1, t2);
+ ticks = ticks_elapsed(t1, t2);
/* Tell the scheduler how much we idled: */
- sched_clock_idle_wakeup_event(sleep_ticks*PM_TIMER_TICK_NS);
+ sched_clock_idle_wakeup_event(ticks*PM_TIMER_TICK_NS);

local_irq_enable();
+ sleep_ticks += ticks;
+
+ /* Check for spurious wakeup */
+ if (check_idle_spurious_wakeup(pr->id) && !need_resched()) {
+ local_irq_disable();
+ goto again;
+ }
+
current_thread_info()->status |= TS_POLLING;

cx->usage++;
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 5bed733..f3ca665 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -69,7 +69,10 @@ static void cpuidle_idle_call(void)
* run any timers that can be run now, at this point
* before calculating the idle duration etc.
*/
- hrtimer_peek_ahead_timers();
+ if (hrtimer_peek_ahead_timers()) {
+ local_irq_enable();
+ return;
+ }

/* ask the governor for the next state */
next_state = cpuidle_curr_governor->select(dev);
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 07e510a..e3d7d9f 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -294,7 +294,7 @@ static inline int hrtimer_is_hres_active(struct hrtimer *timer)
return timer->base->cpu_base->hres_active;
}

-extern void hrtimer_peek_ahead_timers(void);
+extern int hrtimer_peek_ahead_timers(void);

/*
* The resolution of the clocks. The resolution value is returned in
@@ -318,7 +318,7 @@ extern void hrtimer_peek_ahead_timers(void);
* is expired in the next softirq when the clock was advanced.
*/
static inline void clock_was_set(void) { }
-static inline void hrtimer_peek_ahead_timers(void) { }
+static inline int hrtimer_peek_ahead_timers(void) { return 0; }

static inline void hres_timers_resume(void) { }

diff --git a/include/linux/irq.h b/include/linux/irq.h
index d058c57..36b186e 100644
--- a/include/linux/irq.h
+++ b/include/linux/irq.h
@@ -63,7 +63,8 @@ typedef void (*irq_flow_handler_t)(unsigned int irq,
#define IRQ_MOVE_PENDING 0x00200000 /* need to re-target IRQ destination */
#define IRQ_NO_BALANCING 0x00400000 /* IRQ is excluded from balancing */
#define IRQ_SPURIOUS_DISABLED 0x00800000 /* IRQ was disabled by the spurious trap */
-#define IRQ_MOVE_PCNTXT 0x01000000 /* IRQ migration from process context */
+#define IRQ_MOVE_PCNTXT 0x01000000 /* IRQ migration from process context */
+#define IRQ_AFFINITY_SET 0x02000000 /* IRQ affinity was set from userspace*/

#ifdef CONFIG_IRQ_PER_CPU
# define CHECK_IRQ_PER_CPU(var) ((var) & IRQ_PER_CPU)
@@ -210,7 +211,6 @@ extern int setup_irq(unsigned int irq, struct irqaction *new);

#ifdef CONFIG_GENERIC_PENDING_IRQ

-void set_pending_irq(unsigned int irq, cpumask_t mask);
void move_native_irq(int irq);
void move_masked_irq(int irq);

@@ -228,10 +228,6 @@ static inline void move_masked_irq(int irq)
{
}

-static inline void set_pending_irq(unsigned int irq, cpumask_t mask)
-{
-}
-
#endif /* CONFIG_GENERIC_PENDING_IRQ */

#else /* CONFIG_SMP */
diff --git a/include/linux/tick.h b/include/linux/tick.h
index b6ec818..ff202ac 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -64,6 +64,9 @@ struct tick_sched {
unsigned long last_jiffies;
unsigned long next_jiffies;
ktime_t idle_expires;
+ unsigned long irq_wakeups;
+ unsigned long irq_last_wakeups;
+ unsigned long spurious_wakeups;
};

extern void __init tick_init(void);
@@ -116,6 +119,7 @@ extern void tick_nohz_stop_sched_tick(int inidle);
extern void tick_nohz_restart_sched_tick(void);
extern ktime_t tick_nohz_get_sleep_length(void);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
+extern int check_idle_spurious_wakeup(int cpu);
# else
static inline void tick_nohz_stop_sched_tick(int inidle) { }
static inline void tick_nohz_restart_sched_tick(void) { }
@@ -126,6 +130,7 @@ static inline ktime_t tick_nohz_get_sleep_length(void)
return len;
}
static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
+static inline int check_idle_spurious_wakeup(int cpu) { return 0; }
# endif /* !NO_HZ */

#endif
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 2b465df..075bc8e 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1209,6 +1209,7 @@ static void run_hrtimer_pending(struct hrtimer_cpu_base *cpu_base)
enum hrtimer_restart (*fn)(struct hrtimer *);
struct hrtimer *timer;
int restart;
+ int emulate_hardirq_ctx = 0;

timer = list_entry(cpu_base->cb_pending.next,
struct hrtimer, cb_entry);
@@ -1217,10 +1218,24 @@ static void run_hrtimer_pending(struct hrtimer_cpu_base *cpu_base)
timer_stats_account_hrtimer(timer);

fn = timer->function;
+ /*
+ * A timer might have been added to the cb_pending list
+ * when it was migrated during a cpu-offline operation.
+ * Emulate hardirq context for such timers.
+ */
+ if (timer->cb_mode == HRTIMER_CB_IRQSAFE_PERCPU ||
+ timer->cb_mode == HRTIMER_CB_IRQSAFE_UNLOCKED)
+ emulate_hardirq_ctx = 1;
+
__remove_hrtimer(timer, timer->base, HRTIMER_STATE_CALLBACK, 0);
spin_unlock_irq(&cpu_base->lock);

- restart = fn(timer);
+ if (unlikely(emulate_hardirq_ctx)) {
+ local_irq_disable();
+ restart = fn(timer);
+ local_irq_enable();
+ } else
+ restart = fn(timer);

spin_lock_irq(&cpu_base->lock);

@@ -1297,23 +1312,26 @@ static void __run_hrtimer(struct hrtimer *timer)
#ifdef CONFIG_HIGH_RES_TIMERS

/*
- * High resolution timer interrupt
- * Called with interrupts disabled
+ * High resolution timer interrupt internal worker function called
+ * with interrupts disabled either from hrtimer_interrupt() or from
+ * hrtimer_peek_ahead()
*/
-void hrtimer_interrupt(struct clock_event_device *dev)
+int __hrtimer_interrupt(struct hrtimer_cpu_base *cpu_base, int peekahead)
{
- struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
struct hrtimer_clock_base *base;
ktime_t expires_next, now;
int i, raise = 0;

- BUG_ON(!cpu_base->hres_active);
- cpu_base->nr_events++;
- dev->next_event.tv64 = KTIME_MAX;
-
retry:
now = ktime_get();

+ /*
+ * In peek ahead mode bail out, if the hw interrupt is
+ * imminent:
+ */
+ if (peekahead && cpu_base->expires_next.tv64 < now.tv64)
+ return -ETIME;
+
expires_next.tv64 = KTIME_MAX;

base = cpu_base->clock_base;
@@ -1370,17 +1388,49 @@ void hrtimer_interrupt(struct clock_event_device *dev)
base++;
}

+ /*
+ * We just peeked ahead. The hardware timer did not expire. So
+ * we can leave the timer armed.
+ */
+ if (peekahead && cpu_base->expires_next.tv64 == expires_next.tv64)
+ goto out;
+
cpu_base->expires_next = expires_next;

/* Reprogramming necessary ? */
if (expires_next.tv64 != KTIME_MAX) {
+ /*
+ * Clear the peeakahead flag once we decided to
+ * reprogram. Otherwise we break out in the check
+ * above.
+ */
+ peekahead = 0;
if (tick_program_event(expires_next, 0))
goto retry;
}

+out:
/* Raise softirq ? */
if (raise)
raise_softirq(HRTIMER_SOFTIRQ);
+
+ return 0;
+}
+
+
+/*
+ * High resolution timer interrupt
+ * Called with interrupts disabled
+ */
+void hrtimer_interrupt(struct clock_event_device *dev)
+{
+ struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+
+ BUG_ON(!cpu_base->hres_active);
+ cpu_base->nr_events++;
+ dev->next_event.tv64 = KTIME_MAX;
+
+ __hrtimer_interrupt(cpu_base, 0);
}

/**
@@ -1392,19 +1442,14 @@ void hrtimer_interrupt(struct clock_event_device *dev)
* they are run immediately and then removed from the timer queue.
*
*/
-void hrtimer_peek_ahead_timers(void)
+int hrtimer_peek_ahead_timers(void)
{
- struct tick_device *td;
- unsigned long flags;
+ WARN_ON_ONCE(!irqs_disabled());

if (!hrtimer_hres_active())
- return;
+ return 0;

- local_irq_save(flags);
- td = &__get_cpu_var(tick_cpu_device);
- if (td && td->evtdev)
- hrtimer_interrupt(td->evtdev);
- local_irq_restore(flags);
+ return __hrtimer_interrupt(&__get_cpu_var(hrtimer_bases), 1);
}

static void run_hrtimer_softirq(struct softirq_action *h)
diff --git a/kernel/irq/internals.h b/kernel/irq/internals.h
index c9767e6..64c1c72 100644
--- a/kernel/irq/internals.h
+++ b/kernel/irq/internals.h
@@ -25,6 +25,8 @@ static inline void unregister_handler_proc(unsigned int irq,
struct irqaction *action) { }
#endif

+extern int irq_select_affinity_usr(unsigned int irq);
+
/*
* Debugging printout:
*/
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index c498a1b..4358612 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -82,24 +82,27 @@ int irq_can_set_affinity(unsigned int irq)
int irq_set_affinity(unsigned int irq, cpumask_t cpumask)
{
struct irq_desc *desc = irq_to_desc(irq);
+ unsigned long flags;

if (!desc->chip->set_affinity)
return -EINVAL;

+ spin_lock_irqsave(&desc->lock, flags);
+
#ifdef CONFIG_GENERIC_PENDING_IRQ
if (desc->status & IRQ_MOVE_PCNTXT || desc->status & IRQ_DISABLED) {
- unsigned long flags;
-
- spin_lock_irqsave(&desc->lock, flags);
desc->affinity = cpumask;
desc->chip->set_affinity(irq, cpumask);
- spin_unlock_irqrestore(&desc->lock, flags);
- } else
- set_pending_irq(irq, cpumask);
+ } else {
+ desc->status |= IRQ_MOVE_PENDING;
+ desc->pending_mask = cpumask;
+ }
#else
desc->affinity = cpumask;
desc->chip->set_affinity(irq, cpumask);
#endif
+ desc->status |= IRQ_AFFINITY_SET;
+ spin_unlock_irqrestore(&desc->lock, flags);
return 0;
}

@@ -107,24 +110,59 @@ int irq_set_affinity(unsigned int irq, cpumask_t cpumask)
/*
* Generic version of the affinity autoselector.
*/
-int irq_select_affinity(unsigned int irq)
+int do_irq_select_affinity(unsigned int irq, struct irq_desc *desc)
{
cpumask_t mask;
- struct irq_desc *desc;

if (!irq_can_set_affinity(irq))
return 0;

cpus_and(mask, cpu_online_map, irq_default_affinity);

- desc = irq_to_desc(irq);
+ /*
+ * Preserve an userspace affinity setup, but make sure that
+ * one of the targets is online.
+ */
+ if (desc->status & (IRQ_AFFINITY_SET | IRQ_NO_BALANCING)) {
+ if (cpus_intersects(desc->affinity, cpu_online_map))
+ mask = desc->affinity;
+ else
+ desc->status &= ~IRQ_AFFINITY_SET;
+ }
+
desc->affinity = mask;
desc->chip->set_affinity(irq, mask);

return 0;
}
+#else
+static inline int do_irq_select_affinity(unsigned int irq, struct irq_desc *d)
+{
+ return irq_select_affinity(irq);
+}
#endif

+/*
+ * Called when affinity is set via /proc/irq
+ */
+int irq_select_affinity_usr(unsigned int irq)
+{
+ struct irq_desc *desc = irq_to_desc(irq);
+ unsigned long flags;
+ int ret;
+
+ spin_lock_irqsave(&desc->lock, flags);
+ ret = do_irq_select_affinity(irq, desc);
+ spin_unlock_irqrestore(&desc->lock, flags);
+
+ return ret;
+}
+
+#else
+static inline int do_irq_select_affinity(int irq, struct irq_desc *desc)
+{
+ return 0;
+}
#endif

/**
@@ -445,8 +483,12 @@ __setup_irq(unsigned int irq, struct irq_desc * desc, struct irqaction *new)
/* Undo nested disables: */
desc->depth = 1;

+ /* Exclude IRQ from balancing if requested */
+ if (new->flags & IRQF_NOBALANCING)
+ desc->status |= IRQ_NO_BALANCING;
+
/* Set default affinity mask once everything is setup */
- irq_select_affinity(irq);
+ do_irq_select_affinity(irq, desc);

} else if ((new->flags & IRQF_TRIGGER_MASK)
&& (new->flags & IRQF_TRIGGER_MASK)
@@ -459,10 +501,6 @@ __setup_irq(unsigned int irq, struct irq_desc * desc, struct irqaction *new)

*p = new;

- /* Exclude IRQ from balancing */
- if (new->flags & IRQF_NOBALANCING)
- desc->status |= IRQ_NO_BALANCING;
-
/* Reset broken irq detection when installing new handler */
desc->irq_count = 0;
desc->irqs_unhandled = 0;
diff --git a/kernel/irq/migration.c b/kernel/irq/migration.c
index 90b920d..9db681d 100644
--- a/kernel/irq/migration.c
+++ b/kernel/irq/migration.c
@@ -1,17 +1,6 @@

#include <linux/irq.h>

-void set_pending_irq(unsigned int irq, cpumask_t mask)
-{
- struct irq_desc *desc = irq_to_desc(irq);
- unsigned long flags;
-
- spin_lock_irqsave(&desc->lock, flags);
- desc->status |= IRQ_MOVE_PENDING;
- desc->pending_mask = mask;
- spin_unlock_irqrestore(&desc->lock, flags);
-}
-
void move_masked_irq(int irq)
{
struct irq_desc *desc = irq_to_desc(irq);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 4d161c7..d257e7d 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -62,7 +62,7 @@ static ssize_t irq_affinity_proc_write(struct file *file,
if (!cpus_intersects(new_value, cpu_online_map))
/* Special case for empty set - allow the architecture
code to set default SMP affinity. */
- return irq_select_affinity(irq) ? -EINVAL : count;
+ return irq_select_affinity_usr(irq) ? -EINVAL : count;

irq_set_affinity(irq, new_value);

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 7110dae..e7c69a7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -269,10 +269,11 @@ void irq_enter(void)
{
int cpu = smp_processor_id();

- if (idle_cpu(cpu) && !in_interrupt())
+ if (idle_cpu(cpu) && !in_interrupt()) {
+ __irq_enter();
tick_check_idle(cpu);
-
- __irq_enter();
+ } else
+ __irq_enter();
}

#ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index f98a1b7..ffb7252 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -397,6 +397,33 @@ void tick_check_oneshot_broadcast(int cpu)
}

/*
+ * Called from tick_nohz_kick_tick() to check, whether the next
+ * broadcast event is less than a tick_period away or already pending
+ * to avoid reprogramming of the per cpu device.
+ */
+int tick_check_oneshot_broadcast_wakeup(int cpu, ktime_t now)
+{
+ struct clock_event_device *bc;
+ ktime_t delta;
+ int res = 0;
+
+ spin_lock(&tick_broadcast_lock);
+ bc = tick_broadcast_device.evtdev;
+
+ if (bc) {
+ delta = ktime_sub(bc->next_event, now);
+ if (delta.tv64 <= tick_period.tv64)
+ res = 1;
+ else
+ cpu_clear(cpu, tick_broadcast_oneshot_mask);
+ } else
+ cpu_clear(cpu, tick_broadcast_oneshot_mask);
+
+ spin_unlock(&tick_broadcast_lock);
+ return res;
+}
+
+/*
* Handle oneshot mode broadcasting
*/
static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index b1c05bf..b825c39 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -37,6 +37,7 @@ extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup);
extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc);
extern int tick_broadcast_oneshot_active(void);
extern void tick_check_oneshot_broadcast(int cpu);
+extern int tick_check_oneshot_broadcast_wakeup(int cpu, ktime_t now);
# else /* BROADCAST */
static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
{
@@ -47,6 +48,10 @@ static inline void tick_broadcast_switch_to_oneshot(void) { }
static inline void tick_shutdown_broadcast_oneshot(unsigned int *cpup) { }
static inline int tick_broadcast_oneshot_active(void) { return 0; }
static inline void tick_check_oneshot_broadcast(int cpu) { }
+static inline int tick_check_oneshot_broadcast_wakeup(int cpu, ktime_t now)
+{
+ return 0;
+}
# endif /* !BROADCAST */

#else /* !ONESHOT */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5bbb104..35a775e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -134,48 +134,31 @@ __setup("nohz=", setup_tick_nohz);
* value. We do this unconditionally on any cpu, as we don't know whether the
* cpu, which has the update task assigned is in a long sleep.
*/
-void tick_nohz_update_jiffies(void)
+static void tick_nohz_update_jiffies(int cpu, struct tick_sched *ts,
+ ktime_t now)
{
- int cpu = smp_processor_id();
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- unsigned long flags;
- ktime_t now;
-
- if (!ts->tick_stopped)
- return;
-
cpu_clear(cpu, nohz_cpu_mask);
- now = ktime_get();
ts->idle_waketime = now;

- local_irq_save(flags);
tick_do_update_jiffies64(now);
- local_irq_restore(flags);

touch_softlockup_watchdog();
}

-static void tick_nohz_stop_idle(int cpu)
+static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-
- if (ts->idle_active) {
- ktime_t now, delta;
- now = ktime_get();
- delta = ktime_sub(now, ts->idle_entrytime);
- ts->idle_lastupdate = now;
- ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
- ts->idle_active = 0;
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);

- sched_clock_idle_wakeup_event(0);
- }
+ ts->idle_lastupdate = now;
+ ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+ ts->idle_active = 0;
+ sched_clock_idle_wakeup_event(0);
}

static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
{
- ktime_t now, delta;
+ ktime_t delta, now = ktime_get();

- now = ktime_get();
if (ts->idle_active) {
delta = ktime_sub(now, ts->idle_entrytime);
ts->idle_lastupdate = now;
@@ -203,6 +186,21 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);

+int check_idle_spurious_wakeup(int cpu)
+{
+ struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+ int ret;
+
+ local_irq_disable();
+ ret= ts->irq_last_wakeups == ts->irq_wakeups;
+
+ ts->irq_last_wakeups = ts->irq_wakeups;
+ ts->spurious_wakeups += ret;
+ local_irq_enable();
+ return ret;
+}
+EXPORT_SYMBOL_GPL(check_idle_spurious_wakeup);
+
/**
* tick_nohz_stop_sched_tick - stop the idle tick from the idle task
*
@@ -413,10 +411,10 @@ void tick_nohz_restart_sched_tick(void)
int cpu = smp_processor_id();
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
unsigned long ticks;
- ktime_t now;
+ ktime_t now = ktime_get();

local_irq_disable();
- tick_nohz_stop_idle(cpu);
+ tick_nohz_stop_idle(ts, now);

if (!ts->inidle || !ts->tick_stopped) {
ts->inidle = 0;
@@ -430,7 +428,6 @@ void tick_nohz_restart_sched_tick(void)

/* Update jiffies first */
select_nohz_load_balancer(0);
- now = ktime_get();
tick_do_update_jiffies64(now);
cpu_clear(cpu, nohz_cpu_mask);

@@ -566,24 +563,20 @@ static void tick_nohz_switch_to_nohz(void)
* timer and do not touch the other magic bits which need to be done
* when idle is left.
*/
-static void tick_nohz_kick_tick(int cpu)
+static void tick_nohz_kick_tick(int cpu, struct tick_sched *ts, ktime_t now)
{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- ktime_t delta, now;
-
- if (!ts->tick_stopped)
- return;
+ ktime_t delta;

/*
* Do not touch the tick device, when the next expiry is either
* already reached or less/equal than the tick period.
*/
- now = ktime_get();
delta = ktime_sub(hrtimer_get_expires(&ts->sched_timer), now);
if (delta.tv64 <= tick_period.tv64)
return;

- tick_nohz_restart(ts, now);
+ if (!tick_check_oneshot_broadcast_wakeup(cpu, now))
+ tick_nohz_restart(ts, now);
}

#else
@@ -597,11 +590,26 @@ static inline void tick_nohz_switch_to_nohz(void) { }
*/
void tick_check_idle(int cpu)
{
+ struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+ ktime_t now;
+
+ ts->irq_wakeups++;
+
tick_check_oneshot_broadcast(cpu);
+
#ifdef CONFIG_NO_HZ
- tick_nohz_stop_idle(cpu);
- tick_nohz_update_jiffies();
- tick_nohz_kick_tick(cpu);
+ if (!ts->tick_stopped && !ts->idle_active)
+ return;
+
+ now = ktime_get();
+
+ if (ts->idle_active)
+ tick_nohz_stop_idle(ts, now);
+
+ if (ts->tick_stopped) {
+ tick_nohz_update_jiffies(cpu, ts, now);
+ tick_nohz_kick_tick(cpu, ts, now);
+ }
#endif
}

diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index a999b92..67c3d60 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -176,6 +176,8 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
P(last_jiffies);
P(next_jiffies);
P_ns(idle_expires);
+ P(irq_wakeups);
+ P(spurious_wakeups);
SEQ_printf(m, "jiffies: %Lu\n",
(unsigned long long)jiffies);
}
@@ -252,7 +254,7 @@ static int timer_list_show(struct seq_file *m, void *v)
u64 now = ktime_to_ns(ktime_get());
int cpu;

- SEQ_printf(m, "Timer List Version: v0.4\n");
+ SEQ_printf(m, "Timer List Version: v0.5\n");
SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);




--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-11-09 21:38:47

by Lukas Hejtmanek

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, Nov 09, 2008 at 01:24:38PM -0800, Arjan van de Ven wrote:
> it would be greatly appreciated if those that see this issue can test the
> patch below (which is a combo patch of thomas' fixes, to make testing
> easier) and confirm if this fixes the hangs. If they do fix the issue we
> don't need to disable anything and get better power savings as bonus as
> well.

well, with this set of patches, my box boots OK without any key hits. I will
keep testing..

--
Luk?? Hejtm?nek

2008-11-09 21:49:00

by Lukas Hejtmanek

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, Nov 09, 2008 at 01:24:38PM -0800, Arjan van de Ven wrote:
> it would be greatly appreciated if those that see this issue can test the
> patch below (which is a combo patch of thomas' fixes, to make testing
> easier) and confirm if this fixes the hangs. If they do fix the issue we
> don't need to disable anything and get better power savings as bonus as
> well.

hmm, on the other hand, powertop application shows that C4 state is 51.5%
residency and C0 is 47.3% residency. I have about 5% C0 residency and 95-98%
C3 residency. Is this expected?

--
Luk?? Hejtm?nek

2008-11-09 21:56:34

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, 9 Nov 2008, Arjan van de Ven wrote:
> On Sun, 9 Nov 2008 21:34:32 +0100
> Lukas Hejtmanek <[email protected]> wrote:
>
> > On Sun, Nov 09, 2008 at 12:28:35PM -0800, Arjan van de Ven wrote:
> > > for me, the plan is that we need to get Thomas' fixes tested by
> > > someone who can reproduce this very reliably. If they fix it, great.
> >
> > as for me, 100% of boots require a key hit to proceed in 2.6.28-rc3
> > (more precisely, the current git head).
> >
>
> it would be greatly appreciated if those that see this issue can test
> the patch below (which is a combo patch of thomas' fixes, to make
> testing easier) and confirm if this fixes the hangs.
> If they do fix the issue we don't need to disable anything and get better power
> savings as bonus as well.

Just disabling the peek_ahead does not make the key press issue go
away. It makes it less prominent.

>
> Gautham R Shenoy (1):
> timers: handle HRTIMER_CB_IRQSAFE_UNLOCKED correctly from softirq context
>
> Ingo Molnar (2):
> irq: fix typo
> nohz: fix SMP race
>
> Thomas Gleixner (12):
> nohz: check broadcast on tick restart

> x86: apic reset counter on shutdown

Not a required bugfix, but harmless and avoids extra wakeups.

> hrtimer: fix the accounting of hrtimer_peek_ahead()
> hrtimer: do not peek ahead when the hardware timer has fired already
> tick: clear broadcast mask when timer is reprogrammed
> irq: call __irq_enter() before calling the tick_idle_check

> x86: apic honour irq affinity which was set in early boot
> genirq: keep affinities set from userspace across free/request_irq()
> genirq: fix the affinity setting in setup_irq

These three are separate. I discoverd the problems via a report from
Kumar Galak and I fixed it while I tried to find out the root cause of
the nohz / keypress issue. They are not part of the timer fixes and in
a separate tree.

> tick: cleanup idle checks
> ACPI: processor_idle: check for spurious wakeups
> ACPI: idle check for spurious wakeups accounting fix

I worked all week after returning from travelling to get down to the
root cause of this. We had several issues:

1) peek_ahead confused the hell out of the nohz code when it expired
timers when the timer interrupt was already pending

2) the changes I did to prevent the stale jiffies in the nohz case
turned out to be incomplete and wrong versus the broadcast of timers
in deeper C-States

3) the consolidation of the nohz idle checks called the nohz jiffies
update code before calling __irq_enter() which caused useless wakeups
of ksoftirqd

4) A long standing - but until now papered over and therefor unnoticed
- problem of spurious wakeups from deep c-states:

If one core gets an interrupt the other core(s) wake up as
well. The stale jiffies fix and the peek_ahead changes made this
more visible and is a major part of the reported nohz (more wakeups,
more power consumption) regression.

This was there forever, but nobody noticed as it was not accounted
for.

Find below the lineup of the timers-fixes-for-linus branch of the tip
tree (the same as Arjan posted minus the irq fixes)

Thanks,

tglx

---

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git timers-fixes-for-linus

Gautham R Shenoy (1):
timers: handle HRTIMER_CB_IRQSAFE_UNLOCKED correctly from softirq context

Ingo Molnar (1):
nohz: fix SMP race

Thomas Gleixner (9):
nohz: check broadcast on tick restart
x86: apic reset counter on shutdown
hrtimer: fix the accounting of hrtimer_peek_ahead()
hrtimer: do not peek ahead when the hardware timer has fired already
tick: clear broadcast mask when timer is reprogrammed
irq: call __irq_enter() before calling the tick_idle_check
tick: cleanup idle checks
ACPI: processor_idle: check for spurious wakeups
ACPI: idle check for spurious wakeups accounting fix


arch/x86/kernel/apic.c | 1 +
drivers/acpi/processor_idle.c | 49 ++++++++++++++++++-----
drivers/cpuidle/cpuidle.c | 5 ++-
include/linux/hrtimer.h | 4 +-
include/linux/tick.h | 5 ++
kernel/hrtimer.c | 81 +++++++++++++++++++++++++++++--------
kernel/softirq.c | 7 ++-
kernel/time/tick-broadcast.c | 27 ++++++++++++
kernel/time/tick-internal.h | 5 ++
kernel/time/tick-sched.c | 88 ++++++++++++++++++++++------------------
kernel/time/timer_list.c | 4 +-
11 files changed, 201 insertions(+), 75 deletions(-)

diff --git a/arch/x86/kernel/apic.c b/arch/x86/kernel/apic.c
index 04a7f96..70003b8 100644
--- a/arch/x86/kernel/apic.c
+++ b/arch/x86/kernel/apic.c
@@ -441,6 +441,7 @@ static void lapic_timer_setup(enum clock_event_mode mode,
v = apic_read(APIC_LVTT);
v |= (APIC_LVT_MASKED | LOCAL_TIMER_VECTOR);
apic_write(APIC_LVTT, v);
+ apic_write(APIC_TMICT, 0xffffffff);
break;
case CLOCK_EVT_MODE_RESUME:
/* Nothing to do here */
diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
index 81b40ed..371dc42 100644
--- a/drivers/acpi/processor_idle.c
+++ b/drivers/acpi/processor_idle.c
@@ -41,6 +41,7 @@
#include <linux/pm_qos_params.h>
#include <linux/clockchips.h>
#include <linux/cpuidle.h>
+#include <linux/tick.h>

/*
* Include the apic definitions for x86 to have the APIC timer related defines
@@ -1458,6 +1459,7 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
u32 t1, t2;
struct acpi_processor *pr;
struct acpi_processor_cx *cx = cpuidle_get_statedata(state);
+ int sleep_ticks = 0;

pr = __get_cpu_var(processors);

@@ -1473,6 +1475,7 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
return 0;
}

+again:
if (pr->flags.bm_check)
acpi_idle_update_bm_rld(pr, cx);

@@ -1480,10 +1483,18 @@ static int acpi_idle_enter_c1(struct cpuidle_device *dev,
acpi_idle_do_entry(cx);
t2 = inl(acpi_gbl_FADT.xpm_timer_block.address);

+ sleep_ticks += ticks_elapsed(t1, t2);
+
local_irq_enable();
+
+ /* Check for spurious wakeup */
+ if (check_idle_spurious_wakeup(pr->id) && !need_resched()) {
+ local_irq_disable();
+ goto again;
+ }
cx->usage++;

- return ticks_elapsed_in_us(t1, t2);
+ return sleep_ticks;
}

/**
@@ -1497,7 +1508,7 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
struct acpi_processor *pr;
struct acpi_processor_cx *cx = cpuidle_get_statedata(state);
u32 t1, t2;
- int sleep_ticks = 0;
+ int ticks, sleep_ticks = 0;

pr = __get_cpu_var(processors);

@@ -1527,6 +1538,7 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
*/
acpi_state_timer_broadcast(pr, cx, 1);

+again:
if (pr->flags.bm_check)
acpi_idle_update_bm_rld(pr, cx);

@@ -1544,12 +1556,19 @@ static int acpi_idle_enter_simple(struct cpuidle_device *dev,
if (tsc_halts_in_c(cx->type))
mark_tsc_unstable("TSC halts in idle");;
#endif
- sleep_ticks = ticks_elapsed(t1, t2);
-
+ ticks = ticks_elapsed(t1, t2);
/* Tell the scheduler how much we idled: */
- sched_clock_idle_wakeup_event(sleep_ticks*PM_TIMER_TICK_NS);
+ sched_clock_idle_wakeup_event(ticks*PM_TIMER_TICK_NS);

local_irq_enable();
+ sleep_ticks += ticks;
+
+ /* Check for spurious wakeup */
+ if (check_idle_spurious_wakeup(pr->id) && !need_resched()) {
+ local_irq_disable();
+ goto again;
+ }
+
current_thread_info()->status |= TS_POLLING;

cx->usage++;
@@ -1575,7 +1594,7 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
struct acpi_processor *pr;
struct acpi_processor_cx *cx = cpuidle_get_statedata(state);
u32 t1, t2;
- int sleep_ticks = 0;
+ int ticks, sleep_ticks = 0;

pr = __get_cpu_var(processors);

@@ -1613,14 +1632,16 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,

acpi_unlazy_tlb(smp_processor_id());

- /* Tell the scheduler that we are going deep-idle: */
- sched_clock_idle_sleep_event();
/*
* Must be done before busmaster disable as we might need to
* access HPET !
*/
acpi_state_timer_broadcast(pr, cx, 1);

+again:
+ /* Tell the scheduler that we are going deep-idle: */
+ sched_clock_idle_sleep_event();
+
acpi_idle_update_bm_rld(pr, cx);

/*
@@ -1661,11 +1682,19 @@ static int acpi_idle_enter_bm(struct cpuidle_device *dev,
if (tsc_halts_in_c(ACPI_STATE_C3))
mark_tsc_unstable("TSC halts in idle");
#endif
- sleep_ticks = ticks_elapsed(t1, t2);
+ ticks = ticks_elapsed(t1, t2);
/* Tell the scheduler how much we idled: */
- sched_clock_idle_wakeup_event(sleep_ticks*PM_TIMER_TICK_NS);
+ sched_clock_idle_wakeup_event(ticks*PM_TIMER_TICK_NS);

local_irq_enable();
+ sleep_ticks += ticks;
+
+ /* Check for spurious wakeup */
+ if (check_idle_spurious_wakeup(pr->id) && !need_resched()) {
+ local_irq_disable();
+ goto again;
+ }
+
current_thread_info()->status |= TS_POLLING;

cx->usage++;
diff --git a/drivers/cpuidle/cpuidle.c b/drivers/cpuidle/cpuidle.c
index 5bed733..f3ca665 100644
--- a/drivers/cpuidle/cpuidle.c
+++ b/drivers/cpuidle/cpuidle.c
@@ -69,7 +69,10 @@ static void cpuidle_idle_call(void)
* run any timers that can be run now, at this point
* before calculating the idle duration etc.
*/
- hrtimer_peek_ahead_timers();
+ if (hrtimer_peek_ahead_timers()) {
+ local_irq_enable();
+ return;
+ }

/* ask the governor for the next state */
next_state = cpuidle_curr_governor->select(dev);
diff --git a/include/linux/hrtimer.h b/include/linux/hrtimer.h
index 07e510a..e3d7d9f 100644
--- a/include/linux/hrtimer.h
+++ b/include/linux/hrtimer.h
@@ -294,7 +294,7 @@ static inline int hrtimer_is_hres_active(struct hrtimer *timer)
return timer->base->cpu_base->hres_active;
}

-extern void hrtimer_peek_ahead_timers(void);
+extern int hrtimer_peek_ahead_timers(void);

/*
* The resolution of the clocks. The resolution value is returned in
@@ -318,7 +318,7 @@ extern void hrtimer_peek_ahead_timers(void);
* is expired in the next softirq when the clock was advanced.
*/
static inline void clock_was_set(void) { }
-static inline void hrtimer_peek_ahead_timers(void) { }
+static inline int hrtimer_peek_ahead_timers(void) { return 0; }

static inline void hres_timers_resume(void) { }

diff --git a/include/linux/tick.h b/include/linux/tick.h
index b6ec818..ff202ac 100644
--- a/include/linux/tick.h
+++ b/include/linux/tick.h
@@ -64,6 +64,9 @@ struct tick_sched {
unsigned long last_jiffies;
unsigned long next_jiffies;
ktime_t idle_expires;
+ unsigned long irq_wakeups;
+ unsigned long irq_last_wakeups;
+ unsigned long spurious_wakeups;
};

extern void __init tick_init(void);
@@ -116,6 +119,7 @@ extern void tick_nohz_stop_sched_tick(int inidle);
extern void tick_nohz_restart_sched_tick(void);
extern ktime_t tick_nohz_get_sleep_length(void);
extern u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time);
+extern int check_idle_spurious_wakeup(int cpu);
# else
static inline void tick_nohz_stop_sched_tick(int inidle) { }
static inline void tick_nohz_restart_sched_tick(void) { }
@@ -126,6 +130,7 @@ static inline ktime_t tick_nohz_get_sleep_length(void)
return len;
}
static inline u64 get_cpu_idle_time_us(int cpu, u64 *unused) { return -1; }
+static inline int check_idle_spurious_wakeup(int cpu) { return 0; }
# endif /* !NO_HZ */

#endif
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index 2b465df..075bc8e 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -1209,6 +1209,7 @@ static void run_hrtimer_pending(struct hrtimer_cpu_base *cpu_base)
enum hrtimer_restart (*fn)(struct hrtimer *);
struct hrtimer *timer;
int restart;
+ int emulate_hardirq_ctx = 0;

timer = list_entry(cpu_base->cb_pending.next,
struct hrtimer, cb_entry);
@@ -1217,10 +1218,24 @@ static void run_hrtimer_pending(struct hrtimer_cpu_base *cpu_base)
timer_stats_account_hrtimer(timer);

fn = timer->function;
+ /*
+ * A timer might have been added to the cb_pending list
+ * when it was migrated during a cpu-offline operation.
+ * Emulate hardirq context for such timers.
+ */
+ if (timer->cb_mode == HRTIMER_CB_IRQSAFE_PERCPU ||
+ timer->cb_mode == HRTIMER_CB_IRQSAFE_UNLOCKED)
+ emulate_hardirq_ctx = 1;
+
__remove_hrtimer(timer, timer->base, HRTIMER_STATE_CALLBACK, 0);
spin_unlock_irq(&cpu_base->lock);

- restart = fn(timer);
+ if (unlikely(emulate_hardirq_ctx)) {
+ local_irq_disable();
+ restart = fn(timer);
+ local_irq_enable();
+ } else
+ restart = fn(timer);

spin_lock_irq(&cpu_base->lock);

@@ -1297,23 +1312,26 @@ static void __run_hrtimer(struct hrtimer *timer)
#ifdef CONFIG_HIGH_RES_TIMERS

/*
- * High resolution timer interrupt
- * Called with interrupts disabled
+ * High resolution timer interrupt internal worker function called
+ * with interrupts disabled either from hrtimer_interrupt() or from
+ * hrtimer_peek_ahead()
*/
-void hrtimer_interrupt(struct clock_event_device *dev)
+int __hrtimer_interrupt(struct hrtimer_cpu_base *cpu_base, int peekahead)
{
- struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
struct hrtimer_clock_base *base;
ktime_t expires_next, now;
int i, raise = 0;

- BUG_ON(!cpu_base->hres_active);
- cpu_base->nr_events++;
- dev->next_event.tv64 = KTIME_MAX;
-
retry:
now = ktime_get();

+ /*
+ * In peek ahead mode bail out, if the hw interrupt is
+ * imminent:
+ */
+ if (peekahead && cpu_base->expires_next.tv64 < now.tv64)
+ return -ETIME;
+
expires_next.tv64 = KTIME_MAX;

base = cpu_base->clock_base;
@@ -1370,17 +1388,49 @@ void hrtimer_interrupt(struct clock_event_device *dev)
base++;
}

+ /*
+ * We just peeked ahead. The hardware timer did not expire. So
+ * we can leave the timer armed.
+ */
+ if (peekahead && cpu_base->expires_next.tv64 == expires_next.tv64)
+ goto out;
+
cpu_base->expires_next = expires_next;

/* Reprogramming necessary ? */
if (expires_next.tv64 != KTIME_MAX) {
+ /*
+ * Clear the peeakahead flag once we decided to
+ * reprogram. Otherwise we break out in the check
+ * above.
+ */
+ peekahead = 0;
if (tick_program_event(expires_next, 0))
goto retry;
}

+out:
/* Raise softirq ? */
if (raise)
raise_softirq(HRTIMER_SOFTIRQ);
+
+ return 0;
+}
+
+
+/*
+ * High resolution timer interrupt
+ * Called with interrupts disabled
+ */
+void hrtimer_interrupt(struct clock_event_device *dev)
+{
+ struct hrtimer_cpu_base *cpu_base = &__get_cpu_var(hrtimer_bases);
+
+ BUG_ON(!cpu_base->hres_active);
+ cpu_base->nr_events++;
+ dev->next_event.tv64 = KTIME_MAX;
+
+ __hrtimer_interrupt(cpu_base, 0);
}

/**
@@ -1392,19 +1442,14 @@ void hrtimer_interrupt(struct clock_event_device *dev)
* they are run immediately and then removed from the timer queue.
*
*/
-void hrtimer_peek_ahead_timers(void)
+int hrtimer_peek_ahead_timers(void)
{
- struct tick_device *td;
- unsigned long flags;
+ WARN_ON_ONCE(!irqs_disabled());

if (!hrtimer_hres_active())
- return;
+ return 0;

- local_irq_save(flags);
- td = &__get_cpu_var(tick_cpu_device);
- if (td && td->evtdev)
- hrtimer_interrupt(td->evtdev);
- local_irq_restore(flags);
+ return __hrtimer_interrupt(&__get_cpu_var(hrtimer_bases), 1);
}

static void run_hrtimer_softirq(struct softirq_action *h)
diff --git a/kernel/softirq.c b/kernel/softirq.c
index 7110dae..e7c69a7 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -269,10 +269,11 @@ void irq_enter(void)
{
int cpu = smp_processor_id();

- if (idle_cpu(cpu) && !in_interrupt())
+ if (idle_cpu(cpu) && !in_interrupt()) {
+ __irq_enter();
tick_check_idle(cpu);
-
- __irq_enter();
+ } else
+ __irq_enter();
}

#ifdef __ARCH_IRQ_EXIT_IRQS_DISABLED
diff --git a/kernel/time/tick-broadcast.c b/kernel/time/tick-broadcast.c
index f98a1b7..ffb7252 100644
--- a/kernel/time/tick-broadcast.c
+++ b/kernel/time/tick-broadcast.c
@@ -397,6 +397,33 @@ void tick_check_oneshot_broadcast(int cpu)
}

/*
+ * Called from tick_nohz_kick_tick() to check, whether the next
+ * broadcast event is less than a tick_period away or already pending
+ * to avoid reprogramming of the per cpu device.
+ */
+int tick_check_oneshot_broadcast_wakeup(int cpu, ktime_t now)
+{
+ struct clock_event_device *bc;
+ ktime_t delta;
+ int res = 0;
+
+ spin_lock(&tick_broadcast_lock);
+ bc = tick_broadcast_device.evtdev;
+
+ if (bc) {
+ delta = ktime_sub(bc->next_event, now);
+ if (delta.tv64 <= tick_period.tv64)
+ res = 1;
+ else
+ cpu_clear(cpu, tick_broadcast_oneshot_mask);
+ } else
+ cpu_clear(cpu, tick_broadcast_oneshot_mask);
+
+ spin_unlock(&tick_broadcast_lock);
+ return res;
+}
+
+/*
* Handle oneshot mode broadcasting
*/
static void tick_handle_oneshot_broadcast(struct clock_event_device *dev)
diff --git a/kernel/time/tick-internal.h b/kernel/time/tick-internal.h
index b1c05bf..b825c39 100644
--- a/kernel/time/tick-internal.h
+++ b/kernel/time/tick-internal.h
@@ -37,6 +37,7 @@ extern void tick_shutdown_broadcast_oneshot(unsigned int *cpup);
extern int tick_resume_broadcast_oneshot(struct clock_event_device *bc);
extern int tick_broadcast_oneshot_active(void);
extern void tick_check_oneshot_broadcast(int cpu);
+extern int tick_check_oneshot_broadcast_wakeup(int cpu, ktime_t now);
# else /* BROADCAST */
static inline void tick_broadcast_setup_oneshot(struct clock_event_device *bc)
{
@@ -47,6 +48,10 @@ static inline void tick_broadcast_switch_to_oneshot(void) { }
static inline void tick_shutdown_broadcast_oneshot(unsigned int *cpup) { }
static inline int tick_broadcast_oneshot_active(void) { return 0; }
static inline void tick_check_oneshot_broadcast(int cpu) { }
+static inline int tick_check_oneshot_broadcast_wakeup(int cpu, ktime_t now)
+{
+ return 0;
+}
# endif /* !BROADCAST */

#else /* !ONESHOT */
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index 5bbb104..35a775e 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -134,48 +134,31 @@ __setup("nohz=", setup_tick_nohz);
* value. We do this unconditionally on any cpu, as we don't know whether the
* cpu, which has the update task assigned is in a long sleep.
*/
-void tick_nohz_update_jiffies(void)
+static void tick_nohz_update_jiffies(int cpu, struct tick_sched *ts,
+ ktime_t now)
{
- int cpu = smp_processor_id();
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- unsigned long flags;
- ktime_t now;
-
- if (!ts->tick_stopped)
- return;
-
cpu_clear(cpu, nohz_cpu_mask);
- now = ktime_get();
ts->idle_waketime = now;

- local_irq_save(flags);
tick_do_update_jiffies64(now);
- local_irq_restore(flags);

touch_softlockup_watchdog();
}

-static void tick_nohz_stop_idle(int cpu)
+static void tick_nohz_stop_idle(struct tick_sched *ts, ktime_t now)
{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
-
- if (ts->idle_active) {
- ktime_t now, delta;
- now = ktime_get();
- delta = ktime_sub(now, ts->idle_entrytime);
- ts->idle_lastupdate = now;
- ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
- ts->idle_active = 0;
+ ktime_t delta = ktime_sub(now, ts->idle_entrytime);

- sched_clock_idle_wakeup_event(0);
- }
+ ts->idle_lastupdate = now;
+ ts->idle_sleeptime = ktime_add(ts->idle_sleeptime, delta);
+ ts->idle_active = 0;
+ sched_clock_idle_wakeup_event(0);
}

static ktime_t tick_nohz_start_idle(struct tick_sched *ts)
{
- ktime_t now, delta;
+ ktime_t delta, now = ktime_get();

- now = ktime_get();
if (ts->idle_active) {
delta = ktime_sub(now, ts->idle_entrytime);
ts->idle_lastupdate = now;
@@ -203,6 +186,21 @@ u64 get_cpu_idle_time_us(int cpu, u64 *last_update_time)
}
EXPORT_SYMBOL_GPL(get_cpu_idle_time_us);

+int check_idle_spurious_wakeup(int cpu)
+{
+ struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+ int ret;
+
+ local_irq_disable();
+ ret= ts->irq_last_wakeups == ts->irq_wakeups;
+
+ ts->irq_last_wakeups = ts->irq_wakeups;
+ ts->spurious_wakeups += ret;
+ local_irq_enable();
+ return ret;
+}
+EXPORT_SYMBOL_GPL(check_idle_spurious_wakeup);
+
/**
* tick_nohz_stop_sched_tick - stop the idle tick from the idle task
*
@@ -413,10 +411,10 @@ void tick_nohz_restart_sched_tick(void)
int cpu = smp_processor_id();
struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
unsigned long ticks;
- ktime_t now;
+ ktime_t now = ktime_get();

local_irq_disable();
- tick_nohz_stop_idle(cpu);
+ tick_nohz_stop_idle(ts, now);

if (!ts->inidle || !ts->tick_stopped) {
ts->inidle = 0;
@@ -430,7 +428,6 @@ void tick_nohz_restart_sched_tick(void)

/* Update jiffies first */
select_nohz_load_balancer(0);
- now = ktime_get();
tick_do_update_jiffies64(now);
cpu_clear(cpu, nohz_cpu_mask);

@@ -566,24 +563,20 @@ static void tick_nohz_switch_to_nohz(void)
* timer and do not touch the other magic bits which need to be done
* when idle is left.
*/
-static void tick_nohz_kick_tick(int cpu)
+static void tick_nohz_kick_tick(int cpu, struct tick_sched *ts, ktime_t now)
{
- struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
- ktime_t delta, now;
-
- if (!ts->tick_stopped)
- return;
+ ktime_t delta;

/*
* Do not touch the tick device, when the next expiry is either
* already reached or less/equal than the tick period.
*/
- now = ktime_get();
delta = ktime_sub(hrtimer_get_expires(&ts->sched_timer), now);
if (delta.tv64 <= tick_period.tv64)
return;

- tick_nohz_restart(ts, now);
+ if (!tick_check_oneshot_broadcast_wakeup(cpu, now))
+ tick_nohz_restart(ts, now);
}

#else
@@ -597,11 +590,26 @@ static inline void tick_nohz_switch_to_nohz(void) { }
*/
void tick_check_idle(int cpu)
{
+ struct tick_sched *ts = &per_cpu(tick_cpu_sched, cpu);
+ ktime_t now;
+
+ ts->irq_wakeups++;
+
tick_check_oneshot_broadcast(cpu);
+
#ifdef CONFIG_NO_HZ
- tick_nohz_stop_idle(cpu);
- tick_nohz_update_jiffies();
- tick_nohz_kick_tick(cpu);
+ if (!ts->tick_stopped && !ts->idle_active)
+ return;
+
+ now = ktime_get();
+
+ if (ts->idle_active)
+ tick_nohz_stop_idle(ts, now);
+
+ if (ts->tick_stopped) {
+ tick_nohz_update_jiffies(cpu, ts, now);
+ tick_nohz_kick_tick(cpu, ts, now);
+ }
#endif
}

diff --git a/kernel/time/timer_list.c b/kernel/time/timer_list.c
index a999b92..67c3d60 100644
--- a/kernel/time/timer_list.c
+++ b/kernel/time/timer_list.c
@@ -176,6 +176,8 @@ static void print_cpu(struct seq_file *m, int cpu, u64 now)
P(last_jiffies);
P(next_jiffies);
P_ns(idle_expires);
+ P(irq_wakeups);
+ P(spurious_wakeups);
SEQ_printf(m, "jiffies: %Lu\n",
(unsigned long long)jiffies);
}
@@ -252,7 +254,7 @@ static int timer_list_show(struct seq_file *m, void *v)
u64 now = ktime_to_ns(ktime_get());
int cpu;

- SEQ_printf(m, "Timer List Version: v0.4\n");
+ SEQ_printf(m, "Timer List Version: v0.5\n");
SEQ_printf(m, "HRTIMER_MAX_CLOCK_BASES: %d\n", HRTIMER_MAX_CLOCK_BASES);
SEQ_printf(m, "now at %Ld nsecs\n", (unsigned long long)now);



2008-11-09 21:58:18

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, 9 Nov 2008, Lukas Hejtmanek wrote:
> On Sun, Nov 09, 2008 at 01:24:38PM -0800, Arjan van de Ven wrote:
> > it would be greatly appreciated if those that see this issue can test the
> > patch below (which is a combo patch of thomas' fixes, to make testing
> > easier) and confirm if this fixes the hangs. If they do fix the issue we
> > don't need to disable anything and get better power savings as bonus as
> > well.
>
> hmm, on the other hand, powertop application shows that C4 state is 51.5%
> residency and C0 is 47.3% residency. I have about 5% C0 residency and 95-98%
> C3 residency. Is this expected?

Hmm. I do not understand. Where from are the second numbers ?

Can you please provide your .config ?

Thanks,

tglx

2008-11-09 22:46:18

by Bernhard Schmidt

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

In gmane.linux.kernel, Thomas Gleixner wrote:

> Find below the lineup of the timers-fixes-for-linus branch of the tip
> tree (the same as Arjan posted minus the irq fixes)

I can confirm that this patch fixes the "need keypress to continue
booting" problem for me that hit me 100% before.

Regards,
Bernhard

2008-11-09 22:53:42

by Frans Pop

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sunday 09 November 2008, Thomas Gleixner wrote:
> Find below the lineup of the timers-fixes-for-linus branch of the tip
> tree (the same as Arjan posted minus the irq fixes)

I've tested with Thomas' patch set on top of v2.6.28-rc3-261-g8b805ef.

The notebook boots normally, but spends way too much time in C0.

Cheers,
FJP


POWERTOP BEFORE
===============
This is somewhere around -rc3-200 with peek_ahead disabled with a return
in the function. System is on battery after having been suspended to RAM
a few times

Note the double C0 and missing P-states.

Cn Avg residency P-states (frequencies)
C0 (cpu running) ( 3.2%)
C0 0.0ms ( 0.0%)
C1 0.0ms ( 0.0%)
C2 0.9ms ( 0.1%)
C4 3.4ms (96.7%)

Wakeups-from-idle per second : 289.1 interval: 10.0s
Power usage (ACPI estimate): 9.8W (4.5 hours)

Top causes for wakeups:
42.8% (136.5) <interrupt> : extra timer interrupt
30.5% ( 97.2) <interrupt> : PS/2 keyboard/mouse/touchpad
12.4% ( 39.5) <kernel IPI> : Rescheduling interrupts
3.6% ( 11.6) <interrupt> : iwlagn
2.5% ( 8.0) <kernel module> : mod_timer (rh_timer_func)
1.7% ( 5.4) artsd : schedule_hrtimeout_range (hrtimer_wakeup)
1.1% ( 3.6) <kernel core> : hrtimer_start (tick_sched_timer)
1.0% ( 3.1) <interrupt> : acpi
0.6% ( 2.0) kicker : schedule_hrtimeout_range (hrtimer_wakeup)
0.4% ( 1.2) klipper : schedule_hrtimeout_range (hrtimer_wakeup)
0.3% ( 1.1) kdesktop : schedule_hrtimeout_range (hrtimer_wakeup)
0.3% ( 1.0) kwifimanager : schedule_hrtimeout_range (hrtimer_wakeup)
0.3% ( 1.0) <kernel core> : enqueue_task_rt (sched_rt_period_timer)
0.3% ( 1.0) artsd : hrtimer_start (it_real_fn)
0.3% ( 1.0) kwrapper : do_nanosleep (hrtimer_wakeup)
0.3% ( 1.0) kwin : schedule_hrtimeout_range (hrtimer_wakeup)
0.3% ( 0.9) Xorg : schedule_hrtimeout_range (hrtimer_wakeup)
0.2% ( 0.5) iwlagn : mod_timer (ieee80211_sta_timer)
0.2% ( 0.5) gpg-agent : schedule_hrtimeout_range (hrtimer_wakeup)
0.2% ( 0.5) hald-addon-stor : schedule_hrtimeout_range (hrtimer_wakeup)


POWERTOP AFTER
==============
Again on battery; P-states are back and the second C0 is now "polling"
(both same as with 2.6.27.4).

System permanently remains very high in C0!
Wakeups go down to ~30 without user input (keyboard/mouse activity), which
is a bit lower than I get with .27.4.

Cn Avg residency P-states (frequencies)
C0 (cpu running) (49.7%) 1333 Mhz 0.0%
polling 0.0ms ( 0.0%) 1200 Mhz 0.0%
C1 0.0ms ( 0.0%) 1067 Mhz 0.0%
C2 7.5ms ( 1.9%) 800 Mhz 100.0%
C4 8.5ms (48.4%)

Wakeups-from-idle per second : 59.6 interval: 15.0s
Power usage (ACPI estimate): 9.7W (4.4 hours)

Top causes for wakeups:
34.6% ( 36.3) <interrupt> : PS/2 keyboard/mouse/touchpad
16.0% ( 16.7) <kernel IPI> : Rescheduling interrupts
8.9% ( 9.3) <interrupt> : iwlagn
8.4% ( 8.8) <interrupt> : extra timer interrupt
7.8% ( 8.1) <kernel module> : mod_timer (rh_timer_func)
5.6% ( 5.9) <kernel core> : hrtimer_start (tick_sched_timer)
5.2% ( 5.4) artsd : schedule_hrtimeout_range (hrtimer_wakeup)
2.9% ( 3.0) kicker : schedule_hrtimeout_range (hrtimer_wakeup)
1.1% ( 1.2) kdesktop : schedule_hrtimeout_range (hrtimer_wakeup)
1.0% ( 1.1) Xorg : schedule_hrtimeout_range (hrtimer_wakeup)
1.0% ( 1.0) kwifimanager : schedule_hrtimeout_range (hrtimer_wakeup)
1.0% ( 1.0) artsd : hrtimer_start (it_real_fn)
1.0% ( 1.0) kwrapper : do_nanosleep (hrtimer_wakeup)
1.0% ( 1.0) kwin : schedule_hrtimeout_range (hrtimer_wakeup)
1.0% ( 1.0) klipper : schedule_hrtimeout_range (hrtimer_wakeup)
0.5% ( 0.5) hald-addon-stor : schedule_hrtimeout_range (hrtimer_wakeup)
0.5% ( 0.5) iwlagn : mod_timer (ieee80211_sta_timer)
0.5% ( 0.5) <kernel core> : __mod_timer (neigh_periodic_timer)
0.4% ( 0.5) gpg-agent : schedule_hrtimeout_range (hrtimer_wakeup)
0.4% ( 0.4) knotify : schedule_hrtimeout_range (hrtimer_wakeup)

In both cases there's no C3, but I think that's because C3 changes to C4
when on battery?

Cheers,
FJP

2008-11-10 08:14:23

by Lukas Hejtmanek

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, Nov 09, 2008 at 10:57:45PM +0100, Thomas Gleixner wrote:
> > hmm, on the other hand, powertop application shows that C4 state is 51.5%
> > residency and C0 is 47.3% residency. I have about 5% C0 residency and 95-98%
> > C3 residency. Is this expected?
>
> Hmm. I do not understand. Where from are the second numbers ?

sorry, the second numbers are without your patch.

> Can you please provide your .config ?

attached.

--
Luk?? Hejtm?nek


Attachments:
(No filename) (453.00 B)
.config (54.60 kB)
Download all attachments

2008-11-10 12:19:38

by Lukas Hejtmanek

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sun, Nov 09, 2008 at 10:57:45PM +0100, Thomas Gleixner wrote:
> > > it would be greatly appreciated if those that see this issue can test the
> > > patch below (which is a combo patch of thomas' fixes, to make testing
> > > easier) and confirm if this fixes the hangs. If they do fix the issue we
> > > don't need to disable anything and get better power savings as bonus as
> > > well.

another observation, it sometimes happens that a key press causes a repeat too
fast.

--
Luk?? Hejtm?nek

2008-11-10 13:47:30

by Yves-Alexis Perez

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On dim, 2008-11-09 at 13:24 -0800, Arjan van de Ven wrote:
> On Sun, 9 Nov 2008 21:34:32 +0100
> Lukas Hejtmanek <[email protected]> wrote:
>
> > On Sun, Nov 09, 2008 at 12:28:35PM -0800, Arjan van de Ven wrote:
> > > for me, the plan is that we need to get Thomas' fixes tested by
> > > someone who can reproduce this very reliably. If they fix it, great.
> >
> > as for me, 100% of boots require a key hit to proceed in 2.6.28-rc3
> > (more precisely, the current git head).
> >
>
> it would be greatly appreciated if those that see this issue can test
> the patch below (which is a combo patch of thomas' fixes, to make
> testing easier) and confirm if this fixes the hangs.
> If they do fix the issue we don't need to disable anything and get better power
> savings as bonus as well.

Under 2.6.28-rc4 the bootime seems ok, but I have way too much powertop
wakups:

corsac@hidalgo: sudo powertop -d
PowerTOP 1.10 (C) 2007, 2008 Intel Corporation

Collecting data for 15 seconds


Cn Avg residency
C0 (cpu running) ( 4,3%)
C0 0,0ms ( 0,0%)
C1 0,0ms ( 0,0%)
C2 0,5ms ( 0,3%)
C4 2,8ms (95,5%)
P-states (frequencies)
2,21 Ghz 1,3%
2,21 Ghz 0,0%
1,60 Ghz 0,5%
800 Mhz 98,1%
Wakeups-from-idle per second : 346,5 interval: 15,0s
Power usage (ACPI estimate): 15,1W (1,6 hours)
Top causes for wakeups:
74,1% (278,5) <interrupt> : extra timer interrupt
8,8% ( 33,2) <kernel IPI> : Rescheduling interrupts
4,1% ( 15,3) <interrupt> : uhci_hcd:usb5, yenta,
i915@pci:0000:00:02.0
3,6% ( 13,7) <interrupt> : iwlagn
1,7% ( 6,5) <interrupt> : acpi
1,1% ( 4,1) <kernel module> : usb_hcd_poll_rh_status
(rh_timer_func)
1,0% ( 3,9) <interrupt> : ahci

Cheers,
--
Yves-Alexis


Attachments:
signature.asc (197.00 B)
This is a digitally signed message part

2008-11-11 17:33:34

by Len Brown

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+



> Under 2.6.28-rc4 the bootime seems ok, but I have way too much powertop
> wakups:

I see the same on a T61.
There was a boot-stall in -rc3, now gone in rc4.
(as of today's top of tree: 3ad4f597058301c97f362e500a32f63f5c950a45)

However, I'm seeing over 800 "extra timer interrupt" per second when idle

-Len

2008-11-14 01:45:53

by Frans Pop

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

Hi Thomas/Arjan,

On Sunday 09 November 2008, Thomas Gleixner wrote:
> I worked all week after returning from travelling to get down to the
> root cause of this. We had several issues:
>
> 1) peek_ahead confused the hell out of the nohz code when it expired
> timers when the timer interrupt was already pending
>
> 2) the changes I did to prevent the stale jiffies in the nohz case
> turned out to be incomplete and wrong versus the broadcast of timers
> in deeper C-States
>
> 3) the consolidation of the nohz idle checks called the nohz jiffies
> update code before calling __irq_enter() which caused useless wakeups
> of ksoftirqd
>
> 4) A long standing - but until now papered over and therefor unnoticed
> - problem of spurious wakeups from deep c-states:
>
> If one core gets an interrupt the other core(s) wake up as
> well. The stale jiffies fix and the peek_ahead changes made this
> more visible and is a major part of the reported nohz (more wakeups,
> more power consumption) regression.
>
> This was there forever, but nobody noticed as it was not accounted
> for.
>
> Find below the lineup of the timers-fixes-for-linus branch of the tip
> tree (the same as Arjan posted minus the irq fixes)

Could either of you maybe give a status update on this patch set and the
remaining issues with it that were reported (especially the high C0
reported by powertop)?

TIA,
FJP

2008-11-15 17:16:44

by Frans Pop

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

*** 2.6.26.3 on on mains ***
PowerTOP version 1.10 (C) 2007 Intel Corporation

Cn Avg residency P-states (frequencies)
C0 (cpu running) ( 0.4%) 1333 Mhz 0.0%
polling 0.0ms ( 0.0%) 1200 Mhz 0.0%
C1 0.0ms ( 0.0%) 1067 Mhz 0.0%
C3 14.0ms (99.6%) 800 Mhz 100.0%


Wakeups-from-idle per second : 71.5 interval: 15.0s
no ACPI power usage estimate available

Top causes for wakeups:
33.1% ( 23.7) <interrupt> : iwl4965
19.6% ( 14.1) <interrupt> : extra timer interrupt
11.4% ( 8.1) <kernel module> : usb_hcd_poll_rh_status (rh_timer_func)
9.0% ( 6.5) artsd : schedule_timeout (process_timeout)
7.1% ( 5.1) <kernel IPI> : Rescheduling interrupts
===================================================================
*** 2.6.27.4 on battery ***
PowerTOP version 1.10 (C) 2007 Intel Corporation

Cn Avg residency P-states (frequencies)
C0 (cpu running) ( 0.7%) 1333 Mhz 0.0%
polling 0.0ms ( 0.0%) 1200 Mhz 0.0%
C1 0.0ms ( 0.0%) 1067 Mhz 0.0%
C2 0.8ms ( 0.0%) 800 Mhz 100.0%
C4 20.1ms (99.3%)

Wakeups-from-idle per second : 49.8 interval: 15.0s
Power usage (ACPI estimate): 8.5W (4.9 hours)

Top causes for wakeups:
21.6% ( 9.3) <interrupt> : iwlagn
19.0% ( 8.1) <kernel module> : mod_timer (rh_timer_func)
13.2% ( 5.7) artsd : __mod_timer (process_timeout)
10.4% ( 4.5) <kernel IPI> : Rescheduling interrupts
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*** 2.6.27.4 on mains ***
PowerTOP version 1.10 (C) 2007 Intel Corporation

Cn Avg residency P-states (frequencies)
C0 (cpu running) ( 0.3%) 1333 Mhz 0.0%
C0 0.0ms ( 0.0%) 1200 Mhz 0.0%
C1 0.0ms ( 0.0%) 1067 Mhz 0.0%
C3 19.2ms (99.7%) 800 Mhz 100.0%


Wakeups-from-idle per second : 52.0 interval: 15.0s
no ACPI power usage estimate available

Top causes for wakeups:
20.3% ( 8.2) <interrupt> : iwlagn
19.8% ( 8.0) <kernel module> : mod_timer (rh_timer_func)
13.9% ( 5.6) artsd : __mod_timer (process_timeout)
9.3% ( 3.7) <kernel IPI> : Rescheduling interrupts
===================================================================
*** 2.6.28-rc4 on mains ***
PowerTOP version 1.10 (C) 2007 Intel Corporation

Cn Avg residency P-states (frequencies)
C0 (cpu running) ( 0.4%) 1333 Mhz 0.0%
polling 0.0ms ( 0.0%) 1200 Mhz 0.0%
C1 0.0ms ( 0.0%) 1067 Mhz 0.0%
C3 16.7ms (99.6%) 800 Mhz 100.0%


Wakeups-from-idle per second : 59.7 interval: 15.0s
no ACPI power usage estimate available

Top causes for wakeups:
21.5% ( 15.6) <kernel IPI> : Rescheduling interrupts
13.5% ( 9.8) <interrupt> : iwlagn
13.5% ( 9.8) <kernel core> : hrtimer_start (tick_sched_timer)
11.0% ( 8.0) <kernel module> : mod_timer (rh_timer_func)
9.8% ( 7.1) <interrupt> : extra timer interrupt
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*** 2.6.28-rc4 just after switch to battery ***
PowerTOP version 1.10 (C) 2007 Intel Corporation

Cn Avg residency P-states (frequencies)
C0 (cpu running) (71.6%) 1333 Mhz 0.0%
C0 0.0ms ( 0.0%) 1200 Mhz 0.0%
C1 0.0ms (614891469122. 1067 Mhz 0.0%
C2 0.5ms ( 0.1%) 800 Mhz 100.0%
C4 3.8ms (29.5%)

Wakeups-from-idle per second : 54.4 interval: 15.0s
Power usage (ACPI estimate): 10.0W (4.4 hours)

Top causes for wakeups:
26.8% ( 41.3) <kernel IPI> : Rescheduling interrupts
20.7% ( 31.9) <interrupt> : uhci_hcd:usb4, HDA Intel
11.8% ( 18.1) <interrupt> : iwlagn
10.1% ( 15.5) <interrupt> : extra timer interrupt
5.7% ( 8.7) <kernel core> : hrtimer_start (tick_sched_timer)
5.2% ( 8.0) <kernel module> : mod_timer (rh_timer_func)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*** 2.6.28-rc4 on battery after settling down ***
PowerTOP version 1.10 (C) 2007 Intel Corporation

Cn Avg residency P-states (frequencies)
C0 (cpu running) ( 0.6%) 1333 Mhz 0.0%
C0 0.0ms ( 0.0%) 1200 Mhz 0.0%
C1 0.0ms ( 0.0%) 1067 Mhz 0.0%
C2 8.1ms ( 0.4%) 800 Mhz 100.0%
C4 14.5ms (99.0%)

Wakeups-from-idle per second : 69.0 interval: 15.0s
Power usage (ACPI estimate): 8.7W (4.9 hours) (long term: 26.2W,/1.6h)

Top causes for wakeups:
24.0% ( 21.4) <kernel IPI> : Rescheduling interrupts
17.8% ( 15.9) <interrupt> : extra timer interrupt
15.7% ( 14.0) <interrupt> : iwlagn
10.9% ( 9.7) <kernel core> : hrtimer_start (tick_sched_timer)
9.0% ( 8.0) <kernel module> : mod_timer (rh_timer_func)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
*** 2.6.28-rc4 on mains again ***
PowerTOP version 1.10 (C) 2007 Intel Corporation

Cn Avg residency P-states (frequencies)
C0 (cpu running) ( 0.3%) 1333 Mhz 0.0%
C0 0.0ms ( 0.0%) 1200 Mhz 0.0%
C1 0.0ms ( 0.0%) 1067 Mhz 0.0%
C3 17.5ms (99.7%) 800 Mhz 100.0%


Wakeups-from-idle per second : 57.1 interval: 15.0s
no ACPI power usage estimate available

Top causes for wakeups:
26.4% ( 17.7) <kernel IPI> : Rescheduling interrupts
15.1% ( 10.1) <interrupt> : iwlagn
11.9% ( 8.0) <kernel module> : mod_timer (rh_timer_func)
8.9% ( 6.0) <interrupt> : extra timer interrupt
8.4% ( 5.7) <kernel core> : hrtimer_start (tick_sched_timer)


Attachments:
(No filename) (1.83 kB)
powertop.txt (5.97 kB)
Download all attachments

2008-11-15 18:03:46

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Bootup time regression from 2.6.27 to 2.6.28-rc3+

On Sat, 15 Nov 2008 18:16:31 +0100
Frans Pop <[email protected]> wrote:

> On Friday 14 November 2008, Frans Pop wrote:
> > > Find below the lineup of the timers-fixes-for-linus branch of the
> > > tip tree (the same as Arjan posted minus the irq fixes)
> >
> > Could either of you maybe give a status update on this patch set and
> > the remaining issues with it that were reported (especially the
> > high C0 reported by powertop)?
>
> I guess part of the answer is:
> commit ae99286b4f1be7788f2d6947c66a91dbd6351eec
> Author: Thomas Gleixner <[email protected]>
> Date: Mon Nov 10 13:20:23 2008 +0100
> nohz: disable tick_nohz_kick_tick() for now
>
> I've just done some testing using v2.6.28-rc4-322-g58e20d8 which
> includes this patch.
>
> CPU usage reported by powertop is now normal again (close to 100% in
> lowest C state), but I'm still getting high counts for:
> - <kernel IPI> : Rescheduling interrupts
> Typically 3-5 on .26/.27; 15-17 on .28.

these are caused by the scheduler, not by the timer code.
(And sometimes they're caused by the scheduler on behalf of something
else even)

> - <interrupt> : extra timer interrupt
> Consistently more prominent for .28 (though not with high values)

if it's less than 5 total, don't worry about it, some of that can well
be rounding/measurement effects
(powertop has to make a few approximations around the edges of the
measurement interval because things don't always line up perfectly)

> than for .27, but way better than for .26.
>
>
> powertop questions
> ------------------
> - What's with this change from "polling" to "C0" for the 2nd C state?

that sounds like an interesting bug...

> After boot I always (both when booted on mains and on battery) get
> "polling", but after inserting or removing mains it will continue to
> show "C0" independent of any further power state changes.
> - After switching to battery the first time I get something like:
> C1 0.0ms (614891469122.
> This is reproducible. I guess a rounding error due to the change in
> the number of C states. Later updates clear this.

yeah it's annoying but it is only for one display period

> - Looks like someone could not make up his mind between comma and
> slash here: "(long term: 26.2W,/1.6h)"

that sounds easy to fix ;-)

Thanks for the bugreports; either me or Auke will take a look at these
shortly.

--
Arjan van de Ven Intel Open Source Technology Centre
For development, discussion and tips for power savings,
visit http://www.lesswatts.org

2008-11-16 19:27:57

by Frans Pop

[permalink] [raw]
Subject: [2.6.28-rc5] TSC sync error and high hrtimer_start (was: Bootup time regression from 2.6.27 to 2.6.28-rc3+)

On Friday 14 November 2008, Frans Pop wrote:
> Could either of you maybe give a status update on this patch set and
> the remaining issues with it that were reported (especially the high C0
> reported by powertop)?

My previous tests were on my laptop. This is for my desktop running -rc5.
I had not tested my desktop with earlier .28 releases.

For .28 powertop prominently shows hrtimer_start (tick_sched_timer) which
was not present for .27. The list below that is virtually identical to .27.

Total wakeups for .28 is lower though. Something to look into or not?

Possibly this may be related to this diff in the boot messages:
-checking TSC synchronization [CPU#0 -> CPU#1]: passed.
+checking TSC synchronization [CPU#0 -> CPU#1]:
+Measured 696 cycles TSC warp between CPUs, turning off TSC clock.
+Marking TSC unstable due to check_tsc_sync_source failed

With .27.4 TSC _always_ passed. Now, with .28-rc5 I get this warp. I've
done two boots so far, for the other one the warp was 680 cycles.

Cheers,
FJP

Powertop output for idle system:
*** 2.6.27.4 ***
Wakeups-from-idle per second : 68.7 interval: 10.0s
no ACPI power usage estimate available

Top causes for wakeups:
31.5% ( 17.8) <kernel core> : __mod_timer (neigh_periodic_timer)
14.5% ( 8.2) <kernel module> : mod_timer (rh_timer_func)
11.3% ( 6.4) artsd : __mod_timer (process_timeout)
9.2% ( 5.2) ksysguardd : __mod_timer (process_timeout)
4.2% ( 2.4) kicker : __mod_timer (process_timeout)
3.5% ( 2.0) <kernel core> : add_timer_on (clocksource_watchdog)

*** 2.6.28-rc5 ***
Wakeups-from-idle per second : 55.6 interval: 10.0s
no ACPI power usage estimate available

Top causes for wakeups:
28.5% ( 23.4) <kernel core> : hrtimer_start (tick_sched_timer)
21.7% ( 17.8) <kernel core> : __mod_timer (neigh_periodic_timer)
9.7% ( 8.0) <kernel module> : mod_timer (rh_timer_func)
7.8% ( 6.4) artsd : schedule_hrtimeout_range (hrtimer_wakeup)
5.4% ( 4.4) ksysguardd : __mod_timer (process_timeout)
2.8% ( 2.3) kicker : schedule_hrtimeout_range (hrtimer_wakeup)
2.4% ( 2.0) <kernel core> : add_timer_on (clocksource_watchdog)

2008-11-17 02:18:35

by Frans Pop

[permalink] [raw]
Subject: Re: [2.6.28-rc5] TSC sync error and high hrtimer_start (was: Bootup time regression from 2.6.27 to 2.6.28-rc3+)

On Sunday 16 November 2008, Frans Pop wrote:
> Possibly this may be related to this diff in the boot messages:
> -checking TSC synchronization [CPU#0 -> CPU#1]: passed.
> +checking TSC synchronization [CPU#0 -> CPU#1]:
> +Measured 696 cycles TSC warp between CPUs, turning off TSC clock.
> +Marking TSC unstable due to check_tsc_sync_source failed
>
> With .27.4 TSC _always_ passed. Now, with .28-rc5 I get this warp. I've
> done two boots so far, for the other one the warp was 680 cycles.

Maybe this extra context a bit earlier in dmesg is relevant here:

-Calibrating delay loop (skipped), value calculated using timer frequency.. 6399.76 BogoMIPS (lpj=12799520)
+HPET: 3 timers in total, 0 timers will be used for per-cpu timer
+Calibrating delay loop (skipped), value calculated using timer frequency.. 6399.73 BogoMIPS (lpj=12799476)
[...]
-ACPI: Core revision 20080609
+ACPI: Core revision 20080926
+Setting APIC routing to flat
..TIMER: vector=0x30 apic1=0 pin1=2 apic2=-1 pin2=-1
CPU0: Intel(R) Pentium(R) D CPU 3.20GHz stepping 07
-Using local APIC timer interrupts.
-APIC timer calibration result 12499520
-Detected 12.499 MHz APIC timer.
-Booting processor 1/1 ip 6000
+Booting processor 1 APIC 0x1 ip 0x6000
Initializing CPU#1
-Calibrating delay using timer specific routine.. 6399.94 BogoMIPS (lpj=12799898)
+Calibrating delay using timer specific routine.. 6198.05 BogoMIPS (lpj=12396107)

Note the rather large difference in BogoMIPS between core 0 and 1.