2008-03-02 19:48:32

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Too many timer interrupts in NO_HZ

Hi,

I am trying to play around with the tickless kernel and
sched_mc_power_savings tunables to increase the idle duration of the
CPU in an SMP server. sched_mc_power_savings provides good
consolidation of long running jobs when the system is lightly loaded.
However I am interested to see CPU idle times for couple minutes on
atleast few CPUs in an completely idle SMP system.

I have posted related instrumentation results at
http://lkml.org/lkml/2008/1/8/243

The experimental setup is identical. Just that I have tried to
collect more data and also bias the wakeup logic.

The kernel is 2.6.25-rc3 with NO_HZ, SCHED_MC, HIGH_RES_TIMERS, and
HPET turned on.

I have tried to collect interesting data from tick-sched.c, sched.c
and softirq.c.

As suggested by Ingo in the previous thread, I have hacked the sched.c
try_to_wake_up logic so that I can force all wake-ups to happen on
CPU0. This is just and experiment, in reality we will have to
carefully choose a domain and CPU to bias all wakeups on an idle
system.

--- linux-2.6.25-rc3.orig/kernel/sched_fair.c
+++ linux-2.6.25-rc3/kernel/sched_fair.c
@@ -991,6 +991,10 @@ static int select_task_rq_fair(struct ta
this_cpu = smp_processor_id();
new_cpu = cpu;

+ /* Force move all tasks to CPU 0 */
+ if (unlikely(cpu_isset(0, p->cpus_allowed)))
+ return 0;
+
if (cpu == this_cpu)
goto out_set_cpu;

Experiment:
-----------

Hardware:
* Two socket dual core intel xeon (4 Logical CPU)

Setup:

* Killed most daemons include irqbalance
* Routed all interrupts to CPU0
* sched_mc_power_savings = 1
* powertop gives an average ~3 wakeups/sec
* Trace data collected for 120 secs on this completely idle machine
* Log /proc/interrupts before and after the observation period

Results:

Interrupt count for 120 seconds:

CPU0 CPU1 CPU2 CPU3
17: 55 0 0 0 IO-APIC-fasteoi-aacraid
214: 277 0 0 0 PCI-MSI-edge-eth0
LOC: 676 306 324 296 Local timer interrupts
CAL: 0 12 12 12 function call interrupts
TLB: 5 3 0 0 TLB shootdowns


CPU0 CPU1 CPU2 CPU3
Maximum Idle
Time (ms): 1000 5250 5320 5250

Some of the process that was woken-up on CPUs 1,2 & 3 because of
affinity and that could not be moved to CPU0:

Wakeup PID PROCESS
Count
60 : 7 [ksoftirqd/1]
58 : 10 [ksoftirqd/2]
58 : 13 [ksoftirqd/3]
8 : 17 [events/2]
8 : 1098 [kondemand/2]

The problem:

There are way too many timer interrupts even though the CPUs have
entered tickless idle loop. Timer interrupts basically bring the CPU
out of idle, and then return to tickless idle. There are very few
try_to_wake_up()s or need_resched() in between the timer interrupts.

What can happen in an idle system in the timer interrupt context that
does not invoke a need_resched() or try_to_wake_up()?

Trace pattern:

CPU2 Trace sequence:

ns ns
2135960054680 tick_nohz_stop_sched_tick(), expires = 7290612000000
*** tick_stopped = 1 ***
2136363725957 tick_nohz_stop_sched_tick(), expires = 7290612000000
*** Timer Interrupt ***
2136767397163 tick_nohz_stop_sched_tick(), expires = 7290612000000
*** Timer Interrupt ***
2137171068369 tick_nohz_stop_sched_tick(), expires = 7290612000000
*** Timer Interrupt ***
2137574739785 tick_nohz_stop_sched_tick(), expires = 7290612000000
.........
.........
*** Timer Interrupt ***
2140804114397 tick_nohz_stop_sched_tick(), expires = 2140808000000
*** try_to_wake_up() ***
2140804121591 tick_nohz_restart_sched_tick()
*** tick_stopped = 0 ***

There is a similar pattern on CPU1 and CPU3, while CPU0 has
significantly more wake-ups and timer interrupts.

Basically there are more than 10 timer interrupts at 4ms interval even
though the next expected timer is very much into the future. My HZ is
250, so I assume the tick could not be stopped for some reason.

There are instances in the trace were the tick was stopped
successfully and we get a much larger interval between successive
timer interrupts and tick_nohz_stop_sched_tick() calls.

Please help me to understand the following scenario:

* What can happen in timer interrupt context that need not wakeup any
process?
* What can prevent tick_nohz_stop_sched_tick() from actually stopping
the tick?
* Whats wrong in expecting to see some of the CPUs having tickless
idle time of few minutes

Thank you for review my experiment. Comments and suggestions are
welcome. I will send instrumentation patch and trace data on request.

--Vaidy


2008-03-02 19:58:20

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Too many timer interrupts in NO_HZ

On Mon, 3 Mar 2008 01:18:13 +0530
Vaidyanathan Srinivasan <[email protected]> wrote:

> Hi,
>
> I am trying to play around with the tickless kernel and
> sched_mc_power_savings tunables to increase the idle duration of the
> CPU in an SMP server. sched_mc_power_savings provides good
> consolidation of long running jobs when the system is lightly loaded.
> However I am interested to see CPU idle times for couple minutes on
> atleast few CPUs in an completely idle SMP system.
>
> I have posted related instrumentation results at
> http://lkml.org/lkml/2008/1/8/243
>
> The experimental setup is identical. Just that I have tried to
> collect more data and also bias the wakeup logic.
>
> The kernel is 2.6.25-rc3 with NO_HZ, SCHED_MC, HIGH_RES_TIMERS, and
> HPET turned on.
>
> I have tried to collect interesting data from tick-sched.c, sched.c
> and softirq.c.
>
> As suggested by Ingo in the previous thread, I have hacked the sched.c
> try_to_wake_up logic so that I can force all wake-ups to happen on
> CPU0. This is just and experiment, in reality we will have to
> carefully choose a domain and CPU to bias all wakeups on an idle
> system.


please show us powertop -d output; that way we can at least see sort of what
is going on...

2008-03-02 20:25:30

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: Too many timer interrupts in NO_HZ

* Arjan van de Ven <[email protected]> [2008-03-02 11:57:09]:
[sniped]

>
> please show us powertop -d output; that way we can at least see sort of what
> is going on...

Here you go...

[root@XXXX sv]# powertop -d
PowerTOP 1.9 (C) 2007 Intel Corporation

Collecting data for 15 seconds
< Detailed C-state information is only available on Mobile CPUs (laptops) >
P-states (frequencies)
2.40 Ghz 0.0%
2.13 Ghz 0.0%
1.87 Ghz 0.0%
1.60 Ghz 100.0%
Wakeups-from-idle per second : 3.1 interval: 15.0s
no ACPI power usage estimate available
Top causes for wakeups:
41.4% ( 2.4) <interrupt> : eth0
17.2% ( 1.0) ip : bnx2_open (bnx2_timer)
9.2% ( 0.5) <kernel core> : neigh_table_init_no_netlink (neigh_periodic_timer)
3.4% ( 0.2) <kernel IPI> : function call interrupts
3.4% ( 0.2) init : schedule_timeout (process_timeout)
3.4% ( 0.2) python : schedule_timeout (process_timeout)
3.4% ( 0.2) <kernel core> : __netdev_watchdog_up (dev_watchdog)
3.4% ( 0.2) sendmail : schedule_timeout (process_timeout)
3.4% ( 0.2) <kernel core> : page_writeback_init (wb_timer_fn)
2.3% ( 0.1) <interrupt> : aacraid
2.3% ( 0.1) rc.sysinit : start_this_handle (commit_timeout)
1.1% ( 0.1) <kernel core> : sk_reset_timer (tcp_delack_timer)
1.1% ( 0.1) sshd : sk_reset_timer (tcp_write_timer)
1.1% ( 0.1) <kernel core> : cache_register (delayed_work_timer_fn)
1.1% ( 0.1) <kernel core> : init_nonfatal_mce_checker (delayed_work_timer_fn)
1.1% ( 0.1) crond : do_nanosleep (hrtimer_wakeup)
1.1% ( 0.1) <kernel core> : neigh_add_timer (neigh_timer_handler)

Suggestion: increase the VM dirty writeback time from 5.00 to 15 seconds with:
echo 1500 > /proc/sys/vm/dirty_writeback_centisecs
This wakes the disk up less frequenty for background VM activity

Suggestion: enable the noatime filesystem option by executing the following command:
mount -o remount,noatime / or by pressing the T key
noatime disables persistent access time of file accesses, which causes lots of disk IO.

--Vaidy

2008-03-05 15:39:42

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: Too many timer interrupts in NO_HZ

* Arjan van de Ven <[email protected]> [2008-03-02 11:57:09]:

> On Mon, 3 Mar 2008 01:18:13 +0530
> Vaidyanathan Srinivasan <[email protected]> wrote:
>
> > Hi,
> >
> > I am trying to play around with the tickless kernel and
> > sched_mc_power_savings tunables to increase the idle duration of the
> > CPU in an SMP server. sched_mc_power_savings provides good
> > consolidation of long running jobs when the system is lightly loaded.
> > However I am interested to see CPU idle times for couple minutes on
> > atleast few CPUs in an completely idle SMP system.
> >
> > I have posted related instrumentation results at
> > http://lkml.org/lkml/2008/1/8/243
> >
> > The experimental setup is identical. Just that I have tried to
> > collect more data and also bias the wakeup logic.
> >
> > The kernel is 2.6.25-rc3 with NO_HZ, SCHED_MC, HIGH_RES_TIMERS, and
> > HPET turned on.
> >
> > I have tried to collect interesting data from tick-sched.c, sched.c
> > and softirq.c.
> >
> > As suggested by Ingo in the previous thread, I have hacked the sched.c
> > try_to_wake_up logic so that I can force all wake-ups to happen on
> > CPU0. This is just and experiment, in reality we will have to
> > carefully choose a domain and CPU to bias all wakeups on an idle
> > system.
>
>
> please show us powertop -d output; that way we can at least see sort of what
> is going on...

I have some more information on the timer interrupt activities in the
system. I have obtained the following trace from irq_exit and local
timer is the only interrupt on CPUs 1, 2 & 3.

Trace CPU/function Timestamp
number

U2336208 C1 irq_exit EXIT 4452403.687322 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336212 C3 irq_exit EXIT 4452403.722253 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336214 C2 irq_exit EXIT 4452403.722668 ms in_intr 0 in_idle 1 need_resched 0 pid 0


U2336236 C1 irq_exit EXIT 4452807.345543 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336241 C3 irq_exit EXIT 4452807.381894 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336242 C2 irq_exit EXIT 4452807.382144 ms in_intr 0 in_idle 1 need_resched 0 pid 0


U2336248 C1 irq_exit EXIT 4453211.003638 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336253 C3 irq_exit EXIT 4453211.041529 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336254 C2 irq_exit EXIT 4453211.041703 ms in_intr 0 in_idle 1 need_resched 0 pid 0


U2336260 C1 irq_exit EXIT 4453614.662043 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336265 C3 irq_exit EXIT 4453614.701429 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336266 C2 irq_exit EXIT 4453614.701474 ms in_intr 0 in_idle 1 need_resched 0 pid 0


U2336293 C1 irq_exit EXIT 4454018.320078 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336298 C3 irq_exit EXIT 4454018.361100 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336299 C2 irq_exit EXIT 4454018.361142 ms in_intr 0 in_idle 1 need_resched 0 pid 0


U2336313 C1 irq_exit EXIT 4454421.978047 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336318 C2 irq_exit EXIT 4454422.020756 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336319 C3 irq_exit EXIT 4454422.020807 ms in_intr 0 in_idle 1 need_resched 0 pid 0


U2336357 C1 irq_exit EXIT 4454825.636341 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336369 C2 irq_exit EXIT 4454825.681058 ms in_intr 0 in_idle 1 need_resched 0 pid 0
U2336370 C3 irq_exit EXIT 4454825.681191 ms in_intr 0 in_idle 1 need_resched 0 pid 0

This is over a short interval where I complain that the tick is not
really getting stopped. Actually there is a pattern of 400ms timer
being programmed.

Apart from these there are 4ms pattern on one of the CPU that is used
to drive tick_sched_timer. Most of the missing trace entries are CPU0
which takes care of all the idle daemon process.

There are wakeups due to kondemand and other kernel threads on CPUs 1,
2 & 3, but the time is in seconds.

Please help me find the source for this timer.

Thanks,
Vaidy

2008-03-16 18:18:46

by Vaidyanathan Srinivasan

[permalink] [raw]
Subject: Re: Too many timer interrupts in NO_HZ

* Vaidyanathan Srinivasan <[email protected]> [2008-03-03 01:18:13]:

[sniped]


> The problem:
>
> There are way too many timer interrupts even though the CPUs have
> entered tickless idle loop. Timer interrupts basically bring the CPU
> out of idle, and then return to tickless idle. There are very few
> try_to_wake_up()s or need_resched() in between the timer interrupts.
>
> What can happen in an idle system in the timer interrupt context that
> does not invoke a need_resched() or try_to_wake_up()?
>

[sniped]

> Please help me to understand the following scenario:
>
> * What can happen in timer interrupt context that need not wakeup any
> process?
> * What can prevent tick_nohz_stop_sched_tick() from actually stopping
> the tick?
> * Whats wrong in expecting to see some of the CPUs having tickless
> idle time of few minutes

I think I have the answers to some of the above questions.

Function Count Name
Address

c0219922 : 5 blk_unplug_timeout
c014f464 : 55 wb_timer_fn
c02b2e67 : 350 bnx2_timer
c03efcbc : 115 neigh_periodic_timer
c012894a : 220 process_timeout
c027d1c4 : 2 hangcheck_fire
c03fe232 : 3 peer_check_expire
c012e830 : 25 delayed_work_timer_fn
c03afe92 : 2783 ehci_watchdog
c03f1a35 : 10 neigh_timer_handler
c01d7456 : 110 commit_timeout
c04381e9 : 2 addrconf_verify
c03f6d39 : 365 dev_watchdog
c04126bc : 114 tcp_write_timer

These are roughly the list of functions that were responsible for the
timer interrupts across all cpus including the ones in idle. Most of
the functions complete the job in interrupt context and also re-queue
the timer. The count number is the call count in the 120s observation
window across all 4 CPUs.

I got these function addresses from __next_timer_interrupt() in
timer.c. Previously, I did not look in timer.c since hrtimers were
enabled and I assumed all timer call will be through __run_hrtimer
from hrtimer.c. The call count of __run_hrtimer was very minimal and
did not correspond to the local timer interrupt count.

These are device driver timers that we need to investigate. We should
try to migrate them to CPU0 (or some other package) to get really long
uninterrupted CPU sleep time.

I will post more results after tweaking some of the above timers.

Should PowerTop include local timer interrupt counts as well during
the observation period? Interrupt do significantly affect CPU sleep
time whether they wakeup any process or not.

Comments?

Thanks,
Vaidy

2008-03-17 02:34:21

by Alan Stern

[permalink] [raw]
Subject: Re: [linux-pm] Too many timer interrupts in NO_HZ

On Sun, 16 Mar 2008, Vaidyanathan Srinivasan wrote:

> > The problem:
> >
> > There are way too many timer interrupts even though the CPUs have
> > entered tickless idle loop. Timer interrupts basically bring the CPU
> > out of idle, and then return to tickless idle. There are very few
> > try_to_wake_up()s or need_resched() in between the timer interrupts.
> >
> > What can happen in an idle system in the timer interrupt context that
> > does not invoke a need_resched() or try_to_wake_up()?

> Function Count Name
> Address
>
> c0219922 : 5 blk_unplug_timeout
> c014f464 : 55 wb_timer_fn
> c02b2e67 : 350 bnx2_timer
> c03efcbc : 115 neigh_periodic_timer
> c012894a : 220 process_timeout
> c027d1c4 : 2 hangcheck_fire
> c03fe232 : 3 peer_check_expire
> c012e830 : 25 delayed_work_timer_fn
> c03afe92 : 2783 ehci_watchdog
> c03f1a35 : 10 neigh_timer_handler
> c01d7456 : 110 commit_timeout
> c04381e9 : 2 addrconf_verify
> c03f6d39 : 365 dev_watchdog
> c04126bc : 114 tcp_write_timer

> These are device driver timers that we need to investigate. We should
> try to migrate them to CPU0 (or some other package) to get really long
> uninterrupted CPU sleep time.

The largest entry is for ehci_watchdog. This timer won't run at all if
your EHCI controllers are allowed to autosuspend, which will happen
automatically if

(1) You enable CONFIG_USB_SUSPEND, and

(2) You have no high-speed USB devices attached, or the
ones that are attached have all been suspended.

On the other hand, if you were actively using some high-speed USB
device during the test then it's understandable that there should be
lots of timer interrupts as a result.

Alan Stern

2008-03-17 08:45:24

by David Brownell

[permalink] [raw]
Subject: Re: [linux-pm] Too many timer interrupts in NO_HZ

On Sunday 16 March 2008, Alan Stern wrote:
> On Sun, 16 Mar 2008, Vaidyanathan Srinivasan wrote:
>
> The largest entry is for ehci_watchdog. This timer won't run at all if

... you're not accessing EHCI devices at all. Is HAL or
something else polling them too often? Or are you maybe
doing something else that resembles "real work"?


> your EHCI controllers are allowed to autosuspend, which will happen
> automatically if
>
> (1) You enable CONFIG_USB_SUSPEND, and
>
> (2) You have no high-speed USB devices attached, or the
> ones that are attached have all been suspended.
>
> On the other hand, if you were actively using some high-speed USB
> device during the test then it's understandable that there should be
> lots of timer interrupts as a result.

That watchdog is a bit messy, but it's got two basic tasks:

(a) Take work off the async ring ... bulk and control
transfers will leave an empty QH there for a few
milliseconds before taking it off, to avoid wasting
effort in the common case where another transfer
quickly follows the first one. In the extreme case,
when there's no more work, that ring gets disabled.

(b) A real I/O watchdog ... in case the hardware forgets
to issue some kind of I/O completion interrupt. This
watchdog rarely needs to fire.

So I'm thinking this is most likely a case where something
is sending work to one or more high speed devices.

- Dave