2009-09-01 16:04:11

by Josh Triplett

[permalink] [raw]
Subject: [RFC PATCH] Turn off the tick even when not idle

The following patch (not for application any time soon) hacks away the
timer interrupt even when not idle, by triggering the nohz mechanism
even if not running the idle task.

When a process does some number crunching for a while, without involving
the kernel, the kernel still interrupts it HZ times per second to figure
out if it has any work to do. With a system dedicated to doing such
number crunching, the answer will almost always come up "no"; however,
the kernel takes a while figuring out all the "no"s from various
subsystems, every timer tick. On my system, the timer tick takes about
80us, every 1/HZ seconds; that represents a significant overhead. 80us
out of every 1ms, for instance, means 8% overhead. Furthermore, the
time taken varies, and the timer interrupts lead to jitter in the
performance of the number crunching.

This patch represents an attempt to demonstrate the effect of removing
the timer interrupt. It by no means represents a complete solution; it
just thwacks the timer interrupt over the head, ignoring the various
things it does. Known issues include breaking RCU, process accounting
(using "300%" of one CPU), and POSIX CPU timers, among other things. I
have some fixes in progress for some of those.

Nevertheless, this patch successfully boots, runs, and demonstrates some
good results. I ran the benchmark "Fixed Time Quantum" (ftq), which
repeatedly runs fixed length intervals and counts how many iterations of
a simple loop it can run within those intervals. I've attached a plot
of the results with HZ=1000, HZ=250, and this nohz hack; also available
at http://master.kernel.org/~josh/nohz-hack/ along with the raw numbers.
I sorted the samples by iterations completed, to group similar values
together. (The ~5 bad samples on the far left represent unavoidable
SMIs on the laptop I ran the tests on.)

Notice how with the timer tick turned off, the results show long
"shelves" of near-identical values. More than half the samples fall
into one such shelf, consistently completing almost the same hundreds of
thousands of iterations within ~20 iterations of each other. With the
timer tick turned on, the results spread out a lot more, in the
direction of worse performance.

Please, give this patch a try and let me know what you think.

I'd like to work towards a patch which really can kill off the timer
tick, making the kernel entirely event-driven and removing the polling
that occurs in the timer tick. I've reviewed everything the timer tick
does, and every last bit of it could occur using an event-driven
approach.

- Josh Triplett

-- >8 --

kernel/softirq.c | 2 +-
kernel/time/tick-sched.c | 8 --------
2 files changed, 1 insertions(+), 9 deletions(-)

diff --git a/kernel/softirq.c b/kernel/softirq.c
index eb5e131..8bf11b4 100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -305,7 +305,7 @@ void irq_exit(void)
#ifdef CONFIG_NO_HZ
/* Make sure that timer wheel updates are propagated */
rcu_irq_exit();
- if (idle_cpu(smp_processor_id()) && !in_interrupt() && !need_resched())
+ if (!in_interrupt() && !need_resched())
tick_nohz_stop_sched_tick(0);
#endif
preempt_enable_no_resched();
diff --git a/kernel/time/tick-sched.c b/kernel/time/tick-sched.c
index e0f59a2..707ba98 100644
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -223,14 +223,6 @@ void tick_nohz_stop_sched_tick(int inidle)
cpu = smp_processor_id();
ts = &per_cpu(tick_cpu_sched, cpu);

- /*
- * Call to tick_nohz_start_idle stops the last_update_time from being
- * updated. Thus, it must not be called in the event we are called from
- * irq_exit() with the prior state different than idle.
- */
- if (!inidle && !ts->inidle)
- goto end;
-
now = tick_nohz_start_idle(ts);

/*
--
1.6.3.3


Attachments:
(No filename) (3.70 kB)
nohz-hack.png (6.95 kB)
Download all attachments

2009-09-01 16:46:34

by jim owens

[permalink] [raw]
Subject: Re: [RFC PATCH] Turn off the tick even when not idle

Christoph Lameter wrote:
>
> Good idea. Thought so for a long time. Thanks for working on it. Many of
> the events can be deferred if nothing is happening in the system.
> vm statistics, data expiration from caches etc. Not sure what to do about
> the per cpu threads generated by various kernel subsystems though.

might be a little easier to sort this out after Jens code
is in to create workers dynamically instead of on every cpu.

2009-09-01 18:08:04

by Josh Triplett

[permalink] [raw]
Subject: Re: [RFC PATCH] Turn off the tick even when not idle

On Tue, Sep 01, 2009 at 03:56:56PM -0400, Christoph Lameter wrote:
> On Tue, 1 Sep 2009, Josh Triplett wrote:
>
> > Please, give this patch a try and let me know what you think.
>
> Looks quite good. The PNG shows more deterministic run time behavior.

Thanks; exactly what I hoped to demonstrate. Actually making the timer
interrupt go away will require finding a more appropriate place to run
all the code that otherwise polls periodically, but this patch lets us
cheat and see the result before that happens. :)

> > I'd like to work towards a patch which really can kill off the timer
> > tick, making the kernel entirely event-driven and removing the polling
> > that occurs in the timer tick. I've reviewed everything the timer tick
> > does, and every last bit of it could occur using an event-driven
> > approach.
>
> Good idea. Thought so for a long time. Thanks for working on it. Many of
> the events can be deferred if nothing is happening in the system.
> vm statistics, data expiration from caches etc. Not sure what to do about
> the per cpu threads generated by various kernel subsystems though.

I ran the benchmark at realtime priority, and affinitized to a single
CPU. I used ftrace to confirm that after the initial program setup
(shared library loads, memory allocation, etc), no code runs in the
kernel during the number-crunching; this makes sense, since I ran at
higher priority than all the random affinitized kernel threads, and I
pushed everything else (tasks and interrupts) onto another CPU.

Long-term I'd like to solve the problem of those kernel threads, but
realtime priority can mitigate those. The new interrupt threading bits
may help with other interrupts and avoid the need to set interrupt
affinity. The timer interrupt, though, represents the one and only
thing I can't mitigate, hence why I'd like to make it go away.

- Josh Triplett

2009-09-01 15:59:11

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC PATCH] Turn off the tick even when not idle

On Tue, 1 Sep 2009, Josh Triplett wrote:

> Please, give this patch a try and let me know what you think.

Looks quite good. The PNG shows more deterministic run time behavior.

> I'd like to work towards a patch which really can kill off the timer
> tick, making the kernel entirely event-driven and removing the polling
> that occurs in the timer tick. I've reviewed everything the timer tick
> does, and every last bit of it could occur using an event-driven
> approach.

Good idea. Thought so for a long time. Thanks for working on it. Many of
the events can be deferred if nothing is happening in the system.
vm statistics, data expiration from caches etc. Not sure what to do about
the per cpu threads generated by various kernel subsystems though.


Attachments:
nohz-hack.png (6.95 kB)

2009-09-01 20:26:32

by Josh Triplett

[permalink] [raw]
Subject: Re: [RFC PATCH] Turn off the tick even when not idle

On Tue, Sep 01, 2009 at 06:35:34PM -0400, Christoph Lameter wrote:
> On Tue, 1 Sep 2009, Josh Triplett wrote:
>
> > Thanks; exactly what I hoped to demonstrate. Actually making the timer
> > interrupt go away will require finding a more appropriate place to run
> > all the code that otherwise polls periodically, but this patch lets us
> > cheat and see the result before that happens. :)
>
> Well not necessarily. Since the process is not doing system calls some of
> the checks can be skipped. In order to bring about a quiet state for the
> VM one could fold the vm counters and dump the queues. Then maintenance is
> unnecessary as long as no system activity occurs on a processor.

Yes, I agree that most of these checks don't need to happen. When I
said "finding a more appropriate place", I primarily mean either making
these things event-driven or making them happen only when needed, not
just moving the polling elsewhere. For instance, process time
accounting need not happen every timer tick; it can happen the next time
the process runs in the kernel, and then just add all the time elapsed
since then. If some rlimit or POSIX cpu timer exists, the kernel can
figure out when that will trigger, and set a timer for that point.

> > I ran the benchmark at realtime priority, and affinitized to a single
> > CPU. I used ftrace to confirm that after the initial program setup
> > (shared library loads, memory allocation, etc), no code runs in the
> > kernel during the number-crunching; this makes sense, since I ran at
> > higher priority than all the random affinitized kernel threads, and I
> > pushed everything else (tasks and interrupts) onto another CPU.
>
> Interesting.
>
> > Long-term I'd like to solve the problem of those kernel threads, but
> > realtime priority can mitigate those. The new interrupt threading bits
> > may help with other interrupts and avoid the need to set interrupt
> > affinity. The timer interrupt, though, represents the one and only
> > thing I can't mitigate, hence why I'd like to make it go away.
>
> Well it would be best if we can guarantee that there is no system activity
> starting. What you have done is analyze all the causes for your particular
> situation and mitigated them. Not everyone is a specialist able to figure
> out these causes.

Agreed entirely. I want cases like this to work without any tuning or
mitigation required. If userspace doesn't need anything from the
kernel, and the hardware doesn't need attention from the kernel, then
the kernel should have no work to do.

Unfortunately, I don't think any blanket solution exists to fix all of
these issues; each cause of random system activity needs addressing. As
it turns out, many of the difficult-to-deal-with bits of activity occur
on the timer interrupt, making it hard to track them down individually,
hence why I wanted to start there.

- Josh Triplett

2009-09-01 18:36:56

by Christoph Lameter

[permalink] [raw]
Subject: Re: [RFC PATCH] Turn off the tick even when not idle

On Tue, 1 Sep 2009, Josh Triplett wrote:

> Thanks; exactly what I hoped to demonstrate. Actually making the timer
> interrupt go away will require finding a more appropriate place to run
> all the code that otherwise polls periodically, but this patch lets us
> cheat and see the result before that happens. :)

Well not necessarily. Since the process is not doing system calls some of
the checks can be skipped. In order to bring about a quiet state for the
VM one could fold the vm counters and dump the queues. Then maintenance is
unnecessary as long as no system activity occurs on a processor.

> I ran the benchmark at realtime priority, and affinitized to a single
> CPU. I used ftrace to confirm that after the initial program setup
> (shared library loads, memory allocation, etc), no code runs in the
> kernel during the number-crunching; this makes sense, since I ran at
> higher priority than all the random affinitized kernel threads, and I
> pushed everything else (tasks and interrupts) onto another CPU.

Interesting.

> Long-term I'd like to solve the problem of those kernel threads, but
> realtime priority can mitigate those. The new interrupt threading bits
> may help with other interrupts and avoid the need to set interrupt
> affinity. The timer interrupt, though, represents the one and only
> thing I can't mitigate, hence why I'd like to make it go away.

Well it would be best if we can guarantee that there is no system activity
starting. What you have done is analyze all the causes for your particular
situation and mitigated them. Not everyone is a specialist able to figure
out these causes.

2009-09-02 20:01:38

by Pavel Machek

[permalink] [raw]
Subject: Re: [RFC PATCH] Turn off the tick even when not idle

Hi!

> When a process does some number crunching for a while, without involving
> the kernel, the kernel still interrupts it HZ times per second to figure
> out if it has any work to do. With a system dedicated to doing such
> number crunching, the answer will almost always come up "no"; however,
> the kernel takes a while figuring out all the "no"s from various
> subsystems, every timer tick. On my system, the timer tick takes about
> 80us, every 1/HZ seconds; that represents a significant overhead. 80us
> out of every 1ms, for instance, means 8% overhead. Furthermore, the
> time taken varies, and the timer interrupts lead to jitter in the
> performance of the number crunching.

8% overhead on hz=1000 is quite high --- what hw is that?

You should be able to get similar results with HZ=1, right?
Pavel

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-09-02 21:30:20

by Josh Triplett

[permalink] [raw]
Subject: Re: [RFC PATCH] Turn off the tick even when not idle

On Wed, Sep 02, 2009 at 10:01:26PM +0200, Pavel Machek wrote:
> Hi!
>
> > When a process does some number crunching for a while, without involving
> > the kernel, the kernel still interrupts it HZ times per second to figure
> > out if it has any work to do. With a system dedicated to doing such
> > number crunching, the answer will almost always come up "no"; however,
> > the kernel takes a while figuring out all the "no"s from various
> > subsystems, every timer tick. On my system, the timer tick takes about
> > 80us, every 1/HZ seconds; that represents a significant overhead. 80us
> > out of every 1ms, for instance, means 8% overhead. Furthermore, the
> > time taken varies, and the timer interrupts lead to jitter in the
> > performance of the number crunching.
>
> 8% overhead on hz=1000 is quite high --- what hw is that?

32-bit x86, ThinkPad T60p (work laptop). I've observed similar
latencies on x86-64, and others have observed them on 64-bit powerpc.

On top of that, almost all of that 80us consists of variations on "Do I
have any work to do? No? OK then.".

> You should be able to get similar results with HZ=1, right?

Possibly, yes. But I want good responsiveness when the system *does*
have work to do.

- Josh Triplett

2009-09-03 05:19:59

by Anton Blanchard

[permalink] [raw]
Subject: Re: [RFC PATCH] Turn off the tick even when not idle


Hi Josh,

> The following patch (not for application any time soon) hacks away the
> timer interrupt even when not idle, by triggering the nohz mechanism
> even if not running the idle task.

Nice :) I tested it on a ppc64 box, and threw the graphs up here:

http://ozlabs.org/~anton/linux/osjitter/

With the patch applied, we saw only one interruption in the 10 second window
I measured.

Anton