2015-05-26 22:50:23

by Thomas Gleixner

[permalink] [raw]
Subject: [patch 2/7] timer: Remove FIFO guarantee

The FIFO guarantee has been violated by the introduction of timer
slack already. Remove it.

This is a preparatory patch for converting the timer wheel to hlist
which reduces the memory foot print of the wheel by 50%. It's a
seperate patch so any (unlikely to happen) regression caused by this
can be identified clearly.

Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/time/timer.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

Index: tip/kernel/time/timer.c
===================================================================
--- tip.orig/kernel/time/timer.c
+++ tip/kernel/time/timer.c
@@ -403,10 +403,8 @@ __internal_add_timer(struct tvec_base *b
i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
vec = base->tv5.vec + i;
}
- /*
- * Timers are FIFO:
- */
- list_add_tail(&timer->entry, vec);
+
+ list_add(&timer->entry, vec);
}

static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)


2015-06-05 22:27:11

by George Spelvin

[permalink] [raw]
Subject: Re: [patch 2/7] timer: Remove FIFO guarantee

Two thoughts:

1) It's not clear that timer slack has violated the FIFO guarantee.

Remember, the guarantee only applies when two timers have the same
(absolute) timeout; i.e. are requested for the same time.

For two entries with the same timeout, applying slack will produce
the same adjusted timeout. That's the whole point of slack; to
force the timeouts to collide more.

Because the unadjusted timeouts aren't stored, timeout ordering
can be violated; i.e. timers for time t and t+1, which round to the
same time ater slack is applied, will expire in FIFO order rather
than timeout order. This is non-

But in the case where he FIFO guatantee used to exist, I don't think
timer slack broke it.


I'm not disagreeing with the change, but it's not clear to me that
it's as safe as you think.


2) If you want to do some more in that area, one thing I've been meaning
to get around to is eliminating the whole round_jiffies() system.

It does basically the same thing as the slack system, although with
less flexibility, and it would be wonderful to rip it out of
the kernel completely.


Additional rambling you should ignore. It exists because I haven't
figured out why it's impractical yet.

An interesting variant on the slack system would be to apply slack in the
wakeup loop rather than before sleeping. It might permit more bundling
and thus fewer wakeups.

You have the timers in original order. But computing the next expiration
is more complex. Each timer has a minimum and maximum wakeup time, and
they're sorted by minimum time. You scan the list of pending timers,
accumulating a "batch" which can all be woken up together.

You keep intersecting each new timer's wakeup interval with the pending
batch, until the batch maximum is less than the next timer's minimum.

The final result is a range of permissible timeouts, from which one is
chosen as the processor wakeup time. I'd be inclined to err on the low
side, but whatever works.

(It's also permissible to limit the size of a batch arbitrarily.
After scanning N timers, just pick a wakeup time no later than
the N+1st timer's minimum time and stop.)

When a new timer is added, there are three cases:
1) (Common) Its minimum timeout is later than the current pending wakeup.
Add it to the pending queue normally.
2) Its wakeup range includes the pending time. Add it to the batch.
3) Its maximum wakeup time is less than the pending time. Reschedule
the hardware timer. When it goes off, scan the batch for additional
timers that can be fired now, before building the new batch.

The idea is that each timer would be scanned twice: once when
building the batch, and a second time when running it.

Except that's not true if the third case happens frequently, and
batches keep getting preempted and re-scanned.

But there's an easy workaround to avoid O(n^2): at the expense of
possibly non-optimal wakeups, if what's left of the old batch is larger
than reasonable to rescan, just remember the old wakeup range (the
minimum will be accurate, because it's the minimum of the last timer
in the batch, which is still pending, but the maximum might be wrong)
and extend the batch without recomputing the maximum.

There's probably some reason this can't be made to work, but I wanted
to do a brain dump.

2015-06-08 17:48:42

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 2/7] timer: Remove FIFO guarantee

On Sat, 5 Jun 2015, George Spelvin wrote:

> Two thoughts:
>
> 1) It's not clear that timer slack has violated the FIFO guarantee.
>
> Remember, the guarantee only applies when two timers have the same
> (absolute) timeout; i.e. are requested for the same time.
>
> For two entries with the same timeout, applying slack will produce
> the same adjusted timeout. That's the whole point of slack; to
> force the timeouts to collide more.
>
> Because the unadjusted timeouts aren't stored, timeout ordering
> can be violated; i.e. timers for time t and t+1, which round to the
> same time ater slack is applied, will expire in FIFO order rather
> than timeout order. This is non-
>
> But in the case where he FIFO guatantee used to exist, I don't think
> timer slack broke it.

It does. Depending on when you enqueue the timer because the thing is
calculated from the delta (expires - jiffies).

Now thinking more about it. Even before we introduced the slack, the
FIFO guarantee was only there, if two timers were enqueued into the
same array level. Assume the following:

T1 expires = 257 enqueue time = 0
T2 expires = 257 enqueue time = 200

So T1 will go into the secondary array and will be recascaded
into the primary array after 256 jiffies.

T2 will go into the primary array BEFORE T1 is recascaded. So T1 will
be queued into the same bucket at cascading AFTER T2.

Another issue is the whole NOHZ target cpu thing. Even if two timers
are enqueued at the same jiffie with the same expiry value, they can
end up on two different cpus. Due to tick skew, which can be explicit,
T2 can expire before T1.

> I'm not disagreeing with the change, but it's not clear to me that
> it's as safe as you think.

After thinking more about it, I'm even more sure that any code which
relies on the FIFO "guarantee" is broken today.

> 2) If you want to do some more in that area, one thing I've been meaning
> to get around to is eliminating the whole round_jiffies() system.
>
> It does basically the same thing as the slack system, although with
> less flexibility, and it would be wonderful to rip it out of
> the kernel completely.

Once we have natural batching in the wheel itself. I think we can kill
that completely.

> Additional rambling you should ignore. It exists because I haven't
> figured out why it's impractical yet.
>
> An interesting variant on the slack system would be to apply slack in the
> wakeup loop rather than before sleeping. It might permit more bundling
> and thus fewer wakeups.
>
> You have the timers in original order. But computing the next expiration
> is more complex. Each timer has a minimum and maximum wakeup time, and
> they're sorted by minimum time. You scan the list of pending timers,
> accumulating a "batch" which can all be woken up together.

Similar to what we do in hrtimers. Though the main issue of the timer
wheel is, that we have no fast way to find the next expiring
timer. Maybe we can revisit this once I got the rework of the wheel
logic completed.

Thanks,

tglx

2015-06-09 05:39:42

by George Spelvin

[permalink] [raw]
Subject: Re: [patch 2/7] timer: Remove FIFO guarantee

Thomas Gleixner wrote:
> It does. Depending on when you enqueue the timer because the thing is
> calculated from the delta (expires - jiffies).

Ah, right. If slack > 0, the slack amount is absolute and the rounding
will be consistent.

But if slack < 0, which is the default, it's a percentage of remaining
jiffies. Since slack only delays timeouts, an earlier-scheduled
timeout could easily be delayed more.

(There are only six calls to set_timer_slack() to change the default
to something positive in the kernel.)

>> I'm not disagreeing with the change, but it's not clear to me that
>> it's as safe as you think.

> After thinking more about it, I'm even more sure that any code which
> relies on the FIFO "guarantee" is broken today.

Indeed, I am completely convinced. All I might request is a reassignment
of blame in the commit message.



Thank you for your comments on my other blue-sky ideas, too.

I need to look into why we're using wheels, and what the point is.
How much of an advantage do they have over an efficient priority queue
like a pairing heap?

2015-06-09 08:06:07

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [patch 2/7] timer: Remove FIFO guarantee

On Tue, 9 Jun 2015, George Spelvin wrote:
> Thomas Gleixner wrote:
>
> > After thinking more about it, I'm even more sure that any code which
> > relies on the FIFO "guarantee" is broken today.
>
> Indeed, I am completely convinced. All I might request is a reassignment
> of blame in the commit message.

Will do. Thanks for spotting it!

> Thank you for your comments on my other blue-sky ideas, too.
>
> I need to look into why we're using wheels, and what the point is.
> How much of an advantage do they have over an efficient priority queue
> like a pairing heap?

The only reason is performance. The wheel has O(1) insertion and
deletion time while heaps and trees usually have O(log(n)).

Timer wheel timers are usually timeouts and 99% of them are canceled
before expiry. Networking is probably the heaviest use case followed
by disk I/O.

Thanks,

tglx



2015-06-09 09:43:25

by George Spelvin

[permalink] [raw]
Subject: Re: [patch 2/7] timer: Remove FIFO guarantee

Thomas Gleixner wrote:
> The only reason is performance. The wheel has O(1) insertion and
> deletion time while heaps and trees usually have O(log(n)).

Pairing heaps also have O(1) insertion, and rescheduling can
be quite lightweight.

The issue I was worried about is the need for an additional pointer
per timer (left, right, parent) and the associated cache line touch
when modifying the heap.

> Timer wheel timers are usually timeouts and 99% of them are canceled
> before expiry. Networking is probably the heaviest use case followed
> by disk I/O.

And that rewards very lazy structures, that postpone work in the
hope it will become unnecessary. But a wheel has real problems with
non-tick-based timers, which as you note covers both hrtimers and NOHZ.

Now, two things to note about pairing heaps (and many related heap
structures like Fibonacci heaps, but pairing heaps have a particularly
good constant factor in practice) are:

1) It is O(1) to "meld" two heaps into one.
2) The above is O(1) because it's literally "stick it on a linked list";
the left child pointers are NULL until the minimum (which is kept
at the head/root) is deleted and a new minimum has to be found.

Combining these two facts, we could do something wheel-like: divide
the future into a number of heaps, link future events into the correct
subheap, and meld the subheaps into the main heap when necessary.

Hopefully, by the time it's necessary, the subheap would have been
thinned out by timers being ccanceled.

On reflection, it wouldn't even be necessary to have explicit code to
handle the melding. Just allocate an array of "internal use" nodes
which are easy to find, and place them in the main tree like
normal. (Each has a timeout which is guaranteed to be earliest in
its subheap, so the subjeap will never need sorting.)

When one gets to the root, the internal node gets recycled (because
we set up the callback function to do that!) and the subheap gets
sorted and merged into the main heap automatically.

Alternatively, the internal use node could be made smaller (e.g.
an hlist head rather than a normal node) at the expense of needing
special-case code to handle it.

Have to think on this. Heapifying the sublist is O(n) work, which is
the same as overflowing a bucket, but it means that additional deletions
will be more expensive.

Need to think on this.

2015-06-09 10:25:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [patch 2/7] timer: Remove FIFO guarantee

On Tue, Jun 09, 2015 at 05:43:18AM -0400, George Spelvin wrote:
> Thomas Gleixner wrote:
> > The only reason is performance. The wheel has O(1) insertion and
> > deletion time while heaps and trees usually have O(log(n)).
>
> Pairing heaps also have O(1) insertion, and rescheduling can
> be quite lightweight.
>
> The issue I was worried about is the need for an additional pointer
> per timer (left, right, parent) and the associated cache line touch
> when modifying the heap.
>
> > Timer wheel timers are usually timeouts and 99% of them are canceled
> > before expiry. Networking is probably the heaviest use case followed
> > by disk I/O.
>
> And that rewards very lazy structures, that postpone work in the
> hope it will become unnecessary. But a wheel has real problems with
> non-tick-based timers, which as you note covers both hrtimers and NOHZ.
>
> Now, two things to note about pairing heaps (and many related heap
> structures like Fibonacci heaps, but pairing heaps have a particularly
> good constant factor in practice) are:
>
> 1) It is O(1) to "meld" two heaps into one.
> 2) The above is O(1) because it's literally "stick it on a linked list";
> the left child pointers are NULL until the minimum (which is kept
> at the head/root) is deleted and a new minimum has to be found.
>
> Combining these two facts, we could do something wheel-like: divide
> the future into a number of heaps, link future events into the correct
> subheap, and meld the subheaps into the main heap when necessary.
>
> Hopefully, by the time it's necessary, the subheap would have been
> thinned out by timers being ccanceled.

Right, so while pairing (and fibonacci) heaps have O(log n) deletion, if
you can delay heapification you can equally delay that extra cost.

But as you say, the heapification is a O(n) hit, which is not too
different from our current 'problem'.

But yes, interesting problem space this.

2015-06-09 11:15:09

by George Spelvin

[permalink] [raw]
Subject: Re: [patch 2/7] timer: Remove FIFO guarantee

Right, so while pairing (and fibonacci) heaps have O(log n) deletion, if
you can delay heapification you can equally delay that extra cost.

> But as you say, the heapification is a O(n) hit, which is not too
> different from our current 'problem'.

> But yes, interesting problem space this.

It definitely gets my mind going.

The issue that's occupying my thoughts right now is this...

Given a "sublist" for a range of future times, the minimum time
can be either a real timeout or a fake one.

If it's a real one, there's a chance you might delete the minimum timeout
on the list. you have to handle re-sorting after deletion to maintain
a valid minimum time.

You can have a fake list head, which avoids ever re-sorting, but you're
bascially forcing a "tick": taking a timer interrupt at that time even
though nothing needs it.


One possibility that occurs to me would be to have fake heads on the
sublists, but recognize them when they get selected as the "next timeout".
If that happens, deletemin() them now from the remaining heap and find
next *real* timeout.

Basically, just like the current wheel, you'd have to do some scanning
past possibly-empty sublists, but perhaps by using a much more efficient
heap structure, the sublists could be made larger, so not as many heads
would be needed.


Radix sort isn't really O(n); it's O(n log N) with a small constant,
where N is the range of *possible* values (e.g. N = 2^32, so
log N = 32). But if you want high-resolution timers, N goes
up a lot.

Subject: [tip:timers/core] timer: Remove FIFO "guarantee"

Commit-ID: 1bd04bf6f68d65f5422b2b85c495d65d49587a54
Gitweb: http://git.kernel.org/tip/1bd04bf6f68d65f5422b2b85c495d65d49587a54
Author: Thomas Gleixner <[email protected]>
AuthorDate: Tue, 26 May 2015 22:50:26 +0000
Committer: Thomas Gleixner <[email protected]>
CommitDate: Fri, 19 Jun 2015 15:18:27 +0200

timer: Remove FIFO "guarantee"

The FIFO guarantee is only there if two timers are queued into the
same bucket at the same jiffie on the same cpu:

- The slack value depends on the delta between expiry and enqueue
time, so the resulting expiry time can be different for timers
which are queued in different jiffies.

- Timers which are queued into the secondary array end up after a
later queued timer which was queued into the primary array due to
cascading.

- Timers can end up on different cpus due to the NOHZ target moving
around. Obviously there is no guarantee of expiry ordering between
cpus.

So anything which relies on FIFO behaviour of the timer wheel is
broken already.

This is a preparatory patch for converting the timer wheel to hlist
which reduces the memory foot print of the wheel by 50%.

It's a seperate patch so any (unlikely to happen) regression caused by
this can be identified clearly.

Signed-off-by: Thomas Gleixner <[email protected]>
Reviewed-by: Viresh Kumar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Paul McKenney <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: John Stultz <[email protected]>
Cc: Joonwoo Park <[email protected]>
Cc: Wenbo Wang <[email protected]>
Cc: George Spelvin <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Thomas Gleixner <[email protected]>
---
kernel/time/timer.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index d5e0179..e212df2 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -389,10 +389,8 @@ __internal_add_timer(struct tvec_base *base, struct timer_list *timer)
i = (expires >> (TVR_BITS + 3 * TVN_BITS)) & TVN_MASK;
vec = base->tv5.vec + i;
}
- /*
- * Timers are FIFO:
- */
- list_add_tail(&timer->entry, vec);
+
+ list_add(&timer->entry, vec);
}

static void internal_add_timer(struct tvec_base *base, struct timer_list *timer)