LinuxLists.cc - [PATCH v4 1/2] timer: add a function to adjust timeouts to be upper bound

2022-04-07 08:31:55

Subject: [PATCH v4 1/2] timer: add a function to adjust timeouts to be upper bound

Current timer wheel implementation is optimized for performance and
energy usage but lacks in precision. This, normally, is not a problem as
most timers that use timer wheel are used for timeouts and thus rarely
expire, instead they often get canceled or modified before expiration.
Even when they don't, expiring a bit late is not an issue for timeout
timers.

TCP keepalive timer is a special case, it's aim is to prevent timeouts,
so triggering earlier rather than later is desired behavior. In a
reported case the user had a 3600s keepalive timer for preventing firewall
disconnects (on a 3650s interval). They observed keepalive timers coming
in up to four minutes late, causing unexpected disconnects.

This commit adds upper_bound_timeout() function that takes a relative
timeout and adjusts it based on timer wheel granularity so that supplied
value effectively becomes an upper bound for the timer.

This was previously discussed here:
https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u

Suggested-by: Josh Poimboeuf <[email protected]>
Signed-off-by: Artem Savkov <[email protected]>
---
include/linux/timer.h | 1 +
kernel/time/timer.c | 68 ++++++++++++++++++++++++++++++++++---------
2 files changed, 56 insertions(+), 13 deletions(-)

diff --git a/include/linux/timer.h b/include/linux/timer.h
index fda13c9d1256c..b209d31d543f0 100644
--- a/include/linux/timer.h
+++ b/include/linux/timer.h
@@ -168,6 +168,7 @@ static inline int timer_pending(const struct timer_list * timer)
return !hlist_unhashed_lockless(&timer->entry);
}

+extern unsigned long upper_bound_timeout(unsigned long timeout);
extern void add_timer_on(struct timer_list *timer, int cpu);
extern int del_timer(struct timer_list * timer);
extern int mod_timer(struct timer_list *timer, unsigned long expires);
diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index 85f1021ad4595..a645b62e257e2 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -507,28 +507,38 @@ static inline unsigned calc_index(unsigned long expires, unsigned lvl,
return LVL_OFFS(lvl) + (expires & LVL_MASK);
}

-static int calc_wheel_index(unsigned long expires, unsigned long clk,
- unsigned long *bucket_expiry)
+static inline int get_wheel_lvl(unsigned long delta)
{
- unsigned long delta = expires - clk;
- unsigned int idx;
-
if (delta < LVL_START(1)) {
- idx = calc_index(expires, 0, bucket_expiry);
+ return 0;
} else if (delta < LVL_START(2)) {
- idx = calc_index(expires, 1, bucket_expiry);
+ return 1;
} else if (delta < LVL_START(3)) {
- idx = calc_index(expires, 2, bucket_expiry);
+ return 2;
} else if (delta < LVL_START(4)) {
- idx = calc_index(expires, 3, bucket_expiry);
+ return 3;
} else if (delta < LVL_START(5)) {
- idx = calc_index(expires, 4, bucket_expiry);
+ return 4;
} else if (delta < LVL_START(6)) {
- idx = calc_index(expires, 5, bucket_expiry);
+ return 5;
} else if (delta < LVL_START(7)) {
- idx = calc_index(expires, 6, bucket_expiry);
+ return 6;
} else if (LVL_DEPTH > 8 && delta < LVL_START(8)) {
- idx = calc_index(expires, 7, bucket_expiry);
+ return 7;
+ }
+
+ return -1;
+}
+
+static int calc_wheel_index(unsigned long expires, unsigned long clk,
+ unsigned long *bucket_expiry)
+{
+ unsigned long delta = expires - clk;
+ unsigned int idx;
+ int lvl = get_wheel_lvl(delta);
+
+ if (lvl >= 0) {
+ idx = calc_index(expires, lvl, bucket_expiry);
} else if ((long) delta < 0) {
idx = clk & LVL_MASK;
*bucket_expiry = clk;
@@ -545,6 +555,38 @@ static int calc_wheel_index(unsigned long expires, unsigned long clk,
return idx;
}

+/**
+ * upper_bound_timeout - return granularity-adjusted timeout
+ * @timeout: timeout value in jiffies
+ *
+ * This function return supplied timeout adjusted based on timer wheel
+ * granularity effectively making supplied value an upper bound at which the
+ * timer will expire. Due to the way timer wheel works timeouts smaller than
+ * LVL_GRAN on their respecrive levels will be _at least_
+ * LVL_GRAN(lvl) - LVL_GRAN(lvl -1)) jiffies early.
+ */
+unsigned long upper_bound_timeout(unsigned long timeout)
+{
+ int lvl = get_wheel_lvl(timeout);
+
+ if (lvl < 0) {
+ if ((long) timeout < 0) {
+ /*
+ * This will expire immediately so no adjustment
+ * needed.
+ */
+ return timeout;
+ } else {
+ if (timeout > WHEEL_TIMEOUT_CUTOFF)
+ timeout = WHEEL_TIMEOUT_CUTOFF;
+ lvl = LVL_DEPTH - 1;
+ }
+ }
+
+ return LVL_GRAN(lvl) > timeout ? 0 : timeout - LVL_GRAN(lvl);
+}
+EXPORT_SYMBOL(upper_bound_timeout);
+
static void
trigger_dyntick_cpu(struct timer_base *base, struct timer_list *timer)
{
--
2.34.1

2022-04-08 01:06:41

by Thomas Gleixner

[permalink] [raw]

Subject: Re: [PATCH v4 1/2] timer: add a function to adjust timeouts to be upper bound

On Thu, Apr 07 2022 at 09:52, Artem Savkov wrote:
> Current timer wheel implementation is optimized for performance and
> energy usage but lacks in precision. This, normally, is not a problem as
> most timers that use timer wheel are used for timeouts and thus rarely
> expire, instead they often get canceled or modified before expiration.
> Even when they don't, expiring a bit late is not an issue for timeout
> timers.
>
> TCP keepalive timer is a special case, it's aim is to prevent timeouts,
> so triggering earlier rather than later is desired behavior. In a
> reported case the user had a 3600s keepalive timer for preventing firewall
> disconnects (on a 3650s interval). They observed keepalive timers coming
> in up to four minutes late, causing unexpected disconnects.
>
> This commit adds upper_bound_timeout() function that takes a relative

s/This commit adds/Add a new function to ..../

See Documentation/process/*

> timeout and adjusts it based on timer wheel granularity so that supplied
> value effectively becomes an upper bound for the timer.
>
> -static int calc_wheel_index(unsigned long expires, unsigned long clk,
> - unsigned long *bucket_expiry)
> +static inline int get_wheel_lvl(unsigned long delta)
> {
> - unsigned long delta = expires - clk;
> - unsigned int idx;
> -
> if (delta < LVL_START(1)) {
> - idx = calc_index(expires, 0, bucket_expiry);
> + return 0;
> } else if (delta < LVL_START(2)) {

Can you please get rid of all those brackets?

> + return -1;
> +}
> +
> +static int calc_wheel_index(unsigned long expires, unsigned long clk,
> + unsigned long *bucket_expiry)
> +{
> + unsigned long delta = expires - clk;
> + unsigned int idx;
> + int lvl = get_wheel_lvl(delta);
> +
> + if (lvl >= 0) {
> + idx = calc_index(expires, lvl, bucket_expiry);
> } else if ((long) delta < 0) {
> idx = clk & LVL_MASK;
> *bucket_expiry = clk;
> @@ -545,6 +555,38 @@ static int calc_wheel_index(unsigned long expires, unsigned long clk,
> return idx;
> }

This generates horrible code on various compilers. I ran that through a
couple of perf test scenarios and came up with the following, which
still is a tad slower for the level 0 case depending on the branch
predictor state. But it at least prevents the compilers from doing
stupid things and on average it's on par.

Though the level 0 case matters because of *drumroll* networking.

static int calc_wheel_index(unsigned long expires, unsigned long clk,
unsigned long *bucket_expiry)
{
unsigned long delta = expires - clk;
int lvl = get_wheel_lvl(delta);

if (likely(lvl) >= 0)
return __calc_index(expires, lvl, bucket_expiry);

if ((long) delta < 0) {
*bucket_expiry = clk;
return clk & LVL_MASK;
}

/*
* Force expire obscene large timeouts to expire at the capacity
* limit of the wheel.
*/
if (delta >= WHEEL_TIMEOUT_CUTOFF)
expires = clk + WHEEL_TIMEOUT_MAX;

return __calc_index(expires, LVL_DEPTH - 1, bucket_expiry);
}

Just for the record. I told you last time that your patch creates a
measurable overhead and I explained you in depth why the performance of
this stupid thing matters. So why are you not providing a proper
analysis for that?

> +/**
> + * upper_bound_timeout - return granularity-adjusted timeout
> + * @timeout: timeout value in jiffies
> + *
> + * This function return supplied timeout adjusted based on timer wheel
> + * granularity effectively making supplied value an upper bound at which the
> + * timer will expire. Due to the way timer wheel works timeouts smaller than
> + * LVL_GRAN on their respecrive levels will be _at least_
> + * LVL_GRAN(lvl) - LVL_GRAN(lvl -1)) jiffies early.

Contrary to the simple "timeout - timeout/8" this gives better accuracy
as it does not converge to the early side for long timeouts.

With the quirk that this cuts timeout=1 to 0, which means it expires
immediately. The wonders of integer math avoid that with the simple
timeout -= timeout >> 3 approach for timeouts up to 8 ticks. :)

But that want's to be properly documented.

> +unsigned long upper_bound_timeout(unsigned long timeout)
> +{
> + int lvl = get_wheel_lvl(timeout);

which is equivalent to:

lvl = calc_wheel_index(timeout, 0, &dummy) >> LVL_BITS;

Sorry, could not resist. :)

The more interesting question is, how frequently this upper bounds
function is used. It's definitely not something which you want to
inflict onto a high frequency (re)arming timer.

Did you analyse that? And if so, then why is that analysis missing from
the change log of the keepalive timer patch?

Aside of that it clearly lacks any argument why the simple, stupid, but
fast approach of shortening the timeout by 12.5% is not good enough and
why we need yet another function which is just going to be another
source of 'optimizations' for the wrong reasons.

Seriously, I apprecitate that you want to make this 'perfect', but it's
never going to be perfect and the real question is whether there is any
reasonable difference between 'good' and almost 'perfect'.

And this clearly resonates in your changelog of the network patch:

"Make sure TCP keepalive timer does not expire late. Switching to upper
bound timers means it can fire off early but in case of keepalive
tcp_keepalive_timer() handler checks elapsed time and resets the timer
if it was triggered early. This results in timer "cascading" to a
higher precision and being just a couple of milliseconds off it's
original mark."

Which reinvents the cascading effect of the original timer wheel just
with more overhead. Where is the justification for this?

Is this really true for all the reasons where the keep alive timers are
armed? I seriously doubt that. Why?

On the end which waits for the keep alive packet to arrive in time it
does not matter at all, whether the cutoff is a bit later than defined.

So why do you want to let the timer fire early just to rearm it?

But it matters a lot on the sender side. If that is late and the other
end is strict about the timeout then you lost. But does it matter
whether you send the packet too early? No, it does not matter at all
because the important point is that you send it _before_ the other side
decides to give up.

So why do you want to let the timer fire precise?

You are solving the sender side problem by introducing a receiver side
problem and both suffer from the overhead for no reason.

Aside of the theoerical issue why this matters at all I have yet ot see
a reasonable argument what the practical problen is. If this would be a
real problem in the wild then why haven't we ssen a reassonable bug
report within 6 years?

Thanks,

tglx

2022-04-08 06:10:22

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: [PATCH v4 1/2] timer: add a function to adjust timeouts to be upper bound

On Fri, Apr 08, 2022 at 02:37:25AM +0200, Thomas Gleixner wrote:
> "Make sure TCP keepalive timer does not expire late. Switching to upper
> bound timers means it can fire off early but in case of keepalive
> tcp_keepalive_timer() handler checks elapsed time and resets the timer
> if it was triggered early. This results in timer "cascading" to a
> higher precision and being just a couple of milliseconds off it's
> original mark."
>
> Which reinvents the cascading effect of the original timer wheel just
> with more overhead. Where is the justification for this?
>
> Is this really true for all the reasons where the keep alive timers are
> armed? I seriously doubt that. Why?
>
> On the end which waits for the keep alive packet to arrive in time it
> does not matter at all, whether the cutoff is a bit later than defined.
>
> So why do you want to let the timer fire early just to rearm it?
>
> But it matters a lot on the sender side. If that is late and the other
> end is strict about the timeout then you lost. But does it matter
> whether you send the packet too early? No, it does not matter at all
> because the important point is that you send it _before_ the other side
> decides to give up.
>
> So why do you want to let the timer fire precise?
>
> You are solving the sender side problem by introducing a receiver side
> problem and both suffer from the overhead for no reason.

Here are my thoughts. Maybe some networking folks can chime in to
keep us honest.

I get most of what you're saying, though my understanding is that
keepalive is only involved in sending packets, not receiving them. I do
think there would be two opposing use cases:

1) Client sending packets to prevent server disconnects

2) Server sending packets to detect client disconnects

For #1, it's ok for the timer to pop early. For #2, it's ok for it to
pop late. So my conclusion is about the same as your sender/receiver
scenario: there are two sides to the same coin.

If we assume both use cases are valid (which I'm not entirely convinced
of), doesn't that mean that the keepalive timer needs to be precise?

Otherwise we're going to have broken expectations in one direction or
the other, depending on the use case.

> Aside of the theoerical issue why this matters at all I have yet ot see
> a reasonable argument what the practical problen is. If this would be a
> real problem in the wild then why haven't we ssen a reassonable bug
> report within 6 years?

Good question. At least part of the answer *might* be that enterprise
kernels tend to be adopted very slowly. This issue was reported on RHEL
8.3 which is a 4.18 based kernel:

The time that the 1st TCP keepalive probe is sent can be configured by
the "net.ipv4.tcp_keepalive_time" sysctl or by setsockopt().

We observe that if that value is set to 300 seconds, the timer
actually fires around 15-20 seconds later. So ~317 seconds. The larger
the expiration time the greater the delay. So for the default of 2
hours it can be delayed by minutes. This is causing problems for some
customers that rely on the TCP keepalive timer to keep entries active
in firewalls and expect it to be accurate as TCP keepalive values have
to correspond to the firewall settings.

--
Josh

2022-04-12 20:17:13

by Artem Savkov

[permalink] [raw]

Subject: Re: [PATCH v4 1/2] timer: add a function to adjust timeouts to be upper bound

On Fri, Apr 08, 2022 at 02:37:25AM +0200, Thomas Gleixner wrote:
> On Thu, Apr 07 2022 at 09:52, Artem Savkov wrote:
> > + return -1;
> > +}
> > +
> > +static int calc_wheel_index(unsigned long expires, unsigned long clk,
> > + unsigned long *bucket_expiry)
> > +{
> > + unsigned long delta = expires - clk;
> > + unsigned int idx;
> > + int lvl = get_wheel_lvl(delta);
> > +
> > + if (lvl >= 0) {
> > + idx = calc_index(expires, lvl, bucket_expiry);
> > } else if ((long) delta < 0) {
> > idx = clk & LVL_MASK;
> > *bucket_expiry = clk;
> > @@ -545,6 +555,38 @@ static int calc_wheel_index(unsigned long expires, unsigned long clk,
> > return idx;
> > }
>
> This generates horrible code on various compilers. I ran that through a
> couple of perf test scenarios and came up with the following, which
> still is a tad slower for the level 0 case depending on the branch
> predictor state. But it at least prevents the compilers from doing
> stupid things and on average it's on par.
>
> Though the level 0 case matters because of *drumroll* networking.
>
> Just for the record. I told you last time that your patch creates a
> measurable overhead and I explained you in depth why the performance of
> this stupid thing matters. So why are you not providing a proper
> analysis for that?

I did do a simple check of measuring the time it takes for
calc_wheel_index to execute and it turned out a tad lower after the
patch. Your measurements are sure to be much more refined so would you
give me any pointers on what to measure and which cases to check to have
a better view on the impact of these patches?

> > +/**
> > + * upper_bound_timeout - return granularity-adjusted timeout
> > + * @timeout: timeout value in jiffies
> > + *
> > + * This function return supplied timeout adjusted based on timer wheel
> > + * granularity effectively making supplied value an upper bound at which the
> > + * timer will expire. Due to the way timer wheel works timeouts smaller than
> > + * LVL_GRAN on their respecrive levels will be _at least_
> > + * LVL_GRAN(lvl) - LVL_GRAN(lvl -1)) jiffies early.
>
> Contrary to the simple "timeout - timeout/8" this gives better accuracy
> as it does not converge to the early side for long timeouts.
>
> With the quirk that this cuts timeout=1 to 0, which means it expires
> immediately. The wonders of integer math avoid that with the simple
> timeout -= timeout >> 3 approach for timeouts up to 8 ticks. :)
>
> But that want's to be properly documented.
>
> > +unsigned long upper_bound_timeout(unsigned long timeout)
> > +{
> > + int lvl = get_wheel_lvl(timeout);
>
> which is equivalent to:
>
> lvl = calc_wheel_index(timeout, 0, &dummy) >> LVL_BITS;
>
> Sorry, could not resist. :)

Right, but that would mean a significantly more overhead, right?

> The more interesting question is, how frequently this upper bounds
> function is used. It's definitely not something which you want to
> inflict onto a high frequency (re)arming timer.

As of now not that frequent. Instead of re-arming the timer networking
code allows it to expire and makes the decisions in the handler, so it
shouldn't be a problem because keepalives are usually quite big.

> Did you analyse that? And if so, then why is that analysis missing from
> the change log of the keepalive timer patch?

I agree, this needs to be added.

> Aside of that it clearly lacks any argument why the simple, stupid, but
> fast approach of shortening the timeout by 12.5% is not good enough and
> why we need yet another function which is just going to be another
> source of 'optimizations' for the wrong reasons.

I am just not used to this mode of thinking I guess. This will be more
efficient at the cost of being less obvious and requiring to be
remembered about in case the 12.5% figure changes.

> Seriously, I apprecitate that you want to make this 'perfect', but it's
> never going to be perfect and the real question is whether there is any
> reasonable difference between 'good' and almost 'perfect'.
>
> And this clearly resonates in your changelog of the network patch:
>
> "Make sure TCP keepalive timer does not expire late. Switching to upper
> bound timers means it can fire off early but in case of keepalive
> tcp_keepalive_timer() handler checks elapsed time and resets the timer
> if it was triggered early. This results in timer "cascading" to a
> higher precision and being just a couple of milliseconds off it's
> original mark."
>
> Which reinvents the cascading effect of the original timer wheel just
> with more overhead. Where is the justification for this?
>
> Is this really true for all the reasons where the keep alive timers are
> armed? I seriously doubt that. Why?
>
> On the end which waits for the keep alive packet to arrive in time it
> does not matter at all, whether the cutoff is a bit later than defined.
>
> So why do you want to let the timer fire early just to rearm it?
>
> But it matters a lot on the sender side. If that is late and the other
> end is strict about the timeout then you lost. But does it matter
> whether you send the packet too early? No, it does not matter at all
> because the important point is that you send it _before_ the other side
> decides to give up.
>
> So why do you want to let the timer fire precise?
>
> You are solving the sender side problem by introducing a receiver side
> problem and both suffer from the overhead for no reason.

I was hoping to discuss this during the second patch review. Keepalive
timer handler complexity makes me think that it handles a lot of cases I
am not currently aware of and dropping the "cascading" code would result
in problems in places not obvious to me.

Josh's point also makes sense to me. For cases when keepalive is used to
check for dead clients I don't think we want to be early.

> Aside of the theoerical issue why this matters at all I have yet ot see
> a reasonable argument what the practical problen is. If this would be a
> real problem in the wild then why haven't we ssen a reassonable bug
> report within 6 years?

I think it is unusual for keepalive timers to be set so close to the
timeout so it is not an easy problem to hit. But regardless of that from
what I saw in related discussions nobody sees this keepalive timer behavior
as normal let alone expected. If you think it is better to keep it as is
this will need to be clearly described/justified somewhere, but I am not
sure how one would approach that.

--
Regards,
Artem

2022-05-06 09:26:06

by Artem Savkov

[permalink] [raw]

Subject: [PATCH v5 0/2] Upper bound kernel timers

As previously discussed [1] we had a report of a regression in TCP keepalive
timer where timers were up to 4 minutes late resulting in disconnects.

This patchset tries to fix the problem by introducing upper bound kernel timers
and making tcp keepalive timer use those.

[1] https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u

---
Changes in v5:
- The least intrusive and most straightforward approach. This avoids touching
any of existing code. The timeout supplied is always reduced by a known
timer wheel error margin of 12.5%.
- Commit message adjustments.

Changes in v4:
- Drop any attempts to be smart when calculating timer adjustment and always
substract LVL_GRAN so that base clock lag won't be such a problem. This
means that in some cases we will always be early, but all we want is not to
be late.

Changes in v3:
- A different approach: instead of introducing upper bound timers try to
account for timer wheen granularity on timer (re)arming step.
- Not sure whether moving lvl calculation into a separate function is worth
it.
- Had a hard time naming the upper_bount_timeout() function - any suggestions
welcome.

Changes in v2:
- TIMER_UPPER_BOUND flag description added as a comment in timer.h
- Code style fixes
- More elaborate commit message in timer commit

Artem Savkov (2):
timer: add a function to adjust timeouts to be upper bound
net: make tcp keepalive timer upper bound

include/linux/timer.h | 1 +
kernel/time/timer.c | 14 ++++++++++++++
net/ipv4/inet_connection_sock.c | 2 +-
3 files changed, 16 insertions(+), 1 deletion(-)

--
2.34.1

2022-05-09 02:32:23

by Artem Savkov

[permalink] [raw]

Subject: [PATCH v5 1/2] timer: add a function to adjust timeouts to be upper bound

2022-07-26 23:04:13

by Josh Poimboeuf

[permalink] [raw]

Subject: Re: [PATCH v5 0/2] Upper bound kernel timers

Hi Thomas,

Ping?

On Thu, May 05, 2022 at 03:18:09PM +0200, Artem Savkov wrote:
> As previously discussed [1] we had a report of a regression in TCP keepalive
> timer where timers were up to 4 minutes late resulting in disconnects.
>
> This patchset tries to fix the problem by introducing upper bound kernel timers
> and making tcp keepalive timer use those.
>
> [1] https://lore.kernel.org/all/20210302001054.4qgrvnkltvkgikzr@treble/T/#u
>
> ---
> Changes in v5:
> - The least intrusive and most straightforward approach. This avoids touching
> any of existing code. The timeout supplied is always reduced by a known
> timer wheel error margin of 12.5%.
> - Commit message adjustments.
>
> Changes in v4:
> - Drop any attempts to be smart when calculating timer adjustment and always
> substract LVL_GRAN so that base clock lag won't be such a problem. This
> means that in some cases we will always be early, but all we want is not to
> be late.
>
> Changes in v3:
> - A different approach: instead of introducing upper bound timers try to
> account for timer wheen granularity on timer (re)arming step.
> - Not sure whether moving lvl calculation into a separate function is worth
> it.
> - Had a hard time naming the upper_bount_timeout() function - any suggestions
> welcome.
>
> Changes in v2:
> - TIMER_UPPER_BOUND flag description added as a comment in timer.h
> - Code style fixes
> - More elaborate commit message in timer commit
>
> Artem Savkov (2):
> timer: add a function to adjust timeouts to be upper bound
> net: make tcp keepalive timer upper bound
>
> include/linux/timer.h | 1 +
> kernel/time/timer.c | 14 ++++++++++++++
> net/ipv4/inet_connection_sock.c | 2 +-
> 3 files changed, 16 insertions(+), 1 deletion(-)
>
> --
> 2.34.1
>

--
Josh