2018-07-10 06:55:25

by Jon Maxwell

[permalink] [raw]
Subject: [net-next,v3] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy

v3 contains the following suggestions by Neal Cardwell:

1) Fix up units mismatch regarding msec/jiffies.
2) Address possiblility of time_remaining being negative.
3) Add a helper routine tcp_clamp_rto_to_user_timeout() to do the rto
calculation.
4) Move start_ts logic into helper routine tcp_retrans_stamp() to
validate tcp_sk(sk)->retrans_stamp.
5) Some u32 declation and return refactoring.
6) Return 0 instead of false in tcp_retransmit_stamp(), it's not a bool.

Suggestions by David Laight:

1) Don't cache rto in tcp_clamp_rto_to_user_timeout().
2) Use conditional operator instead of min_t() in
tcp_clamp_rto_to_user_timeout()

Changes:

1) Call tcp_clamp_rto_to_user_timeout(sk) as an argument to
inet_csk_reset_xmit_timer() to save on rto declaration.

Every time the TCP retransmission timer fires. It checks to see if there is a
timeout before scheduling the next retransmit timer. The retransmit interval
between each retransmission increases exponentially. The issue is that in order
for the timeout to occur the retransmit timer needs to fire again. If the user
timeout check happens after the 9th retransmit for example. It needs to wait for
the 10th retransmit timer to fire in order to evaluate whether a timeout has
occurred or not. If the interval is large enough then the timeout will be
inaccurate.

For example with a TCP_USER_TIMEOUT of 10 seconds without patch:

1st retransmit:

22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]

Last retransmit:

22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]

Timeout:

send: Connection timed out
Sun Jul 1 22:25:34 EDT 2018

We can see that last retransmit took ~7 seconds. Which pushed the total
timeout to ~15 seconds instead of the expected 10 seconds. This gets more
inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.

Add tcp_clamp_rto_to_user_timeout() to determine if the user rto has expired.
Or whether the rto interval needs to be recalculated. Use the original interval
if user rto is not set.

Test results with the patch is the expected 10 second timeout:

1st retransmit:

01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]

Last retransmit:

01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]

Timeout:

send: Connection timed out
Mon Jul 2 01:38:09 EDT 2018

Signed-off-by: Jon Maxwell <[email protected]>
---
net/ipv4/tcp_timer.c | 49 +++++++++++++++++++++++++++++++++++++++----------
1 file changed, 39 insertions(+), 10 deletions(-)

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 3b3611729928..93239e58776d 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -22,6 +22,38 @@
#include <linux/gfp.h>
#include <net/tcp.h>

+u32 tcp_retransmit_stamp(struct sock *sk)
+{
+ u32 start_ts = tcp_sk(sk)->retrans_stamp;
+
+ if (unlikely(!start_ts)) {
+ struct sk_buff *head = tcp_rtx_queue_head(sk);
+
+ if (!head)
+ return 0;
+ start_ts = tcp_skb_timestamp(head);
+ }
+ return start_ts;
+}
+
+static __u32 tcp_clamp_rto_to_user_timeout(struct sock *sk)
+{
+ struct inet_connection_sock *icsk = inet_csk(sk);
+ __u32 elapsed, user_timeout;
+ u32 start_ts;
+
+ start_ts = tcp_retransmit_stamp(sk);
+ if (!icsk->icsk_user_timeout || !start_ts)
+ return icsk->icsk_rto;
+ elapsed = tcp_time_stamp(tcp_sk(sk)) - start_ts;
+ user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);
+ if (elapsed >= user_timeout)
+ return 1; /* user timeout has passed; fire ASAP */
+ else
+ return (icsk->icsk_rto < msecs_to_jiffies(user_timeout - elapsed)) ?
+ icsk->icsk_rto : msecs_to_jiffies(user_timeout - elapsed);
+}
+
/**
* tcp_write_err() - close socket and save error info
* @sk: The socket the error has appeared on.
@@ -161,19 +193,15 @@ static bool retransmits_timed_out(struct sock *sk,
unsigned int timeout)
{
const unsigned int rto_base = TCP_RTO_MIN;
- unsigned int linear_backoff_thresh, start_ts;
+ unsigned int linear_backoff_thresh;
+ u32 start_ts;

if (!inet_csk(sk)->icsk_retransmits)
return false;

- start_ts = tcp_sk(sk)->retrans_stamp;
- if (unlikely(!start_ts)) {
- struct sk_buff *head = tcp_rtx_queue_head(sk);
-
- if (!head)
- return false;
- start_ts = tcp_skb_timestamp(head);
- }
+ start_ts = tcp_retransmit_stamp(sk);
+ if (!start_ts)
+ return false;

if (likely(timeout == 0)) {
linear_backoff_thresh = ilog2(TCP_RTO_MAX/rto_base);
@@ -535,7 +563,8 @@ void tcp_retransmit_timer(struct sock *sk)
/* Use normal (exponential) backoff */
icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
}
- inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
+ inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
+ tcp_clamp_rto_to_user_timeout(sk), TCP_RTO_MAX);
if (retransmits_timed_out(sk, net->ipv4.sysctl_tcp_retries1 + 1, 0))
__sk_dst_reset(sk);

--
2.13.6



2018-07-10 09:17:17

by David Laight

[permalink] [raw]
Subject: RE: [net-next,v3] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy

From: Jon Maxwell
> Sent: 10 July 2018 07:52
...
> +static __u32 tcp_clamp_rto_to_user_timeout(struct sock *sk)
> +{
> + struct inet_connection_sock *icsk = inet_csk(sk);
> + __u32 elapsed, user_timeout;
> + u32 start_ts;
> +
> + start_ts = tcp_retransmit_stamp(sk);
> + if (!icsk->icsk_user_timeout || !start_ts)
> + return icsk->icsk_rto;
> + elapsed = tcp_time_stamp(tcp_sk(sk)) - start_ts;
> + user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);
> + if (elapsed >= user_timeout)
> + return 1; /* user timeout has passed; fire ASAP */
> + else
> + return (icsk->icsk_rto < msecs_to_jiffies(user_timeout - elapsed)) ?
> + icsk->icsk_rto : msecs_to_jiffies(user_timeout - elapsed);

You've not looked at what the above actually does - something like:
msecs_to_jiffies() probably rounds up, the old code rounded down
so dividing by HZ is probably right.

int new_tmo = icsk->icsk_user_timeout - (tcp_time_stamp(tcp_sk(sk)) - start_ts)/HZ;

if (new_tmo > (int) icsk->icsk_rto;
return icsk->icsk_rto;
return new_tmo <= 0 ? 1 : new_tmo;
}

2018-07-10 12:40:06

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next,v3] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy



On 07/09/2018 11:51 PM, Jon Maxwell wrote:
> v3 contains the following suggestions by Neal Cardwell:
>
> 1) Fix up units mismatch regarding msec/jiffies.
> 2) Address possiblility of time_remaining being negative.
> 3) Add a helper routine tcp_clamp_rto_to_user_timeout() to do the rto
> calculation.
> 4) Move start_ts logic into helper routine tcp_retrans_stamp() to
> validate tcp_sk(sk)->retrans_stamp.
> 5) Some u32 declation and return refactoring.
> 6) Return 0 instead of false in tcp_retransmit_stamp(), it's not a bool.
>
> Suggestions by David Laight:
>
> 1) Don't cache rto in tcp_clamp_rto_to_user_timeout().
> 2) Use conditional operator instead of min_t() in
> tcp_clamp_rto_to_user_timeout()
>
> Changes:
>
> 1) Call tcp_clamp_rto_to_user_timeout(sk) as an argument to
> inet_csk_reset_xmit_timer() to save on rto declaration.
>
> Every time the TCP retransmission timer fires. It checks to see if there is a
> timeout before scheduling the next retransmit timer. The retransmit interval
> between each retransmission increases exponentially. The issue is that in order
> for the timeout to occur the retransmit timer needs to fire again. If the user
> timeout check happens after the 9th retransmit for example. It needs to wait for
> the 10th retransmit timer to fire in order to evaluate whether a timeout has
> occurred or not. If the interval is large enough then the timeout will be
> inaccurate.
>
> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
>
> 1st retransmit:
>
> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
>
> Last retransmit:
>
> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
>
> Timeout:
>
> send: Connection timed out
> Sun Jul 1 22:25:34 EDT 2018
>
> We can see that last retransmit took ~7 seconds. Which pushed the total
> timeout to ~15 seconds instead of the expected 10 seconds. This gets more
> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
>
> Add tcp_clamp_rto_to_user_timeout() to determine if the user rto has expired.
> Or whether the rto interval needs to be recalculated. Use the original interval
> if user rto is not set.
>
> Test results with the patch is the expected 10 second timeout:
>
> 1st retransmit:
>
> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
>
> Last retransmit:
>
> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
>
> Timeout:
>
> send: Connection timed out
> Mon Jul 2 01:38:09 EDT 2018
>
> Signed-off-by: Jon Maxwell <[email protected]>
> ---
> net/ipv4/tcp_timer.c | 49 +++++++++++++++++++++++++++++++++++++++----------
> 1 file changed, 39 insertions(+), 10 deletions(-)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 3b3611729928..93239e58776d 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -22,6 +22,38 @@
> #include <linux/gfp.h>
> #include <net/tcp.h>
>
> +u32 tcp_retransmit_stamp(struct sock *sk)

const struct sock *sk;

(To clearly express the fact this helper does not touch the socket)

> +{
> + u32 start_ts = tcp_sk(sk)->retrans_stamp;
> +
> + if (unlikely(!start_ts)) {
> + struct sk_buff *head = tcp_rtx_queue_head(sk);
> +
> + if (!head)
> + return 0;
> + start_ts = tcp_skb_timestamp(head);
> + }
> + return start_ts;
> +}
> +
> +static __u32 tcp_clamp_rto_to_user_timeout(struct sock *sk)

const struct sock *sk

> +{
> + struct inet_connection_sock *icsk = inet_csk(sk);
> + __u32 elapsed, user_timeout;
> + u32 start_ts;

Mixing __u32 and u32 in new code is confusing.

__u32 are needed in uapi include files, not in the C kernel code.

u32 elapsed, user_timeout, start_ts;

> +
> + start_ts = tcp_retransmit_stamp(sk);
> + if (!icsk->icsk_user_timeout || !start_ts)
> + return icsk->icsk_rto;
> + elapsed = tcp_time_stamp(tcp_sk(sk)) - start_ts;
> + user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);

Note that if we always do jiffies_to_msecs(icsk->icsk_user_timeout) in TCP,
we also could change the convention and store msecs in this field instead of jiffies.

That would eliminate the msecs_to_jiffies() and jiffies_to_msecs() dance.

(That would be done in a patch of its own, of course)

> + if (elapsed >= user_timeout)
> + return 1; /* user timeout has passed; fire ASAP */
> + else
> + return (icsk->icsk_rto < msecs_to_jiffies(user_timeout - elapsed)) ?
> + icsk->icsk_rto : msecs_to_jiffies(user_timeout - elapsed);


return min_t(i32, icsk->icsk_rto, msecs_to_jiffies(user_timeout - elapsed));

> +}
> +
> /**
> * tcp_write_err() - close socket and save error info
> * @sk: The socket the error has appeared on.
> @@ -161,19 +193,15 @@ static bool retransmits_timed_out(struct sock *sk,
> unsigned int timeout)
> {
> const unsigned int rto_base = TCP_RTO_MIN;
> - unsigned int linear_backoff_thresh, start_ts;
> + unsigned int linear_backoff_thresh;
> + u32 start_ts;

This seems a gratuitous change, making your patch bigger than necessary.

u32 and unsigned int are the same essentially.

>
> if (!inet_csk(sk)->icsk_retransmits)
> return false;
>
> - start_ts = tcp_sk(sk)->retrans_stamp;
> - if (unlikely(!start_ts)) {
> - struct sk_buff *head = tcp_rtx_queue_head(sk);
> -
> - if (!head)
> - return false;
> - start_ts = tcp_skb_timestamp(head);
> - }
> + start_ts = tcp_retransmit_stamp(sk);
> + if (!start_ts)
> + return false;

One thing you could do is to make this refactoring in a single patch,
and provide a patch series, with small units, making the review easier.

>
> if (likely(timeout == 0)) {
> linear_backoff_thresh = ilog2(TCP_RTO_MAX/rto_base);
> @@ -535,7 +563,8 @@ void tcp_retransmit_timer(struct sock *sk)
> /* Use normal (exponential) backoff */
> icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> }
> - inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
> + inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS,
> + tcp_clamp_rto_to_user_timeout(sk), TCP_RTO_MAX);
> if (retransmits_timed_out(sk, net->ipv4.sysctl_tcp_retries1 + 1, 0))
> __sk_dst_reset(sk);
>
>

Thanks !

2018-07-10 12:50:09

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next,v3] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy



On 07/10/2018 05:38 AM, Eric Dumazet wrote:

> Note that if we always do jiffies_to_msecs(icsk->icsk_user_timeout) in TCP,
> we also could change the convention and store msecs in this field instead of jiffies.
>
> That would eliminate the msecs_to_jiffies() and jiffies_to_msecs() dance.
>
> (That would be done in a patch of its own, of course)

tcp_keepalive_timer() does use icsk->icsk_user_timeout directly in jiffies unit,
but considering keeapalive timers are rarely used, this point would have to
do the msecs_to_jiffies() conversion.

2018-07-11 00:38:44

by Jon Maxwell

[permalink] [raw]
Subject: Re: [net-next,v3] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy

On Tue, Jul 10, 2018 at 10:48 PM, Eric Dumazet <[email protected]> wrote:
>
>
> On 07/10/2018 05:38 AM, Eric Dumazet wrote:
>
>> Note that if we always do jiffies_to_msecs(icsk->icsk_user_timeout) in TCP,
>> we also could change the convention and store msecs in this field instead of jiffies.
>>
>> That would eliminate the msecs_to_jiffies() and jiffies_to_msecs() dance.
>>
>> (That would be done in a patch of its own, of course)
>
> tcp_keepalive_timer() does use icsk->icsk_user_timeout directly in jiffies unit,
> but considering keeapalive timers are rarely used, this point would have to
> do the msecs_to_jiffies() conversion.

and also if icsk->icsk_user_timeout = 0, then timeout in retransmits_timed_out()
is in jiffies and that would need to addressed.

2018-07-11 02:05:13

by Eric Dumazet

[permalink] [raw]
Subject: Re: [net-next,v3] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy



On 07/10/2018 05:33 PM, Jonathan Maxwell wrote:
> On Tue, Jul 10, 2018 at 10:48 PM, Eric Dumazet <[email protected]> wrote:
>>
>>
>> On 07/10/2018 05:38 AM, Eric Dumazet wrote:
>>
>>> Note that if we always do jiffies_to_msecs(icsk->icsk_user_timeout) in TCP,
>>> we also could change the convention and store msecs in this field instead of jiffies.
>>>
>>> That would eliminate the msecs_to_jiffies() and jiffies_to_msecs() dance.
>>>
>>> (That would be done in a patch of its own, of course)
>>
>> tcp_keepalive_timer() does use icsk->icsk_user_timeout directly in jiffies unit,
>> but considering keeapalive timers are rarely used, this point would have to
>> do the msecs_to_jiffies() conversion.
>
> and also if icsk->icsk_user_timeout = 0, then timeout in retransmits_timed_out()
> is in jiffies and that would need to addressed.

Absolutely, this is what I was suggesting.

Pseudo code for this part, before your changes.

diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 3b3611729928f77934e0298bb248e55c7a7c5def..cae7bbc956ed51e9d381650957f54550cc0967d9 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -183,8 +183,9 @@ static bool retransmits_timed_out(struct sock *sk,
else
timeout = ((2 << linear_backoff_thresh) - 1) * rto_base +
(boundary - linear_backoff_thresh) * TCP_RTO_MAX;
+ timeout = jiffies_to_msecs(timeout)'
}
- return (tcp_time_stamp(tcp_sk(sk)) - start_ts) >= jiffies_to_msecs(timeout);
+ return (tcp_time_stamp(tcp_sk(sk)) - start_ts) >= timeout;
}

/* A write timeout has occurred. Process the after effects. */



2018-07-11 02:36:08

by Jon Maxwell

[permalink] [raw]
Subject: Re: [net-next,v3] tcp: Improve setsockopt() TCP_USER_TIMEOUT accuracy

On Wed, Jul 11, 2018 at 12:04 PM, Eric Dumazet <[email protected]> wrote:
>
>
> On 07/10/2018 05:33 PM, Jonathan Maxwell wrote:
>> On Tue, Jul 10, 2018 at 10:48 PM, Eric Dumazet <[email protected]> wrote:
>>>
>>>
>>> On 07/10/2018 05:38 AM, Eric Dumazet wrote:
>>>
>>>> Note that if we always do jiffies_to_msecs(icsk->icsk_user_timeout) in TCP,
>>>> we also could change the convention and store msecs in this field instead of jiffies.
>>>>
>>>> That would eliminate the msecs_to_jiffies() and jiffies_to_msecs() dance.
>>>>
>>>> (That would be done in a patch of its own, of course)
>>>
>>> tcp_keepalive_timer() does use icsk->icsk_user_timeout directly in jiffies unit,
>>> but considering keeapalive timers are rarely used, this point would have to
>>> do the msecs_to_jiffies() conversion.
>>
>> and also if icsk->icsk_user_timeout = 0, then timeout in retransmits_timed_out()
>> is in jiffies and that would need to addressed.
>
> Absolutely, this is what I was suggesting.
>
> Pseudo code for this part, before your changes.
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 3b3611729928f77934e0298bb248e55c7a7c5def..cae7bbc956ed51e9d381650957f54550cc0967d9 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -183,8 +183,9 @@ static bool retransmits_timed_out(struct sock *sk,
> else
> timeout = ((2 << linear_backoff_thresh) - 1) * rto_base +
> (boundary - linear_backoff_thresh) * TCP_RTO_MAX;
> + timeout = jiffies_to_msecs(timeout)'
> }
> - return (tcp_time_stamp(tcp_sk(sk)) - start_ts) >= jiffies_to_msecs(timeout);
> + return (tcp_time_stamp(tcp_sk(sk)) - start_ts) >= timeout;
> }
>
> /* A write timeout has occurred. Process the after effects. */
>
>

That makes sense thanks. I'll look into this and get a patch together.