Every time the TCP retransmission timer fires. It checks to see if there is a
timeout before scheduling the next retransmit timer. The retransmit interval
between each retransmission increases exponentially. The issue is that in order
for the timeout to occur the retransmit timer needs to fire again. If the user
timeout check happens after the 9th retransmit for example. It needs to wait for
the 10th retransmit timer to fire in order to evaluate whether a timeout has
occurred or not. If the interval is large enough then the timeout will be
inaccurate.
For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
1st retransmit:
22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
Last retransmit:
22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
Timeout:
send: Connection timed out
Sun Jul 1 22:25:34 EDT 2018
We can see that last retransmit took ~7 seconds. Which pushed the total
timeout to ~15 seconds instead of the expected 10 seconds. This gets more
inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
Fix this by recalculating the last retransmit interval so that it fires when
the timeout should occur. Only implement when icsk->icsk_user_timeout is set.
Test results with the patch is the expected 10 second timeout:
1st retransmit:
01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
Last retransmit:
01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
Timeout:
send: Connection timed out
Mon Jul 2 01:38:09 EDT 2018
Signed-off-by: Jon Maxwell <[email protected]>
---
net/ipv4/tcp_timer.c | 7 +++++++
1 file changed, 7 insertions(+)
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index 3b3611729928..94491a481722 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -407,6 +407,7 @@ void tcp_retransmit_timer(struct sock *sk)
struct tcp_sock *tp = tcp_sk(sk);
struct net *net = sock_net(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
+ __u32 time_remaining = 0;
if (tp->fastopen_rsk) {
WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
@@ -535,6 +536,12 @@ void tcp_retransmit_timer(struct sock *sk)
/* Use normal (exponential) backoff */
icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
}
+ if (icsk->icsk_user_timeout) {
+ time_remaining = jiffies_to_msecs(icsk->icsk_user_timeout) -
+ (tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp);
+ if (time_remaining < icsk->icsk_rto)
+ icsk->icsk_rto = time_remaining;
+ }
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, icsk->icsk_rto, TCP_RTO_MAX);
if (retransmits_timed_out(sk, net->ipv4.sysctl_tcp_retries1 + 1, 0))
__sk_dst_reset(sk);
--
2.13.6
On Mon, Jul 2, 2018 at 9:18 PM Jon Maxwell <[email protected]> wrote:
>
> Every time the TCP retransmission timer fires. It checks to see if there is a
> timeout before scheduling the next retransmit timer. The retransmit interval
> between each retransmission increases exponentially. The issue is that in order
> for the timeout to occur the retransmit timer needs to fire again. If the user
> timeout check happens after the 9th retransmit for example. It needs to wait for
> the 10th retransmit timer to fire in order to evaluate whether a timeout has
> occurred or not. If the interval is large enough then the timeout will be
> inaccurate.
>
> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
>
> 1st retransmit:
>
> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
>
> Last retransmit:
>
> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
>
> Timeout:
>
> send: Connection timed out
> Sun Jul 1 22:25:34 EDT 2018
>
> We can see that last retransmit took ~7 seconds. Which pushed the total
> timeout to ~15 seconds instead of the expected 10 seconds. This gets more
> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
>
> Fix this by recalculating the last retransmit interval so that it fires when
> the timeout should occur. Only implement when icsk->icsk_user_timeout is set.
>
> Test results with the patch is the expected 10 second timeout:
>
> 1st retransmit:
>
> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
>
> Last retransmit:
>
> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
>
> Timeout:
>
> send: Connection timed out
> Mon Jul 2 01:38:09 EDT 2018
>
> Signed-off-by: Jon Maxwell <[email protected]>
> ---
> net/ipv4/tcp_timer.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
> index 3b3611729928..94491a481722 100644
> --- a/net/ipv4/tcp_timer.c
> +++ b/net/ipv4/tcp_timer.c
> @@ -407,6 +407,7 @@ void tcp_retransmit_timer(struct sock *sk)
> struct tcp_sock *tp = tcp_sk(sk);
> struct net *net = sock_net(sk);
> struct inet_connection_sock *icsk = inet_csk(sk);
> + __u32 time_remaining = 0;
>
> if (tp->fastopen_rsk) {
> WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
> @@ -535,6 +536,12 @@ void tcp_retransmit_timer(struct sock *sk)
> /* Use normal (exponential) backoff */
> icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
> }
> + if (icsk->icsk_user_timeout) {
> + time_remaining = jiffies_to_msecs(icsk->icsk_user_timeout) -
> + (tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp);
> + if (time_remaining < icsk->icsk_rto)
> + icsk->icsk_rto = time_remaining;
> + }
Thanks, a more precise user timeout sounds nice. A couple thoughts:
(a) The icsk->icsk_rto is in jiffies, and the time_remaining is in
msecs, so it looks like there is a units mismatch here in the
comparisons and assignment.
(b) It also seems like the time_remaining could be negative, because
(a) the icsk_user_timeout is not involved in the baseline RTO
calculation, so that perhaps the first RTO to fire might be beyond the
icsk_user_timeout AFAIK, and (b) if the machine is very busy then the
timer handler can be delayed beyond the targeted icsk_user_timeout.
But time_remaining is a __u32, and icsk->icsk_rto is also a __u32, so
it seems like a negative number in time_remaining would usually be
treated as a very large unsigned positive number in this comparison:
+ if (time_remaining < icsk->icsk_rto)
(c) If the user timeout is changed between RTO expirations to push the
user timeout further in the future, then it seems like this commit
will have side effects that left the icsk->icsk_rto in a weird state
that does not do the expected exponential backoff correctly.
(d) There are also wrapping issues to watch out for, since the
tcp_time_stamp(tcp_sk(sk)) and tcp_sk(sk)->retrans_stamp are
milliseconds, which will wrap every 49 days or so. Seems like the code
is OK in that respect.
(e) It also might be nice to put this logic in a helper, rather than
growing the body of tcp_retransmit_timer().
What about something like (pseudocode):
--
static __u32 tcp_clamp_rto_to_user_timeout(sk):
rto = icsk->icsk_rto;
if (!icsk->icsk_user_timeout)
return rto;
elapsed = tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp;
user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);
if (elapsed >= user_timeout)
rto = 1; /* user timeout has passed; fire ASAP */
else
rto = min(rto, msecs_to_jiffies(user_timeout - elapsed));
return rto;
tcp_retransmit_timer():
...
rto = tcp_clamp_rto_to_user_timeout(sk);
inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX);
--
neal
On Tue, Jul 3, 2018 at 1:00 PM, Neal Cardwell <[email protected]> wrote:
> On Mon, Jul 2, 2018 at 9:18 PM Jon Maxwell <[email protected]> wrote:
>>
>> Every time the TCP retransmission timer fires. It checks to see if there is a
>> timeout before scheduling the next retransmit timer. The retransmit interval
>> between each retransmission increases exponentially. The issue is that in order
>> for the timeout to occur the retransmit timer needs to fire again. If the user
>> timeout check happens after the 9th retransmit for example. It needs to wait for
>> the 10th retransmit timer to fire in order to evaluate whether a timeout has
>> occurred or not. If the interval is large enough then the timeout will be
>> inaccurate.
>>
>> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
>>
>> 1st retransmit:
>>
>> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Last retransmit:
>>
>> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Timeout:
>>
>> send: Connection timed out
>> Sun Jul 1 22:25:34 EDT 2018
>>
>> We can see that last retransmit took ~7 seconds. Which pushed the total
>> timeout to ~15 seconds instead of the expected 10 seconds. This gets more
>> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
>>
>> Fix this by recalculating the last retransmit interval so that it fires when
>> the timeout should occur. Only implement when icsk->icsk_user_timeout is set.
>>
>> Test results with the patch is the expected 10 second timeout:
>>
>> 1st retransmit:
>>
>> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Last retransmit:
>>
>> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
>>
>> Timeout:
>>
>> send: Connection timed out
>> Mon Jul 2 01:38:09 EDT 2018
>>
>> Signed-off-by: Jon Maxwell <[email protected]>
>> ---
>> net/ipv4/tcp_timer.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>> index 3b3611729928..94491a481722 100644
>> --- a/net/ipv4/tcp_timer.c
>> +++ b/net/ipv4/tcp_timer.c
>> @@ -407,6 +407,7 @@ void tcp_retransmit_timer(struct sock *sk)
>> struct tcp_sock *tp = tcp_sk(sk);
>> struct net *net = sock_net(sk);
>> struct inet_connection_sock *icsk = inet_csk(sk);
>> + __u32 time_remaining = 0;
>>
>> if (tp->fastopen_rsk) {
>> WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
>> @@ -535,6 +536,12 @@ void tcp_retransmit_timer(struct sock *sk)
>> /* Use normal (exponential) backoff */
>> icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>> }
>> + if (icsk->icsk_user_timeout) {
>> + time_remaining = jiffies_to_msecs(icsk->icsk_user_timeout) -
>> + (tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp);
>> + if (time_remaining < icsk->icsk_rto)
>> + icsk->icsk_rto = time_remaining;
>> + }
>
> Thanks, a more precise user timeout sounds nice. A couple thoughts:
>
> (a) The icsk->icsk_rto is in jiffies, and the time_remaining is in
> msecs, so it looks like there is a units mismatch here in the
> comparisons and assignment.
>
> (b) It also seems like the time_remaining could be negative, because
> (a) the icsk_user_timeout is not involved in the baseline RTO
> calculation, so that perhaps the first RTO to fire might be beyond the
> icsk_user_timeout AFAIK, and (b) if the machine is very busy then the
> timer handler can be delayed beyond the targeted icsk_user_timeout.
> But time_remaining is a __u32, and icsk->icsk_rto is also a __u32, so
> it seems like a negative number in time_remaining would usually be
> treated as a very large unsigned positive number in this comparison:
>
> + if (time_remaining < icsk->icsk_rto)
>
> (c) If the user timeout is changed between RTO expirations to push the
> user timeout further in the future, then it seems like this commit
> will have side effects that left the icsk->icsk_rto in a weird state
> that does not do the expected exponential backoff correctly.
>
> (d) There are also wrapping issues to watch out for, since the
> tcp_time_stamp(tcp_sk(sk)) and tcp_sk(sk)->retrans_stamp are
> milliseconds, which will wrap every 49 days or so. Seems like the code
> is OK in that respect.
>
> (e) It also might be nice to put this logic in a helper, rather than
> growing the body of tcp_retransmit_timer().
>
> What about something like (pseudocode):
>
> --
>
> static __u32 tcp_clamp_rto_to_user_timeout(sk):
> rto = icsk->icsk_rto;
> if (!icsk->icsk_user_timeout)
> return rto;
> elapsed = tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp;
> user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);
> if (elapsed >= user_timeout)
> rto = 1; /* user timeout has passed; fire ASAP */
> else
> rto = min(rto, msecs_to_jiffies(user_timeout - elapsed));
> return rto;
>
> tcp_retransmit_timer():
> ...
> rto = tcp_clamp_rto_to_user_timeout(sk);
> inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX);
>
Thanks Neal, that looks like a good idea. Let me test that out in my reproducer.
Regards
Jon
> --
>
> neal
On Tue, Jul 3, 2018 at 1:15 PM, Jonathan Maxwell <[email protected]> wrote:
> On Tue, Jul 3, 2018 at 1:00 PM, Neal Cardwell <[email protected]> wrote:
>> On Mon, Jul 2, 2018 at 9:18 PM Jon Maxwell <[email protected]> wrote:
>>>
>>> Every time the TCP retransmission timer fires. It checks to see if there is a
>>> timeout before scheduling the next retransmit timer. The retransmit interval
>>> between each retransmission increases exponentially. The issue is that in order
>>> for the timeout to occur the retransmit timer needs to fire again. If the user
>>> timeout check happens after the 9th retransmit for example. It needs to wait for
>>> the 10th retransmit timer to fire in order to evaluate whether a timeout has
>>> occurred or not. If the interval is large enough then the timeout will be
>>> inaccurate.
>>>
>>> For example with a TCP_USER_TIMEOUT of 10 seconds without patch:
>>>
>>> 1st retransmit:
>>>
>>> 22:25:18.973488 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Last retransmit:
>>>
>>> 22:25:26.205499 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Timeout:
>>>
>>> send: Connection timed out
>>> Sun Jul 1 22:25:34 EDT 2018
>>>
>>> We can see that last retransmit took ~7 seconds. Which pushed the total
>>> timeout to ~15 seconds instead of the expected 10 seconds. This gets more
>>> inaccurate the larger the TCP_USER_TIMEOUT value. As the interval increases.
>>>
>>> Fix this by recalculating the last retransmit interval so that it fires when
>>> the timeout should occur. Only implement when icsk->icsk_user_timeout is set.
>>>
>>> Test results with the patch is the expected 10 second timeout:
>>>
>>> 1st retransmit:
>>>
>>> 01:37:59.022555 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Last retransmit:
>>>
>>> 01:38:06.486558 IP host1.49310 > host2.search-agent: Flags [.]
>>>
>>> Timeout:
>>>
>>> send: Connection timed out
>>> Mon Jul 2 01:38:09 EDT 2018
>>>
>>> Signed-off-by: Jon Maxwell <[email protected]>
>>> ---
>>> net/ipv4/tcp_timer.c | 7 +++++++
>>> 1 file changed, 7 insertions(+)
>>>
>>> diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
>>> index 3b3611729928..94491a481722 100644
>>> --- a/net/ipv4/tcp_timer.c
>>> +++ b/net/ipv4/tcp_timer.c
>>> @@ -407,6 +407,7 @@ void tcp_retransmit_timer(struct sock *sk)
>>> struct tcp_sock *tp = tcp_sk(sk);
>>> struct net *net = sock_net(sk);
>>> struct inet_connection_sock *icsk = inet_csk(sk);
>>> + __u32 time_remaining = 0;
>>>
>>> if (tp->fastopen_rsk) {
>>> WARN_ON_ONCE(sk->sk_state != TCP_SYN_RECV &&
>>> @@ -535,6 +536,12 @@ void tcp_retransmit_timer(struct sock *sk)
>>> /* Use normal (exponential) backoff */
>>> icsk->icsk_rto = min(icsk->icsk_rto << 1, TCP_RTO_MAX);
>>> }
>>> + if (icsk->icsk_user_timeout) {
>>> + time_remaining = jiffies_to_msecs(icsk->icsk_user_timeout) -
>>> + (tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp);
>>> + if (time_remaining < icsk->icsk_rto)
>>> + icsk->icsk_rto = time_remaining;
>>> + }
>>
>> Thanks, a more precise user timeout sounds nice. A couple thoughts:
>>
>> (a) The icsk->icsk_rto is in jiffies, and the time_remaining is in
>> msecs, so it looks like there is a units mismatch here in the
>> comparisons and assignment.
>>
>> (b) It also seems like the time_remaining could be negative, because
>> (a) the icsk_user_timeout is not involved in the baseline RTO
>> calculation, so that perhaps the first RTO to fire might be beyond the
>> icsk_user_timeout AFAIK, and (b) if the machine is very busy then the
>> timer handler can be delayed beyond the targeted icsk_user_timeout.
>> But time_remaining is a __u32, and icsk->icsk_rto is also a __u32, so
>> it seems like a negative number in time_remaining would usually be
>> treated as a very large unsigned positive number in this comparison:
>>
>> + if (time_remaining < icsk->icsk_rto)
>>
>> (c) If the user timeout is changed between RTO expirations to push the
>> user timeout further in the future, then it seems like this commit
>> will have side effects that left the icsk->icsk_rto in a weird state
>> that does not do the expected exponential backoff correctly.
>>
>> (d) There are also wrapping issues to watch out for, since the
>> tcp_time_stamp(tcp_sk(sk)) and tcp_sk(sk)->retrans_stamp are
>> milliseconds, which will wrap every 49 days or so. Seems like the code
>> is OK in that respect.
>>
>> (e) It also might be nice to put this logic in a helper, rather than
>> growing the body of tcp_retransmit_timer().
>>
>> What about something like (pseudocode):
>>
>> --
>>
>> static __u32 tcp_clamp_rto_to_user_timeout(sk):
>> rto = icsk->icsk_rto;
>> if (!icsk->icsk_user_timeout)
>> return rto;
>> elapsed = tcp_time_stamp(tcp_sk(sk)) - tcp_sk(sk)->retrans_stamp;
>> user_timeout = jiffies_to_msecs(icsk->icsk_user_timeout);
>> if (elapsed >= user_timeout)
>> rto = 1; /* user timeout has passed; fire ASAP */
>> else
>> rto = min(rto, msecs_to_jiffies(user_timeout - elapsed));
>> return rto;
>>
>> tcp_retransmit_timer():
>> ...
>> rto = tcp_clamp_rto_to_user_timeout(sk);
>> inet_csk_reset_xmit_timer(sk, ICSK_TIME_RETRANS, rto, TCP_RTO_MAX);
>>
>
> Thanks Neal, that looks like a good idea. Let me test that out in my reproducer.
>
> Regards
>
> Jon
>
Thanks for your input and suggestions Neal.
Results were positive in the reproducer. I'll tidy the patch up a bit and submit
as v1 with your ideas.
>> --
>>
>> neal