2011-05-17 07:40:40

by tsuna

[permalink] [raw]
Subject: [PATCH] tcp: Expose the initial RTO via a new sysctl.

Hi,
it's not easy to change the initial RTO of TCP as right now you need to
recompile your kernel. In order to make it easier to tune this setting,
I was wondering whether you would consider turning it into a sysctl. I
attached a first attempt at a patch that does this -- this is my first
patch to the Linux kernel so although I've read SubmitChecklist and
SubmittingPatches, and I've run checkpatch.pl, please let me know if I'm
doing something wrong.

I am doing this because I work in a high-throughput low-latency environment
(line-rate GbE with submillisecond RTT) and some of our clients are negatively
affected by the high initial RTO when the servers are unable to accept() new
connections fast enough. While we're working on fixing these servers and/or
giving them larger backlog queues when they listen(), being able to tune
the initial RTO at runtime would be very useful as quick workaround for the
server-side issues.

Some large Internet websites are also running with a more aggressive initial
RTO, for instance Google documented some of what they're doing here:
http://www.ietf.org/proceedings/75/slides/tcpm-1.pdf
While I'm not arguing to change the default value at this time, I believe
that this patch would also come in handy for those who wish to experiment
with various values in their environment.

If you're willing to consider this patch, bear in mind I only compiled it,
I didn't test it yet (not knowing whether you'd want something like that or
not). I would also appreciate if anyone had any insight on what I did with
`COUNTER_TRIES' in `syncookies.c' as this magic constant is rather mysterious
and the comment didn't help me figure out how it had been derived. I couldn't
find anything online and git blame didn't help me either (it pre-dates Git).


2011-05-17 07:41:27

by tsuna

[permalink] [raw]
Subject: [PATCH] tcp: Expose the initial RTO via a new sysctl.

Instead of hardcoding the initial RTO to 3s and requiring
the kernel to be recompiled to change it, expose it as a
sysctl that can be tuned at runtime. Leave the default
value unchanged.

Signed-off-by: Benoit Sigoure <[email protected]>
---
Documentation/networking/ip-sysctl.txt | 6 ++++++
include/linux/sysctl.h | 1 +
include/net/tcp.h | 3 ++-
kernel/sysctl_binary.c | 1 +
net/ipv4/syncookies.c | 2 +-
net/ipv4/sysctl_net_ipv4.c | 11 +++++++++++
net/ipv4/tcp.c | 4 ++--
net/ipv4/tcp_input.c | 8 ++++----
net/ipv4/tcp_ipv4.c | 6 +++---
net/ipv4/tcp_minisocks.c | 6 +++---
net/ipv4/tcp_output.c | 2 +-
net/ipv4/tcp_timer.c | 9 +++++----
net/ipv6/syncookies.c | 2 +-
net/ipv6/tcp_ipv6.c | 6 +++---
14 files changed, 44 insertions(+), 23 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d3d653a..c381c68 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -384,6 +384,12 @@ tcp_retries2 - INTEGER
RFC 1122 recommends at least 100 seconds for the timeout,
which corresponds to a value of at least 8.

+tcp_initial_rto - INTEGER
+ This value sets the initial retransmit timeout, that is how long
+ the kernel will wait before retransmitting the initial SYN packet.
+
+ RFC 1122 says that this SHOULD be 3 seconds, which is the default.
+
tcp_rfc1337 - BOOLEAN
If set, the TCP stack behaves conforming to RFC1337. If unset,
we are not conforming to RFC, but prevent TCP TIME_WAIT
diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
index 11684d9..96a9b41 100644
--- a/include/linux/sysctl.h
+++ b/include/linux/sysctl.h
@@ -425,6 +425,7 @@ enum
NET_TCP_ALLOWED_CONG_CONTROL=123,
NET_TCP_MAX_SSTHRESH=124,
NET_TCP_FRTO_RESPONSE=125,
+ NET_IPV4_TCP_INITIAL_RTO=126,
};

enum {
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cda30ea..a2bb0f1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -213,6 +213,7 @@ extern int sysctl_tcp_syn_retries;
extern int sysctl_tcp_synack_retries;
extern int sysctl_tcp_retries1;
extern int sysctl_tcp_retries2;
+extern int sysctl_tcp_initial_rto;
extern int sysctl_tcp_orphan_retries;
extern int sysctl_tcp_syncookies;
extern int sysctl_tcp_retrans_collapse;
@@ -295,7 +296,7 @@ static inline void tcp_synq_overflow(struct sock *sk)
static inline int tcp_synq_no_recent_overflow(const struct sock *sk)
{
unsigned long last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp;
- return time_after(jiffies, last_overflow + TCP_TIMEOUT_INIT);
+ return time_after(jiffies, last_overflow + sysctl_tcp_initial_rto);
}

extern struct proto tcp_prot;
diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
index 3b8e028..d608d84 100644
--- a/kernel/sysctl_binary.c
+++ b/kernel/sysctl_binary.c
@@ -354,6 +354,7 @@ static const struct bin_table bin_net_ipv4_table[] = {
{ CTL_INT, NET_IPV4_TCP_KEEPALIVE_INTVL, "tcp_keepalive_intvl" },
{ CTL_INT, NET_IPV4_TCP_RETRIES1, "tcp_retries1" },
{ CTL_INT, NET_IPV4_TCP_RETRIES2, "tcp_retries2" },
+ { CTL_INT, NET_IPV4_TCP_INITIAL_RTO, "tcp_initial_rto" },
{ CTL_INT, NET_IPV4_TCP_FIN_TIMEOUT, "tcp_fin_timeout" },
{ CTL_INT, NET_TCP_SYNCOOKIES, "tcp_syncookies" },
{ CTL_INT, NET_TCP_TW_RECYCLE, "tcp_tw_recycle" },
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 8b44c6d..089bc92 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -186,7 +186,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
* sysctl_tcp_retries1. It's a rather complicated formula (exponential
* backoff) to compute at runtime so it's currently hardcoded here.
*/
-#define COUNTER_TRIES 4
+#define COUNTER_TRIES (sysctl_tcp_initial_rto + 1)
/*
* Check if a ack sequence number is a valid syncookie.
* Return the decoded mss if it is, or 0 if not.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 321e6e8..24dc21d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -30,6 +30,8 @@ static int tcp_adv_win_scale_min = -31;
static int tcp_adv_win_scale_max = 31;
static int ip_ttl_min = 1;
static int ip_ttl_max = 255;
+static int tcp_initial_rto_min = TCP_RTO_MIN;
+static int tcp_initial_rto_max = TCP_RTO_MAX;

/* Update system visible IP port range */
static void set_local_port_range(int range[2])
@@ -246,6 +248,15 @@ static struct ctl_table ipv4_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
+ {
+ .procname = "tcp_initial_rto",
+ .data = &sysctl_tcp_initial_rto,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &tcp_initial_rto_min,
+ .extra2 = &tcp_initial_rto_max,
+ },
{
.procname = "tcp_fin_timeout",
.data = &sysctl_tcp_fin_timeout,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b22d450..e9e7c3f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2352,7 +2352,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
case TCP_DEFER_ACCEPT:
/* Translate value in seconds to number of retransmits */
icsk->icsk_accept_queue.rskq_defer_accept =
- secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+ secs_to_retrans(val, sysctl_tcp_initial_rto / HZ,
TCP_RTO_MAX / HZ);
break;

@@ -2539,7 +2539,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
break;
case TCP_DEFER_ACCEPT:
val = retrans_to_secs(icsk->icsk_accept_queue.rskq_defer_accept,
- TCP_TIMEOUT_INIT / HZ, TCP_RTO_MAX / HZ);
+ sysctl_tcp_initial_rto / HZ, TCP_RTO_MAX / HZ);
break;
case TCP_WINDOW_CLAMP:
val = tp->window_clamp;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bef9f04..39f6c27 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -890,7 +890,7 @@ static void tcp_init_metrics(struct sock *sk)
if (dst_metric(dst, RTAX_RTT) == 0)
goto reset;

- if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
+ if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (sysctl_tcp_initial_rto << 3))
goto reset;

/* Initial rtt is determined from SYN,SYN-ACK.
@@ -916,7 +916,7 @@ static void tcp_init_metrics(struct sock *sk)
tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
}
tcp_set_rto(sk);
- if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp) {
+ if (inet_csk(sk)->icsk_rto < sysctl_tcp_initial_rto && !tp->rx_opt.saw_tstamp) {
reset:
/* Play conservative. If timestamps are not
* supported, TCP will fail to recalculate correct
@@ -924,8 +924,8 @@ reset:
*/
if (!tp->rx_opt.saw_tstamp && tp->srtt) {
tp->srtt = 0;
- tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
- inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ tp->mdev = tp->mdev_max = tp->rttvar = sysctl_tcp_initial_rto;
+ inet_csk(sk)->icsk_rto = sysctl_tcp_initial_rto;
}
}
tp->snd_cwnd = tcp_init_cwnd(tp, dst);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f7e6c2c..21920e6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1383,7 +1383,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
want_cookie)
goto drop_and_free;

- inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ inet_csk_reqsk_queue_hash_add(sk, req, sysctl_tcp_initial_rto);
return 0;

drop_and_release:
@@ -1834,8 +1834,8 @@ static int tcp_v4_init_sock(struct sock *sk)
tcp_init_xmit_timers(sk);
tcp_prequeue_init(tp);

- icsk->icsk_rto = TCP_TIMEOUT_INIT;
- tp->mdev = TCP_TIMEOUT_INIT;
+ icsk->icsk_rto = sysctl_tcp_initial_rto;
+ tp->mdev = sysctl_tcp_initial_rto;

/* So many TCP implementations out there (incorrectly) count the
* initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 80b1f80..c63ffa0 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -472,8 +472,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
tcp_init_wl(newtp, treq->rcv_isn);

newtp->srtt = 0;
- newtp->mdev = TCP_TIMEOUT_INIT;
- newicsk->icsk_rto = TCP_TIMEOUT_INIT;
+ newtp->mdev = sysctl_tcp_initial_rto;
+ newicsk->icsk_rto = sysctl_tcp_initial_rto;

newtp->packets_out = 0;
newtp->retrans_out = 0;
@@ -582,7 +582,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
* it can be estimated (approximately)
* from another data.
*/
- tmp_opt.ts_recent_stamp = get_seconds() - ((TCP_TIMEOUT_INIT/HZ)<<req->retrans);
+ tmp_opt.ts_recent_stamp = get_seconds() - ((sysctl_tcp_initial_rto/HZ)<<req->retrans);
paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
}
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 17388c7..e34b0f6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2599,7 +2599,7 @@ static void tcp_connect_init(struct sock *sk)
tp->rcv_wup = 0;
tp->copied_seq = 0;

- inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ inet_csk(sk)->icsk_rto = sysctl_tcp_initial_rto;
inet_csk(sk)->icsk_retransmits = 0;
tcp_clear_retrans(tp);
}
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd44b0..b9da62b 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -29,6 +29,7 @@ int sysctl_tcp_keepalive_probes __read_mostly = TCP_KEEPALIVE_PROBES;
int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL;
int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
+int sysctl_tcp_initial_rto __read_mostly = TCP_TIMEOUT_INIT;
int sysctl_tcp_orphan_retries __read_mostly;
int sysctl_tcp_thin_linear_timeouts __read_mostly;

@@ -135,8 +136,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)

/* This function calculates a "timeout" which is equivalent to the timeout of a
* TCP connection after "boundary" unsuccessful, exponentially backed-off
- * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
- * syn_set flag is set.
+ * retransmissions with an initial RTO of TCP_RTO_MIN or
+ * sysctl_tcp_initial_rto if syn_set flag is set.
*/
static bool retransmits_timed_out(struct sock *sk,
unsigned int boundary,
@@ -144,7 +145,7 @@ static bool retransmits_timed_out(struct sock *sk,
bool syn_set)
{
unsigned int linear_backoff_thresh, start_ts;
- unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+ unsigned int rto_base = syn_set ? sysctl_tcp_initial_rto : TCP_RTO_MIN;

if (!inet_csk(sk)->icsk_retransmits)
return false;
@@ -495,7 +496,7 @@ out_unlock:
static void tcp_synack_timer(struct sock *sk)
{
inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
- TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+ sysctl_tcp_initial_rto, TCP_RTO_MAX);
}

void tcp_syn_ack_timeout(struct sock *sk, struct request_sock *req)
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 352c260..50baaec 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -45,7 +45,7 @@ static __u16 const msstab[] = {
* sysctl_tcp_retries1. It's a rather complicated formula (exponential
* backoff) to compute at runtime so it's currently hardcoded here.
*/
-#define COUNTER_TRIES 4
+#define COUNTER_TRIES (sysctl_tcp_initial_rto + 1)

static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4f49e5d..7e791e6 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1349,7 +1349,7 @@ have_isn:
want_cookie)
goto drop_and_free;

- inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ inet6_csk_reqsk_queue_hash_add(sk, req, sysctl_tcp_initial_rto);
return 0;

drop_and_release:
@@ -1957,8 +1957,8 @@ static int tcp_v6_init_sock(struct sock *sk)
tcp_init_xmit_timers(sk);
tcp_prequeue_init(tp);

- icsk->icsk_rto = TCP_TIMEOUT_INIT;
- tp->mdev = TCP_TIMEOUT_INIT;
+ icsk->icsk_rto = sysctl_tcp_initial_rto;
+ tp->mdev = sysctl_tcp_initial_rto;

/* So many TCP implementations out there (incorrectly) count the
* initial SYN frame in their delayed-ACK and congestion control
--
1.7.0.4

2011-05-17 08:31:38

by Alexander Zimmermann

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

Hi Benoit,

Am 17.05.2011 um 09:40 schrieb Benoit Sigoure:

> Instead of hardcoding the initial RTO to 3s and requiring
> the kernel to be recompiled to change it, expose it as a
> sysctl that can be tuned at runtime. Leave the default
> value unchanged.
>

regardless of netdev will accept this patch or not, the
upcoming initRTO is 1s. See
http://tools.ietf.org/id/draft-paxson-tcpm-rfc2988bis-02.txt

The draft is IESG approved and will become an RFC soon.

Alex

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: [email protected]
// web: http://www.umic-mesh.net
//


Attachments:
PGP.sig (243.00 B)
Signierter Teil der Nachricht

2011-05-17 08:08:07

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

Le mardi 17 mai 2011 à 00:40 -0700, Benoit Sigoure a écrit :
> Instead of hardcoding the initial RTO to 3s and requiring
> the kernel to be recompiled to change it, expose it as a
> sysctl that can be tuned at runtime. Leave the default
> value unchanged.
>

I wont discuss if introducing a new sysctl is welcomed, only on patch
issues. I believe some work in IETF is done to reduce the 3sec value to
1sec anyway.

> Signed-off-by: Benoit Sigoure <[email protected]>
> ---
> Documentation/networking/ip-sysctl.txt | 6 ++++++
> include/linux/sysctl.h | 1 +
> include/net/tcp.h | 3 ++-
> kernel/sysctl_binary.c | 1 +
> net/ipv4/syncookies.c | 2 +-
> net/ipv4/sysctl_net_ipv4.c | 11 +++++++++++
> net/ipv4/tcp.c | 4 ++--
> net/ipv4/tcp_input.c | 8 ++++----
> net/ipv4/tcp_ipv4.c | 6 +++---
> net/ipv4/tcp_minisocks.c | 6 +++---
> net/ipv4/tcp_output.c | 2 +-
> net/ipv4/tcp_timer.c | 9 +++++----
> net/ipv6/syncookies.c | 2 +-
> net/ipv6/tcp_ipv6.c | 6 +++---
> 14 files changed, 44 insertions(+), 23 deletions(-)
>
> diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
> index d3d653a..c381c68 100644
> --- a/Documentation/networking/ip-sysctl.txt
> +++ b/Documentation/networking/ip-sysctl.txt
> @@ -384,6 +384,12 @@ tcp_retries2 - INTEGER
> RFC 1122 recommends at least 100 seconds for the timeout,
> which corresponds to a value of at least 8.
>
> +tcp_initial_rto - INTEGER
> + This value sets the initial retransmit timeout, that is how long
> + the kernel will wait before retransmitting the initial SYN packet.
> +
> + RFC 1122 says that this SHOULD be 3 seconds, which is the default.
> +

units ? seconds ? ms ? jiffies ? I suggest using ms as external
interface.

> tcp_rfc1337 - BOOLEAN
> If set, the TCP stack behaves conforming to RFC1337. If unset,
> we are not conforming to RFC, but prevent TCP TIME_WAIT
> diff --git a/include/linux/sysctl.h b/include/linux/sysctl.h
> index 11684d9..96a9b41 100644
> --- a/include/linux/sysctl.h
> +++ b/include/linux/sysctl.h
> @@ -425,6 +425,7 @@ enum
> NET_TCP_ALLOWED_CONG_CONTROL=123,
> NET_TCP_MAX_SSTHRESH=124,
> NET_TCP_FRTO_RESPONSE=125,
> + NET_IPV4_TCP_INITIAL_RTO=126,

We dont add new values here anymore, only anonymous ones.

> };
>
> enum {
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index cda30ea..a2bb0f1 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -213,6 +213,7 @@ extern int sysctl_tcp_syn_retries;
> extern int sysctl_tcp_synack_retries;
> extern int sysctl_tcp_retries1;
> extern int sysctl_tcp_retries2;
> +extern int sysctl_tcp_initial_rto;
> extern int sysctl_tcp_orphan_retries;
> extern int sysctl_tcp_syncookies;
> extern int sysctl_tcp_retrans_collapse;
> @@ -295,7 +296,7 @@ static inline void tcp_synq_overflow(struct sock *sk)
> static inline int tcp_synq_no_recent_overflow(const struct sock *sk)
> {
> unsigned long last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp;
> - return time_after(jiffies, last_overflow + TCP_TIMEOUT_INIT);
> + return time_after(jiffies, last_overflow + sysctl_tcp_initial_rto);
> }
>
> extern struct proto tcp_prot;
> diff --git a/kernel/sysctl_binary.c b/kernel/sysctl_binary.c
> index 3b8e028..d608d84 100644
> --- a/kernel/sysctl_binary.c
> +++ b/kernel/sysctl_binary.c
> @@ -354,6 +354,7 @@ static const struct bin_table bin_net_ipv4_table[] = {
> { CTL_INT, NET_IPV4_TCP_KEEPALIVE_INTVL, "tcp_keepalive_intvl" },
> { CTL_INT, NET_IPV4_TCP_RETRIES1, "tcp_retries1" },
> { CTL_INT, NET_IPV4_TCP_RETRIES2, "tcp_retries2" },
> + { CTL_INT, NET_IPV4_TCP_INITIAL_RTO, "tcp_initial_rto" },

no need here. sysctl() is deprecated.

> { CTL_INT, NET_IPV4_TCP_FIN_TIMEOUT, "tcp_fin_timeout" },
> { CTL_INT, NET_TCP_SYNCOOKIES, "tcp_syncookies" },
> { CTL_INT, NET_TCP_TW_RECYCLE, "tcp_tw_recycle" },
> diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
> index 8b44c6d..089bc92 100644
> --- a/net/ipv4/syncookies.c
> +++ b/net/ipv4/syncookies.c
> @@ -186,7 +186,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
> * sysctl_tcp_retries1. It's a rather complicated formula (exponential
> * backoff) to compute at runtime so it's currently hardcoded here.
> */
> -#define COUNTER_TRIES 4
> +#define COUNTER_TRIES (sysctl_tcp_initial_rto + 1)

Are you sure of this ?

If HZ=1000, sysctl_tcp_initial_rto is 3000

COUNTER_TRIES goes from 4 to 3004

> /*
> * Check if a ack sequence number is a valid syncookie.
> * Return the decoded mss if it is, or 0 if not.
> diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
> index 321e6e8..24dc21d 100644
> --- a/net/ipv4/sysctl_net_ipv4.c
> +++ b/net/ipv4/sysctl_net_ipv4.c
> @@ -30,6 +30,8 @@ static int tcp_adv_win_scale_min = -31;
> static int tcp_adv_win_scale_max = 31;
> static int ip_ttl_min = 1;
> static int ip_ttl_max = 255;
> +static int tcp_initial_rto_min = TCP_RTO_MIN;

warning its jiffies units here.

> +static int tcp_initial_rto_max = TCP_RTO_MAX;
>
> /* Update system visible IP port range */
> static void set_local_port_range(int range[2])
> @@ -246,6 +248,15 @@ static struct ctl_table ipv4_table[] = {
> .mode = 0644,
> .proc_handler = proc_dointvec
> },
> + {
> + .procname = "tcp_initial_rto",
> + .data = &sysctl_tcp_initial_rto,
> + .maxlen = sizeof(int),
> + .mode = 0644,
> + .proc_handler = proc_dointvec_minmax,

so unit is jiffies ? Really its not a good thing. Use ms instead.

Consider proc_dointvec_ms_jiffies(), here.

> + .extra1 = &tcp_initial_rto_min,
> + .extra2 = &tcp_initial_rto_max,
> + },
> {
> .procname = "tcp_fin_timeout",
> .data = &sysctl_tcp_fin_timeout,
> diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
> index b22d450..e9e7c3f 100644
> --- a/net/ipv4/tcp.c
> +++ b/net/ipv4/tcp.c
> @@ -2352,7 +2352,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
> case TCP_DEFER_ACCEPT:
> /* Translate value in seconds to number of retransmits */
> icsk->icsk_accept_queue.rskq_defer_accept =
> - secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
> + secs_to_retrans(val, sysctl_tcp_initial_rto / HZ,

Here you assume sysctl_tcp_initial_rto is expressed in jiffies ?
Oh well...

> TCP_RTO_MAX / HZ);
> break;
>
> @@ -2539,7 +2539,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
> break;
> case TCP_DEFER_ACCEPT:
> val = retrans_to_secs(icsk->icsk_accept_queue.rskq_defer_accept,
> - TCP_TIMEOUT_INIT / HZ, TCP_RTO_MAX / HZ);
> + sysctl_tcp_initial_rto / HZ, TCP_RTO_MAX / HZ);
> break;
> case TCP_WINDOW_CLAMP:
> val = tp->window_clamp;
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index bef9f04..39f6c27 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -890,7 +890,7 @@ static void tcp_init_metrics(struct sock *sk)
> if (dst_metric(dst, RTAX_RTT) == 0)
> goto reset;
>
> - if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
> + if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (sysctl_tcp_initial_rto << 3))

Here you assume jiffies unit again. I wonder how this was tested :(

Please fix this and chose a definitive unit.


2011-05-17 08:34:07

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

Le mardi 17 mai 2011 à 10:01 +0200, Alexander Zimmermann a écrit :

>
> regardless of netdev will accept this patch or not, the
> upcoming initRTO is 1s. See
> http://tools.ietf.org/id/draft-paxson-tcpm-rfc2988bis-02.txt
>
> The draft is IESG approved and will become an RFC soon.

Thanks Alex for this link / information.


2011-05-17 11:02:49

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.


On Tue, 17 May 2011 10:07:57 +0200, Eric Dumazet wrote:

> I wont discuss if introducing a new sysctl is welcomed, only on patch
> issues. I believe some work in IETF is done to reduce the 3sec value to
> 1sec anyway.

Why not? I though all new knobs in this area should be done on a per route
metric so it can be controlled on a per path basis. RTO should be
adjustable on a per path basis, because it depends on the path.

Some months back [1] I posted a patch to enable/disable TCP quick ack
mode, which has nothing to do with network paths, just with a local server
policy. But David rejected the patch with the argument that I should use a
per path knob (this is a little bit inapprehensible for me, but David has
the last word).

Hagen


[1] http://kerneltrap.org/mailarchive/linux-netdev/2010/8/23/6283640

2011-05-17 12:20:31

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

Le mardi 17 mai 2011 à 13:02 +0200, Hagen Paul Pfeifer a écrit :
> On Tue, 17 May 2011 10:07:57 +0200, Eric Dumazet wrote:
>
> > I wont discuss if introducing a new sysctl is welcomed, only on patch
> > issues. I believe some work in IETF is done to reduce the 3sec value to
> > 1sec anyway.
>
> Why not?

Just because I let this point to David and others. I personally dont
care that much.

> I though all new knobs in this area should be done on a per route
> metric so it can be controlled on a per path basis. RTO should be
> adjustable on a per path basis, because it depends on the path.
>

Adding many knobs to each clone had a huge cost on previous kernels.
(Think some machines have millions entries in IP route cache), this used
quite a lot of memory.

With latest David work, we'll consume less ram, because we can now share
settings, instead of copying them on each dst entry.



> Some months back [1] I posted a patch to enable/disable TCP quick ack
> mode, which has nothing to do with network paths, just with a local server
> policy. But David rejected the patch with the argument that I should use a
> per path knob (this is a little bit inapprehensible for me, but David has
> the last word).

Well, if nobody speaks after David, he has the last word indeed.

BTW, I remember Stephen actually asked the per route thing, not David.

http://kerneltrap.org/mailarchive/linux-netdev/2010/8/23/6283641

Then David also stated it :

http://kerneltrap.org/mailarchive/linux-netdev/2010/8/23/6283678

If you really want tcp_quickack thing you really should do it as
requested by both Stephen & David ;)

Unfortunately, I dont know if its really needed or worthwhile.

2011-05-18 10:43:48

by tsuna

[permalink] [raw]
Subject: [PATCH] tcp: Expose the initial RTO via a new sysctl.

Instead of hardcoding the initial RTO to 3s and requiring
the kernel to be recompiled to change it, expose it as a
sysctl that can be tuned at runtime. Leave the default
value unchanged.

Signed-off-by: Benoit Sigoure <[email protected]>
---

v2 of the patch to address Eric's comments. Of course I had to forget
to convert things back and forth between jiffies and ms -- /me n00b.
Code compiles. It seems like no one is opposed to this change, but if
one of you guys could express explicit interest in merging this change,
I'd be happy to spend a bit more time to test it.

The new sysctl is exposed in milliseconds but internally the value remains
in jiffies to avoid having to convert back / and forth between jiffies and
ms in most places.

I'm glad to hear that the default value will be tuned down to 1s. This
change will help people play with this value and easily revert it back at
runtime if they feel like they preferred the current value.

Thank you for your time.

Documentation/networking/ip-sysctl.txt | 8 ++++++++
include/net/tcp.h | 3 ++-
net/ipv4/syncookies.c | 2 +-
net/ipv4/sysctl_net_ipv4.c | 11 +++++++++++
net/ipv4/tcp.c | 4 ++--
net/ipv4/tcp_input.c | 8 ++++----
net/ipv4/tcp_ipv4.c | 6 +++---
net/ipv4/tcp_minisocks.c | 6 +++---
net/ipv4/tcp_output.c | 2 +-
net/ipv4/tcp_timer.c | 9 +++++----
net/ipv6/syncookies.c | 2 +-
net/ipv6/tcp_ipv6.c | 6 +++---
12 files changed, 44 insertions(+), 23 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d3d653a..7f3c7d2 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -384,6 +384,14 @@ tcp_retries2 - INTEGER
RFC 1122 recommends at least 100 seconds for the timeout,
which corresponds to a value of at least 8.

+tcp_initial_rto - INTEGER
+ This value sets the initial retransmit timeout (in milliseconds),
+ that is how long the kernel will wait before retransmitting the
+ initial SYN packet.
+
+ RFC 1122 says that this SHOULD be 3000 milliseconds, which is the
+ default.
+
tcp_rfc1337 - BOOLEAN
If set, the TCP stack behaves conforming to RFC1337. If unset,
we are not conforming to RFC, but prevent TCP TIME_WAIT
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cda30ea..d6d7dea 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -213,6 +213,7 @@ extern int sysctl_tcp_syn_retries;
extern int sysctl_tcp_synack_retries;
extern int sysctl_tcp_retries1;
extern int sysctl_tcp_retries2;
+extern int sysctl_tcp_initial_rto; /* in jiffies */
extern int sysctl_tcp_orphan_retries;
extern int sysctl_tcp_syncookies;
extern int sysctl_tcp_retrans_collapse;
@@ -295,7 +296,7 @@ static inline void tcp_synq_overflow(struct sock *sk)
static inline int tcp_synq_no_recent_overflow(const struct sock *sk)
{
unsigned long last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp;
- return time_after(jiffies, last_overflow + TCP_TIMEOUT_INIT);
+ return time_after(jiffies, last_overflow + sysctl_tcp_initial_rto);
}

extern struct proto tcp_prot;
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 8b44c6d..b035968 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -186,7 +186,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
* sysctl_tcp_retries1. It's a rather complicated formula (exponential
* backoff) to compute at runtime so it's currently hardcoded here.
*/
-#define COUNTER_TRIES 4
+#define COUNTER_TRIES (sysctl_tcp_initial_rto/HZ + 1)
/*
* Check if a ack sequence number is a valid syncookie.
* Return the decoded mss if it is, or 0 if not.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 321e6e8..51c778d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -30,6 +30,8 @@ static int tcp_adv_win_scale_min = -31;
static int tcp_adv_win_scale_max = 31;
static int ip_ttl_min = 1;
static int ip_ttl_max = 255;
+static int tcp_initial_rto_min = TCP_RTO_MIN;
+static int tcp_initial_rto_max = TCP_RTO_MAX;

/* Update system visible IP port range */
static void set_local_port_range(int range[2])
@@ -246,6 +248,15 @@ static struct ctl_table ipv4_table[] = {
.mode = 0644,
.proc_handler = proc_dointvec
},
+ {
+ .procname = "tcp_initial_rto",
+ .data = &sysctl_tcp_initial_rto,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_ms_jiffies,
+ .extra1 = &tcp_initial_rto_min,
+ .extra2 = &tcp_initial_rto_max,
+ },
{
.procname = "tcp_fin_timeout",
.data = &sysctl_tcp_fin_timeout,
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b22d450..e9e7c3f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2352,7 +2352,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
case TCP_DEFER_ACCEPT:
/* Translate value in seconds to number of retransmits */
icsk->icsk_accept_queue.rskq_defer_accept =
- secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+ secs_to_retrans(val, sysctl_tcp_initial_rto / HZ,
TCP_RTO_MAX / HZ);
break;

@@ -2539,7 +2539,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
break;
case TCP_DEFER_ACCEPT:
val = retrans_to_secs(icsk->icsk_accept_queue.rskq_defer_accept,
- TCP_TIMEOUT_INIT / HZ, TCP_RTO_MAX / HZ);
+ sysctl_tcp_initial_rto / HZ, TCP_RTO_MAX / HZ);
break;
case TCP_WINDOW_CLAMP:
val = tp->window_clamp;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bef9f04..39f6c27 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -890,7 +890,7 @@ static void tcp_init_metrics(struct sock *sk)
if (dst_metric(dst, RTAX_RTT) == 0)
goto reset;

- if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
+ if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (sysctl_tcp_initial_rto << 3))
goto reset;

/* Initial rtt is determined from SYN,SYN-ACK.
@@ -916,7 +916,7 @@ static void tcp_init_metrics(struct sock *sk)
tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
}
tcp_set_rto(sk);
- if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp) {
+ if (inet_csk(sk)->icsk_rto < sysctl_tcp_initial_rto && !tp->rx_opt.saw_tstamp) {
reset:
/* Play conservative. If timestamps are not
* supported, TCP will fail to recalculate correct
@@ -924,8 +924,8 @@ reset:
*/
if (!tp->rx_opt.saw_tstamp && tp->srtt) {
tp->srtt = 0;
- tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
- inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ tp->mdev = tp->mdev_max = tp->rttvar = sysctl_tcp_initial_rto;
+ inet_csk(sk)->icsk_rto = sysctl_tcp_initial_rto;
}
}
tp->snd_cwnd = tcp_init_cwnd(tp, dst);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f7e6c2c..21920e6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1383,7 +1383,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
want_cookie)
goto drop_and_free;

- inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ inet_csk_reqsk_queue_hash_add(sk, req, sysctl_tcp_initial_rto);
return 0;

drop_and_release:
@@ -1834,8 +1834,8 @@ static int tcp_v4_init_sock(struct sock *sk)
tcp_init_xmit_timers(sk);
tcp_prequeue_init(tp);

- icsk->icsk_rto = TCP_TIMEOUT_INIT;
- tp->mdev = TCP_TIMEOUT_INIT;
+ icsk->icsk_rto = sysctl_tcp_initial_rto;
+ tp->mdev = sysctl_tcp_initial_rto;

/* So many TCP implementations out there (incorrectly) count the
* initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 80b1f80..c63ffa0 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -472,8 +472,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
tcp_init_wl(newtp, treq->rcv_isn);

newtp->srtt = 0;
- newtp->mdev = TCP_TIMEOUT_INIT;
- newicsk->icsk_rto = TCP_TIMEOUT_INIT;
+ newtp->mdev = sysctl_tcp_initial_rto;
+ newicsk->icsk_rto = sysctl_tcp_initial_rto;

newtp->packets_out = 0;
newtp->retrans_out = 0;
@@ -582,7 +582,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
* it can be estimated (approximately)
* from another data.
*/
- tmp_opt.ts_recent_stamp = get_seconds() - ((TCP_TIMEOUT_INIT/HZ)<<req->retrans);
+ tmp_opt.ts_recent_stamp = get_seconds() - ((sysctl_tcp_initial_rto/HZ)<<req->retrans);
paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
}
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 17388c7..e34b0f6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2599,7 +2599,7 @@ static void tcp_connect_init(struct sock *sk)
tp->rcv_wup = 0;
tp->copied_seq = 0;

- inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ inet_csk(sk)->icsk_rto = sysctl_tcp_initial_rto;
inet_csk(sk)->icsk_retransmits = 0;
tcp_clear_retrans(tp);
}
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd44b0..b9da62b 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -29,6 +29,7 @@ int sysctl_tcp_keepalive_probes __read_mostly = TCP_KEEPALIVE_PROBES;
int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL;
int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
+int sysctl_tcp_initial_rto __read_mostly = TCP_TIMEOUT_INIT;
int sysctl_tcp_orphan_retries __read_mostly;
int sysctl_tcp_thin_linear_timeouts __read_mostly;

@@ -135,8 +136,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)

/* This function calculates a "timeout" which is equivalent to the timeout of a
* TCP connection after "boundary" unsuccessful, exponentially backed-off
- * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
- * syn_set flag is set.
+ * retransmissions with an initial RTO of TCP_RTO_MIN or
+ * sysctl_tcp_initial_rto if syn_set flag is set.
*/
static bool retransmits_timed_out(struct sock *sk,
unsigned int boundary,
@@ -144,7 +145,7 @@ static bool retransmits_timed_out(struct sock *sk,
bool syn_set)
{
unsigned int linear_backoff_thresh, start_ts;
- unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+ unsigned int rto_base = syn_set ? sysctl_tcp_initial_rto : TCP_RTO_MIN;

if (!inet_csk(sk)->icsk_retransmits)
return false;
@@ -495,7 +496,7 @@ out_unlock:
static void tcp_synack_timer(struct sock *sk)
{
inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
- TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+ sysctl_tcp_initial_rto, TCP_RTO_MAX);
}

void tcp_syn_ack_timeout(struct sock *sk, struct request_sock *req)
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 352c260..f8a07a8 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -45,7 +45,7 @@ static __u16 const msstab[] = {
* sysctl_tcp_retries1. It's a rather complicated formula (exponential
* backoff) to compute at runtime so it's currently hardcoded here.
*/
-#define COUNTER_TRIES 4
+#define COUNTER_TRIES (sysctl_tcp_initial_rto/HZ + 1)

static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4f49e5d..7e791e6 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1349,7 +1349,7 @@ have_isn:
want_cookie)
goto drop_and_free;

- inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ inet6_csk_reqsk_queue_hash_add(sk, req, sysctl_tcp_initial_rto);
return 0;

drop_and_release:
@@ -1957,8 +1957,8 @@ static int tcp_v6_init_sock(struct sock *sk)
tcp_init_xmit_timers(sk);
tcp_prequeue_init(tp);

- icsk->icsk_rto = TCP_TIMEOUT_INIT;
- tp->mdev = TCP_TIMEOUT_INIT;
+ icsk->icsk_rto = sysctl_tcp_initial_rto;
+ tp->mdev = sysctl_tcp_initial_rto;

/* So many TCP implementations out there (incorrectly) count the
* initial SYN frame in their delayed-ACK and congestion control
--
1.7.0.4

2011-05-18 19:30:48

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

From: Benoit Sigoure <[email protected]>
Date: Wed, 18 May 2011 03:43:04 -0700

> Instead of hardcoding the initial RTO to 3s and requiring
> the kernel to be recompiled to change it, expose it as a
> sysctl that can be tuned at runtime. Leave the default
> value unchanged.
>
> Signed-off-by: Benoit Sigoure <[email protected]>

If you read the ietf draft that reduces the initial RTO down to 1
second, it states that if we take a timeout during the initial
connection handshake then we have to revert the RTO back up to 3
seconds.

This fallback logic conflicts with being able to only change the
initial RTO via sysctl, I think. Because there are actually two
values at stake and they depend upon eachother, the initial RTO and
the value we fallback to on initial handshake retransmissions.

So I'd rather get a patch that implements the 1 second initial
RTO with the 3 second fallback on SYN retransmit, than this patch.

We already have too many knobs.

2011-05-18 19:40:46

by tsuna

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

On Wed, May 18, 2011 at 12:26 PM, David Miller <[email protected]> wrote:
> If you read the ietf draft that reduces the initial RTO down to 1
> second, it states that if we take a timeout during the initial
> connection handshake then we have to revert the RTO back up to 3
> seconds.
>
> This fallback logic conflicts with being able to only change the
> initial RTO via sysctl, I think. ?Because there are actually two
> values at stake and they depend upon eachother, the initial RTO and
> the value we fallback to on initial handshake retransmissions.
>
> So I'd rather get a patch that implements the 1 second initial
> RTO with the 3 second fallback on SYN retransmit, than this patch.
>
> We already have too many knobs.

I was hoping this knob would be accepted because this is such an
important issue that it even warrants an IETF draft to attempt to
change the standard. I'm not sure how long it will take for this
draft to be accepted and then implemented, so I thought adding this
simple knob today would really help in the future.

Plus, should the draft be accepted, this knob will still be just as
useful (e.g. to revert back to today's behavior), and people might
want to consider adding another knob for the fallback initRTO (this is
debatable). I don't believe this knob conflicts with the proposed
change to the standard, it actually goes along with it pretty well and
helps us prepare better for this upcoming change.

I agree that there are too many knobs, and I hate feature creep too,
but I've found many of these knobs to be really useful, and the degree
to which Linux's TCP stack can be tuned is part of what makes it so
versatile.

--
Benoit "tsuna" Sigoure
Software Engineer @ http://www.StumbleUpon.com

2011-05-18 19:55:51

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

From: tsuna <[email protected]>
Date: Wed, 18 May 2011 12:40:21 -0700

> I was hoping this knob would be accepted because this is such an
> important issue that it even warrants an IETF draft to attempt to
> change the standard. I'm not sure how long it will take for this
> draft to be accepted and then implemented, so I thought adding this
> simple knob today would really help in the future.

I've already changed the initial TCP congestion window in Linux to 10
without some stupid draft being fully accepted.

I'll just as easily accept right now a patch right now which lowers
the initial RTO to 1 second and adds the 3 second RTO fallback.

2011-05-18 20:20:31

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

* David Miller | 2011-05-18 15:52:00 [-0400]:

>I've already changed the initial TCP congestion window in Linux to 10
>without some stupid draft being fully accepted.
>
>I'll just as easily accept right now a patch right now which lowers
>the initial RTO to 1 second and adds the 3 second RTO fallback.

I like the idea to make the initial RTO a knob because we in a isolated MANET
environment have a RTT larger then 1 second. Especially the link layer setup
procedure over several hops demand some time-costly setup time. After that the
RTT is <1 second. The current algorithm works great for us. So this RTO change
will be counterproductive: it will always trigger a needless timeout.

The main problem for us is that Google at all pushing their view of Internet
with a lot of pressure. The same is true for the IETF IW adjustments, which is
unsuitable for networks which operates at a bandwidth characteristic some
years ago. The _former_ conservative principle "TCP over everything" is
forgotten.

Hagen

2011-05-18 20:26:53

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

From: Hagen Paul Pfeifer <[email protected]>
Date: Wed, 18 May 2011 22:20:25 +0200

> I like the idea to make the initial RTO a knob because we in a
> isolated MANET environment have a RTT larger then 1 second.

Then this gets back to the fact that this is a network
attribute and thus more suitable as a route metric not
a global system-wide sysctl.

2011-05-18 20:27:18

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

* David Miller | 2011-05-18 16:23:06 [-0400]:

>Then this gets back to the fact that this is a network
>attribute and thus more suitable as a route metric not
>a global system-wide sysctl.

Yes, in an Email response to Eric I mentioned this already. The initial RTO is
a perfect candidate for route metric. I waiting for a patch to test it! ;-)

Hagen

2011-05-19 02:23:17

by tsuna

[permalink] [raw]
Subject: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

Prior to this patch, Linux would always use 3 seconds (compile-time
constant) as the initial RTO. Draft RFC 2988bis-02 proposes to tune
this down to 1 second and, in case of a timeout during the TCP 3WHS,
revert the RTO back up to 3 seconds when data transmission begins.

This patch implements this behavior but retains default values for
the initial RTO of 3 seconds, instead of 1 second as is suggested
in the draft RFC. This way, in a default configuration, the behavior
of Linux's TCP is unchanged.

This patch also adds 2 knobs to tweak the initial RTO:
- tcp_initial_rto: initial RTO used during the 3WHS (default remains
unchanged: 3 seconds). This was previously a compile-time constant.
- tcp_initial_fallback_rto: the RTO to fallback to if a timeout occurs
during the 3WHS, with a default value of 3 seconds too, as per the
draft RFC.

Signed-off-by: Benoit Sigoure <[email protected]>
---

On Wed, May 18, 2011 at 12:52 PM, David Miller <[email protected]> wrote:
> I'll just as easily accept right now a patch right now which lowers
> the initial RTO to 1 second and adds the 3 second RTO fallback.

Here's a first attempt at a patch that implements the behavior described in
the draft RFC. I only compiled it so far, if you would like to move forward
with this approach, I'll go ahead and test it on a real server.

I'm not sure whether COUNTER_TRIES in syncookies.c should be based off
sysctl_tcp_initial_rto or sysctl_tcp_initial_fallback_rto, if we're going
to take the first one down to 1s...

Documentation/networking/ip-sysctl.txt | 19 +++++++++++++++++++
include/net/tcp.h | 4 +++-
net/ipv4/syncookies.c | 2 +-
net/ipv4/sysctl_net_ipv4.c | 20 ++++++++++++++++++++
net/ipv4/tcp.c | 4 ++--
net/ipv4/tcp_input.c | 13 +++++++++----
net/ipv4/tcp_ipv4.c | 6 +++---
net/ipv4/tcp_minisocks.c | 6 +++---
net/ipv4/tcp_output.c | 2 +-
net/ipv4/tcp_timer.c | 10 ++++++----
net/ipv6/syncookies.c | 2 +-
net/ipv6/tcp_ipv6.c | 6 +++---
12 files changed, 71 insertions(+), 23 deletions(-)

diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt
index d3d653a..590042c 100644
--- a/Documentation/networking/ip-sysctl.txt
+++ b/Documentation/networking/ip-sysctl.txt
@@ -384,6 +384,25 @@ tcp_retries2 - INTEGER
RFC 1122 recommends at least 100 seconds for the timeout,
which corresponds to a value of at least 8.

+tcp_initial_rto - INTEGER
+ This value sets the initial retransmit timeout (in milliseconds),
+ that is how long the kernel will wait before retransmitting the
+ initial SYN packet.
+
+ RFC 1122 says that this SHOULD be 3000 milliseconds, which is the
+ default. Note that draft RFC 2988bis-02 says that this SHOULD be
+ 1000 milliseconds, which might become the default value in future
+ versions.
+
+tcp_initial_fallback_rto - INTEGER
+ This value sets the initial retransmit timeout (in milliseconds)
+ to use after completing a three-way handshake during which the
+ initial SYN packet had to be retransmitted after waiting for
+ tcp_initial_rto milliseconds.
+
+ Draft RFC 2988bis-02 says that this MUST be 3000 milliseconds,
+ which is the default.
+
tcp_rfc1337 - BOOLEAN
If set, the TCP stack behaves conforming to RFC1337. If unset,
we are not conforming to RFC, but prevent TCP TIME_WAIT
diff --git a/include/net/tcp.h b/include/net/tcp.h
index cda30ea..c974242 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -213,6 +213,8 @@ extern int sysctl_tcp_syn_retries;
extern int sysctl_tcp_synack_retries;
extern int sysctl_tcp_retries1;
extern int sysctl_tcp_retries2;
+extern int sysctl_tcp_initial_rto; /* in jiffies */
+extern int sysctl_tcp_initial_fallback_rto; /* in jiffies */
extern int sysctl_tcp_orphan_retries;
extern int sysctl_tcp_syncookies;
extern int sysctl_tcp_retrans_collapse;
@@ -295,7 +297,7 @@ static inline void tcp_synq_overflow(struct sock *sk)
static inline int tcp_synq_no_recent_overflow(const struct sock *sk)
{
unsigned long last_overflow = tcp_sk(sk)->rx_opt.ts_recent_stamp;
- return time_after(jiffies, last_overflow + TCP_TIMEOUT_INIT);
+ return time_after(jiffies, last_overflow + sysctl_tcp_initial_rto);
}

extern struct proto tcp_prot;
diff --git a/net/ipv4/syncookies.c b/net/ipv4/syncookies.c
index 8b44c6d..b035968 100644
--- a/net/ipv4/syncookies.c
+++ b/net/ipv4/syncookies.c
@@ -186,7 +186,7 @@ __u32 cookie_v4_init_sequence(struct sock *sk, struct sk_buff *skb, __u16 *mssp)
* sysctl_tcp_retries1. It's a rather complicated formula (exponential
* backoff) to compute at runtime so it's currently hardcoded here.
*/
-#define COUNTER_TRIES 4
+#define COUNTER_TRIES (sysctl_tcp_initial_rto/HZ + 1)
/*
* Check if a ack sequence number is a valid syncookie.
* Return the decoded mss if it is, or 0 if not.
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 321e6e8..abe8cfc 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -30,6 +30,8 @@ static int tcp_adv_win_scale_min = -31;
static int tcp_adv_win_scale_max = 31;
static int ip_ttl_min = 1;
static int ip_ttl_max = 255;
+static int tcp_min_rto = TCP_RTO_MIN;
+static int tcp_max_rto = TCP_RTO_MAX;

/* Update system visible IP port range */
static void set_local_port_range(int range[2])
@@ -247,6 +249,24 @@ static struct ctl_table ipv4_table[] = {
.proc_handler = proc_dointvec
},
{
+ .procname = "tcp_initial_rto",
+ .data = &sysctl_tcp_initial_rto,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_ms_jiffies,
+ .extra1 = &tcp_min_rto,
+ .extra2 = &tcp_max_rto,
+ },
+ {
+ .procname = "tcp_initial_fallback_rto",
+ .data = &sysctl_tcp_initial_fallback_rto,
+ .maxlen = sizeof(int),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_ms_jiffies,
+ .extra1 = &tcp_min_rto,
+ .extra2 = &tcp_max_rto,
+ },
+ {
.procname = "tcp_fin_timeout",
.data = &sysctl_tcp_fin_timeout,
.maxlen = sizeof(int),
diff --git a/net/ipv4/tcp.c b/net/ipv4/tcp.c
index b22d450..e9e7c3f 100644
--- a/net/ipv4/tcp.c
+++ b/net/ipv4/tcp.c
@@ -2352,7 +2352,7 @@ static int do_tcp_setsockopt(struct sock *sk, int level,
case TCP_DEFER_ACCEPT:
/* Translate value in seconds to number of retransmits */
icsk->icsk_accept_queue.rskq_defer_accept =
- secs_to_retrans(val, TCP_TIMEOUT_INIT / HZ,
+ secs_to_retrans(val, sysctl_tcp_initial_rto / HZ,
TCP_RTO_MAX / HZ);
break;

@@ -2539,7 +2539,7 @@ static int do_tcp_getsockopt(struct sock *sk, int level,
break;
case TCP_DEFER_ACCEPT:
val = retrans_to_secs(icsk->icsk_accept_queue.rskq_defer_accept,
- TCP_TIMEOUT_INIT / HZ, TCP_RTO_MAX / HZ);
+ sysctl_tcp_initial_rto / HZ, TCP_RTO_MAX / HZ);
break;
case TCP_WINDOW_CLAMP:
val = tp->window_clamp;
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bef9f04..513cf7a 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -868,6 +868,11 @@ static void tcp_init_metrics(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct dst_entry *dst = __sk_dst_get(sk);
+ /* If we had to retransmit anything during the 3WHS,
+ * use the initial fallback RTO.
+ */
+ int init_rto = inet_csk(sk)->icsk_retransmits ?
+ sysctl_tcp_initial_fallback_rto : sysctl_tcp_initial_rto;

if (dst == NULL)
goto reset;
@@ -890,7 +895,7 @@ static void tcp_init_metrics(struct sock *sk)
if (dst_metric(dst, RTAX_RTT) == 0)
goto reset;

- if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
+ if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (init_rto << 3))
goto reset;

/* Initial rtt is determined from SYN,SYN-ACK.
@@ -916,7 +921,7 @@ static void tcp_init_metrics(struct sock *sk)
tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
}
tcp_set_rto(sk);
- if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp) {
+ if (inet_csk(sk)->icsk_rto < init_rto && !tp->rx_opt.saw_tstamp) {
reset:
/* Play conservative. If timestamps are not
* supported, TCP will fail to recalculate correct
@@ -924,8 +929,8 @@ reset:
*/
if (!tp->rx_opt.saw_tstamp && tp->srtt) {
tp->srtt = 0;
- tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
- inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ tp->mdev = tp->mdev_max = tp->rttvar = init_rto;
+ inet_csk(sk)->icsk_rto = init_rto;
}
}
tp->snd_cwnd = tcp_init_cwnd(tp, dst);
diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index f7e6c2c..21920e6 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1383,7 +1383,7 @@ int tcp_v4_conn_request(struct sock *sk, struct sk_buff *skb)
want_cookie)
goto drop_and_free;

- inet_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ inet_csk_reqsk_queue_hash_add(sk, req, sysctl_tcp_initial_rto);
return 0;

drop_and_release:
@@ -1834,8 +1834,8 @@ static int tcp_v4_init_sock(struct sock *sk)
tcp_init_xmit_timers(sk);
tcp_prequeue_init(tp);

- icsk->icsk_rto = TCP_TIMEOUT_INIT;
- tp->mdev = TCP_TIMEOUT_INIT;
+ icsk->icsk_rto = sysctl_tcp_initial_rto;
+ tp->mdev = sysctl_tcp_initial_rto;

/* So many TCP implementations out there (incorrectly) count the
* initial SYN frame in their delayed-ACK and congestion control
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index 80b1f80..c63ffa0 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -472,8 +472,8 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
tcp_init_wl(newtp, treq->rcv_isn);

newtp->srtt = 0;
- newtp->mdev = TCP_TIMEOUT_INIT;
- newicsk->icsk_rto = TCP_TIMEOUT_INIT;
+ newtp->mdev = sysctl_tcp_initial_rto;
+ newicsk->icsk_rto = sysctl_tcp_initial_rto;

newtp->packets_out = 0;
newtp->retrans_out = 0;
@@ -582,7 +582,7 @@ struct sock *tcp_check_req(struct sock *sk, struct sk_buff *skb,
* it can be estimated (approximately)
* from another data.
*/
- tmp_opt.ts_recent_stamp = get_seconds() - ((TCP_TIMEOUT_INIT/HZ)<<req->retrans);
+ tmp_opt.ts_recent_stamp = get_seconds() - ((sysctl_tcp_initial_rto/HZ)<<req->retrans);
paws_reject = tcp_paws_reject(&tmp_opt, th->rst);
}
}
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 17388c7..e34b0f6 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -2599,7 +2599,7 @@ static void tcp_connect_init(struct sock *sk)
tp->rcv_wup = 0;
tp->copied_seq = 0;

- inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ inet_csk(sk)->icsk_rto = sysctl_tcp_initial_rto;
inet_csk(sk)->icsk_retransmits = 0;
tcp_clear_retrans(tp);
}
diff --git a/net/ipv4/tcp_timer.c b/net/ipv4/tcp_timer.c
index ecd44b0..47fa600 100644
--- a/net/ipv4/tcp_timer.c
+++ b/net/ipv4/tcp_timer.c
@@ -29,6 +29,8 @@ int sysctl_tcp_keepalive_probes __read_mostly = TCP_KEEPALIVE_PROBES;
int sysctl_tcp_keepalive_intvl __read_mostly = TCP_KEEPALIVE_INTVL;
int sysctl_tcp_retries1 __read_mostly = TCP_RETR1;
int sysctl_tcp_retries2 __read_mostly = TCP_RETR2;
+int sysctl_tcp_initial_rto __read_mostly = TCP_TIMEOUT_INIT;
+int sysctl_tcp_initial_fallback_rto __read_mostly = TCP_TIMEOUT_INIT;
int sysctl_tcp_orphan_retries __read_mostly;
int sysctl_tcp_thin_linear_timeouts __read_mostly;

@@ -135,8 +137,8 @@ static void tcp_mtu_probing(struct inet_connection_sock *icsk, struct sock *sk)

/* This function calculates a "timeout" which is equivalent to the timeout of a
* TCP connection after "boundary" unsuccessful, exponentially backed-off
- * retransmissions with an initial RTO of TCP_RTO_MIN or TCP_TIMEOUT_INIT if
- * syn_set flag is set.
+ * retransmissions with an initial RTO of TCP_RTO_MIN or
+ * sysctl_tcp_initial_rto if syn_set flag is set.
*/
static bool retransmits_timed_out(struct sock *sk,
unsigned int boundary,
@@ -144,7 +146,7 @@ static bool retransmits_timed_out(struct sock *sk,
bool syn_set)
{
unsigned int linear_backoff_thresh, start_ts;
- unsigned int rto_base = syn_set ? TCP_TIMEOUT_INIT : TCP_RTO_MIN;
+ unsigned int rto_base = syn_set ? sysctl_tcp_initial_rto : TCP_RTO_MIN;

if (!inet_csk(sk)->icsk_retransmits)
return false;
@@ -495,7 +497,7 @@ out_unlock:
static void tcp_synack_timer(struct sock *sk)
{
inet_csk_reqsk_queue_prune(sk, TCP_SYNQ_INTERVAL,
- TCP_TIMEOUT_INIT, TCP_RTO_MAX);
+ sysctl_tcp_initial_rto, TCP_RTO_MAX);
}

void tcp_syn_ack_timeout(struct sock *sk, struct request_sock *req)
diff --git a/net/ipv6/syncookies.c b/net/ipv6/syncookies.c
index 352c260..f8a07a8 100644
--- a/net/ipv6/syncookies.c
+++ b/net/ipv6/syncookies.c
@@ -45,7 +45,7 @@ static __u16 const msstab[] = {
* sysctl_tcp_retries1. It's a rather complicated formula (exponential
* backoff) to compute at runtime so it's currently hardcoded here.
*/
-#define COUNTER_TRIES 4
+#define COUNTER_TRIES (sysctl_tcp_initial_rto/HZ + 1)

static inline struct sock *get_cookie_sock(struct sock *sk, struct sk_buff *skb,
struct request_sock *req,
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 4f49e5d..7e791e6 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1349,7 +1349,7 @@ have_isn:
want_cookie)
goto drop_and_free;

- inet6_csk_reqsk_queue_hash_add(sk, req, TCP_TIMEOUT_INIT);
+ inet6_csk_reqsk_queue_hash_add(sk, req, sysctl_tcp_initial_rto);
return 0;

drop_and_release:
@@ -1957,8 +1957,8 @@ static int tcp_v6_init_sock(struct sock *sk)
tcp_init_xmit_timers(sk);
tcp_prequeue_init(tp);

- icsk->icsk_rto = TCP_TIMEOUT_INIT;
- tp->mdev = TCP_TIMEOUT_INIT;
+ icsk->icsk_rto = sysctl_tcp_initial_rto;
+ tp->mdev = sysctl_tcp_initial_rto;

/* So many TCP implementations out there (incorrectly) count the
* initial SYN frame in their delayed-ACK and congestion control
--
1.7.0.4

2011-05-19 02:40:21

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

From: Benoit Sigoure <[email protected]>
Date: Wed, 18 May 2011 19:22:24 -0700

> Prior to this patch, Linux would always use 3 seconds (compile-time
> constant) as the initial RTO. Draft RFC 2988bis-02 proposes to tune
> this down to 1 second and, in case of a timeout during the TCP 3WHS,
> revert the RTO back up to 3 seconds when data transmission begins.

We just had a discussion where it was determined that changes to
these settings are "network specific" and therefore that if it
is appropriate at all (I'm still not convinced) it is only suitable
as a routing metric.

2011-05-19 03:56:55

by tsuna

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

On Wed, May 18, 2011 at 7:36 PM, David Miller <[email protected]> wrote:
> From: Benoit Sigoure <[email protected]>
> Date: Wed, 18 May 2011 19:22:24 -0700
>
>> Prior to this patch, Linux would always use 3 seconds (compile-time
>> constant) as the initial RTO. ?Draft RFC 2988bis-02 proposes to tune
>> this down to 1 second and, in case of a timeout during the TCP 3WHS,
>> revert the RTO back up to 3 seconds when data transmission begins.
>
> We just had a discussion where it was determined that changes to
> these settings are "network specific" and therefore that if it
> is appropriate at all (I'm still not convinced) it is only suitable
> as a routing metric.

Fair enough. I'll take another stab at it and see if I can change
this to be on a per network basis. Do I need any patch that's not yet
in Linus' tree? I'm referring to this:

On Tue, May 17, 2011 at 5:20 AM, Eric Dumazet <[email protected]> wrote:
> Adding many knobs to each clone had a huge cost on previous kernels.
> (Think some machines have millions entries in IP route cache), this used
> quite a lot of memory.
>
> With latest David work, we'll consume less ram, because we can now share
> settings, instead of copying them on each dst entry.

If this has already been merged then it sounds like I should have
everything I need..?

--
Benoit "tsuna" Sigoure
Software Engineer @ http://www.StumbleUpon.com

2011-05-19 04:18:26

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

From: tsuna <[email protected]>
Date: Wed, 18 May 2011 20:56:33 -0700

> On Wed, May 18, 2011 at 7:36 PM, David Miller <[email protected]> wrote:
>> From: Benoit Sigoure <[email protected]>
>> Date: Wed, 18 May 2011 19:22:24 -0700
>>
>>> Prior to this patch, Linux would always use 3 seconds (compile-time
>>> constant) as the initial RTO. ?Draft RFC 2988bis-02 proposes to tune
>>> this down to 1 second and, in case of a timeout during the TCP 3WHS,
>>> revert the RTO back up to 3 seconds when data transmission begins.
>>
>> We just had a discussion where it was determined that changes to
>> these settings are "network specific" and therefore that if it
>> is appropriate at all (I'm still not convinced) it is only suitable
>> as a routing metric.
>
> Fair enough. I'll take another stab at it and see if I can change
> this to be on a per network basis. Do I need any patch that's not yet
> in Linus' tree? I'm referring to this:

Keep in mind another thing I do not like about this knob.

The IETF draft has a requirement that we fallback to 3 seconds if the
initial RTO is 1 second.

Nothing in your facilities ensure this, or provide a way for the
kernel to make sure this is the case.

And for other values of initial RTO, what fallback is appropriate?

As a result of all of this, I do not really think this is something
the user should control at all.

I really would rather see the initial RTO be static and be set to 1
with fallback RTO of 3.

2011-05-19 04:33:44

by tsuna

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

On Wed, May 18, 2011 at 9:14 PM, David Miller <[email protected]> wrote:
> The IETF draft has a requirement that we fallback to 3 seconds if the
> initial RTO is 1 second.
>
> Nothing in your facilities ensure this, or provide a way for the
> kernel to make sure this is the case.

Not sure to understand what you're saying. If tcp_initial_rto = 1000
and tcp_initial_fallback_rto = 3000, then you get exactly the behavior
the draft describes. The knobs simply allow you to either revert to
today's behavior or use other settings that would make more sense in
your environment (e.g. very high RTT). Are you concerned about cases
where, say, tcp_initial_fallback_rto < tcp_initial_rto?

> And for other values of initial RTO, what fallback is appropriate?

Presumably if the user decides to tweak these knobs, they'll know
what's appropriate for their environment. Or are you suggesting that
one value be derived from the other? (e.g. tcp_initial_fallback_rto =
3 * tcp_initial_rto)

> As a result of all of this, I do not really think this is something
> the user should control at all.
>
> I really would rather see the initial RTO be static and be set to 1
> with fallback RTO of 3.

I can also provide a simple patch for this if you want to start from
there. And then maybe we can discuss having a runtime knob some more
:-)

--
Benoit "tsuna" Sigoure
Software Engineer @ http://www.StumbleUpon.com

2011-05-19 05:50:55

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

From: tsuna <[email protected]>
Date: Wed, 18 May 2011 21:33:21 -0700

> On Wed, May 18, 2011 at 9:14 PM, David Miller <[email protected]> wrote:
>> I really would rather see the initial RTO be static and be set to 1
>> with fallback RTO of 3.
>
> I can also provide a simple patch for this if you want to start from
> there. And then maybe we can discuss having a runtime knob some more
> :-)

Yeah why don't we do that :-)

2011-05-19 06:10:46

by Alexander Zimmermann

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

Hi,

Am 19.05.2011 um 06:33 schrieb tsuna:

> Presumably if the user decides to tweak these knobs, they'll know
> what's appropriate for their environment.

Are you sure? I'm not. I fully agree with David that minRTO is
something that a user shout not control at all

> Or are you suggesting that
> one value be derived from the other? (e.g. tcp_initial_fallback_rto =
> 3 * tcp_initial_rto)
>
>> As a result of all of this, I do not really think this is something
>> the user should control at all.
>>
>> I really would rather see the initial RTO be static and be set to 1
>> with fallback RTO of 3.
>
> I can also provide a simple patch for this if you want to start from
> there. And then maybe we can discuss having a runtime knob some more
> :-)
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ http://www.StumbleUpon.com

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: [email protected]
// web: http://www.umic-mesh.net
//


Attachments:
PGP.sig (243.00 B)
Signierter Teil der Nachricht

2011-05-19 06:26:20

by tsuna

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

On Wed, May 18, 2011 at 11:10 PM, Alexander Zimmermann
<[email protected]> wrote:
> Am 19.05.2011 um 06:33 schrieb tsuna:
>> Presumably if the user decides to tweak these knobs, they'll know
>> what's appropriate for their environment.
>
> Are you sure? I'm not. I fully agree with David that minRTO is

s/minRTO/initRTO/, right?

> something that a user shout not control at all

I personally don't like to hold the hand and spoon feed users too
much, I want to trust them to be responsible and know what they're
doing. Yes, there will always be people who will act stupid and do
stupid things with whatever knobs you expose. The web is full of
people who advise to tune up all the TCP rmem/wmem parameters to crazy
high level based on the voodoo belief that they're going to improve
their TCP performance, but then as long as you have knobs in your
system, these people will misuse them anyway and shoot themselves in
the foot, what can we do about that.

There's also a good chunk of people who know what they're doing, and
for them compile-time constants are annoying because it's inconvenient
to experiment and iterate quickly when you need to recompile your
kernel to change a value. If turning the compile time constant into a
knob leaves the code reasonably straightforward and doesn't incur too
much overhead, then why not do it?

Regarding this knob in particular, I can imagine that people who are
in environment where RTT easily gets around 1s will be upset by the
change in the default value, and doubly upset that they have to
recompile their kernel to change the value back to 3s. I'm in favor
of the reduction of initRTO, for the same reason Google is, but I can
also understand that the direction we're taking might not be
appropriate for everyone.

--
Benoit "tsuna" Sigoure
Software Engineer @ http://www.StumbleUpon.com

2011-05-19 06:36:15

by Alexander Zimmermann

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.


Am 19.05.2011 um 08:25 schrieb tsuna:

> On Wed, May 18, 2011 at 11:10 PM, Alexander Zimmermann
> <[email protected]> wrote:
>> Am 19.05.2011 um 06:33 schrieb tsuna:
>>> Presumably if the user decides to tweak these knobs, they'll know
>>> what's appropriate for their environment.
>>
>> Are you sure? I'm not. I fully agree with David that minRTO is
>
> s/minRTO/initRTO/, right?

Yes of course :-)

>
>> something that a user shout not control at all
>
> I personally don't like to hold the hand and spoon feed users too
> much, I want to trust them to be responsible and know what they're
> doing. Yes, there will always be people who will act stupid and do
> stupid things with whatever knobs you expose. The web is full of
> people who advise to tune up all the TCP rmem/wmem parameters to crazy
> high level based on the voodoo belief that they're going to improve
> their TCP performance, but then as long as you have knobs in your
> system, these people will misuse them anyway and shoot themselves in
> the foot, what can we do about that.

But if you tune rmen/wmen to crazy level, it's only your TCP performance
that hurts (and maybe the receiver's one).

If you set the initRTO=0.1s, it's good for me but bad for the rest of the
world. That's the difference.

Or do you want to implement a lower barrier of 1sec so that you can ensure
that nobody set the initRTO lower than 1s?


>
> There's also a good chunk of people who know what they're doing, and
> for them compile-time constants are annoying because it's inconvenient
> to experiment and iterate quickly when you need to recompile your
> kernel to change a value. If turning the compile time constant into a
> knob leaves the code reasonably straightforward and doesn't incur too
> much overhead, then why not do it?
>
> Regarding this knob in particular, I can imagine that people who are
> in environment where RTT easily gets around 1s will be upset by the
> change in the default value, and doubly upset that they have to
> recompile their kernel to change the value back to 3s. I'm in favor
> of the reduction of initRTO, for the same reason Google is, but I can
> also understand that the direction we're taking might not be
> appropriate for everyone.
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ http://www.StumbleUpon.com

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: [email protected]
// web: http://www.umic-mesh.net
//


Attachments:
PGP.sig (243.00 B)
Signierter Teil der Nachricht

2011-05-19 06:37:25

by tsuna

[permalink] [raw]
Subject: [PATCH] tcp: Lower the initial RTO to 1s as per draft RFC 2988bis-02.

From: Benoit Sigoure <[email protected]>

Draft RFC 2988bis-02 recommends that the initial RTO be lowered
from 3 seconds down to 1 second, and that in case of a timeout
during the TCP 3WHS, the RTO should fallback to 3 seconds when
data transmission begins.
---

On Wed, May 18, 2011 at 10:46 PM, David Miller <[email protected]> wrote:
> From: tsuna <[email protected]>
> Date: Wed, 18 May 2011 21:33:21 -0700
>
>> On Wed, May 18, 2011 at 9:14 PM, David Miller <[email protected]> wrote:
>>> I really would rather see the initial RTO be static and be set to 1
>>> with fallback RTO of 3.
>>
>> I can also provide a simple patch for this if you want to start from
>> there. And then maybe we can discuss having a runtime knob some more
>> :-)
>
> Yeah why don't we do that :-)

Alright, here we go.


include/net/tcp.h | 5 ++++-
net/ipv4/tcp_input.c | 13 +++++++++----
2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index cda30ea..274d761 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -122,7 +122,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#endif
#define TCP_RTO_MAX ((unsigned)(120*HZ))
#define TCP_RTO_MIN ((unsigned)(HZ/5))
-#define TCP_TIMEOUT_INIT ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value */
+/* The next 2 values come from Draft RFC 2988bis-02. */
+#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ)) /* initial RTO value */
+#define TCP_TIMEOUT_INIT_FALLBACK ((unsigned)(3*HZ)) /* initial RTO to fallback to when
+ * a timeout happens during the 3WHS. */

#define TCP_RESOURCE_PROBE_INTERVAL ((unsigned)(HZ/2U)) /* Maximal interval between probes
* for local resources.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bef9f04..a36bc35 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -868,6 +868,11 @@ static void tcp_init_metrics(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct dst_entry *dst = __sk_dst_get(sk);
+ /* If we had to retransmit anything during the 3WHS, use
+ * the initial fallback RTO as per draft RFC 2988bis-02.
+ */
+ int init_rto = inet_csk(sk)->icsk_retransmits ?
+ TCP_TIMEOUT_INIT_FALLBACK : TCP_TIMEOUT_INIT;

if (dst == NULL)
goto reset;
@@ -890,7 +895,7 @@ static void tcp_init_metrics(struct sock *sk)
if (dst_metric(dst, RTAX_RTT) == 0)
goto reset;

- if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
+ if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (init_rto << 3))
goto reset;

/* Initial rtt is determined from SYN,SYN-ACK.
@@ -916,7 +921,7 @@ static void tcp_init_metrics(struct sock *sk)
tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
}
tcp_set_rto(sk);
- if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp) {
+ if (inet_csk(sk)->icsk_rto < init_rto && !tp->rx_opt.saw_tstamp) {
reset:
/* Play conservative. If timestamps are not
* supported, TCP will fail to recalculate correct
@@ -924,8 +929,8 @@ reset:
*/
if (!tp->rx_opt.saw_tstamp && tp->srtt) {
tp->srtt = 0;
- tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
- inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ tp->mdev = tp->mdev_max = tp->rttvar = init_rto;
+ inet_csk(sk)->icsk_rto = init_rto;
}
}
tp->snd_cwnd = tcp_init_cwnd(tp, dst);
--
1.7.0.4

2011-05-19 06:42:53

by tsuna

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

On Wed, May 18, 2011 at 11:36 PM, Alexander Zimmermann
<[email protected]> wrote:
> If you set the initRTO=0.1s, it's good for me but bad for the rest of the
> world. That's the difference.
>
> Or do you want to implement a lower barrier of 1sec so that you can ensure
> that nobody set the initRTO lower than 1s?

Oh, I see. Yes, there is a lower bound (and an upper bound) on what
values the kernel will accept as initRTO. In the patch "Implement a
two-level initial RTO as per draft RFC 2988bis-02" above, I re-used
TCP_RTO_MIN and TCP_RTO_MAX in net/ipv4/sysctl_net_ipv4.c in order to
prevent users from setting a minRTO that's outside this range. They
are defined as follows in tcp.h:

#define TCP_RTO_MAX ((unsigned)(120*HZ))
#define TCP_RTO_MIN ((unsigned)(HZ/5))

So we're talking about a [200ms ; 120s] range no matter what.

--
Benoit "tsuna" Sigoure
Software Engineer @ http://www.StumbleUpon.com

2011-05-19 06:48:31

by tsuna

[permalink] [raw]
Subject: [PATCH] tcp: Lower the initial RTO to 1s as per draft RFC 2988bis-02.

Draft RFC 2988bis-02 recommends that the initial RTO be lowered
from 3 seconds down to 1 second, and that in case of a timeout
during the TCP 3WHS, the RTO should fallback to 3 seconds when
data transmission begins.

Signed-off-by: Benoit Sigoure <[email protected]>
---

Apologies for the spam, I sent this patch from the wrong address and without
sob'ing it. I build the Linux kernel in a 15G tmpfs (it's faster this way :D)
and I lost my .git/config after a reboot.

include/net/tcp.h | 5 ++++-
net/ipv4/tcp_input.c | 13 +++++++++----
2 files changed, 13 insertions(+), 5 deletions(-)

diff --git a/include/net/tcp.h b/include/net/tcp.h
index cda30ea..274d761 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -122,7 +122,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
#endif
#define TCP_RTO_MAX ((unsigned)(120*HZ))
#define TCP_RTO_MIN ((unsigned)(HZ/5))
-#define TCP_TIMEOUT_INIT ((unsigned)(3*HZ)) /* RFC 1122 initial RTO value */
+/* The next 2 values come from Draft RFC 2988bis-02. */
+#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ)) /* initial RTO value */
+#define TCP_TIMEOUT_INIT_FALLBACK ((unsigned)(3*HZ)) /* initial RTO to fallback to when
+ * a timeout happens during the 3WHS. */

#define TCP_RESOURCE_PROBE_INTERVAL ((unsigned)(HZ/2U)) /* Maximal interval between probes
* for local resources.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index bef9f04..a36bc35 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -868,6 +868,11 @@ static void tcp_init_metrics(struct sock *sk)
{
struct tcp_sock *tp = tcp_sk(sk);
struct dst_entry *dst = __sk_dst_get(sk);
+ /* If we had to retransmit anything during the 3WHS, use
+ * the initial fallback RTO as per draft RFC 2988bis-02.
+ */
+ int init_rto = inet_csk(sk)->icsk_retransmits ?
+ TCP_TIMEOUT_INIT_FALLBACK : TCP_TIMEOUT_INIT;

if (dst == NULL)
goto reset;
@@ -890,7 +895,7 @@ static void tcp_init_metrics(struct sock *sk)
if (dst_metric(dst, RTAX_RTT) == 0)
goto reset;

- if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
+ if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (init_rto << 3))
goto reset;

/* Initial rtt is determined from SYN,SYN-ACK.
@@ -916,7 +921,7 @@ static void tcp_init_metrics(struct sock *sk)
tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
}
tcp_set_rto(sk);
- if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp) {
+ if (inet_csk(sk)->icsk_rto < init_rto && !tp->rx_opt.saw_tstamp) {
reset:
/* Play conservative. If timestamps are not
* supported, TCP will fail to recalculate correct
@@ -924,8 +929,8 @@ reset:
*/
if (!tp->rx_opt.saw_tstamp && tp->srtt) {
tp->srtt = 0;
- tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
- inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
+ tp->mdev = tp->mdev_max = tp->rttvar = init_rto;
+ inet_csk(sk)->icsk_rto = init_rto;
}
}
tp->snd_cwnd = tcp_init_cwnd(tp, dst);
--
1.7.0.4

2011-05-19 06:52:15

by Alexander Zimmermann

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.


Am 19.05.2011 um 08:42 schrieb tsuna:

> On Wed, May 18, 2011 at 11:36 PM, Alexander Zimmermann
> <[email protected]> wrote:
>> If you set the initRTO=0.1s, it's good for me but bad for the rest of the
>> world. That's the difference.
>>
>> Or do you want to implement a lower barrier of 1sec so that you can ensure
>> that nobody set the initRTO lower than 1s?
>
> Oh, I see. Yes, there is a lower bound (and an upper bound) on what
> values the kernel will accept as initRTO. In the patch "Implement a
> two-level initial RTO as per draft RFC 2988bis-02" above, I re-used
> TCP_RTO_MIN and TCP_RTO_MAX in net/ipv4/sysctl_net_ipv4.c in order to
> prevent users from setting a minRTO that's outside this range. They
> are defined as follows in tcp.h:
>
> #define TCP_RTO_MAX ((unsigned)(120*HZ))
> #define TCP_RTO_MIN ((unsigned)(HZ/5))
>
> So we're talking about a [200ms ; 120s] range no matter what.

Why is 200ms a valid lower bound for initRTO? I'm aware of
measurements that 1s is save for Internet, but I don't know of any
studies that 200ms is save...

>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ http://www.StumbleUpon.com

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: [email protected]
// web: http://www.umic-mesh.net
//


Attachments:
PGP.sig (243.00 B)
Signierter Teil der Nachricht

2011-05-19 07:07:31

by tsuna

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

On Wed, May 18, 2011 at 11:52 PM, Alexander Zimmermann
<[email protected]> wrote:
>> So we're talking about a [200ms ; 120s] range no matter what.
>
> Why is 200ms a valid lower bound for initRTO? I'm aware of
> measurements that 1s is save for Internet, but I don't know of any
> studies that 200ms is save...

The constants that are quoted aren't specific to the initRTO. They're
used to bound the RTO as it gets adjusted during the TCP session. See
`tcp_set_rto' in tcp_input.c for reference.

--
Benoit "tsuna" Sigoure
Software Engineer @ http://www.StumbleUpon.com

2011-05-19 08:02:11

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.


On Thu, 19 May 2011 08:52:10 +0200, Alexander Zimmermann wrote:

>> #define TCP_RTO_MAX ((unsigned)(120*HZ))
>> #define TCP_RTO_MIN ((unsigned)(HZ/5))
>>
>> So we're talking about a [200ms ; 120s] range no matter what.
>
> Why is 200ms a valid lower bound for initRTO? I'm aware of
> measurements that 1s is save for Internet, but I don't know of any
> studies that 200ms is save...

TCP_RTO_MAX and TCP_RTO_MIN is the lower/upper bound for the RTO in
general, not for the initial RTO. RFC 2988 specify a lower bound of 1
second but all operating system choose a lower one because at the time
where RFC 2988 was written the clock granularity was not that accurate. The
minimum RTO for FreeBSD is even 30ms! Furthermore, analysis had
demonstrated that a minimum RTO of 1 second badly breaks throughput in
environments faster then 33kB with minor packet loss rate (e.g. 1%).

So yes, it CAN be wise to choose other lower/upper bounds. But keep in
mind that we should NOT artificial limit ourself. I can image data center
scenarios where a initial RTO of <1 match perfectly.

Hagen

2011-05-19 16:40:33

by tsuna

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

On Thu, May 19, 2011 at 1:02 AM, Hagen Paul Pfeifer <[email protected]> wrote:
> So yes, it CAN be wise to choose other lower/upper bounds. But keep in
> mind that we should NOT artificial limit ourself. I can image data center
> scenarios where a initial RTO of <1 match perfectly.

Yes that's exactly the point I was trying to make when talking to
Alexander offline. On today's Internet, RTTs are easily in the
hundreds of ms, and initRTO is 3s, so there's 2 orders of magnitude of
difference. In my environment, if my RTT is ~2?s, an initRTO of 200ms
means that there's a gap of 6 orders of magnitude (!). And yes,
although I don't work for High Frequency Trading companies in Wall
Street, I'm already buying switches full of line-rate 10Gb ports with
a port-to-port latency of 500ns for L2/L3 forwarding/switching. I
expect this kind of network gear will quickly become prevalent in
datacenter/backend environments.

--
Benoit "tsuna" Sigoure
Software Engineer @ http://www.StumbleUpon.com

2011-05-19 16:56:05

by Alexander Zimmermann

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.


Am 19.05.2011 um 18:40 schrieb tsuna:

> On Thu, May 19, 2011 at 1:02 AM, Hagen Paul Pfeifer <[email protected]> wrote:
>> So yes, it CAN be wise to choose other lower/upper bounds. But keep in
>> mind that we should NOT artificial limit ourself. I can image data center
>> scenarios where a initial RTO of <1 match perfectly.
>
> Yes that's exactly the point I was trying to make when talking to
> Alexander offline. On today's Internet, RTTs are easily in the
> hundreds of ms, and initRTO is 3s, so there's 2 orders of magnitude of
> difference. In my environment,

Exactly. This is the point. It's *your* environment. However, TCP is
general purpose. And for the wider internet 1s is know to be save. See the
measurements in the draft that Mark Allman run.

> if my RTT is ~2?s, an initRTO of 200ms
> means that there's a gap of 6 orders of magnitude (!).

Currently, initRTO is 3s. So you the gap is even larger.

> And yes,
> although I don't work for High Frequency Trading companies in Wall
> Street, I'm already buying switches full of line-rate 10Gb ports with
> a port-to-port latency of 500ns for L2/L3 forwarding/switching. I
> expect this kind of network gear will quickly become prevalent in
> datacenter/backend environments.
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ http://www.StumbleUpon.com

//
// Dipl.-Inform. Alexander Zimmermann
// Department of Computer Science, Informatik 4
// RWTH Aachen University
// Ahornstr. 55, 52056 Aachen, Germany
// phone: (49-241) 80-21422, fax: (49-241) 80-22222
// email: [email protected]
// web: http://www.umic-mesh.net
//


Attachments:
PGP.sig (243.00 B)
Signierter Teil der Nachricht

2011-05-19 17:12:12

by tsuna

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

On Thu, May 19, 2011 at 9:55 AM, Alexander Zimmermann
<[email protected]> wrote:
> Exactly. This is the point. It's *your* environment. However, TCP is
> general purpose. And for the wider internet 1s is know to be save. See the
> measurements in the draft that Mark Allman run.

That's right, there's no one-size-fits-all solution. That's why I'm
in favor of keeping a reasonably conservative default (say 1s to 3s,
so we don't break the Internets) and giving people a knob to adjust it
to whatever makes sense for them.

Looking through the kernel, I see that SCTP already has knobs for
this: sctp_rto_initial, sctp_rto_min, sctp_rto_max. You can even
control the constants used to update rttvar and srtt: sctp_rto_alpha,
sctp_rto_beta

--
Benoit "tsuna" Sigoure
Software Engineer @ http://www.StumbleUpon.com

2011-05-19 17:42:44

by Yuchung Cheng

[permalink] [raw]
Subject: Re: [PATCH] tcp: Lower the initial RTO to 1s as per draft RFC 2988bis-02.

Hi Benoit,

AFAICT, the passive open side would not fall back the
RTO to 3sec in this change because SYNACK timeouts are not
recorded in icsk_retransmits but reqsk->retrans?

Yuchung

On Wed, May 18, 2011 at 11:36 PM, Benoit Sigoure <[email protected]> wrote:
>
> From: Benoit Sigoure <[email protected]>
>
> Draft RFC 2988bis-02 recommends that the initial RTO be lowered
> from 3 seconds down to 1 second, and that in case of a timeout
> during the TCP 3WHS, the RTO should fallback to 3 seconds when
> data transmission begins.
> ---
>
> On Wed, May 18, 2011 at 10:46 PM, David Miller <[email protected]> wrote:
> > From: tsuna <[email protected]>
> > Date: Wed, 18 May 2011 21:33:21 -0700
> >
> >> On Wed, May 18, 2011 at 9:14 PM, David Miller <[email protected]> wrote:
> >>> I really would rather see the initial RTO be static and be set to 1
> >>> with fallback RTO of 3.
> >>
> >> I can also provide a simple patch for this if you want to start from
> >> there. ?And then maybe we can discuss having a runtime knob some more
> >> :-)
> >
> > Yeah why don't we do that :-)
>
> Alright, here we go.
>
>
> ?include/net/tcp.h ? ?| ? ?5 ++++-
> ?net/ipv4/tcp_input.c | ? 13 +++++++++----
> ?2 files changed, 13 insertions(+), 5 deletions(-)
>
> diff --git a/include/net/tcp.h b/include/net/tcp.h
> index cda30ea..274d761 100644
> --- a/include/net/tcp.h
> +++ b/include/net/tcp.h
> @@ -122,7 +122,10 @@ extern void tcp_time_wait(struct sock *sk, int state, int timeo);
> ?#endif
> ?#define TCP_RTO_MAX ? ?((unsigned)(120*HZ))
> ?#define TCP_RTO_MIN ? ?((unsigned)(HZ/5))
> -#define TCP_TIMEOUT_INIT ((unsigned)(3*HZ)) ? ?/* RFC 1122 initial RTO value ? */
> +/* The next 2 values come from Draft RFC 2988bis-02. */
> +#define TCP_TIMEOUT_INIT ((unsigned)(1*HZ)) ? ? ? ? ? ?/* initial RTO value ? ?*/
> +#define TCP_TIMEOUT_INIT_FALLBACK ((unsigned)(3*HZ)) ? /* initial RTO to fallback to when
> + ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?* a timeout happens during the 3WHS. ? */
>
> ?#define TCP_RESOURCE_PROBE_INTERVAL ((unsigned)(HZ/2U)) /* Maximal interval between probes
> ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? * for local resources.
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index bef9f04..a36bc35 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -868,6 +868,11 @@ static void tcp_init_metrics(struct sock *sk)
> ?{
> ? ? ? ?struct tcp_sock *tp = tcp_sk(sk);
> ? ? ? ?struct dst_entry *dst = __sk_dst_get(sk);
> + ? ? ? /* If we had to retransmit anything during the 3WHS, use
> + ? ? ? ?* the initial fallback RTO as per draft RFC 2988bis-02.
> + ? ? ? ?*/
> + ? ? ? int init_rto = inet_csk(sk)->icsk_retransmits ?
> + ? ? ? ? ? ? ? TCP_TIMEOUT_INIT_FALLBACK : TCP_TIMEOUT_INIT;
>
> ? ? ? ?if (dst == NULL)
> ? ? ? ? ? ? ? ?goto reset;
> @@ -890,7 +895,7 @@ static void tcp_init_metrics(struct sock *sk)
> ? ? ? ?if (dst_metric(dst, RTAX_RTT) == 0)
> ? ? ? ? ? ? ? ?goto reset;
>
> - ? ? ? if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (TCP_TIMEOUT_INIT << 3))
> + ? ? ? if (!tp->srtt && dst_metric_rtt(dst, RTAX_RTT) < (init_rto << 3))
> ? ? ? ? ? ? ? ?goto reset;
>
> ? ? ? ?/* Initial rtt is determined from SYN,SYN-ACK.
> @@ -916,7 +921,7 @@ static void tcp_init_metrics(struct sock *sk)
> ? ? ? ? ? ? ? ?tp->mdev_max = tp->rttvar = max(tp->mdev, tcp_rto_min(sk));
> ? ? ? ?}
> ? ? ? ?tcp_set_rto(sk);
> - ? ? ? if (inet_csk(sk)->icsk_rto < TCP_TIMEOUT_INIT && !tp->rx_opt.saw_tstamp) {
> + ? ? ? if (inet_csk(sk)->icsk_rto < init_rto && !tp->rx_opt.saw_tstamp) {
> ?reset:
> ? ? ? ? ? ? ? ?/* Play conservative. If timestamps are not
> ? ? ? ? ? ? ? ? * supported, TCP will fail to recalculate correct
> @@ -924,8 +929,8 @@ reset:
> ? ? ? ? ? ? ? ? */
> ? ? ? ? ? ? ? ?if (!tp->rx_opt.saw_tstamp && tp->srtt) {
> ? ? ? ? ? ? ? ? ? ? ? ?tp->srtt = 0;
> - ? ? ? ? ? ? ? ? ? ? ? tp->mdev = tp->mdev_max = tp->rttvar = TCP_TIMEOUT_INIT;
> - ? ? ? ? ? ? ? ? ? ? ? inet_csk(sk)->icsk_rto = TCP_TIMEOUT_INIT;
> + ? ? ? ? ? ? ? ? ? ? ? tp->mdev = tp->mdev_max = tp->rttvar = init_rto;
> + ? ? ? ? ? ? ? ? ? ? ? inet_csk(sk)->icsk_rto = init_rto;
> ? ? ? ? ? ? ? ?}
> ? ? ? ?}
> ? ? ? ?tp->snd_cwnd = tcp_init_cwnd(tp, dst);
> --
> 1.7.0.4
>
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html

2011-05-19 19:31:04

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

From: tsuna <[email protected]>
Date: Thu, 19 May 2011 10:11:50 -0700

> Looking through the kernel, I see that SCTP already has knobs for
> this: sctp_rto_initial, sctp_rto_min, sctp_rto_max. You can even
> control the constants used to update rttvar and srtt: sctp_rto_alpha,
> sctp_rto_beta

SCTP is 1) not even a sliver of deployment compared to TCP and 2)
doesn't get nearly the same scrutiny on patch review that TCP
changes do.

I basically let the SCTP folks play in their own sandbox, because
frankly SCTP doesn't matter.

The only time I care about an SCTP change is when it has an impact on
the rest of the networking code.

So using SCTP as an example of "see we do this already over here" is a
non-starter. Don't do it.

2011-05-19 20:20:09

by David Miller

[permalink] [raw]
Subject: Re: [PATCH] tcp: Lower the initial RTO to 1s as per draft RFC 2988bis-02.

From: Benoit Sigoure <[email protected]>
Date: Wed, 18 May 2011 23:47:49 -0700

> @@ -868,6 +868,11 @@ static void tcp_init_metrics(struct sock *sk)
> {
> struct tcp_sock *tp = tcp_sk(sk);
> struct dst_entry *dst = __sk_dst_get(sk);
> + /* If we had to retransmit anything during the 3WHS, use
> + * the initial fallback RTO as per draft RFC 2988bis-02.
> + */
> + int init_rto = inet_csk(sk)->icsk_retransmits ?
> + TCP_TIMEOUT_INIT_FALLBACK : TCP_TIMEOUT_INIT;

Please do not put comments in the middle of a set of function
local variable declarations.

Also, as mentioned already, icsk_retransmits is not where SYN
retransmissions are counted.

It is stored in the TCP minisocket ->retrans field.

2011-05-19 20:30:38

by tsuna

[permalink] [raw]
Subject: Re: [PATCH] tcp: Implement a two-level initial RTO as per draft RFC 2988bis-02.

On Thu, May 19, 2011 at 12:27 PM, David Miller <[email protected]> wrote:
> So using SCTP as an example of "see we do this already over here" is a
> non-starter. ?Don't do it.

Fair enough. I hope that the "there's no one-size-fits-all solution"
argument has more weight than "hey SCTP does it". :)

--
Benoit "tsuna" Sigoure
Software Engineer @ http://www.StumbleUpon.com

2011-05-20 02:02:00

by H.K. Jerry Chu

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

On Wed, May 18, 2011 at 12:40 PM, tsuna <[email protected]> wrote:
> On Wed, May 18, 2011 at 12:26 PM, David Miller <[email protected]> wrote:
>> If you read the ietf draft that reduces the initial RTO down to 1
>> second, it states that if we take a timeout during the initial
>> connection handshake then we have to revert the RTO back up to 3
>> seconds.
>>
>> This fallback logic conflicts with being able to only change the
>> initial RTO via sysctl, I think. ?Because there are actually two
>> values at stake and they depend upon eachother, the initial RTO and
>> the value we fallback to on initial handshake retransmissions.
>>
>> So I'd rather get a patch that implements the 1 second initial
>> RTO with the 3 second fallback on SYN retransmit, than this patch.
>>
>> We already have too many knobs.
>
> I was hoping this knob would be accepted because this is such an
> important issue that it even warrants an IETF draft to attempt to
> change the standard. ?I'm not sure how long it will take for this
> draft to be accepted and then implemented, so I thought adding this
> simple knob today would really help in the future.

As one of the co-authors of rfc2988bis I was planning to provide a patch
as soon as the draft gets approved but it looks like you have beaten
me to it :)

Personally I'm in favor of a knob too. We at Google has added such a
knob for years.

Jerry

>
> Plus, should the draft be accepted, this knob will still be just as
> useful (e.g. to revert back to today's behavior), and people might
> want to consider adding another knob for the fallback initRTO (this is
> debatable). ?I don't believe this knob conflicts with the proposed
> change to the standard, it actually goes along with it pretty well and
> helps us prepare better for this upcoming change.
>
> I agree that there are too many knobs, and I hate feature creep too,
> but I've found many of these knobs to be really useful, and the degree
> to which Linux's TCP stack can be tuned is part of what makes it so
> versatile.
>
> --
> Benoit "tsuna" Sigoure
> Software Engineer @ http://www.StumbleUpon.com
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2011-05-20 10:27:40

by H.K. Jerry Chu

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

On Wed, May 18, 2011 at 1:20 PM, Hagen Paul Pfeifer <[email protected]> wrote:
> * David Miller | 2011-05-18 15:52:00 [-0400]:
>
>>I've already changed the initial TCP congestion window in Linux to 10
>>without some stupid draft being fully accepted.
>>
>>I'll just as easily accept right now a patch right now which lowers
>>the initial RTO to 1 second and adds the 3 second RTO fallback.
>
> I like the idea to make the initial RTO a knob because we in a isolated MANET
> environment have a RTT larger then 1 second. Especially the link layer setup
> procedure over several hops demand some time-costly setup time. After that the
> RTT is <1 second. The current algorithm works great for us. So this RTO change
> will be counterproductive: it will always trigger a needless timeout.
>
> The main problem for us is that Google at all pushing their view of Internet
> with a lot of pressure. The same is true for the IETF IW adjustments, which is
> unsuitable for networks which operates at a bandwidth characteristic some
> years ago. The _former_ conservative principle "TCP over everything" is
> forgotten.

Not sure how our various parameter tuning proposals deviate from the "TCP over
everything" principle?

Note that the design goal of rfc2988bis is to try to benefit 98% of
traffic while
keeping any negative impact to the remaining 2% at a minimum. This is why we
limit the use of < 3sec initRTO to at most once. This way the negative impact
of the 1sec initRTO to a path with RTT > 1sec is limited mostly to one
additional,
small, spuriously retransmitted SYN or SYN-ACK pkt, and the unnecessary
reduction of IW to 1 segment.

We actually thought about removing the IW reduction part but unfortunately the
text belongs to a different rfc5681, which is at a higher maturity
level ("draft-standard")
than rfc2988 hence can't be done as part of rfc2988bis. Anyway I have
since added
the recommendation to the IW10 draft. See draft-ietf-tcpm-initcwnd-01.txt.

The bottom line is the damage of rfc2988bis to any network with
initRTT > 1sec is
limited to one spurious retransmitted SYN/SYN-ACK. In the current Linux code,
the SYN/SYN-ACK retransmit is forgotten on the passive open side by the time
3WHS is completed so there is nothing needed to be done. But for the active
open side SYN retransmit will cause not long IW to be reduced to 1, but also
reduction of ssthresh, which is not part of rfc5681 so some more work is needed.
I can provide a patch (or work with tsuna) to ensure a correct fix is made.

Jerry

>
> Hagen
> --
> To unsubscribe from this list: send the line "unsubscribe netdev" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>

2011-05-20 11:00:05

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.


On Fri, 20 May 2011 03:27:37 -0700, "H.K. Jerry Chu" wrote:

Hi Jerry

> Not sure how our various parameter tuning proposals deviate from the
"TCP
> over everything" principle?

For our environment it hurts because we _always_ have an initial RTO >1. I
understand and accept that 98% will benefit of this modification, no doubt
Jerry! Try to put yourself in our situation: imaging a proposal of an init
RTO modification to 0.5 seconds. Maybe because 98% of Internet traffic is
now localized and the RTO is average now 0.2 seconds. Anyway, this will
penalize your network always and this will be the situation for one of my
customer. I can live with that, I see the benefits for the rest of the
world. But I am happy to see a knob where I can restore the old behavior.
Maybe some other environments will benefit from a even lower or higher
initial RTO.

Hagen

2011-05-20 12:39:05

by Alan

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

> For our environment it hurts because we _always_ have an initial RTO >1. I
> understand and accept that 98% will benefit of this modification, no doubt
> Jerry! Try to put yourself in our situation: imaging a proposal of an init
> RTO modification to 0.5 seconds. Maybe because 98% of Internet traffic is
> now localized and the RTO is average now 0.2 seconds. Anyway, this will
> penalize your network always and this will be the situation for one of my
> customer. I can live with that, I see the benefits for the rest of the
> world. But I am happy to see a knob where I can restore the old behavior.
> Maybe some other environments will benefit from a even lower or higher
> initial RTO.

AX.25 is definitely happier with a multi-second round trip but it's a
special case. Some X.25 networks are going to have similar behaviour.

It shouldn't be penalising each connection (and it's worse than that of
course because each node on a shared media network gets in the way of the
rest, plus the queueing effect of all the extra blockages) because done
right multiple connections to the same host can use the previous
connections as estimates (and indeed for the initial RTO there's a good
argument for treating estimates as 'host, then x.y.z.* match, then
average of previous except the x.y.z.* match, then unknown')

The latter would fix an awful lot of the weird cases pretty effectively.

Alan

2011-05-21 00:06:15

by H.K. Jerry Chu

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

Hey Hagen,

On Fri, May 20, 2011 at 4:00 AM, Hagen Paul Pfeifer <[email protected]> wrote:
>
> On Fri, 20 May 2011 03:27:37 -0700, "H.K. Jerry Chu" wrote:
>
> Hi Jerry
>
>> Not sure how our various parameter tuning proposals deviate from the
> "TCP
>> over everything" principle?
>
> For our environment it hurts because we _always_ have an initial RTO >1. I
> understand and accept that 98% will benefit of this modification, no doubt
> Jerry! Try to put yourself in our situation: imaging a proposal of an init

Understood but my point was none of the parameter tuning proposals break
"TCP over everything", although they may not help solving "TCP optimized for
everything", but we never had the latter anyway.

We've tried hard to keep the penalty to those initRTT > 1sec paths at a minimum,
i.e., just one extra tinygram. This is important also for us because it may take
> 1sec for many Android clients to establish connections over a radio channel that
has been put into power saving mode.

> RTO modification to 0.5 seconds. Maybe because 98% of Internet traffic is
> now localized and the RTO is average now 0.2 seconds. Anyway, this will
> penalize your network always and this will be the situation for one of my
> customer. I can live with that, I see the benefits for the rest of the
> world. But I am happy to see a knob where I can restore the old behavior.
> Maybe some other environments will benefit from a even lower or higher
> initial RTO.

Yep, that's why we've had a knob for this for years.

Jerry

>
> Hagen
>

2011-05-31 14:48:32

by tsuna

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

On Fri, May 20, 2011 at 5:06 PM, H.K. Jerry Chu <[email protected]> wrote:
> Yep, that's why we've had a knob for this for years.

I was traveling last week so sorry for not replying earlier to various
comments people made.

I talked to Jerry and he's agreed to share some patches that Google
has been using internally for years. I started this work because
after leaving Google and taking these changes for granted, I was
surprised to find that they weren't actually part of the mainline
Linux kernel.

It seems that David is willing to accept a change that will lower the
initRTO to 1s (compile-time constant), with a fallback to 3s
(compile-time constant), as per the draft rfc2988bis. Others are
legitimately worried about the impact this would cause in environments
where RTT is typically (or always) in the 1-3s range. Some would like
to see this as a per-destination thing.

Personally what I think would be ideal would be:
1. A sysctl knob for initRTO, to allow people to adjust this
appropriately for their environment.
2. Apply the srtt / rttvar seen on previous connections to new connections.

Does that sound reasonable?

For 2), I'm not sure how the details would work yet, I believe the
kernel already has what's necessary to remember these things on a per
peer basis, but it would be nice if I could specify things like "for
10.x.0.0/16 (local datacenter) use this aggressive setting, for
10.0.0.0/8 (my internal backend network) use that, for everything else
(Internets etc.) use the default".

--
Benoit "tsuna" Sigoure
Software Engineer @ http://www.StumbleUpon.com

2011-05-31 15:25:26

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.


On Tue, 31 May 2011 07:48:09 -0700, tsuna <[email protected]> wrote:

> I talked to Jerry and he's agreed to share some patches that Google
> has been using internally for years.

Great!

> Personally what I think would be ideal would be:
> 1. A sysctl knob for initRTO, to allow people to adjust this
> appropriately for their environment.
> 2. Apply the srtt / rttvar seen on previous connections to new
> connections.
>
> Does that sound reasonable?
>
> For 2), I'm not sure how the details would work yet, I believe the
> kernel already has what's necessary to remember these things on a per
> peer basis, but it would be nice if I could specify things like "for
> 10.x.0.0/16 (local datacenter) use this aggressive setting, for
> 10.0.0.0/8 (my internal backend network) use that, for everything else
> (Internets etc.) use the default".

Skip sysctl, it is deprecated. The initRTO is the ideal candidate for a
per route knob. And happily you will solve 2) with the per route thing too!
;-)

Search the web, you will find some patches where you can see how to extend
the per route system - including iproute2.

Hagen

2011-05-31 15:28:41

by tsuna

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.

On Tue, May 31, 2011 at 8:25 AM, Hagen Paul Pfeifer <[email protected]> wrote:
> Skip sysctl, it is deprecated.

Sorry I meant a knob such as /proc/sys/net/ipv4/tcp_initrto.

> The initRTO is the ideal candidate for a
> per route knob. And happily you will solve 2) with the per route thing too!

You still need a knob for the default system-wide value, don't you?

--
Benoit "tsuna" Sigoure
Software Engineer @ http://www.StumbleUpon.com

2011-05-31 15:43:27

by Hagen Paul Pfeifer

[permalink] [raw]
Subject: Re: [PATCH] tcp: Expose the initial RTO via a new sysctl.


On Tue, 31 May 2011 08:28:18 -0700, tsuna <[email protected]> wrote:

> Sorry I meant a knob such as /proc/sys/net/ipv4/tcp_initrto.

That's the same! ;-)

>> The initRTO is the ideal candidate for a
>> per route knob. And happily you will solve 2) with the per route thing
>> too!
>
> You still need a knob for the default system-wide value, don't you?

Yes, try to re-read the emails. Sysctl is a no-go, with a per route
interface you have the ability to tune the values. Talk with Jerry once
again - he wrote that at Google they already have a patch for this. And
with a per route knob you can select a even smaller value for your local
network (e.g. datacenter) and a larger value for all other routes. It makes
sense to provide a knob for this on a route basis, not on a global sysctl
basis.

But once again: talk with Jerry - he has the expert knowledge!

Hagen