LinuxLists.cc - query: tcp_sock tcp_header

2010-01-10 12:06:46

Subject: query: tcp_sock tcp_header_len calculations (re-sent)

Apparently, nobody on the network developers list knows about this. I've
stumbled upon a completely undocumented and incomprehensible usage for
tcp_header_len. Is whomever wrote this still around?

linux/tcp.h documents this as:
...
u16 tcp_header_len; /* Bytes of tcp header to send */
...

So far, so good. But it's clearly *not* correct in tcp_output.c:

tcp_connect_init()
...
tp->tcp_header_len = sizeof(struct tcphdr) +
(sysctl_tcp_timestamps ? TCPOLEN_TSTAMP_ALIGNED : 0);

#ifdef CONFIG_TCP_MD5SIG
if (tp->af_specific->md5_lookup(sk, sk) != NULL)
tp->tcp_header_len += TCPOLEN_MD5SIG_ALIGNED;
#endif
...

This combination is actually *impossible* -- current options code
*never* allows both authentication and timestamps, doing SACK instead:

tcp_syn_options()
...
if (likely(sysctl_tcp_timestamps && *md5 == NULL)) {
opts->options |= OPTION_TS;
...

tcp_synack_options()
...
/* We can't fit any SACK blocks in a packet with MD5 + TS
* options. There was discussion about disabling SACK
* rather than TS in order to fit in better with old,
* buggy kernels, but that was deemed to be unnecessary.
*/
doing_ts &= !ireq->sack_ok;
...

Thus, tcp_header_len has the wrong value, resulting in underestimation
for MSS. But even worse usage in minisocks.c:

tcp_create_openreq_child()
...
if (newtp->rx_opt.tstamp_ok) {
newtp->rx_opt.ts_recent = req->ts_recent;
newtp->rx_opt.ts_recent_stamp = get_seconds();
newtp->tcp_header_len = sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
} else {
newtp->rx_opt.ts_recent_stamp = 0;
newtp->tcp_header_len = sizeof(struct tcphdr);
}
#ifdef CONFIG_TCP_MD5SIG
newtp->md5sig_info = NULL; /*XXX*/
#endif
if (skb->len >= TCP_MSS_DEFAULT + newtp->tcp_header_len)
newicsk->icsk_ack.last_seg_size = skb->len - newtp->tcp_header_len;
...

This takes an *output* estimation, and then compares it to (and subtracts
from) skb->len, which is *input* length. What's supposed to happen here?

Shouldn't this simply use the real input tcp_hdrlen()?

2010-01-19 17:35:11

by William Allen Simpson

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: input header length, prediction, and timestamp bugs

Over the weekend, I've been reviewing the .lst output that Andi
explained, and I've found a few interesting things.

1) The 4.4 compiler isn't very smart about shifts:

static inline unsigned int tcp_header_len_th(const struct tcphdr *th)
{
return th->doff * 4;
c04ea93e: 0f b6 47 0c movzbl 0xc(%edi),%eax
c04ea942: c0 e8 04 shr $0x4,%al
c04ea945: 0f b6 c0 movzbl %al,%eax
c04ea948: c1 e0 02 shl $0x2,%eax

That could easily be done as shr $0x2 instead.

2) This "fast path" code is under quite a bit of register pressure on
the 32-bit i386. There's a lot of saving and re-loading.

3) Particularly, the seldom used *th and len parameters are saved and
reloaded in the very beginning of the fast path, really wasting time.

4) Since both *th and len are actually indexed loads from *skb (which
the compiler keeps in a register most of the time), doing indexed loads
from the stack (%ebp) is the same, so they shouldn't be sent as
parameters at all!

5) There's already code, added back in 2006 for DMA, that directly
references skb->len instead of the len parameter. Probably lack of
header documentation, so the coder failed to notice:

if (copied_early)
tcp_cleanup_rbuf(sk, skb->len);
c04ead5d: 8b 56 50 mov 0x50(%esi),%edx
c04ead60: 89 d8 mov %ebx,%eax
c04ead62: e8 fc ff ff ff call c04ead63 <tcp_rcv_established+0x633>

Therefore, I'll resubmit this patch, removing the existing len parameter.
And maybe *th, too....

2010-01-19 19:35:26

by William Allen Simpson

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: input header length, prediction, and timestamp bugs

William Allen Simpson wrote:
> Therefore, I'll resubmit this patch, removing the existing len parameter.
> And maybe *th, too....
>
Just to quickly note that gcc 4.4 doesn't properly remember that it has
already loaded *th with this rampant use of an inline function (unlike the
older macro method):

c04ea739: 89 d3 mov %edx,%ebx

static inline struct tcphdr *tcp_hdr(const struct sk_buff *skb)
{
return (struct tcphdr *)skb_transport_header(skb);
c04ea743: 8b 92 94 00 00 00 mov 0x94(%edx),%edx
*
* Our current scheme is not silly either but we take the
* extra cost of the net_bh soft interrupt processing...
* We do checksum and copy also but from device to kernel.
*/
if ((tcp_flag_word(tcp_hdr(skb)) & TCP_HP_BITS) == tp->pred_flags &&
...

Note that the index is in both %edx and %ebx, but it uses replaced %edx.
Although by inspection that result stays in %edx, it reloaded twice more:

res = tcp_validate_incoming(sk, skb, tcp_hdr(skb), 1);
c04ea78c: 8b 8b 94 00 00 00 mov 0x94(%ebx),%ecx
c04ea792: 89 da mov %ebx,%edx
c04ea794: 89 f0 mov %esi,%eax
c04ea796: c7 04 24 01 00 00 00 movl $0x1,(%esp)
c04ea79d: e8 0e c3 ff ff call c04e6ab0 <tcp_validate_incoming>
if (res <= 0)
c04ea7a2: 85 c0 test %eax,%eax
c04ea7a4: 0f 8e 8e 03 00 00 jle c04eab38 <tcp_rcv_established+0x408>

#else /* NET_SKBUFF_DATA_USES_OFFSET */

static inline unsigned char *skb_transport_header(const struct sk_buff *skb)
{
return skb->transport_header;
c04ea7aa: 8b 83 94 00 00 00 mov 0x94(%ebx),%eax
c04ea7b0: f6 40 0d 10 testb $0x10,0xd(%eax)
c04ea7b4: 0f 85 5e 03 00 00 jne c04eab18 <tcp_rcv_established+0x3e8>

This doesn't happen with the parameter *th (undisturbed in %edi):

c04ea78a: 89 f9 mov %edi,%ecx
c04ea78c: 89 f2 mov %esi,%edx
c04ea78e: 89 d8 mov %ebx,%eax
c04ea790: c7 04 24 01 00 00 00 movl $0x1,(%esp)
c04ea797: e8 14 c3 ff ff call c04e6ab0 <tcp_validate_incoming>
if (res <= 0)
c04ea79c: 85 c0 test %eax,%eax
c04ea79e: 0f 8e 8c 03 00 00 jle c04eab30 <tcp_rcv_established+0x400>
return -res;

step5:
if (th->ack && tcp_ack(sk, skb, FLAG_SLOWPATH) < 0)
c04ea7a4: f6 47 0d 10 testb $0x10,0xd(%edi)
c04ea7a8: 0f 85 62 03 00 00 jne c04eab10 <tcp_rcv_established+0x3e0>

Therefore, keeping the parameter *th.

2010-01-19 19:58:11

by William Allen Simpson

[permalink] [raw]

Subject: Re: [PATCH v2] tcp: input header length, prediction, and timestamp bugs

William Allen Simpson wrote:
> Therefore, I'll resubmit this patch, removing the existing len parameter.
>
And for record, there's a jtcp_rcv_established(), and the len isn't used
there either (it uses skb->len instead).

I also discovered that the related tcp_rcv_state_process() and
tcp_rcv_synsent_state_process() never uses their len parameter.
So, I'll remove them, too.

2010-01-19 20:24:43

by William Allen Simpson

[permalink] [raw]

Subject: [PATCH v3] tcp: input header length, prediction, and timestamp bugs

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 74728f7..2987ee8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -301,7 +301,11 @@ struct tcp_sock {

/*
* Header prediction flags
- * 0x5?10 << 16 + snd_wnd in net byte order
+ * S << 28 + TCP_FLAG_ACK + snd_wnd, in net byte order
+ * (PSH flag is ignored)
+ * S is 5 (no options), or 8 (timestamp aligned)
+ * otherwise, 0 to turn it off -- for instance, when there are
+ * holes in receive space.
*/
__be32 pred_flags;

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 34f5cc2..6b0d7e9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -310,13 +310,11 @@ extern int tcp_ioctl(struct sock *sk,

extern int tcp_rcv_state_process(struct sock *sk,
struct sk_buff *skb,
- struct tcphdr *th,
- unsigned len);
+ struct tcphdr *th);

extern int tcp_rcv_established(struct sock *sk,
struct sk_buff *skb,
- struct tcphdr *th,
- unsigned len);
+ struct tcphdr *th);

extern void tcp_rcv_space_adjust(struct sock *sk);

@@ -533,9 +531,16 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
return (tp->srtt >> 3) + tp->rttvar;
}

+static inline u16 __tcp_fast_path_header_length(const struct tcp_sock *tp)
+{
+ return tp->rx_opt.tstamp_ok
+ ? sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED
+ : sizeof(struct tcphdr);
+}
+
static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
{
- tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+ tp->pred_flags = htonl((__tcp_fast_path_header_length(tp) << (28 - 2)) |
ntohl(TCP_FLAG_ACK) |
snd_wnd);
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 28e0296..8e0f6ae 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -152,7 +152,7 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
* tcp header plus fixed timestamp option length.
* Resulting "len" is MSS free of SACK jitter.
*/
- len -= tcp_sk(sk)->tcp_header_len;
+ len -= __tcp_fast_path_header_length(tcp_sk(sk));
icsk->icsk_ack.last_seg_size = len;
if (len == lss) {
icsk->icsk_ack.rcv_mss = len;
@@ -5206,7 +5206,7 @@ discard:
* tcp_data_queue when everything is OK.
*/
int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
- struct tcphdr *th, unsigned len)
+ struct tcphdr *th)
{
struct tcp_sock *tp = tcp_sk(sk);
int res;
@@ -5225,31 +5225,15 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
* extra cost of the net_bh soft interrupt processing...
* We do checksum and copy also but from device to kernel.
*/
-
- tp->rx_opt.saw_tstamp = 0;
-
- /* pred_flags is 0xS?10 << 16 + snd_wnd
- * if header_prediction is to be made
- * 'S' will always be tp->tcp_header_len >> 2
- * '?' will be 0 for the fast path, otherwise pred_flags is 0 to
- * turn it off (when there are holes in the receive
- * space for instance)
- * PSH flag is ignored.
- */
-
if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
- int tcp_header_len = tp->tcp_header_len;
-
- /* Timestamp header prediction: tcp_header_len
- * is automatically equal to th->doff*4 due to pred_flags
- * match.
- */
+ int tcp_header_len = tcp_header_len_th(th);

- /* Check timestamp */
- if (tcp_header_len == sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) {
- /* No? Slow path! */
+ /* Timestamp header prediction */
+ if (tcp_header_len != sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) {
+ tp->rx_opt.saw_tstamp = 0; /* false */
+ } else {
if (!tcp_parse_aligned_timestamp(tp, th))
goto slow_path;

@@ -5264,35 +5248,12 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
*/
}

- if (len <= tcp_header_len) {
- /* Bulk data transfer: sender */
- if (len == tcp_header_len) {
- /* Predicted packet is in window by definition.
- * seq == rcv_nxt and rcv_wup <= rcv_nxt.
- * Hence, check seq<=rcv_wup reduces to:
- */
- if (tcp_header_len ==
- (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
- tp->rcv_nxt == tp->rcv_wup)
- tcp_store_ts_recent(tp);
-
- /* We know that such packets are checksummed
- * on entry.
- */
- tcp_ack(sk, skb, 0);
- __kfree_skb(skb);
- tcp_data_snd_check(sk);
- return 0;
- } else { /* Header too small */
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
- goto discard;
- }
- } else {
+ if (tcp_header_len < skb->len) {
int eaten = 0;
int copied_early = 0;

if (tp->copied_seq == tp->rcv_nxt &&
- len - tcp_header_len <= tp->ucopy.len) {
+ skb->len - tcp_header_len <= tp->ucopy.len) {
#ifdef CONFIG_NET_DMA
if (tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
copied_early = 1;
@@ -5311,9 +5272,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
* seq == rcv_nxt and rcv_wup <= rcv_nxt.
* Hence, check seq<=rcv_wup reduces to:
*/
- if (tcp_header_len ==
- (sizeof(struct tcphdr) +
- TCPOLEN_TSTAMP_ALIGNED) &&
+ if (tp->rx_opt.saw_tstamp &&
tp->rcv_nxt == tp->rcv_wup)
tcp_store_ts_recent(tp);

@@ -5334,8 +5293,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
* seq == rcv_nxt and rcv_wup <= rcv_nxt.
* Hence, check seq<=rcv_wup reduces to:
*/
- if (tcp_header_len ==
- (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
+ if (tp->rx_opt.saw_tstamp &&
tp->rcv_nxt == tp->rcv_wup)
tcp_store_ts_recent(tp);

@@ -5376,11 +5334,33 @@ no_ack:
else
sk->sk_data_ready(sk, 0);
return 0;
+ } else {
+ /* Bulk data transfer: sender
+ *
+ * tcp_header_len > skb->len never happens,
+ * already checked by tcp_v[4,6]_rcv()
+ *
+ * Predicted packet is in window by definition.
+ * seq == rcv_nxt and rcv_wup <= rcv_nxt.
+ * Hence, check seq<=rcv_wup reduces to:
+ */
+ if (tp->rx_opt.saw_tstamp &&
+ tp->rcv_nxt == tp->rcv_wup)
+ tcp_store_ts_recent(tp);
+
+ /* We know that such packets are checksummed
+ * on entry.
+ */
+ tcp_ack(sk, skb, 0);
+ __kfree_skb(skb);
+ tcp_data_snd_check(sk);
+ return 0;
}
}

slow_path:
- if (len < (th->doff << 2) || tcp_checksum_complete_user(sk, skb))
+ /* Assumes header and options unchanged since checksum_init() */
+ if (tcp_checksum_complete_user(sk, skb))
goto csum_error;

/*
@@ -5416,7 +5396,7 @@ discard:
}

static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- struct tcphdr *th, unsigned len)
+ struct tcphdr *th)
{
u8 *hash_location;
struct inet_connection_sock *icsk = inet_csk(sk);
@@ -5693,7 +5673,7 @@ reset_and_undo:
*/

int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- struct tcphdr *th, unsigned len)
+ struct tcphdr *th)
{
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
@@ -5740,7 +5720,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
goto discard;

case TCP_SYN_SENT:
- queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
+ queued = tcp_rcv_synsent_state_process(sk, skb, th);
if (queued >= 0)
return queued;

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0a76e41..f999e06 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1551,7 +1551,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)

if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
TCP_CHECK_TIMER(sk);
- if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
+ if (tcp_rcv_established(sk, skb, tcp_hdr(skb))) {
rsk = sk;
goto reset;
}
@@ -1578,7 +1578,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
}

TCP_CHECK_TIMER(sk);
- if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)) {
+ if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb))) {
rsk = sk;
goto reset;
}
diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c
index bb110c5..7bcd3ba 100644
--- a/net/ipv4/tcp_probe.c
+++ b/net/ipv4/tcp_probe.c
@@ -88,7 +88,7 @@ static inline int tcp_probe_avail(void)
* Note: arguments must match tcp_rcv_established()!
*/
static int jtcp_rcv_established(struct sock *sk, struct sk_buff *skb,
- struct tcphdr *th, unsigned len)
+ struct tcphdr *th)
{
const struct tcp_sock *tp = tcp_sk(sk);
const struct inet_sock *inet = inet_sk(sk);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index b76939a..3d08a4d 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1586,7 +1586,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)

if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
TCP_CHECK_TIMER(sk);
- if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len))
+ if (tcp_rcv_established(sk, skb, tcp_hdr(skb)))
goto reset;
TCP_CHECK_TIMER(sk);
if (opt_skb)
@@ -1618,7 +1618,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
}

TCP_CHECK_TIMER(sk);
- if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len))
+ if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb)))
goto reset;
TCP_CHECK_TIMER(sk);
if (opt_skb)
--
1.6.3.3

Attachments:

len_th+4v3.patch (9.14 kB)

2010-01-19 23:58:30

by William Allen Simpson

[permalink] [raw]

Subject: Re: [PATCH v3] tcp: input header length, prediction, and timestamp bugs

Missing 1 part of the patch. My mistake cutting the patches apart
from the bigger testing patch. I'll send a new version promptly.

2010-01-20 00:01:07

by William Allen Simpson

[permalink] [raw]

Subject: [PATCH v4] tcp: input header length, prediction, and timestamp bugs

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 74728f7..2987ee8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -301,7 +301,11 @@ struct tcp_sock {

/*
* Header prediction flags
- * 0x5?10 << 16 + snd_wnd in net byte order
+ * S << 28 + TCP_FLAG_ACK + snd_wnd, in net byte order
+ * (PSH flag is ignored)
+ * S is 5 (no options), or 8 (timestamp aligned)
+ * otherwise, 0 to turn it off -- for instance, when there are
+ * holes in receive space.
*/
__be32 pred_flags;

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 34f5cc2..6b0d7e9 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -310,13 +310,11 @@ extern int tcp_ioctl(struct sock *sk,

extern int tcp_rcv_state_process(struct sock *sk,
struct sk_buff *skb,
- struct tcphdr *th,
- unsigned len);
+ struct tcphdr *th);

extern int tcp_rcv_established(struct sock *sk,
struct sk_buff *skb,
- struct tcphdr *th,
- unsigned len);
+ struct tcphdr *th);

extern void tcp_rcv_space_adjust(struct sock *sk);

@@ -533,9 +531,16 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
return (tp->srtt >> 3) + tp->rttvar;
}

+static inline u16 __tcp_fast_path_header_length(const struct tcp_sock *tp)
+{
+ return tp->rx_opt.tstamp_ok
+ ? sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED
+ : sizeof(struct tcphdr);
+}
+
static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
{
- tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+ tp->pred_flags = htonl((__tcp_fast_path_header_length(tp) << (28 - 2)) |
ntohl(TCP_FLAG_ACK) |
snd_wnd);
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 28e0296..8e0f6ae 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -152,7 +152,7 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
* tcp header plus fixed timestamp option length.
* Resulting "len" is MSS free of SACK jitter.
*/
- len -= tcp_sk(sk)->tcp_header_len;
+ len -= __tcp_fast_path_header_length(tcp_sk(sk));
icsk->icsk_ack.last_seg_size = len;
if (len == lss) {
icsk->icsk_ack.rcv_mss = len;
@@ -5206,7 +5206,7 @@ discard:
* tcp_data_queue when everything is OK.
*/
int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
- struct tcphdr *th, unsigned len)
+ struct tcphdr *th)
{
struct tcp_sock *tp = tcp_sk(sk);
int res;
@@ -5225,31 +5225,15 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
* extra cost of the net_bh soft interrupt processing...
* We do checksum and copy also but from device to kernel.
*/
-
- tp->rx_opt.saw_tstamp = 0;
-
- /* pred_flags is 0xS?10 << 16 + snd_wnd
- * if header_prediction is to be made
- * 'S' will always be tp->tcp_header_len >> 2
- * '?' will be 0 for the fast path, otherwise pred_flags is 0 to
- * turn it off (when there are holes in the receive
- * space for instance)
- * PSH flag is ignored.
- */
-
if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
- int tcp_header_len = tp->tcp_header_len;
-
- /* Timestamp header prediction: tcp_header_len
- * is automatically equal to th->doff*4 due to pred_flags
- * match.
- */
+ int tcp_header_len = tcp_header_len_th(th);

- /* Check timestamp */
- if (tcp_header_len == sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) {
- /* No? Slow path! */
+ /* Timestamp header prediction */
+ if (tcp_header_len != sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) {
+ tp->rx_opt.saw_tstamp = 0; /* false */
+ } else {
if (!tcp_parse_aligned_timestamp(tp, th))
goto slow_path;

@@ -5264,35 +5248,12 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
*/
}

- if (len <= tcp_header_len) {
- /* Bulk data transfer: sender */
- if (len == tcp_header_len) {
- /* Predicted packet is in window by definition.
- * seq == rcv_nxt and rcv_wup <= rcv_nxt.
- * Hence, check seq<=rcv_wup reduces to:
- */
- if (tcp_header_len ==
- (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
- tp->rcv_nxt == tp->rcv_wup)
- tcp_store_ts_recent(tp);
-
- /* We know that such packets are checksummed
- * on entry.
- */
- tcp_ack(sk, skb, 0);
- __kfree_skb(skb);
- tcp_data_snd_check(sk);
- return 0;
- } else { /* Header too small */
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
- goto discard;
- }
- } else {
+ if (tcp_header_len < skb->len) {
int eaten = 0;
int copied_early = 0;

if (tp->copied_seq == tp->rcv_nxt &&
- len - tcp_header_len <= tp->ucopy.len) {
+ skb->len - tcp_header_len <= tp->ucopy.len) {
#ifdef CONFIG_NET_DMA
if (tcp_dma_try_early_copy(sk, skb, tcp_header_len)) {
copied_early = 1;
@@ -5311,9 +5272,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
* seq == rcv_nxt and rcv_wup <= rcv_nxt.
* Hence, check seq<=rcv_wup reduces to:
*/
- if (tcp_header_len ==
- (sizeof(struct tcphdr) +
- TCPOLEN_TSTAMP_ALIGNED) &&
+ if (tp->rx_opt.saw_tstamp &&
tp->rcv_nxt == tp->rcv_wup)
tcp_store_ts_recent(tp);

@@ -5334,8 +5293,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
* seq == rcv_nxt and rcv_wup <= rcv_nxt.
* Hence, check seq<=rcv_wup reduces to:
*/
- if (tcp_header_len ==
- (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
+ if (tp->rx_opt.saw_tstamp &&
tp->rcv_nxt == tp->rcv_wup)
tcp_store_ts_recent(tp);

@@ -5376,11 +5334,33 @@ no_ack:
else
sk->sk_data_ready(sk, 0);
return 0;
+ } else {
+ /* Bulk data transfer: sender
+ *
+ * tcp_header_len > skb->len never happens,
+ * already checked by tcp_v[4,6]_rcv()
+ *
+ * Predicted packet is in window by definition.
+ * seq == rcv_nxt and rcv_wup <= rcv_nxt.
+ * Hence, check seq<=rcv_wup reduces to:
+ */
+ if (tp->rx_opt.saw_tstamp &&
+ tp->rcv_nxt == tp->rcv_wup)
+ tcp_store_ts_recent(tp);
+
+ /* We know that such packets are checksummed
+ * on entry.
+ */
+ tcp_ack(sk, skb, 0);
+ __kfree_skb(skb);
+ tcp_data_snd_check(sk);
+ return 0;
}
}

slow_path:
- if (len < (th->doff << 2) || tcp_checksum_complete_user(sk, skb))
+ /* Assumes header and options unchanged since checksum_init() */
+ if (tcp_checksum_complete_user(sk, skb))
goto csum_error;

/*
@@ -5416,7 +5396,7 @@ discard:
}

static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,
- struct tcphdr *th, unsigned len)
+ struct tcphdr *th)
{
u8 *hash_location;
struct inet_connection_sock *icsk = inet_csk(sk);
@@ -5693,7 +5673,7 @@ reset_and_undo:
*/

int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
- struct tcphdr *th, unsigned len)
+ struct tcphdr *th)
{
struct tcp_sock *tp = tcp_sk(sk);
struct inet_connection_sock *icsk = inet_csk(sk);
@@ -5740,7 +5720,7 @@ int tcp_rcv_state_process(struct sock *sk, struct sk_buff *skb,
goto discard;

case TCP_SYN_SENT:
- queued = tcp_rcv_synsent_state_process(sk, skb, th, len);
+ queued = tcp_rcv_synsent_state_process(sk, skb, th);
if (queued >= 0)
return queued;

diff --git a/net/ipv4/tcp_ipv4.c b/net/ipv4/tcp_ipv4.c
index 0a76e41..f999e06 100644
--- a/net/ipv4/tcp_ipv4.c
+++ b/net/ipv4/tcp_ipv4.c
@@ -1551,7 +1551,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)

if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
TCP_CHECK_TIMER(sk);
- if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len)) {
+ if (tcp_rcv_established(sk, skb, tcp_hdr(skb))) {
rsk = sk;
goto reset;
}
@@ -1578,7 +1578,7 @@ int tcp_v4_do_rcv(struct sock *sk, struct sk_buff *skb)
}

TCP_CHECK_TIMER(sk);
- if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len)) {
+ if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb))) {
rsk = sk;
goto reset;
}
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f206ee5..37b7536 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -718,8 +718,7 @@ int tcp_child_process(struct sock *parent, struct sock *child,
int state = child->sk_state;

if (!sock_owned_by_user(child)) {
- ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb),
- skb->len);
+ ret = tcp_rcv_state_process(child, skb, tcp_hdr(skb));
/* Wakeup parent, send SIGIO */
if (state == TCP_SYN_RECV && child->sk_state != state)
parent->sk_data_ready(parent, 0);
diff --git a/net/ipv4/tcp_probe.c b/net/ipv4/tcp_probe.c
index bb110c5..7bcd3ba 100644
--- a/net/ipv4/tcp_probe.c
+++ b/net/ipv4/tcp_probe.c
@@ -88,7 +88,7 @@ static inline int tcp_probe_avail(void)
* Note: arguments must match tcp_rcv_established()!
*/
static int jtcp_rcv_established(struct sock *sk, struct sk_buff *skb,
- struct tcphdr *th, unsigned len)
+ struct tcphdr *th)
{
const struct tcp_sock *tp = tcp_sk(sk);
const struct inet_sock *inet = inet_sk(sk);
diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index b76939a..3d08a4d 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1586,7 +1586,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)

if (sk->sk_state == TCP_ESTABLISHED) { /* Fast path */
TCP_CHECK_TIMER(sk);
- if (tcp_rcv_established(sk, skb, tcp_hdr(skb), skb->len))
+ if (tcp_rcv_established(sk, skb, tcp_hdr(skb)))
goto reset;
TCP_CHECK_TIMER(sk);
if (opt_skb)
@@ -1618,7 +1618,7 @@ static int tcp_v6_do_rcv(struct sock *sk, struct sk_buff *skb)
}

TCP_CHECK_TIMER(sk);
- if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb), skb->len))
+ if (tcp_rcv_state_process(sk, skb, tcp_hdr(skb)))
goto reset;
TCP_CHECK_TIMER(sk);
if (opt_skb)
--
1.6.3.3

Attachments:

len_th+4v4.patch (9.70 kB)

2010-01-15 14:16:38

by William Allen Simpson

[permalink] [raw]

Subject: [PATCH] tcp: input header length, prediction, and timestamp bugs

diff --git a/include/linux/tcp.h b/include/linux/tcp.h
index 74728f7..2987ee8 100644
--- a/include/linux/tcp.h
+++ b/include/linux/tcp.h
@@ -301,7 +301,11 @@ struct tcp_sock {

/*
* Header prediction flags
- * 0x5?10 << 16 + snd_wnd in net byte order
+ * S << 28 + TCP_FLAG_ACK + snd_wnd, in net byte order
+ * (PSH flag is ignored)
+ * S is 5 (no options), or 8 (timestamp aligned)
+ * otherwise, 0 to turn it off -- for instance, when there are
+ * holes in receive space.
*/
__be32 pred_flags;

diff --git a/include/net/tcp.h b/include/net/tcp.h
index 34f5cc2..30817b1 100644
--- a/include/net/tcp.h
+++ b/include/net/tcp.h
@@ -533,9 +533,16 @@ static inline u32 __tcp_set_rto(const struct tcp_sock *tp)
return (tp->srtt >> 3) + tp->rttvar;
}

+static inline u16 __tcp_fast_path_header_length(const struct tcp_sock *tp)
+{
+ return tp->rx_opt.tstamp_ok
+ ? sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED
+ : sizeof(struct tcphdr);
+}
+
static inline void __tcp_fast_path_on(struct tcp_sock *tp, u32 snd_wnd)
{
- tp->pred_flags = htonl((tp->tcp_header_len << 26) |
+ tp->pred_flags = htonl((__tcp_fast_path_header_length(tp) << (28 - 2)) |
ntohl(TCP_FLAG_ACK) |
snd_wnd);
}
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 28e0296..3378883 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -152,7 +152,7 @@ static void tcp_measure_rcv_mss(struct sock *sk, const struct sk_buff *skb)
* tcp header plus fixed timestamp option length.
* Resulting "len" is MSS free of SACK jitter.
*/
- len -= tcp_sk(sk)->tcp_header_len;
+ len -= __tcp_fast_path_header_length(tcp_sk(sk));
icsk->icsk_ack.last_seg_size = len;
if (len == lss) {
icsk->icsk_ack.rcv_mss = len;
@@ -5225,31 +5225,15 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
* extra cost of the net_bh soft interrupt processing...
* We do checksum and copy also but from device to kernel.
*/
-
- tp->rx_opt.saw_tstamp = 0;
-
- /* pred_flags is 0xS?10 << 16 + snd_wnd
- * if header_prediction is to be made
- * 'S' will always be tp->tcp_header_len >> 2
- * '?' will be 0 for the fast path, otherwise pred_flags is 0 to
- * turn it off (when there are holes in the receive
- * space for instance)
- * PSH flag is ignored.
- */
-
if ((tcp_flag_word(th) & TCP_HP_BITS) == tp->pred_flags &&
TCP_SKB_CB(skb)->seq == tp->rcv_nxt &&
!after(TCP_SKB_CB(skb)->ack_seq, tp->snd_nxt)) {
- int tcp_header_len = tp->tcp_header_len;
-
- /* Timestamp header prediction: tcp_header_len
- * is automatically equal to th->doff*4 due to pred_flags
- * match.
- */
+ int tcp_header_len = tcp_header_len_th(th);

- /* Check timestamp */
- if (tcp_header_len == sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) {
- /* No? Slow path! */
+ /* Timestamp header prediction */
+ if (tcp_header_len != sizeof(*th) + TCPOLEN_TSTAMP_ALIGNED) {
+ tp->rx_opt.saw_tstamp = 0; /* false */
+ } else {
if (!tcp_parse_aligned_timestamp(tp, th))
goto slow_path;

@@ -5264,30 +5248,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
*/
}

- if (len <= tcp_header_len) {
- /* Bulk data transfer: sender */
- if (len == tcp_header_len) {
- /* Predicted packet is in window by definition.
- * seq == rcv_nxt and rcv_wup <= rcv_nxt.
- * Hence, check seq<=rcv_wup reduces to:
- */
- if (tcp_header_len ==
- (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
- tp->rcv_nxt == tp->rcv_wup)
- tcp_store_ts_recent(tp);
-
- /* We know that such packets are checksummed
- * on entry.
- */
- tcp_ack(sk, skb, 0);
- __kfree_skb(skb);
- tcp_data_snd_check(sk);
- return 0;
- } else { /* Header too small */
- TCP_INC_STATS_BH(sock_net(sk), TCP_MIB_INERRS);
- goto discard;
- }
- } else {
+ if (tcp_header_len < len) {
int eaten = 0;
int copied_early = 0;

@@ -5311,9 +5272,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
* seq == rcv_nxt and rcv_wup <= rcv_nxt.
* Hence, check seq<=rcv_wup reduces to:
*/
- if (tcp_header_len ==
- (sizeof(struct tcphdr) +
- TCPOLEN_TSTAMP_ALIGNED) &&
+ if (tp->rx_opt.saw_tstamp &&
tp->rcv_nxt == tp->rcv_wup)
tcp_store_ts_recent(tp);

@@ -5334,8 +5293,7 @@ int tcp_rcv_established(struct sock *sk, struct sk_buff *skb,
* seq == rcv_nxt and rcv_wup <= rcv_nxt.
* Hence, check seq<=rcv_wup reduces to:
*/
- if (tcp_header_len ==
- (sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED) &&
+ if (tp->rx_opt.saw_tstamp &&
tp->rcv_nxt == tp->rcv_wup)
tcp_store_ts_recent(tp);

@@ -5376,11 +5334,33 @@ no_ack:
else
sk->sk_data_ready(sk, 0);
return 0;
+ } else {
+ /* Bulk data transfer: sender
+ *
+ * len < tcp_header_len never happens,
+ * already checked by tcp_v[4,6]_rcv()
+ *
+ * Predicted packet is in window by definition.
+ * seq == rcv_nxt and rcv_wup <= rcv_nxt.
+ * Hence, check seq<=rcv_wup reduces to:
+ */
+ if (tp->rx_opt.saw_tstamp &&
+ tp->rcv_nxt == tp->rcv_wup)
+ tcp_store_ts_recent(tp);
+
+ /* We know that such packets are checksummed
+ * on entry.
+ */
+ tcp_ack(sk, skb, 0);
+ __kfree_skb(skb);
+ tcp_data_snd_check(sk);
+ return 0;
}
}

slow_path:
- if (len < (th->doff << 2) || tcp_checksum_complete_user(sk, skb))
+ /* Assumes header and options unchanged since checksum_init() */
+ if (tcp_checksum_complete_user(sk, skb))
goto csum_error;

/*
@@ -5502,12 +5482,8 @@ static int tcp_rcv_synsent_state_process(struct sock *sk, struct sk_buff *skb,

if (tp->rx_opt.saw_tstamp) {
tp->rx_opt.tstamp_ok = 1;
- tp->tcp_header_len =
- sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
tp->advmss -= TCPOLEN_TSTAMP_ALIGNED;
tcp_store_ts_recent(tp);
- } else {
- tp->tcp_header_len = sizeof(struct tcphdr);
}

if (tcp_is_sack(tp) && sysctl_tcp_fack)
@@ -5632,10 +5608,6 @@ discard:
if (tp->rx_opt.saw_tstamp) {
tp->rx_opt.tstamp_ok = 1;
tcp_store_ts_recent(tp);
- tp->tcp_header_len =
- sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
- } else {
- tp->tcp_header_len = sizeof(struct tcphdr);
}

tp->rcv_nxt = TCP_SKB_CB(skb)->seq + 1;
diff --git a/net/ipv4/tcp_minisocks.c b/net/ipv4/tcp_minisocks.c
index f206ee5..d57a7da 100644
--- a/net/ipv4/tcp_minisocks.c
+++ b/net/ipv4/tcp_minisocks.c
@@ -387,6 +387,7 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
struct tcp_sock *newtp = tcp_sk(newsk);
struct tcp_sock *oldtp = tcp_sk(sk);
struct tcp_cookie_values *oldcvp = oldtp->cookie_values;
+ int lss = skb->len - sizeof(struct tcphdr);

/* TCP Cookie Transactions require space for the cookie pair,
* as it differs for each connection. There is no need to
@@ -490,18 +491,15 @@ struct sock *tcp_create_openreq_child(struct sock *sk, struct request_sock *req,
if (newtp->rx_opt.tstamp_ok) {
newtp->rx_opt.ts_recent = req->ts_recent;
newtp->rx_opt.ts_recent_stamp = get_seconds();
- newtp->tcp_header_len = sizeof(struct tcphdr) + TCPOLEN_TSTAMP_ALIGNED;
+ lss -= TCPOLEN_TSTAMP_ALIGNED;
} else {
newtp->rx_opt.ts_recent_stamp = 0;
- newtp->tcp_header_len = sizeof(struct tcphdr);
}
#ifdef CONFIG_TCP_MD5SIG
newtp->md5sig_info = NULL; /*XXX*/
- if (newtp->af_specific->md5_lookup(sk, newsk))
- newtp->tcp_header_len += TCPOLEN_MD5SIG_ALIGNED;
#endif
- if (skb->len >= TCP_MSS_DEFAULT + newtp->tcp_header_len)
- newicsk->icsk_ack.last_seg_size = skb->len - newtp->tcp_header_len;
+ if (lss >= TCP_MSS_DEFAULT)
+ newicsk->icsk_ack.last_seg_size = lss;
newtp->rx_opt.mss_clamp = req->mss;
TCP_ECN_openreq_child(newtp, req);

--
1.6.3.3

Attachments:

len_th+4v1.patch (7.65 kB)

2010-01-15 19:12:33

by William Allen Simpson

[permalink] [raw]

Subject: Re: [PATCH] tcp: input header length, prediction, and timestamp bugs

William Allen Simpson wrote:
> Finally, don't use TCPOLEN_MD5SIG_ALIGNED in the tcp_minisock.c
> last_seg_size calculations. It's not used in all other places, and may
> result in a too short estimate.
>
> Instead, use the same calculations as tp->advmss in tcp_input.c.
>
This part really shouldn't have been included in this patch, as it
depends on similar changes to tcp_output.c. My mistake cutting the
patches apart from the bigger testing patch.

I'll send a new version promptly.

2010-01-15 19:20:40

by William Allen Simpson

[permalink] [raw]

Subject: [PATCH v2] tcp: input header length, prediction, and timestamp bugs

Attachments:

len_th+4v2.patch (5.47 kB)