From: Alexander Duyck <[email protected]>
There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
message in the case of IPv6 or a fragmentation request in the case of
IPv4. This results in the socket stalling for a second or more as it does
not respond to the message by retransmitting the SYN frame.
Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
makes use of the entire MSS. In the case of fastopen it does, and an
additional complication is that the retransmit queue doesn't contain the
original frames. As a result when tcp_simple_retransmit is called and
walks the list of frames in the queue it may not mark the frames as lost
because both the SYN and the data packet each individually are smaller than
the MSS size after the adjustment. This results in the socket being stalled
until the retransmit timer kicks in and forces the SYN frame out again
without the data attached.
In order to resolve this we can generate our best estimate for the original
packet size by detecting the fastopen SYN frame and then adding the
overhead for MAX_TCP_OPTION_SPACE and verifying if the SYN w/ data would
have exceeded the MSS. If so we can mark the frame as lost and retransmit
it.
Signed-off-by: Alexander Duyck <[email protected]>
---
net/ipv4/tcp_input.c | 30 +++++++++++++++++++++++++++---
1 file changed, 27 insertions(+), 3 deletions(-)
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 9e8a6c1aa019..79375b58de84 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2686,11 +2686,35 @@ static void tcp_mtup_probe_success(struct sock *sk)
void tcp_simple_retransmit(struct sock *sk)
{
const struct inet_connection_sock *icsk = inet_csk(sk);
+ struct sk_buff *skb = tcp_rtx_queue_head(sk);
struct tcp_sock *tp = tcp_sk(sk);
- struct sk_buff *skb;
- unsigned int mss = tcp_current_mss(sk);
+ unsigned int mss;
+
+ /* A fastopen SYN request is stored as two separate packets within
+ * the retransmit queue, this is done by tcp_send_syn_data().
+ * As a result simply checking the MSS of the frames in the queue
+ * will not work for the SYN packet. So instead we must make a best
+ * effort attempt by validating the data frame with the mss size
+ * that would be computed now by tcp_send_syn_data and comparing
+ * that against the data frame that would have been included with
+ * the SYN.
+ */
+ if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN && tp->syn_data) {
+ struct sk_buff *syn_data = skb_rb_next(skb);
+
+ mss = tcp_mtu_to_mss(sk, icsk->icsk_pmtu_cookie) +
+ tp->tcp_header_len - sizeof(struct tcphdr) -
+ MAX_TCP_OPTION_SPACE;
- skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
+ if (syn_data && syn_data->len > mss)
+ tcp_mark_skb_lost(sk, skb);
+
+ skb = syn_data;
+ } else {
+ mss = tcp_current_mss(sk);
+ }
+
+ skb_rbtree_walk_from(skb) {
if (tcp_skb_seglen(skb) > mss)
tcp_mark_skb_lost(sk, skb);
}
On Sat, Dec 12, 2020 at 11:01 AM Alexander Duyck
<[email protected]> wrote:
>
> On Sat, Dec 12, 2020 at 10:34 AM Yuchung Cheng <[email protected]> wrote:
> >
> > On Fri, Dec 11, 2020 at 5:28 PM Alexander Duyck
> > <[email protected]> wrote:
> > >
> > > From: Alexander Duyck <[email protected]>
> > >
> > > There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
> > > message in the case of IPv6 or a fragmentation request in the case of
> > > IPv4. This results in the socket stalling for a second or more as it does
> > > not respond to the message by retransmitting the SYN frame.
> > >
> > > Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
> > > ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
> > > makes use of the entire MSS. In the case of fastopen it does, and an
> > > additional complication is that the retransmit queue doesn't contain the
> > > original frames. As a result when tcp_simple_retransmit is called and
> > > walks the list of frames in the queue it may not mark the frames as lost
> > > because both the SYN and the data packet each individually are smaller than
> > > the MSS size after the adjustment. This results in the socket being stalled
> > > until the retransmit timer kicks in and forces the SYN frame out again
> > > without the data attached.
> > >
> > > In order to resolve this we can generate our best estimate for the original
> > > packet size by detecting the fastopen SYN frame and then adding the
> > > overhead for MAX_TCP_OPTION_SPACE and verifying if the SYN w/ data would
> > > have exceeded the MSS. If so we can mark the frame as lost and retransmit
> > > it.
> > >
> > > Signed-off-by: Alexander Duyck <[email protected]>
> > > ---
> > > net/ipv4/tcp_input.c | 30 +++++++++++++++++++++++++++---
> > > 1 file changed, 27 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index 9e8a6c1aa019..79375b58de84 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -2686,11 +2686,35 @@ static void tcp_mtup_probe_success(struct sock *sk)
> > > void tcp_simple_retransmit(struct sock *sk)
> > > {
> > > const struct inet_connection_sock *icsk = inet_csk(sk);
> > > + struct sk_buff *skb = tcp_rtx_queue_head(sk);
> > > struct tcp_sock *tp = tcp_sk(sk);
> > > - struct sk_buff *skb;
> > > - unsigned int mss = tcp_current_mss(sk);
> > > + unsigned int mss;
> > > +
> > > + /* A fastopen SYN request is stored as two separate packets within
> > > + * the retransmit queue, this is done by tcp_send_syn_data().
> > > + * As a result simply checking the MSS of the frames in the queue
> > > + * will not work for the SYN packet. So instead we must make a best
> > > + * effort attempt by validating the data frame with the mss size
> > > + * that would be computed now by tcp_send_syn_data and comparing
> > > + * that against the data frame that would have been included with
> > > + * the SYN.
> > > + */
> > > + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN && tp->syn_data) {
> > > + struct sk_buff *syn_data = skb_rb_next(skb);
> > > +
> > > + mss = tcp_mtu_to_mss(sk, icsk->icsk_pmtu_cookie) +
> > > + tp->tcp_header_len - sizeof(struct tcphdr) -
> > > + MAX_TCP_OPTION_SPACE;
> > nice comment! The original syn_data mss needs to be inferred which is
> > a hassle to get right. my sense is path-mtu issue is enough to warrant
> > they are lost.
> > I suggest simply mark syn & its data lost if tcp_simple_retransmit is
> > called during TFO handshake, i.e.
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 62f7aabc7920..7f0c4f2947eb 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -2864,7 +2864,8 @@ void tcp_simple_retransmit(struct sock *sk)
> > unsigned int mss = tcp_current_mss(sk);
> >
> > skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
> > - if (tcp_skb_seglen(skb) > mss)
> > + if (tcp_skb_seglen(skb) > mss ||
> > + (tp->syn_data && sk->sk_state == TCP_SYN_SENT))
> > tcp_mark_skb_lost(sk, skb);
> > }
> >
> > We have a TFO packetdrill test that verifies my suggested fix should
> > trigger an immediate retransmit vs 1s wait.
>
> Okay, I will go that route, although I will still probably make one
> minor cleanup. Instead of testing for syn_data and state per packet I
> will probably keep the bit where I overwrite mss since it is only used
> in the loop. What I can do is switch it from unsigned int to int since
> technically tcp_current_mss and tcp_skb_seglen are both a signed int
> anyway. Then I can just set mss to -1 in the syn_data && TCP_SYN_SENT
> case. That way all of the frames in the ring should fail the check
> while only having to add one initial check outside the loop.
sounds good. I thought about that too but decided it's normally just
two skbs and not worth the hassle.
btw I thought about negatively caching this event to disable TFO
completely afterward, but I assume it's not needed as the ICMP error
should correct the route mtu hence future TFO should be fine. If
that's not the case we might want to consider disable TFO to avoid
this future "TFO->ICMP->SYN-retry" repetition.
On Sat, Dec 12, 2020 at 11:07 AM Yuchung Cheng <[email protected]> wrote:
>
> On Sat, Dec 12, 2020 at 11:01 AM Alexander Duyck
> <[email protected]> wrote:
> >
> > On Sat, Dec 12, 2020 at 10:34 AM Yuchung Cheng <[email protected]> wrote:
> > >
> > > On Fri, Dec 11, 2020 at 5:28 PM Alexander Duyck
> > > <[email protected]> wrote:
> > > >
> > > > From: Alexander Duyck <[email protected]>
> > > >
> > > > There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
> > > > message in the case of IPv6 or a fragmentation request in the case of
> > > > IPv4. This results in the socket stalling for a second or more as it does
> > > > not respond to the message by retransmitting the SYN frame.
> > > >
> > > > Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
> > > > ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
> > > > makes use of the entire MSS. In the case of fastopen it does, and an
> > > > additional complication is that the retransmit queue doesn't contain the
> > > > original frames. As a result when tcp_simple_retransmit is called and
> > > > walks the list of frames in the queue it may not mark the frames as lost
> > > > because both the SYN and the data packet each individually are smaller than
> > > > the MSS size after the adjustment. This results in the socket being stalled
> > > > until the retransmit timer kicks in and forces the SYN frame out again
> > > > without the data attached.
> > > >
> > > > In order to resolve this we can generate our best estimate for the original
> > > > packet size by detecting the fastopen SYN frame and then adding the
> > > > overhead for MAX_TCP_OPTION_SPACE and verifying if the SYN w/ data would
> > > > have exceeded the MSS. If so we can mark the frame as lost and retransmit
> > > > it.
> > > >
> > > > Signed-off-by: Alexander Duyck <[email protected]>
> > > > ---
> > > > net/ipv4/tcp_input.c | 30 +++++++++++++++++++++++++++---
> > > > 1 file changed, 27 insertions(+), 3 deletions(-)
> > > >
> > > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > > index 9e8a6c1aa019..79375b58de84 100644
> > > > --- a/net/ipv4/tcp_input.c
> > > > +++ b/net/ipv4/tcp_input.c
> > > > @@ -2686,11 +2686,35 @@ static void tcp_mtup_probe_success(struct sock *sk)
> > > > void tcp_simple_retransmit(struct sock *sk)
> > > > {
> > > > const struct inet_connection_sock *icsk = inet_csk(sk);
> > > > + struct sk_buff *skb = tcp_rtx_queue_head(sk);
> > > > struct tcp_sock *tp = tcp_sk(sk);
> > > > - struct sk_buff *skb;
> > > > - unsigned int mss = tcp_current_mss(sk);
> > > > + unsigned int mss;
> > > > +
> > > > + /* A fastopen SYN request is stored as two separate packets within
> > > > + * the retransmit queue, this is done by tcp_send_syn_data().
> > > > + * As a result simply checking the MSS of the frames in the queue
> > > > + * will not work for the SYN packet. So instead we must make a best
> > > > + * effort attempt by validating the data frame with the mss size
> > > > + * that would be computed now by tcp_send_syn_data and comparing
> > > > + * that against the data frame that would have been included with
> > > > + * the SYN.
> > > > + */
> > > > + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN && tp->syn_data) {
> > > > + struct sk_buff *syn_data = skb_rb_next(skb);
> > > > +
> > > > + mss = tcp_mtu_to_mss(sk, icsk->icsk_pmtu_cookie) +
> > > > + tp->tcp_header_len - sizeof(struct tcphdr) -
> > > > + MAX_TCP_OPTION_SPACE;
> > > nice comment! The original syn_data mss needs to be inferred which is
> > > a hassle to get right. my sense is path-mtu issue is enough to warrant
> > > they are lost.
> > > I suggest simply mark syn & its data lost if tcp_simple_retransmit is
> > > called during TFO handshake, i.e.
> > >
> > > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > > index 62f7aabc7920..7f0c4f2947eb 100644
> > > --- a/net/ipv4/tcp_input.c
> > > +++ b/net/ipv4/tcp_input.c
> > > @@ -2864,7 +2864,8 @@ void tcp_simple_retransmit(struct sock *sk)
> > > unsigned int mss = tcp_current_mss(sk);
> > >
> > > skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
> > > - if (tcp_skb_seglen(skb) > mss)
> > > + if (tcp_skb_seglen(skb) > mss ||
> > > + (tp->syn_data && sk->sk_state == TCP_SYN_SENT))
> > > tcp_mark_skb_lost(sk, skb);
> > > }
> > >
> > > We have a TFO packetdrill test that verifies my suggested fix should
> > > trigger an immediate retransmit vs 1s wait.
> >
> > Okay, I will go that route, although I will still probably make one
> > minor cleanup. Instead of testing for syn_data and state per packet I
> > will probably keep the bit where I overwrite mss since it is only used
> > in the loop. What I can do is switch it from unsigned int to int since
> > technically tcp_current_mss and tcp_skb_seglen are both a signed int
> > anyway. Then I can just set mss to -1 in the syn_data && TCP_SYN_SENT
> > case. That way all of the frames in the ring should fail the check
> > while only having to add one initial check outside the loop.
> sounds good. I thought about that too but decided it's normally just
> two skbs and not worth the hassle.
>
> btw I thought about negatively caching this event to disable TFO
> completely afterward, but I assume it's not needed as the ICMP error
> should correct the route mtu hence future TFO should be fine. If
> that's not the case we might want to consider disable TFO to avoid
> this future "TFO->ICMP->SYN-retry" repetition.
At least in my testing there are no further occurrences after the
first. Basically the first one hits and then all the follow-on
requests use the correct MSS based on the cached path MTU. As such
there isn't any penalty to leaving the feature enabled as over the
course of several transactions it still provides more benefit than
harm.
On Sat, Dec 12, 2020 at 10:34 AM Yuchung Cheng <[email protected]> wrote:
>
> On Fri, Dec 11, 2020 at 5:28 PM Alexander Duyck
> <[email protected]> wrote:
> >
> > From: Alexander Duyck <[email protected]>
> >
> > There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
> > message in the case of IPv6 or a fragmentation request in the case of
> > IPv4. This results in the socket stalling for a second or more as it does
> > not respond to the message by retransmitting the SYN frame.
> >
> > Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
> > ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
> > makes use of the entire MSS. In the case of fastopen it does, and an
> > additional complication is that the retransmit queue doesn't contain the
> > original frames. As a result when tcp_simple_retransmit is called and
> > walks the list of frames in the queue it may not mark the frames as lost
> > because both the SYN and the data packet each individually are smaller than
> > the MSS size after the adjustment. This results in the socket being stalled
> > until the retransmit timer kicks in and forces the SYN frame out again
> > without the data attached.
> >
> > In order to resolve this we can generate our best estimate for the original
> > packet size by detecting the fastopen SYN frame and then adding the
> > overhead for MAX_TCP_OPTION_SPACE and verifying if the SYN w/ data would
> > have exceeded the MSS. If so we can mark the frame as lost and retransmit
> > it.
> >
> > Signed-off-by: Alexander Duyck <[email protected]>
> > ---
> > net/ipv4/tcp_input.c | 30 +++++++++++++++++++++++++++---
> > 1 file changed, 27 insertions(+), 3 deletions(-)
> >
> > diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> > index 9e8a6c1aa019..79375b58de84 100644
> > --- a/net/ipv4/tcp_input.c
> > +++ b/net/ipv4/tcp_input.c
> > @@ -2686,11 +2686,35 @@ static void tcp_mtup_probe_success(struct sock *sk)
> > void tcp_simple_retransmit(struct sock *sk)
> > {
> > const struct inet_connection_sock *icsk = inet_csk(sk);
> > + struct sk_buff *skb = tcp_rtx_queue_head(sk);
> > struct tcp_sock *tp = tcp_sk(sk);
> > - struct sk_buff *skb;
> > - unsigned int mss = tcp_current_mss(sk);
> > + unsigned int mss;
> > +
> > + /* A fastopen SYN request is stored as two separate packets within
> > + * the retransmit queue, this is done by tcp_send_syn_data().
> > + * As a result simply checking the MSS of the frames in the queue
> > + * will not work for the SYN packet. So instead we must make a best
> > + * effort attempt by validating the data frame with the mss size
> > + * that would be computed now by tcp_send_syn_data and comparing
> > + * that against the data frame that would have been included with
> > + * the SYN.
> > + */
> > + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN && tp->syn_data) {
> > + struct sk_buff *syn_data = skb_rb_next(skb);
> > +
> > + mss = tcp_mtu_to_mss(sk, icsk->icsk_pmtu_cookie) +
> > + tp->tcp_header_len - sizeof(struct tcphdr) -
> > + MAX_TCP_OPTION_SPACE;
> nice comment! The original syn_data mss needs to be inferred which is
> a hassle to get right. my sense is path-mtu issue is enough to warrant
> they are lost.
> I suggest simply mark syn & its data lost if tcp_simple_retransmit is
> called during TFO handshake, i.e.
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 62f7aabc7920..7f0c4f2947eb 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2864,7 +2864,8 @@ void tcp_simple_retransmit(struct sock *sk)
> unsigned int mss = tcp_current_mss(sk);
>
> skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
> - if (tcp_skb_seglen(skb) > mss)
> + if (tcp_skb_seglen(skb) > mss ||
> + (tp->syn_data && sk->sk_state == TCP_SYN_SENT))
> tcp_mark_skb_lost(sk, skb);
> }
>
> We have a TFO packetdrill test that verifies my suggested fix should
> trigger an immediate retransmit vs 1s wait.
Okay, I will go that route, although I will still probably make one
minor cleanup. Instead of testing for syn_data and state per packet I
will probably keep the bit where I overwrite mss since it is only used
in the loop. What I can do is switch it from unsigned int to int since
technically tcp_current_mss and tcp_skb_seglen are both a signed int
anyway. Then I can just set mss to -1 in the syn_data && TCP_SYN_SENT
case. That way all of the frames in the ring should fail the check
while only having to add one initial check outside the loop.
On Fri, Dec 11, 2020 at 5:28 PM Alexander Duyck
<[email protected]> wrote:
>
> From: Alexander Duyck <[email protected]>
>
> There are cases where a fastopen SYN may trigger either a ICMP_TOOBIG
> message in the case of IPv6 or a fragmentation request in the case of
> IPv4. This results in the socket stalling for a second or more as it does
> not respond to the message by retransmitting the SYN frame.
>
> Normally a SYN frame should not be able to trigger a ICMP_TOOBIG or
> ICMP_FRAG_NEEDED however in the case of fastopen we can have a frame that
> makes use of the entire MSS. In the case of fastopen it does, and an
> additional complication is that the retransmit queue doesn't contain the
> original frames. As a result when tcp_simple_retransmit is called and
> walks the list of frames in the queue it may not mark the frames as lost
> because both the SYN and the data packet each individually are smaller than
> the MSS size after the adjustment. This results in the socket being stalled
> until the retransmit timer kicks in and forces the SYN frame out again
> without the data attached.
>
> In order to resolve this we can generate our best estimate for the original
> packet size by detecting the fastopen SYN frame and then adding the
> overhead for MAX_TCP_OPTION_SPACE and verifying if the SYN w/ data would
> have exceeded the MSS. If so we can mark the frame as lost and retransmit
> it.
>
> Signed-off-by: Alexander Duyck <[email protected]>
> ---
> net/ipv4/tcp_input.c | 30 +++++++++++++++++++++++++++---
> 1 file changed, 27 insertions(+), 3 deletions(-)
>
> diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
> index 9e8a6c1aa019..79375b58de84 100644
> --- a/net/ipv4/tcp_input.c
> +++ b/net/ipv4/tcp_input.c
> @@ -2686,11 +2686,35 @@ static void tcp_mtup_probe_success(struct sock *sk)
> void tcp_simple_retransmit(struct sock *sk)
> {
> const struct inet_connection_sock *icsk = inet_csk(sk);
> + struct sk_buff *skb = tcp_rtx_queue_head(sk);
> struct tcp_sock *tp = tcp_sk(sk);
> - struct sk_buff *skb;
> - unsigned int mss = tcp_current_mss(sk);
> + unsigned int mss;
> +
> + /* A fastopen SYN request is stored as two separate packets within
> + * the retransmit queue, this is done by tcp_send_syn_data().
> + * As a result simply checking the MSS of the frames in the queue
> + * will not work for the SYN packet. So instead we must make a best
> + * effort attempt by validating the data frame with the mss size
> + * that would be computed now by tcp_send_syn_data and comparing
> + * that against the data frame that would have been included with
> + * the SYN.
> + */
> + if (TCP_SKB_CB(skb)->tcp_flags & TCPHDR_SYN && tp->syn_data) {
> + struct sk_buff *syn_data = skb_rb_next(skb);
> +
> + mss = tcp_mtu_to_mss(sk, icsk->icsk_pmtu_cookie) +
> + tp->tcp_header_len - sizeof(struct tcphdr) -
> + MAX_TCP_OPTION_SPACE;
nice comment! The original syn_data mss needs to be inferred which is
a hassle to get right. my sense is path-mtu issue is enough to warrant
they are lost.
I suggest simply mark syn & its data lost if tcp_simple_retransmit is
called during TFO handshake, i.e.
diff --git a/net/ipv4/tcp_input.c b/net/ipv4/tcp_input.c
index 62f7aabc7920..7f0c4f2947eb 100644
--- a/net/ipv4/tcp_input.c
+++ b/net/ipv4/tcp_input.c
@@ -2864,7 +2864,8 @@ void tcp_simple_retransmit(struct sock *sk)
unsigned int mss = tcp_current_mss(sk);
skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
- if (tcp_skb_seglen(skb) > mss)
+ if (tcp_skb_seglen(skb) > mss ||
+ (tp->syn_data && sk->sk_state == TCP_SYN_SENT))
tcp_mark_skb_lost(sk, skb);
}
We have a TFO packetdrill test that verifies my suggested fix should
trigger an immediate retransmit vs 1s wait.
>
> - skb_rbtree_walk(skb, &sk->tcp_rtx_queue) {
> + if (syn_data && syn_data->len > mss)
> + tcp_mark_skb_lost(sk, skb);
> +
> + skb = syn_data;
> + } else {
> + mss = tcp_current_mss(sk);
> + }
> +
> + skb_rbtree_walk_from(skb) {
> if (tcp_skb_seglen(skb) > mss)
> tcp_mark_skb_lost(sk, skb);
> }
>
>