2023-10-24 02:35:57

by Yan Zhai

[permalink] [raw]
Subject: [PATCH v4 net-next 0/3] ipv6: avoid atomic fragment on GSO output

When the ipv6 stack output a GSO packet, if its gso_size is larger than
dst MTU, then all segments would be fragmented. However, it is possible
for a GSO packet to have a trailing segment with smaller actual size
than both gso_size as well as the MTU, which leads to an "atomic
fragment". Atomic fragments are considered harmful in RFC-8021. An
Existing report from APNIC also shows that atomic fragments are more
likely to be dropped even it is equivalent to a no-op [1].

The series contains following changes:
* drop feature RTAX_FEATURE_ALLFRAG, which has been broken. This helps
simplifying other changes in this set.
* refactor __ip6_finish_output code to separate GSO and non-GSO packet
processing, mirroring IPv4 side logic.
* avoid generating atomic fragment on GSO packets.

Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1]

change log:
V3 -> V4: cleaned up all RTAX_FEATURE_ALLFRAG code, rather than just
drop the check at IPv6 output.
V2 -> V3: split the changes to separate commits as Willem de Bruijn suggested
V1 is incorrect and omitted

V3: https://lore.kernel.org/netdev/[email protected]/
V2: https://lore.kernel.org/netdev/ZS1%[email protected]/

Yan Zhai (3):
ipv6: drop feature RTAX_FEATURE_ALLFRAG
ipv6: refactor ip6_finish_output for GSO handling
ipv6: avoid atomic fragment on GSO packets

include/net/dst.h | 7 -----
include/net/inet_connection_sock.h | 1 -
include/net/inet_sock.h | 2 +-
include/uapi/linux/rtnetlink.h | 2 +-
net/ipv4/tcp_output.c | 20 +------------
net/ipv6/ip6_output.c | 45 ++++++++++++++++--------------
net/ipv6/tcp_ipv6.c | 1 -
net/ipv6/xfrm6_output.c | 2 +-
net/mptcp/subflow.c | 1 -
9 files changed, 28 insertions(+), 53 deletions(-)

--
2.30.2



2023-10-24 02:36:02

by Yan Zhai

[permalink] [raw]
Subject: [PATCH v4 net-next 1/3] ipv6: drop feature RTAX_FEATURE_ALLFRAG

RTAX_FEATURE_ALLFRAG was added before the first git commit:

https://www.mail-archive.com/[email protected]/msg03399.html

The feature would send packets to the fragmentation path if a box
receives a PMTU value with less than 1280 byte. However, since commit
9d289715eb5c ("ipv6: stop sending PTB packets for MTU < 1280"), such
message would be simply discarded. The feature flag is neither supported
in iproute2 utility. In theory one can still manipulate it with direct
netlink message, but it is not ideal because it was based on obsoleted
guidance of RFC-2460 (replaced by RFC-8200).

The feature would always test false at the moment, so remove related
code or mark them as unused.

Signed-off-by: Yan Zhai <[email protected]>
---
V3 -> V4: cleaned up all RTAX_FEATURE_ALLFRAG code, rather than just
drop the check at IPv6 output.

---
include/net/dst.h | 7 -------
include/net/inet_connection_sock.h | 1 -
include/net/inet_sock.h | 2 +-
include/uapi/linux/rtnetlink.h | 2 +-
net/ipv4/tcp_output.c | 20 +-------------------
net/ipv6/ip6_output.c | 15 ++-------------
net/ipv6/tcp_ipv6.c | 1 -
net/ipv6/xfrm6_output.c | 2 +-
net/mptcp/subflow.c | 1 -
9 files changed, 6 insertions(+), 45 deletions(-)

diff --git a/include/net/dst.h b/include/net/dst.h
index f8b8599a0600..f5dfc8fb7b37 100644
--- a/include/net/dst.h
+++ b/include/net/dst.h
@@ -222,13 +222,6 @@ static inline unsigned long dst_metric_rtt(const struct dst_entry *dst, int metr
return msecs_to_jiffies(dst_metric(dst, metric));
}

-static inline u32
-dst_allfrag(const struct dst_entry *dst)
-{
- int ret = dst_feature(dst, RTAX_FEATURE_ALLFRAG);
- return ret;
-}
-
static inline int
dst_metric_locked(const struct dst_entry *dst, int metric)
{
diff --git a/include/net/inet_connection_sock.h b/include/net/inet_connection_sock.h
index 086d1193c9ef..d0a2f827d5f2 100644
--- a/include/net/inet_connection_sock.h
+++ b/include/net/inet_connection_sock.h
@@ -44,7 +44,6 @@ struct inet_connection_sock_af_ops {
struct request_sock *req_unhash,
bool *own_req);
u16 net_header_len;
- u16 net_frag_header_len;
u16 sockaddr_len;
int (*setsockopt)(struct sock *sk, int level, int optname,
sockptr_t optval, unsigned int optlen);
diff --git a/include/net/inet_sock.h b/include/net/inet_sock.h
index 98e11958cdff..dedbc757b688 100644
--- a/include/net/inet_sock.h
+++ b/include/net/inet_sock.h
@@ -244,7 +244,7 @@ struct inet_sock {
};

#define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */
-#define IPCORK_ALLFRAG 2 /* always fragment (for ipv6 for now) */
+#define IPCORK_ALLFRAG 2 /* (unused) always fragment (for ipv6 for now) */

enum {
INET_FLAGS_PKTINFO = 0,
diff --git a/include/uapi/linux/rtnetlink.h b/include/uapi/linux/rtnetlink.h
index aa2482a0614a..3b687d20c9ed 100644
--- a/include/uapi/linux/rtnetlink.h
+++ b/include/uapi/linux/rtnetlink.h
@@ -505,7 +505,7 @@ enum {
#define RTAX_FEATURE_ECN (1 << 0)
#define RTAX_FEATURE_SACK (1 << 1) /* unused */
#define RTAX_FEATURE_TIMESTAMP (1 << 2) /* unused */
-#define RTAX_FEATURE_ALLFRAG (1 << 3)
+#define RTAX_FEATURE_ALLFRAG (1 << 3) /* unused */
#define RTAX_FEATURE_TCP_USEC_TS (1 << 4)

#define RTAX_FEATURE_MASK (RTAX_FEATURE_ECN | \
diff --git a/net/ipv4/tcp_output.c b/net/ipv4/tcp_output.c
index 2866ccbccde0..ca4d7594efd4 100644
--- a/net/ipv4/tcp_output.c
+++ b/net/ipv4/tcp_output.c
@@ -1698,14 +1698,6 @@ static inline int __tcp_mtu_to_mss(struct sock *sk, int pmtu)
*/
mss_now = pmtu - icsk->icsk_af_ops->net_header_len - sizeof(struct tcphdr);

- /* IPv6 adds a frag_hdr in case RTAX_FEATURE_ALLFRAG is set */
- if (icsk->icsk_af_ops->net_frag_header_len) {
- const struct dst_entry *dst = __sk_dst_get(sk);
-
- if (dst && dst_allfrag(dst))
- mss_now -= icsk->icsk_af_ops->net_frag_header_len;
- }
-
/* Clamp it (mss_clamp does not include tcp options) */
if (mss_now > tp->rx_opt.mss_clamp)
mss_now = tp->rx_opt.mss_clamp;
@@ -1733,21 +1725,11 @@ int tcp_mss_to_mtu(struct sock *sk, int mss)
{
const struct tcp_sock *tp = tcp_sk(sk);
const struct inet_connection_sock *icsk = inet_csk(sk);
- int mtu;

- mtu = mss +
+ return mss +
tp->tcp_header_len +
icsk->icsk_ext_hdr_len +
icsk->icsk_af_ops->net_header_len;
-
- /* IPv6 adds a frag_hdr in case RTAX_FEATURE_ALLFRAG is set */
- if (icsk->icsk_af_ops->net_frag_header_len) {
- const struct dst_entry *dst = __sk_dst_get(sk);
-
- if (dst && dst_allfrag(dst))
- mtu += icsk->icsk_af_ops->net_frag_header_len;
- }
- return mtu;
}
EXPORT_SYMBOL(tcp_mss_to_mtu);

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 3c7de89d6755..86efd901ee5a 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -191,7 +191,6 @@ static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff
return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu);

if ((skb->len > mtu && !skb_is_gso(skb)) ||
- dst_allfrag(skb_dst(skb)) ||
(IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size))
return ip6_fragment(net, sk, skb, ip6_finish_output2);
else
@@ -1017,9 +1016,6 @@ int ip6_fragment(struct net *net, struct sock *sk, struct sk_buff *skb,
return err;

fail_toobig:
- if (skb->sk && dst_allfrag(skb_dst(skb)))
- sk_gso_disable(skb->sk);
-
icmpv6_send(skb, ICMPV6_PKT_TOOBIG, 0, mtu);
err = -EMSGSIZE;

@@ -1384,10 +1380,7 @@ static int ip6_setup_cork(struct sock *sk, struct inet_cork_full *cork,
cork->base.mark = ipc6->sockc.mark;
sock_tx_timestamp(sk, ipc6->sockc.tsflags, &cork->base.tx_flags);

- if (dst_allfrag(xfrm_dst_path(&rt->dst)))
- cork->base.flags |= IPCORK_ALLFRAG;
cork->base.length = 0;
-
cork->base.transmit_time = ipc6->sockc.transmit_time;

return 0;
@@ -1444,8 +1437,6 @@ static int __ip6_append_data(struct sock *sk,

headersize = sizeof(struct ipv6hdr) +
(opt ? opt->opt_flen + opt->opt_nflen : 0) +
- (dst_allfrag(&rt->dst) ?
- sizeof(struct frag_hdr) : 0) +
rt->rt6i_nfheader_len;

if (mtu <= fragheaderlen ||
@@ -1555,7 +1546,7 @@ static int __ip6_append_data(struct sock *sk,

while (length > 0) {
/* Check if the remaining data fits into current packet. */
- copy = (cork->length <= mtu && !(cork->flags & IPCORK_ALLFRAG) ? mtu : maxfraglen) - skb->len;
+ copy = (cork->length <= mtu ? mtu : maxfraglen) - skb->len;
if (copy < length)
copy = maxfraglen - skb->len;

@@ -1586,7 +1577,7 @@ static int __ip6_append_data(struct sock *sk,
*/
datalen = length + fraggap;

- if (datalen > (cork->length <= mtu && !(cork->flags & IPCORK_ALLFRAG) ? mtu : maxfraglen) - fragheaderlen)
+ if (datalen > (cork->length <= mtu ? mtu : maxfraglen) - fragheaderlen)
datalen = maxfraglen - fragheaderlen - rt->dst.trailer_len;
fraglen = datalen + fragheaderlen;
pagedlen = 0;
@@ -1835,7 +1826,6 @@ static void ip6_cork_steal_dst(struct sk_buff *skb, struct inet_cork_full *cork)
struct dst_entry *dst = cork->base.dst;

cork->base.dst = NULL;
- cork->base.flags &= ~IPCORK_ALLFRAG;
skb_dst_set(skb, dst);
}

@@ -1856,7 +1846,6 @@ static void ip6_cork_release(struct inet_cork_full *cork,
if (cork->base.dst) {
dst_release(cork->base.dst);
cork->base.dst = NULL;
- cork->base.flags &= ~IPCORK_ALLFRAG;
}
}

diff --git a/net/ipv6/tcp_ipv6.c b/net/ipv6/tcp_ipv6.c
index 0c8a14ba104f..dc27988512a6 100644
--- a/net/ipv6/tcp_ipv6.c
+++ b/net/ipv6/tcp_ipv6.c
@@ -1895,7 +1895,6 @@ const struct inet_connection_sock_af_ops ipv6_specific = {
.conn_request = tcp_v6_conn_request,
.syn_recv_sock = tcp_v6_syn_recv_sock,
.net_header_len = sizeof(struct ipv6hdr),
- .net_frag_header_len = sizeof(struct frag_hdr),
.setsockopt = ipv6_setsockopt,
.getsockopt = ipv6_getsockopt,
.addr2sockaddr = inet6_csk_addr2sockaddr,
diff --git a/net/ipv6/xfrm6_output.c b/net/ipv6/xfrm6_output.c
index ad07904642ca..5f7b1fdbffe6 100644
--- a/net/ipv6/xfrm6_output.c
+++ b/net/ipv6/xfrm6_output.c
@@ -95,7 +95,7 @@ static int __xfrm6_output(struct net *net, struct sock *sk, struct sk_buff *skb)
return -EMSGSIZE;
}

- if (toobig || dst_allfrag(skb_dst(skb)))
+ if (toobig)
return ip6_fragment(net, sk, skb,
__xfrm6_output_finish);

diff --git a/net/mptcp/subflow.c b/net/mptcp/subflow.c
index 9c1f8d1d63d2..7064543b534d 100644
--- a/net/mptcp/subflow.c
+++ b/net/mptcp/subflow.c
@@ -2044,7 +2044,6 @@ void __init mptcp_subflow_init(void)
subflow_v6m_specific.send_check = ipv4_specific.send_check;
subflow_v6m_specific.net_header_len = ipv4_specific.net_header_len;
subflow_v6m_specific.mtu_reduced = ipv4_specific.mtu_reduced;
- subflow_v6m_specific.net_frag_header_len = 0;
subflow_v6m_specific.rebuild_header = subflow_rebuild_header;

tcpv6_prot_override = tcpv6_prot;
--
2.30.2


2023-10-24 02:36:20

by Yan Zhai

[permalink] [raw]
Subject: [PATCH v4 net-next 3/3] ipv6: avoid atomic fragment on GSO packets

When the ipv6 stack output a GSO packet, if its gso_size is larger than
dst MTU, then all segments would be fragmented. However, it is possible
for a GSO packet to have a trailing segment with smaller actual size
than both gso_size as well as the MTU, which leads to an "atomic
fragment". Atomic fragments are considered harmful in RFC-8021. An
Existing report from APNIC also shows that atomic fragments are more
likely to be dropped even it is equivalent to a no-op [1].

Add an extra check in the GSO slow output path. For each segment from
the original over-sized packet, if it fits with the path MTU, then avoid
generating an atomic fragment.

Link: https://www.potaroo.net/presentations/2022-03-01-ipv6-frag.pdf [1]
Fixes: b210de4f8c97 ("net: ipv6: Validate GSO SKB before finish IPv6 processing")
Reported-by: David Wragg <[email protected]>
Signed-off-by: Yan Zhai <[email protected]>
---
net/ipv6/ip6_output.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 4010dd97aaf8..a722a43dd668 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -164,7 +164,13 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk,
int err;

skb_mark_not_on_list(segs);
- err = ip6_fragment(net, sk, segs, ip6_finish_output2);
+ /* Last GSO segment can be smaller than gso_size (and MTU).
+ * Adding a fragment header would produce an "atomic fragment",
+ * which is considered harmful (RFC-8021). Avoid that.
+ */
+ err = segs->len > mtu ?
+ ip6_fragment(net, sk, segs, ip6_finish_output2) :
+ ip6_finish_output2(net, sk, segs);
if (err && ret == 0)
ret = err;
}
--
2.30.2


2023-10-24 02:36:45

by Yan Zhai

[permalink] [raw]
Subject: [PATCH v4 net-next 2/3] ipv6: refactor ip6_finish_output for GSO handling

Separate GSO and non-GSO packets handling to make the logic cleaner. For
GSO packets, frag_max_size check can be omitted because it is only
useful for packets defragmented by netfilter hooks. Both local output
and GRO logic won't produce GSO packets when defragment is needed. This
also mirrors what IPv4 side code is doing.

Suggested-by: Florian Westphal <[email protected]>
Signed-off-by: Yan Zhai <[email protected]>
Reviewed-by: Willem de Bruijn <[email protected]>
---
net/ipv6/ip6_output.c | 22 +++++++++++++++-------
1 file changed, 15 insertions(+), 7 deletions(-)

diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index 86efd901ee5a..4010dd97aaf8 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -172,6 +172,16 @@ ip6_finish_output_gso_slowpath_drop(struct net *net, struct sock *sk,
return ret;
}

+static int ip6_finish_output_gso(struct net *net, struct sock *sk,
+ struct sk_buff *skb, unsigned int mtu)
+{
+ if (!(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) &&
+ !skb_gso_validate_network_len(skb, mtu))
+ return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu);
+
+ return ip6_finish_output2(net, sk, skb);
+}
+
static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
{
unsigned int mtu;
@@ -185,16 +195,14 @@ static int __ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff
#endif

mtu = ip6_skb_dst_mtu(skb);
- if (skb_is_gso(skb) &&
- !(IP6CB(skb)->flags & IP6SKB_FAKEJUMBO) &&
- !skb_gso_validate_network_len(skb, mtu))
- return ip6_finish_output_gso_slowpath_drop(net, sk, skb, mtu);
+ if (skb_is_gso(skb))
+ return ip6_finish_output_gso(net, sk, skb, mtu);

- if ((skb->len > mtu && !skb_is_gso(skb)) ||
+ if (skb->len > mtu ||
(IP6CB(skb)->frag_max_size && skb->len > IP6CB(skb)->frag_max_size))
return ip6_fragment(net, sk, skb, ip6_finish_output2);
- else
- return ip6_finish_output2(net, sk, skb);
+
+ return ip6_finish_output2(net, sk, skb);
}

static int ip6_finish_output(struct net *net, struct sock *sk, struct sk_buff *skb)
--
2.30.2


2023-10-24 10:22:28

by Florian Westphal

[permalink] [raw]
Subject: Re: [PATCH v4 net-next 1/3] ipv6: drop feature RTAX_FEATURE_ALLFRAG

Yan Zhai <[email protected]> wrote:
> #define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */
> -#define IPCORK_ALLFRAG 2 /* always fragment (for ipv6 for now) */
> +#define IPCORK_ALLFRAG 2 /* (unused) always fragment (for ipv6 for now) */

Nit: Why not remove the ALLFRAG define as well?

Otherwise the series looks good to me, thanks!

Reviewed-by: Florian Westphal <[email protected]>

2023-10-24 10:31:00

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v4 net-next 1/3] ipv6: drop feature RTAX_FEATURE_ALLFRAG

On Tue, Oct 24, 2023 at 12:22 PM Florian Westphal <[email protected]> wrote:
>
> Yan Zhai <[email protected]> wrote:
> > #define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */
> > -#define IPCORK_ALLFRAG 2 /* always fragment (for ipv6 for now) */
> > +#define IPCORK_ALLFRAG 2 /* (unused) always fragment (for ipv6 for now) */
>
> Nit: Why not remove the ALLFRAG define as well?

I agree, this is not exposed to user space and should be deleted.

Reviewed-by: Eric Dumazet <[email protected]>

>
> Otherwise the series looks good to me, thanks!
>
> Reviewed-by: Florian Westphal <[email protected]>
>

2023-10-24 14:29:20

by Yan Zhai

[permalink] [raw]
Subject: Re: [PATCH v4 net-next 1/3] ipv6: drop feature RTAX_FEATURE_ALLFRAG

On Tue, Oct 24, 2023 at 5:30 AM Eric Dumazet <[email protected]> wrote:
>
> On Tue, Oct 24, 2023 at 12:22 PM Florian Westphal <[email protected]> wrote:
> >
> > Yan Zhai <[email protected]> wrote:
> > > #define IPCORK_OPT 1 /* ip-options has been held in ipcork.opt */
> > > -#define IPCORK_ALLFRAG 2 /* always fragment (for ipv6 for now) */
> > > +#define IPCORK_ALLFRAG 2 /* (unused) always fragment (for ipv6 for now) */
> >
> > Nit: Why not remove the ALLFRAG define as well?
>
> I agree, this is not exposed to user space and should be deleted.
>
> Reviewed-by: Eric Dumazet <[email protected]>
>
> >
> > Otherwise the series looks good to me, thanks!
> >
> > Reviewed-by: Florian Westphal <[email protected]>
> >

I thought there was some convention of not deleting macros. I sent a
V5 to fix this up (not sure if it is the right approach to go) and
carried your review-by tags over since it's just a small change.
Appreciate if there are any more suggestions there.

thanks
Yan