2024-05-06 09:37:25

by Richard Gobert

[permalink] [raw]
Subject: [PATCH net-next v8 0/3] net: gro: remove network_header use, move p->{flush/flush_id} calculations to L4

The cb fields network_offset and inner_network_offset are used instead of
skb->network_header throughout GRO.

These fields are then leveraged in the next commit to remove flush_id state
from napi_gro_cb, and stateful code in {ipv6,inet}_gro_receive which may be
unnecessarily complicated due to encapsulation support in GRO. These fields
are checked in L4 instead.

3rd patch adds tests for different flush_id flows in GRO.

v7 -> v8:
- Remove network_header use in gro
- Re-send commits after the dependent patch to net was applied
- v7:
https://lore.kernel.org/all/[email protected]/

v6 -> v7:
- Moved bug fixes to a separate submission in net
- Added UDP fwd benchmark
- v6:
https://lore.kernel.org/all/[email protected]/

v5 -> v6:
- Write inner_network_offset in vxlan and geneve
- Ignore is_atomic when DF=0
- v5:
https://lore.kernel.org/all/[email protected]/

v4 -> v5:
- Add 1st commit - flush id checks in udp_gro_receive segment which can be
backported by itself
- Add TCP measurements for the 5th commit
- Add flush id tests to ensure flush id logic is preserved in GRO
- Simplify gro_inet_flush by removing a branch
- v4:
https://lore.kernel.org/all/[email protected]/

v3 -> v4:
- Fix code comment and commit message typos
- v3:
https://lore.kernel.org/all/[email protected]/

v2 -> v3:
- Use napi_gro_cb instead of skb->{offset}
- v2:
https://lore.kernel.org/all/[email protected]/

v1 -> v2:
- Pass p_off in *_gro_complete to fix UDP bug
- Remove more conditionals and memory fetches from inet_gro_flush
- v1:
https://lore.kernel.org/netdev/[email protected]/

Richard Gobert (3):
net: gro: use cb instead of skb->network_header
net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment
selftests/net: add flush id selftests

include/net/gro.h | 75 +++++++++++++--
net/core/gro.c | 3 -
net/ipv4/af_inet.c | 45 +--------
net/ipv4/tcp_offload.c | 18 +---
net/ipv4/udp_offload.c | 10 +-
net/ipv6/ip6_offload.c | 16 +---
net/ipv6/tcpv6_offload.c | 3 +-
tools/testing/selftests/net/gro.c | 147 ++++++++++++++++++++++++++++++
8 files changed, 225 insertions(+), 92 deletions(-)

--
2.36.1



2024-05-06 09:37:39

by Richard Gobert

[permalink] [raw]
Subject: [PATCH net-next v8 1/3] net: gro: use cb instead of skb->network_header

This patch converts references of skb->network_header to napi_gro_cb's
network_offset and inner_network_offset.

Signed-off-by: Richard Gobert <[email protected]>
---
include/net/gro.h | 9 +++++++--
net/ipv4/af_inet.c | 4 ----
net/ipv4/tcp_offload.c | 3 ++-
net/ipv6/ip6_offload.c | 5 ++---
net/ipv6/tcpv6_offload.c | 3 ++-
5 files changed, 13 insertions(+), 11 deletions(-)

diff --git a/include/net/gro.h b/include/net/gro.h
index c1d4ca0463a1..1faff23b66e8 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -181,12 +181,17 @@ static inline void *skb_gro_header(struct sk_buff *skb, unsigned int hlen,
return ptr;
}

+static inline int skb_gro_network_offset(const struct sk_buff *skb)
+{
+ return NAPI_GRO_CB(skb)->network_offsets[NAPI_GRO_CB(skb)->encap_mark];
+}
+
static inline void *skb_gro_network_header(const struct sk_buff *skb)
{
if (skb_gro_may_pull(skb, skb_gro_offset(skb)))
- return skb_gro_header_fast(skb, skb_network_offset(skb));
+ return skb_gro_header_fast(skb, skb_gro_network_offset(skb));

- return skb_network_header(skb);
+ return skb->data + skb_gro_network_offset(skb);
}

static inline __wsum inet_gro_compute_pseudo(const struct sk_buff *skb,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index a7bad18bc8b5..428196e1541f 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1569,10 +1569,6 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)

NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF));
NAPI_GRO_CB(skb)->flush |= flush;
- skb_set_network_header(skb, off);
- /* The above will be needed by the transport layer if there is one
- * immediately following this IP hdr.
- */
NAPI_GRO_CB(skb)->inner_network_offset = off;

/* Note : No need to call skb_gro_postpull_rcsum() here,
diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index fab0973f995b..b70ae50e658d 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -330,7 +330,8 @@ struct sk_buff *tcp4_gro_receive(struct list_head *head, struct sk_buff *skb)

INDIRECT_CALLABLE_SCOPE int tcp4_gro_complete(struct sk_buff *skb, int thoff)
{
- const struct iphdr *iph = ip_hdr(skb);
+ const u16 offset = NAPI_GRO_CB(skb)->network_offsets[skb->encapsulation];
+ const struct iphdr *iph = (struct iphdr *)(skb->data + offset);
struct tcphdr *th = tcp_hdr(skb);

th->check = ~tcp_v4_check(skb->len - thoff, iph->saddr,
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index c8b909a9904f..5d6b875a4638 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -67,7 +67,7 @@ static int ipv6_gro_pull_exthdrs(struct sk_buff *skb, int off, int proto)
off += len;
}

- skb_gro_pull(skb, off - skb_network_offset(skb));
+ skb_gro_pull(skb, off - skb_gro_network_offset(skb));
return proto;
}

@@ -236,7 +236,6 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
if (unlikely(!iph))
goto out;

- skb_set_network_header(skb, off);
NAPI_GRO_CB(skb)->inner_network_offset = off;

flush += ntohs(iph->payload_len) != skb->len - hlen;
@@ -260,7 +259,7 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
NAPI_GRO_CB(skb)->proto = proto;

flush--;
- nlen = skb_network_header_len(skb);
+ nlen = skb_gro_offset(skb) - off;

list_for_each_entry(p, head, list) {
const struct ipv6hdr *iph2;
diff --git a/net/ipv6/tcpv6_offload.c b/net/ipv6/tcpv6_offload.c
index 4b07d1e6c952..e70d60e0f86f 100644
--- a/net/ipv6/tcpv6_offload.c
+++ b/net/ipv6/tcpv6_offload.c
@@ -29,7 +29,8 @@ struct sk_buff *tcp6_gro_receive(struct list_head *head, struct sk_buff *skb)

INDIRECT_CALLABLE_SCOPE int tcp6_gro_complete(struct sk_buff *skb, int thoff)
{
- const struct ipv6hdr *iph = ipv6_hdr(skb);
+ const u16 offset = NAPI_GRO_CB(skb)->network_offsets[skb->encapsulation];
+ const struct ipv6hdr *iph = (struct ipv6hdr *)(skb->data + offset);
struct tcphdr *th = tcp_hdr(skb);

th->check = ~tcp_v6_check(skb->len - thoff, &iph->saddr,
--
2.36.1


2024-05-06 09:47:53

by Richard Gobert

[permalink] [raw]
Subject: [PATCH net-next v8 2/3] net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment

{inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
iph->id, ...) against all packets in a loop. These flush checks are used in
all merging UDP and TCP flows.

These checks need to be done only once and only against the found p skb,
since they only affect flush and not same_flow.

This patch leverages correct network header offsets from the cb for both
outer and inner network headers - allowing these checks to be done only
once, in tcp_gro_receive and udp_gro_receive_segment. As a result,
NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are
more declarative and contained in inet_gro_flush, thus removing the need
for flush_id in napi_gro_cb.

This results in less parsing code for non-loop flush tests for TCP and UDP
flows.

To make sure results are not within noise range - I've made netfilter drop
all TCP packets, and measured CPU performance in GRO (in this case GRO is
responsible for about 50% of the CPU utilization).

perf top while replaying 64 parallel IP/TCP streams merging in GRO:
(gro_network_flush is compiled inline to tcp_gro_receive)
net-next:
6.94% [kernel] [k] inet_gro_receive
3.02% [kernel] [k] tcp_gro_receive

patch applied:
4.27% [kernel] [k] tcp_gro_receive
4.22% [kernel] [k] inet_gro_receive

perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same
results for any encapsulation, in this case inet_gro_receive is top
offender in net-next)
net-next:
10.09% [kernel] [k] inet_gro_receive
2.08% [kernel] [k] tcp_gro_receive

patch applied:
6.97% [kernel] [k] inet_gro_receive
3.68% [kernel] [k] tcp_gro_receive

Signed-off-by: Richard Gobert <[email protected]>
---
include/net/gro.h | 66 ++++++++++++++++++++++++++++++++++++++----
net/core/gro.c | 3 --
net/ipv4/af_inet.c | 41 +-------------------------
net/ipv4/tcp_offload.c | 15 ++--------
net/ipv4/udp_offload.c | 10 ++-----
net/ipv6/ip6_offload.c | 11 -------
6 files changed, 65 insertions(+), 81 deletions(-)

diff --git a/include/net/gro.h b/include/net/gro.h
index 1faff23b66e8..0565dd716ab7 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -36,15 +36,15 @@ struct napi_gro_cb {
/* This is non-zero if the packet cannot be merged with the new skb. */
u16 flush;

- /* Save the IP ID here and check when we get to the transport layer */
- u16 flush_id;
-
/* Number of segments aggregated. */
u16 count;

/* Used in ipv6_gro_receive() and foo-over-udp and esp-in-udp */
u16 proto;

+ /* used to support CHECKSUM_COMPLETE for tunneling protocols */
+ __wsum csum;
+
/* Used in napi_gro_cb::free */
#define NAPI_GRO_FREE 1
#define NAPI_GRO_FREE_STOLEN_HEAD 2
@@ -85,9 +85,6 @@ struct napi_gro_cb {
u8 is_flist:1;
);

- /* used to support CHECKSUM_COMPLETE for tunneling protocols */
- __wsum csum;
-
/* L3 offsets */
union {
struct {
@@ -442,6 +439,63 @@ static inline __wsum ip6_gro_compute_pseudo(const struct sk_buff *skb,
skb_gro_len(skb), proto, 0));
}

+static inline int inet_gro_flush(const struct iphdr *iph, const struct iphdr *iph2,
+ struct sk_buff *p, bool outer)
+{
+ const u32 id = ntohl(*(__be32 *)&iph->id);
+ const u32 id2 = ntohl(*(__be32 *)&iph2->id);
+ const u16 flush_id = (id >> 16) - (id2 >> 16);
+ const u16 count = NAPI_GRO_CB(p)->count;
+ const u32 df = id & IP_DF;
+ u32 is_atomic;
+ int flush;
+
+ /* All fields must match except length and checksum. */
+ flush = (iph->ttl ^ iph2->ttl) | (iph->tos ^ iph2->tos) | (df ^ (id2 & IP_DF));
+
+ if (outer && df)
+ return flush;
+
+ /* When we receive our second frame we can make a decision on if we
+ * continue this flow as an atomic flow with a fixed ID or if we use
+ * an incrementing ID.
+ */
+ NAPI_GRO_CB(p)->is_atomic |= (count == 1 && df && flush_id == 0);
+ is_atomic = (df && NAPI_GRO_CB(p)->is_atomic) - 1;
+
+ return flush | (flush_id ^ (count & is_atomic));
+}
+
+static inline int ipv6_gro_flush(const struct ipv6hdr *iph, const struct ipv6hdr *iph2)
+{
+ /* <Version:4><Traffic_Class:8><Flow_Label:20> */
+ __be32 first_word = *(__be32 *)iph ^ *(__be32 *)iph2;
+
+ /* Flush if Traffic Class fields are different. */
+ return !!((first_word & htonl(0x0FF00000)) |
+ (__force __be32)(iph->hop_limit ^ iph2->hop_limit));
+}
+
+static inline int gro_network_flush(const void *th, const void *th2, struct sk_buff *p, int off)
+{
+ const bool encap_mark = NAPI_GRO_CB(p)->encap_mark;
+ int flush = 0;
+ int i;
+
+ for (i = 0; i <= encap_mark; i++) {
+ const u16 diff = off - NAPI_GRO_CB(p)->network_offsets[i];
+ const void *nh = th - diff;
+ const void *nh2 = th2 - diff;
+
+ if (((struct iphdr *)nh)->version == 6)
+ flush |= ipv6_gro_flush(nh, nh2);
+ else
+ flush |= inet_gro_flush(nh, nh2, p, i != encap_mark);
+ }
+
+ return flush;
+}
+
int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb);

/* Pass the currently batched GRO_NORMAL SKBs up to the stack. */
diff --git a/net/core/gro.c b/net/core/gro.c
index 99a45a5211c9..3e9422c23bc9 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -331,8 +331,6 @@ static void gro_list_prepare(const struct list_head *head,
list_for_each_entry(p, head, list) {
unsigned long diffs;

- NAPI_GRO_CB(p)->flush = 0;
-
if (hash != skb_get_hash_raw(p)) {
NAPI_GRO_CB(p)->same_flow = 0;
continue;
@@ -472,7 +470,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
sizeof(u32))); /* Avoid slow unaligned acc */
*(u32 *)&NAPI_GRO_CB(skb)->zeroed = 0;
NAPI_GRO_CB(skb)->flush = skb_has_frag_list(skb);
- NAPI_GRO_CB(skb)->is_atomic = 1;
NAPI_GRO_CB(skb)->count = 1;
if (unlikely(skb_is_gso(skb))) {
NAPI_GRO_CB(skb)->count = skb_shinfo(skb)->gso_segs;
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 428196e1541f..44564d009e95 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1482,7 +1482,6 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
struct sk_buff *p;
unsigned int hlen;
unsigned int off;
- unsigned int id;
int flush = 1;
int proto;

@@ -1508,13 +1507,10 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
goto out;

NAPI_GRO_CB(skb)->proto = proto;
- id = ntohl(*(__be32 *)&iph->id);
- flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (id & ~IP_DF));
- id >>= 16;
+ flush = (u16)((ntohl(*(__be32 *)iph) ^ skb_gro_len(skb)) | (ntohl(*(__be32 *)&iph->id) & ~IP_DF));

list_for_each_entry(p, head, list) {
struct iphdr *iph2;
- u16 flush_id;

if (!NAPI_GRO_CB(p)->same_flow)
continue;
@@ -1531,43 +1527,8 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)
NAPI_GRO_CB(p)->same_flow = 0;
continue;
}
-
- /* All fields must match except length and checksum. */
- NAPI_GRO_CB(p)->flush |=
- (iph->ttl ^ iph2->ttl) |
- (iph->tos ^ iph2->tos) |
- ((iph->frag_off ^ iph2->frag_off) & htons(IP_DF));
-
- NAPI_GRO_CB(p)->flush |= flush;
-
- /* We need to store of the IP ID check to be included later
- * when we can verify that this packet does in fact belong
- * to a given flow.
- */
- flush_id = (u16)(id - ntohs(iph2->id));
-
- /* This bit of code makes it much easier for us to identify
- * the cases where we are doing atomic vs non-atomic IP ID
- * checks. Specifically an atomic check can return IP ID
- * values 0 - 0xFFFF, while a non-atomic check can only
- * return 0 or 0xFFFF.
- */
- if (!NAPI_GRO_CB(p)->is_atomic ||
- !(iph->frag_off & htons(IP_DF))) {
- flush_id ^= NAPI_GRO_CB(p)->count;
- flush_id = flush_id ? 0xFFFF : 0;
- }
-
- /* If the previous IP ID value was based on an atomic
- * datagram we can overwrite the value and ignore it.
- */
- if (NAPI_GRO_CB(skb)->is_atomic)
- NAPI_GRO_CB(p)->flush_id = flush_id;
- else
- NAPI_GRO_CB(p)->flush_id |= flush_id;
}

- NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF));
NAPI_GRO_CB(skb)->flush |= flush;
NAPI_GRO_CB(skb)->inner_network_offset = off;

diff --git a/net/ipv4/tcp_offload.c b/net/ipv4/tcp_offload.c
index b70ae50e658d..625b4800b3ed 100644
--- a/net/ipv4/tcp_offload.c
+++ b/net/ipv4/tcp_offload.c
@@ -232,9 +232,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
goto out_check_final;

found:
- /* Include the IP ID check below from the inner most IP hdr */
- flush = NAPI_GRO_CB(p)->flush;
- flush |= (__force int)(flags & TCP_FLAG_CWR);
+ flush = (__force int)(flags & TCP_FLAG_CWR);
flush |= (__force int)((flags ^ tcp_flag_word(th2)) &
~(TCP_FLAG_CWR | TCP_FLAG_FIN | TCP_FLAG_PSH));
flush |= (__force int)(th->ack_seq ^ th2->ack_seq);
@@ -242,16 +240,7 @@ struct sk_buff *tcp_gro_receive(struct list_head *head, struct sk_buff *skb)
flush |= *(u32 *)((u8 *)th + i) ^
*(u32 *)((u8 *)th2 + i);

- /* When we receive our second frame we can made a decision on if we
- * continue this flow as an atomic flow with a fixed ID or if we use
- * an incrementing ID.
- */
- if (NAPI_GRO_CB(p)->flush_id != 1 ||
- NAPI_GRO_CB(p)->count != 1 ||
- !NAPI_GRO_CB(p)->is_atomic)
- flush |= NAPI_GRO_CB(p)->flush_id;
- else
- NAPI_GRO_CB(p)->is_atomic = false;
+ flush |= gro_network_flush(th, th2, p, off);

mss = skb_shinfo(p)->gso_size;

diff --git a/net/ipv4/udp_offload.c b/net/ipv4/udp_offload.c
index 8721fe5beca2..5d9696eaab8a 100644
--- a/net/ipv4/udp_offload.c
+++ b/net/ipv4/udp_offload.c
@@ -466,6 +466,7 @@ static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
struct sk_buff *skb)
{
struct udphdr *uh = udp_gro_udphdr(skb);
+ int off = skb_gro_offset(skb);
struct sk_buff *pp = NULL;
struct udphdr *uh2;
struct sk_buff *p;
@@ -505,14 +506,7 @@ static struct sk_buff *udp_gro_receive_segment(struct list_head *head,
return p;
}

- flush = NAPI_GRO_CB(p)->flush;
-
- if (NAPI_GRO_CB(p)->flush_id != 1 ||
- NAPI_GRO_CB(p)->count != 1 ||
- !NAPI_GRO_CB(p)->is_atomic)
- flush |= NAPI_GRO_CB(p)->flush_id;
- else
- NAPI_GRO_CB(p)->is_atomic = false;
+ flush = gro_network_flush(uh, uh2, p, off);

/* Terminate the flow on len mismatch or if it grow "too much".
* Under small packet flood GRO count could elsewhere grow a lot
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 5d6b875a4638..72991a02cb30 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -290,19 +290,8 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
nlen - sizeof(struct ipv6hdr)))
goto not_same_flow;
}
- /* flush if Traffic Class fields are different */
- NAPI_GRO_CB(p)->flush |= !!((first_word & htonl(0x0FF00000)) |
- (__force __be32)(iph->hop_limit ^ iph2->hop_limit));
- NAPI_GRO_CB(p)->flush |= flush;
-
- /* If the previous IP ID value was based on an atomic
- * datagram we can overwrite the value and ignore it.
- */
- if (NAPI_GRO_CB(skb)->is_atomic)
- NAPI_GRO_CB(p)->flush_id = 0;
}

- NAPI_GRO_CB(skb)->is_atomic = true;
NAPI_GRO_CB(skb)->flush |= flush;

skb_gro_postpull_rcsum(skb, iph, nlen);
--
2.36.1

2024-05-06 09:55:09

by Richard Gobert

[permalink] [raw]
Subject: [PATCH net-next v8 3/3] selftests/net: add flush id selftests

Added flush id selftests to test different cases where DF flag is set or
unset and id value changes in the following packets. All cases where the
packets should coalesce or should not coalesce are tested.

Signed-off-by: Richard Gobert <[email protected]>
---
tools/testing/selftests/net/gro.c | 147 ++++++++++++++++++++++++++++++
1 file changed, 147 insertions(+)

diff --git a/tools/testing/selftests/net/gro.c b/tools/testing/selftests/net/gro.c
index 353e1e867fbb..5dc7b539ccbf 100644
--- a/tools/testing/selftests/net/gro.c
+++ b/tools/testing/selftests/net/gro.c
@@ -617,6 +617,123 @@ static void add_ipv6_exthdr(void *buf, void *optpkt, __u8 exthdr_type, char *ext
iph->payload_len = htons(ntohs(iph->payload_len) + MIN_EXTHDR_SIZE);
}

+static void fix_ip4_checksum(struct iphdr *iph)
+{
+ iph->check = 0;
+ iph->check = checksum_fold(iph, sizeof(struct iphdr), 0);
+}
+
+static void send_flush_id_case(int fd, struct sockaddr_ll *daddr, int tcase)
+{
+ static char buf1[MAX_HDR_LEN + PAYLOAD_LEN];
+ static char buf2[MAX_HDR_LEN + PAYLOAD_LEN];
+ static char buf3[MAX_HDR_LEN + PAYLOAD_LEN];
+ bool send_three = false;
+ struct iphdr *iph1;
+ struct iphdr *iph2;
+ struct iphdr *iph3;
+
+ iph1 = (struct iphdr *)(buf1 + ETH_HLEN);
+ iph2 = (struct iphdr *)(buf2 + ETH_HLEN);
+ iph3 = (struct iphdr *)(buf3 + ETH_HLEN);
+
+ create_packet(buf1, 0, 0, PAYLOAD_LEN, 0);
+ create_packet(buf2, PAYLOAD_LEN, 0, PAYLOAD_LEN, 0);
+ create_packet(buf3, PAYLOAD_LEN * 2, 0, PAYLOAD_LEN, 0);
+
+ switch (tcase) {
+ case 0: /* DF=1, Incrementing - should coalesce */
+ iph1->frag_off |= htons(IP_DF);
+ iph1->id = htons(8);
+ fix_ip4_checksum(iph1);
+
+ iph2->frag_off |= htons(IP_DF);
+ iph2->id = htons(9);
+ fix_ip4_checksum(iph2);
+ break;
+
+ case 1: /* DF=1, Fixed - should coalesce */
+ iph1->frag_off |= htons(IP_DF);
+ iph1->id = htons(8);
+ fix_ip4_checksum(iph1);
+
+ iph2->frag_off |= htons(IP_DF);
+ iph2->id = htons(8);
+ fix_ip4_checksum(iph2);
+ break;
+
+ case 2: /* DF=0, Incrementing - should coalesce */
+ iph1->frag_off &= ~htons(IP_DF);
+ iph1->id = htons(8);
+ fix_ip4_checksum(iph1);
+
+ iph2->frag_off &= ~htons(IP_DF);
+ iph2->id = htons(9);
+ fix_ip4_checksum(iph2);
+ break;
+
+ case 3: /* DF=0, Fixed - should not coalesce */
+ iph1->frag_off &= ~htons(IP_DF);
+ iph1->id = htons(8);
+ fix_ip4_checksum(iph1);
+
+ iph2->frag_off &= ~htons(IP_DF);
+ iph2->id = htons(8);
+ fix_ip4_checksum(iph2);
+ break;
+
+ case 4: /* DF=1, two packets incrementing, and one fixed - should
+ * coalesce only the first two packets
+ */
+ iph1->frag_off |= htons(IP_DF);
+ iph1->id = htons(8);
+ fix_ip4_checksum(iph1);
+
+ iph2->frag_off |= htons(IP_DF);
+ iph2->id = htons(9);
+ fix_ip4_checksum(iph2);
+
+ iph3->frag_off |= htons(IP_DF);
+ iph3->id = htons(9);
+ fix_ip4_checksum(iph3);
+ send_three = true;
+ break;
+
+ case 5: /* DF=1, two packets fixed, and one incrementing - should
+ * coalesce only the first two packets
+ */
+ iph1->frag_off |= htons(IP_DF);
+ iph1->id = htons(8);
+ fix_ip4_checksum(iph1);
+
+ iph2->frag_off |= htons(IP_DF);
+ iph2->id = htons(8);
+ fix_ip4_checksum(iph2);
+
+ iph3->frag_off |= htons(IP_DF);
+ iph3->id = htons(9);
+ fix_ip4_checksum(iph3);
+ send_three = true;
+ break;
+ }
+
+ write_packet(fd, buf1, total_hdr_len + PAYLOAD_LEN, daddr);
+ write_packet(fd, buf2, total_hdr_len + PAYLOAD_LEN, daddr);
+
+ if (send_three)
+ write_packet(fd, buf3, total_hdr_len + PAYLOAD_LEN, daddr);
+}
+
+static void test_flush_id(int fd, struct sockaddr_ll *daddr, char *fin_pkt)
+{
+ for (int i = 0; i < 6; i++) {
+ sleep(1);
+ send_flush_id_case(fd, daddr, i);
+ sleep(1);
+ write_packet(fd, fin_pkt, total_hdr_len, daddr);
+ }
+}
+
static void send_ipv6_exthdr(int fd, struct sockaddr_ll *daddr, char *ext_data1, char *ext_data2)
{
static char buf[MAX_HDR_LEN + PAYLOAD_LEN];
@@ -935,6 +1052,8 @@ static void gro_sender(void)
send_fragment4(txfd, &daddr);
sleep(1);
write_packet(txfd, fin_pkt, total_hdr_len, &daddr);
+
+ test_flush_id(txfd, &daddr, fin_pkt);
} else if (proto == PF_INET6) {
sleep(1);
send_fragment6(txfd, &daddr);
@@ -1061,6 +1180,34 @@ static void gro_receiver(void)

printf("fragmented ip4 doesn't coalesce: ");
check_recv_pkts(rxfd, correct_payload, 2);
+
+ /* is_atomic checks */
+ printf("DF=1, Incrementing - should coalesce: ");
+ correct_payload[0] = PAYLOAD_LEN * 2;
+ check_recv_pkts(rxfd, correct_payload, 1);
+
+ printf("DF=1, Fixed - should coalesce: ");
+ correct_payload[0] = PAYLOAD_LEN * 2;
+ check_recv_pkts(rxfd, correct_payload, 1);
+
+ printf("DF=0, Incrementing - should coalesce: ");
+ correct_payload[0] = PAYLOAD_LEN * 2;
+ check_recv_pkts(rxfd, correct_payload, 1);
+
+ printf("DF=0, Fixed - should not coalesce: ");
+ correct_payload[0] = PAYLOAD_LEN;
+ correct_payload[1] = PAYLOAD_LEN;
+ check_recv_pkts(rxfd, correct_payload, 2);
+
+ printf("DF=1, 2 Incrementing and one fixed - should coalesce only first 2 packets: ");
+ correct_payload[0] = PAYLOAD_LEN * 2;
+ correct_payload[1] = PAYLOAD_LEN;
+ check_recv_pkts(rxfd, correct_payload, 2);
+
+ printf("DF=1, 2 Fixed and one incrementing - should coalesce only first 2 packets: ");
+ correct_payload[0] = PAYLOAD_LEN * 2;
+ correct_payload[1] = PAYLOAD_LEN;
+ check_recv_pkts(rxfd, correct_payload, 2);
} else if (proto == PF_INET6) {
/* GRO doesn't check for ipv6 hop limit when flushing.
* Hence no corresponding test to the ipv4 case.
--
2.36.1


2024-05-06 16:46:33

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [PATCH net-next v8 1/3] net: gro: use cb instead of skb->network_header

Richard Gobert wrote:
> This patch converts references of skb->network_header to napi_gro_cb's
> network_offset and inner_network_offset.
>
> Signed-off-by: Richard Gobert <[email protected]>
> ---
> include/net/gro.h | 9 +++++++--
> net/ipv4/af_inet.c | 4 ----
> net/ipv4/tcp_offload.c | 3 ++-
> net/ipv6/ip6_offload.c | 5 ++---
> net/ipv6/tcpv6_offload.c | 3 ++-
> 5 files changed, 13 insertions(+), 11 deletions(-)
>
> diff --git a/include/net/gro.h b/include/net/gro.h
> index c1d4ca0463a1..1faff23b66e8 100644
> --- a/include/net/gro.h
> +++ b/include/net/gro.h
> @@ -181,12 +181,17 @@ static inline void *skb_gro_header(struct sk_buff *skb, unsigned int hlen,
> return ptr;
> }
>
> +static inline int skb_gro_network_offset(const struct sk_buff *skb)
> +{
> + return NAPI_GRO_CB(skb)->network_offsets[NAPI_GRO_CB(skb)->encap_mark];
> +}
> +

The fact that .._receive sets encap_mark, but .._complete must read
encapsulation, due to the clear in udp_gro_complete, is non-obvious.

Can you add a comment to clarify this or rename this to
skb_gro_receive_network_offset?

> static inline void *skb_gro_network_header(const struct sk_buff *skb)
> {
> if (skb_gro_may_pull(skb, skb_gro_offset(skb)))
> - return skb_gro_header_fast(skb, skb_network_offset(skb));
> + return skb_gro_header_fast(skb, skb_gro_network_offset(skb));
>
> - return skb_network_header(skb);
> + return skb->data + skb_gro_network_offset(skb);
> }

2024-05-06 17:20:30

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [PATCH net-next v8 2/3] net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment

Richard Gobert wrote:
> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
> iph->id, ...) against all packets in a loop. These flush checks are used in
> all merging UDP and TCP flows.
>
> These checks need to be done only once and only against the found p skb,
> since they only affect flush and not same_flow.
>
> This patch leverages correct network header offsets from the cb for both
> outer and inner network headers - allowing these checks to be done only
> once, in tcp_gro_receive and udp_gro_receive_segment. As a result,
> NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are
> more declarative and contained in inet_gro_flush, thus removing the need
> for flush_id in napi_gro_cb.
>
> This results in less parsing code for non-loop flush tests for TCP and UDP
> flows.
>
> To make sure results are not within noise range - I've made netfilter drop
> all TCP packets, and measured CPU performance in GRO (in this case GRO is
> responsible for about 50% of the CPU utilization).
>
> perf top while replaying 64 parallel IP/TCP streams merging in GRO:
> (gro_network_flush is compiled inline to tcp_gro_receive)
> net-next:
> 6.94% [kernel] [k] inet_gro_receive
> 3.02% [kernel] [k] tcp_gro_receive
>
> patch applied:
> 4.27% [kernel] [k] tcp_gro_receive
> 4.22% [kernel] [k] inet_gro_receive
>
> perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same
> results for any encapsulation, in this case inet_gro_receive is top
> offender in net-next)
> net-next:
> 10.09% [kernel] [k] inet_gro_receive
> 2.08% [kernel] [k] tcp_gro_receive
>
> patch applied:
> 6.97% [kernel] [k] inet_gro_receive
> 3.68% [kernel] [k] tcp_gro_receive

Thanks for getting the additional numbers. The savings are not huge.

But +1 on the change also because it simplifies this non-obvious
logic. It makes sense to separate flow matching and flush logic.

Btw please include Alexander Duyck in the Cc: of this series.
> +static inline int inet_gro_flush(const struct iphdr *iph, const struct iphdr *iph2,
> + struct sk_buff *p, bool outer)
> +{
> + const u32 id = ntohl(*(__be32 *)&iph->id);
> + const u32 id2 = ntohl(*(__be32 *)&iph2->id);
> + const u16 flush_id = (id >> 16) - (id2 >> 16);
> + const u16 count = NAPI_GRO_CB(p)->count;
> + const u32 df = id & IP_DF;
> + u32 is_atomic;
> + int flush;
> +
> + /* All fields must match except length and checksum. */
> + flush = (iph->ttl ^ iph2->ttl) | (iph->tos ^ iph2->tos) | (df ^ (id2 & IP_DF));
> +
> + if (outer && df)
> + return flush;

Does the fixed id logic apply equally to inner and outer IPv4?

> +
> + /* When we receive our second frame we can make a decision on if we
> + * continue this flow as an atomic flow with a fixed ID or if we use
> + * an incrementing ID.
> + */
> + NAPI_GRO_CB(p)->is_atomic |= (count == 1 && df && flush_id == 0);
> + is_atomic = (df && NAPI_GRO_CB(p)->is_atomic) - 1;
> +
> + return flush | (flush_id ^ (count & is_atomic));

This is a good time to consider making this logical more obvious.

First off, the flush check can be part of the outer && df above, as
flush is not modified after.

Subjective, but I find the following more readable, and not worth
saving a few branches.

if (count == 1 && df && !flush_id)
NAPI_GRO_CB(p)->is_atomic = true;

ip_fixedid_matches = NAPI_GRO_CB(p)->is_atomic ^ df;
ipid_offset_matches = ipid_offset - count;

return ip_fixedid_matches & ipid_offset_matches;

Have to be a bit careful about types. Have not checked that in detail.

And while nitpicking:
ipid_offset may be a more descriptive variable name than flush_id, and
ip_fixedid than is_atomic. If changing those does not result in a lot
of code churn.

> +}
> +
> +static inline int ipv6_gro_flush(const struct ipv6hdr *iph, const struct ipv6hdr *iph2)
> +{
> + /* <Version:4><Traffic_Class:8><Flow_Label:20> */
> + __be32 first_word = *(__be32 *)iph ^ *(__be32 *)iph2;
> +
> + /* Flush if Traffic Class fields are different. */
> + return !!((first_word & htonl(0x0FF00000)) |
> + (__force __be32)(iph->hop_limit ^ iph2->hop_limit));
> +}
> +
> +static inline int gro_network_flush(const void *th, const void *th2, struct sk_buff *p, int off)
> +{
> + const bool encap_mark = NAPI_GRO_CB(p)->encap_mark;

Is this correct when udp_gro_complete clears this for tunnels?

> + int flush = 0;
> + int i;
> +
> + for (i = 0; i <= encap_mark; i++) {
> + const u16 diff = off - NAPI_GRO_CB(p)->network_offsets[i];
> + const void *nh = th - diff;
> + const void *nh2 = th2 - diff;
> +
> + if (((struct iphdr *)nh)->version == 6)
> + flush |= ipv6_gro_flush(nh, nh2);
> + else
> + flush |= inet_gro_flush(nh, nh2, p, i != encap_mark);
> + }

Maybe slightly better for branch prediction, and more obvious, if
creating a helper function __gro_network_flush and calling

__gro_network_flush(th, th2, p, off - NAPI_GRO_CB(p)->network_offsets[0])
if (NAPI_GRO_CB(p)->encap_mark)
__gro_network_flush(th, th2, p, off - NAPI_GRO_CB(p)->network_offsets[1])

> +
> + return flush;
> +}
> +
> int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb);
>

2024-05-06 17:26:37

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [PATCH net-next v8 3/3] selftests/net: add flush id selftests

Richard Gobert wrote:
> Added flush id selftests to test different cases where DF flag is set or
> unset and id value changes in the following packets. All cases where the
> packets should coalesce or should not coalesce are tested.
>
> Signed-off-by: Richard Gobert <[email protected]>
> ---
> tools/testing/selftests/net/gro.c | 147 ++++++++++++++++++++++++++++++
> 1 file changed, 147 insertions(+)
>
> diff --git a/tools/testing/selftests/net/gro.c b/tools/testing/selftests/net/gro.c
> index 353e1e867fbb..5dc7b539ccbf 100644
> --- a/tools/testing/selftests/net/gro.c
> +++ b/tools/testing/selftests/net/gro.c
> @@ -617,6 +617,123 @@ static void add_ipv6_exthdr(void *buf, void *optpkt, __u8 exthdr_type, char *ext
> iph->payload_len = htons(ntohs(iph->payload_len) + MIN_EXTHDR_SIZE);
> }
>
> +static void fix_ip4_checksum(struct iphdr *iph)
> +{
> + iph->check = 0;
> + iph->check = checksum_fold(iph, sizeof(struct iphdr), 0);
> +}
> +
> +static void send_flush_id_case(int fd, struct sockaddr_ll *daddr, int tcase)
> +{
> + static char buf1[MAX_HDR_LEN + PAYLOAD_LEN];
> + static char buf2[MAX_HDR_LEN + PAYLOAD_LEN];
> + static char buf3[MAX_HDR_LEN + PAYLOAD_LEN];
> + bool send_three = false;
> + struct iphdr *iph1;
> + struct iphdr *iph2;
> + struct iphdr *iph3;
> +
> + iph1 = (struct iphdr *)(buf1 + ETH_HLEN);
> + iph2 = (struct iphdr *)(buf2 + ETH_HLEN);
> + iph3 = (struct iphdr *)(buf3 + ETH_HLEN);
> +
> + create_packet(buf1, 0, 0, PAYLOAD_LEN, 0);
> + create_packet(buf2, PAYLOAD_LEN, 0, PAYLOAD_LEN, 0);
> + create_packet(buf3, PAYLOAD_LEN * 2, 0, PAYLOAD_LEN, 0);
> +
> + switch (tcase) {
> + case 0: /* DF=1, Incrementing - should coalesce */
> + iph1->frag_off |= htons(IP_DF);
> + iph1->id = htons(8);
> + fix_ip4_checksum(iph1);
> +
> + iph2->frag_off |= htons(IP_DF);
> + iph2->id = htons(9);
> + fix_ip4_checksum(iph2);
> + break;
> +
> + case 1: /* DF=1, Fixed - should coalesce */
> + iph1->frag_off |= htons(IP_DF);
> + iph1->id = htons(8);
> + fix_ip4_checksum(iph1);
> +
> + iph2->frag_off |= htons(IP_DF);
> + iph2->id = htons(8);
> + fix_ip4_checksum(iph2);
> + break;
> +
> + case 2: /* DF=0, Incrementing - should coalesce */
> + iph1->frag_off &= ~htons(IP_DF);
> + iph1->id = htons(8);
> + fix_ip4_checksum(iph1);
> +
> + iph2->frag_off &= ~htons(IP_DF);
> + iph2->id = htons(9);
> + fix_ip4_checksum(iph2);
> + break;
> +
> + case 3: /* DF=0, Fixed - should not coalesce */
> + iph1->frag_off &= ~htons(IP_DF);
> + iph1->id = htons(8);
> + fix_ip4_checksum(iph1);
> +
> + iph2->frag_off &= ~htons(IP_DF);
> + iph2->id = htons(8);
> + fix_ip4_checksum(iph2);
> + break;
> +
> + case 4: /* DF=1, two packets incrementing, and one fixed - should
> + * coalesce only the first two packets
> + */
> + iph1->frag_off |= htons(IP_DF);
> + iph1->id = htons(8);
> + fix_ip4_checksum(iph1);
> +
> + iph2->frag_off |= htons(IP_DF);
> + iph2->id = htons(9);
> + fix_ip4_checksum(iph2);
> +
> + iph3->frag_off |= htons(IP_DF);
> + iph3->id = htons(9);
> + fix_ip4_checksum(iph3);
> + send_three = true;
> + break;
> +
> + case 5: /* DF=1, two packets fixed, and one incrementing - should
> + * coalesce only the first two packets
> + */
> + iph1->frag_off |= htons(IP_DF);
> + iph1->id = htons(8);
> + fix_ip4_checksum(iph1);
> +
> + iph2->frag_off |= htons(IP_DF);
> + iph2->id = htons(8);
> + fix_ip4_checksum(iph2);
> +
> + iph3->frag_off |= htons(IP_DF);
> + iph3->id = htons(9);
> + fix_ip4_checksum(iph3);
> + send_three = true;
> + break;
> + }

Consider moving the fix_ip4_checksum calls out of the switch to reduce
duplication.

> +
> + write_packet(fd, buf1, total_hdr_len + PAYLOAD_LEN, daddr);
> + write_packet(fd, buf2, total_hdr_len + PAYLOAD_LEN, daddr);
> +
> + if (send_three)
> + write_packet(fd, buf3, total_hdr_len + PAYLOAD_LEN, daddr);
> +}
> +
> +static void test_flush_id(int fd, struct sockaddr_ll *daddr, char *fin_pkt)
> +{
> + for (int i = 0; i < 6; i++) {

Please avoid unnamed magic constants. Something like

const int num_flush_id_cases = 6; /* See switch in send_flush_id_case */

Or even define an enum with named tests and and _MAX val. It's
verbose, but helpful to readers.

> + sleep(1);
> + send_flush_id_case(fd, daddr, i);
> + sleep(1);
> + write_packet(fd, fin_pkt, total_hdr_len, daddr);
> + }
> +}
> +

2024-05-07 16:21:41

by Richard Gobert

[permalink] [raw]
Subject: Re: [PATCH net-next v8 2/3] net: gro: move L3 flush checks to tcp_gro_receive and udp_gro_receive_segment

Willem de Bruijn wrote:
> Richard Gobert wrote:
>> {inet,ipv6}_gro_receive functions perform flush checks (ttl, flags,
>> iph->id, ...) against all packets in a loop. These flush checks are used in
>> all merging UDP and TCP flows.
>>
>> These checks need to be done only once and only against the found p skb,
>> since they only affect flush and not same_flow.
>>
>> This patch leverages correct network header offsets from the cb for both
>> outer and inner network headers - allowing these checks to be done only
>> once, in tcp_gro_receive and udp_gro_receive_segment. As a result,
>> NAPI_GRO_CB(p)->flush is not used at all. In addition, flush_id checks are
>> more declarative and contained in inet_gro_flush, thus removing the need
>> for flush_id in napi_gro_cb.
>>
>> This results in less parsing code for non-loop flush tests for TCP and UDP
>> flows.
>>
>> To make sure results are not within noise range - I've made netfilter drop
>> all TCP packets, and measured CPU performance in GRO (in this case GRO is
>> responsible for about 50% of the CPU utilization).
>>
>> perf top while replaying 64 parallel IP/TCP streams merging in GRO:
>> (gro_network_flush is compiled inline to tcp_gro_receive)
>> net-next:
>> 6.94% [kernel] [k] inet_gro_receive
>> 3.02% [kernel] [k] tcp_gro_receive
>>
>> patch applied:
>> 4.27% [kernel] [k] tcp_gro_receive
>> 4.22% [kernel] [k] inet_gro_receive
>>
>> perf top while replaying 64 parallel IP/IP/TCP streams merging in GRO (same
>> results for any encapsulation, in this case inet_gro_receive is top
>> offender in net-next)
>> net-next:
>> 10.09% [kernel] [k] inet_gro_receive
>> 2.08% [kernel] [k] tcp_gro_receive
>>
>> patch applied:
>> 6.97% [kernel] [k] inet_gro_receive
>> 3.68% [kernel] [k] tcp_gro_receive
>
> Thanks for getting the additional numbers. The savings are not huge.
>
> But +1 on the change also because it simplifies this non-obvious
> logic. It makes sense to separate flow matching and flush logic.
>
> Btw please include Alexander Duyck in the Cc: of this series.

Thanks, will do that when I re-post.

>> +static inline int inet_gro_flush(const struct iphdr *iph, const struct iphdr *iph2,
>> + struct sk_buff *p, bool outer)
>> +{
>> + const u32 id = ntohl(*(__be32 *)&iph->id);
>> + const u32 id2 = ntohl(*(__be32 *)&iph2->id);
>> + const u16 flush_id = (id >> 16) - (id2 >> 16);
>> + const u16 count = NAPI_GRO_CB(p)->count;
>> + const u32 df = id & IP_DF;
>> + u32 is_atomic;
>> + int flush;
>> +
>> + /* All fields must match except length and checksum. */
>> + flush = (iph->ttl ^ iph2->ttl) | (iph->tos ^ iph2->tos) | (df ^ (id2 & IP_DF));
>> +
>> + if (outer && df)
>> + return flush;
>
> Does the fixed id logic apply equally to inner and outer IPv4?
>

Fixed id logic is not applied equally to inner and outer IPv4. innermost
IDs are checked, but outer IPv4 IDs are not checked at all if DF is set.
This is the current logic in the code and this patch preserves it. To my
understanding this is explained as intentional by the original commit author
(20160407223218.11142.26592.stgit@ahduyck-xeon-server)

Alexander - could you maybe elaborate further?

>> +
>> + /* When we receive our second frame we can make a decision on if we
>> + * continue this flow as an atomic flow with a fixed ID or if we use
>> + * an incrementing ID.
>> + */
>> + NAPI_GRO_CB(p)->is_atomic |= (count == 1 && df && flush_id == 0);
>> + is_atomic = (df && NAPI_GRO_CB(p)->is_atomic) - 1;
>> +
>> + return flush | (flush_id ^ (count & is_atomic));
>
> This is a good time to consider making this logical more obvious.
>
> First off, the flush check can be part of the outer && df above, as
> flush is not modified after.
>
> Subjective, but I find the following more readable, and not worth
> saving a few branches.
>
> if (count == 1 && df && !flush_id)
> NAPI_GRO_CB(p)->is_atomic = true;
>
> ip_fixedid_matches = NAPI_GRO_CB(p)->is_atomic ^ df;
> ipid_offset_matches = ipid_offset - count;
>
> return ip_fixedid_matches & ipid_offset_matches;
>
> Have to be a bit careful about types. Have not checked that in detail.
>

ip_fixedid_matches should also account for checking whether flush_id is 0
or not, if we're going for readability I'd suggest checking "which check"
should be done (fixedid or offset) and do only the appropriate one.

if (count == 1 && df && !ipid_offset)
NAPI_GRO_CB(p)->is_atomic = true;

if (NAPI_GRO_CB(p)->is_atomic && df)
return flush | ipid_offset;

return flush | (ipid_offset ^ count);

> And while nitpicking:
> ipid_offset may be a more descriptive variable name than flush_id, and
> ip_fixedid than is_atomic. If changing those does not result in a lot
> of code churn.
>

I also think is_atomic is not the best name, I'll change both names in the
next patch.

>> +}
>> +
>> +static inline int ipv6_gro_flush(const struct ipv6hdr *iph, const struct ipv6hdr *iph2)
>> +{
>> + /* <Version:4><Traffic_Class:8><Flow_Label:20> */
>> + __be32 first_word = *(__be32 *)iph ^ *(__be32 *)iph2;
>> +
>> + /* Flush if Traffic Class fields are different. */
>> + return !!((first_word & htonl(0x0FF00000)) |
>> + (__force __be32)(iph->hop_limit ^ iph2->hop_limit));
>> +}
>> +
>> +static inline int gro_network_flush(const void *th, const void *th2, struct sk_buff *p, int off)
>> +{
>> + const bool encap_mark = NAPI_GRO_CB(p)->encap_mark;
>
> Is this correct when udp_gro_complete clears this for tunnels?
>

gro_network_flush is called in receive flow, so udp_gro_complete cannot be
called after gro_network_flush. I think the function name should be changed to
gro_receive_network_flush.

>> + int flush = 0;
>> + int i;
>> +
>> + for (i = 0; i <= encap_mark; i++) {
>> + const u16 diff = off - NAPI_GRO_CB(p)->network_offsets[i];
>> + const void *nh = th - diff;
>> + const void *nh2 = th2 - diff;
>> +
>> + if (((struct iphdr *)nh)->version == 6)
>> + flush |= ipv6_gro_flush(nh, nh2);
>> + else
>> + flush |= inet_gro_flush(nh, nh2, p, i != encap_mark);
>> + }
>
> Maybe slightly better for branch prediction, and more obvious, if
> creating a helper function __gro_network_flush and calling
>
> __gro_network_flush(th, th2, p, off - NAPI_GRO_CB(p)->network_offsets[0])
> if (NAPI_GRO_CB(p)->encap_mark)
> __gro_network_flush(th, th2, p, off - NAPI_GRO_CB(p)->network_offsets[1])
>
>> +
>> + return flush;
>> +}
>> +
>> int skb_gro_receive(struct sk_buff *p, struct sk_buff *skb);
>>

2024-05-07 16:22:51

by Richard Gobert

[permalink] [raw]
Subject: Re: [PATCH net-next v8 3/3] selftests/net: add flush id selftests

Willem de Bruijn wrote:
> Richard Gobert wrote:
>> Added flush id selftests to test different cases where DF flag is set or
>> unset and id value changes in the following packets. All cases where the
>> packets should coalesce or should not coalesce are tested.
>>
>> Signed-off-by: Richard Gobert <[email protected]>
>> ---
>> tools/testing/selftests/net/gro.c | 147 ++++++++++++++++++++++++++++++
>> 1 file changed, 147 insertions(+)
>>
>> diff --git a/tools/testing/selftests/net/gro.c b/tools/testing/selftests/net/gro.c
>> index 353e1e867fbb..5dc7b539ccbf 100644
>> --- a/tools/testing/selftests/net/gro.c
>> +++ b/tools/testing/selftests/net/gro.c
>> @@ -617,6 +617,123 @@ static void add_ipv6_exthdr(void *buf, void *optpkt, __u8 exthdr_type, char *ext
>> iph->payload_len = htons(ntohs(iph->payload_len) + MIN_EXTHDR_SIZE);
>> }
>>
>> +static void fix_ip4_checksum(struct iphdr *iph)
>> +{
>> + iph->check = 0;
>> + iph->check = checksum_fold(iph, sizeof(struct iphdr), 0);
>> +}
>> +
>> +static void send_flush_id_case(int fd, struct sockaddr_ll *daddr, int tcase)
>> +{
>> + static char buf1[MAX_HDR_LEN + PAYLOAD_LEN];
>> + static char buf2[MAX_HDR_LEN + PAYLOAD_LEN];
>> + static char buf3[MAX_HDR_LEN + PAYLOAD_LEN];
>> + bool send_three = false;
>> + struct iphdr *iph1;
>> + struct iphdr *iph2;
>> + struct iphdr *iph3;
>> +
>> + iph1 = (struct iphdr *)(buf1 + ETH_HLEN);
>> + iph2 = (struct iphdr *)(buf2 + ETH_HLEN);
>> + iph3 = (struct iphdr *)(buf3 + ETH_HLEN);
>> +
>> + create_packet(buf1, 0, 0, PAYLOAD_LEN, 0);
>> + create_packet(buf2, PAYLOAD_LEN, 0, PAYLOAD_LEN, 0);
>> + create_packet(buf3, PAYLOAD_LEN * 2, 0, PAYLOAD_LEN, 0);
>> +
>> + switch (tcase) {
>> + case 0: /* DF=1, Incrementing - should coalesce */
>> + iph1->frag_off |= htons(IP_DF);
>> + iph1->id = htons(8);
>> + fix_ip4_checksum(iph1);
>> +
>> + iph2->frag_off |= htons(IP_DF);
>> + iph2->id = htons(9);
>> + fix_ip4_checksum(iph2);
>> + break;
>> +
>> + case 1: /* DF=1, Fixed - should coalesce */
>> + iph1->frag_off |= htons(IP_DF);
>> + iph1->id = htons(8);
>> + fix_ip4_checksum(iph1);
>> +
>> + iph2->frag_off |= htons(IP_DF);
>> + iph2->id = htons(8);
>> + fix_ip4_checksum(iph2);
>> + break;
>> +
>> + case 2: /* DF=0, Incrementing - should coalesce */
>> + iph1->frag_off &= ~htons(IP_DF);
>> + iph1->id = htons(8);
>> + fix_ip4_checksum(iph1);
>> +
>> + iph2->frag_off &= ~htons(IP_DF);
>> + iph2->id = htons(9);
>> + fix_ip4_checksum(iph2);
>> + break;
>> +
>> + case 3: /* DF=0, Fixed - should not coalesce */
>> + iph1->frag_off &= ~htons(IP_DF);
>> + iph1->id = htons(8);
>> + fix_ip4_checksum(iph1);
>> +
>> + iph2->frag_off &= ~htons(IP_DF);
>> + iph2->id = htons(8);
>> + fix_ip4_checksum(iph2);
>> + break;
>> +
>> + case 4: /* DF=1, two packets incrementing, and one fixed - should
>> + * coalesce only the first two packets
>> + */
>> + iph1->frag_off |= htons(IP_DF);
>> + iph1->id = htons(8);
>> + fix_ip4_checksum(iph1);
>> +
>> + iph2->frag_off |= htons(IP_DF);
>> + iph2->id = htons(9);
>> + fix_ip4_checksum(iph2);
>> +
>> + iph3->frag_off |= htons(IP_DF);
>> + iph3->id = htons(9);
>> + fix_ip4_checksum(iph3);
>> + send_three = true;
>> + break;
>> +
>> + case 5: /* DF=1, two packets fixed, and one incrementing - should
>> + * coalesce only the first two packets
>> + */
>> + iph1->frag_off |= htons(IP_DF);
>> + iph1->id = htons(8);
>> + fix_ip4_checksum(iph1);
>> +
>> + iph2->frag_off |= htons(IP_DF);
>> + iph2->id = htons(8);
>> + fix_ip4_checksum(iph2);
>> +
>> + iph3->frag_off |= htons(IP_DF);
>> + iph3->id = htons(9);
>> + fix_ip4_checksum(iph3);
>> + send_three = true;
>> + break;
>> + }
>
> Consider moving the fix_ip4_checksum calls out of the switch to reduce
> duplication.
>
>> +
>> + write_packet(fd, buf1, total_hdr_len + PAYLOAD_LEN, daddr);
>> + write_packet(fd, buf2, total_hdr_len + PAYLOAD_LEN, daddr);
>> +
>> + if (send_three)
>> + write_packet(fd, buf3, total_hdr_len + PAYLOAD_LEN, daddr);
>> +}
>> +
>> +static void test_flush_id(int fd, struct sockaddr_ll *daddr, char *fin_pkt)
>> +{
>> + for (int i = 0; i < 6; i++) {
>
> Please avoid unnamed magic constants. Something like
>
> const int num_flush_id_cases = 6; /* See switch in send_flush_id_case */
>

Will do that, thanks for the review!

> Or even define an enum with named tests and and _MAX val. It's
> verbose, but helpful to readers.
>
>> + sleep(1);
>> + send_flush_id_case(fd, daddr, i);
>> + sleep(1);
>> + write_packet(fd, fin_pkt, total_hdr_len, daddr);
>> + }
>> +}
>> +

2024-05-07 16:37:41

by Richard Gobert

[permalink] [raw]
Subject: Re: [PATCH net-next v8 1/3] net: gro: use cb instead of skb->network_header

Willem de Bruijn wrote:
> Richard Gobert wrote:
>> This patch converts references of skb->network_header to napi_gro_cb's
>> network_offset and inner_network_offset.
>>
>> Signed-off-by: Richard Gobert <[email protected]>
>> ---
>> include/net/gro.h | 9 +++++++--
>> net/ipv4/af_inet.c | 4 ----
>> net/ipv4/tcp_offload.c | 3 ++-
>> net/ipv6/ip6_offload.c | 5 ++---
>> net/ipv6/tcpv6_offload.c | 3 ++-
>> 5 files changed, 13 insertions(+), 11 deletions(-)
>>
>> diff --git a/include/net/gro.h b/include/net/gro.h
>> index c1d4ca0463a1..1faff23b66e8 100644
>> --- a/include/net/gro.h
>> +++ b/include/net/gro.h
>> @@ -181,12 +181,17 @@ static inline void *skb_gro_header(struct sk_buff *skb, unsigned int hlen,
>> return ptr;
>> }
>>
>> +static inline int skb_gro_network_offset(const struct sk_buff *skb)
>> +{
>> + return NAPI_GRO_CB(skb)->network_offsets[NAPI_GRO_CB(skb)->encap_mark];
>> +}
>> +
>
> The fact that .._receive sets encap_mark, but .._complete must read
> encapsulation, due to the clear in udp_gro_complete, is non-obvious.
>
> Can you add a comment to clarify this or rename this to
> skb_gro_receive_network_offset?
>

I'll rename that to skb_gro_receive_network_offset for better clarity,
thanks.

>> static inline void *skb_gro_network_header(const struct sk_buff *skb)
>> {
>> if (skb_gro_may_pull(skb, skb_gro_offset(skb)))
>> - return skb_gro_header_fast(skb, skb_network_offset(skb));
>> + return skb_gro_header_fast(skb, skb_gro_network_offset(skb));
>>
>> - return skb_network_header(skb);
>> + return skb->data + skb_gro_network_offset(skb);
>> }