by Richard Gobert

[permalink] [raw]

Subject: Re: [PATCH net-next v4 4/4] net: gro: move L3 flush checks to tcp_gro_receive

Paolo Abeni wrote:
> On Tue, 2024-03-26 at 18:25 +0100, Richard Gobert wrote:
>> Paolo Abeni wrote:
>>> Hi,
>>>
>>> On Tue, 2024-03-26 at 16:02 +0100, Richard Gobert wrote:
>>>> This patch is meaningful by itself - removing checks against non-relevant
>>>> packets and making the flush/flush_id checks in a single place.
>>>
>>> I'm personally not sure this patch is a win. The code churn is
>>> significant. I understand this is for performance's sake, but I don't
>>> see the benefit???
>>>
>>
>> Could you clarify what do you mean by code churn?
>
> The diffstat of this patch is not negligible and touches very sensitive
> areas.
>

diff mainly touches flush/flush_id/is_atomic, the new code should be
less complex. I agree this is sensitive as it is part of core GRO -
I checked all relevant flows manually, but I can also create more
tests and ensure that logic remains the same.

>>> he changelog shows that perf reports slightly lower figures for
>>> inet_gro_receive(). That is expected, as this patch move code out of
>>> such functio. What about inet_gro_flush()/tcp_gro_receive() where such
>>> code is moved?
>>>
>>
>> Please consider the following 2 common scenarios:
>>
>> 1) Multiple packets in the GRO bucket - the common case with multiple
>> packets in the bucket (i.e. running super_netperf TCP_STREAM) - each layer
>> executes a for loop - going over each packet in the bucket. Specifically,
>> L3 gro_receive loops over the bucket making flush,flush_id,is_atomic
>> checks.
>
> Only for packets with the same rx hash.
>

Right, but there are only 8 GRO buckets, so a collision can still happen
on multiple concurrent streams.

>> For most packets in the bucket, these checks are not
>> relevant. (possibly also dirtying cache lines with non-relevant p
>> packets). Removing code in the for loop for this case is significant.
>>
>> 2) UDP/TCP streams which do not coalesce in GRO. This is the common case
>> for regular UDP connections (i.e. running netperf UDP_STREAM). In this
>> case, GRO is just overhead. Removing any code from these layers
>> is good (shown in the first measurement of the commit message).
>
> If UDP GRO is not enabled, there are no UDP packet staging in the UDP
> gro engine, the bucket list is empty.
>
>>> Additionally the reported deltas is within noise level according to my
>>> personal experience with similar tests.
>>>
>>
>> I've tested the difference between net-next and this patch repetitively,
>> which showed stable results each time. Is there any specific test you
>> think would be helpful to show the result?
>
> Anything that show measurable gain.
>
> Reporting the CPU utilization in the inet_gro_receive() function alone
> is not enough, as part of the load has been moved into
> gro_network_flush()/tcp_gro_receive().
>

Got it, the numbers I reported were only relevant to UDP flows (so
measuring perf top with -g flag showed the same improvement). I'll post in v5
numbers relevant to TCP as well.

Thanks

2024-03-25 18:42:02

by Richard Gobert

[permalink] [raw]

Subject: [PATCH net-next v4 3/4] net: gro: add {inner_}network_offset to napi_gro_cb

This patch adds network_offset and inner_network_offset to napi_gro_cb, and
makes sure both are set correctly. In the common path there's only one
write (skb_gro_reset_offset).

Signed-off-by: Richard Gobert <[email protected]>
---
include/net/gro.h | 18 ++++++++++++++++--
net/8021q/vlan_core.c | 2 ++
net/core/gro.c | 1 +
net/ethernet/eth.c | 1 +
net/ipv4/af_inet.c | 5 +----
net/ipv4/gre_offload.c | 1 +
net/ipv6/ip6_offload.c | 8 ++++----
7 files changed, 26 insertions(+), 10 deletions(-)

diff --git a/include/net/gro.h b/include/net/gro.h
index ebead1d642b4..a1cc8e8c2ebd 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -87,6 +87,15 @@ struct napi_gro_cb {

/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum csum;
+
+ /* L3 offsets */
+ union {
+ struct {
+ u16 network_offset;
+ u16 inner_network_offset;
+ };
+ u16 network_offsets[2];
+ };
};

#define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
@@ -172,12 +181,17 @@ static inline void *skb_gro_header(struct sk_buff *skb, unsigned int hlen,
return ptr;
}

+static inline int skb_gro_network_offset(const struct sk_buff *skb)
+{
+ return NAPI_GRO_CB(skb)->network_offsets[NAPI_GRO_CB(skb)->encap_mark];
+}
+
static inline void *skb_gro_network_header(const struct sk_buff *skb)
{
if (skb_gro_may_pull(skb, skb_gro_offset(skb)))
- return skb_gro_header_fast(skb, skb_network_offset(skb));
+ return skb_gro_header_fast(skb, skb_gro_network_offset(skb));

- return skb_network_header(skb);
+ return skb->data + skb_gro_network_offset(skb);
}

static inline __wsum inet_gro_compute_pseudo(const struct sk_buff *skb,
diff --git a/net/8021q/vlan_core.c b/net/8021q/vlan_core.c
index 247704cf70af..355cafe23329 100644
--- a/net/8021q/vlan_core.c
+++ b/net/8021q/vlan_core.c
@@ -478,6 +478,8 @@ static struct sk_buff *vlan_gro_receive(struct list_head *head,
if (unlikely(!vhdr))
goto out;

+ NAPI_GRO_CB(skb)->network_offsets[NAPI_GRO_CB(skb)->encap_mark] = hlen;
+
type = vhdr->h_vlan_encapsulated_proto;

ptype = gro_find_receive_by_type(type);
diff --git a/net/core/gro.c b/net/core/gro.c
index 167935396c50..7c468ed805f4 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -370,6 +370,7 @@ static inline void skb_gro_reset_offset(struct sk_buff *skb, u32 nhoff)
const skb_frag_t *frag0;
unsigned int headlen;

+ NAPI_GRO_CB(skb)->network_offset = 0;
NAPI_GRO_CB(skb)->data_offset = 0;
headlen = skb_headlen(skb);
NAPI_GRO_CB(skb)->frag0 = skb->data;
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 7515e6bcbb7d..e3eca605bcc7 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -441,6 +441,7 @@ struct sk_buff *eth_gro_receive(struct list_head *head, struct sk_buff *skb)

skb_gro_pull(skb, sizeof(*eh));
skb_gro_postpull_rcsum(skb, eh, sizeof(*eh));
+ NAPI_GRO_CB(skb)->inner_network_offset = hlen;

pp = indirect_call_gro_receive_inet(ptype->callbacks.gro_receive,
ipv6_gro_receive, inet_gro_receive,
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 5a747d91cd0a..6546bf376b24 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1568,10 +1568,6 @@ struct sk_buff *inet_gro_receive(struct list_head *head, struct sk_buff *skb)

NAPI_GRO_CB(skb)->is_atomic = !!(iph->frag_off & htons(IP_DF));
NAPI_GRO_CB(skb)->flush |= flush;
- skb_set_network_header(skb, off);
- /* The above will be needed by the transport layer if there is one
- * immediately following this IP hdr.
- */

/* Note : No need to call skb_gro_postpull_rcsum() here,
* as we already checked checksum over ipv4 header was 0
@@ -1597,6 +1593,7 @@ static struct sk_buff *ipip_gro_receive(struct list_head *head,
}

NAPI_GRO_CB(skb)->encap_mark = 1;
+ NAPI_GRO_CB(skb)->inner_network_offset = skb_gro_offset(skb);

return inet_gro_receive(head, skb);
}
diff --git a/net/ipv4/gre_offload.c b/net/ipv4/gre_offload.c
index d4520c3f7c09..ae596285d78c 100644
--- a/net/ipv4/gre_offload.c
+++ b/net/ipv4/gre_offload.c
@@ -224,6 +224,7 @@ static struct sk_buff *gre_gro_receive(struct list_head *head,
/* Adjusted NAPI_GRO_CB(skb)->csum after skb_gro_pull()*/
skb_gro_postpull_rcsum(skb, greh, grehlen);

+ NAPI_GRO_CB(skb)->inner_network_offset = hlen;
pp = call_gro_receive(ptype->callbacks.gro_receive, head, skb);
flush = 0;

diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 10ddbbc0e46d..ba41939537f2 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -67,7 +67,7 @@ static int ipv6_gro_pull_exthdrs(struct sk_buff *skb, int off, int proto)
off += len;
}

- skb_gro_pull(skb, off - skb_network_offset(skb));
+ skb_gro_pull(skb, off - skb_gro_network_offset(skb));
return proto;
}

@@ -236,8 +236,6 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
if (unlikely(!iph))
goto out;

- skb_set_network_header(skb, off);
-
flush += ntohs(iph->payload_len) != skb->len - hlen;

proto = iph->nexthdr;
@@ -259,7 +257,7 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
NAPI_GRO_CB(skb)->proto = proto;

flush--;
- nlen = skb_network_header_len(skb);
+ nlen = skb_gro_offset(skb) - off;

list_for_each_entry(p, head, list) {
const struct ipv6hdr *iph2;
@@ -327,6 +325,7 @@ static struct sk_buff *sit_ip6ip6_gro_receive(struct list_head *head,
}

NAPI_GRO_CB(skb)->encap_mark = 1;
+ NAPI_GRO_CB(skb)->inner_network_offset = skb_gro_offset(skb);

return ipv6_gro_receive(head, skb);
}
@@ -342,6 +341,7 @@ static struct sk_buff *ip4ip6_gro_receive(struct list_head *head,
}

NAPI_GRO_CB(skb)->encap_mark = 1;
+ NAPI_GRO_CB(skb)->inner_network_offset = skb_gro_offset(skb);

return inet_gro_receive(head, skb);
}
--
2.36.1