2023-03-13 16:26:15

by Richard Gobert

[permalink] [raw]
Subject: [PATCH v3 0/2] gro: optimise redundant parsing of packets

Currently the IPv6 extension headers are parsed twice: first in
ipv6_gro_receive, and then again in ipv6_gro_complete.

By using the new ->transport_proto and ->network_proto fields, and also
storing the size of the network header, we can avoid parsing a second time
during the gro complete phase.

The first commit frees up space in the GRO CB. The second commit reduces
the redundant parsing during the complete phase, using the freed CB space.

In addition, the second commit contains a fix for a potential future
problem in BIG TCP, which is detailed in the commit message itself.

Performance tests for TCP stream over IPv6 with extension headers
demonstrate rx improvement of ~0.7%.

For the benchmarks, I used 100Gbit NIC mlx5 single-core (power management
off), turboboost off.

Typical IPv6 traffic (zero extension headers):

for i in {1..5}; do netperf -t TCP_STREAM -H 2001:db8:2:2::2 -l 90 | tail -1; done
# before
131072 16384 16384 90.00 16391.20
131072 16384 16384 90.00 16403.50
131072 16384 16384 90.00 16403.30
131072 16384 16384 90.00 16397.84
131072 16384 16384 90.00 16398.00

# after
131072 16384 16384 90.00 16399.85
131072 16384 16384 90.00 16392.37
131072 16384 16384 90.00 16403.06
131072 16384 16384 90.00 16406.97
131072 16384 16384 90.00 16406.09

IPv6 over IPv6 traffic:

for i in {1..5}; do netperf -t TCP_STREAM -H 4001:db8:2:2::2 -l 90 | tail -1; done
# before
131072 16384 16384 90.00 14791.61
131072 16384 16384 90.00 14791.66
131072 16384 16384 90.00 14783.47
131072 16384 16384 90.00 14810.17
131072 16384 16384 90.00 14806.15

# after
131072 16384 16384 90.00 14793.49
131072 16384 16384 90.00 14816.10
131072 16384 16384 90.00 14818.41
131072 16384 16384 90.00 14780.35
131072 16384 16384 90.00 14800.48

IPv6 traffic with varying extension headers:

for i in {1..5}; do netperf -t TCP_STREAM -H 2001:db8:2:2::2 -l 90 | tail -1; done
# before
131072 16384 16384 90.00 14812.37
131072 16384 16384 90.00 14813.04
131072 16384 16384 90.00 14802.54
131072 16384 16384 90.00 14804.06
131072 16384 16384 90.00 14819.08

# after
131072 16384 16384 90.00 14927.11
131072 16384 16384 90.00 14910.45
131072 16384 16384 90.00 14917.36
131072 16384 16384 90.00 14916.53
131072 16384 16384 90.00 14928.88

Richard Gobert (2):
gro: decrease size of CB
gro: optimise redundant parsing of packets

include/net/gro.h | 33 ++++++++++++++++++++++++---------
net/core/gro.c | 18 +++++++++++-------
net/ethernet/eth.c | 14 +++++++++++---
net/ipv6/ip6_offload.c | 20 +++++++++++++++-----
4 files changed, 61 insertions(+), 24 deletions(-)

--
2.36.1


2023-03-13 16:31:48

by Richard Gobert

[permalink] [raw]
Subject: [PATCH v3 1/2] gro: decrease size of CB

The GRO control block (NAPI_GRO_CB) is currently at its maximum size. This
commit reduces its size by putting two groups of fields that are used only
at different times into a union.

Specifically, the fields frag0 and frag0_len are the fields that make up
the frag0 optimisation mechanism, which is used during the initial parsing
of the SKB.

The fields last and age are used after the initial parsing, while the SKB
is stored in the GRO list, waiting for other packets to arrive.

There was one location in dev_gro_receive that modified the frag0 fields
after setting last and age. I changed this accordingly without altering the
code behaviour.

Signed-off-by: Richard Gobert <[email protected]>
---
include/net/gro.h | 26 ++++++++++++++++----------
net/core/gro.c | 18 +++++++++++-------
2 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/include/net/gro.h b/include/net/gro.h
index a4fab706240d..7b47dd6ce94f 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -11,11 +11,23 @@
#include <net/udp.h>

struct napi_gro_cb {
- /* Virtual address of skb_shinfo(skb)->frags[0].page + offset. */
- void *frag0;
+ union {
+ struct {
+ /* Virtual address of skb_shinfo(skb)->frags[0].page + offset. */
+ void *frag0;

- /* Length of frag0. */
- unsigned int frag0_len;
+ /* Length of frag0. */
+ unsigned int frag0_len;
+ };
+
+ struct {
+ /* used in skb_gro_receive() slow path */
+ struct sk_buff *last;
+
+ /* jiffies when first packet was created/queued */
+ unsigned long age;
+ };
+ };

/* This indicates where we are processing relative to skb->data. */
int data_offset;
@@ -32,9 +44,6 @@ struct napi_gro_cb {
/* Used in ipv6_gro_receive() and foo-over-udp */
u16 proto;

- /* jiffies when first packet was created/queued */
- unsigned long age;
-
/* Used in napi_gro_cb::free */
#define NAPI_GRO_FREE 1
#define NAPI_GRO_FREE_STOLEN_HEAD 2
@@ -77,9 +86,6 @@ struct napi_gro_cb {

/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum csum;
-
- /* used in skb_gro_receive() slow path */
- struct sk_buff *last;
};

#define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
diff --git a/net/core/gro.c b/net/core/gro.c
index a606705a0859..b1fdabd414a5 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -460,6 +460,14 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
}
}

+static inline void gro_try_pull_from_frag0(struct sk_buff *skb)
+{
+ int grow = skb_gro_offset(skb) - skb_headlen(skb);
+
+ if (grow > 0)
+ gro_pull_from_frag0(skb, grow);
+}
+
static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head)
{
struct sk_buff *oldest;
@@ -489,7 +497,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
struct sk_buff *pp = NULL;
enum gro_result ret;
int same_flow;
- int grow;

if (netif_elide_gro(skb->dev))
goto normal;
@@ -564,17 +571,13 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
else
gro_list->count++;

+ gro_try_pull_from_frag0(skb);
NAPI_GRO_CB(skb)->age = jiffies;
NAPI_GRO_CB(skb)->last = skb;
if (!skb_is_gso(skb))
skb_shinfo(skb)->gso_size = skb_gro_len(skb);
list_add(&skb->list, &gro_list->list);
ret = GRO_HELD;
-
-pull:
- grow = skb_gro_offset(skb) - skb_headlen(skb);
- if (grow > 0)
- gro_pull_from_frag0(skb, grow);
ok:
if (gro_list->count) {
if (!test_bit(bucket, &napi->gro_bitmask))
@@ -587,7 +590,8 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff

normal:
ret = GRO_NORMAL;
- goto pull;
+ gro_try_pull_from_frag0(skb);
+ goto ok;
}

struct packet_offload *gro_find_receive_by_type(__be16 type)
--
2.36.1

2023-03-13 16:46:27

by Richard Gobert

[permalink] [raw]
Subject: [PATCH v3 2/2] gro: optimise redundant parsing of packets

Currently the IPv6 extension headers are parsed twice: first in
ipv6_gro_receive, and then again in ipv6_gro_complete.

By using the new ->transport_proto field, and also storing the size of the
network header, we can avoid parsing extension headers a second time in
ipv6_gro_complete (which saves multiple memory dereferences and conditional
checks inside ipv6_exthdrs_len for a varying amount of extension headers in
IPv6 packets).

The implementation had to handle both inner and outer layers in case of
encapsulation (as they can't use the same field). I've applied a similar
optimisation to Ethernet.

Performance tests for TCP stream over IPv6 with a varying amount of
extension headers demonstrate throughput improvement of ~0.7%.

In addition, I fixed a potential future problem:
- The call to skb_set_inner_network_header at the beginning of
ipv6_gro_complete calculates inner_network_header based on skb->data by
calling skb_set_inner_network_header, and setting it to point to the
beginning of the ip header.
- If a packet is going to be handled by BIG TCP, the following code block
is going to shift the packet header, and skb->data is going to be
changed as well.

When the two flows are combined, inner_network_header will point to the
wrong place - which might happen if encapsulation of BIG TCP will be
supported in the future.

The fix is to place the whole encapsulation branch after the BIG TCP code
block. This way, if encapsulation of BIG TCP will be supported,
inner_network_header will still be calculated with the correct value of
skb->data.
Also, by arranging the code that way, the optimisation does not
add an additional branch.

Signed-off-by: Richard Gobert <[email protected]>
---
include/net/gro.h | 9 +++++++++
net/ethernet/eth.c | 14 +++++++++++---
net/ipv6/ip6_offload.c | 20 +++++++++++++++-----
3 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/include/net/gro.h b/include/net/gro.h
index 7b47dd6ce94f..35f60ea99f6c 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -86,6 +86,15 @@ struct napi_gro_cb {

/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum csum;
+
+ /* Used in ipv6_gro_receive() */
+ u16 network_len;
+
+ /* Used in eth_gro_receive() */
+ __be16 network_proto;
+
+ /* Used in ipv6_gro_receive() */
+ u8 transport_proto;
};

#define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 2edc8b796a4e..c2b77d9401e4 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -439,6 +439,9 @@ struct sk_buff *eth_gro_receive(struct list_head *head, struct sk_buff *skb)
goto out;
}

+ if (!NAPI_GRO_CB(skb)->encap_mark)
+ NAPI_GRO_CB(skb)->network_proto = type;
+
skb_gro_pull(skb, sizeof(*eh));
skb_gro_postpull_rcsum(skb, eh, sizeof(*eh));

@@ -455,13 +458,18 @@ EXPORT_SYMBOL(eth_gro_receive);

int eth_gro_complete(struct sk_buff *skb, int nhoff)
{
- struct ethhdr *eh = (struct ethhdr *)(skb->data + nhoff);
- __be16 type = eh->h_proto;
struct packet_offload *ptype;
+ struct ethhdr *eh;
int err = -ENOSYS;
+ __be16 type;

- if (skb->encapsulation)
+ if (skb->encapsulation) {
+ eh = (struct ethhdr *)(skb->data + nhoff);
skb_set_inner_mac_header(skb, nhoff);
+ type = eh->h_proto;
+ } else {
+ type = NAPI_GRO_CB(skb)->network_proto;
+ }

ptype = gro_find_complete_by_type(type);
if (ptype != NULL)
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 00dc2e3b0184..6e3a923ad573 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -232,6 +232,11 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
flush--;
nlen = skb_network_header_len(skb);

+ if (!NAPI_GRO_CB(skb)->encap_mark) {
+ NAPI_GRO_CB(skb)->transport_proto = proto;
+ NAPI_GRO_CB(skb)->network_len = nlen;
+ }
+
list_for_each_entry(p, head, list) {
const struct ipv6hdr *iph2;
__be32 first_word; /* <Version:4><Traffic_Class:8><Flow_Label:20> */
@@ -324,10 +329,6 @@ INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
int err = -ENOSYS;
u32 payload_len;

- if (skb->encapsulation) {
- skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IPV6));
- skb_set_inner_network_header(skb, nhoff);
- }

payload_len = skb->len - nhoff - sizeof(*iph);
if (unlikely(payload_len > IPV6_MAXPLEN)) {
@@ -341,6 +342,7 @@ INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
skb->len += hoplen;
skb->mac_header -= hoplen;
skb->network_header -= hoplen;
+ NAPI_GRO_CB(skb)->network_len += hoplen;
iph = (struct ipv6hdr *)(skb->data + nhoff);
hop_jumbo = (struct hop_jumbo_hdr *)(iph + 1);

@@ -358,7 +360,15 @@ INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
iph->payload_len = htons(payload_len);
}

- nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
+ if (skb->encapsulation) {
+ skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IPV6));
+ skb_set_inner_network_header(skb, nhoff);
+ nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
+ } else {
+ ops = rcu_dereference(inet6_offloads[NAPI_GRO_CB(skb)->transport_proto]);
+ nhoff += NAPI_GRO_CB(skb)->network_len;
+ }
+
if (WARN_ON(!ops || !ops->callbacks.gro_complete))
goto out;

--
2.36.1

2023-03-13 16:52:29

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] gro: optimise redundant parsing of packets

On Mon, Mar 13, 2023 at 9:46 AM Richard Gobert <[email protected]> wrote:
>
> Currently the IPv6 extension headers are parsed twice: first in
> ipv6_gro_receive, and then again in ipv6_gro_complete.
>
> By using the new ->transport_proto field, and also storing the size of the
> network header, we can avoid parsing extension headers a second time in
> ipv6_gro_complete (which saves multiple memory dereferences and conditional
> checks inside ipv6_exthdrs_len for a varying amount of extension headers in
> IPv6 packets).
>
> The implementation had to handle both inner and outer layers in case of
> encapsulation (as they can't use the same field). I've applied a similar
> optimisation to Ethernet.
>
> Performance tests for TCP stream over IPv6 with a varying amount of
> extension headers demonstrate throughput improvement of ~0.7%.
>
> In addition, I fixed a potential future problem:

I would remove all this block.

We fix current problems, not future hypothetical ones.

> - The call to skb_set_inner_network_header at the beginning of
> ipv6_gro_complete calculates inner_network_header based on skb->data by
> calling skb_set_inner_network_header, and setting it to point to the
> beginning of the ip header.
> - If a packet is going to be handled by BIG TCP, the following code block
> is going to shift the packet header, and skb->data is going to be
> changed as well.
>
> When the two flows are combined, inner_network_header will point to the
> wrong place - which might happen if encapsulation of BIG TCP will be
> supported in the future.
>
> The fix is to place the whole encapsulation branch after the BIG TCP code
> block. This way, if encapsulation of BIG TCP will be supported,
> inner_network_header will still be calculated with the correct value of
> skb->data.

We do not support encapsulated BIG TCP yet.
We will do this later, and whoever does it will make sure to also support GRO.

> Also, by arranging the code that way, the optimisation does not
> add an additional branch.
>
> Signed-off-by: Richard Gobert <[email protected]>
> ---
>

Can you give us a good explanation of why extension headers are used exactly ?

I am not sure we want to add code to GRO for something that 99.99% of
us do not use.

2023-03-13 17:03:20

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v3 1/2] gro: decrease size of CB

On Mon, Mar 13, 2023 at 9:30 AM Richard Gobert <[email protected]> wrote:
>
> The GRO control block (NAPI_GRO_CB) is currently at its maximum size. This
> commit reduces its size by putting two groups of fields that are used only
> at different times into a union.
>
> Specifically, the fields frag0 and frag0_len are the fields that make up
> the frag0 optimisation mechanism, which is used during the initial parsing
> of the SKB.

Note that these fields could also be stored in some auto variable,
instead of skb.

>
> The fields last and age are used after the initial parsing, while the SKB
> is stored in the GRO list, waiting for other packets to arrive.
>
> There was one location in dev_gro_receive that modified the frag0 fields
> after setting last and age. I changed this accordingly without altering the
> code behaviour.
>
> Signed-off-by: Richard Gobert <[email protected]>
> ---

SGTM, thanks.

Reviewed-by: Eric Dumazet <[email protected]>

2023-03-14 15:56:10

by Richard Gobert

[permalink] [raw]
Subject: Re: [PATCH v3 2/2] gro: optimise redundant parsing of packets

> >
> > Currently the IPv6 extension headers are parsed twice: first in
> > ipv6_gro_receive, and then again in ipv6_gro_complete.
> >
> > By using the new ->transport_proto field, and also storing the size of the
> > network header, we can avoid parsing extension headers a second time in
> > ipv6_gro_complete (which saves multiple memory dereferences and conditional
> > checks inside ipv6_exthdrs_len for a varying amount of extension headers in
> > IPv6 packets).
> >
> > The implementation had to handle both inner and outer layers in case of
> > encapsulation (as they can't use the same field). I've applied a similar
> > optimisation to Ethernet.
> >
> > Performance tests for TCP stream over IPv6 with a varying amount of
> > extension headers demonstrate throughput improvement of ~0.7%.
> >
> > In addition, I fixed a potential future problem:
>
> I would remove all this block.
>
> We fix current problems, not future hypothetical ones.
>

I agree, I did it primarily to avoid an additional branch (the logic
remains exactly the same). I'll remove this part from the commit message.


> > - The call to skb_set_inner_network_header at the beginning of
> > ipv6_gro_complete calculates inner_network_header based on skb->data by
> > calling skb_set_inner_network_header, and setting it to point to the
> > beginning of the ip header.
> > - If a packet is going to be handled by BIG TCP, the following code block
> > is going to shift the packet header, and skb->data is going to be
> > changed as well.
> >
> > When the two flows are combined, inner_network_header will point to the
> > wrong place - which might happen if encapsulation of BIG TCP will be
> > supported in the future.
> >
> > The fix is to place the whole encapsulation branch after the BIG TCP code
> > block. This way, if encapsulation of BIG TCP will be supported,
> > inner_network_header will still be calculated with the correct value of
> > skb->data.
>
> We do not support encapsulated BIG TCP yet.
> We will do this later, and whoever does it will make sure to also support GRO.
>
> > Also, by arranging the code that way, the optimisation does not
> > add an additional branch.
> >
> > Signed-off-by: Richard Gobert <[email protected]>
> > ---
> >
>
> Can you give us a good explanation of why extension headers are used exactly ?
>
> I am not sure we want to add code to GRO for something that 99.99% of
> us do not use.

IMO, some common use cases that will benefit from this patch are:
- Parsing of BIG TCP packets which include a hbh ext hdr.
- dstopts and routing ext hdrs that are used for Mobile IPv6 features.

Generally, when a packet includes ext hdrs we will avoid the recalculation
of the ext hdrs len. When there are no ext hdrs, we will not call the
ipv6_exthdrs_len function so the performance isn't negatively impacted
(potentially even saving some opcodes in ipv6_exthdrs_len).