2023-03-20 16:43:15

by Richard Gobert

[permalink] [raw]
Subject: [PATCH v4 0/2] gro: optimise redundant parsing of packets

Currently the IPv6 extension headers are parsed twice: first in
ipv6_gro_receive, and then again in ipv6_gro_complete.

By using the new ->transport_proto and ->network_proto fields, and also
storing the size of the network header, we can avoid parsing a second time
during the gro complete phase.

The first commit frees up space in the GRO CB. The second commit reduces
the redundant parsing during the complete phase, using the freed CB space.

Performance tests for TCP stream over IPv6 with extension headers
demonstrate rx improvement of ~0.7%.

For the benchmarks, I used 100Gbit NIC mlx5 single-core (power management
off), turboboost off.

Typical IPv6 traffic (zero extension headers):

for i in {1..5}; do netperf -t TCP_STREAM -H 2001:db8:2:2::2 -l 90 | tail -1; done
# before
131072 16384 16384 90.00 16391.20
131072 16384 16384 90.00 16403.50
131072 16384 16384 90.00 16403.30
131072 16384 16384 90.00 16397.84
131072 16384 16384 90.00 16398.00

# after
131072 16384 16384 90.00 16399.85
131072 16384 16384 90.00 16392.37
131072 16384 16384 90.00 16403.06
131072 16384 16384 90.00 16406.97
131072 16384 16384 90.00 16406.09

IPv6 over IPv6 traffic:

for i in {1..5}; do netperf -t TCP_STREAM -H 4001:db8:2:2::2 -l 90 | tail -1; done
# before
131072 16384 16384 90.00 14791.61
131072 16384 16384 90.00 14791.66
131072 16384 16384 90.00 14783.47
131072 16384 16384 90.00 14810.17
131072 16384 16384 90.00 14806.15

# after
131072 16384 16384 90.00 14793.49
131072 16384 16384 90.00 14816.10
131072 16384 16384 90.00 14818.41
131072 16384 16384 90.00 14780.35
131072 16384 16384 90.00 14800.48

IPv6 traffic with varying extension headers:

for i in {1..5}; do netperf -t TCP_STREAM -H 2001:db8:2:2::2 -l 90 | tail -1; done
# before
131072 16384 16384 90.00 14812.37
131072 16384 16384 90.00 14813.04
131072 16384 16384 90.00 14802.54
131072 16384 16384 90.00 14804.06
131072 16384 16384 90.00 14819.08

# after
131072 16384 16384 90.00 14927.11
131072 16384 16384 90.00 14910.45
131072 16384 16384 90.00 14917.36
131072 16384 16384 90.00 14916.53
131072 16384 16384 90.00 14928.88

Richard Gobert (2):
gro: decrease size of CB
gro: optimise redundant parsing of packets

include/net/gro.h | 33 ++++++++++++++++++++++++---------
net/core/gro.c | 18 +++++++++++-------
net/ethernet/eth.c | 14 +++++++++++---
net/ipv6/ip6_offload.c | 20 +++++++++++++++-----
4 files changed, 61 insertions(+), 24 deletions(-)

--
2.36.1


2023-03-20 16:56:33

by Richard Gobert

[permalink] [raw]
Subject: [PATCH v4 1/2] gro: decrease size of CB

The GRO control block (NAPI_GRO_CB) is currently at its maximum size. This
commit reduces its size by putting two groups of fields that are used only
at different times into a union.

Specifically, the fields frag0 and frag0_len are the fields that make up
the frag0 optimisation mechanism, which is used during the initial parsing
of the SKB.

The fields last and age are used after the initial parsing, while the SKB
is stored in the GRO list, waiting for other packets to arrive.

There was one location in dev_gro_receive that modified the frag0 fields
after setting last and age. I changed this accordingly without altering the
code behaviour.

Signed-off-by: Richard Gobert <[email protected]>
Reviewed-by: Eric Dumazet <[email protected]>
---
include/net/gro.h | 26 ++++++++++++++++----------
net/core/gro.c | 18 +++++++++++-------
2 files changed, 27 insertions(+), 17 deletions(-)

diff --git a/include/net/gro.h b/include/net/gro.h
index a4fab706240d..7b47dd6ce94f 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -11,11 +11,23 @@
#include <net/udp.h>

struct napi_gro_cb {
- /* Virtual address of skb_shinfo(skb)->frags[0].page + offset. */
- void *frag0;
+ union {
+ struct {
+ /* Virtual address of skb_shinfo(skb)->frags[0].page + offset. */
+ void *frag0;

- /* Length of frag0. */
- unsigned int frag0_len;
+ /* Length of frag0. */
+ unsigned int frag0_len;
+ };
+
+ struct {
+ /* used in skb_gro_receive() slow path */
+ struct sk_buff *last;
+
+ /* jiffies when first packet was created/queued */
+ unsigned long age;
+ };
+ };

/* This indicates where we are processing relative to skb->data. */
int data_offset;
@@ -32,9 +44,6 @@ struct napi_gro_cb {
/* Used in ipv6_gro_receive() and foo-over-udp */
u16 proto;

- /* jiffies when first packet was created/queued */
- unsigned long age;
-
/* Used in napi_gro_cb::free */
#define NAPI_GRO_FREE 1
#define NAPI_GRO_FREE_STOLEN_HEAD 2
@@ -77,9 +86,6 @@ struct napi_gro_cb {

/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum csum;
-
- /* used in skb_gro_receive() slow path */
- struct sk_buff *last;
};

#define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
diff --git a/net/core/gro.c b/net/core/gro.c
index a606705a0859..b1fdabd414a5 100644
--- a/net/core/gro.c
+++ b/net/core/gro.c
@@ -460,6 +460,14 @@ static void gro_pull_from_frag0(struct sk_buff *skb, int grow)
}
}

+static inline void gro_try_pull_from_frag0(struct sk_buff *skb)
+{
+ int grow = skb_gro_offset(skb) - skb_headlen(skb);
+
+ if (grow > 0)
+ gro_pull_from_frag0(skb, grow);
+}
+
static void gro_flush_oldest(struct napi_struct *napi, struct list_head *head)
{
struct sk_buff *oldest;
@@ -489,7 +497,6 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
struct sk_buff *pp = NULL;
enum gro_result ret;
int same_flow;
- int grow;

if (netif_elide_gro(skb->dev))
goto normal;
@@ -564,17 +571,13 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff
else
gro_list->count++;

+ gro_try_pull_from_frag0(skb);
NAPI_GRO_CB(skb)->age = jiffies;
NAPI_GRO_CB(skb)->last = skb;
if (!skb_is_gso(skb))
skb_shinfo(skb)->gso_size = skb_gro_len(skb);
list_add(&skb->list, &gro_list->list);
ret = GRO_HELD;
-
-pull:
- grow = skb_gro_offset(skb) - skb_headlen(skb);
- if (grow > 0)
- gro_pull_from_frag0(skb, grow);
ok:
if (gro_list->count) {
if (!test_bit(bucket, &napi->gro_bitmask))
@@ -587,7 +590,8 @@ static enum gro_result dev_gro_receive(struct napi_struct *napi, struct sk_buff

normal:
ret = GRO_NORMAL;
- goto pull;
+ gro_try_pull_from_frag0(skb);
+ goto ok;
}

struct packet_offload *gro_find_receive_by_type(__be16 type)
--
2.36.1

2023-03-20 17:07:25

by Richard Gobert

[permalink] [raw]
Subject: [PATCH v4 2/2] gro: optimise redundant parsing of packets

Currently the IPv6 extension headers are parsed twice: first in
ipv6_gro_receive, and then again in ipv6_gro_complete.

By using the new ->transport_proto field, and also storing the size of the
network header, we can avoid parsing extension headers a second time in
ipv6_gro_complete (which saves multiple memory dereferences and conditional
checks inside ipv6_exthdrs_len for a varying amount of extension headers in
IPv6 packets).

The implementation had to handle both inner and outer layers in case of
encapsulation (as they can't use the same field). I've applied a similar
optimisation to Ethernet.

Performance tests for TCP stream over IPv6 with a varying amount of
extension headers demonstrate throughput improvement of ~0.7%.

Signed-off-by: Richard Gobert <[email protected]>
---
v3 -> v4:
- Updated commit msg as Eric suggested.
- No code changes.
---
include/net/gro.h | 9 +++++++++
net/ethernet/eth.c | 14 +++++++++++---
net/ipv6/ip6_offload.c | 20 +++++++++++++++-----
3 files changed, 35 insertions(+), 8 deletions(-)

diff --git a/include/net/gro.h b/include/net/gro.h
index 7b47dd6ce94f..35f60ea99f6c 100644
--- a/include/net/gro.h
+++ b/include/net/gro.h
@@ -86,6 +86,15 @@ struct napi_gro_cb {

/* used to support CHECKSUM_COMPLETE for tunneling protocols */
__wsum csum;
+
+ /* Used in ipv6_gro_receive() */
+ u16 network_len;
+
+ /* Used in eth_gro_receive() */
+ __be16 network_proto;
+
+ /* Used in ipv6_gro_receive() */
+ u8 transport_proto;
};

#define NAPI_GRO_CB(skb) ((struct napi_gro_cb *)(skb)->cb)
diff --git a/net/ethernet/eth.c b/net/ethernet/eth.c
index 2edc8b796a4e..c2b77d9401e4 100644
--- a/net/ethernet/eth.c
+++ b/net/ethernet/eth.c
@@ -439,6 +439,9 @@ struct sk_buff *eth_gro_receive(struct list_head *head, struct sk_buff *skb)
goto out;
}

+ if (!NAPI_GRO_CB(skb)->encap_mark)
+ NAPI_GRO_CB(skb)->network_proto = type;
+
skb_gro_pull(skb, sizeof(*eh));
skb_gro_postpull_rcsum(skb, eh, sizeof(*eh));

@@ -455,13 +458,18 @@ EXPORT_SYMBOL(eth_gro_receive);

int eth_gro_complete(struct sk_buff *skb, int nhoff)
{
- struct ethhdr *eh = (struct ethhdr *)(skb->data + nhoff);
- __be16 type = eh->h_proto;
struct packet_offload *ptype;
+ struct ethhdr *eh;
int err = -ENOSYS;
+ __be16 type;

- if (skb->encapsulation)
+ if (skb->encapsulation) {
+ eh = (struct ethhdr *)(skb->data + nhoff);
skb_set_inner_mac_header(skb, nhoff);
+ type = eh->h_proto;
+ } else {
+ type = NAPI_GRO_CB(skb)->network_proto;
+ }

ptype = gro_find_complete_by_type(type);
if (ptype != NULL)
diff --git a/net/ipv6/ip6_offload.c b/net/ipv6/ip6_offload.c
index 00dc2e3b0184..6e3a923ad573 100644
--- a/net/ipv6/ip6_offload.c
+++ b/net/ipv6/ip6_offload.c
@@ -232,6 +232,11 @@ INDIRECT_CALLABLE_SCOPE struct sk_buff *ipv6_gro_receive(struct list_head *head,
flush--;
nlen = skb_network_header_len(skb);

+ if (!NAPI_GRO_CB(skb)->encap_mark) {
+ NAPI_GRO_CB(skb)->transport_proto = proto;
+ NAPI_GRO_CB(skb)->network_len = nlen;
+ }
+
list_for_each_entry(p, head, list) {
const struct ipv6hdr *iph2;
__be32 first_word; /* <Version:4><Traffic_Class:8><Flow_Label:20> */
@@ -324,10 +329,6 @@ INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
int err = -ENOSYS;
u32 payload_len;

- if (skb->encapsulation) {
- skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IPV6));
- skb_set_inner_network_header(skb, nhoff);
- }

payload_len = skb->len - nhoff - sizeof(*iph);
if (unlikely(payload_len > IPV6_MAXPLEN)) {
@@ -341,6 +342,7 @@ INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
skb->len += hoplen;
skb->mac_header -= hoplen;
skb->network_header -= hoplen;
+ NAPI_GRO_CB(skb)->network_len += hoplen;
iph = (struct ipv6hdr *)(skb->data + nhoff);
hop_jumbo = (struct hop_jumbo_hdr *)(iph + 1);

@@ -358,7 +360,15 @@ INDIRECT_CALLABLE_SCOPE int ipv6_gro_complete(struct sk_buff *skb, int nhoff)
iph->payload_len = htons(payload_len);
}

- nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
+ if (skb->encapsulation) {
+ skb_set_inner_protocol(skb, cpu_to_be16(ETH_P_IPV6));
+ skb_set_inner_network_header(skb, nhoff);
+ nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
+ } else {
+ ops = rcu_dereference(inet6_offloads[NAPI_GRO_CB(skb)->transport_proto]);
+ nhoff += NAPI_GRO_CB(skb)->network_len;
+ }
+
if (WARN_ON(!ops || !ops->callbacks.gro_complete))
goto out;

--
2.36.1

2023-03-22 10:07:42

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] gro: optimise redundant parsing of packets

On Mon, 2023-03-20 at 18:00 +0100, Richard Gobert wrote:
> Currently the IPv6 extension headers are parsed twice: first in
> ipv6_gro_receive, and then again in ipv6_gro_complete.
>
> By using the new ->transport_proto field, and also storing the size of the
> network header, we can avoid parsing extension headers a second time in
> ipv6_gro_complete (which saves multiple memory dereferences and conditional
> checks inside ipv6_exthdrs_len for a varying amount of extension headers in
> IPv6 packets).
>
> The implementation had to handle both inner and outer layers in case of
> encapsulation (as they can't use the same field). I've applied a similar
> optimisation to Ethernet.
>
> Performance tests for TCP stream over IPv6 with a varying amount of
> extension headers demonstrate throughput improvement of ~0.7%.

I'm surprised that the improvement is measurable: for large aggregate
packets a single ipv6_exthdrs_len() call is avoided out of tens calls
for the individual pkts. Additionally such figure is comparable to
noise level in my tests.

This adds a couple of additional branches for the common (no extensions
header) case.

while patch 1/2 could be useful, patch 2/2 overall looks not worthy to
me.

I suggest to re-post for inclusion only patch 1, unless others have
strong different opinions.

Cheers,

Paolo

2023-03-22 10:20:47

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] gro: optimise redundant parsing of packets

On Wed, Mar 22, 2023 at 2:59 AM Paolo Abeni <[email protected]> wrote:
>
> On Mon, 2023-03-20 at 18:00 +0100, Richard Gobert wrote:
> > Currently the IPv6 extension headers are parsed twice: first in
> > ipv6_gro_receive, and then again in ipv6_gro_complete.
> >
> > By using the new ->transport_proto field, and also storing the size of the
> > network header, we can avoid parsing extension headers a second time in
> > ipv6_gro_complete (which saves multiple memory dereferences and conditional
> > checks inside ipv6_exthdrs_len for a varying amount of extension headers in
> > IPv6 packets).
> >
> > The implementation had to handle both inner and outer layers in case of
> > encapsulation (as they can't use the same field). I've applied a similar
> > optimisation to Ethernet.
> >
> > Performance tests for TCP stream over IPv6 with a varying amount of
> > extension headers demonstrate throughput improvement of ~0.7%.
>
> I'm surprised that the improvement is measurable: for large aggregate
> packets a single ipv6_exthdrs_len() call is avoided out of tens calls
> for the individual pkts. Additionally such figure is comparable to
> noise level in my tests.
>
> This adds a couple of additional branches for the common (no extensions
> header) case.
>
> while patch 1/2 could be useful, patch 2/2 overall looks not worthy to
> me.
>
> I suggest to re-post for inclusion only patch 1, unless others have
> strong different opinions.
>

+2

I have the same feeling/opinion.

> Cheers,
>
> Paolo
>

2023-03-22 19:35:57

by Richard Gobert

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] gro: optimise redundant parsing of packets

> On Wed, Mar 22, 2023 at 2:59 AM Paolo Abeni <[email protected]> wrote:
> >
> > On Mon, 2023-03-20 at 18:00 +0100, Richard Gobert wrote:
> > > Currently the IPv6 extension headers are parsed twice: first in
> > > ipv6_gro_receive, and then again in ipv6_gro_complete.
> > >
> > > By using the new ->transport_proto field, and also storing the size of the
> > > network header, we can avoid parsing extension headers a second time in
> > > ipv6_gro_complete (which saves multiple memory dereferences and conditional
> > > checks inside ipv6_exthdrs_len for a varying amount of extension headers in
> > > IPv6 packets).
> > >
> > > The implementation had to handle both inner and outer layers in case of
> > > encapsulation (as they can't use the same field). I've applied a similar
> > > optimisation to Ethernet.
> > >
> > > Performance tests for TCP stream over IPv6 with a varying amount of
> > > extension headers demonstrate throughput improvement of ~0.7%.
> >
> > I'm surprised that the improvement is measurable: for large aggregate
> > packets a single ipv6_exthdrs_len() call is avoided out of tens calls
> > for the individual pkts. Additionally such figure is comparable to
> > noise level in my tests.

It's not simple but I made an effort to make a quiet environment.
Correct configuration allows for this kind of measurements to be made
as the test is CPU bound and noise is a variance that can be reduced with
enough samples.

Environment example: (100Gbit NIC (mlx5), physical machine, i9 12th
gen)

# power-management and hyperthreading disabled in BIOS
# sysctl preallocate net mem
echo 0 > /sys/devices/system/cpu/cpufreq/boost # disable turboboost
ethtool -A enp1s0f0np0 rx off tx off autoneg off # no PAUSE frames

# Single core performance
for x in /sys/devices/system/cpu/cpu[1-9]*/online; do echo 0 >"$x"; done

./network-testing-master/bin/netfilter_unload_modules.sh 2>/dev/null # unload netfilter
tuned-adm profile latency-performance
cpupower frequency-set -f 2200MHz # Set core to specific frequency
systemctl isolate rescue-ssh.target
# and kill all processes besides init

> > This adds a couple of additional branches for the common (no extensions
> > header) case.

The additional branch in ipv6_gro_receive would be negligible or even
non-existent for a branch predictor in the common case
(non-encapsulated packets).
I could wrap it with a likely macro if you wish.
Inside ipv6_gro_complete a couple of branches are saved for the common
case as demonstrated below.

original code ipv6_gro_complete (ipv6_exthdrs_len is inlined):

// if (skb->encapsulation)

ffffffff81c4962b: f6 87 81 00 00 00 20 testb $0x20,0x81(%rdi)
ffffffff81c49632: 74 2a je ffffffff81c4965e <ipv6_gro_complete+0x3e>

...

// nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);

ffffffff81c4969c: eb 1b jmp ffffffff81c496b9 <ipv6_gro_complete+0x99> <-- jump to beginning of for loop
ffffffff81c4968e: b8 28 00 00 00 mov $0x28,%eax
ffffffff81c49693: 31 f6 xor %esi,%esi
ffffffff81c49695: 48 c7 c7 c0 28 aa 82 mov $0xffffffff82aa28c0,%rdi
ffffffff81c4969c: eb 1b jmp ffffffff81c496b9 <ipv6_gro_complete+0x99>
ffffffff81c4969e: f6 41 18 01 testb $0x1,0x18(%rcx)
ffffffff81c496a2: 74 34 je ffffffff81c496d8 <ipv6_gro_complete+0xb8> <--- 3rd conditional check: !((*opps)->flags & INET6_PROTO_GSO_EXTHDR)
ffffffff81c496a4: 48 98 cltq
ffffffff81c496a6: 48 01 c2 add %rax,%rdx
ffffffff81c496a9: 0f b6 42 01 movzbl 0x1(%rdx),%eax
ffffffff81c496ad: 0f b6 0a movzbl (%rdx),%ecx
ffffffff81c496b0: 8d 04 c5 08 00 00 00 lea 0x8(,%rax,8),%eax
ffffffff81c496b7: 01 c6 add %eax,%esi
ffffffff81c496b9: 85 c9 test %ecx,%ecx <--- for loop starts here
ffffffff81c496bb: 74 e7 je ffffffff81c496a4 <ipv6_gro_complete+0x84> <--- 1st conditional check: proto != NEXTHDR_HOP
ffffffff81c496bd: 48 8b 0c cf mov (%rdi,%rcx,8),%rcx
ffffffff81c496c1: 48 85 c9 test %rcx,%rcx
ffffffff81c496c4: 75 d8 jne ffffffff81c4969e <ipv6_gro_complete+0x7e> <--- 2nd conditional check: unlikely(!(*opps))

... (indirect call ops->callbacks.gro_complete)

ipv6_exthdrs_len contains a loop which has 3 conditional checks.
For the common (no extensions header) case, in the new code, *all 3
branches are completely avoided*

patched code ipv6_gro_complete:

// if (skb->encapsulation)
ffffffff81befe58: f6 83 81 00 00 00 20 testb $0x20,0x81(%rbx)
ffffffff81befe5f: 74 78 je ffffffff81befed9 <ipv6_gro_complete+0xb9>

...

// else
ffffffff81befed9: 0f b6 43 50 movzbl 0x50(%rbx),%eax
ffffffff81befedd: 0f b7 73 4c movzwl 0x4c(%rbx),%esi
ffffffff81befee1: 48 8b 0c c5 c0 3f a9 mov -0x7d56c040(,%rax,8),%rcx

... (indirect call ops->callbacks.gro_complete)

Thus, the patch is beneficial for both the common case and the ext hdr
case. I would appreciate a second consideration :)

> > while patch 1/2 could be useful, patch 2/2 overall looks not worthy to
> > me.
> >
> > I suggest to re-post for inclusion only patch 1, unless others have
> > strong different opinions.
> >
>
> +2
>
> I have the same feeling/opinion.
>
> > Cheers,
> >
> > Paolo
> >

2023-04-03 11:45:53

by Paolo Abeni

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] gro: optimise redundant parsing of packets

On Wed, 2023-03-22 at 20:33 +0100, Richard Gobert wrote:
> > On Wed, Mar 22, 2023 at 2:59 AM Paolo Abeni <[email protected]>
> > wrote:
> > >
> > > On Mon, 2023-03-20 at 18:00 +0100, Richard Gobert wrote:
> > > > Currently the IPv6 extension headers are parsed twice: first in
> > > > ipv6_gro_receive, and then again in ipv6_gro_complete.
> > > >
> > > > By using the new ->transport_proto field, and also storing the
> > > > size of the
> > > > network header, we can avoid parsing extension headers a second
> > > > time in
> > > > ipv6_gro_complete (which saves multiple memory dereferences and
> > > > conditional
> > > > checks inside ipv6_exthdrs_len for a varying amount of
> > > > extension headers in
> > > > IPv6 packets).
> > > >
> > > > The implementation had to handle both inner and outer layers in
> > > > case of
> > > > encapsulation (as they can't use the same field). I've applied
> > > > a similar
> > > > optimisation to Ethernet.
> > > >
> > > > Performance tests for TCP stream over IPv6 with a varying
> > > > amount of
> > > > extension headers demonstrate throughput improvement of ~0.7%.
> > >
> > > I'm surprised that the improvement is measurable: for large
> > > aggregate
> > > packets a single ipv6_exthdrs_len() call is avoided out of tens
> > > calls
> > > for the individual pkts. Additionally such figure is comparable
> > > to
> > > noise level in my tests.
>
> It's not simple but I made an effort to make a quiet environment.
> Correct configuration allows for this kind of measurements to be made
> as the test is CPU bound and noise is a variance that can be reduced
> with
> enough samples.
>
> Environment example: (100Gbit NIC (mlx5), physical machine, i9 12th
> gen)
>
>     # power-management and hyperthreading disabled in BIOS
>     # sysctl preallocate net mem
>     echo 0 > /sys/devices/system/cpu/cpufreq/boost # disable
> turboboost
>     ethtool -A enp1s0f0np0 rx off tx off autoneg off # no PAUSE
> frames
>
>     # Single core performance
>     for x in /sys/devices/system/cpu/cpu[1-9]*/online; do echo 0
> >"$x"; done
>
>     ./network-testing-master/bin/netfilter_unload_modules.sh
> 2>/dev/null # unload netfilter
>     tuned-adm profile latency-performance
>     cpupower frequency-set -f 2200MHz # Set core to specific
> frequency
>     systemctl isolate rescue-ssh.target
>     # and kill all processes besides init
>
> > > This adds a couple of additional branches for the common (no
> > > extensions
> > > header) case.
>
> The additional branch in ipv6_gro_receive would be negligible or even
> non-existent for a branch predictor in the common case
> (non-encapsulated packets).
> I could wrap it with a likely macro if you wish.
> Inside ipv6_gro_complete a couple of branches are saved for the
> common
> case as demonstrated below.
>
> original code ipv6_gro_complete (ipv6_exthdrs_len is inlined):
>
>     // if (skb->encapsulation)
>
>     ffffffff81c4962b: f6 87 81 00 00 00 20 testb
> $0x20,0x81(%rdi)
>     ffffffff81c49632: 74 2a je
> ffffffff81c4965e <ipv6_gro_complete+0x3e>
>
>     ...
>
>     // nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
>
>     ffffffff81c4969c: eb 1b jmp
> ffffffff81c496b9 <ipv6_gro_complete+0x99> <-- jump to beginning of
> for loop
>     ffffffff81c4968e: b8 28 00 00 00 mov $0x28,%eax
>     ffffffff81c49693: 31 f6 xor %esi,%esi
>     ffffffff81c49695: 48 c7 c7 c0 28 aa 82 mov
> $0xffffffff82aa28c0,%rdi
>     ffffffff81c4969c: eb 1b jmp
> ffffffff81c496b9 <ipv6_gro_complete+0x99>
>     ffffffff81c4969e: f6 41 18 01 testb
> $0x1,0x18(%rcx)
>     ffffffff81c496a2: 74 34 je
> ffffffff81c496d8 <ipv6_gro_complete+0xb8> <--- 3rd conditional
> check: !((*opps)->flags & INET6_PROTO_GSO_EXTHDR)
>     ffffffff81c496a4: 48 98 cltq
>     ffffffff81c496a6: 48 01 c2 add %rax,%rdx
>     ffffffff81c496a9: 0f b6 42 01 movzbl 0x1(%rdx),%eax
>     ffffffff81c496ad: 0f b6 0a movzbl (%rdx),%ecx
>     ffffffff81c496b0: 8d 04 c5 08 00 00 00 lea
> 0x8(,%rax,8),%eax
>     ffffffff81c496b7: 01 c6 add %eax,%esi
>     ffffffff81c496b9: 85 c9 test %ecx,%ecx
> <--- for loop starts here
>     ffffffff81c496bb: 74 e7 je
> ffffffff81c496a4 <ipv6_gro_complete+0x84> <--- 1st conditional
> check: proto != NEXTHDR_HOP
>     ffffffff81c496bd: 48 8b 0c cf mov
> (%rdi,%rcx,8),%rcx
>     ffffffff81c496c1: 48 85 c9 test %rcx,%rcx
>     ffffffff81c496c4: 75 d8 jne
> ffffffff81c4969e <ipv6_gro_complete+0x7e> <--- 2nd conditional
> check: unlikely(!(*opps))
>     
>     ... (indirect call ops->callbacks.gro_complete)
>
> ipv6_exthdrs_len contains a loop which has 3 conditional checks.
> For the common (no extensions header) case, in the new code, *all 3
> branches are completely avoided*
>
> patched code ipv6_gro_complete:
>
>     // if (skb->encapsulation)
>     ffffffff81befe58: f6 83 81 00 00 00 20 testb
> $0x20,0x81(%rbx)
>     ffffffff81befe5f: 74 78 je
> ffffffff81befed9 <ipv6_gro_complete+0xb9>
>     
>     ...
>     
>     // else
>     ffffffff81befed9: 0f b6 43 50 movzbl
> 0x50(%rbx),%eax
>     ffffffff81befedd: 0f b7 73 4c movzwl
> 0x4c(%rbx),%esi
>     ffffffff81befee1: 48 8b 0c c5 c0 3f a9 mov -
> 0x7d56c040(,%rax,8),%rcx
>     
>     ... (indirect call ops->callbacks.gro_complete)
>
> Thus, the patch is beneficial for both the common case and the ext
> hdr
> case. I would appreciate a second consideration :)

A problem with the above analysis is that it does not take in
consideration the places where the new branch are added:
eth_gro_receive() and ipv6_gro_receive().

Note that such functions are called for each packet on the wire:
multiple times for each aggregate packets.

The above is likely not measurable in terms on pps delta, but the added
CPU cycles spent for the common case are definitely there. In my
opinion that outlast the benefit for the extensions header case.

Cheers,

Paolo

p.s. please refrain from off-list ping. That is ignored by most and
considered rude by some.

2023-04-20 17:25:21

by Richard Gobert

[permalink] [raw]
Subject: Re: [PATCH v4 2/2] gro: optimise redundant parsing of packets

> On Wed, 2023-03-22 at 20:33 +0100, Richard Gobert wrote:
> > > On Wed, Mar 22, 2023 at 2:59 AM Paolo Abeni <[email protected]>
> > > wrote:
> > > >
> > > > On Mon, 2023-03-20 at 18:00 +0100, Richard Gobert wrote:
> > > > > Currently the IPv6 extension headers are parsed twice: first in
> > > > > ipv6_gro_receive, and then again in ipv6_gro_complete.
> > > > >
> > > > > By using the new ->transport_proto field, and also storing the
> > > > > size of the
> > > > > network header, we can avoid parsing extension headers a second
> > > > > time in
> > > > > ipv6_gro_complete (which saves multiple memory dereferences and
> > > > > conditional
> > > > > checks inside ipv6_exthdrs_len for a varying amount of
> > > > > extension headers in
> > > > > IPv6 packets).
> > > > >
> > > > > The implementation had to handle both inner and outer layers in
> > > > > case of
> > > > > encapsulation (as they can't use the same field). I've applied
> > > > > a similar
> > > > > optimisation to Ethernet.
> > > > >
> > > > > Performance tests for TCP stream over IPv6 with a varying
> > > > > amount of
> > > > > extension headers demonstrate throughput improvement of ~0.7%.
> > > >
> > > > I'm surprised that the improvement is measurable: for large
> > > > aggregate
> > > > packets a single ipv6_exthdrs_len() call is avoided out of tens
> > > > calls
> > > > for the individual pkts. Additionally such figure is comparable
> > > > to
> > > > noise level in my tests.
> >
> > It's not simple but I made an effort to make a quiet environment.
> > Correct configuration allows for this kind of measurements to be made
> > as the test is CPU bound and noise is a variance that can be reduced
> > with
> > enough samples.
> >
> > Environment example: (100Gbit NIC (mlx5), physical machine, i9 12th
> > gen)
> >
> > # power-management and hyperthreading disabled in BIOS
> > # sysctl preallocate net mem
> > echo 0 > /sys/devices/system/cpu/cpufreq/boost # disable
> > turboboost
> > ethtool -A enp1s0f0np0 rx off tx off autoneg off # no PAUSE
> > frames
> >
> > # Single core performance
> > for x in /sys/devices/system/cpu/cpu[1-9]*/online; do echo 0
> > >"$x"; done
> >
> > ./network-testing-master/bin/netfilter_unload_modules.sh
> > 2>/dev/null # unload netfilter
> > tuned-adm profile latency-performance
> > cpupower frequency-set -f 2200MHz # Set core to specific
> > frequency
> > systemctl isolate rescue-ssh.target
> > # and kill all processes besides init
> >
> > > > This adds a couple of additional branches for the common (no
> > > > extensions
> > > > header) case.
> >
> > The additional branch in ipv6_gro_receive would be negligible or even
> > non-existent for a branch predictor in the common case
> > (non-encapsulated packets).
> > I could wrap it with a likely macro if you wish.
> > Inside ipv6_gro_complete a couple of branches are saved for the
> > common
> > case as demonstrated below.
> >
> > original code ipv6_gro_complete (ipv6_exthdrs_len is inlined):
> >
> > // if (skb->encapsulation)
> >
> > ffffffff81c4962b: f6 87 81 00 00 00 20 testb
> > $0x20,0x81(%rdi)
> > ffffffff81c49632: 74 2a je
> > ffffffff81c4965e <ipv6_gro_complete+0x3e>
> >
> > ...
> >
> > // nhoff += sizeof(*iph) + ipv6_exthdrs_len(iph, &ops);
> >
> > ffffffff81c4969c: eb 1b jmp
> > ffffffff81c496b9 <ipv6_gro_complete+0x99> <-- jump to beginning of
> > for loop
> > ffffffff81c4968e: b8 28 00 00 00 mov $0x28,%eax
> > ffffffff81c49693: 31 f6 xor %esi,%esi
> > ffffffff81c49695: 48 c7 c7 c0 28 aa 82 mov
> > $0xffffffff82aa28c0,%rdi
> > ffffffff81c4969c: eb 1b jmp
> > ffffffff81c496b9 <ipv6_gro_complete+0x99>
> > ffffffff81c4969e: f6 41 18 01 testb
> > $0x1,0x18(%rcx)
> > ffffffff81c496a2: 74 34 je
> > ffffffff81c496d8 <ipv6_gro_complete+0xb8> <--- 3rd conditional
> > check: !((*opps)->flags & INET6_PROTO_GSO_EXTHDR)
> > ffffffff81c496a4: 48 98 cltq
> > ffffffff81c496a6: 48 01 c2 add %rax,%rdx
> > ffffffff81c496a9: 0f b6 42 01 movzbl 0x1(%rdx),%eax
> > ffffffff81c496ad: 0f b6 0a movzbl (%rdx),%ecx
> > ffffffff81c496b0: 8d 04 c5 08 00 00 00 lea
> > 0x8(,%rax,8),%eax
> > ffffffff81c496b7: 01 c6 add %eax,%esi
> > ffffffff81c496b9: 85 c9 test %ecx,%ecx
> > <--- for loop starts here
> > ffffffff81c496bb: 74 e7 je
> > ffffffff81c496a4 <ipv6_gro_complete+0x84> <--- 1st conditional
> > check: proto != NEXTHDR_HOP
> > ffffffff81c496bd: 48 8b 0c cf mov
> > (%rdi,%rcx,8),%rcx
> > ffffffff81c496c1: 48 85 c9 test %rcx,%rcx
> > ffffffff81c496c4: 75 d8 jne
> > ffffffff81c4969e <ipv6_gro_complete+0x7e> <--- 2nd conditional
> > check: unlikely(!(*opps))
> >
> > ... (indirect call ops->callbacks.gro_complete)
> >
> > ipv6_exthdrs_len contains a loop which has 3 conditional checks.
> > For the common (no extensions header) case, in the new code, *all 3
> > branches are completely avoided*
> >
> > patched code ipv6_gro_complete:
> >
> > // if (skb->encapsulation)
> > ffffffff81befe58: f6 83 81 00 00 00 20 testb
> > $0x20,0x81(%rbx)
> > ffffffff81befe5f: 74 78 je
> > ffffffff81befed9 <ipv6_gro_complete+0xb9>
> >
> > ...
> >
> > // else
> > ffffffff81befed9: 0f b6 43 50 movzbl
> > 0x50(%rbx),%eax
> > ffffffff81befedd: 0f b7 73 4c movzwl
> > 0x4c(%rbx),%esi
> > ffffffff81befee1: 48 8b 0c c5 c0 3f a9 mov -
> > 0x7d56c040(,%rax,8),%rcx
> >
> > ... (indirect call ops->callbacks.gro_complete)
> >
> > Thus, the patch is beneficial for both the common case and the ext
> > hdr
> > case. I would appreciate a second consideration :)
>
> A problem with the above analysis is that it does not take in
> consideration the places where the new branch are added:
> eth_gro_receive() and ipv6_gro_receive().
>
> Note that such functions are called for each packet on the wire:
> multiple times for each aggregate packets.
>
> The above is likely not measurable in terms on pps delta, but the added
> CPU cycles spent for the common case are definitely there. In my
> opinion that outlast the benefit for the extensions header case.
>
> Cheers,
>
> Paolo
>
> p.s. please refrain from off-list ping. That is ignored by most and
> considered rude by some.

Thanks,
I will re-post the first patch as a new one.
As for the second patch, I get your point, you are correct. I didn't
pay enough attention to the accumulated overhead during the receive phase, as it
wasn't showing up in my measurements. I'll look further into it, and check if I
can come up with a better solution.

Sorry for the off-list ping, is it ok to send a ping via the mailing list?