LinuxLists.cc - Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT

2024-05-10 16:22:52

Subject: Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

On 2024-05-10 18:21:24 [+0200], To Jesper Dangaard Brouer wrote:
> The XDP redirect process is two staged:
…
On 2024-05-07 15:27:44 [+0200], Jesper Dangaard Brouer wrote:
>
> I need/want to echo Toke's request to benchmark these changes.

I have:
boxA: ixgbe
boxB: i40e

Both are bigger NUMA boxes. I have to patch ixgbe to ignore the 64CPU
limit and I boot box with only 64CPUs. The IOMMU has been disabled on
both box as well as CPU mitigations. The link is 10G.

The base for testing I have is commit a17ef9e6c2c1c ("net_sched:
sch_sfq: annotate data-races around q->perturb_period") which I used to
rebase my series on top of.

pktgen_sample03_burst_single_flow.sh has been used to send packets and
"xdp-bench drop $nic -e" to receive them.

baseline
~~~~~~~~
boxB -> boxA | gov performance
-t2 (to pktgen)
| receive total 14,854,233 pkt/s 14,854,233 drop/s 0 error/s

-t1 (to pktgen)
| receive total 10,642,895 pkt/s 10,642,895 drop/s 0 error/s

boxB -> boxA | gov powersave
-t2 (to pktgen)
receive total 10,196,085 pkt/s 10,196,085 drop/s 0 error/s
receive total 10,187,254 pkt/s 10,187,254 drop/s 0 error/s
receive total 10,553,298 pkt/s 10,553,298 drop/s 0 error/s

-t1
receive total 10,427,732 pkt/s 10,427,732 drop/s 0 error/s

======
boxA -> boxB (-t1) gov performance
performace:
receive total 13,171,962 pkt/s 13,171,962 drop/s 0 error/s
receive total 13,368,344 pkt/s 13,368,344 drop/s 0 error/s

powersave:
receive total 13,343,136 pkt/s 13,343,136 drop/s 0 error/s
receive total 13,220,326 pkt/s 13,220,326 drop/s 0 error/s

(I the CPU governor had no impact, just noise)

The series applied (with updated 14/15)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
boxB -> boxA | gov performance
-t2:
receive total 14,880,199 pkt/s 14,880,199 drop/s 0 error/s

-t1:
receive total 10,769,082 pkt/s 10,769,082 drop/s 0 error/s

boxB -> boxA | gov powersave
-t2:
receive total 11,163,323 pkt/s 11,163,323 drop/s 0 error/s

-t1:
receive total 10,756,515 pkt/s 10,756,515 drop/s 0 error/s

boxA -> boxB | gov perfomance

receive total 13,395,919 pkt/s 13,395,919 drop/s 0 error/s

boxA -> boxB | gov perfomance
receive total 13,290,527 pkt/s 13,290,527 drop/s 0 error/s

Based on my numbers, there is just noise. BoxA hit the CPU limit during
receive while lowering the CPU-freq. BoxB seems to be unaffected by
lowing CPU frequency during receive.

I can't comment on anything >10G due to HW limits.

Sebastian

2024-05-14 05:07:42

by Jesper Dangaard Brouer

[permalink] [raw]

Subject: Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

On 10/05/2024 18.22, Sebastian Andrzej Siewior wrote:
> On 2024-05-10 18:21:24 [+0200], To Jesper Dangaard Brouer wrote:
>> The XDP redirect process is two staged:
> …
> On 2024-05-07 15:27:44 [+0200], Jesper Dangaard Brouer wrote:
>>
>> I need/want to echo Toke's request to benchmark these changes.
>
> I have:
> boxA: ixgbe
> boxB: i40e
>
> Both are bigger NUMA boxes. I have to patch ixgbe to ignore the 64CPU
> limit and I boot box with only 64CPUs. The IOMMU has been disabled on
> both box as well as CPU mitigations. The link is 10G.
>
> The base for testing I have is commit a17ef9e6c2c1c ("net_sched:
> sch_sfq: annotate data-races around q->perturb_period") which I used to
> rebase my series on top of.
>
> pktgen_sample03_burst_single_flow.sh has been used to send packets and
> "xdp-bench drop $nic -e" to receive them.
>

Sorry, but a XDP_DROP test will not activate the code you are modifying.
Thus, this test is invalid and doesn't tell us anything about your code
changes.

The code is modifying the XDP_REDIRECT handling system. Thus, the
benchmark test needs to activate this code.

> baseline
> ~~~~~~~~
> boxB -> boxA | gov performance
> -t2 (to pktgen)
> | receive total 14,854,233 pkt/s 14,854,233 drop/s 0 error/s
>
> -t1 (to pktgen)
> | receive total 10,642,895 pkt/s 10,642,895 drop/s 0 error/s
>
>
> boxB -> boxA | gov powersave
> -t2 (to pktgen)
> receive total 10,196,085 pkt/s 10,196,085 drop/s 0 error/s
> receive total 10,187,254 pkt/s 10,187,254 drop/s 0 error/s
> receive total 10,553,298 pkt/s 10,553,298 drop/s 0 error/s
>
> -t1
> receive total 10,427,732 pkt/s 10,427,732 drop/s 0 error/s
>
> ======
> boxA -> boxB (-t1) gov performance
> performace:
> receive total 13,171,962 pkt/s 13,171,962 drop/s 0 error/s
> receive total 13,368,344 pkt/s 13,368,344 drop/s 0 error/s
>
> powersave:
> receive total 13,343,136 pkt/s 13,343,136 drop/s 0 error/s
> receive total 13,220,326 pkt/s 13,220,326 drop/s 0 error/s
>
> (I the CPU governor had no impact, just noise)
>
> The series applied (with updated 14/15)
> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> boxB -> boxA | gov performance
> -t2:
> receive total 14,880,199 pkt/s 14,880,199 drop/s 0 error/s
>
> -t1:
> receive total 10,769,082 pkt/s 10,769,082 drop/s 0 error/s
>
> boxB -> boxA | gov powersave
> -t2:
> receive total 11,163,323 pkt/s 11,163,323 drop/s 0 error/s
>
> -t1:
> receive total 10,756,515 pkt/s 10,756,515 drop/s 0 error/s
>
> boxA -> boxB | gov perfomance
>
> receive total 13,395,919 pkt/s 13,395,919 drop/s 0 error/s
>
> boxA -> boxB | gov perfomance
> receive total 13,290,527 pkt/s 13,290,527 drop/s 0 error/s
>
>
> Based on my numbers, there is just noise. BoxA hit the CPU limit during
> receive while lowering the CPU-freq. BoxB seems to be unaffected by
> lowing CPU frequency during receive.
>
> I can't comment on anything >10G due to HW limits.
>
> Sebastian

2024-05-14 05:52:08

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

On 2024-05-14 07:07:21 [+0200], Jesper Dangaard Brouer wrote:
> > pktgen_sample03_burst_single_flow.sh has been used to send packets and
> > "xdp-bench drop $nic -e" to receive them.
> >
>
> Sorry, but a XDP_DROP test will not activate the code you are modifying.
> Thus, this test is invalid and doesn't tell us anything about your code
> changes.
>
> The code is modifying the XDP_REDIRECT handling system. Thus, the
> benchmark test needs to activate this code.

This was a misunderstanding on my side then. What do you suggest
instead? Same setup but "redirect" on the same interface instead of
"drop"?

Sebastian

2024-05-14 12:20:26

by Jesper Dangaard Brouer

[permalink] [raw]

Subject: Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

On 14/05/2024 07.43, Sebastian Andrzej Siewior wrote:
> On 2024-05-14 07:07:21 [+0200], Jesper Dangaard Brouer wrote:
>>> pktgen_sample03_burst_single_flow.sh has been used to send packets and
>>> "xdp-bench drop $nic -e" to receive them.
>>>
>>
>> Sorry, but a XDP_DROP test will not activate the code you are modifying.
>> Thus, this test is invalid and doesn't tell us anything about your code
>> changes.
>>
>> The code is modifying the XDP_REDIRECT handling system. Thus, the
>> benchmark test needs to activate this code.
>
> This was a misunderstanding on my side then. What do you suggest
> instead? Same setup but "redirect" on the same interface instead of
> "drop"?
>

Redirect is more flexible, but redirect back-out same interface is one
option, but I've often seen this will give issues, because it will
overload the traffic generator (without people realizing this) leading
to false-results. Thus, verify packet generator is sending faster than
results you are collecting. (I use this tool[2] on generator machine, in
another terminal, to see of something funky is happening with ethtool
stats).

To workaround this issue, I've previously redirected to device 'lo'
localhost, which is obviously invalid so packet gets dropped, but I can
see that when we converted from kernel samples/bpf/ to this tool, this
trick/hack is no longer supported.

The xdp-bench[1] tool also provide a number of redirect sub-commands.
E.g. redirect / redirect-cpu / redirect-map / redirect-multi.
Given you also modify CPU-map code, I would say we also need a
'redirect-cpu' test case.

Trick for CPU-map to do early drop on remote CPU:

# ./xdp-bench redirect-cpu --cpu 3 --remote-action drop ixgbe1

I recommend using Ctrl+\ while running to show more info like CPUs being
used and what kthread consumes. To catch issues e.g. if you are CPU
redirecting to same CPU as RX happen to run on.

--Jesper

[1] https://github.com/xdp-project/xdp-tools/tree/master/xdp-bench
[2]
https://github.com/netoptimizer/network-testing/blob/master/bin/ethtool_stats.pl

2024-05-17 16:16:13

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

On 2024-05-14 14:20:03 [+0200], Jesper Dangaard Brouer wrote:
> Trick for CPU-map to do early drop on remote CPU:
>
> # ./xdp-bench redirect-cpu --cpu 3 --remote-action drop ixgbe1
>
> I recommend using Ctrl+\ while running to show more info like CPUs being
> used and what kthread consumes. To catch issues e.g. if you are CPU
> redirecting to same CPU as RX happen to run on.

Okay. So I reworked the last two patches make the struct part of
task_struct and then did as you suggested:

Unpatched:
|Sending:
|Show adapter(s) (eno2np1) statistics (ONLY that changed!)
|Ethtool(eno2np1 ) stat: 952102520 ( 952,102,520) <= port.tx_bytes /sec
|Ethtool(eno2np1 ) stat: 14876602 ( 14,876,602) <= port.tx_size_64 /sec
|Ethtool(eno2np1 ) stat: 14876602 ( 14,876,602) <= port.tx_unicast /sec
|Ethtool(eno2np1 ) stat: 446045897 ( 446,045,897) <= tx-0.bytes /sec
|Ethtool(eno2np1 ) stat: 7434098 ( 7,434,098) <= tx-0.packets /sec
|Ethtool(eno2np1 ) stat: 446556042 ( 446,556,042) <= tx-1.bytes /sec
|Ethtool(eno2np1 ) stat: 7442601 ( 7,442,601) <= tx-1.packets /sec
|Ethtool(eno2np1 ) stat: 892592523 ( 892,592,523) <= tx_bytes /sec
|Ethtool(eno2np1 ) stat: 14876542 ( 14,876,542) <= tx_packets /sec
|Ethtool(eno2np1 ) stat: 2 ( 2) <= tx_restart /sec
|Ethtool(eno2np1 ) stat: 2 ( 2) <= tx_stopped /sec
|Ethtool(eno2np1 ) stat: 14876622 ( 14,876,622) <= tx_unicast /sec
|
|Receive:
|eth1->? 8,732,508 rx/s 0 err,drop/s
| receive total 8,732,508 pkt/s 0 drop/s 0 error/s
| cpu:10 8,732,508 pkt/s 0 drop/s 0 error/s
| enqueue to cpu 3 8,732,510 pkt/s 0 drop/s 7.00 bulk-avg
| cpu:10->3 8,732,510 pkt/s 0 drop/s 7.00 bulk-avg
| kthread total 8,732,506 pkt/s 0 drop/s 205,650 sched
| cpu:3 8,732,506 pkt/s 0 drop/s 205,650 sched
| xdp_stats 0 pass/s 8,732,506 drop/s 0 redir/s
| cpu:3 0 pass/s 8,732,506 drop/s 0 redir/s
| redirect_err 0 error/s
| xdp_exception 0 hit/s

I verified that the "drop only" case hits 14M packets/s while this
redirect part reports 8M packets/s.

Patched:
|Sending:
|Show adapter(s) (eno2np1) statistics (ONLY that changed!)
|Ethtool(eno2np1 ) stat: 952635404 ( 952,635,404) <= port.tx_bytes /sec
|Ethtool(eno2np1 ) stat: 14884934 ( 14,884,934) <= port.tx_size_64 /sec
|Ethtool(eno2np1 ) stat: 14884928 ( 14,884,928) <= port.tx_unicast /sec
|Ethtool(eno2np1 ) stat: 446496117 ( 446,496,117) <= tx-0.bytes /sec
|Ethtool(eno2np1 ) stat: 7441602 ( 7,441,602) <= tx-0.packets /sec
|Ethtool(eno2np1 ) stat: 446603461 ( 446,603,461) <= tx-1.bytes /sec
|Ethtool(eno2np1 ) stat: 7443391 ( 7,443,391) <= tx-1.packets /sec
|Ethtool(eno2np1 ) stat: 893086506 ( 893,086,506) <= tx_bytes /sec
|Ethtool(eno2np1 ) stat: 14884775 ( 14,884,775) <= tx_packets /sec
|Ethtool(eno2np1 ) stat: 14 ( 14) <= tx_restart /sec
|Ethtool(eno2np1 ) stat: 14 ( 14) <= tx_stopped /sec
|Ethtool(eno2np1 ) stat: 14884937 ( 14,884,937) <= tx_unicast /sec
|
|Receive:
|eth1->? 8,735,198 rx/s 0 err,drop/s
| receive total 8,735,198 pkt/s 0 drop/s 0 error/s
| cpu:6 8,735,198 pkt/s 0 drop/s 0 error/s
| enqueue to cpu 3 8,735,193 pkt/s 0 drop/s 7.00 bulk-avg
| cpu:6->3 8,735,193 pkt/s 0 drop/s 7.00 bulk-avg
| kthread total 8,735,191 pkt/s 0 drop/s 208,054 sched
| cpu:3 8,735,191 pkt/s 0 drop/s 208,054 sched
| xdp_stats 0 pass/s 8,735,191 drop/s 0 redir/s
| cpu:3 0 pass/s 8,735,191 drop/s 0 redir/s
| redirect_err 0 error/s
| xdp_exception 0 hit/s

This looks to be in the same range/ noise level. top wise I have
ksoftirqd at 100% and cpumap/./map at ~60% so I hit CPU speed limit on a
10G link. perf top shows
| 18.37% bpf_prog_4f0ffbb35139c187_cpumap_l4_hash [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
| 13.15% [kernel] [k] cpu_map_kthread_run
| 12.96% [kernel] [k] ixgbe_poll
| 6.78% [kernel] [k] page_frag_free
| 5.62% [kernel] [k] xdp_do_redirect

for the top 5. Is this something that looks reasonable?

Sebastian

2024-05-22 07:10:09

by Jesper Dangaard Brouer

[permalink] [raw]

Subject: Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

On 17/05/2024 18.15, Sebastian Andrzej Siewior wrote:
> On 2024-05-14 14:20:03 [+0200], Jesper Dangaard Brouer wrote:
>> Trick for CPU-map to do early drop on remote CPU:
>>
>> # ./xdp-bench redirect-cpu --cpu 3 --remote-action drop ixgbe1
>>
>> I recommend using Ctrl+\ while running to show more info like CPUs being
>> used and what kthread consumes. To catch issues e.g. if you are CPU
>> redirecting to same CPU as RX happen to run on.
>
> Okay. So I reworked the last two patches make the struct part of
> task_struct and then did as you suggested:
>
> Unpatched:
> |Sending:
> |Show adapter(s) (eno2np1) statistics (ONLY that changed!)
> |Ethtool(eno2np1 ) stat: 952102520 ( 952,102,520) <= port.tx_bytes /sec
> |Ethtool(eno2np1 ) stat: 14876602 ( 14,876,602) <= port.tx_size_64 /sec
> |Ethtool(eno2np1 ) stat: 14876602 ( 14,876,602) <= port.tx_unicast /sec
> |Ethtool(eno2np1 ) stat: 446045897 ( 446,045,897) <= tx-0.bytes /sec
> |Ethtool(eno2np1 ) stat: 7434098 ( 7,434,098) <= tx-0.packets /sec
> |Ethtool(eno2np1 ) stat: 446556042 ( 446,556,042) <= tx-1.bytes /sec
> |Ethtool(eno2np1 ) stat: 7442601 ( 7,442,601) <= tx-1.packets /sec
> |Ethtool(eno2np1 ) stat: 892592523 ( 892,592,523) <= tx_bytes /sec
> |Ethtool(eno2np1 ) stat: 14876542 ( 14,876,542) <= tx_packets /sec
> |Ethtool(eno2np1 ) stat: 2 ( 2) <= tx_restart /sec
> |Ethtool(eno2np1 ) stat: 2 ( 2) <= tx_stopped /sec
> |Ethtool(eno2np1 ) stat: 14876622 ( 14,876,622) <= tx_unicast /sec
> |
> |Receive:
> |eth1->? 8,732,508 rx/s 0 err,drop/s
> | receive total 8,732,508 pkt/s 0 drop/s 0 error/s
> | cpu:10 8,732,508 pkt/s 0 drop/s 0 error/s
> | enqueue to cpu 3 8,732,510 pkt/s 0 drop/s 7.00 bulk-avg
> | cpu:10->3 8,732,510 pkt/s 0 drop/s 7.00 bulk-avg
> | kthread total 8,732,506 pkt/s 0 drop/s 205,650 sched
> | cpu:3 8,732,506 pkt/s 0 drop/s 205,650 sched
> | xdp_stats 0 pass/s 8,732,506 drop/s 0 redir/s
> | cpu:3 0 pass/s 8,732,506 drop/s 0 redir/s
> | redirect_err 0 error/s
> | xdp_exception 0 hit/s
>
> I verified that the "drop only" case hits 14M packets/s while this
> redirect part reports 8M packets/s.
>

Great, this is a good test.

The transmit speed 14.88 Mpps is 10G wirespeed at smallest Ethernet
packet size (84 bytes with overhead + intergap, 10*10^9/(84*8) = 14880952).

> Patched:
> |Sending:
> |Show adapter(s) (eno2np1) statistics (ONLY that changed!)
> |Ethtool(eno2np1 ) stat: 952635404 ( 952,635,404) <= port.tx_bytes /sec
> |Ethtool(eno2np1 ) stat: 14884934 ( 14,884,934) <= port.tx_size_64 /sec
> |Ethtool(eno2np1 ) stat: 14884928 ( 14,884,928) <= port.tx_unicast /sec
> |Ethtool(eno2np1 ) stat: 446496117 ( 446,496,117) <= tx-0.bytes /sec
> |Ethtool(eno2np1 ) stat: 7441602 ( 7,441,602) <= tx-0.packets /sec
> |Ethtool(eno2np1 ) stat: 446603461 ( 446,603,461) <= tx-1.bytes /sec
> |Ethtool(eno2np1 ) stat: 7443391 ( 7,443,391) <= tx-1.packets /sec
> |Ethtool(eno2np1 ) stat: 893086506 ( 893,086,506) <= tx_bytes /sec
> |Ethtool(eno2np1 ) stat: 14884775 ( 14,884,775) <= tx_packets /sec
> |Ethtool(eno2np1 ) stat: 14 ( 14) <= tx_restart /sec
> |Ethtool(eno2np1 ) stat: 14 ( 14) <= tx_stopped /sec
> |Ethtool(eno2np1 ) stat: 14884937 ( 14,884,937) <= tx_unicast /sec
> |
> |Receive:
> |eth1->? 8,735,198 rx/s 0 err,drop/s
> | receive total 8,735,198 pkt/s 0 drop/s 0 error/s
> | cpu:6 8,735,198 pkt/s 0 drop/s 0 error/s
> | enqueue to cpu 3 8,735,193 pkt/s 0 drop/s 7.00 bulk-avg
> | cpu:6->3 8,735,193 pkt/s 0 drop/s 7.00 bulk-avg
> | kthread total 8,735,191 pkt/s 0 drop/s 208,054 sched
> | cpu:3 8,735,191 pkt/s 0 drop/s 208,054 sched
> | xdp_stats 0 pass/s 8,735,191 drop/s 0 redir/s
> | cpu:3 0 pass/s 8,735,191 drop/s 0 redir/s
> | redirect_err 0 error/s
> | xdp_exception 0 hit/s
>

Great basically zero overhead. Awesome you verified this!

> This looks to be in the same range/ noise level. top wise I have
> ksoftirqd at 100% and cpumap/./map at ~60% so I hit CPU speed limit on a
> 10G link.

For our purpose of testing XDP_REDIRECT code, that you are modifying,
this is what we want. Where RX CPU/NAPI is the bottleneck, given remote
cpumap CPU have idle cycles (also indicated by the 208,054 sched stats).

> perf top shows

I appreciate getting this perf data.

As we are explicitly dealing with splitting workload across CPUs, it
worth mentioning that perf support displaying and filtering on CPUs.

This perf commands include the CPU number (zero indexed):
# perf report --sort cpu,comm,dso,symbol --no-children

For this benchmark, to focus, I would reduce this to:
# perf report --sort cpu,symbol --no-children

The perf tool can also use -C to filter on some CPUs like:

# perf report --sort cpu,symbol --no-children -C 3,6

> | 18.37% bpf_prog_4f0ffbb35139c187_cpumap_l4_hash [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash

This bpf_prog_4f0ffbb35139c187_cpumap_l4_hash is running on RX CPU doing
the load-balancing.

> | 13.15% [kernel] [k] cpu_map_kthread_run

This runs on remote cpumap CPU (in this case CPU 3).

> | 12.96% [kernel] [k] ixgbe_poll
> | 6.78% [kernel] [k] page_frag_free

The page_frag_free call might run on remote cpumap CPU.

> | 5.62% [kernel] [k] xdp_do_redirect
>
> for the top 5. Is this something that looks reasonable?

Yes, except I had to guess how the workload was split between CPUs ;-)

Thanks for doing these benchmarks! :-)
--Jesper

2024-05-24 07:54:53

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

On 2024-05-22 09:09:45 [+0200], Jesper Dangaard Brouer wrote:
>
> Yes, except I had to guess how the workload was split between CPUs ;-)
>
> Thanks for doing these benchmarks! :-)

Thank you for all the explanations and walking me through it.
I'm going to update the patches as discussed and redo the numbers as
suggested here.
Thanks.

> --Jesper

Sebastian

2024-05-24 14:00:16

by Sebastian Andrzej Siewior

[permalink] [raw]

Subject: Re: [PATCH net-next 14/15 v2] net: Reference bpf_redirect_info via task_struct on PREEMPT_RT.

On 2024-05-22 09:09:45 [+0200], Jesper Dangaard Brouer wrote:
> For this benchmark, to focus, I would reduce this to:
> # perf report --sort cpu,symbol --no-children

Keeping the bpf_net_ctx_set()/clear, removing the NULL checks (to align
with Alexei in his last email).
Perf numbers wise, I'm using
xdp-bench redirect-cpu --cpu 3 --remote-action drop eth1 -e

unpached:

| eth1->? 9,427,705 rx/s 0 err,drop/s
| receive total 9,427,705 pkt/s 0 drop/s 0 error/s
| cpu:17 9,427,705 pkt/s 0 drop/s 0 error/s
| enqueue to cpu 3 9,427,708 pkt/s 0 drop/s 8.00 bulk-avg
| cpu:17->3 9,427,708 pkt/s 0 drop/s 8.00 bulk-avg
| kthread total 9,427,710 pkt/s 0 drop/s 147,276 sched
| cpu:3 9,427,710 pkt/s 0 drop/s 147,276 sched
| xdp_stats 0 pass/s 9,427,710 drop/s 0 redir/s
| cpu:3 0 pass/s 9,427,710 drop/s 0 redir/s
| redirect_err 0 error/s
| xdp_exception 0 hit/s

Patched:
| eth1->? 9,557,170 rx/s 0 err,drop/s
| receive total 9,557,170 pkt/s 0 drop/s 0 error/s
| cpu:9 9,557,170 pkt/s 0 drop/s 0 error/s
| enqueue to cpu 3 9,557,170 pkt/s 0 drop/s 8.00 bulk-avg
| cpu:9->3 9,557,170 pkt/s 0 drop/s 8.00 bulk-avg
| kthread total 9,557,195 pkt/s 0 drop/s 126,164 sched
| cpu:3 9,557,195 pkt/s 0 drop/s 126,164 sched
| xdp_stats 0 pass/s 9,557,195 drop/s 0 redir/s
| cpu:3 0 pass/s 9,557,195 drop/s 0 redir/s
| redirect_err 0 error/s
| xdp_exception 0 hit/s

I think this is noise. perf output as suggested (perf report --sort
cpu,symbol --no-children).

unpatched:
| 19.05% 017 [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
| 11.40% 017 [k] ixgbe_poll
| 10.68% 003 [k] cpu_map_kthread_run
| 7.62% 003 [k] intel_idle
| 6.11% 017 [k] xdp_do_redirect
| 6.01% 003 [k] page_frag_free
| 4.72% 017 [k] bq_flush_to_queue
| 3.74% 017 [k] cpu_map_redirect
| 2.35% 003 [k] xdp_return_frame
| 1.55% 003 [k] bpf_prog_57cd311f2e27366b_cpumap_drop
| 1.49% 017 [k] dma_sync_single_for_device
| 1.41% 017 [k] ixgbe_alloc_rx_buffers
| 1.26% 017 [k] cpu_map_enqueue
| 1.24% 017 [k] dma_sync_single_for_cpu
| 1.12% 003 [k] __xdp_return
| 0.83% 017 [k] bpf_trace_run4
| 0.77% 003 [k] __switch_to

patched:
| 18.20% 009 [k] bpf_prog_4f0ffbb35139c187_cpumap_l4_hash
| 11.64% 009 [k] ixgbe_poll
| 7.74% 003 [k] page_frag_free
| 6.69% 003 [k] cpu_map_bpf_prog_run_xdp
| 6.02% 003 [k] intel_idle
| 5.96% 009 [k] xdp_do_redirect
| 4.45% 003 [k] cpu_map_kthread_run
| 3.71% 009 [k] cpu_map_redirect
| 3.23% 009 [k] bq_flush_to_queue
| 2.55% 003 [k] xdp_return_frame
| 1.67% 003 [k] bpf_prog_57cd311f2e27366b_cpumap_drop
| 1.60% 009 [k] _raw_spin_lock
| 1.57% 009 [k] bpf_prog_d7eca17ddc334d36_tp_xdp_cpumap_enqueue
| 1.48% 009 [k] dma_sync_single_for_device
| 1.47% 009 [k] ixgbe_alloc_rx_buffers
| 1.39% 009 [k] dma_sync_single_for_cpu
| 1.33% 009 [k] cpu_map_enqueue
| 1.19% 003 [k] __xdp_return
| 0.66% 003 [k] __switch_to

I'm going to repost the series once the merge window closes unless there
is something you want me to do.

> --Jesper

Sebastian