WORK IN PROGRESS:
* bpf program loading works!
* txq steering via bpf program return code works!
* bpf program unloading not working.
* bpf program attached query not working.
This patch set provides a bpf hookpoint with goals similar to, but a more
generic implementation than, TUNSETSTEERINGEBPF; userspace supplied tx queue
selection policy.
TUNSETSTEERINGEBPF is a useful bpf hookpoint, but has some drawbacks.
First, it only works on tun/tap devices.
Second, there is no way in the current TUNSETSTEERINGEBPF implementation
to bail out or load a noop bpf prog and fallback to the no prog tx queue
selection method.
Third, the TUNSETSTEERINGEBPF interface seems to require possession of existing
or creation of new queues/fds.
This most naturally fits in the "wire" implementation since possession of fds
is ensured. However, it also means the various "wire" implementations (e.g.
qemu) have to all be made aware of TUNSETSTEERINGEBPF and expose an interface
to load/unload a bpf prog (or provide a mechanism to pass an fd to another
program).
Alternatively, you can spin up an extra queue and immediately disable via
IFF_DETACH_QUEUE, but this seems unsafe; packets could be enqueued to this
extra file descriptor which is part of our bpf prog loader, not our "wire".
Placing this in the XPS code and leveraging iproute2 and rtnetlink to provide
our bpf prog loader in a similar manner to xdp gives us a nice way to separate
the tap "wire" and the loading of tx queue selection policy. It also lets us
use this hookpoint for any device traversing XPS.
This patch only introduces the new hookpoint to the XPS code and will not yet
be used by tun/tap devices using the intree tun.ko (which implements an
.ndo_select_queue and does not traverse the XPS code).
In a future patch set, we can optionally refactor tun.ko to traverse this call
to bpf_prog_run_clear_cb() and bpf prog storage. tun/tap devices could then
leverage iproute2 as a generic loader. The TUNSETSTEERINGEBPF interface could
at this point be optionally deprecated/removed.
Both patches in this set have been tested using a rebuilt tun.ko with no
.ndo_select_queue.
sed -i '/\.ndo_select_queue.*=/d' drivers/net/tun.c
The tap device was instantiated using tap_mq_pong.c, supporting scripts, and
wrapping service found here:
https://github.com/stackpath/rxtxcpu/tree/v1.2.6/helpers
The bpf prog source and test scripts can be found here:
https://github.com/werekraken/xps_ebpf
In nstxq, netsniff-ng using PACKET_FANOUT_QM is leveraged to check the
queue_mapping.
With no prog loaded, the tx queue selection is adhering our xps_cpus
configuration.
[vagrant@localhost ~]$ grep . /sys/class/net/tap0/queues/tx-*/xps_cpus; ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe;
/sys/class/net/tap0/queues/tx-0/xps_cpus:1
/sys/class/net/tap0/queues/tx-1/xps_cpus:2
cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.146 ms
cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.121 ms
cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
With a return 0 bpg prog, our tx queue is 0 (despite xps_cpus).
[vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello0.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.160 ms
cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.124 ms
cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
ping-4852 [000] .... 2691.633260: 0: xps (RET 0): Hello, World!
ping-4869 [001] .... 2695.753588: 0: xps (RET 0): Hello, World!
With a return 1 bpg prog, our tx queue is 1.
[vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.193 ms
cpu0: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.135 ms
cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
ping-4894 [000] .... 2710.652080: 0: xps (RET 1): Hello, World!
ping-4911 [001] .... 2714.774608: 0: xps (RET 1): Hello, World!
With a return 2 bpg prog, our tx queue is 0 (we only have 2 tx queues).
[vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello2.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=1.20 ms
cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.986 ms
cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
ping-4936 [000] .... 2729.442668: 0: xps (RET 2): Hello, World!
ping-4953 [001] .... 2733.614558: 0: xps (RET 2): Hello, World!
With a return -1 bpf prog, our tx queue selection is once again determined by
xps_cpus. Any negative return should work the same and provides a nice
mechanism to bail out or have a noop bpf prog at this hookpoint.
[vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello_neg1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.628 ms
cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.322 ms
cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
ping-4981 [000] .... 2763.510760: 0: xps (RET -1): Hello, World!
ping-4998 [001] .... 2767.632583: 0: xps (RET -1): Hello, World!
bpf prog unloading is not yet working and neither does `ip link show` report
when an "xps" bpf prog is attached. This is my first time touching iproute2 or
rtnetlink, so it may be something obvious to those more familiar.
On 2019/9/20 上午8:05, Matt Cover wrote:
> On Thu, Sep 19, 2019 at 3:45 PM Matthew Cover <[email protected]> wrote:
>> WORK IN PROGRESS:
>> * bpf program loading works!
>> * txq steering via bpf program return code works!
>> * bpf program unloading not working.
>> * bpf program attached query not working.
>>
>> This patch set provides a bpf hookpoint with goals similar to, but a more
>> generic implementation than, TUNSETSTEERINGEBPF; userspace supplied tx queue
>> selection policy.
One point that I introduce TUNSETSTEERINGEBPF instead of using a generic
way like cls/act bpf is that I need make sure to have a consistent API
with macvtap.
In the case of macvtap, TX means transmit from userspace to kernel, but
for TUN, it means transmit from kernel to userspace.
>>
>> TUNSETSTEERINGEBPF is a useful bpf hookpoint, but has some drawbacks.
>>
>> First, it only works on tun/tap devices.
>>
>> Second, there is no way in the current TUNSETSTEERINGEBPF implementation
>> to bail out or load a noop bpf prog and fallback to the no prog tx queue
>> selection method.
I believe it expect that eBPF should take all the parts (even the
fallback part).
>>
>> Third, the TUNSETSTEERINGEBPF interface seems to require possession of existing
>> or creation of new queues/fds.
That's the way TUN work for past +10 years because ioctl is the only way
to do configuration and it requires a fd to carry that. David suggest to
implement netlink but nobody did that.
>>
>> This most naturally fits in the "wire" implementation since possession of fds
>> is ensured. However, it also means the various "wire" implementations (e.g.
>> qemu) have to all be made aware of TUNSETSTEERINGEBPF and expose an interface
>> to load/unload a bpf prog (or provide a mechanism to pass an fd to another
>> program).
The load/unload of ebpf program is standard bpf() syscall. Ioctl just
attach that to TUN. This idea is borrowed from packet socket which the
bpf program was attached through setsockopt().
>>
>> Alternatively, you can spin up an extra queue and immediately disable via
>> IFF_DETACH_QUEUE, but this seems unsafe; packets could be enqueued to this
>> extra file descriptor which is part of our bpf prog loader, not our "wire".
You can use you 'wire' queue to do ioctl, but we can invent other API.
>>
>> Placing this in the XPS code and leveraging iproute2 and rtnetlink to provide
>> our bpf prog loader in a similar manner to xdp gives us a nice way to separate
>> the tap "wire" and the loading of tx queue selection policy. It also lets us
>> use this hookpoint for any device traversing XPS.
>>
>> This patch only introduces the new hookpoint to the XPS code and will not yet
>> be used by tun/tap devices using the intree tun.ko (which implements an
>> .ndo_select_queue and does not traverse the XPS code).
>>
>> In a future patch set, we can optionally refactor tun.ko to traverse this call
>> to bpf_prog_run_clear_cb() and bpf prog storage. tun/tap devices could then
>> leverage iproute2 as a generic loader. The TUNSETSTEERINGEBPF interface could
>> at this point be optionally deprecated/removed.
As described above, we need it for macvtap and you propose here can not
work for that.
I'm not against this proposal, just want to clarify some considerations
when developing TUNSETSTEERINGEPF. The main goal is for VM to implement
sophisticated steering policy like RSS without touching kernel.
Thanks
>>
>> Both patches in this set have been tested using a rebuilt tun.ko with no
>> .ndo_select_queue.
>>
>> sed -i '/\.ndo_select_queue.*=/d' drivers/net/tun.c
>>
>> The tap device was instantiated using tap_mq_pong.c, supporting scripts, and
>> wrapping service found here:
>>
>> https://github.com/stackpath/rxtxcpu/tree/v1.2.6/helpers
>>
>> The bpf prog source and test scripts can be found here:
>>
>> https://github.com/werekraken/xps_ebpf
>>
>> In nstxq, netsniff-ng using PACKET_FANOUT_QM is leveraged to check the
>> queue_mapping.
>>
>> With no prog loaded, the tx queue selection is adhering our xps_cpus
>> configuration.
>>
>> [vagrant@localhost ~]$ grep . /sys/class/net/tap0/queues/tx-*/xps_cpus; ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe;
>> /sys/class/net/tap0/queues/tx-0/xps_cpus:1
>> /sys/class/net/tap0/queues/tx-1/xps_cpus:2
>> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.146 ms
>> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
>> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.121 ms
>> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
>>
>> With a return 0 bpg prog, our tx queue is 0 (despite xps_cpus).
>>
>> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello0.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
>> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.160 ms
>> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
>> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.124 ms
>> cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
>> ping-4852 [000] .... 2691.633260: 0: xps (RET 0): Hello, World!
>> ping-4869 [001] .... 2695.753588: 0: xps (RET 0): Hello, World!
>>
>> With a return 1 bpg prog, our tx queue is 1.
>>
>> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
>> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.193 ms
>> cpu0: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
>> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.135 ms
>> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
>> ping-4894 [000] .... 2710.652080: 0: xps (RET 1): Hello, World!
>> ping-4911 [001] .... 2714.774608: 0: xps (RET 1): Hello, World!
>>
>> With a return 2 bpg prog, our tx queue is 0 (we only have 2 tx queues).
>>
>> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello2.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
>> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=1.20 ms
>> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
>> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.986 ms
>> cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
>> ping-4936 [000] .... 2729.442668: 0: xps (RET 2): Hello, World!
>> ping-4953 [001] .... 2733.614558: 0: xps (RET 2): Hello, World!
>>
>> With a return -1 bpf prog, our tx queue selection is once again determined by
>> xps_cpus. Any negative return should work the same and provides a nice
>> mechanism to bail out or have a noop bpf prog at this hookpoint.
>>
>> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello_neg1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
>> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.628 ms
>> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
>> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.322 ms
>> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
>> ping-4981 [000] .... 2763.510760: 0: xps (RET -1): Hello, World!
>> ping-4998 [001] .... 2767.632583: 0: xps (RET -1): Hello, World!
>>
>> bpf prog unloading is not yet working and neither does `ip link show` report
>> when an "xps" bpf prog is attached. This is my first time touching iproute2 or
>> rtnetlink, so it may be something obvious to those more familiar.
> Adding Jason... sorry for missing that the first time.
On Thu, Sep 19, 2019 at 6:42 PM Jason Wang <[email protected]> wrote:
>
>
> On 2019/9/20 上午8:05, Matt Cover wrote:
> > On Thu, Sep 19, 2019 at 3:45 PM Matthew Cover <[email protected]> wrote:
> >> WORK IN PROGRESS:
> >> * bpf program loading works!
> >> * txq steering via bpf program return code works!
> >> * bpf program unloading not working.
> >> * bpf program attached query not working.
> >>
> >> This patch set provides a bpf hookpoint with goals similar to, but a more
> >> generic implementation than, TUNSETSTEERINGEBPF; userspace supplied tx queue
> >> selection policy.
>
>
> One point that I introduce TUNSETSTEERINGEBPF instead of using a generic
> way like cls/act bpf is that I need make sure to have a consistent API
> with macvtap.
>
> In the case of macvtap, TX means transmit from userspace to kernel, but
> for TUN, it means transmit from kernel to userspace.
>
Ah, ok. I'll have to check that out at some point.
>
> >>
> >> TUNSETSTEERINGEBPF is a useful bpf hookpoint, but has some drawbacks.
> >>
> >> First, it only works on tun/tap devices.
> >>
> >> Second, there is no way in the current TUNSETSTEERINGEBPF implementation
> >> to bail out or load a noop bpf prog and fallback to the no prog tx queue
> >> selection method.
>
>
> I believe it expect that eBPF should take all the parts (even the
> fallback part).
>
This would be easy to change in the existing TUNSETSTEERINGEBPF
implementation if desired. We'd just need a negative return from the bpf prog
to result in falling back to tun_automq_select_queue(). If that behavior
sounds reasonable to you, I can look into that as a separate patch.
>
> >>
> >> Third, the TUNSETSTEERINGEBPF interface seems to require possession of existing
> >> or creation of new queues/fds.
>
>
> That's the way TUN work for past +10 years because ioctl is the only way
> to do configuration and it requires a fd to carry that. David suggest to
> implement netlink but nobody did that.
>
I see.
>
> >>
> >> This most naturally fits in the "wire" implementation since possession of fds
> >> is ensured. However, it also means the various "wire" implementations (e.g.
> >> qemu) have to all be made aware of TUNSETSTEERINGEBPF and expose an interface
> >> to load/unload a bpf prog (or provide a mechanism to pass an fd to another
> >> program).
>
>
> The load/unload of ebpf program is standard bpf() syscall. Ioctl just
> attach that to TUN. This idea is borrowed from packet socket which the
> bpf program was attached through setsockopt().
>
Yeah, it doesn't take much code to load a prog. I wrote one earlier this week
in fact which spins up an extra fd and detaches right after.
>
> >>
> >> Alternatively, you can spin up an extra queue and immediately disable via
> >> IFF_DETACH_QUEUE, but this seems unsafe; packets could be enqueued to this
> >> extra file descriptor which is part of our bpf prog loader, not our "wire".
>
>
> You can use you 'wire' queue to do ioctl, but we can invent other API.
>
It might be cool to provide a way to create an already detached fd
(not sure if this
is non-trivial for some reason). Switching over to netlink could be
the more long
term goal.
>
> >>
> >> Placing this in the XPS code and leveraging iproute2 and rtnetlink to provide
> >> our bpf prog loader in a similar manner to xdp gives us a nice way to separate
> >> the tap "wire" and the loading of tx queue selection policy. It also lets us
> >> use this hookpoint for any device traversing XPS.
> >>
> >> This patch only introduces the new hookpoint to the XPS code and will not yet
> >> be used by tun/tap devices using the intree tun.ko (which implements an
> >> .ndo_select_queue and does not traverse the XPS code).
> >>
> >> In a future patch set, we can optionally refactor tun.ko to traverse this call
> >> to bpf_prog_run_clear_cb() and bpf prog storage. tun/tap devices could then
> >> leverage iproute2 as a generic loader. The TUNSETSTEERINGEBPF interface could
> >> at this point be optionally deprecated/removed.
>
>
> As described above, we need it for macvtap and you propose here can not
> work for that.
>
> I'm not against this proposal, just want to clarify some considerations
> when developing TUNSETSTEERINGEPF. The main goal is for VM to implement
> sophisticated steering policy like RSS without touching kernel.
>
Very cool. Thank you for your comments Jason; they have added clarity
to some things.
I'm still interested in adding this hookpoint, community willing. I
believe it provides
value beyond xps_cpus/xps_rxqs.
I also plan to look into adding a similar hookpoint in the rps code.
That will unlock
additional possibilities for this xps hookpoint (e.g. rfs implemented
via bpf maps, but
only on a subset of traffic [high priority or especially resource
costly] rather than all).
I've had (so far casual) chats with a couple NIC vendors about various
"SmartNICs" supporting custom entropy fields for RSS. I'm playing with the idea
of an "rpsoffload" prog loaded into the NIC being the way custom entropy is
configured. Being able to configure RSS to generate a hash based on an fields
of an inner packet or a packet type specific field like GRE key would be super
nice for NFV workloads.
Perhaps even an "rpsdrv" or "rpsoffload" hookpoint could leverage bpf
helpers for
RSS hash algorithm (e.g. bfp_rss_hash_toeplitz(), bpf_rss_hash_crc(),
bpf_rss_hash_xor(), etc.).
The ideas on how things would look for receive are still early, but I
think there is
a lot of potential for making things more flexible by leveraging ebpf
in this area.
> Thanks
>
>
> >>
> >> Both patches in this set have been tested using a rebuilt tun.ko with no
> >> .ndo_select_queue.
> >>
> >> sed -i '/\.ndo_select_queue.*=/d' drivers/net/tun.c
> >>
> >> The tap device was instantiated using tap_mq_pong.c, supporting scripts, and
> >> wrapping service found here:
> >>
> >> https://github.com/stackpath/rxtxcpu/tree/v1.2.6/helpers
> >>
> >> The bpf prog source and test scripts can be found here:
> >>
> >> https://github.com/werekraken/xps_ebpf
> >>
> >> In nstxq, netsniff-ng using PACKET_FANOUT_QM is leveraged to check the
> >> queue_mapping.
> >>
> >> With no prog loaded, the tx queue selection is adhering our xps_cpus
> >> configuration.
> >>
> >> [vagrant@localhost ~]$ grep . /sys/class/net/tap0/queues/tx-*/xps_cpus; ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe;
> >> /sys/class/net/tap0/queues/tx-0/xps_cpus:1
> >> /sys/class/net/tap0/queues/tx-1/xps_cpus:2
> >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.146 ms
> >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.121 ms
> >> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> >>
> >> With a return 0 bpg prog, our tx queue is 0 (despite xps_cpus).
> >>
> >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello0.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.160 ms
> >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.124 ms
> >> cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> >> ping-4852 [000] .... 2691.633260: 0: xps (RET 0): Hello, World!
> >> ping-4869 [001] .... 2695.753588: 0: xps (RET 0): Hello, World!
> >>
> >> With a return 1 bpg prog, our tx queue is 1.
> >>
> >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.193 ms
> >> cpu0: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.135 ms
> >> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> >> ping-4894 [000] .... 2710.652080: 0: xps (RET 1): Hello, World!
> >> ping-4911 [001] .... 2714.774608: 0: xps (RET 1): Hello, World!
> >>
> >> With a return 2 bpg prog, our tx queue is 0 (we only have 2 tx queues).
> >>
> >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello2.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=1.20 ms
> >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.986 ms
> >> cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> >> ping-4936 [000] .... 2729.442668: 0: xps (RET 2): Hello, World!
> >> ping-4953 [001] .... 2733.614558: 0: xps (RET 2): Hello, World!
> >>
> >> With a return -1 bpf prog, our tx queue selection is once again determined by
> >> xps_cpus. Any negative return should work the same and provides a nice
> >> mechanism to bail out or have a noop bpf prog at this hookpoint.
> >>
> >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello_neg1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.628 ms
> >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.322 ms
> >> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> >> ping-4981 [000] .... 2763.510760: 0: xps (RET -1): Hello, World!
> >> ping-4998 [001] .... 2767.632583: 0: xps (RET -1): Hello, World!
> >>
> >> bpf prog unloading is not yet working and neither does `ip link show` report
> >> when an "xps" bpf prog is attached. This is my first time touching iproute2 or
> >> rtnetlink, so it may be something obvious to those more familiar.
> > Adding Jason... sorry for missing that the first time.
On Thu, Sep 19, 2019 at 3:45 PM Matthew Cover <[email protected]> wrote:
>
> WORK IN PROGRESS:
> * bpf program loading works!
> * txq steering via bpf program return code works!
> * bpf program unloading not working.
> * bpf program attached query not working.
>
> This patch set provides a bpf hookpoint with goals similar to, but a more
> generic implementation than, TUNSETSTEERINGEBPF; userspace supplied tx queue
> selection policy.
>
> TUNSETSTEERINGEBPF is a useful bpf hookpoint, but has some drawbacks.
>
> First, it only works on tun/tap devices.
>
> Second, there is no way in the current TUNSETSTEERINGEBPF implementation
> to bail out or load a noop bpf prog and fallback to the no prog tx queue
> selection method.
>
> Third, the TUNSETSTEERINGEBPF interface seems to require possession of existing
> or creation of new queues/fds.
>
> This most naturally fits in the "wire" implementation since possession of fds
> is ensured. However, it also means the various "wire" implementations (e.g.
> qemu) have to all be made aware of TUNSETSTEERINGEBPF and expose an interface
> to load/unload a bpf prog (or provide a mechanism to pass an fd to another
> program).
>
> Alternatively, you can spin up an extra queue and immediately disable via
> IFF_DETACH_QUEUE, but this seems unsafe; packets could be enqueued to this
> extra file descriptor which is part of our bpf prog loader, not our "wire".
>
> Placing this in the XPS code and leveraging iproute2 and rtnetlink to provide
> our bpf prog loader in a similar manner to xdp gives us a nice way to separate
> the tap "wire" and the loading of tx queue selection policy. It also lets us
> use this hookpoint for any device traversing XPS.
>
> This patch only introduces the new hookpoint to the XPS code and will not yet
> be used by tun/tap devices using the intree tun.ko (which implements an
> .ndo_select_queue and does not traverse the XPS code).
>
> In a future patch set, we can optionally refactor tun.ko to traverse this call
> to bpf_prog_run_clear_cb() and bpf prog storage. tun/tap devices could then
> leverage iproute2 as a generic loader. The TUNSETSTEERINGEBPF interface could
> at this point be optionally deprecated/removed.
>
> Both patches in this set have been tested using a rebuilt tun.ko with no
> .ndo_select_queue.
>
> sed -i '/\.ndo_select_queue.*=/d' drivers/net/tun.c
>
> The tap device was instantiated using tap_mq_pong.c, supporting scripts, and
> wrapping service found here:
>
> https://github.com/stackpath/rxtxcpu/tree/v1.2.6/helpers
>
> The bpf prog source and test scripts can be found here:
>
> https://github.com/werekraken/xps_ebpf
>
> In nstxq, netsniff-ng using PACKET_FANOUT_QM is leveraged to check the
> queue_mapping.
>
> With no prog loaded, the tx queue selection is adhering our xps_cpus
> configuration.
>
> [vagrant@localhost ~]$ grep . /sys/class/net/tap0/queues/tx-*/xps_cpus; ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe;
> /sys/class/net/tap0/queues/tx-0/xps_cpus:1
> /sys/class/net/tap0/queues/tx-1/xps_cpus:2
> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.146 ms
> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.121 ms
> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
>
> With a return 0 bpg prog, our tx queue is 0 (despite xps_cpus).
>
> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello0.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.160 ms
> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.124 ms
> cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> ping-4852 [000] .... 2691.633260: 0: xps (RET 0): Hello, World!
> ping-4869 [001] .... 2695.753588: 0: xps (RET 0): Hello, World!
>
> With a return 1 bpg prog, our tx queue is 1.
>
> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.193 ms
> cpu0: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.135 ms
> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> ping-4894 [000] .... 2710.652080: 0: xps (RET 1): Hello, World!
> ping-4911 [001] .... 2714.774608: 0: xps (RET 1): Hello, World!
>
> With a return 2 bpg prog, our tx queue is 0 (we only have 2 tx queues).
>
> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello2.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=1.20 ms
> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.986 ms
> cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> ping-4936 [000] .... 2729.442668: 0: xps (RET 2): Hello, World!
> ping-4953 [001] .... 2733.614558: 0: xps (RET 2): Hello, World!
>
> With a return -1 bpf prog, our tx queue selection is once again determined by
> xps_cpus. Any negative return should work the same and provides a nice
> mechanism to bail out or have a noop bpf prog at this hookpoint.
>
> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello_neg1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.628 ms
> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.322 ms
> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> ping-4981 [000] .... 2763.510760: 0: xps (RET -1): Hello, World!
> ping-4998 [001] .... 2767.632583: 0: xps (RET -1): Hello, World!
>
> bpf prog unloading is not yet working and neither does `ip link show` report
> when an "xps" bpf prog is attached. This is my first time touching iproute2 or
> rtnetlink, so it may be something obvious to those more familiar.
Adding Jason... sorry for missing that the first time.
On Thu, Sep 19, 2019 at 7:45 PM Matt Cover <[email protected]> wrote:
>
> On Thu, Sep 19, 2019 at 6:42 PM Jason Wang <[email protected]> wrote:
> >
> >
> > On 2019/9/20 上午8:05, Matt Cover wrote:
> > > On Thu, Sep 19, 2019 at 3:45 PM Matthew Cover <[email protected]> wrote:
> > >> WORK IN PROGRESS:
> > >> * bpf program loading works!
> > >> * txq steering via bpf program return code works!
> > >> * bpf program unloading not working.
> > >> * bpf program attached query not working.
> > >>
> > >> This patch set provides a bpf hookpoint with goals similar to, but a more
> > >> generic implementation than, TUNSETSTEERINGEBPF; userspace supplied tx queue
> > >> selection policy.
> >
> >
> > One point that I introduce TUNSETSTEERINGEBPF instead of using a generic
> > way like cls/act bpf is that I need make sure to have a consistent API
> > with macvtap.
> >
> > In the case of macvtap, TX means transmit from userspace to kernel, but
> > for TUN, it means transmit from kernel to userspace.
> >
>
> Ah, ok. I'll have to check that out at some point.
>
> >
> > >>
> > >> TUNSETSTEERINGEBPF is a useful bpf hookpoint, but has some drawbacks.
> > >>
> > >> First, it only works on tun/tap devices.
> > >>
> > >> Second, there is no way in the current TUNSETSTEERINGEBPF implementation
> > >> to bail out or load a noop bpf prog and fallback to the no prog tx queue
> > >> selection method.
> >
> >
> > I believe it expect that eBPF should take all the parts (even the
> > fallback part).
> >
>
> This would be easy to change in the existing TUNSETSTEERINGEBPF
> implementation if desired. We'd just need a negative return from the bpf prog
> to result in falling back to tun_automq_select_queue(). If that behavior
> sounds reasonable to you, I can look into that as a separate patch.
>
> >
> > >>
> > >> Third, the TUNSETSTEERINGEBPF interface seems to require possession of existing
> > >> or creation of new queues/fds.
> >
> >
> > That's the way TUN work for past +10 years because ioctl is the only way
> > to do configuration and it requires a fd to carry that. David suggest to
> > implement netlink but nobody did that.
> >
>
> I see.
>
> >
> > >>
> > >> This most naturally fits in the "wire" implementation since possession of fds
> > >> is ensured. However, it also means the various "wire" implementations (e.g.
> > >> qemu) have to all be made aware of TUNSETSTEERINGEBPF and expose an interface
> > >> to load/unload a bpf prog (or provide a mechanism to pass an fd to another
> > >> program).
> >
> >
> > The load/unload of ebpf program is standard bpf() syscall. Ioctl just
> > attach that to TUN. This idea is borrowed from packet socket which the
> > bpf program was attached through setsockopt().
> >
>
> Yeah, it doesn't take much code to load a prog. I wrote one earlier this week
> in fact which spins up an extra fd and detaches right after.
>
> >
> > >>
> > >> Alternatively, you can spin up an extra queue and immediately disable via
> > >> IFF_DETACH_QUEUE, but this seems unsafe; packets could be enqueued to this
> > >> extra file descriptor which is part of our bpf prog loader, not our "wire".
> >
> >
> > You can use you 'wire' queue to do ioctl, but we can invent other API.
> >
>
> It might be cool to provide a way to create an already detached fd
> (not sure if this
> is non-trivial for some reason). Switching over to netlink could be
> the more long
> term goal.
>
> >
> > >>
> > >> Placing this in the XPS code and leveraging iproute2 and rtnetlink to provide
> > >> our bpf prog loader in a similar manner to xdp gives us a nice way to separate
> > >> the tap "wire" and the loading of tx queue selection policy. It also lets us
> > >> use this hookpoint for any device traversing XPS.
> > >>
> > >> This patch only introduces the new hookpoint to the XPS code and will not yet
> > >> be used by tun/tap devices using the intree tun.ko (which implements an
> > >> .ndo_select_queue and does not traverse the XPS code).
> > >>
> > >> In a future patch set, we can optionally refactor tun.ko to traverse this call
> > >> to bpf_prog_run_clear_cb() and bpf prog storage. tun/tap devices could then
> > >> leverage iproute2 as a generic loader. The TUNSETSTEERINGEBPF interface could
> > >> at this point be optionally deprecated/removed.
> >
> >
> > As described above, we need it for macvtap and you propose here can not
> > work for that.
> >
> > I'm not against this proposal, just want to clarify some considerations
> > when developing TUNSETSTEERINGEPF. The main goal is for VM to implement
> > sophisticated steering policy like RSS without touching kernel.
> >
>
> Very cool. Thank you for your comments Jason; they have added clarity
> to some things.
>
> I'm still interested in adding this hookpoint, community willing. I
> believe it provides
> value beyond xps_cpus/xps_rxqs.
>
> I also plan to look into adding a similar hookpoint in the rps code.
> That will unlock
> additional possibilities for this xps hookpoint (e.g. rfs implemented
> via bpf maps, but
> only on a subset of traffic [high priority or especially resource
> costly] rather than all).
>
> I've had (so far casual) chats with a couple NIC vendors about various
> "SmartNICs" supporting custom entropy fields for RSS. I'm playing with the idea
> of an "rpsoffload" prog loaded into the NIC being the way custom entropy is
> configured. Being able to configure RSS to generate a hash based on an fields
> of an inner packet or a packet type specific field like GRE key would be super
> nice for NFV workloads.
>
Turns out the RSS part is already being done via XDP!
https://github.com/Netronome/bpf-samples/tree/master/programmable_rss
> Perhaps even an "rpsdrv" or "rpsoffload" hookpoint could leverage bpf
> helpers for
> RSS hash algorithm (e.g. bfp_rss_hash_toeplitz(), bpf_rss_hash_crc(),
> bpf_rss_hash_xor(), etc.).
>
> The ideas on how things would look for receive are still early, but I
> think there is
> a lot of potential for making things more flexible by leveraging ebpf
> in this area.
>
> > Thanks
> >
> >
> > >>
> > >> Both patches in this set have been tested using a rebuilt tun.ko with no
> > >> .ndo_select_queue.
> > >>
> > >> sed -i '/\.ndo_select_queue.*=/d' drivers/net/tun.c
> > >>
> > >> The tap device was instantiated using tap_mq_pong.c, supporting scripts, and
> > >> wrapping service found here:
> > >>
> > >> https://github.com/stackpath/rxtxcpu/tree/v1.2.6/helpers
> > >>
> > >> The bpf prog source and test scripts can be found here:
> > >>
> > >> https://github.com/werekraken/xps_ebpf
> > >>
> > >> In nstxq, netsniff-ng using PACKET_FANOUT_QM is leveraged to check the
> > >> queue_mapping.
> > >>
> > >> With no prog loaded, the tx queue selection is adhering our xps_cpus
> > >> configuration.
> > >>
> > >> [vagrant@localhost ~]$ grep . /sys/class/net/tap0/queues/tx-*/xps_cpus; ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe;
> > >> /sys/class/net/tap0/queues/tx-0/xps_cpus:1
> > >> /sys/class/net/tap0/queues/tx-1/xps_cpus:2
> > >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.146 ms
> > >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> > >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.121 ms
> > >> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> > >>
> > >> With a return 0 bpg prog, our tx queue is 0 (despite xps_cpus).
> > >>
> > >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello0.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> > >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.160 ms
> > >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> > >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.124 ms
> > >> cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> > >> ping-4852 [000] .... 2691.633260: 0: xps (RET 0): Hello, World!
> > >> ping-4869 [001] .... 2695.753588: 0: xps (RET 0): Hello, World!
> > >>
> > >> With a return 1 bpg prog, our tx queue is 1.
> > >>
> > >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> > >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.193 ms
> > >> cpu0: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> > >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.135 ms
> > >> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> > >> ping-4894 [000] .... 2710.652080: 0: xps (RET 1): Hello, World!
> > >> ping-4911 [001] .... 2714.774608: 0: xps (RET 1): Hello, World!
> > >>
> > >> With a return 2 bpg prog, our tx queue is 0 (we only have 2 tx queues).
> > >>
> > >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello2.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> > >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=1.20 ms
> > >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> > >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.986 ms
> > >> cpu1: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> > >> ping-4936 [000] .... 2729.442668: 0: xps (RET 2): Hello, World!
> > >> ping-4953 [001] .... 2733.614558: 0: xps (RET 2): Hello, World!
> > >>
> > >> With a return -1 bpf prog, our tx queue selection is once again determined by
> > >> xps_cpus. Any negative return should work the same and provides a nice
> > >> mechanism to bail out or have a noop bpf prog at this hookpoint.
> > >>
> > >> [vagrant@localhost ~]$ sudo ip link set dev tap0 xps obj hello_neg1.o sec hello && { ./nstxq; sudo timeout 1 cat /sys/kernel/debug/tracing/trace_pipe; }
> > >> cpu0: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.628 ms
> > >> cpu0: qm0: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> > >> cpu1: ping: 64 bytes from 169.254.254.1: icmp_seq=1 ttl=64 time=0.322 ms
> > >> cpu1: qm1: > tap0 98 Unknown => Unknown IPv4 169.254.254.2/169.254.254.1 Len 84 Type 8 Code 0
> > >> ping-4981 [000] .... 2763.510760: 0: xps (RET -1): Hello, World!
> > >> ping-4998 [001] .... 2767.632583: 0: xps (RET -1): Hello, World!
> > >>
> > >> bpf prog unloading is not yet working and neither does `ip link show` report
> > >> when an "xps" bpf prog is attached. This is my first time touching iproute2 or
> > >> rtnetlink, so it may be something obvious to those more familiar.
> > > Adding Jason... sorry for missing that the first time.