2023-10-08 19:07:58

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
>
> virtio-net have two usage of hashes: one is RSS and another is hash
> reporting. Conventionally the hash calculation was done by the VMM.
> However, computing the hash after the queue was chosen defeats the
> purpose of RSS.
>
> Another approach is to use eBPF steering program. This approach has
> another downside: it cannot report the calculated hash due to the
> restrictive nature of eBPF.
>
> Introduce the code to compute hashes to the kernel in order to overcome
> thse challenges. An alternative solution is to extend the eBPF steering
> program so that it will be able to report to the userspace, but it makes
> little sense to allow to implement different hashing algorithms with
> eBPF since the hash value reported by virtio-net is strictly defined by
> the specification.
>
> The hash value already stored in sk_buff is not used and computed
> independently since it may have been computed in a way not conformant
> with the specification.
>
> Signed-off-by: Akihiko Odaki <[email protected]>
> ---

> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> + .max_indirection_table_length =
> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> +
> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> +};

No need to have explicit capabilities exchange like this? Tun either
supports all or none.

> case TUNSETSTEERINGEBPF:
> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> + if (IS_ERR(bpf_ret))
> + ret = PTR_ERR(bpf_ret);
> + else if (bpf_ret)
> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;

Don't make one feature disable another.

TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
functions. If one is enabled the other call should fail, with EBUSY
for instance.

> + case TUNSETVNETHASH:
> + len = sizeof(vnet_hash);
> + if (copy_from_user(&vnet_hash, argp, len)) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> + !tun_is_little_endian(tun))) ||
> + vnet_hash.indirection_table_mask >=
> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> + ret = -EINVAL;
> + break;
> + }
> +
> + argp = (u8 __user *)argp + len;
> + len = (vnet_hash.indirection_table_mask + 1) * 2;
> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> + ret = -EFAULT;
> + break;
> + }
> +
> + argp = (u8 __user *)argp + len;
> + len = virtio_net_hash_key_length(vnet_hash.types);
> +
> + if (copy_from_user(vnet_hash_key, argp, len)) {
> + ret = -EFAULT;
> + break;
> + }

Probably easier and less error-prone to define a fixed size control
struct with the max indirection table size.

Btw: please trim the CC: list considerably on future patches.


2023-10-08 20:04:32

by Akihiko Odaki

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On 2023/10/09 4:07, Willem de Bruijn wrote:
> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
>>
>> virtio-net have two usage of hashes: one is RSS and another is hash
>> reporting. Conventionally the hash calculation was done by the VMM.
>> However, computing the hash after the queue was chosen defeats the
>> purpose of RSS.
>>
>> Another approach is to use eBPF steering program. This approach has
>> another downside: it cannot report the calculated hash due to the
>> restrictive nature of eBPF.
>>
>> Introduce the code to compute hashes to the kernel in order to overcome
>> thse challenges. An alternative solution is to extend the eBPF steering
>> program so that it will be able to report to the userspace, but it makes
>> little sense to allow to implement different hashing algorithms with
>> eBPF since the hash value reported by virtio-net is strictly defined by
>> the specification.
>>
>> The hash value already stored in sk_buff is not used and computed
>> independently since it may have been computed in a way not conformant
>> with the specification.
>>
>> Signed-off-by: Akihiko Odaki <[email protected]>
>> ---
>
>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>> + .max_indirection_table_length =
>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>> +
>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>> +};
>
> No need to have explicit capabilities exchange like this? Tun either
> supports all or none.

tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.

It is because the flow dissector does not support IPv6 extensions. The
specification is also vague, and does not tell how many TLVs should be
consumed at most when interpreting destination option header so I chose
to avoid adding code for these hash types to the flow dissector. I doubt
anyone will complain about it since nobody complains for Linux.

I'm also adding this so that we can extend it later.
max_indirection_table_length may grow for systems with 128+ CPUs, or
types may have other bits for new protocols in the future.

>
>> case TUNSETSTEERINGEBPF:
>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>> + if (IS_ERR(bpf_ret))
>> + ret = PTR_ERR(bpf_ret);
>> + else if (bpf_ret)
>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>
> Don't make one feature disable another.
>
> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> functions. If one is enabled the other call should fail, with EBUSY
> for instance.
>
>> + case TUNSETVNETHASH:
>> + len = sizeof(vnet_hash);
>> + if (copy_from_user(&vnet_hash, argp, len)) {
>> + ret = -EFAULT;
>> + break;
>> + }
>> +
>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>> + !tun_is_little_endian(tun))) ||
>> + vnet_hash.indirection_table_mask >=
>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>> + ret = -EINVAL;
>> + break;
>> + }
>> +
>> + argp = (u8 __user *)argp + len;
>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>> + ret = -EFAULT;
>> + break;
>> + }
>> +
>> + argp = (u8 __user *)argp + len;
>> + len = virtio_net_hash_key_length(vnet_hash.types);
>> +
>> + if (copy_from_user(vnet_hash_key, argp, len)) {
>> + ret = -EFAULT;
>> + break;
>> + }
>
> Probably easier and less error-prone to define a fixed size control
> struct with the max indirection table size.

I made its size variable because the indirection table and key may grow
in the future as I wrote above.

>
> Btw: please trim the CC: list considerably on future patches.

I'll do so in the next version with the TUNSETSTEERINGEBPF change you
proposed.

2023-10-08 20:09:43

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
>
> On 2023/10/09 4:07, Willem de Bruijn wrote:
> > On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
> >>
> >> virtio-net have two usage of hashes: one is RSS and another is hash
> >> reporting. Conventionally the hash calculation was done by the VMM.
> >> However, computing the hash after the queue was chosen defeats the
> >> purpose of RSS.
> >>
> >> Another approach is to use eBPF steering program. This approach has
> >> another downside: it cannot report the calculated hash due to the
> >> restrictive nature of eBPF.
> >>
> >> Introduce the code to compute hashes to the kernel in order to overcome
> >> thse challenges. An alternative solution is to extend the eBPF steering
> >> program so that it will be able to report to the userspace, but it makes
> >> little sense to allow to implement different hashing algorithms with
> >> eBPF since the hash value reported by virtio-net is strictly defined by
> >> the specification.
> >>
> >> The hash value already stored in sk_buff is not used and computed
> >> independently since it may have been computed in a way not conformant
> >> with the specification.
> >>
> >> Signed-off-by: Akihiko Odaki <[email protected]>
> >> ---
> >
> >> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >> + .max_indirection_table_length =
> >> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >> +
> >> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >> +};
> >
> > No need to have explicit capabilities exchange like this? Tun either
> > supports all or none.
>
> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>
> It is because the flow dissector does not support IPv6 extensions. The
> specification is also vague, and does not tell how many TLVs should be
> consumed at most when interpreting destination option header so I chose
> to avoid adding code for these hash types to the flow dissector. I doubt
> anyone will complain about it since nobody complains for Linux.
>
> I'm also adding this so that we can extend it later.
> max_indirection_table_length may grow for systems with 128+ CPUs, or
> types may have other bits for new protocols in the future.
>
> >
> >> case TUNSETSTEERINGEBPF:
> >> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >> + if (IS_ERR(bpf_ret))
> >> + ret = PTR_ERR(bpf_ret);
> >> + else if (bpf_ret)
> >> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >
> > Don't make one feature disable another.
> >
> > TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> > functions. If one is enabled the other call should fail, with EBUSY
> > for instance.
> >
> >> + case TUNSETVNETHASH:
> >> + len = sizeof(vnet_hash);
> >> + if (copy_from_user(&vnet_hash, argp, len)) {
> >> + ret = -EFAULT;
> >> + break;
> >> + }
> >> +
> >> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >> + !tun_is_little_endian(tun))) ||
> >> + vnet_hash.indirection_table_mask >=
> >> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >> + ret = -EINVAL;
> >> + break;
> >> + }
> >> +
> >> + argp = (u8 __user *)argp + len;
> >> + len = (vnet_hash.indirection_table_mask + 1) * 2;
> >> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >> + ret = -EFAULT;
> >> + break;
> >> + }
> >> +
> >> + argp = (u8 __user *)argp + len;
> >> + len = virtio_net_hash_key_length(vnet_hash.types);
> >> +
> >> + if (copy_from_user(vnet_hash_key, argp, len)) {
> >> + ret = -EFAULT;
> >> + break;
> >> + }
> >
> > Probably easier and less error-prone to define a fixed size control
> > struct with the max indirection table size.
>
> I made its size variable because the indirection table and key may grow
> in the future as I wrote above.
>
> >
> > Btw: please trim the CC: list considerably on future patches.
>
> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> proposed.

To be clear: please don't just resubmit with that one change.

The skb and cb issues are quite fundamental issues that need to be resolved.

I'd like to understand why adjusting the existing BPF feature for this
exact purpose cannot be amended to return the key it produced.

As you point out, the C flow dissector is insufficient. The BPF flow
dissector does not have this problem. The same argument would go for
the pre-existing BPF steering program.

2023-10-08 20:53:02

by Akihiko Odaki

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On 2023/10/09 5:08, Willem de Bruijn wrote:
> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
>>
>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
>>>>
>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>> However, computing the hash after the queue was chosen defeats the
>>>> purpose of RSS.
>>>>
>>>> Another approach is to use eBPF steering program. This approach has
>>>> another downside: it cannot report the calculated hash due to the
>>>> restrictive nature of eBPF.
>>>>
>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>> program so that it will be able to report to the userspace, but it makes
>>>> little sense to allow to implement different hashing algorithms with
>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>> the specification.
>>>>
>>>> The hash value already stored in sk_buff is not used and computed
>>>> independently since it may have been computed in a way not conformant
>>>> with the specification.
>>>>
>>>> Signed-off-by: Akihiko Odaki <[email protected]>
>>>> ---
>>>
>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>> + .max_indirection_table_length =
>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>> +
>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>> +};
>>>
>>> No need to have explicit capabilities exchange like this? Tun either
>>> supports all or none.
>>
>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>
>> It is because the flow dissector does not support IPv6 extensions. The
>> specification is also vague, and does not tell how many TLVs should be
>> consumed at most when interpreting destination option header so I chose
>> to avoid adding code for these hash types to the flow dissector. I doubt
>> anyone will complain about it since nobody complains for Linux.
>>
>> I'm also adding this so that we can extend it later.
>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>> types may have other bits for new protocols in the future.
>>
>>>
>>>> case TUNSETSTEERINGEBPF:
>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>> + if (IS_ERR(bpf_ret))
>>>> + ret = PTR_ERR(bpf_ret);
>>>> + else if (bpf_ret)
>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>
>>> Don't make one feature disable another.
>>>
>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>> functions. If one is enabled the other call should fail, with EBUSY
>>> for instance.
>>>
>>>> + case TUNSETVNETHASH:
>>>> + len = sizeof(vnet_hash);
>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
>>>> + ret = -EFAULT;
>>>> + break;
>>>> + }
>>>> +
>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>> + !tun_is_little_endian(tun))) ||
>>>> + vnet_hash.indirection_table_mask >=
>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>> + ret = -EINVAL;
>>>> + break;
>>>> + }
>>>> +
>>>> + argp = (u8 __user *)argp + len;
>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>> + ret = -EFAULT;
>>>> + break;
>>>> + }
>>>> +
>>>> + argp = (u8 __user *)argp + len;
>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
>>>> +
>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
>>>> + ret = -EFAULT;
>>>> + break;
>>>> + }
>>>
>>> Probably easier and less error-prone to define a fixed size control
>>> struct with the max indirection table size.
>>
>> I made its size variable because the indirection table and key may grow
>> in the future as I wrote above.
>>
>>>
>>> Btw: please trim the CC: list considerably on future patches.
>>
>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>> proposed.
>
> To be clear: please don't just resubmit with that one change.
>
> The skb and cb issues are quite fundamental issues that need to be resolved.
>
> I'd like to understand why adjusting the existing BPF feature for this
> exact purpose cannot be amended to return the key it produced.

eBPF steering program is not designed for this particular problem in my
understanding. It was introduced to derive hash values with an
understanding of application-specific semantics of packets instead of
generic IP/TCP/UDP semantics.

This problem is rather different in terms that the hash derivation is
strictly defined by virtio-net. I don't think it makes sense to
introduce the complexity of BPF when you always run the same code.

It can utilize the existing flow dissector and also make it easier to
use for the userspace by implementing this in the kernel.

>
> As you point out, the C flow dissector is insufficient. The BPF flow
> dissector does not have this problem. The same argument would go for
> the pre-existing BPF steering program.
It is possible to extend the C flow dissector just as it is possible to
implement a BPF flow dissector. The more serious problem is that
virtio-net specification (and Microsoft RSS it follows) does not tell
how to implement IPv6 extension support.

2023-10-09 08:06:05

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
>
> On 2023/10/09 5:08, Willem de Bruijn wrote:
> > On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
> >>
> >> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
> >>>>
> >>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>> However, computing the hash after the queue was chosen defeats the
> >>>> purpose of RSS.
> >>>>
> >>>> Another approach is to use eBPF steering program. This approach has
> >>>> another downside: it cannot report the calculated hash due to the
> >>>> restrictive nature of eBPF.
> >>>>
> >>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>> program so that it will be able to report to the userspace, but it makes
> >>>> little sense to allow to implement different hashing algorithms with
> >>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>> the specification.
> >>>>
> >>>> The hash value already stored in sk_buff is not used and computed
> >>>> independently since it may have been computed in a way not conformant
> >>>> with the specification.
> >>>>
> >>>> Signed-off-by: Akihiko Odaki <[email protected]>
> >>>> ---
> >>>
> >>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>> + .max_indirection_table_length =
> >>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>> +
> >>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>> +};
> >>>
> >>> No need to have explicit capabilities exchange like this? Tun either
> >>> supports all or none.
> >>
> >> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>
> >> It is because the flow dissector does not support IPv6 extensions. The
> >> specification is also vague, and does not tell how many TLVs should be
> >> consumed at most when interpreting destination option header so I chose
> >> to avoid adding code for these hash types to the flow dissector. I doubt
> >> anyone will complain about it since nobody complains for Linux.
> >>
> >> I'm also adding this so that we can extend it later.
> >> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >> types may have other bits for new protocols in the future.
> >>
> >>>
> >>>> case TUNSETSTEERINGEBPF:
> >>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>> + if (IS_ERR(bpf_ret))
> >>>> + ret = PTR_ERR(bpf_ret);
> >>>> + else if (bpf_ret)
> >>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>
> >>> Don't make one feature disable another.
> >>>
> >>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>> functions. If one is enabled the other call should fail, with EBUSY
> >>> for instance.
> >>>
> >>>> + case TUNSETVNETHASH:
> >>>> + len = sizeof(vnet_hash);
> >>>> + if (copy_from_user(&vnet_hash, argp, len)) {
> >>>> + ret = -EFAULT;
> >>>> + break;
> >>>> + }
> >>>> +
> >>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>> + !tun_is_little_endian(tun))) ||
> >>>> + vnet_hash.indirection_table_mask >=
> >>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>> + ret = -EINVAL;
> >>>> + break;
> >>>> + }
> >>>> +
> >>>> + argp = (u8 __user *)argp + len;
> >>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>> + ret = -EFAULT;
> >>>> + break;
> >>>> + }
> >>>> +
> >>>> + argp = (u8 __user *)argp + len;
> >>>> + len = virtio_net_hash_key_length(vnet_hash.types);
> >>>> +
> >>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>> + ret = -EFAULT;
> >>>> + break;
> >>>> + }
> >>>
> >>> Probably easier and less error-prone to define a fixed size control
> >>> struct with the max indirection table size.
> >>
> >> I made its size variable because the indirection table and key may grow
> >> in the future as I wrote above.
> >>
> >>>
> >>> Btw: please trim the CC: list considerably on future patches.
> >>
> >> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >> proposed.
> >
> > To be clear: please don't just resubmit with that one change.
> >
> > The skb and cb issues are quite fundamental issues that need to be resolved.
> >
> > I'd like to understand why adjusting the existing BPF feature for this
> > exact purpose cannot be amended to return the key it produced.
>
> eBPF steering program is not designed for this particular problem in my
> understanding. It was introduced to derive hash values with an
> understanding of application-specific semantics of packets instead of
> generic IP/TCP/UDP semantics.
>
> This problem is rather different in terms that the hash derivation is
> strictly defined by virtio-net. I don't think it makes sense to
> introduce the complexity of BPF when you always run the same code.
>
> It can utilize the existing flow dissector and also make it easier to
> use for the userspace by implementing this in the kernel.

Ok. There does appear to be overlap in functionality. But it might be
easier to deploy to just have standard Toeplitz available without
having to compile and load an eBPF program.

As for the sk_buff and cb[] changes. The first is really not needed.
sk_buff simply would not scale if every edge case needs a few bits.

For the control block, generally it is not safe to use that across
layers. In this case, between qdisc enqueue of a given device and
ndo_start_xmit of that device, I suppose it is. Though uncommon. I
wonder if there is any precedent.

The data will have to be stored in the skb somewhere. A simpler option
is just skb->hash? This code would use skb_get_hash, if it would
always produce a Toeplitz hash, anyway.

> >
> > As you point out, the C flow dissector is insufficient. The BPF flow
> > dissector does not have this problem. The same argument would go for
> > the pre-existing BPF steering program.
> It is possible to extend the C flow dissector just as it is possible to
> implement a BPF flow dissector. The more serious problem is that
> virtio-net specification (and Microsoft RSS it follows) does not tell
> how to implement IPv6 extension support.

2023-10-09 08:57:41

by Akihiko Odaki

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On 2023/10/09 17:04, Willem de Bruijn wrote:
> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
>>
>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
>>>>
>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>
>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>> purpose of RSS.
>>>>>>
>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>> restrictive nature of eBPF.
>>>>>>
>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>> the specification.
>>>>>>
>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>> independently since it may have been computed in a way not conformant
>>>>>> with the specification.
>>>>>>
>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
>>>>>> ---
>>>>>
>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>> + .max_indirection_table_length =
>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>> +
>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>> +};
>>>>>
>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>> supports all or none.
>>>>
>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>
>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>> specification is also vague, and does not tell how many TLVs should be
>>>> consumed at most when interpreting destination option header so I chose
>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>> anyone will complain about it since nobody complains for Linux.
>>>>
>>>> I'm also adding this so that we can extend it later.
>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>> types may have other bits for new protocols in the future.
>>>>
>>>>>
>>>>>> case TUNSETSTEERINGEBPF:
>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>> + if (IS_ERR(bpf_ret))
>>>>>> + ret = PTR_ERR(bpf_ret);
>>>>>> + else if (bpf_ret)
>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>
>>>>> Don't make one feature disable another.
>>>>>
>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>> for instance.
>>>>>
>>>>>> + case TUNSETVNETHASH:
>>>>>> + len = sizeof(vnet_hash);
>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>> + ret = -EFAULT;
>>>>>> + break;
>>>>>> + }
>>>>>> +
>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>> + !tun_is_little_endian(tun))) ||
>>>>>> + vnet_hash.indirection_table_mask >=
>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>> + ret = -EINVAL;
>>>>>> + break;
>>>>>> + }
>>>>>> +
>>>>>> + argp = (u8 __user *)argp + len;
>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>> + ret = -EFAULT;
>>>>>> + break;
>>>>>> + }
>>>>>> +
>>>>>> + argp = (u8 __user *)argp + len;
>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>> +
>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>> + ret = -EFAULT;
>>>>>> + break;
>>>>>> + }
>>>>>
>>>>> Probably easier and less error-prone to define a fixed size control
>>>>> struct with the max indirection table size.
>>>>
>>>> I made its size variable because the indirection table and key may grow
>>>> in the future as I wrote above.
>>>>
>>>>>
>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>
>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>> proposed.
>>>
>>> To be clear: please don't just resubmit with that one change.
>>>
>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>
>>> I'd like to understand why adjusting the existing BPF feature for this
>>> exact purpose cannot be amended to return the key it produced.
>>
>> eBPF steering program is not designed for this particular problem in my
>> understanding. It was introduced to derive hash values with an
>> understanding of application-specific semantics of packets instead of
>> generic IP/TCP/UDP semantics.
>>
>> This problem is rather different in terms that the hash derivation is
>> strictly defined by virtio-net. I don't think it makes sense to
>> introduce the complexity of BPF when you always run the same code.
>>
>> It can utilize the existing flow dissector and also make it easier to
>> use for the userspace by implementing this in the kernel.
>
> Ok. There does appear to be overlap in functionality. But it might be
> easier to deploy to just have standard Toeplitz available without
> having to compile and load an eBPF program.
>
> As for the sk_buff and cb[] changes. The first is really not needed.
> sk_buff simply would not scale if every edge case needs a few bits.

An alternative is to move the bit to cb[] and clear it for every code
paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.

I think we can put the bit in sk_buff for now. We can implement the
alternative when we are short of bits.

>
> For the control block, generally it is not safe to use that across
> layers. In this case, between qdisc enqueue of a given device and
> ndo_start_xmit of that device, I suppose it is. Though uncommon. I
> wonder if there is any precedent.

That's one of the reasons modifying qdisc_skb_cb instead of creating
another struct embedding it as bpf_skb_data_end and tc_skb_cb do. This
clarifies that it is qdisc's responsibility to keep these data intact.

>
> The data will have to be stored in the skb somewhere. A simpler option
> is just skb->hash? This code would use skb_get_hash, if it would
> always produce a Toeplitz hash, anyway.

We still need tun_vnet_hash_report to identify hash type.

skb_get_hash() uses Siphash instead of Toeplitz, and the configuration
of Toeplitz relies on tun's internal state. It is still possible to
store a hash calculated with tun's own logic to skb->hash, but someone
may later overwrite it with __skb_get_hash() though it's unlikely.

2023-10-09 09:58:36

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
>
> On 2023/10/09 17:04, Willem de Bruijn wrote:
> > On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
> >>
> >> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
> >>>>
> >>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>
> >>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>> purpose of RSS.
> >>>>>>
> >>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>> restrictive nature of eBPF.
> >>>>>>
> >>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>> the specification.
> >>>>>>
> >>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>> independently since it may have been computed in a way not conformant
> >>>>>> with the specification.
> >>>>>>
> >>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
> >>>>>> ---
> >>>>>
> >>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>> + .max_indirection_table_length =
> >>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>> +
> >>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>> +};
> >>>>>
> >>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>> supports all or none.
> >>>>
> >>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>
> >>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>> specification is also vague, and does not tell how many TLVs should be
> >>>> consumed at most when interpreting destination option header so I chose
> >>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>> anyone will complain about it since nobody complains for Linux.
> >>>>
> >>>> I'm also adding this so that we can extend it later.
> >>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>> types may have other bits for new protocols in the future.
> >>>>
> >>>>>
> >>>>>> case TUNSETSTEERINGEBPF:
> >>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>> + if (IS_ERR(bpf_ret))
> >>>>>> + ret = PTR_ERR(bpf_ret);
> >>>>>> + else if (bpf_ret)
> >>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>
> >>>>> Don't make one feature disable another.
> >>>>>
> >>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>> for instance.
> >>>>>
> >>>>>> + case TUNSETVNETHASH:
> >>>>>> + len = sizeof(vnet_hash);
> >>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>> + ret = -EFAULT;
> >>>>>> + break;
> >>>>>> + }
> >>>>>> +
> >>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>> + !tun_is_little_endian(tun))) ||
> >>>>>> + vnet_hash.indirection_table_mask >=
> >>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>> + ret = -EINVAL;
> >>>>>> + break;
> >>>>>> + }
> >>>>>> +
> >>>>>> + argp = (u8 __user *)argp + len;
> >>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>> + ret = -EFAULT;
> >>>>>> + break;
> >>>>>> + }
> >>>>>> +
> >>>>>> + argp = (u8 __user *)argp + len;
> >>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>> +
> >>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>> + ret = -EFAULT;
> >>>>>> + break;
> >>>>>> + }
> >>>>>
> >>>>> Probably easier and less error-prone to define a fixed size control
> >>>>> struct with the max indirection table size.
> >>>>
> >>>> I made its size variable because the indirection table and key may grow
> >>>> in the future as I wrote above.
> >>>>
> >>>>>
> >>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>
> >>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>> proposed.
> >>>
> >>> To be clear: please don't just resubmit with that one change.
> >>>
> >>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>
> >>> I'd like to understand why adjusting the existing BPF feature for this
> >>> exact purpose cannot be amended to return the key it produced.
> >>
> >> eBPF steering program is not designed for this particular problem in my
> >> understanding. It was introduced to derive hash values with an
> >> understanding of application-specific semantics of packets instead of
> >> generic IP/TCP/UDP semantics.
> >>
> >> This problem is rather different in terms that the hash derivation is
> >> strictly defined by virtio-net. I don't think it makes sense to
> >> introduce the complexity of BPF when you always run the same code.
> >>
> >> It can utilize the existing flow dissector and also make it easier to
> >> use for the userspace by implementing this in the kernel.
> >
> > Ok. There does appear to be overlap in functionality. But it might be
> > easier to deploy to just have standard Toeplitz available without
> > having to compile and load an eBPF program.
> >
> > As for the sk_buff and cb[] changes. The first is really not needed.
> > sk_buff simply would not scale if every edge case needs a few bits.
>
> An alternative is to move the bit to cb[] and clear it for every code
> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>
> I think we can put the bit in sk_buff for now. We can implement the
> alternative when we are short of bits.

I disagree. sk_buff fields add a cost to every code path. They cannot
be added for every edge case.
>
> >
> > For the control block, generally it is not safe to use that across
> > layers. In this case, between qdisc enqueue of a given device and
> > ndo_start_xmit of that device, I suppose it is. Though uncommon. I
> > wonder if there is any precedent.
>
> That's one of the reasons modifying qdisc_skb_cb instead of creating
> another struct embedding it as bpf_skb_data_end and tc_skb_cb do. This
> clarifies that it is qdisc's responsibility to keep these data intact.
>
> >
> > The data will have to be stored in the skb somewhere. A simpler option
> > is just skb->hash? This code would use skb_get_hash, if it would
> > always produce a Toeplitz hash, anyway.
>
> We still need tun_vnet_hash_report to identify hash type.
>
> skb_get_hash() uses Siphash instead of Toeplitz, and the configuration
> of Toeplitz relies on tun's internal state. It is still possible to
> store a hash calculated with tun's own logic to skb->hash, but someone
> may later overwrite it with __skb_get_hash() though it's unlikely.

That won't happen in this data path.

Fair point on the hash type.

2023-10-09 10:02:23

by Akihiko Odaki

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On 2023/10/09 18:57, Willem de Bruijn wrote:
> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
>>
>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
>>>>
>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
>>>>>>
>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>
>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>> purpose of RSS.
>>>>>>>>
>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>> restrictive nature of eBPF.
>>>>>>>>
>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>> the specification.
>>>>>>>>
>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>> with the specification.
>>>>>>>>
>>>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
>>>>>>>> ---
>>>>>>>
>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>> + .max_indirection_table_length =
>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>> +
>>>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>> +};
>>>>>>>
>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>> supports all or none.
>>>>>>
>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>
>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>
>>>>>> I'm also adding this so that we can extend it later.
>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>> types may have other bits for new protocols in the future.
>>>>>>
>>>>>>>
>>>>>>>> case TUNSETSTEERINGEBPF:
>>>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>> + if (IS_ERR(bpf_ret))
>>>>>>>> + ret = PTR_ERR(bpf_ret);
>>>>>>>> + else if (bpf_ret)
>>>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>
>>>>>>> Don't make one feature disable another.
>>>>>>>
>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>> for instance.
>>>>>>>
>>>>>>>> + case TUNSETVNETHASH:
>>>>>>>> + len = sizeof(vnet_hash);
>>>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>> + ret = -EFAULT;
>>>>>>>> + break;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>> + !tun_is_little_endian(tun))) ||
>>>>>>>> + vnet_hash.indirection_table_mask >=
>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>> + ret = -EINVAL;
>>>>>>>> + break;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>> + ret = -EFAULT;
>>>>>>>> + break;
>>>>>>>> + }
>>>>>>>> +
>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>> +
>>>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>> + ret = -EFAULT;
>>>>>>>> + break;
>>>>>>>> + }
>>>>>>>
>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>> struct with the max indirection table size.
>>>>>>
>>>>>> I made its size variable because the indirection table and key may grow
>>>>>> in the future as I wrote above.
>>>>>>
>>>>>>>
>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>
>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>> proposed.
>>>>>
>>>>> To be clear: please don't just resubmit with that one change.
>>>>>
>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>
>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>> exact purpose cannot be amended to return the key it produced.
>>>>
>>>> eBPF steering program is not designed for this particular problem in my
>>>> understanding. It was introduced to derive hash values with an
>>>> understanding of application-specific semantics of packets instead of
>>>> generic IP/TCP/UDP semantics.
>>>>
>>>> This problem is rather different in terms that the hash derivation is
>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>> introduce the complexity of BPF when you always run the same code.
>>>>
>>>> It can utilize the existing flow dissector and also make it easier to
>>>> use for the userspace by implementing this in the kernel.
>>>
>>> Ok. There does appear to be overlap in functionality. But it might be
>>> easier to deploy to just have standard Toeplitz available without
>>> having to compile and load an eBPF program.
>>>
>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>> sk_buff simply would not scale if every edge case needs a few bits.
>>
>> An alternative is to move the bit to cb[] and clear it for every code
>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>
>> I think we can put the bit in sk_buff for now. We can implement the
>> alternative when we are short of bits.
>
> I disagree. sk_buff fields add a cost to every code path. They cannot
> be added for every edge case.

It only takes an unused bit and does not grow the sk_buff size so I
think it has practically no cost for now.

2023-10-09 10:07:37

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <[email protected]> wrote:
>
> On 2023/10/09 18:57, Willem de Bruijn wrote:
> > On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
> >>
> >> On 2023/10/09 17:04, Willem de Bruijn wrote:
> >>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
> >>>>
> >>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
> >>>>>>
> >>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>> purpose of RSS.
> >>>>>>>>
> >>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>> restrictive nature of eBPF.
> >>>>>>>>
> >>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>>>> the specification.
> >>>>>>>>
> >>>>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>>>> independently since it may have been computed in a way not conformant
> >>>>>>>> with the specification.
> >>>>>>>>
> >>>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
> >>>>>>>> ---
> >>>>>>>
> >>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>>>> + .max_indirection_table_length =
> >>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>>>> +
> >>>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>>>> +};
> >>>>>>>
> >>>>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>>>> supports all or none.
> >>>>>>
> >>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>>>
> >>>>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>>>> specification is also vague, and does not tell how many TLVs should be
> >>>>>> consumed at most when interpreting destination option header so I chose
> >>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>>>> anyone will complain about it since nobody complains for Linux.
> >>>>>>
> >>>>>> I'm also adding this so that we can extend it later.
> >>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>>>> types may have other bits for new protocols in the future.
> >>>>>>
> >>>>>>>
> >>>>>>>> case TUNSETSTEERINGEBPF:
> >>>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>> + if (IS_ERR(bpf_ret))
> >>>>>>>> + ret = PTR_ERR(bpf_ret);
> >>>>>>>> + else if (bpf_ret)
> >>>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>>>
> >>>>>>> Don't make one feature disable another.
> >>>>>>>
> >>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>>>> for instance.
> >>>>>>>
> >>>>>>>> + case TUNSETVNETHASH:
> >>>>>>>> + len = sizeof(vnet_hash);
> >>>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>>>> + ret = -EFAULT;
> >>>>>>>> + break;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>>>> + !tun_is_little_endian(tun))) ||
> >>>>>>>> + vnet_hash.indirection_table_mask >=
> >>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>>>> + ret = -EINVAL;
> >>>>>>>> + break;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> + argp = (u8 __user *)argp + len;
> >>>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>>>> + ret = -EFAULT;
> >>>>>>>> + break;
> >>>>>>>> + }
> >>>>>>>> +
> >>>>>>>> + argp = (u8 __user *)argp + len;
> >>>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>>>> +
> >>>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>>>> + ret = -EFAULT;
> >>>>>>>> + break;
> >>>>>>>> + }
> >>>>>>>
> >>>>>>> Probably easier and less error-prone to define a fixed size control
> >>>>>>> struct with the max indirection table size.
> >>>>>>
> >>>>>> I made its size variable because the indirection table and key may grow
> >>>>>> in the future as I wrote above.
> >>>>>>
> >>>>>>>
> >>>>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>>>
> >>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>>>> proposed.
> >>>>>
> >>>>> To be clear: please don't just resubmit with that one change.
> >>>>>
> >>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>>>
> >>>>> I'd like to understand why adjusting the existing BPF feature for this
> >>>>> exact purpose cannot be amended to return the key it produced.
> >>>>
> >>>> eBPF steering program is not designed for this particular problem in my
> >>>> understanding. It was introduced to derive hash values with an
> >>>> understanding of application-specific semantics of packets instead of
> >>>> generic IP/TCP/UDP semantics.
> >>>>
> >>>> This problem is rather different in terms that the hash derivation is
> >>>> strictly defined by virtio-net. I don't think it makes sense to
> >>>> introduce the complexity of BPF when you always run the same code.
> >>>>
> >>>> It can utilize the existing flow dissector and also make it easier to
> >>>> use for the userspace by implementing this in the kernel.
> >>>
> >>> Ok. There does appear to be overlap in functionality. But it might be
> >>> easier to deploy to just have standard Toeplitz available without
> >>> having to compile and load an eBPF program.
> >>>
> >>> As for the sk_buff and cb[] changes. The first is really not needed.
> >>> sk_buff simply would not scale if every edge case needs a few bits.
> >>
> >> An alternative is to move the bit to cb[] and clear it for every code
> >> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
> >>
> >> I think we can put the bit in sk_buff for now. We can implement the
> >> alternative when we are short of bits.
> >
> > I disagree. sk_buff fields add a cost to every code path. They cannot
> > be added for every edge case.
>
> It only takes an unused bit and does not grow the sk_buff size so I
> think it has practically no cost for now.

The problem is that that thinking leads to death by a thousand cuts.

"for now" forces the cost of having to think hard how to avoid growing
sk_buff onto the next person. Let's do it right from the start.

2023-10-09 10:13:31

by Akihiko Odaki

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On 2023/10/09 19:06, Willem de Bruijn wrote:
> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <[email protected]> wrote:
>>
>> On 2023/10/09 18:57, Willem de Bruijn wrote:
>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
>>>>
>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
>>>>>>
>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>>>> purpose of RSS.
>>>>>>>>>>
>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>>>> restrictive nature of eBPF.
>>>>>>>>>>
>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>>>> the specification.
>>>>>>>>>>
>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>>>> with the specification.
>>>>>>>>>>
>>>>>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
>>>>>>>>>> ---
>>>>>>>>>
>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>>>> + .max_indirection_table_length =
>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>>>> +
>>>>>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>>>> +};
>>>>>>>>>
>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>>>> supports all or none.
>>>>>>>>
>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>>>
>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>>>
>>>>>>>> I'm also adding this so that we can extend it later.
>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>>>> types may have other bits for new protocols in the future.
>>>>>>>>
>>>>>>>>>
>>>>>>>>>> case TUNSETSTEERINGEBPF:
>>>>>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>> + if (IS_ERR(bpf_ret))
>>>>>>>>>> + ret = PTR_ERR(bpf_ret);
>>>>>>>>>> + else if (bpf_ret)
>>>>>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>>>
>>>>>>>>> Don't make one feature disable another.
>>>>>>>>>
>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>>>> for instance.
>>>>>>>>>
>>>>>>>>>> + case TUNSETVNETHASH:
>>>>>>>>>> + len = sizeof(vnet_hash);
>>>>>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>> + break;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>>>> + !tun_is_little_endian(tun))) ||
>>>>>>>>>> + vnet_hash.indirection_table_mask >=
>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>>>> + ret = -EINVAL;
>>>>>>>>>> + break;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>> + break;
>>>>>>>>>> + }
>>>>>>>>>> +
>>>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>>>> +
>>>>>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>> + break;
>>>>>>>>>> + }
>>>>>>>>>
>>>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>>>> struct with the max indirection table size.
>>>>>>>>
>>>>>>>> I made its size variable because the indirection table and key may grow
>>>>>>>> in the future as I wrote above.
>>>>>>>>
>>>>>>>>>
>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>>>
>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>>>> proposed.
>>>>>>>
>>>>>>> To be clear: please don't just resubmit with that one change.
>>>>>>>
>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>>>
>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>>>> exact purpose cannot be amended to return the key it produced.
>>>>>>
>>>>>> eBPF steering program is not designed for this particular problem in my
>>>>>> understanding. It was introduced to derive hash values with an
>>>>>> understanding of application-specific semantics of packets instead of
>>>>>> generic IP/TCP/UDP semantics.
>>>>>>
>>>>>> This problem is rather different in terms that the hash derivation is
>>>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>>>> introduce the complexity of BPF when you always run the same code.
>>>>>>
>>>>>> It can utilize the existing flow dissector and also make it easier to
>>>>>> use for the userspace by implementing this in the kernel.
>>>>>
>>>>> Ok. There does appear to be overlap in functionality. But it might be
>>>>> easier to deploy to just have standard Toeplitz available without
>>>>> having to compile and load an eBPF program.
>>>>>
>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>>>> sk_buff simply would not scale if every edge case needs a few bits.
>>>>
>>>> An alternative is to move the bit to cb[] and clear it for every code
>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>>>
>>>> I think we can put the bit in sk_buff for now. We can implement the
>>>> alternative when we are short of bits.
>>>
>>> I disagree. sk_buff fields add a cost to every code path. They cannot
>>> be added for every edge case.
>>
>> It only takes an unused bit and does not grow the sk_buff size so I
>> think it has practically no cost for now.
>
> The problem is that that thinking leads to death by a thousand cuts.
>
> "for now" forces the cost of having to think hard how to avoid growing
> sk_buff onto the next person. Let's do it right from the start.

I see. I described an alternative to move the bit to cb[] and clear it
in all code paths that leads to ndo_start_xmit() earlier. Does that
sound good to you?

2023-10-09 10:45:02

by Willem de Bruijn

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <[email protected]> wrote:
>
> On 2023/10/09 19:06, Willem de Bruijn wrote:
> > On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <[email protected]> wrote:
> >>
> >> On 2023/10/09 18:57, Willem de Bruijn wrote:
> >>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
> >>>>
> >>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
> >>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
> >>>>>>
> >>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>>>> purpose of RSS.
> >>>>>>>>>>
> >>>>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>>>> restrictive nature of eBPF.
> >>>>>>>>>>
> >>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>>>>>> the specification.
> >>>>>>>>>>
> >>>>>>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>>>>>> independently since it may have been computed in a way not conformant
> >>>>>>>>>> with the specification.
> >>>>>>>>>>
> >>>>>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
> >>>>>>>>>> ---
> >>>>>>>>>
> >>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>>>>>> + .max_indirection_table_length =
> >>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>>>>>> +
> >>>>>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>>>>>> +};
> >>>>>>>>>
> >>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>>>>>> supports all or none.
> >>>>>>>>
> >>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>>>>>
> >>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>>>>>> specification is also vague, and does not tell how many TLVs should be
> >>>>>>>> consumed at most when interpreting destination option header so I chose
> >>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>>>>>> anyone will complain about it since nobody complains for Linux.
> >>>>>>>>
> >>>>>>>> I'm also adding this so that we can extend it later.
> >>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>>>>>> types may have other bits for new protocols in the future.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>>> case TUNSETSTEERINGEBPF:
> >>>>>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>> + if (IS_ERR(bpf_ret))
> >>>>>>>>>> + ret = PTR_ERR(bpf_ret);
> >>>>>>>>>> + else if (bpf_ret)
> >>>>>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>>>>>
> >>>>>>>>> Don't make one feature disable another.
> >>>>>>>>>
> >>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>>>>>> for instance.
> >>>>>>>>>
> >>>>>>>>>> + case TUNSETVNETHASH:
> >>>>>>>>>> + len = sizeof(vnet_hash);
> >>>>>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>> + break;
> >>>>>>>>>> + }
> >>>>>>>>>> +
> >>>>>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>>>>>> + !tun_is_little_endian(tun))) ||
> >>>>>>>>>> + vnet_hash.indirection_table_mask >=
> >>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>>>>>> + ret = -EINVAL;
> >>>>>>>>>> + break;
> >>>>>>>>>> + }
> >>>>>>>>>> +
> >>>>>>>>>> + argp = (u8 __user *)argp + len;
> >>>>>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>> + break;
> >>>>>>>>>> + }
> >>>>>>>>>> +
> >>>>>>>>>> + argp = (u8 __user *)argp + len;
> >>>>>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>>>>>> +
> >>>>>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>> + break;
> >>>>>>>>>> + }
> >>>>>>>>>
> >>>>>>>>> Probably easier and less error-prone to define a fixed size control
> >>>>>>>>> struct with the max indirection table size.
> >>>>>>>>
> >>>>>>>> I made its size variable because the indirection table and key may grow
> >>>>>>>> in the future as I wrote above.
> >>>>>>>>
> >>>>>>>>>
> >>>>>>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>>>>>
> >>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>>>>>> proposed.
> >>>>>>>
> >>>>>>> To be clear: please don't just resubmit with that one change.
> >>>>>>>
> >>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>>>>>
> >>>>>>> I'd like to understand why adjusting the existing BPF feature for this
> >>>>>>> exact purpose cannot be amended to return the key it produced.
> >>>>>>
> >>>>>> eBPF steering program is not designed for this particular problem in my
> >>>>>> understanding. It was introduced to derive hash values with an
> >>>>>> understanding of application-specific semantics of packets instead of
> >>>>>> generic IP/TCP/UDP semantics.
> >>>>>>
> >>>>>> This problem is rather different in terms that the hash derivation is
> >>>>>> strictly defined by virtio-net. I don't think it makes sense to
> >>>>>> introduce the complexity of BPF when you always run the same code.
> >>>>>>
> >>>>>> It can utilize the existing flow dissector and also make it easier to
> >>>>>> use for the userspace by implementing this in the kernel.
> >>>>>
> >>>>> Ok. There does appear to be overlap in functionality. But it might be
> >>>>> easier to deploy to just have standard Toeplitz available without
> >>>>> having to compile and load an eBPF program.
> >>>>>
> >>>>> As for the sk_buff and cb[] changes. The first is really not needed.
> >>>>> sk_buff simply would not scale if every edge case needs a few bits.
> >>>>
> >>>> An alternative is to move the bit to cb[] and clear it for every code
> >>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
> >>>>
> >>>> I think we can put the bit in sk_buff for now. We can implement the
> >>>> alternative when we are short of bits.
> >>>
> >>> I disagree. sk_buff fields add a cost to every code path. They cannot
> >>> be added for every edge case.
> >>
> >> It only takes an unused bit and does not grow the sk_buff size so I
> >> think it has practically no cost for now.
> >
> > The problem is that that thinking leads to death by a thousand cuts.
> >
> > "for now" forces the cost of having to think hard how to avoid growing
> > sk_buff onto the next person. Let's do it right from the start.
>
> I see. I described an alternative to move the bit to cb[] and clear it
> in all code paths that leads to ndo_start_xmit() earlier. Does that
> sound good to you?

If you use the control block to pass information between
__dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
the field can be left undefined in all non-tun paths. tun_select_queue
can initialize.

I would still use skb->hash to encode the hash. That hash type of that
field is not strictly defined. It can be siphash from ___skb_get_hash
or a device hash, which most likely also uses Toeplitz. Then you also
don't run into the problem of growing the struct size.

2023-10-10 01:52:36

by Akihiko Odaki

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On 2023/10/09 19:44, Willem de Bruijn wrote:
> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <[email protected]> wrote:
>>
>> On 2023/10/09 19:06, Willem de Bruijn wrote:
>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <[email protected]> wrote:
>>>>
>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>
>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>>>>>> purpose of RSS.
>>>>>>>>>>>>
>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>>>>>> restrictive nature of eBPF.
>>>>>>>>>>>>
>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>>>>>> the specification.
>>>>>>>>>>>>
>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>>>>>> with the specification.
>>>>>>>>>>>>
>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
>>>>>>>>>>>> ---
>>>>>>>>>>>
>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>>>>>> + .max_indirection_table_length =
>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>>>>>> +
>>>>>>>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>>>>>> +};
>>>>>>>>>>>
>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>>>>>> supports all or none.
>>>>>>>>>>
>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>>>>>
>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>>>>>
>>>>>>>>>> I'm also adding this so that we can extend it later.
>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>>>>>> types may have other bits for new protocols in the future.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>> case TUNSETSTEERINGEBPF:
>>>>>>>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>> + if (IS_ERR(bpf_ret))
>>>>>>>>>>>> + ret = PTR_ERR(bpf_ret);
>>>>>>>>>>>> + else if (bpf_ret)
>>>>>>>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>>>>>
>>>>>>>>>>> Don't make one feature disable another.
>>>>>>>>>>>
>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>>>>>> for instance.
>>>>>>>>>>>
>>>>>>>>>>>> + case TUNSETVNETHASH:
>>>>>>>>>>>> + len = sizeof(vnet_hash);
>>>>>>>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>> + break;
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>>>>>> + !tun_is_little_endian(tun))) ||
>>>>>>>>>>>> + vnet_hash.indirection_table_mask >=
>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>>>>>> + ret = -EINVAL;
>>>>>>>>>>>> + break;
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>> + break;
>>>>>>>>>>>> + }
>>>>>>>>>>>> +
>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>>>>>> +
>>>>>>>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>> + break;
>>>>>>>>>>>> + }
>>>>>>>>>>>
>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>>>>>> struct with the max indirection table size.
>>>>>>>>>>
>>>>>>>>>> I made its size variable because the indirection table and key may grow
>>>>>>>>>> in the future as I wrote above.
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>>>>>
>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>>>>>> proposed.
>>>>>>>>>
>>>>>>>>> To be clear: please don't just resubmit with that one change.
>>>>>>>>>
>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>>>>>
>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>>>>>> exact purpose cannot be amended to return the key it produced.
>>>>>>>>
>>>>>>>> eBPF steering program is not designed for this particular problem in my
>>>>>>>> understanding. It was introduced to derive hash values with an
>>>>>>>> understanding of application-specific semantics of packets instead of
>>>>>>>> generic IP/TCP/UDP semantics.
>>>>>>>>
>>>>>>>> This problem is rather different in terms that the hash derivation is
>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>>>>>> introduce the complexity of BPF when you always run the same code.
>>>>>>>>
>>>>>>>> It can utilize the existing flow dissector and also make it easier to
>>>>>>>> use for the userspace by implementing this in the kernel.
>>>>>>>
>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
>>>>>>> easier to deploy to just have standard Toeplitz available without
>>>>>>> having to compile and load an eBPF program.
>>>>>>>
>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
>>>>>>
>>>>>> An alternative is to move the bit to cb[] and clear it for every code
>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>>>>>
>>>>>> I think we can put the bit in sk_buff for now. We can implement the
>>>>>> alternative when we are short of bits.
>>>>>
>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
>>>>> be added for every edge case.
>>>>
>>>> It only takes an unused bit and does not grow the sk_buff size so I
>>>> think it has practically no cost for now.
>>>
>>> The problem is that that thinking leads to death by a thousand cuts.
>>>
>>> "for now" forces the cost of having to think hard how to avoid growing
>>> sk_buff onto the next person. Let's do it right from the start.
>>
>> I see. I described an alternative to move the bit to cb[] and clear it
>> in all code paths that leads to ndo_start_xmit() earlier. Does that
>> sound good to you?
>
> If you use the control block to pass information between
> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
> the field can be left undefined in all non-tun paths. tun_select_queue
> can initialize.

The problem is that tun_select_queue() is not always called.
netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
calling it, but this variable may change later and result in a race
condition. Another case is that XDP with predefined queue.

>
> I would still use skb->hash to encode the hash. That hash type of that
> field is not strictly defined. It can be siphash from ___skb_get_hash
> or a device hash, which most likely also uses Toeplitz. Then you also
> don't run into the problem of growing the struct size.

I'm concerned exactly because it's not strictly defined. Someone may
decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
also has sufficient space to contain both of the hash value and type.

2023-10-10 05:46:29

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <[email protected]> wrote:
>
> On 2023/10/09 19:44, Willem de Bruijn wrote:
> > On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <[email protected]> wrote:
> >>
> >> On 2023/10/09 19:06, Willem de Bruijn wrote:
> >>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <[email protected]> wrote:
> >>>>
> >>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
> >>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>
> >>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
> >>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>>>>>> purpose of RSS.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>>>>>> restrictive nature of eBPF.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>>>>>>>> the specification.
> >>>>>>>>>>>>
> >>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>>>>>>>> independently since it may have been computed in a way not conformant
> >>>>>>>>>>>> with the specification.
> >>>>>>>>>>>>
> >>>>>>>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
> >>>>>>>>>>>> ---
> >>>>>>>>>>>
> >>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>>>>>>>> + .max_indirection_table_length =
> >>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>>>>>>>> +
> >>>>>>>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>>>>>>>> +};
> >>>>>>>>>>>
> >>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>>>>>>>> supports all or none.
> >>>>>>>>>>
> >>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>>>>>>>
> >>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
> >>>>>>>>>> consumed at most when interpreting destination option header so I chose
> >>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>>>>>>>> anyone will complain about it since nobody complains for Linux.
> >>>>>>>>>>
> >>>>>>>>>> I'm also adding this so that we can extend it later.
> >>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>>>>>>>> types may have other bits for new protocols in the future.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>>> case TUNSETSTEERINGEBPF:
> >>>>>>>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>> + if (IS_ERR(bpf_ret))
> >>>>>>>>>>>> + ret = PTR_ERR(bpf_ret);
> >>>>>>>>>>>> + else if (bpf_ret)
> >>>>>>>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>>>>>>>
> >>>>>>>>>>> Don't make one feature disable another.
> >>>>>>>>>>>
> >>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>>>>>>>> for instance.
> >>>>>>>>>>>
> >>>>>>>>>>>> + case TUNSETVNETHASH:
> >>>>>>>>>>>> + len = sizeof(vnet_hash);
> >>>>>>>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>>>> + break;
> >>>>>>>>>>>> + }
> >>>>>>>>>>>> +
> >>>>>>>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>>>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>>>>>>>> + !tun_is_little_endian(tun))) ||
> >>>>>>>>>>>> + vnet_hash.indirection_table_mask >=
> >>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>>>>>>>> + ret = -EINVAL;
> >>>>>>>>>>>> + break;
> >>>>>>>>>>>> + }
> >>>>>>>>>>>> +
> >>>>>>>>>>>> + argp = (u8 __user *)argp + len;
> >>>>>>>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>>>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>>>> + break;
> >>>>>>>>>>>> + }
> >>>>>>>>>>>> +
> >>>>>>>>>>>> + argp = (u8 __user *)argp + len;
> >>>>>>>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>>>>>>>> +
> >>>>>>>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>>>> + break;
> >>>>>>>>>>>> + }
> >>>>>>>>>>>
> >>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
> >>>>>>>>>>> struct with the max indirection table size.
> >>>>>>>>>>
> >>>>>>>>>> I made its size variable because the indirection table and key may grow
> >>>>>>>>>> in the future as I wrote above.
> >>>>>>>>>>
> >>>>>>>>>>>
> >>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>>>>>>>
> >>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>>>>>>>> proposed.
> >>>>>>>>>
> >>>>>>>>> To be clear: please don't just resubmit with that one change.
> >>>>>>>>>
> >>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>>>>>>>
> >>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
> >>>>>>>>> exact purpose cannot be amended to return the key it produced.
> >>>>>>>>
> >>>>>>>> eBPF steering program is not designed for this particular problem in my
> >>>>>>>> understanding. It was introduced to derive hash values with an
> >>>>>>>> understanding of application-specific semantics of packets instead of
> >>>>>>>> generic IP/TCP/UDP semantics.
> >>>>>>>>
> >>>>>>>> This problem is rather different in terms that the hash derivation is
> >>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
> >>>>>>>> introduce the complexity of BPF when you always run the same code.
> >>>>>>>>
> >>>>>>>> It can utilize the existing flow dissector and also make it easier to
> >>>>>>>> use for the userspace by implementing this in the kernel.
> >>>>>>>
> >>>>>>> Ok. There does appear to be overlap in functionality. But it might be
> >>>>>>> easier to deploy to just have standard Toeplitz available without
> >>>>>>> having to compile and load an eBPF program.
> >>>>>>>
> >>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
> >>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
> >>>>>>
> >>>>>> An alternative is to move the bit to cb[] and clear it for every code
> >>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
> >>>>>>
> >>>>>> I think we can put the bit in sk_buff for now. We can implement the
> >>>>>> alternative when we are short of bits.
> >>>>>
> >>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
> >>>>> be added for every edge case.
> >>>>
> >>>> It only takes an unused bit and does not grow the sk_buff size so I
> >>>> think it has practically no cost for now.
> >>>
> >>> The problem is that that thinking leads to death by a thousand cuts.
> >>>
> >>> "for now" forces the cost of having to think hard how to avoid growing
> >>> sk_buff onto the next person. Let's do it right from the start.
> >>
> >> I see. I described an alternative to move the bit to cb[] and clear it
> >> in all code paths that leads to ndo_start_xmit() earlier. Does that
> >> sound good to you?
> >
> > If you use the control block to pass information between
> > __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
> > the field can be left undefined in all non-tun paths. tun_select_queue
> > can initialize.
>
> The problem is that tun_select_queue() is not always called.
> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
> calling it, but this variable may change later and result in a race
> condition. Another case is that XDP with predefined queue.
>
> >
> > I would still use skb->hash to encode the hash. That hash type of that
> > field is not strictly defined. It can be siphash from ___skb_get_hash
> > or a device hash, which most likely also uses Toeplitz. Then you also
> > don't run into the problem of growing the struct size.
>
> I'm concerned exactly because it's not strictly defined. Someone may
> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
> also has sufficient space to contain both of the hash value and type.

How about using skb extensions?

Thanks

>

2023-10-10 05:51:38

by Akihiko Odaki

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On 2023/10/10 14:45, Jason Wang wrote:
> On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <[email protected]> wrote:
>>
>> On 2023/10/09 19:44, Willem de Bruijn wrote:
>>> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <[email protected]> wrote:
>>>>
>>>> On 2023/10/09 19:06, Willem de Bruijn wrote:
>>>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>
>>>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
>>>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>>>>>>>> purpose of RSS.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>>>>>>>> restrictive nature of eBPF.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>>>>>>>> the specification.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>>>>>>>> with the specification.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>
>>>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>>>>>>>> + .max_indirection_table_length =
>>>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>>>>>>>> +};
>>>>>>>>>>>>>
>>>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>>>>>>>> supports all or none.
>>>>>>>>>>>>
>>>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>>>>>>>
>>>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>>>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>>>>>>>
>>>>>>>>>>>> I'm also adding this so that we can extend it later.
>>>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>>>>>>>> types may have other bits for new protocols in the future.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>> case TUNSETSTEERINGEBPF:
>>>>>>>>>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>> + if (IS_ERR(bpf_ret))
>>>>>>>>>>>>>> + ret = PTR_ERR(bpf_ret);
>>>>>>>>>>>>>> + else if (bpf_ret)
>>>>>>>>>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>>>>>>>
>>>>>>>>>>>>> Don't make one feature disable another.
>>>>>>>>>>>>>
>>>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>>>>>>>> for instance.
>>>>>>>>>>>>>
>>>>>>>>>>>>>> + case TUNSETVNETHASH:
>>>>>>>>>>>>>> + len = sizeof(vnet_hash);
>>>>>>>>>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>>>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>>>>>>>> + !tun_is_little_endian(tun))) ||
>>>>>>>>>>>>>> + vnet_hash.indirection_table_mask >=
>>>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>>>>>>>> + ret = -EINVAL;
>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>>>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>>>>>>>> +
>>>>>>>>>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>
>>>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>>>>>>>> struct with the max indirection table size.
>>>>>>>>>>>>
>>>>>>>>>>>> I made its size variable because the indirection table and key may grow
>>>>>>>>>>>> in the future as I wrote above.
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>>>>>>>
>>>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>>>>>>>> proposed.
>>>>>>>>>>>
>>>>>>>>>>> To be clear: please don't just resubmit with that one change.
>>>>>>>>>>>
>>>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>>>>>>>
>>>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>>>>>>>> exact purpose cannot be amended to return the key it produced.
>>>>>>>>>>
>>>>>>>>>> eBPF steering program is not designed for this particular problem in my
>>>>>>>>>> understanding. It was introduced to derive hash values with an
>>>>>>>>>> understanding of application-specific semantics of packets instead of
>>>>>>>>>> generic IP/TCP/UDP semantics.
>>>>>>>>>>
>>>>>>>>>> This problem is rather different in terms that the hash derivation is
>>>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>>>>>>>> introduce the complexity of BPF when you always run the same code.
>>>>>>>>>>
>>>>>>>>>> It can utilize the existing flow dissector and also make it easier to
>>>>>>>>>> use for the userspace by implementing this in the kernel.
>>>>>>>>>
>>>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
>>>>>>>>> easier to deploy to just have standard Toeplitz available without
>>>>>>>>> having to compile and load an eBPF program.
>>>>>>>>>
>>>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
>>>>>>>>
>>>>>>>> An alternative is to move the bit to cb[] and clear it for every code
>>>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>>>>>>>
>>>>>>>> I think we can put the bit in sk_buff for now. We can implement the
>>>>>>>> alternative when we are short of bits.
>>>>>>>
>>>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
>>>>>>> be added for every edge case.
>>>>>>
>>>>>> It only takes an unused bit and does not grow the sk_buff size so I
>>>>>> think it has practically no cost for now.
>>>>>
>>>>> The problem is that that thinking leads to death by a thousand cuts.
>>>>>
>>>>> "for now" forces the cost of having to think hard how to avoid growing
>>>>> sk_buff onto the next person. Let's do it right from the start.
>>>>
>>>> I see. I described an alternative to move the bit to cb[] and clear it
>>>> in all code paths that leads to ndo_start_xmit() earlier. Does that
>>>> sound good to you?
>>>
>>> If you use the control block to pass information between
>>> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
>>> the field can be left undefined in all non-tun paths. tun_select_queue
>>> can initialize.
>>
>> The problem is that tun_select_queue() is not always called.
>> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
>> calling it, but this variable may change later and result in a race
>> condition. Another case is that XDP with predefined queue.
>>
>>>
>>> I would still use skb->hash to encode the hash. That hash type of that
>>> field is not strictly defined. It can be siphash from ___skb_get_hash
>>> or a device hash, which most likely also uses Toeplitz. Then you also
>>> don't run into the problem of growing the struct size.
>>
>> I'm concerned exactly because it's not strictly defined. Someone may
>> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
>> also has sufficient space to contain both of the hash value and type.
>
> How about using skb extensions?

I think it will work. I'll try it in the next version.

2023-10-10 06:01:59

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On Tue, Oct 10, 2023 at 1:51 PM Akihiko Odaki <[email protected]> wrote:
>
> On 2023/10/10 14:45, Jason Wang wrote:
> > On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <[email protected]> wrote:
> >>
> >> On 2023/10/09 19:44, Willem de Bruijn wrote:
> >>> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <[email protected]> wrote:
> >>>>
> >>>> On 2023/10/09 19:06, Willem de Bruijn wrote:
> >>>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>
> >>>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
> >>>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
> >>>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>>>>>>>> purpose of RSS.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>>>>>>>> restrictive nature of eBPF.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>>>>>>>>>> the specification.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>>>>>>>>>> independently since it may have been computed in a way not conformant
> >>>>>>>>>>>>>> with the specification.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
> >>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>>>>>>>>>> + .max_indirection_table_length =
> >>>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>>>>>>>>>> +};
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>>>>>>>>>> supports all or none.
> >>>>>>>>>>>>
> >>>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
> >>>>>>>>>>>> consumed at most when interpreting destination option header so I chose
> >>>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'm also adding this so that we can extend it later.
> >>>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>>>>>>>>>> types may have other bits for new protocols in the future.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> case TUNSETSTEERINGEBPF:
> >>>>>>>>>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>>>> + if (IS_ERR(bpf_ret))
> >>>>>>>>>>>>>> + ret = PTR_ERR(bpf_ret);
> >>>>>>>>>>>>>> + else if (bpf_ret)
> >>>>>>>>>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Don't make one feature disable another.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>>>>>>>>>> for instance.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>>> + case TUNSETVNETHASH:
> >>>>>>>>>>>>>> + len = sizeof(vnet_hash);
> >>>>>>>>>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>>>>>> + break;
> >>>>>>>>>>>>>> + }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>>>>>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>>>>>>>>>> + !tun_is_little_endian(tun))) ||
> >>>>>>>>>>>>>> + vnet_hash.indirection_table_mask >=
> >>>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>>>>>>>>>> + ret = -EINVAL;
> >>>>>>>>>>>>>> + break;
> >>>>>>>>>>>>>> + }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
> >>>>>>>>>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>>>>>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>>>>>> + break;
> >>>>>>>>>>>>>> + }
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
> >>>>>>>>>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>>>>>>>>>> +
> >>>>>>>>>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>>>>>> + break;
> >>>>>>>>>>>>>> + }
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
> >>>>>>>>>>>>> struct with the max indirection table size.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I made its size variable because the indirection table and key may grow
> >>>>>>>>>>>> in the future as I wrote above.
> >>>>>>>>>>>>
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>>>>>>>>>
> >>>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>>>>>>>>>> proposed.
> >>>>>>>>>>>
> >>>>>>>>>>> To be clear: please don't just resubmit with that one change.
> >>>>>>>>>>>
> >>>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>>>>>>>>>
> >>>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
> >>>>>>>>>>> exact purpose cannot be amended to return the key it produced.
> >>>>>>>>>>
> >>>>>>>>>> eBPF steering program is not designed for this particular problem in my
> >>>>>>>>>> understanding. It was introduced to derive hash values with an
> >>>>>>>>>> understanding of application-specific semantics of packets instead of
> >>>>>>>>>> generic IP/TCP/UDP semantics.
> >>>>>>>>>>
> >>>>>>>>>> This problem is rather different in terms that the hash derivation is
> >>>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
> >>>>>>>>>> introduce the complexity of BPF when you always run the same code.
> >>>>>>>>>>
> >>>>>>>>>> It can utilize the existing flow dissector and also make it easier to
> >>>>>>>>>> use for the userspace by implementing this in the kernel.
> >>>>>>>>>
> >>>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
> >>>>>>>>> easier to deploy to just have standard Toeplitz available without
> >>>>>>>>> having to compile and load an eBPF program.
> >>>>>>>>>
> >>>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
> >>>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
> >>>>>>>>
> >>>>>>>> An alternative is to move the bit to cb[] and clear it for every code
> >>>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
> >>>>>>>>
> >>>>>>>> I think we can put the bit in sk_buff for now. We can implement the
> >>>>>>>> alternative when we are short of bits.
> >>>>>>>
> >>>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
> >>>>>>> be added for every edge case.
> >>>>>>
> >>>>>> It only takes an unused bit and does not grow the sk_buff size so I
> >>>>>> think it has practically no cost for now.
> >>>>>
> >>>>> The problem is that that thinking leads to death by a thousand cuts.
> >>>>>
> >>>>> "for now" forces the cost of having to think hard how to avoid growing
> >>>>> sk_buff onto the next person. Let's do it right from the start.
> >>>>
> >>>> I see. I described an alternative to move the bit to cb[] and clear it
> >>>> in all code paths that leads to ndo_start_xmit() earlier. Does that
> >>>> sound good to you?
> >>>
> >>> If you use the control block to pass information between
> >>> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
> >>> the field can be left undefined in all non-tun paths. tun_select_queue
> >>> can initialize.
> >>
> >> The problem is that tun_select_queue() is not always called.
> >> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
> >> calling it, but this variable may change later and result in a race
> >> condition. Another case is that XDP with predefined queue.
> >>
> >>>
> >>> I would still use skb->hash to encode the hash. That hash type of that
> >>> field is not strictly defined. It can be siphash from ___skb_get_hash
> >>> or a device hash, which most likely also uses Toeplitz. Then you also
> >>> don't run into the problem of growing the struct size.
> >>
> >> I'm concerned exactly because it's not strictly defined. Someone may
> >> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
> >> also has sufficient space to contain both of the hash value and type.
> >
> > How about using skb extensions?
>
> I think it will work. I'll try it in the next version.

Btw, I still think using eBPF for hash might be better.

Though the hashing rule is defined in the spec, it may be extended in
the future. For example, several extensions has been proposed:

1) RSS context
2) encapsulated packet hashing

Thanks

>

2023-10-10 06:19:47

by Akihiko Odaki

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On 2023/10/10 15:00, Jason Wang wrote:
> On Tue, Oct 10, 2023 at 1:51 PM Akihiko Odaki <[email protected]> wrote:
>>
>> On 2023/10/10 14:45, Jason Wang wrote:
>>> On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <[email protected]> wrote:
>>>>
>>>> On 2023/10/09 19:44, Willem de Bruijn wrote:
>>>>> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>
>>>>>> On 2023/10/09 19:06, Willem de Bruijn wrote:
>>>>>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
>>>>>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>>>>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>>>>>>>>>> purpose of RSS.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>>>>>>>>>> restrictive nature of eBPF.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>>>>>>>>>> the specification.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>>>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>>>>>>>>>> with the specification.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>>>>>>>>>> + .max_indirection_table_length =
>>>>>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>>>>>>>>>> +};
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>>>>>>>>>> supports all or none.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>>>>>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm also adding this so that we can extend it later.
>>>>>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>>>>>>>>>> types may have other bits for new protocols in the future.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> case TUNSETSTEERINGEBPF:
>>>>>>>>>>>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>>>> + if (IS_ERR(bpf_ret))
>>>>>>>>>>>>>>>> + ret = PTR_ERR(bpf_ret);
>>>>>>>>>>>>>>>> + else if (bpf_ret)
>>>>>>>>>>>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Don't make one feature disable another.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>>>>>>>>>> for instance.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> + case TUNSETVNETHASH:
>>>>>>>>>>>>>>>> + len = sizeof(vnet_hash);
>>>>>>>>>>>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>>>>>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>>>>>>>>>> + !tun_is_little_endian(tun))) ||
>>>>>>>>>>>>>>>> + vnet_hash.indirection_table_mask >=
>>>>>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>>>>>>>>>> + ret = -EINVAL;
>>>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>>>>>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>>>>>>>>>> struct with the max indirection table size.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I made its size variable because the indirection table and key may grow
>>>>>>>>>>>>>> in the future as I wrote above.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>>>>>>>>>> proposed.
>>>>>>>>>>>>>
>>>>>>>>>>>>> To be clear: please don't just resubmit with that one change.
>>>>>>>>>>>>>
>>>>>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>>>>>>>>>
>>>>>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>>>>>>>>>> exact purpose cannot be amended to return the key it produced.
>>>>>>>>>>>>
>>>>>>>>>>>> eBPF steering program is not designed for this particular problem in my
>>>>>>>>>>>> understanding. It was introduced to derive hash values with an
>>>>>>>>>>>> understanding of application-specific semantics of packets instead of
>>>>>>>>>>>> generic IP/TCP/UDP semantics.
>>>>>>>>>>>>
>>>>>>>>>>>> This problem is rather different in terms that the hash derivation is
>>>>>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>>>>>>>>>> introduce the complexity of BPF when you always run the same code.
>>>>>>>>>>>>
>>>>>>>>>>>> It can utilize the existing flow dissector and also make it easier to
>>>>>>>>>>>> use for the userspace by implementing this in the kernel.
>>>>>>>>>>>
>>>>>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
>>>>>>>>>>> easier to deploy to just have standard Toeplitz available without
>>>>>>>>>>> having to compile and load an eBPF program.
>>>>>>>>>>>
>>>>>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>>>>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
>>>>>>>>>>
>>>>>>>>>> An alternative is to move the bit to cb[] and clear it for every code
>>>>>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>>>>>>>>>
>>>>>>>>>> I think we can put the bit in sk_buff for now. We can implement the
>>>>>>>>>> alternative when we are short of bits.
>>>>>>>>>
>>>>>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
>>>>>>>>> be added for every edge case.
>>>>>>>>
>>>>>>>> It only takes an unused bit and does not grow the sk_buff size so I
>>>>>>>> think it has practically no cost for now.
>>>>>>>
>>>>>>> The problem is that that thinking leads to death by a thousand cuts.
>>>>>>>
>>>>>>> "for now" forces the cost of having to think hard how to avoid growing
>>>>>>> sk_buff onto the next person. Let's do it right from the start.
>>>>>>
>>>>>> I see. I described an alternative to move the bit to cb[] and clear it
>>>>>> in all code paths that leads to ndo_start_xmit() earlier. Does that
>>>>>> sound good to you?
>>>>>
>>>>> If you use the control block to pass information between
>>>>> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
>>>>> the field can be left undefined in all non-tun paths. tun_select_queue
>>>>> can initialize.
>>>>
>>>> The problem is that tun_select_queue() is not always called.
>>>> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
>>>> calling it, but this variable may change later and result in a race
>>>> condition. Another case is that XDP with predefined queue.
>>>>
>>>>>
>>>>> I would still use skb->hash to encode the hash. That hash type of that
>>>>> field is not strictly defined. It can be siphash from ___skb_get_hash
>>>>> or a device hash, which most likely also uses Toeplitz. Then you also
>>>>> don't run into the problem of growing the struct size.
>>>>
>>>> I'm concerned exactly because it's not strictly defined. Someone may
>>>> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
>>>> also has sufficient space to contain both of the hash value and type.
>>>
>>> How about using skb extensions?
>>
>> I think it will work. I'll try it in the next version.
>
> Btw, I still think using eBPF for hash might be better.
>
> Though the hashing rule is defined in the spec, it may be extended in
> the future. For example, several extensions has been proposed:
>
> 1) RSS context
> 2) encapsulated packet hashing

Looking at the proposals, I'm now more inclined to extend the BPF
steering program.

Yuri, who wrote the RFC patches to extend the BPF steering program, also
raised an concern that it may become hard to implement virtio-net
extensions in the future. It is much easier to deploy a new BPF program
to support extensions since it will be included in QEMU and can be
deployed at once without concerning other kernel stuff.

I was still not sure how likely such an extension will emerge especially
when the hardware RSS capability is not evolving for a decade or so. But
those proposals show that there are more demands of new features for
virtio-net.

2023-10-11 03:19:42

by Jason Wang

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On Tue, Oct 10, 2023 at 2:19 PM Akihiko Odaki <[email protected]> wrote:
>
> On 2023/10/10 15:00, Jason Wang wrote:
> > On Tue, Oct 10, 2023 at 1:51 PM Akihiko Odaki <[email protected]> wrote:
> >>
> >> On 2023/10/10 14:45, Jason Wang wrote:
> >>> On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <[email protected]> wrote:
> >>>>
> >>>> On 2023/10/09 19:44, Willem de Bruijn wrote:
> >>>>> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>
> >>>>>> On 2023/10/09 19:06, Willem de Bruijn wrote:
> >>>>>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>
> >>>>>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
> >>>>>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>>>
> >>>>>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
> >>>>>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>>>>>
> >>>>>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
> >>>>>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
> >>>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
> >>>>>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
> >>>>>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
> >>>>>>>>>>>>>>>> purpose of RSS.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
> >>>>>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
> >>>>>>>>>>>>>>>> restrictive nature of eBPF.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
> >>>>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
> >>>>>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
> >>>>>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
> >>>>>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
> >>>>>>>>>>>>>>>> the specification.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
> >>>>>>>>>>>>>>>> independently since it may have been computed in a way not conformant
> >>>>>>>>>>>>>>>> with the specification.
> >>>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
> >>>>>>>>>>>>>>>> ---
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
> >>>>>>>>>>>>>>>> + .max_indirection_table_length =
> >>>>>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
> >>>>>>>>>>>>>>>> +};
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
> >>>>>>>>>>>>>>> supports all or none.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
> >>>>>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
> >>>>>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
> >>>>>>>>>>>>>> consumed at most when interpreting destination option header so I chose
> >>>>>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
> >>>>>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'm also adding this so that we can extend it later.
> >>>>>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
> >>>>>>>>>>>>>> types may have other bits for new protocols in the future.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> case TUNSETSTEERINGEBPF:
> >>>>>>>>>>>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
> >>>>>>>>>>>>>>>> + if (IS_ERR(bpf_ret))
> >>>>>>>>>>>>>>>> + ret = PTR_ERR(bpf_ret);
> >>>>>>>>>>>>>>>> + else if (bpf_ret)
> >>>>>>>>>>>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Don't make one feature disable another.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
> >>>>>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
> >>>>>>>>>>>>>>> for instance.
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>> + case TUNSETVNETHASH:
> >>>>>>>>>>>>>>>> + len = sizeof(vnet_hash);
> >>>>>>>>>>>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
> >>>>>>>>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>>>>>>>> + break;
> >>>>>>>>>>>>>>>> + }
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
> >>>>>>>>>>>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
> >>>>>>>>>>>>>>>> + !tun_is_little_endian(tun))) ||
> >>>>>>>>>>>>>>>> + vnet_hash.indirection_table_mask >=
> >>>>>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
> >>>>>>>>>>>>>>>> + ret = -EINVAL;
> >>>>>>>>>>>>>>>> + break;
> >>>>>>>>>>>>>>>> + }
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
> >>>>>>>>>>>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
> >>>>>>>>>>>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
> >>>>>>>>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>>>>>>>> + break;
> >>>>>>>>>>>>>>>> + }
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
> >>>>>>>>>>>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
> >>>>>>>>>>>>>>>> +
> >>>>>>>>>>>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
> >>>>>>>>>>>>>>>> + ret = -EFAULT;
> >>>>>>>>>>>>>>>> + break;
> >>>>>>>>>>>>>>>> + }
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
> >>>>>>>>>>>>>>> struct with the max indirection table size.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I made its size variable because the indirection table and key may grow
> >>>>>>>>>>>>>> in the future as I wrote above.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>>>
> >>>>>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
> >>>>>>>>>>>>>>
> >>>>>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
> >>>>>>>>>>>>>> proposed.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> To be clear: please don't just resubmit with that one change.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
> >>>>>>>>>>>>>
> >>>>>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
> >>>>>>>>>>>>> exact purpose cannot be amended to return the key it produced.
> >>>>>>>>>>>>
> >>>>>>>>>>>> eBPF steering program is not designed for this particular problem in my
> >>>>>>>>>>>> understanding. It was introduced to derive hash values with an
> >>>>>>>>>>>> understanding of application-specific semantics of packets instead of
> >>>>>>>>>>>> generic IP/TCP/UDP semantics.
> >>>>>>>>>>>>
> >>>>>>>>>>>> This problem is rather different in terms that the hash derivation is
> >>>>>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
> >>>>>>>>>>>> introduce the complexity of BPF when you always run the same code.
> >>>>>>>>>>>>
> >>>>>>>>>>>> It can utilize the existing flow dissector and also make it easier to
> >>>>>>>>>>>> use for the userspace by implementing this in the kernel.
> >>>>>>>>>>>
> >>>>>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
> >>>>>>>>>>> easier to deploy to just have standard Toeplitz available without
> >>>>>>>>>>> having to compile and load an eBPF program.
> >>>>>>>>>>>
> >>>>>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
> >>>>>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
> >>>>>>>>>>
> >>>>>>>>>> An alternative is to move the bit to cb[] and clear it for every code
> >>>>>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
> >>>>>>>>>>
> >>>>>>>>>> I think we can put the bit in sk_buff for now. We can implement the
> >>>>>>>>>> alternative when we are short of bits.
> >>>>>>>>>
> >>>>>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
> >>>>>>>>> be added for every edge case.
> >>>>>>>>
> >>>>>>>> It only takes an unused bit and does not grow the sk_buff size so I
> >>>>>>>> think it has practically no cost for now.
> >>>>>>>
> >>>>>>> The problem is that that thinking leads to death by a thousand cuts.
> >>>>>>>
> >>>>>>> "for now" forces the cost of having to think hard how to avoid growing
> >>>>>>> sk_buff onto the next person. Let's do it right from the start.
> >>>>>>
> >>>>>> I see. I described an alternative to move the bit to cb[] and clear it
> >>>>>> in all code paths that leads to ndo_start_xmit() earlier. Does that
> >>>>>> sound good to you?
> >>>>>
> >>>>> If you use the control block to pass information between
> >>>>> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
> >>>>> the field can be left undefined in all non-tun paths. tun_select_queue
> >>>>> can initialize.
> >>>>
> >>>> The problem is that tun_select_queue() is not always called.
> >>>> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
> >>>> calling it, but this variable may change later and result in a race
> >>>> condition. Another case is that XDP with predefined queue.
> >>>>
> >>>>>
> >>>>> I would still use skb->hash to encode the hash. That hash type of that
> >>>>> field is not strictly defined. It can be siphash from ___skb_get_hash
> >>>>> or a device hash, which most likely also uses Toeplitz. Then you also
> >>>>> don't run into the problem of growing the struct size.
> >>>>
> >>>> I'm concerned exactly because it's not strictly defined. Someone may
> >>>> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
> >>>> also has sufficient space to contain both of the hash value and type.
> >>>
> >>> How about using skb extensions?
> >>
> >> I think it will work. I'll try it in the next version.
> >
> > Btw, I still think using eBPF for hash might be better.
> >
> > Though the hashing rule is defined in the spec, it may be extended in
> > the future. For example, several extensions has been proposed:
> >
> > 1) RSS context
> > 2) encapsulated packet hashing
>
> Looking at the proposals, I'm now more inclined to extend the BPF
> steering program.

Just to make sure we are at the same page.

If the eBPF program needs to access skb extensions, it would not be a
steering program anymore (not a filter).

Or do you mean it is a dedicated eBPF program that calculates the hash?

>
> Yuri, who wrote the RFC patches to extend the BPF steering program, also
> raised an concern that it may become hard to implement virtio-net
> extensions in the future. It is much easier to deploy a new BPF program
> to support extensions since it will be included in QEMU and can be
> deployed at once without concerning other kernel stuff.
>
> I was still not sure how likely such an extension will emerge especially
> when the hardware RSS capability is not evolving for a decade or so. But
> those proposals show that there are more demands of new features for
> virtio-net.

It's not only the RSS, if you track virtio development, flow directors
are also being proposed.

Thanks

>

2023-10-11 03:58:41

by Akihiko Odaki

[permalink] [raw]
Subject: Re: [RFC PATCH 5/7] tun: Introduce virtio-net hashing feature

On 2023/10/11 12:18, Jason Wang wrote:
> On Tue, Oct 10, 2023 at 2:19 PM Akihiko Odaki <[email protected]> wrote:
>>
>> On 2023/10/10 15:00, Jason Wang wrote:
>>> On Tue, Oct 10, 2023 at 1:51 PM Akihiko Odaki <[email protected]> wrote:
>>>>
>>>> On 2023/10/10 14:45, Jason Wang wrote:
>>>>> On Tue, Oct 10, 2023 at 9:52 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>
>>>>>> On 2023/10/09 19:44, Willem de Bruijn wrote:
>>>>>>> On Mon, Oct 9, 2023 at 3:12 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>
>>>>>>>> On 2023/10/09 19:06, Willem de Bruijn wrote:
>>>>>>>>> On Mon, Oct 9, 2023 at 3:02 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>> On 2023/10/09 18:57, Willem de Bruijn wrote:
>>>>>>>>>>> On Mon, Oct 9, 2023 at 3:57 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> On 2023/10/09 17:04, Willem de Bruijn wrote:
>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 3:46 PM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On 2023/10/09 5:08, Willem de Bruijn wrote:
>>>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 10:04 PM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On 2023/10/09 4:07, Willem de Bruijn wrote:
>>>>>>>>>>>>>>>>> On Sun, Oct 8, 2023 at 7:22 AM Akihiko Odaki <[email protected]> wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> virtio-net have two usage of hashes: one is RSS and another is hash
>>>>>>>>>>>>>>>>>> reporting. Conventionally the hash calculation was done by the VMM.
>>>>>>>>>>>>>>>>>> However, computing the hash after the queue was chosen defeats the
>>>>>>>>>>>>>>>>>> purpose of RSS.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Another approach is to use eBPF steering program. This approach has
>>>>>>>>>>>>>>>>>> another downside: it cannot report the calculated hash due to the
>>>>>>>>>>>>>>>>>> restrictive nature of eBPF.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Introduce the code to compute hashes to the kernel in order to overcome
>>>>>>>>>>>>>>>>>> thse challenges. An alternative solution is to extend the eBPF steering
>>>>>>>>>>>>>>>>>> program so that it will be able to report to the userspace, but it makes
>>>>>>>>>>>>>>>>>> little sense to allow to implement different hashing algorithms with
>>>>>>>>>>>>>>>>>> eBPF since the hash value reported by virtio-net is strictly defined by
>>>>>>>>>>>>>>>>>> the specification.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> The hash value already stored in sk_buff is not used and computed
>>>>>>>>>>>>>>>>>> independently since it may have been computed in a way not conformant
>>>>>>>>>>>>>>>>>> with the specification.
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Signed-off-by: Akihiko Odaki <[email protected]>
>>>>>>>>>>>>>>>>>> ---
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> +static const struct tun_vnet_hash_cap tun_vnet_hash_cap = {
>>>>>>>>>>>>>>>>>> + .max_indirection_table_length =
>>>>>>>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH,
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> + .types = VIRTIO_NET_SUPPORTED_HASH_TYPES
>>>>>>>>>>>>>>>>>> +};
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> No need to have explicit capabilities exchange like this? Tun either
>>>>>>>>>>>>>>>>> supports all or none.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> tun does not support VIRTIO_NET_RSS_HASH_TYPE_IP_EX,
>>>>>>>>>>>>>>>> VIRTIO_NET_RSS_HASH_TYPE_TCP_EX, and VIRTIO_NET_RSS_HASH_TYPE_UDP_EX.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> It is because the flow dissector does not support IPv6 extensions. The
>>>>>>>>>>>>>>>> specification is also vague, and does not tell how many TLVs should be
>>>>>>>>>>>>>>>> consumed at most when interpreting destination option header so I chose
>>>>>>>>>>>>>>>> to avoid adding code for these hash types to the flow dissector. I doubt
>>>>>>>>>>>>>>>> anyone will complain about it since nobody complains for Linux.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'm also adding this so that we can extend it later.
>>>>>>>>>>>>>>>> max_indirection_table_length may grow for systems with 128+ CPUs, or
>>>>>>>>>>>>>>>> types may have other bits for new protocols in the future.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> case TUNSETSTEERINGEBPF:
>>>>>>>>>>>>>>>>>> - ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>>>>>> + bpf_ret = tun_set_ebpf(tun, &tun->steering_prog, argp);
>>>>>>>>>>>>>>>>>> + if (IS_ERR(bpf_ret))
>>>>>>>>>>>>>>>>>> + ret = PTR_ERR(bpf_ret);
>>>>>>>>>>>>>>>>>> + else if (bpf_ret)
>>>>>>>>>>>>>>>>>> + tun->vnet_hash.flags &= ~TUN_VNET_HASH_RSS;
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Don't make one feature disable another.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> TUNSETSTEERINGEBPF and TUNSETVNETHASH are mutually exclusive
>>>>>>>>>>>>>>>>> functions. If one is enabled the other call should fail, with EBUSY
>>>>>>>>>>>>>>>>> for instance.
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> + case TUNSETVNETHASH:
>>>>>>>>>>>>>>>>>> + len = sizeof(vnet_hash);
>>>>>>>>>>>>>>>>>> + if (copy_from_user(&vnet_hash, argp, len)) {
>>>>>>>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> + if (((vnet_hash.flags & TUN_VNET_HASH_REPORT) &&
>>>>>>>>>>>>>>>>>> + (tun->vnet_hdr_sz < sizeof(struct virtio_net_hdr_v1_hash) ||
>>>>>>>>>>>>>>>>>> + !tun_is_little_endian(tun))) ||
>>>>>>>>>>>>>>>>>> + vnet_hash.indirection_table_mask >=
>>>>>>>>>>>>>>>>>> + TUN_VNET_HASH_MAX_INDIRECTION_TABLE_LENGTH) {
>>>>>>>>>>>>>>>>>> + ret = -EINVAL;
>>>>>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>>>>>> + len = (vnet_hash.indirection_table_mask + 1) * 2;
>>>>>>>>>>>>>>>>>> + if (copy_from_user(vnet_hash_indirection_table, argp, len)) {
>>>>>>>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> + argp = (u8 __user *)argp + len;
>>>>>>>>>>>>>>>>>> + len = virtio_net_hash_key_length(vnet_hash.types);
>>>>>>>>>>>>>>>>>> +
>>>>>>>>>>>>>>>>>> + if (copy_from_user(vnet_hash_key, argp, len)) {
>>>>>>>>>>>>>>>>>> + ret = -EFAULT;
>>>>>>>>>>>>>>>>>> + break;
>>>>>>>>>>>>>>>>>> + }
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Probably easier and less error-prone to define a fixed size control
>>>>>>>>>>>>>>>>> struct with the max indirection table size.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I made its size variable because the indirection table and key may grow
>>>>>>>>>>>>>>>> in the future as I wrote above.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> Btw: please trim the CC: list considerably on future patches.
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> I'll do so in the next version with the TUNSETSTEERINGEBPF change you
>>>>>>>>>>>>>>>> proposed.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> To be clear: please don't just resubmit with that one change.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> The skb and cb issues are quite fundamental issues that need to be resolved.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I'd like to understand why adjusting the existing BPF feature for this
>>>>>>>>>>>>>>> exact purpose cannot be amended to return the key it produced.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> eBPF steering program is not designed for this particular problem in my
>>>>>>>>>>>>>> understanding. It was introduced to derive hash values with an
>>>>>>>>>>>>>> understanding of application-specific semantics of packets instead of
>>>>>>>>>>>>>> generic IP/TCP/UDP semantics.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> This problem is rather different in terms that the hash derivation is
>>>>>>>>>>>>>> strictly defined by virtio-net. I don't think it makes sense to
>>>>>>>>>>>>>> introduce the complexity of BPF when you always run the same code.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> It can utilize the existing flow dissector and also make it easier to
>>>>>>>>>>>>>> use for the userspace by implementing this in the kernel.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Ok. There does appear to be overlap in functionality. But it might be
>>>>>>>>>>>>> easier to deploy to just have standard Toeplitz available without
>>>>>>>>>>>>> having to compile and load an eBPF program.
>>>>>>>>>>>>>
>>>>>>>>>>>>> As for the sk_buff and cb[] changes. The first is really not needed.
>>>>>>>>>>>>> sk_buff simply would not scale if every edge case needs a few bits.
>>>>>>>>>>>>
>>>>>>>>>>>> An alternative is to move the bit to cb[] and clear it for every code
>>>>>>>>>>>> paths that lead to ndo_start_xmit(), but I'm worried that it is error-prone.
>>>>>>>>>>>>
>>>>>>>>>>>> I think we can put the bit in sk_buff for now. We can implement the
>>>>>>>>>>>> alternative when we are short of bits.
>>>>>>>>>>>
>>>>>>>>>>> I disagree. sk_buff fields add a cost to every code path. They cannot
>>>>>>>>>>> be added for every edge case.
>>>>>>>>>>
>>>>>>>>>> It only takes an unused bit and does not grow the sk_buff size so I
>>>>>>>>>> think it has practically no cost for now.
>>>>>>>>>
>>>>>>>>> The problem is that that thinking leads to death by a thousand cuts.
>>>>>>>>>
>>>>>>>>> "for now" forces the cost of having to think hard how to avoid growing
>>>>>>>>> sk_buff onto the next person. Let's do it right from the start.
>>>>>>>>
>>>>>>>> I see. I described an alternative to move the bit to cb[] and clear it
>>>>>>>> in all code paths that leads to ndo_start_xmit() earlier. Does that
>>>>>>>> sound good to you?
>>>>>>>
>>>>>>> If you use the control block to pass information between
>>>>>>> __dev_queue_xmit on the tun device and tun_net_xmit, using gso_skb_cb,
>>>>>>> the field can be left undefined in all non-tun paths. tun_select_queue
>>>>>>> can initialize.
>>>>>>
>>>>>> The problem is that tun_select_queue() is not always called.
>>>>>> netdev_core_pick_tx() ensures dev->real_num_tx_queues != 1 before
>>>>>> calling it, but this variable may change later and result in a race
>>>>>> condition. Another case is that XDP with predefined queue.
>>>>>>
>>>>>>>
>>>>>>> I would still use skb->hash to encode the hash. That hash type of that
>>>>>>> field is not strictly defined. It can be siphash from ___skb_get_hash
>>>>>>> or a device hash, which most likely also uses Toeplitz. Then you also
>>>>>>> don't run into the problem of growing the struct size.
>>>>>>
>>>>>> I'm concerned exactly because it's not strictly defined. Someone may
>>>>>> decide to overwrite it later if we are not cautious enough. qdisc_skb_cb
>>>>>> also has sufficient space to contain both of the hash value and type.
>>>>>
>>>>> How about using skb extensions?
>>>>
>>>> I think it will work. I'll try it in the next version.
>>>
>>> Btw, I still think using eBPF for hash might be better.
>>>
>>> Though the hashing rule is defined in the spec, it may be extended in
>>> the future. For example, several extensions has been proposed:
>>>
>>> 1) RSS context
>>> 2) encapsulated packet hashing
>>
>> Looking at the proposals, I'm now more inclined to extend the BPF
>> steering program.
>
> Just to make sure we are at the same page.
>
> If the eBPF program needs to access skb extensions, it would not be a
> steering program anymore (not a filter).
>
> Or do you mean it is a dedicated eBPF program that calculates the hash?

I think the BPF program should be a steering program but extended for
hash reporting.

Since we need a hash reporting feature that is not present in a socket
filter, the BPF program should have a dedicated bpf_prog_type (not
BPF_PROG_TYPE_SOCKET_FILTER). However, as its functionality is the
superset of the conventional steering program, I'm planning to use the
existing TUNSETSTEERINGEBPF ioctl to set it.