LinuxLists.cc - [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Conley <[email protected]> writes:

> Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
> enables sending/receiving jumbo ethernet frames up to the theoretical
> maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
> to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
> SKB mode is usable pending future driver work.

Hmm, interesting. So how does this interact with XDP multibuf?

-Toke

2023-04-07 16:30:52

by Maciej Fijalkowski

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Thu, Apr 06, 2023 at 08:38:05PM +0200, Toke H?iland-J?rgensen wrote:
> Kal Conley <[email protected]> writes:
>
> > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
> > enables sending/receiving jumbo ethernet frames up to the theoretical
> > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
> > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
> > SKB mode is usable pending future driver work.
>
> Hmm, interesting. So how does this interact with XDP multibuf?

To me it currently does not interact with mbuf in any way as it is enabled
only for skb mode which linearizes the skb from what i see.

I'd like to hear more about Kal's use case - Kal do you use AF_XDP in SKB
mode on your side?

>
> -Toke
>

2023-04-08 17:35:24

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> > > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
> > > enables sending/receiving jumbo ethernet frames up to the theoretical
> > > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
> > > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
> > > SKB mode is usable pending future driver work.
> >
> > Hmm, interesting. So how does this interact with XDP multibuf?
>
> To me it currently does not interact with mbuf in any way as it is enabled
> only for skb mode which linearizes the skb from what i see.
>
> I'd like to hear more about Kal's use case - Kal do you use AF_XDP in SKB
> mode on your side?

Our use-case is to receive jumbo Ethernet frames up to 9000 bytes with
AF_XDP in zero-copy mode. This patchset is a step in this direction.
At the very least, it lets you test out the feature in SKB mode
pending future driver support. Currently, XDP multi-buffer does not
support AF_XDP at all. It could support it in theory, but I think it
would need some UAPI design work and a bit of implementation work.

Also, I think that the approach taken in this patchset has some
advantages over XDP multi-buffer:
(1) It should be possible to achieve higher performance
(a) because the packet data is kept together
(b) because you need to acquire and validate less descriptors
and touch the queue pointers less often.
(2) It is a nicer user-space API.
(a) Since the packet data is all available in one linear
buffer. This may even be a requirement to avoid an extra copy if the
data must be handed off contiguously to other code.

The disadvantage of this patchset is requiring the user to allocate
HugeTLB pages which is an extra complication.

I am not sure if this patchset would need to interact with XDP
multi-buffer at all directly. Does anyone have anything to add here?

What other intermediate steps are needed to get to ZC? I think drivers
would already be able to support this now?

Kal

2023-04-12 13:40:49

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Cutter Conley <[email protected]> writes:

>> > > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
>> > > enables sending/receiving jumbo ethernet frames up to the theoretical
>> > > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
>> > > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
>> > > SKB mode is usable pending future driver work.
>> >
>> > Hmm, interesting. So how does this interact with XDP multibuf?
>>
>> To me it currently does not interact with mbuf in any way as it is enabled
>> only for skb mode which linearizes the skb from what i see.
>>
>> I'd like to hear more about Kal's use case - Kal do you use AF_XDP in SKB
>> mode on your side?
>
> Our use-case is to receive jumbo Ethernet frames up to 9000 bytes with
> AF_XDP in zero-copy mode. This patchset is a step in this direction.
> At the very least, it lets you test out the feature in SKB mode
> pending future driver support. Currently, XDP multi-buffer does not
> support AF_XDP at all. It could support it in theory, but I think it
> would need some UAPI design work and a bit of implementation work.
>
> Also, I think that the approach taken in this patchset has some
> advantages over XDP multi-buffer:
> (1) It should be possible to achieve higher performance
> (a) because the packet data is kept together
> (b) because you need to acquire and validate less descriptors
> and touch the queue pointers less often.
> (2) It is a nicer user-space API.
> (a) Since the packet data is all available in one linear
> buffer. This may even be a requirement to avoid an extra copy if the
> data must be handed off contiguously to other code.
>
> The disadvantage of this patchset is requiring the user to allocate
> HugeTLB pages which is an extra complication.
>
> I am not sure if this patchset would need to interact with XDP
> multi-buffer at all directly. Does anyone have anything to add here?

Well, I'm mostly concerned with having two different operation and
configuration modes for the same thing. We'll probably need to support
multibuf for AF_XDP anyway for the non-ZC path, which means we'll need
to create a UAPI for that in any case. And having two APIs is just going
to be more complexity to handle at both the documentation and
maintenance level.

It *might* be worth it to do this if the performance benefit is really
compelling, but, well, you'd need to implement both and compare directly
to know that for sure :)

-Toke

2023-04-12 14:07:24

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Wed, 12 Apr 2023 at 15:40, Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Kal Cutter Conley <[email protected]> writes:
>
> >> > > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
> >> > > enables sending/receiving jumbo ethernet frames up to the theoretical
> >> > > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
> >> > > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
> >> > > SKB mode is usable pending future driver work.
> >> >
> >> > Hmm, interesting. So how does this interact with XDP multibuf?
> >>
> >> To me it currently does not interact with mbuf in any way as it is enabled
> >> only for skb mode which linearizes the skb from what i see.
> >>
> >> I'd like to hear more about Kal's use case - Kal do you use AF_XDP in SKB
> >> mode on your side?
> >
> > Our use-case is to receive jumbo Ethernet frames up to 9000 bytes with
> > AF_XDP in zero-copy mode. This patchset is a step in this direction.
> > At the very least, it lets you test out the feature in SKB mode
> > pending future driver support. Currently, XDP multi-buffer does not
> > support AF_XDP at all. It could support it in theory, but I think it
> > would need some UAPI design work and a bit of implementation work.
> >
> > Also, I think that the approach taken in this patchset has some
> > advantages over XDP multi-buffer:
> > (1) It should be possible to achieve higher performance
> > (a) because the packet data is kept together
> > (b) because you need to acquire and validate less descriptors
> > and touch the queue pointers less often.
> > (2) It is a nicer user-space API.
> > (a) Since the packet data is all available in one linear
> > buffer. This may even be a requirement to avoid an extra copy if the
> > data must be handed off contiguously to other code.
> >
> > The disadvantage of this patchset is requiring the user to allocate
> > HugeTLB pages which is an extra complication.
> >
> > I am not sure if this patchset would need to interact with XDP
> > multi-buffer at all directly. Does anyone have anything to add here?
>
> Well, I'm mostly concerned with having two different operation and
> configuration modes for the same thing. We'll probably need to support
> multibuf for AF_XDP anyway for the non-ZC path, which means we'll need
> to create a UAPI for that in any case. And having two APIs is just going
> to be more complexity to handle at both the documentation and
> maintenance level.

One does not replace the other. We need them both, unfortunately.
Multi-buff is great for e.g., stitching together different headers
with the same data. Point to different buffers for the header in each
packet but the same piece of data in all of them. This will never be
solved with Kal's approach. We just need multi-buffer support for
this. BTW, we are close to posting multi-buff support for AF_XDP. Just
hang in there a little while longer while the last glitches are fixed.
We have to stage it in two patch sets as it will be too long
otherwise. First one will only contain improvements to the xsk
selftests framework so that multi-buffer tests can be supported. The
second one will be the core code and the actual multi-buffer tests. As
for what Kal's patches are good for, please see below.

> It *might* be worth it to do this if the performance benefit is really
> compelling, but, well, you'd need to implement both and compare directly
> to know that for sure :)

The performance benefit is compelling. As I wrote in a mail to a post
by Kal, there are users out there that state that this feature (for
zero-copy mode nota bene) is a must for them to be able to use AF_XDP
instead of DPDK style user-mode drivers. They have really tough
latency requirements.

> -Toke
>

2023-04-12 23:07:03

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Magnus Karlsson <[email protected]> writes:

> On Wed, 12 Apr 2023 at 15:40, Toke Høiland-Jørgensen <[email protected]> wrote:
>>
>> Kal Cutter Conley <[email protected]> writes:
>>
>> >> > > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
>> >> > > enables sending/receiving jumbo ethernet frames up to the theoretical
>> >> > > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
>> >> > > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
>> >> > > SKB mode is usable pending future driver work.
>> >> >
>> >> > Hmm, interesting. So how does this interact with XDP multibuf?
>> >>
>> >> To me it currently does not interact with mbuf in any way as it is enabled
>> >> only for skb mode which linearizes the skb from what i see.
>> >>
>> >> I'd like to hear more about Kal's use case - Kal do you use AF_XDP in SKB
>> >> mode on your side?
>> >
>> > Our use-case is to receive jumbo Ethernet frames up to 9000 bytes with
>> > AF_XDP in zero-copy mode. This patchset is a step in this direction.
>> > At the very least, it lets you test out the feature in SKB mode
>> > pending future driver support. Currently, XDP multi-buffer does not
>> > support AF_XDP at all. It could support it in theory, but I think it
>> > would need some UAPI design work and a bit of implementation work.
>> >
>> > Also, I think that the approach taken in this patchset has some
>> > advantages over XDP multi-buffer:
>> > (1) It should be possible to achieve higher performance
>> > (a) because the packet data is kept together
>> > (b) because you need to acquire and validate less descriptors
>> > and touch the queue pointers less often.
>> > (2) It is a nicer user-space API.
>> > (a) Since the packet data is all available in one linear
>> > buffer. This may even be a requirement to avoid an extra copy if the
>> > data must be handed off contiguously to other code.
>> >
>> > The disadvantage of this patchset is requiring the user to allocate
>> > HugeTLB pages which is an extra complication.
>> >
>> > I am not sure if this patchset would need to interact with XDP
>> > multi-buffer at all directly. Does anyone have anything to add here?
>>
>> Well, I'm mostly concerned with having two different operation and
>> configuration modes for the same thing. We'll probably need to support
>> multibuf for AF_XDP anyway for the non-ZC path, which means we'll need
>> to create a UAPI for that in any case. And having two APIs is just going
>> to be more complexity to handle at both the documentation and
>> maintenance level.
>
> One does not replace the other. We need them both, unfortunately.
> Multi-buff is great for e.g., stitching together different headers
> with the same data. Point to different buffers for the header in each
> packet but the same piece of data in all of them. This will never be
> solved with Kal's approach. We just need multi-buffer support for
> this. BTW, we are close to posting multi-buff support for AF_XDP. Just
> hang in there a little while longer while the last glitches are fixed.
> We have to stage it in two patch sets as it will be too long
> otherwise. First one will only contain improvements to the xsk
> selftests framework so that multi-buffer tests can be supported. The
> second one will be the core code and the actual multi-buffer tests.

Alright, sounds good!

> As for what Kal's patches are good for, please see below.
>
>> It *might* be worth it to do this if the performance benefit is really
>> compelling, but, well, you'd need to implement both and compare directly
>> to know that for sure :)
>
> The performance benefit is compelling. As I wrote in a mail to a post
> by Kal, there are users out there that state that this feature (for
> zero-copy mode nota bene) is a must for them to be able to use AF_XDP
> instead of DPDK style user-mode drivers. They have really tough
> latency requirements.

Hmm, okay, looking forward to seeing the benchmark results, then! :)

-Toke

2023-04-13 10:55:22

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

>
> Well, I'm mostly concerned with having two different operation and
> configuration modes for the same thing. We'll probably need to support
> multibuf for AF_XDP anyway for the non-ZC path, which means we'll need
> to create a UAPI for that in any case. And having two APIs is just going
> to be more complexity to handle at both the documentation and
> maintenance level.

I don't know if I would call this another "API". This patchset doesn't
change the semantics of anything. It only lifts the chunk size
restriction when hugepages are used. Furthermore, the changes here are
quite small and easy to understand. The four sentences added to the
documentation shouldn't be too concerning either. :-)

In 30 years when everyone finally migrates to page sizes >= 64K the
maintenance burden will drop to zero. Knock wood. :-)

>
> It *might* be worth it to do this if the performance benefit is really
> compelling, but, well, you'd need to implement both and compare directly
> to know that for sure :)

What about use-cases that require incoming packet data to be
contiguous? Without larger chunk sizes, the user is forced to allocate
extra space per packet and copy the data. This defeats the purpose of
ZC.

2023-04-13 11:11:59

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Cutter Conley <[email protected]> writes:

>>
>> Well, I'm mostly concerned with having two different operation and
>> configuration modes for the same thing. We'll probably need to support
>> multibuf for AF_XDP anyway for the non-ZC path, which means we'll need
>> to create a UAPI for that in any case. And having two APIs is just going
>> to be more complexity to handle at both the documentation and
>> maintenance level.
>
> I don't know if I would call this another "API". This patchset doesn't
> change the semantics of anything. It only lifts the chunk size
> restriction when hugepages are used. Furthermore, the changes here are
> quite small and easy to understand. The four sentences added to the
> documentation shouldn't be too concerning either. :-)

Well, you mentioned yourself that:

> The disadvantage of this patchset is requiring the user to allocate
> HugeTLB pages which is an extra complication.

In addition, presumably when using this mode, the other XDP actions
(XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
add special handling for that in the kernel? We'll definitely need to
handle that somehow...

> In 30 years when everyone finally migrates to page sizes >= 64K the
> maintenance burden will drop to zero. Knock wood. :-)

Haha, right, but let's make sure we have something that is consistent in
the intervening decades, shall we? ;)

>> It *might* be worth it to do this if the performance benefit is really
>> compelling, but, well, you'd need to implement both and compare directly
>> to know that for sure :)
>
> What about use-cases that require incoming packet data to be
> contiguous? Without larger chunk sizes, the user is forced to allocate
> extra space per packet and copy the data. This defeats the purpose of
> ZC.

What use cases would that be, exactly? Presumably this is also a
performance issue? Which goes back to me asking for benchmarks :)

-Toke

2023-04-13 12:40:23

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> Well, you mentioned yourself that:
>
> > The disadvantage of this patchset is requiring the user to allocate
> > HugeTLB pages which is an extra complication.

It's a small extra complication *for the user*. However, users that
need this feature are willing to allocate hugepages. We are one such
user. For us, having to deal with packets split into disjoint buffers
(from the XDP multi-buffer paradigm) is a significantly more annoying
complication than allocating hugepages (particularly on the RX side).

2023-04-13 20:51:58

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Cutter Conley <[email protected]> writes:

>> Well, you mentioned yourself that:
>>
>> > The disadvantage of this patchset is requiring the user to allocate
>> > HugeTLB pages which is an extra complication.
>
> It's a small extra complication *for the user*. However, users that
> need this feature are willing to allocate hugepages. We are one such
> user. For us, having to deal with packets split into disjoint buffers
> (from the XDP multi-buffer paradigm) is a significantly more annoying
> complication than allocating hugepages (particularly on the RX side).

"More annoying" is not a great argument, though. You're basically saying
"please complicate your code so I don't have to complicate mine". And
since kernel API is essentially frozen forever, adding more of them
carries a pretty high cost, which is why kernel developers tend not to
be easily swayed by convenience arguments (if all you want is a more
convenient API, just build one on top of the kernel primitives and wrap
it into a library).

So you'll need to come up with either (1) a use case that you *can't*
solve without this new API (with specifics as to why that is the case),
or (2) a compelling performance benchmark showing the complexity is
worth it. Magnus indicated he would be able to produce the latter, in
which case I'm happy to be persuaded by the numbers.

In any case, however, the behaviour needs to be consistent wrt the rest
of XDP, so it's not as simple as just increasing the limit (as I
mentioned in my previous email).

-Toke

2023-04-13 22:14:54

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> "More annoying" is not a great argument, though. You're basically saying
> "please complicate your code so I don't have to complicate mine". And
> since kernel API is essentially frozen forever, adding more of them
> carries a pretty high cost, which is why kernel developers tend not to
> be easily swayed by convenience arguments (if all you want is a more
> convenient API, just build one on top of the kernel primitives and wrap
> it into a library).

I was trying to make a fair comparison from the user's perspective
between having to allocate huge pages and deal with discontiguous
buffers. That was all.

I think the "your code" distinction is a bit harsh. The kernel is a
community project. Why isn't it "our" code? I am trying to add a
feature that I think is generally useful to people. The kernel only
exists to serve its users. I believe I am doing more good than harm
sending these patches.

2023-04-13 22:31:15

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Cutter Conley <[email protected]> writes:

>> "More annoying" is not a great argument, though. You're basically saying
>> "please complicate your code so I don't have to complicate mine". And
>> since kernel API is essentially frozen forever, adding more of them
>> carries a pretty high cost, which is why kernel developers tend not to
>> be easily swayed by convenience arguments (if all you want is a more
>> convenient API, just build one on top of the kernel primitives and wrap
>> it into a library).
>
> I was trying to make a fair comparison from the user's perspective
> between having to allocate huge pages and deal with discontiguous
> buffers. That was all.
>
> I think the "your code" distinction is a bit harsh. The kernel is a
> community project. Why isn't it "our" code? I am trying to add a
> feature that I think is generally useful to people. The kernel only
> exists to serve its users.

Oh, I'm sorry if that came across as harsh, that was not my intention! I
was certainly not trying to make a "you vs us" distinction; I was just
trying to explain why making changes on the kernel side carries a higher
cost than an equivalent (or even slightly more complex) change on the
userspace side, because of the UAPI consideration.

> I believe I am doing more good than harm sending these patches.

I don't think so! You've certainly sparked a discussion, that is good :)

-Toke

2023-04-14 09:22:11

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> >> "More annoying" is not a great argument, though. You're basically saying
> >> "please complicate your code so I don't have to complicate mine". And
> >> since kernel API is essentially frozen forever, adding more of them
> >> carries a pretty high cost, which is why kernel developers tend not to
> >> be easily swayed by convenience arguments (if all you want is a more
> >> convenient API, just build one on top of the kernel primitives and wrap
> >> it into a library).
> >
> > I was trying to make a fair comparison from the user's perspective
> > between having to allocate huge pages and deal with discontiguous
> > buffers. That was all.
> >
> > I think the "your code" distinction is a bit harsh. The kernel is a
> > community project. Why isn't it "our" code? I am trying to add a
> > feature that I think is generally useful to people. The kernel only
> > exists to serve its users.
>
> Oh, I'm sorry if that came across as harsh, that was not my intention! I
> was certainly not trying to make a "you vs us" distinction; I was just
> trying to explain why making changes on the kernel side carries a higher
> cost than an equivalent (or even slightly more complex) change on the
> userspace side, because of the UAPI consideration.

No problem! I agree. I am just somewhat confused by the "slightly more
complex" view of the situation. Currently, packets > 4K are not
supported _at all_ with AF_XDP. So this patchset adds something that
is not possible _at all_ today. We have been patiently waiting since
2018 for AF_XDP jumbo packet support. It seems that interest in this
feature from maintainers has been... lacking. :-) Now, if I understand
the situation correctly, you are asking for benchmarks to compare this
implementation with AF_XDP multi-buffer which doesn't exist yet? I am
glad it's being worked on but until there is a patchset, AF_XDP
multi-buffer is still VAPORWARE. :-)

Now in our case, we are primarily interested in throughput and
reducing total system load (so we have more compute budget for other
tasks). The faster we can receive packets the better. The packets we
need to receive are (almost) all 8000-9000 bytes... Using AF_XDP
multi-buffer would mean either (1) allocating extra space per packet
and copying the data to linearize it or (2) rewriting a significant
amount of code to handle segmented packets efficiently. If you want
benchmarks for (1) just use xdpsock and compare -z with -c. I think
that is a good approximation...

Hopefully, the HFT crowd can save this patchset in the end. :-)

-Kal

2023-04-14 16:33:00

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> In addition, presumably when using this mode, the other XDP actions
> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
> add special handling for that in the kernel? We'll definitely need to
> handle that somehow...

I am not familiar with all the details here. Do you know a reason why
these cases would stop working / why special handling would be needed?
For example, if I have a UMEM that uses hugepages and XDP_PASS is
returned, then the data is just copied into an SKB right? SKBs can
also be created directly from hugepages AFAIK. So I don't understand
what the issue would be. Can someone explain this concern?

2023-04-17 12:15:51

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Thu, 13 Apr 2023 at 22:52, Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Kal Cutter Conley <[email protected]> writes:
>
> >> Well, you mentioned yourself that:
> >>
> >> > The disadvantage of this patchset is requiring the user to allocate
> >> > HugeTLB pages which is an extra complication.
> >
> > It's a small extra complication *for the user*. However, users that
> > need this feature are willing to allocate hugepages. We are one such
> > user. For us, having to deal with packets split into disjoint buffers
> > (from the XDP multi-buffer paradigm) is a significantly more annoying
> > complication than allocating hugepages (particularly on the RX side).
>
> "More annoying" is not a great argument, though. You're basically saying
> "please complicate your code so I don't have to complicate mine". And
> since kernel API is essentially frozen forever, adding more of them
> carries a pretty high cost, which is why kernel developers tend not to
> be easily swayed by convenience arguments (if all you want is a more
> convenient API, just build one on top of the kernel primitives and wrap
> it into a library).
>
> So you'll need to come up with either (1) a use case that you *can't*
> solve without this new API (with specifics as to why that is the case),
> or (2) a compelling performance benchmark showing the complexity is
> worth it. Magnus indicated he would be able to produce the latter, in
> which case I'm happy to be persuaded by the numbers.

We will measure it and get back to you. Would be good with some numbers.

> In any case, however, the behaviour needs to be consistent wrt the rest
> of XDP, so it's not as simple as just increasing the limit (as I
> mentioned in my previous email).
>
> -Toke
>

2023-04-17 12:43:29

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Magnus Karlsson <[email protected]> writes:

> On Thu, 13 Apr 2023 at 22:52, Toke Høiland-Jørgensen <[email protected]> wrote:
>>
>> Kal Cutter Conley <[email protected]> writes:
>>
>> >> Well, you mentioned yourself that:
>> >>
>> >> > The disadvantage of this patchset is requiring the user to allocate
>> >> > HugeTLB pages which is an extra complication.
>> >
>> > It's a small extra complication *for the user*. However, users that
>> > need this feature are willing to allocate hugepages. We are one such
>> > user. For us, having to deal with packets split into disjoint buffers
>> > (from the XDP multi-buffer paradigm) is a significantly more annoying
>> > complication than allocating hugepages (particularly on the RX side).
>>
>> "More annoying" is not a great argument, though. You're basically saying
>> "please complicate your code so I don't have to complicate mine". And
>> since kernel API is essentially frozen forever, adding more of them
>> carries a pretty high cost, which is why kernel developers tend not to
>> be easily swayed by convenience arguments (if all you want is a more
>> convenient API, just build one on top of the kernel primitives and wrap
>> it into a library).
>>
>> So you'll need to come up with either (1) a use case that you *can't*
>> solve without this new API (with specifics as to why that is the case),
>> or (2) a compelling performance benchmark showing the complexity is
>> worth it. Magnus indicated he would be able to produce the latter, in
>> which case I'm happy to be persuaded by the numbers.
>
> We will measure it and get back to you. Would be good with some
> numbers.

Sounds good, thanks! :)

-Toke

2023-04-17 14:00:13

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> > We will measure it and get back to you. Would be good with some
> > numbers.
>
> Sounds good, thanks! :)
>
> -Toke
>

+1. Thanks a lot for doing this! :-)

2023-04-18 10:21:37

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Cutter Conley <[email protected]> writes:

>> In addition, presumably when using this mode, the other XDP actions
>> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
>> add special handling for that in the kernel? We'll definitely need to
>> handle that somehow...
>
> I am not familiar with all the details here. Do you know a reason why
> these cases would stop working / why special handling would be needed?
> For example, if I have a UMEM that uses hugepages and XDP_PASS is
> returned, then the data is just copied into an SKB right? SKBs can
> also be created directly from hugepages AFAIK. So I don't understand
> what the issue would be. Can someone explain this concern?

Well, I was asking :) It may well be that the SKB path just works; did
you test this? Pretty sure XDP_REDIRECT to another device won't, though?

-Toke

2023-04-18 11:10:35

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> >> In addition, presumably when using this mode, the other XDP actions
> >> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
> >> add special handling for that in the kernel? We'll definitely need to
> >> handle that somehow...
> >
> > I am not familiar with all the details here. Do you know a reason why
> > these cases would stop working / why special handling would be needed?
> > For example, if I have a UMEM that uses hugepages and XDP_PASS is
> > returned, then the data is just copied into an SKB right? SKBs can
> > also be created directly from hugepages AFAIK. So I don't understand
> > what the issue would be. Can someone explain this concern?
>
> Well, I was asking :) It may well be that the SKB path just works; did
> you test this? Pretty sure XDP_REDIRECT to another device won't, though?
>

I was also asking :-)

I tested that the SKB path is usable today with this patch.
Specifically, sending and receiving large jumbo packets with AF_XDP
and that a non-multi-buffer XDP program could access the whole packet.
I have not specifically tested XDP_REDIRECT to another device or
anything with ZC since that is not possible without driver support.

My feeling is, there wouldn't be non-trivial issues here since this
patchset changes nothing except allowing the maximum chunk size to be
larger. The driver either supports larger MTUs with XDP enabled or it
doesn't. If it doesn't, the frames are dropped anyway. Also, chunk
size mismatches between two XSKs (e.g. with XDP_REDIRECT) would be
something supported or not supported irrespective of this patchset.

2023-04-21 09:44:02

by Maciej Fijalkowski

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Tue, Apr 18, 2023 at 01:12:00PM +0200, Kal Cutter Conley wrote:

Hi there,

> > >> In addition, presumably when using this mode, the other XDP actions
> > >> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
> > >> add special handling for that in the kernel? We'll definitely need to
> > >> handle that somehow...
> > >
> > > I am not familiar with all the details here. Do you know a reason why
> > > these cases would stop working / why special handling would be needed?
> > > For example, if I have a UMEM that uses hugepages and XDP_PASS is
> > > returned, then the data is just copied into an SKB right? SKBs can
> > > also be created directly from hugepages AFAIK. So I don't understand
> > > what the issue would be. Can someone explain this concern?
> >
> > Well, I was asking :) It may well be that the SKB path just works; did
> > you test this? Pretty sure XDP_REDIRECT to another device won't, though?

for XDP_PASS we have to allocate a new buffer and copy the contents from
current xdp_buff that was backed by xsk_buff_pool and give the current one
back to pool. I am not sure if __napi_alloc_skb() is always capable of
handling len > PAGE_SIZE - i believe there might a particular combination
of settings that allows it, but if not we should have a fallback path that
would iterate over data and copy this to a certain (linear + frags) parts.
This implies non-zero effort that is needed for jumbo frames ZC support.

I can certainly test this out and play with it - maybe this just works, I
didn't check yet. Even if it does, then we need some kind of temporary
mechanism that will forbid loading ZC jumbo frames due to what Toke
brought up.

> >
>
> I was also asking :-)
>
> I tested that the SKB path is usable today with this patch.
> Specifically, sending and receiving large jumbo packets with AF_XDP
> and that a non-multi-buffer XDP program could access the whole packet.
> I have not specifically tested XDP_REDIRECT to another device or
> anything with ZC since that is not possible without driver support.
>
> My feeling is, there wouldn't be non-trivial issues here since this
> patchset changes nothing except allowing the maximum chunk size to be
> larger. The driver either supports larger MTUs with XDP enabled or it
> doesn't. If it doesn't, the frames are dropped anyway. Also, chunk
> size mismatches between two XSKs (e.g. with XDP_REDIRECT) would be
> something supported or not supported irrespective of this patchset.

Here is the comparison between multi-buffer and jumbo frames that I did
for ZC ice driver. Configured MTU was 8192 as this is the frame size for
aligned mode when working with huge pages. I am presenting plain numbers
over here from xdpsock.

Mbuf, packet size = 8192 - XDP_PACKET_HEADROOM
885,705pps - rxdrop frame_size=4096
806,307pps - l2fwd frame_size=4096
877,989pps - rxdrop frame_size=2048
773,331pps - l2fwd frame_size=2048

Jumbo, packet size = 8192 - XDP_PACKET_HEADROOM
893,530pps - rxdrop frame_size=8192
841,860pps - l2fwd frame_size=8192

Kal might say that multi-buffer numbers are imaginary as these patches
were never shown to the public ;) but now that we have extensive test
suite I am fixing some last issues that stand out, so we are asking for
some more patience over here... overall i was expecting that they will be
much worse when compared to jumbo frames, but then again i believe this
implementation is not ideal and can be improved. Nevertheless, jumbo
frames support has its value.

2023-04-21 10:02:23

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Maciej Fijalkowski <[email protected]> writes:

> On Tue, Apr 18, 2023 at 01:12:00PM +0200, Kal Cutter Conley wrote:
>
> Hi there,
>
>> > >> In addition, presumably when using this mode, the other XDP actions
>> > >> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
>> > >> add special handling for that in the kernel? We'll definitely need to
>> > >> handle that somehow...
>> > >
>> > > I am not familiar with all the details here. Do you know a reason why
>> > > these cases would stop working / why special handling would be needed?
>> > > For example, if I have a UMEM that uses hugepages and XDP_PASS is
>> > > returned, then the data is just copied into an SKB right? SKBs can
>> > > also be created directly from hugepages AFAIK. So I don't understand
>> > > what the issue would be. Can someone explain this concern?
>> >
>> > Well, I was asking :) It may well be that the SKB path just works; did
>> > you test this? Pretty sure XDP_REDIRECT to another device won't, though?
>
> for XDP_PASS we have to allocate a new buffer and copy the contents from
> current xdp_buff that was backed by xsk_buff_pool and give the current one
> back to pool. I am not sure if __napi_alloc_skb() is always capable of
> handling len > PAGE_SIZE - i believe there might a particular combination
> of settings that allows it, but if not we should have a fallback path that
> would iterate over data and copy this to a certain (linear + frags) parts.
> This implies non-zero effort that is needed for jumbo frames ZC support.
>
> I can certainly test this out and play with it - maybe this just works, I
> didn't check yet. Even if it does, then we need some kind of temporary
> mechanism that will forbid loading ZC jumbo frames due to what Toke
> brought up.

Yeah, this was exactly the kind of thing I was worried about (same for
XDP_REDIRECT). Thanks for fleshing it out a bit :)

>> >
>>
>> I was also asking :-)
>>
>> I tested that the SKB path is usable today with this patch.
>> Specifically, sending and receiving large jumbo packets with AF_XDP
>> and that a non-multi-buffer XDP program could access the whole packet.
>> I have not specifically tested XDP_REDIRECT to another device or
>> anything with ZC since that is not possible without driver support.
>>
>> My feeling is, there wouldn't be non-trivial issues here since this
>> patchset changes nothing except allowing the maximum chunk size to be
>> larger. The driver either supports larger MTUs with XDP enabled or it
>> doesn't. If it doesn't, the frames are dropped anyway. Also, chunk
>> size mismatches between two XSKs (e.g. with XDP_REDIRECT) would be
>> something supported or not supported irrespective of this patchset.
>
> Here is the comparison between multi-buffer and jumbo frames that I did
> for ZC ice driver. Configured MTU was 8192 as this is the frame size for
> aligned mode when working with huge pages. I am presenting plain numbers
> over here from xdpsock.
>
> Mbuf, packet size = 8192 - XDP_PACKET_HEADROOM
> 885,705pps - rxdrop frame_size=4096
> 806,307pps - l2fwd frame_size=4096
> 877,989pps - rxdrop frame_size=2048
> 773,331pps - l2fwd frame_size=2048
>
> Jumbo, packet size = 8192 - XDP_PACKET_HEADROOM
> 893,530pps - rxdrop frame_size=8192
> 841,860pps - l2fwd frame_size=8192
>
> Kal might say that multi-buffer numbers are imaginary as these patches
> were never shown to the public ;) but now that we have extensive test
> suite I am fixing some last issues that stand out, so we are asking for
> some more patience over here... overall i was expecting that they will be
> much worse when compared to jumbo frames, but then again i believe this
> implementation is not ideal and can be improved. Nevertheless, jumbo
> frames support has its value.

Thank you for doing these! Okay, so that's between 1-4% improvement (vs
the 4k frags). I dunno, I wouldn't consider that a slam dunk; would
depend on the additional complexity if it is worth it to do both, IMO...

-Toke

2023-04-21 12:20:26

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Fri, 21 Apr 2023 at 11:44, Maciej Fijalkowski
<[email protected]> wrote:
>
> On Tue, Apr 18, 2023 at 01:12:00PM +0200, Kal Cutter Conley wrote:
>
> Hi there,
>
> > > >> In addition, presumably when using this mode, the other XDP actions
> > > >> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
> > > >> add special handling for that in the kernel? We'll definitely need to
> > > >> handle that somehow...
> > > >
> > > > I am not familiar with all the details here. Do you know a reason why
> > > > these cases would stop working / why special handling would be needed?
> > > > For example, if I have a UMEM that uses hugepages and XDP_PASS is
> > > > returned, then the data is just copied into an SKB right? SKBs can
> > > > also be created directly from hugepages AFAIK. So I don't understand
> > > > what the issue would be. Can someone explain this concern?
> > >
> > > Well, I was asking :) It may well be that the SKB path just works; did
> > > you test this? Pretty sure XDP_REDIRECT to another device won't, though?
>
> for XDP_PASS we have to allocate a new buffer and copy the contents from
> current xdp_buff that was backed by xsk_buff_pool and give the current one
> back to pool. I am not sure if __napi_alloc_skb() is always capable of
> handling len > PAGE_SIZE - i believe there might a particular combination
> of settings that allows it, but if not we should have a fallback path that
> would iterate over data and copy this to a certain (linear + frags) parts.
> This implies non-zero effort that is needed for jumbo frames ZC support.

Thinking aloud, could not our multi-buffer work help with this? Sounds
quite similar to operations that we have to do in that patch set. And
if so, would it not be prudent to get the multi-buffer support in
there first, then implement these things on top of that? What do you
think?

> I can certainly test this out and play with it - maybe this just works, I
> didn't check yet. Even if it does, then we need some kind of temporary
> mechanism that will forbid loading ZC jumbo frames due to what Toke
> brought up.
>
> > >
> >
> > I was also asking :-)
> >
> > I tested that the SKB path is usable today with this patch.
> > Specifically, sending and receiving large jumbo packets with AF_XDP
> > and that a non-multi-buffer XDP program could access the whole packet.
> > I have not specifically tested XDP_REDIRECT to another device or
> > anything with ZC since that is not possible without driver support.
> >
> > My feeling is, there wouldn't be non-trivial issues here since this
> > patchset changes nothing except allowing the maximum chunk size to be
> > larger. The driver either supports larger MTUs with XDP enabled or it
> > doesn't. If it doesn't, the frames are dropped anyway. Also, chunk
> > size mismatches between two XSKs (e.g. with XDP_REDIRECT) would be
> > something supported or not supported irrespective of this patchset.
>
> Here is the comparison between multi-buffer and jumbo frames that I did
> for ZC ice driver. Configured MTU was 8192 as this is the frame size for
> aligned mode when working with huge pages. I am presenting plain numbers
> over here from xdpsock.
>
> Mbuf, packet size = 8192 - XDP_PACKET_HEADROOM
> 885,705pps - rxdrop frame_size=4096
> 806,307pps - l2fwd frame_size=4096
> 877,989pps - rxdrop frame_size=2048
> 773,331pps - l2fwd frame_size=2048
>
> Jumbo, packet size = 8192 - XDP_PACKET_HEADROOM
> 893,530pps - rxdrop frame_size=8192
> 841,860pps - l2fwd frame_size=8192
>
> Kal might say that multi-buffer numbers are imaginary as these patches
> were never shown to the public ;) but now that we have extensive test
> suite I am fixing some last issues that stand out, so we are asking for
> some more patience over here... overall i was expecting that they will be
> much worse when compared to jumbo frames, but then again i believe this
> implementation is not ideal and can be improved. Nevertheless, jumbo
> frames support has its value.

2023-04-21 12:27:52

[permalink] [raw]

Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Fri, 21 Apr 2023 at 12:01, Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Maciej Fijalkowski <[email protected]> writes:
>
> > On Tue, Apr 18, 2023 at 01:12:00PM +0200, Kal Cutter Conley wrote:
> >
> > Hi there,
> >
> >> > >> In addition, presumably when using this mode, the other XDP actions
> >> > >> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
> >> > >> add special handling for that in the kernel? We'll definitely need to
> >> > >> handle that somehow...
> >> > >
> >> > > I am not familiar with all the details here. Do you know a reason why
> >> > > these cases would stop working / why special handling would be needed?
> >> > > For example, if I have a UMEM that uses hugepages and XDP_PASS is
> >> > > returned, then the data is just copied into an SKB right? SKBs can
> >> > > also be created directly from hugepages AFAIK. So I don't understand
> >> > > what the issue would be. Can someone explain this concern?
> >> >
> >> > Well, I was asking :) It may well be that the SKB path just works; did
> >> > you test this? Pretty sure XDP_REDIRECT to another device won't, though?
> >
> > for XDP_PASS we have to allocate a new buffer and copy the contents from
> > current xdp_buff that was backed by xsk_buff_pool and give the current one
> > back to pool. I am not sure if __napi_alloc_skb() is always capable of
> > handling len > PAGE_SIZE - i believe there might a particular combination
> > of settings that allows it, but if not we should have a fallback path that
> > would iterate over data and copy this to a certain (linear + frags) parts.
> > This implies non-zero effort that is needed for jumbo frames ZC support.
> >
> > I can certainly test this out and play with it - maybe this just works, I
> > didn't check yet. Even if it does, then we need some kind of temporary
> > mechanism that will forbid loading ZC jumbo frames due to what Toke
> > brought up.
>
> Yeah, this was exactly the kind of thing I was worried about (same for
> XDP_REDIRECT). Thanks for fleshing it out a bit :)
>
> >> >
> >>
> >> I was also asking :-)
> >>
> >> I tested that the SKB path is usable today with this patch.
> >> Specifically, sending and receiving large jumbo packets with AF_XDP
> >> and that a non-multi-buffer XDP program could access the whole packet.
> >> I have not specifically tested XDP_REDIRECT to another device or
> >> anything with ZC since that is not possible without driver support.
> >>
> >> My feeling is, there wouldn't be non-trivial issues here since this
> >> patchset changes nothing except allowing the maximum chunk size to be
> >> larger. The driver either supports larger MTUs with XDP enabled or it
> >> doesn't. If it doesn't, the frames are dropped anyway. Also, chunk
> >> size mismatches between two XSKs (e.g. with XDP_REDIRECT) would be
> >> something supported or not supported irrespective of this patchset.
> >
> > Here is the comparison between multi-buffer and jumbo frames that I did
> > for ZC ice driver. Configured MTU was 8192 as this is the frame size for
> > aligned mode when working with huge pages. I am presenting plain numbers
> > over here from xdpsock.
> >
> > Mbuf, packet size = 8192 - XDP_PACKET_HEADROOM
> > 885,705pps - rxdrop frame_size=4096
> > 806,307pps - l2fwd frame_size=4096
> > 877,989pps - rxdrop frame_size=2048
> > 773,331pps - l2fwd frame_size=2048
> >
> > Jumbo, packet size = 8192 - XDP_PACKET_HEADROOM
> > 893,530pps - rxdrop frame_size=8192
> > 841,860pps - l2fwd frame_size=8192
> >
> > Kal might say that multi-buffer numbers are imaginary as these patches
> > were never shown to the public ;) but now that we have extensive test
> > suite I am fixing some last issues that stand out, so we are asking for
> > some more patience over here... overall i was expecting that they will be
> > much worse when compared to jumbo frames, but then again i believe this
> > implementation is not ideal and can be improved. Nevertheless, jumbo
> > frames support has its value.
>
> Thank you for doing these! Okay, so that's between 1-4% improvement (vs
> the 4k frags). I dunno, I wouldn't consider that a slam dunk; would
> depend on the additional complexity if it is worth it to do both, IMO...

If we are using 4K frags, the worst case is that we have to use 3
frags to cover a 9K packet. Interpolating between the results above
would mean somewhere in the 1 - 6% range of improvement with jumbo
frames. Something that is not covered by these tests at all is the
overhead of an abstraction layer for dealing with multiple buffers. I
believe many applications would choose to have one to hide the fact
that there are multiple buffers. So I think these improvement numbers
are on the lower side.

But agree that we should factor in the complexity of covering the
cases you have brought up and see if it is worth it.

> -Toke
>

2023-04-21 15:29:36