2023-04-06 13:04:46

by Kal Cutter Conley

[permalink] [raw]
Subject: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
enables sending/receiving jumbo ethernet frames up to the theoretical
maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
SKB mode is usable pending future driver work.

For consistency, check for HugeTLB pages during UMEM registration. This
implies that hugepages are required for XDP_COPY mode despite DMA not
being used. This restriction is desirable since it ensures user software
can take advantage of future driver support.

Even in HugeTLB mode, continue to do page accounting using order-0
(4 KiB) pages. This minimizes the size of this change and reduces the
risk of impacting other code. Taking full advantage of hugepages for
accounting should improve XDP performance in the general case.

No significant change in RX/TX performance was observed with this patch.
A few data points are reproduced below:

Machine : Dell PowerEdge R940
CPU : Intel(R) Xeon(R) Platinum 8168 CPU @ 2.70GHz
NIC : MT27700 Family [ConnectX-4]

+-----+------+------+-------+--------+--------+--------+
| | | | chunk | packet | rxdrop | rxdrop |
| | mode | mtu | size | size | (Mpps) | (Gbps) |
+-----+------+------+-------+--------+--------+--------+
| old | -z | 3498 | 4000 | 320 | 16.0 | 40.9 |
| new | -z | 3498 | 4000 | 320 | 16.0 | 40.9 |
+-----+------+------+-------+--------+--------+--------+
| old | -z | 3498 | 4096 | 320 | 16.4 | 42.1 |
| new | -z | 3498 | 4096 | 320 | 16.5 | 42.2 |
+-----+------+------+-------+--------+--------+--------+
| new | -c | 3498 | 10240 | 320 | 6.1 | 15.6 |
+-----+------+------+-------+--------+--------+--------+
| new | -S | 9000 | 10240 | 9000 | 0.37 | 26.9 |
+-----+------+------+-------+--------+--------+--------+

Signed-off-by: Kal Conley <[email protected]>
---
Documentation/networking/af_xdp.rst | 36 ++++++++++++--------
include/net/xdp_sock.h | 1 +
include/net/xdp_sock_drv.h | 12 +++++++
include/net/xsk_buff_pool.h | 3 +-
net/xdp/xdp_umem.c | 51 ++++++++++++++++++++++++-----
net/xdp/xsk_buff_pool.c | 28 +++++++++++-----
6 files changed, 99 insertions(+), 32 deletions(-)

diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst
index 247c6c4127e9..ea65cd882af6 100644
--- a/Documentation/networking/af_xdp.rst
+++ b/Documentation/networking/af_xdp.rst
@@ -105,12 +105,13 @@ with AF_XDP". It can be found at https://lwn.net/Articles/750845/.
UMEM
----

-UMEM is a region of virtual contiguous memory, divided into
-equal-sized frames. An UMEM is associated to a netdev and a specific
-queue id of that netdev. It is created and configured (chunk size,
-headroom, start address and size) by using the XDP_UMEM_REG setsockopt
-system call. A UMEM is bound to a netdev and queue id, via the bind()
-system call.
+UMEM is a region of virtual contiguous memory divided into equal-sized
+frames. This is the area that contains all the buffers that packets can
+reside in. A UMEM is associated with a netdev and a specific queue id of
+that netdev. It is created and configured (start address, size,
+chunk size, and headroom) by using the XDP_UMEM_REG setsockopt system
+call. A UMEM is bound to a netdev and queue id via the bind() system
+call.

An AF_XDP is socket linked to a single UMEM, but one UMEM can have
multiple AF_XDP sockets. To share an UMEM created via one socket A,
@@ -418,14 +419,21 @@ negatively impact performance.
XDP_UMEM_REG setsockopt
-----------------------

-This setsockopt registers a UMEM to a socket. This is the area that
-contain all the buffers that packet can reside in. The call takes a
-pointer to the beginning of this area and the size of it. Moreover, it
-also has parameter called chunk_size that is the size that the UMEM is
-divided into. It can only be 2K or 4K at the moment. If you have an
-UMEM area that is 128K and a chunk size of 2K, this means that you
-will be able to hold a maximum of 128K / 2K = 64 packets in your UMEM
-area and that your largest packet size can be 2K.
+This setsockopt registers a UMEM to a socket. The call takes a pointer
+to the beginning of this area and the size of it. Moreover, there is a
+parameter called chunk_size that is the size that the UMEM is divided
+into. The chunk size limits the maximum packet size that can be sent or
+received. For example, if you have a UMEM area that is 128K and a chunk
+size of 2K, then you will be able to hold a maximum of 128K / 2K = 64
+packets in your UMEM. In this case, the maximum packet size will be 2K.
+
+Valid chunk sizes range from 2K to 64K. However, in aligned mode, the
+chunk size must also be a power of two. Additionally, the chunk size
+must not exceed the size of a page (usually 4K). This limitation is
+relaxed for UMEM areas allocated with HugeTLB pages, in which case
+chunk sizes up to 64K are allowed. Note, this only works with hugepages
+allocated from the kernel's persistent pool. Using Transparent Huge
+Pages (THP) has no effect on the maximum chunk size.

There is also an option to set the headroom of each single buffer in
the UMEM. If you set this to N bytes, it means that the packet will
diff --git a/include/net/xdp_sock.h b/include/net/xdp_sock.h
index e96a1151ec75..086fcf065656 100644
--- a/include/net/xdp_sock.h
+++ b/include/net/xdp_sock.h
@@ -30,6 +30,7 @@ struct xdp_umem {
u8 flags;
bool zc;
struct page **pgs;
+ u32 page_size;
int id;
struct list_head xsk_dma_list;
struct work_struct work;
diff --git a/include/net/xdp_sock_drv.h b/include/net/xdp_sock_drv.h
index 9c0d860609ba..83fba3060c9a 100644
--- a/include/net/xdp_sock_drv.h
+++ b/include/net/xdp_sock_drv.h
@@ -12,6 +12,18 @@
#define XDP_UMEM_MIN_CHUNK_SHIFT 11
#define XDP_UMEM_MIN_CHUNK_SIZE (1 << XDP_UMEM_MIN_CHUNK_SHIFT)

+static_assert(XDP_UMEM_MIN_CHUNK_SIZE <= PAGE_SIZE);
+
+/* Allow chunk sizes up to the maximum size of an ethernet frame (64 KiB).
+ * Larger chunks are not guaranteed to fit in a single SKB.
+ */
+#ifdef CONFIG_HUGETLB_PAGE
+#define XDP_UMEM_MAX_CHUNK_SHIFT min(16, HPAGE_SHIFT)
+#else
+#define XDP_UMEM_MAX_CHUNK_SHIFT min(16, PAGE_SHIFT)
+#endif
+#define XDP_UMEM_MAX_CHUNK_SIZE (1 << XDP_UMEM_MAX_CHUNK_SHIFT)
+
#ifdef CONFIG_XDP_SOCKETS

void xsk_tx_completed(struct xsk_buff_pool *pool, u32 nb_entries);
diff --git a/include/net/xsk_buff_pool.h b/include/net/xsk_buff_pool.h
index d318c769b445..6560d053ab88 100644
--- a/include/net/xsk_buff_pool.h
+++ b/include/net/xsk_buff_pool.h
@@ -75,6 +75,7 @@ struct xsk_buff_pool {
u32 chunk_size;
u32 chunk_shift;
u32 frame_len;
+ u32 page_size;
u8 cached_need_wakeup;
bool uses_need_wakeup;
bool dma_need_sync;
@@ -175,7 +176,7 @@ static inline void xp_dma_sync_for_device(struct xsk_buff_pool *pool,
static inline bool xp_desc_crosses_non_contig_pg(struct xsk_buff_pool *pool,
u64 addr, u32 len)
{
- bool cross_pg = (addr & (PAGE_SIZE - 1)) + len > PAGE_SIZE;
+ bool cross_pg = (addr & (pool->page_size - 1)) + len > pool->page_size;

if (likely(!cross_pg))
return false;
diff --git a/net/xdp/xdp_umem.c b/net/xdp/xdp_umem.c
index 4681e8e8ad94..cc07067f7f31 100644
--- a/net/xdp/xdp_umem.c
+++ b/net/xdp/xdp_umem.c
@@ -10,6 +10,8 @@
#include <linux/uaccess.h>
#include <linux/slab.h>
#include <linux/bpf.h>
+#include <linux/hugetlb.h>
+#include <linux/hugetlb_inline.h>
#include <linux/mm.h>
#include <linux/netdevice.h>
#include <linux/rtnetlink.h>
@@ -91,6 +93,36 @@ void xdp_put_umem(struct xdp_umem *umem, bool defer_cleanup)
}
}

+/* NOTE: The mmap_lock must be held by the caller. */
+static void xdp_umem_init_page_size(struct xdp_umem *umem, unsigned long address)
+{
+#ifdef CONFIG_HUGETLB_PAGE
+ struct vm_area_struct *vma;
+ struct vma_iterator vmi;
+ unsigned long end;
+
+ if (!IS_ALIGNED(address, HPAGE_SIZE))
+ goto no_hugetlb;
+
+ vma_iter_init(&vmi, current->mm, address);
+ end = address + umem->size;
+
+ for_each_vma_range(vmi, vma, end) {
+ if (!is_vm_hugetlb_page(vma))
+ goto no_hugetlb;
+ /* Hugepage sizes smaller than the default are not supported. */
+ if (huge_page_size(hstate_vma(vma)) < HPAGE_SIZE)
+ goto no_hugetlb;
+ }
+
+ umem->page_size = HPAGE_SIZE;
+ return;
+no_hugetlb:
+#endif
+ umem->page_size = PAGE_SIZE;
+}
+
+
static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
{
unsigned int gup_flags = FOLL_WRITE;
@@ -102,8 +134,18 @@ static int xdp_umem_pin_pages(struct xdp_umem *umem, unsigned long address)
return -ENOMEM;

mmap_read_lock(current->mm);
+
+ xdp_umem_init_page_size(umem, address);
+
+ if (umem->chunk_size > umem->page_size) {
+ mmap_read_unlock(current->mm);
+ err = -EINVAL;
+ goto out_pgs;
+ }
+
npgs = pin_user_pages(address, umem->npgs,
gup_flags | FOLL_LONGTERM, &umem->pgs[0], NULL);
+
mmap_read_unlock(current->mm);

if (npgs != umem->npgs) {
@@ -156,15 +198,8 @@ static int xdp_umem_reg(struct xdp_umem *umem, struct xdp_umem_reg *mr)
unsigned int chunks, chunks_rem;
int err;

- if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > PAGE_SIZE) {
- /* Strictly speaking we could support this, if:
- * - huge pages, or*
- * - using an IOMMU, or
- * - making sure the memory area is consecutive
- * but for now, we simply say "computer says no".
- */
+ if (chunk_size < XDP_UMEM_MIN_CHUNK_SIZE || chunk_size > XDP_UMEM_MAX_CHUNK_SIZE)
return -EINVAL;
- }

if (mr->flags & ~XDP_UMEM_UNALIGNED_CHUNK_FLAG)
return -EINVAL;
diff --git a/net/xdp/xsk_buff_pool.c b/net/xdp/xsk_buff_pool.c
index b2df1e0f8153..f9e083fa5e6d 100644
--- a/net/xdp/xsk_buff_pool.c
+++ b/net/xdp/xsk_buff_pool.c
@@ -80,9 +80,10 @@ struct xsk_buff_pool *xp_create_and_assign_umem(struct xdp_sock *xs,
pool->headroom = umem->headroom;
pool->chunk_size = umem->chunk_size;
pool->chunk_shift = ffs(umem->chunk_size) - 1;
- pool->unaligned = unaligned;
pool->frame_len = umem->chunk_size - umem->headroom -
XDP_PACKET_HEADROOM;
+ pool->page_size = umem->page_size;
+ pool->unaligned = unaligned;
pool->umem = umem;
pool->addrs = umem->addrs;
INIT_LIST_HEAD(&pool->free_list);
@@ -369,16 +370,25 @@ void xp_dma_unmap(struct xsk_buff_pool *pool, unsigned long attrs)
}
EXPORT_SYMBOL(xp_dma_unmap);

-static void xp_check_dma_contiguity(struct xsk_dma_map *dma_map)
+/* HugeTLB pools consider contiguity at hugepage granularity only. Hence, all
+ * order-0 pages within a hugepage have the same contiguity value.
+ */
+static void xp_check_dma_contiguity(struct xsk_dma_map *dma_map, u32 page_size)
{
- u32 i;
+ u32 stride = page_size >> PAGE_SHIFT; /* in order-0 pages */
+ u32 i, j;

- for (i = 0; i < dma_map->dma_pages_cnt - 1; i++) {
- if (dma_map->dma_pages[i] + PAGE_SIZE == dma_map->dma_pages[i + 1])
- dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
- else
- dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
+ for (i = 0; i + stride < dma_map->dma_pages_cnt;) {
+ if (dma_map->dma_pages[i] + page_size == dma_map->dma_pages[i + stride]) {
+ for (j = 0; j < stride; i++, j++)
+ dma_map->dma_pages[i] |= XSK_NEXT_PG_CONTIG_MASK;
+ } else {
+ for (j = 0; j < stride; i++, j++)
+ dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
+ }
}
+ for (; i < dma_map->dma_pages_cnt; i++)
+ dma_map->dma_pages[i] &= ~XSK_NEXT_PG_CONTIG_MASK;
}

static int xp_init_dma_info(struct xsk_buff_pool *pool, struct xsk_dma_map *dma_map)
@@ -441,7 +451,7 @@ int xp_dma_map(struct xsk_buff_pool *pool, struct device *dev,
}

if (pool->unaligned)
- xp_check_dma_contiguity(dma_map);
+ xp_check_dma_contiguity(dma_map, pool->page_size);

err = xp_init_dma_info(pool, dma_map);
if (err) {
--
2.39.2


2023-04-06 18:43:45

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Conley <[email protected]> writes:

> Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
> enables sending/receiving jumbo ethernet frames up to the theoretical
> maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
> to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
> SKB mode is usable pending future driver work.

Hmm, interesting. So how does this interact with XDP multibuf?

-Toke

2023-04-07 16:30:52

by Maciej Fijalkowski

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Thu, Apr 06, 2023 at 08:38:05PM +0200, Toke H?iland-J?rgensen wrote:
> Kal Conley <[email protected]> writes:
>
> > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
> > enables sending/receiving jumbo ethernet frames up to the theoretical
> > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
> > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
> > SKB mode is usable pending future driver work.
>
> Hmm, interesting. So how does this interact with XDP multibuf?

To me it currently does not interact with mbuf in any way as it is enabled
only for skb mode which linearizes the skb from what i see.

I'd like to hear more about Kal's use case - Kal do you use AF_XDP in SKB
mode on your side?

>
> -Toke
>

2023-04-08 17:35:24

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> > > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
> > > enables sending/receiving jumbo ethernet frames up to the theoretical
> > > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
> > > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
> > > SKB mode is usable pending future driver work.
> >
> > Hmm, interesting. So how does this interact with XDP multibuf?
>
> To me it currently does not interact with mbuf in any way as it is enabled
> only for skb mode which linearizes the skb from what i see.
>
> I'd like to hear more about Kal's use case - Kal do you use AF_XDP in SKB
> mode on your side?

Our use-case is to receive jumbo Ethernet frames up to 9000 bytes with
AF_XDP in zero-copy mode. This patchset is a step in this direction.
At the very least, it lets you test out the feature in SKB mode
pending future driver support. Currently, XDP multi-buffer does not
support AF_XDP at all. It could support it in theory, but I think it
would need some UAPI design work and a bit of implementation work.

Also, I think that the approach taken in this patchset has some
advantages over XDP multi-buffer:
(1) It should be possible to achieve higher performance
(a) because the packet data is kept together
(b) because you need to acquire and validate less descriptors
and touch the queue pointers less often.
(2) It is a nicer user-space API.
(a) Since the packet data is all available in one linear
buffer. This may even be a requirement to avoid an extra copy if the
data must be handed off contiguously to other code.

The disadvantage of this patchset is requiring the user to allocate
HugeTLB pages which is an extra complication.

I am not sure if this patchset would need to interact with XDP
multi-buffer at all directly. Does anyone have anything to add here?

What other intermediate steps are needed to get to ZC? I think drivers
would already be able to support this now?

Kal

2023-04-12 13:40:49

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Cutter Conley <[email protected]> writes:

>> > > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
>> > > enables sending/receiving jumbo ethernet frames up to the theoretical
>> > > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
>> > > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
>> > > SKB mode is usable pending future driver work.
>> >
>> > Hmm, interesting. So how does this interact with XDP multibuf?
>>
>> To me it currently does not interact with mbuf in any way as it is enabled
>> only for skb mode which linearizes the skb from what i see.
>>
>> I'd like to hear more about Kal's use case - Kal do you use AF_XDP in SKB
>> mode on your side?
>
> Our use-case is to receive jumbo Ethernet frames up to 9000 bytes with
> AF_XDP in zero-copy mode. This patchset is a step in this direction.
> At the very least, it lets you test out the feature in SKB mode
> pending future driver support. Currently, XDP multi-buffer does not
> support AF_XDP at all. It could support it in theory, but I think it
> would need some UAPI design work and a bit of implementation work.
>
> Also, I think that the approach taken in this patchset has some
> advantages over XDP multi-buffer:
> (1) It should be possible to achieve higher performance
> (a) because the packet data is kept together
> (b) because you need to acquire and validate less descriptors
> and touch the queue pointers less often.
> (2) It is a nicer user-space API.
> (a) Since the packet data is all available in one linear
> buffer. This may even be a requirement to avoid an extra copy if the
> data must be handed off contiguously to other code.
>
> The disadvantage of this patchset is requiring the user to allocate
> HugeTLB pages which is an extra complication.
>
> I am not sure if this patchset would need to interact with XDP
> multi-buffer at all directly. Does anyone have anything to add here?

Well, I'm mostly concerned with having two different operation and
configuration modes for the same thing. We'll probably need to support
multibuf for AF_XDP anyway for the non-ZC path, which means we'll need
to create a UAPI for that in any case. And having two APIs is just going
to be more complexity to handle at both the documentation and
maintenance level.

It *might* be worth it to do this if the performance benefit is really
compelling, but, well, you'd need to implement both and compare directly
to know that for sure :)

-Toke

2023-04-12 14:07:24

by Magnus Karlsson

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Wed, 12 Apr 2023 at 15:40, Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Kal Cutter Conley <[email protected]> writes:
>
> >> > > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
> >> > > enables sending/receiving jumbo ethernet frames up to the theoretical
> >> > > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
> >> > > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
> >> > > SKB mode is usable pending future driver work.
> >> >
> >> > Hmm, interesting. So how does this interact with XDP multibuf?
> >>
> >> To me it currently does not interact with mbuf in any way as it is enabled
> >> only for skb mode which linearizes the skb from what i see.
> >>
> >> I'd like to hear more about Kal's use case - Kal do you use AF_XDP in SKB
> >> mode on your side?
> >
> > Our use-case is to receive jumbo Ethernet frames up to 9000 bytes with
> > AF_XDP in zero-copy mode. This patchset is a step in this direction.
> > At the very least, it lets you test out the feature in SKB mode
> > pending future driver support. Currently, XDP multi-buffer does not
> > support AF_XDP at all. It could support it in theory, but I think it
> > would need some UAPI design work and a bit of implementation work.
> >
> > Also, I think that the approach taken in this patchset has some
> > advantages over XDP multi-buffer:
> > (1) It should be possible to achieve higher performance
> > (a) because the packet data is kept together
> > (b) because you need to acquire and validate less descriptors
> > and touch the queue pointers less often.
> > (2) It is a nicer user-space API.
> > (a) Since the packet data is all available in one linear
> > buffer. This may even be a requirement to avoid an extra copy if the
> > data must be handed off contiguously to other code.
> >
> > The disadvantage of this patchset is requiring the user to allocate
> > HugeTLB pages which is an extra complication.
> >
> > I am not sure if this patchset would need to interact with XDP
> > multi-buffer at all directly. Does anyone have anything to add here?
>
> Well, I'm mostly concerned with having two different operation and
> configuration modes for the same thing. We'll probably need to support
> multibuf for AF_XDP anyway for the non-ZC path, which means we'll need
> to create a UAPI for that in any case. And having two APIs is just going
> to be more complexity to handle at both the documentation and
> maintenance level.

One does not replace the other. We need them both, unfortunately.
Multi-buff is great for e.g., stitching together different headers
with the same data. Point to different buffers for the header in each
packet but the same piece of data in all of them. This will never be
solved with Kal's approach. We just need multi-buffer support for
this. BTW, we are close to posting multi-buff support for AF_XDP. Just
hang in there a little while longer while the last glitches are fixed.
We have to stage it in two patch sets as it will be too long
otherwise. First one will only contain improvements to the xsk
selftests framework so that multi-buffer tests can be supported. The
second one will be the core code and the actual multi-buffer tests. As
for what Kal's patches are good for, please see below.

> It *might* be worth it to do this if the performance benefit is really
> compelling, but, well, you'd need to implement both and compare directly
> to know that for sure :)

The performance benefit is compelling. As I wrote in a mail to a post
by Kal, there are users out there that state that this feature (for
zero-copy mode nota bene) is a must for them to be able to use AF_XDP
instead of DPDK style user-mode drivers. They have really tough
latency requirements.



> -Toke
>

2023-04-12 23:07:03

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Magnus Karlsson <[email protected]> writes:

> On Wed, 12 Apr 2023 at 15:40, Toke Høiland-Jørgensen <[email protected]> wrote:
>>
>> Kal Cutter Conley <[email protected]> writes:
>>
>> >> > > Add core AF_XDP support for chunk sizes larger than PAGE_SIZE. This
>> >> > > enables sending/receiving jumbo ethernet frames up to the theoretical
>> >> > > maxiumum of 64 KiB. For chunk sizes > PAGE_SIZE, the UMEM is required
>> >> > > to consist of HugeTLB VMAs (and be hugepage aligned). Initially, only
>> >> > > SKB mode is usable pending future driver work.
>> >> >
>> >> > Hmm, interesting. So how does this interact with XDP multibuf?
>> >>
>> >> To me it currently does not interact with mbuf in any way as it is enabled
>> >> only for skb mode which linearizes the skb from what i see.
>> >>
>> >> I'd like to hear more about Kal's use case - Kal do you use AF_XDP in SKB
>> >> mode on your side?
>> >
>> > Our use-case is to receive jumbo Ethernet frames up to 9000 bytes with
>> > AF_XDP in zero-copy mode. This patchset is a step in this direction.
>> > At the very least, it lets you test out the feature in SKB mode
>> > pending future driver support. Currently, XDP multi-buffer does not
>> > support AF_XDP at all. It could support it in theory, but I think it
>> > would need some UAPI design work and a bit of implementation work.
>> >
>> > Also, I think that the approach taken in this patchset has some
>> > advantages over XDP multi-buffer:
>> > (1) It should be possible to achieve higher performance
>> > (a) because the packet data is kept together
>> > (b) because you need to acquire and validate less descriptors
>> > and touch the queue pointers less often.
>> > (2) It is a nicer user-space API.
>> > (a) Since the packet data is all available in one linear
>> > buffer. This may even be a requirement to avoid an extra copy if the
>> > data must be handed off contiguously to other code.
>> >
>> > The disadvantage of this patchset is requiring the user to allocate
>> > HugeTLB pages which is an extra complication.
>> >
>> > I am not sure if this patchset would need to interact with XDP
>> > multi-buffer at all directly. Does anyone have anything to add here?
>>
>> Well, I'm mostly concerned with having two different operation and
>> configuration modes for the same thing. We'll probably need to support
>> multibuf for AF_XDP anyway for the non-ZC path, which means we'll need
>> to create a UAPI for that in any case. And having two APIs is just going
>> to be more complexity to handle at both the documentation and
>> maintenance level.
>
> One does not replace the other. We need them both, unfortunately.
> Multi-buff is great for e.g., stitching together different headers
> with the same data. Point to different buffers for the header in each
> packet but the same piece of data in all of them. This will never be
> solved with Kal's approach. We just need multi-buffer support for
> this. BTW, we are close to posting multi-buff support for AF_XDP. Just
> hang in there a little while longer while the last glitches are fixed.
> We have to stage it in two patch sets as it will be too long
> otherwise. First one will only contain improvements to the xsk
> selftests framework so that multi-buffer tests can be supported. The
> second one will be the core code and the actual multi-buffer tests.

Alright, sounds good!

> As for what Kal's patches are good for, please see below.
>
>> It *might* be worth it to do this if the performance benefit is really
>> compelling, but, well, you'd need to implement both and compare directly
>> to know that for sure :)
>
> The performance benefit is compelling. As I wrote in a mail to a post
> by Kal, there are users out there that state that this feature (for
> zero-copy mode nota bene) is a must for them to be able to use AF_XDP
> instead of DPDK style user-mode drivers. They have really tough
> latency requirements.

Hmm, okay, looking forward to seeing the benchmark results, then! :)

-Toke

2023-04-13 10:55:22

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

>
> Well, I'm mostly concerned with having two different operation and
> configuration modes for the same thing. We'll probably need to support
> multibuf for AF_XDP anyway for the non-ZC path, which means we'll need
> to create a UAPI for that in any case. And having two APIs is just going
> to be more complexity to handle at both the documentation and
> maintenance level.

I don't know if I would call this another "API". This patchset doesn't
change the semantics of anything. It only lifts the chunk size
restriction when hugepages are used. Furthermore, the changes here are
quite small and easy to understand. The four sentences added to the
documentation shouldn't be too concerning either. :-)

In 30 years when everyone finally migrates to page sizes >= 64K the
maintenance burden will drop to zero. Knock wood. :-)

>
> It *might* be worth it to do this if the performance benefit is really
> compelling, but, well, you'd need to implement both and compare directly
> to know that for sure :)

What about use-cases that require incoming packet data to be
contiguous? Without larger chunk sizes, the user is forced to allocate
extra space per packet and copy the data. This defeats the purpose of
ZC.

2023-04-13 11:11:59

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Cutter Conley <[email protected]> writes:

>>
>> Well, I'm mostly concerned with having two different operation and
>> configuration modes for the same thing. We'll probably need to support
>> multibuf for AF_XDP anyway for the non-ZC path, which means we'll need
>> to create a UAPI for that in any case. And having two APIs is just going
>> to be more complexity to handle at both the documentation and
>> maintenance level.
>
> I don't know if I would call this another "API". This patchset doesn't
> change the semantics of anything. It only lifts the chunk size
> restriction when hugepages are used. Furthermore, the changes here are
> quite small and easy to understand. The four sentences added to the
> documentation shouldn't be too concerning either. :-)

Well, you mentioned yourself that:

> The disadvantage of this patchset is requiring the user to allocate
> HugeTLB pages which is an extra complication.

In addition, presumably when using this mode, the other XDP actions
(XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
add special handling for that in the kernel? We'll definitely need to
handle that somehow...

> In 30 years when everyone finally migrates to page sizes >= 64K the
> maintenance burden will drop to zero. Knock wood. :-)

Haha, right, but let's make sure we have something that is consistent in
the intervening decades, shall we? ;)

>> It *might* be worth it to do this if the performance benefit is really
>> compelling, but, well, you'd need to implement both and compare directly
>> to know that for sure :)
>
> What about use-cases that require incoming packet data to be
> contiguous? Without larger chunk sizes, the user is forced to allocate
> extra space per packet and copy the data. This defeats the purpose of
> ZC.

What use cases would that be, exactly? Presumably this is also a
performance issue? Which goes back to me asking for benchmarks :)

-Toke

2023-04-13 12:40:23

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> Well, you mentioned yourself that:
>
> > The disadvantage of this patchset is requiring the user to allocate
> > HugeTLB pages which is an extra complication.

It's a small extra complication *for the user*. However, users that
need this feature are willing to allocate hugepages. We are one such
user. For us, having to deal with packets split into disjoint buffers
(from the XDP multi-buffer paradigm) is a significantly more annoying
complication than allocating hugepages (particularly on the RX side).

2023-04-13 20:51:58

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Cutter Conley <[email protected]> writes:

>> Well, you mentioned yourself that:
>>
>> > The disadvantage of this patchset is requiring the user to allocate
>> > HugeTLB pages which is an extra complication.
>
> It's a small extra complication *for the user*. However, users that
> need this feature are willing to allocate hugepages. We are one such
> user. For us, having to deal with packets split into disjoint buffers
> (from the XDP multi-buffer paradigm) is a significantly more annoying
> complication than allocating hugepages (particularly on the RX side).

"More annoying" is not a great argument, though. You're basically saying
"please complicate your code so I don't have to complicate mine". And
since kernel API is essentially frozen forever, adding more of them
carries a pretty high cost, which is why kernel developers tend not to
be easily swayed by convenience arguments (if all you want is a more
convenient API, just build one on top of the kernel primitives and wrap
it into a library).

So you'll need to come up with either (1) a use case that you *can't*
solve without this new API (with specifics as to why that is the case),
or (2) a compelling performance benchmark showing the complexity is
worth it. Magnus indicated he would be able to produce the latter, in
which case I'm happy to be persuaded by the numbers.

In any case, however, the behaviour needs to be consistent wrt the rest
of XDP, so it's not as simple as just increasing the limit (as I
mentioned in my previous email).

-Toke

2023-04-13 22:14:54

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> "More annoying" is not a great argument, though. You're basically saying
> "please complicate your code so I don't have to complicate mine". And
> since kernel API is essentially frozen forever, adding more of them
> carries a pretty high cost, which is why kernel developers tend not to
> be easily swayed by convenience arguments (if all you want is a more
> convenient API, just build one on top of the kernel primitives and wrap
> it into a library).

I was trying to make a fair comparison from the user's perspective
between having to allocate huge pages and deal with discontiguous
buffers. That was all.

I think the "your code" distinction is a bit harsh. The kernel is a
community project. Why isn't it "our" code? I am trying to add a
feature that I think is generally useful to people. The kernel only
exists to serve its users. I believe I am doing more good than harm
sending these patches.

2023-04-13 22:31:15

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Cutter Conley <[email protected]> writes:

>> "More annoying" is not a great argument, though. You're basically saying
>> "please complicate your code so I don't have to complicate mine". And
>> since kernel API is essentially frozen forever, adding more of them
>> carries a pretty high cost, which is why kernel developers tend not to
>> be easily swayed by convenience arguments (if all you want is a more
>> convenient API, just build one on top of the kernel primitives and wrap
>> it into a library).
>
> I was trying to make a fair comparison from the user's perspective
> between having to allocate huge pages and deal with discontiguous
> buffers. That was all.
>
> I think the "your code" distinction is a bit harsh. The kernel is a
> community project. Why isn't it "our" code? I am trying to add a
> feature that I think is generally useful to people. The kernel only
> exists to serve its users.

Oh, I'm sorry if that came across as harsh, that was not my intention! I
was certainly not trying to make a "you vs us" distinction; I was just
trying to explain why making changes on the kernel side carries a higher
cost than an equivalent (or even slightly more complex) change on the
userspace side, because of the UAPI consideration.

> I believe I am doing more good than harm sending these patches.

I don't think so! You've certainly sparked a discussion, that is good :)

-Toke

2023-04-14 09:22:11

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> >> "More annoying" is not a great argument, though. You're basically saying
> >> "please complicate your code so I don't have to complicate mine". And
> >> since kernel API is essentially frozen forever, adding more of them
> >> carries a pretty high cost, which is why kernel developers tend not to
> >> be easily swayed by convenience arguments (if all you want is a more
> >> convenient API, just build one on top of the kernel primitives and wrap
> >> it into a library).
> >
> > I was trying to make a fair comparison from the user's perspective
> > between having to allocate huge pages and deal with discontiguous
> > buffers. That was all.
> >
> > I think the "your code" distinction is a bit harsh. The kernel is a
> > community project. Why isn't it "our" code? I am trying to add a
> > feature that I think is generally useful to people. The kernel only
> > exists to serve its users.
>
> Oh, I'm sorry if that came across as harsh, that was not my intention! I
> was certainly not trying to make a "you vs us" distinction; I was just
> trying to explain why making changes on the kernel side carries a higher
> cost than an equivalent (or even slightly more complex) change on the
> userspace side, because of the UAPI consideration.

No problem! I agree. I am just somewhat confused by the "slightly more
complex" view of the situation. Currently, packets > 4K are not
supported _at all_ with AF_XDP. So this patchset adds something that
is not possible _at all_ today. We have been patiently waiting since
2018 for AF_XDP jumbo packet support. It seems that interest in this
feature from maintainers has been... lacking. :-) Now, if I understand
the situation correctly, you are asking for benchmarks to compare this
implementation with AF_XDP multi-buffer which doesn't exist yet? I am
glad it's being worked on but until there is a patchset, AF_XDP
multi-buffer is still VAPORWARE. :-)

Now in our case, we are primarily interested in throughput and
reducing total system load (so we have more compute budget for other
tasks). The faster we can receive packets the better. The packets we
need to receive are (almost) all 8000-9000 bytes... Using AF_XDP
multi-buffer would mean either (1) allocating extra space per packet
and copying the data to linearize it or (2) rewriting a significant
amount of code to handle segmented packets efficiently. If you want
benchmarks for (1) just use xdpsock and compare -z with -c. I think
that is a good approximation...

Hopefully, the HFT crowd can save this patchset in the end. :-)

-Kal

2023-04-14 16:33:00

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> In addition, presumably when using this mode, the other XDP actions
> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
> add special handling for that in the kernel? We'll definitely need to
> handle that somehow...

I am not familiar with all the details here. Do you know a reason why
these cases would stop working / why special handling would be needed?
For example, if I have a UMEM that uses hugepages and XDP_PASS is
returned, then the data is just copied into an SKB right? SKBs can
also be created directly from hugepages AFAIK. So I don't understand
what the issue would be. Can someone explain this concern?

2023-04-17 12:15:51

by Magnus Karlsson

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Thu, 13 Apr 2023 at 22:52, Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Kal Cutter Conley <[email protected]> writes:
>
> >> Well, you mentioned yourself that:
> >>
> >> > The disadvantage of this patchset is requiring the user to allocate
> >> > HugeTLB pages which is an extra complication.
> >
> > It's a small extra complication *for the user*. However, users that
> > need this feature are willing to allocate hugepages. We are one such
> > user. For us, having to deal with packets split into disjoint buffers
> > (from the XDP multi-buffer paradigm) is a significantly more annoying
> > complication than allocating hugepages (particularly on the RX side).
>
> "More annoying" is not a great argument, though. You're basically saying
> "please complicate your code so I don't have to complicate mine". And
> since kernel API is essentially frozen forever, adding more of them
> carries a pretty high cost, which is why kernel developers tend not to
> be easily swayed by convenience arguments (if all you want is a more
> convenient API, just build one on top of the kernel primitives and wrap
> it into a library).
>
> So you'll need to come up with either (1) a use case that you *can't*
> solve without this new API (with specifics as to why that is the case),
> or (2) a compelling performance benchmark showing the complexity is
> worth it. Magnus indicated he would be able to produce the latter, in
> which case I'm happy to be persuaded by the numbers.

We will measure it and get back to you. Would be good with some numbers.

> In any case, however, the behaviour needs to be consistent wrt the rest
> of XDP, so it's not as simple as just increasing the limit (as I
> mentioned in my previous email).
>
> -Toke
>

2023-04-17 12:43:29

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Magnus Karlsson <[email protected]> writes:

> On Thu, 13 Apr 2023 at 22:52, Toke Høiland-Jørgensen <[email protected]> wrote:
>>
>> Kal Cutter Conley <[email protected]> writes:
>>
>> >> Well, you mentioned yourself that:
>> >>
>> >> > The disadvantage of this patchset is requiring the user to allocate
>> >> > HugeTLB pages which is an extra complication.
>> >
>> > It's a small extra complication *for the user*. However, users that
>> > need this feature are willing to allocate hugepages. We are one such
>> > user. For us, having to deal with packets split into disjoint buffers
>> > (from the XDP multi-buffer paradigm) is a significantly more annoying
>> > complication than allocating hugepages (particularly on the RX side).
>>
>> "More annoying" is not a great argument, though. You're basically saying
>> "please complicate your code so I don't have to complicate mine". And
>> since kernel API is essentially frozen forever, adding more of them
>> carries a pretty high cost, which is why kernel developers tend not to
>> be easily swayed by convenience arguments (if all you want is a more
>> convenient API, just build one on top of the kernel primitives and wrap
>> it into a library).
>>
>> So you'll need to come up with either (1) a use case that you *can't*
>> solve without this new API (with specifics as to why that is the case),
>> or (2) a compelling performance benchmark showing the complexity is
>> worth it. Magnus indicated he would be able to produce the latter, in
>> which case I'm happy to be persuaded by the numbers.
>
> We will measure it and get back to you. Would be good with some
> numbers.

Sounds good, thanks! :)

-Toke

2023-04-17 14:00:13

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> > We will measure it and get back to you. Would be good with some
> > numbers.
>
> Sounds good, thanks! :)
>
> -Toke
>

+1. Thanks a lot for doing this! :-)

2023-04-18 10:21:37

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Kal Cutter Conley <[email protected]> writes:

>> In addition, presumably when using this mode, the other XDP actions
>> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
>> add special handling for that in the kernel? We'll definitely need to
>> handle that somehow...
>
> I am not familiar with all the details here. Do you know a reason why
> these cases would stop working / why special handling would be needed?
> For example, if I have a UMEM that uses hugepages and XDP_PASS is
> returned, then the data is just copied into an SKB right? SKBs can
> also be created directly from hugepages AFAIK. So I don't understand
> what the issue would be. Can someone explain this concern?

Well, I was asking :) It may well be that the SKB path just works; did
you test this? Pretty sure XDP_REDIRECT to another device won't, though?

-Toke

2023-04-18 11:10:35

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> >> In addition, presumably when using this mode, the other XDP actions
> >> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
> >> add special handling for that in the kernel? We'll definitely need to
> >> handle that somehow...
> >
> > I am not familiar with all the details here. Do you know a reason why
> > these cases would stop working / why special handling would be needed?
> > For example, if I have a UMEM that uses hugepages and XDP_PASS is
> > returned, then the data is just copied into an SKB right? SKBs can
> > also be created directly from hugepages AFAIK. So I don't understand
> > what the issue would be. Can someone explain this concern?
>
> Well, I was asking :) It may well be that the SKB path just works; did
> you test this? Pretty sure XDP_REDIRECT to another device won't, though?
>

I was also asking :-)

I tested that the SKB path is usable today with this patch.
Specifically, sending and receiving large jumbo packets with AF_XDP
and that a non-multi-buffer XDP program could access the whole packet.
I have not specifically tested XDP_REDIRECT to another device or
anything with ZC since that is not possible without driver support.

My feeling is, there wouldn't be non-trivial issues here since this
patchset changes nothing except allowing the maximum chunk size to be
larger. The driver either supports larger MTUs with XDP enabled or it
doesn't. If it doesn't, the frames are dropped anyway. Also, chunk
size mismatches between two XSKs (e.g. with XDP_REDIRECT) would be
something supported or not supported irrespective of this patchset.

2023-04-21 09:44:02

by Maciej Fijalkowski

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Tue, Apr 18, 2023 at 01:12:00PM +0200, Kal Cutter Conley wrote:

Hi there,

> > >> In addition, presumably when using this mode, the other XDP actions
> > >> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
> > >> add special handling for that in the kernel? We'll definitely need to
> > >> handle that somehow...
> > >
> > > I am not familiar with all the details here. Do you know a reason why
> > > these cases would stop working / why special handling would be needed?
> > > For example, if I have a UMEM that uses hugepages and XDP_PASS is
> > > returned, then the data is just copied into an SKB right? SKBs can
> > > also be created directly from hugepages AFAIK. So I don't understand
> > > what the issue would be. Can someone explain this concern?
> >
> > Well, I was asking :) It may well be that the SKB path just works; did
> > you test this? Pretty sure XDP_REDIRECT to another device won't, though?

for XDP_PASS we have to allocate a new buffer and copy the contents from
current xdp_buff that was backed by xsk_buff_pool and give the current one
back to pool. I am not sure if __napi_alloc_skb() is always capable of
handling len > PAGE_SIZE - i believe there might a particular combination
of settings that allows it, but if not we should have a fallback path that
would iterate over data and copy this to a certain (linear + frags) parts.
This implies non-zero effort that is needed for jumbo frames ZC support.

I can certainly test this out and play with it - maybe this just works, I
didn't check yet. Even if it does, then we need some kind of temporary
mechanism that will forbid loading ZC jumbo frames due to what Toke
brought up.

> >
>
> I was also asking :-)
>
> I tested that the SKB path is usable today with this patch.
> Specifically, sending and receiving large jumbo packets with AF_XDP
> and that a non-multi-buffer XDP program could access the whole packet.
> I have not specifically tested XDP_REDIRECT to another device or
> anything with ZC since that is not possible without driver support.
>
> My feeling is, there wouldn't be non-trivial issues here since this
> patchset changes nothing except allowing the maximum chunk size to be
> larger. The driver either supports larger MTUs with XDP enabled or it
> doesn't. If it doesn't, the frames are dropped anyway. Also, chunk
> size mismatches between two XSKs (e.g. with XDP_REDIRECT) would be
> something supported or not supported irrespective of this patchset.

Here is the comparison between multi-buffer and jumbo frames that I did
for ZC ice driver. Configured MTU was 8192 as this is the frame size for
aligned mode when working with huge pages. I am presenting plain numbers
over here from xdpsock.

Mbuf, packet size = 8192 - XDP_PACKET_HEADROOM
885,705pps - rxdrop frame_size=4096
806,307pps - l2fwd frame_size=4096
877,989pps - rxdrop frame_size=2048
773,331pps - l2fwd frame_size=2048

Jumbo, packet size = 8192 - XDP_PACKET_HEADROOM
893,530pps - rxdrop frame_size=8192
841,860pps - l2fwd frame_size=8192

Kal might say that multi-buffer numbers are imaginary as these patches
were never shown to the public ;) but now that we have extensive test
suite I am fixing some last issues that stand out, so we are asking for
some more patience over here... overall i was expecting that they will be
much worse when compared to jumbo frames, but then again i believe this
implementation is not ideal and can be improved. Nevertheless, jumbo
frames support has its value.

2023-04-21 10:02:23

by Toke Høiland-Jørgensen

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

Maciej Fijalkowski <[email protected]> writes:

> On Tue, Apr 18, 2023 at 01:12:00PM +0200, Kal Cutter Conley wrote:
>
> Hi there,
>
>> > >> In addition, presumably when using this mode, the other XDP actions
>> > >> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
>> > >> add special handling for that in the kernel? We'll definitely need to
>> > >> handle that somehow...
>> > >
>> > > I am not familiar with all the details here. Do you know a reason why
>> > > these cases would stop working / why special handling would be needed?
>> > > For example, if I have a UMEM that uses hugepages and XDP_PASS is
>> > > returned, then the data is just copied into an SKB right? SKBs can
>> > > also be created directly from hugepages AFAIK. So I don't understand
>> > > what the issue would be. Can someone explain this concern?
>> >
>> > Well, I was asking :) It may well be that the SKB path just works; did
>> > you test this? Pretty sure XDP_REDIRECT to another device won't, though?
>
> for XDP_PASS we have to allocate a new buffer and copy the contents from
> current xdp_buff that was backed by xsk_buff_pool and give the current one
> back to pool. I am not sure if __napi_alloc_skb() is always capable of
> handling len > PAGE_SIZE - i believe there might a particular combination
> of settings that allows it, but if not we should have a fallback path that
> would iterate over data and copy this to a certain (linear + frags) parts.
> This implies non-zero effort that is needed for jumbo frames ZC support.
>
> I can certainly test this out and play with it - maybe this just works, I
> didn't check yet. Even if it does, then we need some kind of temporary
> mechanism that will forbid loading ZC jumbo frames due to what Toke
> brought up.

Yeah, this was exactly the kind of thing I was worried about (same for
XDP_REDIRECT). Thanks for fleshing it out a bit :)

>> >
>>
>> I was also asking :-)
>>
>> I tested that the SKB path is usable today with this patch.
>> Specifically, sending and receiving large jumbo packets with AF_XDP
>> and that a non-multi-buffer XDP program could access the whole packet.
>> I have not specifically tested XDP_REDIRECT to another device or
>> anything with ZC since that is not possible without driver support.
>>
>> My feeling is, there wouldn't be non-trivial issues here since this
>> patchset changes nothing except allowing the maximum chunk size to be
>> larger. The driver either supports larger MTUs with XDP enabled or it
>> doesn't. If it doesn't, the frames are dropped anyway. Also, chunk
>> size mismatches between two XSKs (e.g. with XDP_REDIRECT) would be
>> something supported or not supported irrespective of this patchset.
>
> Here is the comparison between multi-buffer and jumbo frames that I did
> for ZC ice driver. Configured MTU was 8192 as this is the frame size for
> aligned mode when working with huge pages. I am presenting plain numbers
> over here from xdpsock.
>
> Mbuf, packet size = 8192 - XDP_PACKET_HEADROOM
> 885,705pps - rxdrop frame_size=4096
> 806,307pps - l2fwd frame_size=4096
> 877,989pps - rxdrop frame_size=2048
> 773,331pps - l2fwd frame_size=2048
>
> Jumbo, packet size = 8192 - XDP_PACKET_HEADROOM
> 893,530pps - rxdrop frame_size=8192
> 841,860pps - l2fwd frame_size=8192
>
> Kal might say that multi-buffer numbers are imaginary as these patches
> were never shown to the public ;) but now that we have extensive test
> suite I am fixing some last issues that stand out, so we are asking for
> some more patience over here... overall i was expecting that they will be
> much worse when compared to jumbo frames, but then again i believe this
> implementation is not ideal and can be improved. Nevertheless, jumbo
> frames support has its value.

Thank you for doing these! Okay, so that's between 1-4% improvement (vs
the 4k frags). I dunno, I wouldn't consider that a slam dunk; would
depend on the additional complexity if it is worth it to do both, IMO...

-Toke

2023-04-21 12:20:26

by Magnus Karlsson

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Fri, 21 Apr 2023 at 11:44, Maciej Fijalkowski
<[email protected]> wrote:
>
> On Tue, Apr 18, 2023 at 01:12:00PM +0200, Kal Cutter Conley wrote:
>
> Hi there,
>
> > > >> In addition, presumably when using this mode, the other XDP actions
> > > >> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
> > > >> add special handling for that in the kernel? We'll definitely need to
> > > >> handle that somehow...
> > > >
> > > > I am not familiar with all the details here. Do you know a reason why
> > > > these cases would stop working / why special handling would be needed?
> > > > For example, if I have a UMEM that uses hugepages and XDP_PASS is
> > > > returned, then the data is just copied into an SKB right? SKBs can
> > > > also be created directly from hugepages AFAIK. So I don't understand
> > > > what the issue would be. Can someone explain this concern?
> > >
> > > Well, I was asking :) It may well be that the SKB path just works; did
> > > you test this? Pretty sure XDP_REDIRECT to another device won't, though?
>
> for XDP_PASS we have to allocate a new buffer and copy the contents from
> current xdp_buff that was backed by xsk_buff_pool and give the current one
> back to pool. I am not sure if __napi_alloc_skb() is always capable of
> handling len > PAGE_SIZE - i believe there might a particular combination
> of settings that allows it, but if not we should have a fallback path that
> would iterate over data and copy this to a certain (linear + frags) parts.
> This implies non-zero effort that is needed for jumbo frames ZC support.

Thinking aloud, could not our multi-buffer work help with this? Sounds
quite similar to operations that we have to do in that patch set. And
if so, would it not be prudent to get the multi-buffer support in
there first, then implement these things on top of that? What do you
think?

> I can certainly test this out and play with it - maybe this just works, I
> didn't check yet. Even if it does, then we need some kind of temporary
> mechanism that will forbid loading ZC jumbo frames due to what Toke
> brought up.
>
> > >
> >
> > I was also asking :-)
> >
> > I tested that the SKB path is usable today with this patch.
> > Specifically, sending and receiving large jumbo packets with AF_XDP
> > and that a non-multi-buffer XDP program could access the whole packet.
> > I have not specifically tested XDP_REDIRECT to another device or
> > anything with ZC since that is not possible without driver support.
> >
> > My feeling is, there wouldn't be non-trivial issues here since this
> > patchset changes nothing except allowing the maximum chunk size to be
> > larger. The driver either supports larger MTUs with XDP enabled or it
> > doesn't. If it doesn't, the frames are dropped anyway. Also, chunk
> > size mismatches between two XSKs (e.g. with XDP_REDIRECT) would be
> > something supported or not supported irrespective of this patchset.
>
> Here is the comparison between multi-buffer and jumbo frames that I did
> for ZC ice driver. Configured MTU was 8192 as this is the frame size for
> aligned mode when working with huge pages. I am presenting plain numbers
> over here from xdpsock.
>
> Mbuf, packet size = 8192 - XDP_PACKET_HEADROOM
> 885,705pps - rxdrop frame_size=4096
> 806,307pps - l2fwd frame_size=4096
> 877,989pps - rxdrop frame_size=2048
> 773,331pps - l2fwd frame_size=2048
>
> Jumbo, packet size = 8192 - XDP_PACKET_HEADROOM
> 893,530pps - rxdrop frame_size=8192
> 841,860pps - l2fwd frame_size=8192
>
> Kal might say that multi-buffer numbers are imaginary as these patches
> were never shown to the public ;) but now that we have extensive test
> suite I am fixing some last issues that stand out, so we are asking for
> some more patience over here... overall i was expecting that they will be
> much worse when compared to jumbo frames, but then again i believe this
> implementation is not ideal and can be improved. Nevertheless, jumbo
> frames support has its value.

2023-04-21 12:27:52

by Magnus Karlsson

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

On Fri, 21 Apr 2023 at 12:01, Toke Høiland-Jørgensen <[email protected]> wrote:
>
> Maciej Fijalkowski <[email protected]> writes:
>
> > On Tue, Apr 18, 2023 at 01:12:00PM +0200, Kal Cutter Conley wrote:
> >
> > Hi there,
> >
> >> > >> In addition, presumably when using this mode, the other XDP actions
> >> > >> (XDP_PASS, XDP_REDIRECT to other targets) would stop working unless we
> >> > >> add special handling for that in the kernel? We'll definitely need to
> >> > >> handle that somehow...
> >> > >
> >> > > I am not familiar with all the details here. Do you know a reason why
> >> > > these cases would stop working / why special handling would be needed?
> >> > > For example, if I have a UMEM that uses hugepages and XDP_PASS is
> >> > > returned, then the data is just copied into an SKB right? SKBs can
> >> > > also be created directly from hugepages AFAIK. So I don't understand
> >> > > what the issue would be. Can someone explain this concern?
> >> >
> >> > Well, I was asking :) It may well be that the SKB path just works; did
> >> > you test this? Pretty sure XDP_REDIRECT to another device won't, though?
> >
> > for XDP_PASS we have to allocate a new buffer and copy the contents from
> > current xdp_buff that was backed by xsk_buff_pool and give the current one
> > back to pool. I am not sure if __napi_alloc_skb() is always capable of
> > handling len > PAGE_SIZE - i believe there might a particular combination
> > of settings that allows it, but if not we should have a fallback path that
> > would iterate over data and copy this to a certain (linear + frags) parts.
> > This implies non-zero effort that is needed for jumbo frames ZC support.
> >
> > I can certainly test this out and play with it - maybe this just works, I
> > didn't check yet. Even if it does, then we need some kind of temporary
> > mechanism that will forbid loading ZC jumbo frames due to what Toke
> > brought up.
>
> Yeah, this was exactly the kind of thing I was worried about (same for
> XDP_REDIRECT). Thanks for fleshing it out a bit :)
>
> >> >
> >>
> >> I was also asking :-)
> >>
> >> I tested that the SKB path is usable today with this patch.
> >> Specifically, sending and receiving large jumbo packets with AF_XDP
> >> and that a non-multi-buffer XDP program could access the whole packet.
> >> I have not specifically tested XDP_REDIRECT to another device or
> >> anything with ZC since that is not possible without driver support.
> >>
> >> My feeling is, there wouldn't be non-trivial issues here since this
> >> patchset changes nothing except allowing the maximum chunk size to be
> >> larger. The driver either supports larger MTUs with XDP enabled or it
> >> doesn't. If it doesn't, the frames are dropped anyway. Also, chunk
> >> size mismatches between two XSKs (e.g. with XDP_REDIRECT) would be
> >> something supported or not supported irrespective of this patchset.
> >
> > Here is the comparison between multi-buffer and jumbo frames that I did
> > for ZC ice driver. Configured MTU was 8192 as this is the frame size for
> > aligned mode when working with huge pages. I am presenting plain numbers
> > over here from xdpsock.
> >
> > Mbuf, packet size = 8192 - XDP_PACKET_HEADROOM
> > 885,705pps - rxdrop frame_size=4096
> > 806,307pps - l2fwd frame_size=4096
> > 877,989pps - rxdrop frame_size=2048
> > 773,331pps - l2fwd frame_size=2048
> >
> > Jumbo, packet size = 8192 - XDP_PACKET_HEADROOM
> > 893,530pps - rxdrop frame_size=8192
> > 841,860pps - l2fwd frame_size=8192
> >
> > Kal might say that multi-buffer numbers are imaginary as these patches
> > were never shown to the public ;) but now that we have extensive test
> > suite I am fixing some last issues that stand out, so we are asking for
> > some more patience over here... overall i was expecting that they will be
> > much worse when compared to jumbo frames, but then again i believe this
> > implementation is not ideal and can be improved. Nevertheless, jumbo
> > frames support has its value.
>
> Thank you for doing these! Okay, so that's between 1-4% improvement (vs
> the 4k frags). I dunno, I wouldn't consider that a slam dunk; would
> depend on the additional complexity if it is worth it to do both, IMO...

If we are using 4K frags, the worst case is that we have to use 3
frags to cover a 9K packet. Interpolating between the results above
would mean somewhere in the 1 - 6% range of improvement with jumbo
frames. Something that is not covered by these tests at all is the
overhead of an abstraction layer for dealing with multiple buffers. I
believe many applications would choose to have one to hide the fact
that there are multiple buffers. So I think these improvement numbers
are on the lower side.

But agree that we should factor in the complexity of covering the
cases you have brought up and see if it is worth it.

> -Toke
>

2023-04-21 15:29:36

by Kal Cutter Conley

[permalink] [raw]
Subject: Re: [PATCH bpf-next v3 1/3] xsk: Support UMEM chunk_size > PAGE_SIZE

> Here is the comparison between multi-buffer and jumbo frames that I did
> for ZC ice driver. Configured MTU was 8192 as this is the frame size for
> aligned mode when working with huge pages. I am presenting plain numbers
> over here from xdpsock.
>
> Mbuf, packet size = 8192 - XDP_PACKET_HEADROOM
> 885,705pps - rxdrop frame_size=4096
> 806,307pps - l2fwd frame_size=4096
> 877,989pps - rxdrop frame_size=2048
> 773,331pps - l2fwd frame_size=2048
>
> Jumbo, packet size = 8192 - XDP_PACKET_HEADROOM
> 893,530pps - rxdrop frame_size=8192
> 841,860pps - l2fwd frame_size=8192

Thanks so much for sharing these initial results! Do you have similar
measurements for ~9000 byte packets in unaligned mode? We typically
receive packets larger than 8192 bytes.

>
> Kal might say that multi-buffer numbers are imaginary as these patches
> were never shown to the public ;) but now that we have extensive test
> suite I am fixing some last issues that stand out, so we are asking for
> some more patience over here... overall i was expecting that they will be
> much worse when compared to jumbo frames, but then again i believe this
> implementation is not ideal and can be improved. Nevertheless, jumbo
> frames support has its value.

You made me chuckle ;-) Any measurements people can provide are
helpful, even if they must be taken with a grain of salt. ;-). How
much of your test suite can be upstreamed in the future? My assumption
was the difference should be measurable, at least you have confirmed
that. :-)