Yeah, I still remember that "Who needs cpumap nowadays" (c), but anyway.
__xdp_build_skb_from_frame() missed the moment when the networking stack
became able to recycle skb pages backed by a Page Pool. This was making
e.g. cpumap redirect even less effective than simple %XDP_PASS. veth was
also affected in some scenarios.
A lot of drivers use skb_mark_for_recycle() already, it's been almost
two years and seems like there are no issues in using it in the generic
code too. {__,}xdp_release_frame() can be then removed as it losts its
last user.
Page Pool becomes then zero-alloc (or almost) in the abovementioned
cases, too. Other memory type models (who needs them at this point)
have no changes.
Some numbers on 1 Xeon Platinum core bombed with 27 Mpps of 64-byte
IPv6 UDP:
Plain %XDP_PASS on baseline, Page Pool driver:
src cpu Rx drops dst cpu Rx
2.1 Mpps N/A 2.1 Mpps
cpumap redirect (w/o leaving its node) on baseline:
6.8 Mpps 5.0 Mpps 1.8 Mpps
cpumap redirect with skb PP recycling:
7.9 Mpps 5.7 Mpps 2.2 Mpps +22%
Alexander Lobakin (2):
xdp: recycle Page Pool backed skbs built from XDP frames
xdp: remove unused {__,}xdp_release_frame()
include/net/xdp.h | 29 -----------------------------
net/core/xdp.c | 19 ++-----------------
2 files changed, 2 insertions(+), 46 deletions(-)
--
2.39.2
__xdp_build_skb_from_frame() state(d):
/* Until page_pool get SKB return path, release DMA here */
Page Pool got skb pages recycling in April 2021, but missed this
function.
xdp_release_frame() is relevant only for Page Pool backed frames and it
detaches the page from the corresponding Pool in order to make it
freeable via page_frag_free(). It can instead just mark the output skb
as eligible for recycling if the frame is backed by a PP. No change for
other memory model types (the same condition check as before).
cpumap redirect and veth on Page Pool drivers now become zero-alloc (or
almost).
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/xdp.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/net/core/xdp.c b/net/core/xdp.c
index 8c92fc553317..a2237cfca8e9 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -658,8 +658,8 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
* - RX ring dev queue index (skb_record_rx_queue)
*/
- /* Until page_pool get SKB return path, release DMA here */
- xdp_release_frame(xdpf);
+ if (xdpf->mem.type == MEM_TYPE_PAGE_POOL)
+ skb_mark_for_recycle(skb);
/* Allow SKB to reuse area used by xdp_frame */
xdp_scrub_frame(xdpf);
--
2.39.2
__xdp_build_skb_from_frame() was the last user of
{__,}xdp_release_frame(), which detaches pages from the Page Pool.
All the consumers now recycle Page Pool skbs and page, except mlx5,
stmmac and tsnep drivers, which use page_pool_release_page() directly
(might change one day). It's safe to assume this functionality is not
needed anymore and can be removed (in favor of recycling).
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/net/xdp.h | 29 -----------------------------
net/core/xdp.c | 15 ---------------
2 files changed, 44 deletions(-)
diff --git a/include/net/xdp.h b/include/net/xdp.h
index d517bfac937b..5393b3ebe56e 100644
--- a/include/net/xdp.h
+++ b/include/net/xdp.h
@@ -317,35 +317,6 @@ void xdp_flush_frame_bulk(struct xdp_frame_bulk *bq);
void xdp_return_frame_bulk(struct xdp_frame *xdpf,
struct xdp_frame_bulk *bq);
-/* When sending xdp_frame into the network stack, then there is no
- * return point callback, which is needed to release e.g. DMA-mapping
- * resources with page_pool. Thus, have explicit function to release
- * frame resources.
- */
-void __xdp_release_frame(void *data, struct xdp_mem_info *mem);
-static inline void xdp_release_frame(struct xdp_frame *xdpf)
-{
- struct xdp_mem_info *mem = &xdpf->mem;
- struct skb_shared_info *sinfo;
- int i;
-
- /* Curr only page_pool needs this */
- if (mem->type != MEM_TYPE_PAGE_POOL)
- return;
-
- if (likely(!xdp_frame_has_frags(xdpf)))
- goto out;
-
- sinfo = xdp_get_shared_info_from_frame(xdpf);
- for (i = 0; i < sinfo->nr_frags; i++) {
- struct page *page = skb_frag_page(&sinfo->frags[i]);
-
- __xdp_release_frame(page_address(page), mem);
- }
-out:
- __xdp_release_frame(xdpf->data, mem);
-}
-
static __always_inline unsigned int xdp_get_frame_len(struct xdp_frame *xdpf)
{
struct skb_shared_info *sinfo;
diff --git a/net/core/xdp.c b/net/core/xdp.c
index a2237cfca8e9..8d3ad315f18d 100644
--- a/net/core/xdp.c
+++ b/net/core/xdp.c
@@ -531,21 +531,6 @@ void xdp_return_buff(struct xdp_buff *xdp)
}
EXPORT_SYMBOL_GPL(xdp_return_buff);
-/* Only called for MEM_TYPE_PAGE_POOL see xdp.h */
-void __xdp_release_frame(void *data, struct xdp_mem_info *mem)
-{
- struct xdp_mem_allocator *xa;
- struct page *page;
-
- rcu_read_lock();
- xa = rhashtable_lookup(mem_id_ht, &mem->id, mem_id_rht_params);
- page = virt_to_head_page(data);
- if (xa)
- page_pool_release_page(xa->page_pool, page);
- rcu_read_unlock();
-}
-EXPORT_SYMBOL_GPL(__xdp_release_frame);
-
void xdp_attachment_setup(struct xdp_attachment_info *info,
struct netdev_bpf *bpf)
{
--
2.39.2
Hi Alexander,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Alexander-Lobakin/xdp-recycle-Page-Pool-backed-skbs-built-from-XDP-frames/20230302-000635
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20230301160315.1022488-2-aleksander.lobakin%40intel.com
patch subject: [PATCH bpf-next v1 1/2] xdp: recycle Page Pool backed skbs built from XDP frames
config: i386-randconfig-a001 (https://download.01.org/0day-ci/archive/20230302/[email protected]/config)
compiler: gcc-11 (Debian 11.3.0-8) 11.3.0
reproduce (this is a W=1 build):
# https://github.com/intel-lab-lkp/linux/commit/a5ca5578e9bd35220be091fd02df96d492120ee3
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Alexander-Lobakin/xdp-recycle-Page-Pool-backed-skbs-built-from-XDP-frames/20230302-000635
git checkout a5ca5578e9bd35220be091fd02df96d492120ee3
# save the config file
mkdir build_dir && cp config build_dir/.config
make W=1 O=build_dir ARCH=i386 olddefconfig
make W=1 O=build_dir ARCH=i386 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
net/core/xdp.c: In function '__xdp_build_skb_from_frame':
>> net/core/xdp.c:662:17: error: implicit declaration of function 'skb_mark_for_recycle' [-Werror=implicit-function-declaration]
662 | skb_mark_for_recycle(skb);
| ^~~~~~~~~~~~~~~~~~~~
cc1: some warnings being treated as errors
vim +/skb_mark_for_recycle +662 net/core/xdp.c
614
615 struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
616 struct sk_buff *skb,
617 struct net_device *dev)
618 {
619 struct skb_shared_info *sinfo = xdp_get_shared_info_from_frame(xdpf);
620 unsigned int headroom, frame_size;
621 void *hard_start;
622 u8 nr_frags;
623
624 /* xdp frags frame */
625 if (unlikely(xdp_frame_has_frags(xdpf)))
626 nr_frags = sinfo->nr_frags;
627
628 /* Part of headroom was reserved to xdpf */
629 headroom = sizeof(*xdpf) + xdpf->headroom;
630
631 /* Memory size backing xdp_frame data already have reserved
632 * room for build_skb to place skb_shared_info in tailroom.
633 */
634 frame_size = xdpf->frame_sz;
635
636 hard_start = xdpf->data - headroom;
637 skb = build_skb_around(skb, hard_start, frame_size);
638 if (unlikely(!skb))
639 return NULL;
640
641 skb_reserve(skb, headroom);
642 __skb_put(skb, xdpf->len);
643 if (xdpf->metasize)
644 skb_metadata_set(skb, xdpf->metasize);
645
646 if (unlikely(xdp_frame_has_frags(xdpf)))
647 xdp_update_skb_shared_info(skb, nr_frags,
648 sinfo->xdp_frags_size,
649 nr_frags * xdpf->frame_sz,
650 xdp_frame_is_frag_pfmemalloc(xdpf));
651
652 /* Essential SKB info: protocol and skb->dev */
653 skb->protocol = eth_type_trans(skb, dev);
654
655 /* Optional SKB info, currently missing:
656 * - HW checksum info (skb->ip_summed)
657 * - HW RX hash (skb_set_hash)
658 * - RX ring dev queue index (skb_record_rx_queue)
659 */
660
661 if (xdpf->mem.type == MEM_TYPE_PAGE_POOL)
> 662 skb_mark_for_recycle(skb);
663
664 /* Allow SKB to reuse area used by xdp_frame */
665 xdp_scrub_frame(xdpf);
666
667 return skb;
668 }
669 EXPORT_SYMBOL_GPL(__xdp_build_skb_from_frame);
670
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
Hi Alexander,
Thank you for the patch! Yet something to improve:
[auto build test ERROR on bpf-next/master]
url: https://github.com/intel-lab-lkp/linux/commits/Alexander-Lobakin/xdp-recycle-Page-Pool-backed-skbs-built-from-XDP-frames/20230302-000635
base: https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next.git master
patch link: https://lore.kernel.org/r/20230301160315.1022488-2-aleksander.lobakin%40intel.com
patch subject: [PATCH bpf-next v1 1/2] xdp: recycle Page Pool backed skbs built from XDP frames
config: x86_64-randconfig-a014 (https://download.01.org/0day-ci/archive/20230302/[email protected]/config)
compiler: clang version 14.0.6 (https://github.com/llvm/llvm-project f28c006a5895fc0e329fe15fead81e37457cb1d1)
reproduce (this is a W=1 build):
wget https://raw.githubusercontent.com/intel/lkp-tests/master/sbin/make.cross -O ~/bin/make.cross
chmod +x ~/bin/make.cross
# https://github.com/intel-lab-lkp/linux/commit/a5ca5578e9bd35220be091fd02df96d492120ee3
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Alexander-Lobakin/xdp-recycle-Page-Pool-backed-skbs-built-from-XDP-frames/20230302-000635
git checkout a5ca5578e9bd35220be091fd02df96d492120ee3
# save the config file
mkdir build_dir && cp config build_dir/.config
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 olddefconfig
COMPILER_INSTALL_PATH=$HOME/0day COMPILER=clang make.cross W=1 O=build_dir ARCH=x86_64 SHELL=/bin/bash
If you fix the issue, kindly add following tag where applicable
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/oe-kbuild-all/[email protected]/
All errors (new ones prefixed by >>):
>> net/core/xdp.c:662:3: error: implicit declaration of function 'skb_mark_for_recycle' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
skb_mark_for_recycle(skb);
^
1 error generated.
vim +/skb_mark_for_recycle +662 net/core/xdp.c
614
615 struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
616 struct sk_buff *skb,
617 struct net_device *dev)
618 {
619 struct skb_shared_info *sinfo = xdp_get_shared_info_from_frame(xdpf);
620 unsigned int headroom, frame_size;
621 void *hard_start;
622 u8 nr_frags;
623
624 /* xdp frags frame */
625 if (unlikely(xdp_frame_has_frags(xdpf)))
626 nr_frags = sinfo->nr_frags;
627
628 /* Part of headroom was reserved to xdpf */
629 headroom = sizeof(*xdpf) + xdpf->headroom;
630
631 /* Memory size backing xdp_frame data already have reserved
632 * room for build_skb to place skb_shared_info in tailroom.
633 */
634 frame_size = xdpf->frame_sz;
635
636 hard_start = xdpf->data - headroom;
637 skb = build_skb_around(skb, hard_start, frame_size);
638 if (unlikely(!skb))
639 return NULL;
640
641 skb_reserve(skb, headroom);
642 __skb_put(skb, xdpf->len);
643 if (xdpf->metasize)
644 skb_metadata_set(skb, xdpf->metasize);
645
646 if (unlikely(xdp_frame_has_frags(xdpf)))
647 xdp_update_skb_shared_info(skb, nr_frags,
648 sinfo->xdp_frags_size,
649 nr_frags * xdpf->frame_sz,
650 xdp_frame_is_frag_pfmemalloc(xdpf));
651
652 /* Essential SKB info: protocol and skb->dev */
653 skb->protocol = eth_type_trans(skb, dev);
654
655 /* Optional SKB info, currently missing:
656 * - HW checksum info (skb->ip_summed)
657 * - HW RX hash (skb_set_hash)
658 * - RX ring dev queue index (skb_record_rx_queue)
659 */
660
661 if (xdpf->mem.type == MEM_TYPE_PAGE_POOL)
> 662 skb_mark_for_recycle(skb);
663
664 /* Allow SKB to reuse area used by xdp_frame */
665 xdp_scrub_frame(xdpf);
666
667 return skb;
668 }
669 EXPORT_SYMBOL_GPL(__xdp_build_skb_from_frame);
670
--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests
On 2023/3/2 0:03, Alexander Lobakin wrote:
> __xdp_build_skb_from_frame() state(d):
>
> /* Until page_pool get SKB return path, release DMA here */
>
> Page Pool got skb pages recycling in April 2021, but missed this
> function.
>
> xdp_release_frame() is relevant only for Page Pool backed frames and it
> detaches the page from the corresponding Pool in order to make it
> freeable via page_frag_free(). It can instead just mark the output skb
> as eligible for recycling if the frame is backed by a PP. No change for
> other memory model types (the same condition check as before).
> cpumap redirect and veth on Page Pool drivers now become zero-alloc (or
> almost).
>
> Signed-off-by: Alexander Lobakin <[email protected]>
> ---
> net/core/xdp.c | 4 ++--
> 1 file changed, 2 insertions(+), 2 deletions(-)
>
> diff --git a/net/core/xdp.c b/net/core/xdp.c
> index 8c92fc553317..a2237cfca8e9 100644
> --- a/net/core/xdp.c
> +++ b/net/core/xdp.c
> @@ -658,8 +658,8 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
> * - RX ring dev queue index (skb_record_rx_queue)
> */
>
> - /* Until page_pool get SKB return path, release DMA here */
> - xdp_release_frame(xdpf);
> + if (xdpf->mem.type == MEM_TYPE_PAGE_POOL)
> + skb_mark_for_recycle(skb);
We both rely on both skb->pp_recycle and page->pp_magic to decide
the page is really from page pool. So there was a few corner case
problem when we are sharing a page for different skb in the driver
level or calling skb_clone() or skb_try_coalesce().
see:
https://github.com/torvalds/linux/commit/2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
https://lore.kernel.org/netdev/MW5PR15MB51214C0513DB08A3607FBC1FBDE19@MW5PR15MB5121.namprd15.prod.outlook.com/t/
https://lore.kernel.org/netdev/167475990764.1934330.11960904198087757911.stgit@localhost.localdomain/
As the 'struct xdp_frame' also use 'struct skb_shared_info' which is
sharable, see xdp_get_shared_info_from_frame().
For now xdpf_clone() does not seems to handling frag page yet,
so it should be fine for now.
IMHO we should find a way to use per-page marker, instead of both
per-skb and per-page markers, in order to avoid the above problem
for xdp if xdp has a similar processing as skb, as suggested by Eric.
https://lore.kernel.org/netdev/CANn89iKgZU4Q+THXupzZi4hETuKuCOvOB=iHpp5JzQTNv_Fg_A@mail.gmail.com/
>
> /* Allow SKB to reuse area used by xdp_frame */
> xdp_scrub_frame(xdpf);
>
On 02/03/2023 03.30, Yunsheng Lin wrote:
> On 2023/3/2 0:03, Alexander Lobakin wrote:
>> __xdp_build_skb_from_frame() state(d):
>>
>> /* Until page_pool get SKB return path, release DMA here */
>>
>> Page Pool got skb pages recycling in April 2021, but missed this
>> function.
>>
>> xdp_release_frame() is relevant only for Page Pool backed frames and it
>> detaches the page from the corresponding Pool in order to make it
>> freeable via page_frag_free(). It can instead just mark the output skb
>> as eligible for recycling if the frame is backed by a PP. No change for
>> other memory model types (the same condition check as before).
>> cpumap redirect and veth on Page Pool drivers now become zero-alloc (or
>> almost).
>>
>> Signed-off-by: Alexander Lobakin <[email protected]>
>> ---
>> net/core/xdp.c | 4 ++--
>> 1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/net/core/xdp.c b/net/core/xdp.c
>> index 8c92fc553317..a2237cfca8e9 100644
>> --- a/net/core/xdp.c
>> +++ b/net/core/xdp.c
>> @@ -658,8 +658,8 @@ struct sk_buff *__xdp_build_skb_from_frame(struct xdp_frame *xdpf,
>> * - RX ring dev queue index (skb_record_rx_queue)
>> */
>>
>> - /* Until page_pool get SKB return path, release DMA here */
>> - xdp_release_frame(xdpf);
>> + if (xdpf->mem.type == MEM_TYPE_PAGE_POOL)
>> + skb_mark_for_recycle(skb);
>
>
> We both rely on both skb->pp_recycle and page->pp_magic to decide
> the page is really from page pool. So there was a few corner case
> problem when we are sharing a page for different skb in the driver
> level or calling skb_clone() or skb_try_coalesce().
> see:
> https://github.com/torvalds/linux/commit/2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> https://lore.kernel.org/netdev/MW5PR15MB51214C0513DB08A3607FBC1FBDE19@MW5PR15MB5121.namprd15.prod.outlook.com/t/
> https://lore.kernel.org/netdev/167475990764.1934330.11960904198087757911.stgit@localhost.localdomain/
>
> As the 'struct xdp_frame' also use 'struct skb_shared_info' which is
> sharable, see xdp_get_shared_info_from_frame().
>
> For now xdpf_clone() does not seems to handling frag page yet,
> so it should be fine for now.
>
> IMHO we should find a way to use per-page marker, instead of both
> per-skb and per-page markers, in order to avoid the above problem
> for xdp if xdp has a similar processing as skb, as suggested by Eric.
>
Moving to a per-page marker can be *more* expensive if the struct-page
memory isn't cache-hot. So, if struct-page is accessed anyhow then sure
we can move it to a per-page marker.
> https://lore.kernel.org/netdev/CANn89iKgZU4Q+THXupzZi4hETuKuCOvOB=iHpp5JzQTNv_Fg_A@mail.gmail.com/
>
>>
>> /* Allow SKB to reuse area used by xdp_frame */
>> xdp_scrub_frame(xdpf);
>>
>
On 01/03/2023 17.03, Alexander Lobakin wrote:
> Yeah, I still remember that "Who needs cpumap nowadays" (c), but anyway.
>
> __xdp_build_skb_from_frame() missed the moment when the networking stack
> became able to recycle skb pages backed by a Page Pool. This was making
^^^^^^^^^
When talking about page_pool, can we write "page_pool" instead of
capitalized "Page Pool", please. I looked through the git log, and here
we all used "page_pool".
> e.g. cpumap redirect even less effective than simple %XDP_PASS. veth was
> also affected in some scenarios.
Thanks for working on closing this gap :-)
> A lot of drivers use skb_mark_for_recycle() already, it's been almost
> two years and seems like there are no issues in using it in the generic
> code too. {__,}xdp_release_frame() can be then removed as it losts its
> last user.
> Page Pool becomes then zero-alloc (or almost) in the abovementioned
> cases, too. Other memory type models (who needs them at this point)
> have no changes.
>
> Some numbers on 1 Xeon Platinum core bombed with 27 Mpps of 64-byte
> IPv6 UDP:
What NIC driver?
>
> Plain %XDP_PASS on baseline, Page Pool driver:
>
> src cpu Rx drops dst cpu Rx
> 2.1 Mpps N/A 2.1 Mpps
>
> cpumap redirect (w/o leaving its node) on baseline:
>
> 6.8 Mpps 5.0 Mpps 1.8 Mpps
>
> cpumap redirect with skb PP recycling:
>
> 7.9 Mpps 5.7 Mpps 2.2 Mpps +22%
>
It is of cause awesome, that cpumap SKBs are faster than normal SKB path.
I do wonder where the +22% number comes from?
> Alexander Lobakin (2):
> xdp: recycle Page Pool backed skbs built from XDP frames
> xdp: remove unused {__,}xdp_release_frame()
>
> include/net/xdp.h | 29 -----------------------------
> net/core/xdp.c | 19 ++-----------------
> 2 files changed, 2 insertions(+), 46 deletions(-)
>
From: Yunsheng Lin <[email protected]>
Date: Thu, 2 Mar 2023 10:30:13 +0800
> On 2023/3/2 0:03, Alexander Lobakin wrote:
>> __xdp_build_skb_from_frame() state(d):
>>
>> /* Until page_pool get SKB return path, release DMA here */
>>
>> Page Pool got skb pages recycling in April 2021, but missed this
>> function.
[...]
> We both rely on both skb->pp_recycle and page->pp_magic to decide
> the page is really from page pool. So there was a few corner case
> problem when we are sharing a page for different skb in the driver
> level or calling skb_clone() or skb_try_coalesce().
> see:
> https://github.com/torvalds/linux/commit/2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
> https://lore.kernel.org/netdev/MW5PR15MB51214C0513DB08A3607FBC1FBDE19@MW5PR15MB5121.namprd15.prod.outlook.com/t/
> https://lore.kernel.org/netdev/167475990764.1934330.11960904198087757911.stgit@localhost.localdomain/
And they are fixed :D
No drivers currently which use Page Pool mix PP pages with non-PP. And
it's impossible to trigger try_coalesce() or so at least on cpumap path
since we're only creating skbs at that moment, they don't come from
anywhere else.
>
> As the 'struct xdp_frame' also use 'struct skb_shared_info' which is
> sharable, see xdp_get_shared_info_from_frame().
>
> For now xdpf_clone() does not seems to handling frag page yet,
> so it should be fine for now.
xdpf_clone() clones a frame to a new full page and doesn't copy its
skb_shared_info.
>
> IMHO we should find a way to use per-page marker, instead of both
> per-skb and per-page markers, in order to avoid the above problem
> for xdp if xdp has a similar processing as skb, as suggested by Eric.
>
> https://lore.kernel.org/netdev/CANn89iKgZU4Q+THXupzZi4hETuKuCOvOB=iHpp5JzQTNv_Fg_A@mail.gmail.com/
As Jesper already pointed out, not having a quick way to check whether
we have to check ::pp_magic at all can decrease performance. So it's
rather a shortcut.
>
>>
>> /* Allow SKB to reuse area used by xdp_frame */
>> xdp_scrub_frame(xdpf);
>>
Thanks,
Olek
From: Jesper Dangaard Brouer <[email protected]>
Date: Fri, 3 Mar 2023 11:39:06 +0100
>
> On 01/03/2023 17.03, Alexander Lobakin wrote:
>> Yeah, I still remember that "Who needs cpumap nowadays" (c), but anyway.
>>
>> __xdp_build_skb_from_frame() missed the moment when the networking stack
>> became able to recycle skb pages backed by a Page Pool. This was making
> ^^^^^^^^^
> When talking about page_pool, can we write "page_pool" instead of
> capitalized "Page Pool", please. I looked through the git log, and here
> we all used "page_pool".
Ah okay, no prob :D Yeah, that's probably more correct. "Page Pool" is
the name of the API, while page_pool is an entity we create via
page_pool_create().
>
>> e.g. cpumap redirect even less effective than simple %XDP_PASS. veth was
>> also affected in some scenarios.
>
> Thanks for working on closing this gap :-)
>
>> A lot of drivers use skb_mark_for_recycle() already, it's been almost
>> two years and seems like there are no issues in using it in the generic
>> code too. {__,}xdp_release_frame() can be then removed as it losts its
>> last user.
>> Page Pool becomes then zero-alloc (or almost) in the abovementioned
>> cases, too. Other memory type models (who needs them at this point)
>> have no changes.
>>
>> Some numbers on 1 Xeon Platinum core bombed with 27 Mpps of 64-byte
>> IPv6 UDP:
>
> What NIC driver?
IAVF with XDP, the series adding XDP support will be sent in a couple
weeks, WIP can be found on my open GH[0].
>
>>
>> Plain %XDP_PASS on baseline, Page Pool driver:
>>
>> src cpu Rx drops dst cpu Rx
>> 2.1 Mpps N/A 2.1 Mpps
>>
>> cpumap redirect (w/o leaving its node) on baseline:
>>
>> 6.8 Mpps 5.0 Mpps 1.8 Mpps
>>
>> cpumap redirect with skb PP recycling:
>>
>> 7.9 Mpps 5.7 Mpps 2.2 Mpps +22%
>>
>
> It is of cause awesome, that cpumap SKBs are faster than normal SKB path.
That's the point of cpumap redirect, right? You separate NAPI poll / IRQ
handling from the skb networking stack traveling to a different CPU,
including page freeing (or recycling). That takes a lot of load from the
source CPU. 0.1 Mpps is not the highest difference I got, cpumap
redirect can boost up to 0.5 Mpps IIRC.
> I do wonder where the +22% number comes from?
(2.2 - 1.8) / 1.8 * 100%. I compare baseline cpumap redirect
before/after here :)
>
>> Alexander Lobakin (2):
>> xdp: recycle Page Pool backed skbs built from XDP frames
>> xdp: remove unused {__,}xdp_release_frame()
>>
>> include/net/xdp.h | 29 -----------------------------
>> net/core/xdp.c | 19 ++-----------------
>> 2 files changed, 2 insertions(+), 46 deletions(-)
>>
>
There's a build failure on non-PP systems due to skb_mark_for_recycle()
being declared only when CONFIG_PAGE_POOL is set. I'll spin v2 in a bit.
[0] https://github.com/alobakin/linux/commits/iavf-xdp
Thanks,
Olek
On 2023/3/3 19:22, Alexander Lobakin wrote:
> From: Yunsheng Lin <[email protected]>
> Date: Thu, 2 Mar 2023 10:30:13 +0800
>
>> On 2023/3/2 0:03, Alexander Lobakin wrote:
>>> __xdp_build_skb_from_frame() state(d):
>>>
>>> /* Until page_pool get SKB return path, release DMA here */
>>>
>>> Page Pool got skb pages recycling in April 2021, but missed this
>>> function.
>
> [...]
>
>> We both rely on both skb->pp_recycle and page->pp_magic to decide
>> the page is really from page pool. So there was a few corner case
>> problem when we are sharing a page for different skb in the driver
>> level or calling skb_clone() or skb_try_coalesce().
>> see:
>> https://github.com/torvalds/linux/commit/2cc3aeb5ecccec0d266813172fcd82b4b5fa5803
>> https://lore.kernel.org/netdev/MW5PR15MB51214C0513DB08A3607FBC1FBDE19@MW5PR15MB5121.namprd15.prod.outlook.com/t/
>> https://lore.kernel.org/netdev/167475990764.1934330.11960904198087757911.stgit@localhost.localdomain/
>
> And they are fixed :D
> No drivers currently which use Page Pool mix PP pages with non-PP. And
The wireless adapter which use Page Pool *does* mix PP pages with
non-PP, see below discussion:
https://lore.kernel.org/netdev/[email protected]/
> it's impossible to trigger try_coalesce() or so at least on cpumap path
> since we're only creating skbs at that moment, they don't come from
> anywhere else.
>
>>
>> As the 'struct xdp_frame' also use 'struct skb_shared_info' which is
>> sharable, see xdp_get_shared_info_from_frame().
>>
>> For now xdpf_clone() does not seems to handling frag page yet,
>> so it should be fine for now.
>
> xdpf_clone() clones a frame to a new full page and doesn't copy its
> skb_shared_info.
>
>>
>> IMHO we should find a way to use per-page marker, instead of both
>> per-skb and per-page markers, in order to avoid the above problem
>> for xdp if xdp has a similar processing as skb, as suggested by Eric.
>>
>> https://lore.kernel.org/netdev/CANn89iKgZU4Q+THXupzZi4hETuKuCOvOB=iHpp5JzQTNv_Fg_A@mail.gmail.com/
>
> As Jesper already pointed out, not having a quick way to check whether
> we have to check ::pp_magic at all can decrease performance. So it's
> rather a shortcut.
When we are freeing a page by updating the _refcount, I think
we are already touching the cache of ::pp_magic.
Anyway, I am not sure checking ::pp_magic is correct when a
page will be passing between different subsystem and back to
the network stack eventually, checking ::pp_magic may not be
correct if this happens.
Another way is to use the bottom two bits in bv_page, see:
https://www.spinics.net/lists/netdev/msg874099.html
>
>>
>>>
>>> /* Allow SKB to reuse area used by xdp_frame */
>>> xdp_scrub_frame(xdpf);
>>>
>
> Thanks,
> Olek
> .
>
From: Yunsheng Lin <[email protected]>
Date: Fri, 3 Mar 2023 20:44:24 +0800
> On 2023/3/3 19:22, Alexander Lobakin wrote:
>> From: Yunsheng Lin <[email protected]>
>> Date: Thu, 2 Mar 2023 10:30:13 +0800
[...]
>> And they are fixed :D
>> No drivers currently which use Page Pool mix PP pages with non-PP. And
>
> The wireless adapter which use Page Pool *does* mix PP pages with
> non-PP, see below discussion:
>
> https://lore.kernel.org/netdev/[email protected]/
Ah right, I remember that (also was fixed).
Not that I think it is correct to mix them -- for my PoV, a driver
shoule either give *all* its Rx buffers as PP-backed or not use PP at all.
[...]
>> As Jesper already pointed out, not having a quick way to check whether
>> we have to check ::pp_magic at all can decrease performance. So it's
>> rather a shortcut.
>
> When we are freeing a page by updating the _refcount, I think
> we are already touching the cache of ::pp_magic.
But no page freeing happens before checking for skb->pp_recycle, neither
in skb_pp_recycle() (skb_free_head() etc.)[0] nor in skb_frag_unref()[1].
>
> Anyway, I am not sure checking ::pp_magic is correct when a
> page will be passing between different subsystem and back to
> the network stack eventually, checking ::pp_magic may not be
> correct if this happens.
>
> Another way is to use the bottom two bits in bv_page, see:
> https://www.spinics.net/lists/netdev/msg874099.html
>
>>
>>>
>>>>
>>>> /* Allow SKB to reuse area used by xdp_frame */
>>>> xdp_scrub_frame(xdpf);
>>>>
>>
>> Thanks,
>> Olek
>> .
>>
[0] https://elixir.bootlin.com/linux/latest/source/net/core/skbuff.c#L808
[1]
https://elixir.bootlin.com/linux/latest/source/include/linux/skbuff.h#L3385
Thanks,
Olek
On 2023/3/3 21:26, Alexander Lobakin wrote:
> From: Yunsheng Lin <[email protected]>
> Date: Fri, 3 Mar 2023 20:44:24 +0800
>
>> On 2023/3/3 19:22, Alexander Lobakin wrote:
>>> From: Yunsheng Lin <[email protected]>
>>> Date: Thu, 2 Mar 2023 10:30:13 +0800
>
> [...]
>
>>> And they are fixed :D
>>> No drivers currently which use Page Pool mix PP pages with non-PP. And
>>
>> The wireless adapter which use Page Pool *does* mix PP pages with
>> non-PP, see below discussion:
>>
>> https://lore.kernel.org/netdev/[email protected]/
>
> Ah right, I remember that (also was fixed).
> Not that I think it is correct to mix them -- for my PoV, a driver
> shoule either give *all* its Rx buffers as PP-backed or not use PP at all.
>
> [...]
>
>>> As Jesper already pointed out, not having a quick way to check whether
>>> we have to check ::pp_magic at all can decrease performance. So it's
>>> rather a shortcut.
>>
>> When we are freeing a page by updating the _refcount, I think
>> we are already touching the cache of ::pp_magic.
>
> But no page freeing happens before checking for skb->pp_recycle, neither
> in skb_pp_recycle() (skb_free_head() etc.)[0] nor in skb_frag_unref()[1].
If we move to per page marker, we probably do not need checking
skb->pp_recycle.
Note both page_pool_return_skb_page() and skb_free_frag() can
reuse the cache line triggered by per page marker checking if
the per page marker is in the 'struct page'.
>
>>
>> Anyway, I am not sure checking ::pp_magic is correct when a
>> page will be passing between different subsystem and back to
>> the network stack eventually, checking ::pp_magic may not be
>> correct if this happens.
>>
>> Another way is to use the bottom two bits in bv_page, see:
>> https://www.spinics.net/lists/netdev/msg874099.html
>>
>>>
>>>>
>>>>>
>>>>> /* Allow SKB to reuse area used by xdp_frame */
>>>>> xdp_scrub_frame(xdpf);
>>>>>
>>>
>>> Thanks,
>>> Olek
>>> .
>>>
>
> [0] https://elixir.bootlin.com/linux/latest/source/net/core/skbuff.c#L808
> [1]
> https://elixir.bootlin.com/linux/latest/source/include/linux/skbuff.h#L3385
>
> Thanks,
> Olek
> .
>
From: Yunsheng Lin <[email protected]>
Date: Mon, 6 Mar 2023 09:09:31 +0800
> On 2023/3/3 21:26, Alexander Lobakin wrote:
>> From: Yunsheng Lin <[email protected]>
>> Date: Fri, 3 Mar 2023 20:44:24 +0800
>>
>>> On 2023/3/3 19:22, Alexander Lobakin wrote:
>>>> From: Yunsheng Lin <[email protected]>
>>>> Date: Thu, 2 Mar 2023 10:30:13 +0800
>>
>> [...]
>>
>>>> And they are fixed :D
>>>> No drivers currently which use Page Pool mix PP pages with non-PP. And
>>>
>>> The wireless adapter which use Page Pool *does* mix PP pages with
>>> non-PP, see below discussion:
>>>
>>> https://lore.kernel.org/netdev/[email protected]/
>>
>> Ah right, I remember that (also was fixed).
>> Not that I think it is correct to mix them -- for my PoV, a driver
>> shoule either give *all* its Rx buffers as PP-backed or not use PP at all.
>>
>> [...]
>>
>>>> As Jesper already pointed out, not having a quick way to check whether
>>>> we have to check ::pp_magic at all can decrease performance. So it's
>>>> rather a shortcut.
>>>
>>> When we are freeing a page by updating the _refcount, I think
>>> we are already touching the cache of ::pp_magic.
>>
>> But no page freeing happens before checking for skb->pp_recycle, neither
>> in skb_pp_recycle() (skb_free_head() etc.)[0] nor in skb_frag_unref()[1].
>
> If we move to per page marker, we probably do not need checking
> skb->pp_recycle.
>
> Note both page_pool_return_skb_page() and skb_free_frag() can
> reuse the cache line triggered by per page marker checking if
> the per page marker is in the 'struct page'.
Ah, from that perspective. Yes, you're probably right, but would need to
be tested anyway. I don't see any open problems with the PP recycling
right now on the lists, but someone may try to change it one day.
Anyway, this flag is only to do a quick test. We do have
sk_buff::pfmemalloc, but this flag doesn't mean every page from this skb
was pfmemalloced.
>
>>
>>>
>>> Anyway, I am not sure checking ::pp_magic is correct when a
>>> page will be passing between different subsystem and back to
>>> the network stack eventually, checking ::pp_magic may not be
>>> correct if this happens.
>>>
>>> Another way is to use the bottom two bits in bv_page, see:
>>> https://www.spinics.net/lists/netdev/msg874099.html
>>>
>>>>
>>>>>
>>>>>>
>>>>>> /* Allow SKB to reuse area used by xdp_frame */
>>>>>> xdp_scrub_frame(xdpf);
>>>>>>
>>>>
>>>> Thanks,
>>>> Olek
>>>> .
>>>>
>>
>> [0] https://elixir.bootlin.com/linux/latest/source/net/core/skbuff.c#L808
>> [1]
>> https://elixir.bootlin.com/linux/latest/source/include/linux/skbuff.h#L3385
>>
>> Thanks,
>> Olek
>> .
>>
Thanks,
Olek
On 2023/3/6 19:58, Alexander Lobakin wrote:
> From: Yunsheng Lin <[email protected]>
> Date: Mon, 6 Mar 2023 09:09:31 +0800
>
>> On 2023/3/3 21:26, Alexander Lobakin wrote:
>>> From: Yunsheng Lin <[email protected]>
>>> Date: Fri, 3 Mar 2023 20:44:24 +0800
>>>
>>>> On 2023/3/3 19:22, Alexander Lobakin wrote:
>>>>> From: Yunsheng Lin <[email protected]>
>>>>> Date: Thu, 2 Mar 2023 10:30:13 +0800
>>>
>>> [...]
>>>
>>>>> And they are fixed :D
>>>>> No drivers currently which use Page Pool mix PP pages with non-PP. And
>>>>
>>>> The wireless adapter which use Page Pool *does* mix PP pages with
>>>> non-PP, see below discussion:
>>>>
>>>> https://lore.kernel.org/netdev/[email protected]/
>>>
>>> Ah right, I remember that (also was fixed).
>>> Not that I think it is correct to mix them -- for my PoV, a driver
>>> shoule either give *all* its Rx buffers as PP-backed or not use PP at all.
>>>
>>> [...]
>>>
>>>>> As Jesper already pointed out, not having a quick way to check whether
>>>>> we have to check ::pp_magic at all can decrease performance. So it's
>>>>> rather a shortcut.
>>>>
>>>> When we are freeing a page by updating the _refcount, I think
>>>> we are already touching the cache of ::pp_magic.
>>>
>>> But no page freeing happens before checking for skb->pp_recycle, neither
>>> in skb_pp_recycle() (skb_free_head() etc.)[0] nor in skb_frag_unref()[1].
>>
>> If we move to per page marker, we probably do not need checking
>> skb->pp_recycle.
>>
>> Note both page_pool_return_skb_page() and skb_free_frag() can
>> reuse the cache line triggered by per page marker checking if
>> the per page marker is in the 'struct page'.
>
> Ah, from that perspective. Yes, you're probably right, but would need to
> be tested anyway. I don't see any open problems with the PP recycling
> right now on the lists, but someone may try to change it one day.
> Anyway, this flag is only to do a quick test. We do have
> sk_buff::pfmemalloc, but this flag doesn't mean every page from this skb
> was pfmemalloced.
The point seems to be that sk_buff::pfmemalloc allow false positive, which
means skb->pfmemalloc can be set to true while every page from this skb is
not pfmemalloced as you mentioned.
While skb->pp_recycle can't allow false positive, if that happens, reference
counting of the page will not be handled properly if pp and non-pp skb shares
the page as the wireless adapter does.
>
>>
>>>
>>>>
>>>> Anyway, I am not sure checking ::pp_magic is correct when a
>>>> page will be passing between different subsystem and back to
>>>> the network stack eventually, checking ::pp_magic may not be
>>>> correct if this happens.
>>>>
>>>> Another way is to use the bottom two bits in bv_page, see:
>>>> https://www.spinics.net/lists/netdev/msg874099.html
>>>>
>>>>>
>>>>>>
>>>>>>>
>>>>>>> /* Allow SKB to reuse area used by xdp_frame */
>>>>>>> xdp_scrub_frame(xdpf);
>>>>>>>
>>>>>
>>>>> Thanks,
>>>>> Olek
>>>>> .
>>>>>
>>>
>>> [0] https://elixir.bootlin.com/linux/latest/source/net/core/skbuff.c#L808
>>> [1]
>>> https://elixir.bootlin.com/linux/latest/source/include/linux/skbuff.h#L3385
>>>
>>> Thanks,
>>> Olek
>>> .
>>>
>
> Thanks,
> Olek
>
> .
>
From: Yunsheng Lin <[email protected]>
Date: Tue, 7 Mar 2023 10:50:34 +0800
> On 2023/3/6 19:58, Alexander Lobakin wrote:
>> From: Yunsheng Lin <[email protected]>
>> Date: Mon, 6 Mar 2023 09:09:31 +0800
[...]
>> Ah, from that perspective. Yes, you're probably right, but would need to
>> be tested anyway. I don't see any open problems with the PP recycling
>> right now on the lists, but someone may try to change it one day.
>> Anyway, this flag is only to do a quick test. We do have
>> sk_buff::pfmemalloc, but this flag doesn't mean every page from this skb
>> was pfmemalloced.
>
> The point seems to be that sk_buff::pfmemalloc allow false positive, which
> means skb->pfmemalloc can be set to true while every page from this skb is
> not pfmemalloced as you mentioned.
>
> While skb->pp_recycle can't allow false positive, if that happens, reference
> counting of the page will not be handled properly if pp and non-pp skb shares
> the page as the wireless adapter does.
You mean false-positives in both directions? Because if ->pp_recycle is
set, the stack can still free non-PP pages. In the opposite case, I mean
when ->pp_recycle is false and an skb page belongs to a page_pool, yes,
there'll be issues.
But I think the deal is to propagate the flag when you want to attach a
PP-backed page to the skb? I mean, if someone decides to mix pages with
different memory models, it's his responsibility to make sure everything
is fine, because it's not a common/intended way. Isn't it?
>
>>
>>>
>>>>
>>>>>
>>>>> Anyway, I am not sure checking ::pp_magic is correct when a
>>>>> page will be passing between different subsystem and back to
>>>>> the network stack eventually, checking ::pp_magic may not be
>>>>> correct if this happens.
>>>>>
>>>>> Another way is to use the bottom two bits in bv_page, see:
>>>>> https://www.spinics.net/lists/netdev/msg874099.html
This one is interesting actually. We'd only need one bit -- which is
100% free and available in case of page pointers.
>>>>>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>> /* Allow SKB to reuse area used by xdp_frame */
>>>>>>>> xdp_scrub_frame(xdpf);
[...]
Thanks,
Olek
On 2023/3/8 2:14, Alexander Lobakin wrote:
> From: Yunsheng Lin <[email protected]>
> Date: Tue, 7 Mar 2023 10:50:34 +0800
>
>> On 2023/3/6 19:58, Alexander Lobakin wrote:
>>> From: Yunsheng Lin <[email protected]>
>>> Date: Mon, 6 Mar 2023 09:09:31 +0800
>
> [...]
>
>>> Ah, from that perspective. Yes, you're probably right, but would need to
>>> be tested anyway. I don't see any open problems with the PP recycling
>>> right now on the lists, but someone may try to change it one day.
>>> Anyway, this flag is only to do a quick test. We do have
>>> sk_buff::pfmemalloc, but this flag doesn't mean every page from this skb
>>> was pfmemalloced.
>>
>> The point seems to be that sk_buff::pfmemalloc allow false positive, which
>> means skb->pfmemalloc can be set to true while every page from this skb is
>> not pfmemalloced as you mentioned.
>>
>> While skb->pp_recycle can't allow false positive, if that happens, reference
>> counting of the page will not be handled properly if pp and non-pp skb shares
>> the page as the wireless adapter does.
>
> You mean false-positives in both directions? Because if ->pp_recycle is
> set, the stack can still free non-PP pages. In the opposite case, I mean
> when ->pp_recycle is false and an skb page belongs to a page_pool, yes,
> there'll be issues.
That may depends on what is a PP pages and what is a non-PP pages, it seems
hard to answer now.
For a skb with ->pp_recycle being true and its frag page with page->pp_magic
being PP_SIGNATURE, when calling skb_clone()/pskb_expand_head() or
skb_try_coalesce(), we may call __skb_frag_ref() for the frag page, which
mean a page with page->pp_magic being PP_SIGNATURE can be both PP page
and non-PP page at the same time. So it is important to set the ->pp_recycle
correctly, and it seems hard to get that right from past experience,that's
why a per page marker is suggested.
> But I think the deal is to propagate the flag when you want to attach a
> PP-backed page to the skb? I mean, if someone decides to mix pages with
> different memory models, it's his responsibility to make sure everything
> is fine, because it's not a common/intended way. Isn't it?
>
>>
>>>
>>>>
>>>>>
>>>>>>
>>>>>> Anyway, I am not sure checking ::pp_magic is correct when a
>>>>>> page will be passing between different subsystem and back to
>>>>>> the network stack eventually, checking ::pp_magic may not be
>>>>>> correct if this happens.
>>>>>>
>>>>>> Another way is to use the bottom two bits in bv_page, see:
>>>>>> https://www.spinics.net/lists/netdev/msg874099.html
>
> This one is interesting actually. We'd only need one bit -- which is
> 100% free and available in case of page pointers.
>
>>>>>>
>>>>>>>
>>>>>>>>
>>>>>>>>>
>>>>>>>>> /* Allow SKB to reuse area used by xdp_frame */
>>>>>>>>> xdp_scrub_frame(xdpf);
>
> [...]
>
> Thanks,
> Olek
>
> .
>
From: Yunsheng Lin <[email protected]>
Date: Wed, 8 Mar 2023 14:27:13 +0800
> On 2023/3/8 2:14, Alexander Lobakin wrote:
>> From: Yunsheng Lin <[email protected]>
>> Date: Tue, 7 Mar 2023 10:50:34 +0800
[...]
>> You mean false-positives in both directions? Because if ->pp_recycle is
>> set, the stack can still free non-PP pages. In the opposite case, I mean
>> when ->pp_recycle is false and an skb page belongs to a page_pool, yes,
>> there'll be issues.
>
> That may depends on what is a PP pages and what is a non-PP pages, it seems
> hard to answer now.
>
> For a skb with ->pp_recycle being true and its frag page with page->pp_magic
> being PP_SIGNATURE, when calling skb_clone()/pskb_expand_head() or
> skb_try_coalesce(), we may call __skb_frag_ref() for the frag page, which
> mean a page with page->pp_magic being PP_SIGNATURE can be both PP page
> and non-PP page at the same time. So it is important to set the ->pp_recycle
> correctly, and it seems hard to get that right from past experience,that's
> why a per page marker is suggested.
Oh well, I didn't know that :s
Thanks for the expl.
[...]
Olek