Currently, all sorts of skb allocation always do allocate
skbuff_heads one by one via kmem_cache_alloc().
On the other hand, we have percpu napi_alloc_cache to store
skbuff_heads queued up for freeing and flush them by bulks.
We can use this cache not only for bulk-wiping, but also to obtain
heads for new skbs and avoid unconditional allocations, as well as
for bulk-allocating.
As accessing napi_alloc_cache implies NAPI softirq context, decaching
is protected with in_serving_softirq() check, with the option to
bypass the check when the context is 100% known.
iperf3 showed 35-70 Mbps bumps for both TCP and UDP while performing
VLAN NAT on 1.2 GHz MIPS board. The boost is likely to be way bigger
on more powerful hosts and NICs with tens of Mpps.
Note on skbuff_heads from distant slabs or pfmemalloc'ed slabs:
- kmalloc()/kmem_cache_alloc() itself allows by default allocating
memory from the remote nodes to defragment their slabs. This is
controlled by sysctl, but according to this, skbuff_head from a
remote node is an OK case;
- The easiest way to check if the slab of skbuff_head is remote or
pfmemalloc'ed is:
if (!dev_page_is_reusable(virt_to_head_page(skb)))
/* drop it */;
...*but*, regarding that most slabs are built of compound pages,
virt_to_head_page() will hit unlikely-branch every single call.
This check costed at least 20 Mbps in test scenarios and seems
like it'd be better to _not_ do this.
Since v2 [1]:
- also cover {,__}alloc_skb() and {,__}build_skb() cases (became handy
after the changes that pass tiny skbs requests to kmalloc layer);
- cover the cache with KASAN instrumentation (suggested by Eric
Dumazet, help of Dmitry Vyukov);
- completely drop redundant __kfree_skb_flush() (also Eric);
- lots of code cleanups;
- expand the commit message with NUMA and pfmemalloc points (Jakub).
Since v1 [0]:
- use one unified cache instead of two separate to greatly simplify
the logics and reduce hotpath overhead (Edward Cree);
- new: recycle also GRO_MERGED_FREE skbs instead of immediate
freeing;
- correct performance numbers after optimizations and performing
lots of tests for different use cases.
[0] https://lore.kernel.org/netdev/[email protected]
[1] https://lore.kernel.org/netdev/[email protected]
Alexander Lobakin (10):
skbuff: move __alloc_skb() next to the other skb allocation functions
skbuff: simplify kmalloc_reserve()
skbuff: make __build_skb_around() return void
skbuff: simplify __alloc_skb() a bit
skbuff: use __build_skb_around() in __alloc_skb()
skbuff: remove __kfree_skb_flush()
skbuff: move NAPI cache declarations upper in the file
skbuff: reuse NAPI skb cache on allocation path (__build_skb())
skbuff: reuse NAPI skb cache on allocation path (__alloc_skb())
skbuff: queue NAPI_MERGED_FREE skbs into NAPI cache instead of freeing
include/linux/skbuff.h | 4 +-
net/core/dev.c | 15 +-
net/core/skbuff.c | 392 ++++++++++++++++++++-------------------
net/netlink/af_netlink.c | 2 +-
4 files changed, 202 insertions(+), 211 deletions(-)
--
2.30.0
Use unlikely() annotations for skbuff_head and data similarly to the
two other allocation functions and remove totally redundant goto.
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 11 +++++------
1 file changed, 5 insertions(+), 6 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index c7d184e11547..88566de26cd1 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -339,8 +339,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
/* Get the HEAD */
skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
- if (!skb)
- goto out;
+ if (unlikely(!skb))
+ return NULL;
prefetchw(skb);
/* We do our best to align skb_shared_info on a separate cache
@@ -351,7 +351,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
size = SKB_DATA_ALIGN(size);
size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
- if (!data)
+ if (unlikely(!data))
goto nodata;
/* kmalloc(size) might give us more room than requested.
* Put skb_shared_info exactly at the end of allocated zone,
@@ -395,12 +395,11 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
skb_set_kcov_handle(skb, kcov_common_handle());
-out:
return skb;
+
nodata:
kmem_cache_free(cache, skb);
- skb = NULL;
- goto out;
+ return NULL;
}
EXPORT_SYMBOL(__alloc_skb);
--
2.30.0
NAPI cache structures will be used for allocating skbuff_heads,
so move their declarations a bit upper.
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 90 +++++++++++++++++++++++------------------------
1 file changed, 45 insertions(+), 45 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 4be2bb969535..860a9d4f752f 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -119,6 +119,51 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
skb_panic(skb, sz, addr, __func__);
}
+#define NAPI_SKB_CACHE_SIZE 64
+
+struct napi_alloc_cache {
+ struct page_frag_cache page;
+ unsigned int skb_count;
+ void *skb_cache[NAPI_SKB_CACHE_SIZE];
+};
+
+static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
+static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
+
+static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask,
+ unsigned int align_mask)
+{
+ struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+
+ return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask);
+}
+
+void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+{
+ fragsz = SKB_DATA_ALIGN(fragsz);
+
+ return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
+}
+EXPORT_SYMBOL(__napi_alloc_frag_align);
+
+void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
+{
+ struct page_frag_cache *nc;
+ void *data;
+
+ fragsz = SKB_DATA_ALIGN(fragsz);
+ if (in_irq() || irqs_disabled()) {
+ nc = this_cpu_ptr(&netdev_alloc_cache);
+ data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
+ } else {
+ local_bh_disable();
+ data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
+ local_bh_enable();
+ }
+ return data;
+}
+EXPORT_SYMBOL(__netdev_alloc_frag_align);
+
/* Caller must provide SKB that is memset cleared */
static void __build_skb_around(struct sk_buff *skb, void *data,
unsigned int frag_size)
@@ -220,51 +265,6 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
}
EXPORT_SYMBOL(build_skb_around);
-#define NAPI_SKB_CACHE_SIZE 64
-
-struct napi_alloc_cache {
- struct page_frag_cache page;
- unsigned int skb_count;
- void *skb_cache[NAPI_SKB_CACHE_SIZE];
-};
-
-static DEFINE_PER_CPU(struct page_frag_cache, netdev_alloc_cache);
-static DEFINE_PER_CPU(struct napi_alloc_cache, napi_alloc_cache);
-
-static void *__alloc_frag_align(unsigned int fragsz, gfp_t gfp_mask,
- unsigned int align_mask)
-{
- struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
- return page_frag_alloc_align(&nc->page, fragsz, gfp_mask, align_mask);
-}
-
-void *__napi_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
-{
- fragsz = SKB_DATA_ALIGN(fragsz);
-
- return __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
-}
-EXPORT_SYMBOL(__napi_alloc_frag_align);
-
-void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
-{
- struct page_frag_cache *nc;
- void *data;
-
- fragsz = SKB_DATA_ALIGN(fragsz);
- if (in_irq() || irqs_disabled()) {
- nc = this_cpu_ptr(&netdev_alloc_cache);
- data = page_frag_alloc_align(nc, fragsz, GFP_ATOMIC, align_mask);
- } else {
- local_bh_disable();
- data = __alloc_frag_align(fragsz, GFP_ATOMIC, align_mask);
- local_bh_enable();
- }
- return data;
-}
-EXPORT_SYMBOL(__netdev_alloc_frag_align);
-
/*
* kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
* the caller if emergency pfmemalloc reserves are being used. If it is and
--
2.30.0
napi_frags_finish() and napi_skb_finish() can only be called inside
NAPI Rx context, so we can feed NAPI cache with skbuff_heads that
got NAPI_MERGED_FREE verdict instead of immediate freeing.
Replace __kfree_skb() with __kfree_skb_defer() in napi_skb_finish()
and move napi_skb_free_stolen_head() to skbuff.c, so it can drop skbs
to NAPI cache.
As many drivers call napi_alloc_skb()/napi_get_frags() on their
receive path, this becomes especially useful.
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/linux/skbuff.h | 1 +
net/core/dev.c | 9 +--------
net/core/skbuff.c | 12 +++++++++---
3 files changed, 11 insertions(+), 11 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 5bb443d37bf4..f8737ad91cc7 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2919,6 +2919,7 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
}
void napi_consume_skb(struct sk_buff *skb, int budget);
+void napi_skb_free_stolen_head(struct sk_buff *skb);
void __kfree_skb_defer(struct sk_buff *skb);
/**
diff --git a/net/core/dev.c b/net/core/dev.c
index 135d46c0c3c7..68ad03382f6a 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -6056,13 +6056,6 @@ struct packet_offload *gro_find_complete_by_type(__be16 type)
}
EXPORT_SYMBOL(gro_find_complete_by_type);
-static void napi_skb_free_stolen_head(struct sk_buff *skb)
-{
- skb_dst_drop(skb);
- skb_ext_put(skb);
- kmem_cache_free(skbuff_head_cache, skb);
-}
-
static gro_result_t napi_skb_finish(struct napi_struct *napi,
struct sk_buff *skb,
gro_result_t ret)
@@ -6076,7 +6069,7 @@ static gro_result_t napi_skb_finish(struct napi_struct *napi,
if (NAPI_GRO_CB(skb)->free == NAPI_GRO_FREE_STOLEN_HEAD)
napi_skb_free_stolen_head(skb);
else
- __kfree_skb(skb);
+ __kfree_skb_defer(skb);
break;
case GRO_HELD:
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8850086f8605..6dbc486c1d68 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -872,9 +872,6 @@ static void napi_skb_cache_put(struct sk_buff *skb)
struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
u32 i;
- /* drop skb->head and call any destructors for packet */
- skb_release_all(skb);
-
kasan_poison_object_data(skbuff_head_cache, skb);
nc->skb_cache[nc->skb_count++] = skb;
@@ -891,6 +888,14 @@ static void napi_skb_cache_put(struct sk_buff *skb)
void __kfree_skb_defer(struct sk_buff *skb)
{
+ skb_release_all(skb);
+ napi_skb_cache_put(skb);
+}
+
+void napi_skb_free_stolen_head(struct sk_buff *skb)
+{
+ skb_dst_drop(skb);
+ skb_ext_put(skb);
napi_skb_cache_put(skb);
}
@@ -916,6 +921,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
return;
}
+ skb_release_all(skb);
napi_skb_cache_put(skb);
}
EXPORT_SYMBOL(napi_consume_skb);
--
2.30.0
__build_skb_around() can never fail and always returns passed skb.
Make it return void to simplify and optimize the code.
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 13 ++++++-------
1 file changed, 6 insertions(+), 7 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 70289f22a6f4..c7d184e11547 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -120,8 +120,8 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
}
/* Caller must provide SKB that is memset cleared */
-static struct sk_buff *__build_skb_around(struct sk_buff *skb,
- void *data, unsigned int frag_size)
+static void __build_skb_around(struct sk_buff *skb, void *data,
+ unsigned int frag_size)
{
struct skb_shared_info *shinfo;
unsigned int size = frag_size ? : ksize(data);
@@ -144,8 +144,6 @@ static struct sk_buff *__build_skb_around(struct sk_buff *skb,
atomic_set(&shinfo->dataref, 1);
skb_set_kcov_handle(skb, kcov_common_handle());
-
- return skb;
}
/**
@@ -176,8 +174,9 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)
return NULL;
memset(skb, 0, offsetof(struct sk_buff, tail));
+ __build_skb_around(skb, data, frag_size);
- return __build_skb_around(skb, data, frag_size);
+ return skb;
}
/* build_skb() is wrapper over __build_skb(), that specifically
@@ -210,9 +209,9 @@ struct sk_buff *build_skb_around(struct sk_buff *skb,
if (unlikely(!skb))
return NULL;
- skb = __build_skb_around(skb, data, frag_size);
+ __build_skb_around(skb, data, frag_size);
- if (skb && frag_size) {
+ if (frag_size) {
skb->head_frag = 1;
if (page_is_pfmemalloc(virt_to_head_page(data)))
skb->pfmemalloc = 1;
--
2.30.0
Try to use the same technique for obtaining skbuff_head from NAPI
cache in {,__}alloc_skb(). Two points here:
- __alloc_skb() can be used for allocating clones or allocating skbs
for distant nodes. Try to grab head from the cache only for
non-clones and for local nodes;
- can be called from any context, so napi_safe == false.
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 8747566a8136..8850086f8605 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -354,15 +354,19 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
struct sk_buff *skb;
u8 *data;
bool pfmemalloc;
+ bool clone;
- cache = (flags & SKB_ALLOC_FCLONE)
- ? skbuff_fclone_cache : skbuff_head_cache;
+ clone = !!(flags & SKB_ALLOC_FCLONE);
+ cache = clone ? skbuff_fclone_cache : skbuff_head_cache;
if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
gfp_mask |= __GFP_MEMALLOC;
/* Get the HEAD */
- skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
+ if (clone || unlikely(node != NUMA_NO_NODE && node != numa_mem_id()))
+ skb = kmem_cache_alloc_node(cache, gfp_mask & ~GFP_DMA, node);
+ else
+ skb = napi_skb_cache_get(false);
if (unlikely(!skb))
return NULL;
prefetchw(skb);
@@ -393,7 +397,7 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
__build_skb_around(skb, data, 0);
skb->pfmemalloc = pfmemalloc;
- if (flags & SKB_ALLOC_FCLONE) {
+ if (clone) {
struct sk_buff_fclones *fclones;
fclones = container_of(skb, struct sk_buff_fclones, skb1);
--
2.30.0
Instead of just bulk-flushing skbuff_heads queued up through
napi_consume_skb() or __kfree_skb_defer(), try to reuse them
on allocation path.
If the cache is empty on allocation, bulk-allocate the first
half, which is more efficient than per-skb allocation.
If the cache is full on freeing, bulk-wipe the second half.
This also includes custom KASAN poisoning/unpoisoning to be
double sure there are no use-after-free cases.
Functions that got cache fastpath:
- {,__}build_skb();
- {,__}netdev_alloc_skb();
- {,__}napi_alloc_skb().
Note on "napi_safe" argument:
NAPI cache should be accessed only from BH-disabled or (better)
NAPI context. To make sure access is safe, in_serving_softirq()
check is used.
Hovewer, there are plenty of cases when we know for sure that
we're in such context. This includes: build_skb() (called only
from NIC drivers in NAPI Rx context) and {,__}napi_alloc_skb()
(called from the same place or from kernel network softirq
functions).
We can use that knowledge to avoid unnecessary checks.
Suggested-by: Edward Cree <[email protected]> # Unified cache part
Suggested-by: Eric Dumazet <[email protected]> # KASAN poisoning
Suggested-by: Dmitry Vyukov <[email protected]> # Help with KASAN
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/linux/skbuff.h | 2 +-
net/core/skbuff.c | 61 ++++++++++++++++++++++++++++------------
net/netlink/af_netlink.c | 2 +-
3 files changed, 45 insertions(+), 20 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0e0707296098..5bb443d37bf4 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -1082,7 +1082,7 @@ bool skb_try_coalesce(struct sk_buff *to, struct sk_buff *from,
struct sk_buff *__alloc_skb(unsigned int size, gfp_t priority, int flags,
int node);
-struct sk_buff *__build_skb(void *data, unsigned int frag_size);
+struct sk_buff *__build_skb(void *data, unsigned int frag_size, bool napi_safe);
struct sk_buff *build_skb(void *data, unsigned int frag_size);
struct sk_buff *build_skb_around(struct sk_buff *skb,
void *data, unsigned int frag_size);
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 860a9d4f752f..8747566a8136 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -120,6 +120,7 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
}
#define NAPI_SKB_CACHE_SIZE 64
+#define NAPI_SKB_CACHE_HALF (NAPI_SKB_CACHE_SIZE / 2)
struct napi_alloc_cache {
struct page_frag_cache page;
@@ -164,6 +165,30 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
}
EXPORT_SYMBOL(__netdev_alloc_frag_align);
+static struct sk_buff *napi_skb_cache_get(bool napi_safe)
+{
+ struct napi_alloc_cache *nc;
+ struct sk_buff *skb;
+
+ if (!napi_safe && unlikely(!in_serving_softirq()))
+ return kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+
+ nc = this_cpu_ptr(&napi_alloc_cache);
+
+ if (unlikely(!nc->skb_count))
+ nc->skb_count = kmem_cache_alloc_bulk(skbuff_head_cache,
+ GFP_ATOMIC,
+ NAPI_SKB_CACHE_HALF,
+ nc->skb_cache);
+ if (unlikely(!nc->skb_count))
+ return NULL;
+
+ skb = nc->skb_cache[--nc->skb_count];
+ kasan_unpoison_object_data(skbuff_head_cache, skb);
+
+ return skb;
+}
+
/* Caller must provide SKB that is memset cleared */
static void __build_skb_around(struct sk_buff *skb, void *data,
unsigned int frag_size)
@@ -210,11 +235,11 @@ static void __build_skb_around(struct sk_buff *skb, void *data,
* before giving packet to stack.
* RX rings only contains data buffers, not full skbs.
*/
-struct sk_buff *__build_skb(void *data, unsigned int frag_size)
+struct sk_buff *__build_skb(void *data, unsigned int frag_size, bool napi_safe)
{
struct sk_buff *skb;
- skb = kmem_cache_alloc(skbuff_head_cache, GFP_ATOMIC);
+ skb = napi_skb_cache_get(napi_safe);
if (unlikely(!skb))
return NULL;
@@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)
*/
struct sk_buff *build_skb(void *data, unsigned int frag_size)
{
- struct sk_buff *skb = __build_skb(data, frag_size);
+ struct sk_buff *skb = __build_skb(data, frag_size, true);
if (skb && frag_size) {
skb->head_frag = 1;
@@ -443,7 +468,7 @@ struct sk_buff *__netdev_alloc_skb(struct net_device *dev, unsigned int len,
if (unlikely(!data))
return NULL;
- skb = __build_skb(data, len);
+ skb = __build_skb(data, len, false);
if (unlikely(!skb)) {
skb_free_frag(data);
return NULL;
@@ -507,7 +532,7 @@ struct sk_buff *__napi_alloc_skb(struct napi_struct *napi, unsigned int len,
if (unlikely(!data))
return NULL;
- skb = __build_skb(data, len);
+ skb = __build_skb(data, len, true);
if (unlikely(!skb)) {
skb_free_frag(data);
return NULL;
@@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb)
kfree_skbmem(skb);
}
-static inline void _kfree_skb_defer(struct sk_buff *skb)
+static void napi_skb_cache_put(struct sk_buff *skb)
{
struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
+ u32 i;
/* drop skb->head and call any destructors for packet */
skb_release_all(skb);
- /* record skb to CPU local list */
+ kasan_poison_object_data(skbuff_head_cache, skb);
nc->skb_cache[nc->skb_count++] = skb;
-#ifdef CONFIG_SLUB
- /* SLUB writes into objects when freeing */
- prefetchw(skb);
-#endif
-
- /* flush skb_cache if it is filled */
if (unlikely(nc->skb_count == NAPI_SKB_CACHE_SIZE)) {
- kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_SIZE,
- nc->skb_cache);
- nc->skb_count = 0;
+ for (i = NAPI_SKB_CACHE_HALF; i < NAPI_SKB_CACHE_SIZE; i++)
+ kasan_unpoison_object_data(skbuff_head_cache,
+ nc->skb_cache[i]);
+
+ kmem_cache_free_bulk(skbuff_head_cache, NAPI_SKB_CACHE_HALF,
+ nc->skb_cache + NAPI_SKB_CACHE_HALF);
+ nc->skb_count = NAPI_SKB_CACHE_HALF;
}
}
+
void __kfree_skb_defer(struct sk_buff *skb)
{
- _kfree_skb_defer(skb);
+ napi_skb_cache_put(skb);
}
void napi_consume_skb(struct sk_buff *skb, int budget)
@@ -887,7 +912,7 @@ void napi_consume_skb(struct sk_buff *skb, int budget)
return;
}
- _kfree_skb_defer(skb);
+ napi_skb_cache_put(skb);
}
EXPORT_SYMBOL(napi_consume_skb);
diff --git a/net/netlink/af_netlink.c b/net/netlink/af_netlink.c
index dd488938447f..afba4e11a526 100644
--- a/net/netlink/af_netlink.c
+++ b/net/netlink/af_netlink.c
@@ -1190,7 +1190,7 @@ static struct sk_buff *netlink_alloc_large_skb(unsigned int size,
if (data == NULL)
return NULL;
- skb = __build_skb(data, size);
+ skb = __build_skb(data, size, false);
if (skb == NULL)
vfree(data);
else
--
2.30.0
Eversince the introduction of __kmalloc_reserve(), "ip" argument
hasn't been used. _RET_IP_ is embedded inside
kmalloc_node_track_caller().
Remove the redundant macro and rename the function after it.
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index a0f846872d19..70289f22a6f4 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -273,11 +273,8 @@ EXPORT_SYMBOL(__netdev_alloc_frag_align);
* may be used. Otherwise, the packet data may be discarded until enough
* memory is free
*/
-#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
- __kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
-
-static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
- unsigned long ip, bool *pfmemalloc)
+static void *kmalloc_reserve(size_t size, gfp_t flags, int node,
+ bool *pfmemalloc)
{
void *obj;
bool ret_pfmemalloc = false;
--
2.30.0
Just call __build_skb_around() instead of open-coding it.
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 18 +-----------------
1 file changed, 1 insertion(+), 17 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 88566de26cd1..1c6f6ef70339 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -326,7 +326,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
int flags, int node)
{
struct kmem_cache *cache;
- struct skb_shared_info *shinfo;
struct sk_buff *skb;
u8 *data;
bool pfmemalloc;
@@ -366,21 +365,8 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
* the tail pointer in struct sk_buff!
*/
memset(skb, 0, offsetof(struct sk_buff, tail));
- /* Account for allocated memory : skb + skb->head */
- skb->truesize = SKB_TRUESIZE(size);
+ __build_skb_around(skb, data, 0);
skb->pfmemalloc = pfmemalloc;
- refcount_set(&skb->users, 1);
- skb->head = data;
- skb->data = data;
- skb_reset_tail_pointer(skb);
- skb->end = skb->tail + size;
- skb->mac_header = (typeof(skb->mac_header))~0U;
- skb->transport_header = (typeof(skb->transport_header))~0U;
-
- /* make sure we initialize shinfo sequentially */
- shinfo = skb_shinfo(skb);
- memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
- atomic_set(&shinfo->dataref, 1);
if (flags & SKB_ALLOC_FCLONE) {
struct sk_buff_fclones *fclones;
@@ -393,8 +379,6 @@ struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
fclones->skb2.fclone = SKB_FCLONE_CLONE;
}
- skb_set_kcov_handle(skb, kcov_common_handle());
-
return skb;
nodata:
--
2.30.0
In preparation before reusing several functions in all three skb
allocation variants, move __alloc_skb() next to the
__netdev_alloc_skb() and __napi_alloc_skb().
No functional changes.
Signed-off-by: Alexander Lobakin <[email protected]>
---
net/core/skbuff.c | 284 +++++++++++++++++++++++-----------------------
1 file changed, 142 insertions(+), 142 deletions(-)
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index d380c7b5a12d..a0f846872d19 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -119,148 +119,6 @@ static void skb_under_panic(struct sk_buff *skb, unsigned int sz, void *addr)
skb_panic(skb, sz, addr, __func__);
}
-/*
- * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
- * the caller if emergency pfmemalloc reserves are being used. If it is and
- * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves
- * may be used. Otherwise, the packet data may be discarded until enough
- * memory is free
- */
-#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
- __kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
-
-static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
- unsigned long ip, bool *pfmemalloc)
-{
- void *obj;
- bool ret_pfmemalloc = false;
-
- /*
- * Try a regular allocation, when that fails and we're not entitled
- * to the reserves, fail.
- */
- obj = kmalloc_node_track_caller(size,
- flags | __GFP_NOMEMALLOC | __GFP_NOWARN,
- node);
- if (obj || !(gfp_pfmemalloc_allowed(flags)))
- goto out;
-
- /* Try again but now we are using pfmemalloc reserves */
- ret_pfmemalloc = true;
- obj = kmalloc_node_track_caller(size, flags, node);
-
-out:
- if (pfmemalloc)
- *pfmemalloc = ret_pfmemalloc;
-
- return obj;
-}
-
-/* Allocate a new skbuff. We do this ourselves so we can fill in a few
- * 'private' fields and also do memory statistics to find all the
- * [BEEP] leaks.
- *
- */
-
-/**
- * __alloc_skb - allocate a network buffer
- * @size: size to allocate
- * @gfp_mask: allocation mask
- * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache
- * instead of head cache and allocate a cloned (child) skb.
- * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
- * allocations in case the data is required for writeback
- * @node: numa node to allocate memory on
- *
- * Allocate a new &sk_buff. The returned buffer has no headroom and a
- * tail room of at least size bytes. The object has a reference count
- * of one. The return is the buffer. On a failure the return is %NULL.
- *
- * Buffers may only be allocated from interrupts using a @gfp_mask of
- * %GFP_ATOMIC.
- */
-struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
- int flags, int node)
-{
- struct kmem_cache *cache;
- struct skb_shared_info *shinfo;
- struct sk_buff *skb;
- u8 *data;
- bool pfmemalloc;
-
- cache = (flags & SKB_ALLOC_FCLONE)
- ? skbuff_fclone_cache : skbuff_head_cache;
-
- if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
- gfp_mask |= __GFP_MEMALLOC;
-
- /* Get the HEAD */
- skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
- if (!skb)
- goto out;
- prefetchw(skb);
-
- /* We do our best to align skb_shared_info on a separate cache
- * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
- * aligned memory blocks, unless SLUB/SLAB debug is enabled.
- * Both skb->head and skb_shared_info are cache line aligned.
- */
- size = SKB_DATA_ALIGN(size);
- size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
- data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
- if (!data)
- goto nodata;
- /* kmalloc(size) might give us more room than requested.
- * Put skb_shared_info exactly at the end of allocated zone,
- * to allow max possible filling before reallocation.
- */
- size = SKB_WITH_OVERHEAD(ksize(data));
- prefetchw(data + size);
-
- /*
- * Only clear those fields we need to clear, not those that we will
- * actually initialise below. Hence, don't put any more fields after
- * the tail pointer in struct sk_buff!
- */
- memset(skb, 0, offsetof(struct sk_buff, tail));
- /* Account for allocated memory : skb + skb->head */
- skb->truesize = SKB_TRUESIZE(size);
- skb->pfmemalloc = pfmemalloc;
- refcount_set(&skb->users, 1);
- skb->head = data;
- skb->data = data;
- skb_reset_tail_pointer(skb);
- skb->end = skb->tail + size;
- skb->mac_header = (typeof(skb->mac_header))~0U;
- skb->transport_header = (typeof(skb->transport_header))~0U;
-
- /* make sure we initialize shinfo sequentially */
- shinfo = skb_shinfo(skb);
- memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
- atomic_set(&shinfo->dataref, 1);
-
- if (flags & SKB_ALLOC_FCLONE) {
- struct sk_buff_fclones *fclones;
-
- fclones = container_of(skb, struct sk_buff_fclones, skb1);
-
- skb->fclone = SKB_FCLONE_ORIG;
- refcount_set(&fclones->fclone_ref, 1);
-
- fclones->skb2.fclone = SKB_FCLONE_CLONE;
- }
-
- skb_set_kcov_handle(skb, kcov_common_handle());
-
-out:
- return skb;
-nodata:
- kmem_cache_free(cache, skb);
- skb = NULL;
- goto out;
-}
-EXPORT_SYMBOL(__alloc_skb);
-
/* Caller must provide SKB that is memset cleared */
static struct sk_buff *__build_skb_around(struct sk_buff *skb,
void *data, unsigned int frag_size)
@@ -408,6 +266,148 @@ void *__netdev_alloc_frag_align(unsigned int fragsz, unsigned int align_mask)
}
EXPORT_SYMBOL(__netdev_alloc_frag_align);
+/*
+ * kmalloc_reserve is a wrapper around kmalloc_node_track_caller that tells
+ * the caller if emergency pfmemalloc reserves are being used. If it is and
+ * the socket is later found to be SOCK_MEMALLOC then PFMEMALLOC reserves
+ * may be used. Otherwise, the packet data may be discarded until enough
+ * memory is free
+ */
+#define kmalloc_reserve(size, gfp, node, pfmemalloc) \
+ __kmalloc_reserve(size, gfp, node, _RET_IP_, pfmemalloc)
+
+static void *__kmalloc_reserve(size_t size, gfp_t flags, int node,
+ unsigned long ip, bool *pfmemalloc)
+{
+ void *obj;
+ bool ret_pfmemalloc = false;
+
+ /*
+ * Try a regular allocation, when that fails and we're not entitled
+ * to the reserves, fail.
+ */
+ obj = kmalloc_node_track_caller(size,
+ flags | __GFP_NOMEMALLOC | __GFP_NOWARN,
+ node);
+ if (obj || !(gfp_pfmemalloc_allowed(flags)))
+ goto out;
+
+ /* Try again but now we are using pfmemalloc reserves */
+ ret_pfmemalloc = true;
+ obj = kmalloc_node_track_caller(size, flags, node);
+
+out:
+ if (pfmemalloc)
+ *pfmemalloc = ret_pfmemalloc;
+
+ return obj;
+}
+
+/* Allocate a new skbuff. We do this ourselves so we can fill in a few
+ * 'private' fields and also do memory statistics to find all the
+ * [BEEP] leaks.
+ *
+ */
+
+/**
+ * __alloc_skb - allocate a network buffer
+ * @size: size to allocate
+ * @gfp_mask: allocation mask
+ * @flags: If SKB_ALLOC_FCLONE is set, allocate from fclone cache
+ * instead of head cache and allocate a cloned (child) skb.
+ * If SKB_ALLOC_RX is set, __GFP_MEMALLOC will be used for
+ * allocations in case the data is required for writeback
+ * @node: numa node to allocate memory on
+ *
+ * Allocate a new &sk_buff. The returned buffer has no headroom and a
+ * tail room of at least size bytes. The object has a reference count
+ * of one. The return is the buffer. On a failure the return is %NULL.
+ *
+ * Buffers may only be allocated from interrupts using a @gfp_mask of
+ * %GFP_ATOMIC.
+ */
+struct sk_buff *__alloc_skb(unsigned int size, gfp_t gfp_mask,
+ int flags, int node)
+{
+ struct kmem_cache *cache;
+ struct skb_shared_info *shinfo;
+ struct sk_buff *skb;
+ u8 *data;
+ bool pfmemalloc;
+
+ cache = (flags & SKB_ALLOC_FCLONE)
+ ? skbuff_fclone_cache : skbuff_head_cache;
+
+ if (sk_memalloc_socks() && (flags & SKB_ALLOC_RX))
+ gfp_mask |= __GFP_MEMALLOC;
+
+ /* Get the HEAD */
+ skb = kmem_cache_alloc_node(cache, gfp_mask & ~__GFP_DMA, node);
+ if (!skb)
+ goto out;
+ prefetchw(skb);
+
+ /* We do our best to align skb_shared_info on a separate cache
+ * line. It usually works because kmalloc(X > SMP_CACHE_BYTES) gives
+ * aligned memory blocks, unless SLUB/SLAB debug is enabled.
+ * Both skb->head and skb_shared_info are cache line aligned.
+ */
+ size = SKB_DATA_ALIGN(size);
+ size += SKB_DATA_ALIGN(sizeof(struct skb_shared_info));
+ data = kmalloc_reserve(size, gfp_mask, node, &pfmemalloc);
+ if (!data)
+ goto nodata;
+ /* kmalloc(size) might give us more room than requested.
+ * Put skb_shared_info exactly at the end of allocated zone,
+ * to allow max possible filling before reallocation.
+ */
+ size = SKB_WITH_OVERHEAD(ksize(data));
+ prefetchw(data + size);
+
+ /*
+ * Only clear those fields we need to clear, not those that we will
+ * actually initialise below. Hence, don't put any more fields after
+ * the tail pointer in struct sk_buff!
+ */
+ memset(skb, 0, offsetof(struct sk_buff, tail));
+ /* Account for allocated memory : skb + skb->head */
+ skb->truesize = SKB_TRUESIZE(size);
+ skb->pfmemalloc = pfmemalloc;
+ refcount_set(&skb->users, 1);
+ skb->head = data;
+ skb->data = data;
+ skb_reset_tail_pointer(skb);
+ skb->end = skb->tail + size;
+ skb->mac_header = (typeof(skb->mac_header))~0U;
+ skb->transport_header = (typeof(skb->transport_header))~0U;
+
+ /* make sure we initialize shinfo sequentially */
+ shinfo = skb_shinfo(skb);
+ memset(shinfo, 0, offsetof(struct skb_shared_info, dataref));
+ atomic_set(&shinfo->dataref, 1);
+
+ if (flags & SKB_ALLOC_FCLONE) {
+ struct sk_buff_fclones *fclones;
+
+ fclones = container_of(skb, struct sk_buff_fclones, skb1);
+
+ skb->fclone = SKB_FCLONE_ORIG;
+ refcount_set(&fclones->fclone_ref, 1);
+
+ fclones->skb2.fclone = SKB_FCLONE_CLONE;
+ }
+
+ skb_set_kcov_handle(skb, kcov_common_handle());
+
+out:
+ return skb;
+nodata:
+ kmem_cache_free(cache, skb);
+ skb = NULL;
+ goto out;
+}
+EXPORT_SYMBOL(__alloc_skb);
+
/**
* __netdev_alloc_skb - allocate an skbuff for rx on a specific device
* @dev: network device to receive on
--
2.30.0
This function isn't much needed as NAPI skb queue gets bulk-freed
anyway when there's no more room, and even may reduce the efficiency
of bulk operations.
It will be even less needed after reusing skb cache on allocation path,
so remove it and this way lighten network softirqs a bit.
Suggested-by: Eric Dumazet <[email protected]>
Signed-off-by: Alexander Lobakin <[email protected]>
---
include/linux/skbuff.h | 1 -
net/core/dev.c | 6 +-----
net/core/skbuff.c | 12 ------------
3 files changed, 1 insertion(+), 18 deletions(-)
diff --git a/include/linux/skbuff.h b/include/linux/skbuff.h
index 0a4e91a2f873..0e0707296098 100644
--- a/include/linux/skbuff.h
+++ b/include/linux/skbuff.h
@@ -2919,7 +2919,6 @@ static inline struct sk_buff *napi_alloc_skb(struct napi_struct *napi,
}
void napi_consume_skb(struct sk_buff *skb, int budget);
-void __kfree_skb_flush(void);
void __kfree_skb_defer(struct sk_buff *skb);
/**
diff --git a/net/core/dev.c b/net/core/dev.c
index 21d74d30f5d7..135d46c0c3c7 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -4906,8 +4906,6 @@ static __latent_entropy void net_tx_action(struct softirq_action *h)
else
__kfree_skb_defer(skb);
}
-
- __kfree_skb_flush();
}
if (sd->output_queue) {
@@ -6873,7 +6871,7 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
if (list_empty(&list)) {
if (!sd_has_rps_ipi_waiting(sd) && list_empty(&repoll))
- goto out;
+ return;
break;
}
@@ -6900,8 +6898,6 @@ static __latent_entropy void net_rx_action(struct softirq_action *h)
__raise_softirq_irqoff(NET_RX_SOFTIRQ);
net_rps_action_and_irq_enable(sd);
-out:
- __kfree_skb_flush();
}
struct netdev_adjacent {
diff --git a/net/core/skbuff.c b/net/core/skbuff.c
index 1c6f6ef70339..4be2bb969535 100644
--- a/net/core/skbuff.c
+++ b/net/core/skbuff.c
@@ -838,18 +838,6 @@ void __consume_stateless_skb(struct sk_buff *skb)
kfree_skbmem(skb);
}
-void __kfree_skb_flush(void)
-{
- struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
-
- /* flush skb_cache if containing objects */
- if (nc->skb_count) {
- kmem_cache_free_bulk(skbuff_head_cache, nc->skb_count,
- nc->skb_cache);
- nc->skb_count = 0;
- }
-}
-
static inline void _kfree_skb_defer(struct sk_buff *skb)
{
struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
--
2.30.0
Hello,
I'm sorry for the late feedback, I could not step-in before.
Also adding Jesper for awareness, as he introduced the bulk free
infrastructure.
On Tue, 2021-02-09 at 20:48 +0000, Alexander Lobakin wrote:
> @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)
> */
> struct sk_buff *build_skb(void *data, unsigned int frag_size)
> {
> - struct sk_buff *skb = __build_skb(data, frag_size);
> + struct sk_buff *skb = __build_skb(data, frag_size, true);
I must admit I'm a bit scared of this. There are several high speed
device drivers that will move to bulk allocation, and we don't have any
performance figure for them.
In my experience with (low end) MIPS board, cache misses cost tend to
be much less visible there compared to reasonably recent server H/W,
because the CPU/memory access time difference is much lower.
When moving to higher end H/W the performance gain you measured could
be completely countered by less optimal cache usage.
I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a
single skb would be visible e.g. in a round-robin test. Generally
speaking bulk allocating 32 skbs looks a bit too much. IIRC, when
Edward added listification to GRO, he did several measures with
different list size and found 8 to be the optimal value (for the tested
workload). Above such number the list become too big and the pressure
on the cache outweighted the bulking benefits.
Perhaps giving the device drivers the ability to opt-in on this infra
via a new helper - as done back then with napi_consume_skb() - would
make this change safer?
> @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb)
> kfree_skbmem(skb);
> }
>
> -static inline void _kfree_skb_defer(struct sk_buff *skb)
> +static void napi_skb_cache_put(struct sk_buff *skb)
> {
> struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> + u32 i;
>
> /* drop skb->head and call any destructors for packet */
> skb_release_all(skb);
>
> - /* record skb to CPU local list */
> + kasan_poison_object_data(skbuff_head_cache, skb);
> nc->skb_cache[nc->skb_count++] = skb;
>
> -#ifdef CONFIG_SLUB
> - /* SLUB writes into objects when freeing */
> - prefetchw(skb);
> -#endif
It looks like this chunk has been lost. Is that intentional?
Thanks!
Paolo
From: Paolo Abeni <[email protected]>
Date: Wed, 10 Feb 2021 11:21:06 +0100
> Hello,
Hi!
> I'm sorry for the late feedback, I could not step-in before.
>
> Also adding Jesper for awareness, as he introduced the bulk free
> infrastructure.
>
> On Tue, 2021-02-09 at 20:48 +0000, Alexander Lobakin wrote:
> > @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)
> > */
> > struct sk_buff *build_skb(void *data, unsigned int frag_size)
> > {
> > - struct sk_buff *skb = __build_skb(data, frag_size);
> > + struct sk_buff *skb = __build_skb(data, frag_size, true);
>
> I must admit I'm a bit scared of this. There are several high speed
> device drivers that will move to bulk allocation, and we don't have any
> performance figure for them.
>
> In my experience with (low end) MIPS board, cache misses cost tend to
> be much less visible there compared to reasonably recent server H/W,
> because the CPU/memory access time difference is much lower.
>
> When moving to higher end H/W the performance gain you measured could
> be completely countered by less optimal cache usage.
>
> I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a
> single skb would be visible e.g. in a round-robin test. Generally
> speaking bulk allocating 32 skbs looks a bit too much. IIRC, when
> Edward added listification to GRO, he did several measures with
> different list size and found 8 to be the optimal value (for the tested
> workload). Above such number the list become too big and the pressure
> on the cache outweighted the bulking benefits.
I can change to logics the way so it would allocate the first 8.
I think I've already seen this batch value somewhere in XDP code,
so this might be a balanced one.
Regarding bulk-freeing: can the batch size make sense when freeing
or it's okay to wipe 32 (currently 64 in baseline) in a row?
> Perhaps giving the device drivers the ability to opt-in on this infra
> via a new helper - as done back then with napi_consume_skb() - would
> make this change safer?
That's actually a very nice idea. There's only a little in the code
to change to introduce an ability to take heads from the cache
optionally. This way developers could switch to it when needed.
Thanks for the suggestions! I'll definitely absorb them into the code
and give it a test.
> > @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb)
> > kfree_skbmem(skb);
> > }
> >
> > -static inline void _kfree_skb_defer(struct sk_buff *skb)
> > +static void napi_skb_cache_put(struct sk_buff *skb)
> > {
> > struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > + u32 i;
> >
> > /* drop skb->head and call any destructors for packet */
> > skb_release_all(skb);
> >
> > - /* record skb to CPU local list */
> > + kasan_poison_object_data(skbuff_head_cache, skb);
> > nc->skb_cache[nc->skb_count++] = skb;
> >
> > -#ifdef CONFIG_SLUB
> > - /* SLUB writes into objects when freeing */
> > - prefetchw(skb);
> > -#endif
>
> It looks like this chunk has been lost. Is that intentional?
Yep. This prefetchw() assumed that skbuff_heads will be wiped
immediately or at the end of network softirq. Reusing this cache
means that heads can be reused later or may be kept in a cache for
some time, so prefetching makes no sense anymore.
> Thanks!
>
> Paolo
Al
On Wed, 2021-02-10 at 12:25 +0000, Alexander Lobakin wrote:
> Paolo Abeni <[email protected]> on Wed, 10 Feb 2021 11:21:06 +0100 wrote:
> > Perhaps giving the device drivers the ability to opt-in on this infra
> > via a new helper - as done back then with napi_consume_skb() - would
> > make this change safer?
>
> That's actually a very nice idea. There's only a little in the code
> to change to introduce an ability to take heads from the cache
> optionally. This way developers could switch to it when needed.
>
> Thanks for the suggestions! I'll definitely absorb them into the code
> and give it a test.
Quick reply before is too late. I suggest to wait a bit for others
opinions before coding - if others dislike this I would regret wasting
your time.
Cheers,
Paolo
Alexander Lobakin <[email protected]> wrote:
> we're in such context. This includes: build_skb() (called only
> from NIC drivers in NAPI Rx context) and {,__}napi_alloc_skb()
> (called from the same place or from kernel network softirq
> functions).
build_skb is called from sleepable context in drivers/net/tun.c .
Perhaps there are other cases.
On Wed, 10 Feb 2021 12:25:04 +0000
Alexander Lobakin <[email protected]> wrote:
> From: Paolo Abeni <[email protected]>
> Date: Wed, 10 Feb 2021 11:21:06 +0100
>
> > I'm sorry for the late feedback, I could not step-in before.
> >
> > Also adding Jesper for awareness, as he introduced the bulk free
> > infrastructure.
Thanks (and Alexander Duyck also did part of the work while at Red Hat).
In my initial versions of my patchsets I actually also had reuse of the
SKBs that were defer freed during NAPI context. But I dropped that
part because it was getting nitpicked and the merge window was getting
close, so I ended up dropping that part.
> > On Tue, 2021-02-09 at 20:48 +0000, Alexander Lobakin wrote:
> > > @@ -231,7 +256,7 @@ struct sk_buff *__build_skb(void *data, unsigned int frag_size)
> > > */
> > > struct sk_buff *build_skb(void *data, unsigned int frag_size)
> > > {
> > > - struct sk_buff *skb = __build_skb(data, frag_size);
> > > + struct sk_buff *skb = __build_skb(data, frag_size, true);
> >
> > I must admit I'm a bit scared of this. There are several high speed
> > device drivers that will move to bulk allocation, and we don't have any
> > performance figure for them.
> >
> > In my experience with (low end) MIPS board, cache misses cost tend to
> > be much less visible there compared to reasonably recent server H/W,
> > because the CPU/memory access time difference is much lower.
> >
> > When moving to higher end H/W the performance gain you measured could
> > be completely countered by less optimal cache usage.
> >
> > I fear also latency spikes - I'm unsure if a 32 skbs allocation vs a
> > single skb would be visible e.g. in a round-robin test. Generally
> > speaking bulk allocating 32 skbs looks a bit too much. IIRC, when
> > Edward added listification to GRO, he did several measures with
> > different list size and found 8 to be the optimal value (for the tested
> > workload). Above such number the list become too big and the pressure
> > on the cache outweighted the bulking benefits.
>
> I can change to logics the way so it would allocate the first 8.
> I think I've already seen this batch value somewhere in XDP code,
> so this might be a balanced one.
(Speaking about SLUB code): Bulk ALLOC side disables interrupts, and
can call slow path (___slab_alloc), which is bad for latency sensitive
workloads. This I don't recommend large bulk ALLOCATIONS.
> Regarding bulk-freeing: can the batch size make sense when freeing
> or it's okay to wipe 32 (currently 64 in baseline) in a row?
(Speaking about SLUB code): You can bulk FREE large amount of object
without hurting latency sensitive workloads, because it doesn't disable
interrupts (I'm quite proud that this was possible).
> > Perhaps giving the device drivers the ability to opt-in on this infra
> > via a new helper - as done back then with napi_consume_skb() - would
> > make this change safer?
>
> That's actually a very nice idea. There's only a little in the code
> to change to introduce an ability to take heads from the cache
> optionally. This way developers could switch to it when needed.
Well, I actually disagree that this should be hidden behind a switch
for drivers to enable, as this will take forever to get proper enabled.
> Thanks for the suggestions! I'll definitely absorb them into the code
> and give it a test.
>
> > > @@ -838,31 +863,31 @@ void __consume_stateless_skb(struct sk_buff *skb)
> > > kfree_skbmem(skb);
> > > }
> > >
> > > -static inline void _kfree_skb_defer(struct sk_buff *skb)
> > > +static void napi_skb_cache_put(struct sk_buff *skb)
> > > {
> > > struct napi_alloc_cache *nc = this_cpu_ptr(&napi_alloc_cache);
> > > + u32 i;
> > >
> > > /* drop skb->head and call any destructors for packet */
> > > skb_release_all(skb);
> > >
> > > - /* record skb to CPU local list */
> > > + kasan_poison_object_data(skbuff_head_cache, skb);
> > > nc->skb_cache[nc->skb_count++] = skb;
> > >
> > > -#ifdef CONFIG_SLUB
> > > - /* SLUB writes into objects when freeing */
> > > - prefetchw(skb);
> > > -#endif
> >
> > It looks like this chunk has been lost. Is that intentional?
>
> Yep. This prefetchw() assumed that skbuff_heads will be wiped
> immediately or at the end of network softirq. Reusing this cache
> means that heads can be reused later or may be kept in a cache for
> some time, so prefetching makes no sense anymore.
I agree with this statement, the prefetchw() is no-longer needed.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer