After [1], Only there are two implementations for page frag:
1. mm/page_alloc.c: net stack seems to be using it in the
rx part with 'struct page_frag_cache' and the main API
being page_frag_alloc_align().
2. net/core/sock.c: net stack seems to be using it in the
tx part with 'struct page_frag' and the main API being
skb_page_frag_refill().
This patchset tries to unfiy the page frag implementation
by replacing page_frag with page_frag_cache for sk_page_frag()
first. net_high_order_alloc_disable_key for the implementation
in net/core/sock.c doesn't seems matter that much now have
have pcp support for high-order pages in commit 44042b449872
("mm/page_alloc: allow high-order pages to be stored on the
per-cpu lists").
As the related change is mostly related to networking, so
targeting the net-next. And will try to replace the rest
of page_frag in the follow patchset.
After this patchset, we are not only able to unify the page
frag implementation a little, but seems able to have about
0.5+% performance boost testing by using the vhost_net_test
introduced in [1] and page_frag_test.ko introduced in this
patch.
Before this patchset:
Performance counter stats for './vhost_net_test' (10 runs):
603027.29 msec task-clock # 1.756 CPUs utilized ( +- 0.04% )
2097713 context-switches # 3.479 K/sec ( +- 0.00% )
212 cpu-migrations # 0.352 /sec ( +- 4.72% )
40 page-faults # 0.066 /sec ( +- 1.18% )
467215266413 cycles # 0.775 GHz ( +- 0.12% ) (66.02%)
131736729037 stalled-cycles-frontend # 28.20% frontend cycles idle ( +- 2.38% ) (64.34%)
77728393294 stalled-cycles-backend # 16.64% backend cycles idle ( +- 3.98% ) (65.42%)
345874254764 instructions # 0.74 insn per cycle
# 0.38 stalled cycles per insn ( +- 0.75% ) (70.28%)
105166217892 branches # 174.397 M/sec ( +- 0.65% ) (68.56%)
9649321070 branch-misses # 9.18% of all branches ( +- 0.69% ) (65.38%)
343.376 +- 0.147 seconds time elapsed ( +- 0.04% )
Performance counter stats for 'insmod ./page_frag_test.ko nr_test=99999999' (30 runs):
39.12 msec task-clock # 0.001 CPUs utilized ( +- 4.51% )
5 context-switches # 127.805 /sec ( +- 3.76% )
1 cpu-migrations # 25.561 /sec ( +- 15.52% )
197 page-faults # 5.035 K/sec ( +- 0.10% )
10689913 cycles # 0.273 GHz ( +- 9.46% ) (72.72%)
2821237 stalled-cycles-frontend # 26.39% frontend cycles idle ( +- 12.04% ) (76.23%)
5035549 stalled-cycles-backend # 47.11% backend cycles idle ( +- 9.69% ) (49.40%)
5439395 instructions # 0.51 insn per cycle
# 0.93 stalled cycles per insn ( +- 11.58% ) (51.45%)
1274419 branches # 32.575 M/sec ( +- 12.69% ) (77.88%)
49562 branch-misses # 3.89% of all branches ( +- 9.91% ) (72.32%)
30.309 +- 0.305 seconds time elapsed ( +- 1.01% )
After this patchset:
Performance counter stats for './vhost_net_test' (10 runs):
598081.02 msec task-clock # 1.752 CPUs utilized ( +- 0.11% )
2097738 context-switches # 3.507 K/sec ( +- 0.00% )
220 cpu-migrations # 0.368 /sec ( +- 6.58% )
40 page-faults # 0.067 /sec ( +- 0.92% )
469788205101 cycles # 0.785 GHz ( +- 0.27% ) (64.86%)
137108509582 stalled-cycles-frontend # 29.19% frontend cycles idle ( +- 0.96% ) (63.62%)
75499065401 stalled-cycles-backend # 16.07% backend cycles idle ( +- 1.04% ) (65.86%)
345469451681 instructions # 0.74 insn per cycle
# 0.40 stalled cycles per insn ( +- 0.37% ) (70.16%)
102782224964 branches # 171.853 M/sec ( +- 0.62% ) (69.28%)
9295357532 branch-misses # 9.04% of all branches ( +- 1.08% ) (66.21%)
341.466 +- 0.305 seconds time elapsed ( +- 0.09% )
Performance counter stats for 'insmod ./page_frag_test.ko nr_test=99999999' (30 runs):
40.09 msec task-clock # 0.001 CPUs utilized ( +- 4.60% )
5 context-switches # 124.722 /sec ( +- 3.45% )
1 cpu-migrations # 24.944 /sec ( +- 12.62% )
197 page-faults # 4.914 K/sec ( +- 0.11% )
10221721 cycles # 0.255 GHz ( +- 9.05% ) (27.73%)
2459009 stalled-cycles-frontend # 24.06% frontend cycles idle ( +- 10.80% ) (29.05%)
5148423 stalled-cycles-backend # 50.37% backend cycles idle ( +- 7.30% ) (82.47%)
5889929 instructions # 0.58 insn per cycle
# 0.87 stalled cycles per insn ( +- 11.85% ) (87.75%)
1276667 branches # 31.846 M/sec ( +- 11.48% ) (89.80%)
50631 branch-misses # 3.97% of all branches ( +- 8.72% ) (83.20%)
29.341 +- 0.300 seconds time elapsed ( +- 1.02% )
CC: Alexander Duyck <[email protected]>
1. https://lore.kernel.org/all/[email protected]/
Yunsheng Lin (12):
mm: Move the page fragment allocator from page_alloc into its own file
mm: page_frag: use initial zero offset for page_frag_alloc_align()
mm: page_frag: change page_frag_alloc_* API to accept align param
mm: page_frag: add '_va' suffix to page_frag API
mm: page_frag: add two inline helper for page_frag API
mm: page_frag: reuse MSB of 'size' field for pfmemalloc
mm: page_frag: reuse existing bit field of 'va' for pagecnt_bias
net: introduce the skb_copy_to_va_nocache() helper
mm: page_frag: introduce prepare/commit API for page_frag
net: replace page_frag with page_frag_cache
mm: page_frag: add a test module for page_frag
mm: page_frag: update documentation and maintainer for page_frag
Documentation/mm/page_frags.rst | 115 ++++--
MAINTAINERS | 10 +
.../chelsio/inline_crypto/chtls/chtls.h | 3 -
.../chelsio/inline_crypto/chtls/chtls_io.c | 101 ++---
.../chelsio/inline_crypto/chtls/chtls_main.c | 3 -
drivers/net/ethernet/google/gve/gve_rx.c | 4 +-
drivers/net/ethernet/intel/ice/ice_txrx.c | 2 +-
drivers/net/ethernet/intel/ice/ice_txrx.h | 2 +-
drivers/net/ethernet/intel/ice/ice_txrx_lib.c | 2 +-
.../net/ethernet/intel/ixgbevf/ixgbevf_main.c | 4 +-
.../marvell/octeontx2/nic/otx2_common.c | 2 +-
drivers/net/ethernet/mediatek/mtk_wed_wo.c | 4 +-
drivers/net/tun.c | 34 +-
drivers/nvme/host/tcp.c | 8 +-
drivers/nvme/target/tcp.c | 22 +-
drivers/vhost/net.c | 6 +-
include/linux/gfp.h | 22 --
include/linux/mm_types.h | 18 -
include/linux/page_frag_cache.h | 339 ++++++++++++++++
include/linux/sched.h | 4 +-
include/linux/skbuff.h | 15 +-
include/net/sock.h | 29 +-
kernel/bpf/cpumap.c | 2 +-
kernel/exit.c | 3 +-
kernel/fork.c | 2 +-
mm/Kconfig.debug | 8 +
mm/Makefile | 2 +
mm/page_alloc.c | 136 -------
mm/page_frag_cache.c | 185 +++++++++
mm/page_frag_test.c | 366 ++++++++++++++++++
net/core/skbuff.c | 57 +--
net/core/skmsg.c | 22 +-
net/core/sock.c | 46 ++-
net/core/xdp.c | 2 +-
net/ipv4/ip_output.c | 35 +-
net/ipv4/tcp.c | 35 +-
net/ipv4/tcp_output.c | 28 +-
net/ipv6/ip6_output.c | 35 +-
net/kcm/kcmsock.c | 30 +-
net/mptcp/protocol.c | 74 ++--
net/rxrpc/txbuf.c | 16 +-
net/sunrpc/svcsock.c | 4 +-
net/tls/tls_device.c | 139 ++++---
43 files changed, 1404 insertions(+), 572 deletions(-)
create mode 100644 include/linux/page_frag_cache.h
create mode 100644 mm/page_frag_cache.c
create mode 100644 mm/page_frag_test.c
--
2.33.0
Update documentation about design, implementation and API usages
for page_frag.
Also update MAINTAINERS for page_frag. Alexander seems to be the
orginal author for page_frag, we can add him to the MAINTAINERS
later if we have an ack from him.
CC: Alexander Duyck <[email protected]>
Signed-off-by: Yunsheng Lin <[email protected]>
---
Documentation/mm/page_frags.rst | 115 ++++++++++++++++++----------
MAINTAINERS | 10 +++
include/linux/page_frag_cache.h | 128 ++++++++++++++++++++++++++++++++
mm/page_frag_cache.c | 51 ++++++++++---
4 files changed, 256 insertions(+), 48 deletions(-)
diff --git a/Documentation/mm/page_frags.rst b/Documentation/mm/page_frags.rst
index 503ca6cdb804..77256dfb58bf 100644
--- a/Documentation/mm/page_frags.rst
+++ b/Documentation/mm/page_frags.rst
@@ -1,43 +1,80 @@
+.. SPDX-License-Identifier: GPL-2.0
+
==============
Page fragments
==============
-A page fragment is an arbitrary-length arbitrary-offset area of memory
-which resides within a 0 or higher order compound page. Multiple
-fragments within that page are individually refcounted, in the page's
-reference counter.
-
-The page_frag functions, page_frag_alloc and page_frag_free, provide a
-simple allocation framework for page fragments. This is used by the
-network stack and network device drivers to provide a backing region of
-memory for use as either an sk_buff->head, or to be used in the "frags"
-portion of skb_shared_info.
-
-In order to make use of the page fragment APIs a backing page fragment
-cache is needed. This provides a central point for the fragment allocation
-and tracks allows multiple calls to make use of a cached page. The
-advantage to doing this is that multiple calls to get_page can be avoided
-which can be expensive at allocation time. However due to the nature of
-this caching it is required that any calls to the cache be protected by
-either a per-cpu limitation, or a per-cpu limitation and forcing interrupts
-to be disabled when executing the fragment allocation.
-
-The network stack uses two separate caches per CPU to handle fragment
-allocation. The netdev_alloc_cache is used by callers making use of the
-netdev_alloc_frag and __netdev_alloc_skb calls. The napi_alloc_cache is
-used by callers of the __napi_alloc_frag and napi_alloc_skb calls. The
-main difference between these two calls is the context in which they may be
-called. The "netdev" prefixed functions are usable in any context as these
-functions will disable interrupts, while the "napi" prefixed functions are
-only usable within the softirq context.
-
-Many network device drivers use a similar methodology for allocating page
-fragments, but the page fragments are cached at the ring or descriptor
-level. In order to enable these cases it is necessary to provide a generic
-way of tearing down a page cache. For this reason __page_frag_cache_drain
-was implemented. It allows for freeing multiple references from a single
-page via a single call. The advantage to doing this is that it allows for
-cleaning up the multiple references that were added to a page in order to
-avoid calling get_page per allocation.
-
-Alexander Duyck, Nov 29, 2016.
+.. kernel-doc:: mm/page_frag_cache.c
+ :doc: page_frag allocator
+
+Architecture overview
+=====================
+
+.. code-block:: none
+
+ +----------------------+
+ | page_frag API caller |
+ +----------------------+
+ ^
+ |
+ |
+ |
+ v
+ +----------------------------------------------+
+ | request page fragment |
+ +----------------------------------------------+
+ ^ ^
+ | |
+ | Cache empty or not enough |
+ | |
+ v |
+ +--------------------------------+ |
+ | refill cache with order 3 page | |
+ +--------------------------------+ |
+ ^ ^ |
+ | | |
+ | | Refill failed |
+ | | | Cache is enough
+ | | |
+ | v |
+ | +----------------------------------+ |
+ | | refill cache with order 0 page | |
+ | +----------------------------------+ |
+ | ^ |
+ | Refill succeed | |
+ | | Refill succeed |
+ | | |
+ v v v
+ +----------------------------------------------+
+ | allocate fragment from cache |
+ +----------------------------------------------+
+
+API interface
+=============
+As the design and implementation of page_frag API, the allocation side does not
+allow concurrent calling, it is assumed that the caller must ensure there is not
+concurrent alloc calling to the same page_frag_cache instance by using it's own
+lock or rely on some lockless guarantee like NAPI softirq.
+
+Depending on different use cases, callers expecting to deal with va, page or
+both va and page for them may call page_frag_alloc_va(), page_frag_alloc_pg(),
+or page_frag_alloc() accordingly.
+
+There is also a use case that need minimum memory in order for forward
+progressing, but can do better if there is more memory available. Introduce
+page_frag_alloc_prepare() and page_frag_alloc_commit() related API, the caller
+requests the minimum memory it need and the prepare API will return the maximum
+size of the fragment returned, caller need to report back to the page_frag core
+how much memory it actually use by calling commit API, or not calling the commit
+API if deciding to not use any memory.
+
+.. kernel-doc:: include/linux/page_frag_cache.h
+ :identifiers: page_frag_cache_init page_frag_cache_is_pfmemalloc
+ page_frag_alloc_va __page_frag_alloc_va_align
+ page_frag_alloc_va_align page_frag_alloc_va_prepare
+ page_frag_alloc_va_prepare_align page_frag_alloc_pg_prepare
+ page_frag_alloc_prepare page_frag_alloc_commit
+ page_frag_alloc_commit_noref page_frag_free_va
+
+.. kernel-doc:: mm/page_frag_cache.c
+ :identifiers: page_frag_cache_drain
diff --git a/MAINTAINERS b/MAINTAINERS
index 4745ea94d463..2f84aba59428 100644
--- a/MAINTAINERS
+++ b/MAINTAINERS
@@ -16683,6 +16683,16 @@ F: mm/page-writeback.c
F: mm/readahead.c
F: mm/truncate.c
+PAGE FRAG
+M: Yunsheng Lin <[email protected]>
+L: [email protected]
+L: [email protected]
+S: Supported
+F: Documentation/mm/page_frags.rst
+F: include/linux/page_frag_cache.h
+F: mm/page_frag_cache.c
+F: mm/page_frag_test.c
+
PAGE POOL
M: Jesper Dangaard Brouer <[email protected]>
M: Ilias Apalodimas <[email protected]>
diff --git a/include/linux/page_frag_cache.h b/include/linux/page_frag_cache.h
index 28185969cd2c..d8edbecdd179 100644
--- a/include/linux/page_frag_cache.h
+++ b/include/linux/page_frag_cache.h
@@ -31,11 +31,23 @@ struct page_frag_cache {
#endif
};
+/**
+ * page_frag_cache_init() - Init page_frag cache.
+ * @nc: page_frag cache from which to init
+ *
+ * Inline helper to init the page_frag cache.
+ */
static inline void page_frag_cache_init(struct page_frag_cache *nc)
{
nc->va = NULL;
}
+/**
+ * page_frag_cache_is_pfmemalloc() - Check for pfmemalloc.
+ * @nc: page_frag cache from which to check
+ *
+ * Used to check if the current page in page_frag cache is pfmemalloc'ed.
+ */
static inline bool page_frag_cache_is_pfmemalloc(struct page_frag_cache *nc)
{
return !!nc->pfmemalloc;
@@ -46,6 +58,17 @@ void __page_frag_cache_drain(struct page *page, unsigned int count);
void *page_frag_cache_refill(struct page_frag_cache *nc, unsigned int fragsz,
gfp_t gfp_mask);
+/**
+ * page_frag_alloc_va() - Alloc a page fragment.
+ * @nc: page_frag cache from which to allocate
+ * @fragsz: the requested fragment size
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ *
+ * Get a page fragment from page_frag cache.
+ *
+ * Return:
+ * Return va of the page fragment, otherwise return NULL.
+ */
static inline void *page_frag_alloc_va(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask)
{
@@ -63,6 +86,19 @@ static inline void *page_frag_alloc_va(struct page_frag_cache *nc,
return va + offset;
}
+/**
+ * __page_frag_alloc_va_align() - Alloc a page fragment with aligning
+ * requirement.
+ * @nc: page_frag cache from which to allocate
+ * @fragsz: the requested fragment size
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ * @align: the requested aligning requirement
+ *
+ * Get a page fragment from page_frag cache with aligning requirement.
+ *
+ * Return:
+ * Return va of the page fragment, otherwise return NULL.
+ */
static inline void *__page_frag_alloc_va_align(struct page_frag_cache *nc,
unsigned int fragsz,
gfp_t gfp_mask,
@@ -75,6 +111,19 @@ static inline void *__page_frag_alloc_va_align(struct page_frag_cache *nc,
return page_frag_alloc_va(nc, fragsz, gfp_mask);
}
+/**
+ * page_frag_alloc_va_align() - Alloc a page fragment with aligning requirement.
+ * @nc: page_frag cache from which to allocate
+ * @fragsz: the requested fragment size
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ * @align: the requested aligning requirement
+ *
+ * WARN_ON_ONCE() checking for align and fragsz before getting a page fragment
+ * from page_frag cache with aligning requirement.
+ *
+ * Return:
+ * Return va of the page fragment, otherwise return NULL.
+ */
static inline void *page_frag_alloc_va_align(struct page_frag_cache *nc,
unsigned int fragsz,
gfp_t gfp_mask,
@@ -86,6 +135,19 @@ static inline void *page_frag_alloc_va_align(struct page_frag_cache *nc,
return __page_frag_alloc_va_align(nc, fragsz, gfp_mask, align);
}
+/**
+ * page_frag_alloc_va_prepare() - Prepare allocing a page fragment.
+ * @nc: page_frag cache from which to prepare
+ * @offset: out as the offset of the page fragment
+ * @size: in as the requested size, out as the available size
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ *
+ * Prepare a page fragment with minimum size of ‘size’, 'size' is also used to
+ * report the maximum size of the page fragment the caller can use.
+ *
+ * Return:
+ * Return va of the page fragment, otherwise return NULL.
+ */
static inline void *page_frag_alloc_va_prepare(struct page_frag_cache *nc,
unsigned int *offset,
unsigned int *size,
@@ -108,6 +170,21 @@ static inline void *page_frag_alloc_va_prepare(struct page_frag_cache *nc,
return va + *offset;
}
+/**
+ * page_frag_alloc_va_prepare_align() - Prepare allocing a page fragment with
+ * aligning requirement.
+ * @nc: page_frag cache from which to prepare
+ * @offset: out as the offset of the page fragment
+ * @size: in as the requested size, out as the available size
+ * @align: the requested aligning requirement
+ * @gfp_mask: the allocation gfp to use when cache need to be refilled
+ *
+ * Prepare an aligned page fragment with minimum size of ‘size’, 'size' is also
+ * used to report the maximum size of the page fragment the caller can use.
+ *
+ * Return:
+ * Return va of the page fragment, otherwise return NULL.
+ */
static inline void *page_frag_alloc_va_prepare_align(struct page_frag_cache *nc,
unsigned int *offset,
unsigned int *size,
@@ -144,6 +221,19 @@ static inline void *__page_frag_alloc_pg_prepare(struct page_frag_cache *nc,
return va;
}
+/**
+ * page_frag_alloc_pg_prepare - Prepare allocing a page fragment.
+ * @nc: page_frag cache from which to prepare
+ * @offset: out as the offset of the page fragment
+ * @size: in as the requested size, out as the available size
+ * @gfp: the allocation gfp to use when cache need to be refilled
+ *
+ * Prepare a page fragment with minimum size of ‘size’, 'size' is also used to
+ * report the maximum size of the page fragment the caller can use.
+ *
+ * Return:
+ * Return the page fragment, otherwise return NULL.
+ */
#define page_frag_alloc_pg_prepare(nc, offset, size, gfp) \
({ \
struct page *__page = NULL; \
@@ -179,6 +269,21 @@ static inline void *__page_frag_alloc_prepare(struct page_frag_cache *nc,
return nc_va;
}
+/**
+ * page_frag_alloc_prepare - Prepare allocing a page fragment.
+ * @nc: page_frag cache from which to prepare
+ * @offset: out as the offset of the page fragment
+ * @size: in as the requested size, out as the available size
+ * @va: out as the va of the returned page fragment
+ * @gfp: the allocation gfp to use when cache need to be refilled
+ *
+ * Prepare a page fragment with minimum size of ‘size’, 'size' is also used to
+ * report the maximum size of the page fragment. Return both 'page' and 'va' of
+ * the fragment to the caller.
+ *
+ * Return:
+ * Return the page fragment, otherwise return NULL.
+ */
#define page_frag_alloc_prepare(nc, offset, size, va, gfp) \
({ \
struct page *__page = NULL; \
@@ -191,6 +296,14 @@ static inline void *__page_frag_alloc_prepare(struct page_frag_cache *nc,
__page; \
})
+/**
+ * page_frag_alloc_commit - Commit allocing a page fragment.
+ * @nc: page_frag cache from which to commit
+ * @offset: offset of the page fragment
+ * @size: size of the page fragment has been used
+ *
+ * Commit the alloc preparing by passing offset and the actual used size.
+ */
static inline void page_frag_alloc_commit(struct page_frag_cache *nc,
unsigned int offset,
unsigned int size)
@@ -199,6 +312,17 @@ static inline void page_frag_alloc_commit(struct page_frag_cache *nc,
nc->offset = offset + size;
}
+/**
+ * page_frag_alloc_commit_noref - Commit allocing a page fragment without taking
+ * page refcount.
+ * @nc: page_frag cache from which to commit
+ * @offset: offset of the page fragment
+ * @size: size of the page fragment has been used
+ *
+ * Commit the alloc preparing by passing offset and the actual used size, but
+ * not taking page refcount. Mostly used for fragmemt coaleasing case when the
+ * current fragmemt can share the same refcount with previous fragmemt.
+ */
static inline void page_frag_alloc_commit_noref(struct page_frag_cache *nc,
unsigned int offset,
unsigned int size)
@@ -206,6 +330,10 @@ static inline void page_frag_alloc_commit_noref(struct page_frag_cache *nc,
nc->offset = offset + size;
}
+/**
+ * page_frag_free_va - Free a page fragment by va.
+ * @addr: va of page fragment to be freed
+ */
void page_frag_free_va(void *addr);
#endif
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index cbd0ed82a596..0c76ec006c22 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -1,15 +1,44 @@
// SPDX-License-Identifier: GPL-2.0-only
-/* Page fragment allocator
+
+/**
+ * DOC: page_frag allocator
+ *
+ * A page fragment is an arbitrary-length arbitrary-offset area of memory which
+ * resides within a 0 or higher order compound page. Multiple fragments within
+ * that page are individually refcounted, in the page's reference counter.
+ *
+ * The page_frag functions, page_frag_alloc* and page_frag_free*, provide a
+ * simple allocation framework for page fragments. This is used by the network
+ * stack and network device drivers to provide a backing region of memory for
+ * use as either an sk_buff->head, or to be used in the "frags" portion of
+ * skb_shared_info.
*
- * Page Fragment:
- * An arbitrary-length arbitrary-offset area of memory which resides within a
- * 0 or higher order page. Multiple fragments within that page are
- * individually refcounted, in the page's reference counter.
+ * In order to make use of the page fragment APIs a backing page fragment cache
+ * is needed. This provides a central point for the fragment allocation and
+ * tracks allows multiple calls to make use of a cached page. The advantage to
+ * doing this is that multiple calls to get_page can be avoided which can be
+ * expensive at allocation time. However due to the nature of this caching it
+ * is required that any calls to the cache be protected by either a per-cpu
+ * limitation, or a per-cpu limitation and forcing interrupts to be disabled
+ * when executing the fragment allocation.
*
- * The page_frag functions provide a simple allocation framework for page
- * fragments. This is used by the network stack and network device drivers to
- * provide a backing region of memory for use as either an sk_buff->head, or to
- * be used in the "frags" portion of skb_shared_info.
+ * The network stack uses two separate caches per CPU to handle fragment
+ * allocation. The netdev_alloc_cache is used by callers making use of the
+ * netdev_alloc_frag and __netdev_alloc_skb calls. The napi_alloc_cache is
+ * used by callers of the __napi_alloc_frag and napi_alloc_skb calls. The
+ * main difference between these two calls is the context in which they may be
+ * called. The "netdev" prefixed functions are usable in any context as these
+ * functions will disable interrupts, while the "napi" prefixed functions are
+ * only usable within the softirq context.
+ *
+ * Many network device drivers use a similar methodology for allocating page
+ * fragments, but the page fragments are cached at the ring or descriptor
+ * level. In order to enable these cases it is necessary to provide a generic
+ * way of tearing down a page cache. For this reason __page_frag_cache_drain
+ * was implemented. It allows for freeing multiple references from a single
+ * page via a single call. The advantage to doing this is that it allows for
+ * cleaning up the multiple references that were added to a page in order to
+ * avoid calling get_page per allocation.
*/
#include <linux/export.h>
@@ -57,6 +86,10 @@ static bool __page_frag_cache_refill(struct page_frag_cache *nc,
return true;
}
+/**
+ * page_frag_cache_drain - Drain the current page from page_frag cache.
+ * @nc: page_frag cache from which to drain
+ */
void page_frag_cache_drain(struct page_frag_cache *nc)
{
if (!nc->va)
--
2.33.0
We are above to use page_frag_alloc_*() API to not just
allocate memory for skb->data, but also use them to do
the memory allocation for skb frag too. Currently the
implementation of page_frag in mm subsystem is running
the offset as a countdown rather than count-up value,
there may have several advantages to that as mentioned
in [1], but it may have some disadvantages, for example,
it may disable skb frag coaleasing and more correct cache
prefetching
We have a trade-off to make in order to have a unified
implementation and API for page_frag, so use a initial zero
offset in this patch, and the following patch will try to
make some optimization to aovid the disadvantages as much
as possible.
1. https://lore.kernel.org/all/[email protected]/
CC: Alexander Duyck <[email protected]>
Signed-off-by: Yunsheng Lin <[email protected]>
---
mm/page_frag_cache.c | 31 ++++++++++++++-----------------
1 file changed, 14 insertions(+), 17 deletions(-)
diff --git a/mm/page_frag_cache.c b/mm/page_frag_cache.c
index a0f90ba25200..3e3e88d9af90 100644
--- a/mm/page_frag_cache.c
+++ b/mm/page_frag_cache.c
@@ -67,9 +67,8 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
unsigned int fragsz, gfp_t gfp_mask,
unsigned int align_mask)
{
- unsigned int size = PAGE_SIZE;
+ unsigned int size, offset;
struct page *page;
- int offset;
if (unlikely(!nc->va)) {
refill:
@@ -77,10 +76,6 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
if (!page)
return NULL;
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
/* Even if we own the page, we do not use atomic_set().
* This would break get_page_unless_zero() users.
*/
@@ -89,11 +84,18 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
/* reset page count bias and offset to start of new frag */
nc->pfmemalloc = page_is_pfmemalloc(page);
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- nc->offset = size;
+ nc->offset = 0;
}
- offset = nc->offset - fragsz;
- if (unlikely(offset < 0)) {
+#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
+ /* if size can vary use size else just use PAGE_SIZE */
+ size = nc->size;
+#else
+ size = PAGE_SIZE;
+#endif
+
+ offset = ALIGN(nc->offset, -align_mask);
+ if (unlikely(offset + fragsz > size)) {
page = virt_to_page(nc->va);
if (!page_ref_sub_and_test(page, nc->pagecnt_bias))
@@ -104,17 +106,13 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
goto refill;
}
-#if (PAGE_SIZE < PAGE_FRAG_CACHE_MAX_SIZE)
- /* if size can vary use size else just use PAGE_SIZE */
- size = nc->size;
-#endif
/* OK, page count is 0, we can safely set it */
set_page_count(page, PAGE_FRAG_CACHE_MAX_SIZE + 1);
/* reset page count bias and offset to start of new frag */
nc->pagecnt_bias = PAGE_FRAG_CACHE_MAX_SIZE + 1;
- offset = size - fragsz;
- if (unlikely(offset < 0)) {
+ offset = 0;
+ if (unlikely(fragsz > size)) {
/*
* The caller is trying to allocate a fragment
* with fragsz > PAGE_SIZE but the cache isn't big
@@ -129,8 +127,7 @@ void *__page_frag_alloc_align(struct page_frag_cache *nc,
}
nc->pagecnt_bias--;
- offset &= align_mask;
- nc->offset = offset;
+ nc->offset = offset + fragsz;
return nc->va + offset;
}
--
2.33.0
On Sun, Apr 7, 2024 at 6:10 AM Yunsheng Lin <[email protected]> wrote:
>
> After [1], Only there are two implementations for page frag:
>
> 1. mm/page_alloc.c: net stack seems to be using it in the
> rx part with 'struct page_frag_cache' and the main API
> being page_frag_alloc_align().
> 2. net/core/sock.c: net stack seems to be using it in the
> tx part with 'struct page_frag' and the main API being
> skb_page_frag_refill().
>
> This patchset tries to unfiy the page frag implementation
> by replacing page_frag with page_frag_cache for sk_page_frag()
> first. net_high_order_alloc_disable_key for the implementation
> in net/core/sock.c doesn't seems matter that much now have
> have pcp support for high-order pages in commit 44042b449872
> ("mm/page_alloc: allow high-order pages to be stored on the
> per-cpu lists").
>
> As the related change is mostly related to networking, so
> targeting the net-next. And will try to replace the rest
> of page_frag in the follow patchset.
>
> After this patchset, we are not only able to unify the page
> frag implementation a little, but seems able to have about
> 0.5+% performance boost testing by using the vhost_net_test
> introduced in [1] and page_frag_test.ko introduced in this
> patch.
One question that jumps out at me for this is "why?". No offense but
this is a pretty massive set of changes with over 1400 additions and
500+ deletions and I can't help but ask why, and this cover page
doesn't give me any good reason to think about accepting this set.
What is meant to be the benefit to the community for adding this? All
I am seeing is a ton of extra code to have to review as this
unification is adding an additional 1000+ lines without a good
explanation as to why they are needed.
Also I wouldn't bother mentioning the 0.5+% performance gain as a
"bonus". Changes of that amount usually mean it is within the margin
of error. At best it likely means you haven't introduced a noticeable
regression.