2021-03-12 15:44:59

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users

This series is based on top of Matthew Wilcox's series "Rationalise
__alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
test and are not using Andrew's tree as a baseline, I suggest using the
following git tree

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2

Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
directly to the subsystem maintainers. While sunrpc is low-risk, I'm
vaguely aware that there are other prototype series on netdev that affect
page_pool. The conflict should be obvious in linux-next.

Changelog since v3
o Rebase on top of Matthew's series consolidating the alloc_pages API
o Rename alloced to allocated
o Split out preparation patch for prepare_alloc_pages
o Defensive check for bulk allocation or <= 0 pages
o Call single page allocation path only if no pages were allocated
o Minor cosmetic cleanups
o Reorder patch dependencies by subsystem. As this is a cross-subsystem
series, the mm patches have to be merged before the sunrpc and net
users.

Changelog since v2
o Prep new pages with IRQs enabled
o Minor documentation update

Changelog since v1
o Parenthesise binary and boolean comparisons
o Add reviewed-bys
o Rebase to 5.12-rc2

This series introduces a bulk order-0 page allocator with sunrpc and
the network page pool being the first users. The implementation is not
particularly efficient and the intention is to iron out what the semantics
of the API should have for users. Once the semantics are ironed out, it
can be made more efficient. Despite that, this is a performance-related
for users that require multiple pages for an operation without multiple
round-trips to the page allocator. Quoting the last patch for the
high-speed networking use-case.

For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to
create an SKB from the xdp_frame, which then cannot return the page
to the page_pool. In this case, we saw[1] an improvement of 18.8%
from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

Both users in this series are corner cases (NFS and high-speed networks)
so it is unlikely that most users will see any benefit in the short
term. Potential other users are batch allocations for page cache
readahead, fault around and SLUB allocations when high-order pages are
unavailable. It's unknown how much benefit would be seen by converting
multiple page allocation calls to a single batch or what difference it may
make to headline performance. It's a chicken and egg problem given that
the potential benefit cannot be investigated without an implementation
to test against.

Light testing passed, I'm relying on Chuck and Jesper to test the target
users more aggressively but both report performance improvements with
the initial RFC.

Patch 1 moves GFP flag initialision to prepare_alloc_pages

Patch 2 renames a variable name that is particularly unpopular

Patch 3 adds a bulk page allocator

Patch 4 is a sunrpc cleanup that is a pre-requisite.

Patch 5 is the sunrpc user. Chuck also has a patch which further caches
pages but is not included in this series. It's not directly
related to the bulk allocator and as it caches pages, it might
have other concerns (e.g. does it need a shrinker?)

Patch 6 is a preparation patch only for the network user

Patch 7 converts the net page pool to the bulk allocator for order-0 pages.

include/linux/gfp.h | 12 ++++
mm/page_alloc.c | 149 +++++++++++++++++++++++++++++++++++++-----
net/core/page_pool.c | 101 +++++++++++++++++-----------
net/sunrpc/svc_xprt.c | 47 +++++++++----
4 files changed, 240 insertions(+), 69 deletions(-)

--
2.26.2


2021-03-12 15:44:59

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/7] mm/page_alloc: Add a bulk page allocator

This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list. They can be freed in bulk using
free_pages_bulk().

The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.

Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 12 +++++
mm/page_alloc.c | 116 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 128 insertions(+)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0a88f84b08f4..e2cd98dba72e 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -518,6 +518,17 @@ static inline int arch_make_page_accessible(struct page *page)
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);

+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+ nodemask_t *nodemask, int nr_pages,
+ struct list_head *list);
+
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
+{
+ return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
+}
+
/*
* Allocate pages, preferring the node given as nid. The node must be valid and
* online. For more general interface, see alloc_pages_node().
@@ -581,6 +592,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);

extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);

struct page_frag_cache;
extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 880b1d6368bd..f48f94375b66 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
}
}

+/* Drop reference counts and free order-0 pages from a list. */
+void free_pages_bulk(struct list_head *list)
+{
+ struct page *page, *next;
+
+ list_for_each_entry_safe(page, next, list, lru) {
+ trace_mm_page_free_batched(page);
+ if (put_page_testzero(page)) {
+ list_del(&page->lru);
+ __free_pages_ok(page, 0, FPI_NONE);
+ }
+ }
+}
+EXPORT_SYMBOL_GPL(free_pages_bulk);
+
static inline unsigned int
gfp_to_alloc_flags(gfp_t gfp_mask)
{
@@ -4963,6 +4978,107 @@ static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,
return true;
}

+/*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ *
+ * Returns the number of pages allocated.
+ */
+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+ nodemask_t *nodemask, int nr_pages,
+ struct list_head *alloc_list)
+{
+ struct page *page;
+ unsigned long flags;
+ struct zone *zone;
+ struct zoneref *z;
+ struct per_cpu_pages *pcp;
+ struct list_head *pcp_list;
+ struct alloc_context ac;
+ gfp_t alloc_gfp;
+ unsigned int alloc_flags;
+ int allocated = 0;
+
+ if (WARN_ON_ONCE(nr_pages <= 0))
+ return 0;
+
+ if (nr_pages == 1)
+ goto failed;
+
+ /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+ if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, &ac,
+ &alloc_gfp, &alloc_flags))
+ return 0;
+ gfp = alloc_gfp;
+
+ /* Find an allowed local zone that meets the high watermark. */
+ for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
+ unsigned long mark;
+
+ if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
+ !__cpuset_zone_allowed(zone, gfp)) {
+ continue;
+ }
+
+ if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
+ zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
+ goto failed;
+ }
+
+ mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
+ if (zone_watermark_fast(zone, 0, mark,
+ zonelist_zone_idx(ac.preferred_zoneref),
+ alloc_flags, gfp)) {
+ break;
+ }
+ }
+ if (!zone)
+ return 0;
+
+ /* Attempt the batch allocation */
+ local_irq_save(flags);
+ pcp = &this_cpu_ptr(zone->pageset)->pcp;
+ pcp_list = &pcp->lists[ac.migratetype];
+
+ while (allocated < nr_pages) {
+ page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
+ pcp, pcp_list);
+ if (!page) {
+ /* Try and get at least one page */
+ if (!allocated)
+ goto failed_irq;
+ break;
+ }
+
+ list_add(&page->lru, alloc_list);
+ allocated++;
+ }
+
+ __count_zid_vm_events(PGALLOC, zone_idx(zone), allocated);
+ zone_statistics(zone, zone);
+
+ local_irq_restore(flags);
+
+ /* Prep page with IRQs enabled to reduce disabled times */
+ list_for_each_entry(page, alloc_list, lru)
+ prep_new_page(page, 0, gfp, 0);
+
+ return allocated;
+
+failed_irq:
+ local_irq_restore(flags);
+
+failed:
+ page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
+ if (page) {
+ list_add(&page->lru, alloc_list);
+ allocated = 1;
+ }
+
+ return allocated;
+}
+EXPORT_SYMBOL_GPL(__alloc_pages_bulk);
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
--
2.26.2

2021-03-12 15:45:33

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator

From: Chuck Lever <[email protected]>

Reduce the rate at which nfsd threads hammer on the page allocator.
This improves throughput scalability by enabling the threads to run
more independently of each other.

Signed-off-by: Chuck Lever <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
net/sunrpc/svc_xprt.c | 43 +++++++++++++++++++++++++++++++------------
1 file changed, 31 insertions(+), 12 deletions(-)

diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index cfa7e4776d0e..38a8d6283801 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -642,11 +642,12 @@ static void svc_check_conn_limits(struct svc_serv *serv)
static int svc_alloc_arg(struct svc_rqst *rqstp)
{
struct svc_serv *serv = rqstp->rq_server;
+ unsigned long needed;
struct xdr_buf *arg;
+ struct page *page;
int pages;
int i;

- /* now allocate needed pages. If we get a failure, sleep briefly */
pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
if (pages > RPCSVC_MAXPAGES) {
pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
@@ -654,19 +655,28 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
/* use as many pages as possible */
pages = RPCSVC_MAXPAGES;
}
- for (i = 0; i < pages ; i++)
- while (rqstp->rq_pages[i] == NULL) {
- struct page *p = alloc_page(GFP_KERNEL);
- if (!p) {
- set_current_state(TASK_INTERRUPTIBLE);
- if (signalled() || kthread_should_stop()) {
- set_current_state(TASK_RUNNING);
- return -EINTR;
- }
- schedule_timeout(msecs_to_jiffies(500));
+
+ for (needed = 0, i = 0; i < pages ; i++)
+ if (!rqstp->rq_pages[i])
+ needed++;
+ if (needed) {
+ LIST_HEAD(list);
+
+retry:
+ alloc_pages_bulk(GFP_KERNEL, needed, &list);
+ for (i = 0; i < pages; i++) {
+ if (!rqstp->rq_pages[i]) {
+ page = list_first_entry_or_null(&list,
+ struct page,
+ lru);
+ if (unlikely(!page))
+ goto empty_list;
+ list_del(&page->lru);
+ rqstp->rq_pages[i] = page;
+ needed--;
}
- rqstp->rq_pages[i] = p;
}
+ }
rqstp->rq_page_end = &rqstp->rq_pages[pages];
rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */

@@ -681,6 +691,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
arg->len = (pages-1)*PAGE_SIZE;
arg->tail[0].iov_len = 0;
return 0;
+
+empty_list:
+ set_current_state(TASK_INTERRUPTIBLE);
+ if (signalled() || kthread_should_stop()) {
+ set_current_state(TASK_RUNNING);
+ return -EINTR;
+ }
+ schedule_timeout(msecs_to_jiffies(500));
+ goto retry;
}

static bool
--
2.26.2

2021-03-12 15:45:33

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path

From: Jesper Dangaard Brouer <[email protected]>

There are cases where the page_pool need to refill with pages from the
page allocator. Some workloads cause the page_pool to release pages
instead of recycling these pages.

For these workload it can improve performance to bulk alloc pages from
the page-allocator to refill the alloc cache.

For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to create
an SKB from the xdp_frame, which then cannot return the page to the
page_pool. In this case, we saw[1] an improvement of 18.8% from using
the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).

[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Ilias Apalodimas <[email protected]>
---
net/core/page_pool.c | 62 ++++++++++++++++++++++++++++----------------
1 file changed, 39 insertions(+), 23 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index 40e1b2beaa6c..a5889f1b86aa 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -208,44 +208,60 @@ noinline
static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
gfp_t _gfp)
{
+ const int bulk = PP_ALLOC_CACHE_REFILL;
+ struct page *page, *next, *first_page;
unsigned int pp_flags = pool->p.flags;
- struct page *page;
+ unsigned int pp_order = pool->p.order;
+ int pp_nid = pool->p.nid;
+ LIST_HEAD(page_list);
gfp_t gfp = _gfp;

- /* We could always set __GFP_COMP, and avoid this branch, as
- * prep_new_page() can handle order-0 with __GFP_COMP.
- */
- if (pool->p.order)
+ /* Don't support bulk alloc for high-order pages */
+ if (unlikely(pp_order)) {
gfp |= __GFP_COMP;
+ first_page = alloc_pages_node(pp_nid, gfp, pp_order);
+ if (unlikely(!first_page))
+ return NULL;
+ goto out;
+ }

- /* FUTURE development:
- *
- * Current slow-path essentially falls back to single page
- * allocations, which doesn't improve performance. This code
- * need bulk allocation support from the page allocator code.
- */
-
- /* Cache was empty, do real allocation */
-#ifdef CONFIG_NUMA
- page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
-#else
- page = alloc_pages(gfp, pool->p.order);
-#endif
- if (!page)
+ if (unlikely(!__alloc_pages_bulk(gfp, pp_nid, NULL, bulk, &page_list)))
return NULL;

+ /* First page is extracted and returned to caller */
+ first_page = list_first_entry(&page_list, struct page, lru);
+ list_del(&first_page->lru);
+
+ /* Remaining pages store in alloc.cache */
+ list_for_each_entry_safe(page, next, &page_list, lru) {
+ list_del(&page->lru);
+ if ((pp_flags & PP_FLAG_DMA_MAP) &&
+ unlikely(!page_pool_dma_map(pool, page))) {
+ put_page(page);
+ continue;
+ }
+ if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
+ pool->alloc.cache[pool->alloc.count++] = page;
+ pool->pages_state_hold_cnt++;
+ trace_page_pool_state_hold(pool, page,
+ pool->pages_state_hold_cnt);
+ } else {
+ put_page(page);
+ }
+ }
+out:
if ((pp_flags & PP_FLAG_DMA_MAP) &&
- unlikely(!page_pool_dma_map(pool, page))) {
- put_page(page);
+ unlikely(!page_pool_dma_map(pool, first_page))) {
+ put_page(first_page);
return NULL;
}

/* Track how many pages are held 'in-flight' */
pool->pages_state_hold_cnt++;
- trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
+ trace_page_pool_state_hold(pool, first_page, pool->pages_state_hold_cnt);

/* When page just alloc'ed is should/must have refcnt 1. */
- return page;
+ return first_page;
}

/* For using page_pool replace: alloc_pages() API calls, but provide
--
2.26.2

2021-03-12 15:45:33

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 6/7] net: page_pool: refactor dma_map into own function page_pool_dma_map

From: Jesper Dangaard Brouer <[email protected]>

In preparation for next patch, move the dma mapping into its own
function, as this will make it easier to follow the changes.

V2: make page_pool_dma_map return boolean (Ilias)

Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
Reviewed-by: Ilias Apalodimas <[email protected]>
---
net/core/page_pool.c | 45 +++++++++++++++++++++++++-------------------
1 file changed, 26 insertions(+), 19 deletions(-)

diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..40e1b2beaa6c 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -180,14 +180,37 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
pool->p.dma_dir);
}

+static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
+{
+ dma_addr_t dma;
+
+ /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
+ * since dma_addr_t can be either 32 or 64 bits and does not always fit
+ * into page private data (i.e 32bit cpu with 64bit DMA caps)
+ * This mapping is kept for lifetime of page, until leaving pool.
+ */
+ dma = dma_map_page_attrs(pool->p.dev, page, 0,
+ (PAGE_SIZE << pool->p.order),
+ pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
+ if (dma_mapping_error(pool->p.dev, dma))
+ return false;
+
+ page->dma_addr = dma;
+
+ if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+ page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
+
+ return true;
+}
+
/* slow path */
noinline
static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
gfp_t _gfp)
{
+ unsigned int pp_flags = pool->p.flags;
struct page *page;
gfp_t gfp = _gfp;
- dma_addr_t dma;

/* We could always set __GFP_COMP, and avoid this branch, as
* prep_new_page() can handle order-0 with __GFP_COMP.
@@ -211,30 +234,14 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
if (!page)
return NULL;

- if (!(pool->p.flags & PP_FLAG_DMA_MAP))
- goto skip_dma_map;
-
- /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
- * since dma_addr_t can be either 32 or 64 bits and does not always fit
- * into page private data (i.e 32bit cpu with 64bit DMA caps)
- * This mapping is kept for lifetime of page, until leaving pool.
- */
- dma = dma_map_page_attrs(pool->p.dev, page, 0,
- (PAGE_SIZE << pool->p.order),
- pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
- if (dma_mapping_error(pool->p.dev, dma)) {
+ if ((pp_flags & PP_FLAG_DMA_MAP) &&
+ unlikely(!page_pool_dma_map(pool, page))) {
put_page(page);
return NULL;
}
- page->dma_addr = dma;

- if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
- page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
-
-skip_dma_map:
/* Track how many pages are held 'in-flight' */
pool->pages_state_hold_cnt++;
-
trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);

/* When page just alloc'ed is should/must have refcnt 1. */
--
2.26.2

2021-03-12 18:45:46

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator

On Fri, Mar 12, 2021 at 7:43 AM Mel Gorman <[email protected]> wrote:
>
> From: Chuck Lever <[email protected]>
>
> Reduce the rate at which nfsd threads hammer on the page allocator.
> This improves throughput scalability by enabling the threads to run
> more independently of each other.
>
> Signed-off-by: Chuck Lever <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> net/sunrpc/svc_xprt.c | 43 +++++++++++++++++++++++++++++++------------
> 1 file changed, 31 insertions(+), 12 deletions(-)
>
> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
> index cfa7e4776d0e..38a8d6283801 100644
> --- a/net/sunrpc/svc_xprt.c
> +++ b/net/sunrpc/svc_xprt.c
> @@ -642,11 +642,12 @@ static void svc_check_conn_limits(struct svc_serv *serv)
> static int svc_alloc_arg(struct svc_rqst *rqstp)
> {
> struct svc_serv *serv = rqstp->rq_server;
> + unsigned long needed;
> struct xdr_buf *arg;
> + struct page *page;
> int pages;
> int i;
>
> - /* now allocate needed pages. If we get a failure, sleep briefly */
> pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
> if (pages > RPCSVC_MAXPAGES) {
> pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
> @@ -654,19 +655,28 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> /* use as many pages as possible */
> pages = RPCSVC_MAXPAGES;
> }
> - for (i = 0; i < pages ; i++)
> - while (rqstp->rq_pages[i] == NULL) {
> - struct page *p = alloc_page(GFP_KERNEL);
> - if (!p) {
> - set_current_state(TASK_INTERRUPTIBLE);
> - if (signalled() || kthread_should_stop()) {
> - set_current_state(TASK_RUNNING);
> - return -EINTR;
> - }
> - schedule_timeout(msecs_to_jiffies(500));
> +

> + for (needed = 0, i = 0; i < pages ; i++)
> + if (!rqstp->rq_pages[i])
> + needed++;

I would use an opening and closing braces for the for loop since
technically the if is a multiline statement. It will make this more
readable.

> + if (needed) {
> + LIST_HEAD(list);
> +
> +retry:

Rather than kind of open code a while loop why not just make this
"while (needed)"? Then all you have to do is break out of the for loop
and you will automatically return here instead of having to jump to
two different labels.

> + alloc_pages_bulk(GFP_KERNEL, needed, &list);

Rather than not using the return value would it make sense here to
perhaps subtract it from needed? Then you would know if any of the
allocation requests weren't fulfilled.

> + for (i = 0; i < pages; i++) {

It is probably optimizing for the exception case, but I don't think
you want the "i = 0" here. If you are having to stop because the list
is empty it probably makes sense to resume where you left off. So you
should probably be initializing i to 0 before we check for needed.

> + if (!rqstp->rq_pages[i]) {

It might be cleaner here to just do a "continue" if rq_pages[i] is populated.

> + page = list_first_entry_or_null(&list,
> + struct page,
> + lru);
> + if (unlikely(!page))
> + goto empty_list;

I think I preferred the original code that wasn't jumping away from
the loop here. With the change I suggested above that would switch the
if(needed) to while(needed) you could have it just break out of the
for loop to place itself back in the while loop.

> + list_del(&page->lru);
> + rqstp->rq_pages[i] = page;
> + needed--;
> }
> - rqstp->rq_pages[i] = p;
> }
> + }
> rqstp->rq_page_end = &rqstp->rq_pages[pages];
> rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
>
> @@ -681,6 +691,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
> arg->len = (pages-1)*PAGE_SIZE;
> arg->tail[0].iov_len = 0;
> return 0;
> +
> +empty_list:
> + set_current_state(TASK_INTERRUPTIBLE);
> + if (signalled() || kthread_should_stop()) {
> + set_current_state(TASK_RUNNING);
> + return -EINTR;
> + }
> + schedule_timeout(msecs_to_jiffies(500));
> + goto retry;
> }
>
> static bool
> --
> 2.26.2
>

2021-03-12 19:45:31

by Alexander Duyck

[permalink] [raw]
Subject: Re: [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path

On Fri, Mar 12, 2021 at 7:43 AM Mel Gorman <[email protected]> wrote:
>
> From: Jesper Dangaard Brouer <[email protected]>
>
> There are cases where the page_pool need to refill with pages from the
> page allocator. Some workloads cause the page_pool to release pages
> instead of recycling these pages.
>
> For these workload it can improve performance to bulk alloc pages from
> the page-allocator to refill the alloc cache.
>
> For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> redirecting xdp_frame packets into a veth, that does XDP_PASS to create
> an SKB from the xdp_frame, which then cannot return the page to the
> page_pool. In this case, we saw[1] an improvement of 18.8% from using
> the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
>
> [1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
>
> Signed-off-by: Jesper Dangaard Brouer <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> Reviewed-by: Ilias Apalodimas <[email protected]>
> ---
> net/core/page_pool.c | 62 ++++++++++++++++++++++++++++----------------
> 1 file changed, 39 insertions(+), 23 deletions(-)
>
> diff --git a/net/core/page_pool.c b/net/core/page_pool.c
> index 40e1b2beaa6c..a5889f1b86aa 100644
> --- a/net/core/page_pool.c
> +++ b/net/core/page_pool.c
> @@ -208,44 +208,60 @@ noinline
> static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> gfp_t _gfp)
> {
> + const int bulk = PP_ALLOC_CACHE_REFILL;
> + struct page *page, *next, *first_page;
> unsigned int pp_flags = pool->p.flags;
> - struct page *page;
> + unsigned int pp_order = pool->p.order;
> + int pp_nid = pool->p.nid;
> + LIST_HEAD(page_list);
> gfp_t gfp = _gfp;
>
> - /* We could always set __GFP_COMP, and avoid this branch, as
> - * prep_new_page() can handle order-0 with __GFP_COMP.
> - */
> - if (pool->p.order)
> + /* Don't support bulk alloc for high-order pages */
> + if (unlikely(pp_order)) {
> gfp |= __GFP_COMP;
> + first_page = alloc_pages_node(pp_nid, gfp, pp_order);
> + if (unlikely(!first_page))
> + return NULL;
> + goto out;
> + }
>
> - /* FUTURE development:
> - *
> - * Current slow-path essentially falls back to single page
> - * allocations, which doesn't improve performance. This code
> - * need bulk allocation support from the page allocator code.
> - */
> -
> - /* Cache was empty, do real allocation */
> -#ifdef CONFIG_NUMA
> - page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
> -#else
> - page = alloc_pages(gfp, pool->p.order);
> -#endif
> - if (!page)
> + if (unlikely(!__alloc_pages_bulk(gfp, pp_nid, NULL, bulk, &page_list)))
> return NULL;
>
> + /* First page is extracted and returned to caller */
> + first_page = list_first_entry(&page_list, struct page, lru);
> + list_del(&first_page->lru);
> +

This seems kind of broken to me. If you pull the first page and then
cannot map it you end up returning NULL even if you placed a number of
pages in the cache.

It might make more sense to have the loop below record a pointer to
the last page you processed and handle things in two stages so that on
the first iteration you map one page.

So something along the lines of:
1. Initialize last_page to NULL

for each page in the list
2. Map page
3. If last_page is non-NULL, move to cache
4. Assign page to last_page
5. Return to step 2 for each page in list

6. return last_page

> + /* Remaining pages store in alloc.cache */
> + list_for_each_entry_safe(page, next, &page_list, lru) {
> + list_del(&page->lru);
> + if ((pp_flags & PP_FLAG_DMA_MAP) &&
> + unlikely(!page_pool_dma_map(pool, page))) {
> + put_page(page);
> + continue;
> + }

So if you added a last_page pointer what you could do is check for it
here and assign it to the alloc cache. If last_page is not set the
block would be skipped.

> + if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
> + pool->alloc.cache[pool->alloc.count++] = page;
> + pool->pages_state_hold_cnt++;
> + trace_page_pool_state_hold(pool, page,
> + pool->pages_state_hold_cnt);
> + } else {
> + put_page(page);

If you are just calling put_page here aren't you leaking DMA mappings?
Wouldn't you need to potentially unmap the page before you call
put_page on it?

> + }
> + }
> +out:
> if ((pp_flags & PP_FLAG_DMA_MAP) &&
> - unlikely(!page_pool_dma_map(pool, page))) {
> - put_page(page);
> + unlikely(!page_pool_dma_map(pool, first_page))) {
> + put_page(first_page);

I would probably move this block up and make it a part of the pp_order
block above. Also since you are doing this in 2 spots it might make
sense to look at possibly making this an inline function.

> return NULL;
> }
>
> /* Track how many pages are held 'in-flight' */
> pool->pages_state_hold_cnt++;
> - trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
> + trace_page_pool_state_hold(pool, first_page, pool->pages_state_hold_cnt);
>
> /* When page just alloc'ed is should/must have refcnt 1. */
> - return page;
> + return first_page;
> }
>
> /* For using page_pool replace: alloc_pages() API calls, but provide
> --
> 2.26.2
>

2021-03-12 19:47:01

by Chuck Lever

[permalink] [raw]
Subject: Re: [PATCH 5/7] SUNRPC: Refresh rq_pages using a bulk page allocator

Mel, I can send you a tidied and tested update to this patch,
or you can drop the two NFSD patches and I can submit them via
the NFSD tree when alloc_pages_bulk() is merged.

> On Mar 12, 2021, at 1:44 PM, Alexander Duyck <[email protected]> wrote:
>
> On Fri, Mar 12, 2021 at 7:43 AM Mel Gorman <[email protected]> wrote:
>>
>> From: Chuck Lever <[email protected]>
>>
>> Reduce the rate at which nfsd threads hammer on the page allocator.
>> This improves throughput scalability by enabling the threads to run
>> more independently of each other.
>>
>> Signed-off-by: Chuck Lever <[email protected]>
>> Signed-off-by: Mel Gorman <[email protected]>
>> ---
>> net/sunrpc/svc_xprt.c | 43 +++++++++++++++++++++++++++++++------------
>> 1 file changed, 31 insertions(+), 12 deletions(-)
>>
>> diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
>> index cfa7e4776d0e..38a8d6283801 100644
>> --- a/net/sunrpc/svc_xprt.c
>> +++ b/net/sunrpc/svc_xprt.c
>> @@ -642,11 +642,12 @@ static void svc_check_conn_limits(struct svc_serv *serv)
>> static int svc_alloc_arg(struct svc_rqst *rqstp)
>> {
>> struct svc_serv *serv = rqstp->rq_server;
>> + unsigned long needed;
>> struct xdr_buf *arg;
>> + struct page *page;
>> int pages;
>> int i;
>>
>> - /* now allocate needed pages. If we get a failure, sleep briefly */
>> pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
>> if (pages > RPCSVC_MAXPAGES) {
>> pr_warn_once("svc: warning: pages=%u > RPCSVC_MAXPAGES=%lu\n",
>> @@ -654,19 +655,28 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>> /* use as many pages as possible */
>> pages = RPCSVC_MAXPAGES;
>> }
>> - for (i = 0; i < pages ; i++)
>> - while (rqstp->rq_pages[i] == NULL) {
>> - struct page *p = alloc_page(GFP_KERNEL);
>> - if (!p) {
>> - set_current_state(TASK_INTERRUPTIBLE);
>> - if (signalled() || kthread_should_stop()) {
>> - set_current_state(TASK_RUNNING);
>> - return -EINTR;
>> - }
>> - schedule_timeout(msecs_to_jiffies(500));
>> +
>
>> + for (needed = 0, i = 0; i < pages ; i++)
>> + if (!rqstp->rq_pages[i])
>> + needed++;
>
> I would use an opening and closing braces for the for loop since
> technically the if is a multiline statement. It will make this more
> readable.
>
>> + if (needed) {
>> + LIST_HEAD(list);
>> +
>> +retry:
>
> Rather than kind of open code a while loop why not just make this
> "while (needed)"? Then all you have to do is break out of the for loop
> and you will automatically return here instead of having to jump to
> two different labels.
>
>> + alloc_pages_bulk(GFP_KERNEL, needed, &list);
>
> Rather than not using the return value would it make sense here to
> perhaps subtract it from needed? Then you would know if any of the
> allocation requests weren't fulfilled.
>
>> + for (i = 0; i < pages; i++) {
>
> It is probably optimizing for the exception case, but I don't think
> you want the "i = 0" here. If you are having to stop because the list
> is empty it probably makes sense to resume where you left off. So you
> should probably be initializing i to 0 before we check for needed.
>
>> + if (!rqstp->rq_pages[i]) {
>
> It might be cleaner here to just do a "continue" if rq_pages[i] is populated.
>
>> + page = list_first_entry_or_null(&list,
>> + struct page,
>> + lru);
>> + if (unlikely(!page))
>> + goto empty_list;
>
> I think I preferred the original code that wasn't jumping away from
> the loop here. With the change I suggested above that would switch the
> if(needed) to while(needed) you could have it just break out of the
> for loop to place itself back in the while loop.
>
>> + list_del(&page->lru);
>> + rqstp->rq_pages[i] = page;
>> + needed--;
>> }
>> - rqstp->rq_pages[i] = p;
>> }
>> + }
>> rqstp->rq_page_end = &rqstp->rq_pages[pages];
>> rqstp->rq_pages[pages] = NULL; /* this might be seen in nfsd_splice_actor() */
>>
>> @@ -681,6 +691,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
>> arg->len = (pages-1)*PAGE_SIZE;
>> arg->tail[0].iov_len = 0;
>> return 0;
>> +
>> +empty_list:
>> + set_current_state(TASK_INTERRUPTIBLE);
>> + if (signalled() || kthread_should_stop()) {
>> + set_current_state(TASK_RUNNING);
>> + return -EINTR;
>> + }
>> + schedule_timeout(msecs_to_jiffies(500));
>> + goto retry;
>> }
>>
>> static bool
>> --
>> 2.26.2

--
Chuck Lever



2021-03-15 08:43:25

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path

On Sat, 13 Mar 2021 13:30:58 +0000
Mel Gorman <[email protected]> wrote:

> On Fri, Mar 12, 2021 at 11:44:09AM -0800, Alexander Duyck wrote:
> > > - /* FUTURE development:
> > > - *
> > > - * Current slow-path essentially falls back to single page
> > > - * allocations, which doesn't improve performance. This code
> > > - * need bulk allocation support from the page allocator code.
> > > - */
> > > -
> > > - /* Cache was empty, do real allocation */
> > > -#ifdef CONFIG_NUMA
> > > - page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
> > > -#else
> > > - page = alloc_pages(gfp, pool->p.order);
> > > -#endif
> > > - if (!page)
> > > + if (unlikely(!__alloc_pages_bulk(gfp, pp_nid, NULL, bulk, &page_list)))
> > > return NULL;
> > >
> > > + /* First page is extracted and returned to caller */
> > > + first_page = list_first_entry(&page_list, struct page, lru);
> > > + list_del(&first_page->lru);
> > > +
> >
> > This seems kind of broken to me. If you pull the first page and then
> > cannot map it you end up returning NULL even if you placed a number of
> > pages in the cache.
> >
>
> I think you're right but I'm punting this to Jesper to fix. He's more
> familiar with this particular code and can verify the performance is
> still ok for high speed networks.

Yes, I'll take a look at this, and updated the patch accordingly (and re-run
the performance tests).

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2021-03-15 13:41:08

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 7/7] net: page_pool: use alloc_pages_bulk in refill code path

On Fri, 12 Mar 2021 22:05:45 +0200
Ilias Apalodimas <[email protected]> wrote:

> [...]
> > 6. return last_page
> >
> > > + /* Remaining pages store in alloc.cache */
> > > + list_for_each_entry_safe(page, next, &page_list, lru) {
> > > + list_del(&page->lru);
> > > + if ((pp_flags & PP_FLAG_DMA_MAP) &&
> > > + unlikely(!page_pool_dma_map(pool, page))) {
> > > + put_page(page);
> > > + continue;
> > > + }
> >
> > So if you added a last_page pointer what you could do is check for it
> > here and assign it to the alloc cache. If last_page is not set the
> > block would be skipped.
> >
> > > + if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
> > > + pool->alloc.cache[pool->alloc.count++] = page;
> > > + pool->pages_state_hold_cnt++;
> > > + trace_page_pool_state_hold(pool, page,
> > > + pool->pages_state_hold_cnt);
> > > + } else {
> > > + put_page(page);
> >
> > If you are just calling put_page here aren't you leaking DMA mappings?
> > Wouldn't you need to potentially unmap the page before you call
> > put_page on it?
>
> Oops, I completely missed that. Alexander is right here.

Well, the put_page() case can never happen as the pool->alloc.cache[]
is known to be empty when this function is called. I do agree that the
code looks cumbersome and should free the DMA mapping, if it could
happen.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2021-03-15 21:22:01

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: [PATCH mel-git] Followup: Update [PATCH 7/7] in Mel's series

This patch is against Mel's git-tree:
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git

Using branch: mm-bulk-rebase-v4r2 but replacing the last patch related to
the page_pool using __alloc_pages_bulk().

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/log/?h=mm-bulk-rebase-v4r2

While implementing suggestions by Alexander Duyck, I realised that I could
simplify the code further, and simply take the last page from the
pool->alloc.cache given this avoids special casing the last page.

I re-ran performance tests and the improvement have been reduced to 13% from
18% before, but I don't think the rewrite of the specific patch have
anything to do with this.

Notes on tests:
https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#test-on-mel-git-tree

---

Jesper Dangaard Brouer (1):
net: page_pool: use alloc_pages_bulk in refill code path


net/core/page_pool.c | 73 ++++++++++++++++++++++++++++++++------------------
1 file changed, 47 insertions(+), 26 deletions(-)

--

2021-03-17 16:33:30

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users

From: Mel Gorman <[email protected]>
Date: Fri, 12 Mar 2021 15:43:24 +0000

Hi there,

> This series is based on top of Matthew Wilcox's series "Rationalise
> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> test and are not using Andrew's tree as a baseline, I suggest using the
> following git tree
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2

I gave this series a go on my setup, it showed a bump of 10 Mbps on
UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.

(4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
allocations with MTU of 1508 bytes, linear frames via build_skb(),
GRO + TSO/USO)

I didn't have time to drill into the code, so for now can't provide
any additional details. You can request anything you need though and
I'll try to find a window to collect it.

> Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
> want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
> directly to the subsystem maintainers. While sunrpc is low-risk, I'm
> vaguely aware that there are other prototype series on netdev that affect
> page_pool. The conflict should be obvious in linux-next.
>
> Changelog since v3
> o Rebase on top of Matthew's series consolidating the alloc_pages API
> o Rename alloced to allocated
> o Split out preparation patch for prepare_alloc_pages
> o Defensive check for bulk allocation or <= 0 pages
> o Call single page allocation path only if no pages were allocated
> o Minor cosmetic cleanups
> o Reorder patch dependencies by subsystem. As this is a cross-subsystem
> series, the mm patches have to be merged before the sunrpc and net
> users.
>
> Changelog since v2
> o Prep new pages with IRQs enabled
> o Minor documentation update
>
> Changelog since v1
> o Parenthesise binary and boolean comparisons
> o Add reviewed-bys
> o Rebase to 5.12-rc2
>
> This series introduces a bulk order-0 page allocator with sunrpc and
> the network page pool being the first users. The implementation is not
> particularly efficient and the intention is to iron out what the semantics
> of the API should have for users. Once the semantics are ironed out, it
> can be made more efficient. Despite that, this is a performance-related
> for users that require multiple pages for an operation without multiple
> round-trips to the page allocator. Quoting the last patch for the
> high-speed networking use-case.
>
> For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> redirecting xdp_frame packets into a veth, that does XDP_PASS to
> create an SKB from the xdp_frame, which then cannot return the page
> to the page_pool. In this case, we saw[1] an improvement of 18.8%
> from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
>
> Both users in this series are corner cases (NFS and high-speed networks)
> so it is unlikely that most users will see any benefit in the short
> term. Potential other users are batch allocations for page cache
> readahead, fault around and SLUB allocations when high-order pages are
> unavailable. It's unknown how much benefit would be seen by converting
> multiple page allocation calls to a single batch or what difference it may
> make to headline performance. It's a chicken and egg problem given that
> the potential benefit cannot be investigated without an implementation
> to test against.
>
> Light testing passed, I'm relying on Chuck and Jesper to test the target
> users more aggressively but both report performance improvements with
> the initial RFC.
>
> Patch 1 moves GFP flag initialision to prepare_alloc_pages
>
> Patch 2 renames a variable name that is particularly unpopular
>
> Patch 3 adds a bulk page allocator
>
> Patch 4 is a sunrpc cleanup that is a pre-requisite.
>
> Patch 5 is the sunrpc user. Chuck also has a patch which further caches
> pages but is not included in this series. It's not directly
> related to the bulk allocator and as it caches pages, it might
> have other concerns (e.g. does it need a shrinker?)
>
> Patch 6 is a preparation patch only for the network user
>
> Patch 7 converts the net page pool to the bulk allocator for order-0 pages.
>
> include/linux/gfp.h | 12 ++++
> mm/page_alloc.c | 149 +++++++++++++++++++++++++++++++++++++-----
> net/core/page_pool.c | 101 +++++++++++++++++-----------
> net/sunrpc/svc_xprt.c | 47 +++++++++----
> 4 files changed, 240 insertions(+), 69 deletions(-)
>
> --
> 2.26.2

Thanks,
Al

2021-03-17 16:40:30

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users

On Wed, 17 Mar 2021 16:31:07 +0000
Alexander Lobakin <[email protected]> wrote:

> From: Mel Gorman <[email protected]>
> Date: Fri, 12 Mar 2021 15:43:24 +0000
>
> Hi there,
>
> > This series is based on top of Matthew Wilcox's series "Rationalise
> > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > test and are not using Andrew's tree as a baseline, I suggest using the
> > following git tree
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2
>
> I gave this series a go on my setup, it showed a bump of 10 Mbps on
> UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
>
> (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> allocations with MTU of 1508 bytes, linear frames via build_skb(),
> GRO + TSO/USO)

What NIC driver is this?

> I didn't have time to drill into the code, so for now can't provide
> any additional details. You can request anything you need though and
> I'll try to find a window to collect it.
>
> > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
> > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
> > directly to the subsystem maintainers. While sunrpc is low-risk, I'm
> > vaguely aware that there are other prototype series on netdev that affect
> > page_pool. The conflict should be obvious in linux-next.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2021-03-17 16:55:07

by Alexander Lobakin

[permalink] [raw]
Subject: Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users

From: Jesper Dangaard Brouer <[email protected]>
Date: Wed, 17 Mar 2021 17:38:44 +0100

> On Wed, 17 Mar 2021 16:31:07 +0000
> Alexander Lobakin <[email protected]> wrote:
>
> > From: Mel Gorman <[email protected]>
> > Date: Fri, 12 Mar 2021 15:43:24 +0000
> >
> > Hi there,
> >
> > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > following git tree
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2
> >
> > I gave this series a go on my setup, it showed a bump of 10 Mbps on
> > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> >
> > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> > allocations with MTU of 1508 bytes, linear frames via build_skb(),
> > GRO + TSO/USO)
>
> What NIC driver is this?

Ah, forgot to mention. It's a WIP driver, not yet mainlined.
The NIC itself is basically on-SoC 1G chip.

> > I didn't have time to drill into the code, so for now can't provide
> > any additional details. You can request anything you need though and
> > I'll try to find a window to collect it.
> >
> > > Note to Chuck and Jesper -- as this is a cross-subsystem series, you may
> > > want to send the sunrpc and page_pool pre-requisites (patches 4 and 6)
> > > directly to the subsystem maintainers. While sunrpc is low-risk, I'm
> > > vaguely aware that there are other prototype series on netdev that affect
> > > page_pool. The conflict should be obvious in linux-next.
>
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer

Al

2021-03-17 17:27:41

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: [PATCH 0/7 v4] Introduce a bulk order-0 page allocator with two in-tree users

On Wed, 17 Mar 2021 16:52:32 +0000
Alexander Lobakin <[email protected]> wrote:

> From: Jesper Dangaard Brouer <[email protected]>
> Date: Wed, 17 Mar 2021 17:38:44 +0100
>
> > On Wed, 17 Mar 2021 16:31:07 +0000
> > Alexander Lobakin <[email protected]> wrote:
> >
> > > From: Mel Gorman <[email protected]>
> > > Date: Fri, 12 Mar 2021 15:43:24 +0000
> > >
> > > Hi there,
> > >
> > > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > > following git tree
> > > >
> > > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v4r2
> > >
> > > I gave this series a go on my setup, it showed a bump of 10 Mbps on
> > > UDP forwarding, but dropped TCP forwarding by almost 50 Mbps.
> > >
> > > (4 core 1.2GHz MIPS32 R2, page size of 16 Kb, Page Pool order-0
> > > allocations with MTU of 1508 bytes, linear frames via build_skb(),
> > > GRO + TSO/USO)
> >
> > What NIC driver is this?
>
> Ah, forgot to mention. It's a WIP driver, not yet mainlined.
> The NIC itself is basically on-SoC 1G chip.

Hmm, then it is really hard to check if your driver is doing something
else that could cause this.

Well, can you try to lower the page_pool bulking size, to test the
theory from Wilcox that we should do smaller bulking to avoid pushing
cachelines into L2 when walking the LRU list. You might have to go as
low as bulk=8 (for N-way associative level of L1 cache).

In function: __page_pool_alloc_pages_slow() adjust variable:
const int bulk = PP_ALLOC_CACHE_REFILL;


--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2021-03-19 18:19:31

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 3/7] mm/page_alloc: Add a bulk page allocator

On 3/12/21 4:43 PM, Mel Gorman wrote:
> This patch adds a new page allocator interface via alloc_pages_bulk,
> and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> to be allocated and added to a list. They can be freed in bulk using
> free_pages_bulk().
>
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch
> if necessary.
>
> Note that this implementation is not very efficient and could be improved
> but it would require refactoring. The intent is to make it available early
> to determine what semantics are required by different callers. Once the
> full semantics are nailed down, it can be refactored.
>
> Signed-off-by: Mel Gorman <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

Although maybe premature, if it changes significantly due to the users'
performance feedback, let's see :)

Some nits below:

...

> @@ -4963,6 +4978,107 @@ static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,
> return true;
> }
>
> +/*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + *
> + * Returns the number of pages allocated.
> + */
> +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *alloc_list)
> +{
> + struct page *page;
> + unsigned long flags;
> + struct zone *zone;
> + struct zoneref *z;
> + struct per_cpu_pages *pcp;
> + struct list_head *pcp_list;
> + struct alloc_context ac;
> + gfp_t alloc_gfp;
> + unsigned int alloc_flags;
> + int allocated = 0;
> +
> + if (WARN_ON_ONCE(nr_pages <= 0))
> + return 0;
> +
> + if (nr_pages == 1)
> + goto failed;
> +
> + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> + if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, &ac,
> + &alloc_gfp, &alloc_flags))

Unusual identation here.

> + return 0;
> + gfp = alloc_gfp;
> +
> + /* Find an allowed local zone that meets the high watermark. */
> + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
> + unsigned long mark;
> +
> + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> + !__cpuset_zone_allowed(zone, gfp)) {
> + continue;
> + }
> +
> + if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> + zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
> + goto failed;
> + }
> +
> + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
> + if (zone_watermark_fast(zone, 0, mark,
> + zonelist_zone_idx(ac.preferred_zoneref),
> + alloc_flags, gfp)) {
> + break;
> + }
> + }
> + if (!zone)
> + return 0;

Why not also "goto failed;" here?

2021-03-22 08:32:01

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/7] mm/page_alloc: Add a bulk page allocator

On Fri, Mar 19, 2021 at 07:18:32PM +0100, Vlastimil Babka wrote:
> On 3/12/21 4:43 PM, Mel Gorman wrote:
> > This patch adds a new page allocator interface via alloc_pages_bulk,
> > and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> > to be allocated and added to a list. They can be freed in bulk using
> > free_pages_bulk().
> >
> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory, the
> > cpuset changes during the allocation or page debugging decides to fail
> > an allocation. It's up to the caller to request more pages in batch
> > if necessary.
> >
> > Note that this implementation is not very efficient and could be improved
> > but it would require refactoring. The intent is to make it available early
> > to determine what semantics are required by different callers. Once the
> > full semantics are nailed down, it can be refactored.
> >
> > Signed-off-by: Mel Gorman <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
>
> Although maybe premature, if it changes significantly due to the users'
> performance feedback, let's see :)
>

Indeed. The next version will have no users so that Jesper and Chuck
can check if an array-based or LRU based version is better. There were
also bugs such as broken accounting of stats that had to be fixed which
increases overhead.

> Some nits below:
>
> ...
>
> > @@ -4963,6 +4978,107 @@ static inline bool prepare_alloc_pages(gfp_t gfp, unsigned int order,
> > return true;
> > }
> >
> > +/*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to list.
> > + *
> > + * Returns the number of pages allocated.
> > + */
> > +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> > + nodemask_t *nodemask, int nr_pages,
> > + struct list_head *alloc_list)
> > +{
> > + struct page *page;
> > + unsigned long flags;
> > + struct zone *zone;
> > + struct zoneref *z;
> > + struct per_cpu_pages *pcp;
> > + struct list_head *pcp_list;
> > + struct alloc_context ac;
> > + gfp_t alloc_gfp;
> > + unsigned int alloc_flags;
> > + int allocated = 0;
> > +
> > + if (WARN_ON_ONCE(nr_pages <= 0))
> > + return 0;
> > +
> > + if (nr_pages == 1)
> > + goto failed;
> > +
> > + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> > + if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, &ac,
> > + &alloc_gfp, &alloc_flags))
>
> Unusual identation here.
>

Fixed

> > + return 0;
> > + gfp = alloc_gfp;
> > +
> > + /* Find an allowed local zone that meets the high watermark. */
> > + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
> > + unsigned long mark;
> > +
> > + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> > + !__cpuset_zone_allowed(zone, gfp)) {
> > + continue;
> > + }
> > +
> > + if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> > + zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
> > + goto failed;
> > + }
> > +
> > + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
> > + if (zone_watermark_fast(zone, 0, mark,
> > + zonelist_zone_idx(ac.preferred_zoneref),
> > + alloc_flags, gfp)) {
> > + break;
> > + }
> > + }
> > + if (!zone)
> > + return 0;
>
> Why not also "goto failed;" here?

Good question. When first written, it was because the zone search for the
normal allocator was almost certainly going to fail to find a zone and
it was expected that callers would prefer to fail fast over blocking.
Now we know that sunrpc can sleep on a failing allocation and it would
be better to enter the single page allocator and reclaim pages instead of
"sleep and hope for the best".

--
Mel Gorman
SUSE Labs