This series introduces a bulk order-0 page allocator with sunrpc and
the network page pool being the first users. The implementation is not
particularly efficient and the intention is to iron out what the semantics
of the API should have for users. Once the semantics are ironed out, it can
be made more efficient.
Improving the implementation requires fairly deep surgery in numerous
places. The lock scope would need to be significantly reduced, particularly
as vmstat, per-cpu and the buddy allocator have different locking protocol
that overall -- e.g. all partially depend on irqs being disabled at
various points. Secondly, the core of the allocator deals with single
pages where as both the bulk allocator and per-cpu allocator operate in
batches. All of that has to be reconciled with all the existing users and
their constraints (memory offline, CMA and cpusets being the trickiest).
Light testing passed, I'm relying on Chuck and Jesper to test the target
users more aggressively but both report performance improvements with the
initial RFC.
Patch 1 of this series is a cleanup to sunrpc, it could be merged
separately but is included here as a pre-requisite.
Patch 2 is the prototype bulk allocator
Patch 3 is the sunrpc user. Chuck also has a patch which further caches
pages but is not included in this series. It's not directly
related to the bulk allocator and as it caches pages, it might
have other concerns (e.g. does it need a shrinker?)
Patch 4 is a preparation patch only for the network user
Patch 5 converts the net page pool to the bulk allocator for order-0 pages.
include/linux/gfp.h | 13 +++++
mm/page_alloc.c | 113 +++++++++++++++++++++++++++++++++++++++++-
net/core/page_pool.c | 102 +++++++++++++++++++++++---------------
net/sunrpc/svc_xprt.c | 47 ++++++++++++------
4 files changed, 220 insertions(+), 55 deletions(-)
--
2.26.2
From: Jesper Dangaard Brouer <[email protected]>
There are cases where the page_pool need to refill with pages from the
page allocator. Some workloads cause the page_pool to release pages
instead of recycling these pages.
For these workload it can improve performance to bulk alloc pages from
the page-allocator to refill the alloc cache.
For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to create
an SKB from the xdp_frame, which then cannot return the page to the
page_pool. In this case, we saw[1] an improvement of 18.8% from using
the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
net/core/page_pool.c | 63 ++++++++++++++++++++++++++++----------------
1 file changed, 40 insertions(+), 23 deletions(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index a26f2ceb6a87..567680bd91c4 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -208,44 +208,61 @@ noinline
static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
gfp_t _gfp)
{
+ const int bulk = PP_ALLOC_CACHE_REFILL;
+ struct page *page, *next, *first_page;
unsigned int pp_flags = pool->p.flags;
- struct page *page;
+ unsigned int pp_order = pool->p.order;
+ int pp_nid = pool->p.nid;
+ LIST_HEAD(page_list);
gfp_t gfp = _gfp;
- /* We could always set __GFP_COMP, and avoid this branch, as
- * prep_new_page() can handle order-0 with __GFP_COMP.
- */
- if (pool->p.order)
+ /* Don't support bulk alloc for high-order pages */
+ if (unlikely(pp_order)) {
gfp |= __GFP_COMP;
+ first_page = alloc_pages_node(pp_nid, gfp, pp_order);
+ if (unlikely(!first_page))
+ return NULL;
+ goto out;
+ }
- /* FUTURE development:
- *
- * Current slow-path essentially falls back to single page
- * allocations, which doesn't improve performance. This code
- * need bulk allocation support from the page allocator code.
- */
-
- /* Cache was empty, do real allocation */
-#ifdef CONFIG_NUMA
- page = alloc_pages_node(pool->p.nid, gfp, pool->p.order);
-#else
- page = alloc_pages(gfp, pool->p.order);
-#endif
- if (!page)
+ if (unlikely(!__alloc_pages_bulk_nodemask(gfp, pp_nid, NULL,
+ bulk, &page_list)))
return NULL;
+ /* First page is extracted and returned to caller */
+ first_page = list_first_entry(&page_list, struct page, lru);
+ list_del(&first_page->lru);
+
+ /* Remaining pages store in alloc.cache */
+ list_for_each_entry_safe(page, next, &page_list, lru) {
+ list_del(&page->lru);
+ if (pp_flags & PP_FLAG_DMA_MAP &&
+ unlikely(!page_pool_dma_map(pool, page))) {
+ put_page(page);
+ continue;
+ }
+ if (likely(pool->alloc.count < PP_ALLOC_CACHE_SIZE)) {
+ pool->alloc.cache[pool->alloc.count++] = page;
+ pool->pages_state_hold_cnt++;
+ trace_page_pool_state_hold(pool, page,
+ pool->pages_state_hold_cnt);
+ } else {
+ put_page(page);
+ }
+ }
+out:
if (pp_flags & PP_FLAG_DMA_MAP &&
- unlikely(!page_pool_dma_map(pool, page))) {
- put_page(page);
+ unlikely(!page_pool_dma_map(pool, first_page))) {
+ put_page(first_page);
return NULL;
}
/* Track how many pages are held 'in-flight' */
pool->pages_state_hold_cnt++;
- trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
+ trace_page_pool_state_hold(pool, first_page, pool->pages_state_hold_cnt);
/* When page just alloc'ed is should/must have refcnt 1. */
- return page;
+ return first_page;
}
/* For using page_pool replace: alloc_pages() API calls, but provide
--
2.26.2
This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list. They can be freed in bulk using
free_pages_bulk().
The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.
Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 13 +++++
mm/page_alloc.c | 113 +++++++++++++++++++++++++++++++++++++++++++-
2 files changed, 124 insertions(+), 2 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 8572a1474e16..4903d1cc48dc 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -515,6 +515,10 @@ static inline int arch_make_page_accessible(struct page *page)
}
#endif
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+ nodemask_t *nodemask, int nr_pages,
+ struct list_head *list);
+
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
@@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
}
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct list_head *list)
+{
+ return __alloc_pages_bulk_nodemask(gfp_mask, numa_mem_id(), NULL,
+ nr_pages, list);
+}
+
/*
* Allocate pages, preferring the node given as nid. The node must be valid and
* online. For more general interface, see alloc_pages_node().
@@ -594,6 +606,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);
struct page_frag_cache;
extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3e4b29ee2b1e..ff1e55793786 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
}
}
+/* Drop reference counts and free order-0 pages from a list. */
+void free_pages_bulk(struct list_head *list)
+{
+ struct page *page, *next;
+
+ list_for_each_entry_safe(page, next, list, lru) {
+ trace_mm_page_free_batched(page);
+ if (put_page_testzero(page)) {
+ list_del(&page->lru);
+ __free_pages_ok(page, 0, FPI_NONE);
+ }
+ }
+}
+EXPORT_SYMBOL_GPL(free_pages_bulk);
+
static inline unsigned int
gfp_to_alloc_flags(gfp_t gfp_mask)
{
@@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac, gfp_t *alloc_mask,
unsigned int *alloc_flags)
{
+ gfp_mask &= gfp_allowed_mask;
+ *alloc_mask = gfp_mask;
+
ac->highest_zoneidx = gfp_zone(gfp_mask);
ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
ac->nodemask = nodemask;
@@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
return true;
}
+/*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ */
+int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
+ nodemask_t *nodemask, int nr_pages,
+ struct list_head *alloc_list)
+{
+ struct page *page;
+ unsigned long flags;
+ struct zone *zone;
+ struct zoneref *z;
+ struct per_cpu_pages *pcp;
+ struct list_head *pcp_list;
+ struct alloc_context ac;
+ gfp_t alloc_mask;
+ unsigned int alloc_flags;
+ int alloced = 0;
+
+ if (nr_pages == 1)
+ goto failed;
+
+ /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+ if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
+ return 0;
+ gfp_mask = alloc_mask;
+
+ /* Find an allowed local zone that meets the high watermark. */
+ for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
+ unsigned long mark;
+
+ if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
+ !__cpuset_zone_allowed(zone, gfp_mask)) {
+ continue;
+ }
+
+ if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
+ zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
+ goto failed;
+ }
+
+ mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
+ if (zone_watermark_fast(zone, 0, mark,
+ zonelist_zone_idx(ac.preferred_zoneref),
+ alloc_flags, gfp_mask)) {
+ break;
+ }
+ }
+ if (!zone)
+ return 0;
+
+ /* Attempt the batch allocation */
+ local_irq_save(flags);
+ pcp = &this_cpu_ptr(zone->pageset)->pcp;
+ pcp_list = &pcp->lists[ac.migratetype];
+
+ while (alloced < nr_pages) {
+ page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
+ pcp, pcp_list);
+ if (!page)
+ break;
+
+ prep_new_page(page, 0, gfp_mask, 0);
+ list_add(&page->lru, alloc_list);
+ alloced++;
+ }
+
+ if (!alloced)
+ goto failed_irq;
+
+ if (alloced) {
+ __count_zid_vm_events(PGALLOC, zone_idx(zone), alloced);
+ zone_statistics(zone, zone);
+ }
+
+ local_irq_restore(flags);
+
+ return alloced;
+
+failed_irq:
+ local_irq_restore(flags);
+
+failed:
+ page = __alloc_pages_nodemask(gfp_mask, 0, preferred_nid, nodemask);
+ if (page) {
+ alloced++;
+ list_add(&page->lru, alloc_list);
+ }
+
+ return alloced;
+}
+EXPORT_SYMBOL_GPL(__alloc_pages_bulk_nodemask);
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
@@ -4981,8 +5092,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
return NULL;
}
- gfp_mask &= gfp_allowed_mask;
- alloc_mask = gfp_mask;
if (!prepare_alloc_pages(gfp_mask, order, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
return NULL;
--
2.26.2
From: Jesper Dangaard Brouer <[email protected]>
In preparation for next patch, move the dma mapping into its own
function, as this will make it easier to follow the changes.
V2: make page_pool_dma_map return boolean (Ilias)
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
net/core/page_pool.c | 45 +++++++++++++++++++++++++-------------------
1 file changed, 26 insertions(+), 19 deletions(-)
diff --git a/net/core/page_pool.c b/net/core/page_pool.c
index ad8b0707af04..a26f2ceb6a87 100644
--- a/net/core/page_pool.c
+++ b/net/core/page_pool.c
@@ -180,14 +180,37 @@ static void page_pool_dma_sync_for_device(struct page_pool *pool,
pool->p.dma_dir);
}
+static bool page_pool_dma_map(struct page_pool *pool, struct page *page)
+{
+ dma_addr_t dma;
+
+ /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
+ * since dma_addr_t can be either 32 or 64 bits and does not always fit
+ * into page private data (i.e 32bit cpu with 64bit DMA caps)
+ * This mapping is kept for lifetime of page, until leaving pool.
+ */
+ dma = dma_map_page_attrs(pool->p.dev, page, 0,
+ (PAGE_SIZE << pool->p.order),
+ pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
+ if (dma_mapping_error(pool->p.dev, dma))
+ return false;
+
+ page->dma_addr = dma;
+
+ if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
+ page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
+
+ return true;
+}
+
/* slow path */
noinline
static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
gfp_t _gfp)
{
+ unsigned int pp_flags = pool->p.flags;
struct page *page;
gfp_t gfp = _gfp;
- dma_addr_t dma;
/* We could always set __GFP_COMP, and avoid this branch, as
* prep_new_page() can handle order-0 with __GFP_COMP.
@@ -211,30 +234,14 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
if (!page)
return NULL;
- if (!(pool->p.flags & PP_FLAG_DMA_MAP))
- goto skip_dma_map;
-
- /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
- * since dma_addr_t can be either 32 or 64 bits and does not always fit
- * into page private data (i.e 32bit cpu with 64bit DMA caps)
- * This mapping is kept for lifetime of page, until leaving pool.
- */
- dma = dma_map_page_attrs(pool->p.dev, page, 0,
- (PAGE_SIZE << pool->p.order),
- pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
- if (dma_mapping_error(pool->p.dev, dma)) {
+ if (pp_flags & PP_FLAG_DMA_MAP &&
+ unlikely(!page_pool_dma_map(pool, page))) {
put_page(page);
return NULL;
}
- page->dma_addr = dma;
- if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
- page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
-
-skip_dma_map:
/* Track how many pages are held 'in-flight' */
pool->pages_state_hold_cnt++;
-
trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
/* When page just alloc'ed is should/must have refcnt 1. */
--
2.26.2
Hi Mel,
Can you please CC me in future revisions. I almost missed that!
On Mon, Mar 01, 2021 at 04:11:59PM +0000, Mel Gorman wrote:
> From: Jesper Dangaard Brouer <[email protected]>
>
> In preparation for next patch, move the dma mapping into its own
> function, as this will make it easier to follow the changes.
>
> V2: make page_pool_dma_map return boolean (Ilias)
>
[...]
> static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> gfp_t _gfp)
> {
> + unsigned int pp_flags = pool->p.flags;
> struct page *page;
> gfp_t gfp = _gfp;
> - dma_addr_t dma;
>
> /* We could always set __GFP_COMP, and avoid this branch, as
> * prep_new_page() can handle order-0 with __GFP_COMP.
> @@ -211,30 +234,14 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> if (!page)
> return NULL;
>
> - if (!(pool->p.flags & PP_FLAG_DMA_MAP))
> - goto skip_dma_map;
> -
> - /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> - * since dma_addr_t can be either 32 or 64 bits and does not always fit
> - * into page private data (i.e 32bit cpu with 64bit DMA caps)
> - * This mapping is kept for lifetime of page, until leaving pool.
> - */
> - dma = dma_map_page_attrs(pool->p.dev, page, 0,
> - (PAGE_SIZE << pool->p.order),
> - pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> - if (dma_mapping_error(pool->p.dev, dma)) {
> + if (pp_flags & PP_FLAG_DMA_MAP &&
Nit pick but can we have if ((pp_flags & PP_FLAG_DMA_MAP) && ...
> + unlikely(!page_pool_dma_map(pool, page))) {
> put_page(page);
> return NULL;
> }
> - page->dma_addr = dma;
>
> - if (pool->p.flags & PP_FLAG_DMA_SYNC_DEV)
> - page_pool_dma_sync_for_device(pool, page, pool->p.max_len);
> -
> -skip_dma_map:
> /* Track how many pages are held 'in-flight' */
> pool->pages_state_hold_cnt++;
> -
> trace_page_pool_state_hold(pool, page, pool->pages_state_hold_cnt);
>
> /* When page just alloc'ed is should/must have refcnt 1. */
> --
> 2.26.2
>
Otherwise
Reviewed-by: Ilias Apalodimas <[email protected]>
On Wed, 3 Mar 2021 09:18:25 +0000
Mel Gorman <[email protected]> wrote:
> On Tue, Mar 02, 2021 at 08:49:06PM +0200, Ilias Apalodimas wrote:
> > On Mon, Mar 01, 2021 at 04:11:59PM +0000, Mel Gorman wrote:
> > > From: Jesper Dangaard Brouer <[email protected]>
> > >
> > > In preparation for next patch, move the dma mapping into its own
> > > function, as this will make it easier to follow the changes.
> > >
> > > V2: make page_pool_dma_map return boolean (Ilias)
> > >
> >
> > [...]
> >
> > > @@ -211,30 +234,14 @@ static struct page *__page_pool_alloc_pages_slow(struct page_pool *pool,
> > > if (!page)
> > > return NULL;
> > >
> > > - if (!(pool->p.flags & PP_FLAG_DMA_MAP))
> > > - goto skip_dma_map;
> > > -
> > > - /* Setup DMA mapping: use 'struct page' area for storing DMA-addr
> > > - * since dma_addr_t can be either 32 or 64 bits and does not always fit
> > > - * into page private data (i.e 32bit cpu with 64bit DMA caps)
> > > - * This mapping is kept for lifetime of page, until leaving pool.
> > > - */
> > > - dma = dma_map_page_attrs(pool->p.dev, page, 0,
> > > - (PAGE_SIZE << pool->p.order),
> > > - pool->p.dma_dir, DMA_ATTR_SKIP_CPU_SYNC);
> > > - if (dma_mapping_error(pool->p.dev, dma)) {
> > > + if (pp_flags & PP_FLAG_DMA_MAP &&
> >
> > Nit pick but can we have if ((pp_flags & PP_FLAG_DMA_MAP) && ...
> >
>
> Done.
Thanks for fixing this nitpick, and carrying the patch.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
Would vmalloc be another good user of this API?
> + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> + if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
This crazy long line is really hard to follow.
> + return 0;
> + gfp_mask = alloc_mask;
> +
> + /* Find an allowed local zone that meets the high watermark. */
> + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
Same here.
> + unsigned long mark;
> +
> + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> + !__cpuset_zone_allowed(zone, gfp_mask)) {
> + continue;
> + }
No need for the curly braces.
> }
>
> - gfp_mask &= gfp_allowed_mask;
> - alloc_mask = gfp_mask;
Is this change intentional?
On Tue, Mar 09, 2021 at 05:12:30PM +0000, Christoph Hellwig wrote:
> Would vmalloc be another good user of this API?
>
> > + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> > + if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
>
> This crazy long line is really hard to follow.
>
It's not crazier than what is already in alloc_pages_nodemask to share
code.
> > + return 0;
> > + gfp_mask = alloc_mask;
> > +
> > + /* Find an allowed local zone that meets the high watermark. */
> > + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
>
> Same here.
>
Similar to what happens in get_page_from_freelist with the
for_next_zone_zonelist_nodemask iterator.
> > + unsigned long mark;
> > +
> > + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> > + !__cpuset_zone_allowed(zone, gfp_mask)) {
> > + continue;
> > + }
>
> No need for the curly braces.
>
Yes, but it's for coding style. MM has no hard coding style guidelines
around this but for sched, it's generally preferred that if the "if"
statement spans multiple lines then it should use {} even if the block
is one line long for clarity.
> > }
> >
> > - gfp_mask &= gfp_allowed_mask;
> > - alloc_mask = gfp_mask;
>
> Is this change intentional?
Yes so that prepare_alloc_pages works for both the single page and bulk
allocator. Slightly less code duplication.
--
Mel Gorman
SUSE Labs
Mel Gorman <[email protected]> writes:
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 8572a1474e16..4903d1cc48dc 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -515,6 +515,10 @@ static inline int
> arch_make_page_accessible(struct page *page)
> }
> #endif
>
> +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int
> preferred_nid,
> + nodemask_t *nodemask, int
> nr_pages,
> + struct list_head *list);
> +
> struct page *
> __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int
> preferred_nid,
> nodemask_t
> *nodemask);
> @@ -525,6 +529,14 @@ __alloc_pages(gfp_t gfp_mask, unsigned int
> order, int preferred_nid)
> return __alloc_pages_nodemask(gfp_mask, order,
> preferred_nid, NULL);
> }
>
> +/* Bulk allocate order-0 pages */
> +static inline unsigned long
> +alloc_pages_bulk(gfp_t gfp_mask, unsigned long nr_pages, struct
> list_head *list)
> +{
> + return __alloc_pages_bulk_nodemask(gfp_mask,
> numa_mem_id(), NULL,
> + nr_pages,
> list);
Is the second line indentation intentional ? Why not align it to
the first argument (gfp_mask) ?
> +}
> +
> /*
> * Allocate pages, preferring the node given as nid. The node
> must be valid and
> * online. For more general interface, see alloc_pages_node().
> @@ -594,6 +606,7 @@ void * __meminit alloc_pages_exact_nid(int
> nid, size_t size, gfp_t gfp_mask);
>
> extern void __free_pages(struct page *page, unsigned int
> order);
> extern void free_pages(unsigned long addr, unsigned int order);
> +extern void free_pages_bulk(struct list_head *list);
>
> struct page_frag_cache;
> extern void __page_frag_cache_drain(struct page *page, unsigned
> int count);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3e4b29ee2b1e..ff1e55793786 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4436,6 +4436,21 @@ static void wake_all_kswapds(unsigned int
> order, gfp_t gfp_mask,
> }
> }
> ...
>
> +/*
> + * This is a batched version of the page allocator that
> attempts to
> + * allocate nr_pages quickly from the preferred zone and add
> them to list.
> + */
> +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int
> preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *alloc_list)
> +{
> + struct page *page;
> + unsigned long flags;
> + struct zone *zone;
> + struct zoneref *z;
> + struct per_cpu_pages *pcp;
> + struct list_head *pcp_list;
> + struct alloc_context ac;
> + gfp_t alloc_mask;
> + unsigned int alloc_flags;
> + int alloced = 0;
Does alloced count the number of allocated pages ? Do you mind
renaming it to 'allocated' ?
> +
> + if (nr_pages == 1)
> + goto failed;
> +
> + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1
> page. */
> + if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid,
> nodemask, &ac, &alloc_mask, &alloc_flags))
> + return 0;
> + gfp_mask = alloc_mask;
> +
> + /* Find an allowed local zone that meets the high
> watermark. */
> + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist,
> ac.highest_zoneidx, ac.nodemask) {
> + unsigned long mark;
> +
> + if (cpusets_enabled() && (alloc_flags &
> ALLOC_CPUSET) &&
> + !__cpuset_zone_allowed(zone, gfp_mask)) {
> + continue;
> + }
> +
> + if (nr_online_nodes > 1 && zone !=
> ac.preferred_zoneref->zone &&
> + zone_to_nid(zone) !=
> zone_to_nid(ac.preferred_zoneref->zone)) {
> + goto failed;
> + }
> +
> + mark = wmark_pages(zone, alloc_flags &
> ALLOC_WMARK_MASK) + nr_pages;
> + if (zone_watermark_fast(zone, 0, mark,
> +
> zonelist_zone_idx(ac.preferred_zoneref),
> + alloc_flags, gfp_mask)) {
> + break;
> + }
> + }
> + if (!zone)
> + return 0;
> +
> + /* Attempt the batch allocation */
> + local_irq_save(flags);
> + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> + pcp_list = &pcp->lists[ac.migratetype];
> +
> + while (alloced < nr_pages) {
> + page = __rmqueue_pcplist(zone, ac.migratetype,
> alloc_flags,
> +
> pcp, pcp_list);
Same indentation comment as before
> + if (!page)
> + break;
> +
> + prep_new_page(page, 0, gfp_mask, 0);
> + list_add(&page->lru, alloc_list);
> + alloced++;
> + }
> +
> + if (!alloced)
> + goto failed_irq;
> +
> + if (alloced) {
> + __count_zid_vm_events(PGALLOC, zone_idx(zone),
> alloced);
> + zone_statistics(zone, zone);
> + }
> +
> + local_irq_restore(flags);
> +
> + return alloced;
> +
> +failed_irq:
> + local_irq_restore(flags);
> +
> +failed:
> + page = __alloc_pages_nodemask(gfp_mask, 0, preferred_nid,
> nodemask);
> + if (page) {
> + alloced++;
> + list_add(&page->lru, alloc_list);
> + }
> +
> + return alloced;
> +}
> +EXPORT_SYMBOL_GPL(__alloc_pages_bulk_nodemask);
> +
> /*
> * This is the 'heart' of the zoned buddy allocator.
> */
> @@ -4981,8 +5092,6 @@ __alloc_pages_nodemask(gfp_t gfp_mask,
> unsigned int order, int preferred_nid,
> return NULL;
> }
>
> - gfp_mask &= gfp_allowed_mask;
> - alloc_mask = gfp_mask;
> if (!prepare_alloc_pages(gfp_mask, order, preferred_nid,
> nodemask, &ac, &alloc_mask, &alloc_flags))
> return NULL;
On Wed, 10 Mar 2021 10:46:15 +0000 Mel Gorman <[email protected]> wrote:
> This patch adds a new page allocator interface via alloc_pages_bulk,
> and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> to be allocated and added to a list. They can be freed in bulk using
> free_pages_bulk().
Why am I surprised we don't already have this.
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch
> if necessary.
>
> Note that this implementation is not very efficient and could be improved
> but it would require refactoring. The intent is to make it available early
> to determine what semantics are required by different callers. Once the
> full semantics are nailed down, it can be refactored.
>
> ...
>
> +/* Drop reference counts and free order-0 pages from a list. */
> +void free_pages_bulk(struct list_head *list)
> +{
> + struct page *page, *next;
> +
> + list_for_each_entry_safe(page, next, list, lru) {
> + trace_mm_page_free_batched(page);
> + if (put_page_testzero(page)) {
> + list_del(&page->lru);
> + __free_pages_ok(page, 0, FPI_NONE);
> + }
> + }
> +}
> +EXPORT_SYMBOL_GPL(free_pages_bulk);
I expect that batching games are planned in here as well?
> static inline unsigned int
> gfp_to_alloc_flags(gfp_t gfp_mask)
> {
> @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> struct alloc_context *ac, gfp_t *alloc_mask,
> unsigned int *alloc_flags)
> {
> + gfp_mask &= gfp_allowed_mask;
> + *alloc_mask = gfp_mask;
> +
> ac->highest_zoneidx = gfp_zone(gfp_mask);
> ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> ac->nodemask = nodemask;
> @@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> return true;
> }
>
> +/*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + */
Documentation is rather lame. Returns number of pages allocated...
> +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *alloc_list)
> +{
> + struct page *page;
> + unsigned long flags;
> + struct zone *zone;
> + struct zoneref *z;
> + struct per_cpu_pages *pcp;
> + struct list_head *pcp_list;
> + struct alloc_context ac;
> + gfp_t alloc_mask;
> + unsigned int alloc_flags;
> + int alloced = 0;
> +
> + if (nr_pages == 1)
> + goto failed;
> +
> + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> + if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
> + return 0;
> + gfp_mask = alloc_mask;
> +
> + /* Find an allowed local zone that meets the high watermark. */
> + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
> + unsigned long mark;
> +
> + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> + !__cpuset_zone_allowed(zone, gfp_mask)) {
> + continue;
> + }
> +
> + if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> + zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
> + goto failed;
> + }
> +
> + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
> + if (zone_watermark_fast(zone, 0, mark,
> + zonelist_zone_idx(ac.preferred_zoneref),
> + alloc_flags, gfp_mask)) {
> + break;
> + }
> + }
I suspect the above was stolen from elsewhere and that some code
commonification is planned.
> + if (!zone)
> + return 0;
> +
> + /* Attempt the batch allocation */
> + local_irq_save(flags);
> + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> + pcp_list = &pcp->lists[ac.migratetype];
> +
> + while (alloced < nr_pages) {
> + page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> + pcp, pcp_list);
> + if (!page)
> + break;
> +
> + prep_new_page(page, 0, gfp_mask, 0);
I wonder if it would be worth running prep_new_page() in a second pass,
after reenabling interrupts.
Speaking of which, will the realtime people get upset about the
irqs-off latency? How many pages are we talking about here?
> + list_add(&page->lru, alloc_list);
> + alloced++;
> + }
> +
> + if (!alloced)
> + goto failed_irq;
> +
> + if (alloced) {
> + __count_zid_vm_events(PGALLOC, zone_idx(zone), alloced);
> + zone_statistics(zone, zone);
> + }
> +
> + local_irq_restore(flags);
> +
> + return alloced;
> +
> +failed_irq:
> + local_irq_restore(flags);
> +
> +failed:
Might we need some counter to show how often this path happens?
> + page = __alloc_pages_nodemask(gfp_mask, 0, preferred_nid, nodemask);
> + if (page) {
> + alloced++;
> + list_add(&page->lru, alloc_list);
> + }
> +
> + return alloced;
> +}
> +EXPORT_SYMBOL_GPL(__alloc_pages_bulk_nodemask);
> +
On Wed, Mar 10, 2021 at 03:46:50PM -0800, Andrew Morton wrote:
> On Wed, 10 Mar 2021 10:46:15 +0000 Mel Gorman <[email protected]> wrote:
>
> > This patch adds a new page allocator interface via alloc_pages_bulk,
> > and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> > to be allocated and added to a list. They can be freed in bulk using
> > free_pages_bulk().
>
> Why am I surprised we don't already have this.
>
It was prototyped a few years ago and discussed at LSF/MM so it's in
the back of your memory somewhere. It never got merged because it lacked
users and I didn't think carrying dead untested code was appropriate.
> > The API is not guaranteed to return the requested number of pages and
> > may fail if the preferred allocation zone has limited free memory, the
> > cpuset changes during the allocation or page debugging decides to fail
> > an allocation. It's up to the caller to request more pages in batch
> > if necessary.
> >
> > Note that this implementation is not very efficient and could be improved
> > but it would require refactoring. The intent is to make it available early
> > to determine what semantics are required by different callers. Once the
> > full semantics are nailed down, it can be refactored.
> >
> > ...
> >
> > +/* Drop reference counts and free order-0 pages from a list. */
> > +void free_pages_bulk(struct list_head *list)
> > +{
> > + struct page *page, *next;
> > +
> > + list_for_each_entry_safe(page, next, list, lru) {
> > + trace_mm_page_free_batched(page);
> > + if (put_page_testzero(page)) {
> > + list_del(&page->lru);
> > + __free_pages_ok(page, 0, FPI_NONE);
> > + }
> > + }
> > +}
> > +EXPORT_SYMBOL_GPL(free_pages_bulk);
>
> I expect that batching games are planned in here as well?
>
Potentially it could be done but the page allocator would need to be
fundamentally aware of batching to make it tidy or the per-cpu allocator
would need knowledge of how to handle batches in the free path. Batch
freeing to the buddy allocator is problematic as buddy merging has to
happen. Batch freeing to per-cpu hits pcp->high limitations.
There are a couple of ways it *could* be done. Per-cpu lists could be
allowed to temporarily exceed the high limits and reduce them out-of-band
like what happens with counter updates or remote pcp freeing. Care
would need to be taken when memory is low to avoid premature OOM
and to guarantee draining happens in a timely fashion. There would be
additional benefits to this. For example, release_pages() can hammer the
zone lock when freeing very large batches and would benefit from either
large batching or "plugging" the per-cpu list. I prototyped a series to
allow the batch limits to be temporarily exceeded but it did not actually
improve performance because of errors in the implementation and it needs
a lot of work.
> > static inline unsigned int
> > gfp_to_alloc_flags(gfp_t gfp_mask)
> > {
> > @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> > struct alloc_context *ac, gfp_t *alloc_mask,
> > unsigned int *alloc_flags)
> > {
> > + gfp_mask &= gfp_allowed_mask;
> > + *alloc_mask = gfp_mask;
> > +
> > ac->highest_zoneidx = gfp_zone(gfp_mask);
> > ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > ac->nodemask = nodemask;
> > @@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> > return true;
> > }
> >
> > +/*
> > + * This is a batched version of the page allocator that attempts to
> > + * allocate nr_pages quickly from the preferred zone and add them to list.
> > + */
>
> Documentation is rather lame. Returns number of pages allocated...
>
I added a note on the return value. The documentation is lame because at
this point, we do not know what the required semantics for future users
are. We have two examples at the moment in this series but I think it
would be better to add kerneldoc documentation when there is a reasonable
expectation that the API will not change. For example, SLUB could use
this API when it fails to allocate a high-order page and instead allocate
batches of order-0 pages but I did not investigate how feasible that
is. Similarly, it's possible that we really need to deal with high-order
batch allocations in which case, the per-cpu list should be high-order
aware or the core buddy allocator needs to be batch-allocation aware.
> > +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > + nodemask_t *nodemask, int nr_pages,
> > + struct list_head *alloc_list)
> > +{
> > + struct page *page;
> > + unsigned long flags;
> > + struct zone *zone;
> > + struct zoneref *z;
> > + struct per_cpu_pages *pcp;
> > + struct list_head *pcp_list;
> > + struct alloc_context ac;
> > + gfp_t alloc_mask;
> > + unsigned int alloc_flags;
> > + int alloced = 0;
> > +
> > + if (nr_pages == 1)
> > + goto failed;
> > +
> > + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> > + if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
> > + return 0;
> > + gfp_mask = alloc_mask;
> > +
> > + /* Find an allowed local zone that meets the high watermark. */
> > + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
> > + unsigned long mark;
> > +
> > + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> > + !__cpuset_zone_allowed(zone, gfp_mask)) {
> > + continue;
> > + }
> > +
> > + if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> > + zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
> > + goto failed;
> > + }
> > +
> > + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
> > + if (zone_watermark_fast(zone, 0, mark,
> > + zonelist_zone_idx(ac.preferred_zoneref),
> > + alloc_flags, gfp_mask)) {
> > + break;
> > + }
> > + }
>
> I suspect the above was stolen from elsewhere and that some code
> commonification is planned.
>
It's based on get_page_from_freelist. It would be messy to have them share
common code at this point with a risk that the fast path for the common
path (single page requests) would be impaired. The issue is that the
fast path and slow paths have zonelist iteration, kswapd wakeup, cpuset
enforcement and reclaim actions all mixed together at various different
points. The locking is also mixed up with per-cpu list locking, statistic
locking and buddy locking all having inappropriate overlaps (e.g. IRQ
disabling protects per-cpu list locking, partially and unnecessarily
protects statistics depending on architecture and overlaps with the
IRQ-safe zone lock.
Ironing this out risks hurting the single page allocation path. It would
need to be done incrementally with ultimately the core of the allocator
dealing with batches to avoid false bisections.
>
> > + if (!zone)
> > + return 0;
> > +
> > + /* Attempt the batch allocation */
> > + local_irq_save(flags);
> > + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > + pcp_list = &pcp->lists[ac.migratetype];
> > +
> > + while (alloced < nr_pages) {
> > + page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> > + pcp, pcp_list);
> > + if (!page)
> > + break;
> > +
> > + prep_new_page(page, 0, gfp_mask, 0);
>
> I wonder if it would be worth running prep_new_page() in a second pass,
> after reenabling interrupts.
>
Possibly, I could add another patch on top that does this because it's
trading the time that IRQs are disabled for a list iteration.
> Speaking of which, will the realtime people get upset about the
> irqs-off latency? How many pages are we talking about here?
>
At the moment, it looks like batches of up to a few hundred at worst. I
don't think realtime sensitive applications are likely to be using the
bulk allocator API at this point.
The realtime people have a worse problem in that the per-cpu list does
not use local_lock and disable IRQs more than it needs to on x86 in
particular. I've a prototype series for this as well which splits the
locking for the per-cpu list and statistic handling and then converts the
per-cpu list to local_lock but I'm getting this off the table first because
I don't want multiple page allocator series in flight at the same time.
Thomas, Peter and Ingo would need to be cc'd on that series to review
the local_lock aspects.
Even with local_lock, it's not clear to me why per-cpu lists need to be
locked at all because potentially it could use a lock-free llist with some
struct page overloading. That one is harder to predict when batches are
taken into account as splicing a batch of free pages with llist would be
unsafe so batch free might exchange IRQ disabling overhead with multiple
atomics. I'd need to recheck things like whether NMI handlers ever call
the page allocator (they shouldn't but it should be checked). It would
need a lot of review and testing.
> > + list_add(&page->lru, alloc_list);
> > + alloced++;
> > + }
> > +
> > + if (!alloced)
> > + goto failed_irq;
> > +
> > + if (alloced) {
> > + __count_zid_vm_events(PGALLOC, zone_idx(zone), alloced);
> > + zone_statistics(zone, zone);
> > + }
> > +
> > + local_irq_restore(flags);
> > +
> > + return alloced;
> > +
> > +failed_irq:
> > + local_irq_restore(flags);
> > +
> > +failed:
>
> Might we need some counter to show how often this path happens?
>
I think that would be overkill at this point. It only gives useful
information to a developer using the API for the first time and that can
be done with a debugging patch (or probes if you're feeling creative).
I'm already unhappy with the counter overhead in the page allocator.
zone_statistics in particular has no business being an accurate statistic.
It should have been a best-effort counter like vm_events that does not need
IRQs to be disabled. If that was a simply counter as opposed to an accurate
statistic then a failure counter at failed_irq would be very cheap to add.
--
Mel Gorman
SUSE Labs
On Thu, 11 Mar 2021 08:42:00 +0000
Mel Gorman <[email protected]> wrote:
> On Wed, Mar 10, 2021 at 03:46:50PM -0800, Andrew Morton wrote:
> > On Wed, 10 Mar 2021 10:46:15 +0000 Mel Gorman <[email protected]> wrote:
> >
> > > This patch adds a new page allocator interface via alloc_pages_bulk,
> > > and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> > > to be allocated and added to a list. They can be freed in bulk using
> > > free_pages_bulk().
> >
> > Why am I surprised we don't already have this.
> >
>
> It was prototyped a few years ago and discussed at LSF/MM so it's in
> the back of your memory somewhere. It never got merged because it lacked
> users and I didn't think carrying dead untested code was appropriate.
And I guess didn't push hard enough and showed the use-case in code.
Thus, I will also take part of the blame for this stalling out.
> > > The API is not guaranteed to return the requested number of pages and
> > > may fail if the preferred allocation zone has limited free memory, the
> > > cpuset changes during the allocation or page debugging decides to fail
> > > an allocation. It's up to the caller to request more pages in batch
> > > if necessary.
> > >
> > > Note that this implementation is not very efficient and could be improved
> > > but it would require refactoring. The intent is to make it available early
> > > to determine what semantics are required by different callers. Once the
> > > full semantics are nailed down, it can be refactored.
> > >
> > > ...
> > >
> > > +/* Drop reference counts and free order-0 pages from a list. */
> > > +void free_pages_bulk(struct list_head *list)
> > > +{
> > > + struct page *page, *next;
> > > +
> > > + list_for_each_entry_safe(page, next, list, lru) {
> > > + trace_mm_page_free_batched(page);
> > > + if (put_page_testzero(page)) {
> > > + list_del(&page->lru);
> > > + __free_pages_ok(page, 0, FPI_NONE);
> > > + }
> > > + }
> > > +}
> > > +EXPORT_SYMBOL_GPL(free_pages_bulk);
> >
> > I expect that batching games are planned in here as well?
> >
>
> Potentially it could be done but the page allocator would need to be
> fundamentally aware of batching to make it tidy or the per-cpu allocator
> would need knowledge of how to handle batches in the free path. Batch
> freeing to the buddy allocator is problematic as buddy merging has to
> happen. Batch freeing to per-cpu hits pcp->high limitations.
>
> There are a couple of ways it *could* be done. Per-cpu lists could be
> allowed to temporarily exceed the high limits and reduce them out-of-band
> like what happens with counter updates or remote pcp freeing. Care
> would need to be taken when memory is low to avoid premature OOM
> and to guarantee draining happens in a timely fashion. There would be
> additional benefits to this. For example, release_pages() can hammer the
> zone lock when freeing very large batches and would benefit from either
> large batching or "plugging" the per-cpu list. I prototyped a series to
> allow the batch limits to be temporarily exceeded but it did not actually
> improve performance because of errors in the implementation and it needs
> a lot of work.
>
> > > static inline unsigned int
> > > gfp_to_alloc_flags(gfp_t gfp_mask)
> > > {
> > > @@ -4919,6 +4934,9 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> > > struct alloc_context *ac, gfp_t *alloc_mask,
> > > unsigned int *alloc_flags)
> > > {
> > > + gfp_mask &= gfp_allowed_mask;
> > > + *alloc_mask = gfp_mask;
> > > +
> > > ac->highest_zoneidx = gfp_zone(gfp_mask);
> > > ac->zonelist = node_zonelist(preferred_nid, gfp_mask);
> > > ac->nodemask = nodemask;
> > > @@ -4960,6 +4978,99 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> > > return true;
> > > }
> > >
> > > +/*
> > > + * This is a batched version of the page allocator that attempts to
> > > + * allocate nr_pages quickly from the preferred zone and add them to list.
> > > + */
> >
> > Documentation is rather lame. Returns number of pages allocated...
> >
>
> I added a note on the return value. The documentation is lame because at
> this point, we do not know what the required semantics for future users
> are. We have two examples at the moment in this series but I think it
> would be better to add kerneldoc documentation when there is a reasonable
> expectation that the API will not change. For example, SLUB could use
> this API when it fails to allocate a high-order page and instead allocate
> batches of order-0 pages but I did not investigate how feasible that
> is. Similarly, it's possible that we really need to deal with high-order
> batch allocations in which case, the per-cpu list should be high-order
> aware or the core buddy allocator needs to be batch-allocation aware.
>
> > > +int __alloc_pages_bulk_nodemask(gfp_t gfp_mask, int preferred_nid,
> > > + nodemask_t *nodemask, int nr_pages,
> > > + struct list_head *alloc_list)
> > > +{
> > > + struct page *page;
> > > + unsigned long flags;
> > > + struct zone *zone;
> > > + struct zoneref *z;
> > > + struct per_cpu_pages *pcp;
> > > + struct list_head *pcp_list;
> > > + struct alloc_context ac;
> > > + gfp_t alloc_mask;
> > > + unsigned int alloc_flags;
> > > + int alloced = 0;
> > > +
> > > + if (nr_pages == 1)
> > > + goto failed;
> > > +
> > > + /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> > > + if (!prepare_alloc_pages(gfp_mask, 0, preferred_nid, nodemask, &ac, &alloc_mask, &alloc_flags))
> > > + return 0;
> > > + gfp_mask = alloc_mask;
> > > +
> > > + /* Find an allowed local zone that meets the high watermark. */
> > > + for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
> > > + unsigned long mark;
> > > +
> > > + if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
> > > + !__cpuset_zone_allowed(zone, gfp_mask)) {
> > > + continue;
> > > + }
> > > +
> > > + if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
> > > + zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
> > > + goto failed;
> > > + }
> > > +
> > > + mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
> > > + if (zone_watermark_fast(zone, 0, mark,
> > > + zonelist_zone_idx(ac.preferred_zoneref),
> > > + alloc_flags, gfp_mask)) {
> > > + break;
> > > + }
> > > + }
> >
> > I suspect the above was stolen from elsewhere and that some code
> > commonification is planned.
> >
>
> It's based on get_page_from_freelist. It would be messy to have them share
> common code at this point with a risk that the fast path for the common
> path (single page requests) would be impaired. The issue is that the
> fast path and slow paths have zonelist iteration, kswapd wakeup, cpuset
> enforcement and reclaim actions all mixed together at various different
> points. The locking is also mixed up with per-cpu list locking, statistic
> locking and buddy locking all having inappropriate overlaps (e.g. IRQ
> disabling protects per-cpu list locking, partially and unnecessarily
> protects statistics depending on architecture and overlaps with the
> IRQ-safe zone lock.
>
> Ironing this out risks hurting the single page allocation path. It would
> need to be done incrementally with ultimately the core of the allocator
> dealing with batches to avoid false bisections.
>
> >
> > > + if (!zone)
> > > + return 0;
> > > +
> > > + /* Attempt the batch allocation */
> > > + local_irq_save(flags);
> > > + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > + pcp_list = &pcp->lists[ac.migratetype];
> > > +
> > > + while (alloced < nr_pages) {
> > > + page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> > > + pcp, pcp_list);
> > > + if (!page)
> > > + break;
> > > +
> > > + prep_new_page(page, 0, gfp_mask, 0);
> >
> > I wonder if it would be worth running prep_new_page() in a second pass,
> > after reenabling interrupts.
> >
>
> Possibly, I could add another patch on top that does this because it's
> trading the time that IRQs are disabled for a list iteration.
I for one like this idea, of moving prep_new_page() to a second pass.
As per below realtime concern, to reduce the time that IRQs are
disabled.
> > Speaking of which, will the realtime people get upset about the
> > irqs-off latency? How many pages are we talking about here?
> >
In my page_pool patch I'm bulk allocating 64 pages. I wanted to ask if
this is too much? (PP_ALLOC_CACHE_REFILL=64).
The mlx5 driver have a while loop for allocation 64 pages, which it
used in this case, that is why 64 is chosen. If we choose a lower
bulk number, then the bulk-alloc will just be called more times.
> At the moment, it looks like batches of up to a few hundred at worst. I
> don't think realtime sensitive applications are likely to be using the
> bulk allocator API at this point.
>
> The realtime people have a worse problem in that the per-cpu list does
> not use local_lock and disable IRQs more than it needs to on x86 in
> particular. I've a prototype series for this as well which splits the
> locking for the per-cpu list and statistic handling and then converts the
> per-cpu list to local_lock but I'm getting this off the table first because
> I don't want multiple page allocator series in flight at the same time.
> Thomas, Peter and Ingo would need to be cc'd on that series to review
> the local_lock aspects.
>
> Even with local_lock, it's not clear to me why per-cpu lists need to be
> locked at all because potentially it could use a lock-free llist with some
> struct page overloading. That one is harder to predict when batches are
> taken into account as splicing a batch of free pages with llist would be
> unsafe so batch free might exchange IRQ disabling overhead with multiple
> atomics. I'd need to recheck things like whether NMI handlers ever call
> the page allocator (they shouldn't but it should be checked). It would
> need a lot of review and testing.
The result of the API is to deliver pages as a double-linked list via
LRU (page->lru member). If you are planning to use llist, then how to
handle this API change later?
Have you notice that the two users store the struct-page pointers in an
array? We could have the caller provide the array to store struct-page
pointers, like we do with kmem_cache_alloc_bulk API.
You likely have good reasons for returning the pages as a list (via
lru), as I can see/imagine that there are some potential for grabbing
the entire PCP-list.
> > > + list_add(&page->lru, alloc_list);
> > > + alloced++;
> > > + }
> > > +
> > > + if (!alloced)
> > > + goto failed_irq;
> > > +
> > > + if (alloced) {
> > > + __count_zid_vm_events(PGALLOC, zone_idx(zone),
> > > alloced);
> > > + zone_statistics(zone, zone);
> > > + }
> > > +
> > > + local_irq_restore(flags);
> > > +
> > > + return alloced;
> > > +
> > > +failed_irq:
> > > + local_irq_restore(flags);
> > > +
> > > +failed:
> >
> > Might we need some counter to show how often this path happens?
> >
>
> I think that would be overkill at this point. It only gives useful
> information to a developer using the API for the first time and that
> can be done with a debugging patch (or probes if you're feeling
> creative). I'm already unhappy with the counter overhead in the page
> allocator. zone_statistics in particular has no business being an
> accurate statistic. It should have been a best-effort counter like
> vm_events that does not need IRQs to be disabled. If that was a
> simply counter as opposed to an accurate statistic then a failure
> counter at failed_irq would be very cheap to add.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
On Fri, Mar 12, 2021 at 12:46:09PM +0100, Jesper Dangaard Brouer wrote:
> > > > <SNIP>
> > > > + if (!zone)
> > > > + return 0;
> > > > +
> > > > + /* Attempt the batch allocation */
> > > > + local_irq_save(flags);
> > > > + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > > > + pcp_list = &pcp->lists[ac.migratetype];
> > > > +
> > > > + while (alloced < nr_pages) {
> > > > + page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> > > > + pcp, pcp_list);
> > > > + if (!page)
> > > > + break;
> > > > +
> > > > + prep_new_page(page, 0, gfp_mask, 0);
> > >
> > > I wonder if it would be worth running prep_new_page() in a second pass,
> > > after reenabling interrupts.
> > >
> >
> > Possibly, I could add another patch on top that does this because it's
> > trading the time that IRQs are disabled for a list iteration.
>
> I for one like this idea, of moving prep_new_page() to a second pass.
> As per below realtime concern, to reduce the time that IRQs are
> disabled.
>
Already done.
> > > Speaking of which, will the realtime people get upset about the
> > > irqs-off latency? How many pages are we talking about here?
> > >
>
> In my page_pool patch I'm bulk allocating 64 pages. I wanted to ask if
> this is too much? (PP_ALLOC_CACHE_REFILL=64).
>
I expect no, it's not too much. The refill path should be short.
> > At the moment, it looks like batches of up to a few hundred at worst. I
> > don't think realtime sensitive applications are likely to be using the
> > bulk allocator API at this point.
> >
> > The realtime people have a worse problem in that the per-cpu list does
> > not use local_lock and disable IRQs more than it needs to on x86 in
> > particular. I've a prototype series for this as well which splits the
> > locking for the per-cpu list and statistic handling and then converts the
> > per-cpu list to local_lock but I'm getting this off the table first because
> > I don't want multiple page allocator series in flight at the same time.
> > Thomas, Peter and Ingo would need to be cc'd on that series to review
> > the local_lock aspects.
> >
> > Even with local_lock, it's not clear to me why per-cpu lists need to be
> > locked at all because potentially it could use a lock-free llist with some
> > struct page overloading. That one is harder to predict when batches are
> > taken into account as splicing a batch of free pages with llist would be
> > unsafe so batch free might exchange IRQ disabling overhead with multiple
> > atomics. I'd need to recheck things like whether NMI handlers ever call
> > the page allocator (they shouldn't but it should be checked). It would
> > need a lot of review and testing.
>
> The result of the API is to deliver pages as a double-linked list via
> LRU (page->lru member). If you are planning to use llist, then how to
> handle this API change later?
>
I would not have to. The per-cpu list internally can use llist internally
while pages returned to the bulk allocator user can still be a doubly
linked list. An llist_node fits in less space than the list_head lru.
> Have you notice that the two users store the struct-page pointers in an
> array? We could have the caller provide the array to store struct-page
> pointers, like we do with kmem_cache_alloc_bulk API.
>
That is a possibility but it ties the caller into declaring an array,
either via kmalloc, within an existing struct or on-stack. They would
then need to ensure that nr_pages does not exceed the array size or pass
in the array size. It's more error prone and a harder API to use.
> You likely have good reasons for returning the pages as a list (via
> lru), as I can see/imagine that there are some potential for grabbing
> the entire PCP-list.
>
I used a list so that user was only required to define a list_head on
the stack to use the API.
--
Mel Gorman
SUSE Labs
On Fri, Mar 12, 2021 at 12:46:09PM +0100, Jesper Dangaard Brouer wrote:
> In my page_pool patch I'm bulk allocating 64 pages. I wanted to ask if
> this is too much? (PP_ALLOC_CACHE_REFILL=64).
>
> The mlx5 driver have a while loop for allocation 64 pages, which it
> used in this case, that is why 64 is chosen. If we choose a lower
> bulk number, then the bulk-alloc will just be called more times.
The thing about batching is that smaller batches are often better.
Let's suppose you need to allocate 100 pages for something, and the page
allocator takes up 90% of your latency budget. Batching just ten pages
at a time is going to reduce the overhead to 9%. Going to 64 pages
reduces the overhead from 9% to 2% -- maybe that's important, but
possibly not.
> The result of the API is to deliver pages as a double-linked list via
> LRU (page->lru member). If you are planning to use llist, then how to
> handle this API change later?
>
> Have you notice that the two users store the struct-page pointers in an
> array? We could have the caller provide the array to store struct-page
> pointers, like we do with kmem_cache_alloc_bulk API.
My preference would be for a pagevec. That does limit you to 15 pages
per call [1], but I do think that might be enough. And the overhead of
manipulating a linked list isn't free.
[1] patches exist to increase this, because it turns out that 15 may
not be enough for all systems! but it would limit to 255 as an absolute
hard cap.
On Fri, Mar 12, 2021 at 02:58:14PM +0000, Matthew Wilcox wrote:
> On Fri, Mar 12, 2021 at 12:46:09PM +0100, Jesper Dangaard Brouer wrote:
> > In my page_pool patch I'm bulk allocating 64 pages. I wanted to ask if
> > this is too much? (PP_ALLOC_CACHE_REFILL=64).
> >
> > The mlx5 driver have a while loop for allocation 64 pages, which it
> > used in this case, that is why 64 is chosen. If we choose a lower
> > bulk number, then the bulk-alloc will just be called more times.
>
> The thing about batching is that smaller batches are often better.
> Let's suppose you need to allocate 100 pages for something, and the page
> allocator takes up 90% of your latency budget. Batching just ten pages
> at a time is going to reduce the overhead to 9%. Going to 64 pages
> reduces the overhead from 9% to 2% -- maybe that's important, but
> possibly not.
>
I do not think that something like that can be properly accessed in
advance. It heavily depends on whether the caller is willing to amortise
the cost of the batch allocation or if the timing of the bulk request is
critical every single time.
> > The result of the API is to deliver pages as a double-linked list via
> > LRU (page->lru member). If you are planning to use llist, then how to
> > handle this API change later?
> >
> > Have you notice that the two users store the struct-page pointers in an
> > array? We could have the caller provide the array to store struct-page
> > pointers, like we do with kmem_cache_alloc_bulk API.
>
> My preference would be for a pagevec. That does limit you to 15 pages
> per call [1], but I do think that might be enough. And the overhead of
> manipulating a linked list isn't free.
>
I'm opposed to a pagevec because it unnecessarily limits the caller. The
sunrpc user for example knows how many pages it needs at the time the bulk
allocator is called but it's not the same value every time. When tracing,
I found it sometimes requested 1 page (most common request actually) and
other times requested 200+ pages. Forcing it to call the batch allocator
in chunks of 15 means the caller incurs the cost of multiple allocation
requests which is almost as bad as calling __alloc_pages in a loop.
I think the first version should have an easy API to start with. Optimise
the implementation if it is a bottleneck. Only make the API harder to
use if the callers are really willing to always allocate and size the
array in advance and it's shown that it really makes a big difference
performance-wise.
--
Mel Gorman
SUSE Labs
On Fri, Mar 12, 2021 at 04:03:50PM +0000, Mel Gorman wrote:
> On Fri, Mar 12, 2021 at 02:58:14PM +0000, Matthew Wilcox wrote:
> > On Fri, Mar 12, 2021 at 12:46:09PM +0100, Jesper Dangaard Brouer wrote:
> > > In my page_pool patch I'm bulk allocating 64 pages. I wanted to ask if
> > > this is too much? (PP_ALLOC_CACHE_REFILL=64).
> > >
> > > The mlx5 driver have a while loop for allocation 64 pages, which it
> > > used in this case, that is why 64 is chosen. If we choose a lower
> > > bulk number, then the bulk-alloc will just be called more times.
> >
> > The thing about batching is that smaller batches are often better.
> > Let's suppose you need to allocate 100 pages for something, and the page
> > allocator takes up 90% of your latency budget. Batching just ten pages
> > at a time is going to reduce the overhead to 9%. Going to 64 pages
> > reduces the overhead from 9% to 2% -- maybe that's important, but
> > possibly not.
> >
>
> I do not think that something like that can be properly accessed in
> advance. It heavily depends on whether the caller is willing to amortise
> the cost of the batch allocation or if the timing of the bulk request is
> critical every single time.
>
> > > The result of the API is to deliver pages as a double-linked list via
> > > LRU (page->lru member). If you are planning to use llist, then how to
> > > handle this API change later?
> > >
> > > Have you notice that the two users store the struct-page pointers in an
> > > array? We could have the caller provide the array to store struct-page
> > > pointers, like we do with kmem_cache_alloc_bulk API.
> >
> > My preference would be for a pagevec. That does limit you to 15 pages
> > per call [1], but I do think that might be enough. And the overhead of
> > manipulating a linked list isn't free.
> >
>
> I'm opposed to a pagevec because it unnecessarily limits the caller. The
> sunrpc user for example knows how many pages it needs at the time the bulk
> allocator is called but it's not the same value every time. When tracing,
> I found it sometimes requested 1 page (most common request actually) and
> other times requested 200+ pages. Forcing it to call the batch allocator
> in chunks of 15 means the caller incurs the cost of multiple allocation
> requests which is almost as bad as calling __alloc_pages in a loop.
Well, no. It reduces the cost by a factor of 15 -- or by 93%. 200 is
an interesting example because putting 200 pages on a list costs 200 *
64 bytes of dirty cachelines, or 12KiB. That's larger than some CPU L1
caches (mine's 48KB, 12-way set associative), but I think it's safe to say
some of those 200 cache lines are going to force others out into L2 cache.
Compared to a smaller batch of 15 pages in a pagevec, it'll dirty two cache
lines (admittedly the 15 struct pages are also going to get dirtied by being
allocated and then by being set up for whatever use they're getting, but
they should stay in L1 cache while that's happening).
I'm not claiming the pagevec is definitely a win, but it's very
unclear which tradeoff is actually going to lead to better performance.
Hopefully Jesper or Chuck can do some tests and figure out what actually
works better with their hardware & usage patterns.
> I think the first version should have an easy API to start with. Optimise
> the implementation if it is a bottleneck. Only make the API harder to
> use if the callers are really willing to always allocate and size the
> array in advance and it's shown that it really makes a big difference
> performance-wise.
I'm not entirely sure that a pagevec is harder to use than a list_head.
On Fri, Mar 12, 2021 at 09:08:23PM +0000, Matthew Wilcox wrote:
> > > > The result of the API is to deliver pages as a double-linked list via
> > > > LRU (page->lru member). If you are planning to use llist, then how to
> > > > handle this API change later?
> > > >
> > > > Have you notice that the two users store the struct-page pointers in an
> > > > array? We could have the caller provide the array to store struct-page
> > > > pointers, like we do with kmem_cache_alloc_bulk API.
> > >
> > > My preference would be for a pagevec. That does limit you to 15 pages
> > > per call [1], but I do think that might be enough. And the overhead of
> > > manipulating a linked list isn't free.
> > >
> >
> > I'm opposed to a pagevec because it unnecessarily limits the caller. The
> > sunrpc user for example knows how many pages it needs at the time the bulk
> > allocator is called but it's not the same value every time. When tracing,
> > I found it sometimes requested 1 page (most common request actually) and
> > other times requested 200+ pages. Forcing it to call the batch allocator
> > in chunks of 15 means the caller incurs the cost of multiple allocation
> > requests which is almost as bad as calling __alloc_pages in a loop.
>
> Well, no. It reduces the cost by a factor of 15 -- or by 93%. 200 is
> an interesting example because putting 200 pages on a list costs 200 *
> 64 bytes of dirty cachelines, or 12KiB.
That's a somewhat limited view. Yes, the overall cost gets reduced by
some factor but forcing the caller to limit the batch sizes incurs an
unnecessary cost. The SUNRPC user is particularly relevant as it cannot
make progress until it gets all the pages it requests -- it sleeps if
it cannot get the pages it needs. The whole point of the bulk allocator
is to avoid multiple round-trips through the page allocator. Forcing a
limit in the API requiring multiple round trips is just weird.
> That's larger than some CPU L1
> caches (mine's 48KB, 12-way set associative), but I think it's safe to say
> some of those 200 cache lines are going to force others out into L2 cache.
> Compared to a smaller batch of 15 pages in a pagevec, it'll dirty two cache
> lines (admittedly the 15 struct pages are also going to get dirtied by being
> allocated and then by being set up for whatever use they're getting, but
> they should stay in L1 cache while that's happening).
>
The cache footprint is irrelevant if the caller *requires* the pages. If
the caller has to zero the pages then the cache gets thrashed anyway.
Even if non-temporal zeroing was used, the cache is likely thrashed by the
data copies. The page allocator in general is a cache nightmare because
of the number of cache lines it potentially dirties, particularly if it
has to call into the buddy allocator to split/merge pages for allocations
and frees respectively.
> I'm not claiming the pagevec is definitely a win, but it's very
> unclear which tradeoff is actually going to lead to better performance.
> Hopefully Jesper or Chuck can do some tests and figure out what actually
> works better with their hardware & usage patterns.
>
The NFS user is often going to need to make round trips to get the pages it
needs. The pagevec would have to be copied into the target array meaning
it's not much better than a list manipulation.
Pagevecs are a bad interface in general simply because it puts hard
constraints on how many pages can be bulk allocatoed. Pagevecs are
primarily there to avoid excessive LRU lock acquisition and they are
bad at the job. These days, the LRU lock protects such a massive amount
of data that the pagevec is barely a band aid. Increasing its size just
shifts the problem slightly. I see very little value in introducing a
fundamental limitation into the bulk allocator by mandating pagevecs.
Now, I can see a case where the API moves to using arrays when there is a
user that is such a common hot path and using arrays that it is justified
but we're not there yet. The two callers are somewhat of corner cases and
both of them are limited by wire speed of networking. Not all users may
require arrays -- SLUB using batched order-0 pages on a high-allocation
failure for example would not need an array. Such an intensively hot user
does not currently exist so it's premature to even consider it.
> > I think the first version should have an easy API to start with. Optimise
> > the implementation if it is a bottleneck. Only make the API harder to
> > use if the callers are really willing to always allocate and size the
> > array in advance and it's shown that it really makes a big difference
> > performance-wise.
>
> I'm not entirely sure that a pagevec is harder to use than a list_head.
Leaving aside the limitations of pagevecs, arrays get messy if the caller
does not necessarily use all the pages returned by the allocator. The
arrays would need to be tracked and/or preserved for some time. The
order pages are taken out of the array matters potentially. With lists,
the remaining pages can be easily spliced on a private cache or simply
handed back to the free API without having to track exactly how many
pages are on the array or where they are located. With arrays, the
elements have to be copied one at a time.
I think it's easier overall for the callers to deal with a list in
the initial implementation and only switch to arrays when there is an
extremely hot user that benefits heavily if pages are inserted directly
into an array.
--
Mel Gorman
SUSE Labs