This series is based on top of Matthew Wilcox's series "Rationalise
__alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
test and are not using Andrew's tree as a baseline, I suggest using the
following git tree
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
The users of the API have been dropped in this version as the callers
need to check whether they prefer an array or list interface (whether
preference is based on convenience or performance).
Changelog since v4
o Drop users of the API
o Remove free_pages_bulk interface, no users
o Add array interface
o Allocate single page if watermark checks on local zones fail
Changelog since v3
o Rebase on top of Matthew's series consolidating the alloc_pages API
o Rename alloced to allocated
o Split out preparation patch for prepare_alloc_pages
o Defensive check for bulk allocation or <= 0 pages
o Call single page allocation path only if no pages were allocated
o Minor cosmetic cleanups
o Reorder patch dependencies by subsystem. As this is a cross-subsystem
series, the mm patches have to be merged before the sunrpc and net
users.
Changelog since v2
o Prep new pages with IRQs enabled
o Minor documentation update
Changelog since v1
o Parenthesise binary and boolean comparisons
o Add reviewed-bys
o Rebase to 5.12-rc2
This series introduces a bulk order-0 page allocator with the
intent that sunrpc and the network page pool become the first users.
The implementation is not particularly efficient and the intention is to
iron out what the semantics of the API should have for users. Despite
that, this is a performance-related enhancement for users that require
multiple pages for an operation without multiple round-trips to the page
allocator. Quoting the last patch for the prototype high-speed networking
use-case.
For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
redirecting xdp_frame packets into a veth, that does XDP_PASS to
create an SKB from the xdp_frame, which then cannot return the page
to the page_pool. In this case, we saw[1] an improvement of 18.8%
from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
Both potential users in this series are corner cases (NFS and high-speed
networks) so it is unlikely that most users will see any benefit in the
short term. Other potential other users are batch allocations for page
cache readahead, fault around and SLUB allocations when high-order pages
are unavailable. It's unknown how much benefit would be seen by converting
multiple page allocation calls to a single batch or what difference it may
make to headline performance. It's a chicken and egg problem given that
the potential benefit cannot be investigated without an implementation
to test against.
Light testing passed, I'm relying on Chuck and Jesper to test their
implementations, choose whether to use lists or arrays and document
performance gains/losses in the changelogs.
Patch 1 renames a variable name that is particularly unpopular
Patch 2 adds a bulk page allocator
Patch 3 adds an array-based version of the bulk allocator
include/linux/gfp.h | 18 +++++
mm/page_alloc.c | 171 ++++++++++++++++++++++++++++++++++++++++++--
2 files changed, 185 insertions(+), 4 deletions(-)
--
2.26.2
The proposed callers for the bulk allocator store pages from the bulk
allocator in an array. This patch adds an array-based interface to the API
to avoid multiple list iterations. The page list interface is preserved
to avoid requiring all users of the bulk API to allocate and manage enough
storage to store the pages.
Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/gfp.h | 13 ++++++--
mm/page_alloc.c | 75 ++++++++++++++++++++++++++++++++++-----------
2 files changed, 67 insertions(+), 21 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 4a304fd39916..fb6234e1fe59 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -520,13 +520,20 @@ struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
- struct list_head *list);
+ struct list_head *page_list,
+ struct page **page_array);
/* Bulk allocate order-0 pages */
static inline unsigned long
-alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
+alloc_pages_bulk_list(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
{
- return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
+ return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list, NULL);
+}
+
+static inline unsigned long
+alloc_pages_bulk_array(gfp_t gfp, unsigned long nr_pages, struct page **page_array)
+{
+ return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, NULL, page_array);
}
/*
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f4d56854c74..c83d38dfe936 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4966,22 +4966,31 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
}
/*
- * __alloc_pages_bulk - Allocate a number of order-0 pages to a list
+ * __alloc_pages_bulk - Allocate a number of order-0 pages to a list or array
* @gfp: GFP flags for the allocation
* @preferred_nid: The preferred NUMA node ID to allocate from
* @nodemask: Set of nodes to allocate from, may be NULL
* @nr_pages: The number of pages requested
- * @page_list: List to store the allocated pages, must be empty
+ * @page_list: Optional list to store the allocated pages
+ * @page_array: Optional array to store the pages
*
* This is a batched version of the page allocator that attempts to
- * allocate nr_pages quickly and add them to a list. The list must be
- * empty to allow new pages to be prepped with IRQs enabled.
+ * allocate nr_pages quickly. Pages are added to page_list if page_list
+ * is not NULL, otherwise it is assumed that the page_array is valid.
*
- * Returns the number of pages allocated.
+ * For lists, nr_pages is the number of pages that should be allocated.
+ *
+ * For arrays, only NULL elements are populated with pages and nr_pages
+ * is the maximum number of pages that will be stored in the array. Note
+ * that arrays with NULL holes in the middle may return prematurely.
+ *
+ * Returns the number of pages added to the page_list or the known
+ * number of populated elements in the page_array.
*/
int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nodemask_t *nodemask, int nr_pages,
- struct list_head *page_list)
+ struct list_head *page_list,
+ struct page **page_array)
{
struct page *page;
unsigned long flags;
@@ -4992,14 +5001,23 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
struct alloc_context ac;
gfp_t alloc_gfp;
unsigned int alloc_flags;
- int allocated = 0;
+ int nr_populated = 0, prep_index = 0;
if (WARN_ON_ONCE(nr_pages <= 0))
return 0;
- if (WARN_ON_ONCE(!list_empty(page_list)))
+ if (WARN_ON_ONCE(page_list && !list_empty(page_list)))
return 0;
+ /* Skip populated array elements. */
+ if (page_array) {
+ while (nr_populated < nr_pages && page_array[nr_populated])
+ nr_populated++;
+ if (nr_populated == nr_pages)
+ return nr_populated;
+ prep_index = nr_populated;
+ }
+
if (nr_pages == 1)
goto failed;
@@ -5044,12 +5062,22 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
pcp = &this_cpu_ptr(zone->pageset)->pcp;
pcp_list = &pcp->lists[ac.migratetype];
- while (allocated < nr_pages) {
+ while (nr_populated < nr_pages) {
+ /*
+ * Stop allocating if the next index has a populated
+ * page or the page will be prepared a second time when
+ * IRQs are enabled.
+ */
+ if (page_array && page_array[nr_populated]) {
+ nr_populated++;
+ break;
+ }
+
page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
pcp, pcp_list);
if (!page) {
/* Try and get at least one page */
- if (!allocated)
+ if (!nr_populated)
goto failed_irq;
break;
}
@@ -5063,17 +5091,25 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
__count_zid_vm_events(PGALLOC, zone_idx(zone), 1);
zone_statistics(ac.preferred_zoneref->zone, zone);
- list_add(&page->lru, page_list);
- allocated++;
+ if (page_list)
+ list_add(&page->lru, page_list);
+ else
+ page_array[nr_populated] = page;
+ nr_populated++;
}
local_irq_restore(flags);
/* Prep pages with IRQs enabled. */
- list_for_each_entry(page, page_list, lru)
- prep_new_page(page, 0, gfp, 0);
+ if (page_list) {
+ list_for_each_entry(page, page_list, lru)
+ prep_new_page(page, 0, gfp, 0);
+ } else {
+ while (prep_index < nr_populated)
+ prep_new_page(page_array[prep_index++], 0, gfp, 0);
+ }
- return allocated;
+ return nr_populated;
failed_irq:
local_irq_restore(flags);
@@ -5081,11 +5117,14 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
failed:
page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
if (page) {
- list_add(&page->lru, page_list);
- allocated = 1;
+ if (page_list)
+ list_add(&page->lru, page_list);
+ else
+ page_array[nr_populated] = page;
+ nr_populated++;
}
- return allocated;
+ return nr_populated;
}
EXPORT_SYMBOL_GPL(__alloc_pages_bulk);
--
2.26.2
This patch adds a new page allocator interface via alloc_pages_bulk,
and __alloc_pages_bulk_nodemask. A caller requests a number of pages
to be allocated and added to a list.
The API is not guaranteed to return the requested number of pages and
may fail if the preferred allocation zone has limited free memory, the
cpuset changes during the allocation or page debugging decides to fail
an allocation. It's up to the caller to request more pages in batch
if necessary.
Note that this implementation is not very efficient and could be improved
but it would require refactoring. The intent is to make it available early
to determine what semantics are required by different callers. Once the
full semantics are nailed down, it can be refactored.
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
---
include/linux/gfp.h | 11 ++++
mm/page_alloc.c | 124 ++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 135 insertions(+)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 0a88f84b08f4..4a304fd39916 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -518,6 +518,17 @@ static inline int arch_make_page_accessible(struct page *page)
struct page *__alloc_pages(gfp_t gfp, unsigned int order, int preferred_nid,
nodemask_t *nodemask);
+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+ nodemask_t *nodemask, int nr_pages,
+ struct list_head *list);
+
+/* Bulk allocate order-0 pages */
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp, unsigned long nr_pages, struct list_head *list)
+{
+ return __alloc_pages_bulk(gfp, numa_mem_id(), NULL, nr_pages, list);
+}
+
/*
* Allocate pages, preferring the node given as nid. The node must be valid and
* online. For more general interface, see alloc_pages_node().
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 8a3e13277e22..3f4d56854c74 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4965,6 +4965,130 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
return true;
}
+/*
+ * __alloc_pages_bulk - Allocate a number of order-0 pages to a list
+ * @gfp: GFP flags for the allocation
+ * @preferred_nid: The preferred NUMA node ID to allocate from
+ * @nodemask: Set of nodes to allocate from, may be NULL
+ * @nr_pages: The number of pages requested
+ * @page_list: List to store the allocated pages, must be empty
+ *
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly and add them to a list. The list must be
+ * empty to allow new pages to be prepped with IRQs enabled.
+ *
+ * Returns the number of pages allocated.
+ */
+int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
+ nodemask_t *nodemask, int nr_pages,
+ struct list_head *page_list)
+{
+ struct page *page;
+ unsigned long flags;
+ struct zone *zone;
+ struct zoneref *z;
+ struct per_cpu_pages *pcp;
+ struct list_head *pcp_list;
+ struct alloc_context ac;
+ gfp_t alloc_gfp;
+ unsigned int alloc_flags;
+ int allocated = 0;
+
+ if (WARN_ON_ONCE(nr_pages <= 0))
+ return 0;
+
+ if (WARN_ON_ONCE(!list_empty(page_list)))
+ return 0;
+
+ if (nr_pages == 1)
+ goto failed;
+
+ /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
+ gfp &= gfp_allowed_mask;
+ alloc_gfp = gfp;
+ if (!prepare_alloc_pages(gfp, 0, preferred_nid, nodemask, &ac, &alloc_gfp, &alloc_flags))
+ return 0;
+ gfp = alloc_gfp;
+
+ /* Find an allowed local zone that meets the high watermark. */
+ for_each_zone_zonelist_nodemask(zone, z, ac.zonelist, ac.highest_zoneidx, ac.nodemask) {
+ unsigned long mark;
+
+ if (cpusets_enabled() && (alloc_flags & ALLOC_CPUSET) &&
+ !__cpuset_zone_allowed(zone, gfp)) {
+ continue;
+ }
+
+ if (nr_online_nodes > 1 && zone != ac.preferred_zoneref->zone &&
+ zone_to_nid(zone) != zone_to_nid(ac.preferred_zoneref->zone)) {
+ goto failed;
+ }
+
+ mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK) + nr_pages;
+ if (zone_watermark_fast(zone, 0, mark,
+ zonelist_zone_idx(ac.preferred_zoneref),
+ alloc_flags, gfp)) {
+ break;
+ }
+ }
+
+ /*
+ * If there are no allowed local zones that meets the watermarks then
+ * try to allocate a single page and reclaim if necessary.
+ */
+ if (!zone)
+ goto failed;
+
+ /* Attempt the batch allocation */
+ local_irq_save(flags);
+ pcp = &this_cpu_ptr(zone->pageset)->pcp;
+ pcp_list = &pcp->lists[ac.migratetype];
+
+ while (allocated < nr_pages) {
+ page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
+ pcp, pcp_list);
+ if (!page) {
+ /* Try and get at least one page */
+ if (!allocated)
+ goto failed_irq;
+ break;
+ }
+
+ /*
+ * Ideally this would be batched but the best way to do
+ * that cheaply is to first convert zone_statistics to
+ * be inaccurate per-cpu counter like vm_events to avoid
+ * a RMW cycle then do the accounting with IRQs enabled.
+ */
+ __count_zid_vm_events(PGALLOC, zone_idx(zone), 1);
+ zone_statistics(ac.preferred_zoneref->zone, zone);
+
+ list_add(&page->lru, page_list);
+ allocated++;
+ }
+
+ local_irq_restore(flags);
+
+ /* Prep pages with IRQs enabled. */
+ list_for_each_entry(page, page_list, lru)
+ prep_new_page(page, 0, gfp, 0);
+
+ return allocated;
+
+failed_irq:
+ local_irq_restore(flags);
+
+failed:
+ page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
+ if (page) {
+ list_add(&page->lru, page_list);
+ allocated = 1;
+ }
+
+ return allocated;
+}
+EXPORT_SYMBOL_GPL(__alloc_pages_bulk);
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
--
2.26.2
On Mon, Mar 22, 2021 at 01:04:46PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 22 Mar 2021 09:18:42 +0000
> Mel Gorman <[email protected]> wrote:
>
> > This series is based on top of Matthew Wilcox's series "Rationalise
> > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > test and are not using Andrew's tree as a baseline, I suggest using the
> > following git tree
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
>
> page_bench04_bulk[1] micro-benchmark on branch: mm-bulk-rebase-v5r9
> [1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c
>
> BASELINE
> single_page alloc+put: Per elem: 199 cycles(tsc) 55.472 ns
>
> LIST variant: time_bulk_page_alloc_free_list: step=bulk size
>
> Per elem: 206 cycles(tsc) 57.478 ns (step:1)
> Per elem: 154 cycles(tsc) 42.861 ns (step:2)
> Per elem: 145 cycles(tsc) 40.536 ns (step:3)
> Per elem: 142 cycles(tsc) 39.477 ns (step:4)
> Per elem: 142 cycles(tsc) 39.610 ns (step:8)
> Per elem: 137 cycles(tsc) 38.155 ns (step:16)
> Per elem: 135 cycles(tsc) 37.739 ns (step:32)
> Per elem: 134 cycles(tsc) 37.282 ns (step:64)
> Per elem: 133 cycles(tsc) 36.993 ns (step:128)
>
> ARRAY variant: time_bulk_page_alloc_free_array: step=bulk size
>
> Per elem: 202 cycles(tsc) 56.383 ns (step:1)
> Per elem: 144 cycles(tsc) 40.047 ns (step:2)
> Per elem: 134 cycles(tsc) 37.339 ns (step:3)
> Per elem: 128 cycles(tsc) 35.578 ns (step:4)
> Per elem: 120 cycles(tsc) 33.592 ns (step:8)
> Per elem: 116 cycles(tsc) 32.362 ns (step:16)
> Per elem: 113 cycles(tsc) 31.476 ns (step:32)
> Per elem: 110 cycles(tsc) 30.633 ns (step:64)
> Per elem: 110 cycles(tsc) 30.596 ns (step:128)
>
> Compared to the previous results (see below) list-variant got faster,
> but array-variant is still faster. The array variant lost a little
> performance. I think this can be related to the stats counters got
> added/moved inside the loop, in this patchset.
>
If you are feeling particularly brave, take a look at
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-percpu-local_lock-v1r10
It's a prototype series rebased on top of the bulk allocator and this
version has not even been boot tested. While it'll get rough treatment
during review, it should reduce the cost of the stat updates in the
bulk allocator as a side-effect.
--
Mel Gorman
SUSE Labs
> On Mar 22, 2021, at 5:18 AM, Mel Gorman <[email protected]> wrote:
>
> This series is based on top of Matthew Wilcox's series "Rationalise
> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> test and are not using Andrew's tree as a baseline, I suggest using the
> following git tree
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
>
> The users of the API have been dropped in this version as the callers
> need to check whether they prefer an array or list interface (whether
> preference is based on convenience or performance).
I now have a consumer implementation that uses the array
API. If I understand the contract correctly, the return
value is the last array index that __alloc_pages_bulk()
visits. My consumer uses the return value to determine
if it needs to call the allocator again.
It is returning some confusing (to me) results. I'd like
to get these resolved before posting any benchmark
results.
1. When it has visited every array element, it returns the
same value as was passed in @nr_pages. That's the N + 1th
array element, which shouldn't be touched. Should the
allocator return nr_pages - 1 in the fully successful case?
Or should the documentation describe the return value as
"the number of elements visited" ?
2. Frequently the allocator returns a number smaller than
the total number of elements. As you may recall, sunrpc
will delay a bit (via a call to schedule_timeout) then call
again. This is supposed to be a rare event, and the delay
is substantial. But with the array-based API, a not-fully-
successful allocator call seems to happen more than half
the time. Is that expected? I'm calling with GFP_KERNEL,
seems like the allocator should be trying harder.
3. Is the current design intended so that if the consumer
does call again, is it supposed to pass in the array address
+ the returned index (and @nr_pages reduced by the returned
index) ?
Thanks for all your hard work, Mel.
> Changelog since v4
> o Drop users of the API
> o Remove free_pages_bulk interface, no users
> o Add array interface
> o Allocate single page if watermark checks on local zones fail
>
> Changelog since v3
> o Rebase on top of Matthew's series consolidating the alloc_pages API
> o Rename alloced to allocated
> o Split out preparation patch for prepare_alloc_pages
> o Defensive check for bulk allocation or <= 0 pages
> o Call single page allocation path only if no pages were allocated
> o Minor cosmetic cleanups
> o Reorder patch dependencies by subsystem. As this is a cross-subsystem
> series, the mm patches have to be merged before the sunrpc and net
> users.
>
> Changelog since v2
> o Prep new pages with IRQs enabled
> o Minor documentation update
>
> Changelog since v1
> o Parenthesise binary and boolean comparisons
> o Add reviewed-bys
> o Rebase to 5.12-rc2
>
> This series introduces a bulk order-0 page allocator with the
> intent that sunrpc and the network page pool become the first users.
> The implementation is not particularly efficient and the intention is to
> iron out what the semantics of the API should have for users. Despite
> that, this is a performance-related enhancement for users that require
> multiple pages for an operation without multiple round-trips to the page
> allocator. Quoting the last patch for the prototype high-speed networking
> use-case.
>
> For XDP-redirect workload with 100G mlx5 driver (that use page_pool)
> redirecting xdp_frame packets into a veth, that does XDP_PASS to
> create an SKB from the xdp_frame, which then cannot return the page
> to the page_pool. In this case, we saw[1] an improvement of 18.8%
> from using the alloc_pages_bulk API (3,677,958 pps -> 4,368,926 pps).
>
> Both potential users in this series are corner cases (NFS and high-speed
> networks) so it is unlikely that most users will see any benefit in the
> short term. Other potential other users are batch allocations for page
> cache readahead, fault around and SLUB allocations when high-order pages
> are unavailable. It's unknown how much benefit would be seen by converting
> multiple page allocation calls to a single batch or what difference it may
> make to headline performance. It's a chicken and egg problem given that
> the potential benefit cannot be investigated without an implementation
> to test against.
>
> Light testing passed, I'm relying on Chuck and Jesper to test their
> implementations, choose whether to use lists or arrays and document
> performance gains/losses in the changelogs.
>
> Patch 1 renames a variable name that is particularly unpopular
>
> Patch 2 adds a bulk page allocator
>
> Patch 3 adds an array-based version of the bulk allocator
>
> include/linux/gfp.h | 18 +++++
> mm/page_alloc.c | 171 ++++++++++++++++++++++++++++++++++++++++++--
> 2 files changed, 185 insertions(+), 4 deletions(-)
>
> --
> 2.26.2
>
--
Chuck Lever
On Mon, Mar 22, 2021 at 06:25:03PM +0000, Chuck Lever III wrote:
>
>
> > On Mar 22, 2021, at 5:18 AM, Mel Gorman <[email protected]> wrote:
> >
> > This series is based on top of Matthew Wilcox's series "Rationalise
> > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > test and are not using Andrew's tree as a baseline, I suggest using the
> > following git tree
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
> >
> > The users of the API have been dropped in this version as the callers
> > need to check whether they prefer an array or list interface (whether
> > preference is based on convenience or performance).
>
> I now have a consumer implementation that uses the array
> API. If I understand the contract correctly, the return
> value is the last array index that __alloc_pages_bulk()
> visits. My consumer uses the return value to determine
> if it needs to call the allocator again.
>
For either arrays or lists, the return value is the number of valid
pages. For arrays, the pattern is expected to be
nr_pages = alloc_pages_bulk(gfp, nr_requested, page_array);
for (i = 0; i < nr_pages; i++) {
do something with page_array[i]
}
There *could* be populated valid elements on and after nr_pages but the
implementation did not visit those elements. The implementation can abort
early if the array looks like this
PPP....PPP
Where P is a page and . is NULL. The implementation would skip the
first three pages, allocate four pages and then abort when a new page
was encountered. This is an implementation detail around how I handled
prep_new_page. It could be addressed if many callers expect to pass in
an array that has holes in the middle.
> It is returning some confusing (to me) results. I'd like
> to get these resolved before posting any benchmark
> results.
>
> 1. When it has visited every array element, it returns the
> same value as was passed in @nr_pages. That's the N + 1th
> array element, which shouldn't be touched. Should the
> allocator return nr_pages - 1 in the fully successful case?
> Or should the documentation describe the return value as
> "the number of elements visited" ?
>
I phrased it as "the known number of populated elements in the
page_array". I did not want to write it as "the number of valid elements
in the array" because that is not necessarily the case if an array is
passed in with holes in the middle. I'm open to any suggestions on how
the __alloc_pages_bulk description can be improved.
The definition of the return value as-is makes sense for either a list
or an array. Returning "nr_pages - 1" suits an array because it's the
last valid index but it makes less sense when returning a list.
> 2. Frequently the allocator returns a number smaller than
> the total number of elements. As you may recall, sunrpc
> will delay a bit (via a call to schedule_timeout) then call
> again. This is supposed to be a rare event, and the delay
> is substantial. But with the array-based API, a not-fully-
> successful allocator call seems to happen more than half
> the time. Is that expected? I'm calling with GFP_KERNEL,
> seems like the allocator should be trying harder.
>
It's not expected that the array implementation would be worse *unless*
you are passing in arrays with holes in the middle. Otherwise, the success
rate should be similar.
> 3. Is the current design intended so that if the consumer
> does call again, is it supposed to pass in the array address
> + the returned index (and @nr_pages reduced by the returned
> index) ?
>
The caller does not have to pass in array address + returned index but
it's more efficient if it does.
If you are passing in arrays with holes in the middle then the following
might work (not tested)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c83d38dfe936..4dc38516a5bd 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5002,6 +5002,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
gfp_t alloc_gfp;
unsigned int alloc_flags;
int nr_populated = 0, prep_index = 0;
+ bool hole = false;
if (WARN_ON_ONCE(nr_pages <= 0))
return 0;
@@ -5057,6 +5058,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
if (!zone)
goto failed;
+retry_hole:
/* Attempt the batch allocation */
local_irq_save(flags);
pcp = &this_cpu_ptr(zone->pageset)->pcp;
@@ -5069,6 +5071,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
* IRQs are enabled.
*/
if (page_array && page_array[nr_populated]) {
+ hole = true;
nr_populated++;
break;
}
@@ -5109,6 +5112,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
prep_new_page(page_array[prep_index++], 0, gfp, 0);
}
+ if (hole && nr_populated < nr_pages && hole)
+ goto retry_hole;
+
return nr_populated;
failed_irq:
--
Mel Gorman
SUSE Labs
> On Mar 22, 2021, at 3:49 PM, Mel Gorman <[email protected]> wrote:
>
> On Mon, Mar 22, 2021 at 06:25:03PM +0000, Chuck Lever III wrote:
>>
>>
>>> On Mar 22, 2021, at 5:18 AM, Mel Gorman <[email protected]> wrote:
>>>
>>> This series is based on top of Matthew Wilcox's series "Rationalise
>>> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
>>> test and are not using Andrew's tree as a baseline, I suggest using the
>>> following git tree
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
>>>
>>> The users of the API have been dropped in this version as the callers
>>> need to check whether they prefer an array or list interface (whether
>>> preference is based on convenience or performance).
>>
>> I now have a consumer implementation that uses the array
>> API. If I understand the contract correctly, the return
>> value is the last array index that __alloc_pages_bulk()
>> visits. My consumer uses the return value to determine
>> if it needs to call the allocator again.
>>
>
> For either arrays or lists, the return value is the number of valid
> pages. For arrays, the pattern is expected to be
>
> nr_pages = alloc_pages_bulk(gfp, nr_requested, page_array);
> for (i = 0; i < nr_pages; i++) {
> do something with page_array[i]
> }
>
> There *could* be populated valid elements on and after nr_pages but the
> implementation did not visit those elements. The implementation can abort
> early if the array looks like this
>
> PPP....PPP
>
> Where P is a page and . is NULL. The implementation would skip the
> first three pages, allocate four pages and then abort when a new page
> was encountered. This is an implementation detail around how I handled
> prep_new_page. It could be addressed if many callers expect to pass in
> an array that has holes in the middle.
>
>> It is returning some confusing (to me) results. I'd like
>> to get these resolved before posting any benchmark
>> results.
>>
>> 1. When it has visited every array element, it returns the
>> same value as was passed in @nr_pages. That's the N + 1th
>> array element, which shouldn't be touched. Should the
>> allocator return nr_pages - 1 in the fully successful case?
>> Or should the documentation describe the return value as
>> "the number of elements visited" ?
>>
>
> I phrased it as "the known number of populated elements in the
> page_array".
The comment you added states:
+ * For lists, nr_pages is the number of pages that should be allocated.
+ *
+ * For arrays, only NULL elements are populated with pages and nr_pages
+ * is the maximum number of pages that will be stored in the array.
+ *
+ * Returns the number of pages added to the page_list or the index of the
+ * last known populated element of page_array.
> I did not want to write it as "the number of valid elements
> in the array" because that is not necessarily the case if an array is
> passed in with holes in the middle. I'm open to any suggestions on how
> the __alloc_pages_bulk description can be improved.
The comments states that, for the array case, a /count/ of
pages is passed in, and an /index/ is returned. If you want
to return the same type for lists and arrays, it should be
documented as a count in both cases, to match @nr_pages.
Consumers will want to compare @nr_pages with the return
value to see if they need to call again.
Comparing a count to an index is a notorious source of
off-by-one errors.
> The definition of the return value as-is makes sense for either a list
> or an array. Returning "nr_pages - 1" suits an array because it's the
> last valid index but it makes less sense when returning a list.
>
>> 2. Frequently the allocator returns a number smaller than
>> the total number of elements. As you may recall, sunrpc
>> will delay a bit (via a call to schedule_timeout) then call
>> again. This is supposed to be a rare event, and the delay
>> is substantial. But with the array-based API, a not-fully-
>> successful allocator call seems to happen more than half
>> the time. Is that expected? I'm calling with GFP_KERNEL,
>> seems like the allocator should be trying harder.
>>
>
> It's not expected that the array implementation would be worse *unless*
> you are passing in arrays with holes in the middle. Otherwise, the success
> rate should be similar.
Essentially, sunrpc will always pass an array with a hole.
Each RPC consumes the first N elements in the rq_pages array.
Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not
pass in an array with more than one hole. Typically:
.....PPPP
My results show that, because svc_alloc_arg() ends up calling
__alloc_pages_bulk() twice in this case, it ends up being
twice as expensive as the list case, on average, for the same
workload.
>> 3. Is the current design intended so that if the consumer
>> does call again, is it supposed to pass in the array address
>> + the returned index (and @nr_pages reduced by the returned
>> index) ?
>>
>
> The caller does not have to pass in array address + returned index but
> it's more efficient if it does.
>
> If you are passing in arrays with holes in the middle then the following
> might work (not tested)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c83d38dfe936..4dc38516a5bd 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5002,6 +5002,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> gfp_t alloc_gfp;
> unsigned int alloc_flags;
> int nr_populated = 0, prep_index = 0;
> + bool hole = false;
>
> if (WARN_ON_ONCE(nr_pages <= 0))
> return 0;
> @@ -5057,6 +5058,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> if (!zone)
> goto failed;
>
> +retry_hole:
> /* Attempt the batch allocation */
> local_irq_save(flags);
> pcp = &this_cpu_ptr(zone->pageset)->pcp;
> @@ -5069,6 +5071,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> * IRQs are enabled.
> */
> if (page_array && page_array[nr_populated]) {
> + hole = true;
> nr_populated++;
> break;
> }
> @@ -5109,6 +5112,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> prep_new_page(page_array[prep_index++], 0, gfp, 0);
> }
>
> + if (hole && nr_populated < nr_pages && hole)
> + goto retry_hole;
> +
> return nr_populated;
>
> failed_irq:
>
> --
> Mel Gorman
> SUSE Labs
If a local_irq_save() is done more than once in this case, I don't
expect that the result will be much better.
To make the array API as performant as the list API, the sunrpc
consumer will have to check if the N + 1th element is populated,
upon return, rather than checking the return value against
@nr_pages.
--
Chuck Lever
On Mon, Mar 22, 2021 at 08:32:54PM +0000, Chuck Lever III wrote:
> >> It is returning some confusing (to me) results. I'd like
> >> to get these resolved before posting any benchmark
> >> results.
> >>
> >> 1. When it has visited every array element, it returns the
> >> same value as was passed in @nr_pages. That's the N + 1th
> >> array element, which shouldn't be touched. Should the
> >> allocator return nr_pages - 1 in the fully successful case?
> >> Or should the documentation describe the return value as
> >> "the number of elements visited" ?
> >>
> >
> > I phrased it as "the known number of populated elements in the
> > page_array".
>
> The comment you added states:
>
> + * For lists, nr_pages is the number of pages that should be allocated.
> + *
> + * For arrays, only NULL elements are populated with pages and nr_pages
> + * is the maximum number of pages that will be stored in the array.
> + *
> + * Returns the number of pages added to the page_list or the index of the
> + * last known populated element of page_array.
>
>
> > I did not want to write it as "the number of valid elements
> > in the array" because that is not necessarily the case if an array is
> > passed in with holes in the middle. I'm open to any suggestions on how
> > the __alloc_pages_bulk description can be improved.
>
> The comments states that, for the array case, a /count/ of
> pages is passed in, and an /index/ is returned. If you want
> to return the same type for lists and arrays, it should be
> documented as a count in both cases, to match @nr_pages.
> Consumers will want to compare @nr_pages with the return
> value to see if they need to call again.
>
Then I'll just say it's the known count of pages in the array. That
might still be less than the number of requested pages if holes are
encountered.
> > The definition of the return value as-is makes sense for either a list
> > or an array. Returning "nr_pages - 1" suits an array because it's the
> > last valid index but it makes less sense when returning a list.
> >
> >> 2. Frequently the allocator returns a number smaller than
> >> the total number of elements. As you may recall, sunrpc
> >> will delay a bit (via a call to schedule_timeout) then call
> >> again. This is supposed to be a rare event, and the delay
> >> is substantial. But with the array-based API, a not-fully-
> >> successful allocator call seems to happen more than half
> >> the time. Is that expected? I'm calling with GFP_KERNEL,
> >> seems like the allocator should be trying harder.
> >>
> >
> > It's not expected that the array implementation would be worse *unless*
> > you are passing in arrays with holes in the middle. Otherwise, the success
> > rate should be similar.
>
> Essentially, sunrpc will always pass an array with a hole.
> Each RPC consumes the first N elements in the rq_pages array.
> Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not
> pass in an array with more than one hole. Typically:
>
> .....PPPP
>
> My results show that, because svc_alloc_arg() ends up calling
> __alloc_pages_bulk() twice in this case, it ends up being
> twice as expensive as the list case, on average, for the same
> workload.
>
Ok, so in this case the caller knows that holes are always at the
start. If the API returns an index that is a valid index and populated,
it can check the next index and if it is valid then the whole array
must be populated.
Right now, the implementation checks for populated elements at the *start*
because it is required for calling prep_new_page starting at the correct
index and the API cannot make assumptions about the location of the hole.
The patch below would check the rest of the array but note that it's
slower for the API to do this check because it has to check every element
while the sunrpc user could check one element. Let me know if a) this
hunk helps and b) is desired behaviour.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c83d38dfe936..4bf20650e5f5 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5107,6 +5107,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
} else {
while (prep_index < nr_populated)
prep_new_page(page_array[prep_index++], 0, gfp, 0);
+
+ while (nr_populated < nr_pages && page_array[nr_populated])
+ nr_populated++;
}
return nr_populated;
--
Mel Gorman
SUSE Labs
On Mon, Mar 22, 2021 at 09:18:42AM +0000, Mel Gorman wrote:
> This series is based on top of Matthew Wilcox's series "Rationalise
> __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> test and are not using Andrew's tree as a baseline, I suggest using the
> following git tree
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
>
Jesper and Chuck, would you mind rebasing on top of the following branch
please?
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2
The interface is the same so the rebase should be trivial.
Jesper, I'm hoping you see no differences in performance but it's best
to check.
For Chuck, this version will check for holes and scan the remainder of
the array to see if nr_pages are allocated before returning. If the holes
in the array are always at the start (which it should be for sunrpc)
then it should still be a single IRQ disable/enable. Specifically, each
contiguous hole in the array will disable/enable IRQs once. I prototyped
NFS array support and it had a 100% success rate with no sleeps running
dbench over the network with no memory pressure but that's a basic test
on a 10G switch.
The basic patch I used to convert sunrpc from using lists to an array
for testing is as follows;
diff --git a/net/sunrpc/svc_xprt.c b/net/sunrpc/svc_xprt.c
index 922118968986..0ce33c1742d9 100644
--- a/net/sunrpc/svc_xprt.c
+++ b/net/sunrpc/svc_xprt.c
@@ -642,12 +642,10 @@ static void svc_check_conn_limits(struct svc_serv *serv)
static int svc_alloc_arg(struct svc_rqst *rqstp)
{
struct svc_serv *serv = rqstp->rq_server;
- unsigned long needed;
struct xdr_buf *arg;
- struct page *page;
LIST_HEAD(list);
int pages;
- int i;
+ int i = 0;
pages = (serv->sv_max_mesg + 2 * PAGE_SIZE) >> PAGE_SHIFT;
if (pages > RPCSVC_MAXPAGES) {
@@ -657,29 +655,15 @@ static int svc_alloc_arg(struct svc_rqst *rqstp)
pages = RPCSVC_MAXPAGES;
}
- for (needed = 0, i = 0; i < pages ; i++) {
- if (!rqstp->rq_pages[i])
- needed++;
- }
- i = 0;
- while (needed) {
- needed -= alloc_pages_bulk(GFP_KERNEL, needed, &list);
- for (; i < pages; i++) {
- if (rqstp->rq_pages[i])
- continue;
- page = list_first_entry_or_null(&list, struct page, lru);
- if (likely(page)) {
- list_del(&page->lru);
- rqstp->rq_pages[i] = page;
- continue;
- }
+ while (i < pages) {
+ i = alloc_pages_bulk_array(GFP_KERNEL, pages, &rqstp->rq_pages[0]);
+ if (i < pages) {
set_current_state(TASK_INTERRUPTIBLE);
if (signalled() || kthread_should_stop()) {
set_current_state(TASK_RUNNING);
return -EINTR;
}
schedule_timeout(msecs_to_jiffies(500));
- break;
}
}
rqstp->rq_page_end = &rqstp->rq_pages[pages];
--
Mel Gorman
SUSE Labs
On Mon, 22 Mar 2021 20:58:27 +0000
Mel Gorman <[email protected]> wrote:
> On Mon, Mar 22, 2021 at 08:32:54PM +0000, Chuck Lever III wrote:
> > >> It is returning some confusing (to me) results. I'd like
> > >> to get these resolved before posting any benchmark
> > >> results.
> > >>
> > >> 1. When it has visited every array element, it returns the
> > >> same value as was passed in @nr_pages. That's the N + 1th
> > >> array element, which shouldn't be touched. Should the
> > >> allocator return nr_pages - 1 in the fully successful case?
> > >> Or should the documentation describe the return value as
> > >> "the number of elements visited" ?
> > >>
> > >
> > > I phrased it as "the known number of populated elements in the
> > > page_array".
> >
> > The comment you added states:
> >
> > + * For lists, nr_pages is the number of pages that should be allocated.
> > + *
> > + * For arrays, only NULL elements are populated with pages and nr_pages
> > + * is the maximum number of pages that will be stored in the array.
> > + *
> > + * Returns the number of pages added to the page_list or the index of the
> > + * last known populated element of page_array.
> >
> >
> > > I did not want to write it as "the number of valid elements
> > > in the array" because that is not necessarily the case if an array is
> > > passed in with holes in the middle. I'm open to any suggestions on how
> > > the __alloc_pages_bulk description can be improved.
> >
> > The comments states that, for the array case, a /count/ of
> > pages is passed in, and an /index/ is returned. If you want
> > to return the same type for lists and arrays, it should be
> > documented as a count in both cases, to match @nr_pages.
> > Consumers will want to compare @nr_pages with the return
> > value to see if they need to call again.
> >
>
> Then I'll just say it's the known count of pages in the array. That
> might still be less than the number of requested pages if holes are
> encountered.
>
> > > The definition of the return value as-is makes sense for either a list
> > > or an array. Returning "nr_pages - 1" suits an array because it's the
> > > last valid index but it makes less sense when returning a list.
> > >
> > >> 2. Frequently the allocator returns a number smaller than
> > >> the total number of elements. As you may recall, sunrpc
> > >> will delay a bit (via a call to schedule_timeout) then call
> > >> again. This is supposed to be a rare event, and the delay
> > >> is substantial. But with the array-based API, a not-fully-
> > >> successful allocator call seems to happen more than half
> > >> the time. Is that expected? I'm calling with GFP_KERNEL,
> > >> seems like the allocator should be trying harder.
> > >>
> > >
> > > It's not expected that the array implementation would be worse *unless*
> > > you are passing in arrays with holes in the middle. Otherwise, the success
> > > rate should be similar.
> >
> > Essentially, sunrpc will always pass an array with a hole.
> > Each RPC consumes the first N elements in the rq_pages array.
> > Sometimes N == ARRAY_SIZE(rq_pages). AFAIK sunrpc will not
> > pass in an array with more than one hole. Typically:
> >
> > .....PPPP
> >
> > My results show that, because svc_alloc_arg() ends up calling
> > __alloc_pages_bulk() twice in this case, it ends up being
> > twice as expensive as the list case, on average, for the same
> > workload.
> >
>
> Ok, so in this case the caller knows that holes are always at the
> start. If the API returns an index that is a valid index and populated,
> it can check the next index and if it is valid then the whole array
> must be populated.
>
> Right now, the implementation checks for populated elements at the *start*
> because it is required for calling prep_new_page starting at the correct
> index and the API cannot make assumptions about the location of the hole.
>
> The patch below would check the rest of the array but note that it's
> slower for the API to do this check because it has to check every element
> while the sunrpc user could check one element. Let me know if a) this
> hunk helps and b) is desired behaviour.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index c83d38dfe936..4bf20650e5f5 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5107,6 +5107,9 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> } else {
> while (prep_index < nr_populated)
> prep_new_page(page_array[prep_index++], 0, gfp, 0);
> +
> + while (nr_populated < nr_pages && page_array[nr_populated])
> + nr_populated++;
> }
>
> return nr_populated;
I do know that I suggested moving prep_new_page() out of the
IRQ-disabled loop, but maybe was a bad idea, for several reasons.
All prep_new_page does is to write into struct page, unless some
debugging stuff (like kasan) is enabled. This cache-line is hot as
LRU-list update just wrote into this cache-line. As the bulk size goes
up, as Matthew pointed out, this cache-line might be pushed into
L2-cache, and then need to be accessed again when prep_new_page() is
called.
Another observation is that moving prep_new_page() into loop reduced
function size with 253 bytes (which affect I-cache).
./scripts/bloat-o-meter mm/page_alloc.o-prep_new_page-outside mm/page_alloc.o-prep_new_page-inside
add/remove: 18/18 grow/shrink: 0/1 up/down: 144/-397 (-253)
Function old new delta
__alloc_pages_bulk 1965 1712 -253
Total: Before=60799, After=60546, chg -0.42%
Maybe it is better to keep prep_new_page() inside the loop. This also
allows list vs array variant to share the call. And it should simplify
the array variant code.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
[PATCH] mm: move prep_new_page inside IRQ disabled loop
From: Jesper Dangaard Brouer <[email protected]>
./scripts/bloat-o-meter mm/page_alloc.o-prep_new_page-outside mm/page_alloc.o-prep_new_page-inside
add/remove: 18/18 grow/shrink: 0/1 up/down: 144/-397 (-253)
Function old new delta
__alloc_pages_bulk 1965 1712 -253
Total: Before=60799, After=60546, chg -0.42%
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
---
mm/page_alloc.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 88a5c1ce5b87..b4ff09b320bc 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5096,11 +5096,13 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
else
page_array[nr_populated] = page;
nr_populated++;
+ prep_new_page(page, 0, gfp, 0);
}
local_irq_restore(flags);
/* Prep pages with IRQs enabled. */
+/*
if (page_list) {
list_for_each_entry(page, page_list, lru)
prep_new_page(page, 0, gfp, 0);
@@ -5108,7 +5110,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
while (prep_index < nr_populated)
prep_new_page(page_array[prep_index++], 0, gfp, 0);
}
-
+*/
return nr_populated;
failed_irq:
On Tue, Mar 23, 2021 at 12:08:51PM +0100, Jesper Dangaard Brouer wrote:
> > > <SNIP>
> > > My results show that, because svc_alloc_arg() ends up calling
> > > __alloc_pages_bulk() twice in this case, it ends up being
> > > twice as expensive as the list case, on average, for the same
> > > workload.
> > >
> >
> > Ok, so in this case the caller knows that holes are always at the
> > start. If the API returns an index that is a valid index and populated,
> > it can check the next index and if it is valid then the whole array
> > must be populated.
> >
> > <SNIP>
>
> I do know that I suggested moving prep_new_page() out of the
> IRQ-disabled loop, but maybe was a bad idea, for several reasons.
>
> All prep_new_page does is to write into struct page, unless some
> debugging stuff (like kasan) is enabled. This cache-line is hot as
> LRU-list update just wrote into this cache-line. As the bulk size goes
> up, as Matthew pointed out, this cache-line might be pushed into
> L2-cache, and then need to be accessed again when prep_new_page() is
> called.
>
> Another observation is that moving prep_new_page() into loop reduced
> function size with 253 bytes (which affect I-cache).
>
> ./scripts/bloat-o-meter mm/page_alloc.o-prep_new_page-outside mm/page_alloc.o-prep_new_page-inside
> add/remove: 18/18 grow/shrink: 0/1 up/down: 144/-397 (-253)
> Function old new delta
> __alloc_pages_bulk 1965 1712 -253
> Total: Before=60799, After=60546, chg -0.42%
>
> Maybe it is better to keep prep_new_page() inside the loop. This also
> allows list vs array variant to share the call. And it should simplify
> the array variant code.
>
I agree. I did not like the level of complexity it incurred for arrays
or the fact it required that a list to be empty when alloc_pages_bulk()
is called. I thought the concern for calling prep_new_page() with IRQs
disabled was a little overblown but did not feel strongly enough to push
back on it hard given that we've had problems with IRQs being disabled
for long periods before. At worst, at some point in the future we'll have
to cap the number of pages that can be requested or enable/disable IRQs
every X pages.
New candidate
git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r4
Interface is still the same so a rebase should be trivial. Diff between
v6r2 and v6r4 is as follows. I like the diffstat if nothing else :P
mm/page_alloc.c | 54 +++++++++++++-----------------------------------------
1 file changed, 13 insertions(+), 41 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 547a84f11310..be1e33a4df39 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -4999,25 +4999,20 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
struct alloc_context ac;
gfp_t alloc_gfp;
unsigned int alloc_flags;
- int nr_populated = 0, prep_index = 0;
- bool hole = false;
+ int nr_populated = 0;
if (WARN_ON_ONCE(nr_pages <= 0))
return 0;
- if (WARN_ON_ONCE(page_list && !list_empty(page_list)))
- return 0;
-
- /* Skip populated array elements. */
- if (page_array) {
- while (nr_populated < nr_pages && page_array[nr_populated])
- nr_populated++;
- if (nr_populated == nr_pages)
- return nr_populated;
- prep_index = nr_populated;
- }
+ /*
+ * Skip populated array elements to determine if any pages need
+ * to be allocated before disabling IRQs.
+ */
+ while (page_array && page_array[nr_populated] && nr_populated < nr_pages)
+ nr_populated++;
- if (nr_pages == 1)
+ /* Use the single page allocator for one page. */
+ if (nr_pages - nr_populated == 1)
goto failed;
/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
@@ -5056,22 +5051,17 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
if (!zone)
goto failed;
-retry_hole:
/* Attempt the batch allocation */
local_irq_save(flags);
pcp = &this_cpu_ptr(zone->pageset)->pcp;
pcp_list = &pcp->lists[ac.migratetype];
while (nr_populated < nr_pages) {
- /*
- * Stop allocating if the next index has a populated
- * page or the page will be prepared a second time when
- * IRQs are enabled.
- */
+
+ /* Skip existing pages */
if (page_array && page_array[nr_populated]) {
- hole = true;
nr_populated++;
- break;
+ continue;
}
page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
@@ -5092,6 +5082,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
__count_zid_vm_events(PGALLOC, zone_idx(zone), 1);
zone_statistics(ac.preferred_zoneref->zone, zone);
+ prep_new_page(page, 0, gfp, 0);
if (page_list)
list_add(&page->lru, page_list);
else
@@ -5101,25 +5092,6 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
local_irq_restore(flags);
- /* Prep pages with IRQs enabled. */
- if (page_list) {
- list_for_each_entry(page, page_list, lru)
- prep_new_page(page, 0, gfp, 0);
- } else {
- while (prep_index < nr_populated)
- prep_new_page(page_array[prep_index++], 0, gfp, 0);
-
- /*
- * If the array is sparse, check whether the array is
- * now fully populated. Continue allocations if
- * necessary.
- */
- while (nr_populated < nr_pages && page_array[nr_populated])
- nr_populated++;
- if (hole && nr_populated < nr_pages)
- goto retry_hole;
- }
-
return nr_populated;
failed_irq:
--
Mel Gorman
SUSE Labs
On Tue, 23 Mar 2021 10:44:21 +0000
Mel Gorman <[email protected]> wrote:
> On Mon, Mar 22, 2021 at 09:18:42AM +0000, Mel Gorman wrote:
> > This series is based on top of Matthew Wilcox's series "Rationalise
> > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > test and are not using Andrew's tree as a baseline, I suggest using the
> > following git tree
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
> >
>
> Jesper and Chuck, would you mind rebasing on top of the following branch
> please?
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2
>
> The interface is the same so the rebase should be trivial.
>
> Jesper, I'm hoping you see no differences in performance but it's best
> to check.
I will rebase and check again.
The current performance tests that I'm running, I observe that the
compiler layout the code in unfortunate ways, which cause I-cache
performance issues. I wonder if you could integrate below patch with
your patchset? (just squash it)
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
[PATCH] mm: optimize code layout for __alloc_pages_bulk
From: Jesper Dangaard Brouer <[email protected]>
Looking at perf-report and ASM-code for __alloc_pages_bulk() then the code
activated is suboptimal. The compiler guess wrong and place unlikely code in
the beginning. Due to the use of WARN_ON_ONCE() macro the UD2 asm
instruction is added to the code, which confuse the I-cache prefetcher in
the CPU
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
---
mm/page_alloc.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f60f51a97a7b..88a5c1ce5b87 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5003,10 +5003,10 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
unsigned int alloc_flags;
int nr_populated = 0, prep_index = 0;
- if (WARN_ON_ONCE(nr_pages <= 0))
+ if (unlikely(nr_pages <= 0))
return 0;
- if (WARN_ON_ONCE(page_list && !list_empty(page_list)))
+ if (unlikely(page_list && !list_empty(page_list)))
return 0;
/* Skip populated array elements. */
@@ -5018,7 +5018,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
prep_index = nr_populated;
}
- if (nr_pages == 1)
+ if (unlikely(nr_pages == 1))
goto failed;
/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
@@ -5054,7 +5054,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
* If there are no allowed local zones that meets the watermarks then
* try to allocate a single page and reclaim if necessary.
*/
- if (!zone)
+ if (unlikely(!zone))
goto failed;
/* Attempt the batch allocation */
@@ -5075,7 +5075,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
pcp, pcp_list);
- if (!page) {
+ if (unlikely(!page)) {
/* Try and get at least one page */
if (!nr_populated)
goto failed_irq;
On Mon, 22 Mar 2021 09:18:44 +0000
Mel Gorman <[email protected]> wrote:
> This patch adds a new page allocator interface via alloc_pages_bulk,
> and __alloc_pages_bulk_nodemask. A caller requests a number of pages
> to be allocated and added to a list.
>
> The API is not guaranteed to return the requested number of pages and
> may fail if the preferred allocation zone has limited free memory, the
> cpuset changes during the allocation or page debugging decides to fail
> an allocation. It's up to the caller to request more pages in batch
> if necessary.
>
> Note that this implementation is not very efficient and could be improved
> but it would require refactoring. The intent is to make it available early
> to determine what semantics are required by different callers. Once the
> full semantics are nailed down, it can be refactored.
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Vlastimil Babka <[email protected]>
> ---
> include/linux/gfp.h | 11 ++++
> mm/page_alloc.c | 124 ++++++++++++++++++++++++++++++++++++++++++++
> 2 files changed, 135 insertions(+)
[...]
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 8a3e13277e22..3f4d56854c74 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4965,6 +4965,130 @@ static inline bool prepare_alloc_pages(gfp_t gfp_mask, unsigned int order,
> return true;
> }
>
> +/*
> + * __alloc_pages_bulk - Allocate a number of order-0 pages to a list
> + * @gfp: GFP flags for the allocation
> + * @preferred_nid: The preferred NUMA node ID to allocate from
> + * @nodemask: Set of nodes to allocate from, may be NULL
> + * @nr_pages: The number of pages requested
> + * @page_list: List to store the allocated pages, must be empty
> + *
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly and add them to a list. The list must be
> + * empty to allow new pages to be prepped with IRQs enabled.
> + *
> + * Returns the number of pages allocated.
> + */
> +int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> + nodemask_t *nodemask, int nr_pages,
> + struct list_head *page_list)
> +{
[...]
> + /*
> + * If there are no allowed local zones that meets the watermarks then
> + * try to allocate a single page and reclaim if necessary.
> + */
> + if (!zone)
> + goto failed;
> +
> + /* Attempt the batch allocation */
> + local_irq_save(flags);
> + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> + pcp_list = &pcp->lists[ac.migratetype];
> +
> + while (allocated < nr_pages) {
> + page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> + pcp, pcp_list);
The function __rmqueue_pcplist() is now used two places, this cause the
compiler to uninline the static function.
My tests show you should inline __rmqueue_pcplist(). See patch I'm
using below signature, which also have some benchmark notes. (Please
squash it into your patch and drop these notes).
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
[PATCH] mm: inline __rmqueue_pcplist
From: Jesper Dangaard Brouer <[email protected]>
When __alloc_pages_bulk() got introduced two callers of
__rmqueue_pcplist exist and the compiler chooses to not inline
this function.
./scripts/bloat-o-meter vmlinux-before vmlinux-inline__rmqueue_pcplist
add/remove: 0/1 grow/shrink: 2/0 up/down: 164/-125 (39)
Function old new delta
rmqueue 2197 2296 +99
__alloc_pages_bulk 1921 1986 +65
__rmqueue_pcplist 125 - -125
Total: Before=19374127, After=19374166, chg +0.00%
modprobe page_bench04_bulk loops=$((10**7))
Type:time_bulk_page_alloc_free_array
- Per elem: 106 cycles(tsc) 29.595 ns (step:64)
- (measurement period time:0.295955434 sec time_interval:295955434)
- (invoke count:10000000 tsc_interval:1065447105)
Before:
- Per elem: 110 cycles(tsc) 30.633 ns (step:64)
Signed-off-by: Jesper Dangaard Brouer <[email protected]>
---
mm/page_alloc.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2cbb8da811ab..f60f51a97a7b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3415,7 +3415,8 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z)
}
/* Remove page from the per-cpu list, caller must protect the list */
-static struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
+static inline
+struct page *__rmqueue_pcplist(struct zone *zone, int migratetype,
unsigned int alloc_flags,
struct per_cpu_pages *pcp,
struct list_head *list)
On Tue, 23 Mar 2021 16:08:14 +0100
Jesper Dangaard Brouer <[email protected]> wrote:
> On Tue, 23 Mar 2021 10:44:21 +0000
> Mel Gorman <[email protected]> wrote:
>
> > On Mon, Mar 22, 2021 at 09:18:42AM +0000, Mel Gorman wrote:
> > > This series is based on top of Matthew Wilcox's series "Rationalise
> > > __alloc_pages wrapper" and does not apply to 5.12-rc2. If you want to
> > > test and are not using Andrew's tree as a baseline, I suggest using the
> > > following git tree
> > >
> > > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v5r9
> > >
I've pushed my benchmarks notes for this branch mm-bulk-rebase-v5r9:
[1] https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org#test-on-mel-git-tree-mm-bulk-rebase-v5r9
> > Jesper and Chuck, would you mind rebasing on top of the following branch
> > please?
> >
> > git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r2
I've rebase on mm-bulk-rebase-v6r4 tomorrow.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer
On Tue, Mar 23, 2021 at 05:00:08PM +0100, Jesper Dangaard Brouer wrote:
> > + /*
> > + * If there are no allowed local zones that meets the watermarks then
> > + * try to allocate a single page and reclaim if necessary.
> > + */
> > + if (!zone)
> > + goto failed;
> > +
> > + /* Attempt the batch allocation */
> > + local_irq_save(flags);
> > + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> > + pcp_list = &pcp->lists[ac.migratetype];
> > +
> > + while (allocated < nr_pages) {
> > + page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> > + pcp, pcp_list);
>
> The function __rmqueue_pcplist() is now used two places, this cause the
> compiler to uninline the static function.
>
This was expected. It was not something I was particularly happy with
but avoiding it was problematic without major refactoring.
> My tests show you should inline __rmqueue_pcplist(). See patch I'm
> using below signature, which also have some benchmark notes. (Please
> squash it into your patch and drop these notes).
>
The cycle savings per element is very marginal at just 4 cycles. I
expect just the silly stat updates are way more costly but the series
that addresses that is likely to be controversial. As I know the cycle
budget for processing a packet is tight, I've applied the patch but am
keeping it separate to preserve the data in case someone points out that
is a big function to inline and "fixes" it.
--
Mel Gorman
SUSE Labs
> On Mar 23, 2021, at 10:45 AM, Mel Gorman <[email protected]> wrote:
>
> On Tue, Mar 23, 2021 at 12:08:51PM +0100, Jesper Dangaard Brouer wrote:
>>>> <SNIP>
>>>> My results show that, because svc_alloc_arg() ends up calling
>>>> __alloc_pages_bulk() twice in this case, it ends up being
>>>> twice as expensive as the list case, on average, for the same
>>>> workload.
>>>>
>>>
>>> Ok, so in this case the caller knows that holes are always at the
>>> start. If the API returns an index that is a valid index and populated,
>>> it can check the next index and if it is valid then the whole array
>>> must be populated.
>>>
>>> <SNIP>
>>
>> I do know that I suggested moving prep_new_page() out of the
>> IRQ-disabled loop, but maybe was a bad idea, for several reasons.
>>
>> All prep_new_page does is to write into struct page, unless some
>> debugging stuff (like kasan) is enabled. This cache-line is hot as
>> LRU-list update just wrote into this cache-line. As the bulk size goes
>> up, as Matthew pointed out, this cache-line might be pushed into
>> L2-cache, and then need to be accessed again when prep_new_page() is
>> called.
>>
>> Another observation is that moving prep_new_page() into loop reduced
>> function size with 253 bytes (which affect I-cache).
>>
>> ./scripts/bloat-o-meter mm/page_alloc.o-prep_new_page-outside mm/page_alloc.o-prep_new_page-inside
>> add/remove: 18/18 grow/shrink: 0/1 up/down: 144/-397 (-253)
>> Function old new delta
>> __alloc_pages_bulk 1965 1712 -253
>> Total: Before=60799, After=60546, chg -0.42%
>>
>> Maybe it is better to keep prep_new_page() inside the loop. This also
>> allows list vs array variant to share the call. And it should simplify
>> the array variant code.
>>
>
> I agree. I did not like the level of complexity it incurred for arrays
> or the fact it required that a list to be empty when alloc_pages_bulk()
> is called. I thought the concern for calling prep_new_page() with IRQs
> disabled was a little overblown but did not feel strongly enough to push
> back on it hard given that we've had problems with IRQs being disabled
> for long periods before. At worst, at some point in the future we'll have
> to cap the number of pages that can be requested or enable/disable IRQs
> every X pages.
>
> New candidate
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-bulk-rebase-v6r4
I have rebased the SUNRPC patches to v6r4. Testing has shown a
minor functional regression, which I'm chasing down. But
performance is in the same ballpark. FYI
> Interface is still the same so a rebase should be trivial. Diff between
> v6r2 and v6r4 is as follows. I like the diffstat if nothing else :P
>
>
> mm/page_alloc.c | 54 +++++++++++++-----------------------------------------
> 1 file changed, 13 insertions(+), 41 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 547a84f11310..be1e33a4df39 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -4999,25 +4999,20 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> struct alloc_context ac;
> gfp_t alloc_gfp;
> unsigned int alloc_flags;
> - int nr_populated = 0, prep_index = 0;
> - bool hole = false;
> + int nr_populated = 0;
>
> if (WARN_ON_ONCE(nr_pages <= 0))
> return 0;
>
> - if (WARN_ON_ONCE(page_list && !list_empty(page_list)))
> - return 0;
> -
> - /* Skip populated array elements. */
> - if (page_array) {
> - while (nr_populated < nr_pages && page_array[nr_populated])
> - nr_populated++;
> - if (nr_populated == nr_pages)
> - return nr_populated;
> - prep_index = nr_populated;
> - }
> + /*
> + * Skip populated array elements to determine if any pages need
> + * to be allocated before disabling IRQs.
> + */
> + while (page_array && page_array[nr_populated] && nr_populated < nr_pages)
> + nr_populated++;
>
> - if (nr_pages == 1)
> + /* Use the single page allocator for one page. */
> + if (nr_pages - nr_populated == 1)
> goto failed;
>
> /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> @@ -5056,22 +5051,17 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> if (!zone)
> goto failed;
>
> -retry_hole:
> /* Attempt the batch allocation */
> local_irq_save(flags);
> pcp = &this_cpu_ptr(zone->pageset)->pcp;
> pcp_list = &pcp->lists[ac.migratetype];
>
> while (nr_populated < nr_pages) {
> - /*
> - * Stop allocating if the next index has a populated
> - * page or the page will be prepared a second time when
> - * IRQs are enabled.
> - */
> +
> + /* Skip existing pages */
> if (page_array && page_array[nr_populated]) {
> - hole = true;
> nr_populated++;
> - break;
> + continue;
> }
>
> page = __rmqueue_pcplist(zone, ac.migratetype, alloc_flags,
> @@ -5092,6 +5082,7 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> __count_zid_vm_events(PGALLOC, zone_idx(zone), 1);
> zone_statistics(ac.preferred_zoneref->zone, zone);
>
> + prep_new_page(page, 0, gfp, 0);
> if (page_list)
> list_add(&page->lru, page_list);
> else
> @@ -5101,25 +5092,6 @@ int __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
>
> local_irq_restore(flags);
>
> - /* Prep pages with IRQs enabled. */
> - if (page_list) {
> - list_for_each_entry(page, page_list, lru)
> - prep_new_page(page, 0, gfp, 0);
> - } else {
> - while (prep_index < nr_populated)
> - prep_new_page(page_array[prep_index++], 0, gfp, 0);
> -
> - /*
> - * If the array is sparse, check whether the array is
> - * now fully populated. Continue allocations if
> - * necessary.
> - */
> - while (nr_populated < nr_pages && page_array[nr_populated])
> - nr_populated++;
> - if (hole && nr_populated < nr_pages)
> - goto retry_hole;
> - }
> -
> return nr_populated;
>
> failed_irq:
>
> --
> Mel Gorman
> SUSE Labs
--
Chuck Lever