2021-02-08 17:55:44

by Chuck Lever III

[permalink] [raw]
Subject: Fwd: alloc_pages_bulk()

Sorry for resending. I misremembered the linux-mm address.


> Begin forwarded message:
>
> From: Chuck Lever <[email protected]>
> Subject: alloc_pages_bulk()
> Date: February 8, 2021 at 10:42:07 AM EST
> To: "[email protected]" <[email protected]>, "[email protected]" <[email protected]>
> Cc: "[email protected]" <[email protected]>, Linux NFS Mailing List <[email protected]>
>
> Hi-
>
> [ please Cc: me, I'm not subscribed to linux-mm ]
>
> We've been discussing how NFSD can more efficiently refill its
> receive buffers (currently alloc_page() in a loop; see
> net/sunrpc/svc_xprt.c::svc_alloc_arg()).
>
> Neil Brown pointed me to this old thread:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> We see that many of the prerequisites are in v5.11-rc, but
> alloc_page_bulk() is not. I tried forward-porting 4/4 in that
> series, but enough internal APIs have changed since 2017 that
> the patch does not come close to applying and compiling.
>
> I'm wondering:
>
> a) is there a newer version of that work?
>
> b) if not, does there exist a preferred API in 5.11 for bulk
> page allocation?
>
> Many thanks for any guidance!
>
> --
> Chuck Lever
>
>
>

--
Chuck Lever




2021-02-09 10:37:09

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Mon, 8 Feb 2021 17:50:51 +0000
Chuck Lever <[email protected]> wrote:

> Sorry for resending. I misremembered the linux-mm address.
>
> > Begin forwarded message:
> >
> > [ please Cc: me, I'm not subscribed to linux-mm ]
> >
> > We've been discussing how NFSD can more efficiently refill its
> > receive buffers (currently alloc_page() in a loop; see
> > net/sunrpc/svc_xprt.c::svc_alloc_arg()).
> >

It looks like you could also take advantage of bulk free in:
svc_free_res_pages()

I would like to use the page bulk alloc API here:
https://github.com/torvalds/linux/blob/master/net/core/page_pool.c#L201-L209


> > Neil Brown pointed me to this old thread:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > We see that many of the prerequisites are in v5.11-rc, but
> > alloc_page_bulk() is not. I tried forward-porting 4/4 in that
> > series, but enough internal APIs have changed since 2017 that
> > the patch does not come close to applying and compiling.

I forgot that this was never merged. It is sad as Mel showed huge
improvement with his work.

> > I'm wondering:
> >
> > a) is there a newer version of that work?
> >

Mel, why was this work never merged upstream?


> > b) if not, does there exist a preferred API in 5.11 for bulk
> > page allocation?
> >
> > Many thanks for any guidance!

I have a kernel module that micro-bench the API alloc_pages_bulk() here:
https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c#L97

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2021-02-09 13:40:29

by Chuck Lever III

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

Hi Jesper-

> On Feb 9, 2021, at 5:31 AM, Jesper Dangaard Brouer <[email protected]> wrote:
>
> On Mon, 8 Feb 2021 17:50:51 +0000
> Chuck Lever <[email protected]> wrote:
>
>> Sorry for resending. I misremembered the linux-mm address.
>>
>>> Begin forwarded message:
>>>
>>> [ please Cc: me, I'm not subscribed to linux-mm ]
>>>
>>> We've been discussing how NFSD can more efficiently refill its
>>> receive buffers (currently alloc_page() in a loop; see
>>> net/sunrpc/svc_xprt.c::svc_alloc_arg()).
>>>
>
> It looks like you could also take advantage of bulk free in:
> svc_free_res_pages()

We started there. Those pages often have a non-zero reference count,
so that call site didn't seem to be a candidate for a bulk free.


> I would like to use the page bulk alloc API here:
> https://github.com/torvalds/linux/blob/master/net/core/page_pool.c#L201-L209
>
>
>>> Neil Brown pointed me to this old thread:
>>>
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> We see that many of the prerequisites are in v5.11-rc, but
>>> alloc_page_bulk() is not. I tried forward-porting 4/4 in that
>>> series, but enough internal APIs have changed since 2017 that
>>> the patch does not come close to applying and compiling.
>
> I forgot that this was never merged. It is sad as Mel showed huge
> improvement with his work.
>
>>> I'm wondering:
>>>
>>> a) is there a newer version of that work?
>>>
>
> Mel, why was this work never merged upstream?
>
>
>>> b) if not, does there exist a preferred API in 5.11 for bulk
>>> page allocation?
>>>
>>> Many thanks for any guidance!
>
> I have a kernel module that micro-bench the API alloc_pages_bulk() here:
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c#L97
>
> --
> Best regards,
> Jesper Dangaard Brouer
> MSc.CS, Principal Kernel Engineer at Red Hat
> LinkedIn: http://www.linkedin.com/in/brouer
>

--
Chuck Lever



2021-02-09 17:31:42

by Vlastimil Babka

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On 2/9/21 11:31 AM, Jesper Dangaard Brouer wrote:
> On Mon, 8 Feb 2021 17:50:51 +0000
> Chuck Lever <[email protected]> wrote:
>
>> Sorry for resending. I misremembered the linux-mm address.
>>
>> > Begin forwarded message:
>> >
>> > [ please Cc: me, I'm not subscribed to linux-mm ]
>> >
>> > We've been discussing how NFSD can more efficiently refill its
>> > receive buffers (currently alloc_page() in a loop; see
>> > net/sunrpc/svc_xprt.c::svc_alloc_arg()).
>> >
>
> It looks like you could also take advantage of bulk free in:
> svc_free_res_pages()
>
> I would like to use the page bulk alloc API here:
> https://github.com/torvalds/linux/blob/master/net/core/page_pool.c#L201-L209
>
>
>> > Neil Brown pointed me to this old thread:
>> >
>> > https://lore.kernel.org/lkml/[email protected]/
>> >
>> > We see that many of the prerequisites are in v5.11-rc, but
>> > alloc_page_bulk() is not. I tried forward-porting 4/4 in that
>> > series, but enough internal APIs have changed since 2017 that
>> > the patch does not come close to applying and compiling.
>
> I forgot that this was never merged. It is sad as Mel showed huge
> improvement with his work.
>
>> > I'm wondering:
>> >
>> > a) is there a newer version of that work?
>> >
>
> Mel, why was this work never merged upstream?

Hmm the cover letter of that work says:

> The fourth patch introduces a bulk page allocator with no
> in-kernel users as an example for Jesper and others who want to
> build a page allocator for DMA-coherent pages. It hopefully is
> relatively easy to modify this API and the one core function toget > the
semantics they require.

So it seems there were no immediate users to finalize the API?

>> > b) if not, does there exist a preferred API in 5.11 for bulk
>> > page allocation?
>> >
>> > Many thanks for any guidance!
>
> I have a kernel module that micro-bench the API alloc_pages_bulk() here:
> https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/bench/page_bench04_bulk.c#L97
>

2021-02-09 23:54:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: Fwd: alloc_pages_bulk()

On Mon, Feb 08, 2021 at 05:50:51PM +0000, Chuck Lever wrote:
> > We've been discussing how NFSD can more efficiently refill its
> > receive buffers (currently alloc_page() in a loop; see
> > net/sunrpc/svc_xprt.c::svc_alloc_arg()).

I'm not familiar with the sunrpc architecture, but this feels like you're
trying to optimise something that shouldn't exist. Ideally a write
would ask the page cache for the pages that correspond to the portion
of the file which is being written to. I appreciate that doesn't work
well for, eg, NFS-over-TCP, but for NFS over any kind of RDMA, that
should be possible, right?

2021-02-09 23:55:35

by Chuck Lever III

[permalink] [raw]
Subject: Re: alloc_pages_bulk()



> On Feb 9, 2021, at 5:01 PM, Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Feb 08, 2021 at 05:50:51PM +0000, Chuck Lever wrote:
>>> We've been discussing how NFSD can more efficiently refill its
>>> receive buffers (currently alloc_page() in a loop; see
>>> net/sunrpc/svc_xprt.c::svc_alloc_arg()).
>
> I'm not familiar with the sunrpc architecture, but this feels like you're
> trying to optimise something that shouldn't exist. Ideally a write
> would ask the page cache for the pages that correspond to the portion
> of the file which is being written to. I appreciate that doesn't work
> well for, eg, NFS-over-TCP, but for NFS over any kind of RDMA, that
> should be possible, right?

(Note there is room for improvement for both transport types).

Since you asked ;-) there are four broad categories of NFSD I/O.

1. Receive an ingress RPC message (typically a Call)

2. Read from a file

3. Write to a file

4. Send an egress RPC message (typically a Reply)

A server RPC transaction is some combination of these, usually
1, 2, and 4; or 1, 3, and 4.

To do 1, the server allocates a set of order-0 pages to form a
receive buffer and a set of order-0 pages to form a send buffer.
We want to handle this with bulk allocation. The Call is then
received into the receive buffer pages.

The receive buffer pages typically stay with the nfsd thread for
its lifetime, but the send buffer pages do not. We want to use a
buffer page size that matches the page cache size (see below) and
also a size small enough that makes allocation very unlikely to
fail. The largest transactions (NFS READ and WRITE) use up to 1MB
worth of pages.

Category two can be done by copying the file's pages into the send
buffer pages, or it can be done via a splice. When a splice is
done, the send buffer pages allocated above are released first
before being replaced in the buffer with the file's pages.

3 is currently done only by copying receive buffer pages to file
pages. Pages are neither allocated or released by this category
of I/O. There are various reasons for this, but it's an area that
could stand some attention.

Sending (category 4) passes the send buffer pages to kernel_sendpage(),
which bumps the page count on them. When sendpage() returns, the
server does a put_page() on all of those pages, then goes back to
category 1 to replace the consumed send buffer pages. When the
network layer is finished with the pages, it releases them.

There are two reasons I can see for this:

1. A network send isn't complete until the server gets an ACK from
the client. This can take a while. I'm not aware of a TCP-provided
mechanism to indicate when the ACK has arrived, so the server can't
re-use them. (RDMA has an affirmative send completion event that
we can use to reduce send buffer churn).

2. If a splice was done, the send buffer pages that are also file
pages can't be re-used for the next RPC send buffer because
overwriting their content would corrupt the file. Thus they must
also be released and replaced.


--
Chuck Lever



2021-02-10 08:55:02

by Mel Gorman

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Tue, Feb 09, 2021 at 11:31:08AM +0100, Jesper Dangaard Brouer wrote:
> > > Neil Brown pointed me to this old thread:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > We see that many of the prerequisites are in v5.11-rc, but
> > > alloc_page_bulk() is not. I tried forward-porting 4/4 in that
> > > series, but enough internal APIs have changed since 2017 that
> > > the patch does not come close to applying and compiling.
>
> I forgot that this was never merged. It is sad as Mel showed huge
> improvement with his work.
>
> > > I'm wondering:
> > >
> > > a) is there a newer version of that work?
> > >
>
> Mel, why was this work never merged upstream?
>

Lack of realistic consumers to drive it forward, finalise the API and
confirm it was working as expected. It eventually died as a result. If it
was reintroduced, it would need to be forward ported and then implement
at least one user on top.

--
Mel Gorman
SUSE Labs

2021-02-10 09:56:35

by Christoph Hellwig

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Tue, Feb 09, 2021 at 06:27:11PM +0100, Vlastimil Babka wrote:
>
> > The fourth patch introduces a bulk page allocator with no
> > in-kernel users as an example for Jesper and others who want to
> > build a page allocator for DMA-coherent pages. It hopefully is
> > relatively easy to modify this API and the one core function toget > the
> semantics they require.
>
> So it seems there were no immediate users to finalize the API?

__iommu_dma_alloc_pages would be a hot candidate.

2021-02-10 11:46:49

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Wed, 10 Feb 2021 08:41:55 +0000
Mel Gorman <[email protected]> wrote:

> On Tue, Feb 09, 2021 at 11:31:08AM +0100, Jesper Dangaard Brouer wrote:
> > > > Neil Brown pointed me to this old thread:
> > > >
> > > > https://lore.kernel.org/lkml/[email protected]/
> > > >
> > > > We see that many of the prerequisites are in v5.11-rc, but
> > > > alloc_page_bulk() is not. I tried forward-porting 4/4 in that
> > > > series, but enough internal APIs have changed since 2017 that
> > > > the patch does not come close to applying and compiling.
> >
> > I forgot that this was never merged. It is sad as Mel showed huge
> > improvement with his work.
> >
> > > > I'm wondering:
> > > >
> > > > a) is there a newer version of that work?
> > > >
> >
> > Mel, why was this work never merged upstream?
> >
>
> Lack of realistic consumers to drive it forward, finalise the API and
> confirm it was working as expected. It eventually died as a result. If it
> was reintroduced, it would need to be forward ported and then implement
> at least one user on top.

I guess I misunderstood you back in 2017. I though that I had presented
a clear use-case/consumer in page_pool[1]. But you wanted the code as
part of the patchset I guess. I though, I could add it later via the
net-next tree.

It seems that Chuck now have a NFS use-case, and Hellwig also have a
use-case for DMA-iommu in __iommu_dma_alloc_pages.

The performance improvement (in above link) were really impressive!

Quote:
"It's roughly a 50-70% reduction of allocation costs and roughly a halving of the
overall cost of allocating/freeing batches of pages."

Who have time to revive this patchset?

I can only signup for coding the page_pool usage.
Chuck do you have time if Mel doesn't?

[1] https://github.com/torvalds/linux/blob/master/net/core/page_pool.c#L201-L209
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2021-02-10 13:09:37

by Mel Gorman

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Wed, Feb 10, 2021 at 12:41:03PM +0100, Jesper Dangaard Brouer wrote:
> On Wed, 10 Feb 2021 08:41:55 +0000
> Mel Gorman <[email protected]> wrote:
>
> > On Tue, Feb 09, 2021 at 11:31:08AM +0100, Jesper Dangaard Brouer wrote:
> > > > > Neil Brown pointed me to this old thread:
> > > > >
> > > > > https://lore.kernel.org/lkml/[email protected]/
> > > > >
> > > > > We see that many of the prerequisites are in v5.11-rc, but
> > > > > alloc_page_bulk() is not. I tried forward-porting 4/4 in that
> > > > > series, but enough internal APIs have changed since 2017 that
> > > > > the patch does not come close to applying and compiling.
> > >
> > > I forgot that this was never merged. It is sad as Mel showed huge
> > > improvement with his work.
> > >
> > > > > I'm wondering:
> > > > >
> > > > > a) is there a newer version of that work?
> > > > >
> > >
> > > Mel, why was this work never merged upstream?
> > >
> >
> > Lack of realistic consumers to drive it forward, finalise the API and
> > confirm it was working as expected. It eventually died as a result. If it
> > was reintroduced, it would need to be forward ported and then implement
> > at least one user on top.
>
> I guess I misunderstood you back in 2017. I though that I had presented
> a clear use-case/consumer in page_pool[1].

You did but it was never integrated and/or tested AFAIK. I see page_pool
accepts orders so even by the original prototype, it would only have seen
a benefit for order-0 pages. It would also have needed some supporting
data that it actually helped with drivers using the page_pool interface
which I was not in the position to properly test at the time.

> But you wanted the code as
> part of the patchset I guess. I though, I could add it later via the
> net-next tree.
>

Yes, a consumer of the code should go in at the same time with supporting
data showing it actually helps because otherwise it's dead code.

> It seems that Chuck now have a NFS use-case, and Hellwig also have a
> use-case for DMA-iommu in __iommu_dma_alloc_pages.
>
> The performance improvement (in above link) were really impressive!
>
> Quote:
> "It's roughly a 50-70% reduction of allocation costs and roughly a halving of the
> overall cost of allocating/freeing batches of pages."
>
> Who have time to revive this patchset?
>

Not in the short term due to bug load and other obligations.

The original series had "mm, page_allocator: Only use per-cpu allocator
for irq-safe requests" but that was ultimately rejected because softirqs
were affected so it would have to be done without that patch.

The last patch can be rebased easily enough but it only batch allocates
order-0 pages. It's also only build tested and could be completely
miserable in practice and as I didn't even try boot test let, let alone
actually test it, it could be a giant pile of crap. To make high orders
work, it would need significant reworking but if the API showed even
partial benefit, it might motiviate someone to reimplement the bulk
interfaces to perform better.

Rebased diff, build tested only, might not even work

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 6e479e9c48ce..d1b586e5b4b8 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -511,6 +511,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
}

+unsigned long
+__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, nodemask_t *nodemask,
+ unsigned long nr_pages, struct list_head *alloc_list);
+
+static inline unsigned long
+__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, unsigned long nr_pages,
+ struct list_head *list)
+{
+ return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL,
+ nr_pages, list);
+}
+
+static inline unsigned long
+alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
+ unsigned long nr_pages, struct list_head *list)
+{
+ int nid = numa_mem_id();
+ return __alloc_pages_bulk(gfp_mask, order,
+ node_zonelist(nid, gfp_mask), nr_pages, list);
+}
+
/*
* Allocate pages, preferring the node given as nid. The node must be valid and
* online. For more general interface, see alloc_pages_node().
@@ -580,6 +603,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);

extern void __free_pages(struct page *page, unsigned int order);
extern void free_pages(unsigned long addr, unsigned int order);
+extern void free_pages_bulk(struct list_head *list);

struct page_frag_cache;
extern void __page_frag_cache_drain(struct page *page, unsigned int count);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 519a60d5b6f7..f8353ea7b977 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3254,7 +3254,7 @@ void free_unref_page(struct page *page)
}

/*
- * Free a list of 0-order pages
+ * Free a list of 0-order pages whose reference count is already zero.
*/
void free_unref_page_list(struct list_head *list)
{
@@ -4435,6 +4435,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
}
}

+/* Drop reference counts and free pages from a list */
+void free_pages_bulk(struct list_head *list)
+{
+ struct page *page, *next;
+
+ list_for_each_entry_safe(page, next, list, lru) {
+ trace_mm_page_free_batched(page);
+ if (put_page_testzero(page)) {
+ list_del(&page->lru);
+ __free_pages_ok(page, 0, FPI_NONE);
+ }
+ }
+}
+EXPORT_SYMBOL_GPL(free_pages_bulk);
+
static inline unsigned int
gfp_to_alloc_flags(gfp_t gfp_mask)
{
@@ -5818,6 +5833,99 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
}


+/*
+ * This is a batched version of the page allocator that attempts to
+ * allocate nr_pages quickly from the preferred zone and add them to list.
+ * Note that there is no guarantee that nr_pages will be allocated although
+ * every effort will be made to allocate at least one. Unlike the core
+ * allocator, no special effort is made to recover from transient
+ * failures caused by changes in cpusets. It should only be used from !IRQ
+ * context. An attempt to allocate a batch of patches from an interrupt
+ * will allocate a single page.
+ */
+unsigned long
+__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
+ struct zonelist *zonelist, nodemask_t *nodemask,
+ unsigned long nr_pages, struct list_head *alloc_list)
+{
+ struct page *page;
+ unsigned long alloced = 0;
+ unsigned int alloc_flags = ALLOC_WMARK_LOW;
+ unsigned long flags;
+ struct zone *zone;
+ struct per_cpu_pages *pcp;
+ struct list_head *pcp_list;
+ int migratetype;
+ gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
+ struct alloc_context ac = { };
+
+ /* If there are already pages on the list, don't bother */
+ if (!list_empty(alloc_list))
+ return 0;
+
+ /* Order-0 cannot go through per-cpu lists */
+ if (order)
+ goto failed;
+
+ gfp_mask &= gfp_allowed_mask;
+
+ if (!prepare_alloc_pages(gfp_mask, order, numa_mem_id(), nodemask, &ac, &alloc_mask, &alloc_flags))
+ return 0;
+
+ if (!ac.preferred_zoneref)
+ return 0;
+
+ /*
+ * Only attempt a batch allocation if watermarks on the preferred zone
+ * are safe.
+ */
+ zone = ac.preferred_zoneref->zone;
+ if (!zone_watermark_fast(zone, order, high_wmark_pages(zone) + nr_pages,
+ zonelist_zone_idx(ac.preferred_zoneref), alloc_flags, gfp_mask))
+ goto failed;
+
+ /* Attempt the batch allocation */
+ migratetype = ac.migratetype;
+
+ local_irq_save(flags);
+ pcp = &this_cpu_ptr(zone->pageset)->pcp;
+ pcp_list = &pcp->lists[migratetype];
+
+ while (nr_pages) {
+ page = __rmqueue_pcplist(zone, gfp_mask, migratetype,
+ pcp, pcp_list);
+ if (!page)
+ break;
+
+ prep_new_page(page, order, gfp_mask, 0);
+ nr_pages--;
+ alloced++;
+ list_add(&page->lru, alloc_list);
+ }
+
+ if (!alloced) {
+ preempt_enable_no_resched();
+ goto failed;
+ }
+
+ __count_zid_vm_events(PGALLOC, zone_idx(zone), alloced);
+ zone_statistics(zone, zone);
+
+ local_irq_restore(flags);
+
+ return alloced;
+
+failed:
+ page = __alloc_pages_nodemask(gfp_mask, order, numa_node_id(), nodemask);
+ if (page) {
+ alloced++;
+ list_add(&page->lru, alloc_list);
+ }
+
+ return alloced;
+}
+EXPORT_SYMBOL(__alloc_pages_bulk_nodemask);
+
/*
* Build zonelists ordered by node and zones within node.
* This results in maximum locality--normal zone overflows into local

--
Mel Gorman
SUSE Labs

2021-02-10 23:02:10

by Chuck Lever III

[permalink] [raw]
Subject: Re: alloc_pages_bulk()



> On Feb 10, 2021, at 8:07 AM, Mel Gorman <[email protected]> wrote:
>
> On Wed, Feb 10, 2021 at 12:41:03PM +0100, Jesper Dangaard Brouer wrote:
>> On Wed, 10 Feb 2021 08:41:55 +0000
>> Mel Gorman <[email protected]> wrote:
>>
>>> On Tue, Feb 09, 2021 at 11:31:08AM +0100, Jesper Dangaard Brouer wrote:
>>>>>> Neil Brown pointed me to this old thread:
>>>>>>
>>>>>> https://lore.kernel.org/lkml/[email protected]/
>>>>>>
>>>>>> We see that many of the prerequisites are in v5.11-rc, but
>>>>>> alloc_page_bulk() is not. I tried forward-porting 4/4 in that
>>>>>> series, but enough internal APIs have changed since 2017 that
>>>>>> the patch does not come close to applying and compiling.
>>>>
>>>> I forgot that this was never merged. It is sad as Mel showed huge
>>>> improvement with his work.
>>>>
>>>>>> I'm wondering:
>>>>>>
>>>>>> a) is there a newer version of that work?
>>>>>>
>>>>
>>>> Mel, why was this work never merged upstream?
>>>>
>>>
>>> Lack of realistic consumers to drive it forward, finalise the API and
>>> confirm it was working as expected. It eventually died as a result. If it
>>> was reintroduced, it would need to be forward ported and then implement
>>> at least one user on top.
>>
>> I guess I misunderstood you back in 2017. I though that I had presented
>> a clear use-case/consumer in page_pool[1].
>
> You did but it was never integrated and/or tested AFAIK. I see page_pool
> accepts orders so even by the original prototype, it would only have seen
> a benefit for order-0 pages. It would also have needed some supporting
> data that it actually helped with drivers using the page_pool interface
> which I was not in the position to properly test at the time.
>
>> But you wanted the code as
>> part of the patchset I guess. I though, I could add it later via the
>> net-next tree.
>>
>
> Yes, a consumer of the code should go in at the same time with supporting
> data showing it actually helps because otherwise it's dead code.
>
>> It seems that Chuck now have a NFS use-case, and Hellwig also have a
>> use-case for DMA-iommu in __iommu_dma_alloc_pages.
>>
>> The performance improvement (in above link) were really impressive!
>>
>> Quote:
>> "It's roughly a 50-70% reduction of allocation costs and roughly a halving of the
>> overall cost of allocating/freeing batches of pages."
>>
>> Who have time to revive this patchset?
>>
>
> Not in the short term due to bug load and other obligations.
>
> The original series had "mm, page_allocator: Only use per-cpu allocator
> for irq-safe requests" but that was ultimately rejected because softirqs
> were affected so it would have to be done without that patch.
>
> The last patch can be rebased easily enough but it only batch allocates
> order-0 pages. It's also only build tested and could be completely
> miserable in practice and as I didn't even try boot test let, let alone
> actually test it, it could be a giant pile of crap. To make high orders
> work, it would need significant reworking but if the API showed even
> partial benefit, it might motiviate someone to reimplement the bulk
> interfaces to perform better.
>
> Rebased diff, build tested only, might not even work

Thanks, Mel, for kicking off a forward port.

It compiles. I've added a patch to replace the page allocation loop
in svc_alloc_arg() with a call to alloc_pages_bulk().

The server system deadlocks pretty quickly with any NFS traffic. Based
on some initial debugging, it appears that a pcplist is getting corrupted
and this causes the list_del() in __rmqueue_pcplist() to fail during a
a call to alloc_pages_bulk().


> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index 6e479e9c48ce..d1b586e5b4b8 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -511,6 +511,29 @@ __alloc_pages(gfp_t gfp_mask, unsigned int order, int preferred_nid)
> return __alloc_pages_nodemask(gfp_mask, order, preferred_nid, NULL);
> }
>
> +unsigned long
> +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> + struct zonelist *zonelist, nodemask_t *nodemask,
> + unsigned long nr_pages, struct list_head *alloc_list);
> +
> +static inline unsigned long
> +__alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
> + struct zonelist *zonelist, unsigned long nr_pages,
> + struct list_head *list)
> +{
> + return __alloc_pages_bulk_nodemask(gfp_mask, order, zonelist, NULL,
> + nr_pages, list);
> +}
> +
> +static inline unsigned long
> +alloc_pages_bulk(gfp_t gfp_mask, unsigned int order,
> + unsigned long nr_pages, struct list_head *list)
> +{
> + int nid = numa_mem_id();
> + return __alloc_pages_bulk(gfp_mask, order,
> + node_zonelist(nid, gfp_mask), nr_pages, list);
> +}
> +
> /*
> * Allocate pages, preferring the node given as nid. The node must be valid and
> * online. For more general interface, see alloc_pages_node().
> @@ -580,6 +603,7 @@ void * __meminit alloc_pages_exact_nid(int nid, size_t size, gfp_t gfp_mask);
>
> extern void __free_pages(struct page *page, unsigned int order);
> extern void free_pages(unsigned long addr, unsigned int order);
> +extern void free_pages_bulk(struct list_head *list);
>
> struct page_frag_cache;
> extern void __page_frag_cache_drain(struct page *page, unsigned int count);
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 519a60d5b6f7..f8353ea7b977 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3254,7 +3254,7 @@ void free_unref_page(struct page *page)
> }
>
> /*
> - * Free a list of 0-order pages
> + * Free a list of 0-order pages whose reference count is already zero.
> */
> void free_unref_page_list(struct list_head *list)
> {
> @@ -4435,6 +4435,21 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
> }
> }
>
> +/* Drop reference counts and free pages from a list */
> +void free_pages_bulk(struct list_head *list)
> +{
> + struct page *page, *next;
> +
> + list_for_each_entry_safe(page, next, list, lru) {
> + trace_mm_page_free_batched(page);
> + if (put_page_testzero(page)) {
> + list_del(&page->lru);
> + __free_pages_ok(page, 0, FPI_NONE);
> + }
> + }
> +}
> +EXPORT_SYMBOL_GPL(free_pages_bulk);
> +
> static inline unsigned int
> gfp_to_alloc_flags(gfp_t gfp_mask)
> {
> @@ -5818,6 +5833,99 @@ static int find_next_best_node(int node, nodemask_t *used_node_mask)
> }
>
>
> +/*
> + * This is a batched version of the page allocator that attempts to
> + * allocate nr_pages quickly from the preferred zone and add them to list.
> + * Note that there is no guarantee that nr_pages will be allocated although
> + * every effort will be made to allocate at least one. Unlike the core
> + * allocator, no special effort is made to recover from transient
> + * failures caused by changes in cpusets. It should only be used from !IRQ
> + * context. An attempt to allocate a batch of patches from an interrupt
> + * will allocate a single page.
> + */
> +unsigned long
> +__alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> + struct zonelist *zonelist, nodemask_t *nodemask,
> + unsigned long nr_pages, struct list_head *alloc_list)
> +{
> + struct page *page;
> + unsigned long alloced = 0;
> + unsigned int alloc_flags = ALLOC_WMARK_LOW;
> + unsigned long flags;
> + struct zone *zone;
> + struct per_cpu_pages *pcp;
> + struct list_head *pcp_list;
> + int migratetype;
> + gfp_t alloc_mask = gfp_mask; /* The gfp_t that was actually used for allocation */
> + struct alloc_context ac = { };
> +
> + /* If there are already pages on the list, don't bother */
> + if (!list_empty(alloc_list))
> + return 0;
> +
> + /* Order-0 cannot go through per-cpu lists */
> + if (order)
> + goto failed;
> +
> + gfp_mask &= gfp_allowed_mask;
> +
> + if (!prepare_alloc_pages(gfp_mask, order, numa_mem_id(), nodemask, &ac, &alloc_mask, &alloc_flags))
> + return 0;
> +
> + if (!ac.preferred_zoneref)
> + return 0;
> +
> + /*
> + * Only attempt a batch allocation if watermarks on the preferred zone
> + * are safe.
> + */
> + zone = ac.preferred_zoneref->zone;
> + if (!zone_watermark_fast(zone, order, high_wmark_pages(zone) + nr_pages,
> + zonelist_zone_idx(ac.preferred_zoneref), alloc_flags, gfp_mask))
> + goto failed;
> +
> + /* Attempt the batch allocation */
> + migratetype = ac.migratetype;
> +
> + local_irq_save(flags);
> + pcp = &this_cpu_ptr(zone->pageset)->pcp;
> + pcp_list = &pcp->lists[migratetype];
> +
> + while (nr_pages) {
> + page = __rmqueue_pcplist(zone, gfp_mask, migratetype,
> + pcp, pcp_list);
> + if (!page)
> + break;
> +
> + prep_new_page(page, order, gfp_mask, 0);
> + nr_pages--;
> + alloced++;
> + list_add(&page->lru, alloc_list);
> + }
> +
> + if (!alloced) {
> + preempt_enable_no_resched();
> + goto failed;
> + }
> +
> + __count_zid_vm_events(PGALLOC, zone_idx(zone), alloced);
> + zone_statistics(zone, zone);
> +
> + local_irq_restore(flags);
> +
> + return alloced;
> +
> +failed:
> + page = __alloc_pages_nodemask(gfp_mask, order, numa_node_id(), nodemask);
> + if (page) {
> + alloced++;
> + list_add(&page->lru, alloc_list);
> + }
> +
> + return alloced;
> +}
> +EXPORT_SYMBOL(__alloc_pages_bulk_nodemask);
> +
> /*
> * Build zonelists ordered by node and zones within node.
> * This results in maximum locality--normal zone overflows into local
>
> --
> Mel Gorman
> SUSE Labs

--
Chuck Lever



2021-02-11 09:28:12

by Mel Gorman

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Wed, Feb 10, 2021 at 10:58:37PM +0000, Chuck Lever wrote:
> > Not in the short term due to bug load and other obligations.
> >
> > The original series had "mm, page_allocator: Only use per-cpu allocator
> > for irq-safe requests" but that was ultimately rejected because softirqs
> > were affected so it would have to be done without that patch.
> >
> > The last patch can be rebased easily enough but it only batch allocates
> > order-0 pages. It's also only build tested and could be completely
> > miserable in practice and as I didn't even try boot test let, let alone
> > actually test it, it could be a giant pile of crap. To make high orders
> > work, it would need significant reworking but if the API showed even
> > partial benefit, it might motiviate someone to reimplement the bulk
> > interfaces to perform better.
> >
> > Rebased diff, build tested only, might not even work
>
> Thanks, Mel, for kicking off a forward port.
>
> It compiles. I've added a patch to replace the page allocation loop
> in svc_alloc_arg() with a call to alloc_pages_bulk().
>
> The server system deadlocks pretty quickly with any NFS traffic. Based
> on some initial debugging, it appears that a pcplist is getting corrupted
> and this causes the list_del() in __rmqueue_pcplist() to fail during a
> a call to alloc_pages_bulk().
>

Parameters to __rmqueue_pcplist are garbage as the parameter order changed.
I'm surprised it didn't blow up in a spectacular fashion. Again, this
hasn't been near any testing and passing a list with high orders to
free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see
if there is justification for reworking the allocator to fundamentally
deal in batches and then feed batches to pcp lists and the bulk allocator
while leaving the normal GFP API as single page "batches". While that
would be ideal, it's relatively high risk for regressions. There is still
some scope for adding a basic bulk allocator before considering a major
refactoring effort.

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index f8353ea7b977..8f3fe7de2cf7 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
pcp_list = &pcp->lists[migratetype];

while (nr_pages) {
- page = __rmqueue_pcplist(zone, gfp_mask, migratetype,
+ page = __rmqueue_pcplist(zone, migratetype, alloc_flags,
pcp, pcp_list);
if (!page)
break;
--
Mel Gorman
SUSE Labs

2021-02-11 12:38:05

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Thu, 11 Feb 2021 09:12:35 +0000
Mel Gorman <[email protected]> wrote:

> On Wed, Feb 10, 2021 at 10:58:37PM +0000, Chuck Lever wrote:
> > > Not in the short term due to bug load and other obligations.
> > >
> > > The original series had "mm, page_allocator: Only use per-cpu allocator
> > > for irq-safe requests" but that was ultimately rejected because softirqs
> > > were affected so it would have to be done without that patch.
> > >
> > > The last patch can be rebased easily enough but it only batch allocates
> > > order-0 pages. It's also only build tested and could be completely
> > > miserable in practice and as I didn't even try boot test let, let alone
> > > actually test it, it could be a giant pile of crap. To make high orders
> > > work, it would need significant reworking but if the API showed even
> > > partial benefit, it might motiviate someone to reimplement the bulk
> > > interfaces to perform better.
> > >
> > > Rebased diff, build tested only, might not even work
> >
> > Thanks, Mel, for kicking off a forward port.
> >
> > It compiles. I've added a patch to replace the page allocation loop
> > in svc_alloc_arg() with a call to alloc_pages_bulk().
> >
> > The server system deadlocks pretty quickly with any NFS traffic. Based
> > on some initial debugging, it appears that a pcplist is getting corrupted
> > and this causes the list_del() in __rmqueue_pcplist() to fail during a
> > a call to alloc_pages_bulk().
> >
>
> Parameters to __rmqueue_pcplist are garbage as the parameter order changed.
> I'm surprised it didn't blow up in a spectacular fashion. Again, this
> hasn't been near any testing and passing a list with high orders to
> free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see
> if there is justification for reworking the allocator to fundamentally
> deal in batches and then feed batches to pcp lists and the bulk allocator
> while leaving the normal GFP API as single page "batches". While that
> would be ideal, it's relatively high risk for regressions. There is still
> some scope for adding a basic bulk allocator before considering a major
> refactoring effort.

The alloc_flags reminds me that I have some asks around the semantics
of the API. I'm concerned about the latency impact on preemption. I
want us to avoid creating something that runs for too long with
IRQs/preempt disabled.

(With SLUB kmem_cache_free_bulk() we manage to run most of the time with
preempt and IRQs enabled. So, I'm not worried about large slab bulk
free. For SLUB kmem_cache_alloc_bulk() we run with local_irq_disable(),
so I always recommend users not to do excessive bulk-alloc.)

For this page bulk alloc API, I'm fine with limiting it to only support
order-0 pages. (This will also fit nicely with the PCP system it think).

I also suggest the API can return less pages than requested. Because I
want to to "exit"/return if it need to go into an expensive code path
(like buddy allocator or compaction). I'm assuming we have a flags to
give us this behavior (via gfp_flags or alloc_flags)?

My use-case is in page_pool where I can easily handle not getting exact
number of pages, and I want to handle low-latency network traffic.



> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f8353ea7b977..8f3fe7de2cf7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> pcp_list = &pcp->lists[migratetype];
>
> while (nr_pages) {
> - page = __rmqueue_pcplist(zone, gfp_mask, migratetype,
> + page = __rmqueue_pcplist(zone, migratetype, alloc_flags,
> pcp, pcp_list);
> if (!page)
> break;



--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2021-02-11 16:27:16

by Chuck Lever III

[permalink] [raw]
Subject: Re: alloc_pages_bulk()



> On Feb 11, 2021, at 4:12 AM, Mel Gorman <[email protected]> wrote:
>
> On Wed, Feb 10, 2021 at 10:58:37PM +0000, Chuck Lever wrote:
>>> Not in the short term due to bug load and other obligations.
>>>
>>> The original series had "mm, page_allocator: Only use per-cpu allocator
>>> for irq-safe requests" but that was ultimately rejected because softirqs
>>> were affected so it would have to be done without that patch.
>>>
>>> The last patch can be rebased easily enough but it only batch allocates
>>> order-0 pages. It's also only build tested and could be completely
>>> miserable in practice and as I didn't even try boot test let, let alone
>>> actually test it, it could be a giant pile of crap. To make high orders
>>> work, it would need significant reworking but if the API showed even
>>> partial benefit, it might motiviate someone to reimplement the bulk
>>> interfaces to perform better.
>>>
>>> Rebased diff, build tested only, might not even work
>>
>> Thanks, Mel, for kicking off a forward port.
>>
>> It compiles. I've added a patch to replace the page allocation loop
>> in svc_alloc_arg() with a call to alloc_pages_bulk().
>>
>> The server system deadlocks pretty quickly with any NFS traffic. Based
>> on some initial debugging, it appears that a pcplist is getting corrupted
>> and this causes the list_del() in __rmqueue_pcplist() to fail during a
>> a call to alloc_pages_bulk().
>>
>
> Parameters to __rmqueue_pcplist are garbage as the parameter order changed.
> I'm surprised it didn't blow up in a spectacular fashion. Again, this
> hasn't been near any testing and passing a list with high orders to
> free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see
> if there is justification for reworking the allocator to fundamentally
> deal in batches and then feed batches to pcp lists and the bulk allocator
> while leaving the normal GFP API as single page "batches". While that
> would be ideal, it's relatively high risk for regressions. There is still
> some scope for adding a basic bulk allocator before considering a major
> refactoring effort.
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index f8353ea7b977..8f3fe7de2cf7 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> pcp_list = &pcp->lists[migratetype];
>
> while (nr_pages) {
> - page = __rmqueue_pcplist(zone, gfp_mask, migratetype,
> + page = __rmqueue_pcplist(zone, migratetype, alloc_flags,
> pcp, pcp_list);
> if (!page)
> break;

The NFS server is considerably more stable now. Thank you!

I confirmed that my patch is requesting and getting multiple pages.
The new NFSD code and the API seem to be working as expected.

The results are stunning. Each svc_alloc_arg() call here allocates
65 pages to satisfy a 256KB NFS READ request.

Before:

nfsd-972 [000] 584.513817: funcgraph_entry: + 35.385 us | svc_alloc_arg();
nfsd-979 [002] 584.513870: funcgraph_entry: + 29.051 us | svc_alloc_arg();
nfsd-980 [001] 584.513951: funcgraph_entry: + 29.178 us | svc_alloc_arg();
nfsd-983 [000] 584.514014: funcgraph_entry: + 29.211 us | svc_alloc_arg();
nfsd-976 [002] 584.514059: funcgraph_entry: + 29.315 us | svc_alloc_arg();
nfsd-974 [001] 584.514127: funcgraph_entry: + 29.237 us | svc_alloc_arg();

After:

nfsd-977 [002] 87.049425: funcgraph_entry: 4.293 us | svc_alloc_arg();
nfsd-981 [000] 87.049478: funcgraph_entry: 4.059 us | svc_alloc_arg();
nfsd-988 [001] 87.049549: funcgraph_entry: 4.474 us | svc_alloc_arg();
nfsd-983 [003] 87.049612: funcgraph_entry: 3.819 us | svc_alloc_arg();
nfsd-976 [000] 87.049619: funcgraph_entry: 3.869 us | svc_alloc_arg();
nfsd-980 [002] 87.049738: funcgraph_entry: 4.124 us | svc_alloc_arg();
nfsd-975 [000] 87.049769: funcgraph_entry: 3.734 us | svc_alloc_arg();


There appears to be little cost change for single-page allocations
using the bulk allocator (nr_pages=1):

Before:

nfsd-985 [003] 572.324517: funcgraph_entry: 0.332 us | svc_alloc_arg();
nfsd-986 [001] 572.324531: funcgraph_entry: 0.311 us | svc_alloc_arg();
nfsd-985 [003] 572.324701: funcgraph_entry: 0.311 us | svc_alloc_arg();
nfsd-986 [001] 572.324727: funcgraph_entry: 0.424 us | svc_alloc_arg();
nfsd-985 [003] 572.324760: funcgraph_entry: 0.332 us | svc_alloc_arg();
nfsd-986 [001] 572.324786: funcgraph_entry: 0.390 us | svc_alloc_arg();

After:

nfsd-989 [002] 75.043226: funcgraph_entry: 0.322 us | svc_alloc_arg();
nfsd-988 [001] 75.043436: funcgraph_entry: 0.368 us | svc_alloc_arg();
nfsd-989 [002] 75.043464: funcgraph_entry: 0.424 us | svc_alloc_arg();
nfsd-988 [001] 75.043490: funcgraph_entry: 0.317 us | svc_alloc_arg();
nfsd-989 [002] 75.043517: funcgraph_entry: 0.425 us | svc_alloc_arg();
nfsd-988 [001] 75.050025: funcgraph_entry: 0.407 us | svc_alloc_arg();


--
Chuck Lever



2021-02-15 12:04:28

by Mel Gorman

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Thu, Feb 11, 2021 at 01:26:28PM +0100, Jesper Dangaard Brouer wrote:
> > Parameters to __rmqueue_pcplist are garbage as the parameter order changed.
> > I'm surprised it didn't blow up in a spectacular fashion. Again, this
> > hasn't been near any testing and passing a list with high orders to
> > free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see
> > if there is justification for reworking the allocator to fundamentally
> > deal in batches and then feed batches to pcp lists and the bulk allocator
> > while leaving the normal GFP API as single page "batches". While that
> > would be ideal, it's relatively high risk for regressions. There is still
> > some scope for adding a basic bulk allocator before considering a major
> > refactoring effort.
>
> The alloc_flags reminds me that I have some asks around the semantics
> of the API. I'm concerned about the latency impact on preemption. I
> want us to avoid creating something that runs for too long with
> IRQs/preempt disabled.
>
> (With SLUB kmem_cache_free_bulk() we manage to run most of the time with
> preempt and IRQs enabled. So, I'm not worried about large slab bulk
> free. For SLUB kmem_cache_alloc_bulk() we run with local_irq_disable(),
> so I always recommend users not to do excessive bulk-alloc.)
>

The length of time IRQs/preempt disabled are partially related to the
bulk API but the deeper concern is how long the IRQs are disabled within
the page allocator in general. Sometimes they are disabled for list
manipulations but the duration is longer than it has to be for per-cpu
stat updates and that may be unnecessary for all arches and
configurations. Some may need IRQs disabled but others may only need
preempt disabled for the counters. That's a more serious overall than
just the bulk allocator.

> For this page bulk alloc API, I'm fine with limiting it to only support
> order-0 pages. (This will also fit nicely with the PCP system it think).
>

Ok.

> I also suggest the API can return less pages than requested. Because I
> want to to "exit"/return if it need to go into an expensive code path
> (like buddy allocator or compaction). I'm assuming we have a flags to
> give us this behavior (via gfp_flags or alloc_flags)?
>

The API returns the number of pages returned on a list so policies
around how aggressive it should be allocating the requested number of
pages could be adjusted without changing the API. Passing in policy
requests via gfp_flags may be problematic as most (all?) bits are
already used.

--
Mel Gorman
SUSE Labs

2021-02-15 12:08:50

by Mel Gorman

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Thu, Feb 11, 2021 at 04:20:31PM +0000, Chuck Lever wrote:
> > On Feb 11, 2021, at 4:12 AM, Mel Gorman <[email protected]> wrote:
> >
> > <SNIP>
> >
> > Parameters to __rmqueue_pcplist are garbage as the parameter order changed.
> > I'm surprised it didn't blow up in a spectacular fashion. Again, this
> > hasn't been near any testing and passing a list with high orders to
> > free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see
> > if there is justification for reworking the allocator to fundamentally
> > deal in batches and then feed batches to pcp lists and the bulk allocator
> > while leaving the normal GFP API as single page "batches". While that
> > would be ideal, it's relatively high risk for regressions. There is still
> > some scope for adding a basic bulk allocator before considering a major
> > refactoring effort.
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index f8353ea7b977..8f3fe7de2cf7 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> > pcp_list = &pcp->lists[migratetype];
> >
> > while (nr_pages) {
> > - page = __rmqueue_pcplist(zone, gfp_mask, migratetype,
> > + page = __rmqueue_pcplist(zone, migratetype, alloc_flags,
> > pcp, pcp_list);
> > if (!page)
> > break;
>
> The NFS server is considerably more stable now. Thank you!
>

Thanks for testing!

> I confirmed that my patch is requesting and getting multiple pages.
> The new NFSD code and the API seem to be working as expected.
>
> The results are stunning. Each svc_alloc_arg() call here allocates
> 65 pages to satisfy a 256KB NFS READ request.
>
> Before:
>
> nfsd-972 [000] 584.513817: funcgraph_entry: + 35.385 us | svc_alloc_arg();
> nfsd-979 [002] 584.513870: funcgraph_entry: + 29.051 us | svc_alloc_arg();
> nfsd-980 [001] 584.513951: funcgraph_entry: + 29.178 us | svc_alloc_arg();
> nfsd-983 [000] 584.514014: funcgraph_entry: + 29.211 us | svc_alloc_arg();
> nfsd-976 [002] 584.514059: funcgraph_entry: + 29.315 us | svc_alloc_arg();
> nfsd-974 [001] 584.514127: funcgraph_entry: + 29.237 us | svc_alloc_arg();
>
> After:
>
> nfsd-977 [002] 87.049425: funcgraph_entry: 4.293 us | svc_alloc_arg();
> nfsd-981 [000] 87.049478: funcgraph_entry: 4.059 us | svc_alloc_arg();
> nfsd-988 [001] 87.049549: funcgraph_entry: 4.474 us | svc_alloc_arg();
> nfsd-983 [003] 87.049612: funcgraph_entry: 3.819 us | svc_alloc_arg();
> nfsd-976 [000] 87.049619: funcgraph_entry: 3.869 us | svc_alloc_arg();
> nfsd-980 [002] 87.049738: funcgraph_entry: 4.124 us | svc_alloc_arg();
> nfsd-975 [000] 87.049769: funcgraph_entry: 3.734 us | svc_alloc_arg();
>

Uhhhh, that is much better than I expected given how lame the
implementation is. Sure -- it works, but it has more overhead than it
should with the downside that reducing it requires fairly deep surgery. It
may be enough to tidy this up to handle order-0 pages only to start with
and see how far it gets. That's a fairly trivial modification.

> There appears to be little cost change for single-page allocations
> using the bulk allocator (nr_pages=1):
>
> Before:
>
> nfsd-985 [003] 572.324517: funcgraph_entry: 0.332 us | svc_alloc_arg();
> nfsd-986 [001] 572.324531: funcgraph_entry: 0.311 us | svc_alloc_arg();
> nfsd-985 [003] 572.324701: funcgraph_entry: 0.311 us | svc_alloc_arg();
> nfsd-986 [001] 572.324727: funcgraph_entry: 0.424 us | svc_alloc_arg();
> nfsd-985 [003] 572.324760: funcgraph_entry: 0.332 us | svc_alloc_arg();
> nfsd-986 [001] 572.324786: funcgraph_entry: 0.390 us | svc_alloc_arg();
>
> After:
>
> nfsd-989 [002] 75.043226: funcgraph_entry: 0.322 us | svc_alloc_arg();
> nfsd-988 [001] 75.043436: funcgraph_entry: 0.368 us | svc_alloc_arg();
> nfsd-989 [002] 75.043464: funcgraph_entry: 0.424 us | svc_alloc_arg();
> nfsd-988 [001] 75.043490: funcgraph_entry: 0.317 us | svc_alloc_arg();
> nfsd-989 [002] 75.043517: funcgraph_entry: 0.425 us | svc_alloc_arg();
> nfsd-988 [001] 75.050025: funcgraph_entry: 0.407 us | svc_alloc_arg();
>

That is not too surprising given that there would be some additional
overhead to manage a list of 1 page. I would hope that users of the bulk
allocator are not routinely calling it with nr_pages == 1.

--
Mel Gorman
SUSE Labs

2021-02-15 16:12:02

by Chuck Lever III

[permalink] [raw]
Subject: Re: alloc_pages_bulk()



> On Feb 15, 2021, at 7:06 AM, Mel Gorman <[email protected]> wrote:
>
> On Thu, Feb 11, 2021 at 04:20:31PM +0000, Chuck Lever wrote:
>>> On Feb 11, 2021, at 4:12 AM, Mel Gorman <[email protected]> wrote:
>>>
>>> <SNIP>
>>>
>>> Parameters to __rmqueue_pcplist are garbage as the parameter order changed.
>>> I'm surprised it didn't blow up in a spectacular fashion. Again, this
>>> hasn't been near any testing and passing a list with high orders to
>>> free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see
>>> if there is justification for reworking the allocator to fundamentally
>>> deal in batches and then feed batches to pcp lists and the bulk allocator
>>> while leaving the normal GFP API as single page "batches". While that
>>> would be ideal, it's relatively high risk for regressions. There is still
>>> some scope for adding a basic bulk allocator before considering a major
>>> refactoring effort.
>>>
>>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>>> index f8353ea7b977..8f3fe7de2cf7 100644
>>> --- a/mm/page_alloc.c
>>> +++ b/mm/page_alloc.c
>>> @@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
>>> pcp_list = &pcp->lists[migratetype];
>>>
>>> while (nr_pages) {
>>> - page = __rmqueue_pcplist(zone, gfp_mask, migratetype,
>>> + page = __rmqueue_pcplist(zone, migratetype, alloc_flags,
>>> pcp, pcp_list);
>>> if (!page)
>>> break;
>>
>> The NFS server is considerably more stable now. Thank you!
>>
>
> Thanks for testing!
>
>> I confirmed that my patch is requesting and getting multiple pages.
>> The new NFSD code and the API seem to be working as expected.
>>
>> The results are stunning. Each svc_alloc_arg() call here allocates
>> 65 pages to satisfy a 256KB NFS READ request.
>>
>> Before:
>>
>> nfsd-972 [000] 584.513817: funcgraph_entry: + 35.385 us | svc_alloc_arg();
>> nfsd-979 [002] 584.513870: funcgraph_entry: + 29.051 us | svc_alloc_arg();
>> nfsd-980 [001] 584.513951: funcgraph_entry: + 29.178 us | svc_alloc_arg();
>> nfsd-983 [000] 584.514014: funcgraph_entry: + 29.211 us | svc_alloc_arg();
>> nfsd-976 [002] 584.514059: funcgraph_entry: + 29.315 us | svc_alloc_arg();
>> nfsd-974 [001] 584.514127: funcgraph_entry: + 29.237 us | svc_alloc_arg();
>>
>> After:
>>
>> nfsd-977 [002] 87.049425: funcgraph_entry: 4.293 us | svc_alloc_arg();
>> nfsd-981 [000] 87.049478: funcgraph_entry: 4.059 us | svc_alloc_arg();
>> nfsd-988 [001] 87.049549: funcgraph_entry: 4.474 us | svc_alloc_arg();
>> nfsd-983 [003] 87.049612: funcgraph_entry: 3.819 us | svc_alloc_arg();
>> nfsd-976 [000] 87.049619: funcgraph_entry: 3.869 us | svc_alloc_arg();
>> nfsd-980 [002] 87.049738: funcgraph_entry: 4.124 us | svc_alloc_arg();
>> nfsd-975 [000] 87.049769: funcgraph_entry: 3.734 us | svc_alloc_arg();
>>
>
> Uhhhh, that is much better than I expected given how lame the
> implementation is.

My experience with function tracing is the entry and exit
timestamping adds significant overhead. I'd bet the actual
timing improvement is still good but much less.


> Sure -- it works, but it has more overhead than it
> should with the downside that reducing it requires fairly deep surgery. It
> may be enough to tidy this up to handle order-0 pages only to start with
> and see how far it gets. That's a fairly trivial modification.

I'd like to see an "order-0 only" implementation go in soon.
The improvement is palpable and there are several worthy
consumers on deck.


>> There appears to be little cost change for single-page allocations
>> using the bulk allocator (nr_pages=1):
>>
>> Before:
>>
>> nfsd-985 [003] 572.324517: funcgraph_entry: 0.332 us | svc_alloc_arg();
>> nfsd-986 [001] 572.324531: funcgraph_entry: 0.311 us | svc_alloc_arg();
>> nfsd-985 [003] 572.324701: funcgraph_entry: 0.311 us | svc_alloc_arg();
>> nfsd-986 [001] 572.324727: funcgraph_entry: 0.424 us | svc_alloc_arg();
>> nfsd-985 [003] 572.324760: funcgraph_entry: 0.332 us | svc_alloc_arg();
>> nfsd-986 [001] 572.324786: funcgraph_entry: 0.390 us | svc_alloc_arg();
>>
>> After:
>>
>> nfsd-989 [002] 75.043226: funcgraph_entry: 0.322 us | svc_alloc_arg();
>> nfsd-988 [001] 75.043436: funcgraph_entry: 0.368 us | svc_alloc_arg();
>> nfsd-989 [002] 75.043464: funcgraph_entry: 0.424 us | svc_alloc_arg();
>> nfsd-988 [001] 75.043490: funcgraph_entry: 0.317 us | svc_alloc_arg();
>> nfsd-989 [002] 75.043517: funcgraph_entry: 0.425 us | svc_alloc_arg();
>> nfsd-988 [001] 75.050025: funcgraph_entry: 0.407 us | svc_alloc_arg();
>>
>
> That is not too surprising given that there would be some additional
> overhead to manage a list of 1 page. I would hope that users of the bulk
> allocator are not routinely calling it with nr_pages == 1.

The NFSD implementation I did uses alloc_pages_bulk() to fill however
many pages are needed. Often that's just one page.

Sometimes it's zero pages. alloc_pages_bulk() does not behave very
well, so NFSD avoids calling it in that case.

I can post the patch for review. I cleaned it up recently but haven't
had a chance to test the clean-ups, so it might not work in its
current state.

--
Chuck Lever



2021-02-15 16:17:08

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: alloc_pages_bulk()


On Mon, 15 Feb 2021 12:00:56 +0000
Mel Gorman <[email protected]> wrote:

> On Thu, Feb 11, 2021 at 01:26:28PM +0100, Jesper Dangaard Brouer wrote:
[...]
>
> > I also suggest the API can return less pages than requested. Because I
> > want to to "exit"/return if it need to go into an expensive code path
> > (like buddy allocator or compaction). I'm assuming we have a flags to
> > give us this behavior (via gfp_flags or alloc_flags)?
> >
>
> The API returns the number of pages returned on a list so policies
> around how aggressive it should be allocating the requested number of
> pages could be adjusted without changing the API. Passing in policy
> requests via gfp_flags may be problematic as most (all?) bits are
> already used.

Well, I was just thinking that I would use GFP_ATOMIC instead of
GFP_KERNEL to "communicate" that I don't want this call to take too
long (like sleeping). I'm not requesting any fancy policy :-)


For page_pool use case we use (GFP_ATOMIC | __GFP_NOWARN) flags.

static inline struct page *page_pool_dev_alloc_pages(struct page_pool *pool)
{
gfp_t gfp = (GFP_ATOMIC | __GFP_NOWARN);

return page_pool_alloc_pages(pool, gfp);
}

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2021-02-22 09:45:58

by Mel Gorman

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Mon, Feb 15, 2021 at 05:10:38PM +0100, Jesper Dangaard Brouer wrote:
>
> On Mon, 15 Feb 2021 12:00:56 +0000
> Mel Gorman <[email protected]> wrote:
>
> > On Thu, Feb 11, 2021 at 01:26:28PM +0100, Jesper Dangaard Brouer wrote:
> [...]
> >
> > > I also suggest the API can return less pages than requested. Because I
> > > want to to "exit"/return if it need to go into an expensive code path
> > > (like buddy allocator or compaction). I'm assuming we have a flags to
> > > give us this behavior (via gfp_flags or alloc_flags)?
> > >
> >
> > The API returns the number of pages returned on a list so policies
> > around how aggressive it should be allocating the requested number of
> > pages could be adjusted without changing the API. Passing in policy
> > requests via gfp_flags may be problematic as most (all?) bits are
> > already used.
>
> Well, I was just thinking that I would use GFP_ATOMIC instead of
> GFP_KERNEL to "communicate" that I don't want this call to take too
> long (like sleeping). I'm not requesting any fancy policy :-)
>

The NFS use case requires opposite semantics
-- it really needs those allocations to succeed
https://lore.kernel.org/r/[email protected].
I've asked what code it's based on as it's not 5.11 and I'll iron that
out first.

Then it might be clearer what the "can fail" semantics should look like.
I think it would be best to have pairs of patches where the first patch
adjusts the semantics of the bulk allocator and the second adds a user.
That will limit the amount of code code carried in the implementation.
When the initial users are in place then the implementation can be
optimised as the optimisations will require significant refactoring and
I not want to refactor multiple times.

--
Mel Gorman
SUSE Labs

2021-02-22 11:45:21

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Mon, 22 Feb 2021 09:42:56 +0000
Mel Gorman <[email protected]> wrote:

> On Mon, Feb 15, 2021 at 05:10:38PM +0100, Jesper Dangaard Brouer wrote:
> >
> > On Mon, 15 Feb 2021 12:00:56 +0000
> > Mel Gorman <[email protected]> wrote:
> >
> > > On Thu, Feb 11, 2021 at 01:26:28PM +0100, Jesper Dangaard Brouer wrote:
> > [...]
> > >
> > > > I also suggest the API can return less pages than requested. Because I
> > > > want to to "exit"/return if it need to go into an expensive code path
> > > > (like buddy allocator or compaction). I'm assuming we have a flags to
> > > > give us this behavior (via gfp_flags or alloc_flags)?
> > > >
> > >
> > > The API returns the number of pages returned on a list so policies
> > > around how aggressive it should be allocating the requested number of
> > > pages could be adjusted without changing the API. Passing in policy
> > > requests via gfp_flags may be problematic as most (all?) bits are
> > > already used.
> >
> > Well, I was just thinking that I would use GFP_ATOMIC instead of
> > GFP_KERNEL to "communicate" that I don't want this call to take too
> > long (like sleeping). I'm not requesting any fancy policy :-)
> >
>
> The NFS use case requires opposite semantics
> -- it really needs those allocations to succeed
> https://lore.kernel.org/r/[email protected].

Sorry, but that is not how I understand the code.

The code is doing exactly what I'm requesting. If the alloc_pages_bulk()
doesn't return expected number of pages, then check if others need to
run. The old code did schedule_timeout(msecs_to_jiffies(500)), while
Chuck's patch change this to ask for cond_resched(). Thus, it tries to
avoid blocking the CPU for too long (when allocating many pages).

And the nfsd code seems to handle that the code can be interrupted (via
return -EINTR) via signal_pending(current). Thus, the nfsd code seems
to be able to handle if the page allocations failed.


> I've asked what code it's based on as it's not 5.11 and I'll iron that
> out first.
>
> Then it might be clearer what the "can fail" semantics should look like.
> I think it would be best to have pairs of patches where the first patch
> adjusts the semantics of the bulk allocator and the second adds a user.
> That will limit the amount of code code carried in the implementation.
> When the initial users are in place then the implementation can be
> optimised as the optimisations will require significant refactoring and
> I not want to refactor multiple times.

I guess, I should try to code-up the usage in page_pool.

What is the latest patch for adding alloc_pages_bulk() ?

The nfsd code (svc_alloc_arg) is called in a context where it can
sleep, and thus use GFP_KERNEL. In most cases the page_pool will be
called with GFP_ATOMIC. I don't think I/page_pool will retry the call
like Chuck did, as I cannot (re)schedule others to run.

--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer

2021-02-22 14:11:31

by Mel Gorman

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Mon, Feb 22, 2021 at 12:42:46PM +0100, Jesper Dangaard Brouer wrote:
> On Mon, 22 Feb 2021 09:42:56 +0000
> Mel Gorman <[email protected]> wrote:
>
> > On Mon, Feb 15, 2021 at 05:10:38PM +0100, Jesper Dangaard Brouer wrote:
> > >
> > > On Mon, 15 Feb 2021 12:00:56 +0000
> > > Mel Gorman <[email protected]> wrote:
> > >
> > > > On Thu, Feb 11, 2021 at 01:26:28PM +0100, Jesper Dangaard Brouer wrote:
> > > [...]
> > > >
> > > > > I also suggest the API can return less pages than requested. Because I
> > > > > want to to "exit"/return if it need to go into an expensive code path
> > > > > (like buddy allocator or compaction). I'm assuming we have a flags to
> > > > > give us this behavior (via gfp_flags or alloc_flags)?
> > > > >
> > > >
> > > > The API returns the number of pages returned on a list so policies
> > > > around how aggressive it should be allocating the requested number of
> > > > pages could be adjusted without changing the API. Passing in policy
> > > > requests via gfp_flags may be problematic as most (all?) bits are
> > > > already used.
> > >
> > > Well, I was just thinking that I would use GFP_ATOMIC instead of
> > > GFP_KERNEL to "communicate" that I don't want this call to take too
> > > long (like sleeping). I'm not requesting any fancy policy :-)
> > >
> >
> > The NFS use case requires opposite semantics
> > -- it really needs those allocations to succeed
> > https://lore.kernel.org/r/[email protected].
>
> Sorry, but that is not how I understand the code.
>
> The code is doing exactly what I'm requesting. If the alloc_pages_bulk()
> doesn't return expected number of pages, then check if others need to
> run. The old code did schedule_timeout(msecs_to_jiffies(500)), while
> Chuck's patch change this to ask for cond_resched(). Thus, it tries to
> avoid blocking the CPU for too long (when allocating many pages).
>
> And the nfsd code seems to handle that the code can be interrupted (via
> return -EINTR) via signal_pending(current). Thus, the nfsd code seems
> to be able to handle if the page allocations failed.
>

I'm waiting to find out exactly what NFSD is currently doing as the code
in 5.11 is not the same as what Chuck was coding against so I'm not 100%
certain how it currently works.

>
> > I've asked what code it's based on as it's not 5.11 and I'll iron that
> > out first.
> >
> > Then it might be clearer what the "can fail" semantics should look like.
> > I think it would be best to have pairs of patches where the first patch
> > adjusts the semantics of the bulk allocator and the second adds a user.
> > That will limit the amount of code code carried in the implementation.
> > When the initial users are in place then the implementation can be
> > optimised as the optimisations will require significant refactoring and
> > I not want to refactor multiple times.
>
> I guess, I should try to code-up the usage in page_pool.
>
> What is the latest patch for adding alloc_pages_bulk() ?
>

There isn't a usable latest version until I reconcile the nfsd caller.
The only major change in the API right now is dropping order. It handles
order-0 only.

--
Mel Gorman
SUSE Labs

2021-02-22 21:31:00

by Jesper Dangaard Brouer

[permalink] [raw]
Subject: Re: alloc_pages_bulk()

On Mon, 15 Feb 2021 12:06:09 +0000
Mel Gorman <[email protected]> wrote:

> On Thu, Feb 11, 2021 at 04:20:31PM +0000, Chuck Lever wrote:
> > > On Feb 11, 2021, at 4:12 AM, Mel Gorman <[email protected]> wrote:
> > >
> > > <SNIP>
> > >
> > > Parameters to __rmqueue_pcplist are garbage as the parameter order changed.
> > > I'm surprised it didn't blow up in a spectacular fashion. Again, this
> > > hasn't been near any testing and passing a list with high orders to
> > > free_pages_bulk() will corrupt lists too. Mostly it's a curiousity to see
> > > if there is justification for reworking the allocator to fundamentally
> > > deal in batches and then feed batches to pcp lists and the bulk allocator
> > > while leaving the normal GFP API as single page "batches". While that
> > > would be ideal, it's relatively high risk for regressions. There is still
> > > some scope for adding a basic bulk allocator before considering a major
> > > refactoring effort.
> > >
> > > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > > index f8353ea7b977..8f3fe7de2cf7 100644
> > > --- a/mm/page_alloc.c
> > > +++ b/mm/page_alloc.c
> > > @@ -5892,7 +5892,7 @@ __alloc_pages_bulk_nodemask(gfp_t gfp_mask, unsigned int order,
> > > pcp_list = &pcp->lists[migratetype];
> > >
> > > while (nr_pages) {
> > > - page = __rmqueue_pcplist(zone, gfp_mask, migratetype,
> > > + page = __rmqueue_pcplist(zone, migratetype, alloc_flags,
> > > pcp, pcp_list);
> > > if (!page)
> > > break;
> >
> > The NFS server is considerably more stable now. Thank you!
> >
>
> Thanks for testing!

I've done some testing here:
https://github.com/xdp-project/xdp-project/blob/master/areas/mem/page_pool06_alloc_pages_bulk.org

Performance summary:
- Before: 3,677,958 pps
- After : 4,066,028 pps

I'll describe/show the page_pool changes tomorrow.
--
Best regards,
Jesper Dangaard Brouer
MSc.CS, Principal Kernel Engineer at Red Hat
LinkedIn: http://www.linkedin.com/in/brouer