LinuxLists.cc - [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more

2023-08-08 17:42:01

Subject: [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more

Using bulk alloc/free on a cache with percpu array should not be
necessary and the bulk alloc actually bypasses the array (the prefill
functionality currently relies on this).

The simplest change is just to convert the respective maple tree
wrappers to do a loop of normal alloc/free.
---
lib/maple_tree.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/lib/maple_tree.c b/lib/maple_tree.c
index 1196d0a17f03..7a8e7c467d7c 100644
--- a/lib/maple_tree.c
+++ b/lib/maple_tree.c
@@ -161,12 +161,19 @@ static inline struct maple_node *mt_alloc_one(gfp_t gfp)

static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
{
- return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
+ int allocated = 0;
+ for (size_t i = 0; i < size; i++) {
+ nodes[i] = kmem_cache_alloc(maple_node_cache, gfp);
+ if (nodes[i])
+ allocated++;
+ }
+ return allocated;
}

static inline void mt_free_bulk(size_t size, void __rcu **nodes)
{
- kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
+ for (size_t i = 0; i < size; i++)
+ kmem_cache_free(maple_node_cache, nodes[i]);
}

static void mt_free_rcu(struct rcu_head *head)
--
2.41.0

2023-08-08 18:19:34

by Peng Zhang

[permalink] [raw]

Subject: Re: [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more

在 2023/8/8 17:53, Vlastimil Babka 写道:
> Using bulk alloc/free on a cache with percpu array should not be
> necessary and the bulk alloc actually bypasses the array (the prefill
> functionality currently relies on this).
>
> The simplest change is just to convert the respective maple tree
> wrappers to do a loop of normal alloc/free.
> ---
> lib/maple_tree.c | 11 +++++++++--
> 1 file changed, 9 insertions(+), 2 deletions(-)
>
> diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> index 1196d0a17f03..7a8e7c467d7c 100644
> --- a/lib/maple_tree.c
> +++ b/lib/maple_tree.c
> @@ -161,12 +161,19 @@ static inline struct maple_node *mt_alloc_one(gfp_t gfp)
>
> static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
> {
> - return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
> + int allocated = 0;
> + for (size_t i = 0; i < size; i++) {
> + nodes[i] = kmem_cache_alloc(maple_node_cache, gfp);
> + if (nodes[i])
If the i-th allocation fails, node[i] will be NULL. This is wrong. We'd
better guarantee that mt_alloc_bulk() allocates completely successfully,
or returns 0. The following cases are not allowed:
nodes: [addr1][addr2][NULL][addr3].
> + allocated++;
> + }
> + return allocated;
> }
>
> static inline void mt_free_bulk(size_t size, void __rcu **nodes)
> {
> - kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
> + for (size_t i = 0; i < size; i++)
> + kmem_cache_free(maple_node_cache, nodes[i]);
> }
>
> static void mt_free_rcu(struct rcu_head *head)

2023-08-08 18:45:14

by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more

On 8/8/23 16:29, Liam R. Howlett wrote:
> * Peng Zhang <[email protected]> [230808 07:17]:
>>
>>
>> 在 2023/8/8 17:53, Vlastimil Babka 写道:
>> > Using bulk alloc/free on a cache with percpu array should not be
>> > necessary and the bulk alloc actually bypasses the array (the prefill
>> > functionality currently relies on this).
>> >
>> > The simplest change is just to convert the respective maple tree
>> > wrappers to do a loop of normal alloc/free.
>> > ---
>> > lib/maple_tree.c | 11 +++++++++--
>> > 1 file changed, 9 insertions(+), 2 deletions(-)
>> >
>> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
>> > index 1196d0a17f03..7a8e7c467d7c 100644
>> > --- a/lib/maple_tree.c
>> > +++ b/lib/maple_tree.c
>> > @@ -161,12 +161,19 @@ static inline struct maple_node *mt_alloc_one(gfp_t gfp)
>> > static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
>> > {
>> > - return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
>> > + int allocated = 0;
>> > + for (size_t i = 0; i < size; i++) {
>> > + nodes[i] = kmem_cache_alloc(maple_node_cache, gfp);
>> > + if (nodes[i])
>> If the i-th allocation fails, node[i] will be NULL. This is wrong. We'd
>> better guarantee that mt_alloc_bulk() allocates completely successfully,
>> or returns 0. The following cases are not allowed:
>> nodes: [addr1][addr2][NULL][addr3].

Thanks, indeed. I guess it should just break; in case of failure and return
how many allocations succeeded so far.

But note this is a really a quick RFC proof of concept hack. I'd expect if
the whole idea is deemed as good, the maple tree node handling could be
redesigned (simplified?) around it and maybe there's no mt_alloc_bulk()
anymore as a result?

> Thanks for pointing this out Peng.
>
> We can handle a lower number than requested being returned, but we
> cannot handle the sparse data.
>
> The kmem_cache_alloc_bulk() can return a failure today - leaving the
> array to be cleaned by the caller, so if this is changed to a full
> success or full fail, then we will also have to change the caller to
> handle whatever state is returned if it differs from
> kmem_cache_alloc_bulk().
>
> It might be best to return the size already allocated when a failure is
> encountered. This will make the caller, mas_alloc_nodes(), request more
> nodes. Only in the case of zero allocations would this be seen as an
> OOM event.
>
> Vlastimil, Is the first kmem_cache_alloc() call failing a possibility?

Sure, if there's no memory, it can fail. In practice if gfp is one that
allows reclaim, it will ultimately be the "too small to fail" allocation on
the page allocator level. But there are exceptions, like having received a
fatal signal, IIRC :)

> If so, what should be the corrective action?

Depends on your context, if you can pass on -ENOMEM to the caller, or need
to succeed.

>> > + allocated++;
>> > + }
>> > + return allocated;
>> > }
>> > static inline void mt_free_bulk(size_t size, void __rcu **nodes)
>> > {
>> > - kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
>> > + for (size_t i = 0; i < size; i++)
>> > + kmem_cache_free(maple_node_cache, nodes[i]);
>> > }
>> > static void mt_free_rcu(struct rcu_head *head)

2023-08-08 21:47:00

by Liam R. Howlett

[permalink] [raw]

Subject: Re: [RFC v1 4/5] maple_tree: avoid bulk alloc/free to use percpu array more

* Peng Zhang <[email protected]> [230808 07:17]:
>
>
> 在 2023/8/8 17:53, Vlastimil Babka 写道:
> > Using bulk alloc/free on a cache with percpu array should not be
> > necessary and the bulk alloc actually bypasses the array (the prefill
> > functionality currently relies on this).
> >
> > The simplest change is just to convert the respective maple tree
> > wrappers to do a loop of normal alloc/free.
> > ---
> > lib/maple_tree.c | 11 +++++++++--
> > 1 file changed, 9 insertions(+), 2 deletions(-)
> >
> > diff --git a/lib/maple_tree.c b/lib/maple_tree.c
> > index 1196d0a17f03..7a8e7c467d7c 100644
> > --- a/lib/maple_tree.c
> > +++ b/lib/maple_tree.c
> > @@ -161,12 +161,19 @@ static inline struct maple_node *mt_alloc_one(gfp_t gfp)
> > static inline int mt_alloc_bulk(gfp_t gfp, size_t size, void **nodes)
> > {
> > - return kmem_cache_alloc_bulk(maple_node_cache, gfp, size, nodes);
> > + int allocated = 0;
> > + for (size_t i = 0; i < size; i++) {
> > + nodes[i] = kmem_cache_alloc(maple_node_cache, gfp);
> > + if (nodes[i])
> If the i-th allocation fails, node[i] will be NULL. This is wrong. We'd
> better guarantee that mt_alloc_bulk() allocates completely successfully,
> or returns 0. The following cases are not allowed:
> nodes: [addr1][addr2][NULL][addr3].

Thanks for pointing this out Peng.

We can handle a lower number than requested being returned, but we
cannot handle the sparse data.

The kmem_cache_alloc_bulk() can return a failure today - leaving the
array to be cleaned by the caller, so if this is changed to a full
success or full fail, then we will also have to change the caller to
handle whatever state is returned if it differs from
kmem_cache_alloc_bulk().

It might be best to return the size already allocated when a failure is
encountered. This will make the caller, mas_alloc_nodes(), request more
nodes. Only in the case of zero allocations would this be seen as an
OOM event.

Vlastimil, Is the first kmem_cache_alloc() call failing a possibility?
If so, what should be the corrective action?

> > + allocated++;
> > + }
> > + return allocated;
> > }
> > static inline void mt_free_bulk(size_t size, void __rcu **nodes)
> > {
> > - kmem_cache_free_bulk(maple_node_cache, size, (void **)nodes);
> > + for (size_t i = 0; i < size; i++)
> > + kmem_cache_free(maple_node_cache, nodes[i]);
> > }
> > static void mt_free_rcu(struct rcu_head *head)