2021-10-18 11:50:15

by Michal Hocko

[permalink] [raw]
Subject: [RFC 0/3] extend vmalloc support for constrained allocations

Hi,
based on a recent discussion with Dave and Neil [1] I have tried to
implement NOFS, NOIO, NOFAIL support for the vmalloc to make
life of kvmalloc users easier.

A requirement for NOFAIL support for kvmalloc was new to me but this
seems to be really needed by the xfs code.

NOFS/NOIO was a known and a long term problem which was hoped to be
handled by the scope API. Those scope should have been used at the
reclaim recursion boundaries both to document them and also to remove
the necessity of NOFS/NOIO constrains for all allocations within that
scope. Instead workarounds were developed to wrap a single allocation
instead (like ceph_kvmalloc).

First patch implements NOFS/NOIO support for vmalloc. The second one
adds NOFAIL support and the third one bundles all together into kvmalloc
and drops ceph_kvmalloc which can use kvmalloc directly now.

Please note that this is RFC and I haven't done any testing on this yet.
I hope I haven't missed anything in the vmalloc allocator. It would be
really great if Christoph and Uladzislau could have a look.

Thanks!

[1] http://lkml.kernel.org/r/[email protected]



2021-10-18 11:50:30

by Michal Hocko

[permalink] [raw]
Subject: [RFC 3/3] mm: allow !GFP_KERNEL allocations for kvmalloc

From: Michal Hocko <[email protected]>

A support for GFP_NO{FS,IO} and __GFP_NOFAIL has been implemented
by previous patches so we can allow the support for kvmalloc. This
will allow some external users to simplify or completely remove
their helpers.

GFP_NOWAIT semantic hasn't been supported so far but it hasn't been
explicitly documented so let's add a note about that.

ceph_kvmalloc is the first helper to be dropped and changed to
kvmalloc.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/ceph/libceph.h | 1 -
mm/util.c | 15 ++++-----------
net/ceph/buffer.c | 4 ++--
net/ceph/ceph_common.c | 27 ---------------------------
net/ceph/crypto.c | 2 +-
net/ceph/messenger.c | 2 +-
net/ceph/messenger_v2.c | 2 +-
net/ceph/osdmap.c | 12 ++++++------
8 files changed, 15 insertions(+), 50 deletions(-)

diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
index 409d8c29bc4f..309acbcb5a8a 100644
--- a/include/linux/ceph/libceph.h
+++ b/include/linux/ceph/libceph.h
@@ -295,7 +295,6 @@ extern bool libceph_compatible(void *data);

extern const char *ceph_msg_type_name(int type);
extern int ceph_check_fsid(struct ceph_client *client, struct ceph_fsid *fsid);
-extern void *ceph_kvmalloc(size_t size, gfp_t flags);

struct fs_parameter;
struct fc_log;
diff --git a/mm/util.c b/mm/util.c
index bacabe446906..fdec6b4b1267 100644
--- a/mm/util.c
+++ b/mm/util.c
@@ -549,13 +549,10 @@ EXPORT_SYMBOL(vm_mmap);
* Uses kmalloc to get the memory but if the allocation fails then falls back
* to the vmalloc allocator. Use kvfree for freeing the memory.
*
- * Reclaim modifiers - __GFP_NORETRY and __GFP_NOFAIL are not supported.
+ * Reclaim modifiers - __GFP_NORETRY and GFP_NOWAIT are not supported.
* __GFP_RETRY_MAYFAIL is supported, and it should be used only if kmalloc is
* preferable to the vmalloc fallback, due to visible performance drawbacks.
*
- * Please note that any use of gfp flags outside of GFP_KERNEL is careful to not
- * fall back to vmalloc.
- *
* Return: pointer to the allocated memory of %NULL in case of failure
*/
void *kvmalloc_node(size_t size, gfp_t flags, int node)
@@ -563,13 +560,6 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)
gfp_t kmalloc_flags = flags;
void *ret;

- /*
- * vmalloc uses GFP_KERNEL for some internal allocations (e.g page tables)
- * so the given set of flags has to be compatible.
- */
- if ((flags & GFP_KERNEL) != GFP_KERNEL)
- return kmalloc_node(size, flags, node);
-
/*
* We want to attempt a large physically contiguous block first because
* it is less likely to fragment multiple larger blocks and therefore
@@ -582,6 +572,9 @@ void *kvmalloc_node(size_t size, gfp_t flags, int node)

if (!(kmalloc_flags & __GFP_RETRY_MAYFAIL))
kmalloc_flags |= __GFP_NORETRY;
+
+ /* nofail semantic is implemented by the vmalloc fallback */
+ kmalloc_flags &= ~__GFP_NOFAIL;
}

ret = kmalloc_node(size, kmalloc_flags, node);
diff --git a/net/ceph/buffer.c b/net/ceph/buffer.c
index 5622763ad402..7e51f128045d 100644
--- a/net/ceph/buffer.c
+++ b/net/ceph/buffer.c
@@ -7,7 +7,7 @@

#include <linux/ceph/buffer.h>
#include <linux/ceph/decode.h>
-#include <linux/ceph/libceph.h> /* for ceph_kvmalloc */
+#include <linux/ceph/libceph.h> /* for kvmalloc */

struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
{
@@ -17,7 +17,7 @@ struct ceph_buffer *ceph_buffer_new(size_t len, gfp_t gfp)
if (!b)
return NULL;

- b->vec.iov_base = ceph_kvmalloc(len, gfp);
+ b->vec.iov_base = kvmalloc(len, gfp);
if (!b->vec.iov_base) {
kfree(b);
return NULL;
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index 97d6ea763e32..9441b4a4912b 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -190,33 +190,6 @@ int ceph_compare_options(struct ceph_options *new_opt,
}
EXPORT_SYMBOL(ceph_compare_options);

-/*
- * kvmalloc() doesn't fall back to the vmalloc allocator unless flags are
- * compatible with (a superset of) GFP_KERNEL. This is because while the
- * actual pages are allocated with the specified flags, the page table pages
- * are always allocated with GFP_KERNEL.
- *
- * ceph_kvmalloc() may be called with GFP_KERNEL, GFP_NOFS or GFP_NOIO.
- */
-void *ceph_kvmalloc(size_t size, gfp_t flags)
-{
- void *p;
-
- if ((flags & (__GFP_IO | __GFP_FS)) == (__GFP_IO | __GFP_FS)) {
- p = kvmalloc(size, flags);
- } else if ((flags & (__GFP_IO | __GFP_FS)) == __GFP_IO) {
- unsigned int nofs_flag = memalloc_nofs_save();
- p = kvmalloc(size, GFP_KERNEL);
- memalloc_nofs_restore(nofs_flag);
- } else {
- unsigned int noio_flag = memalloc_noio_save();
- p = kvmalloc(size, GFP_KERNEL);
- memalloc_noio_restore(noio_flag);
- }
-
- return p;
-}
-
static int parse_fsid(const char *str, struct ceph_fsid *fsid)
{
int i = 0;
diff --git a/net/ceph/crypto.c b/net/ceph/crypto.c
index 92d89b331645..051d22c0e4ad 100644
--- a/net/ceph/crypto.c
+++ b/net/ceph/crypto.c
@@ -147,7 +147,7 @@ void ceph_crypto_key_destroy(struct ceph_crypto_key *key)
static const u8 *aes_iv = (u8 *)CEPH_AES_IV;

/*
- * Should be used for buffers allocated with ceph_kvmalloc().
+ * Should be used for buffers allocated with kvmalloc().
* Currently these are encrypt out-buffer (ceph_buffer) and decrypt
* in-buffer (msg front).
*
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index 57d043b382ed..7b891be799d2 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -1920,7 +1920,7 @@ struct ceph_msg *ceph_msg_new2(int type, int front_len, int max_data_items,

/* front */
if (front_len) {
- m->front.iov_base = ceph_kvmalloc(front_len, flags);
+ m->front.iov_base = kvmalloc(front_len, flags);
if (m->front.iov_base == NULL) {
dout("ceph_msg_new can't allocate %d bytes\n",
front_len);
diff --git a/net/ceph/messenger_v2.c b/net/ceph/messenger_v2.c
index cc40ce4e02fb..c4099b641b38 100644
--- a/net/ceph/messenger_v2.c
+++ b/net/ceph/messenger_v2.c
@@ -308,7 +308,7 @@ static void *alloc_conn_buf(struct ceph_connection *con, int len)
if (WARN_ON(con->v2.conn_buf_cnt >= ARRAY_SIZE(con->v2.conn_bufs)))
return NULL;

- buf = ceph_kvmalloc(len, GFP_NOIO);
+ buf = kvmalloc(len, GFP_NOIO);
if (!buf)
return NULL;

diff --git a/net/ceph/osdmap.c b/net/ceph/osdmap.c
index 75b738083523..2823bb3cff55 100644
--- a/net/ceph/osdmap.c
+++ b/net/ceph/osdmap.c
@@ -980,7 +980,7 @@ static struct crush_work *alloc_workspace(const struct crush_map *c)
work_size = crush_work_size(c, CEPH_PG_MAX_SIZE);
dout("%s work_size %zu bytes\n", __func__, work_size);

- work = ceph_kvmalloc(work_size, GFP_NOIO);
+ work = kvmalloc(work_size, GFP_NOIO);
if (!work)
return NULL;

@@ -1190,9 +1190,9 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max)
if (max == map->max_osd)
return 0;

- state = ceph_kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS);
- weight = ceph_kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS);
- addr = ceph_kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS);
+ state = kvmalloc(array_size(max, sizeof(*state)), GFP_NOFS);
+ weight = kvmalloc(array_size(max, sizeof(*weight)), GFP_NOFS);
+ addr = kvmalloc(array_size(max, sizeof(*addr)), GFP_NOFS);
if (!state || !weight || !addr) {
kvfree(state);
kvfree(weight);
@@ -1222,7 +1222,7 @@ static int osdmap_set_max_osd(struct ceph_osdmap *map, u32 max)
if (map->osd_primary_affinity) {
u32 *affinity;

- affinity = ceph_kvmalloc(array_size(max, sizeof(*affinity)),
+ affinity = kvmalloc(array_size(max, sizeof(*affinity)),
GFP_NOFS);
if (!affinity)
return -ENOMEM;
@@ -1503,7 +1503,7 @@ static int set_primary_affinity(struct ceph_osdmap *map, int osd, u32 aff)
if (!map->osd_primary_affinity) {
int i;

- map->osd_primary_affinity = ceph_kvmalloc(
+ map->osd_primary_affinity = kvmalloc(
array_size(map->max_osd, sizeof(*map->osd_primary_affinity)),
GFP_NOFS);
if (!map->osd_primary_affinity)
--
2.30.2

2021-10-18 11:51:26

by Michal Hocko

[permalink] [raw]
Subject: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

From: Michal Hocko <[email protected]>

Dave Chinner has mentioned that some of the xfs code would benefit from
kvmalloc support for __GFP_NOFAIL because they have allocations that
cannot fail and they do not fit into a single page.

The larg part of the vmalloc implementation already complies with the
given gfp flags so there is no work for those to be done. The area
and page table allocations are an exception to that. Implement a retry
loop for those.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmalloc.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 7455c89598d3..3a5a178295d1 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2941,8 +2941,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
flags = memalloc_noio_save();

- ret = vmap_pages_range(addr, addr + size, prot, area->pages,
+ do {
+ ret = vmap_pages_range(addr, addr + size, prot, area->pages,
page_shift);
+ } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));

if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
memalloc_nofs_restore(flags);
@@ -3032,6 +3034,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
warn_alloc(gfp_mask, NULL,
"vmalloc error: size %lu, vm_struct allocation failed",
real_size);
+ if (gfp_mask && __GFP_NOFAIL)
+ goto again;
goto fail;
}

--
2.30.2

2021-10-18 11:52:18

by Michal Hocko

[permalink] [raw]
Subject: [RFC 1/3] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc

From: Michal Hocko <[email protected]>

vmalloc historically hasn't supported GFP_NO{FS,IO} requests because
page table allocations do not support externally provided gfp mask
and performed GFP_KERNEL like allocations.

Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs
to enforce NOFS and NOIO constrains implicitly to all allocators within
the scope. There was a hope that those scopes would be defined on a
higher level when the reclaim recursion boundary starts/stops (e.g. when
a lock required during the memory reclaim is required etc.). It seems
that not all NOFS/NOIO users have adopted this approach and instead
they have taken a workaround approach to wrap a single [k]vmalloc
allocation by a scope API.

These workarounds do not serve the purpose of a better reclaim recursion
documentation and reduction of explicit GFP_NO{FS,IO} usege so let's
just provide them with the semantic they are asking for without a need
for workarounds.

Add support for GFP_NOFS and GFP_NOIO to vmalloc directly. All internal
allocations already comply with the given gfp_mask. The only current
exception is vmap_pages_range which maps kernel page tables. Infer the
proper scope API based on the given gfp mask.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmalloc.c | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d77830ff604c..7455c89598d3 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2889,6 +2889,8 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
unsigned long array_size;
unsigned int nr_small_pages = size >> PAGE_SHIFT;
unsigned int page_order;
+ unsigned int flags;
+ int ret;

array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
gfp_mask |= __GFP_NOWARN;
@@ -2930,8 +2932,24 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
goto fail;
}

- if (vmap_pages_range(addr, addr + size, prot, area->pages,
- page_shift) < 0) {
+ /*
+ * page tables allocations ignore external gfp mask, enforce it
+ * by the scope API
+ */
+ if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+ flags = memalloc_nofs_save();
+ else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
+ flags = memalloc_noio_save();
+
+ ret = vmap_pages_range(addr, addr + size, prot, area->pages,
+ page_shift);
+
+ if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
+ memalloc_nofs_restore(flags);
+ else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
+ memalloc_noio_restore(flags);
+
+ if (ret < 0) {
warn_alloc(gfp_mask, NULL,
"vmalloc error: size %lu, failed to map pages",
area->nr_pages * PAGE_SIZE);
--
2.30.2

2021-10-18 16:50:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

Fat fingers on my side. A follow up to fold
commit 63d1b80b9a298c9380e5175e2add7025b6bd2600
Author: Michal Hocko <[email protected]>
Date: Mon Oct 18 18:47:04 2021 +0200

fold me "mm/vmalloc: add support for __GFP_NOFAIL"

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3a5a178295d1..4ce9ccc33e33 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3034,7 +3034,7 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
warn_alloc(gfp_mask, NULL,
"vmalloc error: size %lu, vm_struct allocation failed",
real_size);
- if (gfp_mask && __GFP_NOFAIL)
+ if (gfp_mask & __GFP_NOFAIL)
goto again;
goto fail;
}
--
Michal Hocko
SUSE Labs

2021-10-19 00:50:20

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC 1/3] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc

On Mon, 18 Oct 2021, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> vmalloc historically hasn't supported GFP_NO{FS,IO} requests because
> page table allocations do not support externally provided gfp mask
> and performed GFP_KERNEL like allocations.
>
> Since few years we have scope (memalloc_no{fs,io}_{save,restore}) APIs
> to enforce NOFS and NOIO constrains implicitly to all allocators within
> the scope. There was a hope that those scopes would be defined on a
> higher level when the reclaim recursion boundary starts/stops (e.g. when
> a lock required during the memory reclaim is required etc.). It seems
> that not all NOFS/NOIO users have adopted this approach and instead
> they have taken a workaround approach to wrap a single [k]vmalloc
> allocation by a scope API.
>
> These workarounds do not serve the purpose of a better reclaim recursion
> documentation and reduction of explicit GFP_NO{FS,IO} usege so let's
> just provide them with the semantic they are asking for without a need
> for workarounds.
>
> Add support for GFP_NOFS and GFP_NOIO to vmalloc directly. All internal
> allocations already comply with the given gfp_mask. The only current
> exception is vmap_pages_range which maps kernel page tables. Infer the
> proper scope API based on the given gfp mask.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/vmalloc.c | 22 ++++++++++++++++++++--
> 1 file changed, 20 insertions(+), 2 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d77830ff604c..7455c89598d3 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2889,6 +2889,8 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> unsigned long array_size;
> unsigned int nr_small_pages = size >> PAGE_SHIFT;
> unsigned int page_order;
> + unsigned int flags;
> + int ret;
>
> array_size = (unsigned long)nr_small_pages * sizeof(struct page *);
> gfp_mask |= __GFP_NOWARN;
> @@ -2930,8 +2932,24 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> goto fail;
> }
>
> - if (vmap_pages_range(addr, addr + size, prot, area->pages,
> - page_shift) < 0) {
> + /*
> + * page tables allocations ignore external gfp mask, enforce it
> + * by the scope API
> + */
> + if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> + flags = memalloc_nofs_save();
> + else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))

I would *much* rather this were written

else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)

so that the comparison with the previous test is more obvious. Ditto
for similar code below.
It could even be

switch (gfp_mask & (__GFP_FS | __GFP_IO)) {
case __GFP__IO: flags = memalloc_nofs_save(); break;
case 0: flags = memalloc_noio_save(); break;
}

But I'm not completely convinced that is an improvement.

In terms of functionality this looks good.
Thanks,
NeilBrown


> + flags = memalloc_noio_save();
> +
> + ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> + page_shift);
> +
> + if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> + memalloc_nofs_restore(flags);
> + else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
> + memalloc_noio_restore(flags);
> +
> + if (ret < 0) {
> warn_alloc(gfp_mask, NULL,
> "vmalloc error: size %lu, failed to map pages",
> area->nr_pages * PAGE_SIZE);
> --
> 2.30.2
>
>

2021-10-19 07:03:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 1/3] mm/vmalloc: alloc GFP_NO{FS,IO} for vmalloc

On Tue 19-10-21 11:44:01, Neil Brown wrote:
> On Mon, 18 Oct 2021, Michal Hocko wrote:
[...]
> > @@ -2930,8 +2932,24 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > goto fail;
> > }
> >
> > - if (vmap_pages_range(addr, addr + size, prot, area->pages,
> > - page_shift) < 0) {
> > + /*
> > + * page tables allocations ignore external gfp mask, enforce it
> > + * by the scope API
> > + */
> > + if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > + flags = memalloc_nofs_save();
> > + else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
>
> I would *much* rather this were written
>
> else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)

Sure, this looks better indeed.

> so that the comparison with the previous test is more obvious. Ditto
> for similar code below.
> It could even be
>
> switch (gfp_mask & (__GFP_FS | __GFP_IO)) {
> case __GFP__IO: flags = memalloc_nofs_save(); break;
> case 0: flags = memalloc_noio_save(); break;
> }
>
> But I'm not completely convinced that is an improvement.

I am not a great fan of this though.

> In terms of functionality this looks good.

Thanks for the review!

--
Michal Hocko
SUSE Labs

2021-10-19 11:11:17

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

> From: Michal Hocko <[email protected]>
>
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
>
> The larg part of the vmalloc implementation already complies with the
> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/vmalloc.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 7455c89598d3..3a5a178295d1 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2941,8 +2941,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
> flags = memalloc_noio_save();
>
> - ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> + do {
> + ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> page_shift);
> + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
>
> if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> memalloc_nofs_restore(flags);
> @@ -3032,6 +3034,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> warn_alloc(gfp_mask, NULL,
> "vmalloc error: size %lu, vm_struct allocation failed",
> real_size);
> + if (gfp_mask && __GFP_NOFAIL)
> + goto again;
> goto fail;
> }
>
> --
> 2.30.2
>
I have checked the vmap code how it aligns with the __GFP_NOFAIL flag.
To me it looks correct from functional point of view.

There is one place though it is kasan_populate_vmalloc_pte(). It does
not use gfp_mask, instead it directly deals with GFP_KERNEL for its
internal purpose. If it fails the code will end up in loping in the
__vmalloc_node_range().

I am not sure how it is important to pass __GFP_NOFAIL into KASAN code.

Any thoughts about it?

--
Vlad Rezki

2021-10-19 11:54:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Tue 19-10-21 13:06:49, Uladzislau Rezki wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Dave Chinner has mentioned that some of the xfs code would benefit from
> > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > cannot fail and they do not fit into a single page.
> >
> > The larg part of the vmalloc implementation already complies with the
> > given gfp flags so there is no work for those to be done. The area
> > and page table allocations are an exception to that. Implement a retry
> > loop for those.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > mm/vmalloc.c | 6 +++++-
> > 1 file changed, 5 insertions(+), 1 deletion(-)
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 7455c89598d3..3a5a178295d1 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2941,8 +2941,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
> > flags = memalloc_noio_save();
> >
> > - ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > + do {
> > + ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > page_shift);
> > + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> >
> > if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > memalloc_nofs_restore(flags);
> > @@ -3032,6 +3034,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > warn_alloc(gfp_mask, NULL,
> > "vmalloc error: size %lu, vm_struct allocation failed",
> > real_size);
> > + if (gfp_mask && __GFP_NOFAIL)
> > + goto again;
> > goto fail;
> > }
> >
> > --
> > 2.30.2
> >
> I have checked the vmap code how it aligns with the __GFP_NOFAIL flag.
> To me it looks correct from functional point of view.
>
> There is one place though it is kasan_populate_vmalloc_pte(). It does
> not use gfp_mask, instead it directly deals with GFP_KERNEL for its
> internal purpose. If it fails the code will end up in loping in the
> __vmalloc_node_range().
>
> I am not sure how it is important to pass __GFP_NOFAIL into KASAN code.
>
> Any thoughts about it?

The flag itself is not really necessary down there as long as we
guarantee that the high level logic doesn't fail. In this case we keep
retrying at __vmalloc_node_range level which should be possible to cover
all callers that can control gfp mask. I was thinking to put it into
__get_vm_area_node but that was slightly more hairy and we would be
losing the warning which might turn out being helpful in cases where the
failure is due to lack of vmalloc space or similar constrain. Btw. do we
want some throttling on a retry?

It would be better if the kasan part dealt with gfp mask properly though
and something that we can do on top.

--
Michal Hocko
SUSE Labs

2021-10-19 19:50:05

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Tue, Oct 19, 2021 at 01:52:07PM +0200, Michal Hocko wrote:
> On Tue 19-10-21 13:06:49, Uladzislau Rezki wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > cannot fail and they do not fit into a single page.
> > >
> > > The larg part of the vmalloc implementation already complies with the
> > > given gfp flags so there is no work for those to be done. The area
> > > and page table allocations are an exception to that. Implement a retry
> > > loop for those.
> > >
> > > Signed-off-by: Michal Hocko <[email protected]>
> > > ---
> > > mm/vmalloc.c | 6 +++++-
> > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > index 7455c89598d3..3a5a178295d1 100644
> > > --- a/mm/vmalloc.c
> > > +++ b/mm/vmalloc.c
> > > @@ -2941,8 +2941,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > > else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
> > > flags = memalloc_noio_save();
> > >
> > > - ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > + do {
> > > + ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > page_shift);
> > > + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > >
> > > if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > > memalloc_nofs_restore(flags);
> > > @@ -3032,6 +3034,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > > warn_alloc(gfp_mask, NULL,
> > > "vmalloc error: size %lu, vm_struct allocation failed",
> > > real_size);
> > > + if (gfp_mask && __GFP_NOFAIL)
> > > + goto again;
> > > goto fail;
> > > }
> > >
> > > --
> > > 2.30.2
> > >
> > I have checked the vmap code how it aligns with the __GFP_NOFAIL flag.
> > To me it looks correct from functional point of view.
> >
> > There is one place though it is kasan_populate_vmalloc_pte(). It does
> > not use gfp_mask, instead it directly deals with GFP_KERNEL for its
> > internal purpose. If it fails the code will end up in loping in the
> > __vmalloc_node_range().
> >
> > I am not sure how it is important to pass __GFP_NOFAIL into KASAN code.
> >
> > Any thoughts about it?
>
> The flag itself is not really necessary down there as long as we
> guarantee that the high level logic doesn't fail. In this case we keep
> retrying at __vmalloc_node_range level which should be possible to cover
> all callers that can control gfp mask. I was thinking to put it into
> __get_vm_area_node but that was slightly more hairy and we would be
> losing the warning which might turn out being helpful in cases where the
> failure is due to lack of vmalloc space or similar constrain. Btw. do we
> want some throttling on a retry?
>
I think adding kind of schedule() will not make things worse and in corner
cases could prevent a power drain by CPU. It is important for mobile devices.

As for vmap space, it can be that a user specifies a short range that does
not contain any free area. In that case we might never return back to a caller.

Maybe add a good comment something like: think what you do when deal with the
__vmalloc_node_range() and __GFP_NOFAIL?

--
Vlad Rezki

2021-10-20 08:27:20

by Michal Hocko

[permalink] [raw]
Subject: [PATCH] mm/vmalloc: be more explicit about supported gfp flags.

From: Michal Hocko <[email protected]>

The core of the vmalloc allocator __vmalloc_area_node doesn't say
anything about gfp mask argument. Not all gfp flags are supported
though. Be more explicit about constrains.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmalloc.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index f7098e616883..f57c09e98977 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2979,8 +2979,16 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
* @caller: caller's return address
*
* Allocate enough pages to cover @size from the page level
- * allocator with @gfp_mask flags. Map them into contiguous
- * kernel virtual space, using a pagetable protection of @prot.
+ * allocator with @gfp_mask flags. Please note that the full set of gfp
+ * flags are not supported. GFP_KERNEL would be a preferred allocation mode
+ * but GFP_NOFS and GFP_NOIO are supported as well. Zone modifiers are not
+ * supported. From the reclaim modifiers__GFP_DIRECT_RECLAIM is required (aka
+ * GFP_NOWAIT is not supported) and only __GFP_NOFAIL is supported (aka
+ * __GFP_NORETRY and __GFP_RETRY_MAYFAIL are not supported).
+ * __GFP_NOWARN can be used to suppress error messages about failures.
+ *
+ * Map them into contiguous kernel virtual space, using a pagetable
+ * protection of @prot.
*
* Return: the address of the area or %NULL on failure
*/
--
2.30.2

2021-10-20 08:27:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Tue 19-10-21 21:46:58, Uladzislau Rezki wrote:
> On Tue, Oct 19, 2021 at 01:52:07PM +0200, Michal Hocko wrote:
> > On Tue 19-10-21 13:06:49, Uladzislau Rezki wrote:
> > > > From: Michal Hocko <[email protected]>
> > > >
> > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > cannot fail and they do not fit into a single page.
> > > >
> > > > The larg part of the vmalloc implementation already complies with the
> > > > given gfp flags so there is no work for those to be done. The area
> > > > and page table allocations are an exception to that. Implement a retry
> > > > loop for those.
> > > >
> > > > Signed-off-by: Michal Hocko <[email protected]>
> > > > ---
> > > > mm/vmalloc.c | 6 +++++-
> > > > 1 file changed, 5 insertions(+), 1 deletion(-)
> > > >
> > > > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > > > index 7455c89598d3..3a5a178295d1 100644
> > > > --- a/mm/vmalloc.c
> > > > +++ b/mm/vmalloc.c
> > > > @@ -2941,8 +2941,10 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > > > else if (!(gfp_mask & (__GFP_FS | __GFP_IO)))
> > > > flags = memalloc_noio_save();
> > > >
> > > > - ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > + do {
> > > > + ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > > > page_shift);
> > > > + } while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));
> > > >
> > > > if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> > > > memalloc_nofs_restore(flags);
> > > > @@ -3032,6 +3034,8 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> > > > warn_alloc(gfp_mask, NULL,
> > > > "vmalloc error: size %lu, vm_struct allocation failed",
> > > > real_size);
> > > > + if (gfp_mask && __GFP_NOFAIL)
> > > > + goto again;
> > > > goto fail;
> > > > }
> > > >
> > > > --
> > > > 2.30.2
> > > >
> > > I have checked the vmap code how it aligns with the __GFP_NOFAIL flag.
> > > To me it looks correct from functional point of view.
> > >
> > > There is one place though it is kasan_populate_vmalloc_pte(). It does
> > > not use gfp_mask, instead it directly deals with GFP_KERNEL for its
> > > internal purpose. If it fails the code will end up in loping in the
> > > __vmalloc_node_range().
> > >
> > > I am not sure how it is important to pass __GFP_NOFAIL into KASAN code.
> > >
> > > Any thoughts about it?
> >
> > The flag itself is not really necessary down there as long as we
> > guarantee that the high level logic doesn't fail. In this case we keep
> > retrying at __vmalloc_node_range level which should be possible to cover
> > all callers that can control gfp mask. I was thinking to put it into
> > __get_vm_area_node but that was slightly more hairy and we would be
> > losing the warning which might turn out being helpful in cases where the
> > failure is due to lack of vmalloc space or similar constrain. Btw. do we
> > want some throttling on a retry?
> >
> I think adding kind of schedule() will not make things worse and in corner
> cases could prevent a power drain by CPU. It is important for mobile devices.

I suspect you mean schedule_timeout here? Or cond_resched? I went with a
later for now, I do not have a good idea for how to long to sleep here.
I am more than happy to change to to a sleep though.

> As for vmap space, it can be that a user specifies a short range that does
> not contain any free area. In that case we might never return back to a caller.

This is to be expected. The caller cannot fail and if it would be
looping around vmalloc it wouldn't return anyway.

> Maybe add a good comment something like: think what you do when deal with the
> __vmalloc_node_range() and __GFP_NOFAIL?

We have a generic documentation for gfp flags and __GFP_NOFAIL is
docuemented to "The allocation could block indefinitely but will never
return with failure." We are discussing improvements for the generic
documentation in another thread [1] and we will likely extend it so I
suspect we do not have to repeat drawbacks here again.

[1] http://lkml.kernel.org/r/[email protected]

Anyway the gfp mask description and constrains for vmalloc are not
documented. I will add a new patch to fill that gap and send it as a
reply to this one
--
Michal Hocko
SUSE Labs

2021-10-20 09:20:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Wed 20-10-21 10:25:06, Michal Hocko wrote:
[...]
> > > The flag itself is not really necessary down there as long as we
> > > guarantee that the high level logic doesn't fail. In this case we keep
> > > retrying at __vmalloc_node_range level which should be possible to cover
> > > all callers that can control gfp mask. I was thinking to put it into
> > > __get_vm_area_node but that was slightly more hairy and we would be
> > > losing the warning which might turn out being helpful in cases where the
> > > failure is due to lack of vmalloc space or similar constrain. Btw. do we
> > > want some throttling on a retry?
> > >
> > I think adding kind of schedule() will not make things worse and in corner
> > cases could prevent a power drain by CPU. It is important for mobile devices.
>
> I suspect you mean schedule_timeout here? Or cond_resched? I went with a
> later for now, I do not have a good idea for how to long to sleep here.
> I am more than happy to change to to a sleep though.

Forgot to paste the follow up I have staged currently
---
commit 66fea55e5543fa234692a70144cd63c7a1bca32f
Author: Michal Hocko <[email protected]>
Date: Wed Oct 20 10:12:45 2021 +0200

fold me "mm/vmalloc: add support for __GFP_NOFAIL"

- add cond_resched

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0fb5413d9239..f7098e616883 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2944,6 +2944,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
do {
ret = vmap_pages_range(addr, addr + size, prot, area->pages,
page_shift);
+ cond_resched();
} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));

if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
@@ -3034,8 +3035,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
warn_alloc(gfp_mask, NULL,
"vmalloc error: size %lu, vm_struct allocation failed",
real_size);
- if (gfp_mask & __GFP_NOFAIL)
+ if (gfp_mask & __GFP_NOFAIL) {
+ cond_resched();
goto again;
+ }
goto fail;
}


--
Michal Hocko
SUSE Labs

2021-10-20 13:57:47

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

> > >
> > I think adding kind of schedule() will not make things worse and in corner
> > cases could prevent a power drain by CPU. It is important for mobile devices.
>
> I suspect you mean schedule_timeout here? Or cond_resched? I went with a
> later for now, I do not have a good idea for how to long to sleep here.
> I am more than happy to change to to a sleep though.
>
cond_resched() reschedules only if TIF_NEED_RESCHED is raised what is not good
here. Because in our case we know that we definitely would like to
take a breath. Therefore
invoking the schedule() is more suitable here. It will give a CPU time
to another waiting
process(if exists) in any case putting the "current" one to the tail.

As for adding a delay. I am not sure about for how long to delay or i
would say i do not
see a good explanation why for example we delay for 10 milliseconds or so.

> > As for vmap space, it can be that a user specifies a short range that does
> > not contain any free area. In that case we might never return back to a caller.
>
> This is to be expected. The caller cannot fail and if it would be
> looping around vmalloc it wouldn't return anyway.
>
> > Maybe add a good comment something like: think what you do when deal with the
> > __vmalloc_node_range() and __GFP_NOFAIL?
>
> We have a generic documentation for gfp flags and __GFP_NOFAIL is
> docuemented to "The allocation could block indefinitely but will never
> return with failure." We are discussing improvements for the generic
> documentation in another thread [1] and we will likely extend it so I
> suspect we do not have to repeat drawbacks here again.
>
> [1] http://lkml.kernel.org/r/[email protected]
>
> Anyway the gfp mask description and constrains for vmalloc are not
> documented. I will add a new patch to fill that gap and send it as a
> reply to this one
>
This is really good. People should be prepared for a case when it
never returns back
to a caller :)

--
Uladzislau Rezki

2021-10-20 14:07:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Wed 20-10-21 15:54:23, Uladzislau Rezki wrote:
> > > >
> > > I think adding kind of schedule() will not make things worse and in corner
> > > cases could prevent a power drain by CPU. It is important for mobile devices.
> >
> > I suspect you mean schedule_timeout here? Or cond_resched? I went with a
> > later for now, I do not have a good idea for how to long to sleep here.
> > I am more than happy to change to to a sleep though.
> >
> cond_resched() reschedules only if TIF_NEED_RESCHED is raised what is not good
> here. Because in our case we know that we definitely would like to
> take a breath. Therefore
> invoking the schedule() is more suitable here. It will give a CPU time
> to another waiting
> process(if exists) in any case putting the "current" one to the tail.

Yes, but there is no explicit event to wait for currently.

> As for adding a delay. I am not sure about for how long to delay or i
> would say i do not
> see a good explanation why for example we delay for 10 milliseconds or so.

As I've said I am OK with either of the two. Do you or anybody have any
preference? Without any explicit event to wake up for neither of the two
is more than just an optimistic retry.
--
Michal Hocko
SUSE Labs

2021-10-20 14:35:15

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <[email protected]> wrote:
>
> On Wed 20-10-21 15:54:23, Uladzislau Rezki wrote:
> > > > >
> > > > I think adding kind of schedule() will not make things worse and in corner
> > > > cases could prevent a power drain by CPU. It is important for mobile devices.
> > >
> > > I suspect you mean schedule_timeout here? Or cond_resched? I went with a
> > > later for now, I do not have a good idea for how to long to sleep here.
> > > I am more than happy to change to to a sleep though.
> > >
> > cond_resched() reschedules only if TIF_NEED_RESCHED is raised what is not good
> > here. Because in our case we know that we definitely would like to
> > take a breath. Therefore
> > invoking the schedule() is more suitable here. It will give a CPU time
> > to another waiting
> > process(if exists) in any case putting the "current" one to the tail.
>
> Yes, but there is no explicit event to wait for currently.
>
> > As for adding a delay. I am not sure about for how long to delay or i
> > would say i do not
> > see a good explanation why for example we delay for 10 milliseconds or so.
>
> As I've said I am OK with either of the two. Do you or anybody have any
> preference? Without any explicit event to wake up for neither of the two
> is more than just an optimistic retry.
>
From power perspective it is better to have a delay, so i tend to say
that delay is better.

--
Uladzislau Rezki

2021-10-20 14:55:45

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <[email protected]> wrote:
[...]
> > As I've said I am OK with either of the two. Do you or anybody have any
> > preference? Without any explicit event to wake up for neither of the two
> > is more than just an optimistic retry.
> >
> From power perspective it is better to have a delay, so i tend to say
> that delay is better.

I am a terrible random number generator. Can you give me a number
please?
--
Michal Hocko
SUSE Labs

2021-10-20 15:04:46

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

>
> On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <[email protected]> wrote:
> [...]
> > > As I've said I am OK with either of the two. Do you or anybody have any
> > > preference? Without any explicit event to wake up for neither of the two
> > > is more than just an optimistic retry.
> > >
> > From power perspective it is better to have a delay, so i tend to say
> > that delay is better.
>
> I am a terrible random number generator. Can you give me a number
> please?
>
Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)

--
Uladzislau Rezki

2021-10-20 19:29:03

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> >
> > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <[email protected]> wrote:
> > [...]
> > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > preference? Without any explicit event to wake up for neither of the two
> > > > is more than just an optimistic retry.
> > > >
> > > From power perspective it is better to have a delay, so i tend to say
> > > that delay is better.
> >
> > I am a terrible random number generator. Can you give me a number
> > please?
> >
> Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
>
A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));

--
Vlad Rezki

2021-10-21 09:00:04

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Wed 20-10-21 21:24:30, Uladzislau Rezki wrote:
> On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > >
> > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <[email protected]> wrote:
> > > [...]
> > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > is more than just an optimistic retry.
> > > > >
> > > > From power perspective it is better to have a delay, so i tend to say
> > > > that delay is better.
> > >
> > > I am a terrible random number generator. Can you give me a number
> > > please?
> > >
> > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)

OK, I will go with 1 jiffy.

> A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));

I have planned to use schedule_timeout_uninterruptible. Why do you think
msleep is better?
--
Michal Hocko
SUSE Labs

2021-10-21 10:15:00

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > >
> > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <[email protected]> wrote:
> > > [...]
> > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > is more than just an optimistic retry.
> > > > >
> > > > From power perspective it is better to have a delay, so i tend to say
> > > > that delay is better.
> > >
> > > I am a terrible random number generator. Can you give me a number
> > > please?
> > >
> > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
> >
> A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));

I disagree. I think schedule_timeout_uninterruptible(1) is the best
wait to sleep for 1 ticl

msleep() contains
timeout = msecs_to_jiffies(msecs) + 1;
and both jiffies_to_msecs and msecs_to_jiffies might round up too.
So you will sleep for at least twice as long as you asked for, possible
more.

NeilBrown


>
> --
> Vlad Rezki
>
>

2021-10-21 10:30:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Thu 21-10-21 21:13:35, Neil Brown wrote:
> On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > > >
> > > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <[email protected]> wrote:
> > > > [...]
> > > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > > is more than just an optimistic retry.
> > > > > >
> > > > > From power perspective it is better to have a delay, so i tend to say
> > > > > that delay is better.
> > > >
> > > > I am a terrible random number generator. Can you give me a number
> > > > please?
> > > >
> > > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
> > >
> > A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));
>
> I disagree. I think schedule_timeout_uninterruptible(1) is the best
> wait to sleep for 1 ticl
>
> msleep() contains
> timeout = msecs_to_jiffies(msecs) + 1;
> and both jiffies_to_msecs and msecs_to_jiffies might round up too.
> So you will sleep for at least twice as long as you asked for, possible
> more.

That was my thinking as well. Not to mention jiffies_to_msecs just to do
msecs_to_jiffies right after which seems like a pointless wasting of
cpu cycle. But maybe I was missing some other reasons why msleep would
be superior.
--
Michal Hocko
SUSE Labs

2021-10-21 10:41:53

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

> On Thu 21-10-21 21:13:35, Neil Brown wrote:
> > On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > > On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > > > >
> > > > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <[email protected]> wrote:
> > > > > [...]
> > > > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > > > is more than just an optimistic retry.
> > > > > > >
> > > > > > From power perspective it is better to have a delay, so i tend to say
> > > > > > that delay is better.
> > > > >
> > > > > I am a terrible random number generator. Can you give me a number
> > > > > please?
> > > > >
> > > > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
> > > >
> > > A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));
> >
> > I disagree. I think schedule_timeout_uninterruptible(1) is the best
> > wait to sleep for 1 ticl
> >
> > msleep() contains
> > timeout = msecs_to_jiffies(msecs) + 1;
> > and both jiffies_to_msecs and msecs_to_jiffies might round up too.
> > So you will sleep for at least twice as long as you asked for, possible
> > more.
>
> That was my thinking as well. Not to mention jiffies_to_msecs just to do
> msecs_to_jiffies right after which seems like a pointless wasting of
> cpu cycle. But maybe I was missing some other reasons why msleep would
> be superior.
>

To me the msleep is just more simpler from semantic point of view, i.e.
it is as straight forward as it can be. In case of interruptable/uninteraptable
sleep it can be more confusing for people.

When it comes to rounding and possibility to sleep more than 1 tick, it
really does not matter here, we do not need to guarantee exact sleeping
time.

Therefore i proposed to switch to the msleep().

--
Vlad Rezki

2021-10-21 22:50:31

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > On Thu 21-10-21 21:13:35, Neil Brown wrote:
> > > On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > > > On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > > > > >
> > > > > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <[email protected]> wrote:
> > > > > > [...]
> > > > > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > > > > is more than just an optimistic retry.
> > > > > > > >
> > > > > > > From power perspective it is better to have a delay, so i tend to say
> > > > > > > that delay is better.
> > > > > >
> > > > > > I am a terrible random number generator. Can you give me a number
> > > > > > please?
> > > > > >
> > > > > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
> > > > >
> > > > A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));
> > >
> > > I disagree. I think schedule_timeout_uninterruptible(1) is the best
> > > wait to sleep for 1 ticl
> > >
> > > msleep() contains
> > > timeout = msecs_to_jiffies(msecs) + 1;
> > > and both jiffies_to_msecs and msecs_to_jiffies might round up too.
> > > So you will sleep for at least twice as long as you asked for, possible
> > > more.
> >
> > That was my thinking as well. Not to mention jiffies_to_msecs just to do
> > msecs_to_jiffies right after which seems like a pointless wasting of
> > cpu cycle. But maybe I was missing some other reasons why msleep would
> > be superior.
> >
>
> To me the msleep is just more simpler from semantic point of view, i.e.
> it is as straight forward as it can be. In case of interruptable/uninteraptable
> sleep it can be more confusing for people.

I agree that msleep() is more simple. I think adding the
jiffies_to_msec() substantially reduces that simplicity.

>
> When it comes to rounding and possibility to sleep more than 1 tick, it
> really does not matter here, we do not need to guarantee exact sleeping
> time.
>
> Therefore i proposed to switch to the msleep().

If, as you say, the precision doesn't matter that much, then maybe
msleep(0)
which would sleep to the start of the next jiffy. Does that look a bit
weird? If so, the msleep(1) would be ok.

However now that I've thought about some more, I'd much prefer we
introduce something like
memalloc_retry_wait();

and use that everywhere that a memory allocation is retried.
I'm not convinced that we need to wait at all - at least, not when
__GFP_DIRECT_RECLAIM is used, as in that case alloc_page will either
- succeed
- make some progress a reclaiming or
- sleep

However I'm not 100% certain, and the behaviour might change in the
future. So having one place (the definition of memalloc_retry_wait())
where we can change the sleeping behaviour if the alloc_page behavour
changes, would be ideal. Maybe memalloc_retry_wait() could take a
gfpflags arg.

Thanks,
NeilBrown

2021-10-22 08:20:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Fri 22-10-21 09:49:08, Neil Brown wrote:
[...]
> However now that I've thought about some more, I'd much prefer we
> introduce something like
> memalloc_retry_wait();
>
> and use that everywhere that a memory allocation is retried.
> I'm not convinced that we need to wait at all - at least, not when
> __GFP_DIRECT_RECLAIM is used, as in that case alloc_page will either
> - succeed
> - make some progress a reclaiming or
> - sleep

There
are two that we have to do explicitly vmap_pages_range one is due to
implicit GFP_KERNEL allocations for page tables. Those would likely be a
good fit for something you suggest above. Then we have __get_vm_area_node
retry loop which can be either due to vmalloc space reservation failure
or an implicit GFP_KERNEL allocation by kasan. The first one is not
really related to the memory availability so it doesn't sound like a
good fit.

--
Michal Hocko
SUSE Labs

2021-10-25 09:53:32

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > > On Thu 21-10-21 21:13:35, Neil Brown wrote:
> > > > On Thu, 21 Oct 2021, Uladzislau Rezki wrote:
> > > > > On Wed, Oct 20, 2021 at 05:00:28PM +0200, Uladzislau Rezki wrote:
> > > > > > >
> > > > > > > On Wed 20-10-21 16:29:14, Uladzislau Rezki wrote:
> > > > > > > > On Wed, Oct 20, 2021 at 4:06 PM Michal Hocko <[email protected]> wrote:
> > > > > > > [...]
> > > > > > > > > As I've said I am OK with either of the two. Do you or anybody have any
> > > > > > > > > preference? Without any explicit event to wake up for neither of the two
> > > > > > > > > is more than just an optimistic retry.
> > > > > > > > >
> > > > > > > > From power perspective it is better to have a delay, so i tend to say
> > > > > > > > that delay is better.
> > > > > > >
> > > > > > > I am a terrible random number generator. Can you give me a number
> > > > > > > please?
> > > > > > >
> > > > > > Well, we can start from one jiffy so it is one timer tick: schedule_timeout(1)
> > > > > >
> > > > > A small nit, it is better to replace it by the simple msleep() call: msleep(jiffies_to_msecs(1));
> > > >
> > > > I disagree. I think schedule_timeout_uninterruptible(1) is the best
> > > > wait to sleep for 1 ticl
> > > >
> > > > msleep() contains
> > > > timeout = msecs_to_jiffies(msecs) + 1;
> > > > and both jiffies_to_msecs and msecs_to_jiffies might round up too.
> > > > So you will sleep for at least twice as long as you asked for, possible
> > > > more.
> > >
> > > That was my thinking as well. Not to mention jiffies_to_msecs just to do
> > > msecs_to_jiffies right after which seems like a pointless wasting of
> > > cpu cycle. But maybe I was missing some other reasons why msleep would
> > > be superior.
> > >
> >
> > To me the msleep is just more simpler from semantic point of view, i.e.
> > it is as straight forward as it can be. In case of interruptable/uninteraptable
> > sleep it can be more confusing for people.
>
> I agree that msleep() is more simple. I think adding the
> jiffies_to_msec() substantially reduces that simplicity.
>
> >
> > When it comes to rounding and possibility to sleep more than 1 tick, it
> > really does not matter here, we do not need to guarantee exact sleeping
> > time.
> >
> > Therefore i proposed to switch to the msleep().
>
> If, as you say, the precision doesn't matter that much, then maybe
> msleep(0)
> which would sleep to the start of the next jiffy. Does that look a bit
> weird? If so, the msleep(1) would be ok.
>
Agree, msleep(1) looks much better rather then converting 1 jiffy to
milliseconds. Result should be the same.

> However now that I've thought about some more, I'd much prefer we
> introduce something like
> memalloc_retry_wait();
>
> and use that everywhere that a memory allocation is retried.
> I'm not convinced that we need to wait at all - at least, not when
> __GFP_DIRECT_RECLAIM is used, as in that case alloc_page will either
> - succeed
> - make some progress a reclaiming or
> - sleep
>
> However I'm not 100% certain, and the behaviour might change in the
> future. So having one place (the definition of memalloc_retry_wait())
> where we can change the sleeping behaviour if the alloc_page behavour
> changes, would be ideal. Maybe memalloc_retry_wait() could take a
> gfpflags arg.
>
At sleeping is required for __get_vm_area_node() because in case of lack
of vmap space it will end up in tight loop without sleeping what is
really bad.

Thanks!

--
Vlad Rezki

2021-10-25 11:47:23

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Mon 25-10-21 11:48:41, Uladzislau Rezki wrote:
> On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
[...]
> > If, as you say, the precision doesn't matter that much, then maybe
> > msleep(0)
> > which would sleep to the start of the next jiffy. Does that look a bit
> > weird? If so, the msleep(1) would be ok.
> >
> Agree, msleep(1) looks much better rather then converting 1 jiffy to
> milliseconds. Result should be the same.

I would really prefer if this was not the main point of arguing here.
Unless you feel strongly about msleep I would go with schedule_timeout
here because this is a more widely used interface in the mm code and
also because I feel like that relying on the rounding behavior is just
subtle. Here is what I have staged now.

Are there any other concerns you see with this or other patches in the
series?

Thanks!
---
commit c1a7e40e6b56fed5b9e716de7055b77ea29d89d0
Author: Michal Hocko <[email protected]>
Date: Wed Oct 20 10:12:45 2021 +0200

fold me "mm/vmalloc: add support for __GFP_NOFAIL"

Add a short sleep before retrying. 1 jiffy is a completely random
timeout. Ideally the retry would wait for an explicit event - e.g.
a change to the vmalloc space change if the failure was caused by
the space fragmentation or depletion. But there are multiple different
reasons to retry and this could become much more complex. Keep the retry
simple for now and just sleep to prevent from hogging CPUs.

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 0fb5413d9239..a866db0c9c31 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2944,6 +2944,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
do {
ret = vmap_pages_range(addr, addr + size, prot, area->pages,
page_shift);
+ schedule_timeout_uninterruptible(1);
} while ((gfp_mask & __GFP_NOFAIL) && (ret < 0));

if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
@@ -3034,8 +3035,10 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
warn_alloc(gfp_mask, NULL,
"vmalloc error: size %lu, vm_struct allocation failed",
real_size);
- if (gfp_mask & __GFP_NOFAIL)
+ if (gfp_mask & __GFP_NOFAIL) {
+ schedule_timeout_uninterruptible(1);
goto again;
+ }
goto fail;
}


--
Michal Hocko
SUSE Labs

2021-10-25 14:37:31

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

>
> I would really prefer if this was not the main point of arguing here.
> Unless you feel strongly about msleep I would go with schedule_timeout
> here because this is a more widely used interface in the mm code and
> also because I feel like that relying on the rounding behavior is just
> subtle. Here is what I have staged now.
>
I have a preference but do not have a strong opinion here. You can go
either way you want.

>
> Are there any other concerns you see with this or other patches in the
> series?
>
it is better if you could send a new vX version because it is hard to
combine every "folded"
into one solid commit. One comment below:

> ---
> commit c1a7e40e6b56fed5b9e716de7055b77ea29d89d0
> Author: Michal Hocko <[email protected]>
> Date: Wed Oct 20 10:12:45 2021 +0200
>
> fold me "mm/vmalloc: add support for __GFP_NOFAIL"
>
> Add a short sleep before retrying. 1 jiffy is a completely random
> timeout. Ideally the retry would wait for an explicit event - e.g.
> a change to the vmalloc space change if the failure was caused by
> the space fragmentation or depletion. But there are multiple different
> reasons to retry and this could become much more complex. Keep the retry
> simple for now and just sleep to prevent from hogging CPUs.
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 0fb5413d9239..a866db0c9c31 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2944,6 +2944,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> do {
> ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> page_shift);
> + schedule_timeout_uninterruptible(1);
>
We do not want to schedule_timeout_uninterruptible(1); every time.
Only when an error is detected.

--
Uladzislau Rezki

2021-10-25 21:23:25

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Mon 25-10-21 16:30:23, Uladzislau Rezki wrote:
> >
> > I would really prefer if this was not the main point of arguing here.
> > Unless you feel strongly about msleep I would go with schedule_timeout
> > here because this is a more widely used interface in the mm code and
> > also because I feel like that relying on the rounding behavior is just
> > subtle. Here is what I have staged now.
> >
> I have a preference but do not have a strong opinion here. You can go
> either way you want.
>
> >
> > Are there any other concerns you see with this or other patches in the
> > series?
> >
> it is better if you could send a new vX version because it is hard to
> combine every "folded"

Yeah, I plan to soon. I just wanted to sort out most things before
spaming with a new version.

> into one solid commit. One comment below:
>
> > ---
> > commit c1a7e40e6b56fed5b9e716de7055b77ea29d89d0
> > Author: Michal Hocko <[email protected]>
> > Date: Wed Oct 20 10:12:45 2021 +0200
> >
> > fold me "mm/vmalloc: add support for __GFP_NOFAIL"
> >
> > Add a short sleep before retrying. 1 jiffy is a completely random
> > timeout. Ideally the retry would wait for an explicit event - e.g.
> > a change to the vmalloc space change if the failure was caused by
> > the space fragmentation or depletion. But there are multiple different
> > reasons to retry and this could become much more complex. Keep the retry
> > simple for now and just sleep to prevent from hogging CPUs.
> >
> > diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> > index 0fb5413d9239..a866db0c9c31 100644
> > --- a/mm/vmalloc.c
> > +++ b/mm/vmalloc.c
> > @@ -2944,6 +2944,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> > do {
> > ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> > page_shift);
> > + schedule_timeout_uninterruptible(1);
> >
> We do not want to schedule_timeout_uninterruptible(1); every time.
> Only when an error is detected.

Because I was obviously in a brainless mode when doing that one. Thanks
for pointing this out!
--
Michal Hocko
SUSE Labs

2021-10-26 03:19:08

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Mon, 25 Oct 2021, Uladzislau Rezki wrote:
> On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> > However I'm not 100% certain, and the behaviour might change in the
> > future. So having one place (the definition of memalloc_retry_wait())
> > where we can change the sleeping behaviour if the alloc_page behavour
> > changes, would be ideal. Maybe memalloc_retry_wait() could take a
> > gfpflags arg.
> >
> At sleeping is required for __get_vm_area_node() because in case of lack
> of vmap space it will end up in tight loop without sleeping what is
> really bad.
>
So vmalloc() has two failure modes. alloc_page() failure and
__alloc_vmap_area() failure. The caller cannot tell which...

Actually, they can. If we pass __GFP_NOFAIL to vmalloc(), and it fails,
then it must have been __alloc_vmap_area() which failed.
What do we do in that case?
Can we add a waitq which gets a wakeup when __purge_vmap_area_lazy()
finishes?
If we use the spinlock from that waitq in place of free_vmap_area_lock,
then the wakeup would be nearly free if no-one was waiting, and worth
while if someone was waiting.

Thanks,
NeilBrown

2021-10-26 07:38:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Tue 26-10-21 10:50:21, Neil Brown wrote:
> On Mon, 25 Oct 2021, Uladzislau Rezki wrote:
> > On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> > > However I'm not 100% certain, and the behaviour might change in the
> > > future. So having one place (the definition of memalloc_retry_wait())
> > > where we can change the sleeping behaviour if the alloc_page behavour
> > > changes, would be ideal. Maybe memalloc_retry_wait() could take a
> > > gfpflags arg.
> > >
> > At sleeping is required for __get_vm_area_node() because in case of lack
> > of vmap space it will end up in tight loop without sleeping what is
> > really bad.
> >
> So vmalloc() has two failure modes. alloc_page() failure and
> __alloc_vmap_area() failure. The caller cannot tell which...
>
> Actually, they can. If we pass __GFP_NOFAIL to vmalloc(), and it fails,
> then it must have been __alloc_vmap_area() which failed.
> What do we do in that case?
> Can we add a waitq which gets a wakeup when __purge_vmap_area_lazy()
> finishes?
> If we use the spinlock from that waitq in place of free_vmap_area_lock,
> then the wakeup would be nearly free if no-one was waiting, and worth
> while if someone was waiting.

Is this really required to be part of the initial support?
--
Michal Hocko
SUSE Labs

2021-10-26 13:54:17

by NeilBrown

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Tue, 26 Oct 2021, Michal Hocko wrote:
> On Tue 26-10-21 10:50:21, Neil Brown wrote:
> > On Mon, 25 Oct 2021, Uladzislau Rezki wrote:
> > > On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> > > > However I'm not 100% certain, and the behaviour might change in the
> > > > future. So having one place (the definition of memalloc_retry_wait())
> > > > where we can change the sleeping behaviour if the alloc_page behavour
> > > > changes, would be ideal. Maybe memalloc_retry_wait() could take a
> > > > gfpflags arg.
> > > >
> > > At sleeping is required for __get_vm_area_node() because in case of lack
> > > of vmap space it will end up in tight loop without sleeping what is
> > > really bad.
> > >
> > So vmalloc() has two failure modes. alloc_page() failure and
> > __alloc_vmap_area() failure. The caller cannot tell which...
> >
> > Actually, they can. If we pass __GFP_NOFAIL to vmalloc(), and it fails,
> > then it must have been __alloc_vmap_area() which failed.
> > What do we do in that case?
> > Can we add a waitq which gets a wakeup when __purge_vmap_area_lazy()
> > finishes?
> > If we use the spinlock from that waitq in place of free_vmap_area_lock,
> > then the wakeup would be nearly free if no-one was waiting, and worth
> > while if someone was waiting.
>
> Is this really required to be part of the initial support?

No.... I was just thinking out-loud.

NeilBrown

2021-10-26 18:09:46

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Tue, Oct 26, 2021 at 12:24 PM NeilBrown <[email protected]> wrote:
>
> On Tue, 26 Oct 2021, Michal Hocko wrote:
> > On Tue 26-10-21 10:50:21, Neil Brown wrote:
> > > On Mon, 25 Oct 2021, Uladzislau Rezki wrote:
> > > > On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> > > > > However I'm not 100% certain, and the behaviour might change in the
> > > > > future. So having one place (the definition of memalloc_retry_wait())
> > > > > where we can change the sleeping behaviour if the alloc_page behavour
> > > > > changes, would be ideal. Maybe memalloc_retry_wait() could take a
> > > > > gfpflags arg.
> > > > >
> > > > At sleeping is required for __get_vm_area_node() because in case of lack
> > > > of vmap space it will end up in tight loop without sleeping what is
> > > > really bad.
> > > >
> > > So vmalloc() has two failure modes. alloc_page() failure and
> > > __alloc_vmap_area() failure. The caller cannot tell which...
> > >
> > > Actually, they can. If we pass __GFP_NOFAIL to vmalloc(), and it fails,
> > > then it must have been __alloc_vmap_area() which failed.
> > > What do we do in that case?
> > > Can we add a waitq which gets a wakeup when __purge_vmap_area_lazy()
> > > finishes?
> > > If we use the spinlock from that waitq in place of free_vmap_area_lock,
> > > then the wakeup would be nearly free if no-one was waiting, and worth
> > > while if someone was waiting.
> >
> > Is this really required to be part of the initial support?
>
> No.... I was just thinking out-loud.
>
alloc_vmap_area() has an retry path, basically if it fails the code
will try to "purge"
areas and repeat it one more time. So we do not need to purge outside some where
else.

--
Uladzislau Rezki

2021-10-26 20:08:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

On Tue 26-10-21 16:25:07, Uladzislau Rezki wrote:
> On Tue, Oct 26, 2021 at 12:24 PM NeilBrown <[email protected]> wrote:
> >
> > On Tue, 26 Oct 2021, Michal Hocko wrote:
> > > On Tue 26-10-21 10:50:21, Neil Brown wrote:
> > > > On Mon, 25 Oct 2021, Uladzislau Rezki wrote:
> > > > > On Fri, Oct 22, 2021 at 09:49:08AM +1100, NeilBrown wrote:
> > > > > > However I'm not 100% certain, and the behaviour might change in the
> > > > > > future. So having one place (the definition of memalloc_retry_wait())
> > > > > > where we can change the sleeping behaviour if the alloc_page behavour
> > > > > > changes, would be ideal. Maybe memalloc_retry_wait() could take a
> > > > > > gfpflags arg.
> > > > > >
> > > > > At sleeping is required for __get_vm_area_node() because in case of lack
> > > > > of vmap space it will end up in tight loop without sleeping what is
> > > > > really bad.
> > > > >
> > > > So vmalloc() has two failure modes. alloc_page() failure and
> > > > __alloc_vmap_area() failure. The caller cannot tell which...
> > > >
> > > > Actually, they can. If we pass __GFP_NOFAIL to vmalloc(), and it fails,
> > > > then it must have been __alloc_vmap_area() which failed.
> > > > What do we do in that case?
> > > > Can we add a waitq which gets a wakeup when __purge_vmap_area_lazy()
> > > > finishes?
> > > > If we use the spinlock from that waitq in place of free_vmap_area_lock,
> > > > then the wakeup would be nearly free if no-one was waiting, and worth
> > > > while if someone was waiting.
> > >
> > > Is this really required to be part of the initial support?
> >
> > No.... I was just thinking out-loud.
> >
> alloc_vmap_area() has an retry path, basically if it fails the code
> will try to "purge"
> areas and repeat it one more time. So we do not need to purge outside some where
> else.

I think that Neil was not concerned about the need for purging something
but rather a waiting event the retry loop could hook into. So that the
sleep wouldn't have to be a random timeout but something that is
actually actionable - like somebody freeing an area.
--
Michal Hocko
SUSE Labs

2021-10-26 21:06:00

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [RFC 2/3] mm/vmalloc: add support for __GFP_NOFAIL

> > > > Is this really required to be part of the initial support?
> > >
> > > No.... I was just thinking out-loud.
> > >
> > alloc_vmap_area() has an retry path, basically if it fails the code
> > will try to "purge"
> > areas and repeat it one more time. So we do not need to purge outside some where
> > else.
>
> I think that Neil was not concerned about the need for purging something
> but rather a waiting event the retry loop could hook into. So that the
> sleep wouldn't have to be a random timeout but something that is
> actually actionable - like somebody freeing an area.
>
I see this point. But sometimes it is not as straightforward as it could be. If
we have lack of vmap space within a specific range, it might be not about
reclaiming(nobody frees to that area and no outstanding areas) thus we
can do nothing.

--
Uladzislau Rezki