2021-11-22 15:32:55

by Michal Hocko

[permalink] [raw]
Subject: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

From: Michal Hocko <[email protected]>

Dave Chinner has mentioned that some of the xfs code would benefit from
kvmalloc support for __GFP_NOFAIL because they have allocations that
cannot fail and they do not fit into a single page.

The large part of the vmalloc implementation already complies with the
given gfp flags so there is no work for those to be done. The area
and page table allocations are an exception to that. Implement a retry
loop for those.

Add a short sleep before retrying. 1 jiffy is a completely random
timeout. Ideally the retry would wait for an explicit event - e.g.
a change to the vmalloc space change if the failure was caused by
the space fragmentation or depletion. But there are multiple different
reasons to retry and this could become much more complex. Keep the retry
simple for now and just sleep to prevent from hogging CPUs.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/vmalloc.c | 22 +++++++++++++++++-----
1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 17ca7001de1f..b6aed4f94a85 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2844,6 +2844,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
* more permissive.
*/
if (!order) {
+ gfp_t bulk_gfp = gfp & ~__GFP_NOFAIL;
+
while (nr_allocated < nr_pages) {
unsigned int nr, nr_pages_request;

@@ -2861,12 +2863,12 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
* but mempolcy want to alloc memory by interleaving.
*/
if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE)
- nr = alloc_pages_bulk_array_mempolicy(gfp,
+ nr = alloc_pages_bulk_array_mempolicy(bulk_gfp,
nr_pages_request,
pages + nr_allocated);

else
- nr = alloc_pages_bulk_array_node(gfp, nid,
+ nr = alloc_pages_bulk_array_node(bulk_gfp, nid,
nr_pages_request,
pages + nr_allocated);

@@ -2921,6 +2923,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
{
const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
const gfp_t orig_gfp_mask = gfp_mask;
+ bool nofail = gfp_mask & __GFP_NOFAIL;
unsigned long addr = (unsigned long)area->addr;
unsigned long size = get_vm_area_size(area);
unsigned long array_size;
@@ -2978,8 +2981,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
flags = memalloc_noio_save();

- ret = vmap_pages_range(addr, addr + size, prot, area->pages,
+ do {
+ ret = vmap_pages_range(addr, addr + size, prot, area->pages,
page_shift);
+ if (nofail && (ret < 0))
+ schedule_timeout_uninterruptible(1);
+ } while (nofail && (ret < 0));

if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
memalloc_nofs_restore(flags);
@@ -3074,9 +3081,14 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
VM_UNINITIALIZED | vm_flags, start, end, node,
gfp_mask, caller);
if (!area) {
+ bool nofail = gfp_mask & __GFP_NOFAIL;
warn_alloc(gfp_mask, NULL,
- "vmalloc error: size %lu, vm_struct allocation failed",
- real_size);
+ "vmalloc error: size %lu, vm_struct allocation failed%s",
+ real_size, (nofail) ? ". Retrying." : "");
+ if (nofail) {
+ schedule_timeout_uninterruptible(1);
+ goto again;
+ }
goto fail;
}

--
2.30.2



2021-11-23 19:01:59

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
>
> The large part of the vmalloc implementation already complies with the
> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
>
> Add a short sleep before retrying. 1 jiffy is a completely random
> timeout. Ideally the retry would wait for an explicit event - e.g.
> a change to the vmalloc space change if the failure was caused by
> the space fragmentation or depletion. But there are multiple different
> reasons to retry and this could become much more complex. Keep the retry
> simple for now and just sleep to prevent from hogging CPUs.
>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/vmalloc.c | 22 +++++++++++++++++-----
> 1 file changed, 17 insertions(+), 5 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index 17ca7001de1f..b6aed4f94a85 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -2844,6 +2844,8 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> * more permissive.
> */
> if (!order) {
> + gfp_t bulk_gfp = gfp & ~__GFP_NOFAIL;
> +
> while (nr_allocated < nr_pages) {
> unsigned int nr, nr_pages_request;
>
> @@ -2861,12 +2863,12 @@ vm_area_alloc_pages(gfp_t gfp, int nid,
> * but mempolcy want to alloc memory by interleaving.
> */
> if (IS_ENABLED(CONFIG_NUMA) && nid == NUMA_NO_NODE)
> - nr = alloc_pages_bulk_array_mempolicy(gfp,
> + nr = alloc_pages_bulk_array_mempolicy(bulk_gfp,
> nr_pages_request,
> pages + nr_allocated);
>
> else
> - nr = alloc_pages_bulk_array_node(gfp, nid,
> + nr = alloc_pages_bulk_array_node(bulk_gfp, nid,
> nr_pages_request,
> pages + nr_allocated);
>
> @@ -2921,6 +2923,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> {
> const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
> const gfp_t orig_gfp_mask = gfp_mask;
> + bool nofail = gfp_mask & __GFP_NOFAIL;
> unsigned long addr = (unsigned long)area->addr;
> unsigned long size = get_vm_area_size(area);
> unsigned long array_size;
> @@ -2978,8 +2981,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> flags = memalloc_noio_save();
>
> - ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> + do {
> + ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> page_shift);
> + if (nofail && (ret < 0))
> + schedule_timeout_uninterruptible(1);
> + } while (nofail && (ret < 0));
>
> if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> memalloc_nofs_restore(flags);
> @@ -3074,9 +3081,14 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> VM_UNINITIALIZED | vm_flags, start, end, node,
> gfp_mask, caller);
> if (!area) {
> + bool nofail = gfp_mask & __GFP_NOFAIL;
> warn_alloc(gfp_mask, NULL,
> - "vmalloc error: size %lu, vm_struct allocation failed",
> - real_size);
> + "vmalloc error: size %lu, vm_struct allocation failed%s",
> + real_size, (nofail) ? ". Retrying." : "");
> + if (nofail) {
> + schedule_timeout_uninterruptible(1);
> + goto again;
> + }
> goto fail;
> }
>
> --
> 2.30.2
>
I have raised two concerns in our previous discussion about this change,
well that is sad...

--
Vlad Rezki

2021-11-23 20:09:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Tue 23-11-21 20:01:50, Uladzislau Rezki wrote:
[...]
> I have raised two concerns in our previous discussion about this change,
> well that is sad...

I definitely didn't mean to ignore any of the feedback. IIRC we were in
a disagreement in the failure mode for retry loop - i.e. free all the
allocated pages in case page table pages cannot be allocated. I still
maintain my position and until there is a wider consensus on that I will
keep my current implementation. The other concern was about failures
warning but that shouldn't be a problem when basing on the current Linus
tree. Am I missing something?
--
Michal Hocko
SUSE Labs

2021-11-24 01:02:44

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <[email protected]> wrote:

> On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Dave Chinner has mentioned that some of the xfs code would benefit from
> > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > cannot fail and they do not fit into a single page.

Perhaps we should tell xfs "no, do it internally". Because this is a
rather nasty-looking thing - do we want to encourage other callsites to
start using it?

> > The large part of the vmalloc implementation already complies with the
> > given gfp flags so there is no work for those to be done. The area
> > and page table allocations are an exception to that. Implement a retry
> > loop for those.
> >
> > Add a short sleep before retrying. 1 jiffy is a completely random
> > timeout. Ideally the retry would wait for an explicit event - e.g.
> > a change to the vmalloc space change if the failure was caused by
> > the space fragmentation or depletion. But there are multiple different
> > reasons to retry and this could become much more complex. Keep the retry
> > simple for now and just sleep to prevent from hogging CPUs.
> >

Yes, the horse has already bolted. But we didn't want that horse anyway ;)

I added GFP_NOFAIL back in the mesozoic era because quite a lot of
sites were doing open-coded try-forever loops. I thought "hey, they
shouldn't be doing that in the first place, but let's at least
centralize the concept to reduce code size, code duplication and so
it's something we can now grep for". But longer term, all GFP_NOFAIL
sites should be reworked to no longer need to do the retry-forever
thing. In retrospect, this bright idea of mine seems to have added
license for more sites to use retry-forever. Sigh.

> > + if (nofail) {
> > + schedule_timeout_uninterruptible(1);
> > + goto again;
> > + }

The idea behind congestion_wait() is to prevent us from having to
hard-wire delays like this. congestion_wait(1) would sleep for up to
one millisecond, but will return earlier if reclaim events happened
which make it likely that the caller can now proceed with the
allocation event, successfully.

However it turns out that congestion_wait() was quietly broken at the
block level some time ago. We could perhaps resurrect the concept at
another level - say by releasing congestion_wait() callers if an amount
of memory newly becomes allocatable. This obviously asks for inclusion
of zone/node/etc info from the congestion_wait() caller. But that's
just an optimization - if the newly-available memory isn't useful to
the congestion_wait() caller, they just fail the allocation attempts
and wait again.

> well that is sad...
> I have raised two concerns in our previous discussion about this change,

Can you please reiterate those concerns here?

2021-11-24 03:17:10

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Wed, 24 Nov 2021, Andrew Morton wrote:
>
> I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> sites were doing open-coded try-forever loops. I thought "hey, they
> shouldn't be doing that in the first place, but let's at least
> centralize the concept to reduce code size, code duplication and so
> it's something we can now grep for". But longer term, all GFP_NOFAIL
> sites should be reworked to no longer need to do the retry-forever
> thing. In retrospect, this bright idea of mine seems to have added
> license for more sites to use retry-forever. Sigh.

One of the costs of not having GFP_NOFAIL (or similar) is lots of
untested failure-path code.

When does an allocation that is allowed to retry and reclaim ever fail
anyway? I think the answer is "only when it has been killed by the oom
killer". That of course cannot happen to kernel threads, so maybe
kernel threads should never need GFP_NOFAIL??

I'm not sure the above is 100%, but I do think that is the sort of
semantic that we want. We want to know what kmalloc failure *means*.
We also need well defined and documented strategies to handle it.
mempools are one such strategy, but not always suitable.
preallocating can also be useful but can be clumsy to implement. Maybe
we should support a process preallocating a bunch of pages which can
only be used by the process - and are auto-freed when the process
returns to user-space. That might allow the "error paths" to be simple
and early, and subsequent allocations that were GFP_USEPREALLOC would be
safe.

i.e. we need a plan for how to rework all those no-fail call-sites.

NeilBrown

2021-11-24 03:48:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Wed, 24 Nov 2021 14:16:56 +1100 "NeilBrown" <[email protected]> wrote:

> On Wed, 24 Nov 2021, Andrew Morton wrote:
> >
> > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > sites were doing open-coded try-forever loops. I thought "hey, they
> > shouldn't be doing that in the first place, but let's at least
> > centralize the concept to reduce code size, code duplication and so
> > it's something we can now grep for". But longer term, all GFP_NOFAIL
> > sites should be reworked to no longer need to do the retry-forever
> > thing. In retrospect, this bright idea of mine seems to have added
> > license for more sites to use retry-forever. Sigh.
>
> One of the costs of not having GFP_NOFAIL (or similar) is lots of
> untested failure-path code.

Well that's bad of the relevant developers and testers! It isn't that
hard to fake up allocation failures. Either with the formal fault
injection framework or with ad-hackery.

> When does an allocation that is allowed to retry and reclaim ever fail
> anyway? I think the answer is "only when it has been killed by the oom
> killer". That of course cannot happen to kernel threads, so maybe
> kernel threads should never need GFP_NOFAIL??

> I'm not sure the above is 100%, but I do think that is the sort of
> semantic that we want. We want to know what kmalloc failure *means*.
> We also need well defined and documented strategies to handle it.
> mempools are one such strategy, but not always suitable.

Well, mempools aren't "handling" it. They're just another layer to
make memory allocation attempts appear to be magical. The preferred
answer is "just handle the damned error and return ENOMEM".

Obviously this gets very painful at times (arguably because of
high-level design shortcomings). The old radix_tree_preload approach
was bulletproof, but was quite a lot of fuss.

> preallocating can also be useful but can be clumsy to implement. Maybe
> we should support a process preallocating a bunch of pages which can
> only be used by the process - and are auto-freed when the process
> returns to user-space. That might allow the "error paths" to be simple
> and early, and subsequent allocations that were GFP_USEPREALLOC would be
> safe.

Yes, I think something like that would be quite feasible. Need to
prevent interrupt code from stealing the task's page store.

It can be a drag calculating (by hand) what the max possible amount of
allocation will be and one can end up allocating and then not using a
lot of memory.

I forget why radix_tree_preload used a cpu-local store rather than a
per-task one.

Plus "what order pages would you like" and "on which node" and "in
which zone", etc...

2021-11-24 05:23:45

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Wed, 24 Nov 2021, Andrew Morton wrote:
> On Wed, 24 Nov 2021 14:16:56 +1100 "NeilBrown" <[email protected]> wrote:
>
> > On Wed, 24 Nov 2021, Andrew Morton wrote:
> > >
> > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > > sites were doing open-coded try-forever loops. I thought "hey, they
> > > shouldn't be doing that in the first place, but let's at least
> > > centralize the concept to reduce code size, code duplication and so
> > > it's something we can now grep for". But longer term, all GFP_NOFAIL
> > > sites should be reworked to no longer need to do the retry-forever
> > > thing. In retrospect, this bright idea of mine seems to have added
> > > license for more sites to use retry-forever. Sigh.
> >
> > One of the costs of not having GFP_NOFAIL (or similar) is lots of
> > untested failure-path code.
>
> Well that's bad of the relevant developers and testers! It isn't that
> hard to fake up allocation failures. Either with the formal fault
> injection framework or with ad-hackery.

If we have CONFIG_RANDOM_ALLOC_PAGE_FAILURE then I might agree.
lockdep is AWESOME. It makes testing for deadlock problems *so* easy.
That is the level of ease-of-use we need if you want people to handle
page-alloc failures reliably.

>
> > When does an allocation that is allowed to retry and reclaim ever fail
> > anyway? I think the answer is "only when it has been killed by the oom
> > killer". That of course cannot happen to kernel threads, so maybe
> > kernel threads should never need GFP_NOFAIL??
>
> > I'm not sure the above is 100%, but I do think that is the sort of
> > semantic that we want. We want to know what kmalloc failure *means*.
> > We also need well defined and documented strategies to handle it.
> > mempools are one such strategy, but not always suitable.
>
> Well, mempools aren't "handling" it. They're just another layer to
> make memory allocation attempts appear to be magical. The preferred
> answer is "just handle the damned error and return ENOMEM".

No. mempools are EXACTLY handling it - in a specific context. When
writing out dirty pages so as to free up memory, you often need to
allocate memory. And you may need a sequence of allocations, typically
at different levels in the stack. Global water-marks cannot help
reliably as it might all be used up with top-level requests, and the
lower levels can starve indefinitely. mempools ensure that when memory
is freed, it is returned to the same level it was allocated for. That
ensures you can get at least one request at a time all the way out to
storage.

If swap-out just returned ENOMEM, you'd really be in trouble.

>
> Obviously this gets very painful at times (arguably because of
> high-level design shortcomings). The old radix_tree_preload approach
> was bulletproof, but was quite a lot of fuss.

It would get particularly painful if some system call started returned
-ENOMEM, which had never returned that before. I note that ext4 uses
__GFP_NOFAIL when handling truncate. I don't think user-space would be
happy with ENOMEM from truncate (or fallocate(PUNHC_HOLE)), though a
recent commit which adds it focuses more on wanting to avoid the need
for fsck.

>
> > preallocating can also be useful but can be clumsy to implement. Maybe
> > we should support a process preallocating a bunch of pages which can
> > only be used by the process - and are auto-freed when the process
> > returns to user-space. That might allow the "error paths" to be simple
> > and early, and subsequent allocations that were GFP_USEPREALLOC would be
> > safe.
>
> Yes, I think something like that would be quite feasible. Need to
> prevent interrupt code from stealing the task's page store.
>
> It can be a drag calculating (by hand) what the max possible amount of
> allocation will be and one can end up allocating and then not using a
> lot of memory.

CONFIG_DEBUG_PREALLOC could force every GFP_USE_PREALLOC request to use
a different page, and warn if there weren't enough. Not perfect, but
useful.

Concerning the "lot of memory" having prealloc as an easy option means
people can use it until it becomes too expensive, then find a better
solution. There will likely always be a better solution, but it might
not often be worth the effort.

>
> I forget why radix_tree_preload used a cpu-local store rather than a
> per-task one.
>
> Plus "what order pages would you like" and "on which node" and "in
> which zone", etc...

"what order" - only order-0 I hope. I'd hazard a guess that 90% of
current NOFAIL allocations only need one page (providing slub is used -
slab seems to insist on high-order pages sometimes).
"which node" - whichever. Unless __GFP_HARDWALL is set, alloc_page()
will fall-back to "whichever" anyway, and NOFAIL with HARDWALL is
probably a poor choice.
"which zone" - NORMAL. I cannot find any NOFAIL allocations that want
DMA. fs/ntfs asks for __GFP_HIGHMEM with NOFAIL, but that that doesn't
*requre* highmem.

Of course, before designing this interface too precisely we should check
if anyone can use it. From a quick through the some of the 100-ish
users of __GFP_NOFAIL I'd guess that mempools would help - the
preallocation should happen at init-time, not request-time. Maybe if we
made mempools even more light weight .... though that risks allocating a
lot of memory that will never get used.

This brings me back to the idea that
alloc_page(wait and reclaim allowed)
should only fail on OOM_KILL. That way kernel threads are safe, and
user-threads are free to return ENOMEM knowing it won't get to
user-space. If user-thread code really needs NOFAIL, it punts to a
workqueue and waits - aborting the wait if it is killed, while the work
item still runs eventually.

NeilBrown

2021-11-24 08:44:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Tue 23-11-21 17:02:38, Andrew Morton wrote:
> On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <[email protected]> wrote:
>
> > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > cannot fail and they do not fit into a single page.
>
> Perhaps we should tell xfs "no, do it internally". Because this is a
> rather nasty-looking thing - do we want to encourage other callsites to
> start using it?

This is what xfs is likely going to do if we do not provide the
functionality. I just do not see why that would be a better outcome
though. My longterm experience tells me that whenever we ignore
requirements by other subsystems then those requirements materialize in
some form in the end. In many cases done either suboptimaly or outright
wrong. This might be not the case for xfs as the quality of
implementation is high there but this is not the case in general.

Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for
any other stupid reason then what? Is that something we should worry
about? Retrying within the allocator doesn't make the things worse. In
fact it is just easier to find such abusers by grep which would be more
elaborate with custom retry loops.

[...]
> > > + if (nofail) {
> > > + schedule_timeout_uninterruptible(1);
> > > + goto again;
> > > + }
>
> The idea behind congestion_wait() is to prevent us from having to
> hard-wire delays like this. congestion_wait(1) would sleep for up to
> one millisecond, but will return earlier if reclaim events happened
> which make it likely that the caller can now proceed with the
> allocation event, successfully.
>
> However it turns out that congestion_wait() was quietly broken at the
> block level some time ago. We could perhaps resurrect the concept at
> another level - say by releasing congestion_wait() callers if an amount
> of memory newly becomes allocatable. This obviously asks for inclusion
> of zone/node/etc info from the congestion_wait() caller. But that's
> just an optimization - if the newly-available memory isn't useful to
> the congestion_wait() caller, they just fail the allocation attempts
> and wait again.

vmalloc has two potential failure modes. Depleted memory and vmalloc
space. So there are two different events to wait for. I do agree that
schedule_timeout_uninterruptible is both ugly and very simple but do we
really need a much more sophisticated solution at this stage?
--
Michal Hocko
SUSE Labs

2021-11-24 20:11:51

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Tue, Nov 23, 2021 at 05:02:38PM -0800, Andrew Morton wrote:
> On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <[email protected]> wrote:
>
> > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > cannot fail and they do not fit into a single page.
>
> Perhaps we should tell xfs "no, do it internally". Because this is a
> rather nasty-looking thing - do we want to encourage other callsites to
> start using it?
>
> > > The large part of the vmalloc implementation already complies with the
> > > given gfp flags so there is no work for those to be done. The area
> > > and page table allocations are an exception to that. Implement a retry
> > > loop for those.
> > >
> > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > a change to the vmalloc space change if the failure was caused by
> > > the space fragmentation or depletion. But there are multiple different
> > > reasons to retry and this could become much more complex. Keep the retry
> > > simple for now and just sleep to prevent from hogging CPUs.
> > >
>
> Yes, the horse has already bolted. But we didn't want that horse anyway ;)
>
> I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> sites were doing open-coded try-forever loops. I thought "hey, they
> shouldn't be doing that in the first place, but let's at least
> centralize the concept to reduce code size, code duplication and so
> it's something we can now grep for". But longer term, all GFP_NOFAIL
> sites should be reworked to no longer need to do the retry-forever
> thing. In retrospect, this bright idea of mine seems to have added
> license for more sites to use retry-forever. Sigh.
>
> > > + if (nofail) {
> > > + schedule_timeout_uninterruptible(1);
> > > + goto again;
> > > + }
>
> The idea behind congestion_wait() is to prevent us from having to
> hard-wire delays like this. congestion_wait(1) would sleep for up to
> one millisecond, but will return earlier if reclaim events happened
> which make it likely that the caller can now proceed with the
> allocation event, successfully.
>
> However it turns out that congestion_wait() was quietly broken at the
> block level some time ago. We could perhaps resurrect the concept at
> another level - say by releasing congestion_wait() callers if an amount
> of memory newly becomes allocatable. This obviously asks for inclusion
> of zone/node/etc info from the congestion_wait() caller. But that's
> just an optimization - if the newly-available memory isn't useful to
> the congestion_wait() caller, they just fail the allocation attempts
> and wait again.
>
> > well that is sad...
> > I have raised two concerns in our previous discussion about this change,
>
> Can you please reiterate those concerns here?
>
1. I proposed to repeat(if fails) in one solid place, i.e. get rid of
duplication and spreading the logic across several places. This is about
simplification.

2. Second one is about to do an unwinding and release everything what we
have just accumulated in terms of memory consumption. The failure might
occur, if so a condition we are in is a low memory one or high memory
pressure. In this case, since we are about to sleep some milliseconds
in order to repeat later, IMHO it makes sense to release memory:

- to prevent killing apps or possible OOM;
- we can end up looping quite a lot of time or even forever if users do
nasty things with vmalloc API and __GFP_NOFAIL flag.

--
Vlad Rezki

2021-11-24 20:38:01

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote:
> On Tue 23-11-21 17:02:38, Andrew Morton wrote:
> > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <[email protected]> wrote:
> >
> > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > From: Michal Hocko <[email protected]>
> > > >
> > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > cannot fail and they do not fit into a single page.
> >
> > Perhaps we should tell xfs "no, do it internally". Because this is a
> > rather nasty-looking thing - do we want to encourage other callsites to
> > start using it?
>
> This is what xfs is likely going to do if we do not provide the
> functionality. I just do not see why that would be a better outcome
> though. My longterm experience tells me that whenever we ignore
> requirements by other subsystems then those requirements materialize in
> some form in the end. In many cases done either suboptimaly or outright
> wrong. This might be not the case for xfs as the quality of
> implementation is high there but this is not the case in general.
>
> Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for
> any other stupid reason then what? Is that something we should worry
> about? Retrying within the allocator doesn't make the things worse. In
> fact it is just easier to find such abusers by grep which would be more
> elaborate with custom retry loops.
>
> [...]
> > > > + if (nofail) {
> > > > + schedule_timeout_uninterruptible(1);
> > > > + goto again;
> > > > + }
> >
> > The idea behind congestion_wait() is to prevent us from having to
> > hard-wire delays like this. congestion_wait(1) would sleep for up to
> > one millisecond, but will return earlier if reclaim events happened
> > which make it likely that the caller can now proceed with the
> > allocation event, successfully.
> >
> > However it turns out that congestion_wait() was quietly broken at the
> > block level some time ago. We could perhaps resurrect the concept at
> > another level - say by releasing congestion_wait() callers if an amount
> > of memory newly becomes allocatable. This obviously asks for inclusion
> > of zone/node/etc info from the congestion_wait() caller. But that's
> > just an optimization - if the newly-available memory isn't useful to
> > the congestion_wait() caller, they just fail the allocation attempts
> > and wait again.
>
> vmalloc has two potential failure modes. Depleted memory and vmalloc
> space. So there are two different events to wait for. I do agree that
> schedule_timeout_uninterruptible is both ugly and very simple but do we
> really need a much more sophisticated solution at this stage?
>
I would say there is at least one more. It is about when users set their
own range(start:end) where to allocate. In that scenario we might never
return to a user, because there might not be any free vmap space on
specified range.

To address this, we can allow __GFP_NOFAIL only for entire vmalloc
address space, i.e. within VMALLOC_START:VMALLOC_END. By doing so
we will guarantee that we will not run out of vmap space, at least
for 64 bit systems, for smaller 32 bit ones we can not guarantee it
but it is populated back when the "lazily free logic" is kicked.

--
Vlad Rezki

2021-11-24 20:46:34

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Tue, Nov 23, 2021 at 09:09:08PM +0100, Michal Hocko wrote:
> On Tue 23-11-21 20:01:50, Uladzislau Rezki wrote:
> [...]
> > I have raised two concerns in our previous discussion about this change,
> > well that is sad...
>
> I definitely didn't mean to ignore any of the feedback. IIRC we were in
> a disagreement in the failure mode for retry loop - i.e. free all the
> allocated pages in case page table pages cannot be allocated. I still
> maintain my position and until there is a wider consensus on that I will
> keep my current implementation. The other concern was about failures
> warning but that shouldn't be a problem when basing on the current Linus
> tree. Am I missing something?
>
Sorry i was answering vice-versa from latest toward first messages :)

--
Vlad Rezki

2021-11-24 23:45:19

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Wed, Nov 24, 2021 at 02:16:56PM +1100, NeilBrown wrote:
> On Wed, 24 Nov 2021, Andrew Morton wrote:
> >
> > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > sites were doing open-coded try-forever loops. I thought "hey, they
> > shouldn't be doing that in the first place, but let's at least
> > centralize the concept to reduce code size, code duplication and so
> > it's something we can now grep for". But longer term, all GFP_NOFAIL
> > sites should be reworked to no longer need to do the retry-forever
> > thing. In retrospect, this bright idea of mine seems to have added
> > license for more sites to use retry-forever. Sigh.
>
> One of the costs of not having GFP_NOFAIL (or similar) is lots of
> untested failure-path code.
>
> When does an allocation that is allowed to retry and reclaim ever fail
> anyway? I think the answer is "only when it has been killed by the oom
> killer". That of course cannot happen to kernel threads, so maybe
> kernel threads should never need GFP_NOFAIL??
>
> I'm not sure the above is 100%, but I do think that is the sort of
> semantic that we want. We want to know what kmalloc failure *means*.
> We also need well defined and documented strategies to handle it.
> mempools are one such strategy, but not always suitable.

mempools are not suitable for anything that uses demand paging to
hold an unbounded data set in memory before it can free anything.
This is basically the definition of memory demand in an XFS
transaction, and most transaction based filesystems have similar
behaviour.

> preallocating can also be useful but can be clumsy to implement. Maybe
> we should support a process preallocating a bunch of pages which can
> only be used by the process - and are auto-freed when the process
> returns to user-space. That might allow the "error paths" to be simple
> and early, and subsequent allocations that were GFP_USEPREALLOC would be
> safe.

We talked about this years ago at LSFMM (2013 or 2014, IIRC). The
problem is "how much do you preallocate when the worst case
requirement to guarantee forwards progress is at least tens of
megabytes". Considering that there might be thousands of these
contexts running concurrent at any given time and we might be
running through several million preallocation contexts a second,
suddenly preallocation is a great big CPU and memory pit.

Hence preallocation simply doesn't work when the scope to guarantee
forwards progress is in the order of megabytes (even tens of
megabytes) per "no fail" context scope. In situations like this we
need -memory reservations-, not preallocation.

Have the mm guarantee that there is a certain amount of memory
available (even if it requires reclaim to make available) before we
start the operation that cannot tolerate memory allocation failure,
track the memory usage as it goes, warn if it overruns, and release
the unused part of the reservation when context completes.

[ This is what we already do in XFS for guaranteeing forwards
progress for writing modifications into the strictly bound journal
space. We reserve space up front and use tracking tickets to account
for space used across each transaction context. This avoids
overcommit and all the deadlock and corruption problems that come
from running out of physical log space to write all the changes
we've made into the log. We could simply add memory reservations and
tracking structures to the transaction context and we've pretty much
got everything we need on the XFS side covered... ]

> i.e. we need a plan for how to rework all those no-fail call-sites.

Even if we do make them all the filesystems handle ENOMEM errors
gracefully and pass that back up to userspace, how are applications
going to react to random ENOMEM errors when doing data writes or
file creation or any other operation that accesses filesystems?

Given the way applications handle transient errors (i.e. they
don't!) propagating ENOMEM back up to userspace will result in
applications randomly failing under memory pressure. That's a much
worse situation than having to manage the _extremely rare_ issues
that occur because of __GFP_NOFAIL usage in the kernel.

Let's keep that in mind here - __GFP_NOFAIL usage is not causing
system failures in the real world, whilst propagating ENOMEM back
out to userspace is potentially very harmful to users....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2021-11-25 00:34:53

by Theodore Ts'o

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Wed, Nov 24, 2021 at 04:23:31PM +1100, NeilBrown wrote:
>
> It would get particularly painful if some system call started returned
> -ENOMEM, which had never returned that before. I note that ext4 uses
> __GFP_NOFAIL when handling truncate. I don't think user-space would be
> happy with ENOMEM from truncate (or fallocate(PUNHC_HOLE)), though a
> recent commit which adds it focuses more on wanting to avoid the need
> for fsck.

If the inode is in use (via an open file descriptor) when it is
unlocked, we can't actually do the truncate until the inode is
evicted, and at that point, there is no user space to return to. For
that reason, the evict_inode() method is not *allowed* to fail. So
this is why we need to use GFP_NOFAIL or an open-coded retry loop.
The alternative would be to mark the file system corrupt, and then
either remount the file system, panic the system and reboot, or leave
the file system corrupted ("don't worry, be happy"). I considered
GFP_NOFAIL to be the lesser of the evils. :-)

If the VFS allowed evict_inode() to fail, all it could do is to put
the inode back on the list of inodes to be later evicted --- which is
to say, we would have to add a lot of complexity to effectively add a
gigantic retry loop.

Granted, we wouldn't need to be holding any locks in between retries,
so perhaps it'a better than adding a retry loop deep in the guts of
the ext4 truncate codepath. But then we would need to worry about
userspace getting ENOMEM for system calls which historically, users
have traditionally never failing. I suppose we could also solve this
problem by adding retry logic in the top-level VFS truncate codepath,
so instead of returning ENOMEM, we just retry the truncate(2) system
call and hope that we have enough memory to succeed this time.

After all, can the userspace do if truncate() fails with ENOMEM? It
can fail the userspace program, which in the case of a long-running
daemon such as mysqld, is basically the userspace equivalent of "panic
and reboot", or it can retry truncate(2) syste call at the userspace
level.

Are we detecting a pattern here? There will always be cases where the
choice is "panic" or "retry".

- Ted

2021-11-25 08:48:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Wed 24-11-21 21:11:42, Uladzislau Rezki wrote:
> On Tue, Nov 23, 2021 at 05:02:38PM -0800, Andrew Morton wrote:
> > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <[email protected]> wrote:
> >
> > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > From: Michal Hocko <[email protected]>
> > > >
> > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > cannot fail and they do not fit into a single page.
> >
> > Perhaps we should tell xfs "no, do it internally". Because this is a
> > rather nasty-looking thing - do we want to encourage other callsites to
> > start using it?
> >
> > > > The large part of the vmalloc implementation already complies with the
> > > > given gfp flags so there is no work for those to be done. The area
> > > > and page table allocations are an exception to that. Implement a retry
> > > > loop for those.
> > > >
> > > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > > a change to the vmalloc space change if the failure was caused by
> > > > the space fragmentation or depletion. But there are multiple different
> > > > reasons to retry and this could become much more complex. Keep the retry
> > > > simple for now and just sleep to prevent from hogging CPUs.
> > > >
> >
> > Yes, the horse has already bolted. But we didn't want that horse anyway ;)
> >
> > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > sites were doing open-coded try-forever loops. I thought "hey, they
> > shouldn't be doing that in the first place, but let's at least
> > centralize the concept to reduce code size, code duplication and so
> > it's something we can now grep for". But longer term, all GFP_NOFAIL
> > sites should be reworked to no longer need to do the retry-forever
> > thing. In retrospect, this bright idea of mine seems to have added
> > license for more sites to use retry-forever. Sigh.
> >
> > > > + if (nofail) {
> > > > + schedule_timeout_uninterruptible(1);
> > > > + goto again;
> > > > + }
> >
> > The idea behind congestion_wait() is to prevent us from having to
> > hard-wire delays like this. congestion_wait(1) would sleep for up to
> > one millisecond, but will return earlier if reclaim events happened
> > which make it likely that the caller can now proceed with the
> > allocation event, successfully.
> >
> > However it turns out that congestion_wait() was quietly broken at the
> > block level some time ago. We could perhaps resurrect the concept at
> > another level - say by releasing congestion_wait() callers if an amount
> > of memory newly becomes allocatable. This obviously asks for inclusion
> > of zone/node/etc info from the congestion_wait() caller. But that's
> > just an optimization - if the newly-available memory isn't useful to
> > the congestion_wait() caller, they just fail the allocation attempts
> > and wait again.
> >
> > > well that is sad...
> > > I have raised two concerns in our previous discussion about this change,
> >
> > Can you please reiterate those concerns here?
> >
> 1. I proposed to repeat(if fails) in one solid place, i.e. get rid of
> duplication and spreading the logic across several places. This is about
> simplification.

I am all for simplifications. But the presented simplification lead to 2) and ...

> 2. Second one is about to do an unwinding and release everything what we
> have just accumulated in terms of memory consumption. The failure might
> occur, if so a condition we are in is a low memory one or high memory
> pressure. In this case, since we are about to sleep some milliseconds
> in order to repeat later, IMHO it makes sense to release memory:
>
> - to prevent killing apps or possible OOM;
> - we can end up looping quite a lot of time or even forever if users do
> nasty things with vmalloc API and __GFP_NOFAIL flag.

... this is where we disagree and I have tried to explain why. The primary
memory to allocate are pages to back the vmalloc area. Failing to
allocate few page tables - which btw. do not fail as they are order-0 -
and result into the whole and much more expensive work to allocate the
former is really wasteful. You've had a concern about OOM killer
invocation while retrying the page table allocation but you should
realize that page table allocations might already invoke OOM killer so that
is absolutely nothing new.
--
Michal Hocko
SUSE Labs

2021-11-25 08:50:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Wed 24-11-21 21:37:54, Uladzislau Rezki wrote:
> On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote:
> > On Tue 23-11-21 17:02:38, Andrew Morton wrote:
> > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <[email protected]> wrote:
> > >
> > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > > From: Michal Hocko <[email protected]>
> > > > >
> > > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > > cannot fail and they do not fit into a single page.
> > >
> > > Perhaps we should tell xfs "no, do it internally". Because this is a
> > > rather nasty-looking thing - do we want to encourage other callsites to
> > > start using it?
> >
> > This is what xfs is likely going to do if we do not provide the
> > functionality. I just do not see why that would be a better outcome
> > though. My longterm experience tells me that whenever we ignore
> > requirements by other subsystems then those requirements materialize in
> > some form in the end. In many cases done either suboptimaly or outright
> > wrong. This might be not the case for xfs as the quality of
> > implementation is high there but this is not the case in general.
> >
> > Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for
> > any other stupid reason then what? Is that something we should worry
> > about? Retrying within the allocator doesn't make the things worse. In
> > fact it is just easier to find such abusers by grep which would be more
> > elaborate with custom retry loops.
> >
> > [...]
> > > > > + if (nofail) {
> > > > > + schedule_timeout_uninterruptible(1);
> > > > > + goto again;
> > > > > + }
> > >
> > > The idea behind congestion_wait() is to prevent us from having to
> > > hard-wire delays like this. congestion_wait(1) would sleep for up to
> > > one millisecond, but will return earlier if reclaim events happened
> > > which make it likely that the caller can now proceed with the
> > > allocation event, successfully.
> > >
> > > However it turns out that congestion_wait() was quietly broken at the
> > > block level some time ago. We could perhaps resurrect the concept at
> > > another level - say by releasing congestion_wait() callers if an amount
> > > of memory newly becomes allocatable. This obviously asks for inclusion
> > > of zone/node/etc info from the congestion_wait() caller. But that's
> > > just an optimization - if the newly-available memory isn't useful to
> > > the congestion_wait() caller, they just fail the allocation attempts
> > > and wait again.
> >
> > vmalloc has two potential failure modes. Depleted memory and vmalloc
> > space. So there are two different events to wait for. I do agree that
> > schedule_timeout_uninterruptible is both ugly and very simple but do we
> > really need a much more sophisticated solution at this stage?
> >
> I would say there is at least one more. It is about when users set their
> own range(start:end) where to allocate. In that scenario we might never
> return to a user, because there might not be any free vmap space on
> specified range.
>
> To address this, we can allow __GFP_NOFAIL only for entire vmalloc
> address space, i.e. within VMALLOC_START:VMALLOC_END.

How should we do that?
--
Michal Hocko
SUSE Labs

2021-11-25 18:11:36

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Thu, Nov 25, 2021 at 9:46 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 24-11-21 21:11:42, Uladzislau Rezki wrote:
> > On Tue, Nov 23, 2021 at 05:02:38PM -0800, Andrew Morton wrote:
> > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <[email protected]> wrote:
> > >
> > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > > From: Michal Hocko <[email protected]>
> > > > >
> > > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > > cannot fail and they do not fit into a single page.
> > >
> > > Perhaps we should tell xfs "no, do it internally". Because this is a
> > > rather nasty-looking thing - do we want to encourage other callsites to
> > > start using it?
> > >
> > > > > The large part of the vmalloc implementation already complies with the
> > > > > given gfp flags so there is no work for those to be done. The area
> > > > > and page table allocations are an exception to that. Implement a retry
> > > > > loop for those.
> > > > >
> > > > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > > > a change to the vmalloc space change if the failure was caused by
> > > > > the space fragmentation or depletion. But there are multiple different
> > > > > reasons to retry and this could become much more complex. Keep the retry
> > > > > simple for now and just sleep to prevent from hogging CPUs.
> > > > >
> > >
> > > Yes, the horse has already bolted. But we didn't want that horse anyway ;)
> > >
> > > I added GFP_NOFAIL back in the mesozoic era because quite a lot of
> > > sites were doing open-coded try-forever loops. I thought "hey, they
> > > shouldn't be doing that in the first place, but let's at least
> > > centralize the concept to reduce code size, code duplication and so
> > > it's something we can now grep for". But longer term, all GFP_NOFAIL
> > > sites should be reworked to no longer need to do the retry-forever
> > > thing. In retrospect, this bright idea of mine seems to have added
> > > license for more sites to use retry-forever. Sigh.
> > >
> > > > > + if (nofail) {
> > > > > + schedule_timeout_uninterruptible(1);
> > > > > + goto again;
> > > > > + }
> > >
> > > The idea behind congestion_wait() is to prevent us from having to
> > > hard-wire delays like this. congestion_wait(1) would sleep for up to
> > > one millisecond, but will return earlier if reclaim events happened
> > > which make it likely that the caller can now proceed with the
> > > allocation event, successfully.
> > >
> > > However it turns out that congestion_wait() was quietly broken at the
> > > block level some time ago. We could perhaps resurrect the concept at
> > > another level - say by releasing congestion_wait() callers if an amount
> > > of memory newly becomes allocatable. This obviously asks for inclusion
> > > of zone/node/etc info from the congestion_wait() caller. But that's
> > > just an optimization - if the newly-available memory isn't useful to
> > > the congestion_wait() caller, they just fail the allocation attempts
> > > and wait again.
> > >
> > > > well that is sad...
> > > > I have raised two concerns in our previous discussion about this change,
> > >
> > > Can you please reiterate those concerns here?
> > >
> > 1. I proposed to repeat(if fails) in one solid place, i.e. get rid of
> > duplication and spreading the logic across several places. This is about
> > simplification.
>
> I am all for simplifications. But the presented simplification lead to 2) and ...
>
> > 2. Second one is about to do an unwinding and release everything what we
> > have just accumulated in terms of memory consumption. The failure might
> > occur, if so a condition we are in is a low memory one or high memory
> > pressure. In this case, since we are about to sleep some milliseconds
> > in order to repeat later, IMHO it makes sense to release memory:
> >
> > - to prevent killing apps or possible OOM;
> > - we can end up looping quite a lot of time or even forever if users do
> > nasty things with vmalloc API and __GFP_NOFAIL flag.
>
> ... this is where we disagree and I have tried to explain why. The primary
> memory to allocate are pages to back the vmalloc area. Failing to
> allocate few page tables - which btw. do not fail as they are order-0 -
> and result into the whole and much more expensive work to allocate the
> former is really wasteful. You've had a concern about OOM killer
> invocation while retrying the page table allocation but you should
> realize that page table allocations might already invoke OOM killer so that
> is absolutely nothing new.
>
We are in a slow path and this is a corner case, it means we will
timeout for many
milliseconds, for example for CONFIG_HZ_100 it is 10 milliseconds. I would agree
with you if it was requesting some memory and repeating in a tight loop because
of any time constraint and workloads sensitive to latency.

Is it sensitive to any workload? If so, we definitely can not go with
any delay there.

As for OOM, right you are. But it also can be that we are the source
who triggers it,
directly or indirectly. Unwinding and cleaning is a maximum what we
actually can do
here staying fair to OOM.

Therefore i root for simplification and OOM related concerns :) But
maybe there will
be other opinions.

Thanks!

--
Uladzislau Rezki

2021-11-25 18:43:15

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Thu, Nov 25, 2021 at 9:48 AM Michal Hocko <[email protected]> wrote:
>
> On Wed 24-11-21 21:37:54, Uladzislau Rezki wrote:
> > On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote:
> > > On Tue 23-11-21 17:02:38, Andrew Morton wrote:
> > > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <[email protected]> wrote:
> > > >
> > > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > > > From: Michal Hocko <[email protected]>
> > > > > >
> > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > > > cannot fail and they do not fit into a single page.
> > > >
> > > > Perhaps we should tell xfs "no, do it internally". Because this is a
> > > > rather nasty-looking thing - do we want to encourage other callsites to
> > > > start using it?
> > >
> > > This is what xfs is likely going to do if we do not provide the
> > > functionality. I just do not see why that would be a better outcome
> > > though. My longterm experience tells me that whenever we ignore
> > > requirements by other subsystems then those requirements materialize in
> > > some form in the end. In many cases done either suboptimaly or outright
> > > wrong. This might be not the case for xfs as the quality of
> > > implementation is high there but this is not the case in general.
> > >
> > > Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for
> > > any other stupid reason then what? Is that something we should worry
> > > about? Retrying within the allocator doesn't make the things worse. In
> > > fact it is just easier to find such abusers by grep which would be more
> > > elaborate with custom retry loops.
> > >
> > > [...]
> > > > > > + if (nofail) {
> > > > > > + schedule_timeout_uninterruptible(1);
> > > > > > + goto again;
> > > > > > + }
> > > >
> > > > The idea behind congestion_wait() is to prevent us from having to
> > > > hard-wire delays like this. congestion_wait(1) would sleep for up to
> > > > one millisecond, but will return earlier if reclaim events happened
> > > > which make it likely that the caller can now proceed with the
> > > > allocation event, successfully.
> > > >
> > > > However it turns out that congestion_wait() was quietly broken at the
> > > > block level some time ago. We could perhaps resurrect the concept at
> > > > another level - say by releasing congestion_wait() callers if an amount
> > > > of memory newly becomes allocatable. This obviously asks for inclusion
> > > > of zone/node/etc info from the congestion_wait() caller. But that's
> > > > just an optimization - if the newly-available memory isn't useful to
> > > > the congestion_wait() caller, they just fail the allocation attempts
> > > > and wait again.
> > >
> > > vmalloc has two potential failure modes. Depleted memory and vmalloc
> > > space. So there are two different events to wait for. I do agree that
> > > schedule_timeout_uninterruptible is both ugly and very simple but do we
> > > really need a much more sophisticated solution at this stage?
> > >
> > I would say there is at least one more. It is about when users set their
> > own range(start:end) where to allocate. In that scenario we might never
> > return to a user, because there might not be any free vmap space on
> > specified range.
> >
> > To address this, we can allow __GFP_NOFAIL only for entire vmalloc
> > address space, i.e. within VMALLOC_START:VMALLOC_END.
>
> How should we do that?
>
<snip>
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d2a00ad4e1dd..664935bee2a2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -3029,6 +3029,13 @@ void *__vmalloc_node_range(unsigned long size,
unsigned long align,
return NULL;
}

+ if (gfp_mask & __GFP_NOFAIL) {
+ if (start != VMALLOC_START || end != VMALLOC_END) {
+ gfp_mask &= ~__GFP_NOFAIL;
+ WARN_ONCE(1, "__GFP_NOFAIL is allowed only for
entire vmalloc space.");
+ }
+ }
+
if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
unsigned long size_per_node;
<snip>

Or just allow __GFP_NOFAIL flag usage only for a high level API, it is
__vmalloc() one where
gfp can be passed. Because it uses whole vmalloc address space, thus
we do not need to
check the range and other parameters like align, etc. This variant is
preferable.

But the problem is that there are internal functions which are
publicly available for kernel users like
__vmalloc_node_range(). In that case we can add a big comment like:
__GFP_NOFAIL flag can be
used __only__ with high level API, i.e. __vmalloc() one.

Any thoughts?

--
Uladzislau Rezki

2021-11-25 19:23:24

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Thu 25-11-21 19:40:56, Uladzislau Rezki wrote:
> On Thu, Nov 25, 2021 at 9:48 AM Michal Hocko <[email protected]> wrote:
> >
> > On Wed 24-11-21 21:37:54, Uladzislau Rezki wrote:
> > > On Wed, Nov 24, 2021 at 09:43:12AM +0100, Michal Hocko wrote:
> > > > On Tue 23-11-21 17:02:38, Andrew Morton wrote:
> > > > > On Tue, 23 Nov 2021 20:01:50 +0100 Uladzislau Rezki <[email protected]> wrote:
> > > > >
> > > > > > On Mon, Nov 22, 2021 at 04:32:31PM +0100, Michal Hocko wrote:
> > > > > > > From: Michal Hocko <[email protected]>
> > > > > > >
> > > > > > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > > > > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > > > > > cannot fail and they do not fit into a single page.
> > > > >
> > > > > Perhaps we should tell xfs "no, do it internally". Because this is a
> > > > > rather nasty-looking thing - do we want to encourage other callsites to
> > > > > start using it?
> > > >
> > > > This is what xfs is likely going to do if we do not provide the
> > > > functionality. I just do not see why that would be a better outcome
> > > > though. My longterm experience tells me that whenever we ignore
> > > > requirements by other subsystems then those requirements materialize in
> > > > some form in the end. In many cases done either suboptimaly or outright
> > > > wrong. This might be not the case for xfs as the quality of
> > > > implementation is high there but this is not the case in general.
> > > >
> > > > Even if people start using vmalloc(GFP_NOFAIL) out of lazyness or for
> > > > any other stupid reason then what? Is that something we should worry
> > > > about? Retrying within the allocator doesn't make the things worse. In
> > > > fact it is just easier to find such abusers by grep which would be more
> > > > elaborate with custom retry loops.
> > > >
> > > > [...]
> > > > > > > + if (nofail) {
> > > > > > > + schedule_timeout_uninterruptible(1);
> > > > > > > + goto again;
> > > > > > > + }
> > > > >
> > > > > The idea behind congestion_wait() is to prevent us from having to
> > > > > hard-wire delays like this. congestion_wait(1) would sleep for up to
> > > > > one millisecond, but will return earlier if reclaim events happened
> > > > > which make it likely that the caller can now proceed with the
> > > > > allocation event, successfully.
> > > > >
> > > > > However it turns out that congestion_wait() was quietly broken at the
> > > > > block level some time ago. We could perhaps resurrect the concept at
> > > > > another level - say by releasing congestion_wait() callers if an amount
> > > > > of memory newly becomes allocatable. This obviously asks for inclusion
> > > > > of zone/node/etc info from the congestion_wait() caller. But that's
> > > > > just an optimization - if the newly-available memory isn't useful to
> > > > > the congestion_wait() caller, they just fail the allocation attempts
> > > > > and wait again.
> > > >
> > > > vmalloc has two potential failure modes. Depleted memory and vmalloc
> > > > space. So there are two different events to wait for. I do agree that
> > > > schedule_timeout_uninterruptible is both ugly and very simple but do we
> > > > really need a much more sophisticated solution at this stage?
> > > >
> > > I would say there is at least one more. It is about when users set their
> > > own range(start:end) where to allocate. In that scenario we might never
> > > return to a user, because there might not be any free vmap space on
> > > specified range.
> > >
> > > To address this, we can allow __GFP_NOFAIL only for entire vmalloc
> > > address space, i.e. within VMALLOC_START:VMALLOC_END.
> >
> > How should we do that?
> >
> <snip>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d2a00ad4e1dd..664935bee2a2 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -3029,6 +3029,13 @@ void *__vmalloc_node_range(unsigned long size,
> unsigned long align,
> return NULL;
> }
>
> + if (gfp_mask & __GFP_NOFAIL) {
> + if (start != VMALLOC_START || end != VMALLOC_END) {
> + gfp_mask &= ~__GFP_NOFAIL;
> + WARN_ONCE(1, "__GFP_NOFAIL is allowed only for
> entire vmalloc space.");
> + }
> + }

So the called function effectivelly ignores the flag which could lead to
an actual failure and that is something the caller has told us not to
do. I do not consider such an API great, to say the least.

> +
> if (vmap_allow_huge && !(vm_flags & VM_NO_HUGE_VMAP)) {
> unsigned long size_per_node;
> <snip>
>
> Or just allow __GFP_NOFAIL flag usage only for a high level API, it is
> __vmalloc() one where
> gfp can be passed. Because it uses whole vmalloc address space, thus
> we do not need to
> check the range and other parameters like align, etc. This variant is
> preferable.
>
> But the problem is that there are internal functions which are
> publicly available for kernel users like
> __vmalloc_node_range(). In that case we can add a big comment like:
> __GFP_NOFAIL flag can be
> used __only__ with high level API, i.e. __vmalloc() one.
>
> Any thoughts?

I dunno. I find it rather ugly. We can surely document some APIs that
they shouldn't be used with __GFP_NOFAIL because they could result in an
endless loop but I find it rather subtle to change the contract under
the caller's feet and cause other problems.

I am rather curious about other opinions but at this moment this is
trying to handle a non existing problem IMHO. vmalloc and for that
matter other allocators are not trying to be defensive in API because we
assume in-kernel users to be good citizens.
--
Michal Hocko
SUSE Labs

2021-11-25 19:26:08

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote:
[...]
> Therefore i root for simplification and OOM related concerns :) But
> maybe there will be other opinions.

I have to say that I disagree with your view. I am not sure we have
other precedence where an allocator would throw away the primary
allocation just because a metadata allocation failure.

In any case, if there is a wider consensus that we really want to do
that I can rework. I would prefer to keep the patch as is though.
--
Michal Hocko
SUSE Labs

2021-11-25 20:05:59

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Thu, Nov 25, 2021 at 08:24:04PM +0100, Michal Hocko wrote:
> On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote:
> [...]
> > Therefore i root for simplification and OOM related concerns :) But
> > maybe there will be other opinions.
>
> I have to say that I disagree with your view. I am not sure we have
> other precedence where an allocator would throw away the primary
> allocation just because a metadata allocation failure.
>
Well, i tried to do some code review and raised some concerns and
proposals. If you do not agree, then of course i will not be arguing
here.

--
Vlad Rezki

2021-11-25 20:15:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Thu 25-11-21 21:03:23, Uladzislau Rezki wrote:
> On Thu, Nov 25, 2021 at 08:24:04PM +0100, Michal Hocko wrote:
> > On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote:
> > [...]
> > > Therefore i root for simplification and OOM related concerns :) But
> > > maybe there will be other opinions.
> >
> > I have to say that I disagree with your view. I am not sure we have
> > other precedence where an allocator would throw away the primary
> > allocation just because a metadata allocation failure.
> >
> Well, i tried to do some code review and raised some concerns and
> proposals.

I do appreciate your review! No question about that.

I was just surprised by your reaction that your review feedback had been
ignored because I do not think this is the case.

We were in a disagreement and that is just fine. It is quite normal to
disagree. The question is whether that disagreement is fundamental and
poses a roadblock for merging. I definitely do not want and mean to push
anything by force. My previous understanding was that your concerns are
mostly about aesthetics rather than blocking further progress.
--
Michal Hocko
SUSE Labs

2021-11-25 20:26:51

by Uladzislau Rezki

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

> On Thu 25-11-21 21:03:23, Uladzislau Rezki wrote:
> > On Thu, Nov 25, 2021 at 08:24:04PM +0100, Michal Hocko wrote:
> > > On Thu 25-11-21 19:02:09, Uladzislau Rezki wrote:
> > > [...]
> > > > Therefore i root for simplification and OOM related concerns :) But
> > > > maybe there will be other opinions.
> > >
> > > I have to say that I disagree with your view. I am not sure we have
> > > other precedence where an allocator would throw away the primary
> > > allocation just because a metadata allocation failure.
> > >
> > Well, i tried to do some code review and raised some concerns and
> > proposals.
>
> I do appreciate your review! No question about that.
>
> I was just surprised by your reaction that your review feedback had been
> ignored because I do not think this is the case.
>
> We were in a disagreement and that is just fine. It is quite normal to
> disagree. The question is whether that disagreement is fundamental and
> poses a roadblock for merging. I definitely do not want and mean to push
> anything by force. My previous understanding was that your concerns are
> mostly about aesthetics rather than blocking further progress.
>
It is not up to me to block you from further progress and of course it
was not my intention. You asked for a review i did it, that is it :)

--
Vlad Rezki

2021-11-26 10:50:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Mon 22-11-21 16:32:31, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
>
> The large part of the vmalloc implementation already complies with the
> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
>
> Add a short sleep before retrying. 1 jiffy is a completely random
> timeout. Ideally the retry would wait for an explicit event - e.g.
> a change to the vmalloc space change if the failure was caused by
> the space fragmentation or depletion. But there are multiple different
> reasons to retry and this could become much more complex. Keep the retry
> simple for now and just sleep to prevent from hogging CPUs.
>
> Signed-off-by: Michal Hocko <[email protected]>

Are there still any concerns around this patch or the approach in
general?
--
Michal Hocko
SUSE Labs

2021-11-26 14:52:20

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On 11/24/21 06:23, NeilBrown wrote:
>>
>> I forget why radix_tree_preload used a cpu-local store rather than a
>> per-task one.
>>
>> Plus "what order pages would you like" and "on which node" and "in
>> which zone", etc...
>
> "what order" - only order-0 I hope. I'd hazard a guess that 90% of
> current NOFAIL allocations only need one page (providing slub is used -
> slab seems to insist on high-order pages sometimes).

Yeah AFAIK SLUB can prefer higher orders than SLAB, but also allows fallback
to smallest order that's enough (thus 0 unless the objects are larger than a
page).

> "which node" - whichever. Unless __GFP_HARDWALL is set, alloc_page()
> will fall-back to "whichever" anyway, and NOFAIL with HARDWALL is
> probably a poor choice.
> "which zone" - NORMAL. I cannot find any NOFAIL allocations that want
> DMA. fs/ntfs asks for __GFP_HIGHMEM with NOFAIL, but that that doesn't
> *requre* highmem.
>
> Of course, before designing this interface too precisely we should check
> if anyone can use it. From a quick through the some of the 100-ish
> users of __GFP_NOFAIL I'd guess that mempools would help - the
> preallocation should happen at init-time, not request-time. Maybe if we
> made mempools even more light weight .... though that risks allocating a
> lot of memory that will never get used.
>
> This brings me back to the idea that
> alloc_page(wait and reclaim allowed)
> should only fail on OOM_KILL. That way kernel threads are safe, and
> user-threads are free to return ENOMEM knowing it won't get to

Hm I thought that's already pretty much the case of the "too small to fail"
of today. IIRC there's exactly that gotcha that OOM KILL can result in such
allocation failure. But I believe that approach is rather fragile. If you
encounter such an allocation not checking the resulting page != NULL, you
can only guess which one is true:

- the author simply forgot to check at all
- the author relied on "too small to fail" without realizing the gotcha
- at the time of writing the code was verified that it can be only run in
kernel thread context, not user and
- it is still true
- it stopped being true at some later point
- might be hard to even decide which is the case

IIRC at some point we tried to abolish the "too small to fail" rule because
of this, but Linus denied that. But the opposite - make it hard guarantee in
all cases - also didn't happen, so...

> user-space. If user-thread code really needs NOFAIL, it punts to a
> workqueue and waits - aborting the wait if it is killed, while the work
> item still runs eventually.
>
> NeilBrown
>


2021-11-26 15:11:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Fri 26-11-21 15:50:15, Vlastimil Babka wrote:
> On 11/24/21 06:23, NeilBrown wrote:
> >>
> >> I forget why radix_tree_preload used a cpu-local store rather than a
> >> per-task one.
> >>
> >> Plus "what order pages would you like" and "on which node" and "in
> >> which zone", etc...
> >
> > "what order" - only order-0 I hope. I'd hazard a guess that 90% of
> > current NOFAIL allocations only need one page (providing slub is used -
> > slab seems to insist on high-order pages sometimes).
>
> Yeah AFAIK SLUB can prefer higher orders than SLAB, but also allows fallback
> to smallest order that's enough (thus 0 unless the objects are larger than a
> page).
>
> > "which node" - whichever. Unless __GFP_HARDWALL is set, alloc_page()
> > will fall-back to "whichever" anyway, and NOFAIL with HARDWALL is
> > probably a poor choice.
> > "which zone" - NORMAL. I cannot find any NOFAIL allocations that want
> > DMA. fs/ntfs asks for __GFP_HIGHMEM with NOFAIL, but that that doesn't
> > *requre* highmem.
> >
> > Of course, before designing this interface too precisely we should check
> > if anyone can use it. From a quick through the some of the 100-ish
> > users of __GFP_NOFAIL I'd guess that mempools would help - the
> > preallocation should happen at init-time, not request-time. Maybe if we
> > made mempools even more light weight .... though that risks allocating a
> > lot of memory that will never get used.
> >
> > This brings me back to the idea that
> > alloc_page(wait and reclaim allowed)
> > should only fail on OOM_KILL. That way kernel threads are safe, and
> > user-threads are free to return ENOMEM knowing it won't get to
>
> Hm I thought that's already pretty much the case of the "too small to fail"
> of today. IIRC there's exactly that gotcha that OOM KILL can result in such
> allocation failure. But I believe that approach is rather fragile. If you
> encounter such an allocation not checking the resulting page != NULL, you
> can only guess which one is true:
>
> - the author simply forgot to check at all
> - the author relied on "too small to fail" without realizing the gotcha
> - at the time of writing the code was verified that it can be only run in
> kernel thread context, not user and
> - it is still true
> - it stopped being true at some later point
> - might be hard to even decide which is the case
>
> IIRC at some point we tried to abolish the "too small to fail" rule because
> of this, but Linus denied that. But the opposite - make it hard guarantee in
> all cases - also didn't happen, so...

Yeah. IMHO we should treat each missing check for allocation failure
(except for GFP_NOFAIL) as a bug regardless the practical implementation
that say that small allocations do not fail. Because they can fail and
we should never subscribe to official support implicit non-fail
semantic.
--
Michal Hocko
SUSE Labs

2021-11-26 15:34:35

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On 11/22/21 16:32, Michal Hocko wrote:
> From: Michal Hocko <[email protected]>
>
> Dave Chinner has mentioned that some of the xfs code would benefit from
> kvmalloc support for __GFP_NOFAIL because they have allocations that
> cannot fail and they do not fit into a single page.
>
> The large part of the vmalloc implementation already complies with the
> given gfp flags so there is no work for those to be done. The area
> and page table allocations are an exception to that. Implement a retry
> loop for those.
>
> Add a short sleep before retrying. 1 jiffy is a completely random
> timeout. Ideally the retry would wait for an explicit event - e.g.
> a change to the vmalloc space change if the failure was caused by
> the space fragmentation or depletion. But there are multiple different
> reasons to retry and this could become much more complex. Keep the retry
> simple for now and just sleep to prevent from hogging CPUs.
>
> Signed-off-by: Michal Hocko <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> @@ -2921,6 +2923,7 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> {
> const gfp_t nested_gfp = (gfp_mask & GFP_RECLAIM_MASK) | __GFP_ZERO;
> const gfp_t orig_gfp_mask = gfp_mask;
> + bool nofail = gfp_mask & __GFP_NOFAIL;
> unsigned long addr = (unsigned long)area->addr;
> unsigned long size = get_vm_area_size(area);
> unsigned long array_size;
> @@ -2978,8 +2981,12 @@ static void *__vmalloc_area_node(struct vm_struct *area, gfp_t gfp_mask,
> else if ((gfp_mask & (__GFP_FS | __GFP_IO)) == 0)
> flags = memalloc_noio_save();
>
> - ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> + do {
> + ret = vmap_pages_range(addr, addr + size, prot, area->pages,
> page_shift);
> + if (nofail && (ret < 0))
> + schedule_timeout_uninterruptible(1);
> + } while (nofail && (ret < 0));

Kind of ugly to have the same condition twice here, but no hard feelings.

>
> if ((gfp_mask & (__GFP_FS | __GFP_IO)) == __GFP_IO)
> memalloc_nofs_restore(flags);
> @@ -3074,9 +3081,14 @@ void *__vmalloc_node_range(unsigned long size, unsigned long align,
> VM_UNINITIALIZED | vm_flags, start, end, node,
> gfp_mask, caller);
> if (!area) {
> + bool nofail = gfp_mask & __GFP_NOFAIL;
> warn_alloc(gfp_mask, NULL,
> - "vmalloc error: size %lu, vm_struct allocation failed",
> - real_size);
> + "vmalloc error: size %lu, vm_struct allocation failed%s",
> + real_size, (nofail) ? ". Retrying." : "");
> + if (nofail) {
> + schedule_timeout_uninterruptible(1);
> + goto again;
> + }
> goto fail;
> }
>
>


2021-11-28 00:02:50

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Fri, 26 Nov 2021 11:48:46 +0100 Michal Hocko <[email protected]> wrote:

> On Mon 22-11-21 16:32:31, Michal Hocko wrote:
> > From: Michal Hocko <[email protected]>
> >
> > Dave Chinner has mentioned that some of the xfs code would benefit from
> > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > cannot fail and they do not fit into a single page.
> >
> > The large part of the vmalloc implementation already complies with the
> > given gfp flags so there is no work for those to be done. The area
> > and page table allocations are an exception to that. Implement a retry
> > loop for those.
> >
> > Add a short sleep before retrying. 1 jiffy is a completely random
> > timeout. Ideally the retry would wait for an explicit event - e.g.
> > a change to the vmalloc space change if the failure was caused by
> > the space fragmentation or depletion. But there are multiple different
> > reasons to retry and this could become much more complex. Keep the retry
> > simple for now and just sleep to prevent from hogging CPUs.
> >
> > Signed-off-by: Michal Hocko <[email protected]>
>
> Are there still any concerns around this patch or the approach in
> general?

Well. Like GFP_NOFAIL, every use is a sin. But I don't think I've
ever seen a real-world report of anyone hitting GFP_NOFAIL's
theoretical issues.

I assume there will be a v3?

2021-11-29 08:58:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] mm/vmalloc: add support for __GFP_NOFAIL

On Sat 27-11-21 16:00:43, Andrew Morton wrote:
> On Fri, 26 Nov 2021 11:48:46 +0100 Michal Hocko <[email protected]> wrote:
>
> > On Mon 22-11-21 16:32:31, Michal Hocko wrote:
> > > From: Michal Hocko <[email protected]>
> > >
> > > Dave Chinner has mentioned that some of the xfs code would benefit from
> > > kvmalloc support for __GFP_NOFAIL because they have allocations that
> > > cannot fail and they do not fit into a single page.
> > >
> > > The large part of the vmalloc implementation already complies with the
> > > given gfp flags so there is no work for those to be done. The area
> > > and page table allocations are an exception to that. Implement a retry
> > > loop for those.
> > >
> > > Add a short sleep before retrying. 1 jiffy is a completely random
> > > timeout. Ideally the retry would wait for an explicit event - e.g.
> > > a change to the vmalloc space change if the failure was caused by
> > > the space fragmentation or depletion. But there are multiple different
> > > reasons to retry and this could become much more complex. Keep the retry
> > > simple for now and just sleep to prevent from hogging CPUs.
> > >
> > > Signed-off-by: Michal Hocko <[email protected]>
> >
> > Are there still any concerns around this patch or the approach in
> > general?
>
> Well. Like GFP_NOFAIL, every use is a sin. But I don't think I've
> ever seen a real-world report of anyone hitting GFP_NOFAIL's
> theoretical issues.

I am not sure what you mean here. If you are missing real GFP_NOFAIL use
cases then some have been mentioned in the discussion

> I assume there will be a v3?

Currently I do not have any follow up changes on top of neither of the
patch except for acks and review tags. I can repost with those if you
prefer.

--
Michal Hocko
SUSE Labs