2021-09-15 12:08:02

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.

On Wed 15-09-21 07:48:11, Neil Brown wrote:
> On Wed, 15 Sep 2021, Mel Gorman wrote:
> > On Tue, Sep 14, 2021 at 10:13:04AM +1000, NeilBrown wrote:
> > > Indefinite loops waiting for memory allocation are discouraged by
> > > documentation in gfp.h which says the use of __GFP_NOFAIL that it
> > >
> > > is definitely preferable to use the flag rather than opencode endless
> > > loop around allocator.
> > >
> > > Such loops that use congestion_wait() are particularly unwise as
> > > congestion_wait() is indistinguishable from
> > > schedule_timeout_uninterruptible() in practice - and should be
> > > deprecated.
> > >
> > > So this patch changes the two loops in ext4_ext_truncate() to use
> > > __GFP_NOFAIL instead of looping.
> > >
> > > As the allocation is multiple layers deeper in the call stack, this
> > > requires passing the EXT4_EX_NOFAIL flag down and handling it in various
> > > places.
> > >
> > > Of particular interest is the ext4_journal_start family of calls which
> > > can now have EXT4_EX_NOFAIL 'or'ed in to the 'type'. This could be seen
> > > as a blurring of types. However 'type' is 8 bits, and EXT4_EX_NOFAIL is
> > > a high bit, so it is safe in practice.
> > >
> > > jbd2__journal_start() is enhanced so that the gfp_t flags passed are
> > > used for *all* allocations.
> > >
> > > Signed-off-by: NeilBrown <[email protected]>
> >
> > I'm not a fan. GFP_NOFAIL allows access to emergency reserves increasing
> > the risk of a livelock if memory is completely depleted where as some
> > callers can afford to wait.
>
> Maybe we should wind back and focus on the documentation patches.
> As quoted above, mm.h says:
>
> > > is definitely preferable to use the flag rather than opencode endless
> > > loop around allocator.
>
> but you seem to be saying that is wrong. I'd certainly like to get the
> documentation right before changing any code.
>
> Why does __GFP_NOFAIL access the reserves? Why not require that the
> relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included
> with __GFP_NOFAIL if that is justified?

Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to
memory reserves") help?

I would be worried to make the semantic even more complex than already
is. Access to memory reserves is an implementation detail that the page
allocator does currently. Callers shouldn't really be worried about
that. I do not ever remember any actual NOFAIL triggered memory
exhaustion. I have seen that to happen for unrestricted access to memory
reserves by OOM victim though. Hence cd04ae1e2dc8 ("mm, oom: do not rely
on TIF_MEMDIE for memory reserves access"). We can consider something
similar if NOFAIL allocation really tend to show a similar problem. We
do not want callers to care about OOM sitauations for this kind of
requests.

__GFP_NOFAIL | __GFP_HIGH is certainly something that is a valid usage
but I wouldn't base OOM behavior based on that.

> There are over 100 __GFP_NOFAIL allocation sites. I don't feel like
> reviewing them all and seeing if any really need a try-harder flag.
> Can we rename __GFP_NOFAIL to __GFP_NEVERFAIL and then
> #define __GFP_NOFAIL (__GFP_NEVERFAIL | __GFP_ATOMIC)
> and encourage the use of __GFP_NEVERFAIL in future?

Doesn't this add even more complexity?

> When __GFP_NOFAIL loops, it calls congestion_wait() internally. That
> certainly needs to be fixed and the ideas you present below are
> certainly worth considering when trying to understand how to address
> that. I'd rather fix it once there in page_alloc.c rather then export a
> waiting API like congestion_wait(). That would provide more
> flexibility. e.g. a newly freed page could be handed directly back to
> the waiter.

Completely agreed here. We really do not want people to open code NOFAIL
unless they can do something really subsystem specific that would help
to make a forward progress.

--
Michal Hocko
SUSE Labs


2021-09-15 22:37:08

by NeilBrown

[permalink] [raw]
Subject: Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.

On Wed, 15 Sep 2021, Michal Hocko wrote:
> On Wed 15-09-21 07:48:11, Neil Brown wrote:
> >
> > Why does __GFP_NOFAIL access the reserves? Why not require that the
> > relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included
> > with __GFP_NOFAIL if that is justified?
>
> Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to
> memory reserves") help?

Yes, that helps. A bit.

I'm not fond of the clause "the allocation request might have come with some
locks held". What if it doesn't? Does it still have to pay the price.

Should we not require that the caller indicate if any locks are held?
That way callers which don't hold locks can use __GFP_NOFAIL without
worrying about imposing on other code.

Or is it so rare that __GFP_NOFAIL would be used without holding a lock
that it doesn't matter?

The other commit of interest is

Commit: 6c18ba7a1899 ("mm: help __GFP_NOFAIL allocations which do not trigger OOM killer")

I don't find the reasoning convincing. It is a bit like "Robbing Peter
to pay Paul". It takes from the reserves to allow a __GFP_NOFAIL to
proceed, with out any reason to think this particular allocation has any
more 'right' to the reserves than anything else.

While I don't like the reasoning in either of these, they do make it
clear (to me) that the use of reserves is entirely an internal policy
decision. They should *not* be seen as part of the API and callers
should not have to be concerned about it when deciding whether to use
__GFP_NOFAIL or not.

The use of these reserves is, at most, a hypothetical problem. If it
ever looks like becoming a real practical problem, it needs to be fixed
internally to the page allocator. Maybe an extra water-mark which isn't
quite as permissive as ALLOC_HIGH...

I'm inclined to drop all references to reserves from the documentation
for __GFP_NOFAIL. I think there are enough users already that adding a
couple more isn't going to make problems substantially more likely. And
more will be added anyway that the mm/ team won't have the opportunity
or bandwidth to review.

Meanwhile I'll see if I can understand the intricacies of alloc_page so
that I can contibute to making it more predictable.

Question: In those cases where an open-coded loop is appropriate, such
as when you want to handle signals or can drop locks, how bad would it
be to have a tight loop without any sleep?
should_reclaim_retry() will sleep 100ms (sometimes...). Is that enough?
__GFP_NOFAIL doesn't add any sleep when looping.

Thanks,
NeilBrown

2021-09-16 00:38:36

by Dave Chinner

[permalink] [raw]
Subject: Re: [PATCH 3/6] EXT4: Remove ENOMEM/congestion_wait() loops.

On Thu, Sep 16, 2021 at 08:35:40AM +1000, NeilBrown wrote:
> On Wed, 15 Sep 2021, Michal Hocko wrote:
> > On Wed 15-09-21 07:48:11, Neil Brown wrote:
> > >
> > > Why does __GFP_NOFAIL access the reserves? Why not require that the
> > > relevant "Try harder" flag (__GFP_ATOMIC or __GFP_MEMALLOC) be included
> > > with __GFP_NOFAIL if that is justified?
> >
> > Does 5020e285856c ("mm, oom: give __GFP_NOFAIL allocations access to
> > memory reserves") help?
>
> Yes, that helps. A bit.
>
> I'm not fond of the clause "the allocation request might have come with some
> locks held". What if it doesn't? Does it still have to pay the price.
>
> Should we not require that the caller indicate if any locks are held?
> That way callers which don't hold locks can use __GFP_NOFAIL without
> worrying about imposing on other code.
>
> Or is it so rare that __GFP_NOFAIL would be used without holding a lock
> that it doesn't matter?
>
> The other commit of interest is
>
> Commit: 6c18ba7a1899 ("mm: help __GFP_NOFAIL allocations which do not trigger OOM killer")
>
> I don't find the reasoning convincing. It is a bit like "Robbing Peter
> to pay Paul". It takes from the reserves to allow a __GFP_NOFAIL to
> proceed, with out any reason to think this particular allocation has any
> more 'right' to the reserves than anything else.
>
> While I don't like the reasoning in either of these, they do make it
> clear (to me) that the use of reserves is entirely an internal policy
> decision. They should *not* be seen as part of the API and callers
> should not have to be concerned about it when deciding whether to use
> __GFP_NOFAIL or not.

Agree totally with this - we just want to block until allocation
succeeds, and if the -filesystem- deadlocks because allocation never
succeeds then that's a problem that needs to be solved in the
filesystem with a different memory allocation strategy...

OTOH, setting up a single __GFP_NOFAIL call site with the ability to
take the entire system down seems somewhat misguided.

> The use of these reserves is, at most, a hypothetical problem. If it
> ever looks like becoming a real practical problem, it needs to be fixed
> internally to the page allocator. Maybe an extra water-mark which isn't
> quite as permissive as ALLOC_HIGH...
>
> I'm inclined to drop all references to reserves from the documentation
> for __GFP_NOFAIL. I think there are enough users already that adding a
> couple more isn't going to make problems substantially more likely. And
> more will be added anyway that the mm/ team won't have the opportunity
> or bandwidth to review.

Yup, we've been replacing open coded loops like in kmem_alloc() with
explicit __GFP_NOFAIL usage for a while now:

$ ▶ git grep __GFP_NOFAIL fs/xfs |wc -l
33
$

ANd we've got another 100 or so call sites planned for conversion to
__GFP_NOFAIL. Hence the suggestion to remove the use of
reserves from __GFP_NOFAIL seems like a sensible plan because it has
never been necessary in the past for all the allocation sites we are
converting from open coded loops to __GFP_NOFAIL...

Cheers,

Dave.
--
Dave Chinner
[email protected]