by Vlastimil Babka

[permalink] [raw]

Subject: Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

On 04/14/2017 10:37 PM, Christoph Lameter wrote:
> On Thu, 13 Apr 2017, Vlastimil Babka wrote:
>
>>
>> I doubt we can change that now, because that can break existing
>> programs. It also makes some sense at least to me, because a task can
>> control its own mempolicy (for performance reasons), but cpuset changes
>> are admin decisions that the task cannot even anticipate. I think it's
>> better to continue working with suboptimal performance than start
>> failing allocations?
>
> If the expected semantics (hardwall) are that allocations should fail then
> lets be consistent and do so.

It's not "expected" right now. The documented semantics is that (static,
as the others are rebound) mempolicy is ignored when it's not compatible
with cpuset. I'm just reusing the same existing semantic for race
situations. We can discuss whether we can change the semantics now, but
I don't think it should block this fix.

> Adding more and more exceptions gets this convoluted mess into an even
> worse shape.

Again, it's not a new exception semantics-wise, but I agree that the
code of __alloc_pages_slowpath() is even more subtle. But I don't see
any other easy fix.

> Adding the static binding of nodes was already a screwball
> if used within a cpuset because now one has to anticipate how a user would
> move the nodes of a cpuset and how the static bindings would work in such
> a context.

On the other hand, static mempolicy is the only one that does not need
rebinding, and removing the other modes would allow much simpler
implementation. I thought the outcome of LSF/MM session was that we
should try to go that way.

> The admin basically needs to know how the application has used memory
> policies if one still wants to move the applications within a cpuset with
> the fixed bindings.
>
> Maybe the best way to handle this is to give up on cpuset migration of
> live applications? After all this can be done with a script in the same
> way as the kernel is doing:
>
> 1. Extend the cpuset to include the new nodes.
>
> 2. Loop over the processes and use the migrate_pages() to move the apps
> one by one.
>
> 3. Remove the nodes no longer to be used.
>
> Then forget about translating memory policies. If an application that is
> supposed to run in a cpuset and supposed to be moveable has fixed bindings
> then the application should be aware of that and be equipped with
> some logic to rebind its memory on its own.
>
> Such an application typically already has such logic and executes a
> binding after discovering its numa node configuration on startup. It would
> have to be modified to redo that action when it gets some sort of a signal
> from the script telling it that the node config would be changed.
>
> Having this logic in the application instead of the kernel avoids all the
> kernel messes that we keep on trying to deal with and IMHO is much
> cleaner.

That would be much simpler for us indeed. But we still IMHO can't
abruptly start denying page fault allocations for existing applications
that don't have the necessary awareness.

> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2017-04-30 21:33:20

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: [RFC 1/6] mm, page_alloc: fix more premature OOM due to race with cpuset update

On Wed, 26 Apr 2017, Vlastimil Babka wrote:

> > Such an application typically already has such logic and executes a
> > binding after discovering its numa node configuration on startup. It would
> > have to be modified to redo that action when it gets some sort of a signal
> > from the script telling it that the node config would be changed.
> >
> > Having this logic in the application instead of the kernel avoids all the
> > kernel messes that we keep on trying to deal with and IMHO is much
> > cleaner.
>
> That would be much simpler for us indeed. But we still IMHO can't
> abruptly start denying page fault allocations for existing applications
> that don't have the necessary awareness.

We certainly can do that. The failure of the page faults are due to the
admin trying to move an application that is not aware of this and is using
mempols. That could be an error. Trying to move an application that
contains both absolute and relative node numbers is definitely something
that is potentiall so screwed up that the kernel should not muck around
with such an app.

Also user space can determine if the application is using memory policies
and can then take appropriate measures (message to the sysadmin to eval
tge situation f.e.) or mess aroud with the processes memory policies on
its own.

So this is certainly a way out of this mess.