2021-04-20 01:45:28

by Shakeel Butt

[permalink] [raw]
Subject: [RFC] memory reserve for userspace oom-killer

Proposal: Provide memory guarantees to userspace oom-killer.

Background:

Issues with kernel oom-killer:
1. Very conservative and prefer to reclaim. Applications can suffer
for a long time.
2. Borrows the context of the allocator which can be resource limited
(low sched priority or limited CPU quota).
3. Serialized by global lock.
4. Very simplistic oom victim selection policy.

These issues are resolved through userspace oom-killer by:
1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
early detect suffering.
2. Independent process context which can be given dedicated CPU quota
and high scheduling priority.
3. Can be more aggressive as required.
4. Can implement sophisticated business logic/policies.

Android's LMKD and Facebook's oomd are the prime examples of userspace
oom-killers. One of the biggest challenges for userspace oom-killers
is to potentially function under intense memory pressure and are prone
to getting stuck in memory reclaim themselves. Current userspace
oom-killers aim to avoid this situation by preallocating user memory
and protecting themselves from global reclaim by either mlocking or
memory.min. However a new allocation from userspace oom-killer can
still get stuck in the reclaim and policy rich oom-killer do trigger
new allocations through syscalls or even heap.

Our attempt of userspace oom-killer faces similar challenges.
Particularly at the tail on the very highly utilized machines we have
observed userspace oom-killer spectacularly failing in many possible
ways in the direct reclaim. We have seen oom-killer stuck in direct
reclaim throttling, stuck in reclaim and allocations from interrupts
keep stealing reclaimed memory. We have even observed systems where
all the processes were stuck in throttle_direct_reclaim() and only
kswapd was running and the interrupts kept stealing the memory
reclaimed by kswapd.

To reliably solve this problem, we need to give guaranteed memory to
the userspace oom-killer. At the moment we are contemplating between
the following options and I would like to get some feedback.

1. prctl(PF_MEMALLOC)

The idea is to give userspace oom-killer (just one thread which is
finding the appropriate victims and will be sending SIGKILLs) access
to MEMALLOC reserves. Most of the time the preallocation, mlock and
memory.min will be good enough but for rare occasions, when the
userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
protect it from reclaim and let the allocation dip into the memory
reserves.

The misuse of this feature would be risky but it can be limited to
privileged applications. Userspace oom-killer is the only appropriate
user of this feature. This option is simple to implement.

2. Mempool

The idea is to preallocate mempool with a given amount of memory for
userspace oom-killer. Preferably this will be per-thread and
oom-killer can preallocate mempool for its specific threads. The core
page allocator can check before going to the reclaim path if the task
has private access to the mempool and return page from it if yes.

This option would be more complicated than the previous option as the
lifecycle of the page from the mempool would be more sophisticated.
Additionally the current mempool does not handle higher order pages
and we might need to extend it to allow such allocations. Though this
feature might have more use-cases and it would be less risky than the
previous option.

Another idea I had was to use kthread based oom-killer and provide the
policies through eBPF program. Though I am not sure how to make it
monitor arbitrary metrics and if that can be done without any
allocations.

Please do provide feedback on these approaches.

thanks,
Shakeel


2021-04-20 06:48:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Mon 19-04-21 18:44:02, Shakeel Butt wrote:
> Proposal: Provide memory guarantees to userspace oom-killer.
>
> Background:
>
> Issues with kernel oom-killer:
> 1. Very conservative and prefer to reclaim. Applications can suffer
> for a long time.
> 2. Borrows the context of the allocator which can be resource limited
> (low sched priority or limited CPU quota).
> 3. Serialized by global lock.
> 4. Very simplistic oom victim selection policy.
>
> These issues are resolved through userspace oom-killer by:
> 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
> early detect suffering.
> 2. Independent process context which can be given dedicated CPU quota
> and high scheduling priority.
> 3. Can be more aggressive as required.
> 4. Can implement sophisticated business logic/policies.
>
> Android's LMKD and Facebook's oomd are the prime examples of userspace
> oom-killers. One of the biggest challenges for userspace oom-killers
> is to potentially function under intense memory pressure and are prone
> to getting stuck in memory reclaim themselves. Current userspace
> oom-killers aim to avoid this situation by preallocating user memory
> and protecting themselves from global reclaim by either mlocking or
> memory.min. However a new allocation from userspace oom-killer can
> still get stuck in the reclaim and policy rich oom-killer do trigger
> new allocations through syscalls or even heap.

Can you be more specific please?

> Our attempt of userspace oom-killer faces similar challenges.
> Particularly at the tail on the very highly utilized machines we have
> observed userspace oom-killer spectacularly failing in many possible
> ways in the direct reclaim. We have seen oom-killer stuck in direct
> reclaim throttling, stuck in reclaim and allocations from interrupts
> keep stealing reclaimed memory. We have even observed systems where
> all the processes were stuck in throttle_direct_reclaim() and only
> kswapd was running and the interrupts kept stealing the memory
> reclaimed by kswapd.
>
> To reliably solve this problem, we need to give guaranteed memory to
> the userspace oom-killer.

There is nothing like that. Even memory reserves are a finite resource
which can be consumed as it is sharing those reserves with other users
who are not necessarily coordinated. So before we start discussing
making this even more muddy by handing over memory reserves to the
userspace we should really examine whether pre-allocation is something
that will not work.

> At the moment we are contemplating between
> the following options and I would like to get some feedback.
>
> 1. prctl(PF_MEMALLOC)
>
> The idea is to give userspace oom-killer (just one thread which is
> finding the appropriate victims and will be sending SIGKILLs) access
> to MEMALLOC reserves. Most of the time the preallocation, mlock and
> memory.min will be good enough but for rare occasions, when the
> userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> protect it from reclaim and let the allocation dip into the memory
> reserves.

I do not think that handing over an unlimited ticket to the memory
reserves to userspace is a good idea. Even the in kernel oom killer is
bound to a partial access to reserves. So if we really want this then
it should be in sync with and bound by the ALLOC_OOM.

> The misuse of this feature would be risky but it can be limited to
> privileged applications. Userspace oom-killer is the only appropriate
> user of this feature. This option is simple to implement.
>
> 2. Mempool
>
> The idea is to preallocate mempool with a given amount of memory for
> userspace oom-killer. Preferably this will be per-thread and
> oom-killer can preallocate mempool for its specific threads. The core
> page allocator can check before going to the reclaim path if the task
> has private access to the mempool and return page from it if yes.

Could you elaborate some more on how this would be controlled from the
userspace? A dedicated syscall? A driver?

> This option would be more complicated than the previous option as the
> lifecycle of the page from the mempool would be more sophisticated.
> Additionally the current mempool does not handle higher order pages
> and we might need to extend it to allow such allocations. Though this
> feature might have more use-cases and it would be less risky than the
> previous option.

I would tend to agree.

> Another idea I had was to use kthread based oom-killer and provide the
> policies through eBPF program. Though I am not sure how to make it
> monitor arbitrary metrics and if that can be done without any
> allocations.

A kernel module or eBPF to implement oom decisions has already been
discussed few years back. But I am afraid this would be hard to wire in
for anything except for the victim selection. I am not sure it is
maintainable to also control when the OOM handling should trigger.

--
Michal Hocko
SUSE Labs

2021-04-20 16:08:10

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Mon, Apr 19, 2021 at 11:46 PM Michal Hocko <[email protected]> wrote:
>
> On Mon 19-04-21 18:44:02, Shakeel Butt wrote:
[...]
> > memory.min. However a new allocation from userspace oom-killer can
> > still get stuck in the reclaim and policy rich oom-killer do trigger
> > new allocations through syscalls or even heap.
>
> Can you be more specific please?
>

To decide when to kill, the oom-killer has to read a lot of metrics.
It has to open a lot of files to read them and there will definitely
be new allocations involved in those operations. For example reading
memory.stat does a page size allocation. Similarly, to perform action
the oom-killer may have to read cgroup.procs file which again has
allocation inside it.

Regarding sophisticated oom policy, I can give one example of our
cluster level policy. For robustness, many user facing jobs run a lot
of instances in a cluster to handle failures. Such jobs are tolerant
to some amount of failures but they still have requirements to not let
the number of running instances below some threshold. Normally killing
such jobs is fine but we do want to make sure that we do not violate
their cluster level agreement. So, the userspace oom-killer may
dynamically need to confirm if such a job can be killed.

[...]
> > To reliably solve this problem, we need to give guaranteed memory to
> > the userspace oom-killer.
>
> There is nothing like that. Even memory reserves are a finite resource
> which can be consumed as it is sharing those reserves with other users
> who are not necessarily coordinated. So before we start discussing
> making this even more muddy by handing over memory reserves to the
> userspace we should really examine whether pre-allocation is something
> that will not work.
>

We actually explored if we can restrict the syscalls for the
oom-killer which does not do memory allocations. We concluded that is
not practical and not maintainable. Whatever the list we can come up
with will be outdated soon. In addition, converting all the must-have
syscalls to not do allocations is not possible/practical.

> > At the moment we are contemplating between
> > the following options and I would like to get some feedback.
> >
> > 1. prctl(PF_MEMALLOC)
> >
> > The idea is to give userspace oom-killer (just one thread which is
> > finding the appropriate victims and will be sending SIGKILLs) access
> > to MEMALLOC reserves. Most of the time the preallocation, mlock and
> > memory.min will be good enough but for rare occasions, when the
> > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> > protect it from reclaim and let the allocation dip into the memory
> > reserves.
>
> I do not think that handing over an unlimited ticket to the memory
> reserves to userspace is a good idea. Even the in kernel oom killer is
> bound to a partial access to reserves. So if we really want this then
> it should be in sync with and bound by the ALLOC_OOM.
>

Makes sense.

> > The misuse of this feature would be risky but it can be limited to
> > privileged applications. Userspace oom-killer is the only appropriate
> > user of this feature. This option is simple to implement.
> >
> > 2. Mempool
> >
> > The idea is to preallocate mempool with a given amount of memory for
> > userspace oom-killer. Preferably this will be per-thread and
> > oom-killer can preallocate mempool for its specific threads. The core
> > page allocator can check before going to the reclaim path if the task
> > has private access to the mempool and return page from it if yes.
>
> Could you elaborate some more on how this would be controlled from the
> userspace? A dedicated syscall? A driver?
>

I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool
to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to
free the mempool.

> > This option would be more complicated than the previous option as the
> > lifecycle of the page from the mempool would be more sophisticated.
> > Additionally the current mempool does not handle higher order pages
> > and we might need to extend it to allow such allocations. Though this
> > feature might have more use-cases and it would be less risky than the
> > previous option.
>
> I would tend to agree.
>
> > Another idea I had was to use kthread based oom-killer and provide the
> > policies through eBPF program. Though I am not sure how to make it
> > monitor arbitrary metrics and if that can be done without any
> > allocations.
>
> A kernel module or eBPF to implement oom decisions has already been
> discussed few years back. But I am afraid this would be hard to wire in
> for anything except for the victim selection. I am not sure it is
> maintainable to also control when the OOM handling should trigger.
>

I think you are referring to [1]. That patch was only looking at PSI
and I think we are on the same page that we need more information to
decide when to kill. Also I agree with you that it is hard to
implement "when to kill" with eBPF but I wanted the idea out to see if
eBPF experts have some suggestions.

[1] https://lore.kernel.org/lkml/[email protected]/

thanks,
Shakeel

2021-04-20 19:19:02

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote:
> Proposal: Provide memory guarantees to userspace oom-killer.
>
> Background:
>
> Issues with kernel oom-killer:
> 1. Very conservative and prefer to reclaim. Applications can suffer
> for a long time.
> 2. Borrows the context of the allocator which can be resource limited
> (low sched priority or limited CPU quota).
> 3. Serialized by global lock.
> 4. Very simplistic oom victim selection policy.
>
> These issues are resolved through userspace oom-killer by:
> 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
> early detect suffering.
> 2. Independent process context which can be given dedicated CPU quota
> and high scheduling priority.
> 3. Can be more aggressive as required.
> 4. Can implement sophisticated business logic/policies.
>
> Android's LMKD and Facebook's oomd are the prime examples of userspace
> oom-killers. One of the biggest challenges for userspace oom-killers
> is to potentially function under intense memory pressure and are prone
> to getting stuck in memory reclaim themselves. Current userspace
> oom-killers aim to avoid this situation by preallocating user memory
> and protecting themselves from global reclaim by either mlocking or
> memory.min. However a new allocation from userspace oom-killer can
> still get stuck in the reclaim and policy rich oom-killer do trigger
> new allocations through syscalls or even heap.
>
> Our attempt of userspace oom-killer faces similar challenges.
> Particularly at the tail on the very highly utilized machines we have
> observed userspace oom-killer spectacularly failing in many possible
> ways in the direct reclaim. We have seen oom-killer stuck in direct
> reclaim throttling, stuck in reclaim and allocations from interrupts
> keep stealing reclaimed memory. We have even observed systems where
> all the processes were stuck in throttle_direct_reclaim() and only
> kswapd was running and the interrupts kept stealing the memory
> reclaimed by kswapd.
>
> To reliably solve this problem, we need to give guaranteed memory to
> the userspace oom-killer. At the moment we are contemplating between
> the following options and I would like to get some feedback.
>
> 1. prctl(PF_MEMALLOC)
>
> The idea is to give userspace oom-killer (just one thread which is
> finding the appropriate victims and will be sending SIGKILLs) access
> to MEMALLOC reserves. Most of the time the preallocation, mlock and
> memory.min will be good enough but for rare occasions, when the
> userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> protect it from reclaim and let the allocation dip into the memory
> reserves.
>
> The misuse of this feature would be risky but it can be limited to
> privileged applications. Userspace oom-killer is the only appropriate
> user of this feature. This option is simple to implement.

Hello Shakeel!

If ordinary PAGE_SIZE and smaller kernel allocations start to fail,
the system is already in a relatively bad shape. Arguably the userspace
OOM killer should kick in earlier, it's already a bit too late.
Allowing to use reserves just pushes this even further, so we're risking
the kernel stability for no good reason.

But I agree that throttling the oom daemon in direct reclaim makes no sense.
I wonder if we can introduce a per-task flag which will exclude the task from
throttling, but instead all (large) allocations will just fail under a
significant memory pressure more easily. In this case if there is a significant
memory shortage the oom daemon will not be fully functional (will get -ENOMEM
for an attempt to read some stats, for example), but still will be able to kill
some processes and make the forward progress.
But maybe it can be done in userspace too: by splitting the daemon into
a core- and extended part and avoid doing anything behind bare minimum
in the core part.

>
> 2. Mempool
>
> The idea is to preallocate mempool with a given amount of memory for
> userspace oom-killer. Preferably this will be per-thread and
> oom-killer can preallocate mempool for its specific threads. The core
> page allocator can check before going to the reclaim path if the task
> has private access to the mempool and return page from it if yes.
>
> This option would be more complicated than the previous option as the
> lifecycle of the page from the mempool would be more sophisticated.
> Additionally the current mempool does not handle higher order pages
> and we might need to extend it to allow such allocations. Though this
> feature might have more use-cases and it would be less risky than the
> previous option.

It looks like an over-kill for the oom daemon protection, but if there
are other good use cases, maybe it's a good feature to have.

>
> Another idea I had was to use kthread based oom-killer and provide the
> policies through eBPF program. Though I am not sure how to make it
> monitor arbitrary metrics and if that can be done without any
> allocations.

To start this effort it would be nice to understand what metrics various
oom daemons use and how easy is to gather them from the bpf side. I like
this idea long-term, but not sure if it has been settled down enough.
I imagine it will require a fair amount of work on the bpf side, so we
need a good understanding of features we need.

Thanks!

2021-04-20 19:37:54

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

Hi Folks,

On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin <[email protected]> wrote:
>
> On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote:
> > Proposal: Provide memory guarantees to userspace oom-killer.
> >
> > Background:
> >
> > Issues with kernel oom-killer:
> > 1. Very conservative and prefer to reclaim. Applications can suffer
> > for a long time.
> > 2. Borrows the context of the allocator which can be resource limited
> > (low sched priority or limited CPU quota).
> > 3. Serialized by global lock.
> > 4. Very simplistic oom victim selection policy.
> >
> > These issues are resolved through userspace oom-killer by:
> > 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
> > early detect suffering.
> > 2. Independent process context which can be given dedicated CPU quota
> > and high scheduling priority.
> > 3. Can be more aggressive as required.
> > 4. Can implement sophisticated business logic/policies.
> >
> > Android's LMKD and Facebook's oomd are the prime examples of userspace
> > oom-killers. One of the biggest challenges for userspace oom-killers
> > is to potentially function under intense memory pressure and are prone
> > to getting stuck in memory reclaim themselves. Current userspace
> > oom-killers aim to avoid this situation by preallocating user memory
> > and protecting themselves from global reclaim by either mlocking or
> > memory.min. However a new allocation from userspace oom-killer can
> > still get stuck in the reclaim and policy rich oom-killer do trigger
> > new allocations through syscalls or even heap.
> >
> > Our attempt of userspace oom-killer faces similar challenges.
> > Particularly at the tail on the very highly utilized machines we have
> > observed userspace oom-killer spectacularly failing in many possible
> > ways in the direct reclaim. We have seen oom-killer stuck in direct
> > reclaim throttling, stuck in reclaim and allocations from interrupts
> > keep stealing reclaimed memory. We have even observed systems where
> > all the processes were stuck in throttle_direct_reclaim() and only
> > kswapd was running and the interrupts kept stealing the memory
> > reclaimed by kswapd.
> >
> > To reliably solve this problem, we need to give guaranteed memory to
> > the userspace oom-killer. At the moment we are contemplating between
> > the following options and I would like to get some feedback.
> >
> > 1. prctl(PF_MEMALLOC)
> >
> > The idea is to give userspace oom-killer (just one thread which is
> > finding the appropriate victims and will be sending SIGKILLs) access
> > to MEMALLOC reserves. Most of the time the preallocation, mlock and
> > memory.min will be good enough but for rare occasions, when the
> > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> > protect it from reclaim and let the allocation dip into the memory
> > reserves.
> >
> > The misuse of this feature would be risky but it can be limited to
> > privileged applications. Userspace oom-killer is the only appropriate
> > user of this feature. This option is simple to implement.
>
> Hello Shakeel!
>
> If ordinary PAGE_SIZE and smaller kernel allocations start to fail,
> the system is already in a relatively bad shape. Arguably the userspace
> OOM killer should kick in earlier, it's already a bit too late.

I tend to agree here. This is how we are trying to avoid issues with
such severe memory shortages - by tuning the killer a bit more
aggressively. But a more reliable mechanism would definitely be an
improvement.

> Allowing to use reserves just pushes this even further, so we're risking
> the kernel stability for no good reason.
>
> But I agree that throttling the oom daemon in direct reclaim makes no sense.
> I wonder if we can introduce a per-task flag which will exclude the task from
> throttling, but instead all (large) allocations will just fail under a
> significant memory pressure more easily. In this case if there is a significant
> memory shortage the oom daemon will not be fully functional (will get -ENOMEM
> for an attempt to read some stats, for example), but still will be able to kill
> some processes and make the forward progress.

This sounds like a good idea to me.

> But maybe it can be done in userspace too: by splitting the daemon into
> a core- and extended part and avoid doing anything behind bare minimum
> in the core part.
>
> >
> > 2. Mempool
> >
> > The idea is to preallocate mempool with a given amount of memory for
> > userspace oom-killer. Preferably this will be per-thread and
> > oom-killer can preallocate mempool for its specific threads. The core
> > page allocator can check before going to the reclaim path if the task
> > has private access to the mempool and return page from it if yes.
> >
> > This option would be more complicated than the previous option as the
> > lifecycle of the page from the mempool would be more sophisticated.
> > Additionally the current mempool does not handle higher order pages
> > and we might need to extend it to allow such allocations. Though this
> > feature might have more use-cases and it would be less risky than the
> > previous option.
>
> It looks like an over-kill for the oom daemon protection, but if there
> are other good use cases, maybe it's a good feature to have.
>
> >
> > Another idea I had was to use kthread based oom-killer and provide the
> > policies through eBPF program. Though I am not sure how to make it
> > monitor arbitrary metrics and if that can be done without any
> > allocations.
>
> To start this effort it would be nice to understand what metrics various
> oom daemons use and how easy is to gather them from the bpf side. I like
> this idea long-term, but not sure if it has been settled down enough.
> I imagine it will require a fair amount of work on the bpf side, so we
> need a good understanding of features we need.

For a reference, on Android, where we do not really use memcgs,
low-memory-killer reads global data from meminfo, vmstat, zoneinfo
procfs nodes.
Thanks,
Suren.

>
> Thanks!

2021-04-21 01:20:42

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin <[email protected]> wrote:
>
> On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote:
[...]
> > 1. prctl(PF_MEMALLOC)
> >
> > The idea is to give userspace oom-killer (just one thread which is
> > finding the appropriate victims and will be sending SIGKILLs) access
> > to MEMALLOC reserves. Most of the time the preallocation, mlock and
> > memory.min will be good enough but for rare occasions, when the
> > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> > protect it from reclaim and let the allocation dip into the memory
> > reserves.
> >
> > The misuse of this feature would be risky but it can be limited to
> > privileged applications. Userspace oom-killer is the only appropriate
> > user of this feature. This option is simple to implement.
>
> Hello Shakeel!
>
> If ordinary PAGE_SIZE and smaller kernel allocations start to fail,
> the system is already in a relatively bad shape. Arguably the userspace
> OOM killer should kick in earlier, it's already a bit too late.

Please note that these are not allocation failures but rather reclaim
on allocations (which is very normal). Our observation is that this
reclaim is very unpredictable and depends on the type of memory
present on the system which depends on the workload. If there is a
good amount of easily reclaimable memory (e.g. clean file pages), the
reclaim would be really fast. However for other types of reclaimable
memory the reclaim time varies a lot. The unreclaimable memory, pinned
memory, too many direct reclaimers, too many isolated memory and many
other things/heuristics/assumptions make the reclaim further
non-deterministic.

In our observation the global reclaim is very non-deterministic at the
tail and dramatically impacts the reliability of the system. We are
looking for a solution which is independent of the global reclaim.

> Allowing to use reserves just pushes this even further, so we're risking
> the kernel stability for no good reason.

Michal has suggested ALLOC_OOM which is less risky.

>
> But I agree that throttling the oom daemon in direct reclaim makes no sense.
> I wonder if we can introduce a per-task flag which will exclude the task from
> throttling, but instead all (large) allocations will just fail under a
> significant memory pressure more easily. In this case if there is a significant
> memory shortage the oom daemon will not be fully functional (will get -ENOMEM
> for an attempt to read some stats, for example), but still will be able to kill
> some processes and make the forward progress.

So, the suggestion is to have a per-task flag to (1) indicate to not
throttle and (2) fail allocations easily on significant memory
pressure.

For (1), the challenge I see is that there are a lot of places in the
reclaim code paths where a task can get throttled. There are
filesystems that block/throttle in slab shrinking. Any process can get
blocked on an unrelated page or inode writeback within reclaim.

For (2), I am not sure how to deterministically define "significant
memory pressure". One idea is to follow the __GFP_NORETRY semantics
and along with (1) the userspace oom-killer will see ENOMEM more
reliably than stucking in the reclaim.

So, the oom-killer maintains a list of processes to kill in extreme
conditions, have their pidfds open and keep that list fresh. Whenever
any syscalls returns ENOMEM, it starts doing
pidfd_send_signal(SIGKILL) to that list of processes, right?

The idea has merit but I don't see how this is any simpler. The (1) is
challenging on its own and my main concern is that it will be very
hard to maintain as reclaim code (particularly shrinkers) callbacks
into many diverse subsystems.

> But maybe it can be done in userspace too: by splitting the daemon into
> a core- and extended part and avoid doing anything behind bare minimum
> in the core part.
>
> >
> > 2. Mempool
> >
> > The idea is to preallocate mempool with a given amount of memory for
> > userspace oom-killer. Preferably this will be per-thread and
> > oom-killer can preallocate mempool for its specific threads. The core
> > page allocator can check before going to the reclaim path if the task
> > has private access to the mempool and return page from it if yes.
> >
> > This option would be more complicated than the previous option as the
> > lifecycle of the page from the mempool would be more sophisticated.
> > Additionally the current mempool does not handle higher order pages
> > and we might need to extend it to allow such allocations. Though this
> > feature might have more use-cases and it would be less risky than the
> > previous option.
>
> It looks like an over-kill for the oom daemon protection, but if there
> are other good use cases, maybe it's a good feature to have.
>

IMHO it is not an over-kill and easier to do then to remove all
instances of potential blocking/throttling sites in memory reclaim.

> >
> > Another idea I had was to use kthread based oom-killer and provide the
> > policies through eBPF program. Though I am not sure how to make it
> > monitor arbitrary metrics and if that can be done without any
> > allocations.
>
> To start this effort it would be nice to understand what metrics various
> oom daemons use and how easy is to gather them from the bpf side. I like
> this idea long-term, but not sure if it has been settled down enough.
> I imagine it will require a fair amount of work on the bpf side, so we
> need a good understanding of features we need.
>

Are there any examples of gathering existing metrics from bpf? Suren
has given a list of metrics useful for Android. Is it possible to
gather those metrics?

BTW thanks a lot for taking a look and I really appreciate your time.

thanks,
Shakeel

2021-04-21 03:15:12

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Tue, Apr 20, 2021 at 06:18:29PM -0700, Shakeel Butt wrote:
> On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin <[email protected]> wrote:
> >
> > On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote:
> [...]
> > > 1. prctl(PF_MEMALLOC)
> > >
> > > The idea is to give userspace oom-killer (just one thread which is
> > > finding the appropriate victims and will be sending SIGKILLs) access
> > > to MEMALLOC reserves. Most of the time the preallocation, mlock and
> > > memory.min will be good enough but for rare occasions, when the
> > > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> > > protect it from reclaim and let the allocation dip into the memory
> > > reserves.
> > >
> > > The misuse of this feature would be risky but it can be limited to
> > > privileged applications. Userspace oom-killer is the only appropriate
> > > user of this feature. This option is simple to implement.
> >
> > Hello Shakeel!
> >
> > If ordinary PAGE_SIZE and smaller kernel allocations start to fail,
> > the system is already in a relatively bad shape. Arguably the userspace
> > OOM killer should kick in earlier, it's already a bit too late.
>
> Please note that these are not allocation failures but rather reclaim
> on allocations (which is very normal). Our observation is that this
> reclaim is very unpredictable and depends on the type of memory
> present on the system which depends on the workload. If there is a
> good amount of easily reclaimable memory (e.g. clean file pages), the
> reclaim would be really fast. However for other types of reclaimable
> memory the reclaim time varies a lot. The unreclaimable memory, pinned
> memory, too many direct reclaimers, too many isolated memory and many
> other things/heuristics/assumptions make the reclaim further
> non-deterministic.
>
> In our observation the global reclaim is very non-deterministic at the
> tail and dramatically impacts the reliability of the system. We are
> looking for a solution which is independent of the global reclaim.
>
> > Allowing to use reserves just pushes this even further, so we're risking
> > the kernel stability for no good reason.
>
> Michal has suggested ALLOC_OOM which is less risky.

The problem is that even if you'll serve the oom daemon task with pages
from a reserve/custom pool, it doesn't guarantee anything, because the task
still can wait for a long time on some mutex, taken by another process,
throttled somewhere in the reclaim. You're basically trying to introduce a
"higher memory priority" and as always in such cases there will be priority
inversion problems.

So I doubt that you can simple create a common mechanism which will work
flawlessly for all kinds of allocations, I anticipate many special cases
requiring an individual approach.

>
> >
> > But I agree that throttling the oom daemon in direct reclaim makes no sense.
> > I wonder if we can introduce a per-task flag which will exclude the task from
> > throttling, but instead all (large) allocations will just fail under a
> > significant memory pressure more easily. In this case if there is a significant
> > memory shortage the oom daemon will not be fully functional (will get -ENOMEM
> > for an attempt to read some stats, for example), but still will be able to kill
> > some processes and make the forward progress.
>
> So, the suggestion is to have a per-task flag to (1) indicate to not
> throttle and (2) fail allocations easily on significant memory
> pressure.
>
> For (1), the challenge I see is that there are a lot of places in the
> reclaim code paths where a task can get throttled. There are
> filesystems that block/throttle in slab shrinking. Any process can get
> blocked on an unrelated page or inode writeback within reclaim.
>
> For (2), I am not sure how to deterministically define "significant
> memory pressure". One idea is to follow the __GFP_NORETRY semantics
> and along with (1) the userspace oom-killer will see ENOMEM more
> reliably than stucking in the reclaim.
>
> So, the oom-killer maintains a list of processes to kill in extreme
> conditions, have their pidfds open and keep that list fresh. Whenever
> any syscalls returns ENOMEM, it starts doing
> pidfd_send_signal(SIGKILL) to that list of processes, right?
>
> The idea has merit but I don't see how this is any simpler. The (1) is
> challenging on its own and my main concern is that it will be very
> hard to maintain as reclaim code (particularly shrinkers) callbacks
> into many diverse subsystems.

Yeah, I thought about something like this, but I didn't go too deep.
Basically we can emulate __GFP_NOFS | __GFP_NORETRY, but I'm not sure
we can apply it for any random allocation without bad consequences.

Btw, this approach can be easily prototyped using bpf: a bpf program
can be called on each allocation and modify the behavior based on
the pid of the process and other circumstances.

>
> > But maybe it can be done in userspace too: by splitting the daemon into
> > a core- and extended part and avoid doing anything behind bare minimum
> > in the core part.
> >
> > >
> > > 2. Mempool
> > >
> > > The idea is to preallocate mempool with a given amount of memory for
> > > userspace oom-killer. Preferably this will be per-thread and
> > > oom-killer can preallocate mempool for its specific threads. The core
> > > page allocator can check before going to the reclaim path if the task
> > > has private access to the mempool and return page from it if yes.
> > >
> > > This option would be more complicated than the previous option as the
> > > lifecycle of the page from the mempool would be more sophisticated.
> > > Additionally the current mempool does not handle higher order pages
> > > and we might need to extend it to allow such allocations. Though this
> > > feature might have more use-cases and it would be less risky than the
> > > previous option.
> >
> > It looks like an over-kill for the oom daemon protection, but if there
> > are other good use cases, maybe it's a good feature to have.
> >
>
> IMHO it is not an over-kill and easier to do then to remove all
> instances of potential blocking/throttling sites in memory reclaim.
>
> > >
> > > Another idea I had was to use kthread based oom-killer and provide the
> > > policies through eBPF program. Though I am not sure how to make it
> > > monitor arbitrary metrics and if that can be done without any
> > > allocations.
> >
> > To start this effort it would be nice to understand what metrics various
> > oom daemons use and how easy is to gather them from the bpf side. I like
> > this idea long-term, but not sure if it has been settled down enough.
> > I imagine it will require a fair amount of work on the bpf side, so we
> > need a good understanding of features we need.
> >
>
> Are there any examples of gathering existing metrics from bpf? Suren
> has given a list of metrics useful for Android. Is it possible to
> gather those metrics?

First, I need to admit that I didn't follow the bpf development too close
for last couple of years, so my knowledge can be a bit outdated.

But in general bpf is great when there is a fixed amount of data as input
(e.g. skb) and a fixed output (e.g. drop/pass the packet). There are different
maps which are handy to store some persistent data between calls.

However traversing complex data structures is way more complicated. It's
especially tricky if the data structure is not of a fixed size: bpf programs
have to be deterministic, so there are significant constraints on loops.

Just for example: it's easy to call a bpf program for each task in the system,
provide some stats/access to some fields of struct task and expect it to return
an oom score, which then the kernel will look at to select the victim.
Something like this can be done with cgroups too.

Writing a kthread, which can sleep, poll some data all over the system and
decide what to do (what oomd/... does), will be really challenging.
And going back, it will not provide any guarantees unless we're not taking
any locks, which is already quite challenging.

Thanks!

2021-04-21 08:42:22

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Tue 20-04-21 18:18:29, Shakeel Butt wrote:
> On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin <[email protected]> wrote:
> >
> > On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote:
> [...]
> > > 1. prctl(PF_MEMALLOC)
> > >
> > > The idea is to give userspace oom-killer (just one thread which is
> > > finding the appropriate victims and will be sending SIGKILLs) access
> > > to MEMALLOC reserves. Most of the time the preallocation, mlock and
> > > memory.min will be good enough but for rare occasions, when the
> > > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> > > protect it from reclaim and let the allocation dip into the memory
> > > reserves.
> > >
> > > The misuse of this feature would be risky but it can be limited to
> > > privileged applications. Userspace oom-killer is the only appropriate
> > > user of this feature. This option is simple to implement.
> >
> > Hello Shakeel!
> >
> > If ordinary PAGE_SIZE and smaller kernel allocations start to fail,
> > the system is already in a relatively bad shape. Arguably the userspace
> > OOM killer should kick in earlier, it's already a bit too late.
>
> Please note that these are not allocation failures but rather reclaim
> on allocations (which is very normal). Our observation is that this
> reclaim is very unpredictable and depends on the type of memory
> present on the system which depends on the workload. If there is a
> good amount of easily reclaimable memory (e.g. clean file pages), the
> reclaim would be really fast. However for other types of reclaimable
> memory the reclaim time varies a lot. The unreclaimable memory, pinned
> memory, too many direct reclaimers, too many isolated memory and many
> other things/heuristics/assumptions make the reclaim further
> non-deterministic.
>
> In our observation the global reclaim is very non-deterministic at the
> tail and dramatically impacts the reliability of the system. We are
> looking for a solution which is independent of the global reclaim.

I believe it is worth purusing a solution that would make the memory
reclaim more predictable. I have seen direct reclaim memory throttling
in the past. For some reason which I haven't tried to examine this has
become less of a problem with newer kernels. Maybe the memory access
patterns have changed or those problems got replaced by other issues but
an excessive throttling is definitely something that we want to address
rather than work around by some user visible APIs.

> > Allowing to use reserves just pushes this even further, so we're risking
> > the kernel stability for no good reason.
>
> Michal has suggested ALLOC_OOM which is less risky.
>
> >
> > But I agree that throttling the oom daemon in direct reclaim makes no sense.
> > I wonder if we can introduce a per-task flag which will exclude the task from
> > throttling, but instead all (large) allocations will just fail under a
> > significant memory pressure more easily. In this case if there is a significant
> > memory shortage the oom daemon will not be fully functional (will get -ENOMEM
> > for an attempt to read some stats, for example), but still will be able to kill
> > some processes and make the forward progress.
>
> So, the suggestion is to have a per-task flag to (1) indicate to not
> throttle and (2) fail allocations easily on significant memory
> pressure.
>
> For (1), the challenge I see is that there are a lot of places in the
> reclaim code paths where a task can get throttled. There are
> filesystems that block/throttle in slab shrinking. Any process can get
> blocked on an unrelated page or inode writeback within reclaim.
>
> For (2), I am not sure how to deterministically define "significant
> memory pressure". One idea is to follow the __GFP_NORETRY semantics
> and along with (1) the userspace oom-killer will see ENOMEM more
> reliably than stucking in the reclaim.

Some of the interfaces (e.g. seq_file uses GFP_KERNEL reclaim strength)
could be more relaxed and rather fail than OOM kill but wouldn't your
OOM handler be effectivelly dysfunctional when not able to collect data
to make a decision?
--
Michal Hocko
SUSE Labs

2021-04-21 09:48:12

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Tue 20-04-21 09:04:21, Shakeel Butt wrote:
> On Mon, Apr 19, 2021 at 11:46 PM Michal Hocko <[email protected]> wrote:
> >
> > On Mon 19-04-21 18:44:02, Shakeel Butt wrote:
> [...]
> > > memory.min. However a new allocation from userspace oom-killer can
> > > still get stuck in the reclaim and policy rich oom-killer do trigger
> > > new allocations through syscalls or even heap.
> >
> > Can you be more specific please?
> >
>
> To decide when to kill, the oom-killer has to read a lot of metrics.
> It has to open a lot of files to read them and there will definitely
> be new allocations involved in those operations. For example reading
> memory.stat does a page size allocation. Similarly, to perform action
> the oom-killer may have to read cgroup.procs file which again has
> allocation inside it.

True but many of those can be avoided by opening the file early. At
least seq_file based ones will not allocate later if the output size
doesn't increase. Which should be the case for many. I think it is a
general improvement to push those who allocate during read to an open
time allocation.

> Regarding sophisticated oom policy, I can give one example of our
> cluster level policy. For robustness, many user facing jobs run a lot
> of instances in a cluster to handle failures. Such jobs are tolerant
> to some amount of failures but they still have requirements to not let
> the number of running instances below some threshold. Normally killing
> such jobs is fine but we do want to make sure that we do not violate
> their cluster level agreement. So, the userspace oom-killer may
> dynamically need to confirm if such a job can be killed.

What kind of data do you need to examine to make those decisions?

> [...]
> > > To reliably solve this problem, we need to give guaranteed memory to
> > > the userspace oom-killer.
> >
> > There is nothing like that. Even memory reserves are a finite resource
> > which can be consumed as it is sharing those reserves with other users
> > who are not necessarily coordinated. So before we start discussing
> > making this even more muddy by handing over memory reserves to the
> > userspace we should really examine whether pre-allocation is something
> > that will not work.
> >
>
> We actually explored if we can restrict the syscalls for the
> oom-killer which does not do memory allocations. We concluded that is
> not practical and not maintainable. Whatever the list we can come up
> with will be outdated soon. In addition, converting all the must-have
> syscalls to not do allocations is not possible/practical.

I am definitely curious to learn more.

[...]
> > > 2. Mempool
> > >
> > > The idea is to preallocate mempool with a given amount of memory for
> > > userspace oom-killer. Preferably this will be per-thread and
> > > oom-killer can preallocate mempool for its specific threads. The core
> > > page allocator can check before going to the reclaim path if the task
> > > has private access to the mempool and return page from it if yes.
> >
> > Could you elaborate some more on how this would be controlled from the
> > userspace? A dedicated syscall? A driver?
> >
>
> I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool
> to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to
> free the mempool.

I am not a great fan of prctl. It has become a dumping ground for all
mix of unrelated functionality. But let's say this is a minor detail at
this stage. So you are proposing to have a per mm mem pool that would be
used as a fallback for an allocation which cannot make a forward
progress, right? Would that pool be preallocated and sitting idle? What
kind of allocations would be allowed to use the pool? What if the pool
is depleted?
--
Michal Hocko
SUSE Labs

2021-04-21 17:17:37

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Tue, Apr 20, 2021 at 7:58 PM Roman Gushchin <[email protected]> wrote:
>
[...]
> >
> > Michal has suggested ALLOC_OOM which is less risky.
>
> The problem is that even if you'll serve the oom daemon task with pages
> from a reserve/custom pool, it doesn't guarantee anything, because the task
> still can wait for a long time on some mutex, taken by another process,
> throttled somewhere in the reclaim.

I am assuming here by mutex you are referring to locks which
oom-killer might have to take to read metrics or any possible lock
which oom-killer might have to take which some other process can take
too.

Have you observed this situation happening with oomd on production?

> You're basically trying to introduce a
> "higher memory priority" and as always in such cases there will be priority
> inversion problems.
>
> So I doubt that you can simple create a common mechanism which will work
> flawlessly for all kinds of allocations, I anticipate many special cases
> requiring an individual approach.
>
[...]
>
> First, I need to admit that I didn't follow the bpf development too close
> for last couple of years, so my knowledge can be a bit outdated.
>
> But in general bpf is great when there is a fixed amount of data as input
> (e.g. skb) and a fixed output (e.g. drop/pass the packet). There are different
> maps which are handy to store some persistent data between calls.
>
> However traversing complex data structures is way more complicated. It's
> especially tricky if the data structure is not of a fixed size: bpf programs
> have to be deterministic, so there are significant constraints on loops.
>
> Just for example: it's easy to call a bpf program for each task in the system,
> provide some stats/access to some fields of struct task and expect it to return
> an oom score, which then the kernel will look at to select the victim.
> Something like this can be done with cgroups too.
>
> Writing a kthread, which can sleep, poll some data all over the system and
> decide what to do (what oomd/... does), will be really challenging.
> And going back, it will not provide any guarantees unless we're not taking
> any locks, which is already quite challenging.
>

Thanks for the info and I agree this direction needs much more thought
and time to be materialized.

thanks,
Shakeel

2021-04-21 17:41:07

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Wed, Apr 21, 2021 at 12:16 AM Michal Hocko <[email protected]> wrote:
>
[...]
> > To decide when to kill, the oom-killer has to read a lot of metrics.
> > It has to open a lot of files to read them and there will definitely
> > be new allocations involved in those operations. For example reading
> > memory.stat does a page size allocation. Similarly, to perform action
> > the oom-killer may have to read cgroup.procs file which again has
> > allocation inside it.
>
> True but many of those can be avoided by opening the file early. At
> least seq_file based ones will not allocate later if the output size
> doesn't increase. Which should be the case for many. I think it is a
> general improvement to push those who allocate during read to an open
> time allocation.
>

I agree that this would be a general improvement but it is not always
possible (see below).

> > Regarding sophisticated oom policy, I can give one example of our
> > cluster level policy. For robustness, many user facing jobs run a lot
> > of instances in a cluster to handle failures. Such jobs are tolerant
> > to some amount of failures but they still have requirements to not let
> > the number of running instances below some threshold. Normally killing
> > such jobs is fine but we do want to make sure that we do not violate
> > their cluster level agreement. So, the userspace oom-killer may
> > dynamically need to confirm if such a job can be killed.
>
> What kind of data do you need to examine to make those decisions?
>

Most of the time the cluster level scheduler pushes the information to
the node controller which transfers that information to the
oom-killer. However based on the freshness of the information the
oom-killer might request to pull the latest information (IPC and RPC).

[...]
> >
> > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool
> > to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to
> > free the mempool.
>
> I am not a great fan of prctl. It has become a dumping ground for all
> mix of unrelated functionality. But let's say this is a minor detail at
> this stage.

I agree this does not have to be prctl().

> So you are proposing to have a per mm mem pool that would be

I was thinking of per-task_struct instead of per-mm_struct just for simplicity.

> used as a fallback for an allocation which cannot make a forward
> progress, right?

Correct

> Would that pool be preallocated and sitting idle?

Correct

> What kind of allocations would be allowed to use the pool?

I was thinking of any type of allocation from the oom-killer (or
specific threads). Please note that the mempool is the backup and only
used in the slowpath.

> What if the pool is depleted?

This would mean that either the estimate of mempool size is bad or
oom-killer is buggy and leaking memory.

I am open to any design directions for mempool or some other way where
we can provide a notion of memory guarantee to oom-killer.

thanks,
Shakeel

2021-04-21 19:06:46

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Wed 21-04-21 06:57:43, Shakeel Butt wrote:
> On Wed, Apr 21, 2021 at 12:16 AM Michal Hocko <[email protected]> wrote:
> >
> [...]
> > > To decide when to kill, the oom-killer has to read a lot of metrics.
> > > It has to open a lot of files to read them and there will definitely
> > > be new allocations involved in those operations. For example reading
> > > memory.stat does a page size allocation. Similarly, to perform action
> > > the oom-killer may have to read cgroup.procs file which again has
> > > allocation inside it.
> >
> > True but many of those can be avoided by opening the file early. At
> > least seq_file based ones will not allocate later if the output size
> > doesn't increase. Which should be the case for many. I think it is a
> > general improvement to push those who allocate during read to an open
> > time allocation.
> >
>
> I agree that this would be a general improvement but it is not always
> possible (see below).

It would be still great to invest into those improvements. And I would
be really grateful to learn about bottlenecks from the existing kernel
interfaces you have found on the way.

> > > Regarding sophisticated oom policy, I can give one example of our
> > > cluster level policy. For robustness, many user facing jobs run a lot
> > > of instances in a cluster to handle failures. Such jobs are tolerant
> > > to some amount of failures but they still have requirements to not let
> > > the number of running instances below some threshold. Normally killing
> > > such jobs is fine but we do want to make sure that we do not violate
> > > their cluster level agreement. So, the userspace oom-killer may
> > > dynamically need to confirm if such a job can be killed.
> >
> > What kind of data do you need to examine to make those decisions?
> >
>
> Most of the time the cluster level scheduler pushes the information to
> the node controller which transfers that information to the
> oom-killer. However based on the freshness of the information the
> oom-killer might request to pull the latest information (IPC and RPC).

I cannot imagine any OOM handler to be reliable if it has to depend on
other userspace component with a lower resource priority. OOM handlers
are fundamentally complex components which has to reduce their
dependencies to the bare minimum.

> [...]
> > >
> > > I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool
> > > to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to
> > > free the mempool.
> >
> > I am not a great fan of prctl. It has become a dumping ground for all
> > mix of unrelated functionality. But let's say this is a minor detail at
> > this stage.
>
> I agree this does not have to be prctl().
>
> > So you are proposing to have a per mm mem pool that would be
>
> I was thinking of per-task_struct instead of per-mm_struct just for simplicity.
>
> > used as a fallback for an allocation which cannot make a forward
> > progress, right?
>
> Correct
>
> > Would that pool be preallocated and sitting idle?
>
> Correct
>
> > What kind of allocations would be allowed to use the pool?
>
> I was thinking of any type of allocation from the oom-killer (or
> specific threads). Please note that the mempool is the backup and only
> used in the slowpath.
>
> > What if the pool is depleted?
>
> This would mean that either the estimate of mempool size is bad or
> oom-killer is buggy and leaking memory.
>
> I am open to any design directions for mempool or some other way where
> we can provide a notion of memory guarantee to oom-killer.

OK, thanks for clarification. There will certainly be hard problems to
sort out[1] but the overall idea makes sense to me and it sounds like a
much better approach than a OOM specific solution.


[1] - how the pool is going to be replenished without hitting all
potential reclaim problems (thus dependencies on other all tasks
directly/indirectly) yet to not rely on any background workers to do
that on the task behalf without a proper accounting etc...
--
Michal Hocko
SUSE Labs

2021-04-21 19:44:06

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Wed, Apr 21, 2021 at 12:23 AM Michal Hocko <[email protected]> wrote:
>
[...]
> > In our observation the global reclaim is very non-deterministic at the
> > tail and dramatically impacts the reliability of the system. We are
> > looking for a solution which is independent of the global reclaim.
>
> I believe it is worth purusing a solution that would make the memory
> reclaim more predictable. I have seen direct reclaim memory throttling
> in the past. For some reason which I haven't tried to examine this has
> become less of a problem with newer kernels. Maybe the memory access
> patterns have changed or those problems got replaced by other issues but
> an excessive throttling is definitely something that we want to address
> rather than work around by some user visible APIs.
>

I agree we want to address the excessive throttling but for everyone
on the machine and most importantly it is a moving target. The reclaim
code continues to evolve and in addition it has callbacks to diverse
sets of subsystems.

The user visible APIs is for one specific use-case i.e. oom-killer
which will indirectly help in reducing the excessive throttling.

[...]
> > So, the suggestion is to have a per-task flag to (1) indicate to not
> > throttle and (2) fail allocations easily on significant memory
> > pressure.
> >
> > For (1), the challenge I see is that there are a lot of places in the
> > reclaim code paths where a task can get throttled. There are
> > filesystems that block/throttle in slab shrinking. Any process can get
> > blocked on an unrelated page or inode writeback within reclaim.
> >
> > For (2), I am not sure how to deterministically define "significant
> > memory pressure". One idea is to follow the __GFP_NORETRY semantics
> > and along with (1) the userspace oom-killer will see ENOMEM more
> > reliably than stucking in the reclaim.
>
> Some of the interfaces (e.g. seq_file uses GFP_KERNEL reclaim strength)
> could be more relaxed and rather fail than OOM kill but wouldn't your
> OOM handler be effectivelly dysfunctional when not able to collect data
> to make a decision?
>

Yes it would be. Roman is suggesting to have a precomputed kill-list
(pidfds ready to send SIGKILL) and whenever oom-killer gets ENOMEM, it
would go with the kill-list. Though we are still contemplating the
ways and side-effects of preferably returning ENOMEM in slowpath for
oom-killer and in addition the complexity to maintain the kill-list
and keeping it up to date.

thanks,
Shakeel

2021-04-22 01:03:47

by peter enderborg

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On 4/20/21 3:44 AM, Shakeel Butt wrote:
> Proposal: Provide memory guarantees to userspace oom-killer.
>
> Background:
>
> Issues with kernel oom-killer:
> 1. Very conservative and prefer to reclaim. Applications can suffer
> for a long time.
> 2. Borrows the context of the allocator which can be resource limited
> (low sched priority or limited CPU quota).
> 3. Serialized by global lock.
> 4. Very simplistic oom victim selection policy.
>
> These issues are resolved through userspace oom-killer by:
> 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to
> early detect suffering.
> 2. Independent process context which can be given dedicated CPU quota
> and high scheduling priority.
> 3. Can be more aggressive as required.
> 4. Can implement sophisticated business logic/policies.
>
> Android's LMKD and Facebook's oomd are the prime examples of userspace
> oom-killers. One of the biggest challenges for userspace oom-killers
> is to potentially function under intense memory pressure and are prone
> to getting stuck in memory reclaim themselves. Current userspace
> oom-killers aim to avoid this situation by preallocating user memory
> and protecting themselves from global reclaim by either mlocking or
> memory.min. However a new allocation from userspace oom-killer can
> still get stuck in the reclaim and policy rich oom-killer do trigger
> new allocations through syscalls or even heap.
>
> Our attempt of userspace oom-killer faces similar challenges.
> Particularly at the tail on the very highly utilized machines we have
> observed userspace oom-killer spectacularly failing in many possible
> ways in the direct reclaim. We have seen oom-killer stuck in direct
> reclaim throttling, stuck in reclaim and allocations from interrupts
> keep stealing reclaimed memory. We have even observed systems where
> all the processes were stuck in throttle_direct_reclaim() and only
> kswapd was running and the interrupts kept stealing the memory
> reclaimed by kswapd.
>
> To reliably solve this problem, we need to give guaranteed memory to
> the userspace oom-killer. At the moment we are contemplating between
> the following options and I would like to get some feedback.
>
> 1. prctl(PF_MEMALLOC)
>
> The idea is to give userspace oom-killer (just one thread which is
> finding the appropriate victims and will be sending SIGKILLs) access
> to MEMALLOC reserves. Most of the time the preallocation, mlock and
> memory.min will be good enough but for rare occasions, when the
> userspace oom-killer needs to allocate, the PF_MEMALLOC flag will
> protect it from reclaim and let the allocation dip into the memory
> reserves.
>
> The misuse of this feature would be risky but it can be limited to
> privileged applications. Userspace oom-killer is the only appropriate
> user of this feature. This option is simple to implement.
>
> 2. Mempool
>
> The idea is to preallocate mempool with a given amount of memory for
> userspace oom-killer. Preferably this will be per-thread and
> oom-killer can preallocate mempool for its specific threads. The core
> page allocator can check before going to the reclaim path if the task
> has private access to the mempool and return page from it if yes.
>
> This option would be more complicated than the previous option as the
> lifecycle of the page from the mempool would be more sophisticated.
> Additionally the current mempool does not handle higher order pages
> and we might need to extend it to allow such allocations. Though this
> feature might have more use-cases and it would be less risky than the
> previous option.
>
> Another idea I had was to use kthread based oom-killer and provide the
> policies through eBPF program. Though I am not sure how to make it
> monitor arbitrary metrics and if that can be done without any
> allocations.
>
> Please do provide feedback on these approaches.
>
> thanks,
> Shakeel

I think this is the wrong way to go.

I sent a patch for android lowmemorykiller some years ago.

http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2017-February/100319.html

It has been improved since than, so it can act handle oom callbacks, it can act on vmpressure and psi
and as a shrinker. The patches has not been ported to resent kernels though.

I don't think vmpressure and psi is that relevant now. (They are what userspace act on)  But the basic idea is to have a priority queue
within the kernel. It need pick up new processes and dying process.  And then it has a order, and that
is set with oom adj values by activity manager in android.  I see this model can be reused for
something that is between a standard oom and userspace.  Instead of vmpressure and psi
a watchdog might be a better way.  If userspace (in android the activity manager or lmkd) does not kick the watchdog,
the watchdog bite the task according to the priority and kills it.  This priority list does not have to be a list generated 
within kernel. But it has the advantage that you inherent parents properties.  We use a rb-tree for that.

All that is missing is the watchdog.




2021-04-22 01:18:18

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Wed, Apr 21, 2021 at 10:06 AM peter enderborg
<[email protected]> wrote:
>
> On 4/20/21 3:44 AM, Shakeel Butt wrote:
[...]
>
> I think this is the wrong way to go.

Which one? Are you talking about the kernel one? We already talked out
of that. To decide to OOM, we need to look at a very diverse set of
metrics and it seems like that would be very hard to do flexibly
inside the kernel.

>
> I sent a patch for android lowmemorykiller some years ago.
>
> http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2017-February/100319.html
>
> It has been improved since than, so it can act handle oom callbacks, it can act on vmpressure and psi
> and as a shrinker. The patches has not been ported to resent kernels though.
>
> I don't think vmpressure and psi is that relevant now. (They are what userspace act on) But the basic idea is to have a priority queue
> within the kernel. It need pick up new processes and dying process. And then it has a order, and that
> is set with oom adj values by activity manager in android. I see this model can be reused for
> something that is between a standard oom and userspace. Instead of vmpressure and psi
> a watchdog might be a better way. If userspace (in android the activity manager or lmkd) does not kick the watchdog,
> the watchdog bite the task according to the priority and kills it. This priority list does not have to be a list generated
> within kernel. But it has the advantage that you inherent parents properties. We use a rb-tree for that.
>
> All that is missing is the watchdog.
>

Actually no. It is missing the flexibility to monitor metrics which a
user care and based on which they decide to trigger oom-kill. Not sure
how will watchdog replace psi/vmpressure? Userspace keeps petting the
watchdog does not mean that system is not suffering.

In addition oom priorities change dynamically and changing it in your
system seems very hard. Cgroup awareness is missing too.

Anyways, there are already widely deployed userspace oom-killer
solutions (lmkd, oomd). I am aiming to further improve the
reliability.

2021-04-22 01:18:24

by peter enderborg

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On 4/21/21 8:28 PM, Shakeel Butt wrote:
> On Wed, Apr 21, 2021 at 10:06 AM peter enderborg
> <[email protected]> wrote:
>> On 4/20/21 3:44 AM, Shakeel Butt wrote:
> [...]
>> I think this is the wrong way to go.
> Which one? Are you talking about the kernel one? We already talked out
> of that. To decide to OOM, we need to look at a very diverse set of
> metrics and it seems like that would be very hard to do flexibly
> inside the kernel.
You dont need to decide to oom, but when oom occurs you
can take a proper action.
>
>> I sent a patch for android lowmemorykiller some years ago.
>>
>> https://urldefense.com/v3/__http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2017-February/100319.html__;!!JmoZiZGBv3RvKRSx!pwmY7R1kGPkZq95bHSObHqIR1-r3ItSBgdRBdKym9uCcUprGq-CUrAIaH946vWJqrjU$
>>
>> It has been improved since than, so it can act handle oom callbacks, it can act on vmpressure and psi
>> and as a shrinker. The patches has not been ported to resent kernels though.
>>
>> I don't think vmpressure and psi is that relevant now. (They are what userspace act on) But the basic idea is to have a priority queue
>> within the kernel. It need pick up new processes and dying process. And then it has a order, and that
>> is set with oom adj values by activity manager in android. I see this model can be reused for
>> something that is between a standard oom and userspace. Instead of vmpressure and psi
>> a watchdog might be a better way. If userspace (in android the activity manager or lmkd) does not kick the watchdog,
>> the watchdog bite the task according to the priority and kills it. This priority list does not have to be a list generated
>> within kernel. But it has the advantage that you inherent parents properties. We use a rb-tree for that.
>>
>> All that is missing is the watchdog.
>>
> Actually no. It is missing the flexibility to monitor metrics which a
> user care and based on which they decide to trigger oom-kill. Not sure
> how will watchdog replace psi/vmpressure? Userspace keeps petting the
> watchdog does not mean that system is not suffering.

The userspace should very much do what it do. But when it
does not do what it should do, including kick the WD. Then
the kernel kicks in and kill a pre defined process or as many
as needed until the monitoring can start to kick and have the
control.

>
> In addition oom priorities change dynamically and changing it in your
> system seems very hard. Cgroup awareness is missing too.

Why is that hard? Moving a object in a rb-tree is as good it get.


>
> Anyways, there are already widely deployed userspace oom-killer
> solutions (lmkd, oomd). I am aiming to further improve the
> reliability.

Yes, and I totally agree that it is needed. But I don't think
it will possible until linux is realtime ready, including a
memory system that can guarantee allocation times.

2021-04-22 01:19:59

by Roman Gushchin

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Wed, Apr 21, 2021 at 06:26:37AM -0700, Shakeel Butt wrote:
> On Tue, Apr 20, 2021 at 7:58 PM Roman Gushchin <[email protected]> wrote:
> >
> [...]
> > >
> > > Michal has suggested ALLOC_OOM which is less risky.
> >
> > The problem is that even if you'll serve the oom daemon task with pages
> > from a reserve/custom pool, it doesn't guarantee anything, because the task
> > still can wait for a long time on some mutex, taken by another process,
> > throttled somewhere in the reclaim.
>
> I am assuming here by mutex you are referring to locks which
> oom-killer might have to take to read metrics or any possible lock
> which oom-killer might have to take which some other process can take
> too.
>
> Have you observed this situation happening with oomd on production?

I'm not aware of any oomd-specific issues. I'm not sure if they don't exist
at all, but so far it's wasn't a problem for us. Maybe it because you tend to
have less pagecache (as I understand), maybe it comes to specific oomd
policies/settings.

I know we had different pains with mmap_sem and atop and similar programs,
where reading process data stalled on mmap_sem for a long time.

Thanks!

2021-04-22 01:20:44

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Wed, Apr 21, 2021 at 11:46 AM <[email protected]> wrote:
>
> On 4/21/21 8:28 PM, Shakeel Butt wrote:
> > On Wed, Apr 21, 2021 at 10:06 AM peter enderborg
> > <[email protected]> wrote:
> >> On 4/20/21 3:44 AM, Shakeel Butt wrote:
> > [...]
> >> I think this is the wrong way to go.
> > Which one? Are you talking about the kernel one? We already talked out
> > of that. To decide to OOM, we need to look at a very diverse set of
> > metrics and it seems like that would be very hard to do flexibly
> > inside the kernel.
> You dont need to decide to oom, but when oom occurs you
> can take a proper action.

No, we want the flexibility to decide when to oom-kill. Kernel is very
conservative in triggering the oom-kill.

> >
[...]
> > Actually no. It is missing the flexibility to monitor metrics which a
> > user care and based on which they decide to trigger oom-kill. Not sure
> > how will watchdog replace psi/vmpressure? Userspace keeps petting the
> > watchdog does not mean that system is not suffering.
>
> The userspace should very much do what it do. But when it
> does not do what it should do, including kick the WD. Then
> the kernel kicks in and kill a pre defined process or as many
> as needed until the monitoring can start to kick and have the
> control.
>

Roman already suggested something similar (i.e. oom-killer core and
extended and core watching extended) but completely in userspace. I
don't see why we would want to do that in the kernel instead.

> >
> > In addition oom priorities change dynamically and changing it in your
> > system seems very hard. Cgroup awareness is missing too.
>
> Why is that hard? Moving a object in a rb-tree is as good it get.
>

It is a group of objects. Anyways that is implementation detail.

The message I got from this exchange is that we can have a watchdog
(userspace or kernel) to further improve the reliability of userspace
oom-killers.

2021-04-22 05:41:09

by peter enderborg

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On 4/21/21 9:18 PM, Shakeel Butt wrote:
> On Wed, Apr 21, 2021 at 11:46 AM <[email protected]> wrote:
>> On 4/21/21 8:28 PM, Shakeel Butt wrote:
>>> On Wed, Apr 21, 2021 at 10:06 AM peter enderborg
>>> <[email protected]> wrote:
>>>> On 4/20/21 3:44 AM, Shakeel Butt wrote:
>>> [...]
>>>> I think this is the wrong way to go.
>>> Which one? Are you talking about the kernel one? We already talked out
>>> of that. To decide to OOM, we need to look at a very diverse set of
>>> metrics and it seems like that would be very hard to do flexibly
>>> inside the kernel.
>> You dont need to decide to oom, but when oom occurs you
>> can take a proper action.
> No, we want the flexibility to decide when to oom-kill. Kernel is very
> conservative in triggering the oom-kill.

It wont do it for you. We use this code to solve that:

/*
 *  lowmemorykiller_oom
 *
 *  Author: Peter Enderborg <[email protected]>
 *
 *  This program is free software; you can redistribute it and/or modify
 *  it under the terms of the GNU General Public License version 2 as
 *  published by the Free Software Foundation.
 */

/* add fake print format with original module name */
#define pr_fmt(fmt) "lowmemorykiller: " fmt

#include <linux/mm.h>
#include <linux/slab.h>
#include <linux/oom.h>

#include <trace/events/lmk.h>

#include "lowmemorykiller.h"
#include "lowmemorykiller_tng.h"
#include "lowmemorykiller_stats.h"
#include "lowmemorykiller_tasks.h"

/**
 * lowmemorykiller_oom_notify - OOM notifier
 * @self:    notifier block struct
 * @notused:    not used
 * @parm:    returned - number of pages freed
 *
 * Return value:
 *    NOTIFY_OK
 **/

static int lowmemorykiller_oom_notify(struct notifier_block *self,
                      unsigned long notused, void *param)
{
    struct lmk_rb_watch *lrw;
    unsigned long *nfreed = param;

    lowmem_print(2, "oom notify event\n");
    *nfreed = 0;
    lmk_inc_stats(LMK_OOM_COUNT);
    spin_lock_bh(&lmk_task_lock);
    lrw = __lmk_task_first();
    if (lrw) {
        struct task_struct *selected = lrw->tsk;
        struct lmk_death_pending_entry *ldpt;

        if (!task_trylock_lmk(selected)) {
            lmk_inc_stats(LMK_ERROR);
            lowmem_print(1, "Failed to lock task.\n");
            lmk_inc_stats(LMK_BUSY);
            goto unlock_out;
        }

        get_task_struct(selected);
        /* move to kill pending set */
        ldpt = kmem_cache_alloc(lmk_dp_cache, GFP_ATOMIC);
        /* if we fail to alloc we ignore the death pending list */
        if (ldpt) {
            ldpt->tsk = selected;
            __lmk_death_pending_add(ldpt);
        } else {
            WARN_ON(1);
            lmk_inc_stats(LMK_MEM_ERROR);
            trace_lmk_sigkill(selected->pid, selected->comm,
                      LMK_TRACE_MEMERROR, *nfreed, 0);
        }
        if (!__lmk_task_remove(selected, lrw->key))
            WARN_ON(1);

        spin_unlock_bh(&lmk_task_lock);
        *nfreed = get_task_rss(selected);
        send_sig(SIGKILL, selected, 0);

        LMK_TAG_TASK_DIE(selected);
        trace_lmk_sigkill(selected->pid, selected->comm,
                  LMK_TRACE_OOMKILL, *nfreed,
                  0);

        task_unlock(selected);
        put_task_struct(selected);
        lmk_inc_stats(LMK_OOM_KILL_COUNT);
        goto out;
    }
unlock_out:
    spin_unlock_bh(&lmk_task_lock);
out:
    return NOTIFY_OK;
}

static struct notifier_block lowmemorykiller_oom_nb = {
    .notifier_call = lowmemorykiller_oom_notify
};

int __init lowmemorykiller_register_oom_notifier(void)
{
    register_oom_notifier(&lowmemorykiller_oom_nb);
    return 0;
}


So what needed is a function that knows the
priority. Here __lmk_task_first() that is from a rb-tree.

You can pick what ever priority you like. In our case it is a
android so it is a strictly oom_adj order in the tree.

I think you can do the same with a old lowmemmorykiller style  with
a full task scan too.

> [...]
>>> Actually no. It is missing the flexibility to monitor metrics which a
>>> user care and based on which they decide to trigger oom-kill. Not sure
>>> how will watchdog replace psi/vmpressure? Userspace keeps petting the
>>> watchdog does not mean that system is not suffering.
>> The userspace should very much do what it do. But when it
>> does not do what it should do, including kick the WD. Then
>> the kernel kicks in and kill a pre defined process or as many
>> as needed until the monitoring can start to kick and have the
>> control.
>>
> Roman already suggested something similar (i.e. oom-killer core and
> extended and core watching extended) but completely in userspace. I
> don't see why we would want to do that in the kernel instead.

A watchdog in kernel will work if userspace is completely broken
or staved with low on memory.


>
>>> In addition oom priorities change dynamically and changing it in your
>>> system seems very hard. Cgroup awareness is missing too.
>> Why is that hard? Moving a object in a rb-tree is as good it get.
>>
> It is a group of objects. Anyways that is implementation detail.
>
> The message I got from this exchange is that we can have a watchdog
> (userspace or kernel) to further improve the reliability of userspace
> oom-killers.

2021-04-22 12:35:56

by peter enderborg

[permalink] [raw]
Subject: [RFC PATCH] Android OOM helper proof of concept

On 4/21/21 4:29 PM, Michal Hocko wrote:
> On Wed 21-04-21 06:57:43, Shakeel Butt wrote:
>> On Wed, Apr 21, 2021 at 12:16 AM Michal Hocko <[email protected]> wrote:
>> [...]
>>>> To decide when to kill, the oom-killer has to read a lot of metrics.
>>>> It has to open a lot of files to read them and there will definitely
>>>> be new allocations involved in those operations. For example reading
>>>> memory.stat does a page size allocation. Similarly, to perform action
>>>> the oom-killer may have to read cgroup.procs file which again has
>>>> allocation inside it.
>>> True but many of those can be avoided by opening the file early. At
>>> least seq_file based ones will not allocate later if the output size
>>> doesn't increase. Which should be the case for many. I think it is a
>>> general improvement to push those who allocate during read to an open
>>> time allocation.
>>>
>> I agree that this would be a general improvement but it is not always
>> possible (see below).
> It would be still great to invest into those improvements. And I would
> be really grateful to learn about bottlenecks from the existing kernel
> interfaces you have found on the way.
>
>>>> Regarding sophisticated oom policy, I can give one example of our
>>>> cluster level policy. For robustness, many user facing jobs run a lot
>>>> of instances in a cluster to handle failures. Such jobs are tolerant
>>>> to some amount of failures but they still have requirements to not let
>>>> the number of running instances below some threshold. Normally killing
>>>> such jobs is fine but we do want to make sure that we do not violate
>>>> their cluster level agreement. So, the userspace oom-killer may
>>>> dynamically need to confirm if such a job can be killed.
>>> What kind of data do you need to examine to make those decisions?
>>>
>> Most of the time the cluster level scheduler pushes the information to
>> the node controller which transfers that information to the
>> oom-killer. However based on the freshness of the information the
>> oom-killer might request to pull the latest information (IPC and RPC).
> I cannot imagine any OOM handler to be reliable if it has to depend on
> other userspace component with a lower resource priority. OOM handlers
> are fundamentally complex components which has to reduce their
> dependencies to the bare minimum.


I think we very much need a OOM killer that can help out,
but it is essential that it also play with android rules.

This is RFC patch that interact with OOM

From 09f3a2e401d4ed77e95b7cea7edb7c5c3e6a0c62 Mon Sep 17 00:00:00 2001
From: Peter Enderborg <[email protected]>
Date: Thu, 22 Apr 2021 14:15:46 +0200
Subject: [PATCH] mm/oom: Android oomhelper

This is proff of concept of a pre-oom-killer that kill task
strictly on oom-score-adj order if the score is positive.

It act as lifeline when userspace does not have optimal performance.
---
 drivers/staging/Makefile              |  1 +
 drivers/staging/oomhelper/Makefile    |  2 +
 drivers/staging/oomhelper/oomhelper.c | 65 +++++++++++++++++++++++++++
 mm/oom_kill.c                         |  4 +-
 4 files changed, 70 insertions(+), 2 deletions(-)
 create mode 100644 drivers/staging/oomhelper/Makefile
 create mode 100644 drivers/staging/oomhelper/oomhelper.c

diff --git a/drivers/staging/Makefile b/drivers/staging/Makefile
index 2245059e69c7..4a5449b42568 100644
--- a/drivers/staging/Makefile
+++ b/drivers/staging/Makefile
@@ -47,3 +47,4 @@ obj-$(CONFIG_QLGE)        += qlge/
 obj-$(CONFIG_WIMAX)        += wimax/
 obj-$(CONFIG_WFX)        += wfx/
 obj-y                += hikey9xx/
+obj-y                += oomhelper/
diff --git a/drivers/staging/oomhelper/Makefile b/drivers/staging/oomhelper/Makefile
new file mode 100644
index 000000000000..ee9b361957f8
--- /dev/null
+++ b/drivers/staging/oomhelper/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0
+obj-y    += oomhelper.o
diff --git a/drivers/staging/oomhelper/oomhelper.c b/drivers/staging/oomhelper/oomhelper.c
new file mode 100644
index 000000000000..5a3fe0270cb8
--- /dev/null
+++ b/drivers/staging/oomhelper/oomhelper.c
@@ -0,0 +1,65 @@
+// SPDX-License-Identifier: GPL-2.0
+/* prof of concept of android aware oom killer */
+/* Author: [email protected] */
+
+#include <linux/kernel.h>
+#include <linux/mm.h>
+#include <linux/slab.h>
+#include <linux/oom.h>
+void wake_oom_reaper(struct task_struct *tsk); /* need to public ... */
+void __oom_kill_process(struct task_struct *victim, const char *message);
+
+static int oomhelper_oom_notify(struct notifier_block *self,
+                      unsigned long notused, void *param)
+{
+  struct task_struct *tsk;
+  struct task_struct *selected = NULL;
+  int highest = 0;
+
+  pr_info("invited");
+  rcu_read_lock();
+  for_each_process(tsk) {
+      struct task_struct *candidate;
+      if (tsk->flags & PF_KTHREAD)
+          continue;
+
+      /* Ignore task if coredump in progress */
+      if (tsk->mm && tsk->mm->core_state)
+          continue;
+      candidate = find_lock_task_mm(tsk);
+      if (!candidate)
+          continue;
+
+      if (highest < candidate->signal->oom_score_adj) {
+          /* for test dont kill level 0 */
+          highest = candidate->signal->oom_score_adj;
+          selected = candidate;
+          pr_info("new selected %d %d", selected->pid,
+              selected->signal->oom_score_adj);
+      }
+      task_unlock(candidate);
+  }
+  if (selected) {
+      get_task_struct(selected);
+  }
+  rcu_read_unlock();
+  if (selected) {
+      pr_info("oomhelper killing: %d", selected->pid);
+      __oom_kill_process(selected, "oomhelper");
+  }
+
+  return NOTIFY_OK;
+}
+
+static struct notifier_block oomhelper_oom_nb = {
+    .notifier_call = oomhelper_oom_notify
+};
+
+int __init oomhelper_register_oom_notifier(void)
+{
+    register_oom_notifier(&oomhelper_oom_nb);
+    pr_info("oomhelper installed");
+    return 0;
+}
+
+subsys_initcall(oomhelper_register_oom_notifier);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index fa1cf18bac97..a5f7299af9a3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -658,7 +658,7 @@ static int oom_reaper(void *unused)
     return 0;
 }
 
-static void wake_oom_reaper(struct task_struct *tsk)
+void wake_oom_reaper(struct task_struct *tsk)
 {
     /* mm is already queued? */
     if (test_and_set_bit(MMF_OOM_REAP_QUEUED, &tsk->signal->oom_mm->flags))
@@ -856,7 +856,7 @@ static bool task_will_free_mem(struct task_struct *task)
     return ret;
 }
 
-static void __oom_kill_process(struct task_struct *victim, const char *message)
+void __oom_kill_process(struct task_struct *victim, const char *message)
 {
     struct task_struct *p;
     struct mm_struct *mm;
--
2.17.1

Is that something that might be accepted?

It uses the notifications and that is no problem a guess.

But it also calls some oom-kill functions that is not exported.


>
>> [...]
>>>> I was thinking of simply prctl(SET_MEMPOOL, bytes) to assign mempool
>>>> to a thread (not shared between threads) and prctl(RESET_MEMPOOL) to
>>>> free the mempool.
>>> I am not a great fan of prctl. It has become a dumping ground for all
>>> mix of unrelated functionality. But let's say this is a minor detail at
>>> this stage.
>> I agree this does not have to be prctl().
>>
>>> So you are proposing to have a per mm mem pool that would be
>> I was thinking of per-task_struct instead of per-mm_struct just for simplicity.
>>
>>> used as a fallback for an allocation which cannot make a forward
>>> progress, right?
>> Correct
>>
>>> Would that pool be preallocated and sitting idle?
>> Correct
>>
>>> What kind of allocations would be allowed to use the pool?
>> I was thinking of any type of allocation from the oom-killer (or
>> specific threads). Please note that the mempool is the backup and only
>> used in the slowpath.
>>
>>> What if the pool is depleted?
>> This would mean that either the estimate of mempool size is bad or
>> oom-killer is buggy and leaking memory.
>>
>> I am open to any design directions for mempool or some other way where
>> we can provide a notion of memory guarantee to oom-killer.
> OK, thanks for clarification. There will certainly be hard problems to
> sort out[1] but the overall idea makes sense to me and it sounds like a
> much better approach than a OOM specific solution.
>
>
> [1] - how the pool is going to be replenished without hitting all
> potential reclaim problems (thus dependencies on other all tasks
> directly/indirectly) yet to not rely on any background workers to do
> that on the task behalf without a proper accounting etc...


2021-04-22 13:05:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC PATCH] Android OOM helper proof of concept

On Thu 22-04-21 14:33:45, peter enderborg wrote:
[...]
> I think we very much need a OOM killer that can help out,
> but it is essential that it also play with android rules.

This is completely off topic to the discussion here.
--
Michal Hocko
SUSE Labs

2021-04-22 13:10:11

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Wed 21-04-21 19:05:49, peter enderborg wrote:
[...]
> I think this is the wrong way to go.
>
> I sent a patch for android lowmemorykiller some years ago.
>
> http://driverdev.linuxdriverproject.org/pipermail/driverdev-devel/2017-February/100319.html
>
> It has been improved since than, so it can act handle oom callbacks, it can act on vmpressure and psi
> and as a shrinker. The patches has not been ported to resent kernels though.
>
> I don't think vmpressure and psi is that relevant now. (They are what userspace act on)? But the basic idea is to have a priority queue
> within the kernel. It need pick up new processes and dying process.? And then it has a order, and that
> is set with oom adj values by activity manager in android.? I see this model can be reused for
> something that is between a standard oom and userspace.? Instead of vmpressure and psi
> a watchdog might be a better way.? If userspace (in android the activity manager or lmkd) does not kick the watchdog,
> the watchdog bite the task according to the priority and kills it.? This priority list does not have to be a list generated?
> within kernel. But it has the advantage that you inherent parents properties.? We use a rb-tree for that.
>
> All that is missing is the watchdog.

And this is off topic to the discussion as well. We are not discussing
how to handle OOM situation best. Shakeel has brought up challenges that
some userspace based OOM killer implementations are facing. Like it or
not but different workloads have different requirements and what you are
using in Android might not be the best fit for everybody. I will not
comment on the android approach but it doesn't address any of the
concerns that have been brought up.
--
Michal Hocko
SUSE Labs

2021-04-22 14:30:39

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Wed, Apr 21, 2021 at 10:39 PM <[email protected]> wrote:
>
> On 4/21/21 9:18 PM, Shakeel Butt wrote:
> > On Wed, Apr 21, 2021 at 11:46 AM <[email protected]> wrote:
> >> On 4/21/21 8:28 PM, Shakeel Butt wrote:
> >>> On Wed, Apr 21, 2021 at 10:06 AM peter enderborg
> >>> <[email protected]> wrote:
> >>>> On 4/20/21 3:44 AM, Shakeel Butt wrote:
> >>> [...]
> >>>> I think this is the wrong way to go.
> >>> Which one? Are you talking about the kernel one? We already talked out
> >>> of that. To decide to OOM, we need to look at a very diverse set of
> >>> metrics and it seems like that would be very hard to do flexibly
> >>> inside the kernel.
> >> You dont need to decide to oom, but when oom occurs you
> >> can take a proper action.
> > No, we want the flexibility to decide when to oom-kill. Kernel is very
> > conservative in triggering the oom-kill.
>
> It wont do it for you. We use this code to solve that:

Sorry what do you mean by "It wont do it for you"?

[...]
> int __init lowmemorykiller_register_oom_notifier(void)
> {
> register_oom_notifier(&lowmemorykiller_oom_nb);

This code is using oom_notify_list. That is only called when the
kernel has already decided to go for the oom-kill. My point was the
kernel is very conservative in deciding to trigger the oom-kill and
the applications can suffer for long. We already have solutions for
this issue in the form of userspace oom-killers (Android's lmkd and
Facebook's oomd) which monitors a diverse set of metrics to early
detect the application suffering and trigger SIGKILLs to release the
memory pressure on the system.

BTW with the userspace oom-killers, we would like to avoid the kernel
oom-killer and memory.swap.high has been introduced in the kernel for
that purpose.

2021-04-22 15:42:59

by peter enderborg

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On 4/22/21 4:27 PM, Shakeel Butt wrote:
> On Wed, Apr 21, 2021 at 10:39 PM <[email protected]> wrote:
>> On 4/21/21 9:18 PM, Shakeel Butt wrote:
>>> On Wed, Apr 21, 2021 at 11:46 AM <[email protected]> wrote:
>>>> On 4/21/21 8:28 PM, Shakeel Butt wrote:
>>>>> On Wed, Apr 21, 2021 at 10:06 AM peter enderborg
>>>>> <[email protected]> wrote:
>>>>>> On 4/20/21 3:44 AM, Shakeel Butt wrote:
>>>>> [...]
>>>>>> I think this is the wrong way to go.
>>>>> Which one? Are you talking about the kernel one? We already talked out
>>>>> of that. To decide to OOM, we need to look at a very diverse set of
>>>>> metrics and it seems like that would be very hard to do flexibly
>>>>> inside the kernel.
>>>> You dont need to decide to oom, but when oom occurs you
>>>> can take a proper action.
>>> No, we want the flexibility to decide when to oom-kill. Kernel is very
>>> conservative in triggering the oom-kill.
>> It wont do it for you. We use this code to solve that:
> Sorry what do you mean by "It wont do it for you"?
The oom-killer, it does not do what you want and need.

You need to add something that kills the "right" task.
The example does that, it pick the task with highest
oom_score_adj and kills it. It is probably easer
to see in the "proof of concept" patch.

>
> [...]
>> int __init lowmemorykiller_register_oom_notifier(void)
>> {
>> register_oom_notifier(&lowmemorykiller_oom_nb);
> This code is using oom_notify_list. That is only called when the
> kernel has already decided to go for the oom-kill. My point was the
> kernel is very conservative in deciding to trigger the oom-kill and
> the applications can suffer for long. We already have solutions for
> this issue in the form of userspace oom-killers (Android's lmkd and
> Facebook's oomd) which monitors a diverse set of metrics to early
> detect the application suffering and trigger SIGKILLs to release the
> memory pressure on the system.
>
> BTW with the userspace oom-killers, we would like to avoid the kernel
> oom-killer and memory.swap.high has been introduced in the kernel for
> that purpose.

This it is a lifeline. It will keep the lmkd/activity manger going. It is not
a replacement it is helper. It gives the freedom to tune other
parts with out worrying to much about oom. (Assuming that
userspace still can handle kills like the kernel lmk did)

2021-05-05 00:40:24

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <[email protected]> wrote:
>
[...]
> > > What if the pool is depleted?
> >
> > This would mean that either the estimate of mempool size is bad or
> > oom-killer is buggy and leaking memory.
> >
> > I am open to any design directions for mempool or some other way where
> > we can provide a notion of memory guarantee to oom-killer.
>
> OK, thanks for clarification. There will certainly be hard problems to
> sort out[1] but the overall idea makes sense to me and it sounds like a
> much better approach than a OOM specific solution.
>
>
> [1] - how the pool is going to be replenished without hitting all
> potential reclaim problems (thus dependencies on other all tasks
> directly/indirectly) yet to not rely on any background workers to do
> that on the task behalf without a proper accounting etc...
> --

I am currently contemplating between two paths here:

First, the mempool, exposed through either prctl or a new syscall.
Users would need to trace their userspace oom-killer (or whatever
their use case is) to find an appropriate mempool size they would need
and periodically refill the mempools if allowed by the state of the
machine. The challenge here is to find a good value for the mempool
size and coordinating the refilling of mempools.

Second is a mix of Roman and Peter's suggestions but much more
simplified. A very simple watchdog with a kill-list of processes and
if userspace didn't pet the watchdog within a specified time, it will
kill all the processes in the kill-list. The challenge here is to
maintain/update the kill-list.

I would prefer the direction which oomd and lmkd are open to adopt.

Any suggestions?

2021-05-05 01:36:19

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <[email protected]> wrote:
>
> On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <[email protected]> wrote:
> >
> [...]
> > > > What if the pool is depleted?
> > >
> > > This would mean that either the estimate of mempool size is bad or
> > > oom-killer is buggy and leaking memory.
> > >
> > > I am open to any design directions for mempool or some other way where
> > > we can provide a notion of memory guarantee to oom-killer.
> >
> > OK, thanks for clarification. There will certainly be hard problems to
> > sort out[1] but the overall idea makes sense to me and it sounds like a
> > much better approach than a OOM specific solution.
> >
> >
> > [1] - how the pool is going to be replenished without hitting all
> > potential reclaim problems (thus dependencies on other all tasks
> > directly/indirectly) yet to not rely on any background workers to do
> > that on the task behalf without a proper accounting etc...
> > --
>
> I am currently contemplating between two paths here:
>
> First, the mempool, exposed through either prctl or a new syscall.
> Users would need to trace their userspace oom-killer (or whatever
> their use case is) to find an appropriate mempool size they would need
> and periodically refill the mempools if allowed by the state of the
> machine. The challenge here is to find a good value for the mempool
> size and coordinating the refilling of mempools.
>
> Second is a mix of Roman and Peter's suggestions but much more
> simplified. A very simple watchdog with a kill-list of processes and
> if userspace didn't pet the watchdog within a specified time, it will
> kill all the processes in the kill-list. The challenge here is to
> maintain/update the kill-list.

IIUC this solution is designed to identify cases when oomd/lmkd got
stuck while allocating memory due to memory shortages and therefore
can't feed the watchdog. In such a case the kernel goes ahead and
kills some processes to free up memory and unblock the blocked
process. Effectively this would limit the time such a process gets
stuck by the duration of the watchdog timeout. If my understanding of
this proposal is correct, then I see the following downsides:
1. oomd/lmkd are still not prevented from being stuck, it just limits
the duration of this blocked state. Delaying kills when memory
pressure is high even for short duration is very undesirable. I think
having mempool reserves could address this issue better if it can
always guarantee memory availability (not sure if it's possible in
practice).
2. What would be performance overhead of this watchdog? To limit the
duration of a process being blocked to a small enough value we would
have to have quite a small timeout, which means oomd/lmkd would have
to wake up quite often to feed the watchdog. Frequent wakeups on a
battery-powered system is not a good idea.
3. What if oomd/lmkd gets stuck for some memory-unrelated reason and
can't feed the watchdog? In such a scenario the kernel would assume
that it is stuck due to memory shortages and would go on a killing
spree. If there is a sure way to identify when a process gets stuck
due to memory shortages then this could work better.
4. Additional complexity of keeping the list of potential victims in
the kernel. Maybe we can simply reuse oom_score to choose the best
victims?
Thanks,
Suren.

>
> I would prefer the direction which oomd and lmkd are open to adopt.
>
> Any suggestions?

2021-05-05 03:16:15

by Suren Baghdasaryan

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Tue, May 4, 2021 at 7:45 PM Shakeel Butt <[email protected]> wrote:
>
> On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan <[email protected]> wrote:
> >
> > On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <[email protected]> wrote:
> > >
> > > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <[email protected]> wrote:
> > > >
> > > [...]
> > > > > > What if the pool is depleted?
> > > > >
> > > > > This would mean that either the estimate of mempool size is bad or
> > > > > oom-killer is buggy and leaking memory.
> > > > >
> > > > > I am open to any design directions for mempool or some other way where
> > > > > we can provide a notion of memory guarantee to oom-killer.
> > > >
> > > > OK, thanks for clarification. There will certainly be hard problems to
> > > > sort out[1] but the overall idea makes sense to me and it sounds like a
> > > > much better approach than a OOM specific solution.
> > > >
> > > >
> > > > [1] - how the pool is going to be replenished without hitting all
> > > > potential reclaim problems (thus dependencies on other all tasks
> > > > directly/indirectly) yet to not rely on any background workers to do
> > > > that on the task behalf without a proper accounting etc...
> > > > --
> > >
> > > I am currently contemplating between two paths here:
> > >
> > > First, the mempool, exposed through either prctl or a new syscall.
> > > Users would need to trace their userspace oom-killer (or whatever
> > > their use case is) to find an appropriate mempool size they would need
> > > and periodically refill the mempools if allowed by the state of the
> > > machine. The challenge here is to find a good value for the mempool
> > > size and coordinating the refilling of mempools.
> > >
> > > Second is a mix of Roman and Peter's suggestions but much more
> > > simplified. A very simple watchdog with a kill-list of processes and
> > > if userspace didn't pet the watchdog within a specified time, it will
> > > kill all the processes in the kill-list. The challenge here is to
> > > maintain/update the kill-list.
> >
> > IIUC this solution is designed to identify cases when oomd/lmkd got
> > stuck while allocating memory due to memory shortages and therefore
> > can't feed the watchdog. In such a case the kernel goes ahead and
> > kills some processes to free up memory and unblock the blocked
> > process. Effectively this would limit the time such a process gets
> > stuck by the duration of the watchdog timeout. If my understanding of
> > this proposal is correct,
>
> Your understanding is indeed correct.
>
> > then I see the following downsides:
> > 1. oomd/lmkd are still not prevented from being stuck, it just limits
> > the duration of this blocked state. Delaying kills when memory
> > pressure is high even for short duration is very undesirable.
>
> Yes I agree.
>
> > I think
> > having mempool reserves could address this issue better if it can
> > always guarantee memory availability (not sure if it's possible in
> > practice).
>
> I think "mempool ... always guarantee memory availability" is
> something I should quantify with some experiments.
>
> > 2. What would be performance overhead of this watchdog? To limit the
> > duration of a process being blocked to a small enough value we would
> > have to have quite a small timeout, which means oomd/lmkd would have
> > to wake up quite often to feed the watchdog. Frequent wakeups on a
> > battery-powered system is not a good idea.
>
> This is indeed the downside i.e. the tradeoff between acceptable stall
> vs frequent wakeups.
>
> > 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and
> > can't feed the watchdog? In such a scenario the kernel would assume
> > that it is stuck due to memory shortages and would go on a killing
> > spree.
>
> This is correct but IMHO killing spree is not worse than oomd/lmkd
> getting stuck for some other reason.
>
> > If there is a sure way to identify when a process gets stuck
> > due to memory shortages then this could work better.
>
> Hmm are you saying looking at the stack traces of the userspace
> oom-killer or some metrics related to oom-killer? It will complicate
> the code.

Well, I don't know of a sure and easy way to identify the reasons for
process blockage but maybe there is one I don't know of? My point is
that we would need some additional indications of memory being the
culprit for the process blockage before resorting to kill.

>
> > 4. Additional complexity of keeping the list of potential victims in
> > the kernel. Maybe we can simply reuse oom_score to choose the best
> > victims?
>
> Your point of additional complexity is correct. Regarding oom_score I
> think you meant oom_score_adj, I would avoid putting more
> policies/complexity in the kernel but I got your point that the
> simplest watchdog might not be helpful at all.
>
> > Thanks,
> > Suren.
> >
> > >
> > > I would prefer the direction which oomd and lmkd are open to adopt.
> > >
> > > Any suggestions?

2021-05-05 04:43:43

by Shakeel Butt

[permalink] [raw]
Subject: Re: [RFC] memory reserve for userspace oom-killer

On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Tue, May 4, 2021 at 5:37 PM Shakeel Butt <[email protected]> wrote:
> >
> > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko <[email protected]> wrote:
> > >
> > [...]
> > > > > What if the pool is depleted?
> > > >
> > > > This would mean that either the estimate of mempool size is bad or
> > > > oom-killer is buggy and leaking memory.
> > > >
> > > > I am open to any design directions for mempool or some other way where
> > > > we can provide a notion of memory guarantee to oom-killer.
> > >
> > > OK, thanks for clarification. There will certainly be hard problems to
> > > sort out[1] but the overall idea makes sense to me and it sounds like a
> > > much better approach than a OOM specific solution.
> > >
> > >
> > > [1] - how the pool is going to be replenished without hitting all
> > > potential reclaim problems (thus dependencies on other all tasks
> > > directly/indirectly) yet to not rely on any background workers to do
> > > that on the task behalf without a proper accounting etc...
> > > --
> >
> > I am currently contemplating between two paths here:
> >
> > First, the mempool, exposed through either prctl or a new syscall.
> > Users would need to trace their userspace oom-killer (or whatever
> > their use case is) to find an appropriate mempool size they would need
> > and periodically refill the mempools if allowed by the state of the
> > machine. The challenge here is to find a good value for the mempool
> > size and coordinating the refilling of mempools.
> >
> > Second is a mix of Roman and Peter's suggestions but much more
> > simplified. A very simple watchdog with a kill-list of processes and
> > if userspace didn't pet the watchdog within a specified time, it will
> > kill all the processes in the kill-list. The challenge here is to
> > maintain/update the kill-list.
>
> IIUC this solution is designed to identify cases when oomd/lmkd got
> stuck while allocating memory due to memory shortages and therefore
> can't feed the watchdog. In such a case the kernel goes ahead and
> kills some processes to free up memory and unblock the blocked
> process. Effectively this would limit the time such a process gets
> stuck by the duration of the watchdog timeout. If my understanding of
> this proposal is correct,

Your understanding is indeed correct.

> then I see the following downsides:
> 1. oomd/lmkd are still not prevented from being stuck, it just limits
> the duration of this blocked state. Delaying kills when memory
> pressure is high even for short duration is very undesirable.

Yes I agree.

> I think
> having mempool reserves could address this issue better if it can
> always guarantee memory availability (not sure if it's possible in
> practice).

I think "mempool ... always guarantee memory availability" is
something I should quantify with some experiments.

> 2. What would be performance overhead of this watchdog? To limit the
> duration of a process being blocked to a small enough value we would
> have to have quite a small timeout, which means oomd/lmkd would have
> to wake up quite often to feed the watchdog. Frequent wakeups on a
> battery-powered system is not a good idea.

This is indeed the downside i.e. the tradeoff between acceptable stall
vs frequent wakeups.

> 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and
> can't feed the watchdog? In such a scenario the kernel would assume
> that it is stuck due to memory shortages and would go on a killing
> spree.

This is correct but IMHO killing spree is not worse than oomd/lmkd
getting stuck for some other reason.

> If there is a sure way to identify when a process gets stuck
> due to memory shortages then this could work better.

Hmm are you saying looking at the stack traces of the userspace
oom-killer or some metrics related to oom-killer? It will complicate
the code.

> 4. Additional complexity of keeping the list of potential victims in
> the kernel. Maybe we can simply reuse oom_score to choose the best
> victims?

Your point of additional complexity is correct. Regarding oom_score I
think you meant oom_score_adj, I would avoid putting more
policies/complexity in the kernel but I got your point that the
simplest watchdog might not be helpful at all.

> Thanks,
> Suren.
>
> >
> > I would prefer the direction which oomd and lmkd are open to adopt.
> >
> > Any suggestions?