2015-06-01 11:59:44

by Michal Hocko

[permalink] [raw]
Subject: [PATCH] oom: always panic on OOM when panic_on_oom is configured

panic_on_oom allows administrator to set OOM policy to panic the system
when it is out of memory to reduce failover time e.g. when resolving
the OOM condition would take much more time than rebooting the system.

out_of_memory tries to be clever and prevent from premature panics
by checking the current task and prevent from panic when the task
has fatal signal pending and so it should die shortly and release some
memory. This is fair enough but Tetsuo Handa has noted that this might
lead to a silent deadlock when current cannot exit because of
dependencies invisible to the OOM killer.

panic_on_oom is disabled by default and if somebody enables it then any
risk of potential deadlock is certainly unwelcome. The risk is really
low because there are usually more sources of allocation requests and
one of them would eventually trigger the panic but it is better to
reduce the risk as much as possible.

Let's move check_panic_on_oom up before the current task is
checked so that the knob value is . Do the same for the memcg in
mem_cgroup_out_of_memory.

Reported-by: Tetsuo Handa <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/memcontrol.c | 3 ++-
mm/oom_kill.c | 18 +++++++++---------
2 files changed, 11 insertions(+), 10 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 86648a718d21..d3c906da6a09 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1532,6 +1532,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,

mutex_lock(&oom_lock);

+ check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL, memcg);
+
/*
* If current has a pending SIGKILL or is exiting, then automatically
* select it. The goal is to allow it to allocate so that it may
@@ -1542,7 +1544,6 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
goto unlock;
}

- check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL, memcg);
totalpages = mem_cgroup_get_limit(memcg) ? : 1;
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dff991e0681e..f8c83b791dd5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -667,6 +667,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
goto out;

/*
+ * Check if there were limitations on the allocation (only relevant for
+ * NUMA) that may require different handling.
+ */
+ constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
+ &totalpages);
+ mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
+ check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL);
+
+ /*
* If current has a pending SIGKILL or is exiting, then automatically
* select it. The goal is to allow it to allocate so that it may
* quickly exit and free its memory.
@@ -680,15 +689,6 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
goto out;
}

- /*
- * Check if there were limitations on the allocation (only relevant for
- * NUMA) that may require different handling.
- */
- constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
- &totalpages);
- mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
- check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL);
-
if (sysctl_oom_kill_allocating_task && current->mm &&
!oom_unkillable_task(current, NULL, nodemask) &&
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
--
2.1.4


2015-06-01 15:12:49

by Eric B Munson

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Mon, 01 Jun 2015, Michal Hocko wrote:

> panic_on_oom allows administrator to set OOM policy to panic the system
> when it is out of memory to reduce failover time e.g. when resolving
> the OOM condition would take much more time than rebooting the system.
>
> out_of_memory tries to be clever and prevent from premature panics
> by checking the current task and prevent from panic when the task
> has fatal signal pending and so it should die shortly and release some
> memory. This is fair enough but Tetsuo Handa has noted that this might
> lead to a silent deadlock when current cannot exit because of
> dependencies invisible to the OOM killer.
>
> panic_on_oom is disabled by default and if somebody enables it then any
> risk of potential deadlock is certainly unwelcome. The risk is really
> low because there are usually more sources of allocation requests and
> one of them would eventually trigger the panic but it is better to
> reduce the risk as much as possible.
>
> Let's move check_panic_on_oom up before the current task is
> checked so that the knob value is . Do the same for the memcg in
> mem_cgroup_out_of_memory.
>
> Reported-by: Tetsuo Handa <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

I was initially going to complain about this causing the machine to
panic when a cgroup is oom, but the machine is not. However after
reading check_panic_on_oom(), that behavior is controllable.

Reviewed-by: Eric B Munson <[email protected]>


Attachments:
(No filename) (1.47 kB)
signature.asc (819.00 B)
Digital signature
Download all attachments

2015-06-04 23:12:31

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Mon, 1 Jun 2015, Michal Hocko wrote:

> panic_on_oom allows administrator to set OOM policy to panic the system
> when it is out of memory to reduce failover time e.g. when resolving
> the OOM condition would take much more time than rebooting the system.
>
> out_of_memory tries to be clever and prevent from premature panics
> by checking the current task and prevent from panic when the task
> has fatal signal pending and so it should die shortly and release some
> memory. This is fair enough but Tetsuo Handa has noted that this might
> lead to a silent deadlock when current cannot exit because of
> dependencies invisible to the OOM killer.
>
> panic_on_oom is disabled by default and if somebody enables it then any
> risk of potential deadlock is certainly unwelcome. The risk is really
> low because there are usually more sources of allocation requests and
> one of them would eventually trigger the panic but it is better to
> reduce the risk as much as possible.
>
> Let's move check_panic_on_oom up before the current task is
> checked so that the knob value is . Do the same for the memcg in
> mem_cgroup_out_of_memory.
>
> Reported-by: Tetsuo Handa <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>

Nack, this is not the appropriate response to exit path livelocks. By
doing this, you are going to start unnecessarily panicking machines that
have panic_on_oom set when it would not have triggered before. If there
is no reclaimable memory and a process that has already been signaled to
die to is in the process of exiting has to allocate memory, it is
perfectly acceptable to give them access to memory reserves so they can
allocate and exit. Under normal circumstances, that allows the process to
naturally exit. With your patch, it will cause the machine to panic.

It's this simple: panic_on_oom is not a solution to workaround oom killer
livelocks and shouldn't be suggested as the canonical way that such
possibilities should be addressed.

2015-06-05 11:13:09

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Thu 04-06-15 16:12:27, David Rientjes wrote:
> On Mon, 1 Jun 2015, Michal Hocko wrote:
>
> > panic_on_oom allows administrator to set OOM policy to panic the system
> > when it is out of memory to reduce failover time e.g. when resolving
> > the OOM condition would take much more time than rebooting the system.
> >
> > out_of_memory tries to be clever and prevent from premature panics
> > by checking the current task and prevent from panic when the task
> > has fatal signal pending and so it should die shortly and release some
> > memory. This is fair enough but Tetsuo Handa has noted that this might
> > lead to a silent deadlock when current cannot exit because of
> > dependencies invisible to the OOM killer.
> >
> > panic_on_oom is disabled by default and if somebody enables it then any
> > risk of potential deadlock is certainly unwelcome. The risk is really
> > low because there are usually more sources of allocation requests and
> > one of them would eventually trigger the panic but it is better to
> > reduce the risk as much as possible.
> >
> > Let's move check_panic_on_oom up before the current task is
> > checked so that the knob value is . Do the same for the memcg in
> > mem_cgroup_out_of_memory.
> >
> > Reported-by: Tetsuo Handa <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
>
> Nack, this is not the appropriate response to exit path livelocks. By
> doing this, you are going to start unnecessarily panicking machines that
> have panic_on_oom set when it would not have triggered before. If there
> is no reclaimable memory and a process that has already been signaled to
> die to is in the process of exiting has to allocate memory, it is
> perfectly acceptable to give them access to memory reserves so they can
> allocate and exit. Under normal circumstances, that allows the process to
> naturally exit. With your patch, it will cause the machine to panic.

Isn't that what the administrator of the system wants? The system
is _clearly_ out of memory at this point. A coincidental exiting task
doesn't change a lot in that regard. Moreover it increases a risk of
unnecessarily unresponsive system which is what panic_on_oom tries to
prevent from. So from my POV this is a clear violation of the user
policy.

> It's this simple: panic_on_oom is not a solution to workaround oom killer
> livelocks and shouldn't be suggested as the canonical way that such
> possibilities should be addressed.

I wasn't suggesting that at all.

--
Michal Hocko
SUSE Labs

2015-06-06 06:51:58

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

Michal Hocko wrote:
> > > Let's move check_panic_on_oom up before the current task is
> > > checked so that the knob value is . Do the same for the memcg in
> > > mem_cgroup_out_of_memory.
> > >
> > > Reported-by: Tetsuo Handa <[email protected]>
> > > Signed-off-by: Michal Hocko <[email protected]>
> >
> > Nack, this is not the appropriate response to exit path livelocks. By
> > doing this, you are going to start unnecessarily panicking machines that
> > have panic_on_oom set when it would not have triggered before. If there
> > is no reclaimable memory and a process that has already been signaled to
> > die to is in the process of exiting has to allocate memory, it is
> > perfectly acceptable to give them access to memory reserves so they can
> > allocate and exit. Under normal circumstances, that allows the process to
> > naturally exit. With your patch, it will cause the machine to panic.
>
> Isn't that what the administrator of the system wants? The system
> is _clearly_ out of memory at this point. A coincidental exiting task
> doesn't change a lot in that regard. Moreover it increases a risk of
> unnecessarily unresponsive system which is what panic_on_oom tries to
> prevent from. So from my POV this is a clear violation of the user
> policy.

For me, !__GFP_FS allocations not calling out_of_memory() _forever_ is a
violation of the user policy.

If kswapd found nothing more to reclaim and/or kswapd cannot continue
reclaiming due to lock dependency, can't we consider as out of memory
because we already tried to reclaim memory which would have been done by
__GFP_FS allocations?

Why do we do "!__GFP_FS allocations do not call out_of_memory() because
they have very limited reclaim ability"? Both GFP_NOFS and GFP_NOIO
allocations will wake up kswapd due to !__GFP_NO_KSWAPD, doesn't it?

Are objects reclaimed by kswapd and objects reclaimed by __GFP_FS allocations
differ? If yes, we could introduce a proxy kernel thread which does __GFP_FS
allocations on behalf of !__GFP_FS allocators, and notify !__GFP_FS allocators
of completion. If no, why not to call out_of_memory() when kswapd found nothing
more to reclaim and/or kswapd cannot continue reclaiming due to lock dependency?

At least, I expect some warning like check_hung_task() in kernel/hung_task.c
is emitted when memory allocation livelock/deadlock is suspected. That will
help detecting unresponsive systems.

2015-06-08 08:21:52

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Sat 06-06-15 15:51:35, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > > Let's move check_panic_on_oom up before the current task is
> > > > checked so that the knob value is . Do the same for the memcg in
> > > > mem_cgroup_out_of_memory.
> > > >
> > > > Reported-by: Tetsuo Handa <[email protected]>
> > > > Signed-off-by: Michal Hocko <[email protected]>
> > >
> > > Nack, this is not the appropriate response to exit path livelocks. By
> > > doing this, you are going to start unnecessarily panicking machines that
> > > have panic_on_oom set when it would not have triggered before. If there
> > > is no reclaimable memory and a process that has already been signaled to
> > > die to is in the process of exiting has to allocate memory, it is
> > > perfectly acceptable to give them access to memory reserves so they can
> > > allocate and exit. Under normal circumstances, that allows the process to
> > > naturally exit. With your patch, it will cause the machine to panic.
> >
> > Isn't that what the administrator of the system wants? The system
> > is _clearly_ out of memory at this point. A coincidental exiting task
> > doesn't change a lot in that regard. Moreover it increases a risk of
> > unnecessarily unresponsive system which is what panic_on_oom tries to
> > prevent from. So from my POV this is a clear violation of the user
> > policy.
>
> For me, !__GFP_FS allocations not calling out_of_memory() _forever_ is a
> violation of the user policy.

Yes, the current behavior of GFP_NOFS is highly suboptimal, but this has
_nothing_ what so ever to do with this patch and panic_on_oom handling.
The former one is the page allocator proper while we are in the OOM
killer layer here.

This is not the first time you have done that. Please stop it. It makes
a complete mess of the original discussions.

[...]
--
Michal Hocko
SUSE Labs

2015-06-08 11:53:41

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

Michal Hocko wrote:
> On Sat 06-06-15 15:51:35, Tetsuo Handa wrote:
> > For me, !__GFP_FS allocations not calling out_of_memory() _forever_ is a
> > violation of the user policy.
>
> Yes, the current behavior of GFP_NOFS is highly suboptimal, but this has
> _nothing_ what so ever to do with this patch and panic_on_oom handling.
> The former one is the page allocator proper while we are in the OOM
> killer layer here.
>
> This is not the first time you have done that. Please stop it. It makes
> a complete mess of the original discussions.

My specific area of expertise in Linux kernel is security/tomoyo/ directory.
What I could contribute with other areas is to pay attention to potential
possibilities. But I'm such outsider about other areas that I can't
differentiate the page allocator proper from the OOM killer layer.

Excuse me for focusing on the wrong issue, but I will likely unconsciously
make the same mistake again. Please ignore or point out when I made mistakes.

2015-06-08 19:52:06

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Fri, 5 Jun 2015, Michal Hocko wrote:

> > Nack, this is not the appropriate response to exit path livelocks. By
> > doing this, you are going to start unnecessarily panicking machines that
> > have panic_on_oom set when it would not have triggered before. If there
> > is no reclaimable memory and a process that has already been signaled to
> > die to is in the process of exiting has to allocate memory, it is
> > perfectly acceptable to give them access to memory reserves so they can
> > allocate and exit. Under normal circumstances, that allows the process to
> > naturally exit. With your patch, it will cause the machine to panic.
>
> Isn't that what the administrator of the system wants? The system
> is _clearly_ out of memory at this point. A coincidental exiting task
> doesn't change a lot in that regard. Moreover it increases a risk of
> unnecessarily unresponsive system which is what panic_on_oom tries to
> prevent from. So from my POV this is a clear violation of the user
> policy.
>

We rely on the functionality that this patch is short cutting because we
rely on userspace to trigger oom kills. For system oom conditions, we
must then rely on the kernel oom killer to set TIF_MEMDIE since userspace
cannot grant it itself. (I think the memcg case is very similar in that
this patch is short cutting it, but I'm more concerned for the system oom
in this case because it's a show stopper for us.)

We want to send the SIGKILL, which will interrupt things like
get_user_pages() which we find is our culprit most of the time. When the
process enters the exit path, it must allocate other memory (slab,
coredumping and the very problematic proc_exit_connector()) to free
memory. This patch would cause the machine to panic rather than utilizing
memory reserves so that it can exit, not as a result of a kernel oom kill
but rather a userspace kill.

Panic_on_oom is to suppress the kernel oom killer. It's not a sysctl that
triggers whenever watermarks are hit and it doesn't suppress memory
reserves from being used for things like GFP_ATOMIC. Setting TIF_MEMDIE
for an exiting process is another type of memory reserves and is
imperative that we have it to make forward progress. Panic_on_oom should
only trigger when the kernel can't make forward progress without killing
something (not true in this case). I believe that's how the documentation
has always been interpreted and the tunable used in the wild.

It would be interesting to consider your other patch that refactors the
sysrq+f tunable. I think we should make that never trigger panic_on_oom
(the sysadmin can use other sysrqs for that) and allow userspace to use
sysrq+f as a trigger when it is responsive to handle oom conditions.

But this patch itself can't possibly be merged.

2015-06-08 19:58:55

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Sat, 6 Jun 2015, Tetsuo Handa wrote:

> For me, !__GFP_FS allocations not calling out_of_memory() _forever_ is a
> violation of the user policy.
>

I agree that we need work in this area to prevent livelocks that rely on
the contexts of other allocators to make forward progress, but let's be
clear that panicking the system is not the appropriate response. It's
nice and convenient to say we should repurpose panic_on_oom to solve
livelocks because it triggers a fast reboot in your configuration, but we,
and certainly others, can't tolerate reboots when the kernel just gives up
and the majority of the time these situations do resolve themselves.

I think the appropriate place to be attacking this problem is in the page
allocator, which is responsible for the page allocations in the context of
the gfp flags given to it, and not the oom killer. The oom killer is just
supposed to pick a process, send it a SIGKILL, and give it a reasonable
expectation of being able to exit.

> If kswapd found nothing more to reclaim and/or kswapd cannot continue
> reclaiming due to lock dependency, can't we consider as out of memory
> because we already tried to reclaim memory which would have been done by
> __GFP_FS allocations?
>
> Why do we do "!__GFP_FS allocations do not call out_of_memory() because
> they have very limited reclaim ability"? Both GFP_NOFS and GFP_NOIO
> allocations will wake up kswapd due to !__GFP_NO_KSWAPD, doesn't it?
>

The !__GFP_FS exception is historic because the oom killer would trigger
waaay too often if it were removed simply because it doesn't have a great
chance of allowing reclaim to succeed. Instead, we rely on external
memory freeing or other parallel allocators being able to reclaim memory.
What happens when there is no external freeing, nothing else is trying to
reclaim, or nothing else is able to reclaim? Yeah, that's the big
problem. In my opinion, there's three ways of attacking it: (1)
preallocation so we are less dependent on the page allocator in these
contexts, (2) memory reserves in extraordinary circumstances to allow
forward progress (it's already tunable by min_free_kbytes), and (3)
eventual page allocation failure when neither of these succeed.

2015-06-08 21:32:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Mon 08-06-15 12:51:53, David Rientjes wrote:
> On Fri, 5 Jun 2015, Michal Hocko wrote:
>
> > > Nack, this is not the appropriate response to exit path livelocks. By
> > > doing this, you are going to start unnecessarily panicking machines that
> > > have panic_on_oom set when it would not have triggered before. If there
> > > is no reclaimable memory and a process that has already been signaled to
> > > die to is in the process of exiting has to allocate memory, it is
> > > perfectly acceptable to give them access to memory reserves so they can
> > > allocate and exit. Under normal circumstances, that allows the process to
> > > naturally exit. With your patch, it will cause the machine to panic.
> >
> > Isn't that what the administrator of the system wants? The system
> > is _clearly_ out of memory at this point. A coincidental exiting task
> > doesn't change a lot in that regard. Moreover it increases a risk of
> > unnecessarily unresponsive system which is what panic_on_oom tries to
> > prevent from. So from my POV this is a clear violation of the user
> > policy.
> >
>
> We rely on the functionality that this patch is short cutting because we
> rely on userspace to trigger oom kills. For system oom conditions, we
> must then rely on the kernel oom killer to set TIF_MEMDIE since userspace
> cannot grant it itself. (I think the memcg case is very similar in that
> this patch is short cutting it, but I'm more concerned for the system oom
> in this case because it's a show stopper for us.)

Do you actually have panic_on_oops enabled?

> We want to send the SIGKILL, which will interrupt things like

But this patch only changes the ordering of panic_on_oops vs.
fatal_signal_pending(current) shortcut. So it can possible affect the
case when the current is exiting during OOM. Is this the case that you
are worried about?

> get_user_pages() which we find is our culprit most of the time. When the
> process enters the exit path, it must allocate other memory (slab,
> coredumping and the very problematic proc_exit_connector()) to free
> memory. This patch would cause the machine to panic rather than utilizing
> memory reserves so that it can exit, not as a result of a kernel oom kill
> but rather a userspace kill.
>
> Panic_on_oom is to suppress the kernel oom killer. It's not a sysctl that
> triggers whenever watermarks are hit and it doesn't suppress memory
> reserves from being used for things like GFP_ATOMIC. Setting TIF_MEMDIE
> for an exiting process is another type of memory reserves and is
> imperative that we have it to make forward progress.

is your assumption about exiting process actually true? I mean there is
no _guarantee_ the process will manage to die with the available memory
reserves. Dependency blocking the task is another reason why we should
be careful about relying on TIF_MEMDIE.

> Panic_on_oom should
> only trigger when the kernel can't make forward progress without killing
> something (not true in this case). I believe that's how the documentation
> has always been interpreted and the tunable used in the wild.

The documentation clearly states:
"
If this is set to 1, the kernel panics when out-of-memory happens.
However, if a process limits using nodes by mempolicy/cpusets,
and those nodes become memory exhaustion status, one process
may be killed by oom-killer. No panic occurs in this case.
Because other nodes' memory may be free. This means system total status
may be not fatal yet.

If this is set to 2, the kernel panics compulsorily even on the
above-mentioned. Even oom happens under memory cgroup, the whole
system panics.

The default value is 0.
1 and 2 are for failover of clustering. Please select either
according to your policy of failover.
"

So I read this as a reliability feature to handle oom situation as soon
as possible.

>
> It would be interesting to consider your other patch that refactors the
> sysrq+f tunable. I think we should make that never trigger panic_on_oom
> (the sysadmin can use other sysrqs for that) and allow userspace to use
> sysrq+f as a trigger when it is responsive to handle oom conditions.
>
> But this patch itself can't possibly be merged.


--
Michal Hocko
SUSE Labs

2015-06-08 23:20:32

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Mon, 8 Jun 2015, Michal Hocko wrote:

> On Mon 08-06-15 12:51:53, David Rientjes wrote:
> > On Fri, 5 Jun 2015, Michal Hocko wrote:
> >
> > > > Nack, this is not the appropriate response to exit path livelocks. By
> > > > doing this, you are going to start unnecessarily panicking machines that
> > > > have panic_on_oom set when it would not have triggered before. If there
> > > > is no reclaimable memory and a process that has already been signaled to
> > > > die to is in the process of exiting has to allocate memory, it is
> > > > perfectly acceptable to give them access to memory reserves so they can
> > > > allocate and exit. Under normal circumstances, that allows the process to
> > > > naturally exit. With your patch, it will cause the machine to panic.
> > >
> > > Isn't that what the administrator of the system wants? The system
> > > is _clearly_ out of memory at this point. A coincidental exiting task
> > > doesn't change a lot in that regard. Moreover it increases a risk of
> > > unnecessarily unresponsive system which is what panic_on_oom tries to
> > > prevent from. So from my POV this is a clear violation of the user
> > > policy.
> > >
> >
> > We rely on the functionality that this patch is short cutting because we
> > rely on userspace to trigger oom kills. For system oom conditions, we
> > must then rely on the kernel oom killer to set TIF_MEMDIE since userspace
> > cannot grant it itself. (I think the memcg case is very similar in that
> > this patch is short cutting it, but I'm more concerned for the system oom
> > in this case because it's a show stopper for us.)
>
> Do you actually have panic_on_oops enabled?
>

CONFIG_PANIC_ON_OOPS_VALUE should be 0, I'm not sure why that's relevant.

The functionality I'm referring to is that your patch now panics the
machine for configs where /proc/sys/vm/panic_on_oom is set and the same
scenario occurs as described above. You're introducing userspace breakage
because you are using panic_on_oom in a way that it hasn't been used in
the past and isn't described as working in the documentation.

> > We want to send the SIGKILL, which will interrupt things like
>
> But this patch only changes the ordering of panic_on_oops vs.
> fatal_signal_pending(current) shortcut. So it can possible affect the
> case when the current is exiting during OOM. Is this the case that you
> are worried about?
>

Yes, of course, the case specifically when the killed process is in the
exit path due to a userspace oom kill and needs access to memory reserves
to exit. That's needed because the machine is oom (typically the only
time a non-buggy userspace oom handler would kill a process).

This:

check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL);
...
if (current->mm &&
(fatal_signal_pending(current) || task_will_free_mem(current))) {
mark_oom_victim(current);
goto out;
}

is obviously buggy in that regard. We need to be able to give a killed
process or an exiting process memory reserves so it may (1) allocate prior
to handling the signal and (2) be assured of exiting once it is in the
exit path.

> The documentation clearly states:
> "
> If this is set to 1, the kernel panics when out-of-memory happens.
> However, if a process limits using nodes by mempolicy/cpusets,
> and those nodes become memory exhaustion status, one process
> may be killed by oom-killer. No panic occurs in this case.
> Because other nodes' memory may be free. This means system total status
> may be not fatal yet.
>
> If this is set to 2, the kernel panics compulsorily even on the
> above-mentioned. Even oom happens under memory cgroup, the whole
> system panics.
>
> The default value is 0.
> 1 and 2 are for failover of clustering. Please select either
> according to your policy of failover.
> "
>
> So I read this as a reliability feature to handle oom situation as soon
> as possible.
>

A userspace process that is killed by userspace that simply needs memory
to handle the signal and exit is not oom. We have always allowed current
to have access to memory reserves to exit without triggering panic_on_oom.
This is nothing new, and is not implied by the documentation to be the
case.

I'm not going to spend all day trying to convince you that you cannot
change the semantics of sysctls that have existed for years with new
behavior especially when users require that behavior to handle userspace
kills while still keeping their machines running.

2015-06-09 09:44:16

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Mon 08-06-15 16:20:21, David Rientjes wrote:
> On Mon, 8 Jun 2015, Michal Hocko wrote:
[...]
> > On Mon 08-06-15 12:51:53, David Rientjes wrote:
> > Do you actually have panic_on_oops enabled?
> >
>
> CONFIG_PANIC_ON_OOPS_VALUE should be 0, I'm not sure why that's relevant.

No I meant panic_on_oops > 0.

> The functionality I'm referring to is that your patch now panics the
> machine for configs where /proc/sys/vm/panic_on_oom is set and the same
> scenario occurs as described above. You're introducing userspace breakage
> because you are using panic_on_oom in a way that it hasn't been used in
> the past and isn't described as working in the documentation.

I am sorry, but I do not follow. The knob has been always used to
_panic_ the OOM system. Nothing more and nothing less. Now you
are arguing about the change being buggy because a task might be
killed but that argument doesn't make much sense to me because
basically _any_ other allocation which allows OOM to trigger might hit
check_panic_on_oom() and panic the system well before your killed task
gets a chance to terminate.

I would understand your complain if we waited for oom victim(s) before
check_panic_on_oom but we have not been doing that.

> > > We want to send the SIGKILL, which will interrupt things like
> >
> > But this patch only changes the ordering of panic_on_oops vs.
> > fatal_signal_pending(current) shortcut. So it can possible affect the
> > case when the current is exiting during OOM. Is this the case that you
> > are worried about?
> >
>
> Yes, of course, the case specifically when the killed process is in the
> exit path due to a userspace oom kill and needs access to memory reserves
> to exit. That's needed because the machine is oom (typically the only
> time a non-buggy userspace oom handler would kill a process).
>
> This:
>
> check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL);
> ...
> if (current->mm &&
> (fatal_signal_pending(current) || task_will_free_mem(current))) {
> mark_oom_victim(current);
> goto out;
> }
>
> is obviously buggy in that regard. We need to be able to give a killed
> process or an exiting process memory reserves so it may (1) allocate prior
> to handling the signal and (2) be assured of exiting once it is in the
> exit path.

Which OOM path are you talking about here? memcg OOM user space handler
killing a task? This one however doesn't go this path unless the memcg
OOM is racing with the global OOM. Is this the case you are worried
about? If yes then this is racy anyway and nothing to rely on as
described above.
If you have a global OOM detection mechanism then it is racy as well
for the very same reason. User space OOM handling with panic_on_oom
is simply not usable.

> > The documentation clearly states:
> > "
> > If this is set to 1, the kernel panics when out-of-memory happens.
> > However, if a process limits using nodes by mempolicy/cpusets,
> > and those nodes become memory exhaustion status, one process
> > may be killed by oom-killer. No panic occurs in this case.
> > Because other nodes' memory may be free. This means system total status
> > may be not fatal yet.
> >
> > If this is set to 2, the kernel panics compulsorily even on the
> > above-mentioned. Even oom happens under memory cgroup, the whole
> > system panics.
> >
> > The default value is 0.
> > 1 and 2 are for failover of clustering. Please select either
> > according to your policy of failover.
> > "
> >
> > So I read this as a reliability feature to handle oom situation as soon
> > as possible.
> >
>
> A userspace process that is killed by userspace that simply needs memory
> to handle the signal and exit is not oom. We have always allowed current
> to have access to memory reserves to exit without triggering panic_on_oom.
> This is nothing new, and is not implied by the documentation to be the
> case.

The documentation doesn't mention anything about exiting task or any
other last minute attempt to be nice and prevent from OOM killing.
And moreover the assumption that TIF_MEMDIE will help to exit the oom
victim or a task with fatal_signal_pending is not true in general (and
you haven't provided sound arguments yet).

> I'm not going to spend all day trying to convince you that you cannot
> change the semantics of sysctls that have existed for years with new
> behavior especially when users require that behavior to handle userspace
> kills while still keeping their machines running.

Let me remind that you are trying to nack this patch and your
argumentation is unclear at best.

The matter seems quite simple to me. Relying on fatal_signal_pending(current)
to help before check_panic_on_oom might help to prevent OOM but it is
racy and cannot be relied on while not going to check_panic_on_oom might
be potentially harmful, albeil unlikely, and lock up machine which is
against the user defined policy to panic machine on OOM.
--
Michal Hocko
SUSE Labs

2015-06-09 11:48:48

by Tetsuo Handa

[permalink] [raw]
Subject: Re: oom: How to handle !__GFP_FS exception?

David Rientjes wrote:
> On Sat, 6 Jun 2015, Tetsuo Handa wrote:
>
> > For me, !__GFP_FS allocations not calling out_of_memory() _forever_ is a
> > violation of the user policy.
> >
>
> I agree that we need work in this area to prevent livelocks that rely on
> the contexts of other allocators to make forward progress, but let's be
> clear that panicking the system is not the appropriate response. It's
> nice and convenient to say we should repurpose panic_on_oom to solve
> livelocks because it triggers a fast reboot in your configuration, but we,
> and certainly others, can't tolerate reboots when the kernel just gives up
> and the majority of the time these situations do resolve themselves.
>
> I think the appropriate place to be attacking this problem is in the page
> allocator, which is responsible for the page allocations in the context of
> the gfp flags given to it, and not the oom killer. The oom killer is just
> supposed to pick a process, send it a SIGKILL, and give it a reasonable
> expectation of being able to exit.
>
> > If kswapd found nothing more to reclaim and/or kswapd cannot continue
> > reclaiming due to lock dependency, can't we consider as out of memory
> > because we already tried to reclaim memory which would have been done by
> > __GFP_FS allocations?
> >
> > Why do we do "!__GFP_FS allocations do not call out_of_memory() because
> > they have very limited reclaim ability"? Both GFP_NOFS and GFP_NOIO
> > allocations will wake up kswapd due to !__GFP_NO_KSWAPD, doesn't it?
> >
>
> The !__GFP_FS exception is historic because the oom killer would trigger
> waaay too often if it were removed simply because it doesn't have a great
> chance of allowing reclaim to succeed. Instead, we rely on external
> memory freeing or other parallel allocators being able to reclaim memory.
> What happens when there is no external freeing, nothing else is trying to
> reclaim, or nothing else is able to reclaim? Yeah, that's the big
> problem. In my opinion, there's three ways of attacking it: (1)
> preallocation so we are less dependent on the page allocator in these
> contexts, (2) memory reserves in extraordinary circumstances to allow
> forward progress (it's already tunable by min_free_kbytes), and (3)
> eventual page allocation failure when neither of these succeed.
>
According to my observations (as posted at
http://marc.info/?l=linux-mm&m=143239200805478 ), (3) is dangerous because
it can potentially kill critical processes including global init process.
Killing a process by invoking the OOM killer sounds safer than (3).

Regarding (2), how can we selectively allow blocking process to access
memory reserves? Since we don't know the dependency, we can't identify the
process which should be allowed to access memory reserves. If we allow all
processes to access memory reserves, unrelated processes could deplete the
memory reserves while the blocking process is waiting for a lock (either in
killable or unkillable state). What we need to do to make forward progress
is not always to allow access to memory reserves. Sometimes making locks
killable (as posted at http://marc.info/?l=linux-mm&m=142408937117294 )
helps more.

Regarding (1), it would help but insufficient because (2) and (3) unlikely
work.

2015-06-09 22:28:57

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Tue, 9 Jun 2015, Michal Hocko wrote:

> > > On Mon 08-06-15 12:51:53, David Rientjes wrote:
> > > Do you actually have panic_on_oops enabled?
> > >
> >
> > CONFIG_PANIC_ON_OOPS_VALUE should be 0, I'm not sure why that's relevant.
>
> No I meant panic_on_oops > 0.
>

CONFIG_PANIC_ON_OOPS_VALUE sets panic_on_oops, so it's 0.

> > The functionality I'm referring to is that your patch now panics the
> > machine for configs where /proc/sys/vm/panic_on_oom is set and the same
> > scenario occurs as described above. You're introducing userspace breakage
> > because you are using panic_on_oom in a way that it hasn't been used in
> > the past and isn't described as working in the documentation.
>
> I am sorry, but I do not follow. The knob has been always used to
> _panic_ the OOM system. Nothing more and nothing less. Now you
> are arguing about the change being buggy because a task might be
> killed but that argument doesn't make much sense to me because
> basically _any_ other allocation which allows OOM to trigger might hit
> check_panic_on_oom() and panic the system well before your killed task
> gets a chance to terminate.
>

Not necessarily. We pin a lot of memory with get_user_pages() and
short-circuit it by checking for fatal_signal_pending() specifically for
oom conditions. This was done over six years ago by commit 4779280d1ea4
("mm: make get_user_pages() interruptible"). When such a process is
faulting in memory, and it is killed by userspace as a result of an oom
condition, it needs to be able to allocate (TIF_MEMDIE set by the oom
killer due to SIGKILL), return to __get_user_pages(), abort, handle the
signal, and exit.

I can't possibly make that any more clear.

Your patch causes that to instead panic the system if panic_on_oom is set.
It's inappropriate and userspace breakage. The fact that I don't
personally use panic_on_oom is completely and utterly irrelevant.

There is absolutely nothing wrong with a process that has been killed
either directly by userspace or as part of a group exit, or a process that
is already in the exit path and needs to allocate memory to be able to
free its memory, to get access to memory reserves. That's not an oom
condition, that's memory reserves. Panic_on_oom has nothing to do with
this scenario whatsoever.

Panic_on_oom is not panic_when_reclaim_fails. It's to suppress a kernel
oom kill. That's why it's checked where it is checked and always has
been. This patch cannot possibly be merged.

> I would understand your complain if we waited for oom victim(s) before
> check_panic_on_oom but we have not been doing that.
>

I don't think anybody is changing panic_on_oom after boot, so we wouldn't
have any oom victims if the oom killer never did anything.

2015-06-09 22:42:06

by David Rientjes

[permalink] [raw]
Subject: Re: oom: How to handle !__GFP_FS exception?

On Tue, 9 Jun 2015, Tetsuo Handa wrote:

> > The !__GFP_FS exception is historic because the oom killer would trigger
> > waaay too often if it were removed simply because it doesn't have a great
> > chance of allowing reclaim to succeed. Instead, we rely on external
> > memory freeing or other parallel allocators being able to reclaim memory.
> > What happens when there is no external freeing, nothing else is trying to
> > reclaim, or nothing else is able to reclaim? Yeah, that's the big
> > problem. In my opinion, there's three ways of attacking it: (1)
> > preallocation so we are less dependent on the page allocator in these
> > contexts, (2) memory reserves in extraordinary circumstances to allow
> > forward progress (it's already tunable by min_free_kbytes), and (3)
> > eventual page allocation failure when neither of these succeed.
> >
> According to my observations (as posted at
> http://marc.info/?l=linux-mm&m=143239200805478 ), (3) is dangerous because
> it can potentially kill critical processes including global init process.
> Killing a process by invoking the OOM killer sounds safer than (3).
>

Wow, that's a long changelog :) I tried my best to look through it and
http://marc.info/?l=linux-kernel&m=142676304911566 to find where init
could possibly be killed and it references being killed by SIGBUS at
pagefault? I'm not sure how that could be possible since mm_fault_error()
should be handling VM_FAULT_OOM if any page allocation returns NULL in the
page fault path. If that's not being set appropriately (VM_FAULT_OOM on
page allocation failure), are there stack traces that indicate where that
might be? Perhaps this was testing of a patch that was not upstream?

Being killed by SIGBUS certainly should not be the result of the page
allocator returning NULL, but perhaps I'm missing some failure path that
never happens because the allocator infinite loop never returns NULL
today. Trying option (3), in combination with the others, will
undoubtedly yield some breakage because of bad failure handling that
hasn't been exercised before, but this one seems preventable.

> Regarding (2), how can we selectively allow blocking process to access
> memory reserves? Since we don't know the dependency, we can't identify the
> process which should be allowed to access memory reserves. If we allow all
> processes to access memory reserves, unrelated processes could deplete the
> memory reserves while the blocking process is waiting for a lock (either in
> killable or unkillable state). What we need to do to make forward progress
> is not always to allow access to memory reserves. Sometimes making locks
> killable (as posted at http://marc.info/?l=linux-mm&m=142408937117294 )
> helps more.
>

Yeah, I'm all too familiar with this scenario in the memcg world
unfortunately. The only solution that I've come up with, and implemented
for our kernels to test the theory, is to allow access to memory reserves
(or for memcg, overcharge) if an allocation continually loops due to the
oom killer being deferred as a result of a pending oom victim. Basically,
my patch causes out_of_memory() to return a pointer to the task_struct of
the process that we're waiting to exit and the page allocator continually
checks to ensure it is the same and then when a configurable threshold is
reached, it gives access to memory reserves. The thread holding the mutex
that the oom victim wants will eventually allocate due to this and
hopefully make forward progress. The system grinds to a halt if you're
too conservative in this approach with regards to detecting the infinite
oom killer deferral. (How many iterations do you consider to be stall?
Do you set ALLOC_HIGH | ALLOC_HARDER? Do you set ALLOC_NO_WATERMARKS?)

> Regarding (1), it would help but insufficient because (2) and (3) unlikely
> work.
>

Option (1) is somewhat independent of the others and fixable if we find
situations where memory allocation can be done prior to holding a
potentially contended mutex. We hope that nobody is needlessly holding a
contended mutex while allocating, and that seems to be the case most
often. However, there may still be situations where it happens.

2015-06-10 07:52:31

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Tue 09-06-15 15:28:40, David Rientjes wrote:
> On Tue, 9 Jun 2015, Michal Hocko wrote:
>
> > > > On Mon 08-06-15 12:51:53, David Rientjes wrote:
> > > > Do you actually have panic_on_oops enabled?
> > > >
> > >
> > > CONFIG_PANIC_ON_OOPS_VALUE should be 0, I'm not sure why that's relevant.
> >
> > No I meant panic_on_oops > 0.
> >
>
> CONFIG_PANIC_ON_OOPS_VALUE sets panic_on_oops, so it's 0.

So you are not using this, thanks for making that clear!

> > > The functionality I'm referring to is that your patch now panics the
> > > machine for configs where /proc/sys/vm/panic_on_oom is set and the same
> > > scenario occurs as described above. You're introducing userspace breakage
> > > because you are using panic_on_oom in a way that it hasn't been used in
> > > the past and isn't described as working in the documentation.
> >
> > I am sorry, but I do not follow. The knob has been always used to
> > _panic_ the OOM system. Nothing more and nothing less. Now you
> > are arguing about the change being buggy because a task might be
> > killed but that argument doesn't make much sense to me because
> > basically _any_ other allocation which allows OOM to trigger might hit
> > check_panic_on_oom() and panic the system well before your killed task
> > gets a chance to terminate.
> >
>
> Not necessarily. We pin a lot of memory with get_user_pages() and
> short-circuit it by checking for fatal_signal_pending() specifically for
> oom conditions. This was done over six years ago by commit 4779280d1ea4
> ("mm: make get_user_pages() interruptible"). When such a process is
> faulting in memory, and it is killed by userspace as a result of an oom
> condition, it needs to be able to allocate (TIF_MEMDIE set by the oom
> killer due to SIGKILL), return to __get_user_pages(), abort, handle the
> signal, and exit.
>
> I can't possibly make that any more clear.

Are you even reading what I've written? I will ask for the last
time. What exactly prevents other allocation to trigger to oom path and
panic the system before the killed task has a chance to terminate?

> Your patch causes that to instead panic the system if panic_on_oom is set.
> It's inappropriate and userspace breakage. The fact that I don't
> personally use panic_on_oom is completely and utterly irrelevant.
>
> There is absolutely nothing wrong with a process that has been killed
> either directly by userspace or as part of a group exit, or a process that
> is already in the exit path and needs to allocate memory to be able to
> free its memory, to get access to memory reserves. That's not an oom
> condition, that's memory reserves. Panic_on_oom has nothing to do with
> this scenario whatsoever.

It very much has and I have presented arguments about that which you
didn't bother to comment on. TIF_MEMDIE is not a magic which will help a
task to exit in all cases. It is a heuristic and it can fail.
panic_on_oops is a hand break when things go wrong and you want to
reduce your unresponsive time (read failover part in the documentation).

> Panic_on_oom is not panic_when_reclaim_fails.

OOM is when all other reclaim attempts fail. Jeez we are in
out_of_memory how can this be potentially unclear to you? Yes oom killer
path might use heuristics to reduce the impact of the OOM condition but
once we are in this path _we_are_OOM_.

> It's to suppress a kernel
> oom kill. That's why it's checked where it is checked and always has
> been.

Which makes no sense as well. Because you might have thousands of other
tasks with fatal signal pending on the task list. Why would current be
any different?

> This patch cannot possibly be merged.
>
> > I would understand your complain if we waited for oom victim(s) before
> > check_panic_on_oom but we have not been doing that.
> >
>
> I don't think anybody is changing panic_on_oom after boot, so we wouldn't
> have any oom victims if the oom killer never did anything.

This has nothing to do with setting panic_on_oom but. Please go and read
what I wrote again and try to think about it before you throw another
unrelated response.

</bunfight> from my side.

Andrew do whatever you like with this patch. I think that arguing about
the patch being broken because it _might_ break an already racy behavior
which nobody can possible rely on is not a reason to nack it. The risk
of not triggering check_panic_on_oom at all is really low so I can live
without the patch.
I am just worried about the level of argumentation here.
--
Michal Hocko
SUSE Labs

2015-06-11 00:37:05

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH] oom: always panic on OOM when panic_on_oom is configured

On Wed, 10 Jun 2015, Michal Hocko wrote:

> > Not necessarily. We pin a lot of memory with get_user_pages() and
> > short-circuit it by checking for fatal_signal_pending() specifically for
> > oom conditions. This was done over six years ago by commit 4779280d1ea4
> > ("mm: make get_user_pages() interruptible"). When such a process is
> > faulting in memory, and it is killed by userspace as a result of an oom
> > condition, it needs to be able to allocate (TIF_MEMDIE set by the oom
> > killer due to SIGKILL), return to __get_user_pages(), abort, handle the
> > signal, and exit.
> >
> > I can't possibly make that any more clear.
>
> Are you even reading what I've written? I will ask for the last
> time. What exactly prevents other allocation to trigger to oom path and
> panic the system before the killed task has a chance to terminate?
>

If there are other threads that call into the oom killer that are not in
the exit path or have a SIGKILL to handle, then the machine panics.
That's the purpose of panic_on_oom: the kernel has no way to free memory
without killing a process, so the admin has chosen to panic rather than
wait for memory to become available, which may never happen.

This is how panic_on_oom has always worked.

> > Your patch causes that to instead panic the system if panic_on_oom is set.
> > It's inappropriate and userspace breakage. The fact that I don't
> > personally use panic_on_oom is completely and utterly irrelevant.
> >
> > There is absolutely nothing wrong with a process that has been killed
> > either directly by userspace or as part of a group exit, or a process that
> > is already in the exit path and needs to allocate memory to be able to
> > free its memory, to get access to memory reserves. That's not an oom
> > condition, that's memory reserves. Panic_on_oom has nothing to do with
> > this scenario whatsoever.
>
> It very much has and I have presented arguments about that which you
> didn't bother to comment on. TIF_MEMDIE is not a magic which will help a
> task to exit in all cases. It is a heuristic and it can fail.
> panic_on_oops is a hand break when things go wrong and you want to
> reduce your unresponsive time (read failover part in the documentation).
>

Threads that have been oom killed and have TIF_MEMDIE set should exit.
It's certainly a problem if they do not, since the oom killer relies on it
and will defer forever until it does exit. (We don't actually require
that the thread fully exit, we just require that its memory is freed.) If
you're trying to address the issue that Tetsuo Handa brought up (strange,
because you seemed to not want Tetsuo to talk), then that needs to be
handled in a way that makes forward progress. I suggested three methods
for doing that in this thread that can be pursued to do that, but
panicking the system is not one of them.

> > Panic_on_oom is not panic_when_reclaim_fails.
>
> OOM is when all other reclaim attempts fail. Jeez we are in
> out_of_memory how can this be potentially unclear to you? Yes oom killer
> path might use heuristics to reduce the impact of the OOM condition but
> once we are in this path _we_are_OOM_.
>

Hmm, not exactly. You can't make the same argument for GFP_ATOMIC
allocations, for instance, where we don't have the ability to reclaim.
They get access to a memory reserve so they may succeed in this context.
In the case your patch is short-circuiting, a GFP_KERNEL allocation can
fail to reclaim and then you've decided to panic rather than give an
exiting thread access to memory reserves. It's unnecessary.

(I personally don't care what you do or do not label "oom", I only care
about panic vs. no-panic when the kernel has the ability to allow the
allocation to succeed and make forward progress.)

Let me be clear: the issue that Tetsuo brings up is very real and serious.
It exists for system memory as well as memcg. Trying to address it with
panic_on_oom is absurd. It may be difficult to address, and require
substantial VM work to fix, but panicking is not a solution and would lead
to arbitrary machines in a very large fleet rebooting. There's nothing
the userspace programmer could have done differently to prevent it, this
is solely a kernel issue.