Completely disabling the oom killer for a memcg is problematic if
userspace is unable to address the condition itself, usually because it
is unresponsive. This scenario creates a memcg deadlock: tasks are
sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
exit or move, or the oom killer reenabled and userspace is unable to do
so.
An additional possible use case is to defer oom killing within a memcg
for a set period of time, probably to prevent unnecessary kills due to
temporary memory spikes, before allowing the kernel to handle the
condition.
This patch adds an oom killer delay so that a memcg may be configured to
wait at least a pre-defined number of milliseconds before calling the oom
killer. If the oom condition persists for this number of milliseconds,
the oom killer will be called the next time the memory controller
attempts to charge a page (and memory.oom_control is set to 0). This
allows userspace to have a short period of time to respond to the
condition before deferring to the kernel to kill a task.
Admins may set the oom killer delay using the new interface:
# echo 60000 > memory.oom_delay_millisecs
This will defer oom killing to the kernel only after 60 seconds has
elapsed by putting the task to sleep for 60 seconds.
This expiration is cleared in four cases:
- anytime the oom killer is called so another memory.oom_delay_millisecs
delay is incurred the next time,
- anytime memory is uncharged from a memcg so it is no longer oom so
that there is now more available memory,
- anytime memory.limit_in_bytes is raised so that there is now more
available memory, and
- anytime memory.oom_delay_millisecs is written from userspace to change
the next delay.
Unless one of these events occurs after an oom delay has expired, all
future oom kills in that memcg will continue without incurring any delay.
When a memory.oom_delay_millisecs is set for a cgroup, it is propagated
to all children memcg as well and is inherited when a new memcg is
created.
Signed-off-by: David Rientjes <[email protected]>
---
Documentation/cgroups/memory.txt | 27 ++++++++++++++++++
mm/memcontrol.c | 59 ++++++++++++++++++++++++++++++++++++----
2 files changed, 81 insertions(+), 5 deletions(-)
diff --git a/Documentation/cgroups/memory.txt b/Documentation/cgroups/memory.txt
--- a/Documentation/cgroups/memory.txt
+++ b/Documentation/cgroups/memory.txt
@@ -71,6 +71,7 @@ Brief summary of control files.
(See sysctl's vm.swappiness)
memory.move_charge_at_immigrate # set/show controls of moving charges
memory.oom_control # set/show oom controls.
+ memory.oom_delay_millisecs # set/show millisecs to wait before oom kill
memory.numa_stat # show the number of memory usage per numa node
memory.kmem.limit_in_bytes # set/show hard limit for kernel memory
@@ -766,6 +767,32 @@ At reading, current status of OOM is shown.
under_oom 0 or 1 (if 1, the memory cgroup is under OOM, tasks may
be stopped.)
+It is possible to configure an oom killer delay to prevent the possibility that
+the memcg will deadlock looking for memory if userspace has disabled the oom
+killer with oom_control but cannot act to fix the condition itself (usually
+because it is unresponsive).
+
+To set an oom killer delay for a memcg, write the number of milliseconds to wait
+before killing a task to memory.oom_delay_millisecs:
+
+ # echo 60000 > memory.oom_delay_millisecs # 60 seconds before kill
+
+When this memcg is oom, it is guaranteed that this delay will elapse before the
+kernel kills a process. If memory is uncharged from this memcg or one of its
+limits is expanded during this period, the oom kill is inhibited.
+
+Disabling the oom killer for a memcg with memory.oom_control takes precedence
+over memory.oom_delay_millisecs, so it must be set to 0 (default) to allow the
+oom kill after the delay has expired.
+
+This value is inherited from the memcg's parent on creation. Setting a delay
+for a memcg sets the same delay for all children.
+
+There is no delay if memory.oom_delay_millisecs is set to 0 (default). This
+tunable's upper bound is MAX_SCHEDULE_TIMEOUT (about 24 days on 32-bit and a
+lifetime on 64-bit).
+
+
11. Memory Pressure
The pressure level notifications can be used to monitor the memory
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -308,6 +308,10 @@ struct mem_cgroup {
int swappiness;
/* OOM-Killer disable */
int oom_kill_disable;
+ /* number of ticks to stall before calling oom killer */
+ int oom_delay;
+ /* expiration of current delay in jiffies, if oom in progress */
+ atomic64_t oom_delay_expire;
/* set when res.limit == memsw.limit */
bool memsw_is_minimum;
@@ -2185,8 +2189,11 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg)
static void memcg_oom_recover(struct mem_cgroup *memcg)
{
- if (memcg && atomic_read(&memcg->under_oom))
- memcg_wakeup_oom(memcg);
+ if (memcg) {
+ atomic64_set(&memcg->oom_delay_expire, 0);
+ if (atomic_read(&memcg->under_oom))
+ memcg_wakeup_oom(memcg);
+ }
}
/*
@@ -2197,6 +2204,7 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
{
struct oom_wait_info owait;
bool locked, need_to_kill;
+ long timeout = MAX_SCHEDULE_TIMEOUT;
owait.memcg = memcg;
owait.wait.flags = 0;
@@ -2217,15 +2225,25 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
if (!locked || memcg->oom_kill_disable)
need_to_kill = false;
- if (locked)
+ if (locked) {
+ if (memcg->oom_delay) {
+ unsigned long expire;
+
+ expire = atomic64_cmpxchg(&memcg->oom_delay_expire, 0,
+ jiffies + memcg->oom_delay);
+ need_to_kill = expire && time_after_eq(jiffies, expire);
+ timeout = memcg->oom_delay;
+ }
mem_cgroup_oom_notify(memcg);
+ }
spin_unlock(&memcg_oom_lock);
if (need_to_kill) {
finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, mask, order);
+ atomic64_set(&memcg->oom_delay_expire, 0);
} else {
- schedule();
+ schedule_timeout(timeout);
finish_wait(&memcg_oom_waitq, &owait.wait);
}
spin_lock(&memcg_oom_lock);
@@ -2239,7 +2257,8 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
return false;
/* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+ if (timeout == MAX_SCHEDULE_TIMEOUT)
+ schedule_timeout_uninterruptible(1);
return true;
}
@@ -5855,6 +5874,30 @@ static int mem_cgroup_oom_control_write(struct cgroup *cgrp,
return 0;
}
+static u64 mem_cgroup_oom_delay_millisecs_read(struct cgroup *cgrp,
+ struct cftype *cft)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+
+ return jiffies_to_msecs(memcg->oom_delay);
+}
+
+static int mem_cgroup_oom_delay_millisecs_write(struct cgroup *cgrp,
+ struct cftype *cft, u64 val)
+{
+ struct mem_cgroup *memcg = mem_cgroup_from_cont(cgrp);
+ struct mem_cgroup *iter;
+
+ if (val > MAX_SCHEDULE_TIMEOUT)
+ return -EINVAL;
+
+ for_each_mem_cgroup_tree(iter, memcg) {
+ iter->oom_delay = msecs_to_jiffies(val);
+ memcg_oom_recover(iter);
+ }
+ return 0;
+}
+
#ifdef CONFIG_MEMCG_KMEM
static int memcg_init_kmem(struct mem_cgroup *memcg, struct cgroup_subsys *ss)
{
@@ -5962,6 +6005,11 @@ static struct cftype mem_cgroup_files[] = {
.private = MEMFILE_PRIVATE(_OOM_TYPE, OOM_CONTROL),
},
{
+ .name = "oom_delay_millisecs",
+ .read_u64 = mem_cgroup_oom_delay_millisecs_read,
+ .write_u64 = mem_cgroup_oom_delay_millisecs_write,
+ },
+ {
.name = "pressure_level",
.register_event = vmpressure_register_event,
.unregister_event = vmpressure_unregister_event,
@@ -6271,6 +6319,7 @@ mem_cgroup_css_online(struct cgroup *cont)
memcg->use_hierarchy = parent->use_hierarchy;
memcg->oom_kill_disable = parent->oom_kill_disable;
+ memcg->oom_delay = parent->oom_delay;
memcg->swappiness = mem_cgroup_swappiness(parent);
if (parent->use_hierarchy) {
On Wed 29-05-13 18:18:10, David Rientjes wrote:
> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because it
> is unresponsive.
Isn't this a bug in the userspace oom handler? Why is it unresponsive? It
shouldn't allocated any memory so nothing should prevent it from running (if
other tasks are preempting it permanently then the priority of the handler
should be increased).
> This scenario creates a memcg deadlock: tasks are
> sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
> exit or move, or the oom killer reenabled and userspace is unable to do
> so.
>
> An additional possible use case is to defer oom killing within a memcg
> for a set period of time, probably to prevent unnecessary kills due to
> temporary memory spikes, before allowing the kernel to handle the
> condition.
I am not sure I like the idea. How does an admin decide what is the right value
of the timeout? And why he cannot use userspace oom handler to do the same
thing?
[...]
--
Michal Hocko
SUSE Labs
On Thu, 30 May 2013, Michal Hocko wrote:
> > Completely disabling the oom killer for a memcg is problematic if
> > userspace is unable to address the condition itself, usually because it
> > is unresponsive.
>
> Isn't this a bug in the userspace oom handler? Why is it unresponsive? It
> shouldn't allocated any memory so nothing should prevent it from running (if
> other tasks are preempting it permanently then the priority of the handler
> should be increased).
>
Unresponsiveness isn't necessarily only because of memory constraints, you
may have your oom notifier in a parent cgroup that isn't oom. If a
process is stuck on mm->mmap_sem in the oom cgroup, though, the oom
notifier may not be able to scrape /proc/pid and attain necessary
information in making an oom kill decision. If the oom notifier is in the
oom cgroup, it may not be able to successfully read the memcg "tasks"
file to even determine the set of eligible processes. There is also no
guarantee that the userspace oom handler will have the necessary memory to
even re-enable the oom killer in the memcg under oom which would allow the
kernel to make forward progress.
We've used this for a few years as a complement to oom notifiers so that a
process would have a grace period to deal with the oom condition itself
before allowing the kernel to terminate a process and free memory. We've
simply had no alternative in the presence of kernel constraints that
prevent it from being done in any other way. We _want_ userspace to deal
with the issue but when it cannot collect the necessary information (and
we're not tracing every fork() that every process in a potentially oom
memcg does) to deal with the condition, we want the kernel to step in
instead of relying on an admin to login or a global oom condition.
If you'd like to debate this issue, I'd be more than happy to do so and
show why this patch is absolutely necessary for inclusion, but I'd ask
that you'd present the code from your userspace oom handler so I can
understand how it works without needing such backup support.
On Thu 30-05-13 13:47:30, David Rientjes wrote:
> On Thu, 30 May 2013, Michal Hocko wrote:
>
> > > Completely disabling the oom killer for a memcg is problematic if
> > > userspace is unable to address the condition itself, usually because it
> > > is unresponsive.
> >
> > Isn't this a bug in the userspace oom handler? Why is it unresponsive? It
> > shouldn't allocated any memory so nothing should prevent it from running (if
> > other tasks are preempting it permanently then the priority of the handler
> > should be increased).
> >
>
> Unresponsiveness isn't necessarily only because of memory constraints, you
> may have your oom notifier in a parent cgroup that isn't oom.
I have always discouraged people from running oom handler in the same
memcg (or even in the same hierarchy).
> If a process is stuck on mm->mmap_sem in the oom cgroup, though, the
> oom notifier may not be able to scrape /proc/pid and attain necessary
> information in making an oom kill decision.
Yes, mmap_sem is tricky. Nothing in the proc code should take it for
writing and charges are done with mmap_sem held for reading but that
doesn't prevent from non-oom thread to try to get it for writing which
would block also all future readers. We have also seen i_mutex being
held during charge so you have to be careful about that one as well but
I am not aware of other locks that could be a problem.
The question is, do you really need to open any /proc/<pid>/ files which
depend on mmap_sem (e.g. maps, smaps). /proc/<pid>/status should tell you
about used memory. Or put it another way around. What kind of data you
need for your OOM handler?
I might be thinking about different use cases but user space OOM
handlers I have seen so far had quite a good idea what is going on
in the group and who to kill. So it was more a decision based on the
workload and its semantic rather than based on the used memory (which
is done quite sensibly with the in kernel handler already) or something
that would trip over mmap_sem when trying to get information.
> If the oom notifier is in the oom cgroup, it may not be able to
> successfully read the memcg "tasks" file to even determine the set of
> eligible processes.
It would have to use preallocated buffer and have mlocked all the memory
that will be used during oom event.
> There is also no guarantee that the userspace oom handler will have
> the necessary memory to even re-enable the oom killer in the memcg
> under oom which would allow the kernel to make forward progress.
Why it wouldn't have enough memory to write to the file? Assuming that
the oom handler has the file handle (for limit_in_bytes) open, it has
mlocked all the necessary memory (code + buffers) then I do not see what
would prevent it from writing it to limit_in_bytes.
> We've used this for a few years as a complement to oom notifiers so that a
> process would have a grace period to deal with the oom condition itself
> before allowing the kernel to terminate a process and free memory. We've
> simply had no alternative in the presence of kernel constraints that
> prevent it from being done in any other way. We _want_ userspace to deal
> with the issue but when it cannot collect the necessary information (and
> we're not tracing every fork() that every process in a potentially oom
> memcg does) to deal with the condition, we want the kernel to step in
> instead of relying on an admin to login or a global oom condition.
>
> If you'd like to debate this issue, I'd be more than happy to do so and
> show why this patch is absolutely necessary for inclusion, but I'd ask
> that you'd present the code from your userspace oom handler so I can
> understand how it works without needing such backup support.
I usually do not write those things myself as I am supporting others so
I do not have any code handy right now. But I can come up with a simple
handler which implements your timeout based killer for peak workloads
you have mentioned earlier. That one should be quite easy to do.
--
Michal Hocko
SUSE Labs
On Fri, 31 May 2013, Michal Hocko wrote:
> I have always discouraged people from running oom handler in the same
> memcg (or even in the same hierarchy).
>
We allow users to control their own memcgs by chowning them, so they must
be run in the same hierarchy if they want to run their own userspace oom
handler. There's nothing in the kernel that prevents that and the user
has no other option but to run in a parent cgroup.
> Yes, mmap_sem is tricky. Nothing in the proc code should take it for
> writing and charges are done with mmap_sem held for reading but that
> doesn't prevent from non-oom thread to try to get it for writing which
> would block also all future readers. We have also seen i_mutex being
> held during charge so you have to be careful about that one as well but
> I am not aware of other locks that could be a problem.
>
> The question is, do you really need to open any /proc/<pid>/ files which
> depend on mmap_sem (e.g. maps, smaps). /proc/<pid>/status should tell you
> about used memory. Or put it another way around. What kind of data you
> need for your OOM handler?
>
It's too easy to simply do even a "ps ax" in an oom memcg and make that
thread completely unresponsive because it allocates memory.
> I might be thinking about different use cases but user space OOM
> handlers I have seen so far had quite a good idea what is going on
> in the group and who to kill.
Then perhaps I'm raising constraints that you've never worked with, I
don't know. We choose to have a priority-based approach that is inherited
by children; this priority is kept in userspace and and the oom handler
would naturally need to know the set of tasks in the oom memcg at the time
of oom and their parent-child relationship. These priorities are
completely independent of memory usage.
> > If the oom notifier is in the oom cgroup, it may not be able to
> > successfully read the memcg "tasks" file to even determine the set of
> > eligible processes.
>
> It would have to use preallocated buffer and have mlocked all the memory
> that will be used during oom event.
>
Wrong, the kernel itself allocates memory when reading this information
and that would fail in an oom memcg.
On Fri 31-05-13 03:22:59, David Rientjes wrote:
> On Fri, 31 May 2013, Michal Hocko wrote:
[...]
> > > If the oom notifier is in the oom cgroup, it may not be able to
> > > successfully read the memcg "tasks" file to even determine the set of
> > > eligible processes.
> >
> > It would have to use preallocated buffer and have mlocked all the memory
> > that will be used during oom event.
> >
>
> Wrong, the kernel itself allocates memory when reading this information
> and that would fail in an oom memcg.
But that memory is not charged to a memcg, is it? So unless you are
heading towards global OOM you should be safe.
--
Michal Hocko
SUSE Labs
On Fri 31-05-13 03:22:59, David Rientjes wrote:
> On Fri, 31 May 2013, Michal Hocko wrote:
>
> > I have always discouraged people from running oom handler in the same
> > memcg (or even in the same hierarchy).
> >
>
> We allow users to control their own memcgs by chowning them, so they must
> be run in the same hierarchy if they want to run their own userspace oom
> handler. There's nothing in the kernel that prevents that and the user
> has no other option but to run in a parent cgroup.
If the access to the oom_control file is controlled by the file
permissions then the oom handler can live inside root cgroup. Why do you
need "must be in the same hierarchy" requirement?
> > Yes, mmap_sem is tricky. Nothing in the proc code should take it for
> > writing and charges are done with mmap_sem held for reading but that
> > doesn't prevent from non-oom thread to try to get it for writing which
> > would block also all future readers. We have also seen i_mutex being
> > held during charge so you have to be careful about that one as well but
> > I am not aware of other locks that could be a problem.
> >
> > The question is, do you really need to open any /proc/<pid>/ files which
> > depend on mmap_sem (e.g. maps, smaps). /proc/<pid>/status should tell you
> > about used memory. Or put it another way around. What kind of data you
> > need for your OOM handler?
> >
>
> It's too easy to simply do even a "ps ax" in an oom memcg and make that
> thread completely unresponsive because it allocates memory.
Yes, but we are talking about oom handler and that one has to be really
careful about what it does. So doing something that simply allocates is
dangerous.
> > I might be thinking about different use cases but user space OOM
> > handlers I have seen so far had quite a good idea what is going on
> > in the group and who to kill.
>
> Then perhaps I'm raising constraints that you've never worked with, I
> don't know. We choose to have a priority-based approach that is inherited
> by children; this priority is kept in userspace and and the oom handler
> would naturally need to know the set of tasks in the oom memcg at the time
> of oom and their parent-child relationship. These priorities are
> completely independent of memory usage.
OK, but both reading tasks file and readdir should be doable without
userspace (aka charged) allocations. Moreover if you run those oom
handlers under the root cgroup then it should be even easier.
--
Michal Hocko
SUSE Labs
On Fri, 31 May 2013, Michal Hocko wrote:
> > We allow users to control their own memcgs by chowning them, so they must
> > be run in the same hierarchy if they want to run their own userspace oom
> > handler. There's nothing in the kernel that prevents that and the user
> > has no other option but to run in a parent cgroup.
>
> If the access to the oom_control file is controlled by the file
> permissions then the oom handler can live inside root cgroup. Why do you
> need "must be in the same hierarchy" requirement?
>
Users obviously don't have the ability to attach processes to the root
memcg. They are constrained to their own subtree of memcgs.
> > It's too easy to simply do even a "ps ax" in an oom memcg and make that
> > thread completely unresponsive because it allocates memory.
>
> Yes, but we are talking about oom handler and that one has to be really
> careful about what it does. So doing something that simply allocates is
> dangerous.
>
Show me a userspace oom handler that doesn't get notified of every fork()
in a memcg, causing a performance degradation of its own for a complete
and utter slowpath, that will know the entire process tree of its own
memcg or a child memcg.
This discussion is all fine and good from a theoretical point of view
until you actually have to implement one of these yourself. Multiple
users are going to be running their own oom notifiers and without some
sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too
easily deadlock looking for memory. If that user is constrained to his or
her own subtree, as previously stated, there's also no way to login and
rectify the situation at that point and requires admin intervention or a
reboot.
> > Then perhaps I'm raising constraints that you've never worked with, I
> > don't know. We choose to have a priority-based approach that is inherited
> > by children; this priority is kept in userspace and and the oom handler
> > would naturally need to know the set of tasks in the oom memcg at the time
> > of oom and their parent-child relationship. These priorities are
> > completely independent of memory usage.
>
> OK, but both reading tasks file and readdir should be doable without
> userspace (aka charged) allocations. Moreover if you run those oom
> handlers under the root cgroup then it should be even easier.
Then why does "cat tasks" stall when my memcg is totally depleted of all
memory?
This isn't even the argument because memory.oom_delay_millisecs isn't
going to help that situation. I'm talking about a failsafe that ensures a
memcg can't deadlock. The global oom killer will always have to exist in
the kernel, at least in the most simplistic of forms, solely for this
reason; you can't move all of the logic to userspace and expect it to
react 100% of the time.
On Wed, 29 May 2013 18:18:10 -0700 (PDT) David Rientjes <[email protected]> wrote:
> Completely disabling the oom killer for a memcg is problematic if
> userspace is unable to address the condition itself, usually because it
> is unresponsive. This scenario creates a memcg deadlock: tasks are
> sitting in TASK_KILLABLE waiting for the limit to be increased, a task to
> exit or move, or the oom killer reenabled and userspace is unable to do
> so.
>
> An additional possible use case is to defer oom killing within a memcg
> for a set period of time, probably to prevent unnecessary kills due to
> temporary memory spikes, before allowing the kernel to handle the
> condition.
>
> This patch adds an oom killer delay so that a memcg may be configured to
> wait at least a pre-defined number of milliseconds before calling the oom
> killer. If the oom condition persists for this number of milliseconds,
> the oom killer will be called the next time the memory controller
> attempts to charge a page (and memory.oom_control is set to 0). This
> allows userspace to have a short period of time to respond to the
> condition before deferring to the kernel to kill a task.
>
> Admins may set the oom killer delay using the new interface:
>
> # echo 60000 > memory.oom_delay_millisecs
>
> This will defer oom killing to the kernel only after 60 seconds has
> elapsed by putting the task to sleep for 60 seconds.
How often is that delay actually useful, in the real world?
IOW, in what proportion of cases does the system just remain stuck for
60 seconds and then get an oom-killing?
On Fri, May 31, 2013 at 12:29:17PM -0700, David Rientjes wrote:
> On Fri, 31 May 2013, Michal Hocko wrote:
> > > It's too easy to simply do even a "ps ax" in an oom memcg and make that
> > > thread completely unresponsive because it allocates memory.
> >
> > Yes, but we are talking about oom handler and that one has to be really
> > careful about what it does. So doing something that simply allocates is
> > dangerous.
> >
>
> Show me a userspace oom handler that doesn't get notified of every fork()
> in a memcg, causing a performance degradation of its own for a complete
> and utter slowpath, that will know the entire process tree of its own
> memcg or a child memcg.
>
> This discussion is all fine and good from a theoretical point of view
> until you actually have to implement one of these yourself. Multiple
> users are going to be running their own oom notifiers and without some
> sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too
> easily deadlock looking for memory. If that user is constrained to his or
> her own subtree, as previously stated, there's also no way to login and
> rectify the situation at that point and requires admin intervention or a
> reboot.
Memcg OOM killing is fragile, doing it in userspace does not help.
But a userspace OOM handler that is subject to the OOM situation it is
supposed to resolve? That's kind of... cruel.
However, I agree with your other point: as long as chargers can get
stuck during OOM while sitting on an unpredictable bunch of locks, we
can not claim that there is a right way to write a reliable userspace
OOM handler. Userspace has no way of knowing which operations are
safe, and it might even vary between kernel versions.
But I would prefer a better solution than the timeout. Michal in the
past already tried to disable the OOM killer for page cache charges in
general due to deadlock scenarios. Could we invoke the killer just
the same but make sure we never trap tasks in the charge context?
I'm currently messing around with the below patch. When a task faults
and charges under OOM, the memcg is remembered in the task struct and
then made to sleep on the memcg's OOM waitqueue only after unwinding
the page fault stack. With the kernel OOM killer disabled, all tasks
in the OOMing group sit nicely in
mem_cgroup_oom_synchronize
pagefault_out_of_memory
mm_fault_error
__do_page_fault
page_fault
0xffffffffffffffff
regardless of whether they were faulting anon or file. They do not
even hold the mmap_sem anymore at this point.
[ I kept syscalls really simple for now and just have them return
-ENOMEM, never trap them at all (just like the global OOM case).
It would be more work to have them wait from a flatter stack too,
but it should be doable if necessary. ]
I suggested this at the MM summit and people were essentially asking
if I was feeling well, so maybe I'm still missing a gaping hole in
this idea.
Patch only works on x86 as of now, on other architectures memcg OOM
will invoke the global OOM killer.
---
From: Johannes Weiner <[email protected]>
Subject: [PATCH] memcg: more robust oom handling
The memcg OOM handling is incredibly fragile because once a memcg goes
OOM, one task (kernel or userspace) is responsible for resolving the
situation. Every other task that gets caught trying to charge memory
gets stuck in a waitqueue while potentially holding various filesystem
and mm locks on which the OOM handling task may now deadlock.
Do two things to charge attempts under OOM:
1. Do not trap system calls (buffered IO and friends), just return
-ENOMEM. Userspace should be able to handle this... right?
2. Do not trap page faults directly in the charging context. Instead,
remember the OOMing memcg in the task struct and fully unwind the
page fault stack first. Then synchronize the memcg OOM from
pagefault_out_of_memory()
Not-quite-signed-off-by: Johannes Weiner <[email protected]>
---
arch/x86/mm/fault.c | 2 +
include/linux/memcontrol.h | 6 +++
include/linux/sched.h | 6 +++
mm/memcontrol.c | 104 +++++++++++++++++++++++++--------------------
mm/oom_kill.c | 7 ++-
5 files changed, 78 insertions(+), 47 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 0e88336..df043a0 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1185,7 +1185,9 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault:
*/
+ current->in_userfault = 1;
fault = handle_mm_fault(mm, vma, address, flags);
+ current->in_userfault = 0;
if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
if (mm_fault_error(regs, error_code, address, fault))
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..e1b84ea 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -121,6 +121,7 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+bool mem_cgroup_oom_synchronize(void);
extern void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage);
@@ -337,6 +338,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_begin_update_page_stat(struct page *page,
bool *locked, unsigned long *flags)
{
diff --git a/include/linux/sched.h b/include/linux/sched.h
index e692a02..cf60aef 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1282,6 +1282,8 @@ struct task_struct {
* execve */
unsigned in_iowait:1;
+ unsigned in_userfault:1;
+
/* task may not gain privileges */
unsigned no_new_privs:1;
@@ -1570,6 +1572,10 @@ struct task_struct {
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
unsigned int memcg_kmem_skip_account;
+ struct memcg_oom_info {
+ struct mem_cgroup *memcg;
+ int wakeups;
+ } memcg_oom;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index cc3026a..6e13ebe 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -291,6 +291,7 @@ struct mem_cgroup {
bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;
atomic_t refcnt;
@@ -2076,6 +2077,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
+ atomic_inc(&memcg->oom_wakeups);
}
static void memcg_oom_recover(struct mem_cgroup *memcg)
@@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}
/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
- int order)
+static void mem_cgroup_oom(struct mem_cgroup *memcg,
+ gfp_t mask, int order,
+ bool in_userfault)
{
- struct oom_wait_info owait;
- bool locked, need_to_kill;
-
- owait.memcg = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);
+ bool locked, need_to_kill = true;
/* At first, try to OOM lock hierarchy under memcg.*/
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
- /*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
- */
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
- if (!locked || memcg->oom_kill_disable)
+ if (!locked || memcg->oom_kill_disable) {
need_to_kill = false;
+ if (in_userfault) {
+ /*
+ * We start sleeping on the OOM waitqueue only
+ * after unwinding the page fault stack, so
+ * make sure we detect wakeups that happen
+ * between now and then.
+ */
+ mem_cgroup_mark_under_oom(memcg);
+ current->memcg_oom.wakeups =
+ atomic_read(&memcg->oom_wakeups);
+ css_get(&memcg->css);
+ current->memcg_oom.memcg = memcg;
+ }
+ }
if (locked)
mem_cgroup_oom_notify(memcg);
spin_unlock(&memcg_oom_lock);
if (need_to_kill) {
- finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, mask, order);
- } else {
- schedule();
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ memcg_oom_recover(memcg);
}
- spin_lock(&memcg_oom_lock);
- if (locked)
+
+ if (locked) {
+ spin_lock(&memcg_oom_lock);
mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
+ spin_unlock(&memcg_oom_lock);
+ }
+}
- mem_cgroup_unmark_under_oom(memcg);
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;
- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ memcg = current->memcg_oom.memcg;
+ if (!memcg)
return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out;
+
+ owait.memcg = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out:
+ mem_cgroup_unmark_under_oom(memcg);
+ css_put(&memcg->css);
+ current->memcg_oom.memcg = NULL;
return true;
}
@@ -2447,7 +2469,6 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};
static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
@@ -2509,14 +2530,11 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;
- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
- return CHARGE_OOM_DIE;
+ if (oom_check)
+ mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize),
+ current->in_userfault);
- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}
/*
@@ -2647,16 +2665,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || oom_check) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 79e451a..0c9f836 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,9 +678,12 @@ out:
*/
void pagefault_out_of_memory(void)
{
- struct zonelist *zonelist = node_zonelist(first_online_node,
- GFP_KERNEL);
+ struct zonelist *zonelist;
+ if (mem_cgroup_oom_synchronize())
+ return;
+
+ zonelist = node_zonelist(first_online_node, GFP_KERNEL);
if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
out_of_memory(NULL, 0, 0, NULL, false);
clear_zonelist_oom(zonelist, GFP_KERNEL);
--
1.8.2.3
On Fri 31-05-13 12:29:17, David Rientjes wrote:
> On Fri, 31 May 2013, Michal Hocko wrote:
>
> > > We allow users to control their own memcgs by chowning them, so they must
> > > be run in the same hierarchy if they want to run their own userspace oom
> > > handler. There's nothing in the kernel that prevents that and the user
> > > has no other option but to run in a parent cgroup.
> >
> > If the access to the oom_control file is controlled by the file
> > permissions then the oom handler can live inside root cgroup. Why do you
> > need "must be in the same hierarchy" requirement?
> >
>
> Users obviously don't have the ability to attach processes to the root
> memcg. They are constrained to their own subtree of memcgs.
OK, I assume those groups are generally untrusted, right? So you cannot
let them register their oom handler even via an admin interface. This
makes it a bit complicated because it makes much harder demands on the
handler itself as it has to run under restricted environment.
> > > It's too easy to simply do even a "ps ax" in an oom memcg and make that
> > > thread completely unresponsive because it allocates memory.
> >
> > Yes, but we are talking about oom handler and that one has to be really
> > careful about what it does. So doing something that simply allocates is
> > dangerous.
> >
>
> Show me a userspace oom handler that doesn't get notified of every fork()
> in a memcg, causing a performance degradation of its own for a complete
> and utter slowpath, that will know the entire process tree of its own
> memcg or a child memcg.
I still do not see why you cannot simply read tasks file into a
preallocated buffer. This would be few pages even for thousands of pids.
You do not have to track processes as they come and go.
> This discussion is all fine and good from a theoretical point of view
> until you actually have to implement one of these yourself. Multiple
> users are going to be running their own oom notifiers and without some
> sort of failsafe, such as memory.oom_delay_millisecs, a memcg can too
> easily deadlock looking for memory.
As I said before. oom_delay_millisecs is actually really easy to be done
from userspace. If you really need a safety break then you can register
such a handler as a fallback. I am not familiar with eventfd internals
much but I guess that multiple handlers are possible. The fallback might
be enforeced by the admin (when a new group is created) or by the
container itself. Would something like this work for your use case?
> If that user is constrained to his or her own subtree, as previously
> stated, there's also no way to login and rectify the situation at that
> point and requires admin intervention or a reboot.
Yes, insisting on the same subtree makes the life much harder for oom
handlers. I totally agree with you on that. I just feel that introducing
a new knob to workaround user "inability" to write a proper handler
(what ever that means) is not justified.
> > > Then perhaps I'm raising constraints that you've never worked with, I
> > > don't know. We choose to have a priority-based approach that is inherited
> > > by children; this priority is kept in userspace and and the oom handler
> > > would naturally need to know the set of tasks in the oom memcg at the time
> > > of oom and their parent-child relationship. These priorities are
> > > completely independent of memory usage.
> >
> > OK, but both reading tasks file and readdir should be doable without
> > userspace (aka charged) allocations. Moreover if you run those oom
> > handlers under the root cgroup then it should be even easier.
>
> Then why does "cat tasks" stall when my memcg is totally depleted of all
> memory?
if you run it like this then cat obviously needs some charged
allocations. If you had a proper handler which mlocks its buffer for the
read syscall then you shouldn't require any allocation at the oom time.
This shouldn't be that hard to do without too much memory overhead. As I
said we are talking about few (dozens) of pages per handler.
> This isn't even the argument because memory.oom_delay_millisecs isn't
> going to help that situation. I'm talking about a failsafe that ensures a
> memcg can't deadlock. The global oom killer will always have to exist in
> the kernel, at least in the most simplistic of forms, solely for this
> reason; you can't move all of the logic to userspace and expect it to
> react 100% of the time.
The global case is even more complicated because every allocation
matters - not just those that are charged as for memcg case. Btw. at
last LSF there was a discussion about enabling oom_control for the root
cgroup but this is off-topic here and should be discussed separately
when somebody actually tries to implement this.
--
Michal Hocko
SUSE Labs
On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
[...]
> I'm currently messing around with the below patch. When a task faults
> and charges under OOM, the memcg is remembered in the task struct and
> then made to sleep on the memcg's OOM waitqueue only after unwinding
> the page fault stack. With the kernel OOM killer disabled, all tasks
> in the OOMing group sit nicely in
>
> mem_cgroup_oom_synchronize
> pagefault_out_of_memory
> mm_fault_error
> __do_page_fault
> page_fault
> 0xffffffffffffffff
>
> regardless of whether they were faulting anon or file. They do not
> even hold the mmap_sem anymore at this point.
>
> [ I kept syscalls really simple for now and just have them return
> -ENOMEM, never trap them at all (just like the global OOM case).
> It would be more work to have them wait from a flatter stack too,
> but it should be doable if necessary. ]
>
> I suggested this at the MM summit and people were essentially asking
> if I was feeling well, so maybe I'm still missing a gaping hole in
> this idea.
I didn't get to look at the patch (will do on Monday) but it doesn't
sounds entirely crazy. Well, we would have to drop mmap_sem so things
have to be rechecked but we are doing that already with VM_FAULT_RETRY
in some archs so it should work.
> Patch only works on x86 as of now, on other architectures memcg OOM
> will invoke the global OOM killer.
[...]
--
Michal Hocko
SUSE Labs
On Sat, Jun 01, 2013 at 12:29:05PM +0200, Michal Hocko wrote:
> On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> [...]
> > I'm currently messing around with the below patch. When a task faults
> > and charges under OOM, the memcg is remembered in the task struct and
> > then made to sleep on the memcg's OOM waitqueue only after unwinding
> > the page fault stack. With the kernel OOM killer disabled, all tasks
> > in the OOMing group sit nicely in
> >
> > mem_cgroup_oom_synchronize
> > pagefault_out_of_memory
> > mm_fault_error
> > __do_page_fault
> > page_fault
> > 0xffffffffffffffff
> >
> > regardless of whether they were faulting anon or file. They do not
> > even hold the mmap_sem anymore at this point.
> >
> > [ I kept syscalls really simple for now and just have them return
> > -ENOMEM, never trap them at all (just like the global OOM case).
> > It would be more work to have them wait from a flatter stack too,
> > but it should be doable if necessary. ]
> >
> > I suggested this at the MM summit and people were essentially asking
> > if I was feeling well, so maybe I'm still missing a gaping hole in
> > this idea.
>
> I didn't get to look at the patch (will do on Monday) but it doesn't
> sounds entirely crazy. Well, we would have to drop mmap_sem so things
> have to be rechecked but we are doing that already with VM_FAULT_RETRY
> in some archs so it should work.
The global OOM case has been doing this for a long time (1c0fe6e mm:
invoke oom-killer from page fault), way before VM_FAULT_RETRY. The
fault is aborted with VM_FAULT_OOM and the oom killer is invoked.
Either the task gets killed or it'll just retrigger the fault. The
only difference is that a memcg OOM kill may take longer because of
userspace handling, so memcg needs a waitqueue where the global case
simply does a trylock (try_set_zonelist_oom) and restarts the fault
immediately if somebody else is handling the situation.
In fact, when Nick added the page fault OOM invocation, KAME merged
something similar to my patch, which tried to catch memcg OOM kills
from pagefault_out_of_memory() (a636b32 memcg: avoid unnecessary
system-wide-oom-killer).
Only when he reworked the whole memcg OOM synchronization, added the
ability to disable OOM and the waitqueues etc, the OOMs were trapped
right there in the charge context (867578c memcg: fix oom kill
behavior). But I see no reason why we shouldn't be able to keep the
waitqueues and still go back to synchronizing from the bottom of the
page fault stack.
On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
[...]
> From: Johannes Weiner <[email protected]>
> Subject: [PATCH] memcg: more robust oom handling
>
> The memcg OOM handling is incredibly fragile because once a memcg goes
> OOM, one task (kernel or userspace) is responsible for resolving the
> situation. Every other task that gets caught trying to charge memory
> gets stuck in a waitqueue while potentially holding various filesystem
> and mm locks on which the OOM handling task may now deadlock.
>
> Do two things to charge attempts under OOM:
>
> 1. Do not trap system calls (buffered IO and friends), just return
> -ENOMEM. Userspace should be able to handle this... right?
>
> 2. Do not trap page faults directly in the charging context. Instead,
> remember the OOMing memcg in the task struct and fully unwind the
> page fault stack first. Then synchronize the memcg OOM from
> pagefault_out_of_memory()
I think this should work and I really like it! Nice work Johannes, I
never dared to go that deep and my opposite approach was also much more
fragile.
I am just afraid about all the other archs that do not support (from
quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
openrisc, score and tile). What would be an alternative for them?
#ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
would be acceptable for me.
Few comments bellow.
> Not-quite-signed-off-by: Johannes Weiner <[email protected]>
> ---
> arch/x86/mm/fault.c | 2 +
> include/linux/memcontrol.h | 6 +++
> include/linux/sched.h | 6 +++
> mm/memcontrol.c | 104 +++++++++++++++++++++++++--------------------
> mm/oom_kill.c | 7 ++-
> 5 files changed, 78 insertions(+), 47 deletions(-)
>
[...]
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index e692a02..cf60aef 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1282,6 +1282,8 @@ struct task_struct {
> * execve */
> unsigned in_iowait:1;
>
> + unsigned in_userfault:1;
> +
[This is more a nit pick but before I forget while I am reading through
the rest of the patch.]
OK there is a lot of room around those bit fields but as this is only
for memcg and you are enlarging the structure by the pointer then you
can reuse bottom bit of memcg pointer.
> /* task may not gain privileges */
> unsigned no_new_privs:1;
>
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index cc3026a..6e13ebe 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> }
>
> /*
> - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> + * try to call OOM killer
> */
> -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> - int order)
> +static void mem_cgroup_oom(struct mem_cgroup *memcg,
> + gfp_t mask, int order,
> + bool in_userfault)
> {
> - struct oom_wait_info owait;
> - bool locked, need_to_kill;
> -
> - owait.memcg = memcg;
> - owait.wait.flags = 0;
> - owait.wait.func = memcg_oom_wake_function;
> - owait.wait.private = current;
> - INIT_LIST_HEAD(&owait.wait.task_list);
> - need_to_kill = true;
> - mem_cgroup_mark_under_oom(memcg);
> + bool locked, need_to_kill = true;
>
> /* At first, try to OOM lock hierarchy under memcg.*/
> spin_lock(&memcg_oom_lock);
> locked = mem_cgroup_oom_lock(memcg);
> - /*
> - * Even if signal_pending(), we can't quit charge() loop without
> - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> - * under OOM is always welcomed, use TASK_KILLABLE here.
> - */
> - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> - if (!locked || memcg->oom_kill_disable)
> + if (!locked || memcg->oom_kill_disable) {
> need_to_kill = false;
> + if (in_userfault) {
> + /*
> + * We start sleeping on the OOM waitqueue only
> + * after unwinding the page fault stack, so
> + * make sure we detect wakeups that happen
> + * between now and then.
> + */
> + mem_cgroup_mark_under_oom(memcg);
> + current->memcg_oom.wakeups =
> + atomic_read(&memcg->oom_wakeups);
> + css_get(&memcg->css);
> + current->memcg_oom.memcg = memcg;
> + }
> + }
> if (locked)
> mem_cgroup_oom_notify(memcg);
> spin_unlock(&memcg_oom_lock);
>
> if (need_to_kill) {
> - finish_wait(&memcg_oom_waitq, &owait.wait);
> mem_cgroup_out_of_memory(memcg, mask, order);
> - } else {
> - schedule();
> - finish_wait(&memcg_oom_waitq, &owait.wait);
> + memcg_oom_recover(memcg);
Why do we need to call memcg_oom_recover here? We do not know that any
charges have been released. Say mem_cgroup_out_of_memory selected a task
which migrated to our group (without its charges) so we would kill the
poor guy and free no memory from this group.
Now you wake up oom waiters to refault but they will end up in the same
situation. I think it should be sufficient to wait for memcg_oom_recover
until the memory is uncharged (which we do already).
> }
> - spin_lock(&memcg_oom_lock);
> - if (locked)
> +
> + if (locked) {
> + spin_lock(&memcg_oom_lock);
> mem_cgroup_oom_unlock(memcg);
> - memcg_wakeup_oom(memcg);
> - spin_unlock(&memcg_oom_lock);
> + spin_unlock(&memcg_oom_lock);
> + }
> +}
>
> - mem_cgroup_unmark_under_oom(memcg);
[...]
> @@ -2647,16 +2665,12 @@ again:
> css_put(&memcg->css);
> goto nomem;
> case CHARGE_NOMEM: /* OOM routine works */
> - if (!oom) {
> + if (!oom || oom_check) {
OK, this allows us to remove the confusing nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES
from the branch where oom_check is set to true
> css_put(&memcg->css);
> goto nomem;
> }
> - /* If oom, we never return -ENOMEM */
> nr_oom_retries--;
> break;
> - case CHARGE_OOM_DIE: /* Killed by OOM Killer */
> - css_put(&memcg->css);
> - goto bypass;
> }
> } while (ret != CHARGE_OK);
>
[...]
--
Michal Hocko
SUSE Labs
On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> @@ -2076,6 +2077,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg)
> {
> /* for filtering, pass "memcg" as argument. */
> __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
> + atomic_inc(&memcg->oom_wakeups);
> }
>
> static void memcg_oom_recover(struct mem_cgroup *memcg)
[...]
> + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> + /* Only sleep if we didn't miss any wakeups since OOM */
> + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
> + schedule();
On the way home it occured to me that the ordering might be wrong here.
The wake up can be lost here.
__wake_up(memcg_oom_waitq)
<preempted>
prepare_to_wait
atomic_read(&memcg->oom_wakeups)
atomic_inc(oom_wakeups)
I guess we want atomic_inc before __wake_up, right?
--
Michal Hocko
SUSE Labs
On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
> On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> [...]
> > From: Johannes Weiner <[email protected]>
> > Subject: [PATCH] memcg: more robust oom handling
> >
> > The memcg OOM handling is incredibly fragile because once a memcg goes
> > OOM, one task (kernel or userspace) is responsible for resolving the
> > situation. Every other task that gets caught trying to charge memory
> > gets stuck in a waitqueue while potentially holding various filesystem
> > and mm locks on which the OOM handling task may now deadlock.
> >
> > Do two things to charge attempts under OOM:
> >
> > 1. Do not trap system calls (buffered IO and friends), just return
> > -ENOMEM. Userspace should be able to handle this... right?
> >
> > 2. Do not trap page faults directly in the charging context. Instead,
> > remember the OOMing memcg in the task struct and fully unwind the
> > page fault stack first. Then synchronize the memcg OOM from
> > pagefault_out_of_memory()
>
> I think this should work and I really like it! Nice work Johannes, I
> never dared to go that deep and my opposite approach was also much more
> fragile.
>
> I am just afraid about all the other archs that do not support (from
> quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
> openrisc, score and tile). What would be an alternative for them?
> #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
> would be acceptable for me.
blackfin is NOMMU but I guess the others should be converted to the
proper OOM protocol anyway and not just kill the faulting task. I can
update them in the next version of the patch (series).
> > Not-quite-signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > arch/x86/mm/fault.c | 2 +
> > include/linux/memcontrol.h | 6 +++
> > include/linux/sched.h | 6 +++
> > mm/memcontrol.c | 104 +++++++++++++++++++++++++--------------------
> > mm/oom_kill.c | 7 ++-
> > 5 files changed, 78 insertions(+), 47 deletions(-)
> >
> [...]
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index e692a02..cf60aef 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1282,6 +1282,8 @@ struct task_struct {
> > * execve */
> > unsigned in_iowait:1;
> >
> > + unsigned in_userfault:1;
> > +
>
> [This is more a nit pick but before I forget while I am reading through
> the rest of the patch.]
>
> OK there is a lot of room around those bit fields but as this is only
> for memcg and you are enlarging the structure by the pointer then you
> can reuse bottom bit of memcg pointer.
I just didn't want to put anything in the arch code that looks too
memcgish, even though it's the only user right now. But granted, it
will also probably remain the only user for a while.
> > @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> > }
> >
> > /*
> > - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> > + * try to call OOM killer
> > */
> > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > - int order)
> > +static void mem_cgroup_oom(struct mem_cgroup *memcg,
> > + gfp_t mask, int order,
> > + bool in_userfault)
> > {
> > - struct oom_wait_info owait;
> > - bool locked, need_to_kill;
> > -
> > - owait.memcg = memcg;
> > - owait.wait.flags = 0;
> > - owait.wait.func = memcg_oom_wake_function;
> > - owait.wait.private = current;
> > - INIT_LIST_HEAD(&owait.wait.task_list);
> > - need_to_kill = true;
> > - mem_cgroup_mark_under_oom(memcg);
> > + bool locked, need_to_kill = true;
> >
> > /* At first, try to OOM lock hierarchy under memcg.*/
> > spin_lock(&memcg_oom_lock);
> > locked = mem_cgroup_oom_lock(memcg);
> > - /*
> > - * Even if signal_pending(), we can't quit charge() loop without
> > - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> > - * under OOM is always welcomed, use TASK_KILLABLE here.
> > - */
> > - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> > - if (!locked || memcg->oom_kill_disable)
> > + if (!locked || memcg->oom_kill_disable) {
> > need_to_kill = false;
> > + if (in_userfault) {
> > + /*
> > + * We start sleeping on the OOM waitqueue only
> > + * after unwinding the page fault stack, so
> > + * make sure we detect wakeups that happen
> > + * between now and then.
> > + */
> > + mem_cgroup_mark_under_oom(memcg);
> > + current->memcg_oom.wakeups =
> > + atomic_read(&memcg->oom_wakeups);
> > + css_get(&memcg->css);
> > + current->memcg_oom.memcg = memcg;
> > + }
> > + }
> > if (locked)
> > mem_cgroup_oom_notify(memcg);
> > spin_unlock(&memcg_oom_lock);
> >
> > if (need_to_kill) {
> > - finish_wait(&memcg_oom_waitq, &owait.wait);
> > mem_cgroup_out_of_memory(memcg, mask, order);
> > - } else {
> > - schedule();
> > - finish_wait(&memcg_oom_waitq, &owait.wait);
> > + memcg_oom_recover(memcg);
>
> Why do we need to call memcg_oom_recover here? We do not know that any
> charges have been released. Say mem_cgroup_out_of_memory selected a task
> which migrated to our group (without its charges) so we would kill the
> poor guy and free no memory from this group.
> Now you wake up oom waiters to refault but they will end up in the same
> situation. I think it should be sufficient to wait for memcg_oom_recover
> until the memory is uncharged (which we do already).
It's a leftover from how it was before (see the memcg_wakeup_oom
below), but you are right, we can get rid of it.
> > }
> > - spin_lock(&memcg_oom_lock);
> > - if (locked)
> > +
> > + if (locked) {
> > + spin_lock(&memcg_oom_lock);
> > mem_cgroup_oom_unlock(memcg);
> > - memcg_wakeup_oom(memcg);
> > - spin_unlock(&memcg_oom_lock);
> > + spin_unlock(&memcg_oom_lock);
> > + }
> > +}
> >
> > - mem_cgroup_unmark_under_oom(memcg);
> [...]
> > @@ -2647,16 +2665,12 @@ again:
> > css_put(&memcg->css);
> > goto nomem;
> > case CHARGE_NOMEM: /* OOM routine works */
> > - if (!oom) {
> > + if (!oom || oom_check) {
>
> OK, this allows us to remove the confusing nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES
> from the branch where oom_check is set to true
Oops, will update.
Thanks for the review, Michal!
On Mon, Jun 03, 2013 at 06:31:34PM +0200, Michal Hocko wrote:
> On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> > @@ -2076,6 +2077,7 @@ static void memcg_wakeup_oom(struct mem_cgroup *memcg)
> > {
> > /* for filtering, pass "memcg" as argument. */
> > __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
> > + atomic_inc(&memcg->oom_wakeups);
> > }
> >
> > static void memcg_oom_recover(struct mem_cgroup *memcg)
> [...]
> > + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> > + /* Only sleep if we didn't miss any wakeups since OOM */
> > + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
> > + schedule();
>
> On the way home it occured to me that the ordering might be wrong here.
> The wake up can be lost here.
> __wake_up(memcg_oom_waitq)
> <preempted>
> prepare_to_wait
> atomic_read(&memcg->oom_wakeups)
> atomic_inc(oom_wakeups)
>
> I guess we want atomic_inc before __wake_up, right?
I think you are right, thanks for spotting this. Will be fixed in
version 2.
On Fri, 31 May 2013, Andrew Morton wrote:
> > Admins may set the oom killer delay using the new interface:
> >
> > # echo 60000 > memory.oom_delay_millisecs
> >
> > This will defer oom killing to the kernel only after 60 seconds has
> > elapsed by putting the task to sleep for 60 seconds.
>
> How often is that delay actually useful, in the real world?
>
> IOW, in what proportion of cases does the system just remain stuck for
> 60 seconds and then get an oom-killing?
>
It wouldn't be the system, it would just be the oom memcg that would be
stuck. We actually use 10s by default, but it's adjustable for users in
their own memcg hierarchies. It gives just enough time for userspace to
deal with the situation and then defer to the kernel if it's unresponsive,
this tends to happen quite regularly when you have many, many servers.
Same situation if the userspace oom handler has died and isn't running,
perhaps because of its own memory constraints (everything on our systems
is memory constrained). Obviously it isn't going to reenable the oom
killer before it dies from SIGSEGV.
I'd argue that the current functionality that allows users to disable the
oom killer for a memcg indefinitely is a bit ridiculous. It requires
admin intervention to fix such a state and it would be pointless to have
an oom memcg for a week, a month, a year, just completely deadlocked on
making forward progress and consuming resources.
memory.oom_delay_millisecs in my patch is limited to MAX_SCHEDULE_TIMEOUT
just as a sanity check since we currently allow indefinite oom killer
disabling. I think if we were to rethink disabling the oom killer
entirely via memory.oom_control and realize such a condition over a
prolonged period is insane then this memory.oom_delay_millisecs ceiling
would be better defined as something in minutes.
At the same time, we really like userspace oom notifications so users can
implement their own handlers. So where's the compromise between instantly
oom killing something and waiting forever for userspace to respond? My
suggestion is memory.oom_delay_millisecs.
On Mon 03-06-13 12:48:39, Johannes Weiner wrote:
> On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
> > On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> > [...]
> > > From: Johannes Weiner <[email protected]>
> > > Subject: [PATCH] memcg: more robust oom handling
> > >
> > > The memcg OOM handling is incredibly fragile because once a memcg goes
> > > OOM, one task (kernel or userspace) is responsible for resolving the
> > > situation. Every other task that gets caught trying to charge memory
> > > gets stuck in a waitqueue while potentially holding various filesystem
> > > and mm locks on which the OOM handling task may now deadlock.
> > >
> > > Do two things to charge attempts under OOM:
> > >
> > > 1. Do not trap system calls (buffered IO and friends), just return
> > > -ENOMEM. Userspace should be able to handle this... right?
> > >
> > > 2. Do not trap page faults directly in the charging context. Instead,
> > > remember the OOMing memcg in the task struct and fully unwind the
> > > page fault stack first. Then synchronize the memcg OOM from
> > > pagefault_out_of_memory()
> >
> > I think this should work and I really like it! Nice work Johannes, I
> > never dared to go that deep and my opposite approach was also much more
> > fragile.
> >
> > I am just afraid about all the other archs that do not support (from
> > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
> > openrisc, score and tile). What would be an alternative for them?
> > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
> > would be acceptable for me.
>
> blackfin is NOMMU but I guess the others should be converted to the
> proper OOM protocol anyway and not just kill the faulting task. I can
> update them in the next version of the patch (series).
OK, if you are willing to convert them all then even better.
> > > Not-quite-signed-off-by: Johannes Weiner <[email protected]>
> > > ---
> > > arch/x86/mm/fault.c | 2 +
> > > include/linux/memcontrol.h | 6 +++
> > > include/linux/sched.h | 6 +++
> > > mm/memcontrol.c | 104 +++++++++++++++++++++++++--------------------
> > > mm/oom_kill.c | 7 ++-
> > > 5 files changed, 78 insertions(+), 47 deletions(-)
> > >
> > [...]
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index e692a02..cf60aef 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1282,6 +1282,8 @@ struct task_struct {
> > > * execve */
> > > unsigned in_iowait:1;
> > >
> > > + unsigned in_userfault:1;
> > > +
> >
> > [This is more a nit pick but before I forget while I am reading through
> > the rest of the patch.]
> >
> > OK there is a lot of room around those bit fields but as this is only
> > for memcg and you are enlarging the structure by the pointer then you
> > can reuse bottom bit of memcg pointer.
>
> I just didn't want to put anything in the arch code that looks too
> memcgish, even though it's the only user right now. But granted, it
> will also probably remain the only user for a while.
OK, no objection. I would just use a more generic name. Something like
memcg_oom_can_block.
[...`]
> > > if (locked)
> > > mem_cgroup_oom_notify(memcg);
> > > spin_unlock(&memcg_oom_lock);
> > >
> > > if (need_to_kill) {
> > > - finish_wait(&memcg_oom_waitq, &owait.wait);
> > > mem_cgroup_out_of_memory(memcg, mask, order);
> > > - } else {
> > > - schedule();
> > > - finish_wait(&memcg_oom_waitq, &owait.wait);
> > > + memcg_oom_recover(memcg);
> >
> > Why do we need to call memcg_oom_recover here? We do not know that any
> > charges have been released. Say mem_cgroup_out_of_memory selected a task
> > which migrated to our group (without its charges) so we would kill the
> > poor guy and free no memory from this group.
> > Now you wake up oom waiters to refault but they will end up in the same
> > situation. I think it should be sufficient to wait for memcg_oom_recover
> > until the memory is uncharged (which we do already).
>
> It's a leftover from how it was before (see the memcg_wakeup_oom
> below), but you are right, we can get rid of it.
right, I have missed that. Then it would deserve a note in the
changelog. Something like
"
memcg_wakeup_oom should be removed because there is no guarantee that
mem_cgroup_out_of_memory was able to free any memory. It could have
killed a task which doesn't have any charges from the group. Waiters
should be woken up by memcg_oom_recover during uncharge or a limit
change.
"
[...]
--
Michal Hocko
SUSE Labs
On Sat, 1 Jun 2013, Michal Hocko wrote:
> > Users obviously don't have the ability to attach processes to the root
> > memcg. They are constrained to their own subtree of memcgs.
>
> OK, I assume those groups are generally untrusted, right? So you cannot
> let them register their oom handler even via an admin interface. This
> makes it a bit complicated because it makes much harder demands on the
> handler itself as it has to run under restricted environment.
>
That's the point of the patch. We want to allow users to register their
own oom handler in a subtree (they may attach it to their own subtree root
and wait on memory.oom_control of a child memcg with a limit less than
that root) but not insist on an absolutely perfect implementation that can
never fail when you run on many, many servers. Userspace implementations
do fail sometimes, we just accept that.
> I still do not see why you cannot simply read tasks file into a
> preallocated buffer. This would be few pages even for thousands of pids.
> You do not have to track processes as they come and go.
>
What do you suggest when you read the "tasks" file and it returns -ENOMEM
because kmalloc() fails because the userspace oom handler's memcg is also
oom? Obviously it's not a situation we want to get into, but unless you
know that handler's exact memory usage across multiple versions, nothing
else is sharing that memcg, and it's a perfect implementation, you can't
guarantee it. We need to address real world problems that occur in
practice.
> As I said before. oom_delay_millisecs is actually really easy to be done
> from userspace. If you really need a safety break then you can register
> such a handler as a fallback. I am not familiar with eventfd internals
> much but I guess that multiple handlers are possible. The fallback might
> be enforeced by the admin (when a new group is created) or by the
> container itself. Would something like this work for your use case?
>
You're suggesting another userspace process that solely waits for a set
duration and then reenables the oom killer? It faces all the same
problems as the true userspace oom handler: it's own perfect
implementation and it's own memcg constraints.
> > If that user is constrained to his or her own subtree, as previously
> > stated, there's also no way to login and rectify the situation at that
> > point and requires admin intervention or a reboot.
>
> Yes, insisting on the same subtree makes the life much harder for oom
> handlers. I totally agree with you on that. I just feel that introducing
> a new knob to workaround user "inability" to write a proper handler
> (what ever that means) is not justified.
>
It's not necessarily harder if you assign the userspace oom handlers to
the root of your subtree with access to more memory than the children.
There is no "inability" to write a proper handler, but when you have
dozens of individual users implementing their own userspace handlers with
changing memcg limits over time, then you might find it hard to have
perfection every time. If we had perfection, we wouldn't have to worry
about oom in the first place. We can't just let these gazillion memcgs
sit spinning forever because they get stuck, either. That's why we've
used this solution for years as a failsafe. Disabling the oom killer
entirely, even for a memcg, is ridiculous, and if you don't have a grace
period then oom handlers themselves just don't work.
> > Then why does "cat tasks" stall when my memcg is totally depleted of all
> > memory?
>
> if you run it like this then cat obviously needs some charged
> allocations. If you had a proper handler which mlocks its buffer for the
> read syscall then you shouldn't require any allocation at the oom time.
> This shouldn't be that hard to do without too much memory overhead. As I
> said we are talking about few (dozens) of pages per handler.
>
I'm talking about the memory the kernel allocates when reading the "tasks"
file, not userspace. This can, and will, return -ENOMEM.
On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote:
> On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
> > On Sat 01-06-13 02:11:51, Johannes Weiner wrote:
> > [...]
> > > From: Johannes Weiner <[email protected]>
> > > Subject: [PATCH] memcg: more robust oom handling
> > >
> > > The memcg OOM handling is incredibly fragile because once a memcg goes
> > > OOM, one task (kernel or userspace) is responsible for resolving the
> > > situation. Every other task that gets caught trying to charge memory
> > > gets stuck in a waitqueue while potentially holding various filesystem
> > > and mm locks on which the OOM handling task may now deadlock.
> > >
> > > Do two things to charge attempts under OOM:
> > >
> > > 1. Do not trap system calls (buffered IO and friends), just return
> > > -ENOMEM. Userspace should be able to handle this... right?
> > >
> > > 2. Do not trap page faults directly in the charging context. Instead,
> > > remember the OOMing memcg in the task struct and fully unwind the
> > > page fault stack first. Then synchronize the memcg OOM from
> > > pagefault_out_of_memory()
> >
> > I think this should work and I really like it! Nice work Johannes, I
> > never dared to go that deep and my opposite approach was also much more
> > fragile.
> >
> > I am just afraid about all the other archs that do not support (from
> > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
> > openrisc, score and tile). What would be an alternative for them?
> > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
> > would be acceptable for me.
>
> blackfin is NOMMU but I guess the others should be converted to the
> proper OOM protocol anyway and not just kill the faulting task. I can
> update them in the next version of the patch (series).
It's no longer necessary since I remove the arch-specific flag
setting, but I converted them anyway while I was at it. Will send
them as a separate patch.
> > > Not-quite-signed-off-by: Johannes Weiner <[email protected]>
> > > ---
> > > arch/x86/mm/fault.c | 2 +
> > > include/linux/memcontrol.h | 6 +++
> > > include/linux/sched.h | 6 +++
> > > mm/memcontrol.c | 104 +++++++++++++++++++++++++--------------------
> > > mm/oom_kill.c | 7 ++-
> > > 5 files changed, 78 insertions(+), 47 deletions(-)
> > >
> > [...]
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index e692a02..cf60aef 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -1282,6 +1282,8 @@ struct task_struct {
> > > * execve */
> > > unsigned in_iowait:1;
> > >
> > > + unsigned in_userfault:1;
> > > +
> >
> > [This is more a nit pick but before I forget while I am reading through
> > the rest of the patch.]
> >
> > OK there is a lot of room around those bit fields but as this is only
> > for memcg and you are enlarging the structure by the pointer then you
> > can reuse bottom bit of memcg pointer.
>
> I just didn't want to put anything in the arch code that looks too
> memcgish, even though it's the only user right now. But granted, it
> will also probably remain the only user for a while.
I tried a couple of variants, including using the lowest memcg bit,
but it all turned into more ugliness. So that .in_userfault is still
there in v2, but it's now set in handle_mm_fault() in a generic manner
depending on a fault flag, please reconsider if you can live with it.
> > > @@ -2085,56 +2087,76 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> > > }
> > >
> > > /*
> > > - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> > > + * try to call OOM killer
> > > */
> > > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > > - int order)
> > > +static void mem_cgroup_oom(struct mem_cgroup *memcg,
> > > + gfp_t mask, int order,
> > > + bool in_userfault)
> > > {
> > > - struct oom_wait_info owait;
> > > - bool locked, need_to_kill;
> > > -
> > > - owait.memcg = memcg;
> > > - owait.wait.flags = 0;
> > > - owait.wait.func = memcg_oom_wake_function;
> > > - owait.wait.private = current;
> > > - INIT_LIST_HEAD(&owait.wait.task_list);
> > > - need_to_kill = true;
> > > - mem_cgroup_mark_under_oom(memcg);
> > > + bool locked, need_to_kill = true;
> > >
> > > /* At first, try to OOM lock hierarchy under memcg.*/
> > > spin_lock(&memcg_oom_lock);
> > > locked = mem_cgroup_oom_lock(memcg);
> > > - /*
> > > - * Even if signal_pending(), we can't quit charge() loop without
> > > - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> > > - * under OOM is always welcomed, use TASK_KILLABLE here.
> > > - */
> > > - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> > > - if (!locked || memcg->oom_kill_disable)
> > > + if (!locked || memcg->oom_kill_disable) {
> > > need_to_kill = false;
> > > + if (in_userfault) {
> > > + /*
> > > + * We start sleeping on the OOM waitqueue only
> > > + * after unwinding the page fault stack, so
> > > + * make sure we detect wakeups that happen
> > > + * between now and then.
> > > + */
> > > + mem_cgroup_mark_under_oom(memcg);
> > > + current->memcg_oom.wakeups =
> > > + atomic_read(&memcg->oom_wakeups);
> > > + css_get(&memcg->css);
> > > + current->memcg_oom.memcg = memcg;
> > > + }
> > > + }
> > > if (locked)
> > > mem_cgroup_oom_notify(memcg);
> > > spin_unlock(&memcg_oom_lock);
> > >
> > > if (need_to_kill) {
> > > - finish_wait(&memcg_oom_waitq, &owait.wait);
> > > mem_cgroup_out_of_memory(memcg, mask, order);
> > > - } else {
> > > - schedule();
> > > - finish_wait(&memcg_oom_waitq, &owait.wait);
> > > + memcg_oom_recover(memcg);
> >
> > Why do we need to call memcg_oom_recover here? We do not know that any
> > charges have been released. Say mem_cgroup_out_of_memory selected a task
> > which migrated to our group (without its charges) so we would kill the
> > poor guy and free no memory from this group.
> > Now you wake up oom waiters to refault but they will end up in the same
> > situation. I think it should be sufficient to wait for memcg_oom_recover
> > until the memory is uncharged (which we do already).
>
> It's a leftover from how it was before (see the memcg_wakeup_oom
> below), but you are right, we can get rid of it.
Removed it.
> > > @@ -2647,16 +2665,12 @@ again:
> > > css_put(&memcg->css);
> > > goto nomem;
> > > case CHARGE_NOMEM: /* OOM routine works */
> > > - if (!oom) {
> > > + if (!oom || oom_check) {
> >
> > OK, this allows us to remove the confusing nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES
> > from the branch where oom_check is set to true
>
> Oops, will update.
I removed it and also renamed oom_check to invoke_oom, which makes
more sense IMO.
The wakeup race is also fixed. Here is version 2:
---
From: Johannes Weiner <[email protected]>
Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context
The memcg OOM handling is incredibly fragile because once a memcg goes
OOM, one task (kernel or userspace) is responsible for resolving the
situation. Every other task that gets caught trying to charge memory
gets stuck in a waitqueue while potentially holding various filesystem
and mm locks on which the OOM handling task may now deadlock.
Do two things:
1. When OOMing in a system call (buffered IO and friends), invoke the
OOM killer but do not trap other tasks and just return -ENOMEM for
everyone. Userspace should be able to handle this... right?
2. When OOMing in a page fault, invoke the OOM killer but do not trap
other chargers directly in the charging code. Instead, remember
the OOMing memcg in the task struct and then fully unwind the page
fault stack first. Then synchronize the memcg OOM from
pagefault_out_of_memory().
While reworking the OOM routine, also remove a needless OOM waitqueue
wakeup when invoking the killer. Only uncharges and limit increases,
things that actually change the memory situation, should do wakeups.
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 6 +++
include/linux/mm.h | 1 +
include/linux/sched.h | 6 +++
mm/ksm.c | 2 +-
mm/memcontrol.c | 117 +++++++++++++++++++++++----------------------
mm/memory.c | 40 +++++++++++-----
mm/oom_kill.c | 7 ++-
7 files changed, 108 insertions(+), 71 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..e1b84ea 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -121,6 +121,7 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+bool mem_cgroup_oom_synchronize(void);
extern void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage);
@@ -337,6 +338,11 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_begin_update_page_stat(struct page *page,
bool *locked, unsigned long *flags)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..93b29ed 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -167,6 +167,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED 0x40 /* second try */
+#define FAULT_FLAG_KERNEL 0x80 /* kernel-triggered fault (get_user_pages etc.) */
/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..639bfc9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1117,6 +1117,8 @@ struct task_struct {
* execve */
unsigned in_iowait:1;
+ unsigned in_userfault:1;
+
/* task may not gain privileges */
unsigned no_new_privs:1;
@@ -1405,6 +1407,10 @@ struct task_struct {
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
unsigned int memcg_kmem_skip_account;
+ struct memcg_oom_info {
+ struct mem_cgroup *memcg;
+ int wakeups;
+ } memcg_oom;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
diff --git a/mm/ksm.c b/mm/ksm.c
index b6afe0c..9dff93b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -372,7 +372,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
break;
if (PageKsm(page))
ret = handle_mm_fault(vma->vm_mm, vma, addr,
- FAULT_FLAG_WRITE);
+ FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE);
else
ret = VM_FAULT_WRITE;
put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index de22292..97cf32b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -298,6 +298,7 @@ struct mem_cgroup {
bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;
atomic_t refcnt;
@@ -2168,6 +2169,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg->oom_wakeups);
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}
@@ -2179,56 +2181,72 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}
/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
- int order)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
- struct oom_wait_info owait;
- bool locked, need_to_kill;
-
- owait.memcg = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);
+ bool locked, need_to_kill = true;
/* At first, try to OOM lock hierarchy under memcg.*/
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
- /*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
- */
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
- if (!locked || memcg->oom_kill_disable)
+ if (!locked || memcg->oom_kill_disable) {
need_to_kill = false;
+ if (current->in_userfault) {
+ /*
+ * We start sleeping on the OOM waitqueue only
+ * after unwinding the page fault stack, so
+ * make sure we detect wakeups that happen
+ * between now and then.
+ */
+ mem_cgroup_mark_under_oom(memcg);
+ current->memcg_oom.wakeups =
+ atomic_read(&memcg->oom_wakeups);
+ css_get(&memcg->css);
+ current->memcg_oom.memcg = memcg;
+ }
+ }
if (locked)
mem_cgroup_oom_notify(memcg);
spin_unlock(&memcg_oom_lock);
- if (need_to_kill) {
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ if (need_to_kill)
mem_cgroup_out_of_memory(memcg, mask, order);
- } else {
- schedule();
- finish_wait(&memcg_oom_waitq, &owait.wait);
- }
- spin_lock(&memcg_oom_lock);
- if (locked)
+
+ if (locked) {
+ spin_lock(&memcg_oom_lock);
mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
+ spin_unlock(&memcg_oom_lock);
+ }
+}
- mem_cgroup_unmark_under_oom(memcg);
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;
- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ memcg = current->memcg_oom.memcg;
+ if (!memcg)
return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out;
+
+ owait.memcg = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out:
+ mem_cgroup_unmark_under_oom(memcg);
+ css_put(&memcg->css);
+ current->memcg_oom.memcg = NULL;
return true;
}
@@ -2541,12 +2559,11 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};
static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages, unsigned int min_pages,
- bool oom_check)
+ bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2603,14 +2620,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;
- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}
/*
@@ -2713,7 +2726,7 @@ again:
}
do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;
/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2721,14 +2734,8 @@ again:
goto bypass;
}
- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
- oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
+ nr_pages, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2741,16 +2748,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);
diff --git a/mm/memory.c b/mm/memory.c
index 6dc1882..ff5e2d7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
while (!(page = follow_page_mask(vma, start,
foll_flags, &page_mask))) {
int ret;
- unsigned int fault_flags = 0;
+ unsigned int fault_flags = FAULT_FLAG_KERNEL;
/* For mlock, just skip the stack guard page. */
if (foll_flags & FOLL_MLOCK) {
@@ -1943,6 +1943,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
if (!vma || address < vma->vm_start)
return -EFAULT;
+ fault_flags |= FAULT_FLAG_KERNEL;
ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
@@ -3760,22 +3761,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);
@@ -3856,6 +3849,31 @@ retry:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}
+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int in_userfault = !(flags & FAULT_FLAG_KERNEL);
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ if (in_userfault)
+ current->in_userfault = 1;
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (in_userfault)
+ current->in_userfault = 0;
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 79e451a..0c9f836 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,9 +678,12 @@ out:
*/
void pagefault_out_of_memory(void)
{
- struct zonelist *zonelist = node_zonelist(first_online_node,
- GFP_KERNEL);
+ struct zonelist *zonelist;
+ if (mem_cgroup_oom_synchronize())
+ return;
+
+ zonelist = node_zonelist(first_online_node, GFP_KERNEL);
if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
out_of_memory(NULL, 0, 0, NULL, false);
clear_zonelist_oom(zonelist, GFP_KERNEL);
--
1.8.2.3
On Mon, Jun 03, 2013 at 11:18:09AM -0700, David Rientjes wrote:
> On Sat, 1 Jun 2013, Michal Hocko wrote:
> > OK, I assume those groups are generally untrusted, right? So you cannot
> > let them register their oom handler even via an admin interface. This
> > makes it a bit complicated because it makes much harder demands on the
> > handler itself as it has to run under restricted environment.
>
> That's the point of the patch. We want to allow users to register their
> own oom handler in a subtree (they may attach it to their own subtree root
> and wait on memory.oom_control of a child memcg with a limit less than
> that root) but not insist on an absolutely perfect implementation that can
> never fail when you run on many, many servers. Userspace implementations
> do fail sometimes, we just accept that.
>
> > I still do not see why you cannot simply read tasks file into a
> > preallocated buffer. This would be few pages even for thousands of pids.
> > You do not have to track processes as they come and go.
>
> What do you suggest when you read the "tasks" file and it returns -ENOMEM
> because kmalloc() fails because the userspace oom handler's memcg is also
> oom? Obviously it's not a situation we want to get into, but unless you
> know that handler's exact memory usage across multiple versions, nothing
> else is sharing that memcg, and it's a perfect implementation, you can't
> guarantee it. We need to address real world problems that occur in
> practice.
>
> > As I said before. oom_delay_millisecs is actually really easy to be done
> > from userspace. If you really need a safety break then you can register
> > such a handler as a fallback. I am not familiar with eventfd internals
> > much but I guess that multiple handlers are possible. The fallback might
> > be enforeced by the admin (when a new group is created) or by the
> > container itself. Would something like this work for your use case?
>
> You're suggesting another userspace process that solely waits for a set
> duration and then reenables the oom killer? It faces all the same
> problems as the true userspace oom handler: it's own perfect
> implementation and it's own memcg constraints.
>
> > > If that user is constrained to his or her own subtree, as previously
> > > stated, there's also no way to login and rectify the situation at that
> > > point and requires admin intervention or a reboot.
> >
> > Yes, insisting on the same subtree makes the life much harder for oom
> > handlers. I totally agree with you on that. I just feel that introducing
> > a new knob to workaround user "inability" to write a proper handler
> > (what ever that means) is not justified.
>
> It's not necessarily harder if you assign the userspace oom handlers to
> the root of your subtree with access to more memory than the children.
> There is no "inability" to write a proper handler, but when you have
> dozens of individual users implementing their own userspace handlers with
> changing memcg limits over time, then you might find it hard to have
> perfection every time. If we had perfection, we wouldn't have to worry
> about oom in the first place. We can't just let these gazillion memcgs
> sit spinning forever because they get stuck, either. That's why we've
> used this solution for years as a failsafe. Disabling the oom killer
> entirely, even for a memcg, is ridiculous, and if you don't have a grace
> period then oom handlers themselves just don't work.
It's only ridiculous if your OOM handler is subject to the OOM
situation it's trying to handle.
Don't act as if the oom disabling semantics were unreasonable or
inconsistent with the rest of the system, memcgs were not really meant
to be self-policed by the tasks running in them. That's why we have
the waitqueue, so that everybody sits there and waits until an outside
force resolves the situation. There is nothing wrong with that, you
just have a new requirement.
> > > Then why does "cat tasks" stall when my memcg is totally depleted of all
> > > memory?
> >
> > if you run it like this then cat obviously needs some charged
> > allocations. If you had a proper handler which mlocks its buffer for the
> > read syscall then you shouldn't require any allocation at the oom time.
> > This shouldn't be that hard to do without too much memory overhead. As I
> > said we are talking about few (dozens) of pages per handler.
> >
>
> I'm talking about the memory the kernel allocates when reading the "tasks"
> file, not userspace. This can, and will, return -ENOMEM.
Do you mean due to kmem limitations?
What we could do is allow one task in the group to be the dedicated
OOM handler. If we catch this task in the charge path during an OOM
situation, we fall back to the kernel OOM handler.
On Mon, 3 Jun 2013, Johannes Weiner wrote:
> > It's not necessarily harder if you assign the userspace oom handlers to
> > the root of your subtree with access to more memory than the children.
> > There is no "inability" to write a proper handler, but when you have
> > dozens of individual users implementing their own userspace handlers with
> > changing memcg limits over time, then you might find it hard to have
> > perfection every time. If we had perfection, we wouldn't have to worry
> > about oom in the first place. We can't just let these gazillion memcgs
> > sit spinning forever because they get stuck, either. That's why we've
> > used this solution for years as a failsafe. Disabling the oom killer
> > entirely, even for a memcg, is ridiculous, and if you don't have a grace
> > period then oom handlers themselves just don't work.
>
> It's only ridiculous if your OOM handler is subject to the OOM
> situation it's trying to handle.
>
You're suggesting the oom handler can't be subject to its own memcg
limits, independent of the memcg it is handling? If we demand that such a
handler be attached to the root memcg, that breaks the memory isolation
that memcg provides. We constrain processes to a memcg for the purposes
of that isolation, so it cannot use more resources than allotted.
> Don't act as if the oom disabling semantics were unreasonable or
> inconsistent with the rest of the system, memcgs were not really meant
> to be self-policed by the tasks running in them. That's why we have
> the waitqueue, so that everybody sits there and waits until an outside
> force resolves the situation. There is nothing wrong with that, you
> just have a new requirement.
>
The waitqueue doesn't solve anything with regard to the memory, if the
memcg sits there and deadlocks forever then it is using resources (memory,
not cpu) that will never be freed.
> > I'm talking about the memory the kernel allocates when reading the "tasks"
> > file, not userspace. This can, and will, return -ENOMEM.
>
> Do you mean due to kmem limitations?
>
Yes.
> What we could do is allow one task in the group to be the dedicated
> OOM handler. If we catch this task in the charge path during an OOM
> situation, we fall back to the kernel OOM handler.
>
I'm not sure it even makes sense to have more than one oom handler per
memcg and the synchronization that requires in userspace to get the right
result, so I didn't consider dedicating a single oom handler. That would
be an entirely new interface, though, since we may have multiple processes
waiting on memory.oom_control that aren't necessarily handlers; they grab
a snapshot of memory, do logging, etc.
On Mon 03-06-13 11:18:09, David Rientjes wrote:
> On Sat, 1 Jun 2013, Michal Hocko wrote:
[...]
> > I still do not see why you cannot simply read tasks file into a
> > preallocated buffer. This would be few pages even for thousands of pids.
> > You do not have to track processes as they come and go.
> >
>
> What do you suggest when you read the "tasks" file and it returns -ENOMEM
> because kmalloc() fails because the userspace oom handler's memcg is also
> oom?
That would require that you track kernel allocations which is currently
done only for explicit caches.
> Obviously it's not a situation we want to get into, but unless you
> know that handler's exact memory usage across multiple versions, nothing
> else is sharing that memcg, and it's a perfect implementation, you can't
> guarantee it. We need to address real world problems that occur in
> practice.
If you really need to have such a guarantee then you can have a _global_
watchdog observing oom_control of all groups that provide such a vague
requirements for oom user handlers.
> > As I said before. oom_delay_millisecs is actually really easy to be done
> > from userspace. If you really need a safety break then you can register
> > such a handler as a fallback. I am not familiar with eventfd internals
> > much but I guess that multiple handlers are possible. The fallback might
> > be enforeced by the admin (when a new group is created) or by the
> > container itself. Would something like this work for your use case?
> >
>
> You're suggesting another userspace process that solely waits for a set
> duration and then reenables the oom killer?
Yes which kicks the oom killer.
> It faces all the same problems as the true userspace oom handler: it's
> own perfect implementation and it's own memcg constraints.
But that solution might be implemented as a global policy living in a
group with some reservations.
[...]
--
Michal Hocko
SUSE Labs
On Mon, 3 Jun 2013, Michal Hocko wrote:
> > What do you suggest when you read the "tasks" file and it returns -ENOMEM
> > because kmalloc() fails because the userspace oom handler's memcg is also
> > oom?
>
> That would require that you track kernel allocations which is currently
> done only for explicit caches.
>
That will not always be the case, and I think this could be a prerequisite
patch for such support that we have internally. I'm not sure a userspace
oom notifier would want to keep a preallocated buffer around that is
mlocked in memory for all possible lengths of this file.
> > Obviously it's not a situation we want to get into, but unless you
> > know that handler's exact memory usage across multiple versions, nothing
> > else is sharing that memcg, and it's a perfect implementation, you can't
> > guarantee it. We need to address real world problems that occur in
> > practice.
>
> If you really need to have such a guarantee then you can have a _global_
> watchdog observing oom_control of all groups that provide such a vague
> requirements for oom user handlers.
>
The whole point is to allow the user to implement their own oom policy.
If the policy was completely encapsulated in kernel code, we don't need to
ever disable the oom killer even with memory.oom_control. Users may
choose to kill the largest process, the newest process, the oldest
process, sacrifice children instead of parents, prevent forkbombs,
implement their own priority scoring (which is what we do), kill the
allocating task, etc.
To not merge this patch, I'd ask that you show an alternative that allows
users to implement their own userspace oom handlers and not require admin
intervention when things go wrong.
> From: Johannes Weiner <[email protected]>
> Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context
>
> The memcg OOM handling is incredibly fragile because once a memcg goes
> OOM, one task (kernel or userspace) is responsible for resolving the
> situation. Every other task that gets caught trying to charge memory
> gets stuck in a waitqueue while potentially holding various filesystem
> and mm locks on which the OOM handling task may now deadlock.
>
> Do two things:
>
> 1. When OOMing in a system call (buffered IO and friends), invoke the
> OOM killer but do not trap other tasks and just return -ENOMEM for
> everyone. Userspace should be able to handle this... right?
>
> 2. When OOMing in a page fault, invoke the OOM killer but do not trap
> other chargers directly in the charging code. Instead, remember
> the OOMing memcg in the task struct and then fully unwind the page
> fault stack first. Then synchronize the memcg OOM from
> pagefault_out_of_memory().
>
> While reworking the OOM routine, also remove a needless OOM waitqueue
> wakeup when invoking the killer. Only uncharges and limit increases,
> things that actually change the memory situation, should do wakeups.
>
> Signed-off-by: Johannes Weiner <[email protected]>
>From point of the memcg oom notification view, it is NOT supported on the case
that an oom handler process is subjected its own memcg limit. All
memcg developers
clearly agreed it since it began. Even though, anyway, people have a
right to shoot their own foot. :)
However, this patch fixes more than that. OK, I prefer it. Good fix!
Acked-by: KOSAKI Motohiro <[email protected]>
On Mon, Jun 03, 2013 at 12:09:22PM -0700, David Rientjes wrote:
> On Mon, 3 Jun 2013, Johannes Weiner wrote:
>
> > > It's not necessarily harder if you assign the userspace oom handlers to
> > > the root of your subtree with access to more memory than the children.
> > > There is no "inability" to write a proper handler, but when you have
> > > dozens of individual users implementing their own userspace handlers with
> > > changing memcg limits over time, then you might find it hard to have
> > > perfection every time. If we had perfection, we wouldn't have to worry
> > > about oom in the first place. We can't just let these gazillion memcgs
> > > sit spinning forever because they get stuck, either. That's why we've
> > > used this solution for years as a failsafe. Disabling the oom killer
> > > entirely, even for a memcg, is ridiculous, and if you don't have a grace
> > > period then oom handlers themselves just don't work.
> >
> > It's only ridiculous if your OOM handler is subject to the OOM
> > situation it's trying to handle.
> >
>
> You're suggesting the oom handler can't be subject to its own memcg
> limits, independent of the memcg it is handling? If we demand that such a
> handler be attached to the root memcg, that breaks the memory isolation
> that memcg provides. We constrain processes to a memcg for the purposes
> of that isolation, so it cannot use more resources than allotted.
I guess the traditional use case is that you have a job manager that
you trust, that sets up the groups, configures their limits etc. and
would also know more about the jobs than the kernel to act as the OOM
killer as well. Since it's trusted and not expected to consume any
significant amounts of memory by itself, the memcg code assumes that
it does not run in a cgroup itself, it's just not thought of as being
part of the application class.
I'm not saying it should necessarily stay that way, but it's also not
a completely ridiculous model.
> > What we could do is allow one task in the group to be the dedicated
> > OOM handler. If we catch this task in the charge path during an OOM
> > situation, we fall back to the kernel OOM handler.
> >
>
> I'm not sure it even makes sense to have more than one oom handler per
> memcg and the synchronization that requires in userspace to get the right
> result, so I didn't consider dedicating a single oom handler. That would
> be an entirely new interface, though, since we may have multiple processes
> waiting on memory.oom_control that aren't necessarily handlers; they grab
> a snapshot of memory, do logging, etc.
It will probably be hard to extend the existing oom_control interface.
But we could add a separate one where you put in a pid or 0 (kernel)
instead of a boolean value, which then enables or disables the
userspace OOM handling task for the whole subtree. If that task
enters the OOM path, the kernel handler is invoked. If the task dies,
the kernel handler is permanently re-enabled. Everybody is free to
poll that interface for OOM notifications, not only that one task.
Combined with the "enter waitqueue after unwinding page fault stack"
patch, would this fully cover your usecase?
On Mon 03-06-13 14:30:18, Johannes Weiner wrote:
> On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote:
> > On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
[...]
> > > I am just afraid about all the other archs that do not support (from
> > > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
> > > openrisc, score and tile). What would be an alternative for them?
> > > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
> > > would be acceptable for me.
> >
> > blackfin is NOMMU but I guess the others should be converted to the
> > proper OOM protocol anyway and not just kill the faulting task. I can
> > update them in the next version of the patch (series).
>
> It's no longer necessary since I remove the arch-specific flag
> setting, but I converted them anyway while I was at it. Will send
> them as a separate patch.
I am still not sure doing this unconditionally is the right way. Say
that a new arch will be added. How the poor implementer knows that memcg
oom handling requires an arch specific code to work properly?
So while I obviously do not have anything agains your conversion of
other archs that are in the tree currently I think we need something
like CONFIG_OLD_VERSION_MEMCG_OOM which depends on ARCH_HAS_FAULT_OOM_RETRY.
[...]
> > > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > > index e692a02..cf60aef 100644
> > > > --- a/include/linux/sched.h
> > > > +++ b/include/linux/sched.h
> > > > @@ -1282,6 +1282,8 @@ struct task_struct {
> > > > * execve */
> > > > unsigned in_iowait:1;
> > > >
> > > > + unsigned in_userfault:1;
> > > > +
> > >
> > > [This is more a nit pick but before I forget while I am reading through
> > > the rest of the patch.]
> > >
> > > OK there is a lot of room around those bit fields but as this is only
> > > for memcg and you are enlarging the structure by the pointer then you
> > > can reuse bottom bit of memcg pointer.
> >
> > I just didn't want to put anything in the arch code that looks too
> > memcgish, even though it's the only user right now. But granted, it
> > will also probably remain the only user for a while.
>
> I tried a couple of variants, including using the lowest memcg bit,
> but it all turned into more ugliness. So that .in_userfault is still
> there in v2, but it's now set in handle_mm_fault() in a generic manner
> depending on a fault flag, please reconsider if you can live with it.
Sure thing.
[...]
> From: Johannes Weiner <[email protected]>
> Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context
>
> The memcg OOM handling is incredibly fragile because once a memcg goes
> OOM, one task (kernel or userspace) is responsible for resolving the
> situation. Every other task that gets caught trying to charge memory
> gets stuck in a waitqueue while potentially holding various filesystem
> and mm locks on which the OOM handling task may now deadlock.
>
> Do two things:
>
> 1. When OOMing in a system call (buffered IO and friends), invoke the
> OOM killer but do not trap other tasks and just return -ENOMEM for
> everyone. Userspace should be able to handle this... right?
>
> 2. When OOMing in a page fault, invoke the OOM killer but do not trap
> other chargers directly in the charging code. Instead, remember
> the OOMing memcg in the task struct and then fully unwind the page
> fault stack first. Then synchronize the memcg OOM from
> pagefault_out_of_memory().
>
> While reworking the OOM routine, also remove a needless OOM waitqueue
> wakeup when invoking the killer. Only uncharges and limit increases,
> things that actually change the memory situation, should do wakeups.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/memcontrol.h | 6 +++
> include/linux/mm.h | 1 +
> include/linux/sched.h | 6 +++
> mm/ksm.c | 2 +-
> mm/memcontrol.c | 117 +++++++++++++++++++++++----------------------
> mm/memory.c | 40 +++++++++++-----
> mm/oom_kill.c | 7 ++-
> 7 files changed, 108 insertions(+), 71 deletions(-)
>
[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index de22292..97cf32b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -2179,56 +2181,72 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> }
>
> /*
> - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> + * try to call OOM killer
> */
> -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> - int order)
> +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> {
> - struct oom_wait_info owait;
> - bool locked, need_to_kill;
> -
> - owait.memcg = memcg;
> - owait.wait.flags = 0;
> - owait.wait.func = memcg_oom_wake_function;
> - owait.wait.private = current;
> - INIT_LIST_HEAD(&owait.wait.task_list);
> - need_to_kill = true;
> - mem_cgroup_mark_under_oom(memcg);
> + bool locked, need_to_kill = true;
>
> /* At first, try to OOM lock hierarchy under memcg.*/
> spin_lock(&memcg_oom_lock);
> locked = mem_cgroup_oom_lock(memcg);
> - /*
> - * Even if signal_pending(), we can't quit charge() loop without
> - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> - * under OOM is always welcomed, use TASK_KILLABLE here.
> - */
> - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> - if (!locked || memcg->oom_kill_disable)
> + if (!locked || memcg->oom_kill_disable) {
> need_to_kill = false;
> + if (current->in_userfault) {
> + /*
> + * We start sleeping on the OOM waitqueue only
> + * after unwinding the page fault stack, so
> + * make sure we detect wakeups that happen
> + * between now and then.
> + */
> + mem_cgroup_mark_under_oom(memcg);
> + current->memcg_oom.wakeups =
> + atomic_read(&memcg->oom_wakeups);
> + css_get(&memcg->css);
> + current->memcg_oom.memcg = memcg;
> + }
> + }
> if (locked)
> mem_cgroup_oom_notify(memcg);
> spin_unlock(&memcg_oom_lock);
>
> - if (need_to_kill) {
> - finish_wait(&memcg_oom_waitq, &owait.wait);
> + if (need_to_kill)
> mem_cgroup_out_of_memory(memcg, mask, order);
Now that I am looking at this again I've realized that this
is not correct. The task which triggers memcg OOM will not
have memcg_oom.memcg set so it would trigger a global OOM in
pagefault_out_of_memory. Either we should return CHARGE_RETRY (and
propagate it via mem_cgroup_do_charge) for need_to_kill or set up
current->memcg_oom also for need_to_kill.
Or am I missing something?
> - } else {
> - schedule();
> - finish_wait(&memcg_oom_waitq, &owait.wait);
> - }
> - spin_lock(&memcg_oom_lock);
> - if (locked)
> +
> + if (locked) {
> + spin_lock(&memcg_oom_lock);
> mem_cgroup_oom_unlock(memcg);
> - memcg_wakeup_oom(memcg);
> - spin_unlock(&memcg_oom_lock);
> + spin_unlock(&memcg_oom_lock);
> + }
> +}
[...]
> diff --git a/mm/memory.c b/mm/memory.c
> index 6dc1882..ff5e2d7 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> while (!(page = follow_page_mask(vma, start,
> foll_flags, &page_mask))) {
> int ret;
> - unsigned int fault_flags = 0;
> + unsigned int fault_flags = FAULT_FLAG_KERNEL;
>
> /* For mlock, just skip the stack guard page. */
> if (foll_flags & FOLL_MLOCK) {
This is also a bit tricky. Say there is an unlikely situation when a
task fails to charge because of memcg OOM, it couldn't lock the oom
so it ended up with current->memcg_oom set and __get_user_pages will
turn VM_FAULT_OOM into ENOMEM but memcg_oom is still there. Then the
following global OOM condition gets confused (well the oom will be
triggered by somebody else so it shouldn't end up in the endless loop
but still...), doesn't it?
So maybe we need a handle_mm_fault variant called outside of the page
fault path which clears the things up.
> @@ -3760,22 +3761,14 @@ unlock:
> /*
> * By the time we get here, we already hold the mm semaphore
> */
> -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long address, unsigned int flags)
> +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, unsigned int flags)
Is this reusable? Who would call this helper or is it just for the code
readability? I would probably prefer a smaller patch but I do not have a
strong opinion on this.
> {
> pgd_t *pgd;
> pud_t *pud;
> pmd_t *pmd;
> pte_t *pte;
>
> - __set_current_state(TASK_RUNNING);
> -
> - count_vm_event(PGFAULT);
> - mem_cgroup_count_vm_event(mm, PGFAULT);
> -
> - /* do counter updates before entering really critical section. */
> - check_sync_rss_stat(current);
> -
> if (unlikely(is_vm_hugetlb_page(vma)))
> return hugetlb_fault(mm, vma, address, flags);
>
> @@ -3856,6 +3849,31 @@ retry:
> return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> }
>
> +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, unsigned int flags)
> +{
> + int in_userfault = !(flags & FAULT_FLAG_KERNEL);
> + int ret;
> +
> + __set_current_state(TASK_RUNNING);
> +
> + count_vm_event(PGFAULT);
> + mem_cgroup_count_vm_event(mm, PGFAULT);
> +
> + /* do counter updates before entering really critical section. */
> + check_sync_rss_stat(current);
> +
> + if (in_userfault)
> + current->in_userfault = 1;
If this is just memcg thing (although you envision future usage outside
of memcg) then would it make more sense to use a memcg helper here which
would be noop for !CONFIG_MEMCG and disabled for mem_cgroup_disabled.
> +
> + ret = __handle_mm_fault(mm, vma, address, flags);
> +
> + if (in_userfault)
> + current->in_userfault = 0;
> +
> + return ret;
> +}
> +
> #ifndef __PAGETABLE_PUD_FOLDED
> /*
> * Allocate page upper directory.
[...]
--
Michal Hocko
SUSE Labs
On Mon 03-06-13 14:17:54, David Rientjes wrote:
> On Mon, 3 Jun 2013, Michal Hocko wrote:
>
> > > What do you suggest when you read the "tasks" file and it returns -ENOMEM
> > > because kmalloc() fails because the userspace oom handler's memcg is also
> > > oom?
> >
> > That would require that you track kernel allocations which is currently
> > done only for explicit caches.
> >
>
> That will not always be the case, and I think this could be a prerequisite
> patch for such support that we have internally.
> I'm not sure a userspace oom notifier would want to keep a
> preallocated buffer around that is mlocked in memory for all possible
> lengths of this file.
Well, an oom handler which allocates memory under the same restricted
memory doesn't make much sense to me. Tracking all kmem allocations
makes it almost impossible to implement a non-trivial handler.
> > > Obviously it's not a situation we want to get into, but unless you
> > > know that handler's exact memory usage across multiple versions, nothing
> > > else is sharing that memcg, and it's a perfect implementation, you can't
> > > guarantee it. We need to address real world problems that occur in
> > > practice.
> >
> > If you really need to have such a guarantee then you can have a _global_
> > watchdog observing oom_control of all groups that provide such a vague
> > requirements for oom user handlers.
> >
>
> The whole point is to allow the user to implement their own oom policy.
OK, maybe I just wasn't clear enough or I am missing your point. Your
users _can_ implement and register their oom handlers. But as your
requirements are rather benevolent for handlers implementation you would
have a global watchdog which would sit on the oom_control of those
groups (which are allowed to have own handlers - all of them in your
case I guess) and trigger (user defined/global) timeout when it gets a
notification. If the group was under oom always during the timeout then
just disable oom_control until oom is settled (under_oom is 0).
Why wouldn't something like this work for your use case?
> If the policy was completely encapsulated in kernel code, we don't need to
> ever disable the oom killer even with memory.oom_control. Users may
> choose to kill the largest process, the newest process, the oldest
> process, sacrifice children instead of parents, prevent forkbombs,
> implement their own priority scoring (which is what we do), kill the
> allocating task, etc.
>
> To not merge this patch, I'd ask that you show an alternative that allows
> users to implement their own userspace oom handlers and not require admin
> intervention when things go wrong.
Hohmm, so you are insisting on something that can be implemented in the
userspace and put it into the kernel because it is more convenient for
you and your use case. This doesn't sound like a way for accepting a
feature.
To make this absolutely clear. I do understand your requirements but you
haven't shown any _argument_ why the timeout you are proposing cannot be
implemented in the userspace. I will not ack this without this
reasoning.
And yes we should make memcg oom handling less deadlock prone and
Johannes' work in this thread is a good step forward.
--
Michal Hocko
SUSE Labs
On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote:
> On Mon 03-06-13 14:30:18, Johannes Weiner wrote:
> > On Mon, Jun 03, 2013 at 12:48:39PM -0400, Johannes Weiner wrote:
> > > On Mon, Jun 03, 2013 at 05:34:32PM +0200, Michal Hocko wrote:
> [...]
> > > > I am just afraid about all the other archs that do not support (from
> > > > quick grep it looks like: blackfin, c6x, h8300, metag, mn10300,
> > > > openrisc, score and tile). What would be an alternative for them?
> > > > #ifdefs for the old code (something like ARCH_HAS_FAULT_OOM_RETRY)? This
> > > > would be acceptable for me.
> > >
> > > blackfin is NOMMU but I guess the others should be converted to the
> > > proper OOM protocol anyway and not just kill the faulting task. I can
> > > update them in the next version of the patch (series).
> >
> > It's no longer necessary since I remove the arch-specific flag
> > setting, but I converted them anyway while I was at it. Will send
> > them as a separate patch.
>
> I am still not sure doing this unconditionally is the right way. Say
> that a new arch will be added. How the poor implementer knows that memcg
> oom handling requires an arch specific code to work properly?
It doesn't. All that's required is that it follows the generic OOM
killer protocol and calls pagefault_out_of_memory(). Killing the
faulting task right there in the page fault handler is simply a bug
and needs to be fixed so I don't think that any extra care from our
side is necessary.
> > @@ -2179,56 +2181,72 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> > }
> >
> > /*
> > - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> > + * try to call OOM killer
> > */
> > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > - int order)
> > +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > {
> > - struct oom_wait_info owait;
> > - bool locked, need_to_kill;
> > -
> > - owait.memcg = memcg;
> > - owait.wait.flags = 0;
> > - owait.wait.func = memcg_oom_wake_function;
> > - owait.wait.private = current;
> > - INIT_LIST_HEAD(&owait.wait.task_list);
> > - need_to_kill = true;
> > - mem_cgroup_mark_under_oom(memcg);
> > + bool locked, need_to_kill = true;
> >
> > /* At first, try to OOM lock hierarchy under memcg.*/
> > spin_lock(&memcg_oom_lock);
> > locked = mem_cgroup_oom_lock(memcg);
> > - /*
> > - * Even if signal_pending(), we can't quit charge() loop without
> > - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> > - * under OOM is always welcomed, use TASK_KILLABLE here.
> > - */
> > - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> > - if (!locked || memcg->oom_kill_disable)
> > + if (!locked || memcg->oom_kill_disable) {
> > need_to_kill = false;
> > + if (current->in_userfault) {
> > + /*
> > + * We start sleeping on the OOM waitqueue only
> > + * after unwinding the page fault stack, so
> > + * make sure we detect wakeups that happen
> > + * between now and then.
> > + */
> > + mem_cgroup_mark_under_oom(memcg);
> > + current->memcg_oom.wakeups =
> > + atomic_read(&memcg->oom_wakeups);
> > + css_get(&memcg->css);
> > + current->memcg_oom.memcg = memcg;
> > + }
> > + }
> > if (locked)
> > mem_cgroup_oom_notify(memcg);
> > spin_unlock(&memcg_oom_lock);
> >
> > - if (need_to_kill) {
> > - finish_wait(&memcg_oom_waitq, &owait.wait);
> > + if (need_to_kill)
> > mem_cgroup_out_of_memory(memcg, mask, order);
>
> Now that I am looking at this again I've realized that this
> is not correct. The task which triggers memcg OOM will not
> have memcg_oom.memcg set so it would trigger a global OOM in
> pagefault_out_of_memory. Either we should return CHARGE_RETRY (and
> propagate it via mem_cgroup_do_charge) for need_to_kill or set up
> current->memcg_oom also for need_to_kill.
You are absolutely right! I fixed it with a separate bit, like so:
+ struct memcg_oom_info {
+ unsigned int in_userfault:1;
+ unsigned int in_memcg_oom:1;
+ int wakeups;
+ struct mem_cgroup *wait_on_memcg;
+ } memcg_oom;
in_memcg_oom signals mem_cgroup_oom_synchronize() that it's a memcg
OOM and that it should always return true to not invoke the global
killer. But wait_on_memcg might or might not be set, depending if the
task needs to sleep.
So the task invoking the memcg OOM killer won't sleep but retry the
fault instead (everybody else might be sleeping and the first OOM
killer invocation might not have freed anything).
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 6dc1882..ff5e2d7 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > while (!(page = follow_page_mask(vma, start,
> > foll_flags, &page_mask))) {
> > int ret;
> > - unsigned int fault_flags = 0;
> > + unsigned int fault_flags = FAULT_FLAG_KERNEL;
> >
> > /* For mlock, just skip the stack guard page. */
> > if (foll_flags & FOLL_MLOCK) {
>
> This is also a bit tricky. Say there is an unlikely situation when a
> task fails to charge because of memcg OOM, it couldn't lock the oom
> so it ended up with current->memcg_oom set and __get_user_pages will
> turn VM_FAULT_OOM into ENOMEM but memcg_oom is still there. Then the
> following global OOM condition gets confused (well the oom will be
> triggered by somebody else so it shouldn't end up in the endless loop
> but still...), doesn't it?
But current->memcg_oom is not set up unless current->in_userfault.
And get_user_pages does not set this flag.
> > @@ -3760,22 +3761,14 @@ unlock:
> > /*
> > * By the time we get here, we already hold the mm semaphore
> > */
> > -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > - unsigned long address, unsigned int flags)
> > +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > + unsigned long address, unsigned int flags)
>
> Is this reusable? Who would call this helper or is it just for the code
> readability? I would probably prefer a smaller patch but I do not have a
> strong opinion on this.
I just figured I'd move all the task-specific setup into the outer
function and leave this inner one with the actual page table stuff.
The function is not called from somewhere else, but I don't see a
problem with splitting it into logical parts and have the individual
parts more readable. The alternative would have been to make every
return in handle_mm_fault() a { ret = -EFOO; goto out }, and that
would have been harder to follow.
> > @@ -3856,6 +3849,31 @@ retry:
> > return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> > }
> >
> > +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > + unsigned long address, unsigned int flags)
> > +{
> > + int in_userfault = !(flags & FAULT_FLAG_KERNEL);
> > + int ret;
> > +
> > + __set_current_state(TASK_RUNNING);
> > +
> > + count_vm_event(PGFAULT);
> > + mem_cgroup_count_vm_event(mm, PGFAULT);
> > +
> > + /* do counter updates before entering really critical section. */
> > + check_sync_rss_stat(current);
> > +
> > + if (in_userfault)
> > + current->in_userfault = 1;
>
> If this is just memcg thing (although you envision future usage outside
> of memcg) then would it make more sense to use a memcg helper here which
> would be noop for !CONFIG_MEMCG and disabled for mem_cgroup_disabled.
Fair enough. I made it conditional on CONFIG_MEMCG, but an extra
branch to set the bit seems a little elaborate, don't you think?
---
From: Johannes Weiner <[email protected]>
Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context
The memcg OOM handling is incredibly fragile because once a memcg goes
OOM, one task (kernel or userspace) is responsible for resolving the
situation. Every other task that gets caught trying to charge memory
gets stuck in a waitqueue while potentially holding various filesystem
and mm locks on which the OOM handling task may now deadlock.
Do two things:
1. When OOMing in a system call (buffered IO and friends), invoke the
OOM killer but just return -ENOMEM, never sleep. Userspace should
be able to handle this.
2. When OOMing in a page fault and somebody else is handling the
situation, do not sleep directly in the charging code. Instead,
remember the OOMing memcg in the task struct and then fully unwind
the page fault stack first before going to sleep.
While reworking the OOM routine, also remove a needless OOM waitqueue
wakeup when invoking the killer. Only uncharges and limit increases,
things that actually change the memory situation, should do wakeups.
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 22 +++++++
include/linux/mm.h | 1 +
include/linux/sched.h | 6 ++
mm/ksm.c | 2 +-
mm/memcontrol.c | 146 ++++++++++++++++++++++++++++-----------------
mm/memory.c | 40 +++++++++----
mm/oom_kill.c | 7 ++-
7 files changed, 154 insertions(+), 70 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index d6183f0..b8ec9d1 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -121,6 +121,15 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+static inline void mem_cgroup_set_userfault(struct task_struct *p)
+{
+ p->memcg_oom.in_userfault = 1;
+}
+static inline void mem_cgroup_clear_userfault(struct task_struct *p)
+{
+ p->memcg_oom.in_userfault = 0;
+}
+bool mem_cgroup_oom_synchronize(void);
extern void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage);
@@ -337,6 +346,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}
+static inline void mem_cgroup_set_userfault(struct task_struct *p)
+{
+}
+
+static inline void mem_cgroup_clear_userfault(struct task_struct *p)
+{
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_begin_update_page_stat(struct page *page,
bool *locked, unsigned long *flags)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e0c8528..93b29ed 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -167,6 +167,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED 0x40 /* second try */
+#define FAULT_FLAG_KERNEL 0x80 /* kernel-triggered fault (get_user_pages etc.) */
/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 178a8d9..270420a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1405,6 +1405,12 @@ struct task_struct {
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
unsigned int memcg_kmem_skip_account;
+ struct memcg_oom_info {
+ unsigned int in_userfault:1;
+ unsigned int in_memcg_oom:1;
+ int wakeups;
+ struct mem_cgroup *wait_on_memcg;
+ } memcg_oom;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
diff --git a/mm/ksm.c b/mm/ksm.c
index b6afe0c..9dff93b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -372,7 +372,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
break;
if (PageKsm(page))
ret = handle_mm_fault(vma->vm_mm, vma, addr,
- FAULT_FLAG_WRITE);
+ FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE);
else
ret = VM_FAULT_WRITE;
put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index de22292..537b74b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -298,6 +298,7 @@ struct mem_cgroup {
bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;
atomic_t refcnt;
@@ -2168,6 +2169,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg->oom_wakeups);
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}
@@ -2179,56 +2181,103 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}
/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
- int order)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
- struct oom_wait_info owait;
- bool locked, need_to_kill;
-
- owait.memcg = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);
+ bool locked, need_to_kill = true;
/* At first, try to OOM lock hierarchy under memcg.*/
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
- /*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
- */
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
if (!locked || memcg->oom_kill_disable)
need_to_kill = false;
if (locked)
mem_cgroup_oom_notify(memcg);
spin_unlock(&memcg_oom_lock);
- if (need_to_kill) {
- finish_wait(&memcg_oom_waitq, &owait.wait);
- mem_cgroup_out_of_memory(memcg, mask, order);
- } else {
- schedule();
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ /*
+ * A system call can just return -ENOMEM, but if this is a
+ * page fault and somebody else is handling the OOM already,
+ * we need to sleep on the OOM waitqueue for this memcg until
+ * the situation is resolved. Which can take some time,
+ * because it might be handled by a userspace task.
+ *
+ * However, this is the charge context, which means that we
+ * may sit on a large call stack and hold various filesystem
+ * locks, the mmap_sem etc., and we don't want the OOM handler
+ * to deadlock on them while we sit here and wait. Store the
+ * current OOM context in the task_struct, then return
+ * -ENOMEM. At the end of the page fault handler, with the
+ * stack unwound, pagefault_out_of_memory() will check back
+ * with us by calling mem_cgroup_oom_synchronize(), possibly
+ * putting the task to sleep.
+ */
+ if (current->memcg_oom.in_userfault) {
+ current->memcg_oom.in_memcg_oom = 1;
+ /*
+ * Somebody else is handling the situation. Make sure
+ * no wakeups are missed between now and going to
+ * sleep at the end of the page fault.
+ */
+ if (!need_to_kill) {
+ mem_cgroup_mark_under_oom(memcg);
+ current->memcg_oom.wakeups =
+ atomic_read(&memcg->oom_wakeups);
+ css_get(&memcg->css);
+ current->memcg_oom.wait_on_memcg = memcg;
+ }
}
- spin_lock(&memcg_oom_lock);
- if (locked)
+
+ if (need_to_kill)
+ mem_cgroup_out_of_memory(memcg, mask, order);
+
+ if (locked) {
+ spin_lock(&memcg_oom_lock);
mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
+ spin_unlock(&memcg_oom_lock);
+ }
+}
- mem_cgroup_unmark_under_oom(memcg);
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;
- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ /* OOM is global, do not handle */
+ if (!current->memcg_oom.in_memcg_oom)
return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+
+ /*
+ * We invoked the OOM killer but there is a chance that a kill
+ * did not free up any charges. Everybody else might already
+ * be sleeping, so restart the fault and keep the rampage
+ * going until some charges are released.
+ */
+ memcg = current->memcg_oom.wait_on_memcg;
+ if (!memcg)
+ goto out;
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out_put;
+
+ owait.memcg = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out_put:
+ mem_cgroup_unmark_under_oom(memcg);
+ css_put(&memcg->css);
+ current->memcg_oom.wait_on_memcg = NULL;
+out:
+ current->memcg_oom.in_memcg_oom = 0;
return true;
}
@@ -2541,12 +2590,11 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};
static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages, unsigned int min_pages,
- bool oom_check)
+ bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2603,14 +2651,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;
- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}
/*
@@ -2713,7 +2757,7 @@ again:
}
do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;
/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2721,14 +2765,8 @@ again:
goto bypass;
}
- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
- oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
+ nr_pages, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2741,16 +2779,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);
diff --git a/mm/memory.c b/mm/memory.c
index 6dc1882..6d3d8c3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
while (!(page = follow_page_mask(vma, start,
foll_flags, &page_mask))) {
int ret;
- unsigned int fault_flags = 0;
+ unsigned int fault_flags = FAULT_FLAG_KERNEL;
/* For mlock, just skip the stack guard page. */
if (foll_flags & FOLL_MLOCK) {
@@ -1943,6 +1943,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
if (!vma || address < vma->vm_start)
return -EFAULT;
+ fault_flags |= FAULT_FLAG_KERNEL;
ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
@@ -3760,22 +3761,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);
@@ -3856,6 +3849,31 @@ retry:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}
+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int in_userfault = !(flags & FAULT_FLAG_KERNEL);
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ if (in_userfault)
+ mem_cgroup_set_userfault(current);
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (in_userfault)
+ mem_cgroup_clear_userfault(current);
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 79e451a..0c9f836 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,9 +678,12 @@ out:
*/
void pagefault_out_of_memory(void)
{
- struct zonelist *zonelist = node_zonelist(first_online_node,
- GFP_KERNEL);
+ struct zonelist *zonelist;
+ if (mem_cgroup_oom_synchronize())
+ return;
+
+ zonelist = node_zonelist(first_online_node, GFP_KERNEL);
if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
out_of_memory(NULL, 0, 0, NULL, false);
clear_zonelist_oom(zonelist, GFP_KERNEL);
--
1.8.2.3
On Tue 04-06-13 14:48:52, Johannes Weiner wrote:
> On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote:
[...]
> > > diff --git a/mm/memory.c b/mm/memory.c
> > > index 6dc1882..ff5e2d7 100644
> > > --- a/mm/memory.c
> > > +++ b/mm/memory.c
> > > @@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> > > while (!(page = follow_page_mask(vma, start,
> > > foll_flags, &page_mask))) {
> > > int ret;
> > > - unsigned int fault_flags = 0;
> > > + unsigned int fault_flags = FAULT_FLAG_KERNEL;
> > >
> > > /* For mlock, just skip the stack guard page. */
> > > if (foll_flags & FOLL_MLOCK) {
> >
> > This is also a bit tricky. Say there is an unlikely situation when a
> > task fails to charge because of memcg OOM, it couldn't lock the oom
> > so it ended up with current->memcg_oom set and __get_user_pages will
> > turn VM_FAULT_OOM into ENOMEM but memcg_oom is still there. Then the
> > following global OOM condition gets confused (well the oom will be
> > triggered by somebody else so it shouldn't end up in the endless loop
> > but still...), doesn't it?
>
> But current->memcg_oom is not set up unless current->in_userfault.
> And get_user_pages does not set this flag.
And my selective blindness strikes again :/ For some reason I have read
those places as they enable the fault flag. Which would make some sense
if there was a post handling...
Anyway, I will get back to the updated patch tomorrow with a clean and
fresh head.
--
Michal Hocko
SUSE Labs
On Tue, 4 Jun 2013, Michal Hocko wrote:
> > I'm not sure a userspace oom notifier would want to keep a
> > preallocated buffer around that is mlocked in memory for all possible
> > lengths of this file.
>
> Well, an oom handler which allocates memory under the same restricted
> memory doesn't make much sense to me. Tracking all kmem allocations
> makes it almost impossible to implement a non-trivial handler.
>
This isn't only about oom notifiers that reside in the same oom memcg,
they can be at a higher level or a different subtree entirely. The point
is that they can be unresponsive either because userspace is flaky, its
oom itself, or we do track all slab. The kernel is the only thing in the
position to fix the issue after a sensible user-defined amount of time has
elapsed, and that's what this patch is.
> OK, maybe I just wasn't clear enough or I am missing your point. Your
> users _can_ implement and register their oom handlers. But as your
> requirements are rather benevolent for handlers implementation you would
> have a global watchdog which would sit on the oom_control of those
> groups (which are allowed to have own handlers - all of them in your
> case I guess) and trigger (user defined/global) timeout when it gets a
> notification. If the group was under oom always during the timeout then
> just disable oom_control until oom is settled (under_oom is 0).
>
> Why wouldn't something like this work for your use case?
>
For the aforementioned reason that we give users the ability to manipulate
their own memcg trees and userspace is untrusted. Their oom notifiers
cannot be run as root, not only because of security but also because it
would not properly isolate their memory usage to their memcg tree.
> Hohmm, so you are insisting on something that can be implemented in the
> userspace and put it into the kernel because it is more convenient for
> you and your use case. This doesn't sound like a way for accepting a
> feature.
>
I don't think you yet understand the problem, which is probably my fault.
I'm not insisting this be implemented in the kernel, I'm saying it's not
possible to do it in userspace. Your idea of a timeout implemented in
userspace doesn't work in practice: userspace is both untrusted and cannot
be guaranteed to be perfect and always wakeup, get the information it does
according to its implementation, and issue a SIGKILL.
This is the result of memcg allowing users to disable the oom killer
entirely for a memcg, which is still ridiculous, because if the user
doesn't respond then you've wasted all that memory and cannot get it back
without admin intervention or a reboot. There are no other "features" in
the kernel that put such a responsibility on a userspace process such that
if it doesn't work then the entire memcg deadlocks forever without admin
intervention. We need a failsafe in the kernel.
Real users like this cannot run as root, and we cannot run in the root
memcg without charging that memory usage to the user's container for that
share of a global resource. Real users do have to tollerate buggy and
flaky userspace implementations that cannot be guaranteed to run or do
what they are supposed to do. It's too costly of a problem to not address
with a failsafe. I speak strictly from experience on this.
> And yes we should make memcg oom handling less deadlock prone and
> Johannes' work in this thread is a good step forward.
The long-term solution to that, which I already have patches for, is
something you would cringe even more at: memcg memory reserves that are
shared with per-zone memory reserves that get the global oom killer to
kill off that process without notification in the case the memcg memory
reserves cause a global oom.
On Tue 04-06-13 23:40:16, David Rientjes wrote:
> On Tue, 4 Jun 2013, Michal Hocko wrote:
>
> > > I'm not sure a userspace oom notifier would want to keep a
> > > preallocated buffer around that is mlocked in memory for all possible
> > > lengths of this file.
> >
> > Well, an oom handler which allocates memory under the same restricted
> > memory doesn't make much sense to me. Tracking all kmem allocations
> > makes it almost impossible to implement a non-trivial handler.
> >
>
> This isn't only about oom notifiers that reside in the same oom memcg,
> they can be at a higher level or a different subtree entirely. The point
> is that they can be unresponsive either because userspace is flaky, its
> oom itself, or we do track all slab. The kernel is the only thing in the
> position to fix the issue after a sensible user-defined amount of time has
> elapsed, and that's what this patch is.
>
> > OK, maybe I just wasn't clear enough or I am missing your point. Your
> > users _can_ implement and register their oom handlers. But as your
> > requirements are rather benevolent for handlers implementation you would
> > have a global watchdog which would sit on the oom_control of those
> > groups (which are allowed to have own handlers - all of them in your
> > case I guess) and trigger (user defined/global) timeout when it gets a
> > notification. If the group was under oom always during the timeout then
> > just disable oom_control until oom is settled (under_oom is 0).
> >
> > Why wouldn't something like this work for your use case?
> >
>
> For the aforementioned reason that we give users the ability to manipulate
> their own memcg trees and userspace is untrusted. Their oom notifiers
> cannot be run as root, not only because of security but also because it
> would not properly isolate their memory usage to their memcg tree.
Yes, but nothing prevents an admin - I hope you trust at least this
entity - to do the global watchdog for the fallback mode. So users can
play as they like but if they are not able to cope with the oom
situation for the defined timeout then the global (trusted and running
in the root memcg) watchdog re-enables in-kernel oom handler.
Maybe I just do not see why is this setup problem for your setup.
> > Hohmm, so you are insisting on something that can be implemented in the
> > userspace and put it into the kernel because it is more convenient for
> > you and your use case. This doesn't sound like a way for accepting a
> > feature.
> >
>
> I don't think you yet understand the problem, which is probably my fault.
> I'm not insisting this be implemented in the kernel, I'm saying it's not
> possible to do it in userspace.
Because you still insist on having this fallback mode running inside
untrusted environment AFAIU.
> Your idea of a timeout implemented in userspace doesn't work in
> practice: userspace is both untrusted and cannot be guaranteed to be
> perfect and always wakeup, get the information it does according to
> its implementation, and issue a SIGKILL.
>
> This is the result of memcg allowing users to disable the oom killer
> entirely for a memcg, which is still ridiculous, because if the user
> doesn't respond then you've wasted all that memory and cannot get it back
> without admin intervention or a reboot. There are no other "features" in
> the kernel that put such a responsibility on a userspace process such that
> if it doesn't work then the entire memcg deadlocks forever without admin
> intervention. We need a failsafe in the kernel.
But the memcg would deadlock within constrains assigned from somebody
trusted. So while the memory is basically wasted the limit assignment
already says that somebody (trusted) dedicated that much of memory. So I
think disabling oom for ever is not that ridiculous.
> Real users like this cannot run as root, and we cannot run in the root
> memcg without charging that memory usage to the user's container for that
> share of a global resource. Real users do have to tollerate buggy and
> flaky userspace implementations that cannot be guaranteed to run or do
> what they are supposed to do. It's too costly of a problem to not address
> with a failsafe. I speak strictly from experience on this.
>
> > And yes we should make memcg oom handling less deadlock prone and
> > Johannes' work in this thread is a good step forward.
>
> The long-term solution to that, which I already have patches for, is
> something you would cringe even more at: memcg memory reserves that are
> shared with per-zone memory reserves that get the global oom killer to
> kill off that process without notification in the case the memcg memory
> reserves cause a global oom.
I would have to see those patches.
--
Michal Hocko
SUSE Labs
On Tue 04-06-13 14:48:52, Johannes Weiner wrote:
> On Tue, Jun 04, 2013 at 11:17:49AM +0200, Michal Hocko wrote:
[...]
> > Now that I am looking at this again I've realized that this
> > is not correct. The task which triggers memcg OOM will not
> > have memcg_oom.memcg set so it would trigger a global OOM in
> > pagefault_out_of_memory. Either we should return CHARGE_RETRY (and
> > propagate it via mem_cgroup_do_charge) for need_to_kill or set up
> > current->memcg_oom also for need_to_kill.
>
> You are absolutely right! I fixed it with a separate bit, like so:
>
> + struct memcg_oom_info {
> + unsigned int in_userfault:1;
> + unsigned int in_memcg_oom:1;
> + int wakeups;
> + struct mem_cgroup *wait_on_memcg;
> + } memcg_oom;
>
> in_memcg_oom signals mem_cgroup_oom_synchronize() that it's a memcg
> OOM and that it should always return true to not invoke the global
> killer. But wait_on_memcg might or might not be set, depending if the
> task needs to sleep.
>
> So the task invoking the memcg OOM killer won't sleep but retry the
> fault instead (everybody else might be sleeping and the first OOM
> killer invocation might not have freed anything).
OK, I was about to say that you should just return CHARGE_RETRY but that
doesn't solve the locking issue because although the task which
called mem_cgroup_out_of_memory still holds lock which might prevent
the victim from terminating. Maybe worth a note in the changelog?
[...]
> > > @@ -3760,22 +3761,14 @@ unlock:
> > > /*
> > > * By the time we get here, we already hold the mm semaphore
> > > */
> > > -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > > - unsigned long address, unsigned int flags)
> > > +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > > + unsigned long address, unsigned int flags)
> >
> > Is this reusable? Who would call this helper or is it just for the code
> > readability? I would probably prefer a smaller patch but I do not have a
> > strong opinion on this.
>
> I just figured I'd move all the task-specific setup into the outer
> function and leave this inner one with the actual page table stuff.
>
> The function is not called from somewhere else, but I don't see a
> problem with splitting it into logical parts and have the individual
> parts more readable. The alternative would have been to make every
> return in handle_mm_fault() a { ret = -EFOO; goto out }, and that
> would have been harder to follow.
OK
[...]
> From: Johannes Weiner <[email protected]>
> Subject: [PATCH] memcg: do not sleep on OOM waitqueue with full charge context
>
> The memcg OOM handling is incredibly fragile because once a memcg goes
> OOM, one task (kernel or userspace) is responsible for resolving the
> situation. Every other task that gets caught trying to charge memory
> gets stuck in a waitqueue while potentially holding various filesystem
> and mm locks on which the OOM handling task may now deadlock.
>
> Do two things:
>
> 1. When OOMing in a system call (buffered IO and friends), invoke the
> OOM killer but just return -ENOMEM, never sleep. Userspace should
> be able to handle this.
>
> 2. When OOMing in a page fault and somebody else is handling the
> situation, do not sleep directly in the charging code. Instead,
> remember the OOMing memcg in the task struct and then fully unwind
> the page fault stack first before going to sleep.
>
> While reworking the OOM routine, also remove a needless OOM waitqueue
> wakeup when invoking the killer. Only uncharges and limit increases,
> things that actually change the memory situation, should do wakeups.
>
> Signed-off-by: Johannes Weiner <[email protected]>
OK this one looks good. Bitfields is not the nicests piece of art but
this is just a minor thing.
With the remaining archs moved to pagefault_out_of_memory you can add
Reviewed-by: Michal Hocko <[email protected]>
Thanks!
> ---
> include/linux/memcontrol.h | 22 +++++++
> include/linux/mm.h | 1 +
> include/linux/sched.h | 6 ++
> mm/ksm.c | 2 +-
> mm/memcontrol.c | 146 ++++++++++++++++++++++++++++-----------------
> mm/memory.c | 40 +++++++++----
> mm/oom_kill.c | 7 ++-
> 7 files changed, 154 insertions(+), 70 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index d6183f0..b8ec9d1 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -121,6 +121,15 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
> void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
> extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> struct task_struct *p);
> +static inline void mem_cgroup_set_userfault(struct task_struct *p)
> +{
> + p->memcg_oom.in_userfault = 1;
> +}
> +static inline void mem_cgroup_clear_userfault(struct task_struct *p)
> +{
> + p->memcg_oom.in_userfault = 0;
> +}
> +bool mem_cgroup_oom_synchronize(void);
> extern void mem_cgroup_replace_page_cache(struct page *oldpage,
> struct page *newpage);
>
> @@ -337,6 +346,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
> {
> }
>
> +static inline void mem_cgroup_set_userfault(struct task_struct *p)
> +{
> +}
> +
> +static inline void mem_cgroup_clear_userfault(struct task_struct *p)
> +{
> +}
> +
> +static inline bool mem_cgroup_oom_synchronize(void)
> +{
> + return false;
> +}
> +
> static inline void mem_cgroup_begin_update_page_stat(struct page *page,
> bool *locked, unsigned long *flags)
> {
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index e0c8528..93b29ed 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -167,6 +167,7 @@ extern pgprot_t protection_map[16];
> #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
> #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
> #define FAULT_FLAG_TRIED 0x40 /* second try */
> +#define FAULT_FLAG_KERNEL 0x80 /* kernel-triggered fault (get_user_pages etc.) */
>
> /*
> * vm_fault is filled by the the pagefault handler and passed to the vma's
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 178a8d9..270420a 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1405,6 +1405,12 @@ struct task_struct {
> unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
> } memcg_batch;
> unsigned int memcg_kmem_skip_account;
> + struct memcg_oom_info {
> + unsigned int in_userfault:1;
> + unsigned int in_memcg_oom:1;
> + int wakeups;
> + struct mem_cgroup *wait_on_memcg;
> + } memcg_oom;
> #endif
> #ifdef CONFIG_HAVE_HW_BREAKPOINT
> atomic_t ptrace_bp_refcnt;
> diff --git a/mm/ksm.c b/mm/ksm.c
> index b6afe0c..9dff93b 100644
> --- a/mm/ksm.c
> +++ b/mm/ksm.c
> @@ -372,7 +372,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
> break;
> if (PageKsm(page))
> ret = handle_mm_fault(vma->vm_mm, vma, addr,
> - FAULT_FLAG_WRITE);
> + FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE);
> else
> ret = VM_FAULT_WRITE;
> put_page(page);
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index de22292..537b74b 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -298,6 +298,7 @@ struct mem_cgroup {
>
> bool oom_lock;
> atomic_t under_oom;
> + atomic_t oom_wakeups;
>
> atomic_t refcnt;
>
> @@ -2168,6 +2169,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
>
> static void memcg_wakeup_oom(struct mem_cgroup *memcg)
> {
> + atomic_inc(&memcg->oom_wakeups);
> /* for filtering, pass "memcg" as argument. */
> __wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
> }
> @@ -2179,56 +2181,103 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> }
>
> /*
> - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> + * try to call OOM killer
> */
> -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> - int order)
> +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> {
> - struct oom_wait_info owait;
> - bool locked, need_to_kill;
> -
> - owait.memcg = memcg;
> - owait.wait.flags = 0;
> - owait.wait.func = memcg_oom_wake_function;
> - owait.wait.private = current;
> - INIT_LIST_HEAD(&owait.wait.task_list);
> - need_to_kill = true;
> - mem_cgroup_mark_under_oom(memcg);
> + bool locked, need_to_kill = true;
>
> /* At first, try to OOM lock hierarchy under memcg.*/
> spin_lock(&memcg_oom_lock);
> locked = mem_cgroup_oom_lock(memcg);
> - /*
> - * Even if signal_pending(), we can't quit charge() loop without
> - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> - * under OOM is always welcomed, use TASK_KILLABLE here.
> - */
> - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> if (!locked || memcg->oom_kill_disable)
> need_to_kill = false;
> if (locked)
> mem_cgroup_oom_notify(memcg);
> spin_unlock(&memcg_oom_lock);
>
> - if (need_to_kill) {
> - finish_wait(&memcg_oom_waitq, &owait.wait);
> - mem_cgroup_out_of_memory(memcg, mask, order);
> - } else {
> - schedule();
> - finish_wait(&memcg_oom_waitq, &owait.wait);
> + /*
> + * A system call can just return -ENOMEM, but if this is a
> + * page fault and somebody else is handling the OOM already,
> + * we need to sleep on the OOM waitqueue for this memcg until
> + * the situation is resolved. Which can take some time,
> + * because it might be handled by a userspace task.
> + *
> + * However, this is the charge context, which means that we
> + * may sit on a large call stack and hold various filesystem
> + * locks, the mmap_sem etc., and we don't want the OOM handler
> + * to deadlock on them while we sit here and wait. Store the
> + * current OOM context in the task_struct, then return
> + * -ENOMEM. At the end of the page fault handler, with the
> + * stack unwound, pagefault_out_of_memory() will check back
> + * with us by calling mem_cgroup_oom_synchronize(), possibly
> + * putting the task to sleep.
> + */
> + if (current->memcg_oom.in_userfault) {
> + current->memcg_oom.in_memcg_oom = 1;
> + /*
> + * Somebody else is handling the situation. Make sure
> + * no wakeups are missed between now and going to
> + * sleep at the end of the page fault.
> + */
> + if (!need_to_kill) {
> + mem_cgroup_mark_under_oom(memcg);
> + current->memcg_oom.wakeups =
> + atomic_read(&memcg->oom_wakeups);
> + css_get(&memcg->css);
> + current->memcg_oom.wait_on_memcg = memcg;
> + }
> }
> - spin_lock(&memcg_oom_lock);
> - if (locked)
> +
> + if (need_to_kill)
> + mem_cgroup_out_of_memory(memcg, mask, order);
> +
> + if (locked) {
> + spin_lock(&memcg_oom_lock);
> mem_cgroup_oom_unlock(memcg);
> - memcg_wakeup_oom(memcg);
> - spin_unlock(&memcg_oom_lock);
> + spin_unlock(&memcg_oom_lock);
> + }
> +}
>
> - mem_cgroup_unmark_under_oom(memcg);
> +bool mem_cgroup_oom_synchronize(void)
> +{
> + struct oom_wait_info owait;
> + struct mem_cgroup *memcg;
>
> - if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> + /* OOM is global, do not handle */
> + if (!current->memcg_oom.in_memcg_oom)
> return false;
> - /* Give chance to dying process */
> - schedule_timeout_uninterruptible(1);
> +
> + /*
> + * We invoked the OOM killer but there is a chance that a kill
> + * did not free up any charges. Everybody else might already
> + * be sleeping, so restart the fault and keep the rampage
> + * going until some charges are released.
> + */
> + memcg = current->memcg_oom.wait_on_memcg;
> + if (!memcg)
> + goto out;
> +
> + if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> + goto out_put;
> +
> + owait.memcg = memcg;
> + owait.wait.flags = 0;
> + owait.wait.func = memcg_oom_wake_function;
> + owait.wait.private = current;
> + INIT_LIST_HEAD(&owait.wait.task_list);
> +
> + prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> + /* Only sleep if we didn't miss any wakeups since OOM */
> + if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
> + schedule();
> + finish_wait(&memcg_oom_waitq, &owait.wait);
> +out_put:
> + mem_cgroup_unmark_under_oom(memcg);
> + css_put(&memcg->css);
> + current->memcg_oom.wait_on_memcg = NULL;
> +out:
> + current->memcg_oom.in_memcg_oom = 0;
> return true;
> }
>
> @@ -2541,12 +2590,11 @@ enum {
> CHARGE_RETRY, /* need to retry but retry is not bad */
> CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
> CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
> - CHARGE_OOM_DIE, /* the current is killed because of OOM */
> };
>
> static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> unsigned int nr_pages, unsigned int min_pages,
> - bool oom_check)
> + bool invoke_oom)
> {
> unsigned long csize = nr_pages * PAGE_SIZE;
> struct mem_cgroup *mem_over_limit;
> @@ -2603,14 +2651,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> if (mem_cgroup_wait_acct_move(mem_over_limit))
> return CHARGE_RETRY;
>
> - /* If we don't need to call oom-killer at el, return immediately */
> - if (!oom_check)
> - return CHARGE_NOMEM;
> - /* check OOM */
> - if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
> - return CHARGE_OOM_DIE;
> + if (invoke_oom)
> + mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
>
> - return CHARGE_RETRY;
> + return CHARGE_NOMEM;
> }
>
> /*
> @@ -2713,7 +2757,7 @@ again:
> }
>
> do {
> - bool oom_check;
> + bool invoke_oom = oom && !nr_oom_retries;
>
> /* If killed, bypass charge */
> if (fatal_signal_pending(current)) {
> @@ -2721,14 +2765,8 @@ again:
> goto bypass;
> }
>
> - oom_check = false;
> - if (oom && !nr_oom_retries) {
> - oom_check = true;
> - nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
> - }
> -
> - ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
> - oom_check);
> + ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
> + nr_pages, invoke_oom);
> switch (ret) {
> case CHARGE_OK:
> break;
> @@ -2741,16 +2779,12 @@ again:
> css_put(&memcg->css);
> goto nomem;
> case CHARGE_NOMEM: /* OOM routine works */
> - if (!oom) {
> + if (!oom || invoke_oom) {
> css_put(&memcg->css);
> goto nomem;
> }
> - /* If oom, we never return -ENOMEM */
> nr_oom_retries--;
> break;
> - case CHARGE_OOM_DIE: /* Killed by OOM Killer */
> - css_put(&memcg->css);
> - goto bypass;
> }
> } while (ret != CHARGE_OK);
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 6dc1882..6d3d8c3 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -1815,7 +1815,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> while (!(page = follow_page_mask(vma, start,
> foll_flags, &page_mask))) {
> int ret;
> - unsigned int fault_flags = 0;
> + unsigned int fault_flags = FAULT_FLAG_KERNEL;
>
> /* For mlock, just skip the stack guard page. */
> if (foll_flags & FOLL_MLOCK) {
> @@ -1943,6 +1943,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
> if (!vma || address < vma->vm_start)
> return -EFAULT;
>
> + fault_flags |= FAULT_FLAG_KERNEL;
> ret = handle_mm_fault(mm, vma, address, fault_flags);
> if (ret & VM_FAULT_ERROR) {
> if (ret & VM_FAULT_OOM)
> @@ -3760,22 +3761,14 @@ unlock:
> /*
> * By the time we get here, we already hold the mm semaphore
> */
> -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long address, unsigned int flags)
> +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, unsigned int flags)
> {
> pgd_t *pgd;
> pud_t *pud;
> pmd_t *pmd;
> pte_t *pte;
>
> - __set_current_state(TASK_RUNNING);
> -
> - count_vm_event(PGFAULT);
> - mem_cgroup_count_vm_event(mm, PGFAULT);
> -
> - /* do counter updates before entering really critical section. */
> - check_sync_rss_stat(current);
> -
> if (unlikely(is_vm_hugetlb_page(vma)))
> return hugetlb_fault(mm, vma, address, flags);
>
> @@ -3856,6 +3849,31 @@ retry:
> return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> }
>
> +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, unsigned int flags)
> +{
> + int in_userfault = !(flags & FAULT_FLAG_KERNEL);
> + int ret;
> +
> + __set_current_state(TASK_RUNNING);
> +
> + count_vm_event(PGFAULT);
> + mem_cgroup_count_vm_event(mm, PGFAULT);
> +
> + /* do counter updates before entering really critical section. */
> + check_sync_rss_stat(current);
> +
> + if (in_userfault)
> + mem_cgroup_set_userfault(current);
> +
> + ret = __handle_mm_fault(mm, vma, address, flags);
> +
> + if (in_userfault)
> + mem_cgroup_clear_userfault(current);
> +
> + return ret;
> +}
> +
> #ifndef __PAGETABLE_PUD_FOLDED
> /*
> * Allocate page upper directory.
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 79e451a..0c9f836 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -678,9 +678,12 @@ out:
> */
> void pagefault_out_of_memory(void)
> {
> - struct zonelist *zonelist = node_zonelist(first_online_node,
> - GFP_KERNEL);
> + struct zonelist *zonelist;
>
> + if (mem_cgroup_oom_synchronize())
> + return;
> +
> + zonelist = node_zonelist(first_online_node, GFP_KERNEL);
> if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
> out_of_memory(NULL, 0, 0, NULL, false);
> clear_zonelist_oom(zonelist, GFP_KERNEL);
> --
> 1.8.2.3
>
--
Michal Hocko
SUSE Labs
On Wed, 5 Jun 2013, Michal Hocko wrote:
> > For the aforementioned reason that we give users the ability to manipulate
> > their own memcg trees and userspace is untrusted. Their oom notifiers
> > cannot be run as root, not only because of security but also because it
> > would not properly isolate their memory usage to their memcg tree.
>
> Yes, but nothing prevents an admin - I hope you trust at least this
> entity - to do the global watchdog for the fallback mode. So users can
> play as they like but if they are not able to cope with the oom
> situation for the defined timeout then the global (trusted and running
> in the root memcg) watchdog re-enables in-kernel oom handler.
>
Users have the full ability to manipulate their own memcg hierarchy
under the root, the global entity that schedules these jobs is not aware
of any user subcontainers that are created beneath the user root. These
user subcontainers may be oom and our desire is for the user to be able to
have their own userspace handling implementation at a higher level (or
with memcg memory reserves). Userspace is untrusted, they can't be
expected to register an oom notifier for a child memcg with a global
resource, they may not care that they deadlock and leave behind gigabytes
of memory that can't be freed if they oom. And, if that userspace global
resource dies or becomes unresponsive itself for whatever reason, all oom
memcgs deadlock.
> > I don't think you yet understand the problem, which is probably my fault.
> > I'm not insisting this be implemented in the kernel, I'm saying it's not
> > possible to do it in userspace.
>
> Because you still insist on having this fallback mode running inside
> untrusted environment AFAIU.
>
-ENOPARSE. The failsafe is the kernel, it ensures that memcgs don't sit
completely deadlocked for days and weeks and take up resources that can
never be freed. The entire purpose of userspace oom notification is so
that users can implement their own policy, whatever is implemented in the
kernel may not apply (they may want to kill the largest process, the
newest, the youngest, one on a priority-based scale, etc).
> > This is the result of memcg allowing users to disable the oom killer
> > entirely for a memcg, which is still ridiculous, because if the user
> > doesn't respond then you've wasted all that memory and cannot get it back
> > without admin intervention or a reboot. There are no other "features" in
> > the kernel that put such a responsibility on a userspace process such that
> > if it doesn't work then the entire memcg deadlocks forever without admin
> > intervention. We need a failsafe in the kernel.
>
> But the memcg would deadlock within constrains assigned from somebody
> trusted. So while the memory is basically wasted the limit assignment
> already says that somebody (trusted) dedicated that much of memory. So I
> think disabling oom for ever is not that ridiculous.
>
I don't understand what you're trying to say. Yes, a trusted resource
sets the user root's limits and that is their allotted use. To implement
a sane userspace oom handler, we need to give it time to respond; my
solution is memory.oom_delay_millisecs, your solution is disabling the oom
killer for that memcg. Anything else results in an instant oom kill from
the kernel. If the user has their own implementation, with today's kernel
it is required to disable the oom killer entirely and nothing in that
untrusted environment is ever guaranteed to re-enable the oom killer or
even have the memory to do so if it wanted. Meanwhile, the trusted
resource has no knowledge whatsoever of these user subcontainers and it
can't infinitely scan the memcg tree to find them because that requires
memory that may not be available because of global oom or because of slab
accounting.
On Wed 05-06-13 17:09:17, David Rientjes wrote:
> On Wed, 5 Jun 2013, Michal Hocko wrote:
>
> > > For the aforementioned reason that we give users the ability to manipulate
> > > their own memcg trees and userspace is untrusted. Their oom notifiers
> > > cannot be run as root, not only because of security but also because it
> > > would not properly isolate their memory usage to their memcg tree.
> >
> > Yes, but nothing prevents an admin - I hope you trust at least this
> > entity - to do the global watchdog for the fallback mode. So users can
> > play as they like but if they are not able to cope with the oom
> > situation for the defined timeout then the global (trusted and running
> > in the root memcg) watchdog re-enables in-kernel oom handler.
> >
>
> Users have the full ability to manipulate their own memcg hierarchy
> under the root, the global entity that schedules these jobs is not aware
> of any user subcontainers that are created beneath the user root. These
> user subcontainers may be oom and our desire is for the user to be able to
> have their own userspace handling implementation at a higher level (or
> with memcg memory reserves). Userspace is untrusted, they can't be
> expected to register an oom notifier for a child memcg with a global
> resource, they may not care that they deadlock and leave behind gigabytes
> of memory that can't be freed if they oom. And, if that userspace global
> resource dies or becomes unresponsive itself for whatever reason, all oom
> memcgs deadlock.
>
> > > I don't think you yet understand the problem, which is probably my fault.
> > > I'm not insisting this be implemented in the kernel, I'm saying it's not
> > > possible to do it in userspace.
> >
> > Because you still insist on having this fallback mode running inside
> > untrusted environment AFAIU.
> >
>
> -ENOPARSE.
Either the global watchdog is running as a trusted code and you _can_
implement it without dealocks (this is what I claim and I haven't heard
strong arguments against that) or even the watchdog runs as an untrusted
code and then I would ask why.
> The failsafe is the kernel, it ensures that memcgs don't sit
> completely deadlocked for days and weeks and take up resources that
> can never be freed.
Yes, I _understand_ what you are proposing.
Your only objection against userspace handling so far was that admin
has no control over sub-hierarchies. But that is hardly a problem. A
periodic tree scan would solve this easily (just to make sure we are on
the same page - this is still the global watchdog living in a trusted
root cgroup which is marked as unkillable for the global oom).
> The entire purpose of userspace oom notification is so
> that users can implement their own policy, whatever is implemented in the
> kernel may not apply (they may want to kill the largest process, the
> newest, the youngest, one on a priority-based scale, etc).
>
> > > This is the result of memcg allowing users to disable the oom killer
> > > entirely for a memcg, which is still ridiculous, because if the user
> > > doesn't respond then you've wasted all that memory and cannot get it back
> > > without admin intervention or a reboot. There are no other "features" in
> > > the kernel that put such a responsibility on a userspace process such that
> > > if it doesn't work then the entire memcg deadlocks forever without admin
> > > intervention. We need a failsafe in the kernel.
> >
> > But the memcg would deadlock within constrains assigned from somebody
> > trusted. So while the memory is basically wasted the limit assignment
> > already says that somebody (trusted) dedicated that much of memory. So I
> > think disabling oom for ever is not that ridiculous.
> >
>
> I don't understand what you're trying to say. Yes, a trusted resource
> sets the user root's limits and that is their allotted use. To implement
> a sane userspace oom handler, we need to give it time to respond; my
> solution is memory.oom_delay_millisecs, your solution is disabling the oom
> killer for that memcg.
No! My solution is to let (trusted) userspace handle oom_delay_millisecs
as a fallback if user space is not able to handle oom in time.
> Anything else results in an instant oom kill from the kernel. If the
> user has their own implementation, with today's kernel it is required
> to disable the oom killer entirely and nothing in that untrusted
> environment is ever guaranteed to re-enable the oom killer or even
> have the memory to do so if it wanted.
> Meanwhile, the trusted resource has no knowledge whatsoever of these
> user subcontainers and it can't infinitely scan the memcg tree to find
> them because that requires memory that may not be available because of
> global oom or because of slab accounting.
Global oom can be handled by oom_adj for the global watchdog and
slab accounting should be non issue as the limit cannot be set for the
root cgroup - or the watchdog can live in an unlimited group.
Besides that. How much memory are we talking about here (per
memcg)? Have you measure that? Is it possible that your untrusted
users could cause a DoS by creating too many groups? I would be really
surprised and argument about global watchdog quality then.
David, maybe I am still missing something but it seems to me that the
in-kernel timeout is not necessary because the user-space variant is not
that hard to implement and I really do not want to add new knobs for
something that can live outside.
--
Michal Hocko
SUSE Labs
On Mon, 10 Jun 2013, Michal Hocko wrote:
> > > > I don't think you yet understand the problem, which is probably my fault.
> > > > I'm not insisting this be implemented in the kernel, I'm saying it's not
> > > > possible to do it in userspace.
> > >
> > > Because you still insist on having this fallback mode running inside
> > > untrusted environment AFAIU.
> > >
> >
> > -ENOPARSE.
>
> Either the global watchdog is running as a trusted code and you _can_
> implement it without dealocks (this is what I claim and I haven't heard
> strong arguments against that) or even the watchdog runs as an untrusted
> code and then I would ask why.
>
As I already said, no global entity could possibly know and monitor the
entire set of memcgs when users are given the power to create their own
child memcgs within their tree. Not only would it be unnecessarily
expensive to scan the entire memcg tree all the time, but it would also be
racy and miss oom events when a child memcg is created and hits its limit
before the global entity can scan for it.
> Your only objection against userspace handling so far was that admin
> has no control over sub-hierarchies. But that is hardly a problem. A
> periodic tree scan would solve this easily (just to make sure we are on
> the same page - this is still the global watchdog living in a trusted
> root cgroup which is marked as unkillable for the global oom).
>
There's no way a "periodic tree scan" would solve this easily, it's
completely racy and leaves the possibility of missing oom events when they
happen in a new memcg before the next tree scan.
Not only that, but what happens when the entire machine is oom? The
global watchdog can't allocate any memory to handle the event, yet it is
supposed to always be doing entire memcg tree scans, registering oom
handlers, and handling wakeups when notified?
How does this "global watchdog" know when to stop the timer? In the
kernel that would happen if the user expands the memcg limit or there's an
uncharge to the memcg. The global watchdog stores the limit at the time
of oom and compares it before killing something? How many memcgs do we
have to store this value for without allocating memory in a global oom
condition? You expect us to run with all this memory mlocked in memory
forever?
How does the global watchdog know that there was an uncharge and then a
charge in the memcg so it is still oom? This would reset the timer in the
kernel but not in userspace and may result in unnecessary killing. If
the timeout is 10s, the memcg is oom for 9s, uncharges, recharges, and the
global watchdog checks at 10s that it is still oom, we don't want the kill
because we do get uncharge events.
This idea for a global watchdog just doesn't work.
> > Meanwhile, the trusted resource has no knowledge whatsoever of these
> > user subcontainers and it can't infinitely scan the memcg tree to find
> > them because that requires memory that may not be available because of
> > global oom or because of slab accounting.
>
> Global oom can be handled by oom_adj for the global watchdog and
> slab accounting should be non issue as the limit cannot be set for the
> root cgroup - or the watchdog can live in an unlimited group.
>
I'm referring to a global oom condition where physical RAM is exhausted,
not anything to do with memcg.
> Besides that. How much memory are we talking about here (per
> memcg)? Have you measure that? Is it possible that your untrusted
> users could cause a DoS by creating too many groups? I would be really
> surprised and argument about global watchdog quality then.
>
We limit the number of css_id in children but not the top level of memcgs
that are created by the trusted resource.
> David, maybe I am still missing something but it seems to me that the
> in-kernel timeout is not necessary because the user-space variant is not
> that hard to implement and I really do not want to add new knobs for
> something that can live outside.
It's vitally necessary and unless you answer the questions I've asked
about your proposed "global watchdog" that exists only in your theory then
we'll just continue to have this circular argument. You cannot implement
a userspace variant that works this way and insisting that you can is
ridiculous. These are problems that real users face and we insist on the
problem being addressed.
On Tue 11-06-13 13:33:40, David Rientjes wrote:
> On Mon, 10 Jun 2013, Michal Hocko wrote:
[...]
> > Your only objection against userspace handling so far was that admin
> > has no control over sub-hierarchies. But that is hardly a problem. A
> > periodic tree scan would solve this easily (just to make sure we are on
> > the same page - this is still the global watchdog living in a trusted
> > root cgroup which is marked as unkillable for the global oom).
> >
>
> There's no way a "periodic tree scan" would solve this easily, it's
> completely racy and leaves the possibility of missing oom events when they
> happen in a new memcg before the next tree scan.
But the objective is to handle oom deadlocks gracefully and you cannot
possibly miss those as they are, well, _deadlocks_. So while the
watchdog might really miss some oom events which have been already
handled that is absolutely not important here. Please note that the
eventfd is notified also during registration if the group is under oom
already. So you cannot miss a deadlocked group.
> Not only that, but what happens when the entire machine is oom?
Then a memcg deadlock is irrelevant, isn't it? Deadlocked tasks are
sitting in the KILLABLE queue so they can be killed. And you can protect
the watchdog from being killed.
> The global watchdog can't allocate any memory to handle the event, yet
> it is supposed to always be doing entire memcg tree scans, registering
> oom handlers, and handling wakeups when notified?
So what? The global OOM will find a victim and the life goes on. I do
not think that the watchdog consumption would be negligible (simple
queue for timers, eventfd stuff for each existing group, few threads).
> How does this "global watchdog" know when to stop the timer? In the
> kernel that would happen if the user expands the memcg limit or there's an
> uncharge to the memcg.
Then the group won't be marked under_oom in memory.oom_control at the
time when timeout triggers so the watchdog knows that the oom has been
handled already and it can be discarded.
> The global watchdog stores the limit at the time of oom and compares
> it before killing something?
Why it would need to check the limit? It cares only about the oom
events.
> How many memcgs do we have to store this value for without allocating
> memory in a global oom condition? You expect us to run with all this
> memory mlocked in memory forever?
No, the only thing that the watchdog has to care about is memory.oom_control.
> How does the global watchdog know that there was an uncharge and then a
> charge in the memcg so it is still oom?
It will get a new event for every new oom.
> This would reset the timer in the kernel but not in userspace and may
> result in unnecessary killing.
User space watchdog would do the same thing. When it receives an event
it will enqueue or requeue the corresponding timer.
> If the timeout is 10s, the memcg is
> oom for 9s, uncharges, recharges, and the global watchdog checks at
> 10s that it is still oom, we don't want the kill because we do get
> uncharge events.
>
> This idea for a global watchdog just doesn't work.
I am not convinced about that. Maybe I am missing some aspect but all
problems you have mentioned are _solvable_.
[...]
> > David, maybe I am still missing something but it seems to me that the
> > in-kernel timeout is not necessary because the user-space variant is not
> > that hard to implement and I really do not want to add new knobs for
> > something that can live outside.
>
> It's vitally necessary and unless you answer the questions I've asked
> about your proposed "global watchdog" that exists only in your theory then
Which questions are still not answered?
> we'll just continue to have this circular argument. You cannot implement
> a userspace variant that works this way and insisting that you can is
> ridiculous. These are problems that real users face and we insist on the
> problem being addressed.
Your use case pushes the interface to extreme, to be honest. You are
proposing a new knob that would have to be supported for ever. I feel
that we should be careful and prefer user space solution if there is
any.
To be clear, I am not nacking this patch and I will not ack it either.
If Johannes, Kamezawa or Andrew are OK with it I will not block it.
--
Michal Hocko
SUSE Labs
On Wed, 12 Jun 2013, Michal Hocko wrote:
> But the objective is to handle oom deadlocks gracefully and you cannot
> possibly miss those as they are, well, _deadlocks_.
That's not at all the objective, the changelog quite explicitly states
this is a deadlock as the result of userspace having disabled the oom
killer so that its userspace oom handler can resolve the condition and it
being unresponsive or unable to perform its job.
When you allow users to create their own memcgs, which we do and is
possible by chowning the user's root to be owned by it, and implement
their own userspace oom notifier, you must then rely on their
implementation to work 100% of the time, otherwise all those gigabytes of
memory go unfreed forever. What you're insisting on is that this
userspace is perfect and there is never any memory allocated (otherwise it
may oom its own user root memcg where the notifier is hosted) and it is
always responsive and able to handle the situation. This is not reality.
This is why the kernel has its own oom killer and doesn't wait for a user
to go to kill something. There's no option to disable the kernel oom
killer. It's because we don't want to leave the system in a state where
no progress can be made. The same intention is for memcgs to not be left
in a state where no progress can be made even if userspace has the best
intentions.
Your solution of a global entity to prevent these situations doesn't work
for the same reason we can't implement the kernel oom killer in userspace.
It's the exact same reason. We also want to push patches that allow
global oom conditions to trigger an eventfd notification on the root memcg
with the exact same semantics of a memcg oom: allow it time to respond but
step in and kill something if it fails to respond. Memcg happens to be
the perfect place to implement such a userspace policy and we want to have
a priority-based killing mechanism that is hierarchical and different from
oom_score_adj.
For that to work properly, it cannot possibly allocate memory even on page
fault so it must be mlocked in memory and have enough buffers to store the
priorities of top-level memcgs. Asking a global watchdog to sit there
mlocked in memory to store thousands of memcgs, their priorities, their
last oom, their timeouts, etc, is a non-starter.
I don't buy your argument that we're pushing any interface to an extreme.
Users having the ability to manipulate their own memcgs and subcontainers
isn't extreme, it's explicitly allowed by cgroups! What we're asking for
is that level of control for memcg is sane and that if userspace is
unresponsive that we don't lose gigabytes of memory forever. And since
we've supported this type of functionality even before memcg was created
for cpusets and have used and supported it for six years, I have no
problem supporting such a thing upstream.
I do understand that we're the largest user of memcg and use it unlike you
or others on this thread do, but that doesn't mean our usecase is any less
important or that we should aim for the most robust behavior possible.
On Wed 12-06-13 14:27:05, David Rientjes wrote:
> On Wed, 12 Jun 2013, Michal Hocko wrote:
>
> > But the objective is to handle oom deadlocks gracefully and you cannot
> > possibly miss those as they are, well, _deadlocks_.
>
> That's not at all the objective, the changelog quite explicitly states
> this is a deadlock as the result of userspace having disabled the oom
> killer so that its userspace oom handler can resolve the condition and it
> being unresponsive or unable to perform its job.
Ohh, so another round. Sigh. You insist on having user space handlers
running in the context of the limited group. OK, I can understand your
use case, although I think it is pushing the limits of the interface and
it is dangerous.
As the problems/deadlocks are unavoidable with this approach you really
need a backup plan to reduce the damage once it happens. You insist on
having in-kernel solution while there is a user space alternative
possible IMO.
> When you allow users to create their own memcgs, which we do and is
> possible by chowning the user's root to be owned by it, and implement
> their own userspace oom notifier, you must then rely on their
> implementation to work 100% of the time, otherwise all those gigabytes of
> memory go unfreed forever. What you're insisting on is that this
> userspace is perfect
No, I am not saying that. You can let your untrusted users handle
their OOMs as you like. The watchdog (your fallback solution) _has_
to be trusted though and if you want it to do the job then you better
implement it correctly. Is this requirement a problem?
> and there is never any memory allocated (otherwise it may oom its own
> user root memcg where the notifier is hosted)
If the watchdog runs under root memcg then there is no limit so the only
OOM that might happen is the global one and you can protect the watchdog
by oom_adj as I have mentioned earlier.
> and it is always responsive and able to handle the situation.
If it is a trusted code then it can run with a real time priority.
> This is not reality.
It seems you just do not want to accept that there is other solution
because the kernel solution sounds like an easier option for you.
> This is why the kernel has its own oom killer and doesn't wait for a user
> to go to kill something. There's no option to disable the kernel oom
> killer. It's because we don't want to leave the system in a state where
> no progress can be made. The same intention is for memcgs to not be left
> in a state where no progress can be made even if userspace has the best
> intentions.
>
> Your solution of a global entity to prevent these situations doesn't work
> for the same reason we can't implement the kernel oom killer in userspace.
> It's the exact same reason.
No it is not! The core difference is that there is _always_ some memory
for the watchdog (because something else might be killed to free some
memory) while there is none for the global OOM. So the watchdog even
doesn't need to mlock everything in.
> We also want to push patches that allow global oom conditions to
> trigger an eventfd notification on the root memcg with the exact same
> semantics of a memcg oom:
As already mentioned we have discussed this at LSF. I am still not
sure it is the right thing to do though. The interface would be too
tricky. There are other options to implement user defined policy for the
global OOM and they should be considered before there is a decision to
push it into memcg.
> allow it time to respond but
> step in and kill something if it fails to respond. Memcg happens to be
> the perfect place to implement such a userspace policy and we want to have
> a priority-based killing mechanism that is hierarchical and different from
> oom_score_adj.
It might sound perfect for your use cases but we should be _really_
careful to not pull another tricky interface into memcg that would fit
a certain scenario.
Remember use_hierarchy thingy? It sounded like a good idea at the time
and it turned into a nightmare over time. It also aimed at solving a
restriction at the time. The restriction is not here anymore AFAICT but
we have a crippled hierarchy semantic.
> For that to work properly, it cannot possibly allocate memory even on page
> fault so it must be mlocked in memory and have enough buffers to store the
> priorities of top-level memcgs. Asking a global watchdog to sit there
> mlocked in memory to store thousands of memcgs, their priorities, their
> last oom, their timeouts, etc, is a non-starter.
I have asked you about an estimation already. I do not think that the
memory consumption would really matter here. We are talking about few
megs at most and even that is exaggerated.
> I don't buy your argument that we're pushing any interface to an extreme.
OK, call it a matter of taste but handling oom in the context of the
oom itself without any requirements to the handler implementation and
without access to any memory reserves because the handler is not trusted
feels weird and scary to me.
> Users having the ability to manipulate their own memcgs and subcontainers
> isn't extreme, it's explicitly allowed by cgroups!
Sure thing and no discussion about that. We are just arguing who should
be responsible here. Admin who allows such a setup or kernel.
> What we're asking for is that level of control for memcg is sane and
I can argue that it _is_ quite sane in its current form. The interface
allows you to handle the oom either by increasing the limit which
doesn't allocate any kernel memory or by killing a task which doesn't
allocate any memory either.
If the handler fails to make the oom decision without allocating a
memory and it is running under the restricted environment at the same
time then it calls for troubles. If you, as an admin, want to allow such
a setup then why not, but be aware of potential problems and handle them
when they happen. If there was no sane way to implement such a stopgap
measure then I wouldn't be objecting.
I do not think that your oom_control at root group level argument really
matters now because a) there is nothing like that in the kernel yet and
b) it is questionable it will be ever (as the discussion hasn't started
yet).
Not mentioning that even if it was I think the watchdog could be still
implemented (e.g. re-enable global oom after a timeout from a different
thread in the global watchdog - you can certainly do that without any
allocations).
> that if userspace is unresponsive that we don't lose gigabytes of
> memory forever. And since we've supported this type of functionality
> even before memcg was created for cpusets and have used and supported
> it for six years, I have no problem supporting such a thing upstream.
>
> I do understand that we're the largest user of memcg and use it unlike you
> or others on this thread do, but that doesn't mean our usecase is any less
> important or that we should aim for the most robust behavior possible.
--
Michal Hocko
SUSE Labs
On Thu, 13 Jun 2013, Michal Hocko wrote:
> > That's not at all the objective, the changelog quite explicitly states
> > this is a deadlock as the result of userspace having disabled the oom
> > killer so that its userspace oom handler can resolve the condition and it
> > being unresponsive or unable to perform its job.
>
> Ohh, so another round. Sigh. You insist on having user space handlers
> running in the context of the limited group. OK, I can understand your
> use case, although I think it is pushing the limits of the interface and
> it is dangerous.
Ok, this is where our misunderstanding is, and I can see why you have
reacted the way you have. It's my fault for not describing where we're
going with this.
Userspace oom handling right now is very inadequate, not only for the
reasons that you and Johannes have described. This requirement that
userspace oom handlers must run with more memory than the memcgs it is
monitoring is where the pain is.
We have to accept the fact that users can be given control of their own
memcg trees by chowning that "user root" directory to the user. The only
real change you must enforce is that the limit cannot be increased unless
you have a special capability. Otherwise, users can be given complete
control over their own tree and enforce their own memory limits (or just
accounting).
I haven't seen any objection to that usecase from anybody, so I'm assuming
that it's something that is understood and acknowledged. Nothing in the
cgroups code prevents it and I believe it gives a much more powerful
interface to the user to control their own destiny much like setting up
root level memcgs for the admin.
The problem is that we want to control global oom conditions as well.
oom_score_adj simply doesn't allow for user-specific implementations
either within user subcontainers or at the root. Users may very well want
to kill the allocating task in their memcg to avoid expensive scans (SGI
required oom_kill_allocating_task at a global level), kill the largest
memory consumer, do a priority-based memcg scoring, kill the newest
process, etc. That's the power of user oom notifications beyond just
statistics collection or creating a tombstone to examine later.
We want to do that at the root level as well for global oom conditions.
We have a priority-based scoring method amongst top-level memcgs. They
are jobs that are scheduled on machines and given a certain priority to
run at. We also overcommit the machine so the sum of all top-level
memory.limit_in_bytes exceeds the amount of physical RAM since jobs seldom
use their entire reservation. When RAM is exhausted, we want to kill the
lowest priority process from the lowest priority memcg.
That's not possible to do currently because we lack global oom
notification and userspace oom handling. We want to change that by
notifying on the root memcg's memory.oom_control for global oom and wakeup
a process that will handle global oom conditions. Memcg is the perfect
vehicle for this since it already does userspace oom notification and
because the oom condition in a "user root" can be handled the same as a
"global oom".
Waking up userspace and asking it to do anything useful in a global oom
condition is problematic if it is unresponsive for whatever reason.
Especially with true kmem accounting of all slab, we cannot be guaranteed
that this global oom handler can actually do things like query the memory
usage of a process or read a memcg's tasks file. For this reason, we want
a kernel failsafe to ensure that if it isn't able to respond in a
reasonable amount of time, the kernel frees memory itself.
There's simply no other alternative to doing something like
memory.oom_delay_millisecs then at the root level. For notification to
actually work, we need to preempt the kernel at least for a certain amount
of time from killing: the solution right now is to disable the oom killer
entirely with memory.oom_control == 1. If the global oom handler fails
the respond, the machine is essentially dead unless something starts
freeing memory back for some strange reason.
For our own priority-based memcg oom scoring implementation, this global
oom handler would then signal the memcg oom handler in the lowest priority
memcg, if it exists, to handle the situation or it would kill something
from that memcg itself.
The key is that without some sort of kernel-imposed delay, there's no way
to sanely implement a global oom handler that doesn't risk killing the
entire machine because the kernel oom killer is deferred. This patch
makes that delay configurable.
I could certainly push the patch that does the global oom notification on
the root level memcg's memory.oom_control first if you'd like, but I
didn't think there would be so much opposition to running a memcg oom
handler in the memcg itself, which is what this essentially is.
If that's done, then we can't simply rely on the global oom killer, which
your suggestion does when things go wrong or to act as a watchdog, to step
in in such a situation and kill something. We always want to kill from
the lowest-priority memcg in global oom conditions and we don't expect to
implement a multitude of different oom selection algorithms in the kernel.
We want to punt to userspace and give it the functionality in the kernel
that it needs when things go wrong.
(2013/06/14 7:25), David Rientjes wrote:
> On Thu, 13 Jun 2013, Michal Hocko wrote:
>
>>> That's not at all the objective, the changelog quite explicitly states
>>> this is a deadlock as the result of userspace having disabled the oom
>>> killer so that its userspace oom handler can resolve the condition and it
>>> being unresponsive or unable to perform its job.
>>
>> Ohh, so another round. Sigh. You insist on having user space handlers
>> running in the context of the limited group. OK, I can understand your
>> use case, although I think it is pushing the limits of the interface and
>> it is dangerous.
>
> Ok, this is where our misunderstanding is, and I can see why you have
> reacted the way you have. It's my fault for not describing where we're
> going with this.
>
Reading your discussion, I think I understand your requirements.
The problem is that I can't think you took into all options into
accounts and found the best way is this new oom_delay. IOW, I can't
convice oom-delay is the best way to handle your issue.
Your requeirement is
- Allowing userland oom-handler within local memcg.
Considering straightforward, the answer should be
- Allowing oom-handler daemon out of memcg's control by its limit.
(For example, a flag/capability for a task can archive this.)
Or attaching some *fixed* resource to the task rather than cgroup.
Allow to set task->secret_saving=20M.
Going back to your patch, what's confusing is your approach.
Why the problem caused by the amount of memory should be solved by
some dealy, i.e. the amount of time ?
This exchanging sounds confusing to me.
I'm not against what you finally want to do, but I don't like the fix.
Thanks,
-Kame
On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote:
> Reading your discussion, I think I understand your requirements.
> The problem is that I can't think you took into all options into
> accounts and found the best way is this new oom_delay. IOW, I can't
> convice oom-delay is the best way to handle your issue.
>
Ok, let's talk about it.
> Your requeirement is
> - Allowing userland oom-handler within local memcg.
>
Another requirement:
- Allow userland oom handler for global oom conditions.
Hopefully that's hooked into memcg because the functionality is already
there, we can simply duplicate all of the oom functionality that we'll be
adding for the root memcg.
> Considering straightforward, the answer should be
> - Allowing oom-handler daemon out of memcg's control by its limit.
> (For example, a flag/capability for a task can archive this.)
> Or attaching some *fixed* resource to the task rather than cgroup.
>
> Allow to set task->secret_saving=20M.
>
Exactly!
First of all, thanks very much for taking an interest in our usecase and
discussing it with us.
I didn't propose what I referred to earlier in the thread as "memcg
reserves" because I thought it was going to be a more difficult battle.
The fact that you brought it up first actually makes me think it's less
insane :)
We do indeed want memcg reserves and I have patches to add it if you'd
like to see that first. It ensures that this userspace oom handler can
actually do some work in determining which process to kill. The reserve
is a fraction of true memory reserves (the space below the per-zone min
watermarks) which is dependent on min_free_kbytes. This does indeed
become more difficult with true and complete kmem charging. That "work"
could be opening the tasks file (which allocates the pidlist within the
kernel), checking /proc/pid/status for rss, checking for how long a
process has been running, checking for tid, sending a signal to drop
caches, etc.
We'd also like to do this for global oom conditions, which makes it even
more interesting. I was thinking of using a fraction of memory reserves
as the oom killer currently does (that memory below the min watermark) for
these purposes.
Memory charging is simply bypassed for these oom handlers (we only grant
access to those waiting on the memory.oom_control eventfd) up to
memory.limit_in_bytes + (min_free_kbytes / 4), for example. I don't think
this is entirely insane because these oom handlers should lead to future
memory freeing, just like TIF_MEMDIE processes.
> Going back to your patch, what's confusing is your approach.
> Why the problem caused by the amount of memory should be solved by
> some dealy, i.e. the amount of time ?
>
> This exchanging sounds confusing to me.
>
Even with all of the above (which is not actually that invasive of a
patch), I still think we need memory.oom_delay_millisecs. I probably made
a mistake in describing what that is addressing if it seems like it's
trying to address any of the above.
If a userspace oom handler fails to respond even with access to those
"memcg reserves", the kernel needs to kill within that memcg. Do we do
that above a set time period (this patch) or when the reserves are
completely exhausted? That's debatable, but if we are to allow it for
global oom conditions as well then my opinion was to make it as safe as
possible; today, we can't disable the global oom killer from userspace and
I don't think we should ever allow it to be disabled. I think we should
allow userspace a reasonable amount of time to respond and then kill if it
is exceeded.
For the global oom case, we want to have a priority-based memcg selection.
Select the lowest priority top-level memcg and kill within it. If it has
an oom notifier, send it a signal to kill something. If it fails to
react, kill something after memory.oom_delay_millisecs has elapsed. If
there isn't a userspace oom notifier, kill something within that lowest
priority memcg.
The bottomline with my approach is that I don't believe there is ever a
reason for an oom memcg to remain oom indefinitely. That's why I hate
memory.oom_control == 1 and I think for the global notification it would
be deemed a nonstarter since you couldn't even login to the machine.
> I'm not against what you finally want to do, but I don't like the fix.
>
I'm thrilled to hear that, and I hope we can work to make userspace oom
handling more effective.
What do you think about that above?
On Fri, 14 Jun 2013, David Rientjes wrote:
> Even with all of the above (which is not actually that invasive of a
> patch), I still think we need memory.oom_delay_millisecs. I probably made
> a mistake in describing what that is addressing if it seems like it's
> trying to address any of the above.
>
> If a userspace oom handler fails to respond even with access to those
> "memcg reserves", the kernel needs to kill within that memcg. Do we do
> that above a set time period (this patch) or when the reserves are
> completely exhausted? That's debatable, but if we are to allow it for
> global oom conditions as well then my opinion was to make it as safe as
> possible; today, we can't disable the global oom killer from userspace and
> I don't think we should ever allow it to be disabled. I think we should
> allow userspace a reasonable amount of time to respond and then kill if it
> is exceeded.
>
> For the global oom case, we want to have a priority-based memcg selection.
> Select the lowest priority top-level memcg and kill within it. If it has
> an oom notifier, send it a signal to kill something. If it fails to
> react, kill something after memory.oom_delay_millisecs has elapsed. If
> there isn't a userspace oom notifier, kill something within that lowest
> priority memcg.
>
> The bottomline with my approach is that I don't believe there is ever a
> reason for an oom memcg to remain oom indefinitely. That's why I hate
> memory.oom_control == 1 and I think for the global notification it would
> be deemed a nonstarter since you couldn't even login to the machine.
>
What's the status of this patch? I'd love to be able to introduce memcg
reserves so that userspace oom notifications can actually work, but we
still require a failsafe in the kernel if those reserves are depleted or
userspace cannot respond.
(2013/06/14 19:12), David Rientjes wrote:
> On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote:
>
>> Reading your discussion, I think I understand your requirements.
>> The problem is that I can't think you took into all options into
>> accounts and found the best way is this new oom_delay. IOW, I can't
>> convice oom-delay is the best way to handle your issue.
>>
>
> Ok, let's talk about it.
>
I'm sorry that my RTT is long in these days.
>> Your requeirement is
>> - Allowing userland oom-handler within local memcg.
>>
>
> Another requirement:
>
> - Allow userland oom handler for global oom conditions.
>
> Hopefully that's hooked into memcg because the functionality is already
> there, we can simply duplicate all of the oom functionality that we'll be
> adding for the root memcg.
>
At mm-summit, it was discussed ant people seems to think user-land-oom-handler
is impossible. Hm, and in-kernel scripting was discussed, as far as I remember.
>> Considering straightforward, the answer should be
>> - Allowing oom-handler daemon out of memcg's control by its limit.
>> (For example, a flag/capability for a task can archive this.)
>> Or attaching some *fixed* resource to the task rather than cgroup.
>>
>> Allow to set task->secret_saving=20M.
>>
>
> Exactly!
>
> First of all, thanks very much for taking an interest in our usecase and
> discussing it with us.
>
> I didn't propose what I referred to earlier in the thread as "memcg
> reserves" because I thought it was going to be a more difficult battle.
> The fact that you brought it up first actually makes me think it's less
> insane :)
>
> We do indeed want memcg reserves and I have patches to add it if you'd
> like to see that first. It ensures that this userspace oom handler can
> actually do some work in determining which process to kill. The reserve
> is a fraction of true memory reserves (the space below the per-zone min
> watermarks) which is dependent on min_free_kbytes. This does indeed
> become more difficult with true and complete kmem charging. That "work"
> could be opening the tasks file (which allocates the pidlist within the
> kernel), checking /proc/pid/status for rss, checking for how long a
> process has been running, checking for tid, sending a signal to drop
> caches, etc.
>
Considering only memcg, bypassing all charge-limit-check will work.
But as you say, that will not work against global-oom.
Then, in-kernel scripting was discussed.
> We'd also like to do this for global oom conditions, which makes it even
> more interesting. I was thinking of using a fraction of memory reserves
> as the oom killer currently does (that memory below the min watermark) for
> these purposes.
>
> Memory charging is simply bypassed for these oom handlers (we only grant
> access to those waiting on the memory.oom_control eventfd) up to
> memory.limit_in_bytes + (min_free_kbytes / 4), for example. I don't think
> this is entirely insane because these oom handlers should lead to future
> memory freeing, just like TIF_MEMDIE processes.
>
I think that kinds of bypassing is acceptable.
>> Going back to your patch, what's confusing is your approach.
>> Why the problem caused by the amount of memory should be solved by
>> some dealy, i.e. the amount of time ?
>>
>> This exchanging sounds confusing to me.
>>
>
> Even with all of the above (which is not actually that invasive of a
> patch), I still think we need memory.oom_delay_millisecs. I probably made
> a mistake in describing what that is addressing if it seems like it's
> trying to address any of the above.
>
> If a userspace oom handler fails to respond even with access to those
> "memcg reserves",
How this happens ?
> the kernel needs to kill within that memcg. Do we do
> that above a set time period (this patch) or when the reserves are
> completely exhausted? That's debatable, but if we are to allow it for
> global oom conditions as well then my opinion was to make it as safe as
> possible; today, we can't disable the global oom killer from userspace and
> I don't think we should ever allow it to be disabled. I think we should
> allow userspace a reasonable amount of time to respond and then kill if it
> is exceeded.
>
> For the global oom case, we want to have a priority-based memcg selection.
> Select the lowest priority top-level memcg and kill within it. If it has
> an oom notifier, send it a signal to kill something. If it fails to
> react, kill something after memory.oom_delay_millisecs has elapsed. If
> there isn't a userspace oom notifier, kill something within that lowest
> priority memcg.
>
Someone may be against that kind of control and say "Hey, I have better idea".
That was another reason that oom-scirpiting was discussed. No one can implement
general-purpose-victim-selection-logic.
> The bottomline with my approach is that I don't believe there is ever a
> reason for an oom memcg to remain oom indefinitely. That's why I hate
> memory.oom_control == 1 and I think for the global notification it would
> be deemed a nonstarter since you couldn't even login to the machine.
>
>> I'm not against what you finally want to do, but I don't like the fix.
>>
>
> I'm thrilled to hear that, and I hope we can work to make userspace oom
> handling more effective.
>
> What do you think about that above?
IMHO, it will be difficult but allowing to write script/filter for oom-killing
will be worth to try. like..
==
for_each_process :
if comm == mem_manage_daemon :
continue
if user == root :
continue
score = default_calc_score()
if score > high_score :
selected = current
==
BTW, if you love the logic in the userland oom daemon, why you can't implement
it in the kernel ? Does that do some pretty things other than sending SIGKILL ?
Thanks,
-Kame
On Tue, 25 Jun 2013, Kamezawa Hiroyuki wrote:
> Considering only memcg, bypassing all charge-limit-check will work.
> But as you say, that will not work against global-oom.
I think it will since we have per-zone memory reserves that can be
bypassed in the page allocator, not to the level of PF_MEMALLOC or
TIF_MEMDIE but perhaps to the min watermark / 4, for example. A userspace
global oom handler will obviously be mlocked in memory and this reserve is
used only for true kmem accounting so that reading things like the memcg
tasks file or reading /proc/pid/stat works, or dynamically allocate a
buffer to store data to iterate over. This is why
memory.oom_delay_millisecs is crucial: we want the same functionality that
the "user root" has for global oom conditions at the memcg root level and
in case reserves are exhausted that the kernel will kill something (which
should be rare, but possible) and use the rest of memory reserves to allow
to exit.
> > Even with all of the above (which is not actually that invasive of a
> > patch), I still think we need memory.oom_delay_millisecs. I probably made
> > a mistake in describing what that is addressing if it seems like it's
> > trying to address any of the above.
> >
> > If a userspace oom handler fails to respond even with access to those
> > "memcg reserves",
>
> How this happens ?
>
If the memcg reserves are exhausted, then the kernel needs to kill
something even in global oom conditions (consider a "user root" memcg tree
to be the same as a global oom condition for processes attached to that
tree) since otherwise the machine hangs. There's no guarantee that some
root process sitting in the root memcg would be able to enforce this delay
as Michal suggests since reserves could be depleted. It's important we
don't do something as extreme as PF_MEMALLOC so all per-zone reserves are
depleted so that the kernel can still intervene and kill something when
userspace is unresponsive.
> Someone may be against that kind of control and say "Hey, I have better idea".
> That was another reason that oom-scirpiting was discussed. No one can
> implement
> general-purpose-victim-selection-logic.
>
Completely agreed, and our experience shows that users who manipulate
their own "user root" memcgs have their own logic, this is why we're
trying to make userspace oom handling as powerful as possible without
risking making the machine unresponsive.
> IMHO, it will be difficult but allowing to write script/filter for oom-killing
> will be worth to try. like..
>
> ==
> for_each_process :
> if comm == mem_manage_daemon :
> continue
> if user == root :
> continue
> score = default_calc_score()
> if score > high_score :
> selected = current
> ==
>
This is effectively what already happens with the oom delay as proposed
here, the userspace oom handler is given access to "memcg reserves" and a
period of time to respond; if that fails, then the kernel will kill
something the next time we try to charge to the memcg.
> BTW, if you love the logic in the userland oom daemon, why you can't implement
> it in the kernel ? Does that do some pretty things other than sending SIGKILL
> ?
>
Some do "pretty" things like collect stats and dump it before killing
something, but we also want oom handlers that don't do SIGKILL at all. An
example: we statically allocate hugepages at boot because we need a large
percentage of memory to be backed by hugepages for a certain class of
applications and it's only available at boot. We also have a userspace
that runs on these machines that is shared between hugepage machines and
non-hugepage machines. At times, this userspace becomes oom because the
remainder of available memory is allocated as hugetlb pages when in
reality they are unmapped and sitting in a free pool. In that case, our
userspace oom handler wants to free those hugetlb pages back to the kernel
down to a certain watermark and then opportunistically reallocate them to
the pool when memory usage on the system is lower due to spikes in the
userspace.
On Fri 14-06-13 03:12:52, David Rientjes wrote:
> On Fri, 14 Jun 2013, Kamezawa Hiroyuki wrote:
>
> > Reading your discussion, I think I understand your requirements.
> > The problem is that I can't think you took into all options into
> > accounts and found the best way is this new oom_delay. IOW, I can't
> > convice oom-delay is the best way to handle your issue.
> >
>
> Ok, let's talk about it.
>
> > Your requeirement is
> > - Allowing userland oom-handler within local memcg.
> >
>
> Another requirement:
>
> - Allow userland oom handler for global oom conditions.
>
> Hopefully that's hooked into memcg because the functionality is already
> there, we can simply duplicate all of the oom functionality that we'll be
> adding for the root memcg.
I understand this and that's why I mentioned enabling oom_control for
root memcg at LSF. I liked the idea at the time but the more I think
about it the more I am scared about all the consequences. Maybe my fear
is not justified as in-kernel (scripting or module registering its own
handler) would be fragile in a similar ways. I would like to see this
being discussed before we go enable memcg way.
I _do_ agree that we really _want_ a way to provide a customer oom
handler.
> > Considering straightforward, the answer should be
> > - Allowing oom-handler daemon out of memcg's control by its limit.
> > (For example, a flag/capability for a task can archive this.)
> > Or attaching some *fixed* resource to the task rather than cgroup.
> >
> > Allow to set task->secret_saving=20M.
> >
>
> Exactly!
I thought your users are untrusted so you cannot give them any
additional reserves. Or is this only about the root cgroup?
> First of all, thanks very much for taking an interest in our usecase and
> discussing it with us.
>
> I didn't propose what I referred to earlier in the thread as "memcg
> reserves" because I thought it was going to be a more difficult battle.
> The fact that you brought it up first actually makes me think it's less
> insane :)
reserves for oom handler doesn't sound like a bad idea but it assumes
that the handler is trusted.
> We do indeed want memcg reserves and I have patches to add it if you'd
> like to see that first. It ensures that this userspace oom handler can
> actually do some work in determining which process to kill. The reserve
> is a fraction of true memory reserves (the space below the per-zone min
> watermarks) which is dependent on min_free_kbytes. This does indeed
> become more difficult with true and complete kmem charging. That "work"
> could be opening the tasks file (which allocates the pidlist within the
> kernel),
I am not familiar with details why it has to allocate memory but I guess
we need to read tasks file without any in-kernel allocations.
> checking /proc/pid/status for rss,
This doesn't require in kernel allocation AFAICS.
> checking for how long a process has been running, checking for tid,
> sending a signal to drop caches, etc.
>
> We'd also like to do this for global oom conditions, which makes it even
> more interesting. I was thinking of using a fraction of memory reserves
> as the oom killer currently does (that memory below the min watermark) for
> these purposes.
>
> Memory charging is simply bypassed for these oom handlers (we only grant
> access to those waiting on the memory.oom_control eventfd) up to
> memory.limit_in_bytes + (min_free_kbytes / 4), for example. I don't think
> this is entirely insane because these oom handlers should lead to future
> memory freeing, just like TIF_MEMDIE processes.
>
> > Going back to your patch, what's confusing is your approach.
> > Why the problem caused by the amount of memory should be solved by
> > some dealy, i.e. the amount of time ?
> >
> > This exchanging sounds confusing to me.
> >
>
> Even with all of the above (which is not actually that invasive of a
> patch), I still think we need memory.oom_delay_millisecs.
I am still not convinced. I have already mentioned that this can be
handled from userspace.
A simple watchdog which sits on oom_control eventfd which triggers
timer (and reschedule it on a new event while there is one scheduled
already) which reads oom_control to check under_oom before it enables
in-kernel oom handling again doesn't need any memory allocation so there
is nothing to prevent it from resurrecting the system. So I really do
not see a reason for a new knob.
> I probably made a mistake in describing what that is addressing if it
> seems like it's trying to address any of the above.
>
> If a userspace oom handler fails to respond even with access to those
> "memcg reserves", the kernel needs to kill within that memcg. Do we do
> that above a set time period (this patch) or when the reserves are
> completely exhausted? That's debatable, but if we are to allow it for
> global oom conditions as well then my opinion was to make it as safe as
> possible; today, we can't disable the global oom killer from userspace and
> I don't think we should ever allow it to be disabled. I think we should
> allow userspace a reasonable amount of time to respond and then kill if it
> is exceeded.
I don't know. I like the way how we can say we are under control here
and take all the consequences. The scheme you are proposing. You have
some time and you better handle the situation under that time or we will
take over sounds weird to me. We have traditionally allowed users to
shoot their heads expecting they know what they are doing. Why we should
differ here?
> For the global oom case, we want to have a priority-based memcg selection.
> Select the lowest priority top-level memcg and kill within it.
Yes that makes sense as zillions of other heuristics. E.g. kill whole
groups as they might loose any point if one of their process is killed.
> If it has an oom notifier, send it a signal to kill something. If
> it fails to react, kill something after memory.oom_delay_millisecs
> has elapsed. If there isn't a userspace oom notifier, kill something
> within that lowest priority memcg.
>
> The bottomline with my approach is that I don't believe there is ever a
> reason for an oom memcg to remain oom indefinitely. That's why I hate
> memory.oom_control == 1 and I think for the global notification it would
> be deemed a nonstarter since you couldn't even login to the machine.
>
> > I'm not against what you finally want to do, but I don't like the fix.
> >
>
> I'm thrilled to hear that, and I hope we can work to make userspace oom
> handling more effective.
>
> What do you think about that above?
I think the idea about reserves is worth considering. This, however,
requires a stronger trust model for oom handlers. Maybe we can grant
reserves only to processes with some capability, dunno.
I also think that we should explore/discuss other options for custom oom
handling than hooking into memcg oom_control. I do see why oom_control
is a good fit for you (and I thought about it as well - it is really
ironic that I thought you would hate the idea when I mentioned that at
LSF) but once we allow that we will have to live with that for ever
which might turn out to be problem so better not hurry there.
Finally, I still do not like the "you have this amount of time and
handle that or we will take over" approach so I cannot support you with
your delay knob. If one wants to play with fire he has to take all the
consequences and unless you really convince me that this is not doable I
will stick with this.
--
Michal Hocko
SUSE Labs