Since '1c0fe6e mm: invoke oom-killer from page fault', page fault
handlers should not directly kill faulting tasks in an out of memory
condition. Instead, they should be invoking the OOM killer to pick
the right task. Convert the remaining architectures.
Signed-off-by: Johannes Weiner <[email protected]>
---
arch/arc/mm/fault.c | 6 ++++--
arch/metag/mm/fault.c | 6 ++++--
arch/mn10300/mm/fault.c | 7 ++++---
arch/openrisc/mm/fault.c | 8 ++++----
arch/score/mm/fault.c | 8 ++++----
arch/tile/mm/fault.c | 8 ++++----
6 files changed, 24 insertions(+), 19 deletions(-)
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index c0decc1..d5ec60a 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -207,8 +207,10 @@ out_of_memory:
}
up_read(&mm->mmap_sem);
- if (user_mode(regs))
- do_group_exit(SIGKILL); /* This will never return */
+ if (user_mode(regs)) {
+ pagefault_out_of_memory();
+ return;
+ }
goto no_context;
diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index 2c75bf7..8fddf46 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -224,8 +224,10 @@ do_sigbus:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
+ if (user_mode(regs)) {
+ pagefault_out_of_memory();
+ return 1;
+ }
no_context:
/* Are we prepared to handle this kernel fault? */
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index d48a84f..8a2e6de 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -345,9 +345,10 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- printk(KERN_ALERT "VM: killing process %s\n", tsk->comm);
- if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
- do_exit(SIGKILL);
+ if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) {
+ pagefault_out_of_memory();
+ return;
+ }
goto no_context;
do_sigbus:
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index e2bfafc..4a41f84 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -267,10 +267,10 @@ out_of_memory:
__asm__ __volatile__("l.nop 1");
up_read(&mm->mmap_sem);
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
do_sigbus:
up_read(&mm->mmap_sem);
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 47b600e..6b18fb0 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -172,10 +172,10 @@ out_of_memory:
down_read(&mm->mmap_sem);
goto survive;
}
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
do_sigbus:
up_read(&mm->mmap_sem);
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 3d2b81c..f7f99f9 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -573,10 +573,10 @@ out_of_memory:
down_read(&mm->mmap_sem);
goto survive;
}
- pr_alert("VM: killing process %s\n", tsk->comm);
- if (!is_kernel_mode)
- do_group_exit(SIGKILL);
- goto no_context;
+ if (is_kernel_mode)
+ goto no_context;
+ pagefault_out_of_memory();
+ return 0;
do_sigbus:
up_read(&mm->mmap_sem);
--
1.8.2.3
The memcg OOM handling is incredibly fragile because once a memcg goes
OOM, one task (kernel or userspace) is responsible for resolving the
situation. Every other task that gets caught trying to charge memory
gets stuck in a waitqueue while potentially holding various filesystem
and mm locks on which the OOM handling task may now deadlock.
Do two things:
1. When OOMing in a system call (buffered IO and friends), invoke the
OOM killer but just return -ENOMEM, never sleep. Userspace should
be able to handle this.
2. When OOMing in a page fault and somebody else is handling the
situation, do not sleep directly in the charging code. Instead,
remember the OOMing memcg in the task struct and then fully unwind
the page fault stack first before going to sleep.
While reworking the OOM routine, also remove a needless OOM waitqueue
wakeup when invoking the killer. Only uncharges and limit increases,
things that actually change the memory situation, should do wakeups.
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 22 +++++++
include/linux/mm.h | 1 +
include/linux/sched.h | 6 ++
mm/ksm.c | 2 +-
mm/memcontrol.c | 146 ++++++++++++++++++++++++++++-----------------
mm/memory.c | 40 +++++++++----
mm/oom_kill.c | 7 ++-
7 files changed, 154 insertions(+), 70 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c8b1412..8e0f900 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,6 +124,15 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+static inline void mem_cgroup_set_userfault(struct task_struct *p)
+{
+ p->memcg_oom.in_userfault = 1;
+}
+static inline void mem_cgroup_clear_userfault(struct task_struct *p)
+{
+ p->memcg_oom.in_userfault = 0;
+}
+bool mem_cgroup_oom_synchronize(void);
extern void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage);
@@ -343,6 +352,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}
+static inline void mem_cgroup_set_userfault(struct task_struct *p)
+{
+}
+
+static inline void mem_cgroup_clear_userfault(struct task_struct *p)
+{
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_begin_update_page_stat(struct page *page,
bool *locked, unsigned long *flags)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b87681a..79ee304 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -176,6 +176,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED 0x40 /* second try */
+#define FAULT_FLAG_KERNEL 0x80 /* kernel-triggered fault (get_user_pages etc.) */
/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 08090e6..0659277 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1403,6 +1403,12 @@ struct task_struct {
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
unsigned int memcg_kmem_skip_account;
+ struct memcg_oom_info {
+ unsigned int in_userfault:1;
+ unsigned int in_memcg_oom:1;
+ int wakeups;
+ struct mem_cgroup *wait_on_memcg;
+ } memcg_oom;
#endif
#ifdef CONFIG_UPROBES
struct uprobe_task *utask;
diff --git a/mm/ksm.c b/mm/ksm.c
index b6afe0c..9dff93b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -372,7 +372,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
break;
if (PageKsm(page))
ret = handle_mm_fault(vma->vm_mm, vma, addr,
- FAULT_FLAG_WRITE);
+ FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE);
else
ret = VM_FAULT_WRITE;
put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d169a8d..61d3449 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -298,6 +298,7 @@ struct mem_cgroup {
bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;
atomic_t refcnt;
@@ -2305,6 +2306,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg->oom_wakeups);
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}
@@ -2316,56 +2318,103 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}
/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
- int order)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
- struct oom_wait_info owait;
- bool locked, need_to_kill;
-
- owait.memcg = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);
+ bool locked, need_to_kill = true;
/* At first, try to OOM lock hierarchy under memcg.*/
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
- /*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
- */
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
if (!locked || memcg->oom_kill_disable)
need_to_kill = false;
if (locked)
mem_cgroup_oom_notify(memcg);
spin_unlock(&memcg_oom_lock);
- if (need_to_kill) {
- finish_wait(&memcg_oom_waitq, &owait.wait);
- mem_cgroup_out_of_memory(memcg, mask, order);
- } else {
- schedule();
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ /*
+ * A system call can just return -ENOMEM, but if this is a
+ * page fault and somebody else is handling the OOM already,
+ * we need to sleep on the OOM waitqueue for this memcg until
+ * the situation is resolved. Which can take some time
+ * because it might be handled by a userspace task.
+ *
+ * However, this is the charge context, which means that we
+ * may sit on a large call stack and hold various filesystem
+ * locks, the mmap_sem etc. and we don't want the OOM handler
+ * to deadlock on them while we sit here and wait. Store the
+ * current OOM context in the task_struct, then return
+ * -ENOMEM. At the end of the page fault handler, with the
+ * stack unwound, pagefault_out_of_memory() will check back
+ * with us by calling mem_cgroup_oom_synchronize(), possibly
+ * putting the task to sleep.
+ */
+ if (current->memcg_oom.in_userfault) {
+ current->memcg_oom.in_memcg_oom = 1;
+ /*
+ * Somebody else is handling the situation. Make sure
+ * no wakeups are missed between now and going to
+ * sleep at the end of the page fault.
+ */
+ if (!need_to_kill) {
+ mem_cgroup_mark_under_oom(memcg);
+ current->memcg_oom.wakeups =
+ atomic_read(&memcg->oom_wakeups);
+ css_get(&memcg->css);
+ current->memcg_oom.wait_on_memcg = memcg;
+ }
}
- spin_lock(&memcg_oom_lock);
- if (locked)
+
+ if (need_to_kill)
+ mem_cgroup_out_of_memory(memcg, mask, order);
+
+ if (locked) {
+ spin_lock(&memcg_oom_lock);
mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
+ spin_unlock(&memcg_oom_lock);
+ }
+}
- mem_cgroup_unmark_under_oom(memcg);
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;
- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ /* OOM is global, do not handle */
+ if (!current->memcg_oom.in_memcg_oom)
return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+
+ /*
+ * We invoked the OOM killer but there is a chance that a kill
+ * did not free up any charges. Everybody else might already
+ * be sleeping, so restart the fault and keep the rampage
+ * going until some charges are released.
+ */
+ memcg = current->memcg_oom.wait_on_memcg;
+ if (!memcg)
+ goto out;
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out_put;
+
+ owait.memcg = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out_put:
+ mem_cgroup_unmark_under_oom(memcg);
+ css_put(&memcg->css);
+ current->memcg_oom.wait_on_memcg = NULL;
+out:
+ current->memcg_oom.in_memcg_oom = 0;
return true;
}
@@ -2678,12 +2727,11 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};
static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages, unsigned int min_pages,
- bool oom_check)
+ bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2740,14 +2788,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;
- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}
/*
@@ -2850,7 +2894,7 @@ again:
}
do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;
/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2858,14 +2902,8 @@ again:
goto bypass;
}
- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
- oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
+ nr_pages, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2878,16 +2916,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);
diff --git a/mm/memory.c b/mm/memory.c
index 2210b21..05f307b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1819,7 +1819,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
while (!(page = follow_page_mask(vma, start,
foll_flags, &page_mask))) {
int ret;
- unsigned int fault_flags = 0;
+ unsigned int fault_flags = FAULT_FLAG_KERNEL;
/* For mlock, just skip the stack guard page. */
if (foll_flags & FOLL_MLOCK) {
@@ -1947,6 +1947,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
if (!vma || address < vma->vm_start)
return -EFAULT;
+ fault_flags |= FAULT_FLAG_KERNEL;
ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
@@ -3764,22 +3765,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);
@@ -3860,6 +3853,31 @@ retry:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}
+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int in_userfault = !(flags & FAULT_FLAG_KERNEL);
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ if (in_userfault)
+ mem_cgroup_set_userfault(current);
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (in_userfault)
+ mem_cgroup_clear_userfault(current);
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 79e451a..0c9f836 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,9 +678,12 @@ out:
*/
void pagefault_out_of_memory(void)
{
- struct zonelist *zonelist = node_zonelist(first_online_node,
- GFP_KERNEL);
+ struct zonelist *zonelist;
+ if (mem_cgroup_oom_synchronize())
+ return;
+
+ zonelist = node_zonelist(first_online_node, GFP_KERNEL);
if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
out_of_memory(NULL, 0, 0, NULL, false);
clear_zonelist_oom(zonelist, GFP_KERNEL);
--
1.8.2.3
On Wed, 5 Jun 2013, Johannes Weiner wrote:
> Since '1c0fe6e mm: invoke oom-killer from page fault', page fault
> handlers should not directly kill faulting tasks in an out of memory
> condition.
I have no objection to the patch, but there's no explanation given here
why exiting with a kill shouldn't be done. Is it because of memory
reserves and there is no guarantee that current will be able to exit? Or
is it just for consistency with other archs?
> Instead, they should be invoking the OOM killer to pick
> the right task. Convert the remaining architectures.
>
If this is a matter of memory reserves, I guess you could point people who
want the current behavior (avoiding the expensiveness of the tasklist scan
in the oom killer for example) to /proc/sys/vm/oom_kill_allocating_task?
This changelog is a bit cryptic in its motivation.
On Wed, 5 Jun 2013, Johannes Weiner wrote:
> The memcg OOM handling is incredibly fragile because once a memcg goes
> OOM, one task (kernel or userspace) is responsible for resolving the
> situation.
Not sure what this means. Are you referring to the case where the memcg
is disabled from oom killing and we're relying on a userspace handler and
it may be caught on the waitqueue itself? Otherwise, are you referring to
the task as eventually being the only one that takes the hierarchy oom
lock and calls mem_cgroup_out_of_memory()? I don't think you can make a
case for the latter since the one that calls mem_cgroup_out_of_memory()
should return and call memcg_wakeup_oom(). We obviously don't want to do
any memory allocations in this path.
> Every other task that gets caught trying to charge memory
> gets stuck in a waitqueue while potentially holding various filesystem
> and mm locks on which the OOM handling task may now deadlock.
>
What locks? The oom killer quite extensively needs task_lock() but we
shouldn't be calling it in the case where we hold this since its a
spinlock and mem_cgroup_do_charge() never gets to the point of handling
the oom.
Again, are you referring only to a userspace handler here?
> Do two things:
>
> 1. When OOMing in a system call (buffered IO and friends), invoke the
> OOM killer but just return -ENOMEM, never sleep. Userspace should
> be able to handle this.
>
The sleep should never occur for any significant duration currently, the
current implementation ensures one process calls
mem_cgroup_out_of_memory() while all others sit on a waitqueue until that
process returns from the oom killer and then they all wakeup again so the
killed process may exit and all others can retry their allocations now
that something has been killed. If that process hasn't exited yet, the
next process that locks the memcg oom hierarchy will see the killed
process with TIF_MEMDIE and the oom killer is deferred. So this sleep is
very temporary already, I don't see why you're trying to make it faster
while another thread finds a candidate task, it works quite well already.
> 2. When OOMing in a page fault and somebody else is handling the
> situation, do not sleep directly in the charging code. Instead,
> remember the OOMing memcg in the task struct and then fully unwind
> the page fault stack first before going to sleep.
>
Are you trying to address a situation here where the memcg oom handler
takes too long to work? We've quite extensively tried to improve that,
especially by reducing its dependency on tasklist_lock which is contended
from the writeside in the exit path and by only iterating processes that
are attached to that memcg hierarchy and not all processes on the system
like before. I don't see the issue with scheduling other oom processes
while one is doing mem_cgroup_out_of_memory(). (I would if the oom killer
required things like mm->mmap_sem, but obviously it doesn't for reasons
not related to memcg.)
> While reworking the OOM routine, also remove a needless OOM waitqueue
> wakeup when invoking the killer. Only uncharges and limit increases,
> things that actually change the memory situation, should do wakeups.
>
It's not needless at all, it's vitally required! The oom killed process
needs to be removed from the waitqueue and scheduled now with TIF_MEMDIE
that the memcg oom killer provided so the allocation succeeds in the page
allocator and memcg bypasses the charge so it can exit.
Exactly what problem are you trying to address with this patch? I don't
see any description of the user-visible effects or a specific xample of
the scenario you're trying to address here.
On Wed, Jun 05, 2013 at 08:57:44PM -0700, David Rientjes wrote:
> On Wed, 5 Jun 2013, Johannes Weiner wrote:
>
> > Since '1c0fe6e mm: invoke oom-killer from page fault', page fault
> > handlers should not directly kill faulting tasks in an out of memory
> > condition.
>
> I have no objection to the patch, but there's no explanation given here
> why exiting with a kill shouldn't be done. Is it because of memory
> reserves and there is no guarantee that current will be able to exit? Or
> is it just for consistency with other archs?
>
> > Instead, they should be invoking the OOM killer to pick
> > the right task. Convert the remaining architectures.
> >
>
> If this is a matter of memory reserves, I guess you could point people who
> want the current behavior (avoiding the expensiveness of the tasklist scan
> in the oom killer for example) to /proc/sys/vm/oom_kill_allocating_task?
>
> This changelog is a bit cryptic in its motivation.
Fixing copy-pasted bitrot^W^W^W^WHow about this?
---
From: Johannes Weiner <[email protected]>
Subject: [patch] mm: invoke oom-killer from remaining unconverted page fault
handlers
A few remaining architectures directly kill the page faulting task in
an out of memory situation. This is usually not a good idea since
that task might not even use a significant amount of memory and so may
not be the optimal victim to resolve the situation.
Since '1c0fe6e mm: invoke oom-killer from page fault' (2.6.29) there
is a hook that architecture page fault handlers are supposed to call
to invoke the OOM killer and let it pick the right task to kill.
Convert the remaining architectures over to this hook.
To have the previous behavior of simply taking out the faulting task
the vm.oom_kill_allocating_task sysctl can be set to 1.
Signed-off-by: Johannes Weiner <[email protected]>
---
arch/arc/mm/fault.c | 6 ++++--
arch/metag/mm/fault.c | 6 ++++--
arch/mn10300/mm/fault.c | 7 ++++---
arch/openrisc/mm/fault.c | 8 ++++----
arch/score/mm/fault.c | 8 ++++----
arch/tile/mm/fault.c | 8 ++++----
6 files changed, 24 insertions(+), 19 deletions(-)
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index c0decc1..d5ec60a 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -207,8 +207,10 @@ out_of_memory:
}
up_read(&mm->mmap_sem);
- if (user_mode(regs))
- do_group_exit(SIGKILL); /* This will never return */
+ if (user_mode(regs)) {
+ pagefault_out_of_memory();
+ return;
+ }
goto no_context;
diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index 2c75bf7..8fddf46 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -224,8 +224,10 @@ do_sigbus:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
+ if (user_mode(regs)) {
+ pagefault_out_of_memory();
+ return 1;
+ }
no_context:
/* Are we prepared to handle this kernel fault? */
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index d48a84f..8a2e6de 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -345,9 +345,10 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- printk(KERN_ALERT "VM: killing process %s\n", tsk->comm);
- if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
- do_exit(SIGKILL);
+ if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) {
+ pagefault_out_of_memory();
+ return;
+ }
goto no_context;
do_sigbus:
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index e2bfafc..4a41f84 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -267,10 +267,10 @@ out_of_memory:
__asm__ __volatile__("l.nop 1");
up_read(&mm->mmap_sem);
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
do_sigbus:
up_read(&mm->mmap_sem);
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 47b600e..6b18fb0 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -172,10 +172,10 @@ out_of_memory:
down_read(&mm->mmap_sem);
goto survive;
}
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;
do_sigbus:
up_read(&mm->mmap_sem);
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 3d2b81c..f7f99f9 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -573,10 +573,10 @@ out_of_memory:
down_read(&mm->mmap_sem);
goto survive;
}
- pr_alert("VM: killing process %s\n", tsk->comm);
- if (!is_kernel_mode)
- do_group_exit(SIGKILL);
- goto no_context;
+ if (is_kernel_mode)
+ goto no_context;
+ pagefault_out_of_memory();
+ return 0;
do_sigbus:
up_read(&mm->mmap_sem);
--
1.8.2.3
On Thu, 6 Jun 2013, Johannes Weiner wrote:
> From: Johannes Weiner <[email protected]>
> Subject: [patch] mm: invoke oom-killer from remaining unconverted page fault
> handlers
>
> A few remaining architectures directly kill the page faulting task in
> an out of memory situation. This is usually not a good idea since
> that task might not even use a significant amount of memory and so may
> not be the optimal victim to resolve the situation.
>
> Since '1c0fe6e mm: invoke oom-killer from page fault' (2.6.29) there
> is a hook that architecture page fault handlers are supposed to call
> to invoke the OOM killer and let it pick the right task to kill.
> Convert the remaining architectures over to this hook.
>
> To have the previous behavior of simply taking out the faulting task
> the vm.oom_kill_allocating_task sysctl can be set to 1.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: David Rientjes <[email protected]>
On Wed, Jun 05, 2013 at 09:10:51PM -0700, David Rientjes wrote:
> On Wed, 5 Jun 2013, Johannes Weiner wrote:
>
> > The memcg OOM handling is incredibly fragile because once a memcg goes
> > OOM, one task (kernel or userspace) is responsible for resolving the
> > situation.
>
> Not sure what this means. Are you referring to the case where the memcg
> is disabled from oom killing and we're relying on a userspace handler and
> it may be caught on the waitqueue itself? Otherwise, are you referring to
> the task as eventually being the only one that takes the hierarchy oom
> lock and calls mem_cgroup_out_of_memory()? I don't think you can make a
> case for the latter since the one that calls mem_cgroup_out_of_memory()
> should return and call memcg_wakeup_oom(). We obviously don't want to do
> any memory allocations in this path.
If the killing task or one of the sleeping tasks is holding a lock
that the selected victim needs in order to exit no progress can be
made.
The report we had a few months ago was that a task held the i_mutex
when trying to charge a page cache page and then invoked the OOM
handler and looped on CHARGE_RETRY. Meanwhile, the selected victim
was just entering truncate() and now stuck waiting for the i_mutex.
I'll add this scenario to the changelog, hopefully it will make the
rest a little clearer.
> > Every other task that gets caught trying to charge memory
> > gets stuck in a waitqueue while potentially holding various filesystem
> > and mm locks on which the OOM handling task may now deadlock.
> >
>
> What locks? The oom killer quite extensively needs task_lock() but we
> shouldn't be calling it in the case where we hold this since its a
> spinlock and mem_cgroup_do_charge() never gets to the point of handling
> the oom.
>
> Again, are you referring only to a userspace handler here?
No. The OOM path (does not matter if user task or kernel) is
currently entered from the charge context, which may hold filesystem
and mm locks (look who charges page cache pages e.g.) and they are not
released until the situation is resolved because the task either loops
inside __mem_cgroup_try_charge() on CHARGE_RETRY or is stuck in the
waitqueue.
And waiting for anybody else to make progress while holding mmap_sem,
i_mutex etc. is the problem.
> > Do two things:
> >
> > 1. When OOMing in a system call (buffered IO and friends), invoke the
> > OOM killer but just return -ENOMEM, never sleep. Userspace should
> > be able to handle this.
> >
>
> The sleep should never occur for any significant duration currently, the
> current implementation ensures one process calls
> mem_cgroup_out_of_memory() while all others sit on a waitqueue until that
> process returns from the oom killer and then they all wakeup again so the
> killed process may exit and all others can retry their allocations now
> that something has been killed. If that process hasn't exited yet, the
> next process that locks the memcg oom hierarchy will see the killed
> process with TIF_MEMDIE and the oom killer is deferred. So this sleep is
> very temporary already, I don't see why you're trying to make it faster
> while another thread finds a candidate task, it works quite well already.
It's not the amount of time slept on average, it's that going to sleep
in this context may deadlock the killing or the killed task. I don't
see where you read "making it faster", but I'm trying to make it just
slightly faster than a deadlock.
> > 2. When OOMing in a page fault and somebody else is handling the
> > situation, do not sleep directly in the charging code. Instead,
> > remember the OOMing memcg in the task struct and then fully unwind
> > the page fault stack first before going to sleep.
> >
>
> Are you trying to address a situation here where the memcg oom handler
> takes too long to work? We've quite extensively tried to improve that,
> especially by reducing its dependency on tasklist_lock which is contended
> from the writeside in the exit path and by only iterating processes that
> are attached to that memcg hierarchy and not all processes on the system
> like before. I don't see the issue with scheduling other oom processes
> while one is doing mem_cgroup_out_of_memory(). (I would if the oom killer
> required things like mm->mmap_sem, but obviously it doesn't for reasons
> not related to memcg.)
I really don't see where you read "performance optimization" in this
changelog, the very first paragraph mentions "very fragile" and "may
deadlock".
> > While reworking the OOM routine, also remove a needless OOM waitqueue
> > wakeup when invoking the killer. Only uncharges and limit increases,
> > things that actually change the memory situation, should do wakeups.
> >
>
> It's not needless at all, it's vitally required! The oom killed process
> needs to be removed from the waitqueue and scheduled now with TIF_MEMDIE
> that the memcg oom killer provided so the allocation succeeds in the page
> allocator and memcg bypasses the charge so it can exit.
The OOM killer sets TIF_MEMDIE and then sends a fatal signal to the
victim. That reliably wakes it up, doesn't it?
> Exactly what problem are you trying to address with this patch? I don't
> see any description of the user-visible effects or a specific xample of
> the scenario you're trying to address here.
I'll add the example scenarios.
Thanks!
On 06/06/2013 10:06 AM, Johannes Weiner wrote:
> From: Johannes Weiner <[email protected]>
> Subject: [patch] mm: invoke oom-killer from remaining unconverted page fault
> handlers
>
> A few remaining architectures directly kill the page faulting task in
> an out of memory situation. This is usually not a good idea since
> that task might not even use a significant amount of memory and so may
> not be the optimal victim to resolve the situation.
>
> Since '1c0fe6e mm: invoke oom-killer from page fault' (2.6.29) there
> is a hook that architecture page fault handlers are supposed to call
> to invoke the OOM killer and let it pick the right task to kill.
> Convert the remaining architectures over to this hook.
>
> To have the previous behavior of simply taking out the faulting task
> the vm.oom_kill_allocating_task sysctl can be set to 1.
>
> Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Vineet Gupta <[email protected]> # for arch/arc bits
On Thu 06-06-13 00:36:20, Johannes Weiner wrote:
> On Wed, Jun 05, 2013 at 08:57:44PM -0700, David Rientjes wrote:
> > On Wed, 5 Jun 2013, Johannes Weiner wrote:
> >
> > > Since '1c0fe6e mm: invoke oom-killer from page fault', page fault
> > > handlers should not directly kill faulting tasks in an out of memory
> > > condition.
> >
> > I have no objection to the patch, but there's no explanation given here
> > why exiting with a kill shouldn't be done. Is it because of memory
> > reserves and there is no guarantee that current will be able to exit? Or
> > is it just for consistency with other archs?
> >
> > > Instead, they should be invoking the OOM killer to pick
> > > the right task. Convert the remaining architectures.
> > >
> >
> > If this is a matter of memory reserves, I guess you could point people who
> > want the current behavior (avoiding the expensiveness of the tasklist scan
> > in the oom killer for example) to /proc/sys/vm/oom_kill_allocating_task?
> >
> > This changelog is a bit cryptic in its motivation.
>
> Fixing copy-pasted bitrot^W^W^W^WHow about this?
>
> ---
> From: Johannes Weiner <[email protected]>
> Subject: [patch] mm: invoke oom-killer from remaining unconverted page fault
> handlers
>
> A few remaining architectures directly kill the page faulting task in
> an out of memory situation. This is usually not a good idea since
> that task might not even use a significant amount of memory and so may
> not be the optimal victim to resolve the situation.
>
> Since '1c0fe6e mm: invoke oom-killer from page fault' (2.6.29) there
> is a hook that architecture page fault handlers are supposed to call
> to invoke the OOM killer and let it pick the right task to kill.
> Convert the remaining architectures over to this hook.
>
> To have the previous behavior of simply taking out the faulting task
> the vm.oom_kill_allocating_task sysctl can be set to 1.
>
> Signed-off-by: Johannes Weiner <[email protected]>
It was much easier than I thought.
Reviewed-by: Michal Hocko <[email protected]>
> ---
> arch/arc/mm/fault.c | 6 ++++--
> arch/metag/mm/fault.c | 6 ++++--
> arch/mn10300/mm/fault.c | 7 ++++---
> arch/openrisc/mm/fault.c | 8 ++++----
> arch/score/mm/fault.c | 8 ++++----
> arch/tile/mm/fault.c | 8 ++++----
> 6 files changed, 24 insertions(+), 19 deletions(-)
>
> diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
> index c0decc1..d5ec60a 100644
> --- a/arch/arc/mm/fault.c
> +++ b/arch/arc/mm/fault.c
> @@ -207,8 +207,10 @@ out_of_memory:
> }
> up_read(&mm->mmap_sem);
>
> - if (user_mode(regs))
> - do_group_exit(SIGKILL); /* This will never return */
> + if (user_mode(regs)) {
> + pagefault_out_of_memory();
> + return;
> + }
>
> goto no_context;
>
> diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
> index 2c75bf7..8fddf46 100644
> --- a/arch/metag/mm/fault.c
> +++ b/arch/metag/mm/fault.c
> @@ -224,8 +224,10 @@ do_sigbus:
> */
> out_of_memory:
> up_read(&mm->mmap_sem);
> - if (user_mode(regs))
> - do_group_exit(SIGKILL);
> + if (user_mode(regs)) {
> + pagefault_out_of_memory();
> + return 1;
> + }
>
> no_context:
> /* Are we prepared to handle this kernel fault? */
> diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
> index d48a84f..8a2e6de 100644
> --- a/arch/mn10300/mm/fault.c
> +++ b/arch/mn10300/mm/fault.c
> @@ -345,9 +345,10 @@ no_context:
> */
> out_of_memory:
> up_read(&mm->mmap_sem);
> - printk(KERN_ALERT "VM: killing process %s\n", tsk->comm);
> - if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
> - do_exit(SIGKILL);
> + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) {
> + pagefault_out_of_memory();
> + return;
> + }
> goto no_context;
>
> do_sigbus:
> diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
> index e2bfafc..4a41f84 100644
> --- a/arch/openrisc/mm/fault.c
> +++ b/arch/openrisc/mm/fault.c
> @@ -267,10 +267,10 @@ out_of_memory:
> __asm__ __volatile__("l.nop 1");
>
> up_read(&mm->mmap_sem);
> - printk("VM: killing process %s\n", tsk->comm);
> - if (user_mode(regs))
> - do_exit(SIGKILL);
> - goto no_context;
> + if (!user_mode(regs))
> + goto no_context;
> + pagefault_out_of_memory();
> + return;
>
> do_sigbus:
> up_read(&mm->mmap_sem);
> diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
> index 47b600e..6b18fb0 100644
> --- a/arch/score/mm/fault.c
> +++ b/arch/score/mm/fault.c
> @@ -172,10 +172,10 @@ out_of_memory:
> down_read(&mm->mmap_sem);
> goto survive;
> }
> - printk("VM: killing process %s\n", tsk->comm);
> - if (user_mode(regs))
> - do_group_exit(SIGKILL);
> - goto no_context;
> + if (!user_mode(regs))
> + goto no_context;
> + pagefault_out_of_memory();
> + return;
>
> do_sigbus:
> up_read(&mm->mmap_sem);
> diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
> index 3d2b81c..f7f99f9 100644
> --- a/arch/tile/mm/fault.c
> +++ b/arch/tile/mm/fault.c
> @@ -573,10 +573,10 @@ out_of_memory:
> down_read(&mm->mmap_sem);
> goto survive;
> }
> - pr_alert("VM: killing process %s\n", tsk->comm);
> - if (!is_kernel_mode)
> - do_group_exit(SIGKILL);
> - goto no_context;
> + if (is_kernel_mode)
> + goto no_context;
> + pagefault_out_of_memory();
> + return 0;
>
> do_sigbus:
> up_read(&mm->mmap_sem);
> --
> 1.8.2.3
>
--
Michal Hocko
SUSE Labs
On Wed 05-06-13 21:10:51, David Rientjes wrote:
> On Wed, 5 Jun 2013, Johannes Weiner wrote:
[...]
> > While reworking the OOM routine, also remove a needless OOM waitqueue
> > wakeup when invoking the killer. Only uncharges and limit increases,
> > things that actually change the memory situation, should do wakeups.
> >
>
> It's not needless at all, it's vitally required! The oom killed process
> needs to be removed from the waitqueue and scheduled now with TIF_MEMDIE
> that the memcg oom killer provided so the allocation succeeds in the page
> allocator and memcg bypasses the charge so it can exit.
The tasks are waiting with TASK_KILLABLE flags so it gets woken up and
the bypass happens. Calling memcg_wakeup_oom is actually wrong here
because it wakes all tasks up despite there is no reason for that. No
charges have been released yet so another retry loop could be pointless.
We need to be patient and wait for wake up from a uncharge path.
> Exactly what problem are you trying to address with this patch? I don't
> see any description of the user-visible effects or a specific xample of
> the scenario you're trying to address here.
Maybe I am biased because I've tried to handle the same problem some time
ago, but the changelog clearly says that memcg oom handling is fragile
and deadlock prone because of locks that are held while memory is
charged and oom handled and so oom targets might not get killed because
they are stuck at the same lock which will not get released until the
charge succeeds.
It addresses the problem by moving oom handling outside of any locks
which solves this category of dead locks. I agree that the changelog
could better (well, each one can be). It could use some examples (e.g.
the i_mutex we have seen few months ago or a simple unkillable brk which
is hanging on mmap_sem for writing while a page fault is handled and
memcg oom triggered).
--
Michal Hocko
SUSE Labs
On Thu, Jun 06, 2013 at 01:33:15AM -0400, Johannes Weiner wrote:
> On Wed, Jun 05, 2013 at 09:10:51PM -0700, David Rientjes wrote:
> > On Wed, 5 Jun 2013, Johannes Weiner wrote:
> >
> > > The memcg OOM handling is incredibly fragile because once a memcg goes
> > > OOM, one task (kernel or userspace) is responsible for resolving the
> > > situation.
> >
> > Not sure what this means. Are you referring to the case where the memcg
> > is disabled from oom killing and we're relying on a userspace handler and
> > it may be caught on the waitqueue itself? Otherwise, are you referring to
> > the task as eventually being the only one that takes the hierarchy oom
> > lock and calls mem_cgroup_out_of_memory()? I don't think you can make a
> > case for the latter since the one that calls mem_cgroup_out_of_memory()
> > should return and call memcg_wakeup_oom(). We obviously don't want to do
> > any memory allocations in this path.
>
> If the killing task or one of the sleeping tasks is holding a lock
> that the selected victim needs in order to exit no progress can be
> made.
>
> The report we had a few months ago was that a task held the i_mutex
> when trying to charge a page cache page and then invoked the OOM
> handler and looped on CHARGE_RETRY. Meanwhile, the selected victim
> was just entering truncate() and now stuck waiting for the i_mutex.
>
> I'll add this scenario to the changelog, hopefully it will make the
> rest a little clearer.
David, is the updated patch below easier to understand?
---
From: Johannes Weiner <[email protected]>
Subject: [patch] memcg: do not trap chargers with full callstack on OOM
The memcg OOM handling is incredibly fragile and can deadlock. When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds. Comparably, any other
task that enters the charge path at this point will go to a waitqueue
right then and there and sleep until the OOM situation is resolved.
The problem is that these tasks may hold filesystem locks and the
mmap_sem; locks that the selected OOM victim may need to exit.
For example, in one reported case, the task invoking the OOM killer
was about to charge a page cache page during a write(), which holds
the i_mutex. The OOM killer selected a task that was just entering
truncate() and trying to acquire the i_mutex:
OOM invoking task:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff
OOM kill victim:
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff
The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.
A similar scenario can happen when the kernel OOM killer for a memcg
is disabled and a userspace task is in charge of resolving OOM
situations. In this case, ALL tasks that enter the OOM path will be
made to sleep on the OOM waitqueue and wait for userspace to free
resources or increase the group's limit. But a userspace OOM handler
is prone to deadlock itself on the locks held by the waiting tasks.
For example one of the sleeping tasks may be stuck in a brk() call
with the mmap_sem held for writing but the userspace handler, in order
to pick an optimal victim, may need to read files from /proc/<pid>,
which tries to acquire the same mmap_sem for reading and deadlocks.
This patch changes the way tasks behave after detecting an OOM and
makes sure nobody loops or sleeps on OOM with locks held:
1. When OOMing in a system call (buffered IO and friends), invoke the
OOM killer but just return -ENOMEM, never sleep on a OOM waitqueue.
Userspace should be able to handle this and it prevents anybody
from looping or waiting with locks held.
2. When OOMing in a page fault, invoke the OOM killer and restart the
fault instead of looping on the charge attempt. This way, the OOM
victim can not get stuck on locks the looping task may hold.
3. When detecting an OOM in a page fault but somebody else is handling
it (either the kernel OOM killer or a userspace handler), don't go
to sleep in the charge context. Instead, remember the OOMing memcg
in the task struct and then fully unwind the page fault stack with
-ENOMEM. pagefault_out_of_memory() will then call back into the
memcg code to check if the -ENOMEM came from the memcg, and then
either put the task to sleep on the memcg's OOM waitqueue or just
restart the fault. The OOM victim can no longer get stuck on any
lock a sleeping task may hold.
While reworking the OOM routine, also remove a needless OOM waitqueue
wakeup when invoking the killer. Only uncharges and limit increases,
things that actually change the memory situation, should do wakeups.
Reported-by: Reported-by: azurIt <[email protected]>
Debugged-by: Michal Hocko <[email protected]>
Reported-by: David Rientjes <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 22 +++++++
include/linux/mm.h | 1 +
include/linux/sched.h | 6 ++
mm/ksm.c | 2 +-
mm/memcontrol.c | 152 ++++++++++++++++++++++++++++-----------------
mm/memory.c | 40 ++++++++----
mm/oom_kill.c | 7 ++-
7 files changed, 160 insertions(+), 70 deletions(-)
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index c8b1412..8e0f900 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -124,6 +124,15 @@ unsigned long mem_cgroup_get_lru_size(struct lruvec *lruvec, enum lru_list);
void mem_cgroup_update_lru_size(struct lruvec *, enum lru_list, int);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);
+static inline void mem_cgroup_set_userfault(struct task_struct *p)
+{
+ p->memcg_oom.in_userfault = 1;
+}
+static inline void mem_cgroup_clear_userfault(struct task_struct *p)
+{
+ p->memcg_oom.in_userfault = 0;
+}
+bool mem_cgroup_oom_synchronize(void);
extern void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage);
@@ -343,6 +352,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}
+static inline void mem_cgroup_set_userfault(struct task_struct *p)
+{
+}
+
+static inline void mem_cgroup_clear_userfault(struct task_struct *p)
+{
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_begin_update_page_stat(struct page *page,
bool *locked, unsigned long *flags)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index b87681a..79ee304 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -176,6 +176,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED 0x40 /* second try */
+#define FAULT_FLAG_KERNEL 0x80 /* kernel-triggered fault (get_user_pages etc.) */
/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 08090e6..0659277 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1403,6 +1403,12 @@ struct task_struct {
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
unsigned int memcg_kmem_skip_account;
+ struct memcg_oom_info {
+ unsigned int in_userfault:1;
+ unsigned int in_memcg_oom:1;
+ int wakeups;
+ struct mem_cgroup *wait_on_memcg;
+ } memcg_oom;
#endif
#ifdef CONFIG_UPROBES
struct uprobe_task *utask;
diff --git a/mm/ksm.c b/mm/ksm.c
index b6afe0c..9dff93b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -372,7 +372,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
break;
if (PageKsm(page))
ret = handle_mm_fault(vma->vm_mm, vma, addr,
- FAULT_FLAG_WRITE);
+ FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE);
else
ret = VM_FAULT_WRITE;
put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index d169a8d..3c9dc93 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -298,6 +298,7 @@ struct mem_cgroup {
bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;
atomic_t refcnt;
@@ -2305,6 +2306,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,
static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg->oom_wakeups);
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}
@@ -2316,56 +2318,109 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}
/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
- int order)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
- struct oom_wait_info owait;
- bool locked, need_to_kill;
-
- owait.memcg = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);
+ bool locked, need_to_kill = true;
/* At first, try to OOM lock hierarchy under memcg.*/
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
- /*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
- */
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
if (!locked || memcg->oom_kill_disable)
need_to_kill = false;
if (locked)
mem_cgroup_oom_notify(memcg);
spin_unlock(&memcg_oom_lock);
- if (need_to_kill) {
- finish_wait(&memcg_oom_waitq, &owait.wait);
- mem_cgroup_out_of_memory(memcg, mask, order);
- } else {
- schedule();
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ /*
+ * A system call can just return -ENOMEM, but if this is a
+ * page fault and somebody else is handling the OOM already,
+ * we need to sleep on the OOM waitqueue for this memcg until
+ * the situation is resolved. Which can take some time
+ * because it might be handled by a userspace task.
+ *
+ * However, this is the charge context, which means that we
+ * may sit on a large call stack and hold various filesystem
+ * locks, the mmap_sem etc. and we don't want the OOM handler
+ * to deadlock on them while we sit here and wait. Store the
+ * current OOM context in the task_struct, then return
+ * -ENOMEM. At the end of the page fault handler, with the
+ * stack unwound, pagefault_out_of_memory() will check back
+ * with us by calling mem_cgroup_oom_synchronize(), possibly
+ * putting the task to sleep.
+ */
+ if (current->memcg_oom.in_userfault) {
+ current->memcg_oom.in_memcg_oom = 1;
+ /*
+ * Somebody else is handling the situation. Make sure
+ * no wakeups are missed between now and going to
+ * sleep at the end of the page fault.
+ */
+ if (!need_to_kill) {
+ mem_cgroup_mark_under_oom(memcg);
+ current->memcg_oom.wakeups =
+ atomic_read(&memcg->oom_wakeups);
+ css_get(&memcg->css);
+ current->memcg_oom.wait_on_memcg = memcg;
+ }
}
- spin_lock(&memcg_oom_lock);
- if (locked)
+
+ if (need_to_kill)
+ mem_cgroup_out_of_memory(memcg, mask, order);
+
+ if (locked) {
+ spin_lock(&memcg_oom_lock);
mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
+ /*
+ * Sleeping tasks might have been killed, make sure
+ * they get scheduled so they can exit.
+ */
+ if (need_to_kill)
+ mem_cgroup_oom_recover(memcg);
+ spin_unlock(&memcg_oom_lock);
+ }
+}
- mem_cgroup_unmark_under_oom(memcg);
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;
- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ /* OOM is global, do not handle */
+ if (!current->memcg_oom.in_memcg_oom)
return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+
+ /*
+ * We invoked the OOM killer but there is a chance that a kill
+ * did not free up any charges. Everybody else might already
+ * be sleeping, so restart the fault and keep the rampage
+ * going until some charges are released.
+ */
+ memcg = current->memcg_oom.wait_on_memcg;
+ if (!memcg)
+ goto out;
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out_put;
+
+ owait.memcg = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out_put:
+ mem_cgroup_unmark_under_oom(memcg);
+ css_put(&memcg->css);
+ current->memcg_oom.wait_on_memcg = NULL;
+out:
+ current->memcg_oom.in_memcg_oom = 0;
return true;
}
@@ -2678,12 +2733,11 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};
static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages, unsigned int min_pages,
- bool oom_check)
+ bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2740,14 +2794,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;
- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));
- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}
/*
@@ -2850,7 +2900,7 @@ again:
}
do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;
/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2858,14 +2908,8 @@ again:
goto bypass;
}
- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
- oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
+ nr_pages, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2878,16 +2922,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);
diff --git a/mm/memory.c b/mm/memory.c
index 2210b21..05f307b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1819,7 +1819,7 @@ long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
while (!(page = follow_page_mask(vma, start,
foll_flags, &page_mask))) {
int ret;
- unsigned int fault_flags = 0;
+ unsigned int fault_flags = FAULT_FLAG_KERNEL;
/* For mlock, just skip the stack guard page. */
if (foll_flags & FOLL_MLOCK) {
@@ -1947,6 +1947,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
if (!vma || address < vma->vm_start)
return -EFAULT;
+ fault_flags |= FAULT_FLAG_KERNEL;
ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
@@ -3764,22 +3765,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);
@@ -3860,6 +3853,31 @@ retry:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}
+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int in_userfault = !(flags & FAULT_FLAG_KERNEL);
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ if (in_userfault)
+ mem_cgroup_set_userfault(current);
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (in_userfault)
+ mem_cgroup_clear_userfault(current);
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 79e451a..0c9f836 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,9 +678,12 @@ out:
*/
void pagefault_out_of_memory(void)
{
- struct zonelist *zonelist = node_zonelist(first_online_node,
- GFP_KERNEL);
+ struct zonelist *zonelist;
+ if (mem_cgroup_oom_synchronize())
+ return;
+
+ zonelist = node_zonelist(first_online_node, GFP_KERNEL);
if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
out_of_memory(NULL, 0, 0, NULL, false);
clear_zonelist_oom(zonelist, GFP_KERNEL);
--
1.8.3
On Thu, 6 Jun 2013, Johannes Weiner wrote:
> > If the killing task or one of the sleeping tasks is holding a lock
> > that the selected victim needs in order to exit no progress can be
> > made.
> >
> > The report we had a few months ago was that a task held the i_mutex
> > when trying to charge a page cache page and then invoked the OOM
> > handler and looped on CHARGE_RETRY. Meanwhile, the selected victim
> > was just entering truncate() and now stuck waiting for the i_mutex.
> >
> > I'll add this scenario to the changelog, hopefully it will make the
> > rest a little clearer.
>
> David, is the updated patch below easier to understand?
>
I don't understand why memcg is unique in this regard and it doesn't
affect the page allocator as well on system oom conditions. Ignoring
memecg, all allocating processes will loop forever in the page allocator
unless there are atypical gfp flags waiting for memory to be available,
only one will call the oom killer at a time, a process is selected and
killed, and the oom killer defers until that process exists because it
finds TIF_MEMDIE. Why is memcg charging any different?
> From: Johannes Weiner <[email protected]>
> Subject: [patch] memcg: do not trap chargers with full callstack on OOM
>
> The memcg OOM handling is incredibly fragile and can deadlock. When a
> task fails to charge memory, it invokes the OOM killer and loops right
> there in the charge code until it succeeds. Comparably, any other
> task that enters the charge path at this point will go to a waitqueue
> right then and there and sleep until the OOM situation is resolved.
> The problem is that these tasks may hold filesystem locks and the
> mmap_sem; locks that the selected OOM victim may need to exit.
>
> For example, in one reported case, the task invoking the OOM killer
> was about to charge a page cache page during a write(), which holds
> the i_mutex. The OOM killer selected a task that was just entering
> truncate() and trying to acquire the i_mutex:
>
> OOM invoking task:
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> OOM kill victim:
> [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> The OOM handling task will retry the charge indefinitely while the OOM
> killed task is not releasing any resources.
>
> A similar scenario can happen when the kernel OOM killer for a memcg
> is disabled and a userspace task is in charge of resolving OOM
> situations. In this case, ALL tasks that enter the OOM path will be
> made to sleep on the OOM waitqueue and wait for userspace to free
> resources or increase the group's limit. But a userspace OOM handler
> is prone to deadlock itself on the locks held by the waiting tasks.
> For example one of the sleeping tasks may be stuck in a brk() call
> with the mmap_sem held for writing but the userspace handler, in order
> to pick an optimal victim, may need to read files from /proc/<pid>,
> which tries to acquire the same mmap_sem for reading and deadlocks.
>
> This patch changes the way tasks behave after detecting an OOM and
> makes sure nobody loops or sleeps on OOM with locks held:
>
> 1. When OOMing in a system call (buffered IO and friends), invoke the
> OOM killer but just return -ENOMEM, never sleep on a OOM waitqueue.
> Userspace should be able to handle this and it prevents anybody
> from looping or waiting with locks held.
>
> 2. When OOMing in a page fault, invoke the OOM killer and restart the
> fault instead of looping on the charge attempt. This way, the OOM
> victim can not get stuck on locks the looping task may hold.
>
> 3. When detecting an OOM in a page fault but somebody else is handling
> it (either the kernel OOM killer or a userspace handler), don't go
> to sleep in the charge context. Instead, remember the OOMing memcg
> in the task struct and then fully unwind the page fault stack with
> -ENOMEM. pagefault_out_of_memory() will then call back into the
> memcg code to check if the -ENOMEM came from the memcg, and then
> either put the task to sleep on the memcg's OOM waitqueue or just
> restart the fault. The OOM victim can no longer get stuck on any
> lock a sleeping task may hold.
>
> While reworking the OOM routine, also remove a needless OOM waitqueue
> wakeup when invoking the killer. Only uncharges and limit increases,
> things that actually change the memory situation, should do wakeups.
>
> Reported-by: Reported-by: azurIt <[email protected]>
> Debugged-by: Michal Hocko <[email protected]>
> Reported-by: David Rientjes <[email protected]>
What exactly did I report? This isn't at all what
memory.oom_delay_millisecs is about, which is a failure of userspace to
respond to the condition and react in time, not because it's stuck on any
lock. We still need that addition regardless of what you're doing here.
On Thu, Jun 06, 2013 at 01:11:32PM -0700, David Rientjes wrote:
> On Thu, 6 Jun 2013, Johannes Weiner wrote:
>
> > > If the killing task or one of the sleeping tasks is holding a lock
> > > that the selected victim needs in order to exit no progress can be
> > > made.
> > >
> > > The report we had a few months ago was that a task held the i_mutex
> > > when trying to charge a page cache page and then invoked the OOM
> > > handler and looped on CHARGE_RETRY. Meanwhile, the selected victim
> > > was just entering truncate() and now stuck waiting for the i_mutex.
> > >
> > > I'll add this scenario to the changelog, hopefully it will make the
> > > rest a little clearer.
> >
> > David, is the updated patch below easier to understand?
> >
>
> I don't understand why memcg is unique in this regard and it doesn't
> affect the page allocator as well on system oom conditions. Ignoring
> memecg, all allocating processes will loop forever in the page allocator
> unless there are atypical gfp flags waiting for memory to be available,
> only one will call the oom killer at a time, a process is selected and
> killed, and the oom killer defers until that process exists because it
> finds TIF_MEMDIE. Why is memcg charging any different?
The allocator wakes up kswapd, global OOMs are rarer, with physical
memory the line to OOM is blurrier than with the memcg hard limit?
Anyway, I'm not aware of bug reports in the global case, but there are
bug reports for the memcg case and we have a decent understanding of
those deadlocks. So can we stay focussed and fix this, please?
> > Reported-by: Reported-by: azurIt <[email protected]>
> > Debugged-by: Michal Hocko <[email protected]>
> > Reported-by: David Rientjes <[email protected]>
>
> What exactly did I report? This isn't at all what
> memory.oom_delay_millisecs is about, which is a failure of userspace to
> respond to the condition and react in time, not because it's stuck on any
> lock. We still need that addition regardless of what you're doing here.
Oh, tell me how getting stuck indefinitely on a lock will not result
in "a failure to react in time". This is some seriously misguided
pedantry.
And yes, you talked about deadlocking potential other than the handler
itself OOMing, I quote from
<[email protected]>:
"Unresponsiveness isn't necessarily only because of memory
constraints, you may have your oom notifier in a parent cgroup that
isn't oom. If a process is stuck on mm->mmap_sem in the oom cgroup,
though, the oom notifier may not be able to scrape /proc/pid and
attain necessary information in making an oom kill decision."
These are your words, and my patch sets out to fix the described
problem, so I figured the Reported-by: might be appropriate. But I
can remove it if you like.
On Thu, 6 Jun 2013, Johannes Weiner wrote:
> > I don't understand why memcg is unique in this regard and it doesn't
> > affect the page allocator as well on system oom conditions. Ignoring
> > memecg, all allocating processes will loop forever in the page allocator
> > unless there are atypical gfp flags waiting for memory to be available,
> > only one will call the oom killer at a time, a process is selected and
> > killed, and the oom killer defers until that process exists because it
> > finds TIF_MEMDIE. Why is memcg charging any different?
>
> The allocator wakes up kswapd, global OOMs are rarer, with physical
> memory the line to OOM is blurrier than with the memcg hard limit?
>
> Anyway, I'm not aware of bug reports in the global case, but there are
> bug reports for the memcg case and we have a decent understanding of
> those deadlocks. So can we stay focussed and fix this, please?
>
Could you point me to those bug reports? As far as I know, we have never
encountered them so it would be surprising to me that we're running with a
potential landmine and have seemingly never hit it.
Perhaps the answer is solely the stacktraces in your changelog, so I'd be
happy to review that separately from my patch.
> > > Reported-by: Reported-by: azurIt <[email protected]>
> > > Debugged-by: Michal Hocko <[email protected]>
> > > Reported-by: David Rientjes <[email protected]>
> >
> > What exactly did I report? This isn't at all what
> > memory.oom_delay_millisecs is about, which is a failure of userspace to
> > respond to the condition and react in time, not because it's stuck on any
> > lock. We still need that addition regardless of what you're doing here.
>
> Oh, tell me how getting stuck indefinitely on a lock will not result
> in "a failure to react in time". This is some seriously misguided
> pedantry.
>
It certainly would, but it's not the point that memory.oom_delay_millisecs
was intended to address. memory.oom_delay_millisecs would simply delay
calling mem_cgroup_out_of_memory() unless userspace can't free memory or
increase the memory limit in time. Obviously that delay isn't going to
magically address any lock dependency issues.
> And yes, you talked about deadlocking potential other than the handler
> itself OOMing, I quote from
> <[email protected]>:
>
> "Unresponsiveness isn't necessarily only because of memory
> constraints, you may have your oom notifier in a parent cgroup that
> isn't oom. If a process is stuck on mm->mmap_sem in the oom cgroup,
> though, the oom notifier may not be able to scrape /proc/pid and
> attain necessary information in making an oom kill decision."
>
> These are your words, and my patch sets out to fix the described
> problem,
I can review this patch apart from memory.oom_delay_millisecs using the
examples in your changelog, but this isn't the problem statement for my
patch. The paragraph above is describing one way that an oom handler may
encounter issues, it's not the only way and it's not a way that we have
ever faced on our production servers with memcg. I just didn't think the
above was me reporting a bug, perhaps you took it that way.
The point I'm trying to make is that your patch doesn't reduce our need
for memory.oom_delay_millisecs as described in the thread for that patch.
On Thu, Jun 06, 2013 at 03:18:37PM -0700, David Rientjes wrote:
> On Thu, 6 Jun 2013, Johannes Weiner wrote:
>
> > > I don't understand why memcg is unique in this regard and it doesn't
> > > affect the page allocator as well on system oom conditions. Ignoring
> > > memecg, all allocating processes will loop forever in the page allocator
> > > unless there are atypical gfp flags waiting for memory to be available,
> > > only one will call the oom killer at a time, a process is selected and
> > > killed, and the oom killer defers until that process exists because it
> > > finds TIF_MEMDIE. Why is memcg charging any different?
> >
> > The allocator wakes up kswapd, global OOMs are rarer, with physical
> > memory the line to OOM is blurrier than with the memcg hard limit?
> >
> > Anyway, I'm not aware of bug reports in the global case, but there are
> > bug reports for the memcg case and we have a decent understanding of
> > those deadlocks. So can we stay focussed and fix this, please?
> >
>
> Could you point me to those bug reports? As far as I know, we have never
> encountered them so it would be surprising to me that we're running with a
> potential landmine and have seemingly never hit it.
Sure thing: https://lkml.org/lkml/2012/11/21/497
During that thread Michal pinned down the problem to i_mutex being
held by the OOM invoking task, which the selected victim is trying to
acquire.
> > > > Reported-by: Reported-by: azurIt <[email protected]>
> > > > Debugged-by: Michal Hocko <[email protected]>
> > > > Reported-by: David Rientjes <[email protected]>
> > >
> > > What exactly did I report? This isn't at all what
> > > memory.oom_delay_millisecs is about, which is a failure of userspace to
> > > respond to the condition and react in time, not because it's stuck on any
> > > lock. We still need that addition regardless of what you're doing here.
> >
> > Oh, tell me how getting stuck indefinitely on a lock will not result
> > in "a failure to react in time". This is some seriously misguided
> > pedantry.
> >
>
> It certainly would, but it's not the point that memory.oom_delay_millisecs
> was intended to address. memory.oom_delay_millisecs would simply delay
> calling mem_cgroup_out_of_memory() unless userspace can't free memory or
> increase the memory limit in time. Obviously that delay isn't going to
> magically address any lock dependency issues.
The delayed fallback would certainly resolve the issue of the
userspace handler getting stuck, be it due to memory shortness or due
to locks.
However, it would not solve the part of the problem where the OOM
killing kernel task is holding locks that the victim requires to exit.
We are definitely looking at multiple related issues, that's why I'm
trying to fix them step by step.
> > And yes, you talked about deadlocking potential other than the handler
> > itself OOMing, I quote from
> > <[email protected]>:
> >
> > "Unresponsiveness isn't necessarily only because of memory
> > constraints, you may have your oom notifier in a parent cgroup that
> > isn't oom. If a process is stuck on mm->mmap_sem in the oom cgroup,
> > though, the oom notifier may not be able to scrape /proc/pid and
> > attain necessary information in making an oom kill decision."
> >
> > These are your words, and my patch sets out to fix the described
> > problem,
>
> I can review this patch apart from memory.oom_delay_millisecs using the
> examples in your changelog, but this isn't the problem statement for my
> patch. The paragraph above is describing one way that an oom handler may
> encounter issues, it's not the only way and it's not a way that we have
> ever faced on our production servers with memcg. I just didn't think the
> above was me reporting a bug, perhaps you took it that way.
Please do consider this fix individually. It's good to know that you
didn't run into this particular issue on your machines so far, but
since you described the problem you must have arrived at the same
conclusion by just reading the code, which was good enough for me.
Again, I can just remove your Reported-by: if you don't think it's
justified.
> The point I'm trying to make is that your patch doesn't reduce our need
> for memory.oom_delay_millisecs as described in the thread for that patch.
It does not. But it does fix a problem that came up during the
discussion and it does fix a problem that you may hit at a random
point in time regardless of the memory.oom_delay_millisecs patch.
I'm sorry for the confusion this created, but yes, it's a separate
albeit related issue.
On Thu, 6 Jun 2013, Johannes Weiner wrote:
> > Could you point me to those bug reports? As far as I know, we have never
> > encountered them so it would be surprising to me that we're running with a
> > potential landmine and have seemingly never hit it.
>
> Sure thing: https://lkml.org/lkml/2012/11/21/497
>
Ok, I think I read most of it, although the lkml.org interface makes it
easy to miss some.
> During that thread Michal pinned down the problem to i_mutex being
> held by the OOM invoking task, which the selected victim is trying to
> acquire.
>
> > > > > Reported-by: Reported-by: azurIt <[email protected]>
Ok, so the key here is that azurIt was able to reliably reproduce this
issue and now it has been resurrected after seven months of silence since
that thread. I also notice that azurIt isn't cc'd on this thread. Do we
know if this is still a problem?
We certainly haven't run into any memcg deadlocks like this.
> > It certainly would, but it's not the point that memory.oom_delay_millisecs
> > was intended to address. memory.oom_delay_millisecs would simply delay
> > calling mem_cgroup_out_of_memory() unless userspace can't free memory or
> > increase the memory limit in time. Obviously that delay isn't going to
> > magically address any lock dependency issues.
>
> The delayed fallback would certainly resolve the issue of the
> userspace handler getting stuck, be it due to memory shortness or due
> to locks.
>
> However, it would not solve the part of the problem where the OOM
> killing kernel task is holding locks that the victim requires to exit.
>
Right.
> We are definitely looking at multiple related issues, that's why I'm
> trying to fix them step by step.
>
I guess my question is why this would be addressed now when nobody has
reported it recently on any recent kernel and then not cc the person who
reported it?
Can anybody, even with an instrumented kernel to make it more probable,
reproduce the issue this is addressing?
On Tue 11-06-13 14:57:08, David Rientjes wrote:
[...]
> > > > > > Reported-by: Reported-by: azurIt <[email protected]>
>
> Ok, so the key here is that azurIt was able to reliably reproduce this
> issue and now it has been resurrected after seven months of silence since
> that thread. I also notice that azurIt isn't cc'd on this thread. Do we
> know if this is still a problem?
I have backported the patch for his 3.2 and waiting for his feedback.
[...]
--
Michal Hocko
SUSE Labs
On Wed, 12 Jun 2013, Michal Hocko wrote:
> > > > > > > Reported-by: Reported-by: azurIt <[email protected]>
> >
> > Ok, so the key here is that azurIt was able to reliably reproduce this
> > issue and now it has been resurrected after seven months of silence since
> > that thread. I also notice that azurIt isn't cc'd on this thread. Do we
> > know if this is still a problem?
>
> I have backported the patch for his 3.2 and waiting for his feedback.
>
Ok, thanks. I thought this was only going seven months back when it was
reported, I missed that the issue this patch is trying to address goes
back a 1 1/2 years to 3.2 and nobody else has reported it. I think his
feedback would be the key, specifically if he can upgrade to a later
kernel first.
On Wed 12-06-13 13:12:09, David Rientjes wrote:
> On Wed, 12 Jun 2013, Michal Hocko wrote:
>
> > > > > > > > Reported-by: Reported-by: azurIt <[email protected]>
> > >
> > > Ok, so the key here is that azurIt was able to reliably reproduce this
> > > issue and now it has been resurrected after seven months of silence since
> > > that thread. I also notice that azurIt isn't cc'd on this thread. Do we
> > > know if this is still a problem?
> >
> > I have backported the patch for his 3.2 and waiting for his feedback.
> >
>
> Ok, thanks. I thought this was only going seven months back when it was
> reported, I missed that the issue this patch is trying to address goes
> back a 1 1/2 years to 3.2 and nobody else has reported it. I think his
> feedback would be the key, specifically if he can upgrade to a later
> kernel first.
The patch is a big improvement with a minimum code overhead. Blocking
any task which sits on top of an unpredictable amount of locks is just
broken. So regardless how many users are affected we should merge it and
backport to stable trees. The problem is there since ever. We seem to
be surprisingly lucky to not hit this more often.
I am not quite sure I understand your reservation about the patch to be
honest. Andrew still hasn't merged this one although 1/2 is in.
--
Michal Hocko
SUSE Labs
On Wed, 12 Jun 2013, Michal Hocko wrote:
> The patch is a big improvement with a minimum code overhead. Blocking
> any task which sits on top of an unpredictable amount of locks is just
> broken. So regardless how many users are affected we should merge it and
> backport to stable trees. The problem is there since ever. We seem to
> be surprisingly lucky to not hit this more often.
>
Right now it appears that that number of users is 0 and we're talking
about a problem that was reported in 3.2 that was released a year and a
half ago. The rules of inclusion in stable also prohibit such a change
from being backported, specifically "It must fix a real bug that bothers
people (not a, "This could be a problem..." type thing)".
We have deployed memcg on a very large number of machines and I can run a
query over all software watchdog timeouts that have occurred by
deadlocking on i_mutex during memcg oom. It returns 0 results.
> I am not quite sure I understand your reservation about the patch to be
> honest. Andrew still hasn't merged this one although 1/2 is in.
Perhaps he is as unconvinced? The patch adds 100 lines of code, including
fields to task_struct for memcg, for a problem that nobody can reproduce.
My question still stands: can anybody, even with an instrumented kernel to
make it more probable, reproduce the issue this is addressing?
On Wed 12-06-13 13:49:47, David Rientjes wrote:
> On Wed, 12 Jun 2013, Michal Hocko wrote:
>
> > The patch is a big improvement with a minimum code overhead. Blocking
> > any task which sits on top of an unpredictable amount of locks is just
> > broken. So regardless how many users are affected we should merge it and
> > backport to stable trees. The problem is there since ever. We seem to
> > be surprisingly lucky to not hit this more often.
> >
>
> Right now it appears that that number of users is 0 and we're talking
> about a problem that was reported in 3.2 that was released a year and a
> half ago. The rules of inclusion in stable also prohibit such a change
> from being backported, specifically "It must fix a real bug that bothers
> people (not a, "This could be a problem..." type thing)".
As you can see there is an user seeing this in 3.2. The bug is _real_ and
I do not see what you are objecting against. Do you really think that
sitting on a time bomb is preferred more?
> We have deployed memcg on a very large number of machines and I can run a
> query over all software watchdog timeouts that have occurred by
> deadlocking on i_mutex during memcg oom. It returns 0 results.
Do you capture /prc/<pid>/stack for each of them to find that your
deadlock (and you have reported that they happen) was in fact caused by
a locking issue? These kind of deadlocks might got unnoticed especially
when the oom is handled by userspace by increasing the limit (my mmecg
is stuck and increasing the limit a bit always helped).
> > I am not quite sure I understand your reservation about the patch to be
> > honest. Andrew still hasn't merged this one although 1/2 is in.
>
> Perhaps he is as unconvinced? The patch adds 100 lines of code, including
> fields to task_struct for memcg, for a problem that nobody can reproduce.
> My question still stands: can anybody, even with an instrumented kernel to
> make it more probable, reproduce the issue this is addressing?
So the referenced discussion is not sufficient?
--
Michal Hocko
SUSE Labs
On Thu, 13 Jun 2013, Michal Hocko wrote:
> > Right now it appears that that number of users is 0 and we're talking
> > about a problem that was reported in 3.2 that was released a year and a
> > half ago. The rules of inclusion in stable also prohibit such a change
> > from being backported, specifically "It must fix a real bug that bothers
> > people (not a, "This could be a problem..." type thing)".
>
> As you can see there is an user seeing this in 3.2. The bug is _real_ and
> I do not see what you are objecting against. Do you really think that
> sitting on a time bomb is preferred more?
>
Nobody has reported the problem in seven months. You're patching a kernel
that's 18 months old. Your "user" hasn't even bothered to respond to your
backport. This isn't a timebomb.
> > We have deployed memcg on a very large number of machines and I can run a
> > query over all software watchdog timeouts that have occurred by
> > deadlocking on i_mutex during memcg oom. It returns 0 results.
>
> Do you capture /prc/<pid>/stack for each of them to find that your
> deadlock (and you have reported that they happen) was in fact caused by
> a locking issue? These kind of deadlocks might got unnoticed especially
> when the oom is handled by userspace by increasing the limit (my mmecg
> is stuck and increasing the limit a bit always helped).
>
We dump stack traces for every thread on the system to the kernel log for
a software watchdog timeout and capture it over the network for searching
later. We have not experienced any deadlock that even remotely resembles
the stack traces in the chnagelog. We do not reproduce this issue.
On Thu 13-06-13 13:34:46, David Rientjes wrote:
> On Thu, 13 Jun 2013, Michal Hocko wrote:
>
> > > Right now it appears that that number of users is 0 and we're talking
> > > about a problem that was reported in 3.2 that was released a year and a
> > > half ago. The rules of inclusion in stable also prohibit such a change
> > > from being backported, specifically "It must fix a real bug that bothers
> > > people (not a, "This could be a problem..." type thing)".
> >
> > As you can see there is an user seeing this in 3.2. The bug is _real_ and
> > I do not see what you are objecting against. Do you really think that
> > sitting on a time bomb is preferred more?
> >
>
> Nobody has reported the problem in seven months. You're patching a kernel
> that's 18 months old. Your "user" hasn't even bothered to respond to your
> backport.
>
> This isn't a timebomb.
Doh. This is getting ridiculous! So you are claiming that oom blocking
while the task might be sitting on an unpredictable amount of locks
which could block oom victims to die is OK? I would consider it a _bug_
and I am definitely backporting it to our kernel which is 3.0 based
whether it end up in the stable or not.
Whether this is a general stable material I will leave for others (I
would be voting for it because it definitely makes sense). The real
regardless how many users suffer from it. The stable-or-not discussion
shouldn't delay the fix for the current tree though. Or do you disagree
with the patch itself?
> > > We have deployed memcg on a very large number of machines and I can run a
> > > query over all software watchdog timeouts that have occurred by
> > > deadlocking on i_mutex during memcg oom. It returns 0 results.
> >
> > Do you capture /prc/<pid>/stack for each of them to find that your
> > deadlock (and you have reported that they happen) was in fact caused by
> > a locking issue? These kind of deadlocks might got unnoticed especially
> > when the oom is handled by userspace by increasing the limit (my mmecg
> > is stuck and increasing the limit a bit always helped).
> >
>
> We dump stack traces for every thread on the system to the kernel log for
> a software watchdog timeout and capture it over the network for searching
> later. We have not experienced any deadlock that even remotely resembles
> the stack traces in the chnagelog. We do not reproduce this issue.
OK. This could really rule it out for you. The analysis is not really
trivial because locks might be hidden nicely but having the data is
definitely useful.
--
Michal Hocko
SUSE Labs