LinuxLists.cc - Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG

2013-06-06 16:04:52

Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

Hi,

I am really sorry it took so long but I was constantly preempted by
other stuff. I hope I have a good news for you, though. Johannes has
found a nice way how to overcome deadlock issues from memcg OOM which
might help you. Would you be willing to test with his patch
(http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my
patch which handles just the i_mutex case his patch solved all possible
locks.

I can backport the patch for your kernel (are you still using 3.2 kernel
or you have moved to a newer one?).

On Fri 22-02-13 09:23:32, azurIt wrote:
> >Unfortunately I am not able to reproduce this behavior even if I try
> >to hammer OOM like mad so I am afraid I cannot help you much without
> >further debugging patches.
> >I do realize that experimenting in your environment is a problem but I
> >do not many options left. Please do not use strace and rather collect
> >/proc/pid/stack instead. It would be also helpful to get group/tasks
> >file to have a full list of tasks in the group
>
>
>
> Hi Michal,
>
>
> sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info:
>
> - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed)
> http://watchdog.sk/lkml/memcg-bug-6.tar.gz
>
> I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs.
>
>
> - kernel log from boot until now
> http://watchdog.sk/lkml/kern3.gz
>
>
> Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100).
>
>
>
> azur
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Michal Hocko
SUSE Labs

2013-06-06 16:23:05

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set

Hello Michal,

nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and try to backport it? Thank you very much!

azur

______________________________________________________________
> Od: "Michal Hocko" <[email protected]>
> Komu: azurIt <[email protected]>
> Dátum: 06.06.2013 18:04
> Predmet: Re: [PATCH for 3.2.34] memcg: do not trigger OOM if PF_NO_MEMCG_OOM is set
>
> CC: [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, "Johannes Weiner" <[email protected]>
>Hi,
>
>I am really sorry it took so long but I was constantly preempted by
>other stuff. I hope I have a good news for you, though. Johannes has
>found a nice way how to overcome deadlock issues from memcg OOM which
>might help you. Would you be willing to test with his patch
>(http://permalink.gmane.org/gmane.linux.kernel.mm/101437). Unlike my
>patch which handles just the i_mutex case his patch solved all possible
>locks.
>
>I can backport the patch for your kernel (are you still using 3.2 kernel
>or you have moved to a newer one?).
>
>On Fri 22-02-13 09:23:32, azurIt wrote:
>> >Unfortunately I am not able to reproduce this behavior even if I try
>> >to hammer OOM like mad so I am afraid I cannot help you much without
>> >further debugging patches.
>> >I do realize that experimenting in your environment is a problem but I
>> >do not many options left. Please do not use strace and rather collect
>> >/proc/pid/stack instead. It would be also helpful to get group/tasks
>> >file to have a full list of tasks in the group
>>
>>
>>
>> Hi Michal,
>>
>>
>> sorry that i didn't response for a while. Today i installed kernel with your two patches and i'm running it now. I'm still having problems with OOM which is not able to handle low memory and is not killing processes. Here is some info:
>>
>> - data from cgroup 1258 while it was under OOM and no processes were killed (so OOM don't stop and cgroup was freezed)
>> http://watchdog.sk/lkml/memcg-bug-6.tar.gz
>>
>> I noticed problem about on 8:39 and waited until 8:57 (nothing happend). Then i killed process 19864 which seems to help and other processes probably ends and cgroup started to work. But problem accoured again about 20 seconds later, so i killed all processes at 8:58. The problem is occuring all the time since then. All processes (in that cgroup) are always in state 'D' when it occurs.
>>
>>
>> - kernel log from boot until now
>> http://watchdog.sk/lkml/kern3.gz
>>
>>
>> Btw, something probably happened also at about 3:09 but i wasn't able to gather any data because my 'load check script' killed all apache processes (load was more than 100).
>>
>>
>>
>> azur
>> --
>> To unsubscribe from this list: send the line "unsubscribe cgroups" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>--
>Michal Hocko
>SUSE Labs
>

2013-06-07 13:12:04

by Michal Hocko

[permalink] [raw]

Subject: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Thu 06-06-13 18:16:33, azurIt wrote:
> Hello Michal,
>
> nice to read you! :) Yes, i'm still on 3.2. Could you be so kind and
> try to backport it? Thank you very much!

Here we go. I hope I didn't screw anything (Johannes might double check)
because there were quite some changes in the area since 3.2. Nothing
earth shattering though. Please note that I have only compile tested
this. Also make sure you remove the previous patches you have from me.
---
>From 9d2801c1f53147ca9134cc5f76ab28d505a37a54 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Fri, 7 Jun 2013 13:52:42 +0200
Subject: [PATCH] memcg: do not trap chargers with full callstack on OOM

The memcg OOM handling is incredibly fragile and can deadlock. When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds. Comparably, any other
task that enters the charge path at this point will go to a waitqueue
right then and there and sleep until the OOM situation is resolved.
The problem is that these tasks may hold filesystem locks and the
mmap_sem; locks that the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer
was about to charge a page cache page during a write(), which holds
the i_mutex. The OOM killer selected a task that was just entering
truncate() and trying to acquire the i_mutex:

OOM invoking task:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

OOM kill victim:
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg
is disabled and a userspace task is in charge of resolving OOM
situations. In this case, ALL tasks that enter the OOM path will be
made to sleep on the OOM waitqueue and wait for userspace to free
resources or increase the group's limit. But a userspace OOM handler
is prone to deadlock itself on the locks held by the waiting tasks.
For example one of the sleeping tasks may be stuck in a brk() call
with the mmap_sem held for writing but the userspace handler, in order
to pick an optimal victim, may need to read files from /proc/<pid>,
which tries to acquire the same mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting an OOM and
makes sure nobody loops or sleeps on OOM with locks held:

1. When OOMing in a system call (buffered IO and friends), invoke the
OOM killer but just return -ENOMEM, never sleep on a OOM waitqueue.
Userspace should be able to handle this and it prevents anybody
from looping or waiting with locks held.

2. When OOMing in a page fault, invoke the OOM killer and restart the
fault instead of looping on the charge attempt. This way, the OOM
victim can not get stuck on locks the looping task may hold.

3. When detecting an OOM in a page fault but somebody else is handling
it (either the kernel OOM killer or a userspace handler), don't go
to sleep in the charge context. Instead, remember the OOMing memcg
in the task struct and then fully unwind the page fault stack with
-ENOMEM. pagefault_out_of_memory() will then call back into the
memcg code to check if the -ENOMEM came from the memcg, and then
either put the task to sleep on the memcg's OOM waitqueue or just
restart the fault. The OOM victim can no longer get stuck on any
lock a sleeping task may hold.

While reworking the OOM routine, also remove a needless OOM waitqueue
wakeup when invoking the killer. Only uncharges and limit increases,
things that actually change the memory situation, should do wakeups.

Reported-by: Reported-by: azurIt <[email protected]>
Debugged-by: Michal Hocko <[email protected]>
Reported-by: David Rientjes <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
---
include/linux/memcontrol.h | 22 +++++++
include/linux/mm.h | 1 +
include/linux/sched.h | 6 ++
mm/ksm.c | 2 +-
mm/memcontrol.c | 149 ++++++++++++++++++++++++++++----------------
mm/memory.c | 40 ++++++++----
mm/oom_kill.c | 2 +
7 files changed, 156 insertions(+), 66 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..56bfc39 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,15 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

+static inline void mem_cgroup_set_userfault(struct task_struct *p)
+{
+ p->memcg_oom.in_userfault = 1;
+}
+static inline void mem_cgroup_clear_userfault(struct task_struct *p)
+{
+ p->memcg_oom.in_userfault = 0;
+}
+bool mem_cgroup_oom_synchronize(void);
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
@@ -333,6 +342,19 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline void mem_cgroup_set_userfault(struct task_struct *p)
+{
+}
+
+static inline void mem_cgroup_clear_userfault(struct task_struct *p)
+{
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..91380ef 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -156,6 +156,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
+#define FAULT_FLAG_KERNEL 0x80 /* kernel-triggered fault (get_user_pages etc.) */

/*
* This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..d521a70 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1568,6 +1568,12 @@ struct task_struct {
unsigned long nr_pages; /* uncharged usage */
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
+ struct memcg_oom_info {
+ unsigned int in_userfault:1;
+ unsigned int in_memcg_oom:1;
+ int wakeups;
+ struct mem_cgroup *wait_on_memcg;
+ } memcg_oom;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
diff --git a/mm/ksm.c b/mm/ksm.c
index 310544a..3295a3b 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
break;
if (PageKsm(page))
ret = handle_mm_fault(vma->vm_mm, vma, addr,
- FAULT_FLAG_WRITE);
+ FAULT_FLAG_KERNEL | FAULT_FLAG_WRITE);
else
ret = VM_FAULT_WRITE;
put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..67189b4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -249,6 +249,7 @@ struct mem_cgroup {

bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;

atomic_t refcnt;

@@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,

static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg->oom_wakeups);
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}
@@ -1857,55 +1859,109 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}

/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
{
- struct oom_wait_info owait;
- bool locked, need_to_kill;
-
- owait.mem = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);
+ bool locked, need_to_kill = true;

/* At first, try to OOM lock hierarchy under memcg.*/
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
- /*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
- */
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
if (!locked || memcg->oom_kill_disable)
need_to_kill = false;
if (locked)
mem_cgroup_oom_notify(memcg);
spin_unlock(&memcg_oom_lock);

- if (need_to_kill) {
- finish_wait(&memcg_oom_waitq, &owait.wait);
- mem_cgroup_out_of_memory(memcg, mask);
- } else {
- schedule();
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ /*
+ * A system call can just return -ENOMEM, but if this is a
+ * page fault and somebody else is handling the OOM already,
+ * we need to sleep on the OOM waitqueue for this memcg until
+ * the situation is resolved. Which can take some time
+ * because it might be handled by a userspace task.
+ *
+ * However, this is the charge context, which means that we
+ * may sit on a large call stack and hold various filesystem
+ * locks, the mmap_sem etc. and we don't want the OOM handler
+ * to deadlock on them while we sit here and wait. Store the
+ * current OOM context in the task_struct, then return
+ * -ENOMEM. At the end of the page fault handler, with the
+ * stack unwound, pagefault_out_of_memory() will check back
+ * with us by calling mem_cgroup_oom_synchronize(), possibly
+ * putting the task to sleep.
+ */
+ if (current->memcg_oom.in_userfault) {
+ current->memcg_oom.in_memcg_oom = 1;
+ /*
+ * Somebody else is handling the situation. Make sure
+ * no wakeups are missed between now and going to
+ * sleep at the end of the page fault.
+ */
+ if (!need_to_kill) {
+ mem_cgroup_mark_under_oom(memcg);
+ current->memcg_oom.wakeups =
+ atomic_read(&memcg->oom_wakeups);
+ css_get(&memcg->css);
+ current->memcg_oom.wait_on_memcg = memcg;
+ }
}
- spin_lock(&memcg_oom_lock);
- if (locked)
+
+ if (need_to_kill)
+ mem_cgroup_out_of_memory(memcg, mask);
+
+ if (locked) {
+ spin_lock(&memcg_oom_lock);
mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
+ /*
+ * Sleeping tasks might have been killed, make sure
+ * they get scheduled so they can exit.
+ */
+ if (need_to_kill)
+ memcg_oom_recover(memcg);
+ spin_unlock(&memcg_oom_lock);
+ }
+}

- mem_cgroup_unmark_under_oom(memcg);
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;

- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ /* OOM is global, do not handle */
+ if (!current->memcg_oom.in_memcg_oom)
return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+
+ /*
+ * We invoked the OOM killer but there is a chance that a kill
+ * did not free up any charges. Everybody else might already
+ * be sleeping, so restart the fault and keep the rampage
+ * going until some charges are released.
+ */
+ memcg = current->memcg_oom.wait_on_memcg;
+ if (!memcg)
+ goto out;
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out_put;
+
+ owait.mem = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out_put:
+ mem_cgroup_unmark_under_oom(memcg);
+ css_put(&memcg->css);
+ current->memcg_oom.wait_on_memcg = NULL;
+out:
+ current->memcg_oom.in_memcg_oom = 0;
return true;
}

@@ -2195,11 +2251,10 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2257,14 +2312,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;

- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask);

- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}

/*
@@ -2349,7 +2400,7 @@ again:
}

do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;

/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2357,13 +2408,7 @@ again:
goto bypass;
}

- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2376,16 +2421,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);

diff --git a/mm/memory.c b/mm/memory.c
index 829d437..bee177c 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1720,7 +1720,7 @@ int __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
cond_resched();
while (!(page = follow_page(vma, start, foll_flags))) {
int ret;
- unsigned int fault_flags = 0;
+ unsigned int fault_flags = FAULT_FLAG_KERNEL;

/* For mlock, just skip the stack guard page. */
if (foll_flags & FOLL_MLOCK) {
@@ -1842,6 +1842,7 @@ int fixup_user_fault(struct task_struct *tsk, struct mm_struct *mm,
if (!vma || address < vma->vm_start)
return -EFAULT;

+ fault_flags |= FAULT_FLAG_KERNEL;
ret = handle_mm_fault(mm, vma, address, fault_flags);
if (ret & VM_FAULT_ERROR) {
if (ret & VM_FAULT_OOM)
@@ -3439,22 +3440,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);

@@ -3503,6 +3496,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int in_userfault = !(flags & FAULT_FLAG_KERNEL);
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ if (in_userfault)
+ mem_cgroup_set_userfault(current);
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (in_userfault)
+ mem_cgroup_clear_userfault(current);
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..aa60863 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,6 +785,8 @@ out:
*/
void pagefault_out_of_memory(void)
{
+ if (mem_cgroup_oom_synchronize())
+ return;
if (try_set_system_oom()) {
out_of_memory(NULL, 0, 0, NULL);
clear_system_oom();
--
1.7.10.4

--
Michal Hocko
SUSE Labs

2013-06-17 10:21:39

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

>Here we go. I hope I didn't screw anything (Johannes might double check)
>because there were quite some changes in the area since 3.2. Nothing
>earth shattering though. Please note that I have only compile tested
>this. Also make sure you remove the previous patches you have from me.

Hi Michal,

it, unfortunately, didn't work. Everything was working fine but original problem is still occuring. I'm unable to send you stacks or more info because problem is taking down the whole server for some time now (don't know what exactly caused it to start happening, maybe newer versions of 3.2.x). But i'm sure of one thing - when problem occurs, nothing is able to access hard drives (every process which tries it is freezed until problem is resolved or server is rebooted). Problem is fixed after killing processes from cgroup which caused it and everything immediatelly starts to work normally. I find this out by keeping terminal opened from another server to one where my problem is occuring quite often and running several apps there (htop, iotop, etc.). When problem occurs, all apps which wasn't working with HDD was ok. The htop proved to be very usefull here because it's only reading proc filesystem and is also able to send KILL signals - i was able to resolve the problem with it
without rebooting the server.

I created a special daemon (about month ago) which is able to detect and fix the problem so i'm not having server outages now. The point was to NOT access anything which is stored on HDDs, the daemon is only reading info from cgroup filesystem and sending KILL signals to processes. Maybe i should be able to also read stack files before killing, i will try it.

Btw, which vanilla kernel includes this patch?

Thank you and everyone involved very much for time and help.

azur

2013-06-19 13:26:20

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Mon 17-06-13 12:21:34, azurIt wrote:
> >Here we go. I hope I didn't screw anything (Johannes might double check)
> >because there were quite some changes in the area since 3.2. Nothing
> >earth shattering though. Please note that I have only compile tested
> >this. Also make sure you remove the previous patches you have from me.
>
>
> Hi Michal,
>
> it, unfortunately, didn't work. Everything was working fine but
> original problem is still occuring.

This would be more than surprising because tasks blocked at memcg OOM
don't hold any locks anymore. Maybe I have messed something up during
backport but I cannot spot anything.

> I'm unable to send you stacks or more info because problem is taking
> down the whole server for some time now (don't know what exactly
> caused it to start happening, maybe newer versions of 3.2.x).

So you are not testing with the same kernel with just the old patch
replaced by the new one?

> But i'm sure of one thing - when problem occurs, nothing is able to
> access hard drives (every process which tries it is freezed until
> problem is resolved or server is rebooted).

I would be really interesting to see what those tasks are blocked on.

> Problem is fixed after killing processes from cgroup which
> caused it and everything immediatelly starts to work normally. I
> find this out by keeping terminal opened from another server to one
> where my problem is occuring quite often and running several apps
> there (htop, iotop, etc.). When problem occurs, all apps which wasn't
> working with HDD was ok. The htop proved to be very usefull here
> because it's only reading proc filesystem and is also able to send
> KILL signals - i was able to resolve the problem with it
> without rebooting the server.

sysrq+t will give you the list of all tasks and their traces.

> I created a special daemon (about month ago) which is able to detect
> and fix the problem so i'm not having server outages now. The point
> was to NOT access anything which is stored on HDDs, the daemon is
> only reading info from cgroup filesystem and sending KILL signals to
> processes. Maybe i should be able to also read stack files before
> killing, i will try it.
>
> Btw, which vanilla kernel includes this patch?

None yet. But I hope it will be merged to 3.11 and backported to the
stable trees.

> Thank you and everyone involved very much for time and help.
>
> azur

--
Michal Hocko
SUSE Labs

2013-06-22 20:16:05

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

Michal,

>> I'm unable to send you stacks or more info because problem is taking
>> down the whole server for some time now (don't know what exactly
>> caused it to start happening, maybe newer versions of 3.2.x).
>
>So you are not testing with the same kernel with just the old patch
>replaced by the new one?

No, i'm not testing with the same kernel but all are 3.2.x. I even cannot install older 3.2.x because grsecurity is always available for newest kernel and there is no archive of older versions (at least i don't know about any).

>> But i'm sure of one thing - when problem occurs, nothing is able to
>> access hard drives (every process which tries it is freezed until
>> problem is resolved or server is rebooted).
>
>I would be really interesting to see what those tasks are blocked on.

I'm trying to get it, stay tuned :)

Today i noticed one bug, not 100% sure it is related to 'your' patch but i didn't seen this before. I noticed that i have lots of cgroups which cannot be removed - if i do 'rmdir <cgroup_directory>', it just hangs and never complete. Even more, it's not possible to access the whole cgroup filesystem until i kill that rmdir (anything, which tries it, just hangs). All unremoveable cgroups has this in 'memory.oom_control':
oom_kill_disable 0
under_oom 1

And, yes, 'tasks' file is empty.

azur

2013-06-24 16:48:49

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

>I would be really interesting to see what those tasks are blocked on.

Ok, i got it! Problem occurs two times and it behaves differently each time, I was running kernel with that latest patch.

1.) It doesn't have impact on the whole server, only on one cgroup. Here are stacks:
http://watchdog.sk/lkml/memcg-bug-7.tar.gz

2.) It almost takes down the server because of huge I/O on HDDs. Unfortunately, i had a bug in my script which was suppose to gather stacks (i wasn't able to do it by hand like in (1), server was almost unoperable). But I was lucky and somehow killed processes from problematic cgroup (via htop) and server was ok again EXCEPT one important thing - processes from that cgroup were still running in D state and i wasn't able to kill them for good. They were taking web server network ports so i had to reboot the server :( BUT, before that, i gathered stacks:
http://watchdog.sk/lkml/memcg-bug-8.tar.gz

What do you think?

azur

2013-06-24 20:13:58

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

Hi guys,

On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote:
> >> But i'm sure of one thing - when problem occurs, nothing is able to
> >> access hard drives (every process which tries it is freezed until
> >> problem is resolved or server is rebooted).
> >
> >I would be really interesting to see what those tasks are blocked on.
>
> I'm trying to get it, stay tuned :)
>
> Today i noticed one bug, not 100% sure it is related to 'your' patch
> but i didn't seen this before. I noticed that i have lots of cgroups
> which cannot be removed - if i do 'rmdir <cgroup_directory>', it
> just hangs and never complete. Even more, it's not possible to
> access the whole cgroup filesystem until i kill that rmdir
> (anything, which tries it, just hangs). All unremoveable cgroups has
> this in 'memory.oom_control': oom_kill_disable 0 under_oom 1

Somebody acquires the OOM wait reference to the memcg and marks it
under oom but then does not call into mem_cgroup_oom_synchronize() to
clean up. That's why under_oom is set and the rmdir waits for
outstanding references.

> And, yes, 'tasks' file is empty.

It's not a kernel thread that does it because all kernel-context
handle_mm_fault() are annotated properly, which means the task must be
userspace and, since tasks is empty, have exited before synchronizing.

Can you try with the following patch on top?

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..9a0b152 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -846,17 +846,6 @@ static noinline int
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, unsigned int fault)
{
- /*
- * Pagefault was interrupted by SIGKILL. We have no reason to
- * continue pagefault.
- */
- if (fatal_signal_pending(current)) {
- if (!(fault & VM_FAULT_RETRY))
- up_read(&current->mm->mmap_sem);
- if (!(error_code & PF_USER))
- no_context(regs, error_code, address);
- return 1;
- }
if (!(fault & VM_FAULT_ERROR))
return 0;

2013-06-28 10:06:17

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

>It's not a kernel thread that does it because all kernel-context
>handle_mm_fault() are annotated properly, which means the task must be
>userspace and, since tasks is empty, have exited before synchronizing.
>
>Can you try with the following patch on top?

Michal and Johannes,

i have some observations which i made:
Original patch from Johannes was really fixing something but definitely not everything and was introducing new problems. I'm running unpatched kernel from time i send my last message and problems with freezing cgroups are occuring very often (several times per day) - they were, on the other hand, quite rare with patch from Johannes.

Johannes, i didn't try your last patch yet. I would like to wait until you or Michal look at my last message which contained detailed information about freezing of cgroups on kernel running your original patch (which was suppose to fix it for good). Even more, i would like to hear your opinion about that stucked processes which was holding web server port and which forced me to reboot production server at the middle of the day :( more information was in my last message. Thank you very much for your time.

azur

2013-07-05 18:17:40

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

Hi azurIt,

On Fri, Jun 28, 2013 at 12:06:13PM +0200, azurIt wrote:
> >It's not a kernel thread that does it because all kernel-context
> >handle_mm_fault() are annotated properly, which means the task must be
> >userspace and, since tasks is empty, have exited before synchronizing.
> >
> >Can you try with the following patch on top?
>
>
> Michal and Johannes,
>
> i have some observations which i made: Original patch from Johannes
> was really fixing something but definitely not everything and was
> introducing new problems. I'm running unpatched kernel from time i
> send my last message and problems with freezing cgroups are occuring
> very often (several times per day) - they were, on the other hand,
> quite rare with patch from Johannes.

That's good!

> Johannes, i didn't try your last patch yet. I would like to wait
> until you or Michal look at my last message which contained detailed
> information about freezing of cgroups on kernel running your
> original patch (which was suppose to fix it for good). Even more, i
> would like to hear your opinion about that stucked processes which
> was holding web server port and which forced me to reboot production
> server at the middle of the day :( more information was in my last
> message. Thank you very much for your time.

I looked at your debug messages but could not find anything that would
hint at a deadlock. All tasks are stuck in the refrigerator, so I
assume you use the freezer cgroup and enabled it somehow?

Sorry about your production server locking up, but from the stacks I
don't see any connection to the OOM problems you were having... :/

2013-07-05 19:02:50

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

>I looked at your debug messages but could not find anything that would
>hint at a deadlock. All tasks are stuck in the refrigerator, so I
>assume you use the freezer cgroup and enabled it somehow?

Yes, i'm really using freezer cgroup BUT i was checking if it's not doing problems - unfortunately, several days passed from that day and now i don't fully remember if i was checking it for both cases (unremoveabled cgroups and these freezed processes holding web server port). I'm 100% sure i was checking it for unremoveable cgroups but not so sure for the other problem (i had to act quickly in that case). Are you sure (from stacks) that freezer cgroup was enabled there?

Btw, what about that other stacks? I mean this file:
http://watchdog.sk/lkml/memcg-bug-7.tar.gz

It was taken while running the kernel with your patch and from cgroup which was under unresolveable OOM (just like my very original problem).

Thank you!

azur

2013-07-05 19:19:05

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote:
> >I looked at your debug messages but could not find anything that would
> >hint at a deadlock. All tasks are stuck in the refrigerator, so I
> >assume you use the freezer cgroup and enabled it somehow?
>
>
> Yes, i'm really using freezer cgroup BUT i was checking if it's not
> doing problems - unfortunately, several days passed from that day
> and now i don't fully remember if i was checking it for both cases
> (unremoveabled cgroups and these freezed processes holding web
> server port). I'm 100% sure i was checking it for unremoveable
> cgroups but not so sure for the other problem (i had to act quickly
> in that case). Are you sure (from stacks) that freezer cgroup was
> enabled there?

Yeah, all the traces without exception look like this:

1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160
1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540
1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750
1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17
1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff

so the freezer was already enabled when you took the backtraces.

> Btw, what about that other stacks? I mean this file:
> http://watchdog.sk/lkml/memcg-bug-7.tar.gz
>
> It was taken while running the kernel with your patch and from
> cgroup which was under unresolveable OOM (just like my very original
> problem).

I looked at these traces too, but none of the tasks are stuck in rmdir
or the OOM path. Some /are/ in the page fault path, but they are
happily doing reclaim and don't appear to be stuck. So I'm having a
hard time matching this data to what you otherwise observed.

However, based on what you reported the most likely explanation for
the continued hangs is the unfinished OOM handling for which I sent
the followup patch for arch/x86/mm/fault.c.

2013-07-07 23:42:34

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

> CC: "Michal Hocko" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>
>On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote:
>> >I looked at your debug messages but could not find anything that would
>> >hint at a deadlock. All tasks are stuck in the refrigerator, so I
>> >assume you use the freezer cgroup and enabled it somehow?
>>
>>
>> Yes, i'm really using freezer cgroup BUT i was checking if it's not
>> doing problems - unfortunately, several days passed from that day
>> and now i don't fully remember if i was checking it for both cases
>> (unremoveabled cgroups and these freezed processes holding web
>> server port). I'm 100% sure i was checking it for unremoveable
>> cgroups but not so sure for the other problem (i had to act quickly
>> in that case). Are you sure (from stacks) that freezer cgroup was
>> enabled there?
>
>Yeah, all the traces without exception look like this:
>
>1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160
>1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540
>1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750
>1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
>1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17
>1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff
>
>so the freezer was already enabled when you took the backtraces.
>
>> Btw, what about that other stacks? I mean this file:
>> http://watchdog.sk/lkml/memcg-bug-7.tar.gz
>>
>> It was taken while running the kernel with your patch and from
>> cgroup which was under unresolveable OOM (just like my very original
>> problem).
>
>I looked at these traces too, but none of the tasks are stuck in rmdir
>or the OOM path. Some /are/ in the page fault path, but they are
>happily doing reclaim and don't appear to be stuck. So I'm having a
>hard time matching this data to what you otherwise observed.
>
>However, based on what you reported the most likely explanation for
>the continued hangs is the unfinished OOM handling for which I sent
>the followup patch for arch/x86/mm/fault.c.
>

Johannes,

today I tested both of your patches but problem with unremovable cgroups, unfortunately, persists.

azur

2013-07-09 13:00:24

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Mon 24-06-13 16:13:45, Johannes Weiner wrote:
> Hi guys,
>
> On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote:
> > >> But i'm sure of one thing - when problem occurs, nothing is able to
> > >> access hard drives (every process which tries it is freezed until
> > >> problem is resolved or server is rebooted).
> > >
> > >I would be really interesting to see what those tasks are blocked on.
> >
> > I'm trying to get it, stay tuned :)
> >
> > Today i noticed one bug, not 100% sure it is related to 'your' patch
> > but i didn't seen this before. I noticed that i have lots of cgroups
> > which cannot be removed - if i do 'rmdir <cgroup_directory>', it
> > just hangs and never complete. Even more, it's not possible to
> > access the whole cgroup filesystem until i kill that rmdir
> > (anything, which tries it, just hangs). All unremoveable cgroups has
> > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1
>
> Somebody acquires the OOM wait reference to the memcg and marks it
> under oom but then does not call into mem_cgroup_oom_synchronize() to
> clean up. That's why under_oom is set and the rmdir waits for
> outstanding references.
>
> > And, yes, 'tasks' file is empty.
>
> It's not a kernel thread that does it because all kernel-context
> handle_mm_fault() are annotated properly, which means the task must be
> userspace and, since tasks is empty, have exited before synchronizing.

Yes, well spotted. I have missed that while reviewing your patch.
The follow up fix looks correct.

> Can you try with the following patch on top?
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 5db0490..9a0b152 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -846,17 +846,6 @@ static noinline int
> mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> unsigned long address, unsigned int fault)
> {
> - /*
> - * Pagefault was interrupted by SIGKILL. We have no reason to
> - * continue pagefault.
> - */
> - if (fatal_signal_pending(current)) {
> - if (!(fault & VM_FAULT_RETRY))
> - up_read(&current->mm->mmap_sem);
> - if (!(error_code & PF_USER))
> - no_context(regs, error_code, address);
> - return 1;
> - }
> if (!(fault & VM_FAULT_ERROR))
> return 0;
>

--
Michal Hocko
SUSE Labs

2013-07-09 13:08:16

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Tue 09-07-13 15:00:17, Michal Hocko wrote:
> On Mon 24-06-13 16:13:45, Johannes Weiner wrote:
> > Hi guys,
> >
> > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote:
> > > >> But i'm sure of one thing - when problem occurs, nothing is able to
> > > >> access hard drives (every process which tries it is freezed until
> > > >> problem is resolved or server is rebooted).
> > > >
> > > >I would be really interesting to see what those tasks are blocked on.
> > >
> > > I'm trying to get it, stay tuned :)
> > >
> > > Today i noticed one bug, not 100% sure it is related to 'your' patch
> > > but i didn't seen this before. I noticed that i have lots of cgroups
> > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it
> > > just hangs and never complete. Even more, it's not possible to
> > > access the whole cgroup filesystem until i kill that rmdir
> > > (anything, which tries it, just hangs). All unremoveable cgroups has
> > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1
> >
> > Somebody acquires the OOM wait reference to the memcg and marks it
> > under oom but then does not call into mem_cgroup_oom_synchronize() to
> > clean up. That's why under_oom is set and the rmdir waits for
> > outstanding references.
> >
> > > And, yes, 'tasks' file is empty.
> >
> > It's not a kernel thread that does it because all kernel-context
> > handle_mm_fault() are annotated properly, which means the task must be
> > userspace and, since tasks is empty, have exited before synchronizing.
>
> Yes, well spotted. I have missed that while reviewing your patch.
> The follow up fix looks correct.

Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well
otherwise the else BUG() path would be unreachable and we wouldn't know
that something fishy is going on.

> > Can you try with the following patch on top?
> >
> > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> > index 5db0490..9a0b152 100644
> > --- a/arch/x86/mm/fault.c
> > +++ b/arch/x86/mm/fault.c
> > @@ -846,17 +846,6 @@ static noinline int
> > mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> > unsigned long address, unsigned int fault)
> > {
> > - /*
> > - * Pagefault was interrupted by SIGKILL. We have no reason to
> > - * continue pagefault.
> > - */
> > - if (fatal_signal_pending(current)) {
> > - if (!(fault & VM_FAULT_RETRY))
> > - up_read(&current->mm->mmap_sem);
> > - if (!(error_code & PF_USER))
> > - no_context(regs, error_code, address);
> > - return 1;
> > - }
> > if (!(fault & VM_FAULT_ERROR))
> > return 0;
> >
>
> --
> Michal Hocko
> SUSE Labs
> --
> To unsubscribe from this list: send the line "unsubscribe cgroups" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Michal Hocko
SUSE Labs

2013-07-09 13:10:05

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Tue 09-07-13 15:08:08, Michal Hocko wrote:
> On Tue 09-07-13 15:00:17, Michal Hocko wrote:
> > On Mon 24-06-13 16:13:45, Johannes Weiner wrote:
> > > Hi guys,
> > >
> > > On Sat, Jun 22, 2013 at 10:09:58PM +0200, azurIt wrote:
> > > > >> But i'm sure of one thing - when problem occurs, nothing is able to
> > > > >> access hard drives (every process which tries it is freezed until
> > > > >> problem is resolved or server is rebooted).
> > > > >
> > > > >I would be really interesting to see what those tasks are blocked on.
> > > >
> > > > I'm trying to get it, stay tuned :)
> > > >
> > > > Today i noticed one bug, not 100% sure it is related to 'your' patch
> > > > but i didn't seen this before. I noticed that i have lots of cgroups
> > > > which cannot be removed - if i do 'rmdir <cgroup_directory>', it
> > > > just hangs and never complete. Even more, it's not possible to
> > > > access the whole cgroup filesystem until i kill that rmdir
> > > > (anything, which tries it, just hangs). All unremoveable cgroups has
> > > > this in 'memory.oom_control': oom_kill_disable 0 under_oom 1
> > >
> > > Somebody acquires the OOM wait reference to the memcg and marks it
> > > under oom but then does not call into mem_cgroup_oom_synchronize() to
> > > clean up. That's why under_oom is set and the rmdir waits for
> > > outstanding references.
> > >
> > > > And, yes, 'tasks' file is empty.
> > >
> > > It's not a kernel thread that does it because all kernel-context
> > > handle_mm_fault() are annotated properly, which means the task must be
> > > userspace and, since tasks is empty, have exited before synchronizing.
> >
> > Yes, well spotted. I have missed that while reviewing your patch.
> > The follow up fix looks correct.
>
> Hmm, I guess you wanted to remove !(fault & VM_FAULT_ERROR) test as well
> otherwise the else BUG() path would be unreachable and we wouldn't know
> that something fishy is going on.

No, scratch it! We need it for VM_FAULT_RETRY. Sorry about the noise.

--
Michal Hocko
SUSE Labs

2013-07-09 13:10:34

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Mon 08-07-13 01:42:24, azurIt wrote:
> > CC: "Michal Hocko" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>
> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote:
> >> >I looked at your debug messages but could not find anything that would
> >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I
> >> >assume you use the freezer cgroup and enabled it somehow?
> >>
> >>
> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not
> >> doing problems - unfortunately, several days passed from that day
> >> and now i don't fully remember if i was checking it for both cases
> >> (unremoveabled cgroups and these freezed processes holding web
> >> server port). I'm 100% sure i was checking it for unremoveable
> >> cgroups but not so sure for the other problem (i had to act quickly
> >> in that case). Are you sure (from stacks) that freezer cgroup was
> >> enabled there?
> >
> >Yeah, all the traces without exception look like this:
> >
> >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160
> >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540
> >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750
> >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
> >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17
> >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff
> >
> >so the freezer was already enabled when you took the backtraces.
> >
> >> Btw, what about that other stacks? I mean this file:
> >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz
> >>
> >> It was taken while running the kernel with your patch and from
> >> cgroup which was under unresolveable OOM (just like my very original
> >> problem).
> >
> >I looked at these traces too, but none of the tasks are stuck in rmdir
> >or the OOM path. Some /are/ in the page fault path, but they are
> >happily doing reclaim and don't appear to be stuck. So I'm having a
> >hard time matching this data to what you otherwise observed.

Agreed.

> >However, based on what you reported the most likely explanation for
> >the continued hangs is the unfinished OOM handling for which I sent
> >the followup patch for arch/x86/mm/fault.c.
>
> Johannes,
>
> today I tested both of your patches but problem with unremovable
> cgroups, unfortunately, persists.

Is the group empty again with marked under_oom?
--
Michal Hocko
SUSE Labs

2013-07-09 13:19:28

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

>On Mon 08-07-13 01:42:24, azurIt wrote:
>> > CC: "Michal Hocko" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>
>> >On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote:
>> >> >I looked at your debug messages but could not find anything that would
>> >> >hint at a deadlock. All tasks are stuck in the refrigerator, so I
>> >> >assume you use the freezer cgroup and enabled it somehow?
>> >>
>> >>
>> >> Yes, i'm really using freezer cgroup BUT i was checking if it's not
>> >> doing problems - unfortunately, several days passed from that day
>> >> and now i don't fully remember if i was checking it for both cases
>> >> (unremoveabled cgroups and these freezed processes holding web
>> >> server port). I'm 100% sure i was checking it for unremoveable
>> >> cgroups but not so sure for the other problem (i had to act quickly
>> >> in that case). Are you sure (from stacks) that freezer cgroup was
>> >> enabled there?
>> >
>> >Yeah, all the traces without exception look like this:
>> >
>> >1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160
>> >1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540
>> >1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750
>> >1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
>> >1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17
>> >1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff
>> >
>> >so the freezer was already enabled when you took the backtraces.
>> >
>> >> Btw, what about that other stacks? I mean this file:
>> >> http://watchdog.sk/lkml/memcg-bug-7.tar.gz
>> >>
>> >> It was taken while running the kernel with your patch and from
>> >> cgroup which was under unresolveable OOM (just like my very original
>> >> problem).
>> >
>> >I looked at these traces too, but none of the tasks are stuck in rmdir
>> >or the OOM path. Some /are/ in the page fault path, but they are
>> >happily doing reclaim and don't appear to be stuck. So I'm having a
>> >hard time matching this data to what you otherwise observed.
>
>Agreed.
>
>> >However, based on what you reported the most likely explanation for
>> >the continued hangs is the unfinished OOM handling for which I sent
>> >the followup patch for arch/x86/mm/fault.c.
>>
>> Johannes,
>>
>> today I tested both of your patches but problem with unremovable
>> cgroups, unfortunately, persists.
>
>Is the group empty again with marked under_oom?

Now i realized that i forgot to remove UID from that cgroup before trying to remove it, so cgroup cannot be removed anyway (we are using third party cgroup called cgroup-uid from Andrea Righi, which is able to associate all user's processes with target cgroup). Look here for cgroup-uid patch:
https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch

ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was permanently '1'.

azur

2013-07-09 13:54:54

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Tue 09-07-13 15:19:21, azurIt wrote:
[...]
> Now i realized that i forgot to remove UID from that cgroup before
> trying to remove it, so cgroup cannot be removed anyway (we are using
> third party cgroup called cgroup-uid from Andrea Righi, which is able
> to associate all user's processes with target cgroup). Look here for
> cgroup-uid patch:
> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
>
> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> permanently '1'.

This is really strange. Could you post the whole diff against stable
tree you are using (except for grsecurity stuff and the above cgroup-uid
patch)?

Btw. the bellow patch might help us to point to the exit path which
leaves wait_on_memcg without mem_cgroup_oom_synchronize:
---
diff --git a/kernel/exit.c b/kernel/exit.c
index e6e01b9..ad472e0 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code)

profile_task_exit(tsk);

+ WARN_ON(current->memcg_oom.wait_on_memcg);
WARN_ON(blk_needs_flush_plug(tsk));

if (unlikely(in_interrupt()))
--
Michal Hocko
SUSE Labs

2013-07-10 16:25:11

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

>> Now i realized that i forgot to remove UID from that cgroup before
>> trying to remove it, so cgroup cannot be removed anyway (we are using
>> third party cgroup called cgroup-uid from Andrea Righi, which is able
>> to associate all user's processes with target cgroup). Look here for
>> cgroup-uid patch:
>> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
>>
>> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
>> permanently '1'.
>
>This is really strange. Could you post the whole diff against stable
>tree you are using (except for grsecurity stuff and the above cgroup-uid
>patch)?

Here are all patches which i applied to kernel 3.2.48 in my last test:
http://watchdog.sk/lkml/patches3/

Patches marked as 7-* are from Johannes. I'm appling them in order except the grsecurity - it goes as first.

azur

>Btw. the bellow patch might help us to point to the exit path which
>leaves wait_on_memcg without mem_cgroup_oom_synchronize:
>---
>diff --git a/kernel/exit.c b/kernel/exit.c
>index e6e01b9..ad472e0 100644
>--- a/kernel/exit.c
>+++ b/kernel/exit.c
>@@ -895,6 +895,7 @@ NORET_TYPE void do_exit(long code)
>
> profile_task_exit(tsk);
>
>+ WARN_ON(current->memcg_oom.wait_on_memcg);
> WARN_ON(blk_needs_flush_plug(tsk));
>
> if (unlikely(in_interrupt()))
>--
>Michal Hocko
>SUSE Labs
>

2013-07-11 07:25:14

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Wed 10-07-13 18:25:06, azurIt wrote:
> >> Now i realized that i forgot to remove UID from that cgroup before
> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> >> to associate all user's processes with target cgroup). Look here for
> >> cgroup-uid patch:
> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> >>
> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> >> permanently '1'.
> >
> >This is really strange. Could you post the whole diff against stable
> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> >patch)?
>
>
> Here are all patches which i applied to kernel 3.2.48 in my last test:
> http://watchdog.sk/lkml/patches3/

The two patches from Johannes seem correct.

>From a quick look even grsecurity patchset shouldn't interfere as it
doesn't seem to put any code between handle_mm_fault and mm_fault_error
and there also doesn't seem to be any new handle_mm_fault call sites.

But I cannot tell there aren't other code paths which would lead to a
memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.

--
Michal Hocko
SUSE Labs

2013-07-13 23:26:45

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

> CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
>On Wed 10-07-13 18:25:06, azurIt wrote:
>> >> Now i realized that i forgot to remove UID from that cgroup before
>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
>> >> to associate all user's processes with target cgroup). Look here for
>> >> cgroup-uid patch:
>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
>> >>
>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
>> >> permanently '1'.
>> >
>> >This is really strange. Could you post the whole diff against stable
>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
>> >patch)?
>>
>>
>> Here are all patches which i applied to kernel 3.2.48 in my last test:
>> http://watchdog.sk/lkml/patches3/
>
>The two patches from Johannes seem correct.
>
>From a quick look even grsecurity patchset shouldn't interfere as it
>doesn't seem to put any code between handle_mm_fault and mm_fault_error
>and there also doesn't seem to be any new handle_mm_fault call sites.
>
>But I cannot tell there aren't other code paths which would lead to a
>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.

Michal,

now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch.

azur

2013-07-13 23:51:21

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

> CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
>> CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
>>On Wed 10-07-13 18:25:06, azurIt wrote:
>>> >> Now i realized that i forgot to remove UID from that cgroup before
>>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
>>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
>>> >> to associate all user's processes with target cgroup). Look here for
>>> >> cgroup-uid patch:
>>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
>>> >>
>>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
>>> >> permanently '1'.
>>> >
>>> >This is really strange. Could you post the whole diff against stable
>>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
>>> >patch)?
>>>
>>>
>>> Here are all patches which i applied to kernel 3.2.48 in my last test:
>>> http://watchdog.sk/lkml/patches3/
>>
>>The two patches from Johannes seem correct.
>>
>>From a quick look even grsecurity patchset shouldn't interfere as it
>>doesn't seem to put any code between handle_mm_fault and mm_fault_error
>>and there also doesn't seem to be any new handle_mm_fault call sites.
>>
>>But I cannot tell there aren't other code paths which would lead to a
>>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
>
>
>Michal,
>
>now i can definitely confirm that problem with unremovable cgroups persists. What info do you need from me? I applied also your little 'WARN_ON' patch.
>
>azur

Ok, i think you want this:
http://watchdog.sk/lkml/kern4.log

2013-07-14 17:07:34

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

> CC: "Michal Hocko" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>
>On Fri, Jul 05, 2013 at 09:02:46PM +0200, azurIt wrote:
>> >I looked at your debug messages but could not find anything that would
>> >hint at a deadlock. All tasks are stuck in the refrigerator, so I
>> >assume you use the freezer cgroup and enabled it somehow?
>>
>>
>> Yes, i'm really using freezer cgroup BUT i was checking if it's not
>> doing problems - unfortunately, several days passed from that day
>> and now i don't fully remember if i was checking it for both cases
>> (unremoveabled cgroups and these freezed processes holding web
>> server port). I'm 100% sure i was checking it for unremoveable
>> cgroups but not so sure for the other problem (i had to act quickly
>> in that case). Are you sure (from stacks) that freezer cgroup was
>> enabled there?
>
>Yeah, all the traces without exception look like this:
>
>1372089762/23433/stack:[<ffffffff81080925>] refrigerator+0x95/0x160
>1372089762/23433/stack:[<ffffffff8106ab7b>] get_signal_to_deliver+0x1cb/0x540
>1372089762/23433/stack:[<ffffffff8100188b>] do_signal+0x6b/0x750
>1372089762/23433/stack:[<ffffffff81001fc5>] do_notify_resume+0x55/0x80
>1372089762/23433/stack:[<ffffffff815cac77>] int_signal+0x12/0x17
>1372089762/23433/stack:[<ffffffffffffffff>] 0xffffffffffffffff
>
>so the freezer was already enabled when you took the backtraces.
>
>> Btw, what about that other stacks? I mean this file:
>> http://watchdog.sk/lkml/memcg-bug-7.tar.gz
>>
>> It was taken while running the kernel with your patch and from
>> cgroup which was under unresolveable OOM (just like my very original
>> problem).
>
>I looked at these traces too, but none of the tasks are stuck in rmdir
>or the OOM path. Some /are/ in the page fault path, but they are
>happily doing reclaim and don't appear to be stuck. So I'm having a
>hard time matching this data to what you otherwise observed.
>
>However, based on what you reported the most likely explanation for
>the continued hangs is the unfinished OOM handling for which I sent
>the followup patch for arch/x86/mm/fault.c.

Johannes,

this problem happened again but was even worse, now i'm sure it wasn't my fault. This time I even wasn't able to access /proc/<pid> of hanged apache process (which was, again, helding web server port and forced me to reboot the server). Everything which tried to access /proc/<pid> just hanged. Server even wasn't able to reboot correctly, it hanged and then done a hard reboot after few minutes.

azur

2013-07-15 15:41:24

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Sun 14-07-13 01:51:12, azurIt wrote:
> > CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> >> CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> >>On Wed 10-07-13 18:25:06, azurIt wrote:
> >>> >> Now i realized that i forgot to remove UID from that cgroup before
> >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> >>> >> to associate all user's processes with target cgroup). Look here for
> >>> >> cgroup-uid patch:
> >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> >>> >>
> >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> >>> >> permanently '1'.
> >>> >
> >>> >This is really strange. Could you post the whole diff against stable
> >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> >>> >patch)?
> >>>
> >>>
> >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> >>> http://watchdog.sk/lkml/patches3/
> >>
> >>The two patches from Johannes seem correct.
> >>
> >>From a quick look even grsecurity patchset shouldn't interfere as it
> >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> >>and there also doesn't seem to be any new handle_mm_fault call sites.
> >>
> >>But I cannot tell there aren't other code paths which would lead to a
> >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> >
> >
> >Michal,
> >
> >now i can definitely confirm that problem with unremovable cgroups
> >persists. What info do you need from me? I applied also your little
> >'WARN_ON' patch.
>
> Ok, i think you want this:
> http://watchdog.sk/lkml/kern4.log

Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2
Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2
Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2
Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2
Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2
Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2
Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2
Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2
Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2
Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------
Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA
Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace:
Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870
Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0
Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]---

OK, so you had an OOM which has been handled by in-kernel oom handler
(it killed 12021) and 12037 was in the same group. The warning tells us
that it went through mem_cgroup_oom as well (otherwise it wouldn't have
memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
it exited on the userspace request (by exit syscall).

I do not see any way how, this could happen though. If mem_cgroup_oom
is called then we always return CHARGE_NOMEM which turns into ENOMEM
returned by __mem_cgroup_try_charge (invoke_oom must have been set to
true). So if nobody screwed the return value on the way up to page
fault handler then there is no way to escape.

I will check the code.
--
Michal Hocko
SUSE Labs

2013-07-15 16:00:11

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Mon 15-07-13 17:41:19, Michal Hocko wrote:
> On Sun 14-07-13 01:51:12, azurIt wrote:
> > > CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> > >> CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> > >>On Wed 10-07-13 18:25:06, azurIt wrote:
> > >>> >> Now i realized that i forgot to remove UID from that cgroup before
> > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> > >>> >> to associate all user's processes with target cgroup). Look here for
> > >>> >> cgroup-uid patch:
> > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> > >>> >>
> > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> > >>> >> permanently '1'.
> > >>> >
> > >>> >This is really strange. Could you post the whole diff against stable
> > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> > >>> >patch)?
> > >>>
> > >>>
> > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> > >>> http://watchdog.sk/lkml/patches3/
> > >>
> > >>The two patches from Johannes seem correct.
> > >>
> > >>From a quick look even grsecurity patchset shouldn't interfere as it
> > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> > >>and there also doesn't seem to be any new handle_mm_fault call sites.
> > >>
> > >>But I cannot tell there aren't other code paths which would lead to a
> > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> > >
> > >
> > >Michal,
> > >
> > >now i can definitely confirm that problem with unremovable cgroups
> > >persists. What info do you need from me? I applied also your little
> > >'WARN_ON' patch.
> >
> > Ok, i think you want this:
> > http://watchdog.sk/lkml/kern4.log
>
> Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
> Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2
> Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2
> Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2
> Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2
> Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2
> Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2
> Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2
> Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2
> Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2
> Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
> Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
> Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------
> Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
> Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA
> Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
> Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace:
> Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
> Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
> Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870
> Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
> Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0
> Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
> Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
> Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]---
>
> OK, so you had an OOM which has been handled by in-kernel oom handler
> (it killed 12021) and 12037 was in the same group. The warning tells us
> that it went through mem_cgroup_oom as well (otherwise it wouldn't have
> memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
> it exited on the userspace request (by exit syscall).
>
> I do not see any way how, this could happen though. If mem_cgroup_oom
> is called then we always return CHARGE_NOMEM which turns into ENOMEM
> returned by __mem_cgroup_try_charge (invoke_oom must have been set to
> true). So if nobody screwed the return value on the way up to page
> fault handler then there is no way to escape.
>
> I will check the code.

OK, I guess I found it:
__do_fault
fault = filemap_fault
do_async_mmap_readahead
page_cache_async_readahead
ondemand_readahead
__do_page_cache_readahead
read_pages
readpages = ext3_readpages
mpage_readpages # Doesn't propagate ENOMEM
add_to_page_cache_lru
add_to_page_cache
add_to_page_cache_locked
mem_cgroup_cache_charge

So the read ahead most probably. Again! Duhhh. I will try to think
about a fix for this. One obvious place is mpage_readpages but
__do_page_cache_readahead ignores read_pages return value as well and
page_cache_async_readahead, even worse, is just void and exported as
such.

So this smells like a hard to fix bugger. One possible, and really ugly
way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
doesn't return VM_FAULT_ERROR, but that is a crude hack.
--
Michal Hocko
SUSE Labs

2013-07-16 15:36:20

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote:
> On Mon 15-07-13 17:41:19, Michal Hocko wrote:
> > On Sun 14-07-13 01:51:12, azurIt wrote:
> > > > CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> > > >> CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> > > >>On Wed 10-07-13 18:25:06, azurIt wrote:
> > > >>> >> Now i realized that i forgot to remove UID from that cgroup before
> > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> > > >>> >> to associate all user's processes with target cgroup). Look here for
> > > >>> >> cgroup-uid patch:
> > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> > > >>> >>
> > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> > > >>> >> permanently '1'.
> > > >>> >
> > > >>> >This is really strange. Could you post the whole diff against stable
> > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> > > >>> >patch)?
> > > >>>
> > > >>>
> > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> > > >>> http://watchdog.sk/lkml/patches3/
> > > >>
> > > >>The two patches from Johannes seem correct.
> > > >>
> > > >>From a quick look even grsecurity patchset shouldn't interfere as it
> > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> > > >>and there also doesn't seem to be any new handle_mm_fault call sites.
> > > >>
> > > >>But I cannot tell there aren't other code paths which would lead to a
> > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> > > >
> > > >
> > > >Michal,
> > > >
> > > >now i can definitely confirm that problem with unremovable cgroups
> > > >persists. What info do you need from me? I applied also your little
> > > >'WARN_ON' patch.
> > >
> > > Ok, i think you want this:
> > > http://watchdog.sk/lkml/kern4.log
> >
> > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
> > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2
> > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2
> > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2
> > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2
> > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2
> > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2
> > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2
> > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2
> > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2
> > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
> > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
> > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------
> > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
> > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA
> > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
> > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace:
> > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
> > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
> > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870
> > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
> > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0
> > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
> > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
> > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]---
> >
> > OK, so you had an OOM which has been handled by in-kernel oom handler
> > (it killed 12021) and 12037 was in the same group. The warning tells us
> > that it went through mem_cgroup_oom as well (otherwise it wouldn't have
> > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
> > it exited on the userspace request (by exit syscall).
> >
> > I do not see any way how, this could happen though. If mem_cgroup_oom
> > is called then we always return CHARGE_NOMEM which turns into ENOMEM
> > returned by __mem_cgroup_try_charge (invoke_oom must have been set to
> > true). So if nobody screwed the return value on the way up to page
> > fault handler then there is no way to escape.
> >
> > I will check the code.
>
> OK, I guess I found it:
> __do_fault
> fault = filemap_fault
> do_async_mmap_readahead
> page_cache_async_readahead
> ondemand_readahead
> __do_page_cache_readahead
> read_pages
> readpages = ext3_readpages
> mpage_readpages # Doesn't propagate ENOMEM
> add_to_page_cache_lru
> add_to_page_cache
> add_to_page_cache_locked
> mem_cgroup_cache_charge
>
> So the read ahead most probably. Again! Duhhh. I will try to think
> about a fix for this. One obvious place is mpage_readpages but
> __do_page_cache_readahead ignores read_pages return value as well and
> page_cache_async_readahead, even worse, is just void and exported as
> such.
>
> So this smells like a hard to fix bugger. One possible, and really ugly
> way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
> doesn't return VM_FAULT_ERROR, but that is a crude hack.

Ouch, good spot.

I don't think we need to handle an OOM from the readahead code. If
readahead does not produce the desired page, we retry synchroneously
in page_cache_read() and handle the OOM properly. We should not
signal an OOM for optional pages anyway.

So either we pass a flag from the readahead code down to
add_to_page_cache and mem_cgroup_cache_charge that tells the charge
code to ignore OOM conditions and do not set up an OOM context.

Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages,
with an argument that makes it only clean up the context and not wait.
It would not be completely outlandish to place it there, since it's
right next to where an error from add_to_page_cache() is not further
propagated back through the fault stack.

I'm travelling right now, I'll send a patch when I get back
(Thursday). Unless you beat me to it :)

2013-07-16 16:09:09

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Tue 16-07-13 11:35:44, Johannes Weiner wrote:
> On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote:
> > On Mon 15-07-13 17:41:19, Michal Hocko wrote:
> > > On Sun 14-07-13 01:51:12, azurIt wrote:
> > > > > CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> > > > >> CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> > > > >>On Wed 10-07-13 18:25:06, azurIt wrote:
> > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before
> > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> > > > >>> >> to associate all user's processes with target cgroup). Look here for
> > > > >>> >> cgroup-uid patch:
> > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> > > > >>> >>
> > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> > > > >>> >> permanently '1'.
> > > > >>> >
> > > > >>> >This is really strange. Could you post the whole diff against stable
> > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> > > > >>> >patch)?
> > > > >>>
> > > > >>>
> > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> > > > >>> http://watchdog.sk/lkml/patches3/
> > > > >>
> > > > >>The two patches from Johannes seem correct.
> > > > >>
> > > > >>From a quick look even grsecurity patchset shouldn't interfere as it
> > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> > > > >>and there also doesn't seem to be any new handle_mm_fault call sites.
> > > > >>
> > > > >>But I cannot tell there aren't other code paths which would lead to a
> > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> > > > >
> > > > >
> > > > >Michal,
> > > > >
> > > > >now i can definitely confirm that problem with unremovable cgroups
> > > > >persists. What info do you need from me? I applied also your little
> > > > >'WARN_ON' patch.
> > > >
> > > > Ok, i think you want this:
> > > > http://watchdog.sk/lkml/kern4.log
> > >
> > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
> > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2
> > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2
> > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2
> > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2
> > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2
> > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2
> > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2
> > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2
> > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2
> > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
> > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
> > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------
> > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
> > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA
> > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
> > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace:
> > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
> > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
> > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870
> > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
> > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0
> > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
> > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
> > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]---
> > >
> > > OK, so you had an OOM which has been handled by in-kernel oom handler
> > > (it killed 12021) and 12037 was in the same group. The warning tells us
> > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have
> > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
> > > it exited on the userspace request (by exit syscall).
> > >
> > > I do not see any way how, this could happen though. If mem_cgroup_oom
> > > is called then we always return CHARGE_NOMEM which turns into ENOMEM
> > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to
> > > true). So if nobody screwed the return value on the way up to page
> > > fault handler then there is no way to escape.
> > >
> > > I will check the code.
> >
> > OK, I guess I found it:
> > __do_fault
> > fault = filemap_fault
> > do_async_mmap_readahead
> > page_cache_async_readahead
> > ondemand_readahead
> > __do_page_cache_readahead
> > read_pages
> > readpages = ext3_readpages
> > mpage_readpages # Doesn't propagate ENOMEM
> > add_to_page_cache_lru
> > add_to_page_cache
> > add_to_page_cache_locked
> > mem_cgroup_cache_charge
> >
> > So the read ahead most probably. Again! Duhhh. I will try to think
> > about a fix for this. One obvious place is mpage_readpages but
> > __do_page_cache_readahead ignores read_pages return value as well and
> > page_cache_async_readahead, even worse, is just void and exported as
> > such.
> >
> > So this smells like a hard to fix bugger. One possible, and really ugly
> > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
> > doesn't return VM_FAULT_ERROR, but that is a crude hack.
>
> Ouch, good spot.
>
> I don't think we need to handle an OOM from the readahead code. If
> readahead does not produce the desired page, we retry synchroneously
> in page_cache_read() and handle the OOM properly. We should not
> signal an OOM for optional pages anyway.
>
> So either we pass a flag from the readahead code down to
> add_to_page_cache and mem_cgroup_cache_charge that tells the charge
> code to ignore OOM conditions and do not set up an OOM context.

That was my previous attempt and it was sooo painful.

> Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages,
> with an argument that makes it only clean up the context and not wait.

Yes, I was playing with this idea as well. I just do not like how
fragile this is. We need some way to catch all possible places which
might leak it.

> It would not be completely outlandish to place it there, since it's
> right next to where an error from add_to_page_cache() is not further
> propagated back through the fault stack.
>
> I'm travelling right now, I'll send a patch when I get back
> (Thursday). Unless you beat me to it :)

I can cook something up but there is quite a big pile on my desk
currently (as always :/).

--
Michal Hocko
SUSE Labs

2013-07-16 16:48:51

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote:
> On Tue 16-07-13 11:35:44, Johannes Weiner wrote:
> > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote:
> > > On Mon 15-07-13 17:41:19, Michal Hocko wrote:
> > > > On Sun 14-07-13 01:51:12, azurIt wrote:
> > > > > > CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> > > > > >> CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote:
> > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before
> > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> > > > > >>> >> to associate all user's processes with target cgroup). Look here for
> > > > > >>> >> cgroup-uid patch:
> > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> > > > > >>> >>
> > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> > > > > >>> >> permanently '1'.
> > > > > >>> >
> > > > > >>> >This is really strange. Could you post the whole diff against stable
> > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> > > > > >>> >patch)?
> > > > > >>>
> > > > > >>>
> > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> > > > > >>> http://watchdog.sk/lkml/patches3/
> > > > > >>
> > > > > >>The two patches from Johannes seem correct.
> > > > > >>
> > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it
> > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites.
> > > > > >>
> > > > > >>But I cannot tell there aren't other code paths which would lead to a
> > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> > > > > >
> > > > > >
> > > > > >Michal,
> > > > > >
> > > > > >now i can definitely confirm that problem with unremovable cgroups
> > > > > >persists. What info do you need from me? I applied also your little
> > > > > >'WARN_ON' patch.
> > > > >
> > > > > Ok, i think you want this:
> > > > > http://watchdog.sk/lkml/kern4.log
> > > >
> > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
> > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2
> > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
> > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
> > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------
> > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
> > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA
> > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
> > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace:
> > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
> > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
> > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870
> > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
> > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0
> > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
> > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
> > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]---
> > > >
> > > > OK, so you had an OOM which has been handled by in-kernel oom handler
> > > > (it killed 12021) and 12037 was in the same group. The warning tells us
> > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have
> > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
> > > > it exited on the userspace request (by exit syscall).
> > > >
> > > > I do not see any way how, this could happen though. If mem_cgroup_oom
> > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM
> > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to
> > > > true). So if nobody screwed the return value on the way up to page
> > > > fault handler then there is no way to escape.
> > > >
> > > > I will check the code.
> > >
> > > OK, I guess I found it:
> > > __do_fault
> > > fault = filemap_fault
> > > do_async_mmap_readahead
> > > page_cache_async_readahead
> > > ondemand_readahead
> > > __do_page_cache_readahead
> > > read_pages
> > > readpages = ext3_readpages
> > > mpage_readpages # Doesn't propagate ENOMEM
> > > add_to_page_cache_lru
> > > add_to_page_cache
> > > add_to_page_cache_locked
> > > mem_cgroup_cache_charge
> > >
> > > So the read ahead most probably. Again! Duhhh. I will try to think
> > > about a fix for this. One obvious place is mpage_readpages but
> > > __do_page_cache_readahead ignores read_pages return value as well and
> > > page_cache_async_readahead, even worse, is just void and exported as
> > > such.
> > >
> > > So this smells like a hard to fix bugger. One possible, and really ugly
> > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
> > > doesn't return VM_FAULT_ERROR, but that is a crude hack.
> >
> > Ouch, good spot.
> >
> > I don't think we need to handle an OOM from the readahead code. If
> > readahead does not produce the desired page, we retry synchroneously
> > in page_cache_read() and handle the OOM properly. We should not
> > signal an OOM for optional pages anyway.
> >
> > So either we pass a flag from the readahead code down to
> > add_to_page_cache and mem_cgroup_cache_charge that tells the charge
> > code to ignore OOM conditions and do not set up an OOM context.
>
> That was my previous attempt and it was sooo painful.
>
> > Or we DO call mem_cgroup_oom_synchronize() from the read_cache_pages,
> > with an argument that makes it only clean up the context and not wait.
>
> Yes, I was playing with this idea as well. I just do not like how
> fragile this is. We need some way to catch all possible places which
> might leak it.

I don't think this is necessary, but we could add a sanity check
in/near mem_cgroup_clear_userfault() that makes sure the OOM context
is only set up when an error is returned.

> > It would not be completely outlandish to place it there, since it's
> > right next to where an error from add_to_page_cache() is not further
> > propagated back through the fault stack.
> >
> > I'm travelling right now, I'll send a patch when I get back
> > (Thursday). Unless you beat me to it :)
>
> I can cook something up but there is quite a big pile on my desk
> currently (as always :/).

No worries, I'll send an update.

2013-07-19 04:21:37

by Johannes Weiner

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote:
> On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote:
> > On Tue 16-07-13 11:35:44, Johannes Weiner wrote:
> > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote:
> > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote:
> > > > > On Sun 14-07-13 01:51:12, azurIt wrote:
> > > > > > > CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> > > > > > >> CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
> > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote:
> > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before
> > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
> > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
> > > > > > >>> >> to associate all user's processes with target cgroup). Look here for
> > > > > > >>> >> cgroup-uid patch:
> > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
> > > > > > >>> >>
> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
> > > > > > >>> >> permanently '1'.
> > > > > > >>> >
> > > > > > >>> >This is really strange. Could you post the whole diff against stable
> > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
> > > > > > >>> >patch)?
> > > > > > >>>
> > > > > > >>>
> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
> > > > > > >>> http://watchdog.sk/lkml/patches3/
> > > > > > >>
> > > > > > >>The two patches from Johannes seem correct.
> > > > > > >>
> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it
> > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
> > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites.
> > > > > > >>
> > > > > > >>But I cannot tell there aren't other code paths which would lead to a
> > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
> > > > > > >
> > > > > > >
> > > > > > >Michal,
> > > > > > >
> > > > > > >now i can definitely confirm that problem with unremovable cgroups
> > > > > > >persists. What info do you need from me? I applied also your little
> > > > > > >'WARN_ON' patch.
> > > > > >
> > > > > > Ok, i think you want this:
> > > > > > http://watchdog.sk/lkml/kern4.log
> > > > >
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace:
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]---
> > > > >
> > > > > OK, so you had an OOM which has been handled by in-kernel oom handler
> > > > > (it killed 12021) and 12037 was in the same group. The warning tells us
> > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have
> > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
> > > > > it exited on the userspace request (by exit syscall).
> > > > >
> > > > > I do not see any way how, this could happen though. If mem_cgroup_oom
> > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM
> > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to
> > > > > true). So if nobody screwed the return value on the way up to page
> > > > > fault handler then there is no way to escape.
> > > > >
> > > > > I will check the code.
> > > >
> > > > OK, I guess I found it:
> > > > __do_fault
> > > > fault = filemap_fault
> > > > do_async_mmap_readahead
> > > > page_cache_async_readahead
> > > > ondemand_readahead
> > > > __do_page_cache_readahead
> > > > read_pages
> > > > readpages = ext3_readpages
> > > > mpage_readpages # Doesn't propagate ENOMEM
> > > > add_to_page_cache_lru
> > > > add_to_page_cache
> > > > add_to_page_cache_locked
> > > > mem_cgroup_cache_charge
> > > >
> > > > So the read ahead most probably. Again! Duhhh. I will try to think
> > > > about a fix for this. One obvious place is mpage_readpages but
> > > > __do_page_cache_readahead ignores read_pages return value as well and
> > > > page_cache_async_readahead, even worse, is just void and exported as
> > > > such.
> > > >
> > > > So this smells like a hard to fix bugger. One possible, and really ugly
> > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
> > > > doesn't return VM_FAULT_ERROR, but that is a crude hack.

I fixed it by disabling the OOM killer altogether for readahead code.
We don't do it globally, we should not do it in the memcg, these are
optional allocations/charges.

I also disabled it for kernel faults triggered from within a syscall
(copy_*user, get_user_pages), which should just return -ENOMEM as
usual (unless it's nested inside a userspace fault). The only
downside is that we can't get around annotating userspace faults
anymore, so every architecture fault handler now passes
FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less
self-contained, but it's not unreasonable.

It's easy to detect leaks now by checking if the memcg OOM context is
setup and we are not returning VM_FAULT_OOM.

Here is a combined diff based on 3.2. azurIt, any chance you could
give this a shot? I tested it on my local machines, but you have a
known reproducer of fairly unlikely scenarios...

Thanks!
Johannes

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index fadd5f8..fa6b4e4 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
struct mm_struct *mm = current->mm;
const struct exception_table_entry *fixup;
int fault, si_code = SEGV_MAPERR;
+ unsigned long flags = 0;
siginfo_t info;

/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
@@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
goto bad_area;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (cause > 0)
+ flags |= FAULT_FLAG_WRITE;
+
/* If for any reason at all we couldn't handle the fault,
make sure we exit gracefully rather than endlessly redo
the fault. */
- fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
up_read(&mm->mmap_sem);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index aa33949..31b1e69 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)

static int __kprobes
__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
- struct task_struct *tsk)
+ struct task_struct *tsk, struct pt_regs *regs)
{
struct vm_area_struct *vma;
+ unsigned long flags = 0;
int fault;

vma = find_vma(mm, addr);
@@ -253,11 +254,16 @@ good_area:
goto out;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (fsr & FSR_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the fault.
*/
- fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
if (unlikely(fault & VM_FAULT_ERROR))
return fault;
if (fault & VM_FAULT_MAJOR)
@@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
#endif
}

- fault = __do_page_fault(mm, addr, fsr, tsk);
+ fault = __do_page_fault(mm, addr, fsr, tsk, regs);
up_read(&mm->mmap_sem);

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index f7040a1..ada6237 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs)
struct mm_struct *mm;
struct vm_area_struct *vma;
const struct exception_table_entry *fixup;
+ unsigned long flags = 0;
unsigned long address;
unsigned long page;
int writeaccess;
@@ -127,12 +128,17 @@ good_area:
panic("Unhandled case %lu in do_page_fault!", ecr);
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (writeaccess)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 9dcac8e..35d096a 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct * vma;
+ unsigned long flags = 0;
siginfo_t info;
int fault;

@@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
goto bad_area;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (writeaccess & 1)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index a325d57..2dbf219 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
struct vm_area_struct *vma;
struct mm_struct *mm;
unsigned long _pme, lrai, lrad, fixup;
+ unsigned long flags = 0;
siginfo_t info;
pgd_t *pge;
pud_t *pue;
@@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
break;
}

+ if (user_mode(__frame))
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, ear0, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index c10b76f..e56baf3 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
siginfo_t info;
int si_code = SEGV_MAPERR;
int fault;
+ unsigned long flags = 0;
const struct exception_table_entry *fixup;

/*
@@ -96,7 +97,12 @@ good_area:
break;
}

- fault = handle_mm_fault(mm, vma, address, (cause > 0));
+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (cause > 0)
+ flags |= FAULT_FLAG_WRITE;
+
+ fault = handle_mm_fault(mm, vma, address, flags);

/* The most common case -- we are done. */
if (likely(!(fault & VM_FAULT_ERROR))) {
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 20b3593..ad9ef9d 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
int signal = SIGSEGV, code = SEGV_MAPERR;
struct vm_area_struct *vma, *prev_vma;
struct mm_struct *mm = current->mm;
+ unsigned long flags = 0;
struct siginfo si;
unsigned long mask;
int fault;
@@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
if ((vma->vm_flags & mask) != mask)
goto bad_area;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (mask & VM_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
/*
* We ran out of memory, or some other thing happened
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index 2c9aeb4..e74f6fa 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
struct mm_struct *mm;
struct vm_area_struct * vma;
unsigned long page, addr;
+ unsigned long flags = 0;
int write;
int fault;
siginfo_t info;
@@ -188,6 +189,11 @@ good_area:
if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC))
goto bad_area;

+ if (error_code & ACE_USERMODE)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -195,7 +201,7 @@ good_area:
*/
addr = (address & PAGE_MASK);
set_thread_fault_code(error_code);
- fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, addr, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 2db6099..ab88a91 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
{
struct mm_struct *mm = current->mm;
struct vm_area_struct * vma;
+ unsigned long flags = 0;
int write, fault;

#ifdef DEBUG
@@ -134,13 +135,18 @@ good_area:
goto acc_err;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
#ifdef DEBUG
printk("handle_mm_fault returns %d\n",fault);
#endif
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index ae97d2c..b002612 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
{
struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
+ unsigned long flags = 0;
siginfo_t info;
int code = SEGV_MAPERR;
int is_write = error_code & ESR_S;
@@ -206,12 +207,17 @@ good_area:
goto bad_area;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (is_write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 937cf33..e5b9fed 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
const int field = sizeof(unsigned long) * 2;
+ unsigned long flags = 0;
siginfo_t info;
int fault;

@@ -139,12 +140,17 @@ good_area:
}
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 0945409..031be56 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code,
{
struct vm_area_struct *vma;
struct task_struct *tsk;
+ unsigned long flags = 0;
struct mm_struct *mm;
unsigned long page;
siginfo_t info;
@@ -247,12 +248,17 @@ good_area:
break;
}

+ if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
@@ -329,9 +335,10 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- printk(KERN_ALERT "VM: killing process %s\n", tsk->comm);
- if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
- do_exit(SIGKILL);
+ if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) {
+ pagefault_out_of_memory();
+ return;
+ }
goto no_context;

do_sigbus:
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index a5dce82..d586119 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct *vma;
+ unsigned long flags = 0;
siginfo_t info;
int fault;

@@ -153,13 +154,18 @@ good_area:
if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC))
goto bad_area;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (write_acc)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, write_acc);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
@@ -246,10 +252,10 @@ out_of_memory:
__asm__ __volatile__("l.nop 1");

up_read(&mm->mmap_sem);
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;

do_sigbus:
up_read(&mm->mmap_sem);
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 18162ce..a151e87 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
struct vm_area_struct *vma, *prev_vma;
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
+ unsigned long flags = 0;
unsigned long acc_type;
int fault;

@@ -195,13 +196,18 @@ good_area:
if ((vma->vm_flags & acc_type) != acc_type)
goto bad_area;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (acc_type & VM_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the
* fault.
*/

- fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
/*
* We hit a shared mapping outside of the file, or some
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5efe8c9..2bf339c 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
{
struct vm_area_struct * vma;
struct mm_struct *mm = current->mm;
+ unsigned long flags = 0;
siginfo_t info;
int code = SEGV_MAPERR;
int is_write = 0, ret;
@@ -305,12 +306,17 @@ good_area:
goto bad_area;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (is_write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+ ret = handle_mm_fault(mm, vma, address, flags);
if (unlikely(ret & VM_FAULT_ERROR)) {
if (ret & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index a9a3018..fe6109c 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access,
address = trans_exc_code & __FAIL_ADDR_MASK;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
flags = FAULT_FLAG_ALLOW_RETRY;
+ if (regs->psw.mask & PSW_MASK_PSTATE)
+ flags |= FAULT_FLAG_USER;
if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400)
flags |= FAULT_FLAG_WRITE;
down_read(&mm->mmap_sem);
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 47b600e..2ca5ae5 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
const int field = sizeof(unsigned long) * 2;
+ unsigned long flags = 0;
siginfo_t info;
int fault;

@@ -101,12 +102,16 @@ good_area:
}

survive:
+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
@@ -172,10 +177,10 @@ out_of_memory:
down_read(&mm->mmap_sem);
goto survive;
}
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;

do_sigbus:
up_read(&mm->mmap_sem);
diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c
index 7bebd04..a61b803 100644
--- a/arch/sh/mm/fault_32.c
+++ b/arch/sh/mm/fault_32.c
@@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct * vma;
+ unsigned long flags = 0;
int si_code;
int fault;
siginfo_t info;
@@ -195,12 +196,17 @@ good_area:
goto bad_area;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (writeaccess)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c
index e3430e0..0a9d645 100644
--- a/arch/sh/mm/tlbflush_64.c
+++ b/arch/sh/mm/tlbflush_64.c
@@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess,
struct mm_struct *mm;
struct vm_area_struct * vma;
const struct exception_table_entry *fixup;
+ unsigned long flags = 0;
pte_t *pte;
int fault;

@@ -184,12 +185,17 @@ good_area:
}
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (writeaccess)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index 8023fd7..efa3d48 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
struct vm_area_struct *vma;
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
+ unsigned long flags = 0;
unsigned int fixup;
unsigned long g2;
int from_user = !(regs->psr & PSR_PS);
@@ -285,12 +286,17 @@ good_area:
goto bad_area;
}

+ if (from_user)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 504c062..bc536ea 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
+ unsigned long flags = 0;
unsigned int insn = 0;
int si_code, fault_code, fault;
unsigned long address, mm_rss;
@@ -423,7 +424,12 @@ good_area:
goto bad_area;
}

- fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0);
+ if (!(regs->tstate & TSTATE_PRIV))
+ flags |= FAULT_FLAG_USER;
+ if (fault_code & FAULT_CODE_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 25b7b90..b2a7fd5 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs,
struct mm_struct *mm;
struct vm_area_struct *vma;
unsigned long stack_offset;
+ unsigned long flags = 0;
int fault;
int si_code;
int is_kernel_mode;
@@ -415,12 +416,16 @@ good_area:
}

survive:
+ if (!is_kernel_mode)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
@@ -540,10 +545,10 @@ out_of_memory:
down_read(&mm->mmap_sem);
goto survive;
}
- pr_alert("VM: killing process %s\n", tsk->comm);
- if (!is_kernel_mode)
- do_group_exit(SIGKILL);
- goto no_context;
+ if (is_kernel_mode)
+ goto no_context;
+ pagefault_out_of_memory();
+ return 0;

do_sigbus:
up_read(&mm->mmap_sem);
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index dafc947..626a85e 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip,
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
+ unsigned long flags = 0;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -62,10 +63,15 @@ good_area:
if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC)))
goto out;

+ if (is_user)
+ flags |= FAULT_FLAG_USER;
+ if (is_write)
+ flags |= FAULT_FLAG_WRITE;
+
do {
int fault;

- fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM) {
goto out_of_memory;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 283aa4b..3026943 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
}

static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
- struct task_struct *tsk)
+ struct task_struct *tsk, struct pt_regs *regs)
{
struct vm_area_struct *vma;
+ unsigned long flags = 0;
int fault;

vma = find_vma(mm, addr);
@@ -191,12 +192,16 @@ good_area:
goto out;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (!(fsr ^ 0x12))
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the fault.
*/
- fault = handle_mm_fault(mm, vma, addr & PAGE_MASK,
- (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
if (unlikely(fault & VM_FAULT_ERROR))
return fault;
if (fault & VM_FAULT_MAJOR)
@@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
#endif
}

- fault = __do_pf(mm, addr, fsr, tsk);
+ fault = __do_pf(mm, addr, fsr, tsk, regs);
up_read(&mm->mmap_sem);

/*
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..90248c9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -846,17 +846,6 @@ static noinline int
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, unsigned int fault)
{
- /*
- * Pagefault was interrupted by SIGKILL. We have no reason to
- * continue pagefault.
- */
- if (fatal_signal_pending(current)) {
- if (!(fault & VM_FAULT_RETRY))
- up_read(&current->mm->mmap_sem);
- if (!(error_code & PF_USER))
- no_context(regs, error_code, address);
- return 1;
- }
if (!(fault & VM_FAULT_ERROR))
return 0;

@@ -999,8 +988,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
struct mm_struct *mm;
int fault;
int write = error_code & PF_WRITE;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;
mm = tsk->mm;
@@ -1160,6 +1148,11 @@ good_area:
return;
}

+ if (error_code & PF_USER)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index e367e30..7db9fbe 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs)
struct mm_struct *mm = current->mm;
unsigned int exccause = regs->exccause;
unsigned int address = regs->excvaddr;
+ unsigned long flags = 0;
siginfo_t info;

int is_write, is_exec;
@@ -101,11 +102,16 @@ good_area:
if (!(vma->vm_flags & (VM_READ | VM_WRITE)))
goto bad_area;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (is_write)
+ flags |= FAULT_FLAG_WRITE;
+
/* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..b92e5e7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

+static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p,
+ unsigned int new)
+{
+ unsigned int old;
+
+ old = p->memcg_oom.may_oom;
+ p->memcg_oom.may_oom = new;
+
+ return old;
+}
+bool mem_cgroup_oom_synchronize(void);
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
@@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p,
+ unsigned int new)
+{
+ return 0;
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..846b82b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -156,6 +156,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
+#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */

/*
* This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..a77d198 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -91,6 +91,7 @@ struct sched_param {
#include <linux/latencytop.h>
#include <linux/cred.h>
#include <linux/llist.h>
+#include <linux/stacktrace.h>

#include <asm/processor.h>

@@ -1568,6 +1569,14 @@ struct task_struct {
unsigned long nr_pages; /* uncharged usage */
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
+ struct memcg_oom_info {
+ unsigned int may_oom:1;
+ unsigned int in_memcg_oom:1;
+ struct stack_trace trace;
+ unsigned long trace_entries[16];
+ int wakeups;
+ struct mem_cgroup *wait_on_memcg;
+ } memcg_oom;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
diff --git a/mm/filemap.c b/mm/filemap.c
index 5f0a3c9..d18bd47 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
struct file_ra_state *ra = &file->f_ra;
struct inode *inode = mapping->host;
pgoff_t offset = vmf->pgoff;
+ unsigned int may_oom;
struct page *page;
pgoff_t size;
int ret = 0;
@@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return VM_FAULT_SIGBUS;

/*
- * Do we have something in the page cache already?
+ * Do we have something in the page cache already? Either
+ * way, try readahead, but disable the memcg OOM killer for it
+ * as readahead is optional and no errors are propagated up
+ * the fault stack, which does not allow proper unwinding of a
+ * memcg OOM state. The OOM killer is enabled while trying to
+ * instantiate the faulting page individually below.
*/
page = find_get_page(mapping, offset);
if (likely(page)) {
@@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
* We found the page, so try async readahead before
* waiting for the lock.
*/
+ may_oom = mem_cgroup_xchg_may_oom(current, 0);
do_async_mmap_readahead(vma, ra, file, page, offset);
+ mem_cgroup_xchg_may_oom(current, may_oom);
} else {
- /* No page in the page cache at all */
+ /* No page in the page cache at all. */
+ may_oom = mem_cgroup_xchg_may_oom(current, 0);
do_sync_mmap_readahead(vma, ra, file, offset);
+ mem_cgroup_xchg_may_oom(current, may_oom);
count_vm_event(PGMAJFAULT);
mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
diff --git a/mm/ksm.c b/mm/ksm.c
index 310544a..ae7e4ae 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
break;
if (PageKsm(page))
ret = handle_mm_fault(vma->vm_mm, vma, addr,
- FAULT_FLAG_WRITE);
+ FAULT_FLAG_WRITE);
else
ret = VM_FAULT_WRITE;
put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..c47c77e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,7 @@
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
#include <linux/oom.h>
+#include <linux/stacktrace.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -249,6 +250,7 @@ struct mem_cgroup {

bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;

atomic_t refcnt;

@@ -1846,6 +1848,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,

static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg->oom_wakeups);
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}
@@ -1857,30 +1860,26 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}

/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
{
- struct oom_wait_info owait;
- bool locked, need_to_kill;
+ bool locked, need_to_kill = true;

- owait.mem = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);
+ if (!current->memcg_oom.may_oom)
+ return;
+
+ current->memcg_oom.in_memcg_oom = 1;
+
+ current->memcg_oom.trace.nr_entries = 0;
+ current->memcg_oom.trace.max_entries = 16;
+ current->memcg_oom.trace.entries = current->memcg_oom.trace_entries;
+ current->memcg_oom.trace.skip = 1;
+ save_stack_trace(&current->memcg_oom.trace);

/* At first, try to OOM lock hierarchy under memcg.*/
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
- /*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
- */
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
if (!locked || memcg->oom_kill_disable)
need_to_kill = false;
if (locked)
@@ -1888,24 +1887,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
spin_unlock(&memcg_oom_lock);

if (need_to_kill) {
- finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, mask);
} else {
- schedule();
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ /*
+ * A system call can just return -ENOMEM, but if this
+ * is a page fault and somebody else is handling the
+ * OOM already, we need to sleep on the OOM waitqueue
+ * for this memcg until the situation is resolved.
+ * Which can take some time because it might be
+ * handled by a userspace task.
+ *
+ * However, this is the charge context, which means
+ * that we may sit on a large call stack and hold
+ * various filesystem locks, the mmap_sem etc. and we
+ * don't want the OOM handler to deadlock on them
+ * while we sit here and wait. Store the current OOM
+ * context in the task_struct, then return -ENOMEM.
+ * At the end of the page fault handler, with the
+ * stack unwound, pagefault_out_of_memory() will check
+ * back with us by calling
+ * mem_cgroup_oom_synchronize(), possibly putting the
+ * task to sleep.
+ */
+ mem_cgroup_mark_under_oom(memcg);
+ current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups);
+ css_get(&memcg->css);
+ current->memcg_oom.wait_on_memcg = memcg;
}
- spin_lock(&memcg_oom_lock);
- if (locked)
+
+ if (locked) {
+ spin_lock(&memcg_oom_lock);
mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
+ /*
+ * Sleeping tasks might have been killed, make sure
+ * they get scheduled so they can exit.
+ */
+ if (need_to_kill)
+ memcg_oom_recover(memcg);
+ spin_unlock(&memcg_oom_lock);
+ }
+}

- mem_cgroup_unmark_under_oom(memcg);
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;

- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ /* OOM is global, do not handle */
+ if (!current->memcg_oom.in_memcg_oom)
return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+
+ /*
+ * We invoked the OOM killer but there is a chance that a kill
+ * did not free up any charges. Everybody else might already
+ * be sleeping, so restart the fault and keep the rampage
+ * going until some charges are released.
+ */
+ memcg = current->memcg_oom.wait_on_memcg;
+ if (!memcg)
+ goto out;
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out_put;
+
+ owait.mem = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out_put:
+ mem_cgroup_unmark_under_oom(memcg);
+ css_put(&memcg->css);
+ current->memcg_oom.wait_on_memcg = NULL;
+out:
+ current->memcg_oom.in_memcg_oom = 0;
return true;
}

@@ -2195,11 +2256,10 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2257,14 +2317,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;

- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask);

- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}

/*
@@ -2349,7 +2405,7 @@ again:
}

do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;

/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2357,13 +2413,7 @@ again:
goto bypass;
}

- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2376,16 +2426,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);

diff --git a/mm/memory.c b/mm/memory.c
index 829d437..fc6d741 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/stacktrace.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -3439,22 +3440,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);

@@ -3503,6 +3496,39 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int userfault = flags & FAULT_FLAG_USER;
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ if (userfault)
+ WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1);
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (userfault)
+ WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0);
+
+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom,
+ "Fixing unhandled memcg OOM context, set up from:\n")) {
+ print_stack_trace(&current->memcg_oom.trace, 0);
+ mem_cgroup_oom_synchronize();
+ }
+#endif
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..aa60863 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,6 +785,8 @@ out:
*/
void pagefault_out_of_memory(void)
{
+ if (mem_cgroup_oom_synchronize())
+ return;
if (try_set_system_oom()) {
out_of_memory(NULL, 0, 0, NULL);
clear_system_oom();

2013-07-19 04:22:43

by Johannes Weiner

[permalink] [raw]

Subject: [patch 1/5] mm: invoke oom-killer from remaining unconverted page fault handlers

[already upstream, included for 3.2 reference]

A few remaining architectures directly kill the page faulting task in an
out of memory situation. This is usually not a good idea since that
task might not even use a significant amount of memory and so may not be
the optimal victim to resolve the situation.

Since 2.6.29's 1c0fe6e ("mm: invoke oom-killer from page fault") there
is a hook that architecture page fault handlers are supposed to call to
invoke the OOM killer and let it pick the right task to kill. Convert
the remaining architectures over to this hook.

To have the previous behavior of simply taking out the faulting task the
vm.oom_kill_allocating_task sysctl can be set to 1.

Signed-off-by: Johannes Weiner <[email protected]>
Reviewed-by: Michal Hocko <[email protected]>
Cc: KAMEZAWA Hiroyuki <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Vineet Gupta <[email protected]> [arch/arc bits]
Cc: James Hogan <[email protected]>
Cc: David Howells <[email protected]>
Cc: Jonas Bonn <[email protected]>
Cc: Chen Liqin <[email protected]>
Cc: Lennox Wu <[email protected]>
Cc: Chris Metcalf <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
---
arch/mn10300/mm/fault.c | 7 ++++---
arch/openrisc/mm/fault.c | 8 ++++----
arch/score/mm/fault.c | 8 ++++----
arch/tile/mm/fault.c | 8 ++++----
4 files changed, 16 insertions(+), 15 deletions(-)

diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 0945409..5ac4df5 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -329,9 +329,10 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- printk(KERN_ALERT "VM: killing process %s\n", tsk->comm);
- if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
- do_exit(SIGKILL);
+ if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR) {
+ pagefault_out_of_memory();
+ return;
+ }
goto no_context;

do_sigbus:
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index a5dce82..d78881c 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -246,10 +246,10 @@ out_of_memory:
__asm__ __volatile__("l.nop 1");

up_read(&mm->mmap_sem);
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;

do_sigbus:
up_read(&mm->mmap_sem);
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 47b600e..6b18fb0 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -172,10 +172,10 @@ out_of_memory:
down_read(&mm->mmap_sem);
goto survive;
}
- printk("VM: killing process %s\n", tsk->comm);
- if (user_mode(regs))
- do_group_exit(SIGKILL);
- goto no_context;
+ if (!user_mode(regs))
+ goto no_context;
+ pagefault_out_of_memory();
+ return;

do_sigbus:
up_read(&mm->mmap_sem);
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 25b7b90..3312531 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -540,10 +540,10 @@ out_of_memory:
down_read(&mm->mmap_sem);
goto survive;
}
- pr_alert("VM: killing process %s\n", tsk->comm);
- if (!is_kernel_mode)
- do_group_exit(SIGKILL);
- goto no_context;
+ if (is_kernel_mode)
+ goto no_context;
+ pagefault_out_of_memory();
+ return 0;

do_sigbus:
up_read(&mm->mmap_sem);
--
1.8.3.2

2013-07-19 04:24:30

by Johannes Weiner

[permalink] [raw]

Subject: [patch 2/5] mm: pass userspace fault flag to generic fault handler

The global OOM killer is (XXX: for most architectures) only invoked
for userspace faults, not for faults from kernelspace (uaccess, gup).

Memcg OOM handling is currently invoked for all faults. Allow it to
behave like the global case by having the architectures pass a flag to
the generic fault handler code that identifies userspace faults.

Signed-off-by: Johannes Weiner <[email protected]>
---
arch/alpha/mm/fault.c | 8 +++++++-
arch/arm/mm/fault.c | 12 +++++++++---
arch/avr32/mm/fault.c | 8 +++++++-
arch/cris/mm/fault.c | 8 +++++++-
arch/frv/mm/fault.c | 8 +++++++-
arch/hexagon/mm/vm_fault.c | 8 +++++++-
arch/ia64/mm/fault.c | 8 +++++++-
arch/m32r/mm/fault.c | 8 +++++++-
arch/m68k/mm/fault.c | 8 +++++++-
arch/microblaze/mm/fault.c | 8 +++++++-
arch/mips/mm/fault.c | 8 +++++++-
arch/mn10300/mm/fault.c | 8 +++++++-
arch/openrisc/mm/fault.c | 8 +++++++-
arch/parisc/mm/fault.c | 8 +++++++-
arch/powerpc/mm/fault.c | 8 +++++++-
arch/s390/mm/fault.c | 2 ++
arch/score/mm/fault.c | 7 ++++++-
arch/sh/mm/fault_32.c | 8 +++++++-
arch/sh/mm/tlbflush_64.c | 8 +++++++-
arch/sparc/mm/fault_32.c | 8 +++++++-
arch/sparc/mm/fault_64.c | 8 +++++++-
arch/tile/mm/fault.c | 7 ++++++-
arch/um/kernel/trap.c | 8 +++++++-
arch/unicore32/mm/fault.c | 13 +++++++++----
arch/x86/mm/fault.c | 8 ++++++--
arch/xtensa/mm/fault.c | 8 +++++++-
include/linux/mm.h | 1 +
27 files changed, 179 insertions(+), 31 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index fadd5f8..fa6b4e4 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -89,6 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
struct mm_struct *mm = current->mm;
const struct exception_table_entry *fixup;
int fault, si_code = SEGV_MAPERR;
+ unsigned long flags = 0;
siginfo_t info;

/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
@@ -142,10 +143,15 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
goto bad_area;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (cause > 0)
+ flags |= FAULT_FLAG_WRITE;
+
/* If for any reason at all we couldn't handle the fault,
make sure we exit gracefully rather than endlessly redo
the fault. */
- fault = handle_mm_fault(mm, vma, address, cause > 0 ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
up_read(&mm->mmap_sem);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index aa33949..31b1e69 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -231,9 +231,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)

static int __kprobes
__do_page_fault(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
- struct task_struct *tsk)
+ struct task_struct *tsk, struct pt_regs *regs)
{
struct vm_area_struct *vma;
+ unsigned long flags = 0;
int fault;

vma = find_vma(mm, addr);
@@ -253,11 +254,16 @@ good_area:
goto out;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (fsr & FSR_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the fault.
*/
- fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, (fsr & FSR_WRITE) ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
if (unlikely(fault & VM_FAULT_ERROR))
return fault;
if (fault & VM_FAULT_MAJOR)
@@ -320,7 +326,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
#endif
}

- fault = __do_page_fault(mm, addr, fsr, tsk);
+ fault = __do_page_fault(mm, addr, fsr, tsk, regs);
up_read(&mm->mmap_sem);

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr);
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index f7040a1..ada6237 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -59,6 +59,7 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs)
struct mm_struct *mm;
struct vm_area_struct *vma;
const struct exception_table_entry *fixup;
+ unsigned long flags = 0;
unsigned long address;
unsigned long page;
int writeaccess;
@@ -127,12 +128,17 @@ good_area:
panic("Unhandled case %lu in do_page_fault!", ecr);
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (writeaccess)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 9dcac8e..35d096a 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -55,6 +55,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct * vma;
+ unsigned long flags = 0;
siginfo_t info;
int fault;

@@ -156,13 +157,18 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
goto bad_area;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (writeaccess & 1)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, (writeaccess & 1) ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index a325d57..2dbf219 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -35,6 +35,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
struct vm_area_struct *vma;
struct mm_struct *mm;
unsigned long _pme, lrai, lrad, fixup;
+ unsigned long flags = 0;
siginfo_t info;
pgd_t *pge;
pud_t *pue;
@@ -158,12 +159,17 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
break;
}

+ if (user_mode(__frame))
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, ear0, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index c10b76f..e56baf3 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -52,6 +52,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
siginfo_t info;
int si_code = SEGV_MAPERR;
int fault;
+ unsigned long flags = 0;
const struct exception_table_entry *fixup;

/*
@@ -96,7 +97,12 @@ good_area:
break;
}

- fault = handle_mm_fault(mm, vma, address, (cause > 0));
+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (cause > 0)
+ flags |= FAULT_FLAG_WRITE;
+
+ fault = handle_mm_fault(mm, vma, address, flags);

/* The most common case -- we are done. */
if (likely(!(fault & VM_FAULT_ERROR))) {
diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 20b3593..ad9ef9d 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -79,6 +79,7 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
int signal = SIGSEGV, code = SEGV_MAPERR;
struct vm_area_struct *vma, *prev_vma;
struct mm_struct *mm = current->mm;
+ unsigned long flags = 0;
struct siginfo si;
unsigned long mask;
int fault;
@@ -149,12 +150,17 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
if ((vma->vm_flags & mask) != mask)
goto bad_area;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (mask & VM_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the
* fault.
*/
- fault = handle_mm_fault(mm, vma, address, (mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
/*
* We ran out of memory, or some other thing happened
diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index 2c9aeb4..e74f6fa 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -79,6 +79,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
struct mm_struct *mm;
struct vm_area_struct * vma;
unsigned long page, addr;
+ unsigned long flags = 0;
int write;
int fault;
siginfo_t info;
@@ -188,6 +189,11 @@ good_area:
if ((error_code & ACE_INSTRUCTION) && !(vma->vm_flags & VM_EXEC))
goto bad_area;

+ if (error_code & ACE_USERMODE)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -195,7 +201,7 @@ good_area:
*/
addr = (address & PAGE_MASK);
set_thread_fault_code(error_code);
- fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, addr, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index 2db6099..ab88a91 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -73,6 +73,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
{
struct mm_struct *mm = current->mm;
struct vm_area_struct * vma;
+ unsigned long flags = 0;
int write, fault;

#ifdef DEBUG
@@ -134,13 +135,18 @@ good_area:
goto acc_err;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
#ifdef DEBUG
printk("handle_mm_fault returns %d\n",fault);
#endif
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index ae97d2c..b002612 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -89,6 +89,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
{
struct vm_area_struct *vma;
struct mm_struct *mm = current->mm;
+ unsigned long flags = 0;
siginfo_t info;
int code = SEGV_MAPERR;
int is_write = error_code & ESR_S;
@@ -206,12 +207,17 @@ good_area:
goto bad_area;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (is_write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 937cf33..e5b9fed 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -40,6 +40,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs, unsigned long writ
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
const int field = sizeof(unsigned long) * 2;
+ unsigned long flags = 0;
siginfo_t info;
int fault;

@@ -139,12 +140,17 @@ good_area:
}
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 5ac4df5..031be56 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -121,6 +121,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code,
{
struct vm_area_struct *vma;
struct task_struct *tsk;
+ unsigned long flags = 0;
struct mm_struct *mm;
unsigned long page;
siginfo_t info;
@@ -247,12 +248,17 @@ good_area:
break;
}

+ if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index d78881c..d586119 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -52,6 +52,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct *vma;
+ unsigned long flags = 0;
siginfo_t info;
int fault;

@@ -153,13 +154,18 @@ good_area:
if ((vector == 0x400) && !(vma->vm_page_prot.pgprot & _PAGE_EXEC))
goto bad_area;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (write_acc)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/

- fault = handle_mm_fault(mm, vma, address, write_acc);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index 18162ce..a151e87 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -173,6 +173,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
struct vm_area_struct *vma, *prev_vma;
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
+ unsigned long flags = 0;
unsigned long acc_type;
int fault;

@@ -195,13 +196,18 @@ good_area:
if ((vma->vm_flags & acc_type) != acc_type)
goto bad_area;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (acc_type & VM_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the
* fault.
*/

- fault = handle_mm_fault(mm, vma, address, (acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
/*
* We hit a shared mapping outside of the file, or some
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 5efe8c9..2bf339c 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -122,6 +122,7 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
{
struct vm_area_struct * vma;
struct mm_struct *mm = current->mm;
+ unsigned long flags = 0;
siginfo_t info;
int code = SEGV_MAPERR;
int is_write = 0, ret;
@@ -305,12 +306,17 @@ good_area:
goto bad_area;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (is_write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- ret = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+ ret = handle_mm_fault(mm, vma, address, flags);
if (unlikely(ret & VM_FAULT_ERROR)) {
if (ret & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index a9a3018..fe6109c 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -301,6 +301,8 @@ static inline int do_exception(struct pt_regs *regs, int access,
address = trans_exc_code & __FAIL_ADDR_MASK;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
flags = FAULT_FLAG_ALLOW_RETRY;
+ if (regs->psw.mask & PSW_MASK_PSTATE)
+ flags |= FAULT_FLAG_USER;
if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400)
flags |= FAULT_FLAG_WRITE;
down_read(&mm->mmap_sem);
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 6b18fb0..2ca5ae5 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
const int field = sizeof(unsigned long) * 2;
+ unsigned long flags = 0;
siginfo_t info;
int fault;

@@ -101,12 +102,16 @@ good_area:
}

survive:
+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sh/mm/fault_32.c b/arch/sh/mm/fault_32.c
index 7bebd04..a61b803 100644
--- a/arch/sh/mm/fault_32.c
+++ b/arch/sh/mm/fault_32.c
@@ -126,6 +126,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
struct task_struct *tsk;
struct mm_struct *mm;
struct vm_area_struct * vma;
+ unsigned long flags = 0;
int si_code;
int fault;
siginfo_t info;
@@ -195,12 +196,17 @@ good_area:
goto bad_area;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (writeaccess)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sh/mm/tlbflush_64.c b/arch/sh/mm/tlbflush_64.c
index e3430e0..0a9d645 100644
--- a/arch/sh/mm/tlbflush_64.c
+++ b/arch/sh/mm/tlbflush_64.c
@@ -96,6 +96,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long writeaccess,
struct mm_struct *mm;
struct vm_area_struct * vma;
const struct exception_table_entry *fixup;
+ unsigned long flags = 0;
pte_t *pte;
int fault;

@@ -184,12 +185,17 @@ good_area:
}
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (writeaccess)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, writeaccess ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index 8023fd7..efa3d48 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -222,6 +222,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
struct vm_area_struct *vma;
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
+ unsigned long flags = 0;
unsigned int fixup;
unsigned long g2;
int from_user = !(regs->psr & PSR_PS);
@@ -285,12 +286,17 @@ good_area:
goto bad_area;
}

+ if (from_user)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 504c062..bc536ea 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -276,6 +276,7 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
+ unsigned long flags = 0;
unsigned int insn = 0;
int si_code, fault_code, fault;
unsigned long address, mm_rss;
@@ -423,7 +424,12 @@ good_area:
goto bad_area;
}

- fault = handle_mm_fault(mm, vma, address, (fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0);
+ if (!(regs->tstate & TSTATE_PRIV))
+ flags |= FAULT_FLAG_USER;
+ if (fault_code & FAULT_CODE_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index 3312531..b2a7fd5 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -263,6 +263,7 @@ static int handle_page_fault(struct pt_regs *regs,
struct mm_struct *mm;
struct vm_area_struct *vma;
unsigned long stack_offset;
+ unsigned long flags = 0;
int fault;
int si_code;
int is_kernel_mode;
@@ -415,12 +416,16 @@ good_area:
}

survive:
+ if (!is_kernel_mode)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index dafc947..626a85e 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -25,6 +25,7 @@ int handle_page_fault(unsigned long address, unsigned long ip,
{
struct mm_struct *mm = current->mm;
struct vm_area_struct *vma;
+ unsigned long flags = 0;
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
@@ -62,10 +63,15 @@ good_area:
if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC)))
goto out;

+ if (is_user)
+ flags |= FAULT_FLAG_USER;
+ if (is_write)
+ flags |= FAULT_FLAG_WRITE;
+
do {
int fault;

- fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM) {
goto out_of_memory;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 283aa4b..3026943 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -169,9 +169,10 @@ static inline bool access_error(unsigned int fsr, struct vm_area_struct *vma)
}

static int __do_pf(struct mm_struct *mm, unsigned long addr, unsigned int fsr,
- struct task_struct *tsk)
+ struct task_struct *tsk, struct pt_regs *regs)
{
struct vm_area_struct *vma;
+ unsigned long flags = 0;
int fault;

vma = find_vma(mm, addr);
@@ -191,12 +192,16 @@ good_area:
goto out;
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (!(fsr ^ 0x12))
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault, make
* sure we exit gracefully rather than endlessly redo the fault.
*/
- fault = handle_mm_fault(mm, vma, addr & PAGE_MASK,
- (!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, addr & PAGE_MASK, flags);
if (unlikely(fault & VM_FAULT_ERROR))
return fault;
if (fault & VM_FAULT_MAJOR)
@@ -252,7 +257,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
#endif
}

- fault = __do_pf(mm, addr, fsr, tsk);
+ fault = __do_pf(mm, addr, fsr, tsk, regs);
up_read(&mm->mmap_sem);

/*
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..1cebabe 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -999,8 +999,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
struct mm_struct *mm;
int fault;
int write = error_code & PF_WRITE;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;
mm = tsk->mm;
@@ -1160,6 +1159,11 @@ good_area:
return;
}

+ if (error_code & PF_USER)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index e367e30..7db9fbe 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -41,6 +41,7 @@ void do_page_fault(struct pt_regs *regs)
struct mm_struct *mm = current->mm;
unsigned int exccause = regs->exccause;
unsigned int address = regs->excvaddr;
+ unsigned long flags = 0;
siginfo_t info;

int is_write, is_exec;
@@ -101,11 +102,16 @@ good_area:
if (!(vma->vm_flags & (VM_READ | VM_WRITE)))
goto bad_area;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (is_write)
+ flags |= FAULT_FLAG_WRITE;
+
/* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, is_write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..846b82b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -156,6 +156,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
+#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */

/*
* This interface is used by x86 PAT code to identify a pfn mapping that is
--
1.8.3.2

2013-07-19 04:25:08

by Johannes Weiner

[permalink] [raw]

Subject: [patch 3/5] x86: finish fault error path with fatal signal

The x86 fault handler bails in the middle of error handling when the
task has been killed. For the next patch this is a problem, because
it relies on pagefault_out_of_memory() being called even when the task
has been killed, to perform proper OOM state unwinding.

This is a rather minor optimization, just remove it.

Signed-off-by: Johannes Weiner <[email protected]>
---
arch/x86/mm/fault.c | 11 -----------
1 file changed, 11 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 1cebabe..90248c9 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -846,17 +846,6 @@ static noinline int
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, unsigned int fault)
{
- /*
- * Pagefault was interrupted by SIGKILL. We have no reason to
- * continue pagefault.
- */
- if (fatal_signal_pending(current)) {
- if (!(fault & VM_FAULT_RETRY))
- up_read(&current->mm->mmap_sem);
- if (!(error_code & PF_USER))
- no_context(regs, error_code, address);
- return 1;
- }
if (!(fault & VM_FAULT_ERROR))
return 0;

--
1.8.3.2

2013-07-19 04:25:53

by Johannes Weiner

[permalink] [raw]

Subject: [patch 4/5] memcg: do not trap chargers with full callstack on OOM

The memcg OOM handling is incredibly fragile and can deadlock. When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds. Comparably, any other
task that enters the charge path at this point will go to a waitqueue
right then and there and sleep until the OOM situation is resolved.
The problem is that these tasks may hold filesystem locks and the
mmap_sem; locks that the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer
was about to charge a page cache page during a write(), which holds
the i_mutex. The OOM killer selected a task that was just entering
truncate() and trying to acquire the i_mutex:

OOM invoking task:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

OOM kill victim:
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg
is disabled and a userspace task is in charge of resolving OOM
situations. In this case, ALL tasks that enter the OOM path will be
made to sleep on the OOM waitqueue and wait for userspace to free
resources or increase the group's limit. But a userspace OOM handler
is prone to deadlock itself on the locks held by the waiting tasks.
For example one of the sleeping tasks may be stuck in a brk() call
with the mmap_sem held for writing but the userspace handler, in order
to pick an optimal victim, may need to read files from /proc/<pid>,
which tries to acquire the same mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM
and makes sure nobody loops or sleeps with locks held:

0. When OOMing in a system call (buffered IO and friends), do not
invoke the OOM killer, do not sleep on a OOM waitqueue, just return
-ENOMEM. Userspace should be able to handle this and it prevents
anybody from looping or waiting with locks held.

1. When OOMing in a kernel fault, do not invoke the OOM killer, do not
sleep on the OOM waitqueue, just return -ENOMEM. The kernel fault
stack knows how to handle this. If a kernel fault is nested inside
a user fault, however, user fault handling applies:

2. When OOMing in a user fault, invoke the OOM killer and restart the
fault instead of looping on the charge attempt. This way, the OOM
victim can not get stuck on locks the looping task may hold.

3. When OOMing in a user fault but somebody else is handling it
(either the kernel OOM killer or a userspace handler), don't go to
sleep in the charge context. Instead, remember the OOMing memcg in
the task struct and then fully unwind the page fault stack with
-ENOMEM. pagefault_out_of_memory() will then call back into the
memcg code to check if the -ENOMEM came from the memcg, and then
either put the task to sleep on the memcg's OOM waitqueue or just
restart the fault. The OOM victim can no longer get stuck on any
lock a sleeping task may hold.

While reworking the OOM routine, also remove a needless OOM waitqueue
wakeup when invoking the killer. In addition to the wakeup implied in
the kill signal delivery, only uncharges and limit increases, things
that actually change the memory situation, should poke the waitqueue.

Reported-by: Reported-by: azurIt <[email protected]>
Debugged-by: Michal Hocko <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 22 +++++++
include/linux/sched.h | 6 ++
mm/filemap.c | 14 ++++-
mm/ksm.c | 2 +-
mm/memcontrol.c | 139 +++++++++++++++++++++++++++++----------------
mm/memory.c | 37 ++++++++----
mm/oom_kill.c | 2 +
7 files changed, 159 insertions(+), 63 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..b92e5e7 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,17 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

+static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p,
+ unsigned int new)
+{
+ unsigned int old;
+
+ old = p->memcg_oom.may_oom;
+ p->memcg_oom.may_oom = new;
+
+ return old;
+}
+bool mem_cgroup_oom_synchronize(void);
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
@@ -333,6 +344,17 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline unsigned int mem_cgroup_xchg_may_oom(struct task_struct *p,
+ unsigned int new)
+{
+ return 0;
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..7e6c9e9 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1568,6 +1568,12 @@ struct task_struct {
unsigned long nr_pages; /* uncharged usage */
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
+ struct memcg_oom_info {
+ unsigned int may_oom:1;
+ unsigned int in_memcg_oom:1;
+ int wakeups;
+ struct mem_cgroup *wait_on_memcg;
+ } memcg_oom;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
diff --git a/mm/filemap.c b/mm/filemap.c
index 5f0a3c9..d18bd47 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
struct file_ra_state *ra = &file->f_ra;
struct inode *inode = mapping->host;
pgoff_t offset = vmf->pgoff;
+ unsigned int may_oom;
struct page *page;
pgoff_t size;
int ret = 0;
@@ -1669,7 +1670,12 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return VM_FAULT_SIGBUS;

/*
- * Do we have something in the page cache already?
+ * Do we have something in the page cache already? Either
+ * way, try readahead, but disable the memcg OOM killer for it
+ * as readahead is optional and no errors are propagated up
+ * the fault stack, which does not allow proper unwinding of a
+ * memcg OOM state. The OOM killer is enabled while trying to
+ * instantiate the faulting page individually below.
*/
page = find_get_page(mapping, offset);
if (likely(page)) {
@@ -1677,10 +1683,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
* We found the page, so try async readahead before
* waiting for the lock.
*/
+ may_oom = mem_cgroup_xchg_may_oom(current, 0);
do_async_mmap_readahead(vma, ra, file, page, offset);
+ mem_cgroup_xchg_may_oom(current, may_oom);
} else {
- /* No page in the page cache at all */
+ /* No page in the page cache at all. */
+ may_oom = mem_cgroup_xchg_may_oom(current, 0);
do_sync_mmap_readahead(vma, ra, file, offset);
+ mem_cgroup_xchg_may_oom(current, may_oom);
count_vm_event(PGMAJFAULT);
mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
diff --git a/mm/ksm.c b/mm/ksm.c
index 310544a..ae7e4ae 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -338,7 +338,7 @@ static int break_ksm(struct vm_area_struct *vma, unsigned long addr)
break;
if (PageKsm(page))
ret = handle_mm_fault(vma->vm_mm, vma, addr,
- FAULT_FLAG_WRITE);
+ FAULT_FLAG_WRITE);
else
ret = VM_FAULT_WRITE;
put_page(page);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..99b0101 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -249,6 +249,7 @@ struct mem_cgroup {

bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;

atomic_t refcnt;

@@ -1846,6 +1847,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,

static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg->oom_wakeups);
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}
@@ -1857,30 +1859,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}

/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
{
- struct oom_wait_info owait;
- bool locked, need_to_kill;
+ bool locked, need_to_kill = true;

- owait.mem = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);
+ if (!current->memcg_oom.may_oom)
+ return;
+
+ current->memcg_oom.in_memcg_oom = 1;

/* At first, try to OOM lock hierarchy under memcg.*/
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
- /*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
- */
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
if (!locked || memcg->oom_kill_disable)
need_to_kill = false;
if (locked)
@@ -1888,24 +1880,86 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
spin_unlock(&memcg_oom_lock);

if (need_to_kill) {
- finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, mask);
} else {
- schedule();
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ /*
+ * A system call can just return -ENOMEM, but if this
+ * is a page fault and somebody else is handling the
+ * OOM already, we need to sleep on the OOM waitqueue
+ * for this memcg until the situation is resolved.
+ * Which can take some time because it might be
+ * handled by a userspace task.
+ *
+ * However, this is the charge context, which means
+ * that we may sit on a large call stack and hold
+ * various filesystem locks, the mmap_sem etc. and we
+ * don't want the OOM handler to deadlock on them
+ * while we sit here and wait. Store the current OOM
+ * context in the task_struct, then return -ENOMEM.
+ * At the end of the page fault handler, with the
+ * stack unwound, pagefault_out_of_memory() will check
+ * back with us by calling
+ * mem_cgroup_oom_synchronize(), possibly putting the
+ * task to sleep.
+ */
+ mem_cgroup_mark_under_oom(memcg);
+ current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups);
+ css_get(&memcg->css);
+ current->memcg_oom.wait_on_memcg = memcg;
}
- spin_lock(&memcg_oom_lock);
- if (locked)
+
+ if (locked) {
+ spin_lock(&memcg_oom_lock);
mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
+ /*
+ * Sleeping tasks might have been killed, make sure
+ * they get scheduled so they can exit.
+ */
+ if (need_to_kill)
+ memcg_oom_recover(memcg);
+ spin_unlock(&memcg_oom_lock);
+ }
+}

- mem_cgroup_unmark_under_oom(memcg);
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;

- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ /* OOM is global, do not handle */
+ if (!current->memcg_oom.in_memcg_oom)
return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+
+ /*
+ * We invoked the OOM killer but there is a chance that a kill
+ * did not free up any charges. Everybody else might already
+ * be sleeping, so restart the fault and keep the rampage
+ * going until some charges are released.
+ */
+ memcg = current->memcg_oom.wait_on_memcg;
+ if (!memcg)
+ goto out;
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out_put;
+
+ owait.mem = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out_put:
+ mem_cgroup_unmark_under_oom(memcg);
+ css_put(&memcg->css);
+ current->memcg_oom.wait_on_memcg = NULL;
+out:
+ current->memcg_oom.in_memcg_oom = 0;
return true;
}

@@ -2195,11 +2249,10 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2257,14 +2310,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;

- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask);

- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}

/*
@@ -2349,7 +2398,7 @@ again:
}

do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;

/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2357,13 +2406,7 @@ again:
goto bypass;
}

- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2376,16 +2419,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);

diff --git a/mm/memory.c b/mm/memory.c
index 829d437..2be02b7 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3439,22 +3439,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);

@@ -3503,6 +3495,31 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int userfault = flags & FAULT_FLAG_USER;
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ if (userfault)
+ WARN_ON(mem_cgroup_xchg_may_oom(current, 1) == 1);
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (userfault)
+ WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0);
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..aa60863 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,6 +785,8 @@ out:
*/
void pagefault_out_of_memory(void)
{
+ if (mem_cgroup_oom_synchronize())
+ return;
if (try_set_system_oom()) {
out_of_memory(NULL, 0, 0, NULL);
clear_system_oom();
--
1.8.3.2

2013-07-19 04:26:28

by Johannes Weiner

[permalink] [raw]

Subject: [patch 5/5] mm: memcontrol: sanity check memcg OOM context unwind

Catch the cases where a memcg OOM context is set up in the failed
charge path but the fault handler is not actually returning
VM_FAULT_ERROR, which would be required to properly finalize the OOM.

Example output: the first trace shows the stack at the end of
handle_mm_fault() where an unexpected memcg OOM context is detected.
The subsequent trace is of whoever set up that OOM context. In this
case it was the charging of readahead pages in a file fault, which
does not propagate VM_FAULT_OOM on failure and should disable OOM:

[ 27.805359] WARNING: at /home/hannes/src/linux/linux/mm/memory.c:3523 handle_mm_fault+0x1fb/0x3f0()
[ 27.805360] Hardware name: PowerEdge 1950
[ 27.805361] Fixing unhandled memcg OOM context, set up from:
[ 27.805362] Pid: 1599, comm: file Tainted: G W 3.2.0-00005-g6d10010 #97
[ 27.805363] Call Trace:
[ 27.805365] [<ffffffff8103dcea>] warn_slowpath_common+0x6a/0xa0
[ 27.805367] [<ffffffff8103dd91>] warn_slowpath_fmt+0x41/0x50
[ 27.805369] [<ffffffff810c8ffb>] handle_mm_fault+0x1fb/0x3f0
[ 27.805371] [<ffffffff81024fa0>] do_page_fault+0x140/0x4a0
[ 27.805373] [<ffffffff810cdbfb>] ? do_mmap_pgoff+0x34b/0x360
[ 27.805376] [<ffffffff813cbc6f>] page_fault+0x1f/0x30
[ 27.805377] ---[ end trace 305ec584fba81649 ]---
[ 27.805378] [<ffffffff810f2418>] __mem_cgroup_try_charge+0x5c8/0x7e0
[ 27.805380] [<ffffffff810f38fc>] mem_cgroup_cache_charge+0xac/0x110
[ 27.805381] [<ffffffff810a528e>] add_to_page_cache_locked+0x3e/0x120
[ 27.805383] [<ffffffff810a5385>] add_to_page_cache_lru+0x15/0x40
[ 27.805385] [<ffffffff8112dfa3>] mpage_readpages+0xc3/0x150
[ 27.805387] [<ffffffff8115c6d8>] ext4_readpages+0x18/0x20
[ 27.805388] [<ffffffff810afbe1>] __do_page_cache_readahead+0x1c1/0x270
[ 27.805390] [<ffffffff810b023c>] ra_submit+0x1c/0x20
[ 27.805392] [<ffffffff810a5eb4>] filemap_fault+0x3f4/0x450
[ 27.805394] [<ffffffff810c4a2d>] __do_fault+0x6d/0x510
[ 27.805395] [<ffffffff810c741a>] handle_pte_fault+0x8a/0x920
[ 27.805397] [<ffffffff810c8f9c>] handle_mm_fault+0x19c/0x3f0
[ 27.805398] [<ffffffff81024fa0>] do_page_fault+0x140/0x4a0
[ 27.805400] [<ffffffff813cbc6f>] page_fault+0x1f/0x30
[ 27.805401] [<ffffffffffffffff>] 0xffffffffffffffff

Debug patch only.

Not-signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/sched.h | 3 +++
mm/memcontrol.c | 7 +++++++
mm/memory.c | 9 +++++++++
3 files changed, 19 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7e6c9e9..a77d198 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -91,6 +91,7 @@ struct sched_param {
#include <linux/latencytop.h>
#include <linux/cred.h>
#include <linux/llist.h>
+#include <linux/stacktrace.h>

#include <asm/processor.h>

@@ -1571,6 +1572,8 @@ struct task_struct {
struct memcg_oom_info {
unsigned int may_oom:1;
unsigned int in_memcg_oom:1;
+ struct stack_trace trace;
+ unsigned long trace_entries[16];
int wakeups;
struct mem_cgroup *wait_on_memcg;
} memcg_oom;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 99b0101..c47c77e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,7 @@
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
#include <linux/oom.h>
+#include <linux/stacktrace.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -1870,6 +1871,12 @@ static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)

current->memcg_oom.in_memcg_oom = 1;

+ current->memcg_oom.trace.nr_entries = 0;
+ current->memcg_oom.trace.max_entries = 16;
+ current->memcg_oom.trace.entries = current->memcg_oom.trace_entries;
+ current->memcg_oom.trace.skip = 1;
+ save_stack_trace(&current->memcg_oom.trace);
+
/* At first, try to OOM lock hierarchy under memcg.*/
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
diff --git a/mm/memory.c b/mm/memory.c
index 2be02b7..fc6d741 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/stacktrace.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -3517,6 +3518,14 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (userfault)
WARN_ON(mem_cgroup_xchg_may_oom(current, 0) == 0);

+#ifdef CONFIG_CGROUP_MEM_RES_CTLR
+ if (WARN(!(ret & VM_FAULT_OOM) && current->memcg_oom.in_memcg_oom,
+ "Fixing unhandled memcg OOM context, set up from:\n")) {
+ print_stack_trace(&current->memcg_oom.trace, 0);
+ mem_cgroup_oom_synchronize();
+ }
+#endif
+
return ret;
}

--
1.8.3.2

2013-07-19 08:23:48

by azurIt

[permalink] [raw]

Subject: Re: [PATCH for 3.2] memcg: do not trap chargers with full callstack on OOM

> CC: [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
>On Tue, Jul 16, 2013 at 12:48:30PM -0400, Johannes Weiner wrote:
>> On Tue, Jul 16, 2013 at 06:09:05PM +0200, Michal Hocko wrote:
>> > On Tue 16-07-13 11:35:44, Johannes Weiner wrote:
>> > > On Mon, Jul 15, 2013 at 06:00:06PM +0200, Michal Hocko wrote:
>> > > > On Mon 15-07-13 17:41:19, Michal Hocko wrote:
>> > > > > On Sun 14-07-13 01:51:12, azurIt wrote:
>> > > > > > > CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
>> > > > > > >> CC: "Johannes Weiner" <[email protected]>, [email protected], [email protected], "cgroups mailinglist" <[email protected]>, "KAMEZAWA Hiroyuki" <[email protected]>, [email protected]
>> > > > > > >>On Wed 10-07-13 18:25:06, azurIt wrote:
>> > > > > > >>> >> Now i realized that i forgot to remove UID from that cgroup before
>> > > > > > >>> >> trying to remove it, so cgroup cannot be removed anyway (we are using
>> > > > > > >>> >> third party cgroup called cgroup-uid from Andrea Righi, which is able
>> > > > > > >>> >> to associate all user's processes with target cgroup). Look here for
>> > > > > > >>> >> cgroup-uid patch:
>> > > > > > >>> >> https://www.develer.com/~arighi/linux/patches/cgroup-uid/cgroup-uid-v8.patch
>> > > > > > >>> >>
>> > > > > > >>> >> ANYWAY, i'm 101% sure that 'tasks' file was empty and 'under_oom' was
>> > > > > > >>> >> permanently '1'.
>> > > > > > >>> >
>> > > > > > >>> >This is really strange. Could you post the whole diff against stable
>> > > > > > >>> >tree you are using (except for grsecurity stuff and the above cgroup-uid
>> > > > > > >>> >patch)?
>> > > > > > >>>
>> > > > > > >>>
>> > > > > > >>> Here are all patches which i applied to kernel 3.2.48 in my last test:
>> > > > > > >>> http://watchdog.sk/lkml/patches3/
>> > > > > > >>
>> > > > > > >>The two patches from Johannes seem correct.
>> > > > > > >>
>> > > > > > >>From a quick look even grsecurity patchset shouldn't interfere as it
>> > > > > > >>doesn't seem to put any code between handle_mm_fault and mm_fault_error
>> > > > > > >>and there also doesn't seem to be any new handle_mm_fault call sites.
>> > > > > > >>
>> > > > > > >>But I cannot tell there aren't other code paths which would lead to a
>> > > > > > >>memcg charge, thus oom, without proper FAULT_FLAG_KERNEL handling.
>> > > > > > >
>> > > > > > >
>> > > > > > >Michal,
>> > > > > > >
>> > > > > > >now i can definitely confirm that problem with unremovable cgroups
>> > > > > > >persists. What info do you need from me? I applied also your little
>> > > > > > >'WARN_ON' patch.
>> > > > > >
>> > > > > > Ok, i think you want this:
>> > > > > > http://watchdog.sk/lkml/kern4.log
>> > > > >
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589087] [ pid ] uid tgid total_vm rss cpu oom_adj oom_score_adj name
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589451] [12021] 1333 12021 172027 64723 4 0 0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589647] [12030] 1333 12030 172030 64748 2 0 0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.589836] [12031] 1333 12031 172030 64749 3 0 0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590025] [12032] 1333 12032 170619 63428 3 0 0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590213] [12033] 1333 12033 167934 60524 2 0 0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590401] [12034] 1333 12034 170747 63496 4 0 0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590588] [12035] 1333 12035 169659 62451 1 0 0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590776] [12036] 1333 12036 167614 60384 3 0 0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.590984] [12037] 1333 12037 166342 58964 3 0 0 apache2
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591178] Memory cgroup out of memory: Kill process 12021 (apache2) score 847 or sacrifice child
>> > > > > Jul 14 01:11:39 server01 kernel: [ 593.591370] Killed process 12021 (apache2) total-vm:688108kB, anon-rss:255472kB, file-rss:3420kB
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.392920] ------------[ cut here ]------------
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393096] WARNING: at kernel/exit.c:888 do_exit+0x7d0/0x870()
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393256] Hardware name: S5000VSA
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393415] Pid: 12037, comm: apache2 Not tainted 3.2.48-grsec #1
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393577] Call Trace:
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393737] [<ffffffff8105520a>] warn_slowpath_common+0x7a/0xb0
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.393903] [<ffffffff8105525a>] warn_slowpath_null+0x1a/0x20
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394068] [<ffffffff81059c50>] do_exit+0x7d0/0x870
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394231] [<ffffffff81050254>] ? thread_group_times+0x44/0xb0
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394392] [<ffffffff81059d41>] do_group_exit+0x51/0xc0
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394551] [<ffffffff81059dc7>] sys_exit_group+0x17/0x20
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394714] [<ffffffff815caea6>] system_call_fastpath+0x18/0x1d
>> > > > > Jul 14 01:11:41 server01 kernel: [ 595.394921] ---[ end trace 738570e688acf099 ]---
>> > > > >
>> > > > > OK, so you had an OOM which has been handled by in-kernel oom handler
>> > > > > (it killed 12021) and 12037 was in the same group. The warning tells us
>> > > > > that it went through mem_cgroup_oom as well (otherwise it wouldn't have
>> > > > > memcg_oom.wait_on_memcg set and the warning wouldn't trigger) and then
>> > > > > it exited on the userspace request (by exit syscall).
>> > > > >
>> > > > > I do not see any way how, this could happen though. If mem_cgroup_oom
>> > > > > is called then we always return CHARGE_NOMEM which turns into ENOMEM
>> > > > > returned by __mem_cgroup_try_charge (invoke_oom must have been set to
>> > > > > true). So if nobody screwed the return value on the way up to page
>> > > > > fault handler then there is no way to escape.
>> > > > >
>> > > > > I will check the code.
>> > > >
>> > > > OK, I guess I found it:
>> > > > __do_fault
>> > > > fault = filemap_fault
>> > > > do_async_mmap_readahead
>> > > > page_cache_async_readahead
>> > > > ondemand_readahead
>> > > > __do_page_cache_readahead
>> > > > read_pages
>> > > > readpages = ext3_readpages
>> > > > mpage_readpages # Doesn't propagate ENOMEM
>> > > > add_to_page_cache_lru
>> > > > add_to_page_cache
>> > > > add_to_page_cache_locked
>> > > > mem_cgroup_cache_charge
>> > > >
>> > > > So the read ahead most probably. Again! Duhhh. I will try to think
>> > > > about a fix for this. One obvious place is mpage_readpages but
>> > > > __do_page_cache_readahead ignores read_pages return value as well and
>> > > > page_cache_async_readahead, even worse, is just void and exported as
>> > > > such.
>> > > >
>> > > > So this smells like a hard to fix bugger. One possible, and really ugly
>> > > > way would be calling mem_cgroup_oom_synchronize even if handle_mm_fault
>> > > > doesn't return VM_FAULT_ERROR, but that is a crude hack.
>
>I fixed it by disabling the OOM killer altogether for readahead code.
>We don't do it globally, we should not do it in the memcg, these are
>optional allocations/charges.
>
>I also disabled it for kernel faults triggered from within a syscall
>(copy_*user, get_user_pages), which should just return -ENOMEM as
>usual (unless it's nested inside a userspace fault). The only
>downside is that we can't get around annotating userspace faults
>anymore, so every architecture fault handler now passes
>FAULT_FLAG_USER to handle_mm_fault(). Makes the series a little less
>self-contained, but it's not unreasonable.
>
>It's easy to detect leaks now by checking if the memcg OOM context is
>setup and we are not returning VM_FAULT_OOM.
>
>Here is a combined diff based on 3.2. azurIt, any chance you could
>give this a shot? I tested it on my local machines, but you have a
>known reproducer of fairly unlikely scenarios...

I will be out of office between 25.7. and 1.8. and I don't want to run anything which can potentially do an outage of our services. I will test this patch after 2.8. Should I use also previous patches of this one is enough? Thank you very much Johannes.

azur

2013-07-24 20:32:17

by Johannes Weiner

[permalink] [raw]

Subject: Re: [patch 3/5] x86: finish fault error path with fatal signal

On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote:
> The x86 fault handler bails in the middle of error handling when the
> task has been killed. For the next patch this is a problem, because
> it relies on pagefault_out_of_memory() being called even when the task
> has been killed, to perform proper OOM state unwinding.
>
> This is a rather minor optimization, just remove it.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> arch/x86/mm/fault.c | 11 -----------
> 1 file changed, 11 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 1cebabe..90248c9 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -846,17 +846,6 @@ static noinline int
> mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> unsigned long address, unsigned int fault)
> {
> - /*
> - * Pagefault was interrupted by SIGKILL. We have no reason to
> - * continue pagefault.
> - */
> - if (fatal_signal_pending(current)) {
> - if (!(fault & VM_FAULT_RETRY))
> - up_read(&current->mm->mmap_sem);
> - if (!(error_code & PF_USER))
> - no_context(regs, error_code, address);
> - return 1;

This is broken but I only hit it now after testing for a while.

The patch has the right idea: in case of an OOM kill, we should
continue the fault and not abort. What I missed is that in case of a
kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to
exit the fault and not do up_read() etc. This introduced a locking
imbalance that would get everybody hung on mmap_sem.

I moved the retry handling outside of mm_fault_error() (come on...)
and stole some documentation from arm. It's now a little bit more
explicit and comparable to other architectures.

I'll send an updated series, patch for reference:

---
From: Johannes Weiner <[email protected]>
Subject: [patch] x86: finish fault error path with fatal signal

The x86 fault handler bails in the middle of error handling when the
task has been killed. For the next patch this is a problem, because
it relies on pagefault_out_of_memory() being called even when the task
has been killed, to perform proper OOM state unwinding.

This is a rather minor optimization that cuts short the fault handling
by a few instructions in rare cases. Just remove it.

Signed-off-by: Johannes Weiner <[email protected]>
---
arch/x86/mm/fault.c | 33 +++++++++++++--------------------
1 file changed, 13 insertions(+), 20 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6d77c38..0c18beb 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
force_sig_info_fault(SIGBUS, code, address, tsk, fault);
}

-static noinline int
+static noinline void
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, unsigned int fault)
{
- /*
- * Pagefault was interrupted by SIGKILL. We have no reason to
- * continue pagefault.
- */
- if (fatal_signal_pending(current)) {
- if (!(fault & VM_FAULT_RETRY))
- up_read(&current->mm->mmap_sem);
- if (!(error_code & PF_USER))
- no_context(regs, error_code, address, 0, 0);
- return 1;
- }
- if (!(fault & VM_FAULT_ERROR))
- return 0;
-
if (fault & VM_FAULT_OOM) {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & PF_USER)) {
up_read(&current->mm->mmap_sem);
no_context(regs, error_code, address,
SIGSEGV, SEGV_MAPERR);
- return 1;
+ return;
}

up_read(&current->mm->mmap_sem);
@@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
else
BUG();
}
- return 1;
}

static int spurious_fault_check(unsigned long error_code, pte_t *pte)
@@ -1189,9 +1174,17 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, flags);

- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
- if (mm_fault_error(regs, error_code, address, fault))
- return;
+ /*
+ * If we need to retry but a fatal signal is pending, handle the
+ * signal first. We do not need to release the mmap_sem because it
+ * would already be released in __lock_page_or_retry in mm/filemap.c.
+ */
+ if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
+ return;
+
+ if (unlikely(fault & VM_FAULT_ERROR)) {
+ mm_fault_error(regs, error_code, address, fault);
+ return;
}

/*
--
1.8.3.2

2013-07-25 20:29:00

by KOSAKI Motohiro

[permalink] [raw]

Subject: Re: [patch 3/5] x86: finish fault error path with fatal signal

(7/24/13 4:32 PM), Johannes Weiner wrote:
> On Fri, Jul 19, 2013 at 12:25:02AM -0400, Johannes Weiner wrote:
>> The x86 fault handler bails in the middle of error handling when the
>> task has been killed. For the next patch this is a problem, because
>> it relies on pagefault_out_of_memory() being called even when the task
>> has been killed, to perform proper OOM state unwinding.
>>
>> This is a rather minor optimization, just remove it.
>>
>> Signed-off-by: Johannes Weiner <[email protected]>
>> ---
>> arch/x86/mm/fault.c | 11 -----------
>> 1 file changed, 11 deletions(-)
>>
>> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
>> index 1cebabe..90248c9 100644
>> --- a/arch/x86/mm/fault.c
>> +++ b/arch/x86/mm/fault.c
>> @@ -846,17 +846,6 @@ static noinline int
>> mm_fault_error(struct pt_regs *regs, unsigned long error_code,
>> unsigned long address, unsigned int fault)
>> {
>> - /*
>> - * Pagefault was interrupted by SIGKILL. We have no reason to
>> - * continue pagefault.
>> - */
>> - if (fatal_signal_pending(current)) {
>> - if (!(fault & VM_FAULT_RETRY))
>> - up_read(&current->mm->mmap_sem);
>> - if (!(error_code & PF_USER))
>> - no_context(regs, error_code, address);
>> - return 1;
>
> This is broken but I only hit it now after testing for a while.
>
> The patch has the right idea: in case of an OOM kill, we should
> continue the fault and not abort. What I missed is that in case of a
> kill during lock_page, i.e. VM_FAULT_RETRY && fatal_signal, we have to
> exit the fault and not do up_read() etc. This introduced a locking
> imbalance that would get everybody hung on mmap_sem.
>
> I moved the retry handling outside of mm_fault_error() (come on...)
> and stole some documentation from arm. It's now a little bit more
> explicit and comparable to other architectures.
>
> I'll send an updated series, patch for reference:
>
> ---
> From: Johannes Weiner <[email protected]>
> Subject: [patch] x86: finish fault error path with fatal signal
>
> The x86 fault handler bails in the middle of error handling when the
> task has been killed. For the next patch this is a problem, because
> it relies on pagefault_out_of_memory() being called even when the task
> has been killed, to perform proper OOM state unwinding.
>
> This is a rather minor optimization that cuts short the fault handling
> by a few instructions in rare cases. Just remove it.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> arch/x86/mm/fault.c | 33 +++++++++++++--------------------
> 1 file changed, 13 insertions(+), 20 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 6d77c38..0c18beb 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -842,31 +842,17 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
> force_sig_info_fault(SIGBUS, code, address, tsk, fault);
> }
>
> -static noinline int
> +static noinline void
> mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> unsigned long address, unsigned int fault)
> {
> - /*
> - * Pagefault was interrupted by SIGKILL. We have no reason to
> - * continue pagefault.
> - */
> - if (fatal_signal_pending(current)) {
> - if (!(fault & VM_FAULT_RETRY))
> - up_read(&current->mm->mmap_sem);
> - if (!(error_code & PF_USER))
> - no_context(regs, error_code, address, 0, 0);
> - return 1;
> - }
> - if (!(fault & VM_FAULT_ERROR))
> - return 0;
> -
> if (fault & VM_FAULT_OOM) {
> /* Kernel mode? Handle exceptions or die: */
> if (!(error_code & PF_USER)) {
> up_read(&current->mm->mmap_sem);
> no_context(regs, error_code, address,
> SIGSEGV, SEGV_MAPERR);
> - return 1;
> + return;
> }
>
> up_read(&current->mm->mmap_sem);
> @@ -884,7 +870,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> else
> BUG();
> }
> - return 1;
> }
>
> static int spurious_fault_check(unsigned long error_code, pte_t *pte)
> @@ -1189,9 +1174,17 @@ good_area:
> */
> fault = handle_mm_fault(mm, vma, address, flags);
>
> - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
> - if (mm_fault_error(regs, error_code, address, fault))
> - return;
> + /*
> + * If we need to retry but a fatal signal is pending, handle the
> + * signal first. We do not need to release the mmap_sem because it
> + * would already be released in __lock_page_or_retry in mm/filemap.c.
> + */
> + if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> + return;
> +
> + if (unlikely(fault & VM_FAULT_ERROR)) {
> + mm_fault_error(regs, error_code, address, fault);
> + return;
> }

When I made the patch you removed code, Ingo suggested we need put all rare case code
into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly
to maintain.

2013-07-25 21:50:44

by Johannes Weiner

[permalink] [raw]

Subject: Re: [patch 3/5] x86: finish fault error path with fatal signal

On Thu, Jul 25, 2013 at 04:29:13PM -0400, KOSAKI Motohiro wrote:
> (7/24/13 4:32 PM), Johannes Weiner wrote:
> >@@ -1189,9 +1174,17 @@ good_area:
> > */
> > fault = handle_mm_fault(mm, vma, address, flags);
> >
> >- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
> >- if (mm_fault_error(regs, error_code, address, fault))
> >- return;
> >+ /*
> >+ * If we need to retry but a fatal signal is pending, handle the
> >+ * signal first. We do not need to release the mmap_sem because it
> >+ * would already be released in __lock_page_or_retry in mm/filemap.c.
> >+ */
> >+ if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> >+ return;
> >+
> >+ if (unlikely(fault & VM_FAULT_ERROR)) {
> >+ mm_fault_error(regs, error_code, address, fault);
> >+ return;
> > }
>
> When I made the patch you removed code, Ingo suggested we need put all rare case code
> into if(unlikely()) block. Yes, this is purely micro optimization. But it is not costly
> to maintain.

Fair enough, thanks for the heads up!