2013-07-25 22:25:55

by Johannes Weiner

[permalink] [raw]
Subject: [patch 0/6] improve memcg oom killer robustness

The memcg code can trap tasks in the context of the failing allocation
until an OOM situation is resolved. They can hold all kinds of locks
(fs, mm) at this point, which makes it prone to deadlocking.

This series converts memcg OOM handling into a two step process that
is started in the charge context, but any waiting is done after the
fault stack is fully unwound.

Patches 1-4 prepare architecture handlers to support the new memcg
requirements, but in doing so they also remove old cruft and unify
out-of-memory behavior across architectures.

Patch 5 disables the memcg OOM handling for syscalls, readahead,
kernel faults, because they can gracefully unwind the stack with
-ENOMEM. OOM handling is restricted to user triggered faults that
have no other option.

Patch 6 implements the two-part OOM handling such that tasks are never
trapped with the full charge stack in an OOM situation.

arch/alpha/mm/fault.c | 7 ++++---
arch/arc/mm/fault.c | 11 ++++-------
arch/arm/mm/fault.c | 23 +++++++++++++----------
arch/arm64/mm/fault.c | 23 +++++++++++++----------
arch/avr32/mm/fault.c | 4 +++-
arch/cris/mm/fault.c | 6 ++++--
arch/frv/mm/fault.c | 10 ++++++----
arch/hexagon/mm/vm_fault.c | 6 ++++--
arch/ia64/mm/fault.c | 6 ++++--
arch/m32r/mm/fault.c | 10 ++++++----
arch/m68k/mm/fault.c | 2 ++
arch/metag/mm/fault.c | 6 ++++--
arch/microblaze/mm/fault.c | 7 +++++--
arch/mips/mm/fault.c | 8 ++++++--
arch/mn10300/mm/fault.c | 2 ++
arch/openrisc/mm/fault.c | 1 +
arch/parisc/mm/fault.c | 7 +++++--
arch/powerpc/mm/fault.c | 7 ++++---
arch/s390/mm/fault.c | 2 ++
arch/score/mm/fault.c | 13 ++++++-------
arch/sh/mm/fault.c | 9 ++++++---
arch/sparc/mm/fault_32.c | 12 +++++++++---
arch/sparc/mm/fault_64.c | 8 +++++---
arch/tile/mm/fault.c | 13 +++++--------
arch/um/kernel/trap.c | 22 ++++++++++++++--------
arch/unicore32/mm/fault.c | 22 +++++++++++++---------
arch/x86/mm/fault.c | 43 ++++++++++++++++++++++---------------------
arch/xtensa/mm/fault.c | 2 ++
include/linux/memcontrol.h | 40 ++++++++++++++++++++++++++++++++++++++++
include/linux/mm.h | 1 +
include/linux/sched.h | 6 ++++++
mm/filemap.c | 11 ++++++++++-
mm/memcontrol.c | 156 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++----------------------------------------------------
mm/memory.c | 43 +++++++++++++++++++++++++++++++++----------
mm/oom_kill.c | 7 +++++--
35 files changed, 373 insertions(+), 183 deletions(-)


2013-07-25 22:25:59

by Johannes Weiner

[permalink] [raw]
Subject: [patch 2/6] arch: mm: do not invoke OOM killer on kernel fault OOM

Kernel faults are expected to handle OOM conditions gracefully (gup,
uaccess etc.), so they should never invoke the OOM killer. Reserve
this for faults triggered in user context when it is the only option.

Most architectures already do this, fix up the remaining few.

Signed-off-by: Johannes Weiner <[email protected]>
---
arch/arm/mm/fault.c | 14 +++++++-------
arch/arm64/mm/fault.c | 14 +++++++-------
arch/avr32/mm/fault.c | 2 +-
arch/mips/mm/fault.c | 2 ++
arch/um/kernel/trap.c | 2 ++
arch/unicore32/mm/fault.c | 14 +++++++-------
6 files changed, 26 insertions(+), 22 deletions(-)

diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index c97f794..217bcbf 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -349,6 +349,13 @@ retry:
if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
return 0;

+ /*
+ * If we are in kernel mode at this point, we
+ * have no context to handle this fault with.
+ */
+ if (!user_mode(regs))
+ goto no_context;
+
if (fault & VM_FAULT_OOM) {
/*
* We ran out of memory, call the OOM killer, and return to
@@ -359,13 +366,6 @@ retry:
return 0;
}

- /*
- * If we are in kernel mode at this point, we
- * have no context to handle this fault with.
- */
- if (!user_mode(regs))
- goto no_context;
-
if (fault & VM_FAULT_SIGBUS) {
/*
* We had some memory, but were unable to
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index 0ecac89..dab1cfd 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -294,6 +294,13 @@ retry:
VM_FAULT_BADACCESS))))
return 0;

+ /*
+ * If we are in kernel mode at this point, we have no context to
+ * handle this fault with.
+ */
+ if (!user_mode(regs))
+ goto no_context;
+
if (fault & VM_FAULT_OOM) {
/*
* We ran out of memory, call the OOM killer, and return to
@@ -304,13 +311,6 @@ retry:
return 0;
}

- /*
- * If we are in kernel mode at this point, we have no context to
- * handle this fault with.
- */
- if (!user_mode(regs))
- goto no_context;
-
if (fault & VM_FAULT_SIGBUS) {
/*
* We had some memory, but were unable to successfully fix up
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index b2f2d2d..2ca27b0 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -228,9 +228,9 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- pagefault_out_of_memory();
if (!user_mode(regs))
goto no_context;
+ pagefault_out_of_memory();
return;

do_sigbus:
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 85df1cd..94d3a31 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -241,6 +241,8 @@ out_of_memory:
* (which will retry the fault, or kill us if we got oom-killed).
*/
up_read(&mm->mmap_sem);
+ if (!user_mode(regs))
+ goto no_context;
pagefault_out_of_memory();
return;

diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index 089f398..b2f5adf 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -124,6 +124,8 @@ out_of_memory:
* (which will retry the fault, or kill us if we got oom-killed).
*/
up_read(&mm->mmap_sem);
+ if (!is_user)
+ goto out_nosemaphore;
pagefault_out_of_memory();
return 0;
}
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index f9b5c10..8ed3c45 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -278,6 +278,13 @@ retry:
(VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
return 0;

+ /*
+ * If we are in kernel mode at this point, we
+ * have no context to handle this fault with.
+ */
+ if (!user_mode(regs))
+ goto no_context;
+
if (fault & VM_FAULT_OOM) {
/*
* We ran out of memory, call the OOM killer, and return to
@@ -288,13 +295,6 @@ retry:
return 0;
}

- /*
- * If we are in kernel mode at this point, we
- * have no context to handle this fault with.
- */
- if (!user_mode(regs))
- goto no_context;
-
if (fault & VM_FAULT_SIGBUS) {
/*
* We had some memory, but were unable to
--
1.8.3.2

2013-07-25 22:26:07

by Johannes Weiner

[permalink] [raw]
Subject: [patch 4/6] x86: finish user fault error path with fatal signal

The x86 fault handler bails in the middle of error handling when the
task has a fatal signal pending. For a subsequent patch this is a
problem in OOM situations because it relies on
pagefault_out_of_memory() being called even when the task has been
killed, to perform proper per-task OOM state unwinding.

Shortcutting the fault like this is a rather minor optimization that
saves a few instructions in rare cases. Just remove it for
user-triggered faults.

Use the opportunity to split the fault retry handling from actual
fault errors and add locking documentation that reads suprisingly
similar to ARM's.

Signed-off-by: Johannes Weiner <[email protected]>
---
arch/x86/mm/fault.c | 35 +++++++++++++++++------------------
1 file changed, 17 insertions(+), 18 deletions(-)

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 6d77c38..3aaeffc 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -842,23 +842,15 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
force_sig_info_fault(SIGBUS, code, address, tsk, fault);
}

-static noinline int
+static noinline void
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, unsigned int fault)
{
- /*
- * Pagefault was interrupted by SIGKILL. We have no reason to
- * continue pagefault.
- */
- if (fatal_signal_pending(current)) {
- if (!(fault & VM_FAULT_RETRY))
- up_read(&current->mm->mmap_sem);
- if (!(error_code & PF_USER))
- no_context(regs, error_code, address, 0, 0);
- return 1;
+ if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
+ up_read(&current->mm->mmap_sem);
+ no_context(regs, error_code, address, 0, 0);
+ return;
}
- if (!(fault & VM_FAULT_ERROR))
- return 0;

if (fault & VM_FAULT_OOM) {
/* Kernel mode? Handle exceptions or die: */
@@ -866,7 +858,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
up_read(&current->mm->mmap_sem);
no_context(regs, error_code, address,
SIGSEGV, SEGV_MAPERR);
- return 1;
+ return;
}

up_read(&current->mm->mmap_sem);
@@ -884,7 +876,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
else
BUG();
}
- return 1;
}

static int spurious_fault_check(unsigned long error_code, pte_t *pte)
@@ -1189,9 +1180,17 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, flags);

- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
- if (mm_fault_error(regs, error_code, address, fault))
- return;
+ /*
+ * If we need to retry but a fatal signal is pending, handle the
+ * signal first. We do not need to release the mmap_sem because it
+ * would already be released in __lock_page_or_retry in mm/filemap.c.
+ */
+ if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)))
+ return;
+
+ if (unlikely(fault & VM_FAULT_ERROR)) {
+ mm_fault_error(regs, error_code, address, fault);
+ return;
}

/*
--
1.8.3.2

2013-07-25 22:26:05

by Johannes Weiner

[permalink] [raw]
Subject: [patch 3/6] arch: mm: pass userspace fault flag to generic fault handler

Unlike global OOM handling, memory cgroup code will invoke the OOM
killer in any OOM situation because it has no way of telling faults
occuring in kernel context - which could be handled more gracefully -
from user-triggered faults.

Pass a flag that identifies faults originating in user space from the
architecture-specific fault handlers to generic code so that memcg OOM
handling can be improved.

Signed-off-by: Johannes Weiner <[email protected]>
---
arch/alpha/mm/fault.c | 7 ++++---
arch/arc/mm/fault.c | 6 ++++--
arch/arm/mm/fault.c | 9 ++++++---
arch/arm64/mm/fault.c | 9 ++++++---
arch/avr32/mm/fault.c | 2 ++
arch/cris/mm/fault.c | 6 ++++--
arch/frv/mm/fault.c | 10 ++++++----
arch/hexagon/mm/vm_fault.c | 6 ++++--
arch/ia64/mm/fault.c | 6 ++++--
arch/m32r/mm/fault.c | 10 ++++++----
arch/m68k/mm/fault.c | 2 ++
arch/metag/mm/fault.c | 6 ++++--
arch/microblaze/mm/fault.c | 7 +++++--
arch/mips/mm/fault.c | 6 ++++--
arch/mn10300/mm/fault.c | 2 ++
arch/openrisc/mm/fault.c | 1 +
arch/parisc/mm/fault.c | 7 +++++--
arch/powerpc/mm/fault.c | 7 ++++---
arch/s390/mm/fault.c | 2 ++
arch/score/mm/fault.c | 7 ++++++-
arch/sh/mm/fault.c | 9 ++++++---
arch/sparc/mm/fault_32.c | 12 +++++++++---
arch/sparc/mm/fault_64.c | 8 +++++---
arch/tile/mm/fault.c | 7 +++++--
arch/um/kernel/trap.c | 20 ++++++++++++--------
arch/unicore32/mm/fault.c | 8 ++++++--
arch/x86/mm/fault.c | 8 +++++---
arch/xtensa/mm/fault.c | 2 ++
include/linux/mm.h | 1 +
29 files changed, 132 insertions(+), 61 deletions(-)

diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
index 0c4132d..98838a0 100644
--- a/arch/alpha/mm/fault.c
+++ b/arch/alpha/mm/fault.c
@@ -89,8 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
const struct exception_table_entry *fixup;
int fault, si_code = SEGV_MAPERR;
siginfo_t info;
- unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (cause > 0 ? FAULT_FLAG_WRITE : 0));
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

/* As of EV6, a load into $31/$f31 is a prefetch, and never faults
(or is suppressed by the PALcode). Support that for older CPUs
@@ -115,7 +114,8 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
if (address >= TASK_SIZE)
goto vmalloc_fault;
#endif
-
+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -142,6 +142,7 @@ retry:
} else {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
}

/* If for any reason at all we couldn't handle the fault,
diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 6b0bb41..d63f3de 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -60,8 +60,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address)
siginfo_t info;
int fault, ret;
int write = regs->ecr_cause & ECR_C_PROTV_STORE; /* ST/EX */
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

/*
* We fault-in kernel-space virtual memory on-demand. The
@@ -89,6 +88,8 @@ void do_page_fault(struct pt_regs *regs, unsigned long address)
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -117,6 +118,7 @@ good_area:
if (write) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
index 217bcbf..eb8830a 100644
--- a/arch/arm/mm/fault.c
+++ b/arch/arm/mm/fault.c
@@ -261,9 +261,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
struct task_struct *tsk;
struct mm_struct *mm;
int fault, sig, code;
- int write = fsr & FSR_WRITE;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

if (notify_page_fault(regs, fsr))
return 0;
@@ -282,6 +280,11 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (fsr & FSR_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
index dab1cfd..12205b4 100644
--- a/arch/arm64/mm/fault.c
+++ b/arch/arm64/mm/fault.c
@@ -208,9 +208,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
struct task_struct *tsk;
struct mm_struct *mm;
int fault, sig, code;
- bool write = (esr & ESR_WRITE) && !(esr & ESR_CM);
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;
mm = tsk->mm;
@@ -226,6 +224,11 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if ((esr & ESR_WRITE) && !(esr & ESR_CM))
+ flags |= FAULT_FLAG_WRITE;
+
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
index 2ca27b0..0eca933 100644
--- a/arch/avr32/mm/fault.c
+++ b/arch/avr32/mm/fault.c
@@ -86,6 +86,8 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs)

local_irq_enable();

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);

diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
index 73312ab..1790f22 100644
--- a/arch/cris/mm/fault.c
+++ b/arch/cris/mm/fault.c
@@ -58,8 +58,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
struct vm_area_struct * vma;
siginfo_t info;
int fault;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- ((writeaccess & 1) ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

D(printk(KERN_DEBUG
"Page fault for %lX on %X at %lX, prot %d write %d\n",
@@ -117,6 +116,8 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -155,6 +156,7 @@ retry:
} else if (writeaccess == 1) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
index 331c1e2..9a66372 100644
--- a/arch/frv/mm/fault.c
+++ b/arch/frv/mm/fault.c
@@ -34,11 +34,11 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
struct vm_area_struct *vma;
struct mm_struct *mm;
unsigned long _pme, lrai, lrad, fixup;
+ unsigned long flags = 0;
siginfo_t info;
pgd_t *pge;
pud_t *pue;
pte_t *pte;
- int write;
int fault;

#if 0
@@ -81,6 +81,9 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(__frame))
+ flags |= FAULT_FLAG_USER;
+
down_read(&mm->mmap_sem);

vma = find_vma(mm, ear0);
@@ -129,7 +132,6 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
*/
good_area:
info.si_code = SEGV_ACCERR;
- write = 0;
switch (esr0 & ESR0_ATXC) {
default:
/* handle write to write protected page */
@@ -140,7 +142,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
#endif
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
- write = 1;
+ flags |= FAULT_FLAG_WRITE;
break;

/* handle read from protected page */
@@ -162,7 +164,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, ear0, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
index 1bd276d..8704c93 100644
--- a/arch/hexagon/mm/vm_fault.c
+++ b/arch/hexagon/mm/vm_fault.c
@@ -53,8 +53,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
int si_code = SEGV_MAPERR;
int fault;
const struct exception_table_entry *fixup;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (cause > 0 ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

/*
* If we're in an interrupt or have no user context,
@@ -65,6 +64,8 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)

local_irq_enable();

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -96,6 +97,7 @@ good_area:
case FLT_STORE:
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
break;
}

diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
index 6cf0341..7225dad 100644
--- a/arch/ia64/mm/fault.c
+++ b/arch/ia64/mm/fault.c
@@ -90,8 +90,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
mask = ((((isr >> IA64_ISR_X_BIT) & 1UL) << VM_EXEC_BIT)
| (((isr >> IA64_ISR_W_BIT) & 1UL) << VM_WRITE_BIT));

- flags |= ((mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0);
-
/* mmap_sem is performance critical.... */
prefetchw(&mm->mmap_sem);

@@ -119,6 +117,10 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
if (notify_page_fault(regs, TRAP_BRKPT))
return;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (mask & VM_WRITE)
+ flags |= FAULT_FLAG_WRITE;
retry:
down_read(&mm->mmap_sem);

diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
index 3cdfa9c..e9c6a80 100644
--- a/arch/m32r/mm/fault.c
+++ b/arch/m32r/mm/fault.c
@@ -78,7 +78,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
struct mm_struct *mm;
struct vm_area_struct * vma;
unsigned long page, addr;
- int write;
+ unsigned long flags = 0;
int fault;
siginfo_t info;

@@ -117,6 +117,9 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
if (in_atomic() || !mm)
goto bad_area_nosemaphore;

+ if (error_code & ACE_USERMODE)
+ flags |= FAULT_FLAG_USER;
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -166,14 +169,13 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
*/
good_area:
info.si_code = SEGV_ACCERR;
- write = 0;
switch (error_code & (ACE_WRITE|ACE_PROTECTION)) {
default: /* 3: write, present */
/* fall through */
case ACE_WRITE: /* write, not present */
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
- write++;
+ flags |= FAULT_FLAG_WRITE;
break;
case ACE_PROTECTION: /* read, present */
case 0: /* read, not present */
@@ -194,7 +196,7 @@ good_area:
*/
addr = (address & PAGE_MASK);
set_thread_fault_code(error_code);
- fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0);
+ fault = handle_mm_fault(mm, vma, addr, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
index a563727..eb1d61f 100644
--- a/arch/m68k/mm/fault.c
+++ b/arch/m68k/mm/fault.c
@@ -88,6 +88,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);

diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
index 8fddf46..332680e 100644
--- a/arch/metag/mm/fault.c
+++ b/arch/metag/mm/fault.c
@@ -53,8 +53,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
struct vm_area_struct *vma, *prev_vma;
siginfo_t info;
int fault;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write_access ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;

@@ -109,6 +108,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);

@@ -121,6 +122,7 @@ good_area:
if (write_access) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
goto bad_area;
diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
index 731f739..fa4cf52 100644
--- a/arch/microblaze/mm/fault.c
+++ b/arch/microblaze/mm/fault.c
@@ -92,8 +92,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
int code = SEGV_MAPERR;
int is_write = error_code & ESR_S;
int fault;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (is_write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

regs->ear = address;
regs->esr = error_code;
@@ -121,6 +120,9 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
die("Weird page fault", regs, SIGSEGV);
}

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -199,6 +201,7 @@ good_area:
if (unlikely(is_write)) {
if (unlikely(!(vma->vm_flags & VM_WRITE)))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
/* a read */
} else {
/* protection fault */
diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
index 94d3a31..becc42b 100644
--- a/arch/mips/mm/fault.c
+++ b/arch/mips/mm/fault.c
@@ -42,8 +42,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
const int field = sizeof(unsigned long) * 2;
siginfo_t info;
int fault;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

#if 0
printk("Cpu%d[%s:%d:%0*lx:%ld:%0*lx]\n", raw_smp_processor_id(),
@@ -93,6 +92,8 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
if (in_atomic() || !mm)
goto bad_area_nosemaphore;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -114,6 +115,7 @@ good_area:
if (write) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (cpu_has_rixi) {
if (address == regs->cp0_epc && !(vma->vm_flags & VM_EXEC)) {
diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
index 8a2e6de..3516cbd 100644
--- a/arch/mn10300/mm/fault.c
+++ b/arch/mn10300/mm/fault.c
@@ -171,6 +171,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code,
if (in_atomic() || !mm)
goto no_context;

+ if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);

diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
index 4a41f84..0703acf 100644
--- a/arch/openrisc/mm/fault.c
+++ b/arch/openrisc/mm/fault.c
@@ -86,6 +86,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
if (user_mode(regs)) {
/* Exception was in userspace: reenable interrupts */
local_irq_enable();
+ flags |= FAULT_FLAG_USER;
} else {
/* If exception was in a syscall, then IRQ's may have
* been enabled or disabled. If they were enabled,
diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
index f247a34..d10d27a 100644
--- a/arch/parisc/mm/fault.c
+++ b/arch/parisc/mm/fault.c
@@ -180,6 +180,10 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (acc_type & VM_WRITE)
+ flags |= FAULT_FLAG_WRITE;
retry:
down_read(&mm->mmap_sem);
vma = find_vma_prev(mm, address, &prev_vma);
@@ -203,8 +207,7 @@ good_area:
* fault.
*/

- fault = handle_mm_fault(mm, vma, address,
- flags | ((acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0));
+ fault = handle_mm_fault(mm, vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
return;
diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 8726779..d9196c9 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -223,9 +223,6 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
is_write = error_code & ESR_DST;
#endif /* CONFIG_4xx || CONFIG_BOOKE */

- if (is_write)
- flags |= FAULT_FLAG_WRITE;
-
#ifdef CONFIG_PPC_ICSWX
/*
* we need to do this early because this "data storage
@@ -280,6 +277,9 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,

perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+
/* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
* kernel and should generate an OOPS. Unfortunately, in the case of an
@@ -408,6 +408,7 @@ good_area:
} else if (is_write) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
/* a read */
} else {
/* protection fault */
diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
index f00aefb..6fa7b05 100644
--- a/arch/s390/mm/fault.c
+++ b/arch/s390/mm/fault.c
@@ -302,6 +302,8 @@ static inline int do_exception(struct pt_regs *regs, int access)
address = trans_exc_code & __FAIL_ADDR_MASK;
perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
+ if (regs->psw.mask & PSW_MASK_PSTATE)
+ flags |= FAULT_FLAG_USER;
if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400)
flags |= FAULT_FLAG_WRITE;
down_read(&mm->mmap_sem);
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 4b71a62..52238983 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
const int field = sizeof(unsigned long) * 2;
+ unsigned long flags = 0;
siginfo_t info;
int fault;

@@ -75,6 +76,9 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
if (in_atomic() || !mm)
goto bad_area_nosemaphore;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
if (!vma)
@@ -95,6 +99,7 @@ good_area:
if (write) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
goto bad_area;
@@ -105,7 +110,7 @@ good_area:
* make sure we exit gracefully rather than endlessly redo
* the fault.
*/
- fault = handle_mm_fault(mm, vma, address, write);
+ fault = handle_mm_fault(mm, vma, address, flags);
if (unlikely(fault & VM_FAULT_ERROR)) {
if (fault & VM_FAULT_OOM)
goto out_of_memory;
diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
index 1f49c28..541dc61 100644
--- a/arch/sh/mm/fault.c
+++ b/arch/sh/mm/fault.c
@@ -400,9 +400,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
struct mm_struct *mm;
struct vm_area_struct * vma;
int fault;
- int write = error_code & FAULT_CODE_WRITE;
- unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0));
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;
mm = tsk->mm;
@@ -476,6 +474,11 @@ good_area:

set_thread_fault_code(error_code);

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (error_code & FAULT_CODE_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
index e98bfda..59dbd46 100644
--- a/arch/sparc/mm/fault_32.c
+++ b/arch/sparc/mm/fault_32.c
@@ -177,8 +177,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
unsigned long g2;
int from_user = !(regs->psr & PSR_PS);
int fault, code;
- unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0));
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

if (text_fault)
address = regs->pc;
@@ -235,6 +234,11 @@ good_area:
goto bad_area;
}

+ if (from_user)
+ flags |= FAULT_FLAG_USER;
+ if (write)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -383,6 +387,7 @@ static void force_user_fault(unsigned long address, int write)
struct vm_area_struct *vma;
struct task_struct *tsk = current;
struct mm_struct *mm = tsk->mm;
+ unsigned int flags = FAULT_FLAG_USER;
int code;

code = SEGV_MAPERR;
@@ -402,11 +407,12 @@ good_area:
if (write) {
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
}
- switch (handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0)) {
+ switch (handle_mm_fault(mm, vma, address, flags)) {
case VM_FAULT_SIGBUS:
case VM_FAULT_OOM:
goto do_sigbus;
diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
index 5062ff3..c08b9bb 100644
--- a/arch/sparc/mm/fault_64.c
+++ b/arch/sparc/mm/fault_64.c
@@ -314,8 +314,9 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
} else {
bad_kernel_pc(regs, address);
return;
- }
- }
+ }
+ } else
+ flags |= FAULT_FLAG_USER;

/*
* If we're in an interrupt or have no user
@@ -418,13 +419,14 @@ good_area:
vma->vm_file != NULL)
set_thread_fault_code(fault_code |
FAULT_CODE_BLKCOMMIT);
+
+ flags |= FAULT_FLAG_WRITE;
} else {
/* Allow reads even for write-only mappings */
if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
goto bad_area;
}

- flags |= ((fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0);
fault = handle_mm_fault(mm, vma, address, flags);

if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index ac553ee..3ff289f 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -280,8 +280,7 @@ static int handle_page_fault(struct pt_regs *regs,
if (!is_page_fault)
write = 1;

- flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0));
+ flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

is_kernel_mode = (EX1_PL(regs->ex1) != USER_PL);

@@ -365,6 +364,9 @@ static int handle_page_fault(struct pt_regs *regs,
goto bad_area_nosemaphore;
}

+ if (!is_kernel_mode)
+ flags |= FAULT_FLAG_USER;
+
/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in the
@@ -425,6 +427,7 @@ good_area:
#endif
if (!(vma->vm_flags & VM_WRITE))
goto bad_area;
+ flags |= FAULT_FLAG_WRITE;
} else {
if (!is_page_fault || !(vma->vm_flags & VM_READ))
goto bad_area;
diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
index b2f5adf..5c3aef7 100644
--- a/arch/um/kernel/trap.c
+++ b/arch/um/kernel/trap.c
@@ -30,8 +30,7 @@ int handle_page_fault(unsigned long address, unsigned long ip,
pmd_t *pmd;
pte_t *pte;
int err = -EFAULT;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (is_write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

*code_out = SEGV_MAPERR;

@@ -42,6 +41,8 @@ int handle_page_fault(unsigned long address, unsigned long ip,
if (in_atomic())
goto out_nosemaphore;

+ if (is_user)
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
@@ -58,12 +59,15 @@ retry:

good_area:
*code_out = SEGV_ACCERR;
- if (is_write && !(vma->vm_flags & VM_WRITE))
- goto out;
-
- /* Don't require VM_READ|VM_EXEC for write faults! */
- if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC)))
- goto out;
+ if (is_write) {
+ if (!(vma->vm_flags & VM_WRITE))
+ goto out;
+ flags |= FAULT_FLAG_WRITE;
+ } else {
+ /* Don't require VM_READ|VM_EXEC for write faults! */
+ if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
+ goto out;
+ }

do {
int fault;
diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
index 8ed3c45..0dc922d 100644
--- a/arch/unicore32/mm/fault.c
+++ b/arch/unicore32/mm/fault.c
@@ -209,8 +209,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
struct task_struct *tsk;
struct mm_struct *mm;
int fault, sig, code;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- ((!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;
mm = tsk->mm;
@@ -222,6 +221,11 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
if (in_atomic() || !mm)
goto no_context;

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
+ if (!(fsr ^ 0x12))
+ flags |= FAULT_FLAG_WRITE;
+
/*
* As per x86, we may deadlock here. However, since the kernel only
* validly references user space from well defined areas of the code,
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 654be4a..6d77c38 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1011,9 +1011,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code)
unsigned long address;
struct mm_struct *mm;
int fault;
- int write = error_code & PF_WRITE;
- unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
- (write ? FAULT_FLAG_WRITE : 0);
+ unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;

tsk = current;
mm = tsk->mm;
@@ -1083,6 +1081,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code)
if (user_mode_vm(regs)) {
local_irq_enable();
error_code |= PF_USER;
+ flags |= FAULT_FLAG_USER;
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
@@ -1109,6 +1108,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code)
return;
}

+ if (error_code & PF_WRITE)
+ flags |= FAULT_FLAG_WRITE;
+
/*
* When running in the kernel we expect faults to occur only to
* addresses in user space. All other faults represent errors in
diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
index 4b7bc8d..70fa7bc 100644
--- a/arch/xtensa/mm/fault.c
+++ b/arch/xtensa/mm/fault.c
@@ -72,6 +72,8 @@ void do_page_fault(struct pt_regs *regs)
address, exccause, regs->pc, is_write? "w":"", is_exec? "x":"");
#endif

+ if (user_mode(regs))
+ flags |= FAULT_FLAG_USER;
retry:
down_read(&mm->mmap_sem);
vma = find_vma(mm, address);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d5c82dc..c51fc32 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -170,6 +170,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
#define FAULT_FLAG_TRIED 0x40 /* second try */
+#define FAULT_FLAG_USER 0x80 /* The fault originated in userspace */

/*
* vm_fault is filled by the the pagefault handler and passed to the vma's
--
1.8.3.2

2013-07-25 22:26:24

by Johannes Weiner

[permalink] [raw]
Subject: [patch 6/6] mm: memcg: do not trap chargers with full callstack on OOM

The memcg OOM handling is incredibly fragile and can deadlock. When a
task fails to charge memory, it invokes the OOM killer and loops right
there in the charge code until it succeeds. Comparably, any other
task that enters the charge path at this point will go to a waitqueue
right then and there and sleep until the OOM situation is resolved.
The problem is that these tasks may hold filesystem locks and the
mmap_sem; locks that the selected OOM victim may need to exit.

For example, in one reported case, the task invoking the OOM killer
was about to charge a page cache page during a write(), which holds
the i_mutex. The OOM killer selected a task that was just entering
truncate() and trying to acquire the i_mutex:

OOM invoking task:
[<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
[<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
[<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
[<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
[<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
[<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
[<ffffffff81193a18>] ext3_write_begin+0x88/0x270
[<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
[<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
[<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
[<ffffffff8111156a>] do_sync_write+0xea/0x130
[<ffffffff81112183>] vfs_write+0xf3/0x1f0
[<ffffffff81112381>] sys_write+0x51/0x90
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

OOM kill victim:
[<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
[<ffffffff81121c90>] do_last+0x250/0xa30
[<ffffffff81122547>] path_openat+0xd7/0x440
[<ffffffff811229c9>] do_filp_open+0x49/0xa0
[<ffffffff8110f7d6>] do_sys_open+0x106/0x240
[<ffffffff8110f950>] sys_open+0x20/0x30
[<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
[<ffffffffffffffff>] 0xffffffffffffffff

The OOM handling task will retry the charge indefinitely while the OOM
killed task is not releasing any resources.

A similar scenario can happen when the kernel OOM killer for a memcg
is disabled and a userspace task is in charge of resolving OOM
situations. In this case, ALL tasks that enter the OOM path will be
made to sleep on the OOM waitqueue and wait for userspace to free
resources or increase the group's limit. But a userspace OOM handler
is prone to deadlock itself on the locks held by the waiting tasks.
For example one of the sleeping tasks may be stuck in a brk() call
with the mmap_sem held for writing but the userspace handler, in order
to pick an optimal victim, may need to read files from /proc/<pid>,
which tries to acquire the same mmap_sem for reading and deadlocks.

This patch changes the way tasks behave after detecting a memcg OOM
and makes sure nobody loops or sleeps with locks held:

1. When OOMing in a user fault, invoke the OOM killer and restart the
fault instead of looping on the charge attempt. This way, the OOM
victim can not get stuck on locks the looping task may hold.

2. When OOMing in a user fault but somebody else is handling it
(either the kernel OOM killer or a userspace handler), don't go to
sleep in the charge context. Instead, remember the OOMing memcg in
the task struct and then fully unwind the page fault stack with
-ENOMEM. pagefault_out_of_memory() will then call back into the
memcg code to check if the -ENOMEM came from the memcg, and then
either put the task to sleep on the memcg's OOM waitqueue or just
restart the fault. The OOM victim can no longer get stuck on any
lock a sleeping task may hold.

This relies on the memcg OOM killer only being enabled when an
allocation failure will result in a call to pagefault_out_of_memory().

While reworking the OOM routine, also remove a needless OOM waitqueue
wakeup when invoking the killer. In addition to the wakeup implied in
the kill signal delivery, only uncharges and limit increases, things
that actually change the memory situation, should poke the waitqueue.

Reported-by: Reported-by: azurIt <[email protected]>
Debugged-by: Michal Hocko <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 17 +++++
include/linux/sched.h | 3 +
mm/memcontrol.c | 156 ++++++++++++++++++++++++++++++---------------
mm/memory.c | 3 +
mm/oom_kill.c | 7 +-
5 files changed, 132 insertions(+), 54 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 9bb5eeb..2489cb6 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -143,6 +143,13 @@ static inline bool mem_cgroup_xchg_may_oom(struct task_struct *p, bool new)
return old;
}

+static inline bool task_in_mem_cgroup_oom(struct task_struct *p)
+{
+ return p->memcg_oom.in_memcg_oom;
+}
+
+bool mem_cgroup_oom_synchronize(void);
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -371,6 +378,16 @@ static inline bool mem_cgroup_xchg_may_oom(struct task_struct *p, bool new)
return !new;
}

+static inline bool task_in_mem_cgroup_oom(struct task_struct *p)
+{
+ return false;
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4b3effc..eb873fd 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1400,6 +1400,9 @@ struct task_struct {
unsigned int memcg_kmem_skip_account;
struct memcg_oom_info {
unsigned int may_oom:1;
+ unsigned int in_memcg_oom:1;
+ int wakeups;
+ struct mem_cgroup *wait_on_memcg;
} memcg_oom;
#endif
#ifdef CONFIG_UPROBES
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 30ae46a..029a3a8 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -280,6 +280,7 @@ struct mem_cgroup {

bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;

int swappiness;
/* OOM-Killer disable */
@@ -2178,6 +2179,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,

static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg->oom_wakeups);
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}
@@ -2189,31 +2191,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}

/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
- int order)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
{
- struct oom_wait_info owait;
- bool locked, need_to_kill;
+ bool locked, need_to_kill = true;

- owait.memcg = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);
+ if (!current->memcg_oom.may_oom)
+ return;
+
+ current->memcg_oom.in_memcg_oom = 1;

/* At first, try to OOM lock hierarchy under memcg.*/
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
- /*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
- */
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
if (!locked || memcg->oom_kill_disable)
need_to_kill = false;
if (locked)
@@ -2221,24 +2212,100 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
spin_unlock(&memcg_oom_lock);

if (need_to_kill) {
- finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, mask, order);
} else {
- schedule();
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ /*
+ * A system call can just return -ENOMEM, but if this
+ * is a page fault and somebody else is handling the
+ * OOM already, we need to sleep on the OOM waitqueue
+ * for this memcg until the situation is resolved.
+ * Which can take some time because it might be
+ * handled by a userspace task.
+ *
+ * However, this is the charge context, which means
+ * that we may sit on a large call stack and hold
+ * various filesystem locks, the mmap_sem etc. and we
+ * don't want the OOM handler to deadlock on them
+ * while we sit here and wait. Store the current OOM
+ * context in the task_struct, then return -ENOMEM.
+ * At the end of the page fault handler, with the
+ * stack unwound, pagefault_out_of_memory() will check
+ * back with us by calling
+ * mem_cgroup_oom_synchronize(), possibly putting the
+ * task to sleep.
+ */
+ mem_cgroup_mark_under_oom(memcg);
+ current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups);
+ css_get(&memcg->css);
+ current->memcg_oom.wait_on_memcg = memcg;
}
- spin_lock(&memcg_oom_lock);
- if (locked)
+
+ if (locked) {
+ spin_lock(&memcg_oom_lock);
mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
+ /*
+ * Sleeping tasks might have been killed, make sure
+ * they get scheduled so they can exit.
+ */
+ if (need_to_kill)
+ memcg_oom_recover(memcg);
+ spin_unlock(&memcg_oom_lock);
+ }
+}

- mem_cgroup_unmark_under_oom(memcg);
+/**
+ * mem_cgroup_oom_synchronize - complete memcg OOM handling
+ *
+ * This has to be called at the end of a page fault if the the memcg
+ * OOM handler was enabled and the fault is returning %VM_FAULT_OOM.
+ *
+ * Memcg supports userspace OOM handling, so failed allocations must
+ * sleep on a waitqueue until the userspace task resolves the
+ * situation. Sleeping directly in the charge context with all kinds
+ * of locks held is not a good idea, instead we remember an OOM state
+ * in the task and mem_cgroup_oom_synchronize() has to be called at
+ * the end of the page fault to put the task to sleep and clean up the
+ * OOM state.
+ */
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;

- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ /* OOM is global, do not handle */
+ if (!current->memcg_oom.in_memcg_oom)
return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+
+ /*
+ * We invoked the OOM killer but there is a chance that a kill
+ * did not free up any charges. Everybody else might already
+ * be sleeping, so restart the fault and keep the rampage
+ * going until some charges are released.
+ */
+ memcg = current->memcg_oom.wait_on_memcg;
+ if (!memcg)
+ goto out;
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out_memcg;
+
+ owait.memcg = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out_memcg:
+ mem_cgroup_unmark_under_oom(memcg);
+ css_put(&memcg->css);
+ current->memcg_oom.wait_on_memcg = NULL;
+out:
+ current->memcg_oom.in_memcg_oom = 0;
return true;
}

@@ -2551,12 +2618,11 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int nr_pages, unsigned int min_pages,
- bool oom_check)
+ bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2613,14 +2679,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;

- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check || !current->memcg_oom.may_oom)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask, get_order(csize));

- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}

/*
@@ -2723,7 +2785,7 @@ again:
}

do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;

/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2731,14 +2793,8 @@ again:
goto bypass;
}

- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, nr_pages,
- oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch,
+ nr_pages, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2751,16 +2807,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);

diff --git a/mm/memory.c b/mm/memory.c
index 5ea7b47..fff7dfd 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3868,6 +3868,9 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
if (flags & FAULT_FLAG_USER)
WARN_ON(mem_cgroup_xchg_may_oom(current, false) == false);

+ if (WARN_ON(task_in_mem_cgroup_oom(current) && !(ret & VM_FAULT_OOM)))
+ mem_cgroup_oom_synchronize();
+
return ret;
}

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 98e75f2..314e9d2 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -678,9 +678,12 @@ out:
*/
void pagefault_out_of_memory(void)
{
- struct zonelist *zonelist = node_zonelist(first_online_node,
- GFP_KERNEL);
+ struct zonelist *zonelist;

+ if (mem_cgroup_oom_synchronize())
+ return;
+
+ zonelist = node_zonelist(first_online_node, GFP_KERNEL);
if (try_set_zonelist_oom(zonelist, GFP_KERNEL)) {
out_of_memory(NULL, 0, 0, NULL, false);
clear_zonelist_oom(zonelist, GFP_KERNEL);
--
1.8.3.2

2013-07-25 22:26:49

by Johannes Weiner

[permalink] [raw]
Subject: [patch 5/6] mm: memcg: enable memcg OOM killer only for user faults

System calls and kernel faults (uaccess, gup) can handle an out of
memory situation gracefully and just return -ENOMEM.

Enable the memcg OOM killer only for user faults, where it's really
the only option available.

Signed-off-by: Johannes Weiner <[email protected]>
---
include/linux/memcontrol.h | 23 +++++++++++++++++++++++
include/linux/sched.h | 3 +++
mm/filemap.c | 11 ++++++++++-
mm/memcontrol.c | 2 +-
mm/memory.c | 40 ++++++++++++++++++++++++++++++----------
5 files changed, 67 insertions(+), 12 deletions(-)

diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 7b4d9d7..9bb5eeb 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -125,6 +125,24 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
extern void mem_cgroup_replace_page_cache(struct page *oldpage,
struct page *newpage);

+/**
+ * mem_cgroup_xchg_may_oom - toggle the memcg OOM killer for a task
+ * @p: task
+ * @new: true to enable, false to disable
+ *
+ * Toggle whether a failed memcg charge should invoke the OOM killer
+ * or just return -ENOMEM. Returns the previous toggle state.
+ */
+static inline bool mem_cgroup_xchg_may_oom(struct task_struct *p, bool new)
+{
+ bool old;
+
+ old = p->memcg_oom.may_oom;
+ p->memcg_oom.may_oom = new;
+
+ return old;
+}
+
#ifdef CONFIG_MEMCG_SWAP
extern int do_swap_account;
#endif
@@ -348,6 +366,11 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
{
}

+static inline bool mem_cgroup_xchg_may_oom(struct task_struct *p, bool new)
+{
+ return !new;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/sched.h b/include/linux/sched.h
index fc09d21..4b3effc 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1398,6 +1398,9 @@ struct task_struct {
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
unsigned int memcg_kmem_skip_account;
+ struct memcg_oom_info {
+ unsigned int may_oom:1;
+ } memcg_oom;
#endif
#ifdef CONFIG_UPROBES
struct uprobe_task *utask;
diff --git a/mm/filemap.c b/mm/filemap.c
index a6981fe..2932810 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1617,6 +1617,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
struct file_ra_state *ra = &file->f_ra;
struct inode *inode = mapping->host;
pgoff_t offset = vmf->pgoff;
+ unsigned int may_oom;
struct page *page;
pgoff_t size;
int ret = 0;
@@ -1626,7 +1627,11 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return VM_FAULT_SIGBUS;

/*
- * Do we have something in the page cache already?
+ * Do we have something in the page cache already? Either
+ * way, try readahead, but disable the memcg OOM killer for it
+ * as readahead is optional and no errors are propagated up
+ * the fault stack. The OOM killer is enabled while trying to
+ * instantiate the faulting page individually below.
*/
page = find_get_page(mapping, offset);
if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
@@ -1634,10 +1639,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
* We found the page, so try async readahead before
* waiting for the lock.
*/
+ may_oom = mem_cgroup_xchg_may_oom(current, 0);
do_async_mmap_readahead(vma, ra, file, page, offset);
+ mem_cgroup_xchg_may_oom(current, may_oom);
} else if (!page) {
/* No page in the page cache at all */
+ may_oom = mem_cgroup_xchg_may_oom(current, 0);
do_sync_mmap_readahead(vma, ra, file, offset);
+ mem_cgroup_xchg_may_oom(current, may_oom);
count_vm_event(PGMAJFAULT);
mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 00a7a66..30ae46a 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2614,7 +2614,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
return CHARGE_RETRY;

/* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
+ if (!oom_check || !current->memcg_oom.may_oom)
return CHARGE_NOMEM;
/* check OOM */
if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
diff --git a/mm/memory.c b/mm/memory.c
index f2ab2a8..5ea7b47 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3752,22 +3752,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);

@@ -3851,6 +3843,34 @@ retry:
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ /*
+ * Enable the memcg OOM handling for faults triggered in user
+ * space. Kernel faults are handled more gracefully.
+ */
+ if (flags & FAULT_FLAG_USER)
+ WARN_ON(mem_cgroup_xchg_may_oom(current, true) == true);
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (flags & FAULT_FLAG_USER)
+ WARN_ON(mem_cgroup_xchg_may_oom(current, false) == false);
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
--
1.8.3.2

2013-07-25 22:27:28

by Johannes Weiner

[permalink] [raw]
Subject: [patch 1/6] arch: mm: remove obsolete init OOM protection

Back before smart OOM killing, when faulting tasks where killed
directly on allocation failures, the arch-specific fault handlers
needed special protection for the init process.

Now that all fault handlers call into the generic OOM killer (609838c
"mm: invoke oom-killer from remaining unconverted page fault
handlers"), which already provides init protection, the arch-specific
leftovers can be removed.

Signed-off-by: Johannes Weiner <[email protected]>
---
arch/arc/mm/fault.c | 5 -----
arch/score/mm/fault.c | 6 ------
arch/tile/mm/fault.c | 6 ------
3 files changed, 17 deletions(-)

diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
index 0fd1f0d..6b0bb41 100644
--- a/arch/arc/mm/fault.c
+++ b/arch/arc/mm/fault.c
@@ -122,7 +122,6 @@ good_area:
goto bad_area;
}

-survive:
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -201,10 +200,6 @@ no_context:
die("Oops", regs, address);

out_of_memory:
- if (is_global_init(tsk)) {
- yield();
- goto survive;
- }
up_read(&mm->mmap_sem);

if (user_mode(regs)) {
diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
index 6b18fb0..4b71a62 100644
--- a/arch/score/mm/fault.c
+++ b/arch/score/mm/fault.c
@@ -100,7 +100,6 @@ good_area:
goto bad_area;
}

-survive:
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -167,11 +166,6 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(tsk)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
if (!user_mode(regs))
goto no_context;
pagefault_out_of_memory();
diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
index f7f99f9..ac553ee 100644
--- a/arch/tile/mm/fault.c
+++ b/arch/tile/mm/fault.c
@@ -430,7 +430,6 @@ good_area:
goto bad_area;
}

- survive:
/*
* If for any reason at all we couldn't handle the fault,
* make sure we exit gracefully rather than endlessly redo
@@ -568,11 +567,6 @@ no_context:
*/
out_of_memory:
up_read(&mm->mmap_sem);
- if (is_global_init(tsk)) {
- yield();
- down_read(&mm->mmap_sem);
- goto survive;
- }
if (is_kernel_mode)
goto no_context;
pagefault_out_of_memory();
--
1.8.3.2

2013-07-25 22:31:35

by Johannes Weiner

[permalink] [raw]
Subject: [patch 3.2] memcg OOM robustness (x86 only)

azurIt, this is the combined backport for 3.2, x86 + generic bits +
debugging. It would be fantastic if you could give this another shot
once you get back from vacation. Thanks!

Johannes

---

diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index 5db0490..314fe53 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -842,30 +842,22 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
force_sig_info_fault(SIGBUS, code, address, tsk, fault);
}

-static noinline int
+static noinline void
mm_fault_error(struct pt_regs *regs, unsigned long error_code,
unsigned long address, unsigned int fault)
{
- /*
- * Pagefault was interrupted by SIGKILL. We have no reason to
- * continue pagefault.
- */
- if (fatal_signal_pending(current)) {
- if (!(fault & VM_FAULT_RETRY))
- up_read(&current->mm->mmap_sem);
- if (!(error_code & PF_USER))
- no_context(regs, error_code, address);
- return 1;
+ if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
+ up_read(&current->mm->mmap_sem);
+ no_context(regs, error_code, address);
+ return;
}
- if (!(fault & VM_FAULT_ERROR))
- return 0;

if (fault & VM_FAULT_OOM) {
/* Kernel mode? Handle exceptions or die: */
if (!(error_code & PF_USER)) {
up_read(&current->mm->mmap_sem);
no_context(regs, error_code, address);
- return 1;
+ return;
}

out_of_memory(regs, error_code, address);
@@ -876,7 +868,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
else
BUG();
}
- return 1;
}

static int spurious_fault_check(unsigned long error_code, pte_t *pte)
@@ -1070,6 +1061,7 @@ do_page_fault(struct pt_regs *regs, unsigned long error_code)
if (user_mode_vm(regs)) {
local_irq_enable();
error_code |= PF_USER;
+ flags |= FAULT_FLAG_USER;
} else {
if (regs->flags & X86_EFLAGS_IF)
local_irq_enable();
@@ -1167,9 +1159,17 @@ good_area:
*/
fault = handle_mm_fault(mm, vma, address, flags);

- if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
- if (mm_fault_error(regs, error_code, address, fault))
- return;
+ /*
+ * If we need to retry but a fatal signal is pending, handle the
+ * signal first. We do not need to release the mmap_sem because it
+ * would already be released in __lock_page_or_retry in mm/filemap.c.
+ */
+ if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)))
+ return;
+
+ if (unlikely(fault & VM_FAULT_ERROR)) {
+ mm_fault_error(regs, error_code, address, fault);
+ return;
}

/*
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index b87068a..315b822 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -120,6 +120,31 @@ mem_cgroup_get_reclaim_stat_from_page(struct page *page);
extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
struct task_struct *p);

+/**
+ * mem_cgroup_xchg_may_oom - toggle the memcg OOM killer for a task
+ * @p: task
+ * @new: true to enable, false to disable
+ *
+ * Toggle whether a failed memcg charge should invoke the OOM killer
+ * or just return -ENOMEM. Returns the previous toggle state.
+ */
+static inline bool mem_cgroup_xchg_may_oom(struct task_struct *p, bool new)
+{
+ bool old;
+
+ old = p->memcg_oom.may_oom;
+ p->memcg_oom.may_oom = new;
+
+ return old;
+}
+
+static inline bool task_in_mem_cgroup_oom(struct task_struct *p)
+{
+ return p->memcg_oom.in_memcg_oom;
+}
+
+bool mem_cgroup_oom_synchronize(void);
+
#ifdef CONFIG_CGROUP_MEM_RES_CTLR_SWAP
extern int do_swap_account;
#endif
@@ -333,6 +358,21 @@ mem_cgroup_print_oom_info(struct mem_cgroup *memcg, struct task_struct *p)
{
}

+static inline bool mem_cgroup_xchg_may_oom(struct task_struct *p, bool new)
+{
+ return !new;
+}
+
+static inline bool task_in_mem_cgroup_oom(struct task_struct *p)
+{
+ return false;
+}
+
+static inline bool mem_cgroup_oom_synchronize(void)
+{
+ return false;
+}
+
static inline void mem_cgroup_inc_page_stat(struct page *page,
enum mem_cgroup_page_stat_item idx)
{
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4baadd1..846b82b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -156,6 +156,7 @@ extern pgprot_t protection_map[16];
#define FAULT_FLAG_ALLOW_RETRY 0x08 /* Retry fault if blocking */
#define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
#define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
+#define FAULT_FLAG_USER 0x40 /* The fault originated in userspace */

/*
* This interface is used by x86 PAT code to identify a pfn mapping that is
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 1c4f3e9..a77d198 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -91,6 +91,7 @@ struct sched_param {
#include <linux/latencytop.h>
#include <linux/cred.h>
#include <linux/llist.h>
+#include <linux/stacktrace.h>

#include <asm/processor.h>

@@ -1568,6 +1569,14 @@ struct task_struct {
unsigned long nr_pages; /* uncharged usage */
unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
} memcg_batch;
+ struct memcg_oom_info {
+ unsigned int may_oom:1;
+ unsigned int in_memcg_oom:1;
+ struct stack_trace trace;
+ unsigned long trace_entries[16];
+ int wakeups;
+ struct mem_cgroup *wait_on_memcg;
+ } memcg_oom;
#endif
#ifdef CONFIG_HAVE_HW_BREAKPOINT
atomic_t ptrace_bp_refcnt;
diff --git a/mm/filemap.c b/mm/filemap.c
index 5f0a3c9..3256d06 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1660,6 +1660,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
struct file_ra_state *ra = &file->f_ra;
struct inode *inode = mapping->host;
pgoff_t offset = vmf->pgoff;
+ unsigned int may_oom;
struct page *page;
pgoff_t size;
int ret = 0;
@@ -1669,7 +1670,11 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
return VM_FAULT_SIGBUS;

/*
- * Do we have something in the page cache already?
+ * Do we have something in the page cache already? Either
+ * way, try readahead, but disable the memcg OOM killer for it
+ * as readahead is optional and no errors are propagated up
+ * the fault stack. The OOM killer is enabled while trying to
+ * instantiate the faulting page individually below.
*/
page = find_get_page(mapping, offset);
if (likely(page)) {
@@ -1677,10 +1682,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
* We found the page, so try async readahead before
* waiting for the lock.
*/
+ may_oom = mem_cgroup_xchg_may_oom(current, 0);
do_async_mmap_readahead(vma, ra, file, page, offset);
+ mem_cgroup_xchg_may_oom(current, may_oom);
} else {
/* No page in the page cache at all */
+ may_oom = mem_cgroup_xchg_may_oom(current, 0);
do_sync_mmap_readahead(vma, ra, file, offset);
+ mem_cgroup_xchg_may_oom(current, may_oom);
count_vm_event(PGMAJFAULT);
mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
ret = VM_FAULT_MAJOR;
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index b63f5f7..0624d39 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -49,6 +49,7 @@
#include <linux/page_cgroup.h>
#include <linux/cpu.h>
#include <linux/oom.h>
+#include <linux/stacktrace.h>
#include "internal.h"

#include <asm/uaccess.h>
@@ -249,6 +250,7 @@ struct mem_cgroup {

bool oom_lock;
atomic_t under_oom;
+ atomic_t oom_wakeups;

atomic_t refcnt;

@@ -1846,6 +1848,7 @@ static int memcg_oom_wake_function(wait_queue_t *wait,

static void memcg_wakeup_oom(struct mem_cgroup *memcg)
{
+ atomic_inc(&memcg->oom_wakeups);
/* for filtering, pass "memcg" as argument. */
__wake_up(&memcg_oom_waitq, TASK_NORMAL, 0, memcg);
}
@@ -1857,30 +1860,26 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
}

/*
- * try to call OOM killer. returns false if we should exit memory-reclaim loop.
+ * try to call OOM killer
*/
-bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
+static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask)
{
- struct oom_wait_info owait;
- bool locked, need_to_kill;
+ bool locked, need_to_kill = true;

- owait.mem = memcg;
- owait.wait.flags = 0;
- owait.wait.func = memcg_oom_wake_function;
- owait.wait.private = current;
- INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);
+ if (!current->memcg_oom.may_oom)
+ return;
+
+ current->memcg_oom.in_memcg_oom = 1;
+
+ current->memcg_oom.trace.nr_entries = 0;
+ current->memcg_oom.trace.max_entries = 16;
+ current->memcg_oom.trace.entries = current->memcg_oom.trace_entries;
+ current->memcg_oom.trace.skip = 1;
+ save_stack_trace(&current->memcg_oom.trace);

/* At first, try to OOM lock hierarchy under memcg.*/
spin_lock(&memcg_oom_lock);
locked = mem_cgroup_oom_lock(memcg);
- /*
- * Even if signal_pending(), we can't quit charge() loop without
- * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
- * under OOM is always welcomed, use TASK_KILLABLE here.
- */
- prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
if (!locked || memcg->oom_kill_disable)
need_to_kill = false;
if (locked)
@@ -1888,24 +1887,100 @@ bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask)
spin_unlock(&memcg_oom_lock);

if (need_to_kill) {
- finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, mask);
} else {
- schedule();
- finish_wait(&memcg_oom_waitq, &owait.wait);
+ /*
+ * A system call can just return -ENOMEM, but if this
+ * is a page fault and somebody else is handling the
+ * OOM already, we need to sleep on the OOM waitqueue
+ * for this memcg until the situation is resolved.
+ * Which can take some time because it might be
+ * handled by a userspace task.
+ *
+ * However, this is the charge context, which means
+ * that we may sit on a large call stack and hold
+ * various filesystem locks, the mmap_sem etc. and we
+ * don't want the OOM handler to deadlock on them
+ * while we sit here and wait. Store the current OOM
+ * context in the task_struct, then return -ENOMEM.
+ * At the end of the page fault handler, with the
+ * stack unwound, pagefault_out_of_memory() will check
+ * back with us by calling
+ * mem_cgroup_oom_synchronize(), possibly putting the
+ * task to sleep.
+ */
+ mem_cgroup_mark_under_oom(memcg);
+ current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups);
+ css_get(&memcg->css);
+ current->memcg_oom.wait_on_memcg = memcg;
}
- spin_lock(&memcg_oom_lock);
- if (locked)
+
+ if (locked) {
+ spin_lock(&memcg_oom_lock);
mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);
+ /*
+ * Sleeping tasks might have been killed, make sure
+ * they get scheduled so they can exit.
+ */
+ if (need_to_kill)
+ memcg_oom_recover(memcg);
+ spin_unlock(&memcg_oom_lock);
+ }
+}

- mem_cgroup_unmark_under_oom(memcg);
+/**
+ * mem_cgroup_oom_synchronize - complete memcg OOM handling
+ *
+ * This has to be called at the end of a page fault if the the memcg
+ * OOM handler was enabled and the fault is returning %VM_FAULT_OOM.
+ *
+ * Memcg supports userspace OOM handling, so failed allocations must
+ * sleep on a waitqueue until the userspace task resolves the
+ * situation. Sleeping directly in the charge context with all kinds
+ * of locks held is not a good idea, instead we remember an OOM state
+ * in the task and mem_cgroup_oom_synchronize() has to be called at
+ * the end of the page fault to put the task to sleep and clean up the
+ * OOM state.
+ */
+bool mem_cgroup_oom_synchronize(void)
+{
+ struct oom_wait_info owait;
+ struct mem_cgroup *memcg;

- if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ /* OOM is global, do not handle */
+ if (!current->memcg_oom.in_memcg_oom)
return false;
- /* Give chance to dying process */
- schedule_timeout_uninterruptible(1);
+
+ /*
+ * We invoked the OOM killer but there is a chance that a kill
+ * did not free up any charges. Everybody else might already
+ * be sleeping, so restart the fault and keep the rampage
+ * going until some charges are released.
+ */
+ memcg = current->memcg_oom.wait_on_memcg;
+ if (!memcg)
+ goto out;
+
+ if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
+ goto out_memcg;
+
+ owait.mem = memcg;
+ owait.wait.flags = 0;
+ owait.wait.func = memcg_oom_wake_function;
+ owait.wait.private = current;
+ INIT_LIST_HEAD(&owait.wait.task_list);
+
+ prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
+ /* Only sleep if we didn't miss any wakeups since OOM */
+ if (atomic_read(&memcg->oom_wakeups) == current->memcg_oom.wakeups)
+ schedule();
+ finish_wait(&memcg_oom_waitq, &owait.wait);
+out_memcg:
+ mem_cgroup_unmark_under_oom(memcg);
+ css_put(&memcg->css);
+ current->memcg_oom.wait_on_memcg = NULL;
+out:
+ current->memcg_oom.in_memcg_oom = 0;
return true;
}

@@ -2195,11 +2270,10 @@ enum {
CHARGE_RETRY, /* need to retry but retry is not bad */
CHARGE_NOMEM, /* we can't do more. return -ENOMEM */
CHARGE_WOULDBLOCK, /* GFP_WAIT wasn't set and no enough res. */
- CHARGE_OOM_DIE, /* the current is killed because of OOM */
};

static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
- unsigned int nr_pages, bool oom_check)
+ unsigned int nr_pages, bool invoke_oom)
{
unsigned long csize = nr_pages * PAGE_SIZE;
struct mem_cgroup *mem_over_limit;
@@ -2257,14 +2331,10 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
if (mem_cgroup_wait_acct_move(mem_over_limit))
return CHARGE_RETRY;

- /* If we don't need to call oom-killer at el, return immediately */
- if (!oom_check)
- return CHARGE_NOMEM;
- /* check OOM */
- if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask))
- return CHARGE_OOM_DIE;
+ if (invoke_oom)
+ mem_cgroup_oom(mem_over_limit, gfp_mask);

- return CHARGE_RETRY;
+ return CHARGE_NOMEM;
}

/*
@@ -2349,7 +2419,7 @@ again:
}

do {
- bool oom_check;
+ bool invoke_oom = oom && !nr_oom_retries;

/* If killed, bypass charge */
if (fatal_signal_pending(current)) {
@@ -2357,13 +2427,7 @@ again:
goto bypass;
}

- oom_check = false;
- if (oom && !nr_oom_retries) {
- oom_check = true;
- nr_oom_retries = MEM_CGROUP_RECLAIM_RETRIES;
- }
-
- ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, oom_check);
+ ret = mem_cgroup_do_charge(memcg, gfp_mask, batch, invoke_oom);
switch (ret) {
case CHARGE_OK:
break;
@@ -2376,16 +2440,12 @@ again:
css_put(&memcg->css);
goto nomem;
case CHARGE_NOMEM: /* OOM routine works */
- if (!oom) {
+ if (!oom || invoke_oom) {
css_put(&memcg->css);
goto nomem;
}
- /* If oom, we never return -ENOMEM */
nr_oom_retries--;
break;
- case CHARGE_OOM_DIE: /* Killed by OOM Killer */
- css_put(&memcg->css);
- goto bypass;
}
} while (ret != CHARGE_OK);

diff --git a/mm/memory.c b/mm/memory.c
index 829d437..99cfde3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -57,6 +57,7 @@
#include <linux/swapops.h>
#include <linux/elf.h>
#include <linux/gfp.h>
+#include <linux/stacktrace.h>

#include <asm/io.h>
#include <asm/pgalloc.h>
@@ -3439,22 +3440,14 @@ unlock:
/*
* By the time we get here, we already hold the mm semaphore
*/
-int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long address, unsigned int flags)
+static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
{
pgd_t *pgd;
pud_t *pud;
pmd_t *pmd;
pte_t *pte;

- __set_current_state(TASK_RUNNING);
-
- count_vm_event(PGFAULT);
- mem_cgroup_count_vm_event(mm, PGFAULT);
-
- /* do counter updates before entering really critical section. */
- check_sync_rss_stat(current);
-
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);

@@ -3503,6 +3496,40 @@ int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
return handle_pte_fault(mm, vma, address, pte, pmd, flags);
}

+int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
+ unsigned long address, unsigned int flags)
+{
+ int ret;
+
+ __set_current_state(TASK_RUNNING);
+
+ count_vm_event(PGFAULT);
+ mem_cgroup_count_vm_event(mm, PGFAULT);
+
+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
+ /*
+ * Enable the memcg OOM handling for faults triggered in user
+ * space. Kernel faults are handled more gracefully.
+ */
+ if (flags & FAULT_FLAG_USER)
+ WARN_ON(mem_cgroup_xchg_may_oom(current, true) == true);
+
+ ret = __handle_mm_fault(mm, vma, address, flags);
+
+ if (flags & FAULT_FLAG_USER)
+ WARN_ON(mem_cgroup_xchg_may_oom(current, false) == false);
+
+ if (WARN_ON(task_in_mem_cgroup_oom(current) && !(ret & VM_FAULT_OOM))) {
+ printk("Fixing unhandled memcg OOM context set up from:\n");
+ print_stack_trace(&current->memcg_oom.trace, 0);
+ mem_cgroup_oom_synchronize();
+ }
+
+ return ret;
+}
+
#ifndef __PAGETABLE_PUD_FOLDED
/*
* Allocate page upper directory.
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 069b64e..aa60863 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -785,6 +785,8 @@ out:
*/
void pagefault_out_of_memory(void)
{
+ if (mem_cgroup_oom_synchronize())
+ return;
if (try_set_system_oom()) {
out_of_memory(NULL, 0, 0, NULL);
clear_system_oom();

2013-07-26 13:00:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 1/6] arch: mm: remove obsolete init OOM protection

On Thu 25-07-13 18:25:33, Johannes Weiner wrote:
> Back before smart OOM killing, when faulting tasks where killed
> directly on allocation failures, the arch-specific fault handlers
> needed special protection for the init process.
>
> Now that all fault handlers call into the generic OOM killer (609838c
> "mm: invoke oom-killer from remaining unconverted page fault
> handlers"), which already provides init protection, the arch-specific
> leftovers can be removed.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Looks good to me
Reviewed-by: Michal Hocko <[email protected]>

> ---
> arch/arc/mm/fault.c | 5 -----
> arch/score/mm/fault.c | 6 ------
> arch/tile/mm/fault.c | 6 ------
> 3 files changed, 17 deletions(-)
>
> diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
> index 0fd1f0d..6b0bb41 100644
> --- a/arch/arc/mm/fault.c
> +++ b/arch/arc/mm/fault.c
> @@ -122,7 +122,6 @@ good_area:
> goto bad_area;
> }
>
> -survive:
> /*
> * If for any reason at all we couldn't handle the fault,
> * make sure we exit gracefully rather than endlessly redo
> @@ -201,10 +200,6 @@ no_context:
> die("Oops", regs, address);
>
> out_of_memory:
> - if (is_global_init(tsk)) {
> - yield();
> - goto survive;
> - }
> up_read(&mm->mmap_sem);
>
> if (user_mode(regs)) {
> diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
> index 6b18fb0..4b71a62 100644
> --- a/arch/score/mm/fault.c
> +++ b/arch/score/mm/fault.c
> @@ -100,7 +100,6 @@ good_area:
> goto bad_area;
> }
>
> -survive:
> /*
> * If for any reason at all we couldn't handle the fault,
> * make sure we exit gracefully rather than endlessly redo
> @@ -167,11 +166,6 @@ no_context:
> */
> out_of_memory:
> up_read(&mm->mmap_sem);
> - if (is_global_init(tsk)) {
> - yield();
> - down_read(&mm->mmap_sem);
> - goto survive;
> - }
> if (!user_mode(regs))
> goto no_context;
> pagefault_out_of_memory();
> diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
> index f7f99f9..ac553ee 100644
> --- a/arch/tile/mm/fault.c
> +++ b/arch/tile/mm/fault.c
> @@ -430,7 +430,6 @@ good_area:
> goto bad_area;
> }
>
> - survive:
> /*
> * If for any reason at all we couldn't handle the fault,
> * make sure we exit gracefully rather than endlessly redo
> @@ -568,11 +567,6 @@ no_context:
> */
> out_of_memory:
> up_read(&mm->mmap_sem);
> - if (is_global_init(tsk)) {
> - yield();
> - down_read(&mm->mmap_sem);
> - goto survive;
> - }
> if (is_kernel_mode)
> goto no_context;
> pagefault_out_of_memory();
> --
> 1.8.3.2
>

--
Michal Hocko
SUSE Labs

2013-07-26 13:07:37

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 2/6] arch: mm: do not invoke OOM killer on kernel fault OOM

On Thu 25-07-13 18:25:34, Johannes Weiner wrote:
> Kernel faults are expected to handle OOM conditions gracefully (gup,
> uaccess etc.), so they should never invoke the OOM killer. Reserve
> this for faults triggered in user context when it is the only option.
>
> Most architectures already do this, fix up the remaining few.
>
> Signed-off-by: Johannes Weiner <[email protected]>

I didn't go over all architectures to check whether something slipped
through but the converted ones look OK to me.

Reviewed-by: Michal Hocko <[email protected]>

> ---
> arch/arm/mm/fault.c | 14 +++++++-------
> arch/arm64/mm/fault.c | 14 +++++++-------
> arch/avr32/mm/fault.c | 2 +-
> arch/mips/mm/fault.c | 2 ++
> arch/um/kernel/trap.c | 2 ++
> arch/unicore32/mm/fault.c | 14 +++++++-------
> 6 files changed, 26 insertions(+), 22 deletions(-)
>
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index c97f794..217bcbf 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -349,6 +349,13 @@ retry:
> if (likely(!(fault & (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
> return 0;
>
> + /*
> + * If we are in kernel mode at this point, we
> + * have no context to handle this fault with.
> + */
> + if (!user_mode(regs))
> + goto no_context;
> +
> if (fault & VM_FAULT_OOM) {
> /*
> * We ran out of memory, call the OOM killer, and return to
> @@ -359,13 +366,6 @@ retry:
> return 0;
> }
>
> - /*
> - * If we are in kernel mode at this point, we
> - * have no context to handle this fault with.
> - */
> - if (!user_mode(regs))
> - goto no_context;
> -
> if (fault & VM_FAULT_SIGBUS) {
> /*
> * We had some memory, but were unable to
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index 0ecac89..dab1cfd 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -294,6 +294,13 @@ retry:
> VM_FAULT_BADACCESS))))
> return 0;
>
> + /*
> + * If we are in kernel mode at this point, we have no context to
> + * handle this fault with.
> + */
> + if (!user_mode(regs))
> + goto no_context;
> +
> if (fault & VM_FAULT_OOM) {
> /*
> * We ran out of memory, call the OOM killer, and return to
> @@ -304,13 +311,6 @@ retry:
> return 0;
> }
>
> - /*
> - * If we are in kernel mode at this point, we have no context to
> - * handle this fault with.
> - */
> - if (!user_mode(regs))
> - goto no_context;
> -
> if (fault & VM_FAULT_SIGBUS) {
> /*
> * We had some memory, but were unable to successfully fix up
> diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
> index b2f2d2d..2ca27b0 100644
> --- a/arch/avr32/mm/fault.c
> +++ b/arch/avr32/mm/fault.c
> @@ -228,9 +228,9 @@ no_context:
> */
> out_of_memory:
> up_read(&mm->mmap_sem);
> - pagefault_out_of_memory();
> if (!user_mode(regs))
> goto no_context;
> + pagefault_out_of_memory();
> return;
>
> do_sigbus:
> diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
> index 85df1cd..94d3a31 100644
> --- a/arch/mips/mm/fault.c
> +++ b/arch/mips/mm/fault.c
> @@ -241,6 +241,8 @@ out_of_memory:
> * (which will retry the fault, or kill us if we got oom-killed).
> */
> up_read(&mm->mmap_sem);
> + if (!user_mode(regs))
> + goto no_context;
> pagefault_out_of_memory();
> return;
>
> diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
> index 089f398..b2f5adf 100644
> --- a/arch/um/kernel/trap.c
> +++ b/arch/um/kernel/trap.c
> @@ -124,6 +124,8 @@ out_of_memory:
> * (which will retry the fault, or kill us if we got oom-killed).
> */
> up_read(&mm->mmap_sem);
> + if (!is_user)
> + goto out_nosemaphore;
> pagefault_out_of_memory();
> return 0;
> }
> diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
> index f9b5c10..8ed3c45 100644
> --- a/arch/unicore32/mm/fault.c
> +++ b/arch/unicore32/mm/fault.c
> @@ -278,6 +278,13 @@ retry:
> (VM_FAULT_ERROR | VM_FAULT_BADMAP | VM_FAULT_BADACCESS))))
> return 0;
>
> + /*
> + * If we are in kernel mode at this point, we
> + * have no context to handle this fault with.
> + */
> + if (!user_mode(regs))
> + goto no_context;
> +
> if (fault & VM_FAULT_OOM) {
> /*
> * We ran out of memory, call the OOM killer, and return to
> @@ -288,13 +295,6 @@ retry:
> return 0;
> }
>
> - /*
> - * If we are in kernel mode at this point, we
> - * have no context to handle this fault with.
> - */
> - if (!user_mode(regs))
> - goto no_context;
> -
> if (fault & VM_FAULT_SIGBUS) {
> /*
> * We had some memory, but were unable to
> --
> 1.8.3.2
>

--
Michal Hocko
SUSE Labs

2013-07-26 13:19:57

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 3/6] arch: mm: pass userspace fault flag to generic fault handler

On Thu 25-07-13 18:25:35, Johannes Weiner wrote:
> Unlike global OOM handling, memory cgroup code will invoke the OOM
> killer in any OOM situation because it has no way of telling faults
> occuring in kernel context - which could be handled more gracefully -
> from user-triggered faults.
>
> Pass a flag that identifies faults originating in user space from the
> architecture-specific fault handlers to generic code so that memcg OOM
> handling can be improved.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Looks good to me but I guess maintainers of the affected archs should be
CCed
Reviewed-by: Michal Hocko <[email protected]>

> ---
> arch/alpha/mm/fault.c | 7 ++++---
> arch/arc/mm/fault.c | 6 ++++--
> arch/arm/mm/fault.c | 9 ++++++---
> arch/arm64/mm/fault.c | 9 ++++++---
> arch/avr32/mm/fault.c | 2 ++
> arch/cris/mm/fault.c | 6 ++++--
> arch/frv/mm/fault.c | 10 ++++++----
> arch/hexagon/mm/vm_fault.c | 6 ++++--
> arch/ia64/mm/fault.c | 6 ++++--
> arch/m32r/mm/fault.c | 10 ++++++----
> arch/m68k/mm/fault.c | 2 ++
> arch/metag/mm/fault.c | 6 ++++--
> arch/microblaze/mm/fault.c | 7 +++++--
> arch/mips/mm/fault.c | 6 ++++--
> arch/mn10300/mm/fault.c | 2 ++
> arch/openrisc/mm/fault.c | 1 +
> arch/parisc/mm/fault.c | 7 +++++--
> arch/powerpc/mm/fault.c | 7 ++++---
> arch/s390/mm/fault.c | 2 ++
> arch/score/mm/fault.c | 7 ++++++-
> arch/sh/mm/fault.c | 9 ++++++---
> arch/sparc/mm/fault_32.c | 12 +++++++++---
> arch/sparc/mm/fault_64.c | 8 +++++---
> arch/tile/mm/fault.c | 7 +++++--
> arch/um/kernel/trap.c | 20 ++++++++++++--------
> arch/unicore32/mm/fault.c | 8 ++++++--
> arch/x86/mm/fault.c | 8 +++++---
> arch/xtensa/mm/fault.c | 2 ++
> include/linux/mm.h | 1 +
> 29 files changed, 132 insertions(+), 61 deletions(-)
>
> diff --git a/arch/alpha/mm/fault.c b/arch/alpha/mm/fault.c
> index 0c4132d..98838a0 100644
> --- a/arch/alpha/mm/fault.c
> +++ b/arch/alpha/mm/fault.c
> @@ -89,8 +89,7 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
> const struct exception_table_entry *fixup;
> int fault, si_code = SEGV_MAPERR;
> siginfo_t info;
> - unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (cause > 0 ? FAULT_FLAG_WRITE : 0));
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> /* As of EV6, a load into $31/$f31 is a prefetch, and never faults
> (or is suppressed by the PALcode). Support that for older CPUs
> @@ -115,7 +114,8 @@ do_page_fault(unsigned long address, unsigned long mmcsr,
> if (address >= TASK_SIZE)
> goto vmalloc_fault;
> #endif
> -
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> retry:
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, address);
> @@ -142,6 +142,7 @@ retry:
> } else {
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> + flags |= FAULT_FLAG_WRITE;
> }
>
> /* If for any reason at all we couldn't handle the fault,
> diff --git a/arch/arc/mm/fault.c b/arch/arc/mm/fault.c
> index 6b0bb41..d63f3de 100644
> --- a/arch/arc/mm/fault.c
> +++ b/arch/arc/mm/fault.c
> @@ -60,8 +60,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address)
> siginfo_t info;
> int fault, ret;
> int write = regs->ecr_cause & ECR_C_PROTV_STORE; /* ST/EX */
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (write ? FAULT_FLAG_WRITE : 0);
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> /*
> * We fault-in kernel-space virtual memory on-demand. The
> @@ -89,6 +88,8 @@ void do_page_fault(struct pt_regs *regs, unsigned long address)
> if (in_atomic() || !mm)
> goto no_context;
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> retry:
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, address);
> @@ -117,6 +118,7 @@ good_area:
> if (write) {
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> + flags |= FAULT_FLAG_WRITE;
> } else {
> if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
> goto bad_area;
> diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c
> index 217bcbf..eb8830a 100644
> --- a/arch/arm/mm/fault.c
> +++ b/arch/arm/mm/fault.c
> @@ -261,9 +261,7 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
> struct task_struct *tsk;
> struct mm_struct *mm;
> int fault, sig, code;
> - int write = fsr & FSR_WRITE;
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (write ? FAULT_FLAG_WRITE : 0);
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> if (notify_page_fault(regs, fsr))
> return 0;
> @@ -282,6 +280,11 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
> if (in_atomic() || !mm)
> goto no_context;
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> + if (fsr & FSR_WRITE)
> + flags |= FAULT_FLAG_WRITE;
> +
> /*
> * As per x86, we may deadlock here. However, since the kernel only
> * validly references user space from well defined areas of the code,
> diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c
> index dab1cfd..12205b4 100644
> --- a/arch/arm64/mm/fault.c
> +++ b/arch/arm64/mm/fault.c
> @@ -208,9 +208,7 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
> struct task_struct *tsk;
> struct mm_struct *mm;
> int fault, sig, code;
> - bool write = (esr & ESR_WRITE) && !(esr & ESR_CM);
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (write ? FAULT_FLAG_WRITE : 0);
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> tsk = current;
> mm = tsk->mm;
> @@ -226,6 +224,11 @@ static int __kprobes do_page_fault(unsigned long addr, unsigned int esr,
> if (in_atomic() || !mm)
> goto no_context;
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> + if ((esr & ESR_WRITE) && !(esr & ESR_CM))
> + flags |= FAULT_FLAG_WRITE;
> +
> /*
> * As per x86, we may deadlock here. However, since the kernel only
> * validly references user space from well defined areas of the code,
> diff --git a/arch/avr32/mm/fault.c b/arch/avr32/mm/fault.c
> index 2ca27b0..0eca933 100644
> --- a/arch/avr32/mm/fault.c
> +++ b/arch/avr32/mm/fault.c
> @@ -86,6 +86,8 @@ asmlinkage void do_page_fault(unsigned long ecr, struct pt_regs *regs)
>
> local_irq_enable();
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> retry:
> down_read(&mm->mmap_sem);
>
> diff --git a/arch/cris/mm/fault.c b/arch/cris/mm/fault.c
> index 73312ab..1790f22 100644
> --- a/arch/cris/mm/fault.c
> +++ b/arch/cris/mm/fault.c
> @@ -58,8 +58,7 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
> struct vm_area_struct * vma;
> siginfo_t info;
> int fault;
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - ((writeaccess & 1) ? FAULT_FLAG_WRITE : 0);
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> D(printk(KERN_DEBUG
> "Page fault for %lX on %X at %lX, prot %d write %d\n",
> @@ -117,6 +116,8 @@ do_page_fault(unsigned long address, struct pt_regs *regs,
> if (in_atomic() || !mm)
> goto no_context;
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> retry:
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, address);
> @@ -155,6 +156,7 @@ retry:
> } else if (writeaccess == 1) {
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> + flags |= FAULT_FLAG_WRITE;
> } else {
> if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
> goto bad_area;
> diff --git a/arch/frv/mm/fault.c b/arch/frv/mm/fault.c
> index 331c1e2..9a66372 100644
> --- a/arch/frv/mm/fault.c
> +++ b/arch/frv/mm/fault.c
> @@ -34,11 +34,11 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
> struct vm_area_struct *vma;
> struct mm_struct *mm;
> unsigned long _pme, lrai, lrad, fixup;
> + unsigned long flags = 0;
> siginfo_t info;
> pgd_t *pge;
> pud_t *pue;
> pte_t *pte;
> - int write;
> int fault;
>
> #if 0
> @@ -81,6 +81,9 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
> if (in_atomic() || !mm)
> goto no_context;
>
> + if (user_mode(__frame))
> + flags |= FAULT_FLAG_USER;
> +
> down_read(&mm->mmap_sem);
>
> vma = find_vma(mm, ear0);
> @@ -129,7 +132,6 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
> */
> good_area:
> info.si_code = SEGV_ACCERR;
> - write = 0;
> switch (esr0 & ESR0_ATXC) {
> default:
> /* handle write to write protected page */
> @@ -140,7 +142,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
> #endif
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> - write = 1;
> + flags |= FAULT_FLAG_WRITE;
> break;
>
> /* handle read from protected page */
> @@ -162,7 +164,7 @@ asmlinkage void do_page_fault(int datammu, unsigned long esr0, unsigned long ear
> * make sure we exit gracefully rather than endlessly redo
> * the fault.
> */
> - fault = handle_mm_fault(mm, vma, ear0, write ? FAULT_FLAG_WRITE : 0);
> + fault = handle_mm_fault(mm, vma, ear0, flags);
> if (unlikely(fault & VM_FAULT_ERROR)) {
> if (fault & VM_FAULT_OOM)
> goto out_of_memory;
> diff --git a/arch/hexagon/mm/vm_fault.c b/arch/hexagon/mm/vm_fault.c
> index 1bd276d..8704c93 100644
> --- a/arch/hexagon/mm/vm_fault.c
> +++ b/arch/hexagon/mm/vm_fault.c
> @@ -53,8 +53,7 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
> int si_code = SEGV_MAPERR;
> int fault;
> const struct exception_table_entry *fixup;
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (cause > 0 ? FAULT_FLAG_WRITE : 0);
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> /*
> * If we're in an interrupt or have no user context,
> @@ -65,6 +64,8 @@ void do_page_fault(unsigned long address, long cause, struct pt_regs *regs)
>
> local_irq_enable();
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> retry:
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, address);
> @@ -96,6 +97,7 @@ good_area:
> case FLT_STORE:
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> + flags |= FAULT_FLAG_WRITE;
> break;
> }
>
> diff --git a/arch/ia64/mm/fault.c b/arch/ia64/mm/fault.c
> index 6cf0341..7225dad 100644
> --- a/arch/ia64/mm/fault.c
> +++ b/arch/ia64/mm/fault.c
> @@ -90,8 +90,6 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
> mask = ((((isr >> IA64_ISR_X_BIT) & 1UL) << VM_EXEC_BIT)
> | (((isr >> IA64_ISR_W_BIT) & 1UL) << VM_WRITE_BIT));
>
> - flags |= ((mask & VM_WRITE) ? FAULT_FLAG_WRITE : 0);
> -
> /* mmap_sem is performance critical.... */
> prefetchw(&mm->mmap_sem);
>
> @@ -119,6 +117,10 @@ ia64_do_page_fault (unsigned long address, unsigned long isr, struct pt_regs *re
> if (notify_page_fault(regs, TRAP_BRKPT))
> return;
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> + if (mask & VM_WRITE)
> + flags |= FAULT_FLAG_WRITE;
> retry:
> down_read(&mm->mmap_sem);
>
> diff --git a/arch/m32r/mm/fault.c b/arch/m32r/mm/fault.c
> index 3cdfa9c..e9c6a80 100644
> --- a/arch/m32r/mm/fault.c
> +++ b/arch/m32r/mm/fault.c
> @@ -78,7 +78,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
> struct mm_struct *mm;
> struct vm_area_struct * vma;
> unsigned long page, addr;
> - int write;
> + unsigned long flags = 0;
> int fault;
> siginfo_t info;
>
> @@ -117,6 +117,9 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
> if (in_atomic() || !mm)
> goto bad_area_nosemaphore;
>
> + if (error_code & ACE_USERMODE)
> + flags |= FAULT_FLAG_USER;
> +
> /* When running in the kernel we expect faults to occur only to
> * addresses in user space. All other faults represent errors in the
> * kernel and should generate an OOPS. Unfortunately, in the case of an
> @@ -166,14 +169,13 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long error_code,
> */
> good_area:
> info.si_code = SEGV_ACCERR;
> - write = 0;
> switch (error_code & (ACE_WRITE|ACE_PROTECTION)) {
> default: /* 3: write, present */
> /* fall through */
> case ACE_WRITE: /* write, not present */
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> - write++;
> + flags |= FAULT_FLAG_WRITE;
> break;
> case ACE_PROTECTION: /* read, present */
> case 0: /* read, not present */
> @@ -194,7 +196,7 @@ good_area:
> */
> addr = (address & PAGE_MASK);
> set_thread_fault_code(error_code);
> - fault = handle_mm_fault(mm, vma, addr, write ? FAULT_FLAG_WRITE : 0);
> + fault = handle_mm_fault(mm, vma, addr, flags);
> if (unlikely(fault & VM_FAULT_ERROR)) {
> if (fault & VM_FAULT_OOM)
> goto out_of_memory;
> diff --git a/arch/m68k/mm/fault.c b/arch/m68k/mm/fault.c
> index a563727..eb1d61f 100644
> --- a/arch/m68k/mm/fault.c
> +++ b/arch/m68k/mm/fault.c
> @@ -88,6 +88,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> if (in_atomic() || !mm)
> goto no_context;
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> retry:
> down_read(&mm->mmap_sem);
>
> diff --git a/arch/metag/mm/fault.c b/arch/metag/mm/fault.c
> index 8fddf46..332680e 100644
> --- a/arch/metag/mm/fault.c
> +++ b/arch/metag/mm/fault.c
> @@ -53,8 +53,7 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> struct vm_area_struct *vma, *prev_vma;
> siginfo_t info;
> int fault;
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (write_access ? FAULT_FLAG_WRITE : 0);
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> tsk = current;
>
> @@ -109,6 +108,8 @@ int do_page_fault(struct pt_regs *regs, unsigned long address,
> if (in_atomic() || !mm)
> goto no_context;
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> retry:
> down_read(&mm->mmap_sem);
>
> @@ -121,6 +122,7 @@ good_area:
> if (write_access) {
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> + flags |= FAULT_FLAG_WRITE;
> } else {
> if (!(vma->vm_flags & (VM_READ | VM_EXEC | VM_WRITE)))
> goto bad_area;
> diff --git a/arch/microblaze/mm/fault.c b/arch/microblaze/mm/fault.c
> index 731f739..fa4cf52 100644
> --- a/arch/microblaze/mm/fault.c
> +++ b/arch/microblaze/mm/fault.c
> @@ -92,8 +92,7 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
> int code = SEGV_MAPERR;
> int is_write = error_code & ESR_S;
> int fault;
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (is_write ? FAULT_FLAG_WRITE : 0);
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> regs->ear = address;
> regs->esr = error_code;
> @@ -121,6 +120,9 @@ void do_page_fault(struct pt_regs *regs, unsigned long address,
> die("Weird page fault", regs, SIGSEGV);
> }
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> +
> /* When running in the kernel we expect faults to occur only to
> * addresses in user space. All other faults represent errors in the
> * kernel and should generate an OOPS. Unfortunately, in the case of an
> @@ -199,6 +201,7 @@ good_area:
> if (unlikely(is_write)) {
> if (unlikely(!(vma->vm_flags & VM_WRITE)))
> goto bad_area;
> + flags |= FAULT_FLAG_WRITE;
> /* a read */
> } else {
> /* protection fault */
> diff --git a/arch/mips/mm/fault.c b/arch/mips/mm/fault.c
> index 94d3a31..becc42b 100644
> --- a/arch/mips/mm/fault.c
> +++ b/arch/mips/mm/fault.c
> @@ -42,8 +42,7 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
> const int field = sizeof(unsigned long) * 2;
> siginfo_t info;
> int fault;
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (write ? FAULT_FLAG_WRITE : 0);
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> #if 0
> printk("Cpu%d[%s:%d:%0*lx:%ld:%0*lx]\n", raw_smp_processor_id(),
> @@ -93,6 +92,8 @@ static void __kprobes __do_page_fault(struct pt_regs *regs, unsigned long write,
> if (in_atomic() || !mm)
> goto bad_area_nosemaphore;
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> retry:
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, address);
> @@ -114,6 +115,7 @@ good_area:
> if (write) {
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> + flags |= FAULT_FLAG_WRITE;
> } else {
> if (cpu_has_rixi) {
> if (address == regs->cp0_epc && !(vma->vm_flags & VM_EXEC)) {
> diff --git a/arch/mn10300/mm/fault.c b/arch/mn10300/mm/fault.c
> index 8a2e6de..3516cbd 100644
> --- a/arch/mn10300/mm/fault.c
> +++ b/arch/mn10300/mm/fault.c
> @@ -171,6 +171,8 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long fault_code,
> if (in_atomic() || !mm)
> goto no_context;
>
> + if ((fault_code & MMUFCR_xFC_ACCESS) == MMUFCR_xFC_ACCESS_USR)
> + flags |= FAULT_FLAG_USER;
> retry:
> down_read(&mm->mmap_sem);
>
> diff --git a/arch/openrisc/mm/fault.c b/arch/openrisc/mm/fault.c
> index 4a41f84..0703acf 100644
> --- a/arch/openrisc/mm/fault.c
> +++ b/arch/openrisc/mm/fault.c
> @@ -86,6 +86,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long address,
> if (user_mode(regs)) {
> /* Exception was in userspace: reenable interrupts */
> local_irq_enable();
> + flags |= FAULT_FLAG_USER;
> } else {
> /* If exception was in a syscall, then IRQ's may have
> * been enabled or disabled. If they were enabled,
> diff --git a/arch/parisc/mm/fault.c b/arch/parisc/mm/fault.c
> index f247a34..d10d27a 100644
> --- a/arch/parisc/mm/fault.c
> +++ b/arch/parisc/mm/fault.c
> @@ -180,6 +180,10 @@ void do_page_fault(struct pt_regs *regs, unsigned long code,
> if (in_atomic() || !mm)
> goto no_context;
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> + if (acc_type & VM_WRITE)
> + flags |= FAULT_FLAG_WRITE;
> retry:
> down_read(&mm->mmap_sem);
> vma = find_vma_prev(mm, address, &prev_vma);
> @@ -203,8 +207,7 @@ good_area:
> * fault.
> */
>
> - fault = handle_mm_fault(mm, vma, address,
> - flags | ((acc_type & VM_WRITE) ? FAULT_FLAG_WRITE : 0));
> + fault = handle_mm_fault(mm, vma, address, flags);
>
> if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> return;
> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
> index 8726779..d9196c9 100644
> --- a/arch/powerpc/mm/fault.c
> +++ b/arch/powerpc/mm/fault.c
> @@ -223,9 +223,6 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
> is_write = error_code & ESR_DST;
> #endif /* CONFIG_4xx || CONFIG_BOOKE */
>
> - if (is_write)
> - flags |= FAULT_FLAG_WRITE;
> -
> #ifdef CONFIG_PPC_ICSWX
> /*
> * we need to do this early because this "data storage
> @@ -280,6 +277,9 @@ int __kprobes do_page_fault(struct pt_regs *regs, unsigned long address,
>
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> +
> /* When running in the kernel we expect faults to occur only to
> * addresses in user space. All other faults represent errors in the
> * kernel and should generate an OOPS. Unfortunately, in the case of an
> @@ -408,6 +408,7 @@ good_area:
> } else if (is_write) {
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> + flags |= FAULT_FLAG_WRITE;
> /* a read */
> } else {
> /* protection fault */
> diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c
> index f00aefb..6fa7b05 100644
> --- a/arch/s390/mm/fault.c
> +++ b/arch/s390/mm/fault.c
> @@ -302,6 +302,8 @@ static inline int do_exception(struct pt_regs *regs, int access)
> address = trans_exc_code & __FAIL_ADDR_MASK;
> perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, address);
> flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
> + if (regs->psw.mask & PSW_MASK_PSTATE)
> + flags |= FAULT_FLAG_USER;
> if (access == VM_WRITE || (trans_exc_code & store_indication) == 0x400)
> flags |= FAULT_FLAG_WRITE;
> down_read(&mm->mmap_sem);
> diff --git a/arch/score/mm/fault.c b/arch/score/mm/fault.c
> index 4b71a62..52238983 100644
> --- a/arch/score/mm/fault.c
> +++ b/arch/score/mm/fault.c
> @@ -47,6 +47,7 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
> struct task_struct *tsk = current;
> struct mm_struct *mm = tsk->mm;
> const int field = sizeof(unsigned long) * 2;
> + unsigned long flags = 0;
> siginfo_t info;
> int fault;
>
> @@ -75,6 +76,9 @@ asmlinkage void do_page_fault(struct pt_regs *regs, unsigned long write,
> if (in_atomic() || !mm)
> goto bad_area_nosemaphore;
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> +
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, address);
> if (!vma)
> @@ -95,6 +99,7 @@ good_area:
> if (write) {
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> + flags |= FAULT_FLAG_WRITE;
> } else {
> if (!(vma->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
> goto bad_area;
> @@ -105,7 +110,7 @@ good_area:
> * make sure we exit gracefully rather than endlessly redo
> * the fault.
> */
> - fault = handle_mm_fault(mm, vma, address, write);
> + fault = handle_mm_fault(mm, vma, address, flags);
> if (unlikely(fault & VM_FAULT_ERROR)) {
> if (fault & VM_FAULT_OOM)
> goto out_of_memory;
> diff --git a/arch/sh/mm/fault.c b/arch/sh/mm/fault.c
> index 1f49c28..541dc61 100644
> --- a/arch/sh/mm/fault.c
> +++ b/arch/sh/mm/fault.c
> @@ -400,9 +400,7 @@ asmlinkage void __kprobes do_page_fault(struct pt_regs *regs,
> struct mm_struct *mm;
> struct vm_area_struct * vma;
> int fault;
> - int write = error_code & FAULT_CODE_WRITE;
> - unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (write ? FAULT_FLAG_WRITE : 0));
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> tsk = current;
> mm = tsk->mm;
> @@ -476,6 +474,11 @@ good_area:
>
> set_thread_fault_code(error_code);
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> + if (error_code & FAULT_CODE_WRITE)
> + flags |= FAULT_FLAG_WRITE;
> +
> /*
> * If for any reason at all we couldn't handle the fault,
> * make sure we exit gracefully rather than endlessly redo
> diff --git a/arch/sparc/mm/fault_32.c b/arch/sparc/mm/fault_32.c
> index e98bfda..59dbd46 100644
> --- a/arch/sparc/mm/fault_32.c
> +++ b/arch/sparc/mm/fault_32.c
> @@ -177,8 +177,7 @@ asmlinkage void do_sparc_fault(struct pt_regs *regs, int text_fault, int write,
> unsigned long g2;
> int from_user = !(regs->psr & PSR_PS);
> int fault, code;
> - unsigned int flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (write ? FAULT_FLAG_WRITE : 0));
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> if (text_fault)
> address = regs->pc;
> @@ -235,6 +234,11 @@ good_area:
> goto bad_area;
> }
>
> + if (from_user)
> + flags |= FAULT_FLAG_USER;
> + if (write)
> + flags |= FAULT_FLAG_WRITE;
> +
> /*
> * If for any reason at all we couldn't handle the fault,
> * make sure we exit gracefully rather than endlessly redo
> @@ -383,6 +387,7 @@ static void force_user_fault(unsigned long address, int write)
> struct vm_area_struct *vma;
> struct task_struct *tsk = current;
> struct mm_struct *mm = tsk->mm;
> + unsigned int flags = FAULT_FLAG_USER;
> int code;
>
> code = SEGV_MAPERR;
> @@ -402,11 +407,12 @@ good_area:
> if (write) {
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> + flags |= FAULT_FLAG_WRITE;
> } else {
> if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
> goto bad_area;
> }
> - switch (handle_mm_fault(mm, vma, address, write ? FAULT_FLAG_WRITE : 0)) {
> + switch (handle_mm_fault(mm, vma, address, flags)) {
> case VM_FAULT_SIGBUS:
> case VM_FAULT_OOM:
> goto do_sigbus;
> diff --git a/arch/sparc/mm/fault_64.c b/arch/sparc/mm/fault_64.c
> index 5062ff3..c08b9bb 100644
> --- a/arch/sparc/mm/fault_64.c
> +++ b/arch/sparc/mm/fault_64.c
> @@ -314,8 +314,9 @@ asmlinkage void __kprobes do_sparc64_fault(struct pt_regs *regs)
> } else {
> bad_kernel_pc(regs, address);
> return;
> - }
> - }
> + }
> + } else
> + flags |= FAULT_FLAG_USER;
>
> /*
> * If we're in an interrupt or have no user
> @@ -418,13 +419,14 @@ good_area:
> vma->vm_file != NULL)
> set_thread_fault_code(fault_code |
> FAULT_CODE_BLKCOMMIT);
> +
> + flags |= FAULT_FLAG_WRITE;
> } else {
> /* Allow reads even for write-only mappings */
> if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
> goto bad_area;
> }
>
> - flags |= ((fault_code & FAULT_CODE_WRITE) ? FAULT_FLAG_WRITE : 0);
> fault = handle_mm_fault(mm, vma, address, flags);
>
> if ((fault & VM_FAULT_RETRY) && fatal_signal_pending(current))
> diff --git a/arch/tile/mm/fault.c b/arch/tile/mm/fault.c
> index ac553ee..3ff289f 100644
> --- a/arch/tile/mm/fault.c
> +++ b/arch/tile/mm/fault.c
> @@ -280,8 +280,7 @@ static int handle_page_fault(struct pt_regs *regs,
> if (!is_page_fault)
> write = 1;
>
> - flags = (FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (write ? FAULT_FLAG_WRITE : 0));
> + flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> is_kernel_mode = (EX1_PL(regs->ex1) != USER_PL);
>
> @@ -365,6 +364,9 @@ static int handle_page_fault(struct pt_regs *regs,
> goto bad_area_nosemaphore;
> }
>
> + if (!is_kernel_mode)
> + flags |= FAULT_FLAG_USER;
> +
> /*
> * When running in the kernel we expect faults to occur only to
> * addresses in user space. All other faults represent errors in the
> @@ -425,6 +427,7 @@ good_area:
> #endif
> if (!(vma->vm_flags & VM_WRITE))
> goto bad_area;
> + flags |= FAULT_FLAG_WRITE;
> } else {
> if (!is_page_fault || !(vma->vm_flags & VM_READ))
> goto bad_area;
> diff --git a/arch/um/kernel/trap.c b/arch/um/kernel/trap.c
> index b2f5adf..5c3aef7 100644
> --- a/arch/um/kernel/trap.c
> +++ b/arch/um/kernel/trap.c
> @@ -30,8 +30,7 @@ int handle_page_fault(unsigned long address, unsigned long ip,
> pmd_t *pmd;
> pte_t *pte;
> int err = -EFAULT;
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (is_write ? FAULT_FLAG_WRITE : 0);
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> *code_out = SEGV_MAPERR;
>
> @@ -42,6 +41,8 @@ int handle_page_fault(unsigned long address, unsigned long ip,
> if (in_atomic())
> goto out_nosemaphore;
>
> + if (is_user)
> + flags |= FAULT_FLAG_USER;
> retry:
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, address);
> @@ -58,12 +59,15 @@ retry:
>
> good_area:
> *code_out = SEGV_ACCERR;
> - if (is_write && !(vma->vm_flags & VM_WRITE))
> - goto out;
> -
> - /* Don't require VM_READ|VM_EXEC for write faults! */
> - if (!is_write && !(vma->vm_flags & (VM_READ | VM_EXEC)))
> - goto out;
> + if (is_write) {
> + if (!(vma->vm_flags & VM_WRITE))
> + goto out;
> + flags |= FAULT_FLAG_WRITE;
> + } else {
> + /* Don't require VM_READ|VM_EXEC for write faults! */
> + if (!(vma->vm_flags & (VM_READ | VM_EXEC)))
> + goto out;
> + }
>
> do {
> int fault;
> diff --git a/arch/unicore32/mm/fault.c b/arch/unicore32/mm/fault.c
> index 8ed3c45..0dc922d 100644
> --- a/arch/unicore32/mm/fault.c
> +++ b/arch/unicore32/mm/fault.c
> @@ -209,8 +209,7 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
> struct task_struct *tsk;
> struct mm_struct *mm;
> int fault, sig, code;
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - ((!(fsr ^ 0x12)) ? FAULT_FLAG_WRITE : 0);
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> tsk = current;
> mm = tsk->mm;
> @@ -222,6 +221,11 @@ static int do_pf(unsigned long addr, unsigned int fsr, struct pt_regs *regs)
> if (in_atomic() || !mm)
> goto no_context;
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> + if (!(fsr ^ 0x12))
> + flags |= FAULT_FLAG_WRITE;
> +
> /*
> * As per x86, we may deadlock here. However, since the kernel only
> * validly references user space from well defined areas of the code,
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 654be4a..6d77c38 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1011,9 +1011,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code)
> unsigned long address;
> struct mm_struct *mm;
> int fault;
> - int write = error_code & PF_WRITE;
> - unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE |
> - (write ? FAULT_FLAG_WRITE : 0);
> + unsigned int flags = FAULT_FLAG_ALLOW_RETRY | FAULT_FLAG_KILLABLE;
>
> tsk = current;
> mm = tsk->mm;
> @@ -1083,6 +1081,7 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code)
> if (user_mode_vm(regs)) {
> local_irq_enable();
> error_code |= PF_USER;
> + flags |= FAULT_FLAG_USER;
> } else {
> if (regs->flags & X86_EFLAGS_IF)
> local_irq_enable();
> @@ -1109,6 +1108,9 @@ __do_page_fault(struct pt_regs *regs, unsigned long error_code)
> return;
> }
>
> + if (error_code & PF_WRITE)
> + flags |= FAULT_FLAG_WRITE;
> +
> /*
> * When running in the kernel we expect faults to occur only to
> * addresses in user space. All other faults represent errors in
> diff --git a/arch/xtensa/mm/fault.c b/arch/xtensa/mm/fault.c
> index 4b7bc8d..70fa7bc 100644
> --- a/arch/xtensa/mm/fault.c
> +++ b/arch/xtensa/mm/fault.c
> @@ -72,6 +72,8 @@ void do_page_fault(struct pt_regs *regs)
> address, exccause, regs->pc, is_write? "w":"", is_exec? "x":"");
> #endif
>
> + if (user_mode(regs))
> + flags |= FAULT_FLAG_USER;
> retry:
> down_read(&mm->mmap_sem);
> vma = find_vma(mm, address);
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d5c82dc..c51fc32 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -170,6 +170,7 @@ extern pgprot_t protection_map[16];
> #define FAULT_FLAG_RETRY_NOWAIT 0x10 /* Don't drop mmap_sem and wait when retrying */
> #define FAULT_FLAG_KILLABLE 0x20 /* The fault task is in SIGKILL killable region */
> #define FAULT_FLAG_TRIED 0x40 /* second try */
> +#define FAULT_FLAG_USER 0x80 /* The fault originated in userspace */
>
> /*
> * vm_fault is filled by the the pagefault handler and passed to the vma's
> --
> 1.8.3.2
>

--
Michal Hocko
SUSE Labs

2013-07-26 13:52:13

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 4/6] x86: finish user fault error path with fatal signal

On Thu 25-07-13 18:25:36, Johannes Weiner wrote:
> The x86 fault handler bails in the middle of error handling when the
> task has a fatal signal pending. For a subsequent patch this is a
> problem in OOM situations because it relies on
> pagefault_out_of_memory() being called even when the task has been
> killed, to perform proper per-task OOM state unwinding.
>
> Shortcutting the fault like this is a rather minor optimization that
> saves a few instructions in rare cases. Just remove it for
> user-triggered faults.

OK, I thought that this optimization tries to prevent calling OOM
because the current might release some memory but that wasn't the
intention of b80ef10e8 (x86: Move do_page_fault()'s error path under
unlikely()).

> Use the opportunity to split the fault retry handling from actual
> fault errors and add locking documentation that reads suprisingly
> similar to ARM's.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Reviewed-by: Michal Hocko <[email protected]>

> ---
> arch/x86/mm/fault.c | 35 +++++++++++++++++------------------
> 1 file changed, 17 insertions(+), 18 deletions(-)
>
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index 6d77c38..3aaeffc 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -842,23 +842,15 @@ do_sigbus(struct pt_regs *regs, unsigned long error_code, unsigned long address,
> force_sig_info_fault(SIGBUS, code, address, tsk, fault);
> }
>
> -static noinline int
> +static noinline void
> mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> unsigned long address, unsigned int fault)
> {
> - /*
> - * Pagefault was interrupted by SIGKILL. We have no reason to
> - * continue pagefault.
> - */
> - if (fatal_signal_pending(current)) {
> - if (!(fault & VM_FAULT_RETRY))
> - up_read(&current->mm->mmap_sem);
> - if (!(error_code & PF_USER))
> - no_context(regs, error_code, address, 0, 0);
> - return 1;
> + if (fatal_signal_pending(current) && !(error_code & PF_USER)) {
> + up_read(&current->mm->mmap_sem);
> + no_context(regs, error_code, address, 0, 0);
> + return;
> }
> - if (!(fault & VM_FAULT_ERROR))
> - return 0;
>
> if (fault & VM_FAULT_OOM) {
> /* Kernel mode? Handle exceptions or die: */
> @@ -866,7 +858,7 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> up_read(&current->mm->mmap_sem);
> no_context(regs, error_code, address,
> SIGSEGV, SEGV_MAPERR);
> - return 1;
> + return;
> }
>
> up_read(&current->mm->mmap_sem);
> @@ -884,7 +876,6 @@ mm_fault_error(struct pt_regs *regs, unsigned long error_code,
> else
> BUG();
> }
> - return 1;
> }
>
> static int spurious_fault_check(unsigned long error_code, pte_t *pte)
> @@ -1189,9 +1180,17 @@ good_area:
> */
> fault = handle_mm_fault(mm, vma, address, flags);
>
> - if (unlikely(fault & (VM_FAULT_RETRY|VM_FAULT_ERROR))) {
> - if (mm_fault_error(regs, error_code, address, fault))
> - return;
> + /*
> + * If we need to retry but a fatal signal is pending, handle the
> + * signal first. We do not need to release the mmap_sem because it
> + * would already be released in __lock_page_or_retry in mm/filemap.c.
> + */
> + if (unlikely((fault & VM_FAULT_RETRY) && fatal_signal_pending(current)))
> + return;
> +
> + if (unlikely(fault & VM_FAULT_ERROR)) {
> + mm_fault_error(regs, error_code, address, fault);
> + return;
> }
>
> /*
> --
> 1.8.3.2
>

--
Michal Hocko
SUSE Labs

2013-07-26 14:16:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 5/6] mm: memcg: enable memcg OOM killer only for user faults

On Thu 25-07-13 18:25:37, Johannes Weiner wrote:
> System calls and kernel faults (uaccess, gup) can handle an out of
> memory situation gracefully and just return -ENOMEM.
>
> Enable the memcg OOM killer only for user faults, where it's really
> the only option available.
>
> Signed-off-by: Johannes Weiner <[email protected]>

It looks OK to me, but I have few comments bellow. Nothing really huge
but I do not like mem_cgroup_xchg_may_oom for !MEMCG.

> ---
> include/linux/memcontrol.h | 23 +++++++++++++++++++++++
> include/linux/sched.h | 3 +++
> mm/filemap.c | 11 ++++++++++-
> mm/memcontrol.c | 2 +-
> mm/memory.c | 40 ++++++++++++++++++++++++++++++----------
> 5 files changed, 67 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7b4d9d7..9bb5eeb 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -125,6 +125,24 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> extern void mem_cgroup_replace_page_cache(struct page *oldpage,
> struct page *newpage);
>
> +/**
> + * mem_cgroup_xchg_may_oom - toggle the memcg OOM killer for a task
> + * @p: task

Is this ever safe to call on !current? If not then I wouldn't allow to
give p as a parameter.

> + * @new: true to enable, false to disable
> + *
> + * Toggle whether a failed memcg charge should invoke the OOM killer
> + * or just return -ENOMEM. Returns the previous toggle state.
> + */
> +static inline bool mem_cgroup_xchg_may_oom(struct task_struct *p, bool new)
> +{
> + bool old;
> +
> + old = p->memcg_oom.may_oom;
> + p->memcg_oom.may_oom = new;
> +
> + return old;
> +}
> +
> #ifdef CONFIG_MEMCG_SWAP
> extern int do_swap_account;
> #endif
> @@ -348,6 +366,11 @@ static inline void mem_cgroup_end_update_page_stat(struct page *page,
> {
> }
>
> +static inline bool mem_cgroup_xchg_may_oom(struct task_struct *p, bool new)
> +{
> + return !new;
> +}

This looks a bit weird. MEMCG is not compiled in and yet we return that
the previous may_oom could be true. This is calling for troubles if
somebody tries to do any following decisions on the returned value.
Why not simply return false unconditionally?

[...]
> diff --git a/mm/filemap.c b/mm/filemap.c
> index a6981fe..2932810 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
[...]
> @@ -1634,10 +1639,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> * We found the page, so try async readahead before
> * waiting for the lock.
> */
> + may_oom = mem_cgroup_xchg_may_oom(current, 0);

s/0/false/

below ditto

> do_async_mmap_readahead(vma, ra, file, page, offset);
> + mem_cgroup_xchg_may_oom(current, may_oom);
> } else if (!page) {
> /* No page in the page cache at all */
> + may_oom = mem_cgroup_xchg_may_oom(current, 0);
> do_sync_mmap_readahead(vma, ra, file, offset);
> + mem_cgroup_xchg_may_oom(current, may_oom);
> count_vm_event(PGMAJFAULT);
> mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> ret = VM_FAULT_MAJOR;
[...]
> diff --git a/mm/memory.c b/mm/memory.c
> index f2ab2a8..5ea7b47 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
[...]
> @@ -3851,6 +3843,34 @@ retry:
> return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> }
>
> +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, unsigned int flags)
> +{
> + int ret;
> +
> + __set_current_state(TASK_RUNNING);
> +
> + count_vm_event(PGFAULT);
> + mem_cgroup_count_vm_event(mm, PGFAULT);
> +
> + /* do counter updates before entering really critical section. */
> + check_sync_rss_stat(current);
> +
> + /*
> + * Enable the memcg OOM handling for faults triggered in user
> + * space. Kernel faults are handled more gracefully.
> + */
> + if (flags & FAULT_FLAG_USER)
> + WARN_ON(mem_cgroup_xchg_may_oom(current, true) == true);
> +
> + ret = __handle_mm_fault(mm, vma, address, flags);
> +
> + if (flags & FAULT_FLAG_USER)
> + WARN_ON(mem_cgroup_xchg_may_oom(current, false) == false);

Ohh, I see why you used !new in mem_cgroup_xchg_may_oom for !MEMCG case
above. This could be fixed easily if you add mem_cgroup_{enable,disable}_oom
which would be empty for !MEMCG.

> +
> + return ret;
> +}
> +
> #ifndef __PAGETABLE_PUD_FOLDED
> /*
> * Allocate page upper directory.
> --
> 1.8.3.2
>

--
Michal Hocko
SUSE Labs

2013-07-26 14:43:17

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 6/6] mm: memcg: do not trap chargers with full callstack on OOM

On Thu 25-07-13 18:25:38, Johannes Weiner wrote:
> The memcg OOM handling is incredibly fragile and can deadlock. When a
> task fails to charge memory, it invokes the OOM killer and loops right
> there in the charge code until it succeeds. Comparably, any other
> task that enters the charge path at this point will go to a waitqueue
> right then and there and sleep until the OOM situation is resolved.
> The problem is that these tasks may hold filesystem locks and the
> mmap_sem; locks that the selected OOM victim may need to exit.
>
> For example, in one reported case, the task invoking the OOM killer
> was about to charge a page cache page during a write(), which holds
> the i_mutex. The OOM killer selected a task that was just entering
> truncate() and trying to acquire the i_mutex:
>
> OOM invoking task:
> [<ffffffff8110a9c1>] mem_cgroup_handle_oom+0x241/0x3b0
> [<ffffffff8110b5ab>] T.1146+0x5ab/0x5c0
> [<ffffffff8110c22e>] mem_cgroup_cache_charge+0xbe/0xe0
> [<ffffffff810ca28c>] add_to_page_cache_locked+0x4c/0x140
> [<ffffffff810ca3a2>] add_to_page_cache_lru+0x22/0x50
> [<ffffffff810ca45b>] grab_cache_page_write_begin+0x8b/0xe0
> [<ffffffff81193a18>] ext3_write_begin+0x88/0x270
> [<ffffffff810c8fc6>] generic_file_buffered_write+0x116/0x290
> [<ffffffff810cb3cc>] __generic_file_aio_write+0x27c/0x480
> [<ffffffff810cb646>] generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
> [<ffffffff8111156a>] do_sync_write+0xea/0x130
> [<ffffffff81112183>] vfs_write+0xf3/0x1f0
> [<ffffffff81112381>] sys_write+0x51/0x90
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> OOM kill victim:
> [<ffffffff811109b8>] do_truncate+0x58/0xa0 # takes i_mutex
> [<ffffffff81121c90>] do_last+0x250/0xa30
> [<ffffffff81122547>] path_openat+0xd7/0x440
> [<ffffffff811229c9>] do_filp_open+0x49/0xa0
> [<ffffffff8110f7d6>] do_sys_open+0x106/0x240
> [<ffffffff8110f950>] sys_open+0x20/0x30
> [<ffffffff815b5926>] system_call_fastpath+0x18/0x1d
> [<ffffffffffffffff>] 0xffffffffffffffff
>
> The OOM handling task will retry the charge indefinitely while the OOM
> killed task is not releasing any resources.
>
> A similar scenario can happen when the kernel OOM killer for a memcg
> is disabled and a userspace task is in charge of resolving OOM
> situations. In this case, ALL tasks that enter the OOM path will be
> made to sleep on the OOM waitqueue and wait for userspace to free
> resources or increase the group's limit. But a userspace OOM handler
> is prone to deadlock itself on the locks held by the waiting tasks.
> For example one of the sleeping tasks may be stuck in a brk() call
> with the mmap_sem held for writing but the userspace handler, in order
> to pick an optimal victim, may need to read files from /proc/<pid>,
> which tries to acquire the same mmap_sem for reading and deadlocks.
>
> This patch changes the way tasks behave after detecting a memcg OOM
> and makes sure nobody loops or sleeps with locks held:
>
> 1. When OOMing in a user fault, invoke the OOM killer and restart the
> fault instead of looping on the charge attempt. This way, the OOM
> victim can not get stuck on locks the looping task may hold.
>
> 2. When OOMing in a user fault but somebody else is handling it
> (either the kernel OOM killer or a userspace handler), don't go to
> sleep in the charge context. Instead, remember the OOMing memcg in
> the task struct and then fully unwind the page fault stack with
> -ENOMEM. pagefault_out_of_memory() will then call back into the
> memcg code to check if the -ENOMEM came from the memcg, and then
> either put the task to sleep on the memcg's OOM waitqueue or just
> restart the fault. The OOM victim can no longer get stuck on any
> lock a sleeping task may hold.
>
> This relies on the memcg OOM killer only being enabled when an
> allocation failure will result in a call to pagefault_out_of_memory().
>
> While reworking the OOM routine, also remove a needless OOM waitqueue
> wakeup when invoking the killer. In addition to the wakeup implied in
> the kill signal delivery, only uncharges and limit increases, things
> that actually change the memory situation, should poke the waitqueue.
>
> Reported-by: Reported-by: azurIt <[email protected]>
> Debugged-by: Michal Hocko <[email protected]>
> Signed-off-by: Johannes Weiner <[email protected]>

Looks good just one remark bellow.

[...]
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 30ae46a..029a3a8 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
[...]
> @@ -2189,31 +2191,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> }
>
> /*
> - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> + * try to call OOM killer
> */
> -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> - int order)
> +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> {
> - struct oom_wait_info owait;
> - bool locked, need_to_kill;
> + bool locked, need_to_kill = true;
>
> - owait.memcg = memcg;
> - owait.wait.flags = 0;
> - owait.wait.func = memcg_oom_wake_function;
> - owait.wait.private = current;
> - INIT_LIST_HEAD(&owait.wait.task_list);
> - need_to_kill = true;
> - mem_cgroup_mark_under_oom(memcg);

You are marking memcg under_oom only for the sleepers. So if we have
no sleepers then the memcg will never report it is under oom which
is a behavior change. On the other hand who-ever relies on under_oom
under such conditions (it would basically mean a busy loop reading
memory.oom_control) would be racy anyway so it is questionable it
matters at all. At least now when we do not have any active notification
that under_oom has changed.

Anyway, this shouldn't be a part of this patch so if you want it because
it saves a pointless hierarchy traversal then make it a separate patch
with explanation why the new behavior is still OK.

> + if (!current->memcg_oom.may_oom)
> + return;
> +
> + current->memcg_oom.in_memcg_oom = 1;
>
> /* At first, try to OOM lock hierarchy under memcg.*/
> spin_lock(&memcg_oom_lock);
> locked = mem_cgroup_oom_lock(memcg);
> - /*
> - * Even if signal_pending(), we can't quit charge() loop without
> - * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> - * under OOM is always welcomed, use TASK_KILLABLE here.
> - */
> - prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> if (!locked || memcg->oom_kill_disable)
> need_to_kill = false;
> if (locked)
> @@ -2221,24 +2212,100 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> spin_unlock(&memcg_oom_lock);
>
> if (need_to_kill) {
> - finish_wait(&memcg_oom_waitq, &owait.wait);
> mem_cgroup_out_of_memory(memcg, mask, order);
> } else {
> - schedule();
> - finish_wait(&memcg_oom_waitq, &owait.wait);
> + /*
> + * A system call can just return -ENOMEM, but if this
> + * is a page fault and somebody else is handling the
> + * OOM already, we need to sleep on the OOM waitqueue
> + * for this memcg until the situation is resolved.
> + * Which can take some time because it might be
> + * handled by a userspace task.
> + *
> + * However, this is the charge context, which means
> + * that we may sit on a large call stack and hold
> + * various filesystem locks, the mmap_sem etc. and we
> + * don't want the OOM handler to deadlock on them
> + * while we sit here and wait. Store the current OOM
> + * context in the task_struct, then return -ENOMEM.
> + * At the end of the page fault handler, with the
> + * stack unwound, pagefault_out_of_memory() will check
> + * back with us by calling
> + * mem_cgroup_oom_synchronize(), possibly putting the
> + * task to sleep.
> + */
> + mem_cgroup_mark_under_oom(memcg);
> + current->memcg_oom.wakeups = atomic_read(&memcg->oom_wakeups);
> + css_get(&memcg->css);
> + current->memcg_oom.wait_on_memcg = memcg;
> }
> - spin_lock(&memcg_oom_lock);
> - if (locked)
> +
> + if (locked) {
> + spin_lock(&memcg_oom_lock);
> mem_cgroup_oom_unlock(memcg);
> - memcg_wakeup_oom(memcg);
> - spin_unlock(&memcg_oom_lock);
> + /*
> + * Sleeping tasks might have been killed, make sure
> + * they get scheduled so they can exit.
> + */
> + if (need_to_kill)
> + memcg_oom_recover(memcg);
> + spin_unlock(&memcg_oom_lock);
> + }
> +}
>
> - mem_cgroup_unmark_under_oom(memcg);
[...]
--
Michal Hocko
SUSE Labs

2013-07-26 18:45:16

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3/6] arch: mm: pass userspace fault flag to generic fault handler

On Fri, Jul 26, 2013 at 03:19:47PM +0200, Michal Hocko wrote:
> On Thu 25-07-13 18:25:35, Johannes Weiner wrote:
> > Unlike global OOM handling, memory cgroup code will invoke the OOM
> > killer in any OOM situation because it has no way of telling faults
> > occuring in kernel context - which could be handled more gracefully -
> > from user-triggered faults.
> >
> > Pass a flag that identifies faults originating in user space from the
> > architecture-specific fault handlers to generic code so that memcg OOM
> > handling can be improved.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Looks good to me but I guess maintainers of the affected archs should be
> CCed

linux-arch is on CC, that should do the trick :)

> Reviewed-by: Michal Hocko <[email protected]>

Thanks!

2013-07-26 18:47:08

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 4/6] x86: finish user fault error path with fatal signal

On Fri, Jul 26, 2013 at 03:52:07PM +0200, Michal Hocko wrote:
> On Thu 25-07-13 18:25:36, Johannes Weiner wrote:
> > The x86 fault handler bails in the middle of error handling when the
> > task has a fatal signal pending. For a subsequent patch this is a
> > problem in OOM situations because it relies on
> > pagefault_out_of_memory() being called even when the task has been
> > killed, to perform proper per-task OOM state unwinding.
> >
> > Shortcutting the fault like this is a rather minor optimization that
> > saves a few instructions in rare cases. Just remove it for
> > user-triggered faults.
>
> OK, I thought that this optimization tries to prevent calling OOM
> because the current might release some memory but that wasn't the
> intention of b80ef10e8 (x86: Move do_page_fault()'s error path under
> unlikely()).

out_of_memory() also checks the caller for pending signals, so it
would not actually invoke the OOM killer if the caller is already
dying.

> > Use the opportunity to split the fault retry handling from actual
> > fault errors and add locking documentation that reads suprisingly
> > similar to ARM's.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> Reviewed-by: Michal Hocko <[email protected]>

Thanks!

2013-07-26 18:54:18

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 5/6] mm: memcg: enable memcg OOM killer only for user faults

On Fri, Jul 26, 2013 at 04:16:42PM +0200, Michal Hocko wrote:
> On Thu 25-07-13 18:25:37, Johannes Weiner wrote:
> > System calls and kernel faults (uaccess, gup) can handle an out of
> > memory situation gracefully and just return -ENOMEM.
> >
> > Enable the memcg OOM killer only for user faults, where it's really
> > the only option available.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> It looks OK to me, but I have few comments bellow. Nothing really huge
> but I do not like mem_cgroup_xchg_may_oom for !MEMCG.

:-)

> > ---
> > include/linux/memcontrol.h | 23 +++++++++++++++++++++++
> > include/linux/sched.h | 3 +++
> > mm/filemap.c | 11 ++++++++++-
> > mm/memcontrol.c | 2 +-
> > mm/memory.c | 40 ++++++++++++++++++++++++++++++----------
> > 5 files changed, 67 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 7b4d9d7..9bb5eeb 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -125,6 +125,24 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> > extern void mem_cgroup_replace_page_cache(struct page *oldpage,
> > struct page *newpage);
> >
> > +/**
> > + * mem_cgroup_xchg_may_oom - toggle the memcg OOM killer for a task
> > + * @p: task
>
> Is this ever safe to call on !current? If not then I wouldn't allow to
> give p as a parameter.

Makes sense, I removed the parameter.

> > @@ -1634,10 +1639,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > * We found the page, so try async readahead before
> > * waiting for the lock.
> > */
> > + may_oom = mem_cgroup_xchg_may_oom(current, 0);
>
> s/0/false/
>
> below ditto

Oops, updated both sites.

> > do_async_mmap_readahead(vma, ra, file, page, offset);
> > + mem_cgroup_xchg_may_oom(current, may_oom);
> > } else if (!page) {
> > /* No page in the page cache at all */
> > + may_oom = mem_cgroup_xchg_may_oom(current, 0);
> > do_sync_mmap_readahead(vma, ra, file, offset);
> > + mem_cgroup_xchg_may_oom(current, may_oom);
> > count_vm_event(PGMAJFAULT);
> > mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> > ret = VM_FAULT_MAJOR;
> [...]
> > diff --git a/mm/memory.c b/mm/memory.c
> > index f2ab2a8..5ea7b47 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> [...]
> > @@ -3851,6 +3843,34 @@ retry:
> > return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> > }
> >
> > +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > + unsigned long address, unsigned int flags)
> > +{
> > + int ret;
> > +
> > + __set_current_state(TASK_RUNNING);
> > +
> > + count_vm_event(PGFAULT);
> > + mem_cgroup_count_vm_event(mm, PGFAULT);
> > +
> > + /* do counter updates before entering really critical section. */
> > + check_sync_rss_stat(current);
> > +
> > + /*
> > + * Enable the memcg OOM handling for faults triggered in user
> > + * space. Kernel faults are handled more gracefully.
> > + */
> > + if (flags & FAULT_FLAG_USER)
> > + WARN_ON(mem_cgroup_xchg_may_oom(current, true) == true);
> > +
> > + ret = __handle_mm_fault(mm, vma, address, flags);
> > +
> > + if (flags & FAULT_FLAG_USER)
> > + WARN_ON(mem_cgroup_xchg_may_oom(current, false) == false);
>
> Ohh, I see why you used !new in mem_cgroup_xchg_may_oom for !MEMCG case
> above. This could be fixed easily if you add mem_cgroup_{enable,disable}_oom
> which would be empty for !MEMCG.

You're right, it's much cleaner this way. I added the enable/disable
functions, which has the advantage that the memcg-specific WARN_ON is
also not in generic code but encapsulated nicely.

Thanks for your feedback!

2013-07-26 21:28:28

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 6/6] mm: memcg: do not trap chargers with full callstack on OOM

On Fri, Jul 26, 2013 at 04:43:10PM +0200, Michal Hocko wrote:
> On Thu 25-07-13 18:25:38, Johannes Weiner wrote:
> > @@ -2189,31 +2191,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> > }
> >
> > /*
> > - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> > + * try to call OOM killer
> > */
> > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > - int order)
> > +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > {
> > - struct oom_wait_info owait;
> > - bool locked, need_to_kill;
> > + bool locked, need_to_kill = true;
> >
> > - owait.memcg = memcg;
> > - owait.wait.flags = 0;
> > - owait.wait.func = memcg_oom_wake_function;
> > - owait.wait.private = current;
> > - INIT_LIST_HEAD(&owait.wait.task_list);
> > - need_to_kill = true;
> > - mem_cgroup_mark_under_oom(memcg);
>
> You are marking memcg under_oom only for the sleepers. So if we have
> no sleepers then the memcg will never report it is under oom which
> is a behavior change. On the other hand who-ever relies on under_oom
> under such conditions (it would basically mean a busy loop reading
> memory.oom_control) would be racy anyway so it is questionable it
> matters at all. At least now when we do not have any active notification
> that under_oom has changed.
>
> Anyway, this shouldn't be a part of this patch so if you want it because
> it saves a pointless hierarchy traversal then make it a separate patch
> with explanation why the new behavior is still OK.

This made me think again about how the locking and waking in there
works and I found a bug in this patch.

Basically, we have an open-coded sleeping lock in there and it's all
obfuscated by having way too much stuffed into the memcg_oom_lock
section.

Removing all the clutter, it becomes clear that I can't remove that
(undocumented) final wakeup at the end of the function. As with any
lock, a contender has to be woken up after unlock. We can't rely on
the lock holder's OOM kill to trigger uncharges and wakeups, because a
contender for the OOM lock could show up after the OOM kill but before
the lock is released. If there weren't any more wakeups, the
contender would sleep indefinitely.

It also becomes clear that I can't remove the
mem_cgroup_mark_under_oom() like that because it is key in receiving
wakeups. And as with any sleeping lock, we need to listen to wakeups
before attempting the trylock, or we might miss the wakeup from the
unlock.

It definitely became a separate patch, which cleans up this unholy
mess first before putting new things on top:

---
From: Johannes Weiner <[email protected]>
Subject: [patch] mm: memcg: rework and document OOM serialization

1. Remove the return value of mem_cgroup_oom_unlock().

2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().

3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This
makes it more obvious that the task has to be on the waitqueue
before attempting to OOM-trylock the hierarchy, to not miss any
wakeups before going to sleep. It just didn't matter until now
because it was all lumped together into the global memcg_oom_lock
spinlock section.

4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
It is proctected by the hierarchical OOM-lock.

5. The memcg_oom_lock spinlock is only required to propagate the OOM
lock in any given hierarchy atomically. Restrict its scope to
mem_cgroup_oom_(trylock|unlock).

6. Do not wake up the waitqueue unconditionally at the end of the
function. Only the lockholder has to wake up the next in line
after releasing the lock.

Note that the lockholder kicks off the OOM-killer, which in turn
leads to wakeups from the uncharges of the exiting task. But any
contender is not guaranteed to see them if it enters the OOM path
after the OOM kills but before the lockholder releases the lock.
Thus the wakeup has to be explicitely after releasing the lock.

7. Put the OOM task on the waitqueue before marking the hierarchy as
under OOM as that is the point where we start to receive wakeups.
No point in listening before being on the waitqueue.

8. Likewise, unmark the hierarchy before finishing the sleep, for
symmetry.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/memcontrol.c | 85 +++++++++++++++++++++++++++++++--------------------------
1 file changed, 47 insertions(+), 38 deletions(-)

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 30ae46a..0d923df 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -2076,15 +2076,18 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
return total;
}

+static DEFINE_SPINLOCK(memcg_oom_lock);
+
/*
* Check OOM-Killer is already running under our hierarchy.
* If someone is running, return false.
- * Has to be called with memcg_oom_lock
*/
-static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
+static bool mem_cgroup_oom_trylock(struct mem_cgroup *memcg)
{
struct mem_cgroup *iter, *failed = NULL;

+ spin_lock(&memcg_oom_lock);
+
for_each_mem_cgroup_tree(iter, memcg) {
if (iter->oom_lock) {
/*
@@ -2098,33 +2101,33 @@ static bool mem_cgroup_oom_lock(struct mem_cgroup *memcg)
iter->oom_lock = true;
}

- if (!failed)
- return true;
-
- /*
- * OK, we failed to lock the whole subtree so we have to clean up
- * what we set up to the failing subtree
- */
- for_each_mem_cgroup_tree(iter, memcg) {
- if (iter == failed) {
- mem_cgroup_iter_break(memcg, iter);
- break;
+ if (failed) {
+ /*
+ * OK, we failed to lock the whole subtree so we have
+ * to clean up what we set up to the failing subtree
+ */
+ for_each_mem_cgroup_tree(iter, memcg) {
+ if (iter == failed) {
+ mem_cgroup_iter_break(memcg, iter);
+ break;
+ }
+ iter->oom_lock = false;
}
- iter->oom_lock = false;
- }
- return false;
+ }
+
+ spin_unlock(&memcg_oom_lock);
+
+ return !failed;
}

-/*
- * Has to be called with memcg_oom_lock
- */
-static int mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
+static void mem_cgroup_oom_unlock(struct mem_cgroup *memcg)
{
struct mem_cgroup *iter;

+ spin_lock(&memcg_oom_lock);
for_each_mem_cgroup_tree(iter, memcg)
iter->oom_lock = false;
- return 0;
+ spin_unlock(&memcg_oom_lock);
}

static void mem_cgroup_mark_under_oom(struct mem_cgroup *memcg)
@@ -2148,7 +2151,6 @@ static void mem_cgroup_unmark_under_oom(struct mem_cgroup *memcg)
atomic_add_unless(&iter->under_oom, -1, 0);
}

-static DEFINE_SPINLOCK(memcg_oom_lock);
static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);

struct oom_wait_info {
@@ -2195,45 +2197,52 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
int order)
{
struct oom_wait_info owait;
- bool locked, need_to_kill;
+ bool locked;

owait.memcg = memcg;
owait.wait.flags = 0;
owait.wait.func = memcg_oom_wake_function;
owait.wait.private = current;
INIT_LIST_HEAD(&owait.wait.task_list);
- need_to_kill = true;
- mem_cgroup_mark_under_oom(memcg);

- /* At first, try to OOM lock hierarchy under memcg.*/
- spin_lock(&memcg_oom_lock);
- locked = mem_cgroup_oom_lock(memcg);
/*
+ * As with any blocking lock, a contender needs to start
+ * listening for wakeups before attempting the trylock,
+ * otherwise it can miss the wakeup from the unlock and sleep
+ * indefinitely. This is just open-coded because our locking
+ * is so particular to memcg hierarchies.
+ *
* Even if signal_pending(), we can't quit charge() loop without
* accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
* under OOM is always welcomed, use TASK_KILLABLE here.
*/
prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
- if (!locked || memcg->oom_kill_disable)
- need_to_kill = false;
+ mem_cgroup_mark_under_oom(memcg);
+
+ locked = mem_cgroup_oom_trylock(memcg);
+
if (locked)
mem_cgroup_oom_notify(memcg);
- spin_unlock(&memcg_oom_lock);

- if (need_to_kill) {
+ if (locked && !memcg->oom_kill_disable) {
+ mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
mem_cgroup_out_of_memory(memcg, mask, order);
} else {
schedule();
+ mem_cgroup_unmark_under_oom(memcg);
finish_wait(&memcg_oom_waitq, &owait.wait);
}
- spin_lock(&memcg_oom_lock);
- if (locked)
- mem_cgroup_oom_unlock(memcg);
- memcg_wakeup_oom(memcg);
- spin_unlock(&memcg_oom_lock);

- mem_cgroup_unmark_under_oom(memcg);
+ if (locked) {
+ mem_cgroup_oom_unlock(memcg);
+ /*
+ * There is no guarantee that a OOM-lock contender
+ * sees the wakeups triggered by the OOM kill
+ * uncharges. Wake any sleepers explicitely.
+ */
+ memcg_oom_recover(memcg);
+ }

if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
return false;
--
1.8.3.2

2013-07-29 12:45:34

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 4/6] x86: finish user fault error path with fatal signal

On Fri 26-07-13 14:46:57, Johannes Weiner wrote:
> On Fri, Jul 26, 2013 at 03:52:07PM +0200, Michal Hocko wrote:
> > On Thu 25-07-13 18:25:36, Johannes Weiner wrote:
> > > The x86 fault handler bails in the middle of error handling when the
> > > task has a fatal signal pending. For a subsequent patch this is a
> > > problem in OOM situations because it relies on
> > > pagefault_out_of_memory() being called even when the task has been
> > > killed, to perform proper per-task OOM state unwinding.
> > >
> > > Shortcutting the fault like this is a rather minor optimization that
> > > saves a few instructions in rare cases. Just remove it for
> > > user-triggered faults.
> >
> > OK, I thought that this optimization tries to prevent calling OOM
> > because the current might release some memory but that wasn't the
> > intention of b80ef10e8 (x86: Move do_page_fault()'s error path under
> > unlikely()).
>
> out_of_memory() also checks the caller for pending signals, so it
> would not actually invoke the OOM killer if the caller is already
> dying.

Ohh, right you are! I should have checked deeper in the call chain.

> > > Use the opportunity to split the fault retry handling from actual
> > > fault errors and add locking documentation that reads suprisingly
> > > similar to ARM's.
> > >
> > > Signed-off-by: Johannes Weiner <[email protected]>
> >
> > Reviewed-by: Michal Hocko <[email protected]>
>
> Thanks!

--
Michal Hocko
SUSE Labs

2013-07-29 14:12:58

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 6/6] mm: memcg: do not trap chargers with full callstack on OOM

On Fri 26-07-13 17:28:09, Johannes Weiner wrote:
> On Fri, Jul 26, 2013 at 04:43:10PM +0200, Michal Hocko wrote:
> > On Thu 25-07-13 18:25:38, Johannes Weiner wrote:
> > > @@ -2189,31 +2191,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> > > }
> > >
> > > /*
> > > - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> > > + * try to call OOM killer
> > > */
> > > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > > - int order)
> > > +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > > {
> > > - struct oom_wait_info owait;
> > > - bool locked, need_to_kill;
> > > + bool locked, need_to_kill = true;
> > >
> > > - owait.memcg = memcg;
> > > - owait.wait.flags = 0;
> > > - owait.wait.func = memcg_oom_wake_function;
> > > - owait.wait.private = current;
> > > - INIT_LIST_HEAD(&owait.wait.task_list);
> > > - need_to_kill = true;
> > > - mem_cgroup_mark_under_oom(memcg);
> >
> > You are marking memcg under_oom only for the sleepers. So if we have
> > no sleepers then the memcg will never report it is under oom which
> > is a behavior change. On the other hand who-ever relies on under_oom
> > under such conditions (it would basically mean a busy loop reading
> > memory.oom_control) would be racy anyway so it is questionable it
> > matters at all. At least now when we do not have any active notification
> > that under_oom has changed.
> >
> > Anyway, this shouldn't be a part of this patch so if you want it because
> > it saves a pointless hierarchy traversal then make it a separate patch
> > with explanation why the new behavior is still OK.
>
> This made me think again about how the locking and waking in there
> works and I found a bug in this patch.
>
> Basically, we have an open-coded sleeping lock in there and it's all
> obfuscated by having way too much stuffed into the memcg_oom_lock
> section.
>
> Removing all the clutter, it becomes clear that I can't remove that
> (undocumented) final wakeup at the end of the function. As with any
> lock, a contender has to be woken up after unlock. We can't rely on
> the lock holder's OOM kill to trigger uncharges and wakeups, because a
> contender for the OOM lock could show up after the OOM kill but before
> the lock is released. If there weren't any more wakeups, the
> contender would sleep indefinitely.

I have checked that path again and I still do not see how wakeup_oom
helps here. What prevents us from the following race then?

spin_lock(&memcg_oom_lock)
locked = mem_cgroup_oom_lock(memcg) # true
spin_unlock(&memcg_oom_lock)
spin_lock(&memcg_oom_lock)
locked = mem_cgroup_oom_lock(memcg) # false
spin_unlock(&memcg_oom_lock)
<resched>
mem_cgroup_out_of_memory()
<uncharge & memcg_oom_recover>
spin_lock(&memcg_oom_lock)
mem_cgroup_oom_unlock(memcg)
memcg_wakeup_oom(memcg)
schedule()
spin_unlock(&memcg_oom_lock)
mem_cgroup_unmark_under_oom(memcg)

> It also becomes clear that I can't remove the
> mem_cgroup_mark_under_oom() like that because it is key in receiving
> wakeups. And as with any sleeping lock, we need to listen to wakeups
> before attempting the trylock, or we might miss the wakeup from the
> unlock.
>
> It definitely became a separate patch, which cleans up this unholy
> mess first before putting new things on top:

I will check the patch tomorrow with a clean head.
[...]
--
Michal Hocko
SUSE Labs

2013-07-29 14:55:46

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 6/6] mm: memcg: do not trap chargers with full callstack on OOM

On Mon, Jul 29, 2013 at 04:12:50PM +0200, Michal Hocko wrote:
> On Fri 26-07-13 17:28:09, Johannes Weiner wrote:
> > On Fri, Jul 26, 2013 at 04:43:10PM +0200, Michal Hocko wrote:
> > > On Thu 25-07-13 18:25:38, Johannes Weiner wrote:
> > > > @@ -2189,31 +2191,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> > > > }
> > > >
> > > > /*
> > > > - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> > > > + * try to call OOM killer
> > > > */
> > > > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > > > - int order)
> > > > +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > > > {
> > > > - struct oom_wait_info owait;
> > > > - bool locked, need_to_kill;
> > > > + bool locked, need_to_kill = true;
> > > >
> > > > - owait.memcg = memcg;
> > > > - owait.wait.flags = 0;
> > > > - owait.wait.func = memcg_oom_wake_function;
> > > > - owait.wait.private = current;
> > > > - INIT_LIST_HEAD(&owait.wait.task_list);
> > > > - need_to_kill = true;
> > > > - mem_cgroup_mark_under_oom(memcg);
> > >
> > > You are marking memcg under_oom only for the sleepers. So if we have
> > > no sleepers then the memcg will never report it is under oom which
> > > is a behavior change. On the other hand who-ever relies on under_oom
> > > under such conditions (it would basically mean a busy loop reading
> > > memory.oom_control) would be racy anyway so it is questionable it
> > > matters at all. At least now when we do not have any active notification
> > > that under_oom has changed.
> > >
> > > Anyway, this shouldn't be a part of this patch so if you want it because
> > > it saves a pointless hierarchy traversal then make it a separate patch
> > > with explanation why the new behavior is still OK.
> >
> > This made me think again about how the locking and waking in there
> > works and I found a bug in this patch.
> >
> > Basically, we have an open-coded sleeping lock in there and it's all
> > obfuscated by having way too much stuffed into the memcg_oom_lock
> > section.
> >
> > Removing all the clutter, it becomes clear that I can't remove that
> > (undocumented) final wakeup at the end of the function. As with any
> > lock, a contender has to be woken up after unlock. We can't rely on
> > the lock holder's OOM kill to trigger uncharges and wakeups, because a
> > contender for the OOM lock could show up after the OOM kill but before
> > the lock is released. If there weren't any more wakeups, the
> > contender would sleep indefinitely.
>
> I have checked that path again and I still do not see how wakeup_oom
> helps here. What prevents us from the following race then?
>
> spin_lock(&memcg_oom_lock)
> locked = mem_cgroup_oom_lock(memcg) # true
> spin_unlock(&memcg_oom_lock)

prepare_to_wait()

> spin_lock(&memcg_oom_lock)
> locked = mem_cgroup_oom_lock(memcg) # false
> spin_unlock(&memcg_oom_lock)
> <resched>
> mem_cgroup_out_of_memory()
> <uncharge & memcg_oom_recover>
> spin_lock(&memcg_oom_lock)
> mem_cgroup_oom_unlock(memcg)
> memcg_wakeup_oom(memcg)
> schedule()
> spin_unlock(&memcg_oom_lock)
> mem_cgroup_unmark_under_oom(memcg)

2013-07-29 15:52:49

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 6/6] mm: memcg: do not trap chargers with full callstack on OOM

On Mon 29-07-13 10:55:29, Johannes Weiner wrote:
> On Mon, Jul 29, 2013 at 04:12:50PM +0200, Michal Hocko wrote:
> > On Fri 26-07-13 17:28:09, Johannes Weiner wrote:
> > > On Fri, Jul 26, 2013 at 04:43:10PM +0200, Michal Hocko wrote:
> > > > On Thu 25-07-13 18:25:38, Johannes Weiner wrote:
> > > > > @@ -2189,31 +2191,20 @@ static void memcg_oom_recover(struct mem_cgroup *memcg)
> > > > > }
> > > > >
> > > > > /*
> > > > > - * try to call OOM killer. returns false if we should exit memory-reclaim loop.
> > > > > + * try to call OOM killer
> > > > > */
> > > > > -static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > > > > - int order)
> > > > > +static void mem_cgroup_oom(struct mem_cgroup *memcg, gfp_t mask, int order)
> > > > > {
> > > > > - struct oom_wait_info owait;
> > > > > - bool locked, need_to_kill;
> > > > > + bool locked, need_to_kill = true;
> > > > >
> > > > > - owait.memcg = memcg;
> > > > > - owait.wait.flags = 0;
> > > > > - owait.wait.func = memcg_oom_wake_function;
> > > > > - owait.wait.private = current;
> > > > > - INIT_LIST_HEAD(&owait.wait.task_list);
> > > > > - need_to_kill = true;
> > > > > - mem_cgroup_mark_under_oom(memcg);
> > > >
> > > > You are marking memcg under_oom only for the sleepers. So if we have
> > > > no sleepers then the memcg will never report it is under oom which
> > > > is a behavior change. On the other hand who-ever relies on under_oom
> > > > under such conditions (it would basically mean a busy loop reading
> > > > memory.oom_control) would be racy anyway so it is questionable it
> > > > matters at all. At least now when we do not have any active notification
> > > > that under_oom has changed.
> > > >
> > > > Anyway, this shouldn't be a part of this patch so if you want it because
> > > > it saves a pointless hierarchy traversal then make it a separate patch
> > > > with explanation why the new behavior is still OK.
> > >
> > > This made me think again about how the locking and waking in there
> > > works and I found a bug in this patch.
> > >
> > > Basically, we have an open-coded sleeping lock in there and it's all
> > > obfuscated by having way too much stuffed into the memcg_oom_lock
> > > section.
> > >
> > > Removing all the clutter, it becomes clear that I can't remove that
> > > (undocumented) final wakeup at the end of the function. As with any
> > > lock, a contender has to be woken up after unlock. We can't rely on
> > > the lock holder's OOM kill to trigger uncharges and wakeups, because a
> > > contender for the OOM lock could show up after the OOM kill but before
> > > the lock is released. If there weren't any more wakeups, the
> > > contender would sleep indefinitely.
> >
> > I have checked that path again and I still do not see how wakeup_oom
> > helps here. What prevents us from the following race then?
> >
> > spin_lock(&memcg_oom_lock)
> > locked = mem_cgroup_oom_lock(memcg) # true
> > spin_unlock(&memcg_oom_lock)
>
> prepare_to_wait()

For some reason that one disappeared from my screen ;)

> > spin_lock(&memcg_oom_lock)
> > locked = mem_cgroup_oom_lock(memcg) # false
> > spin_unlock(&memcg_oom_lock)
> > <resched>
> > mem_cgroup_out_of_memory()
> > <uncharge & memcg_oom_recover>
> > spin_lock(&memcg_oom_lock)
> > mem_cgroup_oom_unlock(memcg)
> > memcg_wakeup_oom(memcg)
> > schedule()
> > spin_unlock(&memcg_oom_lock)
> > mem_cgroup_unmark_under_oom(memcg)

--
Michal Hocko
SUSE Labs

2013-07-29 18:54:33

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 1/6] arch: mm: remove obsolete init OOM protection

(7/25/13 6:25 PM), Johannes Weiner wrote:
> Back before smart OOM killing, when faulting tasks where killed
> directly on allocation failures, the arch-specific fault handlers
> needed special protection for the init process.
>
> Now that all fault handlers call into the generic OOM killer (609838c
> "mm: invoke oom-killer from remaining unconverted page fault
> handlers"), which already provides init protection, the arch-specific
> leftovers can be removed.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Looks good to me.

Acked-by: KOSAKI Motohiro <[email protected]>

2013-07-29 18:57:35

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 2/6] arch: mm: do not invoke OOM killer on kernel fault OOM

(7/25/13 6:25 PM), Johannes Weiner wrote:
> Kernel faults are expected to handle OOM conditions gracefully (gup,
> uaccess etc.), so they should never invoke the OOM killer. Reserve
> this for faults triggered in user context when it is the only option.
>
> Most architectures already do this, fix up the remaining few.
>
> Signed-off-by: Johannes Weiner <[email protected]>

OK. but now almost all arch have the same page fault handler. So, I think
we can implement arch generic page fault handler in future. Ah, ok, never
mind if you are not interest.

Acked-by: KOSAKI Motohiro <[email protected]>

2013-07-29 19:00:32

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 4/6] x86: finish user fault error path with fatal signal

(7/25/13 6:25 PM), Johannes Weiner wrote:
> The x86 fault handler bails in the middle of error handling when the
> task has a fatal signal pending. For a subsequent patch this is a
> problem in OOM situations because it relies on
> pagefault_out_of_memory() being called even when the task has been
> killed, to perform proper per-task OOM state unwinding.
>
> Shortcutting the fault like this is a rather minor optimization that
> saves a few instructions in rare cases. Just remove it for
> user-triggered faults.
>
> Use the opportunity to split the fault retry handling from actual
> fault errors and add locking documentation that reads suprisingly
> similar to ARM's.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Acked-by: KOSAKI Motohiro <[email protected]>

2013-07-29 19:18:07

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 5/6] mm: memcg: enable memcg OOM killer only for user faults

(7/25/13 6:25 PM), Johannes Weiner wrote:
> System calls and kernel faults (uaccess, gup) can handle an out of
> memory situation gracefully and just return -ENOMEM.
>
> Enable the memcg OOM killer only for user faults, where it's really
> the only option available.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> include/linux/memcontrol.h | 23 +++++++++++++++++++++++
> include/linux/sched.h | 3 +++
> mm/filemap.c | 11 ++++++++++-
> mm/memcontrol.c | 2 +-
> mm/memory.c | 40 ++++++++++++++++++++++++++++++----------
> 5 files changed, 67 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> index 7b4d9d7..9bb5eeb 100644
> --- a/include/linux/memcontrol.h
> +++ b/include/linux/memcontrol.h
> @@ -125,6 +125,24 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> extern void mem_cgroup_replace_page_cache(struct page *oldpage,
> struct page *newpage);
>
> +/**
> + * mem_cgroup_xchg_may_oom - toggle the memcg OOM killer for a task
> + * @p: task
> + * @new: true to enable, false to disable
> + *
> + * Toggle whether a failed memcg charge should invoke the OOM killer
> + * or just return -ENOMEM. Returns the previous toggle state.
> + */
> +static inline bool mem_cgroup_xchg_may_oom(struct task_struct *p, bool new)
> +{
> + bool old;
> +
> + old = p->memcg_oom.may_oom;
> + p->memcg_oom.may_oom = new;
> +
> + return old;
> +}

The name of xchg strongly suggest the function use compare-swap op. So, it seems
misleading name. I suggest just use "set_*" or something else. In linux kernel,
many setter functions already return old value. Don't mind.

> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index fc09d21..4b3effc 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1398,6 +1398,9 @@ struct task_struct {
> unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
> } memcg_batch;
> unsigned int memcg_kmem_skip_account;
> + struct memcg_oom_info {
> + unsigned int may_oom:1;
> + } memcg_oom;

This ":1" makes slower but doesn't diet any memory space, right? I suggest
to use bool. If anybody need to diet in future, he may change it to bit field.
That's ok, let's stop too early and questionable micro optimization.


> diff --git a/mm/filemap.c b/mm/filemap.c
> index a6981fe..2932810 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1617,6 +1617,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> struct file_ra_state *ra = &file->f_ra;
> struct inode *inode = mapping->host;
> pgoff_t offset = vmf->pgoff;
> + unsigned int may_oom;

Why don't you use bool? your mem_cgroup_xchg_may_oom() uses bool and it seems cleaner more.

> @@ -1626,7 +1627,11 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> return VM_FAULT_SIGBUS;
>
> /*
> - * Do we have something in the page cache already?
> + * Do we have something in the page cache already? Either
> + * way, try readahead, but disable the memcg OOM killer for it
> + * as readahead is optional and no errors are propagated up
> + * the fault stack. The OOM killer is enabled while trying to
> + * instantiate the faulting page individually below.
> */
> page = find_get_page(mapping, offset);
> if (likely(page) && !(vmf->flags & FAULT_FLAG_TRIED)) {
> @@ -1634,10 +1639,14 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> * We found the page, so try async readahead before
> * waiting for the lock.
> */
> + may_oom = mem_cgroup_xchg_may_oom(current, 0);
> do_async_mmap_readahead(vma, ra, file, page, offset);
> + mem_cgroup_xchg_may_oom(current, may_oom);
> } else if (!page) {
> /* No page in the page cache at all */
> + may_oom = mem_cgroup_xchg_may_oom(current, 0);
> do_sync_mmap_readahead(vma, ra, file, offset);
> + mem_cgroup_xchg_may_oom(current, may_oom);
> count_vm_event(PGMAJFAULT);
> mem_cgroup_count_vm_event(vma->vm_mm, PGMAJFAULT);
> ret = VM_FAULT_MAJOR;
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 00a7a66..30ae46a 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2614,7 +2614,7 @@ static int mem_cgroup_do_charge(struct mem_cgroup *memcg, gfp_t gfp_mask,
> return CHARGE_RETRY;
>
> /* If we don't need to call oom-killer at el, return immediately */
> - if (!oom_check)
> + if (!oom_check || !current->memcg_oom.may_oom)
> return CHARGE_NOMEM;
> /* check OOM */
> if (!mem_cgroup_handle_oom(mem_over_limit, gfp_mask, get_order(csize)))
> diff --git a/mm/memory.c b/mm/memory.c
> index f2ab2a8..5ea7b47 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3752,22 +3752,14 @@ unlock:
> /*
> * By the time we get here, we already hold the mm semaphore
> */
> -int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long address, unsigned int flags)
> +static int __handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, unsigned int flags)
> {
> pgd_t *pgd;
> pud_t *pud;
> pmd_t *pmd;
> pte_t *pte;
>
> - __set_current_state(TASK_RUNNING);
> -
> - count_vm_event(PGFAULT);
> - mem_cgroup_count_vm_event(mm, PGFAULT);
> -
> - /* do counter updates before entering really critical section. */
> - check_sync_rss_stat(current);
> -
> if (unlikely(is_vm_hugetlb_page(vma)))
> return hugetlb_fault(mm, vma, address, flags);
>
> @@ -3851,6 +3843,34 @@ retry:
> return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> }
>
> +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> + unsigned long address, unsigned int flags)
> +{
> + int ret;
> +
> + __set_current_state(TASK_RUNNING);
> +
> + count_vm_event(PGFAULT);
> + mem_cgroup_count_vm_event(mm, PGFAULT);
> +
> + /* do counter updates before entering really critical section. */
> + check_sync_rss_stat(current);
> +
> + /*
> + * Enable the memcg OOM handling for faults triggered in user
> + * space. Kernel faults are handled more gracefully.
> + */
> + if (flags & FAULT_FLAG_USER)
> + WARN_ON(mem_cgroup_xchg_may_oom(current, true) == true);

Please don't assume WARN_ON never erase any code. I'm not surprised if embedded
guys replace WARN_ON with nop in future.

Thanks.

2013-07-29 19:45:12

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 5/6] mm: memcg: enable memcg OOM killer only for user faults

On Mon, Jul 29, 2013 at 03:18:36PM -0400, KOSAKI Motohiro wrote:
> (7/25/13 6:25 PM), Johannes Weiner wrote:
> > System calls and kernel faults (uaccess, gup) can handle an out of
> > memory situation gracefully and just return -ENOMEM.
> >
> > Enable the memcg OOM killer only for user faults, where it's really
> > the only option available.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > include/linux/memcontrol.h | 23 +++++++++++++++++++++++
> > include/linux/sched.h | 3 +++
> > mm/filemap.c | 11 ++++++++++-
> > mm/memcontrol.c | 2 +-
> > mm/memory.c | 40 ++++++++++++++++++++++++++++++----------
> > 5 files changed, 67 insertions(+), 12 deletions(-)
> >
> > diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
> > index 7b4d9d7..9bb5eeb 100644
> > --- a/include/linux/memcontrol.h
> > +++ b/include/linux/memcontrol.h
> > @@ -125,6 +125,24 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
> > extern void mem_cgroup_replace_page_cache(struct page *oldpage,
> > struct page *newpage);
> >
> > +/**
> > + * mem_cgroup_xchg_may_oom - toggle the memcg OOM killer for a task
> > + * @p: task
> > + * @new: true to enable, false to disable
> > + *
> > + * Toggle whether a failed memcg charge should invoke the OOM killer
> > + * or just return -ENOMEM. Returns the previous toggle state.
> > + */
> > +static inline bool mem_cgroup_xchg_may_oom(struct task_struct *p, bool new)
> > +{
> > + bool old;
> > +
> > + old = p->memcg_oom.may_oom;
> > + p->memcg_oom.may_oom = new;
> > +
> > + return old;
> > +}
>
> The name of xchg strongly suggest the function use compare-swap op. So, it seems
> misleading name. I suggest just use "set_*" or something else. In linux kernel,
> many setter functions already return old value. Don't mind.

I renamed it to bool mem_cgroup_toggle_oom(bool onoff) when I
incorporated Michal's feedback, would you be okay with that?

> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index fc09d21..4b3effc 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1398,6 +1398,9 @@ struct task_struct {
> > unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
> > } memcg_batch;
> > unsigned int memcg_kmem_skip_account;
> > + struct memcg_oom_info {
> > + unsigned int may_oom:1;
> > + } memcg_oom;
>
> This ":1" makes slower but doesn't diet any memory space, right? I suggest
> to use bool. If anybody need to diet in future, he may change it to bit field.
> That's ok, let's stop too early and questionable micro optimization.

It should sit in the same word as the memcg_kmem_skip_account, plus
I'm adding another bit in the next patch (in_memcg_oom), so we save
space. It's also the OOM path, so anything but performance critical.

> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index a6981fe..2932810 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -1617,6 +1617,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > struct file_ra_state *ra = &file->f_ra;
> > struct inode *inode = mapping->host;
> > pgoff_t offset = vmf->pgoff;
> > + unsigned int may_oom;
>
> Why don't you use bool? your mem_cgroup_xchg_may_oom() uses bool and it seems cleaner more.

Yup, forgot to convert it with the interface, I changed it to bool.

> > @@ -3851,6 +3843,34 @@ retry:
> > return handle_pte_fault(mm, vma, address, pte, pmd, flags);
> > }
> >
> > +int handle_mm_fault(struct mm_struct *mm, struct vm_area_struct *vma,
> > + unsigned long address, unsigned int flags)
> > +{
> > + int ret;
> > +
> > + __set_current_state(TASK_RUNNING);
> > +
> > + count_vm_event(PGFAULT);
> > + mem_cgroup_count_vm_event(mm, PGFAULT);
> > +
> > + /* do counter updates before entering really critical section. */
> > + check_sync_rss_stat(current);
> > +
> > + /*
> > + * Enable the memcg OOM handling for faults triggered in user
> > + * space. Kernel faults are handled more gracefully.
> > + */
> > + if (flags & FAULT_FLAG_USER)
> > + WARN_ON(mem_cgroup_xchg_may_oom(current, true) == true);
>
> Please don't assume WARN_ON never erase any code. I'm not surprised if embedded
> guys replace WARN_ON with nop in future.

That would be really messed up.

But at the same time, the WARN_ON() obfuscates what's going on a
little bit, so putting it separately should make the code more
readable. I'll change it.

Thanks for your input!

2013-07-29 19:47:22

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [patch 5/6] mm: memcg: enable memcg OOM killer only for user faults

(7/29/13 3:44 PM), Johannes Weiner wrote:
> On Mon, Jul 29, 2013 at 03:18:36PM -0400, KOSAKI Motohiro wrote:
>> (7/25/13 6:25 PM), Johannes Weiner wrote:
>>> System calls and kernel faults (uaccess, gup) can handle an out of
>>> memory situation gracefully and just return -ENOMEM.
>>>
>>> Enable the memcg OOM killer only for user faults, where it's really
>>> the only option available.
>>>
>>> Signed-off-by: Johannes Weiner <[email protected]>
>>> ---
>>> include/linux/memcontrol.h | 23 +++++++++++++++++++++++
>>> include/linux/sched.h | 3 +++
>>> mm/filemap.c | 11 ++++++++++-
>>> mm/memcontrol.c | 2 +-
>>> mm/memory.c | 40 ++++++++++++++++++++++++++++++----------
>>> 5 files changed, 67 insertions(+), 12 deletions(-)
>>>
>>> diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
>>> index 7b4d9d7..9bb5eeb 100644
>>> --- a/include/linux/memcontrol.h
>>> +++ b/include/linux/memcontrol.h
>>> @@ -125,6 +125,24 @@ extern void mem_cgroup_print_oom_info(struct mem_cgroup *memcg,
>>> extern void mem_cgroup_replace_page_cache(struct page *oldpage,
>>> struct page *newpage);
>>>
>>> +/**
>>> + * mem_cgroup_xchg_may_oom - toggle the memcg OOM killer for a task
>>> + * @p: task
>>> + * @new: true to enable, false to disable
>>> + *
>>> + * Toggle whether a failed memcg charge should invoke the OOM killer
>>> + * or just return -ENOMEM. Returns the previous toggle state.
>>> + */
>>> +static inline bool mem_cgroup_xchg_may_oom(struct task_struct *p, bool new)
>>> +{
>>> + bool old;
>>> +
>>> + old = p->memcg_oom.may_oom;
>>> + p->memcg_oom.may_oom = new;
>>> +
>>> + return old;
>>> +}
>>
>> The name of xchg strongly suggest the function use compare-swap op. So, it seems
>> misleading name. I suggest just use "set_*" or something else. In linux kernel,
>> many setter functions already return old value. Don't mind.
>
> I renamed it to bool mem_cgroup_toggle_oom(bool onoff) when I
> incorporated Michal's feedback, would you be okay with that?

Yes, thank you.

>
>>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>>> index fc09d21..4b3effc 100644
>>> --- a/include/linux/sched.h
>>> +++ b/include/linux/sched.h
>>> @@ -1398,6 +1398,9 @@ struct task_struct {
>>> unsigned long memsw_nr_pages; /* uncharged mem+swap usage */
>>> } memcg_batch;
>>> unsigned int memcg_kmem_skip_account;
>>> + struct memcg_oom_info {
>>> + unsigned int may_oom:1;
>>> + } memcg_oom;
>>
>> This ":1" makes slower but doesn't diet any memory space, right? I suggest
>> to use bool. If anybody need to diet in future, he may change it to bit field.
>> That's ok, let's stop too early and questionable micro optimization.
>
> It should sit in the same word as the memcg_kmem_skip_account, plus
> I'm adding another bit in the next patch (in_memcg_oom), so we save
> space. It's also the OOM path, so anything but performance critical.

Oh, if you added another bit too, it's ok, of course.

>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index a6981fe..2932810 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -1617,6 +1617,7 @@ int filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>> struct file_ra_state *ra = &file->f_ra;
>>> struct inode *inode = mapping->host;
>>> pgoff_t offset = vmf->pgoff;
>>> + unsigned int may_oom;
>>
>> Why don't you use bool? your mem_cgroup_xchg_may_oom() uses bool and it seems cleaner more.
>
> Yup, forgot to convert it with the interface, I changed it to bool.

thx.


>>> + /*
>>> + * Enable the memcg OOM handling for faults triggered in user
>>> + * space. Kernel faults are handled more gracefully.
>>> + */
>>> + if (flags & FAULT_FLAG_USER)
>>> + WARN_ON(mem_cgroup_xchg_may_oom(current, true) == true);
>>
>> Please don't assume WARN_ON never erase any code. I'm not surprised if embedded
>> guys replace WARN_ON with nop in future.
>
> That would be really messed up.
>
> But at the same time, the WARN_ON() obfuscates what's going on a
> little bit, so putting it separately should make the code more
> readable. I'll change it.
>
> Thanks for your input!

No problem. :)


2013-07-30 14:09:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 6/6] mm: memcg: do not trap chargers with full callstack on OOM

On Fri 26-07-13 17:28:09, Johannes Weiner wrote:
[...]
> From: Johannes Weiner <[email protected]>
> Subject: [patch] mm: memcg: rework and document OOM serialization
>
> 1. Remove the return value of mem_cgroup_oom_unlock().
>
> 2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().
>
> 3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This
> makes it more obvious that the task has to be on the waitqueue
> before attempting to OOM-trylock the hierarchy, to not miss any
> wakeups before going to sleep. It just didn't matter until now
> because it was all lumped together into the global memcg_oom_lock
> spinlock section.
>
> 4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
> It is proctected by the hierarchical OOM-lock.
>
> 5. The memcg_oom_lock spinlock is only required to propagate the OOM
> lock in any given hierarchy atomically. Restrict its scope to
> mem_cgroup_oom_(trylock|unlock).
>
> 6. Do not wake up the waitqueue unconditionally at the end of the
> function. Only the lockholder has to wake up the next in line
> after releasing the lock.
>
> Note that the lockholder kicks off the OOM-killer, which in turn
> leads to wakeups from the uncharges of the exiting task. But any
> contender is not guaranteed to see them if it enters the OOM path
> after the OOM kills but before the lockholder releases the lock.
> Thus the wakeup has to be explicitely after releasing the lock.
>
> 7. Put the OOM task on the waitqueue before marking the hierarchy as
> under OOM as that is the point where we start to receive wakeups.
> No point in listening before being on the waitqueue.
>
> 8. Likewise, unmark the hierarchy before finishing the sleep, for
> symmetry.
>

OK, this looks better than what we have today, but still could be done
better IMO ;)

> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/memcontrol.c | 85 +++++++++++++++++++++++++++++++--------------------------
> 1 file changed, 47 insertions(+), 38 deletions(-)
>
> diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> index 30ae46a..0d923df 100644
> --- a/mm/memcontrol.c
> +++ b/mm/memcontrol.c
> @@ -2076,15 +2076,18 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
> return total;
> }
>

/* Protects oom_lock hierarchy consistent state and oom_notify chain */

> +static DEFINE_SPINLOCK(memcg_oom_lock);
> +
> /*
> * Check OOM-Killer is already running under our hierarchy.
> * If someone is running, return false.
> - * Has to be called with memcg_oom_lock
> */
[...]
> @@ -2195,45 +2197,52 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> int order)
> {
> struct oom_wait_info owait;
> - bool locked, need_to_kill;
> + bool locked;
>
> owait.memcg = memcg;
> owait.wait.flags = 0;
> owait.wait.func = memcg_oom_wake_function;
> owait.wait.private = current;
> INIT_LIST_HEAD(&owait.wait.task_list);
> - need_to_kill = true;
> - mem_cgroup_mark_under_oom(memcg);
>
> - /* At first, try to OOM lock hierarchy under memcg.*/
> - spin_lock(&memcg_oom_lock);
> - locked = mem_cgroup_oom_lock(memcg);
> /*
> + * As with any blocking lock, a contender needs to start
> + * listening for wakeups before attempting the trylock,
> + * otherwise it can miss the wakeup from the unlock and sleep
> + * indefinitely. This is just open-coded because our locking
> + * is so particular to memcg hierarchies.
> + *
> * Even if signal_pending(), we can't quit charge() loop without
> * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> * under OOM is always welcomed, use TASK_KILLABLE here.

Could you take care of this paragraph as well, while you are at it,
please? I've always found it it confusing. I would remove it completely
I would remove it completely.

> */
> prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> - if (!locked || memcg->oom_kill_disable)
> - need_to_kill = false;
> + mem_cgroup_mark_under_oom(memcg);
> +
> + locked = mem_cgroup_oom_trylock(memcg);
> +
> if (locked)
> mem_cgroup_oom_notify(memcg);
> - spin_unlock(&memcg_oom_lock);
>
> - if (need_to_kill) {
> + if (locked && !memcg->oom_kill_disable) {
> + mem_cgroup_unmark_under_oom(memcg);
> finish_wait(&memcg_oom_waitq, &owait.wait);
> mem_cgroup_out_of_memory(memcg, mask, order);

Killing under hierarchy which is not under_oom sounds strange to me.
Cannot we just move finish_wait & unmark down after unlock? It would
also take care about incorrect memcg_oom_recover you have in oom_unlock
path. The ordering would also be more natural
prepare_wait
mark_under_oom
trylock
unlock
unmark_under_oom
finish_wait

> } else {
> schedule();
> + mem_cgroup_unmark_under_oom(memcg);
> finish_wait(&memcg_oom_waitq, &owait.wait);
> }
> - spin_lock(&memcg_oom_lock);
> - if (locked)
> - mem_cgroup_oom_unlock(memcg);
> - memcg_wakeup_oom(memcg);
> - spin_unlock(&memcg_oom_lock);
>
> - mem_cgroup_unmark_under_oom(memcg);
> + if (locked) {
> + mem_cgroup_oom_unlock(memcg);
> + /*
> + * There is no guarantee that a OOM-lock contender
> + * sees the wakeups triggered by the OOM kill
> + * uncharges. Wake any sleepers explicitely.
> + */
> + memcg_oom_recover(memcg);

This will be a noop because memcg is no longer under_oom (you wanted
memcg_wakeup_oom here I guess). Moreover, even the killed wouldn't wake
up anybody for the same reason.

> + }
>
> if (test_thread_flag(TIF_MEMDIE) || fatal_signal_pending(current))
> return false;
> --
> 1.8.3.2
>

--
Michal Hocko
SUSE Labs

2013-07-30 14:32:46

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 6/6] mm: memcg: do not trap chargers with full callstack on OOM

On Tue, Jul 30, 2013 at 04:09:13PM +0200, Michal Hocko wrote:
> On Fri 26-07-13 17:28:09, Johannes Weiner wrote:
> [...]
> > From: Johannes Weiner <[email protected]>
> > Subject: [patch] mm: memcg: rework and document OOM serialization
> >
> > 1. Remove the return value of mem_cgroup_oom_unlock().
> >
> > 2. Rename mem_cgroup_oom_lock() to mem_cgroup_oom_trylock().
> >
> > 3. Pull the prepare_to_wait() out of the memcg_oom_lock scope. This
> > makes it more obvious that the task has to be on the waitqueue
> > before attempting to OOM-trylock the hierarchy, to not miss any
> > wakeups before going to sleep. It just didn't matter until now
> > because it was all lumped together into the global memcg_oom_lock
> > spinlock section.
> >
> > 4. Pull the mem_cgroup_oom_notify() out of the memcg_oom_lock scope.
> > It is proctected by the hierarchical OOM-lock.
> >
> > 5. The memcg_oom_lock spinlock is only required to propagate the OOM
> > lock in any given hierarchy atomically. Restrict its scope to
> > mem_cgroup_oom_(trylock|unlock).
> >
> > 6. Do not wake up the waitqueue unconditionally at the end of the
> > function. Only the lockholder has to wake up the next in line
> > after releasing the lock.
> >
> > Note that the lockholder kicks off the OOM-killer, which in turn
> > leads to wakeups from the uncharges of the exiting task. But any
> > contender is not guaranteed to see them if it enters the OOM path
> > after the OOM kills but before the lockholder releases the lock.
> > Thus the wakeup has to be explicitely after releasing the lock.
> >
> > 7. Put the OOM task on the waitqueue before marking the hierarchy as
> > under OOM as that is the point where we start to receive wakeups.
> > No point in listening before being on the waitqueue.
> >
> > 8. Likewise, unmark the hierarchy before finishing the sleep, for
> > symmetry.
> >
>
> OK, this looks better than what we have today, but still could be done
> better IMO ;)
>
> > Signed-off-by: Johannes Weiner <[email protected]>
> > ---
> > mm/memcontrol.c | 85 +++++++++++++++++++++++++++++++--------------------------
> > 1 file changed, 47 insertions(+), 38 deletions(-)
> >
> > diff --git a/mm/memcontrol.c b/mm/memcontrol.c
> > index 30ae46a..0d923df 100644
> > --- a/mm/memcontrol.c
> > +++ b/mm/memcontrol.c
> > @@ -2076,15 +2076,18 @@ static int mem_cgroup_soft_reclaim(struct mem_cgroup *root_memcg,
> > return total;
> > }
> >
>
> /* Protects oom_lock hierarchy consistent state and oom_notify chain */

See 4. :-)

> > +static DEFINE_SPINLOCK(memcg_oom_lock);
> > +
> > /*
> > * Check OOM-Killer is already running under our hierarchy.
> > * If someone is running, return false.
> > - * Has to be called with memcg_oom_lock
> > */
> [...]
> > @@ -2195,45 +2197,52 @@ static bool mem_cgroup_handle_oom(struct mem_cgroup *memcg, gfp_t mask,
> > int order)
> > {
> > struct oom_wait_info owait;
> > - bool locked, need_to_kill;
> > + bool locked;
> >
> > owait.memcg = memcg;
> > owait.wait.flags = 0;
> > owait.wait.func = memcg_oom_wake_function;
> > owait.wait.private = current;
> > INIT_LIST_HEAD(&owait.wait.task_list);
> > - need_to_kill = true;
> > - mem_cgroup_mark_under_oom(memcg);
> >
> > - /* At first, try to OOM lock hierarchy under memcg.*/
> > - spin_lock(&memcg_oom_lock);
> > - locked = mem_cgroup_oom_lock(memcg);
> > /*
> > + * As with any blocking lock, a contender needs to start
> > + * listening for wakeups before attempting the trylock,
> > + * otherwise it can miss the wakeup from the unlock and sleep
> > + * indefinitely. This is just open-coded because our locking
> > + * is so particular to memcg hierarchies.
> > + *
> > * Even if signal_pending(), we can't quit charge() loop without
> > * accounting. So, UNINTERRUPTIBLE is appropriate. But SIGKILL
> > * under OOM is always welcomed, use TASK_KILLABLE here.
>
> Could you take care of this paragraph as well, while you are at it,
> please? I've always found it it confusing. I would remove it completely
> I would remove it completely.

It's removed completely in the same series, that's why I didn't
bother.

> > */
> > prepare_to_wait(&memcg_oom_waitq, &owait.wait, TASK_KILLABLE);
> > - if (!locked || memcg->oom_kill_disable)
> > - need_to_kill = false;
> > + mem_cgroup_mark_under_oom(memcg);
> > +
> > + locked = mem_cgroup_oom_trylock(memcg);
> > +
> > if (locked)
> > mem_cgroup_oom_notify(memcg);
> > - spin_unlock(&memcg_oom_lock);
> >
> > - if (need_to_kill) {
> > + if (locked && !memcg->oom_kill_disable) {
> > + mem_cgroup_unmark_under_oom(memcg);
> > finish_wait(&memcg_oom_waitq, &owait.wait);
> > mem_cgroup_out_of_memory(memcg, mask, order);
>
> Killing under hierarchy which is not under_oom sounds strange to me.
> Cannot we just move finish_wait & unmark down after unlock? It would
> also take care about incorrect memcg_oom_recover you have in oom_unlock
> path. The ordering would also be more natural
> prepare_wait
> mark_under_oom
> trylock
> unlock
> unmark_under_oom
> finish_wait

That makes no sense. The waitqueue is for when you failed the trylock
and need to sleep until the holder unlocks. If you acquire the lock,
what's the point in staying on the waitqueue?

Look at __mutex_lock_common() in kernel/mutex.c.

There is also no point in leaving the under_oom set during the kill,
this flag is so transient if the in-kernel OOM killer is enabled.
Whether we leave it set during mem_cgroup_out_of_memory() or not, how
would you possibly read it and base any meaningful decisions on it
from userspace?

> > } else {
> > schedule();
> > + mem_cgroup_unmark_under_oom(memcg);
> > finish_wait(&memcg_oom_waitq, &owait.wait);
> > }
> > - spin_lock(&memcg_oom_lock);
> > - if (locked)
> > - mem_cgroup_oom_unlock(memcg);
> > - memcg_wakeup_oom(memcg);
> > - spin_unlock(&memcg_oom_lock);
> >
> > - mem_cgroup_unmark_under_oom(memcg);
> > + if (locked) {
> > + mem_cgroup_oom_unlock(memcg);
> > + /*
> > + * There is no guarantee that a OOM-lock contender
> > + * sees the wakeups triggered by the OOM kill
> > + * uncharges. Wake any sleepers explicitely.
> > + */
> > + memcg_oom_recover(memcg);
>
> This will be a noop because memcg is no longer under_oom (you wanted
> memcg_wakeup_oom here I guess). Moreover, even the killed wouldn't wake
> up anybody for the same reason.

Anybody entering this path will increase the under_oom counter. The
killer decreases it again, but everybody who is sleeping because they
failed the trylock still hasn't unmarked the hierarchy (they call
schedule() before unmark_under_oom()). So we issue wakeups when there
is somebody waiting for the lock.

2013-07-30 14:56:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [patch 6/6] mm: memcg: do not trap chargers with full callstack on OOM

On Tue 30-07-13 10:32:28, Johannes Weiner wrote:
> On Tue, Jul 30, 2013 at 04:09:13PM +0200, Michal Hocko wrote:
> > On Fri 26-07-13 17:28:09, Johannes Weiner wrote:
[...]
> > > } else {
> > > schedule();
> > > + mem_cgroup_unmark_under_oom(memcg);
> > > finish_wait(&memcg_oom_waitq, &owait.wait);
> > > }
> > > - spin_lock(&memcg_oom_lock);
> > > - if (locked)
> > > - mem_cgroup_oom_unlock(memcg);
> > > - memcg_wakeup_oom(memcg);
> > > - spin_unlock(&memcg_oom_lock);
> > >
> > > - mem_cgroup_unmark_under_oom(memcg);
> > > + if (locked) {
> > > + mem_cgroup_oom_unlock(memcg);
> > > + /*
> > > + * There is no guarantee that a OOM-lock contender
> > > + * sees the wakeups triggered by the OOM kill
> > > + * uncharges. Wake any sleepers explicitely.
> > > + */
> > > + memcg_oom_recover(memcg);
> >
> > This will be a noop because memcg is no longer under_oom (you wanted
> > memcg_wakeup_oom here I guess). Moreover, even the killed wouldn't wake
> > up anybody for the same reason.
>
> Anybody entering this path will increase the under_oom counter. The
> killer decreases it again, but everybody who is sleeping because they
> failed the trylock still hasn't unmarked the hierarchy (they call
> schedule() before unmark_under_oom()). So we issue wakeups when there
> is somebody waiting for the lock.

True, sorry for the noise. Feel free to add

Acked-by: Michal Hocko <[email protected]>
--
Michal Hocko
SUSE Labs

2013-08-01 21:59:47

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 2/6] arch: mm: do not invoke OOM killer on kernel fault OOM

On Mon, Jul 29, 2013 at 02:58:05PM -0400, KOSAKI Motohiro wrote:
> (7/25/13 6:25 PM), Johannes Weiner wrote:
> > Kernel faults are expected to handle OOM conditions gracefully (gup,
> > uaccess etc.), so they should never invoke the OOM killer. Reserve
> > this for faults triggered in user context when it is the only option.
> >
> > Most architectures already do this, fix up the remaining few.
> >
> > Signed-off-by: Johannes Weiner <[email protected]>
>
> OK. but now almost all arch have the same page fault handler. So, I think
> we can implement arch generic page fault handler in future. Ah, ok, never
> mind if you are not interest.

Well, I'm already working towards it ;-) Still a long way to go,
though to fully replace them with generic code...

> Acked-by: KOSAKI Motohiro <[email protected]>

Thanks!

2013-08-03 08:40:06

by azurIt

[permalink] [raw]
Subject: Re: [patch 3.2] memcg OOM robustness (x86 only)

>azurIt, this is the combined backport for 3.2, x86 + generic bits +
>debugging. It would be fantastic if you could give this another shot
>once you get back from vacation. Thanks!
>
>Johannes



Hi Johannes,

is this still up to date? Thank you.

azur

2013-08-03 16:31:09

by Johannes Weiner

[permalink] [raw]
Subject: Re: [patch 3.2] memcg OOM robustness (x86 only)

Hey azur,

On Sat, Aug 03, 2013 at 10:38:52AM +0200, azurIt wrote:
> is this still up to date? Thank you.

no, I'll send an updated version in a bit.

Thanks!
Johannes