2015-04-27 19:06:37

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 0/9] mm: improve OOM mechanism v2

There is a possible deadlock scenario between the page allocator and
the OOM killer. Most allocations currently retry forever inside the
page allocator, but when the OOM killer is invoked the chosen victim
might try taking locks held by the allocating task. This series, on
top of many cleanups in the allocator & OOM killer, grants such OOM-
killing allocations access to the system's memory reserves in order
for them to make progress without relying on their own kill to exit.

Changes since v1:
- drop GFP_NOFS deadlock fix (Dave Chinner)
- drop low-order deadlock fix (Michal Hocko)
- fix missing oom_lock in sysrq+f (Michal Hocko)
- fix PAGE_ALLOC_COSTLY retry condition (Michal Hocko)
- ALLOC_NO_WATERMARKS only for OOM victims, not all killed tasks (Tetsuo Handa)
- bump OOM wait timeout from 1s to 5s (Vlastimil Babka & Michal Hocko)

drivers/staging/android/lowmemorykiller.c | 2 +-
drivers/tty/sysrq.c | 2 +
include/linux/oom.h | 12 +-
kernel/exit.c | 2 +-
mm/memcontrol.c | 20 ++--
mm/oom_kill.c | 167 +++++++---------------------
mm/page_alloc.c | 161 ++++++++++++---------------
7 files changed, 137 insertions(+), 229 deletions(-)


2015-04-27 19:06:40

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 1/9] mm: oom_kill: remove unnecessary locking in oom_enable()

Setting oom_killer_disabled to false is atomic, there is no need for
further synchronization with ongoing allocations trying to OOM-kill.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
Acked-by: David Rientjes <[email protected]>
---
mm/oom_kill.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2b665da..73763e4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -488,9 +488,7 @@ bool oom_killer_disable(void)
*/
void oom_killer_enable(void)
{
- down_write(&oom_sem);
oom_killer_disabled = false;
- up_write(&oom_sem);
}

#define K(x) ((x) << (PAGE_SHIFT-10))
--
2.3.4

2015-04-27 19:08:23

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 2/9] mm: oom_kill: clean up victim marking and exiting interfaces

Rename unmark_oom_victim() to exit_oom_victim(). Marking and
unmarking are related in functionality, but the interface is not
symmetrical at all: one is an internal OOM killer function used during
the killing, the other is for an OOM victim to signal its own death on
exit later on. This has locking implications, see follow-up changes.

While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
is easier on the eye.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
drivers/staging/android/lowmemorykiller.c | 2 +-
include/linux/oom.h | 7 ++++---
kernel/exit.c | 2 +-
mm/memcontrol.c | 2 +-
mm/oom_kill.c | 16 +++++++---------
5 files changed, 14 insertions(+), 15 deletions(-)

diff --git a/drivers/staging/android/lowmemorykiller.c b/drivers/staging/android/lowmemorykiller.c
index feafa17..2345ee7 100644
--- a/drivers/staging/android/lowmemorykiller.c
+++ b/drivers/staging/android/lowmemorykiller.c
@@ -165,7 +165,7 @@ static unsigned long lowmem_scan(struct shrinker *s, struct shrink_control *sc)
* infrastructure. There is no real reason why the selected
* task should have access to the memory reserves.
*/
- mark_tsk_oom_victim(selected);
+ mark_oom_victim(selected);
send_sig(SIGKILL, selected, 0);
rem += selected_tasksize;
}
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 44b2f6f..a8e6a49 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -47,9 +47,7 @@ static inline bool oom_task_origin(const struct task_struct *p)
return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
}

-extern void mark_tsk_oom_victim(struct task_struct *tsk);
-
-extern void unmark_oom_victim(void);
+extern void mark_oom_victim(struct task_struct *tsk);

extern unsigned long oom_badness(struct task_struct *p,
struct mem_cgroup *memcg, const nodemask_t *nodemask,
@@ -75,6 +73,9 @@ extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,

extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
int order, nodemask_t *mask, bool force_kill);
+
+extern void exit_oom_victim(void);
+
extern int register_oom_notifier(struct notifier_block *nb);
extern int unregister_oom_notifier(struct notifier_block *nb);

diff --git a/kernel/exit.c b/kernel/exit.c
index 22fcc05..185752a 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -436,7 +436,7 @@ static void exit_mm(struct task_struct *tsk)
mm_update_next_owner(mm);
mmput(mm);
if (test_thread_flag(TIF_MEMDIE))
- unmark_oom_victim();
+ exit_oom_victim();
}

static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 14c2f20..54f8457 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1536,7 +1536,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
* quickly exit and free its memory.
*/
if (fatal_signal_pending(current) || task_will_free_mem(current)) {
- mark_tsk_oom_victim(current);
+ mark_oom_victim(current);
return;
}

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 73763e4..b2f081f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -408,13 +408,13 @@ bool oom_killer_disabled __read_mostly;
static DECLARE_RWSEM(oom_sem);

/**
- * mark_tsk_oom_victim - marks the given task as OOM victim.
+ * mark_oom_victim - mark the given task as OOM victim
* @tsk: task to mark
*
* Has to be called with oom_sem taken for read and never after
* oom has been disabled already.
*/
-void mark_tsk_oom_victim(struct task_struct *tsk)
+void mark_oom_victim(struct task_struct *tsk)
{
WARN_ON(oom_killer_disabled);
/* OOM killer might race with memcg OOM */
@@ -431,11 +431,9 @@ void mark_tsk_oom_victim(struct task_struct *tsk)
}

/**
- * unmark_oom_victim - unmarks the current task as OOM victim.
- *
- * Wakes up all waiters in oom_killer_disable()
+ * exit_oom_victim - note the exit of an OOM victim
*/
-void unmark_oom_victim(void)
+void exit_oom_victim(void)
{
if (!test_and_clear_thread_flag(TIF_MEMDIE))
return;
@@ -515,7 +513,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
*/
task_lock(p);
if (p->mm && task_will_free_mem(p)) {
- mark_tsk_oom_victim(p);
+ mark_oom_victim(p);
task_unlock(p);
put_task_struct(p);
return;
@@ -570,7 +568,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

/* mm cannot safely be dereferenced after task_unlock(victim) */
mm = victim->mm;
- mark_tsk_oom_victim(victim);
+ mark_oom_victim(victim);
pr_err("Killed process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
task_pid_nr(victim), victim->comm, K(victim->mm->total_vm),
K(get_mm_counter(victim->mm, MM_ANONPAGES)),
@@ -728,7 +726,7 @@ static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
*/
if (current->mm &&
(fatal_signal_pending(current) || task_will_free_mem(current))) {
- mark_tsk_oom_victim(current);
+ mark_oom_victim(current);
return;
}

--
2.3.4

2015-04-27 19:06:42

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 3/9] mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear

exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody
else can clear it concurrently. Use clear_thread_flag() directly.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: David Rientjes <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/oom_kill.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b2f081f..4b9547b 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -435,8 +435,7 @@ void mark_oom_victim(struct task_struct *tsk)
*/
void exit_oom_victim(void)
{
- if (!test_and_clear_thread_flag(TIF_MEMDIE))
- return;
+ clear_thread_flag(TIF_MEMDIE);

down_read(&oom_sem);
/*
--
2.3.4

2015-04-27 19:08:21

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 4/9] mm: oom_kill: generalize OOM progress waitqueue

It turns out that the mechanism to wait for exiting OOM victims is
less generic than it looks: it won't issue wakeups unless the OOM
killer is disabled.

The reason this check was added was the thought that, since only the
OOM disabling code would wait on this queue, wakeup operations could
be saved when that specific consumer is known to be absent.

However, this is quite the handgrenade. Later attempts to reuse the
waitqueue for other purposes will lead to completely unexpected bugs
and the failure mode will appear seemingly illogical. Generally,
providers shouldn't make unnecessary assumptions about consumers.

This could have been replaced with waitqueue_active(), but it only
saves a few instructions in one of the coldest paths in the kernel.
Simply remove it.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/oom_kill.c | 6 +-----
1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b9547b..472f124 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -438,11 +438,7 @@ void exit_oom_victim(void)
clear_thread_flag(TIF_MEMDIE);

down_read(&oom_sem);
- /*
- * There is no need to signal the lasst oom_victim if there
- * is nobody who cares.
- */
- if (!atomic_dec_return(&oom_victims) && oom_killer_disabled)
+ if (!atomic_dec_return(&oom_victims))
wake_up_all(&oom_victims_wait);
up_read(&oom_sem);
}
--
2.3.4

2015-04-27 19:08:19

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 5/9] mm: oom_kill: remove unnecessary locking in exit_oom_victim()

Disabling the OOM killer needs to exclude allocators from entering,
not existing victims from exiting.

Right now the only waiter is suspend code, which achieves quiescence
by disabling the OOM killer. But later on we want to add waits that
hold the lock instead to stop new victims from showing up.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/oom_kill.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 472f124..d3490b0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -437,10 +437,8 @@ void exit_oom_victim(void)
{
clear_thread_flag(TIF_MEMDIE);

- down_read(&oom_sem);
if (!atomic_dec_return(&oom_victims))
wake_up_all(&oom_victims_wait);
- up_read(&oom_sem);
}

/**
--
2.3.4

2015-04-27 19:08:04

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 6/9] mm: oom_kill: simplify OOM killer locking

The zonelist locking and the oom_sem are two overlapping locks that
are used to serialize global OOM killing against different things.

The historical zonelist locking serializes OOM kills from allocations
with overlapping zonelists against each other to prevent killing more
tasks than necessary in the same memory domain. Only when neither
tasklists nor zonelists from two concurrent OOM kills overlap (tasks
in separate memcgs bound to separate nodes) are OOM kills allowed to
execute in parallel.

The younger oom_sem is a read-write lock to serialize OOM killing
against the PM code trying to disable the OOM killer altogether.

However, the OOM killer is a fairly cold error path, there is really
no reason to optimize for highly performant and concurrent OOM kills.
And the oom_sem is just flat-out redundant.

Replace both locking schemes with a single global mutex serializing
OOM kills regardless of context.

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
drivers/tty/sysrq.c | 2 +
include/linux/oom.h | 5 +--
mm/memcontrol.c | 18 +++++---
mm/oom_kill.c | 127 +++++++++++-----------------------------------------
mm/page_alloc.c | 8 ++--
5 files changed, 46 insertions(+), 114 deletions(-)

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 843f2cd..b20d2c0 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -356,9 +356,11 @@ static struct sysrq_key_op sysrq_term_op = {

static void moom_callback(struct work_struct *ignored)
{
+ mutex_lock(&oom_lock);
if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
GFP_KERNEL, 0, NULL, true))
pr_info("OOM request ignored because killer is disabled\n");
+ mutex_unlock(&oom_lock);
}

static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index a8e6a49..7deecb7 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -32,6 +32,8 @@ enum oom_scan_t {
/* Thread is the potential origin of an oom condition; kill first on oom */
#define OOM_FLAG_ORIGIN ((__force oom_flags_t)0x1)

+extern struct mutex oom_lock;
+
static inline void set_current_oom_origin(void)
{
current->signal->oom_flags |= OOM_FLAG_ORIGIN;
@@ -60,9 +62,6 @@ extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
struct mem_cgroup *memcg, nodemask_t *nodemask,
const char *message);

-extern bool oom_zonelist_trylock(struct zonelist *zonelist, gfp_t gfp_flags);
-extern void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_flags);
-
extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
int order, const nodemask_t *nodemask,
struct mem_cgroup *memcg);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 54f8457..5fd273d 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1530,6 +1530,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int points = 0;
struct task_struct *chosen = NULL;

+ mutex_lock(&oom_lock);
+
/*
* If current has a pending SIGKILL or is exiting, then automatically
* select it. The goal is to allow it to allocate so that it may
@@ -1537,7 +1539,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
*/
if (fatal_signal_pending(current) || task_will_free_mem(current)) {
mark_oom_victim(current);
- return;
+ goto unlock;
}

check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL, memcg);
@@ -1564,7 +1566,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
mem_cgroup_iter_break(memcg, iter);
if (chosen)
put_task_struct(chosen);
- return;
+ goto unlock;
case OOM_SCAN_OK:
break;
};
@@ -1585,11 +1587,13 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
css_task_iter_end(&it);
}

- if (!chosen)
- return;
- points = chosen_points * 1000 / totalpages;
- oom_kill_process(chosen, gfp_mask, order, points, totalpages, memcg,
- NULL, "Memory cgroup out of memory");
+ if (chosen) {
+ points = chosen_points * 1000 / totalpages;
+ oom_kill_process(chosen, gfp_mask, order, points, totalpages,
+ memcg, NULL, "Memory cgroup out of memory");
+ }
+unlock:
+ mutex_unlock(&oom_lock);
}

#if MAX_NUMNODES > 1
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d3490b0..5cfda39 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -42,7 +42,8 @@
int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;
-static DEFINE_SPINLOCK(zone_scan_lock);
+
+DEFINE_MUTEX(oom_lock);

#ifdef CONFIG_NUMA
/**
@@ -405,13 +406,12 @@ static atomic_t oom_victims = ATOMIC_INIT(0);
static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);

bool oom_killer_disabled __read_mostly;
-static DECLARE_RWSEM(oom_sem);

/**
* mark_oom_victim - mark the given task as OOM victim
* @tsk: task to mark
*
- * Has to be called with oom_sem taken for read and never after
+ * Has to be called with oom_lock held and never after
* oom has been disabled already.
*/
void mark_oom_victim(struct task_struct *tsk)
@@ -460,14 +460,14 @@ bool oom_killer_disable(void)
* Make sure to not race with an ongoing OOM killer
* and that the current is not the victim.
*/
- down_write(&oom_sem);
+ mutex_lock(&oom_lock);
if (test_thread_flag(TIF_MEMDIE)) {
- up_write(&oom_sem);
+ mutex_unlock(&oom_lock);
return false;
}

oom_killer_disabled = true;
- up_write(&oom_sem);
+ mutex_unlock(&oom_lock);

wait_event(oom_victims_wait, !atomic_read(&oom_victims));

@@ -634,52 +634,6 @@ int unregister_oom_notifier(struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(unregister_oom_notifier);

-/*
- * Try to acquire the OOM killer lock for the zones in zonelist. Returns zero
- * if a parallel OOM killing is already taking place that includes a zone in
- * the zonelist. Otherwise, locks all zones in the zonelist and returns 1.
- */
-bool oom_zonelist_trylock(struct zonelist *zonelist, gfp_t gfp_mask)
-{
- struct zoneref *z;
- struct zone *zone;
- bool ret = true;
-
- spin_lock(&zone_scan_lock);
- for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
- if (test_bit(ZONE_OOM_LOCKED, &zone->flags)) {
- ret = false;
- goto out;
- }
-
- /*
- * Lock each zone in the zonelist under zone_scan_lock so a parallel
- * call to oom_zonelist_trylock() doesn't succeed when it shouldn't.
- */
- for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
- set_bit(ZONE_OOM_LOCKED, &zone->flags);
-
-out:
- spin_unlock(&zone_scan_lock);
- return ret;
-}
-
-/*
- * Clears the ZONE_OOM_LOCKED flag for all zones in the zonelist so that failed
- * allocation attempts with zonelists containing them may now recall the OOM
- * killer, if necessary.
- */
-void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
-{
- struct zoneref *z;
- struct zone *zone;
-
- spin_lock(&zone_scan_lock);
- for_each_zone_zonelist(zone, z, zonelist, gfp_zone(gfp_mask))
- clear_bit(ZONE_OOM_LOCKED, &zone->flags);
- spin_unlock(&zone_scan_lock);
-}
-
/**
* __out_of_memory - kill the "best" process when we run out of memory
* @zonelist: zonelist pointer
@@ -693,8 +647,8 @@ void oom_zonelist_unlock(struct zonelist *zonelist, gfp_t gfp_mask)
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
*/
-static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
- int order, nodemask_t *nodemask, bool force_kill)
+bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *nodemask, bool force_kill)
{
const nodemask_t *mpol_mask;
struct task_struct *p;
@@ -704,10 +658,13 @@ static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
enum oom_constraint constraint = CONSTRAINT_NONE;
int killed = 0;

+ if (oom_killer_disabled)
+ return false;
+
blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
if (freed > 0)
/* Got some memory back in the last second. */
- return;
+ goto out;

/*
* If current has a pending SIGKILL or is exiting, then automatically
@@ -720,7 +677,7 @@ static void __out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
if (current->mm &&
(fatal_signal_pending(current) || task_will_free_mem(current))) {
mark_oom_victim(current);
- return;
+ goto out;
}

/*
@@ -760,32 +717,8 @@ out:
*/
if (killed)
schedule_timeout_killable(1);
-}
-
-/**
- * out_of_memory - tries to invoke OOM killer.
- * @zonelist: zonelist pointer
- * @gfp_mask: memory allocation flags
- * @order: amount of memory being requested as a power of 2
- * @nodemask: nodemask passed to page allocator
- * @force_kill: true if a task must be killed, even if others are exiting
- *
- * invokes __out_of_memory if the OOM is not disabled by oom_killer_disable()
- * when it returns false. Otherwise returns true.
- */
-bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
- int order, nodemask_t *nodemask, bool force_kill)
-{
- bool ret = false;
-
- down_read(&oom_sem);
- if (!oom_killer_disabled) {
- __out_of_memory(zonelist, gfp_mask, order, nodemask, force_kill);
- ret = true;
- }
- up_read(&oom_sem);

- return ret;
+ return true;
}

/*
@@ -795,27 +728,21 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
*/
void pagefault_out_of_memory(void)
{
- struct zonelist *zonelist;
-
- down_read(&oom_sem);
if (mem_cgroup_oom_synchronize(true))
- goto unlock;
+ return;

- zonelist = node_zonelist(first_memory_node, GFP_KERNEL);
- if (oom_zonelist_trylock(zonelist, GFP_KERNEL)) {
- if (!oom_killer_disabled)
- __out_of_memory(NULL, 0, 0, NULL, false);
- else
- /*
- * There shouldn't be any user tasks runable while the
- * OOM killer is disabled so the current task has to
- * be a racing OOM victim for which oom_killer_disable()
- * is waiting for.
- */
- WARN_ON(test_thread_flag(TIF_MEMDIE));
+ if (!mutex_trylock(&oom_lock))
+ return;

- oom_zonelist_unlock(zonelist, GFP_KERNEL);
+ if (!out_of_memory(NULL, 0, 0, NULL, false)) {
+ /*
+ * There shouldn't be any user tasks runnable while the
+ * OOM killer is disabled, so the current task has to
+ * be a racing OOM victim for which oom_killer_disable()
+ * is waiting for.
+ */
+ WARN_ON(test_thread_flag(TIF_MEMDIE));
}
-unlock:
- up_read(&oom_sem);
+
+ mutex_unlock(&oom_lock);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ebffa0e..cdb786b 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2373,10 +2373,10 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
*did_some_progress = 0;

/*
- * Acquire the per-zone oom lock for each zone. If that
- * fails, somebody else is making progress for us.
+ * Acquire the oom lock. If that fails, somebody else is
+ * making progress for us.
*/
- if (!oom_zonelist_trylock(ac->zonelist, gfp_mask)) {
+ if (!mutex_trylock(&oom_lock)) {
*did_some_progress = 1;
schedule_timeout_uninterruptible(1);
return NULL;
@@ -2421,7 +2421,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
*did_some_progress = 1;
out:
- oom_zonelist_unlock(ac->zonelist, gfp_mask);
+ mutex_unlock(&oom_lock);
return page;
}

--
2.3.4

2015-04-27 19:07:14

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 7/9] mm: page_alloc: inline should_alloc_retry()

The should_alloc_retry() function was meant to encapsulate retry
conditions of the allocator slowpath, but there are still checks
remaining in the main function, and much of how the retrying is
performed also depends on the OOM killer progress. The physical
separation of those conditions make the code hard to follow.

Inline the should_alloc_retry() checks. Notes:

- The __GFP_NOFAIL check is already done in __alloc_pages_may_oom(),
replace it with looping on OOM killer progress

- The pm_suspended_storage() check is meant to skip the OOM killer
when reclaim has no IO available, move to __alloc_pages_may_oom()

- The order <= PAGE_ALLOC_COSTLY order is re-united with its original
counterpart of checking whether reclaim actually made any progress

Signed-off-by: Johannes Weiner <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
mm/page_alloc.c | 104 +++++++++++++++++---------------------------------------
1 file changed, 32 insertions(+), 72 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cdb786b..3b4e4f81 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2322,48 +2322,6 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
show_mem(filter);
}

-static inline int
-should_alloc_retry(gfp_t gfp_mask, unsigned int order,
- unsigned long did_some_progress,
- unsigned long pages_reclaimed)
-{
- /* Do not loop if specifically requested */
- if (gfp_mask & __GFP_NORETRY)
- return 0;
-
- /* Always retry if specifically requested */
- if (gfp_mask & __GFP_NOFAIL)
- return 1;
-
- /*
- * Suspend converts GFP_KERNEL to __GFP_WAIT which can prevent reclaim
- * making forward progress without invoking OOM. Suspend also disables
- * storage devices so kswapd will not help. Bail if we are suspending.
- */
- if (!did_some_progress && pm_suspended_storage())
- return 0;
-
- /*
- * In this implementation, order <= PAGE_ALLOC_COSTLY_ORDER
- * means __GFP_NOFAIL, but that may not be true in other
- * implementations.
- */
- if (order <= PAGE_ALLOC_COSTLY_ORDER)
- return 1;
-
- /*
- * For order > PAGE_ALLOC_COSTLY_ORDER, if __GFP_REPEAT is
- * specified, then we retry until we no longer reclaim any pages
- * (above), or we've reclaimed an order of pages at least as
- * large as the allocation's order. In both cases, if the
- * allocation still fails, we stop retrying.
- */
- if (gfp_mask & __GFP_REPEAT && pages_reclaimed < (1 << order))
- return 1;
-
- return 0;
-}
-
static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
const struct alloc_context *ac, unsigned long *did_some_progress)
@@ -2402,16 +2360,18 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
/* The OOM killer does not needlessly kill tasks for lowmem */
if (ac->high_zoneidx < ZONE_NORMAL)
goto out;
- /* The OOM killer does not compensate for light reclaim */
+ /* The OOM killer does not compensate for IO-less reclaim */
if (!(gfp_mask & __GFP_FS)) {
/*
* XXX: Page reclaim didn't yield anything,
* and the OOM killer can't be invoked, but
- * keep looping as per should_alloc_retry().
+ * keep looping as per tradition.
*/
*did_some_progress = 1;
goto out;
}
+ if (pm_suspended_storage())
+ goto out;
/* The OOM killer may not free memory on a specific node */
if (gfp_mask & __GFP_THISNODE)
goto out;
@@ -2794,40 +2754,40 @@ retry:
if (page)
goto got_pg;

- /* Check if we should retry the allocation */
+ /* Do not loop if specifically requested */
+ if (gfp_mask & __GFP_NORETRY)
+ goto noretry;
+
+ /* Keep reclaiming pages as long as there is reasonable progress */
pages_reclaimed += did_some_progress;
- if (should_alloc_retry(gfp_mask, order, did_some_progress,
- pages_reclaimed)) {
- /*
- * If we fail to make progress by freeing individual
- * pages, but the allocation wants us to keep going,
- * start OOM killing tasks.
- */
- if (!did_some_progress) {
- page = __alloc_pages_may_oom(gfp_mask, order, ac,
- &did_some_progress);
- if (page)
- goto got_pg;
- if (!did_some_progress)
- goto nopage;
- }
+ if ((did_some_progress && order <= PAGE_ALLOC_COSTLY_ORDER) ||
+ ((gfp_mask & __GFP_REPEAT) && pages_reclaimed < (1 << order))) {
/* Wait for some write requests to complete then retry */
wait_iff_congested(ac->preferred_zone, BLK_RW_ASYNC, HZ/50);
goto retry;
- } else {
- /*
- * High-order allocations do not necessarily loop after
- * direct reclaim and reclaim/compaction depends on compaction
- * being called after reclaim so call directly if necessary
- */
- page = __alloc_pages_direct_compact(gfp_mask, order,
- alloc_flags, ac, migration_mode,
- &contended_compaction,
- &deferred_compaction);
- if (page)
- goto got_pg;
}

+ /* Reclaim has failed us, start killing things */
+ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+ if (page)
+ goto got_pg;
+
+ /* Retry as long as the OOM killer is making progress */
+ if (did_some_progress)
+ goto retry;
+
+noretry:
+ /*
+ * High-order allocations do not necessarily loop after
+ * direct reclaim and reclaim/compaction depends on compaction
+ * being called after reclaim so call directly if necessary
+ */
+ page = __alloc_pages_direct_compact(gfp_mask, order, alloc_flags,
+ ac, migration_mode,
+ &contended_compaction,
+ &deferred_compaction);
+ if (page)
+ goto got_pg;
nopage:
warn_alloc_failed(gfp_mask, order, NULL);
got_pg:
--
2.3.4

2015-04-27 19:07:40

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 8/9] mm: page_alloc: wait for OOM killer progress before retrying

There is not much point in rushing back to the freelists and burning
CPU cycles in direct reclaim when somebody else is in the process of
OOM killing, or right after issuing a kill ourselves, because it could
take some time for the OOM victim to release memory.

This is a very cold error path, so there is not much hurry. Use the
OOM victim waitqueue to wait for victims to actually exit, which is a
solid signal that the memory pinned by those tasks has been released.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/oom_kill.c | 11 +++++++----
mm/page_alloc.c | 43 ++++++++++++++++++++++++++-----------------
2 files changed, 33 insertions(+), 21 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5cfda39..823f87e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
killed = 1;
}
out:
+ if (test_thread_flag(TIF_MEMDIE))
+ return true;
/*
- * Give the killed threads a good chance of exiting before trying to
- * allocate memory again.
+ * Wait for any outstanding OOM victims to die. In rare cases
+ * victims can get stuck behind the allocating tasks, so the
+ * wait needs to be bounded. It's crude alright, but cheaper
+ * than keeping a global dependency tree between all tasks.
*/
- if (killed)
- schedule_timeout_killable(1);
+ wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), 5*HZ);

return true;
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b4e4f81..94530db 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2323,30 +2323,30 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)
}

static inline struct page *
-__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
+__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
const struct alloc_context *ac, unsigned long *did_some_progress)
{
- struct page *page;
+ struct page *page = NULL;

*did_some_progress = 0;

/*
- * Acquire the oom lock. If that fails, somebody else is
- * making progress for us.
+ * This allocating task can become the OOM victim itself at
+ * any point before acquiring the lock. In that case, exit
+ * quickly and don't block on the lock held by another task
+ * waiting for us to exit.
*/
- if (!mutex_trylock(&oom_lock)) {
- *did_some_progress = 1;
- schedule_timeout_uninterruptible(1);
- return NULL;
+ if (test_thread_flag(TIF_MEMDIE) || mutex_lock_killable(&oom_lock)) {
+ if (test_thread_flag(TIF_MEMDIE))
+ alloc_flags |= ALLOC_NO_WATERMARKS;
+ goto alloc;
}

/*
- * Go through the zonelist yet one more time, keep very high watermark
- * here, this is only to catch a parallel oom killing, we must fail if
- * we're still under heavy pressure.
+ * While we have been waiting for the lock, the previous OOM
+ * kill might have released enough memory for the both of us.
*/
- page = get_page_from_freelist(gfp_mask | __GFP_HARDWALL, order,
- ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
+ page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
if (page)
goto out;

@@ -2376,12 +2376,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
if (gfp_mask & __GFP_THISNODE)
goto out;
}
- /* Exhausted what can be done so it's blamo time */
- if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
- || WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+
+ if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)) {
*did_some_progress = 1;
+ } else {
+ /* Oops, these shouldn't happen with the OOM killer disabled */
+ if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
+ *did_some_progress = 1;
+ }
out:
mutex_unlock(&oom_lock);
+alloc:
+ if (!page)
+ page = get_page_from_freelist(gfp_mask, order, alloc_flags, ac);
+
return page;
}

@@ -2768,7 +2776,8 @@ retry:
}

/* Reclaim has failed us, start killing things */
- page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+ page = __alloc_pages_may_oom(gfp_mask, order, alloc_flags, ac,
+ &did_some_progress);
if (page)
goto got_pg;

--
2.3.4

2015-04-27 19:06:50

by Johannes Weiner

[permalink] [raw]
Subject: [PATCH 9/9] mm: page_alloc: memory reserve access for OOM-killing allocations

The OOM killer connects random tasks in the system with unknown
dependencies between them, and the OOM victim might well get blocked
behind locks held by the allocating task. That means that while
allocations can issue OOM kills to improve the low memory situation,
which generally frees more than they are going to take out, they can
not rely on their *own* OOM kills to make forward progress.

However, OOM-killing allocations currently retry forever. Without any
extra measures the above situation will result in a deadlock; between
the allocating task and the OOM victim at first, but it can spread
once other tasks in the system start contending for the same locks.

Allow OOM-killing allocations to dip into the system's memory reserves
to avoid this deadlock scenario. Those reserves are specifically for
operations in the memory reclaim paths which need a small amount of
memory to release a much larger amount. Arguably, the same notion
applies to the OOM killer.

Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 14 ++++++++++++++
1 file changed, 14 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 94530db..5f3806d 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2384,6 +2384,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
*did_some_progress = 1;
}
+
+ /*
+ * In the current implementation, an OOM-killing allocation
+ * loops indefinitely inside the allocator. However, it's
+ * possible for the OOM victim to get stuck behind locks held
+ * by the allocating task itself, so we can never rely on the
+ * OOM killer to free memory synchroneously without risking a
+ * deadlock. Allow these allocations to dip into the memory
+ * reserves to ensure forward progress once the OOM kill has
+ * been issued. The reserves will be replenished when the
+ * caller releases the locks and the victim exits.
+ */
+ if (*did_some_progress)
+ alloc_flags |= ALLOC_NO_WATERMARKS;
out:
mutex_unlock(&oom_lock);
alloc:
--
2.3.4

2015-04-28 10:34:56

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

Johannes Weiner wrote:
> There is a possible deadlock scenario between the page allocator and
> the OOM killer. Most allocations currently retry forever inside the
> page allocator, but when the OOM killer is invoked the chosen victim
> might try taking locks held by the allocating task. This series, on
> top of many cleanups in the allocator & OOM killer, grants such OOM-
> killing allocations access to the system's memory reserves in order
> for them to make progress without relying on their own kill to exit.

I don't think

[PATCH 8/9] mm: page_alloc: wait for OOM killer progress before retrying

and

[PATCH 9/9] mm: page_alloc: memory reserve access for OOM-killing allocations

will work.

[PATCH 8/9] makes the speed of allocating __GFP_FS pages extremely slow (5
seconds / page) because out_of_memory() serialized by the oom_lock sleeps for
5 seconds before returning true when the OOM victim got stuck. This throttling
also slows down !__GFP_FS allocations when there is a thread doing a __GFP_FS
allocation, for __alloc_pages_may_oom() is serialized by the oom_lock
regardless of gfp_mask. How long will the OOM victim is blocked when the
allocating task needs to allocate e.g. 1000 !__GFP_FS pages before allowing
the OOM victim waiting at mutex_lock(&inode->i_mutex) to continue? It will be
a too-long-to-wait stall which is effectively a deadlock for users. I think
we should not sleep with the oom_lock held.

Also, allowing any !fatal_signal_pending() threads doing __GFP_FS allocations
(e.g. malloc() + memset()) to dip into the reserves will deplete them when the
OOM victim is blocked for a thread doing a !__GFP_FS allocation, for
[PATCH 9/9] does not allow !test_thread_flag(TIF_MEMDIE) threads doing
!__GFP_FS allocations to access the reserves. Of course, updating [PATCH 9/9]
like

-+ if (*did_some_progress)
-+ alloc_flags |= ALLOC_NO_WATERMARKS;
out:
++ if (*did_some_progress)
++ alloc_flags |= ALLOC_NO_WATERMARKS;
mutex_unlock(&oom_lock);

(which means use of "no watermark" without invoking the OOM killer) is
obviously wrong. I think we should not allow __GFP_FS allocations to
access to the reserves when the OOM victim is blocked.



By the way, I came up with an idea (incomplete patch on top of patches up to
7/9 is shown below) while trying to avoid sleeping with the oom_lock held.
This patch is meant for

(1) blocking_notifier_call_chain(&oom_notify_list) is called after
the OOM killer is disabled in order to increase possibility of
memory allocation to succeed.

(2) oom_kill_process() can determine when to kill next OOM victim.

(3) oom_scan_process_thread() can take TIF_MEMDIE timeout into
account when choosing an OOM victim.

What do you think?



diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index b20d2c0..843f2cd 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -356,11 +356,9 @@ static struct sysrq_key_op sysrq_term_op = {

static void moom_callback(struct work_struct *ignored)
{
- mutex_lock(&oom_lock);
if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
GFP_KERNEL, 0, NULL, true))
pr_info("OOM request ignored because killer is disabled\n");
- mutex_unlock(&oom_lock);
}

static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7deecb7..ed8c53b 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -32,8 +32,6 @@ enum oom_scan_t {
/* Thread is the potential origin of an oom condition; kill first on oom */
#define OOM_FLAG_ORIGIN ((__force oom_flags_t)0x1)

-extern struct mutex oom_lock;
-
static inline void set_current_oom_origin(void)
{
current->signal->oom_flags |= OOM_FLAG_ORIGIN;
@@ -49,8 +47,6 @@ static inline bool oom_task_origin(const struct task_struct *p)
return !!(p->signal->oom_flags & OOM_FLAG_ORIGIN);
}

-extern void mark_oom_victim(struct task_struct *tsk);
-
extern unsigned long oom_badness(struct task_struct *p,
struct mem_cgroup *memcg, const nodemask_t *nodemask,
unsigned long totalpages);
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 5fd273d..06a7a9f 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1530,16 +1530,16 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned int points = 0;
struct task_struct *chosen = NULL;

- mutex_lock(&oom_lock);
-
/*
* If current has a pending SIGKILL or is exiting, then automatically
* select it. The goal is to allow it to allocate so that it may
* quickly exit and free its memory.
*/
if (fatal_signal_pending(current) || task_will_free_mem(current)) {
- mark_oom_victim(current);
- goto unlock;
+ get_task_struct(current);
+ oom_kill_process(current, gfp_mask, order, 0, 0, memcg, NULL,
+ "Out of memory (killing exiting task)");
+ return;
}

check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL, memcg);
@@ -1566,7 +1566,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
mem_cgroup_iter_break(memcg, iter);
if (chosen)
put_task_struct(chosen);
- goto unlock;
+ return;
case OOM_SCAN_OK:
break;
};
@@ -1587,13 +1587,11 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
css_task_iter_end(&it);
}

- if (chosen) {
- points = chosen_points * 1000 / totalpages;
- oom_kill_process(chosen, gfp_mask, order, points, totalpages,
- memcg, NULL, "Memory cgroup out of memory");
- }
-unlock:
- mutex_unlock(&oom_lock);
+ if (!chosen)
+ return;
+ points = chosen_points * 1000 / totalpages;
+ oom_kill_process(chosen, gfp_mask, order, points, totalpages,
+ memcg, NULL, "Memory cgroup out of memory");
}

#if MAX_NUMNODES > 1
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5cfda39..9a8e430 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -43,7 +43,7 @@ int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;

-DEFINE_MUTEX(oom_lock);
+static DEFINE_SPINLOCK(oom_lock);

#ifdef CONFIG_NUMA
/**
@@ -414,9 +414,10 @@ bool oom_killer_disabled __read_mostly;
* Has to be called with oom_lock held and never after
* oom has been disabled already.
*/
-void mark_oom_victim(struct task_struct *tsk)
+static void mark_oom_victim(struct task_struct *tsk)
{
WARN_ON(oom_killer_disabled);
+ assert_spin_locked(&oom_lock);
/* OOM killer might race with memcg OOM */
if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
return;
@@ -456,22 +457,15 @@ void exit_oom_victim(void)
*/
bool oom_killer_disable(void)
{
- /*
- * Make sure to not race with an ongoing OOM killer
- * and that the current is not the victim.
- */
- mutex_lock(&oom_lock);
- if (test_thread_flag(TIF_MEMDIE)) {
- mutex_unlock(&oom_lock);
- return false;
- }
-
+ /* Make sure to not race with an ongoing OOM killer. */
+ spin_lock(&oom_lock);
oom_killer_disabled = true;
- mutex_unlock(&oom_lock);
-
- wait_event(oom_victims_wait, !atomic_read(&oom_victims));
-
- return true;
+ spin_unlock(&oom_lock);
+ /* Prepare for stuck TIF_MEMDIE threads. */
+ if (!wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims),
+ 60 * HZ))
+ oom_killer_disabled = false;
+ return oom_killer_disabled;
}

/**
@@ -482,6 +476,15 @@ void oom_killer_enable(void)
oom_killer_disabled = false;
}

+static bool oom_kill_ok = true;
+
+static void oom_wait_expire(struct work_struct *unused)
+{
+ oom_kill_ok = true;
+}
+
+static DECLARE_DELAYED_WORK(oom_wait_work, oom_wait_expire);
+
#define K(x) ((x) << (PAGE_SHIFT-10))
/*
* Must be called while holding a reference to p, which will be released upon
@@ -500,6 +503,13 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);

+ spin_lock(&oom_lock);
+ /*
+ * Did we race with oom_killer_disable() or another oom_kill_process()?
+ */
+ if (oom_killer_disabled || !oom_kill_ok)
+ goto unlock;
+
/*
* If the task is already exiting, don't alarm the sysadmin or kill
* its children or threads, just set TIF_MEMDIE so it can die quickly
@@ -508,8 +518,7 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
if (p->mm && task_will_free_mem(p)) {
mark_oom_victim(p);
task_unlock(p);
- put_task_struct(p);
- return;
+ goto unlock;
}
task_unlock(p);

@@ -551,8 +560,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

p = find_lock_task_mm(victim);
if (!p) {
- put_task_struct(victim);
- return;
+ p = victim;
+ goto unlock;
} else if (victim != p) {
get_task_struct(p);
put_task_struct(victim);
@@ -593,7 +602,20 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
rcu_read_unlock();

do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
- put_task_struct(victim);
+ p = victim;
+ /*
+ * Since choosing an OOM victim is done asynchronously, someone might
+ * have chosen an extra victim which would not have been chosen if we
+ * waited the previous victim to die. Therefore, wait for a while
+ * before trying to kill the next victim. If oom_scan_process_thread()
+ * uses oom_kill_ok than force_kill, it will act like TIF_MEMDIE
+ * timeout.
+ */
+ oom_kill_ok = false;
+ schedule_delayed_work(&oom_wait_work, HZ);
+unlock:
+ spin_unlock(&oom_lock);
+ put_task_struct(p);
}
#undef K

@@ -658,9 +680,6 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
enum oom_constraint constraint = CONSTRAINT_NONE;
int killed = 0;

- if (oom_killer_disabled)
- return false;
-
blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
if (freed > 0)
/* Got some memory back in the last second. */
@@ -676,7 +695,9 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
*/
if (current->mm &&
(fatal_signal_pending(current) || task_will_free_mem(current))) {
- mark_oom_victim(current);
+ get_task_struct(current);
+ oom_kill_process(current, gfp_mask, order, 0, 0, NULL, nodemask,
+ "Out of memory (killing exiting task)");
goto out;
}

@@ -731,9 +752,6 @@ void pagefault_out_of_memory(void)
if (mem_cgroup_oom_synchronize(true))
return;

- if (!mutex_trylock(&oom_lock))
- return;
-
if (!out_of_memory(NULL, 0, 0, NULL, false)) {
/*
* There shouldn't be any user tasks runnable while the
@@ -743,6 +761,4 @@ void pagefault_out_of_memory(void)
*/
WARN_ON(test_thread_flag(TIF_MEMDIE));
}
-
- mutex_unlock(&oom_lock);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b4e4f81..2b69467 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2331,16 +2331,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
*did_some_progress = 0;

/*
- * Acquire the oom lock. If that fails, somebody else is
- * making progress for us.
- */
- if (!mutex_trylock(&oom_lock)) {
- *did_some_progress = 1;
- schedule_timeout_uninterruptible(1);
- return NULL;
- }
-
- /*
* Go through the zonelist yet one more time, keep very high watermark
* here, this is only to catch a parallel oom killing, we must fail if
* we're still under heavy pressure.
@@ -2381,7 +2371,6 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
*did_some_progress = 1;
out:
- mutex_unlock(&oom_lock);
return page;
}

2015-04-28 13:18:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 8/9] mm: page_alloc: wait for OOM killer progress before retrying

On Mon 27-04-15 15:05:54, Johannes Weiner wrote:
> There is not much point in rushing back to the freelists and burning
> CPU cycles in direct reclaim when somebody else is in the process of
> OOM killing, or right after issuing a kill ourselves, because it could
> take some time for the OOM victim to release memory.
>
> This is a very cold error path, so there is not much hurry. Use the
> OOM victim waitqueue to wait for victims to actually exit, which is a
> solid signal that the memory pinned by those tasks has been released.
>
> Signed-off-by: Johannes Weiner <[email protected]>

Looks good to me. One minor thing/suggestion below.
Acked-by: Michal Hocko <[email protected]>

> ---
> mm/oom_kill.c | 11 +++++++----
> mm/page_alloc.c | 43 ++++++++++++++++++++++++++-----------------
> 2 files changed, 33 insertions(+), 21 deletions(-)
>
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5cfda39..823f87e 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -711,12 +711,15 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
> killed = 1;
> }
> out:
> + if (test_thread_flag(TIF_MEMDIE))
> + return true;
> /*
> - * Give the killed threads a good chance of exiting before trying to
> - * allocate memory again.
> + * Wait for any outstanding OOM victims to die. In rare cases
> + * victims can get stuck behind the allocating tasks, so the
> + * wait needs to be bounded. It's crude alright, but cheaper
> + * than keeping a global dependency tree between all tasks.
> */
> - if (killed)
> - schedule_timeout_killable(1);
> + wait_event_timeout(oom_victims_wait, !atomic_read(&oom_victims), 5*HZ);

WARN(!wait_event_timeout(...), "OOM victim has hard time to finish. OOM deadlock?")

or something along those lines? It would tell the admin that something
fishy is going here.

>
> return true;
> }
[...]
--
Michal Hocko
SUSE Labs

2015-04-28 13:30:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 9/9] mm: page_alloc: memory reserve access for OOM-killing allocations

On Mon 27-04-15 15:05:55, Johannes Weiner wrote:
> The OOM killer connects random tasks in the system with unknown
> dependencies between them, and the OOM victim might well get blocked
> behind locks held by the allocating task. That means that while
> allocations can issue OOM kills to improve the low memory situation,
> which generally frees more than they are going to take out, they can
> not rely on their *own* OOM kills to make forward progress.
>
> However, OOM-killing allocations currently retry forever. Without any
> extra measures the above situation will result in a deadlock; between
> the allocating task and the OOM victim at first, but it can spread
> once other tasks in the system start contending for the same locks.
>
> Allow OOM-killing allocations to dip into the system's memory reserves
> to avoid this deadlock scenario. Those reserves are specifically for
> operations in the memory reclaim paths which need a small amount of
> memory to release a much larger amount. Arguably, the same notion
> applies to the OOM killer.

This will not work without some throttling. You will basically give a
free ticket to all memory reserves to basically all allocating tasks
(which are allowed to trigger OOM and there might be hundreds of them)
and that itself might prevent the OOM victim from exiting.

Your previous OOM wmark was nicer because it naturally throttled
allocations and still left some room for the exiting task.

> Signed-off-by: Johannes Weiner <[email protected]>
> ---
> mm/page_alloc.c | 14 ++++++++++++++
> 1 file changed, 14 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 94530db..5f3806d 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2384,6 +2384,20 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> if (WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
> *did_some_progress = 1;
> }
> +
> + /*
> + * In the current implementation, an OOM-killing allocation
> + * loops indefinitely inside the allocator. However, it's
> + * possible for the OOM victim to get stuck behind locks held
> + * by the allocating task itself, so we can never rely on the
> + * OOM killer to free memory synchroneously without risking a
> + * deadlock. Allow these allocations to dip into the memory
> + * reserves to ensure forward progress once the OOM kill has
> + * been issued. The reserves will be replenished when the
> + * caller releases the locks and the victim exits.
> + */
> + if (*did_some_progress)
> + alloc_flags |= ALLOC_NO_WATERMARKS;
> out:
> mutex_unlock(&oom_lock);
> alloc:
> --
> 2.3.4
>

--
Michal Hocko
SUSE Labs

2015-04-28 13:55:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

On Tue 28-04-15 19:34:47, Tetsuo Handa wrote:
[...]
> [PATCH 8/9] makes the speed of allocating __GFP_FS pages extremely slow (5
> seconds / page) because out_of_memory() serialized by the oom_lock sleeps for
> 5 seconds before returning true when the OOM victim got stuck. This throttling
> also slows down !__GFP_FS allocations when there is a thread doing a __GFP_FS
> allocation, for __alloc_pages_may_oom() is serialized by the oom_lock
> regardless of gfp_mask.

This is indeed unnecessary.

> How long will the OOM victim is blocked when the
> allocating task needs to allocate e.g. 1000 !__GFP_FS pages before allowing
> the OOM victim waiting at mutex_lock(&inode->i_mutex) to continue? It will be
> a too-long-to-wait stall which is effectively a deadlock for users. I think
> we should not sleep with the oom_lock held.

I do not see why sleeping with oom_lock would be a problem. It simply
doesn't make much sense to try to trigger OOM killer when there is/are
OOM victims still exiting.

> Also, allowing any !fatal_signal_pending() threads doing __GFP_FS allocations
> (e.g. malloc() + memset()) to dip into the reserves will deplete them when the
> OOM victim is blocked for a thread doing a !__GFP_FS allocation, for
> [PATCH 9/9] does not allow !test_thread_flag(TIF_MEMDIE) threads doing
> !__GFP_FS allocations to access the reserves. Of course, updating [PATCH 9/9]
> like
>
> -+ if (*did_some_progress)
> -+ alloc_flags |= ALLOC_NO_WATERMARKS;
> out:
> ++ if (*did_some_progress)
> ++ alloc_flags |= ALLOC_NO_WATERMARKS;
> mutex_unlock(&oom_lock);
>
> (which means use of "no watermark" without invoking the OOM killer) is
> obviously wrong. I think we should not allow __GFP_FS allocations to
> access to the reserves when the OOM victim is blocked.
>
> By the way, I came up with an idea (incomplete patch on top of patches up to
> 7/9 is shown below) while trying to avoid sleeping with the oom_lock held.
> This patch is meant for
>
> (1) blocking_notifier_call_chain(&oom_notify_list) is called after
> the OOM killer is disabled in order to increase possibility of
> memory allocation to succeed.

How do you guarantee that the notifier doesn't wake up any process and
break the oom_disable guarantee?

> (2) oom_kill_process() can determine when to kill next OOM victim.
>
> (3) oom_scan_process_thread() can take TIF_MEMDIE timeout into
> account when choosing an OOM victim.

You have heard my opinions about this and I do not plan to repeat them
here again.

[...]
--
Michal Hocko
SUSE Labs

2015-04-28 14:59:14

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 9/9] mm: page_alloc: memory reserve access for OOM-killing allocations

On Tue 28-04-15 15:30:09, Michal Hocko wrote:
> On Mon 27-04-15 15:05:55, Johannes Weiner wrote:
> > The OOM killer connects random tasks in the system with unknown
> > dependencies between them, and the OOM victim might well get blocked
> > behind locks held by the allocating task. That means that while
> > allocations can issue OOM kills to improve the low memory situation,
> > which generally frees more than they are going to take out, they can
> > not rely on their *own* OOM kills to make forward progress.
> >
> > However, OOM-killing allocations currently retry forever. Without any
> > extra measures the above situation will result in a deadlock; between
> > the allocating task and the OOM victim at first, but it can spread
> > once other tasks in the system start contending for the same locks.
> >
> > Allow OOM-killing allocations to dip into the system's memory reserves
> > to avoid this deadlock scenario. Those reserves are specifically for
> > operations in the memory reclaim paths which need a small amount of
> > memory to release a much larger amount. Arguably, the same notion
> > applies to the OOM killer.
>
> This will not work without some throttling.

Hmm, thinking about it some more it seems that the throttling on
out_of_memory and its wait_event_timeout might be sufficient to not
allow too many tasks consume reserves. If this doesn't help to make any
progress then we are screwed anyway. Maybe we should simply panic if
the last get_page_from_freelist with ALLOC_NO_WATERMARKS fails...

I will think about this some more but it is certainly easier than
a new wmark and that one can be added later should there be a need.

> You will basically give a
> free ticket to all memory reserves to basically all allocating tasks
> (which are allowed to trigger OOM and there might be hundreds of them)
> and that itself might prevent the OOM victim from exiting.
>
> Your previous OOM wmark was nicer because it naturally throttled
> allocations and still left some room for the exiting task.
--
Michal Hocko
SUSE Labs

2015-04-28 15:50:47

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

Michal Hocko wrote:
> On Tue 28-04-15 19:34:47, Tetsuo Handa wrote:
> [...]
> > [PATCH 8/9] makes the speed of allocating __GFP_FS pages extremely slow (5
> > seconds / page) because out_of_memory() serialized by the oom_lock sleeps for
> > 5 seconds before returning true when the OOM victim got stuck. This throttling
> > also slows down !__GFP_FS allocations when there is a thread doing a __GFP_FS
> > allocation, for __alloc_pages_may_oom() is serialized by the oom_lock
> > regardless of gfp_mask.
>
> This is indeed unnecessary.
>
> > How long will the OOM victim is blocked when the
> > allocating task needs to allocate e.g. 1000 !__GFP_FS pages before allowing
> > the OOM victim waiting at mutex_lock(&inode->i_mutex) to continue? It will be
> > a too-long-to-wait stall which is effectively a deadlock for users. I think
> > we should not sleep with the oom_lock held.
>
> I do not see why sleeping with oom_lock would be a problem. It simply
> doesn't make much sense to try to trigger OOM killer when there is/are
> OOM victims still exiting.

Because thread A's memory allocation is deferred by threads B, C, D...'s memory
allocation which are holding (or waiting for) the oom_lock when the OOM victim
is waiting for thread A's allocation. I think that a memory allocator which
allocates at average 5 seconds is considered as unusable. If we sleep without
the oom_lock held, the memory allocator can allocate at average
(5 / number_of_allocating_threads) seconds. Sleeping with the oom_lock held
can effectively prevent thread A from making progress.

> > By the way, I came up with an idea (incomplete patch on top of patches up to
> > 7/9 is shown below) while trying to avoid sleeping with the oom_lock held.
> > This patch is meant for
> >
> > (1) blocking_notifier_call_chain(&oom_notify_list) is called after
> > the OOM killer is disabled in order to increase possibility of
> > memory allocation to succeed.
>
> How do you guarantee that the notifier doesn't wake up any process and
> break the oom_disable guarantee?

I thought oom_notify_list wakes up only kernel threads. OK, that's the reason
we don't call oom_notify_list after the OOM killer is disabled?

>
> > (2) oom_kill_process() can determine when to kill next OOM victim.
> >
> > (3) oom_scan_process_thread() can take TIF_MEMDIE timeout into
> > account when choosing an OOM victim.
>
> You have heard my opinions about this and I do not plan to repeat them
> here again.

Yes, I've heard your opinions. But neither ALLOC_NO_WATERMARKS nor WMARK_OOM
is a perfect measure for avoiding deadlock. We want to solve "Without any
extra measures the above situation will result in a deadlock" problem.
When WMARK_OOM failed to avoid the deadlock, and we don't want to go to
ALLOC_NO_WATERMARKS, I think somehow choosing and killing more victims is
the only possible measure. Maybe choosing next OOM victim upon reaching
WMARK_OOM?

2015-04-28 22:40:12

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 4/9] mm: oom_kill: generalize OOM progress waitqueue

On Mon, 27 Apr 2015, Johannes Weiner wrote:

> It turns out that the mechanism to wait for exiting OOM victims is
> less generic than it looks: it won't issue wakeups unless the OOM
> killer is disabled.
>
> The reason this check was added was the thought that, since only the
> OOM disabling code would wait on this queue, wakeup operations could
> be saved when that specific consumer is known to be absent.
>
> However, this is quite the handgrenade. Later attempts to reuse the
> waitqueue for other purposes will lead to completely unexpected bugs
> and the failure mode will appear seemingly illogical. Generally,
> providers shouldn't make unnecessary assumptions about consumers.
>
> This could have been replaced with waitqueue_active(), but it only
> saves a few instructions in one of the coldest paths in the kernel.
> Simply remove it.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Acked-by: Michal Hocko <[email protected]>

Acked-by: David Rientjes <[email protected]>

2015-04-28 22:40:30

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 5/9] mm: oom_kill: remove unnecessary locking in exit_oom_victim()

On Mon, 27 Apr 2015, Johannes Weiner wrote:

> Disabling the OOM killer needs to exclude allocators from entering,
> not existing victims from exiting.
>
> Right now the only waiter is suspend code, which achieves quiescence
> by disabling the OOM killer. But later on we want to add waits that
> hold the lock instead to stop new victims from showing up.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Acked-by: Michal Hocko <[email protected]>

Acked-by: David Rientjes <[email protected]>

2015-04-28 22:43:39

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 6/9] mm: oom_kill: simplify OOM killer locking

On Mon, 27 Apr 2015, Johannes Weiner wrote:

> The zonelist locking and the oom_sem are two overlapping locks that
> are used to serialize global OOM killing against different things.
>
> The historical zonelist locking serializes OOM kills from allocations
> with overlapping zonelists against each other to prevent killing more
> tasks than necessary in the same memory domain. Only when neither
> tasklists nor zonelists from two concurrent OOM kills overlap (tasks
> in separate memcgs bound to separate nodes) are OOM kills allowed to
> execute in parallel.
>
> The younger oom_sem is a read-write lock to serialize OOM killing
> against the PM code trying to disable the OOM killer altogether.
>
> However, the OOM killer is a fairly cold error path, there is really
> no reason to optimize for highly performant and concurrent OOM kills.
> And the oom_sem is just flat-out redundant.
>
> Replace both locking schemes with a single global mutex serializing
> OOM kills regardless of context.
>
> Signed-off-by: Johannes Weiner <[email protected]>
> Acked-by: Michal Hocko <[email protected]>

Acked-by: David Rientjes <[email protected]>

Thanks for doing this, it cleans up the code quite a bit and I think
there's the added benefit of not interleaving oom killer messages in the
kernel log, and that's important since it's the only way we can currently
discover that the kernel has killed something.

It's not vital and somewhat unrelated to your patch, but if we can't grab
the mutex with the trylock in __alloc_pages_may_oom() then I think it
would be more correct to do schedule_timeout_killable() rather than
uninterruptible. I just mention it if you happen to go through another
revision of the series and want to switch it at the same time.

2015-04-29 05:48:25

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 6/9] mm: oom_kill: simplify OOM killer locking

David Rientjes wrote:
> It's not vital and somewhat unrelated to your patch, but if we can't grab
> the mutex with the trylock in __alloc_pages_may_oom() then I think it
> would be more correct to do schedule_timeout_killable() rather than
> uninterruptible. I just mention it if you happen to go through another
> revision of the series and want to switch it at the same time.

It is a difficult choice. Killable sleep is a good thing if

(1) the OOM victim is current thread
(2) the OOM victim is waiting for current thread to release lock

but is a bad thing otherwise. And currently, (2) is not true because current
thread cannot access the memory reserves when current thread is blocking the
OOM victim. If fatal_signal_pending() threads can access portion of the memory
reserves (like I said

I don't like allowing only TIF_MEMDIE to get reserve access, for it can be
one of !TIF_MEMDIE threads which really need memory to safely terminate without
failing allocations from do_exit(). Rather, why not to discontinue TIF_MEMDIE
handling and allow getting access to private memory reserves for all
fatal_signal_pending() threads (i.e. replacing WMARK_OOM with WMARK_KILLED
in "[patch 09/12] mm: page_alloc: private memory reserves for OOM-killing
allocations") ?

at https://lkml.org/lkml/2015/3/27/378 ), (2) will become true.

Of course, the threads which the OOM victim is waiting for may not have
SIGKILL pending. WMARK_KILLED helps if the lock contention is happening
among threads sharing the same mm struct, does not help otherwise.

Well, what about introducing WMARK_OOM as a memory reserve which can be
accessed during atomic_read(&oom_victims) > 0? In this way, we can choose
next OOM victim upon reaching WMARK_OOM.

2015-04-29 12:55:22

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

On Wed, Apr 29, 2015 at 12:50:37AM +0900, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Tue 28-04-15 19:34:47, Tetsuo Handa wrote:
> > [...]
> > > [PATCH 8/9] makes the speed of allocating __GFP_FS pages extremely slow (5
> > > seconds / page) because out_of_memory() serialized by the oom_lock sleeps for
> > > 5 seconds before returning true when the OOM victim got stuck. This throttling
> > > also slows down !__GFP_FS allocations when there is a thread doing a __GFP_FS
> > > allocation, for __alloc_pages_may_oom() is serialized by the oom_lock
> > > regardless of gfp_mask.
> >
> > This is indeed unnecessary.
> >
> > > How long will the OOM victim is blocked when the
> > > allocating task needs to allocate e.g. 1000 !__GFP_FS pages before allowing
> > > the OOM victim waiting at mutex_lock(&inode->i_mutex) to continue? It will be
> > > a too-long-to-wait stall which is effectively a deadlock for users. I think
> > > we should not sleep with the oom_lock held.
> >
> > I do not see why sleeping with oom_lock would be a problem. It simply
> > doesn't make much sense to try to trigger OOM killer when there is/are
> > OOM victims still exiting.
>
> Because thread A's memory allocation is deferred by threads B, C, D...'s memory
> allocation which are holding (or waiting for) the oom_lock when the OOM victim
> is waiting for thread A's allocation. I think that a memory allocator which
> allocates at average 5 seconds is considered as unusable. If we sleep without
> the oom_lock held, the memory allocator can allocate at average
> (5 / number_of_allocating_threads) seconds. Sleeping with the oom_lock held
> can effectively prevent thread A from making progress.

I agree with that. The problem with the sleeping is that it has to be
long enough to give the OOM victim a fair chance to exit, but short
enough to not make the page allocator unusable in case there is a
genuine deadlock. And you are right, the context blocking the OOM
victim from exiting might do a whole string of allocations, not just
one, before releasing any locks.

What we can do to mitigate this is tie the timeout to the setting of
TIF_MEMDIE so that the wait is not 5s from the point of calling
out_of_memory() but from the point of where TIF_MEMDIE was set.
Subsequent allocations will then go straight to the reserves.

> > > (2) oom_kill_process() can determine when to kill next OOM victim.
> > >
> > > (3) oom_scan_process_thread() can take TIF_MEMDIE timeout into
> > > account when choosing an OOM victim.
> >
> > You have heard my opinions about this and I do not plan to repeat them
> > here again.
>
> Yes, I've heard your opinions. But neither ALLOC_NO_WATERMARKS nor WMARK_OOM
> is a perfect measure for avoiding deadlock. We want to solve "Without any
> extra measures the above situation will result in a deadlock" problem.
> When WMARK_OOM failed to avoid the deadlock, and we don't want to go to
> ALLOC_NO_WATERMARKS, I think somehow choosing and killing more victims is
> the only possible measure. Maybe choosing next OOM victim upon reaching
> WMARK_OOM?

I also think that the argument against moving on to the next victim is
completely missing the tradeoff we are making here. When the victim
is stuck and we run out of memory reserves, the machine *deadlocks*
while there are still tasks that might be able to release memory.
It's irresponsible not to go after them. Why *shouldn't* we try?

2015-04-29 14:40:38

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

On Wed 29-04-15 08:55:06, Johannes Weiner wrote:
> On Wed, Apr 29, 2015 at 12:50:37AM +0900, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > On Tue 28-04-15 19:34:47, Tetsuo Handa wrote:
> > > [...]
> > > > [PATCH 8/9] makes the speed of allocating __GFP_FS pages extremely slow (5
> > > > seconds / page) because out_of_memory() serialized by the oom_lock sleeps for
> > > > 5 seconds before returning true when the OOM victim got stuck. This throttling
> > > > also slows down !__GFP_FS allocations when there is a thread doing a __GFP_FS
> > > > allocation, for __alloc_pages_may_oom() is serialized by the oom_lock
> > > > regardless of gfp_mask.
> > >
> > > This is indeed unnecessary.
> > >
> > > > How long will the OOM victim is blocked when the
> > > > allocating task needs to allocate e.g. 1000 !__GFP_FS pages before allowing
> > > > the OOM victim waiting at mutex_lock(&inode->i_mutex) to continue? It will be
> > > > a too-long-to-wait stall which is effectively a deadlock for users. I think
> > > > we should not sleep with the oom_lock held.
> > >
> > > I do not see why sleeping with oom_lock would be a problem. It simply
> > > doesn't make much sense to try to trigger OOM killer when there is/are
> > > OOM victims still exiting.
> >
> > Because thread A's memory allocation is deferred by threads B, C, D...'s memory
> > allocation which are holding (or waiting for) the oom_lock when the OOM victim
> > is waiting for thread A's allocation. I think that a memory allocator which
> > allocates at average 5 seconds is considered as unusable. If we sleep without
> > the oom_lock held, the memory allocator can allocate at average
> > (5 / number_of_allocating_threads) seconds. Sleeping with the oom_lock held
> > can effectively prevent thread A from making progress.
>
> I agree with that. The problem with the sleeping is that it has to be
> long enough to give the OOM victim a fair chance to exit, but short
> enough to not make the page allocator unusable in case there is a
> genuine deadlock. And you are right, the context blocking the OOM
> victim from exiting might do a whole string of allocations, not just
> one, before releasing any locks.
>
> What we can do to mitigate this is tie the timeout to the setting of
> TIF_MEMDIE so that the wait is not 5s from the point of calling
> out_of_memory() but from the point of where TIF_MEMDIE was set.
> Subsequent allocations will then go straight to the reserves.

That would deplete the reserves very easily. Shouldn't we rather
go other way around? Allow OOM killer context to dive into memory
reserves some more (ALLOC_OOM on top of current ALLOC flags and
__zone_watermark_ok would allow an additional 1/4 of the reserves) and
start waiting for the victim after that reserve is depleted. We would
still have some room for TIF_MEMDIE to allocate, the reserves consumption
would be throttled somehow and the holders of resources would have some
chance to release them and allow the victim to die.

If the allocation still fails after the timeout then we should consider
failing the allocation as the next step or give NO_WATERMARK to
GFP_NOFAIL.
--
Michal Hocko
SUSE Labs

2015-04-29 17:27:52

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

Michal Hocko wrote:
> On Wed 29-04-15 08:55:06, Johannes Weiner wrote:
> > What we can do to mitigate this is tie the timeout to the setting of
> > TIF_MEMDIE so that the wait is not 5s from the point of calling
> > out_of_memory() but from the point of where TIF_MEMDIE was set.
> > Subsequent allocations will then go straight to the reserves.
>
> That would deplete the reserves very easily. Shouldn't we rather
> go other way around? Allow OOM killer context to dive into memory
> reserves some more (ALLOC_OOM on top of current ALLOC flags and
> __zone_watermark_ok would allow an additional 1/4 of the reserves) and
> start waiting for the victim after that reserve is depleted. We would
> still have some room for TIF_MEMDIE to allocate, the reserves consumption
> would be throttled somehow and the holders of resources would have some
> chance to release them and allow the victim to die.

Does OOM killer context mean memory allocations which can call out_of_memory()?
If yes, there is no guarantee that such memory reserve is used by threads which
the OOM victim is waiting for, for they might do only !__GFP_FS allocations.
Likewise, there is possibility that such memory reserve is used by threads
which the OOM victim is not waiting for, for malloc() + memset() causes
__GFP_FS allocations.

2015-04-29 18:31:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

On Thu 30-04-15 02:27:44, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Wed 29-04-15 08:55:06, Johannes Weiner wrote:
> > > What we can do to mitigate this is tie the timeout to the setting of
> > > TIF_MEMDIE so that the wait is not 5s from the point of calling
> > > out_of_memory() but from the point of where TIF_MEMDIE was set.
> > > Subsequent allocations will then go straight to the reserves.
> >
> > That would deplete the reserves very easily. Shouldn't we rather
> > go other way around? Allow OOM killer context to dive into memory
> > reserves some more (ALLOC_OOM on top of current ALLOC flags and
> > __zone_watermark_ok would allow an additional 1/4 of the reserves) and
> > start waiting for the victim after that reserve is depleted. We would
> > still have some room for TIF_MEMDIE to allocate, the reserves consumption
> > would be throttled somehow and the holders of resources would have some
> > chance to release them and allow the victim to die.
>
> Does OOM killer context mean memory allocations which can call out_of_memory()?

Yes, that was the idea, because others will not reclaim any memory. Even
all those which invoke out_of_memory will not kill a new task but one
killed task should compensate for the ALLOC_OOM part of the memory
reserves.

> If yes, there is no guarantee that such memory reserve is used by threads which
> the OOM victim is waiting for, for they might do only !__GFP_FS allocations.

OK, so we are back to GFP_NOFS. Right, those are your main pain point
because you can see i_mutex deadlocks. But really, those allocations
should simply fail because looping in the allocator and rely on others
to make a progress is simply retarded.

I thought that Dave was quite explicit that they do not strictly
need nofail behavior of GFP_NOFS but rather a GFP flag which
would allow to dive into reserves some more for specific contexts
(http://marc.info/?l=linux-mm&m=142897087230385&w=2). I also do not
remember him or anybody else saying that _every_ GFP_NOFS should get the
access to reserves automatically.

Dave, could you clarify/confirm, please?

Because we are going back and forth about GFP_NOFS without any progress
for a very long time already and it seems one class of issues could be
handled by this change already.

I mean we should eventually fail all the allocation types but GFP_NOFS
is coming from _carefully_ handled code paths which is an easier starting
point than a random code path in the kernel/drivers. So can we finally
move at least in this direction?

> Likewise, there is possibility that such memory reserve is used by threads
> which the OOM victim is not waiting for, for malloc() + memset() causes
> __GFP_FS allocations.

We cannot be certain without complete dependency tracking. This is
just a heuristic.
--
Michal Hocko
SUSE Labs

2015-04-30 09:44:30

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

Michal Hocko wrote:
> I mean we should eventually fail all the allocation types but GFP_NOFS
> is coming from _carefully_ handled code paths which is an easier starting
> point than a random code path in the kernel/drivers. So can we finally
> move at least in this direction?

I agree that all the allocation types can fail unless GFP_NOFAIL is given.
But I also expect that all the allocation types should not fail unless
order > PAGE_ALLOC_COSTLY_ORDER or GFP_NORETRY is given or chosen as an OOM
victim.

We already experienced at Linux 3.19 what happens if !__GFP_FS allocations
fails. out_of_memory() is called by pagefault_out_of_memory() when 0x2015a
(!__GFP_FS) allocation failed. This looks to me that !__GFP_FS allocations
are effectively OOM killer context. It is not fair to kill the thread which
triggered a page fault, for that thread may not be using so much memory
(unfair from memory usage point of view) or that thread may be global init
(unfair because killing the entire system than survive by killing somebody).
Also, failing the GFP_NOFS/GFP_NOIO allocations which are not triggered by
a page fault generally causes more damage (e.g. taking filesystem error
action) than survive by killing somebody. Therefore, I think we should not
hesitate invoking the OOM killer for !__GFP_FS allocation.

> > Likewise, there is possibility that such memory reserve is used by threads
> > which the OOM victim is not waiting for, for malloc() + memset() causes
> > __GFP_FS allocations.
>
> We cannot be certain without complete dependency tracking. This is
> just a heuristic.

Yes, we cannot be certain without complete dependency tracking. And doing
complete dependency tracking is too expensive to implement. Dave is
recommending that we should focus on not to trigger the OOM killer than
how to handle corner cases in OOM conditions, isn't he? I still believe that
choosing more OOM victims upon timeout (which is a heuristic after all) and
invoking the OOM killer for !__GFP_FS allocations are the cheapest and least
surprising. This is something like automatically and periodically pressing
SysRq-f on behalf of the system administrator when memory allocator cannot
recover from low memory situation.

2015-04-30 14:25:43

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

On Thu 30-04-15 18:44:25, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > I mean we should eventually fail all the allocation types but GFP_NOFS
> > is coming from _carefully_ handled code paths which is an easier starting
> > point than a random code path in the kernel/drivers. So can we finally
> > move at least in this direction?
>
> I agree that all the allocation types can fail unless GFP_NOFAIL is given.
> But I also expect that all the allocation types should not fail unless
> order > PAGE_ALLOC_COSTLY_ORDER or GFP_NORETRY is given or chosen as an OOM
> victim.

Yeah, let's keep shooting our feet and then look for workarounds to deal
with it...

> We already experienced at Linux 3.19 what happens if !__GFP_FS allocations
> fails. out_of_memory() is called by pagefault_out_of_memory() when 0x2015a
> (!__GFP_FS) allocation failed.

I have posted a patch to deal with this
(http://marc.info/?l=linux-mm&m=142770374521952&w=2). There is no real
reason to do the GFP_NOFS from the page fault context just because the
mapping _always_ insists on it. Page fault simply _has_ to be GFP_FS
safe, we are badly broken otherwise. That patch should go in hand with
GFP_NOFS might fail one. I haven't posted it yet because I was waiting
for the merge window to close.

> This looks to me that !__GFP_FS allocations
> are effectively OOM killer context. It is not fair to kill the thread which
> triggered a page fault, for that thread may not be using so much memory
> (unfair from memory usage point of view) or that thread may be global init
> (unfair because killing the entire system than survive by killing somebody).

Why would we kill the faulting process?

> Also, failing the GFP_NOFS/GFP_NOIO allocations which are not triggered by
> a page fault generally causes more damage (e.g. taking filesystem error
> action) than survive by killing somebody. Therefore, I think we should not
> hesitate invoking the OOM killer for !__GFP_FS allocation.

No, we should fix those places and use proper gfp flags rather than
pretend that the problem doesn't exist and deal with all the side
effectes.
--
Michal Hocko
SUSE Labs

2015-05-04 18:02:36

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

Hi Andrew,

since patches 8 and 9 are still controversial, would you mind picking
up just 1-7 for now? They're cleaunps nice to have on their own.

Thanks,
Johannes

On Mon, Apr 27, 2015 at 03:05:46PM -0400, Johannes Weiner wrote:
> There is a possible deadlock scenario between the page allocator and
> the OOM killer. Most allocations currently retry forever inside the
> page allocator, but when the OOM killer is invoked the chosen victim
> might try taking locks held by the allocating task. This series, on
> top of many cleanups in the allocator & OOM killer, grants such OOM-
> killing allocations access to the system's memory reserves in order
> for them to make progress without relying on their own kill to exit.
>
> Changes since v1:
> - drop GFP_NOFS deadlock fix (Dave Chinner)
> - drop low-order deadlock fix (Michal Hocko)
> - fix missing oom_lock in sysrq+f (Michal Hocko)
> - fix PAGE_ALLOC_COSTLY retry condition (Michal Hocko)
> - ALLOC_NO_WATERMARKS only for OOM victims, not all killed tasks (Tetsuo Handa)
> - bump OOM wait timeout from 1s to 5s (Vlastimil Babka & Michal Hocko)
>
> drivers/staging/android/lowmemorykiller.c | 2 +-
> drivers/tty/sysrq.c | 2 +
> include/linux/oom.h | 12 +-
> kernel/exit.c | 2 +-
> mm/memcontrol.c | 20 ++--
> mm/oom_kill.c | 167 +++++++---------------------
> mm/page_alloc.c | 161 ++++++++++++---------------
> 7 files changed, 137 insertions(+), 229 deletions(-)
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2015-05-04 19:01:29

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

On Mon 04-05-15 14:02:10, Johannes Weiner wrote:
> Hi Andrew,
>
> since patches 8 and 9 are still controversial, would you mind picking
> up just 1-7 for now? They're cleaunps nice to have on their own.

Completely agreed.
--
Michal Hocko
SUSE Labs

2015-05-23 14:42:29

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/9] mm: improve OOM mechanism v2

Michal Hocko wrote:
> On Thu 30-04-15 18:44:25, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > I mean we should eventually fail all the allocation types but GFP_NOFS
> > > is coming from _carefully_ handled code paths which is an easier starting
> > > point than a random code path in the kernel/drivers. So can we finally
> > > move at least in this direction?
> >
> > I agree that all the allocation types can fail unless GFP_NOFAIL is given.
> > But I also expect that all the allocation types should not fail unless
> > order > PAGE_ALLOC_COSTLY_ORDER or GFP_NORETRY is given or chosen as an OOM
> > victim.
>
> Yeah, let's keep shooting our feet and then look for workarounds to deal
> with it...
>
> > We already experienced at Linux 3.19 what happens if !__GFP_FS allocations
> > fails. out_of_memory() is called by pagefault_out_of_memory() when 0x2015a
> > (!__GFP_FS) allocation failed.
>
> I have posted a patch to deal with this
> (http://marc.info/?l=linux-mm&m=142770374521952&w=2). There is no real
> reason to do the GFP_NOFS from the page fault context just because the
> mapping _always_ insists on it. Page fault simply _has_ to be GFP_FS
> safe, we are badly broken otherwise. That patch should go in hand with
> GFP_NOFS might fail one. I haven't posted it yet because I was waiting
> for the merge window to close.
>
Converting page fault allocations from GFP_NOFS to GFP_KERNEL is a different
problem for me. My concern is that failing/stalling GFP_NOFS/GFP_NOIO
allocations are more dangerous than GFP_KERNEL allocations.

> > This looks to me that !__GFP_FS allocations
> > are effectively OOM killer context. It is not fair to kill the thread which
> > triggered a page fault, for that thread may not be using so much memory
> > (unfair from memory usage point of view) or that thread may be global init
> > (unfair because killing the entire system than survive by killing somebody).
>
> Why would we kill the faulting process?
>
We can see that processes are killed by SIGBUS if we allow memory allocations
by page faults to fail, can't we? I didn't catch what your question is.

> > Also, failing the GFP_NOFS/GFP_NOIO allocations which are not triggered by
> > a page fault generally causes more damage (e.g. taking filesystem error
> > action) than survive by killing somebody. Therefore, I think we should not
> > hesitate invoking the OOM killer for !__GFP_FS allocation.
>
> No, we should fix those places and use proper gfp flags rather than
> pretend that the problem doesn't exist and deal with all the side
> effectes.

Do you think we can identify and fix such places and _backport_ them before
customers bother us with unexplained hang up?

As Andrew Morton picked up from 1 to 7 of this series, I reposted timeout based
OOM killing patch at http://marc.info/?l=linux-mm&m=143239200805478&w=2 .
Please check and point out what I'm misunderstanding.

> --
> Michal Hocko
> SUSE Labs
>