2016-03-22 11:00:55

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 0/9] oom reaper v6

Hi,
I am reposting the whole patchset on top of the current Linus tree which should
already contain big pile of Andrew's mm patches. This should serve an easier
reviewability and I also hope that this core part of the work can go to 4.6.

The previous version was posted here [1] Hugh and David have suggested to
drop [2] because the munlock path currently depends on the page lock and
it is better if the initial version was conservative and prevent from
any potential lockups even though it is not clear whether they are real
- nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
will have a look and try to make the munlock path not depend on the page
lock as a follow up work.

Apart from that the feedback revealed one bug for a very unusual
configuration (sysctl_oom_kill_allocating_task) and that has been fixed
by patch 8 and one potential mis interaction with the pm freezer fixed by
patch 7.

I think the current code base is already very useful for many situations.
The rest of the feedback was mostly about potential enhancements of the
current code which I would really prefer to build on top of the current
series. I plan to finish my mmap_sem killable for write in the upcoming
release cycle and hopefully have it merged in the next merge window.
I believe more extensions will follow.

This code has been sitting in the mmotm (thus linux-next) for a while.
Are there any fundamental objections to have this part merged in this
merge window?

Thanks!

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]



2016-03-22 11:01:13

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 1/9] sched: add schedule_timeout_idle()

From: Andrew Morton <[email protected]>

This will be needed in the patch "mm, oom: introduce oom reaper".

Acked-by: Michal Hocko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/sched.h | 1 +
kernel/time/timer.c | 11 +++++++++++
2 files changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 084ed9fba620..9cf5731472fe 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -425,6 +425,7 @@ extern signed long schedule_timeout(signed long timeout);
extern signed long schedule_timeout_interruptible(signed long timeout);
extern signed long schedule_timeout_killable(signed long timeout);
extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern signed long schedule_timeout_idle(signed long timeout);
asmlinkage void schedule(void);
extern void schedule_preempt_disabled(void);

diff --git a/kernel/time/timer.c b/kernel/time/timer.c
index d1798fa0c743..73164c3aa56b 100644
--- a/kernel/time/timer.c
+++ b/kernel/time/timer.c
@@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uninterruptible(signed long timeout)
}
EXPORT_SYMBOL(schedule_timeout_uninterruptible);

+/*
+ * Like schedule_timeout_uninterruptible(), except this task will not contribute
+ * to load average.
+ */
+signed long __sched schedule_timeout_idle(signed long timeout)
+{
+ __set_current_state(TASK_IDLE);
+ return schedule_timeout(timeout);
+}
+EXPORT_SYMBOL(schedule_timeout_idle);
+
#ifdef CONFIG_HOTPLUG_CPU
static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
{
--
2.7.0

2016-03-22 11:01:21

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 2/9] mm, oom: introduce oom reaper

From: Michal Hocko <[email protected]>

This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.

The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory. Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.

It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for an event
(e.g. lock) which is blocked by another task looping in the page
allocator.

This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.

A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.

oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.

The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).

Chnages since v4
- drop MAX_RT_PRIO-1 as per David - memcg/cpuset/mempolicy OOM killing
might interfere with the rest of the system
Changes since v3
- many style/compile fixups by Andrew
- unmap_mapping_range_tree needs full initialization of zap_details
to prevent from missing unmaps and follow up BUG_ON during truncate
resp. misaccounting - Kirill/Andrew
- exclude mlocked pages because they need an explicit munlock by Kirill
- use subsys_initcall instead of module_init - Paul Gortmaker
- do not tear down mm if it is shared with the global init because this
could lead to SEGV and panic - Tetsuo
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo

Suggested-by: Oleg Nesterov <[email protected]>
Suggested-by: Mel Gorman <[email protected]>
Acked-by: Mel Gorman <[email protected]>
Acked-by: David Rientjes <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/mm.h | 2 +
mm/internal.h | 5 ++
mm/memory.c | 17 +++---
mm/oom_kill.c | 151 ++++++++++++++++++++++++++++++++++++++++++++++++++---
4 files changed, 162 insertions(+), 13 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 450fc977ed02..ed6407d1b7b5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1132,6 +1132,8 @@ struct zap_details {
struct address_space *check_mapping; /* Check page->mapping if set */
pgoff_t first_index; /* Lowest page->index to unmap */
pgoff_t last_index; /* Highest page->index to unmap */
+ bool ignore_dirty; /* Ignore dirty pages */
+ bool check_swap_entries; /* Check also swap entries */
};

struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 7449392c6faa..b79abb6721cf 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -38,6 +38,11 @@
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);

+void unmap_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details);
+
extern int __do_page_cache_readahead(struct address_space *mapping,
struct file *filp, pgoff_t offset, unsigned long nr_to_read,
unsigned long lookahead_size);
diff --git a/mm/memory.c b/mm/memory.c
index 81dca0083fcd..098f00d05461 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1102,6 +1102,12 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,

if (!PageAnon(page)) {
if (pte_dirty(ptent)) {
+ /*
+ * oom_reaper cannot tear down dirty
+ * pages
+ */
+ if (unlikely(details && details->ignore_dirty))
+ continue;
force_flush = 1;
set_page_dirty(page);
}
@@ -1120,8 +1126,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
}
continue;
}
- /* If details->check_mapping, we leave swap entries. */
- if (unlikely(details))
+ /* only check swap_entries if explicitly asked for in details */
+ if (unlikely(details && !details->check_swap_entries))
continue;

entry = pte_to_swp_entry(ptent);
@@ -1226,7 +1232,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
return addr;
}

-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
struct zap_details *details)
@@ -1234,9 +1240,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
pgd_t *pgd;
unsigned long next;

- if (details && !details->check_mapping)
- details = NULL;
-
BUG_ON(addr >= end);
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
@@ -2432,7 +2435,7 @@ static inline void unmap_mapping_range_tree(struct rb_root *root,
void unmap_mapping_range(struct address_space *mapping,
loff_t const holebegin, loff_t const holelen, int even_cows)
{
- struct zap_details details;
+ struct zap_details details = { };
pgoff_t hba = holebegin >> PAGE_SHIFT;
pgoff_t hlen = (holelen + PAGE_SIZE - 1) >> PAGE_SHIFT;

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 06f7e1707847..f7ed6ece0719 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
#include <linux/freezer.h>
#include <linux/ftrace.h>
#include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/init.h>
+
+#include <asm/tlb.h>
+#include "internal.h"

#define CREATE_TRACE_POINTS
#include <trace/events/oom.h>
@@ -405,6 +410,133 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);

bool oom_killer_disabled __read_mostly;

+#ifdef CONFIG_MMU
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+ struct mmu_gather tlb;
+ struct vm_area_struct *vma;
+ struct zap_details details = {.check_swap_entries = true,
+ .ignore_dirty = true};
+ bool ret = true;
+
+ /* We might have raced with exit path */
+ if (!atomic_inc_not_zero(&mm->mm_users))
+ return true;
+
+ if (!down_read_trylock(&mm->mmap_sem)) {
+ ret = false;
+ goto out;
+ }
+
+ tlb_gather_mmu(&tlb, mm, 0, -1);
+ for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+ if (is_vm_hugetlb_page(vma))
+ continue;
+
+ /*
+ * mlocked VMAs require explicit munlocking before unmap.
+ * Let's keep it simple here and skip such VMAs.
+ */
+ if (vma->vm_flags & VM_LOCKED)
+ continue;
+
+ /*
+ * Only anonymous pages have a good chance to be dropped
+ * without additional steps which we cannot afford as we
+ * are OOM already.
+ *
+ * We do not even care about fs backed pages because all
+ * which are reclaimable have already been reclaimed and
+ * we do not want to block exit_mmap by keeping mm ref
+ * count elevated without a good reason.
+ */
+ if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+ unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+ &details);
+ }
+ tlb_finish_mmu(&tlb, 0, -1);
+ up_read(&mm->mmap_sem);
+out:
+ mmput(mm);
+ return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+ int attempts = 0;
+
+ /* Retry the down_read_trylock(mmap_sem) a few times */
+ while (attempts++ < 10 && !__oom_reap_vmas(mm))
+ schedule_timeout_idle(HZ/10);
+
+ /* Drop a reference taken by wake_oom_reaper */
+ mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+ while (true) {
+ struct mm_struct *mm;
+
+ wait_event_freezable(oom_reaper_wait,
+ (mm = READ_ONCE(mm_to_reap)));
+ oom_reap_vmas(mm);
+ WRITE_ONCE(mm_to_reap, NULL);
+ }
+
+ return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+ struct mm_struct *old_mm;
+
+ if (!oom_reaper_th)
+ return;
+
+ /*
+ * Pin the given mm. Use mm_count instead of mm_users because
+ * we do not want to delay the address space tear down.
+ */
+ atomic_inc(&mm->mm_count);
+
+ /*
+ * Make sure that only a single mm is ever queued for the reaper
+ * because multiple are not necessary and the operation might be
+ * disruptive so better reduce it to the bare minimum.
+ */
+ old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+ if (!old_mm)
+ wake_up(&oom_reaper_wait);
+ else
+ mmdrop(mm);
+}
+
+static int __init oom_init(void)
+{
+ oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+ if (IS_ERR(oom_reaper_th)) {
+ pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+ PTR_ERR(oom_reaper_th));
+ oom_reaper_th = NULL;
+ }
+ return 0;
+}
+subsys_initcall(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
+#endif
+
/**
* mark_oom_victim - mark the given task as OOM victim
* @tsk: task to mark
@@ -510,6 +642,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
+ bool can_oom_reap = true;

/*
* If the task is already exiting, don't alarm the sysadmin or kill
@@ -600,17 +733,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
continue;
if (same_thread_group(p, victim))
continue;
- if (unlikely(p->flags & PF_KTHREAD))
- continue;
- if (is_global_init(p))
- continue;
- if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+ if (unlikely(p->flags & PF_KTHREAD) || is_global_init(p) ||
+ p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+ /*
+ * We cannot use oom_reaper for the mm shared by this
+ * process because it wouldn't get killed and so the
+ * memory might be still used.
+ */
+ can_oom_reap = false;
continue;
-
+ }
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
}
rcu_read_unlock();

+ if (can_oom_reap)
+ wake_oom_reaper(mm);
+
mmdrop(mm);
put_task_struct(victim);
}
--
2.7.0

2016-03-22 11:01:31

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 8/9] oom: make oom_reaper freezable

From: Michal Hocko <[email protected]>

After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the
address space" oom_reaper will call exit_oom_victim on the target
task after it is done. This might however race with the PM freezer:

CPU0 CPU1 CPU2
freeze_processes
try_to_freeze_tasks
# Allocation request
out_of_memory
oom_killer_disable
wake_oom_reaper(P1)
__oom_reap_task
exit_oom_victim(P1)
wait_event(oom_victims==0)
[...]
do_exit(P1)
perform IO/interfere with the freezer

which breaks the oom_killer_disable semantic. We no longer have a
guarantee that the oom victim won't interfere with the freezer because
it might be anywhere on the way to do_exit while the freezer thinks the
task has already terminated. It might trigger IO or touch devices which
are frozen already.

In order to close this race, make the oom_reaper thread freezable. This
will work because
a) already running oom_reaper will block freezer to enter the
quiescent state
b) wake_oom_reaper will not wake up the reaper after it has been
frozen
c) the only way to call exit_oom_victim after try_to_freeze_tasks
is from the oom victim's context when we know the further
interference shouldn't be possible

Signed-off-by: Michal Hocko <[email protected]>
---
mm/oom_kill.c | 2 ++
1 file changed, 2 insertions(+)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index af75260f32c3..bed2885d10b0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -524,6 +524,8 @@ static void oom_reap_task(struct task_struct *tsk)

static int oom_reaper(void *unused)
{
+ set_freezable();
+
while (true) {
struct task_struct *tsk = NULL;

--
2.7.0

2016-03-22 11:01:47

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 7/9] oom: make oom_reaper_list single linked

From: Vladimir Davydov <[email protected]>

Entries are only added/removed from oom_reaper_list at head so we can use
a single linked list and hence save a word in task_struct.

Signed-off-by: Vladimir Davydov <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
include/linux/sched.h | 2 +-
mm/oom_kill.c | 15 +++++++--------
2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index acb480b581e3..d118445a332e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1841,7 +1841,7 @@ struct task_struct {
#endif
int pagefault_disabled;
#ifdef CONFIG_MMU
- struct list_head oom_reaper_list;
+ struct task_struct *oom_reaper_list;
#endif
/* CPU-specific state of this task */
struct thread_struct thread;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index b38a648558f9..af75260f32c3 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -417,7 +417,7 @@ bool oom_killer_disabled __read_mostly;
*/
static struct task_struct *oom_reaper_th;
static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static LIST_HEAD(oom_reaper_list);
+static struct task_struct *oom_reaper_list;
static DEFINE_SPINLOCK(oom_reaper_lock);


@@ -527,13 +527,11 @@ static int oom_reaper(void *unused)
while (true) {
struct task_struct *tsk = NULL;

- wait_event_freezable(oom_reaper_wait,
- (!list_empty(&oom_reaper_list)));
+ wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
spin_lock(&oom_reaper_lock);
- if (!list_empty(&oom_reaper_list)) {
- tsk = list_first_entry(&oom_reaper_list,
- struct task_struct, oom_reaper_list);
- list_del(&tsk->oom_reaper_list);
+ if (oom_reaper_list != NULL) {
+ tsk = oom_reaper_list;
+ oom_reaper_list = tsk->oom_reaper_list;
}
spin_unlock(&oom_reaper_lock);

@@ -552,7 +550,8 @@ static void wake_oom_reaper(struct task_struct *tsk)
get_task_struct(tsk);

spin_lock(&oom_reaper_lock);
- list_add(&tsk->oom_reaper_list, &oom_reaper_list);
+ tsk->oom_reaper_list = oom_reaper_list;
+ oom_reaper_list = tsk;
spin_unlock(&oom_reaper_lock);
wake_up(&oom_reaper_wait);
}
--
2.7.0

2016-03-22 11:02:41

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 9/9] oom, oom_reaper: protect oom_reaper_list using simpler way

From: Tetsuo Handa <[email protected]>

"oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried
to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it
by simply checking tsk->oom_reaper_list != NULL.

Signed-off-by: Tetsuo Handa <[email protected]>
Acked-by: Michal Hocko <[email protected]>
---
include/linux/sched.h | 2 --
mm/oom_kill.c | 8 ++------
2 files changed, 2 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d118445a332e..78434d4f85f2 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -511,8 +511,6 @@ static inline int get_dumpable(struct mm_struct *mm)
#define MMF_HAS_UPROBES 19 /* has uprobes */
#define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */

-#define MMF_OOM_KILLED 21 /* OOM killer has chosen this mm */
-
#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)

struct sighand_struct {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index bed2885d10b0..cfafb91ebcd9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -546,7 +546,7 @@ static int oom_reaper(void *unused)

static void wake_oom_reaper(struct task_struct *tsk)
{
- if (!oom_reaper_th)
+ if (!oom_reaper_th || tsk->oom_reaper_list)
return;

get_task_struct(tsk);
@@ -680,7 +680,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
- bool can_oom_reap;
+ bool can_oom_reap = true;

/*
* If the task is already exiting, don't alarm the sysadmin or kill
@@ -742,10 +742,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
/* Get a reference to safely compare mm after task_unlock(victim) */
mm = victim->mm;
atomic_inc(&mm->mm_count);
-
- /* Make sure we do not try to oom reap the mm multiple times */
- can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
-
/*
* We should send SIGKILL before setting TIF_MEMDIE in order to prevent
* the OOM victim from depleting the memory reserves from the user
--
2.7.0

2016-03-22 11:02:52

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 6/9] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task

From: Michal Hocko <[email protected]>

Tetsuo has reported that oom_kill_allocating_task=1 will cause
oom_reaper_list corruption because oom_kill_process doesn't follow
standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
the same task multiple times - e.g. by sacrificing the same child
multiple times.

This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
which is set in oom_kill_process atomically and oom reaper is disabled
if the flag was already set.

Reported-by: Tetsuo Handa <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/sched.h | 2 ++
mm/oom_kill.c | 6 +++++-
2 files changed, 7 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bc5867296f7b..acb480b581e3 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -511,6 +511,8 @@ static inline int get_dumpable(struct mm_struct *mm)
#define MMF_HAS_UPROBES 19 /* has uprobes */
#define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */

+#define MMF_OOM_KILLED 21 /* OOM killer has chosen this mm */
+
#define MMF_INIT_MASK (MMF_DUMPABLE_MASK | MMF_DUMP_FILTER_MASK)

struct sighand_struct {
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8e0bd279135f..b38a648558f9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -679,7 +679,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
- bool can_oom_reap = true;
+ bool can_oom_reap;

/*
* If the task is already exiting, don't alarm the sysadmin or kill
@@ -741,6 +741,10 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
/* Get a reference to safely compare mm after task_unlock(victim) */
mm = victim->mm;
atomic_inc(&mm->mm_count);
+
+ /* Make sure we do not try to oom reap the mm multiple times */
+ can_oom_reap = !test_and_set_bit(MMF_OOM_KILLED, &mm->flags);
+
/*
* We should send SIGKILL before setting TIF_MEMDIE in order to prevent
* the OOM victim from depleting the memory reserves from the user
--
2.7.0

2016-03-22 11:03:01

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 5/9] mm, oom_reaper: implement OOM victims queuing

From: Michal Hocko <[email protected]>

wake_oom_reaper has allowed only 1 oom victim to be queued. The main
reason for that was the simplicity as other solutions would require
some way of queuing. The current approach is racy and that was deemed
sufficient as the oom_reaper is considered a best effort approach
to help with oom handling when the OOM victim cannot terminate in a
reasonable time. The race could lead to missing an oom victim which can
get stuck

out_of_memory
wake_oom_reaper
cmpxchg // OK
oom_reaper
oom_reap_task
__oom_reap_task
oom_victim terminates
atomic_inc_not_zero // fail
out_of_memory
wake_oom_reaper
cmpxchg // fails
task_to_reap = NULL

This race requires 2 OOM invocations in a short time period which is not
very likely but certainly not impossible. E.g. the original victim might
have not released a lot of memory for some reason.

The situation would improve considerably if wake_oom_reaper used a more
robust queuing. This is what this patch implements. This means adding
oom_reaper_list list_head into task_struct (eat a hole before embeded
thread_struct for that purpose) and a oom_reaper_lock spinlock for
queuing synchronization. wake_oom_reaper will then add the task on the
queue and oom_reaper will dequeue it.

Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/sched.h | 3 +++
mm/oom_kill.c | 36 +++++++++++++++++++-----------------
2 files changed, 22 insertions(+), 17 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 9cf5731472fe..bc5867296f7b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1838,6 +1838,9 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+#ifdef CONFIG_MMU
+ struct list_head oom_reaper_list;
+#endif
/* CPU-specific state of this task */
struct thread_struct thread;
/*
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index e627ce235e38..8e0bd279135f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,8 +416,10 @@ bool oom_killer_disabled __read_mostly;
* victim (if that is possible) to help the OOM killer to move on.
*/
static struct task_struct *oom_reaper_th;
-static struct task_struct *task_to_reap;
static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+static LIST_HEAD(oom_reaper_list);
+static DEFINE_SPINLOCK(oom_reaper_lock);
+

#define K(x) ((x) << (PAGE_SHIFT-10))
static bool __oom_reap_task(struct task_struct *tsk)
@@ -523,12 +525,20 @@ static void oom_reap_task(struct task_struct *tsk)
static int oom_reaper(void *unused)
{
while (true) {
- struct task_struct *tsk;
+ struct task_struct *tsk = NULL;

wait_event_freezable(oom_reaper_wait,
- (tsk = READ_ONCE(task_to_reap)));
- oom_reap_task(tsk);
- WRITE_ONCE(task_to_reap, NULL);
+ (!list_empty(&oom_reaper_list)));
+ spin_lock(&oom_reaper_lock);
+ if (!list_empty(&oom_reaper_list)) {
+ tsk = list_first_entry(&oom_reaper_list,
+ struct task_struct, oom_reaper_list);
+ list_del(&tsk->oom_reaper_list);
+ }
+ spin_unlock(&oom_reaper_lock);
+
+ if (tsk)
+ oom_reap_task(tsk);
}

return 0;
@@ -536,23 +546,15 @@ static int oom_reaper(void *unused)

static void wake_oom_reaper(struct task_struct *tsk)
{
- struct task_struct *old_tsk;
-
if (!oom_reaper_th)
return;

get_task_struct(tsk);

- /*
- * Make sure that only a single mm is ever queued for the reaper
- * because multiple are not necessary and the operation might be
- * disruptive so better reduce it to the bare minimum.
- */
- old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
- if (!old_tsk)
- wake_up(&oom_reaper_wait);
- else
- put_task_struct(tsk);
+ spin_lock(&oom_reaper_lock);
+ list_add(&tsk->oom_reaper_list, &oom_reaper_list);
+ spin_unlock(&oom_reaper_lock);
+ wake_up(&oom_reaper_wait);
}

static int __init oom_init(void)
--
2.7.0

2016-03-22 11:03:10

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 4/9] mm, oom_reaper: report success/failure

From: Michal Hocko <[email protected]>

Inform about the successful/failed oom_reaper attempts and dump all the
held locks to tell us more who is blocking the progress.

Signed-off-by: Michal Hocko <[email protected]>
---
mm/oom_kill.c | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 2830b1c6483e..e627ce235e38 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -419,6 +419,7 @@ static struct task_struct *oom_reaper_th;
static struct task_struct *task_to_reap;
static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);

+#define K(x) ((x) << (PAGE_SHIFT-10))
static bool __oom_reap_task(struct task_struct *tsk)
{
struct mmu_gather tlb;
@@ -479,6 +480,11 @@ static bool __oom_reap_task(struct task_struct *tsk)
&details);
}
tlb_finish_mmu(&tlb, 0, -1);
+ pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+ task_pid_nr(tsk), tsk->comm,
+ K(get_mm_counter(mm, MM_ANONPAGES)),
+ K(get_mm_counter(mm, MM_FILEPAGES)),
+ K(get_mm_counter(mm, MM_SHMEMPAGES)));
up_read(&mm->mmap_sem);

/*
@@ -495,14 +501,21 @@ static bool __oom_reap_task(struct task_struct *tsk)
return ret;
}

+#define MAX_OOM_REAP_RETRIES 10
static void oom_reap_task(struct task_struct *tsk)
{
int attempts = 0;

/* Retry the down_read_trylock(mmap_sem) a few times */
- while (attempts++ < 10 && !__oom_reap_task(tsk))
+ while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task(tsk))
schedule_timeout_idle(HZ/10);

+ if (attempts > MAX_OOM_REAP_RETRIES) {
+ pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+ task_pid_nr(tsk), tsk->comm);
+ debug_show_all_locks();
+ }
+
/* Drop a reference taken by wake_oom_reaper */
put_task_struct(tsk);
}
@@ -649,7 +662,6 @@ static bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
return false;
}

-#define K(x) ((x) << (PAGE_SHIFT-10))
/*
* Must be called while holding a reference to p, which will be released upon
* returning.
--
2.7.0

2016-03-22 11:03:24

by Michal Hocko

[permalink] [raw]
Subject: [PATCH 3/9] oom: clear TIF_MEMDIE after oom_reaper managed to unmap the address space

From: Michal Hocko <[email protected]>

When oom_reaper manages to unmap all the eligible vmas there shouldn't
be much of the freable memory held by the oom victim left anymore so it
makes sense to clear the TIF_MEMDIE flag for the victim and allow the
OOM killer to select another task.

The lack of TIF_MEMDIE also means that the victim cannot access memory
reserves anymore but that shouldn't be a problem because it would get
the access again if it needs to allocate and hits the OOM killer again
due to the fatal_signal_pending resp. PF_EXITING check. We can safely
hide the task from the OOM killer because it is clearly not a good
candidate anymore as everyhing reclaimable has been torn down already.

This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
and thus hold off further global OOM killer actions granted the oom
reaper is able to take mmap_sem for the associated mm struct. This is
not guaranteed now but further steps should make sure that mmap_sem
for write should be blocked killable which will help to reduce such a
lock contention. This is not done by this patch.

Note that exit_oom_victim might be called on a remote task from
__oom_reap_task now so we have to check and clear the flag atomically
otherwise we might race and underflow oom_victims or wake up
waiters too early.

Suggested-by: Johannes Weiner <[email protected]>
Suggested-by: Tetsuo Handa <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/oom.h | 2 +-
kernel/exit.c | 2 +-
mm/oom_kill.c | 73 +++++++++++++++++++++++++++++++++++------------------
3 files changed, 50 insertions(+), 27 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 03e6257321f0..45993b840ed6 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -91,7 +91,7 @@ extern enum oom_scan_t oom_scan_process_thread(struct oom_control *oc,

extern bool out_of_memory(struct oom_control *oc);

-extern void exit_oom_victim(void);
+extern void exit_oom_victim(struct task_struct *tsk);

extern int register_oom_notifier(struct notifier_block *nb);
extern int unregister_oom_notifier(struct notifier_block *nb);
diff --git a/kernel/exit.c b/kernel/exit.c
index 10e088237fed..ba3bd29d7e1d 100644
--- a/kernel/exit.c
+++ b/kernel/exit.c
@@ -434,7 +434,7 @@ static void exit_mm(struct task_struct *tsk)
mm_update_next_owner(mm);
mmput(mm);
if (test_thread_flag(TIF_MEMDIE))
- exit_oom_victim();
+ exit_oom_victim(tsk);
}

static struct task_struct *find_alive_thread(struct task_struct *p)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index f7ed6ece0719..2830b1c6483e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -416,20 +416,36 @@ bool oom_killer_disabled __read_mostly;
* victim (if that is possible) to help the OOM killer to move on.
*/
static struct task_struct *oom_reaper_th;
-static struct mm_struct *mm_to_reap;
+static struct task_struct *task_to_reap;
static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);

-static bool __oom_reap_vmas(struct mm_struct *mm)
+static bool __oom_reap_task(struct task_struct *tsk)
{
struct mmu_gather tlb;
struct vm_area_struct *vma;
+ struct mm_struct *mm;
+ struct task_struct *p;
struct zap_details details = {.check_swap_entries = true,
.ignore_dirty = true};
bool ret = true;

- /* We might have raced with exit path */
- if (!atomic_inc_not_zero(&mm->mm_users))
+ /*
+ * Make sure we find the associated mm_struct even when the particular
+ * thread has already terminated and cleared its mm.
+ * We might have race with exit path so consider our work done if there
+ * is no mm.
+ */
+ p = find_lock_task_mm(tsk);
+ if (!p)
+ return true;
+
+ mm = p->mm;
+ if (!atomic_inc_not_zero(&mm->mm_users)) {
+ task_unlock(p);
return true;
+ }
+
+ task_unlock(p);

if (!down_read_trylock(&mm->mmap_sem)) {
ret = false;
@@ -464,60 +480,66 @@ static bool __oom_reap_vmas(struct mm_struct *mm)
}
tlb_finish_mmu(&tlb, 0, -1);
up_read(&mm->mmap_sem);
+
+ /*
+ * Clear TIF_MEMDIE because the task shouldn't be sitting on a
+ * reasonably reclaimable memory anymore. OOM killer can continue
+ * by selecting other victim if unmapping hasn't led to any
+ * improvements. This also means that selecting this task doesn't
+ * make any sense.
+ */
+ tsk->signal->oom_score_adj = OOM_SCORE_ADJ_MIN;
+ exit_oom_victim(tsk);
out:
mmput(mm);
return ret;
}

-static void oom_reap_vmas(struct mm_struct *mm)
+static void oom_reap_task(struct task_struct *tsk)
{
int attempts = 0;

/* Retry the down_read_trylock(mmap_sem) a few times */
- while (attempts++ < 10 && !__oom_reap_vmas(mm))
+ while (attempts++ < 10 && !__oom_reap_task(tsk))
schedule_timeout_idle(HZ/10);

/* Drop a reference taken by wake_oom_reaper */
- mmdrop(mm);
+ put_task_struct(tsk);
}

static int oom_reaper(void *unused)
{
while (true) {
- struct mm_struct *mm;
+ struct task_struct *tsk;

wait_event_freezable(oom_reaper_wait,
- (mm = READ_ONCE(mm_to_reap)));
- oom_reap_vmas(mm);
- WRITE_ONCE(mm_to_reap, NULL);
+ (tsk = READ_ONCE(task_to_reap)));
+ oom_reap_task(tsk);
+ WRITE_ONCE(task_to_reap, NULL);
}

return 0;
}

-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *tsk)
{
- struct mm_struct *old_mm;
+ struct task_struct *old_tsk;

if (!oom_reaper_th)
return;

- /*
- * Pin the given mm. Use mm_count instead of mm_users because
- * we do not want to delay the address space tear down.
- */
- atomic_inc(&mm->mm_count);
+ get_task_struct(tsk);

/*
* Make sure that only a single mm is ever queued for the reaper
* because multiple are not necessary and the operation might be
* disruptive so better reduce it to the bare minimum.
*/
- old_mm = cmpxchg(&mm_to_reap, NULL, mm);
- if (!old_mm)
+ old_tsk = cmpxchg(&task_to_reap, NULL, tsk);
+ if (!old_tsk)
wake_up(&oom_reaper_wait);
else
- mmdrop(mm);
+ put_task_struct(tsk);
}

static int __init oom_init(void)
@@ -532,7 +554,7 @@ static int __init oom_init(void)
}
subsys_initcall(oom_init)
#else
-static void wake_oom_reaper(struct mm_struct *mm)
+static void wake_oom_reaper(struct task_struct *tsk)
{
}
#endif
@@ -563,9 +585,10 @@ void mark_oom_victim(struct task_struct *tsk)
/**
* exit_oom_victim - note the exit of an OOM victim
*/
-void exit_oom_victim(void)
+void exit_oom_victim(struct task_struct *tsk)
{
- clear_thread_flag(TIF_MEMDIE);
+ if (!test_and_clear_tsk_thread_flag(tsk, TIF_MEMDIE))
+ return;

if (!atomic_dec_return(&oom_victims))
wake_up_all(&oom_victims_wait);
@@ -748,7 +771,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
rcu_read_unlock();

if (can_oom_reap)
- wake_oom_reaper(mm);
+ wake_oom_reaper(victim);

mmdrop(mm);
put_task_struct(victim);
--
2.7.0

2016-03-22 12:23:57

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched: add schedule_timeout_idle()

On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:

> extern signed long schedule_timeout_interruptible(signed long timeout);
> extern signed long schedule_timeout_killable(signed long timeout);
> extern signed long schedule_timeout_uninterruptible(signed long timeout);
> +extern signed long schedule_timeout_idle(signed long timeout);

> +/*
> + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> + * to load average.
> + */
> +signed long __sched schedule_timeout_idle(signed long timeout)
> +{
> + __set_current_state(TASK_IDLE);
> + return schedule_timeout(timeout);
> +}
> +EXPORT_SYMBOL(schedule_timeout_idle);

Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
pretty pointless.

Why not kill the lot?

2016-03-22 12:33:24

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched: add schedule_timeout_idle()

On Tue 22-03-16 13:23:45, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
>
> > extern signed long schedule_timeout_interruptible(signed long timeout);
> > extern signed long schedule_timeout_killable(signed long timeout);
> > extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > +extern signed long schedule_timeout_idle(signed long timeout);
>
> > +/*
> > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > + * to load average.
> > + */
> > +signed long __sched schedule_timeout_idle(signed long timeout)
> > +{
> > + __set_current_state(TASK_IDLE);
> > + return schedule_timeout(timeout);
> > +}
> > +EXPORT_SYMBOL(schedule_timeout_idle);
>
> Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> pretty pointless.

It seems it is just too easy to miss the __set_current_state (I am
talking from my own experience). This also seems to be a pretty common
pattern so why not wrap it under a common call.

> Why not kill the lot?

We have over 400 users, would it be much better if we open code all of
them? It doesn't sound like a huge win to me.

--
Michal Hocko
SUSE Labs

2016-03-22 12:51:24

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched: add schedule_timeout_idle()

On Tue, Mar 22, 2016 at 01:33:14PM +0100, Michal Hocko wrote:
> On Tue 22-03-16 13:23:45, Peter Zijlstra wrote:
> > On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
> >
> > > extern signed long schedule_timeout_interruptible(signed long timeout);
> > > extern signed long schedule_timeout_killable(signed long timeout);
> > > extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > > +extern signed long schedule_timeout_idle(signed long timeout);
> >
> > > +/*
> > > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > > + * to load average.
> > > + */
> > > +signed long __sched schedule_timeout_idle(signed long timeout)
> > > +{
> > > + __set_current_state(TASK_IDLE);
> > > + return schedule_timeout(timeout);
> > > +}
> > > +EXPORT_SYMBOL(schedule_timeout_idle);
> >
> > Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> > pretty pointless.
>
> It seems it is just too easy to miss the __set_current_state (I am
> talking from my own experience).

Well, that's what you get; if you call schedule() and forget to set a
blocking state you also don't block, where the problem?

> This also seems to be a pretty common
> pattern so why not wrap it under a common call.

It just seems extremely silly to create a (out-of-line even) function
for a store and a call.

> > Why not kill the lot?
>
> We have over 400 users, would it be much better if we open code all of
> them? It doesn't sound like a huge win to me.

Dunno, changing them around isn't much work, we've got coccinelle for
that.

2016-03-22 13:08:33

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched: add schedule_timeout_idle()

On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 01:33:14PM +0100, Michal Hocko wrote:
> > On Tue 22-03-16 13:23:45, Peter Zijlstra wrote:
> > > On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
> > >
> > > > extern signed long schedule_timeout_interruptible(signed long timeout);
> > > > extern signed long schedule_timeout_killable(signed long timeout);
> > > > extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > > > +extern signed long schedule_timeout_idle(signed long timeout);
> > >
> > > > +/*
> > > > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > > > + * to load average.
> > > > + */
> > > > +signed long __sched schedule_timeout_idle(signed long timeout)
> > > > +{
> > > > + __set_current_state(TASK_IDLE);
> > > > + return schedule_timeout(timeout);
> > > > +}
> > > > +EXPORT_SYMBOL(schedule_timeout_idle);
> > >
> > > Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> > > pretty pointless.
> >
> > It seems it is just too easy to miss the __set_current_state (I am
> > talking from my own experience).
>
> Well, that's what you get; if you call schedule() and forget to set a
> blocking state you also don't block, where the problem?

The error prone nature of schedule_timeout usage was the reason to
introduce them in the first place IIRC which makes me think this is
something that is not so uncommon.

[...]

> > > Why not kill the lot?
> >
> > We have over 400 users, would it be much better if we open code all of
> > them? It doesn't sound like a huge win to me.
>
> Dunno, changing them around isn't much work, we've got coccinelle for
> that.

If that sounds like a more appropriate plan I won't object. I can simply
change my patch to do __set_current_state and schedule_timeout.

--
Michal Hocko
SUSE Labs

2016-03-22 13:23:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched: add schedule_timeout_idle()

On Tue, Mar 22, 2016 at 02:08:23PM +0100, Michal Hocko wrote:
> On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> If that sounds like a more appropriate plan I won't object. I can simply
> change my patch to do __set_current_state and schedule_timeout.

I dunno, I just think these wrappers are silly.

2016-03-22 17:57:59

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched: add schedule_timeout_idle()

On Tue, Mar 22, 2016 at 02:22:49PM +0100, Peter Zijlstra wrote:
> On Tue, Mar 22, 2016 at 02:08:23PM +0100, Michal Hocko wrote:
> > On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> > If that sounds like a more appropriate plan I won't object. I can simply
> > change my patch to do __set_current_state and schedule_timeout.
>
> I dunno, I just think these wrappers are silly.

Adding out-of-line, exported wrappers for every single task state is
kind of silly. But it's still a common operation to wait in a certain
state, so having a single function for that makes sense. Kind of like
spin_lock_irqsave and friends.

Maybe this would be better?:

static inline long schedule_timeout_state(long timeout, long state)
{
__set_current_state(state);
return schedule_timeout(timeout);
}

2016-03-22 21:24:07

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched: add schedule_timeout_idle()

On Tue, Mar 22, 2016 at 01:56:26PM -0400, Johannes Weiner wrote:
> On Tue, Mar 22, 2016 at 02:22:49PM +0100, Peter Zijlstra wrote:
> > On Tue, Mar 22, 2016 at 02:08:23PM +0100, Michal Hocko wrote:
> > > On Tue 22-03-16 13:51:13, Peter Zijlstra wrote:
> > > If that sounds like a more appropriate plan I won't object. I can simply
> > > change my patch to do __set_current_state and schedule_timeout.
> >
> > I dunno, I just think these wrappers are silly.
>
> Adding out-of-line, exported wrappers for every single task state is
> kind of silly. But it's still a common operation to wait in a certain
> state, so having a single function for that makes sense. Kind of like
> spin_lock_irqsave and friends.
>
> Maybe this would be better?:
>
> static inline long schedule_timeout_state(long timeout, long state)
> {
> __set_current_state(state);
> return schedule_timeout(timeout);
> }

Probably. However, with such semantics the schedule*() name is wrong
too, you cannot use these functions to build actual wait loops etc.

So maybe:

static inline long sleep_in_state(long timeout, long state)
{
__set_current_state(state);
return schedule_timeout(timeout);
}

might be an even better name; but at that point we look very like the
msleep*() class of function, so maybe we should do:

long sleep_in_state(long state, long timeout)
{
while (timeout && !signal_pending_state(state, current)) {
__set_current_state(state);
timeout = schedule_timeout(timeout);
}
return timeout;
}

Hmm ?

2016-03-22 22:08:26

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH 0/9] oom reaper v6

On Tue, 22 Mar 2016, Michal Hocko wrote:

> Hi,
> I am reposting the whole patchset on top of the current Linus tree which should
> already contain big pile of Andrew's mm patches. This should serve an easier
> reviewability and I also hope that this core part of the work can go to 4.6.
>
> The previous version was posted here [1] Hugh and David have suggested to
> drop [2] because the munlock path currently depends on the page lock and
> it is better if the initial version was conservative and prevent from
> any potential lockups even though it is not clear whether they are real
> - nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
> will have a look and try to make the munlock path not depend on the page
> lock as a follow up work.
>
> Apart from that the feedback revealed one bug for a very unusual
> configuration (sysctl_oom_kill_allocating_task) and that has been fixed
> by patch 8 and one potential mis interaction with the pm freezer fixed by
> patch 7.
>
> I think the current code base is already very useful for many situations.
> The rest of the feedback was mostly about potential enhancements of the
> current code which I would really prefer to build on top of the current
> series. I plan to finish my mmap_sem killable for write in the upcoming
> release cycle and hopefully have it merged in the next merge window.
> I believe more extensions will follow.
>
> This code has been sitting in the mmotm (thus linux-next) for a while.
> Are there any fundamental objections to have this part merged in this
> merge window?
>

Tetsuo, have you been able to run your previous test cases on top of this
version and do you have any concerns about it or possible extensions that
could be made?

2016-03-22 22:45:40

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 2/9] mm, oom: introduce oom reaper

On Tue, 22 Mar 2016 12:00:19 +0100 Michal Hocko <[email protected]> wrote:

> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.

What happened to oom-reaper-handle-mlocked-pages.patch? I have it in
-mm but I don't see it in this v6.


From: Michal Hocko <[email protected]>
Subject: oom reaper: handle mlocked pages

__oom_reap_vmas current skips over all mlocked vmas because they need a
special treatment before they are unmapped. This is primarily done for
simplicity. There is no reason to skip over them and reduce the amount of
reclaimed memory. This is safe from the semantic point of view because
try_to_unmap_one during rmap walk would keep tell the reclaim to cull the
page back and mlock it again.

munlock_vma_pages_all is also safe to be called from the oom reaper
context because it doesn't sit on any locks but mmap_sem (for read).

Signed-off-by: Michal Hocko <[email protected]>
Cc: Andrea Argangeli <[email protected]>
Acked-by: David Rientjes <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Oleg Nesterov <[email protected]>
Cc: Rik van Riel <[email protected]>
Cc: Tetsuo Handa <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

mm/oom_kill.c | 12 ++++--------
1 file changed, 4 insertions(+), 8 deletions(-)

diff -puN mm/oom_kill.c~oom-reaper-handle-mlocked-pages mm/oom_kill.c
--- a/mm/oom_kill.c~oom-reaper-handle-mlocked-pages
+++ a/mm/oom_kill.c
@@ -442,13 +442,6 @@ static bool __oom_reap_vmas(struct mm_st
continue;

/*
- * mlocked VMAs require explicit munlocking before unmap.
- * Let's keep it simple here and skip such VMAs.
- */
- if (vma->vm_flags & VM_LOCKED)
- continue;
-
- /*
* Only anonymous pages have a good chance to be dropped
* without additional steps which we cannot afford as we
* are OOM already.
@@ -458,9 +451,12 @@ static bool __oom_reap_vmas(struct mm_st
* we do not want to block exit_mmap by keeping mm ref
* count elevated without a good reason.
*/
- if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+ if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
+ if (vma->vm_flags & VM_LOCKED)
+ munlock_vma_pages_all(vma);
unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
&details);
+ }
}
tlb_finish_mmu(&tlb, 0, -1);
up_read(&mm->mmap_sem);
_

2016-03-22 22:58:38

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH 2/9] mm, oom: introduce oom reaper

On Tue, 22 Mar 2016, Andrew Morton wrote:
> On Tue, 22 Mar 2016 12:00:19 +0100 Michal Hocko <[email protected]> wrote:
>
> > This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> > independently brought up by Oleg Nesterov.
>
> What happened to oom-reaper-handle-mlocked-pages.patch? I have it in
> -mm but I don't see it in this v6.

Michal is dropping it for now, explained in his 0/9:

The previous version was posted here [1] Hugh and David have suggested to
drop [2] because the munlock path currently depends on the page lock and
it is better if the initial version was conservative and prevent from
any potential lockups even though it is not clear whether they are real
- nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
will have a look and try to make the munlock path not depend on the page
lock as a follow up work.

[1] http://lkml.kernel.org/r/[email protected]
[2] http://lkml.kernel.org/r/[email protected]

>
> From: Michal Hocko <[email protected]>
> Subject: oom reaper: handle mlocked pages
>
> __oom_reap_vmas current skips over all mlocked vmas because they need a
> special treatment before they are unmapped. This is primarily done for
> simplicity. There is no reason to skip over them and reduce the amount of
> reclaimed memory. This is safe from the semantic point of view because
> try_to_unmap_one during rmap walk would keep tell the reclaim to cull the
> page back and mlock it again.
>
> munlock_vma_pages_all is also safe to be called from the oom reaper
> context because it doesn't sit on any locks but mmap_sem (for read).
>
> Signed-off-by: Michal Hocko <[email protected]>
> Cc: Andrea Argangeli <[email protected]>
> Acked-by: David Rientjes <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Oleg Nesterov <[email protected]>
> Cc: Rik van Riel <[email protected]>
> Cc: Tetsuo Handa <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
> ---
>
> mm/oom_kill.c | 12 ++++--------
> 1 file changed, 4 insertions(+), 8 deletions(-)
>
> diff -puN mm/oom_kill.c~oom-reaper-handle-mlocked-pages mm/oom_kill.c
> --- a/mm/oom_kill.c~oom-reaper-handle-mlocked-pages
> +++ a/mm/oom_kill.c
> @@ -442,13 +442,6 @@ static bool __oom_reap_vmas(struct mm_st
> continue;
>
> /*
> - * mlocked VMAs require explicit munlocking before unmap.
> - * Let's keep it simple here and skip such VMAs.
> - */
> - if (vma->vm_flags & VM_LOCKED)
> - continue;
> -
> - /*
> * Only anonymous pages have a good chance to be dropped
> * without additional steps which we cannot afford as we
> * are OOM already.
> @@ -458,9 +451,12 @@ static bool __oom_reap_vmas(struct mm_st
> * we do not want to block exit_mmap by keeping mm ref
> * count elevated without a good reason.
> */
> - if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
> + if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) {
> + if (vma->vm_flags & VM_LOCKED)
> + munlock_vma_pages_all(vma);
> unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
> &details);
> + }
> }
> tlb_finish_mmu(&tlb, 0, -1);
> up_read(&mm->mmap_sem);
> _

2016-03-22 23:02:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched: add schedule_timeout_idle()

On Tue, 22 Mar 2016 13:23:45 +0100 Peter Zijlstra <[email protected]> wrote:

> On Tue, Mar 22, 2016 at 12:00:18PM +0100, Michal Hocko wrote:
>
> > extern signed long schedule_timeout_interruptible(signed long timeout);
> > extern signed long schedule_timeout_killable(signed long timeout);
> > extern signed long schedule_timeout_uninterruptible(signed long timeout);
> > +extern signed long schedule_timeout_idle(signed long timeout);
>
> > +/*
> > + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> > + * to load average.
> > + */
> > +signed long __sched schedule_timeout_idle(signed long timeout)
> > +{
> > + __set_current_state(TASK_IDLE);
> > + return schedule_timeout(timeout);
> > +}
> > +EXPORT_SYMBOL(schedule_timeout_idle);
>
> Yes we have 3 such other wrappers, but I've gotta ask: why? They seem
> pretty pointless.

I like the wrappers. At least, more than having to read the open-coded
version. The latter is just more stuff to interpret and to check
whereas I can look at "schedule_timeout_idle" and think "yup, I know
what that does".

But whatever. I'll probably be sending this series up for 4.6 and we can
worry about the schedule_timeout_foo() stuff later.



2016-03-23 10:43:54

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 1/9] sched: add schedule_timeout_idle()

On Tue 22-03-16 22:23:52, Peter Zijlstra wrote:
[...]
> Probably. However, with such semantics the schedule*() name is wrong
> too, you cannot use these functions to build actual wait loops etc.
>
> So maybe:
>
> static inline long sleep_in_state(long timeout, long state)
> {
> __set_current_state(state);
> return schedule_timeout(timeout);
> }
>
> might be an even better name; but at that point we look very like the
> msleep*() class of function, so maybe we should do:
>
> long sleep_in_state(long state, long timeout)
> {
> while (timeout && !signal_pending_state(state, current)) {
> __set_current_state(state);
> timeout = schedule_timeout(timeout);
> }
> return timeout;
> }
>
> Hmm ?

I am not sure how many callers do care about premature wake-ups (e.g
I could find a use for it) but this indeed has a better and cleaner
semantic.

--
Michal Hocko
SUSE Labs

2016-03-23 11:26:40

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH 0/9] oom reaper v6

David Rientjes wrote:
> On Tue, 22 Mar 2016, Michal Hocko wrote:
>
> > Hi,
> > I am reposting the whole patchset on top of the current Linus tree which should
> > already contain big pile of Andrew's mm patches. This should serve an easier
> > reviewability and I also hope that this core part of the work can go to 4.6.
> >
> > The previous version was posted here [1] Hugh and David have suggested to
> > drop [2] because the munlock path currently depends on the page lock and
> > it is better if the initial version was conservative and prevent from
> > any potential lockups even though it is not clear whether they are real
> > - nobody has seen oom_reaper stuck on the page lock AFAICK. Me or Hugh
> > will have a look and try to make the munlock path not depend on the page
> > lock as a follow up work.
> >
> > Apart from that the feedback revealed one bug for a very unusual
> > configuration (sysctl_oom_kill_allocating_task) and that has been fixed
> > by patch 8 and one potential mis interaction with the pm freezer fixed by
> > patch 7.
> >
> > I think the current code base is already very useful for many situations.
> > The rest of the feedback was mostly about potential enhancements of the
> > current code which I would really prefer to build on top of the current
> > series. I plan to finish my mmap_sem killable for write in the upcoming
> > release cycle and hopefully have it merged in the next merge window.
> > I believe more extensions will follow.
> >
> > This code has been sitting in the mmotm (thus linux-next) for a while.
> > Are there any fundamental objections to have this part merged in this
> > merge window?
> >
>
> Tetsuo, have you been able to run your previous test cases on top of this
> version and do you have any concerns about it or possible extensions that
> could be made?
>

I think [PATCH 3/9] [PATCH 4/9] [PATCH 8/9] will be mostly reverted.
My concerns and possible extensions are explained in

Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
http://lkml.kernel.org/r/[email protected]

. Regarding "[PATCH 4/9] mm, oom_reaper: report success/failure",
debug_show_all_locks() may not be safe

commit 856848737bd944c1 "lockdep: fix debug_show_all_locks()"
commit 82a1fcb90287052a "softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks"

and showing traces might be more useful.
(A discussion for making printk() completely async is in progress.)

But we don't have time to update this series before merge window for 4.6 closes.
We want to send current patchset as is for now, don't we? So, please go ahead.



My other concerns about OOM handling:

Change TIF_MEMDIE strategy from per a thread to per a signal_struct.

[PATCH] mm,oom: Set TIF_MEMDIE on all OOM-killed threads.
http://lkml.kernel.org/r/1458529634-5951-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp

Found a bug in too_many_isolated() assumption.

How to handle infinite too_many_isolated() loop (for OOM detection rework v4) ?
http://lkml.kernel.org/r/[email protected]

Waiting for a patch to be merged.

[PATCH] mm,writeback: Don't use memory reserves for wb_start_writeback
http://lkml.kernel.org/r/[email protected]

And, kmallocwd, __GFP_KILLABLE, and timeout (or something finite one) for TIF_MEMDIE.

2016-03-23 12:07:27

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH 0/9] oom reaper v6

On Wed 23-03-16 20:11:35, Tetsuo Handa wrote:
> David Rientjes wrote:
[...]
> > Tetsuo, have you been able to run your previous test cases on top of this
> > version and do you have any concerns about it or possible extensions that
> > could be made?
> >
>
> I think [PATCH 3/9] [PATCH 4/9] [PATCH 8/9] will be mostly reverted.
> My concerns and possible extensions are explained in
>
> Re: [PATCH 6/5] oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task
> http://lkml.kernel.org/r/[email protected]

I believe issues you have raised there are a matter for further
discussion as they are potential improvements of the existing
functionality rather than fixing a regression of the current code.

> . Regarding "[PATCH 4/9] mm, oom_reaper: report success/failure",
> debug_show_all_locks() may not be safe
>
> commit 856848737bd944c1 "lockdep: fix debug_show_all_locks()"
> commit 82a1fcb90287052a "softlockup: automatically detect hung TASK_UNINTERRUPTIBLE tasks"

Let me ask again. What exactly is unsafe about calling
debug_show_all_locks here? It is true that 856848737bd944c1 has
changed debug_show_all_locks to ignore running tasks which limits
this functionality to some degree but I still think this might be
useful. Proposed alternatives were way too verbose and complex on its
own. This is something to be further discussed as well, though.

> and showing traces might be more useful.
> (A discussion for making printk() completely async is in progress.)
>
> But we don't have time to update this series before merge window for 4.6 closes.
> We want to send current patchset as is for now, don't we? So, please go ahead.

I am happy that we are on the same patch here.

> My other concerns about OOM handling:

Let's stick to oom reaper here, please.

Thanks!
--
Michal Hocko
SUSE Labs