From: Michal Hocko <[email protected]>
This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
independently brought up by Oleg Nesterov.
The OOM killer currently allows to kill only a single task in a good
hope that the task will terminate in a reasonable time and frees up its
memory. Such a task (oom victim) will get an access to memory reserves
via mark_oom_victim to allow a forward progress should there be a need
for additional memory during exit path.
It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
construct workloads which break the core assumption mentioned above and
the OOM victim might take unbounded amount of time to exit because it
might be blocked in the uninterruptible state waiting for on an event
(e.g. lock) which is blocked by another task looping in the page
allocator.
This patch reduces the probability of such a lockup by introducing a
specialized kernel thread (oom_reaper) which tries to reclaim additional
memory by preemptively reaping the anonymous or swapped out memory
owned by the oom victim under an assumption that such a memory won't
be needed when its owner is killed and kicked from the userspace anyway.
There is one notable exception to this, though, if the OOM victim was
in the process of coredumping the result would be incomplete. This is
considered a reasonable constrain because the overall system health is
more important than debugability of a particular application.
A kernel thread has been chosen because we need a reliable way of
invocation so workqueue context is not appropriate because all the
workers might be busy (e.g. allocating memory). Kswapd which sounds
like another good fit is not appropriate as well because it might get
blocked on locks during reclaim as well.
oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above. In order to prevent
from blocking on the lock without any forward progress we are using only
a trylock and retry 10 times with a short sleep in between.
Users of mmap_sem which need it for write should be carefully reviewed
to use _killable waiting as much as possible and reduce allocations
requests done with the lock held to absolute minimum to reduce the risk
even further.
The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
and oom_reaper clear this atomically once it is done with the work. This
means that only a single mm_struct can be reaped at the time. As the
operation is potentially disruptive we are trying to limit it to the
ncessary minimum and the reaper blocks any updates while it operates on
an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
race is detected by atomic_inc_not_zero(mm_users).
Changes since v2
- fix mm_count refernce leak reported by Tetsuo
- make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
- use wait_event_freezable rather than open coded wait loop - suggested
by Tetsuo
Changes since v1
- fix the screwed up detail->check_swap_entries - Johannes
- do not use kthread_should_stop because that would need a cleanup
and we do not have anybody to stop us - Tetsuo
- move wake_oom_reaper to oom_kill_process because we have to wait
for all tasks sharing the same mm to get killed - Tetsuo
- do not reap mm structs which are shared with unkillable tasks - Tetsuo
Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
Hi,
this is version 3 of the patch. Thanks to Tetsuo for his exhaustive
review. There haven't been any fundamental objections to the approach
so I have dropped the RFC status. Any further feedback is welcome of
course.
include/linux/mm.h | 2 +
mm/internal.h | 5 ++
mm/memory.c | 12 ++---
mm/oom_kill.c | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++--
4 files changed, 149 insertions(+), 9 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 25cdec395f2c..d1ce03569942 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1061,6 +1061,8 @@ struct zap_details {
struct address_space *check_mapping; /* Check page->mapping if set */
pgoff_t first_index; /* Lowest page->index to unmap */
pgoff_t last_index; /* Highest page->index to unmap */
+ bool ignore_dirty; /* Ignore dirty pages */
+ bool check_swap_entries; /* Check also swap entries */
};
struct page *vm_normal_page(struct vm_area_struct *vma, unsigned long addr,
diff --git a/mm/internal.h b/mm/internal.h
index 4ae7b7c7462b..9006ce1960ff 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -41,6 +41,11 @@ extern int do_swap_page(struct mm_struct *mm, struct vm_area_struct *vma,
void free_pgtables(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
unsigned long floor, unsigned long ceiling);
+void unmap_page_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma,
+ unsigned long addr, unsigned long end,
+ struct zap_details *details);
+
static inline void set_page_count(struct page *page, int v)
{
atomic_set(&page->_count, v);
diff --git a/mm/memory.c b/mm/memory.c
index f5b8e8c9f4c3..ccfd48c7286e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -1104,6 +1104,9 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
if (!PageAnon(page)) {
if (pte_dirty(ptent)) {
+ /* oom_reaper cannot tear down dirty pages */
+ if (unlikely(details && details->ignore_dirty))
+ continue;
force_flush = 1;
set_page_dirty(page);
}
@@ -1122,8 +1125,8 @@ static unsigned long zap_pte_range(struct mmu_gather *tlb,
}
continue;
}
- /* If details->check_mapping, we leave swap entries. */
- if (unlikely(details))
+ /* only check swap_entries if explicitly asked for in details */
+ if (unlikely(details && !details->check_swap_entries))
continue;
entry = pte_to_swp_entry(ptent);
@@ -1228,7 +1231,7 @@ static inline unsigned long zap_pud_range(struct mmu_gather *tlb,
return addr;
}
-static void unmap_page_range(struct mmu_gather *tlb,
+void unmap_page_range(struct mmu_gather *tlb,
struct vm_area_struct *vma,
unsigned long addr, unsigned long end,
struct zap_details *details)
@@ -1236,9 +1239,6 @@ static void unmap_page_range(struct mmu_gather *tlb,
pgd_t *pgd;
unsigned long next;
- if (details && !details->check_mapping)
- details = NULL;
-
BUG_ON(addr >= end);
tlb_start_vma(tlb, vma);
pgd = pgd_offset(vma->vm_mm, addr);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 5314b206caa5..48025a21f8c4 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,6 +35,11 @@
#include <linux/freezer.h>
#include <linux/ftrace.h>
#include <linux/ratelimit.h>
+#include <linux/kthread.h>
+#include <linux/module.h>
+
+#include <asm/tlb.h>
+#include "internal.h"
#define CREATE_TRACE_POINTS
#include <trace/events/oom.h>
@@ -408,6 +413,105 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victims_wait);
bool oom_killer_disabled __read_mostly;
+/*
+ * OOM Reaper kernel thread which tries to reap the memory used by the OOM
+ * victim (if that is possible) to help the OOM killer to move on.
+ */
+static struct task_struct *oom_reaper_th;
+static struct mm_struct *mm_to_reap;
+static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
+
+static bool __oom_reap_vmas(struct mm_struct *mm)
+{
+ struct mmu_gather tlb;
+ struct vm_area_struct *vma;
+ struct zap_details details = {.check_swap_entries = true,
+ .ignore_dirty = true};
+ bool ret = true;
+
+ /* We might have raced with exit path */
+ if (!atomic_inc_not_zero(&mm->mm_users))
+ return true;
+
+ if (!down_read_trylock(&mm->mmap_sem)) {
+ ret = false;
+ goto out;
+ }
+
+ tlb_gather_mmu(&tlb, mm, 0, -1);
+ for (vma = mm->mmap ; vma; vma = vma->vm_next) {
+ if (is_vm_hugetlb_page(vma))
+ continue;
+
+ /*
+ * Only anonymous pages have a good chance to be dropped
+ * without additional steps which we cannot afford as we
+ * are OOM already.
+ *
+ * We do not even care about fs backed pages because all
+ * which are reclaimable have already been reclaimed and
+ * we do not want to block exit_mmap by keeping mm ref
+ * count elevated without a good reason.
+ */
+ if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED))
+ unmap_page_range(&tlb, vma, vma->vm_start, vma->vm_end,
+ &details);
+ }
+ tlb_finish_mmu(&tlb, 0, -1);
+ up_read(&mm->mmap_sem);
+out:
+ mmput(mm);
+ return ret;
+}
+
+static void oom_reap_vmas(struct mm_struct *mm)
+{
+ int attempts = 0;
+
+ while (attempts++ < 10 && !__oom_reap_vmas(mm))
+ schedule_timeout(HZ/10);
+
+ /* Drop a reference taken by wake_oom_reaper */
+ mmdrop(mm);
+}
+
+static int oom_reaper(void *unused)
+{
+ while (true) {
+ struct mm_struct *mm;
+ wait_event_freezable(oom_reaper_wait, (mm = READ_ONCE(mm_to_reap)));
+ oom_reap_vmas(mm);
+ WRITE_ONCE(mm_to_reap, NULL);
+ }
+
+ return 0;
+}
+
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+ struct mm_struct *old_mm;
+
+ if (!oom_reaper_th)
+ return;
+
+ /*
+ * Pin the given mm. Use mm_count instead of mm_users because
+ * we do not want to delay the address space tear down.
+ */
+ atomic_inc(&mm->mm_count);
+
+ /*
+ * Make sure that only a single mm is ever queued for the reaper
+ * because multiple are not necessary and the operation might be
+ * disruptive so better reduce it to the bare minimum.
+ */
+ old_mm = cmpxchg(&mm_to_reap, NULL, mm);
+ if (!old_mm)
+ wake_up(&oom_reaper_wait);
+ else
+ mmdrop(mm);
+}
+
/**
* mark_oom_victim - mark the given task as OOM victim
* @tsk: task to mark
@@ -517,6 +621,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
+ bool can_oom_reap = true;
/*
* If the task is already exiting, don't alarm the sysadmin or kill
@@ -607,15 +712,23 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
continue;
if (same_thread_group(p, victim))
continue;
- if (unlikely(p->flags & PF_KTHREAD))
- continue;
- if (p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN)
+ if (unlikely(p->flags & PF_KTHREAD) ||
+ p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
+ /*
+ * We cannot use oom_reaper for the mm shared by this process
+ * because it wouldn't get killed and so the memory might be
+ * still used.
+ */
+ can_oom_reap = false;
continue;
+ }
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
}
rcu_read_unlock();
+ if (can_oom_reap)
+ wake_oom_reaper(mm);
mmdrop(mm);
put_task_struct(victim);
}
@@ -767,3 +880,23 @@ void pagefault_out_of_memory(void)
mutex_unlock(&oom_lock);
}
+
+static int __init oom_init(void)
+{
+ oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+ if (IS_ERR(oom_reaper_th)) {
+ pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+ PTR_ERR(oom_reaper_th));
+ oom_reaper_th = NULL;
+ } else {
+ struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+ /*
+ * Make sure our oom reaper thread will get scheduled when
+ * ASAP and that it won't get preempted by malicious userspace.
+ */
+ sched_setscheduler(oom_reaper_th, SCHED_FIFO, ¶m);
+ }
+ return 0;
+}
+module_init(oom_init)
--
2.6.2
On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <[email protected]> wrote:
> From: Michal Hocko <[email protected]>
>
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
>
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory. Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
>
> ...
>
> +static void oom_reap_vmas(struct mm_struct *mm)
> +{
> + int attempts = 0;
> +
> + while (attempts++ < 10 && !__oom_reap_vmas(mm))
> + schedule_timeout(HZ/10);
schedule_timeout() in state TASK_RUNNING doesn't do anything. Use
msleep() or msleep_interruptible(). I can't decide which is more
appropriate - it only affects the load average display.
Which prompts the obvious question: as the no-operativeness of this
call wasn't noticed in testing, why do we have it there...
I guess it means that the __oom_reap_vmas() success rate is nice and
high ;)
On Wed 16-12-15 16:50:35, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <[email protected]> wrote:
[...]
> > +static void oom_reap_vmas(struct mm_struct *mm)
> > +{
> > + int attempts = 0;
> > +
> > + while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > + schedule_timeout(HZ/10);
>
> schedule_timeout() in state TASK_RUNNING doesn't do anything. Use
> msleep() or msleep_interruptible(). I can't decide which is more
> appropriate - it only affects the load average display.
Ups. You are right. I will go with msleep_interruptible(100).
> Which prompts the obvious question: as the no-operativeness of this
> call wasn't noticed in testing, why do we have it there...
Well, the idea was that an interfering mmap_sem operation which holds
it for write might block us for a short time period - e.g. when not
depending on an allocation or accessing the memory reserves helps
to progress the allocation. If the holder of the semaphore is stuck
then the retry is pointless. On the other hand the retry shouldn't be
harmful. All in all this is just a heuristic and we do not depend on
it. I guess we can drop it and nobody would actually notice. Let me know
if you prefer that and I will respin the patch.
> I guess it means that the __oom_reap_vmas() success rate is nice anud
> high ;)
I had a debugging trace_printks around this and there were no reties
during my testing so I was probably lucky to not trigger the mmap_sem
contention.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 48025a21f8c4..f53f87cfd899 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -469,7 +469,7 @@ static void oom_reap_vmas(struct mm_struct *mm)
int attempts = 0;
while (attempts++ < 10 && !__oom_reap_vmas(mm))
- schedule_timeout(HZ/10);
+ msleep_interruptible(100);
/* Drop a reference taken by wake_oom_reaper */
mmdrop(mm);
--
Michal Hocko
SUSE Labs
On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko <[email protected]> wrote:
> Ups. You are right. I will go with msleep_interruptible(100).
I don't think that's right.
If a signal happens, that loop is now (again) just busy-looping. That
doesn't sound right, although with the maximum limit of 10 attempts,
maybe it's fine - the thing is technically "busylooping", but it will
definitely not busy-loop for very long.
So maybe that code is fine, but I think the signal case might at least
merit a comment?
Also, if you actually do want UNINTERRUPTIBLE (no reaction to signals
at all), but don't want to be seen as being "load" on the system, you
can use TASK_IDLE, which is a combination of TASK_UNINTERRUPTIBLE |
TASK_NOLOAD.
Because if you sleep interruptibly, you do generally need to handle
signals (although that limit count may make it ok in this case).
There's basically three levels:
- TASK_UNINTERRUPTIBLE: no signal handling at all
- TASK_KILLABLE: no normal signal handling, but ok to be killed
(needs to check fatal_signal_pending() and exit)
- TASK_INTERRUPTIBLE: will react to signals
(and then that TASK_IDLE thing that is semantically the same as
uninterruptible, but doesn't count against the load average).
The main use for TASK_KILLABLE is in places where expected semantics
do not allow a EINTR return, but we know that because the process is
about to be killed, we can ignore that, for the simple reason that
nobody will ever *see* the EINTR.
Btw, I think you might want to re-run your test-case after this
change, since the whole "busy loop vs actually sleeping" might just
have changed the result..
Linus
On Thu, 17 Dec 2015 11:55:11 -0800 Linus Torvalds <[email protected]> wrote:
> On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko <[email protected]> wrote:
> > Ups. You are right. I will go with msleep_interruptible(100).
>
> I don't think that's right.
>
> If a signal happens, that loop is now (again) just busy-looping.
It's called only by a kernel thread so no signal_pending(). This
relationship is a bit unobvious and fragile, but we do it in quite a
few places.
On Thu, 17 Dec 2015 14:02:24 +0100 Michal Hocko <[email protected]> wrote:
> > I guess it means that the __oom_reap_vmas() success rate is nice anud
> > high ;)
>
> I had a debugging trace_printks around this and there were no reties
> during my testing so I was probably lucky to not trigger the mmap_sem
> contention.
> ---
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 48025a21f8c4..f53f87cfd899 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -469,7 +469,7 @@ static void oom_reap_vmas(struct mm_struct *mm)
> int attempts = 0;
>
> while (attempts++ < 10 && !__oom_reap_vmas(mm))
> - schedule_timeout(HZ/10);
> + msleep_interruptible(100);
>
> /* Drop a reference taken by wake_oom_reaper */
> mmdrop(mm);
Timeliness matter here. Over on the other CPU, direct reclaim is
pounding along, on its way to declaring oom. Sometimes the oom_reaper
thread will end up scavenging memory on behalf of a caller who gave up
a long time ago. But we shouldn't atempt to "fix" that unless we can
demonstrate that it's a problem.
Also, re-reading your description:
: It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
: construct workloads which break the core assumption mentioned above and
: the OOM victim might take unbounded amount of time to exit because it
: might be blocked in the uninterruptible state waiting for on an event
: (e.g. lock) which is blocked by another task looping in the page
: allocator.
So the allocating task has done an oom-kill and is waiting for memory
to become available. The killed task is stuck on some lock, unable to
free memory.
But the problematic lock will sometimes be the killed tasks's mmap_sem,
so the reaper won't reap anything. This scenario requires that the
mmap_sem is held for writing, which sounds like it will be uncommon.
hm. sigh. I hate the oom-killer. Just buy some more memory already!
On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <[email protected]> wrote:
> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper)
CONFIG_MMU=n:
slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'
I did the below so I can get an mmotm out the door, but hopefully
there's a cleaner way.
--- a/mm/oom_kill.c~mm-oom-introduce-oom-reaper-fix-3
+++ a/mm/oom_kill.c
@@ -415,6 +415,7 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_victi
bool oom_killer_disabled __read_mostly;
+#ifdef CONFIG_MMU
/*
* OOM Reaper kernel thread which tries to reap the memory used by the OOM
* victim (if that is possible) to help the OOM killer to move on.
@@ -517,6 +518,27 @@ static void wake_oom_reaper(struct mm_st
mmdrop(mm);
}
+static int __init oom_init(void)
+{
+ oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+ if (IS_ERR(oom_reaper_th)) {
+ pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
+ PTR_ERR(oom_reaper_th));
+ oom_reaper_th = NULL;
+ } else {
+ struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
+
+ /*
+ * Make sure our oom reaper thread will get scheduled when
+ * ASAP and that it won't get preempted by malicious userspace.
+ */
+ sched_setscheduler(oom_reaper_th, SCHED_FIFO, ¶m);
+ }
+ return 0;
+}
+module_init(oom_init)
+#endif
+
/**
* mark_oom_victim - mark the given task as OOM victim
* @tsk: task to mark
@@ -626,7 +648,9 @@ void oom_kill_process(struct oom_control
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
+#ifdef CONFIG_MMU
bool can_oom_reap = true;
+#endif
/*
* If the task is already exiting, don't alarm the sysadmin or kill
@@ -719,6 +743,7 @@ void oom_kill_process(struct oom_control
continue;
if (is_global_init(p))
continue;
+#ifdef CONFIG_MMU
if (unlikely(p->flags & PF_KTHREAD) ||
p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
/*
@@ -729,13 +754,16 @@ void oom_kill_process(struct oom_control
can_oom_reap = false;
continue;
}
-
+#endif
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
}
rcu_read_unlock();
+#ifdef CONFIG_MMU
if (can_oom_reap)
wake_oom_reaper(mm);
+#endif
+
mmdrop(mm);
put_task_struct(victim);
}
@@ -887,23 +915,3 @@ void pagefault_out_of_memory(void)
mutex_unlock(&oom_lock);
}
-
-static int __init oom_init(void)
-{
- oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
- if (IS_ERR(oom_reaper_th)) {
- pr_err("Unable to start OOM reaper %ld. Continuing regardless\n",
- PTR_ERR(oom_reaper_th));
- oom_reaper_th = NULL;
- } else {
- struct sched_param param = { .sched_priority = MAX_RT_PRIO-1 };
-
- /*
- * Make sure our oom reaper thread will get scheduled when
- * ASAP and that it won't get preempted by malicious userspace.
- */
- sched_setscheduler(oom_reaper_th, SCHED_FIFO, ¶m);
- }
- return 0;
-}
-module_init(oom_init)
_
On Thu 17-12-15 16:15:21, Andrew Morton wrote:
> On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <[email protected]> wrote:
>
> > This patch reduces the probability of such a lockup by introducing a
> > specialized kernel thread (oom_reaper)
>
> CONFIG_MMU=n:
>
> slub.c:(.text+0x4184): undefined reference to `tlb_gather_mmu'
> slub.c:(.text+0x41bc): undefined reference to `unmap_page_range'
> slub.c:(.text+0x41d8): undefined reference to `tlb_finish_mmu'
>
> I did the below so I can get an mmotm out the door, but hopefully
> there's a cleaner way.
Sorry about that and thanks for your fixup! I am not very familiar with
!MMU world and haven't heard about issues with the OOM deadlocks yet. So
I guess making this MMU only makes some sense. I would just get rid of
ifdefs in oom_kill_process and provide an empty wake_oom_reaper for
!CONFIG_MMU. The following on top of yours:
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8b92e1..56ff1ff18c0e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -537,6 +537,10 @@ static int __init oom_init(void)
return 0;
}
module_init(oom_init)
+#else
+static void wake_oom_reaper(struct mm_struct *mm)
+{
+}
#endif
/**
@@ -648,9 +652,7 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
unsigned int victim_points = 0;
static DEFINE_RATELIMIT_STATE(oom_rs, DEFAULT_RATELIMIT_INTERVAL,
DEFAULT_RATELIMIT_BURST);
-#ifdef CONFIG_MMU
bool can_oom_reap = true;
-#endif
/*
* If the task is already exiting, don't alarm the sysadmin or kill
@@ -743,7 +745,6 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
continue;
if (is_global_init(p))
continue;
-#ifdef CONFIG_MMU
if (unlikely(p->flags & PF_KTHREAD) ||
p->signal->oom_score_adj == OOM_SCORE_ADJ_MIN) {
/*
@@ -754,15 +755,12 @@ void oom_kill_process(struct oom_control *oc, struct task_struct *p,
can_oom_reap = false;
continue;
}
-#endif
do_send_sig_info(SIGKILL, SEND_SIG_FORCED, p, true);
}
rcu_read_unlock();
-#ifdef CONFIG_MMU
if (can_oom_reap)
wake_oom_reaper(mm);
-#endif
mmdrop(mm);
put_task_struct(victim);
--
Michal Hocko
SUSE Labs
On Thu 17-12-15 12:00:04, Andrew Morton wrote:
> On Thu, 17 Dec 2015 11:55:11 -0800 Linus Torvalds <[email protected]> wrote:
>
> > On Thu, Dec 17, 2015 at 5:02 AM, Michal Hocko <[email protected]> wrote:
> > > Ups. You are right. I will go with msleep_interruptible(100).
> >
> > I don't think that's right.
> >
> > If a signal happens, that loop is now (again) just busy-looping.
>
> It's called only by a kernel thread so no signal_pending().
Yes that was the thinking.
> This relationship is a bit unobvious and fragile, but we do it in
> quite a few places.
I guess Linus is right and __set_task_state(current, TASK_IDLE) would be
better and easier to read.
---
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 4b0a5d8b92e1..eed99506b891 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -472,8 +472,10 @@ static void oom_reap_vmas(struct mm_struct *mm)
int attempts = 0;
/* Retry the down_read_trylock(mmap_sem) a few times */
- while (attempts++ < 10 && !__oom_reap_vmas(mm))
- msleep_interruptible(100);
+ while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
+ __set_task_state(current, TASK_IDLE);
+ schedule_timeout(HZ/10);
+ }
/* Drop a reference taken by wake_oom_reaper */
mmdrop(mm);
--
Michal Hocko
SUSE Labs
Michal Hocko wrote:
> On Wed 16-12-15 16:50:35, Andrew Morton wrote:
> > On Tue, 15 Dec 2015 19:36:15 +0100 Michal Hocko <[email protected]> wrote:
> [...]
> > > +static void oom_reap_vmas(struct mm_struct *mm)
> > > +{
> > > + int attempts = 0;
> > > +
> > > + while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > > + schedule_timeout(HZ/10);
> >
> > schedule_timeout() in state TASK_RUNNING doesn't do anything. Use
> > msleep() or msleep_interruptible(). I can't decide which is more
> > appropriate - it only affects the load average display.
>
> Ups. You are right. I will go with msleep_interruptible(100).
>
I didn't know that. My testing was almost without oom_reap_vmas().
> > I guess it means that the __oom_reap_vmas() success rate is nice anud
> > high ;)
>
> I had a debugging trace_printks around this and there were no reties
> during my testing so I was probably lucky to not trigger the mmap_sem
> contention.
Yes, you are lucky that you did not hit the mmap_sem contention.
I retested with
static void oom_reap_vmas(struct mm_struct *mm)
{
int attempts = 0;
while (attempts++ < 10 && !__oom_reap_vmas(mm))
- schedule_timeout(HZ/10);
+ msleep_interruptible(100);
+ printk(KERN_WARNING "oom_reaper: attempts=%u\n", attempts);
/* Drop a reference taken by wake_oom_reaper */
mmdrop(mm);
}
and I can hit that attempts becomes 11 (i.e. oom_reap_vmas() gives up
waiting) if I ran a memory stressing program with many contending
mmap_sem readers and writers shown below.
----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>
#include <sys/mman.h>
static cpu_set_t set = { { 1 } }; /* Allow only CPU 0. */
static char filename[32] = { };
/* down_read(&mm->mmap_sem) requester. */
static int reader(void *unused)
{
const int fd = open(filename, O_RDONLY);
char buffer[128];
sched_setaffinity(0, sizeof(set), &set);
sleep(2);
while (pread(fd, buffer, sizeof(buffer), 0) > 0);
while (1)
pause();
return 0;
}
/* down_write(&mm->mmap_sem) requester. */
static int writer(void *unused)
{
const int fd = open("/proc/self/exe", O_RDONLY);
sched_setaffinity(0, sizeof(set), &set);
sleep(2);
while (1) {
void *ptr = mmap(NULL, 4096, PROT_READ, MAP_PRIVATE, fd, 0);
munmap(ptr, 4096);
}
return 0;
}
static void my_clone(int (*func) (void *))
{
char *stack = malloc(4096);
if (stack)
clone(func, stack + 4096,
CLONE_THREAD | CLONE_SIGHAND | CLONE_VM, NULL);
}
/* Memory consumer for invoking the OOM killer. */
static void memory_eater(void) {
char *buf = NULL;
unsigned long i;
unsigned long size = 0;
sleep(4);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
fprintf(stderr, "Start eating memory\n");
for (i = 0; i < size; i += 4096)
buf[i] = '\0'; /* Will cause OOM due to overcommit */
}
int main(int argc, char *argv[])
{
int i;
const pid_t pid = fork();
if (pid == 0) {
for (i = 0; i < 9; i++)
my_clone(writer);
writer(NULL);
_exit(0);
} else if (pid > 0) {
snprintf(filename, sizeof(filename), "/proc/%u/stat", pid);
for (i = 0; i < 1000; i++)
my_clone(reader);
}
memory_eater();
return *(char *) NULL; /* Not reached. */
}
----------
Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151218.txt.xz .
----------
[ 90.790847] Killed process 9560 (oom_reaper-test) total-vm:4312kB, anon-rss:124kB, file-rss:0kB, shmem-rss:0kB
[ 91.803154] oom_reaper: attempts=11
[ 100.701494] MemAlloc-Info: 509 stalling task, 0 dying task, 1 victim task.
[ 102.439082] Killed process 9559 (oom_reaper-test) total-vm:2170960kB, anon-rss:1564600kB, file-rss:0kB, shmem-rss:0kB
[ 102.441937] Killed process 9561 (oom_reaper-test) total-vm:2170960kB, anon-rss:1564776kB, file-rss:0kB, shmem-rss:0kB
[ 102.731326] oom_reaper: attempts=1
[ 125.420727] Killed process 10573 (oom_reaper-test) total-vm:4340kB, anon-rss:80kB, file-rss:0kB, shmem-rss:0kB
[ 126.440392] oom_reaper: attempts=11
[ 135.354193] MemAlloc-Info: 450 stalling task, 0 dying task, 0 victim task.
[ 240.023256] MemAlloc-Info: 1016 stalling task, 0 dying task, 0 victim task.
[ 302.246975] Killed process 10572 (oom_reaper-test) total-vm:2170960kB, anon-rss:1562128kB, file-rss:0kB, shmem-rss:0kB
[ 302.263515] oom_reaper: attempts=1
[ 382.961343] Killed process 11667 (oom_reaper-test) total-vm:4312kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
[ 383.980541] oom_reaper: attempts=11
[ 392.592658] MemAlloc-Info: 758 stalling task, 10 dying task, 1 victim task.
[ 399.497478] Killed process 11666 (oom_reaper-test) total-vm:2170960kB, anon-rss:1556072kB, file-rss:0kB, shmem-rss:0kB
[ 399.499101] Killed process 11668 (oom_reaper-test) total-vm:2170960kB, anon-rss:1556260kB, file-rss:0kB, shmem-rss:0kB
[ 399.778283] oom_reaper: attempts=1
[ 438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
[ 439.318951] oom_reaper: attempts=11
[ 445.581171] MemAlloc-Info: 796 stalling task, 0 dying task, 0 victim task.
[ 618.955215] MemAlloc-Info: 979 stalling task, 0 dying task, 0 victim task.
----------
Yes, this is an insane program. But what is important will be we prepare for
cases when oom_reap_vmas() gave up waiting. Silent hang up is annoying.
Like Andrew said
( http://lkml.kernel.org/r/[email protected] ),
I want to add a watchdog for printk()ing.
( http://lkml.kernel.org/r/[email protected] ).
On Thu 17-12-15 13:13:56, Andrew Morton wrote:
[...]
> Also, re-reading your description:
>
> : It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> : construct workloads which break the core assumption mentioned above and
> : the OOM victim might take unbounded amount of time to exit because it
> : might be blocked in the uninterruptible state waiting for on an event
> : (e.g. lock) which is blocked by another task looping in the page
> : allocator.
>
> So the allocating task has done an oom-kill and is waiting for memory
> to become available. The killed task is stuck on some lock, unable to
> free memory.
>
> But the problematic lock will sometimes be the killed tasks's mmap_sem,
> so the reaper won't reap anything. This scenario requires that the
> mmap_sem is held for writing, which sounds like it will be uncommon.
Yes, I have mentioned that in the changelog:
"
oom_reaper has to take mmap_sem on the target task for reading so the
solution is not 100% because the semaphore might be held or blocked for
write but the probability is reduced considerably wrt. basically any
lock blocking forward progress as described above.
"
Another thing is to do is to change down_write(mmap_sem) to
down_write_killable in most cases where we have a clear ENITR semantic.
This is on my todo list.
> hm. sigh. I hate the oom-killer. Just buy some more memory already!
Tell me something about that...
--
Michal Hocko
SUSE Labs
On Fri, 18 Dec 2015 12:54:55 +0100 Michal Hocko <[email protected]> wrote:
> /* Retry the down_read_trylock(mmap_sem) a few times */
> - while (attempts++ < 10 && !__oom_reap_vmas(mm))
> - msleep_interruptible(100);
> + while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
> + __set_task_state(current, TASK_IDLE);
> + schedule_timeout(HZ/10);
> + }
If you won't, I shall ;)
From: Andrew Morton <[email protected]>
Subject: sched: add schedule_timeout_idle()
This will be needed in the patch "mm, oom: introduce oom reaper".
Cc: Michal Hocko <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
include/linux/sched.h | 1 +
kernel/time/timer.c | 11 +++++++++++
2 files changed, 12 insertions(+)
diff -puN kernel/time/timer.c~sched-add-schedule_timeout_idle kernel/time/timer.c
--- a/kernel/time/timer.c~sched-add-schedule_timeout_idle
+++ a/kernel/time/timer.c
@@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uni
}
EXPORT_SYMBOL(schedule_timeout_uninterruptible);
+/*
+ * Like schedule_timeout_uninterruptible(), except this task will not contribute
+ * to load average.
+ */
+signed long __sched schedule_timeout_idle(signed long timeout)
+{
+ __set_current_state(TASK_IDLE);
+ return schedule_timeout(timeout);
+}
+EXPORT_SYMBOL(schedule_timeout_idle);
+
#ifdef CONFIG_HOTPLUG_CPU
static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
{
diff -puN include/linux/sched.h~sched-add-schedule_timeout_idle include/linux/sched.h
--- a/include/linux/sched.h~sched-add-schedule_timeout_idle
+++ a/include/linux/sched.h
@@ -423,6 +423,7 @@ extern signed long schedule_timeout(sign
extern signed long schedule_timeout_interruptible(signed long timeout);
extern signed long schedule_timeout_killable(signed long timeout);
extern signed long schedule_timeout_uninterruptible(signed long timeout);
+extern signed long schedule_timeout_idle(signed long timeout);
asmlinkage void schedule(void);
extern void schedule_preempt_disabled(void);
_
Tetsuo Handa wrote:
> Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151218.txt.xz .
> ----------
> [ 438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
> [ 439.318951] oom_reaper: attempts=11
> [ 445.581171] MemAlloc-Info: 796 stalling task, 0 dying task, 0 victim task.
> [ 618.955215] MemAlloc-Info: 979 stalling task, 0 dying task, 0 victim task.
> ----------
>
> Yes, this is an insane program. But what is important will be we prepare for
> cases when oom_reap_vmas() gave up waiting. Silent hang up is annoying.
s/gave up waiting/did not help/
I noticed yet another problem with this program.
The OOM victim (a child of memory hog process) received SIGKILL at
uptime = 438 and it terminated before uptime = 445 even though
oom_reap_vmas() gave up waiting at uptime = 439. However, the OOM killer
was not invoked again in order to kill the memory hog process before I
gave up waiting at uptime = 679. The OOM killer was needlessly kept
disabled for more than 234 seconds after the OOM victim terminated.
----------
[ 438.180596] oom_reaper-test invoked oom-killer: order=0, oom_score_adj=0, gfp_mask=0x26040c0(GFP_KERNEL|GFP_COMP|GFP_NOTRACK)
[ 438.183524] oom_reaper-test cpuset=/ mems_allowed=0
[ 438.185440] CPU: 0 PID: 13451 Comm: oom_reaper-test Not tainted 4.4.0-rc5-next-20151217+ #248
(...snipped...)
[ 438.301687] Out of memory: Kill process 12679 (oom_reaper-test) score 876 or sacrifice child
[ 438.304082] Killed process 12680 (oom_reaper-test) total-vm:4324kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
(...snipped...)
[ 568.185582] MemAlloc: oom_reaper-test(13451) seq=2 gfp=0x26040c0 order=0 delay=7403
[ 568.187593] oom_reaper-test R running task 0 13451 8130 0x00000080
[ 568.189546] ffff88007a4cb918 ffff88007a5c4140 ffff88007a4c0000 ffff88007a4cc000
[ 568.191637] ffff88007a4cb950 ffff88007fc10240 0000000100021bb8 ffffffff81c11730
[ 568.193713] ffff88007a4cb930 ffffffff816f5b77 ffff88007fc10240 ffff88007a4cb9d0
[ 568.195798] Call Trace:
[ 568.196878] [<ffffffff816f5b77>] schedule+0x37/0x90
[ 568.198402] [<ffffffff816f9f37>] schedule_timeout+0x117/0x1c0
[ 568.200078] [<ffffffff810dfd00>] ? init_timer_key+0x40/0x40
[ 568.201725] [<ffffffff816fa034>] schedule_timeout_killable+0x24/0x30
[ 568.203512] [<ffffffff81142d49>] out_of_memory+0x1f9/0x5a0
[ 568.205144] [<ffffffff81142dfd>] ? out_of_memory+0x2ad/0x5a0
[ 568.206800] [<ffffffff811486f3>] __alloc_pages_nodemask+0xc43/0xc80
[ 568.208564] [<ffffffff8118f786>] alloc_pages_current+0x96/0x1b0
[ 568.210259] [<ffffffff81198177>] ? new_slab+0x357/0x470
[ 568.211811] [<ffffffff811981ee>] new_slab+0x3ce/0x470
[ 568.213329] [<ffffffff8119a41a>] ___slab_alloc+0x42a/0x5c0
[ 568.214917] [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[ 568.216486] [<ffffffff811aa75d>] ? mem_cgroup_end_page_stat+0x2d/0xb0
[ 568.218240] [<ffffffff810ba2a9>] ? __lock_is_held+0x49/0x70
[ 568.219815] [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[ 568.221388] [<ffffffff811bc45b>] __slab_alloc+0x4a/0x81
[ 568.222980] [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[ 568.224565] [<ffffffff8119aab3>] __kmalloc+0x163/0x1b0
[ 568.226075] [<ffffffff811e5ff5>] seq_buf_alloc+0x35/0x40
[ 568.227710] [<ffffffff811e660b>] seq_read+0x31b/0x3c0
[ 568.229184] [<ffffffff811beaf2>] __vfs_read+0x32/0xf0
[ 568.230673] [<ffffffff81302339>] ? security_file_permission+0xa9/0xc0
[ 568.232408] [<ffffffff811bf49d>] ? rw_verify_area+0x4d/0xd0
[ 568.234068] [<ffffffff811bf59a>] vfs_read+0x7a/0x120
[ 568.235561] [<ffffffff811c0700>] SyS_pread64+0x90/0xb0
[ 568.237062] [<ffffffff816fb0f2>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 568.238766] 2 locks held by oom_reaper-test/13451:
[ 568.240188] #0: (&p->lock){+.+.+.}, at: [<ffffffff811e6337>] seq_read+0x47/0x3c0
[ 568.242303] #1: (oom_lock){+.+...}, at: [<ffffffff81148358>] __alloc_pages_nodemask+0x8a8/0xc80
(...snipped...)
[ 658.711079] MemAlloc: oom_reaper-test(13451) seq=2 gfp=0x26040c0 order=0 delay=180777
[ 658.713110] oom_reaper-test R running task 0 13451 8130 0x00000080
[ 658.715073] ffff88007a4cb918 ffff88007a5c4140 ffff88007a4c0000 ffff88007a4cc000
[ 658.717166] ffff88007a4cb950 ffff88007fc10240 0000000100021bb8 ffffffff81c11730
[ 658.719248] ffff88007a4cb930 ffffffff816f5b77 ffff88007fc10240 ffff88007a4cb9d0
[ 658.721345] Call Trace:
[ 658.722426] [<ffffffff816f5b77>] schedule+0x37/0x90
[ 658.723950] [<ffffffff816f9f37>] schedule_timeout+0x117/0x1c0
[ 658.725636] [<ffffffff810dfd00>] ? init_timer_key+0x40/0x40
[ 658.727304] [<ffffffff816fa034>] schedule_timeout_killable+0x24/0x30
[ 658.729113] [<ffffffff81142d49>] out_of_memory+0x1f9/0x5a0
[ 658.730757] [<ffffffff81142dfd>] ? out_of_memory+0x2ad/0x5a0
[ 658.732444] [<ffffffff811486f3>] __alloc_pages_nodemask+0xc43/0xc80
[ 658.734314] [<ffffffff8118f786>] alloc_pages_current+0x96/0x1b0
[ 658.736041] [<ffffffff81198177>] ? new_slab+0x357/0x470
[ 658.737643] [<ffffffff811981ee>] new_slab+0x3ce/0x470
[ 658.739239] [<ffffffff8119a41a>] ___slab_alloc+0x42a/0x5c0
[ 658.740899] [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[ 658.742617] [<ffffffff811aa75d>] ? mem_cgroup_end_page_stat+0x2d/0xb0
[ 658.744489] [<ffffffff810ba2a9>] ? __lock_is_held+0x49/0x70
[ 658.746157] [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[ 658.747801] [<ffffffff811bc45b>] __slab_alloc+0x4a/0x81
[ 658.749392] [<ffffffff811e5ff5>] ? seq_buf_alloc+0x35/0x40
[ 658.751021] [<ffffffff8119aab3>] __kmalloc+0x163/0x1b0
[ 658.752562] [<ffffffff811e5ff5>] seq_buf_alloc+0x35/0x40
[ 658.754145] [<ffffffff811e660b>] seq_read+0x31b/0x3c0
[ 658.755678] [<ffffffff811beaf2>] __vfs_read+0x32/0xf0
[ 658.757194] [<ffffffff81302339>] ? security_file_permission+0xa9/0xc0
[ 658.758959] [<ffffffff811bf49d>] ? rw_verify_area+0x4d/0xd0
[ 658.760571] [<ffffffff811bf59a>] vfs_read+0x7a/0x120
[ 658.762746] [<ffffffff811c0700>] SyS_pread64+0x90/0xb0
[ 658.770489] [<ffffffff816fb0f2>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 658.772641] 2 locks held by oom_reaper-test/13451:
[ 658.774321] #0: (&p->lock){+.+.+.}, at: [<ffffffff811e6337>] seq_read+0x47/0x3c0
[ 658.776456] #1: (oom_lock){+.+...}, at: [<ffffffff81148358>] __alloc_pages_nodemask+0x8a8/0xc80
(...snipped...)
[ 679.648918] sysrq: SysRq : Kill All Tasks
----------
Looking at the traces, the process which invoked the OOM killer kept
the oom_lock mutex held because it had been sleeping at
schedule_timeout_killable(1) at out_of_memory(), which meant to wait for
only one jiffie but actually waited for more than 234 seconds.
if (p && p != (void *)-1UL) {
oom_kill_process(oc, p, points, totalpages, NULL,
"Out of memory");
/*
* Give the killed process a good chance to exit before trying
* to allocate memory again.
*/
schedule_timeout_killable(1);
}
return true;
During that period, nobody was able to call out_of_memory() because
everybody assumed that the process which invoked the OOM killer is
making progress for us.
/*
* Acquire the oom lock. If that fails, somebody else is
* making progress for us.
*/
if (!mutex_trylock(&oom_lock)) {
*did_some_progress = 1;
schedule_timeout_uninterruptible(1);
return NULL;
}
The side effect is not limited to not choosing the next OOM victim.
SIGKILL but !TIF_MEMDIE tasks (possibly tasks sharing OOM victim's mm)
cannot use ALLOC_NO_WATERMARKS until they can arrive at out_of_memory().
Assumptions like
/*
* If current has a pending SIGKILL or is exiting, then automatically
* select it. The goal is to allow it to allocate so that it may
* quickly exit and free its memory.
*
* But don't select if current has already released its mm and cleared
* TIF_MEMDIE flag at exit_mm(), otherwise an OOM livelock may occur.
*/
if (current->mm &&
(fatal_signal_pending(current) || task_will_free_mem(current))) {
mark_oom_victim(current);
return true;
}
in out_of_memory() and
/*
* Kill all user processes sharing victim->mm in other thread groups, if
* any. They don't get access to memory reserves, though, to avoid
* depletion of all memory. This prevents mm->mmap_sem livelock when an
* oom killed thread cannot exit because it requires the semaphore and
* its contended by another thread trying to allocate memory itself.
* That thread will now get access to memory reserves since it has a
* pending fatal signal.
*/
in oom_kill_process() can not work. (Yes, we know
fatal_signal_pending(current) check in out_of_memory() is wrong.
http://lkml.kernel.org/r/[email protected] )
I think we might want to make sure that the oom_lock mutex is released within
reasonable period after the OOM killer kills a victim. Maybe changing not to
depend on TIF_MEMDIE for using memory reserves. Maybe replacing the whole
operation between mutex_trylock(&oom_lock) and mutex_unlock(&oom_lock) with
request_oom_killer() (like request_module() does) and let a kernel thread do
the OOM kill operation (oom_reaper() can do it?), for it will make easy to
wait for short period after killing the victim, without worrying about huge
unexpected delay caused by low scheduling priority / limited available CPUs.
On Fri 18-12-15 13:14:00, Andrew Morton wrote:
> On Fri, 18 Dec 2015 12:54:55 +0100 Michal Hocko <[email protected]> wrote:
>
> > /* Retry the down_read_trylock(mmap_sem) a few times */
> > - while (attempts++ < 10 && !__oom_reap_vmas(mm))
> > - msleep_interruptible(100);
> > + while (attempts++ < 10 && !__oom_reap_vmas(mm)) {
> > + __set_task_state(current, TASK_IDLE);
> > + schedule_timeout(HZ/10);
> > + }
>
> If you won't, I shall ;)
>
>
> From: Andrew Morton <[email protected]>
> Subject: sched: add schedule_timeout_idle()
>
> This will be needed in the patch "mm, oom: introduce oom reaper".
>
> Cc: Michal Hocko <[email protected]>
> Cc: Ingo Molnar <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Signed-off-by: Andrew Morton <[email protected]>
Thanks! This makes more sense.
Acked-by: Michal Hocko <[email protected]>
> ---
>
> include/linux/sched.h | 1 +
> kernel/time/timer.c | 11 +++++++++++
> 2 files changed, 12 insertions(+)
>
> diff -puN kernel/time/timer.c~sched-add-schedule_timeout_idle kernel/time/timer.c
> --- a/kernel/time/timer.c~sched-add-schedule_timeout_idle
> +++ a/kernel/time/timer.c
> @@ -1566,6 +1566,17 @@ signed long __sched schedule_timeout_uni
> }
> EXPORT_SYMBOL(schedule_timeout_uninterruptible);
>
> +/*
> + * Like schedule_timeout_uninterruptible(), except this task will not contribute
> + * to load average.
> + */
> +signed long __sched schedule_timeout_idle(signed long timeout)
> +{
> + __set_current_state(TASK_IDLE);
> + return schedule_timeout(timeout);
> +}
> +EXPORT_SYMBOL(schedule_timeout_idle);
> +
> #ifdef CONFIG_HOTPLUG_CPU
> static void migrate_timer_list(struct tvec_base *new_base, struct hlist_head *head)
> {
> diff -puN include/linux/sched.h~sched-add-schedule_timeout_idle include/linux/sched.h
> --- a/include/linux/sched.h~sched-add-schedule_timeout_idle
> +++ a/include/linux/sched.h
> @@ -423,6 +423,7 @@ extern signed long schedule_timeout(sign
> extern signed long schedule_timeout_interruptible(signed long timeout);
> extern signed long schedule_timeout_killable(signed long timeout);
> extern signed long schedule_timeout_uninterruptible(signed long timeout);
> +extern signed long schedule_timeout_idle(signed long timeout);
> asmlinkage void schedule(void);
> extern void schedule_preempt_disabled(void);
>
> _
--
Michal Hocko
SUSE Labs
On Tue, Dec 15, 2015 at 1:36 PM, Michal Hocko <[email protected]> wrote:
> From: Michal Hocko <[email protected]>
>
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
>
[...]
Since this is built-in always, can we....
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index 5314b206caa5..48025a21f8c4 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
> @@ -35,6 +35,11 @@
> #include <linux/freezer.h>
> #include <linux/ftrace.h>
> #include <linux/ratelimit.h>
> +#include <linux/kthread.h>
> +#include <linux/module.h>
...use <linux/init.h> instead above, and then...
> +
> +#include <asm/tlb.h>
> +#include "internal.h"
>
[...]
> + * Make sure our oom reaper thread will get scheduled when
> + * ASAP and that it won't get preempted by malicious userspace.
> + */
> + sched_setscheduler(oom_reaper_th, SCHED_FIFO, ¶m);
> + }
> + return 0;
> +}
> +module_init(oom_init)
...use one of the non-modular initcalls here? I'm trying to clean up most of
the non-modular uses of modular macros etc. since:
(1) it is easy to accidentally code up an unused module_exit function
(2) it can be misleading when reading the source, thinking it can be
modular when the Makefile and/or Kconfig prohibit it
(3) it requires the include of the module.h header file which in turn
includes nearly everything else, thus increasing CPP overhead.
I figured no point in sending a follow on patch since this came in via
the akpm tree into next and that gets rebased/updated regularly.
Thanks,
Paul.
--
On Tue, Dec 15, 2015 at 11:36 AM, Michal Hocko <[email protected]> wrote:
> From: Michal Hocko <[email protected]>
>
> This is based on the idea from Mel Gorman discussed during LSFMM 2015 and
> independently brought up by Oleg Nesterov.
>
> The OOM killer currently allows to kill only a single task in a good
> hope that the task will terminate in a reasonable time and frees up its
> memory. Such a task (oom victim) will get an access to memory reserves
> via mark_oom_victim to allow a forward progress should there be a need
> for additional memory during exit path.
>
> It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
> construct workloads which break the core assumption mentioned above and
> the OOM victim might take unbounded amount of time to exit because it
> might be blocked in the uninterruptible state waiting for on an event
> (e.g. lock) which is blocked by another task looping in the page
> allocator.
>
> This patch reduces the probability of such a lockup by introducing a
> specialized kernel thread (oom_reaper) which tries to reclaim additional
> memory by preemptively reaping the anonymous or swapped out memory
> owned by the oom victim under an assumption that such a memory won't
> be needed when its owner is killed and kicked from the userspace anyway.
> There is one notable exception to this, though, if the OOM victim was
> in the process of coredumping the result would be incomplete. This is
> considered a reasonable constrain because the overall system health is
> more important than debugability of a particular application.
>
> A kernel thread has been chosen because we need a reliable way of
> invocation so workqueue context is not appropriate because all the
> workers might be busy (e.g. allocating memory). Kswapd which sounds
> like another good fit is not appropriate as well because it might get
> blocked on locks during reclaim as well.
>
> oom_reaper has to take mmap_sem on the target task for reading so the
> solution is not 100% because the semaphore might be held or blocked for
> write but the probability is reduced considerably wrt. basically any
> lock blocking forward progress as described above. In order to prevent
> from blocking on the lock without any forward progress we are using only
> a trylock and retry 10 times with a short sleep in between.
> Users of mmap_sem which need it for write should be carefully reviewed
> to use _killable waiting as much as possible and reduce allocations
> requests done with the lock held to absolute minimum to reduce the risk
> even further.
>
> The API between oom killer and oom reaper is quite trivial. wake_oom_reaper
> updates mm_to_reap with cmpxchg to guarantee only NULL->mm transition
> and oom_reaper clear this atomically once it is done with the work. This
> means that only a single mm_struct can be reaped at the time. As the
> operation is potentially disruptive we are trying to limit it to the
> ncessary minimum and the reaper blocks any updates while it operates on
> an mm. mm_struct is pinned by mm_count to allow parallel exit_mmap and a
> race is detected by atomic_inc_not_zero(mm_users).
>
> Changes since v2
> - fix mm_count refernce leak reported by Tetsuo
> - make sure oom_reaper_th is NULL after kthread_run fails - Tetsuo
> - use wait_event_freezable rather than open coded wait loop - suggested
> by Tetsuo
> Changes since v1
> - fix the screwed up detail->check_swap_entries - Johannes
> - do not use kthread_should_stop because that would need a cleanup
> and we do not have anybody to stop us - Tetsuo
> - move wake_oom_reaper to oom_kill_process because we have to wait
> for all tasks sharing the same mm to get killed - Tetsuo
> - do not reap mm structs which are shared with unkillable tasks - Tetsuo
>
> Acked-by: Mel Gorman <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
While running xfstests on next-20151223 I hit a pair of kernel BUGs
that bisected to this commit:
1eb3a80d8239 ("mm, oom: introduce oom reaper")
Here is a BUG produced by generic/029 when run against XFS:
[ 235.751723] ------------[ cut here ]------------
[ 235.752194] kernel BUG at mm/filemap.c:208!
[ 235.752595] invalid opcode: 0000 [#1] SMP
[ 235.753036] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[ 235.753681] CPU: 3 PID: 17586 Comm: xfs_io Not tainted
4.4.0-rc6-next-20151223_new_fsync_v6+ #8
[ 235.754535] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140709_153950- 04/01/2014
[ 235.755451] task: ffff88040bde19c0 ti: ffff8800bab80000 task.ti:
ffff8800bab80000
[ 235.756202] RIP: 0010:[<ffffffff811c81f6>] [<ffffffff811c81f6>]
__delete_from_page_cache+0x206/0x440
[ 235.757151] RSP: 0018:ffff8800bab83b60 EFLAGS: 00010082
[ 235.757679] RAX: 0000000000000021 RBX: ffffea0007d37e00 RCX: 0000000000000006
[ 235.758360] RDX: 0000000000000000 RSI: 0000000000000001 RDI: ffff8804117ce380
[ 235.759043] RBP: ffff8800bab83bb8 R08: 0000000000000001 R09: 0000000000000001
[ 235.759749] R10: 00000000ffffffff R11: 0000000000028dc0 R12: ffff8800b1e7db00
[ 235.760444] R13: ffff8800b1e7daf8 R14: 0000000000000000 R15: 0000000000000003
[ 235.761122] FS: 00007f65dd009700(0000) GS:ffff880411600000(0000)
knlGS:0000000000000000
[ 235.761888] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 235.762432] CR2: 00007f65dd01f000 CR3: 00000000ba9d7000 CR4: 00000000000406e0
[ 235.763150] Stack:
[ 235.763347] ffff88040cf1a800 0000000000000001 0000000000000001
ffff8800ae2a3b50
[ 235.764123] ffff8800ae2a3b80 00000000b4ca5e1a ffffea0007d37e00
ffff8800b1e7db10
[ 235.764900] ffff88040cf1a800 0000000000000000 0000000000000292
ffff8800bab83bf0
[ 235.765638] Call Trace:
[ 235.765903] [<ffffffff811c8493>] delete_from_page_cache+0x63/0xd0
[ 235.766513] [<ffffffff811dc3e5>] truncate_inode_page+0xa5/0x120
[ 235.767088] [<ffffffff811dc648>] truncate_inode_pages_range+0x1a8/0x7f0
[ 235.767725] [<ffffffff81021459>] ? sched_clock+0x9/0x10
[ 235.768239] [<ffffffff810db37c>] ? local_clock+0x1c/0x20
[ 235.768779] [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
[ 235.769385] [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
[ 235.770010] [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
[ 235.770501] [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
[ 235.771092] [<ffffffff811dcd58>] truncate_pagecache+0x48/0x70
[ 235.771646] [<ffffffff811dcdb2>] truncate_setsize+0x32/0x40
[ 235.772276] [<ffffffff8148e972>] xfs_setattr_size+0x232/0x470
[ 235.772839] [<ffffffff8148ec64>] xfs_vn_setattr+0xb4/0xc0
[ 235.773369] [<ffffffff8127af87>] notify_change+0x237/0x350
[ 235.773945] [<ffffffff81257c87>] do_truncate+0x77/0xc0
[ 235.774446] [<ffffffff8125800f>] do_sys_ftruncate.constprop.15+0xef/0x150
[ 235.775156] [<ffffffff812580ae>] SyS_ftruncate+0xe/0x10
[ 235.775650] [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
e8 32
[ 235.778695] RIP [<ffffffff811c81f6>] __delete_from_page_cache+0x206/0x440
[ 235.779350] RSP <ffff8800bab83b60>
[ 235.779694] ---[ end trace fac9dd65c4cdd828 ]---
And a different BUG produced by generic/095, also with XFS:
[ 609.398897] ------------[ cut here ]------------
[ 609.399843] kernel BUG at mm/truncate.c:629!
[ 609.400666] invalid opcode: 0000 [#1] SMP
[ 609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
[ 609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: G W
4.4.0-rc6-next-20151223+ #1
[ 609.404267] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.7.5-20140709_153950- 04/01/2014
[ 609.405851] task: ffff8801e52119c0 ti: ffff8801f6540000 task.ti:
ffff8801f6540000
[ 609.407272] RIP: 0010:[<ffffffff811dc0ab>] [<ffffffff811dc0ab>]
invalidate_inode_pages2_range+0x30b/0x550
[ 609.409111] RSP: 0018:ffff8801f6543c88 EFLAGS: 00010202
[ 609.410105] RAX: 0000000000000001 RBX: 0000000000000061 RCX: ffff88041180e440
[ 609.411417] RDX: ffffffff00000001 RSI: 0000000000000000 RDI: 0000000000000286
[ 609.412737] RBP: ffff8801f6543dd0 R08: 0000000000000008 R09: 0000000000000001
[ 609.414069] R10: 0000000000000001 R11: 0000000000000000 R12: ffff8801f6dfb438
[ 609.415388] R13: ffffffffffffffff R14: 000000000000000b R15: ffffea0002877c80
[ 609.416681] FS: 00007f48e13ed740(0000) GS:ffff880411800000(0000)
knlGS:0000000000000000
[ 609.418190] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 609.419261] CR2: 00000000012e0000 CR3: 000000040f24c000 CR4: 00000000000406e0
[ 609.420577] Stack:
[ 609.420968] 0000000000000292 ffff8800ba6f7800 ffff8801f6dfb450
0000000000000000
[ 609.422423] 0000000000000001 0000000000000056 0000000000000057
0000000000000058
[ 609.423878] 0000000000000059 000000000000005a 000000000000005b
000000000000005c
[ 609.425325] Call Trace:
[ 609.425797] [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
[ 609.426971] [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
[ 609.428097] [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
[ 609.429073] [<ffffffff8125a319>] vfs_read+0x89/0x130
[ 609.430010] [<ffffffff8125b418>] SyS_read+0x58/0xd0
[ 609.430943] [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
74 15
[ 609.436956] RIP [<ffffffff811dc0ab>]
invalidate_inode_pages2_range+0x30b/0x550
[ 609.438373] RSP <ffff8801f6543c88>
[ 609.439080] ---[ end trace 10616a16523ccb2c ]---
They both fail 100% of the time with the above signatures with the
"oom reaper" commit, and succeed 100% of the time with the parent
commit.
My test setup is a qemu guest machine with a pair of 4 GiB PMEM
ramdisk test devices, one for the xfstest test disk and one for the
scratch disk.
Please let me know if you have trouble reproducing this. I'm also
happy to test fixes.
Thanks,
- Ross
On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
[...]
> While running xfstests on next-20151223 I hit a pair of kernel BUGs
> that bisected to this commit:
>
> 1eb3a80d8239 ("mm, oom: introduce oom reaper")
Thank you for the report and the bisection.
> Here is a BUG produced by generic/029 when run against XFS:
>
> [ 235.751723] ------------[ cut here ]------------
> [ 235.752194] kernel BUG at mm/filemap.c:208!
This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
the full kernel log? It all smells like a race when OOM reaper tears
down the mapping and there is a truncate still in progress. But hitting
the BUG_ON just because of that doesn't make much sense to me. OOM
reaper is essentially MADV_DONTNEED. I have to think about this some
more, though, but I am in a holiday mode until early next year so please
bear with me.
[...]
> [ 235.765638] Call Trace:
> [ 235.765903] [<ffffffff811c8493>] delete_from_page_cache+0x63/0xd0
> [ 235.766513] [<ffffffff811dc3e5>] truncate_inode_page+0xa5/0x120
> [ 235.767088] [<ffffffff811dc648>] truncate_inode_pages_range+0x1a8/0x7f0
> [ 235.767725] [<ffffffff81021459>] ? sched_clock+0x9/0x10
> [ 235.768239] [<ffffffff810db37c>] ? local_clock+0x1c/0x20
> [ 235.768779] [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
> [ 235.769385] [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
> [ 235.770010] [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
> [ 235.770501] [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
> [ 235.771092] [<ffffffff811dcd58>] truncate_pagecache+0x48/0x70
> [ 235.771646] [<ffffffff811dcdb2>] truncate_setsize+0x32/0x40
> [ 235.772276] [<ffffffff8148e972>] xfs_setattr_size+0x232/0x470
> [ 235.772839] [<ffffffff8148ec64>] xfs_vn_setattr+0xb4/0xc0
> [ 235.773369] [<ffffffff8127af87>] notify_change+0x237/0x350
> [ 235.773945] [<ffffffff81257c87>] do_truncate+0x77/0xc0
> [ 235.774446] [<ffffffff8125800f>] do_sys_ftruncate.constprop.15+0xef/0x150
> [ 235.775156] [<ffffffff812580ae>] SyS_ftruncate+0xe/0x10
> [ 235.775650] [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
> [ 235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
> fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
> fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
> e8 32
> [ 235.778695] RIP [<ffffffff811c81f6>] __delete_from_page_cache+0x206/0x440
> [ 235.779350] RSP <ffff8800bab83b60>
> [ 235.779694] ---[ end trace fac9dd65c4cdd828 ]---
>
> And a different BUG produced by generic/095, also with XFS:
>
> [ 609.398897] ------------[ cut here ]------------
> [ 609.399843] kernel BUG at mm/truncate.c:629!
Hmm, I do not see any BUG_ON at this line. But there is
BUG_ON(page_mapped(page)) at line 620.
> [ 609.400666] invalid opcode: 0000 [#1] SMP
> [ 609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
> [ 609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: G W
There was a warning before this triggered. The full kernel log would be
helpful as well.
[...]
> [ 609.425325] Call Trace:
> [ 609.425797] [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
> [ 609.426971] [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
> [ 609.428097] [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
> [ 609.429073] [<ffffffff8125a319>] vfs_read+0x89/0x130
> [ 609.430010] [<ffffffff8125b418>] SyS_read+0x58/0xd0
> [ 609.430943] [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
> [ 609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
> 74 15
[...]
> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
> ramdisk test devices, one for the xfstest test disk and one for the
> scratch disk.
Is this just a plain ramdisk device or it needs a special configuration?
Is this somehow DAX related?
Thanks!
--
Michal Hocko
SUSE Labs
Michal Hocko wrote:
> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> the full kernel log? It all smells like a race when OOM reaper tears
> down the mapping and there is a truncate still in progress. But hitting
> the BUG_ON just because of that doesn't make much sense to me. OOM
> reaper is essentially MADV_DONTNEED. I have to think about this some
> more, though, but I am in a holiday mode until early next year so please
> bear with me.
I don't know whether the OOM killer was invoked just before this
VM_BUG_ON_PAGE().
> Is this somehow DAX related?
4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
http://marc.info/?l=linux-mm&m=145068666428057 should be applied
when retesting. (20151223 does not have this fix.)
[ 235.768779] [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
[ 235.769385] [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
[ 235.770010] [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
[ 235.770501] [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
On Thu, Dec 24, 2015 at 4:06 AM, Tetsuo Handa
<[email protected]> wrote:
> Michal Hocko wrote:
>> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
>> the full kernel log? It all smells like a race when OOM reaper tears
>> down the mapping and there is a truncate still in progress. But hitting
>> the BUG_ON just because of that doesn't make much sense to me. OOM
>> reaper is essentially MADV_DONTNEED. I have to think about this some
>> more, though, but I am in a holiday mode until early next year so please
>> bear with me.
>
> I don't know whether the OOM killer was invoked just before this
> VM_BUG_ON_PAGE().
>
>> Is this somehow DAX related?
>
> 4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
> has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
> http://marc.info/?l=linux-mm&m=145068666428057 should be applied
> when retesting. (20151223 does not have this fix.)
No, DAX was not turned on, and while my fsync/msync patches were the initial
reason I was testing (hence the new_fsync_v6 kernel name) they were not
applied during the bisect, so I'm sure they are not related to this issue.
I will retest with the patch referenced above, but it probably won't
happen until
the new year.
On Thu, Dec 24, 2015 at 2:47 AM, Michal Hocko <[email protected]> wrote:
> On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
> [...]
>> While running xfstests on next-20151223 I hit a pair of kernel BUGs
>> that bisected to this commit:
>>
>> 1eb3a80d8239 ("mm, oom: introduce oom reaper")
>
> Thank you for the report and the bisection.
>
>> Here is a BUG produced by generic/029 when run against XFS:
>>
>> [ 235.751723] ------------[ cut here ]------------
>> [ 235.752194] kernel BUG at mm/filemap.c:208!
>
> This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> the full kernel log? It all smells like a race when OOM reaper tears
> down the mapping and there is a truncate still in progress. But hitting
> the BUG_ON just because of that doesn't make much sense to me. OOM
> reaper is essentially MADV_DONTNEED. I have to think about this some
> more, though, but I am in a holiday mode until early next year so please
> bear with me.
The two stack traces were gathered with next-20151223, so the line numbers
may have moved around a bit when compared to the actual "mm, oom: introduce
oom reaper" commit.
> [...]
>> [ 235.765638] Call Trace:
>> [ 235.765903] [<ffffffff811c8493>] delete_from_page_cache+0x63/0xd0
>> [ 235.766513] [<ffffffff811dc3e5>] truncate_inode_page+0xa5/0x120
>> [ 235.767088] [<ffffffff811dc648>] truncate_inode_pages_range+0x1a8/0x7f0
>> [ 235.767725] [<ffffffff81021459>] ? sched_clock+0x9/0x10
>> [ 235.768239] [<ffffffff810db37c>] ? local_clock+0x1c/0x20
>> [ 235.768779] [<ffffffff811feba4>] ? unmap_mapping_range+0x64/0x130
>> [ 235.769385] [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
>> [ 235.770010] [<ffffffff810f5c3f>] ? up_write+0x1f/0x40
>> [ 235.770501] [<ffffffff811febb4>] ? unmap_mapping_range+0x74/0x130
>> [ 235.771092] [<ffffffff811dcd58>] truncate_pagecache+0x48/0x70
>> [ 235.771646] [<ffffffff811dcdb2>] truncate_setsize+0x32/0x40
>> [ 235.772276] [<ffffffff8148e972>] xfs_setattr_size+0x232/0x470
>> [ 235.772839] [<ffffffff8148ec64>] xfs_vn_setattr+0xb4/0xc0
>> [ 235.773369] [<ffffffff8127af87>] notify_change+0x237/0x350
>> [ 235.773945] [<ffffffff81257c87>] do_truncate+0x77/0xc0
>> [ 235.774446] [<ffffffff8125800f>] do_sys_ftruncate.constprop.15+0xef/0x150
>> [ 235.775156] [<ffffffff812580ae>] SyS_ftruncate+0xe/0x10
>> [ 235.775650] [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
>> [ 235.776257] Code: 5f 5d c3 48 8b 43 20 48 8d 78 ff a8 01 48 0f 44
>> fb 8b 47 48 85 c0 0f 88 2b 01 00 00 48 c7 c6 a8 57 f0 81 48 89 df e8
>> fa 1a 03 00 <0f> 0b 4c 89 ce 44 89 fa 4c 89 e7 4c 89 45 b0 4c 89 4d b8
>> e8 32
>> [ 235.778695] RIP [<ffffffff811c81f6>] __delete_from_page_cache+0x206/0x440
>> [ 235.779350] RSP <ffff8800bab83b60>
>> [ 235.779694] ---[ end trace fac9dd65c4cdd828 ]---
>>
>> And a different BUG produced by generic/095, also with XFS:
>>
>> [ 609.398897] ------------[ cut here ]------------
>> [ 609.399843] kernel BUG at mm/truncate.c:629!
>
> Hmm, I do not see any BUG_ON at this line. But there is
> BUG_ON(page_mapped(page)) at line 620.
Ditto - check out next-20151223 for real line numbers.
>> [ 609.400666] invalid opcode: 0000 [#1] SMP
>> [ 609.401512] Modules linked in: nd_pmem nd_btt nd_e820 libnvdimm
>> [ 609.402719] CPU: 4 PID: 26782 Comm: fio Tainted: G W
>
> There was a warning before this triggered. The full kernel log would be
> helpful as well.
Sure, I can gather full kernel logs, but it'll probably after the new year.
> [...]
>> [ 609.425325] Call Trace:
>> [ 609.425797] [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
>> [ 609.426971] [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
>> [ 609.428097] [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
>> [ 609.429073] [<ffffffff8125a319>] vfs_read+0x89/0x130
>> [ 609.430010] [<ffffffff8125b418>] SyS_read+0x58/0xd0
>> [ 609.430943] [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
>> [ 609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
>> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
>> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
>> 74 15
> [...]
>> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
>> ramdisk test devices, one for the xfstest test disk and one for the
>> scratch disk.
>
> Is this just a plain ramdisk device or it needs a special configuration?
Just a plain PMEM ram disk with DAX turned off. Configuration instructions
for PMEM can be found here:
https://nvdimm.wiki.kernel.org/
On Thu 24-12-15 13:44:03, Ross Zwisler wrote:
> On Thu, Dec 24, 2015 at 2:47 AM, Michal Hocko <[email protected]> wrote:
> > On Wed 23-12-15 16:00:09, Ross Zwisler wrote:
> > [...]
> >> While running xfstests on next-20151223 I hit a pair of kernel BUGs
> >> that bisected to this commit:
> >>
> >> 1eb3a80d8239 ("mm, oom: introduce oom reaper")
> >
> > Thank you for the report and the bisection.
> >
> >> Here is a BUG produced by generic/029 when run against XFS:
> >>
> >> [ 235.751723] ------------[ cut here ]------------
> >> [ 235.752194] kernel BUG at mm/filemap.c:208!
> >
> > This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> > the full kernel log? It all smells like a race when OOM reaper tears
> > down the mapping and there is a truncate still in progress. But hitting
> > the BUG_ON just because of that doesn't make much sense to me. OOM
> > reaper is essentially MADV_DONTNEED. I have to think about this some
> > more, though, but I am in a holiday mode until early next year so please
> > bear with me.
>
> The two stack traces were gathered with next-20151223, so the line numbers
> may have moved around a bit when compared to the actual "mm, oom: introduce
> oom reaper" commit.
I was looking at the same next tree, I believe
$ git describe
next-20151223
[...]
> > There was a warning before this triggered. The full kernel log would be
> > helpful as well.
>
> Sure, I can gather full kernel logs, but it'll probably after the new year.
OK, I will wait for the logs. It is really interesting to see what was
the timing between OOM killer invocation and this trace.
> > [...]
> >> [ 609.425325] Call Trace:
> >> [ 609.425797] [<ffffffff811dc307>] invalidate_inode_pages2+0x17/0x20
> >> [ 609.426971] [<ffffffff81482167>] xfs_file_read_iter+0x297/0x300
> >> [ 609.428097] [<ffffffff81259ac9>] __vfs_read+0xc9/0x100
> >> [ 609.429073] [<ffffffff8125a319>] vfs_read+0x89/0x130
> >> [ 609.430010] [<ffffffff8125b418>] SyS_read+0x58/0xd0
> >> [ 609.430943] [<ffffffff81a527b2>] entry_SYSCALL_64_fastpath+0x12/0x76
> >> [ 609.432139] Code: 85 d8 fe ff ff 01 00 00 00 f6 c4 40 0f 84 59 ff
> >> ff ff 49 8b 47 20 48 8d 78 ff a8 01 49 0f 44 ff 8b 47 48 85 c0 0f 88
> >> bd 01 00 00 <0f> 0b 4d 3b 67 08 0f 85 70 ff ff ff 49 f7 07 00 18 00 00
> >> 74 15
> > [...]
> >> My test setup is a qemu guest machine with a pair of 4 GiB PMEM
> >> ramdisk test devices, one for the xfstest test disk and one for the
> >> scratch disk.
> >
> > Is this just a plain ramdisk device or it needs a special configuration?
>
> Just a plain PMEM ram disk with DAX turned off. Configuration instructions
> for PMEM can be found here:
>
> https://nvdimm.wiki.kernel.org/
Thanks I will try to reproduce early next year. But so far I think this
is just a general issue of MADV_DONTNEED vs. truncate and oom_reaper is
just lucky to trigger it. There shouldn't be anything oom_reaper
specific here. Maybe there is some additional locking missing?
--
Michal Hocko
SUSE Labs
On Thu 24-12-15 20:06:50, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > This is VM_BUG_ON_PAGE(page_mapped(page), page), right? Could you attach
> > the full kernel log? It all smells like a race when OOM reaper tears
> > down the mapping and there is a truncate still in progress. But hitting
> > the BUG_ON just because of that doesn't make much sense to me. OOM
> > reaper is essentially MADV_DONTNEED. I have to think about this some
> > more, though, but I am in a holiday mode until early next year so please
> > bear with me.
>
> I don't know whether the OOM killer was invoked just before this
> VM_BUG_ON_PAGE().
>
> > Is this somehow DAX related?
>
> 4.4.0-rc6-next-20151223_new_fsync_v6+ suggests that this kernel
> has "[PATCH v6 0/7] DAX fsync/msync support" applied. But I think
> http://marc.info/?l=linux-mm&m=145068666428057 should be applied
> when retesting. (20151223 does not have this fix.)
Hmm, I think you are right! Very well spotted! If ignore_dirty ends up
being true then we would simply skip over dirty page and wouldn't end up
doing page_remove_rmap. I can see that the truncation code can later trip
over this page.
Thanks!
--
Michal Hocko
SUSE Labs
On Fri 25-12-15 12:35:37, Michal Hocko wrote:
[...]
> Thanks I will try to reproduce early next year. But so far I think this
> is just a general issue of MADV_DONTNEED vs. truncate and oom_reaper is
> just lucky to trigger it. There shouldn't be anything oom_reaper
> specific here. Maybe there is some additional locking missing?
Hmm, scratch that. I think Tetsuo has nailed it. It seems like
the missing initialization of details structure during unmap
is the culprit. So there most probably was on OOM killing
invoked. It is just a side effect of the patch and missing
http://marc.info/?l=linux-mm&m=145068666428057 follow up fix.
--
Michal Hocko
SUSE Labs