The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
it cannot reap an mm. This can happen for a variety of reasons,
including:
- the inability to grab mm->mmap_sem in a sufficient amount of time,
- when the mm has blockable mmu notifiers that could cause the oom reaper
to stall indefinitely,
but we can also add a third when the oom reaper can "reap" an mm but doing
so is unlikely to free any amount of memory:
- when the mm's memory is fully mlocked.
When all memory is mlocked, the oom reaper will not be able to free any
substantial amount of memory. It sets MMF_OOM_SKIP before the victim can
unmap and free its memory in exit_mmap() and subsequent oom victims are
chosen unnecessarily. This is trivial to reproduce if all eligible
processes on the system have mlocked their memory: the oom killer calls
panic() even though forward progress can be made.
This is the same issue where the exit path sets MMF_OOM_SKIP before
unmapping memory and additional processes can be chosen unnecessarily
because the oom killer is racing with exit_mmap().
We can't simply defer setting MMF_OOM_SKIP, however, because if there is
a true oom livelock in progress, it never gets set and no additional
killing is possible.
To fix this, this patch introduces a per-mm reaping timeout, initially set
at 10s. It requires that the oom reaper's list becomes a properly linked
list so that other mm's may be reaped while waiting for an mm's timeout to
expire.
The exit path will now set MMF_OOM_SKIP only after all memory has been
freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
determine when it can race with the oom reaper.
The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
lapsed because it can no longer guarantee forward progress.
The reaping timeout is intentionally set for a substantial amount of time
since oom livelock is a very rare occurrence and it's better to optimize
for preventing additional (unnecessary) oom killing than a scenario that
is much more unlikely.
Signed-off-by: David Rientjes <[email protected]>
---
include/linux/mm_types.h | 4 ++
include/linux/sched.h | 2 +-
mm/mmap.c | 12 +++---
mm/oom_kill.c | 85 ++++++++++++++++++++++++++--------------
4 files changed, 66 insertions(+), 37 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -462,6 +462,10 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
+#ifdef CONFIG_MMU
+ /* When to give up on oom reaping this mm */
+ unsigned long reap_timeout;
+#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1151,7 +1151,7 @@ struct task_struct {
#endif
int pagefault_disabled;
#ifdef CONFIG_MMU
- struct task_struct *oom_reaper_list;
+ struct list_head oom_reap_list;
#endif
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm)
if (unlikely(mm_is_oom_victim(mm))) {
/*
* Manually reap the mm to free as much memory as possible.
- * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
- * this mm from further consideration. Taking mm->mmap_sem for
- * write after setting MMF_OOM_SKIP will guarantee that the oom
- * reaper will not run on this mm again after mmap_sem is
- * dropped.
+ * Then, set MMF_UNSTABLE to avoid racing with the oom reaper.
+ * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
+ * guarantee that the oom reaper will not run on this mm again
+ * after mmap_sem is dropped.
*
* Nothing can be holding mm->mmap_sem here and the above call
* to mmu_notifier_release(mm) ensures mmu notifier callbacks in
@@ -3077,7 +3076,7 @@ void exit_mmap(struct mm_struct *mm)
__oom_reap_task_mm(mm);
mutex_unlock(&oom_lock);
- set_bit(MMF_OOM_SKIP, &mm->flags);
+ set_bit(MMF_UNSTABLE, &mm->flags);
down_write(&mm->mmap_sem);
up_write(&mm->mmap_sem);
}
@@ -3105,6 +3104,7 @@ void exit_mmap(struct mm_struct *mm)
unmap_vmas(&tlb, vma, 0, -1);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb, 0, -1);
+ set_bit(MMF_OOM_SKIP, &mm->flags);
/*
* Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -476,7 +476,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
*/
static struct task_struct *oom_reaper_th;
static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static LIST_HEAD(oom_reaper_list);
static DEFINE_SPINLOCK(oom_reaper_lock);
void __oom_reap_task_mm(struct mm_struct *mm)
@@ -558,12 +558,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
}
/*
- * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
- * work on the mm anymore. The check for MMF_OOM_SKIP must run
+ * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't
+ * work on the mm anymore. The check for MMF_UNSTABLE must run
* under mmap_sem for reading because it serializes against the
* down_write();up_write() cycle in exit_mmap().
*/
- if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
+ if (test_bit(MMF_UNSTABLE, &mm->flags)) {
up_read(&mm->mmap_sem);
trace_skip_task_reaping(tsk->pid);
goto unlock_oom;
@@ -589,31 +589,49 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
#define MAX_OOM_REAP_RETRIES 10
static void oom_reap_task(struct task_struct *tsk)
{
- int attempts = 0;
struct mm_struct *mm = tsk->signal->oom_mm;
+ bool ret = true;
- /* Retry the down_read_trylock(mmap_sem) a few times */
- while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
- schedule_timeout_idle(HZ/10);
+ /*
+ * If this mm has either been fully unmapped, or the oom reaper has
+ * given up on it, nothing left to do except drop the refcount.
+ */
+ if (test_bit(MMF_OOM_SKIP, &mm->flags))
+ goto drop;
- if (attempts <= MAX_OOM_REAP_RETRIES ||
- test_bit(MMF_OOM_SKIP, &mm->flags))
- goto done;
+ /*
+ * If this mm has already been reaped, doing so again will not likely
+ * free additional memory.
+ */
+ if (!test_bit(MMF_UNSTABLE, &mm->flags))
+ ret = oom_reap_task_mm(tsk, mm);
+
+ if (time_after(jiffies, mm->reap_timeout)) {
+ if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
+ pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+ task_pid_nr(tsk), tsk->comm);
+ debug_show_all_locks();
- pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
- task_pid_nr(tsk), tsk->comm);
- debug_show_all_locks();
+ /*
+ * Reaping has failed for the timeout period, so give up
+ * and allow additional processes to be oom killed.
+ */
+ set_bit(MMF_OOM_SKIP, &mm->flags);
+ }
+ goto drop;
+ }
-done:
- tsk->oom_reaper_list = NULL;
+ if (!ret)
+ schedule_timeout_idle(HZ/10);
- /*
- * Hide this mm from OOM killer because it has been either reaped or
- * somebody can't call up_write(mmap_sem).
- */
- set_bit(MMF_OOM_SKIP, &mm->flags);
+ /* Enqueue to be reaped again */
+ spin_lock(&oom_reaper_lock);
+ list_add(&tsk->oom_reap_list, &oom_reaper_list);
+ spin_unlock(&oom_reaper_lock);
+ return;
- /* Drop a reference taken by wake_oom_reaper */
+drop:
+ /* Drop the reference taken by wake_oom_reaper() */
put_task_struct(tsk);
}
@@ -622,11 +640,13 @@ static int oom_reaper(void *unused)
while (true) {
struct task_struct *tsk = NULL;
- wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
+ wait_event_freezable(oom_reaper_wait,
+ !list_empty(&oom_reaper_list));
spin_lock(&oom_reaper_lock);
- if (oom_reaper_list != NULL) {
- tsk = oom_reaper_list;
- oom_reaper_list = tsk->oom_reaper_list;
+ if (!list_empty(&oom_reaper_list)) {
+ tsk = list_entry(&oom_reaper_list, struct task_struct,
+ oom_reap_list);
+ list_del(&tsk->oom_reap_list);
}
spin_unlock(&oom_reaper_lock);
@@ -637,17 +657,22 @@ static int oom_reaper(void *unused)
return 0;
}
+/* How long to wait to oom reap an mm before selecting another process */
+#define OOM_REAP_TIMEOUT_MSECS (10 * 1000)
static void wake_oom_reaper(struct task_struct *tsk)
{
- /* tsk is already queued? */
- if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+ /*
+ * Set the reap timeout; if it's already set, the mm is enqueued and
+ * this tsk can be ignored.
+ */
+ if (cmpxchg(&tsk->signal->oom_mm->reap_timeout, 0UL,
+ jiffies + msecs_to_jiffies(OOM_REAP_TIMEOUT_MSECS)))
return;
get_task_struct(tsk);
spin_lock(&oom_reaper_lock);
- tsk->oom_reaper_list = oom_reaper_list;
- oom_reaper_list = tsk;
+ list_add(&tsk->oom_reap_list, &oom_reaper_list);
spin_unlock(&oom_reaper_lock);
trace_wake_reaper(tsk->pid);
wake_up(&oom_reaper_wait);
David Rientjes wrote:
> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm. This can happen for a variety of reasons,
> including:
>
> - the inability to grab mm->mmap_sem in a sufficient amount of time,
>
> - when the mm has blockable mmu notifiers that could cause the oom reaper
> to stall indefinitely,
>
> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
>
> - when the mm's memory is fully mlocked.
- when the mm's memory is fully mlocked (needs privilege) or
fully shared (does not need privilege)
>
> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory. It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily. This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.
s/mlocked/mlocked or shared/g
>
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap().
>
> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.
>
> To fix this, this patch introduces a per-mm reaping timeout, initially set
> at 10s. It requires that the oom reaper's list becomes a properly linked
> list so that other mm's may be reaped while waiting for an mm's timeout to
> expire.
I already proposed more simpler one at https://patchwork.kernel.org/patch/9877991/ .
>
> The exit path will now set MMF_OOM_SKIP only after all memory has been
> freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
> determine when it can race with the oom reaper.
>
> The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> lapsed because it can no longer guarantee forward progress.
>
> The reaping timeout is intentionally set for a substantial amount of time
> since oom livelock is a very rare occurrence and it's better to optimize
> for preventing additional (unnecessary) oom killing than a scenario that
> is much more unlikely.
But before thinking about your proposal, please think about how to guarantee
that the OOM reaper and the exit path can run discussed at
http://lkml.kernel.org/r/[email protected] .
On Thu 24-05-18 14:22:53, David Rientjes wrote:
> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm. This can happen for a variety of reasons,
> including:
>
> - the inability to grab mm->mmap_sem in a sufficient amount of time,
>
> - when the mm has blockable mmu notifiers that could cause the oom reaper
> to stall indefinitely,
>
> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
>
> - when the mm's memory is fully mlocked.
>
> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory. It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily. This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.
>
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap().
>
> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.
>
> To fix this, this patch introduces a per-mm reaping timeout, initially set
> at 10s. It requires that the oom reaper's list becomes a properly linked
> list so that other mm's may be reaped while waiting for an mm's timeout to
> expire.
No timeouts please! The proper way to handle this problem is to simply
teach the oom reaper to handle mlocked areas.
--
Michal Hocko
SUSE Labs
On Fri, 25 May 2018, Michal Hocko wrote:
> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm. This can happen for a variety of reasons,
> > including:
> >
> > - the inability to grab mm->mmap_sem in a sufficient amount of time,
> >
> > - when the mm has blockable mmu notifiers that could cause the oom reaper
> > to stall indefinitely,
> >
> > but we can also add a third when the oom reaper can "reap" an mm but doing
> > so is unlikely to free any amount of memory:
> >
> > - when the mm's memory is fully mlocked.
> >
> > When all memory is mlocked, the oom reaper will not be able to free any
> > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can
> > unmap and free its memory in exit_mmap() and subsequent oom victims are
> > chosen unnecessarily. This is trivial to reproduce if all eligible
> > processes on the system have mlocked their memory: the oom killer calls
> > panic() even though forward progress can be made.
> >
> > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > unmapping memory and additional processes can be chosen unnecessarily
> > because the oom killer is racing with exit_mmap().
> >
> > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > a true oom livelock in progress, it never gets set and no additional
> > killing is possible.
> >
> > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > at 10s. It requires that the oom reaper's list becomes a properly linked
> > list so that other mm's may be reaped while waiting for an mm's timeout to
> > expire.
>
> No timeouts please! The proper way to handle this problem is to simply
> teach the oom reaper to handle mlocked areas.
That's not sufficient since the oom reaper is also not able to oom reap if
the mm has blockable mmu notifiers or all memory is shared filebacked
memory, so it immediately sets MMF_OOM_SKIP and additional processes are
oom killed.
The current implementation that relies on MAX_OOM_REAP_RETRIES is acting
as a timeout already for mm->mmap_sem, but it's doing so without
attempting to oom reap other victims that may actually allow it to grab
mm->mmap_sem if the allocator is waiting on a lock.
The solution, as proposed, is to allow the oom reaper to iterate over all
victims and try to free memory rather than working on each victim one by
one and giving up.
But also note that even if oom reaping is possible, in the presence of an
antagonist that continues to allocate memory, that it is possible to oom
kill additional victims unnecessarily if we aren't able to complete
free_pgtables() in exit_mmap() of the original victim.
So this patch is solving all three issues: allowing a process to *fully*
exit (including free_pgtables()) before setting MMF_OOM_SKIP, allows the
oom reaper to act on parallel victims that may allow a victim to be
reaped, and preventing additional processes from being killed
unnecessarily when oom reaping isn't able to free memory (mlock, blockable
mmu invalidates, all VM_SHARED file backed, small rss, etc).
The vast majority of the time, oom reaping can occur with this change or
the process can reach exit_mmap() itself; oom livelock appears to be very
rare with this patch even for mem cgroup constrained oom kills and very
tight limitation and thus it makes sense to wait for a prolonged period of
time before killing additional processes unnecessarily.
On Fri, 25 May 2018, Tetsuo Handa wrote:
> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm. This can happen for a variety of reasons,
> > including:
> >
> > - the inability to grab mm->mmap_sem in a sufficient amount of time,
> >
> > - when the mm has blockable mmu notifiers that could cause the oom reaper
> > to stall indefinitely,
> >
> > but we can also add a third when the oom reaper can "reap" an mm but doing
> > so is unlikely to free any amount of memory:
> >
> > - when the mm's memory is fully mlocked.
>
> - when the mm's memory is fully mlocked (needs privilege) or
> fully shared (does not need privilege)
>
Good point, that is another way that unnecessary oom killing can occur
because the oom reaper sets MMF_OOM_SKIP far too early. I can make the
change to the commit message.
Also, I noticed in my patch that oom_reap_task() should be doing
list_add_tail() rather than list_add() to enqueue the mm for reaping
again.
> > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > unmapping memory and additional processes can be chosen unnecessarily
> > because the oom killer is racing with exit_mmap().
> >
> > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > a true oom livelock in progress, it never gets set and no additional
> > killing is possible.
> >
> > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > at 10s. It requires that the oom reaper's list becomes a properly linked
> > list so that other mm's may be reaped while waiting for an mm's timeout to
> > expire.
>
> I already proposed more simpler one at https://patchwork.kernel.org/patch/9877991/ .
>
It's a similar idea, and I'm glad that we agree that some kind of per-mm
delay is required to avoid this problem. I think yours is simpler, but
consider the other two changes in my patch:
- in the normal exit path, absent any timeout for the mm, we only set
MMF_OOM_SKIP after free_pgtables() when it is known we will not free
any additional memory, which can also cause unnecessary oom killing
because the oom killer races with free_pgtables(), and
- the oom reaper now operates over all concurrent victims instead of
repeatedly trying to take mm->mmap_sem of the first victim, sleeping
many times, retrying, giving up, and moving on the next victim.
Allowing the oom reaper to iterate through all victims can allow
memory freeing such that an allocator may be able to drop mm->mmap_sem.
In fact, with my patch, I don't know of any condition where we kill
additional processes unnecessarily *unless* the victim cannot be oom
reaped or complete memory freeing in the exit path within 10 seconds.
Given how rare oom livelock appears in practice, I think the 10 seconds is
justified because right now it is _trivial_ to oom kill many victims
completely unnecessarily.
On Fri 25-05-18 12:36:08, David Rientjes wrote:
> On Fri, 25 May 2018, Michal Hocko wrote:
>
> > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > > it cannot reap an mm. This can happen for a variety of reasons,
> > > including:
> > >
> > > - the inability to grab mm->mmap_sem in a sufficient amount of time,
> > >
> > > - when the mm has blockable mmu notifiers that could cause the oom reaper
> > > to stall indefinitely,
> > >
> > > but we can also add a third when the oom reaper can "reap" an mm but doing
> > > so is unlikely to free any amount of memory:
> > >
> > > - when the mm's memory is fully mlocked.
> > >
> > > When all memory is mlocked, the oom reaper will not be able to free any
> > > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can
> > > unmap and free its memory in exit_mmap() and subsequent oom victims are
> > > chosen unnecessarily. This is trivial to reproduce if all eligible
> > > processes on the system have mlocked their memory: the oom killer calls
> > > panic() even though forward progress can be made.
> > >
> > > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > > unmapping memory and additional processes can be chosen unnecessarily
> > > because the oom killer is racing with exit_mmap().
> > >
> > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > > a true oom livelock in progress, it never gets set and no additional
> > > killing is possible.
> > >
> > > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > > at 10s. It requires that the oom reaper's list becomes a properly linked
> > > list so that other mm's may be reaped while waiting for an mm's timeout to
> > > expire.
> >
> > No timeouts please! The proper way to handle this problem is to simply
> > teach the oom reaper to handle mlocked areas.
>
> That's not sufficient since the oom reaper is also not able to oom reap if
> the mm has blockable mmu notifiers or all memory is shared filebacked
> memory, so it immediately sets MMF_OOM_SKIP and additional processes are
> oom killed.
Could you be more specific with a real world example where that is the
case? I mean the full address space of non-reclaimable file backed
memory where waiting some more would help? Blockable mmu notifiers are
a PITA for sure. I wish we could have a better way to deal with them.
Maybe we can tell them we are in the non-blockable context and have them
release as much as possible. Still something that a random timeout
wouldn't help I am afraid.
> The current implementation that relies on MAX_OOM_REAP_RETRIES is acting
> as a timeout already for mm->mmap_sem, but it's doing so without
> attempting to oom reap other victims that may actually allow it to grab
> mm->mmap_sem if the allocator is waiting on a lock.
Trying to reap a different oom victim when the current one is not making
progress during the lock contention is certainly something that make
sense. It has been proposed in the past and we just gave it up because
it was more complex. Do you have any specific example when this would
help to justify the additional complexity?
> The solution, as proposed, is to allow the oom reaper to iterate over all
> victims and try to free memory rather than working on each victim one by
> one and giving up.
>
> But also note that even if oom reaping is possible, in the presence of an
> antagonist that continues to allocate memory, that it is possible to oom
> kill additional victims unnecessarily if we aren't able to complete
> free_pgtables() in exit_mmap() of the original victim.
If there is unbound source of allocations then we are screwed no matter
what. We just hope that the allocator will get noticed by the oom killer
and it will be stopped.
That being said. I do not object for justified improvements in the oom
reaping. But I absolutely detest some random timeouts and will nack
implementations based on them until it is absolutely clear there is no
other way around.
--
Michal Hocko
SUSE Labs
On Mon, 28 May 2018, Michal Hocko wrote:
> > That's not sufficient since the oom reaper is also not able to oom reap if
> > the mm has blockable mmu notifiers or all memory is shared filebacked
> > memory, so it immediately sets MMF_OOM_SKIP and additional processes are
> > oom killed.
>
> Could you be more specific with a real world example where that is the
> case? I mean the full address space of non-reclaimable file backed
> memory where waiting some more would help? Blockable mmu notifiers are
> a PITA for sure. I wish we could have a better way to deal with them.
> Maybe we can tell them we are in the non-blockable context and have them
> release as much as possible. Still something that a random timeout
> wouldn't help I am afraid.
>
It's not a random timeout, it's sufficiently long such that we don't oom
kill several processes needlessly in the very rare case where oom livelock
would actually prevent the original victim from exiting. The oom reaper
processing an mm, finding everything to be mlocked, and immediately
MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a
large memory hogging process that mlocks all of its memory; we
consistently see spurious and unnecessary oom kills simply because the oom
reaper has set MMF_OOM_SKIP very early.
This patch introduces a "give up" period such that the oom reaper is still
allowed to do its good work but only gives up in the hope the victim can
make forward progress at some substantial period of time in the future. I
would understand the objection if oom livelock where the victim cannot
make forward progress were commonplace, but in the interest of not killing
several processes needlessly every time a large mlocked process is
targeted, I think it compels a waiting period.
> Trying to reap a different oom victim when the current one is not making
> progress during the lock contention is certainly something that make
> sense. It has been proposed in the past and we just gave it up because
> it was more complex. Do you have any specific example when this would
> help to justify the additional complexity?
>
I'm not sure how you're defining complexity, the patch adds ~30 lines of
code and prevents processes from needlessly being oom killed when oom
reaping is largely unsuccessful and before the victim finishes
free_pgtables() and then also allows the oom reaper to operate on multiple
mm's instead of processing one at a time. Obviously if there is a delay
before MMF_OOM_SKIP is set it requires that the oom reaper be able to
process other mm's, otherwise we stall needlessly for 10s. Operating on
multiple mm's in a linked list while waiting for victims to exit during a
timeout period is thus very much needed, it wouldn't make sense without
it.
> > But also note that even if oom reaping is possible, in the presence of an
> > antagonist that continues to allocate memory, that it is possible to oom
> > kill additional victims unnecessarily if we aren't able to complete
> > free_pgtables() in exit_mmap() of the original victim.
>
> If there is unbound source of allocations then we are screwed no matter
> what. We just hope that the allocator will get noticed by the oom killer
> and it will be stopped.
>
It's not unbounded, it's just an allocator that acts as an antagonist. At
the risk of being overly verbose, for system or memcg oom conditions: a
large mlocked process is oom killed, other processes continue to
allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP
without being able to free any memory, and the other important processes
are needlessly oom killed before the original victim can reach
exit_mmap(). This happens a _lot_.
I'm open to hearing any other suggestions that you have other than waiting
some time period before MMF_OOM_SKIP gets set to solve this problem.
On Wed 30-05-18 14:06:51, David Rientjes wrote:
> On Mon, 28 May 2018, Michal Hocko wrote:
>
> > > That's not sufficient since the oom reaper is also not able to oom reap if
> > > the mm has blockable mmu notifiers or all memory is shared filebacked
> > > memory, so it immediately sets MMF_OOM_SKIP and additional processes are
> > > oom killed.
> >
> > Could you be more specific with a real world example where that is the
> > case? I mean the full address space of non-reclaimable file backed
> > memory where waiting some more would help? Blockable mmu notifiers are
> > a PITA for sure. I wish we could have a better way to deal with them.
> > Maybe we can tell them we are in the non-blockable context and have them
> > release as much as possible. Still something that a random timeout
> > wouldn't help I am afraid.
> >
>
> It's not a random timeout, it's sufficiently long such that we don't oom
> kill several processes needlessly in the very rare case where oom livelock
> would actually prevent the original victim from exiting. The oom reaper
> processing an mm, finding everything to be mlocked, and immediately
> MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a
> large memory hogging process that mlocks all of its memory; we
> consistently see spurious and unnecessary oom kills simply because the oom
> reaper has set MMF_OOM_SKIP very early.
It takes quite some additional steps for admin to allow a large amount
of mlocked memory and such an application should be really careful to
not consume too much memory. So how come this is something you see that
consistently? Is this some sort of bug or an unfortunate workload side
effect? I am asking this because I really want to see how relevant this
really is.
> This patch introduces a "give up" period such that the oom reaper is still
> allowed to do its good work but only gives up in the hope the victim can
> make forward progress at some substantial period of time in the future. I
> would understand the objection if oom livelock where the victim cannot
> make forward progress were commonplace, but in the interest of not killing
> several processes needlessly every time a large mlocked process is
> targeted, I think it compels a waiting period.
But the waiting periods just turn out to be a really poor design. There
will be no good timeout to fit for everybody. We can do better and as
long as this is the case the timeout based solution should be really
rejected. It is a shortcut that doesn't really solve the underlying
problem.
> > Trying to reap a different oom victim when the current one is not making
> > progress during the lock contention is certainly something that make
> > sense. It has been proposed in the past and we just gave it up because
> > it was more complex. Do you have any specific example when this would
> > help to justify the additional complexity?
> >
>
> I'm not sure how you're defining complexity, the patch adds ~30 lines of
> code and prevents processes from needlessly being oom killed when oom
> reaping is largely unsuccessful and before the victim finishes
> free_pgtables() and then also allows the oom reaper to operate on multiple
> mm's instead of processing one at a time. Obviously if there is a delay
> before MMF_OOM_SKIP is set it requires that the oom reaper be able to
> process other mm's, otherwise we stall needlessly for 10s. Operating on
> multiple mm's in a linked list while waiting for victims to exit during a
> timeout period is thus very much needed, it wouldn't make sense without
> it.
It needs to keep track of the current retry state of the reaped victim
and that is an additional complexity, isn't it? And I am asking how
often do we have to handle that. Please note that the primary objective
here is to unclutter a locked up situation. The oom reaper doesn't block
the victim to go away on its own while we keep retrying. So a slow
progress on the reaper side is not an issue IMIHO.
> > > But also note that even if oom reaping is possible, in the presence of an
> > > antagonist that continues to allocate memory, that it is possible to oom
> > > kill additional victims unnecessarily if we aren't able to complete
> > > free_pgtables() in exit_mmap() of the original victim.
> >
> > If there is unbound source of allocations then we are screwed no matter
> > what. We just hope that the allocator will get noticed by the oom killer
> > and it will be stopped.
> >
>
> It's not unbounded, it's just an allocator that acts as an antagonist. At
> the risk of being overly verbose, for system or memcg oom conditions: a
> large mlocked process is oom killed, other processes continue to
> allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP
> without being able to free any memory, and the other important processes
> are needlessly oom killed before the original victim can reach
> exit_mmap(). This happens a _lot_.
>
> I'm open to hearing any other suggestions that you have other than waiting
> some time period before MMF_OOM_SKIP gets set to solve this problem.
I've already offered one. Make mlocked pages reapable. This is something
that has been on the todo list for quite some time. I just didn't have
time to work on that. The priority was not at the top because most sane
workloads simply do not mlock large portion of the memory. But if you
can see that happening regularly then this should be the first thing to
try. The main obstable to do so back then was the page_lock currently
taken in the munlock path. I've discussed that with Hugh and he said
that they are mainly for accounting purposes and mostly a relict from
the past IIRC and this should be fixable and a general improvement as
well.
--
Michal Hocko
SUSE Labs
On Thu, 31 May 2018, Michal Hocko wrote:
> > It's not a random timeout, it's sufficiently long such that we don't oom
> > kill several processes needlessly in the very rare case where oom livelock
> > would actually prevent the original victim from exiting. The oom reaper
> > processing an mm, finding everything to be mlocked, and immediately
> > MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a
> > large memory hogging process that mlocks all of its memory; we
> > consistently see spurious and unnecessary oom kills simply because the oom
> > reaper has set MMF_OOM_SKIP very early.
>
> It takes quite some additional steps for admin to allow a large amount
> of mlocked memory and such an application should be really careful to
> not consume too much memory. So how come this is something you see that
> consistently? Is this some sort of bug or an unfortunate workload side
> effect? I am asking this because I really want to see how relevant this
> really is.
>
The bug is that the oom reaper sets MMF_OOM_SKIP almost immediately after
the victim has been chosen for oom kill and we get follow-up oom kills,
not that the process is able to mlock a large amount of memory. Mlock
here is only being discussed as a single example. Tetsuo has brought up
the example of all shared file-backed memory. We've discussed the mm
having a single blockable mmu notifier. Regardless of how we arrive at
the point where the oom reaper can't free memory, which could be any of
those three cases, if (1) the original victim is sufficiently large that
follow-up oom kills would become unnecessary and (2) other threads
allocate/charge before the oom victim reaches exit_mmap(), this occurs.
We have examples of cases where oom reaping was successful, but the rss
numbers in the kernel log are very similar to when it was oom killed and
the process is known not to mlock, the reason is because the oom reaper
could free very little memory due to blockable mmu notifiers.
> But the waiting periods just turn out to be a really poor design. There
> will be no good timeout to fit for everybody. We can do better and as
> long as this is the case the timeout based solution should be really
> rejected. It is a shortcut that doesn't really solve the underlying
> problem.
>
The current implementation is a timeout based solution for mmap_sem, it
just has the oom reaper spinning trying to grab the sem and eventually
gives up. This patch allows it to currently work on other mm's and
detects the timeout in a different way, with jiffies instead of an
iterator.
I'd love a solution where we can reliably detect an oom livelock and oom
kill additional processes but only after the original victim has had a
chance to do exit_mmap() without a timeout, but I don't see one being
offered. Given Tetsuo has seen issues with this in the past and suggested
a similar proposal means we are not the only ones feeling pain from this.
> > I'm open to hearing any other suggestions that you have other than waiting
> > some time period before MMF_OOM_SKIP gets set to solve this problem.
>
> I've already offered one. Make mlocked pages reapable.
Making mlocked pages reapable would only solve the most trivial reproducer
of this. Unless the oom reaper can guarantee that it will never block and
can free all memory that exit_mmap() can free, we need to ensure that a
victim has a chance to reach the exit path on its own before killing every
other process on the system.
I'll fix the issue I identified with doing list_add_tail() rather than
list_add(), fix up the commit message per Tetsuo to identify the other
possible ways this can occur other than mlock, remove the rfc tag, and
repost.
On Thu 31-05-18 14:16:34, David Rientjes wrote:
> On Thu, 31 May 2018, Michal Hocko wrote:
>
> > > It's not a random timeout, it's sufficiently long such that we don't oom
> > > kill several processes needlessly in the very rare case where oom livelock
> > > would actually prevent the original victim from exiting. The oom reaper
> > > processing an mm, finding everything to be mlocked, and immediately
> > > MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a
> > > large memory hogging process that mlocks all of its memory; we
> > > consistently see spurious and unnecessary oom kills simply because the oom
> > > reaper has set MMF_OOM_SKIP very early.
> >
> > It takes quite some additional steps for admin to allow a large amount
> > of mlocked memory and such an application should be really careful to
> > not consume too much memory. So how come this is something you see that
> > consistently? Is this some sort of bug or an unfortunate workload side
> > effect? I am asking this because I really want to see how relevant this
> > really is.
> >
>
> The bug is that the oom reaper sets MMF_OOM_SKIP almost immediately after
> the victim has been chosen for oom kill and we get follow-up oom kills,
> not that the process is able to mlock a large amount of memory. Mlock
> here is only being discussed as a single example. Tetsuo has brought up
> the example of all shared file-backed memory.
How is such a case even possible? File backed memory is reclaimable and
as such should be gone by the time we hit the OOM killer. If that is not
the case then I fail how wait slightly longer helps anything.
> We've discussed the mm
> having a single blockable mmu notifier. Regardless of how we arrive at
> the point where the oom reaper can't free memory, which could be any of
> those three cases, if (1) the original victim is sufficiently large that
> follow-up oom kills would become unnecessary and (2) other threads
> allocate/charge before the oom victim reaches exit_mmap(), this occurs.
>
> We have examples of cases where oom reaping was successful, but the rss
> numbers in the kernel log are very similar to when it was oom killed and
> the process is known not to mlock, the reason is because the oom reaper
> could free very little memory due to blockable mmu notifiers.
Please be more specific. Which notifiers these were. Blockable notifiers
are a PITA and we should be addressing them. That requiers identifying
them first.
> > But the waiting periods just turn out to be a really poor design. There
> > will be no good timeout to fit for everybody. We can do better and as
> > long as this is the case the timeout based solution should be really
> > rejected. It is a shortcut that doesn't really solve the underlying
> > problem.
> >
>
> The current implementation is a timeout based solution for mmap_sem, it
> just has the oom reaper spinning trying to grab the sem and eventually
> gives up. This patch allows it to currently work on other mm's and
> detects the timeout in a different way, with jiffies instead of an
> iterator.
And I argue that anything timeout based is just broken by design. Trying
n times will at least give you a consistent behavior. Retrying on mmap
sem makes sense because the lock might be taken for a short time.
Retrying on a memory oom reaper doesn't reclaim is just pointless
waiting for somebody else doing the work. See the difference?
> I'd love a solution where we can reliably detect an oom livelock and oom
> kill additional processes but only after the original victim has had a
> chance to do exit_mmap() without a timeout, but I don't see one being
> offered. Given Tetsuo has seen issues with this in the past and suggested
> a similar proposal means we are not the only ones feeling pain from this.
Tetsuo is doing an artificial stress test which doesn't resemble any
reasonable workload. This is good to catch different corner cases but
nothing even close to base any design on. I will definitely nack any
attempt to add a timeout based solution based on such a non-realistic
tests. If we have realistic workloads then try to address them and
resort to any timeout or other hacks as the last option.
> > > I'm open to hearing any other suggestions that you have other than waiting
> > > some time period before MMF_OOM_SKIP gets set to solve this problem.
> >
> > I've already offered one. Make mlocked pages reapable.
>
> Making mlocked pages reapable would only solve the most trivial reproducer
> of this. Unless the oom reaper can guarantee that it will never block and
> can free all memory that exit_mmap() can free, we need to ensure that a
> victim has a chance to reach the exit path on its own before killing every
> other process on the system.
>
> I'll fix the issue I identified with doing list_add_tail() rather than
> list_add(), fix up the commit message per Tetsuo to identify the other
> possible ways this can occur other than mlock, remove the rfc tag, and
> repost.
As I've already said. I will nack any timeout based solution until we
address all particular problems and still see more to come. Here we have
a clear goal. Address mlocked pages and identify mmu notifier offenders.
--
Michal Hocko
SUSE Labs
FYI, we noticed the following commit (built with gcc-7):
commit: 2d251ff6e66d7978b3e7a9c69e99b7150de26926 ("mm, oom: fix unnecessary killing of additional processes")
url: https://github.com/0day-ci/linux/commits/David-Rientjes/mm-oom-fix-unnecessary-killing-of-additional-processes/20180527-033815
base: git://git.cmpxchg.org/linux-mmotm.git master
in testcase: boot
on test machine: qemu-system-i386 -enable-kvm -cpu Haswell,+smep,+smap -m 360M
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+------------------------------------------------------------------+------------+------------+
| | 0b018d19da | 2d251ff6e6 |
+------------------------------------------------------------------+------------+------------+
| boot_successes | 4 | 0 |
| boot_failures | 4 | 4 |
| invoked_oom-killer:gfp_mask=0x | 4 | 2 |
| Mem-Info | 4 | 4 |
| Out_of_memory:Kill_process | 4 | 4 |
| Kernel_panic-not_syncing:Out_of_memory_and_no_killable_processes | 4 | |
| BUG:unable_to_handle_kernel | 0 | 4 |
| Oops:#[##] | 0 | 4 |
| EIP:oom_reaper | 0 | 4 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 4 |
+------------------------------------------------------------------+------------+------------+
[ 11.063029] BUG: unable to handle kernel NULL pointer dereference at 00000204
[ 11.064104] *pde = 00000000
[ 11.064548] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 11.065328] CPU: 0 PID: 21 Comm: oom_reaper Not tainted 4.17.0-rc5-mm1-00218-g2d251ff #1
[ 11.066537] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1 04/01/2014
[ 11.067786] EIP: oom_reaper+0x115/0x296
[ 11.068374] Code: 01 00 00 bb 54 92 65 7e c7 05 ac a8 65 7e 00 02 00 00 b8 80 a8 65 7e e8 9c a2 63 00 85 db 0f 84 fa fe ff ff 8b 83 ac 04 00 00 <8b> b0 04 02 00 00 8b 86 c8 02 00 00 0f ba e0 15 0f 82 55 01 00 00
[ 11.071210] EAX: 00000000 EBX: 7e659254 ECX: 00000001 EDX: 00000000
[ 11.072142] ESI: 78090960 EDI: 7816b500 EBP: 78167f8c ESP: 78167f68
[ 11.073073] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00010202
[ 11.074075] CR0: 80050033 CR2: 00000204 CR3: 033d7000 CR4: 00040690
[ 11.074994] Call Trace:
[ 11.075380] ? wait_woken+0x75/0x75
[ 11.075911] ? kthread+0xef/0xf4
[ 11.076413] ? __oom_reap_task_mm+0x6f/0x6f
[ 11.077036] ? kthread_create_on_node+0x1a/0x1a
[ 11.077718] ? ret_from_fork+0x19/0x24
[ 11.078286] Modules linked in:
[ 11.078747] CR2: 0000000000000204
[ 11.079253] ---[ end trace 881b7ebfce401a98 ]---
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
Thanks,
Xiaolong
On Fri, 1 Jun 2018, Michal Hocko wrote:
> > We've discussed the mm
> > having a single blockable mmu notifier. Regardless of how we arrive at
> > the point where the oom reaper can't free memory, which could be any of
> > those three cases, if (1) the original victim is sufficiently large that
> > follow-up oom kills would become unnecessary and (2) other threads
> > allocate/charge before the oom victim reaches exit_mmap(), this occurs.
> >
> > We have examples of cases where oom reaping was successful, but the rss
> > numbers in the kernel log are very similar to when it was oom killed and
> > the process is known not to mlock, the reason is because the oom reaper
> > could free very little memory due to blockable mmu notifiers.
>
> Please be more specific. Which notifiers these were. Blockable notifiers
> are a PITA and we should be addressing them. That requiers identifying
> them first.
>
The most common offender seems to be ib_umem_notifier, but I have also
heard of possible occurrences for mv_invl_range_start() for xen, but that
would need more investigation. The rather new invalidate_range callback
for hmm mirroring could also be problematic. Any mmu_notifier without
MMU_INVALIDATE_DOES_NOT_BLOCK causes the mm to immediately be disregarded.
For this reason, we see testing harnesses often oom killed immediately
after running a unittest that stresses reclaim or compaction by inducing a
system-wide oom condition. The harness spawns the unittest which spawns
an antagonist memory hog that is intended to be oom killed. When memory
is mlocked or there are a large number of threads faulting memory for the
antagonist, the unittest and the harness itself get oom killed because the
oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.
The memory hog has mm->mmap_sem readers queued ahead of a writer that is
doing mmap() so the oom reaper can't grab the sem quickly enough.
I agree that blockable mmu notifiers are a pain, but until such time as
all can implicitly be MMU_INVALIDATE_DOES_NOT_BLOCK, the oom reaper can
free all mlocked memory, and the oom reaper waits long enough to grab
mm->mmap_sem for stalled mm->mmap_sem readers, we need a solution that
won't oom kill everything running on the system. I have doubts we'll ever
reach a point where the oom reaper can do the equivalent of exit_mmap(),
but it's possible to help solve the immediate issue of all oom kills
killing many innocent processes while working in a direction to make oom
reaping more successful at freeing memory.
> > The current implementation is a timeout based solution for mmap_sem, it
> > just has the oom reaper spinning trying to grab the sem and eventually
> > gives up. This patch allows it to currently work on other mm's and
> > detects the timeout in a different way, with jiffies instead of an
> > iterator.
>
> And I argue that anything timeout based is just broken by design. Trying
> n times will at least give you a consistent behavior.
It's not consistent, we see wildly inconsistent results especially on
power because it depends on the number of queued readers of mm->mmap_sem
ahead of a writer until such time that a thread doing mmap() can grab it,
drop it, and allow the oom reaper to grab it for read. It's so
inconsistent that we can see the oom reaper successfully grab the sem for
an oom killed memory hog with 128 faulting threads, and see it fail with 4
faulting threads.
> Retrying on mmap
> sem makes sense because the lock might be taken for a short time.
It isn't a function of how long mmap_sem is taken for write, it's a
function of how many readers are ahead of the queued writer. We don't run
with thp defrag set to "always" under standard configurations, but users
of MADV_HUGEPAGE or configs where defrag is set to "always" can
consistently cause any number of additional processes to be oom killed
unnecessarily because the readers are performing compaction and the writer
is queued behind it.
> > I'd love a solution where we can reliably detect an oom livelock and oom
> > kill additional processes but only after the original victim has had a
> > chance to do exit_mmap() without a timeout, but I don't see one being
> > offered. Given Tetsuo has seen issues with this in the past and suggested
> > a similar proposal means we are not the only ones feeling pain from this.
>
> Tetsuo is doing an artificial stress test which doesn't resemble any
> reasonable workload.
Tetsuo's test cases caught the CVE on powerpc which could trivially
panic the system if configured to panic on any oops and required a
security fix because it made it easy for any user doing a large mlock.
His test case here is trivial to reproduce on powerpc and causes several
additional processes to be oom killed. It's not artificial, I see many
test harnesses killed *nightly* because a memory hog is faulting with a
large number of threads and two or three other threads are doing mmap().
No mlock.
> > Making mlocked pages reapable would only solve the most trivial reproducer
> > of this. Unless the oom reaper can guarantee that it will never block and
> > can free all memory that exit_mmap() can free, we need to ensure that a
> > victim has a chance to reach the exit path on its own before killing every
> > other process on the system.
> >
> > I'll fix the issue I identified with doing list_add_tail() rather than
> > list_add(), fix up the commit message per Tetsuo to identify the other
> > possible ways this can occur other than mlock, remove the rfc tag, and
> > repost.
>
> As I've already said. I will nack any timeout based solution until we
> address all particular problems and still see more to come. Here we have
> a clear goal. Address mlocked pages and identify mmu notifier offenders.
I cannot fix all mmu notifiers to not block, I can't fix the configuration
to allow direct compaction for thp allocations and a large number of
concurrent faulters, and I cannot fix userspace mlocking a lot of memory.
It's worthwhile to work in that direction, but it will never be 100%
possible to avoid. We must have a solution that prevents innocent
processes from consistently being oom killed completely unnecessarily.
On Mon 04-06-18 21:25:39, David Rientjes wrote:
> On Fri, 1 Jun 2018, Michal Hocko wrote:
>
> > > We've discussed the mm
> > > having a single blockable mmu notifier. Regardless of how we arrive at
> > > the point where the oom reaper can't free memory, which could be any of
> > > those three cases, if (1) the original victim is sufficiently large that
> > > follow-up oom kills would become unnecessary and (2) other threads
> > > allocate/charge before the oom victim reaches exit_mmap(), this occurs.
> > >
> > > We have examples of cases where oom reaping was successful, but the rss
> > > numbers in the kernel log are very similar to when it was oom killed and
> > > the process is known not to mlock, the reason is because the oom reaper
> > > could free very little memory due to blockable mmu notifiers.
> >
> > Please be more specific. Which notifiers these were. Blockable notifiers
> > are a PITA and we should be addressing them. That requiers identifying
> > them first.
> >
>
> The most common offender seems to be ib_umem_notifier, but I have also
> heard of possible occurrences for mv_invl_range_start() for xen, but that
> would need more investigation. The rather new invalidate_range callback
> for hmm mirroring could also be problematic. Any mmu_notifier without
> MMU_INVALIDATE_DOES_NOT_BLOCK causes the mm to immediately be disregarded.
Yes, this is unfortunate and it was meant as a stop gap quick fix with a
long term vision to be fixed properly. I am pretty sure that we can do
much better here. Teach mmu_notifier_invalidate_range_start to get a
non-block flag and back out on ranges that would block. I am pretty sure
that notifiers can be targeted a lot and so we can still process some
vmas at least.
> For this reason, we see testing harnesses often oom killed immediately
> after running a unittest that stresses reclaim or compaction by inducing a
> system-wide oom condition. The harness spawns the unittest which spawns
> an antagonist memory hog that is intended to be oom killed. When memory
> is mlocked or there are a large number of threads faulting memory for the
> antagonist, the unittest and the harness itself get oom killed because the
> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.
> The memory hog has mm->mmap_sem readers queued ahead of a writer that is
> doing mmap() so the oom reaper can't grab the sem quickly enough.
How come the writer doesn't back off. mmap paths should be taking an
exclusive mmap sem in killable sleep so it should back off. Or is the
holder of the lock deep inside mmap path doing something else and not
backing out with the exclusive lock held?
[...]
> > As I've already said. I will nack any timeout based solution until we
> > address all particular problems and still see more to come. Here we have
> > a clear goal. Address mlocked pages and identify mmu notifier offenders.
>
> I cannot fix all mmu notifiers to not block, I can't fix the configuration
> to allow direct compaction for thp allocations and a large number of
> concurrent faulters, and I cannot fix userspace mlocking a lot of memory.
> It's worthwhile to work in that direction, but it will never be 100%
> possible to avoid. We must have a solution that prevents innocent
> processes from consistently being oom killed completely unnecessarily.
None of the above has been attempted and shown not worth doing. The oom
even should be a rare thing to happen so I absolutely do not see any
reason to rush any misdesigned fix to be done right now.
--
Michal Hocko
SUSE Labs
On 2018/06/05 17:57, Michal Hocko wrote:
>> For this reason, we see testing harnesses often oom killed immediately
>> after running a unittest that stresses reclaim or compaction by inducing a
>> system-wide oom condition. The harness spawns the unittest which spawns
>> an antagonist memory hog that is intended to be oom killed. When memory
>> is mlocked or there are a large number of threads faulting memory for the
>> antagonist, the unittest and the harness itself get oom killed because the
>> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.
>> The memory hog has mm->mmap_sem readers queued ahead of a writer that is
>> doing mmap() so the oom reaper can't grab the sem quickly enough.
>
> How come the writer doesn't back off. mmap paths should be taking an
> exclusive mmap sem in killable sleep so it should back off. Or is the
> holder of the lock deep inside mmap path doing something else and not
> backing out with the exclusive lock held?
>
Here is an example where the writer doesn't back off.
http://lkml.kernel.org/r/[email protected]
down_write_killable(&mm->mmap_sem) is nothing but increasing the possibility of
successfully back off. There is no guarantee that the owner of that exclusive
mmap sem will not be blocked by other unkillable waits.
On Wed 13-06-18 22:20:49, Tetsuo Handa wrote:
> On 2018/06/05 17:57, Michal Hocko wrote:
> >> For this reason, we see testing harnesses often oom killed immediately
> >> after running a unittest that stresses reclaim or compaction by inducing a
> >> system-wide oom condition. The harness spawns the unittest which spawns
> >> an antagonist memory hog that is intended to be oom killed. When memory
> >> is mlocked or there are a large number of threads faulting memory for the
> >> antagonist, the unittest and the harness itself get oom killed because the
> >> oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc.
> >> The memory hog has mm->mmap_sem readers queued ahead of a writer that is
> >> doing mmap() so the oom reaper can't grab the sem quickly enough.
> >
> > How come the writer doesn't back off. mmap paths should be taking an
> > exclusive mmap sem in killable sleep so it should back off. Or is the
> > holder of the lock deep inside mmap path doing something else and not
> > backing out with the exclusive lock held?
> >
>
> Here is an example where the writer doesn't back off.
>
> http://lkml.kernel.org/r/[email protected]
>
> down_write_killable(&mm->mmap_sem) is nothing but increasing the possibility of
> successfully back off. There is no guarantee that the owner of that exclusive
> mmap sem will not be blocked by other unkillable waits.
but we are talking about mmap() path here. Sure there are other paths
which might need a back off while the lock is held and that should be
addressed if possible but this is not really related to what David wrote
above and I tried to understand.
--
Michal Hocko
SUSE Labs
The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
it cannot reap an mm. This can happen for a variety of reasons,
including:
- the inability to grab mm->mmap_sem in a sufficient amount of time,
- when the mm has blockable mmu notifiers that could cause the oom reaper
to stall indefinitely,
but we can also add a third when the oom reaper can "reap" an mm but doing
so is unlikely to free any amount of memory:
- when the mm's memory is fully mlocked.
When all memory is mlocked, the oom reaper will not be able to free any
substantial amount of memory. It sets MMF_OOM_SKIP before the victim can
unmap and free its memory in exit_mmap() and subsequent oom victims are
chosen unnecessarily. This is trivial to reproduce if all eligible
processes on the system have mlocked their memory: the oom killer calls
panic() even though forward progress can be made.
This is the same issue where the exit path sets MMF_OOM_SKIP before
unmapping memory and additional processes can be chosen unnecessarily
because the oom killer is racing with exit_mmap().
We can't simply defer setting MMF_OOM_SKIP, however, because if there is
a true oom livelock in progress, it never gets set and no additional
killing is possible.
To fix this, this patch introduces a per-mm reaping timeout, initially set
at 10s. It requires that the oom reaper's list becomes a properly linked
list so that other mm's may be reaped while waiting for an mm's timeout to
expire.
This replaces the current timeouts in the oom reaper: (1) when trying to
grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
a HZ sleep if there are blockable mmu notifiers. It extends it with
timeout to allow an oom victim to reach exit_mmap() before choosing
additional processes unnecessarily.
The exit path will now set MMF_OOM_SKIP only after all memory has been
freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
determine when it can race with the oom reaper.
The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
lapsed because it can no longer guarantee forward progress.
The reaping timeout is intentionally set for a substantial amount of time
since oom livelock is a very rare occurrence and it's better to optimize
for preventing additional (unnecessary) oom killing than a scenario that
is much more unlikely.
Signed-off-by: David Rientjes <[email protected]>
---
Note: I understand there is an objection based on timeout based delays.
This is currently the only possible way to avoid oom killing important
processes completely unnecessarily. If the oom reaper can someday free
all memory, including mlocked memory and those mm's with blockable mmu
notifiers, and is guaranteed to always be able to grab mm->mmap_sem,
this can be removed. I do not believe any such guarantee is possible
and consider the massive killing of additional processes unnecessarily
to be a regression introduced by the oom reaper and its very quick
setting of MMF_OOM_SKIP to allow additional processes to be oom killed.
include/linux/mm_types.h | 4 ++
include/linux/sched.h | 2 +-
kernel/fork.c | 4 ++
mm/mmap.c | 12 ++---
mm/oom_kill.c | 112 ++++++++++++++++++++++-----------------
5 files changed, 79 insertions(+), 55 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -449,6 +449,10 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
+#ifdef CONFIG_MMU
+ /* When to give up on oom reaping this mm */
+ unsigned long reap_timeout;
+#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1163,7 +1163,7 @@ struct task_struct {
#endif
int pagefault_disabled;
#ifdef CONFIG_MMU
- struct task_struct *oom_reaper_list;
+ struct list_head oom_reap_list;
#endif
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -835,6 +835,10 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
tsk->fail_nth = 0;
#endif
+#ifdef CONFIG_MMU
+ INIT_LIST_HEAD(&tsk->oom_reap_list);
+#endif
+
return tsk;
free_stack:
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm)
if (unlikely(mm_is_oom_victim(mm))) {
/*
* Manually reap the mm to free as much memory as possible.
- * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
- * this mm from further consideration. Taking mm->mmap_sem for
- * write after setting MMF_OOM_SKIP will guarantee that the oom
- * reaper will not run on this mm again after mmap_sem is
- * dropped.
+ * Then, set MMF_UNSTABLE to avoid racing with the oom reaper.
+ * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
+ * guarantee that the oom reaper will not run on this mm again
+ * after mmap_sem is dropped.
*
* Nothing can be holding mm->mmap_sem here and the above call
* to mmu_notifier_release(mm) ensures mmu notifier callbacks in
@@ -3077,7 +3076,7 @@ void exit_mmap(struct mm_struct *mm)
__oom_reap_task_mm(mm);
mutex_unlock(&oom_lock);
- set_bit(MMF_OOM_SKIP, &mm->flags);
+ set_bit(MMF_UNSTABLE, &mm->flags);
down_write(&mm->mmap_sem);
up_write(&mm->mmap_sem);
}
@@ -3105,6 +3104,7 @@ void exit_mmap(struct mm_struct *mm)
unmap_vmas(&tlb, vma, 0, -1);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb, 0, -1);
+ set_bit(MMF_OOM_SKIP, &mm->flags);
/*
* Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -476,7 +476,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
*/
static struct task_struct *oom_reaper_th;
static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static LIST_HEAD(oom_reaper_list);
static DEFINE_SPINLOCK(oom_reaper_lock);
void __oom_reap_task_mm(struct mm_struct *mm)
@@ -519,10 +519,8 @@ void __oom_reap_task_mm(struct mm_struct *mm)
}
}
-static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
{
- bool ret = true;
-
/*
* We have to make sure to not race with the victim exit path
* and cause premature new oom victim selection:
@@ -540,9 +538,8 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
mutex_lock(&oom_lock);
if (!down_read_trylock(&mm->mmap_sem)) {
- ret = false;
trace_skip_task_reaping(tsk->pid);
- goto unlock_oom;
+ goto out_oom;
}
/*
@@ -551,69 +548,81 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
* TODO: we really want to get rid of this ugly hack and make sure that
* notifiers cannot block for unbounded amount of time
*/
- if (mm_has_blockable_invalidate_notifiers(mm)) {
- up_read(&mm->mmap_sem);
- schedule_timeout_idle(HZ);
- goto unlock_oom;
- }
+ if (mm_has_blockable_invalidate_notifiers(mm))
+ goto out_mm;
/*
- * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
- * work on the mm anymore. The check for MMF_OOM_SKIP must run
+ * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't
+ * work on the mm anymore. The check for MMF_UNSTABLE must run
* under mmap_sem for reading because it serializes against the
* down_write();up_write() cycle in exit_mmap().
*/
- if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
- up_read(&mm->mmap_sem);
+ if (test_bit(MMF_UNSTABLE, &mm->flags)) {
trace_skip_task_reaping(tsk->pid);
- goto unlock_oom;
+ goto out_mm;
}
trace_start_task_reaping(tsk->pid);
-
__oom_reap_task_mm(mm);
+ trace_finish_task_reaping(tsk->pid);
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
+out_mm:
up_read(&mm->mmap_sem);
-
- trace_finish_task_reaping(tsk->pid);
-unlock_oom:
+out_oom:
mutex_unlock(&oom_lock);
- return ret;
}
-#define MAX_OOM_REAP_RETRIES 10
static void oom_reap_task(struct task_struct *tsk)
{
- int attempts = 0;
struct mm_struct *mm = tsk->signal->oom_mm;
- /* Retry the down_read_trylock(mmap_sem) a few times */
- while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
- schedule_timeout_idle(HZ/10);
+ /*
+ * If this mm has either been fully unmapped, or the oom reaper has
+ * given up on it, nothing left to do except drop the refcount.
+ */
+ if (test_bit(MMF_OOM_SKIP, &mm->flags))
+ goto drop;
- if (attempts <= MAX_OOM_REAP_RETRIES ||
- test_bit(MMF_OOM_SKIP, &mm->flags))
- goto done;
+ /*
+ * If this mm has already been reaped, doing so again will not likely
+ * free additional memory.
+ */
+ if (!test_bit(MMF_UNSTABLE, &mm->flags))
+ oom_reap_task_mm(tsk, mm);
- pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
- task_pid_nr(tsk), tsk->comm);
- debug_show_all_locks();
+ if (time_after_eq(jiffies, mm->reap_timeout)) {
+ if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
+ pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+ task_pid_nr(tsk), tsk->comm);
+ debug_show_all_locks();
-done:
- tsk->oom_reaper_list = NULL;
+ /*
+ * Reaping has failed for the timeout period, so give up
+ * and allow additional processes to be oom killed.
+ */
+ set_bit(MMF_OOM_SKIP, &mm->flags);
+ }
+ goto drop;
+ }
- /*
- * Hide this mm from OOM killer because it has been either reaped or
- * somebody can't call up_write(mmap_sem).
- */
- set_bit(MMF_OOM_SKIP, &mm->flags);
+ if (test_bit(MMF_OOM_SKIP, &mm->flags))
+ goto drop;
- /* Drop a reference taken by wake_oom_reaper */
+ /* Enqueue to be reaped again */
+ spin_lock(&oom_reaper_lock);
+ list_add_tail(&tsk->oom_reap_list, &oom_reaper_list);
+ spin_unlock(&oom_reaper_lock);
+
+ schedule_timeout_idle(HZ/10);
+ return;
+
+drop:
+ /* Drop the reference taken by wake_oom_reaper */
put_task_struct(tsk);
}
@@ -622,11 +631,13 @@ static int oom_reaper(void *unused)
while (true) {
struct task_struct *tsk = NULL;
- wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
+ wait_event_freezable(oom_reaper_wait,
+ !list_empty(&oom_reaper_list));
spin_lock(&oom_reaper_lock);
- if (oom_reaper_list != NULL) {
- tsk = oom_reaper_list;
- oom_reaper_list = tsk->oom_reaper_list;
+ if (!list_empty(&oom_reaper_list)) {
+ tsk = list_entry(oom_reaper_list.next,
+ struct task_struct, oom_reap_list);
+ list_del(&tsk->oom_reap_list);
}
spin_unlock(&oom_reaper_lock);
@@ -637,17 +648,22 @@ static int oom_reaper(void *unused)
return 0;
}
+/* How long to wait to oom reap an mm before selecting another process */
+#define OOM_REAP_TIMEOUT_MSECS (10 * 1000)
static void wake_oom_reaper(struct task_struct *tsk)
{
- /* tsk is already queued? */
- if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+ /*
+ * Set the reap timeout; if it's already set, the mm is enqueued and
+ * this tsk can be ignored.
+ */
+ if (cmpxchg(&tsk->signal->oom_mm->reap_timeout, 0UL,
+ jiffies + msecs_to_jiffies(OOM_REAP_TIMEOUT_MSECS)))
return;
get_task_struct(tsk);
spin_lock(&oom_reaper_lock);
- tsk->oom_reaper_list = oom_reaper_list;
- oom_reaper_list = tsk;
+ list_add(&tsk->oom_reap_list, &oom_reaper_list);
spin_unlock(&oom_reaper_lock);
trace_wake_reaper(tsk->pid);
wake_up(&oom_reaper_wait);
On Thu 14-06-18 13:42:59, David Rientjes wrote:
> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm. This can happen for a variety of reasons,
> including:
>
> - the inability to grab mm->mmap_sem in a sufficient amount of time,
>
> - when the mm has blockable mmu notifiers that could cause the oom reaper
> to stall indefinitely,
>
> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
>
> - when the mm's memory is fully mlocked.
>
> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory. It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily. This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.
>
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap().
>
> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.
>
> To fix this, this patch introduces a per-mm reaping timeout, initially set
> at 10s. It requires that the oom reaper's list becomes a properly linked
> list so that other mm's may be reaped while waiting for an mm's timeout to
> expire.
>
> This replaces the current timeouts in the oom reaper: (1) when trying to
> grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
> a HZ sleep if there are blockable mmu notifiers. It extends it with
> timeout to allow an oom victim to reach exit_mmap() before choosing
> additional processes unnecessarily.
>
> The exit path will now set MMF_OOM_SKIP only after all memory has been
> freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
> determine when it can race with the oom reaper.
>
> The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> lapsed because it can no longer guarantee forward progress.
>
> The reaping timeout is intentionally set for a substantial amount of time
> since oom livelock is a very rare occurrence and it's better to optimize
> for preventing additional (unnecessary) oom killing than a scenario that
> is much more unlikely.
>
> Signed-off-by: David Rientjes <[email protected]>
Nacked-by: Michal Hocko <[email protected]>
as already explained elsewhere in this email thread.
> ---
> Note: I understand there is an objection based on timeout based delays.
> This is currently the only possible way to avoid oom killing important
> processes completely unnecessarily. If the oom reaper can someday free
> all memory, including mlocked memory and those mm's with blockable mmu
> notifiers, and is guaranteed to always be able to grab mm->mmap_sem,
> this can be removed. I do not believe any such guarantee is possible
> and consider the massive killing of additional processes unnecessarily
> to be a regression introduced by the oom reaper and its very quick
> setting of MMF_OOM_SKIP to allow additional processes to be oom killed.
If you find oom reaper more harmful than useful I would be willing to
ack a comman line option to disable it. Especially when you keep
claiming that the lockups are not really happening in your environment.
Other than that I've already pointed to a more robust solution. If you
are reluctant to try it out I will do, but introducing a timeout is just
papering over the real problem. Maybe we will not reach the state that
_all_ the memory is reapable but we definitely should try to make as
much as possible to be reapable and I do not see any fundamental
problems in that direction.
--
Michal Hocko
SUSE Labs
On Fri, 15 Jun 2018, Michal Hocko wrote:
> > Signed-off-by: David Rientjes <[email protected]>
>
> Nacked-by: Michal Hocko <[email protected]>
> as already explained elsewhere in this email thread.
>
I don't find this to be surprising, but I'm not sure that it actually
matters if you won't fix a regression that you introduced. Tetsuo
initially found this issue and presented a similar solution, so I think
his feedback on this is more important since it would fix a problem for
him as well.
> > ---
> > Note: I understand there is an objection based on timeout based delays.
> > This is currently the only possible way to avoid oom killing important
> > processes completely unnecessarily. If the oom reaper can someday free
> > all memory, including mlocked memory and those mm's with blockable mmu
> > notifiers, and is guaranteed to always be able to grab mm->mmap_sem,
> > this can be removed. I do not believe any such guarantee is possible
> > and consider the massive killing of additional processes unnecessarily
> > to be a regression introduced by the oom reaper and its very quick
> > setting of MMF_OOM_SKIP to allow additional processes to be oom killed.
>
> If you find oom reaper more harmful than useful I would be willing to
> ack a comman line option to disable it. Especially when you keep
> claiming that the lockups are not really happening in your environment.
>
There's no need to disable it, we simply need to ensure that it doesn't
set MMF_OOM_SKIP too early, which my patch does. We also need to avoid
setting MMF_OOM_SKIP in exit_mmap() until after all memory has been freed,
i.e. after free_pgtables().
I'd be happy to make the this timeout configurable, however, and default
it to perhaps one second as the blockable mmu notifier timeout in your own
code does. I find it somewhat sad that we'd need a sysctl for this, but
if that will appease you and it will help to move this into -mm then we
can do that.
> Other than that I've already pointed to a more robust solution. If you
> are reluctant to try it out I will do, but introducing a timeout is just
> papering over the real problem. Maybe we will not reach the state that
> _all_ the memory is reapable but we definitely should try to make as
> much as possible to be reapable and I do not see any fundamental
> problems in that direction.
You introduced the timeout already, I'm sure you realized yourself that
the oom reaper sets MMF_OOM_SKIP much too early. Trying to grab
mm->mmap_sem 10 times in a row with HZ/10 sleeps in between is a timeout.
If there are blockable mmu notifiers, your code puts the oom reaper to
sleep for HZ before setting MMF_OOM_SKIP, which is a timeout. This patch
moves the timeout to reaching exit_mmap() where we actually free all
memory possible and still allow for additional oom killing if there is a
very rare oom livelock.
You haven't provided any data that suggests oom livelocking isn't a very
rare event and that we need to respond immediately by randomly killing
more and more processes rather than wait a bounded period of time to allow
for forward progress to be made. I have constantly provided data showing
oom livelock in our fleet is extremely rare, less than 0.04% of the time.
Yet your solution is to kill many processes so this 0.04% is fast.
The reproducer on powerpc is very simple. Do an mmap() and mlock() the
length. Fork one 120MB process that does that and two 60MB processes that
do that in a 128MB memcg.
[ 402.064375] Killed process 17024 (a.out) total-vm:134080kB, anon-rss:122032kB, file-rss:1600kB
[ 402.107521] Killed process 17026 (a.out) total-vm:64448kB, anon-rss:44736kB, file-rss:1600kB
Completely reproducible and completely unnecessary. Killing two processes
pointlessly when the first oom kill would have been successful.
Killing processes is important, optimizing for 0.04% of cases of true oom
livelock by insisting everybody tolerate excessive oom killing is not. If
you have data to suggest the 0.04% is higher, please present it. I'd be
interested in any data you have that suggests its higher and has even
1/1,000,000th oom occurrence rate that I have shown.
It's inappropriate to merge code that oom kills many processes
unnecessarily when one happens to be mlocked or have blockable mmu
notifiers or when mm->mmap_sem can't be grabbed fast enough but forward
progress is actually being made. It's a regression, and it impacts real
users. Insisting that we fix the problem you introduced by making all mmu
notifiers unblockable and mlocked memory can always be reaped and
mm->mmap_sem can always be grabbed within a second is irresponsible.
On Thu, 14 Jun 2018 13:42:59 -0700 (PDT) David Rientjes <[email protected]> wrote:
> The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> it cannot reap an mm. This can happen for a variety of reasons,
> including:
>
> - the inability to grab mm->mmap_sem in a sufficient amount of time,
>
> - when the mm has blockable mmu notifiers that could cause the oom reaper
> to stall indefinitely,
Maybe we should have more than one oom reaper thread? I assume the
probability of the oom reaper thread blocking on an mmu notifier is
small, so perhaps just dive in and hope for the best. If the oom
reaper gets stuck then there's another thread ready to take over. And
revisit the decision to use a kernel thread instead of workqueues.
> but we can also add a third when the oom reaper can "reap" an mm but doing
> so is unlikely to free any amount of memory:
>
> - when the mm's memory is fully mlocked.
>
> When all memory is mlocked, the oom reaper will not be able to free any
> substantial amount of memory. It sets MMF_OOM_SKIP before the victim can
> unmap and free its memory in exit_mmap() and subsequent oom victims are
> chosen unnecessarily. This is trivial to reproduce if all eligible
> processes on the system have mlocked their memory: the oom killer calls
> panic() even though forward progress can be made.
>
> This is the same issue where the exit path sets MMF_OOM_SKIP before
> unmapping memory and additional processes can be chosen unnecessarily
> because the oom killer is racing with exit_mmap().
So what's actually happening here. A process has a large amount of
mlocked memory, it has been oom-killed and it is in the process of
releasing its memory and exiting, yes?
If so, why does this task set MMF_OOM_SKIP on itself? Why aren't we
just patiently waiting for its attempt to release meory?
> We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> a true oom livelock in progress, it never gets set and no additional
> killing is possible.
I guess that's my answer. What causes this livelock? Process looping
in alloc_pages while holding a lock the oom victim wants?
> To fix this, this patch introduces a per-mm reaping timeout, initially set
> at 10s. It requires that the oom reaper's list becomes a properly linked
> list so that other mm's may be reaped while waiting for an mm's timeout to
> expire.
>
> This replaces the current timeouts in the oom reaper: (1) when trying to
> grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
> a HZ sleep if there are blockable mmu notifiers. It extends it with
> timeout to allow an oom victim to reach exit_mmap() before choosing
> additional processes unnecessarily.
>
> The exit path will now set MMF_OOM_SKIP only after all memory has been
> freed, so additional oom killing is justified,
That seems sensible, but why set MMF_OOM_SKIP at all?
> and rely on MMF_UNSTABLE to
> determine when it can race with the oom reaper.
>
> The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> lapsed because it can no longer guarantee forward progress.
>
> The reaping timeout is intentionally set for a substantial amount of time
> since oom livelock is a very rare occurrence and it's better to optimize
> for preventing additional (unnecessary) oom killing than a scenario that
> is much more unlikely.
What happened to the old idea of permitting the task which is blocking
the oom victim to access additional reserves?
Come to that, what happened to the really really old Andreaidea of not
looping in the page allocator anyway? Return NULL instead...
I dunno, I'm thrashing around here. We seem to be piling mess on top
of mess and then being surprised that the result is a mess.
> +#ifdef CONFIG_MMU
> + /* When to give up on oom reaping this mm */
> + unsigned long reap_timeout;
"timeout" implies "interval". To me, anyway. This is an absolute
time, so something like reap_time would be clearer. Along with a
comment explaining that the units are in jiffies.
> +#endif
> #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> pgtable_t pmd_huge_pte; /* protected by page_table_lock */
> #endif
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1163,7 +1163,7 @@ struct task_struct {
> #endif
> int pagefault_disabled;
> #ifdef CONFIG_MMU
> - struct task_struct *oom_reaper_list;
> + struct list_head oom_reap_list;
Can we have a comment explaining its locking.
> #endif
> #ifdef CONFIG_VMAP_STACK
> struct vm_struct *stack_vm_area;
>
> ...
>
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm)
> if (unlikely(mm_is_oom_victim(mm))) {
> /*
> * Manually reap the mm to free as much memory as possible.
> - * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> - * this mm from further consideration. Taking mm->mmap_sem for
> - * write after setting MMF_OOM_SKIP will guarantee that the oom
> - * reaper will not run on this mm again after mmap_sem is
> - * dropped.
> + * Then, set MMF_UNSTABLE to avoid racing with the oom reaper.
> + * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
> + * guarantee that the oom reaper will not run on this mm again
> + * after mmap_sem is dropped.
Comment should explain *why* we don't want the reaper to run on this mm
again.
>
> ...
>
On Fri 15-06-18 16:15:39, David Rientjes wrote:
[...]
> I'd be happy to make the this timeout configurable, however, and default
> it to perhaps one second as the blockable mmu notifier timeout in your own
> code does. I find it somewhat sad that we'd need a sysctl for this, but
> if that will appease you and it will help to move this into -mm then we
> can do that.
No. This has been nacked in the past and I do not see anything different
from back than.
> > Other than that I've already pointed to a more robust solution. If you
> > are reluctant to try it out I will do, but introducing a timeout is just
> > papering over the real problem. Maybe we will not reach the state that
> > _all_ the memory is reapable but we definitely should try to make as
> > much as possible to be reapable and I do not see any fundamental
> > problems in that direction.
>
> You introduced the timeout already, I'm sure you realized yourself that
> the oom reaper sets MMF_OOM_SKIP much too early. Trying to grab
> mm->mmap_sem 10 times in a row with HZ/10 sleeps in between is a timeout.
Yes, it is. And it is a timeout based some some feedback. The lock is
held, let's retry later but do not retry for ever. We can do the same
with blockable mmu notifiers. We are currently giving up right away. I
was proposing to add can_sleep parameter to mmu_notifier_invalidate_range_start
and return it EAGAIN if it would block. This would allow to simply retry
on EAGAIN like we do for the mmap_sem.
[...]
> The reproducer on powerpc is very simple. Do an mmap() and mlock() the
> length. Fork one 120MB process that does that and two 60MB processes that
> do that in a 128MB memcg.
And again, to solve this we just need to teach oom_reaper to handle
mlocked memory. There shouldn't be any fundamental reason why this would
be impossible AFAICS. Timeout is not a solution!
[...]
> It's inappropriate to merge code that oom kills many processes
> unnecessarily when one happens to be mlocked or have blockable mmu
> notifiers or when mm->mmap_sem can't be grabbed fast enough but forward
> progress is actually being made. It's a regression, and it impacts real
> users. Insisting that we fix the problem you introduced by making all mmu
> notifiers unblockable and mlocked memory can always be reaped and
> mm->mmap_sem can always be grabbed within a second is irresponsible.
Well, a lack of real world bug reports doesn't really back your story
here. I have asked about non-artificial workloads suffering and your
responsive were quite nonspecific to say the least.
And I do insist to come with a reasonable solution rather than random
hacks. Jeez the oom killer was full of these.
As I've said, if you are not willing to work on a proper solution, I
will, but my nack holds for this patch until we see no other way around
existing and real world problems.
--
Michal Hocko
SUSE Labs
On Mon 18-06-18 17:27:33, Andrew Morton wrote:
> On Thu, 14 Jun 2018 13:42:59 -0700 (PDT) David Rientjes <[email protected]> wrote:
>
> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm. This can happen for a variety of reasons,
> > including:
> >
> > - the inability to grab mm->mmap_sem in a sufficient amount of time,
> >
> > - when the mm has blockable mmu notifiers that could cause the oom reaper
> > to stall indefinitely,
>
> Maybe we should have more than one oom reaper thread? I assume the
> probability of the oom reaper thread blocking on an mmu notifier is
> small, so perhaps just dive in and hope for the best. If the oom
> reaper gets stuck then there's another thread ready to take over. And
> revisit the decision to use a kernel thread instead of workqueues.
Well, I think that having more threads would be wasteful for a rare
event like the oom. Creating one on demand could be tricky because we
are under strong memory pressure at the time and a new thread costs some
resoures.
> > but we can also add a third when the oom reaper can "reap" an mm but doing
> > so is unlikely to free any amount of memory:
> >
> > - when the mm's memory is fully mlocked.
> >
> > When all memory is mlocked, the oom reaper will not be able to free any
> > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can
> > unmap and free its memory in exit_mmap() and subsequent oom victims are
> > chosen unnecessarily. This is trivial to reproduce if all eligible
> > processes on the system have mlocked their memory: the oom killer calls
> > panic() even though forward progress can be made.
> >
> > This is the same issue where the exit path sets MMF_OOM_SKIP before
> > unmapping memory and additional processes can be chosen unnecessarily
> > because the oom killer is racing with exit_mmap().
>
> So what's actually happening here. A process has a large amount of
> mlocked memory, it has been oom-killed and it is in the process of
> releasing its memory and exiting, yes?
>
> If so, why does this task set MMF_OOM_SKIP on itself? Why aren't we
> just patiently waiting for its attempt to release meory?
Because the oom victim has no guarantee to proceed to exit and release
its own memory. OOM reaper jumps in and skip over mlocked ranges because
they require lock page and that egain cannot be taken from the oom
reaper path (the lock might be held and doing an allocation). This in
turn means that the oom_reaper doesn't free mlocked memory before it
sets MMF_OOM_SKIP which will allow a new oom victim to be selected.
At the time we merged the oom reaper this hasn't been seen as a major
issue because tasks usually do not consume a lot of mlocked memory and
there is always some other memory to tear down and help to relief the
memory pressure. mlockall oom victim were deemed unlikely because they
need a large rlimit and as such it should be trusted and therefore quite
safe from runaways. But there was definitely a plan to make mlocked
memory reapable. So time to do it finally.
> > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > a true oom livelock in progress, it never gets set and no additional
> > killing is possible.
>
> I guess that's my answer. What causes this livelock? Process looping
> in alloc_pages while holding a lock the oom victim wants?
Yes.
> > To fix this, this patch introduces a per-mm reaping timeout, initially set
> > at 10s. It requires that the oom reaper's list becomes a properly linked
> > list so that other mm's may be reaped while waiting for an mm's timeout to
> > expire.
> >
> > This replaces the current timeouts in the oom reaper: (1) when trying to
> > grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
> > a HZ sleep if there are blockable mmu notifiers. It extends it with
> > timeout to allow an oom victim to reach exit_mmap() before choosing
> > additional processes unnecessarily.
> >
> > The exit path will now set MMF_OOM_SKIP only after all memory has been
> > freed, so additional oom killing is justified,
>
> That seems sensible, but why set MMF_OOM_SKIP at all?
MMF_OOM_SKIP is a way to say that the task should be skipped from OOM
victims evaluation.
> > and rely on MMF_UNSTABLE to
> > determine when it can race with the oom reaper.
> >
> > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> > lapsed because it can no longer guarantee forward progress.
> >
> > The reaping timeout is intentionally set for a substantial amount of time
> > since oom livelock is a very rare occurrence and it's better to optimize
> > for preventing additional (unnecessary) oom killing than a scenario that
> > is much more unlikely.
>
> What happened to the old idea of permitting the task which is blocking
> the oom victim to access additional reserves?
How do you find such a task?
> Come to that, what happened to the really really old Andreaidea of not
> looping in the page allocator anyway? Return NULL instead...
Nacked by Linus because too-small-to-fail is a long term semantic that
cannot change easily. We do not have any way to audit syscall paths to
not return ENOMEM when inappropriate.
> I dunno, I'm thrashing around here. We seem to be piling mess on top
> of mess and then being surprised that the result is a mess.
Are we? The current oom_reaper certainly has some shortcomings that
are addressable. We have started simple to cover most cases and move
on with more complex heuristics based on real life bug reports. But we
_do_ have a quite straightforward feedback based algorithm to reclaim
oom victims. This is a solid ground for future development. Something we
never had before. So I am really wondering what is all the mess about.
--
Michal Hocko
SUSE Labs
On Mon, 18 Jun 2018, Andrew Morton wrote:
> > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
> > it cannot reap an mm. This can happen for a variety of reasons,
> > including:
> >
> > - the inability to grab mm->mmap_sem in a sufficient amount of time,
> >
> > - when the mm has blockable mmu notifiers that could cause the oom reaper
> > to stall indefinitely,
>
> Maybe we should have more than one oom reaper thread? I assume the
> probability of the oom reaper thread blocking on an mmu notifier is
> small, so perhaps just dive in and hope for the best. If the oom
> reaper gets stuck then there's another thread ready to take over. And
> revisit the decision to use a kernel thread instead of workqueues.
>
I'm not sure that we need more than one thread, per se, but we need the
ability to operate on more than one oom victim while deciding whether one
victim can be reaped or not. The current implementation only processes
one victim at a time: it tries to grab mm->mmap_sem, it sleeps, retries,
sleeps, etc. We need to try other oom victims (we do parallel memcg oom
stress testing, and the oom reaper can uncharge memory to a hierarchy that
prevents livelock as well), which my patch does.
> So what's actually happening here. A process has a large amount of
> mlocked memory, it has been oom-killed and it is in the process of
> releasing its memory and exiting, yes?
>
That's one failure mode, yes, and three possible ways:
- the oom reaper immediately sets MMF_OOM_SKIP because it tried to free
memory and completely failed, so it actually declares this as a success
and sets MMF_OOM_SKIP assuming memory was freed, which wasn't,
- to avoid CVE-2018-1000200 exit_mmap() must set MMF_OOM_SKIP before
doing munlock_vma_pages_all() which the oom reaper uses to determine if
it can safely operate on a vma, so the exit path sets MMF_OOM_SKIP
before any possible memory freeing as well, and
- the previous iteration of the oom reaper to set MMF_OOM_SKIP between
unmap_vmas() and free_pgtables() suffered from the same problem for
large amounts of virtual memory whereas subsequent oom kill could have
been prevented if free_pgtables() could have completed.
My patch fixes all these issues because MMF_OOM_SKIP only gets set after
free_pgtables(), i.e. no additional memory freeing is possible through
exit_mmap(), or a process has failed to exit for 10s by the oom reaper. I
will patch this to make the timeout configurable. I use the existing
MMF_UNSTABLE to determine if the oom reaper can safely operate on vmas of
the mm.
> If so, why does this task set MMF_OOM_SKIP on itself? Why aren't we
> just patiently waiting for its attempt to release meory?
>
That's what my patch does, yes, it needs to wait to ensure forward
progress is not being made before setting MMF_OOM_SKIP and allowing all
other processes on the system to be oom killed. Taken to an extreme,
imagine a single large mlocked process or one with a blockable mmu
notifier taking up almost all memory on a machine. If there is a memory
leak, it will be oom killed same as it always has been. The difference
now is that the machine panic()'s because MMF_OOM_SKIP is set with no
memory freeing and the oom killer finds no more eligible processes so its
only alternative is panicking.
> > We can't simply defer setting MMF_OOM_SKIP, however, because if there is
> > a true oom livelock in progress, it never gets set and no additional
> > killing is possible.
>
> I guess that's my answer. What causes this livelock? Process looping
> in alloc_pages while holding a lock the oom victim wants?
>
That's one way, yes, the other is to be charging memory in the mem cgroup
path while holding a mutex the victim wants. If additional kmem will
start being charged to mem cgroup hierarchies and the oom killer is called
synchronously in the charge path (there is no fault path to unwind to),
which has been discussed, this problem will become much more prolific.
> > The exit path will now set MMF_OOM_SKIP only after all memory has been
> > freed, so additional oom killing is justified,
>
> That seems sensible, but why set MMF_OOM_SKIP at all?
>
The oom reaper will eventually need to set it if its actually livelocked,
which happens extremely rarely in practice, because the oom reaper was
unable to free memory such that an allocator holding our mutex could
successfully allocate. It sets it immediately now for mlocked processes
(it doesn't realize it didn't free a single page). It retries 10 times to
grab mm->mmap_sem and sets it after one second if it fails. If it has a
blockable mmu notifier it sleeps for a second and sets it. I'm replacing
all the current timeouts with a per-mm timeout and volunteering to make it
configurable so that it can be disabled or set to 10s as preferred by us
because we are tired of every process getting oom killed pointlessly.
I'll suggest a default of 1s to match the timeouts currently implemented
in the oom reaper and generalize them to be per-mm.
> > and rely on MMF_UNSTABLE to
> > determine when it can race with the oom reaper.
> >
> > The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
> > lapsed because it can no longer guarantee forward progress.
> >
> > The reaping timeout is intentionally set for a substantial amount of time
> > since oom livelock is a very rare occurrence and it's better to optimize
> > for preventing additional (unnecessary) oom killing than a scenario that
> > is much more unlikely.
>
> What happened to the old idea of permitting the task which is blocking
> the oom victim to access additional reserves?
>
That is an alternative to the oom reaper and worked quite successfully for
us. We'd detect when a process was looping endlessly waiting for the same
victim to exit and then grant it access to additional reserves,
specifically to detect oom livelock scenarios. The oom reaper should
theoretically make this extremely rare since it normally can free *some*
memory so we aren't oom anymore and allocators holding mutexes can
succeed.
> > +#ifdef CONFIG_MMU
> > + /* When to give up on oom reaping this mm */
> > + unsigned long reap_timeout;
>
> "timeout" implies "interval". To me, anyway. This is an absolute
> time, so something like reap_time would be clearer. Along with a
> comment explaining that the units are in jiffies.
>
Ack.
> > +#endif
> > #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
> > pgtable_t pmd_huge_pte; /* protected by page_table_lock */
> > #endif
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1163,7 +1163,7 @@ struct task_struct {
> > #endif
> > int pagefault_disabled;
> > #ifdef CONFIG_MMU
> > - struct task_struct *oom_reaper_list;
> > + struct list_head oom_reap_list;
>
> Can we have a comment explaining its locking.
>
Ok.
> > #endif
> > #ifdef CONFIG_VMAP_STACK
> > struct vm_struct *stack_vm_area;
> >
> > ...
> >
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm)
> > if (unlikely(mm_is_oom_victim(mm))) {
> > /*
> > * Manually reap the mm to free as much memory as possible.
> > - * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
> > - * this mm from further consideration. Taking mm->mmap_sem for
> > - * write after setting MMF_OOM_SKIP will guarantee that the oom
> > - * reaper will not run on this mm again after mmap_sem is
> > - * dropped.
> > + * Then, set MMF_UNSTABLE to avoid racing with the oom reaper.
> > + * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
> > + * guarantee that the oom reaper will not run on this mm again
> > + * after mmap_sem is dropped.
>
> Comment should explain *why* we don't want the reaper to run on this mm
> again.
>
Sounds good.
On Tue 19-06-18 10:33:16, Michal Hocko wrote:
[...]
> As I've said, if you are not willing to work on a proper solution, I
> will, but my nack holds for this patch until we see no other way around
> existing and real world problems.
OK, so I gave it a quick try and it doesn't look all that bad to me.
This is only for blockable mmu notifiers. I didn't really try to
address all the problems down the road - I mean some of the blocking
notifiers can check the range in their interval tree without blocking
locks. It is quite probable that only few ranges will be of interest,
right?
So this is only to give an idea about the change. It probably even
doesn't compile. Does that sound sane?
---
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 6bcecc325e7e..ac08f5d711be 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
}
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
- unsigned long start, unsigned long end)
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+ unsigned long start, unsigned long end,
+ bool blockable)
{
unsigned long apic_address;
@@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
if (start <= apic_address && apic_address < end)
kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
+
+ return 0;
}
void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
diff --git a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
index 83e344fbb50a..d138a526feff 100644
--- a/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
+++ b/drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c
@@ -136,12 +136,18 @@ void amdgpu_mn_unlock(struct amdgpu_mn *mn)
*
* Take the rmn read side lock.
*/
-static void amdgpu_mn_read_lock(struct amdgpu_mn *rmn)
+static int amdgpu_mn_read_lock(struct amdgpu_mn *rmn, bool blockable)
{
- mutex_lock(&rmn->read_lock);
+ if (blockable)
+ mutex_lock(&rmn->read_lock);
+ else if (!mutex_trylock(&rmn->read_lock))
+ return -EAGAIN;
+
if (atomic_inc_return(&rmn->recursion) == 1)
down_read_non_owner(&rmn->lock);
mutex_unlock(&rmn->read_lock);
+
+ return 0;
}
/**
@@ -197,10 +203,11 @@ static void amdgpu_mn_invalidate_node(struct amdgpu_mn_node *node,
* We block for all BOs between start and end to be idle and
* unmap them by move them into system domain again.
*/
-static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ bool blockable)
{
struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
struct interval_tree_node *it;
@@ -208,7 +215,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
/* notification is exclusive, but interval is inclusive */
end -= 1;
- amdgpu_mn_read_lock(rmn);
+ /* TODO we should be able to split locking for interval tree and
+ * amdgpu_mn_invalidate_node
+ */
+ if (amdgpu_mn_read_lock(rmn, blockable))
+ return -EAGAIN;
it = interval_tree_iter_first(&rmn->objects, start, end);
while (it) {
@@ -219,6 +230,8 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
amdgpu_mn_invalidate_node(node, start, end);
}
+
+ return 0;
}
/**
@@ -233,10 +246,11 @@ static void amdgpu_mn_invalidate_range_start_gfx(struct mmu_notifier *mn,
* necessitates evicting all user-mode queues of the process. The BOs
* are restorted in amdgpu_mn_invalidate_range_end_hsa.
*/
-static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
+static int amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ bool blockable)
{
struct amdgpu_mn *rmn = container_of(mn, struct amdgpu_mn, mn);
struct interval_tree_node *it;
@@ -244,7 +258,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
/* notification is exclusive, but interval is inclusive */
end -= 1;
- amdgpu_mn_read_lock(rmn);
+ if (amdgpu_mn_read_lock(rmn, blockable))
+ return -EAGAIN;
it = interval_tree_iter_first(&rmn->objects, start, end);
while (it) {
@@ -262,6 +277,8 @@ static void amdgpu_mn_invalidate_range_start_hsa(struct mmu_notifier *mn,
amdgpu_amdkfd_evict_userptr(mem, mm);
}
}
+
+ return 0;
}
/**
diff --git a/drivers/gpu/drm/i915/i915_gem_userptr.c b/drivers/gpu/drm/i915/i915_gem_userptr.c
index 854bd51b9478..5285df9331fa 100644
--- a/drivers/gpu/drm/i915/i915_gem_userptr.c
+++ b/drivers/gpu/drm/i915/i915_gem_userptr.c
@@ -112,10 +112,11 @@ static void del_object(struct i915_mmu_object *mo)
mo->attached = false;
}
-static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
+static int i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ bool blockable)
{
struct i915_mmu_notifier *mn =
container_of(_mn, struct i915_mmu_notifier, mn);
@@ -124,7 +125,7 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
LIST_HEAD(cancelled);
if (RB_EMPTY_ROOT(&mn->objects.rb_root))
- return;
+ return 0;
/* interval ranges are inclusive, but invalidate range is exclusive */
end--;
@@ -152,7 +153,8 @@ static void i915_gem_userptr_mn_invalidate_range_start(struct mmu_notifier *_mn,
del_object(mo);
spin_unlock(&mn->lock);
- if (!list_empty(&cancelled))
+ /* TODO: can we skip waiting here? */
+ if (!list_empty(&cancelled) && blockable)
flush_workqueue(mn->wq);
}
diff --git a/drivers/gpu/drm/radeon/radeon_mn.c b/drivers/gpu/drm/radeon/radeon_mn.c
index abd24975c9b1..b47e828b725d 100644
--- a/drivers/gpu/drm/radeon/radeon_mn.c
+++ b/drivers/gpu/drm/radeon/radeon_mn.c
@@ -118,10 +118,11 @@ static void radeon_mn_release(struct mmu_notifier *mn,
* We block for all BOs between start and end to be idle and
* unmap them by move them into system domain again.
*/
-static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
+static int radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ bool blockable)
{
struct radeon_mn *rmn = container_of(mn, struct radeon_mn, mn);
struct ttm_operation_ctx ctx = { false, false };
@@ -130,7 +131,13 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
/* notification is exclusive, but interval is inclusive */
end -= 1;
- mutex_lock(&rmn->lock);
+ /* TODO we should be able to split locking for interval tree and
+ * the tear down.
+ */
+ if (blockable)
+ mutex_lock(&rmn->lock);
+ else if (!mutex_trylock(&rmn->lock))
+ return -EAGAIN;
it = interval_tree_iter_first(&rmn->objects, start, end);
while (it) {
@@ -167,6 +174,8 @@ static void radeon_mn_invalidate_range_start(struct mmu_notifier *mn,
}
mutex_unlock(&rmn->lock);
+
+ return 0;
}
static const struct mmu_notifier_ops radeon_mn_ops = {
diff --git a/drivers/infiniband/core/umem_odp.c b/drivers/infiniband/core/umem_odp.c
index 182436b92ba9..f65f6a29daae 100644
--- a/drivers/infiniband/core/umem_odp.c
+++ b/drivers/infiniband/core/umem_odp.c
@@ -207,22 +207,29 @@ static int invalidate_range_start_trampoline(struct ib_umem *item, u64 start,
return 0;
}
-static void ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int ib_umem_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ bool blockable)
{
struct ib_ucontext *context = container_of(mn, struct ib_ucontext, mn);
if (!context->invalidate_range)
- return;
+ return 0;
+
+ if (blockable)
+ down_read(&context->umem_rwsem);
+ else if (!down_read_trylock(&context->umem_rwsem))
+ return -EAGAIN;
ib_ucontext_notifier_start_account(context);
- down_read(&context->umem_rwsem);
rbt_ib_umem_for_each_in_range(&context->umem_tree, start,
end,
invalidate_range_start_trampoline, NULL);
up_read(&context->umem_rwsem);
+
+ return 0;
}
static int invalidate_range_end_trampoline(struct ib_umem *item, u64 start,
diff --git a/drivers/infiniband/hw/hfi1/mmu_rb.c b/drivers/infiniband/hw/hfi1/mmu_rb.c
index 70aceefe14d5..8780560d1623 100644
--- a/drivers/infiniband/hw/hfi1/mmu_rb.c
+++ b/drivers/infiniband/hw/hfi1/mmu_rb.c
@@ -284,10 +284,11 @@ void hfi1_mmu_rb_remove(struct mmu_rb_handler *handler,
handler->ops->remove(handler->ops_arg, node);
}
-static void mmu_notifier_range_start(struct mmu_notifier *mn,
+static int mmu_notifier_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ bool blockable)
{
struct mmu_rb_handler *handler =
container_of(mn, struct mmu_rb_handler, mn);
@@ -313,6 +314,8 @@ static void mmu_notifier_range_start(struct mmu_notifier *mn,
if (added)
queue_work(handler->wq, &handler->del_work);
+
+ return 0;
}
/*
diff --git a/drivers/misc/mic/scif/scif_dma.c b/drivers/misc/mic/scif/scif_dma.c
index 63d6246d6dff..d940568bed87 100644
--- a/drivers/misc/mic/scif/scif_dma.c
+++ b/drivers/misc/mic/scif/scif_dma.c
@@ -200,15 +200,18 @@ static void scif_mmu_notifier_release(struct mmu_notifier *mn,
schedule_work(&scif_info.misc_work);
}
-static void scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int scif_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ bool blockable)
{
struct scif_mmu_notif *mmn;
mmn = container_of(mn, struct scif_mmu_notif, ep_mmu_notifier);
scif_rma_destroy_tcw(mmn, start, end - start);
+
+ return 0
}
static void scif_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/misc/sgi-gru/grutlbpurge.c b/drivers/misc/sgi-gru/grutlbpurge.c
index a3454eb56fbf..be28f05bfafa 100644
--- a/drivers/misc/sgi-gru/grutlbpurge.c
+++ b/drivers/misc/sgi-gru/grutlbpurge.c
@@ -219,9 +219,10 @@ void gru_flush_all_tlb(struct gru_state *gru)
/*
* MMUOPS notifier callout functions
*/
-static void gru_invalidate_range_start(struct mmu_notifier *mn,
+static int gru_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end,
+ bool blockable)
{
struct gru_mm_struct *gms = container_of(mn, struct gru_mm_struct,
ms_notifier);
@@ -231,6 +232,8 @@ static void gru_invalidate_range_start(struct mmu_notifier *mn,
gru_dbg(grudev, "gms %p, start 0x%lx, end 0x%lx, act %d\n", gms,
start, end, atomic_read(&gms->ms_range_active));
gru_flush_tlb_range(gms, start, end - start);
+
+ return 0;
}
static void gru_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/drivers/xen/gntdev.c b/drivers/xen/gntdev.c
index bd56653b9bbc..50724d09fe5c 100644
--- a/drivers/xen/gntdev.c
+++ b/drivers/xen/gntdev.c
@@ -465,14 +465,20 @@ static void unmap_if_in_range(struct grant_map *map,
WARN_ON(err);
}
-static void mn_invl_range_start(struct mmu_notifier *mn,
+static int mn_invl_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
- unsigned long start, unsigned long end)
+ unsigned long start, unsigned long end,
+ bool blockable)
{
struct gntdev_priv *priv = container_of(mn, struct gntdev_priv, mn);
struct grant_map *map;
- mutex_lock(&priv->lock);
+ /* TODO do we really need a mutex here? */
+ if (blockable)
+ mutex_lock(&priv->lock);
+ else if (!mutex_trylock(&priv->lock))
+ return -EAGAIN;
+
list_for_each_entry(map, &priv->maps, next) {
unmap_if_in_range(map, start, end);
}
@@ -480,6 +486,8 @@ static void mn_invl_range_start(struct mmu_notifier *mn,
unmap_if_in_range(map, start, end);
}
mutex_unlock(&priv->lock);
+
+ return true;
}
static void mn_release(struct mmu_notifier *mn,
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 4ee7bc548a83..e4181063e755 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -1275,7 +1275,7 @@ static inline long kvm_arch_vcpu_async_ioctl(struct file *filp,
}
#endif /* CONFIG_HAVE_KVM_VCPU_ASYNC_IOCTL */
-void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
unsigned long start, unsigned long end);
#ifdef CONFIG_HAVE_KVM_VCPU_RUN_PID_CHANGE
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 392e6af82701..369867501bed 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -230,7 +230,8 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
extern void __mmu_notifier_change_pte(struct mm_struct *mm,
unsigned long address, pte_t pte);
extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end);
+ unsigned long start, unsigned long end,
+ bool blockable);
extern void __mmu_notifier_invalidate_range_end(struct mm_struct *mm,
unsigned long start, unsigned long end,
bool only_end);
@@ -281,7 +282,17 @@ static inline void mmu_notifier_invalidate_range_start(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
if (mm_has_notifiers(mm))
- __mmu_notifier_invalidate_range_start(mm, start, end);
+ __mmu_notifier_invalidate_range_start(mm, start, end, true);
+}
+
+static inline int mmu_notifier_invalidate_range_start_nonblock(struct mm_struct *mm,
+ unsigned long start, unsigned long end)
+{
+ int ret = 0;
+ if (mm_has_notifiers(mm))
+ ret = __mmu_notifier_invalidate_range_start(mm, start, end, false);
+
+ return ret;
}
static inline void mmu_notifier_invalidate_range_end(struct mm_struct *mm,
diff --git a/mm/hmm.c b/mm/hmm.c
index de7b6bf77201..81fd57bd2634 100644
--- a/mm/hmm.c
+++ b/mm/hmm.c
@@ -177,16 +177,19 @@ static void hmm_release(struct mmu_notifier *mn, struct mm_struct *mm)
up_write(&hmm->mirrors_sem);
}
-static void hmm_invalidate_range_start(struct mmu_notifier *mn,
+static int hmm_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ bool blockable)
{
struct hmm *hmm = mm->hmm;
VM_BUG_ON(!hmm);
atomic_inc(&hmm->sequence);
+
+ return 0;
}
static void hmm_invalidate_range_end(struct mmu_notifier *mn,
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index eff6b88a993f..30cc43121da9 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -174,18 +174,25 @@ void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address,
srcu_read_unlock(&srcu, id);
}
-void __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
- unsigned long start, unsigned long end)
+int __mmu_notifier_invalidate_range_start(struct mm_struct *mm,
+ unsigned long start, unsigned long end,
+ bool blockable)
{
struct mmu_notifier *mn;
+ int ret = 0;
int id;
id = srcu_read_lock(&srcu);
hlist_for_each_entry_rcu(mn, &mm->mmu_notifier_mm->list, hlist) {
- if (mn->ops->invalidate_range_start)
- mn->ops->invalidate_range_start(mn, mm, start, end);
+ if (mn->ops->invalidate_range_start) {
+ int _ret = mn->ops->invalidate_range_start(mn, mm, start, end, blockable);
+ if (_ret)
+ ret = _ret;
+ }
}
srcu_read_unlock(&srcu, id);
+
+ return ret;
}
EXPORT_SYMBOL_GPL(__mmu_notifier_invalidate_range_start);
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e77bc51..7e0c6e78ae5c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -479,9 +479,10 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
static struct task_struct *oom_reaper_list;
static DEFINE_SPINLOCK(oom_reaper_lock);
-void __oom_reap_task_mm(struct mm_struct *mm)
+bool __oom_reap_task_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma;
+ bool ret = true;
/*
* Tell all users of get_user/copy_from_user etc... that the content
@@ -511,12 +512,17 @@ void __oom_reap_task_mm(struct mm_struct *mm)
struct mmu_gather tlb;
tlb_gather_mmu(&tlb, mm, start, end);
- mmu_notifier_invalidate_range_start(mm, start, end);
+ if (mmu_notifier_invalidate_range_start_nonblock(mm, start, end)) {
+ ret = false;
+ continue;
+ }
unmap_page_range(&tlb, vma, start, end, NULL);
mmu_notifier_invalidate_range_end(mm, start, end);
tlb_finish_mmu(&tlb, start, end);
}
}
+
+ return ret;
}
static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
@@ -545,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
goto unlock_oom;
}
- /*
- * If the mm has invalidate_{start,end}() notifiers that could block,
- * sleep to give the oom victim some more time.
- * TODO: we really want to get rid of this ugly hack and make sure that
- * notifiers cannot block for unbounded amount of time
- */
- if (mm_has_blockable_invalidate_notifiers(mm)) {
- up_read(&mm->mmap_sem);
- schedule_timeout_idle(HZ);
- goto unlock_oom;
- }
-
/*
* MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
* work on the mm anymore. The check for MMF_OOM_SKIP must run
@@ -571,7 +565,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
trace_start_task_reaping(tsk->pid);
- __oom_reap_task_mm(mm);
+ /* failed to reap part of the address space. Try again later */
+ if (!__oom_reap_task_mm(mm)) {
+ up_read(&mm->mmap_sem);
+ ret = false;
+ goto out_unlock;
+ }
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ada21f47f22b..6f7e709d2944 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -135,7 +135,7 @@ static void kvm_uevent_notify_change(unsigned int type, struct kvm *kvm);
static unsigned long long kvm_createvm_count;
static unsigned long long kvm_active_vms;
-__weak void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
+__weak int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
unsigned long start, unsigned long end)
{
}
@@ -354,13 +354,15 @@ static void kvm_mmu_notifier_change_pte(struct mmu_notifier *mn,
srcu_read_unlock(&kvm->srcu, idx);
}
-static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
+static int kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
struct mm_struct *mm,
unsigned long start,
- unsigned long end)
+ unsigned long end,
+ bool blockable)
{
struct kvm *kvm = mmu_notifier_to_kvm(mn);
int need_tlb_flush = 0, idx;
+ int ret;
idx = srcu_read_lock(&kvm->srcu);
spin_lock(&kvm->mmu_lock);
@@ -378,9 +380,11 @@ static void kvm_mmu_notifier_invalidate_range_start(struct mmu_notifier *mn,
spin_unlock(&kvm->mmu_lock);
- kvm_arch_mmu_notifier_invalidate_range(kvm, start, end);
+ ret = kvm_arch_mmu_notifier_invalidate_range(kvm, start, end, blockable);
srcu_read_unlock(&kvm->srcu, idx);
+
+ return ret;
}
static void kvm_mmu_notifier_invalidate_range_end(struct mmu_notifier *mn,
--
Michal Hocko
SUSE Labs
On Wed, 20 Jun 2018, Michal Hocko wrote:
> On Tue 19-06-18 10:33:16, Michal Hocko wrote:
> [...]
> > As I've said, if you are not willing to work on a proper solution, I
> > will, but my nack holds for this patch until we see no other way around
> > existing and real world problems.
>
> OK, so I gave it a quick try and it doesn't look all that bad to me.
> This is only for blockable mmu notifiers. I didn't really try to
> address all the problems down the road - I mean some of the blocking
> notifiers can check the range in their interval tree without blocking
> locks. It is quite probable that only few ranges will be of interest,
> right?
>
> So this is only to give an idea about the change. It probably even
> doesn't compile. Does that sound sane?
It depends on how invasive we want to make this, it should result in more
memory being freeable if the invalidate callbacks can guarantee that they
won't block. I think it's much more invasive than the proposed patch,
however.
For the same reason as the mm->mmap_sem backoff, however, this should
retry for a longer period of time than HZ. If we can't grab mm->mmap_sem
the first five times with the trylock because of writer queueing, for
example, then we only have five attempts for each blockable mmu notifier
invalidate callback, and any of the numerous locks it can take to declare
it will not block.
Note that this doesn't solve the issue with setting MMF_OOM_SKIP too early
on processes with mm->mmap_sem contention or now invalidate callbacks that
will block; the decision that the mm cannot be reaped should come much
later.
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 6bcecc325e7e..ac08f5d711be 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
> }
>
> -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> - unsigned long start, unsigned long end)
> +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> + unsigned long start, unsigned long end,
> + bool blockable)
> {
> unsigned long apic_address;
>
> @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> if (start <= apic_address && apic_address < end)
> kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> +
> + return 0;
> }
>
> void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
Auditing the first change in the patch, this is incorrect because
kvm_make_all_cpus_request() for KVM_REQ_APIC_PAGE_RELOAD can block in
kvm_kick_many_cpus() and that is after kvm_make_request() has been done.
The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if
it cannot reap an mm. This can happen for a variety of reasons,
including:
- the inability to grab mm->mmap_sem in a sufficient amount of time,
- when the mm has blockable mmu notifiers that could cause the oom reaper
to stall indefinitely,
but we can also add a third when the oom reaper can "reap" an mm but doing
so is unlikely to free any amount of memory:
- when the mm's memory is mostly mlocked.
When all memory is mlocked, the oom reaper will not be able to free any
substantial amount of memory. It sets MMF_OOM_SKIP before the victim can
unmap and free its memory in exit_mmap() and subsequent oom victims are
chosen unnecessarily. This is trivial to reproduce if all eligible
processes on the system have mlocked their memory: the oom killer calls
panic() even though forward progress can be made.
This is the same issue where the exit path sets MMF_OOM_SKIP before
unmapping memory and additional processes can be chosen unnecessarily
because the oom killer is racing with exit_mmap() and is separate from
the oom reaper setting MMF_OOM_SKIP prematurely.
We can't simply defer setting MMF_OOM_SKIP, however, because if there is
a true oom livelock in progress, it never gets set and no additional
killing is possible.
To fix this, this patch introduces a per-mm reaping period, which is
configurable through the new oom_free_timeout_ms file in debugfs and
defaults to one second to match the current heuristics. This support
requires that the oom reaper's list becomes a proper linked list so that
other mm's may be reaped while waiting for an mm's timeout to expire.
This replaces the current timeouts in the oom reaper: (1) when trying to
grab mm->mmap_sem 10 times in a row with HZ/10 sleeps in between and (2)
a HZ sleep if there are blockable mmu notifiers. It extends it with
timeout to allow an oom victim to reach exit_mmap() before choosing
additional processes unnecessarily.
The exit path will now set MMF_OOM_SKIP only after all memory has been
freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to
determine when it can race with the oom reaper.
The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has
lapsed because it can no longer guarantee forward progress. Since the
default oom_free_timeout_ms is one second, the same as current heuristics,
there should be no functional change with this patch for users who do not
tune it to be longer other than MMF_OOM_SKIP is set by exit_mmap() after
free_pgtables(), which is the preferred behavior.
The reaping timeout can intentionally be set for a substantial amount of
time, such as 10s, since oom livelock is a very rare occurrence and it's
better to optimize for preventing additional (unnecessary) oom killing
than a scenario that is much more unlikely.
Signed-off-by: David Rientjes <[email protected]>
---
v2:
- configurable timeout period through debugfs
- change mm->reap_timeout to mm->oom_free_expire and add more
descriptive comment per akpm
- add comment to describe task->oom_reap_list locking based on
oom_reaper_lock per akpm
- rework the exit_mmap() comment and split into two parts to be more
descriptive about the locking and the issue with the oom reaper
racing with munlock_vma_pages_all() per akpm
---
include/linux/mm_types.h | 7 ++
include/linux/sched.h | 3 +-
kernel/fork.c | 3 +
mm/mmap.c | 26 +++++---
mm/oom_kill.c | 140 +++++++++++++++++++++++++--------------
5 files changed, 119 insertions(+), 60 deletions(-)
diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -449,6 +449,13 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
+#ifdef CONFIG_MMU
+ /*
+ * When to give up on memory freeing from this mm after its
+ * threads have been oom killed, in jiffies.
+ */
+ unsigned long oom_free_expire;
+#endif
#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS
pgtable_t pmd_huge_pte; /* protected by page_table_lock */
#endif
diff --git a/include/linux/sched.h b/include/linux/sched.h
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1163,7 +1163,8 @@ struct task_struct {
#endif
int pagefault_disabled;
#ifdef CONFIG_MMU
- struct task_struct *oom_reaper_list;
+ /* OOM victim queue for oom reaper, protected by oom_reaper_lock */
+ struct list_head oom_reap_list;
#endif
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
diff --git a/kernel/fork.c b/kernel/fork.c
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -842,6 +842,9 @@ static struct task_struct *dup_task_struct(struct task_struct *orig, int node)
#ifdef CONFIG_FAULT_INJECTION
tsk->fail_nth = 0;
#endif
+#ifdef CONFIG_MMU
+ INIT_LIST_HEAD(&tsk->oom_reap_list);
+#endif
return tsk;
diff --git a/mm/mmap.c b/mm/mmap.c
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3059,25 +3059,28 @@ void exit_mmap(struct mm_struct *mm)
if (unlikely(mm_is_oom_victim(mm))) {
/*
* Manually reap the mm to free as much memory as possible.
- * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
- * this mm from further consideration. Taking mm->mmap_sem for
- * write after setting MMF_OOM_SKIP will guarantee that the oom
- * reaper will not run on this mm again after mmap_sem is
- * dropped.
- *
* Nothing can be holding mm->mmap_sem here and the above call
* to mmu_notifier_release(mm) ensures mmu notifier callbacks in
* __oom_reap_task_mm() will not block.
- *
- * This needs to be done before calling munlock_vma_pages_all(),
- * which clears VM_LOCKED, otherwise the oom reaper cannot
- * reliably test it.
*/
mutex_lock(&oom_lock);
__oom_reap_task_mm(mm);
mutex_unlock(&oom_lock);
- set_bit(MMF_OOM_SKIP, &mm->flags);
+ /*
+ * Now, set MMF_UNSTABLE to avoid racing with the oom reaper.
+ * This needs to be done before calling munlock_vma_pages_all(),
+ * which clears VM_LOCKED, otherwise the oom reaper cannot
+ * reliably test for it. If the oom reaper races with
+ * munlock_vma_pages_all(), this can result in a kernel oops if
+ * a pmd is zapped, for example, after follow_page_mask() has
+ * checked pmd_none().
+ *
+ * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will
+ * guarantee that the oom reaper will not run on this mm again
+ * after mmap_sem is dropped.
+ */
+ set_bit(MMF_UNSTABLE, &mm->flags);
down_write(&mm->mmap_sem);
up_write(&mm->mmap_sem);
}
@@ -3105,6 +3108,7 @@ void exit_mmap(struct mm_struct *mm)
unmap_vmas(&tlb, vma, 0, -1);
free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING);
tlb_finish_mmu(&tlb, 0, -1);
+ set_bit(MMF_OOM_SKIP, &mm->flags);
/*
* Walk the list again, actually closing and freeing it,
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -41,6 +41,7 @@
#include <linux/kthread.h>
#include <linux/init.h>
#include <linux/mmu_notifier.h>
+#include <linux/debugfs.h>
#include <asm/tlb.h>
#include "internal.h"
@@ -476,7 +477,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
*/
static struct task_struct *oom_reaper_th;
static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
+static LIST_HEAD(oom_reaper_list);
static DEFINE_SPINLOCK(oom_reaper_lock);
void __oom_reap_task_mm(struct mm_struct *mm)
@@ -519,10 +520,8 @@ void __oom_reap_task_mm(struct mm_struct *mm)
}
}
-static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
+static void oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
{
- bool ret = true;
-
/*
* We have to make sure to not race with the victim exit path
* and cause premature new oom victim selection:
@@ -540,9 +539,8 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
mutex_lock(&oom_lock);
if (!down_read_trylock(&mm->mmap_sem)) {
- ret = false;
trace_skip_task_reaping(tsk->pid);
- goto unlock_oom;
+ goto out_oom;
}
/*
@@ -551,69 +549,81 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
* TODO: we really want to get rid of this ugly hack and make sure that
* notifiers cannot block for unbounded amount of time
*/
- if (mm_has_blockable_invalidate_notifiers(mm)) {
- up_read(&mm->mmap_sem);
- schedule_timeout_idle(HZ);
- goto unlock_oom;
- }
+ if (mm_has_blockable_invalidate_notifiers(mm))
+ goto out_mm;
/*
- * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
- * work on the mm anymore. The check for MMF_OOM_SKIP must run
+ * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't
+ * work on the mm anymore. The check for MMF_UNSTABLE must run
* under mmap_sem for reading because it serializes against the
* down_write();up_write() cycle in exit_mmap().
*/
- if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
- up_read(&mm->mmap_sem);
+ if (test_bit(MMF_UNSTABLE, &mm->flags)) {
trace_skip_task_reaping(tsk->pid);
- goto unlock_oom;
+ goto out_mm;
}
trace_start_task_reaping(tsk->pid);
-
__oom_reap_task_mm(mm);
+ trace_finish_task_reaping(tsk->pid);
pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
task_pid_nr(tsk), tsk->comm,
K(get_mm_counter(mm, MM_ANONPAGES)),
K(get_mm_counter(mm, MM_FILEPAGES)),
K(get_mm_counter(mm, MM_SHMEMPAGES)));
+out_mm:
up_read(&mm->mmap_sem);
-
- trace_finish_task_reaping(tsk->pid);
-unlock_oom:
+out_oom:
mutex_unlock(&oom_lock);
- return ret;
}
-#define MAX_OOM_REAP_RETRIES 10
static void oom_reap_task(struct task_struct *tsk)
{
- int attempts = 0;
struct mm_struct *mm = tsk->signal->oom_mm;
- /* Retry the down_read_trylock(mmap_sem) a few times */
- while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
- schedule_timeout_idle(HZ/10);
+ /*
+ * If this mm has either been fully unmapped, or the oom reaper has
+ * given up on it, nothing left to do except drop the refcount.
+ */
+ if (test_bit(MMF_OOM_SKIP, &mm->flags))
+ goto drop;
- if (attempts <= MAX_OOM_REAP_RETRIES ||
- test_bit(MMF_OOM_SKIP, &mm->flags))
- goto done;
+ /*
+ * If this mm has already been reaped, doing so again will not likely
+ * free additional memory.
+ */
+ if (!test_bit(MMF_UNSTABLE, &mm->flags))
+ oom_reap_task_mm(tsk, mm);
- pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
- task_pid_nr(tsk), tsk->comm);
- debug_show_all_locks();
+ if (time_after_eq(jiffies, mm->oom_free_expire)) {
+ if (!test_bit(MMF_OOM_SKIP, &mm->flags)) {
+ pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
+ task_pid_nr(tsk), tsk->comm);
+ debug_show_all_locks();
-done:
- tsk->oom_reaper_list = NULL;
+ /*
+ * Reaping has failed for the timeout period, so give up
+ * and allow additional processes to be oom killed.
+ */
+ set_bit(MMF_OOM_SKIP, &mm->flags);
+ }
+ goto drop;
+ }
- /*
- * Hide this mm from OOM killer because it has been either reaped or
- * somebody can't call up_write(mmap_sem).
- */
- set_bit(MMF_OOM_SKIP, &mm->flags);
+ if (test_bit(MMF_OOM_SKIP, &mm->flags))
+ goto drop;
+
+ /* Enqueue to be reaped again */
+ spin_lock(&oom_reaper_lock);
+ list_add_tail(&tsk->oom_reap_list, &oom_reaper_list);
+ spin_unlock(&oom_reaper_lock);
- /* Drop a reference taken by wake_oom_reaper */
+ schedule_timeout_idle(HZ/10);
+ return;
+
+drop:
+ /* Drop the reference taken by wake_oom_reaper */
put_task_struct(tsk);
}
@@ -622,11 +632,13 @@ static int oom_reaper(void *unused)
while (true) {
struct task_struct *tsk = NULL;
- wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
+ wait_event_freezable(oom_reaper_wait,
+ !list_empty(&oom_reaper_list));
spin_lock(&oom_reaper_lock);
- if (oom_reaper_list != NULL) {
- tsk = oom_reaper_list;
- oom_reaper_list = tsk->oom_reaper_list;
+ if (!list_empty(&oom_reaper_list)) {
+ tsk = list_entry(oom_reaper_list.next,
+ struct task_struct, oom_reap_list);
+ list_del(&tsk->oom_reap_list);
}
spin_unlock(&oom_reaper_lock);
@@ -637,25 +649,57 @@ static int oom_reaper(void *unused)
return 0;
}
+/*
+ * Millisecs to wait for an oom mm to free memory before selecting another
+ * victim.
+ */
+u64 oom_free_timeout_ms = 1000;
static void wake_oom_reaper(struct task_struct *tsk)
{
- /* tsk is already queued? */
- if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+ /*
+ * Set the reap timeout; if it's already set, the mm is enqueued and
+ * this tsk can be ignored.
+ */
+ if (cmpxchg(&tsk->signal->oom_mm->oom_free_expire, 0UL,
+ jiffies + msecs_to_jiffies(oom_free_timeout_ms)))
return;
get_task_struct(tsk);
spin_lock(&oom_reaper_lock);
- tsk->oom_reaper_list = oom_reaper_list;
- oom_reaper_list = tsk;
+ list_add(&tsk->oom_reap_list, &oom_reaper_list);
spin_unlock(&oom_reaper_lock);
trace_wake_reaper(tsk->pid);
wake_up(&oom_reaper_wait);
}
+#ifdef CONFIG_DEBUG_FS
+static int oom_free_timeout_ms_read(void *data, u64 *val)
+{
+ *val = oom_free_timeout_ms;
+ return 0;
+}
+
+static int oom_free_timeout_ms_write(void *data, u64 val)
+{
+ if (val > 60 * 1000)
+ return -EINVAL;
+
+ oom_free_timeout_ms = val;
+ return 0;
+}
+DEFINE_SIMPLE_ATTRIBUTE(oom_free_timeout_ms_fops, oom_free_timeout_ms_read,
+ oom_free_timeout_ms_write, "%llu\n");
+#endif /* CONFIG_DEBUG_FS */
+
static int __init oom_init(void)
{
oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
+#ifdef CONFIG_DEBUG_FS
+ if (!IS_ERR(oom_reaper_th))
+ debugfs_create_file("oom_free_timeout_ms", 0200, NULL, NULL,
+ &oom_free_timeout_ms_fops);
+#endif
return 0;
}
subsys_initcall(oom_init)
On Wed 20-06-18 13:34:52, David Rientjes wrote:
> On Wed, 20 Jun 2018, Michal Hocko wrote:
>
> > On Tue 19-06-18 10:33:16, Michal Hocko wrote:
> > [...]
> > > As I've said, if you are not willing to work on a proper solution, I
> > > will, but my nack holds for this patch until we see no other way around
> > > existing and real world problems.
> >
> > OK, so I gave it a quick try and it doesn't look all that bad to me.
> > This is only for blockable mmu notifiers. I didn't really try to
> > address all the problems down the road - I mean some of the blocking
> > notifiers can check the range in their interval tree without blocking
> > locks. It is quite probable that only few ranges will be of interest,
> > right?
> >
> > So this is only to give an idea about the change. It probably even
> > doesn't compile. Does that sound sane?
>
> It depends on how invasive we want to make this, it should result in more
> memory being freeable if the invalidate callbacks can guarantee that they
> won't block. I think it's much more invasive than the proposed patch,
> however.
It is a larger patch for sure but it heads towards a more deterministic
behavior because we know _why_ we are trying. It is a specific and
rarely taken lock that we need. If we get one step further and examine
the range without blocking then we are almost lockless from the oom
reaper POV for most notifiers.
> For the same reason as the mm->mmap_sem backoff, however, this should
> retry for a longer period of time than HZ. If we can't grab mm->mmap_sem
> the first five times with the trylock because of writer queueing, for
> example, then we only have five attempts for each blockable mmu notifier
> invalidate callback, and any of the numerous locks it can take to declare
> it will not block.
>
> Note that this doesn't solve the issue with setting MMF_OOM_SKIP too early
> on processes with mm->mmap_sem contention or now invalidate callbacks that
> will block; the decision that the mm cannot be reaped should come much
> later.
I do not mind tuning the number of retries or the sleep duration. All
that based on real life examples.
I have asked about a specific mmap_sem contention case several times but
didn't get any answer yet.
> > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > index 6bcecc325e7e..ac08f5d711be 100644
> > --- a/arch/x86/kvm/x86.c
> > +++ b/arch/x86/kvm/x86.c
> > @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> > kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
> > }
> >
> > -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > - unsigned long start, unsigned long end)
> > +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > + unsigned long start, unsigned long end,
> > + bool blockable)
> > {
> > unsigned long apic_address;
> >
> > @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> > if (start <= apic_address && apic_address < end)
> > kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> > +
> > + return 0;
> > }
> >
> > void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
>
> Auditing the first change in the patch, this is incorrect because
> kvm_make_all_cpus_request() for KVM_REQ_APIC_PAGE_RELOAD can block in
> kvm_kick_many_cpus() and that is after kvm_make_request() has been done.
I would have to check the code closer. But doesn't
kvm_make_all_cpus_request call get_cpu which is preempt_disable? I
definitely plan to talk to respective maintainers about these changes of
course.
--
Michal Hocko
SUSE Labs
On Thu 21-06-18 09:45:37, Michal Hocko wrote:
> On Wed 20-06-18 13:34:52, David Rientjes wrote:
> > On Wed, 20 Jun 2018, Michal Hocko wrote:
[...]
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 6bcecc325e7e..ac08f5d711be 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> > > kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
> > > }
> > >
> > > -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > - unsigned long start, unsigned long end)
> > > +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > + unsigned long start, unsigned long end,
> > > + bool blockable)
> > > {
> > > unsigned long apic_address;
> > >
> > > @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> > > if (start <= apic_address && apic_address < end)
> > > kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> > > +
> > > + return 0;
> > > }
> > >
> > > void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> >
> > Auditing the first change in the patch, this is incorrect because
> > kvm_make_all_cpus_request() for KVM_REQ_APIC_PAGE_RELOAD can block in
> > kvm_kick_many_cpus() and that is after kvm_make_request() has been done.
>
> I would have to check the code closer. But doesn't
> kvm_make_all_cpus_request call get_cpu which is preempt_disable?
Sorry I meant kvm_make_vcpus_request_mask. kvm_make_all_cpus_request
only does a GFP_ATOMIC allocation on top.
--
Michal Hocko
SUSE Labs
Fixes: 45c6e373dd94 ("mm, oom: fix unnecessary killing of additional processes")
Signed-off-by: kbuild test robot <[email protected]>
---
oom_kill.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 8a775c4..6b776b9 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -653,7 +653,7 @@ static int oom_reaper(void *unused)
* Millisecs to wait for an oom mm to free memory before selecting another
* victim.
*/
-u64 oom_free_timeout_ms = 1000;
+static u64 oom_free_timeout_ms = 1000;
static void wake_oom_reaper(struct task_struct *tsk)
{
/*
Hi David,
I love your patch! Perhaps something to improve:
[auto build test WARNING on linus/master]
[also build test WARNING on v4.18-rc1 next-20180621]
[if your patch is applied to the wrong git tree, please drop us a note to help improve the system]
url: https://github.com/0day-ci/linux/commits/David-Rientjes/mm-oom-fix-unnecessary-killing-of-additional-processes/20180621-060118
reproduce:
# apt-get install sparse
make ARCH=x86_64 allmodconfig
make C=1 CF=-D__CHECK_ENDIAN__
sparse warnings: (new ones prefixed by >>)
include/linux/nodemask.h:265:16: sparse: expression using sizeof(void)
include/linux/nodemask.h:271:16: sparse: expression using sizeof(void)
include/linux/nodemask.h:265:16: sparse: expression using sizeof(void)
include/linux/nodemask.h:271:16: sparse: expression using sizeof(void)
>> mm/oom_kill.c:656:5: sparse: symbol 'oom_free_timeout_ms' was not declared. Should it be static?
include/linux/rcupdate.h:683:9: sparse: context imbalance in 'find_lock_task_mm' - wrong count at exit
include/linux/sched/mm.h:141:37: sparse: dereference of noderef expression
mm/oom_kill.c:218:28: sparse: context imbalance in 'oom_badness' - unexpected unlock
mm/oom_kill.c:398:9: sparse: context imbalance in 'dump_tasks' - different lock contexts for basic block
include/linux/rcupdate.h:683:9: sparse: context imbalance in 'oom_kill_process' - unexpected unlock
Please review and possibly fold the followup patch.
---
0-DAY kernel test infrastructure Open Source Technology Center
https://lists.01.org/pipermail/kbuild-all Intel Corporation
On Thu, 21 Jun 2018, Michal Hocko wrote:
> > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > index 6bcecc325e7e..ac08f5d711be 100644
> > > --- a/arch/x86/kvm/x86.c
> > > +++ b/arch/x86/kvm/x86.c
> > > @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> > > kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
> > > }
> > >
> > > -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > - unsigned long start, unsigned long end)
> > > +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > + unsigned long start, unsigned long end,
> > > + bool blockable)
> > > {
> > > unsigned long apic_address;
> > >
> > > @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> > > if (start <= apic_address && apic_address < end)
> > > kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> > > +
> > > + return 0;
> > > }
> > >
> > > void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> >
> > Auditing the first change in the patch, this is incorrect because
> > kvm_make_all_cpus_request() for KVM_REQ_APIC_PAGE_RELOAD can block in
> > kvm_kick_many_cpus() and that is after kvm_make_request() has been done.
>
> I would have to check the code closer. But doesn't
> kvm_make_all_cpus_request call get_cpu which is preempt_disable? I
> definitely plan to talk to respective maintainers about these changes of
> course.
>
preempt_disable() is required because it calls kvm_kick_many_cpus() with
wait == true because KVM_REQ_APIC_PAGE_RELOAD sets KVM_REQUEST_WAIT and
thus the smp_call_function_many() is going to block until all cpus can run
ack_flush().
On Thu 21-06-18 13:50:53, David Rientjes wrote:
> On Thu, 21 Jun 2018, Michal Hocko wrote:
>
> > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > index 6bcecc325e7e..ac08f5d711be 100644
> > > > --- a/arch/x86/kvm/x86.c
> > > > +++ b/arch/x86/kvm/x86.c
> > > > @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> > > > kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
> > > > }
> > > >
> > > > -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > > - unsigned long start, unsigned long end)
> > > > +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > > + unsigned long start, unsigned long end,
> > > > + bool blockable)
> > > > {
> > > > unsigned long apic_address;
> > > >
> > > > @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > > apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> > > > if (start <= apic_address && apic_address < end)
> > > > kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> > > > +
> > > > + return 0;
> > > > }
> > > >
> > > > void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> > >
> > > Auditing the first change in the patch, this is incorrect because
> > > kvm_make_all_cpus_request() for KVM_REQ_APIC_PAGE_RELOAD can block in
> > > kvm_kick_many_cpus() and that is after kvm_make_request() has been done.
> >
> > I would have to check the code closer. But doesn't
> > kvm_make_all_cpus_request call get_cpu which is preempt_disable? I
> > definitely plan to talk to respective maintainers about these changes of
> > course.
> >
>
> preempt_disable() is required because it calls kvm_kick_many_cpus() with
> wait == true because KVM_REQ_APIC_PAGE_RELOAD sets KVM_REQUEST_WAIT and
> thus the smp_call_function_many() is going to block until all cpus can run
> ack_flush().
I will make sure to talk to the maintainer of the respective code to
do the nonblock case correctly.
--
Michal Hocko
SUSE Labs
On Fri 22-06-18 09:42:57, Michal Hocko wrote:
> On Thu 21-06-18 13:50:53, David Rientjes wrote:
> > On Thu, 21 Jun 2018, Michal Hocko wrote:
> >
> > > > > diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> > > > > index 6bcecc325e7e..ac08f5d711be 100644
> > > > > --- a/arch/x86/kvm/x86.c
> > > > > +++ b/arch/x86/kvm/x86.c
> > > > > @@ -7203,8 +7203,9 @@ static void vcpu_load_eoi_exitmap(struct kvm_vcpu *vcpu)
> > > > > kvm_x86_ops->load_eoi_exitmap(vcpu, eoi_exit_bitmap);
> > > > > }
> > > > >
> > > > > -void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > > > - unsigned long start, unsigned long end)
> > > > > +int kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > > > + unsigned long start, unsigned long end,
> > > > > + bool blockable)
> > > > > {
> > > > > unsigned long apic_address;
> > > > >
> > > > > @@ -7215,6 +7216,8 @@ void kvm_arch_mmu_notifier_invalidate_range(struct kvm *kvm,
> > > > > apic_address = gfn_to_hva(kvm, APIC_DEFAULT_PHYS_BASE >> PAGE_SHIFT);
> > > > > if (start <= apic_address && apic_address < end)
> > > > > kvm_make_all_cpus_request(kvm, KVM_REQ_APIC_PAGE_RELOAD);
> > > > > +
> > > > > + return 0;
> > > > > }
> > > > >
> > > > > void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu)
> > > >
> > > > Auditing the first change in the patch, this is incorrect because
> > > > kvm_make_all_cpus_request() for KVM_REQ_APIC_PAGE_RELOAD can block in
> > > > kvm_kick_many_cpus() and that is after kvm_make_request() has been done.
> > >
> > > I would have to check the code closer. But doesn't
> > > kvm_make_all_cpus_request call get_cpu which is preempt_disable? I
> > > definitely plan to talk to respective maintainers about these changes of
> > > course.
> > >
> >
> > preempt_disable() is required because it calls kvm_kick_many_cpus() with
> > wait == true because KVM_REQ_APIC_PAGE_RELOAD sets KVM_REQUEST_WAIT and
> > thus the smp_call_function_many() is going to block until all cpus can run
> > ack_flush().
>
> I will make sure to talk to the maintainer of the respective code to
> do the nonblock case correctly.
I've just double checked this particular code and the wait path and this
one is not a sleep. It is a busy wait for IPI to get handled. So this
one should be OK AFAICS. Anyway I will send an RFC and involve
respective maintainers to make sure I am not making any incorrect
assumptions.
--
Michal Hocko
SUSE Labs
On Fri, 22 Jun 2018, Michal Hocko wrote:
> > > preempt_disable() is required because it calls kvm_kick_many_cpus() with
> > > wait == true because KVM_REQ_APIC_PAGE_RELOAD sets KVM_REQUEST_WAIT and
> > > thus the smp_call_function_many() is going to block until all cpus can run
> > > ack_flush().
> >
> > I will make sure to talk to the maintainer of the respective code to
> > do the nonblock case correctly.
>
> I've just double checked this particular code and the wait path and this
> one is not a sleep. It is a busy wait for IPI to get handled. So this
> one should be OK AFAICS. Anyway I will send an RFC and involve
> respective maintainers to make sure I am not making any incorrect
> assumptions.
Do you believe that having the only potential source of memory freeing
busy waiting for all other cpus on the system to run ack_flush() is
particularly dangerous given the fact that they may be allocating
themselves?
On 2018/06/15 5:42, David Rientjes wrote:
> Note: I understand there is an objection based on timeout based delays.
> This is currently the only possible way to avoid oom killing important
> processes completely unnecessarily. If the oom reaper can someday free
> all memory, including mlocked memory and those mm's with blockable mmu
> notifiers, and is guaranteed to always be able to grab mm->mmap_sem,
> this can be removed. I do not believe any such guarantee is possible
> and consider the massive killing of additional processes unnecessarily
> to be a regression introduced by the oom reaper and its very quick
> setting of MMF_OOM_SKIP to allow additional processes to be oom killed.
>
Here is my version for your proposal including my anti-lockup series.
My version is using OOM badness score as a feedback for deciding when to give up.
---
drivers/tty/sysrq.c | 2 -
include/linux/memcontrol.h | 9 +-
include/linux/oom.h | 7 +-
include/linux/sched.h | 7 +-
include/linux/sched/coredump.h | 1 -
kernel/fork.c | 2 +
mm/memcontrol.c | 24 +--
mm/mmap.c | 17 +-
mm/oom_kill.c | 383 +++++++++++++++++------------------------
mm/page_alloc.c | 73 +++-----
10 files changed, 202 insertions(+), 323 deletions(-)
diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 6364890..c8b66b9 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -376,10 +376,8 @@ static void moom_callback(struct work_struct *ignored)
.order = -1,
};
- mutex_lock(&oom_lock);
if (!out_of_memory(&oc))
pr_info("OOM request ignored. No task eligible\n");
- mutex_unlock(&oom_lock);
}
static DECLARE_WORK(moom_work, moom_callback);
diff --git a/include/linux/memcontrol.h b/include/linux/memcontrol.h
index 6c6fb11..a82360a 100644
--- a/include/linux/memcontrol.h
+++ b/include/linux/memcontrol.h
@@ -382,8 +382,8 @@ struct mem_cgroup *mem_cgroup_iter(struct mem_cgroup *,
struct mem_cgroup *,
struct mem_cgroup_reclaim_cookie *);
void mem_cgroup_iter_break(struct mem_cgroup *, struct mem_cgroup *);
-int mem_cgroup_scan_tasks(struct mem_cgroup *,
- int (*)(struct task_struct *, void *), void *);
+void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+ void (*fn)(struct task_struct *, void *), void *arg);
static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
{
@@ -850,10 +850,9 @@ static inline void mem_cgroup_iter_break(struct mem_cgroup *root,
{
}
-static inline int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
- int (*fn)(struct task_struct *, void *), void *arg)
+static inline void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+ void (*fn)(struct task_struct *, void *), void *arg)
{
- return 0;
}
static inline unsigned short mem_cgroup_id(struct mem_cgroup *memcg)
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 6adac11..09cfa8e 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -44,8 +44,6 @@ struct oom_control {
unsigned long chosen_points;
};
-extern struct mutex oom_lock;
-
static inline void set_current_oom_origin(void)
{
current->signal->oom_flag_origin = true;
@@ -68,7 +66,7 @@ static inline bool tsk_is_oom_victim(struct task_struct * tsk)
/*
* Use this helper if tsk->mm != mm and the victim mm needs a special
- * handling. This is guaranteed to stay true after once set.
+ * handling.
*/
static inline bool mm_is_oom_victim(struct mm_struct *mm)
{
@@ -95,7 +93,8 @@ static inline int check_stable_address_space(struct mm_struct *mm)
return 0;
}
-void __oom_reap_task_mm(struct mm_struct *mm);
+extern void oom_reap_mm(struct mm_struct *mm);
+extern bool try_oom_notifier(void);
extern unsigned long oom_badness(struct task_struct *p,
struct mem_cgroup *memcg, const nodemask_t *nodemask,
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 87bf02d..e23fc7f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1162,9 +1162,10 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
-#ifdef CONFIG_MMU
- struct task_struct *oom_reaper_list;
-#endif
+ struct list_head oom_victim_list;
+ unsigned long last_oom_compared;
+ unsigned long last_oom_score;
+ unsigned char oom_reap_stall_count;
#ifdef CONFIG_VMAP_STACK
struct vm_struct *stack_vm_area;
#endif
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index ec912d0..d30615e 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -66,7 +66,6 @@ static inline int get_dumpable(struct mm_struct *mm)
#define MMF_HAS_UPROBES 19 /* has uprobes */
#define MMF_RECALC_UPROBES 20 /* MMF_HAS_UPROBES can be wrong */
-#define MMF_OOM_SKIP 21 /* mm is of no interest for the OOM killer */
#define MMF_UNSTABLE 22 /* mm is unstable for copy_from_user */
#define MMF_HUGE_ZERO_PAGE 23 /* mm has ever used the global huge zero page */
#define MMF_DISABLE_THP 24 /* disable THP for all VMAs */
diff --git a/kernel/fork.c b/kernel/fork.c
index 9440d61..5ad2b19 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -977,6 +977,8 @@ static inline void __mmput(struct mm_struct *mm)
}
if (mm->binfmt)
module_put(mm->binfmt->module);
+ if (unlikely(mm_is_oom_victim(mm)))
+ clear_bit(MMF_OOM_VICTIM, &mm->flags);
mmdrop(mm);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index e6f0d5e..35c33bf 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -884,17 +884,14 @@ static void invalidate_reclaim_iterators(struct mem_cgroup *dead_memcg)
* @arg: argument passed to @fn
*
* This function iterates over tasks attached to @memcg or to any of its
- * descendants and calls @fn for each task. If @fn returns a non-zero
- * value, the function breaks the iteration loop and returns the value.
- * Otherwise, it will iterate over all tasks and return 0.
+ * descendants and calls @fn for each task.
*
* This function must not be called for the root memory cgroup.
*/
-int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
- int (*fn)(struct task_struct *, void *), void *arg)
+void mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
+ void (*fn)(struct task_struct *, void *), void *arg)
{
struct mem_cgroup *iter;
- int ret = 0;
BUG_ON(memcg == root_mem_cgroup);
@@ -903,15 +900,10 @@ int mem_cgroup_scan_tasks(struct mem_cgroup *memcg,
struct task_struct *task;
css_task_iter_start(&iter->css, 0, &it);
- while (!ret && (task = css_task_iter_next(&it)))
- ret = fn(task, arg);
+ while ((task = css_task_iter_next(&it)))
+ fn(task, arg);
css_task_iter_end(&it);
- if (ret) {
- mem_cgroup_iter_break(memcg, iter);
- break;
- }
}
- return ret;
}
/**
@@ -1206,12 +1198,8 @@ static bool mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
.gfp_mask = gfp_mask,
.order = order,
};
- bool ret;
- mutex_lock(&oom_lock);
- ret = out_of_memory(&oc);
- mutex_unlock(&oom_lock);
- return ret;
+ return out_of_memory(&oc);
}
#if MAX_NUMNODES > 1
diff --git a/mm/mmap.c b/mm/mmap.c
index d1eb87e..2b422dd 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -3059,25 +3059,18 @@ void exit_mmap(struct mm_struct *mm)
if (unlikely(mm_is_oom_victim(mm))) {
/*
* Manually reap the mm to free as much memory as possible.
- * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard
- * this mm from further consideration. Taking mm->mmap_sem for
- * write after setting MMF_OOM_SKIP will guarantee that the oom
- * reaper will not run on this mm again after mmap_sem is
- * dropped.
+ * Then, tell oom_has_pending_victims() no longer try to call
+ * oom_reap_mm() by taking mm->mmap_sem for write.
*
* Nothing can be holding mm->mmap_sem here and the above call
* to mmu_notifier_release(mm) ensures mmu notifier callbacks in
- * __oom_reap_task_mm() will not block.
+ * oom_reap_mm() will not block.
*
* This needs to be done before calling munlock_vma_pages_all(),
- * which clears VM_LOCKED, otherwise the oom reaper cannot
+ * which clears VM_LOCKED, otherwise oom_reap_mm() cannot
* reliably test it.
*/
- mutex_lock(&oom_lock);
- __oom_reap_task_mm(mm);
- mutex_unlock(&oom_lock);
-
- set_bit(MMF_OOM_SKIP, &mm->flags);
+ oom_reap_mm(mm);
down_write(&mm->mmap_sem);
up_write(&mm->mmap_sem);
}
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index 84081e7..36bc02f 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -38,7 +38,6 @@
#include <linux/freezer.h>
#include <linux/ftrace.h>
#include <linux/ratelimit.h>
-#include <linux/kthread.h>
#include <linux/init.h>
#include <linux/mmu_notifier.h>
@@ -49,11 +48,17 @@
#define CREATE_TRACE_POINTS
#include <trace/events/oom.h>
+static inline unsigned long oom_victim_mm_score(struct mm_struct *mm)
+{
+ return get_mm_rss(mm) + get_mm_counter(mm, MM_SWAPENTS) +
+ mm_pgtables_bytes(mm) / PAGE_SIZE;
+}
+
int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;
-DEFINE_MUTEX(oom_lock);
+static DEFINE_MUTEX(oom_lock);
#ifdef CONFIG_NUMA
/**
@@ -201,19 +206,19 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
if (oom_unkillable_task(p, memcg, nodemask))
return 0;
+ if (tsk_is_oom_victim(p))
+ return 0;
+
p = find_lock_task_mm(p);
if (!p)
return 0;
/*
* Do not even consider tasks which are explicitly marked oom
- * unkillable or have been already oom reaped or the are in
- * the middle of vfork
+ * unkillable or they are in the middle of vfork
*/
adj = (long)p->signal->oom_score_adj;
- if (adj == OOM_SCORE_ADJ_MIN ||
- test_bit(MMF_OOM_SKIP, &p->mm->flags) ||
- in_vfork(p)) {
+ if (adj == OOM_SCORE_ADJ_MIN || in_vfork(p)) {
task_unlock(p);
return 0;
}
@@ -222,8 +227,7 @@ unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg,
* The baseline for the badness score is the proportion of RAM that each
* task's rss, pagetable and swap space use.
*/
- points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS) +
- mm_pgtables_bytes(p->mm) / PAGE_SIZE;
+ points = oom_victim_mm_score(p->mm);
task_unlock(p);
/* Normalize to oom_score_adj units */
@@ -304,25 +308,13 @@ static enum oom_constraint constrained_alloc(struct oom_control *oc)
return CONSTRAINT_NONE;
}
-static int oom_evaluate_task(struct task_struct *task, void *arg)
+static void oom_evaluate_task(struct task_struct *task, void *arg)
{
struct oom_control *oc = arg;
unsigned long points;
if (oom_unkillable_task(task, NULL, oc->nodemask))
- goto next;
-
- /*
- * This task already has access to memory reserves and is being killed.
- * Don't allow any other task to have access to the reserves unless
- * the task has MMF_OOM_SKIP because chances that it would release
- * any memory is quite low.
- */
- if (!is_sysrq_oom(oc) && tsk_is_oom_victim(task)) {
- if (test_bit(MMF_OOM_SKIP, &task->signal->oom_mm->flags))
- goto next;
- goto abort;
- }
+ return;
/*
* If task is allocating a lot of memory and has been marked to be
@@ -335,29 +327,22 @@ static int oom_evaluate_task(struct task_struct *task, void *arg)
points = oom_badness(task, NULL, oc->nodemask, oc->totalpages);
if (!points || points < oc->chosen_points)
- goto next;
+ return;
/* Prefer thread group leaders for display purposes */
if (points == oc->chosen_points && thread_group_leader(oc->chosen))
- goto next;
+ return;
select:
if (oc->chosen)
put_task_struct(oc->chosen);
get_task_struct(task);
oc->chosen = task;
oc->chosen_points = points;
-next:
- return 0;
-abort:
- if (oc->chosen)
- put_task_struct(oc->chosen);
- oc->chosen = (void *)-1UL;
- return 1;
}
/*
* Simple selection loop. We choose the process with the highest number of
- * 'points'. In case scan was aborted, oc->chosen is set to -1.
+ * 'points'.
*/
static void select_bad_process(struct oom_control *oc)
{
@@ -368,8 +353,7 @@ static void select_bad_process(struct oom_control *oc)
rcu_read_lock();
for_each_process(p)
- if (oom_evaluate_task(p, oc))
- break;
+ oom_evaluate_task(p, oc);
rcu_read_unlock();
}
@@ -451,6 +435,29 @@ static void dump_header(struct oom_control *oc, struct task_struct *p)
#define K(x) ((x) << (PAGE_SHIFT-10))
+static bool victim_mm_stalling(struct task_struct *p, struct mm_struct *mm)
+{
+ unsigned long score;
+
+ if (time_before(jiffies, p->last_oom_compared + HZ / 10))
+ return false;
+ score = oom_victim_mm_score(mm);
+ if (score < p->last_oom_score)
+ p->oom_reap_stall_count = 0;
+ else
+ p->oom_reap_stall_count++;
+ p->last_oom_score = oom_victim_mm_score(mm);
+ p->last_oom_compared = jiffies;
+ if (p->oom_reap_stall_count < 30)
+ return false;
+ pr_info("Gave up waiting for process %d (%s) total-vm:%lukB, anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
+ task_pid_nr(p), p->comm, K(mm->total_vm),
+ K(get_mm_counter(mm, MM_ANONPAGES)),
+ K(get_mm_counter(mm, MM_FILEPAGES)),
+ K(get_mm_counter(mm, MM_SHMEMPAGES)));
+ return true;
+}
+
/*
* task->mm can be NULL if the task is the exited group leader. So to
* determine whether the task is using a particular mm, we examine all the
@@ -469,17 +476,10 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm)
return false;
}
-#ifdef CONFIG_MMU
-/*
- * OOM Reaper kernel thread which tries to reap the memory used by the OOM
- * victim (if that is possible) to help the OOM killer to move on.
- */
-static struct task_struct *oom_reaper_th;
-static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait);
-static struct task_struct *oom_reaper_list;
-static DEFINE_SPINLOCK(oom_reaper_lock);
+static LIST_HEAD(oom_victim_list);
-void __oom_reap_task_mm(struct mm_struct *mm)
+#ifdef CONFIG_MMU
+void oom_reap_mm(struct mm_struct *mm)
{
struct vm_area_struct *vma;
@@ -518,152 +518,20 @@ void __oom_reap_task_mm(struct mm_struct *mm)
}
}
}
-
-static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm)
-{
- bool ret = true;
-
- /*
- * We have to make sure to not race with the victim exit path
- * and cause premature new oom victim selection:
- * oom_reap_task_mm exit_mm
- * mmget_not_zero
- * mmput
- * atomic_dec_and_test
- * exit_oom_victim
- * [...]
- * out_of_memory
- * select_bad_process
- * # no TIF_MEMDIE task selects new victim
- * unmap_page_range # frees some memory
- */
- mutex_lock(&oom_lock);
-
- if (!down_read_trylock(&mm->mmap_sem)) {
- ret = false;
- trace_skip_task_reaping(tsk->pid);
- goto unlock_oom;
- }
-
- /*
- * If the mm has invalidate_{start,end}() notifiers that could block,
- * sleep to give the oom victim some more time.
- * TODO: we really want to get rid of this ugly hack and make sure that
- * notifiers cannot block for unbounded amount of time
- */
- if (mm_has_blockable_invalidate_notifiers(mm)) {
- up_read(&mm->mmap_sem);
- schedule_timeout_idle(HZ);
- goto unlock_oom;
- }
-
- /*
- * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't
- * work on the mm anymore. The check for MMF_OOM_SKIP must run
- * under mmap_sem for reading because it serializes against the
- * down_write();up_write() cycle in exit_mmap().
- */
- if (test_bit(MMF_OOM_SKIP, &mm->flags)) {
- up_read(&mm->mmap_sem);
- trace_skip_task_reaping(tsk->pid);
- goto unlock_oom;
- }
-
- trace_start_task_reaping(tsk->pid);
-
- __oom_reap_task_mm(mm);
-
- pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n",
- task_pid_nr(tsk), tsk->comm,
- K(get_mm_counter(mm, MM_ANONPAGES)),
- K(get_mm_counter(mm, MM_FILEPAGES)),
- K(get_mm_counter(mm, MM_SHMEMPAGES)));
- up_read(&mm->mmap_sem);
-
- trace_finish_task_reaping(tsk->pid);
-unlock_oom:
- mutex_unlock(&oom_lock);
- return ret;
-}
-
-#define MAX_OOM_REAP_RETRIES 10
-static void oom_reap_task(struct task_struct *tsk)
-{
- int attempts = 0;
- struct mm_struct *mm = tsk->signal->oom_mm;
-
- /* Retry the down_read_trylock(mmap_sem) a few times */
- while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm))
- schedule_timeout_idle(HZ/10);
-
- if (attempts <= MAX_OOM_REAP_RETRIES ||
- test_bit(MMF_OOM_SKIP, &mm->flags))
- goto done;
-
- pr_info("oom_reaper: unable to reap pid:%d (%s)\n",
- task_pid_nr(tsk), tsk->comm);
- debug_show_all_locks();
-
-done:
- tsk->oom_reaper_list = NULL;
-
- /*
- * Hide this mm from OOM killer because it has been either reaped or
- * somebody can't call up_write(mmap_sem).
- */
- set_bit(MMF_OOM_SKIP, &mm->flags);
-
- /* Drop a reference taken by wake_oom_reaper */
- put_task_struct(tsk);
-}
-
-static int oom_reaper(void *unused)
-{
- while (true) {
- struct task_struct *tsk = NULL;
-
- wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL);
- spin_lock(&oom_reaper_lock);
- if (oom_reaper_list != NULL) {
- tsk = oom_reaper_list;
- oom_reaper_list = tsk->oom_reaper_list;
- }
- spin_unlock(&oom_reaper_lock);
-
- if (tsk)
- oom_reap_task(tsk);
- }
-
- return 0;
-}
+#endif
static void wake_oom_reaper(struct task_struct *tsk)
{
- /* tsk is already queued? */
- if (tsk == oom_reaper_list || tsk->oom_reaper_list)
+ if (tsk->oom_victim_list.next)
return;
get_task_struct(tsk);
-
- spin_lock(&oom_reaper_lock);
- tsk->oom_reaper_list = oom_reaper_list;
- oom_reaper_list = tsk;
- spin_unlock(&oom_reaper_lock);
- trace_wake_reaper(tsk->pid);
- wake_up(&oom_reaper_wait);
-}
-
-static int __init oom_init(void)
-{
- oom_reaper_th = kthread_run(oom_reaper, NULL, "oom_reaper");
- return 0;
-}
-subsys_initcall(oom_init)
-#else
-static inline void wake_oom_reaper(struct task_struct *tsk)
-{
+ tsk->oom_reap_stall_count = 0;
+ tsk->last_oom_compared = jiffies;
+ tsk->last_oom_score = oom_victim_mm_score(tsk->signal->oom_mm);
+ lockdep_assert_held(&oom_lock);
+ list_add_tail(&tsk->oom_victim_list, &oom_victim_list);
}
-#endif /* CONFIG_MMU */
/**
* mark_oom_victim - mark the given task as OOM victim
@@ -806,10 +674,11 @@ static bool task_will_free_mem(struct task_struct *task)
return false;
/*
- * This task has already been drained by the oom reaper so there are
- * only small chances it will free some more
+ * If memory reserves granted to this task was not sufficient, allow
+ * killing more processes after oom_has_pending_victims() completed
+ * reaping this mm.
*/
- if (test_bit(MMF_OOM_SKIP, &mm->flags))
+ if (tsk_is_oom_victim(task))
return false;
if (atomic_read(&mm->mm_users) <= 1)
@@ -946,7 +815,6 @@ static void oom_kill_process(struct oom_control *oc, const char *message)
continue;
if (is_global_init(p)) {
can_oom_reap = false;
- set_bit(MMF_OOM_SKIP, &mm->flags);
pr_info("oom killer %d (%s) has mm pinned by %d (%s)\n",
task_pid_nr(victim), victim->comm,
task_pid_nr(p), p->comm);
@@ -1009,6 +877,72 @@ int unregister_oom_notifier(struct notifier_block *nb)
}
EXPORT_SYMBOL_GPL(unregister_oom_notifier);
+bool try_oom_notifier(void)
+{
+ static DEFINE_MUTEX(lock);
+ unsigned long freed = 0;
+
+ /*
+ * In order to protect OOM notifiers which are not thread safe and to
+ * avoid excessively releasing memory from OOM notifiers which release
+ * memory every time, this lock serializes/excludes concurrent calls to
+ * OOM notifiers.
+ */
+ if (!mutex_trylock(&lock))
+ return true;
+ blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
+ mutex_unlock(&lock);
+ return freed > 0;
+}
+
+/*
+ * Currently a reference to "struct task_struct" taken by wake_oom_reaper()
+ * will remain on the oom_victim_list until somebody finds that this mm has
+ * already completed __mmput() or had not completed for too long.
+ */
+static bool oom_has_pending_victims(struct oom_control *oc)
+{
+ struct task_struct *p, *tmp;
+ bool ret = false;
+ bool gaveup = false;
+
+ lockdep_assert_held(&oom_lock);
+ list_for_each_entry_safe(p, tmp, &oom_victim_list, oom_victim_list) {
+ struct mm_struct *mm = p->signal->oom_mm;
+
+ /* Forget about mm which already completed __mmput(). */
+ if (!test_bit(MMF_OOM_VICTIM, &mm->flags))
+ goto remove;
+ /* Skip OOM victims which current thread cannot select. */
+ if (oom_unkillable_task(p, oc->memcg, oc->nodemask))
+ continue;
+ ret = true;
+#ifdef CONFIG_MMU
+ /*
+ * We need to hold mmap_sem for read, in order to safely test
+ * MMF_UNSTABLE flag and blockable invalidate notifiers.
+ */
+ if (down_read_trylock(&mm->mmap_sem)) {
+ if (!test_bit(MMF_UNSTABLE, &mm->flags) &&
+ !mm_has_blockable_invalidate_notifiers(mm))
+ oom_reap_mm(mm);
+ up_read(&mm->mmap_sem);
+ }
+#endif
+ /* Forget if this mm didn't complete __mmput() for too long. */
+ if (!victim_mm_stalling(p, mm))
+ continue;
+ gaveup = true;
+remove:
+ list_del(&p->oom_victim_list);
+ put_task_struct(p);
+ }
+ if (gaveup)
+ debug_show_all_locks();
+
+ return ret && !is_sysrq_oom(oc);
+}
+
/**
* out_of_memory - kill the "best" process when we run out of memory
* @oc: pointer to struct oom_control
@@ -1020,18 +954,8 @@ int unregister_oom_notifier(struct notifier_block *nb)
*/
bool out_of_memory(struct oom_control *oc)
{
- unsigned long freed = 0;
enum oom_constraint constraint = CONSTRAINT_NONE;
-
- if (oom_killer_disabled)
- return false;
-
- if (!is_memcg_oom(oc)) {
- blocking_notifier_call_chain(&oom_notify_list, 0, &freed);
- if (freed > 0)
- /* Got some memory back in the last second. */
- return true;
- }
+ const char *prompt;
/*
* If current has a pending SIGKILL or is exiting, then automatically
@@ -1045,15 +969,6 @@ bool out_of_memory(struct oom_control *oc)
}
/*
- * The OOM killer does not compensate for IO-less reclaim.
- * pagefault_out_of_memory lost its gfp context so we have to
- * make sure exclude 0 mask - all other users should have at least
- * ___GFP_DIRECT_RECLAIM to get here.
- */
- if (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS))
- return true;
-
- /*
* Check if there were limitations on the allocation (only relevant for
* NUMA and memcg) that may require different handling.
*/
@@ -1067,32 +982,46 @@ bool out_of_memory(struct oom_control *oc)
current->signal->oom_score_adj != OOM_SCORE_ADJ_MIN) {
get_task_struct(current);
oc->chosen = current;
- oom_kill_process(oc, "Out of memory (oom_kill_allocating_task)");
- return true;
+ prompt = "Out of memory (oom_kill_allocating_task)";
+ } else {
+ select_bad_process(oc);
+ prompt = !is_memcg_oom(oc) ? "Out of memory" :
+ "Memory cgroup out of memory";
}
-
- select_bad_process(oc);
/* Found nothing?!?! Either we hang forever, or we panic. */
- if (!oc->chosen && !is_sysrq_oom(oc) && !is_memcg_oom(oc)) {
+ if (!oc->chosen) {
+ if (is_sysrq_oom(oc) || is_memcg_oom(oc))
+ return false;
dump_header(oc, NULL);
panic("Out of memory and no killable processes...\n");
}
- if (oc->chosen && oc->chosen != (void *)-1UL) {
- oom_kill_process(oc, !is_memcg_oom(oc) ? "Out of memory" :
- "Memory cgroup out of memory");
- /*
- * Give the killed process a good chance to exit before trying
- * to allocate memory again.
- */
- schedule_timeout_killable(1);
- }
- return !!oc->chosen;
+ mutex_lock(&oom_lock);
+ /*
+ * If there are OOM victims which current thread can select,
+ * wait for them to reach __mmput().
+ *
+ * If oom_killer_disable() is in progress, we can't select new OOM
+ * victims.
+ *
+ * The OOM killer does not compensate for IO-less reclaim.
+ * pagefault_out_of_memory lost its gfp context so we have to
+ * make sure exclude 0 mask - all other users should have at least
+ * ___GFP_DIRECT_RECLAIM to get here.
+ *
+ * Otherwise, invoke the OOM-killer.
+ */
+ if (oom_has_pending_victims(oc) || oom_killer_disabled ||
+ (oc->gfp_mask && !(oc->gfp_mask & __GFP_FS)))
+ put_task_struct(oc->chosen);
+ else
+ oom_kill_process(oc, prompt);
+ mutex_unlock(&oom_lock);
+ return !oom_killer_disabled;
}
/*
* The pagefault handler calls here because it is out of memory, so kill a
- * memory-hogging task. If oom_lock is held by somebody else, a parallel oom
- * killing is already in progress so do nothing.
+ * memory-hogging task.
*/
void pagefault_out_of_memory(void)
{
@@ -1107,8 +1036,6 @@ void pagefault_out_of_memory(void)
if (mem_cgroup_oom_synchronize(true))
return;
- if (!mutex_trylock(&oom_lock))
- return;
out_of_memory(&oc);
- mutex_unlock(&oom_lock);
+ schedule_timeout_killable(1);
}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 1521100..cd7f9db 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3460,29 +3460,16 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
};
struct page *page;
- *did_some_progress = 0;
-
- /*
- * Acquire the oom lock. If that fails, somebody else is
- * making progress for us.
- */
- if (!mutex_trylock(&oom_lock)) {
- *did_some_progress = 1;
- schedule_timeout_uninterruptible(1);
- return NULL;
- }
+ *did_some_progress = try_oom_notifier();
/*
* Go through the zonelist yet one more time, keep very high watermark
* here, this is only to catch a parallel oom killing, we must fail if
- * we're still under heavy pressure. But make sure that this reclaim
- * attempt shall not depend on __GFP_DIRECT_RECLAIM && !__GFP_NORETRY
- * allocation which will never fail due to oom_lock already held.
+ * we're still under heavy pressure.
*/
- page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL) &
- ~__GFP_DIRECT_RECLAIM, order,
+ page = get_page_from_freelist((gfp_mask | __GFP_HARDWALL), order,
ALLOC_WMARK_HIGH|ALLOC_CPUSET, ac);
- if (page)
+ if (page || *did_some_progress)
goto out;
/* Coredumps can quickly deplete all memory reserves */
@@ -3531,7 +3518,6 @@ void warn_alloc(gfp_t gfp_mask, nodemask_t *nodemask, const char *fmt, ...)
ALLOC_NO_WATERMARKS, ac);
}
out:
- mutex_unlock(&oom_lock);
return page;
}
@@ -3863,21 +3849,6 @@ static void wake_all_kswapds(unsigned int order, gfp_t gfp_mask,
return alloc_flags;
}
-static bool oom_reserves_allowed(struct task_struct *tsk)
-{
- if (!tsk_is_oom_victim(tsk))
- return false;
-
- /*
- * !MMU doesn't have oom reaper so give access to memory reserves
- * only to the thread with TIF_MEMDIE set
- */
- if (!IS_ENABLED(CONFIG_MMU) && !test_thread_flag(TIF_MEMDIE))
- return false;
-
- return true;
-}
-
/*
* Distinguish requests which really need access to full memory
* reserves from oom victims which can live with a portion of it
@@ -3893,7 +3864,7 @@ static inline int __gfp_pfmemalloc_flags(gfp_t gfp_mask)
if (!in_interrupt()) {
if (current->flags & PF_MEMALLOC)
return ALLOC_NO_WATERMARKS;
- else if (oom_reserves_allowed(current))
+ else if (tsk_is_oom_victim(current))
return ALLOC_OOM;
}
@@ -3922,6 +3893,7 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
{
struct zone *zone;
struct zoneref *z;
+ bool ret = false;
/*
* Costly allocations might have made a progress but this doesn't mean
@@ -3985,25 +3957,26 @@ bool gfp_pfmemalloc_allowed(gfp_t gfp_mask)
}
}
- /*
- * Memory allocation/reclaim might be called from a WQ
- * context and the current implementation of the WQ
- * concurrency control doesn't recognize that
- * a particular WQ is congested if the worker thread is
- * looping without ever sleeping. Therefore we have to
- * do a short sleep here rather than calling
- * cond_resched().
- */
- if (current->flags & PF_WQ_WORKER)
- schedule_timeout_uninterruptible(1);
- else
- cond_resched();
-
- return true;
+ ret = true;
+ goto out;
}
}
- return false;
+out:
+ /*
+ * Memory allocation/reclaim might be called from a WQ
+ * context and the current implementation of the WQ
+ * concurrency control doesn't recognize that
+ * a particular WQ is congested if the worker thread is
+ * looping without ever sleeping. Therefore we have to
+ * do a short sleep here rather than calling
+ * cond_resched().
+ */
+ if (current->flags & PF_WQ_WORKER)
+ schedule_timeout_uninterruptible(1);
+ else
+ cond_resched();
+ return ret;
}
static inline bool
--
1.8.3.1
On Fri 22-06-18 11:49:14, David Rientjes wrote:
> On Fri, 22 Jun 2018, Michal Hocko wrote:
>
> > > > preempt_disable() is required because it calls kvm_kick_many_cpus() with
> > > > wait == true because KVM_REQ_APIC_PAGE_RELOAD sets KVM_REQUEST_WAIT and
> > > > thus the smp_call_function_many() is going to block until all cpus can run
> > > > ack_flush().
> > >
> > > I will make sure to talk to the maintainer of the respective code to
> > > do the nonblock case correctly.
> >
> > I've just double checked this particular code and the wait path and this
> > one is not a sleep. It is a busy wait for IPI to get handled. So this
> > one should be OK AFAICS. Anyway I will send an RFC and involve
> > respective maintainers to make sure I am not making any incorrect
> > assumptions.
>
> Do you believe that having the only potential source of memory freeing
> busy waiting for all other cpus on the system to run ack_flush() is
> particularly dangerous given the fact that they may be allocating
> themselves?
These are IPIs. How could they depend on a memory allocation? In other
words we do rely on the very same mechanism for TLB flushing so this is
any different.
Maybe I am missing something here though.
--
Michal Hocko
SUSE Labs