Received: by 2002:ac0:a594:0:0:0:0:0 with SMTP id m20-v6csp2950896imm; Thu, 24 May 2018 19:44:49 -0700 (PDT) X-Google-Smtp-Source: AB8JxZrCbHubrk3guCcshwHH0tSHiJm95C8nuSrlAyDR2oYA+/cpLsWDjyu3sJKVuyMKEnpJ9HIz X-Received: by 2002:a17:902:8d8c:: with SMTP id v12-v6mr617763plo.366.1527216289320; Thu, 24 May 2018 19:44:49 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527216289; cv=none; d=google.com; s=arc-20160816; b=lPZz5bkiSOQ9VOOUwSP/8mvOmiei5z38an+SfMxUqK4zwQzwtSC5Nr/dezFhfD8TlH f7gnyZecdYe7LZPUW877ESwdVSVBeoEM8fC6K9sIgFYhBJsK3OojDqLwPl4r76nSoMHZ TyXgjPQZ51RvBhJZGVJ/K0YgzkjCRwmhLnmm1pUx8CjfldWBRsZDOcGUMu/ZEKWnqr+2 WH4vi1eofA486xR728KK4yU8Gf791b3wZT9IkCOIeIHct6ihisxsHGp1eijP+EOL99Uz ralMlLlayV6hlbPgwxuIPq0W8FnpH1IrHa5OAuOqZD8GjiqCxbabmbGpQ+EUCyim/RnB xN2Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:message-id :subject:cc:to:from:date:dkim-signature:arc-authentication-results; bh=Cm8fJ6Qg6bqiTDpz25SnICt/JEDWhd0VDb5awd2JGB4=; b=HzD39IDMyBs5MU3pKcBpbR9gL9SwIwCbmXd2rtOg5+uL7ObuYiFv1za44ZtLdft6ur m0HD8fVpfTqPiBd3G047yIU4lP1Eb/fFVao9AG6OIstOLxDwLPk75Xt1ueDUpHwTQn/U V0XXUtZVKGA0G2Oc2hFaB4/yzw8pAU7qugWF2tICjJClYurjEH0kSE5UjJiUqsu7xggc hKY0UcuCuBbnFp74Oh4m6SAnk9JoMq3fGa7smwUh/of7mUFhFgjcKefQNv57rwctW+l4 KKmGITKRQMSqPzp+XtI4W+TpYRYtQ6fatfLHAAUe5vpNTvspPF+4foPMJ9tgHOmJTLvN 6foA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=tXAiDx1l; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a17-v6si17918798pgf.15.2018.05.24.19.44.34; Thu, 24 May 2018 19:44:49 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=tXAiDx1l; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1033164AbeEXVW6 (ORCPT + 99 others); Thu, 24 May 2018 17:22:58 -0400 Received: from mail-pg0-f65.google.com ([74.125.83.65]:39696 "EHLO mail-pg0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S966832AbeEXVW4 (ORCPT ); Thu, 24 May 2018 17:22:56 -0400 Received: by mail-pg0-f65.google.com with SMTP id w12-v6so121347pgc.6 for ; Thu, 24 May 2018 14:22:56 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:message-id:user-agent:mime-version; bh=Cm8fJ6Qg6bqiTDpz25SnICt/JEDWhd0VDb5awd2JGB4=; b=tXAiDx1lhTX0jE8s+fcQL79O2B72+GBUagvUeqQ0DTtyRWesfifYjziB8WKGRzUxNc Fe0QyfWUT5W8s7hh1wO89tzDc0dO4YX+WdGctbXP95tbYaUtBXj3GJKm47q2eoo9ljSA FV/M/vtvUmpP90QHg+RV2dcyOiFT1wwzphI6cGaJwX+kSp+Yvtdqdk1/FCE7xMc74ZHy VAEHMvhUoinG1Ga6Yn/fp9df9LBxF/2YTx8wQ9MlqYTE5XHiYQZ+PLbNRHqdpoG2Wv8J SslBeGC8LJwjSDECCFwTQNgWcThCwPaHTAo1OF6gxYrlJu1G19AZNVh0qKjxBIkr1l5E Me7A== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:user-agent :mime-version; bh=Cm8fJ6Qg6bqiTDpz25SnICt/JEDWhd0VDb5awd2JGB4=; b=cn8U/OLPWczidgPQNMuze6gIrrVnkNSXxzCjG1fgM/bczmkqSVxea9iVsFAwRw8yqY RyCZ169EX0Pu6fk/axltA0hOlqBblyEALNUjyvNM+l4NcFmw4v4NjBHTq1LNj1Owh40v 1Qzya6pkcTZTFxeGTjyPzbPR2ttvBq4Gn1VKp6aUjIdjIp3xY6r0FvIGFPaZ8WtAbD4v FeQ4EAOfZOpqFNj9MVG+R5irRerUYqpIBY1NJOM5m2VZOwGGmMrco0nyLwPCwGEPc4m/ Cx4QNQz0X3n4E1x7+lwxOmeYlsnkVFPcPoAep/BAjFE6w/BSFPWHWzBY11zNu4Y4+GWk QnIQ== X-Gm-Message-State: ALKqPwfkndhZQloys50e5KfZgnS5qpL7s9msXQxNIHR0sw6VcopSAtJg tpYu9gIhUlxr2AIDnm6LI0OWRQ== X-Received: by 2002:a63:6741:: with SMTP id b62-v6mr7251885pgc.5.1527196975568; Thu, 24 May 2018 14:22:55 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id n18-v6sm53200427pfg.36.2018.05.24.14.22.54 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 24 May 2018 14:22:54 -0700 (PDT) Date: Thu, 24 May 2018 14:22:53 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Michal Hocko , Tetsuo Handa cc: Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [rfc patch] mm, oom: fix unnecessary killing of additional processes Message-ID: User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if it cannot reap an mm. This can happen for a variety of reasons, including: - the inability to grab mm->mmap_sem in a sufficient amount of time, - when the mm has blockable mmu notifiers that could cause the oom reaper to stall indefinitely, but we can also add a third when the oom reaper can "reap" an mm but doing so is unlikely to free any amount of memory: - when the mm's memory is fully mlocked. When all memory is mlocked, the oom reaper will not be able to free any substantial amount of memory. It sets MMF_OOM_SKIP before the victim can unmap and free its memory in exit_mmap() and subsequent oom victims are chosen unnecessarily. This is trivial to reproduce if all eligible processes on the system have mlocked their memory: the oom killer calls panic() even though forward progress can be made. This is the same issue where the exit path sets MMF_OOM_SKIP before unmapping memory and additional processes can be chosen unnecessarily because the oom killer is racing with exit_mmap(). We can't simply defer setting MMF_OOM_SKIP, however, because if there is a true oom livelock in progress, it never gets set and no additional killing is possible. To fix this, this patch introduces a per-mm reaping timeout, initially set at 10s. It requires that the oom reaper's list becomes a properly linked list so that other mm's may be reaped while waiting for an mm's timeout to expire. The exit path will now set MMF_OOM_SKIP only after all memory has been freed, so additional oom killing is justified, and rely on MMF_UNSTABLE to determine when it can race with the oom reaper. The oom reaper will now set MMF_OOM_SKIP only after the reap timeout has lapsed because it can no longer guarantee forward progress. The reaping timeout is intentionally set for a substantial amount of time since oom livelock is a very rare occurrence and it's better to optimize for preventing additional (unnecessary) oom killing than a scenario that is much more unlikely. Signed-off-by: David Rientjes --- include/linux/mm_types.h | 4 ++ include/linux/sched.h | 2 +- mm/mmap.c | 12 +++--- mm/oom_kill.c | 85 ++++++++++++++++++++++++++-------------- 4 files changed, 66 insertions(+), 37 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -462,6 +462,10 @@ struct mm_struct { #ifdef CONFIG_MMU_NOTIFIER struct mmu_notifier_mm *mmu_notifier_mm; #endif +#ifdef CONFIG_MMU + /* When to give up on oom reaping this mm */ + unsigned long reap_timeout; +#endif #if defined(CONFIG_TRANSPARENT_HUGEPAGE) && !USE_SPLIT_PMD_PTLOCKS pgtable_t pmd_huge_pte; /* protected by page_table_lock */ #endif diff --git a/include/linux/sched.h b/include/linux/sched.h --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1151,7 +1151,7 @@ struct task_struct { #endif int pagefault_disabled; #ifdef CONFIG_MMU - struct task_struct *oom_reaper_list; + struct list_head oom_reap_list; #endif #ifdef CONFIG_VMAP_STACK struct vm_struct *stack_vm_area; diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3059,11 +3059,10 @@ void exit_mmap(struct mm_struct *mm) if (unlikely(mm_is_oom_victim(mm))) { /* * Manually reap the mm to free as much memory as possible. - * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard - * this mm from further consideration. Taking mm->mmap_sem for - * write after setting MMF_OOM_SKIP will guarantee that the oom - * reaper will not run on this mm again after mmap_sem is - * dropped. + * Then, set MMF_UNSTABLE to avoid racing with the oom reaper. + * Taking mm->mmap_sem for write after setting MMF_UNSTABLE will + * guarantee that the oom reaper will not run on this mm again + * after mmap_sem is dropped. * * Nothing can be holding mm->mmap_sem here and the above call * to mmu_notifier_release(mm) ensures mmu notifier callbacks in @@ -3077,7 +3076,7 @@ void exit_mmap(struct mm_struct *mm) __oom_reap_task_mm(mm); mutex_unlock(&oom_lock); - set_bit(MMF_OOM_SKIP, &mm->flags); + set_bit(MMF_UNSTABLE, &mm->flags); down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); } @@ -3105,6 +3104,7 @@ void exit_mmap(struct mm_struct *mm) unmap_vmas(&tlb, vma, 0, -1); free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); + set_bit(MMF_OOM_SKIP, &mm->flags); /* * Walk the list again, actually closing and freeing it, diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -476,7 +476,7 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) */ static struct task_struct *oom_reaper_th; static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); -static struct task_struct *oom_reaper_list; +static LIST_HEAD(oom_reaper_list); static DEFINE_SPINLOCK(oom_reaper_lock); void __oom_reap_task_mm(struct mm_struct *mm) @@ -558,12 +558,12 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) } /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run + * MMF_UNSTABLE is set by exit_mmap when the OOM reaper can't + * work on the mm anymore. The check for MMF_UNSTABLE must run * under mmap_sem for reading because it serializes against the * down_write();up_write() cycle in exit_mmap(). */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + if (test_bit(MMF_UNSTABLE, &mm->flags)) { up_read(&mm->mmap_sem); trace_skip_task_reaping(tsk->pid); goto unlock_oom; @@ -589,31 +589,49 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) #define MAX_OOM_REAP_RETRIES 10 static void oom_reap_task(struct task_struct *tsk) { - int attempts = 0; struct mm_struct *mm = tsk->signal->oom_mm; + bool ret = true; - /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) - schedule_timeout_idle(HZ/10); + /* + * If this mm has either been fully unmapped, or the oom reaper has + * given up on it, nothing left to do except drop the refcount. + */ + if (test_bit(MMF_OOM_SKIP, &mm->flags)) + goto drop; - if (attempts <= MAX_OOM_REAP_RETRIES || - test_bit(MMF_OOM_SKIP, &mm->flags)) - goto done; + /* + * If this mm has already been reaped, doing so again will not likely + * free additional memory. + */ + if (!test_bit(MMF_UNSTABLE, &mm->flags)) + ret = oom_reap_task_mm(tsk, mm); + + if (time_after(jiffies, mm->reap_timeout)) { + if (!test_bit(MMF_OOM_SKIP, &mm->flags)) { + pr_info("oom_reaper: unable to reap pid:%d (%s)\n", + task_pid_nr(tsk), tsk->comm); + debug_show_all_locks(); - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", - task_pid_nr(tsk), tsk->comm); - debug_show_all_locks(); + /* + * Reaping has failed for the timeout period, so give up + * and allow additional processes to be oom killed. + */ + set_bit(MMF_OOM_SKIP, &mm->flags); + } + goto drop; + } -done: - tsk->oom_reaper_list = NULL; + if (!ret) + schedule_timeout_idle(HZ/10); - /* - * Hide this mm from OOM killer because it has been either reaped or - * somebody can't call up_write(mmap_sem). - */ - set_bit(MMF_OOM_SKIP, &mm->flags); + /* Enqueue to be reaped again */ + spin_lock(&oom_reaper_lock); + list_add(&tsk->oom_reap_list, &oom_reaper_list); + spin_unlock(&oom_reaper_lock); + return; - /* Drop a reference taken by wake_oom_reaper */ +drop: + /* Drop the reference taken by wake_oom_reaper() */ put_task_struct(tsk); } @@ -622,11 +640,13 @@ static int oom_reaper(void *unused) while (true) { struct task_struct *tsk = NULL; - wait_event_freezable(oom_reaper_wait, oom_reaper_list != NULL); + wait_event_freezable(oom_reaper_wait, + !list_empty(&oom_reaper_list)); spin_lock(&oom_reaper_lock); - if (oom_reaper_list != NULL) { - tsk = oom_reaper_list; - oom_reaper_list = tsk->oom_reaper_list; + if (!list_empty(&oom_reaper_list)) { + tsk = list_entry(&oom_reaper_list, struct task_struct, + oom_reap_list); + list_del(&tsk->oom_reap_list); } spin_unlock(&oom_reaper_lock); @@ -637,17 +657,22 @@ static int oom_reaper(void *unused) return 0; } +/* How long to wait to oom reap an mm before selecting another process */ +#define OOM_REAP_TIMEOUT_MSECS (10 * 1000) static void wake_oom_reaper(struct task_struct *tsk) { - /* tsk is already queued? */ - if (tsk == oom_reaper_list || tsk->oom_reaper_list) + /* + * Set the reap timeout; if it's already set, the mm is enqueued and + * this tsk can be ignored. + */ + if (cmpxchg(&tsk->signal->oom_mm->reap_timeout, 0UL, + jiffies + msecs_to_jiffies(OOM_REAP_TIMEOUT_MSECS))) return; get_task_struct(tsk); spin_lock(&oom_reaper_lock); - tsk->oom_reaper_list = oom_reaper_list; - oom_reaper_list = tsk; + list_add(&tsk->oom_reap_list, &oom_reaper_list); spin_unlock(&oom_reaper_lock); trace_wake_reaper(tsk->pid); wake_up(&oom_reaper_wait);