Received: by 10.192.165.148 with SMTP id m20csp4112170imm; Mon, 23 Apr 2018 19:32:36 -0700 (PDT) X-Google-Smtp-Source: AIpwx4/dXMB5xUO/DRpJYyCaVfp0HyzsHyot9zHTaM9qHtrKn37RdvLRiRa3+JQbl7eGmKRZfMe2 X-Received: by 2002:a17:902:a8:: with SMTP id a37-v6mr21052693pla.238.1524537156748; Mon, 23 Apr 2018 19:32:36 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524537156; cv=none; d=google.com; s=arc-20160816; b=Pf0nUGqvrZpykqB908zdpRLogORWIqH2xYfec9fFeMFvxmfPLg5qJjmS9JJ2CEPzVS P1pToBCxo6mRsUm4G4gnjjAoFvG3pHhvHSxVGbBTyg3N7eTgPvLYJ4ccJzA8wieQS96s XLNgaNEk10LWDHGWrxO4ErR04zsnWUDodG1NqyLjgsOarD9a0TivjvI9x5eTRWCMlMP6 /2rAO0hubV/NznibyDKqVT+HXM5/nGgeNhPug4HfamyWwHJqfGxr5esoZXC/Vx1xcap3 Azvk9DbnmcJR2kBEXUHX/5jc2sXDt2VmcWWxTfuRKgxADsIUaj7QUiKjqfsoeKUgt00n +v6Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=lSYg2/KBkYEgVWZ64T5ETeDvDgi54r8QmXnWeThTKBk=; b=cJZ7Tx2vwd+8DtuBWIVjTwXlGmbiQxXC0aLm4xiGIRU59uLrNqShCKTzWVBhB8xv+g d4VocwuBV5OPfkTVreWKnidpliUVoqJSyyGQcx9/1glnP7hEWCclPzHv8+FZJOFmNlLZ dPLA9g0cDbZxZIWa2o7zTAlxNl084RN8hN8KCprv5poROm4pb8dsL03D/ycKj0l+wgPp YY6r5b+uiK5rx+7WvQ/RZJ8UfMPMPgiDzliLyoYQDpMPQYOO1feO3ir864pTnpOYut/A +RaEbmUP0MKSxlsMSL/1W3AeT7xbSMz1NF/lISJpP5pswLv/WMJTxvBIcJMd1x9x09/j JXow== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=ne89MX7a; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g28-v6si14006793plj.529.2018.04.23.19.32.21; Mon, 23 Apr 2018 19:32:36 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=ne89MX7a; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S932767AbeDXCbM (ORCPT + 99 others); Mon, 23 Apr 2018 22:31:12 -0400 Received: from mail-pf0-f195.google.com ([209.85.192.195]:33828 "EHLO mail-pf0-f195.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S932688AbeDXCbI (ORCPT ); Mon, 23 Apr 2018 22:31:08 -0400 Received: by mail-pf0-f195.google.com with SMTP id a14so421495pfi.1 for ; Mon, 23 Apr 2018 19:31:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=lSYg2/KBkYEgVWZ64T5ETeDvDgi54r8QmXnWeThTKBk=; b=ne89MX7aFg8i//qUHku0FzG8n7RkZZC5WTC0ffkLnwEcI9HFiZTYCkZfreTfaagRek cLyhr+Z+nSngEp3epbPlofMoYKBfkLQdH1WRpoB3CDDsTUrWdZVxxNuAjfSKwNKOoH0M rUOI+o6L0luNUQCosHxHBsyF84tyegTsGETcPrFKsmRDGhFLmbFQZbra1PRE0cWKfsbU sj+c4TDV/SmuTXi43WjstUe5vnhZCRkcySDYfp6KI3opmVlAO6FT+2N49Pt8PPkVVTRQ 6ob5ajqHSfrAmf0e+C+v4e2q0BTaUrL0GwvUKZgbW1GAvOq66X+UtrluQgAcNO3seUpV 7EBQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=lSYg2/KBkYEgVWZ64T5ETeDvDgi54r8QmXnWeThTKBk=; b=e8CakUZgEtQB+c50nV6N3ec+MkaJjc4YMrAw7OpUMt9gGUKfVaYiwNhts9Tgn8YdMK 57lB3A5iNgW4u1eCv+CpV2jgZDIPl4718BFvMXyCx6Cu0846Q9TSPEkmf3ck1HoRCw9j NQWIXZOmMcE3PJ5VuAL0GQMkY1sEgl7bSMmbcIubSXvnZVxjKNBBpQCSe1JGWXtFxCL/ 4bblaAzmbOGjaq95SFsCimXEkOBtHt+l6FGL9MlkFF6F0raF4OTmDU2W8D1ElLXJ+XaN ZXo3UyTW5yP3CtB9bewqoRx51RPKfo38Oxp7NrFJmGfqrxTrh2BQojDdGEl0AZNK/zF1 giTA== X-Gm-Message-State: ALQs6tDtPxRh6IavT0Yt18TQwbo8D15KK+lLhhb/Csy1clIsnHItca6E z4CUO+HlANT24ZPaoNMzoNZm8A== X-Received: by 10.98.223.205 with SMTP id d74mr22200074pfl.114.1524537067281; Mon, 23 Apr 2018 19:31:07 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id e4sm26639983pfa.128.2018.04.23.19.31.06 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 23 Apr 2018 19:31:06 -0700 (PDT) Date: Mon, 23 Apr 2018 19:31:05 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Tetsuo Handa cc: mhocko@kernel.org, Andrew Morton , Andrea Arcangeli , guro@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap In-Reply-To: <201804221248.CHE35432.FtOMOLSHOFJFVQ@I-love.SAKURA.ne.jp> Message-ID: References: <20180419063556.GK17484@dhcp22.suse.cz> <20180420082349.GW17484@dhcp22.suse.cz> <20180420124044.GA17484@dhcp22.suse.cz> <201804221248.CHE35432.FtOMOLSHOFJFVQ@I-love.SAKURA.ne.jp> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Sun, 22 Apr 2018, Tetsuo Handa wrote: > > I'm wondering why you do not see oom killing of many processes if the > > victim is a very large process that takes a long time to free memory in > > exit_mmap() as I do because the oom reaper gives up trying to acquire > > mm->mmap_sem and just sets MMF_OOM_SKIP itself. > > > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before > exit_mmap() holds mmap_sem for write. Then, at least memory which could > have been reclaimed if exit_mmap() did not hold mmap_sem for write will > be guaranteed to be reclaimed before MMF_OOM_SKIP is set. > I think that's an exceptionally good idea and will mitigate the concerns of others. It can be done without holding mm->mmap_sem in exit_mmap() and uses the same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we don't get dozens of unnecessary oom kills. What do you think about this? It passes preliminary testing on powerpc and I'm enqueued it for much more intensive testing. (I'm wishing there was a better way to acknowledge your contribution to fixing this issue, especially since you brought up the exact problem this is addressing in previous emails.) mm, oom: fix concurrent munlock and oom reaper unmap Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Cc: stable@vger.kernel.org [4.14+] Suggested-by: Tetsuo Handa Signed-off-by: David Rientjes --- include/linux/oom.h | 2 ++ mm/mmap.c | 39 ++++++++++++---------- mm/oom_kill.c | 81 ++++++++++++++++++++++++--------------------- 3 files changed, 66 insertions(+), 56 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -95,6 +95,8 @@ static inline int check_stable_address_space(struct mm_struct *mm) return 0; } +void __oom_reap_task_mm(struct mm_struct *mm); + extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3015,6 +3015,27 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + if (unlikely(mm_is_oom_victim(mm))) { + /* + * Manually reap the mm to free as much memory as possible. + * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this + * mm from further consideration. Taking mm->mmap_sem for write + * after setting MMF_OOM_SKIP will guarantee that the oom reaper + * will not run on this mm again after mmap_sem is dropped. + * + * This needs to be done before calling munlock_vma_pages_all(), + * which clears VM_LOCKED, otherwise the oom reaper cannot + * reliably test it. + */ + mutex_lock(&oom_lock); + __oom_reap_task_mm(mm); + mutex_unlock(&oom_lock); + + set_bit(MMF_OOM_SKIP, &mm->flags); + down_write(&mm->mmap_sem); + up_write(&mm->mmap_sem); + } + if (mm->locked_vm) { vma = mm->mmap; while (vma) { @@ -3036,24 +3057,6 @@ void exit_mmap(struct mm_struct *mm) /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - - if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because victim->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if victim->mm - * is found not NULL while holding the task_lock. - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -469,7 +469,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) return false; } - #ifdef CONFIG_MMU /* * OOM Reaper kernel thread which tries to reap the memory used by the OOM @@ -480,16 +479,54 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); -static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) +void __oom_reap_task_mm(struct mm_struct *mm) { - struct mmu_gather tlb; struct vm_area_struct *vma; + + /* + * Tell all users of get_user/copy_from_user etc... that the content + * is no longer stable. No barriers really needed because unmapping + * should imply barriers already and the reader would hit a page fault + * if it stumbled over a reaped memory. + */ + set_bit(MMF_UNSTABLE, &mm->flags); + + for (vma = mm->mmap ; vma; vma = vma->vm_next) { + if (!can_madv_dontneed_vma(vma)) + continue; + + /* + * Only anonymous pages have a good chance to be dropped + * without additional steps which we cannot afford as we + * are OOM already. + * + * We do not even care about fs backed pages because all + * which are reclaimable have already been reclaimed and + * we do not want to block exit_mmap by keeping mm ref + * count elevated without a good reason. + */ + if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { + const unsigned long start = vma->vm_start; + const unsigned long end = vma->vm_end; + struct mmu_gather tlb; + + tlb_gather_mmu(&tlb, mm, start, end); + mmu_notifier_invalidate_range_start(mm, start, end); + unmap_page_range(&tlb, vma, start, end, NULL); + mmu_notifier_invalidate_range_end(mm, start, end); + tlb_finish_mmu(&tlb, start, end); + } + } +} + +static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) +{ bool ret = true; /* * We have to make sure to not race with the victim exit path * and cause premature new oom victim selection: - * __oom_reap_task_mm exit_mm + * oom_reap_task_mm exit_mm * mmget_not_zero * mmput * atomic_dec_and_test @@ -534,39 +571,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) trace_start_task_reaping(tsk->pid); - /* - * Tell all users of get_user/copy_from_user etc... that the content - * is no longer stable. No barriers really needed because unmapping - * should imply barriers already and the reader would hit a page fault - * if it stumbled over a reaped memory. - */ - set_bit(MMF_UNSTABLE, &mm->flags); - - for (vma = mm->mmap ; vma; vma = vma->vm_next) { - if (!can_madv_dontneed_vma(vma)) - continue; + __oom_reap_task_mm(mm); - /* - * Only anonymous pages have a good chance to be dropped - * without additional steps which we cannot afford as we - * are OOM already. - * - * We do not even care about fs backed pages because all - * which are reclaimable have already been reclaimed and - * we do not want to block exit_mmap by keeping mm ref - * count elevated without a good reason. - */ - if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { - const unsigned long start = vma->vm_start; - const unsigned long end = vma->vm_end; - - tlb_gather_mmu(&tlb, mm, start, end); - mmu_notifier_invalidate_range_start(mm, start, end); - unmap_page_range(&tlb, vma, start, end, NULL); - mmu_notifier_invalidate_range_end(mm, start, end); - tlb_finish_mmu(&tlb, start, end); - } - } pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), @@ -587,14 +593,13 @@ static void oom_reap_task(struct task_struct *tsk) struct mm_struct *mm = tsk->signal->oom_mm; /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) + while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) schedule_timeout_idle(HZ/10); if (attempts <= MAX_OOM_REAP_RETRIES || test_bit(MMF_OOM_SKIP, &mm->flags)) goto done; - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); debug_show_all_locks();