Received: by 10.192.165.148 with SMTP id m20csp5190516imm; Tue, 24 Apr 2018 15:35:52 -0700 (PDT) X-Google-Smtp-Source: AIpwx4+pubjBZ8cOqdp10g88Dq7ZDpJ2H4E7yU5QZO1r/jiYD6xezcG0h54iE4b3Q7p5zjMLW5tE X-Received: by 10.99.112.88 with SMTP id a24mr22237980pgn.101.1524609352699; Tue, 24 Apr 2018 15:35:52 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524609352; cv=none; d=google.com; s=arc-20160816; b=dXq6P2Oqd1QEyMXTCs9FL140FbpPR1cKFjoDbyieYpEQYvE4fWsWAWcJzTG4XdVMvX ftjt377e/ixfBb/FRaAEoXreqlLv1eeaDCkK4PuHE5vB2y8KQD82+N9qCSv6ifOTIusT Xmd4KFpB+OTiiyqBaz7wu9ALlNk/cMGazd5LqNjG24EkouqCwv0wBEJ1f4zcTizuaXmt 5bQSp1DcE9c4DdDUeVX3NQw8uOulfBcFZSlJyLb6fkzOcR34HvlZ6djpmkynve9oGxT7 RNdyrNBxjO9CgG7DFQ7NIpw+JLNUNRo0Xja24WsHfCVvbjw5DnzhVwZl2IScfrwu6v5B 2LUA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=1OPgALkDS76VyuuEm9dfmuz52fIoOyXMpgUMr+3wg2s=; b=D2wFQokvfPsfO0CM71Qb3Pdh+Q/NCnL5KBURjd0UMCpb01JP/UEftGfWf58RbZ+idj IsiJnwiX7dhAV0iI6U28qGXfBDD9PVZrpjJbpfvf43jlZsJTBb6VbdzAFzkA2s92s+op 4b+ZPMn0mow2W34bz6Zhr54HSjI7u/4iCU8zVW+VunykVKKOHX6fwU+qKElhYCLLkivl 5Ek/8EF5fYA/TB0QYwcuLY/J3hXCmixig4ePYxqSVdXlCIfIPUTandhM5N9dCQ2IbJ38 a/Bw/hI4Bf1WXiTWpjpmW7HxOLFoQRZlI4JNzyvpr+eWv68gnhFVBlzh4nEGgMuvoq9x 3kkQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=i9xeDPyQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id m1si7992563pfe.79.2018.04.24.15.35.35; Tue, 24 Apr 2018 15:35:52 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=i9xeDPyQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1750902AbeDXWeH (ORCPT + 99 others); Tue, 24 Apr 2018 18:34:07 -0400 Received: from mail-pf0-f193.google.com ([209.85.192.193]:38264 "EHLO mail-pf0-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750756AbeDXWeG (ORCPT ); Tue, 24 Apr 2018 18:34:06 -0400 Received: by mail-pf0-f193.google.com with SMTP id o76so8744057pfi.5 for ; Tue, 24 Apr 2018 15:34:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=1OPgALkDS76VyuuEm9dfmuz52fIoOyXMpgUMr+3wg2s=; b=i9xeDPyQl/JVIYoJ66oBxCwBpGI21hata9CyUn/u8BrM4fX6b3tHmdxdPC7MoGmL8A QgNr7y6QKToeTG7LB/jw/hczJ84qdOxGlSQg7+AA65i6eHnIXzrOHglw94n1gL0OSNnl Z0ZZ5BmdnDENQHCAJexUYr7Xhzbowov5jMv8fY/LOo8gyGgbY6myetQmT5ahNjs7l8S+ qPfIOWtpEgvA/0PWdbOmDrLZXhAG30m+/V/kmw8Bns3FJztFcpizrnoX2OipLqka2tYR sp1f1zQ9MNgW7QK1vCZ7gphjIeuB+DWbhP9N+YLcaKDtfnWY0EbbyljqYS8Ji/IaCEKO uJjg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=1OPgALkDS76VyuuEm9dfmuz52fIoOyXMpgUMr+3wg2s=; b=TYPgAhV5yWL5gE+o6GdoelEHAOTbBmY9cVQS50rZs9KVW9XReT3/BfzXe//58rCYru f1xN6z1Zrihiwvl9UxUzS+wxPNL3exIFqoXPuxKYKV5dts6k2KSPsEo2/OTr37YjLI9q HNEPl+MEG3C7KrnF5JzVSHjyr0FIKdLG0en1qtj7LviZbDI0tN+zZku8sQNnGzvW7bar YZxmEXyzYEK/WzkIxsrss7AfoG4Drb5k9QfEQrV/SCgfuiV+xEtKJ3+LznnOyUZ2KUyD L4dzICpUAklc5SZyuXSTdcva2uiFWymlYR+eTmQL0ehZI3rMumAclXovCCn5tPn+20JM /1LA== X-Gm-Message-State: ALQs6tAPn9kCO+GN80aKodSERj0O1dclVXkINtcnMqci6QLOFTJwVz/k xAUGEymB21aoaAC8eTZGnWBYaA== X-Received: by 10.98.102.221 with SMTP id s90mr21875178pfj.123.1524609245029; Tue, 24 Apr 2018 15:34:05 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id h2sm31116862pfd.119.2018.04.24.15.34.04 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 24 Apr 2018 15:34:04 -0700 (PDT) Date: Tue, 24 Apr 2018 15:34:03 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Andrew Morton , Tetsuo Handa , Linus Torvalds cc: mhocko@kernel.org, Andrea Arcangeli , guro@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch v3 for-4.17] mm, oom: fix concurrent munlock and oom reaper unmap In-Reply-To: Message-ID: References: <201804221248.CHE35432.FtOMOLSHOFJFVQ@I-love.SAKURA.ne.jp> <201804240511.w3O5BY4o090598@www262.sakura.ne.jp> <201804250657.GFI21363.StOJHOQFOMFVFL@I-love.SAKURA.ne.jp> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a kernel oops. Fix this by manually freeing all possible memory from the mm before doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not run on the mm anymore so the munlock is safe to do in exit_mmap(). It also matches the logic that the oom reaper currently uses for determining when to set MMF_OOM_SKIP itself, so there's no new risk of excessive oom killing. This issue fixes CVE-2018-1000200. Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Cc: stable@vger.kernel.org [4.14+] Suggested-by: Tetsuo Handa Signed-off-by: David Rientjes --- include/linux/oom.h | 2 ++ mm/mmap.c | 44 ++++++++++++++---------- mm/oom_kill.c | 81 ++++++++++++++++++++++++--------------------- 3 files changed, 71 insertions(+), 56 deletions(-) diff --git a/include/linux/oom.h b/include/linux/oom.h --- a/include/linux/oom.h +++ b/include/linux/oom.h @@ -95,6 +95,8 @@ static inline int check_stable_address_space(struct mm_struct *mm) return 0; } +void __oom_reap_task_mm(struct mm_struct *mm); + extern unsigned long oom_badness(struct task_struct *p, struct mem_cgroup *memcg, const nodemask_t *nodemask, unsigned long totalpages); diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3015,6 +3015,32 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + if (unlikely(mm_is_oom_victim(mm))) { + /* + * Manually reap the mm to free as much memory as possible. + * Then, as the oom reaper does, set MMF_OOM_SKIP to disregard + * this mm from further consideration. Taking mm->mmap_sem for + * write after setting MMF_OOM_SKIP will guarantee that the oom + * reaper will not run on this mm again after mmap_sem is + * dropped. + * + * Nothing can be holding mm->mmap_sem here and the above call + * to mmu_notifier_release(mm) ensures mmu notifier callbacks in + * __oom_reap_task_mm() will not block. + * + * This needs to be done before calling munlock_vma_pages_all(), + * which clears VM_LOCKED, otherwise the oom reaper cannot + * reliably test it. + */ + mutex_lock(&oom_lock); + __oom_reap_task_mm(mm); + mutex_unlock(&oom_lock); + + set_bit(MMF_OOM_SKIP, &mm->flags); + down_write(&mm->mmap_sem); + up_write(&mm->mmap_sem); + } + if (mm->locked_vm) { vma = mm->mmap; while (vma) { @@ -3036,24 +3062,6 @@ void exit_mmap(struct mm_struct *mm) /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - - if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because victim->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if victim->mm - * is found not NULL while holding the task_lock. - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -469,7 +469,6 @@ bool process_shares_mm(struct task_struct *p, struct mm_struct *mm) return false; } - #ifdef CONFIG_MMU /* * OOM Reaper kernel thread which tries to reap the memory used by the OOM @@ -480,16 +479,54 @@ static DECLARE_WAIT_QUEUE_HEAD(oom_reaper_wait); static struct task_struct *oom_reaper_list; static DEFINE_SPINLOCK(oom_reaper_lock); -static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) +void __oom_reap_task_mm(struct mm_struct *mm) { - struct mmu_gather tlb; struct vm_area_struct *vma; + + /* + * Tell all users of get_user/copy_from_user etc... that the content + * is no longer stable. No barriers really needed because unmapping + * should imply barriers already and the reader would hit a page fault + * if it stumbled over a reaped memory. + */ + set_bit(MMF_UNSTABLE, &mm->flags); + + for (vma = mm->mmap ; vma; vma = vma->vm_next) { + if (!can_madv_dontneed_vma(vma)) + continue; + + /* + * Only anonymous pages have a good chance to be dropped + * without additional steps which we cannot afford as we + * are OOM already. + * + * We do not even care about fs backed pages because all + * which are reclaimable have already been reclaimed and + * we do not want to block exit_mmap by keeping mm ref + * count elevated without a good reason. + */ + if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { + const unsigned long start = vma->vm_start; + const unsigned long end = vma->vm_end; + struct mmu_gather tlb; + + tlb_gather_mmu(&tlb, mm, start, end); + mmu_notifier_invalidate_range_start(mm, start, end); + unmap_page_range(&tlb, vma, start, end, NULL); + mmu_notifier_invalidate_range_end(mm, start, end); + tlb_finish_mmu(&tlb, start, end); + } + } +} + +static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) +{ bool ret = true; /* * We have to make sure to not race with the victim exit path * and cause premature new oom victim selection: - * __oom_reap_task_mm exit_mm + * oom_reap_task_mm exit_mm * mmget_not_zero * mmput * atomic_dec_and_test @@ -534,39 +571,8 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) trace_start_task_reaping(tsk->pid); - /* - * Tell all users of get_user/copy_from_user etc... that the content - * is no longer stable. No barriers really needed because unmapping - * should imply barriers already and the reader would hit a page fault - * if it stumbled over a reaped memory. - */ - set_bit(MMF_UNSTABLE, &mm->flags); - - for (vma = mm->mmap ; vma; vma = vma->vm_next) { - if (!can_madv_dontneed_vma(vma)) - continue; + __oom_reap_task_mm(mm); - /* - * Only anonymous pages have a good chance to be dropped - * without additional steps which we cannot afford as we - * are OOM already. - * - * We do not even care about fs backed pages because all - * which are reclaimable have already been reclaimed and - * we do not want to block exit_mmap by keeping mm ref - * count elevated without a good reason. - */ - if (vma_is_anonymous(vma) || !(vma->vm_flags & VM_SHARED)) { - const unsigned long start = vma->vm_start; - const unsigned long end = vma->vm_end; - - tlb_gather_mmu(&tlb, mm, start, end); - mmu_notifier_invalidate_range_start(mm, start, end); - unmap_page_range(&tlb, vma, start, end, NULL); - mmu_notifier_invalidate_range_end(mm, start, end); - tlb_finish_mmu(&tlb, start, end); - } - } pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), @@ -587,14 +593,13 @@ static void oom_reap_task(struct task_struct *tsk) struct mm_struct *mm = tsk->signal->oom_mm; /* Retry the down_read_trylock(mmap_sem) a few times */ - while (attempts++ < MAX_OOM_REAP_RETRIES && !__oom_reap_task_mm(tsk, mm)) + while (attempts++ < MAX_OOM_REAP_RETRIES && !oom_reap_task_mm(tsk, mm)) schedule_timeout_idle(HZ/10); if (attempts <= MAX_OOM_REAP_RETRIES || test_bit(MMF_OOM_SKIP, &mm->flags)) goto done; - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); debug_show_all_locks();