Received: by 10.192.165.156 with SMTP id m28csp747625imm; Tue, 17 Apr 2018 19:54:55 -0700 (PDT) X-Google-Smtp-Source: AIpwx4/mTYVqoU25VpLA6NWk5mHcd8+BeZIUKyUPWpP4L1AwIXAPv3ENu+dKZaKz2wIVJ0jeuhiF X-Received: by 10.101.72.140 with SMTP id n12mr271122pgs.155.1524020095103; Tue, 17 Apr 2018 19:54:55 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524020095; cv=none; d=google.com; s=arc-20160816; b=OEudAwYWaUJxTMI49Ypl0OBqpgoes12IYKvVm+O2+9reVPBlvqu+9ZtApQrNvmGWGQ QWcKrqiMDDPfGYCKPuuZf/ZDgk1to8/ATpvWxaytY4cf8ZE+uqtRAtmSar07g1sNNZHS Fpjhbbz88+8r49C5uOwu9QELcmOwCKKpqbkcBvcmKLlkErFNkejkv3gI6v1vhLRyUvnR Xqukg4mxa7yICoWlbeb+FKVFn/V0Y8z6+FYPRduDPONQzpIIHaneAb/jJhLdvtBq8LVf un8jrMWIHRLLOagPmpE/vk/OlRZCVUQlyEf+2poHE+iD9zjTNNw5XTm4ViEy2tShiSYh KqZg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=JgLtUx295lCRdKoGCPqVP0/OV8w+et8FL6sLwJC/zFg=; b=D7Fl0dVNd1gU7O+NS4li5M9hig3OcDJMVNG80kFXy5DnK/vnRWvR2QsdQSyY5mqYkp NljAiS+mjl6U9lB1rMw1Kts9e+kRHvZdEWluBrrBn5qMy79vvy7N/6S0Wcyc6WqIkHkk VScM23wKnJ5EjBCZkCNDv5BRUTVRJ8DQ6Y+P1phPyQLfBRRPI6Tij6k0ipwXwm+eMmQK 3hfkSmx2F6ducs58fvkCV20vm7XfFyG9EeAXC+2a2xjxybk7X3HPzxPe+H4+035T/dk8 p2qTl5g2R1I+MWl41Odihto6JTyxZXuI0KNbEauhTt5oTEsz4e/24tRTEW92tqbgEMrd DdVQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=N5qrjb1x; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z129si240976pgb.40.2018.04.17.19.54.30; Tue, 17 Apr 2018 19:54:55 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=N5qrjb1x; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753526AbeDRCwp (ORCPT + 99 others); Tue, 17 Apr 2018 22:52:45 -0400 Received: from mail-pf0-f196.google.com ([209.85.192.196]:34651 "EHLO mail-pf0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752612AbeDRCwn (ORCPT ); Tue, 17 Apr 2018 22:52:43 -0400 Received: by mail-pf0-f196.google.com with SMTP id q9so169862pff.1 for ; Tue, 17 Apr 2018 19:52:43 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=JgLtUx295lCRdKoGCPqVP0/OV8w+et8FL6sLwJC/zFg=; b=N5qrjb1xm/KXHBa9JSjPYyqTqrvUz0SDDOU2LId/+ixOXTQYED0v8GEd6UzfMibxJC Kw1q1aVISlU3DY+0fppohVUnmSNXZmlXXJUd4HRNCyiqWa9hz8s7T6eKpVG6dMaO8eIp cHoh9ZjnbPALvwUQPzmyJSImWxfXlcbAMz5BcurRwI3/NoH4krOxrwCyShtqneslVSbq 8geYlGqKAzH4XctNTY03he/KRa6vHVfxc+36YPOQpAtISBNn9q3GTHseExbFgx+fdz5N 28bTddRCWNe+ynj5xKQ7eY/sxhNb6pJtiRVIozN1P66wwkT3VZjPXorz7f40Nc9mQm7k 0WAA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=JgLtUx295lCRdKoGCPqVP0/OV8w+et8FL6sLwJC/zFg=; b=Jh9O8IvLTsF6eITLGAhtxgJdR/Zvt/mANF06SBhWoLxSLd5WDTfyZLASWTSchJr+pk VyL9ESdW0L0qFtY6C5PqDBQszXJTOX9pCqDAK1h4luN7D4X+wliy8QaLvt1K1GsCyODN WFEWAy2y17YQO/9XU72OQ2WgZABleDW1thJUOHwU0u0r3nPW+DsX369riaKBwzMVJ0HA ED3rIgv9ldFoGUz2wYgu4BGDwIGq9e3b+QTYT2Iu0v5vOlfB7qJC0SRGvF4FWVUj4dLk 009jiQ8cx2JHA5Cr8u85TyK4Xnj2LfqCdosHxD0yDiG/B9hwsUBXU8puQHB/qieEZsFp 4tgA== X-Gm-Message-State: ALQs6tD4Eq6WBbtsfXIEyVQjSLezltpYamxv1xA01nGJ+5qqhe8WTP+i dUOVAe3NYPDvJ6WJ5ZVST75oig== X-Received: by 10.98.202.10 with SMTP id n10mr302193pfg.220.1524019963004; Tue, 17 Apr 2018 19:52:43 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id o64sm288730pfb.62.2018.04.17.19.52.42 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Tue, 17 Apr 2018 19:52:42 -0700 (PDT) Date: Tue, 17 Apr 2018 19:52:41 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Andrew Morton , Tetsuo Handa cc: Michal Hocko , Andrea Arcangeli , Roman Gushchin , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap In-Reply-To: Message-ID: References: <201804180057.w3I0vieV034949@www262.sakura.ne.jp> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Since exit_mmap() is done without the protection of mm->mmap_sem, it is possible for the oom reaper to concurrently operate on an mm until MMF_OOM_SKIP is set. This allows munlock_vma_pages_all() to concurrently run while the oom reaper is operating on a vma. Since munlock_vma_pages_range() depends on clearing VM_LOCKED from vm_flags before actually doing the munlock to determine if any other vmas are locking the same memory, the check for VM_LOCKED in the oom reaper is racy. This is especially noticeable on architectures such as powerpc where clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd is zapped by the oom reaper during follow_page_mask() after the check for pmd_none() is bypassed, this ends up deferencing a NULL ptl. Fix this by reusing MMF_UNSTABLE to specify that an mm should not be reaped. This prevents the concurrent munlock_vma_pages_range() and unmap_page_range(). The oom reaper will simply not operate on an mm that has the bit set and leave the unmapping to exit_mmap(). Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") Cc: stable@vger.kernel.org [4.14+] Signed-off-by: David Rientjes --- v2: - oom reaper only sets MMF_OOM_SKIP if MMF_UNSTABLE was never set (either by itself or by exit_mmap(), per Tetsuo - s/kick_all_cpus_sync/serialize_against_pte_lookup/ in changelog as more isolated way of forcing cpus as non-idle on power mm/mmap.c | 38 ++++++++++++++++++++------------------ mm/oom_kill.c | 28 +++++++++++++--------------- 2 files changed, 33 insertions(+), 33 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3015,6 +3015,25 @@ void exit_mmap(struct mm_struct *mm) /* mm's last user has gone, and its about to be pulled down */ mmu_notifier_release(mm); + if (unlikely(mm_is_oom_victim(mm))) { + /* + * Wait for oom_reap_task() to stop working on this mm. Because + * MMF_UNSTABLE is already set before calling down_read(), + * oom_reap_task() will not run on this mm after up_write(). + * oom_reap_task() also depends on a stable VM_LOCKED flag to + * indicate it should not unmap during munlock_vma_pages_all(). + * + * mm_is_oom_victim() cannot be set from under us because + * victim->mm is already set to NULL under task_lock before + * calling mmput() and victim->signal->oom_mm is set by the oom + * killer only if victim->mm is non-NULL while holding + * task_lock(). + */ + set_bit(MMF_UNSTABLE, &mm->flags); + down_write(&mm->mmap_sem); + up_write(&mm->mmap_sem); + } + if (mm->locked_vm) { vma = mm->mmap; while (vma) { @@ -3036,26 +3055,9 @@ void exit_mmap(struct mm_struct *mm) /* update_hiwater_rss(mm) here? but nobody should be looking */ /* Use -1 here to ensure all VMAs in the mm are unmapped */ unmap_vmas(&tlb, vma, 0, -1); - - if (unlikely(mm_is_oom_victim(mm))) { - /* - * Wait for oom_reap_task() to stop working on this - * mm. Because MMF_OOM_SKIP is already set before - * calling down_read(), oom_reap_task() will not run - * on this "mm" post up_write(). - * - * mm_is_oom_victim() cannot be set from under us - * either because victim->mm is already set to NULL - * under task_lock before calling mmput and oom_mm is - * set not NULL by the OOM killer only if victim->mm - * is found not NULL while holding the task_lock. - */ - set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); - } free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb, 0, -1); + set_bit(MMF_OOM_SKIP, &mm->flags); /* * Walk the list again, actually closing and freeing it, diff --git a/mm/oom_kill.c b/mm/oom_kill.c --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -521,12 +521,17 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) } /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run + * Tell all users of get_user/copy_from_user etc... that the content + * is no longer stable. No barriers really needed because unmapping + * should imply barriers already and the reader would hit a page fault + * if it stumbled over reaped memory. + * + * MMF_UNSTABLE is also set by exit_mmap when the OOM reaper shouldn't + * work on the mm anymore. The check for MMF_OOM_UNSTABLE must run * under mmap_sem for reading because it serializes against the * down_write();up_write() cycle in exit_mmap(). */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { + if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) { up_read(&mm->mmap_sem); trace_skip_task_reaping(tsk->pid); goto unlock_oom; @@ -534,14 +539,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) trace_start_task_reaping(tsk->pid); - /* - * Tell all users of get_user/copy_from_user etc... that the content - * is no longer stable. No barriers really needed because unmapping - * should imply barriers already and the reader would hit a page fault - * if it stumbled over a reaped memory. - */ - set_bit(MMF_UNSTABLE, &mm->flags); - for (vma = mm->mmap ; vma; vma = vma->vm_next) { if (!can_madv_dontneed_vma(vma)) continue; @@ -567,6 +564,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) tlb_finish_mmu(&tlb, start, end); } } + set_bit(MMF_OOM_SKIP, &mm->flags); pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", task_pid_nr(tsk), tsk->comm, K(get_mm_counter(mm, MM_ANONPAGES)), @@ -594,7 +592,6 @@ static void oom_reap_task(struct task_struct *tsk) test_bit(MMF_OOM_SKIP, &mm->flags)) goto done; - pr_info("oom_reaper: unable to reap pid:%d (%s)\n", task_pid_nr(tsk), tsk->comm); debug_show_all_locks(); @@ -603,10 +600,11 @@ static void oom_reap_task(struct task_struct *tsk) tsk->oom_reaper_list = NULL; /* - * Hide this mm from OOM killer because it has been either reaped or - * somebody can't call up_write(mmap_sem). + * If the oom reaper could not get started on this mm and it has not yet + * reached exit_mmap(), set MMF_OOM_SKIP to disregard. */ - set_bit(MMF_OOM_SKIP, &mm->flags); + if (!test_bit(MMF_UNSTABLE, &mm->flags)) + set_bit(MMF_OOM_SKIP, &mm->flags); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk);