Received: by 10.192.165.156 with SMTP id m28csp944212imm; Wed, 18 Apr 2018 00:52:29 -0700 (PDT) X-Google-Smtp-Source: AIpwx49WomnWWNtZHxmJKutbRt1r9MxrTfGfV9vcFGkZ7T81pJcWYuojKzU8NqXcuN2olBpOPIUx X-Received: by 2002:a17:902:85:: with SMTP id a5-v6mr1078407pla.99.1524037949736; Wed, 18 Apr 2018 00:52:29 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524037949; cv=none; d=google.com; s=arc-20160816; b=jtT5jyNijDxFy6cOJhBu4BahDWq4d1WHYwARwW342/dclf+tDpw8+RvtZ84Oyzob/f xjkoVdkYkgFhaFIYf87AeguI2i5G3atlQRgPdO74bLnzRq9ZX91TMrOivpGa3CGfAh6j qo/63g0QJTp0Fn7WJT6rJZXrhTAdyeCM2ZRgOzT5H3BiHT8l48SsoW9izIgSbkLzj9aq AR0WfxmsII2g8OcWyQ2ycdrJZRnxyh8Ijkgh4pfM3vlg2updQRaNnEUJnv4m6kQSIdJZ 4z7xifSef8ofnYJT/lORew6WGlw3kDYuGpJTHTYzI2dCYoclQ9Zj2N/OhfszAZdmzLnZ LpYQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=s3EqrZ5Q1M92c03kPQBamL5MvBARehyqTQROOh2naFY=; b=TepowwuPjl/1WWpe7ZsIn+JTK9clmx4GzwnPXTKY+/2N5BRzQgXFZQPEvo2r0dsIhY E/j7ojMwZ9/z97HMiEvSTXRW515w2Vp/sdBhQlwgEsQfTIDmyMNUWM0PVRaju5gkzWvk oOcXtH6q+akIOgzvPrV2dFSt1nUXZdF8es/fav/KJ1MSWlVGQ7WWo1Js/GHq0k0t3sIa T0RFkUedUPRLld0tTpIWRXZSj4PDntkRJpg37izSChC+TiwKGqbrGHbRbPYp79xo91ip 8yuRu/a8JkU16Lo13WBqwQtuQSlFU/LULUExF6RzsFTZdmlifcMQC6KHcuDbpZ6oeKQ+ 7Xzw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p9si604233pgu.80.2018.04.18.00.52.15; Wed, 18 Apr 2018 00:52:29 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752511AbeDRHuy (ORCPT + 99 others); Wed, 18 Apr 2018 03:50:54 -0400 Received: from mx2.suse.de ([195.135.220.15]:38801 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751180AbeDRHux (ORCPT ); Wed, 18 Apr 2018 03:50:53 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 5E2BAADD2; Wed, 18 Apr 2018 07:50:52 +0000 (UTC) Date: Wed, 18 Apr 2018 09:50:51 +0200 From: Michal Hocko To: David Rientjes Cc: Andrew Morton , Tetsuo Handa , Andrea Arcangeli , Roman Gushchin , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap Message-ID: <20180418075051.GO17484@dhcp22.suse.cz> References: <201804180057.w3I0vieV034949@www262.sakura.ne.jp> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 17-04-18 19:52:41, David Rientjes wrote: > Since exit_mmap() is done without the protection of mm->mmap_sem, it is > possible for the oom reaper to concurrently operate on an mm until > MMF_OOM_SKIP is set. > > This allows munlock_vma_pages_all() to concurrently run while the oom > reaper is operating on a vma. Since munlock_vma_pages_range() depends on > clearing VM_LOCKED from vm_flags before actually doing the munlock to > determine if any other vmas are locking the same memory, the check for > VM_LOCKED in the oom reaper is racy. > > This is especially noticeable on architectures such as powerpc where > clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd > is zapped by the oom reaper during follow_page_mask() after the check for > pmd_none() is bypassed, this ends up deferencing a NULL ptl. > > Fix this by reusing MMF_UNSTABLE to specify that an mm should not be > reaped. This prevents the concurrent munlock_vma_pages_range() and > unmap_page_range(). The oom reaper will simply not operate on an mm that > has the bit set and leave the unmapping to exit_mmap(). This will further complicate the protocol and actually theoretically restores the oom lockup issues because the oom reaper doesn't set MMF_OOM_SKIP when racing with exit_mmap so we fully rely that nothing blocks there... So the resulting code is more fragile and tricky. Can we try a simpler way and get back to what I was suggesting before [1] and simply not play tricks with down_write(&mm->mmap_sem); up_write(&mm->mmap_sem); and use the write lock in exit_mmap for oom_victims? Andrea wanted to make this more clever but this is the second fallout which could have been prevented. The patch would be smaller and the locking protocol easier [1] http://lkml.kernel.org/r/20170727065023.GB20970@dhcp22.suse.cz > Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently") > Cc: stable@vger.kernel.org [4.14+] > Signed-off-by: David Rientjes > --- > v2: > - oom reaper only sets MMF_OOM_SKIP if MMF_UNSTABLE was never set (either > by itself or by exit_mmap(), per Tetsuo > - s/kick_all_cpus_sync/serialize_against_pte_lookup/ in changelog as more > isolated way of forcing cpus as non-idle on power > > mm/mmap.c | 38 ++++++++++++++++++++------------------ > mm/oom_kill.c | 28 +++++++++++++--------------- > 2 files changed, 33 insertions(+), 33 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3015,6 +3015,25 @@ void exit_mmap(struct mm_struct *mm) > /* mm's last user has gone, and its about to be pulled down */ > mmu_notifier_release(mm); > > + if (unlikely(mm_is_oom_victim(mm))) { > + /* > + * Wait for oom_reap_task() to stop working on this mm. Because > + * MMF_UNSTABLE is already set before calling down_read(), > + * oom_reap_task() will not run on this mm after up_write(). > + * oom_reap_task() also depends on a stable VM_LOCKED flag to > + * indicate it should not unmap during munlock_vma_pages_all(). > + * > + * mm_is_oom_victim() cannot be set from under us because > + * victim->mm is already set to NULL under task_lock before > + * calling mmput() and victim->signal->oom_mm is set by the oom > + * killer only if victim->mm is non-NULL while holding > + * task_lock(). > + */ > + set_bit(MMF_UNSTABLE, &mm->flags); > + down_write(&mm->mmap_sem); > + up_write(&mm->mmap_sem); > + } > + > if (mm->locked_vm) { > vma = mm->mmap; > while (vma) { > @@ -3036,26 +3055,9 @@ void exit_mmap(struct mm_struct *mm) > /* update_hiwater_rss(mm) here? but nobody should be looking */ > /* Use -1 here to ensure all VMAs in the mm are unmapped */ > unmap_vmas(&tlb, vma, 0, -1); > - > - if (unlikely(mm_is_oom_victim(mm))) { > - /* > - * Wait for oom_reap_task() to stop working on this > - * mm. Because MMF_OOM_SKIP is already set before > - * calling down_read(), oom_reap_task() will not run > - * on this "mm" post up_write(). > - * > - * mm_is_oom_victim() cannot be set from under us > - * either because victim->mm is already set to NULL > - * under task_lock before calling mmput and oom_mm is > - * set not NULL by the OOM killer only if victim->mm > - * is found not NULL while holding the task_lock. > - */ > - set_bit(MMF_OOM_SKIP, &mm->flags); > - down_write(&mm->mmap_sem); > - up_write(&mm->mmap_sem); > - } > free_pgtables(&tlb, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); > tlb_finish_mmu(&tlb, 0, -1); > + set_bit(MMF_OOM_SKIP, &mm->flags); > > /* > * Walk the list again, actually closing and freeing it, > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -521,12 +521,17 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > } > > /* > - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't > - * work on the mm anymore. The check for MMF_OOM_SKIP must run > + * Tell all users of get_user/copy_from_user etc... that the content > + * is no longer stable. No barriers really needed because unmapping > + * should imply barriers already and the reader would hit a page fault > + * if it stumbled over reaped memory. > + * > + * MMF_UNSTABLE is also set by exit_mmap when the OOM reaper shouldn't > + * work on the mm anymore. The check for MMF_OOM_UNSTABLE must run > * under mmap_sem for reading because it serializes against the > * down_write();up_write() cycle in exit_mmap(). > */ > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { > + if (test_and_set_bit(MMF_UNSTABLE, &mm->flags)) { > up_read(&mm->mmap_sem); > trace_skip_task_reaping(tsk->pid); > goto unlock_oom; > @@ -534,14 +539,6 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > > trace_start_task_reaping(tsk->pid); > > - /* > - * Tell all users of get_user/copy_from_user etc... that the content > - * is no longer stable. No barriers really needed because unmapping > - * should imply barriers already and the reader would hit a page fault > - * if it stumbled over a reaped memory. > - */ > - set_bit(MMF_UNSTABLE, &mm->flags); > - > for (vma = mm->mmap ; vma; vma = vma->vm_next) { > if (!can_madv_dontneed_vma(vma)) > continue; > @@ -567,6 +564,7 @@ static bool __oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > tlb_finish_mmu(&tlb, start, end); > } > } > + set_bit(MMF_OOM_SKIP, &mm->flags); > pr_info("oom_reaper: reaped process %d (%s), now anon-rss:%lukB, file-rss:%lukB, shmem-rss:%lukB\n", > task_pid_nr(tsk), tsk->comm, > K(get_mm_counter(mm, MM_ANONPAGES)), > @@ -594,7 +592,6 @@ static void oom_reap_task(struct task_struct *tsk) > test_bit(MMF_OOM_SKIP, &mm->flags)) > goto done; > > - > pr_info("oom_reaper: unable to reap pid:%d (%s)\n", > task_pid_nr(tsk), tsk->comm); > debug_show_all_locks(); > @@ -603,10 +600,11 @@ static void oom_reap_task(struct task_struct *tsk) > tsk->oom_reaper_list = NULL; > > /* > - * Hide this mm from OOM killer because it has been either reaped or > - * somebody can't call up_write(mmap_sem). > + * If the oom reaper could not get started on this mm and it has not yet > + * reached exit_mmap(), set MMF_OOM_SKIP to disregard. > */ > - set_bit(MMF_OOM_SKIP, &mm->flags); > + if (!test_bit(MMF_UNSTABLE, &mm->flags)) > + set_bit(MMF_OOM_SKIP, &mm->flags); > > /* Drop a reference taken by wake_oom_reaper */ > put_task_struct(tsk); -- Michal Hocko SUSE Labs