Received: by 10.192.165.148 with SMTP id m20csp4229435imm; Mon, 23 Apr 2018 22:13:45 -0700 (PDT) X-Google-Smtp-Source: AIpwx49jZhjSUmnRKD1ejaehRUFO8Inpl6QhGgQt+JefyH96pA+jI95ClTnOikaubT33ab2jJH7T X-Received: by 2002:a17:902:a5cb:: with SMTP id t11-v6mr23677600plq.265.1524546825824; Mon, 23 Apr 2018 22:13:45 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524546825; cv=none; d=google.com; s=arc-20160816; b=T4IbKz1iw2rlqnt9jzOyE39s2r1JqL8UcvoQmv4UrEq32FMVELQPxinTMqBcZo1y/W uHWMIuatIHUKBiq78SIk+cqdVHgNil/z1E8mBUYe5VMEL0U2Edn7BMyKI7ewOG+NM28G tEWrpnPV/FZ3P381ynK4uKWN201N0GRNJW2M1zULsPMkapiKwwa7rgPF24/F57/sXFxA ABFWHMVdK068VYgJ/5UvkLzHlV/AYs1fTaq8hs27uasMM2LvXtVwfjDxdgfxNRpKZCNd LzQliQZmF9BrrQJL6nj+q60drnzCpoFNgnSJRRMQHiz1t0grdUatgbIN/kUgGzTelt/v IlIg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:in-reply-to :references:date:mime-version:cc:to:from:subject:message-id :arc-authentication-results; bh=1+k+ke094KRe8ZgR+9zz9J+R9bWNCxKhp1GH7UKmP2c=; b=x2fRPX1JUoN0pp0XlDZ0zHepon7GD7HqeP3axF3j0vF9AzyYgYRTwvKIT6vxXtauJg SUoaq7xX9ReKbVzH25GirEpYIgXsMDbuhiwngGzGoXcq8IU/HuI2kMyBtAFWsBkmmhl3 m6nzQAki+vJ+0ukwGqoOhaKYUxMCjofSMtkBOINS+/dID0CuCXCG2KG3007J2p43elmH ytOES5ACAoVM7kDn9eRUtoPXQoSH23CTTUU7J6M577aF88jGjk6zx1cqUEvNJu52n30G OnMznXycV221TXK8m6XWAPdFGz/P9QbZ+ZlabbcC2ezEq9oy1xzz1jP1wjK7J9pAnqjs aOMQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id q66si12921004pfk.190.2018.04.23.22.13.30; Mon, 23 Apr 2018 22:13:45 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755754AbeDXFLy (ORCPT + 99 others); Tue, 24 Apr 2018 01:11:54 -0400 Received: from www262.sakura.ne.jp ([202.181.97.72]:33981 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755600AbeDXFLw (ORCPT ); Tue, 24 Apr 2018 01:11:52 -0400 Received: from fsav101.sakura.ne.jp (fsav101.sakura.ne.jp [27.133.134.228]) by www262.sakura.ne.jp (8.14.5/8.14.5) with ESMTP id w3O5BYZo090603; Tue, 24 Apr 2018 14:11:34 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav101.sakura.ne.jp (F-Secure/fsigk_smtp/530/fsav101.sakura.ne.jp); Tue, 24 Apr 2018 14:11:34 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/530/fsav101.sakura.ne.jp) Received: from www262.sakura.ne.jp (localhost [127.0.0.1]) by www262.sakura.ne.jp (8.14.5/8.14.5) with ESMTP id w3O5BY5Y090599; Tue, 24 Apr 2018 14:11:34 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Received: (from i-love@localhost) by www262.sakura.ne.jp (8.14.5/8.14.5/Submit) id w3O5BY4o090598; Tue, 24 Apr 2018 14:11:34 +0900 (JST) (envelope-from penguin-kernel@i-love.sakura.ne.jp) Message-Id: <201804240511.w3O5BY4o090598@www262.sakura.ne.jp> X-Authentication-Warning: www262.sakura.ne.jp: i-love set sender to penguin-kernel@i-love.sakura.ne.jp using -f Subject: Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap From: Tetsuo Handa To: David Rientjes Cc: mhocko@kernel.org, Andrew Morton , Andrea Arcangeli , guro@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org MIME-Version: 1.0 Date: Tue, 24 Apr 2018 14:11:34 +0900 References: <201804221248.CHE35432.FtOMOLSHOFJFVQ@I-love.SAKURA.ne.jp> In-Reply-To: Content-Type: text/plain; charset="ISO-2022-JP" Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Sun, 22 Apr 2018, Tetsuo Handa wrote: > > > > I'm wondering why you do not see oom killing of many processes if the > > > victim is a very large process that takes a long time to free memory in > > > exit_mmap() as I do because the oom reaper gives up trying to acquire > > > mm->mmap_sem and just sets MMF_OOM_SKIP itself. > > > > > > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before > > exit_mmap() holds mmap_sem for write. Then, at least memory which could > > have been reclaimed if exit_mmap() did not hold mmap_sem for write will > > be guaranteed to be reclaimed before MMF_OOM_SKIP is set. > > > > I think that's an exceptionally good idea and will mitigate the concerns > of others. > > It can be done without holding mm->mmap_sem in exit_mmap() and uses the > same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we > don't get dozens of unnecessary oom kills. > > What do you think about this? It passes preliminary testing on powerpc > and I'm enqueued it for much more intensive testing. (I'm wishing there > was a better way to acknowledge your contribution to fixing this issue, > especially since you brought up the exact problem this is addressing in > previous emails.) > I don't think this patch is safe, for exit_mmap() is calling mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap(). exit_mmap() must not block while holding oom_lock in order to guarantee that oom_reap_task_mm() can give up. Some suggestion on top of your patch: mm/mmap.c | 13 +++++-------- mm/oom_kill.c | 51 ++++++++++++++++++++++++++------------------------- 2 files changed, 31 insertions(+), 33 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index 981eed4..7b31357 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm) /* * Manually reap the mm to free as much memory as possible. * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this - * mm from further consideration. Taking mm->mmap_sem for write - * after setting MMF_OOM_SKIP will guarantee that the oom reaper - * will not run on this mm again after mmap_sem is dropped. + * mm from further consideration. Setting MMF_OOM_SKIP under + * oom_lock held will guarantee that the OOM reaper will not + * run on this mm again. * * This needs to be done before calling munlock_vma_pages_all(), * which clears VM_LOCKED, otherwise the oom reaper cannot * reliably test it. */ - mutex_lock(&oom_lock); __oom_reap_task_mm(mm); - mutex_unlock(&oom_lock); - + mutex_lock(&oom_lock); set_bit(MMF_OOM_SKIP, &mm->flags); - down_write(&mm->mmap_sem); - up_write(&mm->mmap_sem); + mutex_unlock(&oom_lock); } if (mm->locked_vm) { diff --git a/mm/oom_kill.c b/mm/oom_kill.c index 8ba6cb8..9a29df8 100644 --- a/mm/oom_kill.c +++ b/mm/oom_kill.c @@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) { bool ret = true; + mutex_lock(&oom_lock); + /* - * We have to make sure to not race with the victim exit path - * and cause premature new oom victim selection: - * oom_reap_task_mm exit_mm - * mmget_not_zero - * mmput - * atomic_dec_and_test - * exit_oom_victim - * [...] - * out_of_memory - * select_bad_process - * # no TIF_MEMDIE task selects new victim - * unmap_page_range # frees some memory + * MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't + * work on the mm anymore. The check for MMF_OOM_SKIP must run + * under oom_lock held. */ - mutex_lock(&oom_lock); + if (test_bit(MMF_OOM_SKIP, &mm->flags)) + goto unlock_oom; if (!down_read_trylock(&mm->mmap_sem)) { ret = false; @@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) goto unlock_oom; } - /* - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't - * work on the mm anymore. The check for MMF_OOM_SKIP must run - * under mmap_sem for reading because it serializes against the - * down_write();up_write() cycle in exit_mmap(). - */ - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { - up_read(&mm->mmap_sem); - trace_skip_task_reaping(tsk->pid); - goto unlock_oom; - } - trace_start_task_reaping(tsk->pid); __oom_reap_task_mm(mm); @@ -610,8 +592,27 @@ static void oom_reap_task(struct task_struct *tsk) /* * Hide this mm from OOM killer because it has been either reaped or * somebody can't call up_write(mmap_sem). + * + * We have to make sure to not cause premature new oom victim selection: + * + * __alloc_pages_may_oom() oom_reap_task_mm()/exit_mmap() + * mutex_trylock(&oom_lock) + * get_page_from_freelist(ALLOC_WMARK_HIGH) # fails + * unmap_page_range() # frees some memory + * set_bit(MMF_OOM_SKIP) + * out_of_memory() + * select_bad_process() + * test_bit(MMF_OOM_SKIP) # selects new oom victim + * mutex_unlock(&oom_lock) + * + * Setting MMF_OOM_SKIP under oom_lock held will guarantee that the + * last second alocation attempt is done by __alloc_pages_may_oom() + * before out_of_memory() selects next OOM victim by finding + * MMF_OOM_SKIP. */ + mutex_lock(&oom_lock); set_bit(MMF_OOM_SKIP, &mm->flags); + mutex_unlock(&oom_lock); /* Drop a reference taken by wake_oom_reaper */ put_task_struct(tsk); -- 1.8.3.1