Received: by 10.192.165.148 with SMTP id m20csp4245820imm; Mon, 23 Apr 2018 22:36:32 -0700 (PDT) X-Google-Smtp-Source: AIpwx49PgniY20OJ4Fq9/edjTO39VllokvGekHQlkjWvp+iwisYcKt/tLZaCcsu79q0ZXbavrqb6 X-Received: by 10.98.60.16 with SMTP id j16mr14901470pfa.7.1524548192089; Mon, 23 Apr 2018 22:36:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524548192; cv=none; d=google.com; s=arc-20160816; b=DMlQPbr3jd6xAw6hGaCNJicLyoUG8RBc8RcGNeODrDiOh02jh2kkHYTmbERJ3tF76E GFRnL95QdHhmgW5dqdvlcxvW6Jcb6JHrJcv+n/5LZ4Lk6kPhhIXp3R1/mtWDCLgxnY8q V5IiyY8t5OhG+nMSDs3EDMjbjA3Rubr//eeU8shKe7qXfjPo34SW72E3we4tak3BmKJR YvmHbtgRRz4S6noocMtTKEk+VrGxbBBjj8LTAQ6J5HmJxLFF/iMjvMZ1UyNZFSdfg0Pn n9XnwSA0Z/+jOHkLz+4Lgufbg3JBwJlhccxgGUpykqbM6nx5UXHvJVMos0J1+whJk0u2 N9iw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=zXqZRoEEq4TZZKOYvWD5iO8YiyY+45K/yRVL4lyTApE=; b=pbHsyL2anSJxsW/yPDX6tjfiOFnqt8CLZI/WS1TP7zr6Z6hterLYci+0sz+6vCgFQz gXFO0dtS8pkERkXgeZEycmwbHvHXImye9LIGeo8SrO8Mx8/HHWWkO8J3wJkwzVdG8wda rfxf5AJ2+Brpp+vdnUAi3W/KNh4Vo3K2pIOoi0+YLcQMw0LgXhpjPeP+tWIxVvcWiRuV sTynQsqbJrcULYSROn7j0zqN+O85WhrpgzSfaf0PyL1phxJV0v22b5AGyYfHH0Jeh7O1 XuVpJqJ4t6WpsFlpO4AFWnuQe6Nod7r+J0tOIXJUH5rh1TW1vO1pIgYbgaSb7r63Czyd MrUg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=mfflzgSD; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o1si6669627pgp.273.2018.04.23.22.36.17; Mon, 23 Apr 2018 22:36:32 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=mfflzgSD; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755997AbeDXFfL (ORCPT + 99 others); Tue, 24 Apr 2018 01:35:11 -0400 Received: from mail-pg0-f66.google.com ([74.125.83.66]:46045 "EHLO mail-pg0-f66.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754824AbeDXFfI (ORCPT ); Tue, 24 Apr 2018 01:35:08 -0400 Received: by mail-pg0-f66.google.com with SMTP id i29so8498197pgn.12 for ; Mon, 23 Apr 2018 22:35:08 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=zXqZRoEEq4TZZKOYvWD5iO8YiyY+45K/yRVL4lyTApE=; b=mfflzgSDLhdGHVQWDLndRVbEv/d3ogI2K9GEsxihpm73O31b71g2avbNOz+wCQLIk3 prsMBVt+v8X0BP3+RjASu5rrWQYRth58PGaua9LDp7E5aWCk0LuYBZP4rEojjH+zVbRf E6uIz72uMeGLZr54CIOVBle42pv5Jou069ooXp4JaJFoH1B8qo2QslxyuxDLKSnq0XCa zNLebUXnbyrjQjgkLqkVeVdU2R4qDsECP8BrV07rzY0LQQpmaZKOd5Wxon9olKfKsOxr Jm6I8c9JrfZjPnTAml+WUgjXnovwAV5V+59UaKDhQhoaQYq1SDtdGXt/YKKh/zanO0Gc Wyxw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=zXqZRoEEq4TZZKOYvWD5iO8YiyY+45K/yRVL4lyTApE=; b=nVSpX9SD2axcoVL3PEi4lfLojmdzaC3JZGIGGIkjnUQnUSjd/VMOlXPxVmEq8++NXh aHPtjWT7SStts5sBy9UXDaNFEQYDlI1fnpjxOH5g3uZRA346xLmdVxSSwWSCxT+O/UTY 3oaszh5JrnzdrmnAeKZr2XIjgkyZYWRwUW/TiUU/XSAyaD4m14mZHjqpN0BtrZdajF1E wv3PHBAs0G/hwNqJjxRNrrXQNExRh79ijqtWRl8PaGotXIJ6rm/em5KUx/L8RJnF/mu3 Ld1CU+Rgfo3fRrMffJIJGDzCtoMS4z8RWVfIWHAdXnD/laYtTqoN7CsHkUKDzfGpDZRj upKQ== X-Gm-Message-State: ALQs6tA0iLaklWPlzBCKCXxgtcqoozhmu+WU7mP6g7gscSs4mI+vQuOj khcd3BkVeBe+BcnGv0ASisvW6Q== X-Received: by 10.98.101.131 with SMTP id z125mr22669640pfb.208.1524548108086; Mon, 23 Apr 2018 22:35:08 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id 133sm24626458pfy.113.2018.04.23.22.35.06 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 23 Apr 2018 22:35:07 -0700 (PDT) Date: Mon, 23 Apr 2018 22:35:06 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Tetsuo Handa cc: mhocko@kernel.org, Andrew Morton , Andrea Arcangeli , guro@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch v2] mm, oom: fix concurrent munlock and oom reaperunmap In-Reply-To: <201804240511.w3O5BY4o090598@www262.sakura.ne.jp> Message-ID: References: <201804221248.CHE35432.FtOMOLSHOFJFVQ@I-love.SAKURA.ne.jp> <201804240511.w3O5BY4o090598@www262.sakura.ne.jp> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 24 Apr 2018, Tetsuo Handa wrote: > > > We can call __oom_reap_task_mm() from exit_mmap() (or __mmput()) before > > > exit_mmap() holds mmap_sem for write. Then, at least memory which could > > > have been reclaimed if exit_mmap() did not hold mmap_sem for write will > > > be guaranteed to be reclaimed before MMF_OOM_SKIP is set. > > > > > > > I think that's an exceptionally good idea and will mitigate the concerns > > of others. > > > > It can be done without holding mm->mmap_sem in exit_mmap() and uses the > > same criteria that the oom reaper uses to set MMF_OOM_SKIP itself, so we > > don't get dozens of unnecessary oom kills. > > > > What do you think about this? It passes preliminary testing on powerpc > > and I'm enqueued it for much more intensive testing. (I'm wishing there > > was a better way to acknowledge your contribution to fixing this issue, > > especially since you brought up the exact problem this is addressing in > > previous emails.) > > > > I don't think this patch is safe, for exit_mmap() is calling > mmu_notifier_invalidate_range_{start,end}() which might block with oom_lock > held when oom_reap_task_mm() is waiting for oom_lock held by exit_mmap(). One of the reasons that I extracted __oom_reap_task_mm() out of the new oom_reap_task_mm() is to avoid the checks that would be unnecessary when called from exit_mmap(). In this case, we can ignore the mm_has_blockable_invalidate_notifiers() check because exit_mmap() has already done mmu_notifier_release(). So I don't think there's a concern about __oom_reap_task_mm() blocking while holding oom_lock. Unless you are referring to something else? > exit_mmap() must not block while holding oom_lock in order to guarantee that > oom_reap_task_mm() can give up. > > Some suggestion on top of your patch: > > mm/mmap.c | 13 +++++-------- > mm/oom_kill.c | 51 ++++++++++++++++++++++++++------------------------- > 2 files changed, 31 insertions(+), 33 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 981eed4..7b31357 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -3019,21 +3019,18 @@ void exit_mmap(struct mm_struct *mm) > /* > * Manually reap the mm to free as much memory as possible. > * Then, as the oom reaper, set MMF_OOM_SKIP to disregard this > - * mm from further consideration. Taking mm->mmap_sem for write > - * after setting MMF_OOM_SKIP will guarantee that the oom reaper > - * will not run on this mm again after mmap_sem is dropped. > + * mm from further consideration. Setting MMF_OOM_SKIP under > + * oom_lock held will guarantee that the OOM reaper will not > + * run on this mm again. > * > * This needs to be done before calling munlock_vma_pages_all(), > * which clears VM_LOCKED, otherwise the oom reaper cannot > * reliably test it. > */ > - mutex_lock(&oom_lock); > __oom_reap_task_mm(mm); > - mutex_unlock(&oom_lock); > - > + mutex_lock(&oom_lock); > set_bit(MMF_OOM_SKIP, &mm->flags); > - down_write(&mm->mmap_sem); > - up_write(&mm->mmap_sem); > + mutex_unlock(&oom_lock); > } > > if (mm->locked_vm) { > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 8ba6cb8..9a29df8 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -523,21 +523,15 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > { > bool ret = true; > > + mutex_lock(&oom_lock); > + > /* > - * We have to make sure to not race with the victim exit path > - * and cause premature new oom victim selection: > - * oom_reap_task_mm exit_mm > - * mmget_not_zero > - * mmput > - * atomic_dec_and_test > - * exit_oom_victim > - * [...] > - * out_of_memory > - * select_bad_process > - * # no TIF_MEMDIE task selects new victim > - * unmap_page_range # frees some memory > + * MMF_OOM_SKIP is set by exit_mmap() when the OOM reaper can't > + * work on the mm anymore. The check for MMF_OOM_SKIP must run > + * under oom_lock held. > */ > - mutex_lock(&oom_lock); > + if (test_bit(MMF_OOM_SKIP, &mm->flags)) > + goto unlock_oom; > > if (!down_read_trylock(&mm->mmap_sem)) { > ret = false; > @@ -557,18 +551,6 @@ static bool oom_reap_task_mm(struct task_struct *tsk, struct mm_struct *mm) > goto unlock_oom; > } > > - /* > - * MMF_OOM_SKIP is set by exit_mmap when the OOM reaper can't > - * work on the mm anymore. The check for MMF_OOM_SKIP must run > - * under mmap_sem for reading because it serializes against the > - * down_write();up_write() cycle in exit_mmap(). > - */ > - if (test_bit(MMF_OOM_SKIP, &mm->flags)) { > - up_read(&mm->mmap_sem); > - trace_skip_task_reaping(tsk->pid); > - goto unlock_oom; > - } > - > trace_start_task_reaping(tsk->pid); > > __oom_reap_task_mm(mm); > @@ -610,8 +592,27 @@ static void oom_reap_task(struct task_struct *tsk) > /* > * Hide this mm from OOM killer because it has been either reaped or > * somebody can't call up_write(mmap_sem). > + * > + * We have to make sure to not cause premature new oom victim selection: > + * > + * __alloc_pages_may_oom() oom_reap_task_mm()/exit_mmap() > + * mutex_trylock(&oom_lock) > + * get_page_from_freelist(ALLOC_WMARK_HIGH) # fails > + * unmap_page_range() # frees some memory > + * set_bit(MMF_OOM_SKIP) > + * out_of_memory() > + * select_bad_process() > + * test_bit(MMF_OOM_SKIP) # selects new oom victim > + * mutex_unlock(&oom_lock) > + * > + * Setting MMF_OOM_SKIP under oom_lock held will guarantee that the > + * last second alocation attempt is done by __alloc_pages_may_oom() > + * before out_of_memory() selects next OOM victim by finding > + * MMF_OOM_SKIP. > */ > + mutex_lock(&oom_lock); > set_bit(MMF_OOM_SKIP, &mm->flags); > + mutex_unlock(&oom_lock); > > /* Drop a reference taken by wake_oom_reaper */ > put_task_struct(tsk);