Received: by 10.192.165.148 with SMTP id m20csp23138imm; Thu, 19 Apr 2018 15:15:16 -0700 (PDT) X-Google-Smtp-Source: AIpwx48s44T1MyamY660jSDzabrR4xUPpJ6SeurH4TsGuYXNzxfjgjgD7BMhc3SxJ/E1I08szLwb X-Received: by 2002:a17:902:9686:: with SMTP id n6-v6mr7475181plp.136.1524176116266; Thu, 19 Apr 2018 15:15:16 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524176116; cv=none; d=google.com; s=arc-20160816; b=PsDBeHSREm3vikUhhSub6WUDPR8NHP9FYMv1UYl2lk72cdLeHC3eIeE0pDofEURLOw 3Ii3ogyQ9E7XElTjK93s4kzMbY4582NqIBl1t86p0EPZcBy2LcmDjDgeNNTWOowaAOVg QlV8eL/H53k+8Tk0fb9Ty9TVN3qv12hLquLm4tw5CVdE0fd7iAaL8KxQWvaedkYtsh5x O3xA7Pof1OlIE7sQ5Z776ltF5i9rw8uVbjBnrU0NBLVJfZFfylZxBfrTXZlWhIi0Ta3j RZWJJMrTlmaqIUhddjCfSbqsuO0CAJ+TMPCEcvnGsIprpc2EG1xiXFqTObetIgOf1E28 Jicw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:date:message-id:in-reply-to :references:from:subject:cc:to:arc-authentication-results; bh=bQxAKFZk9CptMBtfrKH4nJXZCUP5RUGHr7Ai2l2F5dg=; b=N6fZQg11REZjAXTUkg13Tfm2kUK2lfQeE+tSHfyu9/idF0eZmvMnscbTmRiUb1ZgD9 LLD+cQlHZLuu8KuwCdgDcWvnW0+o6KJ5vQp2efeHRwUFGid+p6vACW1I8ahd2oZRpIRz kl4HGhJynYcwIeE8ULir2Rdl9sQvZh3We9IYSbGZFryl44FQjUjMcHG7/gygHSL6RmVe z8x1arVXQZ4nfrA+iNPH4WcHGfxgxZvMc5KLBg0QcOpYKbEHvUMymx5BCSwHWjMKOFQ2 d6IrqqUxR3RnKDbHCx6O6aSMxtMJu4FFg08IjVvTn0HX9PGeCEJUFgazlY/mabGEBkCR n//Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id t4si3927401pfh.139.2018.04.19.15.14.58; Thu, 19 Apr 2018 15:15:16 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753634AbeDSWNU (ORCPT + 99 others); Thu, 19 Apr 2018 18:13:20 -0400 Received: from www262.sakura.ne.jp ([202.181.97.72]:36889 "EHLO www262.sakura.ne.jp" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753486AbeDSWNT (ORCPT ); Thu, 19 Apr 2018 18:13:19 -0400 Received: from fsav103.sakura.ne.jp (fsav103.sakura.ne.jp [27.133.134.230]) by www262.sakura.ne.jp (8.14.5/8.14.5) with ESMTP id w3JMD2j1070331; Fri, 20 Apr 2018 07:13:02 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) Received: from www262.sakura.ne.jp (202.181.97.72) by fsav103.sakura.ne.jp (F-Secure/fsigk_smtp/530/fsav103.sakura.ne.jp); Fri, 20 Apr 2018 07:13:02 +0900 (JST) X-Virus-Status: clean(F-Secure/fsigk_smtp/530/fsav103.sakura.ne.jp) Received: from AQUA (softbank126099184120.bbtec.net [126.99.184.120]) (authenticated bits=0) by www262.sakura.ne.jp (8.14.5/8.14.5) with ESMTP id w3JMD2gT070328; Fri, 20 Apr 2018 07:13:02 +0900 (JST) (envelope-from penguin-kernel@I-love.SAKURA.ne.jp) To: rientjes@google.com, mhocko@kernel.org Cc: akpm@linux-foundation.org, aarcange@redhat.com, guro@fb.com, linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap From: Tetsuo Handa References: <20180418075051.GO17484@dhcp22.suse.cz> <20180419063556.GK17484@dhcp22.suse.cz> In-Reply-To: Message-Id: <201804200713.IJF15701.SOVFOMHtQJOFFL@I-love.SAKURA.ne.jp> X-Mailer: Winbiff [Version 2.51 PL2] X-Accept-Language: ja,en,zh Date: Fri, 20 Apr 2018 07:13:02 +0900 Mime-Version: 1.0 Content-Type: text/plain; charset=us-ascii Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org David Rientjes wrote: > On Thu, 19 Apr 2018, Michal Hocko wrote: > > > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is > > > entered. > > > > Not true. munlock_vma_pages_all might take page_lock which can have > > unpredictable dependences. This is the reason why we are ruling out > > mlocked VMAs in the first place when reaping the address space. > > > > I don't find any occurrences in millions of oom kills in real-world > scenarios where this matters. Is your OOM events system-wide rather than memcg? It is trivial to hide bugs in the details if your OOM events is memcg OOM. > The solution is certainly not to hold > down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead. If > exit_mmap() is not making forward progress then that's a separate issue; Just a simple memory + CPU pressure is sufficient for making exit_mmap() unable to make forward progress. Try triggering system-wide OOM event by running below reproducer. We are ever ignoring this issue. ----- #include int main(int argc, char *argv[]) { while (1) if (fork() == 0) execlp(argv[0], argv[0], NULL); return 0; } ----- > that would need to be fixed in one of two ways: (1) in oom_reap_task() to > try over a longer duration before setting MMF_OOM_SKIP itself, but that > would have to be a long duration to allow a large unmap and page table > free, or (2) in oom_evaluate_task() so that we defer for MMF_OOM_SKIP but > only if MMF_UNSTABLE has been set for a long period of time so we target > another process when the oom killer has given up. > > Either of those two fixes are simple to implement, I'd just like to see a > bug report with stack traces to indicate that a victim getting stalled in > exit_mmap() is a problem to justify the patch. It is too hard for normal users to report problems under memory pressure without a mean to help understand what is happening. See a bug report at https://lists.opensuse.org/opensuse-kernel/2018-04/msg00018.html for example. > > I'm trying to fix the page table corruption that is trivial to trigger on > powerpc. We simply cannot allow the oom reaper's unmap_page_range() to > race with munlock_vma_pages_range(), ever. Holding down_write on > mm->mmap_sem otherwise needlessly over a large amount of code is riskier > (hasn't been done or tested here), more error prone (any code change over > this large area of code or in functions it calls are unnecessarily > burdened by unnecessary locking), makes exit_mmap() less extensible for > the same reason, and causes the oom reaper to give up and go set > MMF_OOM_SKIP itself because it depends on taking down_read while the > thread is still exiting. I suggest reverting 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently"). We can check for progress for a while before setting MMF_OOM_SKIP after the OOM reaper completed or gave up reaping.