Received: by 10.192.165.148 with SMTP id m20csp39031imm; Thu, 19 Apr 2018 12:36:19 -0700 (PDT) X-Google-Smtp-Source: AIpwx48GEi3zR6/EhguD1xQJYdujRk5Q9jiyuD6HUrEt87Fx8NR/3EJ3aCPqEJ99M6iqce6B2VVN X-Received: by 10.99.106.7 with SMTP id f7mr6166524pgc.363.1524166579818; Thu, 19 Apr 2018 12:36:19 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524166579; cv=none; d=google.com; s=arc-20160816; b=Qbq/Gu6Zapk+Nzx9O/7vFXkVn3P0YCV0JwK7e7l0jC1cG+syLr2N+VBBgQujjfQbBb LaoBJeVox3X1V6AYXtzjae+Aazcl/Pi6MPQZw/zFliivyDowM8us/xVsFldwF8n3UoPi olPoY3PYK5e1lRMIZPHcWaI+Bob3mvnmCbZQW9WiR/uPXDF29kAI0ZCmRECBtROdE4rV zkqsdw9hTqGKF1o28UfzR0U/R0WblqXm9COKvAymT2kBtAuepwRAOA7xR1wstPbT02IZ YFcCAV0PbOQILMBzEMNIcCOYh5HUTr8Ah1W8nnZrk2nzhRhFbTVoSu+mzH/kQiaZpRhx X41A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=/FrQKeYjvEuIiFdJ/2C8MVkxW+M6uqzjajKlfWXZm/Y=; b=rZEmlhPVM+fynja47qf7IlA95ymQ8UQ8ZS+3PgMRbo5pLDolZ+hxWFFpTtj+g4xKkz VHO1YOUVUJapsc4o9y7hyHTmGGXpkQ4etz4yLfAOWGgqEfppQkTAxjazncW0ZIF3Kbw9 3pAv9leFS+yDqK0tjXcES0wvcMuk/4C45ravfgbVcohyk1YN7wivAV+cmva/66segPAh sZ8WgPdU0mySlSkaCvgvMY805NMBxt39A7VdGZev0caydO7k9u1oa7jIadiN4xN1QmvY DhiiUgJtjLhtB8b1wuK+lOwpQSBV2eI8a3f2qmfjMAAOY26n+OVd7WoVvBuZBBkrN6Gb 1Xlw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=s5PFs/kG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a6si3334695pgd.551.2018.04.19.12.36.04; Thu, 19 Apr 2018 12:36:19 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=s5PFs/kG; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753011AbeDSTe5 (ORCPT + 99 others); Thu, 19 Apr 2018 15:34:57 -0400 Received: from mail-pf0-f178.google.com ([209.85.192.178]:33343 "EHLO mail-pf0-f178.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752273AbeDSTez (ORCPT ); Thu, 19 Apr 2018 15:34:55 -0400 Received: by mail-pf0-f178.google.com with SMTP id f15so3121860pfn.0 for ; Thu, 19 Apr 2018 12:34:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=/FrQKeYjvEuIiFdJ/2C8MVkxW+M6uqzjajKlfWXZm/Y=; b=s5PFs/kGiE8jC4Ub+JGbNC3ZWwNcc5qbSZoeX4K4l8BpmK1/Hj30uDssyG5Ygm+TtI jK1inCGRnN7xp9DoiLkBNoQx7DyuUzoH6oj6qVWkO4gIAIB3f1dNLInhChM+GXfBSViu 0awvC2XDDhE+uKzZkyXkF8vo00UPaI8AldSX4QmPsoGf66TNzebrTB8pBmedKQ0OQmVi YDwfGNXj8KRn4EvtPa59Qbk1JPpdxYfF0fMIA+K4Z+h/AYPdWie8ycJGEmN+U3gl8hP4 Pn2J4LgP4yD652xRPXMZcL8yTHUTHB9AKUX0lo3AOQlTc9MeCU39Jzoi7wiFJc28nNkc L/Sg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=/FrQKeYjvEuIiFdJ/2C8MVkxW+M6uqzjajKlfWXZm/Y=; b=ItVOcMX8BnRETvMJGfD7W9jMU3PcSWJriea2Ok3WyzYMdpjfl9O9QxlIvJhMyg8K0D 3k+4nf8nk0hcxBTdL0OJsynS0UycmDWRePHXpwBW/mUSlJrE7NCZGGB7ZYn2rS6myoti PhN7IPqhunl3NxXwcOHs0kiCakjXayXqWkvlxms8fxS3OiYAqQVv5ekOD6Vt+Ty10C1V 4PL97Tv2fSIfMNvmeqPO1t1Cat+WwXY9oup4lYsAANMp8f5eE74yrIPxGg1UknO4SuNE x+liFWysLXnQuQyVmXaheYsyxMgUouB9TaEwGrxZgPoR0iUVvAx7tk65/5K+G/aYQhdb abcg== X-Gm-Message-State: ALQs6tDmfyXbVYeoX+x5xm+c4NeOK4ekUJ8AcuAfMMupIj+6CZa9N3x5 tYr7Xq263OxlNRaKeRGgCihH/g== X-Received: by 10.101.90.129 with SMTP id c1mr6202125pgt.20.1524166494760; Thu, 19 Apr 2018 12:34:54 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id a5sm11024879pfl.159.2018.04.19.12.34.53 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Thu, 19 Apr 2018 12:34:54 -0700 (PDT) Date: Thu, 19 Apr 2018 12:34:53 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Michal Hocko cc: Andrew Morton , Tetsuo Handa , Andrea Arcangeli , Roman Gushchin , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap In-Reply-To: <20180419063556.GK17484@dhcp22.suse.cz> Message-ID: References: <201804180057.w3I0vieV034949@www262.sakura.ne.jp> <20180418075051.GO17484@dhcp22.suse.cz> <20180419063556.GK17484@dhcp22.suse.cz> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, 19 Apr 2018, Michal Hocko wrote: > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is > > entered. > > Not true. munlock_vma_pages_all might take page_lock which can have > unpredictable dependences. This is the reason why we are ruling out > mlocked VMAs in the first place when reaping the address space. > I don't find any occurrences in millions of oom kills in real-world scenarios where this matters. The solution is certainly not to hold down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead. If exit_mmap() is not making forward progress then that's a separate issue; that would need to be fixed in one of two ways: (1) in oom_reap_task() to try over a longer duration before setting MMF_OOM_SKIP itself, but that would have to be a long duration to allow a large unmap and page table free, or (2) in oom_evaluate_task() so that we defer for MMF_OOM_SKIP but only if MMF_UNSTABLE has been set for a long period of time so we target another process when the oom killer has given up. Either of those two fixes are simple to implement, I'd just like to see a bug report with stack traces to indicate that a victim getting stalled in exit_mmap() is a problem to justify the patch. I'm trying to fix the page table corruption that is trivial to trigger on powerpc. We simply cannot allow the oom reaper's unmap_page_range() to race with munlock_vma_pages_range(), ever. Holding down_write on mm->mmap_sem otherwise needlessly over a large amount of code is riskier (hasn't been done or tested here), more error prone (any code change over this large area of code or in functions it calls are unnecessarily burdened by unnecessary locking), makes exit_mmap() less extensible for the same reason, and causes the oom reaper to give up and go set MMF_OOM_SKIP itself because it depends on taking down_read while the thread is still exiting. > On the > other hand your lock protocol introduces the MMF_OOM_SKIP problem I've > mentioned above and that really worries me. The primary objective of the > reaper is to guarantee a forward progress without relying on any > externalities. We might kill another OOM victim but that is safer than > lock up. > I understand the concern, but it's the difference between the victim getting stuck in exit_mmap() and actually taking a long time to free its memory in exit_mmap(). I don't have evidence of the former. If there are bug reports for occurrences of the oom reaper being unable to reap, it would be helpful to see. The only reports about the "unable to reap" message was that the message itself was racy, not that a thread got stuck. This is more reason to not take down_write unnecessarily in the exit_mmap() path, because it influences an oom reaper heurstic. > The current protocol has proven to be error prone so I really believe we > should back off and turn it into something much simpler and build on top > of that if needed. > > So do you see any _technical_ reasons why not do [1] and have a simpler > protocol easily backportable to stable trees? It's not simpler per the above, and I agree with Andrea's assessment when this was originally implemented. The current method is not error prone, it works, it just wasn't protecting enough of exit_mmap(). That's not a critcism of the method itself, it's a bugfix that expands its critical section.