Received: by 10.192.165.148 with SMTP id m20csp424715imm; Fri, 20 Apr 2018 01:25:16 -0700 (PDT) X-Google-Smtp-Source: AIpwx49RG+puzxWK1LYJ+fIMCRiep3Oj1eDiF1tmdqYgJab3/dAvlaNIkLo8Q19l7Cp47TmpkAkE X-Received: by 10.99.181.30 with SMTP id y30mr6049001pge.279.1524212715965; Fri, 20 Apr 2018 01:25:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1524212715; cv=none; d=google.com; s=arc-20160816; b=UuDYF8rC2nrCeIJQNOgkHHd+XduNnYgBtwK9DWnAfj1KJ8cvdA3IABd84qH5l+UR0f 14KBHZ1Lu5h7WcBHEjlEBwOD2oN8RcbQtKQOFlOxeSz9ZEpCeNKGSuHD10u3bkoJeh2E Vn8LXIYl8rzft6rEVQAt6xnEI/HexgTzfzFRYV6yDNxfgUK6mHmhgYnkBUtfMP1pjNHD Y9BT8EDz4jC6JYDx3uXURGWYHHF/bzQfAGzH5080hascQEOZo4oALBQrPN4no4aa8sqp Qf9R4mDfJ+iizT6SrjNNVXHmOuXRoYuFQPI1Xa569cKqi7LmVTgugWu1+2OEx2XzgFMH P7BA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=YMp3g2OJ2fEBMf3PXOeoEA1+QTFzZlNR48R4lOIc5bI=; b=eEVw2e13epaPO9oXwihBhcf8IIT1zvuDS0LbRQEqG1449Uu0ph+y8q5qSiMmipOO0d 7lyRPoCvdLiAcsj/vphabP9khe2BpfJqpxulzt1ClRhEyhliA08MasVcH2UetrR0svyl Ks+JmDaDB7dUvffGwz9xWUOo6mjkd6VXA2qWgr1XoW8pb+9EQQoQ/2NgrQpIKR9flJ80 MfPyAvXF2lpTpaIK0S3X4gzGkpcRWf/g/BpW6J/D68PLtEa91LqYcXMkMb5itZrCacFc zpYatGWmMeL6DoA55bqkOBJYCVNWiJffhp5kbMAXHjVkiNKewjWUO36aqezNyb89huFD 6M+A== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id o30-v6si1675370pli.64.2018.04.20.01.25.01; Fri, 20 Apr 2018 01:25:15 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754116AbeDTIXw (ORCPT + 99 others); Fri, 20 Apr 2018 04:23:52 -0400 Received: from mx2.suse.de ([195.135.220.15]:51827 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753905AbeDTIXu (ORCPT ); Fri, 20 Apr 2018 04:23:50 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 9314AAD67; Fri, 20 Apr 2018 08:23:49 +0000 (UTC) Date: Fri, 20 Apr 2018 10:23:49 +0200 From: Michal Hocko To: David Rientjes Cc: Andrew Morton , Tetsuo Handa , Andrea Arcangeli , Roman Gushchin , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [patch v2] mm, oom: fix concurrent munlock and oom reaper unmap Message-ID: <20180420082349.GW17484@dhcp22.suse.cz> References: <201804180057.w3I0vieV034949@www262.sakura.ne.jp> <20180418075051.GO17484@dhcp22.suse.cz> <20180419063556.GK17484@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.4 (2018-02-28) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu 19-04-18 12:34:53, David Rientjes wrote: > On Thu, 19 Apr 2018, Michal Hocko wrote: > > > > exit_mmap() does not block before set_bit(MMF_OOM_SKIP) once it is > > > entered. > > > > Not true. munlock_vma_pages_all might take page_lock which can have > > unpredictable dependences. This is the reason why we are ruling out > > mlocked VMAs in the first place when reaping the address space. > > > > I don't find any occurrences in millions of oom kills in real-world > scenarios where this matters. Which doesn't really mean much. We want a guarantee here. > The solution is certainly not to hold > down_write(&mm->mmap_sem) during munlock_vma_pages_all() instead. Why not? This is what we do for normal paths. exit path just tries to be clever because it knows that it doesn't have to lock because there is no concurent user. At least it wasn't until the oom reaper came. So I really fail to see why we shouldn't do the most obvious thing and use the locking. > If > exit_mmap() is not making forward progress then that's a separate issue; Please read what I wrote. There is a page lock and there is no way to guarantee it will make a forward progress. Or do you consider that not true? > that would need to be fixed in one of two ways: (1) in oom_reap_task() to > try over a longer duration before setting MMF_OOM_SKIP itself, but that > would have to be a long duration to allow a large unmap and page table > free, or (2) in oom_evaluate_task() so that we defer for MMF_OOM_SKIP but > only if MMF_UNSTABLE has been set for a long period of time so we target > another process when the oom killer has given up. > > Either of those two fixes are simple to implement, I'd just like to see a > bug report with stack traces to indicate that a victim getting stalled in > exit_mmap() is a problem to justify the patch. And both are not really needed if we do the proper and straightforward locking. > I'm trying to fix the page table corruption that is trivial to trigger on > powerpc. We simply cannot allow the oom reaper's unmap_page_range() to > race with munlock_vma_pages_range(), ever. There is no discussion about that. Sure, you are right. We are just arguing how to achieve that. > Holding down_write on > mm->mmap_sem otherwise needlessly over a large amount of code is riskier > (hasn't been done or tested here), more error prone (any code change over > this large area of code or in functions it calls are unnecessarily > burdened by unnecessary locking), makes exit_mmap() less extensible for > the same reason, I do not see any of the calls in that path could suffer from holding mmap_sem. Do you? > and causes the oom reaper to give up and go set > MMF_OOM_SKIP itself because it depends on taking down_read while the > thread is still exiting. Which is the standard backoff mechanism. > > On the > > other hand your lock protocol introduces the MMF_OOM_SKIP problem I've > > mentioned above and that really worries me. The primary objective of the > > reaper is to guarantee a forward progress without relying on any > > externalities. We might kill another OOM victim but that is safer than > > lock up. > > > > I understand the concern, but it's the difference between the victim > getting stuck in exit_mmap() and actually taking a long time to free its > memory in exit_mmap(). I don't have evidence of the former. I do not really want to repeat myself. The primary purpose of the oom reaper is to provide a _guarantee_ of the forward progress. I do not care whether there is any evidences. All I know that lock_page has plethora of different dependencies and we cannot clearly state this is safe so we _must not_ depend on it when setting MMF_OOM_SKIP. The way how the oom path was fragile and lockup prone based on optimistic assumptions shouldn't be repeated. That being said, I haven't heard any actual technical argument about why locking the munmap path is a wrong thing to do while the MMF_OOM_SKIP dependency on the page_lock really concerns me so Nacked-by: Michal Hocko If you want to keep the current locking protocol then you really have to make sure that the oom reaper will set MMF_OOM_SKIP when racing with exit_mmap. -- Michal Hocko SUSE Labs