Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp434642imm; Mon, 4 Jun 2018 21:26:22 -0700 (PDT) X-Google-Smtp-Source: ADUXVKL9MEJNhIQONLKlGuChywTQJr9LXRcOfjcfhXaeDyPFvl9UTlZMjUTk+jzkWBeXNEFbQv9k X-Received: by 2002:a63:6e44:: with SMTP id j65-v6mr7875918pgc.14.1528172782782; Mon, 04 Jun 2018 21:26:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1528172782; cv=none; d=google.com; s=arc-20160816; b=rnTI927OSS8DQK663YUNJ0dI1D6ywODJ1YjBtd6cqv/TntwO5DpsvTYVntK8wJvP/l loq2WqIudaASpG9HIn9l4YsG6U6IoVH4+BdZcYpsvGgWPLtngeq04NqjwoZ80fFL7IHo k2Dt68douVdY2Tuxrv7pasjmy8PMzjVrg9zOPb2bfZTh5XXohX2iJRWZcOUKKxUL9p0Z JKW7LX7+m8TLCtVa9izu6dFoNAnbWDujBEqclgCntGLohGilZ6L/1LU5B+aB6lqXddzs 8Q9bCu0Y+5H30VvgN4IsxZU0Sz9L49Ho8G81OtVYNgXbsTGRTy8tij72ZKy6rMKddra2 7Eig== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature :arc-authentication-results; bh=hxNFYqh6gVFXnQHn4atVj/5gJAMy9RwDSUJGjlTnMOY=; b=Ey35P8zgexG1hZLb/v/o1dq61dGdJ7m7q7PGDv4kcCfgiw1Wdk1QVLkUMhwWmeTvBk aF+t302062YbhsEDCn1utz8Ca6ik753h5WWS5OGkgq85mozLETfJxoRHGVYXdmE3LZIT IJqK8LtxeT5SrWnNz8gSErHdtWZyybrZ1MdMNSouf+RTj57dllP6fGJX6Th90a7eKTYl AU+glFmeKrOO8SGHse7xXxdRlVafXm2UBiS8Gjylnv7rWxsjXy9VBzq/Ywsja9q8R+6y /9QQj0VExPD7a3jvuilEf8ZzrPF45kO1h60HVq9S0IXYKody2CByw7AZnBL/XX5q5Pjt o02A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=iJiNpDJO; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g4-v6si6591958plm.181.2018.06.04.21.26.08; Mon, 04 Jun 2018 21:26:22 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=iJiNpDJO; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751452AbeFEEZn (ORCPT + 99 others); Tue, 5 Jun 2018 00:25:43 -0400 Received: from mail-pg0-f65.google.com ([74.125.83.65]:36175 "EHLO mail-pg0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1750882AbeFEEZl (ORCPT ); Tue, 5 Jun 2018 00:25:41 -0400 Received: by mail-pg0-f65.google.com with SMTP id m5-v6so522805pgd.3 for ; Mon, 04 Jun 2018 21:25:41 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=hxNFYqh6gVFXnQHn4atVj/5gJAMy9RwDSUJGjlTnMOY=; b=iJiNpDJO02U+q1/uMx/dwS2q3TBS3scoF7sxWzTpI0W+S8vU/qFWu6LAnKYpVo+6y5 qCbJ3eleZmrVPmX5UGu3LTLQ5IbsshIMkgPVwtVSb/2cXDUkg4048W/IOIHRhVFa34cx VISWfPo/cQV+pQ0Hx1nYVX4I7gZl0M4aw+7VLEVWhC4BBSAOKtLJXc4C13nX2lp8K2/d 42nVYevuI1Gu/H2lpIjKqeiwNg2M+lZiKYHXewm+c/TXNJ9UXJr/Hdvoe8av1ePjcTrM buweS980R7rFcrGQJGT5JRpvou1a6YHmpWE27mLEvBlJusKLPmzHNfKK+8TOnlTEXKdd pZKA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=hxNFYqh6gVFXnQHn4atVj/5gJAMy9RwDSUJGjlTnMOY=; b=KmmceB4GVUIBY+U2OhzXIUJEPOzj7+mADvYZvMD2cOOZCC+yz8GPDS1ermqWBWhPME JM8Ay7cSrEJPx82NJ6eHYko+d3Eq7Elvv/H+/AMKe1k2IPRjbiPB+3B8MUGsVItuPaP9 ssAQCe7M/4vgT0RU9IRm0EUTjmZnfQLcechj1kQJcVUgAsNUmGRXPhMnmnD9Oyie6CqO ktzT++zHTfyocF5FazqsQWmxvDUuoQ4CYfKmsWsqPeP+ai0HBl0OWHwlQ7DrL2ZN3IoW S9iveEqp79Dpl8fEzIP8SH2n9iupVh0+wkYR/7eS2RYclzkDndrKXJVwKInZGFsIWM8Z 3nuQ== X-Gm-Message-State: APt69E1nuqg11XqjfXYHNJroJpoQj6psWzWW/+srWL8a7h77STW9QoBX auicPMHrtiklnN8/jlZKbGlqtl3j9/A= X-Received: by 2002:a65:5143:: with SMTP id g3-v6mr3802188pgq.190.1528172741174; Mon, 04 Jun 2018 21:25:41 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id k9-v6sm8065136pgs.49.2018.06.04.21.25.39 (version=TLS1_2 cipher=ECDHE-RSA-CHACHA20-POLY1305 bits=256/256); Mon, 04 Jun 2018 21:25:40 -0700 (PDT) Date: Mon, 4 Jun 2018 21:25:39 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Michal Hocko cc: Tetsuo Handa , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes In-Reply-To: <20180601074642.GW15278@dhcp22.suse.cz> Message-ID: References: <20180525072636.GE11881@dhcp22.suse.cz> <20180528081345.GD1517@dhcp22.suse.cz> <20180531063212.GF15278@dhcp22.suse.cz> <20180601074642.GW15278@dhcp22.suse.cz> User-Agent: Alpine 2.21 (DEB 202 2017-01-01) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, 1 Jun 2018, Michal Hocko wrote: > > We've discussed the mm > > having a single blockable mmu notifier. Regardless of how we arrive at > > the point where the oom reaper can't free memory, which could be any of > > those three cases, if (1) the original victim is sufficiently large that > > follow-up oom kills would become unnecessary and (2) other threads > > allocate/charge before the oom victim reaches exit_mmap(), this occurs. > > > > We have examples of cases where oom reaping was successful, but the rss > > numbers in the kernel log are very similar to when it was oom killed and > > the process is known not to mlock, the reason is because the oom reaper > > could free very little memory due to blockable mmu notifiers. > > Please be more specific. Which notifiers these were. Blockable notifiers > are a PITA and we should be addressing them. That requiers identifying > them first. > The most common offender seems to be ib_umem_notifier, but I have also heard of possible occurrences for mv_invl_range_start() for xen, but that would need more investigation. The rather new invalidate_range callback for hmm mirroring could also be problematic. Any mmu_notifier without MMU_INVALIDATE_DOES_NOT_BLOCK causes the mm to immediately be disregarded. For this reason, we see testing harnesses often oom killed immediately after running a unittest that stresses reclaim or compaction by inducing a system-wide oom condition. The harness spawns the unittest which spawns an antagonist memory hog that is intended to be oom killed. When memory is mlocked or there are a large number of threads faulting memory for the antagonist, the unittest and the harness itself get oom killed because the oom reaper sets MMF_OOM_SKIP; this ends up happening a lot on powerpc. The memory hog has mm->mmap_sem readers queued ahead of a writer that is doing mmap() so the oom reaper can't grab the sem quickly enough. I agree that blockable mmu notifiers are a pain, but until such time as all can implicitly be MMU_INVALIDATE_DOES_NOT_BLOCK, the oom reaper can free all mlocked memory, and the oom reaper waits long enough to grab mm->mmap_sem for stalled mm->mmap_sem readers, we need a solution that won't oom kill everything running on the system. I have doubts we'll ever reach a point where the oom reaper can do the equivalent of exit_mmap(), but it's possible to help solve the immediate issue of all oom kills killing many innocent processes while working in a direction to make oom reaping more successful at freeing memory. > > The current implementation is a timeout based solution for mmap_sem, it > > just has the oom reaper spinning trying to grab the sem and eventually > > gives up. This patch allows it to currently work on other mm's and > > detects the timeout in a different way, with jiffies instead of an > > iterator. > > And I argue that anything timeout based is just broken by design. Trying > n times will at least give you a consistent behavior. It's not consistent, we see wildly inconsistent results especially on power because it depends on the number of queued readers of mm->mmap_sem ahead of a writer until such time that a thread doing mmap() can grab it, drop it, and allow the oom reaper to grab it for read. It's so inconsistent that we can see the oom reaper successfully grab the sem for an oom killed memory hog with 128 faulting threads, and see it fail with 4 faulting threads. > Retrying on mmap > sem makes sense because the lock might be taken for a short time. It isn't a function of how long mmap_sem is taken for write, it's a function of how many readers are ahead of the queued writer. We don't run with thp defrag set to "always" under standard configurations, but users of MADV_HUGEPAGE or configs where defrag is set to "always" can consistently cause any number of additional processes to be oom killed unnecessarily because the readers are performing compaction and the writer is queued behind it. > > I'd love a solution where we can reliably detect an oom livelock and oom > > kill additional processes but only after the original victim has had a > > chance to do exit_mmap() without a timeout, but I don't see one being > > offered. Given Tetsuo has seen issues with this in the past and suggested > > a similar proposal means we are not the only ones feeling pain from this. > > Tetsuo is doing an artificial stress test which doesn't resemble any > reasonable workload. Tetsuo's test cases caught the CVE on powerpc which could trivially panic the system if configured to panic on any oops and required a security fix because it made it easy for any user doing a large mlock. His test case here is trivial to reproduce on powerpc and causes several additional processes to be oom killed. It's not artificial, I see many test harnesses killed *nightly* because a memory hog is faulting with a large number of threads and two or three other threads are doing mmap(). No mlock. > > Making mlocked pages reapable would only solve the most trivial reproducer > > of this. Unless the oom reaper can guarantee that it will never block and > > can free all memory that exit_mmap() can free, we need to ensure that a > > victim has a chance to reach the exit path on its own before killing every > > other process on the system. > > > > I'll fix the issue I identified with doing list_add_tail() rather than > > list_add(), fix up the commit message per Tetsuo to identify the other > > possible ways this can occur other than mlock, remove the rfc tag, and > > repost. > > As I've already said. I will nack any timeout based solution until we > address all particular problems and still see more to come. Here we have > a clear goal. Address mlocked pages and identify mmu notifier offenders. I cannot fix all mmu notifiers to not block, I can't fix the configuration to allow direct compaction for thp allocations and a large number of concurrent faulters, and I cannot fix userspace mlocking a lot of memory. It's worthwhile to work in that direction, but it will never be 100% possible to avoid. We must have a solution that prevents innocent processes from consistently being oom killed completely unnecessarily.