Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757416AbdIIIqA (ORCPT ); Sat, 9 Sep 2017 04:46:00 -0400 Received: from mail-pg0-f47.google.com ([74.125.83.47]:37045 "EHLO mail-pg0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1755689AbdIIIpz (ORCPT ); Sat, 9 Sep 2017 04:45:55 -0400 X-Google-Smtp-Source: ADKCNb4VK+UG5lf6r64Kraw9cvqttFcmiE6jccKY4Ft5WFz64ZUt5/LCMp8qyleRA3ZRu7rNCiF8vg== Date: Sat, 9 Sep 2017 01:45:53 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Christopher Lameter cc: Michal Hocko , Johannes Weiner , Roman Gushchin , linux-mm@kvack.org, Vladimir Davydov , Tetsuo Handa , Andrew Morton , Tejun Heo , kernel-team@fb.com, cgroups@vger.kernel.org, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org Subject: Re: [v7 5/5] mm, oom: cgroup v2 mount option to disable cgroup-aware OOM killer In-Reply-To: Message-ID: References: <20170904142108.7165-1-guro@fb.com> <20170904142108.7165-6-guro@fb.com> <20170905134412.qdvqcfhvbdzmarna@dhcp22.suse.cz> <20170905215344.GA27427@cmpxchg.org> <20170906082859.qlqenftxuib64j35@dhcp22.suse.cz> User-Agent: Alpine 2.10 (DEB 1266 2009-07-14) MIME-Version: 1.0 Content-Type: TEXT/PLAIN; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 2287 Lines: 43 On Fri, 8 Sep 2017, Christopher Lameter wrote: > Ok. Certainly there were scalability issues (lots of them) and the sysctl > may have helped there if set globally. But the ability to kill the > allocating tasks was primarily used in cpusets for constrained allocation. > I remember discussing it with him and he had some data with pretty extreme numbers for how long the tasklist iteration was taking. Regardless, I agree it's not pertinent to the discussion if anybody is actively using the sysctl, just fun to try to remember the discussions from 10 years ago. The problem I'm having with the removal, though, is that the kernel source actually uses it itself in tools/testing/fault-injection/failcmd.sh. That, to me, suggests there are people outside the kernel source that are also probably use it. We use it as part of our unit testing, although we could convert away from it. These are things that can probably be worked around, but I'm struggling to see the whole benefit of it. It's only defined, there's generic sysctl handling, and there's a single conditional in the oom killer. I wouldn't risk the potential userspace breakage. > The issue of scaling is irrelevant in the context of deciding what to do > about the sysctl. You can address the issue differently if it still > exists. The systems with super high NUMA nodes (hundreds to a > thousand) have somehow fallen out of fashion a bit. So I doubt that this > is still an issue. And no one of the old stakeholders is speaking up. > > What is the current approach for an OOM occuring in a cpuset or cgroup > with a restricted numa node set? > It's always been shaky, we simply exclude potential kill victims based on whether or not they share mempolicy nodes or cpuset mems with the allocating process. Of course, this could result in no memory freeing because a potential victim being allowed to allocate on a particular node right now doesn't mean killing it will free memory on that node. It's just more probable in practice. Nobody has complained about that methodology, but we do have internal code that simply kills current for mempolicy ooms. That is because we have priority based oom killing much like this patchset implements and then extends it even further to processes.