Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp275874imm; Wed, 30 May 2018 23:33:21 -0700 (PDT) X-Google-Smtp-Source: ADUXVKIWiREgY5u9hD1wtT7/y9wxGEWL5IwL7tJak4O7/NMvTNjQE+bVsIKyVZlSMQ94khaqfZ9v X-Received: by 2002:a17:902:b786:: with SMTP id e6-v6mr5861571pls.260.1527748401622; Wed, 30 May 2018 23:33:21 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527748401; cv=none; d=google.com; s=arc-20160816; b=0OnzedS8Urd3hn0twMhZ5ftxS7wYESBnKdTFleuez0eJ81+RvObO6bTLR6izyRMZzb w1PdMTA2EI5fBj3f4ZPHHGquBOhFIJlAwn4CQn1fYzvmq0W0+QJu38t1piUW0xmAbn3f NemjRG9tBmzC+wlOLvmZL/PkTAjSPxyJDkd5f/KXZBtimvbZWdnIBvebYz9Skg50srux ELWwM0+o5E3dkza/1w8oNeY9y9xFgodeDUsomVcIAjfT8dUghDvmb5yhf964NlR8Ijch P6X1f9PXpbTjbfP8kaKlwRfxmZoCVO57pcMwjo+tXdLbZCkKQjg8hpeD0QQ5fIR7BRUR H95w== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=BG6UBJSud3K+Cw8YrH28DfCdUyxJDXcC/4UI4/JVd9o=; b=HxABIWMa0e1dgmeK4abdShpMgFWBioQwpPB+49ice+Fvs8WQ2DFQu0rNSIfAPF0nR+ +L3BBmSauxYy+n0FmOrKT/+pYMSMhsbef1i6IkYE3ZXntIk8cH4qMaHBiJRR4vyOAS7y vyO/wGAODrgescEdrRtIh9l/Hq1caWsHrMA9bOEuicP8ECDbWJ7PKxSkCNzg2SSBPg0w kigFe5Unhfp8719RELWQ4T5il2om39QbrUk4uH5f5UurEgZH6Qf34zRA4DKr9SSQw+y5 P95dk0sYZl8jhzpaiK//VFT1Ul3ZCiCkNv8GQMfbRBgn8DzECi3AVKFBrwnDVMLqlWuK yoaQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id d12-v6si9218250pfk.166.2018.05.30.23.33.07; Wed, 30 May 2018 23:33:21 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754014AbeEaGcR (ORCPT + 99 others); Thu, 31 May 2018 02:32:17 -0400 Received: from mx2.suse.de ([195.135.220.15]:48152 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753760AbeEaGcP (ORCPT ); Thu, 31 May 2018 02:32:15 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext-too.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 4437EAEB5; Thu, 31 May 2018 06:32:14 +0000 (UTC) Date: Thu, 31 May 2018 08:32:12 +0200 From: Michal Hocko To: David Rientjes Cc: Tetsuo Handa , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes Message-ID: <20180531063212.GF15278@dhcp22.suse.cz> References: <20180525072636.GE11881@dhcp22.suse.cz> <20180528081345.GD1517@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.5 (2018-04-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed 30-05-18 14:06:51, David Rientjes wrote: > On Mon, 28 May 2018, Michal Hocko wrote: > > > > That's not sufficient since the oom reaper is also not able to oom reap if > > > the mm has blockable mmu notifiers or all memory is shared filebacked > > > memory, so it immediately sets MMF_OOM_SKIP and additional processes are > > > oom killed. > > > > Could you be more specific with a real world example where that is the > > case? I mean the full address space of non-reclaimable file backed > > memory where waiting some more would help? Blockable mmu notifiers are > > a PITA for sure. I wish we could have a better way to deal with them. > > Maybe we can tell them we are in the non-blockable context and have them > > release as much as possible. Still something that a random timeout > > wouldn't help I am afraid. > > > > It's not a random timeout, it's sufficiently long such that we don't oom > kill several processes needlessly in the very rare case where oom livelock > would actually prevent the original victim from exiting. The oom reaper > processing an mm, finding everything to be mlocked, and immediately > MMF_OOM_SKIP is inappropriate. This is rather trivial to reproduce for a > large memory hogging process that mlocks all of its memory; we > consistently see spurious and unnecessary oom kills simply because the oom > reaper has set MMF_OOM_SKIP very early. It takes quite some additional steps for admin to allow a large amount of mlocked memory and such an application should be really careful to not consume too much memory. So how come this is something you see that consistently? Is this some sort of bug or an unfortunate workload side effect? I am asking this because I really want to see how relevant this really is. > This patch introduces a "give up" period such that the oom reaper is still > allowed to do its good work but only gives up in the hope the victim can > make forward progress at some substantial period of time in the future. I > would understand the objection if oom livelock where the victim cannot > make forward progress were commonplace, but in the interest of not killing > several processes needlessly every time a large mlocked process is > targeted, I think it compels a waiting period. But the waiting periods just turn out to be a really poor design. There will be no good timeout to fit for everybody. We can do better and as long as this is the case the timeout based solution should be really rejected. It is a shortcut that doesn't really solve the underlying problem. > > Trying to reap a different oom victim when the current one is not making > > progress during the lock contention is certainly something that make > > sense. It has been proposed in the past and we just gave it up because > > it was more complex. Do you have any specific example when this would > > help to justify the additional complexity? > > > > I'm not sure how you're defining complexity, the patch adds ~30 lines of > code and prevents processes from needlessly being oom killed when oom > reaping is largely unsuccessful and before the victim finishes > free_pgtables() and then also allows the oom reaper to operate on multiple > mm's instead of processing one at a time. Obviously if there is a delay > before MMF_OOM_SKIP is set it requires that the oom reaper be able to > process other mm's, otherwise we stall needlessly for 10s. Operating on > multiple mm's in a linked list while waiting for victims to exit during a > timeout period is thus very much needed, it wouldn't make sense without > it. It needs to keep track of the current retry state of the reaped victim and that is an additional complexity, isn't it? And I am asking how often do we have to handle that. Please note that the primary objective here is to unclutter a locked up situation. The oom reaper doesn't block the victim to go away on its own while we keep retrying. So a slow progress on the reaper side is not an issue IMIHO. > > > But also note that even if oom reaping is possible, in the presence of an > > > antagonist that continues to allocate memory, that it is possible to oom > > > kill additional victims unnecessarily if we aren't able to complete > > > free_pgtables() in exit_mmap() of the original victim. > > > > If there is unbound source of allocations then we are screwed no matter > > what. We just hope that the allocator will get noticed by the oom killer > > and it will be stopped. > > > > It's not unbounded, it's just an allocator that acts as an antagonist. At > the risk of being overly verbose, for system or memcg oom conditions: a > large mlocked process is oom killed, other processes continue to > allocate/charge, the oom reaper almost immediately grants MMF_OOM_SKIP > without being able to free any memory, and the other important processes > are needlessly oom killed before the original victim can reach > exit_mmap(). This happens a _lot_. > > I'm open to hearing any other suggestions that you have other than waiting > some time period before MMF_OOM_SKIP gets set to solve this problem. I've already offered one. Make mlocked pages reapable. This is something that has been on the todo list for quite some time. I just didn't have time to work on that. The priority was not at the top because most sane workloads simply do not mlock large portion of the memory. But if you can see that happening regularly then this should be the first thing to try. The main obstable to do so back then was the page_lock currently taken in the munlock path. I've discussed that with Hugh and he said that they are mainly for accounting purposes and mostly a relict from the past IIRC and this should be fixable and a general improvement as well. -- Michal Hocko SUSE Labs