Received: by 2002:ac0:a5b6:0:0:0:0:0 with SMTP id m51-v6csp2491672imm; Mon, 28 May 2018 09:05:47 -0700 (PDT) X-Google-Smtp-Source: AB8JxZpls/L3wMaHY/l1M6G0rPkLbJA4QOzw0g9ukm/Lwv9jO3JlqMmXJRul75e0ZcMtP5r5ZD+M X-Received: by 2002:a17:902:9f84:: with SMTP id g4-v6mr14210756plq.152.1527523547007; Mon, 28 May 2018 09:05:47 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1527523546; cv=none; d=google.com; s=arc-20160816; b=QsbKAUT/fyCa/Oo41o40KNWmBg6IO/DbjJwsrMJd8lo5D24CEx+iMQK18zUEC1VzrE ktxuoaWAroGeSEStVBea61EoAcjhBcNqWGUogkPZQ9ww723zgAfmhMTF3XXqjXKyuUqW ChiyDGZLtvgaNKDi/393Pz+JjNPwAY58waLDu5DG3EuXIXd1jeC1X1PV2qif1Vw/QlDo o8iNeUC/0FANszIafOpN4eUtz2EwxM0bzBqbsFd0xxHyXmf6eUzSXTVwX/X0C2HJtht0 tSyU8FTUR9cySEmYF5IfgdNL38p3jXWp6KGJeVeDyJaIGOk96jP4CU9xz6qVXt9xzDu4 wVuA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:arc-authentication-results; bh=7pGAtNbxpXMSxP5C40QUDiEGkHVc0VDDyQubD8bFSr8=; b=VNm4w2moajMudgS9JdtStSVQvR3NW5oNHhXiEPTnX4AJ1U4qhlwDR4YXMyY3Bn0JmH Xw9pE4NAx/ZZxXhnCRxS0b+IpkH4GXPaCrOGfClreZFAsz7d2rI5aUWE+NCL7WFuPhIO YfJub/AXssQ0inPUCky0EXYogzoo4doEPt8SuEC+RHuTiThWYGF+m8S2BVhNHQpQ9TgE NOX9llIFkJ/sNnMW6XpOGI7jkgmxF5i5skbsdxlKpaqCNi8jolnH58cKoevJ2N1W3AWI xXvA4hfIW/kfqq21j8+T7ofx83R4S7LTj1QY8whMD7yE6lBPQQihkQ1Qp65TooOaSHOp BkIQ== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id k12-v6si6421973pll.319.2018.05.28.09.05.31; Mon, 28 May 2018 09:05:46 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S933382AbeE1QEx (ORCPT + 99 others); Mon, 28 May 2018 12:04:53 -0400 Received: from mx2.suse.de ([195.135.220.15]:47734 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S933226AbeE1QDr (ORCPT ); Mon, 28 May 2018 12:03:47 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (charybdis-ext-too.suse.de [195.135.220.254]) by mx2.suse.de (Postfix) with ESMTP id 08FCAACA7; Mon, 28 May 2018 16:03:45 +0000 (UTC) Date: Mon, 28 May 2018 10:13:45 +0200 From: Michal Hocko To: David Rientjes Cc: Tetsuo Handa , Andrew Morton , linux-kernel@vger.kernel.org, linux-mm@kvack.org Subject: Re: [rfc patch] mm, oom: fix unnecessary killing of additional processes Message-ID: <20180528081345.GD1517@dhcp22.suse.cz> References: <20180525072636.GE11881@dhcp22.suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.9.5 (2018-04-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri 25-05-18 12:36:08, David Rientjes wrote: > On Fri, 25 May 2018, Michal Hocko wrote: > > > > The oom reaper ensures forward progress by setting MMF_OOM_SKIP itself if > > > it cannot reap an mm. This can happen for a variety of reasons, > > > including: > > > > > > - the inability to grab mm->mmap_sem in a sufficient amount of time, > > > > > > - when the mm has blockable mmu notifiers that could cause the oom reaper > > > to stall indefinitely, > > > > > > but we can also add a third when the oom reaper can "reap" an mm but doing > > > so is unlikely to free any amount of memory: > > > > > > - when the mm's memory is fully mlocked. > > > > > > When all memory is mlocked, the oom reaper will not be able to free any > > > substantial amount of memory. It sets MMF_OOM_SKIP before the victim can > > > unmap and free its memory in exit_mmap() and subsequent oom victims are > > > chosen unnecessarily. This is trivial to reproduce if all eligible > > > processes on the system have mlocked their memory: the oom killer calls > > > panic() even though forward progress can be made. > > > > > > This is the same issue where the exit path sets MMF_OOM_SKIP before > > > unmapping memory and additional processes can be chosen unnecessarily > > > because the oom killer is racing with exit_mmap(). > > > > > > We can't simply defer setting MMF_OOM_SKIP, however, because if there is > > > a true oom livelock in progress, it never gets set and no additional > > > killing is possible. > > > > > > To fix this, this patch introduces a per-mm reaping timeout, initially set > > > at 10s. It requires that the oom reaper's list becomes a properly linked > > > list so that other mm's may be reaped while waiting for an mm's timeout to > > > expire. > > > > No timeouts please! The proper way to handle this problem is to simply > > teach the oom reaper to handle mlocked areas. > > That's not sufficient since the oom reaper is also not able to oom reap if > the mm has blockable mmu notifiers or all memory is shared filebacked > memory, so it immediately sets MMF_OOM_SKIP and additional processes are > oom killed. Could you be more specific with a real world example where that is the case? I mean the full address space of non-reclaimable file backed memory where waiting some more would help? Blockable mmu notifiers are a PITA for sure. I wish we could have a better way to deal with them. Maybe we can tell them we are in the non-blockable context and have them release as much as possible. Still something that a random timeout wouldn't help I am afraid. > The current implementation that relies on MAX_OOM_REAP_RETRIES is acting > as a timeout already for mm->mmap_sem, but it's doing so without > attempting to oom reap other victims that may actually allow it to grab > mm->mmap_sem if the allocator is waiting on a lock. Trying to reap a different oom victim when the current one is not making progress during the lock contention is certainly something that make sense. It has been proposed in the past and we just gave it up because it was more complex. Do you have any specific example when this would help to justify the additional complexity? > The solution, as proposed, is to allow the oom reaper to iterate over all > victims and try to free memory rather than working on each victim one by > one and giving up. > > But also note that even if oom reaping is possible, in the presence of an > antagonist that continues to allocate memory, that it is possible to oom > kill additional victims unnecessarily if we aren't able to complete > free_pgtables() in exit_mmap() of the original victim. If there is unbound source of allocations then we are screwed no matter what. We just hope that the allocator will get noticed by the oom killer and it will be stopped. That being said. I do not object for justified improvements in the oom reaping. But I absolutely detest some random timeouts and will nack implementations based on them until it is absolutely clear there is no other way around. -- Michal Hocko SUSE Labs