Received: by 2002:ac0:8845:0:0:0:0:0 with SMTP id g63csp431528img; Thu, 28 Feb 2019 01:53:59 -0800 (PST) X-Google-Smtp-Source: AHgI3IYo9ZhwS1UAR1JWXqPmUKI4OYGEkPzb3Zn14vtrRrLfkI4LxUaBvOtGfw54MjIDblQuKz8Y X-Received: by 2002:a17:902:b709:: with SMTP id d9mr7124445pls.83.1551347639881; Thu, 28 Feb 2019 01:53:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1551347639; cv=none; d=google.com; s=arc-20160816; b=iz2ukaQ9R1YxPl25TPGFb6pQcY/wzQyBvnjfa5yeyosdql6vtr0RHNM6kRgnPoKNC8 KGcTCYwhofatSqjvuVhMbnh9olzxVE0SHlUDH1rS9qozDa45Uka7w2BM/k315t8+YKsq VwiGD1L4wUrzvNZlWOHFrM0G7UVFrQDK30QXqjT+69unOKwoGKtaU5kQz+QCdWuhnnqm gktSba3TPvOjjMCEqLehrkTGVLzNWrxnLhL57s1DSeB/dThXZQmD0U6egP4bETBFx1at KYxCjS+KnFxN9f2mtIM+1leZX9Rqq+/v5sZ67UG2/AFuZ2Tkfb/sxuwrYWYFAqi0mD59 axKA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=BLenWt831WoVVeD5roXUAVDiSBH1kN3vVA6DVJyDozc=; b=sJ6nGNyvuHblmTzQSXe3/trW04rrgcNDyZGw/1CCOlUBdJiscwY2mJCdycMJyzxPRP hq9yrGLqCm0j6OGqS7NvNOdOlYcn+FlJHutK9UWRrPjTTl1G5Z1l6M7UUlK5pO/8MoiG tuzsnHId2+PDXNxtXUMM5fefvA1R4vpSozcu/9nUVk70sFm4cDzzNe9Eg5F6TrCDsve6 4xZJWM4AAUOL6G+RQcbtUM7sukTRXo/EO4Hu5+EwSFwMDfTPHGnvFPcQm+0NXC2KUV9+ EIK8GCMdjbzp3iS1ciTrlaLzS4lvPvttE9LgBxyKv4Y9dWWp+YOE5yAuOGrY92H3Lggy zoFA== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 11si17617174plc.383.2019.02.28.01.53.44; Thu, 28 Feb 2019 01:53:59 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730860AbfB1Jwy (ORCPT + 99 others); Thu, 28 Feb 2019 04:52:54 -0500 Received: from mx2.suse.de ([195.135.220.15]:57742 "EHLO mx1.suse.de" rhost-flags-OK-OK-OK-FAIL) by vger.kernel.org with ESMTP id S1726165AbfB1Jwx (ORCPT ); Thu, 28 Feb 2019 04:52:53 -0500 X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.220.254]) by mx1.suse.de (Postfix) with ESMTP id 87261AEB3; Thu, 28 Feb 2019 09:52:51 +0000 (UTC) Date: Thu, 28 Feb 2019 10:52:50 +0100 From: Michal Hocko To: Johannes Weiner Cc: Chris Down , Andrew Morton , Tejun Heo , Roman Gushchin , linux-kernel@vger.kernel.org, cgroups@vger.kernel.org, linux-mm@kvack.org, kernel-team@fb.com Subject: Re: [PATCH] mm: Throttle allocators when failing reclaim over memory.high Message-ID: <20190228095250.GA21078@dhcp22.suse.cz> References: <20190201011352.GA14370@chrisdown.name> <20190201071757.GE11599@dhcp22.suse.cz> <20190201161233.GA11231@cmpxchg.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20190201161233.GA11231@cmpxchg.org> User-Agent: Mutt/1.10.1 (2018-07-13) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org [Sorry for a late reply] On Fri 01-02-19 11:12:33, Johannes Weiner wrote: > On Fri, Feb 01, 2019 at 08:17:57AM +0100, Michal Hocko wrote: > > On Thu 31-01-19 20:13:52, Chris Down wrote: > > [...] > > > The current situation goes against both the expectations of users of > > > memory.high, and our intentions as cgroup v2 developers. In > > > cgroup-v2.txt, we claim that we will throttle and only under "extreme > > > conditions" will memory.high protection be breached. Likewise, cgroup v2 > > > users generally also expect that memory.high should throttle workloads > > > as they exceed their high threshold. However, as seen above, this isn't > > > always how it works in practice -- even on banal setups like those with > > > no swap, or where swap has become exhausted, we can end up with > > > memory.high being breached and us having no weapons left in our arsenal > > > to combat runaway growth with, since reclaim is futile. > > > > > > It's also hard for system monitoring software or users to tell how bad > > > the situation is, as "high" events for the memcg may in some cases be > > > benign, and in others be catastrophic. The current status quo is that we > > > fail containment in a way that doesn't provide any advance warning that > > > things are about to go horribly wrong (for example, we are about to > > > invoke the kernel OOM killer). > > > > > > This patch introduces explicit throttling when reclaim is failing to > > > keep memcg size contained at the memory.high setting. It does so by > > > applying an exponential delay curve derived from the memcg's overage > > > compared to memory.high. In the normal case where the memcg is either > > > below or only marginally over its memory.high setting, no throttling > > > will be performed. > > > > How does this play wit the actual OOM when the user expects oom to > > resolve the situation because the reclaim is futile and there is nothing > > reclaimable except for killing a process? > > Hm, can you elaborate on your question a bit? > > The idea behind memory.high is to throttle allocations long enough for > the admin or a management daemon to intervene, but not to trigger the > kernel oom killer. It was designed as a replacement for the cgroup1 > oom_control, but without the deadlock potential, ptrace problems etc. Yes, this makes sense. The high limit reclaim is also documented as a best effort resource guarantee. My understanding is that if the workload cannot be contained within the high limit then the system cannot do much and eventually gives up. Having the full memory unreclaimable is such an example. And there is either the global OOM killer or hard limit OOM killer to trigger to resolve such a situation. [...] Thanks for describing the usecase. > Right now, that throttling mechanism works okay with swap enabled, but > we cannot enable swap everywhere, or sometimes run out of swap, and > then it breaks down and we run into system OOMs. > > This patch makes sure memory.high *always* implements the throttling > semantics described in cgroup-v2.txt, not just most of the time. I am not really opposed to the throttling in the absence of a reclaimable memory. We do that for the regular allocation paths already (should_reclaim_retry). A swapless system with anon memory is very likely to oom too quickly and this sounds like a real problem. But I do not think that we should throttle the allocation to freeze it completely. We should eventually OOM. And that was my question about essentially. How much we can/should throttle to give a high limit events consumer enough time to intervene. I am sorry to still not have time to study the patch more closely but this should be explained in the changelog. Are we talking about seconds/minutes or simply freeze each allocator to death? -- Michal Hocko SUSE Labs