Received: by 2002:a05:6a10:a841:0:0:0:0 with SMTP id d1csp241078pxy; Wed, 21 Apr 2021 01:42:22 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz6Cq11V3+wUlZb5QLSrOW+8ysWsleyv9romBpGGTxdvxgQeME3CWbJOzz6KNIOI8X97b8+ X-Received: by 2002:a17:906:fb19:: with SMTP id lz25mr30859659ejb.544.1618994542266; Wed, 21 Apr 2021 01:42:22 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618994542; cv=none; d=google.com; s=arc-20160816; b=Tn1Wvf6DoLFfVQNtPx+bHT8+sxNifl3C+KWblBJDJHSJ77eE5MjH8mln45v+fESd+S 7ZzvNCSwPULF49kmy5V+2jDYggAE0D1nQFPlqVKKHuwrBK0LWfMeetBR6sSWfvgmYh3u +WHuo/9cywlwD46WBsnEYYqGOa0DJPODYJT8ZWJQqT2rGkPKDZRKjb3/BaJrQZSGkn5d gV25VMXKku469KMAp5DpIR8fd9zaEti0vsNCC6J8KUiiSkywZ5eSE3rsNrKhAVIUwjFo 8oxwcuQoWHEJULsa0EGjPv91M5nnmGp93Z+11aC15oyVxFex/nrINpYqw8jGW+nB4bcx KlBw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date:dkim-signature; bh=dp1wmWCgke/XdGEWD5zHfkjSz85fgdsugbIVZroCUDQ=; b=0AFxbE62zNHlR3crOE8LPGUo5BIhEY2wQpB9TBuLkvpv0hTeFBfCdzhVfSP7mfKrHE xgJ4NQit3b29NdBd6f9gPtSfIT8LvNkpU+YkF8SrLRazcFBIq54XC6QzGeGta1icsIsk OMrY/RuHemml6jNtQK/VDgFXrzWJgGdKcrHqIdRfI9Ri/yK1/kyId1933JfTR0UKybG/ rssMhNuHQW3jpeaExZq89KM7pIwt2ut0TqtI/eK99vvD5DY0z2ggAa7t9bX76mhOxhwe a+TCZTFsONst0nkBLtJW+I9RrnVFnN04oQcWDzWXWJCBSDVCOLgXhhOSjZGgl8O/PIGK nyqA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=OJYaL42c; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id fp7si1117755ejc.538.2021.04.21.01.41.58; Wed, 21 Apr 2021 01:42:22 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@suse.com header.s=susede1 header.b=OJYaL42c; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=NONE dis=NONE) header.from=suse.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S235331AbhDUHXl (ORCPT + 99 others); Wed, 21 Apr 2021 03:23:41 -0400 Received: from mx2.suse.de ([195.135.220.15]:44456 "EHLO mx2.suse.de" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230098AbhDUHXk (ORCPT ); Wed, 21 Apr 2021 03:23:40 -0400 X-Virus-Scanned: by amavisd-new at test-mx.suse.de DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=suse.com; s=susede1; t=1618989786; h=from:from:reply-to:date:date:message-id:message-id:to:to:cc:cc: mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=dp1wmWCgke/XdGEWD5zHfkjSz85fgdsugbIVZroCUDQ=; b=OJYaL42cTzaT6k4aidF/zvvFEKpsBC2O/ZqXbkPl3IoRTDZBfNGIMNGZzKtGLCemXvKfwX +FQvLaLFMwqMFGhbpw3AVFG0nvk5LPhAmPZhvPqECINdfpbP4Xp/mfLBJ9XRhFH+9Clikz OgYnoCY12/z+efAZmQ/qQqHw6RY+hOs= Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 425B4B13D; Wed, 21 Apr 2021 07:23:06 +0000 (UTC) Date: Wed, 21 Apr 2021 09:23:05 +0200 From: Michal Hocko To: Shakeel Butt Cc: Roman Gushchin , Johannes Weiner , Linux MM , Andrew Morton , Cgroups , David Rientjes , LKML , Suren Baghdasaryan , Greg Thelen , Dragos Sbirlea , Priya Duraisamy Subject: Re: [RFC] memory reserve for userspace oom-killer Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue 20-04-21 18:18:29, Shakeel Butt wrote: > On Tue, Apr 20, 2021 at 12:18 PM Roman Gushchin wrote: > > > > On Mon, Apr 19, 2021 at 06:44:02PM -0700, Shakeel Butt wrote: > [...] > > > 1. prctl(PF_MEMALLOC) > > > > > > The idea is to give userspace oom-killer (just one thread which is > > > finding the appropriate victims and will be sending SIGKILLs) access > > > to MEMALLOC reserves. Most of the time the preallocation, mlock and > > > memory.min will be good enough but for rare occasions, when the > > > userspace oom-killer needs to allocate, the PF_MEMALLOC flag will > > > protect it from reclaim and let the allocation dip into the memory > > > reserves. > > > > > > The misuse of this feature would be risky but it can be limited to > > > privileged applications. Userspace oom-killer is the only appropriate > > > user of this feature. This option is simple to implement. > > > > Hello Shakeel! > > > > If ordinary PAGE_SIZE and smaller kernel allocations start to fail, > > the system is already in a relatively bad shape. Arguably the userspace > > OOM killer should kick in earlier, it's already a bit too late. > > Please note that these are not allocation failures but rather reclaim > on allocations (which is very normal). Our observation is that this > reclaim is very unpredictable and depends on the type of memory > present on the system which depends on the workload. If there is a > good amount of easily reclaimable memory (e.g. clean file pages), the > reclaim would be really fast. However for other types of reclaimable > memory the reclaim time varies a lot. The unreclaimable memory, pinned > memory, too many direct reclaimers, too many isolated memory and many > other things/heuristics/assumptions make the reclaim further > non-deterministic. > > In our observation the global reclaim is very non-deterministic at the > tail and dramatically impacts the reliability of the system. We are > looking for a solution which is independent of the global reclaim. I believe it is worth purusing a solution that would make the memory reclaim more predictable. I have seen direct reclaim memory throttling in the past. For some reason which I haven't tried to examine this has become less of a problem with newer kernels. Maybe the memory access patterns have changed or those problems got replaced by other issues but an excessive throttling is definitely something that we want to address rather than work around by some user visible APIs. > > Allowing to use reserves just pushes this even further, so we're risking > > the kernel stability for no good reason. > > Michal has suggested ALLOC_OOM which is less risky. > > > > > But I agree that throttling the oom daemon in direct reclaim makes no sense. > > I wonder if we can introduce a per-task flag which will exclude the task from > > throttling, but instead all (large) allocations will just fail under a > > significant memory pressure more easily. In this case if there is a significant > > memory shortage the oom daemon will not be fully functional (will get -ENOMEM > > for an attempt to read some stats, for example), but still will be able to kill > > some processes and make the forward progress. > > So, the suggestion is to have a per-task flag to (1) indicate to not > throttle and (2) fail allocations easily on significant memory > pressure. > > For (1), the challenge I see is that there are a lot of places in the > reclaim code paths where a task can get throttled. There are > filesystems that block/throttle in slab shrinking. Any process can get > blocked on an unrelated page or inode writeback within reclaim. > > For (2), I am not sure how to deterministically define "significant > memory pressure". One idea is to follow the __GFP_NORETRY semantics > and along with (1) the userspace oom-killer will see ENOMEM more > reliably than stucking in the reclaim. Some of the interfaces (e.g. seq_file uses GFP_KERNEL reclaim strength) could be more relaxed and rather fail than OOM kill but wouldn't your OOM handler be effectivelly dysfunctional when not able to collect data to make a decision? -- Michal Hocko SUSE Labs