Received: by 2002:a05:6a10:a852:0:0:0:0 with SMTP id d18csp114040pxy; Tue, 4 May 2021 20:16:15 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzxLElZulBcxMc1K783gfz9qbYOq84fZWMxUW0jKp21DGT5v2FGt07ej+7vGMg0f5PMwrzS X-Received: by 2002:a05:6402:350d:: with SMTP id b13mr7523680edd.21.1620184575641; Tue, 04 May 2021 20:16:15 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1620184575; cv=none; d=google.com; s=arc-20160816; b=HfiuZ+BFwR4zBPVf3ypwkka6AtLRdJkXAlwvl+ZcnR/N6D496jWMiWl3GrxI3bDO66 mhqhn1B9mTM1f0AMkAqGMVtgy1FT5HR8SmiiDKPtc3BqogqiW9q5+VHStE4C/8XhwgIp udDC2sspqRlJPVOtEKCEFruPLoEgCITAmVKevl6NcXiQYuKe0eGUolq/1EDX3UuxdTl1 4kGNhPjCd4q80rQbrYh+9E3MRsiUuQDsfJJ+XbwraoNWL8y3kwt4cKyFjGy8+wlfpUV+ +HVySPuoQMv7gXAJGQzX0JNpuxCIlfkpaevrk6cscdGjhkjdtd9OIlxtUpVtOeui6d3l XpCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=ihGAn22hJSR8rjyLL6L+pknKJgzhwSZEEzawq0VbAU0=; b=RDqFER11MvoWDbxs96Z41sbQaC3HpsrVzpEsf/TFN2reXFa6wy1qBOhMCl9wfdZzIA /cBZIFNak+BeWGmWq0elviZ4qYMYxt7js0k9CBMuj/dyKpj73lAFmYM3YWNefMbgAyG8 oAFJGjlC0DrpeHVXCqQMzJXo1hpd4aL3i2Nwep+vfMuUyM8M/b7BWb1tjs/Cx0jEjvKj ElSOEbu2OwoI40dShf0EfAwUCZCB53dNdUBYBLrfSdCZTuM7McoAbS82b04hzGMV0kLD gEKzrl3iWZVucs8HA4lsAtvx2ja71xEy7BVWXmjuvKV8jRRcfOPfgFPUCqwIF8+37kw0 9PSw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=knXCpejn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id t16si14943200edi.299.2021.05.04.20.15.49; Tue, 04 May 2021 20:16:15 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=knXCpejn; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231259AbhEEDAW (ORCPT + 99 others); Tue, 4 May 2021 23:00:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55230 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230079AbhEEDAV (ORCPT ); Tue, 4 May 2021 23:00:21 -0400 Received: from mail-yb1-xb36.google.com (mail-yb1-xb36.google.com [IPv6:2607:f8b0:4864:20::b36]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D8AE4C061574 for ; Tue, 4 May 2021 19:59:25 -0700 (PDT) Received: by mail-yb1-xb36.google.com with SMTP id m9so820101ybm.3 for ; Tue, 04 May 2021 19:59:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=ihGAn22hJSR8rjyLL6L+pknKJgzhwSZEEzawq0VbAU0=; b=knXCpejnyh18gszj8Evz817ERM+apdbOpGzltW5BoTfCfnWvkFNz7GwzqeSqpjJRhq 2Z5VxkeCwMz7CbctjqLVsv439h0ie60LIOjkJhgR2nLQfn3LTZw4zle3XxJGId4AeKiX SHItgrHqyo0S1OJB1a8ohnLE58ZALqyq0WwlAexypypDblVxzJ5GwCnXQ2TudbBZj9l/ 5NGj2T+jKrZQGcDLnsIb4kL2F7GM2XCD39p2MntbG30uZk1PEjwXqh+kWwy2Gi6SXfFw eky1yep288Y0OPlcPIQSbDnaPQUzvvPRAdS6WIravUuliHYFpzJ9BP65Ip1o1K451gsI QSvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=ihGAn22hJSR8rjyLL6L+pknKJgzhwSZEEzawq0VbAU0=; b=GMl55+VibR+a08DOzmMp4EjWBZ+sqDp3PcRYXwFnkZIGtqqlTWNFU9CGTNrKb05K3g VdEeDGb9rHK8IThL+45R7e9GdVJLCipaj55UGNDtdYFeE4pCWUBLjD96AjljH06CGdO0 eqK1nEUiSe63Lb0m7jkLY8PV75GMCGFKdC5zgoqbSgsxGCvWzPFW0mkuCiOpEY+Osal8 PqOQmMkCvK4ZGCl5FIhd/WymH50TlsA8GGKMETLLvRwyGu/wVvLC5d/EjKQesntpfSYY NeudvDfbG014vy5N8Qy0u9kziQUUfxtoYUxv4iWNoJuilQ/DIE9MZfgElToyJEDDiTc2 ykgQ== X-Gm-Message-State: AOAM532pT0xxae+/p5pxtOsliXVSL9eWSaLbXyLA+f2YozxUb6MXwYse NSyU90U2cphZio5drzs106dHQ42QkjylWJkTosafHQ== X-Received: by 2002:a25:7ec4:: with SMTP id z187mr36919159ybc.136.1620183564909; Tue, 04 May 2021 19:59:24 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Suren Baghdasaryan Date: Tue, 4 May 2021 19:59:13 -0700 Message-ID: Subject: Re: [RFC] memory reserve for userspace oom-killer To: Shakeel Butt Cc: Michal Hocko , Johannes Weiner , Roman Gushchin , Linux MM , Andrew Morton , Cgroups , David Rientjes , LKML , Greg Thelen , Dragos Sbirlea , Priya Duraisamy Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, May 4, 2021 at 7:45 PM Shakeel Butt wrote: > > On Tue, May 4, 2021 at 6:26 PM Suren Baghdasaryan wrote: > > > > On Tue, May 4, 2021 at 5:37 PM Shakeel Butt wrote: > > > > > > On Wed, Apr 21, 2021 at 7:29 AM Michal Hocko wrote: > > > > > > > [...] > > > > > > What if the pool is depleted? > > > > > > > > > > This would mean that either the estimate of mempool size is bad or > > > > > oom-killer is buggy and leaking memory. > > > > > > > > > > I am open to any design directions for mempool or some other way where > > > > > we can provide a notion of memory guarantee to oom-killer. > > > > > > > > OK, thanks for clarification. There will certainly be hard problems to > > > > sort out[1] but the overall idea makes sense to me and it sounds like a > > > > much better approach than a OOM specific solution. > > > > > > > > > > > > [1] - how the pool is going to be replenished without hitting all > > > > potential reclaim problems (thus dependencies on other all tasks > > > > directly/indirectly) yet to not rely on any background workers to do > > > > that on the task behalf without a proper accounting etc... > > > > -- > > > > > > I am currently contemplating between two paths here: > > > > > > First, the mempool, exposed through either prctl or a new syscall. > > > Users would need to trace their userspace oom-killer (or whatever > > > their use case is) to find an appropriate mempool size they would need > > > and periodically refill the mempools if allowed by the state of the > > > machine. The challenge here is to find a good value for the mempool > > > size and coordinating the refilling of mempools. > > > > > > Second is a mix of Roman and Peter's suggestions but much more > > > simplified. A very simple watchdog with a kill-list of processes and > > > if userspace didn't pet the watchdog within a specified time, it will > > > kill all the processes in the kill-list. The challenge here is to > > > maintain/update the kill-list. > > > > IIUC this solution is designed to identify cases when oomd/lmkd got > > stuck while allocating memory due to memory shortages and therefore > > can't feed the watchdog. In such a case the kernel goes ahead and > > kills some processes to free up memory and unblock the blocked > > process. Effectively this would limit the time such a process gets > > stuck by the duration of the watchdog timeout. If my understanding of > > this proposal is correct, > > Your understanding is indeed correct. > > > then I see the following downsides: > > 1. oomd/lmkd are still not prevented from being stuck, it just limits > > the duration of this blocked state. Delaying kills when memory > > pressure is high even for short duration is very undesirable. > > Yes I agree. > > > I think > > having mempool reserves could address this issue better if it can > > always guarantee memory availability (not sure if it's possible in > > practice). > > I think "mempool ... always guarantee memory availability" is > something I should quantify with some experiments. > > > 2. What would be performance overhead of this watchdog? To limit the > > duration of a process being blocked to a small enough value we would > > have to have quite a small timeout, which means oomd/lmkd would have > > to wake up quite often to feed the watchdog. Frequent wakeups on a > > battery-powered system is not a good idea. > > This is indeed the downside i.e. the tradeoff between acceptable stall > vs frequent wakeups. > > > 3. What if oomd/lmkd gets stuck for some memory-unrelated reason and > > can't feed the watchdog? In such a scenario the kernel would assume > > that it is stuck due to memory shortages and would go on a killing > > spree. > > This is correct but IMHO killing spree is not worse than oomd/lmkd > getting stuck for some other reason. > > > If there is a sure way to identify when a process gets stuck > > due to memory shortages then this could work better. > > Hmm are you saying looking at the stack traces of the userspace > oom-killer or some metrics related to oom-killer? It will complicate > the code. Well, I don't know of a sure and easy way to identify the reasons for process blockage but maybe there is one I don't know of? My point is that we would need some additional indications of memory being the culprit for the process blockage before resorting to kill. > > > 4. Additional complexity of keeping the list of potential victims in > > the kernel. Maybe we can simply reuse oom_score to choose the best > > victims? > > Your point of additional complexity is correct. Regarding oom_score I > think you meant oom_score_adj, I would avoid putting more > policies/complexity in the kernel but I got your point that the > simplest watchdog might not be helpful at all. > > > Thanks, > > Suren. > > > > > > > > I would prefer the direction which oomd and lmkd are open to adopt. > > > > > > Any suggestions?