Received: by 2002:a05:6a10:8c0a:0:0:0:0 with SMTP id go10csp6534844pxb; Wed, 17 Feb 2021 07:04:49 -0800 (PST) X-Google-Smtp-Source: ABdhPJwuOF1/k6zLMyp9M08/TGr/istD0hxUzVVGQ22EH9x1Hm394wBTknmCgC5ehM6Pp7DFfCon X-Received: by 2002:a17:906:b356:: with SMTP id cd22mr25107805ejb.253.1613574289661; Wed, 17 Feb 2021 07:04:49 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1613574289; cv=none; d=google.com; s=arc-20160816; b=iLGGjnQcOb6ZUtIhlPPwxIDaEowFxKxQ8+XHwqLugvmZnPB2qyKBJ9cP3ZIFHnnugk b7wy/1FXo9W7Ef9A0ZnyqqLvSQjN8l+VlKayx713eIXWDNPa+eByNRyQekVi2OWCM/3X anWmpm2633E4Cj4kdwusaaeDQ5igc06p4yBC0XCGC20asQ6wfnDr76u3FibulSp2Phnl IYkg4K0APlB7+9ZkeyUleFDCme3vFRP9l5juGfd5QgPDg19j/3fmgYBBB9JxKCCPC6lm d0J42qYGWAtH1O1jcvsooWz/DgIbDTGDN6uxnl8MFqOdYDAkNw65Vp2G8XQzE+fR8b8G 3fcg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=Rh+5DVV2EZcIZwJ1XuaHOxn/tnJr/iJkF+N040Yn/HY=; b=ePo0Br+rti1peZtQSCQaK98bBTmefssbqRJeXR1gUp8kUiTyQeQWW3Hj3LfjW9cnkT GA8sZGeJ1+RtLG1EQwkaigH4+z9aiQ0FSXamkgE9yGdOzPWfap4axB870kvRfI3nZZnW zZ9H7TbkJZucdR0IX+tpvkS+YXxTzWU/AK+cLljZ1XnbY1gknNRCu2XmxRm5GZyIKw5/ diIkNwLg2V9SqiIFVxG51ofy/sOzQyGBWRlgOlA5h0G++W49mrAlh6jHPZPKjZIZeyQj e2eTmtw8eQvJyEQrUZ00c8wg3fz8CCcUJAHXkkL8LcOAH1Bp30UIVdVl+pd576sWa/4L fAVg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=M2F9g6Uw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id w21si1340053edd.421.2021.02.17.07.04.23; Wed, 17 Feb 2021 07:04:49 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=M2F9g6Uw; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233610AbhBQPBS (ORCPT + 99 others); Wed, 17 Feb 2021 10:01:18 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:54994 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233728AbhBQPAt (ORCPT ); Wed, 17 Feb 2021 10:00:49 -0500 Received: from mail-lf1-x133.google.com (mail-lf1-x133.google.com [IPv6:2a00:1450:4864:20::133]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id B5B8FC0613D6 for ; Wed, 17 Feb 2021 07:00:08 -0800 (PST) Received: by mail-lf1-x133.google.com with SMTP id j19so21903397lfr.12 for ; Wed, 17 Feb 2021 07:00:08 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Rh+5DVV2EZcIZwJ1XuaHOxn/tnJr/iJkF+N040Yn/HY=; b=M2F9g6UwE5JX/c5d+1RDp2TjHOgK94HobVq5+b+33U+hA6d1v8leYWM7ITPeSwqBw6 wj51n/YUEQM3/eGuAYvT8/qAWMZii7Yik+bLrs7GvTwFJbwFE7GcBC+6qfNhU6/0dLq1 hwzHf2OkLBmUdv4bHUJtKup9MPpD8HuAf/ndec9RzEXUpHfdvo40swjWsFNjwpF2C/zx 7SAEyvvlO/NLKOwoSu+ffhOP5o2t08+/NWa3VXVBAo5i8vzoO9/FpPeZEAO673Zid6Dv 9XaU04B0WRvQfG1NvhQ9l4NcG8K+XCRgrEqcgSEhauKVEuPYoZYrp3lI07JNWy4mjrNi NPlg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Rh+5DVV2EZcIZwJ1XuaHOxn/tnJr/iJkF+N040Yn/HY=; b=s2hLlxFJshrbc1XW3OukcAoX6Z8BdJIdp1UMhiEWeSA6VoTvLkSESh9QzbQBpRTzj1 Cs6kGI2/X7FPhmOE5SOP7SSXG5hQS/solDCuQUdh2FVT4u2Kxggy2IrAqKhVXHjVlS2m BI+m3RaiiiPQApp9/imfedq+wphnBpXJUp7Pn/8i+DtAIsleZPp7qSZ2dxEu5maZUbDg OBEFw8G4TGTAyy9hbwzGCGOHeWTs86+ufpP7PkPqbBpSAMZH4j0QDrkX9Nim9SVFOsjS KM8A+EK3zZMQPZnKIMhWiPCUUyhSOCVVXcD7Uu5jNSUcDrL0y4jMaXj60VmPZ+Vl1bWT HK/Q== X-Gm-Message-State: AOAM5316iZNJmG/qNO/vepDnBmh3xPHa9fk4QqmVtT6uP5LoFk+3tqzs hf3Vb/Wcf6eZD1mm0rNvxNYUf/bsL9yioIRYWadvEw== X-Received: by 2002:a05:6512:10c8:: with SMTP id k8mr15123602lfg.299.1613574006710; Wed, 17 Feb 2021 07:00:06 -0800 (PST) MIME-Version: 1.0 References: <20210216030713.79101-1-eiichi.tsukata@nutanix.com> In-Reply-To: From: Shakeel Butt Date: Wed, 17 Feb 2021 06:59:55 -0800 Message-ID: Subject: Re: [RFC PATCH] mm, oom: introduce vm.sacrifice_hugepage_on_oom To: David Rientjes , Johannes Weiner , Tejun Heo Cc: Michal Hocko , Eiichi Tsukata , Jonathan Corbet , Mike Kravetz , mcgrof@kernel.org, Kees Cook , yzaikin@google.com, Andrew Morton , linux-doc@vger.kernel.org, LKML , Linux MM , linux-fsdevel , felipe.franciosi@nutanix.com Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Feb 16, 2021 at 5:25 PM David Rientjes wrote: > > On Tue, 16 Feb 2021, Michal Hocko wrote: > > > > Hugepages can be preallocated to avoid unpredictable allocation latency. > > > If we run into 4k page shortage, the kernel can trigger OOM even though > > > there were free hugepages. When OOM is triggered by user address page > > > fault handler, we can use oom notifier to free hugepages in user space > > > but if it's triggered by memory allocation for kernel, there is no way > > > to synchronously handle it in user space. > > > > Can you expand some more on what kind of problem do you see? > > Hugetlb pages are, by definition, a preallocated, unreclaimable and > > admin controlled pool of pages. > > Small nit: true of non-surplus hugetlb pages. > > > Under those conditions it is expected > > and required that the sizing would be done very carefully. Why is that a > > problem in your particular setup/scenario? > > > > If the sizing is really done properly and then a random process can > > trigger OOM then this can lead to malfunctioning of those workloads > > which do depend on hugetlb pool, right? So isn't this a kinda DoS > > scenario? > > > > > This patch introduces a new sysctl vm.sacrifice_hugepage_on_oom. If > > > enabled, it first tries to free a hugepage if available before invoking > > > the oom-killer. The default value is disabled not to change the current > > > behavior. > > > > Why is this interface not hugepage size aware? It is quite different to > > release a GB huge page or 2MB one. Or is it expected to release the > > smallest one? To the implementation... > > > > [...] > > > +static int sacrifice_hugepage(void) > > > +{ > > > + int ret; > > > + > > > + spin_lock(&hugetlb_lock); > > > + ret = free_pool_huge_page(&default_hstate, &node_states[N_MEMORY], 0); > > > > ... no it is going to release the default huge page. This will be 2MB in > > most cases but this is not given. > > > > Unless I am mistaken this will free up also reserved hugetlb pages. This > > would mean that a page fault would SIGBUS which is very likely not > > something we want to do right? You also want to use oom nodemask rather > > than a full one. > > > > Overall, I am not really happy about this feature even when above is > > fixed, but let's hear more the actual problem first. > > Shouldn't this behavior be possible as an oomd plugin instead, perhaps > triggered by psi? I'm not sure if oomd is intended only to kill something > (oomkilld? lol) or if it can be made to do sysadmin level behavior, such > as shrinking the hugetlb pool, to solve the oom condition. The senpai plugin of oomd actually is a proactive reclaimer, so oomd is being used for more than oom-killing. > > If so, it seems like we want to do this at the absolute last minute. In > other words, reclaim has failed to free memory by other means so we would > like to shrink the hugetlb pool. (It's the reason why it's implemented as > a predecessor to oom as opposed to part of reclaim in general.) > > Do we have the ability to suppress the oom killer until oomd has a chance > to react in this scenario? There is no explicit knob but there are indirect ways to delay the kernel oom killer. In the presence of reclaimable memory the kernel is very conservative to trigger the oom-kill. I think the way Facebook is achieving this in oomd is by using swap to have good enough reclaimable memory and then using memory.swap.high to throttle the workload's allocation rates which will increase the PSI as well. Since oomd pools PSI, it will be able to react before the kernel oom-killer.