Received: by 2002:a05:6a10:17d3:0:0:0:0 with SMTP id hz19csp2938454pxb; Mon, 19 Apr 2021 18:45:28 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxMrBufbaGx5RGXAH3zZAfu+v4iGattq7aEaSQxleyPcnVaJDgIjQk4AN5Au25YMHf99Yq6 X-Received: by 2002:aa7:d74a:: with SMTP id a10mr17545411eds.82.1618883128267; Mon, 19 Apr 2021 18:45:28 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1618883128; cv=none; d=google.com; s=arc-20160816; b=KabzyJ9ofItqZdP9ku8lZ+qL4TC06lHcVP2n1LnuA3bZh7zjxOA3Dk0WyR9CvXe0RM wUXs3roNauSf/sSTTf/5onXfXPIvV9Sl0DL1rq7whJlzWbn5EIkRTAuOdRc7IDxxNeY0 uv6XzNcRgcMiSlBWem9UenLKDPcFpbh5nuol9GxsQv/IsnEcY01oYyCsgnojqOZiJa7S VAvrLLhEXzCEV+t+z17w/5nc6cbQ5HzTMXysY9lkgg1t96leIZsn1mMJqtIwKmgxgsin 1Wmei4ZU80vpG05fwhaKZCxbG0tbGl+nIwreU7DFPjDDzLpLqZrWcvEXOGjMHcF4Ysb7 h3MA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:mime-version :dkim-signature; bh=bezKEnQ/LitrlU1FItFHrI7Y2YVJfuRBU3xne3s9Vgk=; b=o9ZFS+6QFe+CIXHg7v2+Ojbfa3MjyCWj2O7LaKlMhNTE+rQxFzTJktq0bxXbHsx+wu R9gKPbqxvP7pOXJi0sZRkIqnM/k49b4AtohnBJhOpXFziq4xKfBq0pnYCtRRSsZM68hb 3dItQ9BGc203zqDmvPviPumep1F1z2VNB7NH0Z+LMQGuCDwn+xIY52A7TGEDC5SGzxKk R52rA4plCaIdcEaDLwe5ThhtrPiW8Mp0zjIB1M0IwzBKezLPn9sICZjEKExArmDpxUJu E1e6IWTaDOI+JH6DZkuCcyEAYwpebqDXm6vIP4hdwOApFJx1tYWog7DQHmOchpKhev97 XjkA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="Bu/zPqRG"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id hd36si13021797ejc.183.2021.04.19.18.45.05; Mon, 19 Apr 2021 18:45:28 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b="Bu/zPqRG"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229592AbhDTBor (ORCPT + 99 others); Mon, 19 Apr 2021 21:44:47 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:53212 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229575AbhDTBoq (ORCPT ); Mon, 19 Apr 2021 21:44:46 -0400 Received: from mail-lf1-x132.google.com (mail-lf1-x132.google.com [IPv6:2a00:1450:4864:20::132]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A0D94C06174A for ; Mon, 19 Apr 2021 18:44:15 -0700 (PDT) Received: by mail-lf1-x132.google.com with SMTP id i10so23125299lfe.11 for ; Mon, 19 Apr 2021 18:44:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:from:date:message-id:subject:to:cc; bh=bezKEnQ/LitrlU1FItFHrI7Y2YVJfuRBU3xne3s9Vgk=; b=Bu/zPqRGatAudfvgMRtfHORM5ajt+SWWeYSwfUAXy2SBpqh4GIe4jQ5SDAyT0yAjKT SU7AcAeynquuZ7f/VQ6aJC/djUIvdySL3/6Zdr5d9ul/NcQqMBi511EhmnE56lzaycL3 Z4yORcyHVSRgclUahRtaTQ2P2Rn3PvmFq9IUEWSM3ng3i3jKMLLiFbl69dgg6YSSIkKF tcib5o4ww7pm/IFycc1+sK68dSCNeq4DroDY+3SGSSce/Kqq6g8M01NIEbT4OktgkiQc HfQG6g/+fe2m5nioirOe3Va1SawyYnbuXSDJUD7TYpJ7f1o8UrRECj6x2lxm42EWGDME rVpQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to:cc; bh=bezKEnQ/LitrlU1FItFHrI7Y2YVJfuRBU3xne3s9Vgk=; b=dC2+BNyGTNTrKSSbkzjPga/JkrBY+sY/RROjpB6lXbd7R5Hu0mNAE38lM8JVxULIIV lZH7HHFJ5dKIlxl46EFZGySgZ1UNB+/+OlrYzUfOZQYcJ3Ki/uah50S5M/bMJxGJSKoV 1WFDFeA1ZwOOWUG19cwJS//UD0q20ypVafQji61z5K2Vdj80LqvebM+S2b0XkFLE6O6s daGNgKKkzkHi8AclGqzDxU1XAKehAfT54t2xk7PKOS0zKgL5RGvuu9M8uLD7AOOIF6zR 3ERzzwJDBsyu+QyXFUtb1q4n5JHESVdXs+j7Essp+Sat72V8QYJZdnEAs5d8d5hMPl1B wWVA== X-Gm-Message-State: AOAM533dFXFD9OOuOYOyTfXwCCXbnT5GQpGl2s0teGs1rrKZt+Tfdi7A z1yd9vliAsL5Z7vi8W90gH4Vc/PHF2Vthg2gqI7cIQ== X-Received: by 2002:ac2:58ee:: with SMTP id v14mr13888531lfo.83.1618883053836; Mon, 19 Apr 2021 18:44:13 -0700 (PDT) MIME-Version: 1.0 From: Shakeel Butt Date: Mon, 19 Apr 2021 18:44:02 -0700 Message-ID: Subject: [RFC] memory reserve for userspace oom-killer To: Johannes Weiner , Roman Gushchin , Michal Hocko , Linux MM , Andrew Morton , Cgroups , David Rientjes , LKML , Suren Baghdasaryan Cc: Greg Thelen , Dragos Sbirlea , Priya Duraisamy Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Proposal: Provide memory guarantees to userspace oom-killer. Background: Issues with kernel oom-killer: 1. Very conservative and prefer to reclaim. Applications can suffer for a long time. 2. Borrows the context of the allocator which can be resource limited (low sched priority or limited CPU quota). 3. Serialized by global lock. 4. Very simplistic oom victim selection policy. These issues are resolved through userspace oom-killer by: 1. Ability to monitor arbitrary metrics (PSI, vmstat, memcg stats) to early detect suffering. 2. Independent process context which can be given dedicated CPU quota and high scheduling priority. 3. Can be more aggressive as required. 4. Can implement sophisticated business logic/policies. Android's LMKD and Facebook's oomd are the prime examples of userspace oom-killers. One of the biggest challenges for userspace oom-killers is to potentially function under intense memory pressure and are prone to getting stuck in memory reclaim themselves. Current userspace oom-killers aim to avoid this situation by preallocating user memory and protecting themselves from global reclaim by either mlocking or memory.min. However a new allocation from userspace oom-killer can still get stuck in the reclaim and policy rich oom-killer do trigger new allocations through syscalls or even heap. Our attempt of userspace oom-killer faces similar challenges. Particularly at the tail on the very highly utilized machines we have observed userspace oom-killer spectacularly failing in many possible ways in the direct reclaim. We have seen oom-killer stuck in direct reclaim throttling, stuck in reclaim and allocations from interrupts keep stealing reclaimed memory. We have even observed systems where all the processes were stuck in throttle_direct_reclaim() and only kswapd was running and the interrupts kept stealing the memory reclaimed by kswapd. To reliably solve this problem, we need to give guaranteed memory to the userspace oom-killer. At the moment we are contemplating between the following options and I would like to get some feedback. 1. prctl(PF_MEMALLOC) The idea is to give userspace oom-killer (just one thread which is finding the appropriate victims and will be sending SIGKILLs) access to MEMALLOC reserves. Most of the time the preallocation, mlock and memory.min will be good enough but for rare occasions, when the userspace oom-killer needs to allocate, the PF_MEMALLOC flag will protect it from reclaim and let the allocation dip into the memory reserves. The misuse of this feature would be risky but it can be limited to privileged applications. Userspace oom-killer is the only appropriate user of this feature. This option is simple to implement. 2. Mempool The idea is to preallocate mempool with a given amount of memory for userspace oom-killer. Preferably this will be per-thread and oom-killer can preallocate mempool for its specific threads. The core page allocator can check before going to the reclaim path if the task has private access to the mempool and return page from it if yes. This option would be more complicated than the previous option as the lifecycle of the page from the mempool would be more sophisticated. Additionally the current mempool does not handle higher order pages and we might need to extend it to allow such allocations. Though this feature might have more use-cases and it would be less risky than the previous option. Another idea I had was to use kthread based oom-killer and provide the policies through eBPF program. Though I am not sure how to make it monitor arbitrary metrics and if that can be done without any allocations. Please do provide feedback on these approaches. thanks, Shakeel