Received: by 2002:a25:4158:0:0:0:0:0 with SMTP id o85csp5975976yba; Thu, 11 Apr 2019 09:27:24 -0700 (PDT) X-Google-Smtp-Source: APXvYqy6EVRgmlkZB9seLRWj95Qk3Sun4ZMRQAvbLsp/xX4SUpQDs9SahoGjqijIS6Zg07xukJB1 X-Received: by 2002:a17:902:2907:: with SMTP id g7mr16373563plb.238.1555000044733; Thu, 11 Apr 2019 09:27:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1555000044; cv=none; d=google.com; s=arc-20160816; b=z86ffQtXec9207lJr7xKHq3mAovXl3Sl+GH5bZDq25M1Sn6tkkFv7qsI+J4qdrbV2E QRyNaeaKjZuvzsfmiFp/LN7IguXHOBURSX84+q6g65DCuGn5N09lpCWPnmcTM0hWUP7t U1J8dxhWxKwbLuMdRbtGji2cL2iRGDDmfmOQr6juJbKWhxDbBU/U7Jpc2XfXjzvZJc1j OfWBTVkMlZ2Asj8WQLcK8hlzBnJC9HqSYK14UInQdNpVirHO2oY0NV9z2SyBJDskmaoL 4MZGT+CoGiLS9vZkJ7dTayGHyM3tKkAKUxQnuHJnmiAWDjU/bM4Bhhxm24FN8F/FIY7l amCg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:dkim-signature; bh=6URVyeJUSk4JGBhzUK/GU3l+GiR1jafukhgScONoVh8=; b=x6YO4RsYuN+M4+L8Fc98lcwCeJi0kvR3okT+Nn/fKh4eP5LnAakpnlZdt2jRX/p4ul utELXrpEw8txRh6UUsZd08AOIKLNsFFLVF6i9xQynr8EXidHqbbdMM33lY1ogUDV0+Dy ACych0EI82I4aPhG5Oa/BIlNAt45pwNSUbWwYGk7MZfkJFP9s4JbQ9iJf2q+4cxZCKol YjRwaaBI9qm4SDmNLdr/dqwGL04DDuM93dfaAkQvtYAXalgG6h8Ra08Iz5RFHdO+6ASU S6Gt4mHzfRnw7gJrEwB65mrks1smXcmDcoQOJrS/tWkxU9ud0LovyLKXN02GUTghBBvg rlmg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=CCrxixgm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id f2si34375119pgi.61.2019.04.11.09.27.07; Thu, 11 Apr 2019 09:27:24 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=CCrxixgm; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1726953AbfDKQZ4 (ORCPT + 99 others); Thu, 11 Apr 2019 12:25:56 -0400 Received: from mail-vs1-f65.google.com ([209.85.217.65]:33607 "EHLO mail-vs1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726881AbfDKQZz (ORCPT ); Thu, 11 Apr 2019 12:25:55 -0400 Received: by mail-vs1-f65.google.com with SMTP id a190so3859917vsd.0 for ; Thu, 11 Apr 2019 09:25:54 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=6URVyeJUSk4JGBhzUK/GU3l+GiR1jafukhgScONoVh8=; b=CCrxixgma4yFpHCkZzUEv9SdR6N+nGl21esIDWcXQHPs/rneFFO1ER7Yy3t/O3kvF6 K2LjEoUEUPmyDMq7RAaPlpLis7lAfXDTS8pxT2Zch6lvz2bLr3P9wSHHS/59Wt8IM7e3 vaAF7DEX4jyCHQK9U0VNFdFzcxmQEtJXCsqxZh6JSoTmTVJ2MewFgXKUnHVvFZ+v2B5n tRWuGsTBo0fPP3L7a3+E/yvdFYYkRjBBjxOu+SP9DjfrZSKt3XkVN4VmMzw2dIiD0KK7 z9GI8GwZRe/JGK8edMcgZlynAfWMd6pRwRgZtrJz3SVey2z7P1EFR2lKrHovn0GWIqj+ Rx8g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=6URVyeJUSk4JGBhzUK/GU3l+GiR1jafukhgScONoVh8=; b=AInfcXS4Uu2UMjD7OiUAqSh17pLHRsvDlMHF5/nedQ0XQfZAfNLWrcQg34MJedkTzl YIZacxmWPsPYk3KTVmUYj4VKyBh5gGu1dRRD4eKyObIB8Nf2nhO7Hpodotq3rWTmthmh SVUZPe+vJPIJLj4t11VefzL4uiRCZtpyfR3WTArfXLIES7fqAfWXkunnhJo+qc72tsm2 W8+4cUzCfjun0BrrsVAuxiACgFniG5g7voCuRNqYu1YOGRsTF9w3FZVa5BtPVX/QeazS rUa1Zshg+u4KpisIRySWBWW8CXOLvhU92XKVIjUdJzPwg1Kx4zbx6Ihnxd8vUANaYqay MxyA== X-Gm-Message-State: APjAAAVOZQLRmmzquoAFAhXH+BYQ4ipVOm+5U3/IPUXaHoRvGeSMCqKb 3H2/BWcPDXEGMM6FRluP5sJ836Bivml25vY6WKBhyQ== X-Received: by 2002:a67:e256:: with SMTP id w22mr17217643vse.173.1554999953603; Thu, 11 Apr 2019 09:25:53 -0700 (PDT) MIME-Version: 1.0 References: <20190411014353.113252-1-surenb@google.com> <20190411014353.113252-3-surenb@google.com> <20190411103018.tcsinifuj7klh6rp@brauner.io> In-Reply-To: From: Daniel Colascione Date: Thu, 11 Apr 2019 09:25:41 -0700 Message-ID: Subject: Re: [RFC 2/2] signal: extend pidfd_send_signal() to allow expedited process killing To: Suren Baghdasaryan Cc: Christian Brauner , Andrew Morton , Michal Hocko , David Rientjes , Matthew Wilcox , yuzhoujian@didichuxing.com, Souptick Joarder , Roman Gushchin , Johannes Weiner , Tetsuo Handa , "Eric W. Biederman" , Shakeel Butt , Minchan Kim , Tim Murray , Daniel Colascione , Joel Fernandes , Jann Horn , linux-mm , lsf-pc@lists.linux-foundation.org, LKML , kernel-team , Oleg Nesterov Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 11, 2019 at 8:23 AM Suren Baghdasaryan wrote: > > > On Wed, Apr 10, 2019 at 06:43:53PM -0700, Suren Baghdasaryan wrote: > > > > Add new SS_EXPEDITE flag to be used when sending SIGKILL via > > > > pidfd_send_signal() syscall to allow expedited memory reclaim of the > > > > victim process. The usage of this flag is currently limited to SIGKILL > > > > signal and only to privileged users. FWIW, I like Suren's general idea, but I was thinking of a different way of exposing the same general functionality to userspace. The way I look at it, it's very useful for an auto-balancing memory system like Android (or, I presume, something that uses oomd) to recover memory *immediately* after a SIGKILL instead of waiting for the process to kill itself: a process's death can be delayed for a long time due to factors like scheduling and being blocked in various uninterruptible kernel-side operations. Suren's proposal is basically about pulling forward in time page reclaimation that would happen anyway. What if we let userspace control exactly when this reclaimation happens? I'm imagining a new* kernel facility that basically looks like this. It lets lmkd determine for itself how much work the system should expend on reclaiming memory from dying processes. size_t try_reap_dying_process( int pidfd, int flags /* must be zero */, size_t maximum_bytes_to_reap); Precondition: process is pending group-exit (someone already sent it SIGKILL) Postcondition: some memory reclaimed from dying process Invariant: doesn't sleep; stops reaping after MAXIMUM_BYTES_TO_REAP -> success: return number of bytes reaped -> failure: (size_t)-1 EBUSY: couldn't get mmap_sem EINVAL: PIDFD isn't a pidfd or otherwise invalid arguments EPERM: process hasn't been send SIGKILL: try_reap_dying_process on a process that isn't dying is illegal Kernel-side, try_reap_dying_process would try-acquire mmap_sem and just fail if it couldn't get it. Once acquired, it would release "easy" pages (using the same check the oom reaper uses) until it either ran out of pages or hit the MAXIMUM_BYTES_TO_REAP cap. The purpose of MAXIMUM_BYTES_TO_REAP is to let userspace bound-above the amount of time we spend reclaiming pages. It'd be up to userspace to set policy on retries, the actual value of the reap cap, the priority at which we run TRY_REAP_DYING_PROCESS, and so on. We return the number of bytes we managed to free this way so that lmkd can make an informed decision about what to do next, e.g., kill something else or wait a little while. Personally, I like th approach a bit more that recruiting the oom reaper through because it doesn't affect any kind of emergency memory reserve permission and because it frees us from having to think about whether the oom reaper's thread priority is right for this particular job. It also occurred to me that try_reap_dying_process might make a decent shrinker callback. Shrinkers are there, AIUI, to reclaim memory that's easy to free and that's not essential for correct kernel operation. Usually, it's some kind of cache that meets these criteria. But the private pages of a dying process also meet the criteria, don't they? I'm imagining the shrinker just picking an arbitrary doomed (dying but not yet dead) process and freeing some of its pages. I know there are concerns about slow shrinkers causing havoc throughout the system, but since this shrinker would be bounded above on CPU time and would never block, I feel like it'd be pretty safe. * insert standard missive about system calls being cheap, but we can talk about the way in which we expose this functionality after we agree that it's a good idea generally