Received: by 2002:a05:6a10:f3d0:0:0:0:0 with SMTP id a16csp554360pxv; Wed, 30 Jun 2021 11:45:10 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyZAf+FKRQp2YbhJe8/69n+hBljQ605iWHbBqkSYmEAg9liOUUXjbKYqD1ky7n65PDUS72H X-Received: by 2002:a92:1948:: with SMTP id e8mr26239377ilm.77.1625078710759; Wed, 30 Jun 2021 11:45:10 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1625078710; cv=none; d=google.com; s=arc-20160816; b=e8JPnVTYo9SM5AGEvzTi7I+phCaACY3yEQzWCkdV0uemwiOzL1EOxTboqhA8Zri7Hz gMQQAsFuWumW+ga6klF6TUAm7RRImZCc32COcLugJ492Up3iE0Z0pVs6PdPEH6sKDyYF 7nVJvlAgkpfdOb1ekOPsokzBxLmdL6xIbC+8TdCbvw6JgnH7Zu4H/oi7lJFxT9fG2qaX 4zUAJk1AOgeYI7h/N14YfchMGr+60LEyGKe6a6HnWwDYqAdGi5Tz3/kpaxWr7rTPW78m 8nKULGSB4Y3DTNsOzmi4gFq5zVX1/2FrpEnYp3BQFkUieFxDoovb7pmzVSbj8Q0QMLQU JqbA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=WIoChI/UV1Z7a7XrKG/VZSLZ/TPMwf39BXQMXIxOAwA=; b=Ev2yUg8QqtucsMKvPbLIaRvAvCo+AhXK4iqV0h/7bEP5MwTrakjMH8cURKqFhD2Lkt 2awNSW39SCCFxHs5tlvHnlDGBTGHiRAnBrI62uQsBeKvzwNCHZDJLd3IabRf9i3WfqwK QIEZovQUE4dmvQPgHOoPP6cTcHnsRvKHE29Iu8NyAkgfUTVV3YQDZMnBJQXBGzSTeCAo SbMIqxNS+rd74ebas/KBNkqpDU0SQu2j0ohAXnmickMhgvvMmgp7IMF2N/uPjj16wki5 G77MrHxxVc8iSuTM9LEatYZzLesqxB9UKqK90Q7ViJeNxunOieKTkH+422IYw/WbrOa8 mfwQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=btIXHyRi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h9si226839jaj.40.2021.06.30.11.44.56; Wed, 30 Jun 2021 11:45:10 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=btIXHyRi; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233520AbhF3Sqk (ORCPT + 99 others); Wed, 30 Jun 2021 14:46:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:55814 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233587AbhF3Sqg (ORCPT ); Wed, 30 Jun 2021 14:46:36 -0400 Received: from mail-yb1-xb31.google.com (mail-yb1-xb31.google.com [IPv6:2607:f8b0:4864:20::b31]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id A2DE0C061756 for ; Wed, 30 Jun 2021 11:44:06 -0700 (PDT) Received: by mail-yb1-xb31.google.com with SMTP id k184so6575307ybf.12 for ; Wed, 30 Jun 2021 11:44:06 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=WIoChI/UV1Z7a7XrKG/VZSLZ/TPMwf39BXQMXIxOAwA=; b=btIXHyRiBkK956T8OwjWkekMNvzo/mZJOPFcrvBEKbeBejlWmlRMlznuchf2pWPWqe a9Ag692CKOfxzD8GcEQEhZ/3Jr9LS22bkN+lCYbP2wrWkSf2MsmKL4vpgoFibTmfwJ7H WFzEIGgzeQ191/Pt+a+5pUMDz7AToSI7RmYpEEpOhySN6UKbpUGg6FJzgKiotqZ0wfoE Z2pLW/YB01A6w0jwOMzklxNWBxb+4Iv3srIzg7zlfMxeQ8dentmx7j/x5pLBCq+TEp30 dHGD2ns9TvW0TXws3V8QruPNTAt1lQT+kuX399Eg3WT2x5Px+0CSkxk2UwoGc4fN5w0A khMw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=WIoChI/UV1Z7a7XrKG/VZSLZ/TPMwf39BXQMXIxOAwA=; b=RJ/0o4brAiJ5uGMU81rjyO9QWgC5Q7CVratq1iyad493yUxc5X4AtY+Ckbv2kAjXX8 APPjmsmJNWp+BEt5Ho4qWSSTqwjkxf2jIpK3q/7lIB3HsGNfpV7DyUPAkcK+lz5H0H2x Px5SWUQgIwdFHy0E14EenOhUYPDA6Wa3YIgsiusMHc9tSwMsT1nlwFvRSAwQ/aX8gUVA DMOQAUGClJbLZ5S+rl537xlvXY43mZ4xNp98Pb4ltvRcujxBvymZo6hwQQln1L5Bb5s8 XsJCXr7njWSXrVr85cT9dgTNatW7dZ5jqYxnQsjN3LF93xKSaqANL1GS+o0yqYzApNow Z5Tg== X-Gm-Message-State: AOAM531XOhiOZoZHcRIsFYGMGfzkjTEIxr1kZApwz2gN7s8m/9axIr+r 8O1QstRaIknInD1JAL7CTxmPqpKE6nkpMSyGmLZPIg== X-Received: by 2002:a25:d913:: with SMTP id q19mr48066209ybg.397.1625078645669; Wed, 30 Jun 2021 11:44:05 -0700 (PDT) MIME-Version: 1.0 References: <20210623192822.3072029-1-surenb@google.com> In-Reply-To: From: Suren Baghdasaryan Date: Wed, 30 Jun 2021 11:43:54 -0700 Message-ID: Subject: Re: [PATCH 1/1] mm: introduce process_reap system call To: Shakeel Butt Cc: Andrew Morton , Michal Hocko , Michal Hocko , David Rientjes , Matthew Wilcox , Johannes Weiner , Roman Gushchin , Rik van Riel , Minchan Kim , Christian Brauner , Christoph Hellwig , Oleg Nesterov , David Hildenbrand , Jann Horn , Tim Murray , Linux API , Linux MM , LKML , kernel-team Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Jun 30, 2021 at 11:01 AM Shakeel Butt wrote: > > Hi Suren, > > On Wed, Jun 23, 2021 at 12:28 PM Suren Baghdasaryan wrote: > > > > In modern systems it's not unusual to have a system component monitoring > > memory conditions of the system and tasked with keeping system memory > > pressure under control. One way to accomplish that is to kill > > non-essential processes to free up memory for more important ones. > > Examples of this are Facebook's OOM killer daemon called oomd and > > Android's low memory killer daemon called lmkd. > > For such system component it's important to be able to free memory > > quickly and efficiently. Unfortunately the time process takes to free > > up its memory after receiving a SIGKILL might vary based on the state > > of the process (uninterruptible sleep), size and OPP level of the core > > the process is running. A mechanism to free resources of the target > > process in a more predictable way would improve system's ability to > > control its memory pressure. > > Introduce process_reap system call that reclaims memory of a dying process > > from the context of the caller. This way the memory in freed in a more > > controllable way with CPU affinity and priority of the caller. The workload > > of freeing the memory will also be charged to the caller. > > The operation is allowed only on a dying process. > > > > Previously I proposed a number of alternatives to accomplish this: > > - https://lore.kernel.org/patchwork/patch/1060407 extending > > pidfd_send_signal to allow memory reaping using oom_reaper thread; > > - https://lore.kernel.org/patchwork/patch/1338196 extending > > pidfd_send_signal to reap memory of the target process synchronously from > > the context of the caller; > > - https://lore.kernel.org/patchwork/patch/1344419/ to add MADV_DONTNEED > > support for process_madvise implementing synchronous memory reaping. > > > > The end of the last discussion culminated with suggestion to introduce a > > dedicated system call (https://lore.kernel.org/patchwork/patch/1344418/#1553875) > > The reasoning was that the new variant of process_madvise > > a) does not work on an address range > > b) is destructive > > c) doesn't share much code at all with the rest of process_madvise > > From the userspace point of view it was awkward and inconvenient to provide > > memory range for this operation that operates on the entire address space. > > Using special flags or address values to specify the entire address space > > was too hacky. > > > > The API is as follows, > > > > int process_reap(int pidfd, unsigned int flags); > > > > DESCRIPTION > > The process_reap() system call is used to free the memory of a > > dying process. > > > > The pidfd selects the process referred to by the PID file > > descriptor. > > (See pidofd_open(2) for further information) > > *pidfd_open Ack > > > > > The flags argument is reserved for future use; currently, this > > argument must be specified as 0. > > > > RETURN VALUE > > On success, process_reap() returns 0. On error, -1 is returned > > and errno is set to indicate the error. > > > > Signed-off-by: Suren Baghdasaryan > > Thanks for continuously pushing this. One question I have is how do > you envision this syscall to be used for the cgroup based workloads. > Traverse the target tree, read pids from cgroup.procs files, > pidfd_open them, send SIGKILL and then process_reap them. Is that > right? Yes, at least that's how Android does that. It's a bit more involved but it's a technical detail. Userspace low memory killer kills a process (sends SIGKILL and calls process_reap) and another system component detects that a process died and will kill all processes belonging to the same cgroup (that's how we identify related processes). > > Orthogonal to this patch I wonder if we should have an optimized way > to reap processes from a cgroup. Something similar to cgroup.kill (or > maybe overload cgroup.kill with reaping as well). Seems reasonable to me. We could use that in the above scenario. > > [...] > > > + > > +SYSCALL_DEFINE2(process_reap, int, pidfd, unsigned int, flags) > > +{ > > + struct pid *pid; > > + struct task_struct *task; > > + struct mm_struct *mm = NULL; > > + unsigned int f_flags; > > + long ret = 0; > > + > > + if (flags != 0) > > + return -EINVAL; > > + > > + pid = pidfd_get_pid(pidfd, &f_flags); > > + if (IS_ERR(pid)) > > + return PTR_ERR(pid); > > + > > + task = get_pid_task(pid, PIDTYPE_PID); > > + if (!task) { > > + ret = -ESRCH; > > + goto put_pid; > > + } > > + > > + /* > > + * If the task is dying and in the process of releasing its memory > > + * then get its mm. > > + */ > > + task_lock(task); > > + if (task_will_free_mem(task) && (task->flags & PF_KTHREAD) == 0) { > > task_will_free_mem() is fine here but I think in parallel we should > optimize this function. At the moment it is traversing all the > processes on the machine. It is very normal to have tens of thousands > of processes on big machines, so it would be really costly when > reaping a bunch of processes. Hmm. But I think we still need to make sure that the mm is not shared with another non-dying process. IIUC that's the point of that traversal. Am I mistaken?