Received: by 2002:a05:6902:102b:0:0:0:0 with SMTP id x11csp858456ybt; Wed, 24 Jun 2020 13:03:12 -0700 (PDT) X-Google-Smtp-Source: ABdhPJz1fHNWMrosItcjkEYt8Nv9viCepZtarBGwcZ1pRM505ABFa7IuR6SHn4mLNCqH3Giru37N X-Received: by 2002:a17:906:3b83:: with SMTP id u3mr4038340ejf.207.1593028992351; Wed, 24 Jun 2020 13:03:12 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1593028992; cv=none; d=google.com; s=arc-20160816; b=Nau8FvWi2+ApRWGaCm3U0Wn6mtL9c3E9rd6QcmvCFG7NqGnpjeXEsvxICljdAR2/nP OGLPyMagH4nbkxvGkkv81Hs+Qo0XiInL0l0yOmwoH3P0yEx7n76dQ8NxvALpZPqdDVPH C5ljNnF3UnXBWJ7Cr6Vsh4Fg2pmDRLIYTbSwEWnkVCId583pTsZsoM2XLx0DVbqZh6sE Kf+CuQ5oo40KwMGA7ceWibPW0ac/lb2/cZwBYHdMR6QRh3Q0Hc8RxDTAsrzAtFwRf97M UBfUtPA2FXYLmPvf8pUF+qcEaLuRYqiqZTVg75UEfeDFg3qatKpVPO/50/ya96QYaIhX GqCA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date:dkim-signature; bh=t4v5KqGdhOyQDdtaopLkokN0MyYgv8PQbu6M776ynnU=; b=DeeU1mXEimMpzdtIVAzWcoX8TAbImz+WrIdvBjku1tVUawYMbb2BumZxyApzwRKBSq FpUnK80oVvPZHHdjCChWrwgUXrnX9AfywVcryRO97eIwDXECgHhb1hMhRLp3+HWNlBIM /M6sJe5h4p20UtlFznvrd7zhL7YiIBlodmC00CfqqXokNJkiiKTtZQ0CEeqcLW08jX1i DPzOapHqLCr2/wI661qVYqBnTw/3kDDHwwTkeQAthm8Sb0Ljp6vhyyAUXQEkIp1G2s2N 1gBROfdA2vH113WEkHgQ8hmmJo2QPgzN/RXF5U7MBkxpV9mKoa1GsJYz46S2I410QPjG eNYg== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=pc8morrK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id kq13si13785129ejb.276.2020.06.24.13.02.48; Wed, 24 Jun 2020 13:03:12 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20161025 header.b=pc8morrK; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2391411AbgFXUAR (ORCPT + 99 others); Wed, 24 Jun 2020 16:00:17 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:45424 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S2391221AbgFXUAQ (ORCPT ); Wed, 24 Jun 2020 16:00:16 -0400 Received: from mail-pg1-x544.google.com (mail-pg1-x544.google.com [IPv6:2607:f8b0:4864:20::544]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 982B9C0613ED for ; Wed, 24 Jun 2020 13:00:16 -0700 (PDT) Received: by mail-pg1-x544.google.com with SMTP id w2so1143999pgg.10 for ; Wed, 24 Jun 2020 13:00:16 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=date:from:to:cc:subject:in-reply-to:message-id:references :user-agent:mime-version; bh=t4v5KqGdhOyQDdtaopLkokN0MyYgv8PQbu6M776ynnU=; b=pc8morrKQjJuftmGPd9dqzIkqrfZU6KhnvfDKy4deJbIsgbmlJrhbl+IgKDXgW0imd leJ2wlzd5Jfm3/iS6XwM7rf9p0Jr4xb90Z9+STbBP3U2BVt8/bao7D/+DnUtUbyn+hwf xlGs87BYYs+qeZjmkswUttP8DxX4DTpQ618DlxqoGhGVPWdPeGo0Kw7cIjBuHDmXv622 FthieZpfV2Je42+7ai1OM7aVJUC+PDRKm1kvKP4GpH/n195TpXZJZXkxUBZNTfnAeyG/ br2Q47pxLqA/C2DR0FPwsDHTV3iD7EwDTi4Qj+gbSmFXg56VIxIvyY0OC/mkOAfQkhxu 76nw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:in-reply-to:message-id :references:user-agent:mime-version; bh=t4v5KqGdhOyQDdtaopLkokN0MyYgv8PQbu6M776ynnU=; b=VSGCZgK4cbQ0TeApefP+F31prrSxmQUNECtfv4mLiV7ihbPrxxX+NZLawzo6sF1L2z ygpuwUSUP3qDjORjJVmQxreFfphpnB8rvfk/kZIFhPc9Lb/5kzK0QI6BdDlkIRQ+CpZn R1XdtRd/G7eBAlf1GKb5ViwMDyhhybiwhhW5tQHMIuILFtYmHz/gvaEfKDpCt2LB23Du 61ovsdh9IMcdL8njGjwoWTx6/QR4hWHS1KgY9pMxJ4qQqPUaS3MNNMqd3IsAVKFt7H5V 3cJ+YxAcGPZy3TUf9Yf0hGgXP9JNQ7tQKHbtfSrdHfXblVamrdi1SS3VsaaisYlfV2LE 8Rww== X-Gm-Message-State: AOAM531h0DfegglmXDDDSLRiUtHHKefvgmKPKLnp+43s0oQoejkZ9S7W 3s44sKuCqXA63EgAhgVPXNkdPw== X-Received: by 2002:a63:481:: with SMTP id 123mr21855027pge.2.1593028815929; Wed, 24 Jun 2020 13:00:15 -0700 (PDT) Received: from [2620:15c:17:3:3a5:23a7:5e32:4598] ([2620:15c:17:3:3a5:23a7:5e32:4598]) by smtp.gmail.com with ESMTPSA id 4sm20452879pfn.205.2020.06.24.13.00.14 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Jun 2020 13:00:15 -0700 (PDT) Date: Wed, 24 Jun 2020 13:00:14 -0700 (PDT) From: David Rientjes X-X-Sender: rientjes@chino.kir.corp.google.com To: Minchan Kim cc: Andrew Morton , LKML , Christian Brauner , linux-mm , linux-api@vger.kernel.org, oleksandr@redhat.com, Suren Baghdasaryan , Tim Murray , Sandeep Patil , Sonny Rao , Brian Geffon , Michal Hocko , Johannes Weiner , Shakeel Butt , John Dias , Joel Fernandes , Jann Horn , alexander.h.duyck@linux.intel.com, sj38.park@gmail.com, Arjun Roy , Vlastimil Babka , Christian Brauner , Daniel Colascione , Jens Axboe , Kirill Tkhai , SeongJae Park , linux-man@vger.kernel.org Subject: Re: [PATCH v8 3/4] mm/madvise: introduce process_madvise() syscall: an external memory hinting API In-Reply-To: <20200622192900.22757-4-minchan@kernel.org> Message-ID: References: <20200622192900.22757-1-minchan@kernel.org> <20200622192900.22757-4-minchan@kernel.org> User-Agent: Alpine 2.22 (DEB 394 2020-01-19) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, 22 Jun 2020, Minchan Kim wrote: > diff --git a/mm/madvise.c b/mm/madvise.c > index 551ed816eefe..23abca3f93fa 100644 > --- a/mm/madvise.c > +++ b/mm/madvise.c > @@ -17,6 +17,7 @@ > #include > #include > #include > +#include > #include > #include > #include > @@ -995,6 +996,18 @@ madvise_behavior_valid(int behavior) > } > } > > +static bool > +process_madvise_behavior_valid(int behavior) > +{ > + switch (behavior) { > + case MADV_COLD: > + case MADV_PAGEOUT: > + return true; > + default: > + return false; > + } > +} > + > /* > * The madvise(2) system call. > * > @@ -1042,6 +1055,11 @@ madvise_behavior_valid(int behavior) > * MADV_DONTDUMP - the application wants to prevent pages in the given range > * from being included in its core dump. > * MADV_DODUMP - cancel MADV_DONTDUMP: no longer exclude from core dump. > + * MADV_COLD - the application is not expected to use this memory soon, > + * deactivate pages in this range so that they can be reclaimed > + * easily if memory pressure hanppens. > + * MADV_PAGEOUT - the application is not expected to use this memory soon, > + * page out the pages in this range immediately. > * > * return values: > * zero - success > @@ -1176,3 +1194,106 @@ SYSCALL_DEFINE3(madvise, unsigned long, start, size_t, len_in, int, behavior) > { > return do_madvise(current, current->mm, start, len_in, behavior); > } > + > +static int process_madvise_vec(struct task_struct *target_task, > + struct mm_struct *mm, struct iov_iter *iter, int behavior) > +{ > + struct iovec iovec; > + int ret = 0; > + > + while (iov_iter_count(iter)) { > + iovec = iov_iter_iovec(iter); > + ret = do_madvise(target_task, mm, (unsigned long)iovec.iov_base, > + iovec.iov_len, behavior); > + if (ret < 0) > + break; > + iov_iter_advance(iter, iovec.iov_len); > + } > + > + return ret; > +} > + > +static ssize_t do_process_madvise(int pidfd, struct iov_iter *iter, > + int behavior, unsigned int flags) > +{ > + ssize_t ret; > + struct pid *pid; > + struct task_struct *task; > + struct mm_struct *mm; > + size_t total_len = iov_iter_count(iter); > + > + if (flags != 0) > + return -EINVAL; > + > + pid = pidfd_get_pid(pidfd); > + if (IS_ERR(pid)) > + return PTR_ERR(pid); > + > + task = get_pid_task(pid, PIDTYPE_PID); > + if (!task) { > + ret = -ESRCH; > + goto put_pid; > + } > + > + if (task->mm != current->mm && > + !process_madvise_behavior_valid(behavior)) { > + ret = -EINVAL; > + goto release_task; > + } > + > + mm = mm_access(task, PTRACE_MODE_ATTACH_FSCREDS); > + if (IS_ERR_OR_NULL(mm)) { > + ret = IS_ERR(mm) ? PTR_ERR(mm) : -ESRCH; > + goto release_task; > + } > mm is always task->mm right? I'm wondering if it would be better to find the mm directly in process_madvise_vec() rather than passing it into the function. I'm not sure why we'd pass both task and mm here. + > + ret = process_madvise_vec(task, mm, iter, behavior); > + if (ret >= 0) > + ret = total_len - iov_iter_count(iter); > + > + mmput(mm); > +release_task: > + put_task_struct(task); > +put_pid: > + put_pid(pid); > + return ret; > +} > + > +SYSCALL_DEFINE5(process_madvise, int, pidfd, const struct iovec __user *, vec, > + unsigned long, vlen, int, behavior, unsigned int, flags) I love the idea of adding the flags parameter here and I can think of an immediate use case for MADV_HUGEPAGE, which is overloaded. Today, MADV_HUGEPAGE controls enablement depending on system config and controls defrag behavior based on system config. It also cannot be opted out of without setting MADV_NOHUGEPAGE :) I was thinking of a flag that users could use to trigger an immediate collapse in process context regardless of the system config. So I'm a big advocate of this flags parameter and consider it an absolute must for the API. Acked-by: David Rientjes