Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp4215277rwl; Mon, 3 Apr 2023 01:24:42 -0700 (PDT) X-Google-Smtp-Source: AKy350ZjvWcxa0QSBY+llpoQeFRvX7WN8wXku0CkhRsKHsqyKQbQUAXpgbkhT2FA0sfX2IshjXUp X-Received: by 2002:a17:903:182:b0:1a1:e33f:d567 with SMTP id z2-20020a170903018200b001a1e33fd567mr47771996plg.52.1680510281783; Mon, 03 Apr 2023 01:24:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680510281; cv=none; d=google.com; s=arc-20160816; b=opHr/4vuyiLBe9rZ/7u5Oo3bKcnXtwsIO92w0Qyu7HsLf09dL3NNA+8RnrsnMSbTl/ BPp/uNvXR4rvMLrUoMWhTlwA5AgKr4rHVW3JhXNADZCyEDGHLLoLsACnQstYSiQIGHUI 7CTxqkgl2rnaxmZYoMQ2fbfOnUpXTRZG3jbNZxViPPBA7UHtllfYlDxdotnyLUcpcvxW oU6Bs+SJSs0VFCXyNfKA/vd2b+H0qRctDiE0YrPhUEP8AbAWqKAUVjXFL2zHVPFOYCl5 j8/A13Z7ZP6TpmuqX0x2eZavozhlzsvH6ddKPMWkOSWd1Ue8buUPcBC2xP8e1k9HdH8v la/Q== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :organization:from:references:cc:to:content-language:user-agent :mime-version:date:message-id:dkim-signature; bh=jQzDBAQKeqgfsxxyrlkQCm/LEl+p6ixMadQZXFxQxwQ=; b=PZukV+nriXLdpIc+aZ5XVaEzELDqwHKXFAnN4JGSNqiFnZ+91/+CX3X3XQOOhNzFHD 9o+sso7jVYYDKHdvIrwyj7CCAMWd1drQ2JsM/17Q77xpaG2935hQqdVzY73cbWufD/zO 7VraRiSxMuEW003ziBqfqP9dF9WMoKO3nt+/DhyzD1Li3pqMYN9+jr/FbMTeEY5zI8XB JXiONOzspFLL77sv3dToikVrtBDQl8qCkV7GTZhVhSaGl5dwUTUMPnwfc3Brv0SIwTG6 Q5c4PyHeadswtsbMHBjIAi9GHDEkbjfmGp/pU4jQpI2G/pQ1/qiffytGvJLtLmtkqzQc JL7A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="DH96VbS/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id x24-20020a63db58000000b004fb165e159bsi7713485pgi.794.2023.04.03.01.24.29; Mon, 03 Apr 2023 01:24:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@redhat.com header.s=mimecast20190719 header.b="DH96VbS/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=redhat.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231775AbjDCIXW (ORCPT + 99 others); Mon, 3 Apr 2023 04:23:22 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:44268 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S231736AbjDCIXT (ORCPT ); Mon, 3 Apr 2023 04:23:19 -0400 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id DC65210E for ; Mon, 3 Apr 2023 01:22:15 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1680510135; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=jQzDBAQKeqgfsxxyrlkQCm/LEl+p6ixMadQZXFxQxwQ=; b=DH96VbS/Bh39d25dhM7mlCuxUSfTRufIuIwy10l0l2s395UNQPg5XoKLi+oED/8ofH1Uq1 chs/6lnWAfpVpxgzf7k/swQvGm11gMUL0YPhTGIiH/sUSa24ziB3+cyVx5unHN+wVChmAS EWM5H/zujEK8SiJNnJF2rM0uh6L3R1Q= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-459--pRKt8B-PHy9SEMWAamwPw-1; Mon, 03 Apr 2023 04:21:52 -0400 X-MC-Unique: -pRKt8B-PHy9SEMWAamwPw-1 Received: by mail-wm1-f70.google.com with SMTP id m7-20020a05600c4f4700b003ee7e120bdfso14154565wmq.6 for ; Mon, 03 Apr 2023 01:21:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680510111; h=content-transfer-encoding:in-reply-to:subject:organization:from :references:cc:to:content-language:user-agent:mime-version:date :message-id:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=jQzDBAQKeqgfsxxyrlkQCm/LEl+p6ixMadQZXFxQxwQ=; b=AlB9dpPtL13N6ju+QjI/9qUN0jpw6C9p4brcrtCQGuDKYJD6GH7raH4pBu49urLBLr VMH/JbK64nNUncwqjtshyXw2geuF7V8iFMEGp4LrmJw1RaLeENvV9LtnV6dFhSHuDWH7 GEjg5sRmSIQ0xw76/V1kTXjdD304OikDqrB0feR8ZRaWAzMQLreYvh6JMlrop+Hl3ztg u8yUgdbbKI5E2hG8UF23Q1DzlYb1Q+2IhVrRBcBi0wy5Kx+Ud4LCtcyxrgniofQai7qO rLTfnrAz8hAANKjab9ql0x3TRg1zXa3KqSVPLGMqgdLMwJNQRr/Uv1nwanIKmClZOV/5 3Z9Q== X-Gm-Message-State: AO0yUKX8Q4aMhJ/ny5KYhuKNo7pifErxkclquToDwG8qaz3sqi4PLelW 9HVazEh8Fa2wxJbSAXeM9shP2npcmCmzfd6ycBUDFZo/pK1TRfWXsXDVR2C75CC4z+B8Rz7OZb4 3WEJ71UMXI/IhmVqqMkmwSUlN X-Received: by 2002:a05:600c:20d:b0:3ee:672d:caae with SMTP id 13-20020a05600c020d00b003ee672dcaaemr26843807wmi.36.1680510111655; Mon, 03 Apr 2023 01:21:51 -0700 (PDT) X-Received: by 2002:a05:600c:20d:b0:3ee:672d:caae with SMTP id 13-20020a05600c020d00b003ee672dcaaemr26843772wmi.36.1680510111270; Mon, 03 Apr 2023 01:21:51 -0700 (PDT) Received: from ?IPV6:2003:cb:c702:5e00:8e78:71f3:6243:77f0? (p200300cbc7025e008e7871f3624377f0.dip0.t-ipconnect.de. [2003:cb:c702:5e00:8e78:71f3:6243:77f0]) by smtp.gmail.com with ESMTPSA id c2-20020adfe702000000b002d6f285c0a2sm9135348wrm.42.2023.04.03.01.21.48 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 03 Apr 2023 01:21:50 -0700 (PDT) Message-ID: Date: Mon, 3 Apr 2023 10:21:48 +0200 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.1 Content-Language: en-US To: Ackerley Tng , kvm@vger.kernel.org, linux-api@vger.kernel.org, linux-arch@vger.kernel.org, linux-doc@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, qemu-devel@nongnu.org Cc: aarcange@redhat.com, ak@linux.intel.com, akpm@linux-foundation.org, arnd@arndb.de, bfields@fieldses.org, bp@alien8.de, chao.p.peng@linux.intel.com, corbet@lwn.net, dave.hansen@intel.com, ddutile@redhat.com, dhildenb@redhat.com, hpa@zytor.com, hughd@google.com, jlayton@kernel.org, jmattson@google.com, joro@8bytes.org, jun.nakajima@intel.com, kirill.shutemov@linux.intel.com, linmiaohe@huawei.com, luto@kernel.org, mail@maciej.szmigiero.name, mhocko@suse.com, michael.roth@amd.com, mingo@redhat.com, naoya.horiguchi@nec.com, pbonzini@redhat.com, qperret@google.com, rppt@kernel.org, seanjc@google.com, shuah@kernel.org, steven.price@arm.com, tabba@google.com, tglx@linutronix.de, vannapurve@google.com, vbabka@suse.cz, vkuznets@redhat.com, wanpengli@tencent.com, wei.w.wang@intel.com, x86@kernel.org, yu.c.zhang@linux.intel.com References: <592ebd9e33a906ba026d56dc68f42d691706f865.1680306489.git.ackerleytng@google.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH v3 1/2] mm: restrictedmem: Allow userspace to specify mount for memfd_restricted In-Reply-To: <592ebd9e33a906ba026d56dc68f42d691706f865.1680306489.git.ackerleytng@google.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.6 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF,NICE_REPLY_A,RCVD_IN_DNSWL_NONE, RCVD_IN_MSPIKE_H2,SPF_HELO_NONE,SPF_NONE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 01.04.23 01:50, Ackerley Tng wrote: > By default, the backing shmem file for a restrictedmem fd is created > on shmem's kernel space mount. > > With this patch, an optional tmpfs mount can be specified via an fd, > which will be used as the mountpoint for backing the shmem file > associated with a restrictedmem fd. > > This will help restrictedmem fds inherit the properties of the > provided tmpfs mounts, for example, hugepage allocation hints, NUMA > binding hints, etc. > > Permissions for the fd passed to memfd_restricted() is modeled after > the openat() syscall, since both of these allow creation of a file > upon a mount/directory. > > Permission to reference the mount the fd represents is checked upon fd > creation by other syscalls (e.g. fsmount(), open(), or open_tree(), > etc) and any process that can present memfd_restricted() with a valid > fd is expected to have obtained permission to use the mount > represented by the fd. This behavior is intended to parallel that of > the openat() syscall. > > memfd_restricted() will check that the tmpfs superblock is > writable, and that the mount is also writable, before attempting to > create a restrictedmem file on the mount. > > Signed-off-by: Ackerley Tng > --- > include/linux/syscalls.h | 2 +- > include/uapi/linux/restrictedmem.h | 8 ++++ > mm/restrictedmem.c | 74 +++++++++++++++++++++++++++--- > 3 files changed, 77 insertions(+), 7 deletions(-) > create mode 100644 include/uapi/linux/restrictedmem.h > > diff --git a/include/linux/syscalls.h b/include/linux/syscalls.h > index f9e9e0c820c5..a23c4c385cd3 100644 > --- a/include/linux/syscalls.h > +++ b/include/linux/syscalls.h > @@ -1056,7 +1056,7 @@ asmlinkage long sys_memfd_secret(unsigned int flags); > asmlinkage long sys_set_mempolicy_home_node(unsigned long start, unsigned long len, > unsigned long home_node, > unsigned long flags); > -asmlinkage long sys_memfd_restricted(unsigned int flags); > +asmlinkage long sys_memfd_restricted(unsigned int flags, int mount_fd); > > /* > * Architecture-specific system calls > diff --git a/include/uapi/linux/restrictedmem.h b/include/uapi/linux/restrictedmem.h > new file mode 100644 > index 000000000000..22d6f2285f6d > --- /dev/null > +++ b/include/uapi/linux/restrictedmem.h > @@ -0,0 +1,8 @@ > +/* SPDX-License-Identifier: GPL-2.0 WITH Linux-syscall-note */ > +#ifndef _UAPI_LINUX_RESTRICTEDMEM_H > +#define _UAPI_LINUX_RESTRICTEDMEM_H > + > +/* flags for memfd_restricted */ > +#define RMFD_USERMNT 0x0001U I wonder if we can come up with a more expressive prefix than RMFD. Sounds more like "rm fd" ;) Maybe it should better match the "memfd_restricted" syscall name, like "MEMFD_RSTD_USERMNT". > + > +#endif /* _UAPI_LINUX_RESTRICTEDMEM_H */ > diff --git a/mm/restrictedmem.c b/mm/restrictedmem.c > index c5d869d8c2d8..f7b62364a31a 100644 > --- a/mm/restrictedmem.c > +++ b/mm/restrictedmem.c > @@ -1,11 +1,12 @@ > // SPDX-License-Identifier: GPL-2.0 > -#include "linux/sbitmap.h" Looks like an unrelated change? > +#include > #include > #include > #include > #include > #include > #include > +#include > #include > > struct restrictedmem { > @@ -189,19 +190,20 @@ static struct file *restrictedmem_file_create(struct file *memfd) > return file; > } > > -SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > +static int restrictedmem_create(struct vfsmount *mount) > { > struct file *file, *restricted_file; > int fd, err; > > - if (flags) > - return -EINVAL; > - > fd = get_unused_fd_flags(0); > if (fd < 0) > return fd; > > - file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + if (mount) > + file = shmem_file_setup_with_mnt(mount, "memfd:restrictedmem", 0, VM_NORESERVE); > + else > + file = shmem_file_setup("memfd:restrictedmem", 0, VM_NORESERVE); > + > if (IS_ERR(file)) { > err = PTR_ERR(file); > goto err_fd; > @@ -223,6 +225,66 @@ SYSCALL_DEFINE1(memfd_restricted, unsigned int, flags) > return err; > } > > +static bool is_shmem_mount(struct vfsmount *mnt) > +{ > + return mnt && mnt->mnt_sb && mnt->mnt_sb->s_magic == TMPFS_MAGIC; > +} > + > +static bool is_mount_root(struct file *file) > +{ > + return file->f_path.dentry == file->f_path.mnt->mnt_root; > +} I'd inline at least that function, pretty self-explaining. > + > +static int restrictedmem_create_on_user_mount(int mount_fd) > +{ > + int ret; > + struct fd f; > + struct vfsmount *mnt; > + > + f = fdget_raw(mount_fd); > + if (!f.file) > + return -EBADF; > + > + ret = -EINVAL; > + if (!is_mount_root(f.file)) > + goto out; > + > + mnt = f.file->f_path.mnt; > + if (!is_shmem_mount(mnt)) > + goto out; > + > + ret = file_permission(f.file, MAY_WRITE | MAY_EXEC); > + if (ret) > + goto out; > + > + ret = mnt_want_write(mnt); > + if (unlikely(ret)) > + goto out; > + > + ret = restrictedmem_create(mnt); > + > + mnt_drop_write(mnt); > +out: > + fdput(f); > + > + return ret; > +} > + > +SYSCALL_DEFINE2(memfd_restricted, unsigned int, flags, int, mount_fd) > +{ > + if (flags & ~RMFD_USERMNT) > + return -EINVAL; > + > + if (flags == RMFD_USERMNT) { > + if (mount_fd < 0) > + return -EINVAL; > + > + return restrictedmem_create_on_user_mount(mount_fd); > + } else { > + return restrictedmem_create(NULL); > + } You can drop the else case: if (flags == RMFD_USERMNT) { ... return restrictedmem_create_on_user_mount(mount_fd); } return restrictedmem_create(NULL); I do wonder if you want to properly check for a flag instead of comparing values. Results in a more natural way to deal with flags: if (flags & RMFD_USERMNT) { } > +} > + > int restrictedmem_bind(struct file *file, pgoff_t start, pgoff_t end, > struct restrictedmem_notifier *notifier, bool exclusive) > { The "memfd_restricted" vs. "restrictedmem" terminology is a bit unfortunate, but not your fault here. I'm not a FS person, but it does look good to me. -- Thanks, David / dhildenb