Received: by 2002:a05:6358:d09b:b0:dc:cd0c:909e with SMTP id jc27csp2810424rwb; Fri, 2 Dec 2022 15:44:34 -0800 (PST) X-Google-Smtp-Source: AA0mqf4dCfq8JcQM45swjaXm1AC2ju/kXIxkOMSk569/YbckbxAYig7XdbIA63uf1XJ5ZRoMEzkf X-Received: by 2002:a17:902:cecf:b0:189:c02f:c418 with SMTP id d15-20020a170902cecf00b00189c02fc418mr3648341plg.131.1670024674578; Fri, 02 Dec 2022 15:44:34 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1670024674; cv=none; d=google.com; s=arc-20160816; b=PWph844O8S4sDsxcsOfNUJV+tImDgiiHP0ZTOGrFeLN5r7npobBzCHcrBNpRu+85s7 8L+K6lIcSTL5GArNI6U5oLNYEltiY1Gyin8BNKNaZBffpzDldptkzpnbm3aCdgy3DB+J n4r3loXjSw7iBwGWM4O1WnS6dkSL3OtGt7mpPwVRVsXM3pFsyiabAbLRSFTYI3ryO6KI 6itRL9U7g5NOSX2OkLC5BTHXycI16mHUQBNXzsIDn5zkEH355LzPTqOTzq0j/+ftcRu1 6EzABuDeNC45RfzIM/QRgzOPxDIW45hGz1LqDY+Jid8IJmRWrsED4LXniDzLF05xcut/ clTw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:cc:to:subject:message-id:date:from:in-reply-to :references:mime-version:dkim-signature; bh=WfQbQ+1EE0fMDX/Q+9Du4Mqir+JlALTfkpbrkbmDKGs=; b=IxTbTzwv7JXHLDLZzbYr6oIWQH+2QoETzUNKCZnrcv39lmS7Yod8yD23PjrFETBcQa wChyzxFWLsTTv31Cs4GtT2hzBBR9ShYPP3PIbsG5rRTjkEAo/pSXPoGeAqZvzeNvtFVz 5J3Oc9/uPzW41yO3ZIMOVN1+syCliac14BgAjQ2tqLWnqSXiDlZZmlUcWdRQ9xPqCzhy 4BMyox1tFf1ZkV2XB0z1qHiymBMJ4Mq+a+mZwiUeXfMqmSFUzP1Hr9Wxr890Y38GnAFJ GmyLxkRfrhTCWJrTjLoroG8yJBcoBDEu+GzhSN8bVn25LVuHgdU/4mlmDaX6KGER4xez 1zTw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="U7BO6/sC"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id qe18-20020a17090b4f9200b002199a9891desi1678309pjb.141.2022.12.02.15.44.24; Fri, 02 Dec 2022 15:44:34 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@google.com header.s=20210112 header.b="U7BO6/sC"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=REJECT sp=REJECT dis=NONE) header.from=google.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234620AbiLBXdk (ORCPT + 83 others); Fri, 2 Dec 2022 18:33:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:36400 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234275AbiLBXdi (ORCPT ); Fri, 2 Dec 2022 18:33:38 -0500 Received: from mail-pf1-x436.google.com (mail-pf1-x436.google.com [IPv6:2607:f8b0:4864:20::436]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 758BB101CE for ; Fri, 2 Dec 2022 15:33:37 -0800 (PST) Received: by mail-pf1-x436.google.com with SMTP id 124so6333628pfy.0 for ; Fri, 02 Dec 2022 15:33:37 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=WfQbQ+1EE0fMDX/Q+9Du4Mqir+JlALTfkpbrkbmDKGs=; b=U7BO6/sCr4BvAijXi57j1gjHRzL7ZauFB6uT3qpzHVJiOpDuOiQY1cPj6T6lOy9u7A wHm/yY1/7ftQRfno3ov7QdAHNFpWNy/xvNi9Q/b4M5eQIWOvRnQo0RAfb6sdkxlVRYjC 1eF9FDubS5CeXdxo7VR6ruRlZcmZ6xCJlJH6SsBZKBdM2ml+c6tPmcHdfCuMZUB7a0P/ a/VvDqOcmi0XcyQlkvExZpwlm+5NC9amLuCLxbiZ8guhUkkiJRr+RJ+L67NG17pJ24/M Ue8WNcULJoC1ikQ9EzSbTt6mOr66/Yq9WoqyvGTKKgW3bz4h0ript0zpYo34mfPJEDJD v7ZA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=WfQbQ+1EE0fMDX/Q+9Du4Mqir+JlALTfkpbrkbmDKGs=; b=X7oGek0ZTbv7uqIpmm5DNFSdIedpmL1bB+TKkAyJsxK4dqLaAGWNFSMB1SsZBTw9t2 sTKG+5tUpgHuWstLD29P2t6WARna/MAH/pBEbeVRs2L0q4NSG439QRFnvSHkceeoKr9N jWu1SB44xvRlAoRBFrPlsZyZhRCDM/1pL6wwsPaEyWX+5EP1848AHkPDHgp8+KngE6tP QWUZ5KuBNEPbX8/XFp/Y7QFGGlW89G8WqAjCke93VULteKs+vPoIWAAbpfGRTZNWNhM6 95MSTIRENfipByhft9dNYc5CcLy4E1fnDvOu+XwKjWBSaMEmhhnxIRam2Cq0ty5Cmrch TsnQ== X-Gm-Message-State: ANoB5pm9dMjnJTzpWGABaEGY9n+U/BcGzoNhLutT8sjpITdq2NRy1wdx R52oX2XkfjrRr+hcWABSCJdkKvNOY3sRd9IVDdByfA== X-Received: by 2002:a63:2584:0:b0:478:5d6b:d1fd with SMTP id l126-20020a632584000000b004785d6bd1fdmr12962258pgl.249.1670024016718; Fri, 02 Dec 2022 15:33:36 -0800 (PST) MIME-Version: 1.0 References: <20221202013404.163143-1-jeffxu@google.com> <20221202013404.163143-3-jeffxu@google.com> <202212021446.9D544DDF@keescook> In-Reply-To: <202212021446.9D544DDF@keescook> From: Jeff Xu Date: Fri, 2 Dec 2022 15:32:59 -0800 Message-ID: Subject: Re: [PATCH v3] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC To: Kees Cook Cc: jeffxu@chromium.org, skhan@linuxfoundation.org, akpm@linux-foundation.org, dmitry.torokhov@gmail.com, dverkamp@chromium.org, hughd@google.com, jorgelo@chromium.org, linux-kernel@vger.kernel.org, linux-kselftest@vger.kernel.org, linux-mm@kvack.org, mnissler@chromium.org, jannh@google.com, linux-hardening@vger.kernel.org Content-Type: text/plain; charset="UTF-8" X-Spam-Status: No, score=-17.6 required=5.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,DKIM_VALID_EF, ENV_AND_HDR_SPF_MATCH,RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS, USER_IN_DEF_DKIM_WL,USER_IN_DEF_SPF_WL autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Dec 2, 2022 at 2:56 PM Kees Cook wrote: > > On Fri, Dec 02, 2022 at 01:34:00AM +0000, jeffxu@chromium.org wrote: > > From: Jeff Xu > > > > The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to > > set executable bit at creation time (memfd_create). > > > > When MFD_NOEXEC_SEAL is set, memfd is created without executable bit > > (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to > > be executable (mode: 0777) after creation. > > > > when MFD_EXEC flag is set, memfd is created with executable bit > > (mode:0777), this is the same as the old behavior of memfd_create. > > > > The new pid namespaced sysctl vm.memfd_noexec has 3 values: > > 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like > > MFD_EXEC was set. > > 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like > > MFD_NOEXEC_SEAL was set. > > 2:memfd_create() without MFD_NOEXEC_SEAL will be rejected. > > ^ nit: missing space > > > > > The sysctl allows finer control of memfd_create for old-software > > that doesn't set the executable bit, for example, a container with > > vm.memfd_noexec=1 means the old-software will create non-executable > > memfd by default. > > > > Co-developed-by: Daniel Verkamp > > Signed-off-by: Daniel Verkamp > > Signed-off-by: Jeff Xu > > --- > > include/linux/pid_namespace.h | 19 ++++++++++++++ > > include/uapi/linux/memfd.h | 4 +++ > > kernel/pid_namespace.c | 47 +++++++++++++++++++++++++++++++++++ > > mm/memfd.c | 44 ++++++++++++++++++++++++++++++-- > > 4 files changed, 112 insertions(+), 2 deletions(-) > > > > diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h > > index 07481bb87d4e..a4789a7b34a9 100644 > > --- a/include/linux/pid_namespace.h > > +++ b/include/linux/pid_namespace.h > > @@ -16,6 +16,21 @@ > > > > struct fs_pin; > > > > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > > +/* > > + * sysctl for vm.memfd_noexec > > + * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL > > + * acts like MFD_EXEC was set. > > + * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL > > + * acts like MFD_NOEXEC_SEAL was set. > > + * 2: memfd_create() without MFD_NOEXEC_SEAL will be > > + * rejected. > > + */ > > +#define MEMFD_NOEXEC_SCOPE_EXEC 0 > > +#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1 > > +#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2 > > +#endif > > + > > struct pid_namespace { > > struct idr idr; > > struct rcu_head rcu; > > @@ -31,6 +46,10 @@ struct pid_namespace { > > struct ucounts *ucounts; > > int reboot; /* group exit code if this pidns was rebooted */ > > struct ns_common ns; > > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > > + /* sysctl for vm.memfd_noexec */ > > + int memfd_noexec_scope; > > +#endif > > } __randomize_layout; > > > > extern struct pid_namespace init_pid_ns; > > diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h > > index 7a8a26751c23..273a4e15dfcf 100644 > > --- a/include/uapi/linux/memfd.h > > +++ b/include/uapi/linux/memfd.h > > @@ -8,6 +8,10 @@ > > #define MFD_CLOEXEC 0x0001U > > #define MFD_ALLOW_SEALING 0x0002U > > #define MFD_HUGETLB 0x0004U > > +/* not executable and sealed to prevent changing to executable. */ > > +#define MFD_NOEXEC_SEAL 0x0008U > > +/* executable */ > > +#define MFD_EXEC 0x0010U > > > > /* > > * Huge page size encoding when MFD_HUGETLB is specified, and a huge page > > diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c > > index f4f8cb0435b4..71dd9b0a0f62 100644 > > --- a/kernel/pid_namespace.c > > +++ b/kernel/pid_namespace.c > > @@ -110,6 +110,10 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns > > ns->ucounts = ucounts; > > ns->pid_allocated = PIDNS_ADDING; > > > > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > > + ns->memfd_noexec_scope = MEMFD_NOEXEC_SCOPE_EXEC; > > +#endif > > I think this should be inherited from the parent pid namespace instead? > OK, I will add logic to make the child name space to inherit from the parent namespace at creation time, but if the parent changes this setting later, the child will not pick up the parent's change automatically, would that be OK ? Or do we want the child to keep in sync with the parent, to the highest value. For example, if init namespace val=1, child namespace=1, later when init namespace = 2, child namespace will automatically become 2. Jeff > > + > > return ns; > > > > out_free_idr: > > @@ -255,6 +259,45 @@ void zap_pid_ns_processes(struct pid_namespace *pid_ns) > > return; > > } > > > > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > > +int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table, int write, > > + void *buffer, size_t *lenp, loff_t *ppos) > > +{ > > + struct pid_namespace *ns = task_active_pid_ns(current); > > + struct ctl_table table_copy; > > + > > + if (write && !capable(CAP_SYS_ADMIN)) > > + return -EPERM; > > + > > + table_copy = *table; > > + if (ns != &init_pid_ns) > > + table_copy.data = &ns->memfd_noexec_scope; > > + > > + /* > > + * set minimum to current value, the effect is only bigger > > + * value is accepted. > > + */ > > + if (*(int *)table_copy.data > *(int *)table_copy.extra1) > > + table_copy.extra1 = table_copy.data; > > Yeah, I like this kind of enforcement. > > > + > > + return proc_dointvec_minmax(&table_copy, write, buffer, lenp, ppos); > > +} > > + > > +static struct ctl_table pid_ns_ctl_table_vm[] = { > > + { > > + .procname = "memfd_noexec", > > + .data = &init_pid_ns.memfd_noexec_scope, > > + .maxlen = sizeof(init_pid_ns.memfd_noexec_scope), > > + .mode = 0644, > > + .proc_handler = pid_mfd_noexec_dointvec_minmax, > > + .extra1 = SYSCTL_ZERO, > > + .extra2 = SYSCTL_TWO, > > + }, > > + { } > > +}; > > +static struct ctl_path vm_path[] = { { .procname = "vm", }, { } }; > > +#endif > > + > > #ifdef CONFIG_CHECKPOINT_RESTORE > > static int pid_ns_ctl_handler(struct ctl_table *table, int write, > > void *buffer, size_t *lenp, loff_t *ppos) > > @@ -455,6 +498,10 @@ static __init int pid_namespaces_init(void) > > #ifdef CONFIG_CHECKPOINT_RESTORE > > register_sysctl_paths(kern_path, pid_ns_ctl_table); > > #endif > > + > > +#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE) > > + register_sysctl_paths(vm_path, pid_ns_ctl_table_vm); > > +#endif > > return 0; > > } > > > > diff --git a/mm/memfd.c b/mm/memfd.c > > index 4ebeab94aa74..69e897dea6d5 100644 > > --- a/mm/memfd.c > > +++ b/mm/memfd.c > > @@ -18,6 +18,7 @@ > > #include > > #include > > #include > > +#include > > #include > > > > /* > > @@ -263,12 +264,13 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg) > > #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1) > > #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN) > > > > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB) > > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC) > > > > SYSCALL_DEFINE2(memfd_create, > > const char __user *, uname, > > unsigned int, flags) > > { > > + struct pid_namespace *ns; > > unsigned int *file_seals; > > struct file *file; > > int fd, error; > > @@ -285,6 +287,36 @@ SYSCALL_DEFINE2(memfd_create, > > return -EINVAL; > > } > > > > + /* Invalid if both EXEC and NOEXEC_SEAL are set.*/ > > + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL)) > > + return -EINVAL; > > + > > + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) { > > +#ifdef CONFIG_SYSCTL > > + int sysctl = MEMFD_NOEXEC_SCOPE_EXEC; > > + > > + ns = task_active_pid_ns(current); > > + if (ns) > > + sysctl = ns->memfd_noexec_scope; > > + > > + if (sysctl == MEMFD_NOEXEC_SCOPE_EXEC) { > > + flags |= MFD_EXEC; > > + } else if (sysctl == MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL) { > > + flags |= MFD_NOEXEC_SEAL; > > + } else { > > + pr_warn_ratelimited( > > + "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d\n", > > + task_pid_nr(current)); > > + return -EINVAL; > > + } > > Not a huge deal, but the above could be a switch statement to improve > readability. > > > +#else > > + flags |= MFD_EXEC; > > +#endif > > + pr_warn_ratelimited( > > + "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d\n", > > + task_pid_nr(current)); > > Perhaps include process name as well -- makes admin lives easier. :) > > pr_warn_ratelimited( > "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL: %s[%d]\n", > current->comm, task_pid_nr(current)); > > > + } > > + > > /* length includes terminating zero */ > > len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1); > > if (len <= 0) > > @@ -328,7 +360,15 @@ SYSCALL_DEFINE2(memfd_create, > > file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE; > > file->f_flags |= O_LARGEFILE; > > > > - if (flags & MFD_ALLOW_SEALING) { > > + if (flags & MFD_NOEXEC_SEAL) { > > + struct inode *inode = file_inode(file); > > + > > + inode->i_mode &= ~0111; > > + file_seals = memfd_file_seals_ptr(file); > > + *file_seals &= ~F_SEAL_SEAL; > > + *file_seals |= F_SEAL_EXEC; > > + } else if (flags & MFD_ALLOW_SEALING) { > > + /* MFD_EXEC and MFD_ALLOW_SEALING are set */ > > file_seals = memfd_file_seals_ptr(file); > > *file_seals &= ~F_SEAL_SEAL; > > } > > -- > > 2.39.0.rc0.267.gcb52ba06e7-goog > > > > Otherwise, looks good to me. > > Reviewed-by: Kees Cook > > -- > Kees Cook