2022-12-15 00:26:32

by Jeff Xu

[permalink] [raw]
Subject: [PATCH v8 0/5] mm/memfd: introduce MFD_NOEXEC_SEAL and MFD_EXEC

From: Jeff Xu <[email protected]>

Since Linux introduced the memfd feature, memfd have always had their
execute bit set, and the memfd_create() syscall doesn't allow setting
it differently.

However, in a secure by default system, such as ChromeOS, (where all
executables should come from the rootfs, which is protected by Verified
boot), this executable nature of memfd opens a door for NoExec bypass
and enables “confused deputy attack”. E.g, in VRP bug [1]: cros_vm
process created a memfd to share the content with an external process,
however the memfd is overwritten and used for executing arbitrary code
and root escalation. [2] lists more VRP in this kind.

On the other hand, executable memfd has its legit use, runc uses memfd’s
seal and executable feature to copy the contents of the binary then
execute them, for such system, we need a solution to differentiate runc's
use of executable memfds and an attacker's [3].

To address those above, this set of patches add following:
1> Let memfd_create() set X bit at creation time.
2> Let memfd to be sealed for modifying X bit.
3> A new pid namespace sysctl: vm.memfd_noexec to control the behavior of
X bit.For example, if a container has vm.memfd_noexec=2, then
memfd_create() without MFD_NOEXEC_SEAL will be rejected.
4> A new security hook in memfd_create(). This make it possible to a new
LSM, which rejects or allows executable memfd based on its security policy.

Change history:
v8:
- Update ref bug in cover letter.
- Add Reviewed-by field.
- Remove security hook (security_memfd_create) patch, which will have
its own patch set in future.

v7:
- patch 2/6: remove #ifdef and MAX_PATH (memfd_test.c).
- patch 3/6: check capability (CAP_SYS_ADMIN) from userns instead of
global ns (pid_sysctl.h). Add a tab (pid_namespace.h).
- patch 5/6: remove #ifdef (memfd_test.c)
- patch 6/6: remove unneeded security_move_mount(security.c).

v6:https://lore.kernel.org/lkml/[email protected]/
- Address comment and move "#ifdef CONFIG_" from .c file to pid_sysctl.h

v5:https://lore.kernel.org/lkml/[email protected]/
- Pass vm.memfd_noexec from current ns to child ns.
- Fix build issue detected by kernel test robot.
- Add missing security.c

v3:https://lore.kernel.org/lkml/[email protected]/
- Address API design comments in v2.
- Let memfd_create() to set X bit at creation time.
- A new pid namespace sysctl: vm.memfd_noexec to control behavior of X bit.
- A new security hook in memfd_create().

v2:https://lore.kernel.org/lkml/[email protected]/
- address comments in V1.
- add sysctl (vm.mfd_noexec) to set the default file permissions of
memfd_create to be non-executable.

v1:https://lwn.net/Articles/890096/

[1] https://crbug.com/1305267
[2] https://bugs.chromium.org/p/chromium/issues/list?q=type%3Dbug-security%20memfd%20escalation&can=1
[3] https://lwn.net/Articles/781013/

Daniel Verkamp (2):
mm/memfd: add F_SEAL_EXEC
selftests/memfd: add tests for F_SEAL_EXEC

Jeff Xu (3):
mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC
mm/memfd: Add write seals when apply SEAL_EXEC to executable memfd
selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC

include/linux/pid_namespace.h | 19 ++
include/uapi/linux/fcntl.h | 1 +
include/uapi/linux/memfd.h | 4 +
kernel/pid_namespace.c | 5 +
kernel/pid_sysctl.h | 59 ++++
mm/memfd.c | 56 +++-
mm/shmem.c | 6 +
tools/testing/selftests/memfd/fuse_test.c | 1 +
tools/testing/selftests/memfd/memfd_test.c | 341 ++++++++++++++++++++-
9 files changed, 489 insertions(+), 3 deletions(-)
create mode 100644 kernel/pid_sysctl.h


base-commit: eb7081409f94a9a8608593d0fb63a1aa3d6f95d8
--
2.39.0.rc1.256.g54fd8350bd-goog


2022-12-15 00:28:13

by Jeff Xu

[permalink] [raw]
Subject: [PATCH v8 5/5] selftests/memfd: add tests for MFD_NOEXEC_SEAL MFD_EXEC

From: Jeff Xu <[email protected]>

Tests to verify MFD_NOEXEC, MFD_EXEC and vm.memfd_noexec sysctl.

Signed-off-by: Jeff Xu <[email protected]>
Co-developed-by: Daniel Verkamp <[email protected]>
Signed-off-by: Daniel Verkamp <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
tools/testing/selftests/memfd/fuse_test.c | 1 +
tools/testing/selftests/memfd/memfd_test.c | 228 ++++++++++++++++++++-
2 files changed, 224 insertions(+), 5 deletions(-)

diff --git a/tools/testing/selftests/memfd/fuse_test.c b/tools/testing/selftests/memfd/fuse_test.c
index be675002f918..93798c8c5d54 100644
--- a/tools/testing/selftests/memfd/fuse_test.c
+++ b/tools/testing/selftests/memfd/fuse_test.c
@@ -22,6 +22,7 @@
#include <linux/falloc.h>
#include <fcntl.h>
#include <linux/memfd.h>
+#include <linux/types.h>
#include <sched.h>
#include <stdio.h>
#include <stdlib.h>
diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index f18a15a1f275..ae71f15f790d 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -30,6 +30,14 @@

#define F_SEAL_EXEC 0x0020

+#define F_WX_SEALS (F_SEAL_SHRINK | \
+ F_SEAL_GROW | \
+ F_SEAL_WRITE | \
+ F_SEAL_FUTURE_WRITE | \
+ F_SEAL_EXEC)
+
+#define MFD_NOEXEC_SEAL 0x0008U
+
/*
* Default is not to test hugetlbfs
*/
@@ -80,6 +88,37 @@ static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags)
return fd;
}

+static void sysctl_assert_write(const char *val)
+{
+ int fd = open("/proc/sys/vm/memfd_noexec", O_WRONLY | O_CLOEXEC);
+
+ if (fd < 0) {
+ printf("open sysctl failed\n");
+ abort();
+ }
+
+ if (write(fd, val, strlen(val)) < 0) {
+ printf("write sysctl failed\n");
+ abort();
+ }
+}
+
+static void sysctl_fail_write(const char *val)
+{
+ int fd = open("/proc/sys/vm/memfd_noexec", O_WRONLY | O_CLOEXEC);
+
+ if (fd < 0) {
+ printf("open sysctl failed\n");
+ abort();
+ }
+
+ if (write(fd, val, strlen(val)) >= 0) {
+ printf("write sysctl %s succeeded, but failure expected\n",
+ val);
+ abort();
+ }
+}
+
static int mfd_assert_reopen_fd(int fd_in)
{
int fd;
@@ -758,6 +797,9 @@ static void test_create(void)
mfd_fail_new("", ~0);
mfd_fail_new("", 0x80000000U);

+ /* verify EXEC and NOEXEC_SEAL can't both be set */
+ mfd_fail_new("", MFD_EXEC | MFD_NOEXEC_SEAL);
+
/* verify MFD_CLOEXEC is allowed */
fd = mfd_assert_new("", 0, MFD_CLOEXEC);
close(fd);
@@ -969,20 +1011,21 @@ static void test_seal_resize(void)

/*
* Test SEAL_EXEC
- * Test that chmod() cannot change x bits after sealing
+ * Test fd is created with exec and allow sealing.
+ * chmod() cannot change x bits after sealing.
*/
-static void test_seal_exec(void)
+static void test_exec_seal(void)
{
int fd;

printf("%s SEAL-EXEC\n", memfd_str);

+ printf("%s Apply SEAL_EXEC\n", memfd_str);
fd = mfd_assert_new("kern_memfd_seal_exec",
mfd_def_size,
- MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_EXEC);

mfd_assert_mode(fd, 0777);
-
mfd_assert_chmod(fd, 0644);

mfd_assert_has_seals(fd, 0);
@@ -996,10 +1039,181 @@ static void test_seal_exec(void)
mfd_fail_chmod(fd, 0700);
mfd_fail_chmod(fd, 0100);
mfd_assert_chmod(fd, 0666);
+ mfd_assert_write(fd);
+ close(fd);
+
+ printf("%s Apply ALL_SEALS\n", memfd_str);
+ fd = mfd_assert_new("kern_memfd_seal_exec",
+ mfd_def_size,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_EXEC);
+
+ mfd_assert_mode(fd, 0777);
+ mfd_assert_chmod(fd, 0700);
+
+ mfd_assert_has_seals(fd, 0);
+ mfd_assert_add_seals(fd, F_SEAL_EXEC);
+ mfd_assert_has_seals(fd, F_WX_SEALS);

+ mfd_fail_chmod(fd, 0711);
+ mfd_fail_chmod(fd, 0600);
+ mfd_fail_write(fd);
+ close(fd);
+}
+
+/*
+ * Test EXEC_NO_SEAL
+ * Test fd is created with exec and not allow sealing.
+ */
+static void test_exec_no_seal(void)
+{
+ int fd;
+
+ printf("%s EXEC_NO_SEAL\n", memfd_str);
+
+ /* Create with EXEC but without ALLOW_SEALING */
+ fd = mfd_assert_new("kern_memfd_exec_no_sealing",
+ mfd_def_size,
+ MFD_CLOEXEC | MFD_EXEC);
+ mfd_assert_mode(fd, 0777);
+ mfd_assert_has_seals(fd, F_SEAL_SEAL);
+ mfd_assert_chmod(fd, 0666);
close(fd);
}

+/*
+ * Test memfd_create with MFD_NOEXEC flag
+ */
+static void test_noexec_seal(void)
+{
+ int fd;
+
+ printf("%s NOEXEC_SEAL\n", memfd_str);
+
+ /* Create with NOEXEC and ALLOW_SEALING */
+ fd = mfd_assert_new("kern_memfd_noexec",
+ mfd_def_size,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_NOEXEC_SEAL);
+ mfd_assert_mode(fd, 0666);
+ mfd_assert_has_seals(fd, F_SEAL_EXEC);
+ mfd_fail_chmod(fd, 0777);
+ close(fd);
+
+ /* Create with NOEXEC but without ALLOW_SEALING */
+ fd = mfd_assert_new("kern_memfd_noexec",
+ mfd_def_size,
+ MFD_CLOEXEC | MFD_NOEXEC_SEAL);
+ mfd_assert_mode(fd, 0666);
+ mfd_assert_has_seals(fd, F_SEAL_EXEC);
+ mfd_fail_chmod(fd, 0777);
+ close(fd);
+}
+
+static void test_sysctl_child(void)
+{
+ int fd;
+
+ printf("%s sysctl 0\n", memfd_str);
+ sysctl_assert_write("0");
+ fd = mfd_assert_new("kern_memfd_sysctl_0",
+ mfd_def_size,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+ mfd_assert_mode(fd, 0777);
+ mfd_assert_has_seals(fd, 0);
+ mfd_assert_chmod(fd, 0644);
+ close(fd);
+
+ printf("%s sysctl 1\n", memfd_str);
+ sysctl_assert_write("1");
+ fd = mfd_assert_new("kern_memfd_sysctl_1",
+ mfd_def_size,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+ mfd_assert_mode(fd, 0666);
+ mfd_assert_has_seals(fd, F_SEAL_EXEC);
+ mfd_fail_chmod(fd, 0777);
+ sysctl_fail_write("0");
+ close(fd);
+
+ printf("%s sysctl 2\n", memfd_str);
+ sysctl_assert_write("2");
+ mfd_fail_new("kern_memfd_sysctl_2",
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+ sysctl_fail_write("0");
+ sysctl_fail_write("1");
+}
+
+static int newpid_thread_fn(void *arg)
+{
+ test_sysctl_child();
+ return 0;
+}
+
+static void test_sysctl_child2(void)
+{
+ int fd;
+
+ sysctl_fail_write("0");
+ fd = mfd_assert_new("kern_memfd_sysctl_1",
+ mfd_def_size,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+ mfd_assert_mode(fd, 0666);
+ mfd_assert_has_seals(fd, F_SEAL_EXEC);
+ mfd_fail_chmod(fd, 0777);
+ close(fd);
+}
+
+static int newpid_thread_fn2(void *arg)
+{
+ test_sysctl_child2();
+ return 0;
+}
+static pid_t spawn_newpid_thread(unsigned int flags, int (*fn)(void *))
+{
+ uint8_t *stack;
+ pid_t pid;
+
+ stack = malloc(STACK_SIZE);
+ if (!stack) {
+ printf("malloc(STACK_SIZE) failed: %m\n");
+ abort();
+ }
+
+ pid = clone(fn,
+ stack + STACK_SIZE,
+ SIGCHLD | flags,
+ NULL);
+ if (pid < 0) {
+ printf("clone() failed: %m\n");
+ abort();
+ }
+
+ return pid;
+}
+
+static void join_newpid_thread(pid_t pid)
+{
+ waitpid(pid, NULL, 0);
+}
+
+/*
+ * Test sysctl
+ * A very basic sealing test to see whether setting/retrieving seals works.
+ */
+static void test_sysctl(void)
+{
+ int pid = spawn_newpid_thread(CLONE_NEWPID, newpid_thread_fn);
+
+ join_newpid_thread(pid);
+
+ printf("%s child ns\n", memfd_str);
+ sysctl_assert_write("1");
+
+ pid = spawn_newpid_thread(CLONE_NEWPID, newpid_thread_fn2);
+ join_newpid_thread(pid);
+}
+
/*
* Test sharing via dup()
* Test that seals are shared between dupped FDs and they're all equal.
@@ -1173,13 +1387,15 @@ int main(int argc, char **argv)

test_create();
test_basic();
+ test_exec_seal();
+ test_exec_no_seal();
+ test_noexec_seal();

test_seal_write();
test_seal_future_write();
test_seal_shrink();
test_seal_grow();
test_seal_resize();
- test_seal_exec();

test_share_dup("SHARE-DUP", "");
test_share_mmap("SHARE-MMAP", "");
@@ -1195,6 +1411,8 @@ int main(int argc, char **argv)
test_share_fork("SHARE-FORK", SHARED_FT_STR);
join_idle_thread(pid);

+ test_sysctl();
+
printf("memfd: DONE\n");

return 0;
--
2.39.0.rc1.256.g54fd8350bd-goog

2022-12-15 00:32:02

by Jeff Xu

[permalink] [raw]
Subject: [PATCH v8 3/5] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC

From: Jeff Xu <[email protected]>

The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to
set executable bit at creation time (memfd_create).

When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
(mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to
be executable (mode: 0777) after creation.

when MFD_EXEC flag is set, memfd is created with executable bit
(mode:0777), this is the same as the old behavior of memfd_create.

The new pid namespaced sysctl vm.memfd_noexec has 3 values:
0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_EXEC was set.
1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
MFD_NOEXEC_SEAL was set.
2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.

The sysctl allows finer control of memfd_create for old-software
that doesn't set the executable bit, for example, a container with
vm.memfd_noexec=1 means the old-software will create non-executable
memfd by default. Also, the value of memfd_noexec is passed to child
namespace at creation time. For example, if the init namespace has
vm.memfd_noexec=2, all its children namespaces will be created with 2.

Signed-off-by: Jeff Xu <[email protected]>
Co-developed-by: Daniel Verkamp <[email protected]>
Signed-off-by: Daniel Verkamp <[email protected]>
Reported-by: kernel test robot <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
include/linux/pid_namespace.h | 19 +++++++++++
include/uapi/linux/memfd.h | 4 +++
kernel/pid_namespace.c | 5 +++
kernel/pid_sysctl.h | 59 +++++++++++++++++++++++++++++++++++
mm/memfd.c | 48 ++++++++++++++++++++++++++--
5 files changed, 133 insertions(+), 2 deletions(-)
create mode 100644 kernel/pid_sysctl.h

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index 07481bb87d4e..c758809d5bcf 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -16,6 +16,21 @@

struct fs_pin;

+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+/*
+ * sysctl for vm.memfd_noexec
+ * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
+ * acts like MFD_EXEC was set.
+ * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
+ * acts like MFD_NOEXEC_SEAL was set.
+ * 2: memfd_create() without MFD_NOEXEC_SEAL will be
+ * rejected.
+ */
+#define MEMFD_NOEXEC_SCOPE_EXEC 0
+#define MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL 1
+#define MEMFD_NOEXEC_SCOPE_NOEXEC_ENFORCED 2
+#endif
+
struct pid_namespace {
struct idr idr;
struct rcu_head rcu;
@@ -31,6 +46,10 @@ struct pid_namespace {
struct ucounts *ucounts;
int reboot; /* group exit code if this pidns was rebooted */
struct ns_common ns;
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+ /* sysctl for vm.memfd_noexec */
+ int memfd_noexec_scope;
+#endif
} __randomize_layout;

extern struct pid_namespace init_pid_ns;
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 7a8a26751c23..273a4e15dfcf 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -8,6 +8,10 @@
#define MFD_CLOEXEC 0x0001U
#define MFD_ALLOW_SEALING 0x0002U
#define MFD_HUGETLB 0x0004U
+/* not executable and sealed to prevent changing to executable. */
+#define MFD_NOEXEC_SEAL 0x0008U
+/* executable */
+#define MFD_EXEC 0x0010U

/*
* Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/kernel/pid_namespace.c b/kernel/pid_namespace.c
index f4f8cb0435b4..8a98b1af9376 100644
--- a/kernel/pid_namespace.c
+++ b/kernel/pid_namespace.c
@@ -23,6 +23,7 @@
#include <linux/sched/task.h>
#include <linux/sched/signal.h>
#include <linux/idr.h>
+#include "pid_sysctl.h"

static DEFINE_MUTEX(pid_caches_mutex);
static struct kmem_cache *pid_ns_cachep;
@@ -110,6 +111,8 @@ static struct pid_namespace *create_pid_namespace(struct user_namespace *user_ns
ns->ucounts = ucounts;
ns->pid_allocated = PIDNS_ADDING;

+ initialize_memfd_noexec_scope(ns);
+
return ns;

out_free_idr:
@@ -455,6 +458,8 @@ static __init int pid_namespaces_init(void)
#ifdef CONFIG_CHECKPOINT_RESTORE
register_sysctl_paths(kern_path, pid_ns_ctl_table);
#endif
+
+ register_pid_ns_sysctl_table_vm();
return 0;
}

diff --git a/kernel/pid_sysctl.h b/kernel/pid_sysctl.h
new file mode 100644
index 000000000000..90a93161a122
--- /dev/null
+++ b/kernel/pid_sysctl.h
@@ -0,0 +1,59 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef LINUX_PID_SYSCTL_H
+#define LINUX_PID_SYSCTL_H
+
+#include <linux/pid_namespace.h>
+
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
+static inline void initialize_memfd_noexec_scope(struct pid_namespace *ns)
+{
+ ns->memfd_noexec_scope =
+ task_active_pid_ns(current)->memfd_noexec_scope;
+}
+
+static int pid_mfd_noexec_dointvec_minmax(struct ctl_table *table,
+ int write, void *buf, size_t *lenp, loff_t *ppos)
+{
+ struct pid_namespace *ns = task_active_pid_ns(current);
+ struct ctl_table table_copy;
+
+ if (write && !ns_capable(ns->user_ns, CAP_SYS_ADMIN))
+ return -EPERM;
+
+ table_copy = *table;
+ if (ns != &init_pid_ns)
+ table_copy.data = &ns->memfd_noexec_scope;
+
+ /*
+ * set minimum to current value, the effect is only bigger
+ * value is accepted.
+ */
+ if (*(int *)table_copy.data > *(int *)table_copy.extra1)
+ table_copy.extra1 = table_copy.data;
+
+ return proc_dointvec_minmax(&table_copy, write, buf, lenp, ppos);
+}
+
+static struct ctl_table pid_ns_ctl_table_vm[] = {
+ {
+ .procname = "memfd_noexec",
+ .data = &init_pid_ns.memfd_noexec_scope,
+ .maxlen = sizeof(init_pid_ns.memfd_noexec_scope),
+ .mode = 0644,
+ .proc_handler = pid_mfd_noexec_dointvec_minmax,
+ .extra1 = SYSCTL_ZERO,
+ .extra2 = SYSCTL_TWO,
+ },
+ { }
+};
+static struct ctl_path vm_path[] = { { .procname = "vm", }, { } };
+static inline void register_pid_ns_sysctl_table_vm(void)
+{
+ register_sysctl_paths(vm_path, pid_ns_ctl_table_vm);
+}
+#else
+static inline void set_memfd_noexec_scope(struct pid_namespace *ns) {}
+static inline void register_pid_ns_ctl_table_vm(void) {}
+#endif
+
+#endif /* LINUX_PID_SYSCTL_H */
diff --git a/mm/memfd.c b/mm/memfd.c
index 4ebeab94aa74..ec70675a7069 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -18,6 +18,7 @@
#include <linux/hugetlb.h>
#include <linux/shmem_fs.h>
#include <linux/memfd.h>
+#include <linux/pid_namespace.h>
#include <uapi/linux/memfd.h>

/*
@@ -263,12 +264,14 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)

-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
+#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)

SYSCALL_DEFINE2(memfd_create,
const char __user *, uname,
unsigned int, flags)
{
+ char comm[TASK_COMM_LEN];
+ struct pid_namespace *ns;
unsigned int *file_seals;
struct file *file;
int fd, error;
@@ -285,6 +288,39 @@ SYSCALL_DEFINE2(memfd_create,
return -EINVAL;
}

+ /* Invalid if both EXEC and NOEXEC_SEAL are set.*/
+ if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL))
+ return -EINVAL;
+
+ if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
+#ifdef CONFIG_SYSCTL
+ int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
+
+ ns = task_active_pid_ns(current);
+ if (ns)
+ sysctl = ns->memfd_noexec_scope;
+
+ switch (sysctl) {
+ case MEMFD_NOEXEC_SCOPE_EXEC:
+ flags |= MFD_EXEC;
+ break;
+ case MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL:
+ flags |= MFD_NOEXEC_SEAL;
+ break;
+ default:
+ pr_warn_ratelimited(
+ "memfd_create(): MFD_NOEXEC_SEAL is enforced, pid=%d '%s'\n",
+ task_pid_nr(current), get_task_comm(comm, current));
+ return -EINVAL;
+ }
+#else
+ flags |= MFD_EXEC;
+#endif
+ pr_warn_ratelimited(
+ "memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=%d '%s'\n",
+ task_pid_nr(current), get_task_comm(comm, current));
+ }
+
/* length includes terminating zero */
len = strnlen_user(uname, MFD_NAME_MAX_LEN + 1);
if (len <= 0)
@@ -328,7 +364,15 @@ SYSCALL_DEFINE2(memfd_create,
file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
file->f_flags |= O_LARGEFILE;

- if (flags & MFD_ALLOW_SEALING) {
+ if (flags & MFD_NOEXEC_SEAL) {
+ struct inode *inode = file_inode(file);
+
+ inode->i_mode &= ~0111;
+ file_seals = memfd_file_seals_ptr(file);
+ *file_seals &= ~F_SEAL_SEAL;
+ *file_seals |= F_SEAL_EXEC;
+ } else if (flags & MFD_ALLOW_SEALING) {
+ /* MFD_EXEC and MFD_ALLOW_SEALING are set */
file_seals = memfd_file_seals_ptr(file);
*file_seals &= ~F_SEAL_SEAL;
}
--
2.39.0.rc1.256.g54fd8350bd-goog

2022-12-15 00:41:43

by Jeff Xu

[permalink] [raw]
Subject: [PATCH v8 1/5] mm/memfd: add F_SEAL_EXEC

From: Daniel Verkamp <[email protected]>

The new F_SEAL_EXEC flag will prevent modification of the exec bits:
written as traditional octal mask, 0111, or as named flags, S_IXUSR |
S_IXGRP | S_IXOTH. Any chmod(2) or similar call that attempts to modify
any of these bits after the seal is applied will fail with errno EPERM.

This will preserve the execute bits as they are at the time of sealing,
so the memfd will become either permanently executable or permanently
un-executable.

Signed-off-by: Daniel Verkamp <[email protected]>
Co-developed-by: Jeff Xu <[email protected]>
Signed-off-by: Jeff Xu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
include/uapi/linux/fcntl.h | 1 +
mm/memfd.c | 2 ++
mm/shmem.c | 6 ++++++
3 files changed, 9 insertions(+)

diff --git a/include/uapi/linux/fcntl.h b/include/uapi/linux/fcntl.h
index 2f86b2ad6d7e..e8c07da58c9f 100644
--- a/include/uapi/linux/fcntl.h
+++ b/include/uapi/linux/fcntl.h
@@ -43,6 +43,7 @@
#define F_SEAL_GROW 0x0004 /* prevent file from growing */
#define F_SEAL_WRITE 0x0008 /* prevent writes */
#define F_SEAL_FUTURE_WRITE 0x0010 /* prevent future writes while mapped */
+#define F_SEAL_EXEC 0x0020 /* prevent chmod modifying exec bits */
/* (1U << 31) is reserved for signed error codes */

/*
diff --git a/mm/memfd.c b/mm/memfd.c
index 08f5f8304746..4ebeab94aa74 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -147,6 +147,7 @@ static unsigned int *memfd_file_seals_ptr(struct file *file)
}

#define F_ALL_SEALS (F_SEAL_SEAL | \
+ F_SEAL_EXEC | \
F_SEAL_SHRINK | \
F_SEAL_GROW | \
F_SEAL_WRITE | \
@@ -175,6 +176,7 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
* SEAL_SHRINK: Prevent the file from shrinking
* SEAL_GROW: Prevent the file from growing
* SEAL_WRITE: Prevent write access to the file
+ * SEAL_EXEC: Prevent modification of the exec bits in the file mode
*
* As we don't require any trust relationship between two parties, we
* must prevent seals from being removed. Therefore, sealing a file
diff --git a/mm/shmem.c b/mm/shmem.c
index c1d8b8a1aa3b..e18a9cf9d937 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -1085,6 +1085,12 @@ static int shmem_setattr(struct user_namespace *mnt_userns,
if (error)
return error;

+ if ((info->seals & F_SEAL_EXEC) && (attr->ia_valid & ATTR_MODE)) {
+ if ((inode->i_mode ^ attr->ia_mode) & 0111) {
+ return -EPERM;
+ }
+ }
+
if (S_ISREG(inode->i_mode) && (attr->ia_valid & ATTR_SIZE)) {
loff_t oldsize = inode->i_size;
loff_t newsize = attr->ia_size;
--
2.39.0.rc1.256.g54fd8350bd-goog

2022-12-15 01:02:29

by Jeff Xu

[permalink] [raw]
Subject: [PATCH v8 4/5] mm/memfd: Add write seals when apply SEAL_EXEC to executable memfd

From: Jeff Xu <[email protected]>

In order to avoid WX mappings, add F_SEAL_WRITE when apply
F_SEAL_EXEC to an executable memfd, so W^X from start.

This implys application need to fill the content of the memfd first,
after F_SEAL_EXEC is applied, application can no longer modify the
content of the memfd.

Typically, application seals the memfd right after writing to it.
For example:
1. memfd_create(MFD_EXEC).
2. write() code to the memfd.
3. fcntl(F_ADD_SEALS, F_SEAL_EXEC) to convert the memfd to W^X.
4. call exec() on the memfd.

Signed-off-by: Jeff Xu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
mm/memfd.c | 6 ++++++
1 file changed, 6 insertions(+)

diff --git a/mm/memfd.c b/mm/memfd.c
index ec70675a7069..92f0a5765f7c 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -222,6 +222,12 @@ static int memfd_add_seals(struct file *file, unsigned int seals)
}
}

+ /*
+ * SEAL_EXEC implys SEAL_WRITE, making W^X from the start.
+ */
+ if (seals & F_SEAL_EXEC && inode->i_mode & 0111)
+ seals |= F_SEAL_SHRINK|F_SEAL_GROW|F_SEAL_WRITE|F_SEAL_FUTURE_WRITE;
+
*file_seals |= seals;
error = 0;

--
2.39.0.rc1.256.g54fd8350bd-goog

2022-12-15 01:09:46

by Jeff Xu

[permalink] [raw]
Subject: [PATCH v8 2/5] selftests/memfd: add tests for F_SEAL_EXEC

From: Daniel Verkamp <[email protected]>

Basic tests to ensure that user/group/other execute bits cannot be
changed after applying F_SEAL_EXEC to a memfd.

Signed-off-by: Daniel Verkamp <[email protected]>
Co-developed-by: Jeff Xu <[email protected]>
Signed-off-by: Jeff Xu <[email protected]>
Reviewed-by: Kees Cook <[email protected]>
---
tools/testing/selftests/memfd/memfd_test.c | 123 ++++++++++++++++++++-
1 file changed, 122 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/memfd/memfd_test.c b/tools/testing/selftests/memfd/memfd_test.c
index 94df2692e6e4..f18a15a1f275 100644
--- a/tools/testing/selftests/memfd/memfd_test.c
+++ b/tools/testing/selftests/memfd/memfd_test.c
@@ -28,12 +28,38 @@
#define MFD_DEF_SIZE 8192
#define STACK_SIZE 65536

+#define F_SEAL_EXEC 0x0020
+
/*
* Default is not to test hugetlbfs
*/
static size_t mfd_def_size = MFD_DEF_SIZE;
static const char *memfd_str = MEMFD_STR;

+static ssize_t fd2name(int fd, char *buf, size_t bufsize)
+{
+ char buf1[PATH_MAX];
+ int size;
+ ssize_t nbytes;
+
+ size = snprintf(buf1, PATH_MAX, "/proc/self/fd/%d", fd);
+ if (size < 0) {
+ printf("snprintf(%d) failed on %m\n", fd);
+ abort();
+ }
+
+ /*
+ * reserver one byte for string termination.
+ */
+ nbytes = readlink(buf1, buf, bufsize-1);
+ if (nbytes == -1) {
+ printf("readlink(%s) failed %m\n", buf1);
+ abort();
+ }
+ buf[nbytes] = '\0';
+ return nbytes;
+}
+
static int mfd_assert_new(const char *name, loff_t sz, unsigned int flags)
{
int r, fd;
@@ -98,11 +124,14 @@ static unsigned int mfd_assert_get_seals(int fd)

static void mfd_assert_has_seals(int fd, unsigned int seals)
{
+ char buf[PATH_MAX];
+ int nbytes;
unsigned int s;
+ fd2name(fd, buf, PATH_MAX);

s = mfd_assert_get_seals(fd);
if (s != seals) {
- printf("%u != %u = GET_SEALS(%d)\n", seals, s, fd);
+ printf("%u != %u = GET_SEALS(%s)\n", seals, s, buf);
abort();
}
}
@@ -594,6 +623,64 @@ static void mfd_fail_grow_write(int fd)
}
}

+static void mfd_assert_mode(int fd, int mode)
+{
+ struct stat st;
+ char buf[PATH_MAX];
+ int nbytes;
+
+ fd2name(fd, buf, PATH_MAX);
+
+ if (fstat(fd, &st) < 0) {
+ printf("fstat(%s) failed: %m\n", buf);
+ abort();
+ }
+
+ if ((st.st_mode & 07777) != mode) {
+ printf("fstat(%s) wrong file mode 0%04o, but expected 0%04o\n",
+ buf, (int)st.st_mode & 07777, mode);
+ abort();
+ }
+}
+
+static void mfd_assert_chmod(int fd, int mode)
+{
+ char buf[PATH_MAX];
+ int nbytes;
+
+ fd2name(fd, buf, PATH_MAX);
+
+ if (fchmod(fd, mode) < 0) {
+ printf("fchmod(%s, 0%04o) failed: %m\n", buf, mode);
+ abort();
+ }
+
+ mfd_assert_mode(fd, mode);
+}
+
+static void mfd_fail_chmod(int fd, int mode)
+{
+ struct stat st;
+ char buf[PATH_MAX];
+ int nbytes;
+
+ fd2name(fd, buf, PATH_MAX);
+
+ if (fstat(fd, &st) < 0) {
+ printf("fstat(%s) failed: %m\n", buf);
+ abort();
+ }
+
+ if (fchmod(fd, mode) == 0) {
+ printf("fchmod(%s, 0%04o) didn't fail as expected\n",
+ buf, mode);
+ abort();
+ }
+
+ /* verify that file mode bits did not change */
+ mfd_assert_mode(fd, st.st_mode & 07777);
+}
+
static int idle_thread_fn(void *arg)
{
sigset_t set;
@@ -880,6 +967,39 @@ static void test_seal_resize(void)
close(fd);
}

+/*
+ * Test SEAL_EXEC
+ * Test that chmod() cannot change x bits after sealing
+ */
+static void test_seal_exec(void)
+{
+ int fd;
+
+ printf("%s SEAL-EXEC\n", memfd_str);
+
+ fd = mfd_assert_new("kern_memfd_seal_exec",
+ mfd_def_size,
+ MFD_CLOEXEC | MFD_ALLOW_SEALING);
+
+ mfd_assert_mode(fd, 0777);
+
+ mfd_assert_chmod(fd, 0644);
+
+ mfd_assert_has_seals(fd, 0);
+ mfd_assert_add_seals(fd, F_SEAL_EXEC);
+ mfd_assert_has_seals(fd, F_SEAL_EXEC);
+
+ mfd_assert_chmod(fd, 0600);
+ mfd_fail_chmod(fd, 0777);
+ mfd_fail_chmod(fd, 0670);
+ mfd_fail_chmod(fd, 0605);
+ mfd_fail_chmod(fd, 0700);
+ mfd_fail_chmod(fd, 0100);
+ mfd_assert_chmod(fd, 0666);
+
+ close(fd);
+}
+
/*
* Test sharing via dup()
* Test that seals are shared between dupped FDs and they're all equal.
@@ -1059,6 +1179,7 @@ int main(int argc, char **argv)
test_seal_shrink();
test_seal_grow();
test_seal_resize();
+ test_seal_exec();

test_share_dup("SHARE-DUP", "");
test_share_mmap("SHARE-MMAP", "");
--
2.39.0.rc1.256.g54fd8350bd-goog

2023-06-28 12:12:24

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC

[email protected] wrote on Thu, Dec 15, 2022 at 12:12:03AM +0000:
> From: Jeff Xu <[email protected]>
>
> The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to
> set executable bit at creation time (memfd_create).
>
> When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
> (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to
> be executable (mode: 0777) after creation.
>
> when MFD_EXEC flag is set, memfd is created with executable bit
> (mode:0777), this is the same as the old behavior of memfd_create.
>
> The new pid namespaced sysctl vm.memfd_noexec has 3 values:
> 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
> MFD_EXEC was set.
> 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
> MFD_NOEXEC_SEAL was set.
> 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.

So, erm, I'm a bit late to the party but I was just looking at a way of
blocking memfd_create+exec in a container and this sounded perfect: my
reading is that this is a security feature meant to be set for
container's namespaces that'd totally disable something like
memfd_create followed by fexecve (because we don't want weird binaries
coming from who knows where to be executed on a shiny secure system),
but. . . is this actually supposed to work?
(see below)

> [...]
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -263,12 +264,14 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
>
> SYSCALL_DEFINE2(memfd_create,
> const char __user *, uname,
> unsigned int, flags)
> {
> + char comm[TASK_COMM_LEN];
> + struct pid_namespace *ns;
> unsigned int *file_seals;
> struct file *file;
> int fd, error;
> @@ -285,6 +288,39 @@ SYSCALL_DEFINE2(memfd_create,
> return -EINVAL;
> }
>
> + /* Invalid if both EXEC and NOEXEC_SEAL are set.*/
> + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL))
> + return -EINVAL;
> +
> + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
> + [code that checks the sysctl]

If flags already has either MFD_EXEC or MFD_NOEXEC_SEAL, you don't check
the sysctl at all.

This can be verified easily:
-----
$ cat > memfd_exec.c <<'EOF'
#define _GNU_SOURCE

#include <errno.h>
#include <stdio.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/wait.h>

#ifndef MFD_EXEC
#define MFD_EXEC 0x0010U
#endif

int main() {
int fd = memfd_create("script", MFD_EXEC);
if (fd == -1)l
perror("memfd");

char prog[] = "#!/bin/sh\necho Ran script\n";
if (write(fd, prog, sizeof(prog)-1) != sizeof(prog)-1)
perror("write");

char *const argv[] = { "script", NULL };
char *const envp[] = { NULL };
fexecve(fd, argv, envp);
perror("fexecve");
}
EOF
$ gcc -o memfd_exec memfd_exec.c
$ ./memfd_exec
Ran script
$ sysctl vm.memfd_noexec
vm.memfd_noexec = 2
-----
(as opposed to failing hard on memfd_create if flag unset on sysctl=2,
and failing on fexecve with flag unset and sysctl=1)

What am I missing?



BTW I find the current behaviour rather hard to use: setting this to 2
should still set NOEXEC by default in my opinion, just refuse anything
that explicitly requested EXEC.

Sure there's a warn_once that memfd_create was used without seal, but
right now on my system it's "used up" 5 seconds after boot by systemd:
[ 5.854378] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd'

And anyway, older kernels will barf up EINVAL when calling memfd_create
with MFD_NOEXEC_SEAL, so even if userspace will want to adapt they'll
need to try calling memfd_create with the flag once and retry on EINVAL,
which let's face it is going to take a while to happen.
(Also, the flag has been added to glibc, but not in any release yet)

Making calls default to noexec AND refuse exec does what you want
(forbid use of exec in an app that wasn't in a namespace that allows
exec) while allowing apps that require it to work; that sounds better
than making all applications that haven't taken the pain of adding the
new flag to me.
Well, I guess an app that did require exec without setting the flag will
fail in a weird place instead of failing at memfd_create and having a
chance to fallback, so it's not like it doesn't make any sense;
I don't have such strong feelings about this if the sysctl works, but
for my use case I'm more likely to want to take a chance at memfd_create
not needing exec than having the flag set. Perhaps a third value if I
cared enough...

--
Dominique Martinet | Asmadeus

2023-06-28 19:41:38

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC

Dominique Martinet wrote on Wed, Jun 28, 2023 at 08:42:41PM +0900:
> If flags already has either MFD_EXEC or MFD_NOEXEC_SEAL, you don't check
> the sysctl at all.
> [...repro snipped..]
>
> What am I missing?

(Perhaps the intent is just to force people to use the flag so it is
easier to check for memfd_create in seccomp or other LSM?
But I don't see why such a check couldn't consider the absence of a flag
as well, so I don't see the point.)


> BTW I find the current behaviour rather hard to use: setting this to 2
> should still set NOEXEC by default in my opinion, just refuse anything
> that explicitly requested EXEC.

And I just noticed it's not possible to lower the value despite having
CAP_SYS_ADMIN: what the heck?! I have never seen such a sysctl and it
just forced me to reboot because I willy-nilly tested in the init pid
namespace, and quite a few applications that don't require exec broke
exactly as I described below.

If the user has CAP_SYS_ADMIN there are more container escape methods
than I can count, this is basically free pass to root on main namespace
anyway, you're not protecting anything. Please let people set the sysctl
to what they want.

> Sure there's a warn_once that memfd_create was used without seal, but
> right now on my system it's "used up" 5 seconds after boot by systemd:
> [ 5.854378] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd'
>
> And anyway, older kernels will barf up EINVAL when calling memfd_create
> with MFD_NOEXEC_SEAL, so even if userspace will want to adapt they'll
> need to try calling memfd_create with the flag once and retry on EINVAL,
> which let's face it is going to take a while to happen.
> (Also, the flag has been added to glibc, but not in any release yet)
>
> Making calls default to noexec AND refuse exec does what you want
> (forbid use of exec in an app that wasn't in a namespace that allows
> exec) while allowing apps that require it to work; that sounds better
> than making all applications that haven't taken the pain of adding the
> new flag to me.
> Well, I guess an app that did require exec without setting the flag will
> fail in a weird place instead of failing at memfd_create and having a
> chance to fallback, so it's not like it doesn't make any sense;
> I don't have such strong feelings about this if the sysctl works, but
> for my use case I'm more likely to want to take a chance at memfd_create
> not needing exec than having the flag set. Perhaps a third value if I
> cared enough...


--
Dominique Martinet | Asmadeus

2023-06-29 04:59:16

by Jeff Xu

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC

Hello!

On Wed, Jun 28, 2023 at 12:31 PM Dominique Martinet
<[email protected]> wrote:
>
> Dominique Martinet wrote on Wed, Jun 28, 2023 at 08:42:41PM +0900:
> > If flags already has either MFD_EXEC or MFD_NOEXEC_SEAL, you don't check
> > the sysctl at all.
> > [...repro snipped..]
> >
> > What am I missing?
>
> (Perhaps the intent is just to force people to use the flag so it is
> easier to check for memfd_create in seccomp or other LSM?
> But I don't see why such a check couldn't consider the absence of a flag
> as well, so I don't see the point.)
>
Yes. There is consideration to motivate app devs to migrate their code
to use the new EXEC/NOEXEC_SEAL flag for memfd_create, if that answers
your question.

>
> > BTW I find the current behaviour rather hard to use: setting this to 2
> > should still set NOEXEC by default in my opinion, just refuse anything
> > that explicitly requested EXEC.
>
> And I just noticed it's not possible to lower the value despite having
> CAP_SYS_ADMIN: what the heck?! I have never seen such a sysctl and it
> just forced me to reboot because I willy-nilly tested in the init pid
> namespace, and quite a few applications that don't require exec broke
> exactly as I described below.
>
> If the user has CAP_SYS_ADMIN there are more container escape methods
> than I can count, this is basically free pass to root on main namespace
> anyway, you're not protecting anything. Please let people set the sysctl
> to what they want.
>
Yama has a similar setting, for example, 3 (YAMA_SCOPE_NO_ATTACH)
will not allow downgrading at runtime.

Since this is a security feature, not allowing downgrading at run time
is part of the security consideration. I hope you understand.

> --
> Dominique Martinet | Asmadeus

Thanks!
-Jeff

2023-06-29 05:05:07

by Jeff Xu

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC

Hello.

Thank you for your email and interested in using memfd_noexec !

On Wed, Jun 28, 2023 at 4:43 AM Dominique Martinet
<[email protected]> wrote:
>
> [email protected] wrote on Thu, Dec 15, 2022 at 12:12:03AM +0000:
> > From: Jeff Xu <[email protected]>
> >
> > The new MFD_NOEXEC_SEAL and MFD_EXEC flags allows application to
> > set executable bit at creation time (memfd_create).
> >
> > When MFD_NOEXEC_SEAL is set, memfd is created without executable bit
> > (mode:0666), and sealed with F_SEAL_EXEC, so it can't be chmod to
> > be executable (mode: 0777) after creation.
> >
> > when MFD_EXEC flag is set, memfd is created with executable bit
> > (mode:0777), this is the same as the old behavior of memfd_create.
> >
> > The new pid namespaced sysctl vm.memfd_noexec has 3 values:
> > 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
> > MFD_EXEC was set.
> > 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL acts like
> > MFD_NOEXEC_SEAL was set.
> > 2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.
>
> So, erm, I'm a bit late to the party but I was just looking at a way of
> blocking memfd_create+exec in a container and this sounded perfect: my
> reading is that this is a security feature meant to be set for
> container's namespaces that'd totally disable something like
> memfd_create followed by fexecve (because we don't want weird binaries
> coming from who knows where to be executed on a shiny secure system),
> but. . . is this actually supposed to work?
> (see below)
>
> > [...]
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -263,12 +264,14 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned long arg)
> > #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> > #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
> >
> > -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB)
> > +#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
> >
> > SYSCALL_DEFINE2(memfd_create,
> > const char __user *, uname,
> > unsigned int, flags)
> > {
> > + char comm[TASK_COMM_LEN];
> > + struct pid_namespace *ns;
> > unsigned int *file_seals;
> > struct file *file;
> > int fd, error;
> > @@ -285,6 +288,39 @@ SYSCALL_DEFINE2(memfd_create,
> > return -EINVAL;
> > }
> >
> > + /* Invalid if both EXEC and NOEXEC_SEAL are set.*/
> > + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL))
> > + return -EINVAL;
> > +
> > + if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
> > + [code that checks the sysctl]
>
> If flags already has either MFD_EXEC or MFD_NOEXEC_SEAL, you don't check
> the sysctl at all.
>
> This can be verified easily:
> -----
> $ cat > memfd_exec.c <<'EOF'
> #define _GNU_SOURCE
>
> #include <errno.h>
> #include <stdio.h>
> #include <sys/mman.h>
> #include <sys/types.h>
> #include <sys/wait.h>
>
> #ifndef MFD_EXEC
> #define MFD_EXEC 0x0010U
> #endif
>
> int main() {
> int fd = memfd_create("script", MFD_EXEC);
> if (fd == -1)l
> perror("memfd");
>
> char prog[] = "#!/bin/sh\necho Ran script\n";
> if (write(fd, prog, sizeof(prog)-1) != sizeof(prog)-1)
> perror("write");
>
> char *const argv[] = { "script", NULL };
> char *const envp[] = { NULL };
> fexecve(fd, argv, envp);
> perror("fexecve");
> }
> EOF
> $ gcc -o memfd_exec memfd_exec.c
> $ ./memfd_exec
> Ran script
> $ sysctl vm.memfd_noexec
> vm.memfd_noexec = 2
> -----
> (as opposed to failing hard on memfd_create if flag unset on sysctl=2,
> and failing on fexecve with flag unset and sysctl=1)
>
> What am I missing?
>
>
At one point, I was thinking of having a security hook to block
executable memfd [1], so this sysctl only works for the application
that doesn't set EXEC bit. Now I think it makes sense to use
vm.memfd_noexec = 2 to block the MFD_EXEC also.
Anyway the commit msg says:
2: memfd_create() without MFD_NOEXEC_SEAL will be rejected.
Not doing that is a bug. I will send a fix for that.

[1] https://lore.kernel.org/lkml/[email protected]/

>
> BTW I find the current behaviour rather hard to use: setting this to 2
> should still set NOEXEC by default in my opinion, just refuse anything
> that explicitly requested EXEC.
>
At one point [2] (v2 of patch) there were two sysctls, one is doing
overwrite, one is enforcing, later I decided with one sysctl, the
rationale is the kernel will eventually get out of the business of
overwriting user space code. Yes. It might take a long time to
migrate all of the userspace.

In the meantime, to meet what you want, the solution is keep
vm.memfd_noexec = 1 (for overwrite), and a new security policy
(SELInux or Landlock) that uses security hook security_memfd_create,
this can block one process from creating executable memfd. Indeed,
security policy is better fit to cases like this than sysctl.

[2] https://lore.kernel.org/linux-mm/CABi2SkWGo9Jrd=i1e2PoDWYGenGhR=pG=yGsQP5VLmizTmg-iA@mail.gmail.com/

> Sure there's a warn_once that memfd_create was used without seal, but
> right now on my system it's "used up" 5 seconds after boot by systemd:
> [ 5.854378] memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL, pid=1 'systemd'
>
> And anyway, older kernels will barf up EINVAL when calling memfd_create
> with MFD_NOEXEC_SEAL, so even if userspace will want to adapt they'll
> need to try calling memfd_create with the flag once and retry on EINVAL,
> which let's face it is going to take a while to happen.
> (Also, the flag has been added to glibc, but not in any release yet)
>
Yes. Application will need to do some detection of the kernel. This is
not avoidable.

> Making calls default to noexec AND refuse exec does what you want
> (forbid use of exec in an app that wasn't in a namespace that allows
> exec) while allowing apps that require it to work; that sounds better
> than making all applications that haven't taken the pain of adding the
> new flag to me.
> Well, I guess an app that did require exec without setting the flag will
> fail in a weird place instead of failing at memfd_create and having a
> chance to fallback, so it's not like it doesn't make any sense;
> I don't have such strong feelings about this if the sysctl works, but
> for my use case I'm more likely to want to take a chance at memfd_create
> not needing exec than having the flag set. Perhaps a third value if I
> cared enough...
>
> --
> Dominique Martinet | Asmadeus

Thanks
-Jeff

2023-06-29 11:11:48

by Dominique Martinet

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC

Jeff Xu wrote on Wed, Jun 28, 2023 at 09:33:27PM -0700:
> > > BTW I find the current behaviour rather hard to use: setting this to 2
> > > should still set NOEXEC by default in my opinion, just refuse anything
> > > that explicitly requested EXEC.
> >
> > And I just noticed it's not possible to lower the value despite having
> > CAP_SYS_ADMIN: what the heck?! I have never seen such a sysctl and it
> > just forced me to reboot because I willy-nilly tested in the init pid
> > namespace, and quite a few applications that don't require exec broke
> > exactly as I described below.
> >
> > If the user has CAP_SYS_ADMIN there are more container escape methods
> > than I can count, this is basically free pass to root on main namespace
> > anyway, you're not protecting anything. Please let people set the sysctl
> > to what they want.
>
> Yama has a similar setting, for example, 3 (YAMA_SCOPE_NO_ATTACH)
> will not allow downgrading at runtime.
>
> Since this is a security feature, not allowing downgrading at run time
> is part of the security consideration. I hope you understand.

I didn't remember yama had this stuck bit; that still strikes me as
unusual, and if you require a custom LSM rule for memfd anyway I don't
see why it couldn't enforce that the sysctl is unchanged, but sure.

Please, though:
- I have a hard time thinking of 1 as a security flag in general (even
if I do agree a sloppy LSM rule could require it); I would only lock 2
- please make it clear, I don't see any entry in the sysctl
documentation[1] about memfd_noexec, there should be one and you can
copy the wording from yama's doc[2]: "Once set, this sysctl value cannot
be changed"
[1] Documentation/admin-guide/sysctl/vm.rst
[2] Documentation/admin-guide/LSM/Yama.rst


Either way as it stands I still don't think one can expect most
userspace applications to be converted until some libc wrapper takes
care of the retry logic and a couple of years, so I'll go look for
another way of filtering this (and eventually setting this to 1) as you
suggested.
I'll leave the follow-up up to you and won't bother you more.

Thanks,
--
Dominique Martinet | Asmadeus

2023-06-29 21:46:51

by Jeff Xu

[permalink] [raw]
Subject: Re: [PATCH v8 3/5] mm/memfd: add MFD_NOEXEC_SEAL and MFD_EXEC

Hello

On Thu, Jun 29, 2023 at 3:30 AM Dominique Martinet
<[email protected]> wrote:
>
> Jeff Xu wrote on Wed, Jun 28, 2023 at 09:33:27PM -0700:
> > > > BTW I find the current behaviour rather hard to use: setting this to 2
> > > > should still set NOEXEC by default in my opinion, just refuse anything
> > > > that explicitly requested EXEC.
> > >
> > > And I just noticed it's not possible to lower the value despite having
> > > CAP_SYS_ADMIN: what the heck?! I have never seen such a sysctl and it
> > > just forced me to reboot because I willy-nilly tested in the init pid
> > > namespace, and quite a few applications that don't require exec broke
> > > exactly as I described below.
> > >
> > > If the user has CAP_SYS_ADMIN there are more container escape methods
> > > than I can count, this is basically free pass to root on main namespace
> > > anyway, you're not protecting anything. Please let people set the sysctl
> > > to what they want.
> >
> > Yama has a similar setting, for example, 3 (YAMA_SCOPE_NO_ATTACH)
> > will not allow downgrading at runtime.
> >
> > Since this is a security feature, not allowing downgrading at run time
> > is part of the security consideration. I hope you understand.
>
> I didn't remember yama had this stuck bit; that still strikes me as
> unusual, and if you require a custom LSM rule for memfd anyway I don't
> see why it couldn't enforce that the sysctl is unchanged, but sure.
>
> Please, though:
> - I have a hard time thinking of 1 as a security flag in general (even
> if I do agree a sloppy LSM rule could require it); I would only lock 2
> - please make it clear, I don't see any entry in the sysctl
> documentation[1] about memfd_noexec, there should be one and you can
> copy the wording from yama's doc[2]: "Once set, this sysctl value cannot
> be changed"
> [1] Documentation/admin-guide/sysctl/vm.rst
> [2] Documentation/admin-guide/LSM/Yama.rst
>
Thanks for the suggestion.
Yes, it would be good to have some documentation.
I will send patch to update Documentation/admin-guide/sysctl/vm.rst

>
> Either way as it stands I still don't think one can expect most
> userspace applications to be converted until some libc wrapper takes
> care of the retry logic and a couple of years, so I'll go look for
> another way of filtering this (and eventually setting this to 1) as you
> suggested.
> I'll leave the follow-up up to you and won't bother you more.
>
Not bothered at all! and thanks for reporting the bug to improve the
quality of memfd_noexec !

Much appreciated.
-Jeff


> Thanks,
> --
> Dominique Martinet | Asmadeus