2023-07-14 12:03:57

by David Rheinsberg

[permalink] [raw]
Subject: [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC

Add a new flag for memfd_create() called MFD_NOEXEC, which has the
opposite effect as MFD_EXEC (i.e., it strips the executable bits from
the inode file mode).

The default mode for memfd_create() has historically been to use 0777 as
file modes. The new `MFD_EXEC` flag has now made this explicit, paving
the way to reduce the default to 0666 and thus dropping the executable
bits for security reasons. Additionally, the `MFD_NOEXEC_SEAL` flag has
been added which allows this without changing the default right now.

Unfortunately, `MFD_NOEXEC_SEAL` enforces `MFD_ALLOW_SEALING` and
`F_SEAL_EXEC`, with no alternatives available. This leads to multiple
issues:

* Applications that do not want to allow sealing either have to use
`MFD_EXEC` (which we don't want, unless they actually need the
executable bits), or they must add `F_SEAL_SEAL` immediately on
return of memfd_create(2) with `MFD_NOEXEC_SEAL`, since this
implicitly enables sealing.

Note that we explicitly added `MFD_ALLOW_SEALING` when creating
memfd_create(2), because enabling seals on existing users of shmem
without them knowing about it can easily introduce DoS scenarios. It
is unclear why `MFD_NOEXEC_SEAL` was designed to enable seals, and
this is especially dangerous with `MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL`
set via sysctl, since it will silently enable sealing on every memfd
created.

* Applications that do not want `MFD_EXEC`, but rely on
`F_GET_SEALS` to *NOT* return `F_SEAL_EXEC` have no way of achieving
this other than using `MFD_EXEC` and clearing the executable bits
manually via fchmod(2). Using `MFD_NOEXEC_SEAL` will set
`F_SEAL_EXEC` and thus break existing code that hard-codes the
seal-set.

This is already an issue when sending log-memfds to systemd-journald
today, which verifies the seal-set of the received FD and fails if
unknown seals are set. Hence, you have to use `MFD_EXEC` when
creating memfds for this purpose, even though you really do not need
the executable bits set.

* Applications that want to enable the executable bit later on,
currently have no way to create the memfd without it. They have to
clear the bits immediately after creating it via fchmod(2), or just
leave them set.

By adding MFD_NOEXEC, user-space is now able to clear the executable
bits on all memfds immediately, clearly stating their intention, without
requiring fixups after creating the memfd. In particular, it is highly
useful for existing use-cases that cannot allow new seals to appear on
memfds.

Signed-off-by: David Rheinsberg <[email protected]>
---
include/linux/pid_namespace.h | 4 ++--
include/uapi/linux/memfd.h | 1 +
mm/memfd.c | 29 ++++++++++++++---------------
3 files changed, 17 insertions(+), 17 deletions(-)

diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
index c758809d5bcf..02f8acf94512 100644
--- a/include/linux/pid_namespace.h
+++ b/include/linux/pid_namespace.h
@@ -19,9 +19,9 @@ struct fs_pin;
#if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
/*
* sysctl for vm.memfd_noexec
- * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
+ * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC
* acts like MFD_EXEC was set.
- * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
+ * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC
* acts like MFD_NOEXEC_SEAL was set.
* 2: memfd_create() without MFD_NOEXEC_SEAL will be
* rejected.
diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
index 273a4e15dfcf..b9e13bd65817 100644
--- a/include/uapi/linux/memfd.h
+++ b/include/uapi/linux/memfd.h
@@ -12,6 +12,7 @@
#define MFD_NOEXEC_SEAL 0x0008U
/* executable */
#define MFD_EXEC 0x0010U
+#define MFD_NOEXEC 0x0020U

/*
* Huge page size encoding when MFD_HUGETLB is specified, and a huge page
diff --git a/mm/memfd.c b/mm/memfd.c
index e763e76f1106..2afe49368fc5 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -266,7 +266,9 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
#define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
#define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)

-#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
+#define MFD_ALL_FLAGS \
+ (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
+ MFD_NOEXEC_SEAL | MFD_EXEC | MFD_NOEXEC)

SYSCALL_DEFINE2(memfd_create,
const char __user *, uname,
@@ -289,11 +291,13 @@ SYSCALL_DEFINE2(memfd_create,
return -EINVAL;
}

- /* Invalid if both EXEC and NOEXEC_SEAL are set.*/
- if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL))
+ if (flags & MFD_NOEXEC_SEAL)
+ flags |= MFD_ALLOW_SEALING | MFD_NOEXEC;
+
+ if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC))
return -EINVAL;

- if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
+ if (!(flags & (MFD_EXEC | MFD_NOEXEC))) {
#ifdef CONFIG_SYSCTL
int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
struct pid_namespace *ns;
@@ -366,20 +370,15 @@ SYSCALL_DEFINE2(memfd_create,
file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
file->f_flags |= O_LARGEFILE;

- if (flags & MFD_NOEXEC_SEAL) {
- struct inode *inode = file_inode(file);
+ if (flags & MFD_NOEXEC)
+ file_inode(file)->i_mode &= ~0111;

- inode->i_mode &= ~0111;
- file_seals = memfd_file_seals_ptr(file);
- if (file_seals) {
+ file_seals = memfd_file_seals_ptr(file);
+ if (file_seals) {
+ if (flags & MFD_ALLOW_SEALING)
*file_seals &= ~F_SEAL_SEAL;
+ if (flags & MFD_NOEXEC_SEAL)
*file_seals |= F_SEAL_EXEC;
- }
- } else if (flags & MFD_ALLOW_SEALING) {
- /* MFD_EXEC and MFD_ALLOW_SEALING are set */
- file_seals = memfd_file_seals_ptr(file);
- if (file_seals)
- *file_seals &= ~F_SEAL_SEAL;
}

fd_install(fd, file);
--
2.41.0



2023-07-18 19:46:43

by Jeff Xu

[permalink] [raw]
Subject: Re: [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC

Hi David

Thanks email and patch for discussion.

On Fri, Jul 14, 2023 at 4:48 AM David Rheinsberg <[email protected]> wrote:
>
> Add a new flag for memfd_create() called MFD_NOEXEC, which has the
> opposite effect as MFD_EXEC (i.e., it strips the executable bits from
> the inode file mode).
>
I previously thought about having the symmetric flags, such as
MFD_NOEXEC/MFD_EXEC/MFD_NOEXEC_SEAL/MFD_EXEC_SEAL, but decided against
it. The app shall decide beforehand what the memfd is created for, if
it is no-executable, then it should be sealed, such that it can't be
chmod to enable "X" bit.

> The default mode for memfd_create() has historically been to use 0777 as
> file modes. The new `MFD_EXEC` flag has now made this explicit, paving
> the way to reduce the default to 0666 and thus dropping the executable
> bits for security reasons. Additionally, the `MFD_NOEXEC_SEAL` flag has
> been added which allows this without changing the default right now.
>
> Unfortunately, `MFD_NOEXEC_SEAL` enforces `MFD_ALLOW_SEALING` and
> `F_SEAL_EXEC`, with no alternatives available. This leads to multiple
> issues:
>
> * Applications that do not want to allow sealing either have to use
> `MFD_EXEC` (which we don't want, unless they actually need the
> executable bits), or they must add `F_SEAL_SEAL` immediately on
> return of memfd_create(2) with `MFD_NOEXEC_SEAL`, since this
> implicitly enables sealing.
>
> Note that we explicitly added `MFD_ALLOW_SEALING` when creating
> memfd_create(2), because enabling seals on existing users of shmem
> without them knowing about it can easily introduce DoS scenarios.

The application that doesn't want MFD_NOEXEC_SEAL can use MFD_EXEC,
the kernel won't add MFD_ALLOW_SEALING implicitly. MFD_EXEC makes the
kernel behave the same as before, this is also why sysctl
vm.memfd_noexec=0 can work seamlessly.

> It
> is unclear why `MFD_NOEXEC_SEAL` was designed to enable seals, and
> this is especially dangerous with `MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL`
> set via sysctl, since it will silently enable sealing on every memfd
> created.
>
Without sealing, chmod(2) can modify the mfd to be executable, that is
the consideration that MFD_NOEXEC is not provided as an option.
Indeed, current design is "biased" towards promoting MFD_NOEXEC_SEAL
as the safest approach, and try to avoid the pitfall that dev
accidently uses "MFD_NOEXEC" without realizing it can still be
chmod().


> * Applications that do not want `MFD_EXEC`, but rely on
> `F_GET_SEALS` to *NOT* return `F_SEAL_EXEC` have no way of achieving
> this other than using `MFD_EXEC` and clearing the executable bits
> manually via fchmod(2). Using `MFD_NOEXEC_SEAL` will set
> `F_SEAL_EXEC` and thus break existing code that hard-codes the
> seal-set.
>
> This is already an issue when sending log-memfds to systemd-journald
> today, which verifies the seal-set of the received FD and fails if
> unknown seals are set. Hence, you have to use `MFD_EXEC` when
> creating memfds for this purpose, even though you really do not need
> the executable bits set.
>
> * Applications that want to enable the executable bit later on,
> currently have no way to create the memfd without it. They have to
> clear the bits immediately after creating it via fchmod(2), or just
> leave them set.
>
Is it OK to do what you want in two steps ? What is the concern there ? i.e.
memfd_create(MFD_EXEC), then chmod to remove the "X" bit.

I imagine this is probably already what the application does for
creating no-executable mfd before my patch, i.e.:
memfd_create(), then chmod() to remove "X" to remove "X" bit.

Thanks!
-Jeff



-Jeff

> By adding MFD_NOEXEC, user-space is now able to clear the executable
> bits on all memfds immediately, clearly stating their intention, without
> requiring fixups after creating the memfd. In particular, it is highly
> useful for existing use-cases that cannot allow new seals to appear on
> memfds.
>
> Signed-off-by: David Rheinsberg <[email protected]>
> ---
> include/linux/pid_namespace.h | 4 ++--
> include/uapi/linux/memfd.h | 1 +
> mm/memfd.c | 29 ++++++++++++++---------------
> 3 files changed, 17 insertions(+), 17 deletions(-)
>
> diff --git a/include/linux/pid_namespace.h b/include/linux/pid_namespace.h
> index c758809d5bcf..02f8acf94512 100644
> --- a/include/linux/pid_namespace.h
> +++ b/include/linux/pid_namespace.h
> @@ -19,9 +19,9 @@ struct fs_pin;
> #if defined(CONFIG_SYSCTL) && defined(CONFIG_MEMFD_CREATE)
> /*
> * sysctl for vm.memfd_noexec
> - * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
> + * 0: memfd_create() without MFD_EXEC nor MFD_NOEXEC
> * acts like MFD_EXEC was set.
> - * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC_SEAL
> + * 1: memfd_create() without MFD_EXEC nor MFD_NOEXEC
> * acts like MFD_NOEXEC_SEAL was set.
> * 2: memfd_create() without MFD_NOEXEC_SEAL will be
> * rejected.
> diff --git a/include/uapi/linux/memfd.h b/include/uapi/linux/memfd.h
> index 273a4e15dfcf..b9e13bd65817 100644
> --- a/include/uapi/linux/memfd.h
> +++ b/include/uapi/linux/memfd.h
> @@ -12,6 +12,7 @@
> #define MFD_NOEXEC_SEAL 0x0008U
> /* executable */
> #define MFD_EXEC 0x0010U
> +#define MFD_NOEXEC 0x0020U
>
> /*
> * Huge page size encoding when MFD_HUGETLB is specified, and a huge page
> diff --git a/mm/memfd.c b/mm/memfd.c
> index e763e76f1106..2afe49368fc5 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -266,7 +266,9 @@ long memfd_fcntl(struct file *file, unsigned int cmd, unsigned int arg)
> #define MFD_NAME_PREFIX_LEN (sizeof(MFD_NAME_PREFIX) - 1)
> #define MFD_NAME_MAX_LEN (NAME_MAX - MFD_NAME_PREFIX_LEN)
>
> -#define MFD_ALL_FLAGS (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | MFD_NOEXEC_SEAL | MFD_EXEC)
> +#define MFD_ALL_FLAGS \
> + (MFD_CLOEXEC | MFD_ALLOW_SEALING | MFD_HUGETLB | \
> + MFD_NOEXEC_SEAL | MFD_EXEC | MFD_NOEXEC)
>
> SYSCALL_DEFINE2(memfd_create,
> const char __user *, uname,
> @@ -289,11 +291,13 @@ SYSCALL_DEFINE2(memfd_create,
> return -EINVAL;
> }
>
> - /* Invalid if both EXEC and NOEXEC_SEAL are set.*/
> - if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC_SEAL))
> + if (flags & MFD_NOEXEC_SEAL)
> + flags |= MFD_ALLOW_SEALING | MFD_NOEXEC;
> +
> + if ((flags & MFD_EXEC) && (flags & MFD_NOEXEC))
> return -EINVAL;
>
> - if (!(flags & (MFD_EXEC | MFD_NOEXEC_SEAL))) {
> + if (!(flags & (MFD_EXEC | MFD_NOEXEC))) {
> #ifdef CONFIG_SYSCTL
> int sysctl = MEMFD_NOEXEC_SCOPE_EXEC;
> struct pid_namespace *ns;
> @@ -366,20 +370,15 @@ SYSCALL_DEFINE2(memfd_create,
> file->f_mode |= FMODE_LSEEK | FMODE_PREAD | FMODE_PWRITE;
> file->f_flags |= O_LARGEFILE;
>
> - if (flags & MFD_NOEXEC_SEAL) {
> - struct inode *inode = file_inode(file);
> + if (flags & MFD_NOEXEC)
> + file_inode(file)->i_mode &= ~0111;
>
> - inode->i_mode &= ~0111;
> - file_seals = memfd_file_seals_ptr(file);
> - if (file_seals) {
> + file_seals = memfd_file_seals_ptr(file);
> + if (file_seals) {
> + if (flags & MFD_ALLOW_SEALING)
> *file_seals &= ~F_SEAL_SEAL;
> + if (flags & MFD_NOEXEC_SEAL)
> *file_seals |= F_SEAL_EXEC;
> - }
> - } else if (flags & MFD_ALLOW_SEALING) {
> - /* MFD_EXEC and MFD_ALLOW_SEALING are set */
> - file_seals = memfd_file_seals_ptr(file);
> - if (file_seals)
> - *file_seals &= ~F_SEAL_SEAL;
> }
>
> fd_install(fd, file);
> --
> 2.41.0
>

2023-07-28 08:54:58

by David Rheinsberg

[permalink] [raw]
Subject: Re: [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC

Hi

On Tue, Jul 18, 2023, at 9:03 PM, Jeff Xu wrote:
> Hi David
>
> Thanks email and patch for discussion.
>
> On Fri, Jul 14, 2023 at 4:48 AM David Rheinsberg <[email protected]> wrote:
>>
>> Add a new flag for memfd_create() called MFD_NOEXEC, which has the
>> opposite effect as MFD_EXEC (i.e., it strips the executable bits from
>> the inode file mode).
>>
> I previously thought about having the symmetric flags, such as
> MFD_NOEXEC/MFD_EXEC/MFD_NOEXEC_SEAL/MFD_EXEC_SEAL, but decided against
> it. The app shall decide beforehand what the memfd is created for, if
> it is no-executable, then it should be sealed, such that it can't be
> chmod to enable "X" bit.

My point is, an application might decide to *not* seal a property, because it knows it has to change it later on. But it might still want to disable the executable bit initially, so to avoid having executable pages around that can be exploited.

>> The default mode for memfd_create() has historically been to use 0777 as
>> file modes. The new `MFD_EXEC` flag has now made this explicit, paving
>> the way to reduce the default to 0666 and thus dropping the executable
>> bits for security reasons. Additionally, the `MFD_NOEXEC_SEAL` flag has
>> been added which allows this without changing the default right now.
>>
>> Unfortunately, `MFD_NOEXEC_SEAL` enforces `MFD_ALLOW_SEALING` and
>> `F_SEAL_EXEC`, with no alternatives available. This leads to multiple
>> issues:
>>
>> * Applications that do not want to allow sealing either have to use
>> `MFD_EXEC` (which we don't want, unless they actually need the
>> executable bits), or they must add `F_SEAL_SEAL` immediately on
>> return of memfd_create(2) with `MFD_NOEXEC_SEAL`, since this
>> implicitly enables sealing.
>>
>> Note that we explicitly added `MFD_ALLOW_SEALING` when creating
>> memfd_create(2), because enabling seals on existing users of shmem
>> without them knowing about it can easily introduce DoS scenarios.
>
> The application that doesn't want MFD_NOEXEC_SEAL can use MFD_EXEC,
> the kernel won't add MFD_ALLOW_SEALING implicitly. MFD_EXEC makes the
> kernel behave the same as before, this is also why sysctl
> vm.memfd_noexec=0 can work seamlessly.
>
>> It
>> is unclear why `MFD_NOEXEC_SEAL` was designed to enable seals, and
>> this is especially dangerous with `MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL`
>> set via sysctl, since it will silently enable sealing on every memfd
>> created.
>>
> Without sealing, chmod(2) can modify the mfd to be executable, that is
> the consideration that MFD_NOEXEC is not provided as an option.
> Indeed, current design is "biased" towards promoting MFD_NOEXEC_SEAL
> as the safest approach, and try to avoid the pitfall that dev
> accidently uses "MFD_NOEXEC" without realizing it can still be
> chmod().

I think I didn't get my point across. Imagine an application that does *NOT* use sealing, but uses memfds. This application shares memfds with untrusted clients, and does this in a safe way (SIGBUS protected). Everything works fine, unless someone decides to enable `vm.memfd_noexec=2`. Suddenly, the memfd will have sealing enabled *without* the application ever requesting this. Now any untrusted client that got the memfd can add seals to the memfd, even though the creator of the memfd did not enable sealing. This client can now seal WRITES on the memfd, even though it really should not be able to do that.

(This is not an hypothetical setup, we have such setups for data sharing already)

Thus, setting the security-option `memfd_noexec` *breaks* applications, because it enables sealing. If `MFD_NOEXEC_SEAL` would *not* imply `MFD_ALLOW_SEALING`, this would not be an issue. IOW, why does ´MFD_NOEXEC_SEAL` clear `F_SEAL_SEAL` even if `MFD_ALLOW_SEALING` is not set?

>> * Applications that do not want `MFD_EXEC`, but rely on
>> `F_GET_SEALS` to *NOT* return `F_SEAL_EXEC` have no way of achieving
>> this other than using `MFD_EXEC` and clearing the executable bits
>> manually via fchmod(2). Using `MFD_NOEXEC_SEAL` will set
>> `F_SEAL_EXEC` and thus break existing code that hard-codes the
>> seal-set.
>>
>> This is already an issue when sending log-memfds to systemd-journald
>> today, which verifies the seal-set of the received FD and fails if
>> unknown seals are set. Hence, you have to use `MFD_EXEC` when
>> creating memfds for this purpose, even though you really do not need
>> the executable bits set.
>>
>> * Applications that want to enable the executable bit later on,
>> currently have no way to create the memfd without it. They have to
>> clear the bits immediately after creating it via fchmod(2), or just
>> leave them set.
>>
> Is it OK to do what you want in two steps ? What is the concern there ? i.e.
> memfd_create(MFD_EXEC), then chmod to remove the "X" bit.
>
> I imagine this is probably already what the application does for
> creating no-executable mfd before my patch, i.e.:
> memfd_create(), then chmod() to remove "X" to remove "X" bit.

Yes, correct, this is not a technical issue, but rather an API issue. I don't think most memfd-users are aware that their inode has the executable bit set, and they likely don't want it. But for backwards-compatibility reasons (as noted above), they cannot use `MFD_NOEXEC_SEAL`. Hence, we have to make them explicitly request an executable memfd via `MFD_EXEC` now, even though they clearly do not want this. And then we have to add code to drop the executable immediately afterwards.

It don't understand why we don't add out `MFD_NOEXEC` and thus make it a lot easier to patch existing applications? And we make it explicit that these applications don't care for the executable-bit, rather than forcing them to request the executable bit just to drop it immediately.

The downside of `MFD_NOEXEC` is that it might be picked over `MFD_NOEXEC_SEAL` by uneducated users, thus reduce security. But right now, the alternative is that existing code picks `MFD_EXEC` instead and never clears the executable bit, because it is a hassle to do so.

Or is there another reason *not* to include `MFD_NOEXEC`? I am not sure I understand fully why you fight it so vehemently?

Thanks
David

2023-08-01 20:11:38

by Jeff Xu

[permalink] [raw]
Subject: Re: [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC

Hi David,

On Fri, Jul 28, 2023 at 12:55 AM David Rheinsberg <[email protected]> wrote:
>
> Hi
>
> On Tue, Jul 18, 2023, at 9:03 PM, Jeff Xu wrote:
> > Hi David
> >
> > Thanks email and patch for discussion.
> >
> > On Fri, Jul 14, 2023 at 4:48 AM David Rheinsberg <[email protected]> wrote:
> >>
> >> Add a new flag for memfd_create() called MFD_NOEXEC, which has the
> >> opposite effect as MFD_EXEC (i.e., it strips the executable bits from
> >> the inode file mode).
> >>
> > I previously thought about having the symmetric flags, such as
> > MFD_NOEXEC/MFD_EXEC/MFD_NOEXEC_SEAL/MFD_EXEC_SEAL, but decided against
> > it. The app shall decide beforehand what the memfd is created for, if
> > it is no-executable, then it should be sealed, such that it can't be
> > chmod to enable "X" bit.
>
> My point is, an application might decide to *not* seal a property, because it knows it has to change it later on. But it might still want to disable the executable bit initially, so to avoid having executable pages around that can be exploited.
>

I understand that.
My argument was this application can do this in two steps, as in my
previous email:
1> memfd_create(MFD_EXEC)
2> chmod

Two system calls back to back isn't too terrible, and I know this
might seem to be not optimized for your user case, I will explain it
later, please read on.

> >> The default mode for memfd_create() has historically been to use 0777 as
> >> file modes. The new `MFD_EXEC` flag has now made this explicit, paving
> >> the way to reduce the default to 0666 and thus dropping the executable
> >> bits for security reasons. Additionally, the `MFD_NOEXEC_SEAL` flag has
> >> been added which allows this without changing the default right now.
> >>
> >> Unfortunately, `MFD_NOEXEC_SEAL` enforces `MFD_ALLOW_SEALING` and
> >> `F_SEAL_EXEC`, with no alternatives available. This leads to multiple
> >> issues:
> >>
> >> * Applications that do not want to allow sealing either have to use
> >> `MFD_EXEC` (which we don't want, unless they actually need the
> >> executable bits), or they must add `F_SEAL_SEAL` immediately on
> >> return of memfd_create(2) with `MFD_NOEXEC_SEAL`, since this
> >> implicitly enables sealing.
> >>
> >> Note that we explicitly added `MFD_ALLOW_SEALING` when creating
> >> memfd_create(2), because enabling seals on existing users of shmem
> >> without them knowing about it can easily introduce DoS scenarios.
> >
> > The application that doesn't want MFD_NOEXEC_SEAL can use MFD_EXEC,
> > the kernel won't add MFD_ALLOW_SEALING implicitly. MFD_EXEC makes the
> > kernel behave the same as before, this is also why sysctl
> > vm.memfd_noexec=0 can work seamlessly.
> >
> >> It
> >> is unclear why `MFD_NOEXEC_SEAL` was designed to enable seals, and
> >> this is especially dangerous with `MEMFD_NOEXEC_SCOPE_NOEXEC_SEAL`
> >> set via sysctl, since it will silently enable sealing on every memfd
> >> created.
> >>
> > Without sealing, chmod(2) can modify the mfd to be executable, that is
> > the consideration that MFD_NOEXEC is not provided as an option.
> > Indeed, current design is "biased" towards promoting MFD_NOEXEC_SEAL
> > as the safest approach, and try to avoid the pitfall that dev
> > accidently uses "MFD_NOEXEC" without realizing it can still be
> > chmod().
>
> I think I didn't get my point across. Imagine an application that does *NOT* use sealing, but uses memfds. This application shares memfds with untrusted clients, and does this in a safe way (SIGBUS protected). Everything works fine, unless someone decides to enable `vm.memfd_noexec=2`. Suddenly, the memfd will have sealing enabled *without* the application ever requesting this. Now any untrusted client that got the memfd can add seals to the memfd, even though the creator of the memfd did not enable sealing. This client can now seal WRITES on the memfd, even though it really should not be able to do that.
>
> (This is not an hypothetical setup, we have such setups for data sharing already)

Thanks, this helps me understand your point better.

I'm not convinced that sysctl needs to consider the threat model of
"someone" changing and breaking an application. If we follow that
threat model, there are a lot of other sysctls to worry about.

Also, in the system that you described, if memfd is handled to an
untrusted process, not only "sealing" can cause damage, but also
chmod, arbitrary rw, imo the right approach is to harden the process
or mechanism of passing the memfd.

>
> Thus, setting the security-option `memfd_noexec` *breaks* applications, because it enables sealing. If `MFD_NOEXEC_SEAL` would *not* imply `MFD_ALLOW_SEALING`, this would not be an issue. IOW, why does ´MFD_NOEXEC_SEAL` clear `F_SEAL_SEAL` even if `MFD_ALLOW_SEALING` is not set?
>

If MFD_NOEXEC_SEAL is not desired, then it should not be used to
overwrite memfd_create() in this system.

For the question of why the sysctl adding a seal without application
setting it , the rationale here is, as summary of previous/this
emails:
1> The sysctl helps a system (or container in pid name) to migrate out
of old API, e.g, a container can update only the applications that
need executable memfd, change the sysctl to MFD_NOEXEC_SEAL overwrite,
thus secure the usage of executable memfd to only those need to. This
is faster than updating all of the applications.
2> Since the sysctl does overwrite, it has potential to break an
application's intent, i.e. an unmigrated application which expects
executable memfd, and kernel uses MFD_NOEXEC_SEAL overwrite. It is
important to use this with caution and abundant tests before turning
it on.
3> The seal is added because the majority of users should just care
about MFD_NOEXEC_SEAL, this prevents changing the memfd to executable
through chmod at runtime.

> >> * Applications that do not want `MFD_EXEC`, but rely on
> >> `F_GET_SEALS` to *NOT* return `F_SEAL_EXEC` have no way of achieving
> >> this other than using `MFD_EXEC` and clearing the executable bits
> >> manually via fchmod(2). Using `MFD_NOEXEC_SEAL` will set
> >> `F_SEAL_EXEC` and thus break existing code that hard-codes the
> >> seal-set.
> >>
> >> This is already an issue when sending log-memfds to systemd-journald
> >> today, which verifies the seal-set of the received FD and fails if
> >> unknown seals are set. Hence, you have to use `MFD_EXEC` when
> >> creating memfds for this purpose, even though you really do not need
> >> the executable bits set.
> >>
> >> * Applications that want to enable the executable bit later on,
> >> currently have no way to create the memfd without it. They have to
> >> clear the bits immediately after creating it via fchmod(2), or just
> >> leave them set.
> >>
> > Is it OK to do what you want in two steps ? What is the concern there ? i.e.
> > memfd_create(MFD_EXEC), then chmod to remove the "X" bit.
> >
> > I imagine this is probably already what the application does for
> > creating no-executable mfd before my patch, i.e.:
> > memfd_create(), then chmod() to remove "X" to remove "X" bit.
>
> Yes, correct, this is not a technical issue, but rather an API issue. I don't think most memfd-users are aware that their inode has the executable bit set, and they likely don't want it. But for backwards-compatibility reasons (as noted above), they cannot use `MFD_NOEXEC_SEAL`. Hence, we have to make them explicitly request an executable memfd via `MFD_EXEC` now, even though they clearly do not want this. And then we have to add code to drop the executable immediately afterwards.
>
> It don't understand why we don't add out `MFD_NOEXEC` and thus make it a lot easier to patch existing applications? And we make it explicit that these applications don't care for the executable-bit, rather than forcing them to request the executable bit just to drop it immediately.
>
> The downside of `MFD_NOEXEC` is that it might be picked over `MFD_NOEXEC_SEAL` by uneducated users, thus reducing security. But right now, the alternative is that existing code picks `MFD_EXEC` instead and never clears the executable bit, because it is a hassle to do so.
>

Yes. This is the downside I was thinking about.

I lean to believe the kernel API shouldn't be feature rich, it could
be simple, optimized towards the majority of user cases, and ideally,
is self-explained without devs to look through documentation. For
example, if I had to choose one to implement between MFD_NOEXEC and
MFD_NOEXEC_SEAL, I would choose MFD_NOEXEC_SEAL because it should be
what most users care about.

> Or is there another reason *not* to include `MFD_NOEXEC`? I am not sure I understand fully why you fight it so vehemently?
>

I wouldn't add it myself, I hope to convince you not to :-).
If you still think it is beneficial to add MFD_NOEXEC (saving one
chmod call and making it easy to use), I wouldn't feel bad about that.
I would suggest going with documentation to help devs to choose
between those two, i.e. recommend MFD_NOEXEC_SEAL in most cases.

Thanks
Best regards,
-Jeff

> Thanks
> David
>

2023-08-02 09:01:58

by David Rheinsberg

[permalink] [raw]
Subject: Re: [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC

Hi Jeff!

On Tue, Aug 1, 2023, at 9:24 PM, Jeff Xu wrote:
>> My point is, an application might decide to *not* seal a property, because it knows it has to change it later on. But it might still want to disable the executable bit initially, so to avoid having executable pages around that can be exploited.
>>
>
> I understand that.
> My argument was this application can do this in two steps, as in my
> previous email:
> 1> memfd_create(MFD_EXEC)
> 2> chmod
>
> Two system calls back to back isn't too terrible, and I know this
> might seem to be not optimized for your user case, I will explain it
> later, please read on.

Yes, I agree that MFD_NOEXEC would be rather simple to imitate. So this is mostly a discussion about the intention and side-effects of this workaround, which is also likely why we haven't found an agreement, yet.

[...]
>> I think I didn't get my point across. Imagine an application that does *NOT* use sealing, but uses memfds. This application shares memfds with untrusted clients, and does this in a safe way (SIGBUS protected). Everything works fine, unless someone decides to enable `vm.memfd_noexec=2`. Suddenly, the memfd will have sealing enabled *without* the application ever requesting this. Now any untrusted client that got the memfd can add seals to the memfd, even though the creator of the memfd did not enable sealing. This client can now seal WRITES on the memfd, even though it really should not be able to do that.
>>
>> (This is not an hypothetical setup, we have such setups for data sharing already)
>
> Thanks, this helps me understand your point better.
>
> I'm not convinced that sysctl needs to consider the threat model of
> "someone" changing and breaking an application. If we follow that
> threat model, there are a lot of other sysctls to worry about.
>
> Also, in the system that you described, if memfd is handled to an
> untrusted process, not only "sealing" can cause damage, but also
> chmod, arbitrary rw, imo the right approach is to harden the process
> or mechanism of passing the memfd.

No. The model I describe is carefully designed to hand out file-descriptors to inodes that the clients have *no* access to. They cannot run fchmod(2), unlink(2), etc. All they can do is operate on the open file. And all access to this shared file is properly guarded against possible damage the other concurrent clients can do. The entire model is already hardened against malicious actors.

With the new sysctl, a new attack-vector is introduced, which was not possible before.

I was *explicitly* told to add `MFD_ALLOW_SEALING` for that exact reason when introducing memfd_create(2). So I am a bit baffled why it is now ok to enable sealing behind the users back.

I agree that the new sysctl is a root-only option. But I fail to see *why* it implies `MFD_ALLOW_SEALING`? This behavior is not documented nor is it explained in the original commit-messages, nor mentioned *anywhere*.

>> Thus, setting the security-option `memfd_noexec` *breaks* applications, because it enables sealing. If `MFD_NOEXEC_SEAL` would *not* imply `MFD_ALLOW_SEALING`, this would not be an issue. IOW, why does ´MFD_NOEXEC_SEAL` clear `F_SEAL_SEAL` even if `MFD_ALLOW_SEALING` is not set?
>>
>
> If MFD_NOEXEC_SEAL is not desired, then it should not be used to
> overwrite memfd_create() in this system.
>
> For the question of why the sysctl adding a seal without application
> setting it , the rationale here is, as summary of previous/this
> emails:

I still think we are not talking about the same thing. I completely understand why you add the seal! I am just questioning why you *CLEAR* `F_SEAL_SEAL`? That is, why do you enable `MFD_ALLOW_SEALING` without the user requesting it? You could just set `F_SEAL_EXEC` without clearing `F_SEAL_SEAL`. And then require `MFD_ALLOW_SEALING` on top to clear `F_SEAL_SEAL`.

[...]
>> The downside of `MFD_NOEXEC` is that it might be picked over `MFD_NOEXEC_SEAL` by uneducated users, thus reducing security. But right now, the alternative is that existing code picks `MFD_EXEC` instead and never clears the executable bit, because it is a hassle to do so.
>>
>
> Yes. This is the downside I was thinking about.
>
> I lean to believe the kernel API shouldn't be feature rich, it could
> be simple, optimized towards the majority of user cases, and ideally,
> is self-explained without devs to look through documentation. For
> example, if I had to choose one to implement between MFD_NOEXEC and
> MFD_NOEXEC_SEAL, I would choose MFD_NOEXEC_SEAL because it should be
> what most users care about.

Well, if we were to go back, we would make MFD_NOEXEC(_SEAL) the default and just add `MFD_EXEC` :)

>> Or is there another reason *not* to include `MFD_NOEXEC`? I am not sure I understand fully why you fight it so vehemently?
>>
>
> I wouldn't add it myself, I hope to convince you not to :-).
> If you still think it is beneficial to add MFD_NOEXEC (saving one
> chmod call and making it easy to use), I wouldn't feel bad about that.
> I would suggest going with documentation to help devs to choose
> between those two, i.e. recommend MFD_NOEXEC_SEAL in most cases.

Any application that cannot use `F_SEAL_EXEC` (e.g., because its peers verify for historic reasons that the seal is not set) now has to do an extra dance to get the "safer" behavior, rather than getting the "safer" behavior by default. That is, we make it easier to get the unsafe behavior than to get the safe behavior (in this particular scenario).

Without `MFD_NOEXEC`, it is easier to end up with a 0777 memfd than not. I want the application that desires `S_IXUSR` to jump through hoops, not the application that does *not* require it.

In other words, I would prefer `MFD_ALLOW_EXEC`, which requires fchmod(2)` to set `S_IXUSR`, rather than requiring a call to fchmod(2) to clear it for everyone that does not need it.

Thanks
David

2023-08-03 00:54:46

by Jeff Xu

[permalink] [raw]
Subject: Re: [PATCH] memfd: support MFD_NOEXEC alongside MFD_EXEC

On Wed, Aug 2, 2023 at 12:58 AM David Rheinsberg <[email protected]> wrote:
> >> I think I didn't get my point across. Imagine an application that does *NOT* use sealing, but uses memfds. This application shares memfds with untrusted clients, and does this in a safe way (SIGBUS protected). Everything works fine, unless someone decides to enable `vm.memfd_noexec=2`. Suddenly, the memfd will have sealing enabled *without* the application ever requesting this. Now any untrusted client that got the memfd can add seals to the memfd, even though the creator of the memfd did not enable sealing. This client can now seal WRITES on the memfd, even though it really should not be able to do that.
> >>
> >> (This is not an hypothetical setup, we have such setups for data sharing already)
> >
> > Thanks, this helps me understand your point better.
> >
> > I'm not convinced that sysctl needs to consider the threat model of
> > "someone" changing and breaking an application. If we follow that
> > threat model, there are a lot of other sysctls to worry about.
> >
> > Also, in the system that you described, if memfd is handled to an
> > untrusted process, not only "sealing" can cause damage, but also
> > chmod, arbitrary rw, imo the right approach is to harden the process
> > or mechanism of passing the memfd.
>
> No. The model I describe is carefully designed to hand out file-descriptors to inodes that the clients have *no* access to. They cannot run fchmod(2), unlink(2), etc. All they can do is operate on the open file. And all access to this shared file is properly guarded against possible damage the other concurrent clients can do. The entire model is already hardened against malicious actors.
>
> With the new sysctl, a new attack-vector is introduced, which was not possible before.
>
> I was *explicitly* told to add `MFD_ALLOW_SEALING` for that exact reason when introducing memfd_create(2). So I am a bit baffled why it is now ok to enable sealing behind the users back.
>
> I agree that the new sysctl is a root-only option. But I fail to see *why* it implies `MFD_ALLOW_SEALING`? This behavior is not documented nor is it explained in the original commit-messages, nor mentioned *anywhere*.
>
> >> Thus, setting the security-option `memfd_noexec` *breaks* applications, because it enables sealing. If `MFD_NOEXEC_SEAL` would *not* imply `MFD_ALLOW_SEALING`, this would not be an issue. IOW, why does ´MFD_NOEXEC_SEAL` clear `F_SEAL_SEAL` even if `MFD_ALLOW_SEALING` is not set?
> >>
> >
> > If MFD_NOEXEC_SEAL is not desired, then it should not be used to
> > overwrite memfd_create() in this system.
> >
> > For the question of why the sysctl adding a seal without application
> > setting it , the rationale here is, as summary of previous/this
> > emails:
>
> I still think we are not talking about the same thing. I completely understand why you add the seal! I am just questioning why you *CLEAR* `F_SEAL_SEAL`? That is, why do you enable `MFD_ALLOW_SEALING` without the user requesting it? You could just set `F_SEAL_EXEC` without clearing `F_SEAL_SEAL`. And then require `MFD_ALLOW_SEALING` on top to clear `F_SEAL_SEAL`.
>
Ah, I apologize. I didn't read it carefully enough and misunderstood
you, thanks for clarification.

The reason that F_SEAL_SEAL is cleared, is that MFD_NOEXEC_SEAL
implies MFD_ALLOW_SEALING, and it seems to be reasonable that
application might want to use sealing e.g I image application write
the content to memfd then adding F_SEAL_WRITE.

Your point is that MFD_ALLOW_SEALING should not be implied by
MFD_NOEXEC_SEAL. An application should still explicitly set
MFD_ALLOW_SEALING.

To me, MFD_NOEXEC_SEAL, the _SEAL part implies to allow sealing, but
of course, this might not be so clear to anyone other than me :-) ,
documentation is indeed necessary.

And with the context you described, now I think your approach is better:
1> application set MFD_NOEXEC_SEAL, with MFD_ALLOW_SEALING
F_SEAL_EXEC is set, F_SEAL_SEAL is clear.
2> Application set MFD_NOEXEC_SEAL, without MFD_ALLOW_SEALING
F_SEAL_EXEC and F_SEAL_SEAL are set.

> [...]
> >> The downside of `MFD_NOEXEC` is that it might be picked over `MFD_NOEXEC_SEAL` by uneducated users, thus reducing security. But right now, the alternative is that existing code picks `MFD_EXEC` instead and never clears the executable bit, because it is a hassle to do so.
> >>
> >
> > Yes. This is the downside I was thinking about.
> >
> > I lean to believe the kernel API shouldn't be feature rich, it could
> > be simple, optimized towards the majority of user cases, and ideally,
> > is self-explained without devs to look through documentation. For
> > example, if I had to choose one to implement between MFD_NOEXEC and
> > MFD_NOEXEC_SEAL, I would choose MFD_NOEXEC_SEAL because it should be
> > what most users care about.
>
> Well, if we were to go back, we would make MFD_NOEXEC(_SEAL) the default and just add `MFD_EXEC` :)
>
> >> Or is there another reason *not* to include `MFD_NOEXEC`? I am not sure I understand fully why you fight it so vehemently?
> >>
> >
> > I wouldn't add it myself, I hope to convince you not to :-).
> > If you still think it is beneficial to add MFD_NOEXEC (saving one
> > chmod call and making it easy to use), I wouldn't feel bad about that.
> > I would suggest going with documentation to help devs to choose
> > between those two, i.e. recommend MFD_NOEXEC_SEAL in most cases.
>
> Any application that cannot use `F_SEAL_EXEC` (e.g., because its peers verify for historic reasons that the seal is not set) now has to do an extra dance to get the "safer" behavior, rather than getting the "safer" behavior by default. That is, we make it easier to get the unsafe behavior than to get the safe behavior (in this particular scenario).
> Without `MFD_NOEXEC`, it is easier to end up with a 0777 memfd than not. I want the application that desires `S_IXUSR` to jump through hoops, not the application that does *not* require it.
>
I see your points now, i.e. the "disallow sealing entirely" is at
least as important as "not able to chmod to add X".
I think the reasonable mid-ground is perhaps adding MFD_NOEXEC
support, with some documentation to help dev to choose between
MFD_NOEXEC and MFD_NOEXEC_SEAL

Would you like to update your patch to the last version on Andrew's
branch, adding selftest, and perhaps help for documentation ?

Thanks!
-Jeff