Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags
current with a flag that prevents pages that were previously not executable from
becoming executable.
This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK)
At Google, we've been using a somewhat similar downstream patch for a few years
now. To make the adoption of this feature easier, we've had it support a mode in
which the W^X flag does not propagate to children. For example, this is handy if
a C process which wants W^X protection suspects it could start children
processes that would use a JIT.
I'd like to align our features with the upstream prctl. This series proposes a
new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It
sets a different flag in current that is not in MMF_INIT_MASK and which does not
propagate.
As part of looking into MDWE, I also fixed a couple of things in the MDWE test.
Florent Revest (4):
kselftest: vm: Fix tabs/spaces inconsistency in the mdwe test
kselftest: vm: Fix mdwe's mmap_FIXED test case
mm: Add a NO_INHERIT flag to the PR_SET_MDWE prctl
kselftest: vm: Add tests for no-inherit memory-deny-write-execute
include/linux/mman.h | 8 +-
include/linux/sched/coredump.h | 1 +
include/uapi/linux/prctl.h | 1 +
kernel/sys.c | 29 +++++--
tools/include/uapi/linux/prctl.h | 1 +
tools/testing/selftests/mm/mdwe_test.c | 110 +++++++++++++++++++++----
6 files changed, 128 insertions(+), 22 deletions(-)
--
2.40.1.495.gc816e09b53d-goog
Signed-off-by: Florent Revest <[email protected]>
---
tools/testing/selftests/mm/mdwe_test.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/tools/testing/selftests/mm/mdwe_test.c b/tools/testing/selftests/mm/mdwe_test.c
index bc91bef5d254..d0954c657feb 100644
--- a/tools/testing/selftests/mm/mdwe_test.c
+++ b/tools/testing/selftests/mm/mdwe_test.c
@@ -49,19 +49,19 @@ FIXTURE_VARIANT(mdwe)
FIXTURE_VARIANT_ADD(mdwe, stock)
{
- .enabled = false,
+ .enabled = false,
.forked = false,
};
FIXTURE_VARIANT_ADD(mdwe, enabled)
{
- .enabled = true,
+ .enabled = true,
.forked = false,
};
FIXTURE_VARIANT_ADD(mdwe, forked)
{
- .enabled = true,
+ .enabled = true,
.forked = true,
};
--
2.40.1.495.gc816e09b53d-goog
I checked with the original author, the mmap_FIXED test case wasn't
properly tested and fails. Currently, it maps two consecutive (non
overlapping) pages and expects the second mapping to be denied by MDWE
but these two pages have nothing to do with each other so MDWE is
actually out of the picture here.
What the test actually intended to do was to remap a virtual address
using MAP_FIXED. However, this operation unmaps the existing mapping and
creates a new one so the va is backed by a new page and MDWE is again
out of the picture, all remappings should succeed.
This patch keeps the test case to make it clear that this situation is
expected to work.
Signed-off-by: Florent Revest <[email protected]>
---
tools/testing/selftests/mm/mdwe_test.c | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/tools/testing/selftests/mm/mdwe_test.c b/tools/testing/selftests/mm/mdwe_test.c
index d0954c657feb..91aa9c3099e7 100644
--- a/tools/testing/selftests/mm/mdwe_test.c
+++ b/tools/testing/selftests/mm/mdwe_test.c
@@ -168,13 +168,10 @@ TEST_F(mdwe, mmap_FIXED)
self->p = mmap(NULL, self->size, PROT_READ, self->flags, 0, 0);
ASSERT_NE(self->p, MAP_FAILED);
- p = mmap(self->p + self->size, self->size, PROT_READ | PROT_EXEC,
+ /* MAP_FIXED unmaps the existing page before mapping which is allowed */
+ p = mmap(self->p, self->size, PROT_READ | PROT_EXEC,
self->flags | MAP_FIXED, 0, 0);
- if (variant->enabled) {
- EXPECT_EQ(p, MAP_FAILED);
- } else {
- EXPECT_EQ(p, self->p);
- }
+ EXPECT_EQ(p, self->p);
}
TEST_F(mdwe, arm64_BTI)
--
2.40.1.495.gc816e09b53d-goog
This extends the current PR_SET_MDWE prctl arg with a bit to indicate
that the process doesn't want MDWE protection to propagate to children.
To implement this no-inherit mode, the tag in current->mm->flags must be
absent from MMF_INIT_MASK. This means that the encoding for "MDWE but
without inherit" is different in the prctl than in the mm flags. This
leads to a bit of bit-mangling in the prctl implementation.
Signed-off-by: Florent Revest <[email protected]>
---
include/linux/mman.h | 8 +++++++-
include/linux/sched/coredump.h | 1 +
include/uapi/linux/prctl.h | 1 +
kernel/sys.c | 29 +++++++++++++++++++++++------
tools/include/uapi/linux/prctl.h | 1 +
5 files changed, 33 insertions(+), 7 deletions(-)
diff --git a/include/linux/mman.h b/include/linux/mman.h
index cee1e4b566d8..3d7a0b70ad2d 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -157,6 +157,12 @@ calc_vm_flag_bits(unsigned long flags)
unsigned long vm_commit_limit(void);
+static inline bool has_mdwe_enabled(struct task_struct *task)
+{
+ return test_bit(MMF_HAS_MDWE, &task->mm->flags) ||
+ test_bit(MMF_HAS_MDWE_NO_INHERIT, &task->mm->flags);
+}
+
/*
* Denies creating a writable executable mapping or gaining executable permissions.
*
@@ -178,7 +184,7 @@ unsigned long vm_commit_limit(void);
*/
static inline bool map_deny_write_exec(struct vm_area_struct *vma, unsigned long vm_flags)
{
- if (!test_bit(MMF_HAS_MDWE, ¤t->mm->flags))
+ if (!has_mdwe_enabled(current))
return false;
if ((vm_flags & VM_EXEC) && (vm_flags & VM_WRITE))
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 0ee96ea7a0e9..b2d9659ef863 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -91,4 +91,5 @@ static inline int get_dumpable(struct mm_struct *mm)
MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK)
#define MMF_VM_MERGE_ANY 29
+#define MMF_HAS_MDWE_NO_INHERIT 30
#endif /* _LINUX_SCHED_COREDUMP_H */
diff --git a/include/uapi/linux/prctl.h b/include/uapi/linux/prctl.h
index f23d9a16507f..31ec44728412 100644
--- a/include/uapi/linux/prctl.h
+++ b/include/uapi/linux/prctl.h
@@ -284,6 +284,7 @@ struct prctl_mm_map {
/* Memory deny write / execute */
#define PR_SET_MDWE 65
# define PR_MDWE_REFUSE_EXEC_GAIN 1
+# define PR_MDWE_NO_INHERIT 2
#define PR_GET_MDWE 66
diff --git a/kernel/sys.c b/kernel/sys.c
index 339fee3eff6a..c864fd42ece1 100644
--- a/kernel/sys.c
+++ b/kernel/sys.c
@@ -2368,12 +2368,25 @@ static inline int prctl_set_mdwe(unsigned long bits, unsigned long arg3,
if (arg3 || arg4 || arg5)
return -EINVAL;
- if (bits & ~(PR_MDWE_REFUSE_EXEC_GAIN))
+ if (bits & ~(PR_MDWE_REFUSE_EXEC_GAIN | PR_MDWE_NO_INHERIT))
return -EINVAL;
- if (bits & PR_MDWE_REFUSE_EXEC_GAIN)
- set_bit(MMF_HAS_MDWE, ¤t->mm->flags);
- else if (test_bit(MMF_HAS_MDWE, ¤t->mm->flags))
+ /* Cannot set NO_INHERIT without REFUSE_EXEC_GAIN */
+ if (bits & PR_MDWE_NO_INHERIT && !(bits & PR_MDWE_REFUSE_EXEC_GAIN))
+ return -EINVAL;
+
+ if (bits & PR_MDWE_REFUSE_EXEC_GAIN) {
+ if (bits & PR_MDWE_NO_INHERIT) {
+ /* Cannot go from inherit mode to no inherit */
+ if (test_bit(MMF_HAS_MDWE, ¤t->mm->flags))
+ return -EPERM;
+
+ set_bit(MMF_HAS_MDWE_NO_INHERIT, ¤t->mm->flags);
+ } else {
+ set_bit(MMF_HAS_MDWE, ¤t->mm->flags);
+ clear_bit(MMF_HAS_MDWE_NO_INHERIT, ¤t->mm->flags);
+ }
+ } else if (has_mdwe_enabled(current))
return -EPERM; /* Cannot unset the flag */
return 0;
@@ -2385,8 +2398,12 @@ static inline int prctl_get_mdwe(unsigned long arg2, unsigned long arg3,
if (arg2 || arg3 || arg4 || arg5)
return -EINVAL;
- return test_bit(MMF_HAS_MDWE, ¤t->mm->flags) ?
- PR_MDWE_REFUSE_EXEC_GAIN : 0;
+ if (test_bit(MMF_HAS_MDWE, ¤t->mm->flags))
+ return PR_MDWE_REFUSE_EXEC_GAIN;
+ else if (test_bit(MMF_HAS_MDWE_NO_INHERIT, ¤t->mm->flags))
+ return PR_MDWE_REFUSE_EXEC_GAIN | PR_MDWE_NO_INHERIT;
+
+ return 0;
}
static int prctl_get_auxv(void __user *addr, unsigned long len)
diff --git a/tools/include/uapi/linux/prctl.h b/tools/include/uapi/linux/prctl.h
index 759b3f53e53f..a3424852d2d6 100644
--- a/tools/include/uapi/linux/prctl.h
+++ b/tools/include/uapi/linux/prctl.h
@@ -284,6 +284,7 @@ struct prctl_mm_map {
/* Memory deny write / execute */
#define PR_SET_MDWE 65
# define PR_MDWE_REFUSE_EXEC_GAIN 1
+# define PR_MDWE_NO_INHERIT 2
#define PR_GET_MDWE 66
--
2.40.1.495.gc816e09b53d-goog
On Thu, May 4, 2023 at 7:10 PM Florent Revest <[email protected]> wrote:
>
> I checked with the original author, the mmap_FIXED test case wasn't
> properly tested and fails. Currently, it maps two consecutive (non
> overlapping) pages and expects the second mapping to be denied by MDWE
> but these two pages have nothing to do with each other so MDWE is
> actually out of the picture here.
>
> What the test actually intended to do was to remap a virtual address
> using MAP_FIXED. However, this operation unmaps the existing mapping and
> creates a new one so the va is backed by a new page and MDWE is again
> out of the picture, all remappings should succeed.
>
> This patch keeps the test case to make it clear that this situation is
> expected to work.
>
> Signed-off-by: Florent Revest <[email protected]>
Ah, snap, I hit send to fast and forgot to add:
Fixes: 4cf1fe34fd18 ("kselftest: vm: add tests for memory-deny-write-execute")
I will do it in the next iteration (surely there'll be other things to
fix... :))
On Thu, May 04, 2023 at 07:09:38PM +0200, Florent Revest wrote:
> Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags
> current with a flag that prevents pages that were previously not executable from
> becoming executable.
> This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK)
>
> At Google, we've been using a somewhat similar downstream patch for a few years
> now. To make the adoption of this feature easier, we've had it support a mode in
> which the W^X flag does not propagate to children. For example, this is handy if
> a C process which wants W^X protection suspects it could start children
> processes that would use a JIT.
>
> I'd like to align our features with the upstream prctl. This series proposes a
> new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It
> sets a different flag in current that is not in MMF_INIT_MASK and which does not
> propagate.
I don't think I have enough context, so sorry if I'm going to ask a naive
question..
I can understand how current MDWE helps on not allowing any modifi-able
content from becoming executable. How could NO_INHERIT help if it won't
inherit and not in MMF_INIT_MASK?
IIUC it means the restriction will only apply to the current process. Then
I assume the process can escape from this rule simply by a fork(). If so,
what's the point to protect at all?
And, what's the difference of this comparing to disabling MDWE after being
enabled (which seems to be forbidden for now, but it seems fork() can play
a similar role of disabling it)?
Thanks,
--
Peter Xu
On Thu, May 4, 2023 at 10:06 PM Peter Xu <[email protected]> wrote:
>
> On Thu, May 04, 2023 at 07:09:38PM +0200, Florent Revest wrote:
> > Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags
> > current with a flag that prevents pages that were previously not executable from
> > becoming executable.
> > This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK)
> >
> > At Google, we've been using a somewhat similar downstream patch for a few years
> > now. To make the adoption of this feature easier, we've had it support a mode in
> > which the W^X flag does not propagate to children. For example, this is handy if
> > a C process which wants W^X protection suspects it could start children
> > processes that would use a JIT.
> >
> > I'd like to align our features with the upstream prctl. This series proposes a
> > new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It
> > sets a different flag in current that is not in MMF_INIT_MASK and which does not
> > propagate.
>
> I don't think I have enough context, so sorry if I'm going to ask a naive
> question..
Not at all! :) You're absolutely right, it's important to address these points.
> I can understand how current MDWE helps on not allowing any modifi-able
> content from becoming executable. How could NO_INHERIT help if it won't
> inherit and not in MMF_INIT_MASK?
The way I see it, enabling MDWE is just a small step towards hardening
a binary anyway. It can possibly make exploitation a bit harder in the
case where the attacker has _just_: a write primitive they can use to
write a shellcode somewhere and a primitive to make that page
executable later. It's a fairly narrow protection already and I think
it only really helps as part of a broader "defense in depth" strategy.
> IIUC it means the restriction will only apply to the current process. Then
> I assume the process can escape from this rule simply by a fork(). If so,
> what's the point to protect at all?
If we assume enough control from the attacker, then MDWE is already
useless since it can be bypassed by writing to a file and then
mmapping that file with PROT_EXEC. I think that's a good example of
how "perfect can be the enemy of good" in security hardening. MDWE
isn't a silver-bullet but it's a cheap trick and it makes a small dent
in reducing the attack surface so it seems worth having anyway ?
But indeed, to address your question, if you choose to use this
NO_INHERIT flag: you're no longer protected if the attacker can fork()
as part of their exploitation. I think it's been a useful trade-off
for our internal users since, on the other hand, it also makes
adoption a lot easier: our C++ services developers can trivially opt
into a potpourri of hardening features without having to think too
much about how they work under-the-hood. The default behavior has been
to use a NO_INHERIT strategy so users don't get bad surprises the day
when they try to spawn a JITted subcommand. In the meantime, their C++
service still gets a little bit of extra protection.
> And, what's the difference of this comparing to disabling MDWE after being
> enabled (which seems to be forbidden for now, but it seems fork() can play
> a similar role of disabling it)?
That would be functionally somewhat similar, yes. I think it mostly
comes down to ease of adoption. I imagine that users who would opt
into NO_INHERIT are those who are interested in MDWE for the binary
they are writing but aren't 100% confident in what subprocesses they
will run and so they don't have to think about disabling it after
every fork.
On Thu, May 04, 2023 at 07:09:41PM +0200, Florent Revest wrote:
> This extends the current PR_SET_MDWE prctl arg with a bit to indicate
> that the process doesn't want MDWE protection to propagate to children.
>
> To implement this no-inherit mode, the tag in current->mm->flags must be
> absent from MMF_INIT_MASK. This means that the encoding for "MDWE but
> without inherit" is different in the prctl than in the mm flags. This
> leads to a bit of bit-mangling in the prctl implementation.
That bit mangling is not that bad but it complicates the code a bit,
especially if we'll add new bits in the future. We also need to check
both the original and the no-inherit bits for each feature.
Another question is whether we want to support more fine-grained
inheriting or just a big knob that disables inheriting for all the
(future) MDWE flags.
I think a somewhat simpler way would be to clear the flags on fork(),
either based on a big MMF_HAS_MDWE_NO_INHERIT knob or individual ones.
Something like below (completely untested):
diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
index 0ee96ea7a0e9..ca83a0c8d19c 100644
--- a/include/linux/sched/coredump.h
+++ b/include/linux/sched/coredump.h
@@ -91,4 +91,12 @@ static inline int get_dumpable(struct mm_struct *mm)
MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK)
#define MMF_VM_MERGE_ANY 29
+
+#define MMF_INIT_FLAGS(flags) ({ \
+ unsigned long new_flags = flags; \
+ if (new_flags & (1UL << MMF_HAS_MDWE_NO_INHERIT)) \
+ new_flags &= ~(1UL << MMF_HAS_MDWE_MASK); \
+ new_flags & MMF_INIT_MASK; \
+})
+
#endif /* _LINUX_SCHED_COREDUMP_H */
diff --git a/kernel/fork.c b/kernel/fork.c
index ed4e01daccaa..53f0b68a5451 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1288,7 +1288,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
hugetlb_count_init(mm);
if (current->mm) {
- mm->flags = current->mm->flags & MMF_INIT_MASK;
+ mm->flags = MMF_INIT_FLAGS(current->mm->flags);
mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
} else {
mm->flags = default_dump_filter;
The checks in MMF_INIT_FLAGS() can grow in time if we add more bits in
there but we still only keep a single flag that determines whether the
feature is enabled (maybe that's more like bikeshedding at this moment
when we have a single bit).
(fun remark: I see you cc'ed [email protected]'; that's not a real person, it's
what our IT folk asked us to add on cc so that the Exchange server
doesn't append the legal disclaimer; most lists are covered already
without such cc but I guess people feel safer to add it, just in case)
--
Catalin
On Fri, May 05, 2023 at 06:42:08PM +0200, Florent Revest wrote:
> On Thu, May 4, 2023 at 10:06 PM Peter Xu <[email protected]> wrote:
> >
> > On Thu, May 04, 2023 at 07:09:38PM +0200, Florent Revest wrote:
> > > Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags
> > > current with a flag that prevents pages that were previously not executable from
> > > becoming executable.
> > > This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK)
> > >
> > > At Google, we've been using a somewhat similar downstream patch for a few years
> > > now. To make the adoption of this feature easier, we've had it support a mode in
> > > which the W^X flag does not propagate to children. For example, this is handy if
> > > a C process which wants W^X protection suspects it could start children
> > > processes that would use a JIT.
> > >
> > > I'd like to align our features with the upstream prctl. This series proposes a
> > > new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It
> > > sets a different flag in current that is not in MMF_INIT_MASK and which does not
> > > propagate.
> >
> > I don't think I have enough context, so sorry if I'm going to ask a naive
> > question..
>
> Not at all! :) You're absolutely right, it's important to address these points.
>
> > I can understand how current MDWE helps on not allowing any modifi-able
> > content from becoming executable. How could NO_INHERIT help if it won't
> > inherit and not in MMF_INIT_MASK?
>
> The way I see it, enabling MDWE is just a small step towards hardening
> a binary anyway. It can possibly make exploitation a bit harder in the
> case where the attacker has _just_: a write primitive they can use to
> write a shellcode somewhere and a primitive to make that page
> executable later. It's a fairly narrow protection already and I think
> it only really helps as part of a broader "defense in depth" strategy.
>
> > IIUC it means the restriction will only apply to the current process. Then
> > I assume the process can escape from this rule simply by a fork(). If so,
> > what's the point to protect at all?
>
> If we assume enough control from the attacker, then MDWE is already
> useless since it can be bypassed by writing to a file and then
> mmapping that file with PROT_EXEC. I think that's a good example of
> how "perfect can be the enemy of good" in security hardening. MDWE
> isn't a silver-bullet but it's a cheap trick and it makes a small dent
> in reducing the attack surface so it seems worth having anyway ?
>
> But indeed, to address your question, if you choose to use this
> NO_INHERIT flag: you're no longer protected if the attacker can fork()
> as part of their exploitation. I think it's been a useful trade-off
> for our internal users since, on the other hand, it also makes
> adoption a lot easier: our C++ services developers can trivially opt
> into a potpourri of hardening features without having to think too
> much about how they work under-the-hood. The default behavior has been
> to use a NO_INHERIT strategy so users don't get bad surprises the day
> when they try to spawn a JITted subcommand. In the meantime, their C++
> service still gets a little bit of extra protection.
>
> > And, what's the difference of this comparing to disabling MDWE after being
> > enabled (which seems to be forbidden for now, but it seems fork() can play
> > a similar role of disabling it)?
>
> That would be functionally somewhat similar, yes. I think it mostly
> comes down to ease of adoption. I imagine that users who would opt
> into NO_INHERIT are those who are interested in MDWE for the binary
> they are writing but aren't 100% confident in what subprocesses they
> will run and so they don't have to think about disabling it after
> every fork.
Okay, that makes sense to me. Thanks.
Since the original MDWE was for systemd, I'm wondering what will happen if
some program like what you said is invoked by systemd and with MDWE enabled
already.
Currently in your patch IIUC MDWE_NO_INHERIT will fail directly on MDWE
enabled process, but then it makes me think whether it makes more sense to
allow converting MDWE->MDWE_NO_INHERIT in this case. It seems to provide a
most broad coverage on system daemons using MDWE starting from systemd
initial process, meanwhile allows specific daemons to fork into anything
like a JIT process so it can make itself NO_INHERIT. Attackers won't
leverage this because MDWE_NO_INHERIT also means MDWE enabled.
--
Peter Xu
On Mon, May 8, 2023 at 3:29 AM Peter Xu <[email protected]> wrote:
>
> On Fri, May 05, 2023 at 06:42:08PM +0200, Florent Revest wrote:
> > On Thu, May 4, 2023 at 10:06 PM Peter Xu <[email protected]> wrote:
> > >
> > > On Thu, May 04, 2023 at 07:09:38PM +0200, Florent Revest wrote:
> > > > Joey recently introduced a Memory-Deny-Write-Executable (MDWE) prctl which tags
> > > > current with a flag that prevents pages that were previously not executable from
> > > > becoming executable.
> > > > This tag always gets inherited by children tasks. (it's in MMF_INIT_MASK)
> > > >
> > > > At Google, we've been using a somewhat similar downstream patch for a few years
> > > > now. To make the adoption of this feature easier, we've had it support a mode in
> > > > which the W^X flag does not propagate to children. For example, this is handy if
> > > > a C process which wants W^X protection suspects it could start children
> > > > processes that would use a JIT.
> > > >
> > > > I'd like to align our features with the upstream prctl. This series proposes a
> > > > new NO_INHERIT flag to the MDWE prctl to make this kind of adoption easier. It
> > > > sets a different flag in current that is not in MMF_INIT_MASK and which does not
> > > > propagate.
> > >
> > > I don't think I have enough context, so sorry if I'm going to ask a naive
> > > question..
> >
> > Not at all! :) You're absolutely right, it's important to address these points.
> >
> > > I can understand how current MDWE helps on not allowing any modifi-able
> > > content from becoming executable. How could NO_INHERIT help if it won't
> > > inherit and not in MMF_INIT_MASK?
> >
> > The way I see it, enabling MDWE is just a small step towards hardening
> > a binary anyway. It can possibly make exploitation a bit harder in the
> > case where the attacker has _just_: a write primitive they can use to
> > write a shellcode somewhere and a primitive to make that page
> > executable later. It's a fairly narrow protection already and I think
> > it only really helps as part of a broader "defense in depth" strategy.
> >
> > > IIUC it means the restriction will only apply to the current process. Then
> > > I assume the process can escape from this rule simply by a fork(). If so,
> > > what's the point to protect at all?
> >
> > If we assume enough control from the attacker, then MDWE is already
> > useless since it can be bypassed by writing to a file and then
> > mmapping that file with PROT_EXEC. I think that's a good example of
> > how "perfect can be the enemy of good" in security hardening. MDWE
> > isn't a silver-bullet but it's a cheap trick and it makes a small dent
> > in reducing the attack surface so it seems worth having anyway ?
> >
> > But indeed, to address your question, if you choose to use this
> > NO_INHERIT flag: you're no longer protected if the attacker can fork()
> > as part of their exploitation. I think it's been a useful trade-off
> > for our internal users since, on the other hand, it also makes
> > adoption a lot easier: our C++ services developers can trivially opt
> > into a potpourri of hardening features without having to think too
> > much about how they work under-the-hood. The default behavior has been
> > to use a NO_INHERIT strategy so users don't get bad surprises the day
> > when they try to spawn a JITted subcommand. In the meantime, their C++
> > service still gets a little bit of extra protection.
> >
> > > And, what's the difference of this comparing to disabling MDWE after being
> > > enabled (which seems to be forbidden for now, but it seems fork() can play
> > > a similar role of disabling it)?
> >
> > That would be functionally somewhat similar, yes. I think it mostly
> > comes down to ease of adoption. I imagine that users who would opt
> > into NO_INHERIT are those who are interested in MDWE for the binary
> > they are writing but aren't 100% confident in what subprocesses they
> > will run and so they don't have to think about disabling it after
> > every fork.
>
> Okay, that makes sense to me. Thanks.
>
> Since the original MDWE was for systemd, I'm wondering what will happen if
> some program like what you said is invoked by systemd and with MDWE enabled
> already.
Good question
> Currently in your patch IIUC MDWE_NO_INHERIT will fail directly on MDWE
> enabled process,
Yes, I tried to stay close to the spirit of the existing logic (which
doesn't allow any sort of privilege gains) but this is not
particularly a requirement on our side so I'm quite flexible here.
Maybe Joey has an input here ?
> but then it makes me think whether it makes more sense to
> allow converting MDWE->MDWE_NO_INHERIT in this case. It seems to provide a
> most broad coverage on system daemons using MDWE starting from systemd
> initial process, meanwhile allows specific daemons to fork into anything
> like a JIT process so it can make itself NO_INHERIT. Attackers won't
> leverage this because MDWE_NO_INHERIT also means MDWE enabled.
I should have cc-ed systemd folks who could have opinions on this. I
will for v2.
+ cc Topi & Lennart
On Fri, May 5, 2023 at 8:34 PM Catalin Marinas <[email protected]> wrote:
>
> On Thu, May 04, 2023 at 07:09:41PM +0200, Florent Revest wrote:
> > This extends the current PR_SET_MDWE prctl arg with a bit to indicate
> > that the process doesn't want MDWE protection to propagate to children.
> >
> > To implement this no-inherit mode, the tag in current->mm->flags must be
> > absent from MMF_INIT_MASK. This means that the encoding for "MDWE but
> > without inherit" is different in the prctl than in the mm flags. This
> > leads to a bit of bit-mangling in the prctl implementation.
>
> That bit mangling is not that bad but it complicates the code a bit,
> especially if we'll add new bits in the future. We also need to check
> both the original and the no-inherit bits for each feature.
I agree :)
> Another question is whether we want to support more fine-grained
> inheriting or just a big knob that disables inheriting for all the
> (future) MDWE flags.
Yep, I can't think of a more fine-grained inheritance model off the
top of my head but it would be good to have a sane base for when the
need arises.
> I think a somewhat simpler way would be to clear the flags on fork(),
> either based on a big MMF_HAS_MDWE_NO_INHERIT knob or individual ones.
> Something like below (completely untested):
>
> diff --git a/include/linux/sched/coredump.h b/include/linux/sched/coredump.h
> index 0ee96ea7a0e9..ca83a0c8d19c 100644
> --- a/include/linux/sched/coredump.h
> +++ b/include/linux/sched/coredump.h
> @@ -91,4 +91,12 @@ static inline int get_dumpable(struct mm_struct *mm)
> MMF_DISABLE_THP_MASK | MMF_HAS_MDWE_MASK)
>
> #define MMF_VM_MERGE_ANY 29
> +
> +#define MMF_INIT_FLAGS(flags) ({ \
> + unsigned long new_flags = flags; \
> + if (new_flags & (1UL << MMF_HAS_MDWE_NO_INHERIT)) \
> + new_flags &= ~(1UL << MMF_HAS_MDWE_MASK); \
> + new_flags & MMF_INIT_MASK; \
> +})
> +
> #endif /* _LINUX_SCHED_COREDUMP_H */
> diff --git a/kernel/fork.c b/kernel/fork.c
> index ed4e01daccaa..53f0b68a5451 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1288,7 +1288,7 @@ static struct mm_struct *mm_init(struct mm_struct *mm, struct task_struct *p,
> hugetlb_count_init(mm);
>
> if (current->mm) {
> - mm->flags = current->mm->flags & MMF_INIT_MASK;
> + mm->flags = MMF_INIT_FLAGS(current->mm->flags);
> mm->def_flags = current->mm->def_flags & VM_INIT_DEF_MASK;
> } else {
> mm->flags = default_dump_filter;
>
> The checks in MMF_INIT_FLAGS() can grow in time if we add more bits in
> there but we still only keep a single flag that determines whether the
> feature is enabled (maybe that's more like bikeshedding at this moment
> when we have a single bit).
Sounds good!
I had considered something like this but I was afraid I'd spill too
much logic into fork.c... I didn't think of making it a neat macro in
coredump.h. That's a good point, I'll do this in v2.
> (fun remark: I see you cc'ed [email protected]'; that's not a real person, it's
> what our IT folk asked us to add on cc so that the Exchange server
> doesn't append the legal disclaimer; most lists are covered already
> without such cc but I guess people feel safer to add it, just in case)
Ahah! I mostly just copied the cc list from Joey's series. I remember
wondering why I couldn't find any patch sent by this mysterious ND but
I thought that if they got such a cool username, surely they must have
been at ARM since the early days and have some important role... :)
Then... mister nd won't get to see my v2! Thanks for the heads up.
On Mon, May 08, 2023 at 02:12:21PM +0200, Florent Revest wrote:
> On Mon, May 8, 2023 at 3:29 AM Peter Xu <[email protected]> wrote:
> > On Fri, May 05, 2023 at 06:42:08PM +0200, Florent Revest wrote:
> > > On Thu, May 4, 2023 at 10:06 PM Peter Xu <[email protected]> wrote:
> > > > And, what's the difference of this comparing to disabling MDWE after being
> > > > enabled (which seems to be forbidden for now, but it seems fork() can play
> > > > a similar role of disabling it)?
> > >
> > > That would be functionally somewhat similar, yes. I think it mostly
> > > comes down to ease of adoption. I imagine that users who would opt
> > > into NO_INHERIT are those who are interested in MDWE for the binary
> > > they are writing but aren't 100% confident in what subprocesses they
> > > will run and so they don't have to think about disabling it after
> > > every fork.
> >
> > Okay, that makes sense to me. Thanks.
> >
> > Since the original MDWE was for systemd, I'm wondering what will happen if
> > some program like what you said is invoked by systemd and with MDWE enabled
> > already.
>
> Good question
I think JITs work around this by creating two separate mappings of the
same pages, one RX and the other RW (rather than toggling the permission
with mprotect()). I had the impression Chromium can use memfd to do
something similar but I never checked.
> > Currently in your patch IIUC MDWE_NO_INHERIT will fail directly on MDWE
> > enabled process,
>
> Yes, I tried to stay close to the spirit of the existing logic (which
> doesn't allow any sort of privilege gains) but this is not
> particularly a requirement on our side so I'm quite flexible here.
I think we should keep the original behaviour of systemd here, otherwise
they won't transition to the new interface and keep using the SECCOMP
BPF approach (which, in addition, prevents glibc from setting PROT_BTI
on an already executable mapping).
To me MDWE is not about preventing JITs but rather ensuring buggy
programs don't end up with WX mappings. We ended up this way because of
the SECCOMP BPF limitations (just guessing, I haven't been involved in
its design). With a no-inherit MDWE, one can introduce an additional
policy for systemd. It would be a sysadmin decision which one to enable
and maybe current (inherit) MDWE will disappear in time.
x86 has protection keys and arm64 will soon have permission overlays
that allow user-space to toggle between RX and RW (Joey is looking at
the arm64 support). I'm not sure how we'll end up implemented this on
arm64 (and haven't looked at x86) but I have a suspicion MDWE will get
in the way as the base page table permission will probably need
PROT_WRITE|PROT_EXEC.
--
Catalin
On 8.5.2023 17.10, Catalin Marinas wrote:
> I think we should keep the original behaviour of systemd here, otherwise
> they won't transition to the new interface and keep using the SECCOMP
> BPF approach (which, in addition, prevents glibc from setting PROT_BTI
> on an already executable mapping).
Systemd has transitioned to prctl(PR_SET_MDWE) method since release of
v253, so the original behaviour definitely should be kept.
> To me MDWE is not about preventing JITs but rather ensuring buggy
> programs don't end up with WX mappings. We ended up this way because of
> the SECCOMP BPF limitations (just guessing, I haven't been involved in
> its design). With a no-inherit MDWE, one can introduce an additional
> policy for systemd. It would be a sysadmin decision which one to enable
> and maybe current (inherit) MDWE will disappear in time.
There could be a new setting for this, like
MemoryDenyWriteExecute=no-inherit. I'd only use it for those special
cases where MemoryDenyWriteExecute=yes can't be used.
> x86 has protection keys and arm64 will soon have permission overlays
> that allow user-space to toggle between RX and RW (Joey is looking at
> the arm64 support). I'm not sure how we'll end up implemented this on
> arm64 (and haven't looked at x86) but I have a suspicion MDWE will get
> in the way as the base page table permission will probably need
> PROT_WRITE|PROT_EXEC.
Wouldn't those features defeat any gains from MDWE? The features
probably should be forbidden with MemoryDenyWriteExecute=yes.
-Topi
On Mon, May 08, 2023 at 08:21:16PM +0300, Topi Miettinen wrote:
> On 8.5.2023 17.10, Catalin Marinas wrote:
> > I think we should keep the original behaviour of systemd here, otherwise
> > they won't transition to the new interface and keep using the SECCOMP
> > BPF approach (which, in addition, prevents glibc from setting PROT_BTI
> > on an already executable mapping).
>
> Systemd has transitioned to prctl(PR_SET_MDWE) method since release of v253,
> so the original behaviour definitely should be kept.
That's great. So yes, no ABI changes allowed anymore.
> > x86 has protection keys and arm64 will soon have permission overlays
> > that allow user-space to toggle between RX and RW (Joey is looking at
> > the arm64 support). I'm not sure how we'll end up implemented this on
> > arm64 (and haven't looked at x86) but I have a suspicion MDWE will get
> > in the way as the base page table permission will probably need
> > PROT_WRITE|PROT_EXEC.
>
> Wouldn't those features defeat any gains from MDWE? The features probably
> should be forbidden with MemoryDenyWriteExecute=yes.
The permission overlays (controlled by the user) can only further
restrict the mmap() permissions. So MDWE would still work as expected.
If one wants to toggle between RW and RX with overlays, the overall
mmap() needs to be RWX and it won't work if MDWE=yes. No need to
explicitly disable the overlays feature.
On arm64 at least, with the introduction of permission overlays we also
have the notion of "Read, Execute if not Write". This permission
automatically disables Exec if the mapping becomes writable (overlays
can disable writable, allowing exec). We could have a new MDWE policy
which allows this, though I'm not that keen on using it in Linux since
background permission changes done by the kernel can lead to an
unexpected executable permission (e.g. marking a page read-only for
clean/dirty tracking or in preparation for CoW after fork()).
--
Catalin