2023-04-18 21:08:53

by Waiman Long

[permalink] [raw]
Subject: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK

One of the flags of mmap(2) is MAP_STACK to request a memory segment
suitable for a process or thread stack. The kernel currently ignores
this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
selinux has an execstack check in selinux_file_mprotect() which disallows
a stack VMA to be made executable.

Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
with an adjacent anonymous VMA. With that merging, using mprotect(2)
to change a part of the merged anonymous VMA to make it executable may
fail. This can lead to sporadic failure of applications that need to
make those changes.

One possible fix is to make sure that a stack VMA will not be merged
with a non-stack anonymous VMA. That requires a vm flag that can be
used to distinguish a stack VMA from a regular anonymous VMA. One
can add a new dummy vm flag for that purpose. However, there is only
1 bit left in the lower 32 bits of vm_flags. Another alternative is
to use an existing vm flag. VM_STACK (= VM_GROWSDOWN for most arches)
can certainly be used for this purpose. The downside is that it is a
slight change in existing behavior.

Making a stack VMA growable by default certainly fits the need of a
process or thread stack. This patch now maps MAP_STACK to VM_STACK to
prevent unwanted merging with adjacent non-stack VMAs and make the VMA
more suitable for being used as a stack.

Reported-by: Joe Mario <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
include/linux/mman.h | 1 +
1 file changed, 1 insertion(+)

diff --git a/include/linux/mman.h b/include/linux/mman.h
index cee1e4b566d8..4b621d30ace9 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags)
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
_calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
+ _calc_vm_trans(flags, MAP_STACK, VM_STACK ) |
arch_calc_vm_flag_bits(flags);
}

--
2.31.1


2023-04-18 21:24:45

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK

On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <[email protected]> wrote:

> One of the flags of mmap(2) is MAP_STACK to request a memory segment
> suitable for a process or thread stack. The kernel currently ignores
> this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
> selinux has an execstack check in selinux_file_mprotect() which disallows
> a stack VMA to be made executable.
>
> Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
> with an adjacent anonymous VMA. With that merging, using mprotect(2)
> to change a part of the merged anonymous VMA to make it executable may
> fail. This can lead to sporadic failure of applications that need to
> make those changes.

"Sporadic failure of applications" sounds quite serious. Can you
provide more details?

Did you consider a -stable backport? I'm unable to judge, because the
description of the userspace effects is so thin,

> One possible fix is to make sure that a stack VMA will not be merged
> with a non-stack anonymous VMA. That requires a vm flag that can be
> used to distinguish a stack VMA from a regular anonymous VMA. One
> can add a new dummy vm flag for that purpose. However, there is only
> 1 bit left in the lower 32 bits of vm_flags. Another alternative is
> to use an existing vm flag. VM_STACK (= VM_GROWSDOWN for most arches)
> can certainly be used for this purpose. The downside is that it is a
> slight change in existing behavior.
>
> Making a stack VMA growable by default certainly fits the need of a
> process or thread stack. This patch now maps MAP_STACK to VM_STACK to
> prevent unwanted merging with adjacent non-stack VMAs and make the VMA
> more suitable for being used as a stack.
>
> ...
>
> --- a/include/linux/mman.h
> +++ b/include/linux/mman.h
> @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags)
> return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
> _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
> _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
> + _calc_vm_trans(flags, MAP_STACK, VM_STACK ) |
> arch_calc_vm_flag_bits(flags);
> }

The mmap(2) manpage says

This flag is currently a no-op on Linux. However, by employing
this flag, applications can ensure that they transparently ob- tain
support if the flag is implemented in the future. Thus, it is used
in the glibc threading implementation to allow for the fact that some
architectures may (later) require special treat- ment for stack
allocations. A further reason to employ this flag is portability:
MAP_STACK exists (and has an effect) on some other systems (e.g.,
some of the BSDs).

so please propose an update for this?

2023-04-19 01:41:22

by Hugh Dickins

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK

On Tue, 18 Apr 2023, Waiman Long wrote:
> On 4/18/23 17:18, Andrew Morton wrote:
> > On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <[email protected]> wrote:
> >
> >> One of the flags of mmap(2) is MAP_STACK to request a memory segment
> >> suitable for a process or thread stack. The kernel currently ignores
> >> this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
> >> selinux has an execstack check in selinux_file_mprotect() which disallows
> >> a stack VMA to be made executable.
> >>
> >> Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
> >> with an adjacent anonymous VMA. With that merging, using mprotect(2)
> >> to change a part of the merged anonymous VMA to make it executable may
> >> fail. This can lead to sporadic failure of applications that need to
> >> make those changes.
> > "Sporadic failure of applications" sounds quite serious. Can you
> > provide more details?
>
> The problem boils down to the fact that it is possible for user code to mmap a
> region of memory and then for the kernel to merge the VMA for that memory with
> the VMA for one of the application's thread stacks. This is causing random
> SEGVs with one of our large customer application.
>
> At a high level, this is what's happening:
>
>  1) App runs creating lots of threads.
>  2) It mmap's 256K pages of anonymous memory.
>  3) It writes executable code to that memory.
>  4) It calls mprotect() with PROT_EXEC on that memory so
>     it can subsequently execute the code.
>
> The above mprotect() will fail if the mmap'd region's VMA gets merged with the
> VMA for one of the thread stacks.  That's because the default RHEL SELinux
> policy is to not allow executable stacks.

Then wouldn't the bug be at the SELinux end? VMAs may have been merged
already, but the mprotect() with PROT_EXEC of the good non-stack range
will then split that area off from the stack again - maybe the SELinux
check does not understand that must happen?

Hugh

2023-04-19 01:54:37

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK


On 4/18/23 17:18, Andrew Morton wrote:
> On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <[email protected]> wrote:
>
>> One of the flags of mmap(2) is MAP_STACK to request a memory segment
>> suitable for a process or thread stack. The kernel currently ignores
>> this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
>> selinux has an execstack check in selinux_file_mprotect() which disallows
>> a stack VMA to be made executable.
>>
>> Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
>> with an adjacent anonymous VMA. With that merging, using mprotect(2)
>> to change a part of the merged anonymous VMA to make it executable may
>> fail. This can lead to sporadic failure of applications that need to
>> make those changes.
> "Sporadic failure of applications" sounds quite serious. Can you
> provide more details?

The problem boils down to the fact that it is possible for user code to
mmap a region of memory and then for the kernel to merge the VMA for
that memory with the VMA for one of the application's thread stacks.
This is causing random SEGVs with one of our large customer application.

At a high level, this is what's happening:

 1) App runs creating lots of threads.
 2) It mmap's 256K pages of anonymous memory.
 3) It writes executable code to that memory.
 4) It calls mprotect() with PROT_EXEC on that memory so
    it can subsequently execute the code.

The above mprotect() will fail if the mmap'd region's VMA gets merged
with the VMA for one of the thread stacks.  That's because the default
RHEL SELinux policy is to not allow executable stacks.

>
> Did you consider a -stable backport? I'm unable to judge, because the
> description of the userspace effects is so thin,

Yes, stable backport can be considered.


>
>> One possible fix is to make sure that a stack VMA will not be merged
>> with a non-stack anonymous VMA. That requires a vm flag that can be
>> used to distinguish a stack VMA from a regular anonymous VMA. One
>> can add a new dummy vm flag for that purpose. However, there is only
>> 1 bit left in the lower 32 bits of vm_flags. Another alternative is
>> to use an existing vm flag. VM_STACK (= VM_GROWSDOWN for most arches)
>> can certainly be used for this purpose. The downside is that it is a
>> slight change in existing behavior.
>>
>> Making a stack VMA growable by default certainly fits the need of a
>> process or thread stack. This patch now maps MAP_STACK to VM_STACK to
>> prevent unwanted merging with adjacent non-stack VMAs and make the VMA
>> more suitable for being used as a stack.
>>
>> ...
>>
>> --- a/include/linux/mman.h
>> +++ b/include/linux/mman.h
>> @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags)
>> return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
>> _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
>> _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
>> + _calc_vm_trans(flags, MAP_STACK, VM_STACK ) |
>> arch_calc_vm_flag_bits(flags);
>> }
> The mmap(2) manpage says
>
> This flag is currently a no-op on Linux. However, by employing
> this flag, applications can ensure that they transparently ob- tain
> support if the flag is implemented in the future. Thus, it is used
> in the glibc threading implementation to allow for the fact that some
> architectures may (later) require special treat- ment for stack
> allocations. A further reason to employ this flag is portability:
> MAP_STACK exists (and has an effect) on some other systems (e.g.,
> some of the BSDs).
>
> so please propose an update for this?

OK, will do.

Thanks,
Longman

2023-04-19 01:58:19

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK


On 4/18/23 21:36, Hugh Dickins wrote:
> On Tue, 18 Apr 2023, Waiman Long wrote:
>> On 4/18/23 17:18, Andrew Morton wrote:
>>> On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <[email protected]> wrote:
>>>
>>>> One of the flags of mmap(2) is MAP_STACK to request a memory segment
>>>> suitable for a process or thread stack. The kernel currently ignores
>>>> this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
>>>> selinux has an execstack check in selinux_file_mprotect() which disallows
>>>> a stack VMA to be made executable.
>>>>
>>>> Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
>>>> with an adjacent anonymous VMA. With that merging, using mprotect(2)
>>>> to change a part of the merged anonymous VMA to make it executable may
>>>> fail. This can lead to sporadic failure of applications that need to
>>>> make those changes.
>>> "Sporadic failure of applications" sounds quite serious. Can you
>>> provide more details?
>> The problem boils down to the fact that it is possible for user code to mmap a
>> region of memory and then for the kernel to merge the VMA for that memory with
>> the VMA for one of the application's thread stacks. This is causing random
>> SEGVs with one of our large customer application.
>>
>> At a high level, this is what's happening:
>>
>>  1) App runs creating lots of threads.
>>  2) It mmap's 256K pages of anonymous memory.
>>  3) It writes executable code to that memory.
>>  4) It calls mprotect() with PROT_EXEC on that memory so
>>     it can subsequently execute the code.
>>
>> The above mprotect() will fail if the mmap'd region's VMA gets merged with the
>> VMA for one of the thread stacks.  That's because the default RHEL SELinux
>> policy is to not allow executable stacks.
> Then wouldn't the bug be at the SELinux end? VMAs may have been merged
> already, but the mprotect() with PROT_EXEC of the good non-stack range
> will then split that area off from the stack again - maybe the SELinux
> check does not understand that must happen?

The SELinux check is done per VMA, not a region within a VMA. After VMA
merging, SELinux is probably not able to determine which part of a VMA
is a stack unless we keep that information somewhere and provide an API
for SELinux to query. That can be quite a lot of work. So the easiest
way to prevent this problem is to avoid merging a stack VMA with a
regular anonymous VMA.

Cheers,
Longman

2023-04-19 03:27:41

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK

On Tue, Apr 18, 2023 at 09:45:34PM -0400, Waiman Long wrote:
>
> On 4/18/23 21:36, Hugh Dickins wrote:
> > On Tue, 18 Apr 2023, Waiman Long wrote:
> > > On 4/18/23 17:18, Andrew Morton wrote:
> > > > On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <[email protected]> wrote:
> > > >
> > > > > One of the flags of mmap(2) is MAP_STACK to request a memory segment
> > > > > suitable for a process or thread stack. The kernel currently ignores
> > > > > this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
> > > > > selinux has an execstack check in selinux_file_mprotect() which disallows
> > > > > a stack VMA to be made executable.
> > > > >
> > > > > Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
> > > > > with an adjacent anonymous VMA. With that merging, using mprotect(2)
> > > > > to change a part of the merged anonymous VMA to make it executable may
> > > > > fail. This can lead to sporadic failure of applications that need to
> > > > > make those changes.
> > > > "Sporadic failure of applications" sounds quite serious. Can you
> > > > provide more details?
> > > The problem boils down to the fact that it is possible for user code to mmap a
> > > region of memory and then for the kernel to merge the VMA for that memory with
> > > the VMA for one of the application's thread stacks. This is causing random
> > > SEGVs with one of our large customer application.
> > >
> > > At a high level, this is what's happening:
> > >
> > > ?1) App runs creating lots of threads.
> > > ?2) It mmap's 256K pages of anonymous memory.
> > > ?3) It writes executable code to that memory.
> > > ?4) It calls mprotect() with PROT_EXEC on that memory so
> > > ??? it can subsequently execute the code.
> > >
> > > The above mprotect() will fail if the mmap'd region's VMA gets merged with the
> > > VMA for one of the thread stacks.? That's because the default RHEL SELinux
> > > policy is to not allow executable stacks.
> > Then wouldn't the bug be at the SELinux end? VMAs may have been merged
> > already, but the mprotect() with PROT_EXEC of the good non-stack range
> > will then split that area off from the stack again - maybe the SELinux
> > check does not understand that must happen?
>
> The SELinux check is done per VMA, not a region within a VMA. After VMA
> merging, SELinux is probably not able to determine which part of a VMA is a
> stack unless we keep that information somewhere and provide an API for
> SELinux to query. That can be quite a lot of work. So the easiest way to
> prevent this problem is to avoid merging a stack VMA with a regular
> anonymous VMA.

To paraphrase you, "Yes, SELinux is buggy, but we don't want to fix it".

Cc'ing the SELinux people so it can be fixed properly.

2023-04-19 03:49:53

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK

On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote:
> ?1) App runs creating lots of threads.
> ?2) It mmap's 256K pages of anonymous memory.
> ?3) It writes executable code to that memory.
> ?4) It calls mprotect() with PROT_EXEC on that memory so
> ??? it can subsequently execute the code.
>
> The above mprotect() will fail if the mmap'd region's VMA gets merged with
> the VMA for one of the thread stacks.? That's because the default RHEL
> SELinux policy is to not allow executable stacks.

By the way, this is a daft policy. The policy you really want is
EXEC|WRITE is not allowed. A non-writable stack is useless, so it's
actually a superset of your current policy. Forbidding _simultaneous_
write and executable is just good programming. This way, you don't need
to care about the underlying VMA's current permissions, you just need
to do:

if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE))
return -EACCESS;

2023-04-19 14:40:40

by Paul Moore

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK

On Tue, Apr 18, 2023 at 11:24 PM Matthew Wilcox <[email protected]> wrote:
> On Tue, Apr 18, 2023 at 09:45:34PM -0400, Waiman Long wrote:
> > On 4/18/23 21:36, Hugh Dickins wrote:
> > > On Tue, 18 Apr 2023, Waiman Long wrote:
> > > > On 4/18/23 17:18, Andrew Morton wrote:
> > > > > On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <[email protected]> wrote:
> > > > >
> > > > > > One of the flags of mmap(2) is MAP_STACK to request a memory segment
> > > > > > suitable for a process or thread stack. The kernel currently ignores
> > > > > > this flags. Glibc uses MAP_STACK when mmapping a thread stack. However,
> > > > > > selinux has an execstack check in selinux_file_mprotect() which disallows
> > > > > > a stack VMA to be made executable.
> > > > > >
> > > > > > Since MAP_STACK is a noop, it is possible for a stack VMA to be merged
> > > > > > with an adjacent anonymous VMA. With that merging, using mprotect(2)
> > > > > > to change a part of the merged anonymous VMA to make it executable may
> > > > > > fail. This can lead to sporadic failure of applications that need to
> > > > > > make those changes.
> > > > > "Sporadic failure of applications" sounds quite serious. Can you
> > > > > provide more details?
> > > > The problem boils down to the fact that it is possible for user code to mmap a
> > > > region of memory and then for the kernel to merge the VMA for that memory with
> > > > the VMA for one of the application's thread stacks. This is causing random
> > > > SEGVs with one of our large customer application.
> > > >
> > > > At a high level, this is what's happening:
> > > >
> > > > 1) App runs creating lots of threads.
> > > > 2) It mmap's 256K pages of anonymous memory.
> > > > 3) It writes executable code to that memory.
> > > > 4) It calls mprotect() with PROT_EXEC on that memory so
> > > > it can subsequently execute the code.
> > > >
> > > > The above mprotect() will fail if the mmap'd region's VMA gets merged with the
> > > > VMA for one of the thread stacks. That's because the default RHEL SELinux
> > > > policy is to not allow executable stacks.
> > > Then wouldn't the bug be at the SELinux end? VMAs may have been merged
> > > already, but the mprotect() with PROT_EXEC of the good non-stack range
> > > will then split that area off from the stack again - maybe the SELinux
> > > check does not understand that must happen?
> >
> > The SELinux check is done per VMA, not a region within a VMA. After VMA
> > merging, SELinux is probably not able to determine which part of a VMA is a
> > stack unless we keep that information somewhere and provide an API for
> > SELinux to query. That can be quite a lot of work. So the easiest way to
> > prevent this problem is to avoid merging a stack VMA with a regular
> > anonymous VMA.
>
> To paraphrase you, "Yes, SELinux is buggy, but we don't want to fix it".
>
> Cc'ing the SELinux people so it can be fixed properly.

SELinux needs some way to determine what memory region is currently
being used by an application's stacks. The current logic can be found
in selinux_file_mprotect(), the relevant snippet is below:

int selinux_file_mprotect(struct vm_area_struct *vma, ...)
{
...
} else if (!vma->vm_file &&
((vma->vm_start <= vma->vm_mm->start_stack &&
vma->vm_end >= vma->vm_mm->start_stack) ||
vma_is_stack_for_current(vma))) {
rc = avc_has_perm(&selinux_state,
sid, sid, SECCLASS_PROCESS,
PROCESS__EXECSTACK, NULL);
}
...
}

If someone has a better, more foolproof way to determine an
application's stack please let us know, or better yet submit a patch
:)

--
paul-moore.com

2023-04-19 15:08:35

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK

On 4/18/23 23:46, Matthew Wilcox wrote:
> On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote:
>>  1) App runs creating lots of threads.
>>  2) It mmap's 256K pages of anonymous memory.
>>  3) It writes executable code to that memory.
>>  4) It calls mprotect() with PROT_EXEC on that memory so
>>     it can subsequently execute the code.
>>
>> The above mprotect() will fail if the mmap'd region's VMA gets merged with
>> the VMA for one of the thread stacks.  That's because the default RHEL
>> SELinux policy is to not allow executable stacks.
> By the way, this is a daft policy. The policy you really want is
> EXEC|WRITE is not allowed. A non-writable stack is useless, so it's
> actually a superset of your current policy. Forbidding _simultaneous_
> write and executable is just good programming. This way, you don't need
> to care about the underlying VMA's current permissions, you just need
> to do:
>
> if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE))
> return -EACCESS;

I am not totally sure if the application changes the VMA to read-only
first. Even if it does that, it highlights another possible issue when
an anonymous VMA is merged with a stack VMA. Either the mprotect() to
write-protect the VMA will fail or the application will segfault if it
writes stuff to the stack. This particular issue is not related to
SELinux. It provides another good idea why we should avoid merging stack
VMA to anonymous VMA.

Cheers,
Longman

2023-04-19 15:10:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK

On Wed, Apr 19, 2023 at 11:07:04AM -0400, Waiman Long wrote:
> On 4/18/23 23:46, Matthew Wilcox wrote:
> > On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote:
> > > ?1) App runs creating lots of threads.
> > > ?2) It mmap's 256K pages of anonymous memory.
> > > ?3) It writes executable code to that memory.
> > > ?4) It calls mprotect() with PROT_EXEC on that memory so
> > > ??? it can subsequently execute the code.
> > >
> > > The above mprotect() will fail if the mmap'd region's VMA gets merged with
> > > the VMA for one of the thread stacks.? That's because the default RHEL
> > > SELinux policy is to not allow executable stacks.
> > By the way, this is a daft policy. The policy you really want is
> > EXEC|WRITE is not allowed. A non-writable stack is useless, so it's
> > actually a superset of your current policy. Forbidding _simultaneous_
> > write and executable is just good programming. This way, you don't need
> > to care about the underlying VMA's current permissions, you just need
> > to do:
> >
> > if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE))
> > return -EACCESS;
>
> I am not totally sure if the application changes the VMA to read-only first.
> Even if it does that, it highlights another possible issue when an anonymous
> VMA is merged with a stack VMA. Either the mprotect() to write-protect the
> VMA will fail or the application will segfault if it writes stuff to the
> stack. This particular issue is not related to SELinux. It provides another
> good idea why we should avoid merging stack VMA to anonymous VMA.

mprotect will split the VMA into two VMAs, one that is
PROT_READ|PROT_WRITE and one the is PROT_READ|PROT_EXEC.

2023-04-19 16:12:41

by Joe Mario

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK



On 4/19/23 11:09 AM, Matthew Wilcox wrote:
> On Wed, Apr 19, 2023 at 11:07:04AM -0400, Waiman Long wrote:
>> On 4/18/23 23:46, Matthew Wilcox wrote:
>>> On Tue, Apr 18, 2023 at 09:16:37PM -0400, Waiman Long wrote:
>>>>  1) App runs creating lots of threads.
>>>>  2) It mmap's 256K pages of anonymous memory.
>>>>  3) It writes executable code to that memory.
>>>>  4) It calls mprotect() with PROT_EXEC on that memory so
>>>>     it can subsequently execute the code.
>>>>
>>>> The above mprotect() will fail if the mmap'd region's VMA gets merged with
>>>> the VMA for one of the thread stacks.  That's because the default RHEL
>>>> SELinux policy is to not allow executable stacks.
>>> By the way, this is a daft policy. The policy you really want is
>>> EXEC|WRITE is not allowed. A non-writable stack is useless, so it's
>>> actually a superset of your current policy. Forbidding _simultaneous_
>>> write and executable is just good programming. This way, you don't need
>>> to care about the underlying VMA's current permissions, you just need
>>> to do:
>>>
>>> if ((prot & (PROT_EXEC|PROT_WRITE)) == (PROT_EXEC|PROT_WRITE))
>>> return -EACCESS;
>>
>> I am not totally sure if the application changes the VMA to read-only first.
>> Even if it does that, it highlights another possible issue when an anonymous
>> VMA is merged with a stack VMA. Either the mprotect() to write-protect the
>> VMA will fail or the application will segfault if it writes stuff to the
>> stack. This particular issue is not related to SELinux. It provides another
>> good idea why we should avoid merging stack VMA to anonymous VMA.
>
> mprotect will split the VMA into two VMAs, one that is
> PROT_READ|PROT_WRITE and one the is PROT_READ|PROT_EXEC.
>

But in this case, the latter still has PROT_WRITE.

This was reported by a large data analytics customer. They started getting infrequent random crashes in code they haven't touched in 10 years.

One of the threads in their program mmaps a large region using PROT_READ|PROT_WRITE, and that region just happens to be merged with the thread's stack.

Then they copy a small snipit of code to a location somewhere within that mapped region. For the one page that contains that code, they mprotect it to PROT_READ|PROT_WRITE|PROT_EXEC. I recall they're still reading and writing data elsewhere on that page.

Joe






2023-04-19 23:37:10

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK

On 4/18/2023 2:18 PM, Andrew Morton wrote:
> On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <[email protected]> wrote:
[..]
>> ...
>>
>> --- a/include/linux/mman.h
>> +++ b/include/linux/mman.h
>> @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags)
>> return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
>> _calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED ) |
>> _calc_vm_trans(flags, MAP_SYNC, VM_SYNC ) |
>> + _calc_vm_trans(flags, MAP_STACK, VM_STACK ) |
>> arch_calc_vm_flag_bits(flags);
>> }
>
> The mmap(2) manpage says
>
> This flag is currently a no-op on Linux. However, by employing
> this flag, applications can ensure that they transparently ob- tain
> support if the flag is implemented in the future. Thus, it is used
> in the glibc threading implementation to allow for the fact that some
> architectures may (later) require special treat- ment for stack
> allocations. A further reason to employ this flag is portability:
> MAP_STACK exists (and has an effect) on some other systems (e.g.,
> some of the BSDs).
>
> so please propose an update for this?
>

Just curious, why isn't MAP_STACK implemented in Linux kernel? what does
it take to implement it?

Also, could there be other potential issue with the vma merge, such as,
the user process start to truncate half of the anonymous memory vma
range oblivious to the fact that the vma has 'grown' into its stack and
it might be attempting to unmap some of its stack range?

If the vma merge is otherwise harmless, does it bring benefit other than
being one vma less?

thanks!
-jane

2023-04-20 00:15:26

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH] mm/mmap: Map MAP_STACK to VM_STACK



On 4/19/2023 4:21 PM, Jane Chu wrote:
> On 4/18/2023 2:18 PM, Andrew Morton wrote:
>> On Tue, 18 Apr 2023 17:02:30 -0400 Waiman Long <[email protected]>
>> wrote:
> [..]
>>> ...
>>>
>>> --- a/include/linux/mman.h
>>> +++ b/include/linux/mman.h
>>> @@ -152,6 +152,7 @@ calc_vm_flag_bits(unsigned long flags)
>>>       return _calc_vm_trans(flags, MAP_GROWSDOWN,  VM_GROWSDOWN ) |
>>>              _calc_vm_trans(flags, MAP_LOCKED,     VM_LOCKED    ) |
>>>              _calc_vm_trans(flags, MAP_SYNC,         VM_SYNC      ) |
>>> +           _calc_vm_trans(flags, MAP_STACK,         VM_STACK     ) |
>>>              arch_calc_vm_flag_bits(flags);
>>>   }
>>
>> The mmap(2) manpage says
>>
>>    This flag is currently a no-op on Linux.  However, by employing
>>    this flag, applications can ensure that they transparently ob- tain
>>    support if the flag is implemented in the future.  Thus, it is used
>>    in the glibc threading implementation to allow for the fact that some
>>    architectures may (later) require special treat- ment for stack
>>    allocations.  A further reason to employ this flag is portability:
>>    MAP_STACK exists (and has an effect) on some other systems (e.g.,
>>    some of the BSDs).
>>
>> so please propose an update for this?
>>
>
> Just curious, why isn't MAP_STACK implemented in Linux kernel? what does
> it take to implement it?
>
> Also, could there be other potential issue with the vma merge, such as,
> the user process start to truncate half of the anonymous memory vma
> range oblivious to the fact that the vma has 'grown' into its stack and
> it might be attempting to unmap some of its stack range?

Sorry, not 'oblivious'. how about a malicious user process get an fd via
memfd_create() and attempt to truncate more than it mmap'ed?

>
> If the vma merge is otherwise harmless, does it bring benefit other than
> being one vma less?
>
> thanks!
> -jane
>
>