On Wed, 2017-06-21 at 11:47 +0100, Ben Hutchings wrote:
> On Wed, 2017-06-21 at 11:24 +0200, Michal Hocko wrote:
> > On Wed 21-06-17 02:38:21, Ben Hutchings wrote:
> > > On Mon, 2017-06-19 at 16:23 +0200, Willy Tarreau wrote:
> > > > On Mon, Jun 19, 2017 at 08:44:24PM +0800, Linus Torvalds wrote:
> > > > > The distros are in a different situation and don't have that
> > > > > two-week
> > > > > window until a release, and presumably would not want to cut
> > > > > over to
> > > > > something new and fairly untested on such short notice.
> > > > >
> > > > > The timing for this all sucks, but if somebody has some final
> > > > > comments, please speak up now..
> > > >
> > > > What do you suggest the stable maintainers do here ? I've just
> > > > backported
> > > > this patch back to 3.10 and could boot it on i386 where it
> > > > apparently
> > > > works. But we may need more tests. On the other hand we benefit
> > > > from the
> > > > automated tests on tens of platforms when we push the queues so
> > > > at least
> > > > we'll quickly know if it builds and boots. I just don't feel
> > > > confident in
> > > > my work just because it builds and boots, you know.
> > > >
> > > > I'm appending the patches I currently have if anyone wants to
> > > > have a
> > > > glance. Ben, 3.2 requires much more changes than 3.10 and I'm
> > > > pretty
> > > > sure you won't change your patches at the last minute so I gave
> > > > up.
> > >
> > > Well I'm now dealing with fall-out from the Debian stable updates,
> > > which used a backport of Michal's patch series. That unfortunately
> > > seems to break programs running Java code in the main thread (the
> > > 'java' command doesn't do this, but e.g. 'jsvc' does).
> >
> > Could you share more details please?
>
> https://bugs.debian.org/865303
> https://bugs.debian.org/865311
> https://bugs.debian.org/865343
Unfortunately these regressions have not been completely fixed by
switching to Hugh's fix.
Firstly, some Rust programs are crashing on ppc64el with 64 KiB pages.
Apparently Rust maps its own guard page at the lower limit of the stack
(determined using pthread_getattr_np() and pthread_attr_getstack()). I
don't think this ever actually worked for the main thread stack, but it
now also blocks expansion as the default stack size of 8 MiB is smaller
than the stack gap of 16 MiB. Would it make sense to skip over
PROT_NONE mappings when checking whether it's safe to expand?
Secondly, LibreOffice is crashing on i386 when running components
implemented in Java. I don't have a diagnosis for this yet.
Ben.
--
Ben Hutchings
The world is coming to an end. Please log off.
On Mon, Jul 3, 2017 at 4:55 PM, Ben Hutchings <[email protected]> wrote:
>
> Firstly, some Rust programs are crashing on ppc64el with 64 KiB pages.
> Apparently Rust maps its own guard page at the lower limit of the stack
> (determined using pthread_getattr_np() and pthread_attr_getstack()). I
> don't think this ever actually worked for the main thread stack, but it
> now also blocks expansion as the default stack size of 8 MiB is smaller
> than the stack gap of 16 MiB. Would it make sense to skip over
> PROT_NONE mappings when checking whether it's safe to expand?
Hmm. Maybe.
Also, the whole notion that the gap should be relative to the page
size never made sense to me. So I think we could/should just make the
default gap size be one megabyte, not that "256 pages" abortion.
> Secondly, LibreOffice is crashing on i386 when running components
> implemented in Java. I don't have a diagnosis for this yet.
Ugh. Nobody seeing this inside SuSe/Red Hat? I don't think I've heard
about this..
Linus
On Mon, Jul 3, 2017 at 4:55 PM, Ben Hutchings <[email protected]> wrote:
> On Wed, 2017-06-21 at 11:47 +0100, Ben Hutchings wrote:
>> On Wed, 2017-06-21 at 11:24 +0200, Michal Hocko wrote:
>> > On Wed 21-06-17 02:38:21, Ben Hutchings wrote:
>> > > On Mon, 2017-06-19 at 16:23 +0200, Willy Tarreau wrote:
>> > > > On Mon, Jun 19, 2017 at 08:44:24PM +0800, Linus Torvalds wrote:
>> > > > > The distros are in a different situation and don't have that
>> > > > > two-week
>> > > > > window until a release, and presumably would not want to cut
>> > > > > over to
>> > > > > something new and fairly untested on such short notice.
>> > > > >
>> > > > > The timing for this all sucks, but if somebody has some final
>> > > > > comments, please speak up now..
>> > > >
>> > > > What do you suggest the stable maintainers do here ? I've just
>> > > > backported
>> > > > this patch back to 3.10 and could boot it on i386 where it
>> > > > apparently
>> > > > works. But we may need more tests. On the other hand we benefit
>> > > > from the
>> > > > automated tests on tens of platforms when we push the queues so
>> > > > at least
>> > > > we'll quickly know if it builds and boots. I just don't feel
>> > > > confident in
>> > > > my work just because it builds and boots, you know.
>> > > >
>> > > > I'm appending the patches I currently have if anyone wants to
>> > > > have a
>> > > > glance. Ben, 3.2 requires much more changes than 3.10 and I'm
>> > > > pretty
>> > > > sure you won't change your patches at the last minute so I gave
>> > > > up.
>> > >
>> > > Well I'm now dealing with fall-out from the Debian stable updates,
>> > > which used a backport of Michal's patch series. That unfortunately
>> > > seems to break programs running Java code in the main thread (the
>> > > 'java' command doesn't do this, but e.g. 'jsvc' does).
>> >
>> > Could you share more details please?
>>
>> https://bugs.debian.org/865303
>> https://bugs.debian.org/865311
>> https://bugs.debian.org/865343
>
> Unfortunately these regressions have not been completely fixed by
> switching to Hugh's fix.
>
> Firstly, some Rust programs are crashing on ppc64el with 64 KiB pages.
> Apparently Rust maps its own guard page at the lower limit of the stack
> (determined using pthread_getattr_np() and pthread_attr_getstack()). I
> don't think this ever actually worked for the main thread stack, but it
> now also blocks expansion as the default stack size of 8 MiB is smaller
> than the stack gap of 16 MiB. Would it make sense to skip over
> PROT_NONE mappings when checking whether it's safe to expand?
That change makes sense to me.
On Mon 03-07-17 17:05:27, Linus Torvalds wrote:
> On Mon, Jul 3, 2017 at 4:55 PM, Ben Hutchings <[email protected]> wrote:
> >
> > Firstly, some Rust programs are crashing on ppc64el with 64 KiB pages.
> > Apparently Rust maps its own guard page at the lower limit of the stack
> > (determined using pthread_getattr_np() and pthread_attr_getstack()). I
> > don't think this ever actually worked for the main thread stack, but it
> > now also blocks expansion as the default stack size of 8 MiB is smaller
> > than the stack gap of 16 MiB. Would it make sense to skip over
> > PROT_NONE mappings when checking whether it's safe to expand?
This is what my workaround for the older patch was doing, actually. We
have deployed that as a follow up fix on our older code bases. And this
has fixed verious issues with Java which was doing the similar thing.
> Hmm. Maybe.
>
> Also, the whole notion that the gap should be relative to the page
> size never made sense to me. So I think we could/should just make the
> default gap size be one megabyte, not that "256 pages" abortion.
The reason for having this in page units was that MAX_ARG_STRLEN is in
page units as well. And this is used as an on stack variable quite
often. 1MB wouldn't be sufficient for that to cover - we could go with a
larger gap but who knows how many other traps are there.
> > Secondly, LibreOffice is crashing on i386 when running components
> > implemented in Java. I don't have a diagnosis for this yet.
>
> Ugh. Nobody seeing this inside SuSe/Red Hat? I don't think I've heard
> about this..
No reports yet but we do not support 32b kernels on newer kernels which
had the upstream fix.
--
Michal Hocko
SUSE Labs
On Tue 04-07-17 10:41:22, Michal Hocko wrote:
> On Mon 03-07-17 17:05:27, Linus Torvalds wrote:
> > On Mon, Jul 3, 2017 at 4:55 PM, Ben Hutchings <[email protected]> wrote:
> > >
> > > Firstly, some Rust programs are crashing on ppc64el with 64 KiB pages.
> > > Apparently Rust maps its own guard page at the lower limit of the stack
> > > (determined using pthread_getattr_np() and pthread_attr_getstack()). I
> > > don't think this ever actually worked for the main thread stack, but it
> > > now also blocks expansion as the default stack size of 8 MiB is smaller
> > > than the stack gap of 16 MiB. Would it make sense to skip over
> > > PROT_NONE mappings when checking whether it's safe to expand?
>
> This is what my workaround for the older patch was doing, actually. We
> have deployed that as a follow up fix on our older code bases. And this
> has fixed verious issues with Java which was doing the similar thing.
Here is a forward port (on top of the current Linus tree) of my earlier
patch. I have dropped a note about java stack trace because this would
most likely be not the case with the Hugh's patch. The problem is the
same in principle though. Note I didn't get to test this properly yet
but it should be pretty much obvious.
---
>From d9f6faccf2c286ed81fbc860c9b0b7fe23ef0836 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 4 Jul 2017 11:27:39 +0200
Subject: [PATCH] mm: mm, mmap: do not blow on PROT_NONE MAP_FIXED holes in the
stack
"mm: enlarge stack guard gap" has introduced a regression in some rust
and Java environments which are trying to implement their own stack
guard page. They are punching a new MAP_FIXED mapping inside the
existing stack Vma.
This will confuse expand_{downwards,upwards} into thinking that the stack
expansion would in fact get us too close to an existing non-stack vma
which is a correct behavior wrt. safety. It is a real regression on
the other hand. Let's work around the problem by considering PROT_NONE
mapping as a part of the stack. This is a gros hack but overflowing to
such a mapping would trap anyway an we only can hope that usespace
knows what it is doing and handle it propely.
Fixes: d4d2d35e6ef9 ("mm: larger stack guard gap, between vmas")
Debugged-by: Vlastimil Babka <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/mmap.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index f60a8bc2869c..2e996cbf4ff3 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2244,7 +2244,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
gap_addr = TASK_SIZE;
next = vma->vm_next;
- if (next && next->vm_start < gap_addr) {
+ if (next && next->vm_start < gap_addr &&
+ (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
if (!(next->vm_flags & VM_GROWSUP))
return -ENOMEM;
/* Check that both stack segments have the same anon_vma? */
@@ -2325,7 +2326,8 @@ int expand_downwards(struct vm_area_struct *vma,
/* Enforce stack_guard_gap */
prev = vma->vm_prev;
/* Check that both stack segments have the same anon_vma? */
- if (prev && !(prev->vm_flags & VM_GROWSDOWN)) {
+ if (prev && !(prev->vm_flags & VM_GROWSDOWN) &&
+ (prev->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
if (address - prev->vm_end < stack_guard_gap)
return -ENOMEM;
}
--
2.11.0
--
Michal Hocko
SUSE Labs
On Tue, Jul 04, 2017 at 11:35:38AM +0200, Michal Hocko wrote:
> On Tue 04-07-17 10:41:22, Michal Hocko wrote:
> > On Mon 03-07-17 17:05:27, Linus Torvalds wrote:
> > > On Mon, Jul 3, 2017 at 4:55 PM, Ben Hutchings <[email protected]> wrote:
> > > >
> > > > Firstly, some Rust programs are crashing on ppc64el with 64 KiB pages.
> > > > Apparently Rust maps its own guard page at the lower limit of the stack
> > > > (determined using pthread_getattr_np() and pthread_attr_getstack()). I
> > > > don't think this ever actually worked for the main thread stack, but it
> > > > now also blocks expansion as the default stack size of 8 MiB is smaller
> > > > than the stack gap of 16 MiB. Would it make sense to skip over
> > > > PROT_NONE mappings when checking whether it's safe to expand?
> >
> > This is what my workaround for the older patch was doing, actually. We
> > have deployed that as a follow up fix on our older code bases. And this
> > has fixed verious issues with Java which was doing the similar thing.
>
> Here is a forward port (on top of the current Linus tree) of my earlier
> patch. I have dropped a note about java stack trace because this would
> most likely be not the case with the Hugh's patch. The problem is the
> same in principle though. Note I didn't get to test this properly yet
> but it should be pretty much obvious.
> ---
> >From d9f6faccf2c286ed81fbc860c9b0b7fe23ef0836 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Tue, 4 Jul 2017 11:27:39 +0200
> Subject: [PATCH] mm: mm, mmap: do not blow on PROT_NONE MAP_FIXED holes in the
> stack
>
> "mm: enlarge stack guard gap" has introduced a regression in some rust
> and Java environments which are trying to implement their own stack
> guard page. They are punching a new MAP_FIXED mapping inside the
> existing stack Vma.
>
> This will confuse expand_{downwards,upwards} into thinking that the stack
> expansion would in fact get us too close to an existing non-stack vma
> which is a correct behavior wrt. safety. It is a real regression on
> the other hand. Let's work around the problem by considering PROT_NONE
> mapping as a part of the stack. This is a gros hack but overflowing to
> such a mapping would trap anyway an we only can hope that usespace
> knows what it is doing and handle it propely.
>
> Fixes: d4d2d35e6ef9 ("mm: larger stack guard gap, between vmas")
> Debugged-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> mm/mmap.c | 6 ++++--
> 1 file changed, 4 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mmap.c b/mm/mmap.c
> index f60a8bc2869c..2e996cbf4ff3 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2244,7 +2244,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> gap_addr = TASK_SIZE;
>
> next = vma->vm_next;
> - if (next && next->vm_start < gap_addr) {
> + if (next && next->vm_start < gap_addr &&
> + (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
> if (!(next->vm_flags & VM_GROWSUP))
> return -ENOMEM;
> /* Check that both stack segments have the same anon_vma? */
> @@ -2325,7 +2326,8 @@ int expand_downwards(struct vm_area_struct *vma,
> /* Enforce stack_guard_gap */
> prev = vma->vm_prev;
> /* Check that both stack segments have the same anon_vma? */
> - if (prev && !(prev->vm_flags & VM_GROWSDOWN)) {
> + if (prev && !(prev->vm_flags & VM_GROWSDOWN) &&
> + (prev->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
> if (address - prev->vm_end < stack_guard_gap)
> return -ENOMEM;
> }
But wouldn't this completely disable the check in case such a guard page
is installed, and possibly continue to allow the collision when the stack
allocation is large enough to skip this guard page ? Shouldn't we instead
"skip" such a vma and look for the next one ?
I was thinking about something more like :
prev = vma->vm_prev;
+ /* Don't consider a possible user-space stack guard page */
+ if (prev && !(prev->vm_flags & VM_GROWSDOWN) &&
+ !(prev->vm_flags & (VM_WRITE|VM_READ|VM_EXEC)))
+ prev = prev->vm_prev;
+
/* Check that both stack segments have the same anon_vma? */
Willy
On Tue 04-07-17 11:47:28, Willy Tarreau wrote:
> On Tue, Jul 04, 2017 at 11:35:38AM +0200, Michal Hocko wrote:
> > On Tue 04-07-17 10:41:22, Michal Hocko wrote:
> > > On Mon 03-07-17 17:05:27, Linus Torvalds wrote:
> > > > On Mon, Jul 3, 2017 at 4:55 PM, Ben Hutchings <[email protected]> wrote:
> > > > >
> > > > > Firstly, some Rust programs are crashing on ppc64el with 64 KiB pages.
> > > > > Apparently Rust maps its own guard page at the lower limit of the stack
> > > > > (determined using pthread_getattr_np() and pthread_attr_getstack()). I
> > > > > don't think this ever actually worked for the main thread stack, but it
> > > > > now also blocks expansion as the default stack size of 8 MiB is smaller
> > > > > than the stack gap of 16 MiB. Would it make sense to skip over
> > > > > PROT_NONE mappings when checking whether it's safe to expand?
> > >
> > > This is what my workaround for the older patch was doing, actually. We
> > > have deployed that as a follow up fix on our older code bases. And this
> > > has fixed verious issues with Java which was doing the similar thing.
> >
> > Here is a forward port (on top of the current Linus tree) of my earlier
> > patch. I have dropped a note about java stack trace because this would
> > most likely be not the case with the Hugh's patch. The problem is the
> > same in principle though. Note I didn't get to test this properly yet
> > but it should be pretty much obvious.
> > ---
> > >From d9f6faccf2c286ed81fbc860c9b0b7fe23ef0836 Mon Sep 17 00:00:00 2001
> > From: Michal Hocko <[email protected]>
> > Date: Tue, 4 Jul 2017 11:27:39 +0200
> > Subject: [PATCH] mm: mm, mmap: do not blow on PROT_NONE MAP_FIXED holes in the
> > stack
> >
> > "mm: enlarge stack guard gap" has introduced a regression in some rust
> > and Java environments which are trying to implement their own stack
> > guard page. They are punching a new MAP_FIXED mapping inside the
> > existing stack Vma.
> >
> > This will confuse expand_{downwards,upwards} into thinking that the stack
> > expansion would in fact get us too close to an existing non-stack vma
> > which is a correct behavior wrt. safety. It is a real regression on
> > the other hand. Let's work around the problem by considering PROT_NONE
> > mapping as a part of the stack. This is a gros hack but overflowing to
> > such a mapping would trap anyway an we only can hope that usespace
> > knows what it is doing and handle it propely.
> >
> > Fixes: d4d2d35e6ef9 ("mm: larger stack guard gap, between vmas")
> > Debugged-by: Vlastimil Babka <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> > ---
> > mm/mmap.c | 6 ++++--
> > 1 file changed, 4 insertions(+), 2 deletions(-)
> >
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index f60a8bc2869c..2e996cbf4ff3 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2244,7 +2244,8 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > gap_addr = TASK_SIZE;
> >
> > next = vma->vm_next;
> > - if (next && next->vm_start < gap_addr) {
> > + if (next && next->vm_start < gap_addr &&
> > + (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
> > if (!(next->vm_flags & VM_GROWSUP))
> > return -ENOMEM;
> > /* Check that both stack segments have the same anon_vma? */
> > @@ -2325,7 +2326,8 @@ int expand_downwards(struct vm_area_struct *vma,
> > /* Enforce stack_guard_gap */
> > prev = vma->vm_prev;
> > /* Check that both stack segments have the same anon_vma? */
> > - if (prev && !(prev->vm_flags & VM_GROWSDOWN)) {
> > + if (prev && !(prev->vm_flags & VM_GROWSDOWN) &&
> > + (prev->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
> > if (address - prev->vm_end < stack_guard_gap)
> > return -ENOMEM;
> > }
>
> But wouldn't this completely disable the check in case such a guard page
> is installed, and possibly continue to allow the collision when the stack
> allocation is large enough to skip this guard page ?
Yes and but a PROT_NONE would fault and as the changelog says, we _hope_
that userspace does the right thing.
> Shouldn't we instead
> "skip" such a vma and look for the next one ?
Yeah, that would be possible, I am not sure it is worth it though. The
gap as it is implemented now prevents regular mappings to get close to
the stack. So we only care about those with MAP_FIXED and those can
screw things already so we really have to rely on userspace doing some
semi reasonable.
> I was thinking about something more like :
>
> prev = vma->vm_prev;
> + /* Don't consider a possible user-space stack guard page */
> + if (prev && !(prev->vm_flags & VM_GROWSDOWN) &&
> + !(prev->vm_flags & (VM_WRITE|VM_READ|VM_EXEC)))
> + prev = prev->vm_prev;
> +
If anywhing this would require to have a loop over all PROT_NONE
mappings to not hit into other weird usecases.
> /* Check that both stack segments have the same anon_vma? */
>
> Willy
--
Michal Hocko
SUSE Labs
On Tue 04-07-17 11:35:38, Michal Hocko wrote:
> On Tue 04-07-17 10:41:22, Michal Hocko wrote:
> > On Mon 03-07-17 17:05:27, Linus Torvalds wrote:
> > > On Mon, Jul 3, 2017 at 4:55 PM, Ben Hutchings <[email protected]> wrote:
> > > >
> > > > Firstly, some Rust programs are crashing on ppc64el with 64 KiB pages.
> > > > Apparently Rust maps its own guard page at the lower limit of the stack
> > > > (determined using pthread_getattr_np() and pthread_attr_getstack()). I
> > > > don't think this ever actually worked for the main thread stack, but it
> > > > now also blocks expansion as the default stack size of 8 MiB is smaller
> > > > than the stack gap of 16 MiB. Would it make sense to skip over
> > > > PROT_NONE mappings when checking whether it's safe to expand?
> >
> > This is what my workaround for the older patch was doing, actually. We
> > have deployed that as a follow up fix on our older code bases. And this
> > has fixed verious issues with Java which was doing the similar thing.
>
> Here is a forward port (on top of the current Linus tree) of my earlier
> patch. I have dropped a note about java stack trace because this would
> most likely be not the case with the Hugh's patch. The problem is the
> same in principle though. Note I didn't get to test this properly yet
> but it should be pretty much obvious.
Tested with the attached program.
root@test1:~# ./stack_crash
Stack top:0x7fffcdb605ec mmap:0x7fffcc760000
address:0x7fffcc760ff8 aligned:0x7fffcc760000 mapped:[7fffcc760000,7fffcc761000] diff:-8
[...]
so we faulted on the PROT_NONE while with
#define MAPING_PROT PROT_READ
root@test1:~# ./stack_crash
Stack top:0x7ffe73dde6fc mmap:0x7ffe729de000
address:0x7ffe72adefd8 aligned:0x7ffe72ade000 mapped:[7ffe729de000,7ffe729df000] diff:1048536
[...]
we failed 1MB ahead of the mapping.
--
Michal Hocko
SUSE Labs
On Tue 04-07-17 12:46:52, Michal Hocko wrote:
[...]
> Tested with the attached program.
Err, attached now for real.
--
Michal Hocko
SUSE Labs
On Tue, 2017-07-04 at 12:42 +0200, Michal Hocko wrote:
> On Tue 04-07-17 11:47:28, Willy Tarreau wrote:
> > On Tue, Jul 04, 2017 at 11:35:38AM +0200, Michal Hocko wrote:
[...]
> > But wouldn't this completely disable the check in case such a guard page
> > is installed, and possibly continue to allow the collision when the stack
> > allocation is large enough to skip this guard page ?
>
> Yes and but a PROT_NONE would fault and as the changelog says, we _hope_
> that userspace does the right thing.
It may well not be large enough, because of the same wrong assumptions
that resulted in the kernel's guard page not being large enough. We
should count it as part of the guard gap but not a substitute.
> > Shouldn't we instead
> > "skip" such a vma and look for the next one ?
>
> Yeah, that would be possible, I am not sure it is worth it though. The
> gap as it is implemented now prevents regular mappings to get close to
> the stack. So we only care about those with MAP_FIXED and those can
> screw things already so we really have to rely on userspace doing some
> semi reasonable.
>
> > I was thinking about something more like :
> >
> > prev = vma->vm_prev;
> > + /* Don't consider a possible user-space stack guard page */
> > + if (prev && !(prev->vm_flags & VM_GROWSDOWN) &&
> > + ????!(prev->vm_flags & (VM_WRITE|VM_READ|VM_EXEC)))
> > + prev = prev->vm_prev;
> > +
>
> If anywhing this would require to have a loop over all PROT_NONE
> mappings to not hit into other weird usecases.
That's what I was thinking of. Tried the following patch:
Subject: mmap: Ignore VM_NONE mappings when checking for space to
expand the stack
Some user-space run-times (in particular, Java and Rust) allocate
their own guard pages in the main stack. This didn't work well
before, but it can now block stack expansion where it is safe and would
previously have been allowed. Ignore such mappings when checking the
size of the gap before expanding.
Reported-by: Ximin Luo <[email protected]>
References: https://bugs.debian.org/865416
Fixes: 1be7107fbe18 ("mm: larger stack guard gap, between vmas")
Cc: [email protected]
Signed-off-by: Ben Hutchings <[email protected]>
---
mm/mmap.c | 19 ++++++++++++++++---
1 file changed, 16 insertions(+), 3 deletions(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index a5e3dcd75e79..19f3ce04f24f 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2243,7 +2243,14 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
if (gap_addr < address || gap_addr > TASK_SIZE)
gap_addr = TASK_SIZE;
- next = vma->vm_next;
+ /*
+ * Allow VM_NONE mappings in the gap as some applications try
+ * to make their own stack guards
+ */
+ for (next = vma->vm_next;
+ next && !(next->vm_flags & (VM_READ | VM_WRITE | VM_EXEC));
+ next = next->vm_next)
+ ;
if (next && next->vm_start < gap_addr) {
if (!(next->vm_flags & VM_GROWSUP))
return -ENOMEM;
@@ -2323,11 +2330,17 @@ int expand_downwards(struct vm_area_struct *vma,
if (error)
return error;
- /* Enforce stack_guard_gap */
+ /*
+ * Enforce stack_guard_gap, but allow VM_NONE mappings in the gap
+ * as some applications try to make their own stack guards
+ */
gap_addr = address - stack_guard_gap;
if (gap_addr > address)
return -ENOMEM;
- prev = vma->vm_prev;
+ for (prev = vma->vm_prev;
+ prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC));
+ prev = prev->vm_prev)
+ ;
if (prev && prev->vm_end > gap_addr) {
if (!(prev->vm_flags & VM_GROWSDOWN))
return -ENOMEM;
--- END ---
I don't have a ppc64el machine where I can change the kernel, but I
tried this on x86_64 with the stack limit reduced to 1 MiB and Rust
is able to expand its stack where previously it would crash.
This *doesn't* fix the LibreOffice regression on i386.
Ben.
--
Ben Hutchings
The world is coming to an end. Please log off.
On Tue 04-07-17 12:36:11, Ben Hutchings wrote:
> On Tue, 2017-07-04 at 12:42 +0200, Michal Hocko wrote:
> > On Tue 04-07-17 11:47:28, Willy Tarreau wrote:
> > > On Tue, Jul 04, 2017 at 11:35:38AM +0200, Michal Hocko wrote:
> [...]
> > > But wouldn't this completely disable the check in case such a guard page
> > > is installed, and possibly continue to allow the collision when the stack
> > > allocation is large enough to skip this guard page ?
> >
> > Yes and but a PROT_NONE would fault and as the changelog says, we _hope_
> > that userspace does the right thing.
>
> It may well not be large enough, because of the same wrong assumptions
> that resulted in the kernel's guard page not being large enough. We
> should count it as part of the guard gap but not a substitute.
yes, you are right of course. But isn't this a bug on their side
considering they are managing their _own_ stack gap? Our stack gap
management is a best effort thing and two such approaches competing will
always lead to weird cornercases. That was my assumption when saying
that I am not sure this is really _worth_ it. We should definitely try
to workaround clashes but that's about it. If others think that we
should do everything to prevent even those issues I will not oppose
of course. It just adds more cycles to something that is a weird case
already.
[...]
> This *doesn't* fix the LibreOffice regression on i386.
Are there any details about this regression?
--
Michal Hocko
SUSE Labs
On Tue 04-07-17 13:59:59, Michal Hocko wrote:
> On Tue 04-07-17 12:36:11, Ben Hutchings wrote:
> > On Tue, 2017-07-04 at 12:42 +0200, Michal Hocko wrote:
> > > On Tue 04-07-17 11:47:28, Willy Tarreau wrote:
> > > > On Tue, Jul 04, 2017 at 11:35:38AM +0200, Michal Hocko wrote:
> > [...]
> > > > But wouldn't this completely disable the check in case such a guard page
> > > > is installed, and possibly continue to allow the collision when the stack
> > > > allocation is large enough to skip this guard page ?
> > >
> > > Yes and but a PROT_NONE would fault and as the changelog says, we _hope_
> > > that userspace does the right thing.
> >
> > It may well not be large enough, because of the same wrong assumptions
> > that resulted in the kernel's guard page not being large enough. We
> > should count it as part of the guard gap but not a substitute.
>
> yes, you are right of course. But isn't this a bug on their side
> considering they are managing their _own_ stack gap? Our stack gap
> management is a best effort thing and two such approaches competing will
> always lead to weird cornercases. That was my assumption when saying
> that I am not sure this is really _worth_ it. We should definitely try
> to workaround clashes but that's about it. If others think that we
> should do everything to prevent even those issues I will not oppose
> of course. It just adds more cycles to something that is a weird case
> already.
Forgot to mention another point. Currently we do not check other
previous vmas if prev->vm_flags & VM_GROWSDOWN. Consider that the stack
gap is implemented by mprotect. This wouldn't change the VM_GROWSDOWN
flag and we are back to square 1 because the gap might be too small. Do
we want/need to handle those cases. Are they too different from
MAP_FIXED gaps? I am not so sure but I would be inclined to say no.
--
Michal Hocko
SUSE Labs
On Tue, 2017-07-04 at 14:00 +0200, Michal Hocko wrote:
> On Tue 04-07-17 12:36:11, Ben Hutchings wrote:
> > On Tue, 2017-07-04 at 12:42 +0200, Michal Hocko wrote:
> > > On Tue 04-07-17 11:47:28, Willy Tarreau wrote:
> > > > On Tue, Jul 04, 2017 at 11:35:38AM +0200, Michal Hocko wrote:
> >
> > [...]
> > > > But wouldn't this completely disable the check in case such a guard page
> > > > is installed, and possibly continue to allow the collision when the stack
> > > > allocation is large enough to skip this guard page ?
> > >
> > > Yes and but a PROT_NONE would fault and as the changelog says, we _hope_
> > > that userspace does the right thing.
> >
> > It may well not be large enough, because of the same wrong assumptions
> > that resulted in the kernel's guard page not being large enough. We
> > should count it as part of the guard gap but not a substitute.
>
> yes, you are right of course. But isn't this a bug on their side
> considering they are managing their _own_ stack gap?
Yes it's their bug, but you know the rule - don't break user-space.
> Our stack gap
> management is a best effort thing and two such approaches competing will
> always lead to weird cornercases. That was my assumption when saying
> that I am not sure this is really _worth_ it. We should definitely try
> to workaround clashes but that's about it. If others think that we
> should do everything to prevent even those issues I will not oppose
> of course. It just adds more cycles to something that is a weird case
> already.
I don't want odd behaviour to weaken the stack guard.
> [...]
>
> > This *doesn't* fix the LibreOffice regression on i386.
>
> Are there any details about this regression?
Here:
https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=865303#170
I haven't reproduced it in Writer, but if I use Base to create a new
HSQLDB database it reliably crashes (HSQLDB is implemented in Java).
Ben.
--
Ben Hutchings
The world is coming to an end. Please log off.
On 04/07/17 00:55, Ben Hutchings wrote:
> Unfortunately these regressions have not been completely fixed by
> switching to Hugh's fix.
>
> Firstly, some Rust programs are crashing on ppc64el with 64 KiB pages.
> Apparently Rust maps its own guard page at the lower limit of the stack
> (determined using pthread_getattr_np() and pthread_attr_getstack()). I
> don't think this ever actually worked for the main thread stack, but it
> now also blocks expansion as the default stack size of 8 MiB is smaller
> than the stack gap of 16 MiB. Would it make sense to skip over
> PROT_NONE mappings when checking whether it's safe to expand?
>
> Secondly, LibreOffice is crashing on i386 when running components
> implemented in Java. I don't have a diagnosis for this yet.
We found that we needed f4cb767d76cf ("mm: fix new crash in
unmapped_area_topdown()") Apologies if you've already covered that.
This may be needed in addition to the other patch you proposed.
jch
On Tue 04-07-17 13:21:02, Ben Hutchings wrote:
> On Tue, 2017-07-04 at 14:00 +0200, Michal Hocko wrote:
> > On Tue 04-07-17 12:36:11, Ben Hutchings wrote:
> > > On Tue, 2017-07-04 at 12:42 +0200, Michal Hocko wrote:
> > > > On Tue 04-07-17 11:47:28, Willy Tarreau wrote:
> > > > > On Tue, Jul 04, 2017 at 11:35:38AM +0200, Michal Hocko wrote:
> > >
> > > [...]
> > > > > But wouldn't this completely disable the check in case such a guard page
> > > > > is installed, and possibly continue to allow the collision when the stack
> > > > > allocation is large enough to skip this guard page ?
> > > >
> > > > Yes and but a PROT_NONE would fault and as the changelog says, we _hope_
> > > > that userspace does the right thing.
> > >
> > > It may well not be large enough, because of the same wrong assumptions
> > > that resulted in the kernel's guard page not being large enough.??We
> > > should count it as part of the guard gap but not a substitute.
> >
> > yes, you are right of course. But isn't this a bug on their side
> > considering they are managing their _own_ stack gap?
>
> Yes it's their bug, but you know the rule - don't break user-space.
Absolutely, that is why I belive we should consider the prev VMA but
doing anything more just risks for new regressions. Or why do you think
that not-checking them would cause a regression?
> > Our stack gap
> > management is a best effort thing and two such approaches competing will
> > always lead to weird cornercases. That was my assumption when saying
> > that I am not sure this is really _worth_ it. We should definitely try
> > to workaround clashes but that's about it. If others think that we
> > should do everything to prevent even those issues I will not oppose
> > of course. It just adds more cycles to something that is a weird case
> > already.
>
> I don't want odd behaviour to weaken the stack guard.
>
> > [...]
> >
> > > This *doesn't* fix the LibreOffice regression on i386.
> >
> > Are there any details about this regression?
>
> Here:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=865303#170
>
> I haven't reproduced it in Writer, but if I use Base to create a new
> HSQLDB database it reliably crashes (HSQLDB is implemented in Java).
I haven't read through previous 169 comments but I do not see any stack
trace. Ideally with info proc mapping that would tell us the memory
layout.
--
Michal Hocko
SUSE Labs
Michal Hocko:
> On Tue 04-07-17 13:21:02, Ben Hutchings wrote:
>> On Tue, 2017-07-04 at 14:00 +0200, Michal Hocko wrote:
>>> On Tue 04-07-17 12:36:11, Ben Hutchings wrote:
>>>> On Tue, 2017-07-04 at 12:42 +0200, Michal Hocko wrote:
>>>>> On Tue 04-07-17 11:47:28, Willy Tarreau wrote:
>>>>>> On Tue, Jul 04, 2017 at 11:35:38AM +0200, Michal Hocko wrote:
>>>>
>>>> [...]
>>>>>> But wouldn't this completely disable the check in case such a guard page
>>>>>> is installed, and possibly continue to allow the collision when the stack
>>>>>> allocation is large enough to skip this guard page ?
>>>>>
>>>>> Yes and but a PROT_NONE would fault and as the changelog says, we _hope_
>>>>> that userspace does the right thing.
>>>>
>>>> It may well not be large enough, because of the same wrong assumptions
>>>> that resulted in the kernel's guard page not being large enough. We
>>>> should count it as part of the guard gap but not a substitute.
>>>
>>> yes, you are right of course. But isn't this a bug on their side
>>> considering they are managing their _own_ stack gap?
>>
>> Yes it's their bug, but you know the rule - don't break user-space.
>
> Absolutely, that is why I belive we should consider the prev VMA but
> doing anything more just risks for new regressions. Or why do you think
> that not-checking them would cause a regression?
>
>>> Our stack gap
>>> management is a best effort thing and two such approaches competing will
>>> always lead to weird cornercases. That was my assumption when saying
>>> that I am not sure this is really _worth_ it. We should definitely try
>>> to workaround clashes but that's about it. If others think that we
>>> should do everything to prevent even those issues I will not oppose
>>> of course. It just adds more cycles to something that is a weird case
>>> already.
>>
>> I don't want odd behaviour to weaken the stack guard.
>>
>>> [...]
>>>
>>>> This *doesn't* fix the LibreOffice regression on i386.
>>>
>>> Are there any details about this regression?
>>
>> Here:
>> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=865303#170
>>
>> I haven't reproduced it in Writer, but if I use Base to create a new
>> HSQLDB database it reliably crashes (HSQLDB is implemented in Java).
>
> I haven't read through previous 169 comments but I do not see any stack
> trace. Ideally with info proc mapping that would tell us the memory
> layout.
>
I've written up an explanation of what happens in the Rust case here:
https://github.com/rust-lang/rust/issues/43052
Hopefully I got the details about Linux correct - I only had them explained to me last night - please reply on that page if not.
X
--
GPG: ed25519/56034877E1F87C35
GPG: rsa4096/1318EFAC5FBBDBCE
https://github.com/infinity0/pubkeys.git
On Tue 04-07-17 14:19:00, Ximin Luo wrote:
[...]
> I've written up an explanation of what happens in the Rust case here:
>
> https://github.com/rust-lang/rust/issues/43052
The most important part is https://github.com/rust-lang/rust/blob/master/src/libstd/sys/unix/thread.rs#L248
// Rellocate the last page of the stack.
// This ensures SIGBUS will be raised on
// stack overflow.
let result = mmap(stackaddr, psize, PROT_NONE, MAP_PRIVATE | MAP_ANON | MAP_FIXED, -1, 0);
so this is basically the same thing Java does. Except that Java doesn't
do that on main thread usually. Only some JNI runtimes do that.
pthread_attr_getstack() usage on the main thread sounds like a real bug
in rust to me.
Thanks for the writeup!
--
Michal Hocko
SUSE Labs
On Tue, Jul 04, 2017 at 12:36:11PM +0100, Ben Hutchings wrote:
> > If anywhing this would require to have a loop over all PROT_NONE
> > mappings to not hit into other weird usecases.
>
> That's what I was thinking of. Tried the following patch:
(...)
> - next = vma->vm_next;
> + /*
> + * Allow VM_NONE mappings in the gap as some applications try
> + * to make their own stack guards
> + */
> + for (next = vma->vm_next;
> + next && !(next->vm_flags & (VM_READ | VM_WRITE | VM_EXEC));
> + next = next->vm_next)
> + ;
That's what I wanted to propose but I feared someone would scream at me
for this loop :-)
+1 for me!
Willy
On Tue, Jul 4, 2017 at 4:36 AM, Ben Hutchings <[email protected]> wrote:
>
> That's what I was thinking of. Tried the following patch:
>
> Subject: mmap: Ignore VM_NONE mappings when checking for space to
> expand the stack
This looks sane to me.
I'm going to ignore it in this thread, and assume that it gets sent as
a patch separately, ok?
It would be good to have more acks on it.
Also, separately, John Haxby kind of implied that the LibreOffice
regression on i386 is already fixed by commit f4cb767d76cf ("mm: fix
new crash in unmapped_area_topdown()").
Or was that a separate issue?
Linus
On 04/07/17 17:18, Linus Torvalds wrote:
> Also, separately, John Haxby kind of implied that the LibreOffice
> regression on i386 is already fixed by commit f4cb767d76cf ("mm: fix
> new crash in unmapped_area_topdown()").
I'm not certain. We had two distinct problems that were avoided by
Hugh's original patch together with f4cb767d76cf: the Oracle RDBMS
started and java programs worked. In my mind I'm conflating the second
of these with problems in LibreOffice.
Alas, the people who could confirm this for me are getting ready to
watch fireworks and generally have a good time.
jch
On Tue, Jul 04, 2017 at 05:27:55PM +0100, John Haxby wrote:
> Alas, the people who could confirm this for me are getting ready to
> watch fireworks and generally have a good time.
Let's hope the fireworks is controlled by Java with on up-to-date
kernel so that they can quickly get back to work :-)
Willy
On Tue, Jul 04, 2017 at 12:36:11PM +0100, Ben Hutchings wrote:
> @@ -2323,11 +2330,17 @@ int expand_downwards(struct vm_area_struct *vma,
> if (error)
> return error;
>
> - /* Enforce stack_guard_gap */
> + /*
> + * Enforce stack_guard_gap, but allow VM_NONE mappings in the gap
> + * as some applications try to make their own stack guards
> + */
> gap_addr = address - stack_guard_gap;
> if (gap_addr > address)
> return -ENOMEM;
> - prev = vma->vm_prev;
> + for (prev = vma->vm_prev;
> + prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC));
> + prev = prev->vm_prev)
> + ;
> if (prev && prev->vm_end > gap_addr) {
> if (!(prev->vm_flags & VM_GROWSDOWN))
> return -ENOMEM;
Hmmm shouldn't we also stop looping when we're out of the gap ? Something
like this :
for (prev = vma->vm_prev;
prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) &&
address - prev->vm_end < stack_guard_gap;
prev = prev->vm_prev)
;
This would limit the risk of runaway loops if someone is having fun
allocating a lot of memory in small chunks (eg: 4 GB in 1 million
independant mmap() calls).
Willy
On Tue 04-07-17 17:51:40, Willy Tarreau wrote:
> On Tue, Jul 04, 2017 at 12:36:11PM +0100, Ben Hutchings wrote:
> > > If anywhing this would require to have a loop over all PROT_NONE
> > > mappings to not hit into other weird usecases.
> >
> > That's what I was thinking of. Tried the following patch:
> (...)
> > - next = vma->vm_next;
> > + /*
> > + * Allow VM_NONE mappings in the gap as some applications try
> > + * to make their own stack guards
> > + */
> > + for (next = vma->vm_next;
> > + next && !(next->vm_flags & (VM_READ | VM_WRITE | VM_EXEC));
> > + next = next->vm_next)
> > + ;
>
> That's what I wanted to propose but I feared someone would scream at me
> for this loop :-)
Well, I've been thinking about this some more and the more I think about
it the less I am convinced we should try to be clever here. Why? Because
as soon as somebody tries to manage stacks explicitly you cannot simply
assume anything about the previous mapping. Say some interpret uses
[ mngmnt data][red zone] <--[- MAP_GROWSDOWN ]
Now if we consider the red zone's (PROT_NONE) prev mapping we would fail
the expansion even though we haven't hit the red zone and that is
essentially what the Java and rust bugs are about. So we just risk yet
another regression.
Now let's say another example
<--[- MAP_GROWSDOWN][red zone] <--[- MAP_GROWSDOWN]
thread 1 thread 2
Does the more clever code prevent from smashing over unrelated stack?
No because of our VM_GROWS{DOWN,UP} checks which are needed for other
cases. Well we could special case those as well but...
That being said, I am not really convinced that mixing 2 different gap
implemetantions is sane. I guess it should be reasonable to assume that
a PROT_NONE mapping close to the stack is meant to be a red zone and at
this moment we should rather back off and rely on the userspace rather
than risk more weird cornercases and regressions.
--
Michal Hocko
SUSE Labs
On Tue, Jul 4, 2017 at 10:22 AM, Michal Hocko <[email protected]> wrote:
>
> Well, I've been thinking about this some more and the more I think about
> it the less I am convinced we should try to be clever here. Why? Because
> as soon as somebody tries to manage stacks explicitly you cannot simply
> assume anything about the previous mapping. Say some interpret uses
> [ mngmnt data][red zone] <--[- MAP_GROWSDOWN ]
>
> Now if we consider the red zone's (PROT_NONE) prev mapping we would fail
> the expansion even though we haven't hit the red zone and that is
> essentially what the Java and rust bugs are about. So we just risk yet
> another regression.
Ack.
Let's make the initial version at least only check the first vma.
The long-term fix for this is to have the binaries do proper stack
expansion probing anyway, and it's quite possible that people who do
their own stack redzoning by adding a PROT_NONE thing already do that
proper fix (eg the Java stack may simply not *have* those big crazy
structures on it in the first place).
Linus
On Tue, Jul 04, 2017 at 11:37:15AM -0700, Linus Torvalds wrote:
> On Tue, Jul 4, 2017 at 10:22 AM, Michal Hocko <[email protected]> wrote:
> >
> > Well, I've been thinking about this some more and the more I think about
> > it the less I am convinced we should try to be clever here. Why? Because
> > as soon as somebody tries to manage stacks explicitly you cannot simply
> > assume anything about the previous mapping. Say some interpret uses
> > [ mngmnt data][red zone] <--[- MAP_GROWSDOWN ]
> >
> > Now if we consider the red zone's (PROT_NONE) prev mapping we would fail
> > the expansion even though we haven't hit the red zone and that is
> > essentially what the Java and rust bugs are about. So we just risk yet
> > another regression.
>
> Ack.
>
> Let's make the initial version at least only check the first vma.
>
> The long-term fix for this is to have the binaries do proper stack
> expansion probing anyway, and it's quite possible that people who do
> their own stack redzoning by adding a PROT_NONE thing already do that
> proper fix (eg the Java stack may simply not *have* those big crazy
> structures on it in the first place).
But what is wrong with stopping the loop as soon as the distance gets
larger than the stack_guard_gap ?
Willy
On Tue, Jul 4, 2017 at 11:39 AM, Willy Tarreau <[email protected]> wrote:
>
> But what is wrong with stopping the loop as soon as the distance gets
> larger than the stack_guard_gap ?
Absolutely nothing. But that's not the problem with the loop. Let's
say that you are using lots of threads, so that you know your stack
space is limited. What you do is to use MAP_FIXED a lot, and you lay
out your stacks fairly densely (with each other, but also possibly
with other mappings), with that PROT_NONE redzoning mapping in between
the "dense" allocations.
So when the kernel wants to grow the stack, it finds the PROT_NONE
redzone mapping - but there's possibly other maps right under it, so
the stack_guard_gap still hits other mappings.
And the fact that this seems to trigger with
(a) 32-bit x86
(b) Java
actually makes sense in the above scenario: that's _exactly_ when
you'd have dense mappings. Java is very thread-happy, and in a 32-bit
VM, the virtual address space allocation for stacks is a primary issue
with lots of threads.
Of course, the downside to this theory is that apparently the Java
problem is not confirmed to actually be due to this (Ben root-caused
the rust thing on ppc64), but it still sounds like quite a reasonable
thing to do.
The problem with the Java issue may be that they do that "dense stack
mappings in VM space" (for all the usual "lots of threads, limited VM"
reasons), but they may *not* have that PROT_NONE redzoning at all.
So the patch under discussion works for Rust exactly *because* it does
its redzone to show "this is where I expect the stack to end". The
i386 java load may simply not have that marker for us to use..
Linus
On Tue, Jul 04, 2017 at 11:47:37AM -0700, Linus Torvalds wrote:
> Let's
> say that you are using lots of threads, so that you know your stack
> space is limited. What you do is to use MAP_FIXED a lot, and you lay
> out your stacks fairly densely (with each other, but also possibly
> with other mappings), with that PROT_NONE redzoning mapping in between
> the "dense" allocations.
>
> So when the kernel wants to grow the stack, it finds the PROT_NONE
> redzone mapping - but there's possibly other maps right under it, so
> the stack_guard_gap still hits other mappings.
(...)
OK I didn't get that use case, that totally makes sense indeed! So
now we use PROT_NONE not as something that must be skipped to find
the unmapped area but as a hint that the application apparently
wants the stack to stop here.
Thanks for this clear explanation!
Willy
On Tue, 2017-07-04 at 12:36 +0100, Ben Hutchings wrote:
[...]
> This *doesn't* fix the LibreOffice regression on i386.
gdb shows me that the crash is at the last statement in this function:
static void _expand_stack_to(address bottom) {
address sp;
size_t size;
volatile char *p;
// Adjust bottom to point to the largest address within the same page, it
// gives us a one-page buffer if alloca() allocates slightly more memory.
bottom = (address)align_size_down((uintptr_t)bottom, os::Linux::page_size());
bottom += os::Linux::page_size() - 1;
// sp might be slightly above current stack pointer; if that's the case, we
// will alloca() a little more space than necessary, which is OK. Don't use
// os::current_stack_pointer(), as its result can be slightly below current
// stack pointer, causing us to not alloca enough to reach "bottom".
sp = (address)&sp;
if (sp > bottom) {
size = sp - bottom;
p = (volatile char *)alloca(size);
assert(p != NULL && p <= (volatile char *)bottom, "alloca problem?");
p[0] = '\0';
}
}
We have:
bottom = 0xff803fff
sp = 0xffffb178
The relevant mappings are:
ff7fc000-ff7fd000 rwxp 00000000 00:00 0
fffdd000-ffffe000 rw-p 00000000 00:00 0 [stack]
So instead of a useless guard page, we have a dangerous WX page
underneath the stack! I suppose I should find out where and why that's
being allocated.
Ben.
--
Ben Hutchings
The world is coming to an end. Please log off.
On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings <[email protected]> wrote:
>
> We have:
>
> bottom = 0xff803fff
> sp = 0xffffb178
>
> The relevant mappings are:
>
> ff7fc000-ff7fd000 rwxp 00000000 00:00 0
> fffdd000-ffffe000 rw-p 00000000 00:00 0 [stack]
Ugh. So that stack is actually 8MB in size, but the alloca() is about
to use up almost all of it, and there's only about 28kB left between
"bottom" and that 'rwx' mapping.
Still, that rwx mapping is interesting: it is a single page, and it
really is almost exactly 8MB below the stack.
In fact, the top of stack (at 0xffffe000) is *exactly* 8MB+4kB from
the top of that odd one-page allocation (0xff7fd000).
Can you find out where that is allocated? Perhaps a breakpoint on
mmap, with a condition to catch that particular one?
Because I'm wondering if it was done explicitly as a 8MB stack
boundary allocation, with the "knowledge" that the kernel then adds a
one-page guard page.
I really don't know why somebody would do that (as opposed to just
limiting the stack with ulimit), but the 8MB+4kB distance is kind of
intriguing.
Maybe that one-page mapping is some hack to make sure that no random
mmap() will ever get too close to the stack, so it really is a "guard
mapping", except it's explicitly designed not so much to guard the
stack from growing down further (ulimit does that), but to guard the
brk() and other mmaps from growing *up* into the stack area..
Sometimes user mode does crazy things just because people are insane.
But sometimes there really is a method to the madness.
I would *not* be surprised if the way somebody allocared the stack was
to basically say:
- let's use "mmap()" with a size of 8MB+2 pages to find a
sufficiently sized virtual memory area
- once we've gotten that virtual address space range, let's over-map
the last page as the new stack using MAP_FIXED
- finally, munmap the 8MB in between so that the new stack can grow
down into that gap the munmap creates.
Notice how you end up with exactly the above pattern of allocations,
and how it guarantees that you get a nice 8MB stack without having to
do any locking (you rely on the kernel to just find the 8MB+8kB areas,
and once one has been allocated, it will be "safe").
And yes, it would have been much nicer to just use PROT_NONE for that
initial sizing allocation, but for somebody who is only interested in
carving out a 8MB stack in virtual space, the protections are actually
kind of immaterial, so 'rwx' might be just their mental default.
Linus
This issue occurs post stackguard patches correct? Fixing it sounds like
this might go beyond hardening and into CVE territory.
--
Kurt Seifried -- Red Hat -- Product Security -- Cloud
PGP A90B F995 7350 148F 66BF 7554 160D 4553 5E26 7993
Red Hat Product Security contact: [email protected]
On Tue 04-07-17 16:31:52, Linus Torvalds wrote:
> On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings <[email protected]> wrote:
> >
> > We have:
> >
> > bottom = 0xff803fff
> > sp = 0xffffb178
> >
> > The relevant mappings are:
> >
> > ff7fc000-ff7fd000 rwxp 00000000 00:00 0
> > fffdd000-ffffe000 rw-p 00000000 00:00 0 [stack]
>
> Ugh. So that stack is actually 8MB in size, but the alloca() is about
> to use up almost all of it, and there's only about 28kB left between
> "bottom" and that 'rwx' mapping.
>
> Still, that rwx mapping is interesting: it is a single page, and it
> really is almost exactly 8MB below the stack.
>
> In fact, the top of stack (at 0xffffe000) is *exactly* 8MB+4kB from
> the top of that odd one-page allocation (0xff7fd000).
Very interesting! I would be really curious whether changing ulimit to
something bigger changes the picture. And if this is really the case
what we are going to do here. We can special case a single page mapping
under the stack but that sounds quite dangerous for something that is
dubious in itself. PROT_NONE would explicitly fault but we would simply
run over this mapping too easily and who knows what might end up below
it. So to me the guard gap does its job here.
Do you want me to post the earier patch to ignore PROT_NONE mapping
or we should rather wait for this one to get more details?
--
Michal Hocko
SUSE Labs
On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> PROT_NONE would explicitly fault but we would simply
> run over this mapping too easily and who knows what might end up below
> it. So to me the guard gap does its job here.
I tend to think that applications that implement their own stack guard
using PROT_NONE also assume that they will never perfom unchecked stack
allocations larger than their own guard, thus the condition above should
never happen. Otherwise they're bogus and/or vulnerable by design and it
is their responsibility to fix it.
Thus maybe if that helps we could even relax some of the stack guard
checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
packed if the application knows what it's doing. That wouldn't solve
the libreoffice issue though, given the lower page is RWX.
Willy
On Wed 05-07-17 10:14:43, Willy Tarreau wrote:
> On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > PROT_NONE would explicitly fault but we would simply
> > run over this mapping too easily and who knows what might end up below
> > it. So to me the guard gap does its job here.
>
> I tend to think that applications that implement their own stack guard
> using PROT_NONE also assume that they will never perfom unchecked stack
> allocations larger than their own guard, thus the condition above should
> never happen. Otherwise they're bogus and/or vulnerable by design and it
> is their responsibility to fix it.
>
> Thus maybe if that helps we could even relax some of the stack guard
> checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> packed if the application knows what it's doing.
Yes, this is what my patch does [1]. Or did I miss your point?
> That wouldn't solve the libreoffice issue though, given the lower page
> is RWX.
unfortunatelly yes. We only have limited room to address this issue
though. We could add per task (mm) stack_gap limit (controlled either
via proc or prctl) and revert back to 1 page for the specific program
but I would be really careful to add some more hack into the stack
expansion code.
[1] http://lkml.kernel.org/r/[email protected]
--
Michal Hocko
SUSE Labs
On Wed, Jul 05, 2017 at 10:24:41AM +0200, Michal Hocko wrote:
> > Thus maybe if that helps we could even relax some of the stack guard
> > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> > packed if the application knows what it's doing.
>
> Yes, this is what my patch does [1]. Or did I miss your point?
Sorry you're right, I got my mind confused when looking at the
libreoffice dump and for whatever reason ended up thinking we were
just considering that page as part of the gap and not being a marker
for the bottom. Never mind.
> > That wouldn't solve the libreoffice issue though, given the lower page
> > is RWX.
>
> unfortunatelly yes. We only have limited room to address this issue
> though. We could add per task (mm) stack_gap limit (controlled either
> via proc or prctl) and revert back to 1 page for the specific program
> but I would be really careful to add some more hack into the stack
> expansion code.
Actually one option could be to have a sysctl causing a warning to be
emitted when hitting the stack guard instead of killing the process. We
could think for example that once this warning is emitted, the guard is
reduced to 64kB (I think it was the size before) and the application can
continue to run. That could help problematic applications getting fixed
quickly. And in environments with lots of local users it would be
dissuasive enough to avoid users trying their luck on setuid binaries.
Just my two cents,
Willy
On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings <[email protected]>
> wrote:
> >
> > We have:
> >
> > bottom = 0xff803fff
> > sp = 0xffffb178
> >
> > The relevant mappings are:
> >
> > ff7fc000-ff7fd000 rwxp 00000000 00:00 0
> > fffdd000-ffffe000 rw-p 00000000 00:00
> > 0 [stack]
>
> Ugh. So that stack is actually 8MB in size, but the alloca() is about
> to use up almost all of it, and there's only about 28kB left between
> "bottom" and that 'rwx' mapping.
>
> Still, that rwx mapping is interesting: it is a single page, and it
> really is almost exactly 8MB below the stack.
>
> In fact, the top of stack (at 0xffffe000) is *exactly* 8MB+4kB from
> the top of that odd one-page allocation (0xff7fd000).
>
> Can you find out where that is allocated? Perhaps a breakpoint on
> mmap, with a condition to catch that particular one?
[...]
Found it, and it's now clear why only i386 is affected:
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
Ben.
--
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.
On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote:
> On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > PROT_NONE would explicitly fault but we would simply
> > run over this mapping too easily and who knows what might end up below
> > it. So to me the guard gap does its job here.
>
> I tend to think that applications that implement their own stack guard
> using PROT_NONE also assume that they will never perfom unchecked stack
> allocations larger than their own guard, thus the condition above should
> never happen. Otherwise they're bogus and/or vulnerable by design and it
> is their responsibility to fix it.
>
> Thus maybe if that helps we could even relax some of the stack guard
> checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> packed if the application knows what it's doing. That wouldn't solve
> the libreoffice issue though, given the lower page is RWX.
How about, instead of looking at permissions, we remember whether vmas
were allocated with MAP_FIXED and ignore those when evaluating the gap?
Ben.
--
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.
On Tue, 2017-07-04 at 19:11 +0200, Willy Tarreau wrote:
> On Tue, Jul 04, 2017 at 12:36:11PM +0100, Ben Hutchings wrote:
> > @@ -2323,11 +2330,17 @@ int expand_downwards(struct vm_area_struct *vma,
> > if (error)
> > return error;
> >
> > - /* Enforce stack_guard_gap */
> > + /*
> > + * Enforce stack_guard_gap, but allow VM_NONE mappings in the gap
> > + * as some applications try to make their own stack guards
> > + */
> > gap_addr = address - stack_guard_gap;
> > if (gap_addr > address)
> > return -ENOMEM;
> > - prev = vma->vm_prev;
> > + for (prev = vma->vm_prev;
> > + prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC));
> > + prev = prev->vm_prev)
> > + ;
> > if (prev && prev->vm_end > gap_addr) {
> > if (!(prev->vm_flags & VM_GROWSDOWN))
> > return -ENOMEM;
>
> Hmmm shouldn't we also stop looping when we're out of the gap ?
Yes, either that or only allow one such vma.
Ben.
> Something like this :
>
> for (prev = vma->vm_prev;
> prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)) &&
> address - prev->vm_end < stack_guard_gap;
> prev = prev->vm_prev)
> ;
>
> This would limit the risk of runaway loops if someone is having fun
> allocating a lot of memory in small chunks (eg: 4 GB in 1 million
> independant mmap() calls).
--
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.
On Tue, 2017-07-04 at 09:18 -0700, Linus Torvalds wrote:
> > On Tue, Jul 4, 2017 at 4:36 AM, Ben Hutchings <[email protected]> wrote:
> >
> > That's what I was thinking of. Tried the following patch:
> >
> > Subject: mmap: Ignore VM_NONE mappings when checking for space to
> > expand the stack
>
> This looks sane to me.
>
> I'm going to ignore it in this thread, and assume that it gets sent as
> a patch separately, ok?
>
> It would be good to have more acks on it.
>
> Also, separately, John Haxby kind of implied that the LibreOffice
> regression on i386 is already fixed by commit f4cb767d76cf ("mm: fix
> new crash in unmapped_area_topdown()").
>
> Or was that a separate issue?
They are separate issues.
Ben.
--
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.
On Wed, Jul 05, 2017 at 01:21:54PM +0100, Ben Hutchings wrote:
> On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote:
> > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > > PROT_NONE would explicitly fault but we would simply
> > > run over this mapping too easily and who knows what might end up below
> > > it. So to me the guard gap does its job here.
> >
> > I tend to think that applications that implement their own stack guard
> > using PROT_NONE also assume that they will never perfom unchecked stack
> > allocations larger than their own guard, thus the condition above should
> > never happen. Otherwise they're bogus and/or vulnerable by design and it
> > is their responsibility to fix it.
> >
> > Thus maybe if that helps we could even relax some of the stack guard
> > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> > packed if the application knows what it's doing. That wouldn't solve
> > the libreoffice issue though, given the lower page is RWX.
>
> How about, instead of looking at permissions, we remember whether vmas
> were allocated with MAP_FIXED and ignore those when evaluating the gap?
I like this idea. It leaves complete control to the application. Our
usual principle of letting people shoot themselves in the foot if they
insist on doing so.
Do you think something like this could work (not even build tested) ?
Willy
--
diff --git a/include/linux/mm.h b/include/linux/mm.h
index d16f524..8ad7f40 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -90,6 +90,7 @@
#define VM_PFNMAP 0x00000400 /* Page-ranges managed without "struct page", just pure PFN */
#define VM_DENYWRITE 0x00000800 /* ETXTBSY on write attempts.. */
+#define VM_FIXED 0x00001000 /* MAP_FIXED was used */
#define VM_LOCKED 0x00002000
#define VM_IO 0x00004000 /* Memory mapped I/O or similar */
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 9aa863d..4df2659 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -79,6 +79,7 @@ static inline int arch_validate_prot(unsigned long prot)
{
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
+ _calc_vm_trans(flags, MAP_FIXED, VM_FIXED ) |
_calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
}
#endif /* _LINUX_MMAN_H */
diff --git a/mm/mmap.c b/mm/mmap.c
index 3c4e4d7..b612868 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2145,7 +2145,7 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
next = vma->vm_next;
if (next && next->vm_start < gap_addr) {
- if (!(next->vm_flags & VM_GROWSUP))
+ if (!(next->vm_flags & (VM_GROWSUP|VM_FIXED)))
return -ENOMEM;
/* Check that both stack segments have the same anon_vma? */
}
@@ -2225,7 +2225,7 @@ int expand_downwards(struct vm_area_struct *vma,
return -ENOMEM;
prev = vma->vm_prev;
if (prev && prev->vm_end > gap_addr) {
- if (!(prev->vm_flags & VM_GROWSDOWN))
+ if (!(prev->vm_flags & (VM_GROWSDOWN|VM_FIXED)))
return -ENOMEM;
/* Check that both stack segments have the same anon_vma? */
}
Hi all,
On Tue, Jul 04, 2017 at 07:16:06PM -0600, [email protected] wrote:
> This issue occurs post stackguard patches correct? Fixing it sounds like
> this might go beyond hardening and into CVE territory.
Since this thread is public on LKML, as it should be, it's no longer
valid to be CC'ed to linux-distros, which is for handling of embargoed
issues only. So please drop linux-distros from further CC's (I moved
linux-distros to Bcc on this reply, just so they know what happened).
If specific security issues are identified (such as with LibreOffice and
Java), then ideally those should be posted to oss-security as separate
reports. I'd appreciate it if anyone takes care of that (regardless of
CVE worthiness).
In fact, I already mentioned this thread in:
http://www.openwall.com/lists/oss-security/2017/07/05/11
Thank you!
Alexander
On Wed 05-07-17 13:21:54, Ben Hutchings wrote:
> On Wed, 2017-07-05 at 10:14 +0200, Willy Tarreau wrote:
> > On Wed, Jul 05, 2017 at 08:36:46AM +0200, Michal Hocko wrote:
> > > PROT_NONE would explicitly fault but we would simply
> > > run over this mapping too easily and who knows what might end up below
> > > it. So to me the guard gap does its job here.
> >
> > I tend to think that applications that implement their own stack guard
> > using PROT_NONE also assume that they will never perfom unchecked stack
> > allocations larger than their own guard, thus the condition above should
> > never happen. Otherwise they're bogus and/or vulnerable by design and it
> > is their responsibility to fix it.
> >
> > Thus maybe if that helps we could even relax some of the stack guard
> > checks as soon as we meet a PROT_NONE area, allowing VMAs to be tightly
> > packed if the application knows what it's doing. That wouldn't solve
> > the libreoffice issue though, given the lower page is RWX.
>
> How about, instead of looking at permissions, we remember whether vmas
> were allocated with MAP_FIXED and ignore those when evaluating the gap?
To be honest I really hate this. The same way as any other heuristics
where we try to guess the gap which will not fault to let userspace
know something is wrong. And the Java example just proves the point
AFAIU. The mapping we clash on is _not_ a gap. It is a real mapping we
should rather not scribble over. It contains a code to execute and that
is even more worrying. So I guess the _only_ sane way forward for this
case is to reduce stack gap for the particular code.
--
Michal Hocko
SUSE Labs
On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings <[email protected]>
> > wrote:
> > >
> > > We have:
> > >
> > > bottom = 0xff803fff
> > > sp =?????0xffffb178
> > >
> > > The relevant mappings are:
> > >
> > > ff7fc000-ff7fd000 rwxp 00000000 00:00 0
> > > fffdd000-ffffe000 rw-p 00000000 00:00
> > > 0??????????????????????????????????[stack]
> >
> > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > to use up almost all of it, and there's only about 28kB left between
> > "bottom" and that 'rwx' mapping.
> >
> > Still, that rwx mapping is interesting: it is a single page, and it
> > really is almost exactly 8MB below the stack.
> >
> > In fact, the top of stack (at 0xffffe000) is *exactly* 8MB+4kB from
> > the top of that odd one-page allocation (0xff7fd000).
> >
> > Can you find out where that is allocated? Perhaps a breakpoint on
> > mmap, with a condition to catch that particular one?
> [...]
>
> Found it, and it's now clear why only i386 is affected:
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
This is really worrying. This doesn't look like a gap at all. It is a
mapping which actually contains a code and so we should absolutely not
allow to scribble over it. So I am afraid the only way forward is to
allow per process stack gap and run this particular program to have a
smaller gap. We basically have two ways. Either /proc/<pid>/$file or
a prctl inherited on exec. The later is a smaller code. What do you
think?
--
Michal Hocko
SUSE Labs
On Wed, 2017-07-05 at 16:23 +0200, Michal Hocko wrote:
> On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings <[email protected]>
> > > wrote:
> > > >
> > > > We have:
> > > >
> > > > bottom = 0xff803fff
> > > > sp = 0xffffb178
> > > >
> > > > The relevant mappings are:
> > > >
> > > > ff7fc000-ff7fd000 rwxp 00000000 00:00 0
> > > > fffdd000-ffffe000 rw-p 00000000 00:00
> > > > 0 [stack]
> > >
> > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > to use up almost all of it, and there's only about 28kB left between
> > > "bottom" and that 'rwx' mapping.
> > >
> > > Still, that rwx mapping is interesting: it is a single page, and it
> > > really is almost exactly 8MB below the stack.
> > >
> > > In fact, the top of stack (at 0xffffe000) is *exactly* 8MB+4kB from
> > > the top of that odd one-page allocation (0xff7fd000).
> > >
> > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > mmap, with a condition to catch that particular one?
> >
> > [...]
> >
> > Found it, and it's now clear why only i386 is affected:
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
>
> This is really worrying. This doesn't look like a gap at all. It is a
> mapping which actually contains a code and so we should absolutely not
> allow to scribble over it. So I am afraid the only way forward is to
> allow per process stack gap and run this particular program to have a
> smaller gap. We basically have two ways. Either /proc/<pid>/$file or
> a prctl inherited on exec. The later is a smaller code. What do you
> think?
Distributions can do that, but what about all the other apps out there
using JNI and private copies of the JRE?
Soemthing I noticed is that Java doesn't immediately use MAP_FIXED.
Look at os::pd_attempt_reserve_memory_at(). If the first, hinted,
mmap() doesn't return the hinted address it then attempts to allocate
huge areas (I'm not sure how intentional this is) and unmaps the
unwanted parts. Then os::workaround_expand_exec_shield_cs_limit() re-
mmap()s the wanted part with MAP_FIXED. If this fails at any point it
is not a fatal error.
So if we change vm_start_gap() to take the stack limit into account
(when it's finite) that should neutralise
os::workaround_expand_exec_shield_cs_limit(). I'll try this.
Ben.
--
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.
On Wed 05-07-17 16:25:00, Ben Hutchings wrote:
> On Wed, 2017-07-05 at 16:23 +0200, Michal Hocko wrote:
> > On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings <[email protected]>
> > > > wrote:
> > > > >
> > > > > We have:
> > > > >
> > > > > bottom = 0xff803fff
> > > > > sp =?????0xffffb178
> > > > >
> > > > > The relevant mappings are:
> > > > >
> > > > > ff7fc000-ff7fd000 rwxp 00000000 00:00 0
> > > > > fffdd000-ffffe000 rw-p 00000000 00:00
> > > > > 0??????????????????????????????????[stack]
> > > >
> > > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > > to use up almost all of it, and there's only about 28kB left between
> > > > "bottom" and that 'rwx' mapping.
> > > >
> > > > Still, that rwx mapping is interesting: it is a single page, and it
> > > > really is almost exactly 8MB below the stack.
> > > >
> > > > In fact, the top of stack (at 0xffffe000) is *exactly* 8MB+4kB from
> > > > the top of that odd one-page allocation (0xff7fd000).
> > > >
> > > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > > mmap, with a condition to catch that particular one?
> > >
> > > [...]
> > >
> > > Found it, and it's now clear why only i386 is affected:
> > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
> >
> > This is really worrying. This doesn't look like a gap at all. It is a
> > mapping which actually contains a code and so we should absolutely not
> > allow to scribble over it. So I am afraid the only way forward is to
> > allow per process stack gap and run this particular program to have a
> > smaller gap. We basically have two ways. Either /proc/<pid>/$file or
> > a prctl inherited on exec. The later is a smaller code. What do you
> > think?
>
> Distributions can do that, but what about all the other apps out there
> using JNI and private copies of the JRE?
Yes this sucks. I was thinking about something like run_legacy_stack
which would do
prctl(PR_SET_STACK_GAP, 1, 0, 0, 0);
execve(argv[1], argv+1, environ);
so we would have a way to start applications that start crashing with
the new setup without changing the default for all other applications.
The question is what to do if the execed task is suid because we
definitely do not want to allow tricking anybody to have smaller gap.
Or maybe just start the java with increased stack rlimit?
> Soemthing I noticed is that Java doesn't immediately use MAP_FIXED.
> Look at os::pd_attempt_reserve_memory_at(). If the first, hinted,
> mmap() doesn't return the hinted address it then attempts to allocate
> huge areas (I'm not sure how intentional this is) and unmaps the
> unwanted parts. Then os::workaround_expand_exec_shield_cs_limit() re-
> mmap()s the wanted part with MAP_FIXED. If this fails at any point it
> is not a fatal error.
>
> So if we change vm_start_gap() to take the stack limit into account
> (when it's finite) that should neutralise
> os::workaround_expand_exec_shield_cs_limit(). I'll try this.
I was already thinking about doing something like that to have a better
support for MAP_GROWSDOWN but then I just gave up because this would
require to cap RLIMIT_STACK for large values in order to not break
userspace again. The max value is not really clear to me.
--
Michal Hocko
SUSE Labs
On Wed, Jul 5, 2017 at 5:21 AM, Ben Hutchings <[email protected]> wrote:
>
> How about, instead of looking at permissions, we remember whether vmas
> were allocated with MAP_FIXED and ignore those when evaluating the gap?
No, I think that's a bad idea. There's tons of good reasons to use
MAP_FIXED, and real programs do it all the time.
I'd much rather just do something special for the Java case, either
recognizing that particular pattern, or (and this is likely what we'll
have to do) just have a per-process stack limit that
(a) will be reset by suid transitions etc security boundaries
(b) you can just set back to 4kB for the specific Java case.
because I'd rather make this be a very conscious thing rather than a
random hack.
The PROT_NONE thing made tons of conceptual sense ("let people do
their own guard mappings"). The MAP_FIXED thing does not.
Linus
On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko <[email protected]> wrote:
> On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
>> On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
>> > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings <[email protected]>
>> > wrote:
>> > >
>> > > We have:
>> > >
>> > > bottom = 0xff803fff
>> > > sp = 0xffffb178
>> > >
>> > > The relevant mappings are:
>> > >
>> > > ff7fc000-ff7fd000 rwxp 00000000 00:00 0
>> > > fffdd000-ffffe000 rw-p 00000000 00:00
>> > > 0 [stack]
>> >
>> > Ugh. So that stack is actually 8MB in size, but the alloca() is about
>> > to use up almost all of it, and there's only about 28kB left between
>> > "bottom" and that 'rwx' mapping.
>> >
>> > Still, that rwx mapping is interesting: it is a single page, and it
>> > really is almost exactly 8MB below the stack.
>> >
>> > In fact, the top of stack (at 0xffffe000) is *exactly* 8MB+4kB from
>> > the top of that odd one-page allocation (0xff7fd000).
>> >
>> > Can you find out where that is allocated? Perhaps a breakpoint on
>> > mmap, with a condition to catch that particular one?
>> [...]
>>
>> Found it, and it's now clear why only i386 is affected:
>> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
>> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
>
> This is really worrying. This doesn't look like a gap at all. It is a
> mapping which actually contains a code and so we should absolutely not
> allow to scribble over it. So I am afraid the only way forward is to
> allow per process stack gap and run this particular program to have a
> smaller gap. We basically have two ways. Either /proc/<pid>/$file or
> a prctl inherited on exec. The later is a smaller code. What do you
> think?
Why inherit on exec?
I think that, if we add a new API, we should do it right rather than
making it even more hackish. Specifically, we'd add a real VMA type
(via flag or whatever) that means "this is a modern stack". A modern
stack wouldn't ever expand and would have no guard page at all. It
would, however, properly account stack space by tracking the pages
used as stack space. Users of the new VMA type would be responsible
for allocating their own guard pages, probably by mapping an extra
page and than mapping PROT_NONE over it.
Also, this doesn't even need a new API, I think. What's wrong with
plain old mmap(2) with MAP_STACK and *without* MAP_GROWSDOWN? Only
new kernels would get the accounting right, but I doubt that matters
much in practice.
On Wed, Jul 5, 2017 at 5:19 AM, Ben Hutchings <[email protected]> wrote:
> On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
>>
>> Can you find out where that is allocated? Perhaps a breakpoint on
>> mmap, with a condition to catch that particular one?
>
> Found it, and it's now clear why only i386 is affected:
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
Thanks, good work.
Well, good work on *your* part. I will try very hard to refrain from
commenting too much on the f*cking stinking pile of sh*t that was
exec-shield.
But yes, I don't think we can sanely recognize this. The code clearly
very intentionally does that mapping under the stack, and it's very
intentionally not PROT_NONE, since it's meant to be both writable and
executable.
As I said earlier (and I see Michal Hocko suggested the same - sudden
email flurry going on here), I think we need to basically allow people
to set the stack gap per-process to something low.
The good news is that this is probably specialized enough that we can
just keep the defaults as "will break this one case, but we give
people the tools to work around it".
I hate doing that, but distros that still support 32-bit (which is
apparently a shrinking number) can maybe hack the libreoffice launch
scripts up?
Linus
On Wed, Jul 5, 2017 at 9:15 AM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko <[email protected]> wrote:
>>
>> This is really worrying. This doesn't look like a gap at all. It is a
>> mapping which actually contains a code and so we should absolutely not
>> allow to scribble over it. So I am afraid the only way forward is to
>> allow per process stack gap and run this particular program to have a
>> smaller gap. We basically have two ways. Either /proc/<pid>/$file or
>> a prctl inherited on exec. The later is a smaller code. What do you
>> think?
>
> Why inherit on exec?
.. because the whole point is that you have an existing binary that breaks.
So you need to be able to wrap it in "let's lower the stack gap, then
run that known-problematic binary".
If you think the problem is solved by recompiling existing binaries,
then why are we doing this kernel hack to begin with? The *real*
solution was always to just fix the damn compiler and ABI.
That *real* solution is simple and needs no kernel support at all.
In other words, *ALL* of the kernel work in this area is purely to
support existing binaries. Don't overlook that fact.
Linus
On Wed, Jul 05, 2017 at 04:25:00PM +0100, Ben Hutchings wrote:
[...]
> Soemthing I noticed is that Java doesn't immediately use MAP_FIXED.
> Look at os::pd_attempt_reserve_memory_at(). If the first, hinted,
> mmap() doesn't return the hinted address it then attempts to allocate
> huge areas (I'm not sure how intentional this is) and unmaps the
> unwanted parts. Then os::workaround_expand_exec_shield_cs_limit() re-
> mmap()s the wanted part with MAP_FIXED. If this fails at any point it
> is not a fatal error.
>
> So if we change vm_start_gap() to take the stack limit into account
> (when it's finite) that should neutralise
> os::workaround_expand_exec_shield_cs_limit(). I'll try this.
I ended up with the following two patches, which seem to deal with
both the Java and Rust regressions. These don't touch the
stack-grows-up paths at all because Rust doesn't run on those
architectures and the Java weirdness is i386-specific.
They definitely need longer commit messages and comments, but aside
from that do these look reasonable?
Ben.
Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack gap
Signed-off-by: Ben Hutchings <[email protected]>
---
mm/mmap.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)
diff --git a/mm/mmap.c b/mm/mmap.c
index a5e3dcd75e79..c7906ae1a7a1 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2323,11 +2323,16 @@ int expand_downwards(struct vm_area_struct *vma,
if (error)
return error;
- /* Enforce stack_guard_gap */
+ /*
+ * Enforce stack_guard_gap. Some applications allocate a VM_NONE
+ * mapping just below the stack, which we can safely ignore.
+ */
gap_addr = address - stack_guard_gap;
if (gap_addr > address)
return -ENOMEM;
prev = vma->vm_prev;
+ if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
+ prev = prev->vm_prev;
if (prev && prev->vm_end > gap_addr) {
if (!(prev->vm_flags & VM_GROWSDOWN))
return -ENOMEM;
Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent
if finite
Signed-off-by: Ben Hutchings <[email protected]>
---
include/linux/mm.h | 9 ++++-----
mm/mmap.c | 19 +++++++++++++++++++
2 files changed, 23 insertions(+), 5 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a47fc92..2240a0505072 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2223,15 +2223,14 @@ static inline struct vm_area_struct * find_vma_intersection(struct mm_struct * m
return vma;
}
+unsigned long __vm_start_gap(struct vm_area_struct *vma);
+
static inline unsigned long vm_start_gap(struct vm_area_struct *vma)
{
unsigned long vm_start = vma->vm_start;
- if (vma->vm_flags & VM_GROWSDOWN) {
- vm_start -= stack_guard_gap;
- if (vm_start > vma->vm_start)
- vm_start = 0;
- }
+ if (vma->vm_flags & VM_GROWSDOWN)
+ vm_start = __vm_start_gap(vma);
return vm_start;
}
diff --git a/mm/mmap.c b/mm/mmap.c
index c7906ae1a7a1..f8131a94e56e 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
}
#endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
+unsigned long __vm_start_gap(struct vm_area_struct *vma)
+{
+ unsigned long stack_limit =
+ current->signal->rlim[RLIMIT_STACK].rlim_cur;
+ unsigned long vm_start;
+
+ if (stack_limit != RLIM_INFINITY &&
+ vma->vm_end - vma->vm_start < stack_limit)
+ vm_start = vma->vm_end - PAGE_ALIGN(stack_limit);
+ else
+ vm_start = vma->vm_start;
+
+ vm_start -= stack_guard_gap;
+ if (vm_start > vma->vm_start)
+ vm_start = 0;
+
+ return vm_start;
+}
+
/*
* vma is the first one with address < vma->vm_start. Have to extend vma.
*/
--
Ben Hutchings
For every complex problem
there is a solution that is simple, neat, and wrong.
On Wed 05-07-17 17:58:45, Ben Hutchings wrote:
[...]
> diff --git a/mm/mmap.c b/mm/mmap.c
> index c7906ae1a7a1..f8131a94e56e 100644
> --- a/mm/mmap.c
> +++ b/mm/mmap.c
> @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> }
> #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
>
> +unsigned long __vm_start_gap(struct vm_area_struct *vma)
> +{
> + unsigned long stack_limit =
> + current->signal->rlim[RLIMIT_STACK].rlim_cur;
> + unsigned long vm_start;
> +
> + if (stack_limit != RLIM_INFINITY &&
> + vma->vm_end - vma->vm_start < stack_limit)
> + vm_start = vma->vm_end - PAGE_ALIGN(stack_limit);
This is exactly what I was worried about in my previous email. Say
somebody sets stack ulimit to 1G or so. Should we reduce the available
address space that much? Say you are 32b and you have an application
with multiple stacks each doing its MAP_GROWSDOWN. You are quickly out
of address space. That's why I've said that we would need to find a cap
for the user defined limit. How much that should be though? Few (tens,
hundreds) megs. If we can figure that up I would be of course quite
happy about such a change because MAP_GROWSDOWN doesn't work really well
these days.
> + else
> + vm_start = vma->vm_start;
> +
> + vm_start -= stack_guard_gap;
> + if (vm_start > vma->vm_start)
> + vm_start = 0;
> +
> + return vm_start;
> +}
> +
> /*
> * vma is the first one with address < vma->vm_start. Have to extend vma.
> */
>
> --
> Ben Hutchings
> For every complex problem
> there is a solution that is simple, neat, and wrong.
--
Michal Hocko
SUSE Labs
On Wed, Jul 5, 2017 at 9:58 AM, Ben Hutchings <[email protected]> wrote:
>
> I ended up with the following two patches, which seem to deal with
> both the Java and Rust regressions. These don't touch the
> stack-grows-up paths at all because Rust doesn't run on those
> architectures and the Java weirdness is i386-specific.
>
> They definitely need longer commit messages and comments, but aside
> from that do these look reasonable?
I thin kthey both look reasonable, but I think we might still want to
massage things a bit (cutting down the quoting to a minimum, hopefully
leaving enough context to still make sense):
> Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack gap
>
> prev = vma->vm_prev;
> + if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
> + prev = prev->vm_prev;
> if (prev && prev->vm_end > gap_addr) {
Do we just want to ignore the user-supplied guard mapping, or do we
want to say "if the user does a guard mapping, we use that *instead*
of our stack gap"?
IOW, instead of "prev = prev->vm_prev;" and continuing, maybe we want
to just return "ok".
> Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent if finite
This is good thinking, but no, I don't think the "if finite" is right.
I've seen people use "really big values" as replacement for
RLIM_INIFITY, for various reasons.
We've had huge confusion about RLIM_INFINITY over the years - look for
things like COMPAT_RLIM_OLD_INFINITY to see the kinds of confusions
we've had.
Some people just use MAX_LONG etc, which is *not* the same as
RLIM_INFINITY, but in practice ends up doing the same thing. Yadda
yadda.
So I'm personally leery of checking and depending on "exactly
RLIM_INIFITY", because I've seen it go wrong so many times.
And I think your second patch breaks that "use a really large value to
approximate infinity" case that definitely has existed as a pattern.
Linus
On Wed, Jul 5, 2017 at 9:20 AM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Jul 5, 2017 at 9:15 AM, Andy Lutomirski <[email protected]> wrote:
>> On Wed, Jul 5, 2017 at 7:23 AM, Michal Hocko <[email protected]> wrote:
>>>
>>> This is really worrying. This doesn't look like a gap at all. It is a
>>> mapping which actually contains a code and so we should absolutely not
>>> allow to scribble over it. So I am afraid the only way forward is to
>>> allow per process stack gap and run this particular program to have a
>>> smaller gap. We basically have two ways. Either /proc/<pid>/$file or
>>> a prctl inherited on exec. The later is a smaller code. What do you
>>> think?
>>
>> Why inherit on exec?
>
> .. because the whole point is that you have an existing binary that breaks.
>
> So you need to be able to wrap it in "let's lower the stack gap, then
> run that known-problematic binary".
>
> If you think the problem is solved by recompiling existing binaries,
> then why are we doing this kernel hack to begin with? The *real*
> solution was always to just fix the damn compiler and ABI.
That's not what I was suggesting at all. I was suggesting that, if
we're going to suggest a new API, that the new API actually be sane.
>
> That *real* solution is simple and needs no kernel support at all.
>
> In other words, *ALL* of the kernel work in this area is purely to
> support existing binaries. Don't overlook that fact.
Right. But I think the approach that we're all taking here is a bit
nutty. We all realize that this issue is a longstanding *GCC* bug
[1], but we're acting like it's a Big Deal (tm) kernel bug that Must
Be Fixed (tm) and therefore is allowed to break ABI. My security hat
is normally pretty hard-line, but I think it may be time to call BS.
Imagine if Kees had sent some symlink hardening patch that was
default-on and broke a stock distro. Or if I had sent a vsyscall
hardening patch that broke real code. It would get reverted right
away, probably along with a diatribe about how we should have known
better. I think this stack gap stuff is the same thing. It's not a
security fix -- it's a hardening patch.
Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
I'm starting to think that the right approach is to mostly revert all
this stuff (the execve fixes are fine). Then start over and think
about it as hardening. I would suggest the following approach:
- The stack gap is one page, just like it's been for years.
- As a hardening feature, if the stack would expand within 64k or
whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might
have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
The idea being that, if you deliberately place a mapping under the
stack, you know what you're doing. If you're like LibreOffice and do
something daft and are thus exploitable, you're on your own.
- As a hardening measure, don't let mmap without MAP_FIXED position
something within 64k or whatever of the bottom of the stack unless a
MAP_FIXED mapping is between them.
And that's all. It's not like a 64k gap actually fixes these bugs for
real -- it just makes them harder to exploit.
[1] The code that GCC generates for char buf[bug number] and alloca()
is flat-out wrong. Everyone who's ever thought about it all all knows
it and has known about it for years, but no one cared to fix it.
On Wed, 2017-07-05 at 19:05 +0200, Michal Hocko wrote:
> On Wed 05-07-17 17:58:45, Ben Hutchings wrote:
> [...]
> > diff --git a/mm/mmap.c b/mm/mmap.c
> > index c7906ae1a7a1..f8131a94e56e 100644
> > --- a/mm/mmap.c
> > +++ b/mm/mmap.c
> > @@ -2307,6 +2307,25 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
> > }
> > #endif /* CONFIG_STACK_GROWSUP || CONFIG_IA64 */
> >
> > +unsigned long __vm_start_gap(struct vm_area_struct *vma)
> > +{
> > + unsigned long stack_limit =
> > + current->signal->rlim[RLIMIT_STACK].rlim_cur;
> > + unsigned long vm_start;
> > +
> > + if (stack_limit != RLIM_INFINITY &&
> > + vma->vm_end - vma->vm_start < stack_limit)
> > + vm_start = vma->vm_end - PAGE_ALIGN(stack_limit);
>
> This is exactly what I was worried about in my previous email. Say
> somebody sets stack ulimit to 1G or so. Should we reduce the available
> address space that much?
It's not ideal, but why would someone set the stack limit that high
unless it's for an application that will actually use most of that
stack space? Do you think that "increase the stack limit" has been
cargo-culted?
> Say you are 32b and you have an application
> with multiple stacks each doing its MAP_GROWSDOWN.
[...]
So this application is using dietlibc or uclibc? glibc uses fixed-size
mappings for new threads.
I suppose there's a risk that by doing this we would mamke
MAP_GROWSDOWN useful enough that it is more likely to be used for new
thread stacks in future.
Ben.
--
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.
On Wed, Jul 05, 2017 at 09:17:59AM -0700, Linus Torvalds wrote:
(...)
> The good news is that this is probably specialized enough that we can
> just keep the defaults as "will break this one case, but we give
> people the tools to work around it".
>
> I hate doing that, but distros that still support 32-bit (which is
> apparently a shrinking number) can maybe hack the libreoffice launch
> scripts up?
Don't you think that the option of having a sysctl to relax the check
per task wouldn't be easier for distros and safer overall ? Ie, emit
a warning the first time the gap is hit instead of segfaulting, then
reduce it to something that used to work (4k or 64k, I don't remember)
and try again ? It would quickly report all these "special" programs
for end-user distros, without leaving too much room for attacks due
to the warning making it pretty obvious what's going on. I just don't
know how to place this stack gap per process but since this was already
discussed for prctl I think it's doable.
Willy
On Wed, Jul 5, 2017 at 11:59 AM, Willy Tarreau <[email protected]> wrote:
>
> Don't you think that the option of having a sysctl to relax the check
> per task wouldn't be easier for distros and safer overall ? Ie, emit
> a warning the first time the gap is hit instead of segfaulting, then
> reduce it to something that used to work (4k or 64k, I don't remember)
> and try again ?
It used to be just 4k.
.. and I think that might be a valid way to find these things, but
would it be safer? It basically disables the new stack gap entirely
apart from the warning.
And maybe that's ok and distros prefer that?
Linus
On Wed, Jul 05, 2017 at 12:17:20PM -0700, Linus Torvalds wrote:
> On Wed, Jul 5, 2017 at 11:59 AM, Willy Tarreau <[email protected]> wrote:
> >
> > Don't you think that the option of having a sysctl to relax the check
> > per task wouldn't be easier for distros and safer overall ? Ie, emit
> > a warning the first time the gap is hit instead of segfaulting, then
> > reduce it to something that used to work (4k or 64k, I don't remember)
> > and try again ?
>
> It used to be just 4k.
>
> .. and I think that might be a valid way to find these things, but
> would it be safer? It basically disables the new stack gap entirely
> apart from the warning.
But only if the sysctl is set. It can simply be recommended to set it
if any program fails. We've done this for many years with other ones
like min_mmap_addr or tcp_ecn.
Willy
On Wed, Jul 5, 2017 at 12:18 PM, Willy Tarreau <[email protected]> wrote:
>
> But only if the sysctl is set. It can simply be recommended to set it
> if any program fails. We've done this for many years with other ones
> like min_mmap_addr or tcp_ecn.
Ok, fair enough. I don't hate the approach, and maybe it's simpler
overall, and would help find other potential problem spots.
*Hopefully* it was just that Rust thing and the nasty Java exec-shield
workaround, but yeah, those might just be the first ones that have
been found so far.
Linus
On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
[...]
> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
>
> I'm starting to think that the right approach is to mostly revert all
> this stuff (the execve fixes are fine). Then start over and think
> about it as hardening. I would suggest the following approach:
>
> - The stack gap is one page, just like it's been for years.
Given that in the following points you say that something sounding like
a stack gap would be "64k or whatever", what does "the stack gap" mean
in this first point?
> - As a hardening feature, if the stack would expand within 64k or
> whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might
> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> The idea being that, if you deliberately place a mapping under the
> stack, you know what you're doing. If you're like LibreOffice and do
> something daft and are thus exploitable, you're on your own.
> - As a hardening measure, don't let mmap without MAP_FIXED position
> something within 64k or whatever of the bottom of the stack unless a
> MAP_FIXED mapping is between them.
Having tested patches along these lines, I think the above would avoid
the reported regressions.
Ben.
> And that's all. It's not like a 64k gap actually fixes these bugs for
> real -- it just makes them harder to exploit.
>
> [1] The code that GCC generates for char buf[bug number] and alloca()
> is flat-out wrong. Everyone who's ever thought about it all all knows
> it and has known about it for years, but no one cared to fix it.
--
Ben Hutchings
Anthony's Law of Force: Don't force it, get a larger hammer.
On Wed, Jul 05, 2017 at 08:32:43PM +0100, Ben Hutchings wrote:
> > ?- As a hardening feature, if the stack would expand within 64k or
> > whatever of a non-MAP_FIXED mapping, refuse to expand it.??(This might
> > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> > The idea being that, if you deliberately place a mapping under the
> > stack, you know what you're doing.??If you're like LibreOffice and do
> > something daft and are thus exploitable, you're on your own.
> > ?- As a hardening measure, don't let mmap without MAP_FIXED position
> > something within 64k or whatever of the bottom of the stack unless a
> > MAP_FIXED mapping is between them.
>
> Having tested patches along these lines, I think the above would avoid
> the reported regressions.
Stuff like this has already been proposed but Linus suspects that more
software than we imagine uses MAP_FIXED and could break. I cannot infirm
nor confirm, and that probably indicates that there's nothing fundamentally
wrong with this approach from the userland's perspective and that it could
indeed imply such software may be more common than we would like it.
Willy
> On Jul 5, 2017, at 12:32 PM, Ben Hutchings <[email protected]> wrote:
>
>> On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
>> [...]
>> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
>>
>> I'm starting to think that the right approach is to mostly revert all
>> this stuff (the execve fixes are fine). Then start over and think
>> about it as hardening. I would suggest the following approach:
>>
>> - The stack gap is one page, just like it's been for years.
>
> Given that in the following points you say that something sounding like
> a stack gap would be "64k or whatever", what does "the stack gap" mean
> in this first point?
I mean one page, with semantics as close to previous (4.11) behavior as practical.
>
>> - As a hardening feature, if the stack would expand within 64k or
>> whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might
>> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
>> The idea being that, if you deliberately place a mapping under the
>> stack, you know what you're doing. If you're like LibreOffice and do
>> something daft and are thus exploitable, you're on your own.
>> - As a hardening measure, don't let mmap without MAP_FIXED position
>> something within 64k or whatever of the bottom of the stack unless a
>> MAP_FIXED mapping is between them.
>
> Having tested patches along these lines, I think the above would avoid
> the reported regressions.
>
FWIW, even this last part may be problematic. It'll break anything that tries to allocate many small MAP_GROWSDOWN stacks on 32-bit. Hopefully nothing does this, but maybe Java does.
> Ben.
>
>> And that's all. It's not like a 64k gap actually fixes these bugs for
>> real -- it just makes them harder to exploit.
>>
>> [1] The code that GCC generates for char buf[bug number] and alloca()
>> is flat-out wrong. Everyone who's ever thought about it all all knows
>> it and has known about it for years, but no one cared to fix it.
> --
> Ben Hutchings
> Anthony's Law of Force: Don't force it, get a larger hammer.
>
On Wed, 2017-07-05 at 10:15 -0700, Linus Torvalds wrote:
> > On Wed, Jul 5, 2017 at 9:58 AM, Ben Hutchings <[email protected]> wrote:
> >
> > I ended up with the following two patches, which seem to deal with
> > both the Java and Rust regressions. These don't touch the
> > stack-grows-up paths at all because Rust doesn't run on those
> > architectures and the Java weirdness is i386-specific.
> >
> > They definitely need longer commit messages and comments, but aside
> > from that do these look reasonable?
>
> I thin kthey both look reasonable, but I think we might still want to
> massage things a bit (cutting down the quoting to a minimum, hopefully
> leaving enough context to still make sense):
>
> > Subject: [1/2] mmap: Skip a single VM_NONE mapping when checking the stack gap
> >
> > prev = vma->vm_prev;
> > + if (prev && !(prev->vm_flags & (VM_READ | VM_WRITE | VM_EXEC)))
> > + prev = prev->vm_prev;
> > if (prev && prev->vm_end > gap_addr) {
>
> Do we just want to ignore the user-supplied guard mapping, or do we
> want to say "if the user does a guard mapping, we use that *instead*
> of our stack gap"?
>
> IOW, instead of "prev = prev->vm_prev;" and continuing, maybe we want
> to just return "ok".
Rust effectively added a second guard page to the main thread stack.
But it does not (yet) implement stack probing
(https://github.com/rust-lang/rust/issues/16012) so I think it will
benefit from the kernel's larger stack guard gap.
> > Subject: [2/2] mmap: Avoid mapping anywhere within the full stack extent if finite
>
> This is good thinking, but no, I don't think the "if finite" is right.
>
> I've seen people use "really big values" as replacement for
> RLIM_INIFITY, for various reasons.
>
> We've had huge confusion about RLIM_INFINITY over the years - look for
> things like COMPAT_RLIM_OLD_INFINITY to see the kinds of confusions
> we've had.
That sounds familiar...
> Some people just use MAX_LONG etc, which is *not* the same as
> RLIM_INFINITY, but in practice ends up doing the same thing. Yadda
> yadda.
>
> So I'm personally leery of checking and depending on "exactly
> RLIM_INIFITY", because I've seen it go wrong so many times.
>
> And I think your second patch breaks that "use a really large value to
> approximate infinity" case that definitely has existed as a pattern.
Right. Well that seems to leave us with remembering the MAP_FIXED flag
and using that as the condition to ignore the previous mapping.
Ben.
--
Ben Hutchings
Man invented language to satisfy his deep need to complain. - Lily
Tomlin
On Wed, Jul 5, 2017 at 10:23 AM, Andy Lutomirski <[email protected]> wrote:
> Right. But I think the approach that we're all taking here is a bit
> nutty. We all realize that this issue is a longstanding *GCC* bug
> [1], but we're acting like it's a Big Deal (tm) kernel bug that Must
> Be Fixed (tm) and therefore is allowed to break ABI. My security hat
> is normally pretty hard-line, but I think it may be time to call BS.
>
> Imagine if Kees had sent some symlink hardening patch that was
> default-on and broke a stock distro. Or if I had sent a vsyscall
> hardening patch that broke real code. It would get reverted right
> away, probably along with a diatribe about how we should have known
> better. I think this stack gap stuff is the same thing. It's not a
> security fix -- it's a hardening patch.
>
> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
>
> I'm starting to think that the right approach is to mostly revert all
> this stuff (the execve fixes are fine). Then start over and think
> about it as hardening. I would suggest the following approach:
>
> - The stack gap is one page, just like it's been for years.
> - As a hardening feature, if the stack would expand within 64k or
> whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might
> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> The idea being that, if you deliberately place a mapping under the
> stack, you know what you're doing. If you're like LibreOffice and do
> something daft and are thus exploitable, you're on your own.
> - As a hardening measure, don't let mmap without MAP_FIXED position
> something within 64k or whatever of the bottom of the stack unless a
> MAP_FIXED mapping is between them.
>
> And that's all. It's not like a 64k gap actually fixes these bugs for
> real -- it just makes them harder to exploit.
>
> [1] The code that GCC generates for char buf[bug number] and alloca()
> is flat-out wrong. Everyone who's ever thought about it all all knows
> it and has known about it for years, but no one cared to fix it.
As part of that should we put restrictions on the environment of
set*id exec too? Part of the risks demonstrated by Qualys was that
allowing a privilege-elevating binary to inherit rlimits can have lead
to the nasty memory layout side-effects. That would fall into the
"hardening" bucket as well. And if it turns out there is some set*id
binary out there that can't run with "only", e.g., 128MB of stack, we
can make it configurable...
-Kees
--
Kees Cook
Pixel Security
On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings <[email protected]> wrote:
>>
>> And I think your second patch breaks that "use a really large value to
>> approximate infinity" case that definitely has existed as a pattern.
>
> Right. Well that seems to leave us with remembering the MAP_FIXED flag
> and using that as the condition to ignore the previous mapping.
I'm not particularly happy about having a MAP_FIXED special case, but
yeah, I'm not seeing a lot of alternatives.
Linus
On Wed, 2017-07-05 at 13:53 -0700, Andy Lutomirski wrote:
> On Jul 5, 2017, at 12:32 PM, Ben Hutchings <[email protected]> wrote:
> > On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
[...]
> > > - As a hardening feature, if the stack would expand within 64k or
> > > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might
> > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
> > > The idea being that, if you deliberately place a mapping under the
> > > stack, you know what you're doing. If you're like LibreOffice and do
> > > something daft and are thus exploitable, you're on your own.
> > > - As a hardening measure, don't let mmap without MAP_FIXED position
> > > something within 64k or whatever of the bottom of the stack unless a
> > > MAP_FIXED mapping is between them.
> >
> > Having tested patches along these lines, I think the above would avoid
> > the reported regressions.
> >
>
> FWIW, even this last part may be problematic. It'll break anything
> that tries to allocate many small MAP_GROWSDOWN stacks on 32-
> bit. Hopefully nothing does this, but maybe Java does.
glibc (NPTL) does not. Java (at least Hotspot in OpenJDK 6,7, 8) does
not. LinuxThreads *does* and is used by uclibc. dietlibc *does*. I
would be surprised if either was used for applications with very many
threads, but then this issue has thrown up a lot of surprises.
Ben.
--
Ben Hutchings
Man invented language to satisfy his deep need to complain. - Lily
Tomlin
On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook <[email protected]> wrote:
>
> As part of that should we put restrictions on the environment of
> set*id exec too?
I'm not seeing what sane limits you could use.
I think the concept of "reset as much of the environment to sane
things when running suid binaries" is a good concepr.
But we simply don't have any sane values to reset things to.
Linus
On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook <[email protected]> wrote:
> On Wed, Jul 5, 2017 at 10:23 AM, Andy Lutomirski <[email protected]> wrote:
>> Right. But I think the approach that we're all taking here is a bit
>> nutty. We all realize that this issue is a longstanding *GCC* bug
>> [1], but we're acting like it's a Big Deal (tm) kernel bug that Must
>> Be Fixed (tm) and therefore is allowed to break ABI. My security hat
>> is normally pretty hard-line, but I think it may be time to call BS.
>>
>> Imagine if Kees had sent some symlink hardening patch that was
>> default-on and broke a stock distro. Or if I had sent a vsyscall
>> hardening patch that broke real code. It would get reverted right
>> away, probably along with a diatribe about how we should have known
>> better. I think this stack gap stuff is the same thing. It's not a
>> security fix -- it's a hardening patch.
>>
>> Looking at it that way, I think a new inherited-on-exec flag is nucking futs.
>>
>> I'm starting to think that the right approach is to mostly revert all
>> this stuff (the execve fixes are fine). Then start over and think
>> about it as hardening. I would suggest the following approach:
>>
>> - The stack gap is one page, just like it's been for years.
>> - As a hardening feature, if the stack would expand within 64k or
>> whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might
>> have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
>> The idea being that, if you deliberately place a mapping under the
>> stack, you know what you're doing. If you're like LibreOffice and do
>> something daft and are thus exploitable, you're on your own.
>> - As a hardening measure, don't let mmap without MAP_FIXED position
>> something within 64k or whatever of the bottom of the stack unless a
>> MAP_FIXED mapping is between them.
>>
>> And that's all. It's not like a 64k gap actually fixes these bugs for
>> real -- it just makes them harder to exploit.
>>
>> [1] The code that GCC generates for char buf[bug number] and alloca()
>> is flat-out wrong. Everyone who's ever thought about it all all knows
>> it and has known about it for years, but no one cared to fix it.
>
> As part of that should we put restrictions on the environment of
> set*id exec too? Part of the risks demonstrated by Qualys was that
> allowing a privilege-elevating binary to inherit rlimits can have lead
> to the nasty memory layout side-effects. That would fall into the
> "hardening" bucket as well. And if it turns out there is some set*id
> binary out there that can't run with "only", e.g., 128MB of stack, we
> can make it configurable...
Yes. I think it's ridiculous that you can change rlimits and then
exec a setuid thing. It's not so easy to fix, though. Maybe track,
per-task, inherited by clone and exec, what the rlimits were the last
time the process had privilege and reset to those limits when running
something setuid. But a better approach might be to have some sysctls
that say what the rlimits become when doing setuid.
We need per-user-ns sysctls for stuff like this, and we don't really
have them...
--Andy
On Wed, Jul 5, 2017 at 4:50 PM, Ben Hutchings <[email protected]> wrote:
> On Wed, 2017-07-05 at 13:53 -0700, Andy Lutomirski wrote:
>> On Jul 5, 2017, at 12:32 PM, Ben Hutchings <[email protected]> wrote:
>> > On Wed, 2017-07-05 at 10:23 -0700, Andy Lutomirski wrote:
> [...]
>> > > - As a hardening feature, if the stack would expand within 64k or
>> > > whatever of a non-MAP_FIXED mapping, refuse to expand it. (This might
>> > > have to be a non-hinted mapping, not just a non-MAP_FIXED mapping.)
>> > > The idea being that, if you deliberately place a mapping under the
>> > > stack, you know what you're doing. If you're like LibreOffice and do
>> > > something daft and are thus exploitable, you're on your own.
>> > > - As a hardening measure, don't let mmap without MAP_FIXED position
>> > > something within 64k or whatever of the bottom of the stack unless a
>> > > MAP_FIXED mapping is between them.
>> >
>> > Having tested patches along these lines, I think the above would avoid
>> > the reported regressions.
>> >
>>
>> FWIW, even this last part may be problematic. It'll break anything
>> that tries to allocate many small MAP_GROWSDOWN stacks on 32-
>> bit. Hopefully nothing does this, but maybe Java does.
>
> glibc (NPTL) does not. Java (at least Hotspot in OpenJDK 6,7, 8) does
> not. LinuxThreads *does* and is used by uclibc. dietlibc *does*. I
> would be surprised if either was used for applications with very many
> threads, but then this issue has thrown up a lot of surprises.
>
Ugh. But yeah, I'd be a bit surprised to see heavily threaded apps
using LinuxThreads or dietlibc.
LinuxThreads still uses modify_ldt(), right? modify_ldt() performance
is abysmal, and I have no intention of even trying to optimize it.
Anyhow, you *can't* have more than 8192 threads if you use
modify_ldt() for TLS because you run out of LDT slots. 8192 * 64k
fits in 32 bits with room to spare, so this is unlikely to be a
showstopper.
--Andy
On Wed, Jul 5, 2017 at 4:55 PM, Linus Torvalds
<[email protected]> wrote:
> On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook <[email protected]> wrote:
>>
>> As part of that should we put restrictions on the environment of
>> set*id exec too?
>
> I'm not seeing what sane limits you could use.
>
> I think the concept of "reset as much of the environment to sane
> things when running suid binaries" is a good concepr.
>
> But we simply don't have any sane values to reset things to.
I wonder if we could pull some "sane" values out of our arses and have
it work just fine.
It's worth noting that a lot of the rlimits don't meaningfully
restrict the use of any particular resource, so we could plausibly
drop requirements to have privilege to increase them if we really
cared to. I don't see why we'd make such a change, but it means that,
if we reset on set*id and therefore poke a hole that allows a program
to do "sudo -u $me whatever" and thereby reset limits, it's not so
bad. A tiny survey:
RLIMIT_AS: not a systemwide resource at all.
RLIMIT_CORE: more or less just a policy of what you do when you crash.
I don't see how you could do much damage here.
RLIMIT_CPU: unless you're not allowed to fork(), this doesn't restrict
anything systemwide.
RLIMIT_DATA: ***
RLIMIT_FSIZE: maybe? but I can see this being quite dangerous across set*id
RLIMIT_LOCKS: gone
RLIMIT_MEMLOCK: this one matters, but it also seems nearly worthless
for exploits
RLIMIT_MSGQUEUE: privilege matters here
RLIMIT_NICE: maybe? anyone who actually cares would use cgroups instead
RLIMIT_NOFILE: great for exploits. Only sort of useful for resource management
RLIMIT_NPROC: privilege matters here
RLIMIT_RTTIME: privilege kind of matters. Also dangerous for exploits
(a bit) since it lets you kill your children at controlled times.
RLIMIT_SIGPENDING: not sure
RLIMIT_STACK: ***
*** means that this is a half-arsed resource control. It's half-arsed
because this stuff doesn't cover mmap(2), which seems to me like it
defeats the purpose. This stuff feels like a throwback to the
eighties.
On Wed, Jul 5, 2017 at 5:31 PM, Andy Lutomirski <[email protected]> wrote:
>
> I wonder if we could pull some "sane" values out of our arses and have
> it work just fine.
That approach may work, but it's pretty nasty.
But together with at least some way for the distro to set the values
we pick, it would probably be fairly reasonable.
You're right that most of the rlimits are just not very useful.
Linus
On Wed, Jul 5, 2017 at 5:19 PM, Andy Lutomirski <[email protected]> wrote:
> On Wed, Jul 5, 2017 at 4:50 PM, Kees Cook <[email protected]> wrote:
>> As part of that should we put restrictions on the environment of
>> set*id exec too? Part of the risks demonstrated by Qualys was that
>> allowing a privilege-elevating binary to inherit rlimits can have lead
>> to the nasty memory layout side-effects. That would fall into the
>> "hardening" bucket as well. And if it turns out there is some set*id
>> binary out there that can't run with "only", e.g., 128MB of stack, we
>> can make it configurable...
>
> Yes. I think it's ridiculous that you can change rlimits and then
> exec a setuid thing. It's not so easy to fix, though. Maybe track,
> per-task, inherited by clone and exec, what the rlimits were the last
> time the process had privilege and reset to those limits when running
> something setuid. But a better approach might be to have some sysctls
> that say what the rlimits become when doing setuid.
>
> We need per-user-ns sysctls for stuff like this, and we don't really
> have them...
In userspace, the way that sensible rlimit defaults were selected by
PAM when building an initial environment is to just examine the
rlimits of init. Maybe we could just do the same thing here, which
gives us some level of namespace control.
-Kees
--
Kees Cook
Pixel Security
On Wed, Jul 05, 2017 at 05:19:47PM -0700, Andy Lutomirski wrote:
> I think it's ridiculous that you can change rlimits and then
> exec a setuid thing. It's not so easy to fix, though. Maybe track,
> per-task, inherited by clone and exec, what the rlimits were the last
> time the process had privilege and reset to those limits when running
> something setuid. But a better approach might be to have some sysctls
> that say what the rlimits become when doing setuid.
*Some* rlimits are useful and needed for the user as you mentionned.
RLIMIT_CORE definitely is one of them, especially for debugging, when
combined with suid_dumpable. Some others like RLIMIT_STACK should
probably never be configurable at all and cause trouble. Probably
that simply having a sysctl to set this one for setuid programs and
ignore the current limit would be enough. We could even imagine another
one to set the stack guard gap for setuid programs (this would also
limit the impacts of having a large gap for everyone).
> We need per-user-ns sysctls for stuff like this, and we don't really
> have them...
I don't think we need to be this fine-grained. min_mmap_addr is global,
is used to address very similar issues and nobody seems to complain.
Willy
On Wed, Jul 05, 2017 at 04:23:56PM +0200, Michal Hocko wrote:
> On Wed 05-07-17 13:19:40, Ben Hutchings wrote:
> > On Tue, 2017-07-04 at 16:31 -0700, Linus Torvalds wrote:
> > > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings <[email protected]>
> > > wrote:
> > > >
> > > > We have:
> > > >
> > > > bottom = 0xff803fff
> > > > sp =?????0xffffb178
> > > >
> > > > The relevant mappings are:
> > > >
> > > > ff7fc000-ff7fd000 rwxp 00000000 00:00 0
> > > > fffdd000-ffffe000 rw-p 00000000 00:00
> > > > 0??????????????????????????????????[stack]
> > >
> > > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > > to use up almost all of it, and there's only about 28kB left between
> > > "bottom" and that 'rwx' mapping.
> > >
> > > Still, that rwx mapping is interesting: it is a single page, and it
> > > really is almost exactly 8MB below the stack.
> > >
> > > In fact, the top of stack (at 0xffffe000) is *exactly* 8MB+4kB from
> > > the top of that odd one-page allocation (0xff7fd000).
> > >
> > > Can you find out where that is allocated? Perhaps a breakpoint on
> > > mmap, with a condition to catch that particular one?
> > [...]
> >
> > Found it, and it's now clear why only i386 is affected:
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os/linux/vm/os_linux.cpp#l4852
> > http://hg.openjdk.java.net/jdk8/jdk8/hotspot/file/tip/src/os_cpu/linux_x86/vm/os_linux_x86.cpp#l881
>
> This is really worrying. This doesn't look like a gap at all. It is a
> mapping which actually contains a code and so we should absolutely not
> allow to scribble over it. So I am afraid the only way forward is to
> allow per process stack gap and run this particular program to have a
> smaller gap. We basically have two ways. Either /proc/<pid>/$file or
> a prctl inherited on exec. The later is a smaller code. What do you
> think?
On the plus side, the code in that page (a single RET) is only executed
once when the workaround function is called. Notice that 'codebuf'
is never even returned out of that function.
The only reason they even leave that page mapped is to stop the exec
shield limit from being lowered on them again.
- Kevin
On Wed 05-07-17 08:36:45, Michal Hocko wrote:
> On Tue 04-07-17 16:31:52, Linus Torvalds wrote:
> > On Tue, Jul 4, 2017 at 4:01 PM, Ben Hutchings <[email protected]> wrote:
> > >
> > > We have:
> > >
> > > bottom = 0xff803fff
> > > sp = 0xffffb178
> > >
> > > The relevant mappings are:
> > >
> > > ff7fc000-ff7fd000 rwxp 00000000 00:00 0
> > > fffdd000-ffffe000 rw-p 00000000 00:00 0 [stack]
> >
> > Ugh. So that stack is actually 8MB in size, but the alloca() is about
> > to use up almost all of it, and there's only about 28kB left between
> > "bottom" and that 'rwx' mapping.
> >
> > Still, that rwx mapping is interesting: it is a single page, and it
> > really is almost exactly 8MB below the stack.
> >
> > In fact, the top of stack (at 0xffffe000) is *exactly* 8MB+4kB from
> > the top of that odd one-page allocation (0xff7fd000).
>
> Very interesting! I would be really curious whether changing ulimit to
> something bigger changes the picture.
It's public holiday today here and I haven't read all new emails and I
will be mostly offline today. I will catch up tomorrow. But before we go
to more tricky workarounds. Could you double check that simply
increasing the RLIMIT_STACK workarounds the problem here? Because if it
does and other workarounds require some manual intervention then
changing ulimit sounds like the least tricky one to me.
--
Michal Hocko
SUSE Labs
On Wed, Jul 05, 2017 at 04:51:06PM -0700, Linus Torvalds wrote:
> On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings <[email protected]> wrote:
> >>
> >> And I think your second patch breaks that "use a really large value to
> >> approximate infinity" case that definitely has existed as a pattern.
> >
> > Right. Well that seems to leave us with remembering the MAP_FIXED flag
> > and using that as the condition to ignore the previous mapping.
>
> I'm not particularly happy about having a MAP_FIXED special case, but
> yeah, I'm not seeing a lot of alternatives.
We can possibly refine it like this :
- use PROT_NONE as a mark for the end of the stack and consider the
application doing this knows exactly what it's doing ;
- use other MAP_FIXED as a limit for a shorter gap (ie 4kB), considering
that 1) it used to work like this for many years, and 2) if an application
is forcing a MAP_FIXED just below the stack and at the same time uses
large alloca() or VLA it's definitely bogus and looking for unfixable
trouble. Not allowing this means we break existing applications anyway.
Willy
On Thu, Jul 06, 2017 at 10:24:06AM +0200, Willy Tarreau wrote:
> On Wed, Jul 05, 2017 at 04:51:06PM -0700, Linus Torvalds wrote:
> > On Wed, Jul 5, 2017 at 4:35 PM, Ben Hutchings <[email protected]> wrote:
> > >>
> > >> And I think your second patch breaks that "use a really large value to
> > >> approximate infinity" case that definitely has existed as a pattern.
> > >
> > > Right. Well that seems to leave us with remembering the MAP_FIXED flag
> > > and using that as the condition to ignore the previous mapping.
> >
> > I'm not particularly happy about having a MAP_FIXED special case, but
> > yeah, I'm not seeing a lot of alternatives.
>
> We can possibly refine it like this :
> - use PROT_NONE as a mark for the end of the stack and consider the
> application doing this knows exactly what it's doing ;
>
> - use other MAP_FIXED as a limit for a shorter gap (ie 4kB), considering
> that 1) it used to work like this for many years, and 2) if an application
> is forcing a MAP_FIXED just below the stack and at the same time uses
> large alloca() or VLA it's definitely bogus and looking for unfixable
> trouble. Not allowing this means we break existing applications anyway.
That would probably give the following (only build-tested on x86_64). Do
you think it would make sense and/or be acceptable ? That would more
easily avoid the other options like adding sysctl + warnings or making
a special case of setuid.
Willy
---
>From 56ae4e57e446bc92fd2647327da281e313930524 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Thu, 6 Jul 2017 12:00:54 +0200
Subject: mm: mm, mmap: only apply a one page gap betwen the stack an
MAP_FIXED
Some programs place a MAP_FIXED below the stack, not leaving enough room
for the stack guard. This patch keeps track of MAP_FIXED, mirroring it in
a new VM_FIXED flag and reduces the stack guard to a single page (as it
used to be) in such a situation, assuming that when an application places
a fixed map close to the stack, it very likely does it on purpose and is
taking the full responsibility for the risk of the stack blowing up.
Cc: Ben Hutchings <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Kees Cook <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Signed-off-by: Willy Tarreau <[email protected]>
---
include/linux/mm.h | 1 +
include/linux/mman.h | 1 +
mm/mmap.c | 30 ++++++++++++++++++++----------
3 files changed, 22 insertions(+), 10 deletions(-)
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 6f543a4..41492b9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -188,6 +188,7 @@ extern int overcommit_kbytes_handler(struct ctl_table *, int, void __user *,
#define VM_ACCOUNT 0x00100000 /* Is a VM accounted object */
#define VM_NORESERVE 0x00200000 /* should the VM suppress accounting */
#define VM_HUGETLB 0x00400000 /* Huge TLB Page VM */
+#define VM_FIXED 0x00800000 /* MAP_FIXED was used */
#define VM_ARCH_1 0x01000000 /* Architecture-specific flag */
#define VM_ARCH_2 0x02000000
#define VM_DONTDUMP 0x04000000 /* Do not include in the core dump */
diff --git a/include/linux/mman.h b/include/linux/mman.h
index 634c4c5..3a29069 100644
--- a/include/linux/mman.h
+++ b/include/linux/mman.h
@@ -86,6 +86,7 @@ static inline bool arch_validate_prot(unsigned long prot)
{
return _calc_vm_trans(flags, MAP_GROWSDOWN, VM_GROWSDOWN ) |
_calc_vm_trans(flags, MAP_DENYWRITE, VM_DENYWRITE ) |
+ _calc_vm_trans(flags, MAP_FIXED, VM_FIXED ) |
_calc_vm_trans(flags, MAP_LOCKED, VM_LOCKED );
}
diff --git a/mm/mmap.c b/mm/mmap.c
index ece0f6d..7fc1c29 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2244,12 +2244,17 @@ int expand_upwards(struct vm_area_struct *vma, unsigned long address)
gap_addr = TASK_SIZE;
next = vma->vm_next;
+
+ /* PROT_NONE above a MAP_GROWSUP always serves as a mark and inhibits
+ * the stack guard gap.
+ * MAP_FIXED above a MAP_GROWSUP only requires a single page guard.
+ */
if (next && next->vm_start < gap_addr &&
- (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
- if (!(next->vm_flags & VM_GROWSUP))
- return -ENOMEM;
- /* Check that both stack segments have the same anon_vma? */
- }
+ !(next->vm_flags & VM_GROWSUP) &&
+ (next->vm_flags & (VM_WRITE|VM_READ|VM_EXEC)) &&
+ (!(next->vm_flags & VM_FIXED) ||
+ next->vm_start < address + PAGE_SIZE))
+ return -ENOMEM;
/* We must make sure the anon_vma is allocated. */
if (unlikely(anon_vma_prepare(vma)))
@@ -2329,12 +2334,17 @@ int expand_downwards(struct vm_area_struct *vma,
if (gap_addr > address)
return -ENOMEM;
prev = vma->vm_prev;
+
+ /* PROT_NONE below a MAP_GROWSDOWN always serves as a mark and inhibits
+ * the stack guard gap.
+ * MAP_FIXED below a MAP_GROWSDOWN only requires a single page guard.
+ */
if (prev && prev->vm_end > gap_addr &&
- (prev->vm_flags & (VM_WRITE|VM_READ|VM_EXEC))) {
- if (!(prev->vm_flags & VM_GROWSDOWN))
- return -ENOMEM;
- /* Check that both stack segments have the same anon_vma? */
- }
+ !(prev->vm_flags & VM_GROWSDOWN) &&
+ (prev->vm_flags & (VM_WRITE|VM_READ|VM_EXEC)) &&
+ (!(prev->vm_flags & VM_FIXED) ||
+ prev->vm_end > address - PAGE_SIZE))
+ return -ENOMEM;
/* We must make sure the anon_vma is allocated. */
if (unlikely(anon_vma_prepare(vma)))
--
1.7.12.1
FYI, we noticed the following commit:
commit: a99d848d3bc6586e922584ce8ec673a451a09cf1 ("mm: larger stack guard gap, between vmas")
url: https://github.com/0day-ci/linux/commits/Ben-Hutchings/mmap-Skip-a-single-VM_NONE-mapping-when-checking-the-stack/20170707-131750
in testcase: trinity
with following parameters:
runtime: 300s
test-description: Trinity is a linux system call fuzz tester.
test-url: http://codemonkey.org.uk/projects/trinity/
on test machine: qemu-system-x86_64 -enable-kvm -m 512M
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
+------------------------------------------+------------+------------+
| | 9b51f04424 | a99d848d3b |
+------------------------------------------+------------+------------+
| boot_successes | 88 | 0 |
| boot_failures | 11 | 14 |
| BUG:kernel_hang_in_test_stage | 11 | |
| kernel_BUG_at_mm/mmap.c | 0 | 14 |
| invalid_opcode:#[##] | 0 | 14 |
| Kernel_panic-not_syncing:Fatal_exception | 0 | 14 |
+------------------------------------------+------------+------------+
[ 7.169579] kernel BUG at mm/mmap.c:388!
[ 7.170690] invalid opcode: 0000 [#1] PREEMPT SMP
[ 7.171625] CPU: 0 PID: 1 Comm: init Not tainted 4.12.0-06091-ga99d848 #3
[ 7.172985] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-20161025_171302-gandalf 04/01/2014
[ 7.174982] task: ffff8ab880048000 task.stack: ffffacde40008000
[ 7.176176] RIP: 0010:validate_mm+0x213/0x224
[ 7.177045] RSP: 0000:ffffacde4000bb90 EFLAGS: 00010282
[ 7.178094] RAX: 0000000000000290 RBX: 00000000ffffffff RCX: b0e7f7ea00000000
[ 7.179508] RDX: 00000001b0449a78 RSI: 0000000000000001 RDI: 0000000000000246
[ 7.180915] RBP: ffffacde4000bbd0 R08: ffff8ab880048770 R09: 0000000051472920
[ 7.182313] R10: ffff8ab898919020 R11: ffffffffb12d8eaa R12: ffff8ab89e560b00
[ 7.183758] R13: 0000000000000001 R14: 0000000000000000 R15: 00007fffdd106000
[ 7.185175] FS: 0000000000000000(0000) GS:ffff8ab89f800000(0000) knlGS:0000000000000000
[ 7.186776] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 7.187916] CR2: 0000000000000000 CR3: 0000000017e25000 CR4: 00000000000006f0
[ 7.189313] Call Trace:
[ 7.189828] __vma_adjust+0x657/0x6ca
[ 7.190583] ? tlb_flush_mmu+0x15/0x18
[ 7.191331] shift_arg_pages+0x152/0x167
[ 7.192162] setup_arg_pages+0x1c1/0x1f4
[ 7.192970] load_elf_binary+0x344/0xe48
[ 7.193782] ? kvm_clock_read+0x25/0x35
[ 7.194553] ? kvm_sched_clock_read+0x9/0x12
[ 7.195412] ? search_binary_handler+0x52/0xce
[ 7.196281] search_binary_handler+0x5f/0xce
[ 7.197150] do_execveat_common+0x4dc/0x64c
[ 7.198121] ? rest_init+0x143/0x143
[ 7.198851] do_execve+0x1e/0x20
[ 7.199519] run_init_process+0x26/0x28
[ 7.200288] kernel_init+0x4f/0xe6
[ 7.200977] ret_from_fork+0x25/0x30
[ 7.201679] Code: 41 8b 74 24 70 39 de 74 15 83 fb ff 74 15 89 da 48 c7 c7 d6 c8 23 b0 e8 ba f6 fc ff eb 05 45 85 f6 74 0a 4c 89 e7 e8 67 42 ff ff <0f> 0b 48 83 c4 18 5b 41 5c 41 5d 41 5e 41 5f 5d c3 55 48 89 e5
[ 7.205614] RIP: validate_mm+0x213/0x224 RSP: ffffacde4000bb90
[ 7.206830] ---[ end trace 95e0c74c93056b9b ]---
To reproduce:
git clone https://github.com/01org/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> job-script # job-script is attached in this email
Thanks,
Xiaolong