2019-04-01 14:24:17

by Dave Hansen

[permalink] [raw]
Subject: [PATCH] x86/mpx: fix recursive munmap() corruption


This is a bit of a mess, to put it mildly. But, it's a bug
that seems to have gone unticked up to now, probably because
nobody uses MPX. The other alternative to this fix is to just
deprecate MPX, even in -stable kernels.

MPX has the arch_unmap() hook inside of munmap() because MPX
uses bounds tables that protect other areas of memory. When
memory is unmapped, there is also a need to unmap the MPX
bounds tables. Barring this, unused bounds tables can eat 80%
of the address space.

But, the recursive do_munmap() that gets called vi arch_unmap()
wreaks havoc with __do_munmap()'s state. It can result in
freeing populated page tables, accessing bogus VMA state,
double-freed VMAs and more.

To fix this, call arch_unmap() before __do_unmap() has a chance
to do anything meaningful. Also, remove the 'vma' argument
and force the MPX code to do its own, independent VMA lookup.

For the common success case this is functionally identical to
what was there before. For the munmap() failure case, it's
possible that some MPX tables will be zapped for memory that
continues to be in use. But, this is an extraordinarily
unlikely scenario and the harm would be that MPX provides no
protection since the bounds table got reset (zeroed).

I can't imagine anyone doing this:

ptr = mmap();
// use ptr
ret = munmap(ptr);
if (ret)
// oh, there was an error, I'll
// keep using ptr.

Because if you're doing munmap(), you are *done* with the
memory. There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as
well as the existing mpx selftests/.

====

The long story:

munmap() has a couple of pieces:
1. Find the affected VMA(s)
2. Split the start/end one(s) if neceesary
3. Pull the VMAs out of the rbtree
4. Actually zap the memory via unmap_region(), including
freeing page tables (or queueing them to be freed).
5. Fixup some of the accounting (like fput()) and actually
free the VMA itself.

I decided to put the arch_unmap() call right afer #3. This
was *just* before mmap_sem looked like it might get downgraded
(it won't in this context), but it looked right. It wasn't.

Richard Biener reported a test that shows this in dmesg:

[1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551
[1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via
arch_unmap(). It was freeing page tables that has not been
properly zapped.

But, the problem was bigger than this. For one, arch_unmap()
can free VMAs. But, the calling __do_munmap() has variables
that *point* to VMAs and obviously can't handle them just
getting freed while the pointer is still valid.

I tried a couple of things here. First, I tried to fix the page
table freeing problem in isolation, but I then found the VMA
issue. I also tried having the MPX code return a flag if it
modified the rbtree which would force __do_munmap() to re-walk
to restart. That spiralled out of control in complexity pretty
fast.

Just moving arch_unmap() and accepting that the bonkers failure
case might eat some bounds tables seems like the simplest viable
fix.

Reported-by: Richard Biener <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: [email protected]
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]

---

b/arch/x86/include/asm/mmu_context.h | 6 +++---
b/arch/x86/include/asm/mpx.h | 5 ++---
b/arch/x86/mm/mpx.c | 10 ++++++----
b/include/asm-generic/mm_hooks.h | 1 -
b/mm/mmap.c | 15 ++++++++-------
5 files changed, 19 insertions(+), 18 deletions(-)

diff -puN mm/mmap.c~mpx-rss-pass-no-vma mm/mmap.c
--- a/mm/mmap.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.409411123 -0700
+++ b/mm/mmap.c 2019-04-01 06:56:53.423411123 -0700
@@ -2731,9 +2731,17 @@ int __do_munmap(struct mm_struct *mm, un
return -EINVAL;

len = PAGE_ALIGN(len);
+ end = start + len;
if (len == 0)
return -EINVAL;

+ /*
+ * arch_unmap() might do unmaps itself. It must be called
+ * and finish any rbtree manipulation before this code
+ * runs and also starts to manipulate the rbtree.
+ */
+ arch_unmap(mm, start, end);
+
/* Find the first overlapping VMA */
vma = find_vma(mm, start);
if (!vma)
@@ -2742,7 +2750,6 @@ int __do_munmap(struct mm_struct *mm, un
/* we have start < vma->vm_end */

/* if it doesn't overlap, we have nothing.. */
- end = start + len;
if (vma->vm_start >= end)
return 0;

@@ -2812,12 +2819,6 @@ int __do_munmap(struct mm_struct *mm, un
/* Detach vmas from rbtree */
detach_vmas_to_be_unmapped(mm, vma, prev, end);

- /*
- * mpx unmap needs to be called with mmap_sem held for write.
- * It is safe to call it before unmap_region().
- */
- arch_unmap(mm, vma, start, end);
-
if (downgrade)
downgrade_write(&mm->mmap_sem);

diff -puN arch/x86/include/asm/mmu_context.h~mpx-rss-pass-no-vma arch/x86/include/asm/mmu_context.h
--- a/arch/x86/include/asm/mmu_context.h~mpx-rss-pass-no-vma 2019-04-01 06:56:53.412411123 -0700
+++ b/arch/x86/include/asm/mmu_context.h 2019-04-01 06:56:53.423411123 -0700
@@ -277,8 +277,8 @@ static inline void arch_bprm_mm_init(str
mpx_mm_init(mm);
}

-static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static inline void arch_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
{
/*
* mpx_notify_unmap() goes and reads a rarely-hot
@@ -298,7 +298,7 @@ static inline void arch_unmap(struct mm_
* consistently wrong.
*/
if (unlikely(cpu_feature_enabled(X86_FEATURE_MPX)))
- mpx_notify_unmap(mm, vma, start, end);
+ mpx_notify_unmap(mm, start, end);
}

/*
diff -puN include/asm-generic/mm_hooks.h~mpx-rss-pass-no-vma include/asm-generic/mm_hooks.h
--- a/include/asm-generic/mm_hooks.h~mpx-rss-pass-no-vma 2019-04-01 06:56:53.414411123 -0700
+++ b/include/asm-generic/mm_hooks.h 2019-04-01 06:56:53.423411123 -0700
@@ -18,7 +18,6 @@ static inline void arch_exit_mmap(struct
}

static inline void arch_unmap(struct mm_struct *mm,
- struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
}
diff -puN arch/x86/mm/mpx.c~mpx-rss-pass-no-vma arch/x86/mm/mpx.c
--- a/arch/x86/mm/mpx.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.416411123 -0700
+++ b/arch/x86/mm/mpx.c 2019-04-01 06:56:53.423411123 -0700
@@ -881,9 +881,10 @@ static int mpx_unmap_tables(struct mm_st
* the virtual address region start...end have already been split if
* necessary, and the 'vma' is the first vma in this range (start -> end).
*/
-void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+void mpx_notify_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
{
+ struct vm_area_struct *vma;
int ret;

/*
@@ -902,11 +903,12 @@ void mpx_notify_unmap(struct mm_struct *
* which should not occur normally. Being strict about it here
* helps ensure that we do not have an exploitable stack overflow.
*/
- do {
+ vma = find_vma(mm, start);
+ while (vma && vma->vm_start < end) {
if (vma->vm_flags & VM_MPX)
return;
vma = vma->vm_next;
- } while (vma && vma->vm_start < end);
+ }

ret = mpx_unmap_tables(mm, start, end);
if (ret)
diff -puN arch/x86/include/asm/mpx.h~mpx-rss-pass-no-vma arch/x86/include/asm/mpx.h
--- a/arch/x86/include/asm/mpx.h~mpx-rss-pass-no-vma 2019-04-01 06:56:53.418411123 -0700
+++ b/arch/x86/include/asm/mpx.h 2019-04-01 06:56:53.424411123 -0700
@@ -78,8 +78,8 @@ static inline void mpx_mm_init(struct mm
*/
mm->context.bd_addr = MPX_INVALID_BOUNDS_DIR;
}
-void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end);
+void mpx_notify_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end);

unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
unsigned long flags);
@@ -100,7 +100,6 @@ static inline void mpx_mm_init(struct mm
{
}
static inline void mpx_notify_unmap(struct mm_struct *mm,
- struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
}
_


Subject: [tip:x86/urgent] x86/mpx: Fix recursive munmap() corruption

Commit-ID: 9d812ad8fce2c556538f6efe8ab1ae78ab59ae8f
Gitweb: https://git.kernel.org/tip/9d812ad8fce2c556538f6efe8ab1ae78ab59ae8f
Author: Dave Hansen <[email protected]>
AuthorDate: Mon, 1 Apr 2019 07:15:49 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 18 Apr 2019 14:37:04 +0200

x86/mpx: Fix recursive munmap() corruption

This is a bit of a mess, to put it mildly. But, it's a bug that seems to
have gone unticked up to now, probably because nobody uses MPX. The other
alternative to this fix is to just deprecate MPX, even in -stable kernels.

MPX has the arch_unmap() hook inside of munmap() because MPX uses bounds
tables that protect other areas of memory. When memory is unmapped, there
is also a need to unmap the MPX bounds tables. Barring this, unused bounds
tables can eat 80% of the address space.

But, the recursive do_munmap() that gets called via arch_unmap() wreaks
havoc with __do_munmap()'s state. It can result in freeing populated page
tables, accessing bogus VMA state, double-freed VMAs and more.

To fix this, call arch_unmap() before __do_unmap() has a chance to do
anything meaningful. Also, remove the 'vma' argument and force the MPX
code to do its own, independent VMA lookup.

For the common success case this is functionally identical to what was
there before. For the munmap() failure case, it's possible that some MPX
tables will be zapped for memory that continues to be in use. But, this is
an extraordinarily unlikely scenario and the harm would be that MPX
provides no protection since the bounds table got reset (zeroed).

It's hard to imagine that anyone is doing this:

ptr = mmap();
// use ptr
ret = munmap(ptr);
if (ret)
// oh, there was an error, I'll
// keep using ptr.

Because if after doing munmap(), the applitcation is *done* with the
memory. There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as well as the
existing mpx selftests/.

Further details:

munmap() has a couple of pieces:
1. Find the affected VMA(s)
2. Split the start/end one(s) if neceesary
3. Pull the VMAs out of the rbtree
4. Actually zap the memory via unmap_region(), including
freeing page tables (or queueing them to be freed).
5. Fixup some of the accounting (like fput()) and actually
free the VMA itself.

The original MPX implementation chose to put the arch_unmap() call right
afer #3. This was *just* before mmap_sem looked like it might get
downgraded (it won't in this context), but it looked right. It wasn't.

Richard Biener reported a test that shows this in dmesg:

[1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551
[1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via arch_unmap().
It was freeing page tables that has not been properly zapped.

But, the problem was bigger than this. For one, arch_unmap() can free
VMAs. But, the calling __do_munmap() has variables that *point* to VMAs
and obviously can't handle them just getting freed while the pointer is
still valid.

A couple of attempts were made to fix this. First, was to fix the page
table freeing problem in isolation, but this lead to the VMA issue. The
next approach was to let the MPX code return a flag if it modified the
rbtree which would force __do_munmap() to re-walk to restart. That
spiralled out of control in complexity pretty fast.

Just moving arch_unmap() and accepting that the bonkers failure case might
eat some bounds tables is the simplest viable fix.

Fixes: 1de4fa14e ("x86, mpx: Cleanup unused bound tables")
Reported-by: Richard Biener <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]

---
arch/x86/include/asm/mmu_context.h | 6 +++---
arch/x86/include/asm/mpx.h | 5 ++---
arch/x86/mm/mpx.c | 10 ++++++----
include/asm-generic/mm_hooks.h | 1 -
mm/mmap.c | 15 ++++++++-------
5 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 19d18fae6ec6..41019af68adf 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -277,8 +277,8 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
mpx_mm_init(mm);
}

-static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static inline void arch_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
{
/*
* mpx_notify_unmap() goes and reads a rarely-hot
@@ -298,7 +298,7 @@ static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
* consistently wrong.
*/
if (unlikely(cpu_feature_enabled(X86_FEATURE_MPX)))
- mpx_notify_unmap(mm, vma, start, end);
+ mpx_notify_unmap(mm, start, end);
}

/*
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index d0b1434fb0b6..2f9d86bc7e48 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -78,8 +78,8 @@ static inline void mpx_mm_init(struct mm_struct *mm)
*/
mm->context.bd_addr = MPX_INVALID_BOUNDS_DIR;
}
-void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end);
+void mpx_notify_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end);

unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
unsigned long flags);
@@ -100,7 +100,6 @@ static inline void mpx_mm_init(struct mm_struct *mm)
{
}
static inline void mpx_notify_unmap(struct mm_struct *mm,
- struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index c805db6236b4..7aeb9fe2955f 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -881,9 +881,10 @@ static int mpx_unmap_tables(struct mm_struct *mm,
* the virtual address region start...end have already been split if
* necessary, and the 'vma' is the first vma in this range (start -> end).
*/
-void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+void mpx_notify_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
{
+ struct vm_area_struct *vma;
int ret;

/*
@@ -902,11 +903,12 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
* which should not occur normally. Being strict about it here
* helps ensure that we do not have an exploitable stack overflow.
*/
- do {
+ vma = find_vma(mm, start);
+ while (vma && vma->vm_start < end) {
if (vma->vm_flags & VM_MPX)
return;
vma = vma->vm_next;
- } while (vma && vma->vm_start < end);
+ }

ret = mpx_unmap_tables(mm, start, end);
if (ret)
diff --git a/include/asm-generic/mm_hooks.h b/include/asm-generic/mm_hooks.h
index 8ac4e68a12f0..6736ed2f632b 100644
--- a/include/asm-generic/mm_hooks.h
+++ b/include/asm-generic/mm_hooks.h
@@ -18,7 +18,6 @@ static inline void arch_exit_mmap(struct mm_struct *mm)
}

static inline void arch_unmap(struct mm_struct *mm,
- struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
}
diff --git a/mm/mmap.c b/mm/mmap.c
index 41eb48d9b527..01944c378d38 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2730,9 +2730,17 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
return -EINVAL;

len = PAGE_ALIGN(len);
+ end = start + len;
if (len == 0)
return -EINVAL;

+ /*
+ * arch_unmap() might do unmaps itself. It must be called
+ * and finish any rbtree manipulation before this code
+ * runs and also starts to manipulate the rbtree.
+ */
+ arch_unmap(mm, start, end);
+
/* Find the first overlapping VMA */
vma = find_vma(mm, start);
if (!vma)
@@ -2741,7 +2749,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
/* we have start < vma->vm_end */

/* if it doesn't overlap, we have nothing.. */
- end = start + len;
if (vma->vm_start >= end)
return 0;

@@ -2811,12 +2818,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
/* Detach vmas from rbtree */
detach_vmas_to_be_unmapped(mm, vma, prev, end);

- /*
- * mpx unmap needs to be called with mmap_sem held for write.
- * It is safe to call it before unmap_region().
- */
- arch_unmap(mm, vma, start, end);
-
if (downgrade)
downgrade_write(&mm->mmap_sem);

Subject: [tip:x86/urgent] x86/mpx: Fix recursive munmap() corruption

Commit-ID: 508b8482ea2227ba8695d1cf8311166a455c2ae0
Gitweb: https://git.kernel.org/tip/508b8482ea2227ba8695d1cf8311166a455c2ae0
Author: Dave Hansen <[email protected]>
AuthorDate: Mon, 1 Apr 2019 07:15:49 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 18 Apr 2019 14:39:08 +0200

x86/mpx: Fix recursive munmap() corruption

This is a bit of a mess, to put it mildly. But, it's a bug that seems to
have gone unticked up to now, probably because nobody uses MPX. The other
alternative to this fix is to just deprecate MPX, even in -stable kernels.

MPX has the arch_unmap() hook inside of munmap() because MPX uses bounds
tables that protect other areas of memory. When memory is unmapped, there
is also a need to unmap the MPX bounds tables. Barring this, unused bounds
tables can eat 80% of the address space.

But, the recursive do_munmap() that gets called via arch_unmap() wreaks
havoc with __do_munmap()'s state. It can result in freeing populated page
tables, accessing bogus VMA state, double-freed VMAs and more.

To fix this, call arch_unmap() before __do_unmap() has a chance to do
anything meaningful. Also, remove the 'vma' argument and force the MPX
code to do its own, independent VMA lookup.

For the common success case this is functionally identical to what was
there before. For the munmap() failure case, it's possible that some MPX
tables will be zapped for memory that continues to be in use. But, this is
an extraordinarily unlikely scenario and the harm would be that MPX
provides no protection since the bounds table got reset (zeroed).

It's hard to imagine that anyone is doing this:

ptr = mmap();
// use ptr
ret = munmap(ptr);
if (ret)
// oh, there was an error, I'll
// keep using ptr.

Because if after doing munmap(), the applitcation is *done* with the
memory. There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as well as the
existing mpx selftests/.

Further details:

munmap() has a couple of pieces:
1. Find the affected VMA(s)
2. Split the start/end one(s) if neceesary
3. Pull the VMAs out of the rbtree
4. Actually zap the memory via unmap_region(), including
freeing page tables (or queueing them to be freed).
5. Fixup some of the accounting (like fput()) and actually
free the VMA itself.

The original MPX implementation chose to put the arch_unmap() call right
afer #3. This was *just* before mmap_sem looked like it might get
downgraded (it won't in this context), but it looked right. It wasn't.

Richard Biener reported a test that shows this in dmesg:

[1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551
[1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via arch_unmap().
It was freeing page tables that has not been properly zapped.

But, the problem was bigger than this. For one, arch_unmap() can free
VMAs. But, the calling __do_munmap() has variables that *point* to VMAs
and obviously can't handle them just getting freed while the pointer is
still valid.

A couple of attempts were made to fix this. First, was to fix the page
table freeing problem in isolation, but this lead to the VMA issue. The
next approach was to let the MPX code return a flag if it modified the
rbtree which would force __do_munmap() to re-walk to restart. That
spiralled out of control in complexity pretty fast.

Just moving arch_unmap() and accepting that the bonkers failure case might
eat some bounds tables is the simplest viable fix.

Fixes: 1de4fa14e ("x86, mpx: Cleanup unused bound tables")
Reported-by: Richard Biener <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/mmu_context.h | 6 +++---
arch/x86/include/asm/mpx.h | 5 ++---
arch/x86/mm/mpx.c | 10 ++++++----
include/asm-generic/mm_hooks.h | 1 -
mm/mmap.c | 15 ++++++++-------
5 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 19d18fae6ec6..41019af68adf 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -277,8 +277,8 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
mpx_mm_init(mm);
}

-static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static inline void arch_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
{
/*
* mpx_notify_unmap() goes and reads a rarely-hot
@@ -298,7 +298,7 @@ static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
* consistently wrong.
*/
if (unlikely(cpu_feature_enabled(X86_FEATURE_MPX)))
- mpx_notify_unmap(mm, vma, start, end);
+ mpx_notify_unmap(mm, start, end);
}

/*
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index d0b1434fb0b6..2f9d86bc7e48 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -78,8 +78,8 @@ static inline void mpx_mm_init(struct mm_struct *mm)
*/
mm->context.bd_addr = MPX_INVALID_BOUNDS_DIR;
}
-void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end);
+void mpx_notify_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end);

unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
unsigned long flags);
@@ -100,7 +100,6 @@ static inline void mpx_mm_init(struct mm_struct *mm)
{
}
static inline void mpx_notify_unmap(struct mm_struct *mm,
- struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index c805db6236b4..7aeb9fe2955f 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -881,9 +881,10 @@ static int mpx_unmap_tables(struct mm_struct *mm,
* the virtual address region start...end have already been split if
* necessary, and the 'vma' is the first vma in this range (start -> end).
*/
-void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+void mpx_notify_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
{
+ struct vm_area_struct *vma;
int ret;

/*
@@ -902,11 +903,12 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
* which should not occur normally. Being strict about it here
* helps ensure that we do not have an exploitable stack overflow.
*/
- do {
+ vma = find_vma(mm, start);
+ while (vma && vma->vm_start < end) {
if (vma->vm_flags & VM_MPX)
return;
vma = vma->vm_next;
- } while (vma && vma->vm_start < end);
+ }

ret = mpx_unmap_tables(mm, start, end);
if (ret)
diff --git a/include/asm-generic/mm_hooks.h b/include/asm-generic/mm_hooks.h
index 8ac4e68a12f0..6736ed2f632b 100644
--- a/include/asm-generic/mm_hooks.h
+++ b/include/asm-generic/mm_hooks.h
@@ -18,7 +18,6 @@ static inline void arch_exit_mmap(struct mm_struct *mm)
}

static inline void arch_unmap(struct mm_struct *mm,
- struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
}
diff --git a/mm/mmap.c b/mm/mmap.c
index 41eb48d9b527..01944c378d38 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2730,9 +2730,17 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
return -EINVAL;

len = PAGE_ALIGN(len);
+ end = start + len;
if (len == 0)
return -EINVAL;

+ /*
+ * arch_unmap() might do unmaps itself. It must be called
+ * and finish any rbtree manipulation before this code
+ * runs and also starts to manipulate the rbtree.
+ */
+ arch_unmap(mm, start, end);
+
/* Find the first overlapping VMA */
vma = find_vma(mm, start);
if (!vma)
@@ -2741,7 +2749,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
/* we have start < vma->vm_end */

/* if it doesn't overlap, we have nothing.. */
- end = start + len;
if (vma->vm_start >= end)
return 0;

@@ -2811,12 +2818,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
/* Detach vmas from rbtree */
detach_vmas_to_be_unmapped(mm, vma, prev, end);

- /*
- * mpx unmap needs to be called with mmap_sem held for write.
- * It is safe to call it before unmap_region().
- */
- arch_unmap(mm, vma, start, end);
-
if (downgrade)
downgrade_write(&mm->mmap_sem);

Subject: [tip:x86/urgent] x86/mpx: Fix recursive munmap() corruption

Commit-ID: 7d3bca521e76fd14922529a0ddd3ca7005eb2acf
Gitweb: https://git.kernel.org/tip/7d3bca521e76fd14922529a0ddd3ca7005eb2acf
Author: Dave Hansen <[email protected]>
AuthorDate: Mon, 1 Apr 2019 07:15:49 -0700
Committer: Thomas Gleixner <[email protected]>
CommitDate: Thu, 18 Apr 2019 21:20:28 +0200

x86/mpx: Fix recursive munmap() corruption

This is a bit of a mess, to put it mildly. But, it's a bug that seems to
have gone unticked up to now, probably because nobody uses MPX. The other
alternative to this fix is to just deprecate MPX, even in -stable kernels.

MPX has the arch_unmap() hook inside of munmap() because MPX uses bounds
tables that protect other areas of memory. When memory is unmapped, there
is also a need to unmap the MPX bounds tables. Barring this, unused bounds
tables can eat 80% of the address space.

But, the recursive do_munmap() that gets called via arch_unmap() wreaks
havoc with __do_munmap()'s state. It can result in freeing populated page
tables, accessing bogus VMA state, double-freed VMAs and more.

To fix this, call arch_unmap() before __do_unmap() has a chance to do
anything meaningful. Also, remove the 'vma' argument and force the MPX
code to do its own, independent VMA lookup.

For the common success case this is functionally identical to what was
there before. For the munmap() failure case, it's possible that some MPX
tables will be zapped for memory that continues to be in use. But, this is
an extraordinarily unlikely scenario and the harm would be that MPX
provides no protection since the bounds table got reset (zeroed).

It's hard to imagine that anyone is doing this:

ptr = mmap();
// use ptr
ret = munmap(ptr);
if (ret)
// oh, there was an error, I'll
// keep using ptr.

Because if after doing munmap(), the applitcation is *done* with the
memory. There's probably no good data in there _anyway_.

This passes the original reproducer from Richard Biener as well as the
existing mpx selftests/.

Further details:

munmap() has a couple of pieces:
1. Find the affected VMA(s)
2. Split the start/end one(s) if neceesary
3. Pull the VMAs out of the rbtree
4. Actually zap the memory via unmap_region(), including
freeing page tables (or queueing them to be freed).
5. Fixup some of the accounting (like fput()) and actually
free the VMA itself.

The original MPX implementation chose to put the arch_unmap() call right
afer #3. This was *just* before mmap_sem looked like it might get
downgraded (it won't in this context), but it looked right. It wasn't.

Richard Biener reported a test that shows this in dmesg:

[1216548.787498] BUG: Bad rss-counter state mm:0000000017ce560b idx:1 val:551
[1216548.787500] BUG: non-zero pgtables_bytes on freeing mm: 24576

What triggered this was the recursive do_munmap() called via arch_unmap().
It was freeing page tables that has not been properly zapped.

But, the problem was bigger than this. For one, arch_unmap() can free
VMAs. But, the calling __do_munmap() has variables that *point* to VMAs
and obviously can't handle them just getting freed while the pointer is
still valid.

A couple of attempts were made to fix this. First, was to fix the page
table freeing problem in isolation, but this lead to the VMA issue. The
next approach was to let the MPX code return a flag if it modified the
rbtree which would force __do_munmap() to re-walk to restart. That
spiralled out of control in complexity pretty fast.

Just moving arch_unmap() and accepting that the bonkers failure case might
eat some bounds tables is the simplest viable fix.

Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
Reported-by: Richard Biener <[email protected]>
Signed-off-by: Dave Hansen <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: [email protected]
Cc: [email protected]
Link: https://lkml.kernel.org/r/[email protected]
---
arch/x86/include/asm/mmu_context.h | 6 +++---
arch/x86/include/asm/mpx.h | 5 ++---
arch/x86/mm/mpx.c | 10 ++++++----
include/asm-generic/mm_hooks.h | 1 -
mm/mmap.c | 15 ++++++++-------
5 files changed, 19 insertions(+), 18 deletions(-)

diff --git a/arch/x86/include/asm/mmu_context.h b/arch/x86/include/asm/mmu_context.h
index 19d18fae6ec6..41019af68adf 100644
--- a/arch/x86/include/asm/mmu_context.h
+++ b/arch/x86/include/asm/mmu_context.h
@@ -277,8 +277,8 @@ static inline void arch_bprm_mm_init(struct mm_struct *mm,
mpx_mm_init(mm);
}

-static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+static inline void arch_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
{
/*
* mpx_notify_unmap() goes and reads a rarely-hot
@@ -298,7 +298,7 @@ static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
* consistently wrong.
*/
if (unlikely(cpu_feature_enabled(X86_FEATURE_MPX)))
- mpx_notify_unmap(mm, vma, start, end);
+ mpx_notify_unmap(mm, start, end);
}

/*
diff --git a/arch/x86/include/asm/mpx.h b/arch/x86/include/asm/mpx.h
index d0b1434fb0b6..2f9d86bc7e48 100644
--- a/arch/x86/include/asm/mpx.h
+++ b/arch/x86/include/asm/mpx.h
@@ -78,8 +78,8 @@ static inline void mpx_mm_init(struct mm_struct *mm)
*/
mm->context.bd_addr = MPX_INVALID_BOUNDS_DIR;
}
-void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end);
+void mpx_notify_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end);

unsigned long mpx_unmapped_area_check(unsigned long addr, unsigned long len,
unsigned long flags);
@@ -100,7 +100,6 @@ static inline void mpx_mm_init(struct mm_struct *mm)
{
}
static inline void mpx_notify_unmap(struct mm_struct *mm,
- struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
}
diff --git a/arch/x86/mm/mpx.c b/arch/x86/mm/mpx.c
index c805db6236b4..7aeb9fe2955f 100644
--- a/arch/x86/mm/mpx.c
+++ b/arch/x86/mm/mpx.c
@@ -881,9 +881,10 @@ static int mpx_unmap_tables(struct mm_struct *mm,
* the virtual address region start...end have already been split if
* necessary, and the 'vma' is the first vma in this range (start -> end).
*/
-void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
- unsigned long start, unsigned long end)
+void mpx_notify_unmap(struct mm_struct *mm, unsigned long start,
+ unsigned long end)
{
+ struct vm_area_struct *vma;
int ret;

/*
@@ -902,11 +903,12 @@ void mpx_notify_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
* which should not occur normally. Being strict about it here
* helps ensure that we do not have an exploitable stack overflow.
*/
- do {
+ vma = find_vma(mm, start);
+ while (vma && vma->vm_start < end) {
if (vma->vm_flags & VM_MPX)
return;
vma = vma->vm_next;
- } while (vma && vma->vm_start < end);
+ }

ret = mpx_unmap_tables(mm, start, end);
if (ret)
diff --git a/include/asm-generic/mm_hooks.h b/include/asm-generic/mm_hooks.h
index 8ac4e68a12f0..6736ed2f632b 100644
--- a/include/asm-generic/mm_hooks.h
+++ b/include/asm-generic/mm_hooks.h
@@ -18,7 +18,6 @@ static inline void arch_exit_mmap(struct mm_struct *mm)
}

static inline void arch_unmap(struct mm_struct *mm,
- struct vm_area_struct *vma,
unsigned long start, unsigned long end)
{
}
diff --git a/mm/mmap.c b/mm/mmap.c
index 41eb48d9b527..01944c378d38 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -2730,9 +2730,17 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
return -EINVAL;

len = PAGE_ALIGN(len);
+ end = start + len;
if (len == 0)
return -EINVAL;

+ /*
+ * arch_unmap() might do unmaps itself. It must be called
+ * and finish any rbtree manipulation before this code
+ * runs and also starts to manipulate the rbtree.
+ */
+ arch_unmap(mm, start, end);
+
/* Find the first overlapping VMA */
vma = find_vma(mm, start);
if (!vma)
@@ -2741,7 +2749,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
/* we have start < vma->vm_end */

/* if it doesn't overlap, we have nothing.. */
- end = start + len;
if (vma->vm_start >= end)
return 0;

@@ -2811,12 +2818,6 @@ int __do_munmap(struct mm_struct *mm, unsigned long start, size_t len,
/* Detach vmas from rbtree */
detach_vmas_to_be_unmapped(mm, vma, prev, end);

- /*
- * mpx unmap needs to be called with mmap_sem held for write.
- * It is safe to call it before unmap_region().
- */
- arch_unmap(mm, vma, start, end);
-
if (downgrade)
downgrade_write(&mm->mmap_sem);

2019-04-19 19:06:27

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

On Mon, 1 Apr 2019, Dave Hansen wrote:
> diff -puN mm/mmap.c~mpx-rss-pass-no-vma mm/mmap.c
> --- a/mm/mmap.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.409411123 -0700
> +++ b/mm/mmap.c 2019-04-01 06:56:53.423411123 -0700
> @@ -2731,9 +2731,17 @@ int __do_munmap(struct mm_struct *mm, un
> return -EINVAL;
>
> len = PAGE_ALIGN(len);
> + end = start + len;
> if (len == 0)
> return -EINVAL;
>
> + /*
> + * arch_unmap() might do unmaps itself. It must be called
> + * and finish any rbtree manipulation before this code
> + * runs and also starts to manipulate the rbtree.
> + */
> + arch_unmap(mm, start, end);

...

> -static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
> - unsigned long start, unsigned long end)
> +static inline void arch_unmap(struct mm_struct *mm, unsigned long start,
> + unsigned long end)

While you fixed up the asm-generic thing, this breaks arch/um and
arch/unicorn32. For those the fixup is trivial by removing the vma
argument.

But itt also breaks powerpc and there I'm not sure whether moving
arch_unmap() to the beginning of __do_munmap() is safe. Micheal???

Aside of that the powerpc variant looks suspicious:

static inline void arch_unmap(struct mm_struct *mm,
unsigned long start, unsigned long end)
{
if (start <= mm->context.vdso_base && mm->context.vdso_base < end)
mm->context.vdso_base = 0;
}

Shouldn't that be:

if (start >= mm->context.vdso_base && mm->context.vdso_base < end)

Hmm?

Thanks,

tglx


2019-04-20 10:33:12

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

Thomas Gleixner <[email protected]> writes:
> On Mon, 1 Apr 2019, Dave Hansen wrote:
>> diff -puN mm/mmap.c~mpx-rss-pass-no-vma mm/mmap.c
>> --- a/mm/mmap.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.409411123 -0700
>> +++ b/mm/mmap.c 2019-04-01 06:56:53.423411123 -0700
>> @@ -2731,9 +2731,17 @@ int __do_munmap(struct mm_struct *mm, un
>> return -EINVAL;
>>
>> len = PAGE_ALIGN(len);
>> + end = start + len;
>> if (len == 0)
>> return -EINVAL;
>>
>> + /*
>> + * arch_unmap() might do unmaps itself. It must be called
>> + * and finish any rbtree manipulation before this code
>> + * runs and also starts to manipulate the rbtree.
>> + */
>> + arch_unmap(mm, start, end);
>
> ...
>
>> -static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>> - unsigned long start, unsigned long end)
>> +static inline void arch_unmap(struct mm_struct *mm, unsigned long start,
>> + unsigned long end)
>
> While you fixed up the asm-generic thing, this breaks arch/um and
> arch/unicorn32. For those the fixup is trivial by removing the vma
> argument.
>
> But itt also breaks powerpc and there I'm not sure whether moving
> arch_unmap() to the beginning of __do_munmap() is safe. Micheal???

I don't know for sure but I think it should be fine. That code is just
there to handle CRIU unmapping/remapping the VDSO. So that either needs
to happen while the process is stopped or it needs to handle races
anyway, so I don't see how the placement within the unmap path should
matter.

> Aside of that the powerpc variant looks suspicious:
>
> static inline void arch_unmap(struct mm_struct *mm,
> unsigned long start, unsigned long end)
> {
> if (start <= mm->context.vdso_base && mm->context.vdso_base < end)
> mm->context.vdso_base = 0;
> }
>
> Shouldn't that be:
>
> if (start >= mm->context.vdso_base && mm->context.vdso_base < end)
>
> Hmm?

Yeah looks pretty suspicious. I'll follow-up with Laurent who wrote it.
Thanks for spotting it!

cheers

2019-04-23 11:19:31

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

Le 20/04/2019 à 12:31, Michael Ellerman a écrit :
> Thomas Gleixner <[email protected]> writes:
>> On Mon, 1 Apr 2019, Dave Hansen wrote:
>>> diff -puN mm/mmap.c~mpx-rss-pass-no-vma mm/mmap.c
>>> --- a/mm/mmap.c~mpx-rss-pass-no-vma 2019-04-01 06:56:53.409411123 -0700
>>> +++ b/mm/mmap.c 2019-04-01 06:56:53.423411123 -0700
>>> @@ -2731,9 +2731,17 @@ int __do_munmap(struct mm_struct *mm, un
>>> return -EINVAL;
>>>
>>> len = PAGE_ALIGN(len);
>>> + end = start + len;
>>> if (len == 0)
>>> return -EINVAL;
>>>
>>> + /*
>>> + * arch_unmap() might do unmaps itself. It must be called
>>> + * and finish any rbtree manipulation before this code
>>> + * runs and also starts to manipulate the rbtree.
>>> + */
>>> + arch_unmap(mm, start, end);
>>
>> ...
>>
>>> -static inline void arch_unmap(struct mm_struct *mm, struct vm_area_struct *vma,
>>> - unsigned long start, unsigned long end)
>>> +static inline void arch_unmap(struct mm_struct *mm, unsigned long start,
>>> + unsigned long end)
>>
>> While you fixed up the asm-generic thing, this breaks arch/um and
>> arch/unicorn32. For those the fixup is trivial by removing the vma
>> argument.
>>
>> But itt also breaks powerpc and there I'm not sure whether moving
>> arch_unmap() to the beginning of __do_munmap() is safe. Micheal???
>
> I don't know for sure but I think it should be fine. That code is just
> there to handle CRIU unmapping/remapping the VDSO. So that either needs
> to happen while the process is stopped or it needs to handle races
> anyway, so I don't see how the placement within the unmap path should
> matter.

My only concern is the error path.
Calling arch_unmap() before handling any error case means that it will
have to be undo and there is no way to do so.

I don't know what is the rational to move arch_unmap() to the beginning
of __do_munmap() but the error paths must be managed.

>> Aside of that the powerpc variant looks suspicious:
>>
>> static inline void arch_unmap(struct mm_struct *mm,
>> unsigned long start, unsigned long end)
>> {
>> if (start <= mm->context.vdso_base && mm->context.vdso_base < end)
>> mm->context.vdso_base = 0;
>> }
>>
>> Shouldn't that be:
>>
>> if (start >= mm->context.vdso_base && mm->context.vdso_base < end)
>>
>> Hmm?
>
> Yeah looks pretty suspicious. I'll follow-up with Laurent who wrote it.
> Thanks for spotting it!

I've to admit that I had to read that code carefully before answering.

There are 2 assumptions here:
1. 'start' and 'end' are page aligned (this is guaranteed by
__do_munmap().
2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store
on powerpc).

The idea is to handle a munmap() call surrounding the VDSO area:
| VDSO |
^start ^end

This is covered by this test, as the munmap() matching the exact
boundaries of the VDSO is handled too.

Am I missing something ?

Cheers,
Laurent.

2019-04-23 13:37:24

by Thomas Gleixner

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

On Tue, 23 Apr 2019, Laurent Dufour wrote:
> Le 20/04/2019 à 12:31, Michael Ellerman a écrit :
> > Thomas Gleixner <[email protected]> writes:
> > > Aside of that the powerpc variant looks suspicious:
> > >
> > > static inline void arch_unmap(struct mm_struct *mm,
> > > unsigned long start, unsigned long end)
> > > {
> > > if (start <= mm->context.vdso_base && mm->context.vdso_base < end)
> > > mm->context.vdso_base = 0;
> > > }
> > >
> > > Shouldn't that be:
> > >
> > > if (start >= mm->context.vdso_base && mm->context.vdso_base < end)
> > >
> > > Hmm?
> >
> > Yeah looks pretty suspicious. I'll follow-up with Laurent who wrote it.
> > Thanks for spotting it!
>
> I've to admit that I had to read that code carefully before answering.
>
> There are 2 assumptions here:
> 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap().
> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on
> powerpc).
>
> The idea is to handle a munmap() call surrounding the VDSO area:
> | VDSO |
> ^start ^end
>
> This is covered by this test, as the munmap() matching the exact boundaries of
> the VDSO is handled too.
>
> Am I missing something ?

Well if this is the intention, then you missed to add a comment explaining it :)

Thanks,

tglx

2019-04-23 13:38:37

by Laurent Dufour

[permalink] [raw]
Subject: bos

Le 23/04/2019 à 15:34, Thomas Gleixner a écrit :
> On Tue, 23 Apr 2019, Laurent Dufour wrote:
>> Le 20/04/2019 à 12:31, Michael Ellerman a écrit :
>>> Thomas Gleixner <[email protected]> writes:
>>>> Aside of that the powerpc variant looks suspicious:
>>>>
>>>> static inline void arch_unmap(struct mm_struct *mm,
>>>> unsigned long start, unsigned long end)
>>>> {
>>>> if (start <= mm->context.vdso_base && mm->context.vdso_base < end)
>>>> mm->context.vdso_base = 0;
>>>> }
>>>>
>>>> Shouldn't that be:
>>>>
>>>> if (start >= mm->context.vdso_base && mm->context.vdso_base < end)
>>>>
>>>> Hmm?
>>>
>>> Yeah looks pretty suspicious. I'll follow-up with Laurent who wrote it.
>>> Thanks for spotting it!
>>
>> I've to admit that I had to read that code carefully before answering.
>>
>> There are 2 assumptions here:
>> 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap().
>> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on
>> powerpc).
>>
>> The idea is to handle a munmap() call surrounding the VDSO area:
>> | VDSO |
>> ^start ^end
>>
>> This is covered by this test, as the munmap() matching the exact boundaries of
>> the VDSO is handled too.
>>
>> Am I missing something ?
>
> Well if this is the intention, then you missed to add a comment explaining it :)
>
> Thanks,
>
> tglx

You're right, and I was thinking the same when I read that code this
morning ;)

I'll propose a patch to a add an explicit comment.

Thanks,
Laurent.

2019-04-23 16:07:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

On 4/23/19 4:16 AM, Laurent Dufour wrote:
> My only concern is the error path.
> Calling arch_unmap() before handling any error case means that it will
> have to be undo and there is no way to do so.

Is there a practical scenario where munmap() of the VDSO can split a
VMA? If the VDSO is guaranteed to be a single page, it would have to be
a scenario where munmap() was called on a range that included the VDSO
*and* other VMA that we failed to split.

But, the scenario would have to be that someone tried to munmap() the
VDSO and something adjacent, the munmap() failed, and they kept on using
the VDSO and expected the special signal and perf behavior to be maintained.

BTW, what keeps the VDSO from merging with an adjacent VMA? Is it just
the vm_ops->close that comes from special_mapping_vmops?

> I don't know what is the rational to move arch_unmap() to the beginning
> of __do_munmap() but the error paths must be managed.

It's in the changelog:

https://patchwork.kernel.org/patch/10909727/

But, the tl;dr version is: x86 is recursively calling __do_unmap() (via
arch_unmap()) in a spot where the internal rbtree data is inconsistent,
which causes all kinds of fun. If we move arch_unmap() to before
__do_munmap() does any data structure manipulation, the recursive call
doesn't get confused any more.

> There are 2 assumptions here:
> 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap().
> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc)

Are you sure about #2? The 'vdso64_pages' variable seems rather
unnecessary if the VDSO is only 1 page. ;)

2019-04-23 17:09:17

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

Le 23/04/2019 à 18:04, Dave Hansen a écrit :
> On 4/23/19 4:16 AM, Laurent Dufour wrote:
>> My only concern is the error path.
>> Calling arch_unmap() before handling any error case means that it will
>> have to be undo and there is no way to do so.
>
> Is there a practical scenario where munmap() of the VDSO can split a
> VMA? If the VDSO is guaranteed to be a single page, it would have to be
> a scenario where munmap() was called on a range that included the VDSO
> *and* other VMA that we failed to split.
>
> But, the scenario would have to be that someone tried to munmap() the
> VDSO and something adjacent, the munmap() failed, and they kept on using
> the VDSO and expected the special signal and perf behavior to be maintained.

I've to admit that this should not be a common scenario, and unmapping
the VDSO is not so common anyway.

> BTW, what keeps the VDSO from merging with an adjacent VMA? Is it just
> the vm_ops->close that comes from special_mapping_vmops?

I'd think so.

>> I don't know what is the rational to move arch_unmap() to the beginning
>> of __do_munmap() but the error paths must be managed.
>
> It's in the changelog:
>
> https://patchwork.kernel.org/patch/10909727/
>
> But, the tl;dr version is: x86 is recursively calling __do_unmap() (via
> arch_unmap()) in a spot where the internal rbtree data is inconsistent,
> which causes all kinds of fun. If we move arch_unmap() to before
> __do_munmap() does any data structure manipulation, the recursive call
> doesn't get confused any more.

If only Powerpc is impacted I guess this would be fine but what about
the other architectures?

>> There are 2 assumptions here:
>> 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap().
>> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc)
>
> Are you sure about #2? The 'vdso64_pages' variable seems rather
> unnecessary if the VDSO is only 1 page. ;)

Hum, not so sure now ;)
I got confused, only the header is one page.
The test is working as a best effort, and don't cover the case where
only few pages inside the VDSO are unmmapped (start >
mm->context.vdso_base). This is not what CRIU is doing and so this was
enough for CRIU support.

Michael, do you think there is a need to manage all the possibility
here, since the only user is CRIU and unmapping the VDSO is not a so
good idea for other processes ?

2019-05-01 10:34:09

by Michael Ellerman

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

Laurent Dufour <[email protected]> writes:
> Le 23/04/2019 à 18:04, Dave Hansen a écrit :
>> On 4/23/19 4:16 AM, Laurent Dufour wrote:
...
>>> There are 2 assumptions here:
>>> 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap().
>>> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc)
>>
>> Are you sure about #2? The 'vdso64_pages' variable seems rather
>> unnecessary if the VDSO is only 1 page. ;)
>
> Hum, not so sure now ;)
> I got confused, only the header is one page.
> The test is working as a best effort, and don't cover the case where
> only few pages inside the VDSO are unmmapped (start >
> mm->context.vdso_base). This is not what CRIU is doing and so this was
> enough for CRIU support.
>
> Michael, do you think there is a need to manage all the possibility
> here, since the only user is CRIU and unmapping the VDSO is not a so
> good idea for other processes ?

Couldn't we implement the semantic that if any part of the VDSO is
unmapped then vdso_base is set to zero? That should be fairly easy, eg:

if (start < vdso_end && end >= mm->context.vdso_base)
mm->context.vdso_base = 0;


We might need to add vdso_end to the mm->context, but that should be OK.

That seems like it would work for CRIU and make sense in general?

cheers

2019-05-07 16:37:47

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

Le 01/05/2019 à 12:32, Michael Ellerman a écrit :
> Laurent Dufour <[email protected]> writes:
>> Le 23/04/2019 à 18:04, Dave Hansen a écrit :
>>> On 4/23/19 4:16 AM, Laurent Dufour wrote:
> ...
>>>> There are 2 assumptions here:
>>>> 1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap().
>>>> 2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc)
>>>
>>> Are you sure about #2? The 'vdso64_pages' variable seems rather
>>> unnecessary if the VDSO is only 1 page. ;)
>>
>> Hum, not so sure now ;)
>> I got confused, only the header is one page.
>> The test is working as a best effort, and don't cover the case where
>> only few pages inside the VDSO are unmmapped (start >
>> mm->context.vdso_base). This is not what CRIU is doing and so this was
>> enough for CRIU support.
>>
>> Michael, do you think there is a need to manage all the possibility
>> here, since the only user is CRIU and unmapping the VDSO is not a so
>> good idea for other processes ?
>
> Couldn't we implement the semantic that if any part of the VDSO is
> unmapped then vdso_base is set to zero? That should be fairly easy, eg:
>
> if (start < vdso_end && end >= mm->context.vdso_base)
> mm->context.vdso_base = 0;
>
>
> We might need to add vdso_end to the mm->context, but that should be OK.
>
> That seems like it would work for CRIU and make sense in general?

Sorry for the late answer, yes this would make more sense.

Here is a patch doing that.

Cheers,
Laurent



Attachments:
0001-powerpc-vdso-handle-generic-unmap-of-the-VDSO.patch (6.81 kB)

2020-10-23 15:50:09

by Christophe Leroy

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

Hi Laurent

Le 07/05/2019 à 18:35, Laurent Dufour a écrit :
> Le 01/05/2019 à 12:32, Michael Ellerman a écrit :
>> Laurent Dufour <[email protected]> writes:
>>> Le 23/04/2019 à 18:04, Dave Hansen a écrit :
>>>> On 4/23/19 4:16 AM, Laurent Dufour wrote:
>> ...
>>>>> There are 2 assumptions here:
>>>>>    1. 'start' and 'end' are page aligned (this is guaranteed by __do_munmap().
>>>>>    2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store on powerpc)
>>>>
>>>> Are you sure about #2?  The 'vdso64_pages' variable seems rather
>>>> unnecessary if the VDSO is only 1 page. ;)
>>>
>>> Hum, not so sure now ;)
>>> I got confused, only the header is one page.
>>> The test is working as a best effort, and don't cover the case where
>>> only few pages inside the VDSO are unmmapped (start >
>>> mm->context.vdso_base). This is not what CRIU is doing and so this was
>>> enough for CRIU support.
>>>
>>> Michael, do you think there is a need to manage all the possibility
>>> here, since the only user is CRIU and unmapping the VDSO is not a so
>>> good idea for other processes ?
>>
>> Couldn't we implement the semantic that if any part of the VDSO is
>> unmapped then vdso_base is set to zero? That should be fairly easy, eg:
>>
>>     if (start < vdso_end && end >= mm->context.vdso_base)
>>         mm->context.vdso_base = 0;
>>
>>
>> We might need to add vdso_end to the mm->context, but that should be OK.
>>
>> That seems like it would work for CRIU and make sense in general?
>
> Sorry for the late answer, yes this would make more sense.
>
> Here is a patch doing that.
>

In your patch, the test seems overkill:

+ if ((start <= vdso_base && vdso_end <= end) || /* 1 */
+ (vdso_base <= start && start < vdso_end) || /* 3,4 */
+ (vdso_base < end && end <= vdso_end)) /* 2,3 */
+ mm->context.vdso_base = mm->context.vdso_end = 0;

What about

if (start < vdso_end && vdso_start < end)
mm->context.vdso_base = mm->context.vdso_end = 0;

This should cover all cases, or am I missing something ?


And do we really need to store vdso_end in the context ?
I think it should be possible to re-calculate it: the size of the VDSO should be (&vdso32_end -
&vdso32_start) + PAGE_SIZE for 32 bits VDSO, and (&vdso64_end - &vdso64_start) + PAGE_SIZE for the
64 bits VDSO.

Christophe

2020-11-03 17:14:04

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

Le 23/10/2020 à 14:28, Christophe Leroy a écrit :
> Hi Laurent
>
> Le 07/05/2019 à 18:35, Laurent Dufour a écrit :
>> Le 01/05/2019 à 12:32, Michael Ellerman a écrit :
>>> Laurent Dufour <[email protected]> writes:
>>>> Le 23/04/2019 à 18:04, Dave Hansen a écrit :
>>>>> On 4/23/19 4:16 AM, Laurent Dufour wrote:
>>> ...
>>>>>> There are 2 assumptions here:
>>>>>>    1. 'start' and 'end' are page aligned (this is guaranteed by
>>>>>> __do_munmap().
>>>>>>    2. the VDSO is 1 page (this is guaranteed by the union vdso_data_store
>>>>>> on powerpc)
>>>>>
>>>>> Are you sure about #2?  The 'vdso64_pages' variable seems rather
>>>>> unnecessary if the VDSO is only 1 page. ;)
>>>>
>>>> Hum, not so sure now ;)
>>>> I got confused, only the header is one page.
>>>> The test is working as a best effort, and don't cover the case where
>>>> only few pages inside the VDSO are unmmapped (start >
>>>> mm->context.vdso_base). This is not what CRIU is doing and so this was
>>>> enough for CRIU support.
>>>>
>>>> Michael, do you think there is a need to manage all the possibility
>>>> here, since the only user is CRIU and unmapping the VDSO is not a so
>>>> good idea for other processes ?
>>>
>>> Couldn't we implement the semantic that if any part of the VDSO is
>>> unmapped then vdso_base is set to zero? That should be fairly easy, eg:
>>>
>>>     if (start < vdso_end && end >= mm->context.vdso_base)
>>>         mm->context.vdso_base = 0;
>>>
>>>
>>> We might need to add vdso_end to the mm->context, but that should be OK.
>>>
>>> That seems like it would work for CRIU and make sense in general?
>>
>> Sorry for the late answer, yes this would make more sense.
>>
>> Here is a patch doing that.
>>
>
> In your patch, the test seems overkill:
>
> +    if ((start <= vdso_base && vdso_end <= end) ||  /* 1   */
> +        (vdso_base <= start && start < vdso_end) || /* 3,4 */
> +        (vdso_base < end && end <= vdso_end))       /* 2,3 */
> +        mm->context.vdso_base = mm->context.vdso_end = 0;
>
> What about
>
>     if (start < vdso_end && vdso_start < end)
>         mm->context.vdso_base = mm->context.vdso_end = 0;
>
> This should cover all cases, or am I missing something ?
>
>
> And do we really need to store vdso_end in the context ?
> I think it should be possible to re-calculate it: the size of the VDSO should be
> (&vdso32_end - &vdso32_start) + PAGE_SIZE for 32 bits VDSO, and (&vdso64_end -
> &vdso64_start) + PAGE_SIZE for the 64 bits VDSO.

Thanks Christophe for the advise.

That is covering all the cases, and indeed is similar to the Michael's proposal
I missed last year.

I'll send a patch fixing this issue following your proposal.

Cheers,
Laurent.

2020-11-03 21:10:51

by Dmitry Safonov

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

Hi Laurent, Christophe, Michael, all,

On 11/3/20 5:11 PM, Laurent Dufour wrote:
> Le 23/10/2020 à 14:28, Christophe Leroy a écrit :
[..]
>>>> That seems like it would work for CRIU and make sense in general?
>>>
>>> Sorry for the late answer, yes this would make more sense.
>>>
>>> Here is a patch doing that.
>>>
>>
>> In your patch, the test seems overkill:
>>
>> +    if ((start <= vdso_base && vdso_end <= end) ||  /* 1   */
>> +        (vdso_base <= start && start < vdso_end) || /* 3,4 */
>> +        (vdso_base < end && end <= vdso_end))       /* 2,3 */
>> +        mm->context.vdso_base = mm->context.vdso_end = 0;
>>
>> What about
>>
>>      if (start < vdso_end && vdso_start < end)
>>          mm->context.vdso_base = mm->context.vdso_end = 0;
>>
>> This should cover all cases, or am I missing something ?
>>
>>
>> And do we really need to store vdso_end in the context ?
>> I think it should be possible to re-calculate it: the size of the VDSO
>> should be (&vdso32_end - &vdso32_start) + PAGE_SIZE for 32 bits VDSO,
>> and (&vdso64_end - &vdso64_start) + PAGE_SIZE for the 64 bits VDSO.
>
> Thanks Christophe for the advise.
>
> That is covering all the cases, and indeed is similar to the Michael's
> proposal I missed last year.
>
> I'll send a patch fixing this issue following your proposal.

It's probably not necessary anymore. I've sent patches [1], currently in
akpm, the last one forbids splitting of vm_special_mapping.
So, a user is able munmap() or mremap() vdso as a whole, but not partly.

[1]:
https://lore.kernel.org/linux-mm/[email protected]/

Thanks,
Dmitry

2020-11-04 09:46:01

by Laurent Dufour

[permalink] [raw]
Subject: Re: [PATCH] x86/mpx: fix recursive munmap() corruption

Le 03/11/2020 à 22:08, Dmitry Safonov a écrit :
> Hi Laurent, Christophe, Michael, all,
>
> On 11/3/20 5:11 PM, Laurent Dufour wrote:
>> Le 23/10/2020 à 14:28, Christophe Leroy a écrit :
> [..]
>>>>> That seems like it would work for CRIU and make sense in general?
>>>>
>>>> Sorry for the late answer, yes this would make more sense.
>>>>
>>>> Here is a patch doing that.
>>>>
>>>
>>> In your patch, the test seems overkill:
>>>
>>> +    if ((start <= vdso_base && vdso_end <= end) ||  /* 1   */
>>> +        (vdso_base <= start && start < vdso_end) || /* 3,4 */
>>> +        (vdso_base < end && end <= vdso_end))       /* 2,3 */
>>> +        mm->context.vdso_base = mm->context.vdso_end = 0;
>>>
>>> What about
>>>
>>>      if (start < vdso_end && vdso_start < end)
>>>          mm->context.vdso_base = mm->context.vdso_end = 0;
>>>
>>> This should cover all cases, or am I missing something ?
>>>
>>>
>>> And do we really need to store vdso_end in the context ?
>>> I think it should be possible to re-calculate it: the size of the VDSO
>>> should be (&vdso32_end - &vdso32_start) + PAGE_SIZE for 32 bits VDSO,
>>> and (&vdso64_end - &vdso64_start) + PAGE_SIZE for the 64 bits VDSO.
>>
>> Thanks Christophe for the advise.
>>
>> That is covering all the cases, and indeed is similar to the Michael's
>> proposal I missed last year.
>>
>> I'll send a patch fixing this issue following your proposal.
>
> It's probably not necessary anymore. I've sent patches [1], currently in
> akpm, the last one forbids splitting of vm_special_mapping.
> So, a user is able munmap() or mremap() vdso as a whole, but not partly.

Hi Dmitry,

That's a good thing too, but I think my patch is still valid in the PowerPC
code, fixing a bad check, even if some corner cases are handled earlier in the code.

> [1]:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> Thanks,
> Dmitry
>