LinuxLists.cc - [PATCH] mm/memory.c: do_fault: avoid usage of stale vm_area

2019-03-02 15:13:57

Subject: [PATCH] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
This is a stress test, where one thread mmaps/writes/munmaps memory area
and other thread is trying to read from it:

CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
Call Trace:
([<0000000000000000>] (null))
[<00000000001adae4>] lock_acquire+0xec/0x258
[<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
[<000000000012a780>] page_table_free+0x48/0x1a8
[<00000000002f6e54>] do_fault+0xdc/0x670
[<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
[<00000000002fb138>] handle_mm_fault+0x1b0/0x320
[<00000000001248cc>] do_dat_exception+0x19c/0x2c8
[<000000000080e5ee>] pgm_check_handler+0x19e/0x200

page_table_free() is called with NULL mm parameter, but because
"0" is a valid address on s390 (see S390_lowcore), it keeps
going until it eventually crashes in lockdep's lock_acquire.
This crash is reproducible at least since 4.14.

Problem is that "vmf->vma" used in do_fault() can become stale.
Because mmap_sem may be released, other threads can come in,
call munmap() and cause "vma" be returned to kmem cache, and
get zeroed/re-initialized and re-used:

handle_mm_fault |
__handle_mm_fault |
do_fault |
vma = vmf->vma |
do_read_fault |
__do_fault |
vma->vm_ops->fault(vmf); |
mmap_sem is released |
|
| do_munmap()
| remove_vma_list()
| remove_vma()
| vm_area_free()
| # vma is released
| ...
| # same vma is allocated
| # from kmem cache
| do_mmap()
| vm_area_alloc()
| memset(vma, 0, ...)
|
pte_free(vma->vm_mm, ...); |
page_table_free |
spin_lock_bh(&mm->context.lock);|
<crash> |

This patch pins mm_struct and stores its value, to avoid using
potentially stale "vma" when calling pte_free().

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

Signed-off-by: Jan Stancek <[email protected]>
---
mm/memory.c | 10 +++++++++-
1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e11ca9dd823f..1287ee9acbdc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3517,12 +3517,17 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
* but allow concurrent faults).
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
+ * If mmap_sem is released, vma may become invalid (for example
+ * by other thread calling munmap()).
*/
static vm_fault_t do_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ struct mm_struct *vm_mm = READ_ONCE(vma->vm_mm);
vm_fault_t ret;

+ mmgrab(vm_mm);
+
/*
* The VMA was not fully populated on mmap() or missing VM_DONTEXPAND
*/
@@ -3561,9 +3566,12 @@ static vm_fault_t do_fault(struct vm_fault *vmf)

/* preallocated pagetable is unused: free it */
if (vmf->prealloc_pte) {
- pte_free(vma->vm_mm, vmf->prealloc_pte);
+ pte_free(vm_mm, vmf->prealloc_pte);
vmf->prealloc_pte = NULL;
}
+
+ mmdrop(vm_mm);
+
return ret;
}

--
1.8.3.1

2019-03-02 17:12:02

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

On Sat, Mar 02, 2019 at 04:11:26PM +0100, Jan Stancek wrote:
> Problem is that "vmf->vma" used in do_fault() can become stale.
> Because mmap_sem may be released, other threads can come in,
> call munmap() and cause "vma" be returned to kmem cache, and
> get zeroed/re-initialized and re-used:

> This patch pins mm_struct and stores its value, to avoid using
> potentially stale "vma" when calling pte_free().

OK, we need to cache the mm_struct, but why do we need the extra atomic op?
There's surely no way the mm can be freed while the thread is in the middle
of handling a fault.

ie I would drop these lines:

> + mmgrab(vm_mm);
> +
...
> +
> + mmdrop(vm_mm);
> +

2019-03-02 18:01:08

by Jan Stancek

[permalink] [raw]

Subject: Re: [PATCH] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

----- Original Message -----
> On Sat, Mar 02, 2019 at 04:11:26PM +0100, Jan Stancek wrote:
> > Problem is that "vmf->vma" used in do_fault() can become stale.
> > Because mmap_sem may be released, other threads can come in,
> > call munmap() and cause "vma" be returned to kmem cache, and
> > get zeroed/re-initialized and re-used:
>
> > This patch pins mm_struct and stores its value, to avoid using
> > potentially stale "vma" when calling pte_free().
>
> OK, we need to cache the mm_struct, but why do we need the extra atomic op?
> There's surely no way the mm can be freed while the thread is in the middle
> of handling a fault.

You're right, I was needlessly paranoid.

>
> ie I would drop these lines:

I'll send v2.

Thanks,
Jan

>
> > + mmgrab(vm_mm);
> > +
> ...
> > +
> > + mmdrop(vm_mm);
> > +
>

2019-03-02 18:20:58

by Jan Stancek

[permalink] [raw]

Subject: [PATCH v2] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
This is a stress test, where one thread mmaps/writes/munmaps memory area
and other thread is trying to read from it:

CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
Call Trace:
([<0000000000000000>] (null))
[<00000000001adae4>] lock_acquire+0xec/0x258
[<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
[<000000000012a780>] page_table_free+0x48/0x1a8
[<00000000002f6e54>] do_fault+0xdc/0x670
[<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
[<00000000002fb138>] handle_mm_fault+0x1b0/0x320
[<00000000001248cc>] do_dat_exception+0x19c/0x2c8
[<000000000080e5ee>] pgm_check_handler+0x19e/0x200

page_table_free() is called with NULL mm parameter, but because
"0" is a valid address on s390 (see S390_lowcore), it keeps
going until it eventually crashes in lockdep's lock_acquire.
This crash is reproducible at least since 4.14.

Problem is that "vmf->vma" used in do_fault() can become stale.
Because mmap_sem may be released, other threads can come in,
call munmap() and cause "vma" be returned to kmem cache, and
get zeroed/re-initialized and re-used:

handle_mm_fault |
__handle_mm_fault |
do_fault |
vma = vmf->vma |
do_read_fault |
__do_fault |
vma->vm_ops->fault(vmf); |
mmap_sem is released |
|
| do_munmap()
| remove_vma_list()
| remove_vma()
| vm_area_free()
| # vma is released
| ...
| # same vma is allocated
| # from kmem cache
| do_mmap()
| vm_area_alloc()
| memset(vma, 0, ...)
|
pte_free(vma->vm_mm, ...); |
page_table_free |
spin_lock_bh(&mm->context.lock);|
<crash> |

Cache mm_struct to avoid using potentially stale "vma".

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

Signed-off-by: Jan Stancek <[email protected]>
---
mm/memory.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e11ca9dd823f..6c1afc1ece50 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3517,10 +3517,13 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
* but allow concurrent faults).
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
+ * If mmap_sem is released, vma may become invalid (for example
+ * by other thread calling munmap()).
*/
static vm_fault_t do_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ struct mm_struct *vm_mm = READ_ONCE(vma->vm_mm);
vm_fault_t ret;

/*
@@ -3561,7 +3564,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)

/* preallocated pagetable is unused: free it */
if (vmf->prealloc_pte) {
- pte_free(vma->vm_mm, vmf->prealloc_pte);
+ pte_free(vm_mm, vmf->prealloc_pte);
vmf->prealloc_pte = NULL;
}
return ret;
--
1.8.3.1

2019-03-02 18:48:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH v2] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

On Sat, Mar 02, 2019 at 07:19:39PM +0100, Jan Stancek wrote:
> static vm_fault_t do_fault(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> + struct mm_struct *vm_mm = READ_ONCE(vma->vm_mm);

Would this not need a corresponding WRITE_ONCE() in vma_init() ?

2019-03-02 18:52:53

by Andrea Arcangeli

[permalink] [raw]

Subject: Re: [PATCH v2] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

Hello Jan,

On Sat, Mar 02, 2019 at 07:19:39PM +0100, Jan Stancek wrote:
> + struct mm_struct *vm_mm = READ_ONCE(vma->vm_mm);

The vma->vm_mm cannot change under gcc there, so no need of
READ_ONCE. The release of mmap_sem has release semantics so the
vma->vm_mm access cannot be reordered after up_read(mmap_sem) either.

Other than the above detail:

Reviewed-by: Andrea Arcangeli <[email protected]>

Thanks,
Andrea

2019-03-03 07:28:01

by Jan Stancek

[permalink] [raw]

Subject: Re: [PATCH v2] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

----- Original Message -----
> Hello Jan,
>
> On Sat, Mar 02, 2019 at 07:19:39PM +0100, Jan Stancek wrote:
> > + struct mm_struct *vm_mm = READ_ONCE(vma->vm_mm);
>
> The vma->vm_mm cannot change under gcc there, so no need of
> READ_ONCE. The release of mmap_sem has release semantics so the
> vma->vm_mm access cannot be reordered after up_read(mmap_sem) either.
>
> Other than the above detail:
>
> Reviewed-by: Andrea Arcangeli <[email protected]>

Thank you for review, I dropped READ_ONCE and sent v3 with your
Reviewed-by included. I also successfully re-ran tests over-night.

> Would this not need a corresponding WRITE_ONCE() in vma_init() ?

There's at least 2 context switches between, so I think it wouldn't matter.
My concern was gcc optimizing out vm_mm, and vma->vm_mm access happening only
after do_read_fault().

2019-03-03 07:29:48

by Jan Stancek

[permalink] [raw]

Subject: [PATCH v3] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
This is a stress test, where one thread mmaps/writes/munmaps memory area
and other thread is trying to read from it:

CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
Call Trace:
([<0000000000000000>] (null))
[<00000000001adae4>] lock_acquire+0xec/0x258
[<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
[<000000000012a780>] page_table_free+0x48/0x1a8
[<00000000002f6e54>] do_fault+0xdc/0x670
[<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
[<00000000002fb138>] handle_mm_fault+0x1b0/0x320
[<00000000001248cc>] do_dat_exception+0x19c/0x2c8
[<000000000080e5ee>] pgm_check_handler+0x19e/0x200

page_table_free() is called with NULL mm parameter, but because
"0" is a valid address on s390 (see S390_lowcore), it keeps
going until it eventually crashes in lockdep's lock_acquire.
This crash is reproducible at least since 4.14.

Problem is that "vmf->vma" used in do_fault() can become stale.
Because mmap_sem may be released, other threads can come in,
call munmap() and cause "vma" be returned to kmem cache, and
get zeroed/re-initialized and re-used:

handle_mm_fault |
__handle_mm_fault |
do_fault |
vma = vmf->vma |
do_read_fault |
__do_fault |
vma->vm_ops->fault(vmf); |
mmap_sem is released |
|
| do_munmap()
| remove_vma_list()
| remove_vma()
| vm_area_free()
| # vma is released
| ...
| # same vma is allocated
| # from kmem cache
| do_mmap()
| vm_area_alloc()
| memset(vma, 0, ...)
|
pte_free(vma->vm_mm, ...); |
page_table_free |
spin_lock_bh(&mm->context.lock);|
<crash> |

Cache mm_struct to avoid using potentially stale "vma".

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c

Signed-off-by: Jan Stancek <[email protected]>
Reviewed-by: Andrea Arcangeli <[email protected]>
---
mm/memory.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/memory.c b/mm/memory.c
index e11ca9dd823f..e8d69ade5acc 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3517,10 +3517,13 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
* but allow concurrent faults).
* The mmap_sem may have been released depending on flags and our
* return value. See filemap_fault() and __lock_page_or_retry().
+ * If mmap_sem is released, vma may become invalid (for example
+ * by other thread calling munmap()).
*/
static vm_fault_t do_fault(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ struct mm_struct *vm_mm = vma->vm_mm;
vm_fault_t ret;

/*
@@ -3561,7 +3564,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)

/* preallocated pagetable is unused: free it */
if (vmf->prealloc_pte) {
- pte_free(vma->vm_mm, vmf->prealloc_pte);
+ pte_free(vm_mm, vmf->prealloc_pte);
vmf->prealloc_pte = NULL;
}
return ret;
--
1.8.3.1

2019-03-03 10:38:15

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH v3] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

On Sun, Mar 03, 2019 at 08:28:04AM +0100, Jan Stancek wrote:
> Cache mm_struct to avoid using potentially stale "vma".
>
> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
>
> Signed-off-by: Jan Stancek <[email protected]>
> Reviewed-by: Andrea Arcangeli <[email protected]>

Reviewed-by: Matthew Wilcox <[email protected]>

2019-03-04 00:14:20

by Rafael Aquini

[permalink] [raw]

Subject: Re: [PATCH v3] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

On Sun, Mar 03, 2019 at 08:28:04AM +0100, Jan Stancek wrote:
> LTP testcase mtest06 [1] can trigger a crash on s390x running 5.0.0-rc8.
> This is a stress test, where one thread mmaps/writes/munmaps memory area
> and other thread is trying to read from it:
>
> CPU: 0 PID: 2611 Comm: mmap1 Not tainted 5.0.0-rc8+ #51
> Hardware name: IBM 2964 N63 400 (z/VM 6.4.0)
> Krnl PSW : 0404e00180000000 00000000001ac8d8 (__lock_acquire+0x7/0x7a8)
> Call Trace:
> ([<0000000000000000>] (null))
> [<00000000001adae4>] lock_acquire+0xec/0x258
> [<000000000080d1ac>] _raw_spin_lock_bh+0x5c/0x98
> [<000000000012a780>] page_table_free+0x48/0x1a8
> [<00000000002f6e54>] do_fault+0xdc/0x670
> [<00000000002fadae>] __handle_mm_fault+0x416/0x5f0
> [<00000000002fb138>] handle_mm_fault+0x1b0/0x320
> [<00000000001248cc>] do_dat_exception+0x19c/0x2c8
> [<000000000080e5ee>] pgm_check_handler+0x19e/0x200
>
> page_table_free() is called with NULL mm parameter, but because
> "0" is a valid address on s390 (see S390_lowcore), it keeps
> going until it eventually crashes in lockdep's lock_acquire.
> This crash is reproducible at least since 4.14.
>
> Problem is that "vmf->vma" used in do_fault() can become stale.
> Because mmap_sem may be released, other threads can come in,
> call munmap() and cause "vma" be returned to kmem cache, and
> get zeroed/re-initialized and re-used:
>
> handle_mm_fault |
> __handle_mm_fault |
> do_fault |
> vma = vmf->vma |
> do_read_fault |
> __do_fault |
> vma->vm_ops->fault(vmf); |
> mmap_sem is released |
> |
> | do_munmap()
> | remove_vma_list()
> | remove_vma()
> | vm_area_free()
> | # vma is released
> | ...
> | # same vma is allocated
> | # from kmem cache
> | do_mmap()
> | vm_area_alloc()
> | memset(vma, 0, ...)
> |
> pte_free(vma->vm_mm, ...); |
> page_table_free |
> spin_lock_bh(&mm->context.lock);|
> <crash> |
>
> Cache mm_struct to avoid using potentially stale "vma".
>
> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/mem/mtest06/mmap1.c
>
> Signed-off-by: Jan Stancek <[email protected]>
> Reviewed-by: Andrea Arcangeli <[email protected]>
> ---
> mm/memory.c | 5 ++++-
> 1 file changed, 4 insertions(+), 1 deletion(-)
>
> diff --git a/mm/memory.c b/mm/memory.c
> index e11ca9dd823f..e8d69ade5acc 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3517,10 +3517,13 @@ static vm_fault_t do_shared_fault(struct vm_fault *vmf)
> * but allow concurrent faults).
> * The mmap_sem may have been released depending on flags and our
> * return value. See filemap_fault() and __lock_page_or_retry().
> + * If mmap_sem is released, vma may become invalid (for example
> + * by other thread calling munmap()).
> */
> static vm_fault_t do_fault(struct vm_fault *vmf)
> {
> struct vm_area_struct *vma = vmf->vma;
> + struct mm_struct *vm_mm = vma->vm_mm;
> vm_fault_t ret;
>
> /*
> @@ -3561,7 +3564,7 @@ static vm_fault_t do_fault(struct vm_fault *vmf)
>
> /* preallocated pagetable is unused: free it */
> if (vmf->prealloc_pte) {
> - pte_free(vma->vm_mm, vmf->prealloc_pte);
> + pte_free(vm_mm, vmf->prealloc_pte);
> vmf->prealloc_pte = NULL;
> }
> return ret;
> --
> 1.8.3.1
>
Acked-by: Rafael Aquini <[email protected]>

2019-03-04 08:13:04

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH v3] mm/memory.c: do_fault: avoid usage of stale vm_area_struct

2019-03-04 08:20:09

by Kirill A. Shutemov

[permalink] [raw]