2017-03-02 15:13:04

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 0/4] thp: fix few MADV_DONTNEED races

For MADV_DONTNEED to work properly with huge pages, it's critical to not clear
pmd intermittently unless you hold down_write(mmap_sem). Otherwise
MADV_DONTNEED can miss the THP which can lead to userspace breakage.

See example of such race in commit message of patch 2/4.

All these races are found by code inspection. I haven't seen them triggered.
I don't think it's worth to apply them to stable@.

Kirill A. Shutemov (4):
thp: reduce indentation level in change_huge_pmd()
thp: fix MADV_DONTNEED vs. numa balancing race
thp: fix MADV_DONTNEED vs. MADV_FREE race
thp: fix MADV_DONTNEED vs clear soft dirty race

fs/proc/task_mmu.c | 9 +++++-
mm/huge_memory.c | 86 ++++++++++++++++++++++++++++++++++++------------------
2 files changed, 66 insertions(+), 29 deletions(-)

--
2.11.0


2017-03-02 15:52:14

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 1/4] thp: reduce indentation level in change_huge_pmd()

Restructure code in preparation for a fix.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 52 ++++++++++++++++++++++++++--------------------------
1 file changed, 26 insertions(+), 26 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 71e3dede95b4..e7ce73b2b208 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1722,37 +1722,37 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
{
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
- int ret = 0;
+ pmd_t entry;
+ bool preserve_write;
+ int ret;

ptl = __pmd_trans_huge_lock(pmd, vma);
- if (ptl) {
- pmd_t entry;
- bool preserve_write = prot_numa && pmd_write(*pmd);
- ret = 1;
+ if (!ptl)
+ return 0;

- /*
- * Avoid trapping faults against the zero page. The read-only
- * data is likely to be read-cached on the local CPU and
- * local/remote hits to the zero page are not interesting.
- */
- if (prot_numa && is_huge_zero_pmd(*pmd)) {
- spin_unlock(ptl);
- return ret;
- }
+ preserve_write = prot_numa && pmd_write(*pmd);
+ ret = 1;

- if (!prot_numa || !pmd_protnone(*pmd)) {
- entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
- entry = pmd_modify(entry, newprot);
- if (preserve_write)
- entry = pmd_mk_savedwrite(entry);
- ret = HPAGE_PMD_NR;
- set_pmd_at(mm, addr, pmd, entry);
- BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
- pmd_write(entry));
- }
- spin_unlock(ptl);
- }
+ /*
+ * Avoid trapping faults against the zero page. The read-only
+ * data is likely to be read-cached on the local CPU and
+ * local/remote hits to the zero page are not interesting.
+ */
+ if (prot_numa && is_huge_zero_pmd(*pmd))
+ goto unlock;

+ if (prot_numa && pmd_protnone(*pmd))
+ goto unlock;
+
+ entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
+ entry = pmd_modify(entry, newprot);
+ if (preserve_write)
+ entry = pmd_mk_savedwrite(entry);
+ ret = HPAGE_PMD_NR;
+ set_pmd_at(mm, addr, pmd, entry);
+ BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
+unlock:
+ spin_unlock(ptl);
return ret;
}

--
2.11.0

2017-03-02 15:52:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race

Basically the same race as with numa balancing in change_huge_pmd(), but
a bit simpler to mitigate: we don't need to preserve dirty/young flags
here due to MADV_FREE functionality.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Minchan Kim <[email protected]>
---
mm/huge_memory.c | 2 --
1 file changed, 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bb2b3646bd78..324217c31ec9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
deactivate_page(page);

if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
- orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
- tlb->fullmm);
orig_pmd = pmd_mkold(orig_pmd);
orig_pmd = pmd_mkclean(orig_pmd);

--
2.11.0

2017-03-02 15:54:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 4/4] thp: fix MADV_DONTNEED vs clear soft dirty race

Yet another instance of the same race.

Fix is identical to change_huge_pmd().

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/proc/task_mmu.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee3efb229ef6..0ce5294abc2c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -899,7 +899,14 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
- pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
+ pmd_t pmd = *pmdp;
+
+ /* See comment in change_huge_pmd() */
+ pmdp_invalidate(vma, addr, pmdp);
+ if (pmd_dirty(*pmdp))
+ pmd = pmd_mkdirty(pmd);
+ if (pmd_young(*pmdp))
+ pmd = pmd_mkyoung(pmd);

pmd = pmd_wrprotect(pmd);
pmd = pmd_clear_soft_dirty(pmd);
--
2.11.0

2017-03-03 02:09:57

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

In case prot_numa, we are under down_read(mmap_sem). It's critical
to not clear pmd intermittently to avoid race with MADV_DONTNEED
which is also under down_read(mmap_sem):

CPU0: CPU1:
change_huge_pmd(prot_numa=1)
pmdp_huge_get_and_clear_notify()
madvise_dontneed()
zap_pmd_range()
pmd_trans_huge(*pmd) == 0 (without ptl)
// skip the pmd
set_pmd_at();
// pmd is re-established

The race makes MADV_DONTNEED miss the huge pmd and don't clear it
which may break userspace.

Found by code analysis, never saw triggered.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e7ce73b2b208..bb2b3646bd78 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
if (prot_numa && pmd_protnone(*pmd))
goto unlock;

- entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
+ /*
+ * In case prot_numa, we are under down_read(mmap_sem). It's critical
+ * to not clear pmd intermittently to avoid race with MADV_DONTNEED
+ * which is also under down_read(mmap_sem):
+ *
+ * CPU0: CPU1:
+ * change_huge_pmd(prot_numa=1)
+ * pmdp_huge_get_and_clear_notify()
+ * madvise_dontneed()
+ * zap_pmd_range()
+ * pmd_trans_huge(*pmd) == 0 (without ptl)
+ * // skip the pmd
+ * set_pmd_at();
+ * // pmd is re-established
+ *
+ * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
+ * which may break userspace.
+ *
+ * pmdp_invalidate() is required to make sure we don't miss
+ * dirty/young flags set by hardware.
+ */
+ entry = *pmd;
+ pmdp_invalidate(vma, addr, pmd);
+
+ /*
+ * Recover dirty/young flags. It relies on pmdp_invalidate to not
+ * corrupt them.
+ */
+ if (pmd_dirty(*pmd))
+ entry = pmd_mkdirty(entry);
+ if (pmd_young(*pmd))
+ entry = pmd_mkyoung(entry);
+
entry = pmd_modify(entry, newprot);
if (preserve_write)
entry = pmd_mk_savedwrite(entry);
--
2.11.0

2017-03-03 05:38:32

by Hillf Danton

[permalink] [raw]
Subject: Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race


On March 02, 2017 11:11 PM Kirill A. Shutemov wrote:
>
> Basically the same race as with numa balancing in change_huge_pmd(), but
> a bit simpler to mitigate: we don't need to preserve dirty/young flags
> here due to MADV_FREE functionality.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Minchan Kim <[email protected]>
> ---
> mm/huge_memory.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb2b3646bd78..324217c31ec9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> deactivate_page(page);
>
> if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> - tlb->fullmm);
> orig_pmd = pmd_mkold(orig_pmd);
> orig_pmd = pmd_mkclean(orig_pmd);
>
$ grep -n set_pmd_at linux-4.10/arch/powerpc/mm/pgtable-book3s64.c

/*
* set a new huge pmd. We should not be called for updating
* an existing pmd entry. That should go via pmd_hugepage_update.
*/
void set_pmd_at(struct mm_struct *mm, unsigned long addr,

2017-03-03 10:27:13

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race

On Fri, Mar 03, 2017 at 01:35:11PM +0800, Hillf Danton wrote:
>
> On March 02, 2017 11:11 PM Kirill A. Shutemov wrote:
> >
> > Basically the same race as with numa balancing in change_huge_pmd(), but
> > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > here due to MADV_FREE functionality.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > ---
> > mm/huge_memory.c | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index bb2b3646bd78..324217c31ec9 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > deactivate_page(page);
> >
> > if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > - tlb->fullmm);
> > orig_pmd = pmd_mkold(orig_pmd);
> > orig_pmd = pmd_mkclean(orig_pmd);
> >
> $ grep -n set_pmd_at linux-4.10/arch/powerpc/mm/pgtable-book3s64.c
>
> /*
> * set a new huge pmd. We should not be called for updating
> * an existing pmd entry. That should go via pmd_hugepage_update.
> */
> void set_pmd_at(struct mm_struct *mm, unsigned long addr,

+Aneesh.

Urgh... Power is special again.

I think this should work fine.

>From 056914fa025992c0a2212aee057c26307ce60238 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Thu, 2 Mar 2017 16:47:45 +0300
Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race

Basically the same race as with numa balancing in change_huge_pmd(), but
a bit simpler to mitigate: we don't need to preserve dirty/young flags
here due to MADV_FREE functionality.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Minchan Kim <[email protected]>
---
mm/huge_memory.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bb2b3646bd78..23c1b3d58cf4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1566,8 +1566,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
deactivate_page(page);

if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
- orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
- tlb->fullmm);
+ pmdp_invalidate(vma, addr, pmd);
orig_pmd = pmd_mkold(orig_pmd);
orig_pmd = pmd_mkclean(orig_pmd);

--
Kirill A. Shutemov

2017-03-03 17:18:10

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

On 03/02/2017 07:10 AM, Kirill A. Shutemov wrote:
> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> if (prot_numa && pmd_protnone(*pmd))
> goto unlock;
>
> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);

Are there any remaining call sites for pmdp_huge_get_and_clear_notify()
after this?

2017-03-03 22:41:01

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 4/4] thp: fix MADV_DONTNEED vs clear soft dirty race

On Thu, 2 Mar 2017 18:10:34 +0300 "Kirill A. Shutemov" <[email protected]> wrote:

> Yet another instance of the same race.
>
> Fix is identical to change_huge_pmd().

Nit: someone who is reading this changelog a year from now will be
quite confused - how do they work out what the race was?

I'll add

: See "thp: fix MADV_DONTNEED vs. numa balancing race" for more details.

to the changelogs to help them a bit.

Also, it wasn't a great idea to start this series with a "Restructure
code in preparation for a fix". If people later start hitting this
race, the fixes will be difficult to backport. I'm OK with taking that
risk, but please do bear this in mind in the future.


2017-03-06 01:59:53

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race

Hello, Kirill,

On Fri, Mar 03, 2017 at 01:26:36PM +0300, Kirill A. Shutemov wrote:
> On Fri, Mar 03, 2017 at 01:35:11PM +0800, Hillf Danton wrote:
> >
> > On March 02, 2017 11:11 PM Kirill A. Shutemov wrote:
> > >
> > > Basically the same race as with numa balancing in change_huge_pmd(), but
> > > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > > here due to MADV_FREE functionality.
> > >
> > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > Cc: Minchan Kim <[email protected]>
> > > ---
> > > mm/huge_memory.c | 2 --
> > > 1 file changed, 2 deletions(-)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index bb2b3646bd78..324217c31ec9 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > > deactivate_page(page);
> > >
> > > if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > > - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > > - tlb->fullmm);
> > > orig_pmd = pmd_mkold(orig_pmd);
> > > orig_pmd = pmd_mkclean(orig_pmd);
> > >
> > $ grep -n set_pmd_at linux-4.10/arch/powerpc/mm/pgtable-book3s64.c
> >
> > /*
> > * set a new huge pmd. We should not be called for updating
> > * an existing pmd entry. That should go via pmd_hugepage_update.
> > */
> > void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>
> +Aneesh.
>
> Urgh... Power is special again.
>
> I think this should work fine.
>
> From 056914fa025992c0a2212aee057c26307ce60238 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Thu, 2 Mar 2017 16:47:45 +0300
> Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race
>
> Basically the same race as with numa balancing in change_huge_pmd(), but
> a bit simpler to mitigate: we don't need to preserve dirty/young flags
> here due to MADV_FREE functionality.

Could you elaborate a bit more here rather than relying on other
patch's description?

And could you say what happens to the userspace if that race
happens? When I guess from title "MADV_DONTNEED vs MADV_FREE",
a page cannot be zapped but marked lazyfree or vise versa? Right?

Thanks.

>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Minchan Kim <[email protected]>
> ---
> mm/huge_memory.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb2b3646bd78..23c1b3d58cf4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1566,8 +1566,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> deactivate_page(page);
>
> if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> - tlb->fullmm);
> + pmdp_invalidate(vma, addr, pmd);
> orig_pmd = pmd_mkold(orig_pmd);
> orig_pmd = pmd_mkclean(orig_pmd);
>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2017-03-06 02:49:18

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race

"Kirill A. Shutemov" <[email protected]> writes:

> Basically the same race as with numa balancing in change_huge_pmd(), but
> a bit simpler to mitigate: we don't need to preserve dirty/young flags
> here due to MADV_FREE functionality.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Minchan Kim <[email protected]>
> ---
> mm/huge_memory.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb2b3646bd78..324217c31ec9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> deactivate_page(page);
>
> if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> - tlb->fullmm);
> orig_pmd = pmd_mkold(orig_pmd);
> orig_pmd = pmd_mkclean(orig_pmd);
>

Instead can we do a new interface that does something like

pmdp_huge_update(tlb->mm, addr, pmd, new_pmd);

We do have a variant already in ptep_set_access_flags. What we need is
something that can be used to update THP pmd, without converting it to
pmd_none and one which doens't loose reference and change bit ?

-aneesh

2017-03-07 14:38:32

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race

On Mon, Mar 06, 2017 at 10:44:46AM +0900, Minchan Kim wrote:
> Hello, Kirill,
>
> On Fri, Mar 03, 2017 at 01:26:36PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Mar 03, 2017 at 01:35:11PM +0800, Hillf Danton wrote:
> > >
> > > On March 02, 2017 11:11 PM Kirill A. Shutemov wrote:
> > > >
> > > > Basically the same race as with numa balancing in change_huge_pmd(), but
> > > > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > > > here due to MADV_FREE functionality.
> > > >
> > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > Cc: Minchan Kim <[email protected]>
> > > > ---
> > > > mm/huge_memory.c | 2 --
> > > > 1 file changed, 2 deletions(-)
> > > >
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index bb2b3646bd78..324217c31ec9 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > > > deactivate_page(page);
> > > >
> > > > if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > > > - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > > > - tlb->fullmm);
> > > > orig_pmd = pmd_mkold(orig_pmd);
> > > > orig_pmd = pmd_mkclean(orig_pmd);
> > > >
> > > $ grep -n set_pmd_at linux-4.10/arch/powerpc/mm/pgtable-book3s64.c
> > >
> > > /*
> > > * set a new huge pmd. We should not be called for updating
> > > * an existing pmd entry. That should go via pmd_hugepage_update.
> > > */
> > > void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> >
> > +Aneesh.
> >
> > Urgh... Power is special again.
> >
> > I think this should work fine.
> >
> > From 056914fa025992c0a2212aee057c26307ce60238 Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <[email protected]>
> > Date: Thu, 2 Mar 2017 16:47:45 +0300
> > Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race
> >
> > Basically the same race as with numa balancing in change_huge_pmd(), but
> > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > here due to MADV_FREE functionality.
>
> Could you elaborate a bit more here rather than relying on other
> patch's description?

Okay, updated patch is below.

> And could you say what happens to the userspace if that race
> happens? When I guess from title "MADV_DONTNEED vs MADV_FREE",
> a page cannot be zapped but marked lazyfree or vise versa? Right?

"Vise versa" part should be fine. The case I'm worry about is that
MADV_DONTNEED would skip the pmd and it will not be cleared.
Userspace expects the area of memory to be clean after MADV_DONTNEED, but
it's not. It can lead to userspace misbehaviour.

>From a0967b0293a6f8053d85785c4d6340e550e849ea Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Thu, 2 Mar 2017 16:47:45 +0300
Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race

Both MADV_DONTNEED and MADV_FREE handled with down_read(mmap_sem).
It's critical to not clear pmd intermittently while handling MADV_FREE to
avoid race with MADV_DONTNEED:

CPU0: CPU1:
madvise_free_huge_pmd()
pmdp_huge_get_and_clear_full()
madvise_dontneed()
zap_pmd_range()
pmd_trans_huge(*pmd) == 0 (without ptl)
// skip the pmd
set_pmd_at();
// pmd is re-established

It results in MADV_DONTNEED skipping the pmd, leaving it not cleared. It
violates MADV_DONTNEED interface and can result is userspace misbehaviour.

Basically it's the same race as with numa balancing in change_huge_pmd(),
but a bit simpler to mitigate: we don't need to preserve dirty/young flags
here due to MADV_FREE functionality.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Minchan Kim <[email protected]>
---
mm/huge_memory.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 51a8c376d020..3c9ef1104d85 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1568,8 +1568,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
deactivate_page(page);

if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
- orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
- tlb->fullmm);
+ pmdp_invalidate(vma, addr, pmd);
orig_pmd = pmd_mkold(orig_pmd);
orig_pmd = pmd_mkclean(orig_pmd);

--
Kirill A. Shutemov

2017-03-07 14:46:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCH 3/4] thp: fix MADV_DONTNEED vs. MADV_FREE race

On Mon, Mar 06, 2017 at 08:19:03AM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <[email protected]> writes:
>
> > Basically the same race as with numa balancing in change_huge_pmd(), but
> > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > here due to MADV_FREE functionality.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > ---
> > mm/huge_memory.c | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index bb2b3646bd78..324217c31ec9 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > deactivate_page(page);
> >
> > if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > - tlb->fullmm);
> > orig_pmd = pmd_mkold(orig_pmd);
> > orig_pmd = pmd_mkclean(orig_pmd);
> >
>
> Instead can we do a new interface that does something like
>
> pmdp_huge_update(tlb->mm, addr, pmd, new_pmd);
>
> We do have a variant already in ptep_set_access_flags. What we need is
> something that can be used to update THP pmd, without converting it to
> pmd_none and one which doens't loose reference and change bit ?

Sounds like a good idea. Would you volunteer to implement it?
I don't have time for this right now.

--
Kirill A. Shutemov

2017-04-12 11:37:55

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 1/4] thp: reduce indentation level in change_huge_pmd()

On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> Restructure code in preparation for a fix.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/huge_memory.c | 52 ++++++++++++++++++++++++++--------------------------
> 1 file changed, 26 insertions(+), 26 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 71e3dede95b4..e7ce73b2b208 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1722,37 +1722,37 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> {
> struct mm_struct *mm = vma->vm_mm;
> spinlock_t *ptl;
> - int ret = 0;
> + pmd_t entry;
> + bool preserve_write;
> + int ret;
>
> ptl = __pmd_trans_huge_lock(pmd, vma);
> - if (ptl) {
> - pmd_t entry;
> - bool preserve_write = prot_numa && pmd_write(*pmd);
> - ret = 1;
> + if (!ptl)
> + return 0;
>
> - /*
> - * Avoid trapping faults against the zero page. The read-only
> - * data is likely to be read-cached on the local CPU and
> - * local/remote hits to the zero page are not interesting.
> - */
> - if (prot_numa && is_huge_zero_pmd(*pmd)) {
> - spin_unlock(ptl);
> - return ret;
> - }
> + preserve_write = prot_numa && pmd_write(*pmd);
> + ret = 1;
>
> - if (!prot_numa || !pmd_protnone(*pmd)) {
> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> - entry = pmd_modify(entry, newprot);
> - if (preserve_write)
> - entry = pmd_mk_savedwrite(entry);
> - ret = HPAGE_PMD_NR;
> - set_pmd_at(mm, addr, pmd, entry);
> - BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
> - pmd_write(entry));
> - }
> - spin_unlock(ptl);
> - }
> + /*
> + * Avoid trapping faults against the zero page. The read-only
> + * data is likely to be read-cached on the local CPU and
> + * local/remote hits to the zero page are not interesting.
> + */
> + if (prot_numa && is_huge_zero_pmd(*pmd))
> + goto unlock;
>
> + if (prot_numa && pmd_protnone(*pmd))
> + goto unlock;
> +
> + entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> + entry = pmd_modify(entry, newprot);
> + if (preserve_write)
> + entry = pmd_mk_savedwrite(entry);
> + ret = HPAGE_PMD_NR;
> + set_pmd_at(mm, addr, pmd, entry);
> + BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
> +unlock:
> + spin_unlock(ptl);
> return ret;
> }
>
>

2017-04-12 13:33:40

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> In case prot_numa, we are under down_read(mmap_sem). It's critical
> to not clear pmd intermittently to avoid race with MADV_DONTNEED
> which is also under down_read(mmap_sem):
>
> CPU0: CPU1:
> change_huge_pmd(prot_numa=1)
> pmdp_huge_get_and_clear_notify()
> madvise_dontneed()
> zap_pmd_range()
> pmd_trans_huge(*pmd) == 0 (without ptl)
> // skip the pmd
> set_pmd_at();
> // pmd is re-established
>
> The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> which may break userspace.
>
> Found by code analysis, never saw triggered.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++-
> 1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e7ce73b2b208..bb2b3646bd78 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> if (prot_numa && pmd_protnone(*pmd))
> goto unlock;
>
> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> + /*
> + * In case prot_numa, we are under down_read(mmap_sem). It's critical
> + * to not clear pmd intermittently to avoid race with MADV_DONTNEED
> + * which is also under down_read(mmap_sem):
> + *
> + * CPU0: CPU1:
> + * change_huge_pmd(prot_numa=1)
> + * pmdp_huge_get_and_clear_notify()
> + * madvise_dontneed()
> + * zap_pmd_range()
> + * pmd_trans_huge(*pmd) == 0 (without ptl)
> + * // skip the pmd
> + * set_pmd_at();
> + * // pmd is re-established
> + *
> + * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> + * which may break userspace.
> + *
> + * pmdp_invalidate() is required to make sure we don't miss
> + * dirty/young flags set by hardware.
> + */
> + entry = *pmd;
> + pmdp_invalidate(vma, addr, pmd);
> +
> + /*
> + * Recover dirty/young flags. It relies on pmdp_invalidate to not
> + * corrupt them.
> + */

pmdp_invalidate() does:

pmd_t entry = *pmdp;
set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));

so it's not atomic and if CPU sets dirty or accessed in the middle of
this, they will be lost?

But I don't see how the other invalidate caller
__split_huge_pmd_locked() deals with this either. Andrea, any idea?

Vlastimil

> + if (pmd_dirty(*pmd))
> + entry = pmd_mkdirty(entry);
> + if (pmd_young(*pmd))
> + entry = pmd_mkyoung(entry);
> +
> entry = pmd_modify(entry, newprot);
> if (preserve_write)
> entry = pmd_mk_savedwrite(entry);
>

2017-06-09 08:21:06

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 2/4] thp: fix MADV_DONTNEED vs. numa balancing race

On 05/23/2017 02:42 PM, Vlastimil Babka wrote:
> On 05/16/2017 10:29 PM, Andrea Arcangeli wrote:
>> On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
>>>
>>> pmdp_invalidate() does:
>>>
>>> pmd_t entry = *pmdp;
>>> set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
>>>
>>> so it's not atomic and if CPU sets dirty or accessed in the middle of
>>> this, they will be lost?
>>
>> I agree it looks like the dirty bit can be lost. Furthermore this also
>> loses a MMU notifier invalidate that will lead to corruption at the
>> secondary MMU level (which will keep using the old protection
>> permission, potentially keeping writing to a wrprotected page).
>
> Oh, I didn't paste the whole function, just the pmd manipulation.
> There's also a third line:
>
> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>
> so there's no missing invalidate, AFAICS? Sorry for the confusion.

Oh, tlb flush is not MMU notified invalidate...

>>>
>>> But I don't see how the other invalidate caller
>>> __split_huge_pmd_locked() deals with this either. Andrea, any idea?
>>
>> The original code I wrote did this in __split_huge_page_map to create
>> the "entry" to establish in the pte pagetables:
>>
>> entry = mk_pte(page + i, vma->vm_page_prot);
>> entry = maybe_mkwrite(pte_mkdirty(entry),
>> vma);
>>
>> For anonymous memory the dirty bit is only meaningful for swapping,
>> and THP couldn't be swapped so setting it unconditional avoided any
>> issue with the pmdp_invalidate; pmdp_establish.
>
> Yeah, but now we are going to swap THP's, and we have shmem THP's...
>
>> pmdp_invalidate is needed primarily to avoid aliasing of two different
>> TLB translation pointing from the same virtual address to the the same
>> physical address that triggered machine checks (while needing to keep
>> the pmd huge at all times, back then it was also splitting huge,
>> splitting is a software bit so userland could still access the data,
>> splitting bit only blocked kernel code to manipulate on it similar to
>> what migration entry does right now upstream, except those prevent
>> userland to access the page during split which is less efficient than
>> the splitting bit, but at least it's only used for the physical split,
>> back then there was no difference between virtual and physical split
>> and physical split is less frequent than the virtual one right now).
>
> This took me a while to grasp, but I think I understand now :)
>
>> It looks like this needs a pmdp_populate that atomically grabs the
>> value of the pmd and returns it like pmdp_huge_get_and_clear_notify
>> does
>
> pmdp_huge_get_and_clear_notify() is now gone...
>
>> and a _notify variant to use "freeze" is false (if freeze is true
>> the MMU notifier invalidate must have happened when the pmd was set to
>> a migration entry). If pmdp_populate_notify (freeze==true)
>> /pmd_populate (freeze==false) would return the old pmd value
>> atomically with xchg() (just instead of setting it to 0 we should set
>> it to the mknotpresent one), then we can set the dirty bit on the ptes
>> (__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd
>> accordingly.

I was trying to look into this yesterday, but e.g. I know next to
nothing about MMU notifiers (see above :) so I would probably get it
wrong. But it should get fixed, so... Kirill?

> I think the confusion was partially caused by the comment at the
> original caller of pmdp_invalidate():
>
> we first mark the
> * current pmd notpresent (atomically because here the pmd_trans_huge
> * and pmd_trans_splitting must remain set at all times on the pmd
> * until the split is complete for this pmd),
>
> It says "atomically" but in fact that only means that the pmd_trans_huge
> and pmd_trans_splitting flags are not temporarily cleared at any point
> of time, right? It's not a true atomic swap.
>
>> If the "dirty" flag information is obtained by the pmd read before
>> calling pmdp_invalidate is not ok (losing _notify also not ok).
>
> Right.
>
>> Thanks!
>> Andrea
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>