For MADV_DONTNEED to work properly with huge pages, it's critical to not clear
pmd intermittently unless you hold down_write(mmap_sem). Otherwise
MADV_DONTNEED can miss the THP which can lead to userspace breakage.
See example of such race in commit message of patch 2/4.
All these races are found by code inspection. I haven't seen them triggered.
I don't think it's worth to apply them to stable@.
Kirill A. Shutemov (4):
thp: reduce indentation level in change_huge_pmd()
thp: fix MADV_DONTNEED vs. numa balancing race
thp: fix MADV_DONTNEED vs. MADV_FREE race
thp: fix MADV_DONTNEED vs clear soft dirty race
fs/proc/task_mmu.c | 9 +++++-
mm/huge_memory.c | 86 ++++++++++++++++++++++++++++++++++++------------------
2 files changed, 66 insertions(+), 29 deletions(-)
--
2.11.0
Restructure code in preparation for a fix.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 52 ++++++++++++++++++++++++++--------------------------
1 file changed, 26 insertions(+), 26 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 71e3dede95b4..e7ce73b2b208 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1722,37 +1722,37 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
{
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
- int ret = 0;
+ pmd_t entry;
+ bool preserve_write;
+ int ret;
ptl = __pmd_trans_huge_lock(pmd, vma);
- if (ptl) {
- pmd_t entry;
- bool preserve_write = prot_numa && pmd_write(*pmd);
- ret = 1;
+ if (!ptl)
+ return 0;
- /*
- * Avoid trapping faults against the zero page. The read-only
- * data is likely to be read-cached on the local CPU and
- * local/remote hits to the zero page are not interesting.
- */
- if (prot_numa && is_huge_zero_pmd(*pmd)) {
- spin_unlock(ptl);
- return ret;
- }
+ preserve_write = prot_numa && pmd_write(*pmd);
+ ret = 1;
- if (!prot_numa || !pmd_protnone(*pmd)) {
- entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
- entry = pmd_modify(entry, newprot);
- if (preserve_write)
- entry = pmd_mk_savedwrite(entry);
- ret = HPAGE_PMD_NR;
- set_pmd_at(mm, addr, pmd, entry);
- BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
- pmd_write(entry));
- }
- spin_unlock(ptl);
- }
+ /*
+ * Avoid trapping faults against the zero page. The read-only
+ * data is likely to be read-cached on the local CPU and
+ * local/remote hits to the zero page are not interesting.
+ */
+ if (prot_numa && is_huge_zero_pmd(*pmd))
+ goto unlock;
+ if (prot_numa && pmd_protnone(*pmd))
+ goto unlock;
+
+ entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
+ entry = pmd_modify(entry, newprot);
+ if (preserve_write)
+ entry = pmd_mk_savedwrite(entry);
+ ret = HPAGE_PMD_NR;
+ set_pmd_at(mm, addr, pmd, entry);
+ BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
+unlock:
+ spin_unlock(ptl);
return ret;
}
--
2.11.0
Basically the same race as with numa balancing in change_huge_pmd(), but
a bit simpler to mitigate: we don't need to preserve dirty/young flags
here due to MADV_FREE functionality.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Minchan Kim <[email protected]>
---
mm/huge_memory.c | 2 --
1 file changed, 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bb2b3646bd78..324217c31ec9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
deactivate_page(page);
if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
- orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
- tlb->fullmm);
orig_pmd = pmd_mkold(orig_pmd);
orig_pmd = pmd_mkclean(orig_pmd);
--
2.11.0
Yet another instance of the same race.
Fix is identical to change_huge_pmd().
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/proc/task_mmu.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index ee3efb229ef6..0ce5294abc2c 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -899,7 +899,14 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
- pmd_t pmd = pmdp_huge_get_and_clear(vma->vm_mm, addr, pmdp);
+ pmd_t pmd = *pmdp;
+
+ /* See comment in change_huge_pmd() */
+ pmdp_invalidate(vma, addr, pmdp);
+ if (pmd_dirty(*pmdp))
+ pmd = pmd_mkdirty(pmd);
+ if (pmd_young(*pmdp))
+ pmd = pmd_mkyoung(pmd);
pmd = pmd_wrprotect(pmd);
pmd = pmd_clear_soft_dirty(pmd);
--
2.11.0
In case prot_numa, we are under down_read(mmap_sem). It's critical
to not clear pmd intermittently to avoid race with MADV_DONTNEED
which is also under down_read(mmap_sem):
CPU0: CPU1:
change_huge_pmd(prot_numa=1)
pmdp_huge_get_and_clear_notify()
madvise_dontneed()
zap_pmd_range()
pmd_trans_huge(*pmd) == 0 (without ptl)
// skip the pmd
set_pmd_at();
// pmd is re-established
The race makes MADV_DONTNEED miss the huge pmd and don't clear it
which may break userspace.
Found by code analysis, never saw triggered.
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++-
1 file changed, 33 insertions(+), 1 deletion(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e7ce73b2b208..bb2b3646bd78 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
if (prot_numa && pmd_protnone(*pmd))
goto unlock;
- entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
+ /*
+ * In case prot_numa, we are under down_read(mmap_sem). It's critical
+ * to not clear pmd intermittently to avoid race with MADV_DONTNEED
+ * which is also under down_read(mmap_sem):
+ *
+ * CPU0: CPU1:
+ * change_huge_pmd(prot_numa=1)
+ * pmdp_huge_get_and_clear_notify()
+ * madvise_dontneed()
+ * zap_pmd_range()
+ * pmd_trans_huge(*pmd) == 0 (without ptl)
+ * // skip the pmd
+ * set_pmd_at();
+ * // pmd is re-established
+ *
+ * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
+ * which may break userspace.
+ *
+ * pmdp_invalidate() is required to make sure we don't miss
+ * dirty/young flags set by hardware.
+ */
+ entry = *pmd;
+ pmdp_invalidate(vma, addr, pmd);
+
+ /*
+ * Recover dirty/young flags. It relies on pmdp_invalidate to not
+ * corrupt them.
+ */
+ if (pmd_dirty(*pmd))
+ entry = pmd_mkdirty(entry);
+ if (pmd_young(*pmd))
+ entry = pmd_mkyoung(entry);
+
entry = pmd_modify(entry, newprot);
if (preserve_write)
entry = pmd_mk_savedwrite(entry);
--
2.11.0
On March 02, 2017 11:11 PM Kirill A. Shutemov wrote:
>
> Basically the same race as with numa balancing in change_huge_pmd(), but
> a bit simpler to mitigate: we don't need to preserve dirty/young flags
> here due to MADV_FREE functionality.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Minchan Kim <[email protected]>
> ---
> mm/huge_memory.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb2b3646bd78..324217c31ec9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> deactivate_page(page);
>
> if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> - tlb->fullmm);
> orig_pmd = pmd_mkold(orig_pmd);
> orig_pmd = pmd_mkclean(orig_pmd);
>
$ grep -n set_pmd_at linux-4.10/arch/powerpc/mm/pgtable-book3s64.c
/*
* set a new huge pmd. We should not be called for updating
* an existing pmd entry. That should go via pmd_hugepage_update.
*/
void set_pmd_at(struct mm_struct *mm, unsigned long addr,
On Fri, Mar 03, 2017 at 01:35:11PM +0800, Hillf Danton wrote:
>
> On March 02, 2017 11:11 PM Kirill A. Shutemov wrote:
> >
> > Basically the same race as with numa balancing in change_huge_pmd(), but
> > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > here due to MADV_FREE functionality.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > ---
> > mm/huge_memory.c | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index bb2b3646bd78..324217c31ec9 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > deactivate_page(page);
> >
> > if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > - tlb->fullmm);
> > orig_pmd = pmd_mkold(orig_pmd);
> > orig_pmd = pmd_mkclean(orig_pmd);
> >
> $ grep -n set_pmd_at linux-4.10/arch/powerpc/mm/pgtable-book3s64.c
>
> /*
> * set a new huge pmd. We should not be called for updating
> * an existing pmd entry. That should go via pmd_hugepage_update.
> */
> void set_pmd_at(struct mm_struct *mm, unsigned long addr,
+Aneesh.
Urgh... Power is special again.
I think this should work fine.
>From 056914fa025992c0a2212aee057c26307ce60238 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Thu, 2 Mar 2017 16:47:45 +0300
Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race
Basically the same race as with numa balancing in change_huge_pmd(), but
a bit simpler to mitigate: we don't need to preserve dirty/young flags
here due to MADV_FREE functionality.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Minchan Kim <[email protected]>
---
mm/huge_memory.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index bb2b3646bd78..23c1b3d58cf4 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1566,8 +1566,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
deactivate_page(page);
if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
- orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
- tlb->fullmm);
+ pmdp_invalidate(vma, addr, pmd);
orig_pmd = pmd_mkold(orig_pmd);
orig_pmd = pmd_mkclean(orig_pmd);
--
Kirill A. Shutemov
On 03/02/2017 07:10 AM, Kirill A. Shutemov wrote:
> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> if (prot_numa && pmd_protnone(*pmd))
> goto unlock;
>
> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
Are there any remaining call sites for pmdp_huge_get_and_clear_notify()
after this?
On Thu, 2 Mar 2017 18:10:34 +0300 "Kirill A. Shutemov" <[email protected]> wrote:
> Yet another instance of the same race.
>
> Fix is identical to change_huge_pmd().
Nit: someone who is reading this changelog a year from now will be
quite confused - how do they work out what the race was?
I'll add
: See "thp: fix MADV_DONTNEED vs. numa balancing race" for more details.
to the changelogs to help them a bit.
Also, it wasn't a great idea to start this series with a "Restructure
code in preparation for a fix". If people later start hitting this
race, the fixes will be difficult to backport. I'm OK with taking that
risk, but please do bear this in mind in the future.
Hello, Kirill,
On Fri, Mar 03, 2017 at 01:26:36PM +0300, Kirill A. Shutemov wrote:
> On Fri, Mar 03, 2017 at 01:35:11PM +0800, Hillf Danton wrote:
> >
> > On March 02, 2017 11:11 PM Kirill A. Shutemov wrote:
> > >
> > > Basically the same race as with numa balancing in change_huge_pmd(), but
> > > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > > here due to MADV_FREE functionality.
> > >
> > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > Cc: Minchan Kim <[email protected]>
> > > ---
> > > mm/huge_memory.c | 2 --
> > > 1 file changed, 2 deletions(-)
> > >
> > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > index bb2b3646bd78..324217c31ec9 100644
> > > --- a/mm/huge_memory.c
> > > +++ b/mm/huge_memory.c
> > > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > > deactivate_page(page);
> > >
> > > if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > > - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > > - tlb->fullmm);
> > > orig_pmd = pmd_mkold(orig_pmd);
> > > orig_pmd = pmd_mkclean(orig_pmd);
> > >
> > $ grep -n set_pmd_at linux-4.10/arch/powerpc/mm/pgtable-book3s64.c
> >
> > /*
> > * set a new huge pmd. We should not be called for updating
> > * an existing pmd entry. That should go via pmd_hugepage_update.
> > */
> > void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>
> +Aneesh.
>
> Urgh... Power is special again.
>
> I think this should work fine.
>
> From 056914fa025992c0a2212aee057c26307ce60238 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Thu, 2 Mar 2017 16:47:45 +0300
> Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race
>
> Basically the same race as with numa balancing in change_huge_pmd(), but
> a bit simpler to mitigate: we don't need to preserve dirty/young flags
> here due to MADV_FREE functionality.
Could you elaborate a bit more here rather than relying on other
patch's description?
And could you say what happens to the userspace if that race
happens? When I guess from title "MADV_DONTNEED vs MADV_FREE",
a page cannot be zapped but marked lazyfree or vise versa? Right?
Thanks.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Minchan Kim <[email protected]>
> ---
> mm/huge_memory.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb2b3646bd78..23c1b3d58cf4 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1566,8 +1566,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> deactivate_page(page);
>
> if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> - tlb->fullmm);
> + pmdp_invalidate(vma, addr, pmd);
> orig_pmd = pmd_mkold(orig_pmd);
> orig_pmd = pmd_mkclean(orig_pmd);
>
> --
> Kirill A. Shutemov
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
"Kirill A. Shutemov" <[email protected]> writes:
> Basically the same race as with numa balancing in change_huge_pmd(), but
> a bit simpler to mitigate: we don't need to preserve dirty/young flags
> here due to MADV_FREE functionality.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> Cc: Minchan Kim <[email protected]>
> ---
> mm/huge_memory.c | 2 --
> 1 file changed, 2 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index bb2b3646bd78..324217c31ec9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> deactivate_page(page);
>
> if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> - tlb->fullmm);
> orig_pmd = pmd_mkold(orig_pmd);
> orig_pmd = pmd_mkclean(orig_pmd);
>
Instead can we do a new interface that does something like
pmdp_huge_update(tlb->mm, addr, pmd, new_pmd);
We do have a variant already in ptep_set_access_flags. What we need is
something that can be used to update THP pmd, without converting it to
pmd_none and one which doens't loose reference and change bit ?
-aneesh
On Mon, Mar 06, 2017 at 10:44:46AM +0900, Minchan Kim wrote:
> Hello, Kirill,
>
> On Fri, Mar 03, 2017 at 01:26:36PM +0300, Kirill A. Shutemov wrote:
> > On Fri, Mar 03, 2017 at 01:35:11PM +0800, Hillf Danton wrote:
> > >
> > > On March 02, 2017 11:11 PM Kirill A. Shutemov wrote:
> > > >
> > > > Basically the same race as with numa balancing in change_huge_pmd(), but
> > > > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > > > here due to MADV_FREE functionality.
> > > >
> > > > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > > > Cc: Minchan Kim <[email protected]>
> > > > ---
> > > > mm/huge_memory.c | 2 --
> > > > 1 file changed, 2 deletions(-)
> > > >
> > > > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > > > index bb2b3646bd78..324217c31ec9 100644
> > > > --- a/mm/huge_memory.c
> > > > +++ b/mm/huge_memory.c
> > > > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > > > deactivate_page(page);
> > > >
> > > > if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > > > - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > > > - tlb->fullmm);
> > > > orig_pmd = pmd_mkold(orig_pmd);
> > > > orig_pmd = pmd_mkclean(orig_pmd);
> > > >
> > > $ grep -n set_pmd_at linux-4.10/arch/powerpc/mm/pgtable-book3s64.c
> > >
> > > /*
> > > * set a new huge pmd. We should not be called for updating
> > > * an existing pmd entry. That should go via pmd_hugepage_update.
> > > */
> > > void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> >
> > +Aneesh.
> >
> > Urgh... Power is special again.
> >
> > I think this should work fine.
> >
> > From 056914fa025992c0a2212aee057c26307ce60238 Mon Sep 17 00:00:00 2001
> > From: "Kirill A. Shutemov" <[email protected]>
> > Date: Thu, 2 Mar 2017 16:47:45 +0300
> > Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race
> >
> > Basically the same race as with numa balancing in change_huge_pmd(), but
> > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > here due to MADV_FREE functionality.
>
> Could you elaborate a bit more here rather than relying on other
> patch's description?
Okay, updated patch is below.
> And could you say what happens to the userspace if that race
> happens? When I guess from title "MADV_DONTNEED vs MADV_FREE",
> a page cannot be zapped but marked lazyfree or vise versa? Right?
"Vise versa" part should be fine. The case I'm worry about is that
MADV_DONTNEED would skip the pmd and it will not be cleared.
Userspace expects the area of memory to be clean after MADV_DONTNEED, but
it's not. It can lead to userspace misbehaviour.
>From a0967b0293a6f8053d85785c4d6340e550e849ea Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Thu, 2 Mar 2017 16:47:45 +0300
Subject: [PATCH] thp: fix MADV_DONTNEED vs. MADV_FREE race
Both MADV_DONTNEED and MADV_FREE handled with down_read(mmap_sem).
It's critical to not clear pmd intermittently while handling MADV_FREE to
avoid race with MADV_DONTNEED:
CPU0: CPU1:
madvise_free_huge_pmd()
pmdp_huge_get_and_clear_full()
madvise_dontneed()
zap_pmd_range()
pmd_trans_huge(*pmd) == 0 (without ptl)
// skip the pmd
set_pmd_at();
// pmd is re-established
It results in MADV_DONTNEED skipping the pmd, leaving it not cleared. It
violates MADV_DONTNEED interface and can result is userspace misbehaviour.
Basically it's the same race as with numa balancing in change_huge_pmd(),
but a bit simpler to mitigate: we don't need to preserve dirty/young flags
here due to MADV_FREE functionality.
Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Minchan Kim <[email protected]>
---
mm/huge_memory.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 51a8c376d020..3c9ef1104d85 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1568,8 +1568,7 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
deactivate_page(page);
if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
- orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
- tlb->fullmm);
+ pmdp_invalidate(vma, addr, pmd);
orig_pmd = pmd_mkold(orig_pmd);
orig_pmd = pmd_mkclean(orig_pmd);
--
Kirill A. Shutemov
On Mon, Mar 06, 2017 at 08:19:03AM +0530, Aneesh Kumar K.V wrote:
> "Kirill A. Shutemov" <[email protected]> writes:
>
> > Basically the same race as with numa balancing in change_huge_pmd(), but
> > a bit simpler to mitigate: we don't need to preserve dirty/young flags
> > here due to MADV_FREE functionality.
> >
> > Signed-off-by: Kirill A. Shutemov <[email protected]>
> > Cc: Minchan Kim <[email protected]>
> > ---
> > mm/huge_memory.c | 2 --
> > 1 file changed, 2 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index bb2b3646bd78..324217c31ec9 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -1566,8 +1566,6 @@ bool madvise_free_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
> > deactivate_page(page);
> >
> > if (pmd_young(orig_pmd) || pmd_dirty(orig_pmd)) {
> > - orig_pmd = pmdp_huge_get_and_clear_full(tlb->mm, addr, pmd,
> > - tlb->fullmm);
> > orig_pmd = pmd_mkold(orig_pmd);
> > orig_pmd = pmd_mkclean(orig_pmd);
> >
>
> Instead can we do a new interface that does something like
>
> pmdp_huge_update(tlb->mm, addr, pmd, new_pmd);
>
> We do have a variant already in ptep_set_access_flags. What we need is
> something that can be used to update THP pmd, without converting it to
> pmd_none and one which doens't loose reference and change bit ?
Sounds like a good idea. Would you volunteer to implement it?
I don't have time for this right now.
--
Kirill A. Shutemov
On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> Restructure code in preparation for a fix.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
> ---
> mm/huge_memory.c | 52 ++++++++++++++++++++++++++--------------------------
> 1 file changed, 26 insertions(+), 26 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 71e3dede95b4..e7ce73b2b208 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1722,37 +1722,37 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> {
> struct mm_struct *mm = vma->vm_mm;
> spinlock_t *ptl;
> - int ret = 0;
> + pmd_t entry;
> + bool preserve_write;
> + int ret;
>
> ptl = __pmd_trans_huge_lock(pmd, vma);
> - if (ptl) {
> - pmd_t entry;
> - bool preserve_write = prot_numa && pmd_write(*pmd);
> - ret = 1;
> + if (!ptl)
> + return 0;
>
> - /*
> - * Avoid trapping faults against the zero page. The read-only
> - * data is likely to be read-cached on the local CPU and
> - * local/remote hits to the zero page are not interesting.
> - */
> - if (prot_numa && is_huge_zero_pmd(*pmd)) {
> - spin_unlock(ptl);
> - return ret;
> - }
> + preserve_write = prot_numa && pmd_write(*pmd);
> + ret = 1;
>
> - if (!prot_numa || !pmd_protnone(*pmd)) {
> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> - entry = pmd_modify(entry, newprot);
> - if (preserve_write)
> - entry = pmd_mk_savedwrite(entry);
> - ret = HPAGE_PMD_NR;
> - set_pmd_at(mm, addr, pmd, entry);
> - BUG_ON(vma_is_anonymous(vma) && !preserve_write &&
> - pmd_write(entry));
> - }
> - spin_unlock(ptl);
> - }
> + /*
> + * Avoid trapping faults against the zero page. The read-only
> + * data is likely to be read-cached on the local CPU and
> + * local/remote hits to the zero page are not interesting.
> + */
> + if (prot_numa && is_huge_zero_pmd(*pmd))
> + goto unlock;
>
> + if (prot_numa && pmd_protnone(*pmd))
> + goto unlock;
> +
> + entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> + entry = pmd_modify(entry, newprot);
> + if (preserve_write)
> + entry = pmd_mk_savedwrite(entry);
> + ret = HPAGE_PMD_NR;
> + set_pmd_at(mm, addr, pmd, entry);
> + BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
> +unlock:
> + spin_unlock(ptl);
> return ret;
> }
>
>
On 03/02/2017 04:10 PM, Kirill A. Shutemov wrote:
> In case prot_numa, we are under down_read(mmap_sem). It's critical
> to not clear pmd intermittently to avoid race with MADV_DONTNEED
> which is also under down_read(mmap_sem):
>
> CPU0: CPU1:
> change_huge_pmd(prot_numa=1)
> pmdp_huge_get_and_clear_notify()
> madvise_dontneed()
> zap_pmd_range()
> pmd_trans_huge(*pmd) == 0 (without ptl)
> // skip the pmd
> set_pmd_at();
> // pmd is re-established
>
> The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> which may break userspace.
>
> Found by code analysis, never saw triggered.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> mm/huge_memory.c | 34 +++++++++++++++++++++++++++++++++-
> 1 file changed, 33 insertions(+), 1 deletion(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e7ce73b2b208..bb2b3646bd78 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1744,7 +1744,39 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> if (prot_numa && pmd_protnone(*pmd))
> goto unlock;
>
> - entry = pmdp_huge_get_and_clear_notify(mm, addr, pmd);
> + /*
> + * In case prot_numa, we are under down_read(mmap_sem). It's critical
> + * to not clear pmd intermittently to avoid race with MADV_DONTNEED
> + * which is also under down_read(mmap_sem):
> + *
> + * CPU0: CPU1:
> + * change_huge_pmd(prot_numa=1)
> + * pmdp_huge_get_and_clear_notify()
> + * madvise_dontneed()
> + * zap_pmd_range()
> + * pmd_trans_huge(*pmd) == 0 (without ptl)
> + * // skip the pmd
> + * set_pmd_at();
> + * // pmd is re-established
> + *
> + * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> + * which may break userspace.
> + *
> + * pmdp_invalidate() is required to make sure we don't miss
> + * dirty/young flags set by hardware.
> + */
> + entry = *pmd;
> + pmdp_invalidate(vma, addr, pmd);
> +
> + /*
> + * Recover dirty/young flags. It relies on pmdp_invalidate to not
> + * corrupt them.
> + */
pmdp_invalidate() does:
pmd_t entry = *pmdp;
set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
so it's not atomic and if CPU sets dirty or accessed in the middle of
this, they will be lost?
But I don't see how the other invalidate caller
__split_huge_pmd_locked() deals with this either. Andrea, any idea?
Vlastimil
> + if (pmd_dirty(*pmd))
> + entry = pmd_mkdirty(entry);
> + if (pmd_young(*pmd))
> + entry = pmd_mkyoung(entry);
> +
> entry = pmd_modify(entry, newprot);
> if (preserve_write)
> entry = pmd_mk_savedwrite(entry);
>
On 05/23/2017 02:42 PM, Vlastimil Babka wrote:
> On 05/16/2017 10:29 PM, Andrea Arcangeli wrote:
>> On Wed, Apr 12, 2017 at 03:33:35PM +0200, Vlastimil Babka wrote:
>>>
>>> pmdp_invalidate() does:
>>>
>>> pmd_t entry = *pmdp;
>>> set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
>>>
>>> so it's not atomic and if CPU sets dirty or accessed in the middle of
>>> this, they will be lost?
>>
>> I agree it looks like the dirty bit can be lost. Furthermore this also
>> loses a MMU notifier invalidate that will lead to corruption at the
>> secondary MMU level (which will keep using the old protection
>> permission, potentially keeping writing to a wrprotected page).
>
> Oh, I didn't paste the whole function, just the pmd manipulation.
> There's also a third line:
>
> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>
> so there's no missing invalidate, AFAICS? Sorry for the confusion.
Oh, tlb flush is not MMU notified invalidate...
>>>
>>> But I don't see how the other invalidate caller
>>> __split_huge_pmd_locked() deals with this either. Andrea, any idea?
>>
>> The original code I wrote did this in __split_huge_page_map to create
>> the "entry" to establish in the pte pagetables:
>>
>> entry = mk_pte(page + i, vma->vm_page_prot);
>> entry = maybe_mkwrite(pte_mkdirty(entry),
>> vma);
>>
>> For anonymous memory the dirty bit is only meaningful for swapping,
>> and THP couldn't be swapped so setting it unconditional avoided any
>> issue with the pmdp_invalidate; pmdp_establish.
>
> Yeah, but now we are going to swap THP's, and we have shmem THP's...
>
>> pmdp_invalidate is needed primarily to avoid aliasing of two different
>> TLB translation pointing from the same virtual address to the the same
>> physical address that triggered machine checks (while needing to keep
>> the pmd huge at all times, back then it was also splitting huge,
>> splitting is a software bit so userland could still access the data,
>> splitting bit only blocked kernel code to manipulate on it similar to
>> what migration entry does right now upstream, except those prevent
>> userland to access the page during split which is less efficient than
>> the splitting bit, but at least it's only used for the physical split,
>> back then there was no difference between virtual and physical split
>> and physical split is less frequent than the virtual one right now).
>
> This took me a while to grasp, but I think I understand now :)
>
>> It looks like this needs a pmdp_populate that atomically grabs the
>> value of the pmd and returns it like pmdp_huge_get_and_clear_notify
>> does
>
> pmdp_huge_get_and_clear_notify() is now gone...
>
>> and a _notify variant to use "freeze" is false (if freeze is true
>> the MMU notifier invalidate must have happened when the pmd was set to
>> a migration entry). If pmdp_populate_notify (freeze==true)
>> /pmd_populate (freeze==false) would return the old pmd value
>> atomically with xchg() (just instead of setting it to 0 we should set
>> it to the mknotpresent one), then we can set the dirty bit on the ptes
>> (__split_huge_pmd_locked) or in the pmd itself in the change_huge_pmd
>> accordingly.
I was trying to look into this yesterday, but e.g. I know next to
nothing about MMU notifiers (see above :) so I would probably get it
wrong. But it should get fixed, so... Kirill?
> I think the confusion was partially caused by the comment at the
> original caller of pmdp_invalidate():
>
> we first mark the
> * current pmd notpresent (atomically because here the pmd_trans_huge
> * and pmd_trans_splitting must remain set at all times on the pmd
> * until the split is complete for this pmd),
>
> It says "atomically" but in fact that only means that the pmd_trans_huge
> and pmd_trans_splitting flags are not temporarily cleared at any point
> of time, right? It's not a true atomic swap.
>
>> If the "dirty" flag information is obtained by the pmd read before
>> calling pmdp_invalidate is not ok (losing _notify also not ok).
>
> Right.
>
>> Thanks!
>> Andrea
>>
>> --
>> To unsubscribe, send a message with 'unsubscribe linux-mm' in
>> the body to [email protected]. For more info on Linux MM,
>> see: http://www.linux-mm.org/ .
>> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>