2017-09-12 15:40:29

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 00/11] Do not loose dirty bit on THP pages

Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().

The bug can lead to data loss, but the race window is tiny and I haven't
seen any reports that suggested that it happens in reality. So I don't
think it worth sending it to stable.

Unfortunately, there's no way to address the issue in a generic way. We need to
fix all architectures that support THP one-by-one.

All architectures that have THP supported have to provide atomic
pmdp_invalidate() that returns previous value.

If generic implementation of pmdp_invalidate() is used, architecture needs to
provide atomic pmdp_estabish().

pmdp_estabish() is not used out-side generic implementation of
pmdp_invalidate() so far, but I think this can change in the future.

Aneesh Kumar K.V (2):
powerpc/mm: update pmdp_invalidate to return old pmd value
sparc64: update pmdp_invalidate to return old pmd value

Catalin Marinas (1):
arm64: Provide pmdp_establish() helper

Kirill A. Shutemov (7):
asm-generic: Provide generic_pmdp_establish()
arc: Use generic_pmdp_establish as pmdp_establish
arm/mm: Provide pmdp_establish() helper
mips: Use generic_pmdp_establish as pmdp_establish
x86/mm: Provide pmdp_establish() helper
mm: Do not loose dirty and access bits in pmdp_invalidate()
mm: Use updated pmdp_invalidate() interface to track dirty/accessed
bits

Martin Schwidefsky (1):
s390/mm: Modify pmdp_invalidate to return old value.

arch/arc/include/asm/hugepage.h | 3 +++
arch/arm/include/asm/pgtable-3level.h | 3 +++
arch/arm64/include/asm/pgtable.h | 7 ++++++
arch/mips/include/asm/pgtable.h | 3 +++
arch/powerpc/include/asm/book3s/64/pgtable.h | 4 +--
arch/powerpc/mm/pgtable-book3s64.c | 7 ++++--
arch/s390/include/asm/pgtable.h | 5 ++--
arch/sparc/include/asm/pgtable_64.h | 2 +-
arch/sparc/mm/tlb.c | 23 +++++++++++++----
arch/x86/include/asm/pgtable-3level.h | 37 +++++++++++++++++++++++++++-
arch/x86/include/asm/pgtable.h | 15 +++++++++++
fs/proc/task_mmu.c | 8 +++---
include/asm-generic/pgtable.h | 17 ++++++++++++-
mm/huge_memory.c | 29 +++++++++-------------
mm/pgtable-generic.c | 6 ++---
15 files changed, 131 insertions(+), 38 deletions(-)

--
2.14.1


2017-09-12 15:40:01

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 03/11] arm/mm: Provide pmdp_establish() helper

ARM LPAE doesn't have hardware dirty/accessed bits.

generic_pmdp_establish() is the right implementation of pmdp_establish
for this case.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Catalin Marinas <[email protected]>
---
arch/arm/include/asm/pgtable-3level.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/arm/include/asm/pgtable-3level.h b/arch/arm/include/asm/pgtable-3level.h
index 2a029bceaf2f..57d57cb8cb9a 100644
--- a/arch/arm/include/asm/pgtable-3level.h
+++ b/arch/arm/include/asm/pgtable-3level.h
@@ -250,6 +250,9 @@ PMD_BIT_FUNC(mkyoung, |= PMD_SECT_AF);
#define pfn_pmd(pfn,prot) (__pmd(((phys_addr_t)(pfn) << PAGE_SHIFT) | pgprot_val(prot)))
#define mk_pmd(page,prot) pfn_pmd(page_to_pfn(page),prot)

+/* No hardware dirty/accessed bits -- generic_pmdp_establish() fits*/
+#define pmdp_establish generic_pmdp_establish
+
/* represent a notpresent pmd by faulting entry, this is used by pmdp_invalidate */
static inline pmd_t pmd_mknotpresent(pmd_t pmd)
{
--
2.14.1

2017-09-12 15:40:03

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 04/11] arm64: Provide pmdp_establish() helper

From: Catalin Marinas <[email protected]>

We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't lose these bits.

Signed-off-by: Catalin Marinas <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/arm64/include/asm/pgtable.h | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h
index bc4e92337d16..09bb86533d32 100644
--- a/arch/arm64/include/asm/pgtable.h
+++ b/arch/arm64/include/asm/pgtable.h
@@ -663,6 +663,13 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
{
ptep_set_wrprotect(mm, address, (pte_t *)pmdp);
}
+
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+ return __pmd(xchg_relaxed(&pmd_val(*pmdp), pmd_val(pmd)));
+}
#endif

extern pgd_t swapper_pg_dir[PTRS_PER_PGD];
--
2.14.1

2017-09-12 15:40:19

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 09/11] x86/mm: Provide pmdp_establish() helper

We need an atomic way to setup pmd page table entry, avoiding races with
CPU setting dirty/accessed bits. This is required to implement
pmdp_invalidate() that doesn't loose these bits.

On PAE we can avoid expensive cmpxchg8b for cases when new page table
entry is not present. If it's present, fallback to cpmxchg loop.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
---
arch/x86/include/asm/pgtable-3level.h | 37 ++++++++++++++++++++++++++++++++++-
arch/x86/include/asm/pgtable.h | 15 ++++++++++++++
2 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable-3level.h b/arch/x86/include/asm/pgtable-3level.h
index c8821bab938f..cd73be22be1d 100644
--- a/arch/x86/include/asm/pgtable-3level.h
+++ b/arch/x86/include/asm/pgtable-3level.h
@@ -157,7 +157,6 @@ static inline pte_t native_ptep_get_and_clear(pte_t *ptep)
#define native_ptep_get_and_clear(xp) native_local_ptep_get_and_clear(xp)
#endif

-#ifdef CONFIG_SMP
union split_pmd {
struct {
u32 pmd_low;
@@ -165,6 +164,8 @@ union split_pmd {
};
pmd_t pmd;
};
+
+#ifdef CONFIG_SMP
static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
{
union split_pmd res, *orig = (union split_pmd *)pmdp;
@@ -180,6 +181,40 @@ static inline pmd_t native_pmdp_get_and_clear(pmd_t *pmdp)
#define native_pmdp_get_and_clear(xp) native_local_pmdp_get_and_clear(xp)
#endif

+#ifndef pmdp_establish
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+ pmd_t old;
+
+ /*
+ * If pmd has present bit cleared we can get away without expensive
+ * cmpxchg64: we can update pmdp half-by-half without racing with
+ * anybody.
+ */
+ if (!(pmd_val(pmd) & _PAGE_PRESENT)) {
+ union split_pmd old, new, *ptr;
+
+ ptr = (union split_pmd *)pmdp;
+
+ new.pmd = pmd;
+
+ /* xchg acts as a barrier before setting of the high bits */
+ old.pmd_low = xchg(&ptr->pmd_low, new.pmd_low);
+ old.pmd_high = ptr->pmd_high;
+ ptr->pmd_high = new.pmd_high;
+ return old.pmd;
+ }
+
+ {
+ old = *pmdp;
+ } while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
+
+ return old;
+}
+#endif
+
#ifdef CONFIG_SMP
union split_pud {
struct {
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 5b4c44d419c5..ff19dbd6c93d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1111,6 +1111,21 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
clear_bit(_PAGE_BIT_RW, (unsigned long *)pmdp);
}

+#ifndef pmdp_establish
+#define pmdp_establish pmdp_establish
+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+ if (IS_ENABLED(CONFIG_SMP)) {
+ return xchg(pmdp, pmd);
+ } else {
+ pmd_t old = *pmdp;
+ *pmdp = pmd;
+ return old;
+ }
+}
+#endif
+
/*
* clone_pgd_range(pgd_t *dst, pgd_t *src, int count);
*
--
2.14.1

2017-09-12 15:40:34

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 11/11] mm: Use updated pmdp_invalidate() interface to track dirty/accessed bits

This patch uses modifed pmdp_invalidate(), that return previous value of pmd,
to transfer dirty and accessed bits.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
fs/proc/task_mmu.c | 8 ++++----
mm/huge_memory.c | 29 ++++++++++++-----------------
2 files changed, 16 insertions(+), 21 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 7b40e11ede9b..fe5bff79031a 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -979,14 +979,14 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
static inline void clear_soft_dirty_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
- pmd_t pmd = *pmdp;
+ pmd_t old, pmd = *pmdp;

if (pmd_present(pmd)) {
/* See comment in change_huge_pmd() */
- pmdp_invalidate(vma, addr, pmdp);
- if (pmd_dirty(*pmdp))
+ old = pmdp_invalidate(vma, addr, pmdp);
+ if (pmd_dirty(old))
pmd = pmd_mkdirty(pmd);
- if (pmd_young(*pmdp))
+ if (pmd_young(old))
pmd = pmd_mkyoung(pmd);

pmd = pmd_wrprotect(pmd);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 269b5df58543..c288c3ce9658 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1900,17 +1900,7 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
* pmdp_invalidate() is required to make sure we don't miss
* dirty/young flags set by hardware.
*/
- entry = *pmd;
- pmdp_invalidate(vma, addr, pmd);
-
- /*
- * Recover dirty/young flags. It relies on pmdp_invalidate to not
- * corrupt them.
- */
- if (pmd_dirty(*pmd))
- entry = pmd_mkdirty(entry);
- if (pmd_young(*pmd))
- entry = pmd_mkyoung(entry);
+ entry = pmdp_invalidate(vma, addr, pmd);

entry = pmd_modify(entry, newprot);
if (preserve_write)
@@ -2051,8 +2041,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
struct mm_struct *mm = vma->vm_mm;
struct page *page;
pgtable_t pgtable;
- pmd_t _pmd;
- bool young, write, dirty, soft_dirty, pmd_migration = false;
+ pmd_t old, _pmd;
+ bool young, write, soft_dirty, pmd_migration = false;
unsigned long addr;
int i;

@@ -2099,7 +2089,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
page_ref_add(page, HPAGE_PMD_NR - 1);
write = pmd_write(*pmd);
young = pmd_young(*pmd);
- dirty = pmd_dirty(*pmd);
soft_dirty = pmd_soft_dirty(*pmd);

pmdp_huge_split_prepare(vma, haddr, pmd);
@@ -2129,8 +2118,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (soft_dirty)
entry = pte_mksoft_dirty(entry);
}
- if (dirty)
- SetPageDirty(page + i);
pte = pte_offset_map(&_pmd, addr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, entry);
@@ -2179,7 +2166,15 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
* and finally we write the non-huge version of the pmd entry with
* pmd_populate.
*/
- pmdp_invalidate(vma, haddr, pmd);
+ old = pmdp_invalidate(vma, haddr, pmd);
+
+ /*
+ * Transfer dirty bit using value returned by pmd_invalidate() to be
+ * sure we don't race with CPU that can set the bit under us.
+ */
+ if (pmd_dirty(old))
+ SetPageDirty(page);
+
pmd_populate(mm, pmd, pgtable);

if (freeze) {
--
2.14.1

2017-09-12 15:40:50

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 10/11] mm: Do not loose dirty and access bits in pmdp_invalidate()

Vlastimil noted that pmdp_invalidate() is not atomic and we can loose
dirty and access bits if CPU sets them after pmdp dereference, but
before set_pmd_at().

The patch change pmdp_invalidate() to make the entry non-present atomically and
return previous value of the entry. This value can be used to check if
CPU set dirty/accessed bits under us.

The race window is very small and I haven't seen any reports that can be
attributed to the bug. For this reason, I don't think backporting to
stable trees needed.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Reported-by: Vlastimil Babka <[email protected]>
Cc: Hugh Dickins <[email protected]>
---
include/asm-generic/pgtable.h | 2 +-
mm/pgtable-generic.c | 6 +++---
2 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index bf0889eb774d..9df1da175fb0 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -324,7 +324,7 @@ static inline pmd_t generic_pmdp_establish(struct vm_area_struct *vma,
#endif

#ifndef __HAVE_ARCH_PMDP_INVALIDATE
-extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp);
#endif

diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 1175f6a24fdb..3db8f2f76666 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -180,12 +180,12 @@ pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
#endif

#ifndef __HAVE_ARCH_PMDP_INVALIDATE
-void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
- pmd_t entry = *pmdp;
- set_pmd_at(vma->vm_mm, address, pmdp, pmd_mknotpresent(entry));
+ pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mknotpresent(*pmdp));
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ return old;
}
#endif

--
2.14.1

2017-09-12 15:39:59

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 01/11] asm-generic: Provide generic_pmdp_establish()

This is implementation of pmdp_establish() that is only suitable for an
architecture that doesn't have hardware dirty/accessed bits. In this
case we can't race with CPU which sets these bits and non-atomic
approach is fine.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
include/asm-generic/pgtable.h | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index 8e0243036564..bf0889eb774d 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -308,6 +308,21 @@ extern void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
#endif

+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+/*
+ * This is implementation of pmdp_establish() that is only suitable for an
+ * architecture that doesn't have hardware dirty/accessed bits. In this case we
+ * can't race with CPU which sets these bits and non-atomic aproach is fine.
+ */
+static inline pmd_t generic_pmdp_establish(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+ pmd_t old_pmd = *pmdp;
+ set_pmd_at(vma->vm_mm, address, pmdp, pmd);
+ return old_pmd;
+}
+#endif
+
#ifndef __HAVE_ARCH_PMDP_INVALIDATE
extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp);
--
2.14.1

2017-09-12 15:41:30

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 08/11] sparc64: update pmdp_invalidate to return old pmd value

From: "Aneesh Kumar K.V" <[email protected]>

It's required to avoid loosing dirty and accessed bits.

Signed-off-by: Nitin Gupta <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/sparc/include/asm/pgtable_64.h | 2 +-
arch/sparc/mm/tlb.c | 23 ++++++++++++++++++-----
2 files changed, 19 insertions(+), 6 deletions(-)

diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
index 4fefe3762083..83b06c98bb94 100644
--- a/arch/sparc/include/asm/pgtable_64.h
+++ b/arch/sparc/include/asm/pgtable_64.h
@@ -979,7 +979,7 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
pmd_t *pmd);

#define __HAVE_ARCH_PMDP_INVALIDATE
-extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp);

#define __HAVE_ARCH_PGTABLE_DEPOSIT
diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
index ee8066c3d96c..d36c65fc55cf 100644
--- a/arch/sparc/mm/tlb.c
+++ b/arch/sparc/mm/tlb.c
@@ -218,17 +218,28 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
}
}

+static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp, pmd_t pmd)
+{
+ pmd_t old;
+
+ {
+ old = *pmdp;
+ } while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
+
+ return old;
+}
+
/*
* This routine is only called when splitting a THP
*/
-void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
- pmd_t entry = *pmdp;
-
- pmd_val(entry) &= ~_PAGE_VALID;
+ pmd_t old, entry;

- set_pmd_at(vma->vm_mm, address, pmdp, entry);
+ entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID);
+ old = pmdp_establish(vma, address, pmdp, entry);
flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);

/*
@@ -239,6 +250,8 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
if ((pmd_val(entry) & _PAGE_PMD_HUGE) &&
!is_huge_zero_page(pmd_page(entry)))
(vma->vm_mm)->context.thp_pte_count--;
+
+ return old;
}

void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
--
2.14.1

2017-09-12 15:41:24

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 05/11] mips: Use generic_pmdp_establish as pmdp_establish

MIPS doesn't support hardware dirty/accessed bits.
generic_pmdp_establish() is suitable in this case.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Ralf Baechle <[email protected]>
Cc: David Daney <[email protected]>
Cc: [email protected]
---
arch/mips/include/asm/pgtable.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/mips/include/asm/pgtable.h b/arch/mips/include/asm/pgtable.h
index 9e9e94415d08..7b3a3139e82d 100644
--- a/arch/mips/include/asm/pgtable.h
+++ b/arch/mips/include/asm/pgtable.h
@@ -534,6 +534,9 @@ static inline int io_remap_pfn_range(struct vm_area_struct *vma,

#ifdef CONFIG_TRANSPARENT_HUGEPAGE

+/* We don't have hardware dirty/accessed bits, generic_pmdp_establish is fine.*/
+#define pmdp_establish generic_pmdp_establish
+
#define has_transparent_hugepage has_transparent_hugepage
extern int has_transparent_hugepage(void);

--
2.14.1

2017-09-12 15:42:10

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 07/11] s390/mm: Modify pmdp_invalidate to return old value.

From: Martin Schwidefsky <[email protected]>

It's required to avoid loosing dirty and accessed bits.

Signed-off-by: Martin Schwidefsky <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/s390/include/asm/pgtable.h | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/s390/include/asm/pgtable.h b/arch/s390/include/asm/pgtable.h
index dce708e061ea..d3de8ddc55ec 100644
--- a/arch/s390/include/asm/pgtable.h
+++ b/arch/s390/include/asm/pgtable.h
@@ -1504,10 +1504,11 @@ static inline pmd_t pmdp_huge_clear_flush(struct vm_area_struct *vma,
}

#define __HAVE_ARCH_PMDP_INVALIDATE
-static inline void pmdp_invalidate(struct vm_area_struct *vma,
+static inline pmd_t pmdp_invalidate(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmdp)
{
- pmdp_xchg_direct(vma->vm_mm, addr, pmdp, __pmd(_SEGMENT_ENTRY_EMPTY));
+ return pmdp_xchg_direct(vma->vm_mm, addr, pmdp,
+ __pmd(_SEGMENT_ENTRY_EMPTY));
}

#define __HAVE_ARCH_PMDP_SET_WRPROTECT
--
2.14.1

2017-09-12 15:42:46

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 06/11] powerpc/mm: update pmdp_invalidate to return old pmd value

From: "Aneesh Kumar K.V" <[email protected]>

It's required to avoid loosing dirty and accessed bits.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/powerpc/include/asm/book3s/64/pgtable.h | 4 ++--
arch/powerpc/mm/pgtable-book3s64.c | 7 +++++--
2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index b9aff515b4de..aca7cfa349eb 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1137,8 +1137,8 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
}

#define __HAVE_ARCH_PMDP_INVALIDATE
-extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
- pmd_t *pmdp);
+extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmdp);

#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/pgtable-book3s64.c b/arch/powerpc/mm/pgtable-book3s64.c
index 3b65917785a5..422e80253a33 100644
--- a/arch/powerpc/mm/pgtable-book3s64.c
+++ b/arch/powerpc/mm/pgtable-book3s64.c
@@ -90,16 +90,19 @@ void serialize_against_pte_lookup(struct mm_struct *mm)
* We use this to invalidate a pmdp entry before switching from a
* hugepte to regular pmd entry.
*/
-void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
+pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
{
- pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
+ unsigned long old_pmd;
+
+ old_pmd = pmd_hugepage_update(vma->vm_mm, address, pmdp, _PAGE_PRESENT, 0);
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
/*
* This ensures that generic code that rely on IRQ disabling
* to prevent a parallel THP split work as expected.
*/
serialize_against_pte_lookup(vma->vm_mm);
+ return __pmd(old_pmd);
}

static pmd_t pmd_set_protbits(pmd_t pmd, pgprot_t pgprot)
--
2.14.1

2017-09-12 15:39:56

by Kirill A. Shutemov

[permalink] [raw]
Subject: [PATCHv3 02/11] arc: Use generic_pmdp_establish as pmdp_establish

ARC doesn't support hardware dirty/accessed bits.
generic_pmdp_establish() is suitable in this case.

Signed-off-by: Kirill A. Shutemov <[email protected]>
Cc: Vineet Gupta <[email protected]>
---
arch/arc/include/asm/hugepage.h | 3 +++
1 file changed, 3 insertions(+)

diff --git a/arch/arc/include/asm/hugepage.h b/arch/arc/include/asm/hugepage.h
index b18fcb606908..dc8ee011882f 100644
--- a/arch/arc/include/asm/hugepage.h
+++ b/arch/arc/include/asm/hugepage.h
@@ -74,4 +74,7 @@ extern pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
extern void flush_pmd_tlb_range(struct vm_area_struct *vma, unsigned long start,
unsigned long end);

+/* We don't have hardware dirty/accessed bits, generic_pmdp_establish is fine.*/
+#define pmdp_establish generic_pmdp_establish
+
#endif
--
2.14.1

2017-09-13 02:09:19

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCHv3 11/11] mm: Use updated pmdp_invalidate() interface to track dirty/accessed bits


How about this additional patch ?. This results in code reduction.

>From fed62d0541ae78206a1a25caeb46a3ffa7ade9c8 Mon Sep 17 00:00:00 2001
From: "Aneesh Kumar K.V" <[email protected]>
Date: Thu, 27 Jul 2017 12:21:33 +0530
Subject: [PATCH] mm/thp: Remove pmd_huge_split_prepare

Instead of marking the pmd ready for split, invalidate the pmd. This should
take care of powerpc requirement. Only side effect is that we mark the pmd
invalid early. This can result in us blocking access to the page a bit longer
if we race against a thp split.

Signed-off-by: Aneesh Kumar K.V <[email protected]>
---
arch/powerpc/include/asm/book3s/64/hash-4k.h | 2 -
arch/powerpc/include/asm/book3s/64/hash-64k.h | 2 -
arch/powerpc/include/asm/book3s/64/pgtable.h | 9 ----
arch/powerpc/include/asm/book3s/64/radix.h | 6 ---
arch/powerpc/mm/pgtable-hash64.c | 22 --------
include/asm-generic/pgtable.h | 8 ---
mm/huge_memory.c | 73 +++++++++++++--------------
7 files changed, 35 insertions(+), 87 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/hash-4k.h b/arch/powerpc/include/asm/book3s/64/hash-4k.h
index d65dcb5826ff..2416edb74d28 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-4k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-4k.h
@@ -112,8 +112,6 @@ extern pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma,
extern void hash__pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable);
extern pgtable_t hash__pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
-extern void hash__pmdp_huge_split_prepare(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp);
extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
unsigned long addr, pmd_t *pmdp);
extern int hash__has_transparent_hugepage(void);
diff --git a/arch/powerpc/include/asm/book3s/64/hash-64k.h b/arch/powerpc/include/asm/book3s/64/hash-64k.h
index ab36323b8a3e..001202cabedf 100644
--- a/arch/powerpc/include/asm/book3s/64/hash-64k.h
+++ b/arch/powerpc/include/asm/book3s/64/hash-64k.h
@@ -162,8 +162,6 @@ extern pmd_t hash__pmdp_collapse_flush(struct vm_area_struct *vma,
extern void hash__pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
pgtable_t pgtable);
extern pgtable_t hash__pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp);
-extern void hash__pmdp_huge_split_prepare(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp);
extern pmd_t hash__pmdp_huge_get_and_clear(struct mm_struct *mm,
unsigned long addr, pmd_t *pmdp);
extern int hash__has_transparent_hugepage(void);
diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 6cf53dc70efc..fee01ffe3b60 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -1114,15 +1114,6 @@ static inline pgtable_t pgtable_trans_huge_withdraw(struct mm_struct *mm,
extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp);

-#define __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
-static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp)
-{
- if (radix_enabled())
- return radix__pmdp_huge_split_prepare(vma, address, pmdp);
- return hash__pmdp_huge_split_prepare(vma, address, pmdp);
-}
-
#define pmd_move_must_withdraw pmd_move_must_withdraw
struct spinlock;
static inline int pmd_move_must_withdraw(struct spinlock *new_pmd_ptl,
diff --git a/arch/powerpc/include/asm/book3s/64/radix.h b/arch/powerpc/include/asm/book3s/64/radix.h
index f5ece365d929..389be8b6c9f7 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -272,12 +272,6 @@ static inline pmd_t radix__pmd_mkhuge(pmd_t pmd)
return __pmd(pmd_val(pmd) | _PAGE_PTE | R_PAGE_LARGE);
return __pmd(pmd_val(pmd) | _PAGE_PTE);
}
-static inline void radix__pmdp_huge_split_prepare(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp)
-{
- /* Nothing to do for radix. */
- return;
-}

extern unsigned long radix__pmd_hugepage_update(struct mm_struct *mm, unsigned long addr,
pmd_t *pmdp, unsigned long clr,
diff --git a/arch/powerpc/mm/pgtable-hash64.c b/arch/powerpc/mm/pgtable-hash64.c
index ec277913e01b..469808e77e58 100644
--- a/arch/powerpc/mm/pgtable-hash64.c
+++ b/arch/powerpc/mm/pgtable-hash64.c
@@ -296,28 +296,6 @@ pgtable_t hash__pgtable_trans_huge_withdraw(struct mm_struct *mm, pmd_t *pmdp)
return pgtable;
}

-void hash__pmdp_huge_split_prepare(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp)
-{
- VM_BUG_ON(address & ~HPAGE_PMD_MASK);
- VM_BUG_ON(REGION_ID(address) != USER_REGION_ID);
- VM_BUG_ON(pmd_devmap(*pmdp));
-
- /*
- * We can't mark the pmd none here, because that will cause a race
- * against exit_mmap. We need to continue mark pmd TRANS HUGE, while
- * we spilt, but at the same time we wan't rest of the ppc64 code
- * not to insert hash pte on this, because we will be modifying
- * the deposited pgtable in the caller of this function. Hence
- * clear the _PAGE_USER so that we move the fault handling to
- * higher level function and that will serialize against ptl.
- * We need to flush existing hash pte entries here even though,
- * the translation is still valid, because we will withdraw
- * pgtable_t after this.
- */
- pmd_hugepage_update(vma->vm_mm, address, pmdp, 0, _PAGE_PRIVILEGED);
-}
-
/*
* A linux hugepage PMD was changed and the corresponding hash table entries
* neesd to be flushed.
diff --git a/include/asm-generic/pgtable.h b/include/asm-generic/pgtable.h
index ece5e399567a..b934e41277ac 100644
--- a/include/asm-generic/pgtable.h
+++ b/include/asm-generic/pgtable.h
@@ -313,14 +313,6 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp);
#endif

-#ifndef __HAVE_ARCH_PMDP_HUGE_SPLIT_PREPARE
-static inline void pmdp_huge_split_prepare(struct vm_area_struct *vma,
- unsigned long address, pmd_t *pmdp)
-{
-
-}
-#endif
-
#ifndef __HAVE_ARCH_PTE_SAME
static inline int pte_same(pte_t pte_a, pte_t pte_b)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d72c2d20e9c6..59ec8c916368 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1944,8 +1944,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
struct mm_struct *mm = vma->vm_mm;
struct page *page;
pgtable_t pgtable;
- pmd_t old, _pmd;
- bool young, write, soft_dirty;
+ pmd_t old_pmd, _pmd;
+ bool young, write, dirty, soft_dirty;
unsigned long addr;
int i;

@@ -1977,14 +1977,39 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
return __split_huge_zero_page_pmd(vma, haddr, pmd);
}

- page = pmd_page(*pmd);
+ /*
+ * Up to this point the pmd is present and huge and userland has the
+ * whole access to the hugepage during the split (which happens in
+ * place). If we overwrite the pmd with the not-huge version pointing
+ * to the pte here (which of course we could if all CPUs were bug
+ * free), userland could trigger a small page size TLB miss on the
+ * small sized TLB while the hugepage TLB entry is still established in
+ * the huge TLB. Some CPU doesn't like that.
+ * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum
+ * 383 on page 93. Intel should be safe but is also warns that it's
+ * only safe if the permission and cache attributes of the two entries
+ * loaded in the two TLB is identical (which should be the case here).
+ * But it is generally safer to never allow small and huge TLB entries
+ * for the same virtual address to be loaded simultaneously. So instead
+ * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
+ * current pmd notpresent (atomically because here the pmd_trans_huge
+ * and pmd_trans_splitting must remain set at all times on the pmd
+ * until the split is complete for this pmd), then we flush the SMP TLB
+ * and finally we write the non-huge version of the pmd entry with
+ * pmd_populate.
+ */
+ old_pmd = pmdp_invalidate(vma, haddr, pmd);
+
+ page = pmd_page(old_pmd);
VM_BUG_ON_PAGE(!page_count(page), page);
page_ref_add(page, HPAGE_PMD_NR - 1);
- write = pmd_write(*pmd);
- young = pmd_young(*pmd);
- soft_dirty = pmd_soft_dirty(*pmd);
-
- pmdp_huge_split_prepare(vma, haddr, pmd);
+ write = pmd_write(old_pmd);
+ young = pmd_young(old_pmd);
+ dirty = pmd_dirty(old_pmd);
+ soft_dirty = pmd_soft_dirty(old_pmd);
+ /*
+ * withdraw the table only after we mark the pmd entry invalid
+ */
pgtable = pgtable_trans_huge_withdraw(mm, pmd);
pmd_populate(mm, &_pmd, pgtable);

@@ -2011,6 +2036,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
if (soft_dirty)
entry = pte_mksoft_dirty(entry);
}
+ if (dirty)
+ SetPageDirty(page + i);
pte = pte_offset_map(&_pmd, addr);
BUG_ON(!pte_none(*pte));
set_pte_at(mm, addr, pte, entry);
@@ -2038,36 +2065,6 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
}

smp_wmb(); /* make pte visible before pmd */
- /*
- * Up to this point the pmd is present and huge and userland has the
- * whole access to the hugepage during the split (which happens in
- * place). If we overwrite the pmd with the not-huge version pointing
- * to the pte here (which of course we could if all CPUs were bug
- * free), userland could trigger a small page size TLB miss on the
- * small sized TLB while the hugepage TLB entry is still established in
- * the huge TLB. Some CPU doesn't like that.
- * See http://support.amd.com/us/Processor_TechDocs/41322.pdf, Erratum
- * 383 on page 93. Intel should be safe but is also warns that it's
- * only safe if the permission and cache attributes of the two entries
- * loaded in the two TLB is identical (which should be the case here).
- * But it is generally safer to never allow small and huge TLB entries
- * for the same virtual address to be loaded simultaneously. So instead
- * of doing "pmd_populate(); flush_pmd_tlb_range();" we first mark the
- * current pmd notpresent (atomically because here the pmd_trans_huge
- * and pmd_trans_splitting must remain set at all times on the pmd
- * until the split is complete for this pmd), then we flush the SMP TLB
- * and finally we write the non-huge version of the pmd entry with
- * pmd_populate.
- */
- old = pmdp_invalidate(vma, haddr, pmd);
-
- /*
- * Transfer dirty bit using value returned by pmd_invalidate() to be
- * sure we don't race with CPU that can set the bit under us.
- */
- if (pmd_dirty(old))
- SetPageDirty(page);
-
pmd_populate(mm, pmd, pgtable);

if (freeze) {
--
2.13.5


2017-09-13 05:06:29

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: [PATCHv3 08/11] sparc64: update pmdp_invalidate to return old pmd value

"Kirill A. Shutemov" <[email protected]> writes:

> From: "Aneesh Kumar K.V" <[email protected]>


You got the author wrong there.

>
> It's required to avoid loosing dirty and accessed bits.
>
> Signed-off-by: Nitin Gupta <[email protected]>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/sparc/include/asm/pgtable_64.h | 2 +-
> arch/sparc/mm/tlb.c | 23 ++++++++++++++++++-----
> 2 files changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/arch/sparc/include/asm/pgtable_64.h b/arch/sparc/include/asm/pgtable_64.h
> index 4fefe3762083..83b06c98bb94 100644
> --- a/arch/sparc/include/asm/pgtable_64.h
> +++ b/arch/sparc/include/asm/pgtable_64.h
> @@ -979,7 +979,7 @@ void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmd);
>
> #define __HAVE_ARCH_PMDP_INVALIDATE
> -extern void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> pmd_t *pmdp);
>
> #define __HAVE_ARCH_PGTABLE_DEPOSIT
> diff --git a/arch/sparc/mm/tlb.c b/arch/sparc/mm/tlb.c
> index ee8066c3d96c..d36c65fc55cf 100644
> --- a/arch/sparc/mm/tlb.c
> +++ b/arch/sparc/mm/tlb.c
> @@ -218,17 +218,28 @@ void set_pmd_at(struct mm_struct *mm, unsigned long addr,
> }
> }
>
> +static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> + unsigned long address, pmd_t *pmdp, pmd_t pmd)
> +{
> + pmd_t old;
> +
> + {
> + old = *pmdp;
> + } while (cmpxchg64(&pmdp->pmd, old.pmd, pmd.pmd) != old.pmd);
> +
> + return old;
> +}
> +
> /*
> * This routine is only called when splitting a THP
> */
> -void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> +pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> pmd_t *pmdp)
> {
> - pmd_t entry = *pmdp;
> -
> - pmd_val(entry) &= ~_PAGE_VALID;
> + pmd_t old, entry;
>
> - set_pmd_at(vma->vm_mm, address, pmdp, entry);
> + entry = __pmd(pmd_val(*pmdp) & ~_PAGE_VALID);
> + old = pmdp_establish(vma, address, pmdp, entry);
> flush_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>
> /*
> @@ -239,6 +250,8 @@ void pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> if ((pmd_val(entry) & _PAGE_PMD_HUGE) &&
> !is_huge_zero_page(pmd_page(entry)))
> (vma->vm_mm)->context.thp_pte_count--;
> +
> + return old;
> }
>
> void pgtable_trans_huge_deposit(struct mm_struct *mm, pmd_t *pmdp,
> --
> 2.14.1

2017-12-13 10:14:22

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: [PATCHv3 11/11] mm: Use updated pmdp_invalidate() interface to track dirty/accessed bits

On Wed, Sep 13, 2017 at 02:08:58AM +0000, Aneesh Kumar K.V wrote:
> @@ -2011,6 +2036,8 @@ static void __split_huge_pmd_locked(struct vm_area_struct *vma, pmd_t *pmd,
> if (soft_dirty)
> entry = pte_mksoft_dirty(entry);
> }
> + if (dirty)
> + SetPageDirty(page + i);
> pte = pte_offset_map(&_pmd, addr);
> BUG_ON(!pte_none(*pte));
> set_pte_at(mm, addr, pte, entry);

The patch is fine. But we don't need to set every 4k dirty. We have single
dirty bit for whole THP. I'll change this part and sent the patch as part
of the series.

--
Kirill A. Shutemov