From: Nadav Amit <[email protected]>
This patch-set is intended to remove unnecessary TLB flushes. It is
based on feedback from v1 and several bugs I found in v1 myself.
Basically, there are 3 optimizations in this patch-set:
1. Avoiding TLB flushes on change_huge_pmd() that are only needed to
prevent the A/D bits from changing.
2. Use TLB batching infrastructure to batch flushes across VMAs and
do better/fewer flushes.
3. Avoid TLB flushes on permission demotion.
Andrea asked for the aforementioned (2) to come after (3), but this
is not simple (specifically since change_prot_numa() needs the number
of pages affected).
There are many changes from v1 to v2 so consider the change log as
partial.
v1->v2:
* Wrong detection of permission demotion [Andrea]
* Better comments [Andrea]
* Handle THP [Andrea]
* Batching across VMAs [Peter Xu]
* Avoid open-coding PTE analysis
* Fix wrong use of the mmu_gather()
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Nadav Amit (5):
x86: Detection of Knights Landing A/D leak
mm: avoid unnecessary flush on change_huge_pmd()
x86/mm: check exec permissions on fault
mm/mprotect: use mmu_gather
mm/mprotect: do not flush on permission promotion
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/pgtable.h | 8 +++
arch/x86/include/asm/pgtable_types.h | 2 +
arch/x86/include/asm/tlbflush.h | 80 +++++++++++++++++++++++
arch/x86/kernel/cpu/intel.c | 5 ++
arch/x86/mm/fault.c | 11 +++-
fs/exec.c | 6 +-
include/asm-generic/tlb.h | 14 +++++
include/linux/huge_mm.h | 5 +-
include/linux/mm.h | 5 +-
include/linux/pgtable.h | 5 ++
mm/huge_memory.c | 22 ++++---
mm/mprotect.c | 94 +++++++++++++++-------------
mm/pgtable-generic.c | 8 +++
mm/userfaultfd.c | 6 +-
15 files changed, 215 insertions(+), 57 deletions(-)
--
2.25.1
From: Nadav Amit <[email protected]>
Calls to change_protection_range() on THP can trigger, at least on x86,
two TLB flushes for one page: one immediately, when pmdp_invalidate() is
called by change_huge_pmd(), and then another one later (that can be
batched) when change_protection_range() finishes.
The first TLB flush is only necessary to prevent the dirty bit (and with
a lesser importance the access bit) from changing while the PTE is
modified. However, this is not necessary as the x86 CPUs set the
dirty-bit atomically with an additional check that the PTE is (still)
present. One caveat is Intel's Knights Landing that has a bug and does
not do so.
Leverage this behavior to eliminate the unnecessary TLB flush in
change_huge_pmd(). Introduce a new arch specific pmdp_invalidate_ad()
that only invalidates the access and dirty bit from further changes.
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/pgtable.h | 8 ++++++++
include/linux/pgtable.h | 5 +++++
mm/huge_memory.c | 7 ++++---
mm/pgtable-generic.c | 8 ++++++++
4 files changed, 25 insertions(+), 3 deletions(-)
diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 448cd01eb3ec..18c3366f8f4d 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1146,6 +1146,14 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
}
}
#endif
+
+#define __HAVE_ARCH_PMDP_INVALIDATE_AD
+static inline pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp)
+{
+ return pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
+}
+
/*
* Page table pages are page-aligned. The lower half of the top
* level is used for userspace and the top half for the kernel.
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e24d2c992b11..622efe0a9ef0 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -561,6 +561,11 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp);
#endif
+#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
+extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp);
+#endif
+
#ifndef __HAVE_ARCH_PTE_SAME
static inline int pte_same(pte_t pte_a, pte_t pte_b)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e5ea5f775d5c..435da011b1a2 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1795,10 +1795,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
* The race makes MADV_DONTNEED miss the huge pmd and don't clear it
* which may break userspace.
*
- * pmdp_invalidate() is required to make sure we don't miss
- * dirty/young flags set by hardware.
+ * pmdp_invalidate_ad() is required to make sure we don't miss
+ * dirty/young flags (which are also known as access/dirty) cannot be
+ * further modifeid by the hardware.
*/
- entry = pmdp_invalidate(vma, addr, pmd);
+ entry = pmdp_invalidate_ad(vma, addr, pmd);
entry = pmd_modify(entry, newprot);
if (preserve_write)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 4e640baf9794..b0ce6c7391bf 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -200,6 +200,14 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
}
#endif
+#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
+pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmdp)
+{
+ return pmdp_invalidate(vma, address, pmdp);
+}
+#endif
+
#ifndef pmdp_collapse_flush
pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
--
2.25.1
From: Nadav Amit <[email protected]>
Knights Landing has a issue that a thread setting A or D bits may not do
so atomically against checking the present bit. A thread which is going
to page fault may still set those bits, even though the present bit was
already atomically cleared.
This implies that when the kernel clears present atomically, some time
later the supposed to be zero entry could be corrupted with stray A or D
bits.
Since the PTE could be already used for storing a swap index, or a NUMA
migration index, this cannot be tolerated. Most of the time the kernel
detects the problem, but in some rare cases it may not.
This patch adds an interface to detect the bug, which will be used in
the following patch.
[ Based on a patch by Andi Kleen ]
Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/intel.c | 5 +++++
2 files changed, 6 insertions(+)
diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index d0ce5cfd3ac1..32d0aabd788d 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -436,5 +436,6 @@
#define X86_BUG_TAA X86_BUG(22) /* CPU is affected by TSX Async Abort(TAA) */
#define X86_BUG_ITLB_MULTIHIT X86_BUG(23) /* CPU may incur MCE during certain page attribute changes */
#define X86_BUG_SRBDS X86_BUG(24) /* CPU may leak RNG bits if not mitigated */
+#define X86_BUG_PTE_LEAK X86_BUG(25) /* PTE may leak A/D bits after clear */
#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..40bcba6e3641 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -296,6 +296,11 @@ static void early_init_intel(struct cpuinfo_x86 *c)
}
}
+ if (c->x86_model == 87) {
+ pr_info_once("Enabling PTE leaking workaround\n");
+ set_cpu_bug(c, X86_BUG_PTE_LEAK);
+ }
+
/*
* Intel Quark Core DevMan_001.pdf section 6.4.11
* "The operating system also is required to invalidate (i.e., flush)
--
2.25.1
From: Nadav Amit <[email protected]>
Currently, using mprotect() to unprotect a memory region or uffd to
unprotect a memory region causes a TLB flush. At least on x86, as
protection is promoted, no TLB flush is needed.
Add an arch-specific pte_may_need_flush() which tells whether a TLB
flush is needed based on the old PTE and the new one. Implement an x86
pte_may_need_flush().
For x86, besides the simple logic that PTE protection promotion or
changes of software bits does require a flush, also add logic that
considers the dirty-bit. If the dirty-bit is clear and write-protect is
set, no TLB flush is needed, as x86 updates the dirty-bit atomically
on write, and if the bit is clear, the PTE is reread.
Signed-off-by: Nadav Amit <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
---
arch/x86/include/asm/pgtable_types.h | 2 +
arch/x86/include/asm/tlbflush.h | 80 ++++++++++++++++++++++++++++
include/asm-generic/tlb.h | 14 +++++
mm/huge_memory.c | 9 ++--
mm/mprotect.c | 3 +-
5 files changed, 103 insertions(+), 5 deletions(-)
diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..8668bc661026 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -110,9 +110,11 @@
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
#define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX)
#define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
+#define _PAGE_SOFTW4 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW4)
#else
#define _PAGE_NX (_AT(pteval_t, 0))
#define _PAGE_DEVMAP (_AT(pteval_t, 0))
+#define _PAGE_SOFTW4 (_AT(pteval_t, 0))
#endif
#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index b587a9ee9cb2..a782adde3d62 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,6 +259,86 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+/*
+ * The enabled_mask tells which bits that were present and gets cleared require
+ * flush.
+ *
+ * The disabled_mask tells which bits that were missing and gets set require
+ * flush.
+ *
+ * All the other bits except the ignored bits will require a flush no matter if
+ * they gets set or cleared.
+ *
+ * Note that we ignore the accessed bit, since anyhow the kernel does not flush
+ * after clearing it in other situations. We also ignore the global bit, as it
+ * is used for protnone.
+ */
+static inline bool pte_flags_may_need_flush(unsigned long oldflags,
+ unsigned long newflags)
+{
+ const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
+ _PAGE_SOFTW3 | _PAGE_SOFTW4 | _PAGE_ACCESSED | _PAGE_GLOBAL;
+ const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT;
+ const pteval_t disable_mask = _PAGE_NX;
+ unsigned long diff = oldflags ^ newflags;
+
+ return diff & ((oldflags & enable_mask) |
+ (newflags & disable_mask) |
+ ~(enable_mask | disable_mask | ignore_mask));
+}
+
+/*
+ * pte_may_need_flush() checks whether permissions were demoted and require a
+ * flush. It should only be used for userspace PTEs.
+ */
+static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
+{
+ /* new is non-present: need only if old is present */
+ if (!pte_present(newpte))
+ return pte_present(oldpte);
+
+ /* old is not present: no need for flush */
+ if (!pte_present(oldpte))
+ return false;
+
+ /*
+ * Avoid open-coding to account for protnone_mask() and perform
+ * comparison of the PTEs.
+ */
+ if (pte_pfn(oldpte) != pte_pfn(newpte))
+ return true;
+
+ return pte_flags_may_need_flush(pte_flags(oldpte),
+ pte_flags(newpte));
+}
+#define pte_may_need_flush pte_may_need_flush
+
+/*
+ * huge_pmd_may_need_flush() checks whether permissions were demoted and
+ * require a flush. It should only be used for userspace huge PMDs.
+ */
+static inline bool huge_pmd_may_need_flush(pmd_t oldpmd, pmd_t newpmd)
+{
+ /* new is non-present: need only if old is present */
+ if (!pmd_present(newpmd))
+ return pmd_present(oldpmd);
+
+ /* old is not present: no need for flush */
+ if (!pmd_present(oldpmd))
+ return false;
+
+ /*
+ * Avoid open-coding to account for protnone_mask() and perform
+ * comparison of the PTEs.
+ */
+ if (pmd_pfn(oldpmd) != pmd_pfn(newpmd))
+ return true;
+
+ return pte_flags_may_need_flush(pmd_flags(oldpmd),
+ pmd_flags(newpmd));
+}
+#define huge_pmd_may_need_flush huge_pmd_may_need_flush
+
#endif /* !MODULE */
#endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2c68a545ffa7..2d3736c62602 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -654,6 +654,20 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
} while (0)
#endif
+#ifndef pte_may_need_flush
+static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
+{
+ return true;
+}
+#endif
+
+#ifndef huge_pmd_may_need_flush
+static inline bool huge_pmd_may_need_flush(pmd_t oldpmd, pmd_t newpmd)
+{
+ return true;
+}
+#endif
+
#endif /* CONFIG_MMU */
#endif /* _ASM_GENERIC__TLB_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index f5d0357a25ce..f80936324e6a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1726,7 +1726,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
- pmd_t entry;
+ pmd_t oldpmd, entry;
bool preserve_write;
int ret;
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
@@ -1802,9 +1802,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
* dirty/young flags (which are also known as access/dirty) cannot be
* further modifeid by the hardware.
*/
- entry = pmdp_invalidate_ad(vma, addr, pmd);
+ oldpmd = pmdp_invalidate_ad(vma, addr, pmd);
- entry = pmd_modify(entry, newprot);
+ entry = pmd_modify(oldpmd, newprot);
if (preserve_write)
entry = pmd_mk_savedwrite(entry);
if (uffd_wp) {
@@ -1821,7 +1821,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
- tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
+ if (huge_pmd_may_need_flush(oldpmd, entry))
+ tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
unlock:
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 0f5c87af5c60..6179c82ea72d 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -141,7 +141,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
- tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
+ if (pte_may_need_flush(oldpte, ptent))
+ tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
--
2.25.1
From: Nadav Amit <[email protected]>
access_error() currently does not check for execution permission
violation. As a result, spurious page-faults due to execution permission
violation cause SIGSEGV.
It appears not to be an issue so far, but the next patches avoid TLB
flushes on permission promotion, which can lead to this scenario. nodejs
for instance crashes when TLB flush is avoided on permission promotion.
Add a check to prevent access_error() from returning mistakenly that
page-faults due to instruction fetch are not allowed. Intel SDM does not
indicate whether "instruction fetch" and "write" in the hardware error
code are mutual exclusive, so check both before returning whether the
access is allowed.
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/mm/fault.c | 11 +++++++++--
1 file changed, 9 insertions(+), 2 deletions(-)
diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
index b2eefdefc108..e776130473ce 100644
--- a/arch/x86/mm/fault.c
+++ b/arch/x86/mm/fault.c
@@ -1100,10 +1100,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
(error_code & X86_PF_INSTR), foreign))
return 1;
- if (error_code & X86_PF_WRITE) {
+ if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
/* write, present and write, not present: */
- if (unlikely(!(vma->vm_flags & VM_WRITE)))
+ if ((error_code & X86_PF_WRITE) &&
+ unlikely(!(vma->vm_flags & VM_WRITE)))
return 1;
+
+ /* exec, present and exec, not present: */
+ if ((error_code & X86_PF_INSTR) &&
+ unlikely(!(vma->vm_flags & VM_EXEC)))
+ return 1;
+
return 0;
}
--
2.25.1
From: Nadav Amit <[email protected]>
change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates
the code, as developers need to be aware of different invalidation
schemes, and prevents opportunities to avoid TLB flushes or perform them
in finer granularity.
The use of mmu_gather for modified PTEs has benefits in various
scenarios even if pages are not released. For instance, if only a single
page needs to be flushed out of a range of many pages, only that page
would be flushed. If a THP page is flushed, on x86 a single TLB invlpg
instruction can be used instead of 512 instructions (or a full TLB
flush, which would Linux would actually use by default). mprotect() over
multiple VMAs requires a single flush.
Use mmu_gather in change_pXX_range(). As the pages are not released,
only record the flushed range using tlb_flush_pXX_range().
Handle THP similarly and get rid of flush_cache_range() which becomes
redundant since tlb_start_vma() calls it when needed.
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
fs/exec.c | 6 ++-
include/linux/huge_mm.h | 5 ++-
include/linux/mm.h | 5 ++-
mm/huge_memory.c | 10 ++++-
mm/mprotect.c | 93 ++++++++++++++++++++++-------------------
mm/userfaultfd.c | 6 ++-
6 files changed, 75 insertions(+), 50 deletions(-)
diff --git a/fs/exec.c b/fs/exec.c
index 5a7a07dfdc81..7f8609bbc6b3 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -752,6 +752,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
unsigned long stack_size;
unsigned long stack_expand;
unsigned long rlim_stack;
+ struct mmu_gather tlb;
#ifdef CONFIG_STACK_GROWSUP
/* Limit stack size */
@@ -806,8 +807,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
vm_flags |= mm->def_flags;
vm_flags |= VM_STACK_INCOMPLETE_SETUP;
- ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
+ tlb_gather_mmu(&tlb, mm);
+ ret = mprotect_fixup(&tlb, vma, &prev, vma->vm_start, vma->vm_end,
vm_flags);
+ tlb_finish_mmu(&tlb);
+
if (ret)
goto out_unlock;
BUG_ON(prev != vma);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index f280f33ff223..a9b6e03e9c4c 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -36,8 +36,9 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud,
unsigned long addr);
bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
-int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
- pgprot_t newprot, unsigned long cp_flags);
+int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr, pgprot_t newprot,
+ unsigned long cp_flags);
vm_fault_t vmf_insert_pfn_pmd_prot(struct vm_fault *vmf, pfn_t pfn,
pgprot_t pgprot, bool write);
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 00bb2d938df4..f46bab158560 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2001,10 +2001,11 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
#define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
MM_CP_UFFD_WP_RESOLVE)
-extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+extern unsigned long change_protection(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
unsigned long cp_flags);
-extern int mprotect_fixup(struct vm_area_struct *vma,
+extern int mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct vm_area_struct **pprev, unsigned long start,
unsigned long end, unsigned long newflags);
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 435da011b1a2..f5d0357a25ce 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1720,8 +1720,9 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
* or if prot_numa but THP migration is not supported
* - HPAGE_PMD_NR if protections changed and TLB flush necessary
*/
-int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, pgprot_t newprot, unsigned long cp_flags)
+int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr, pgprot_t newprot,
+ unsigned long cp_flags)
{
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1732,6 +1733,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+
if (prot_numa && !thp_migration_supported())
return 1;
@@ -1817,6 +1820,9 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
}
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
+
+ tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
+
BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
unlock:
spin_unlock(ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 883e2cc85cad..0f5c87af5c60 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -32,12 +32,13 @@
#include <asm/cacheflush.h>
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
+#include <asm/tlb.h>
#include "internal.h"
-static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, unsigned long end, pgprot_t newprot,
- unsigned long cp_flags)
+static unsigned long change_pte_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
pte_t *pte, oldpte;
spinlock_t *ptl;
@@ -48,6 +49,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ tlb_change_page_size(tlb, PAGE_SIZE);
+
/*
* Can be called with only the mmap_lock for reading by
* prot_numa so we must check the pmd isn't constantly
@@ -138,6 +141,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+ tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
@@ -219,9 +223,9 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
return 0;
}
-static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
- pud_t *pud, unsigned long addr, unsigned long end,
- pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
pmd_t *pmd;
unsigned long next;
@@ -261,8 +265,12 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (next - addr != HPAGE_PMD_SIZE) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
} else {
- int nr_ptes = change_huge_pmd(vma, pmd, addr,
- newprot, cp_flags);
+ /*
+ * change_huge_pmd() does not defer TLB flushes,
+ * so no need to propagate the tlb argument.
+ */
+ int nr_ptes = change_huge_pmd(tlb, vma, pmd,
+ addr, newprot, cp_flags);
if (nr_ptes) {
if (nr_ptes == HPAGE_PMD_NR) {
@@ -276,8 +284,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}
/* fall through, the trans huge pmd just split */
}
- this_pages = change_pte_range(vma, pmd, addr, next, newprot,
- cp_flags);
+ this_pages = change_pte_range(tlb, vma, pmd, addr, next,
+ newprot, cp_flags);
pages += this_pages;
next:
cond_resched();
@@ -291,9 +299,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
return pages;
}
-static inline unsigned long change_pud_range(struct vm_area_struct *vma,
- p4d_t *p4d, unsigned long addr, unsigned long end,
- pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_pud_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
pud_t *pud;
unsigned long next;
@@ -304,16 +312,16 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- pages += change_pmd_range(vma, pud, addr, next, newprot,
+ pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
cp_flags);
} while (pud++, addr = next, addr != end);
return pages;
}
-static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
- pgd_t *pgd, unsigned long addr, unsigned long end,
- pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_p4d_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
p4d_t *p4d;
unsigned long next;
@@ -324,44 +332,40 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
next = p4d_addr_end(addr, end);
if (p4d_none_or_clear_bad(p4d))
continue;
- pages += change_pud_range(vma, p4d, addr, next, newprot,
+ pages += change_pud_range(tlb, vma, p4d, addr, next, newprot,
cp_flags);
} while (p4d++, addr = next, addr != end);
return pages;
}
-static unsigned long change_protection_range(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end, pgprot_t newprot,
- unsigned long cp_flags)
+static unsigned long change_protection_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
unsigned long next;
- unsigned long start = addr;
unsigned long pages = 0;
BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
- flush_cache_range(vma, addr, end);
- inc_tlb_flush_pending(mm);
+ tlb_start_vma(tlb, vma);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- pages += change_p4d_range(vma, pgd, addr, next, newprot,
+ pages += change_p4d_range(tlb, vma, pgd, addr, next, newprot,
cp_flags);
} while (pgd++, addr = next, addr != end);
- /* Only flush the TLB if we actually modified any entries: */
- if (pages)
- flush_tlb_range(vma, start, end);
- dec_tlb_flush_pending(mm);
+ tlb_end_vma(tlb, vma);
return pages;
}
-unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+unsigned long change_protection(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
unsigned long cp_flags)
{
@@ -372,7 +376,7 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
if (is_vm_hugetlb_page(vma))
pages = hugetlb_change_protection(vma, start, end, newprot);
else
- pages = change_protection_range(vma, start, end, newprot,
+ pages = change_protection_range(tlb, vma, start, end, newprot,
cp_flags);
return pages;
@@ -406,8 +410,9 @@ static const struct mm_walk_ops prot_none_walk_ops = {
};
int
-mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
- unsigned long start, unsigned long end, unsigned long newflags)
+mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ struct vm_area_struct **pprev, unsigned long start,
+ unsigned long end, unsigned long newflags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long oldflags = vma->vm_flags;
@@ -494,7 +499,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
vma_set_page_prot(vma);
- change_protection(vma, start, end, vma->vm_page_prot,
+ change_protection(tlb, vma, start, end, vma->vm_page_prot,
dirty_accountable ? MM_CP_DIRTY_ACCT : 0);
/*
@@ -528,6 +533,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
const bool rier = (current->personality & READ_IMPLIES_EXEC) &&
(prot & PROT_READ);
+ struct mmu_gather tlb;
start = untagged_addr(start);
@@ -584,6 +590,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
if (start > vma->vm_start)
prev = vma;
+ tlb_gather_mmu(&tlb, current->mm);
for (nstart = start ; ; ) {
unsigned long mask_off_old_flags;
unsigned long newflags;
@@ -610,18 +617,18 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
/* newflags >> 4 shift VM_MAY% in place of VM_% */
if ((newflags & ~(newflags >> 4)) & VM_ACCESS_FLAGS) {
error = -EACCES;
- goto out;
+ goto out_tlb;
}
/* Allow architectures to sanity-check the new flags */
if (!arch_validate_flags(newflags)) {
error = -EINVAL;
- goto out;
+ goto out_tlb;
}
error = security_file_mprotect(vma, reqprot, prot);
if (error)
- goto out;
+ goto out_tlb;
tmp = vma->vm_end;
if (tmp > end)
@@ -630,27 +637,29 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
if (vma->vm_ops && vma->vm_ops->mprotect) {
error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags);
if (error)
- goto out;
+ goto out_tlb;
}
- error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
+ error = mprotect_fixup(&tlb, vma, &prev, nstart, tmp, newflags);
if (error)
- goto out;
+ goto out_tlb;
nstart = tmp;
if (nstart < prev->vm_end)
nstart = prev->vm_end;
if (nstart >= end)
- goto out;
+ goto out_tlb;
vma = prev->vm_next;
if (!vma || vma->vm_start != nstart) {
error = -ENOMEM;
- goto out;
+ goto out_tlb;
}
prot = reqprot;
}
+out_tlb:
+ tlb_finish_mmu(&tlb);
out:
mmap_write_unlock(current->mm);
return error;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index ac6f036298cd..15a20bb35868 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -16,6 +16,7 @@
#include <linux/hugetlb.h>
#include <linux/shmem_fs.h>
#include <asm/tlbflush.h>
+#include <asm/tlb.h>
#include "internal.h"
static __always_inline
@@ -674,6 +675,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
atomic_t *mmap_changing)
{
struct vm_area_struct *dst_vma;
+ struct mmu_gather tlb;
pgprot_t newprot;
int err;
@@ -715,8 +717,10 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
else
newprot = vm_get_page_prot(dst_vma->vm_flags);
- change_protection(dst_vma, start, start + len, newprot,
+ tlb_gather_mmu(&tlb, dst_mm);
+ change_protection(&tlb, dst_vma, start, start + len, newprot,
enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
+ tlb_finish_mmu(&tlb);
err = 0;
out_unlock:
--
2.25.1
On Thu, 21 Oct 2021 05:21:07 -0700 Nadav Amit <[email protected]> wrote:
> This patch-set is intended to remove unnecessary TLB flushes. It is
> based on feedback from v1 and several bugs I found in v1 myself.
>
> Basically, there are 3 optimizations in this patch-set:
> 1. Avoiding TLB flushes on change_huge_pmd() that are only needed to
> prevent the A/D bits from changing.
> 2. Use TLB batching infrastructure to batch flushes across VMAs and
> do better/fewer flushes.
> 3. Avoid TLB flushes on permission demotion.
>
> Andrea asked for the aforementioned (2) to come after (3), but this
> is not simple (specifically since change_prot_numa() needs the number
> of pages affected).
[1/5] appears to be a significant fix which should probably be
backported into -stable kernels. If you agree with this then I suggest
it be prepared as a standalone patch, separate from the other four
patches. With a cc:stable.
And the remaining patches are a performance optimization. Has any
attempt been made to quantify the benefits?
> On Oct 21, 2021, at 8:04 PM, Andrew Morton <[email protected]> wrote:
>
> On Thu, 21 Oct 2021 05:21:07 -0700 Nadav Amit <[email protected]> wrote:
>
>> This patch-set is intended to remove unnecessary TLB flushes. It is
>> based on feedback from v1 and several bugs I found in v1 myself.
>>
>> Basically, there are 3 optimizations in this patch-set:
>> 1. Avoiding TLB flushes on change_huge_pmd() that are only needed to
>> prevent the A/D bits from changing.
>> 2. Use TLB batching infrastructure to batch flushes across VMAs and
>> do better/fewer flushes.
>> 3. Avoid TLB flushes on permission demotion.
>>
>> Andrea asked for the aforementioned (2) to come after (3), but this
>> is not simple (specifically since change_prot_numa() needs the number
>> of pages affected).
>
> [1/5] appears to be a significant fix which should probably be
> backported into -stable kernels. If you agree with this then I suggest
> it be prepared as a standalone patch, separate from the other four
> patches. With a cc:stable.
There is no functionality bug in the kernel. The Knights Landing bug
was circumvented eventually by changing the swap entry structure so
the access/dirty bits would not overlap with the swap entry data.
>
> And the remaining patches are a performance optimization. Has any
> attempt been made to quantify the benefits?
I included some data before [1]. In general the cost that is saved
is the cost of a TLB flush/shootdown.
I will modify my benchmark to test huge-pages (which were not
included in the previous patch-set) and send results later. I would
also try nodejs to see if there is a significant enough benefit.
Nodejs crashed before (hence the 3rd patch added here), as it
exec-protects/unprotects pages - I will see if the benefit shows in
the benchmarks.
[ The motivation behind the patches is to later introduce userfaultfd
writeprotectv interface, and for my use-case that is under
development this proved to improve performance considerably. ]
[1] https://lore.kernel.org/linux-mm/[email protected]/
On Thu, Oct 21, 2021 at 08:04:50PM -0700, Andrew Morton wrote:
> On Thu, 21 Oct 2021 05:21:07 -0700 Nadav Amit <[email protected]> wrote:
>
> > This patch-set is intended to remove unnecessary TLB flushes. It is
> > based on feedback from v1 and several bugs I found in v1 myself.
> >
> > Basically, there are 3 optimizations in this patch-set:
> > 1. Avoiding TLB flushes on change_huge_pmd() that are only needed to
> > prevent the A/D bits from changing.
> > 2. Use TLB batching infrastructure to batch flushes across VMAs and
> > do better/fewer flushes.
> > 3. Avoid TLB flushes on permission demotion.
> >
> > Andrea asked for the aforementioned (2) to come after (3), but this
> > is not simple (specifically since change_prot_numa() needs the number
> > of pages affected).
>
> [1/5] appears to be a significant fix which should probably be
> backported into -stable kernels. If you agree with this then I suggest
> it be prepared as a standalone patch, separate from the other four
> patches. With a cc:stable.
I am confused, 1/5 doesn't actually do *anything*. I also cannot find
any further usage of the introduced X86_BUG_PTE_LEAK.
I'm thinking patch #2 means to have something like:
if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
In the newly minted: pmdp_invalidate_ad(), but alas, nothing there.
On Thu, Oct 21, 2021 at 05:21:09AM -0700, Nadav Amit wrote:
> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
> index 448cd01eb3ec..18c3366f8f4d 100644
> --- a/arch/x86/include/asm/pgtable.h
> +++ b/arch/x86/include/asm/pgtable.h
> @@ -1146,6 +1146,14 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
> }
> }
> #endif
> +
> +#define __HAVE_ARCH_PMDP_INVALIDATE_AD
> +static inline pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
> + unsigned long address, pmd_t *pmdp)
> +{
> + return pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
Did this want to be something like:
pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
return old;
instead?
> +}
> +
> /*
> * Page table pages are page-aligned. The lower half of the top
> * level is used for userspace and the top half for the kernel.
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e5ea5f775d5c..435da011b1a2 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -1795,10 +1795,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
> * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
> * which may break userspace.
> *
> - * pmdp_invalidate() is required to make sure we don't miss
> - * dirty/young flags set by hardware.
> + * pmdp_invalidate_ad() is required to make sure we don't miss
> + * dirty/young flags (which are also known as access/dirty) cannot be
> + * further modifeid by the hardware.
"modified", I think is the more common spelling.
> */
> - entry = pmdp_invalidate(vma, addr, pmd);
> + entry = pmdp_invalidate_ad(vma, addr, pmd);
>
> entry = pmd_modify(entry, newprot);
> if (preserve_write)
> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
> index 4e640baf9794..b0ce6c7391bf 100644
> --- a/mm/pgtable-generic.c
> +++ b/mm/pgtable-generic.c
> @@ -200,6 +200,14 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
> }
> #endif
>
> +#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
/*
* Does this deserve a comment to explain the intended difference vs
* pmdp_invalidate() ?
*/
> +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
> + pmd_t *pmdp)
> +{
> + return pmdp_invalidate(vma, address, pmdp);
> +}
> +#endif
> +
> #ifndef pmdp_collapse_flush
> pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
> pmd_t *pmdp)
> --
> 2.25.1
>
On Thu, Oct 21, 2021 at 05:21:12AM -0700, Nadav Amit wrote:
> +/*
> + * pte_may_need_flush() checks whether permissions were demoted and require a
> + * flush. It should only be used for userspace PTEs.
> + */
> +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
> +{
> + /* new is non-present: need only if old is present */
> + if (!pte_present(newpte))
> + return pte_present(oldpte);
> +
> + /* old is not present: no need for flush */
> + if (!pte_present(oldpte))
> + return false;
Would it not be clearer to write the above like:
/* !PRESENT -> * ; no need for flush */
if (!pte_present(oldpte))
return false;
/* PRESENT -> !PRESENT ; needs flush */
if (!pte_present(newpte))
return true;
?
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 0f5c87af5c60..6179c82ea72d 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -141,7 +141,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
> ptent = pte_mkwrite(ptent);
> }
> ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
> - tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> + if (pte_may_need_flush(oldpte, ptent))
> + tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
> pages++;
> } else if (is_swap_pte(oldpte)) {
> swp_entry_t entry = pte_to_swp_entry(oldpte);
One question on naming, "may_need" sounds a bit washy to me, either it
does or it does not. I suppose you're trying to convey the fact that we
ought to err towards too many TLBi rather than too few, but that's
always true.
That is, would "needs" not be a better name?
On 10/25/21 3:59 AM, Peter Zijlstra wrote:
>> Add a check to prevent access_error() from returning mistakenly that
>> page-faults due to instruction fetch are not allowed. Intel SDM does not
>> indicate whether "instruction fetch" and "write" in the hardware error
>> code are mutual exclusive, so check both before returning whether the
>> access is allowed.
> Dave, can we get that clarified? It seems a bit naf and leads to
> confusing code IMO.
We can, but there are quite a few implicit relationships in those bits.
PF_INSN and PF_PK can't ever be set together, for instance. It's
pretty clear as long as you have fetch==read in your head.
On 10/21/21 5:21 AM, Nadav Amit wrote:
> access_error() currently does not check for execution permission
> violation.
Ye
> As a result, spurious page-faults due to execution permission
> violation cause SIGSEGV.
While I could totally believe that something is goofy when VMAs are
being changed underneath a page fault, I'm having trouble figuring out
why the "if (error_code & X86_PF_WRITE)" code is being modified.
> It appears not to be an issue so far, but the next patches avoid TLB
> flushes on permission promotion, which can lead to this scenario. nodejs
> for instance crashes when TLB flush is avoided on permission promotion.
Just to be clear, "promotion" is going from something like:
W=0->W=1
or
NX=1->NX=0
right? I tend to call that "relaxing" permissions.
Currently, X86_PF_WRITE faults are considered an access error unless the
VMA to which the write occurred allows writes. Returning "no access
error" permits continuing and handling the copy-on-write.
It sounds like you want to expand that. You want to add a whole class
of new faults that can be ignored: not just that some COW handling might
be necessary, but that the PTE itself might be out of date. Just like
a "COW fault" may just result in setting the PTE.W=1 and moving on with
our day, an instruction fault might now just end up with setting
PTE.NX=0 and also moving on with our day.
I'm really confused why the "error_code & X86_PF_WRITE" case is getting
modified. I would have expected it to be something like just adding:
/* read, instruction fetch */
if (error_code & X86_PF_INSN) {
/* Avoid enforcing access error if spurious: */
if (unlikely(!(vma->vm_flags & VM_EXEC)))
return 1;
return 0;
}
I'm really confused what X86_PF_WRITE and X86_PF_INSN have in common
other than both being able to (now) be generated spuriously.
> On Oct 25, 2021, at 4:12 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Oct 21, 2021 at 05:21:12AM -0700, Nadav Amit wrote:
>> +/*
>> + * pte_may_need_flush() checks whether permissions were demoted and require a
>> + * flush. It should only be used for userspace PTEs.
>> + */
>> +static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
>> +{
>> + /* new is non-present: need only if old is present */
>> + if (!pte_present(newpte))
>> + return pte_present(oldpte);
>> +
>> + /* old is not present: no need for flush */
>> + if (!pte_present(oldpte))
>> + return false;
>
> Would it not be clearer to write the above like:
>
> /* !PRESENT -> * ; no need for flush */
> if (!pte_present(oldpte))
> return false;
>
> /* PRESENT -> !PRESENT ; needs flush */
> if (!pte_present(newpte))
> return true;
>
> ?
I will change the comment to yours. Thanks.
>
>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 0f5c87af5c60..6179c82ea72d 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -141,7 +141,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
>> ptent = pte_mkwrite(ptent);
>> }
>> ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
>> - tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> + if (pte_may_need_flush(oldpte, ptent))
>> + tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
>> pages++;
>> } else if (is_swap_pte(oldpte)) {
>> swp_entry_t entry = pte_to_swp_entry(oldpte);
>
> One question on naming, "may_need" sounds a bit washy to me, either it
> does or it does not. I suppose you're trying to convey the fact that we
> ought to err towards too many TLBi rather than too few, but that's
> always true.
>
> That is, would "needs" not be a better name?
The “may” is indeed intended to be clear that the function can error
towards too many TLB flushes (of any kind). For instance, in a change
from (!dirty|write)->(!write), no flush is needed in theory. I was too
chicken to add it, at least for now.
I can change the name and indicate in the comment instead though.
> On Oct 25, 2021, at 3:52 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Oct 21, 2021 at 05:21:09AM -0700, Nadav Amit wrote:
>> diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
>> index 448cd01eb3ec..18c3366f8f4d 100644
>> --- a/arch/x86/include/asm/pgtable.h
>> +++ b/arch/x86/include/asm/pgtable.h
>> @@ -1146,6 +1146,14 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
>> }
>> }
>> #endif
>> +
>> +#define __HAVE_ARCH_PMDP_INVALIDATE_AD
>> +static inline pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
>> + unsigned long address, pmd_t *pmdp)
>> +{
>> + return pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
>
> Did this want to be something like:
>
> pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
> if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> return old;
>
> instead?
Yes. Of course. Where did my code go to? :(
>
>> +}
>> +
>> /*
>> * Page table pages are page-aligned. The lower half of the top
>> * level is used for userspace and the top half for the kernel.
>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e5ea5f775d5c..435da011b1a2 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -1795,10 +1795,11 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
>> * The race makes MADV_DONTNEED miss the huge pmd and don't clear it
>> * which may break userspace.
>> *
>> - * pmdp_invalidate() is required to make sure we don't miss
>> - * dirty/young flags set by hardware.
>> + * pmdp_invalidate_ad() is required to make sure we don't miss
>> + * dirty/young flags (which are also known as access/dirty) cannot be
>> + * further modifeid by the hardware.
>
> "modified", I think is the more common spelling.
I tried to start a new trend. I will fix it.
>
>> */
>> - entry = pmdp_invalidate(vma, addr, pmd);
>> + entry = pmdp_invalidate_ad(vma, addr, pmd);
>>
>> entry = pmd_modify(entry, newprot);
>> if (preserve_write)
>> diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
>> index 4e640baf9794..b0ce6c7391bf 100644
>> --- a/mm/pgtable-generic.c
>> +++ b/mm/pgtable-generic.c
>> @@ -200,6 +200,14 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
>> }
>> #endif
>>
>> +#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
>
> /*
> * Does this deserve a comment to explain the intended difference vs
> * pmdp_invalidate() ?
> */
I will add a comment.
> On Oct 25, 2021, at 3:50 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Thu, Oct 21, 2021 at 08:04:50PM -0700, Andrew Morton wrote:
>> On Thu, 21 Oct 2021 05:21:07 -0700 Nadav Amit <[email protected]> wrote:
>>
>>> This patch-set is intended to remove unnecessary TLB flushes. It is
>>> based on feedback from v1 and several bugs I found in v1 myself.
>>>
>>> Basically, there are 3 optimizations in this patch-set:
>>> 1. Avoiding TLB flushes on change_huge_pmd() that are only needed to
>>> prevent the A/D bits from changing.
>>> 2. Use TLB batching infrastructure to batch flushes across VMAs and
>>> do better/fewer flushes.
>>> 3. Avoid TLB flushes on permission demotion.
>>>
>>> Andrea asked for the aforementioned (2) to come after (3), but this
>>> is not simple (specifically since change_prot_numa() needs the number
>>> of pages affected).
>>
>> [1/5] appears to be a significant fix which should probably be
>> backported into -stable kernels. If you agree with this then I suggest
>> it be prepared as a standalone patch, separate from the other four
>> patches. With a cc:stable.
>
> I am confused, 1/5 doesn't actually do *anything*. I also cannot find
> any further usage of the introduced X86_BUG_PTE_LEAK.
>
> I'm thinking patch #2 means to have something like:
>
> if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
> flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
>
> In the newly minted: pmdp_invalidate_ad(), but alas, nothing there.
This change was only intended for pmdp_invalidate_ad() but somehow
got lost. I will add it there.
I eventually did not add the optimization to avoid TLB flushes on
(!dirty|write)->!write so I did not use it for the first case that
you mentioned. I am too afraid, although I think this is correct.
Perhaps I will add it as a separate patch.
On 10/25/21 9:19 AM, Nadav Amit wrote:
> That was my first version, but I was concerned that perhaps there is
> some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can
> be set. That is the reason that Peter asked you whether this is
> something that might happen.
>
> If you confirm they cannot be both set, I would the version you just
> mentioned.
I'm pretty sure they can't be set together on any sane hardware. A
bonkers hypervisor or CPU could do it of course, but they'd be crazy.
BTW, feel free to add a WARN_ON_ONCE() if WRITE and INSN are both set.
That would be a nice place to talk about the assumption.
> On Oct 25, 2021, at 10:45 AM, Dave Hansen <[email protected]> wrote:
>
> On 10/25/21 9:19 AM, Nadav Amit wrote:
>> That was my first version, but I was concerned that perhaps there is
>> some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can
>> be set. That is the reason that Peter asked you whether this is
>> something that might happen.
>>
>> If you confirm they cannot be both set, I would the version you just
>> mentioned.
>
> I'm pretty sure they can't be set together on any sane hardware. A
> bonkers hypervisor or CPU could do it of course, but they'd be crazy.
>
> BTW, feel free to add a WARN_ON_ONCE() if WRITE and INSN are both set.
> That would be a nice place to talk about the assumption.
>
I can do that. But be aware that if the assumption is broken, it might
lead to the application getting stuck in an infinite loop of
page-faults instead of receiving SIGSEGV.
On 10/25/21 10:51 AM, Nadav Amit wrote:
>> On Oct 25, 2021, at 10:45 AM, Dave Hansen <[email protected]> wrote:
>> On 10/25/21 9:19 AM, Nadav Amit wrote:
>>> That was my first version, but I was concerned that perhaps there is
>>> some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can
>>> be set. That is the reason that Peter asked you whether this is
>>> something that might happen.
>>>
>>> If you confirm they cannot be both set, I would the version you just
>>> mentioned.
>> I'm pretty sure they can't be set together on any sane hardware. A
>> bonkers hypervisor or CPU could do it of course, but they'd be crazy.
>>
>> BTW, feel free to add a WARN_ON_ONCE() if WRITE and INSN are both set.
>> That would be a nice place to talk about the assumption.
>>
> I can do that. But be aware that if the assumption is broken, it might
> lead to the application getting stuck in an infinite loop of
> page-faults instead of receiving SIGSEGV.
If we have a bonkers hypervisor/CPU, I'm OK with a process that hangs
like that, especially if we can ^C it and see its stream of page faults
with tracing or whatever.
Couldn't we just also do:
if ((code & (X86_PF_WRITE|X86_PF_INSN) ==
(X86_PF_WRITE|X86_PF_INSN)) {
WARN_ON_ONCE(1);
return 1;
}
That should give you the WARN_ON_ONCE() and also return an affirmative
access_error(), resulting in a SIGSEGV.
(I'm not sure I like the indentation as I wrote it here... just do what
looks best in the code)
On Thu, Oct 21, 2021 at 05:21:10AM -0700, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
> Add a check to prevent access_error() from returning mistakenly that
> page-faults due to instruction fetch are not allowed. Intel SDM does not
> indicate whether "instruction fetch" and "write" in the hardware error
> code are mutual exclusive, so check both before returning whether the
> access is allowed.
Dave, can we get that clarified? It seems a bit naf and leads to
confusing code IMO.
Other than that, the change looks ok to me.
> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c
> index b2eefdefc108..e776130473ce 100644
> --- a/arch/x86/mm/fault.c
> +++ b/arch/x86/mm/fault.c
> @@ -1100,10 +1100,17 @@ access_error(unsigned long error_code, struct vm_area_struct *vma)
> (error_code & X86_PF_INSTR), foreign))
> return 1;
>
> - if (error_code & X86_PF_WRITE) {
> + if (error_code & (X86_PF_WRITE | X86_PF_INSTR)) {
> /* write, present and write, not present: */
> - if (unlikely(!(vma->vm_flags & VM_WRITE)))
> + if ((error_code & X86_PF_WRITE) &&
> + unlikely(!(vma->vm_flags & VM_WRITE)))
> return 1;
> +
> + /* exec, present and exec, not present: */
> + if ((error_code & X86_PF_INSTR) &&
> + unlikely(!(vma->vm_flags & VM_EXEC)))
> + return 1;
> +
> return 0;
> }
> On Oct 25, 2021, at 7:20 AM, Dave Hansen <[email protected]> wrote:
>
> On 10/21/21 5:21 AM, Nadav Amit wrote:
>> access_error() currently does not check for execution permission
>> violation.
> Ye
>
>> As a result, spurious page-faults due to execution permission
>> violation cause SIGSEGV.
>
> While I could totally believe that something is goofy when VMAs are
> being changed underneath a page fault, I'm having trouble figuring out
> why the "if (error_code & X86_PF_WRITE)" code is being modified.
In the scenario I mentioned the VMAs are not changed underneath the
page-fault. They change *before* the page-fault, but there are
residues of the old PTE in the TLB.
>
>> It appears not to be an issue so far, but the next patches avoid TLB
>> flushes on permission promotion, which can lead to this scenario. nodejs
>> for instance crashes when TLB flush is avoided on permission promotion.
>
> Just to be clear, "promotion" is going from something like:
>
> W=0->W=1
> or
> NX=1->NX=0
>
> right? I tend to call that "relaxing" permissions.
I specifically talk about NX=1>NX=0.
I can change the language to “relaxing”.
>
> Currently, X86_PF_WRITE faults are considered an access error unless the
> VMA to which the write occurred allows writes. Returning "no access
> error" permits continuing and handling the copy-on-write.
>
> It sounds like you want to expand that. You want to add a whole class
> of new faults that can be ignored: not just that some COW handling might
> be necessary, but that the PTE itself might be out of date. Just like
> a "COW fault" may just result in setting the PTE.W=1 and moving on with
> our day, an instruction fault might now just end up with setting
> PTE.NX=0 and also moving on with our day.
You raise an interesting idea (which can easily be implemented with uffd),
but no - I had none of that in my mind.
My only purpose is to deal with actual spurious page-faults that I
encountered when I removed the TLB flush the happens after NX=1->NX=0.
I am actually surprised that the kernel makes such a strong assumption
that every change of NX=1->NX=0 would be followed by a TLB flush, and
that during these changes the mm is locked for write. But that is the
case. If you do not have this change and a PTE is changed from
NX=1->NX=0 and *later* you access the page, you can have a page-fault
due to stale PTE, and get a SIGSEGV since access_error() is wrong to
assume that this is an invalid access.
I did not change and there are no changes to the VMA during the
page-fault. The page-fault handler would do pretty much nothing and
return to user-space which would retry the instruction. [ page-fault
triggers an implicit TLB flush of the offending PTE ]
>
> I'm really confused why the "error_code & X86_PF_WRITE" case is getting
> modified. I would have expected it to be something like just adding:
>
> /* read, instruction fetch */
> if (error_code & X86_PF_INSN) {
> /* Avoid enforcing access error if spurious: */
> if (unlikely(!(vma->vm_flags & VM_EXEC)))
> return 1;
> return 0;
> }
>
> I'm really confused what X86_PF_WRITE and X86_PF_INSN have in common
> other than both being able to (now) be generated spuriously.
That was my first version, but I was concerned that perhaps there is
some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can
be set. That is the reason that Peter asked you whether this is
something that might happen.
If you confirm they cannot be both set, I would the version you just
mentioned.
On 10/21/21 5:21 AM, Nadav Amit wrote:
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -296,6 +296,11 @@ static void early_init_intel(struct cpuinfo_x86 *c)
> }
> }
>
> + if (c->x86_model == 87) {
> + pr_info_once("Enabling PTE leaking workaround\n");
> + set_cpu_bug(c, X86_BUG_PTE_LEAK);
> + }
Please take a look at:
arch/x86/include/asm/intel-family.h
specifically:
#define INTEL_FAM6_XEON_PHI_KNL 0x57 /* Knights Landing */
> On Oct 26, 2021, at 8:54 AM, Dave Hansen <[email protected]> wrote:
>
> On 10/21/21 5:21 AM, Nadav Amit wrote:
>> --- a/arch/x86/kernel/cpu/intel.c
>> +++ b/arch/x86/kernel/cpu/intel.c
>> @@ -296,6 +296,11 @@ static void early_init_intel(struct cpuinfo_x86 *c)
>> }
>> }
>>
>> + if (c->x86_model == 87) {
>> + pr_info_once("Enabling PTE leaking workaround\n");
>> + set_cpu_bug(c, X86_BUG_PTE_LEAK);
>> + }
>
> Please take a look at:
>
> arch/x86/include/asm/intel-family.h
>
> specifically:
>
> #define INTEL_FAM6_XEON_PHI_KNL 0x57 /* Knights Landing */
Thanks, I will fix it. I really just copy pasted from Andi’s patch
(for better and worse).
On 10/21/21 5:21 AM, Nadav Amit wrote:
> The first TLB flush is only necessary to prevent the dirty bit (and with
> a lesser importance the access bit) from changing while the PTE is
> modified. However, this is not necessary as the x86 CPUs set the
> dirty-bit atomically with an additional check that the PTE is (still)
> present. One caveat is Intel's Knights Landing that has a bug and does
> not do so.
First, did I miss the check in this patch for X86_BUG_PTE_LEAK? I don't
see it anywhere.
> - * pmdp_invalidate() is required to make sure we don't miss
> - * dirty/young flags set by hardware.
This got me thinking... In here:
> https://lore.kernel.org/lkml/[email protected]/
I wrote:
> These bits are truly "stray". In the case of the Dirty bit, the
> thread associated with the stray set was *not* allowed to write to
> the page. This means that we do not have to launder the bit(s); we
> can simply ignore them.
Is the goal of your proposed patch here to ensure that the dirty bit is
not set at *all*? Or, is it to ensure that a dirty bit which we need to
*launder* is never set?
On 10/22/21 2:58 PM, Nadav Amit wrote:
>> [1/5] appears to be a significant fix which should probably be
>> backported into -stable kernels. If you agree with this then I suggest
>> it be prepared as a standalone patch, separate from the other four
>> patches. With a cc:stable.
>
> There is no functionality bug in the kernel. The Knights Landing bug
> was circumvented eventually by changing the swap entry structure so
> the access/dirty bits would not overlap with the swap entry data.
Yeah, it was a significant issue, but we fixed it in here:
> commit 00839ee3b299303c6a5e26a0a2485427a3afcbbf
> Author: Dave Hansen <[email protected]>
> Date: Thu Jul 7 17:19:11 2016 -0700
>
> x86/mm: Move swap offset/type up in PTE to work around erratum
> On Oct 26, 2021, at 9:06 AM, Dave Hansen <[email protected]> wrote:
>
> On 10/21/21 5:21 AM, Nadav Amit wrote:
>> The first TLB flush is only necessary to prevent the dirty bit (and with
>> a lesser importance the access bit) from changing while the PTE is
>> modified. However, this is not necessary as the x86 CPUs set the
>> dirty-bit atomically with an additional check that the PTE is (still)
>> present. One caveat is Intel's Knights Landing that has a bug and does
>> not do so.
>
> First, did I miss the check in this patch for X86_BUG_PTE_LEAK? I don't
> see it anywhere.
No, it is me who missed it. It should have been in pmdp_invalidate_ad():
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..f14f64cc17b5 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -780,6 +780,30 @@ int pmd_clear_huge(pmd_t *pmd)
return 0;
}
+/*
+ * pmdp_invalidate_ad() - prevents the access and dirty bits from being further
+ * updated by the CPU.
+ *
+ * Returns the original PTE.
+ *
+ * During an access to a page, x86 CPUs set the dirty and access bit atomically
+ * with an additional check of the present-bit. Therefore, it is possible to
+ * avoid the TLB flush if we change the PTE atomically, as pmdp_establish does.
+ *
+ * We do not make this optimization on certain CPUs that has a bug that violates
+ * this behavior (specifically Knights Landing).
+ */
+pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmdp)
+{
+ pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
+
+ if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
+ flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ return old;
+}
>
>> - * pmdp_invalidate() is required to make sure we don't miss
>> - * dirty/young flags set by hardware.
>
> This got me thinking... In here:
>
>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20160708001909.FB2443E2%40viggo.jf.intel.com%2F&data=04%7C01%7Cnamit%40vmware.com%7Cf6a2a69eec094b12638108d9989afb60%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637708613735772213%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=o8fYbm8BKHvWxYC9aO5e3MFLkaOnQxvDMy%2BEnYxz56I%3D&reserved=0
>
> I wrote:
>
>> These bits are truly "stray". In the case of the Dirty bit, the
>> thread associated with the stray set was *not* allowed to write to
>> the page. This means that we do not have to launder the bit(s); we
>> can simply ignore them.
>
> Is the goal of your proposed patch here to ensure that the dirty bit is
> not set at *all*? Or, is it to ensure that a dirty bit which we need to
> *launder* is never set?
At *all*.
Err… I remembered from our previous discussions that the dirty bit cannot
be set once the R/W bit is cleared atomically. But going back to the SDM,
I see the (relatively new?) note:
"If software on one logical processor writes to a page while software on
another logical processor concurrently clears the R/W flag in the
paging-structure entry that maps the page, execution on some processors may
result in the entry’s dirty flag being set (due to the write on the first
logical processor) and the entry’s R/W flag being clear (due to the update
to the entry on the second logical processor). This will never occur on a
processor that supports control-flow enforcement technology (CET)”
So I guess that this optimization can only be enabled when CET is enabled.
:(
> On Oct 26, 2021, at 9:47 AM, Nadav Amit <[email protected]> wrote:
>
>
>
>> On Oct 26, 2021, at 9:06 AM, Dave Hansen <[email protected]> wrote:
>>
>> On 10/21/21 5:21 AM, Nadav Amit wrote:
>>> The first TLB flush is only necessary to prevent the dirty bit (and with
>>> a lesser importance the access bit) from changing while the PTE is
>>> modified. However, this is not necessary as the x86 CPUs set the
>>> dirty-bit atomically with an additional check that the PTE is (still)
>>> present. One caveat is Intel's Knights Landing that has a bug and does
>>> not do so.
>>
>> First, did I miss the check in this patch for X86_BUG_PTE_LEAK? I don't
>> see it anywhere.
>
> No, it is me who missed it. It should have been in pmdp_invalidate_ad():
>
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index 3481b35cb4ec..f14f64cc17b5 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -780,6 +780,30 @@ int pmd_clear_huge(pmd_t *pmd)
> return 0;
> }
>
> +/*
> + * pmdp_invalidate_ad() - prevents the access and dirty bits from being further
> + * updated by the CPU.
> + *
> + * Returns the original PTE.
> + *
> + * During an access to a page, x86 CPUs set the dirty and access bit atomically
> + * with an additional check of the present-bit. Therefore, it is possible to
> + * avoid the TLB flush if we change the PTE atomically, as pmdp_establish does.
> + *
> + * We do not make this optimization on certain CPUs that has a bug that violates
> + * this behavior (specifically Knights Landing).
> + */
> +pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
> + pmd_t *pmdp)
> +{
> + pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
> +
> + if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
> + flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
> + return old;
> +}
>
>>
>>> - * pmdp_invalidate() is required to make sure we don't miss
>>> - * dirty/young flags set by hardware.
>>
>> This got me thinking... In here:
>>
>>> https://nam04.safelinks.protection.outlook.com/?url=https%3A%2F%2Flore.kernel.org%2Flkml%2F20160708001909.FB2443E2%40viggo.jf.intel.com%2F&data=04%7C01%7Cnamit%40vmware.com%7Cf6a2a69eec094b12638108d9989afb60%7Cb39138ca3cee4b4aa4d6cd83d9dd62f0%7C0%7C0%7C637708613735772213%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=o8fYbm8BKHvWxYC9aO5e3MFLkaOnQxvDMy%2BEnYxz56I%3D&reserved=0
>>
>> I wrote:
>>
>>> These bits are truly "stray". In the case of the Dirty bit, the
>>> thread associated with the stray set was *not* allowed to write to
>>> the page. This means that we do not have to launder the bit(s); we
>>> can simply ignore them.
>>
>> Is the goal of your proposed patch here to ensure that the dirty bit is
>> not set at *all*? Or, is it to ensure that a dirty bit which we need to
>> *launder* is never set?
>
> At *all*.
>
> Err… I remembered from our previous discussions that the dirty bit cannot
> be set once the R/W bit is cleared atomically. But going back to the SDM,
> I see the (relatively new?) note:
>
> "If software on one logical processor writes to a page while software on
> another logical processor concurrently clears the R/W flag in the
> paging-structure entry that maps the page, execution on some processors may
> result in the entry’s dirty flag being set (due to the write on the first
> logical processor) and the entry’s R/W flag being clear (due to the update
> to the entry on the second logical processor). This will never occur on a
> processor that supports control-flow enforcement technology (CET)”
>
> So I guess that this optimization can only be enabled when CET is enabled.
>
> :(
I still wonder whether the SDM comment applies to present bit vs dirty
bit atomicity as well.
On AMD’s APM I find:
"The processor never sets the Accessed bit or the Dirty bit for a not
present page (P = 0). The ordering of Accessed and Dirty bit updates
with respect to surrounding loads and stores is discussed below.”
( The later comment regards ordering to WC memory ).
I don’t know if I read it too creatively...
On 10/26/21 10:44 AM, Nadav Amit wrote:
>> "If software on one logical processor writes to a page while software on
>> another logical processor concurrently clears the R/W flag in the
>> paging-structure entry that maps the page, execution on some processors may
>> result in the entry’s dirty flag being set (due to the write on the first
>> logical processor) and the entry’s R/W flag being clear (due to the update
>> to the entry on the second logical processor). This will never occur on a
>> processor that supports control-flow enforcement technology (CET)”
>>
>> So I guess that this optimization can only be enabled when CET is enabled.
>>
>> :(
> I still wonder whether the SDM comment applies to present bit vs dirty
> bit atomicity as well.
I think it's implicit. From "4.8 ACCESSED AND DIRTY FLAGS":
"Whenever there is a write to a linear address, the processor
sets the dirty flag (if it is not already set) in the paging-
structure entry"
There can't be a "write to a linear address" without a Present=1 PTE.
If it were a Dirty=1,Present=1 PTE, there's no race because there might
not be a write to the PTE at all.
There's also this from the "4.10.4.3 Optional Invalidation" section:
"no TLB entry or paging-structure cache entry is created with
information from a paging-structure entry in which the P flag
is 0."
That means that we don't have to worry about the TLB doing something
bonkers like caching a Dirty=1 bit from a Present=0 PTE.
Is that what you were worried about?
> On Oct 26, 2021, at 11:44 AM, Dave Hansen <[email protected]> wrote:
>
> On 10/26/21 10:44 AM, Nadav Amit wrote:
>>> "If software on one logical processor writes to a page while software on
>>> another logical processor concurrently clears the R/W flag in the
>>> paging-structure entry that maps the page, execution on some processors may
>>> result in the entry’s dirty flag being set (due to the write on the first
>>> logical processor) and the entry’s R/W flag being clear (due to the update
>>> to the entry on the second logical processor). This will never occur on a
>>> processor that supports control-flow enforcement technology (CET)”
>>>
>>> So I guess that this optimization can only be enabled when CET is enabled.
>>>
>>> :(
>> I still wonder whether the SDM comment applies to present bit vs dirty
>> bit atomicity as well.
>
> I think it's implicit. From "4.8 ACCESSED AND DIRTY FLAGS":
>
> "Whenever there is a write to a linear address, the processor
> sets the dirty flag (if it is not already set) in the paging-
> structure entry"
>
> There can't be a "write to a linear address" without a Present=1 PTE.
> If it were a Dirty=1,Present=1 PTE, there's no race because there might
> not be a write to the PTE at all.
>
> There's also this from the "4.10.4.3 Optional Invalidation" section:
>
> "no TLB entry or paging-structure cache entry is created with
> information from a paging-structure entry in which the P flag
> is 0."
>
> That means that we don't have to worry about the TLB doing something
> bonkers like caching a Dirty=1 bit from a Present=0 PTE.
>
> Is that what you were worried about?
Thanks Dave, but no - that is not my concern.
To make it very clear - consider the following scenario, in which
a volatile pointer p is mapped using a certain PTE, which is RW
(i.e., *p is writable):
CPU0 CPU1
---- ----
x = *p
[ PTE cached in TLB;
PTE is not dirty ]
clear_pte(PTE)
*p = x
[ needs to set dirty ]
Note that there is no TLB flush in this scenario. The question
is whether the write access to *p would succeed, setting the
dirty bit on the clear, non-present entry.
I was under the impression that the hardware AD-assist would
recheck the PTE atomically as it sets the dirty bit. But, as I
said, I am not sure anymore whether this is defined architecturally
(or at least would work in practice on all CPUs modulo the
Knights Landing thingy).
On 10/26/21 12:06 PM, Nadav Amit wrote:
>
> To make it very clear - consider the following scenario, in which
> a volatile pointer p is mapped using a certain PTE, which is RW
> (i.e., *p is writable):
>
> CPU0 CPU1
> ---- ----
> x = *p
> [ PTE cached in TLB;
> PTE is not dirty ]
> clear_pte(PTE)
> *p = x
> [ needs to set dirty ]
>
> Note that there is no TLB flush in this scenario. The question
> is whether the write access to *p would succeed, setting the
> dirty bit on the clear, non-present entry.
>
> I was under the impression that the hardware AD-assist would
> recheck the PTE atomically as it sets the dirty bit. But, as I
> said, I am not sure anymore whether this is defined architecturally
> (or at least would work in practice on all CPUs modulo the
> Knights Landing thingy).
Practically, at "x=*p", he thing that gets cached in the TLB will
Dirty=0. At the "*p=x", the CPU will decide it needs to do a write,
find the Dirty=0 entry and will entirely discard it. In other words, it
*acts* roughly like this:
x = *p
INVLPG(p)
*p = x;
Where the INVLPG() and the "*p=x" are atomic. So, there's no
_practical_ problem with your scenario. This specific behavior isn't
architectural as far as I know, though.
Although it's pretty much just academic, as for the architecture, are
you getting hung up on the difference between the description of "Accessed":
Whenever the processor uses a paging-structure entry as part of
linear-address translation, it sets the accessed flag in that
entry
and "Dirty:"
Whenever there is a write to a linear address, the processor
sets the dirty flag (if it is not already set) in the paging-
structure entry...
Accessed says "as part of linear-address translation", which means that
the address must have a translation. But, the "Dirty" section doesn't
say that. It talks about "a write to a linear address" but not whether
there is a linear address *translation* involved.
If that's it, we could probably add a bit like:
In addition to setting the accessed flag, whenever there is a
write...
before the dirty rules in the SDM.
Or am I being dense and continuing to miss your point? :)
> On Oct 26, 2021, at 12:40 PM, Dave Hansen <[email protected]> wrote:
>
> On 10/26/21 12:06 PM, Nadav Amit wrote:
>>
>> To make it very clear - consider the following scenario, in which
>> a volatile pointer p is mapped using a certain PTE, which is RW
>> (i.e., *p is writable):
>>
>> CPU0 CPU1
>> ---- ----
>> x = *p
>> [ PTE cached in TLB;
>> PTE is not dirty ]
>> clear_pte(PTE)
>> *p = x
>> [ needs to set dirty ]
>>
>> Note that there is no TLB flush in this scenario. The question
>> is whether the write access to *p would succeed, setting the
>> dirty bit on the clear, non-present entry.
>>
>> I was under the impression that the hardware AD-assist would
>> recheck the PTE atomically as it sets the dirty bit. But, as I
>> said, I am not sure anymore whether this is defined architecturally
>> (or at least would work in practice on all CPUs modulo the
>> Knights Landing thingy).
>
> Practically, at "x=*p", he thing that gets cached in the TLB will
> Dirty=0. At the "*p=x", the CPU will decide it needs to do a write,
> find the Dirty=0 entry and will entirely discard it. In other words, it
> *acts* roughly like this:
>
> x = *p
> INVLPG(p)
> *p = x;
>
> Where the INVLPG() and the "*p=x" are atomic. So, there's no
> _practical_ problem with your scenario. This specific behavior isn't
> architectural as far as I know, though.
>
> Although it's pretty much just academic, as for the architecture, are
> you getting hung up on the difference between the description of "Accessed":
>
> Whenever the processor uses a paging-structure entry as part of
> linear-address translation, it sets the accessed flag in that
> entry
>
> and "Dirty:"
>
> Whenever there is a write to a linear address, the processor
> sets the dirty flag (if it is not already set) in the paging-
> structure entry...
>
> Accessed says "as part of linear-address translation", which means that
> the address must have a translation. But, the "Dirty" section doesn't
> say that. It talks about "a write to a linear address" but not whether
> there is a linear address *translation* involved.
>
> If that's it, we could probably add a bit like:
>
> In addition to setting the accessed flag, whenever there is a
> write...
>
> before the dirty rules in the SDM.
>
> Or am I being dense and continuing to miss your point? :)
I think this time you got my question right.
I was thrown off by the SDM comment on RW permissions vs dirty that I
mentioned before:
"If software on one logical processor writes to a page while software on
another logical processor concurrently clears the R/W flag in the
paging-structure entry that maps the page, execution on some processors may
result in the entry’s dirty flag being set (due to the write on the first
logical processor) and the entry’s R/W flag being clear (due to the update
to the entry on the second logical processor).”
I did not pay enough attention to these small differences that you mentioned
between access and dirty this time (although I did notice them before).
I do not think that the change that you offered to the SDM really clarifies
the situation. Setting the access flag is done as part of caching the PTE in
the TLB. The SDM change you propose does not clarify the atomicity of the
permission/PTE-validity check and dirty-bit setting or the fact the PTE is
invalidated if the dirty-bit needs to be set and is cached as clear [I do not
presume you would want the latter in the SDM, since it is an implementation
detail.]
I just wonder how come the R/W-clearing and the P-clearing cause concurrent
dirty bit setting to behave differently. I am not a hardware guy, but I would
imagine they would be the same...
On 10/26/21 1:07 PM, Nadav Amit wrote:
> I just wonder how come the R/W-clearing and the P-clearing cause concurrent
> dirty bit setting to behave differently. I am not a hardware guy, but I would
> imagine they would be the same...
First of all, I think the non-atomic properties where a PTE can go:
W=1,D=0 // original
W=0,D=0 // software clears W
W=0,D=1 // hardware sets D
were a total implementation accident. It wasn't someone being clever
and since the behavior was architecturally allowed and well-tolerated by
software it was around for a while. I think I was the one that asked
that it get fixed for shadow stacks, and nobody pushed back on it too
hard as far as I remember. I don't think it was super hard to fix.
Why do the Present/Accessed and Write/Dirty pairs act differently? I
think it's a total implementation accident and wasn't by design.
The KNL erratum was an erratum and wasn't codified in the architecture
because it actually broke things. The pre-CET Write/Dirty behavior
didn't break software to a level it was considered an erratum. It gets
to live on as allowed in the architecture.