2022-03-09 04:10:43

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v3 0/5] mm/mprotect: avoid unnecessary TLB flushes

From: Nadav Amit <[email protected]>

This patch-set is intended to remove unnecessary TLB flushes during
mprotect() syscalls. Once this patch-set make it through, similar
and further optimizations for MADV_COLD and userfaultfd would be
possible.

Sorry for the time between it took me to get to v3.

Basically, there are 3 optimizations in this patch-set:
1. Use TLB batching infrastructure to batch flushes across VMAs and
do better/fewer flushes. This would also be handy for later
userfaultfd enhancements.
2. Avoid TLB flushes on permission demotion. This optimization is
the one that provides most of the performance benefits. Note that
the previous batching infrastructure changes are needed for that to
happen.
3. Avoiding TLB flushes on change_huge_pmd() that are only needed to
prevent the A/D bits from changing.

Andrew asked for some benchmark numbers. I do not have an easy
determinate macrobenchmark in which it is easy to show benefit. I therre
ran a microbenchmark: a loop that does the following on anonymous
memory, just as a sanity check to see that time is saved by avoiding TLB
flushes. The loop goes:

mprotect(p, PAGE_SIZE, PROT_READ)
mprotect(p, PAGE_SIZE, PROT_READ|PROT_WRITE)
*p = 0; // make the page writable

The test was run in KVM guest with 1 or 2 threads (the second thread
was busy-looping). I measured the time (cycles) of each operation:

1 thread 2 threads
mmots +patch mmots +patch
PROT_READ 3494 2725 (-22%) 8630 7788 (-10%)
PROT_READ|WRITE 3952 2724 (-31%) 9075 2865 (-68%)

[ mmots = v5.17-rc6-mmots-2022-03-06-20-38 ]

The exact numbers are really meaningless, but the benefit is clear.
There are 2 interesting results though.

(1) PROT_READ is cheaper, while one can expect it not to be affected.
This is presumably due to TLB miss that is saved

(2) Without memory access (*p = 0), the speedup of the patch is even
greater. In that scenario mprotect(PROT_READ) also avoids the TLB flush.
As a result both operations on the patched kernel take roughly ~1500
cycles (with either 1 or 2 threads), whereas on mmotm their cost is as
high as presented in the table.


Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]

--

v2 -> v3:
* Fix orders of patches (order could lead to breakage)
* Better comments
* Clearer KNL detection [Dave]
* Assertion on PF error-code [Dave]
* Comments, code, function names improvements [PeterZ]
* Flush on access-bit clearing on PMD changes to follow the way
flushing on x86 is done today in the kernel.

v1 -> v2:
* Wrong detection of permission demotion [Andrea]
* Better comments [Andrea]
* Handle THP [Andrea]
* Batching across VMAs [Peter Xu]
* Avoid open-coding PTE analysis
* Fix wrong use of the mmu_gather()

Nadav Amit (5):
x86: Detection of Knights Landing A/D leak
x86/mm: check exec permissions on fault
mm/mprotect: use mmu_gather
mm/mprotect: do not flush on permission promotion
mm: avoid unnecessary flush on change_huge_pmd()

arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/include/asm/pgtable.h | 5 ++
arch/x86/include/asm/pgtable_types.h | 2 +
arch/x86/include/asm/tlbflush.h | 82 ++++++++++++++++++++++++
arch/x86/kernel/cpu/intel.c | 5 ++
arch/x86/mm/fault.c | 22 ++++++-
arch/x86/mm/pgtable.c | 10 +++
fs/exec.c | 6 +-
include/asm-generic/tlb.h | 14 +++++
include/linux/huge_mm.h | 5 +-
include/linux/mm.h | 5 +-
include/linux/pgtable.h | 20 ++++++
mm/huge_memory.c | 19 ++++--
mm/mprotect.c | 94 +++++++++++++++-------------
mm/pgtable-generic.c | 8 +++
mm/userfaultfd.c | 6 +-
16 files changed, 248 insertions(+), 56 deletions(-)

--
2.25.1


2022-03-09 04:10:47

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v3 3/5] mm/mprotect: use mmu_gather

From: Nadav Amit <[email protected]>

change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates
the code, as developers need to be aware of different invalidation
schemes, and prevents opportunities to avoid TLB flushes or perform them
in finer granularity.

The use of mmu_gather for modified PTEs has benefits in various
scenarios even if pages are not released. For instance, if only a single
page needs to be flushed out of a range of many pages, only that page
would be flushed. If a THP page is flushed, on x86 a single TLB invlpg
instruction can be used instead of 512 instructions (or a full TLB
flush, which would Linux would actually use by default). mprotect() over
multiple VMAs requires a single flush.

Use mmu_gather in change_pXX_range(). As the pages are not released,
only record the flushed range using tlb_flush_pXX_range().

Handle THP similarly and get rid of flush_cache_range() which becomes
redundant since tlb_start_vma() calls it when needed.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
fs/exec.c | 6 ++-
include/linux/huge_mm.h | 5 ++-
include/linux/mm.h | 5 ++-
mm/huge_memory.c | 10 ++++-
mm/mprotect.c | 93 ++++++++++++++++++++++-------------------
mm/userfaultfd.c | 6 ++-
6 files changed, 75 insertions(+), 50 deletions(-)

diff --git a/fs/exec.c b/fs/exec.c
index a39108c1190a..1c46b7d7c32c 100644
--- a/fs/exec.c
+++ b/fs/exec.c
@@ -759,6 +759,7 @@ int setup_arg_pages(struct linux_binprm *bprm,
unsigned long stack_size;
unsigned long stack_expand;
unsigned long rlim_stack;
+ struct mmu_gather tlb;

#ifdef CONFIG_STACK_GROWSUP
/* Limit stack size */
@@ -813,8 +814,11 @@ int setup_arg_pages(struct linux_binprm *bprm,
vm_flags |= mm->def_flags;
vm_flags |= VM_STACK_INCOMPLETE_SETUP;

- ret = mprotect_fixup(vma, &prev, vma->vm_start, vma->vm_end,
+ tlb_gather_mmu(&tlb, mm);
+ ret = mprotect_fixup(&tlb, vma, &prev, vma->vm_start, vma->vm_end,
vm_flags);
+ tlb_finish_mmu(&tlb);
+
if (ret)
goto out_unlock;
BUG_ON(prev != vma);
diff --git a/include/linux/huge_mm.h b/include/linux/huge_mm.h
index 2999190adc22..9a26bd10e083 100644
--- a/include/linux/huge_mm.h
+++ b/include/linux/huge_mm.h
@@ -36,8 +36,9 @@ int zap_huge_pud(struct mmu_gather *tlb, struct vm_area_struct *vma, pud_t *pud,
unsigned long addr);
bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
unsigned long new_addr, pmd_t *old_pmd, pmd_t *new_pmd);
-int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
- pgprot_t newprot, unsigned long cp_flags);
+int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr, pgprot_t newprot,
+ unsigned long cp_flags);
vm_fault_t vmf_insert_pfn_pmd_prot(struct vm_fault *vmf, pfn_t pfn,
pgprot_t pgprot, bool write);

diff --git a/include/linux/mm.h b/include/linux/mm.h
index dc69d2a69912..b2574bbdb76d 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -1969,10 +1969,11 @@ extern unsigned long move_page_tables(struct vm_area_struct *vma,
#define MM_CP_UFFD_WP_ALL (MM_CP_UFFD_WP | \
MM_CP_UFFD_WP_RESOLVE)

-extern unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+extern unsigned long change_protection(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
unsigned long cp_flags);
-extern int mprotect_fixup(struct vm_area_struct *vma,
+extern int mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
struct vm_area_struct **pprev, unsigned long start,
unsigned long end, unsigned long newflags);

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 3557aabe86fe..d58a5b498011 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1692,8 +1692,9 @@ bool move_huge_pmd(struct vm_area_struct *vma, unsigned long old_addr,
* or if prot_numa but THP migration is not supported
* - HPAGE_PMD_NR if protections changed and TLB flush necessary
*/
-int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, pgprot_t newprot, unsigned long cp_flags)
+int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ pmd_t *pmd, unsigned long addr, pgprot_t newprot,
+ unsigned long cp_flags)
{
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
@@ -1704,6 +1705,8 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;

+ tlb_change_page_size(tlb, HPAGE_PMD_SIZE);
+
if (prot_numa && !thp_migration_supported())
return 1;

@@ -1799,6 +1802,9 @@ int change_huge_pmd(struct vm_area_struct *vma, pmd_t *pmd,
}
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);
+
+ tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
+
BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
unlock:
spin_unlock(ptl);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index b69ce7a7b2b7..f9730bac2d78 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -33,12 +33,13 @@
#include <asm/cacheflush.h>
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
+#include <asm/tlb.h>

#include "internal.h"

-static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, unsigned long end, pgprot_t newprot,
- unsigned long cp_flags)
+static unsigned long change_pte_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
pte_t *pte, oldpte;
spinlock_t *ptl;
@@ -49,6 +50,8 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;

+ tlb_change_page_size(tlb, PAGE_SIZE);
+
/*
* Can be called with only the mmap_lock for reading by
* prot_numa so we must check the pmd isn't constantly
@@ -149,6 +152,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+ tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
@@ -230,9 +234,9 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
return 0;
}

-static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
- pud_t *pud, unsigned long addr, unsigned long end,
- pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
pmd_t *pmd;
unsigned long next;
@@ -272,8 +276,12 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (next - addr != HPAGE_PMD_SIZE) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
} else {
- int nr_ptes = change_huge_pmd(vma, pmd, addr,
- newprot, cp_flags);
+ /*
+ * change_huge_pmd() does not defer TLB flushes,
+ * so no need to propagate the tlb argument.
+ */
+ int nr_ptes = change_huge_pmd(tlb, vma, pmd,
+ addr, newprot, cp_flags);

if (nr_ptes) {
if (nr_ptes == HPAGE_PMD_NR) {
@@ -287,8 +295,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}
/* fall through, the trans huge pmd just split */
}
- this_pages = change_pte_range(vma, pmd, addr, next, newprot,
- cp_flags);
+ this_pages = change_pte_range(tlb, vma, pmd, addr, next,
+ newprot, cp_flags);
pages += this_pages;
next:
cond_resched();
@@ -302,9 +310,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
return pages;
}

-static inline unsigned long change_pud_range(struct vm_area_struct *vma,
- p4d_t *p4d, unsigned long addr, unsigned long end,
- pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_pud_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
pud_t *pud;
unsigned long next;
@@ -315,16 +323,16 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- pages += change_pmd_range(vma, pud, addr, next, newprot,
+ pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
cp_flags);
} while (pud++, addr = next, addr != end);

return pages;
}

-static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
- pgd_t *pgd, unsigned long addr, unsigned long end,
- pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_p4d_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
p4d_t *p4d;
unsigned long next;
@@ -335,44 +343,40 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
next = p4d_addr_end(addr, end);
if (p4d_none_or_clear_bad(p4d))
continue;
- pages += change_pud_range(vma, p4d, addr, next, newprot,
+ pages += change_pud_range(tlb, vma, p4d, addr, next, newprot,
cp_flags);
} while (p4d++, addr = next, addr != end);

return pages;
}

-static unsigned long change_protection_range(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end, pgprot_t newprot,
- unsigned long cp_flags)
+static unsigned long change_protection_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
unsigned long next;
- unsigned long start = addr;
unsigned long pages = 0;

BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
- flush_cache_range(vma, addr, end);
- inc_tlb_flush_pending(mm);
+ tlb_start_vma(tlb, vma);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- pages += change_p4d_range(vma, pgd, addr, next, newprot,
+ pages += change_p4d_range(tlb, vma, pgd, addr, next, newprot,
cp_flags);
} while (pgd++, addr = next, addr != end);

- /* Only flush the TLB if we actually modified any entries: */
- if (pages)
- flush_tlb_range(vma, start, end);
- dec_tlb_flush_pending(mm);
+ tlb_end_vma(tlb, vma);

return pages;
}

-unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
+unsigned long change_protection(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, unsigned long start,
unsigned long end, pgprot_t newprot,
unsigned long cp_flags)
{
@@ -383,7 +387,7 @@ unsigned long change_protection(struct vm_area_struct *vma, unsigned long start,
if (is_vm_hugetlb_page(vma))
pages = hugetlb_change_protection(vma, start, end, newprot);
else
- pages = change_protection_range(vma, start, end, newprot,
+ pages = change_protection_range(tlb, vma, start, end, newprot,
cp_flags);

return pages;
@@ -417,8 +421,9 @@ static const struct mm_walk_ops prot_none_walk_ops = {
};

int
-mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
- unsigned long start, unsigned long end, unsigned long newflags)
+mprotect_fixup(struct mmu_gather *tlb, struct vm_area_struct *vma,
+ struct vm_area_struct **pprev, unsigned long start,
+ unsigned long end, unsigned long newflags)
{
struct mm_struct *mm = vma->vm_mm;
unsigned long oldflags = vma->vm_flags;
@@ -505,7 +510,7 @@ mprotect_fixup(struct vm_area_struct *vma, struct vm_area_struct **pprev,
dirty_accountable = vma_wants_writenotify(vma, vma->vm_page_prot);
vma_set_page_prot(vma);

- change_protection(vma, start, end, vma->vm_page_prot,
+ change_protection(tlb, vma, start, end, vma->vm_page_prot,
dirty_accountable ? MM_CP_DIRTY_ACCT : 0);

/*
@@ -539,6 +544,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
const int grows = prot & (PROT_GROWSDOWN|PROT_GROWSUP);
const bool rier = (current->personality & READ_IMPLIES_EXEC) &&
(prot & PROT_READ);
+ struct mmu_gather tlb;

start = untagged_addr(start);

@@ -598,6 +604,7 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
else
prev = vma->vm_prev;

+ tlb_gather_mmu(&tlb, current->mm);
for (nstart = start ; ; ) {
unsigned long mask_off_old_flags;
unsigned long newflags;
@@ -624,18 +631,18 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
/* newflags >> 4 shift VM_MAY% in place of VM_% */
if ((newflags & ~(newflags >> 4)) & VM_ACCESS_FLAGS) {
error = -EACCES;
- goto out;
+ goto out_tlb;
}

/* Allow architectures to sanity-check the new flags */
if (!arch_validate_flags(newflags)) {
error = -EINVAL;
- goto out;
+ goto out_tlb;
}

error = security_file_mprotect(vma, reqprot, prot);
if (error)
- goto out;
+ goto out_tlb;

tmp = vma->vm_end;
if (tmp > end)
@@ -644,27 +651,29 @@ static int do_mprotect_pkey(unsigned long start, size_t len,
if (vma->vm_ops && vma->vm_ops->mprotect) {
error = vma->vm_ops->mprotect(vma, nstart, tmp, newflags);
if (error)
- goto out;
+ goto out_tlb;
}

- error = mprotect_fixup(vma, &prev, nstart, tmp, newflags);
+ error = mprotect_fixup(&tlb, vma, &prev, nstart, tmp, newflags);
if (error)
- goto out;
+ goto out_tlb;

nstart = tmp;

if (nstart < prev->vm_end)
nstart = prev->vm_end;
if (nstart >= end)
- goto out;
+ goto out_tlb;

vma = prev->vm_next;
if (!vma || vma->vm_start != nstart) {
error = -ENOMEM;
- goto out;
+ goto out_tlb;
}
prot = reqprot;
}
+out_tlb:
+ tlb_finish_mmu(&tlb);
out:
mmap_write_unlock(current->mm);
return error;
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index e9bb6db002aa..0fd7fc8010d7 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -16,6 +16,7 @@
#include <linux/hugetlb.h>
#include <linux/shmem_fs.h>
#include <asm/tlbflush.h>
+#include <asm/tlb.h>
#include "internal.h"

static __always_inline
@@ -687,6 +688,7 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
atomic_t *mmap_changing)
{
struct vm_area_struct *dst_vma;
+ struct mmu_gather tlb;
pgprot_t newprot;
int err;

@@ -728,8 +730,10 @@ int mwriteprotect_range(struct mm_struct *dst_mm, unsigned long start,
else
newprot = vm_get_page_prot(dst_vma->vm_flags);

- change_protection(dst_vma, start, start + len, newprot,
+ tlb_gather_mmu(&tlb, dst_mm);
+ change_protection(&tlb, dst_vma, start, start + len, newprot,
enable_wp ? MM_CP_UFFD_WP : MM_CP_UFFD_WP_RESOLVE);
+ tlb_finish_mmu(&tlb);

err = 0;
out_unlock:
--
2.25.1

2022-03-09 04:10:49

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v3 1/5] x86: Detection of Knights Landing A/D leak

From: Nadav Amit <[email protected]>

Knights Landing has a issue that a thread setting A or D bits may not do
so atomically against checking the present bit. A thread which is going
to page fault may still set those bits, even though the present bit was
already atomically cleared.

This implies that when the kernel clears present atomically, some time
later the supposed to be zero entry could be corrupted with stray A or D
bits.

Since the PTE could be already used for storing a swap index, or a NUMA
migration index, this cannot be tolerated. Most of the time the kernel
detects the problem, but in some rare cases it may not.

This patch adds an interface to detect the bug, which will be used in a
following patch.

Cc: Andi Kleen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Link: https://lore.kernel.org/lkml/[email protected]/
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/cpufeatures.h | 1 +
arch/x86/kernel/cpu/intel.c | 5 +++++
2 files changed, 6 insertions(+)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index 67ef0e81c7dc..184b299dbf12 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -442,5 +442,6 @@
#define X86_BUG_TAA X86_BUG(22) /* CPU is affected by TSX Async Abort(TAA) */
#define X86_BUG_ITLB_MULTIHIT X86_BUG(23) /* CPU may incur MCE during certain page attribute changes */
#define X86_BUG_SRBDS X86_BUG(24) /* CPU may leak RNG bits if not mitigated */
+#define X86_BUG_PTE_LEAK X86_BUG(25) /* PTE may leak A/D bits after clear */

#endif /* _ASM_X86_CPUFEATURES_H */
diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 8321c43554a1..74780fef3f12 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -296,6 +296,11 @@ static void early_init_intel(struct cpuinfo_x86 *c)
}
}

+ if (c->x86_model == INTEL_FAM6_XEON_PHI_KNL) {
+ pr_info_once("Enabling PTE leaking workaround\n");
+ set_cpu_bug(c, X86_BUG_PTE_LEAK);
+ }
+
/*
* Intel Quark Core DevMan_001.pdf section 6.4.11
* "The operating system also is required to invalidate (i.e., flush)
--
2.25.1

2022-03-09 04:11:25

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v3 5/5] mm: avoid unnecessary flush on change_huge_pmd()

From: Nadav Amit <[email protected]>

Calls to change_protection_range() on THP can trigger, at least on x86,
two TLB flushes for one page: one immediately, when pmdp_invalidate() is
called by change_huge_pmd(), and then another one later (that can be
batched) when change_protection_range() finishes.

The first TLB flush is only necessary to prevent the dirty bit (and with
a lesser importance the access bit) from changing while the PTE is
modified. However, this is not necessary as the x86 CPUs set the
dirty-bit atomically with an additional check that the PTE is (still)
present. One caveat is Intel's Knights Landing that has a bug and does
not do so.

Leverage this behavior to eliminate the unnecessary TLB flush in
change_huge_pmd(). Introduce a new arch specific pmdp_invalidate_ad()
that only invalidates the access and dirty bit from further changes.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/pgtable.h | 5 +++++
arch/x86/mm/pgtable.c | 10 ++++++++++
include/linux/pgtable.h | 20 ++++++++++++++++++++
mm/huge_memory.c | 4 ++--
mm/pgtable-generic.c | 8 ++++++++
5 files changed, 45 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 62ab07e24aef..23ad34edcc4b 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1173,6 +1173,11 @@ static inline pmd_t pmdp_establish(struct vm_area_struct *vma,
}
}
#endif
+
+#define __HAVE_ARCH_PMDP_INVALIDATE_AD
+extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp);
+
/*
* Page table pages are page-aligned. The lower half of the top
* level is used for userspace and the top half for the kernel.
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 3481b35cb4ec..b2fcb2c749ce 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -608,6 +608,16 @@ int pmdp_clear_flush_young(struct vm_area_struct *vma,

return young;
}
+
+pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmdp)
+{
+ pmd_t old = pmdp_establish(vma, address, pmdp, pmd_mkinvalid(*pmdp));
+
+ if (cpu_feature_enabled(X86_BUG_PTE_LEAK))
+ flush_pmd_tlb_range(vma, address, address + HPAGE_PMD_SIZE);
+ return old;
+}
#endif

/**
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index f4f4077b97aa..5826e8e52619 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -570,6 +570,26 @@ extern pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp);
#endif

+#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
+
+/*
+ * pmdp_invalidate_ad() invalidates the PMD while changing a transparent
+ * hugepage mapping in the page tables. This function is similar to
+ * pmdp_invalidate(), but should only be used if the access and dirty bits would
+ * not be cleared by the software in the new PMD value. The function ensures
+ * that hardware changes of the access and dirty bits updates would not be lost.
+ *
+ * Doing so can allow in certain architectures to avoid a TLB flush in most
+ * cases. Yet, another TLB flush might be necessary later if the PMD update
+ * itself requires such flush (e.g., if protection was set to be stricter). Yet,
+ * even when a TLB flush is needed because of the update, the caller may be able
+ * to batch these TLB flushing operations, so fewer TLB flush operations are
+ * needed.
+ */
+extern pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma,
+ unsigned long address, pmd_t *pmdp);
+#endif
+
#ifndef __HAVE_ARCH_PTE_SAME
static inline int pte_same(pte_t pte_a, pte_t pte_b)
{
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 51b0f3cb1ba0..691d80edcfd7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1781,10 +1781,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
* The race makes MADV_DONTNEED miss the huge pmd and don't clear it
* which may break userspace.
*
- * pmdp_invalidate() is required to make sure we don't miss
+ * pmdp_invalidate_ad() is required to make sure we don't miss
* dirty/young flags set by hardware.
*/
- oldpmd = pmdp_invalidate(vma, addr, pmd);
+ oldpmd = pmdp_invalidate_ad(vma, addr, pmd);

entry = pmd_modify(oldpmd, newprot);
if (preserve_write)
diff --git a/mm/pgtable-generic.c b/mm/pgtable-generic.c
index 6523fda274e5..90ab721a12a8 100644
--- a/mm/pgtable-generic.c
+++ b/mm/pgtable-generic.c
@@ -201,6 +201,14 @@ pmd_t pmdp_invalidate(struct vm_area_struct *vma, unsigned long address,
}
#endif

+#ifndef __HAVE_ARCH_PMDP_INVALIDATE_AD
+pmd_t pmdp_invalidate_ad(struct vm_area_struct *vma, unsigned long address,
+ pmd_t *pmdp)
+{
+ return pmdp_invalidate(vma, address, pmdp);
+}
+#endif
+
#ifndef pmdp_collapse_flush
pmd_t pmdp_collapse_flush(struct vm_area_struct *vma, unsigned long address,
pmd_t *pmdp)
--
2.25.1

2022-03-09 04:14:26

by Nadav Amit

[permalink] [raw]
Subject: [PATCH v3 4/5] mm/mprotect: do not flush on permission promotion

From: Nadav Amit <[email protected]>

Currently, using mprotect() to unprotect a memory region or uffd to
unprotect a memory region causes a TLB flush. At least on x86, as
protection is promoted, no TLB flush is needed.

Add an arch-specific pte_may_need_flush() which tells whether a TLB
flush is needed based on the old PTE and the new one. Implement an x86
pte_may_need_flush().

For x86, besides the simple logic that PTE protection promotion or
changes of software bits does require a flush, also add logic that
considers the dirty-bit. If the dirty-bit is clear and write-protect is
set, no TLB flush is needed, as x86 updates the dirty-bit atomically
on write, and if the bit is clear, the PTE is reread.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/pgtable_types.h | 2 +
arch/x86/include/asm/tlbflush.h | 82 ++++++++++++++++++++++++++++
include/asm-generic/tlb.h | 14 +++++
mm/huge_memory.c | 9 +--
mm/mprotect.c | 3 +-
5 files changed, 105 insertions(+), 5 deletions(-)

diff --git a/arch/x86/include/asm/pgtable_types.h b/arch/x86/include/asm/pgtable_types.h
index 40497a9020c6..8668bc661026 100644
--- a/arch/x86/include/asm/pgtable_types.h
+++ b/arch/x86/include/asm/pgtable_types.h
@@ -110,9 +110,11 @@
#if defined(CONFIG_X86_64) || defined(CONFIG_X86_PAE)
#define _PAGE_NX (_AT(pteval_t, 1) << _PAGE_BIT_NX)
#define _PAGE_DEVMAP (_AT(u64, 1) << _PAGE_BIT_DEVMAP)
+#define _PAGE_SOFTW4 (_AT(pteval_t, 1) << _PAGE_BIT_SOFTW4)
#else
#define _PAGE_NX (_AT(pteval_t, 0))
#define _PAGE_DEVMAP (_AT(pteval_t, 0))
+#define _PAGE_SOFTW4 (_AT(pteval_t, 0))
#endif

#define _PAGE_PROTNONE (_AT(pteval_t, 1) << _PAGE_BIT_PROTNONE)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 98fa0a114074..ec01e6cff137 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,6 +259,88 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,

extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);

+/*
+ * The enabled_mask tells which bits that were present and gets cleared require
+ * flush.
+ *
+ * The disabled_mask tells which bits that were missing and gets set require
+ * flush.
+ *
+ * All the other bits except the ignored bits will require a flush no matter if
+ * they gets set or cleared.
+ *
+ * This function ignores the global bit, as it is used for protnone. This
+ * function should therefore not be used if the global bit might really be
+ * cleared.
+ *
+ * The function allows to ignore_access flags (provide _PAGE_ACCESS as argument
+ * to do so). The expected use is that the access bit would be ignored for PTEs
+ * but not for PMDs. That is the way the kernel perform TLB flushes after
+ * updates of the access-bit in other situations.
+ */
+static inline bool pte_flags_need_flush(unsigned long oldflags,
+ unsigned long newflags,
+ pteval_t ignore_access)
+{
+ const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
+ _PAGE_SOFTW3 | _PAGE_SOFTW4 | _PAGE_GLOBAL | ignore_access;
+ const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_PRESENT |
+ (_PAGE_ACCESSED & ~ignore_access);
+ const pteval_t disable_mask = _PAGE_NX;
+ unsigned long diff = oldflags ^ newflags;
+
+ VM_BUG_ON(ignore_access != 0 && ignore_access != _PAGE_ACCESSED);
+
+ return diff & ((oldflags & enable_mask) |
+ (newflags & disable_mask) |
+ ~(enable_mask | disable_mask | ignore_mask));
+}
+
+/*
+ * pte_needs_flush() checks whether permissions were demoted and require a
+ * flush. It should only be used for userspace PTEs.
+ */
+static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte)
+{
+ /* !PRESENT -> * ; no need for flush */
+ if (!pte_present(oldpte))
+ return false;
+
+ /* PRESENT -> !PRESENT ; needs flush */
+ if (!pte_present(newpte))
+ return true;
+
+ /* PFN changed ; needs flush */
+ if (pte_pfn(oldpte) != pte_pfn(newpte))
+ return true;
+
+ return pte_flags_need_flush(pte_flags(oldpte), pte_flags(newpte),
+ _PAGE_ACCESSED);
+}
+#define pte_needs_flush pte_needs_flush
+
+/*
+ * huge_pmd_needs_flush() checks whether permissions were demoted and
+ * require a flush. It should only be used for userspace huge PMDs.
+ */
+static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
+{
+ /* !PRESENT -> * ; no need for flush */
+ if (!pmd_present(oldpmd))
+ return false;
+
+ /* PRESENT -> !PRESENT ; needs flush */
+ if (!pmd_present(newpmd))
+ return true;
+
+ /* PFN changed ; needs flush */
+ if (pmd_pfn(oldpmd) != pmd_pfn(newpmd))
+ return true;
+
+ return pte_flags_need_flush(pmd_flags(oldpmd), pmd_flags(newpmd), 0);
+}
+#define huge_pmd_needs_flush huge_pmd_needs_flush
+
#endif /* !MODULE */

static inline void __native_tlb_flush_global(unsigned long cr4)
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index fd7feb5c7894..3a30e23fa35d 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -654,6 +654,20 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
} while (0)
#endif

+#ifndef pte_needs_flush
+static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte)
+{
+ return true;
+}
+#endif
+
+#ifndef huge_pmd_needs_flush
+static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
+{
+ return true;
+}
+#endif
+
#endif /* CONFIG_MMU */

#endif /* _ASM_GENERIC__TLB_H */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index d58a5b498011..51b0f3cb1ba0 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1698,7 +1698,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
{
struct mm_struct *mm = vma->vm_mm;
spinlock_t *ptl;
- pmd_t entry;
+ pmd_t oldpmd, entry;
bool preserve_write;
int ret;
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
@@ -1784,9 +1784,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
* pmdp_invalidate() is required to make sure we don't miss
* dirty/young flags set by hardware.
*/
- entry = pmdp_invalidate(vma, addr, pmd);
+ oldpmd = pmdp_invalidate(vma, addr, pmd);

- entry = pmd_modify(entry, newprot);
+ entry = pmd_modify(oldpmd, newprot);
if (preserve_write)
entry = pmd_mk_savedwrite(entry);
if (uffd_wp) {
@@ -1803,7 +1803,8 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);

- tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
+ if (huge_pmd_needs_flush(oldpmd, entry))
+ tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);

BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
unlock:
diff --git a/mm/mprotect.c b/mm/mprotect.c
index f9730bac2d78..97967d589ddb 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -152,7 +152,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
- tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
+ if (pte_needs_flush(oldpte, ptent))
+ tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
--
2.25.1

2022-03-10 01:34:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] mm/mprotect: avoid unnecessary TLB flushes

On Tue, 8 Mar 2022 20:10:38 -0800 Nadav Amit <[email protected]> wrote:

> This patch-set is intended to remove unnecessary TLB flushes during
> mprotect() syscalls. Once this patch-set make it through, similar
> and further optimizations for MADV_COLD and userfaultfd would be
> possible.

Thanks. I'll save this away for consideration after the -rc1 release.
Which gives others plenty of time for review (hint) (request).

2022-03-11 22:33:40

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v3 0/5] mm/mprotect: avoid unnecessary TLB flushes



> On Mar 9, 2022, at 1:40 PM, Andrew Morton <[email protected]> wrote:
>
> On Tue, 8 Mar 2022 20:10:38 -0800 Nadav Amit <[email protected]> wrote:
>
>> This patch-set is intended to remove unnecessary TLB flushes during
>> mprotect() syscalls. Once this patch-set make it through, similar
>> and further optimizations for MADV_COLD and userfaultfd would be
>> possible.
>
> Thanks. I'll save this away for consideration after the -rc1 release.
> Which gives others plenty of time for review (hint) (request).
>

I see that some patches did not make it through to the mailing list.

I will resend later today.