2022-07-18 19:52:49

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 00/14] mm: relaxed TLB flushes and other optimi.

From: Nadav Amit <[email protected]>

Following the optimizations to avoid unnecessary TLB flushes [1],
mprotect() and userfaultfd() did not cause unnecessary TLB flushes when
protection was unchanged. This enabled userfaultfd to write-unprotect a
page without triggering a TLB flush (and potentially shootdown).

After these changes, David added another feature to mprotect [2],
allowing pages that can safely be mapped as writable, to be mapped as
such directly from mprotect(), instead of going through the page fault
handler. This saves the overhead of a page-fault when write-unprotecting
private exclusive pages as writable, for instance.

This change introduced, however, some undesired behaviors, especially if
we adopt this new feature for userfaultfd. First, the newly mapped PTE
is not set as dirty, which might induce on x86 over 500 cycles of
overhead (if the page was not dirty before). Second, once again we can
have an expensive TLB shootdown when we write-unprotect a page: when we
relax the protection (i.e., give more permission), we would do a TLB
flush. If the application is multithreaded, or a userfaultfd monitor
uses write-unprotect (which is a common case), a TLB shootdown would be
needed.

This patch-set allows userfaultfd to map pages as writeable directly
upon write-(un)protect ioctl, while addressing the undesired behaviors
that occur when one uses userfaultfd write-unprotect or mprotect to add
permissions. It also does some cleanup and micro-optimizations along the
way.

The main change that is done in the patch-set - x86 specific, at the
moment - is the introduction of "relaxed" TLB flushes when permissions
are added. Upon a "relaxed" TLB flush, the mm's TLB generation is
advanced and the local TLB is flushed, but no TLB shootdown takes place.
If a spurious page-fault occurs and the local generation of the TLB is
found to be out-of-sync with the mm generation, a full TLB flush is
performed on the faulting core to prevent further spurious page-faults.

To a certain extent "relaxed flushes" are similar to the changes that
were proposed some time ago for kernel mappings [3]. However, it does
not have any complicated interactions with with NMI handlers.

Experiments on Haswell show the performance improvement. Running, for a
single page, a loop of (1) mprotect(READ); (2) mprotect(READ|WRITE) and
then (3) access provides the following result (on bare metal this time):

mprotect(PROT_READ) time in cycles:

1 Thread 2 Threads
Before (5.19rc4+) 2499 4655
+patch 2495 4363 (-6%)


mprotect(PROT_READ|PROT_WRITE) in cycles:

1 Thread 2 Threads
Before (5.19rc4+) 2529 4675
+patch 2496 2615 (-44%)

If we ran MADV_FREE or the page was not dirty, we can also shorten the
PROT_READ time by skipping the TLB shootdown with this patch-set.

[1] https://lore.kernel.org/all/[email protected]/
[2] https://lore.kernel.org/all/[email protected]/
[3] https://lore.kernel.org/all/[email protected]/

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: Axel Rasmussen <[email protected]>
Cc: Mike Rapoport <[email protected]>

Nadav Amit (14):
userfaultfd: set dirty and young on writeprotect
userfaultfd: try to map write-unprotected pages
mm/mprotect: allow exclusive anon pages to be writable
mm/mprotect: preserve write with MM_CP_TRY_CHANGE_WRITABLE
x86/mm: check exec permissions on fault
mm/rmap: avoid flushing on page_vma_mkclean_one() when possible
mm: do fix spurious page-faults for instruction faults
x86/mm: introduce flush_tlb_fix_spurious_fault
mm: introduce relaxed TLB flushes
x86/mm: introduce relaxed TLB flushes
x86/mm: use relaxed TLB flushes when protection is removed
x86/tlb: no flush on PTE change from RW->RO when PTE is clean
mm/mprotect: do not check flush type if a strict is needed
mm: conditional check of pfn in pte_flush_type

arch/x86/include/asm/pgtable.h | 4 +-
arch/x86/include/asm/tlb.h | 3 +-
arch/x86/include/asm/tlbflush.h | 90 +++++++++++++++++--------
arch/x86/kernel/alternative.c | 2 +-
arch/x86/kernel/ldt.c | 3 +-
arch/x86/mm/fault.c | 22 +++++-
arch/x86/mm/tlb.c | 21 +++++-
include/asm-generic/tlb.h | 116 +++++++++++++++++++-------------
include/linux/mm.h | 2 +
include/linux/mm_types.h | 6 ++
mm/huge_memory.c | 9 ++-
mm/hugetlb.c | 2 +-
mm/memory.c | 2 +-
mm/mmu_gather.c | 1 +
mm/mprotect.c | 31 ++++++---
mm/rmap.c | 16 +++--
mm/userfaultfd.c | 10 ++-
17 files changed, 237 insertions(+), 103 deletions(-)

--
2.25.1


2022-07-18 19:52:54

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 06/14] mm/rmap: avoid flushing on page_vma_mkclean_one() when possible

From: Nadav Amit <[email protected]>

x86 is capable to avoid TLB flush on clean writable entries.
page_vma_mkclean_one() does not take advantage of this behavior. Adapt
it.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
mm/rmap.c | 16 ++++++++++++----
1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/mm/rmap.c b/mm/rmap.c
index 83172ee0ea35..23997c387858 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -961,17 +961,25 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)

address = pvmw->address;
if (pvmw->pte) {
- pte_t entry;
+ pte_t entry, oldpte;
pte_t *pte = pvmw->pte;

if (!pte_dirty(*pte) && !pte_write(*pte))
continue;

flush_cache_page(vma, address, pte_pfn(*pte));
- entry = ptep_clear_flush(vma, address, pte);
- entry = pte_wrprotect(entry);
+ oldpte = ptep_modify_prot_start(pvmw->vma, address,
+ pte);
+
+ entry = pte_wrprotect(oldpte);
entry = pte_mkclean(entry);
- set_pte_at(vma->vm_mm, address, pte, entry);
+
+ if (pte_needs_flush(oldpte, entry) ||
+ mm_tlb_flush_pending(vma->vm_mm))
+ flush_tlb_page(vma, address);
+
+ ptep_modify_prot_commit(vma, address, pte, oldpte,
+ entry);
ret = 1;
} else {
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
--
2.25.1

2022-07-18 19:53:42

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 11/14] x86/mm: use relaxed TLB flushes when protection is removed

From: Nadav Amit <[email protected]>

When checking x86 PTE flags to determine whether a TLB flush is needed,
determine whether a relaxed TLB flush is sufficient. If protection is
added (NX removed or W added), indicate that a relaxed TLB flush would
suffice.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 20 ++++++++++++++++----
1 file changed, 16 insertions(+), 4 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 230cd1d24fe6..4f98735ab07a 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -271,18 +271,23 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
* dirty/access bit if needed without a fault.
*/
const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
- _PAGE_ACCESSED;
+ _PAGE_ACCESSED | _PAGE_RW;
+ const pteval_t flush_on_set = _PAGE_NX;
+ const pteval_t flush_on_set_relaxed = _PAGE_RW;
+ const pteval_t flush_on_clear_relaxed = _PAGE_NX;
const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
_PAGE_SOFTW3 | _PAGE_SOFTW4;
- const pteval_t flush_on_change = _PAGE_RW | _PAGE_USER | _PAGE_PWT |
+ const pteval_t flush_on_change = _PAGE_USER | _PAGE_PWT |
_PAGE_PCD | _PAGE_PSE | _PAGE_GLOBAL | _PAGE_PAT |
_PAGE_PAT_LARGE | _PAGE_PKEY_BIT0 | _PAGE_PKEY_BIT1 |
- _PAGE_PKEY_BIT2 | _PAGE_PKEY_BIT3 | _PAGE_NX;
+ _PAGE_PKEY_BIT2 | _PAGE_PKEY_BIT3;
unsigned long diff = oldflags ^ newflags;

BUILD_BUG_ON(flush_on_clear & software_flags);
BUILD_BUG_ON(flush_on_clear & flush_on_change);
BUILD_BUG_ON(flush_on_change & software_flags);
+ BUILD_BUG_ON(flush_on_change & flush_on_clear_relaxed);
+ BUILD_BUG_ON(flush_on_change & flush_on_set_relaxed);

/* Ignore software flags */
diff &= ~software_flags;
@@ -301,9 +306,16 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
if (diff & flush_on_change)
return PTE_FLUSH_STRICT;

+ if (diff & oldflags & flush_on_clear_relaxed)
+ return PTE_FLUSH_RELAXED;
+
+ if (diff & newflags & flush_on_set_relaxed)
+ return PTE_FLUSH_RELAXED;
+
/* Ensure there are no flags that were left behind */
if (IS_ENABLED(CONFIG_DEBUG_VM) &&
- (diff & ~(flush_on_clear | software_flags | flush_on_change))) {
+ (diff & ~(flush_on_clear | flush_on_set |
+ software_flags | flush_on_change))) {
VM_WARN_ON_ONCE(1);
return PTE_FLUSH_STRICT;
}
--
2.25.1

2022-07-18 19:54:24

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 08/14] x86/mm: introduce flush_tlb_fix_spurious_fault

From: Nadav Amit <[email protected]>

The next patches introduce relaxed TLB flushes for x86, which would
require a full TLB flush upon spurious page-fault. If a spurious
page-fault occurs on x86, check if the local TLB generation is out of
sync and perform a TLB flush if needed.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/pgtable.h | 4 +++-
arch/x86/mm/tlb.c | 17 +++++++++++++++++
2 files changed, 20 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h
index 44e2d6f1dbaa..1fbdaff1bb7a 100644
--- a/arch/x86/include/asm/pgtable.h
+++ b/arch/x86/include/asm/pgtable.h
@@ -1079,7 +1079,9 @@ static inline void ptep_set_wrprotect(struct mm_struct *mm,
clear_bit(_PAGE_BIT_RW, (unsigned long *)&ptep->pte);
}

-#define flush_tlb_fix_spurious_fault(vma, address) do { } while (0)
+extern void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
+ unsigned long address);
+#define flush_tlb_fix_spurious_fault flush_tlb_fix_spurious_fault

#define mk_pmd(page, pgprot) pfn_pmd(page_to_pfn(page), (pgprot))

diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index d400b6d9d246..ff3bcc55435e 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -955,6 +955,23 @@ static void put_flush_tlb_info(void)
#endif
}

+void flush_tlb_fix_spurious_fault(struct vm_area_struct *vma,
+ unsigned long address)
+{
+ u32 loaded_mm_asid = this_cpu_read(cpu_tlbstate.loaded_mm_asid);
+ u64 mm_tlb_gen = atomic64_read(&vma->vm_mm->context.tlb_gen);
+ u64 local_tlb_gen = this_cpu_read(cpu_tlbstate.ctxs[loaded_mm_asid].tlb_gen);
+ struct flush_tlb_info *info;
+
+ if (local_tlb_gen == mm_tlb_gen)
+ return;
+
+ preempt_disable();
+ info = get_flush_tlb_info(NULL, 0, TLB_FLUSH_ALL, 0, false, 0);
+ flush_tlb_func(info);
+ preempt_enable();
+}
+
void flush_tlb_mm_range(struct mm_struct *mm, unsigned long start,
unsigned long end, unsigned int stride_shift,
bool freed_tables)
--
2.25.1

2022-07-18 19:55:26

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 09/14] mm: introduce relaxed TLB flushes

From: Nadav Amit <[email protected]>

Introduce the concept of strict and relaxed TLB flushes. Relaxed TLB
flushes are TLB flushes that can be skipped but might lead to degraded
performance. It is down to arch code (in the next patches) to deal with
relaxed flushes correctly. One such behavior is flushing the local TLB
eagerly and remote TLBs lazily.

Track whether a flush is strict in the mmu_gather struct and introduce
the required constants for tracking.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 41 ++++++------
include/asm-generic/tlb.h | 114 ++++++++++++++++++--------------
include/linux/mm_types.h | 6 ++
mm/huge_memory.c | 7 +-
mm/hugetlb.c | 2 +-
mm/mmu_gather.c | 1 +
mm/mprotect.c | 8 ++-
mm/rmap.c | 2 +-
8 files changed, 107 insertions(+), 74 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4af5579c7ef7..77d4810e5a5d 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,7 +259,7 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,

extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);

-static inline bool pte_flags_need_flush(unsigned long oldflags,
+static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
unsigned long newflags,
bool ignore_access)
{
@@ -290,71 +290,72 @@ static inline bool pte_flags_need_flush(unsigned long oldflags,
diff &= ~_PAGE_ACCESSED;

/*
- * Did any of the 'flush_on_clear' flags was clleared set from between
- * 'oldflags' and 'newflags'?
+ * Did any of the 'flush_on_clear' flags was cleared between 'oldflags'
+ * and 'newflags'?
*/
if (diff & oldflags & flush_on_clear)
- return true;
+ return PTE_FLUSH_STRICT;

/* Flush on modified flags. */
if (diff & flush_on_change)
- return true;
+ return PTE_FLUSH_STRICT;

/* Ensure there are no flags that were left behind */
if (IS_ENABLED(CONFIG_DEBUG_VM) &&
(diff & ~(flush_on_clear | software_flags | flush_on_change))) {
VM_WARN_ON_ONCE(1);
- return true;
+ return PTE_FLUSH_STRICT;
}

- return false;
+ return PTE_FLUSH_NONE;
}

/*
- * pte_needs_flush() checks whether permissions were demoted and require a
- * flush. It should only be used for userspace PTEs.
+ * pte_flush_type() checks whether permissions were demoted or promoted and
+ * whether a strict or relaxed TLB flush is need. It should only be used on
+ * userspace PTEs.
*/
-static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte)
+static inline enum pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
{
/* !PRESENT -> * ; no need for flush */
if (!(pte_flags(oldpte) & _PAGE_PRESENT))
- return false;
+ return PTE_FLUSH_NONE;

/* PFN changed ; needs flush */
if (pte_pfn(oldpte) != pte_pfn(newpte))
- return true;
+ return PTE_FLUSH_STRICT;

/*
* check PTE flags; ignore access-bit; see comment in
* ptep_clear_flush_young().
*/
- return pte_flags_need_flush(pte_flags(oldpte), pte_flags(newpte),
+ return pte_flags_flush_type(pte_flags(oldpte), pte_flags(newpte),
true);
}
-#define pte_needs_flush pte_needs_flush
+#define pte_flush_type pte_flush_type

/*
- * huge_pmd_needs_flush() checks whether permissions were demoted and require a
+ * huge_pmd_flush_type() checks whether permissions were demoted and require a
* flush. It should only be used for userspace huge PMDs.
*/
-static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
+static inline enum pte_flush_type huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd)
{
/* !PRESENT -> * ; no need for flush */
if (!(pmd_flags(oldpmd) & _PAGE_PRESENT))
- return false;
+ return PTE_FLUSH_NONE;

/* PFN changed ; needs flush */
if (pmd_pfn(oldpmd) != pmd_pfn(newpmd))
- return true;
+ return PTE_FLUSH_STRICT;

/*
* check PMD flags; do not ignore access-bit; see
* pmdp_clear_flush_young().
*/
- return pte_flags_need_flush(pmd_flags(oldpmd), pmd_flags(newpmd),
+ return pte_flags_flush_type(pmd_flags(oldpmd), pmd_flags(newpmd),
false);
}
-#define huge_pmd_needs_flush huge_pmd_needs_flush
+#define huge_pmd_flush_type huge_pmd_flush_type

#endif /* !MODULE */

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index ff3e82553a76..07b3eb8caf63 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -289,6 +289,11 @@ struct mmu_gather {
unsigned int vma_exec : 1;
unsigned int vma_huge : 1;

+ /*
+ * wheteher we made flushing strict (add protection) or changed
+ * mappings.
+ */
+ unsigned int strict : 1;
unsigned int batch_count;

#ifndef CONFIG_MMU_GATHER_NO_GATHER
@@ -325,6 +330,7 @@ static inline void __tlb_reset_range(struct mmu_gather *tlb)
tlb->cleared_pmds = 0;
tlb->cleared_puds = 0;
tlb->cleared_p4ds = 0;
+ tlb->strict = 0;
/*
* Do not reset mmu_gather::vma_* fields here, we do not
* call into tlb_start_vma() again to set them if there is an
@@ -518,31 +524,43 @@ static inline void tlb_end_vma(struct mmu_gather *tlb, struct vm_area_struct *vm
* and set corresponding cleared_*.
*/
static inline void tlb_flush_pte_range(struct mmu_gather *tlb,
- unsigned long address, unsigned long size)
+ unsigned long address, unsigned long size,
+ bool strict)
{
__tlb_adjust_range(tlb, address, size);
tlb->cleared_ptes = 1;
+ if (strict)
+ tlb->strict = 1;
}

static inline void tlb_flush_pmd_range(struct mmu_gather *tlb,
- unsigned long address, unsigned long size)
+ unsigned long address, unsigned long size,
+ bool strict)
{
__tlb_adjust_range(tlb, address, size);
tlb->cleared_pmds = 1;
+ if (strict)
+ tlb->strict = 1;
}

static inline void tlb_flush_pud_range(struct mmu_gather *tlb,
- unsigned long address, unsigned long size)
+ unsigned long address, unsigned long size,
+ bool strict)
{
__tlb_adjust_range(tlb, address, size);
tlb->cleared_puds = 1;
+ if (strict)
+ tlb->strict = 1;
}

static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
- unsigned long address, unsigned long size)
+ unsigned long address, unsigned long size,
+ bool strict)
{
__tlb_adjust_range(tlb, address, size);
tlb->cleared_p4ds = 1;
+ if (strict)
+ tlb->strict = 1;
}

#ifndef __tlb_remove_tlb_entry
@@ -556,24 +574,24 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
* so we can later optimise away the tlb invalidate. This helps when
* userspace is unmapping already-unmapped pages, which happens quite a lot.
*/
-#define tlb_remove_tlb_entry(tlb, ptep, address) \
- do { \
- tlb_flush_pte_range(tlb, address, PAGE_SIZE); \
- __tlb_remove_tlb_entry(tlb, ptep, address); \
+#define tlb_remove_tlb_entry(tlb, ptep, address) \
+ do { \
+ tlb_flush_pte_range(tlb, address, PAGE_SIZE, true); \
+ __tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)

-#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
- do { \
- unsigned long _sz = huge_page_size(h); \
- if (_sz >= P4D_SIZE) \
- tlb_flush_p4d_range(tlb, address, _sz); \
- else if (_sz >= PUD_SIZE) \
- tlb_flush_pud_range(tlb, address, _sz); \
- else if (_sz >= PMD_SIZE) \
- tlb_flush_pmd_range(tlb, address, _sz); \
- else \
- tlb_flush_pte_range(tlb, address, _sz); \
- __tlb_remove_tlb_entry(tlb, ptep, address); \
+#define tlb_remove_huge_tlb_entry(h, tlb, ptep, address) \
+ do { \
+ unsigned long _sz = huge_page_size(h); \
+ if (_sz >= P4D_SIZE) \
+ tlb_flush_p4d_range(tlb, address, _sz, true); \
+ else if (_sz >= PUD_SIZE) \
+ tlb_flush_pud_range(tlb, address, _sz, true); \
+ else if (_sz >= PMD_SIZE) \
+ tlb_flush_pmd_range(tlb, address, _sz, true); \
+ else \
+ tlb_flush_pte_range(tlb, address, _sz, true); \
+ __tlb_remove_tlb_entry(tlb, ptep, address); \
} while (0)

/**
@@ -586,7 +604,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,

#define tlb_remove_pmd_tlb_entry(tlb, pmdp, address) \
do { \
- tlb_flush_pmd_range(tlb, address, HPAGE_PMD_SIZE); \
+ tlb_flush_pmd_range(tlb, address, HPAGE_PMD_SIZE, true);\
__tlb_remove_pmd_tlb_entry(tlb, pmdp, address); \
} while (0)

@@ -600,7 +618,7 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,

#define tlb_remove_pud_tlb_entry(tlb, pudp, address) \
do { \
- tlb_flush_pud_range(tlb, address, HPAGE_PUD_SIZE); \
+ tlb_flush_pud_range(tlb, address, HPAGE_PUD_SIZE, true);\
__tlb_remove_pud_tlb_entry(tlb, pudp, address); \
} while (0)

@@ -623,52 +641,52 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
*/

#ifndef pte_free_tlb
-#define pte_free_tlb(tlb, ptep, address) \
- do { \
- tlb_flush_pmd_range(tlb, address, PAGE_SIZE); \
- tlb->freed_tables = 1; \
- __pte_free_tlb(tlb, ptep, address); \
+#define pte_free_tlb(tlb, ptep, address) \
+ do { \
+ tlb_flush_pmd_range(tlb, address, PAGE_SIZE, true); \
+ tlb->freed_tables = 1; \
+ __pte_free_tlb(tlb, ptep, address); \
} while (0)
#endif

#ifndef pmd_free_tlb
-#define pmd_free_tlb(tlb, pmdp, address) \
- do { \
- tlb_flush_pud_range(tlb, address, PAGE_SIZE); \
- tlb->freed_tables = 1; \
- __pmd_free_tlb(tlb, pmdp, address); \
+#define pmd_free_tlb(tlb, pmdp, address) \
+ do { \
+ tlb_flush_pud_range(tlb, address, PAGE_SIZE, true); \
+ tlb->freed_tables = 1; \
+ __pmd_free_tlb(tlb, pmdp, address); \
} while (0)
#endif

#ifndef pud_free_tlb
-#define pud_free_tlb(tlb, pudp, address) \
- do { \
- tlb_flush_p4d_range(tlb, address, PAGE_SIZE); \
- tlb->freed_tables = 1; \
- __pud_free_tlb(tlb, pudp, address); \
+#define pud_free_tlb(tlb, pudp, address) \
+ do { \
+ tlb_flush_p4d_range(tlb, address, PAGE_SIZE, true); \
+ tlb->freed_tables = 1; \
+ __pud_free_tlb(tlb, pudp, address); \
} while (0)
#endif

#ifndef p4d_free_tlb
-#define p4d_free_tlb(tlb, pudp, address) \
- do { \
- __tlb_adjust_range(tlb, address, PAGE_SIZE); \
- tlb->freed_tables = 1; \
- __p4d_free_tlb(tlb, pudp, address); \
+#define p4d_free_tlb(tlb, pudp, address) \
+ do { \
+ __tlb_adjust_range(tlb, address, PAGE_SIZE, true); \
+ tlb->freed_tables = 1; \
+ __p4d_free_tlb(tlb, pudp, address); \
} while (0)
#endif

-#ifndef pte_needs_flush
-static inline bool pte_needs_flush(pte_t oldpte, pte_t newpte)
+#ifndef pte_flush_type
+static inline struct pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
{
- return true;
+ return PTE_FLUSH_STRICT;
}
#endif

-#ifndef huge_pmd_needs_flush
-static inline bool huge_pmd_needs_flush(pmd_t oldpmd, pmd_t newpmd)
+#ifndef huge_pmd_flush_type
+static inline bool huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd)
{
- return true;
+ return PTE_FLUSH_STRICT;
}
#endif

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index 6b961a29bf26..8825f1314a28 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -698,6 +698,12 @@ extern void tlb_gather_mmu(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_gather_mmu_fullmm(struct mmu_gather *tlb, struct mm_struct *mm);
extern void tlb_finish_mmu(struct mmu_gather *tlb);

+enum pte_flush_type {
+ PTE_FLUSH_NONE = 0, /* not necessary */
+ PTE_FLUSH_STRICT = 1, /* required */
+ PTE_FLUSH_RELAXED = 2, /* can cause spurious page-faults */
+};
+
struct vm_fault;

/**
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 60d742c33de3..09e6608a6431 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1713,6 +1713,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
+ enum pte_flush_type flush_type;

tlb_change_page_size(tlb, HPAGE_PMD_SIZE);

@@ -1815,8 +1816,10 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);

- if (huge_pmd_needs_flush(oldpmd, entry))
- tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE);
+ flush_type = huge_pmd_flush_type(oldpmd, entry);
+ if (flush_type != PTE_FLUSH_NONE)
+ tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE,
+ flush_type == PTE_FLUSH_STRICT);

BUG_ON(vma_is_anonymous(vma) && !preserve_write && pmd_write(entry));
unlock:
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 6621d3fe4991..9a667237a69a 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5022,7 +5022,7 @@ static void __unmap_hugepage_range(struct mmu_gather *tlb, struct vm_area_struct
ptl = huge_pte_lock(h, mm, ptep);
if (huge_pmd_unshare(mm, vma, &address, ptep)) {
spin_unlock(ptl);
- tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE);
+ tlb_flush_pmd_range(tlb, address & PUD_MASK, PUD_SIZE, true);
force_flush = true;
continue;
}
diff --git a/mm/mmu_gather.c b/mm/mmu_gather.c
index a71924bd38c0..9a8bd2f23543 100644
--- a/mm/mmu_gather.c
+++ b/mm/mmu_gather.c
@@ -348,6 +348,7 @@ void tlb_finish_mmu(struct mmu_gather *tlb)
tlb->fullmm = 1;
__tlb_reset_range(tlb);
tlb->freed_tables = 1;
+ tlb->strict = 1;
}

tlb_flush_mmu(tlb);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 92bfb17dcb8a..ead20dc66d34 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -117,6 +117,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
pte_t ptent;
bool preserve_write = (prot_numa || try_change_writable) &&
pte_write(oldpte);
+ enum pte_flush_type flush_type;

/*
* Avoid trapping faults against the zero or KSM
@@ -200,8 +201,11 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
}

ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
- if (pte_needs_flush(oldpte, ptent))
- tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
+
+ flush_type = pte_flush_type(oldpte, ptent);
+ if (flush_type != PTE_FLUSH_NONE)
+ tlb_flush_pte_range(tlb, addr, PAGE_SIZE,
+ flush_type == PTE_FLUSH_STRICT);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
diff --git a/mm/rmap.c b/mm/rmap.c
index 23997c387858..62f4b2a4f067 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -974,7 +974,7 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
entry = pte_wrprotect(oldpte);
entry = pte_mkclean(entry);

- if (pte_needs_flush(oldpte, entry) ||
+ if (pte_flush_type(oldpte, entry) != PTE_FLUSH_NONE ||
mm_tlb_flush_pending(vma->vm_mm))
flush_tlb_page(vma, address);

--
2.25.1

2022-07-18 19:55:28

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 14/14] mm: conditional check of pfn in pte_flush_type

From: Nadav Amit <[email protected]>

Checking whether PFNs in two PTEs are the same takes surprisingly large
number of instructions. Yet in fact, in most cases the caller to
pte_flush_type() already knows if the PFN was changed. For instance,
mprotect() does not change the PFN, but only modifies the protection
flags.

Add argument to pte_flush_type() to indicate whether the PFN should be
checked. Keep checking it in mm-debug to see if some caller was wrong to
assume the PFN is the same.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 14 ++++++++++----
include/asm-generic/tlb.h | 6 ++++--
mm/huge_memory.c | 2 +-
mm/mprotect.c | 2 +-
mm/rmap.c | 2 +-
5 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 58c95e36b098..50349861fdc9 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -340,14 +340,17 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
* whether a strict or relaxed TLB flush is need. It should only be used on
* userspace PTEs.
*/
-static inline enum pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
+static inline enum pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte,
+ bool check_pfn)
{
/* !PRESENT -> * ; no need for flush */
if (!(pte_flags(oldpte) & _PAGE_PRESENT))
return PTE_FLUSH_NONE;

/* PFN changed ; needs flush */
- if (pte_pfn(oldpte) != pte_pfn(newpte))
+ if (!check_pfn)
+ VM_BUG_ON(pte_pfn(oldpte) != pte_pfn(newpte));
+ else if (pte_pfn(oldpte) != pte_pfn(newpte))
return PTE_FLUSH_STRICT;

/*
@@ -363,14 +366,17 @@ static inline enum pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
* huge_pmd_flush_type() checks whether permissions were demoted and require a
* flush. It should only be used for userspace huge PMDs.
*/
-static inline enum pte_flush_type huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd)
+static inline enum pte_flush_type huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd,
+ bool check_pfn)
{
/* !PRESENT -> * ; no need for flush */
if (!(pmd_flags(oldpmd) & _PAGE_PRESENT))
return PTE_FLUSH_NONE;

/* PFN changed ; needs flush */
- if (pmd_pfn(oldpmd) != pmd_pfn(newpmd))
+ if (!check_pfn)
+ VM_BUG_ON(pmd_pfn(oldpmd) != pmd_pfn(newpmd));
+ else if (pmd_pfn(oldpmd) != pmd_pfn(newpmd))
return PTE_FLUSH_STRICT;

/*
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 07b3eb8caf63..aee9da6cc5d5 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -677,14 +677,16 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
#endif

#ifndef pte_flush_type
-static inline struct pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte)
+static inline struct pte_flush_type pte_flush_type(pte_t oldpte, pte_t newpte,
+ bool check_pfn)
{
return PTE_FLUSH_STRICT;
}
#endif

#ifndef huge_pmd_flush_type
-static inline bool huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd)
+static inline bool huge_pmd_flush_type(pmd_t oldpmd, pmd_t newpmd,
+ bool check_pfn)
{
return PTE_FLUSH_STRICT;
}
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index b32b7da0f6f7..92a7b3ca317f 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1818,7 +1818,7 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,

flush_type = PTE_FLUSH_STRICT;
if (!tlb->strict)
- flush_type = huge_pmd_flush_type(oldpmd, entry);
+ flush_type = huge_pmd_flush_type(oldpmd, entry, false);
if (flush_type != PTE_FLUSH_NONE)
tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE,
flush_type == PTE_FLUSH_STRICT);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index cf775f6c8c08..78081d7f4edf 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -204,7 +204,7 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,

flush_type = PTE_FLUSH_STRICT;
if (!tlb->strict)
- flush_type = pte_flush_type(oldpte, ptent);
+ flush_type = pte_flush_type(oldpte, ptent, false);
if (flush_type != PTE_FLUSH_NONE)
tlb_flush_pte_range(tlb, addr, PAGE_SIZE,
flush_type == PTE_FLUSH_STRICT);
diff --git a/mm/rmap.c b/mm/rmap.c
index 62f4b2a4f067..63261619b607 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -974,7 +974,7 @@ static int page_vma_mkclean_one(struct page_vma_mapped_walk *pvmw)
entry = pte_wrprotect(oldpte);
entry = pte_mkclean(entry);

- if (pte_flush_type(oldpte, entry) != PTE_FLUSH_NONE ||
+ if (pte_flush_type(oldpte, entry, false) != PTE_FLUSH_NONE ||
mm_tlb_flush_pending(vma->vm_mm))
flush_tlb_page(vma, address);

--
2.25.1

2022-07-18 20:02:42

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable

From: Nadav Amit <[email protected]>

Anonymous pages might have the dirty bit clear, but this should not
prevent mprotect from making them writable if they are exclusive.
Therefore, skip the test whether the page is dirty in this case.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
mm/mprotect.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 34c2dfb68c42..da5b9bf8204f 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,

VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));

- if (pte_protnone(pte) || !pte_dirty(pte))
+ if (pte_protnone(pte))
return false;

/* Do we need write faults for softdirty tracking? */
@@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
page = vm_normal_page(vma, addr, pte);
if (!page || !PageAnon(page) || !PageAnonExclusive(page))
return false;
- }
+ } else if (!pte_dirty(pte))
+ return false;

return true;
}
--
2.25.1

2022-07-18 20:02:42

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 13/14] mm/mprotect: do not check flush type if a strict is needed

From: Nadav Amit <[email protected]>

Once it was determined that a strict TLB flush is needed, it is both
likely that other PTEs would need strict TLB flush and little benefit
from not extending the range that is flushed.

Skip the check which TLB flush is needed, if it was determined a strict
flush is already needed.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
mm/huge_memory.c | 4 +++-
mm/mprotect.c | 4 +++-
2 files changed, 6 insertions(+), 2 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 09e6608a6431..b32b7da0f6f7 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1816,7 +1816,9 @@ int change_huge_pmd(struct mmu_gather *tlb, struct vm_area_struct *vma,
ret = HPAGE_PMD_NR;
set_pmd_at(mm, addr, pmd, entry);

- flush_type = huge_pmd_flush_type(oldpmd, entry);
+ flush_type = PTE_FLUSH_STRICT;
+ if (!tlb->strict)
+ flush_type = huge_pmd_flush_type(oldpmd, entry);
if (flush_type != PTE_FLUSH_NONE)
tlb_flush_pmd_range(tlb, addr, HPAGE_PMD_SIZE,
flush_type == PTE_FLUSH_STRICT);
diff --git a/mm/mprotect.c b/mm/mprotect.c
index ead20dc66d34..cf775f6c8c08 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -202,7 +202,9 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,

ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);

- flush_type = pte_flush_type(oldpte, ptent);
+ flush_type = PTE_FLUSH_STRICT;
+ if (!tlb->strict)
+ flush_type = pte_flush_type(oldpte, ptent);
if (flush_type != PTE_FLUSH_NONE)
tlb_flush_pte_range(tlb, addr, PAGE_SIZE,
flush_type == PTE_FLUSH_STRICT);
--
2.25.1

2022-07-18 20:10:05

by Nadav Amit

[permalink] [raw]
Subject: [RFC PATCH 12/14] x86/tlb: no flush on PTE change from RW->RO when PTE is clean

From: Nadav Amit <[email protected]>

On x86 it is possible to skip a TLB flush when a RW entry become RO and
the PTE is clean. Add logic to detect this case and skip the flush.

Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 16 ++++++++++++++--
1 file changed, 14 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index 4f98735ab07a..58c95e36b098 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -271,8 +271,9 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
* dirty/access bit if needed without a fault.
*/
const pteval_t flush_on_clear = _PAGE_DIRTY | _PAGE_PRESENT |
- _PAGE_ACCESSED | _PAGE_RW;
+ _PAGE_ACCESSED;
const pteval_t flush_on_set = _PAGE_NX;
+ const pteval_t flush_on_special = _PAGE_RW;
const pteval_t flush_on_set_relaxed = _PAGE_RW;
const pteval_t flush_on_clear_relaxed = _PAGE_NX;
const pteval_t software_flags = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
@@ -302,6 +303,17 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,
if (diff & oldflags & flush_on_clear)
return PTE_FLUSH_STRICT;

+ /*
+ * Did any of the 'flush_on_set' flags was set between 'oldflags' and
+ * 'newflags'?
+ */
+ if (diff & newflags & flush_on_set)
+ return PTE_FLUSH_STRICT;
+
+ /* On RW->RO, a flush is needed if the old entry is dirty */
+ if ((diff & oldflags & _PAGE_RW) && (oldflags & _PAGE_DIRTY))
+ return PTE_FLUSH_STRICT;
+
/* Flush on modified flags. */
if (diff & flush_on_change)
return PTE_FLUSH_STRICT;
@@ -314,7 +326,7 @@ static inline enum pte_flush_type pte_flags_flush_type(unsigned long oldflags,

/* Ensure there are no flags that were left behind */
if (IS_ENABLED(CONFIG_DEBUG_VM) &&
- (diff & ~(flush_on_clear | flush_on_set |
+ (diff & ~(flush_on_clear | flush_on_set | flush_on_special |
software_flags | flush_on_change))) {
VM_WARN_ON_ONCE(1);
return PTE_FLUSH_STRICT;
--
2.25.1

2022-07-20 15:28:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable

On 18.07.22 14:02, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> Anonymous pages might have the dirty bit clear, but this should not
> prevent mprotect from making them writable if they are exclusive.
> Therefore, skip the test whether the page is dirty in this case.
>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andrew Cooper <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Andy Lutomirski <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Peter Xu <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Will Deacon <[email protected]>
> Cc: Yu Zhao <[email protected]>
> Cc: Nick Piggin <[email protected]>
> Signed-off-by: Nadav Amit <[email protected]>
> ---
> mm/mprotect.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/mprotect.c b/mm/mprotect.c
> index 34c2dfb68c42..da5b9bf8204f 100644
> --- a/mm/mprotect.c
> +++ b/mm/mprotect.c
> @@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>
> VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
>
> - if (pte_protnone(pte) || !pte_dirty(pte))
> + if (pte_protnone(pte))
> return false;
>
> /* Do we need write faults for softdirty tracking? */
> @@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
> page = vm_normal_page(vma, addr, pte);
> if (!page || !PageAnon(page) || !PageAnonExclusive(page))
> return false;
> - }
> + } else if (!pte_dirty(pte))
> + return false;
>
> return true;
> }

When I wrote that code, I was wondering how often that would actually
happen in practice -- and if we care about optimizing that. Do you have
a gut feeling in which scenarios this would happen and if we care?

If the page is in the swapcache and was swapped out, you'd be requiring
a writeback even though nobody modified the page and possibly isn't
going to do so in the near future.

--
Thanks,

David / dhildenb

2022-07-20 17:53:23

by Nadav Amit

[permalink] [raw]
Subject: Re: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable

On Jul 20, 2022, at 8:19 AM, David Hildenbrand <[email protected]> wrote:

> On 18.07.22 14:02, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>>
>> Anonymous pages might have the dirty bit clear, but this should not
>> prevent mprotect from making them writable if they are exclusive.
>> Therefore, skip the test whether the page is dirty in this case.
>>
>> Cc: Andrea Arcangeli <[email protected]>
>> Cc: Andrew Cooper <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Andy Lutomirski <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: David Hildenbrand <[email protected]>
>> Cc: Peter Xu <[email protected]>
>> Cc: Peter Zijlstra <[email protected]>
>> Cc: Thomas Gleixner <[email protected]>
>> Cc: Will Deacon <[email protected]>
>> Cc: Yu Zhao <[email protected]>
>> Cc: Nick Piggin <[email protected]>
>> Signed-off-by: Nadav Amit <[email protected]>
>> ---
>> mm/mprotect.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>> index 34c2dfb68c42..da5b9bf8204f 100644
>> --- a/mm/mprotect.c
>> +++ b/mm/mprotect.c
>> @@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>>
>> VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
>>
>> - if (pte_protnone(pte) || !pte_dirty(pte))
>> + if (pte_protnone(pte))
>> return false;
>>
>> /* Do we need write faults for softdirty tracking? */
>> @@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>> page = vm_normal_page(vma, addr, pte);
>> if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>> return false;
>> - }
>> + } else if (!pte_dirty(pte))
>> + return false;
>>
>> return true;
>> }
>
> When I wrote that code, I was wondering how often that would actually
> happen in practice -- and if we care about optimizing that. Do you have
> a gut feeling in which scenarios this would happen and if we care?
>
> If the page is in the swapcache and was swapped out, you'd be requiring
> a writeback even though nobody modified the page and possibly isn't
> going to do so in the near future.

So here is my due diligence: I did not really encounter a scenario in which
it showed up. When I looked at your code, I assumed this was an oversight
and not a thoughtful decision. For me the issue is more of the discrepancy
between how a certain page is handled before and after it was pages out.

The way that I see it, there is a tradeoff in the way dirty-bit should
be handled:
(1) Writable-clean PTEs introduce some non-negligible overhead.
(2) Marking a PTE dirty speculatively would require a write back.

… But this tradeoff should not affect whether a PTE is writable, i.e.,
mapping the PTE as writable-clean should not cause a writeback. In other
words, if you are concerned about unnecessary writebacks, which I think is a
fair concern, then do not set the dirty-bit. In a later patch I try to avoid
TLB flushes on clean-writable entries that are write-protected.

So I do not think that the writeback you mentioned should be a real issue.
Yet if you think that using the fact that the page is not-dirty is a good
hueristics to avoid future TLB flushes (for P->NP; as I said there is a
solution for RW->RO), or if you are concerned about the cost of
vm_normal_page(), perhaps those are valid concerned (although I do not think
so).

--

[ Regarding (1): After some discussions with Peter and reading more code, I
thought at some point that perhaps avoiding having writable-clean PTE as
much as possible makes sense [*], since setting the dirty-bit costs ~550
cycles and a page fault is not a lot more than 1000. But with all the
mitigations (and after adding IBRS for retbless) page-fault entry is kind of
expensive.

[*] At least on x86 ]

2022-07-21 07:58:21

by David Hildenbrand

[permalink] [raw]
Subject: Re: [RFC PATCH 03/14] mm/mprotect: allow exclusive anon pages to be writable

On 20.07.22 19:25, Nadav Amit wrote:
> On Jul 20, 2022, at 8:19 AM, David Hildenbrand <[email protected]> wrote:
>
>> On 18.07.22 14:02, Nadav Amit wrote:
>>> From: Nadav Amit <[email protected]>
>>>
>>> Anonymous pages might have the dirty bit clear, but this should not
>>> prevent mprotect from making them writable if they are exclusive.
>>> Therefore, skip the test whether the page is dirty in this case.
>>>
>>> Cc: Andrea Arcangeli <[email protected]>
>>> Cc: Andrew Cooper <[email protected]>
>>> Cc: Andrew Morton <[email protected]>
>>> Cc: Andy Lutomirski <[email protected]>
>>> Cc: Dave Hansen <[email protected]>
>>> Cc: David Hildenbrand <[email protected]>
>>> Cc: Peter Xu <[email protected]>
>>> Cc: Peter Zijlstra <[email protected]>
>>> Cc: Thomas Gleixner <[email protected]>
>>> Cc: Will Deacon <[email protected]>
>>> Cc: Yu Zhao <[email protected]>
>>> Cc: Nick Piggin <[email protected]>
>>> Signed-off-by: Nadav Amit <[email protected]>
>>> ---
>>> mm/mprotect.c | 5 +++--
>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/mm/mprotect.c b/mm/mprotect.c
>>> index 34c2dfb68c42..da5b9bf8204f 100644
>>> --- a/mm/mprotect.c
>>> +++ b/mm/mprotect.c
>>> @@ -45,7 +45,7 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>>>
>>> VM_BUG_ON(!(vma->vm_flags & VM_WRITE) || pte_write(pte));
>>>
>>> - if (pte_protnone(pte) || !pte_dirty(pte))
>>> + if (pte_protnone(pte))
>>> return false;
>>>
>>> /* Do we need write faults for softdirty tracking? */
>>> @@ -66,7 +66,8 @@ static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>>> page = vm_normal_page(vma, addr, pte);
>>> if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>>> return false;
>>> - }
>>> + } else if (!pte_dirty(pte))
>>> + return false;
>>>
>>> return true;
>>> }
>>
>> When I wrote that code, I was wondering how often that would actually
>> happen in practice -- and if we care about optimizing that. Do you have
>> a gut feeling in which scenarios this would happen and if we care?
>>
>> If the page is in the swapcache and was swapped out, you'd be requiring
>> a writeback even though nobody modified the page and possibly isn't
>> going to do so in the near future.
>
> So here is my due diligence: I did not really encounter a scenario in which
> it showed up. When I looked at your code, I assumed this was an oversight
> and not a thoughtful decision. For me the issue is more of the discrepancy
> between how a certain page is handled before and after it was pages out.
>
> The way that I see it, there is a tradeoff in the way dirty-bit should
> be handled:
> (1) Writable-clean PTEs introduce some non-negligible overhead.
> (2) Marking a PTE dirty speculatively would require a write back.
>
> … But this tradeoff should not affect whether a PTE is writable, i.e.,
> mapping the PTE as writable-clean should not cause a writeback. In other
> words, if you are concerned about unnecessary writebacks, which I think is a
> fair concern, then do not set the dirty-bit. In a later patch I try to avoid
> TLB flushes on clean-writable entries that are write-protected.
>
> So I do not think that the writeback you mentioned should be a real issue.
> Yet if you think that using the fact that the page is not-dirty is a good
> hueristics to avoid future TLB flushes (for P->NP; as I said there is a
> solution for RW->RO), or if you are concerned about the cost of
> vm_normal_page(), perhaps those are valid concerned (although I do not think
> so).

I think I now understand what you mean. I somehow remembered that some
architectures set a PTE dirty when marking it writable, but I guess this
is not true -- and setting it writable will keep it !dirty until really
accessed (either by the HW or by a fault). [I'll do some more digging
just to confirm]

With that in mind, your patch makes sense, and I guess you'd want that
as patch #1, because otherwise I fail to see how current patch #1 would
even succeed in reaching the "pte_mkdirty" call -- if !pte_dirty()
protects the code from running.

>
> --
>
> [ Regarding (1): After some discussions with Peter and reading more code, I
> thought at some point that perhaps avoiding having writable-clean PTE as
> much as possible makes sense [*], since setting the dirty-bit costs ~550
> cycles and a page fault is not a lot more than 1000. But with all the
> mitigations (and after adding IBRS for retbless) page-fault entry is kind of
> expensive.

I understand the reasoning for anonymous memory, but not for page cache
pages. And for anonymous memory I think there are still cases where we
don't want to do that (swapcache and MADV_FREE comes to mind) -- and
IMHO some of the scenarios where we have clean anonymous memory at all,
we just don't care about optimizing for setting the dirty bit faster on
x86 ;) .

So my gut feeling is to keep it as simple as possible. To me, it
translates to setting the dirty bit only if we have clear indication
that write access is likely next.

Thanks Nadav for refreshing my memory :) !

--
Thanks,

David / dhildenb