From: Nadav Amit <[email protected]>
This patch-set is based on a very small subset of an old RFC (see link
below), and intended to avoid TLB flushes when they are not necessary
architecturally. Specifically, memory-unprotect using userfaultfd
(i.e., using userfaultfd IOCTL) triggers a TLB flush when in fact no
architectural data, other than a software flag, is updated. This
overhead shows up in my development workload profiles.
Instead of tailoring a solution for this specific scenario, it is
arguably better to use this opportunity to consolidate the interfaces
that are used for TLB batching, avoid the open-coded
[inc|dec]_tlb_flush_pending() and use the tlb_[gather|finish]_mmu()
interface instead.
Avoiding the TLB flushes is done very conservatively (unlike the RFC):
1. According to x86 specifications no flushes are necessary on
permission promotion and changes to software bits.
2. Linux does not flush PTEs after the access bit is cleared.
I considered the feedback of Andy Lutomirski and Andrew Cooper for the
RFC regarding avoiding TLB invalidations when RW is cleared for clean
PTEs. Although the bugs they pointed out can be easily addressed, I am
concerned since I could not find specifications that explicitly clarify
this optimization is valid.
--
RFC -> v1:
* Do not skip TLB flushes when clearing RW on clean PTEs
* Do not defer huge PMD flush as it is already done inline
Link: https://lore.kernel.org/lkml/[email protected]/
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Nadav Amit (2):
mm/mprotect: use mmu_gather
mm/mprotect: do not flush on permission promotion
arch/x86/include/asm/tlbflush.h | 40 ++++++++++++++++++++++++++
include/asm-generic/tlb.h | 4 +++
mm/mprotect.c | 51 +++++++++++++++++++--------------
3 files changed, 73 insertions(+), 22 deletions(-)
--
2.25.1
From: Nadav Amit <[email protected]>
change_pXX_range() currently does not use mmu_gather, but instead
implements its own deferred TLB flushes scheme. This both complicates
the code, as developers need to be aware of different invalidation
schemes, and prevents opportunities to avoid TLB flushes or perform them
in finer granularity.
Use mmu_gather in change_pXX_range(). As the pages are not released,
only record the flushed range using tlb_flush_pXX_range().
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
mm/mprotect.c | 50 ++++++++++++++++++++++++++++----------------------
1 file changed, 28 insertions(+), 22 deletions(-)
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 883e2cc85cad..075ff94aa51c 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -32,12 +32,13 @@
#include <asm/cacheflush.h>
#include <asm/mmu_context.h>
#include <asm/tlbflush.h>
+#include <asm/tlb.h>
#include "internal.h"
-static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, unsigned long end, pgprot_t newprot,
- unsigned long cp_flags)
+static unsigned long change_pte_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pmd_t *pmd, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
pte_t *pte, oldpte;
spinlock_t *ptl;
@@ -138,6 +139,7 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
+ tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
@@ -219,9 +221,9 @@ static inline int pmd_none_or_clear_bad_unless_trans_huge(pmd_t *pmd)
return 0;
}
-static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
- pud_t *pud, unsigned long addr, unsigned long end,
- pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_pmd_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pud_t *pud, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
pmd_t *pmd;
unsigned long next;
@@ -261,6 +263,10 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
if (next - addr != HPAGE_PMD_SIZE) {
__split_huge_pmd(vma, pmd, addr, false, NULL);
} else {
+ /*
+ * change_huge_pmd() does not defer TLB flushes,
+ * so no need to propagate the tlb argument.
+ */
int nr_ptes = change_huge_pmd(vma, pmd, addr,
newprot, cp_flags);
@@ -276,8 +282,8 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
}
/* fall through, the trans huge pmd just split */
}
- this_pages = change_pte_range(vma, pmd, addr, next, newprot,
- cp_flags);
+ this_pages = change_pte_range(tlb, vma, pmd, addr, next,
+ newprot, cp_flags);
pages += this_pages;
next:
cond_resched();
@@ -291,9 +297,9 @@ static inline unsigned long change_pmd_range(struct vm_area_struct *vma,
return pages;
}
-static inline unsigned long change_pud_range(struct vm_area_struct *vma,
- p4d_t *p4d, unsigned long addr, unsigned long end,
- pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_pud_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, p4d_t *p4d, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
pud_t *pud;
unsigned long next;
@@ -304,16 +310,16 @@ static inline unsigned long change_pud_range(struct vm_area_struct *vma,
next = pud_addr_end(addr, end);
if (pud_none_or_clear_bad(pud))
continue;
- pages += change_pmd_range(vma, pud, addr, next, newprot,
+ pages += change_pmd_range(tlb, vma, pud, addr, next, newprot,
cp_flags);
} while (pud++, addr = next, addr != end);
return pages;
}
-static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
- pgd_t *pgd, unsigned long addr, unsigned long end,
- pgprot_t newprot, unsigned long cp_flags)
+static inline unsigned long change_p4d_range(struct mmu_gather *tlb,
+ struct vm_area_struct *vma, pgd_t *pgd, unsigned long addr,
+ unsigned long end, pgprot_t newprot, unsigned long cp_flags)
{
p4d_t *p4d;
unsigned long next;
@@ -324,7 +330,7 @@ static inline unsigned long change_p4d_range(struct vm_area_struct *vma,
next = p4d_addr_end(addr, end);
if (p4d_none_or_clear_bad(p4d))
continue;
- pages += change_pud_range(vma, p4d, addr, next, newprot,
+ pages += change_pud_range(tlb, vma, p4d, addr, next, newprot,
cp_flags);
} while (p4d++, addr = next, addr != end);
@@ -338,25 +344,25 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
struct mm_struct *mm = vma->vm_mm;
pgd_t *pgd;
unsigned long next;
- unsigned long start = addr;
unsigned long pages = 0;
+ struct mmu_gather tlb;
BUG_ON(addr >= end);
pgd = pgd_offset(mm, addr);
flush_cache_range(vma, addr, end);
inc_tlb_flush_pending(mm);
+ tlb_gather_mmu(&tlb, mm);
+ tlb_start_vma(&tlb, vma);
do {
next = pgd_addr_end(addr, end);
if (pgd_none_or_clear_bad(pgd))
continue;
- pages += change_p4d_range(vma, pgd, addr, next, newprot,
+ pages += change_p4d_range(&tlb, vma, pgd, addr, next, newprot,
cp_flags);
} while (pgd++, addr = next, addr != end);
- /* Only flush the TLB if we actually modified any entries: */
- if (pages)
- flush_tlb_range(vma, start, end);
- dec_tlb_flush_pending(mm);
+ tlb_end_vma(&tlb, vma);
+ tlb_finish_mmu(&tlb);
return pages;
}
--
2.25.1
From: Nadav Amit <[email protected]>
Currently, using mprotect() to unprotect a memory region or uffd to
unprotect a memory region causes a TLB flush. At least on x86, as
protection is promoted, no TLB flush is needed.
Add an arch-specific pte_may_need_flush() which tells whether a TLB
flush is needed based on the old PTE and the new one. Implement an x86
pte_may_need_flush().
For x86, PTE protection promotion or changes of software bits does
require a flush, also add logic that considers the dirty-bit. Changes to
the access-bit do not trigger a TLB flush, although architecturally they
should, as Linux considers the access-bit as a hint.
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Cooper <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Andy Lutomirski <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Will Deacon <[email protected]>
Cc: Yu Zhao <[email protected]>
Cc: Nick Piggin <[email protected]>
Cc: [email protected]
Signed-off-by: Nadav Amit <[email protected]>
---
arch/x86/include/asm/tlbflush.h | 40 +++++++++++++++++++++++++++++++++
include/asm-generic/tlb.h | 4 ++++
mm/mprotect.c | 3 ++-
3 files changed, 46 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/tlbflush.h b/arch/x86/include/asm/tlbflush.h
index b587a9ee9cb2..e74d13d174d1 100644
--- a/arch/x86/include/asm/tlbflush.h
+++ b/arch/x86/include/asm/tlbflush.h
@@ -259,6 +259,46 @@ static inline void arch_tlbbatch_add_mm(struct arch_tlbflush_unmap_batch *batch,
extern void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
+/*
+ * pte_may_need_flush() checks whether a TLB flush is necessary according to x86
+ * architectural definition. Mere promotion of permissions does not require a
+ * TLB flush. Changes of software bits does not require a TLB flush. Demotion
+ * of the access-bit also does not trigger a TLB flush (although it is required
+ * architecturally) - see the comment in ptep_clear_flush_young().
+ *
+ * Further optimizations may be possible, such as avoiding a flush when clearing
+ * the write-bit on non-dirty entries. As those do not explicitly follow the
+ * specifications, they are not implemented (at least for now).
+ */
+static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte)
+{
+ const pteval_t ignore_mask = _PAGE_SOFTW1 | _PAGE_SOFTW2 |
+ _PAGE_SOFTW3 | _PAGE_ACCESSED;
+ const pteval_t enable_mask = _PAGE_RW | _PAGE_DIRTY | _PAGE_GLOBAL;
+ pteval_t oldval = pte_val(oldpte);
+ pteval_t newval = pte_val(newpte);
+ pteval_t diff = oldval ^ newval;
+ pteval_t disable_mask = 0;
+
+ if (IS_ENABLED(CONFIG_X86_64) || IS_ENABLED(CONFIG_X86_PAE))
+ disable_mask = _PAGE_NX;
+
+ /* new is non-present: need only if old is present */
+ if (pte_none(newpte))
+ return !pte_none(oldpte);
+
+ /*
+ * Any change of PFN and any flag other than those that we consider
+ * requires a flush (e.g., PAT, protection keys). To save flushes we do
+ * not consider the access bit as it is considered by the kernel as
+ * best-effort.
+ */
+ return diff & ((oldval & enable_mask) |
+ (newval & disable_mask) |
+ ~(enable_mask | disable_mask | ignore_mask));
+}
+#define pte_may_need_flush pte_may_need_flush
+
#endif /* !MODULE */
#endif /* _ASM_X86_TLBFLUSH_H */
diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index 2c68a545ffa7..5ca49f44bc38 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -654,6 +654,10 @@ static inline void tlb_flush_p4d_range(struct mmu_gather *tlb,
} while (0)
#endif
+#ifndef pte_may_need_flush
+static inline bool pte_may_need_flush(pte_t oldpte, pte_t newpte) { return true; }
+#endif
+
#endif /* CONFIG_MMU */
#endif /* _ASM_GENERIC__TLB_H */
diff --git a/mm/mprotect.c b/mm/mprotect.c
index 075ff94aa51c..ae79df59a7a8 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -139,7 +139,8 @@ static unsigned long change_pte_range(struct mmu_gather *tlb,
ptent = pte_mkwrite(ptent);
}
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
- tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
+ if (pte_may_need_flush(oldpte, ptent))
+ tlb_flush_pte_range(tlb, addr, PAGE_SIZE);
pages++;
} else if (is_swap_pte(oldpte)) {
swp_entry_t entry = pte_to_swp_entry(oldpte);
--
2.25.1
On Sat, Sep 25, 2021 at 01:54:22PM -0700, Nadav Amit wrote:
> @@ -338,25 +344,25 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
> struct mm_struct *mm = vma->vm_mm;
> pgd_t *pgd;
> unsigned long next;
> - unsigned long start = addr;
> unsigned long pages = 0;
> + struct mmu_gather tlb;
>
> BUG_ON(addr >= end);
> pgd = pgd_offset(mm, addr);
> flush_cache_range(vma, addr, end);
> inc_tlb_flush_pending(mm);
That seems unbalanced...
> + tlb_gather_mmu(&tlb, mm);
> + tlb_start_vma(&tlb, vma);
> do {
> next = pgd_addr_end(addr, end);
> if (pgd_none_or_clear_bad(pgd))
> continue;
> - pages += change_p4d_range(vma, pgd, addr, next, newprot,
> + pages += change_p4d_range(&tlb, vma, pgd, addr, next, newprot,
> cp_flags);
> } while (pgd++, addr = next, addr != end);
>
> - /* Only flush the TLB if we actually modified any entries: */
> - if (pages)
> - flush_tlb_range(vma, start, end);
> - dec_tlb_flush_pending(mm);
... seeing you do remove the extra decrement.
> + tlb_end_vma(&tlb, vma);
> + tlb_finish_mmu(&tlb);
>
> return pages;
> }
> --
> 2.25.1
>
> On Oct 3, 2021, at 5:10 AM, Peter Zijlstra <[email protected]> wrote:
>
> On Sat, Sep 25, 2021 at 01:54:22PM -0700, Nadav Amit wrote:
>
>> @@ -338,25 +344,25 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
>> struct mm_struct *mm = vma->vm_mm;
>> pgd_t *pgd;
>> unsigned long next;
>> - unsigned long start = addr;
>> unsigned long pages = 0;
>> + struct mmu_gather tlb;
>>
>> BUG_ON(addr >= end);
>> pgd = pgd_offset(mm, addr);
>> flush_cache_range(vma, addr, end);
>> inc_tlb_flush_pending(mm);
>
> That seems unbalanced...
Bad rebase. Thanks for catching it!
>
>> + tlb_gather_mmu(&tlb, mm);
>> + tlb_start_vma(&tlb, vma);
>> do {
>> next = pgd_addr_end(addr, end);
>> if (pgd_none_or_clear_bad(pgd))
>> continue;
>> - pages += change_p4d_range(vma, pgd, addr, next, newprot,
>> + pages += change_p4d_range(&tlb, vma, pgd, addr, next, newprot,
>> cp_flags);
>> } while (pgd++, addr = next, addr != end);
>>
>> - /* Only flush the TLB if we actually modified any entries: */
>> - if (pages)
>> - flush_tlb_range(vma, start, end);
>> - dec_tlb_flush_pending(mm);
>
> ... seeing you do remove the extra decrement.
Is it really needed? We do not put this comment elsewhere for
tlb_finish_mmu(). But no problem, I’ll keep it.
On Mon, Oct 04, 2021 at 12:24:14PM -0700, Nadav Amit wrote:
>
>
> > On Oct 3, 2021, at 5:10 AM, Peter Zijlstra <[email protected]> wrote:
> >
> > On Sat, Sep 25, 2021 at 01:54:22PM -0700, Nadav Amit wrote:
> >
> >> @@ -338,25 +344,25 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
> >> struct mm_struct *mm = vma->vm_mm;
> >> pgd_t *pgd;
> >> unsigned long next;
> >> - unsigned long start = addr;
> >> unsigned long pages = 0;
> >> + struct mmu_gather tlb;
> >>
> >> BUG_ON(addr >= end);
> >> pgd = pgd_offset(mm, addr);
> >> flush_cache_range(vma, addr, end);
> >> inc_tlb_flush_pending(mm);
> >
> > That seems unbalanced...
>
> Bad rebase. Thanks for catching it!
>
> >
> >> + tlb_gather_mmu(&tlb, mm);
> >> + tlb_start_vma(&tlb, vma);
> >> do {
> >> next = pgd_addr_end(addr, end);
> >> if (pgd_none_or_clear_bad(pgd))
> >> continue;
> >> - pages += change_p4d_range(vma, pgd, addr, next, newprot,
> >> + pages += change_p4d_range(&tlb, vma, pgd, addr, next, newprot,
> >> cp_flags);
> >> } while (pgd++, addr = next, addr != end);
> >>
> >> - /* Only flush the TLB if we actually modified any entries: */
> >> - if (pages)
> >> - flush_tlb_range(vma, start, end);
> >> - dec_tlb_flush_pending(mm);
> >
> > ... seeing you do remove the extra decrement.
>
> Is it really needed? We do not put this comment elsewhere for
> tlb_finish_mmu(). But no problem, I’ll keep it.
-ENOPARSE, did you read decrement as comment? In any case, I don't
particularly care about the comment, and tlb_*_mmu() imply the inc/dec
thingies.
All I tried to do is point out that removing the dec but leaving the inc
is somewhat inconsistent :-)
> On Oct 4, 2021, at 11:53 PM, Peter Zijlstra <[email protected]> wrote:
>
> On Mon, Oct 04, 2021 at 12:24:14PM -0700, Nadav Amit wrote:
>>
>>
>>> On Oct 3, 2021, at 5:10 AM, Peter Zijlstra <[email protected]> wrote:
>>>
>>> On Sat, Sep 25, 2021 at 01:54:22PM -0700, Nadav Amit wrote:
>>>
>>>> @@ -338,25 +344,25 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
>>>> struct mm_struct *mm = vma->vm_mm;
>>>> pgd_t *pgd;
>>>> unsigned long next;
>>>> - unsigned long start = addr;
>>>> unsigned long pages = 0;
>>>> + struct mmu_gather tlb;
>>>>
>>>> BUG_ON(addr >= end);
>>>> pgd = pgd_offset(mm, addr);
>>>> flush_cache_range(vma, addr, end);
>>>> inc_tlb_flush_pending(mm);
>>>
>>> That seems unbalanced...
>>
>> Bad rebase. Thanks for catching it!
>>
>>>
>>>> + tlb_gather_mmu(&tlb, mm);
>>>> + tlb_start_vma(&tlb, vma);
>>>> do {
>>>> next = pgd_addr_end(addr, end);
>>>> if (pgd_none_or_clear_bad(pgd))
>>>> continue;
>>>> - pages += change_p4d_range(vma, pgd, addr, next, newprot,
>>>> + pages += change_p4d_range(&tlb, vma, pgd, addr, next, newprot,
>>>> cp_flags);
>>>> } while (pgd++, addr = next, addr != end);
>>>>
>>>> - /* Only flush the TLB if we actually modified any entries: */
>>>> - if (pages)
>>>> - flush_tlb_range(vma, start, end);
>>>> - dec_tlb_flush_pending(mm);
>>>
>>> ... seeing you do remove the extra decrement.
>>
>> Is it really needed? We do not put this comment elsewhere for
>> tlb_finish_mmu(). But no problem, I’ll keep it.
>
> -ENOPARSE, did you read decrement as comment? In any case, I don't
> particularly care about the comment, and tlb_*_mmu() imply the inc/dec
> thingies.
>
> All I tried to do is point out that removing the dec but leaving the inc
> is somewhat inconsistent :-)
The autocorrect in my mind was broken so I read as “documentation”
instead of “decrement”.
I will send v2 soon.
Thanks again!
Nadav
On 25.09.21 22:54, Nadav Amit wrote:
> From: Nadav Amit <[email protected]>
>
> Currently, using mprotect() to unprotect a memory region or uffd to
> unprotect a memory region causes a TLB flush. At least on x86, as
> protection is promoted, no TLB flush is needed.
>
> Add an arch-specific pte_may_need_flush() which tells whether a TLB
> flush is needed based on the old PTE and the new one. Implement an x86
> pte_may_need_flush().
>
> For x86, PTE protection promotion or changes of software bits does
> require a flush, also add logic that considers the dirty-bit. Changes to
> the access-bit do not trigger a TLB flush, although architecturally they
> should, as Linux considers the access-bit as a hint.
Is the added LOC worth the benefit? IOW, do we have some benchmark that
really benefits from that?
--
Thanks,
David / dhildenb
> On Oct 7, 2021, at 5:13 AM, David Hildenbrand <[email protected]> wrote:
>
> On 25.09.21 22:54, Nadav Amit wrote:
>> From: Nadav Amit <[email protected]>
>> Currently, using mprotect() to unprotect a memory region or uffd to
>> unprotect a memory region causes a TLB flush. At least on x86, as
>> protection is promoted, no TLB flush is needed.
>> Add an arch-specific pte_may_need_flush() which tells whether a TLB
>> flush is needed based on the old PTE and the new one. Implement an x86
>> pte_may_need_flush().
>> For x86, PTE protection promotion or changes of software bits does
>> require a flush, also add logic that considers the dirty-bit. Changes to
>> the access-bit do not trigger a TLB flush, although architecturally they
>> should, as Linux considers the access-bit as a hint.
>
> Is the added LOC worth the benefit? IOW, do we have some benchmark that really benefits from that?
So you ask whether the added ~10 LOC (net) worth the benefit?
Let’s start with the cost of this patch.
If you ask about complexity, I think that it is a rather simple
patch and documented as needed. Please be more concrete if you
think otherwise.
If you ask about the runtime overhead, my experience is that
such code, which mostly does bit operations, has negligible cost.
The execution time of mprotect code, and other similar pieces of
code, is mostly dominated by walking the page-tables & getting
the pages (which might require cold or random memory accesses),
acquiring the locks, and of course the TLB flushes that this
patch tries to eliminate.
As for the benefit: TLB flush on x86 of a single PTE has an
overhead of ~200 cycles. If a TLB shootdown is needed, for instance
on multithreaded applications, this overhead can grow to few
microseconds or even more, depending on the number of sockets,
whether the workload runs in a VM (and worse if CPUs are
overcommitted) and so on.
This overhead is completely unnecessary on many occasions. If
you run mprotect() to add permissions, or as I noted in my case,
to do something similar using userfaultfd. Note that the
potentially unnecessary TLB flush/shootdown takes place while
you hold the mmap-lock for write in the case of mprotect(),
thereby potentially preventing other threads from making
progress during that time.
On my in-development workload it was a considerable overhead
(I didn’t collect numbers though). Basically, I track dirty
pages using uffd, and every page-fault that can be easily
resolved by unprotecting cause a TLB flush/shootdown.
If you want, I will write a microbenchmarks and give you numbers.
If you look for further optimizations (although you did not indicate
so), such as doing the TLB batching from do_mprotect_key(),
(i.e. batching across VMAs), we can discuss it and apply it on
top of these patches.
On 07.10.21 18:16, Nadav Amit wrote:
>
>> On Oct 7, 2021, at 5:13 AM, David Hildenbrand <[email protected]> wrote:
>>
>> On 25.09.21 22:54, Nadav Amit wrote:
>>> From: Nadav Amit <[email protected]>
>>> Currently, using mprotect() to unprotect a memory region or uffd to
>>> unprotect a memory region causes a TLB flush. At least on x86, as
>>> protection is promoted, no TLB flush is needed.
>>> Add an arch-specific pte_may_need_flush() which tells whether a TLB
>>> flush is needed based on the old PTE and the new one. Implement an x86
>>> pte_may_need_flush().
>>> For x86, PTE protection promotion or changes of software bits does
>>> require a flush, also add logic that considers the dirty-bit. Changes to
>>> the access-bit do not trigger a TLB flush, although architecturally they
>>> should, as Linux considers the access-bit as a hint.
>>
>> Is the added LOC worth the benefit? IOW, do we have some benchmark that really benefits from that?
>
> So you ask whether the added ~10 LOC (net) worth the benefit?
I read "3 files changed, 46 insertions(+), 1 deletion(-)" to optimize
something without proof, so I naturally have to ask. So this is just a
"usually we optimize and show numbers to proof" comment.
>
> Let’s start with the cost of this patch.
>
> If you ask about complexity, I think that it is a rather simple
> patch and documented as needed. Please be more concrete if you
> think otherwise.
It is most certainly added complexity, although documented cleanly.
>
> If you ask about the runtime overhead, my experience is that
> such code, which mostly does bit operations, has negligible cost.
> The execution time of mprotect code, and other similar pieces of
> code, is mostly dominated by walking the page-tables & getting
> the pages (which might require cold or random memory accesses),
> acquiring the locks, and of course the TLB flushes that this
> patch tries to eliminate.
I'm absolutely not concerned about runtime overhead :)
>
> As for the benefit: TLB flush on x86 of a single PTE has an
> overhead of ~200 cycles. If a TLB shootdown is needed, for instance
> on multithreaded applications, this overhead can grow to few
> microseconds or even more, depending on the number of sockets,
> whether the workload runs in a VM (and worse if CPUs are
> overcommitted) and so on.
>
> This overhead is completely unnecessary on many occasions. If
> you run mprotect() to add permissions, or as I noted in my case,
> to do something similar using userfaultfd. Note that the
> potentially unnecessary TLB flush/shootdown takes place while
> you hold the mmap-lock for write in the case of mprotect(),
> thereby potentially preventing other threads from making
> progress during that time.
>
> On my in-development workload it was a considerable overhead
> (I didn’t collect numbers though). Basically, I track dirty
> pages using uffd, and every page-fault that can be easily
> resolved by unprotecting cause a TLB flush/shootdown.
Any numbers would be helpful.
>
> If you want, I will write a microbenchmarks and give you numbers.
> If you look for further optimizations (although you did not indicate
> so), such as doing the TLB batching from do_mprotect_key(),
> (i.e. batching across VMAs), we can discuss it and apply it on
> top of these patches.
I think this patch itself is sufficient if we can show a benefit; I do
wonder if existing benchmarks could already show a benefit, I feel like
they should if this makes a difference. Excessive mprotect() usage
(protect<>unprotect) isn't something unusual.
--
Thanks,
David / dhildenb
> On Oct 7, 2021, at 10:07 AM, David Hildenbrand <[email protected]> wrote:
>
> On 07.10.21 18:16, Nadav Amit wrote:
>>> On Oct 7, 2021, at 5:13 AM, David Hildenbrand <[email protected]> wrote:
>>>
>>> On 25.09.21 22:54, Nadav Amit wrote:
>>>> From: Nadav Amit <[email protected]>
>>>> Currently, using mprotect() to unprotect a memory region or uffd to
>>>> unprotect a memory region causes a TLB flush. At least on x86, as
>>>> protection is promoted, no TLB flush is needed.
>>>> Add an arch-specific pte_may_need_flush() which tells whether a TLB
>>>> flush is needed based on the old PTE and the new one. Implement an x86
>>>> pte_may_need_flush().
>>>> For x86, PTE protection promotion or changes of software bits does
>>>> require a flush, also add logic that considers the dirty-bit. Changes to
>>>> the access-bit do not trigger a TLB flush, although architecturally they
>>>> should, as Linux considers the access-bit as a hint.
>>>
>>> Is the added LOC worth the benefit? IOW, do we have some benchmark that really benefits from that?
>> So you ask whether the added ~10 LOC (net) worth the benefit?
>
> I read "3 files changed, 46 insertions(+), 1 deletion(-)" to optimize something without proof, so I naturally have to ask. So this is just a "usually we optimize and show numbers to proof" comment.
These numbers are misleading, as they count comments and other non-code.
[snip]
>
> Any numbers would be helpful.
>
>> If you want, I will write a microbenchmarks and give you numbers.
>> If you look for further optimizations (although you did not indicate
>> so), such as doing the TLB batching from do_mprotect_key(),
>> (i.e. batching across VMAs), we can discuss it and apply it on
>> top of these patches.
>
> I think this patch itself is sufficient if we can show a benefit; I do wonder if existing benchmarks could already show a benefit, I feel like they should if this makes a difference. Excessive mprotect() usage (protect<>unprotect) isn't something unusual.
I do not know about a concrete benchmark (other than my workload, which I cannot share right now) that does excessive mprotect() in a way that would actually be measurable on the overall performance. I would argue that many many optimizations in the kernel are such that would not have so measurable benefit by themselves on common macrobenchmarks.
Anyhow, per your request I created a small micro-benchmark that runs mprotect(PROT_READ) and mprotect(PROT_READ|PROT_WRITE) in a loop and measured the time it took to do the latter (where a writeprotect is not needed). I ran the benchmark on a VM (guest) on top of KVM.
The cost (cycles) per mprotect(PROT_READ|PROT_WRITE) operation:
1 thread 2 threads
-------- ---------
w/patch: 2496 2505
w/o patch: 5342 10458
[ The results for 1 thread might seem strange as one can expect the overhead in this case to be no more than ~250 cycles, which is the time a TLB invalidation of a single PTE takes. Yet, this overhead are probably related to “page fracturing”, which happens when the VM memory is backed by 4KB pages. In such scenarios, a single PTE invalidation in the VM can cause on Intel a full TLB flush. The full flush is needed to ensure that if the invalidated address was mapped through huge page in the VM, any relevant 4KB mapping that is cached in the TLB (after fracturing due to the 4KB GPA->HPA mapping) would be removed.]
Let me know if you want me to share the micro-benchmark with you. I am not going to mention the results in the commit log, because I think the overhead of unnecessary TLB invalidation is well established.
>>
>> Any numbers would be helpful.
>>
>>> If you want, I will write a microbenchmarks and give you numbers.
>>> If you look for further optimizations (although you did not indicate
>>> so), such as doing the TLB batching from do_mprotect_key(),
>>> (i.e. batching across VMAs), we can discuss it and apply it on
>>> top of these patches.
>>
>> I think this patch itself is sufficient if we can show a benefit; I do wonder if existing benchmarks could already show a benefit, I feel like they should if this makes a difference. Excessive mprotect() usage (protect<>unprotect) isn't something unusual.
>
> I do not know about a concrete benchmark (other than my workload, which I cannot share right now) that does excessive mprotect() in a way that would actually be measurable on the overall performance. I would argue that many many optimizations in the kernel are such that would not have so measurable benefit by themselves on common macrobenchmarks.
>
> Anyhow, per your request I created a small micro-benchmark that runs mprotect(PROT_READ) and mprotect(PROT_READ|PROT_WRITE) in a loop and measured the time it took to do the latter (where a writeprotect is not needed). I ran the benchmark on a VM (guest) on top of KVM.
>
> The cost (cycles) per mprotect(PROT_READ|PROT_WRITE) operation:
>
> 1 thread 2 threads
> -------- ---------
> w/patch: 2496 2505
> w/o patch: 5342 10458
>
For my taste, the above numbers are sufficient, thanks!
> [ The results for 1 thread might seem strange as one can expect the overhead in this case to be no more than ~250 cycles, which is the time a TLB invalidation of a single PTE takes. Yet, this overhead are probably related to “page fracturing”, which happens when the VM memory is backed by 4KB pages. In such scenarios, a single PTE invalidation in the VM can cause on Intel a full TLB flush. The full flush is needed to ensure that if the invalidated address was mapped through huge page in the VM, any relevant 4KB mapping that is cached in the TLB (after fracturing due to the 4KB GPA->HPA mapping) would be removed.]
Very nice analysis :)
>
> Let me know if you want me to share the micro-benchmark with you. I am not going to mention the results in the commit log, because I think the overhead of unnecessary TLB invalidation is well established.
Just let me clarify why I am asking at all, it could be that:
a) The optimization is effective and applicable to many workloads
b) The optimization is effective and applicable to some workloads
("micro benchmark")
c) The optimization is ineffective
d) The optimization is wrong
IMHO: We can rule out d) by review and tests. We can rule out c) by
simple benchmarks easily.
Maybe extend the patch description by something like:
"The benefit of this optimization can already be visible when doing
mprotect(PROT_READ) -> mprotect(PROT_READ|PROT_WRITE) on a single
thread, because we end up requiring basically no TLB flushes. The
optimization gets even more significant with more threads. See [1] for
simple micro benchmark results."
Something like that would be good enough for my taste.
--
Thanks,
David / dhildenb
> On Sep 25, 2021, at 1:54 PM, Nadav Amit <[email protected]> wrote:
>
> From: Nadav Amit <[email protected]>
>
> change_pXX_range() currently does not use mmu_gather, but instead
> implements its own deferred TLB flushes scheme. This both complicates
> the code, as developers need to be aware of different invalidation
> schemes, and prevents opportunities to avoid TLB flushes or perform them
> in finer granularity.
>
> Use mmu_gather in change_pXX_range(). As the pages are not released,
> only record the flushed range using tlb_flush_pXX_range().
Andrea pointed out that I do not take care of THP. Actually, there is
indeed a missing TLB flush on THP, but it is not required due to the
pmdp_invalidate(). Anyhow, the patch needs to address it cleanly, and
to try to avoid the flush on pmdp_invalidate(), which at least on x86
does not appear to be necessary.
There is an additional bug, as tlb_change_page_size() needs to be
called.
-- Jerome,
While I am reviewing my (bad) code, I wanted to understand whether
update of migration entries requires a TLB flush, because I do not
think I got that right either.
I thought they should not, but I now am not very sure. I am very
confused by the following code in migrate_vma_collect_pmd():
pte_unmap_unlock(ptep - 1, ptl);
/* Only flush the TLB if we actually modified any entries */
if (unmapped)
flush_tlb_range(walk->vma, start, end);
According to this code flush_tlb_range() is called without the ptl.
So theoretically there is a possible race:
CPU0 CPU1
---- ----
migrate_vma_collect_pmd()
set_pte_at() [ present->
non-present]
pte_unmap_unlock()
madvise(MADV_DONTNEED)
zap_pte_range()
[ PTE non-present =>
no flush ]
So my questions:
1. Is there a reason the above scenario is invalid?
2. Does one need to flush a migration entry he updates it?
Thanks,
Nadav
On Sat, Sep 25, 2021 at 01:54:22PM -0700, Nadav Amit wrote:
> @@ -338,25 +344,25 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
> struct mm_struct *mm = vma->vm_mm;
> pgd_t *pgd;
> unsigned long next;
> - unsigned long start = addr;
> unsigned long pages = 0;
> + struct mmu_gather tlb;
>
> BUG_ON(addr >= end);
> pgd = pgd_offset(mm, addr);
> flush_cache_range(vma, addr, end);
> inc_tlb_flush_pending(mm);
> + tlb_gather_mmu(&tlb, mm);
> + tlb_start_vma(&tlb, vma);
Pure question:
I actually have no idea why tlb_start_vma() is needed here, as protection range
can be just a single page, but anyway.. I do see that tlb_start_vma() contains
a whole-vma flush_cache_range() when the arch needs it, then does it mean that
besides the inc_tlb_flush_pending() to be dropped, so as to the other call to
flush_cache_range() above?
> do {
> next = pgd_addr_end(addr, end);
> if (pgd_none_or_clear_bad(pgd))
> continue;
> - pages += change_p4d_range(vma, pgd, addr, next, newprot,
> + pages += change_p4d_range(&tlb, vma, pgd, addr, next, newprot,
> cp_flags);
> } while (pgd++, addr = next, addr != end);
>
> - /* Only flush the TLB if we actually modified any entries: */
> - if (pages)
> - flush_tlb_range(vma, start, end);
> - dec_tlb_flush_pending(mm);
> + tlb_end_vma(&tlb, vma);
> + tlb_finish_mmu(&tlb);
>
> return pages;
> }
> --
> 2.25.1
>
--
Peter Xu
> On Oct 12, 2021, at 3:16 AM, Peter Xu <[email protected]> wrote:
>
> On Sat, Sep 25, 2021 at 01:54:22PM -0700, Nadav Amit wrote:
>> @@ -338,25 +344,25 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
>> struct mm_struct *mm = vma->vm_mm;
>> pgd_t *pgd;
>> unsigned long next;
>> - unsigned long start = addr;
>> unsigned long pages = 0;
>> + struct mmu_gather tlb;
>>
>> BUG_ON(addr >= end);
>> pgd = pgd_offset(mm, addr);
>> flush_cache_range(vma, addr, end);
>> inc_tlb_flush_pending(mm);
>> + tlb_gather_mmu(&tlb, mm);
>> + tlb_start_vma(&tlb, vma);
>
> Pure question:
>
> I actually have no idea why tlb_start_vma() is needed here, as protection range
> can be just a single page, but anyway.. I do see that tlb_start_vma() contains
> a whole-vma flush_cache_range() when the arch needs it, then does it mean that
> besides the inc_tlb_flush_pending() to be dropped, so as to the other call to
> flush_cache_range() above?
Good point.
tlb_start_vma() and tlb_end_vma() are required since some archs do not
batch TLB flushes across VMAs (e.g., ARM). I am not sure whether that’s
the best behavior for all archs, but I do not want to change it.
Anyhow, you make a valid point that the flush_cache_range() should be
dropped as well. I will do so for next version.
Regards,
Nadav
On Tue, Oct 12, 2021 at 10:31:45AM -0700, Nadav Amit wrote:
>
>
> > On Oct 12, 2021, at 3:16 AM, Peter Xu <[email protected]> wrote:
> >
> > On Sat, Sep 25, 2021 at 01:54:22PM -0700, Nadav Amit wrote:
> >> @@ -338,25 +344,25 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
> >> struct mm_struct *mm = vma->vm_mm;
> >> pgd_t *pgd;
> >> unsigned long next;
> >> - unsigned long start = addr;
> >> unsigned long pages = 0;
> >> + struct mmu_gather tlb;
> >>
> >> BUG_ON(addr >= end);
> >> pgd = pgd_offset(mm, addr);
> >> flush_cache_range(vma, addr, end);
> >> inc_tlb_flush_pending(mm);
> >> + tlb_gather_mmu(&tlb, mm);
> >> + tlb_start_vma(&tlb, vma);
> >
> > Pure question:
> >
> > I actually have no idea why tlb_start_vma() is needed here, as protection range
> > can be just a single page, but anyway.. I do see that tlb_start_vma() contains
> > a whole-vma flush_cache_range() when the arch needs it, then does it mean that
> > besides the inc_tlb_flush_pending() to be dropped, so as to the other call to
> > flush_cache_range() above?
>
> Good point.
>
> tlb_start_vma() and tlb_end_vma() are required since some archs do not
> batch TLB flushes across VMAs (e.g., ARM).
Sorry I didn't follow here - as change_protection() is per-vma anyway, so I
don't see why it needs to consider vma crossing.
In all cases, it'll be great if you could add some explanation into commit
message on why we need tlb_{start|end}_vma(), as I think it could not be
obvious to all people.
> I am not sure whether that’s the best behavior for all archs, but I do not
> want to change it.
>
> Anyhow, you make a valid point that the flush_cache_range() should be
> dropped as well. I will do so for next version.
Thanks,
--
Peter Xu
> On Oct 12, 2021, at 4:20 PM, Peter Xu <[email protected]> wrote:
>
> On Tue, Oct 12, 2021 at 10:31:45AM -0700, Nadav Amit wrote:
>>
>>
>>> On Oct 12, 2021, at 3:16 AM, Peter Xu <[email protected]> wrote:
>>>
>>> On Sat, Sep 25, 2021 at 01:54:22PM -0700, Nadav Amit wrote:
>>>> @@ -338,25 +344,25 @@ static unsigned long change_protection_range(struct vm_area_struct *vma,
>>>> struct mm_struct *mm = vma->vm_mm;
>>>> pgd_t *pgd;
>>>> unsigned long next;
>>>> - unsigned long start = addr;
>>>> unsigned long pages = 0;
>>>> + struct mmu_gather tlb;
>>>>
>>>> BUG_ON(addr >= end);
>>>> pgd = pgd_offset(mm, addr);
>>>> flush_cache_range(vma, addr, end);
>>>> inc_tlb_flush_pending(mm);
>>>> + tlb_gather_mmu(&tlb, mm);
>>>> + tlb_start_vma(&tlb, vma);
>>>
>>> Pure question:
>>>
>>> I actually have no idea why tlb_start_vma() is needed here, as protection range
>>> can be just a single page, but anyway.. I do see that tlb_start_vma() contains
>>> a whole-vma flush_cache_range() when the arch needs it, then does it mean that
>>> besides the inc_tlb_flush_pending() to be dropped, so as to the other call to
>>> flush_cache_range() above?
>>
>> Good point.
>>
>> tlb_start_vma() and tlb_end_vma() are required since some archs do not
>> batch TLB flushes across VMAs (e.g., ARM).
>
> Sorry I didn't follow here - as change_protection() is per-vma anyway, so I
> don't see why it needs to consider vma crossing.
>
> In all cases, it'll be great if you could add some explanation into commit
> message on why we need tlb_{start|end}_vma(), as I think it could not be
> obvious to all people.
tlb_start_vma() is required when we switch from flush_tlb_range() because
certain properties of the VMA (e.g., executable) are needed on certain
arch. That’s the reason flush_tlb_range() requires the VMA that is
invalidated to be provided.
Regardless, there is an interface and that is the way it is used. I am not
inclined to break it, even if it was possible, for unclear performance
benefits.
As I discussed offline with Andrea and David, switching to tlb_gather_mmu()
interface has additional advantages than batching and avoiding unnecessary
flushes on PTE permission promotion (as done in patch 2). If a single PTE
is updated out of a bigger range, currently flush_tlb_range() would flush
the whole range instead of the single page. In addition, once I fix this
patch-set, if you update a THP, you would (at least on x86) be able to
flush a single PTE instead of flushing 512 entries (which would actually
be done using a full TLB flush).
I would say that as I mentioned in a different thread, and was not
upfront about before, one of the motivations of mine behind this patch
is that I need a vectored UFFDIO_WRITEPROTECTV interface for performance.
Nevertheless, I think these two patches stand by themselves and have
independent value.