2022-04-02 16:08:33

by David Hildenbrand

[permalink] [raw]
Subject: [PATCH v1 mmotm] mm/mprotect: try avoiding write faults for exclusive anonynmous pages when changing protection

Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
can try mapping anonymous pages writable if they are exclusive,
the PTE is already dirty, and no special handling applies. Mapping the
PTE writable is essentially the same thing the write fault handler would do
in this case.

Special handling is required for uffd-wp and softdirty tracking, so take
care of that properly. Also, leave PROT_NONE handling alone for now;
in the future, we could similarly extend the logic in do_numa_page() or
use pte_mk_savedwrite() here. Note that we'll now also check for uffd-wp in
case of VM_SHARED -- which is harmless and prepares for uffd-wp support for
shmem.

While this improves mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE)
performance, it should also be a valuable optimization for uffd-wp, when
un-protecting.

Applying the same logic to PMDs (anonymous THP, anonymous hugetlb) is
probably not worth the trouble, but could similarly be added if there is
demand.

Results of a simple microbenchmark on my Ryzen 9 3900X, comparing the new
optimization (avoiding write faults) during mrprotect() with softdirty
tracking, where we require a write fault.

Running 1000 iterations each

==========================================================
Measuring memset() of 4096 bytes
First write access:
Min: 741 ns, Max: 3566 ns, Avg: 770 ns
Second write access:
Min: 150 ns, Max: 441 ns, Avg: 158 ns
Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
Min: 170 ns, Max: 420 ns, Avg: 177 ns
Write access after clearing softdirty:
Min: 440 ns, Max: 1533 ns, Avg: 454 ns
-> mprotect = 1.120 * second [avg]
-> mprotect = 0.390 * softdirty [avg]
----------------------------------------------------------
Measuring single byte access per page of 4096 bytes
First write access:
Min: 281 ns, Max: 1022 ns, Avg: 732 ns
Second write access:
Min: 120 ns, Max: 160 ns, Avg: 129 ns
Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
Min: 140 ns, Max: 191 ns, Avg: 146 ns
Write access after clearing softdirty:
Min: 301 ns, Max: 561 ns, Avg: 416 ns
-> mprotect = 1.132 * second [avg]
-> mprotect = 0.351 * softdirty [avg]
==========================================================
Measuring memset() of 16384 bytes
First write access:
Min: 1923 ns, Max: 3497 ns, Avg: 1986 ns
Second write access:
Min: 211 ns, Max: 310 ns, Avg: 249 ns
Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
Min: 9 ns, Max: 361 ns, Avg: 281 ns
Write access after clearing softdirty:
Min: 1203 ns, Max: 1974 ns, Avg: 1232 ns
-> mprotect = 1.129 * second [avg]
-> mprotect = 0.228 * softdirty [avg]
----------------------------------------------------------
Measuring single byte access per page of 16384 bytes
First write access:
Min: 961 ns, Max: 9317 ns, Avg: 1855 ns
Second write access:
Min: 130 ns, Max: 171 ns, Avg: 132 ns
Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
Min: 150 ns, Max: 191 ns, Avg: 153 ns
Write access after clearing softdirty:
Min: 1061 ns, Max: 1513 ns, Avg: 1085 ns
-> mprotect = 1.159 * second [avg]
-> mprotect = 0.141 * softdirty [avg]
==========================================================
Measuring memset() of 65536 bytes
First write access:
Min: 6933 ns, Max: 14366 ns, Avg: 7068 ns
Second write access:
Min: 601 ns, Max: 772 ns, Avg: 614 ns
Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
Min: 671 ns, Max: 7795 ns, Avg: 715 ns
Write access after clearing softdirty:
Min: 4238 ns, Max: 11703 ns, Avg: 4367 ns
-> mprotect = 1.164 * second [avg]
-> mprotect = 0.164 * softdirty [avg]
----------------------------------------------------------
Measuring single byte access per page of 65536 bytes
First write access:
Min: 6082 ns, Max: 13866 ns, Avg: 6637 ns
Second write access:
Min: 130 ns, Max: 190 ns, Avg: 145 ns
Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
Min: 170 ns, Max: 992 ns, Avg: 184 ns
Write access after clearing softdirty:
Min: 3367 ns, Max: 4709 ns, Avg: 3759 ns
-> mprotect = 1.269 * second [avg]
-> mprotect = 0.049 * softdirty [avg]
==========================================================
Measuring memset() of 524288 bytes
First write access:
Min: 54712 ns, Max: 86162 ns, Avg: 55544 ns
Second write access:
Min: 4989 ns, Max: 7714 ns, Avg: 5106 ns
Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
Min: 5561 ns, Max: 7044 ns, Avg: 5710 ns
Write access after clearing softdirty:
Min: 34224 ns, Max: 41848 ns, Avg: 34610 ns
-> mprotect = 1.118 * second [avg]
-> mprotect = 0.165 * softdirty [avg]
----------------------------------------------------------
Measuring single byte access per page of 524288 bytes
First write access:
Min: 50695 ns, Max: 56617 ns, Avg: 51353 ns
Second write access:
Min: 390 ns, Max: 1553 ns, Avg: 1090 ns
Write access after mprotect(PROT_READ)+mprotect(PROT_READ|PROT_WRITE):
Min: 471 ns, Max: 2074 ns, Avg: 675 ns
Write access after clearing softdirty:
Min: 29115 ns, Max: 35076 ns, Avg: 29521 ns
-> mprotect = 0.619 * second [avg]
-> mprotect = 0.023 * softdirty [avg]

Cc: Linus Torvalds <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Nadav Amit <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Andrea Arcangeli <[email protected]>
Cc: Peter Xu <[email protected]>
Cc: Yang Shi <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: Mel Gorman <[email protected]>
Signed-off-by: David Hildenbrand <[email protected]>
---

This is based on:
"[PATCH v2 00/15] mm: COW fixes part 2: reliable GUP pins of
anonymous pages"
-> https://lkml.kernel.org/r/[email protected]

... which is in -mm but not yet in -next. Sending this out for early
discussion.

---
mm/mprotect.c | 70 ++++++++++++++++++++++++++++++++++++++++++---------
1 file changed, 58 insertions(+), 12 deletions(-)

diff --git a/mm/mprotect.c b/mm/mprotect.c
index 56060acdabd3..69770b547ec1 100644
--- a/mm/mprotect.c
+++ b/mm/mprotect.c
@@ -36,6 +36,49 @@

#include "internal.h"

+static inline bool can_change_pte_writable(struct vm_area_struct *vma,
+ unsigned long addr, pte_t pte,
+ unsigned long cp_flags)
+{
+ struct page *page;
+
+ if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
+ /*
+ * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
+ * without MM_CP_DIRTY_ACCT, there is nothing to do.
+ */
+ return false;
+
+ if (!(vma->vm_flags & VM_WRITE))
+ return false;
+
+ if (pte_write(pte) || pte_protnone(pte) || !pte_dirty(pte))
+ return false;
+
+ /* Do we need write faults for softdirty tracking? */
+ if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !pte_soft_dirty(pte) &&
+ (vma->vm_flags & VM_SOFTDIRTY))
+ return false;
+
+ /* Do we need write faults for uffd-wp tracking? */
+ if (userfaultfd_pte_wp(vma, pte))
+ return false;
+
+ if (!(vma->vm_flags & VM_SHARED)) {
+ /*
+ * We can only special-case on exclusive anonymous pages,
+ * because we know that our write-fault handler similarly would
+ * map them writable without any additional checks while holding
+ * the PT lock.
+ */
+ page = vm_normal_page(vma, addr, pte);
+ if (!page || !PageAnon(page) || !PageAnonExclusive(page))
+ return false;
+ }
+
+ return true;
+}
+
static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
unsigned long addr, unsigned long end, pgprot_t newprot,
unsigned long cp_flags)
@@ -44,7 +87,6 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
spinlock_t *ptl;
unsigned long pages = 0;
int target_node = NUMA_NO_NODE;
- bool dirty_accountable = cp_flags & MM_CP_DIRTY_ACCT;
bool prot_numa = cp_flags & MM_CP_PROT_NUMA;
bool uffd_wp = cp_flags & MM_CP_UFFD_WP;
bool uffd_wp_resolve = cp_flags & MM_CP_UFFD_WP_RESOLVE;
@@ -133,21 +175,25 @@ static unsigned long change_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
ptent = pte_wrprotect(ptent);
ptent = pte_mkuffd_wp(ptent);
} else if (uffd_wp_resolve) {
- /*
- * Leave the write bit to be handled
- * by PF interrupt handler, then
- * things like COW could be properly
- * handled.
- */
ptent = pte_clear_uffd_wp(ptent);
}

- /* Avoid taking write faults for known dirty pages */
- if (dirty_accountable && pte_dirty(ptent) &&
- (pte_soft_dirty(ptent) ||
- !(vma->vm_flags & VM_SOFTDIRTY))) {
+ /*
+ * In some writable, shared mappings, we might want
+ * to catch actual write access -- see
+ * vma_wants_writenotify().
+ *
+ * In all writable, private mappings, we have to
+ * properly handle COW.
+ *
+ * In both cases, we can sometimes still map PTEs
+ * writable and avoid the write-fault handler, for
+ * example, if the PTE is already dirty and no other
+ * COW or special handling is required.
+ */
+ if (can_change_pte_writable(vma, addr, ptent, cp_flags))
ptent = pte_mkwrite(ptent);
- }
+
ptep_modify_prot_commit(vma, addr, pte, oldpte, ptent);
pages++;
} else if (is_swap_pte(oldpte)) {
--
2.35.1


2022-04-03 15:37:58

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH v1 mmotm] mm/mprotect: try avoiding write faults for exclusive anonynmous pages when changing protection

[ +Rick ]

> On Apr 1, 2022, at 3:13 AM, David Hildenbrand <[email protected]> wrote:
>
> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
> can try mapping anonymous pages writable if they are exclusive,
> the PTE is already dirty, and no special handling applies. Mapping the
> PTE writable is essentially the same thing the write fault handler would do
> in this case.

In general I am all supportive for such a change.

I do have some mostly-minor concerns.

>
> +static inline bool can_change_pte_writable(struct vm_area_struct *vma,
> + unsigned long addr, pte_t pte,
> + unsigned long cp_flags)
> +{
> + struct page *page;
> +
> + if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
> + /*
> + * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
> + * without MM_CP_DIRTY_ACCT, there is nothing to do.
> + */
> + return false;
> +
> + if (!(vma->vm_flags & VM_WRITE))
> + return false;
> +
> + if (pte_write(pte) || pte_protnone(pte) || !pte_dirty(pte))
> + return false;

If pte_write() is already try then return false? I understand you want
to do so because the page is already writable, but it is confusing.

In addition, I am not sure about the pte_dirty() check is really robust.
I mean I think it is ok, but is there any issue with shadow-stack?

And this also assumes the kernel does not clear the dirty bit without
clearing the present, as otherwise the note in Intel SDM section 4.8
("Accessed and Dirty Flags”) will be relevant and dirty bit might be
set unnecessarily. I think it is ok.

> +
> + /* Do we need write faults for softdirty tracking? */
> + if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !pte_soft_dirty(pte) &&
> + (vma->vm_flags & VM_SOFTDIRTY))

If !IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) then VM_SOFTDIRTY == 0. So I do not
think the IS_ENABLED() is necessary (unless you think it is clearer this
way).

> + return false;
> +
> + /* Do we need write faults for uffd-wp tracking? */
> + if (userfaultfd_pte_wp(vma, pte))
> + return false;
> +
> + if (!(vma->vm_flags & VM_SHARED)) {
> + /*
> + * We can only special-case on exclusive anonymous pages,
> + * because we know that our write-fault handler similarly would
> + * map them writable without any additional checks while holding
> + * the PT lock.
> + */
> + page = vm_normal_page(vma, addr, pte);

I guess we cannot call vm_normal_page() twice, once for prot_numa and once
here, in practice...

> + if (!page || !PageAnon(page) || !PageAnonExclusive(page))
> + return false;
> + }
> +
> + return true;
> +}

Note that there is a small downside to all of that. Assume you mprotect()
a single page from RO to RW and you have many threads.

With my pending patch you would avoid the TLB shootdown (and get a PF).
With this patch you would get a TLB shootdown and save the PF. IOW, I
think it is worthy to skip the shootdown as well in such a case and
instead flush the TLB on spurious page-faults. But I guess that’s for
another patch.

2022-04-05 03:09:55

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v1 mmotm] mm/mprotect: try avoiding write faults for exclusive anonynmous pages when changing protection

On 01.04.22 21:15, Nadav Amit wrote:
> [ +Rick ]
>
>> On Apr 1, 2022, at 3:13 AM, David Hildenbrand <[email protected]> wrote:
>>
>> Similar to our MM_CP_DIRTY_ACCT handling for shared, writable mappings, we
>> can try mapping anonymous pages writable if they are exclusive,
>> the PTE is already dirty, and no special handling applies. Mapping the
>> PTE writable is essentially the same thing the write fault handler would do
>> in this case.
>
> In general I am all supportive for such a change.
>
> I do have some mostly-minor concerns.

Hi Nadav,

thanks a lot for your review!

>
>>
>> +static inline bool can_change_pte_writable(struct vm_area_struct *vma,
>> + unsigned long addr, pte_t pte,
>> + unsigned long cp_flags)
>> +{
>> + struct page *page;
>> +
>> + if ((vma->vm_flags & VM_SHARED) && !(cp_flags & MM_CP_DIRTY_ACCT))
>> + /*
>> + * MM_CP_DIRTY_ACCT is only expressive for shared mappings;
>> + * without MM_CP_DIRTY_ACCT, there is nothing to do.
>> + */
>> + return false;
>> +
>> + if (!(vma->vm_flags & VM_WRITE))
>> + return false;
>> +
>> + if (pte_write(pte) || pte_protnone(pte) || !pte_dirty(pte))
>> + return false;
>
> If pte_write() is already try then return false? I understand you want
> to do so because the page is already writable, but it is confusing.


I thought about just doing outside of the function

if ((vma->vm_flags & VM_WRITE) && !pte_write(pte) &&
can_change_pte_writable()...


I refrained from doing so because the sequence of checks might be
sub-optimal. But most probably we don't really care about that and it
might make the code easier to grasp.

Would that make it clearer?

>
> In addition, I am not sure about the pte_dirty() check is really robust.
> I mean I think it is ok, but is there any issue with shadow-stack?

Judging that it's already used that way for VMAs with dirty tracking, I
assume it's ok. Without checking that the PTE is dirty, we'd have to do a:

pte_mkwrite(pte_mkwrite(ptent));

Which would set the pte and consequently the page dirty, although there
might not even be a write access. That's what we want to avoid here.

>
> And this also assumes the kernel does not clear the dirty bit without
> clearing the present, as otherwise the note in Intel SDM section 4.8
> ("Accessed and Dirty Flags”) will be relevant and dirty bit might be
> set unnecessarily. I think it is ok.

Yeah, I think so as well.

>
>> +
>> + /* Do we need write faults for softdirty tracking? */
>> + if (IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) && !pte_soft_dirty(pte) &&
>> + (vma->vm_flags & VM_SOFTDIRTY))
>
> If !IS_ENABLED(CONFIG_MEM_SOFT_DIRTY) then VM_SOFTDIRTY == 0. So I do not
> think the IS_ENABLED() is necessary (unless you think it is clearer this
> way).

Right, we can just do

if ((vma->vm_flags & VM_SOFTDIRTY) && !pte_soft_dirty(pte))

and it should get fully optimized out. Thanks!

>
>> + return false;
>> +
>> + /* Do we need write faults for uffd-wp tracking? */
>> + if (userfaultfd_pte_wp(vma, pte))
>> + return false;
>> +
>> + if (!(vma->vm_flags & VM_SHARED)) {
>> + /*
>> + * We can only special-case on exclusive anonymous pages,
>> + * because we know that our write-fault handler similarly would
>> + * map them writable without any additional checks while holding
>> + * the PT lock.
>> + */
>> + page = vm_normal_page(vma, addr, pte);
>
> I guess we cannot call vm_normal_page() twice, once for prot_numa and once
> here, in practice...

I guess we could, but it doesn't necessarily make the code easier to
read :) And we want to skip protnone either way.

>
>> + if (!page || !PageAnon(page) || !PageAnonExclusive(page))
>> + return false;
>> + }
>> +
>> + return true;
>> +}
>
> Note that there is a small downside to all of that. Assume you mprotect()
> a single page from RO to RW and you have many threads.
>
> With my pending patch you would avoid the TLB shootdown (and get a PF).
> With this patch you would get a TLB shootdown and save the PF. IOW, I
> think it is worthy to skip the shootdown as well in such a case and
> instead flush the TLB on spurious page-faults. But I guess that’s for
> another patch.

Just so I understand correctly: your optimization avoids the flush when
effectively, nothing changed (R/O -> R/O).

And the optimization for this case here would be, to avoid the TLB flush
when similarly not required (R/O -> R/W).

Correct?

--
Thanks,

David / dhildenb