On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote:
> BTW: In general, I think that you are right, and that changing of PTEs
> should not require taking mmap_lock for write. However, I am not sure
> cow_user_page() is not the only one that poses a problem and whether a more
> systematic solution is needed. If cow_user_pages() is the only problem, do
> you think it is possible to do the copying while holding the PTL? It works
> for normal-pages, but I am not sure whether special-pages pose special
> problems.
>
> Anyhow, this is an enhancement that we can try later.
AFAIU mprotect() is the only one who modifies the pte using the mmap write
lock. NUMA balancing is also using read mmap lock when changing pte
protections, while my understanding is mprotect() used write lock only because
it manipulates the address space itself (aka. vma layout) rather than modifying
the ptes, so it needs to.
At the pte level, it seems always to be the pgtable lock that serializes things.
So it's perfectly legal to me for e.g. a driver to modify ptes with the read
lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is
taken when doing so.
If there's a driver that manipulated the ptes, changed the content of the page,
recover the ptes to origin, and all these happen right after wp_page_copy()
unlocked the pgtable lock but before wp_page_copy() retakes the same lock
again, we may face the same issue finding that the page got copied contains
corrupted data at last. While I don't know what to blame on the driver either
because it seems to be exactly following the rules.
I believe changing into write lock would solve the race here because tlb
flushing would be guaranteed along the way, but I'm just a bit worried it's not
the best way to go..
Thanks,
--
Peter Xu
On Mon, Dec 21, 2020 at 05:30:41PM -0500, Peter Xu wrote:
> On Mon, Dec 21, 2020 at 01:49:55PM -0800, Nadav Amit wrote:
> > BTW: In general, I think that you are right, and that changing of PTEs
> > should not require taking mmap_lock for write. However, I am not sure
> > cow_user_page() is not the only one that poses a problem and whether a more
> > systematic solution is needed. If cow_user_pages() is the only problem, do
> > you think it is possible to do the copying while holding the PTL? It works
> > for normal-pages, but I am not sure whether special-pages pose special
> > problems.
> >
> > Anyhow, this is an enhancement that we can try later.
>
> AFAIU mprotect() is the only one who modifies the pte using the mmap write
> lock. NUMA balancing is also using read mmap lock when changing pte
> protections,
NUMA balance doesn't clear pte_write() -- I would not call setting
pte_none() a change of protection.
> while my understanding is mprotect() used write lock only because
> it manipulates the address space itself (aka. vma layout) rather than modifying
> the ptes, so it needs to.
Yes, and personally, I would only take mmap lock for write when I
change VMAs, not PTE protections.
> At the pte level, it seems always to be the pgtable lock that serializes things.
>
> So it's perfectly legal to me for e.g. a driver to modify ptes with the read
> lock of mmap_sem, unless I'm severely mistaken.. as long as the pgtable lock is
> taken when doing so.
>
> If there's a driver that manipulated the ptes, changed the content of the page,
> recover the ptes to origin, and all these happen right after wp_page_copy()
> unlocked the pgtable lock but before wp_page_copy() retakes the same lock
> again, we may face the same issue finding that the page got copied contains
> corrupted data at last. While I don't know what to blame on the driver either
> because it seems to be exactly following the rules.
>
> I believe changing into write lock would solve the race here because tlb
> flushing would be guaranteed along the way, but I'm just a bit worried it's not
> the best way to go..
I can't say I disagree with you but the man has made the call and I
think we should just move on.
On Mon, Dec 21, 2020 at 2:30 PM Peter Xu <[email protected]> wrote:
>
> AFAIU mprotect() is the only one who modifies the pte using the mmap write
> lock. NUMA balancing is also using read mmap lock when changing pte
> protections, while my understanding is mprotect() used write lock only because
> it manipulates the address space itself (aka. vma layout) rather than modifying
> the ptes, so it needs to.
So it's ok to change the pte holding only the PTE lock, if it's a
*one*way* conversion.
That doesn't break the "re-check the PTE contents" model (which
predates _all_ of the rest: NUMA, userfaultfd, everything - it's
pretty much the original model for our page table operations, and goes
back to the dark ages even before SMP and the existence of a page
table lock).
So for example, a COW will always create a different pte (not just
because the page number itself changes - you could imagine a page
getting re-used and changing back - but because it's always a RO->RW
transition).
So two COW operations cannot "undo" each other and fool us into
thinking nothing changed.
Anything that changes RW->RO - like fork(), for example - needs to
take the mmap_lock.
NUMA balancing should be ok wrt COW, because it doesn't do that RW->RO
thing, it uses the present bit.
I think that you are right that NUMA balancing itself might cause
other issues, because it can cause that "pte changed and then came
back" (for numa protectoipn and then a numa fault) all with just the
mmap lock for reading.
However, even that shouldn't matter for COW, because the write protect
bit is the one that proptects the *contents* of the page, so even if
NUMA balancing caused that "load original PTE, then re-check later" to
succeed (despite the PTE actually changing in the middle), the
_contents_ of the page cannot have changed, so COW is ok. NUMA
balancing won't be making a read-only page temporarily writable.
Linus
On Mon, Dec 21, 2020 at 3:12 PM Yu Zhao <[email protected]> wrote:
>
> I can't say I disagree with you but the man has made the call and I
> think we should just move on.
"The man" can always be convinced by numbers.
So if somebody comes up with an alternate patch, and explains it, and
shows that it is better - go for it.
I just think that if mprotect() can take the mmap lock for writing,
then userfaultfd sure as hell can. What odd load does people have
where userfaultfd is more important than mprotect?
So as far as the man is concerned, I think "just fix userfaultfd" is
simply the default obvious operation.
Not necessarily a final endpoint.
Linus
On Mon, Dec 21, 2020 at 3:22 PM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Dec 21, 2020 at 2:30 PM Peter Xu <[email protected]> wrote:
> >
> > AFAIU mprotect() is the only one who modifies the pte using the mmap write
> > lock. NUMA balancing is also using read mmap lock when changing pte
> > protections, while my understanding is mprotect() used write lock only because
> > it manipulates the address space itself (aka. vma layout) rather than modifying
> > the ptes, so it needs to.
>
> So it's ok to change the pte holding only the PTE lock, if it's a
> *one*way* conversion.
>
> That doesn't break the "re-check the PTE contents" model (which
> predates _all_ of the rest: NUMA, userfaultfd, everything - it's
> pretty much the original model for our page table operations, and goes
> back to the dark ages even before SMP and the existence of a page
> table lock).
>
> So for example, a COW will always create a different pte (not just
> because the page number itself changes - you could imagine a page
> getting re-used and changing back - but because it's always a RO->RW
> transition).
>
> So two COW operations cannot "undo" each other and fool us into
> thinking nothing changed.
>
> Anything that changes RW->RO - like fork(), for example - needs to
> take the mmap_lock.
Ugh, this is unpleasantly complicated. I will admit that any API that
takes an address and more-or-less-blindly marks it RO makes me quite
nervous even assuming all the relevant locks are held. At least
userfaultfd refuses to operate on VM_SHARED VMAs, but we have another
instance of this (with mmap_sem held for write) in x86:
mark_screen_rdonly(). Dare I ask how broken this is? We could likely
get away with deleting it entirely.
--Andy
On Mon, Dec 21, 2020 at 7:19 PM Andy Lutomirski <[email protected]> wrote:
>
> Ugh, this is unpleasantly complicated.
I probably should have phrased it differently, because the case you
quote is actually a good one:
> I will admit that any API that
> takes an address and more-or-less-blindly marks it RO makes me quite
> nervous even assuming all the relevant locks are held. At least
> userfaultfd refuses to operate on VM_SHARED VMAs, but we have another
> instance of this (with mmap_sem held for write) in x86:
> mark_screen_rdonly(). Dare I ask how broken this is? We could likely
> get away with deleting it entirely.
So I think the basic rule is that "if you hold mmap_sem for writing,
you're always safe". And that really should be considered the
"default" locking.
ANY time you make a modification to the VM layer, you should basically
always treat it as a write operation, and get the mmap_sem for
writing.
Yeah, yeah, that's a bit simplified, and it ignores various special
cases (and the hardware page table walkers that obviously take no
locks at all), but if you hold the mmap_sem for writing you won't
really race with anything else - not page faults, and not other
"modify this VM".
So mark_screen_rdonly() looks trivially fine to me. I don't think it
really necessarily matters any more, and it's a legacy thing for a
legacy hardware issue, but I don't think there's any reason at all to
remove it either.
To a first approximation, everybody that changes the VM should take
the mmap_sem for writing, and the readers should just be just about
page fault handling (and I count GUP as "page fault handling" too -
it's kind of the same "look up page" rather than "modify vm" kind of
operation).
And there are just a _lot_ more page faults than there are things that
modify the page tables and the vma's.
So having that mental model of "lookup of pages in a VM take mmap_semn
for reading, any modification of the VM uses it for writing" makes
sense both from a performance angle and a logical standpoint. It's the
correct model.
And it's worth noting that COW is still "lookup of pages", even though
it might modify the page tables in the process. The same way lookup
can modify the page tables to mark things accessed or dirty.
So COW is still a lookup operation, in ways that "change the
writabiility of this range" very much is not. COW is "lookup for
write", and the magic we do to copy to make that write valid is still
all about the lookup of the page.
Which brings up another mental mistake I saw earlier in this thread:
you should not think "mmap_sem is for vma, and the page table lock is
for the page table changes".
mmap_sem is the primary lock for any modifications to the VM layout,
whether it be in the vma's or in the page tables.
Now, the page table lock does exist _in_addition_to_ the mmap_sem, but
it is partly because
(a) we have things that historically walked the page tables _without_
walking the vma's (notably the virtual memory scanning)
(b) we do allow concurrent page faults, so we then need a lower-level
lock to serialize the parallelism we _do_ have.
And yes, we have ended up allowing some of those "modify the VM state"
to then take the mmap_sem for reading, just because their
modifications are slight enough that we can then say "ok, this is a
write modification, but the writes it does are protected by the page
table lock, we'll just get the mmap_sem for reading". It's an
optimization, but it should be considered exactly that: not
fundamental, but just a clever trick and an optimization.
It's why I went "userfaultfd is buggy" immediately when I noticed. It
basically became clear that "oh, that's an optimization, but it's an
*invalid* one", in that it didn't actually work and give the same end
result.
So when I said:
> Anything that changes RW->RO - like fork(), for example - needs to
> take the mmap_lock.
I didn't mean that we should consider that RW->RO change to be this
complex semantic marker that we should honor and say "ok, because it's
a RW->RO change, we need to hold the mmap_sem".
I phrased it really badly, in other words.
What I *should* have said is that *because* userfaultfd is changing
the VM layout, it should always act as if it had to take the mmap_sem
for writing, and that the RW->RO change is an example of when
downgrading that write-lock to a read lock is simply not valid -
because it basically breaks the rules about what a lookup (ie a read)
can do.
A lookup can never turn a writable page non-writable. A lookup -
through COW - _can_ turn a read-only page writable. So that's why
"RO->RW" can be ok under the read lock, but RW->RO is not.
Does that clarify the situation when I phrase it that way instead?
Linus
On Mon, Dec 21, 2020 at 07:19:35PM -0800, Andy Lutomirski wrote:
> instance of this (with mmap_sem held for write) in x86:
> mark_screen_rdonly(). Dare I ask how broken this is? We could likely
That one is buggy despite it takes the mmap_write_lock... inverting
the last two lines would fix it though.
- mmap_write_unlock(mm);
flush_tlb_mm_range(mm, 0xA0000, 0xA0000 + 32*PAGE_SIZE, PAGE_SHIFT, false);
+ mmap_write_unlock(mm);
The issue here is that rightfully wp_page_copy copies outside the PT
lock (good thing) and it correctly assumes during the copy nobody can
modify the page if the fault happened in a write access and the pte
had no _PAGE_WRITE bit set.
The bug is clearly in userfaultfd_writeprotect that doesn't enforce
the above invariant.
If the userfaultfd_wrprotect(mode_wp = false) run after the deferred
TLB flush this could never happen because the uffd_wp bit was still
set and the page fault would hit the handle_userfault dead end and
it'd have to be restarted from scratch.
Leaving stale tlb entries after dropping the PT lock and with only the
mmap_lock_read is only ok if the page fault has a "check" that catches
that specific case, so that specific case is fully serialized by the
PT lock alone and the "catch" path is fully aware about the stale TLB
entries that were left around.
So looking at the two cases that predates uffd-wp that already did a
RW->RO transition with the mmap_lock_read, they both comply with the
wp_page_copy invariant but with two different techniques:
1) change_protection is called by the scheduler with the
mmap_read_lock and with a deferred TLB flush, and things don't fall
apart completely there simply because we catch that in do_numa_page:
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
+ /* catch it: no risk to end up in wp_page_copy even if the change_protection
+ is running concurrently and the tlb flush is still pending */
return do_numa_page(vmf);
do_numa_page doesn't issue any further tlb flush, but basically
restores the pte to the previous value, so then the pending flush
becomes a noop, since there was no modification to the pte in the
first place after do_numa_page runs.
2) ksm write_protect_page is also called with mmap_read_lock, and it
will simply not defer the flush. If the tlb flush is done inside
the PT lock it is ok to take mmap_read_lock since the page fault
code will not make any assumption about the TLB before reading the
pagetable content under PT lock.
In fact if mm_tlb_flush_pending(mm) is true, write_protect_page in
KSM has to issue an extra synchronous flush before dropping the PT
lock, even if it finds the _PAGE_WRITE bit is already clear,
precisely because there can be another deferred flush that cleared
the pte but didn't flush the TLB yet.
userfaultfd_writeprotect() is already using the technique in 1) to be
safe, except the un-wrprotect can break it if run while the tlb flush
is still pending...
The good thing is this guarantee must be provided not more granular
even than a vma, not per-mm and a vma cannot be registered on two
different uffd ctx at the same time, so the locking can happen all
internal to the uffd ctx.
My previous suggestion to use a mutex to serialize
userfaultfd_writeprotect with a mutex will still work, but we can run
as many wrprotect and un-wrprotect as we want in parallel, as long as
they're not simultaneous, we can do much better than a mutex.
Ideally we would need a new two_group_semaphore, where each group can
run as many parallel instances as it wants, but no instance of one
group can run in parallel with any instance of the other group. AFIK
such a kind of lock doesn't exist right now.
wrprotect if left alone never had an issue since we had a working
"catch" check for it well before wp_page_copy could run. unprotect
also if left alone was ok since it's a permission upgrade.
Alternatively, we can solve this with the same technique as in 2),
because all it matters is that the 4k pte modification doesn't take
the mmap_lock_write. A modification to a single 4k pte or a single 2m
hugepmd is very likely indication that there's going to be a flood of
those in parallel from all CPUs and likely there's also a flood of
page faults from all CPUs in parallel. In addition for a 4k wrprotect
or un-wrprotect there's not a significant benefit in deferring the TLB
flush.
I don't think we can take the mmap_write_lock unconditionally because
one of the main reasons to deal with the extra complexity of resolving
page faults in userland is to bypass mprotect entirely.
Example, JITs can use unprivileged uffd to eliminate the software-set
dirty bit every time it modifies memory, that would then require
frequent wrprotect/un-wrprotect during page faults and the less likely
we have to block for I/O during those CPU-only events, the
better. This is one of the potential use cases, but there's plenty
more.
So the alternative would be to take mmap_read_lock for "len <=
HPAGE_PMD_SIZE" and the mmap_write_lock otherwise, and to add a
parameter in change_protection to tell it to flush before dropping the
PT lock and not defer the flush. So this is going to be more intrusive
in the VM and it's overall unnecessary.
The below is mostly untested... but it'd be good to hear some feedback
before doing more work in this direction.
From 4ace4d1b53f5cb3b22a5c2dc33facc4150b112d6 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <[email protected]>
Date: Tue, 22 Dec 2020 14:30:16 -0500
Subject: [PATCH 1/1] mm: userfaultfd: avoid leaving stale TLB after
userfaultfd_writeprotect()
change_protection() is called by userfaultfd_writeprotect() with the
mmap_lock_read like in change_prot_numa().
The page fault code in wp_copy_page() rightfully assumes if the CPU
issued a write fault and the write bit in the pagetable is not set, no
CPU can write to the page. That's wrong assumption after
userfaultfd_writeprotect(). That's also wrong assumption after
change_prot_numa() where the regular page fault code also would assume
that if the present bit is not set and the page fault is running,
there should be no stale TLB entry, but there is still.
So to stay safe, the page fault code must be prevented to run as long
as long as the TLB flush remains pending. That is already achieved by
the do_numa_page() path for change_prot_numa() and by the
userfaultfd_pte_wp() path for userfaultfd_writeprotect().
The problem that needs fixing is that an un-wrprotect
(i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP not
set) could run in between the original wrprotect
(i.e. userfaultfd_writeprotect() with UFFDIO_WRITEPROTECT_MODE_WP set)
and wp_copy_page, while the TLB flush remains pending.
In such case the userfaultfd_pte_wp() check will stop being effective
to prevent the regular page fault code to run.
CPU0 CPU 1 CPU 2
------ -------- -------
userfaultfd_wrprotect(mode_wp = true)
PT lock
atomic set _PAGE_UFFD_WP and clear _PAGE_WRITE
PT unlock
do_page_fault FAULT_FLAG_WRITE
userfaultfd_wrprotect(mode_wp = false)
PT lock
ATOMIC clear _PAGE_UFFD_WP <- problem
/* _PAGE_WRITE not set */
PT unlock
XXXXXXXXXXXXXX BUG RACE window open here
PT lock
FAULT_FLAG_WRITE is set by CPU
_PAGE_WRITE is still clear in pte
PT unlock
wp_page_copy
copy_user_page runs with stale TLB
deferred tlb flush <- too late
XXXXXXXXXXXXXX BUG RACE window close here
================================================================================
If the userfaultfd_wrprotect(mode_wp = false) can only run after the
deferred TLB flush the above cannot happen either, because the uffd_wp
bit will remain set until after the TLB flush and the page fault would
reliably hit the handle_userfault() dead end as long as the TLB is
stale.
So to fix this we need to prevent any un-wrprotect to start until the
last outstanding wrprotect completed and to prevent any further
wrprotect until the last outstanding un-wrprotect completed. Then
wp_page_copy can still run in parallel but only with the un-wrprotect,
and that's fine since it's a permission promotion.
We would need a new two_group_semaphore, where each group can run as
many parallel instances as it wants, but no instance of one group can
run in parallel with any instance of the other group. The below rwsem
with two atomics approximates that kind lock.
Signed-off-by: Andrea Arcangeli <[email protected]>
---
fs/userfaultfd.c | 39 +++++++++++++++++++++++++++++++++++++++
mm/memory.c | 20 ++++++++++++++++++++
2 files changed, 59 insertions(+)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index 894cc28142e7..3729ca99dae5 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -76,6 +76,20 @@ struct userfaultfd_ctx {
bool mmap_changing;
/* mm with one ore more vmas attached to this userfaultfd_ctx */
struct mm_struct *mm;
+ /*
+ * We cannot let userfaultd_writeprotect(mode_wp = false)
+ * clear _PAGE_UFFD_WP from the pgtable, until the last
+ * outstanding userfaultd_writeprotect(mode_wp = true)
+ * completes, because the TLB flush is deferred. The page
+ * fault must be forced to enter handle_userfault() while the
+ * TLB flush is deferred, and that's achieved with the
+ * _PAGE_UFFD_WP bit. The _PAGE_UFFD_WP can only be cleared
+ * after there are no stale TLB entries left, only then it'll
+ * be safe again for the page fault to continue and not hit
+ * the handle_userfault() dead end anymore.
+ */
+ struct rw_semaphore wrprotect_rwsem;
+ atomic64_t wrprotect_count[2];
};
struct userfaultfd_fork_ctx {
@@ -1783,6 +1797,7 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
struct uffdio_writeprotect __user *user_uffdio_wp;
struct userfaultfd_wake_range range;
bool mode_wp, mode_dontwake;
+ bool contention;
if (READ_ONCE(ctx->mmap_changing))
return -EAGAIN;
@@ -1808,9 +1823,30 @@ static int userfaultfd_writeprotect(struct userfaultfd_ctx *ctx,
if (mode_wp && mode_dontwake)
return -EINVAL;
+ VM_WARN_ON(atomic64_read(&ctx->wrprotect_count[0]) < 0);
+ VM_WARN_ON(atomic64_read(&ctx->wrprotect_count[1]) < 0);
+ atomic64_inc(&ctx->wrprotect_count[mode_wp ? 0 : 1]);
+ smp_mb__after_atomic();
+ contention = atomic64_read(&ctx->wrprotect_count[mode_wp ? 1 : 0]) > 0;
+ if (!contention)
+ down_read(&ctx->wrprotect_rwsem);
+ else {
+ down_write(&ctx->wrprotect_rwsem);
+ if (!atomic64_read(&ctx->wrprotect_count[mode_wp ? 1 : 0])) {
+ contention = false;
+ downgrade_write(&ctx->wrprotect_rwsem);
+ }
+
+ }
ret = mwriteprotect_range(ctx->mm, uffdio_wp.range.start,
uffdio_wp.range.len, mode_wp,
&ctx->mmap_changing);
+ if (!contention)
+ up_read(&ctx->wrprotect_rwsem);
+ else
+ up_write(&ctx->wrprotect_rwsem);
+ smp_mb__before_atomic();
+ atomic64_dec(&ctx->wrprotect_count[mode_wp ? 0 : 1]);
if (ret)
return ret;
@@ -1958,6 +1994,9 @@ static void init_once_userfaultfd_ctx(void *mem)
init_waitqueue_head(&ctx->fault_wqh);
init_waitqueue_head(&ctx->event_wqh);
init_waitqueue_head(&ctx->fd_wqh);
+ init_rwsem(&ctx->wrprotect_rwsem);
+ atomic64_set(&ctx->wrprotect_count[0], 0);
+ atomic64_set(&ctx->wrprotect_count[1], 0);
seqcount_spinlock_init(&ctx->refile_seq, &ctx->fault_pending_wqh.lock);
}
diff --git a/mm/memory.c b/mm/memory.c
index 7d608765932b..4ff9f21d5a7b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3085,6 +3085,12 @@ static vm_fault_t do_wp_page(struct vm_fault *vmf)
{
struct vm_area_struct *vma = vmf->vma;
+ /*
+ * STALE_TLB_WARNING: while the uffd_wp bit is set, the TLB
+ * can be stale. We cannot allow do_wp_page to proceed or
+ * it'll wrongly assume that nobody can still be writing to
+ * the page if !pte_write.
+ */
if (userfaultfd_pte_wp(vma, *vmf->pte)) {
pte_unmap_unlock(vmf->pte, vmf->ptl);
return handle_userfault(vmf, VM_UFFD_WP);
@@ -4274,6 +4280,12 @@ static inline vm_fault_t create_huge_pmd(struct vm_fault *vmf)
static inline vm_fault_t wp_huge_pmd(struct vm_fault *vmf, pmd_t orig_pmd)
{
if (vma_is_anonymous(vmf->vma)) {
+ /*
+ * STALE_TLB_WARNING: while the uffd_wp bit is set,
+ * the TLB can be stale. We cannot allow wp_huge_pmd()
+ * to proceed or it'll wrongly assume that nobody can
+ * still be writing to the page if !pmd_write.
+ */
if (userfaultfd_huge_pmd_wp(vmf->vma, orig_pmd))
return handle_userfault(vmf, VM_UFFD_WP);
return do_huge_pmd_wp_page(vmf, orig_pmd);
@@ -4388,6 +4400,10 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
if (!pte_present(vmf->orig_pte))
return do_swap_page(vmf);
+ /*
+ * STALE_TLB_WARNING: if the pte is NUMA protnone the TLB can
+ * be stale.
+ */
if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
return do_numa_page(vmf);
@@ -4503,6 +4519,10 @@ static vm_fault_t __handle_mm_fault(struct vm_area_struct *vma,
return 0;
}
if (pmd_trans_huge(orig_pmd) || pmd_devmap(orig_pmd)) {
+ /*
+ * STALE_TLB_WARNING: if the pmd is NUMA
+ * protnone the TLB can be stale.
+ */
if (pmd_protnone(orig_pmd) && vma_is_accessible(vma))
return do_huge_pmd_numa_page(&vmf, orig_pmd);
On Tue, Dec 22, 2020 at 02:31:52PM -0500, Andrea Arcangeli wrote:
> My previous suggestion to use a mutex to serialize
> userfaultfd_writeprotect with a mutex will still work, but we can run
> as many wrprotect and un-wrprotect as we want in parallel, as long as
> they're not simultaneous, we can do much better than a mutex.
>
> Ideally we would need a new two_group_semaphore, where each group can
> run as many parallel instances as it wants, but no instance of one
> group can run in parallel with any instance of the other group. AFIK
> such a kind of lock doesn't exist right now.
Kent and I worked on one for a bit, and we called it a red-black mutex.
If team red had the lock, more members of team red could join in.
If team black had the lock, more members of team black could join in.
I forget what our rule was around fairness (if team red has the lock,
and somebody from team black is waiting, can another member of team red
take the lock, or must they block?)
It was to solve the direct-IO vs buffered-IO problem (you can have as many
direct-IO readers/writers at once or you can have as many buffered-IO
readers/writers at once, but exclude a mix of direct and buffered I/O).
In the end, we decided it didn't work all that well.
On Mon, Dec 21, 2020 at 8:16 PM Linus Torvalds
<[email protected]> wrote:
>
> On Mon, Dec 21, 2020 at 7:19 PM Andy Lutomirski <[email protected]> wrote:
> >
> > Ugh, this is unpleasantly complicated.
>
> What I *should* have said is that *because* userfaultfd is changing
> the VM layout, it should always act as if it had to take the mmap_sem
> for writing, and that the RW->RO change is an example of when
> downgrading that write-lock to a read lock is simply not valid -
> because it basically breaks the rules about what a lookup (ie a read)
> can do.
>
> A lookup can never turn a writable page non-writable. A lookup -
> through COW - _can_ turn a read-only page writable. So that's why
> "RO->RW" can be ok under the read lock, but RW->RO is not.
>
> Does that clarify the situation when I phrase it that way instead?
It's certainly more clear, but I'm still not thrilled with a bunch of
functions in different files maintained by different people that
cooperate using an undocumented lockless protocol. It makes me
nervous from a maintainability standpoint even if the code is, in
fact, entirely correct.
But I didn't make my question clear either: when I asked about
mark_screen_rdonly(), I wasn't asking about locking or races at all.
mark_screen_rdonly() will happily mark *any* PTE readonly. I can
easily believe that writable mappings of non-shared mappings all
follow the same CoW rules in the kernel and that, assuming all the
locking is correct, marking them readonly is conceptually okay. But
shared mappings are a whole different story. Is it actually the case
that all, or even most, drivers and other sources of shared mappings
will function correctly if one of their PTEs becomes readonly behind
their back? Or maybe there are less bizarre ways for this to happen
without vm86 shenanigans and this is completely safe after all.
P.S. This whole mess reminds me that I need to take a closer look at
the upcoming SHSTK code. Shadow stack pages violate all common sense
about how the various PTE bits work.
On Tue, Dec 22, 2020 at 08:15:53PM +0000, Matthew Wilcox wrote:
> On Tue, Dec 22, 2020 at 02:31:52PM -0500, Andrea Arcangeli wrote:
> > My previous suggestion to use a mutex to serialize
> > userfaultfd_writeprotect with a mutex will still work, but we can run
> > as many wrprotect and un-wrprotect as we want in parallel, as long as
> > they're not simultaneous, we can do much better than a mutex.
> >
> > Ideally we would need a new two_group_semaphore, where each group can
> > run as many parallel instances as it wants, but no instance of one
> > group can run in parallel with any instance of the other group. AFIK
> > such a kind of lock doesn't exist right now.
>
> Kent and I worked on one for a bit, and we called it a red-black mutex.
> If team red had the lock, more members of team red could join in.
> If team black had the lock, more members of team black could join in.
> I forget what our rule was around fairness (if team red has the lock,
> and somebody from team black is waiting, can another member of team red
> take the lock, or must they block?)
In this case they would need to block and provide full fairness.
Well maybe just a bit of unfariness (to let a few more through the
door before it shuts) wouldn't be a deal breaker but it would need to
be bound or it'd starve the other color/side indefinitely. Otherwise
an ioctl mode_wp = true would block forever, if more ioctl mode_wp =
false keep coming in other CPUs (or the other way around).
The approximation with rwsem and two atomics provides full fariness in
both read and write mode (originally the read would stave the write
IIRC which was an issue for all mprotect etc.. not anymore thankfully).
> It was to solve the direct-IO vs buffered-IO problem (you can have as many
> direct-IO readers/writers at once or you can have as many buffered-IO
> readers/writers at once, but exclude a mix of direct and buffered I/O).
> In the end, we decided it didn't work all that well.
Well mixing buffered and direct-IO is certainly not a good practice so
it's reasonable to leave it up to userland to serialize if such mix is
needed, the kernel behavior is undefined if the mix is concurrent out
of order.
On Mon, Dec 21, 2020 at 08:16:11PM -0800, Linus Torvalds wrote:
> So I think the basic rule is that "if you hold mmap_sem for writing,
> you're always safe". And that really should be considered the
> "default" locking.
>
> ANY time you make a modification to the VM layer, you should basically
> always treat it as a write operation, and get the mmap_sem for
> writing.
>
> Yeah, yeah, that's a bit simplified, and it ignores various special
> cases (and the hardware page table walkers that obviously take no
> locks at all), but if you hold the mmap_sem for writing you won't
> really race with anything else - not page faults, and not other
> "modify this VM".
> To a first approximation, everybody that changes the VM should take
> the mmap_sem for writing, and the readers should just be just about
> page fault handling (and I count GUP as "page fault handling" too -
> it's kind of the same "look up page" rather than "modify vm" kind of
> operation).
>
> And there are just a _lot_ more page faults than there are things that
> modify the page tables and the vma's.
>
> So having that mental model of "lookup of pages in a VM take mmap_semn
> for reading, any modification of the VM uses it for writing" makes
> sense both from a performance angle and a logical standpoint. It's the
> correct model.
> And it's worth noting that COW is still "lookup of pages", even though
> it might modify the page tables in the process. The same way lookup
> can modify the page tables to mark things accessed or dirty.
>
> So COW is still a lookup operation, in ways that "change the
> writabiility of this range" very much is not. COW is "lookup for
> write", and the magic we do to copy to make that write valid is still
> all about the lookup of the page.
(your other email clarified this point; the COW needs to copy while
holding the PTL and we need TLBI under PTL if we're to change this)
> Which brings up another mental mistake I saw earlier in this thread:
> you should not think "mmap_sem is for vma, and the page table lock is
> for the page table changes".
>
> mmap_sem is the primary lock for any modifications to the VM layout,
> whether it be in the vma's or in the page tables.
>
> Now, the page table lock does exist _in_addition_to_ the mmap_sem, but
> it is partly because
>
> (a) we have things that historically walked the page tables _without_
> walking the vma's (notably the virtual memory scanning)
>
> (b) we do allow concurrent page faults, so we then need a lower-level
> lock to serialize the parallelism we _do_ have.
And I'm thinking the speculative page fault series steps right into all
this, it fundamentally avoids mmap_sem and entirely relies on the PTL.
Which opens it up to exactly these races explored here.
The range lock approach does not suffer this, but I'm still worried
about the actual performance of that thing.
On Tue, Jan 05, 2021 at 04:37:27PM +0100, Peter Zijlstra wrote:
> (your other email clarified this point; the COW needs to copy while
> holding the PTL and we need TLBI under PTL if we're to change this)
The COW doesn't need to hold the PT lock, the TLBI broadcast doesn't
need to be delivered under PT lock either.
Simply there need to be a TLBI broadcast before the copy. The patch I
sent here https://lkml.kernel.org/r/[email protected] that
needs to be cleaned up with some abstraction and better commentary
also misses a smp_mb() in the case flush_tlb_page is not called, but
that's a small detail.
> And I'm thinking the speculative page fault series steps right into all
> this, it fundamentally avoids mmap_sem and entirely relies on the PTL.
I thought about that but that only applies to some kind of "anon" page
fault.
Here the problem isn't just the page fault, the problem is not to
regress clear_refs to block on page fault I/O, and all
MAP_PRIVATE/MAP_SHARED filebacked faults bitting the disk to read
/usr/ will still prevent clear_refs from running (and the other way
around) if it has to take the mmap_sem for writing.
I don't look at the speculative page fault for a while but last I
checked there was nothing there that can tame the above major
regression from CPU speed to disk I/O speed that would be inflicted on
both clear_refs on huge mm and on uffd-wp.
Thanks,
Andrea
On 1/5/2021 9:07 PM, Peter Zijlstra wrote:
> On Mon, Dec 21, 2020 at 08:16:11PM -0800, Linus Torvalds wrote:
>
>> So I think the basic rule is that "if you hold mmap_sem for writing,
>> you're always safe". And that really should be considered the
>> "default" locking.
>>
>> ANY time you make a modification to the VM layer, you should basically
>> always treat it as a write operation, and get the mmap_sem for
>> writing.
>>
>> Yeah, yeah, that's a bit simplified, and it ignores various special
>> cases (and the hardware page table walkers that obviously take no
>> locks at all), but if you hold the mmap_sem for writing you won't
>> really race with anything else - not page faults, and not other
>> "modify this VM".
>> To a first approximation, everybody that changes the VM should take
>> the mmap_sem for writing, and the readers should just be just about
>> page fault handling (and I count GUP as "page fault handling" too -
>> it's kind of the same "look up page" rather than "modify vm" kind of
>> operation).
>>
>> And there are just a _lot_ more page faults than there are things that
>> modify the page tables and the vma's.
>>
>> So having that mental model of "lookup of pages in a VM take mmap_semn
>> for reading, any modification of the VM uses it for writing" makes
>> sense both from a performance angle and a logical standpoint. It's the
>> correct model.
>> And it's worth noting that COW is still "lookup of pages", even though
>> it might modify the page tables in the process. The same way lookup
>> can modify the page tables to mark things accessed or dirty.
>>
>> So COW is still a lookup operation, in ways that "change the
>> writabiility of this range" very much is not. COW is "lookup for
>> write", and the magic we do to copy to make that write valid is still
>> all about the lookup of the page.
> (your other email clarified this point; the COW needs to copy while
> holding the PTL and we need TLBI under PTL if we're to change this)
>
>> Which brings up another mental mistake I saw earlier in this thread:
>> you should not think "mmap_sem is for vma, and the page table lock is
>> for the page table changes".
>>
>> mmap_sem is the primary lock for any modifications to the VM layout,
>> whether it be in the vma's or in the page tables.
>>
>> Now, the page table lock does exist _in_addition_to_ the mmap_sem, but
>> it is partly because
>>
>> (a) we have things that historically walked the page tables _without_
>> walking the vma's (notably the virtual memory scanning)
>>
>> (b) we do allow concurrent page faults, so we then need a lower-level
>> lock to serialize the parallelism we _do_ have.
> And I'm thinking the speculative page fault series steps right into all
> this, it fundamentally avoids mmap_sem and entirely relies on the PTL.
>
> Which opens it up to exactly these races explored here.
>
> The range lock approach does not suffer this, but I'm still worried
> about the actual performance of that thing.
Some thoughts on why there may not be an issue with speculative page fault.
Adding Laurent as well.
Possibility of race against other PTE modifiers
1) Fork - We have seen a case of SPF racing with fork marking PTEs RO
and that
is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
2) mprotect - change_protection in mprotect which does the deferred flush is
marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
on those VMAs.
3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
But SPF does not take UFFD faults.
4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
(2) above.
5) Concurrent faults - SPF does not handle all faults. Only anon page
faults.
Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
transitions without tlb flush. And I hope do_wp_page with RO->RW is fine
as well.
I could not see a case where speculative path cannot see a PTE update
done via
a fault on another CPU.
--
QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of the Code Aurora Forum, hosted by The Linux Foundation
Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
> On 1/5/2021 9:07 PM, Peter Zijlstra wrote:
>> On Mon, Dec 21, 2020 at 08:16:11PM -0800, Linus Torvalds wrote:
>>
>>> So I think the basic rule is that "if you hold mmap_sem for writing,
>>> you're always safe". And that really should be considered the
>>> "default" locking.
>>>
>>> ANY time you make a modification to the VM layer, you should basically
>>> always treat it as a write operation, and get the mmap_sem for
>>> writing.
>>>
>>> Yeah, yeah, that's a bit simplified, and it ignores various special
>>> cases (and the hardware page table walkers that obviously take no
>>> locks at all), but if you hold the mmap_sem for writing you won't
>>> really race with anything else - not page faults, and not other
>>> "modify this VM".
>>> To a first approximation, everybody that changes the VM should take
>>> the mmap_sem for writing, and the readers should just be just about
>>> page fault handling (and I count GUP as "page fault handling" too -
>>> it's kind of the same "look up page" rather than "modify vm" kind of
>>> operation).
>>>
>>> And there are just a _lot_ more page faults than there are things that
>>> modify the page tables and the vma's.
>>>
>>> So having that mental model of "lookup of pages in a VM take mmap_semn
>>> for reading, any modification of the VM uses it for writing" makes
>>> sense both from a performance angle and a logical standpoint. It's the
>>> correct model.
>>> And it's worth noting that COW is still "lookup of pages", even though
>>> it might modify the page tables in the process. The same way lookup
>>> can modify the page tables to mark things accessed or dirty.
>>>
>>> So COW is still a lookup operation, in ways that "change the
>>> writabiility of this range" very much is not. COW is "lookup for
>>> write", and the magic we do to copy to make that write valid is still
>>> all about the lookup of the page.
>> (your other email clarified this point; the COW needs to copy while
>> holding the PTL and we need TLBI under PTL if we're to change this)
>>
>>> Which brings up another mental mistake I saw earlier in this thread:
>>> you should not think "mmap_sem is for vma, and the page table lock is
>>> for the page table changes".
>>>
>>> mmap_sem is the primary lock for any modifications to the VM layout,
>>> whether it be in the vma's or in the page tables.
>>>
>>> Now, the page table lock does exist _in_addition_to_ the mmap_sem, but
>>> it is partly because
>>>
>>> (a) we have things that historically walked the page tables _without_
>>> walking the vma's (notably the virtual memory scanning)
>>>
>>> (b) we do allow concurrent page faults, so we then need a lower-level
>>> lock to serialize the parallelism we _do_ have.
>> And I'm thinking the speculative page fault series steps right into all
>> this, it fundamentally avoids mmap_sem and entirely relies on the PTL.
>>
>> Which opens it up to exactly these races explored here.
>>
>> The range lock approach does not suffer this, but I'm still worried
>> about the actual performance of that thing.
>
>
> Some thoughts on why there may not be an issue with speculative page fault.
> Adding Laurent as well.
>
> Possibility of race against other PTE modifiers
>
> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
> 2) mprotect - change_protection in mprotect which does the deferred flush is
> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults on those
> VMAs.
> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
> But SPF does not take UFFD faults.
> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
> (2) above.
> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
> I could not see a case where speculative path cannot see a PTE update done via
> a fault on another CPU.
>
Thanks Vinayak,
You explained it fine. Indeed SPF is handling deferred TLB invalidation by
marking the VMA through vm_write_begin/end(), as for the fork case you
mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
values read are valid.
Cheers,
Laurent.
On Tue, Jan 05, 2021 at 01:03:48PM -0500, Andrea Arcangeli wrote:
> On Tue, Jan 05, 2021 at 04:37:27PM +0100, Peter Zijlstra wrote:
> > (your other email clarified this point; the COW needs to copy while
> > holding the PTL and we need TLBI under PTL if we're to change this)
>
> The COW doesn't need to hold the PT lock, the TLBI broadcast doesn't
> need to be delivered under PT lock either.
>
> Simply there need to be a TLBI broadcast before the copy. The patch I
> sent here https://lkml.kernel.org/r/[email protected] that
> needs to be cleaned up with some abstraction and better commentary
> also misses a smp_mb() in the case flush_tlb_page is not called, but
> that's a small detail.
That's horrific crap. All of that tlb-pending stuff is batshit, and this
makes it worse.
> > And I'm thinking the speculative page fault series steps right into all
> > this, it fundamentally avoids mmap_sem and entirely relies on the PTL.
>
> I thought about that but that only applies to some kind of "anon" page
> fault.
That must be something new; it used to handle all faults. I specifically
spend quite a bit of time getting the file crud right (which Linus
initially fingered for being horrible broken).
SPF fundamentally elides the mmap_sem, which Linus said must serialize
faults.
> Here the problem isn't just the page fault, the problem is not to
> regress clear_refs to block on page fault I/O, and all
IIRC we do the actual reads without any locks held, just like
VM_FAULT_RETRY does today. You take the fault, find you need IO, drop
locks, do IO, retake fault.
> MAP_PRIVATE/MAP_SHARED filebacked faults bitting the disk to read
> /usr/ will still prevent clear_refs from running (and the other way
> around) if it has to take the mmap_sem for writing.
>
> I don't look at the speculative page fault for a while but last I
> checked there was nothing there that can tame the above major
> regression from CPU speed to disk I/O speed that would be inflicted on
> both clear_refs on huge mm and on uffd-wp.
All of the clear_refs nonsense is immaterial to SPF. Also, who again
cares about clear_refs? Why is it important?
On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
> Le 12/01/2021 ? 12:43, Vinayak Menon a ?crit?:
> > Possibility of race against other PTE modifiers
> >
> > 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
> > is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
Right, that's exactly the kind of thing I was worried about.
> > 2) mprotect - change_protection in mprotect which does the deferred flush is
> > marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
> > on those VMAs.
Sure, mprotect also changes vm_flags, so it really needs that anyway.
> > 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
> > But SPF does not take UFFD faults.
> > 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
> > (2) above.
> > 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
What happened to shared/file-backed stuff? ISTR I had that working.
> > Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
> > transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
The tricky one is demotion, specifically write to non-write.
> > I could not see a case where speculative path cannot see a PTE update done via
> > a fault on another CPU.
One you didn't mention is the NUMA balancing scanning crud; although I
think that's fine, loosing a PTE update there is harmless. But I've not
thought overly hard on it.
> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
> marking the VMA through vm_write_begin/end(), as for the fork case you
> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
> values read are valid.
That should indeed work, but are we really sure we covered them all?
Should we invest in better TLBI APIs to make sure we can't get this
wrong?
Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
>
>>> Possibility of race against other PTE modifiers
>>>
>>> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
>>> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
>
> Right, that's exactly the kind of thing I was worried about.
>
>>> 2) mprotect - change_protection in mprotect which does the deferred flush is
>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
>>> on those VMAs.
>
> Sure, mprotect also changes vm_flags, so it really needs that anyway.
>
>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
>>> But SPF does not take UFFD faults.
>>> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
>>> (2) above.
>
>>> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
>
> What happened to shared/file-backed stuff? ISTR I had that working.
File-backed mappings are not processed in a speculative way, there were options
to manage some of them depending on the underlying file system but that's still
not done.
Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops
is not null).
>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
>>> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
>
> The tricky one is demotion, specifically write to non-write.
>
>>> I could not see a case where speculative path cannot see a PTE update done via
>>> a fault on another CPU.
>
> One you didn't mention is the NUMA balancing scanning crud; although I
> think that's fine, loosing a PTE update there is harmless. But I've not
> thought overly hard on it.
That's a good point, I need to double check on that side.
>> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
>> marking the VMA through vm_write_begin/end(), as for the fork case you
>> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
>> values read are valid.
>
> That should indeed work, but are we really sure we covered them all?
> Should we invest in better TLBI APIs to make sure we can't get this
> wrong?
That may be a good option to identify deferred TLB invalidation but I've no clue
on what this API would look like.
> On Jan 12, 2021, at 11:02 AM, Laurent Dufour <[email protected]> wrote:
>
> Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
>> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
>>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
>>>> Possibility of race against other PTE modifiers
>>>>
>>>> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
>>>> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
>> Right, that's exactly the kind of thing I was worried about.
>>>> 2) mprotect - change_protection in mprotect which does the deferred flush is
>>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
>>>> on those VMAs.
>> Sure, mprotect also changes vm_flags, so it really needs that anyway.
>>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
>>>> But SPF does not take UFFD faults.
>>>> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
>>>> (2) above.
>>>> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
>> What happened to shared/file-backed stuff? ISTR I had that working.
>
> File-backed mappings are not processed in a speculative way, there were options to manage some of them depending on the underlying file system but that's still not done.
>
> Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops is not null).
>
>>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
>>>> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
>> The tricky one is demotion, specifically write to non-write.
>>>> I could not see a case where speculative path cannot see a PTE update done via
>>>> a fault on another CPU.
>> One you didn't mention is the NUMA balancing scanning crud; although I
>> think that's fine, loosing a PTE update there is harmless. But I've not
>> thought overly hard on it.
>
> That's a good point, I need to double check on that side.
>
>>> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
>>> marking the VMA through vm_write_begin/end(), as for the fork case you
>>> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
>>> values read are valid.
>> That should indeed work, but are we really sure we covered them all?
>> Should we invest in better TLBI APIs to make sure we can't get this
>> wrong?
>
> That may be a good option to identify deferred TLB invalidation but I've no clue on what this API would look like.
I will send an RFC soon for per-table deferred TLB flushes tracking.
The basic idea is to save a generation in the page-struct that tracks
when deferred PTE change took place, and track whenever a TLB flush
completed. In addition, other users - such as mprotect - would use
the tlb_gather interface.
Unfortunately, due to limited space in page-struct this would only
be possible for 64-bit (and my implementation is only for x86-64).
It would still require to do the copying while holding the PTL though.
On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> > On Jan 12, 2021, at 11:02 AM, Laurent Dufour <[email protected]> wrote:
> >
> > Le 12/01/2021 ? 17:57, Peter Zijlstra a ?crit :
> >> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
> >>> Le 12/01/2021 ? 12:43, Vinayak Menon a ?crit :
> >>>> Possibility of race against other PTE modifiers
> >>>>
> >>>> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
> >>>> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
> >> Right, that's exactly the kind of thing I was worried about.
> >>>> 2) mprotect - change_protection in mprotect which does the deferred flush is
> >>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
> >>>> on those VMAs.
> >> Sure, mprotect also changes vm_flags, so it really needs that anyway.
> >>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
> >>>> But SPF does not take UFFD faults.
> >>>> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
> >>>> (2) above.
> >>>> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
> >> What happened to shared/file-backed stuff? ISTR I had that working.
> >
> > File-backed mappings are not processed in a speculative way, there were options to manage some of them depending on the underlying file system but that's still not done.
> >
> > Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops is not null).
> >
> >>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
> >>>> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
> >> The tricky one is demotion, specifically write to non-write.
> >>>> I could not see a case where speculative path cannot see a PTE update done via
> >>>> a fault on another CPU.
> >> One you didn't mention is the NUMA balancing scanning crud; although I
> >> think that's fine, loosing a PTE update there is harmless. But I've not
> >> thought overly hard on it.
> >
> > That's a good point, I need to double check on that side.
> >
> >>> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
> >>> marking the VMA through vm_write_begin/end(), as for the fork case you
> >>> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
> >>> values read are valid.
> >> That should indeed work, but are we really sure we covered them all?
> >> Should we invest in better TLBI APIs to make sure we can't get this
> >> wrong?
> >
> > That may be a good option to identify deferred TLB invalidation but I've no clue on what this API would look like.
>
> I will send an RFC soon for per-table deferred TLB flushes tracking.
> The basic idea is to save a generation in the page-struct that tracks
> when deferred PTE change took place, and track whenever a TLB flush
> completed. In addition, other users - such as mprotect - would use
> the tlb_gather interface.
>
> Unfortunately, due to limited space in page-struct this would only
> be possible for 64-bit (and my implementation is only for x86-64).
I don't want to discourage you but I don't think this would end up
well. PPC doesn't necessarily follow one-page-struct-per-table rule,
and I've run into problems with this before while trying to do
something similar.
I'd recommend per-vma and per-category (unmapping, clearing writable
and clearing dirty) tracking, which only rely on arch-independent data
structures, i.e., vm_area_struct and mm_struct.
> It would still require to do the copying while holding the PTL though.
IMO, this is unacceptable. Most archs don't support per-table PTL, and
even x86_64 can be configured to use per-mm PTL. What if we want to
support a larger page size in the feature?
It seems to me the only way to solve the problem with self-explanatory
code and without performance impact is to check mm_tlb_flush_pending
and the writable bit (and two other cases I mentioned above) at the
same time. Of course, this requires a lot of effort to audit the
existing uses, clean them up and properly wrap them up with new
primitives, BUG_ON all invalid cases and document the exact workflow
to prevent misuses.
I've mentioned the following before -- it only demonstrates the rough
idea.
diff --git a/mm/memory.c b/mm/memory.c
index 5e9ca612d7d7..af38c5ee327e 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
goto unlock;
}
if (vmf->flags & FAULT_FLAG_WRITE) {
- if (!pte_write(entry))
+ if (!pte_write(entry)) {
+ if (mm_tlb_flush_pending(vmf->vma->vm_mm))
+ flush_tlb_page(vmf->vma, vmf->address);
return do_wp_page(vmf);
+ }
entry = pte_mkdirty(entry);
}
entry = pte_mkyoung(entry);
> On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
>
> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>> On Jan 12, 2021, at 11:02 AM, Laurent Dufour <[email protected]> wrote:
>>>
>>> Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
>>>> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
>>>>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
>>>>>> Possibility of race against other PTE modifiers
>>>>>>
>>>>>> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
>>>>>> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
>>>> Right, that's exactly the kind of thing I was worried about.
>>>>>> 2) mprotect - change_protection in mprotect which does the deferred flush is
>>>>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
>>>>>> on those VMAs.
>>>> Sure, mprotect also changes vm_flags, so it really needs that anyway.
>>>>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
>>>>>> But SPF does not take UFFD faults.
>>>>>> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
>>>>>> (2) above.
>>>>>> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
>>>> What happened to shared/file-backed stuff? ISTR I had that working.
>>>
>>> File-backed mappings are not processed in a speculative way, there were options to manage some of them depending on the underlying file system but that's still not done.
>>>
>>> Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops is not null).
>>>
>>>>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
>>>>>> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
>>>> The tricky one is demotion, specifically write to non-write.
>>>>>> I could not see a case where speculative path cannot see a PTE update done via
>>>>>> a fault on another CPU.
>>>> One you didn't mention is the NUMA balancing scanning crud; although I
>>>> think that's fine, loosing a PTE update there is harmless. But I've not
>>>> thought overly hard on it.
>>>
>>> That's a good point, I need to double check on that side.
>>>
>>>>> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
>>>>> marking the VMA through vm_write_begin/end(), as for the fork case you
>>>>> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
>>>>> values read are valid.
>>>> That should indeed work, but are we really sure we covered them all?
>>>> Should we invest in better TLBI APIs to make sure we can't get this
>>>> wrong?
>>>
>>> That may be a good option to identify deferred TLB invalidation but I've no clue on what this API would look like.
>>
>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>> The basic idea is to save a generation in the page-struct that tracks
>> when deferred PTE change took place, and track whenever a TLB flush
>> completed. In addition, other users - such as mprotect - would use
>> the tlb_gather interface.
>>
>> Unfortunately, due to limited space in page-struct this would only
>> be possible for 64-bit (and my implementation is only for x86-64).
>
> I don't want to discourage you but I don't think this would end up
> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> and I've run into problems with this before while trying to do
> something similar.
Discourage, discourage. Better now than later.
It will be relatively easy to extend the scheme to be per-VMA instead of
per-table for architectures that prefer it this way. It does require
TLB-generation tracking though, which Andy only implemented for x86, so I
will focus on x86-64 right now.
[ For per-VMA it would require an additional cmpxchg, I presume to save the
last deferred generation though. ]
> I'd recommend per-vma and per-category (unmapping, clearing writable
> and clearing dirty) tracking, which only rely on arch-independent data
> structures, i.e., vm_area_struct and mm_struct.
I think that tracking changes on “what was changed” granularity is harder
and more fragile.
Let me finish trying the clean up the mess first, since fullmm and
need_flush_all semantics were mixed; there are 3 different flushing schemes
for mprotect(), munmap() and try_to_unmap(); there are missing memory
barriers; mprotect() performs TLB flushes even when permissions are
promoted; etc.
There are also some optimizations that we discussed before, such on x86 -
RW->RO does not require a TLB flush if a PTE is not dirty, etc.
I am trying to finish something so you can say how terrible it is, so I will
not waste too much time. ;-)
>> It would still require to do the copying while holding the PTL though.
>
> IMO, this is unacceptable. Most archs don't support per-table PTL, and
> even x86_64 can be configured to use per-mm PTL. What if we want to
> support a larger page size in the feature?
>
> It seems to me the only way to solve the problem with self-explanatory
> code and without performance impact is to check mm_tlb_flush_pending
> and the writable bit (and two other cases I mentioned above) at the
> same time. Of course, this requires a lot of effort to audit the
> existing uses, clean them up and properly wrap them up with new
> primitives, BUG_ON all invalid cases and document the exact workflow
> to prevent misuses.
>
> I've mentioned the following before -- it only demonstrates the rough
> idea.
>
> diff --git a/mm/memory.c b/mm/memory.c
> index 5e9ca612d7d7..af38c5ee327e 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> goto unlock;
> }
> if (vmf->flags & FAULT_FLAG_WRITE) {
> - if (!pte_write(entry))
> + if (!pte_write(entry)) {
> + if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> + flush_tlb_page(vmf->vma, vmf->address);
> return do_wp_page(vmf);
> + }
> entry = pte_mkdirty(entry);
> }
> entry = pte_mkyoung(entry);
I understand. This might be required, regardless of the deferred flushes
detection scheme. If we assume that no write-unprotect requires a COW (which
should be true in this case, since we take a reference on the page), your
proposal should be sufficient.
Still, I think that there are many unnecessary TLB flushes right now,
and others that might be missed due to the overly complicated invalidation
schemes.
Regardless, as Andrea pointed, this requires first to figure out the
semantics of mprotect() and friends when pages are pinned.
On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> > On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
> >
> > On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> >>> On Jan 12, 2021, at 11:02 AM, Laurent Dufour <[email protected]> wrote:
> >>>
> >>> Le 12/01/2021 à 17:57, Peter Zijlstra a écrit :
> >>>> On Tue, Jan 12, 2021 at 04:47:17PM +0100, Laurent Dufour wrote:
> >>>>> Le 12/01/2021 à 12:43, Vinayak Menon a écrit :
> >>>>>> Possibility of race against other PTE modifiers
> >>>>>>
> >>>>>> 1) Fork - We have seen a case of SPF racing with fork marking PTEs RO and that
> >>>>>> is described and fixed here https://lore.kernel.org/patchwork/patch/1062672/
> >>>> Right, that's exactly the kind of thing I was worried about.
> >>>>>> 2) mprotect - change_protection in mprotect which does the deferred flush is
> >>>>>> marked under vm_write_begin/vm_write_end, thus SPF bails out on faults
> >>>>>> on those VMAs.
> >>>> Sure, mprotect also changes vm_flags, so it really needs that anyway.
> >>>>>> 3) userfaultfd - mwriteprotect_range is not protected unlike in (2) above.
> >>>>>> But SPF does not take UFFD faults.
> >>>>>> 4) hugetlb - hugetlb_change_protection - called from mprotect and covered by
> >>>>>> (2) above.
> >>>>>> 5) Concurrent faults - SPF does not handle all faults. Only anon page faults.
> >>>> What happened to shared/file-backed stuff? ISTR I had that working.
> >>>
> >>> File-backed mappings are not processed in a speculative way, there were options to manage some of them depending on the underlying file system but that's still not done.
> >>>
> >>> Shared anonymous mapping, are also not yet handled in a speculative way (vm_ops is not null).
> >>>
> >>>>>> Of which do_anonymous_page and do_swap_page are NONE/NON-PRESENT->PRESENT
> >>>>>> transitions without tlb flush. And I hope do_wp_page with RO->RW is fine as well.
> >>>> The tricky one is demotion, specifically write to non-write.
> >>>>>> I could not see a case where speculative path cannot see a PTE update done via
> >>>>>> a fault on another CPU.
> >>>> One you didn't mention is the NUMA balancing scanning crud; although I
> >>>> think that's fine, loosing a PTE update there is harmless. But I've not
> >>>> thought overly hard on it.
> >>>
> >>> That's a good point, I need to double check on that side.
> >>>
> >>>>> You explained it fine. Indeed SPF is handling deferred TLB invalidation by
> >>>>> marking the VMA through vm_write_begin/end(), as for the fork case you
> >>>>> mentioned. Once the PTL is held, and the VMA's seqcount is checked, the PTE
> >>>>> values read are valid.
> >>>> That should indeed work, but are we really sure we covered them all?
> >>>> Should we invest in better TLBI APIs to make sure we can't get this
> >>>> wrong?
> >>>
> >>> That may be a good option to identify deferred TLB invalidation but I've no clue on what this API would look like.
> >>
> >> I will send an RFC soon for per-table deferred TLB flushes tracking.
> >> The basic idea is to save a generation in the page-struct that tracks
> >> when deferred PTE change took place, and track whenever a TLB flush
> >> completed. In addition, other users - such as mprotect - would use
> >> the tlb_gather interface.
> >>
> >> Unfortunately, due to limited space in page-struct this would only
> >> be possible for 64-bit (and my implementation is only for x86-64).
> >
> > I don't want to discourage you but I don't think this would end up
> > well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> > and I've run into problems with this before while trying to do
> > something similar.
>
> Discourage, discourage. Better now than later.
>
> It will be relatively easy to extend the scheme to be per-VMA instead of
> per-table for architectures that prefer it this way. It does require
> TLB-generation tracking though, which Andy only implemented for x86, so I
> will focus on x86-64 right now.
>
> [ For per-VMA it would require an additional cmpxchg, I presume to save the
> last deferred generation though. ]
>
> > I'd recommend per-vma and per-category (unmapping, clearing writable
> > and clearing dirty) tracking, which only rely on arch-independent data
> > structures, i.e., vm_area_struct and mm_struct.
>
> I think that tracking changes on “what was changed” granularity is harder
> and more fragile.
>
> Let me finish trying the clean up the mess first, since fullmm and
> need_flush_all semantics were mixed; there are 3 different flushing schemes
> for mprotect(), munmap() and try_to_unmap(); there are missing memory
> barriers; mprotect() performs TLB flushes even when permissions are
> promoted; etc.
>
> There are also some optimizations that we discussed before, such on x86 -
> RW->RO does not require a TLB flush if a PTE is not dirty, etc.
>
> I am trying to finish something so you can say how terrible it is, so I will
> not waste too much time. ;-)
>
> >> It would still require to do the copying while holding the PTL though.
> >
> > IMO, this is unacceptable. Most archs don't support per-table PTL, and
> > even x86_64 can be configured to use per-mm PTL. What if we want to
> > support a larger page size in the feature?
> >
> > It seems to me the only way to solve the problem with self-explanatory
> > code and without performance impact is to check mm_tlb_flush_pending
> > and the writable bit (and two other cases I mentioned above) at the
> > same time. Of course, this requires a lot of effort to audit the
> > existing uses, clean them up and properly wrap them up with new
> > primitives, BUG_ON all invalid cases and document the exact workflow
> > to prevent misuses.
> >
> > I've mentioned the following before -- it only demonstrates the rough
> > idea.
> >
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 5e9ca612d7d7..af38c5ee327e 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -4403,8 +4403,11 @@ static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> > goto unlock;
> > }
> > if (vmf->flags & FAULT_FLAG_WRITE) {
> > - if (!pte_write(entry))
> > + if (!pte_write(entry)) {
> > + if (mm_tlb_flush_pending(vmf->vma->vm_mm))
> > + flush_tlb_page(vmf->vma, vmf->address);
> > return do_wp_page(vmf);
> > + }
> > entry = pte_mkdirty(entry);
> > }
> > entry = pte_mkyoung(entry);
>
> I understand. This might be required, regardless of the deferred flushes
> detection scheme. If we assume that no write-unprotect requires a COW (which
> should be true in this case, since we take a reference on the page), your
> proposal should be sufficient.
>
> Still, I think that there are many unnecessary TLB flushes right now,
> and others that might be missed due to the overly complicated invalidation
> schemes.
>
> Regardless, as Andrea pointed, this requires first to figure out the
> semantics of mprotect() and friends when pages are pinned.
Thanks, I appreciate your effort. I'd be glad to review whatever you
come up with.
On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> > On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
> > On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> >> I will send an RFC soon for per-table deferred TLB flushes tracking.
> >> The basic idea is to save a generation in the page-struct that tracks
> >> when deferred PTE change took place, and track whenever a TLB flush
> >> completed. In addition, other users - such as mprotect - would use
> >> the tlb_gather interface.
> >>
> >> Unfortunately, due to limited space in page-struct this would only
> >> be possible for 64-bit (and my implementation is only for x86-64).
> >
> > I don't want to discourage you but I don't think this would end up
> > well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> > and I've run into problems with this before while trying to do
> > something similar.
>
> Discourage, discourage. Better now than later.
>
> It will be relatively easy to extend the scheme to be per-VMA instead of
> per-table for architectures that prefer it this way. It does require
> TLB-generation tracking though, which Andy only implemented for x86, so I
> will focus on x86-64 right now.
Can you remind me of what we're missing on arm64 in this area, please? I'm
happy to help get this up and running once you have something I can build
on.
Will
> On Jan 12, 2021, at 1:43 PM, Will Deacon <[email protected]> wrote:
>
> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>>>> The basic idea is to save a generation in the page-struct that tracks
>>>> when deferred PTE change took place, and track whenever a TLB flush
>>>> completed. In addition, other users - such as mprotect - would use
>>>> the tlb_gather interface.
>>>>
>>>> Unfortunately, due to limited space in page-struct this would only
>>>> be possible for 64-bit (and my implementation is only for x86-64).
>>>
>>> I don't want to discourage you but I don't think this would end up
>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
>>> and I've run into problems with this before while trying to do
>>> something similar.
>>
>> Discourage, discourage. Better now than later.
>>
>> It will be relatively easy to extend the scheme to be per-VMA instead of
>> per-table for architectures that prefer it this way. It does require
>> TLB-generation tracking though, which Andy only implemented for x86, so I
>> will focus on x86-64 right now.
>
> Can you remind me of what we're missing on arm64 in this area, please? I'm
> happy to help get this up and running once you have something I can build
> on.
Let me first finish making something that we can use as a basis for a
discussion. I do not waste your time before I have something ready.
On Tue, Jan 12, 2021 at 02:29:51PM -0800, Nadav Amit wrote:
> > On Jan 12, 2021, at 1:43 PM, Will Deacon <[email protected]> wrote:
> >
> > On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> >>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
> >>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> >>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
> >>>> The basic idea is to save a generation in the page-struct that tracks
> >>>> when deferred PTE change took place, and track whenever a TLB flush
> >>>> completed. In addition, other users - such as mprotect - would use
> >>>> the tlb_gather interface.
> >>>>
> >>>> Unfortunately, due to limited space in page-struct this would only
> >>>> be possible for 64-bit (and my implementation is only for x86-64).
> >>>
> >>> I don't want to discourage you but I don't think this would end up
> >>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> >>> and I've run into problems with this before while trying to do
> >>> something similar.
> >>
> >> Discourage, discourage. Better now than later.
> >>
> >> It will be relatively easy to extend the scheme to be per-VMA instead of
> >> per-table for architectures that prefer it this way. It does require
> >> TLB-generation tracking though, which Andy only implemented for x86, so I
> >> will focus on x86-64 right now.
> >
> > Can you remind me of what we're missing on arm64 in this area, please? I'm
> > happy to help get this up and running once you have something I can build
> > on.
>
> Let me first finish making something that we can use as a basis for a
> discussion. I do not waste your time before I have something ready.
Sure thing! Give me a shout when you're ready.
Will
> On Jan 12, 2021, at 2:29 PM, Nadav Amit <[email protected]> wrote:
>
>
>>
>> On Jan 12, 2021, at 1:43 PM, Will Deacon <[email protected]> wrote:
>>
>> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
>>>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>>>>>> The basic idea is to save a generation in the page-struct that tracks
>>>>>> when deferred PTE change took place, and track whenever a TLB flush
>>>>>> completed. In addition, other users - such as mprotect - would use
>>>>>> the tlb_gather interface.
>>>>>>
>>>>>> Unfortunately, due to limited space in page-struct this would only
>>>>>> be possible for 64-bit (and my implementation is only for x86-64).
>>>>>
>>>>> I don't want to discourage you but I don't think this would end up
>>>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
>>>>> and I've run into problems with this before while trying to do
>>>>> something similar.
>>>
>>> Discourage, discourage. Better now than later.
>>>
>>> It will be relatively easy to extend the scheme to be per-VMA instead of
>>> per-table for architectures that prefer it this way. It does require
>>> TLB-generation tracking though, which Andy only implemented for x86, so I
>>> will focus on x86-64 right now.
>>
>> Can you remind me of what we're missing on arm64 in this area, please? I'm
>> happy to help get this up and running once you have something I can build
>> on.
>
> Let me first finish making something that we can use as a basis for a
> discussion. I do not waste your time before I have something ready.
If you want a hand, let me know. I suspect you understand the x86 code as well as I do at this point, though :)
On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> > > On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
> > > On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> > >> I will send an RFC soon for per-table deferred TLB flushes tracking.
> > >> The basic idea is to save a generation in the page-struct that tracks
> > >> when deferred PTE change took place, and track whenever a TLB flush
> > >> completed. In addition, other users - such as mprotect - would use
> > >> the tlb_gather interface.
> > >>
> > >> Unfortunately, due to limited space in page-struct this would only
> > >> be possible for 64-bit (and my implementation is only for x86-64).
> > >
> > > I don't want to discourage you but I don't think this would end up
> > > well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> > > and I've run into problems with this before while trying to do
> > > something similar.
> >
> > Discourage, discourage. Better now than later.
> >
> > It will be relatively easy to extend the scheme to be per-VMA instead of
> > per-table for architectures that prefer it this way. It does require
> > TLB-generation tracking though, which Andy only implemented for x86, so I
> > will focus on x86-64 right now.
>
> Can you remind me of what we're missing on arm64 in this area, please? I'm
> happy to help get this up and running once you have something I can build
> on.
I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
Would it be something worth pursuing? Arm has been using mm_cpumask,
so it might not be too difficult I guess?
> On Jan 16, 2021, at 8:41 PM, Yu Zhao <[email protected]> wrote:
>
> On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
>> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
>>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>>>>> The basic idea is to save a generation in the page-struct that tracks
>>>>> when deferred PTE change took place, and track whenever a TLB flush
>>>>> completed. In addition, other users - such as mprotect - would use
>>>>> the tlb_gather interface.
>>>>>
>>>>> Unfortunately, due to limited space in page-struct this would only
>>>>> be possible for 64-bit (and my implementation is only for x86-64).
>>>>
>>>> I don't want to discourage you but I don't think this would end up
>>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
>>>> and I've run into problems with this before while trying to do
>>>> something similar.
>>>
>>> Discourage, discourage. Better now than later.
>>>
>>> It will be relatively easy to extend the scheme to be per-VMA instead of
>>> per-table for architectures that prefer it this way. It does require
>>> TLB-generation tracking though, which Andy only implemented for x86, so I
>>> will focus on x86-64 right now.
>>
>> Can you remind me of what we're missing on arm64 in this area, please? I'm
>> happy to help get this up and running once you have something I can build
>> on.
>
> I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
> Would it be something worth pursuing? Arm has been using mm_cpumask,
> so it might not be too difficult I guess?
[ +Mel Gorman who implemented ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ]
IIUC, there are at least two bugs in x86 implementation.
First, there is a missing memory barrier in tlbbatch_add_mm() between
inc_mm_tlb_gen() and the read of mm_cpumask().
Second, try_to_unmap_flush() clears flush_required after flushing. Another
thread can call set_tlb_ubc_flush_pending() after the flush and before
flush_required is cleared, and the indication that a TLB flush is pending
can be lost.
I am working on addressing these issues among others, but, as you already
saw, I am a bit slow.
On a different but related topic: Another thing that I noticed that Arm does
not do is batching TLB flushes across VMAs. Since Arm does not have its own
tlb_end_vma(), it uses the default tlb_end_vma(), which flushes each VMA
separately. Peter Zijlstra’s comment says that there are advantages in
flushing each VMA separately, but I am not sure it is better or intentional
(especially since x86 does not do so).
I am trying to remove the arch-specific tlb_end_vma() and have a config
option to control this behavior.
Again, sorry for being slow. I hope to send an RFC soon.
On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote:
> > On Jan 16, 2021, at 8:41 PM, Yu Zhao <[email protected]> wrote:
> >
> > On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
> >> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> >>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
> >>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> >>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
> >>>>> The basic idea is to save a generation in the page-struct that tracks
> >>>>> when deferred PTE change took place, and track whenever a TLB flush
> >>>>> completed. In addition, other users - such as mprotect - would use
> >>>>> the tlb_gather interface.
> >>>>>
> >>>>> Unfortunately, due to limited space in page-struct this would only
> >>>>> be possible for 64-bit (and my implementation is only for x86-64).
> >>>>
> >>>> I don't want to discourage you but I don't think this would end up
> >>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> >>>> and I've run into problems with this before while trying to do
> >>>> something similar.
> >>>
> >>> Discourage, discourage. Better now than later.
> >>>
> >>> It will be relatively easy to extend the scheme to be per-VMA instead of
> >>> per-table for architectures that prefer it this way. It does require
> >>> TLB-generation tracking though, which Andy only implemented for x86, so I
> >>> will focus on x86-64 right now.
> >>
> >> Can you remind me of what we're missing on arm64 in this area, please? I'm
> >> happy to help get this up and running once you have something I can build
> >> on.
> >
> > I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
> > Would it be something worth pursuing? Arm has been using mm_cpumask,
> > so it might not be too difficult I guess?
>
> [ +Mel Gorman who implemented ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ]
>
> IIUC, there are at least two bugs in x86 implementation.
>
> First, there is a missing memory barrier in tlbbatch_add_mm() between
> inc_mm_tlb_gen() and the read of mm_cpumask().
In arch_tlbbatch_add_mm()? inc_mm_tlb_gen() has builtin barrier as its
comment says -- atomic update ops that return values are also full
memory barriers.
> Second, try_to_unmap_flush() clears flush_required after flushing. Another
> thread can call set_tlb_ubc_flush_pending() after the flush and before
> flush_required is cleared, and the indication that a TLB flush is pending
> can be lost.
This isn't a problem either because flush_required is per thread.
> I am working on addressing these issues among others, but, as you already
> saw, I am a bit slow.
>
> On a different but related topic: Another thing that I noticed that Arm does
> not do is batching TLB flushes across VMAs. Since Arm does not have its own
> tlb_end_vma(), it uses the default tlb_end_vma(), which flushes each VMA
> separately. Peter Zijlstra’s comment says that there are advantages in
> flushing each VMA separately, but I am not sure it is better or intentional
> (especially since x86 does not do so).
>
> I am trying to remove the arch-specific tlb_end_vma() and have a config
> option to control this behavior.
One thing worth noting is not all arm/arm64 hw versions support ranges.
(system_supports_tlb_range()). But IIUC what you are trying to do, this
isn't a problem.
> Again, sorry for being slow. I hope to send an RFC soon.
No worries. I brought it up only because I noticed it and didn't want
it to slip away.
> On Jan 17, 2021, at 1:16 AM, Yu Zhao <[email protected]> wrote:
>
> On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote:
>>> On Jan 16, 2021, at 8:41 PM, Yu Zhao <[email protected]> wrote:
>>>
>>> On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
>>>> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>>>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
>>>>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>>>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>>>>>>> The basic idea is to save a generation in the page-struct that tracks
>>>>>>> when deferred PTE change took place, and track whenever a TLB flush
>>>>>>> completed. In addition, other users - such as mprotect - would use
>>>>>>> the tlb_gather interface.
>>>>>>>
>>>>>>> Unfortunately, due to limited space in page-struct this would only
>>>>>>> be possible for 64-bit (and my implementation is only for x86-64).
>>>>>>
>>>>>> I don't want to discourage you but I don't think this would end up
>>>>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
>>>>>> and I've run into problems with this before while trying to do
>>>>>> something similar.
>>>>>
>>>>> Discourage, discourage. Better now than later.
>>>>>
>>>>> It will be relatively easy to extend the scheme to be per-VMA instead of
>>>>> per-table for architectures that prefer it this way. It does require
>>>>> TLB-generation tracking though, which Andy only implemented for x86, so I
>>>>> will focus on x86-64 right now.
>>>>
>>>> Can you remind me of what we're missing on arm64 in this area, please? I'm
>>>> happy to help get this up and running once you have something I can build
>>>> on.
>>>
>>> I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
>>> Would it be something worth pursuing? Arm has been using mm_cpumask,
>>> so it might not be too difficult I guess?
>>
>> [ +Mel Gorman who implemented ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ]
>>
>> IIUC, there are at least two bugs in x86 implementation.
>>
>> First, there is a missing memory barrier in tlbbatch_add_mm() between
>> inc_mm_tlb_gen() and the read of mm_cpumask().
>
> In arch_tlbbatch_add_mm()? inc_mm_tlb_gen() has builtin barrier as its
> comment says -- atomic update ops that return values are also full
> memory barriers.
Yes, you are correct.
>
>> Second, try_to_unmap_flush() clears flush_required after flushing. Another
>> thread can call set_tlb_ubc_flush_pending() after the flush and before
>> flush_required is cleared, and the indication that a TLB flush is pending
>> can be lost.
>
> This isn't a problem either because flush_required is per thread.
Sorry, I meant mm->tlb_flush_batched . It is not per-thread.
flush_tlb_batched_pending() clears it after flush and indications that
set_tlb_ubc_flush_pending() sets in between can be lost.
>> I am working on addressing these issues among others, but, as you already
>> saw, I am a bit slow.
>>
>> On a different but related topic: Another thing that I noticed that Arm does
>> not do is batching TLB flushes across VMAs. Since Arm does not have its own
>> tlb_end_vma(), it uses the default tlb_end_vma(), which flushes each VMA
>> separately. Peter Zijlstra’s comment says that there are advantages in
>> flushing each VMA separately, but I am not sure it is better or intentional
>> (especially since x86 does not do so).
>>
>> I am trying to remove the arch-specific tlb_end_vma() and have a config
>> option to control this behavior.
>
> One thing worth noting is not all arm/arm64 hw versions support ranges.
> (system_supports_tlb_range()). But IIUC what you are trying to do, this
> isn't a problem.
I just wanted to get rid of arch-specific tlb_start_vma() it in order to
cleanup the code (I am doing first the VMA-deferred tracking, as you asked).
While I was doing that, I noticed that Arm does per-VMA TLB flushes. I do
not know whether it is intentional, but it seems rather easy to get this
behavior unintentionally.
On Sun, Jan 17, 2021 at 02:13:43AM -0800, Nadav Amit wrote:
> > On Jan 17, 2021, at 1:16 AM, Yu Zhao <[email protected]> wrote:
> >
> > On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote:
> >>> On Jan 16, 2021, at 8:41 PM, Yu Zhao <[email protected]> wrote:
> >>>
> >>> On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
> >>>> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
> >>>>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
> >>>>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
> >>>>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
> >>>>>>> The basic idea is to save a generation in the page-struct that tracks
> >>>>>>> when deferred PTE change took place, and track whenever a TLB flush
> >>>>>>> completed. In addition, other users - such as mprotect - would use
> >>>>>>> the tlb_gather interface.
> >>>>>>>
> >>>>>>> Unfortunately, due to limited space in page-struct this would only
> >>>>>>> be possible for 64-bit (and my implementation is only for x86-64).
> >>>>>>
> >>>>>> I don't want to discourage you but I don't think this would end up
> >>>>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
> >>>>>> and I've run into problems with this before while trying to do
> >>>>>> something similar.
> >>>>>
> >>>>> Discourage, discourage. Better now than later.
> >>>>>
> >>>>> It will be relatively easy to extend the scheme to be per-VMA instead of
> >>>>> per-table for architectures that prefer it this way. It does require
> >>>>> TLB-generation tracking though, which Andy only implemented for x86, so I
> >>>>> will focus on x86-64 right now.
> >>>>
> >>>> Can you remind me of what we're missing on arm64 in this area, please? I'm
> >>>> happy to help get this up and running once you have something I can build
> >>>> on.
> >>>
> >>> I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
> >>> Would it be something worth pursuing? Arm has been using mm_cpumask,
> >>> so it might not be too difficult I guess?
> >>
> >> [ +Mel Gorman who implemented ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ]
> >>
> >> IIUC, there are at least two bugs in x86 implementation.
> >>
> >> First, there is a missing memory barrier in tlbbatch_add_mm() between
> >> inc_mm_tlb_gen() and the read of mm_cpumask().
> >
> > In arch_tlbbatch_add_mm()? inc_mm_tlb_gen() has builtin barrier as its
> > comment says -- atomic update ops that return values are also full
> > memory barriers.
>
> Yes, you are correct.
>
> >
> >> Second, try_to_unmap_flush() clears flush_required after flushing. Another
> >> thread can call set_tlb_ubc_flush_pending() after the flush and before
> >> flush_required is cleared, and the indication that a TLB flush is pending
> >> can be lost.
> >
> > This isn't a problem either because flush_required is per thread.
>
> Sorry, I meant mm->tlb_flush_batched . It is not per-thread.
> flush_tlb_batched_pending() clears it after flush and indications that
> set_tlb_ubc_flush_pending() sets in between can be lost.
Hmm, the PTL argument above flush_tlb_batched_pending() doesn't seem
to hold when USE_SPLIT_PTE_PTLOCKS is set. Do you have a reproducer?
KCSAN might be able to help in this case.
> On Jan 17, 2021, at 11:25 AM, Yu Zhao <[email protected]> wrote:
>
> On Sun, Jan 17, 2021 at 02:13:43AM -0800, Nadav Amit wrote:
>>> On Jan 17, 2021, at 1:16 AM, Yu Zhao <[email protected]> wrote:
>>>
>>> On Sat, Jan 16, 2021 at 11:32:22PM -0800, Nadav Amit wrote:
>>>>> On Jan 16, 2021, at 8:41 PM, Yu Zhao <[email protected]> wrote:
>>>>>
>>>>> On Tue, Jan 12, 2021 at 09:43:38PM +0000, Will Deacon wrote:
>>>>>> On Tue, Jan 12, 2021 at 12:38:34PM -0800, Nadav Amit wrote:
>>>>>>>> On Jan 12, 2021, at 11:56 AM, Yu Zhao <[email protected]> wrote:
>>>>>>>> On Tue, Jan 12, 2021 at 11:15:43AM -0800, Nadav Amit wrote:
>>>>>>>>> I will send an RFC soon for per-table deferred TLB flushes tracking.
>>>>>>>>> The basic idea is to save a generation in the page-struct that tracks
>>>>>>>>> when deferred PTE change took place, and track whenever a TLB flush
>>>>>>>>> completed. In addition, other users - such as mprotect - would use
>>>>>>>>> the tlb_gather interface.
>>>>>>>>>
>>>>>>>>> Unfortunately, due to limited space in page-struct this would only
>>>>>>>>> be possible for 64-bit (and my implementation is only for x86-64).
>>>>>>>>
>>>>>>>> I don't want to discourage you but I don't think this would end up
>>>>>>>> well. PPC doesn't necessarily follow one-page-struct-per-table rule,
>>>>>>>> and I've run into problems with this before while trying to do
>>>>>>>> something similar.
>>>>>>>
>>>>>>> Discourage, discourage. Better now than later.
>>>>>>>
>>>>>>> It will be relatively easy to extend the scheme to be per-VMA instead of
>>>>>>> per-table for architectures that prefer it this way. It does require
>>>>>>> TLB-generation tracking though, which Andy only implemented for x86, so I
>>>>>>> will focus on x86-64 right now.
>>>>>>
>>>>>> Can you remind me of what we're missing on arm64 in this area, please? I'm
>>>>>> happy to help get this up and running once you have something I can build
>>>>>> on.
>>>>>
>>>>> I noticed arm/arm64 don't support ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
>>>>> Would it be something worth pursuing? Arm has been using mm_cpumask,
>>>>> so it might not be too difficult I guess?
>>>>
>>>> [ +Mel Gorman who implemented ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH ]
>>>>
>>>> IIUC, there are at least two bugs in x86 implementation.
>>>>
>>>> First, there is a missing memory barrier in tlbbatch_add_mm() between
>>>> inc_mm_tlb_gen() and the read of mm_cpumask().
>>>
>>> In arch_tlbbatch_add_mm()? inc_mm_tlb_gen() has builtin barrier as its
>>> comment says -- atomic update ops that return values are also full
>>> memory barriers.
>>
>> Yes, you are correct.
>>
>>>> Second, try_to_unmap_flush() clears flush_required after flushing. Another
>>>> thread can call set_tlb_ubc_flush_pending() after the flush and before
>>>> flush_required is cleared, and the indication that a TLB flush is pending
>>>> can be lost.
>>>
>>> This isn't a problem either because flush_required is per thread.
>>
>> Sorry, I meant mm->tlb_flush_batched . It is not per-thread.
>> flush_tlb_batched_pending() clears it after flush and indications that
>> set_tlb_ubc_flush_pending() sets in between can be lost.
>
> Hmm, the PTL argument above flush_tlb_batched_pending() doesn't seem
> to hold when USE_SPLIT_PTE_PTLOCKS is set. Do you have a reproducer?
> KCSAN might be able to help in this case.
I do not have a reproducer. It is just based on my understanding of this
code.
I will give a short try for building a reproducer, although for some reason
“you guys” complain that my reproducers do not work for you (is it PTI that
I disable? idle=poll? running in a VM?). It is also not likely to be too
easy to build a reproducer that actually triggers a memory corruption.
Anyhow, apparently KCSAN has already shouted about this code, causing Qian
Cai to add "data_race()" to avoid KCSAN from shouting (9c1177b62a8c
"mm/rmap: annotate a data race at tlb_flush_batched”).
Note that Andrea asked me not to hijack this thread and have a different one
on this issue.