2023-07-26 22:13:12

by Jann Horn

[permalink] [raw]
Subject: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

Hi!

Patch 1 here is a straightforward fix for a race in per-VMA locking code
that can lead to use-after-free; I hope we can get this one into
mainline and stable quickly.

Patch 2 is a fix for what I believe is a longstanding memory ordering
issue in how vma->anon_vma is used across the MM subsystem; I expect
that this one will have to go through a few iterations of review and
potentially rewrites, because memory ordering is tricky.
(If someone else wants to take over patch 2, I would be very happy.)

These patches don't really belong together all that much, I'm just
sending them as a series because they'd otherwise conflict.

I am CCing:

- Suren because patch 1 touches his code
- Matthew Wilcox because he is also currently working on per-VMA
locking stuff
- all the maintainers/reviewers for the Kernel Memory Consistency Model
so they can help figure out the READ_ONCE() vs smp_load_acquire()
thing
- people involved in the previous discussion on the security list


Jann Horn (2):
mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock
mm: Fix anon_vma memory ordering

include/linux/rmap.h | 15 ++++++++++++++-
mm/huge_memory.c | 4 +++-
mm/khugepaged.c | 2 +-
mm/ksm.c | 16 +++++++++++-----
mm/memory.c | 32 ++++++++++++++++++++------------
mm/mmap.c | 13 ++++++++++---
mm/rmap.c | 6 ++++--
mm/swapfile.c | 3 ++-
8 files changed, 65 insertions(+), 26 deletions(-)


base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f
--
2.41.0.487.g6d72f3e995-goog



2023-07-26 22:33:40

by Jann Horn

[permalink] [raw]
Subject: [PATCH 2/2] mm: Fix anon_vma memory ordering

A read of vma->anon_vma under mmap_lock in read mode (in particular in
anon_vma_prepare()) can race with a concurrent update under mmap_lock
in read mode plus pagetable lock (in __prepare_anon_vma()).
However, the only allowed concurrent update is one that changes
vma->anon_vma from NULL to a non-NULL pointer; once vma->anon_vma has
been set to a non-NULL value, it will keep that value as long as the
mmap lock is held in read mode.

This clearly means that we need an smp_store_release() in
__prepare_anon_vma() (unless we happen to have some equivalent barrier
there, which doesn't seem to be the case).

The harder choice is what to do for all the readers.
In the page fault path, we probably want to only have a single piece of
code (in __prepare_anon_vma()) that has to deal with ensuring that
vma->anon_vma is stable, and then be able to use it with normal memory
accesses afterwards.
One way we can do this is with smp_load_acquire(), which is what this
patch does; however, I expect that we can have a lot of bikeshedding on
whether that is the right barrier to use and whether it should be moved
into a helper function for other places outside the page fault path that
look at vma->anon_vma in read mode.

The other approach is to use read-dependency ordering with READ_ONCE(),
maybe combined with a barrier() afterwards to prevent compiler
reordering. Based on some discussion we had on the security list (and in
particular repeating much of what Linus explained), this would probably
be correct on all currently-supported architectures, because reads from
the same location are guaranteed to not be reordered by the CPU
(I think?), so subsequent loads from the same location would be
guaranteed to see the same value for vma->anon_vma.
And then on anything other than alpha, we'd get implicit
address-dependency ordering against the plain load; while on alpha, we
have a memory barrier inside the READ_ONCE().

Since smp_load_acquire() is allegedly fairly cheap on the main
architectures with weak memory ordering, and the reasoning for why
READ_ONCE() works is kinda iffy, I figured I should use READ_ONCE() for
v1 of the patch and then we can bikeshed it from there.

Also I tried looking through every single access to vma->anon_vma and
put smp_load_acquire() on every one that might reasonably need it, but
again we can probably bikeshed on that.

Cc: [email protected]
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Jann Horn <[email protected]>
---
include/linux/rmap.h | 15 ++++++++++++++-
mm/huge_memory.c | 4 +++-
mm/khugepaged.c | 2 +-
mm/ksm.c | 16 +++++++++++-----
mm/memory.c | 6 +++++-
mm/mmap.c | 13 ++++++++++---
mm/rmap.c | 6 ++++--
mm/swapfile.c | 3 ++-
8 files changed, 50 insertions(+), 15 deletions(-)

diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b87d01660412..f7f669a7149b 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -153,9 +153,22 @@ int anon_vma_fork(struct vm_area_struct *, struct vm_area_struct *);

static inline int anon_vma_prepare(struct vm_area_struct *vma)
{
- if (likely(vma->anon_vma))
+ /*
+ * Pairs with smp_store_release() in __anon_vma_prepare().
+ *
+ * Holding the mmap lock in read mode guarantees that the only possible
+ * concurrent modification of vma->anon_vma is that it changes from NULL
+ * to another value. Therefore, if we see a non-NULL value here, future
+ * plain loads of vma->anon_vma are okay.
+ */
+ if (likely(smp_load_acquire(&vma->anon_vma)))
return 0;

+ /*
+ * If __anon_vma_prepare() succeeds, it has loaded or stored a non-NULL
+ * value in vma->anon_vma while protected against concurrent changes by
+ * the page table lock.
+ */
return __anon_vma_prepare(vma);
}

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index eb3678360b97..1d0322d45187 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -144,8 +144,10 @@ bool hugepage_vma_check(struct vm_area_struct *vma, unsigned long vm_flags,
*
* Allow page fault since anon_vma may be not initialized until
* the first page fault.
+ *
+ * See anon_vma_prepare() for barrier.
*/
- if (!vma->anon_vma)
+ if (!smp_load_acquire(&vma->anon_vma))
return (smaps || in_pf);

return true;
diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 78c8d5d8b628..4553efe08264 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -939,7 +939,7 @@ static int hugepage_vma_revalidate(struct mm_struct *mm, unsigned long address,
* hugepage_vma_check may return true for qualified file
* vmas.
*/
- if (expect_anon && (!(*vmap)->anon_vma || !vma_is_anonymous(*vmap)))
+ if (expect_anon && (!smp_load_acquire(&vma->anon_vma) || !vma_is_anonymous(vma)))
return SCAN_PAGE_ANON;
return SCAN_SUCCEED;
}
diff --git a/mm/ksm.c b/mm/ksm.c
index ba266359da55..89e69c43a857 100644
--- a/mm/ksm.c
+++ b/mm/ksm.c
@@ -545,7 +545,9 @@ static struct vm_area_struct *find_mergeable_vma(struct mm_struct *mm,
if (ksm_test_exit(mm))
return NULL;
vma = vma_lookup(mm, addr);
- if (!vma || !(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
+ /* see anon_vma_prepare() */
+ if (!vma || !(vma->vm_flags & VM_MERGEABLE) ||
+ !smp_load_acquire(&vma->anon_vma))
return NULL;
return vma;
}
@@ -1026,7 +1028,8 @@ static int unmerge_and_remove_all_rmap_items(void)
goto mm_exiting;

for_each_vma(vmi, vma) {
- if (!(vma->vm_flags & VM_MERGEABLE) || !vma->anon_vma)
+ if (!(vma->vm_flags & VM_MERGEABLE) ||
+ !smp_load_acquire(&vma->anon_vma))
continue;
err = unmerge_ksm_pages(vma,
vma->vm_start, vma->vm_end);
@@ -2367,7 +2370,8 @@ static struct ksm_rmap_item *scan_get_next_rmap_item(struct page **page)
continue;
if (ksm_scan.address < vma->vm_start)
ksm_scan.address = vma->vm_start;
- if (!vma->anon_vma)
+ /* see anon_vma_prepare() */
+ if (!smp_load_acquire(&vma->anon_vma))
ksm_scan.address = vma->vm_end;

while (ksm_scan.address < vma->vm_end) {
@@ -2529,7 +2533,8 @@ static int __ksm_del_vma(struct vm_area_struct *vma)
if (!(vma->vm_flags & VM_MERGEABLE))
return 0;

- if (vma->anon_vma) {
+ /* see anon_vma_prepare() */
+ if (smp_load_acquire(&vma->anon_vma)) {
err = unmerge_ksm_pages(vma, vma->vm_start, vma->vm_end);
if (err)
return err;
@@ -2667,7 +2672,8 @@ int ksm_madvise(struct vm_area_struct *vma, unsigned long start,
if (!(*vm_flags & VM_MERGEABLE))
return 0; /* just ignore the advice */

- if (vma->anon_vma) {
+ /* see anon_vma_prepare() */
+ if (smp_load_acquire(&vma->anon_vma)) {
err = unmerge_ksm_pages(vma, start, end);
if (err)
return err;
diff --git a/mm/memory.c b/mm/memory.c
index 603b2f419948..ac602eb70eb4 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5401,8 +5401,12 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm,
* This check must happen after vma_start_read(); otherwise, a
* concurrent mremap() with MREMAP_DONTUNMAP could dissociate the VMA
* from its anon_vma.
+ *
+ * We don't rely on the READ_ONCE() here for ordering; we're guaranteed
+ * to go through anon_vma_prepare() before we actually access
+ * vma->anon_vma.
*/
- if (unlikely(!vma->anon_vma && !vma_is_tcp(vma)))
+ if (unlikely(!READ_ONCE(vma->anon_vma) && !vma_is_tcp(vma)))
goto inval_end_read;

/*
diff --git a/mm/mmap.c b/mm/mmap.c
index 3eda23c9ebe7..092f64d30522 100644
--- a/mm/mmap.c
+++ b/mm/mmap.c
@@ -1057,8 +1057,7 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *
* the anon_vma of 'old' is concurrently in the process of being set up
* by another page fault trying to merge _that_. But that's ok: if it
* is being set up, that automatically means that it will be a singleton
- * acceptable for merging, so we can do all of this optimistically. But
- * we do that READ_ONCE() to make sure that we never re-load the pointer.
+ * acceptable for merging, so we can do all of this optimistically.
*
* IOW: that the "list_is_singular()" test on the anon_vma_chain only
* matters for the 'stable anon_vma' case (ie the thing we want to avoid
@@ -1072,7 +1071,15 @@ static int anon_vma_compatible(struct vm_area_struct *a, struct vm_area_struct *
static struct anon_vma *reusable_anon_vma(struct vm_area_struct *old, struct vm_area_struct *a, struct vm_area_struct *b)
{
if (anon_vma_compatible(a, b)) {
- struct anon_vma *anon_vma = READ_ONCE(old->anon_vma);
+ /*
+ * Pairs with smp_store_release() in __anon_vma_prepare().
+ *
+ * We could get away with a READ_ONCE() here, but
+ * smp_load_acquire() ensures that the following
+ * list_is_singular() check on old->anon_vma_chain doesn't race
+ * with __anon_vma_prepare().
+ */
+ struct anon_vma *anon_vma = smp_load_acquire(&old->anon_vma);

if (anon_vma && list_is_singular(&old->anon_vma_chain))
return anon_vma;
diff --git a/mm/rmap.c b/mm/rmap.c
index 0c0d8857dfce..83bc4267269f 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -210,8 +210,9 @@ int __anon_vma_prepare(struct vm_area_struct *vma)
anon_vma_lock_write(anon_vma);
/* page_table_lock to protect against threads */
spin_lock(&mm->page_table_lock);
+ /* no need for smp_load_acquire() here, the lock prevents concurrency */
if (likely(!vma->anon_vma)) {
- vma->anon_vma = anon_vma;
+ smp_store_release(&vma->anon_vma, anon_vma);
anon_vma_chain_link(vma, avc, anon_vma);
anon_vma->num_active_vmas++;
allocated = NULL;
@@ -755,8 +756,9 @@ unsigned long page_address_in_vma(struct page *page, struct vm_area_struct *vma)
/*
* Note: swapoff's unuse_vma() is more efficient with this
* check, and needs it to match anon_vma when KSM is active.
+ * See anon_vma_prepare() for barrier.
*/
- if (!vma->anon_vma || !page__anon_vma ||
+ if (!smp_load_acquire(&vma->anon_vma) || !page__anon_vma ||
vma->anon_vma->root != page__anon_vma->root)
return -EFAULT;
} else if (!vma->vm_file) {
diff --git a/mm/swapfile.c b/mm/swapfile.c
index 8e6dde68b389..bdc71829df85 100644
--- a/mm/swapfile.c
+++ b/mm/swapfile.c
@@ -1986,7 +1986,8 @@ static int unuse_mm(struct mm_struct *mm, unsigned int type)

mmap_read_lock(mm);
for_each_vma(vmi, vma) {
- if (vma->anon_vma) {
+ /* see anon_vma_prepare() */
+ if (smp_load_acquire(&vma->anon_vma)) {
ret = unuse_vma(vma, type);
if (ret)
break;
--
2.41.0.487.g6d72f3e995-goog


2023-07-26 23:37:12

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote:
> Hi!
>
> Patch 1 here is a straightforward fix for a race in per-VMA locking code
> that can lead to use-after-free; I hope we can get this one into
> mainline and stable quickly.
>
> Patch 2 is a fix for what I believe is a longstanding memory ordering
> issue in how vma->anon_vma is used across the MM subsystem; I expect
> that this one will have to go through a few iterations of review and
> potentially rewrites, because memory ordering is tricky.
> (If someone else wants to take over patch 2, I would be very happy.)
>
> These patches don't really belong together all that much, I'm just
> sending them as a series because they'd otherwise conflict.
>
> I am CCing:
>
> - Suren because patch 1 touches his code
> - Matthew Wilcox because he is also currently working on per-VMA
> locking stuff
> - all the maintainers/reviewers for the Kernel Memory Consistency Model
> so they can help figure out the READ_ONCE() vs smp_load_acquire()
> thing

READ_ONCE() has weaker ordering properties than smp_load_acquire().

For example, given a pointer gp:

p = whichever(gp);
a = 1;
r1 = p->b;
if ((uintptr_t)p & 0x1)
WRITE_ONCE(b, 1);
WRITE_ONCE(c, 1);

Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is
"READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are
ordered after the load from gp (the former due to an address dependency
and the latter due to a (fragile) control dependency). The compiler
is within its rights to reorder the store to "a" to precede the load
from gp. The compiler is forbidden from reordering the store to "c"
wtih the load from gp (because both are volatile accesses), but the CPU
is completely within its rights to do this reordering.

But if "whichever" is "smp_load_acquire()", all four of the subsequent
memory accesses are ordered after the load from gp.

Similarly, for WRITE_ONCE() and smp_store_release():

p = READ_ONCE(gp);
r1 = READ_ONCE(gi);
r2 = READ_ONCE(gj);
a = 1;
WRITE_ONCE(b, 1);
if (r1 & 0x1)
whichever(p->q, r2);

Again leaving aside the "&" needed by smp_store_release(), if "whichever"
is WRITE_ONCE(), then the load from gp, the load from gi, and the load
from gj are all ordered before the store to p->q (by address dependency,
control dependency, and data dependency, respectively). The store to "a"
can be reordered with the store to p->q by the compiler. The store to
"b" cannot be reordered with the store to p->q by the compiler (again,
both are volatile), but the CPU is free to reorder them, especially when
whichever() is implemented as a conditional store.

But if "whichever" is "smp_store_release()", all five of the earlier
memory accesses are ordered before the store to p->q.

Does that help, or am I missing the point of your question?

Thanx, Paul

> - people involved in the previous discussion on the security list
>
>
> Jann Horn (2):
> mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock
> mm: Fix anon_vma memory ordering
>
> include/linux/rmap.h | 15 ++++++++++++++-
> mm/huge_memory.c | 4 +++-
> mm/khugepaged.c | 2 +-
> mm/ksm.c | 16 +++++++++++-----
> mm/memory.c | 32 ++++++++++++++++++++------------
> mm/mmap.c | 13 ++++++++++---
> mm/rmap.c | 6 ++++--
> mm/swapfile.c | 3 ++-
> 8 files changed, 65 insertions(+), 26 deletions(-)
>
>
> base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f
> --
> 2.41.0.487.g6d72f3e995-goog
>

2023-07-27 14:59:28

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, Jul 27, 2023 at 1:19 AM Paul E. McKenney <[email protected]> wrote:
>
> On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote:
> > Hi!
> >
> > Patch 1 here is a straightforward fix for a race in per-VMA locking code
> > that can lead to use-after-free; I hope we can get this one into
> > mainline and stable quickly.
> >
> > Patch 2 is a fix for what I believe is a longstanding memory ordering
> > issue in how vma->anon_vma is used across the MM subsystem; I expect
> > that this one will have to go through a few iterations of review and
> > potentially rewrites, because memory ordering is tricky.
> > (If someone else wants to take over patch 2, I would be very happy.)
> >
> > These patches don't really belong together all that much, I'm just
> > sending them as a series because they'd otherwise conflict.
> >
> > I am CCing:
> >
> > - Suren because patch 1 touches his code
> > - Matthew Wilcox because he is also currently working on per-VMA
> > locking stuff
> > - all the maintainers/reviewers for the Kernel Memory Consistency Model
> > so they can help figure out the READ_ONCE() vs smp_load_acquire()
> > thing
>
> READ_ONCE() has weaker ordering properties than smp_load_acquire().
>
> For example, given a pointer gp:
>
> p = whichever(gp);
> a = 1;
> r1 = p->b;
> if ((uintptr_t)p & 0x1)
> WRITE_ONCE(b, 1);
> WRITE_ONCE(c, 1);
>
> Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is
> "READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are
> ordered after the load from gp (the former due to an address dependency
> and the latter due to a (fragile) control dependency). The compiler
> is within its rights to reorder the store to "a" to precede the load
> from gp. The compiler is forbidden from reordering the store to "c"
> wtih the load from gp (because both are volatile accesses), but the CPU
> is completely within its rights to do this reordering.
>
> But if "whichever" is "smp_load_acquire()", all four of the subsequent
> memory accesses are ordered after the load from gp.
>
> Similarly, for WRITE_ONCE() and smp_store_release():
>
> p = READ_ONCE(gp);
> r1 = READ_ONCE(gi);
> r2 = READ_ONCE(gj);
> a = 1;
> WRITE_ONCE(b, 1);
> if (r1 & 0x1)
> whichever(p->q, r2);
>
> Again leaving aside the "&" needed by smp_store_release(), if "whichever"
> is WRITE_ONCE(), then the load from gp, the load from gi, and the load
> from gj are all ordered before the store to p->q (by address dependency,
> control dependency, and data dependency, respectively). The store to "a"
> can be reordered with the store to p->q by the compiler. The store to
> "b" cannot be reordered with the store to p->q by the compiler (again,
> both are volatile), but the CPU is free to reorder them, especially when
> whichever() is implemented as a conditional store.
>
> But if "whichever" is "smp_store_release()", all five of the earlier
> memory accesses are ordered before the store to p->q.
>
> Does that help, or am I missing the point of your question?

My main question is how permissible/ugly you think the following use
of READ_ONCE() would be, and whether you think it ought to be an
smp_load_acquire() instead.

Assume that we are holding some kind of lock that ensures that the
only possible concurrent update to "vma->anon_vma" is that it changes
from a NULL pointer to a non-NULL pointer (using smp_store_release()).


if (READ_ONCE(vma->anon_vma) != NULL) {
// we now know that vma->anon_vma cannot change anymore

// access the same memory location again with a plain load
struct anon_vma *a = vma->anon_vma;

// this needs to be address-dependency-ordered against one of
// the loads from vma->anon_vma
struct anon_vma *root = a->root;
}


Is this fine? If it is not fine just because the compiler might
reorder the plain load of vma->anon_vma before the READ_ONCE() load,
would it be fine after adding a barrier() directly after the
READ_ONCE()?

I initially suggested using READ_ONCE() for this, and then Linus and
me tried to reason it out and Linus suggested (if I understood him
correctly) that you could make the ugly argument that this works
because loads from the same location will not be reordered by the
hardware. So on anything other than alpha, we'd still have the
required address-dependency ordering because that happens for all
loads, even plain loads, while on alpha, the READ_ONCE() includes a
memory barrier. But that argument is weirdly reliant on
architecture-specific implementation details.

The other option is to replace the READ_ONCE() with a
smp_load_acquire(), at which point it becomes a lot simpler to show
that the code is correct.


> Thanx, Paul
>
> > - people involved in the previous discussion on the security list
> >
> >
> > Jann Horn (2):
> > mm: lock_vma_under_rcu() must check vma->anon_vma under vma lock
> > mm: Fix anon_vma memory ordering
> >
> > include/linux/rmap.h | 15 ++++++++++++++-
> > mm/huge_memory.c | 4 +++-
> > mm/khugepaged.c | 2 +-
> > mm/ksm.c | 16 +++++++++++-----
> > mm/memory.c | 32 ++++++++++++++++++++------------
> > mm/mmap.c | 13 ++++++++++---
> > mm/rmap.c | 6 ++++--
> > mm/swapfile.c | 3 ++-
> > 8 files changed, 65 insertions(+), 26 deletions(-)
> >
> >
> > base-commit: 20ea1e7d13c1b544fe67c4a8dc3943bb1ab33e6f
> > --
> > 2.41.0.487.g6d72f3e995-goog
> >

2023-07-27 15:04:32

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> On Thu, Jul 27, 2023 at 1:19 AM Paul E. McKenney <[email protected]> wrote:
> >
> > On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote:
> > > Hi!
> > >
> > > Patch 1 here is a straightforward fix for a race in per-VMA locking code
> > > that can lead to use-after-free; I hope we can get this one into
> > > mainline and stable quickly.
> > >
> > > Patch 2 is a fix for what I believe is a longstanding memory ordering
> > > issue in how vma->anon_vma is used across the MM subsystem; I expect
> > > that this one will have to go through a few iterations of review and
> > > potentially rewrites, because memory ordering is tricky.
> > > (If someone else wants to take over patch 2, I would be very happy.)
> > >
> > > These patches don't really belong together all that much, I'm just
> > > sending them as a series because they'd otherwise conflict.
> > >
> > > I am CCing:
> > >
> > > - Suren because patch 1 touches his code
> > > - Matthew Wilcox because he is also currently working on per-VMA
> > > locking stuff
> > > - all the maintainers/reviewers for the Kernel Memory Consistency Model
> > > so they can help figure out the READ_ONCE() vs smp_load_acquire()
> > > thing
> >
> > READ_ONCE() has weaker ordering properties than smp_load_acquire().
> >
> > For example, given a pointer gp:
> >
> > p = whichever(gp);
> > a = 1;
> > r1 = p->b;
> > if ((uintptr_t)p & 0x1)
> > WRITE_ONCE(b, 1);
> > WRITE_ONCE(c, 1);
> >
> > Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is
> > "READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are
> > ordered after the load from gp (the former due to an address dependency
> > and the latter due to a (fragile) control dependency). The compiler
> > is within its rights to reorder the store to "a" to precede the load
> > from gp. The compiler is forbidden from reordering the store to "c"
> > wtih the load from gp (because both are volatile accesses), but the CPU
> > is completely within its rights to do this reordering.
> >
> > But if "whichever" is "smp_load_acquire()", all four of the subsequent
> > memory accesses are ordered after the load from gp.
> >
> > Similarly, for WRITE_ONCE() and smp_store_release():
> >
> > p = READ_ONCE(gp);
> > r1 = READ_ONCE(gi);
> > r2 = READ_ONCE(gj);
> > a = 1;
> > WRITE_ONCE(b, 1);
> > if (r1 & 0x1)
> > whichever(p->q, r2);
> >
> > Again leaving aside the "&" needed by smp_store_release(), if "whichever"
> > is WRITE_ONCE(), then the load from gp, the load from gi, and the load
> > from gj are all ordered before the store to p->q (by address dependency,
> > control dependency, and data dependency, respectively). The store to "a"
> > can be reordered with the store to p->q by the compiler. The store to
> > "b" cannot be reordered with the store to p->q by the compiler (again,
> > both are volatile), but the CPU is free to reorder them, especially when
> > whichever() is implemented as a conditional store.
> >
> > But if "whichever" is "smp_store_release()", all five of the earlier
> > memory accesses are ordered before the store to p->q.
> >
> > Does that help, or am I missing the point of your question?
>
> My main question is how permissible/ugly you think the following use
> of READ_ONCE() would be, and whether you think it ought to be an
> smp_load_acquire() instead.
>
> Assume that we are holding some kind of lock that ensures that the
> only possible concurrent update to "vma->anon_vma" is that it changes
> from a NULL pointer to a non-NULL pointer (using smp_store_release()).
>
>
> if (READ_ONCE(vma->anon_vma) != NULL) {
> // we now know that vma->anon_vma cannot change anymore
>
> // access the same memory location again with a plain load
> struct anon_vma *a = vma->anon_vma;
>
> // this needs to be address-dependency-ordered against one of
> // the loads from vma->anon_vma
> struct anon_vma *root = a->root;
> }
>
>
> Is this fine? If it is not fine just because the compiler might
> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> would it be fine after adding a barrier() directly after the
> READ_ONCE()?

I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
as I've run into cases where you have sequences such as:

// Assume *ptr is initially 0 and somebody else writes it to 1
// concurrently

foo = *ptr;
bar = READ_ONCE(*ptr);
baz = *ptr;

and you can get foo == baz == 0 but bar == 1 because the compiler only
ends up reading from memory twice.

That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
when dereferencing pointer to pte table"), which was very unpleasant to
debug.

Will

2023-07-27 15:21:17

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> Assume that we are holding some kind of lock that ensures that the
> only possible concurrent update to "vma->anon_vma" is that it changes
> from a NULL pointer to a non-NULL pointer (using smp_store_release()).
>
>
> if (READ_ONCE(vma->anon_vma) != NULL) {
> // we now know that vma->anon_vma cannot change anymore
>
> // access the same memory location again with a plain load
> struct anon_vma *a = vma->anon_vma;
>
> // this needs to be address-dependency-ordered against one of
> // the loads from vma->anon_vma
> struct anon_vma *root = a->root;
> }
>
>
> Is this fine? If it is not fine just because the compiler might
> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> would it be fine after adding a barrier() directly after the
> READ_ONCE()?
>
> I initially suggested using READ_ONCE() for this, and then Linus and
> me tried to reason it out and Linus suggested (if I understood him
> correctly) that you could make the ugly argument that this works
> because loads from the same location will not be reordered by the
> hardware. So on anything other than alpha, we'd still have the
> required address-dependency ordering because that happens for all
> loads, even plain loads, while on alpha, the READ_ONCE() includes a
> memory barrier. But that argument is weirdly reliant on
> architecture-specific implementation details.
>
> The other option is to replace the READ_ONCE() with a
> smp_load_acquire(), at which point it becomes a lot simpler to show
> that the code is correct.

Aren't we straining at gnats here? The context of this is handling a
page fault, and we used to take an entire rwsem for read. I'm having
a hard time caring about "the extra expense" of an unnecessarily broad
barrier.

Cost of an L3 cacheline miss is in the thousands of cycles. Cost of a
barrier is ... tens?

2023-07-27 15:58:57

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, Jul 27, 2023 at 5:07 PM Matthew Wilcox <[email protected]> wrote:
> On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> > The other option is to replace the READ_ONCE() with a
> > smp_load_acquire(), at which point it becomes a lot simpler to show
> > that the code is correct.
>
> Aren't we straining at gnats here? The context of this is handling a
> page fault, and we used to take an entire rwsem for read. I'm having
> a hard time caring about "the extra expense" of an unnecessarily broad
> barrier.
>
> Cost of an L3 cacheline miss is in the thousands of cycles. Cost of a
> barrier is ... tens?

Yeah, fair point. If it's hard to show correctness with READ_ONCE() we
can just use smp_load_acquire() and call it a day.

2023-07-27 16:14:26

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote:
> On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:

> > Assume that we are holding some kind of lock that ensures that the
> > only possible concurrent update to "vma->anon_vma" is that it changes
> > from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> >
> >
> > if (READ_ONCE(vma->anon_vma) != NULL) {
> > // we now know that vma->anon_vma cannot change anymore
> >
> > // access the same memory location again with a plain load
> > struct anon_vma *a = vma->anon_vma;
> >
> > // this needs to be address-dependency-ordered against one of
> > // the loads from vma->anon_vma
> > struct anon_vma *root = a->root;
> > }

This reads a little oddly, perhaps because it's a fragment from a larger
piece of code. Still, if I were doing something like this, I'd write it
as:

struct anon_vma *a;

a = READ_ONCE(vma->anon_vma);
if (a != NULL) {
struct anon_vma *root = a->root;
...

thus eliminating the possibility of confusion from multiple reads of the
same address.

In this situation, the ordering of the two reads is guaranteed by the
address dependency. And people shouldn't worry too much about using
that sort of ordering; RCU relies on it critically, all the time.

> > Is this fine? If it is not fine just because the compiler might
> > reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> > would it be fine after adding a barrier() directly after the
> > READ_ONCE()?
>
> I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> as I've run into cases where you have sequences such as:
>
> // Assume *ptr is initially 0 and somebody else writes it to 1
> // concurrently
>
> foo = *ptr;
> bar = READ_ONCE(*ptr);
> baz = *ptr;
>
> and you can get foo == baz == 0 but bar == 1 because the compiler only
> ends up reading from memory twice.
>
> That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> when dereferencing pointer to pte table"), which was very unpleasant to
> debug.

Indeed, that's the sort of thing that can happen when plain accesses are
involved in a race.

Alan Stern

2023-07-27 16:30:51

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, Jul 27, 2023 at 11:44:02AM -0400, Alan Stern wrote:
> On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote:
> > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
>
> > > Assume that we are holding some kind of lock that ensures that the
> > > only possible concurrent update to "vma->anon_vma" is that it changes
> > > from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> > >
> > >
> > > if (READ_ONCE(vma->anon_vma) != NULL) {
> > > // we now know that vma->anon_vma cannot change anymore
> > >
> > > // access the same memory location again with a plain load
> > > struct anon_vma *a = vma->anon_vma;
> > >
> > > // this needs to be address-dependency-ordered against one of
> > > // the loads from vma->anon_vma
> > > struct anon_vma *root = a->root;
> > > }
>
> This reads a little oddly, perhaps because it's a fragment from a larger
> piece of code. Still, if I were doing something like this, I'd write it
> as:
>
> struct anon_vma *a;
>
> a = READ_ONCE(vma->anon_vma);
> if (a != NULL) {
> struct anon_vma *root = a->root;
> ...
>
> thus eliminating the possibility of confusion from multiple reads of the
> same address.
>
> In this situation, the ordering of the two reads is guaranteed by the
> address dependency. And people shouldn't worry too much about using
> that sort of ordering; RCU relies on it critically, all the time.

Agreed. In contrast, control dependencies require quite a bit more care
and feeding, and are usually best avoided.

But even with the normal RCU address/data dependencies, it is possible
to get in trouble. For but one example, comparing a pointer obtained
from rcu_dereference() to the address of a static structure is a good
way to break your address dependency. (Just yesterday evening I talked
to someone who had spent quite a bit of time chasing one of these down,
so yes, this is quite real.)

> > > Is this fine? If it is not fine just because the compiler might
> > > reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> > > would it be fine after adding a barrier() directly after the
> > > READ_ONCE()?
> >
> > I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> > as I've run into cases where you have sequences such as:
> >
> > // Assume *ptr is initially 0 and somebody else writes it to 1
> > // concurrently
> >
> > foo = *ptr;
> > bar = READ_ONCE(*ptr);
> > baz = *ptr;
> >
> > and you can get foo == baz == 0 but bar == 1 because the compiler only
> > ends up reading from memory twice.
> >
> > That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> > when dereferencing pointer to pte table"), which was very unpleasant to
> > debug.
>
> Indeed, that's the sort of thing that can happen when plain accesses are
> involved in a race.

Agreed. Furthermore, it is more important to comment plain C-language
accesses to shared variables than to comment the likes of READ_ONCE().
"OK, tell me again exactly why you think the compiler cannot mess you
up here?"

Thanx, Paul

2023-07-27 16:30:53

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, Jul 27, 2023 at 06:10:12PM +0200, Jann Horn wrote:
> On Thu, Jul 27, 2023 at 5:44 PM Alan Stern <[email protected]> wrote:
> > On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote:
> > > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> >
> > > > Assume that we are holding some kind of lock that ensures that the
> > > > only possible concurrent update to "vma->anon_vma" is that it changes
> > > > from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> > > >
> > > >
> > > > if (READ_ONCE(vma->anon_vma) != NULL) {
> > > > // we now know that vma->anon_vma cannot change anymore
> > > >
> > > > // access the same memory location again with a plain load
> > > > struct anon_vma *a = vma->anon_vma;
> > > >
> > > > // this needs to be address-dependency-ordered against one of
> > > > // the loads from vma->anon_vma
> > > > struct anon_vma *root = a->root;
> > > > }
> >
> > This reads a little oddly, perhaps because it's a fragment from a larger
> > piece of code.
>
> Yes, exactly. The READ_ONCE() would be in anon_vma_prepare(), which is
> a helper used to ensure that a VMA is associated with an anon_vma, and
> then the vma->anon_vma is used further down inside the fault handling
> path. Something like:
>
> do_cow_fault
> anon_vma_prepare
> READ_ONCE(vma->anon_vma)
> barrier()
> finish_fault
> do_set_pte
> page_add_new_anon_rmap
> folio_add_new_anon_rmap
> __page_set_anon_rmap
> [reads vma->anon_vma]
>
> Anyway, I guess I'll follow what Paul and Matthew said and go with
> smp_load_acquire().

I thank you now, and you will thank youself later. ;-)

Thanx, Paul

2023-07-27 16:51:34

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, Jul 27, 2023 at 04:07:32PM +0100, Matthew Wilcox wrote:
> On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> > Assume that we are holding some kind of lock that ensures that the
> > only possible concurrent update to "vma->anon_vma" is that it changes
> > from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> >
> >
> > if (READ_ONCE(vma->anon_vma) != NULL) {
> > // we now know that vma->anon_vma cannot change anymore
> >
> > // access the same memory location again with a plain load
> > struct anon_vma *a = vma->anon_vma;
> >
> > // this needs to be address-dependency-ordered against one of
> > // the loads from vma->anon_vma
> > struct anon_vma *root = a->root;
> > }
> >
> >
> > Is this fine? If it is not fine just because the compiler might
> > reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> > would it be fine after adding a barrier() directly after the
> > READ_ONCE()?
> >
> > I initially suggested using READ_ONCE() for this, and then Linus and
> > me tried to reason it out and Linus suggested (if I understood him
> > correctly) that you could make the ugly argument that this works
> > because loads from the same location will not be reordered by the
> > hardware. So on anything other than alpha, we'd still have the
> > required address-dependency ordering because that happens for all
> > loads, even plain loads, while on alpha, the READ_ONCE() includes a
> > memory barrier. But that argument is weirdly reliant on
> > architecture-specific implementation details.
> >
> > The other option is to replace the READ_ONCE() with a
> > smp_load_acquire(), at which point it becomes a lot simpler to show
> > that the code is correct.
>
> Aren't we straining at gnats here? The context of this is handling a
> page fault, and we used to take an entire rwsem for read. I'm having
> a hard time caring about "the extra expense" of an unnecessarily broad
> barrier.
>
> Cost of an L3 cacheline miss is in the thousands of cycles. Cost of a
> barrier is ... tens?

Couldn't agree more!

Thanx, Paul

2023-07-27 17:19:08

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering




> On Jul 27, 2023, at 10:57 AM, Will Deacon <[email protected]> wrote:
> On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
>> On Thu, Jul 27, 2023 at 1:19 AM Paul E. McKenney <[email protected]> wrote:
>>> On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote:
>>>> Hi!
>>>> Patch 1 here is a straightforward fix for a race in per-VMA locking code
>>>> that can lead to use-after-free; I hope we can get this one into
>>>> mainline and stable quickly.
>>>> Patch 2 is a fix for what I believe is a longstanding memory ordering
>>>> issue in how vma->anon_vma is used across the MM subsystem; I expect
>>>> that this one will have to go through a few iterations of review and
>>>> potentially rewrites, because memory ordering is tricky.
>>>> (If someone else wants to take over patch 2, I would be very happy.)
>>>> These patches don't really belong together all that much, I'm just
>>>> sending them as a series because they'd otherwise conflict.
>>>> I am CCing:
>>>> - Suren because patch 1 touches his code
>>>> - Matthew Wilcox because he is also currently working on per-VMA
>>>> locking stuff
>>>> - all the maintainers/reviewers for the Kernel Memory Consistency Model
>>>> so they can help figure out the READ_ONCE() vs smp_load_acquire()
>>>> thing
>>> READ_ONCE() has weaker ordering properties than smp_load_acquire().
>>> For example, given a pointer gp:
>>> p = whichever(gp);
>>> a = 1;
>>> r1 = p->b;
>>> if ((uintptr_t)p & 0x1)
>>> WRITE_ONCE(b, 1);
>>> WRITE_ONCE(c, 1);
>>> Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is
>>> "READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are
>>> ordered after the load from gp (the former due to an address dependency
>>> and the latter due to a (fragile) control dependency). The compiler
>>> is within its rights to reorder the store to "a" to precede the load
>>> from gp. The compiler is forbidden from reordering the store to "c"
>>> wtih the load from gp (because both are volatile accesses), but the CPU
>>> is completely within its rights to do this reordering.
>>> But if "whichever" is "smp_load_acquire()", all four of the subsequent
>>> memory accesses are ordered after the load from gp.
>>> Similarly, for WRITE_ONCE() and smp_store_release():
>>> p = READ_ONCE(gp);
>>> r1 = READ_ONCE(gi);
>>> r2 = READ_ONCE(gj);
>>> a = 1;
>>> WRITE_ONCE(b, 1);
>>> if (r1 & 0x1)
>>> whichever(p->q, r2);
>>> Again leaving aside the "&" needed by smp_store_release(), if "whichever"
>>> is WRITE_ONCE(), then the load from gp, the load from gi, and the load
>>> from gj are all ordered before the store to p->q (by address dependency,
>>> control dependency, and data dependency, respectively). The store to "a"
>>> can be reordered with the store to p->q by the compiler. The store to
>>> "b" cannot be reordered with the store to p->q by the compiler (again,
>>> both are volatile), but the CPU is free to reorder them, especially when
>>> whichever() is implemented as a conditional store.
>>> But if "whichever" is "smp_store_release()", all five of the earlier
>>> memory accesses are ordered before the store to p->q.
>>> Does that help, or am I missing the point of your question?
>>
>> My main question is how permissible/ugly you think the following use
>> of READ_ONCE() would be, and whether you think it ought to be an
>> smp_load_acquire() instead.
>>
>> Assume that we are holding some kind of lock that ensures that the
>> only possible concurrent update to "vma->anon_vma" is that it changes
>> from a NULL pointer to a non-NULL pointer (using smp_store_release()).
>>
>>
>> if (READ_ONCE(vma->anon_vma) != NULL) {
>> // we now know that vma->anon_vma cannot change anymore
>>
>> // access the same memory location again with a plain load
>> struct anon_vma *a = vma->anon_vma;
>>
>> // this needs to be address-dependency-ordered against one of
>> // the loads from vma->anon_vma
>> struct anon_vma *root = a->root;
>> }
>>
>>
>> Is this fine? If it is not fine just because the compiler might
>> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
>> would it be fine after adding a barrier() directly after the
>> READ_ONCE()?
>
> I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> as I've run into cases where you have sequences such as:
>
> // Assume *ptr is initially 0 and somebody else writes it to 1
> // concurrently
>
> foo = *ptr;
> bar = READ_ONCE(*ptr);
> baz = *ptr;
>
> and you can get foo == baz == 0 but bar == 1 because the compiler only
> ends up reading from memory twice.
>
> That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> when dereferencing pointer to pte table"), which was very unpleasant to
> debug.

Will, Unless I am missing something fundamental, this case is different though.
This case does not care about fewer reads. As long as the first read is volatile, the subsequent loads (even plain)
should work fine, no?
I am not seeing how the compiler can screw that up, so please do enlighten :).

Also RCU read dereference does a similar pattern (as Alan also pointed).

Cheers,

- Joel

>
> Will

2023-07-27 17:20:11

by Jann Horn

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, Jul 27, 2023 at 5:44 PM Alan Stern <[email protected]> wrote:
> On Thu, Jul 27, 2023 at 03:57:47PM +0100, Will Deacon wrote:
> > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
>
> > > Assume that we are holding some kind of lock that ensures that the
> > > only possible concurrent update to "vma->anon_vma" is that it changes
> > > from a NULL pointer to a non-NULL pointer (using smp_store_release()).
> > >
> > >
> > > if (READ_ONCE(vma->anon_vma) != NULL) {
> > > // we now know that vma->anon_vma cannot change anymore
> > >
> > > // access the same memory location again with a plain load
> > > struct anon_vma *a = vma->anon_vma;
> > >
> > > // this needs to be address-dependency-ordered against one of
> > > // the loads from vma->anon_vma
> > > struct anon_vma *root = a->root;
> > > }
>
> This reads a little oddly, perhaps because it's a fragment from a larger
> piece of code.

Yes, exactly. The READ_ONCE() would be in anon_vma_prepare(), which is
a helper used to ensure that a VMA is associated with an anon_vma, and
then the vma->anon_vma is used further down inside the fault handling
path. Something like:

do_cow_fault
anon_vma_prepare
READ_ONCE(vma->anon_vma)
barrier()
finish_fault
do_set_pte
page_add_new_anon_rmap
folio_add_new_anon_rmap
__page_set_anon_rmap
[reads vma->anon_vma]

Anyway, I guess I'll follow what Paul and Matthew said and go with
smp_load_acquire().

2023-07-27 18:02:22

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, 27 Jul 2023 at 08:44, Alan Stern <[email protected]> wrote:
>
> This reads a little oddly, perhaps because it's a fragment from a larger
> piece of code.

Yes. As Jann already said, this is basically a preparatory step in a
much longer sequence, and the code simply wants to make sure that any
later code (possibly quite a bit later) will not see a NULL value.

I do believe it happens to be safe to use READ_ONCE() for a number of
reasons, but I will argue that we should *never* use a bare READ_ONCE
if there is any actual real question about what any memory ordering
might be.

And yes, the RCU code obviously does use READ_ONCE(), so that's why I
say "bare" - the RCU code wraps it in helper accessors with strict
semantics.

The reason I think it would be techncially *safe* to do here is that

(a) code like this:

if (READ_ONCE(..))

may end up having the conditional speculated by the CPU, and have
the actual read done without any ordering by the CPU, but we do have
*one* guarantee: the memory load instruction will be before the
conditional branch (or whatever) in the instruction stream.

So the code may be *executed* out of order, but we know the
memory load can not be moved after the conditional (whatever form that
conditional then takes) by a compiler.

(We do have various barriers like "rcu_read_unlock()" that
guarantees that the READ_ONCE() couldn't be moved lmuch later even in
the absence of the conditional, but we can ignore that).

(b) the later use of the anon_vma (that depends on the value being
stable) is *so* much later, and behind things that the compiler sees
as barriers (again, that rcu_read_unlock() acts at a minimum as a
compiler barrier) that any subsequent use would not have its load
moved down to before the READ_ONCE() in the instruction stream.

Again, this is purely a "location in the instruction stream"
ordering argument, not a "execution order" ordering argument.

And finally

(c) I do think that all our architectures have the rules that when
you read from the *same* memory location from the same CPU, the
accesses are ordered.

Now, I didn't actually find (c) spelled out anywhere, but even alpha -
the worst memory ordering ever devised - had that particular rule (the
alpha architecture manual calls it "Location Access Order").

Now, with that said, I did argue to Jann that we should use
smp_store_release and smp_load_acquire pairs here, because honestly,
the above (a)-(c) argument is too damn subtle for me, and I don't
think this particular use requires it.

With smp_load_acquire(), you don't need any of the (a)-(c) rules. The
rule is "the load is done before any subsequent memory operations".
End of story.

So while I do think READ_ONCE() is sufficient here, I actually think
that once you start going down that path you can argue that
READ_ONCE() is actually entirely unnecessary, because we also have
other ordering rules that mean that the compiler can't really do
anythinig else even *without* the READ_ONCE().

End result: I can trivially extend the (a)-(c) to say "even
READ_ONCE() isn't strictly necessary here, because even any access
tearing - which won't happen anyway - wouldn't actually change the
result.

So if we want to make it *obvious* that it's safe, we should use
smp_load_acquire().

And if we do find that there are situations where we care so much
about the (generally fairly cheap) possible additional
synchronization, and we really want to use READ_ONCE() rather than
smp_load_acquire(), I'd rather try to wrap a READ_ONCE in some kind of
better abstraction (maybe make the conditional part of the operation,
and make it clear that you are doing a "one-time test which depends on
the same-location rule".

Do we even have the same-location rule in the LKMM?

Linus

2023-07-27 18:16:06

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, Jul 27, 2023 at 10:11:29AM -0700, Linus Torvalds wrote:
> On Thu, 27 Jul 2023 at 08:44, Alan Stern <[email protected]> wrote:
> >
> > This reads a little oddly, perhaps because it's a fragment from a larger
> > piece of code.
>
> Yes. As Jann already said, this is basically a preparatory step in a
> much longer sequence, and the code simply wants to make sure that any
> later code (possibly quite a bit later) will not see a NULL value.

...

> Do we even have the same-location rule in the LKMM?

Yes. The comment in the source file calls it "Sequential Consistency
Per Variable", under the category of "Fundamental coherence ordering".
It applies even to plain accesses, not just to READ_ONCE or stronger.

But in the presence of data races (as in the example that Will posted
earlier), all bets are off. So if you want to use a plain access rather
than READ_ONCE, you need to be certain that it won't race with anything.

Alan Stern

2023-07-27 18:24:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, 27 Jul 2023 at 10:41, Alan Stern <[email protected]> wrote:
>
> But in the presence of data races (as in the example that Will posted
> earlier), all bets are off. So if you want to use a plain access rather
> than READ_ONCE, you need to be certain that it won't race with anything.

So in this case, the initial NULL check really is just checking for
"has the smp_store_release() happened". The reason even tearing
wouldn't matter is that seeing *any* value other than all-zeroes (aka
NULL) is already sufficient information.

Because once the smp_store_release() has been seen by this CPU, the
data race no longer exists, and the value is stable.

Which is why I argue that even without READ_ONCE() the code would
*work* due to all the circumstances around it (one of them is that we
just took a lock, after doing an optimistic check that really has no
semantic effect at all except for the "don't even take the lock if it
looks like things are going to fail").

But if we want to have the code be obvious, and not have to refer to
those kinds of arguments, I think smp_load_acquire() is the only
actual "obvious" thing to use. At that point it's no longer some chain
of "because of X, Y and Z".

Linus

2023-07-27 19:43:34

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering



> On Jul 27, 2023, at 7:57 AM, Will Deacon <[email protected]> wrote:
>
> On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
>> On Thu, Jul 27, 2023 at 1:19 AM Paul E. McKenney <[email protected]> wrote:
>>>
>>> On Wed, Jul 26, 2023 at 11:41:01PM +0200, Jann Horn wrote:
>>>> Hi!
>>>>
>>>> Patch 1 here is a straightforward fix for a race in per-VMA locking code
>>>> that can lead to use-after-free; I hope we can get this one into
>>>> mainline and stable quickly.
>>>>
>>>> Patch 2 is a fix for what I believe is a longstanding memory ordering
>>>> issue in how vma->anon_vma is used across the MM subsystem; I expect
>>>> that this one will have to go through a few iterations of review and
>>>> potentially rewrites, because memory ordering is tricky.
>>>> (If someone else wants to take over patch 2, I would be very happy.)
>>>>
>>>> These patches don't really belong together all that much, I'm just
>>>> sending them as a series because they'd otherwise conflict.
>>>>
>>>> I am CCing:
>>>>
>>>> - Suren because patch 1 touches his code
>>>> - Matthew Wilcox because he is also currently working on per-VMA
>>>> locking stuff
>>>> - all the maintainers/reviewers for the Kernel Memory Consistency Model
>>>> so they can help figure out the READ_ONCE() vs smp_load_acquire()
>>>> thing
>>>
>>> READ_ONCE() has weaker ordering properties than smp_load_acquire().
>>>
>>> For example, given a pointer gp:
>>>
>>> p = whichever(gp);
>>> a = 1;
>>> r1 = p->b;
>>> if ((uintptr_t)p & 0x1)
>>> WRITE_ONCE(b, 1);
>>> WRITE_ONCE(c, 1);
>>>
>>> Leaving aside the "&" needed by smp_load_acquire(), if "whichever" is
>>> "READ_ONCE", then the load from p->b and the WRITE_ONCE() to "b" are
>>> ordered after the load from gp (the former due to an address dependency
>>> and the latter due to a (fragile) control dependency). The compiler
>>> is within its rights to reorder the store to "a" to precede the load
>>> from gp. The compiler is forbidden from reordering the store to "c"
>>> wtih the load from gp (because both are volatile accesses), but the CPU
>>> is completely within its rights to do this reordering.
>>>
>>> But if "whichever" is "smp_load_acquire()", all four of the subsequent
>>> memory accesses are ordered after the load from gp.
>>>
>>> Similarly, for WRITE_ONCE() and smp_store_release():
>>>
>>> p = READ_ONCE(gp);
>>> r1 = READ_ONCE(gi);
>>> r2 = READ_ONCE(gj);
>>> a = 1;
>>> WRITE_ONCE(b, 1);
>>> if (r1 & 0x1)
>>> whichever(p->q, r2);
>>>
>>> Again leaving aside the "&" needed by smp_store_release(), if "whichever"
>>> is WRITE_ONCE(), then the load from gp, the load from gi, and the load
>>> from gj are all ordered before the store to p->q (by address dependency,
>>> control dependency, and data dependency, respectively). The store to "a"
>>> can be reordered with the store to p->q by the compiler. The store to
>>> "b" cannot be reordered with the store to p->q by the compiler (again,
>>> both are volatile), but the CPU is free to reorder them, especially when
>>> whichever() is implemented as a conditional store.
>>>
>>> But if "whichever" is "smp_store_release()", all five of the earlier
>>> memory accesses are ordered before the store to p->q.
>>>
>>> Does that help, or am I missing the point of your question?
>>
>> My main question is how permissible/ugly you think the following use
>> of READ_ONCE() would be, and whether you think it ought to be an
>> smp_load_acquire() instead.
>>
>> Assume that we are holding some kind of lock that ensures that the
>> only possible concurrent update to "vma->anon_vma" is that it changes
>> from a NULL pointer to a non-NULL pointer (using smp_store_release()).
>>
>>
>> if (READ_ONCE(vma->anon_vma) != NULL) {
>> // we now know that vma->anon_vma cannot change anymore
>>
>> // access the same memory location again with a plain load
>> struct anon_vma *a = vma->anon_vma;
>>
>> // this needs to be address-dependency-ordered against one of
>> // the loads from vma->anon_vma
>> struct anon_vma *root = a->root;
>> }
>>
>>
>> Is this fine? If it is not fine just because the compiler might
>> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
>> would it be fine after adding a barrier() directly after the
>> READ_ONCE()?
>
> I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> as I've run into cases where you have sequences such as:
>
> // Assume *ptr is initially 0 and somebody else writes it to 1
> // concurrently
>
> foo = *ptr;
> bar = READ_ONCE(*ptr);
> baz = *ptr;
>
> and you can get foo == baz == 0 but bar == 1 because the compiler only
> ends up reading from memory twice.
>
> That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> when dereferencing pointer to pte table"), which was very unpleasant to
> debug.

Interesting. I wonder if you considered adding to READ_ONCE() something
like:

asm volatile("" : "+g" (x) );

So later loads (such as baz = *ptr) would reload the updated value.


2023-07-27 19:50:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, 27 Jul 2023 at 12:10, Nadav Amit <[email protected]> wrote:
>
> Interesting. I wonder if you considered adding to READ_ONCE() something
> like:
>
> asm volatile("" : "+g" (x) );
>
> So later loads (such as baz = *ptr) would reload the updated value.

Not necessarily a bad idea. Although I suspect you'd want to add
*two* of them - on either side - to make sure any previous loads
wouldn't be moved around it either.

Linus

2023-07-27 21:23:06

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering



> On Jul 27, 2023, at 12:39 PM, Linus Torvalds <[email protected]> wrote:
>
> On Thu, 27 Jul 2023 at 12:10, Nadav Amit <[email protected]> wrote:
>>
>> Interesting. I wonder if you considered adding to READ_ONCE() something
>> like:
>>
>> asm volatile("" : "+g" (x) );
>>
>> So later loads (such as baz = *ptr) would reload the updated value.
>
> Not necessarily a bad idea. Although I suspect you'd want to add
> *two* of them - on either side - to make sure any previous loads
> wouldn't be moved around it either.

You are right, two are needed.

I’ll give it a shot and see if I see changes to the binary.


2023-07-28 10:29:52

by Nadav Amit

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering



> On Jul 27, 2023, at 1:11 PM, Nadav Amit <[email protected]> wrote:
>
>
>
>> On Jul 27, 2023, at 12:39 PM, Linus Torvalds <[email protected]> wrote:
>>
>> On Thu, 27 Jul 2023 at 12:10, Nadav Amit <[email protected]> wrote:
>>>
>>> Interesting. I wonder if you considered adding to READ_ONCE() something
>>> like:
>>>
>>> asm volatile("" : "+g" (x) );
>>>
>>> So later loads (such as baz = *ptr) would reload the updated value.
>>
>> Not necessarily a bad idea. Although I suspect you'd want to add
>> *two* of them - on either side - to make sure any previous loads
>> wouldn't be moved around it either.
>
> You are right, two are needed.
>
> I’ll give it a shot and see if I see changes to the binary.

Just for the record (so nobody will make my mistakes):

I implemented it, but it works poorly. It appears to communicate the
constraints but the generated code is much worse.

The problem is that if the GCC inline asm decides to use a memory operand
(e.g. “+m”), it computes the address (uses LEA first before the MOV) and
allocates a register for the address. Using “+g” also causes the compiler
to allocate a register.

-- >8 --

---
include/asm-generic/rwonce.h | 32 ++++++++++++++++++++++++++++----
1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/include/asm-generic/rwonce.h b/include/asm-generic/rwonce.h
index 8d0a6280e982..c6a2f8e3013c 100644
--- a/include/asm-generic/rwonce.h
+++ b/include/asm-generic/rwonce.h
@@ -44,10 +44,34 @@
#define __READ_ONCE(x) (*(const volatile __unqual_scalar_typeof(x) *)&(x))
#endif

-#define READ_ONCE(x) \
-({ \
- compiletime_assert_rwonce_type(x); \
- __READ_ONCE(x); \
+#define as_scalar(x) \
+ __builtin_choose_expr(sizeof(x) == sizeof(char), \
+ (__force char *)&(x), \
+ __builtin_choose_expr(sizeof(x) == sizeof(short), \
+ (__force short *)&(x), \
+ __builtin_choose_expr(sizeof(x) == sizeof(int), \
+ (__force int *)&(x), \
+ __builtin_choose_expr(sizeof(x) == sizeof(long), \
+ (__force long *)&(x), \
+ __builtin_choose_expr(sizeof(x) == sizeof(long long), \
+ (__force long long *)&(x), \
+ (void*)0)))))
+
+#define READ_ONCE(x) \
+({ \
+ typeof(x) * px = &(x); \
+ union { \
+ typeof(x) __orig; \
+ typeof(*as_scalar(x)) __alias; \
+ } __u; \
+ \
+ compiletime_assert_rwonce_type(x); \
+ asm volatile ("" : \
+ "+g" (*(__force typeof(*as_scalar(x)) *)(px))); \
+ __u.__alias = __READ_ONCE(*as_scalar(*px)); \
+ asm volatile ("" : \
+ "+g" (*(__force typeof(*as_scalar(x)) *)(px))); \
+ __u.__orig; \
})

#define __WRITE_ONCE(x, val) \
--
2.34.1



2023-07-28 13:27:57

by Will Deacon

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Thu, Jul 27, 2023 at 12:34:44PM -0400, Joel Fernandes wrote:
> > On Jul 27, 2023, at 10:57 AM, Will Deacon <[email protected]> wrote:
> > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> >> if (READ_ONCE(vma->anon_vma) != NULL) {
> >> // we now know that vma->anon_vma cannot change anymore
> >>
> >> // access the same memory location again with a plain load
> >> struct anon_vma *a = vma->anon_vma;
> >>
> >> // this needs to be address-dependency-ordered against one of
> >> // the loads from vma->anon_vma
> >> struct anon_vma *root = a->root;
> >> }
> >>
> >>
> >> Is this fine? If it is not fine just because the compiler might
> >> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> >> would it be fine after adding a barrier() directly after the
> >> READ_ONCE()?
> >
> > I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> > as I've run into cases where you have sequences such as:
> >
> > // Assume *ptr is initially 0 and somebody else writes it to 1
> > // concurrently
> >
> > foo = *ptr;
> > bar = READ_ONCE(*ptr);
> > baz = *ptr;
> >
> > and you can get foo == baz == 0 but bar == 1 because the compiler only
> > ends up reading from memory twice.
> >
> > That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> > when dereferencing pointer to pte table"), which was very unpleasant to
> > debug.
>
> Will, Unless I am missing something fundamental, this case is different though.
> This case does not care about fewer reads. As long as the first read is volatile, the subsequent loads (even plain)
> should work fine, no?
> I am not seeing how the compiler can screw that up, so please do enlighten :).

I guess the thing I'm worried about is if there is some previous read of
'vma->anon_vma' which didn't use READ_ONCE() and the compiler kept the
result around in a register. In that case, 'a' could be NULL, even if
the READ_ONCE(vma->anon_vma) returned non-NULL.

The crux of the issue is that the compiler can break read-after-read
ordering if you don't use READ_ONCE() consistently. Sadly, judging by
the other part of the thread from Nadav, it's fiddly to fix this without
wrecking the codegen.

Will

2023-07-28 18:25:15

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Fri, Jul 28, 2023 at 1:51 PM Alan Stern <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 01:35:43PM -0400, Joel Fernandes wrote:
> > On Fri, Jul 28, 2023 at 8:44 AM Will Deacon <[email protected]> wrote:
> > >
> > > On Thu, Jul 27, 2023 at 12:34:44PM -0400, Joel Fernandes wrote:
> > > > > On Jul 27, 2023, at 10:57 AM, Will Deacon <[email protected]> wrote:
> > > > > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> > > > >> if (READ_ONCE(vma->anon_vma) != NULL) {
> > > > >> // we now know that vma->anon_vma cannot change anymore
> > > > >>
> > > > >> // access the same memory location again with a plain load
> > > > >> struct anon_vma *a = vma->anon_vma;
> > > > >>
> > > > >> // this needs to be address-dependency-ordered against one of
> > > > >> // the loads from vma->anon_vma
> > > > >> struct anon_vma *root = a->root;
> > > > >> }
> > > > >>
> > > > >>
> > > > >> Is this fine? If it is not fine just because the compiler might
> > > > >> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> > > > >> would it be fine after adding a barrier() directly after the
> > > > >> READ_ONCE()?
> > > > >
> > > > > I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> > > > > as I've run into cases where you have sequences such as:
> > > > >
> > > > > // Assume *ptr is initially 0 and somebody else writes it to 1
> > > > > // concurrently
> > > > >
> > > > > foo = *ptr;
> > > > > bar = READ_ONCE(*ptr);
> > > > > baz = *ptr;
> > > > >
> > > > > and you can get foo == baz == 0 but bar == 1 because the compiler only
> > > > > ends up reading from memory twice.
> > > > >
> > > > > That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> > > > > when dereferencing pointer to pte table"), which was very unpleasant to
> > > > > debug.
> > > >
> > > > Will, Unless I am missing something fundamental, this case is different though.
> > > > This case does not care about fewer reads. As long as the first read is volatile, the subsequent loads (even plain)
> > > > should work fine, no?
> > > > I am not seeing how the compiler can screw that up, so please do enlighten :).
> > >
> > > I guess the thing I'm worried about is if there is some previous read of
> > > 'vma->anon_vma' which didn't use READ_ONCE() and the compiler kept the
> > > result around in a register. In that case, 'a' could be NULL, even if
> > > the READ_ONCE(vma->anon_vma) returned non-NULL.
> >
> > If I can be a bit brave enough to say -- that appears to be a compiler
> > bug to me. It seems that the compiler in such an instance violates the
> > "Sequential Consistency Per Variable" rule? I mean if it can't even
> > keep SCPV true for a same memory-location load (plain or not) for a
> > sequence of code, how can it expect the hardware to.
>
> It's not a compiler bug. In this example, some other thread performs a
> write that changes vma->anon_vma from NULL to non-NULL. This write
> races with the plain reads, and compilers are not required to obey the
> "Sequential Consistency Per Variable" rule (or indeed, any rule) when
> there is a data race.

So you're saying the following code behavior is OK?

/* Say anon_vma can only ever transition from NULL to non-NULL values */
a = vma->anon_vma; // Reads NULL
b = READ_ONCE(vma->anon_vma); // Reads non-NULL
c = vma->anon_vma; // Reads NULL!!!
if (b) {
c->some_attribute++; // Oopsie
}

thanks,

- Joel
(On the road now, so my replies will be slightly delayed for few hours)

2023-07-28 18:25:40

by Paul E. McKenney

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Fri, Jul 28, 2023 at 02:03:09PM -0400, Joel Fernandes wrote:
> On Fri, Jul 28, 2023 at 1:51 PM Alan Stern <[email protected]> wrote:
> >
> > On Fri, Jul 28, 2023 at 01:35:43PM -0400, Joel Fernandes wrote:
> > > On Fri, Jul 28, 2023 at 8:44 AM Will Deacon <[email protected]> wrote:
> > > >
> > > > On Thu, Jul 27, 2023 at 12:34:44PM -0400, Joel Fernandes wrote:
> > > > > > On Jul 27, 2023, at 10:57 AM, Will Deacon <[email protected]> wrote:
> > > > > > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> > > > > >> if (READ_ONCE(vma->anon_vma) != NULL) {
> > > > > >> // we now know that vma->anon_vma cannot change anymore
> > > > > >>
> > > > > >> // access the same memory location again with a plain load
> > > > > >> struct anon_vma *a = vma->anon_vma;
> > > > > >>
> > > > > >> // this needs to be address-dependency-ordered against one of
> > > > > >> // the loads from vma->anon_vma
> > > > > >> struct anon_vma *root = a->root;
> > > > > >> }
> > > > > >>
> > > > > >>
> > > > > >> Is this fine? If it is not fine just because the compiler might
> > > > > >> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> > > > > >> would it be fine after adding a barrier() directly after the
> > > > > >> READ_ONCE()?
> > > > > >
> > > > > > I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> > > > > > as I've run into cases where you have sequences such as:
> > > > > >
> > > > > > // Assume *ptr is initially 0 and somebody else writes it to 1
> > > > > > // concurrently
> > > > > >
> > > > > > foo = *ptr;
> > > > > > bar = READ_ONCE(*ptr);
> > > > > > baz = *ptr;
> > > > > >
> > > > > > and you can get foo == baz == 0 but bar == 1 because the compiler only
> > > > > > ends up reading from memory twice.
> > > > > >
> > > > > > That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> > > > > > when dereferencing pointer to pte table"), which was very unpleasant to
> > > > > > debug.
> > > > >
> > > > > Will, Unless I am missing something fundamental, this case is different though.
> > > > > This case does not care about fewer reads. As long as the first read is volatile, the subsequent loads (even plain)
> > > > > should work fine, no?
> > > > > I am not seeing how the compiler can screw that up, so please do enlighten :).
> > > >
> > > > I guess the thing I'm worried about is if there is some previous read of
> > > > 'vma->anon_vma' which didn't use READ_ONCE() and the compiler kept the
> > > > result around in a register. In that case, 'a' could be NULL, even if
> > > > the READ_ONCE(vma->anon_vma) returned non-NULL.
> > >
> > > If I can be a bit brave enough to say -- that appears to be a compiler
> > > bug to me. It seems that the compiler in such an instance violates the
> > > "Sequential Consistency Per Variable" rule? I mean if it can't even
> > > keep SCPV true for a same memory-location load (plain or not) for a
> > > sequence of code, how can it expect the hardware to.
> >
> > It's not a compiler bug. In this example, some other thread performs a
> > write that changes vma->anon_vma from NULL to non-NULL. This write
> > races with the plain reads, and compilers are not required to obey the
> > "Sequential Consistency Per Variable" rule (or indeed, any rule) when
> > there is a data race.
>
> So you're saying the following code behavior is OK?
>
> /* Say anon_vma can only ever transition from NULL to non-NULL values */
> a = vma->anon_vma; // Reads NULL
> b = READ_ONCE(vma->anon_vma); // Reads non-NULL
> c = vma->anon_vma; // Reads NULL!!!
> if (b) {
> c->some_attribute++; // Oopsie
> }

Is there some way to obtain (a && !b) that does not involve a data race,
and they carte blanche for the compiler to do whatever it pleases?
I am not seeing one.

What am I missing?

Thanx, Paul

2023-07-28 19:05:33

by Alan Stern

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Fri, Jul 28, 2023 at 01:35:43PM -0400, Joel Fernandes wrote:
> On Fri, Jul 28, 2023 at 8:44 AM Will Deacon <[email protected]> wrote:
> >
> > On Thu, Jul 27, 2023 at 12:34:44PM -0400, Joel Fernandes wrote:
> > > > On Jul 27, 2023, at 10:57 AM, Will Deacon <[email protected]> wrote:
> > > > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> > > >> if (READ_ONCE(vma->anon_vma) != NULL) {
> > > >> // we now know that vma->anon_vma cannot change anymore
> > > >>
> > > >> // access the same memory location again with a plain load
> > > >> struct anon_vma *a = vma->anon_vma;
> > > >>
> > > >> // this needs to be address-dependency-ordered against one of
> > > >> // the loads from vma->anon_vma
> > > >> struct anon_vma *root = a->root;
> > > >> }
> > > >>
> > > >>
> > > >> Is this fine? If it is not fine just because the compiler might
> > > >> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> > > >> would it be fine after adding a barrier() directly after the
> > > >> READ_ONCE()?
> > > >
> > > > I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> > > > as I've run into cases where you have sequences such as:
> > > >
> > > > // Assume *ptr is initially 0 and somebody else writes it to 1
> > > > // concurrently
> > > >
> > > > foo = *ptr;
> > > > bar = READ_ONCE(*ptr);
> > > > baz = *ptr;
> > > >
> > > > and you can get foo == baz == 0 but bar == 1 because the compiler only
> > > > ends up reading from memory twice.
> > > >
> > > > That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> > > > when dereferencing pointer to pte table"), which was very unpleasant to
> > > > debug.
> > >
> > > Will, Unless I am missing something fundamental, this case is different though.
> > > This case does not care about fewer reads. As long as the first read is volatile, the subsequent loads (even plain)
> > > should work fine, no?
> > > I am not seeing how the compiler can screw that up, so please do enlighten :).
> >
> > I guess the thing I'm worried about is if there is some previous read of
> > 'vma->anon_vma' which didn't use READ_ONCE() and the compiler kept the
> > result around in a register. In that case, 'a' could be NULL, even if
> > the READ_ONCE(vma->anon_vma) returned non-NULL.
>
> If I can be a bit brave enough to say -- that appears to be a compiler
> bug to me. It seems that the compiler in such an instance violates the
> "Sequential Consistency Per Variable" rule? I mean if it can't even
> keep SCPV true for a same memory-location load (plain or not) for a
> sequence of code, how can it expect the hardware to.

It's not a compiler bug. In this example, some other thread performs a
write that changes vma->anon_vma from NULL to non-NULL. This write
races with the plain reads, and compilers are not required to obey the
"Sequential Consistency Per Variable" rule (or indeed, any rule) when
there is a data race.

Alan Stern

2023-07-28 20:02:21

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering

On Fri, Jul 28, 2023 at 8:44 AM Will Deacon <[email protected]> wrote:
>
> On Thu, Jul 27, 2023 at 12:34:44PM -0400, Joel Fernandes wrote:
> > > On Jul 27, 2023, at 10:57 AM, Will Deacon <[email protected]> wrote:
> > > On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
> > >> if (READ_ONCE(vma->anon_vma) != NULL) {
> > >> // we now know that vma->anon_vma cannot change anymore
> > >>
> > >> // access the same memory location again with a plain load
> > >> struct anon_vma *a = vma->anon_vma;
> > >>
> > >> // this needs to be address-dependency-ordered against one of
> > >> // the loads from vma->anon_vma
> > >> struct anon_vma *root = a->root;
> > >> }
> > >>
> > >>
> > >> Is this fine? If it is not fine just because the compiler might
> > >> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
> > >> would it be fine after adding a barrier() directly after the
> > >> READ_ONCE()?
> > >
> > > I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
> > > as I've run into cases where you have sequences such as:
> > >
> > > // Assume *ptr is initially 0 and somebody else writes it to 1
> > > // concurrently
> > >
> > > foo = *ptr;
> > > bar = READ_ONCE(*ptr);
> > > baz = *ptr;
> > >
> > > and you can get foo == baz == 0 but bar == 1 because the compiler only
> > > ends up reading from memory twice.
> > >
> > > That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
> > > when dereferencing pointer to pte table"), which was very unpleasant to
> > > debug.
> >
> > Will, Unless I am missing something fundamental, this case is different though.
> > This case does not care about fewer reads. As long as the first read is volatile, the subsequent loads (even plain)
> > should work fine, no?
> > I am not seeing how the compiler can screw that up, so please do enlighten :).
>
> I guess the thing I'm worried about is if there is some previous read of
> 'vma->anon_vma' which didn't use READ_ONCE() and the compiler kept the
> result around in a register. In that case, 'a' could be NULL, even if
> the READ_ONCE(vma->anon_vma) returned non-NULL.

If I can be a bit brave enough to say -- that appears to be a compiler
bug to me. It seems that the compiler in such an instance violates the
"Sequential Consistency Per Variable" rule? I mean if it can't even
keep SCPV true for a same memory-location load (plain or not) for a
sequence of code, how can it expect the hardware to.

In other words, with that kind of caching, the value of the variable
goes back in time which will be tough luck for even a fully ordered
sequentially-consistent processor!!!

> The crux of the issue is that the compiler can break read-after-read
> ordering if you don't use READ_ONCE() consistently. Sadly, judging by
> the other part of the thread from Nadav, it's fiddly to fix this without
> wrecking the codegen.

Right. Thanks to you and others for sharing your informative
perspective as always,

- Joel

2023-07-28 20:36:35

by Joel Fernandes

[permalink] [raw]
Subject: Re: [PATCH 0/2] fix vma->anon_vma check for per-VMA locking; fix anon_vma memory ordering



> On Jul 28, 2023, at 2:18 PM, Paul E. McKenney <[email protected]> wrote:
>
> On Fri, Jul 28, 2023 at 02:03:09PM -0400, Joel Fernandes wrote:
>>> On Fri, Jul 28, 2023 at 1:51 PM Alan Stern <[email protected]> wrote:
>>>
>>> On Fri, Jul 28, 2023 at 01:35:43PM -0400, Joel Fernandes wrote:
>>>> On Fri, Jul 28, 2023 at 8:44 AM Will Deacon <[email protected]> wrote:
>>>>>
>>>>> On Thu, Jul 27, 2023 at 12:34:44PM -0400, Joel Fernandes wrote:
>>>>>>> On Jul 27, 2023, at 10:57 AM, Will Deacon <[email protected]> wrote:
>>>>>>> On Thu, Jul 27, 2023 at 04:39:34PM +0200, Jann Horn wrote:
>>>>>>>> if (READ_ONCE(vma->anon_vma) != NULL) {
>>>>>>>> // we now know that vma->anon_vma cannot change anymore
>>>>>>>>
>>>>>>>> // access the same memory location again with a plain load
>>>>>>>> struct anon_vma *a = vma->anon_vma;
>>>>>>>>
>>>>>>>> // this needs to be address-dependency-ordered against one of
>>>>>>>> // the loads from vma->anon_vma
>>>>>>>> struct anon_vma *root = a->root;
>>>>>>>> }
>>>>>>>>
>>>>>>>>
>>>>>>>> Is this fine? If it is not fine just because the compiler might
>>>>>>>> reorder the plain load of vma->anon_vma before the READ_ONCE() load,
>>>>>>>> would it be fine after adding a barrier() directly after the
>>>>>>>> READ_ONCE()?
>>>>>>>
>>>>>>> I'm _very_ wary of mixing READ_ONCE() and plain loads to the same variable,
>>>>>>> as I've run into cases where you have sequences such as:
>>>>>>>
>>>>>>> // Assume *ptr is initially 0 and somebody else writes it to 1
>>>>>>> // concurrently
>>>>>>>
>>>>>>> foo = *ptr;
>>>>>>> bar = READ_ONCE(*ptr);
>>>>>>> baz = *ptr;
>>>>>>>
>>>>>>> and you can get foo == baz == 0 but bar == 1 because the compiler only
>>>>>>> ends up reading from memory twice.
>>>>>>>
>>>>>>> That was the root cause behind f069faba6887 ("arm64: mm: Use READ_ONCE
>>>>>>> when dereferencing pointer to pte table"), which was very unpleasant to
>>>>>>> debug.
>>>>>>
>>>>>> Will, Unless I am missing something fundamental, this case is different though.
>>>>>> This case does not care about fewer reads. As long as the first read is volatile, the subsequent loads (even plain)
>>>>>> should work fine, no?
>>>>>> I am not seeing how the compiler can screw that up, so please do enlighten :).
>>>>>
>>>>> I guess the thing I'm worried about is if there is some previous read of
>>>>> 'vma->anon_vma' which didn't use READ_ONCE() and the compiler kept the
>>>>> result around in a register. In that case, 'a' could be NULL, even if
>>>>> the READ_ONCE(vma->anon_vma) returned non-NULL.
>>>>
>>>> If I can be a bit brave enough to say -- that appears to be a compiler
>>>> bug to me. It seems that the compiler in such an instance violates the
>>>> "Sequential Consistency Per Variable" rule? I mean if it can't even
>>>> keep SCPV true for a same memory-location load (plain or not) for a
>>>> sequence of code, how can it expect the hardware to.
>>>
>>> It's not a compiler bug. In this example, some other thread performs a
>>> write that changes vma->anon_vma from NULL to non-NULL. This write
>>> races with the plain reads, and compilers are not required to obey the
>>> "Sequential Consistency Per Variable" rule (or indeed, any rule) when
>>> there is a data race.
>>
>> So you're saying the following code behavior is OK?
>>
>> /* Say anon_vma can only ever transition from NULL to non-NULL values */
>> a = vma->anon_vma; // Reads NULL
>> b = READ_ONCE(vma->anon_vma); // Reads non-NULL
>> c = vma->anon_vma; // Reads NULL!!!
>> if (b) {
>> c->some_attribute++; // Oopsie
>> }
>
> Is there some way to obtain (a && !b) that does not involve a data race,
> and they carte blanche for the compiler to do whatever it pleases?
> I am not seeing one.
>
> What am I missing?

Probably nothing. I think I was living briefly in a fantasy world where I
expected predictable compiler behavior on same-memory accesses
amidst data traces. It is good to come back to reality.

thanks,

- Joel

>
> Thanx, Paul