Users of UFFDIO_CONTINUE may reasonably assume that a write memory
barrier is included as part of UFFDIO_CONTINUE. That is, a user may
(mistakenly) believe that all writes it has done to a page that it is
now UFFDIO_CONTINUE'ing are guaranteed to be visible to anyone
subsequently reading the page through the newly mapped virtual memory
region.
Include only a single smp_wmb() for each UFFDIO_CONTINUE, as that is all
that is necessary. While we're at it, optimize the smp_wmb() that is
already incidentally present for the HugeTLB case.
Documentation doesn't specify if the kernel does a wmb(), so it's not
wrong not to include it. But by not including it, we are making is easy
for a user to have a very hard-to-detect bug. Include it now to be safe.
A user that decides to update the contents of the page in one thread and
UFFDIO_CONTINUE that page in another must already take additional steps
to synchronize properly.
Signed-off-by: James Houghton <[email protected]>
---
I'm not sure if this patch should be merged. I think it's the right
thing to do, as it is very easy for a user to get this wrong. (I have
been using UFFDIO_CONTINUE for >2 years and only realized this problem
recently.) Given that it's not a "bug" strictly speaking, even if this
patch is a good idea, I'm unsure if it needs to be backported.
This quirk has existed since minor fault support was added for shmem[1].
I've tried to see if I can legitimately get a user to read stale data,
and a few attempts with this test[2] have been unsuccessful.
[1]: commit 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
[2]: https://gist.github.com/48ca/38d0665b0f1a6319a56507dc73a173f9
mm/hugetlb.c | 15 +++++++++------
mm/userfaultfd.c | 18 ++++++++++++++++++
2 files changed, 27 insertions(+), 6 deletions(-)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bb17e5c22759..533bf6b2d94d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -6779,12 +6779,15 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
}
}
- /*
- * The memory barrier inside __folio_mark_uptodate makes sure that
- * preceding stores to the page contents become visible before
- * the set_pte_at() write.
- */
- __folio_mark_uptodate(folio);
+ if (!is_continue) {
+ /*
+ * The memory barrier inside __folio_mark_uptodate makes sure
+ * that preceding stores to the page contents become visible
+ * before the set_pte_at() write.
+ */
+ __folio_mark_uptodate(folio);
+ } else
+ WARN_ON_ONCE(!folio_test_uptodate(folio));
/* Add shared, newly allocated pages to the page cache. */
if (vm_shared && !is_continue) {
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 503ea77c81aa..d515b640ca48 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -531,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
goto out_unlock;
}
+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+ /* See the comment in mfill_atomic. */
+ smp_wmb();
+
while (src_addr < src_start + len) {
BUG_ON(dst_addr >= dst_start + len);
@@ -743,6 +747,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
goto out_unlock;
+ if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
+ /*
+ * A caller might reasonably assume that UFFDIO_CONTINUE
+ * contains a wmb() to ensure that any writes to the
+ * about-to-be-mapped page by the thread doing the
+ * UFFDIO_CONTINUE are guaranteed to be visible to subsequent
+ * reads of the page through the newly mapped address.
+ *
+ * For MFILL_ATOMIC_COPY, the wmb() is done for each COPYed
+ * page. We can do the wmb() now for CONTINUE as the user has
+ * already prepared the page contents.
+ */
+ smp_wmb();
+
while (src_addr < src_start + len) {
pmd_t dst_pmdval;
base-commit: a7f399ae964e1d2a11d88d863a1d64392678ccaf
--
2.44.0.278.ge034bb2e1d-goog
On Wed, Mar 06, 2024 at 12:15:10AM +0000, James Houghton wrote:
> Users of UFFDIO_CONTINUE may reasonably assume that a write memory
> barrier is included as part of UFFDIO_CONTINUE. That is, a user may
> (mistakenly) believe that all writes it has done to a page that it is
> now UFFDIO_CONTINUE'ing are guaranteed to be visible to anyone
> subsequently reading the page through the newly mapped virtual memory
> region.
>
> Include only a single smp_wmb() for each UFFDIO_CONTINUE, as that is all
> that is necessary. While we're at it, optimize the smp_wmb() that is
> already incidentally present for the HugeTLB case.
>
> Documentation doesn't specify if the kernel does a wmb(), so it's not
> wrong not to include it. But by not including it, we are making is easy
> for a user to have a very hard-to-detect bug. Include it now to be safe.
>
> A user that decides to update the contents of the page in one thread and
> UFFDIO_CONTINUE that page in another must already take additional steps
> to synchronize properly.
>
> Signed-off-by: James Houghton <[email protected]>
> ---
>
> I'm not sure if this patch should be merged. I think it's the right
> thing to do, as it is very easy for a user to get this wrong. (I have
> been using UFFDIO_CONTINUE for >2 years and only realized this problem
> recently.) Given that it's not a "bug" strictly speaking, even if this
> patch is a good idea, I'm unsure if it needs to be backported.
>
> This quirk has existed since minor fault support was added for shmem[1].
>
> I've tried to see if I can legitimately get a user to read stale data,
> and a few attempts with this test[2] have been unsuccessful.
AFAICT that won't easily reproduce even if the problem existed, as we
contain so many implict memory barriers here and there. E.g. right at the
entry of ioctl(), mmget_not_zero() already contains a full ordering
constraint:
/**
* atomic_inc_not_zero() - atomic increment unless zero with full ordering
* @v: pointer to atomic_t
I was expecting the syscall routine will guarantee an ordering already but
indeed I can't find any. I also checked up Intel's spec and SYSCALL inst
document only has one paragraph on ordering:
Instruction ordering. Instructions following a SYSCALL may be
fetched from memory before earlier instructions complete execution,
but they will not execute (even speculatively) until all
instructions prior to the SYSCALL have completed execution (the
later instructions may execute before data stored by the earlier
instructions have become globally visible).
I guess it implies a hardware reordering is indeed possible in this case?
>
> [1]: commit 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
> [2]: https://gist.github.com/48ca/38d0665b0f1a6319a56507dc73a173f9
>
> mm/hugetlb.c | 15 +++++++++------
> mm/userfaultfd.c | 18 ++++++++++++++++++
> 2 files changed, 27 insertions(+), 6 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bb17e5c22759..533bf6b2d94d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -6779,12 +6779,15 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
> }
> }
>
> - /*
> - * The memory barrier inside __folio_mark_uptodate makes sure that
> - * preceding stores to the page contents become visible before
> - * the set_pte_at() write.
> - */
> - __folio_mark_uptodate(folio);
> + if (!is_continue) {
> + /*
> + * The memory barrier inside __folio_mark_uptodate makes sure
> + * that preceding stores to the page contents become visible
> + * before the set_pte_at() write.
> + */
> + __folio_mark_uptodate(folio);
Can we move the comment above the "if", explaining both conditions?
> + } else
> + WARN_ON_ONCE(!folio_test_uptodate(folio));
>
> /* Add shared, newly allocated pages to the page cache. */
> if (vm_shared && !is_continue) {
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 503ea77c81aa..d515b640ca48 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -531,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> goto out_unlock;
> }
>
> + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> + /* See the comment in mfill_atomic. */
> + smp_wmb();
> +
> while (src_addr < src_start + len) {
> BUG_ON(dst_addr >= dst_start + len);
>
> @@ -743,6 +747,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> goto out_unlock;
>
> + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> + /*
> + * A caller might reasonably assume that UFFDIO_CONTINUE
> + * contains a wmb() to ensure that any writes to the
> + * about-to-be-mapped page by the thread doing the
> + * UFFDIO_CONTINUE are guaranteed to be visible to subsequent
> + * reads of the page through the newly mapped address.
> + *
> + * For MFILL_ATOMIC_COPY, the wmb() is done for each COPYed
> + * page. We can do the wmb() now for CONTINUE as the user has
> + * already prepared the page contents.
> + */
> + smp_wmb();
> +
Why you did it twice separately? Can we still share the code?
I'm wildly guessing: I don't worry on an extra wmb() in failure paths, as
that's never a performance concern to make failure slightly slower, IMHO.
Thanks,
> while (src_addr < src_start + len) {
> pmd_t dst_pmdval;
>
>
> base-commit: a7f399ae964e1d2a11d88d863a1d64392678ccaf
> --
> 2.44.0.278.ge034bb2e1d-goog
>
--
Peter Xu
On Tue, Mar 5, 2024 at 7:41 PM Peter Xu <[email protected]> wrote:
>
> On Wed, Mar 06, 2024 at 12:15:10AM +0000, James Houghton wrote:
> > I've tried to see if I can legitimately get a user to read stale data,
> > and a few attempts with this test[2] have been unsuccessful.
>
> AFAICT that won't easily reproduce even if the problem existed, as we
> contain so many implict memory barriers here and there. E.g. right at the
> entry of ioctl(), mmget_not_zero() already contains a full ordering
> constraint:
>
> /**
> * atomic_inc_not_zero() - atomic increment unless zero with full ordering
> * @v: pointer to atomic_t
Oh yes, of course. Thanks for pointing this out. So we definitely
don't need a Fixes.
>
> I was expecting the syscall routine will guarantee an ordering already but
> indeed I can't find any. I also checked up Intel's spec and SYSCALL inst
> document only has one paragraph on ordering:
>
> Instruction ordering. Instructions following a SYSCALL may be
> fetched from memory before earlier instructions complete execution,
> but they will not execute (even speculatively) until all
> instructions prior to the SYSCALL have completed execution (the
> later instructions may execute before data stored by the earlier
> instructions have become globally visible).
>
> I guess it implies a hardware reordering is indeed possible in this case?
That's my understanding as well.
>
> >
> > [1]: commit 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
> > [2]: https://gist.github.com/48ca/38d0665b0f1a6319a56507dc73a173f9
> >
> > mm/hugetlb.c | 15 +++++++++------
> > mm/userfaultfd.c | 18 ++++++++++++++++++
> > 2 files changed, 27 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bb17e5c22759..533bf6b2d94d 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6779,12 +6779,15 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
> > }
> > }
> >
> > - /*
> > - * The memory barrier inside __folio_mark_uptodate makes sure that
> > - * preceding stores to the page contents become visible before
> > - * the set_pte_at() write.
> > - */
> > - __folio_mark_uptodate(folio);
> > + if (!is_continue) {
> > + /*
> > + * The memory barrier inside __folio_mark_uptodate makes sure
> > + * that preceding stores to the page contents become visible
> > + * before the set_pte_at() write.
> > + */
> > + __folio_mark_uptodate(folio);
>
> Can we move the comment above the "if", explaining both conditions?
Yes, I'll do that. I think the explanation for
WARN_ON_ONCE(!folio_test_uptodate(folio)) is:
We only need to `__folio_mark_uptodate(folio)` if we have
allocated a new folio, and HugeTLB pages will always be Uptodate if
they are in the pagecache.
We could even drop the WARN_ON_ONCE.
>
> > + } else
> > + WARN_ON_ONCE(!folio_test_uptodate(folio));
> >
> > /* Add shared, newly allocated pages to the page cache. */
> > if (vm_shared && !is_continue) {
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 503ea77c81aa..d515b640ca48 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -531,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> > goto out_unlock;
> > }
> >
> > + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> > + /* See the comment in mfill_atomic. */
> > + smp_wmb();
> > +
> > while (src_addr < src_start + len) {
> > BUG_ON(dst_addr >= dst_start + len);
> >
> > @@ -743,6 +747,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> > uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> > goto out_unlock;
> >
> > + if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> > + /*
> > + * A caller might reasonably assume that UFFDIO_CONTINUE
> > + * contains a wmb() to ensure that any writes to the
> > + * about-to-be-mapped page by the thread doing the
> > + * UFFDIO_CONTINUE are guaranteed to be visible to subsequent
> > + * reads of the page through the newly mapped address.
> > + *
> > + * For MFILL_ATOMIC_COPY, the wmb() is done for each COPYed
> > + * page. We can do the wmb() now for CONTINUE as the user has
> > + * already prepared the page contents.
> > + */
> > + smp_wmb();
> > +
>
> Why you did it twice separately? Can we still share the code?
>
> I'm wildly guessing: I don't worry on an extra wmb() in failure paths, as
> that's never a performance concern to make failure slightly slower, IMHO.
You're right, I shouldn't care so much about making the slow paths a
little bit slower. I'll move the wmb earlier so that we only need it
in one place.
>
> Thanks,
Thanks Peter!
On Wed, Mar 06, 2024 at 09:57:17AM -0800, James Houghton wrote:
> On Tue, Mar 5, 2024 at 7:41 PM Peter Xu <[email protected]> wrote:
> >
> > On Wed, Mar 06, 2024 at 12:15:10AM +0000, James Houghton wrote:
> > > I've tried to see if I can legitimately get a user to read stale data,
> > > and a few attempts with this test[2] have been unsuccessful.
> >
> > AFAICT that won't easily reproduce even if the problem existed, as we
> > contain so many implict memory barriers here and there. E.g. right at the
> > entry of ioctl(), mmget_not_zero() already contains a full ordering
> > constraint:
> >
> > /**
> > * atomic_inc_not_zero() - atomic increment unless zero with full ordering
> > * @v: pointer to atomic_t
>
> Oh yes, of course. Thanks for pointing this out. So we definitely
> don't need a Fixes.
>
> >
> > I was expecting the syscall routine will guarantee an ordering already but
> > indeed I can't find any. I also checked up Intel's spec and SYSCALL inst
> > document only has one paragraph on ordering:
> >
> > Instruction ordering. Instructions following a SYSCALL may be
> > fetched from memory before earlier instructions complete execution,
> > but they will not execute (even speculatively) until all
> > instructions prior to the SYSCALL have completed execution (the
> > later instructions may execute before data stored by the earlier
> > instructions have become globally visible).
> >
> > I guess it implies a hardware reordering is indeed possible in this case?
>
> That's my understanding as well.
Let's also mention that in the commit message when repost? Just in case
it'll answer other readers when they read this patch.
>
> >
> > >
> > > [1]: commit 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
> > > [2]: https://gist.github.com/48ca/38d0665b0f1a6319a56507dc73a173f9
> > >
> > > mm/hugetlb.c | 15 +++++++++------
> > > mm/userfaultfd.c | 18 ++++++++++++++++++
> > > 2 files changed, 27 insertions(+), 6 deletions(-)
> > >
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index bb17e5c22759..533bf6b2d94d 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -6779,12 +6779,15 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
> > > }
> > > }
> > >
> > > - /*
> > > - * The memory barrier inside __folio_mark_uptodate makes sure that
> > > - * preceding stores to the page contents become visible before
> > > - * the set_pte_at() write.
> > > - */
> > > - __folio_mark_uptodate(folio);
> > > + if (!is_continue) {
> > > + /*
> > > + * The memory barrier inside __folio_mark_uptodate makes sure
> > > + * that preceding stores to the page contents become visible
> > > + * before the set_pte_at() write.
> > > + */
> > > + __folio_mark_uptodate(folio);
> >
> > Can we move the comment above the "if", explaining both conditions?
>
> Yes, I'll do that. I think the explanation for
> WARN_ON_ONCE(!folio_test_uptodate(folio)) is:
>
> We only need to `__folio_mark_uptodate(folio)` if we have
> allocated a new folio, and HugeTLB pages will always be Uptodate if
> they are in the pagecache.
>
> We could even drop the WARN_ON_ONCE.
No strong opinions, keeping it still makes sense to me. Btw, it'll also be
important to document "how is the ordering guaranteed for CONTINUE", then
you can reference the new code you add, as readers can get confused on why
CONTINUE doesn't need such ordering.
Thanks,
--
Peter Xu