The faulting path is:
do_user_addr_fault (flags = FAULT_FLAG_ALLOW_RETRY)
handle_mm_fault
__handle_mm_fault
do_anonymous
handle_userfault (ret = VM_FAULT_RETRY)
At this point the fault is handled and when this call sequence unwinds
it is expected that the PTEs are present as handle_userfault took care
of that and returned VM_FAULT_RETRY. When we unwind back down to
do_user_addr_fault it will attempt to retry after clearing
FAULT_FLAG_ALLOW_RETRY and setting FAULT_FLAG_TRIED.
At any point between the fault being handle and when it's retried a
userspace thread was to zap the page range, let's say via
madvise(MADV_DONTNEED). Now as this fault is being retried the PTEs would
be missing again and we land right back in handle_userfault. Although this
time since FAULT_FLAG_ALLOW_RETRY has been cleared we're going to bail and
return VM_FAULT_SIGBUS.
This scenario is easy to reproduce and I observed it while writing tests for
MREMAP_DONTUNMAP in the userfaultfd selftests. I have a standalone example of
this that uses madvise(MADV_DONTNEED) to cause this race:
https://gist.github.com/bgaff/3a8b2a890ae4771be22456e014c2e5aa
Given that this is genuinely a new fault, is a SIGBUS the best way to go about
this? Since userfaultfd is designed to be used in a non-cooperative case, could
it be more resilient and instead retry for the new fault?
With that being said for VM_UFFD_MISSING userfaults can we just remove the
check in handle_userfault that FAULT_FLAG_ALLOW_RETRY is set in vmf->flags
and allow it to retry the fault to address this race scenario?
In my testing that does solve the problem, but does it possibly create others?
Is this the best way to go about it? I'll defer to the domain experts for any
recommendations here in case I'm way off.
Thanks for your time.
Signed-off-by: Brian Geffon <[email protected]>
---
fs/userfaultfd.c | 28 ----------------------------
1 file changed, 28 deletions(-)
diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
index ebf17d7e1093..6407fec798ff 100644
--- a/fs/userfaultfd.c
+++ b/fs/userfaultfd.c
@@ -416,34 +416,6 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
goto out;
}
- /*
- * Check that we can return VM_FAULT_RETRY.
- *
- * NOTE: it should become possible to return VM_FAULT_RETRY
- * even if FAULT_FLAG_TRIED is set without leading to gup()
- * -EBUSY failures, if the userfaultfd is to be extended for
- * VM_UFFD_WP tracking and we intend to arm the userfault
- * without first stopping userland access to the memory. For
- * VM_UFFD_MISSING userfaults this is enough for now.
- */
- if (unlikely(!(vmf->flags & FAULT_FLAG_ALLOW_RETRY))) {
- /*
- * Validate the invariant that nowait must allow retry
- * to be sure not to return SIGBUS erroneously on
- * nowait invocations.
- */
- BUG_ON(vmf->flags & FAULT_FLAG_RETRY_NOWAIT);
-#ifdef CONFIG_DEBUG_VM
- if (printk_ratelimit()) {
- printk(KERN_WARNING
- "FAULT_FLAG_ALLOW_RETRY missing %x\n",
- vmf->flags);
- dump_stack();
- }
-#endif
- goto out;
- }
-
/*
* Handle nowait, not much to do other than tell it to retry
* and wait.
--
2.25.0.265.gbab2e86ba0-goog
Hello,
this and other enhancements have already implemented by Peter (CC'ed)
and in the right way, by altering the retry logic in the page fault
code. This is a requirement for other kind of usages too, notably the
UFFD_WRITEPROTECT ioctl after which multiple consecutive faults can
happen and must be handled.
IIRC Kirill asked at last LSF-MM uffd-wp talk if there's any
particular reason the fault couldn't be retried currently. I had no
sure answer other than there's apparently no strong reason why
VM_FAULT_RETRY is only allowed 1 time currently, so there should be no
issue in lifting that artificial restriction.
I'm running with this patchset applied in my systems since Nov with no
regression at all. I got sidetracked by various other issues, so
unfortunately I didn' post a proper reviewed-by on the last submit yet
(pending), but I did at least test it and it was rock solid so far.
https://lore.kernel.org/lkml/[email protected]/
Can you test and verify it too if it solves your use case?
Also note the complete uffd-WP support submit also from Peter:
https://lore.kernel.org/lkml/[email protected]/
https://github.com/xzpeter/linux/tree/uffd-wp-merged
Thanks,
Andrea
Hi Andrea,
Thanks for the quick reply. That's great to hear that Peter has been
working on those improvements. I didn't try the entire patchset but I
did confirm that patch 13, not surprisingly, also resolves that issue
on at least on x86:
https://lkml.org/lkml/2019/9/26/179
Given that seems pretty low risk and it definitely resolves a pretty
big issue for the non-cooperative userfaultfd case, any chance it
could be landed ahead of the rest of the series?
Thanks,
Brian
On Fri, Feb 14, 2020 at 6:20 PM Andrea Arcangeli <[email protected]> wrote:
>
> Hello,
>
> this and other enhancements have already implemented by Peter (CC'ed)
> and in the right way, by altering the retry logic in the page fault
> code. This is a requirement for other kind of usages too, notably the
> UFFD_WRITEPROTECT ioctl after which multiple consecutive faults can
> happen and must be handled.
>
> IIRC Kirill asked at last LSF-MM uffd-wp talk if there's any
> particular reason the fault couldn't be retried currently. I had no
> sure answer other than there's apparently no strong reason why
> VM_FAULT_RETRY is only allowed 1 time currently, so there should be no
> issue in lifting that artificial restriction.
>
> I'm running with this patchset applied in my systems since Nov with no
> regression at all. I got sidetracked by various other issues, so
> unfortunately I didn' post a proper reviewed-by on the last submit yet
> (pending), but I did at least test it and it was rock solid so far.
>
> https://lore.kernel.org/lkml/[email protected]/
>
> Can you test and verify it too if it solves your use case?
>
> Also note the complete uffd-WP support submit also from Peter:
>
> https://lore.kernel.org/lkml/[email protected]/
>
> https://github.com/xzpeter/linux/tree/uffd-wp-merged
>
> Thanks,
> Andrea
>
On Sat, Feb 15, 2020 at 09:29:46AM -0500, Brian Geffon wrote:
> Hi Andrea,
> Thanks for the quick reply. That's great to hear that Peter has been
> working on those improvements. I didn't try the entire patchset but I
> did confirm that patch 13, not surprisingly, also resolves that issue
> on at least on x86:
> https://lkml.org/lkml/2019/9/26/179
>
> Given that seems pretty low risk and it definitely resolves a pretty
> big issue for the non-cooperative userfaultfd case, any chance it
> could be landed ahead of the rest of the series?
Thanks Andrea & Brian! Yes it would be great if the series (or some
of the patches) could be moved forward. Please just let me know if
there's still anything I can do from my side.
Thanks,
>
> Thanks,
> Brian
>
> On Fri, Feb 14, 2020 at 6:20 PM Andrea Arcangeli <[email protected]> wrote:
> >
> > Hello,
> >
> > this and other enhancements have already implemented by Peter (CC'ed)
> > and in the right way, by altering the retry logic in the page fault
> > code. This is a requirement for other kind of usages too, notably the
> > UFFD_WRITEPROTECT ioctl after which multiple consecutive faults can
> > happen and must be handled.
> >
> > IIRC Kirill asked at last LSF-MM uffd-wp talk if there's any
> > particular reason the fault couldn't be retried currently. I had no
> > sure answer other than there's apparently no strong reason why
> > VM_FAULT_RETRY is only allowed 1 time currently, so there should be no
> > issue in lifting that artificial restriction.
> >
> > I'm running with this patchset applied in my systems since Nov with no
> > regression at all. I got sidetracked by various other issues, so
> > unfortunately I didn' post a proper reviewed-by on the last submit yet
> > (pending), but I did at least test it and it was rock solid so far.
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > Can you test and verify it too if it solves your use case?
> >
> > Also note the complete uffd-WP support submit also from Peter:
> >
> > https://lore.kernel.org/lkml/[email protected]/
> >
> > https://github.com/xzpeter/linux/tree/uffd-wp-merged
> >
> > Thanks,
> > Andrea
> >
>
--
Peter Xu
Hi Peter,
I'll try helping out by giving the entire patchset a try.
But in the meantime, if the plan of record will be to always allow
retrying then shouldn't the block I mailed a patch on be removed
regardless because do_user_addr_fault always starts with
FAULT_FLAG_ALLOW_RETRY and we shouldn't ever land there without it in
the future and allows userfaultfd to retry?
Thanks,
Brian
On Mon, Feb 17, 2020 at 10:07 AM Peter Xu <[email protected]> wrote:
>
> On Sat, Feb 15, 2020 at 09:29:46AM -0500, Brian Geffon wrote:
> > Hi Andrea,
> > Thanks for the quick reply. That's great to hear that Peter has been
> > working on those improvements. I didn't try the entire patchset but I
> > did confirm that patch 13, not surprisingly, also resolves that issue
> > on at least on x86:
> > https://lkml.org/lkml/2019/9/26/179
> >
> > Given that seems pretty low risk and it definitely resolves a pretty
> > big issue for the non-cooperative userfaultfd case, any chance it
> > could be landed ahead of the rest of the series?
>
> Thanks Andrea & Brian! Yes it would be great if the series (or some
> of the patches) could be moved forward. Please just let me know if
> there's still anything I can do from my side.
>
> Thanks,
>
> >
> > Thanks,
> > Brian
> >
> > On Fri, Feb 14, 2020 at 6:20 PM Andrea Arcangeli <[email protected]> wrote:
> > >
> > > Hello,
> > >
> > > this and other enhancements have already implemented by Peter (CC'ed)
> > > and in the right way, by altering the retry logic in the page fault
> > > code. This is a requirement for other kind of usages too, notably the
> > > UFFD_WRITEPROTECT ioctl after which multiple consecutive faults can
> > > happen and must be handled.
> > >
> > > IIRC Kirill asked at last LSF-MM uffd-wp talk if there's any
> > > particular reason the fault couldn't be retried currently. I had no
> > > sure answer other than there's apparently no strong reason why
> > > VM_FAULT_RETRY is only allowed 1 time currently, so there should be no
> > > issue in lifting that artificial restriction.
> > >
> > > I'm running with this patchset applied in my systems since Nov with no
> > > regression at all. I got sidetracked by various other issues, so
> > > unfortunately I didn' post a proper reviewed-by on the last submit yet
> > > (pending), but I did at least test it and it was rock solid so far.
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > Can you test and verify it too if it solves your use case?
> > >
> > > Also note the complete uffd-WP support submit also from Peter:
> > >
> > > https://lore.kernel.org/lkml/[email protected]/
> > >
> > > https://github.com/xzpeter/linux/tree/uffd-wp-merged
> > >
> > > Thanks,
> > > Andrea
> > >
> >
>
> --
> Peter Xu
>
On Mon, Feb 17, 2020 at 07:50:19PM -0600, Brian Geffon wrote:
> But in the meantime, if the plan of record will be to always allow
> retrying then shouldn't the block I mailed a patch on be removed
> regardless because do_user_addr_fault always starts with
> FAULT_FLAG_ALLOW_RETRY and we shouldn't ever land there without it in
> the future and allows userfaultfd to retry?
It might hide the limitation but only if the page fault originated in
userland (Android's case), but that's not something userfault users
should depend on. Userfaults (unlike sigsegv trapping) are meant to be
reliable and transparent to all user and kernel accesses alike.
It is also is unclear how long Android will be forced to keep doing
bounce buffers copies in RAM before considering passing any memory to
kernel syscalls.
For all other users where the kernel access may be the one triggering
the fault the patch will remove a debug aid and the kernel fault would
then fail by hitting on the below:
/* Not returning to user mode? Handle exceptions or die: */
no_context(regs, hw_error_code, address, SIGBUS, BUS_ADRERR);
There may be more side effects in other archs I didn't evaluate
because there's no other place where the common code can return
VM_FAULT_RETRY despite the arch code explicitly told the common code
it can't do that (by not setting FAULT_FLAG_ALLOW_RETRY) so it doesn't
look very safe and it doesn't seem a generic enough solution to the
problem.
That dump_stack() helped a lot to identify those kernel outliers that
erroneously use get_user_pages instead of the gup_locked/unlocked
variant that are uffd-capable.
Thanks,
Andrea
Hi Andrea,
That all makes sense, thanks so much for that detailed explanation.
Brian
On Mon, Feb 17, 2020 at 8:27 PM Andrea Arcangeli <[email protected]> wrote:
>
> On Mon, Feb 17, 2020 at 07:50:19PM -0600, Brian Geffon wrote:
> > But in the meantime, if the plan of record will be to always allow
> > retrying then shouldn't the block I mailed a patch on be removed
> > regardless because do_user_addr_fault always starts with
> > FAULT_FLAG_ALLOW_RETRY and we shouldn't ever land there without it in
> > the future and allows userfaultfd to retry?
>
> It might hide the limitation but only if the page fault originated in
> userland (Android's case), but that's not something userfault users
> should depend on. Userfaults (unlike sigsegv trapping) are meant to be
> reliable and transparent to all user and kernel accesses alike.
>
> It is also is unclear how long Android will be forced to keep doing
> bounce buffers copies in RAM before considering passing any memory to
> kernel syscalls.
>
> For all other users where the kernel access may be the one triggering
> the fault the patch will remove a debug aid and the kernel fault would
> then fail by hitting on the below:
>
> /* Not returning to user mode? Handle exceptions or die: */
> no_context(regs, hw_error_code, address, SIGBUS, BUS_ADRERR);
>
> There may be more side effects in other archs I didn't evaluate
> because there's no other place where the common code can return
> VM_FAULT_RETRY despite the arch code explicitly told the common code
> it can't do that (by not setting FAULT_FLAG_ALLOW_RETRY) so it doesn't
> look very safe and it doesn't seem a generic enough solution to the
> problem.
>
> That dump_stack() helped a lot to identify those kernel outliers that
> erroneously use get_user_pages instead of the gup_locked/unlocked
> variant that are uffd-capable.
>
> Thanks,
> Andrea
>