2018-04-05 01:55:09

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] gup: return -EFAULT on access_ok failure

On Fri, Mar 30, 2018 at 08:37:45PM +0300, Michael S. Tsirkin wrote:
> get_user_pages_fast is supposed to be a faster drop-in equivalent of
> get_user_pages. As such, callers expect it to return a negative return
> code when passed an invalid address, and never expect it to
> return 0 when passed a positive number of pages, since
> its documentation says:
>
> * Returns number of pages pinned. This may be fewer than the number
> * requested. If nr_pages is 0 or negative, returns 0. If no pages
> * were pinned, returns -errno.
>
> Unfortunately this is not what the implementation does: it returns 0 if
> passed a kernel address, confusing callers: for example, the following
> is pretty common but does not appear to do the right thing with a kernel
> address:
>
> ret = get_user_pages_fast(addr, 1, writeable, &page);
> if (ret < 0)
> return ret;
>
> Change get_user_pages_fast to return -EFAULT when supplied a
> kernel address to make it match expectations.
>
> __get_user_pages_fast does not seem to be used like this, but let's
> change __get_user_pages_fast as well for consistency and to match
> documentation.
>
> Lightly tested.
>
> Cc: Kirill A. Shutemov <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Huang Ying <[email protected]>
> Cc: Jonathan Corbet <[email protected]>
> Cc: Linus Torvalds <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> Cc: Thomas Gleixner <[email protected]>
> Cc: Thorsten Leemhuis <[email protected]>
> Cc: [email protected]
> Fixes: 5b65c4677a57 ("mm, x86/mm: Fix performance regression in get_user_pages_fast()")
> Reported-by: [email protected]
> Signed-off-by: Michael S. Tsirkin <[email protected]>

Any feedback on this? As this fixes a bug in vhost, I'll merge
through the vhost tree unless someone objects.

> ---
> mm/gup.c | 10 ++++++++--
> 1 file changed, 8 insertions(+), 2 deletions(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index 6afae32..5642521 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -1749,6 +1749,9 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> unsigned long flags;
> int nr = 0;
>
> + if (nr_pages <= 0)
> + return 0;
> +
> start &= PAGE_MASK;
> addr = start;
> len = (unsigned long) nr_pages << PAGE_SHIFT;
> @@ -1756,7 +1759,7 @@ int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
>
> if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> (void __user *)start, len)))
> - return 0;
> + return -EFAULT;
>
> /*
> * Disable interrupts. We use the nested form as we can already have
> @@ -1806,9 +1809,12 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
> len = (unsigned long) nr_pages << PAGE_SHIFT;
> end = start + len;
>
> + if (nr_pages <= 0)
> + return 0;
> +
> if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
> (void __user *)start, len)))
> - return 0;
> + return -EFAULT;
>
> if (gup_fast_permitted(start, nr_pages, write)) {
> local_irq_disable();
> --
> MST


2018-04-05 02:44:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] gup: return -EFAULT on access_ok failure

On Wed, Apr 4, 2018 at 6:53 PM, Michael S. Tsirkin <[email protected]> wrote:
>
> Any feedback on this? As this fixes a bug in vhost, I'll merge
> through the vhost tree unless someone objects.

NAK.

__get_user_pages_fast() returns the number of pages it gets.

It has never returned an error code, and all the other versions of it
(architecture-specific) don't either.

If you ask for one page, and get zero pages, then that's an -EFAULT.
Note that that's an EFAULT regardless of whether that zero page
happened due to kernel addresses or just lack of mapping in user
space.

The documentation is simply wrong if it says anything else. Fix the
docs, and fix the users.

The correct use has always been to check the number of pages returned.

Just looking around, returning an error number looks like it could
seriously confuse some things. You have things like the kvm code that
does the *right* thing:

unsigned long ... npinned ...

npinned = get_user_pages_fast(uaddr, npages, write ?
FOLL_WRITE : 0, pages);
if (npinned != npages) {
...

err:
if (npinned > 0)
release_pages(pages, npinned);

and the above code clearly depends on the actual behavior, not on the
documentation.

Any changes in this area would need some *extreme* care, exactly
because of code like the above that clearly depends on the existing
semantics.

In fact, the documentation really seems to be just buggy. The actual
get_user_pages() function itself is expressly being careful *not* to
return an error code, it even has a comment to the effect ("Have to be
a bit careful with return values").

So the "If no pages were pinned, returns -errno" comment is just bogus.

Linus

2018-04-05 14:20:36

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] gup: return -EFAULT on access_ok failure

On Wed, Apr 04, 2018 at 07:40:36PM -0700, Linus Torvalds wrote:
> On Wed, Apr 4, 2018 at 6:53 PM, Michael S. Tsirkin <[email protected]> wrote:
> >
> > Any feedback on this? As this fixes a bug in vhost, I'll merge
> > through the vhost tree unless someone objects.
>
> NAK.
>
> __get_user_pages_fast() returns the number of pages it gets.
>
> It has never returned an error code, and all the other versions of it
> (architecture-specific) don't either.

Thanks Linus. I can change the docs and all the callers.


I wonder however whether all the following should be changed then:

static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,

...

if (!vma || check_vma_flags(vma, gup_flags))
return i ? : -EFAULT;

is this a bug in __get_user_pages?


Another example:

ret = get_gate_page(mm, start & PAGE_MASK,
gup_flags, &vma,
pages ? &pages[i] : NULL);
if (ret)
return i ? : ret;

and ret is -EFAULT on error.


Another example:
switch (ret) {
case 0:
goto retry;
case -EFAULT:
case -ENOMEM:
case -EHWPOISON:
return i ? i : ret;
case -EBUSY:
return i;
case -ENOENT:
goto next_page;
}

it looks like this will return -EFAULT/-ENOMEM/-EHWPOISON
if i is 0.


> If you ask for one page, and get zero pages, then that's an -EFAULT.
> Note that that's an EFAULT regardless of whether that zero page
> happened due to kernel addresses or just lack of mapping in user
> space.
>
> The documentation is simply wrong if it says anything else. Fix the
> docs, and fix the users.
>
> The correct use has always been to check the number of pages returned.
>
> Just looking around, returning an error number looks like it could
> seriously confuse some things.
>
> You have things like the kvm code that
> does the *right* thing:
>
> unsigned long ... npinned ...
>
> npinned = get_user_pages_fast(uaddr, npages, write ?
> FOLL_WRITE : 0, pages);
> if (npinned != npages) {
> ...
>
> err:
> if (npinned > 0)
> release_pages(pages, npinned);
>
> and the above code clearly depends on the actual behavior, not on the
> documentation.

This seems to work fine with my patch since it only changes the
case where npinned == 0.

> Any changes in this area would need some *extreme* care, exactly
> because of code like the above that clearly depends on the existing
> semantics.
>
> In fact, the documentation really seems to be just buggy. The actual
> get_user_pages() function itself is expressly being careful *not* to
> return an error code, it even has a comment to the effect ("Have to be
> a bit careful with return values").
>
> So the "If no pages were pinned, returns -errno" comment is just bogus.
>
> Linus

I'd like to change the doc then, but it seems that I'll have to change
the implementation in that case too.

--
MST

2018-04-05 15:41:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] gup: return -EFAULT on access_ok failure

On Thu, Apr 5, 2018 at 7:17 AM, Michael S. Tsirkin <[email protected]> wrote:
>
> I wonder however whether all the following should be changed then:
>
> static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
>
> ...
>
> if (!vma || check_vma_flags(vma, gup_flags))
> return i ? : -EFAULT;
>
> is this a bug in __get_user_pages?

Note the difference between "get_user_pages()", and "get_user_pages_fast()".

It's the *fast* versions that just return the number of pages pinned.

The non-fast ones will return an error code for various cases.

Why?

The non-fast cases actually *have* various error cases. They can block
and get interrupted etc.

The fast cases are basically "just get me the pages, dammit, and if
you can't get some page, stop".

At least that's one excuse for the difference in behavior.

The real excuse is probably just "that's how it worked" - the fast
case just walked the page tables and that was it.

Linus

2018-04-05 18:29:55

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] gup: return -EFAULT on access_ok failure

On Thu, Apr 05, 2018 at 08:40:05AM -0700, Linus Torvalds wrote:
> On Thu, Apr 5, 2018 at 7:17 AM, Michael S. Tsirkin <[email protected]> wrote:
> >
> > I wonder however whether all the following should be changed then:
> >
> > static long __get_user_pages(struct task_struct *tsk, struct mm_struct *mm,
> >
> > ...
> >
> > if (!vma || check_vma_flags(vma, gup_flags))
> > return i ? : -EFAULT;
> >
> > is this a bug in __get_user_pages?
>
> Note the difference between "get_user_pages()", and "get_user_pages_fast()".
>
> It's the *fast* versions that just return the number of pages pinned.
>
> The non-fast ones will return an error code for various cases.
>
> Why?
>
> The non-fast cases actually *have* various error cases. They can block
> and get interrupted etc.
>
> The fast cases are basically "just get me the pages, dammit, and if
> you can't get some page, stop".
>
> At least that's one excuse for the difference in behavior.
>
> The real excuse is probably just "that's how it worked" - the fast
> case just walked the page tables and that was it.
>
> Linus

I see, thanks for the clarification Linus.

to repeat what you are saying IIUC __get_user_pages_fast returns 0 if it can't
pin any pages and that is by design. Returning 0 on error isn't usual I think
so I guess this behaviour should we well documented.

That part of my patch was wrong and should be replaced with a doc
update.

What about get_user_pages_fast though? That's the other part of the
patch. Right now get_user_pages_fast does:

ret = get_user_pages_unlocked(start, nr_pages - nr, pages,
write ? FOLL_WRITE : 0);

/* Have to be a bit careful with return values */
if (nr > 0) {
if (ret < 0)
ret = nr;
else
ret += nr;
}

so an error on the 1st page gets propagated to the caller,
and that get_user_pages_unlocked eventually calls __get_user_pages
so it does return an error sometimes.

Would it be correct to apply the second part of the patch then
(pasted below for reference) or should get_user_pages_fast
and all its callers be changed to return 0 on error instead?

@@ -1806,9 +1809,12 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
len = (unsigned long) nr_pages << PAGE_SHIFT;
end = start + len;

+ if (nr_pages <= 0)
+ return 0;
+
if (unlikely(!access_ok(write ? VERIFY_WRITE : VERIFY_READ,
(void __user *)start, len)))
- return 0;
+ return -EFAULT;

if (gup_fast_permitted(start, nr_pages, write)) {
local_irq_disable();

--
MST

2018-04-05 18:45:00

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] gup: return -EFAULT on access_ok failure

On Thu, Apr 5, 2018 at 11:28 AM, Michael S. Tsirkin <[email protected]> wrote:
>
> to repeat what you are saying IIUC __get_user_pages_fast returns 0 if it can't
> pin any pages and that is by design. Returning 0 on error isn't usual I think
> so I guess this behaviour should we well documented.

Arguably it happens elsewhere too, and not just in the kernel.
"read()" at past the end of a file is not an error, you'll just get 0
for EOF.

So it's not really "returning 0 on error".

It really is simply returning the number of pages it got. End of
story. That number of pages can be smaller than the requested number
of pages, and _that_ is due to some error, but note how it can return
"5" on error too - you asked for 10 pages, but the error happened in
the middle!

So the right way to check for error is to bverify that you get the
number of pages that you asked for. If you don't, something bad
happened.

Of course, many users don't actually care about "I didn't get
everything". They only care about "did I get _something_". Then that 0
ends up being the error case, but note how it depends on the caller.

> What about get_user_pages_fast though?

We do seem to special-case the first page there. I'm not sure it's a
good idea. But like the __get_user_pages_fast(), we seem to have users
that know about the particular semantics and depend on it.

It's all ugly, I agree.

End result: we can't just change semantics of either of them.

At least not without going through every single user and checking that
they are ok with it.

Which I guess I could be ok with. Maybe changing the semantics of
__get_user_pages_fast() is acceptable, if you just change it
*everywhere* (which includes not just he users, but also the couple of
architecture-specific versions of that same function that we have.

Linus

2018-04-05 19:35:42

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] gup: return -EFAULT on access_ok failure

On Thu, Apr 05, 2018 at 11:43:27AM -0700, Linus Torvalds wrote:
> On Thu, Apr 5, 2018 at 11:28 AM, Michael S. Tsirkin <[email protected]> wrote:
> >
> > to repeat what you are saying IIUC __get_user_pages_fast returns 0 if it can't
> > pin any pages and that is by design. Returning 0 on error isn't usual I think
> > so I guess this behaviour should we well documented.
>
> Arguably it happens elsewhere too, and not just in the kernel.
> "read()" at past the end of a file is not an error, you'll just get 0
> for EOF.
>
> So it's not really "returning 0 on error".
>
> It really is simply returning the number of pages it got. End of
> story. That number of pages can be smaller than the requested number
> of pages, and _that_ is due to some error, but note how it can return
> "5" on error too - you asked for 10 pages, but the error happened in
> the middle!
>
> So the right way to check for error is to bverify that you get the
> number of pages that you asked for. If you don't, something bad
> happened.
>
> Of course, many users don't actually care about "I didn't get
> everything". They only care about "did I get _something_". Then that 0
> ends up being the error case, but note how it depends on the caller.
>
> > What about get_user_pages_fast though?
>
> We do seem to special-case the first page there. I'm not sure it's a
> good idea. But like the __get_user_pages_fast(), we seem to have users
> that know about the particular semantics and depend on it.
>
> It's all ugly, I agree.
>
> End result: we can't just change semantics of either of them.
>
> At least not without going through every single user and checking that
> they are ok with it.
>
> Which I guess I could be ok with. Maybe changing the semantics of
> __get_user_pages_fast() is acceptable, if you just change it
> *everywhere* (which includes not just he users, but also the couple of
> architecture-specific versions of that same function that we have.
>
> Linus

OK I hope I understood what you are saying here.

At least drivers/gpu/drm/i915/i915_gem_userptr.c seems to
get it wrong:

pinned = __get_user_pages_fast(obj->userptr.ptr,

if (pinned < 0) {
pages = ERR_PTR(pinned);
pinned = 0;
} else if (pinned < num_pages) {
pages = __i915_gem_userptr_get_pages_schedule(obj);
active = pages == ERR_PTR(-EAGAIN);
} else {
pages = __i915_gem_userptr_alloc_pages(obj, pvec, num_pages);
active = !IS_ERR(pages);
}

The <0 path is never taken.

Cc maintainers - should that driver be changed to use
get_user_pages_fast?



2018-04-05 19:41:47

by Chris Wilson

[permalink] [raw]
Subject: Re: [PATCH] gup: return -EFAULT on access_ok failure

Quoting Michael S. Tsirkin (2018-04-05 20:34:08)
> On Thu, Apr 05, 2018 at 11:43:27AM -0700, Linus Torvalds wrote:
> > On Thu, Apr 5, 2018 at 11:28 AM, Michael S. Tsirkin <[email protected]> wrote:
> > >
> > > to repeat what you are saying IIUC __get_user_pages_fast returns 0 if it can't
> > > pin any pages and that is by design. Returning 0 on error isn't usual I think
> > > so I guess this behaviour should we well documented.
> >
> > Arguably it happens elsewhere too, and not just in the kernel.
> > "read()" at past the end of a file is not an error, you'll just get 0
> > for EOF.
> >
> > So it's not really "returning 0 on error".
> >
> > It really is simply returning the number of pages it got. End of
> > story. That number of pages can be smaller than the requested number
> > of pages, and _that_ is due to some error, but note how it can return
> > "5" on error too - you asked for 10 pages, but the error happened in
> > the middle!
> >
> > So the right way to check for error is to bverify that you get the
> > number of pages that you asked for. If you don't, something bad
> > happened.
> >
> > Of course, many users don't actually care about "I didn't get
> > everything". They only care about "did I get _something_". Then that 0
> > ends up being the error case, but note how it depends on the caller.
> >
> > > What about get_user_pages_fast though?
> >
> > We do seem to special-case the first page there. I'm not sure it's a
> > good idea. But like the __get_user_pages_fast(), we seem to have users
> > that know about the particular semantics and depend on it.
> >
> > It's all ugly, I agree.
> >
> > End result: we can't just change semantics of either of them.
> >
> > At least not without going through every single user and checking that
> > they are ok with it.
> >
> > Which I guess I could be ok with. Maybe changing the semantics of
> > __get_user_pages_fast() is acceptable, if you just change it
> > *everywhere* (which includes not just he users, but also the couple of
> > architecture-specific versions of that same function that we have.
> >
> > Linus
>
> OK I hope I understood what you are saying here.
>
> At least drivers/gpu/drm/i915/i915_gem_userptr.c seems to
> get it wrong:
>
> pinned = __get_user_pages_fast(obj->userptr.ptr,
>
> if (pinned < 0) {
> pages = ERR_PTR(pinned);
> pinned = 0;
> } else if (pinned < num_pages) {
> pages = __i915_gem_userptr_get_pages_schedule(obj);
> active = pages == ERR_PTR(-EAGAIN);
> } else {
> pages = __i915_gem_userptr_alloc_pages(obj, pvec, num_pages);
> active = !IS_ERR(pages);
> }
>
> The <0 path is never taken.

Please note that it only recently lost other paths that set an error
beforehand. Not exactly wrong since the short return is expected and
handled.

> Cc maintainers - should that driver be changed to use
> get_user_pages_fast?

It's not allowed to fault. __gup_fast has that comment, gup_fast does
not.
-Chris

2018-04-05 21:10:06

by Michael S. Tsirkin

[permalink] [raw]
Subject: Re: [PATCH] gup: return -EFAULT on access_ok failure

On Thu, Apr 05, 2018 at 11:43:27AM -0700, Linus Torvalds wrote:
> On Thu, Apr 5, 2018 at 11:28 AM, Michael S. Tsirkin <[email protected]> wrote:
> >
> > to repeat what you are saying IIUC __get_user_pages_fast returns 0 if it can't
> > pin any pages and that is by design. Returning 0 on error isn't usual I think
> > so I guess this behaviour should we well documented.
>
> Arguably it happens elsewhere too, and not just in the kernel.
> "read()" at past the end of a file is not an error, you'll just get 0
> for EOF.
>
> So it's not really "returning 0 on error".
>
> It really is simply returning the number of pages it got. End of
> story. That number of pages can be smaller than the requested number
> of pages, and _that_ is due to some error, but note how it can return
> "5" on error too - you asked for 10 pages, but the error happened in
> the middle!
>
> So the right way to check for error is to bverify that you get the
> number of pages that you asked for. If you don't, something bad
> happened.
>
> Of course, many users don't actually care about "I didn't get
> everything". They only care about "did I get _something_". Then that 0
> ends up being the error case, but note how it depends on the caller.
>
> > What about get_user_pages_fast though?
>
> We do seem to special-case the first page there. I'm not sure it's a
> good idea. But like the __get_user_pages_fast(), we seem to have users
> that know about the particular semantics and depend on it.
>
> It's all ugly, I agree.
>
> End result: we can't just change semantics of either of them.
>
> At least not without going through every single user and checking that
> they are ok with it.
>
> Which I guess I could be ok with. Maybe changing the semantics of
> __get_user_pages_fast() is acceptable, if you just change it
> *everywhere* (which includes not just he users, but also the couple of
> architecture-specific versions of that same function that we have.
>
> Linus

For now I sent a patchset
1. documenting current behaviour for __get_user_pages_fast.
2. fixing get_user_pages_fast for consistency.

--
MST

2018-04-06 11:37:53

by Alan Cox

[permalink] [raw]
Subject: Re: [PATCH] gup: return -EFAULT on access_ok failure

> so an error on the 1st page gets propagated to the caller,
> and that get_user_pages_unlocked eventually calls __get_user_pages
> so it does return an error sometimes.
>
> Would it be correct to apply the second part of the patch then
> (pasted below for reference) or should get_user_pages_fast
> and all its callers be changed to return 0 on error instead?

0 isn't an error. As SuS sees it (ie from the userspace end of the pile)

returning the number you asked for means it worked

returning a smaller number means it worked partially and that much was
consumed (or in some cases more and the rest if so was lost - depends
what you are reading/writing)

returning 0 means you read nothing as you were at the end of file

returning an error code means it broke, or you should try again
(EAGAIN/EWOULDBLOCK)

The ugly bit there is the try-again semantics needs to exactly match the
attached poll() behaviour or you get busy loops.

Alan