In 2019, we introduced pin_user_pages*() and now we are converting
get_user_pages*() to the new API as appropriate. [1] & [2] could
be referred for more information. This is case 5 as per document [1].
As discussed, pages need to be marked as dirty before unpinned it.
Previously, if lock_pages() end up partially mapping pages, it used
to return -ERRNO due to which unlock_pages() have to go through
each pages[i] till *nr_pages* to validate them. This can be avoided
by passing correct number partially mapped pages & -ERRNO separately
while returning from lock_pages() due to error.
With this fix unlock_pages() doesn't need to validate pages[i] till
*nr_pages* for error scenario.
[1] Documentation/core-api/pin_user_pages.rst
[2] "Explicit pinning of user-space pages":
https://lwn.net/Articles/807108/
Signed-off-by: Souptick Joarder <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Boris Ostrovsky <[email protected]>
---
drivers/xen/privcmd.c | 34 +++++++++++++++++++---------------
1 file changed, 19 insertions(+), 15 deletions(-)
diff --git a/drivers/xen/privcmd.c b/drivers/xen/privcmd.c
index a250d11..013d23b 100644
--- a/drivers/xen/privcmd.c
+++ b/drivers/xen/privcmd.c
@@ -580,25 +580,30 @@ static long privcmd_ioctl_mmap_batch(
static int lock_pages(
struct privcmd_dm_op_buf kbufs[], unsigned int num,
- struct page *pages[], unsigned int nr_pages)
+ struct page *pages[], unsigned int nr_pages, int *errno)
{
unsigned int i;
+ int pinned = 0, rc = 0;
for (i = 0; i < num; i++) {
unsigned int requested;
- int pinned;
+ rc += pinned;
requested = DIV_ROUND_UP(
offset_in_page(kbufs[i].uptr) + kbufs[i].size,
PAGE_SIZE);
- if (requested > nr_pages)
- return -ENOSPC;
+ if (requested > nr_pages) {
+ *errno = -ENOSPC;
+ return rc;
+ }
- pinned = get_user_pages_fast(
+ pinned = pin_user_pages_fast(
(unsigned long) kbufs[i].uptr,
requested, FOLL_WRITE, pages);
- if (pinned < 0)
- return pinned;
+ if (pinned < 0) {
+ *errno = pinned;
+ return rc;
+ }
nr_pages -= pinned;
pages += pinned;
@@ -609,15 +614,10 @@ static int lock_pages(
static void unlock_pages(struct page *pages[], unsigned int nr_pages)
{
- unsigned int i;
-
if (!pages)
return;
- for (i = 0; i < nr_pages; i++) {
- if (pages[i])
- put_page(pages[i]);
- }
+ unpin_user_pages_dirty_lock(pages, nr_pages, 1);
}
static long privcmd_ioctl_dm_op(struct file *file, void __user *udata)
@@ -630,6 +630,7 @@ static long privcmd_ioctl_dm_op(struct file *file, void __user *udata)
struct xen_dm_op_buf *xbufs = NULL;
unsigned int i;
long rc;
+ int errno = 0;
if (copy_from_user(&kdata, udata, sizeof(kdata)))
return -EFAULT;
@@ -683,9 +684,12 @@ static long privcmd_ioctl_dm_op(struct file *file, void __user *udata)
goto out;
}
- rc = lock_pages(kbufs, kdata.num, pages, nr_pages);
- if (rc)
+ rc = lock_pages(kbufs, kdata.num, pages, nr_pages, &errno);
+ if (errno < 0) {
+ nr_pages = rc;
+ rc = errno;
goto out;
+ }
for (i = 0; i < kdata.num; i++) {
set_xen_guest_handle(xbufs[i].h, kbufs[i].uptr);
--
1.9.1
On 6/23/20 7:58 AM, Souptick Joarder wrote:
> In 2019, we introduced pin_user_pages*() and now we are converting
> get_user_pages*() to the new API as appropriate. [1] & [2] could
> be referred for more information. This is case 5 as per document [1].
>
> As discussed, pages need to be marked as dirty before unpinned it.
>
> Previously, if lock_pages() end up partially mapping pages, it used
> to return -ERRNO due to which unlock_pages() have to go through
> each pages[i] till *nr_pages* to validate them. This can be avoided
> by passing correct number partially mapped pages & -ERRNO separately
> while returning from lock_pages() due to error.
> With this fix unlock_pages() doesn't need to validate pages[i] till
> *nr_pages* for error scenario.
This should be split into two patches please. The first one will fix the
return value bug (and will need to go to stable branches) and the second
will use new routine to pin pages.
> @@ -580,25 +580,30 @@ static long privcmd_ioctl_mmap_batch(
>
> static int lock_pages(
> struct privcmd_dm_op_buf kbufs[], unsigned int num,
> - struct page *pages[], unsigned int nr_pages)
> + struct page *pages[], unsigned int nr_pages, int *errno)
I'd prefer if you used more traditional way of returning error code by
the function, and pass the number of pinned pages as an argument. This
will also make call site simpler.
-boris
On Tue, Jun 23, 2020 at 11:11 PM Boris Ostrovsky
<[email protected]> wrote:
>
> On 6/23/20 7:58 AM, Souptick Joarder wrote:
> > In 2019, we introduced pin_user_pages*() and now we are converting
> > get_user_pages*() to the new API as appropriate. [1] & [2] could
> > be referred for more information. This is case 5 as per document [1].
> >
> > As discussed, pages need to be marked as dirty before unpinned it.
> >
> > Previously, if lock_pages() end up partially mapping pages, it used
> > to return -ERRNO due to which unlock_pages() have to go through
> > each pages[i] till *nr_pages* to validate them. This can be avoided
> > by passing correct number partially mapped pages & -ERRNO separately
> > while returning from lock_pages() due to error.
> > With this fix unlock_pages() doesn't need to validate pages[i] till
> > *nr_pages* for error scenario.
>
>
> This should be split into two patches please. The first one will fix the
> return value bug (and will need to go to stable branches) and the second
> will use new routine to pin pages.
Initially I split the patches into 2 commits. But at last moment I
figure out that,
this bug fix ( better to call coding error, doesn't looks like lead to
any runtime bug) is tightly coupled to 2nd commit for
pin_user_pages*() conversion,
which means we don't need the bug fix patch if we are not converting the API to
pin_user_pages*()/ unpin_user_pages_dirty_lock(). That's the reason to
clubbed these two
commits into a single one.
If this looks unreasonable, will split it into 2 patches again.
>
>
> > @@ -580,25 +580,30 @@ static long privcmd_ioctl_mmap_batch(
> >
> > static int lock_pages(
> > struct privcmd_dm_op_buf kbufs[], unsigned int num,
> > - struct page *pages[], unsigned int nr_pages)
> > + struct page *pages[], unsigned int nr_pages, int *errno)
>
>
> I'd prefer if you used more traditional way of returning error code by
> the function, and pass the number of pinned pages as an argument. This
> will also make call site simpler.
Sure, Will do it.
>
>
> -boris
>
On 6/23/20 9:36 PM, Souptick Joarder wrote:
> On Tue, Jun 23, 2020 at 11:11 PM Boris Ostrovsky
> <[email protected]> wrote:
>> On 6/23/20 7:58 AM, Souptick Joarder wrote:
>>> In 2019, we introduced pin_user_pages*() and now we are converting
>>> get_user_pages*() to the new API as appropriate. [1] & [2] could
>>> be referred for more information. This is case 5 as per document [1].
>>>
>>> As discussed, pages need to be marked as dirty before unpinned it.
>>>
>>> Previously, if lock_pages() end up partially mapping pages, it used
>>> to return -ERRNO due to which unlock_pages() have to go through
>>> each pages[i] till *nr_pages* to validate them. This can be avoided
>>> by passing correct number partially mapped pages & -ERRNO separately
>>> while returning from lock_pages() due to error.
>>> With this fix unlock_pages() doesn't need to validate pages[i] till
>>> *nr_pages* for error scenario.
>>
>> This should be split into two patches please. The first one will fix the
>> return value bug (and will need to go to stable branches) and the second
>> will use new routine to pin pages.
> Initially I split the patches into 2 commits. But at last moment I
> figure out that,
> this bug fix ( better to call coding error, doesn't looks like lead to
> any runtime bug) is tightly coupled to 2nd commit for
> pin_user_pages*() conversion,
> which means we don't need the bug fix patch if we are not converting the API to
> pin_user_pages*()/ unpin_user_pages_dirty_lock(). That's the reason to
> clubbed these two
> commits into a single one.
I am not sure I understand why the two issues are connected. Failure of
either get_user_pages_fast() or pin_user_pages() will result in the same
kind of overall ioctl failure, won't it?
One concern though is that we are changing how user will see this error.
Currently Xen devicemodel (which AFAIK is the only caller) ignores it,
which is I think is wrong. So another option would be to fix this in Xen
and continue returning positive number here. I guess it's back to Paul
again.
-boris
On Wed, Jun 24, 2020 at 9:07 PM Boris Ostrovsky
<[email protected]> wrote:
>
> On 6/23/20 9:36 PM, Souptick Joarder wrote:
> > On Tue, Jun 23, 2020 at 11:11 PM Boris Ostrovsky
> > <[email protected]> wrote:
> >> On 6/23/20 7:58 AM, Souptick Joarder wrote:
> >>> In 2019, we introduced pin_user_pages*() and now we are converting
> >>> get_user_pages*() to the new API as appropriate. [1] & [2] could
> >>> be referred for more information. This is case 5 as per document [1].
> >>>
> >>> As discussed, pages need to be marked as dirty before unpinned it.
> >>>
> >>> Previously, if lock_pages() end up partially mapping pages, it used
> >>> to return -ERRNO due to which unlock_pages() have to go through
> >>> each pages[i] till *nr_pages* to validate them. This can be avoided
> >>> by passing correct number partially mapped pages & -ERRNO separately
> >>> while returning from lock_pages() due to error.
> >>> With this fix unlock_pages() doesn't need to validate pages[i] till
> >>> *nr_pages* for error scenario.
> >>
> >> This should be split into two patches please. The first one will fix the
> >> return value bug (and will need to go to stable branches) and the second
> >> will use new routine to pin pages.
> > Initially I split the patches into 2 commits. But at last moment I
> > figure out that,
> > this bug fix ( better to call coding error, doesn't looks like lead to
> > any runtime bug) is tightly coupled to 2nd commit for
> > pin_user_pages*() conversion,
> > which means we don't need the bug fix patch if we are not converting the API to
> > pin_user_pages*()/ unpin_user_pages_dirty_lock(). That's the reason to
> > clubbed these two
> > commits into a single one.
>
>
> I am not sure I understand why the two issues are connected. Failure of
> either get_user_pages_fast() or pin_user_pages() will result in the same
> kind of overall ioctl failure, won't it?
>
Sorry, I think, I need to go through the bug again for my clarity.
My understanding is ->
First, In case of success lock_pages() returns 0, so what privcmd_ioctl_dm_op()
will return to user is depend on the return value of HYPERVISOR_dm_op()
and at last all the pages are unpinned. At present I am not clear about the
return value of HYPERVISOR_dm_op(). But this path of code looks to be correct.
second, incase of failure from lock_pages() will return either -ENOSPC / -ERRNO
receive from {get|pin|_user_pages_fast() inside for loop. (at that
point there might
be some partially mapped pages as it is mapping inside the loop). Upon
receiving
-ERRNO privcmd_ioctl_dm_op() will pass the same -ERRNO to user (not the partial
mapped page count). This looks to be correct behaviour from user space.
The problem I was trying to address is, in the second scenario when
lock_pages() returns
-ERRNO and there are already existing mapped pages. In this scenario,
when unlcok_pages()
is called it will iterate till *nr_pages* to validate and unmap the
pages. But it is supposed to
unmap only the mapped pages which I am trying to address as part of bug fix.
Let me know if I am able to be in sync with what you are expecting ?
> One concern though is that we are changing how user will see this error.
My understanding from above is user will always see right -ERRNO and
will not be impacted.
Another scenario I was thinking, if {get|pin|_user_pages_fast() return
number of pages pinned
which may be fewer than the number requested. Then lock_pages()
returns 0 to caller
and caller/user space will continue with the assumption that all pages
are pinned correctly.
Is this an expected behaviour as per current code ?
> Currently Xen devicemodel (which AFAIK is the only caller) ignores it,
> which is I think is wrong. So another option would be to fix this in Xen
> and continue returning positive number here. I guess it's back to Paul
> again.
>
>
> -boris
>
>
On 6/24/20 12:41 PM, Souptick Joarder wrote:
> On Wed, Jun 24, 2020 at 9:07 PM Boris Ostrovsky
> <[email protected]> wrote:
>> On 6/23/20 9:36 PM, Souptick Joarder wrote:
>>> On Tue, Jun 23, 2020 at 11:11 PM Boris Ostrovsky
>>> <[email protected]> wrote:
>>>> On 6/23/20 7:58 AM, Souptick Joarder wrote:
>>>>> In 2019, we introduced pin_user_pages*() and now we are converting
>>>>> get_user_pages*() to the new API as appropriate. [1] & [2] could
>>>>> be referred for more information. This is case 5 as per document [1].
>>>>>
>>>>> As discussed, pages need to be marked as dirty before unpinned it.
>>>>>
>>>>> Previously, if lock_pages() end up partially mapping pages, it used
>>>>> to return -ERRNO due to which unlock_pages() have to go through
>>>>> each pages[i] till *nr_pages* to validate them. This can be avoided
>>>>> by passing correct number partially mapped pages & -ERRNO separately
>>>>> while returning from lock_pages() due to error.
>>>>> With this fix unlock_pages() doesn't need to validate pages[i] till
>>>>> *nr_pages* for error scenario.
>>>> This should be split into two patches please. The first one will fix the
>>>> return value bug (and will need to go to stable branches) and the second
>>>> will use new routine to pin pages.
>>> Initially I split the patches into 2 commits. But at last moment I
>>> figure out that,
>>> this bug fix ( better to call coding error, doesn't looks like lead to
>>> any runtime bug) is tightly coupled to 2nd commit for
>>> pin_user_pages*() conversion,
>>> which means we don't need the bug fix patch if we are not converting the API to
>>> pin_user_pages*()/ unpin_user_pages_dirty_lock(). That's the reason to
>>> clubbed these two
>>> commits into a single one.
>>
>> I am not sure I understand why the two issues are connected. Failure of
>> either get_user_pages_fast() or pin_user_pages() will result in the same
>> kind of overall ioctl failure, won't it?
>>
> Sorry, I think, I need to go through the bug again for my clarity.
>
> My understanding is ->
>
> First, In case of success lock_pages() returns 0, so what privcmd_ioctl_dm_op()
> will return to user is depend on the return value of HYPERVISOR_dm_op()
> and at last all the pages are unpinned. At present I am not clear about the
> return value of HYPERVISOR_dm_op(). But this path of code looks to be correct.
>
> second, incase of failure from lock_pages() will return either -ENOSPC / -ERRNO
> receive from {get|pin|_user_pages_fast() inside for loop. (at that
> point there might
> be some partially mapped pages as it is mapping inside the loop). Upon
> receiving
> -ERRNO privcmd_ioctl_dm_op() will pass the same -ERRNO to user (not the partial
> mapped page count). This looks to be correct behaviour from user space.
Sigh... I am sorry, you are absolutely correct. It does return -errno
on get_user_pages_fast() failure. I don't know what (or whether) I was
thinking. (And so Paul, ignore my question)
-boris
>
> The problem I was trying to address is, in the second scenario when
> lock_pages() returns
> -ERRNO and there are already existing mapped pages. In this scenario,
> when unlcok_pages()
> is called it will iterate till *nr_pages* to validate and unmap the
> pages. But it is supposed to
> unmap only the mapped pages which I am trying to address as part of bug fix.
>
> Let me know if I am able to be in sync with what you are expecting ?
>
>
>> One concern though is that we are changing how user will see this error.
> My understanding from above is user will always see right -ERRNO and
> will not be impacted.
>
> Another scenario I was thinking, if {get|pin|_user_pages_fast() return
> number of pages pinned
> which may be fewer than the number requested. Then lock_pages()
> returns 0 to caller
> and caller/user space will continue with the assumption that all pages
> are pinned correctly.
> Is this an expected behaviour as per current code ?
>
>
>> Currently Xen devicemodel (which AFAIK is the only caller) ignores it,
>> which is I think is wrong. So another option would be to fix this in Xen
>> and continue returning positive number here. I guess it's back to Paul
>> again.
>>
>>
>> -boris
>>
>>