Alex reported invalid page pointer returned with pin_user_pages_remote() from
vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for batched
pinning with struct vfio_batch"). This problem breaks NVIDIA vfio mdev.
It turns out that it's not the fault of the vfio commit; however after vfio
switches to a full page buffer to store the page pointers it starts to expose
the problem easier.
The problem is for VM_PFNMAP vmas we should normally fail with an -EFAULT then
vfio will carry on to handle the MMIO regions. However when the bug triggered,
follow_page_mask() returned -EEXIST for such a page, which will jump over the
current page, leaving that entry in **pages untouched. However the caller is
not aware of it, hence the caller will reference the page as usual even if the
pointer data can be anything.
We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn
mapping unless FOLL_GET is requested") which seems very reasonable. It could
be that when we reworked GUP with FOLL_PIN we could have overlooked that
special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if
that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when
it needs to return an -EEXIST.
Since at it, add another WARN_ON_ONCE() at the -EEXIST handling to make sure we
mustn't have **pages set when reaching there, because otherwise it means the
caller will try to read a garbage right after __get_user_pages() returns.
Attaching the Fixes to the FOLL_PIN rework commit, as it happened later than
1027e4436b6a.
Cc: Andrea Arcangeli <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Jan Kara <[email protected]>
Cc: Jérôme Glisse <[email protected]>
Cc: John Hubbard <[email protected]>
Cc: Kirill A. Shutemov <[email protected]>
Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
Reported-by: Alex Williamson <[email protected]>
Debugged-by: Alex Williamson <[email protected]>
Signed-off-by: Peter Xu <[email protected]>
---
mm/gup.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)
diff --git a/mm/gup.c b/mm/gup.c
index f0af462ac1e2..8ebc04058e97 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
pte_t *pte, unsigned int flags)
{
/* No page to get reference */
- if (flags & FOLL_GET)
+ if (flags & (FOLL_GET | FOLL_PIN))
return -EFAULT;
if (flags & FOLL_TOUCH) {
@@ -1181,7 +1181,13 @@ static long __get_user_pages(struct mm_struct *mm,
/*
* Proper page table entry exists, but no corresponding
* struct page.
+ *
+ * Warn if we jumped over even with a valid **pages.
+ * It shouldn't trigger in practise, but when there's
+ * buggy returns on -EEXIST we'll warn before returning
+ * an invalid page pointer in the array.
*/
+ WARN_ON_ONCE(pages);
goto next_page;
} else if (IS_ERR(page)) {
ret = PTR_ERR(page);
--
2.32.0
On 1/24/22 19:37, Peter Xu wrote:
> Alex reported invalid page pointer returned with pin_user_pages_remote() from
> vfio after upstream commit 4b6c33b32296 ("vfio/type1: Prepare for batched
> pinning with struct vfio_batch"). This problem breaks NVIDIA vfio mdev.
>
> It turns out that it's not the fault of the vfio commit; however after vfio
> switches to a full page buffer to store the page pointers it starts to expose
> the problem easier.
>
> The problem is for VM_PFNMAP vmas we should normally fail with an -EFAULT then
> vfio will carry on to handle the MMIO regions. However when the bug triggered,
> follow_page_mask() returned -EEXIST for such a page, which will jump over the
> current page, leaving that entry in **pages untouched. However the caller is
> not aware of it, hence the caller will reference the page as usual even if the
> pointer data can be anything.
>
> We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn
> mapping unless FOLL_GET is requested") which seems very reasonable. It could
> be that when we reworked GUP with FOLL_PIN we could have overlooked that
> special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if
> that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when
> it needs to return an -EEXIST.
>
> Since at it, add another WARN_ON_ONCE() at the -EEXIST handling to make sure we
> mustn't have **pages set when reaching there, because otherwise it means the
> caller will try to read a garbage right after __get_user_pages() returns.
>
> Attaching the Fixes to the FOLL_PIN rework commit, as it happened later than
> 1027e4436b6a.
>
> Cc: Andrea Arcangeli <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Jan Kara <[email protected]>
> Cc: Jérôme Glisse <[email protected]>
> Cc: John Hubbard <[email protected]>
> Cc: Kirill A. Shutemov <[email protected]>
> Fixes: 3faa52c03f44 ("mm/gup: track FOLL_PIN pages")
> Reported-by: Alex Williamson <[email protected]>
> Debugged-by: Alex Williamson <[email protected]>
> Signed-off-by: Peter Xu <[email protected]>
> ---
> mm/gup.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> diff --git a/mm/gup.c b/mm/gup.c
> index f0af462ac1e2..8ebc04058e97 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> pte_t *pte, unsigned int flags)
> {
> /* No page to get reference */
> - if (flags & FOLL_GET)
> + if (flags & (FOLL_GET | FOLL_PIN))
> return -EFAULT;
Yes. This clearly fixes the problem that the patch describes, and also
clearly matches up with the Fixes tag. So that's correct.
>
> if (flags & FOLL_TOUCH) {
> @@ -1181,7 +1181,13 @@ static long __get_user_pages(struct mm_struct *mm,
> /*
> * Proper page table entry exists, but no corresponding
> * struct page.
> + *
> + * Warn if we jumped over even with a valid **pages.
> + * It shouldn't trigger in practise, but when there's
> + * buggy returns on -EEXIST we'll warn before returning
> + * an invalid page pointer in the array.
> */
> + WARN_ON_ONCE(pages);
Here, however, I think we need to consider this a little more carefully,
and attempt to actually fix up this case. It is never going to be OK
here, to return a **pages array that has these little landmines of
potentially uninitialized pointers. And so continuing on *at all* seems
very wrong.
Can we bail out at this point, without breaking the world? I think we can...
Also: this part, even if it remains as is, should be a separate fix and a
separate patch, IMHO.
thanks,
--
John Hubbard
NVIDIA
> goto next_page;
> } else if (IS_ERR(page)) {
> ret = PTR_ERR(page);
On Wed, Jan 26, 2022 at 04:15:02PM -0800, John Hubbard wrote:
> > We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn
> > mapping unless FOLL_GET is requested") which seems very reasonable. It could
> > be that when we reworked GUP with FOLL_PIN we could have overlooked that
> > special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if
> > that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when
> > it needs to return an -EEXIST.
It sounds like this commit was all about changing the behavior of
follow_page()
It feels like that is another ill-fated holdover from the effort to
make pageless DAX that doesn't exist anymore.
Can we safely drop it now?
Regardless..
> > diff --git a/mm/gup.c b/mm/gup.c
> > index f0af462ac1e2..8ebc04058e97 100644
> > +++ b/mm/gup.c
> > @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> > pte_t *pte, unsigned int flags)
> > {
> > /* No page to get reference */
> > - if (flags & FOLL_GET)
> > + if (flags & (FOLL_GET | FOLL_PIN))
> > return -EFAULT;
>
> Yes. This clearly fixes the problem that the patch describes, and also
> clearly matches up with the Fixes tag. So that's correct.
It is a really confusing though, why not just always return -EEXIST
here?
The caller will always see the error code and refrain from trying to
pin it and unwind upwards, just the same as -EFAULT.
We shouldn't need to test the flags at this point at all.
> > if (flags & FOLL_TOUCH) {
> > @@ -1181,7 +1181,13 @@ static long __get_user_pages(struct mm_struct *mm,
> > /*
> > * Proper page table entry exists, but no corresponding
> > * struct page.
> > + *
> > + * Warn if we jumped over even with a valid **pages.
> > + * It shouldn't trigger in practise, but when there's
> > + * buggy returns on -EEXIST we'll warn before returning
> > + * an invalid page pointer in the array.
> > */
> > + WARN_ON_ONCE(pages);
>
> Here, however, I think we need to consider this a little more carefully,
> and attempt to actually fix up this case. It is never going to be OK
> here, to return a **pages array that has these little landmines of
> potentially uninitialized pointers. And so continuing on *at all* seems
> very wrong.
Indeed, it should just be like this:
@@ -1182,6 +1182,10 @@ static long __get_user_pages(struct mm_struct *mm,
* Proper page table entry exists, but no corresponding
* struct page.
*/
+ if (pages) {
+ page = ERR_PTR(-EFAULT);
+ goto out;
+ }
goto next_page;
} else if (IS_ERR(page)) {
ret = PTR_ERR(page);
Jason
Hi, Jason, John,
On Wed, Jan 26, 2022 at 08:42:06PM -0400, Jason Gunthorpe wrote:
> On Wed, Jan 26, 2022 at 04:15:02PM -0800, John Hubbard wrote:
>
> > > We had that -EEXIST logic since commit 1027e4436b6a ("mm: make GUP handle pfn
> > > mapping unless FOLL_GET is requested") which seems very reasonable. It could
> > > be that when we reworked GUP with FOLL_PIN we could have overlooked that
> > > special path in commit 3faa52c03f44 ("mm/gup: track FOLL_PIN pages"), even if
> > > that commit rightfully touched up follow_devmap_pud() on checking FOLL_PIN when
> > > it needs to return an -EEXIST.
>
> It sounds like this commit was all about changing the behavior of
> follow_page()
>
> It feels like that is another ill-fated holdover from the effort to
> make pageless DAX that doesn't exist anymore.
>
> Can we safely drop it now?
Not quite familiar with the dax effort, but just to note again that this patch
fixes an immediate breakage, hence IMHO we should still merge it first (not
mention it's a one-liner..).
>
> Regardless..
>
> > > diff --git a/mm/gup.c b/mm/gup.c
> > > index f0af462ac1e2..8ebc04058e97 100644
> > > +++ b/mm/gup.c
> > > @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> > > pte_t *pte, unsigned int flags)
> > > {
> > > /* No page to get reference */
> > > - if (flags & FOLL_GET)
> > > + if (flags & (FOLL_GET | FOLL_PIN))
> > > return -EFAULT;
> >
> > Yes. This clearly fixes the problem that the patch describes, and also
> > clearly matches up with the Fixes tag. So that's correct.
>
> It is a really confusing though, why not just always return -EEXIST
> here?
Because in current code GUP handles -EEXIST and -EFAULT differently?
We do early bail out on -EFAULT. -EEXIST was first introduced in 2015 from
Kirill for not failing some mlock() or mmap(MAP_POPULATE) on dax (1027e4436b6).
Then in 2017 it got used again with pud-sized thp (a00cc7d9dd93d) on dax too.
They seem to service the same goal and it seems to be designed that -EEXIST
shouldn't fail GUP immediately.
>
> The caller will always see the error code and refrain from trying to
> pin it and unwind upwards, just the same as -EFAULT.
>
> We shouldn't need to test the flags at this point at all.
>
> > > if (flags & FOLL_TOUCH) {
> > > @@ -1181,7 +1181,13 @@ static long __get_user_pages(struct mm_struct *mm,
> > > /*
> > > * Proper page table entry exists, but no corresponding
> > > * struct page.
> > > + *
> > > + * Warn if we jumped over even with a valid **pages.
> > > + * It shouldn't trigger in practise, but when there's
> > > + * buggy returns on -EEXIST we'll warn before returning
> > > + * an invalid page pointer in the array.
> > > */
> > > + WARN_ON_ONCE(pages);
> >
> > Here, however, I think we need to consider this a little more carefully,
> > and attempt to actually fix up this case. It is never going to be OK
> > here, to return a **pages array that has these little landmines of
> > potentially uninitialized pointers. And so continuing on *at all* seems
> > very wrong.
>
> Indeed, it should just be like this:
>
> @@ -1182,6 +1182,10 @@ static long __get_user_pages(struct mm_struct *mm,
> * Proper page table entry exists, but no corresponding
> * struct page.
> */
> + if (pages) {
> + page = ERR_PTR(-EFAULT);
> + goto out;
> + }
> goto next_page;
> } else if (IS_ERR(page)) {
> ret = PTR_ERR(page);
IIUC not failing -EEXIST immediately seems to be what we want. We've discussed
the only (unless I overlooked something...) two sites that can return -EEXIST
in follow_page_mask() context; if it is triggered somewhere else unexpectedly
it is a programming error and needs fixing, imho. Hence the gup code shouldn't
rely on the -EEXIST -> -EFAULT transition to work at all in any existing case.
From that POV, WARN_ON_ONCE() helps better on exposing an illegal return of
-EEXIST (as mentioned in the commit message) than the -EFAULT convertion, IMHO.
Thanks,
--
Peter Xu
On Thu, Jan 27, 2022 at 05:19:56PM +0800, Peter Xu wrote:
> > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > index f0af462ac1e2..8ebc04058e97 100644
> > > > +++ b/mm/gup.c
> > > > @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> > > > pte_t *pte, unsigned int flags)
> > > > {
> > > > /* No page to get reference */
> > > > - if (flags & FOLL_GET)
> > > > + if (flags & (FOLL_GET | FOLL_PIN))
> > > > return -EFAULT;
> > >
> > > Yes. This clearly fixes the problem that the patch describes, and also
> > > clearly matches up with the Fixes tag. So that's correct.
> >
> > It is a really confusing though, why not just always return -EEXIST
> > here?
>
> Because in current code GUP handles -EEXIST and -EFAULT differently?
That has nothing to do with here. We shouldn't be deciding what the
top layer does way down here. Return the correct error code for what
was discovered at this layer the upper loop should make the decision
what it should do
> We do early bail out on -EFAULT. -EEXIST was first introduced in 2015 from
> Kirill for not failing some mlock() or mmap(MAP_POPULATE) on dax (1027e4436b6).
> Then in 2017 it got used again with pud-sized thp (a00cc7d9dd93d) on dax too.
> They seem to service the same goal and it seems to be designed that -EEXIST
> shouldn't fail GUP immediately.
It must fail GUP immeidately if there is a pages list.
Callers that want an early failure must pass in NULL for pages, it is
just that simple. It has nothing to do with the FOLL flags.
A WARN_ON would be appropriate to compare the FOLL flags against the
pages. eg FOLL_GET without a pages is nonsense and should be
immediately aborted. On the other hand, we avoid this by construction
internal to gup.c
> > > Here, however, I think we need to consider this a little more carefully,
> > > and attempt to actually fix up this case. It is never going to be OK
> > > here, to return a **pages array that has these little landmines of
> > > potentially uninitialized pointers. And so continuing on *at all* seems
> > > very wrong.
> >
> > Indeed, it should just be like this:
> >
> > @@ -1182,6 +1182,10 @@ static long __get_user_pages(struct mm_struct *mm,
> > * Proper page table entry exists, but no corresponding
> > * struct page.
> > */
> > + if (pages) {
> > + page = ERR_PTR(-EFAULT);
> > + goto out;
> > + }
> > goto next_page;
> > } else if (IS_ERR(page)) {
> > ret = PTR_ERR(page);
>
> IIUC not failing -EEXIST immediately seems to be what we want.
Which is what this does, for the only case it is acceptable - a null
page list.
> From that POV, WARN_ON_ONCE() helps better on exposing an illegal return of
> -EEXIST (as mentioned in the commit message) than the -EFAULT convertion, IMHO.
Again, that is upside down, -EEXIST should not be a illegal return. It
should be valid, have a defined meaning 'the vaddr exists but has no
struct page' and the top loop, and only the top loop, makes the
decision what to do about it.
Jason
On Thu, Jan 27, 2022 at 11:25:38AM -0400, Jason Gunthorpe wrote:
> On Thu, Jan 27, 2022 at 05:19:56PM +0800, Peter Xu wrote:
>
> > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > index f0af462ac1e2..8ebc04058e97 100644
> > > > > +++ b/mm/gup.c
> > > > > @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> > > > > pte_t *pte, unsigned int flags)
> > > > > {
> > > > > /* No page to get reference */
> > > > > - if (flags & FOLL_GET)
> > > > > + if (flags & (FOLL_GET | FOLL_PIN))
> > > > > return -EFAULT;
> > > >
> > > > Yes. This clearly fixes the problem that the patch describes, and also
> > > > clearly matches up with the Fixes tag. So that's correct.
> > >
> > > It is a really confusing though, why not just always return -EEXIST
> > > here?
> >
> > Because in current code GUP handles -EEXIST and -EFAULT differently?
>
> That has nothing to do with here. We shouldn't be deciding what the
> top layer does way down here. Return the correct error code for what
> was discovered at this layer the upper loop should make the decision
> what it should do
>
> > We do early bail out on -EFAULT. -EEXIST was first introduced in 2015 from
> > Kirill for not failing some mlock() or mmap(MAP_POPULATE) on dax (1027e4436b6).
> > Then in 2017 it got used again with pud-sized thp (a00cc7d9dd93d) on dax too.
> > They seem to service the same goal and it seems to be designed that -EEXIST
> > shouldn't fail GUP immediately.
>
> It must fail GUP immeidately if there is a pages list.
Right, but my point is we don't have an user at all for follow_page_mask()
returning -EEXIST with a **page which is non-NULL. Or did I miss it?
>
> Callers that want an early failure must pass in NULL for pages, it is
> just that simple. It has nothing to do with the FOLL flags.
>
> A WARN_ON would be appropriate to compare the FOLL flags against the
> pages. eg FOLL_GET without a pages is nonsense and should be
> immediately aborted. On the other hand, we avoid this by construction
> internal to gup.c
We have something like that already, although it's only a VM_BUG_ON() not a
BUG_ON() or WARN_ON() at the entry of __get_user_pages():
VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
>
> > > > Here, however, I think we need to consider this a little more carefully,
> > > > and attempt to actually fix up this case. It is never going to be OK
> > > > here, to return a **pages array that has these little landmines of
> > > > potentially uninitialized pointers. And so continuing on *at all* seems
> > > > very wrong.
> > >
> > > Indeed, it should just be like this:
> > >
> > > @@ -1182,6 +1182,10 @@ static long __get_user_pages(struct mm_struct *mm,
> > > * Proper page table entry exists, but no corresponding
> > > * struct page.
> > > */
> > > + if (pages) {
> > > + page = ERR_PTR(-EFAULT);
> > > + goto out;
> > > + }
> > > goto next_page;
> > > } else if (IS_ERR(page)) {
> > > ret = PTR_ERR(page);
> >
> > IIUC not failing -EEXIST immediately seems to be what we want.
>
> Which is what this does, for the only case it is acceptable - a null
> page list.
>
> > From that POV, WARN_ON_ONCE() helps better on exposing an illegal return of
> > -EEXIST (as mentioned in the commit message) than the -EFAULT convertion, IMHO.
>
> Again, that is upside down, -EEXIST should not be a illegal return. It
> should be valid, have a defined meaning 'the vaddr exists but has no
> struct page' and the top loop, and only the top loop, makes the
> decision what to do about it.
I believe this works too and I think I get your point, but as stated above it's
just not used yet so the path is not useful to any real code path.
Especially with above VM_BUG_ON() it means if we'll go into the "if (pages)" we
should have already triggered the VM_BUG_ON() condition when entering the function.
Thanks,
--
Peter Xu
On Fri, Jan 28, 2022 at 09:36:14AM +0800, Peter Xu wrote:
> On Thu, Jan 27, 2022 at 11:25:38AM -0400, Jason Gunthorpe wrote:
> > On Thu, Jan 27, 2022 at 05:19:56PM +0800, Peter Xu wrote:
> >
> > > > > > diff --git a/mm/gup.c b/mm/gup.c
> > > > > > index f0af462ac1e2..8ebc04058e97 100644
> > > > > > +++ b/mm/gup.c
> > > > > > @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> > > > > > pte_t *pte, unsigned int flags)
> > > > > > {
> > > > > > /* No page to get reference */
> > > > > > - if (flags & FOLL_GET)
> > > > > > + if (flags & (FOLL_GET | FOLL_PIN))
> > > > > > return -EFAULT;
> > > > >
> > > > > Yes. This clearly fixes the problem that the patch describes, and also
> > > > > clearly matches up with the Fixes tag. So that's correct.
> > > >
> > > > It is a really confusing though, why not just always return -EEXIST
> > > > here?
> > >
> > > Because in current code GUP handles -EEXIST and -EFAULT differently?
> >
> > That has nothing to do with here. We shouldn't be deciding what the
> > top layer does way down here. Return the correct error code for what
> > was discovered at this layer the upper loop should make the decision
> > what it should do
> >
> > > We do early bail out on -EFAULT. -EEXIST was first introduced in 2015 from
> > > Kirill for not failing some mlock() or mmap(MAP_POPULATE) on dax (1027e4436b6).
> > > Then in 2017 it got used again with pud-sized thp (a00cc7d9dd93d) on dax too.
> > > They seem to service the same goal and it seems to be designed that -EEXIST
> > > shouldn't fail GUP immediately.
> >
> > It must fail GUP immeidately if there is a pages list.
>
> Right, but my point is we don't have an user at all for follow_page_mask()
> returning -EEXIST with a **page which is non-NULL. Or did I miss
> it?
We don't, but that isn't the point - it is about understandability not
utility.
There are only two call chains that involve follow_page_mask(), one
always wants the -EEXIST and the other wants -EEXIST sometimes and
sometimes wants to return -EFAULT for -EEXIST
The solution is not to change follow_page_mask() but to be consistent
throughout that -EEXIST means we encountered a populated but
non-struct page PTE and the single place that wants -EEXIST to be
reported as -EFAULT (__get_user_pages) should make that
transformation.
For instance, should we return -EEXIST in other cases like devmap and
more that have PTEs present but are not returnable? It is already
really confusing (and probably wrong) that we sometimes return NULL
for populated PTEs. NULL causes faultin_page() to be called on
something we already know is populated, seems like nonsense.
I certainly don't want to see every place like that doing some if to
transform -EEXIST into -EFAULT.
> > Callers that want an early failure must pass in NULL for pages, it is
> > just that simple. It has nothing to do with the FOLL flags.
> >
> > A WARN_ON would be appropriate to compare the FOLL flags against the
> > pages. eg FOLL_GET without a pages is nonsense and should be
> > immediately aborted. On the other hand, we avoid this by construction
> > internal to gup.c
>
> We have something like that already, although it's only a VM_BUG_ON() not a
> BUG_ON() or WARN_ON() at the entry of __get_user_pages():
>
> VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
Right
> I believe this works too and I think I get your point, but as stated above it's
> just not used yet so the path is not useful to any real code path.
It is not about useful, it is about understandable and
consistency.. There should be clear rules about when and what errno to
return in every case, we should trend in that direction.
Jason
On 1/27/22 17:36, Peter Xu wrote:
> On Thu, Jan 27, 2022 at 11:25:38AM -0400, Jason Gunthorpe wrote:
>> On Thu, Jan 27, 2022 at 05:19:56PM +0800, Peter Xu wrote:
>>
>>>>>> diff --git a/mm/gup.c b/mm/gup.c
>>>>>> index f0af462ac1e2..8ebc04058e97 100644
>>>>>> +++ b/mm/gup.c
>>>>>> @@ -440,7 +440,7 @@ static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
>>>>>> pte_t *pte, unsigned int flags)
>>>>>> {
>>>>>> /* No page to get reference */
>>>>>> - if (flags & FOLL_GET)
>>>>>> + if (flags & (FOLL_GET | FOLL_PIN))
>>>>>> return -EFAULT;
>>>>>
>>>>> Yes. This clearly fixes the problem that the patch describes, and also
>>>>> clearly matches up with the Fixes tag. So that's correct.
>>>>
>>>> It is a really confusing though, why not just always return -EEXIST
>>>> here?
>>>
>>> Because in current code GUP handles -EEXIST and -EFAULT differently?
>>
>> That has nothing to do with here. We shouldn't be deciding what the
>> top layer does way down here. Return the correct error code for what
>> was discovered at this layer the upper loop should make the decision
>> what it should do
>>
>>> We do early bail out on -EFAULT. -EEXIST was first introduced in 2015 from
>>> Kirill for not failing some mlock() or mmap(MAP_POPULATE) on dax (1027e4436b6).
>>> Then in 2017 it got used again with pud-sized thp (a00cc7d9dd93d) on dax too.
>>> They seem to service the same goal and it seems to be designed that -EEXIST
>>> shouldn't fail GUP immediately.
>>
>> It must fail GUP immeidately if there is a pages list.
>
> Right, but my point is we don't have an user at all for follow_page_mask()
> returning -EEXIST with a **page which is non-NULL. Or did I miss it?
>
What you are missing is that other people are potentially writing code
that we haven't seen yet, and that code may use follow_page_mask(). The
idea, therefore, is to make it a good building block.
>>
>> Callers that want an early failure must pass in NULL for pages, it is
>> just that simple. It has nothing to do with the FOLL flags.
>>
>> A WARN_ON would be appropriate to compare the FOLL flags against the
>> pages. eg FOLL_GET without a pages is nonsense and should be
>> immediately aborted. On the other hand, we avoid this by construction
>> internal to gup.c
>
> We have something like that already, although it's only a VM_BUG_ON() not a
> BUG_ON() or WARN_ON() at the entry of __get_user_pages():
>
> VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
>
>>
>>>>> Here, however, I think we need to consider this a little more carefully,
>>>>> and attempt to actually fix up this case. It is never going to be OK
>>>>> here, to return a **pages array that has these little landmines of
>>>>> potentially uninitialized pointers. And so continuing on *at all* seems
>>>>> very wrong.
>>>>
>>>> Indeed, it should just be like this:
>>>>
>>>> @@ -1182,6 +1182,10 @@ static long __get_user_pages(struct mm_struct *mm,
>>>> * Proper page table entry exists, but no corresponding
>>>> * struct page.
>>>> */
>>>> + if (pages) {
>>>> + page = ERR_PTR(-EFAULT);
>>>> + goto out;
>>>> + }
>>>> goto next_page;
>>>> } else if (IS_ERR(page)) {
>>>> ret = PTR_ERR(page);
>>>
>>> IIUC not failing -EEXIST immediately seems to be what we want.
>>
>> Which is what this does, for the only case it is acceptable - a null
>> page list.
>>
>>> From that POV, WARN_ON_ONCE() helps better on exposing an illegal return of
>>> -EEXIST (as mentioned in the commit message) than the -EFAULT convertion, IMHO.
>>
>> Again, that is upside down, -EEXIST should not be a illegal return. It
>> should be valid, have a defined meaning 'the vaddr exists but has no
>> struct page' and the top loop, and only the top loop, makes the
>> decision what to do about it.
>
> I believe this works too and I think I get your point, but as stated above it's
> just not used yet so the path is not useful to any real code path.
It's a really important point. This lower level routine needs to be fixed up
so that it behaves in a way that is both correct and reasonable. And it is not
reasonable to load up **pages with garbage pointers, regardless of whether any
callers are suffering from that yet.
Again, the argument that "no one uses this code path yet, so it's OK for it to
have pitfalls and weirdness" is not the way to go, at least, not if it's
feasible to provide a better alternative. And here, it is very feasible.
>
> Especially with above VM_BUG_ON() it means if we'll go into the "if (pages)" we
> should have already triggered the VM_BUG_ON() condition when entering the function.
>
And let's not forget that VM_BUG_ON() is a debug-only assertion. So you
may very well *not* have triggered anything.
Even if code is functionally correct, there are still API design
considerations, such as, "how usable is it?", and such. For example,
Rusty Russell's "How do I make this hard to misuse?" rules [1].
[1] https://ozlabs.org/~rusty/index.cgi/tech/2008-03-30.html
thanks,
--
John Hubbard
NVIDIA
On Thu, Jan 27, 2022 at 10:31:27PM -0400, Jason Gunthorpe wrote:
> We don't, but that isn't the point - it is about understandability not
> utility.
>
> There are only two call chains that involve follow_page_mask(), one
> always wants the -EEXIST and the other wants -EEXIST sometimes and
> sometimes wants to return -EFAULT for -EEXIST
>
> The solution is not to change follow_page_mask() but to be consistent
> throughout that -EEXIST means we encountered a populated but
> non-struct page PTE and the single place that wants -EEXIST to be
> reported as -EFAULT (__get_user_pages) should make that
> transformation.
Currently -EEXIST is not defined like that. To me, it's defined as:
This pte is populated but there's no page struct that binds to it. Let's skip
this pte and continue (because we've checked there's no **pages requested).
That services use cases like mlock() very well. IMHO that's fine.
If I understand correctly, what is proposed above is actually to redefine
-EEXIST into:
This pte is populated but there's no page struct that binds to it.
Then the caller will decide how we should react on it.
I kind of agree that the latter one looks cleaner, but I don't have extremely
strong reason to refactor it immediately.
More importantly, I still don't know what will be the side effect if we
immediately remove the two checks in follow_page_mask(), and how much changeset
we'll need to redefine -EEXIST here.
Let's discuss the two sites that returns -EEXIST one by one..
follow_pfn_pte() should be relatively easy. If with the -EEXIST to -EFAULT
convertion proposed in __get_user_pages(), we could have dropped below two
lines:
---8<---
diff --git a/mm/gup.c b/mm/gup.c
index 8ebc04058e97..b3a109114968 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
pte_t *pte, unsigned int flags)
{
- /* No page to get reference */
- if (flags & (FOLL_GET | FOLL_PIN))
- return -EFAULT;
-
if (flags & FOLL_TOUCH) {
pte_t entry = *pte;
---8<---
Then the convertion will make sure it behaves like before. There is a very
slight side effect on FOLL_TOUCH above but it shouldn't be a major concern, at
least not in the vfio use case - I believe most of the guest pages should
already have the dirty bit set anyways, and since they're pinned they won't be
a good candidate for reclaims already.
IMHO the other site follow_devmap_pud() is tricker, because we can't simply
remove the two lines here:
---8<---
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 406a3c28c026..0fe2253f922c 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1171,15 +1171,6 @@ struct page *follow_devmap_pud(struct vm_area_struct *vma, unsigned long addr,
if (flags & FOLL_TOUCH)
touch_pud(vma, addr, pud, flags);
- /*
- * device mapped pages can only be returned if the
- * caller will manage the page reference count.
- *
- * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
- */
- if (!(flags & (FOLL_GET | FOLL_PIN)))
- return ERR_PTR(-EEXIST);
-
pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
*pgmap = get_dev_pagemap(pfn, *pgmap);
if (!*pgmap)
---8<---
Because with the old code we'll 100% return -EEXIST as long as we don't have
GET|PIN (e.g. mlock), but then later when the two lines removed we'll start to
try call get_dev_pagemap(), if it returned something we'll start to fetch the
page.
That part looks fine, __get_user_pages() will still continue but ignore the
page* returned.
But what if get_dev_pagemap() returned NULL? Then we'll start to return
-EFAULT where we used to constantly return -EEXIST. I'm afraid it'll stop
working for some mlock() or MAP_POPULATE case that Kirill wanted to fix before,
then it could break things.
We could definitely add more code to fix up above to make sure mlock() on huge
pud of DAX will keep working. However I hope I have somehow clarified the
point that it's not as trivial as it looks like to change the semantics of
-EEXIST for follow_page_mask() immediately, and to fix the vfio mdev breakage
we'd better merge the oneliner fix first even if we want to rework -EEXIST.
Summary: I see no problem at all on either way, it's just that for this bug fix
it's straightforward to keep the -EEXIST definition, rather than clean it up,
because otherwise the change will grow. Note that we will need to backport
this patch, I think it's proper we keep the changeset small and leave the
cleanups for later.
>
> For instance, should we return -EEXIST in other cases like devmap and
> more that have PTEs present but are not returnable? It is already
> really confusing (and probably wrong) that we sometimes return NULL
> for populated PTEs. NULL causes faultin_page() to be called on
> something we already know is populated, seems like nonsense.
Could you elaborate what's the "return NULL" confusion you mentioned?
>
> I certainly don't want to see every place like that doing some if to
> transform -EEXIST into -EFAULT.
>
> > > Callers that want an early failure must pass in NULL for pages, it is
> > > just that simple. It has nothing to do with the FOLL flags.
> > >
> > > A WARN_ON would be appropriate to compare the FOLL flags against the
> > > pages. eg FOLL_GET without a pages is nonsense and should be
> > > immediately aborted. On the other hand, we avoid this by construction
> > > internal to gup.c
> >
> > We have something like that already, although it's only a VM_BUG_ON() not a
> > BUG_ON() or WARN_ON() at the entry of __get_user_pages():
> >
> > VM_BUG_ON(!!pages != !!(gup_flags & (FOLL_GET | FOLL_PIN)));
>
> Right
>
> > I believe this works too and I think I get your point, but as stated above it's
> > just not used yet so the path is not useful to any real code path.
>
> It is not about useful, it is about understandable and
> consistency.. There should be clear rules about when and what errno to
> return in every case, we should trend in that direction.
I see that both you and John has a strong preference on at least the
WARN_ON_ONCE() in the patch.
Do you think it's okay I repost with only the one-liner fix, which will keep
the Fixes but drop the WARN_ON_ONCE? Then we can leave the rest as follow up.
For my own preference, I really like to keep the patch as-is, because as I
mentioned it helps to identify issues with existing code already (it'll even
help the reader when looking at the -EEXIST handling because that's not
obvious). But I don't strongly ask for that, I still sincerely wish the
discussion won't block the real simple bugfix.
Thanks,
--
Peter Xu
On Thu, Jan 27, 2022 at 06:32:27PM -0800, John Hubbard wrote:
> What you are missing is that other people are potentially writing code
> that we haven't seen yet, and that code may use follow_page_mask(). The
> idea, therefore, is to make it a good building block.
Yes, actually that's why I attached the WARN_ON_ONCE() since when people do add
that new code they'll quickly discover before posting, as long as they'll still
test the patch..
Then people may even start to wonder whether we should rework -EEXIST if
necessary, and IMHO that's the better place for a rework, not within the bugfix
here.
Please check the email I replied to Jason. I hope both of you would agree that
I can repost with the fix only but drop the WARN_ON_ONCE(), or any explicit
suggestion on how we should move forward with the fix would be welcomed.
Thanks,
--
Peter Xu
On 1/27/22 19:26, Peter Xu wrote:
...
> I see that both you and John has a strong preference on at least the
> WARN_ON_ONCE() in the patch.
>
> Do you think it's okay I repost with only the one-liner fix, which will keep
> the Fixes but drop the WARN_ON_ONCE? Then we can leave the rest as follow up.
>
I think that's OK with me, anyway. You'll recall that I initially requested
that you split this into two patches, after all.
Would you like me to post a follow-up that does the refactoring that Jason
and I are requesting? I see that we have some fundamental differences in
opinion about how this should be done, so rather than drive you crazy with
debating, maybe that would be smoother? :)
thanks,
--
John Hubbard
NVIDIA
On Thu, Jan 27, 2022 at 09:57:34PM -0800, John Hubbard wrote:
> On 1/27/22 19:26, Peter Xu wrote:
> ...
> > I see that both you and John has a strong preference on at least the
> > WARN_ON_ONCE() in the patch.
> >
> > Do you think it's okay I repost with only the one-liner fix, which will keep
> > the Fixes but drop the WARN_ON_ONCE? Then we can leave the rest as follow up.
> >
>
> I think that's OK with me, anyway. You'll recall that I initially requested
> that you split this into two patches, after all.
>
> Would you like me to post a follow-up that does the refactoring that Jason
> and I are requesting? I see that we have some fundamental differences in
> opinion about how this should be done, so rather than drive you crazy with
> debating, maybe that would be smoother? :)
Sure thing. :-)
Please just double check that the pud devmap will still always work on mlock().
I believe both of you are much more familiar than me on that; it just still
seems a little bit tricky.
I'll repost this one, thanks.
--
Peter Xu
On Fri, Jan 28, 2022 at 11:26:16AM +0800, Peter Xu wrote:
> diff --git a/mm/gup.c b/mm/gup.c
> index 8ebc04058e97..b3a109114968 100644
> +++ b/mm/gup.c
> @@ -439,10 +439,6 @@ static struct page *no_page_table(struct vm_area_struct *vma,
> static int follow_pfn_pte(struct vm_area_struct *vma, unsigned long address,
> pte_t *pte, unsigned int flags)
> {
> - /* No page to get reference */
> - if (flags & (FOLL_GET | FOLL_PIN))
> - return -EFAULT;
> -
> if (flags & FOLL_TOUCH) {
> pte_t entry = *pte;
>
> Then the convertion will make sure it behaves like before. There is a very
> slight side effect on FOLL_TOUCH above but it shouldn't be a major concern, at
> least not in the vfio use case
You mean that it touches the page even though it will not return the
page? This is what devmap is doing already :\
> IMHO the other site follow_devmap_pud() is tricker, because we can't simply
> remove the two lines here:
Sure, but I didn't ask to fix everything, just trend toward something
sane - if you drop the 4 line above then your patch will do what it
needs to do, we don't need to jump into devmap to fix this bug.
> - /*
> - * device mapped pages can only be returned if the
> - * caller will manage the page reference count.
> - *
> - * At least one of FOLL_GET | FOLL_PIN must be set, so assert that here:
> - */
> - if (!(flags & (FOLL_GET | FOLL_PIN)))
> - return ERR_PTR(-EEXIST);
> -
> pfn += (addr & ~PUD_MASK) >> PAGE_SHIFT;
> *pgmap = get_dev_pagemap(pfn, *pgmap);
> if (!*pgmap)
>
> Because with the old code we'll 100% return -EEXIST as long as we don't have
> GET|PIN (e.g. mlock), but then later when the two lines removed we'll start to
> try call get_dev_pagemap(), if it returned something we'll start to fetch the
> page.
This is part of the "return codes don't make any sense" I was
complaining about..
devmap has a populated PTE and a struct page, it shouldn't be
returning -EFAULT. Failure to get the struct page is either EEXIST or
NULL (take a fault to resolve the race).
[as we've discussed before the get_dev_pagemap shouldn't even be here
at all, I suppose this is why it is so weird looking]
> That part looks fine, __get_user_pages() will still continue but ignore the
> page* returned.
IMHO it would be clearer if the __get_user_pages() checked the struct
page zone and rejected the cases it doesn't want than to sprinkle
these checks all over the place based on PMD flags.
We want to go in this direction anyhow so we can remove the pmd_dax()
flag..
> But what if get_dev_pagemap() returned NULL? Then we'll start to return
> -EFAULT where we used to constantly return -EEXIST.
IHMO most of the EFAULTs in the walker call chain are pretty
questionable..
they should ideally be
EEXIST or NULL. eg why does get_gate_page() return EFAULT for a
non-present page? That should be NULL and faultin_page() should
generate the EFAULT
> > For instance, should we return -EEXIST in other cases like devmap and
> > more that have PTEs present but are not returnable? It is already
> > really confusing (and probably wrong) that we sometimes return NULL
> > for populated PTEs. NULL causes faultin_page() to be called on
> > something we already know is populated, seems like nonsense.
>
> Could you elaborate what's the "return NULL" confusion you mentioned?
Returning null for a popoulated PTE as a permanent failure is
nonsense.
NULL means 'call faultin_page()', it should be used on populated pages
that ought to have a struct page but for some racy reason is
unavailable. The faultin should resolve that race and make the pte
either fully non-present or restore the struct page.
> > It is not about useful, it is about understandable and
> > consistency.. There should be clear rules about when and what errno to
> > return in every case, we should trend in that direction.
>
> I see that both you and John has a strong preference on at least the
> WARN_ON_ONCE() in the patch.
Well, I wanted to see the WARN_ON_ONCE avoided by making it
unnecessary by construction, not just deleted :\
Jason