2020-03-03 00:25:38

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 2/2] mm/gup/writeback: add callbacks for inaccessible pages

With the introduction of protected KVM guests on s390 there is now a
concept of inaccessible pages. These pages need to be made accessible
before the host can access them.

While cpu accesses will trigger a fault that can be resolved, I/O
accesses will just fail. We need to add a callback into architecture
code for places that will do I/O, namely when writeback is started or
when a page reference is taken.

This is not only to enable paging, file backing etc, it is also
necessary to protect the host against a malicious user space. For
example a bad QEMU could simply start direct I/O on such protected
memory. We do not want userspace to be able to trigger I/O errors and
thus the logic is "whenever somebody accesses that page (gup) or does
I/O, make sure that this page can be accessed". When the guest tries
to access that page we will wait in the page fault handler for
writeback to have finished and for the page_ref to be the expected
value.

On s390x the function is not supposed to fail, so it is ok to use a
WARN_ON on failure. If we ever need some more finegrained handling
we can tackle this when we know the details.

Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Will Deacon <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
---
include/linux/gfp.h | 6 ++++++
mm/gup.c | 27 ++++++++++++++++++++++++---
mm/page-writeback.c | 5 +++++
3 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index e5b817cb86e7..be2754841369 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { }
#ifndef HAVE_ARCH_ALLOC_PAGE
static inline void arch_alloc_page(struct page *page, int order) { }
#endif
+#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
+static inline int arch_make_page_accessible(struct page *page)
+{
+ return 0;
+}
+#endif

struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
diff --git a/mm/gup.c b/mm/gup.c
index 81a95fbe9901..15c47e0e86f8 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -413,6 +413,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
struct page *page;
spinlock_t *ptl;
pte_t *ptep, pte;
+ int ret;

/* FOLL_GET and FOLL_PIN are mutually exclusive. */
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
@@ -471,8 +472,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
if (is_zero_pfn(pte_pfn(pte))) {
page = pte_page(pte);
} else {
- int ret;
-
ret = follow_pfn_pte(vma, address, ptep, flags);
page = ERR_PTR(ret);
goto out;
@@ -480,7 +479,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}

if (flags & FOLL_SPLIT && PageTransCompound(page)) {
- int ret;
get_page(page);
pte_unmap_unlock(ptep, ptl);
lock_page(page);
@@ -497,6 +495,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
page = ERR_PTR(-ENOMEM);
goto out;
}
+ /*
+ * We need to make the page accessible if we are actually going to
+ * poke at its content (pin), otherwise we can leave it inaccessible.
+ * If we cannot make the page accessible, fail.
+ */
+ if (flags & FOLL_PIN) {
+ ret = arch_make_page_accessible(page);
+ if (ret) {
+ unpin_user_page(page);
+ page = ERR_PTR(ret);
+ goto out;
+ }
+ }
if (flags & FOLL_TOUCH) {
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
@@ -2162,6 +2173,16 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,

VM_BUG_ON_PAGE(compound_head(page) != head, page);

+ /*
+ * We need to make the page accessible if we are actually
+ * going to poke at its content (pin), otherwise we can
+ * leave it inaccessible. If the page cannot be made
+ * accessible, fail.
+ */
+ if ((flags & FOLL_PIN) && arch_make_page_accessible(page)) {
+ unpin_user_page(page);
+ goto pte_unmap;
+ }
SetPageReferenced(page);
pages[*nr] = page;
(*nr)++;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ab5a3cee8ad3..8384be5a2758 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2807,6 +2807,11 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
}
unlock_page_memcg(page);
+ /*
+ * If writeback has been triggered on a page that cannot be made
+ * accessible, it is too late.
+ */
+ WARN_ON(arch_make_page_accessible(page));
return ret;

}
--
2.24.1


2020-03-03 08:03:59

by John Hubbard

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/gup/writeback: add callbacks for inaccessible pages

On 3/2/20 4:25 PM, Claudio Imbrenda wrote:
> With the introduction of protected KVM guests on s390 there is now a
> concept of inaccessible pages. These pages need to be made accessible
> before the host can access them.
>
> While cpu accesses will trigger a fault that can be resolved, I/O
> accesses will just fail. We need to add a callback into architecture
> code for places that will do I/O, namely when writeback is started or
> when a page reference is taken.
>
> This is not only to enable paging, file backing etc, it is also
> necessary to protect the host against a malicious user space. For
> example a bad QEMU could simply start direct I/O on such protected
> memory. We do not want userspace to be able to trigger I/O errors and
> thus the logic is "whenever somebody accesses that page (gup) or does
> I/O, make sure that this page can be accessed". When the guest tries
> to access that page we will wait in the page fault handler for
> writeback to have finished and for the page_ref to be the expected
> value.
>
> On s390x the function is not supposed to fail, so it is ok to use a
> WARN_ON on failure. If we ever need some more finegrained handling
> we can tackle this when we know the details.
>
> Signed-off-by: Claudio Imbrenda <[email protected]>
> Acked-by: Will Deacon <[email protected]>
> Reviewed-by: David Hildenbrand <[email protected]>
> Reviewed-by: Christian Borntraeger <[email protected]>
> ---
> include/linux/gfp.h | 6 ++++++
> mm/gup.c | 27 ++++++++++++++++++++++++---
> mm/page-writeback.c | 5 +++++
> 3 files changed, 35 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> index e5b817cb86e7..be2754841369 100644
> --- a/include/linux/gfp.h
> +++ b/include/linux/gfp.h
> @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { }
> #ifndef HAVE_ARCH_ALLOC_PAGE
> static inline void arch_alloc_page(struct page *page, int order) { }
> #endif
> +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
> +static inline int arch_make_page_accessible(struct page *page)
> +{
> + return 0;
> +}
> +#endif
>
> struct page *
> __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
> diff --git a/mm/gup.c b/mm/gup.c
> index 81a95fbe9901..15c47e0e86f8 100644
> --- a/mm/gup.c
> +++ b/mm/gup.c
> @@ -413,6 +413,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> struct page *page;
> spinlock_t *ptl;
> pte_t *ptep, pte;
> + int ret;
>
> /* FOLL_GET and FOLL_PIN are mutually exclusive. */
> if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
> @@ -471,8 +472,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> if (is_zero_pfn(pte_pfn(pte))) {
> page = pte_page(pte);
> } else {
> - int ret;
> -
> ret = follow_pfn_pte(vma, address, ptep, flags);
> page = ERR_PTR(ret);
> goto out;
> @@ -480,7 +479,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> }
>
> if (flags & FOLL_SPLIT && PageTransCompound(page)) {
> - int ret;
> get_page(page);
> pte_unmap_unlock(ptep, ptl);
> lock_page(page);
> @@ -497,6 +495,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
> page = ERR_PTR(-ENOMEM);
> goto out;
> }
> + /*
> + * We need to make the page accessible if we are actually going to
> + * poke at its content (pin), otherwise we can leave it inaccessible.
> + * If we cannot make the page accessible, fail.
> + */
> + if (flags & FOLL_PIN) {
> + ret = arch_make_page_accessible(page);
> + if (ret) {
> + unpin_user_page(page);
> + page = ERR_PTR(ret);
> + goto out;
> + }
> + }


That looks good.


> if (flags & FOLL_TOUCH) {
> if ((flags & FOLL_WRITE) &&
> !pte_dirty(pte) && !PageDirty(page))
> @@ -2162,6 +2173,16 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
>
> VM_BUG_ON_PAGE(compound_head(page) != head, page);
>
> + /*
> + * We need to make the page accessible if we are actually
> + * going to poke at its content (pin), otherwise we can
> + * leave it inaccessible. If the page cannot be made
> + * accessible, fail.
> + */


This part looks good, so these two points are just nits:

That's a little bit of repeating what the code does, in the comments. How about:

/*
* We need to make the page accessible if and only if we are
* going to access its content (the FOLL_PIN case). Please see
* Documentation/core-api/pin_user_pages.rst for details.
*/


> + if ((flags & FOLL_PIN) && arch_make_page_accessible(page)) {
> + unpin_user_page(page);
> + goto pte_unmap;
> + }


Your style earlier in the patch was easier on the reader, why not stay consistent
with that (and with this file, which tends also to do this), so:

if (flags & FOLL_PIN) {
ret = arch_make_page_accessible(page);
if (ret) {
unpin_user_page(page);
goto pte_unmap;
}
}




> SetPageReferenced(page);
> pages[*nr] = page;
> (*nr)++;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index ab5a3cee8ad3..8384be5a2758 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2807,6 +2807,11 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> }
> unlock_page_memcg(page);
> + /*
> + * If writeback has been triggered on a page that cannot be made
> + * accessible, it is too late.
> + */
> + WARN_ON(arch_make_page_accessible(page));


I'm not deep enough into this area to know if a) this is correct, and b) if there are any
other places that need arch_make_page_accessible() calls. So I'll rely on other
reviewers to help check on that.


> return ret;
>
> }
>

Anyway, I don't see any problems, and as I said, those documentation and style points are
just nitpicks, not bugs.


thanks,
--
John Hubbard
NVIDIA

2020-03-03 08:08:16

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/gup/writeback: add callbacks for inaccessible pages


> Your style earlier in the patch was easier on the reader, why not stay consistent
> with that (and with this file, which tends also to do this), so:
>
> if (flags & FOLL_PIN) {
> ret = arch_make_page_accessible(page);
> if (ret) {
> unpin_user_page(page);
> goto pte_unmap;
> }
> }
>

+1

--
Thanks,

David / dhildenb

2020-03-03 10:44:04

by Claudio Imbrenda

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm/gup/writeback: add callbacks for inaccessible pages

On Mon, 2 Mar 2020 23:59:32 -0800
John Hubbard <[email protected]> wrote:

> On 3/2/20 4:25 PM, Claudio Imbrenda wrote:
> > With the introduction of protected KVM guests on s390 there is now a
> > concept of inaccessible pages. These pages need to be made
> > accessible before the host can access them.
> >
> > While cpu accesses will trigger a fault that can be resolved, I/O
> > accesses will just fail. We need to add a callback into
> > architecture code for places that will do I/O, namely when
> > writeback is started or when a page reference is taken.
> >
> > This is not only to enable paging, file backing etc, it is also
> > necessary to protect the host against a malicious user space. For
> > example a bad QEMU could simply start direct I/O on such protected
> > memory. We do not want userspace to be able to trigger I/O errors
> > and thus the logic is "whenever somebody accesses that page (gup)
> > or does I/O, make sure that this page can be accessed". When the
> > guest tries to access that page we will wait in the page fault
> > handler for writeback to have finished and for the page_ref to be
> > the expected value.
> >
> > On s390x the function is not supposed to fail, so it is ok to use a
> > WARN_ON on failure. If we ever need some more finegrained handling
> > we can tackle this when we know the details.
> >
> > Signed-off-by: Claudio Imbrenda <[email protected]>
> > Acked-by: Will Deacon <[email protected]>
> > Reviewed-by: David Hildenbrand <[email protected]>
> > Reviewed-by: Christian Borntraeger <[email protected]>
> > ---
> > include/linux/gfp.h | 6 ++++++
> > mm/gup.c | 27 ++++++++++++++++++++++++---
> > mm/page-writeback.c | 5 +++++
> > 3 files changed, 35 insertions(+), 3 deletions(-)
> >
> > diff --git a/include/linux/gfp.h b/include/linux/gfp.h
> > index e5b817cb86e7..be2754841369 100644
> > --- a/include/linux/gfp.h
> > +++ b/include/linux/gfp.h
> > @@ -485,6 +485,12 @@ static inline void arch_free_page(struct page
> > *page, int order) { } #ifndef HAVE_ARCH_ALLOC_PAGE
> > static inline void arch_alloc_page(struct page *page, int order)
> > { } #endif
> > +#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
> > +static inline int arch_make_page_accessible(struct page *page)
> > +{
> > + return 0;
> > +}
> > +#endif
> >
> > struct page *
> > __alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int
> > preferred_nid, diff --git a/mm/gup.c b/mm/gup.c
> > index 81a95fbe9901..15c47e0e86f8 100644
> > --- a/mm/gup.c
> > +++ b/mm/gup.c
> > @@ -413,6 +413,7 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, struct page *page;
> > spinlock_t *ptl;
> > pte_t *ptep, pte;
> > + int ret;
> >
> > /* FOLL_GET and FOLL_PIN are mutually exclusive. */
> > if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
> > @@ -471,8 +472,6 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, if (is_zero_pfn(pte_pfn(pte))) {
> > page = pte_page(pte);
> > } else {
> > - int ret;
> > -
> > ret = follow_pfn_pte(vma, address, ptep,
> > flags); page = ERR_PTR(ret);
> > goto out;
> > @@ -480,7 +479,6 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, }
> >
> > if (flags & FOLL_SPLIT && PageTransCompound(page)) {
> > - int ret;
> > get_page(page);
> > pte_unmap_unlock(ptep, ptl);
> > lock_page(page);
> > @@ -497,6 +495,19 @@ static struct page *follow_page_pte(struct
> > vm_area_struct *vma, page = ERR_PTR(-ENOMEM);
> > goto out;
> > }
> > + /*
> > + * We need to make the page accessible if we are actually
> > going to
> > + * poke at its content (pin), otherwise we can leave it
> > inaccessible.
> > + * If we cannot make the page accessible, fail.
> > + */
> > + if (flags & FOLL_PIN) {
> > + ret = arch_make_page_accessible(page);
> > + if (ret) {
> > + unpin_user_page(page);
> > + page = ERR_PTR(ret);
> > + goto out;
> > + }
> > + }
>
>
> That looks good.
>
>
> > if (flags & FOLL_TOUCH) {
> > if ((flags & FOLL_WRITE) &&
> > !pte_dirty(pte) && !PageDirty(page))
> > @@ -2162,6 +2173,16 @@ static int gup_pte_range(pmd_t pmd, unsigned
> > long addr, unsigned long end,
> > VM_BUG_ON_PAGE(compound_head(page) != head, page);
> >
> > + /*
> > + * We need to make the page accessible if we are
> > actually
> > + * going to poke at its content (pin), otherwise
> > we can
> > + * leave it inaccessible. If the page cannot be
> > made
> > + * accessible, fail.
> > + */
>
>
> This part looks good, so these two points are just nits:
>
> That's a little bit of repeating what the code does, in the comments.
> How about:
>
> /*
> * We need to make the page accessible if and only if
> we are
> * going to access its content (the FOLL_PIN case).
> Please see
> * Documentation/core-api/pin_user_pages.rst for
> details. */
>
>
> > + if ((flags & FOLL_PIN) &&
> > arch_make_page_accessible(page)) {
> > + unpin_user_page(page);
> > + goto pte_unmap;
> > + }
>
>
> Your style earlier in the patch was easier on the reader, why not
> stay consistent with that (and with this file, which tends also to do
> this), so:
>
> if (flags & FOLL_PIN) {
> ret = arch_make_page_accessible(page);
> if (ret) {
> unpin_user_page(page);
> goto pte_unmap;
> }
> }
>
>
>
>
> > SetPageReferenced(page);
> > pages[*nr] = page;
> > (*nr)++;
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index ab5a3cee8ad3..8384be5a2758 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2807,6 +2807,11 @@ int __test_set_page_writeback(struct page
> > *page, bool keep_write) inc_zone_page_state(page,
> > NR_ZONE_WRITE_PENDING); }
> > unlock_page_memcg(page);
> > + /*
> > + * If writeback has been triggered on a page that cannot
> > be made
> > + * accessible, it is too late.
> > + */
> > + WARN_ON(arch_make_page_accessible(page));
>
>
> I'm not deep enough into this area to know if a) this is correct, and
> b) if there are any other places that need
> arch_make_page_accessible() calls. So I'll rely on other reviewers to
> help check on that.
>
>
> > return ret;
> >
> > }
> >
>
> Anyway, I don't see any problems, and as I said, those documentation
> and style points are just nitpicks, not bugs.


these are minor fixes, and I mostly agree with you. I'll fix them and
send a v3 soon™

thanks for the comments!