With the introduction of protected KVM guests on s390 there is now a
concept of inaccessible pages. These pages need to be made accessible
before the host can access them.
While cpu accesses will trigger a fault that can be resolved, I/O
accesses will just fail. We need to add a callback into architecture
code for places that will do I/O, namely when writeback is started or
when a page reference is taken.
This is not only to enable paging, file backing etc, it is also
necessary to protect the host against a malicious user space. For
example a bad QEMU could simply start direct I/O on such protected
memory. We do not want userspace to be able to trigger I/O errors and
thus the logic is "whenever somebody accesses that page (gup) or does
I/O, make sure that this page can be accessed". When the guest tries
to access that page we will wait in the page fault handler for
writeback to have finished and for the page_ref to be the expected
value.
On s390x the function is not supposed to fail, so it is ok to use a
WARN_ON on failure. If we ever need some more finegrained handling
we can tackle this when we know the details.
Signed-off-by: Claudio Imbrenda <[email protected]>
Acked-by: Will Deacon <[email protected]>
Reviewed-by: David Hildenbrand <[email protected]>
Reviewed-by: Christian Borntraeger <[email protected]>
Reviewed-by: John Hubbard <[email protected]>
---
include/linux/gfp.h | 6 ++++++
mm/gup.c | 30 +++++++++++++++++++++++++++---
mm/page-writeback.c | 9 ++++++++-
3 files changed, 41 insertions(+), 4 deletions(-)
diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index e5b817cb86e7..be2754841369 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -485,6 +485,12 @@ static inline void arch_free_page(struct page *page, int order) { }
#ifndef HAVE_ARCH_ALLOC_PAGE
static inline void arch_alloc_page(struct page *page, int order) { }
#endif
+#ifndef HAVE_ARCH_MAKE_PAGE_ACCESSIBLE
+static inline int arch_make_page_accessible(struct page *page)
+{
+ return 0;
+}
+#endif
struct page *
__alloc_pages_nodemask(gfp_t gfp_mask, unsigned int order, int preferred_nid,
diff --git a/mm/gup.c b/mm/gup.c
index 81a95fbe9901..d0c4c6f336bb 100644
--- a/mm/gup.c
+++ b/mm/gup.c
@@ -413,6 +413,7 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
struct page *page;
spinlock_t *ptl;
pte_t *ptep, pte;
+ int ret;
/* FOLL_GET and FOLL_PIN are mutually exclusive. */
if (WARN_ON_ONCE((flags & (FOLL_PIN | FOLL_GET)) ==
@@ -471,8 +472,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
if (is_zero_pfn(pte_pfn(pte))) {
page = pte_page(pte);
} else {
- int ret;
-
ret = follow_pfn_pte(vma, address, ptep, flags);
page = ERR_PTR(ret);
goto out;
@@ -480,7 +479,6 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
}
if (flags & FOLL_SPLIT && PageTransCompound(page)) {
- int ret;
get_page(page);
pte_unmap_unlock(ptep, ptl);
lock_page(page);
@@ -497,6 +495,19 @@ static struct page *follow_page_pte(struct vm_area_struct *vma,
page = ERR_PTR(-ENOMEM);
goto out;
}
+ /*
+ * We need to make the page accessible if and only if we are going
+ * to access its content (the FOLL_PIN case). Please see
+ * Documentation/core-api/pin_user_pages.rst for details.
+ */
+ if (flags & FOLL_PIN) {
+ ret = arch_make_page_accessible(page);
+ if (ret) {
+ unpin_user_page(page);
+ page = ERR_PTR(ret);
+ goto out;
+ }
+ }
if (flags & FOLL_TOUCH) {
if ((flags & FOLL_WRITE) &&
!pte_dirty(pte) && !PageDirty(page))
@@ -2162,6 +2173,19 @@ static int gup_pte_range(pmd_t pmd, unsigned long addr, unsigned long end,
VM_BUG_ON_PAGE(compound_head(page) != head, page);
+ /*
+ * We need to make the page accessible if and only if we are
+ * going to access its content (the FOLL_PIN case). Please
+ * see Documentation/core-api/pin_user_pages.rst for
+ * details.
+ */
+ if (flags & FOLL_PIN) {
+ ret = arch_make_page_accessible(page);
+ if (ret) {
+ unpin_user_page(page);
+ goto pte_unmap;
+ }
+ }
SetPageReferenced(page);
pages[*nr] = page;
(*nr)++;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index ab5a3cee8ad3..b7f3d0766a5f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page *page)
int __test_set_page_writeback(struct page *page, bool keep_write)
{
struct address_space *mapping = page_mapping(page);
- int ret;
+ int ret, access_ret;
lock_page_memcg(page);
if (mapping && mapping_use_writeback_tags(mapping)) {
@@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
}
unlock_page_memcg(page);
+ access_ret = arch_make_page_accessible(page);
+ /*
+ * If writeback has been triggered on a page that cannot be made
+ * accessible, it is too late to recover here.
+ */
+ VM_BUG_ON_PAGE(access_ret != 0, page);
+
return ret;
}
--
2.24.1
On 3/6/20 5:25 AM, Claudio Imbrenda wrote:
> On s390x the function is not supposed to fail, so it is ok to use a
> WARN_ON on failure. If we ever need some more finegrained handling
> we can tackle this when we know the details.
Could you explain a bit why the function can't fail?
If the guest has secret data in the page, then it *can* and does fail.
It won't fail, though, if the host and guest agree on whether the page
is protected.
Right?
> @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> }
> unlock_page_memcg(page);
> + access_ret = arch_make_page_accessible(page);
> + /*
> + * If writeback has been triggered on a page that cannot be made
> + * accessible, it is too late to recover here.
> + */
> + VM_BUG_ON_PAGE(access_ret != 0, page);
> +
> return ret;
>
> }
This seems like a really odd place to do this. Writeback is specific to
block I/O. I would have thought there were other kinds of devices that
matter, not just block devices.
Also, this patch seems odd that it only does the
arch_make_page_accessible() half. Where's the other half where the page
is made inaccessible?
I assume it's OK to "leak" things like this, it's just not clear to me
_why_ it's OK.
On Mon, 13 Apr 2020 13:22:24 -0700
Dave Hansen <[email protected]> wrote:
> On 3/6/20 5:25 AM, Claudio Imbrenda wrote:
> > On s390x the function is not supposed to fail, so it is ok to use a
> > WARN_ON on failure. If we ever need some more finegrained handling
> > we can tackle this when we know the details.
>
> Could you explain a bit why the function can't fail?
the concept of "making accessible" is only to make sure that accessing
the page will not trigger faults or I/O or DMA errors. in general it
does not mean freely accessing the content of the page in cleartext.
on s390x, protected guest pages can be shared. the guest has to
actively share its pages, and in that case those pages are both part of
the protected VM and freely accessible by the host.
pages that are not shared cannot be accessed by the host.
in our case "making the page accessible" means:
- if the page was shared, make sure it stays shared
- if the page was not shared, first encrypt it and then make it
accessible to the host (both operations performed securely and
atomically by the hardware)
then the page can be swapped out, or used for direct I/O (obviously if
you do I/O on a page that was not shared, you cannot expect good
things to happen, since you basically corrupt the memory of the guest).
on s390x performing I/O directly on protected pages results in (in
practice) unrecoverable I/O errors, so we want to avoid it at all costs.
accessing protected pages from the CPU triggers an exception that can
be handled (and we do handle it, in fact)
now imagine a buggy or malicious qemu process crashing the whole machine
just because it did I/O to/from a protected page. we clearly don't want
that.
> If the guest has secret data in the page, then it *can* and does fail.
no, that's the whole point of this mechanism. in fact, most of the
guest pages will be "secret data", only the few pages used for guest I/O
bounce buffers will be shared with the host
> It won't fail, though, if the host and guest agree on whether the page
> is protected.
>
> Right?
>
> > @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page
> > *page, bool keep_write) inc_zone_page_state(page,
> > NR_ZONE_WRITE_PENDING); }
> > unlock_page_memcg(page);
> > + access_ret = arch_make_page_accessible(page);
> > + /*
> > + * If writeback has been triggered on a page that cannot
> > be made
> > + * accessible, it is too late to recover here.
> > + */
> > + VM_BUG_ON_PAGE(access_ret != 0, page);
> > +
> > return ret;
> >
> > }
>
> This seems like a really odd place to do this. Writeback is specific
> to block I/O. I would have thought there were other kinds of devices
> that matter, not just block devices.
well, yes and no. for writeback (block I/O and swap) this is the right
place. at this point we know that the page is present and nobody else
has started doing I/O yet, and I/O will happen soon-ish. so we make the
page accessible. there is no turning back here, unlike pinning. we
are not allowed to fail, we can't
regarding the other kinds of devices: yes, they will use pinning, which
is covered by the rest of the patch. the semantics of get page and pin
page (if the documentation has not changed meanwhile) is that the
traditional get_page is used for when the page is needed but not its
content, and pin_page is used when the content of the page is accessed.
since the only issue here is accessing the content of the page, we
don't need to make it accessible for get_page, but only for pin_page.
get_page and pin_page are allowed to fail, so in this case we return an
error code, so other architectures can potentially abort the pinning if
needed. on s390x we will never fail, for the same reasons written
above.
> Also, this patch seems odd that it only does the
> arch_make_page_accessible() half. Where's the other half where the
> page is made inaccessible?
that is very arch-specific. for s390x, you can look at this patch and
the ones immediately before/after: 214d9bbcd3a67230b932f6ce
> I assume it's OK to "leak" things like this, it's just not clear to me
> _why_ it's OK.
nothing is being leaked :)
I hope I clarified a little how this works on s390x :)
feel free to poke me again if some things are still unclear
best regards,
Claudio Imbrenda
On 4/14/20 9:03 AM, Claudio Imbrenda wrote:
> On Mon, 13 Apr 2020 13:22:24 -0700
> Dave Hansen <[email protected]> wrote:
>
>> On 3/6/20 5:25 AM, Claudio Imbrenda wrote:
>>> On s390x the function is not supposed to fail, so it is ok to use a
>>> WARN_ON on failure. If we ever need some more finegrained handling
>>> we can tackle this when we know the details.
>>
>> Could you explain a bit why the function can't fail?
>
> the concept of "making accessible" is only to make sure that accessing
> the page will not trigger faults or I/O or DMA errors. in general it
> does not mean freely accessing the content of the page in cleartext.
>
> on s390x, protected guest pages can be shared. the guest has to
> actively share its pages, and in that case those pages are both part of
> the protected VM and freely accessible by the host.
Oh, that's interesting.
It sounds like there are three separate concepts:
1. Protection
2. Sharing
3. Accessibility
Protected pages may be shared and the request of the guest.
Shared pages' plaintext can be accessed by the host. For unshared
pages, the host can only see ciphertext.
I wonder if Documentation/virt/kvm/s390-pv.rst can be beefed up with
some of this information. It seems a bit sparse on this topic.
As it stands, if I were modifying generic code, I don't think I'd have
even a chance of getting an arch_make_page_accessible() in the right spot.
> in our case "making the page accessible" means:
...
> - if the page was not shared, first encrypt it and then make it
> accessible to the host (both operations performed securely and
> atomically by the hardware)
What happens to the guest's view of the page when this happens? Does it
keep seeing plaintext?
> then the page can be swapped out, or used for direct I/O (obviously if
> you do I/O on a page that was not shared, you cannot expect good
> things to happen, since you basically corrupt the memory of the guest).
So why even allow access to the encrypted contents if the host can't do
anything useful with it? Is there some reason for going to the trouble
of encrypting it and exposing it to the host?
> on s390x performing I/O directly on protected pages results in (in
> practice) unrecoverable I/O errors, so we want to avoid it at all costs.
This is understandable, but we usually steer I/O operations in places
like the DMA API, not in the core VM.
We *have* the concept of pages to which I/O can't be done. There are
plenty of crippled devices where we have to bounce data into a low
buffer before it can go where we really want it to. I think the AMD SEV
patches do this, for instance.
> accessing protected pages from the CPU triggers an exception that can
> be handled (and we do handle it, in fact)
>
> now imagine a buggy or malicious qemu process crashing the whole machine
> just because it did I/O to/from a protected page. we clearly don't want
> that.
Is DMA disallowed to *all* protected pages? Even pages which the guest
has explicitly shared with the host?
>>> @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page
>>> *page, bool keep_write) inc_zone_page_state(page,
>>> NR_ZONE_WRITE_PENDING); }
>>> unlock_page_memcg(page);
>>> + access_ret = arch_make_page_accessible(page);
>>> + /*
>>> + * If writeback has been triggered on a page that cannot
>>> be made
>>> + * accessible, it is too late to recover here.
>>> + */
>>> + VM_BUG_ON_PAGE(access_ret != 0, page);
>>> +
>>> return ret;
>>>
>>> }
>>
>> This seems like a really odd place to do this. Writeback is specific
>> to block I/O. I would have thought there were other kinds of devices
>> that matter, not just block devices.
>
> well, yes and no. for writeback (block I/O and swap) this is the right
> place. at this point we know that the page is present and nobody else
> has started doing I/O yet, and I/O will happen soon-ish. so we make the
> page accessible. there is no turning back here, unlike pinning. we
> are not allowed to fail, we can't
This description sounds really incomplete to me.
Not all swap involved device I/O. For instance, zswap doesn't involve
any devices. Would zswap need this hook?
On Tue, 14 Apr 2020 11:50:16 -0700
Dave Hansen <[email protected]> wrote:
> On 4/14/20 9:03 AM, Claudio Imbrenda wrote:
> > On Mon, 13 Apr 2020 13:22:24 -0700
> > Dave Hansen <[email protected]> wrote:
> >
> >> On 3/6/20 5:25 AM, Claudio Imbrenda wrote:
> >>> On s390x the function is not supposed to fail, so it is ok to use
> >>> a WARN_ON on failure. If we ever need some more finegrained
> >>> handling we can tackle this when we know the details.
> >>
> >> Could you explain a bit why the function can't fail?
> >
> > the concept of "making accessible" is only to make sure that
> > accessing the page will not trigger faults or I/O or DMA errors. in
> > general it does not mean freely accessing the content of the page
> > in cleartext.
> >
> > on s390x, protected guest pages can be shared. the guest has to
> > actively share its pages, and in that case those pages are both
> > part of the protected VM and freely accessible by the host.
>
> Oh, that's interesting.
>
> It sounds like there are three separate concepts:
> 1. Protection
> 2. Sharing
> 3. Accessibility
>
> Protected pages may be shared and the request of the guest.
> Shared pages' plaintext can be accessed by the host. For unshared
> pages, the host can only see ciphertext.
>
> I wonder if Documentation/virt/kvm/s390-pv.rst can be beefed up with
> some of this information. It seems a bit sparse on this topic.
that is definitely something that can be fixed.
I will improve the documentation and make sure it properly explains
all the details of how protected VMs work on s390x.
> As it stands, if I were modifying generic code, I don't think I'd have
> even a chance of getting an arch_make_page_accessible() in the right
> spot.
>
> > in our case "making the page accessible" means:
> ...
> > - if the page was not shared, first encrypt it and then make it
> > accessible to the host (both operations performed securely and
> > atomically by the hardware)
>
> What happens to the guest's view of the page when this happens? Does
> it keep seeing plaintext?
>
> > then the page can be swapped out, or used for direct I/O (obviously
> > if you do I/O on a page that was not shared, you cannot expect good
> > things to happen, since you basically corrupt the memory of the
> > guest).
>
> So why even allow access to the encrypted contents if the host can't
> do anything useful with it? Is there some reason for going to the
> trouble of encrypting it and exposing it to the host?
you should not overwrite it, but you can/should write it out verbatim,
e.g. for swap
> > on s390x performing I/O directly on protected pages results in (in
> > practice) unrecoverable I/O errors, so we want to avoid it at all
> > costs.
>
> This is understandable, but we usually steer I/O operations in places
> like the DMA API, not in the core VM.
>
> We *have* the concept of pages to which I/O can't be done. There are
> plenty of crippled devices where we have to bounce data into a low
> buffer before it can go where we really want it to. I think the AMD
> SEV patches do this, for instance.
>
> > accessing protected pages from the CPU triggers an exception that
> > can be handled (and we do handle it, in fact)
> >
> > now imagine a buggy or malicious qemu process crashing the whole
> > machine just because it did I/O to/from a protected page. we
> > clearly don't want that.
>
> Is DMA disallowed to *all* protected pages? Even pages which the
> guest has explicitly shared with the host?
>
>
> >>> @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page
> >>> *page, bool keep_write) inc_zone_page_state(page,
> >>> NR_ZONE_WRITE_PENDING); }
> >>> unlock_page_memcg(page);
> >>> + access_ret = arch_make_page_accessible(page);
> >>> + /*
> >>> + * If writeback has been triggered on a page that cannot
> >>> be made
> >>> + * accessible, it is too late to recover here.
> >>> + */
> >>> + VM_BUG_ON_PAGE(access_ret != 0, page);
> >>> +
> >>> return ret;
> >>>
> >>> }
> >>
> >> This seems like a really odd place to do this. Writeback is
> >> specific to block I/O. I would have thought there were other
> >> kinds of devices that matter, not just block devices.
> >
> > well, yes and no. for writeback (block I/O and swap) this is the
> > right place. at this point we know that the page is present and
> > nobody else has started doing I/O yet, and I/O will happen
> > soon-ish. so we make the page accessible. there is no turning back
> > here, unlike pinning. we are not allowed to fail, we can't
>
> This description sounds really incomplete to me.
>
> Not all swap involved device I/O. For instance, zswap doesn't involve
> any devices. Would zswap need this hook?
please feel free to write to me privately if you have any further
questions or doubts :)
best regards,
Claudio Imbrenda
On 4/15/20 11:26 AM, Claudio Imbrenda wrote:
> On Tue, 14 Apr 2020 11:50:16 -0700
> Dave Hansen <[email protected]> wrote:
>
>> On 4/14/20 9:03 AM, Claudio Imbrenda wrote:
>>> On Mon, 13 Apr 2020 13:22:24 -0700
>>> Dave Hansen <[email protected]> wrote:
>>>
>>>> On 3/6/20 5:25 AM, Claudio Imbrenda wrote:
>>>>> On s390x the function is not supposed to fail, so it is ok to use
>>>>> a WARN_ON on failure. If we ever need some more finegrained
>>>>> handling we can tackle this when we know the details.
>>>>
>>>> Could you explain a bit why the function can't fail?
>>>
>>> the concept of "making accessible" is only to make sure that
>>> accessing the page will not trigger faults or I/O or DMA errors. in
>>> general it does not mean freely accessing the content of the page
>>> in cleartext.
>>>
>>> on s390x, protected guest pages can be shared. the guest has to
>>> actively share its pages, and in that case those pages are both
>>> part of the protected VM and freely accessible by the host.
>>
>> Oh, that's interesting.
>>
>> It sounds like there are three separate concepts:
>> 1. Protection
>> 2. Sharing
>> 3. Accessibility
>>
>> Protected pages may be shared and the request of the guest.
>> Shared pages' plaintext can be accessed by the host. For unshared
>> pages, the host can only see ciphertext.
>>
>> I wonder if Documentation/virt/kvm/s390-pv.rst can be beefed up with
>> some of this information. It seems a bit sparse on this topic.
>
> that is definitely something that can be fixed.
>
> I will improve the documentation and make sure it properly explains
> all the details of how protected VMs work on s390x.
I'd also definitely appreciate more people looking over that document
and adding things I forgot ;-)
>
>> As it stands, if I were modifying generic code, I don't think I'd have
>> even a chance of getting an arch_make_page_accessible() in the right
>> spot.
>>
>>> in our case "making the page accessible" means:
>> ...
>>> - if the page was not shared, first encrypt it and then make it
>>> accessible to the host (both operations performed securely and
>>> atomically by the hardware)
>>
>> What happens to the guest's view of the page when this happens? Does
>> it keep seeing plaintext?
>>
>>> then the page can be swapped out, or used for direct I/O (obviously
>>> if you do I/O on a page that was not shared, you cannot expect good
>>> things to happen, since you basically corrupt the memory of the
>>> guest).
>>
>> So why even allow access to the encrypted contents if the host can't
>> do anything useful with it? Is there some reason for going to the
>> trouble of encrypting it and exposing it to the host?
>
> you should not overwrite it, but you can/should write it out verbatim,
> e.g. for swap
>
>>> on s390x performing I/O directly on protected pages results in (in
>>> practice) unrecoverable I/O errors, so we want to avoid it at all
>>> costs.
>>
>> This is understandable, but we usually steer I/O operations in places
>> like the DMA API, not in the core VM.
>>
>> We *have* the concept of pages to which I/O can't be done. There are
>> plenty of crippled devices where we have to bounce data into a low
>> buffer before it can go where we really want it to. I think the AMD
>> SEV patches do this, for instance.
>>
>>> accessing protected pages from the CPU triggers an exception that
>>> can be handled (and we do handle it, in fact)
>>>
>>> now imagine a buggy or malicious qemu process crashing the whole
>>> machine just because it did I/O to/from a protected page. we
>>> clearly don't want that.
>>
>> Is DMA disallowed to *all* protected pages? Even pages which the
>> guest has explicitly shared with the host?
>>
>>
>>>>> @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page
>>>>> *page, bool keep_write) inc_zone_page_state(page,
>>>>> NR_ZONE_WRITE_PENDING); }
>>>>> unlock_page_memcg(page);
>>>>> + access_ret = arch_make_page_accessible(page);
>>>>> + /*
>>>>> + * If writeback has been triggered on a page that cannot
>>>>> be made
>>>>> + * accessible, it is too late to recover here.
>>>>> + */
>>>>> + VM_BUG_ON_PAGE(access_ret != 0, page);
>>>>> +
>>>>> return ret;
>>>>>
>>>>> }
>>>>
>>>> This seems like a really odd place to do this. Writeback is
>>>> specific to block I/O. I would have thought there were other
>>>> kinds of devices that matter, not just block devices.
>>>
>>> well, yes and no. for writeback (block I/O and swap) this is the
>>> right place. at this point we know that the page is present and
>>> nobody else has started doing I/O yet, and I/O will happen
>>> soon-ish. so we make the page accessible. there is no turning back
>>> here, unlike pinning. we are not allowed to fail, we can't
>>
>> This description sounds really incomplete to me.
>>
>> Not all swap involved device I/O. For instance, zswap doesn't involve
>> any devices. Would zswap need this hook?
>
> please feel free to write to me privately if you have any further
> questions or doubts :)
>
>
> best regards,
>
> Claudio Imbrenda
>
On 3/6/20 5:25 AM, Claudio Imbrenda wrote:
> + /*
> + * We need to make the page accessible if and only if we are going
> + * to access its content (the FOLL_PIN case). Please see
> + * Documentation/core-api/pin_user_pages.rst for details.
> + */
> + if (flags & FOLL_PIN) {
> + ret = arch_make_page_accessible(page);
> + if (ret) {
> + unpin_user_page(page);
> + page = ERR_PTR(ret);
> + goto out;
> + }
> + }
Thanks, Claudio, for a really thorough refresher on this in private mail.
But, I think this mechanism probably hooks into the wrong place. I
don't doubt that it *functions* on s390, but I think these calls are
misplaced. I think the end result is that no other architecture will
have a chance to use the same hooks. They're far too s390-specific even
for a concept that's not limited to s390.
get_user_pages(FOLL_PIN) does *not* mean "the kernel will access this
page's contents". The kmap() family is really what we use for that.
kmap()s are often *preceded* by get_user_pages(), which is probably why
this works for you, though.
Yes, the docs do say that FOLL_PIN is for accessing the pages. But,
there's a crucial thing that it leaves out: *WHO* will be accessing the
pages. For Direct IO, for instance, the CPU isn't touching the page at
all. It's always a device. Also, crucially, the page contents are
*not* accessible from the CPU's perspective after a gup. They're not
accessible until a kmap(). They're also not even accessible for
*devices* after a gup. There's a _separate_ mapping process that's
requires to make them accessible to the CPU.
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page *page)
> int __test_set_page_writeback(struct page *page, bool keep_write)
> {
> struct address_space *mapping = page_mapping(page);
> - int ret;
> + int ret, access_ret;
>
> lock_page_memcg(page);
> if (mapping && mapping_use_writeback_tags(mapping)) {
> @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> }
> unlock_page_memcg(page);
> + access_ret = arch_make_page_accessible(page);
> + /*
> + * If writeback has been triggered on a page that cannot be made
> + * accessible, it is too late to recover here.
> + */
> + VM_BUG_ON_PAGE(access_ret != 0, page);
> +
> return ret;
>
> }
I think this one really shows the cracks in the approach. Pages being
swapped *don't* have get_user_pages() done on them since we've already
got the physical page at the time writeback and aren't looking at PTEs.
They're read by I/O devices sending them out to storage, but also by the
CPU if you're doing something like zswap. But, again, critically,
accessing page contents won't be done until kmap().
I suspect you saw crashes underneath __swap_writepage()->submit_bio()
and looked a few lines up to the set_page_writeback() and decided to
hook in there. I think a better spot, again, is to hook into kmap()
which is called in the block layer.
Why do I care?
I was looking at AMD's SEV (Secure Encrypted Virtualization) code which
is in the kernel which shares some implementation details with the
not-in-the-tree Intel MKTME. SEV currently has a concept of guest pages
being encrypted and being gibberish to the host, plus a handshake to
share guest-selected pages. Some of the side-effects of exposing the
gibberish to the host aren't great (I think it can break cache coherency
if a stray write occurs) and it would be nice to get better behavior.
But, to get better behavior, the host kernel might need to remove pages
from its direct map, making them inaccessible. I was hoping to reuse
arch_make_page_accessible() for obvious reasons. But, get_user_pages()
is not the right spot to map pages because they might not *ever* be
accessed by the CPU, only devices.
Anyway, I know it's late feedback, but I'd hate to have core code like
this that has no hope of ever getting reused.
On Wed, Apr 15, 2020 at 02:52:31PM -0700, Dave Hansen wrote:
> On 3/6/20 5:25 AM, Claudio Imbrenda wrote:
> > + /*
> > + * We need to make the page accessible if and only if we are going
> > + * to access its content (the FOLL_PIN case). Please see
> > + * Documentation/core-api/pin_user_pages.rst for details.
> > + */
> > + if (flags & FOLL_PIN) {
> > + ret = arch_make_page_accessible(page);
> > + if (ret) {
> > + unpin_user_page(page);
> > + page = ERR_PTR(ret);
> > + goto out;
> > + }
> > + }
>
> Thanks, Claudio, for a really thorough refresher on this in private mail.
>
> But, I think this mechanism probably hooks into the wrong place. I
> don't doubt that it *functions* on s390, but I think these calls are
> misplaced. I think the end result is that no other architecture will
> have a chance to use the same hooks. They're far too s390-specific even
> for a concept that's not limited to s390.
>
> get_user_pages(FOLL_PIN) does *not* mean "the kernel will access this
> page's contents". The kmap() family is really what we use for that.
> kmap()s are often *preceded* by get_user_pages(), which is probably why
> this works for you, though.
>
> Yes, the docs do say that FOLL_PIN is for accessing the pages. But,
> there's a crucial thing that it leaves out: *WHO* will be accessing the
> pages. For Direct IO, for instance, the CPU isn't touching the page at
> all. It's always a device. Also, crucially, the page contents are
> *not* accessible from the CPU's perspective after a gup. They're not
> accessible until a kmap(). They're also not even accessible for
> *devices* after a gup. There's a _separate_ mapping process that's
> requires to make them accessible to the CPU.
I think the crucial detail is that we can fail gup(), while we cannot
ever fail kmap() or whatever else a device needs to do.
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page *page)
> > int __test_set_page_writeback(struct page *page, bool keep_write)
> > {
> > struct address_space *mapping = page_mapping(page);
> > - int ret;
> > + int ret, access_ret;
> >
> > lock_page_memcg(page);
> > if (mapping && mapping_use_writeback_tags(mapping)) {
> > @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page *page, bool keep_write)
> > inc_zone_page_state(page, NR_ZONE_WRITE_PENDING);
> > }
> > unlock_page_memcg(page);
> > + access_ret = arch_make_page_accessible(page);
> > + /*
> > + * If writeback has been triggered on a page that cannot be made
> > + * accessible, it is too late to recover here.
> > + */
> > + VM_BUG_ON_PAGE(access_ret != 0, page);
> > +
> > return ret;
> >
> > }
>
> I think this one really shows the cracks in the approach. Pages being
> swapped *don't* have get_user_pages() done on them since we've already
> got the physical page at the time writeback and aren't looking at PTEs.
I suspect this happens because FOLL_TOUCH or something later does
set_page_dirty() on the page, which then eventually gets it in
writeback.
Failing gup() ealier, should ensure the above VM_BUG never happens,
unless someone is doing dodgy things.
> Why do I care?
>
> I was looking at AMD's SEV (Secure Encrypted Virtualization) code which
> is in the kernel which shares some implementation details with the
> not-in-the-tree Intel MKTME. SEV currently has a concept of guest pages
> being encrypted and being gibberish to the host, plus a handshake to
> share guest-selected pages. Some of the side-effects of exposing the
> gibberish to the host aren't great (I think it can break cache coherency
> if a stray write occurs) and it would be nice to get better behavior.
>
> But, to get better behavior, the host kernel might need to remove pages
> from its direct map, making them inaccessible.
But for SEV we would actually need to fail this
arch_make_page_acesssible() thing, right? The encrypted guest pages
cannot be sanely accessed by the host IIRC, ever. Isn't their encryption
key linked to the phys addr of the page?
> I was hoping to reuse
> arch_make_page_accessible() for obvious reasons. But, get_user_pages()
> is not the right spot to map pages because they might not *ever* be
> accessed by the CPU, only devices.
I'm confused, why does it matter who accesses it? The point is that they
want to access it through this vaddr/mapping.
On 4/15/20 3:17 PM, Peter Zijlstra wrote:
> On Wed, Apr 15, 2020 at 02:52:31PM -0700, Dave Hansen wrote:
>> Yes, the docs do say that FOLL_PIN is for accessing the pages. But,
>> there's a crucial thing that it leaves out: *WHO* will be accessing the
>> pages. For Direct IO, for instance, the CPU isn't touching the page at
>> all. It's always a device. Also, crucially, the page contents are
>> *not* accessible from the CPU's perspective after a gup. They're not
>> accessible until a kmap(). They're also not even accessible for
>> *devices* after a gup. There's a _separate_ mapping process that's
>> requires to make them accessible to the CPU.
>
> I think the crucial detail is that we can fail gup(), while we cannot
> ever fail kmap() or whatever else a device needs to do.
Yep, good point.
The patch in question puts says that you now need to do something to the
page before it can be accessed by the kernel. Because this is
presumably anonymous-only, and the only main way to get to anonymous
pages is the page tables, and the gup code is our de facto user page
table walker, this works OK-ish.
But, the gup code isn't out only walker. Grepping for pte_page() finds
a bunch of sites that this approach misses. They'll theoretically each
have to be patched if we want to extend this gup-time approach for
anything other than anonymous, small pages.
(Most of the additional pte_page() sites are for huge pages, which can't
be protected on s390.)
>>> + access_ret = arch_make_page_accessible(page);
>>> + /*
>>> + * If writeback has been triggered on a page that cannot be made
>>> + * accessible, it is too late to recover here.
>>> + */
>>> + VM_BUG_ON_PAGE(access_ret != 0, page);
>>> +
>>> return ret;
>>>
>>> }
>>
>> I think this one really shows the cracks in the approach. Pages being
>> swapped *don't* have get_user_pages() done on them since we've already
>> got the physical page at the time writeback and aren't looking at PTEs.
>
> I suspect this happens because FOLL_TOUCH or something later does
> set_page_dirty() on the page, which then eventually gets it in
> writeback.
I assumed that this was all anonymous-only so it's always dirty before
writeback starts.
>> Why do I care?
>>
>> I was looking at AMD's SEV (Secure Encrypted Virtualization) code which
>> is in the kernel which shares some implementation details with the
>> not-in-the-tree Intel MKTME. SEV currently has a concept of guest pages
>> being encrypted and being gibberish to the host, plus a handshake to
>> share guest-selected pages. Some of the side-effects of exposing the
>> gibberish to the host aren't great (I think it can break cache coherency
>> if a stray write occurs) and it would be nice to get better behavior.
>>
>> But, to get better behavior, the host kernel might need to remove pages
>> from its direct map, making them inaccessible.
>
> But for SEV we would actually need to fail this
> arch_make_page_acesssible() thing, right?
Yeah, we would ideally fail it, but not at the current
arch_make_page_acesssible() site. If the PTE isn't usable, we shouldn't
be creating it in the first place, or shouldn't leave it Present=1 and
GUP'able in the page tables once the underlying memory is no longer
accessible.
I _think_ vm_normal_page() is the right place to do it, when we're
dealing with a PTE but don't yet have a 'struct page'.
> The encrypted guest pages cannot be sanely accessed by the host IIRC,
> ever. Isn't their encryption key linked to the phys addr of the
> page?
Yes, and the keys can't even be used unless you are inside the VM.
But this begs the question: why did we create PTEs which can't be used?
Just to have something to gup?
>> I was hoping to reuse
>> arch_make_page_accessible() for obvious reasons. But, get_user_pages()
>> is not the right spot to map pages because they might not *ever* be
>> accessed by the CPU, only devices.
>
> I'm confused, why does it matter who accesses it? The point is that they
> want to access it through this vaddr/mapping.
To me, that's the entire *point* of get_user_pages(). It's someone
saying: "I want to find out what this mapping does, but I actually can't
use *that* mapping." I'm either:
1. A device that does I/O to the paddr space or through an IOMMU, or
2. The kernel but I want access to that page via *another* mapping (if
we could use the gup'd mapping, we would, but we know we can't)
and I need the physical address space to stay consistent for a bit so I
can do those things via other address spaces.
On Thu, 16 Apr 2020 07:20:48 -0700
Dave Hansen <[email protected]> wrote:
> On 4/16/20 5:15 AM, Claudio Imbrenda wrote:
> >> I assumed that this was all anonymous-only so it's always dirty
> >> before writeback starts.
> > it could also be mmapped
>
> Let's say you have a mmap()'d ramfs file. Another process calls which
> doesn't have it mapped calls sys_write() and writes to the file.
>
> This means that host host has to write to the physical page and must
> do arch_make_page_accessible() in the sys_write() path somewhere.
>
> There is a get_user_pages() in that path, but it's on the _source_
> buffer, not the ramfs page because the ramfs page is not mapped.
> There's also no __test_set_page_writeback() because you can't write
> back ramfs.
>
> Where is the arch_make_page_accessible() in this case on the ramfs
> page?
it's in the fault handler for the exception the CPU will get when
attempting to write the data to the protected page
On 4/16/20 7:59 AM, Claudio Imbrenda wrote:
> On Thu, 16 Apr 2020 07:20:48 -0700
> Dave Hansen <[email protected]> wrote:
>> On 4/16/20 5:15 AM, Claudio Imbrenda wrote:
>>>> I assumed that this was all anonymous-only so it's always dirty
>>>> before writeback starts.
>>> it could also be mmapped
>>
>> Let's say you have a mmap()'d ramfs file. Another process calls which
>> doesn't have it mapped calls sys_write() and writes to the file.
...
>> Where is the arch_make_page_accessible() in this case on the ramfs
>> page?
>
> it's in the fault handler for the exception the CPU will get when
> attempting to write the data to the protected page
Ahh, so this is *just* intended to precede I/O done on the page, when a
non-host entity is touching the memory?
That seems inconsistent with the process_vm_readv/writev() paths which
set FOLL_PIN on their pin_remote_user_pages() requests, but don't do I/O
to the memory.
On Thu, 16 Apr 2020 08:36:50 -0700
Dave Hansen <[email protected]> wrote:
> On 4/16/20 7:59 AM, Claudio Imbrenda wrote:
> > On Thu, 16 Apr 2020 07:20:48 -0700
> > Dave Hansen <[email protected]> wrote:
> >> On 4/16/20 5:15 AM, Claudio Imbrenda wrote:
> >>>> I assumed that this was all anonymous-only so it's always dirty
> >>>> before writeback starts.
> >>> it could also be mmapped
> >>
> >> Let's say you have a mmap()'d ramfs file. Another process calls
> >> which doesn't have it mapped calls sys_write() and writes to the
> >> file.
> ...
> >> Where is the arch_make_page_accessible() in this case on the ramfs
> >> page?
> >
> > it's in the fault handler for the exception the CPU will get when
> > attempting to write the data to the protected page
>
> Ahh, so this is *just* intended to precede I/O done on the page, when
> a non-host entity is touching the memory?
yep
> That seems inconsistent with the process_vm_readv/writev() paths which
> set FOLL_PIN on their pin_remote_user_pages() requests, but don't do
> I/O to the memory.
FOLL_PIN simply indicates potential access to the content of the page,
not just for I/O.
so yes, we are overdoing arch_make_page_accessible() in some cases,
because we can't tell when a page will be used for I/O and when not.
In most cases this will boil down to checking a flag and doing nothing,
for example in case the page was already accessible.
Also note that making the page accessible because of a FOLL_PIN in
absence of I/O will probably later on spare us from triggering and
handling the exception that would have caused us to make the page
accessible anyway.
On Wed, 15 Apr 2020 14:52:31 -0700
Dave Hansen <[email protected]> wrote:
> On 3/6/20 5:25 AM, Claudio Imbrenda wrote:
> > + /*
> > + * We need to make the page accessible if and only if we
> > are going
> > + * to access its content (the FOLL_PIN case). Please see
> > + * Documentation/core-api/pin_user_pages.rst for details.
> > + */
> > + if (flags & FOLL_PIN) {
> > + ret = arch_make_page_accessible(page);
> > + if (ret) {
> > + unpin_user_page(page);
> > + page = ERR_PTR(ret);
> > + goto out;
> > + }
> > + }
>
> Thanks, Claudio, for a really thorough refresher on this in private
> mail.
>
> But, I think this mechanism probably hooks into the wrong place. I
> don't doubt that it *functions* on s390, but I think these calls are
> misplaced. I think the end result is that no other architecture will
> have a chance to use the same hooks. They're far too s390-specific
> even for a concept that's not limited to s390.
>
> get_user_pages(FOLL_PIN) does *not* mean "the kernel will access this
> page's contents". The kmap() family is really what we use for that.
it means that _something_ _might_ access the content of the
physical page. be it kernel or device, and the device can access the
page through DMA or through other means (and yes on s390 many devices
read and write directly from/to memory without using DMA... it's
complicated)
also, not all architectures use kmap (e.g. s390 doesn't)
> kmap()s are often *preceded* by get_user_pages(), which is probably
> why this works for you, though.
>
> Yes, the docs do say that FOLL_PIN is for accessing the pages. But,
> there's a crucial thing that it leaves out: *WHO* will be accessing
exactly
> the pages. For Direct IO, for instance, the CPU isn't touching the
> page at all. It's always a device. Also, crucially, the page
exactly. and that is the one case we need to protect ourselves from.
letting a device touch directly a protected page causes an
unrecoverable error state in the device, potentially bringing down the
whole system. and this would be triggerable by userspace.
> contents are *not* accessible from the CPU's perspective after a gup.
depends on the architecture, I think
> They're not accessible until a kmap(). They're also not even
> accessible for *devices* after a gup. There's a _separate_ mapping
also depends on the architecture
> process that's requires to make them accessible to the CPU.
>
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2764,7 +2764,7 @@ int test_clear_page_writeback(struct page
> > *page) int __test_set_page_writeback(struct page *page, bool
> > keep_write) {
> > struct address_space *mapping = page_mapping(page);
> > - int ret;
> > + int ret, access_ret;
> >
> > lock_page_memcg(page);
> > if (mapping && mapping_use_writeback_tags(mapping)) {
> > @@ -2807,6 +2807,13 @@ int __test_set_page_writeback(struct page
> > *page, bool keep_write) inc_zone_page_state(page,
> > NR_ZONE_WRITE_PENDING); }
> > unlock_page_memcg(page);
> > + access_ret = arch_make_page_accessible(page);
> > + /*
> > + * If writeback has been triggered on a page that cannot
> > be made
> > + * accessible, it is too late to recover here.
> > + */
> > + VM_BUG_ON_PAGE(access_ret != 0, page);
> > +
> > return ret;
> >
> > }
>
> I think this one really shows the cracks in the approach. Pages being
> swapped *don't* have get_user_pages() done on them since we've already
> got the physical page at the time writeback and aren't looking at
> PTEs.
correct. that's why we are doing it when setting the writeback bit.
> They're read by I/O devices sending them out to storage, but also by
> the CPU if you're doing something like zswap. But, again, critically,
> accessing page contents won't be done until kmap().
is kmap called for direct I/O too?
> I suspect you saw crashes underneath __swap_writepage()->submit_bio()
> and looked a few lines up to the set_page_writeback() and decided to
> hook in there. I think a better spot, again, is to hook into kmap()
> which is called in the block layer.
making a page accessible is a potentially long operation (e.g. on s390
requires the hardware to encrypt the page and do some other expensive
operations), while kmap is a nop.
> Why do I care?
>
> I was looking at AMD's SEV (Secure Encrypted Virtualization) code
> which is in the kernel which shares some implementation details with
> the not-in-the-tree Intel MKTME. SEV currently has a concept of
> guest pages being encrypted and being gibberish to the host, plus a
> handshake to share guest-selected pages. Some of the side-effects of
> exposing the gibberish to the host aren't great (I think it can break
> cache coherency if a stray write occurs) and it would be nice to get
> better behavior.
>
> But, to get better behavior, the host kernel might need to remove
> pages from its direct map, making them inaccessible. I was hoping to
> reuse arch_make_page_accessible() for obvious reasons. But,
we are talking about physical pages being inaccessible, not mappings.
you can have the page correctly mapped and still inaccessible.
> get_user_pages() is not the right spot to map pages because they
> might not *ever* be accessed by the CPU, only devices.
>
> Anyway, I know it's late feedback, but I'd hate to have core code like
> this that has no hope of ever getting reused.
On Wed, 15 Apr 2020 16:34:14 -0700
Dave Hansen <[email protected]> wrote:
> On 4/15/20 3:17 PM, Peter Zijlstra wrote:
> > On Wed, Apr 15, 2020 at 02:52:31PM -0700, Dave Hansen wrote:
> >> Yes, the docs do say that FOLL_PIN is for accessing the pages.
> >> But, there's a crucial thing that it leaves out: *WHO* will be
> >> accessing the pages. For Direct IO, for instance, the CPU isn't
> >> touching the page at all. It's always a device. Also, crucially,
> >> the page contents are *not* accessible from the CPU's perspective
> >> after a gup. They're not accessible until a kmap(). They're also
> >> not even accessible for *devices* after a gup. There's a
> >> _separate_ mapping process that's requires to make them accessible
> >> to the CPU.
> >
> > I think the crucial detail is that we can fail gup(), while we
> > cannot ever fail kmap() or whatever else a device needs to do.
>
> Yep, good point.
>
> The patch in question puts says that you now need to do something to
> the page before it can be accessed by the kernel. Because this is
actually, before it can be accessed by anything. In fact, if the kernel
touches a protected page on s390 it gets a recoverable fault, and our
fault handler basically calls arch_make_page_accessible. the problem is
that I/O devices can touch the memory without the CPU touching it
first. so they would try to read or write on protected memory, causing
all kinds of issues.
> presumably anonymous-only, and the only main way to get to anonymous
> pages is the page tables, and the gup code is our de facto user page
> table walker, this works OK-ish.
actually this also works for mmapped memory, you can always write to
memory and then call sync, the effect is similar to swapping (both use
writeback IIRC)
> But, the gup code isn't out only walker. Grepping for pte_page()
> finds a bunch of sites that this approach misses. They'll
> theoretically each have to be patched if we want to extend this
> gup-time approach for anything other than anonymous, small pages.
well, no. walking the page tables gives you a physical address, but you
have no guarantees that the page will still be there later. that's the
whole point of gup - you make sure the page stays pinned in memory
e.g. stuff like follow_page without FOLL_PIN or FOLL_GET offers no
guarantee that the page will still be there one nanosecond later
> (Most of the additional pte_page() sites are for huge pages, which
> can't be protected on s390.)
>
> >>> + access_ret = arch_make_page_accessible(page);
> >>> + /*
> >>> + * If writeback has been triggered on a page that cannot
> >>> be made
> >>> + * accessible, it is too late to recover here.
> >>> + */
> >>> + VM_BUG_ON_PAGE(access_ret != 0, page);
> >>> +
> >>> return ret;
> >>>
> >>> }
> >>
> >> I think this one really shows the cracks in the approach. Pages
> >> being swapped *don't* have get_user_pages() done on them since
> >> we've already got the physical page at the time writeback and
> >> aren't looking at PTEs.
> >
> > I suspect this happens because FOLL_TOUCH or something later does
> > set_page_dirty() on the page, which then eventually gets it in
> > writeback.
>
> I assumed that this was all anonymous-only so it's always dirty before
> writeback starts.
it could also be mmapped
> >> Why do I care?
> >>
> >> I was looking at AMD's SEV (Secure Encrypted Virtualization) code
> >> which is in the kernel which shares some implementation details
> >> with the not-in-the-tree Intel MKTME. SEV currently has a concept
> >> of guest pages being encrypted and being gibberish to the host,
> >> plus a handshake to share guest-selected pages. Some of the
> >> side-effects of exposing the gibberish to the host aren't great (I
> >> think it can break cache coherency if a stray write occurs) and it
> >> would be nice to get better behavior.
> >>
> >> But, to get better behavior, the host kernel might need to remove
> >> pages from its direct map, making them inaccessible.
> >
> > But for SEV we would actually need to fail this
> > arch_make_page_acesssible() thing, right?
>
> Yeah, we would ideally fail it, but not at the current
> arch_make_page_acesssible() site. If the PTE isn't usable, we
> shouldn't be creating it in the first place, or shouldn't leave it
> Present=1 and GUP'able in the page tables once the underlying memory
> is no longer accessible.
the problem is that the page needs to be present, otherwise it cannot
be used in the VM. the protected VM can access the content of otherwise
inaccessible pages -- that's the whole point. we have userspace pages
that need to be mapped, but whose content should not be touched by I/O
without being "made accessible" first. and the I/O devices don't
necessarily use the DMA API
> I _think_ vm_normal_page() is the right place to do it, when we're
> dealing with a PTE but don't yet have a 'struct page'.
>
> > The encrypted guest pages cannot be sanely accessed by the host
> > IIRC, ever. Isn't their encryption key linked to the phys addr of
> > the page?
> Yes, and the keys can't even be used unless you are inside the VM.
>
> But this begs the question: why did we create PTEs which can't be
> used? Just to have something to gup?
the userspace PTEs (which are used for the protected guest) are needed
by the protected guest to access its memory. memory that is
"inaccessible" for the kernel is accessible by the hardware/firmware
and by the protected guest it belongs to. if the PTE is not valid, the
guest cannot run.
> >> I was hoping to reuse
> >> arch_make_page_accessible() for obvious reasons. But,
> >> get_user_pages() is not the right spot to map pages because they
> >> might not *ever* be accessed by the CPU, only devices.
> >
> > I'm confused, why does it matter who accesses it? The point is that
> > they want to access it through this vaddr/mapping.
>
> To me, that's the entire *point* of get_user_pages(). It's someone
> saying: "I want to find out what this mapping does, but I actually
> can't use *that* mapping." I'm either:
>
> 1. A device that does I/O to the paddr space or through an IOMMU, or
this is exactly the thing we want to protect ourselves from
> 2. The kernel but I want access to that page via *another* mapping (if
> we could use the gup'd mapping, we would, but we know we can't)
on s390 this is also true sometimes, because the kernel has a 1-to-1
mapping of all physical memory, and that's the mapping used to access
all pages when we have only the physical address.
copy to user actually uses the userspace mapping
s390 has separate address spaces for kernel and userspace
also, notice that in this case which mapping is used is irrelevant: you
could work with paging disabled, you still need to make the physical
pages accessible before the kernel or anyone else (except
hardware/firmware and the protected VM it belongs to) can touch them.
> and I need the physical address space to stay consistent for a bit so
> I can do those things via other address spaces.
On 4/16/20 5:15 AM, Claudio Imbrenda wrote:
>> I assumed that this was all anonymous-only so it's always dirty before
>> writeback starts.
> it could also be mmapped
Let's say you have a mmap()'d ramfs file. Another process calls which
doesn't have it mapped calls sys_write() and writes to the file.
This means that host host has to write to the physical page and must do
arch_make_page_accessible() in the sys_write() path somewhere.
There is a get_user_pages() in that path, but it's on the _source_
buffer, not the ramfs page because the ramfs page is not mapped.
There's also no __test_set_page_writeback() because you can't write back
ramfs.
Where is the arch_make_page_accessible() in this case on the ramfs page?
On 4/16/20 9:34 AM, Claudio Imbrenda wrote:
>> Ahh, so this is *just* intended to precede I/O done on the page, when
>> a non-host entity is touching the memory?
> yep
OK, so we've got to do an action that precedes *all* I/O to a page.
That's not too bad.
I still don't understand how this could work generally, though There
are lots of places where I/O is done to a page without either going
through __test_set_page_writeback() or gup() with FOLL_PIN set.
sendfile() is probably the best example of this:
fd = open("/normal/ext4/file", O_RDONLY);
sendfile(socket_fd, fd, &off, count);
There's no gup in sight since the file doesn't have an address and it's
not being written to so there's no writeback.
How does sendfile work?
On 4/16/20 12:02 PM, Dave Hansen wrote:
> On 4/16/20 9:34 AM, Claudio Imbrenda wrote:
>>> Ahh, so this is *just* intended to precede I/O done on the page, when
>>> a non-host entity is touching the memory?
>> yep
> OK, so we've got to do an action that precedes *all* I/O to a page.
> That's not too bad.
>
> I still don't understand how this could work generally, though There
> are lots of places where I/O is done to a page without either going
> through __test_set_page_writeback() or gup() with FOLL_PIN set.
>
> sendfile() is probably the best example of this:
>
> fd = open("/normal/ext4/file", O_RDONLY);
> sendfile(socket_fd, fd, &off, count);
>
> There's no gup in sight since the file doesn't have an address and it's
> not being written to so there's no writeback.
>
> How does sendfile work?
Did you manage to see if sendfile works (or any other operation that
DMAs file-backed data without being preceded by a gup)?
I suspect it's actually not that hard to fix. As long as you have a
dma_ops for the devices in question either via dev->dma_ops or you add
an s390 get_arch_vm_ops(), you can fix *all* the DMA sites, sendfile()
included.
BTW, device drivers do need to know how to use the DMA mapping API. If
s390 has drivers that need to be updated, I think that's vastly
preferable to incomplete hooks in core mm code.
On 4/21/20 2:31 PM, Dave Hansen wrote:
> On 4/16/20 12:02 PM, Dave Hansen wrote:
>> On 4/16/20 9:34 AM, Claudio Imbrenda wrote:
>>>> Ahh, so this is *just* intended to precede I/O done on the page, when
>>>> a non-host entity is touching the memory?
>>> yep
>> OK, so we've got to do an action that precedes *all* I/O to a page.
>> That's not too bad.
>>
>> I still don't understand how this could work generally, though There
>> are lots of places where I/O is done to a page without either going
>> through __test_set_page_writeback() or gup() with FOLL_PIN set.
>>
>> sendfile() is probably the best example of this:
>>
>> fd = open("/normal/ext4/file", O_RDONLY);
>> sendfile(socket_fd, fd, &off, count);
>>
>> There's no gup in sight since the file doesn't have an address and it's
>> not being written to so there's no writeback.
>>
>> How does sendfile work?
>
> Did you manage to see if sendfile works (or any other operation that
> DMAs file-backed data without being preceded by a gup)?
It's been a couple of weeks with no response on this.
From where I'm standing, we have a hook in the core VM that can't
possibly work with some existing kernel functionality and has virtually
no chance of getting used on a second architecture.
It sounds like there may need to be some additional work here, but
should these hooks stay in for 5.7? Or, should we revert this patch and
try again for 5.8?
On 28.04.20 21:43, Dave Hansen wrote:
> On 4/21/20 2:31 PM, Dave Hansen wrote:
>> On 4/16/20 12:02 PM, Dave Hansen wrote:
>>> On 4/16/20 9:34 AM, Claudio Imbrenda wrote:
>>>>> Ahh, so this is *just* intended to precede I/O done on the page, when
>>>>> a non-host entity is touching the memory?
>>>> yep
>>> OK, so we've got to do an action that precedes *all* I/O to a page.
>>> That's not too bad.
>>>
>>> I still don't understand how this could work generally, though There
>>> are lots of places where I/O is done to a page without either going
>>> through __test_set_page_writeback() or gup() with FOLL_PIN set.
>>>
>>> sendfile() is probably the best example of this:
>>>
>>> fd = open("/normal/ext4/file", O_RDONLY);
>>> sendfile(socket_fd, fd, &off, count);
>>>
>>> There's no gup in sight since the file doesn't have an address and it's
>>> not being written to so there's no writeback.
>>>
>>> How does sendfile work?
>>
>> Did you manage to see if sendfile works (or any other operation that
>> DMAs file-backed data without being preceded by a gup)?
>
> It's been a couple of weeks with no response on this.
>
> From where I'm standing, we have a hook in the core VM that can't
> possibly work with some existing kernel functionality and has virtually
> no chance of getting used on a second architecture.
>
> It sounds like there may need to be some additional work here, but
> should these hooks stay in for 5.7? Or, should we revert this patch and
> try again for 5.8?
We have something and Claudio is going to send out an addon patch soon.
Reverting would be harmful basically allowing for I/O errors on a valid
path that can happen with normal use. The sendfile case is also valid, but
it only triggers for a handcrafted cases (you have to change QEMU to use
MAP_SHARED and you have to use a file as memory backing and you have to
splice this file then to another file that is using DIRECT_IO). Those cases
then run into an I/O error case that is recovered by the driver. the same
would not work for a swap backing.
It is also not trivial (or doable) to change all device drivers that exist
on s390 to use the DMA API. In the long run we might be able to get the
necessary changes into the DMA api.
Claudio will send the details tomorrow.
On Tue, 28 Apr 2020 12:43:45 -0700
Dave Hansen <[email protected]> wrote:
> On 4/21/20 2:31 PM, Dave Hansen wrote:
> > On 4/16/20 12:02 PM, Dave Hansen wrote:
> >> On 4/16/20 9:34 AM, Claudio Imbrenda wrote:
> >>>> Ahh, so this is *just* intended to precede I/O done on the page,
> >>>> when a non-host entity is touching the memory?
> >>> yep
> >> OK, so we've got to do an action that precedes *all* I/O to a page.
> >> That's not too bad.
> >>
> >> I still don't understand how this could work generally, though
> >> There are lots of places where I/O is done to a page without
> >> either going through __test_set_page_writeback() or gup() with
> >> FOLL_PIN set.
> >>
> >> sendfile() is probably the best example of this:
> >>
> >> fd = open("/normal/ext4/file", O_RDONLY);
> >> sendfile(socket_fd, fd, &off, count);
> >>
> >> There's no gup in sight since the file doesn't have an address and
> >> it's not being written to so there's no writeback.
> >>
> >> How does sendfile work?
> >
> > Did you manage to see if sendfile works (or any other operation that
> > DMAs file-backed data without being preceded by a gup)?
>
> It's been a couple of weeks with no response on this.
sorry, I've been busy with things
> From where I'm standing, we have a hook in the core VM that can't
> possibly work with some existing kernel functionality and has
> virtually no chance of getting used on a second architecture.
it seems to work at least for us, so it does possibly work :)
regarding second architectures: when we started sending these patches
around, there has been interest from some other architectures, so
just because nobody else needs them now, it doesn't mean nobody will
use them ever. Moreover this is the only way for us to reasonably
implement this (see below).
> It sounds like there may need to be some additional work here, but
> should these hooks stay in for 5.7? Or, should we revert this patch
> and try again for 5.8?
I don't see why we should revert a patch that works as intended and
poses no overhead for non-users, whereas reverting it would break
functionality.
Now let me elaborate a little on the DMA API. There are some issues
with some of the bus types used on s390 when it comes to the DMA API.
Most I/O instructions on s390 need to allocate some small control blocks
for each operation, and those need to be under 2GB. Those control blocks
will be accessed directly by the hardware. The rest of the actual I/O
buffers have no restriction and can be anywhere (64 bits).
Setting the DMA mask to 2GB means that all other buffers will be
almost always bounced, which is unacceptable. Especially since there are
no bounce buffers set up for s390x hosts anyway (they are set up only in
protected guests (and not in normal guests), so this was also introduced
quite recently).
Also notice that, until now, there has been no actual need to use the
DMA API on most s390 device drivers, hence why it's not being used
there. I know that you are used to need the DMA API for DMA operations
otherwise Bad Thingsā¢ happen, but this is not the case on s390 (for
non-PCI devices at least).
So until the DMA API is fixed, there is no way to convert all the
drivers to the DMA API (which would be quite a lot of work in itself
already, but that's not the point here). A fix for the DMA API was
actually proposed by my colleague Halil several months ago now, but it
did not go through. His proposal was to allow architectures to override
the GFP flags for DMA allocations, to allow allocating some buffers
from some areas and some other buffers from other areas.
I hope this clarifies the matter a little :)
On 4/28/20 4:39 PM, Claudio Imbrenda wrote:
>> From where I'm standing, we have a hook in the core VM that can't
>> possibly work with some existing kernel functionality and has
>> virtually no chance of getting used on a second architecture.
> it seems to work at least for us, so it does possibly work :)
I think all you're saying is that it's been very lightly tested. :)
> regarding second architectures: when we started sending these patches
> around, there has been interest from some other architectures, so
> just because nobody else needs them now, it doesn't mean nobody will
> use them ever.
I was really interested in using them... until I looked at them.
Conceptually, they do something really useful, but the _implementation_
falls short of its promises.
I can't imagine ever using these hooks on x86.