2020-04-30 14:41:36

by Claudio Imbrenda

[permalink] [raw]
Subject: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

Inaccessible pages are pages that should not be accessed by any device,
and belong to a protected VM. If any such pages are passed to a
device, there will be I/O errors, which will not always be recoverable,
depending on the architecture and on the specific device.

CPU accesses to inaccessible pages are less problematic, since they are
always recoverable.

Page cache and direct I/O were fixed in a previous patch, in which a
architecture hook is provided to make the page accessible by I/O
devices.

One possible remaining path to sneak a protected page directly to a
device is sendfile and similar syscalls. Those syscalls take a page
directly from the page cache and give them directly to the device with
zero-copy. This bypasses both existing hooks in gup and in writeback.

This patch fixes the issue by adding a call to arch_make_page_accessible
in page_cache_pipe_buf_confirm, thus fixing the issue.

Notice that we only need to make sure the source is accessible, since
zero-copy only works in one direction, and CPU accesses to inaccessible
pages are not a problem. Pagecache-to-pagecache is also not a problem
since that is done by the CPU.

The hook has no overhead for architectures that do not need to deal
with inaccessible pages.

Fixes: f28d43636d6f ("mm/gup/writeback: add callbacks for inaccessible pages")
Signed-off-by: Claudio Imbrenda <[email protected]>
---
fs/splice.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/fs/splice.c b/fs/splice.c
index 4735defc46ee..f026e0ce9acd 100644
--- a/fs/splice.c
+++ b/fs/splice.c
@@ -106,6 +106,9 @@ static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe,
struct page *page = buf->page;
int err;

+ if (arch_make_page_accessible(page))
+ return -EIO;
+
if (!PageUptodate(page)) {
lock_page(page);

--
2.25.4


2020-04-30 20:07:01

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages



On 30.04.20 16:38, Claudio Imbrenda wrote:
> Inaccessible pages are pages that should not be accessed by any device,
> and belong to a protected VM. If any such pages are passed to a
> device, there will be I/O errors, which will not always be recoverable,
> depending on the architecture and on the specific device.
>
> CPU accesses to inaccessible pages are less problematic, since they are
> always recoverable.
>
> Page cache and direct I/O were fixed in a previous patch, in which a
> architecture hook is provided to make the page accessible by I/O
> devices.
>
> One possible remaining path to sneak a protected page directly to a
> device is sendfile and similar syscalls. Those syscalls take a page
> directly from the page cache and give them directly to the device with
> zero-copy. This bypasses both existing hooks in gup and in writeback.
>
> This patch fixes the issue by adding a call to arch_make_page_accessible
> in page_cache_pipe_buf_confirm, thus fixing the issue.
>
> Notice that we only need to make sure the source is accessible, since
> zero-copy only works in one direction, and CPU accesses to inaccessible
> pages are not a problem. Pagecache-to-pagecache is also not a problem
> since that is done by the CPU.
>
> The hook has no overhead for architectures that do not need to deal
> with inaccessible pages.
>
> Fixes: f28d43636d6f ("mm/gup/writeback: add callbacks for inaccessible pages")

You should add a Reported-by Dave Hansen, I guess.

Acked-by: Christian Borntraeger <[email protected]>



> Signed-off-by: Claudio Imbrenda <[email protected]>
> ---
> fs/splice.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/fs/splice.c b/fs/splice.c
> index 4735defc46ee..f026e0ce9acd 100644
> --- a/fs/splice.c
> +++ b/fs/splice.c
> @@ -106,6 +106,9 @@ static int page_cache_pipe_buf_confirm(struct pipe_inode_info *pipe,
> struct page *page = buf->page;
> int err;
>
> + if (arch_make_page_accessible(page))
> + return -EIO;
> +
> if (!PageUptodate(page)) {
> lock_page(page);
>
>

2020-04-30 22:12:15

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

I was also wondering if Claudio was right about the debug patch having
races. I went to go look how the s390 code avoids races when pages go
from accessible->inaccessible.

Because, if if all of the traps are in place to transform pages from
inaccessible->accessible, the code *after* those traps is still
vulnerable. What *keeps* pages accessible?

The race avoidance is this, basically:

down_read(&gmap->mm->mmap_sem);
lock_page(page);
ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
...
> expected = expected_page_refs(page);
> if (!page_ref_freeze(page, expected))
> return -EBUSY;
> set_bit(PG_arch_1, &page->flags);
> rc = uv_call(0, (u64)uvcb);
> page_ref_unfreeze(page, expected);

... up_read(mmap_sem) / unlock_page() / unlock pte

I'm assuming that after the uv_call(), the page is inaccessible and I/O
devices will go boom if they touch the page.

The page_ref_freeze() ensures that references come between the
freeze/unfreeze are noticed, but it doesn't actually *stop* new ones for
users that hold references already. For the page cache, especially,
someone could do:

page = find_get_page();
arch_make_page_accessible();
lock_page();
... make_secure_pte();
unlock_page();
get_page();
// ^ OK because I have a ref
// do DMA on inaccessible page

Because the make_secure_pte() code isn't looking for a *specific*
'expected' value, it has no way of noticing that the extra ref snuck in
there.

I _think_ expected actually needs to be checked for having a specific
(low) value so that if there's a *possibility* of a reference holder
acquiring additional references, the page is known to be off-limits.
mm/migrate.c has a few examples of this, but I'm not quite sure how
bulletproof they are. Some of it appears to just be optimizations.



2020-04-30 22:22:27

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

One other thing... The gup code will not take references on ref-frozen
pages:

> static inline __must_check bool try_get_page(struct page *page)
> {
> page = compound_head(page);
> if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> return false;
> page_ref_inc(page);
> return true;
> }

*But*, notice that the path that skips taking a ref is also a
WARN_ON_ONCE(). Basically, if you get to try_get_page() on a ref-frozen
page, it's considered buggy. This makes sense because you fundamentally
can't freeze refs on a page that might have more refs taken on it.

I think all the other users do this by ensuring that any PTE that could
be gup'd is set non-present before the refs are frozen and remote TLBs
are flushed which also ensures no GUPs are running. I don't know if the
s390 code has some other way of preventing GUPs, but leaving Present=1
PTEs while you freeze refs would be quite troublesome on x86.

2020-05-01 07:22:47

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages



On 01.05.20 00:06, Dave Hansen wrote:
> I was also wondering if Claudio was right about the debug patch having
> races. I went to go look how the s390 code avoids races when pages go
> from accessible->inaccessible.
>
> Because, if if all of the traps are in place to transform pages from
> inaccessible->accessible, the code *after* those traps is still
> vulnerable. What *keeps* pages accessible?
>
> The race avoidance is this, basically:
>
> down_read(&gmap->mm->mmap_sem);
> lock_page(page);
> ptep = get_locked_pte(gmap->mm, uaddr, &ptelock);
> ...
>> expected = expected_page_refs(page);
>> if (!page_ref_freeze(page, expected))
>> return -EBUSY;
>> set_bit(PG_arch_1, &page->flags);
>> rc = uv_call(0, (u64)uvcb);
>> page_ref_unfreeze(page, expected);
>
> ... up_read(mmap_sem) / unlock_page() / unlock pte
>
> I'm assuming that after the uv_call(), the page is inaccessible and I/O
> devices will go boom if they touch the page.
>
> The page_ref_freeze() ensures that references come between the
> freeze/unfreeze are noticed, but it doesn't actually *stop* new ones for
> users that hold references already. For the page cache, especially,
> someone could do:
>
> page = find_get_page();
> arch_make_page_accessible();
> lock_page();
> ... make_secure_pte();

Not sure if I got your point here, but this make_secure_pte should bail
out because we actually do check for a calculated refcount value and return
-EBUSY. The find_get_page should have raised this refcount to a value that
would go beyond the expected value, No?


> unlock_page();
> get_page();
> // ^ OK because I have a ref
> // do DMA on inaccessible page
>
> Because the make_secure_pte() code isn't looking for a *specific*
> 'expected' value, it has no way of noticing that the extra ref snuck in
> there.

I think the expected calcution is actually doing that,giving back the minimum
value when no one else has any references that are valid for I/O.

But I might not have understood what you are trying to tell me?

>
> I _think_ expected actually needs to be checked for having a specific
> (low) value so that if there's a *possibility* of a reference holder
> acquiring additional references, the page is known to be off-limits.
> mm/migrate.c has a few examples of this, but I'm not quite sure how
> bulletproof they are. Some of it appears to just be optimizations.
>
>
>
b

2020-05-01 16:36:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

On 5/1/20 12:18 AM, Christian Borntraeger wrote:
>> unlock_page();
>> get_page();
>> // ^ OK because I have a ref
>> // do DMA on inaccessible page
>>
>> Because the make_secure_pte() code isn't looking for a *specific*
>> 'expected' value, it has no way of noticing that the extra ref snuck in
>> there.
> I think the expected calcution is actually doing that,giving back the minimum
> value when no one else has any references that are valid for I/O.
>
> But I might not have understood what you are trying to tell me?

I was wrong. I was looking at migrate_page_move_mapping():

> int expected_count = expected_page_refs(mapping, page) + extra_count;
...
> xas_lock_irq(&xas);
> if (page_count(page) != expected_count || xas_load(&xas) != page) {
> xas_unlock_irq(&xas);
> return -EAGAIN;
> }
>
> if (!page_ref_freeze(page, expected_count)) {
> xas_unlock_irq(&xas);
> return -EAGAIN;
> }

I saw the check for page_count(page) *and* the page_ref_freeze() call.
My assumption was that both were needed. My assumption was wrong. (I
think the migrate_page_move_mapping() code may actually be doing a
superfluous check.)

The larger point, though, is that the s390 code ensures no extra
references exist upon entering make_secure_pte(), but it still has no
mechanism to prevent future, new references to page cache pages from
being created.

The one existing user of expected_page_refs() freezes the refs then
*removes* the page from the page cache (that's what the xas_lock_irq()
is for). That stops *new* refs from being acquired.

The s390 code is missing an equivalent mechanism.

One example:

page_freeze_refs();
// page->_count==0 now
find_get_page();
// ^ sees a "freed" page
page_unfreeze_refs();

find_get_page() will either fail to *find* the page because it will see
page->_refcount==0 think it is freed (not great), or it will
VM_BUG_ON_PAGE() in __page_cache_add_speculative().

My bigger point is that this patches doesn't systematically stop finding
page cache pages that are arch-inaccessible. This patch hits *one* of
those sites.

2020-05-04 13:46:34

by Ulrich Weigand

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

On Fri, May 01, 2020 at 09:32:45AM -0700, Dave Hansen wrote:
> The larger point, though, is that the s390 code ensures no extra
> references exist upon entering make_secure_pte(), but it still has no
> mechanism to prevent future, new references to page cache pages from
> being created.

Hi Dave, I worked with Claudio and Christian on the initial design
of our approach, so let me chime in here as well.

You're right that there is no mechanism to prevent new references,
but that's really never been the goal either. We're simply trying
to ensure that no I/O is ever done on a page that is in the "secure"
(or inaccessible) state. To do so, we rely on the assumption that
all code that starts I/O on a page cache page will *first*:
- mark the page as pending I/O by either taking an extra page
count, or by setting the Writeback flag; then:
- call arch_make_page_accessible(); then:
- start I/O; and only after I/O has finished:
- remove the "pending I/O" marker (Writeback and/or extra ref)

We thought we had identified all places where we needed to place
arch_make_page_accessible so that the above assumption is satisfied.
You've found at least two instances where this wasn't true (thanks!);
but I still think that this can be fixed by just adding those calls.

Now, if the above assumption holds, then I believe we're safe:
- before we make any page secure, we verify that it is not
"pending I/O" as defined above (neither Writeback flag, nor
and extra page count)
- *during* the process of making the page secure, we're protected
against any potential races due to changes in that status, since
we hold the page lock (and therefore the Writeback flag cannot
change), and we've frozen page references (so those cannot change).

This implies that before I/O has started, the page was made
accessible; and as long as the page is marked "pending I/O"
it will not be made inaccessible again.

> The one existing user of expected_page_refs() freezes the refs then
> *removes* the page from the page cache (that's what the xas_lock_irq()
> is for). That stops *new* refs from being acquired.
>
> The s390 code is missing an equivalent mechanism.
>
> One example:
>
> page_freeze_refs();
> // page->_count==0 now
> find_get_page();
> // ^ sees a "freed" page
> page_unfreeze_refs();
>
> find_get_page() will either fail to *find* the page because it will see
> page->_refcount==0 think it is freed (not great), or it will
> VM_BUG_ON_PAGE() in __page_cache_add_speculative().

I don't really see how that could happen; my understanding is that
page_freeze_refs simply causes potential users to spin and wait
until it is no longer frozen. For example, find_get_page will
in the end call down to find_get_entry, which does:

if (!page_cache_get_speculative(page))
goto repeat;

Am I misunderstanding anything here?

> My bigger point is that this patches doesn't systematically stop finding
> page cache pages that are arch-inaccessible. This patch hits *one* of
> those sites.

As I said above, that wasn't really the goal for our approach.

In particular, note that we *must* have secure pages present in the
page table of the secure guest (that is a requirement of the architecture;
note that the "secure" status doesn't just apply to the phyiscal page,
but a triple of "*this* host physical page is the secure backing store
of *this* guest physical page in *this* secure guest", which the HW/FW
tracks based on the specific page table entry).

As a consequence, the page really also has to remain present in the
page cache (I don't think Linux mm code would be able to handle the
case where a file-backed page is in the page table but not page cache).

I'm not sure what exactly the requirements for your use case are; if those
are significantly differently, maybe we can work together to find an
approach that works for both?

Bye,
Ulrich

--
Dr. Ulrich Weigand
GNU/Linux compilers and toolchain
[email protected]

2020-05-05 12:37:04

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

On 5/4/20 6:41 AM, Ulrich Weigand wrote:
> On Fri, May 01, 2020 at 09:32:45AM -0700, Dave Hansen wrote:
>> The larger point, though, is that the s390 code ensures no extra
>> references exist upon entering make_secure_pte(), but it still has no
>> mechanism to prevent future, new references to page cache pages from
>> being created.
>
> Hi Dave, I worked with Claudio and Christian on the initial design
> of our approach, so let me chime in here as well.

Hi Ulrich!

> You're right that there is no mechanism to prevent new references,
> but that's really never been the goal either. We're simply trying
> to ensure that no I/O is ever done on a page that is in the "secure"
> (or inaccessible) state. To do so, we rely on the assumption that
> all code that starts I/O on a page cache page will *first*:
> - mark the page as pending I/O by either taking an extra page
> count, or by setting the Writeback flag; then:
> - call arch_make_page_accessible(); then:
> - start I/O; and only after I/O has finished:
> - remove the "pending I/O" marker (Writeback and/or extra ref)

Let's ignore writeback for a moment because get_page() is the more
general case. The locking sequence is:

1. get_page() (or equivalent) "locks out" a page from converting to
inaccessbile,
2. followed by a make_page_accessible() guarantees that the page
*stays* accessible until
3. I/O is safe in this region
4. put_page(), removes the "lock out", I/O now unsafe

They key is, though, the get_page() must happen before
make_page_accessible() and *every* place that acquires a new reference
needs a make_page_accessible().

try_get_page() is obviously one of those "new reference sites" and it
only has one call site outisde of the gup code: generic_pipe_buf_get(),
which is effectively patched by the patch that started this thread. The
fact that this one oddball site _and_ gup are patched now is a good sign.

*But*, I still don't know how that could work nicely:

> static inline __must_check bool try_get_page(struct page *page)
> {
> page = compound_head(page);
> if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> return false;
> page_ref_inc(page);
> return true;
> }

If try_get_page() collides with a freeze_page_refs(), it'll hit the
WARN_ON_ONCE(), which is surely there for a good reason. I'm not sure
that warning is _actually_ valid since freeze_page_refs() isn't truly a
0 refcount. But, the fact that this hasn't been encountered means that
the testing here is potentially lacking.

> We thought we had identified all places where we needed to place
> arch_make_page_accessible so that the above assumption is satisfied.
> You've found at least two instances where this wasn't true (thanks!);
> but I still think that this can be fixed by just adding those calls.

Why do you think that's the extent of the problem? Because the crashes
stopped?

I'd feel a lot more comfortable if you explained the audits that you've
performed or _why_ you think that. What I've heard thus far is
basically that you've been able to boot a guest and you're ready to ship
this code.

>> The one existing user of expected_page_refs() freezes the refs then
>> *removes* the page from the page cache (that's what the xas_lock_irq()
>> is for). That stops *new* refs from being acquired.
>>
>> The s390 code is missing an equivalent mechanism.
>>
>> One example:
>>
>> page_freeze_refs();
>> // page->_count==0 now
>> find_get_page();
>> // ^ sees a "freed" page
>> page_unfreeze_refs();
>>
>> find_get_page() will either fail to *find* the page because it will see
>> page->_refcount==0 think it is freed (not great), or it will
>> VM_BUG_ON_PAGE() in __page_cache_add_speculative().
>
> I don't really see how that could happen; my understanding is that
> page_freeze_refs simply causes potential users to spin and wait
> until it is no longer frozen. For example, find_get_page will
> in the end call down to find_get_entry, which does:
>
> if (!page_cache_get_speculative(page))
> goto repeat;
>
> Am I misunderstanding anything here?

Dang, I think I was looking at the TINY_RCU code, which is unfortunately
first in page_cache_get_speculative(). It doesn't support PREEMPT or
SMP, so it can take some shortcuts.

But, with regular RCU, you're right, it _does_ appear that it would hit
that retry loop, but then it would *succeed* in getting a reference. In
the end, this just supports the sequence I wrote above:
arch_make_page_accessible() is only valid when called with an elevated
refcount and the refcount must be held to lock out make_secure_pte().

>> My bigger point is that this patches doesn't systematically stop finding
>> page cache pages that are arch-inaccessible. This patch hits *one* of
>> those sites.
>
> As I said above, that wasn't really the goal for our approach.
>
> In particular, note that we *must* have secure pages present in the
> page table of the secure guest (that is a requirement of the architecture;
> note that the "secure" status doesn't just apply to the phyiscal page,
> but a triple of "*this* host physical page is the secure backing store
> of *this* guest physical page in *this* secure guest", which the HW/FW
> tracks based on the specific page table entry).
>
> As a consequence, the page really also has to remain present in the
> page cache (I don't think Linux mm code would be able to handle the
> case where a file-backed page is in the page table but not page cache).

It actually happens transiently, at least. I believe inode truncation
removes from the page cache before it zaps the PTEs.

> I'm not sure what exactly the requirements for your use case are; if those
> are significantly differently, maybe we can work together to find an
> approach that works for both?

I'm actually trying to figure out what to do with AMD's SEV. The
current state isn't great and, for instance, allows userspace to read
guest ciphertext. But, the pages come and go out of the encrypted state
at the behest of the guest, and the kernel needs *some* mapping for the
pages to do things like instruction emulation.

I started looking at s390 because someone said there was a similar
problem there and suggested the hooks might work. I couldn't figure out
how they worked comprehensively on s390, and that's how we got here.

2020-05-05 13:59:44

by Ulrich Weigand

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

On Tue, May 05, 2020 at 05:34:45AM -0700, Dave Hansen wrote:
> On 5/4/20 6:41 AM, Ulrich Weigand wrote:
> > You're right that there is no mechanism to prevent new references,
> > but that's really never been the goal either. We're simply trying
> > to ensure that no I/O is ever done on a page that is in the "secure"
> > (or inaccessible) state. To do so, we rely on the assumption that
> > all code that starts I/O on a page cache page will *first*:
> > - mark the page as pending I/O by either taking an extra page
> > count, or by setting the Writeback flag; then:
> > - call arch_make_page_accessible(); then:
> > - start I/O; and only after I/O has finished:
> > - remove the "pending I/O" marker (Writeback and/or extra ref)
>
> Let's ignore writeback for a moment because get_page() is the more
> general case. The locking sequence is:
>
> 1. get_page() (or equivalent) "locks out" a page from converting to
> inaccessbile,
> 2. followed by a make_page_accessible() guarantees that the page
> *stays* accessible until
> 3. I/O is safe in this region
> 4. put_page(), removes the "lock out", I/O now unsafe

Yes, exactly.

> They key is, though, the get_page() must happen before
> make_page_accessible() and *every* place that acquires a new reference
> needs a make_page_accessible().

Well, sort of: every place that acquires a new reference *and then
performs I/O* needs a make_page_accessible(). There seem to be a
lot of plain get_page() calls that aren't related to I/O.

> try_get_page() is obviously one of those "new reference sites" and it
> only has one call site outisde of the gup code: generic_pipe_buf_get(),
> which is effectively patched by the patch that started this thread. The
> fact that this one oddball site _and_ gup are patched now is a good sign.
>
> *But*, I still don't know how that could work nicely:
>
> > static inline __must_check bool try_get_page(struct page *page)
> > {
> > page = compound_head(page);
> > if (WARN_ON_ONCE(page_ref_count(page) <= 0))
> > return false;
> > page_ref_inc(page);
> > return true;
> > }
>
> If try_get_page() collides with a freeze_page_refs(), it'll hit the
> WARN_ON_ONCE(), which is surely there for a good reason. I'm not sure
> that warning is _actually_ valid since freeze_page_refs() isn't truly a
> 0 refcount. But, the fact that this hasn't been encountered means that
> the testing here is potentially lacking.

This is indeed interesting. In particular if you compare try_get_page
with try_get_compound_head in gup.c, which does instead:

if (WARN_ON_ONCE(page_ref_count(head) < 0))
return NULL;

which seems more reasonable to me, given the presence of the
page_ref_freeze method. So I'm not sure why try_get_page has <= 0.

I think I understand why we haven't seen this in testing: all the
places in gup.c where try_get_page is called hold the pte lock;
and in the usual case, the pte lock itself already suffices to
lock out make_secure_pte before it even tries to use page_ref_freeze.
(The intent of holding the pte lock there was really to ensure that
the PTE entry is and remains valid throughout the execution of
the ultravisor call, which will look at the PTE entry.)

However, I guess we could construct cases where the pte lock doesn't
suffice to prevent the try_get_page warning: if we create a shared
mapping of the secure guest backing store file into a second process.
That doesn't ever happen in normal qemu operation, so that's likely
why we haven't seen that case.

> > We thought we had identified all places where we needed to place
> > arch_make_page_accessible so that the above assumption is satisfied.
> > You've found at least two instances where this wasn't true (thanks!);
> > but I still think that this can be fixed by just adding those calls.
>
> Why do you think that's the extent of the problem? Because the crashes
> stopped?
>
> I'd feel a lot more comfortable if you explained the audits that you've
> performed or _why_ you think that. What I've heard thus far is
> basically that you've been able to boot a guest and you're ready to ship
> this code.

Not sure if you can really call this an "audit", but we were really
coming from identifying places where I/O can happen on a page cache
page, and everything we found (except writeback) went through gup.
We obviously missed the sendfile case here; not sure what the best
way would be to verify nothing else was missed.

> But, with regular RCU, you're right, it _does_ appear that it would hit
> that retry loop, but then it would *succeed* in getting a reference. In
> the end, this just supports the sequence I wrote above:
> arch_make_page_accessible() is only valid when called with an elevated
> refcount and the refcount must be held to lock out make_secure_pte().

Yes, exactly. That's what comment ahead of our arch_make_page_accesible
says: To be called with the page locked or with an extra reference!
(Either is enough to lock out make_secure_pte.)

Bye,
Ulrich

--
Dr. Ulrich Weigand
GNU/Linux compilers and toolchain
[email protected]

2020-05-05 14:04:01

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages



On 05.05.20 14:34, Dave Hansen wrote:[...]
>> I'm not sure what exactly the requirements for your use case are; if those
>> are significantly differently, maybe we can work together to find an
>> approach that works for both?
>
> I'm actually trying to figure out what to do with AMD's SEV. The
> current state isn't great and, for instance, allows userspace to read
> guest ciphertext. But, the pages come and go out of the encrypted state
> at the behest of the guest, and the kernel needs *some* mapping for the
> pages to do things like instruction emulation.
>
> I started looking at s390 because someone said there was a similar
> problem there and suggested the hooks might work. I couldn't figure out
> how they worked comprehensively on s390, and that's how we got here.

We are certainly not married to our approach. I would happily extend/change
this to anything that works for your case and the s390 case. So can you outline
your requirements a bit more?

2020-05-05 14:06:57

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages



On 05.05.20 15:55, Ulrich Weigand wrote:
> On Tue, May 05, 2020 at 05:34:45AM -0700, Dave Hansen wrote:
>> On 5/4/20 6:41 AM, Ulrich Weigand wrote:
>>> You're right that there is no mechanism to prevent new references,
>>> but that's really never been the goal either. We're simply trying
>>> to ensure that no I/O is ever done on a page that is in the "secure"
>>> (or inaccessible) state. To do so, we rely on the assumption that
>>> all code that starts I/O on a page cache page will *first*:
>>> - mark the page as pending I/O by either taking an extra page
>>> count, or by setting the Writeback flag; then:
>>> - call arch_make_page_accessible(); then:
>>> - start I/O; and only after I/O has finished:
>>> - remove the "pending I/O" marker (Writeback and/or extra ref)
>>
>> Let's ignore writeback for a moment because get_page() is the more
>> general case. The locking sequence is:
>>
>> 1. get_page() (or equivalent) "locks out" a page from converting to
>> inaccessbile,
>> 2. followed by a make_page_accessible() guarantees that the page
>> *stays* accessible until
>> 3. I/O is safe in this region
>> 4. put_page(), removes the "lock out", I/O now unsafe
>
> Yes, exactly.
>
>> They key is, though, the get_page() must happen before
>> make_page_accessible() and *every* place that acquires a new reference
>> needs a make_page_accessible().
>
> Well, sort of: every place that acquires a new reference *and then
> performs I/O* needs a make_page_accessible(). There seem to be a
> lot of plain get_page() calls that aren't related to I/O.
>
>> try_get_page() is obviously one of those "new reference sites" and it
>> only has one call site outisde of the gup code: generic_pipe_buf_get(),
>> which is effectively patched by the patch that started this thread. The
>> fact that this one oddball site _and_ gup are patched now is a good sign.
>>
>> *But*, I still don't know how that could work nicely:
>>
>>> static inline __must_check bool try_get_page(struct page *page)
>>> {
>>> page = compound_head(page);
>>> if (WARN_ON_ONCE(page_ref_count(page) <= 0))
>>> return false;
>>> page_ref_inc(page);
>>> return true;
>>> }
>>
>> If try_get_page() collides with a freeze_page_refs(), it'll hit the
>> WARN_ON_ONCE(), which is surely there for a good reason. I'm not sure
>> that warning is _actually_ valid since freeze_page_refs() isn't truly a
>> 0 refcount. But, the fact that this hasn't been encountered means that
>> the testing here is potentially lacking.
>
> This is indeed interesting. In particular if you compare try_get_page
> with try_get_compound_head in gup.c, which does instead:
>
> if (WARN_ON_ONCE(page_ref_count(head) < 0))
> return NULL;
>
> which seems more reasonable to me, given the presence of the
> page_ref_freeze method. So I'm not sure why try_get_page has <= 0.


Just looked at
commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 mm: add 'try_get_page()' helper function

which says:
Also like 'get_page()', you can't use this function unless you already
had a reference to the page. The intent is that you can use this
exactly like get_page(), but in situations where you want to limit the
maximum reference count.

The code currently does an unconditional WARN_ON_ONCE() if we ever hit
the reference count issues (either zero or negative), as a notification
that the conditional non-increment actually happened.

If try_get_page must not be called with an existing reference, that means
that when we call it the page reference is already higher and our freeze
will never succeed. That would imply that we cannot trigger this. No?

2020-05-05 14:07:49

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages



On 05.05.20 16:01, Christian Borntraeger wrote:
>
>
> On 05.05.20 15:55, Ulrich Weigand wrote:
>> On Tue, May 05, 2020 at 05:34:45AM -0700, Dave Hansen wrote:
>>> On 5/4/20 6:41 AM, Ulrich Weigand wrote:
>>>> You're right that there is no mechanism to prevent new references,
>>>> but that's really never been the goal either. We're simply trying
>>>> to ensure that no I/O is ever done on a page that is in the "secure"
>>>> (or inaccessible) state. To do so, we rely on the assumption that
>>>> all code that starts I/O on a page cache page will *first*:
>>>> - mark the page as pending I/O by either taking an extra page
>>>> count, or by setting the Writeback flag; then:
>>>> - call arch_make_page_accessible(); then:
>>>> - start I/O; and only after I/O has finished:
>>>> - remove the "pending I/O" marker (Writeback and/or extra ref)
>>>
>>> Let's ignore writeback for a moment because get_page() is the more
>>> general case. The locking sequence is:
>>>
>>> 1. get_page() (or equivalent) "locks out" a page from converting to
>>> inaccessbile,
>>> 2. followed by a make_page_accessible() guarantees that the page
>>> *stays* accessible until
>>> 3. I/O is safe in this region
>>> 4. put_page(), removes the "lock out", I/O now unsafe
>>
>> Yes, exactly.
>>
>>> They key is, though, the get_page() must happen before
>>> make_page_accessible() and *every* place that acquires a new reference
>>> needs a make_page_accessible().
>>
>> Well, sort of: every place that acquires a new reference *and then
>> performs I/O* needs a make_page_accessible(). There seem to be a
>> lot of plain get_page() calls that aren't related to I/O.
>>
>>> try_get_page() is obviously one of those "new reference sites" and it
>>> only has one call site outisde of the gup code: generic_pipe_buf_get(),
>>> which is effectively patched by the patch that started this thread. The
>>> fact that this one oddball site _and_ gup are patched now is a good sign.
>>>
>>> *But*, I still don't know how that could work nicely:
>>>
>>>> static inline __must_check bool try_get_page(struct page *page)
>>>> {
>>>> page = compound_head(page);
>>>> if (WARN_ON_ONCE(page_ref_count(page) <= 0))
>>>> return false;
>>>> page_ref_inc(page);
>>>> return true;
>>>> }
>>>
>>> If try_get_page() collides with a freeze_page_refs(), it'll hit the
>>> WARN_ON_ONCE(), which is surely there for a good reason. I'm not sure
>>> that warning is _actually_ valid since freeze_page_refs() isn't truly a
>>> 0 refcount. But, the fact that this hasn't been encountered means that
>>> the testing here is potentially lacking.
>>
>> This is indeed interesting. In particular if you compare try_get_page
>> with try_get_compound_head in gup.c, which does instead:
>>
>> if (WARN_ON_ONCE(page_ref_count(head) < 0))
>> return NULL;
>>
>> which seems more reasonable to me, given the presence of the
>> page_ref_freeze method. So I'm not sure why try_get_page has <= 0.
>
>
> Just looked at
> commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 mm: add 'try_get_page()' helper function
>
> which says:
> Also like 'get_page()', you can't use this function unless you already
> had a reference to the page. The intent is that you can use this
> exactly like get_page(), but in situations where you want to limit the
> maximum reference count.
>
> The code currently does an unconditional WARN_ON_ONCE() if we ever hit
> the reference count issues (either zero or negative), as a notification
> that the conditional non-increment actually happened.
>
> If try_get_page must not be called with an existing reference, that means
s/not//
> that when we call it the page reference is already higher and our freeze
> will never succeed. That would imply that we cannot trigger this. No?
>

2020-05-05 14:26:59

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

On 5/5/20 7:00 AM, Christian Borntraeger wrote:
> We are certainly not married to our approach. I would happily extend/change
> this to anything that works for your case and the s390 case. So can you outline
> your requirements a bit more?

For SEV, the guest define which pages are encrypted or not. You could
theoretically do DMA to them or have the CPU access their contents, but
you'd get either get ciphertext for reads, or data corruption and loss
of cache coherency for writes. That's not so cool.

Ideally, we would stop the CPU from ever accessing those pages by
unmapping them. But, the pages go in and out of the encrypted state and
the host really needs to be *sure* about what's going on before it
restores its mapping and messes with the page. That includes situations
where someone does a gup, starts an I/O to an unencrypted page, then the
guest tries to convert that page over to being encrypted.

So, the requirements are:

1. Allow host-side DMA and CPU access to shared pages
2. Stop host-side DMA and CPU access to encrypted pages
3. Allow pages to be converted between the states at the request of the
guest

Stopping the DMA is pretty easy, even across the gazillions of drivers
in the tree because even random ethernet drivers do stuff like:

txdr->buffer_info[i].dma =
dma_map_single(&pdev->dev, skb->data, skb->len,
DMA_TO_DEVICE);

So the DMA can be stopped at the mapping layer. It's a *LOT* easier to
catch there since the IOMMUs already provide isolation between the I/O
and CPU address spaces.

2020-05-05 14:34:21

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages



On 05.05.20 16:24, Dave Hansen wrote:
> On 5/5/20 7:00 AM, Christian Borntraeger wrote:
>> We are certainly not married to our approach. I would happily extend/change
>> this to anything that works for your case and the s390 case. So can you outline
>> your requirements a bit more?
>
> For SEV, the guest define which pages are encrypted or not. You could
> theoretically do DMA to them or have the CPU access their contents, but
> you'd get either get ciphertext for reads, or data corruption and loss
> of cache coherency for writes. That's not so cool.
>
> Ideally, we would stop the CPU from ever accessing those pages by
> unmapping them. But, the pages go in and out of the encrypted state and
> the host really needs to be *sure* about what's going on before it
> restores its mapping and messes with the page. That includes situations
> where someone does a gup, starts an I/O to an unencrypted page, then the
> guest tries to convert that page over to being encrypted.
>
> So, the requirements are:
>
> 1. Allow host-side DMA and CPU access to shared pages
> 2. Stop host-side DMA and CPU access to encrypted pages
> 3. Allow pages to be converted between the states at the request of the
> guest
>
> Stopping the DMA is pretty easy, even across the gazillions of drivers
> in the tree because even random ethernet drivers do stuff like:
>
> txdr->buffer_info[i].dma =
> dma_map_single(&pdev->dev, skb->data, skb->len,
> DMA_TO_DEVICE);
>
> So the DMA can be stopped at the mapping layer. It's a *LOT* easier to
> catch there since the IOMMUs already provide isolation between the I/O
> and CPU address spaces.

And your problem is that the guest could convert this after the dma_map?
So you looked into our code if this would help?

2020-05-05 14:35:56

by Ulrich Weigand

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

On Tue, May 05, 2020 at 04:03:00PM +0200, Christian Borntraeger wrote:
> > Just looked at
> > commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 mm: add 'try_get_page()' helper function
> >
> > which says:
> > Also like 'get_page()', you can't use this function unless you already
> > had a reference to the page. The intent is that you can use this
> > exactly like get_page(), but in situations where you want to limit the
> > maximum reference count.
> >
> > The code currently does an unconditional WARN_ON_ONCE() if we ever hit
> > the reference count issues (either zero or negative), as a notification
> > that the conditional non-increment actually happened.
> >
> > If try_get_page must not be called with an existing reference, that means
> s/not//
> > that when we call it the page reference is already higher and our freeze
> > will never succeed. That would imply that we cannot trigger this. No?

Well, my understanding is that the "existing" reference may be one of the
references that is expected by our freeze code, in particular in gup the
existing reference is simply the one from the pte. So in this case our
freeze *would* succeeed.

Bye,
Ulrich

--
Dr. Ulrich Weigand
GNU/Linux compilers and toolchain
[email protected]

2020-05-05 14:37:46

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

On 5/5/20 7:31 AM, Christian Borntraeger wrote:
>> So, the requirements are:
>>
>> 1. Allow host-side DMA and CPU access to shared pages
>> 2. Stop host-side DMA and CPU access to encrypted pages
>> 3. Allow pages to be converted between the states at the request of the
>> guest
>>
>> Stopping the DMA is pretty easy, even across the gazillions of drivers
>> in the tree because even random ethernet drivers do stuff like:
>>
>> txdr->buffer_info[i].dma =
>> dma_map_single(&pdev->dev, skb->data, skb->len,
>> DMA_TO_DEVICE);
>>
>> So the DMA can be stopped at the mapping layer. It's a *LOT* easier to
>> catch there since the IOMMUs already provide isolation between the I/O
>> and CPU address spaces.
> And your problem is that the guest could convert this after the dma_map?
> So you looked into our code if this would help?

Yep, it seemed like a close-enough problem.

2020-05-05 14:42:02

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages



On 05.05.20 16:34, Dave Hansen wrote:
> On 5/5/20 7:31 AM, Christian Borntraeger wrote:
>>> So, the requirements are:
>>>
>>> 1. Allow host-side DMA and CPU access to shared pages
>>> 2. Stop host-side DMA and CPU access to encrypted pages
>>> 3. Allow pages to be converted between the states at the request of the
>>> guest
>>>
>>> Stopping the DMA is pretty easy, even across the gazillions of drivers
>>> in the tree because even random ethernet drivers do stuff like:
>>>
>>> txdr->buffer_info[i].dma =
>>> dma_map_single(&pdev->dev, skb->data, skb->len,
>>> DMA_TO_DEVICE);
>>>
>>> So the DMA can be stopped at the mapping layer. It's a *LOT* easier to
>>> catch there since the IOMMUs already provide isolation between the I/O
>>> and CPU address spaces.
>> And your problem is that the guest could convert this after the dma_map?
>> So you looked into our code if this would help?
>
> Yep, it seemed like a close-enough problem.

Is there a way to prevent the guest from switching the state? We also have 2
variants of secure pages
1. those that are secure and when the host accesses them will be encrypted
2. those that are marked by the guest as shared. When you look at
arch_make_page_accessible we first try to "pin" the shared state. So the guest
trying to "unshare" such a page would trigger an exit that we can handle.

2020-05-05 14:52:42

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages



On 05.05.20 16:33, Ulrich Weigand wrote:
> On Tue, May 05, 2020 at 04:03:00PM +0200, Christian Borntraeger wrote:
>>> Just looked at
>>> commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 mm: add 'try_get_page()' helper function
>>>
>>> which says:
>>> Also like 'get_page()', you can't use this function unless you already
>>> had a reference to the page. The intent is that you can use this
>>> exactly like get_page(), but in situations where you want to limit the
>>> maximum reference count.
>>>
>>> The code currently does an unconditional WARN_ON_ONCE() if we ever hit
>>> the reference count issues (either zero or negative), as a notification
>>> that the conditional non-increment actually happened.
>>>
>>> If try_get_page must not be called with an existing reference, that means
>> s/not//
>>> that when we call it the page reference is already higher and our freeze
>>> will never succeed. That would imply that we cannot trigger this. No?
>
> Well, my understanding is that the "existing" reference may be one of the
> references that is expected by our freeze code, in particular in gup the
> existing reference is simply the one from the pte. So in this case our
> freeze *would* succeeed.

If thats the case then "<0" looks indeed better than "<=0" for the check.
I think try_get_page was never meant to exclude a parallel page_ref_freeze/unfreeze.

2020-05-05 14:59:50

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v2 1/1] fs/splice: add missing callback for inaccessible pages

On 5/5/20 7:01 AM, Christian Borntraeger wrote:
> On 05.05.20 15:55, Ulrich Weigand wrote:
>> On Tue, May 05, 2020 at 05:34:45AM -0700, Dave Hansen wrote:
>>>> static inline __must_check bool try_get_page(struct page *page)
>>>> {
>>>> page = compound_head(page);
>>>> if (WARN_ON_ONCE(page_ref_count(page) <= 0))
>>>> return false;
>>>> page_ref_inc(page);
>>>> return true;
>>>> }
>>>
>>> If try_get_page() collides with a freeze_page_refs(), it'll hit the
>>> WARN_ON_ONCE(), which is surely there for a good reason. I'm not sure
>>> that warning is _actually_ valid since freeze_page_refs() isn't truly a
>>> 0 refcount. But, the fact that this hasn't been encountered means that
>>> the testing here is potentially lacking.
>>
>> This is indeed interesting. In particular if you compare try_get_page
>> with try_get_compound_head in gup.c, which does instead:
>>
>> if (WARN_ON_ONCE(page_ref_count(head) < 0))
>> return NULL;
>>
>> which seems more reasonable to me, given the presence of the
>> page_ref_freeze method. So I'm not sure why try_get_page has <= 0.
>
> Just looked at
> commit 88b1a17dfc3ed7728316478fae0f5ad508f50397 mm: add 'try_get_page()' helper function
>
> which says:
> Also like 'get_page()', you can't use this function unless you already
> had a reference to the page. The intent is that you can use this
> exactly like get_page(), but in situations where you want to limit the
> maximum reference count.
>
> The code currently does an unconditional WARN_ON_ONCE() if we ever hit
> the reference count issues (either zero or negative), as a notification
> that the conditional non-increment actually happened.
>
> If try_get_page must be called with an existing reference, that means
> that when we call it the page reference is already higher and our freeze
> will never succeed. That would imply that we cannot trigger this. No?

For gup, we hold the page table lock over the try_grab_page(). That
ensures that nobody can drop the reference while try_grab_page() is in
progress. The migration page_ref_freeze() code also never races with
this because it first shoots down the PTEs before freezing refs.

My worry with the s390 code is that it leaves the PTEs in place while
freezing refs. This seems new, otherwise we would have been tripping
the gup warning.

For the page cache, there's a reference taken because of the page's
presence in the page cache xarray. But, the page cache uses
page_cache_get_speculative(), not try_grab_page(). It doesn't have the
warning on the <=0 refcount.

Either way, I agree that the try_get_page()
WARN_ON_ONCE(page_ref_count(page) <= 0) is looking fishy.