On Thu, Sep 21, 2023, [email protected] wrote:
> From: Isaku Yamahata <[email protected]>
>
> Although kvm_gmem_punch_hole() keeps all pages in mapping on punching hole,
> it's common expectation that pages are truncated. Truncate pages on
> punching hole. As page contents can be encrypted, avoid zeroing partial
> folio by refusing partial punch hole.
>
> Signed-off-by: Isaku Yamahata <[email protected]>
> ---
> virt/kvm/guest_mem.c | 14 ++++++++++++--
> 1 file changed, 12 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> index a819367434e9..01fb4ca861d0 100644
> --- a/virt/kvm/guest_mem.c
> +++ b/virt/kvm/guest_mem.c
> @@ -130,22 +130,32 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> {
> struct list_head *gmem_list = &inode->i_mapping->private_list;
> + struct address_space *mapping = inode->i_mapping;
> pgoff_t start = offset >> PAGE_SHIFT;
> pgoff_t end = (offset + len) >> PAGE_SHIFT;
> struct kvm_gmem *gmem;
>
> + /*
> + * punch hole may result in zeroing partial area. As pages can be
> + * encrypted, prohibit zeroing partial area.
> + */
> + if (offset & ~PAGE_MASK || len & ~PAGE_MASK)
> + return -EINVAL;
This should be unnecessary, kvm_gmem_fallocate() does
if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
return -EINVAL;
before invoking kvm_gmem_punch_hole(). If that's not working, i.e. your test
fails, then that code needs to be fixed. I'll run your test to double-check,
but AFAICT this is unnecesary.
> +
> /*
> * Bindings must stable across invalidation to ensure the start+end
> * are balanced.
> */
> - filemap_invalidate_lock(inode->i_mapping);
> + filemap_invalidate_lock(mapping);
>
> list_for_each_entry(gmem, gmem_list, entry) {
> kvm_gmem_invalidate_begin(gmem, start, end);
> kvm_gmem_invalidate_end(gmem, start, end);
> }
>
> - filemap_invalidate_unlock(inode->i_mapping);
> + truncate_inode_pages_range(mapping, offset, offset + len - 1);
The truncate needs to happen between begin() and end(), otherwise KVM can create
mappings to the memory between end() and truncate().
> +
> + filemap_invalidate_unlock(mapping);
>
> return 0;
> }
> --
> 2.25.1
>
On Thu, Sep 21, 2023, Sean Christopherson wrote:
> On Thu, Sep 21, 2023, [email protected] wrote:
> > From: Isaku Yamahata <[email protected]>
> >
> > Although kvm_gmem_punch_hole() keeps all pages in mapping on punching hole,
> > it's common expectation that pages are truncated. Truncate pages on
> > punching hole. As page contents can be encrypted, avoid zeroing partial
> > folio by refusing partial punch hole.
> >
> > Signed-off-by: Isaku Yamahata <[email protected]>
> > ---
> > virt/kvm/guest_mem.c | 14 ++++++++++++--
> > 1 file changed, 12 insertions(+), 2 deletions(-)
> >
> > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> > index a819367434e9..01fb4ca861d0 100644
> > --- a/virt/kvm/guest_mem.c
> > +++ b/virt/kvm/guest_mem.c
> > @@ -130,22 +130,32 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> > static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> > {
> > struct list_head *gmem_list = &inode->i_mapping->private_list;
> > + struct address_space *mapping = inode->i_mapping;
> > pgoff_t start = offset >> PAGE_SHIFT;
> > pgoff_t end = (offset + len) >> PAGE_SHIFT;
> > struct kvm_gmem *gmem;
> >
> > + /*
> > + * punch hole may result in zeroing partial area. As pages can be
> > + * encrypted, prohibit zeroing partial area.
> > + */
> > + if (offset & ~PAGE_MASK || len & ~PAGE_MASK)
> > + return -EINVAL;
>
> This should be unnecessary, kvm_gmem_fallocate() does
>
> if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> return -EINVAL;
>
> before invoking kvm_gmem_punch_hole(). If that's not working, i.e. your test
> fails, then that code needs to be fixed. I'll run your test to double-check,
> but AFAICT this is unnecesary.
I confirmed that the testcase passes without the extra checks. Just to close the
loop, what prompted adding more checks to kvm_gmem_punch_hole()?
On Thu, Sep 21, 2023 at 02:34:46PM -0700, Sean Christopherson wrote:
> On Thu, Sep 21, 2023, Sean Christopherson wrote:
> > On Thu, Sep 21, 2023, [email protected] wrote:
> > > From: Isaku Yamahata <[email protected]>
> > >
> > > Although kvm_gmem_punch_hole() keeps all pages in mapping on punching hole,
> > > it's common expectation that pages are truncated. Truncate pages on
> > > punching hole. As page contents can be encrypted, avoid zeroing partial
> > > folio by refusing partial punch hole.
> > >
> > > Signed-off-by: Isaku Yamahata <[email protected]>
> > > ---
> > > virt/kvm/guest_mem.c | 14 ++++++++++++--
> > > 1 file changed, 12 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> > > index a819367434e9..01fb4ca861d0 100644
> > > --- a/virt/kvm/guest_mem.c
> > > +++ b/virt/kvm/guest_mem.c
> > > @@ -130,22 +130,32 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> > > static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> > > {
> > > struct list_head *gmem_list = &inode->i_mapping->private_list;
> > > + struct address_space *mapping = inode->i_mapping;
> > > pgoff_t start = offset >> PAGE_SHIFT;
> > > pgoff_t end = (offset + len) >> PAGE_SHIFT;
> > > struct kvm_gmem *gmem;
> > >
> > > + /*
> > > + * punch hole may result in zeroing partial area. As pages can be
> > > + * encrypted, prohibit zeroing partial area.
> > > + */
> > > + if (offset & ~PAGE_MASK || len & ~PAGE_MASK)
> > > + return -EINVAL;
> >
> > This should be unnecessary, kvm_gmem_fallocate() does
> >
> > if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > return -EINVAL;
> >
> > before invoking kvm_gmem_punch_hole(). If that's not working, i.e. your test
> > fails, then that code needs to be fixed. I'll run your test to double-check,
> > but AFAICT this is unnecesary.
>
> I confirmed that the testcase passes without the extra checks. Just to close the
> loop, what prompted adding more checks to kvm_gmem_punch_hole()?
I don't know if it's the same issue that Isaku ran into, but for SNP we
hit a similar issue with the truncate_inode_pages_range(lstart, lend) call.
The issue in that case was a bit more subtle:
- userspace does a hole-punch on a 4K range of its gmem FD, which happens
to be backed by a 2MB folio.
- truncate_inode_pages_range() gets called for that 4K range
- truncate_inode_pages_range() does special handling on the folios at the
start/end of the range in case they are partial and passes these to
truncate_inode_partial_folio(folio, lstart, lend). In this case, there's
just the 1 backing folio. But it *still* gets the special treatment, and
so gets passed to truncate_inode_partial_folio().
- truncate_inode_partial_folio() will then zero that 4K range, even though
it is page-aligned, based on the following rationale in the comments:
/*
* We may be zeroing pages we're about to discard, but it avoids
* doing a complex calculation here, and then doing the zeroing
* anyway if the page split fails.
*/
folio_zero_range(folio, offset, length);
- after that, .invalidate_folio callback is issued, then the folio is split,
and the caller (truncate_inode_pages_range()) does another pass through
the whole range and can free the now-split folio then .free_folio callbacks
are issued.
Because of that, we can't rely on .invalidate_folio/.free_folio to handle
putting the page back into a normal host-accessible state, because the
zero'ing will happen beforehand. That's why we ended up needing to do this
for SNP patches to make sure arch-specific invalidation callbacks are issued
before the truncation occurs:
https://github.com/mdroth/linux/commit/4ebcc04b84dd691fc6daccb9b7438402520b0704#diff-77306411fdaeb7f322a1ca756dead9feb75363aa6117b703ac118576153ddb37R233
I'd planned to post those as a separate RFC to discuss, but when I came across
this it seemed like it might be relevant to what the TDX folks might ran into
here.
If not for the zero'ing logic mentioned above, for SNP at least, the
.free_folio() ends up working pretty nicely for both truncation and fput(),
and even plays nicely with live update use-case where the destination gmem
instance shares the inode->i_mapping, since iput() won't trigger the
truncate_inode_pages_final() until the last reference goes away so we don't
have to do anything special in kvm_gmem_release() to determine when we
should/shouldn't issue the arch-invalidations to clean up things like the
RMP table.
It seems like the above zero'ing logic could be reworked to only zero non-page
aligned ranges (as the comments above truncate_inode_pages_range() claim
should be the case), which would avoid the issue for the gmem use-case. But I
wonder if some explicit "dont-zero-these-pages" flag might be more robust.
Or maybe there's some other way we should be going about this?
On Thu, Oct 05, 2023, Michael Roth wrote:
> On Thu, Sep 21, 2023 at 02:34:46PM -0700, Sean Christopherson wrote:
> > On Thu, Sep 21, 2023, Sean Christopherson wrote:
> > > > diff --git a/virt/kvm/guest_mem.c b/virt/kvm/guest_mem.c
> > > > index a819367434e9..01fb4ca861d0 100644
> > > > --- a/virt/kvm/guest_mem.c
> > > > +++ b/virt/kvm/guest_mem.c
> > > > @@ -130,22 +130,32 @@ static void kvm_gmem_invalidate_end(struct kvm_gmem *gmem, pgoff_t start,
> > > > static long kvm_gmem_punch_hole(struct inode *inode, loff_t offset, loff_t len)
> > > > {
> > > > struct list_head *gmem_list = &inode->i_mapping->private_list;
> > > > + struct address_space *mapping = inode->i_mapping;
> > > > pgoff_t start = offset >> PAGE_SHIFT;
> > > > pgoff_t end = (offset + len) >> PAGE_SHIFT;
> > > > struct kvm_gmem *gmem;
> > > >
> > > > + /*
> > > > + * punch hole may result in zeroing partial area. As pages can be
> > > > + * encrypted, prohibit zeroing partial area.
> > > > + */
> > > > + if (offset & ~PAGE_MASK || len & ~PAGE_MASK)
> > > > + return -EINVAL;
> > >
> > > This should be unnecessary, kvm_gmem_fallocate() does
> > >
> > > if (!PAGE_ALIGNED(offset) || !PAGE_ALIGNED(len))
> > > return -EINVAL;
> > >
> > > before invoking kvm_gmem_punch_hole(). If that's not working, i.e. your test
> > > fails, then that code needs to be fixed. I'll run your test to double-check,
> > > but AFAICT this is unnecesary.
> >
> > I confirmed that the testcase passes without the extra checks. Just to close the
> > loop, what prompted adding more checks to kvm_gmem_punch_hole()?
>
> I don't know if it's the same issue that Isaku ran into, but for SNP we
> hit a similar issue with the truncate_inode_pages_range(lstart, lend) call.
>
> The issue in that case was a bit more subtle:
>
> - userspace does a hole-punch on a 4K range of its gmem FD, which happens
> to be backed by a 2MB folio.
> - truncate_inode_pages_range() gets called for that 4K range
> - truncate_inode_pages_range() does special handling on the folios at the
> start/end of the range in case they are partial and passes these to
> truncate_inode_partial_folio(folio, lstart, lend). In this case, there's
> just the 1 backing folio. But it *still* gets the special treatment, and
> so gets passed to truncate_inode_partial_folio().
> - truncate_inode_partial_folio() will then zero that 4K range, even though
> it is page-aligned, based on the following rationale in the comments:
>
> /*
> * We may be zeroing pages we're about to discard, but it avoids
> * doing a complex calculation here, and then doing the zeroing
> * anyway if the page split fails.
> */
> folio_zero_range(folio, offset, length);
>
> - after that, .invalidate_folio callback is issued, then the folio is split,
> and the caller (truncate_inode_pages_range()) does another pass through
> the whole range and can free the now-split folio then .free_folio callbacks
> are issued.
>
> Because of that, we can't rely on .invalidate_folio/.free_folio to handle
> putting the page back into a normal host-accessible state, because the
> zero'ing will happen beforehand.
Argh, and that causes an RMP violation #PF.
FWIW, I don't *think* zeroing would be problematic for TDX. The page would get
poisoned, but KVM would re-zero the memory with MOVDIR64B and flush the cache.
> That's why we ended up needing to do this for SNP patches to make sure
> arch-specific invalidation callbacks are issued before the truncation occurs:
>
> https://github.com/mdroth/linux/commit/4ebcc04b84dd691fc6daccb9b7438402520b0704#diff-77306411fdaeb7f322a1ca756dead9feb75363aa6117b703ac118576153ddb37R233
>
> I'd planned to post those as a separate RFC to discuss, but when I came across
> this it seemed like it might be relevant to what the TDX folks might ran into
> here.
>
> If not for the zero'ing logic mentioned above, for SNP at least, the
> .free_folio() ends up working pretty nicely for both truncation and fput(),
> and even plays nicely with live update use-case where the destination gmem
> instance shares the inode->i_mapping, since iput() won't trigger the
> truncate_inode_pages_final() until the last reference goes away so we don't
> have to do anything special in kvm_gmem_release() to determine when we
> should/shouldn't issue the arch-invalidations to clean up things like the
> RMP table.
>
> It seems like the above zero'ing logic could be reworked to only zero non-page
> aligned ranges (as the comments above truncate_inode_pages_range() claim
> should be the case), which would avoid the issue for the gmem use-case. But I
> wonder if some explicit "dont-zero-these-pages" flag might be more robust.
>
> Or maybe there's some other way we should be going about this?
Skipping the write seems like the obvious solution. An address_space flag,
e.g. AS_INACCESSIBLE, would be the easiest thing to implement. Or maybe even
make it AS_DONT_ZERO_ON_TRUNCATE_DAMMIT (mostly joking).
Or a hook in address_space_operations to zero the folio, which conceptually is
better in many ways, but feels like overkill.