[permalink] [raw]

Subject: Re: [PATCH v7 2/8] KVM: Introduce __kvm_follow_pfn function

On Thu, Jul 06, 2023, Yu Zhang wrote:
> On Thu, Jul 06, 2023 at 02:29:24PM +0900, David Stevens wrote:
> > On Wed, Jul 5, 2023 at 7:53 PM Yu Zhang <[email protected]> wrote:
> > >
> > > On Wed, Jul 05, 2023 at 06:22:59PM +0900, David Stevens wrote:
> > > > On Wed, Jul 5, 2023 at 12:10 PM Yu Zhang <[email protected]> wrote:
> > > > >
> > > > > > @@ -2514,35 +2512,26 @@ static bool hva_to_pfn_fast(unsigned long addr, bool write_fault,
> > > > > > * The slow path to get the pfn of the specified host virtual address,
> > > > > > * 1 indicates success, -errno is returned if error is detected.
> > > > > > */
> > > > > > -static int hva_to_pfn_slow(unsigned long addr, bool *async, bool write_fault,
> > > > > > - bool interruptible, bool *writable, kvm_pfn_t *pfn)
> > > > > > +static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
> > > > > > {
> > > > > > - unsigned int flags = FOLL_HWPOISON;
> > > > > > + unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
> > > > > > struct page *page;
> > > > > > int npages;
> > > > > >
> > > > > > might_sleep();
> > > > > >
> > > > > > - if (writable)
> > > > > > - *writable = write_fault;
> > > > > > -
> > > > > > - if (write_fault)
> > > > > > - flags |= FOLL_WRITE;
> > > > > > - if (async)
> > > > > > - flags |= FOLL_NOWAIT;
> > > > > > - if (interruptible)
> > > > > > - flags |= FOLL_INTERRUPTIBLE;
> > > > > > -
> > > > > > - npages = get_user_pages_unlocked(addr, 1, &page, flags);
> > > > > > + npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
> > > > > > if (npages != 1)
> > > > > > return npages;
> > > > > >
> > > > > > + foll->writable = (foll->flags & FOLL_WRITE) && foll->allow_write_mapping;
> > > > > > +
> > > > > > /* map read fault as writable if possible */
> > > > > > - if (unlikely(!write_fault) && writable) {
> > > > > > + if (unlikely(!foll->writable) && foll->allow_write_mapping) {
> > > > >
> > > > > I guess !foll->writable should be !(foll->flags & FOLL_WRITE) here.
> > > >
> > > > The two statements are logically equivalent, although I guess using
> > > > !(foll->flags & FOLL_WRITE) may be a little clearer, if a little more
> > > > verbose.
> > >
> > > Well, as the comment says, we wanna try to map the read fault as writable
> > > whenever possible. And __gfn_to_pfn_memslot() will only set the FOLL_WRITE
> > > for write faults. So I guess using !foll->writable will not allow this.
> > > Did I miss anything?
> >
> > We just set the foll->writable out parameter to be equal to
> > ((foll->flags & FOLL_WRITE) && foll->allow_write_mapping). Taking a =
> > foll->flags & FOLL_WRITE and b = foll->allow_write_mapping, we have
> > !(a && b) && b -> (!a || !b) && b -> (!a && b) || (!b && b) -> !a &&
> > b.
>
> Ouch, my bad again... I typed "!foll->writable", but missed the "!" in
> my head while calculating... Thanks! :)

The code is funky and confusing though. Specifically, FOLL_WRITE without
allow_write_mapping is nonsensical, and yields the even more nonsensical output
of a successful FOLL_WRITE with foll->writable==%false.

It "works" because callers only consume foll->writable when foll->allow_write_mapping
is true, but relying on that is ugly and completely unnecessary. Similarly, the
"allow" terminology is misleading. FOLL_WRITE *always* allows writable mappings.

This wasn't as much of problem in the previous code because the lower levels took
the pointer, i.e. avoided the "allow" terminology entirely.

So we should either keep that behavior, i.e. replace "bool allow_write_mapping"
with "bool *writable", or rename allow_write_mapping to something like
opportunistically_map_writable, and then unconditionally set foll->writable
whenever KVM obtains a writable mapping, i.e. regardless of whether the original
fault was a read or a write.

My vote is for the latter. If opportunistically_map_writable is too verbose,
try_map_writable would be another option. Hmm, I'll make "try_map_writable" my
official vote.

Ah, and I also vote to use an if-elif instead of unconditionally setting foll->writable.
That makes the relationship between FOLL_WRITE and try_map_writable a bit more
obvious IMO. E.g.

static bool hva_to_pfn_fast(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
struct page *page[1];

/*
* Fast pin a writable pfn only if it is a write fault request
* or the caller allows to map a writable pfn for a read fault
* request.
*/
if (!((foll->flags & FOLL_WRITE) || foll->try_map_writable))
return false;

if (get_user_page_fast_only(foll->hva, FOLL_WRITE, page)) {
*pfn = page_to_pfn(page[0]);
foll->writable = true;
return true;
}

return false;
}

/*
* The slow path to get the pfn of the specified host virtual address,
* 1 indicates success, -errno is returned if error is detected.
*/
static int hva_to_pfn_slow(struct kvm_follow_pfn *foll, kvm_pfn_t *pfn)
{
unsigned int flags = FOLL_HWPOISON | FOLL_GET | foll->flags;
struct page *page;
int npages;

might_sleep();

npages = get_user_pages_unlocked(foll->hva, 1, &page, flags);
if (npages != 1)
return npages;

if (foll->flags & FOLL_WRITE) {
foll->writable = true;
} else if (foll->try_map_writable) {
struct page *wpage;

/* map read fault as writable if possible */
if (get_user_page_fast_only(foll->hva, FOLL_WRITE, &wpage)) {
foll->writable = true;
put_page(page);
page = wpage;
}
}
*pfn = page_to_pfn(page);
return npages;
}

2023-08-05 01:15:00

by Sean Christopherson

[permalink] [raw]

Subject: Re: [PATCH v7 5/8] KVM: x86/mmu: Don't pass FOLL_GET to __kvm_follow_pfn

On Thu, Jul 06, 2023, David Stevens wrote:
> On Wed, Jul 5, 2023 at 7:17 PM Yu Zhang <[email protected]> wrote:
> >
> > On Tue, Jul 04, 2023 at 04:50:50PM +0900, David Stevens wrote:
> > > From: David Stevens <[email protected]>
> > >
> > > Stop passing FOLL_GET to __kvm_follow_pfn. This allows the host to map
> > > memory into the guest that is backed by un-refcounted struct pages - for
> > > example, higher order non-compound pages allocated by the amdgpu driver
> > > via ttm_pool_alloc_page.
> >
> > I guess you mean the tail pages of the higher order non-compound pages?
> > And as to the head page, it is said to be set to one coincidentally[*],
> > and shall not be considered as refcounted. IIUC, refcount of this head
> > page will be increased and decreased soon in hva_to_pfn_remapped(), so
> > this may not be a problem(?). But treating this head page differently,
> > as a refcounted one(e.g., to set the A/D flags), is weired.
> >
> > Or maybe I missed some context, e.g., can the head page be allocted to
> > guest at all?
>
> Yes, this is to allow mapping the tail pages of higher order
> non-compound pages - I should have been more precise in my wording.
> The head pages can already be mapped into the guest.

Recording for posterity (or to make an incorrect statment and get corrected),
because I recently had a conversation about the head page not actually being
refcounted. (I can't remember with whom I had the conversation, but I'm pretty
sure it wasn't an imaginary friend).

Even though whatever allocates the page doesn't explicit refcount the head page,
__free_pages() will still do the right thing and (a) keep the head page around
until its last reference is put. And my understanding is that even though it's
a "head" page, it's not a PG_head page, i.e. not a compound page and so is treated
as an order-0 page when KVM invoke put_page().

void __free_pages(struct page *page, unsigned int order)
{
/* get PageHead before we drop reference */
int head = PageHead(page);

if (put_page_testzero(page)) <=== will evaluate false if KVM holds a ref
free_the_page(page, order);
else if (!head) <=== will be false for non-compound pages
while (order-- > 0)
free_the_page(page + (1 << order), order);
}
EXPORT_SYMBOL(__free_pages);