Received-SPF: pass (google.com: domain of linux-kernel+bounces-103560-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:45e3:2400::1 as permitted sender) client-ip=2604:1380:45e3:2400::1;
Date: Thu, 14 Mar 2024 09:17:13 -0700
In-Reply-To: <CAD=HUj5k+N+zrv-Yybj6K3EvfYpfGNf-Ab+ov5Jv+Zopf-LJ+g@mail.gmail.com>
Precedence: bulk
Mime-Version: 1.0
References: <ZfGrS4QS_WhBWiDl@google.com> <0b109bc4-ee4c-4f13-996f-b89fbee09c0b@amd.com>
 <ZfG801lYHRxlhZGT@google.com> <9e604f99-5b63-44d7-8476-00859dae1dc4@amd.com>
 <ZfHKoxVMcBAMqcSC@google.com> <93df19f9-6dab-41fc-bbcd-b108e52ff50b@amd.com>
 <ZfHhqzKVZeOxXMnx@google.com> <c84fcf0a-f944-4908-b7f6-a1b66a66a6bc@amd.com>
 <d2a95b5c-4c93-47b1-bb5b-ef71370be287@amd.com> <CAD=HUj5k+N+zrv-Yybj6K3EvfYpfGNf-Ab+ov5Jv+Zopf-LJ+g@mail.gmail.com>
Message-ID: <ZfMjCXZWuUD76r_5@google.com>
Subject: Re: [PATCH v11 0/8] KVM: allow mapping non-refcounted pages
From: Sean Christopherson <seanjc@google.com>
To: David Stevens <stevensd@chromium.org>
Cc: Paolo Bonzini <pbonzini@redhat.com>, Yu Zhang <yu.c.zhang@linux.intel.com>, 
	Isaku Yamahata <isaku.yamahata@gmail.com>, Zhi Wang <zhi.wang.linux@gmail.com>, 
	Maxim Levitsky <mlevitsk@redhat.com>, kvmarm@lists.linux.dev, linux-kernel@vger.kernel.org, 
	kvm@vger.kernel.org
Content-Type: text/plain; charset="us-ascii"

-Christ{oph,ian} to avoid creating more noise...

On Thu, Mar 14, 2024, David Stevens wrote:
> Because of that, the specific type of pfns that don't work right now are
> pfn_valid() && !PG_Reserved && !page_ref_count() - what I called the
> non-refcounted pages in a bad choice of words. If that's correct, then
> perhaps this series should go a little bit further in modifying
> hva_to_pfn_remapped, but it isn't fundamentally wrong.

Loosely related to all of this, I have a mildly ambitious idea.  Well, one mildly
ambitious idea, and one crazy ambitious idea.  Crazy ambitious idea first...

Something we (GCE side of Google) have been eyeballing is adding support for huge
VM_PFNMAP memory, e.g. for mapping large amounts of device (a.k.a. GPU) memory
into guests using hugepages.  One of the hiccups is that follow_pte() doesn't play
nice with hugepages, at all, e.g. even has a "VM_BUG_ON(pmd_trans_huge(*pmd))".
Teaching follow_pte() to play nice with hugepage probably is doing, but making
sure all existing users are aware, maybe not so much.

My first (half baked, crazy ambitious) idea is to move away from follow_pte() and
get_user_page_fast_only() for mmu_notifier-aware lookups, i.e. that don't need
to grab references, and replace them with a new converged API that locklessly walks
host userspace page tables, and grabs the hugepage size along the way, e.g. so that
arch code wouldn't have to do a second walk of the page tables just to get the
hugepage size.

In other words, for the common case (mmu_notifier integration, no reference needed),
route hva_to_pfn_fast() into the new API and walk the userspace page tables (probably
only for write faults, to avoid CoW compliciations) before doing anything else.

Uses of hva_to_pfn() that need to get a reference to the struct page couldn't be
converted, e.g. when stuffing physical addresses into the VMCS for nested virtualization.
But for everything else, grabbing a reference is a non-goal, i.e. actually "getting"
a user page is wasted effort and actively gets in the way.

I was initially hoping we could go super simple and use something like x86's
host_pfn_mapping_level(), but there are too many edge cases in gup() that need to
be respected, e.g. to avoid mapping memfd_secret pages into KVM guests.  I.e. the
API would need to be a formal mm-owned thing, not some homebrewed KVM implementation.

I can't tell if the payoff would be big enough to justify the effort involved, i.e.
having a single unified API for grabbing PFNs from the primary MMU might just be a
pie-in-the-sky type idea.

My second, less ambitious idea: the previously linked LWN[*] article about the
writeback issues reminded me of something that has bugged me for a long time.  IIUC,
getting a writable mapping from the primary MMU marks the page/folio dirty, and that
page/folio stays dirty until the data is written back and the mapping is made read-only.
And because KVM is tapped into the mmu_notifiers, KVM will be notified *before* the
RW=>RO conversion completes, i.e. before the page/folio is marked clean.

I _think_ that means that calling kvm_set_page_dirty() when zapping a SPTE (or
dropping any mmu_notifier-aware mapping) is completely unnecessary.  If that is the
case, _and_ we can weasel our way out of calling kvm_set_page_accessed() too, then
with FOLL_GET plumbed into hva_to_pfn(), we can:

  - Drop kvm_{set,release}_pfn_{accessed,dirty}(), because all callers of hva_to_pfn()
    that aren't tied into mmu_notifiers, i.e. aren't guaranteed to drop mappings
    before the page/folio is cleaned, will *know* that they hold a refcounted struct
    page.

  - Skip "KVM: x86/mmu: Track if sptes refer to refcounted pages" entirely, because
    KVM never needs to know if a SPTE points at a refcounted page.

In other words, double down on immediately doing put_page() after gup() if FOLL_GET
isn't specified, and naturally make all KVM MMUs compatible with pfn_valid() PFNs
that are acquired by follow_pte().

I suspect we can simply mark pages as access when a page is retrieved from the primary
MMU, as marking a page accessed when its *removed* from the guest is rather nonsensical.
E.g. if a page is mapped into the guest for a long time and it gets swapped out, marking
the page accessed when KVM drops its SPTEs in response to the swap adds no value.  And
through the mmu_notifiers, KVM already plays nice with setups that use idle page
tracking to make reclaim decisions.

[*] https://lwn.net/Articles/930667