Received-SPF: pass (google.com: domain of linux-kernel+bounces-128854-linux.lists.archive=gmail.com@vger.kernel.org designates 2604:1380:4601:e00::3 as permitted sender) client-ip=2604:1380:4601:e00::3;
Date: Tue, 2 Apr 2024 17:17:30 -0700
In-Reply-To: <a2fff462-dfe6-4979-a7b2-131c6e0b5017@redhat.com>
Precedence: bulk
Mime-Version: 1.0
References: <20240320005024.3216282-1-seanjc@google.com> <4d04b010-98f3-4eae-b320-a7dd6104b0bf@redhat.com>
 <CALzav=eLH+V_5Y6ZWsRkmnbEb6fxPa55B7xyWxP3o6qsrs_nHA@mail.gmail.com> <a2fff462-dfe6-4979-a7b2-131c6e0b5017@redhat.com>
Message-ID: <ZgygGmaEuddZGKyX@google.com>
Subject: Re: [RFC PATCH 0/4] KVM: x86/mmu: Rework marking folios dirty/accessed
From: Sean Christopherson <seanjc@google.com>
To: David Hildenbrand <david@redhat.com>
Cc: David Matlack <dmatlack@google.com>, Paolo Bonzini <pbonzini@redhat.com>, kvm@vger.kernel.org, 
	linux-kernel@vger.kernel.org, David Stevens <stevensd@chromium.org>, 
	Matthew Wilcox <willy@infradead.org>
Content-Type: text/plain; charset="utf-8"
Content-Transfer-Encoding: quoted-printable

On Tue, Apr 02, 2024, David Hildenbrand wrote:
> On 02.04.24 19:38, David Matlack wrote:
> > On Wed, Mar 20, 2024 at 5:56=E2=80=AFAM David Hildenbrand <david@redhat=
com> wrote:
> > > Secondary page tables are different to ordinary GUP, and KVM ends up
> > > using GUP to some degree to simulate HW access; regarding NUMA-hintin=
g,
> > > KVM already revealed to be very different to all other GUP users. [1]
> > >=20
> > > And I recall that at some point I raised that we might want to have a
> > > dedicate interface for these "mmu-notifier" based page table
> > > synchonization mechanism.
> > >=20
> > > But KVM ends up setting folio dirty/access flags itself, like other G=
UP
> > > users. I do wonder if secondary page tables should be messing with fo=
lio
> > > flags *at all*, and if there would be ways to to it differently using=
 PTEs.
> > >=20
> > > We make sure to synchronize the secondary page tables to the process
> > > page tables using MMU notifiers: when we write-protect/unmap a PTE, w=
e
> > > write-protect/unmap the SPTE. Yet, we handle accessed/dirty completel=
y
> > > different.
> >=20
> > Accessed bits have the test/clear young MMU-notifiers. But I agree
> > there aren't any notifiers for dirty tracking.
>
> Yes, and I am questioning if the "test" part should exist -- or if having=
 a
> spte in the secondary MMU should require the access bit to be set (derive=
d
> from the primary MMU). (again, my explanation about fake HW page table
> walkers)
>=20
> There might be a good reason to do it like that nowadays, so I'm only
> raising it as something I was wondering. Likely, frequent clearing of the
> access bit would result in many PTEs in the secondary MMU getting
> invalidated, requiring a new GUP-fast lookup where we would set the acces=
s
> bit in the primary MMU PTE. But I'm not an expert on the implications wit=
h
> MMU notifiers and access bit clearing.

Ya.  KVM already does something similar this for dirty logging, where SPTEs=
 are
write-protected, i.e. generate faults to track dirty status.  But KVM x86 h=
as a
lot of code to mitigates the resulting pain, e.g. has a lockless fast-path =
to
make the SPTE writable and propagate the dirty status to bitmaps, and users=
pace
is heavily/carefully optimized to balance harvesting/clearing dirty status =
against
guest activity.

In general, assumptions that core MM makes about the cost of PTE modificati=
ons
tend to fall apart for KVM.  Much of that comes down to mmu_notifiers inval=
idations
being an imperfect boundary between the primary MMU and secondary MMUs.  E.=
g.
any prot changes (NUMA balancing in particular) can have disastrous impact =
on KVM
guests because those changes invalidate secondary MMUs at PMD granularity, =
and
effective force KVM to do a remote TLB for every PMD that is affected, wher=
eas
the primary is able to batch TLB flushes until the entire VMA is processed.

So yeah, forcing KVM to zap and re-fault SPTEs in order to do page aging wo=
uld be
comically slow for KVM guests without a crazy amount of optimization.

> > Are there any cases where the primary MMU transfers the PTE dirty bit
> > to the folio _other_ than zapping (which already has an MMU-notifier
> > to KVM). If not then there might not be any reason to add a new
> > notifier. Instead the contract should just be that secondary MMUs must
> > also transfer their dirty bits to folios in sync (or before) the
> > primary MMU zaps its PTE.
>=20
> Grepping for pte_mkclean(), there might be some cases. Many cases use MMU
> notifier, because they either clear the PTE or also remove write
> permissions.
>=20
> But these is madvise_free_pte_range() and
> clean_record_shared_mapping_range()...->clean_record_pte(), that might on=
ly
> clear the dirty bit without clearing/changing permissions and consequentl=
y
> not calling MMU notifiers.

Heh, I was staring at these earlier today.  They all are wrapped with
mmu_notifier_invalidate_range_{start,end}(), and KVM's hyper aggressive res=
ponse
to mmu_notifier invalidations ensure all KVM SPTEs get dropped.

> Getting a writable PTE without the dirty bit set should be possible.
>=20
> So I am questioning whether having a writable PTE in the secondary MMU wi=
th
> a clean PTE in the primary MMU should be valid to exist. It can exist tod=
ay,
> and I am not sure if that's the right approach.

I came to the same conclusion (that it can exist today).  Without (sane) FO=
LL_TOUCH,
KVM could retrieve the new PTE (W=3D1,D=3D0) after an mmu_notifier invaliat=
ion, and
thus end up with a writable SPTE that isn't dirty in core MM.

For MADV_FREE, I don't see how KVM's current behavior of marking the folio =
dirty
at zap time changes anything.  Ah, try_to_unmap_one() checks folio_test_dir=
ty()
*after* invoking mmu_notifier_invalidate_range_start(), which ensures that =
KVM's
dirty status is propagated to the folio, and thus try_to_unmap_one() keeps =
the
folio.

Aha!  But try_to_unmap_one() also checks that refcount=3D=3Dmapcount+1, i.e=
 will
also keep the folio if it has been GUP'd.  And __remove_mapping() explicitl=
y states
that it needs to play nice with a GUP'd page being marked dirty before the
reference is dropped.

	 * Must be careful with the order of the tests. When someone has
	 * a ref to the folio, it may be possible that they dirty it then
	 * drop the reference. So if the dirty flag is tested before the
	 * refcount here, then the following race may occur:

So while it's totally possible for KVM to get a W=3D1,D=3D0 PTE, if I'm rea=
ding the
code correctly it's safe/legal so long as KVM either (a) marks the folio di=
rty
while holding a reference or (b) marks the folio dirty before returning fro=
m its
mmu_notifier_invalidate_range_start() hook, *AND* obviously if KVM drops it=
s
mappings in response to mmu_notifier_invalidate_range_start().

> > > I once had the following idea, but I am not sure about all implicatio=
ns,
> > > just wanted to raise it because it matches the topic here:
> > >=20
> > > Secondary page tables kind-of behave like "HW" access. If there is a
> > > write access, we would expect the original PTE to become dirty, not t=
he
> > > mapped folio.
> >=20
> > Propagating SPTE dirty bits to folios indirectly via the primary MMU
> > PTEs won't work for guest_memfd where there is no primary MMU PTE. In
> > order to avoid having two different ways to propagate SPTE dirty bits,
> > KVM should probably be responsible for updating the folio directly.
> >=20
>=20
> But who really cares about access/dirty bits for guest_memfd?
>=20
> guest_memfd already wants to disable/bypass all of core-MM, so different
> rules are to be expected. This discussion is all about integration with
> core-MM that relies on correct dirty bits, which does not really apply to
> guest_memfd.

+1, this is one of many reasons I don't want to support swap/relcaim for gu=
est_memfd.
The instant guest_memfd gets involved with anything LRU related, the intera=
ctions
and complexity skyrockets.