2021-12-13 11:25:30

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 0/2] KVM: x86: Fix dangling page reference in TDP MMU

kvm_tdp_mmu_zap_all is intended to visit all roots and zap their page
tables, which flushes the accessed and dirty bits out to the Linux
"struct page"s. Missing some of the roots has catastrophic effects,
because kvm_tdp_mmu_zap_all is called when the MMU notifier is being
removed and any PTEs left behind might become dangling by the time
kvm-arch_destroy_vm tears down the roots for good.

Unfortunately that is exactly what kvm_tdp_mmu_zap_all is doing: it
visits all roots via for_each_tdp_mmu_root_yield_safe, which in turn
uses kvm_tdp_mmu_get_root to skip invalid roots. If the current root is
invalid at the time of kvm_tdp_mmu_zap_all, its page tables will remain
in place but will later be zapped during kvm_arch_destroy_vm.

To fix this, ensure that kvm_tdp_mmu_zap_all goes over all
roots, including the invalid ones. The easiest way to do so is for
kvm_tdp_mmu_zap_all to do the same as kvm_mmu_zap_all_fast: invalidate
all roots, and then zap the invalid roots. The only difference is that
there is no need to go through tdp_mmu_zap_spte_atomic.

Paolo

Paolo Bonzini (2):
KVM: x86: allow kvm_tdp_mmu_zap_invalidated_roots with write-locked
mmu_lock
KVM: x86: zap invalid roots in kvm_tdp_mmu_zap_all

arch/x86/kvm/mmu/mmu.c | 2 +-
arch/x86/kvm/mmu/tdp_mmu.c | 42 ++++++++++++++++++++------------------
arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
3 files changed, 24 insertions(+), 22 deletions(-)

--
2.31.1



2021-12-13 11:25:34

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH 2/2] KVM: x86: zap invalid roots in kvm_tdp_mmu_zap_all

kvm_tdp_mmu_zap_all is intended to visit all roots and zap their page
tables, which flushes the accessed and dirty bits out to the Linux
"struct page"s. Missing some of the roots has catastrophic effects,
because kvm_tdp_mmu_zap_all is called when the MMU notifier is being
removed and any PTEs left behind might become dangling by the time
kvm-arch_destroy_vm tears down the roots for good.

Unfortunately that is exactly what kvm_tdp_mmu_zap_all is doing: it
visits all roots via for_each_tdp_mmu_root_yield_safe, which in turn
uses kvm_tdp_mmu_get_root to skip invalid roots. If the current root is
invalid at the time of kvm_tdp_mmu_zap_all, its page tables will remain
in place but will later be zapped during kvm_arch_destroy_vm.

To fix this, ensure that kvm_tdp_mmu_zap_all goes over all roots,
including the invalid ones. The easiest way to do so is for
kvm_tdp_mmu_zap_all to do the same as kvm_mmu_zap_all_fast: invalidate
all roots, and then zap the invalid roots. However, there is no need
to go through tdp_mmu_zap_spte_atomic because there are no running vCPUs.

Fixes: b7cccd397f31 ("KVM: x86/mmu: Fast invalidation for TDP MMU")
Cc: [email protected]
Reported-by: Ignat Korchagin <[email protected]>
Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/x86/kvm/mmu/tdp_mmu.c | 25 +++++++++++++------------
1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/x86/kvm/mmu/tdp_mmu.c b/arch/x86/kvm/mmu/tdp_mmu.c
index f2dd5c97bbc2..ce3fafb6c9a7 100644
--- a/arch/x86/kvm/mmu/tdp_mmu.c
+++ b/arch/x86/kvm/mmu/tdp_mmu.c
@@ -779,18 +779,6 @@ bool __kvm_tdp_mmu_zap_gfn_range(struct kvm *kvm, int as_id, gfn_t start,
return flush;
}

-void kvm_tdp_mmu_zap_all(struct kvm *kvm)
-{
- bool flush = false;
- int i;
-
- for (i = 0; i < KVM_ADDRESS_SPACE_NUM; i++)
- flush = kvm_tdp_mmu_zap_gfn_range(kvm, i, 0, -1ull, flush);
-
- if (flush)
- kvm_flush_remote_tlbs(kvm);
-}
-
static struct kvm_mmu_page *next_invalidated_root(struct kvm *kvm,
struct kvm_mmu_page *prev_root)
{
@@ -888,6 +876,19 @@ void kvm_tdp_mmu_invalidate_all_roots(struct kvm *kvm)
root->role.invalid = true;
}

+void kvm_tdp_mmu_zap_all(struct kvm *kvm)
+{
+ /*
+ * We need to zap all roots, including already-invalid ones. The
+ * easiest way is to ensure there's only invalid roots which then,
+ * for efficiency, we zap while mmu_lock is taken exclusively.
+ * Since the MMU notifier is being torn down, contention on the
+ * mmu_lock is not an issue.
+ */
+ kvm_tdp_mmu_invalidate_all_roots(kvm);
+ kvm_tdp_mmu_zap_invalidated_roots(kvm, false);
+}
+
/*
* Installs a last-level SPTE to handle a TDP page fault.
* (NPT/EPT violation/misconfiguration)
--
2.31.1


2021-12-13 13:44:10

by Ignat Korchagin

[permalink] [raw]
Subject: Re: [PATCH 0/2] KVM: x86: Fix dangling page reference in TDP MMU

Unfortunately, this patchset does not fix the original issue reported in [1].

Still got the following stacktrace on the 5th run of the visorg

[ 65.777066][ T2987] ------------[ cut here ]------------
[ 65.788832][ T2987] WARNING: CPU: 2 PID: 2987 at
arch/x86/kvm/../../../virt/kvm/kvm_main.c:173
kvm_is_zone_device_pfn.part.0+0x9e/0xd0
[ 65.813145][ T2987] Modules linked in:
[ 65.821414][ T2987] CPU: 2 PID: 2987 Comm: exe Not tainted 5.16.0-rc4+ #23
[ 65.835836][ T2987] Hardware name: QEMU Standard PC (Q35 + ICH9,
2009), BIOS 0.0.0 02/06/2015
[ 65.854804][ T2987] RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0
[ 65.867500][ T2987] Code: 00 00 00 00 fc ff df 48 c1 ea 03 0f b6 14
02 48 89 e8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 0f 8b 43 34 85 c0
74 03 5b 5d c3 <0f> 0b 5b 5d c3 48 89 ef e8 d5 36 9e 00 eb e7 e8 de 36
9e 00 eb 9b
[ 65.909924][ T2987] RSP: 0018:ffff888113e47288 EFLAGS: 00010246
[ 65.923944][ T2987] RAX: 0000000000000000 RBX: ffffea0004969880
RCX: ffffffff9087289e
[ 65.942453][ T2987] RDX: 0000000000000000 RSI: 0000000000000004
RDI: ffffea00049698b4
[ 65.960703][ T2987] RBP: ffffea00049698b4 R08: 0000000000000000
R09: ffffea00049698b7
[ 65.978929][ T2987] R10: fffff9400092d316 R11: 0000000008000000
R12: ffff88827ffda000
[ 65.996858][ T2987] R13: 0600000125a62b77 R14: 0000000000000001
R15: 0000000000000001
[ 66.014646][ T2987] FS: 0000000000000000(0000)
GS:ffff88822d300000(0000) knlGS:0000000000000000
[ 66.035733][ T2987] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 66.050672][ T2987] CR2: 00007f40eae98008 CR3: 000000026c63e005
CR4: 0000000000172ee0
[ 66.068551][ T2987] Call Trace:
[ 66.076686][ T2987] <TASK>
[ 66.084543][ T2987] kvm_set_pfn_dirty+0x120/0x1d0
[ 66.095423][ T2987] __handle_changed_spte+0x9e7/0xed0
[ 66.107841][ T2987] ? alloc_tdp_mmu_page+0x470/0x470
[ 66.121447][ T2987] __handle_changed_spte+0x6d2/0xed0
[ 66.135648][ T2987] ? alloc_tdp_mmu_page+0x470/0x470
[ 66.148376][ T2987] __handle_changed_spte+0x6d2/0xed0
[ 66.160600][ T2987] ? alloc_tdp_mmu_page+0x470/0x470
[ 66.172627][ T2987] __handle_changed_spte+0x6d2/0xed0
[ 66.184815][ T2987] ? alloc_tdp_mmu_page+0x470/0x470
[ 66.196732][ T2987] ? lock_release+0x700/0x700
[ 66.208483][ T2987] __tdp_mmu_set_spte+0x18c/0x9d0
[ 66.223852][ T2987] ? tdp_iter_next+0x205/0x640
[ 66.235493][ T2987] ? tdp_iter_start+0x26d/0x3f0
[ 66.246130][ T2987] zap_gfn_range+0x8b5/0x990
[ 66.256505][ T2987] ? zap_collapsible_spte_range+0x800/0x800
[ 66.269919][ T2987] ? lock_release+0x3b7/0x700
[ 66.279630][ T2987] ? kvm_tdp_mmu_put_root+0x1b6/0x2d0
[ 66.292582][ T2987] ? rwlock_bug.part.0+0x90/0x90
[ 66.303611][ T2987] kvm_tdp_mmu_put_root+0x1d1/0x2d0
[ 66.315943][ T2987] mmu_free_root_page+0x219/0x2c0
[ 66.327581][ T2987] ? ept_invlpg+0x740/0x740
[ 66.337448][ T2987] ? kvm_vcpu_write_tsc_offset+0xfd/0x370
[ 66.350101][ T2987] kvm_mmu_free_roots+0x275/0x490
[ 66.361490][ T2987] ? mmu_free_root_page+0x2c0/0x2c0
[ 66.373656][ T2987] ? do_raw_spin_unlock+0x54/0x220
[ 66.385622][ T2987] ? _raw_spin_unlock+0x29/0x40
[ 66.396843][ T2987] kvm_mmu_unload+0x1c/0xa0
[ 66.407419][ T2987] kvm_arch_destroy_vm+0x1fe/0x5e0
[ 66.419178][ T2987] ? mmu_notifier_unregister+0x276/0x330
[ 66.431782][ T2987] kvm_put_kvm+0x3f9/0xa70
[ 66.442694][ T2987] kvm_vcpu_release+0x4e/0x70
[ 66.453070][ T2987] __fput+0x204/0x8d0
[ 66.462673][ T2987] task_work_run+0xce/0x170
[ 66.473106][ T2987] do_exit+0xa37/0x23e0
[ 66.482760][ T2987] ? static_obj+0x61/0xc0
[ 66.492643][ T2987] ? lock_release+0x3b7/0x700
[ 66.503069][ T2987] ? mm_update_next_owner+0x6d0/0x6d0
[ 66.514934][ T2987] ? lock_downgrade+0x6d0/0x6d0
[ 66.527491][ T2987] ? do_raw_spin_lock+0x12b/0x270
[ 66.538734][ T2987] ? rwlock_bug.part.0+0x90/0x90
[ 66.549815][ T2987] do_group_exit+0xec/0x2a0
[ 66.560548][ T2987] get_signal+0x3e8/0x1f50
[ 66.570826][ T2987] arch_do_signal_or_restart+0x244/0x1820
[ 66.583863][ T2987] ? migrate_enable+0x1d6/0x240
[ 66.594782][ T2987] ? do_futex+0x229/0x340
[ 66.604801][ T2987] ? get_sigframe_size+0x10/0x10
[ 66.616460][ T2987] ? __seccomp_filter+0x19d/0xd90
[ 66.627524][ T2987] ? __x64_sys_futex+0x181/0x420
[ 66.640458][ T2987] ? do_futex+0x340/0x340
[ 66.650659][ T2987] exit_to_user_mode_prepare+0x12c/0x1c0
[ 66.663419][ T2987] syscall_exit_to_user_mode+0x19/0x50
[ 66.675965][ T2987] do_syscall_64+0x4d/0x90
[ 66.685580][ T2987] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 66.699743][ T2987] RIP: 0033:0x474703
[ 66.708120][ T2987] Code: Unable to access opcode bytes at RIP 0x4746d9.
[ 66.721675][ T2987] RSP: 002b:000000c0002f3d28 EFLAGS: 00000286
ORIG_RAX: 00000000000000ca
[ 66.737633][ T2987] RAX: fffffffffffffe00 RBX: 000000c000308c00
RCX: 0000000000474703
[ 66.752382][ T2987] RDX: 0000000000000000 RSI: 0000000000000080
RDI: 000000c000308d50
[ 66.767496][ T2987] RBP: 000000c0002f3d70 R08: 0000000000000000
R09: 0000000000000000
[ 66.782643][ T2987] R10: 0000000000000000 R11: 0000000000000286
R12: 0000000000000000
[ 66.797659][ T2987] R13: 0000000000000000 R14: ffffffffffffffff
R15: 000000c00027b6c0
[ 66.813540][ T2987] </TASK>
[ 66.819823][ T2987] irq event stamp: 17803
[ 66.827863][ T2987] hardirqs last enabled at (17813):
[<ffffffff90d26642>] __up_console_sem+0x52/0x60
[ 66.845094][ T2987] hardirqs last disabled at (17822):
[<ffffffff90d26627>] __up_console_sem+0x37/0x60
[ 66.862747][ T2987] softirqs last enabled at (17702):
[<ffffffff90bd9903>] __irq_exit_rcu+0x113/0x170
[ 66.880146][ T2987] softirqs last disabled at (17697):
[<ffffffff90bd9903>] __irq_exit_rcu+0x113/0x170
[ 66.897521][ T2987] ---[ end trace 552e9049bda0ba46 ]---
[ 442.873226][ C1] perf: interrupt took too long (4761 > 2500),
lowering kernel.perf_event_max_sample_rate to 42000

The only difference I noticed is the presence of __tdp_mmu_set_spte
between zap_gfn_range and __handle_changed_spte, which is absent from
the original stacktrace.

[1]: https://marc.info/?l=kvm&m=163822397323141&w=2

On Mon, Dec 13, 2021 at 11:25 AM Paolo Bonzini <[email protected]> wrote:
>
> kvm_tdp_mmu_zap_all is intended to visit all roots and zap their page
> tables, which flushes the accessed and dirty bits out to the Linux
> "struct page"s. Missing some of the roots has catastrophic effects,
> because kvm_tdp_mmu_zap_all is called when the MMU notifier is being
> removed and any PTEs left behind might become dangling by the time
> kvm-arch_destroy_vm tears down the roots for good.
>
> Unfortunately that is exactly what kvm_tdp_mmu_zap_all is doing: it
> visits all roots via for_each_tdp_mmu_root_yield_safe, which in turn
> uses kvm_tdp_mmu_get_root to skip invalid roots. If the current root is
> invalid at the time of kvm_tdp_mmu_zap_all, its page tables will remain
> in place but will later be zapped during kvm_arch_destroy_vm.
>
> To fix this, ensure that kvm_tdp_mmu_zap_all goes over all
> roots, including the invalid ones. The easiest way to do so is for
> kvm_tdp_mmu_zap_all to do the same as kvm_mmu_zap_all_fast: invalidate
> all roots, and then zap the invalid roots. The only difference is that
> there is no need to go through tdp_mmu_zap_spte_atomic.
>
> Paolo
>
> Paolo Bonzini (2):
> KVM: x86: allow kvm_tdp_mmu_zap_invalidated_roots with write-locked
> mmu_lock
> KVM: x86: zap invalid roots in kvm_tdp_mmu_zap_all
>
> arch/x86/kvm/mmu/mmu.c | 2 +-
> arch/x86/kvm/mmu/tdp_mmu.c | 42 ++++++++++++++++++++------------------
> arch/x86/kvm/mmu/tdp_mmu.h | 2 +-
> 3 files changed, 24 insertions(+), 22 deletions(-)
>
> --
> 2.31.1
>

2021-12-13 15:07:08

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/2] KVM: x86: Fix dangling page reference in TDP MMU

On 12/13/21 14:43, Ignat Korchagin wrote:
> The only difference I noticed is the presence of __tdp_mmu_set_spte
> between zap_gfn_range and __handle_changed_spte, which is absent from
> the original stacktrace.

That's just a difference in inlining decisions, so it doesn't really matter.

Let's see if Sean has some more ideas or finds something obviously wrong
in my patch.

Paolo

2021-12-13 16:36:08

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: x86: zap invalid roots in kvm_tdp_mmu_zap_all

On Mon, Dec 13, 2021, Paolo Bonzini wrote:
> kvm_tdp_mmu_zap_all is intended to visit all roots and zap their page
> tables, which flushes the accessed and dirty bits out to the Linux
> "struct page"s. Missing some of the roots has catastrophic effects,
> because kvm_tdp_mmu_zap_all is called when the MMU notifier is being
> removed and any PTEs left behind might become dangling by the time
> kvm-arch_destroy_vm tears down the roots for good.
>
> Unfortunately that is exactly what kvm_tdp_mmu_zap_all is doing: it
> visits all roots via for_each_tdp_mmu_root_yield_safe, which in turn
> uses kvm_tdp_mmu_get_root to skip invalid roots. If the current root is
> invalid at the time of kvm_tdp_mmu_zap_all, its page tables will remain
> in place but will later be zapped during kvm_arch_destroy_vm.

As stated in the bug report thread[*], it should be impossible as for the MMU
notifier to be unregistered while kvm_mmu_zap_all_fast() is running.

I do believe there's a race between set_nx_huge_pages() and kvm_mmu_notifier_release(),
but that would result in the use-after-free kvm_set_pfn_dirty() tracing back to
set_nx_huge_pages(), not kvm_destroy_vm(). And for that, I would much prefer we
elevant mm->users while changing the NX hugepage setting.

diff --git a/arch/x86/kvm/mmu/mmu.c b/arch/x86/kvm/mmu/mmu.c
index 8f0035517450..985df4db8192 100644
--- a/arch/x86/kvm/mmu/mmu.c
+++ b/arch/x86/kvm/mmu/mmu.c
@@ -6092,10 +6092,15 @@ static int set_nx_huge_pages(const char *val, const struct kernel_param *kp)
mutex_lock(&kvm_lock);

list_for_each_entry(kvm, &vm_list, vm_list) {
+ if (!mmget_not_zero(kvm->mm))
+ continue;
+
mutex_lock(&kvm->slots_lock);
kvm_mmu_zap_all_fast(kvm);
mutex_unlock(&kvm->slots_lock);

+ mmput_async(kvm->mm);
+
wake_up_process(kvm->arch.nx_lpage_recovery_thread);
}
mutex_unlock(&kvm_lock);

[*] https://lore.kernel.org/all/[email protected]/

2021-12-13 16:47:40

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 0/2] KVM: x86: Fix dangling page reference in TDP MMU

On Mon, Dec 13, 2021, Ignat Korchagin wrote:
> Unfortunately, this patchset does not fix the original issue reported in [1].

Can you provide your kernel config? And any other version/config info that might
be relevant, e.g. anything in gvisor or runsc?

2021-12-13 18:28:26

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH 0/2] KVM: x86: Fix dangling page reference in TDP MMU

On 12/13/21 17:47, Sean Christopherson wrote:
> On Mon, Dec 13, 2021, Ignat Korchagin wrote:
>> Unfortunately, this patchset does not fix the original issue reported in [1].
> Can you provide your kernel config? And any other version/config info that might
> be relevant, e.g. anything in gvisor or runsc?

Also, I couldn't even get gvisor to run; it cannot start gofer and fails
with

starting container: starting root container: urpc method "containerManager.StartRoot" failed: EOF

(I only tried CentOS 9 and Fedora, both with cgroupv1 enabled).

Paolo

2021-12-13 19:44:17

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 0/2] KVM: x86: Fix dangling page reference in TDP MMU

On Mon, Dec 13, 2021, Sean Christopherson wrote:
> On Mon, Dec 13, 2021, Ignat Korchagin wrote:
> > Unfortunately, this patchset does not fix the original issue reported in [1].
>
> Can you provide your kernel config? And any other version/config info that might
> be relevant, e.g. anything in gvisor or runsc?

Scratch that, I've reproduced this, with luck I'll have a root cause by end of day.

2021-12-13 20:18:34

by Ignat Korchagin

[permalink] [raw]
Subject: Re: [PATCH 0/2] KVM: x86: Fix dangling page reference in TDP MMU

Just for the reference, here is my repro environment:

* kernel config:
https://gist.githubusercontent.com/ignatk/3a5457b8641d636963a2a4f14ccc854f/raw/e9b76b66454e4a3c0f7e395b1792b32ef053a541/gistfile1.txt
Kernel compiled from kvm/master. The config is processed with
mod2yesconfig, but when many things are modules - works too. I just
didn't want to bother with installing modules in the target VM.

* host: Debian Bullseye with qemu version: QEMU emulator version 6.1.0
(Debian 1:6.1+dfsg-6~bpo11+1)

* qemu commandline:
qemu-system-x86_64 -nographic -cpu host \
-enable-kvm \
-machine q35 \
-smp 8 \
-m 8G \
-drive
if=pflash,format=raw,readonly=on,file=/usr/share/OVMF/OVMF_CODE.fd \
-drive
if=pflash,format=raw,file=/usr/share/OVMF/OVMF_VARS.fd \
-drive file=/work/rootfs.img,format=qcow2 \
-nic user,model=virtio-net-pci,hostfwd=tcp::22-:22 \
-kernel vmlinuz \
-append "console=ttyS0 root=/dev/sda rw
systemd.unified_cgroup_hierarchy=0"

* rootfs.img is barebones standard Debian Bullseye installation

* to install gvisor I just run the following in the VM (blindly
copypasted from https://gvisor.dev/docs/user_guide/install/):

(
set -e
ARCH=$(uname -m)
URL=https://storage.googleapis.com/gvisor/releases/release/latest/${ARCH}
wget ${URL}/runsc ${URL}/runsc.sha512 \
${URL}/containerd-shim-runsc-v1 ${URL}/containerd-shim-runsc-v1.sha512
sha512sum -c runsc.sha512 \
-c containerd-shim-runsc-v1.sha512
rm -f *.sha512
chmod a+rx runsc containerd-shim-runsc-v1
sudo mv runsc containerd-shim-runsc-v1 /usr/local/bin
)

* to reproduce, just run "sudo runsc --platform=kvm --network=none do
echo ok" several times

Regards,
Ignat

On Mon, Dec 13, 2021 at 7:44 PM Sean Christopherson <[email protected]> wrote:
>
> On Mon, Dec 13, 2021, Sean Christopherson wrote:
> > On Mon, Dec 13, 2021, Ignat Korchagin wrote:
> > > Unfortunately, this patchset does not fix the original issue reported in [1].
> >
> > Can you provide your kernel config? And any other version/config info that might
> > be relevant, e.g. anything in gvisor or runsc?
>
> Scratch that, I've reproduced this, with luck I'll have a root cause by end of day.

2021-12-14 03:12:40

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 0/2] KVM: x86: Fix dangling page reference in TDP MMU

On Mon, Dec 13, 2021, Sean Christopherson wrote:
> On Mon, Dec 13, 2021, Sean Christopherson wrote:
> > On Mon, Dec 13, 2021, Ignat Korchagin wrote:
> > > Unfortunately, this patchset does not fix the original issue reported in [1].
> >
> > Can you provide your kernel config? And any other version/config info that might
> > be relevant, e.g. anything in gvisor or runsc?
>
> Scratch that, I've reproduced this, with luck I'll have a root cause by end of day.

Ok, the root cause is comically simple compared to all the theories we came up with.
If tdp_mmu_iter_cond_resched() drops mmu_lock and restarts the iterator, the
"continue" in the caller triggers tdp_iter_next(). tdp_iter_next() does what it's
told and advances the iterator. Because all users call tdp_mmu_iter_cond_resched()
at the very beginning of the loop, this has the effect of skipping the current SPTE.

E.g. in the "zap all" case, where iter->level == iter->min_level == iter->root_level,
we effectively end up with code like this, which is obviously wrong once the
complexity of traversing a tree is simplified down to walking an array of SPTEs.

gfn_t end = tdp_mmu_max_gfn_host();
gfn_t start = 0;
gfn_t last;

for (i = last = start; i < end; i += 8, last = i) {
if (cond_resched()) {
i = last;
continue;
}

sp = &root->spt[i];
zap(sp);
}

Patch incoming...

2021-12-14 19:45:45

by Sean Christopherson

[permalink] [raw]
Subject: Re: [PATCH 2/2] KVM: x86: zap invalid roots in kvm_tdp_mmu_zap_all

On Mon, Dec 13, 2021, Sean Christopherson wrote:
> On Mon, Dec 13, 2021, Paolo Bonzini wrote:
> > kvm_tdp_mmu_zap_all is intended to visit all roots and zap their page
> > tables, which flushes the accessed and dirty bits out to the Linux
> > "struct page"s. Missing some of the roots has catastrophic effects,
> > because kvm_tdp_mmu_zap_all is called when the MMU notifier is being
> > removed and any PTEs left behind might become dangling by the time
> > kvm-arch_destroy_vm tears down the roots for good.
> >
> > Unfortunately that is exactly what kvm_tdp_mmu_zap_all is doing: it
> > visits all roots via for_each_tdp_mmu_root_yield_safe, which in turn
> > uses kvm_tdp_mmu_get_root to skip invalid roots. If the current root is
> > invalid at the time of kvm_tdp_mmu_zap_all, its page tables will remain
> > in place but will later be zapped during kvm_arch_destroy_vm.
>
> As stated in the bug report thread[*], it should be impossible as for the MMU
> notifier to be unregistered while kvm_mmu_zap_all_fast() is running.
>
> I do believe there's a race between set_nx_huge_pages() and kvm_mmu_notifier_release(),
> but that would result in the use-after-free kvm_set_pfn_dirty() tracing back to
> set_nx_huge_pages(), not kvm_destroy_vm(). And for that, I would much prefer we
> elevant mm->users while changing the NX hugepage setting.

Mwhahaha, race confirmed with a bit of hacking to force the issue. I'll get a
patch out.