2021-11-30 08:27:16

by Tian, Kevin

[permalink] [raw]
Subject: Q. about KVM and CPU hotplug

Hi, Paolo/Thomas,

I'm curious about the consequence if KVM fails to initialize a
hotplugged CPU.

Looking at the code KVM has been added to the CPU hotplug state
machine:

r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
kvm_starting_cpu, kvm_dying_cpu);

static int kvm_starting_cpu(unsigned int cpu)
{
raw_spin_lock(&kvm_count_lock);
if (kvm_usage_count)
hardware_enable_nolock(NULL);
raw_spin_unlock(&kvm_count_lock);
return 0;
}

kvm_starting_cpu() always return success as the callbacks in the
STARTING section are not allowed to fail.

However hardware_enable_nolock() may fail for various reasons:

static void hardware_enable_nolock(void *junk)
{
int cpu = raw_smp_processor_id();
int r;

if (cpumask_test_cpu(cpu, cpus_hardware_enabled))
return;

cpumask_set_cpu(cpu, cpus_hardware_enabled);

r = kvm_arch_hardware_enable();

if (r) {
cpumask_clear_cpu(cpu, cpus_hardware_enabled);
atomic_inc(&hardware_enable_failed);
pr_info("kvm: enabling virtualization on CPU%d failed\n", cpu);
}
}

Upon error hardware_enable_failed is incremented. However this variable
is checked only in hardware_enable_all() called when the 1st VM is called.

This implies that KVM may be left in a state where it doesn't know a CPU
not ready to host VMX operations.

Then I'm curious what will happen if a vCPU is scheduled to this CPU. Does
KVM indirectly catch it (e.g. vmenter fail) and return a deterministic error
to Qemu at some point or may it lead to undefined behavior? And is there
any method to prevent vCPU thread from being scheduled to the CPU?

We found this open when considering TDX and CPU hotplug.

By design the current generation of TDX doesn't support CPU hotplug.
Only boot-time CPUs can be initialized for TDX (and must be done en
masse in one breath). Attempting to do seamcalls on a hotplugged CPU
simply fails, thus it potentially affects any trusted domain in case its
vCPUs are scheduled to the plugged CPU.

There is a puzzle whether we should just document such restriction or
need more proactive measure (e.g. to prevent such case happen). Since
it's similar to above situation where KVM fails to init on a hotplugged
CPU, we'd like to seek your suggestion first.

Thanks
Kevin


2021-11-30 09:28:53

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Q. about KVM and CPU hotplug

On 11/30/21 09:27, Tian, Kevin wrote:
> r = kvm_arch_hardware_enable();
>
> if (r) {
> cpumask_clear_cpu(cpu, cpus_hardware_enabled);
> atomic_inc(&hardware_enable_failed);
> pr_info("kvm: enabling virtualization on CPU%d failed\n", cpu);
> }
> }
>
> Upon error hardware_enable_failed is incremented. However this variable
> is checked only in hardware_enable_all() called when the 1st VM is called.
>
> This implies that KVM may be left in a state where it doesn't know a CPU
> not ready to host VMX operations.
>
> Then I'm curious what will happen if a vCPU is scheduled to this CPU. Does
> KVM indirectly catch it (e.g. vmenter fail) and return a deterministic error
> to Qemu at some point or may it lead to undefined behavior? And is there
> any method to prevent vCPU thread from being scheduled to the CPU?

It should fail the first vmptrld instruction. It will result in a few
WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed). For VMX this
should be a pretty bad firmware bug, and it has never been reported.
KVM did find some undocumented errata but not this one!

I don't think there's any fix other than pinning userspace. The WARNs
can be eliminated by calling KVM_BUG_ON in the sched_in notifier, plus
checking if the VM is bugged before entering the guest or doing a
VMREAD/VMWRITE (usually the check is done only in a ioctl). But some
refactoring is probably needed to make the code more robust in general.

Paolo

> By design the current generation of TDX doesn't support CPU hotplug.
> Only boot-time CPUs can be initialized for TDX (and must be done en
> masse in one breath). Attempting to do seamcalls on a hotplugged CPU
> simply fails, thus it potentially affects any trusted domain in case its
> vCPUs are scheduled to the plugged CPU.

2021-11-30 14:05:56

by Thomas Gleixner

[permalink] [raw]
Subject: Re: Q. about KVM and CPU hotplug

On Tue, Nov 30 2021 at 10:28, Paolo Bonzini wrote:

> On 11/30/21 09:27, Tian, Kevin wrote:
>> r = kvm_arch_hardware_enable();
>>
>> if (r) {
>> cpumask_clear_cpu(cpu, cpus_hardware_enabled);
>> atomic_inc(&hardware_enable_failed);
>> pr_info("kvm: enabling virtualization on CPU%d failed\n", cpu);
>> }
>> }
>>
>> Upon error hardware_enable_failed is incremented. However this variable
>> is checked only in hardware_enable_all() called when the 1st VM is called.
>>
>> This implies that KVM may be left in a state where it doesn't know a CPU
>> not ready to host VMX operations.
>>
>> Then I'm curious what will happen if a vCPU is scheduled to this CPU. Does
>> KVM indirectly catch it (e.g. vmenter fail) and return a deterministic error
>> to Qemu at some point or may it lead to undefined behavior? And is there
>> any method to prevent vCPU thread from being scheduled to the CPU?
>
> It should fail the first vmptrld instruction. It will result in a few
> WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed). For VMX this
> should be a pretty bad firmware bug, and it has never been reported.
> KVM did find some undocumented errata but not this one!
>
> I don't think there's any fix other than pinning userspace. The WARNs
> can be eliminated by calling KVM_BUG_ON in the sched_in notifier, plus
> checking if the VM is bugged before entering the guest or doing a
> VMREAD/VMWRITE (usually the check is done only in a ioctl). But some
> refactoring is probably needed to make the code more robust in general.

Why is this hotplug callback in the CPU starting section to begin with?

If you stick it into the online section which runs on the hotplugged CPU
in thread context:

CPUHP_AP_ONLINE_IDLE,

--> CPUHP_AP_KVM_STARTING,

CPUHP_AP_SCHED_WAIT_EMPTY,

then it is allowed to fail and it still works in the right way.

When onlining a CPU then there cannot be any vCPU task run on the
CPU at that point.

When offlining a CPU then it's guaranteed that all user tasks and
non-pinned kernel tasks have left the CPU, i.e. there cannot be a vCPU
task around either.

Thanks,

tglx


2021-11-30 16:27:32

by Paolo Bonzini

[permalink] [raw]
Subject: Re: Q. about KVM and CPU hotplug

On 11/30/21 15:05, Thomas Gleixner wrote:
> Why is this hotplug callback in the CPU starting section to begin with?

Just because the old notifier implementation used CPU_STARTING - in fact
the commit messages say that CPU_STARTING was added partly *for* KVM
(commit e545a6140b69, "kernel/cpu.c: create a CPU_STARTING cpu_chain
notifier", 2008-09-08).

> If you stick it into the online section which runs on the hotplugged CPU
> in thread context:
>
> CPUHP_AP_ONLINE_IDLE,
>
> --> CPUHP_AP_KVM_STARTING,
>
> CPUHP_AP_SCHED_WAIT_EMPTY,
>
> then it is allowed to fail and it still works in the right way.

Yes, moving it to the online section should be fine; it wouldn't solve
the TDX problem however. Failure would rollback the hotplug and forbid
hotplug altogether when TDX is loaded, which is not acceptable.

Paolo

> When onlining a CPU then there cannot be any vCPU task run on the
> CPU at that point.
>
> When offlining a CPU then it's guaranteed that all user tasks and
> non-pinned kernel tasks have left the CPU, i.e. there cannot be a vCPU
> task around either.


2021-11-30 20:02:53

by Sean Christopherson

[permalink] [raw]
Subject: Re: Q. about KVM and CPU hotplug

On Tue, Nov 30, 2021, Paolo Bonzini wrote:
> On 11/30/21 09:27, Tian, Kevin wrote:
> > r = kvm_arch_hardware_enable();
> >
> > if (r) {
> > cpumask_clear_cpu(cpu, cpus_hardware_enabled);
> > atomic_inc(&hardware_enable_failed);
> > pr_info("kvm: enabling virtualization on CPU%d failed\n", cpu);
> > }
> > }
> >
> > Upon error hardware_enable_failed is incremented. However this variable
> > is checked only in hardware_enable_all() called when the 1st VM is called.
> >
> > This implies that KVM may be left in a state where it doesn't know a CPU
> > not ready to host VMX operations.
> >
> > Then I'm curious what will happen if a vCPU is scheduled to this CPU. Does
> > KVM indirectly catch it (e.g. vmenter fail) and return a deterministic error
> > to Qemu at some point or may it lead to undefined behavior? And is there
> > any method to prevent vCPU thread from being scheduled to the CPU?
>
> It should fail the first vmptrld instruction. It will result in a few
> WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed). For VMX this
> should be a pretty bad firmware bug, and it has never been reported. KVM did
> find some undocumented errata but not this one!

Heh, writing MSR_TEST_CTRL on some CPUs, e.g. Haswell, magically disables VMX.
Not exactly CPU hotplug, but we got close :-) But yeah, if enabling VMX fails,
something in the CPU is badly mangled.

009bce1df0bb ("x86/split_lock: Don't write MSR_TEST_CTRL on CPUs that aren't whitelisted")

2021-12-01 06:59:38

by Tian, Kevin

[permalink] [raw]
Subject: RE: Q. about KVM and CPU hotplug

> From: Paolo Bonzini <[email protected]>
> Sent: Tuesday, November 30, 2021 5:29 PM
>
> On 11/30/21 09:27, Tian, Kevin wrote:
> > r = kvm_arch_hardware_enable();
> >
> > if (r) {
> > cpumask_clear_cpu(cpu, cpus_hardware_enabled);
> > atomic_inc(&hardware_enable_failed);
> > pr_info("kvm: enabling virtualization on CPU%d
> failed\n", cpu);
> > }
> > }
> >
> > Upon error hardware_enable_failed is incremented. However this variable
> > is checked only in hardware_enable_all() called when the 1st VM is called.
> >
> > This implies that KVM may be left in a state where it doesn't know a CPU
> > not ready to host VMX operations.
> >
> > Then I'm curious what will happen if a vCPU is scheduled to this CPU. Does
> > KVM indirectly catch it (e.g. vmenter fail) and return a deterministic error
> > to Qemu at some point or may it lead to undefined behavior? And is there
> > any method to prevent vCPU thread from being scheduled to the CPU?
>
> It should fail the first vmptrld instruction. It will result in a few
> WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed). For VMX this
> should be a pretty bad firmware bug, and it has never been reported.
> KVM did find some undocumented errata but not this one!
>

or it may be caused by incompatible CPU capabilities, which is currently
missing a check in kvm_starting_cpu(). So far the compatibility check is
done only once before registering cpu hotplug state machine:

for_each_online_cpu(cpu) {
smp_call_function_single(cpu, check_processor_compat, &c, 1);
if (r < 0)
goto out_free_2;
}

r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
kvm_starting_cpu, kvm_dying_cpu);

Thanks
Kevin

2021-12-01 07:18:27

by Tian, Kevin

[permalink] [raw]
Subject: RE: Q. about KVM and CPU hotplug

> From: Paolo Bonzini <[email protected]>
> Sent: Wednesday, December 1, 2021 12:27 AM
>
> On 11/30/21 15:05, Thomas Gleixner wrote:
> > Why is this hotplug callback in the CPU starting section to begin with?
>
> Just because the old notifier implementation used CPU_STARTING - in fact
> the commit messages say that CPU_STARTING was added partly *for* KVM
> (commit e545a6140b69, "kernel/cpu.c: create a CPU_STARTING cpu_chain
> notifier", 2008-09-08).
>
> > If you stick it into the online section which runs on the hotplugged CPU
> > in thread context:
> >
> > CPUHP_AP_ONLINE_IDLE,
> >
> > --> CPUHP_AP_KVM_STARTING,
> >
> > CPUHP_AP_SCHED_WAIT_EMPTY,
> >
> > then it is allowed to fail and it still works in the right way.
>
> Yes, moving it to the online section should be fine; it wouldn't solve
> the TDX problem however. Failure would rollback the hotplug and forbid
> hotplug altogether when TDX is loaded, which is not acceptable.
>

Fail hotplug just because TDX is loaded is not acceptable.

But fail hotplug when a trusted domain using TDX is active imo makes
sense. It's similar philosophy to VMX which, with above change, will
fail hotplug when kvm_usage_count is non-zero (implying a VM is
active) but VMX initialization fails on this CPU. We can add similar
tdx_usage_count to mark active TDX users and forbid hotplug
when this variable is non-zero.

In general I think it's an acceptable policy to fail an operation if it
breaks active existing usages... ????

Thanks
Kevin

2021-12-01 10:31:03

by Thomas Gleixner

[permalink] [raw]
Subject: RE: Q. about KVM and CPU hotplug

On Wed, Dec 01 2021 at 06:59, Kevin Tian wrote:
>> From: Paolo Bonzini <[email protected]>
>> It should fail the first vmptrld instruction. It will result in a few
>> WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed). For VMX this
>> should be a pretty bad firmware bug, and it has never been reported.
>> KVM did find some undocumented errata but not this one!
>>
>
> or it may be caused by incompatible CPU capabilities, which is currently
> missing a check in kvm_starting_cpu(). So far the compatibility check is
> done only once before registering cpu hotplug state machine:
>
> for_each_online_cpu(cpu) {
> smp_call_function_single(cpu, check_processor_compat, &c, 1);
> if (r < 0)
> goto out_free_2;
> }
>
> r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
> kvm_starting_cpu, kvm_dying_cpu);

Duh. This is silly _and_ broken.

Using for_each_inline_cpu() without holding cpus_read_lock() is racy
against concurrent hotplug. But even if the locking is added then
nothing prevents a CPU from being plugged _after_ the lock is dropped.

The right solution is to move the hotplug state into the threaded
section as I pointed out and do:

r = cpuhp_setup_state(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
kvm_starting_cpu, kvm_dying_cpu);

which will do the right thing automatically. Checking for compatibility
would just be part of the kvm_starting_cpu() callback.

Thanks,

tglx

2021-12-04 03:57:16

by Tian, Kevin

[permalink] [raw]
Subject: RE: Q. about KVM and CPU hotplug

> From: Thomas Gleixner <[email protected]>
> Sent: Wednesday, December 1, 2021 6:31 PM
>
> On Wed, Dec 01 2021 at 06:59, Kevin Tian wrote:
> >> From: Paolo Bonzini <[email protected]>
> >> It should fail the first vmptrld instruction. It will result in a few
> >> WARN_ONCE and pr_warn_ratelimited (see vmx_insn_failed). For VMX
> this
> >> should be a pretty bad firmware bug, and it has never been reported.
> >> KVM did find some undocumented errata but not this one!
> >>
> >
> > or it may be caused by incompatible CPU capabilities, which is currently
> > missing a check in kvm_starting_cpu(). So far the compatibility check is
> > done only once before registering cpu hotplug state machine:
> >
> > for_each_online_cpu(cpu) {
> > smp_call_function_single(cpu, check_processor_compat, &c, 1);
> > if (r < 0)
> > goto out_free_2;
> > }
> >
> > r = cpuhp_setup_state_nocalls(CPUHP_AP_KVM_STARTING,
> "kvm/cpu:starting",
> > kvm_starting_cpu, kvm_dying_cpu);
>
> Duh. This is silly _and_ broken.
>
> Using for_each_inline_cpu() without holding cpus_read_lock() is racy
> against concurrent hotplug. But even if the locking is added then
> nothing prevents a CPU from being plugged _after_ the lock is dropped.
>
> The right solution is to move the hotplug state into the threaded
> section as I pointed out and do:
>
> r = cpuhp_setup_state(CPUHP_AP_KVM_STARTING, "kvm/cpu:starting",
> kvm_starting_cpu, kvm_dying_cpu);
>
> which will do the right thing automatically. Checking for compatibility
> would just be part of the kvm_starting_cpu() callback.
>

Yes, this sounds the right thing to do. We'll work on a fix.

And as said in another reply to Paolo, future TDX compatibility check
will also be added to this callback.

Thanks
Kevin