There are two workflow paths that access the same address
simultaneously, creating a potential data race in kvm_vcpu_on_spin. This
occurs when one workflow reads kvm->last_boosted_vcpu while another
parallel path writes to it.
KCSAN produces the following output when enabled.
BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm]
write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16:
kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm
handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
__se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
__x64_sys_ioctl (fs/ioctl.c:890)
x64_sys_call (arch/x86/entry/syscall_64.c:33)
do_syscall_64 (arch/x86/entry/common.c:?)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4:
kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm
handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
__se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
__x64_sys_ioctl (fs/ioctl.c:890)
x64_sys_call (arch/x86/entry/syscall_64.c:33)
do_syscall_64 (arch/x86/entry/common.c:?)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
value changed: 0x00000012 -> 0x00000000
Given that both operations occur simultaneously without any locking
mechanisms in place, let's ensure atomicity to prevent possible data
corruption. We'll achieve this by employing READ_ONCE() for the reading
operation and WRITE_ONCE() for the writing operation.
Signed-off-by: Breno Leitao <[email protected]>
---
virt/kvm/kvm_main.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff0a20565f90..9768307d5e6c 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -4066,12 +4066,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
{
struct kvm *kvm = me->kvm;
struct kvm_vcpu *vcpu;
- int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
+ int last_boosted_vcpu;
unsigned long i;
int yielded = 0;
int try = 3;
int pass;
+ last_boosted_vcpu = READ_ONCE(me->kvm->last_boosted_vcpu);
kvm_vcpu_set_in_spin_loop(me, true);
/*
* We boost the priority of a VCPU that is runnable but not
@@ -4109,7 +4110,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
yielded = kvm_vcpu_yield_to(vcpu);
if (yielded > 0) {
- kvm->last_boosted_vcpu = i;
+ WRITE_ONCE(kvm->last_boosted_vcpu, i);
break;
} else if (yielded < 0) {
try--;
--
2.43.0
On Thu, May 09, 2024, Breno Leitao wrote:
> There are two workflow paths that access the same address
> simultaneously, creating a potential data race in kvm_vcpu_on_spin. This
> occurs when one workflow reads kvm->last_boosted_vcpu while another
> parallel path writes to it.
>
> KCSAN produces the following output when enabled.
>
> BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm]
>
> write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16:
> kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm
> handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
> vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
> vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
> kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
> kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
> __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
> __x64_sys_ioctl (fs/ioctl.c:890)
> x64_sys_call (arch/x86/entry/syscall_64.c:33)
> do_syscall_64 (arch/x86/entry/common.c:?)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
>
> read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4:
> kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm
> handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
> vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
> vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
> kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
> kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
> __se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
> __x64_sys_ioctl (fs/ioctl.c:890)
> x64_sys_call (arch/x86/entry/syscall_64.c:33)
> do_syscall_64 (arch/x86/entry/common.c:?)
> entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
>
> value changed: 0x00000012 -> 0x00000000
>
> Given that both operations occur simultaneously without any locking
> mechanisms in place, let's ensure atomicity to prevent possible data
> corruption. We'll achieve this by employing READ_ONCE() for the reading
> operation and WRITE_ONCE() for the writing operation.
Please state changelogs as a commands, e.g.
Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure ...
And I think it's worth calling out that corruption is _extremely_ unlikely to
happen in practice. It would require the compiler to generate truly awful code,
and it would require a VM with >256 vCPUs.
That said, I do think this should be sent to stable kernels, as it's (very, very)
theoretically possible to generate an out-of-bounds access, and this seems like a
super safe fix. How about this?
---
KVM: Fix a data race on last_boosted_vcpu in kvm_vcpu_on_spin()
Use {READ,WRITE}_ONCE() to access kvm->last_boosted_vcpu to ensure the
loads and stores are atomic. In the extremely unlikely scenario the
compiler tears the stores, it's theoretically possible for KVM to attempt
to get a vCPU using an out-of-bounds index, e.g. if the write is split
into multiple 8-bit stores, and is paired with a 32-bit load on a VM with
257 vCPUs:
CPU0 CPU1
last_boosted_vcpu = 0xff;
(last_boosted_vcpu = 0x100)
last_boosted_vcpu[15:8] = 0x01;
i = (last_boosted_vcpu = 0x1ff)
last_boosted_vcpu[7:0] = 0x00;
vcpu = kvm->vcpu_array[0x1ff];
As detected by KCSAN:
BUG: KCSAN: data-race in kvm_vcpu_on_spin [kvm] / kvm_vcpu_on_spin [kvm]
write to 0xffffc90025a92344 of 4 bytes by task 4340 on cpu 16:
kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4112) kvm
handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
__se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
__x64_sys_ioctl (fs/ioctl.c:890)
x64_sys_call (arch/x86/entry/syscall_64.c:33)
do_syscall_64 (arch/x86/entry/common.c:?)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
read to 0xffffc90025a92344 of 4 bytes by task 4342 on cpu 4:
kvm_vcpu_on_spin (arch/x86/kvm/../../../virt/kvm/kvm_main.c:4069) kvm
handle_pause (arch/x86/kvm/vmx/vmx.c:5929) kvm_intel
vmx_handle_exit (arch/x86/kvm/vmx/vmx.c:? arch/x86/kvm/vmx/vmx.c:6606) kvm_intel
vcpu_run (arch/x86/kvm/x86.c:11107 arch/x86/kvm/x86.c:11211) kvm
kvm_arch_vcpu_ioctl_run (arch/x86/kvm/x86.c:?) kvm
kvm_vcpu_ioctl (arch/x86/kvm/../../../virt/kvm/kvm_main.c:?) kvm
__se_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:904 fs/ioctl.c:890)
__x64_sys_ioctl (fs/ioctl.c:890)
x64_sys_call (arch/x86/entry/syscall_64.c:33)
do_syscall_64 (arch/x86/entry/common.c:?)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:130)
value changed: 0x00000012 -> 0x00000000
Fixes: 217ece6129f2 ("KVM: use yield_to instead of sleep in kvm_vcpu_on_spin")
Cc: [email protected]
---
> Signed-off-by: Breno Leitao <[email protected]>
> ---
> virt/kvm/kvm_main.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index ff0a20565f90..9768307d5e6c 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -4066,12 +4066,13 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> {
> struct kvm *kvm = me->kvm;
> struct kvm_vcpu *vcpu;
> - int last_boosted_vcpu = me->kvm->last_boosted_vcpu;
> + int last_boosted_vcpu;
> unsigned long i;
> int yielded = 0;
> int try = 3;
> int pass;
>
> + last_boosted_vcpu = READ_ONCE(me->kvm->last_boosted_vcpu);
Nit, this could opportunistically use "kvm" without the silly me->kvm.
> kvm_vcpu_set_in_spin_loop(me, true);
> /*
> * We boost the priority of a VCPU that is runnable but not
> @@ -4109,7 +4110,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
>
> yielded = kvm_vcpu_yield_to(vcpu);
> if (yielded > 0) {
> - kvm->last_boosted_vcpu = i;
> + WRITE_ONCE(kvm->last_boosted_vcpu, i);
> break;
> } else if (yielded < 0) {
> try--;
Side topic #1: am I the only one that finds these loops unnecessarily hard to
read? Unless I'm misreading the code, it's really just an indirect way of looping
over all vCPUs, starting at last_boosted_vcpu+1 and the wrapping.
IMO, reworking it to be like this is more straightforward:
int nr_vcpus, start, i, idx, yielded;
struct kvm *kvm = me->kvm;
struct kvm_vcpu *vcpu;
int try = 3;
nr_vcpus = atomic_read(&kvm->online_vcpus);
if (nr_vcpus < 2)
return;
/* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */
smp_rmb();
kvm_vcpu_set_in_spin_loop(me, true);
start = READ_ONCE(kvm->last_boosted_vcpu) + 1;
for (i = 0; i < nr_vcpus; i++) {
idx = (start + i) % nr_vcpus;
if (idx == me->vcpu_idx)
continue;
vcpu = xa_load(&kvm->vcpu_array, idx);
if (!READ_ONCE(vcpu->ready))
continue;
if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
continue;
/*
* Treat the target vCPU as being in-kernel if it has a pending
* interrupt, as the vCPU trying to yield may be spinning
* waiting on IPI delivery, i.e. the target vCPU is in-kernel
* for the purposes of directed yield.
*/
if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
!kvm_arch_dy_has_pending_interrupt(vcpu) &&
!kvm_arch_vcpu_preempted_in_kernel(vcpu))
continue;
if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
continue;
yielded = kvm_vcpu_yield_to(vcpu);
if (yielded > 0) {
WRITE_ONCE(kvm->last_boosted_vcpu, i);
break;
} else if (yielded < 0 && !--try) {
break;
}
}
kvm_vcpu_set_in_spin_loop(me, false);
/* Ensure vcpu is not eligible during next spinloop */
kvm_vcpu_set_dy_eligible(me, false);
Side topic #2, intercepting PAUSE on x86 when there's only a single vCPU in the
VM is silly. I don't know if it's worth the complexity, but we could defer
enabling PLE exiting until a second vCPU is created, e.g. via a new request.
Hmm, but x86 at least already has KVM_X86_DISABLE_EXITS_PAUSE, so this could be
more easily handled in userspace, e.g. by disabing PAUSE exiting if userspace
knows it's creating a single-vCPU VM.
Hello Sean,
On Thu, May 09, 2024 at 09:42:48AM -0700, Sean Christopherson wrote:
> On Thu, May 09, 2024, Breno Leitao wrote:
> > kvm_vcpu_set_in_spin_loop(me, true);
> > /*
> > * We boost the priority of a VCPU that is runnable but not
> > @@ -4109,7 +4110,7 @@ void kvm_vcpu_on_spin(struct kvm_vcpu *me, bool yield_to_kernel_mode)
> >
> > yielded = kvm_vcpu_yield_to(vcpu);
> > if (yielded > 0) {
> > - kvm->last_boosted_vcpu = i;
> > + WRITE_ONCE(kvm->last_boosted_vcpu, i);
> > break;
> > } else if (yielded < 0) {
> > try--;
>
> Side topic #1: am I the only one that finds these loops unnecessarily hard to
> read?
No. :-)
In fact, when I skimmed over the code, I though that maybe the code was
not covering the vCPUs before last_boosted_vcpu in the array.
Now that I am looking at it carefully, the code is using `pass` to track
if the vCPU passed last_boosted_vcpu in the index.
> Unless I'm misreading the code, it's really just an indirect way of looping
> over all vCPUs, starting at last_boosted_vcpu+1 and the wrapping.
>
> IMO, reworking it to be like this is more straightforward:
>
> int nr_vcpus, start, i, idx, yielded;
> struct kvm *kvm = me->kvm;
> struct kvm_vcpu *vcpu;
> int try = 3;
>
> nr_vcpus = atomic_read(&kvm->online_vcpus);
> if (nr_vcpus < 2)
> return;
>
> /* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */
> smp_rmb();
Why do you need this now? Isn't the RCU read lock in xa_load() enough?
> kvm_vcpu_set_in_spin_loop(me, true);
>
> start = READ_ONCE(kvm->last_boosted_vcpu) + 1;
> for (i = 0; i < nr_vcpus; i++) {
Why do you need to started at the last boosted vcpu? I.e, why not
starting at 0 and skipping me->vcpu_idx and kvm->last_boosted_vcpu?
> idx = (start + i) % nr_vcpus;
> if (idx == me->vcpu_idx)
> continue;
>
> vcpu = xa_load(&kvm->vcpu_array, idx);
> if (!READ_ONCE(vcpu->ready))
> continue;
> if (kvm_vcpu_is_blocking(vcpu) && !vcpu_dy_runnable(vcpu))
> continue;
>
> /*
> * Treat the target vCPU as being in-kernel if it has a pending
> * interrupt, as the vCPU trying to yield may be spinning
> * waiting on IPI delivery, i.e. the target vCPU is in-kernel
> * for the purposes of directed yield.
> */
> if (READ_ONCE(vcpu->preempted) && yield_to_kernel_mode &&
> !kvm_arch_dy_has_pending_interrupt(vcpu) &&
> !kvm_arch_vcpu_preempted_in_kernel(vcpu))
> continue;
>
> if (!kvm_vcpu_eligible_for_directed_yield(vcpu))
> continue;
>
> yielded = kvm_vcpu_yield_to(vcpu);
> if (yielded > 0) {
> WRITE_ONCE(kvm->last_boosted_vcpu, i);
> break;
> } else if (yielded < 0 && !--try) {
> break;
> }
> }
>
> kvm_vcpu_set_in_spin_loop(me, false);
>
> /* Ensure vcpu is not eligible during next spinloop */
> kvm_vcpu_set_dy_eligible(me, false);
I didn't tested it, but I reviewed it, and it seems sane and way easier
to read. I agree this code is easier to read, from someone that has
little KVM background.
On Fri, May 10, 2024, Breno Leitao wrote:
> > IMO, reworking it to be like this is more straightforward:
> >
> > int nr_vcpus, start, i, idx, yielded;
> > struct kvm *kvm = me->kvm;
> > struct kvm_vcpu *vcpu;
> > int try = 3;
> >
> > nr_vcpus = atomic_read(&kvm->online_vcpus);
> > if (nr_vcpus < 2)
> > return;
> >
> > /* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */
> > smp_rmb();
>
> Why do you need this now? Isn't the RCU read lock in xa_load() enough?
No. RCU read lock doesn't suffice, because on kernels without PREEMPT_COUNT
rcu_read_lock() may be a literal nop. There may be a _compiler_ barrier, but
smp_rmb() requires more than a compiler barrier on many architectures.
And just as importantly, KVM shouldn't be relying on the inner details of other
code without a hard guarantee of that behavior. E.g. KVM does rely on
srcu_read_unlock() to provide a full memory barrier, but KVM does so through an
"official" API, smp_mb__after_srcu_read_unlock().
> > kvm_vcpu_set_in_spin_loop(me, true);
> >
> > start = READ_ONCE(kvm->last_boosted_vcpu) + 1;
> > for (i = 0; i < nr_vcpus; i++) {
>
> Why do you need to started at the last boosted vcpu? I.e, why not
> starting at 0 and skipping me->vcpu_idx and kvm->last_boosted_vcpu?
To provide round-robin style yielding in order to (hopefully) yield to the vCPU
that is holding a spinlock (or some other asset that is causing a vCPU to spin
in kernel mode).
E.g. if there are 4 vCPUs all running on a single CPU, vCPU3 gets preempted while
holding a spinlock, and all vCPUs are contented for said spinlock then starting
at vCPU0 every time would result in vCPU1 yielding to vCPU0, and vCPU0 yielding
back to vCPU1, indefinitely.
Starting at the last boosted vCPU instead results in vCPU0 yielding to vCPU1,
vCPU1 yielding to vCPU2, and vCPU2 yielding to vCPU3, thus getting back to the
vCPU that holds the spinlock soon-ish.
I'm sure more sophisticated/performant approaches are possible, but they would
likely be more complex, require persistent state (a.k.a. memory usage), and/or
need knowledge of the workload being run.
On Fri, May 10, 2024 at 07:39:14AM -0700, Sean Christopherson wrote:
> On Fri, May 10, 2024, Breno Leitao wrote:
> > > IMO, reworking it to be like this is more straightforward:
> > >
> > > int nr_vcpus, start, i, idx, yielded;
> > > struct kvm *kvm = me->kvm;
> > > struct kvm_vcpu *vcpu;
> > > int try = 3;
> > >
> > > nr_vcpus = atomic_read(&kvm->online_vcpus);
> > > if (nr_vcpus < 2)
> > > return;
> > >
> > > /* Pairs with the smp_wmb() in kvm_vm_ioctl_create_vcpu(). */
> > > smp_rmb();
> >
> > Why do you need this now? Isn't the RCU read lock in xa_load() enough?
>
> No. RCU read lock doesn't suffice, because on kernels without PREEMPT_COUNT
> rcu_read_lock() may be a literal nop. There may be a _compiler_ barrier, but
> smp_rmb() requires more than a compiler barrier on many architectures.
Makes sense. In fact, it makes sense to have an explicit barrier in-between
the xarray modify operations and reading/storing online_vcpus.
> > > kvm_vcpu_set_in_spin_loop(me, true);
> > >
> > > start = READ_ONCE(kvm->last_boosted_vcpu) + 1;
> > > for (i = 0; i < nr_vcpus; i++) {
> >
> > Why do you need to started at the last boosted vcpu? I.e, why not
> > starting at 0 and skipping me->vcpu_idx and kvm->last_boosted_vcpu?
>
> To provide round-robin style yielding in order to (hopefully) yield to the vCPU
> that is holding a spinlock (or some other asset that is causing a vCPU to spin
> in kernel mode).
>
> E.g. if there are 4 vCPUs all running on a single CPU, vCPU3 gets preempted while
> holding a spinlock, and all vCPUs are contented for said spinlock then starting
> at vCPU0 every time would result in vCPU1 yielding to vCPU0, and vCPU0 yielding
> back to vCPU1, indefinitely.
Makes sense, this would always privilege vCPU 0 in favor of the last
vCPU. 100% clear. Thanks!