This series is a continuation to Sean's "[PATCH 0/2] VM: Fix a benign race
in kicking vCPUs" work and v2 for my "KVM: Optimize
kvm_make_vcpus_request_mask() a bit"/"KVM: x86: Fix stack-out-of-bounds
memory access from ioapic_write_indirect()" patchset.
Changes since v1:
- Drop inappropriate added 'likely' from kvm_make_vcpus_request_mask [Sean]
- Keep get_cpu()/put_cpu() and pass 'current_cpu' parameter to
kvm_make_vcpu_request() as a minor optimization [Sean]
From Sean:
"Fix benign races when kicking vCPUs where the task doing the kicking can
consume a stale vcpu->cpu. The races are benign because of the
impliciations of task migration with respect to interrupts and being in
guest mode, but IMO they're worth fixing if only as an excuse to
document the flows.
Patch 2 is a tangentially related cleanup to prevent future me from
trying to get rid of the NULL check on the cpumask parameters, which
_looks_ like it can't ever be NULL, but has a subtle edge case due to the
way CONFIG_CPUMASK_OFFSTACK=y handles cpumasks."
Patch3 is a minor optimization for kvm_make_vcpus_request_mask() for big
guests.
Patch4 fixes a real problem with ioapic_write_indirect() KVM does
out-of-bounds access to stack memory.
Sean Christopherson (2):
KVM: Clean up benign vcpu->cpu data races when kicking vCPUs
KVM: Guard cpusmask NULL check with CONFIG_CPUMASK_OFFSTACK
Vitaly Kuznetsov (2):
KVM: Optimize kvm_make_vcpus_request_mask() a bit
KVM: x86: Fix stack-out-of-bounds memory access from
ioapic_write_indirect()
arch/x86/kvm/ioapic.c | 10 +++---
virt/kvm/kvm_main.c | 83 ++++++++++++++++++++++++++++++++-----------
2 files changed, 68 insertions(+), 25 deletions(-)
--
2.31.1
KASAN reports the following issue:
BUG: KASAN: stack-out-of-bounds in kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
Read of size 8 at addr ffffc9001364f638 by task qemu-kvm/4798
CPU: 0 PID: 4798 Comm: qemu-kvm Tainted: G X --------- ---
Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM0081C 07/13/2020
Call Trace:
dump_stack+0xa5/0xe6
print_address_description.constprop.0+0x18/0x130
? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
__kasan_report.cold+0x7f/0x114
? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
kasan_report+0x38/0x50
kasan_check_range+0xf5/0x1d0
kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
kvm_make_scan_ioapic_request_mask+0x84/0xc0 [kvm]
? kvm_arch_exit+0x110/0x110 [kvm]
? sched_clock+0x5/0x10
ioapic_write_indirect+0x59f/0x9e0 [kvm]
? static_obj+0xc0/0xc0
? __lock_acquired+0x1d2/0x8c0
? kvm_ioapic_eoi_inject_work+0x120/0x120 [kvm]
The problem appears to be that 'vcpu_bitmap' is allocated as a single long
on stack and it should really be KVM_MAX_VCPUS long. We also seem to clear
the lower 16 bits of it with bitmap_zero() for no particular reason (my
guess would be that 'bitmap' and 'vcpu_bitmap' variables in
kvm_bitmap_or_dest_vcpus() caused the confusion: while the later is indeed
16-bit long, the later should accommodate all possible vCPUs).
Fixes: 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
Fixes: 9a2ae9f6b6bb ("KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap")
Reported-by: Dr. David Alan Gilbert <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
arch/x86/kvm/ioapic.c | 10 +++++-----
1 file changed, 5 insertions(+), 5 deletions(-)
diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
index ff005fe738a4..92cd4b02e9ba 100644
--- a/arch/x86/kvm/ioapic.c
+++ b/arch/x86/kvm/ioapic.c
@@ -319,7 +319,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
unsigned index;
bool mask_before, mask_after;
union kvm_ioapic_redirect_entry *e;
- unsigned long vcpu_bitmap;
+ unsigned long vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
int old_remote_irr, old_delivery_status, old_dest_id, old_dest_mode;
switch (ioapic->ioregsel) {
@@ -384,9 +384,9 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
irq.shorthand = APIC_DEST_NOSHORT;
irq.dest_id = e->fields.dest_id;
irq.msi_redir_hint = false;
- bitmap_zero(&vcpu_bitmap, 16);
+ bitmap_zero(vcpu_bitmap, KVM_MAX_VCPUS);
kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq,
- &vcpu_bitmap);
+ vcpu_bitmap);
if (old_dest_mode != e->fields.dest_mode ||
old_dest_id != e->fields.dest_id) {
/*
@@ -399,10 +399,10 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
kvm_lapic_irq_dest_mode(
!!e->fields.dest_mode);
kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq,
- &vcpu_bitmap);
+ vcpu_bitmap);
}
kvm_make_scan_ioapic_request_mask(ioapic->kvm,
- &vcpu_bitmap);
+ vcpu_bitmap);
} else {
kvm_make_scan_ioapic_request(ioapic->kvm);
}
--
2.31.1
Iterating over set bits in 'vcpu_bitmap' should be faster than going
through all vCPUs, especially when just a few bits are set.
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
virt/kvm/kvm_main.c | 83 ++++++++++++++++++++++++++-------------------
1 file changed, 49 insertions(+), 34 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 82c5280dd5ce..e9c2ce2d273f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -257,49 +257,64 @@ static inline bool kvm_kick_many_cpus(const struct cpumask *cpus, bool wait)
return true;
}
+static void kvm_make_vcpu_request(struct kvm *kvm, struct kvm_vcpu *vcpu,
+ unsigned int req, cpumask_var_t tmp,
+ int current_cpu)
+{
+ int cpu = vcpu->cpu;
+
+ kvm_make_request(req, vcpu);
+
+ if (!(req & KVM_REQUEST_NO_WAKEUP) && kvm_vcpu_wake_up(vcpu))
+ return;
+
+ /*
+ * tmp can be NULL if cpumasks are allocated off stack, as allocation of
+ * the mask is deliberately not fatal and is handled by falling back to
+ * kicking all online CPUs.
+ */
+ if (IS_ENABLED(CONFIG_CPUMASK_OFFSTACK) && !tmp)
+ return;
+
+ /*
+ * Note, the vCPU could get migrated to a different pCPU at any point
+ * after kvm_request_needs_ipi(), which could result in sending an IPI
+ * to the previous pCPU. But, that's ok because the purpose of the IPI
+ * is to ensure the vCPU returns to OUTSIDE_GUEST_MODE, which is
+ * satisfied if the vCPU migrates. Entering READING_SHADOW_PAGE_TABLES
+ * after this point is also ok, as the requirement is only that KVM wait
+ * for vCPUs that were reading SPTEs _before_ any changes were
+ * finalized. See kvm_vcpu_kick() for more details on handling requests.
+ */
+ if (kvm_request_needs_ipi(vcpu, req)) {
+ cpu = READ_ONCE(vcpu->cpu);
+ if (cpu != -1 && cpu != current_cpu)
+ __cpumask_set_cpu(cpu, tmp);
+ }
+}
+
bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
struct kvm_vcpu *except,
unsigned long *vcpu_bitmap, cpumask_var_t tmp)
{
- int i, cpu, me;
+ int i, me;
struct kvm_vcpu *vcpu;
bool called;
me = get_cpu();
- kvm_for_each_vcpu(i, vcpu, kvm) {
- if ((vcpu_bitmap && !test_bit(i, vcpu_bitmap)) ||
- vcpu == except)
- continue;
-
- kvm_make_request(req, vcpu);
-
- if (!(req & KVM_REQUEST_NO_WAKEUP) && kvm_vcpu_wake_up(vcpu))
- continue;
-
- /*
- * tmp can be NULL if cpumasks are allocated off stack, as
- * allocation of the mask is deliberately not fatal and is
- * handled by falling back to kicking all online CPUs.
- */
- if (IS_ENABLED(CONFIG_CPUMASK_OFFSTACK) && !tmp)
- continue;
-
- /*
- * Note, the vCPU could get migrated to a different pCPU at any
- * point after kvm_request_needs_ipi(), which could result in
- * sending an IPI to the previous pCPU. But, that's ok because
- * the purpose of the IPI is to ensure the vCPU returns to
- * OUTSIDE_GUEST_MODE, which is satisfied if the vCPU migrates.
- * Entering READING_SHADOW_PAGE_TABLES after this point is also
- * ok, as the requirement is only that KVM wait for vCPUs that
- * were reading SPTEs _before_ any changes were finalized. See
- * kvm_vcpu_kick() for more details on handling requests.
- */
- if (kvm_request_needs_ipi(vcpu, req)) {
- cpu = READ_ONCE(vcpu->cpu);
- if (cpu != -1 && cpu != me)
- __cpumask_set_cpu(cpu, tmp);
+ if (vcpu_bitmap) {
+ for_each_set_bit(i, vcpu_bitmap, KVM_MAX_VCPUS) {
+ vcpu = kvm_get_vcpu(kvm, i);
+ if (!vcpu || vcpu == except)
+ continue;
+ kvm_make_vcpu_request(kvm, vcpu, req, tmp, me);
+ }
+ } else {
+ kvm_for_each_vcpu(i, vcpu, kvm) {
+ if (vcpu == except)
+ continue;
+ kvm_make_vcpu_request(kvm, vcpu, req, tmp, me);
}
}
--
2.31.1
From: Sean Christopherson <[email protected]>
Fix a benign data race reported by syzbot+KCSAN[*] by ensuring vcpu->cpu
is read exactly once, and by ensuring the vCPU is booted from guest mode
if kvm_arch_vcpu_should_kick() returns true. Fix a similar race in
kvm_make_vcpus_request_mask() by ensuring the vCPU is interrupted if
kvm_request_needs_ipi() returns true.
Reading vcpu->cpu before vcpu->mode (via kvm_arch_vcpu_should_kick() or
kvm_request_needs_ipi()) means the target vCPU could get migrated (change
vcpu->cpu) and enter !OUTSIDE_GUEST_MODE between reading vcpu->cpud and
reading vcpu->mode. If that happens, the kick/IPI will be sent to the
old pCPU, not the new pCPU that is now running the vCPU or reading SPTEs.
Although failing to kick the vCPU is not exactly ideal, practically
speaking it cannot cause a functional issue unless there is also a bug in
the caller, and any such bug would exist regardless of kvm_vcpu_kick()'s
behavior.
The purpose of sending an IPI is purely to get a vCPU into the host (or
out of reading SPTEs) so that the vCPU can recognize a change in state,
e.g. a KVM_REQ_* request. If vCPU's handling of the state change is
required for correctness, KVM must ensure either the vCPU sees the change
before entering the guest, or that the sender sees the vCPU as running in
guest mode. All architectures handle this by (a) sending the request
before calling kvm_vcpu_kick() and (b) checking for requests _after_
setting vcpu->mode.
x86's READING_SHADOW_PAGE_TABLES has similar requirements; KVM needs to
ensure it kicks and waits for vCPUs that started reading SPTEs _before_
MMU changes were finalized, but any vCPU that starts reading after MMU
changes were finalized will see the new state and can continue on
uninterrupted.
For uses of kvm_vcpu_kick() that are not paired with a KVM_REQ_*, e.g.
x86's kvm_arch_sync_dirty_log(), the order of the kick must not be relied
upon for functional correctness, e.g. in the dirty log case, userspace
cannot assume it has a 100% complete log if vCPUs are still running.
All that said, eliminate the benign race since the cost of doing so is an
"extra" atomic cmpxchg() in the case where the target vCPU is loaded by
the current pCPU or is not loaded at all. I.e. the kick will be skipped
due to kvm_vcpu_exiting_guest_mode() seeing a compatible vcpu->mode as
opposed to the kick being skipped because of the cpu checks.
Keep the "cpu != me" checks even though they appear useless/impossible at
first glance. x86 processes guest IPI writes in a fast path that runs in
IN_GUEST_MODE, i.e. can call kvm_vcpu_kick() from IN_GUEST_MODE. And
calling kvm_vm_bugged()->kvm_make_vcpus_request_mask() from IN_GUEST or
READING_SHADOW_PAGE_TABLES is perfectly reasonable.
Note, a race with the cpu_online() check in kvm_vcpu_kick() likely
persists, e.g. the vCPU could exit guest mode and get offlined between
the cpu_online() check and the sending of smp_send_reschedule(). But,
the online check appears to exist only to avoid a WARN in x86's
native_smp_send_reschedule() that fires if the target CPU is not online.
The reschedule WARN exists because CPU offlining takes the CPU out of the
scheduling pool, i.e. the WARN is intended to detect the case where the
kernel attempts to schedule a task on an offline CPU. The actual sending
of the IPI is a non-issue as at worst it will simpy be dropped on the
floor. In other words, KVM's usurping of the reschedule IPI could
theoretically trigger a WARN if the stars align, but there will be no
loss of functionality.
[*] https://syzkaller.appspot.com/bug?extid=cd4154e502f43f10808a
Cc: Venkatesh Srinivas <[email protected]>
Cc: Vitaly Kuznetsov <[email protected]>
Fixes: 97222cc83163 ("KVM: Emulate local APIC in kernel")
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
virt/kvm/kvm_main.c | 36 ++++++++++++++++++++++++++++--------
1 file changed, 28 insertions(+), 8 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 3e67c93ca403..786b914db98f 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -273,14 +273,26 @@ bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
continue;
kvm_make_request(req, vcpu);
- cpu = vcpu->cpu;
if (!(req & KVM_REQUEST_NO_WAKEUP) && kvm_vcpu_wake_up(vcpu))
continue;
- if (tmp != NULL && cpu != -1 && cpu != me &&
- kvm_request_needs_ipi(vcpu, req))
- __cpumask_set_cpu(cpu, tmp);
+ /*
+ * Note, the vCPU could get migrated to a different pCPU at any
+ * point after kvm_request_needs_ipi(), which could result in
+ * sending an IPI to the previous pCPU. But, that's ok because
+ * the purpose of the IPI is to ensure the vCPU returns to
+ * OUTSIDE_GUEST_MODE, which is satisfied if the vCPU migrates.
+ * Entering READING_SHADOW_PAGE_TABLES after this point is also
+ * ok, as the requirement is only that KVM wait for vCPUs that
+ * were reading SPTEs _before_ any changes were finalized. See
+ * kvm_vcpu_kick() for more details on handling requests.
+ */
+ if (tmp != NULL && kvm_request_needs_ipi(vcpu, req)) {
+ cpu = READ_ONCE(vcpu->cpu);
+ if (cpu != -1 && cpu != me)
+ __cpumask_set_cpu(cpu, tmp);
+ }
}
called = kvm_kick_many_cpus(tmp, !!(req & KVM_REQUEST_WAIT));
@@ -3309,16 +3321,24 @@ EXPORT_SYMBOL_GPL(kvm_vcpu_wake_up);
*/
void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
{
- int me;
- int cpu = vcpu->cpu;
+ int me, cpu;
if (kvm_vcpu_wake_up(vcpu))
return;
+ /*
+ * Note, the vCPU could get migrated to a different pCPU at any point
+ * after kvm_arch_vcpu_should_kick(), which could result in sending an
+ * IPI to the previous pCPU. But, that's ok because the purpose of the
+ * IPI is to force the vCPU to leave IN_GUEST_MODE, and migrating the
+ * vCPU also requires it to leave IN_GUEST_MODE.
+ */
me = get_cpu();
- if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
- if (kvm_arch_vcpu_should_kick(vcpu))
+ if (kvm_arch_vcpu_should_kick(vcpu)) {
+ cpu = READ_ONCE(vcpu->cpu);
+ if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
smp_send_reschedule(cpu);
+ }
put_cpu();
}
EXPORT_SYMBOL_GPL(kvm_vcpu_kick);
--
2.31.1
From: Sean Christopherson <[email protected]>
Check for a NULL cpumask_var_t when kicking multiple vCPUs if and only if
cpumasks are configured to be allocated off-stack. This is a meaningless
optimization, e.g. avoids a TEST+Jcc and TEST+CMOV on x86, but more
importantly helps document that the NULL check is necessary even though
all callers pass in a local variable.
No functional change intended.
Signed-off-by: Sean Christopherson <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
virt/kvm/kvm_main.c | 12 ++++++++++--
1 file changed, 10 insertions(+), 2 deletions(-)
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 786b914db98f..82c5280dd5ce 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -247,7 +247,7 @@ static void ack_flush(void *_completed)
static inline bool kvm_kick_many_cpus(const struct cpumask *cpus, bool wait)
{
- if (unlikely(!cpus))
+ if (IS_ENABLED(CONFIG_CPUMASK_OFFSTACK) && unlikely(!cpus))
cpus = cpu_online_mask;
if (cpumask_empty(cpus))
@@ -277,6 +277,14 @@ bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
if (!(req & KVM_REQUEST_NO_WAKEUP) && kvm_vcpu_wake_up(vcpu))
continue;
+ /*
+ * tmp can be NULL if cpumasks are allocated off stack, as
+ * allocation of the mask is deliberately not fatal and is
+ * handled by falling back to kicking all online CPUs.
+ */
+ if (IS_ENABLED(CONFIG_CPUMASK_OFFSTACK) && !tmp)
+ continue;
+
/*
* Note, the vCPU could get migrated to a different pCPU at any
* point after kvm_request_needs_ipi(), which could result in
@@ -288,7 +296,7 @@ bool kvm_make_vcpus_request_mask(struct kvm *kvm, unsigned int req,
* were reading SPTEs _before_ any changes were finalized. See
* kvm_vcpu_kick() for more details on handling requests.
*/
- if (tmp != NULL && kvm_request_needs_ipi(vcpu, req)) {
+ if (kvm_request_needs_ipi(vcpu, req)) {
cpu = READ_ONCE(vcpu->cpu);
if (cpu != -1 && cpu != me)
__cpumask_set_cpu(cpu, tmp);
--
2.31.1
On Mon, Aug 23, 2021 at 04:30:28PM +0200, Vitaly Kuznetsov wrote:
> KASAN reports the following issue:
>
> BUG: KASAN: stack-out-of-bounds in kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> Read of size 8 at addr ffffc9001364f638 by task qemu-kvm/4798
>
> CPU: 0 PID: 4798 Comm: qemu-kvm Tainted: G X --------- ---
> Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM0081C 07/13/2020
> Call Trace:
> dump_stack+0xa5/0xe6
> print_address_description.constprop.0+0x18/0x130
> ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> __kasan_report.cold+0x7f/0x114
> ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> kasan_report+0x38/0x50
> kasan_check_range+0xf5/0x1d0
> kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> kvm_make_scan_ioapic_request_mask+0x84/0xc0 [kvm]
> ? kvm_arch_exit+0x110/0x110 [kvm]
> ? sched_clock+0x5/0x10
> ioapic_write_indirect+0x59f/0x9e0 [kvm]
> ? static_obj+0xc0/0xc0
> ? __lock_acquired+0x1d2/0x8c0
> ? kvm_ioapic_eoi_inject_work+0x120/0x120 [kvm]
>
> The problem appears to be that 'vcpu_bitmap' is allocated as a single long
> on stack and it should really be KVM_MAX_VCPUS long. We also seem to clear
> the lower 16 bits of it with bitmap_zero() for no particular reason (my
> guess would be that 'bitmap' and 'vcpu_bitmap' variables in
> kvm_bitmap_or_dest_vcpus() caused the confusion: while the later is indeed
> 16-bit long, the later should accommodate all possible vCPUs).
>
> Fixes: 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
> Fixes: 9a2ae9f6b6bb ("KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap")
> Reported-by: Dr. David Alan Gilbert <[email protected]>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> arch/x86/kvm/ioapic.c | 10 +++++-----
> 1 file changed, 5 insertions(+), 5 deletions(-)
>
> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> index ff005fe738a4..92cd4b02e9ba 100644
> --- a/arch/x86/kvm/ioapic.c
> +++ b/arch/x86/kvm/ioapic.c
> @@ -319,7 +319,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
> unsigned index;
> bool mask_before, mask_after;
> union kvm_ioapic_redirect_entry *e;
> - unsigned long vcpu_bitmap;
> + unsigned long vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
Is there a way to avoid this KVM_MAX_VCPUS-sized variable on the
stack? This might hit us back when we increase KVM_MAX_VCPUS to
a few thousand VCPUs (I was planning to submit a patch for that
soon).
> int old_remote_irr, old_delivery_status, old_dest_id, old_dest_mode;
>
> switch (ioapic->ioregsel) {
> @@ -384,9 +384,9 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
> irq.shorthand = APIC_DEST_NOSHORT;
> irq.dest_id = e->fields.dest_id;
> irq.msi_redir_hint = false;
> - bitmap_zero(&vcpu_bitmap, 16);
> + bitmap_zero(vcpu_bitmap, KVM_MAX_VCPUS);
> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq,
> - &vcpu_bitmap);
> + vcpu_bitmap);
> if (old_dest_mode != e->fields.dest_mode ||
> old_dest_id != e->fields.dest_id) {
> /*
> @@ -399,10 +399,10 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
> kvm_lapic_irq_dest_mode(
> !!e->fields.dest_mode);
> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq,
> - &vcpu_bitmap);
> + vcpu_bitmap);
> }
> kvm_make_scan_ioapic_request_mask(ioapic->kvm,
> - &vcpu_bitmap);
> + vcpu_bitmap);
> } else {
> kvm_make_scan_ioapic_request(ioapic->kvm);
> }
> --
> 2.31.1
>
--
Eduardo
Eduardo Habkost <[email protected]> writes:
> On Mon, Aug 23, 2021 at 04:30:28PM +0200, Vitaly Kuznetsov wrote:
>> KASAN reports the following issue:
>>
>> BUG: KASAN: stack-out-of-bounds in kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
>> Read of size 8 at addr ffffc9001364f638 by task qemu-kvm/4798
>>
>> CPU: 0 PID: 4798 Comm: qemu-kvm Tainted: G X --------- ---
>> Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM0081C 07/13/2020
>> Call Trace:
>> dump_stack+0xa5/0xe6
>> print_address_description.constprop.0+0x18/0x130
>> ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
>> __kasan_report.cold+0x7f/0x114
>> ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
>> kasan_report+0x38/0x50
>> kasan_check_range+0xf5/0x1d0
>> kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
>> kvm_make_scan_ioapic_request_mask+0x84/0xc0 [kvm]
>> ? kvm_arch_exit+0x110/0x110 [kvm]
>> ? sched_clock+0x5/0x10
>> ioapic_write_indirect+0x59f/0x9e0 [kvm]
>> ? static_obj+0xc0/0xc0
>> ? __lock_acquired+0x1d2/0x8c0
>> ? kvm_ioapic_eoi_inject_work+0x120/0x120 [kvm]
>>
>> The problem appears to be that 'vcpu_bitmap' is allocated as a single long
>> on stack and it should really be KVM_MAX_VCPUS long. We also seem to clear
>> the lower 16 bits of it with bitmap_zero() for no particular reason (my
>> guess would be that 'bitmap' and 'vcpu_bitmap' variables in
>> kvm_bitmap_or_dest_vcpus() caused the confusion: while the later is indeed
>> 16-bit long, the later should accommodate all possible vCPUs).
>>
>> Fixes: 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
>> Fixes: 9a2ae9f6b6bb ("KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap")
>> Reported-by: Dr. David Alan Gilbert <[email protected]>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> arch/x86/kvm/ioapic.c | 10 +++++-----
>> 1 file changed, 5 insertions(+), 5 deletions(-)
>>
>> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
>> index ff005fe738a4..92cd4b02e9ba 100644
>> --- a/arch/x86/kvm/ioapic.c
>> +++ b/arch/x86/kvm/ioapic.c
>> @@ -319,7 +319,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
>> unsigned index;
>> bool mask_before, mask_after;
>> union kvm_ioapic_redirect_entry *e;
>> - unsigned long vcpu_bitmap;
>> + unsigned long vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
>
> Is there a way to avoid this KVM_MAX_VCPUS-sized variable on the
> stack? This might hit us back when we increase KVM_MAX_VCPUS to
> a few thousand VCPUs (I was planning to submit a patch for that
> soon).
What's the short- or mid-term target?
Note, we're allocating KVM_MAX_VCPUS bits (not bytes!) here, this means
that for e.g. 2048 vCPUs we need 256 bytes of the stack only. In case
the target much higher than that, we will need to either switch to
dynamic allocation or e.g. use pre-allocated per-CPU variables and make
this a preempt-disabled region. I, however, would like to understand if
the problem with allocating this from stack is real or not first.
>
>
>> int old_remote_irr, old_delivery_status, old_dest_id, old_dest_mode;
>>
>> switch (ioapic->ioregsel) {
>> @@ -384,9 +384,9 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
>> irq.shorthand = APIC_DEST_NOSHORT;
>> irq.dest_id = e->fields.dest_id;
>> irq.msi_redir_hint = false;
>> - bitmap_zero(&vcpu_bitmap, 16);
>> + bitmap_zero(vcpu_bitmap, KVM_MAX_VCPUS);
>> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq,
>> - &vcpu_bitmap);
>> + vcpu_bitmap);
>> if (old_dest_mode != e->fields.dest_mode ||
>> old_dest_id != e->fields.dest_id) {
>> /*
>> @@ -399,10 +399,10 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
>> kvm_lapic_irq_dest_mode(
>> !!e->fields.dest_mode);
>> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq,
>> - &vcpu_bitmap);
>> + vcpu_bitmap);
>> }
>> kvm_make_scan_ioapic_request_mask(ioapic->kvm,
>> - &vcpu_bitmap);
>> + vcpu_bitmap);
>> } else {
>> kvm_make_scan_ioapic_request(ioapic->kvm);
>> }
>> --
>> 2.31.1
>>
--
Vitaly
On Tue, Aug 24, 2021 at 3:13 AM Vitaly Kuznetsov <[email protected]> wrote:
>
> Eduardo Habkost <[email protected]> writes:
>
> > On Mon, Aug 23, 2021 at 04:30:28PM +0200, Vitaly Kuznetsov wrote:
> >> KASAN reports the following issue:
> >>
> >> BUG: KASAN: stack-out-of-bounds in kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> >> Read of size 8 at addr ffffc9001364f638 by task qemu-kvm/4798
> >>
> >> CPU: 0 PID: 4798 Comm: qemu-kvm Tainted: G X --------- ---
> >> Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM0081C 07/13/2020
> >> Call Trace:
> >> dump_stack+0xa5/0xe6
> >> print_address_description.constprop.0+0x18/0x130
> >> ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> >> __kasan_report.cold+0x7f/0x114
> >> ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> >> kasan_report+0x38/0x50
> >> kasan_check_range+0xf5/0x1d0
> >> kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> >> kvm_make_scan_ioapic_request_mask+0x84/0xc0 [kvm]
> >> ? kvm_arch_exit+0x110/0x110 [kvm]
> >> ? sched_clock+0x5/0x10
> >> ioapic_write_indirect+0x59f/0x9e0 [kvm]
> >> ? static_obj+0xc0/0xc0
> >> ? __lock_acquired+0x1d2/0x8c0
> >> ? kvm_ioapic_eoi_inject_work+0x120/0x120 [kvm]
> >>
> >> The problem appears to be that 'vcpu_bitmap' is allocated as a single long
> >> on stack and it should really be KVM_MAX_VCPUS long. We also seem to clear
> >> the lower 16 bits of it with bitmap_zero() for no particular reason (my
> >> guess would be that 'bitmap' and 'vcpu_bitmap' variables in
> >> kvm_bitmap_or_dest_vcpus() caused the confusion: while the later is indeed
> >> 16-bit long, the later should accommodate all possible vCPUs).
> >>
> >> Fixes: 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
> >> Fixes: 9a2ae9f6b6bb ("KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap")
> >> Reported-by: Dr. David Alan Gilbert <[email protected]>
> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> >> ---
> >> arch/x86/kvm/ioapic.c | 10 +++++-----
> >> 1 file changed, 5 insertions(+), 5 deletions(-)
> >>
> >> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> >> index ff005fe738a4..92cd4b02e9ba 100644
> >> --- a/arch/x86/kvm/ioapic.c
> >> +++ b/arch/x86/kvm/ioapic.c
> >> @@ -319,7 +319,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
> >> unsigned index;
> >> bool mask_before, mask_after;
> >> union kvm_ioapic_redirect_entry *e;
> >> - unsigned long vcpu_bitmap;
> >> + unsigned long vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
> >
> > Is there a way to avoid this KVM_MAX_VCPUS-sized variable on the
> > stack? This might hit us back when we increase KVM_MAX_VCPUS to
> > a few thousand VCPUs (I was planning to submit a patch for that
> > soon).
>
> What's the short- or mid-term target?
Short term target is 2048 (which was already tested). Mid-term target
(not tested yet) is 4096, maybe 8192.
>
> Note, we're allocating KVM_MAX_VCPUS bits (not bytes!) here, this means
> that for e.g. 2048 vCPUs we need 256 bytes of the stack only. In case
> the target much higher than that, we will need to either switch to
> dynamic allocation or e.g. use pre-allocated per-CPU variables and make
> this a preempt-disabled region. I, however, would like to understand if
> the problem with allocating this from stack is real or not first.
Is 256 bytes too much here, or would that be OK?
--
Eduardo
Eduardo Habkost <[email protected]> writes:
> On Tue, Aug 24, 2021 at 3:13 AM Vitaly Kuznetsov <[email protected]> wrote:
>>
>> Eduardo Habkost <[email protected]> writes:
>>
>> > On Mon, Aug 23, 2021 at 04:30:28PM +0200, Vitaly Kuznetsov wrote:
>> >> KASAN reports the following issue:
>> >>
>> >> BUG: KASAN: stack-out-of-bounds in kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
>> >> Read of size 8 at addr ffffc9001364f638 by task qemu-kvm/4798
>> >>
>> >> CPU: 0 PID: 4798 Comm: qemu-kvm Tainted: G X --------- ---
>> >> Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM0081C 07/13/2020
>> >> Call Trace:
>> >> dump_stack+0xa5/0xe6
>> >> print_address_description.constprop.0+0x18/0x130
>> >> ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
>> >> __kasan_report.cold+0x7f/0x114
>> >> ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
>> >> kasan_report+0x38/0x50
>> >> kasan_check_range+0xf5/0x1d0
>> >> kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
>> >> kvm_make_scan_ioapic_request_mask+0x84/0xc0 [kvm]
>> >> ? kvm_arch_exit+0x110/0x110 [kvm]
>> >> ? sched_clock+0x5/0x10
>> >> ioapic_write_indirect+0x59f/0x9e0 [kvm]
>> >> ? static_obj+0xc0/0xc0
>> >> ? __lock_acquired+0x1d2/0x8c0
>> >> ? kvm_ioapic_eoi_inject_work+0x120/0x120 [kvm]
>> >>
>> >> The problem appears to be that 'vcpu_bitmap' is allocated as a single long
>> >> on stack and it should really be KVM_MAX_VCPUS long. We also seem to clear
>> >> the lower 16 bits of it with bitmap_zero() for no particular reason (my
>> >> guess would be that 'bitmap' and 'vcpu_bitmap' variables in
>> >> kvm_bitmap_or_dest_vcpus() caused the confusion: while the later is indeed
>> >> 16-bit long, the later should accommodate all possible vCPUs).
>> >>
>> >> Fixes: 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
>> >> Fixes: 9a2ae9f6b6bb ("KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap")
>> >> Reported-by: Dr. David Alan Gilbert <[email protected]>
>> >> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> >> ---
>> >> arch/x86/kvm/ioapic.c | 10 +++++-----
>> >> 1 file changed, 5 insertions(+), 5 deletions(-)
>> >>
>> >> diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
>> >> index ff005fe738a4..92cd4b02e9ba 100644
>> >> --- a/arch/x86/kvm/ioapic.c
>> >> +++ b/arch/x86/kvm/ioapic.c
>> >> @@ -319,7 +319,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
>> >> unsigned index;
>> >> bool mask_before, mask_after;
>> >> union kvm_ioapic_redirect_entry *e;
>> >> - unsigned long vcpu_bitmap;
>> >> + unsigned long vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
>> >
>> > Is there a way to avoid this KVM_MAX_VCPUS-sized variable on the
>> > stack? This might hit us back when we increase KVM_MAX_VCPUS to
>> > a few thousand VCPUs (I was planning to submit a patch for that
>> > soon).
>>
>> What's the short- or mid-term target?
>
> Short term target is 2048 (which was already tested). Mid-term target
> (not tested yet) is 4096, maybe 8192.
>
>>
>> Note, we're allocating KVM_MAX_VCPUS bits (not bytes!) here, this means
>> that for e.g. 2048 vCPUs we need 256 bytes of the stack only. In case
>> the target much higher than that, we will need to either switch to
>> dynamic allocation or e.g. use pre-allocated per-CPU variables and make
>> this a preempt-disabled region. I, however, would like to understand if
>> the problem with allocating this from stack is real or not first.
>
> Is 256 bytes too much here, or would that be OK?
>
AFAIR, on x86_64 stack size (both reqular and irq) is 16k, eating 256
bytes of it is probably OK. I'd start worrying when we go to 1024 (8k
vCPUs) and above (but this is subjective of course).
--
Vitaly
On Tue, 2021-08-24 at 16:42 +0200, Vitaly Kuznetsov wrote:
> Eduardo Habkost <[email protected]> writes:
>
> > On Tue, Aug 24, 2021 at 3:13 AM Vitaly Kuznetsov <[email protected]> wrote:
> > > Eduardo Habkost <[email protected]> writes:
> > >
> > > > On Mon, Aug 23, 2021 at 04:30:28PM +0200, Vitaly Kuznetsov wrote:
> > > > > KASAN reports the following issue:
> > > > >
> > > > > BUG: KASAN: stack-out-of-bounds in kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> > > > > Read of size 8 at addr ffffc9001364f638 by task qemu-kvm/4798
> > > > >
> > > > > CPU: 0 PID: 4798 Comm: qemu-kvm Tainted: G X --------- ---
> > > > > Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM0081C 07/13/2020
> > > > > Call Trace:
> > > > > dump_stack+0xa5/0xe6
> > > > > print_address_description.constprop.0+0x18/0x130
> > > > > ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> > > > > __kasan_report.cold+0x7f/0x114
> > > > > ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> > > > > kasan_report+0x38/0x50
> > > > > kasan_check_range+0xf5/0x1d0
> > > > > kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> > > > > kvm_make_scan_ioapic_request_mask+0x84/0xc0 [kvm]
> > > > > ? kvm_arch_exit+0x110/0x110 [kvm]
> > > > > ? sched_clock+0x5/0x10
> > > > > ioapic_write_indirect+0x59f/0x9e0 [kvm]
> > > > > ? static_obj+0xc0/0xc0
> > > > > ? __lock_acquired+0x1d2/0x8c0
> > > > > ? kvm_ioapic_eoi_inject_work+0x120/0x120 [kvm]
> > > > >
> > > > > The problem appears to be that 'vcpu_bitmap' is allocated as a single long
> > > > > on stack and it should really be KVM_MAX_VCPUS long. We also seem to clear
> > > > > the lower 16 bits of it with bitmap_zero() for no particular reason (my
> > > > > guess would be that 'bitmap' and 'vcpu_bitmap' variables in
> > > > > kvm_bitmap_or_dest_vcpus() caused the confusion: while the later is indeed
> > > > > 16-bit long, the later should accommodate all possible vCPUs).
> > > > >
> > > > > Fixes: 7ee30bc132c6 ("KVM: x86: deliver KVM IOAPIC scan request to target vCPUs")
> > > > > Fixes: 9a2ae9f6b6bb ("KVM: x86: Zero the IOAPIC scan request dest vCPUs bitmap")
> > > > > Reported-by: Dr. David Alan Gilbert <[email protected]>
> > > > > Signed-off-by: Vitaly Kuznetsov <[email protected]>
> > > > > ---
> > > > > arch/x86/kvm/ioapic.c | 10 +++++-----
> > > > > 1 file changed, 5 insertions(+), 5 deletions(-)
> > > > >
> > > > > diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> > > > > index ff005fe738a4..92cd4b02e9ba 100644
> > > > > --- a/arch/x86/kvm/ioapic.c
> > > > > +++ b/arch/x86/kvm/ioapic.c
> > > > > @@ -319,7 +319,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
> > > > > unsigned index;
> > > > > bool mask_before, mask_after;
> > > > > union kvm_ioapic_redirect_entry *e;
> > > > > - unsigned long vcpu_bitmap;
> > > > > + unsigned long vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
> > > >
> > > > Is there a way to avoid this KVM_MAX_VCPUS-sized variable on the
> > > > stack? This might hit us back when we increase KVM_MAX_VCPUS to
> > > > a few thousand VCPUs (I was planning to submit a patch for that
> > > > soon).
> > >
> > > What's the short- or mid-term target?
> >
> > Short term target is 2048 (which was already tested). Mid-term target
> > (not tested yet) is 4096, maybe 8192.
> >
> > > Note, we're allocating KVM_MAX_VCPUS bits (not bytes!) here, this means
> > > that for e.g. 2048 vCPUs we need 256 bytes of the stack only. In case
> > > the target much higher than that, we will need to either switch to
> > > dynamic allocation or e.g. use pre-allocated per-CPU variables and make
> > > this a preempt-disabled region. I, however, would like to understand if
> > > the problem with allocating this from stack is real or not first.
> >
> > Is 256 bytes too much here, or would that be OK?
> >
>
> AFAIR, on x86_64 stack size (both reqular and irq) is 16k, eating 256
> bytes of it is probably OK. I'd start worrying when we go to 1024 (8k
> vCPUs) and above (but this is subjective of course).
>
Hi,
Not a classical review but,
I did some digital archaeology with this one, trying to understand what is going on:
I think that 16 bit vcpu bitmap is due to the fact that IOAPIC spec states that
it can address up to 16 cpus in physical destination mode.
In logical destination mode, assuming flat addressing and that logical id = 1 << physical id
which KVM hardcodes, it is also only possible to address 8 CPUs.
However(!) in flat cluster mode, the logical apic id is split in two.
We have 16 clusters and each have 4 CPUs, so it is possible to address 64 CPUs,
and unlike the logical ID, the KVM does honour cluster ID,
thus one can stick say cluster ID 0 to any vCPU.
Let's look at ioapic_write_indirect.
It does:
-> bitmap_zero(&vcpu_bitmap, 16);
-> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq, &vcpu_bitmap);
-> kvm_make_scan_ioapic_request_mask(ioapic->kvm, &vcpu_bitmap); // use of the above bitmap
When we call kvm_bitmap_or_dest_vcpus, we can already overflow the bitmap,
since we pass all 8 bit of the destination even when it is physical.
Lets examine the kvm_bitmap_or_dest_vcpus:
-> It calls the kvm_apic_map_get_dest_lapic which
-> for physical destinations, it just sets the bitmap, which can overflow
if we pass it 8 bit destination (which basically includes reserved bits + 4 bit destination).
-> For logical apic ID, it seems to truncate the result to 16 bit, which isn't correct as I explained
above, but should not overflow the result.
-> If call to kvm_apic_map_get_dest_lapic fails, it goes over all vcpus and tries to match the destination
This can overflow as well.
I also don't like that ioapic_write_indirect calls the kvm_bitmap_or_dest_vcpus twice,
and second time with 'old_dest_id'
I am not 100% sure why old_dest_id/old_dest_mode are needed as I don't see anything in the
function changing them.
I think only the guest can change them, so maybe the code deals with the guest changing them
while the code is running from a different vcpu?
The commit that introduced this code is 7ee30bc132c683d06a6d9e360e39e483e3990708
Nitesh Narayan Lal, maybe you remember something about it?
Also I worry a lot about other callers of kvm_apic_map_get_dest_lapic
It is also called from kvm_irq_delivery_to_apic_fast, and from kvm_intr_is_single_vcpu_fast
and both seem to also use 'unsigned long' for bitmap, and then only use 16 bits of it.
I haven't dug into them, but these don't seem to be IOAPIC related and I think
can overwrite the stack as well.
The whole thing seems a bit busted IMHO.
On the topic of enlarging these bitmaps to cover all vCPUs.
I also share the worry of having the whole bitmap on kernel stack for very large number of vcpus.
Maybe we need to abstract and use a bitmap for a sane number of vcpus,
and use otherwise a 'kmalloc'ed buffer?
Also in theory large bitmaps might affect performance a bit.
My 0.2 cents.
Best regards,
Maxim Levitsky
On Tue, Aug 24, 2021, Maxim Levitsky wrote:
> On Tue, 2021-08-24 at 16:42 +0200, Vitaly Kuznetsov wrote:
> > Eduardo Habkost <[email protected]> writes:
> >
> > > On Tue, Aug 24, 2021 at 3:13 AM Vitaly Kuznetsov <[email protected]> wrote:
> > > > Eduardo Habkost <[email protected]> writes:
> > > >
> > > > > On Mon, Aug 23, 2021 at 04:30:28PM +0200, Vitaly Kuznetsov wrote:
> > > > > > diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
> > > > > > index ff005fe738a4..92cd4b02e9ba 100644
> > > > > > --- a/arch/x86/kvm/ioapic.c
> > > > > > +++ b/arch/x86/kvm/ioapic.c
> > > > > > @@ -319,7 +319,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
> > > > > > unsigned index;
> > > > > > bool mask_before, mask_after;
> > > > > > union kvm_ioapic_redirect_entry *e;
> > > > > > - unsigned long vcpu_bitmap;
> > > > > > + unsigned long vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
The preferred pattern is:
DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS);
> > > > >
> > > > > Is there a way to avoid this KVM_MAX_VCPUS-sized variable on the
> > > > > stack? This might hit us back when we increase KVM_MAX_VCPUS to
> > > > > a few thousand VCPUs (I was planning to submit a patch for that
> > > > > soon).
> > > >
> > > > What's the short- or mid-term target?
> > >
> > > Short term target is 2048 (which was already tested). Mid-term target
> > > (not tested yet) is 4096, maybe 8192.
> > >
> > > > Note, we're allocating KVM_MAX_VCPUS bits (not bytes!) here, this means
> > > > that for e.g. 2048 vCPUs we need 256 bytes of the stack only. In case
> > > > the target much higher than that, we will need to either switch to
> > > > dynamic allocation or e.g. use pre-allocated per-CPU variables and make
> > > > this a preempt-disabled region. I, however, would like to understand if
> > > > the problem with allocating this from stack is real or not first.
> > >
> > > Is 256 bytes too much here, or would that be OK?
> > >
> >
> > AFAIR, on x86_64 stack size (both reqular and irq) is 16k, eating 256
Don't forget i386! :-)
> > bytes of it is probably OK. I'd start worrying when we go to 1024 (8k
> > vCPUs) and above (but this is subjective of course).
256 is fine, 1024 would indeed be problematic, e.g. CONFIG_FRAME_WARN defaults to
1024 on 32-bit kernels. That's not a hard limit per se, but ideally KVM will stay
warn-free on all flavors of x86.
> On the topic of enlarging these bitmaps to cover all vCPUs.
>
> I also share the worry of having the whole bitmap on kernel stack for very
> large number of vcpus.
> Maybe we need to abstract and use a bitmap for a sane number of vcpus,
> and use otherwise a 'kmalloc'ed buffer?
That's a future problem. More specifically, it's the problem of whoever wants to
push KVM_MAX_VCPUS > ~2048. There are a lot of ways to solve the problem, e.g.
this I/O APIC code runs under a spinlock so a dedicated bitmap in struct kvm_ioapic
could be used to avoid a large stack allocation.
> Also in theory large bitmaps might affect performance a bit.
Maybe. The only possible degredation for small VMs, i.e. VMs that don't need the
full bitmap, is if the compiler puts other variables below the bitmap and causes
sub-optimal cache line usage. But I suspect/hope the compiler is smart enough to
use GPRs and/or organize the local variables on the stack so that doesn't happen.
Maxim Levitsky <[email protected]> writes:
> On Tue, 2021-08-24 at 16:42 +0200, Vitaly Kuznetsov wrote:
...
>
> Not a classical review but,
> I did some digital archaeology with this one, trying to understand what is going on:
>
>
> I think that 16 bit vcpu bitmap is due to the fact that IOAPIC spec states that
> it can address up to 16 cpus in physical destination mode.
>
> In logical destination mode, assuming flat addressing and that logical id = 1 << physical id
> which KVM hardcodes, it is also only possible to address 8 CPUs.
>
> However(!) in flat cluster mode, the logical apic id is split in two.
> We have 16 clusters and each have 4 CPUs, so it is possible to address 64 CPUs,
> and unlike the logical ID, the KVM does honour cluster ID,
> thus one can stick say cluster ID 0 to any vCPU.
>
>
> Let's look at ioapic_write_indirect.
> It does:
>
> -> bitmap_zero(&vcpu_bitmap, 16);
> -> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq, &vcpu_bitmap);
> -> kvm_make_scan_ioapic_request_mask(ioapic->kvm, &vcpu_bitmap); // use of the above bitmap
>
>
> When we call kvm_bitmap_or_dest_vcpus, we can already overflow the bitmap,
> since we pass all 8 bit of the destination even when it is physical.
>
>
> Lets examine the kvm_bitmap_or_dest_vcpus:
>
> -> It calls the kvm_apic_map_get_dest_lapic which
>
> -> for physical destinations, it just sets the bitmap, which can overflow
> if we pass it 8 bit destination (which basically includes reserved bits + 4 bit destination).
>
>
> -> For logical apic ID, it seems to truncate the result to 16 bit, which isn't correct as I explained
> above, but should not overflow the result.
>
>
> -> If call to kvm_apic_map_get_dest_lapic fails, it goes over all vcpus and tries to match the destination
> This can overflow as well.
>
>
> I also don't like that ioapic_write_indirect calls the kvm_bitmap_or_dest_vcpus twice,
> and second time with 'old_dest_id'
>
> I am not 100% sure why old_dest_id/old_dest_mode are needed as I don't see anything in the
> function changing them.
> I think only the guest can change them, so maybe the code deals with the guest changing them
> while the code is running from a different vcpu?
>
> The commit that introduced this code is 7ee30bc132c683d06a6d9e360e39e483e3990708
> Nitesh Narayan Lal, maybe you remember something about it?
>
Before posting this patch I've contacted Nitesh privately, he's
currently on vacation but will take a look when he gets back.
>
> Also I worry a lot about other callers of kvm_apic_map_get_dest_lapic
>
> It is also called from kvm_irq_delivery_to_apic_fast, and from kvm_intr_is_single_vcpu_fast
> and both seem to also use 'unsigned long' for bitmap, and then only use 16 bits of it.
>
> I haven't dug into them, but these don't seem to be IOAPIC related and I think
> can overwrite the stack as well.
I'm no expert in this code but when writing the patch I somehow
convinced myself that a single unsigned long is always enough. I think
that for cluster mode 'bitmap' needs 64-bits (and it is *not* a
vcpu_bitmap, we need to convert). I may be completely wrong of course
but in any case this is a different issue. In ioapic_write_indirect() we
have 'vcpu_bitmap' which should certainly be longer than 64 bits.
--
Vitaly
Sean Christopherson <[email protected]> writes:
> On Tue, Aug 24, 2021, Maxim Levitsky wrote:
>> On Tue, 2021-08-24 at 16:42 +0200, Vitaly Kuznetsov wrote:
>> > Eduardo Habkost <[email protected]> writes:
>> >
>> > > On Tue, Aug 24, 2021 at 3:13 AM Vitaly Kuznetsov <[email protected]> wrote:
>> > > > Eduardo Habkost <[email protected]> writes:
>> > > >
>> > > > > On Mon, Aug 23, 2021 at 04:30:28PM +0200, Vitaly Kuznetsov wrote:
>> > > > > > diff --git a/arch/x86/kvm/ioapic.c b/arch/x86/kvm/ioapic.c
>> > > > > > index ff005fe738a4..92cd4b02e9ba 100644
>> > > > > > --- a/arch/x86/kvm/ioapic.c
>> > > > > > +++ b/arch/x86/kvm/ioapic.c
>> > > > > > @@ -319,7 +319,7 @@ static void ioapic_write_indirect(struct kvm_ioapic *ioapic, u32 val)
>> > > > > > unsigned index;
>> > > > > > bool mask_before, mask_after;
>> > > > > > union kvm_ioapic_redirect_entry *e;
>> > > > > > - unsigned long vcpu_bitmap;
>> > > > > > + unsigned long vcpu_bitmap[BITS_TO_LONGS(KVM_MAX_VCPUS)];
>
> The preferred pattern is:
>
> DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS);
>
Yes, thanks!
>> > > > >
>> > > > > Is there a way to avoid this KVM_MAX_VCPUS-sized variable on the
>> > > > > stack? This might hit us back when we increase KVM_MAX_VCPUS to
>> > > > > a few thousand VCPUs (I was planning to submit a patch for that
>> > > > > soon).
>> > > >
>> > > > What's the short- or mid-term target?
>> > >
>> > > Short term target is 2048 (which was already tested). Mid-term target
>> > > (not tested yet) is 4096, maybe 8192.
>> > >
>> > > > Note, we're allocating KVM_MAX_VCPUS bits (not bytes!) here, this means
>> > > > that for e.g. 2048 vCPUs we need 256 bytes of the stack only. In case
>> > > > the target much higher than that, we will need to either switch to
>> > > > dynamic allocation or e.g. use pre-allocated per-CPU variables and make
>> > > > this a preempt-disabled region. I, however, would like to understand if
>> > > > the problem with allocating this from stack is real or not first.
>> > >
>> > > Is 256 bytes too much here, or would that be OK?
>> > >
>> >
>> > AFAIR, on x86_64 stack size (both reqular and irq) is 16k, eating 256
>
> Don't forget i386! :-)
>
I'm not forgetting, I'm deliberately ignoring its existence :-)
Whoever tries to raise KVM_MAX_VCPUS from '288' may limit the change to
x86_64, I seriosly doubt 32bit users want to run guests with thouthands
of CPUs.
>> > bytes of it is probably OK. I'd start worrying when we go to 1024 (8k
>> > vCPUs) and above (but this is subjective of course).
>
> 256 is fine, 1024 would indeed be problematic, e.g. CONFIG_FRAME_WARN defaults to
> 1024 on 32-bit kernels. That's not a hard limit per se, but ideally KVM will stay
> warn-free on all flavors of x86.
Thanks for the CONFIG_FRAME_WARN pointer, I said '1024' out of top of my
head but it seems the number wasn't random after all)
>
>> On the topic of enlarging these bitmaps to cover all vCPUs.
>>
>> I also share the worry of having the whole bitmap on kernel stack for very
>> large number of vcpus.
>> Maybe we need to abstract and use a bitmap for a sane number of vcpus,
>> and use otherwise a 'kmalloc'ed buffer?
>
> That's a future problem. More specifically, it's the problem of whoever wants to
> push KVM_MAX_VCPUS > ~2048. There are a lot of ways to solve the problem, e.g.
> this I/O APIC code runs under a spinlock so a dedicated bitmap in struct kvm_ioapic
> could be used to avoid a large stack allocation.
+1
>
>> Also in theory large bitmaps might affect performance a bit.
>
> Maybe. The only possible degredation for small VMs, i.e. VMs that don't need the
> full bitmap, is if the compiler puts other variables below the bitmap and causes
> sub-optimal cache line usage. But I suspect/hope the compiler is smart enough to
> use GPRs and/or organize the local variables on the stack so that doesn't happen.
>
--
Vitaly
On Wed, 2021-08-25 at 10:21 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <[email protected]> writes:
>
> > On Tue, 2021-08-24 at 16:42 +0200, Vitaly Kuznetsov wrote:
> ...
> > Not a classical review but,
> > I did some digital archaeology with this one, trying to understand what is going on:
> >
> >
> > I think that 16 bit vcpu bitmap is due to the fact that IOAPIC spec states that
> > it can address up to 16 cpus in physical destination mode.
> >
> > In logical destination mode, assuming flat addressing and that logical id = 1 << physical id
> > which KVM hardcodes, it is also only possible to address 8 CPUs.
> >
> > However(!) in flat cluster mode, the logical apic id is split in two.
> > We have 16 clusters and each have 4 CPUs, so it is possible to address 64 CPUs,
> > and unlike the logical ID, the KVM does honour cluster ID,
> > thus one can stick say cluster ID 0 to any vCPU.
> >
> >
> > Let's look at ioapic_write_indirect.
> > It does:
> >
> > -> bitmap_zero(&vcpu_bitmap, 16);
> > -> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq, &vcpu_bitmap);
> > -> kvm_make_scan_ioapic_request_mask(ioapic->kvm, &vcpu_bitmap); // use of the above bitmap
> >
> >
> > When we call kvm_bitmap_or_dest_vcpus, we can already overflow the bitmap,
> > since we pass all 8 bit of the destination even when it is physical.
> >
> >
> > Lets examine the kvm_bitmap_or_dest_vcpus:
> >
> > -> It calls the kvm_apic_map_get_dest_lapic which
> >
> > -> for physical destinations, it just sets the bitmap, which can overflow
> > if we pass it 8 bit destination (which basically includes reserved bits + 4 bit destination).
> >
> >
> > -> For logical apic ID, it seems to truncate the result to 16 bit, which isn't correct as I explained
> > above, but should not overflow the result.
> >
> >
> > -> If call to kvm_apic_map_get_dest_lapic fails, it goes over all vcpus and tries to match the destination
> > This can overflow as well.
> >
> >
> > I also don't like that ioapic_write_indirect calls the kvm_bitmap_or_dest_vcpus twice,
> > and second time with 'old_dest_id'
> >
> > I am not 100% sure why old_dest_id/old_dest_mode are needed as I don't see anything in the
> > function changing them.
> > I think only the guest can change them, so maybe the code deals with the guest changing them
> > while the code is running from a different vcpu?
> >
> > The commit that introduced this code is 7ee30bc132c683d06a6d9e360e39e483e3990708
> > Nitesh Narayan Lal, maybe you remember something about it?
> >
>
> Before posting this patch I've contacted Nitesh privately, he's
> currently on vacation but will take a look when he gets back.
>
> > Also I worry a lot about other callers of kvm_apic_map_get_dest_lapic
> >
> > It is also called from kvm_irq_delivery_to_apic_fast, and from kvm_intr_is_single_vcpu_fast
> > and both seem to also use 'unsigned long' for bitmap, and then only use 16 bits of it.
> >
> > I haven't dug into them, but these don't seem to be IOAPIC related and I think
> > can overwrite the stack as well.
>
> I'm no expert in this code but when writing the patch I somehow
> convinced myself that a single unsigned long is always enough. I think
> that for cluster mode 'bitmap' needs 64-bits (and it is *not* a
> vcpu_bitmap, we need to convert). I may be completely wrong of course
> but in any case this is a different issue. In ioapic_write_indirect() we
> have 'vcpu_bitmap' which should certainly be longer than 64 bits.
This code which I mentioned in 'other callers' as far as I see is not IOAPIC related.
For regular local APIC all bets are off, any vCPU and apic ID are possible
(xapic I think limits apic id to 255 but x2apic doesn't).
I strongly suspect that this code can overflow as well.
Best regards,
Maxim Levitsky
>
Maxim Levitsky <[email protected]> writes:
> On Wed, 2021-08-25 at 10:21 +0200, Vitaly Kuznetsov wrote:
>> Maxim Levitsky <[email protected]> writes:
>>
>> > On Tue, 2021-08-24 at 16:42 +0200, Vitaly Kuznetsov wrote:
>> ...
>> > Not a classical review but,
>> > I did some digital archaeology with this one, trying to understand what is going on:
>> >
>> >
>> > I think that 16 bit vcpu bitmap is due to the fact that IOAPIC spec states that
>> > it can address up to 16 cpus in physical destination mode.
>> >
>> > In logical destination mode, assuming flat addressing and that logical id = 1 << physical id
>> > which KVM hardcodes, it is also only possible to address 8 CPUs.
>> >
>> > However(!) in flat cluster mode, the logical apic id is split in two.
>> > We have 16 clusters and each have 4 CPUs, so it is possible to address 64 CPUs,
>> > and unlike the logical ID, the KVM does honour cluster ID,
>> > thus one can stick say cluster ID 0 to any vCPU.
>> >
>> >
>> > Let's look at ioapic_write_indirect.
>> > It does:
>> >
>> > -> bitmap_zero(&vcpu_bitmap, 16);
>> > -> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq, &vcpu_bitmap);
>> > -> kvm_make_scan_ioapic_request_mask(ioapic->kvm, &vcpu_bitmap); // use of the above bitmap
>> >
>> >
>> > When we call kvm_bitmap_or_dest_vcpus, we can already overflow the bitmap,
>> > since we pass all 8 bit of the destination even when it is physical.
>> >
>> >
>> > Lets examine the kvm_bitmap_or_dest_vcpus:
>> >
>> > -> It calls the kvm_apic_map_get_dest_lapic which
>> >
>> > -> for physical destinations, it just sets the bitmap, which can overflow
>> > if we pass it 8 bit destination (which basically includes reserved bits + 4 bit destination).
>> >
>> >
>> > -> For logical apic ID, it seems to truncate the result to 16 bit, which isn't correct as I explained
>> > above, but should not overflow the result.
>> >
>> >
>> > -> If call to kvm_apic_map_get_dest_lapic fails, it goes over all vcpus and tries to match the destination
>> > This can overflow as well.
>> >
>> >
>> > I also don't like that ioapic_write_indirect calls the kvm_bitmap_or_dest_vcpus twice,
>> > and second time with 'old_dest_id'
>> >
>> > I am not 100% sure why old_dest_id/old_dest_mode are needed as I don't see anything in the
>> > function changing them.
>> > I think only the guest can change them, so maybe the code deals with the guest changing them
>> > while the code is running from a different vcpu?
>> >
>> > The commit that introduced this code is 7ee30bc132c683d06a6d9e360e39e483e3990708
>> > Nitesh Narayan Lal, maybe you remember something about it?
>> >
>>
>> Before posting this patch I've contacted Nitesh privately, he's
>> currently on vacation but will take a look when he gets back.
>>
>> > Also I worry a lot about other callers of kvm_apic_map_get_dest_lapic
>> >
>> > It is also called from kvm_irq_delivery_to_apic_fast, and from kvm_intr_is_single_vcpu_fast
>> > and both seem to also use 'unsigned long' for bitmap, and then only use 16 bits of it.
>> >
>> > I haven't dug into them, but these don't seem to be IOAPIC related and I think
>> > can overwrite the stack as well.
>>
>> I'm no expert in this code but when writing the patch I somehow
>> convinced myself that a single unsigned long is always enough. I think
>> that for cluster mode 'bitmap' needs 64-bits (and it is *not* a
>> vcpu_bitmap, we need to convert). I may be completely wrong of course
>> but in any case this is a different issue. In ioapic_write_indirect() we
>> have 'vcpu_bitmap' which should certainly be longer than 64 bits.
>
>
> This code which I mentioned in 'other callers' as far as I see is not IOAPIC related.
> For regular local APIC all bets are off, any vCPU and apic ID are possible
> (xapic I think limits apic id to 255 but x2apic doesn't).
>
> I strongly suspect that this code can overflow as well.
I've probably missed something but I don't see how
kvm_apic_map_get_dest_lapic() can set bits above 64 in 'bitmap'. If it
can, then we have a problem indeed.
--
Vitaly
On Wed, 2021-08-25 at 11:43 +0200, Vitaly Kuznetsov wrote:
> Maxim Levitsky <[email protected]> writes:
>
> > On Wed, 2021-08-25 at 10:21 +0200, Vitaly Kuznetsov wrote:
> > > Maxim Levitsky <[email protected]> writes:
> > >
> > > > On Tue, 2021-08-24 at 16:42 +0200, Vitaly Kuznetsov wrote:
> > > ...
> > > > Not a classical review but,
> > > > I did some digital archaeology with this one, trying to understand what is going on:
> > > >
> > > >
> > > > I think that 16 bit vcpu bitmap is due to the fact that IOAPIC spec states that
> > > > it can address up to 16 cpus in physical destination mode.
> > > >
> > > > In logical destination mode, assuming flat addressing and that logical id = 1 << physical id
> > > > which KVM hardcodes, it is also only possible to address 8 CPUs.
> > > >
> > > > However(!) in flat cluster mode, the logical apic id is split in two.
> > > > We have 16 clusters and each have 4 CPUs, so it is possible to address 64 CPUs,
> > > > and unlike the logical ID, the KVM does honour cluster ID,
> > > > thus one can stick say cluster ID 0 to any vCPU.
> > > >
> > > >
> > > > Let's look at ioapic_write_indirect.
> > > > It does:
> > > >
> > > > -> bitmap_zero(&vcpu_bitmap, 16);
> > > > -> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq, &vcpu_bitmap);
> > > > -> kvm_make_scan_ioapic_request_mask(ioapic->kvm, &vcpu_bitmap); // use of the above bitmap
> > > >
> > > >
> > > > When we call kvm_bitmap_or_dest_vcpus, we can already overflow the bitmap,
> > > > since we pass all 8 bit of the destination even when it is physical.
> > > >
> > > >
> > > > Lets examine the kvm_bitmap_or_dest_vcpus:
> > > >
> > > > -> It calls the kvm_apic_map_get_dest_lapic which
> > > >
> > > > -> for physical destinations, it just sets the bitmap, which can overflow
> > > > if we pass it 8 bit destination (which basically includes reserved bits + 4 bit destination).
> > > >
> > > >
> > > > -> For logical apic ID, it seems to truncate the result to 16 bit, which isn't correct as I explained
> > > > above, but should not overflow the result.
> > > >
> > > >
> > > > -> If call to kvm_apic_map_get_dest_lapic fails, it goes over all vcpus and tries to match the destination
> > > > This can overflow as well.
> > > >
> > > >
> > > > I also don't like that ioapic_write_indirect calls the kvm_bitmap_or_dest_vcpus twice,
> > > > and second time with 'old_dest_id'
> > > >
> > > > I am not 100% sure why old_dest_id/old_dest_mode are needed as I don't see anything in the
> > > > function changing them.
> > > > I think only the guest can change them, so maybe the code deals with the guest changing them
> > > > while the code is running from a different vcpu?
> > > >
> > > > The commit that introduced this code is 7ee30bc132c683d06a6d9e360e39e483e3990708
> > > > Nitesh Narayan Lal, maybe you remember something about it?
> > > >
> > >
> > > Before posting this patch I've contacted Nitesh privately, he's
> > > currently on vacation but will take a look when he gets back.
> > >
> > > > Also I worry a lot about other callers of kvm_apic_map_get_dest_lapic
> > > >
> > > > It is also called from kvm_irq_delivery_to_apic_fast, and from kvm_intr_is_single_vcpu_fast
> > > > and both seem to also use 'unsigned long' for bitmap, and then only use 16 bits of it.
> > > >
> > > > I haven't dug into them, but these don't seem to be IOAPIC related and I think
> > > > can overwrite the stack as well.
> > >
> > > I'm no expert in this code but when writing the patch I somehow
> > > convinced myself that a single unsigned long is always enough. I think
> > > that for cluster mode 'bitmap' needs 64-bits (and it is *not* a
> > > vcpu_bitmap, we need to convert). I may be completely wrong of course
> > > but in any case this is a different issue. In ioapic_write_indirect() we
> > > have 'vcpu_bitmap' which should certainly be longer than 64 bits.
> >
> > This code which I mentioned in 'other callers' as far as I see is not IOAPIC related.
> > For regular local APIC all bets are off, any vCPU and apic ID are possible
> > (xapic I think limits apic id to 255 but x2apic doesn't).
> >
> > I strongly suspect that this code can overflow as well.
>
> I've probably missed something but I don't see how
> kvm_apic_map_get_dest_lapic() can set bits above 64 in 'bitmap'. If it
> can, then we have a problem indeed.
It me that missed that '*bitmap = 1' is harmless, since this function returns
the destanation local apic, and bitmap of vCPU to work on starting from
this local apic. So its OK.
So now your patch looks OK to me now.
You can also zero uppper 4 bits of destanation
apic ID though since IO apic spec says that they are reserved.
(But I won't be surprised if that restriction was later lifted, as I am reading
the original IO apic spec, about the good old actual chip).
But that isn't a serious issue.
Reviewed-by: Maxim Levitsky <[email protected]>
Best regards,
Maxim Levitsky
>
On Wed, Aug 25, 2021 at 5:43 AM Vitaly Kuznetsov <[email protected]> wrote:
>
> Maxim Levitsky <[email protected]> writes:
>
> > On Wed, 2021-08-25 at 10:21 +0200, Vitaly Kuznetsov wrote:
> >> Maxim Levitsky <[email protected]> writes:
> >>
> >> > On Tue, 2021-08-24 at 16:42 +0200, Vitaly Kuznetsov wrote:
> >> ...
> >> > Not a classical review but,
> >> > I did some digital archaeology with this one, trying to understand what is going on:
> >> >
> >> >
> >> > I think that 16 bit vcpu bitmap is due to the fact that IOAPIC spec states that
> >> > it can address up to 16 cpus in physical destination mode.
> >> >
> >> > In logical destination mode, assuming flat addressing and that logical id = 1 << physical id
> >> > which KVM hardcodes, it is also only possible to address 8 CPUs.
> >> >
> >> > However(!) in flat cluster mode, the logical apic id is split in two.
> >> > We have 16 clusters and each have 4 CPUs, so it is possible to address 64 CPUs,
> >> > and unlike the logical ID, the KVM does honour cluster ID,
> >> > thus one can stick say cluster ID 0 to any vCPU.
> >> >
> >> >
> >> > Let's look at ioapic_write_indirect.
> >> > It does:
> >> >
> >> > -> bitmap_zero(&vcpu_bitmap, 16);
> >> > -> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq, &vcpu_bitmap);
> >> > -> kvm_make_scan_ioapic_request_mask(ioapic->kvm, &vcpu_bitmap); // use of the above bitmap
> >> >
> >> >
> >> > When we call kvm_bitmap_or_dest_vcpus, we can already overflow the bitmap,
> >> > since we pass all 8 bit of the destination even when it is physical.
> >> >
> >> >
> >> > Lets examine the kvm_bitmap_or_dest_vcpus:
> >> >
> >> > -> It calls the kvm_apic_map_get_dest_lapic which
> >> >
> >> > -> for physical destinations, it just sets the bitmap, which can overflow
> >> > if we pass it 8 bit destination (which basically includes reserved bits + 4 bit destination).
> >> >
> >> >
> >> > -> For logical apic ID, it seems to truncate the result to 16 bit, which isn't correct as I explained
> >> > above, but should not overflow the result.
> >> >
> >> >
> >> > -> If call to kvm_apic_map_get_dest_lapic fails, it goes over all vcpus and tries to match the destination
> >> > This can overflow as well.
> >> >
> >> >
> >> > I also don't like that ioapic_write_indirect calls the kvm_bitmap_or_dest_vcpus twice,
> >> > and second time with 'old_dest_id'
> >> >
> >> > I am not 100% sure why old_dest_id/old_dest_mode are needed as I don't see anything in the
> >> > function changing them.
> >> > I think only the guest can change them, so maybe the code deals with the guest changing them
> >> > while the code is running from a different vcpu?
> >> >
> >> > The commit that introduced this code is 7ee30bc132c683d06a6d9e360e39e483e3990708
> >> > Nitesh Narayan Lal, maybe you remember something about it?
> >> >
> >>
> >> Before posting this patch I've contacted Nitesh privately, he's
> >> currently on vacation but will take a look when he gets back.
> >>
> >> > Also I worry a lot about other callers of kvm_apic_map_get_dest_lapic
> >> >
> >> > It is also called from kvm_irq_delivery_to_apic_fast, and from kvm_intr_is_single_vcpu_fast
> >> > and both seem to also use 'unsigned long' for bitmap, and then only use 16 bits of it.
> >> >
> >> > I haven't dug into them, but these don't seem to be IOAPIC related and I think
> >> > can overwrite the stack as well.
> >>
> >> I'm no expert in this code but when writing the patch I somehow
> >> convinced myself that a single unsigned long is always enough. I think
> >> that for cluster mode 'bitmap' needs 64-bits (and it is *not* a
> >> vcpu_bitmap, we need to convert). I may be completely wrong of course
> >> but in any case this is a different issue. In ioapic_write_indirect() we
> >> have 'vcpu_bitmap' which should certainly be longer than 64 bits.
> >
> >
> > This code which I mentioned in 'other callers' as far as I see is not IOAPIC related.
> > For regular local APIC all bets are off, any vCPU and apic ID are possible
> > (xapic I think limits apic id to 255 but x2apic doesn't).
> >
> > I strongly suspect that this code can overflow as well.
>
> I've probably missed something but I don't see how
> kvm_apic_map_get_dest_lapic() can set bits above 64 in 'bitmap'. If it
> can, then we have a problem indeed.
It would be nice if the compiler took care of validating bitmap sizes
for us. Shouldn't we make the function prototypes explicit about the
bitmap sizes they expect?
I believe some `typedef DECLARE_BITMAP(...)` or `typedef struct {
DECLARE_BITMAP(...) } ...` declarations would be very useful here.
--
Eduardo
Eduardo Habkost <[email protected]> writes:
> On Wed, Aug 25, 2021 at 5:43 AM Vitaly Kuznetsov <[email protected]> wrote:
>>
>> Maxim Levitsky <[email protected]> writes:
>>
>> > On Wed, 2021-08-25 at 10:21 +0200, Vitaly Kuznetsov wrote:
>> >> Maxim Levitsky <[email protected]> writes:
>> >>
>> >> > On Tue, 2021-08-24 at 16:42 +0200, Vitaly Kuznetsov wrote:
>> >> ...
>> >> > Not a classical review but,
>> >> > I did some digital archaeology with this one, trying to understand what is going on:
>> >> >
>> >> >
>> >> > I think that 16 bit vcpu bitmap is due to the fact that IOAPIC spec states that
>> >> > it can address up to 16 cpus in physical destination mode.
>> >> >
>> >> > In logical destination mode, assuming flat addressing and that logical id = 1 << physical id
>> >> > which KVM hardcodes, it is also only possible to address 8 CPUs.
>> >> >
>> >> > However(!) in flat cluster mode, the logical apic id is split in two.
>> >> > We have 16 clusters and each have 4 CPUs, so it is possible to address 64 CPUs,
>> >> > and unlike the logical ID, the KVM does honour cluster ID,
>> >> > thus one can stick say cluster ID 0 to any vCPU.
>> >> >
>> >> >
>> >> > Let's look at ioapic_write_indirect.
>> >> > It does:
>> >> >
>> >> > -> bitmap_zero(&vcpu_bitmap, 16);
>> >> > -> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq, &vcpu_bitmap);
>> >> > -> kvm_make_scan_ioapic_request_mask(ioapic->kvm, &vcpu_bitmap); // use of the above bitmap
>> >> >
>> >> >
>> >> > When we call kvm_bitmap_or_dest_vcpus, we can already overflow the bitmap,
>> >> > since we pass all 8 bit of the destination even when it is physical.
>> >> >
>> >> >
>> >> > Lets examine the kvm_bitmap_or_dest_vcpus:
>> >> >
>> >> > -> It calls the kvm_apic_map_get_dest_lapic which
>> >> >
>> >> > -> for physical destinations, it just sets the bitmap, which can overflow
>> >> > if we pass it 8 bit destination (which basically includes reserved bits + 4 bit destination).
>> >> >
>> >> >
>> >> > -> For logical apic ID, it seems to truncate the result to 16 bit, which isn't correct as I explained
>> >> > above, but should not overflow the result.
>> >> >
>> >> >
>> >> > -> If call to kvm_apic_map_get_dest_lapic fails, it goes over all vcpus and tries to match the destination
>> >> > This can overflow as well.
>> >> >
>> >> >
>> >> > I also don't like that ioapic_write_indirect calls the kvm_bitmap_or_dest_vcpus twice,
>> >> > and second time with 'old_dest_id'
>> >> >
>> >> > I am not 100% sure why old_dest_id/old_dest_mode are needed as I don't see anything in the
>> >> > function changing them.
>> >> > I think only the guest can change them, so maybe the code deals with the guest changing them
>> >> > while the code is running from a different vcpu?
>> >> >
>> >> > The commit that introduced this code is 7ee30bc132c683d06a6d9e360e39e483e3990708
>> >> > Nitesh Narayan Lal, maybe you remember something about it?
>> >> >
>> >>
>> >> Before posting this patch I've contacted Nitesh privately, he's
>> >> currently on vacation but will take a look when he gets back.
>> >>
>> >> > Also I worry a lot about other callers of kvm_apic_map_get_dest_lapic
>> >> >
>> >> > It is also called from kvm_irq_delivery_to_apic_fast, and from kvm_intr_is_single_vcpu_fast
>> >> > and both seem to also use 'unsigned long' for bitmap, and then only use 16 bits of it.
>> >> >
>> >> > I haven't dug into them, but these don't seem to be IOAPIC related and I think
>> >> > can overwrite the stack as well.
>> >>
>> >> I'm no expert in this code but when writing the patch I somehow
>> >> convinced myself that a single unsigned long is always enough. I think
>> >> that for cluster mode 'bitmap' needs 64-bits (and it is *not* a
>> >> vcpu_bitmap, we need to convert). I may be completely wrong of course
>> >> but in any case this is a different issue. In ioapic_write_indirect() we
>> >> have 'vcpu_bitmap' which should certainly be longer than 64 bits.
>> >
>> >
>> > This code which I mentioned in 'other callers' as far as I see is not IOAPIC related.
>> > For regular local APIC all bets are off, any vCPU and apic ID are possible
>> > (xapic I think limits apic id to 255 but x2apic doesn't).
>> >
>> > I strongly suspect that this code can overflow as well.
>>
>> I've probably missed something but I don't see how
>> kvm_apic_map_get_dest_lapic() can set bits above 64 in 'bitmap'. If it
>> can, then we have a problem indeed.
>
> It would be nice if the compiler took care of validating bitmap sizes
> for us. Shouldn't we make the function prototypes explicit about the
> bitmap sizes they expect?
>
> I believe some `typedef DECLARE_BITMAP(...)` or `typedef struct {
> DECLARE_BITMAP(...) } ...` declarations would be very useful here.
The fundamental problem here is that bitmap in Linux has 'unsigned long
*' type, it's supposed to be accompanied with 'int len' parameter but
it's not always the case.
In KVM, we usually use 'vcpu_bitmap' (or 'dest_vcpu_bitmap') and these
are 'KVM_MAX_VCPUS' long. Just 'bitmap' or 'mask' case is a bit more
complicated. E.g. kvm_apic_map_get_logical_dest() uses 'u16 *mask' and
this means that only 16 bits in the destination are supposed to be
set. kvm_apic_map_get_dest_lapic() uses 'unsigned long *bitmap' - go
figure.
We could've probably used a declaration like you suggest to e.g. create
incompatible 'bitmap16','bitmap64',... types and make the compiler do
the checking but I'm slightly hesitant to introduce such helpers to KVM
and not the whole kernel. Alternatively, we could've just encoded the
length in parameters name, e.g.
@@ -918,7 +918,7 @@ static bool kvm_apic_is_broadcast_dest(struct kvm *kvm, struct kvm_lapic **src,
static inline bool kvm_apic_map_get_dest_lapic(struct kvm *kvm,
struct kvm_lapic **src, struct kvm_lapic_irq *irq,
struct kvm_apic_map *map, struct kvm_lapic ***dst,
- unsigned long *bitmap)
+ unsigned long *bitmap64)
{
int i, lowest;
--
Vitaly
On Thu, Aug 26, 2021 at 02:40:53PM +0200, Vitaly Kuznetsov wrote:
> Eduardo Habkost <[email protected]> writes:
>
> > On Wed, Aug 25, 2021 at 5:43 AM Vitaly Kuznetsov <[email protected]> wrote:
> >>
> >> Maxim Levitsky <[email protected]> writes:
> >>
> >> > On Wed, 2021-08-25 at 10:21 +0200, Vitaly Kuznetsov wrote:
> >> >> Maxim Levitsky <[email protected]> writes:
> >> >>
> >> >> > On Tue, 2021-08-24 at 16:42 +0200, Vitaly Kuznetsov wrote:
> >> >> ...
> >> >> > Not a classical review but,
> >> >> > I did some digital archaeology with this one, trying to understand what is going on:
> >> >> >
> >> >> >
> >> >> > I think that 16 bit vcpu bitmap is due to the fact that IOAPIC spec states that
> >> >> > it can address up to 16 cpus in physical destination mode.
> >> >> >
> >> >> > In logical destination mode, assuming flat addressing and that logical id = 1 << physical id
> >> >> > which KVM hardcodes, it is also only possible to address 8 CPUs.
> >> >> >
> >> >> > However(!) in flat cluster mode, the logical apic id is split in two.
> >> >> > We have 16 clusters and each have 4 CPUs, so it is possible to address 64 CPUs,
> >> >> > and unlike the logical ID, the KVM does honour cluster ID,
> >> >> > thus one can stick say cluster ID 0 to any vCPU.
> >> >> >
> >> >> >
> >> >> > Let's look at ioapic_write_indirect.
> >> >> > It does:
> >> >> >
> >> >> > -> bitmap_zero(&vcpu_bitmap, 16);
> >> >> > -> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq, &vcpu_bitmap);
> >> >> > -> kvm_make_scan_ioapic_request_mask(ioapic->kvm, &vcpu_bitmap); // use of the above bitmap
> >> >> >
> >> >> >
> >> >> > When we call kvm_bitmap_or_dest_vcpus, we can already overflow the bitmap,
> >> >> > since we pass all 8 bit of the destination even when it is physical.
> >> >> >
> >> >> >
> >> >> > Lets examine the kvm_bitmap_or_dest_vcpus:
> >> >> >
> >> >> > -> It calls the kvm_apic_map_get_dest_lapic which
> >> >> >
> >> >> > -> for physical destinations, it just sets the bitmap, which can overflow
> >> >> > if we pass it 8 bit destination (which basically includes reserved bits + 4 bit destination).
> >> >> >
> >> >> >
> >> >> > -> For logical apic ID, it seems to truncate the result to 16 bit, which isn't correct as I explained
> >> >> > above, but should not overflow the result.
> >> >> >
> >> >> >
> >> >> > -> If call to kvm_apic_map_get_dest_lapic fails, it goes over all vcpus and tries to match the destination
> >> >> > This can overflow as well.
> >> >> >
> >> >> >
> >> >> > I also don't like that ioapic_write_indirect calls the kvm_bitmap_or_dest_vcpus twice,
> >> >> > and second time with 'old_dest_id'
> >> >> >
> >> >> > I am not 100% sure why old_dest_id/old_dest_mode are needed as I don't see anything in the
> >> >> > function changing them.
> >> >> > I think only the guest can change them, so maybe the code deals with the guest changing them
> >> >> > while the code is running from a different vcpu?
> >> >> >
> >> >> > The commit that introduced this code is 7ee30bc132c683d06a6d9e360e39e483e3990708
> >> >> > Nitesh Narayan Lal, maybe you remember something about it?
> >> >> >
> >> >>
> >> >> Before posting this patch I've contacted Nitesh privately, he's
> >> >> currently on vacation but will take a look when he gets back.
> >> >>
> >> >> > Also I worry a lot about other callers of kvm_apic_map_get_dest_lapic
> >> >> >
> >> >> > It is also called from kvm_irq_delivery_to_apic_fast, and from kvm_intr_is_single_vcpu_fast
> >> >> > and both seem to also use 'unsigned long' for bitmap, and then only use 16 bits of it.
> >> >> >
> >> >> > I haven't dug into them, but these don't seem to be IOAPIC related and I think
> >> >> > can overwrite the stack as well.
> >> >>
> >> >> I'm no expert in this code but when writing the patch I somehow
> >> >> convinced myself that a single unsigned long is always enough. I think
> >> >> that for cluster mode 'bitmap' needs 64-bits (and it is *not* a
> >> >> vcpu_bitmap, we need to convert). I may be completely wrong of course
> >> >> but in any case this is a different issue. In ioapic_write_indirect() we
> >> >> have 'vcpu_bitmap' which should certainly be longer than 64 bits.
> >> >
> >> >
> >> > This code which I mentioned in 'other callers' as far as I see is not IOAPIC related.
> >> > For regular local APIC all bets are off, any vCPU and apic ID are possible
> >> > (xapic I think limits apic id to 255 but x2apic doesn't).
> >> >
> >> > I strongly suspect that this code can overflow as well.
> >>
> >> I've probably missed something but I don't see how
> >> kvm_apic_map_get_dest_lapic() can set bits above 64 in 'bitmap'. If it
> >> can, then we have a problem indeed.
> >
> > It would be nice if the compiler took care of validating bitmap sizes
> > for us. Shouldn't we make the function prototypes explicit about the
> > bitmap sizes they expect?
> >
> > I believe some `typedef DECLARE_BITMAP(...)` or `typedef struct {
> > DECLARE_BITMAP(...) } ...` declarations would be very useful here.
>
> The fundamental problem here is that bitmap in Linux has 'unsigned long
> *' type, it's supposed to be accompanied with 'int len' parameter but
> it's not always the case.
>
> In KVM, we usually use 'vcpu_bitmap' (or 'dest_vcpu_bitmap') and these
> are 'KVM_MAX_VCPUS' long. Just 'bitmap' or 'mask' case is a bit more
> complicated. E.g. kvm_apic_map_get_logical_dest() uses 'u16 *mask' and
> this means that only 16 bits in the destination are supposed to be
> set. kvm_apic_map_get_dest_lapic() uses 'unsigned long *bitmap' - go
> figure.
>
> We could've probably used a declaration like you suggest to e.g. create
> incompatible 'bitmap16','bitmap64',... types and make the compiler do
> the checking but I'm slightly hesitant to introduce such helpers to KVM
> and not the whole kernel. Alternatively, we could've just encoded the
> length in parameters name, e.g.
>
> @@ -918,7 +918,7 @@ static bool kvm_apic_is_broadcast_dest(struct kvm *kvm, struct kvm_lapic **src,
> static inline bool kvm_apic_map_get_dest_lapic(struct kvm *kvm,
> struct kvm_lapic **src, struct kvm_lapic_irq *irq,
> struct kvm_apic_map *map, struct kvm_lapic ***dst,
> - unsigned long *bitmap)
> + unsigned long *bitmap64)
You can communicate the expected bitmap size to the compiler
without typedefs if using DECLARE_BITMAP inside the function
parameter list is acceptable coding style (is it?).
For example, the following would have allowed the compiler to
catch the bug you are fixing:
Signed-off-by: Eduardo Habkost <[email protected]>
---
diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
index d7c25d0c1354..e8c64747121a 100644
--- a/arch/x86/kvm/lapic.h
+++ b/arch/x86/kvm/lapic.h
@@ -236,7 +236,7 @@ bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu);
void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
- unsigned long *vcpu_bitmap);
+ DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS));
bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq,
struct kvm_vcpu **dest_vcpu);
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 76fb00921203..1df113894cba 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -1166,7 +1166,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
* each available vcpu to identify the same.
*/
void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
- unsigned long *vcpu_bitmap)
+ DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS))
{
struct kvm_lapic **dest_vcpu = NULL;
struct kvm_lapic *src = NULL;
--
Eduardo
On Thu, Aug 26, 2021, Eduardo Habkost wrote:
> > @@ -918,7 +918,7 @@ static bool kvm_apic_is_broadcast_dest(struct kvm *kvm, struct kvm_lapic **src,
> > static inline bool kvm_apic_map_get_dest_lapic(struct kvm *kvm,
> > struct kvm_lapic **src, struct kvm_lapic_irq *irq,
> > struct kvm_apic_map *map, struct kvm_lapic ***dst,
> > - unsigned long *bitmap)
> > + unsigned long *bitmap64)
>
> You can communicate the expected bitmap size to the compiler
> without typedefs if using DECLARE_BITMAP inside the function
> parameter list is acceptable coding style (is it?).
>
> For example, the following would have allowed the compiler to
> catch the bug you are fixing:
>
> Signed-off-by: Eduardo Habkost <[email protected]>
> ---
> diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> index d7c25d0c1354..e8c64747121a 100644
> --- a/arch/x86/kvm/lapic.h
> +++ b/arch/x86/kvm/lapic.h
> @@ -236,7 +236,7 @@ bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
> void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu);
>
> void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
> - unsigned long *vcpu_bitmap);
> + DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS));
>
> bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq,
> struct kvm_vcpu **dest_vcpu);
> diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> index 76fb00921203..1df113894cba 100644
> --- a/arch/x86/kvm/lapic.c
> +++ b/arch/x86/kvm/lapic.c
> @@ -1166,7 +1166,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
> * each available vcpu to identify the same.
> */
> void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
> - unsigned long *vcpu_bitmap)
> + DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS))
> {
> struct kvm_lapic **dest_vcpu = NULL;
> struct kvm_lapic *src = NULL;
Sadly, that would not have actually caught the bug. In C++, an array param does
indeed have a fixed size, but in C an array param is nothing more than syntatic
sugar that is demoted to a plain ol' pointer. E.g. gcc-10 and clang-11 both
happily compile with "DECLARE_BITMAP(vcpu_bitmap, 0)" and the original single
"unsigned long vcpu_bitmap". Maybe there are gcc extensions to enforce array
sizes? But if there are, they are not (yet) enabled for kernel builds.
On Thu, Aug 26, 2021 at 06:01:15PM +0000, Sean Christopherson wrote:
> On Thu, Aug 26, 2021, Eduardo Habkost wrote:
> > > @@ -918,7 +918,7 @@ static bool kvm_apic_is_broadcast_dest(struct kvm *kvm, struct kvm_lapic **src,
> > > static inline bool kvm_apic_map_get_dest_lapic(struct kvm *kvm,
> > > struct kvm_lapic **src, struct kvm_lapic_irq *irq,
> > > struct kvm_apic_map *map, struct kvm_lapic ***dst,
> > > - unsigned long *bitmap)
> > > + unsigned long *bitmap64)
> >
> > You can communicate the expected bitmap size to the compiler
> > without typedefs if using DECLARE_BITMAP inside the function
> > parameter list is acceptable coding style (is it?).
> >
> > For example, the following would have allowed the compiler to
> > catch the bug you are fixing:
> >
> > Signed-off-by: Eduardo Habkost <[email protected]>
> > ---
> > diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h
> > index d7c25d0c1354..e8c64747121a 100644
> > --- a/arch/x86/kvm/lapic.h
> > +++ b/arch/x86/kvm/lapic.h
> > @@ -236,7 +236,7 @@ bool kvm_apic_pending_eoi(struct kvm_vcpu *vcpu, int vector);
> > void kvm_wait_lapic_expire(struct kvm_vcpu *vcpu);
> >
> > void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
> > - unsigned long *vcpu_bitmap);
> > + DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS));
> >
> > bool kvm_intr_is_single_vcpu_fast(struct kvm *kvm, struct kvm_lapic_irq *irq,
> > struct kvm_vcpu **dest_vcpu);
> > diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
> > index 76fb00921203..1df113894cba 100644
> > --- a/arch/x86/kvm/lapic.c
> > +++ b/arch/x86/kvm/lapic.c
> > @@ -1166,7 +1166,7 @@ static int __apic_accept_irq(struct kvm_lapic *apic, int delivery_mode,
> > * each available vcpu to identify the same.
> > */
> > void kvm_bitmap_or_dest_vcpus(struct kvm *kvm, struct kvm_lapic_irq *irq,
> > - unsigned long *vcpu_bitmap)
> > + DECLARE_BITMAP(vcpu_bitmap, KVM_MAX_VCPUS))
> > {
> > struct kvm_lapic **dest_vcpu = NULL;
> > struct kvm_lapic *src = NULL;
>
> Sadly, that would not have actually caught the bug. In C++, an array param does
> indeed have a fixed size, but in C an array param is nothing more than syntatic
> sugar that is demoted to a plain ol' pointer. E.g. gcc-10 and clang-11 both
> happily compile with "DECLARE_BITMAP(vcpu_bitmap, 0)" and the original single
> "unsigned long vcpu_bitmap". Maybe there are gcc extensions to enforce array
> sizes? But if there are, they are not (yet) enabled for kernel builds.
The compiler wouldn't have caught it today only because Linux is
compiled with `-Wno-stringop-overflow`. I have some hope that
eventually the warning will be enabled, as indicated on the
commit message if commit 5a76021c2eff ("gcc-10: disable
'stringop-overflow' warning for now").
Even if the warning isn't enabled, the bitmap size declaration
would be a hint for humans reading the code.
--
Eduardo
I'm re-reading this, and:
On Tue, Aug 24, 2021 at 07:07:58PM +0300, Maxim Levitsky wrote:
[...]
> Hi,
>
> Not a classical review but,
> I did some digital archaeology with this one, trying to understand what is going on:
>
>
> I think that 16 bit vcpu bitmap is due to the fact that IOAPIC spec states that
> it can address up to 16 cpus in physical destination mode.
>
> In logical destination mode, assuming flat addressing and that logical id = 1 << physical id
> which KVM hardcodes, it is also only possible to address 8 CPUs.
>
> However(!) in flat cluster mode, the logical apic id is split in two.
> We have 16 clusters and each have 4 CPUs, so it is possible to address 64 CPUs,
> and unlike the logical ID, the KVM does honour cluster ID,
> thus one can stick say cluster ID 0 to any vCPU.
>
>
> Let's look at ioapic_write_indirect.
> It does:
>
> -> bitmap_zero(&vcpu_bitmap, 16);
> -> kvm_bitmap_or_dest_vcpus(ioapic->kvm, &irq, &vcpu_bitmap);
> -> kvm_make_scan_ioapic_request_mask(ioapic->kvm, &vcpu_bitmap); // use of the above bitmap
>
>
> When we call kvm_bitmap_or_dest_vcpus, we can already overflow the bitmap,
> since we pass all 8 bit of the destination even when it is physical.
>
>
> Lets examine the kvm_bitmap_or_dest_vcpus:
>
> -> It calls the kvm_apic_map_get_dest_lapic which
>
> -> for physical destinations, it just sets the bitmap, which can overflow
> if we pass it 8 bit destination (which basically includes reserved bits + 4 bit destination).
How exactly do you think kvm_apic_map_get_dest_lapic() can
overflow? It never writes beyond `bitmap[0]`, as far as I can
see.
>
>
> -> For logical apic ID, it seems to truncate the result to 16 bit, which isn't correct as I explained
> above, but should not overflow the result.
>
>
> -> If call to kvm_apic_map_get_dest_lapic fails, it goes over all vcpus and tries to match the destination
> This can overflow as well.
>
> [...]
--
Eduardo
On Tue, Aug 24, 2021 at 12:08 PM Maxim Levitsky <[email protected]> wrote:
>
> On Tue, 2021-08-24 at 16:42 +0200, Vitaly Kuznetsov wrote:
> > Eduardo Habkost <[email protected]> writes:
> >
> > > On Tue, Aug 24, 2021 at 3:13 AM Vitaly Kuznetsov <[email protected]> wrote:
> > > > Eduardo Habkost <[email protected]> writes:
> > > >
> > > > > On Mon, Aug 23, 2021 at 04:30:28PM +0200, Vitaly Kuznetsov wrote:
> > > > > > KASAN reports the following issue:
> > > > > >
> > > > > > BUG: KASAN: stack-out-of-bounds in kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> > > > > > Read of size 8 at addr ffffc9001364f638 by task qemu-kvm/4798
> > > > > >
> > > > > > CPU: 0 PID: 4798 Comm: qemu-kvm Tainted: G X --------- ---
> > > > > > Hardware name: AMD Corporation DAYTONA_X/DAYTONA_X, BIOS RYM0081C 07/13/2020
> > > > > > Call Trace:
> > > > > > dump_stack+0xa5/0xe6
> > > > > > print_address_description.constprop.0+0x18/0x130
> > > > > > ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> > > > > > __kasan_report.cold+0x7f/0x114
> > > > > > ? kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> > > > > > kasan_report+0x38/0x50
> > > > > > kasan_check_range+0xf5/0x1d0
> > > > > > kvm_make_vcpus_request_mask+0x174/0x440 [kvm]
> > > > > > kvm_make_scan_ioapic_request_mask+0x84/0xc0 [kvm]
> > > > > > ? kvm_arch_exit+0x110/0x110 [kvm]
> > > > > > ? sched_clock+0x5/0x10
> > > > > > ioapic_write_indirect+0x59f/0x9e0 [kvm]
> > > > > > ? static_obj+0xc0/0xc0
> > > > > > ? __lock_acquired+0x1d2/0x8c0
> > > > > > ? kvm_ioapic_eoi_inject_work+0x120/0x120 [kvm]
> > > > > >
[...]
>
>
> I also don't like that ioapic_write_indirect calls the kvm_bitmap_or_dest_vcpus twice,
> and second time with 'old_dest_id'
>
> I am not 100% sure why old_dest_id/old_dest_mode are needed as I don't see anything in the
> function changing them.
> I think only the guest can change them, so maybe the code deals with the guest changing them
> while the code is running from a different vcpu?
>
> The commit that introduced this code is 7ee30bc132c683d06a6d9e360e39e483e3990708
> Nitesh Narayan Lal, maybe you remember something about it?
>
Apologies for the delay in responding, I just got back from my PTO and
still clearing my inbox. Since you have reviewed this patch the only open
question is the above so I will try to answer that. Please let me know
in case I missed anything.
IIRC IOAPIC can be reconfigured while the previous interrupt is pending or
still processing. In this situation, ioapic_handeld_vectors may go out of
sync as it only records the recently passed configuration. Since with this
commit, we stopped generating requests for all vCPUs we need this chunk of
code to keep ioapic_handled_vectors in sync.
Having said that perhaps there could be a better way of handling this (?).
--
Thanks
Nitesh