2012-06-28 06:07:31

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 00/18] KVM: x86: CPU isolation and direct interrupts handling by guests

Hello,

This RFC patch series provides facility to dedicate CPUs to KVM guests
and enable the guests to handle interrupts from passed-through PCI devices
directly (without VM exit and relay by the host).

With this feature, we can improve throughput and response time of the device
and the host's CPU usage by reducing the overhead of interrupt handling.
This is good for the application using very high throughput/frequent
interrupt device (e.g. 10GbE NIC).
CPU-intensive high performance applications and real-time applicatoins
also gets benefit from CPU isolation feature, which reduces VM exit and
scheduling delay.

Current implementation is still just PoC and have many limitations, but
submitted for RFC. Any comments are appreciated.

* Overview
Intel and AMD CPUs have a feature to handle interrupts by guests without
VM Exit. However, because it cannot switch VM Exit based on IRQ vectors,
interrupts to both the host and the guest will be routed to guests.

To avoid mixture of host and guest interrupts, in this patch, some of CPUs
are cut off from the host and dedicated to the guests. In addition, IRQ
affinity of the passed-through devices are set to the guest CPUs only.

For IPI from the host to the guest, we use NMIs, that is an only interrupts
having another VM Exit flag.

* Benefits
This feature provides benefits of virtualization to areas where high
performance and low latency are required, such as HPC and trading,
and so on. It also useful for consolidation in large scale systems with
many CPU cores and PCI devices passed-through or with SR-IOV.
For the future, it may be used to keep the guests running even if the host
is crashed (but that would need additional features like memory isolation).

* Limitations
Current implementation is experimental, unstable, and has a lot of limitations.
- SMP guests don't work correctly
- Only Linux guest is supported
- Only Intel VT-x is supported
- Only MSI and MSI-X pass-through; no ISA interrupts support
- Non passed-through PCI devices (including virtio) are slower
- Kernel space PIT emulation does not work
- Needs a lot of cleanups

* How to test
- Create a guest VM with 1 CPU and some PCI passthrough devices (which
supports MSI/MSI-X).
No VGA display will be better...
- Apply the patch at the end of this mail to qemu-kvm.
(This patch is just for simple testing, and dedicated CPU ID for the
guest is hard-coded.)
- Run the guest once to ensure the PCI passthrough works correctly.
- Make the specified CPU offline.
# echo 0 > /sys/devices/system/cpu/cpu3/online
- Launch qemu-kvm with -no-kvm-pit option.
The offlined CPU is booted as a slave CPU and guest is runs on that CPU.

* Performance Example
Tested under Xeon W3520, and 10Gb NIC (ixgbe 82599EB) with SR-IOV to share
the device with the host and a guest. Using this NIC, we measured
communication performance (throughput, latency, CPU usage) between the host
and the guest.

w/direct interrupts handling w/o direct interrupts handling
Throughput(*1) 11.4 Gbits/sec 8.91 Gbits/sec
Latency (*2) 0.054 ms 0.069 ms

*1) measured with `iperf -s' on the host and `iperf -c' on the guest.
*2) average `ping' RTT from the host to the guest

CPU Usage (top output)
- w/direct interrupts handling
Tasks: 200 total, 1 running, 199 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.0%sy, 0.0%ni, 41.1%id, 0.0%wa, 0.0%hi, 58.9%si, 0.0%st
Cpu1 : 0.0%us, 55.3%sy, 0.0%ni, 44.7%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Cpu2 : 0.0%us, 0.0%sy, 0.0%ni, 99.0%id, 0.7%wa, 0.3%hi, 0.0%si, 0.0%st
Mem: 6152492k total, 1921728k used, 4230764k free, 52544k buffers
Swap: 8159228k total, 0k used, 8159228k free, 890964k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
32307 root 0 -20 165m 1088 772 S 56.5 0.0 1:33.03 iperf
1777 root 20 0 0 0 0 S 0.3 0.0 0:00.01 kworker/2:0
2121 sekiyama 20 0 15260 1372 1008 R 0.3 0.0 0:00.12 top
28792 qemu 20 0 820m 532m 8808 S 0.3 8.9 0:06.10 qemu-kvm.custom
1 root 20 0 37536 4684 2016 S 0.0 0.1 0:05.61 systemd

- w/o direct interrupts handling
Tasks: 193 total, 1 running, 192 sleeping, 0 stopped, 0 zombie
Cpu0 : 0.0%us, 0.7%sy, 0.0%ni, 22.2%id, 0.0%wa, 0.3%hi, 76.8%si, 0.0%st
Cpu1 : 0.0%us, 0.0%sy, 0.0%ni, 98.3%id, 0.0%wa, 1.7%hi, 0.0%si, 0.0%st
Cpu2 : 0.3%us, 74.7%sy, 0.0%ni, 23.0%id, 0.0%wa, 2.0%hi, 0.0%si, 0.0%st
Cpu3 : 94.7%us, 4.6%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.7%hi, 0.0%si, 0.0%st
Mem: 6152492k total, 1586520k used, 4565972k free, 47832k buffers
Swap: 8159228k total, 0k used, 8159228k free, 644460k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
1747 qemu 20 0 844m 530m 8808 S 99.2 8.8 0:23.85 qemu-kvm.custom
1929 root 0 -20 165m 1080 772 S 70.9 0.0 0:09.96 iperf
1804 root -51 0 0 0 0 S 3.0 0.0 0:00.45 irq/74-kvm:0000
1803 root -51 0 0 0 0 S 2.6 0.0 0:00.40 irq/73-kvm:0000
1833 sekiyama 20 0 15260 1372 1004 R 0.3 0.0 0:00.13 top

With direct interrupt handling, Guest execution is not included in top
since the dedicated CPU is offlined from the host.
And CPU usage by interrupt relay kernel thread (irq/*-kvm:0000) is reduced.

* Patch to qemu-kvm for testing

diff -u -r qemu-kvm-0.15.1/qemu-kvm-x86.c qemu-kvm-0.15.1-test/qemu-kvm-x86.c
--- qemu-kvm-0.15.1/qemu-kvm-x86.c 2011-10-19 22:54:48.000000000 +0900
+++ qemu-kvm-0.15.1-test/qemu-kvm-x86.c 2012-06-25 21:21:15.141557256 +0900
@@ -139,12 +139,28 @@
return kvm_vcpu_ioctl(env, KVM_TPR_ACCESS_REPORTING, &tac);
}
+static int kvm_set_slave_cpu(CPUState *env)
+{
+ int r, slave = 3;
+
+ r = kvm_ioctl(env->kvm_state, KVM_CHECK_EXTENSION, KVM_CAP_SLAVE_CPU);
+ if (r <= 0) {
+ return -ENOSYS;
+ }
+ r = kvm_vcpu_ioctl(env, KVM_SET_SLAVE_CPU, slave);
+ if (r < 0)
+ perror("kvm_set_slave_cpu");
+ return r;
+}
+
static int _kvm_arch_init_vcpu(CPUState *env)
{
kvm_arch_reset_vcpu(env);
kvm_enable_tpr_access_reporting(env);
+ kvm_set_slave_cpu(env);
+
return kvm_update_ioport_access(env);
}


---

Tomoki Sekiyama (18):
x86: request TLB flush to slave CPU using NMI
KVM: route assigned devices' MSI/MSI-X directly to guests on slave CPUs
KVM: add kvm_arch_vcpu_prevent_run to prevent VM ENTER when NMI is received
KVM: vmx: Add definitions PIN_BASED_PREEMPTION_TIMER
KVM: Directly handle interrupts by guests without VM EXIT on slave CPUs
x86/apic: IRQ vector remapping on slave for slave CPUs
x86/apic: Enable external interrupt routing to slave CPUs
KVM: no exiting from guest when slave CPU halted
KVM: proxy slab operations for slave CPUs on online CPUs
KVM: Go back to online CPU on VM exit by external interrupt
KVM: Add KVM_GET_SLAVE_CPU and KVM_SET_SLAVE_CPU to vCPU ioctl
KVM: handle page faults occured in slave CPUs on online CPUs
KVM: Add facility to run guests on slave CPUs
KVM: Enable/Disable virtualization on slave CPUs are activated/dying
KVM: Replace local_irq_disable/enable with local_irq_save/restore
x86: Support hrtimer on slave CPUs
x86: Add a facility to use offlined CPUs as slave CPUs
x86: Split memory hotplug function from cpu_up() as cpu_memory_up()


arch/x86/Kconfig | 10 +
arch/x86/include/asm/apic.h | 4
arch/x86/include/asm/cpu.h | 14 +
arch/x86/include/asm/irq.h | 15 +
arch/x86/include/asm/kvm_host.h | 56 +++++
arch/x86/include/asm/mmu.h | 7 +
arch/x86/include/asm/vmx.h | 3
arch/x86/kernel/apic/apic_flat_64.c | 2
arch/x86/kernel/apic/io_apic.c | 89 ++++++-
arch/x86/kernel/apic/x2apic_cluster.c | 6
arch/x86/kernel/apic/x2apic_phys.c | 2
arch/x86/kernel/cpu/common.c | 3
arch/x86/kernel/smp.c | 2
arch/x86/kernel/smpboot.c | 188 +++++++++++++++
arch/x86/kvm/irq.c | 136 +++++++++++
arch/x86/kvm/lapic.c | 6
arch/x86/kvm/mmu.c | 83 +++++--
arch/x86/kvm/mmu.h | 4
arch/x86/kvm/trace.h | 1
arch/x86/kvm/vmx.c | 74 ++++++
arch/x86/kvm/x86.c | 407 +++++++++++++++++++++++++++++++--
arch/x86/mm/gup.c | 7 -
arch/x86/mm/tlb.c | 63 +++++
drivers/iommu/intel_irq_remapping.c | 10 +
include/linux/cpu.h | 9 +
include/linux/cpumask.h | 26 ++
include/linux/kvm.h | 4
include/linux/kvm_host.h | 2
kernel/cpu.c | 83 +++++--
kernel/hrtimer.c | 22 ++
kernel/irq/manage.c | 4
kernel/irq/migration.c | 2
kernel/irq/proc.c | 2
kernel/smp.c | 9 -
virt/kvm/assigned-dev.c | 8 +
virt/kvm/async_pf.c | 17 +
virt/kvm/kvm_main.c | 40 +++
37 files changed, 1296 insertions(+), 124 deletions(-)


Thanks,
--
Tomoki Sekiyama <[email protected]>
Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory


2012-06-28 06:07:37

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 01/18] x86: Split memory hotplug function from cpu_up() as cpu_memory_up()

Split memory hotplug function from cpu_up() as cpu_memory_up(), which will
be used for assigning memory area to off-lined cpus at following patch
in this series.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

include/linux/cpu.h | 9 +++++++++
kernel/cpu.c | 46 +++++++++++++++++++++++++++-------------------
2 files changed, 36 insertions(+), 19 deletions(-)

diff --git a/include/linux/cpu.h b/include/linux/cpu.h
index 2e9b9eb..14d8e41 100644
--- a/include/linux/cpu.h
+++ b/include/linux/cpu.h
@@ -145,6 +145,15 @@ void notify_cpu_starting(unsigned int cpu);
extern void cpu_maps_update_begin(void);
extern void cpu_maps_update_done(void);

+#ifdef CONFIG_MEMORY_HOTPLUG
+extern int cpu_memory_up(unsigned int cpu);
+#else
+static inline int cpu_memory_up(unsigned int cpu)
+{
+ return 0;
+}
+#endif
+
#else /* CONFIG_SMP */

#define cpu_notifier(fn, pri) do { (void)(fn); } while (0)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index a4eb522..db3e33c 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -384,11 +384,6 @@ int __cpuinit cpu_up(unsigned int cpu)
{
int err = 0;

-#ifdef CONFIG_MEMORY_HOTPLUG
- int nid;
- pg_data_t *pgdat;
-#endif
-
if (!cpu_possible(cpu)) {
printk(KERN_ERR "can't online cpu %d because it is not "
"configured as may-hotadd at boot time\n", cpu);
@@ -399,7 +394,32 @@ int __cpuinit cpu_up(unsigned int cpu)
return -EINVAL;
}

+ err = cpu_memory_up(cpu);
+ if (err)
+ return err;
+
+ cpu_maps_update_begin();
+
+ if (cpu_hotplug_disabled) {
+ err = -EBUSY;
+ goto out;
+ }
+
+ err = _cpu_up(cpu, 0);
+
+out:
+ cpu_maps_update_done();
+ return err;
+}
+EXPORT_SYMBOL_GPL(cpu_up);
+
#ifdef CONFIG_MEMORY_HOTPLUG
+int __cpuinit cpu_memory_up(unsigned int cpu)
+{
+ int err;
+ int nid;
+ pg_data_t *pgdat;
+
nid = cpu_to_node(cpu);
if (!node_online(nid)) {
err = mem_online_node(nid);
@@ -419,22 +439,10 @@ int __cpuinit cpu_up(unsigned int cpu)
build_all_zonelists(NULL);
mutex_unlock(&zonelists_mutex);
}
-#endif

- cpu_maps_update_begin();
-
- if (cpu_hotplug_disabled) {
- err = -EBUSY;
- goto out;
- }
-
- err = _cpu_up(cpu, 0);
-
-out:
- cpu_maps_update_done();
- return err;
+ return 0;
}
-EXPORT_SYMBOL_GPL(cpu_up);
+#endif

#ifdef CONFIG_PM_SLEEP_SMP
static cpumask_var_t frozen_cpus;

2012-06-28 06:07:43

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 02/18] x86: Add a facility to use offlined CPUs as slave CPUs

Add a facility of using offlined CPUs as slave CPUs. Slave CPUs are
specialized to exclusively run functions specified by online CPUs,
which do not run user processes.

To use this feature, build the kernel with CONFIG_SLAVE_CPU=y.

A slave CPU is launched by calling cpu_slave_up() when the CPU is offlined.
Once launched, the slave CPU waits for IPI in the idle thread context.
Users of the slave CPU can run specific kernel function by sending IPI using
smp_call_function_single().

When cpu_slave_down() is called, the slave cpu will go offline state again.

Cpumask `cpu_slave_mask' is provided to manage whether CPU is slave.
In addition, `cpu_online_or_slave_mask' is also provided for convenence of
APIC handling, etc.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/Kconfig | 10 +++
arch/x86/include/asm/cpu.h | 9 ++
arch/x86/kernel/cpu/common.c | 3 +
arch/x86/kernel/smpboot.c | 152 ++++++++++++++++++++++++++++++++++++++++--
include/linux/cpumask.h | 26 +++++++
kernel/cpu.c | 37 ++++++++++
kernel/smp.c | 8 +-
7 files changed, 232 insertions(+), 13 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index c70684f..e17f49b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1676,6 +1676,16 @@ config HOTPLUG_CPU
automatically on SMP systems. )
Say N if you want to disable CPU hotplug.

+config SLAVE_CPU
+ bool "Support for slave CPUs (EXPERIMENTAL)"
+ depends on EXPERIMENTAL && HOTPLUG_CPU
+ ---help---
+ Say Y here to allow use some of CPUs as slave processors.
+ Slave CPUs are controlled from another CPU and do some tasks
+ and cannot run user processes. Slave processors can be
+ specified through /sys/devices/system/cpu.
+ Say N if you want to disable slave CPU support.
+
config COMPAT_VDSO
def_bool y
prompt "Compat VDSO support"
diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index 4564c8e..a6dc43d 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -30,6 +30,15 @@ extern int arch_register_cpu(int num);
extern void arch_unregister_cpu(int);
#endif

+#ifdef CONFIG_SLAVE_CPU
+#define CPU_SLAVE_UP_PREPARE 0xff00
+#define CPU_SLAVE_UP 0xff01
+#define CPU_SLAVE_DEAD 0xff02
+
+extern int slave_cpu_up(unsigned int cpu);
+extern int slave_cpu_down(unsigned int cpu);
+#endif
+
DECLARE_PER_CPU(int, cpu_state);

int mwait_usable(const struct cpuinfo_x86 *);
diff --git a/arch/x86/kernel/cpu/common.c b/arch/x86/kernel/cpu/common.c
index 6b9333b..abca8a6 100644
--- a/arch/x86/kernel/cpu/common.c
+++ b/arch/x86/kernel/cpu/common.c
@@ -882,7 +882,8 @@ static void __cpuinit identify_cpu(struct cpuinfo_x86 *c)
}

/* Init Machine Check Exception if available. */
- mcheck_cpu_init(c);
+ if (per_cpu(cpu_state, smp_processor_id()) != CPU_SLAVE_UP_PREPARE)
+ mcheck_cpu_init(c);

select_idle_routine(c);

diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index 7bd8a08..edeebad 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -51,6 +51,9 @@
#include <linux/stackprotector.h>
#include <linux/gfp.h>
#include <linux/cpuidle.h>
+#include <linux/clockchips.h>
+#include <linux/tick.h>
+#include "../kernel/smpboot.h"

#include <asm/acpi.h>
#include <asm/desc.h>
@@ -126,7 +129,7 @@ atomic_t init_deasserted;
* Report back to the Boot Processor.
* Running on AP.
*/
-static void __cpuinit smp_callin(void)
+static void __cpuinit smp_callin(int notify_starting)
{
int cpuid, phys_id;
unsigned long timeout;
@@ -218,7 +221,8 @@ static void __cpuinit smp_callin(void)
set_cpu_sibling_map(raw_smp_processor_id());
wmb();

- notify_cpu_starting(cpuid);
+ if (notify_starting)
+ notify_cpu_starting(cpuid);

/*
* Allow the master to continue.
@@ -239,7 +243,7 @@ notrace static void __cpuinit start_secondary(void *unused)
cpu_init();
x86_cpuinit.early_percpu_clock_init();
preempt_disable();
- smp_callin();
+ smp_callin(1);

#ifdef CONFIG_X86_32
/* switch away from the initial page table */
@@ -286,6 +290,68 @@ notrace static void __cpuinit start_secondary(void *unused)
cpu_idle();
}

+#ifdef CONFIG_SLAVE_CPU
+/*
+ * Activate cpu as a slave processor.
+ * The cpu is used to run specified function using smp_call_function
+ * from online processors.
+ * Note that this doesn't mark the cpu online.
+ */
+notrace static void __cpuinit start_slave_cpu(void *unused)
+{
+ int cpu;
+
+ /*
+ * Don't put *anything* before cpu_init(), SMP booting is too
+ * fragile that we want to limit the things done here to the
+ * most necessary things.
+ */
+ cpu_init();
+ preempt_disable();
+ smp_callin(0);
+
+#ifdef CONFIG_X86_32
+ /* switch away from the initial page table */
+ load_cr3(swapper_pg_dir);
+ __flush_tlb_all();
+#endif
+
+ /* otherwise gcc will move up smp_processor_id before the cpu_init */
+ barrier();
+ /*
+ * Check TSC synchronization with the BP:
+ */
+ check_tsc_sync_target();
+
+ x86_platform.nmi_init();
+
+ /* enable local interrupts */
+ local_irq_enable();
+
+ cpu = smp_processor_id();
+
+ /* to prevent fake stack check failure */
+ boot_init_stack_canary();
+
+ /* announce slave CPU started */
+ pr_info("Slave CPU %d is up\n", cpu);
+ per_cpu(cpu_state, cpu) = CPU_SLAVE_UP;
+ set_cpu_slave(cpu, true);
+ wmb();
+
+ /* wait for smp_call_function interrupt */
+ tick_nohz_idle_enter();
+ while (per_cpu(cpu_state, cpu) == CPU_SLAVE_UP)
+ native_safe_halt();
+
+ /* now stop this CPU again */
+ pr_info("Slave CPU %d is going down ...\n", cpu);
+ native_cpu_disable();
+ set_cpu_slave(cpu, false);
+ native_play_dead();
+}
+#endif
+
/*
* The bootstrap kernel entry code has set these up. Save them for
* a given CPU
@@ -664,7 +730,8 @@ static void __cpuinit announce_cpu(int cpu, int apicid)
* Returns zero if CPU booted OK, else error code from
* ->wakeup_secondary_cpu.
*/
-static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
+static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle,
+ int slave)
{
volatile u32 *trampoline_status =
(volatile u32 *) __va(real_mode_header->trampoline_status);
@@ -692,6 +759,10 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
#endif
early_gdt_descr.address = (unsigned long)get_cpu_gdt_table(cpu);
initial_code = (unsigned long)start_secondary;
+#ifdef CONFIG_SLAVE_CPU
+ if (unlikely(slave))
+ initial_code = (unsigned long)start_slave_cpu;
+#endif
stack_start = idle->thread.sp;

/* So we see what's up */
@@ -793,7 +864,8 @@ static int __cpuinit do_boot_cpu(int apicid, int cpu, struct task_struct *idle)
return boot_error;
}

-int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
+static int __cpuinit __native_cpu_up(unsigned int cpu,
+ struct task_struct *tidle, int slave)
{
int apicid = apic->cpu_present_to_apicid(cpu);
unsigned long flags;
@@ -824,9 +896,10 @@ int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
*/
mtrr_save_state();

- per_cpu(cpu_state, cpu) = CPU_UP_PREPARE;
+ per_cpu(cpu_state, cpu) = slave ? CPU_SLAVE_UP_PREPARE
+ : CPU_UP_PREPARE;

- err = do_boot_cpu(apicid, cpu, tidle);
+ err = do_boot_cpu(apicid, cpu, tidle, slave);
if (err) {
pr_debug("do_boot_cpu failed %d\n", err);
return -EIO;
@@ -840,7 +913,7 @@ int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
check_tsc_sync_source(cpu);
local_irq_restore(flags);

- while (!cpu_online(cpu)) {
+ while (!cpu_online(cpu) && !cpu_slave(cpu)) {
cpu_relax();
touch_nmi_watchdog();
}
@@ -848,6 +921,69 @@ int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
return 0;
}

+int __cpuinit native_cpu_up(unsigned int cpu, struct task_struct *tidle)
+{
+ return __native_cpu_up(cpu, tidle, 0);
+}
+
+#ifdef CONFIG_SLAVE_CPU
+
+/* boot CPU as a slave processor */
+int __cpuinit slave_cpu_up(unsigned int cpu)
+{
+ int ret;
+ struct task_struct *idle;
+
+ if (!cpu_possible(cpu)) {
+ pr_err("can't start slave cpu %d because it is not "
+ "configured as may-hotadd at boot time\n", cpu);
+ return -EINVAL;
+ }
+ if (cpu_online(cpu) || !cpu_present(cpu))
+ return -EINVAL;
+
+ ret = cpu_memory_up(cpu);
+ if (ret)
+ return ret;
+
+ cpu_maps_update_begin();
+
+ idle = idle_thread_get(cpu);
+ if (IS_ERR(idle))
+ return PTR_ERR(idle);
+
+ ret = __native_cpu_up(cpu, idle, 1);
+
+ cpu_maps_update_done();
+
+ return ret;
+}
+EXPORT_SYMBOL(slave_cpu_up);
+
+static void __slave_cpu_down(void *dummy)
+{
+ __this_cpu_write(cpu_state, CPU_DYING);
+}
+
+int slave_cpu_down(unsigned int cpu)
+{
+ if (!cpu_slave(cpu))
+ return -EINVAL;
+
+ smp_call_function_single(cpu, __slave_cpu_down, NULL, 1);
+ native_cpu_die(cpu);
+
+ if (cpu_slave(cpu)) {
+ pr_err("failed to stop slave cpu %d\n", cpu);
+ return -EBUSY;
+ }
+
+ return 0;
+}
+EXPORT_SYMBOL(slave_cpu_down);
+
+#endif
+
/**
* arch_disable_smp_support() - disables SMP support for x86 at runtime
*/
diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index a2c819d..20c4a25 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -81,6 +81,19 @@ extern const struct cpumask *const cpu_online_mask;
extern const struct cpumask *const cpu_present_mask;
extern const struct cpumask *const cpu_active_mask;

+/*
+ * cpu_online_or_slave_mask represents cpu_online_mask | cpu_slave_mask.
+ * This mask indicates the cpu can hanlde irq.
+ * if CONFIG_SLAVE_CPU is not defined, cpu_slave is defined as 0,
+ * and cpu_online_or_slave_mask is equals to cpu_online_mask.
+ */
+#ifdef CONFIG_SLAVE_CPU
+extern const struct cpumask *const cpu_slave_mask;
+extern const struct cpumask *const cpu_online_or_slave_mask;
+#else
+#define cpu_online_or_slave_mask cpu_online_mask
+#endif
+
#if NR_CPUS > 1
#define num_online_cpus() cpumask_weight(cpu_online_mask)
#define num_possible_cpus() cpumask_weight(cpu_possible_mask)
@@ -101,6 +114,12 @@ extern const struct cpumask *const cpu_active_mask;
#define cpu_active(cpu) ((cpu) == 0)
#endif

+#if defined(CONFIG_SLAVE_CPU) && NR_CPUS > 1
+#define cpu_slave(cpu) cpumask_test_cpu((cpu), cpu_slave_mask)
+#else
+#define cpu_slave(cpu) 0
+#endif
+
/* verify cpu argument to cpumask_* operators */
static inline unsigned int cpumask_check(unsigned int cpu)
{
@@ -705,6 +724,13 @@ void init_cpu_present(const struct cpumask *src);
void init_cpu_possible(const struct cpumask *src);
void init_cpu_online(const struct cpumask *src);

+#ifdef CONFIG_SLAVE_CPU
+#define for_each_slave_cpu(cpu) for_each_cpu((cpu), cpu_slave_mask)
+void set_cpu_slave(unsigned int cpu, bool slave);
+#else
+#define for_each_slave_cpu(cpu) while (0)
+#endif
+
/**
* to_cpumask - convert an NR_CPUS bitmap to a struct cpumask *
* @bitmap: the bitmap
diff --git a/kernel/cpu.c b/kernel/cpu.c
index db3e33c..bd623ca 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -343,7 +343,7 @@ static int __cpuinit _cpu_up(unsigned int cpu, int tasks_frozen)
unsigned long mod = tasks_frozen ? CPU_TASKS_FROZEN : 0;
struct task_struct *idle;

- if (cpu_online(cpu) || !cpu_present(cpu))
+ if (cpu_online(cpu) || cpu_slave(cpu) || !cpu_present(cpu))
return -EINVAL;

cpu_hotplug_begin();
@@ -685,6 +685,17 @@ static DECLARE_BITMAP(cpu_active_bits, CONFIG_NR_CPUS) __read_mostly;
const struct cpumask *const cpu_active_mask = to_cpumask(cpu_active_bits);
EXPORT_SYMBOL(cpu_active_mask);

+#ifdef CONFIG_SLAVE_CPU
+static DECLARE_BITMAP(cpu_slave_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_slave_mask = to_cpumask(cpu_slave_bits);
+EXPORT_SYMBOL(cpu_slave_mask);
+
+static DECLARE_BITMAP(cpu_online_or_slave_bits, CONFIG_NR_CPUS) __read_mostly;
+const struct cpumask *const cpu_online_or_slave_mask =
+ to_cpumask(cpu_online_or_slave_bits);
+EXPORT_SYMBOL(cpu_online_or_slave_mask);
+#endif
+
void set_cpu_possible(unsigned int cpu, bool possible)
{
if (possible)
@@ -707,6 +718,13 @@ void set_cpu_online(unsigned int cpu, bool online)
cpumask_set_cpu(cpu, to_cpumask(cpu_online_bits));
else
cpumask_clear_cpu(cpu, to_cpumask(cpu_online_bits));
+
+#ifdef CONFIG_SLAVE_CPU
+ if (online)
+ cpumask_set_cpu(cpu, to_cpumask(cpu_online_or_slave_bits));
+ else
+ cpumask_clear_cpu(cpu, to_cpumask(cpu_online_or_slave_bits));
+#endif
}

void set_cpu_active(unsigned int cpu, bool active)
@@ -717,6 +735,19 @@ void set_cpu_active(unsigned int cpu, bool active)
cpumask_clear_cpu(cpu, to_cpumask(cpu_active_bits));
}

+#ifdef CONFIG_SLAVE_CPU
+void set_cpu_slave(unsigned int cpu, bool slave)
+{
+ if (slave) {
+ cpumask_set_cpu(cpu, to_cpumask(cpu_slave_bits));
+ cpumask_set_cpu(cpu, to_cpumask(cpu_online_or_slave_bits));
+ } else {
+ cpumask_clear_cpu(cpu, to_cpumask(cpu_slave_bits));
+ cpumask_clear_cpu(cpu, to_cpumask(cpu_online_or_slave_bits));
+ }
+}
+#endif
+
void init_cpu_present(const struct cpumask *src)
{
cpumask_copy(to_cpumask(cpu_present_bits), src);
@@ -730,4 +761,8 @@ void init_cpu_possible(const struct cpumask *src)
void init_cpu_online(const struct cpumask *src)
{
cpumask_copy(to_cpumask(cpu_online_bits), src);
+
+#ifdef CONFIG_SLAVE_CPU
+ cpumask_copy(to_cpumask(cpu_online_or_slave_bits), src);
+#endif
}
diff --git a/kernel/smp.c b/kernel/smp.c
index d0ae5b2..6e42573 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -177,7 +177,7 @@ void generic_smp_call_function_interrupt(void)
/*
* Shouldn't receive this interrupt on a cpu that is not yet online.
*/
- WARN_ON_ONCE(!cpu_online(cpu));
+ WARN_ON_ONCE(!cpu_online(cpu) && !cpu_slave(cpu));

/*
* Ensure entry is visible on call_function_queue after we have
@@ -257,7 +257,8 @@ void generic_smp_call_function_single_interrupt(void)
/*
* Shouldn't receive this interrupt on a cpu that is not yet online.
*/
- WARN_ON_ONCE(!cpu_online(smp_processor_id()));
+ WARN_ON_ONCE(!cpu_online(smp_processor_id()) &&
+ !cpu_slave(smp_processor_id()));

raw_spin_lock(&q->lock);
list_replace_init(&q->list, &list);
@@ -326,7 +327,8 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
func(info);
local_irq_restore(flags);
} else {
- if ((unsigned)cpu < nr_cpu_ids && cpu_online(cpu)) {
+ if ((unsigned)cpu < nr_cpu_ids &&
+ (cpu_online(cpu) || cpu_slave(cpu))) {
struct call_single_data *data = &d;

if (!wait)

2012-06-28 06:07:49

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 03/18] x86: Support hrtimer on slave CPUs

Adds a facility to use hrtimer on slave CPUs.

To initialize hrtimer when slave CPUs are activated, and to shutdown hrtimer
when slave CPUs are stopped, this patch adds the slave cpu notifier chain,
which call registered callbacks when slave CPUs are up, dying, and died.

The registered callbacks are called with CPU_SLAVE_UP when a slave CPU
becomes active. When the slave CPU is stopped, callbacks are called with
CPU_SLAVE_DYING on slave CPUs, and with CPU_SLAVE_DEAD on online CPUs.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/cpu.h | 5 +++++
arch/x86/kernel/smpboot.c | 36 ++++++++++++++++++++++++++++++++++++
kernel/hrtimer.c | 22 ++++++++++++++++++++++
3 files changed, 63 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/cpu.h b/arch/x86/include/asm/cpu.h
index a6dc43d..808b084 100644
--- a/arch/x86/include/asm/cpu.h
+++ b/arch/x86/include/asm/cpu.h
@@ -31,9 +31,14 @@ extern void arch_unregister_cpu(int);
#endif

#ifdef CONFIG_SLAVE_CPU
+int register_slave_cpu_notifier(struct notifier_block *nb);
+void unregister_slave_cpu_notifier(struct notifier_block *nb);
+
+/* CPU notifier constants for slave processors */
#define CPU_SLAVE_UP_PREPARE 0xff00
#define CPU_SLAVE_UP 0xff01
#define CPU_SLAVE_DEAD 0xff02
+#define CPU_SLAVE_DYING 0xff03

extern int slave_cpu_up(unsigned int cpu);
extern int slave_cpu_down(unsigned int cpu);
diff --git a/arch/x86/kernel/smpboot.c b/arch/x86/kernel/smpboot.c
index edeebad..527c4ca 100644
--- a/arch/x86/kernel/smpboot.c
+++ b/arch/x86/kernel/smpboot.c
@@ -125,6 +125,36 @@ EXPORT_PER_CPU_SYMBOL(cpu_info);

atomic_t init_deasserted;

+static void __ref remove_cpu_from_maps(int cpu);
+
+
+#ifdef CONFIG_SLAVE_CPU
+/* Notify slave cpu up and down */
+static RAW_NOTIFIER_HEAD(slave_cpu_chain);
+
+int register_slave_cpu_notifier(struct notifier_block *nb)
+{
+ return raw_notifier_chain_register(&slave_cpu_chain, nb);
+}
+EXPORT_SYMBOL(register_slave_cpu_notifier);
+
+void unregister_slave_cpu_notifier(struct notifier_block *nb)
+{
+ raw_notifier_chain_unregister(&slave_cpu_chain, nb);
+}
+EXPORT_SYMBOL(unregister_slave_cpu_notifier);
+
+static int slave_cpu_notify(unsigned long val, int cpu)
+{
+ int ret;
+
+ ret = __raw_notifier_call_chain(&slave_cpu_chain, val,
+ (void *)(long)cpu, -1, NULL);
+
+ return notifier_to_errno(ret);
+}
+#endif
+
/*
* Report back to the Boot Processor.
* Running on AP.
@@ -307,6 +337,7 @@ notrace static void __cpuinit start_slave_cpu(void *unused)
* most necessary things.
*/
cpu_init();
+ x86_cpuinit.early_percpu_clock_init();
preempt_disable();
smp_callin(0);

@@ -333,10 +364,13 @@ notrace static void __cpuinit start_slave_cpu(void *unused)
/* to prevent fake stack check failure */
boot_init_stack_canary();

+ x86_cpuinit.setup_percpu_clockev();
+
/* announce slave CPU started */
pr_info("Slave CPU %d is up\n", cpu);
per_cpu(cpu_state, cpu) = CPU_SLAVE_UP;
set_cpu_slave(cpu, true);
+ slave_cpu_notify(CPU_SLAVE_UP, cpu);
wmb();

/* wait for smp_call_function interrupt */
@@ -348,6 +382,7 @@ notrace static void __cpuinit start_slave_cpu(void *unused)
pr_info("Slave CPU %d is going down ...\n", cpu);
native_cpu_disable();
set_cpu_slave(cpu, false);
+ slave_cpu_notify(CPU_SLAVE_DYING, cpu);
native_play_dead();
}
#endif
@@ -978,6 +1013,7 @@ int slave_cpu_down(unsigned int cpu)
return -EBUSY;
}

+ slave_cpu_notify(CPU_SLAVE_DEAD, cpu);
return 0;
}
EXPORT_SYMBOL(slave_cpu_down);
diff --git a/kernel/hrtimer.c b/kernel/hrtimer.c
index ae34bf5..0d94142 100644
--- a/kernel/hrtimer.c
+++ b/kernel/hrtimer.c
@@ -48,6 +48,10 @@

#include <asm/uaccess.h>

+#ifdef CONFIG_SLAVE_CPU
+#include <asm/cpu.h>
+#endif
+
#include <trace/events/timer.h>

/*
@@ -1706,16 +1710,25 @@ static int __cpuinit hrtimer_cpu_notify(struct notifier_block *self,

case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
+#ifdef CONFIG_SLAVE_CPU
+ case CPU_SLAVE_UP:
+#endif
init_hrtimers_cpu(scpu);
break;

#ifdef CONFIG_HOTPLUG_CPU
case CPU_DYING:
case CPU_DYING_FROZEN:
+#ifdef CONFIG_SLAVE_CPU
+ case CPU_SLAVE_DYING:
+#endif
clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DYING, &scpu);
break;
case CPU_DEAD:
case CPU_DEAD_FROZEN:
+#ifdef CONFIG_SLAVE_CPU
+ case CPU_SLAVE_DEAD:
+#endif
{
clockevents_notify(CLOCK_EVT_NOTIFY_CPU_DEAD, &scpu);
migrate_hrtimers(scpu);
@@ -1734,11 +1747,20 @@ static struct notifier_block __cpuinitdata hrtimers_nb = {
.notifier_call = hrtimer_cpu_notify,
};

+#ifdef CONFIG_SLAVE_CPU
+static struct notifier_block __cpuinitdata hrtimers_slave_nb = {
+ .notifier_call = hrtimer_cpu_notify,
+};
+#endif
+
void __init hrtimers_init(void)
{
hrtimer_cpu_notify(&hrtimers_nb, (unsigned long)CPU_UP_PREPARE,
(void *)(long)smp_processor_id());
register_cpu_notifier(&hrtimers_nb);
+#ifdef CONFIG_SLAVE_CPU
+ register_slave_cpu_notifier(&hrtimers_slave_nb);
+#endif
#ifdef CONFIG_HIGH_RES_TIMERS
open_softirq(HRTIMER_SOFTIRQ, run_hrtimer_softirq);
#endif

2012-06-28 06:07:55

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 04/18] KVM: Replace local_irq_disable/enable with local_irq_save/restore

Replace local_irq_disable/enable with local_irq_save/restore in the path
where is executed on slave CPUs. This is required because irqs are disabled
while the guest is running on the slave CPUs.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/kvm/mmu.c | 20 ++++++++++++--------
arch/x86/kvm/vmx.c | 5 +++--
arch/x86/kvm/x86.c | 7 ++++---
arch/x86/mm/gup.c | 7 ++++---
4 files changed, 23 insertions(+), 16 deletions(-)

diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index be3cea4..6139e1d 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -549,13 +549,14 @@ static u64 mmu_spte_get_lockless(u64 *sptep)
return __get_spte_lockless(sptep);
}

-static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
+static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu,
+ unsigned long *flags)
{
/*
* Prevent page table teardown by making any free-er wait during
* kvm_flush_remote_tlbs() IPI to all active vcpus.
*/
- local_irq_disable();
+ local_irq_save(*flags);
vcpu->mode = READING_SHADOW_PAGE_TABLES;
/*
* Make sure a following spte read is not reordered ahead of the write
@@ -564,7 +565,8 @@ static void walk_shadow_page_lockless_begin(struct kvm_vcpu *vcpu)
smp_mb();
}

-static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
+static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu,
+ unsigned long *flags)
{
/*
* Make sure the write to vcpu->mode is not reordered in front of
@@ -573,7 +575,7 @@ static void walk_shadow_page_lockless_end(struct kvm_vcpu *vcpu)
*/
smp_mb();
vcpu->mode = OUTSIDE_GUEST_MODE;
- local_irq_enable();
+ local_irq_restore(*flags);
}

static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,
@@ -2959,12 +2961,13 @@ static u64 walk_shadow_page_get_mmio_spte(struct kvm_vcpu *vcpu, u64 addr)
{
struct kvm_shadow_walk_iterator iterator;
u64 spte = 0ull;
+ unsigned long flags;

- walk_shadow_page_lockless_begin(vcpu);
+ walk_shadow_page_lockless_begin(vcpu, &flags);
for_each_shadow_entry_lockless(vcpu, addr, iterator, spte)
if (!is_shadow_present_pte(spte))
break;
- walk_shadow_page_lockless_end(vcpu);
+ walk_shadow_page_lockless_end(vcpu, &flags);

return spte;
}
@@ -4043,15 +4046,16 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4])
struct kvm_shadow_walk_iterator iterator;
u64 spte;
int nr_sptes = 0;
+ unsigned long flags;

- walk_shadow_page_lockless_begin(vcpu);
+ walk_shadow_page_lockless_begin(vcpu, &flags);
for_each_shadow_entry_lockless(vcpu, addr, iterator, spte) {
sptes[iterator.level-1] = spte;
nr_sptes++;
if (!is_shadow_present_pte(spte))
break;
}
- walk_shadow_page_lockless_end(vcpu);
+ walk_shadow_page_lockless_end(vcpu, &flags);

return nr_sptes;
}
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 32eb588..6ea77e4 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1516,12 +1516,13 @@ static void vmx_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
if (vmx->loaded_vmcs->cpu != cpu) {
struct desc_ptr *gdt = &__get_cpu_var(host_gdt);
unsigned long sysenter_esp;
+ unsigned long flags;

kvm_make_request(KVM_REQ_TLB_FLUSH, vcpu);
- local_irq_disable();
+ local_irq_save(flags);
list_add(&vmx->loaded_vmcs->loaded_vmcss_on_cpu_link,
&per_cpu(loaded_vmcss_on_cpu, cpu));
- local_irq_enable();
+ local_irq_restore(flags);

/*
* Linux uses per-cpu TSS and GDT, so set these when switching
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index be6d549..4a69c66 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5229,6 +5229,7 @@ static void process_nmi(struct kvm_vcpu *vcpu)
static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
{
int r;
+ unsigned long flags;
bool req_int_win = !irqchip_in_kernel(vcpu->kvm) &&
vcpu->run->request_interrupt_window;
bool req_immediate_exit = 0;
@@ -5314,13 +5315,13 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
*/
smp_mb();

- local_irq_disable();
+ local_irq_save(flags);

if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
|| need_resched() || signal_pending(current)) {
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
- local_irq_enable();
+ local_irq_restore(flags);
preempt_enable();
kvm_x86_ops->cancel_injection(vcpu);
r = 1;
@@ -5359,7 +5360,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)

vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
- local_irq_enable();
+ local_irq_restore(flags);

++vcpu->stat.exits;

diff --git a/arch/x86/mm/gup.c b/arch/x86/mm/gup.c
index dd74e46..6679525 100644
--- a/arch/x86/mm/gup.c
+++ b/arch/x86/mm/gup.c
@@ -315,6 +315,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct mm_struct *mm = current->mm;
unsigned long addr, len, end;
unsigned long next;
+ unsigned long flags;
pgd_t *pgdp;
int nr = 0;

@@ -349,7 +350,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
* (which we do on x86, with the above PAE exception), we can follow the
* address down to the the page and take a ref on it.
*/
- local_irq_disable();
+ local_irq_save(flags);
pgdp = pgd_offset(mm, addr);
do {
pgd_t pgd = *pgdp;
@@ -360,7 +361,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
if (!gup_pud_range(pgd, addr, next, write, pages, &nr))
goto slow;
} while (pgdp++, addr = next, addr != end);
- local_irq_enable();
+ local_irq_restore(flags);

VM_BUG_ON(nr != (end - start) >> PAGE_SHIFT);
return nr;
@@ -369,7 +370,7 @@ int get_user_pages_fast(unsigned long start, int nr_pages, int write,
int ret;

slow:
- local_irq_enable();
+ local_irq_restore(flags);
slow_irqon:
/* Try to get the remaining pages with get_user_pages */
start += nr << PAGE_SHIFT;

2012-06-28 06:07:59

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 05/18] KVM: Enable/Disable virtualization on slave CPUs are activated/dying

Enable virtualization when slave CPUs are activated, and disable when
the CPUs are dying using slave CPU notifier call chain.

In x86, TSC kHz must also be initialized by tsc_khz_changed when the
new slave CPUs are activated.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/kvm/x86.c | 20 ++++++++++++++++++++
virt/kvm/kvm_main.c | 38 ++++++++++++++++++++++++++++++++++++--
2 files changed, 56 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4a69c66..9bb2f8f2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -61,6 +61,7 @@
#include <asm/xcr.h>
#include <asm/pvclock.h>
#include <asm/div64.h>
+#include <asm/cpu.h>

#define MAX_IO_MSRS 256
#define KVM_MAX_MCE_BANKS 32
@@ -4769,9 +4770,15 @@ static int kvmclock_cpu_notifier(struct notifier_block *nfb,
switch (action) {
case CPU_ONLINE:
case CPU_DOWN_FAILED:
+#ifdef CONFIG_SLAVE_CPU
+ case CPU_SLAVE_UP:
+#endif
smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
break;
case CPU_DOWN_PREPARE:
+#ifdef CONFIG_SLAVE_CPU
+ case CPU_SLAVE_DYING:
+#endif
smp_call_function_single(cpu, tsc_bad, NULL, 1);
break;
}
@@ -4783,12 +4790,20 @@ static struct notifier_block kvmclock_cpu_notifier_block = {
.priority = -INT_MAX
};

+static struct notifier_block kvmclock_slave_cpu_notifier_block = {
+ .notifier_call = kvmclock_cpu_notifier,
+ .priority = -INT_MAX
+};
+
static void kvm_timer_init(void)
{
int cpu;

max_tsc_khz = tsc_khz;
register_hotcpu_notifier(&kvmclock_cpu_notifier_block);
+#ifdef CONFIG_SLAVE_CPU
+ register_slave_cpu_notifier(&kvmclock_slave_cpu_notifier_block);
+#endif
if (!boot_cpu_has(X86_FEATURE_CONSTANT_TSC)) {
#ifdef CONFIG_CPU_FREQ
struct cpufreq_policy policy;
@@ -4805,6 +4820,8 @@ static void kvm_timer_init(void)
pr_debug("kvm: max_tsc_khz = %ld\n", max_tsc_khz);
for_each_online_cpu(cpu)
smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
+ for_each_slave_cpu(cpu)
+ smp_call_function_single(cpu, tsc_khz_changed, NULL, 1);
}

static DEFINE_PER_CPU(struct kvm_vcpu *, current_vcpu);
@@ -4930,6 +4947,9 @@ void kvm_arch_exit(void)
cpufreq_unregister_notifier(&kvmclock_cpufreq_notifier_block,
CPUFREQ_TRANSITION_NOTIFIER);
unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
+#ifdef CONFIG_SLAVE_CPU
+ unregister_slave_cpu_notifier(&kvmclock_slave_cpu_notifier_block);
+#endif
kvm_x86_ops = NULL;
kvm_mmu_module_exit();
}
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 7e14068..f5890f0 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -54,6 +54,9 @@
#include <asm/io.h>
#include <asm/uaccess.h>
#include <asm/pgtable.h>
+#ifdef CONFIG_X86
+#include <asm/cpu.h>
+#endif

#include "coalesced_mmio.h"
#include "async_pf.h"
@@ -2323,11 +2326,17 @@ static void hardware_disable(void *junk)

static void hardware_disable_all_nolock(void)
{
+ int cpu;
+
BUG_ON(!kvm_usage_count);

kvm_usage_count--;
- if (!kvm_usage_count)
+ if (!kvm_usage_count) {
on_each_cpu(hardware_disable_nolock, NULL, 1);
+ for_each_slave_cpu(cpu)
+ smp_call_function_single(cpu, hardware_disable_nolock,
+ NULL, 1);
+ }
}

static void hardware_disable_all(void)
@@ -2340,6 +2349,7 @@ static void hardware_disable_all(void)
static int hardware_enable_all(void)
{
int r = 0;
+ int cpu;

raw_spin_lock(&kvm_lock);

@@ -2347,6 +2357,9 @@ static int hardware_enable_all(void)
if (kvm_usage_count == 1) {
atomic_set(&hardware_enable_failed, 0);
on_each_cpu(hardware_enable_nolock, NULL, 1);
+ for_each_slave_cpu(cpu)
+ smp_call_function_single(cpu, hardware_enable_nolock,
+ NULL, 1);

if (atomic_read(&hardware_enable_failed)) {
hardware_disable_all_nolock();
@@ -2370,11 +2383,17 @@ static int kvm_cpu_hotplug(struct notifier_block *notifier, unsigned long val,
val &= ~CPU_TASKS_FROZEN;
switch (val) {
case CPU_DYING:
+#ifdef CONFIG_SLAVE_CPU
+ case CPU_SLAVE_DYING:
+#endif
printk(KERN_INFO "kvm: disabling virtualization on CPU%d\n",
cpu);
hardware_disable(NULL);
break;
case CPU_STARTING:
+#ifdef CONFIG_SLAVE_CPU
+ case CPU_SLAVE_UP:
+#endif
printk(KERN_INFO "kvm: enabling virtualization on CPU%d\n",
cpu);
hardware_enable(NULL);
@@ -2592,6 +2611,12 @@ static struct notifier_block kvm_cpu_notifier = {
.notifier_call = kvm_cpu_hotplug,
};

+#ifdef CONFIG_SLAVE_CPU
+static struct notifier_block kvm_slave_cpu_notifier = {
+ .notifier_call = kvm_cpu_hotplug,
+};
+#endif
+
static int vm_stat_get(void *_offset, u64 *val)
{
unsigned offset = (long)_offset;
@@ -2755,7 +2780,7 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
if (r < 0)
goto out_free_0a;

- for_each_online_cpu(cpu) {
+ for_each_cpu(cpu, cpu_online_or_slave_mask) {
smp_call_function_single(cpu,
kvm_arch_check_processor_compat,
&r, 1);
@@ -2766,6 +2791,9 @@ int kvm_init(void *opaque, unsigned vcpu_size, unsigned vcpu_align,
r = register_cpu_notifier(&kvm_cpu_notifier);
if (r)
goto out_free_2;
+#ifdef CONFIG_SLAVE_CPU
+ register_slave_cpu_notifier(&kvm_slave_cpu_notifier);
+#endif
register_reboot_notifier(&kvm_reboot_notifier);

/* A kmem cache lets us meet the alignment requirements of fx_save. */
@@ -2813,6 +2841,9 @@ out_free:
kmem_cache_destroy(kvm_vcpu_cache);
out_free_3:
unregister_reboot_notifier(&kvm_reboot_notifier);
+#ifdef CONFIG_SLAVE_CPU
+ unregister_slave_cpu_notifier(&kvm_slave_cpu_notifier);
+#endif
unregister_cpu_notifier(&kvm_cpu_notifier);
out_free_2:
out_free_1:
@@ -2840,6 +2871,9 @@ void kvm_exit(void)
kvm_async_pf_deinit();
unregister_syscore_ops(&kvm_syscore_ops);
unregister_reboot_notifier(&kvm_reboot_notifier);
+#ifdef CONFIG_SLAVE_CPU
+ unregister_slave_cpu_notifier(&kvm_slave_cpu_notifier);
+#endif
unregister_cpu_notifier(&kvm_cpu_notifier);
on_each_cpu(hardware_disable_nolock, NULL, 1);
kvm_arch_hardware_unsetup();

2012-06-28 06:08:07

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 06/18] KVM: Add facility to run guests on slave CPUs

Add path to migrate execution of vcpu_enter_guest to a slave CPU when
vcpu->arch.slave_cpu is set.

After moving to the slave CPU, it goes back to the online CPU when the
guest is exited by reasons that cannot be handled by the slave CPU only
(e.g. handling async page faults).

On migration, kvm_arch_vcpu_put_migrate is used to avoid using IPI to
clear loaded vmcs from the old CPU. Instead, this immediately clears
vmcs.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/kvm_host.h | 9 ++
arch/x86/kernel/smp.c | 2
arch/x86/kvm/vmx.c | 10 ++
arch/x86/kvm/x86.c | 190 ++++++++++++++++++++++++++++++++++-----
include/linux/kvm_host.h | 1
kernel/smp.c | 1
virt/kvm/async_pf.c | 9 +-
virt/kvm/kvm_main.c | 3 -
8 files changed, 196 insertions(+), 29 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index db7c1f2..4291954 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -346,6 +346,14 @@ struct kvm_vcpu_arch {
u64 ia32_misc_enable_msr;
bool tpr_access_reporting;

+#ifdef CONFIG_SLAVE_CPU
+ /* slave cpu dedicated to this vcpu */
+ int slave_cpu;
+#endif
+
+ /* user process tied to each vcpu */
+ struct task_struct *task;
+
/*
* Paging state of the vcpu
*
@@ -604,6 +612,7 @@ struct kvm_x86_ops {
void (*prepare_guest_switch)(struct kvm_vcpu *vcpu);
void (*vcpu_load)(struct kvm_vcpu *vcpu, int cpu);
void (*vcpu_put)(struct kvm_vcpu *vcpu);
+ void (*vcpu_put_migrate)(struct kvm_vcpu *vcpu);

void (*set_guest_debug)(struct kvm_vcpu *vcpu,
struct kvm_guest_debug *dbg);
diff --git a/arch/x86/kernel/smp.c b/arch/x86/kernel/smp.c
index 48d2b7d..a58dead 100644
--- a/arch/x86/kernel/smp.c
+++ b/arch/x86/kernel/smp.c
@@ -119,7 +119,7 @@ static bool smp_no_nmi_ipi = false;
*/
static void native_smp_send_reschedule(int cpu)
{
- if (unlikely(cpu_is_offline(cpu))) {
+ if (unlikely(cpu_is_offline(cpu) && !cpu_slave(cpu))) {
WARN_ON(1);
return;
}
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 6ea77e4..9ee2d9a 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1547,6 +1547,13 @@ static void vmx_vcpu_put(struct kvm_vcpu *vcpu)
}
}

+static void vmx_vcpu_put_migrate(struct kvm_vcpu *vcpu)
+{
+ vmx_vcpu_put(vcpu);
+ __loaded_vmcs_clear(to_vmx(vcpu)->loaded_vmcs);
+ vcpu->cpu = -1;
+}
+
static void vmx_fpu_activate(struct kvm_vcpu *vcpu)
{
ulong cr0;
@@ -4928,7 +4935,7 @@ static int handle_invalid_guest_state(struct kvm_vcpu *vcpu)
if (err != EMULATE_DONE)
return 0;

- if (signal_pending(current))
+ if (signal_pending(vcpu->arch.task))
goto out;
if (need_resched())
schedule();
@@ -7144,6 +7151,7 @@ static struct kvm_x86_ops vmx_x86_ops = {
.prepare_guest_switch = vmx_save_host_state,
.vcpu_load = vmx_vcpu_load,
.vcpu_put = vmx_vcpu_put,
+ .vcpu_put_migrate = vmx_vcpu_put_migrate,

.set_guest_debug = set_guest_debug,
.get_msr = vmx_get_msr,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 9bb2f8f2..ecd474a 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -46,6 +46,7 @@
#include <linux/uaccess.h>
#include <linux/hash.h>
#include <linux/pci.h>
+#include <linux/mmu_context.h>
#include <trace/events/kvm.h>

#define CREATE_TRACE_POINTS
@@ -62,6 +63,7 @@
#include <asm/pvclock.h>
#include <asm/div64.h>
#include <asm/cpu.h>
+#include <asm/mmu.h>

#define MAX_IO_MSRS 256
#define KVM_MAX_MCE_BANKS 32
@@ -1633,6 +1635,11 @@ int kvm_set_msr_common(struct kvm_vcpu *vcpu, u32 msr, u64 data)
if (unlikely(!sched_info_on()))
return 1;

+#ifdef CONFIG_SLAVE_CPU
+ if (vcpu->arch.slave_cpu)
+ break;
+#endif
+
if (data & KVM_STEAL_RESERVED_MASK)
return 1;

@@ -2319,6 +2326,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
vcpu->arch.last_host_tsc = native_read_tsc();
}

+void kvm_arch_vcpu_put_migrate(struct kvm_vcpu *vcpu)
+{
+ kvm_x86_ops->vcpu_put_migrate(vcpu);
+ kvm_put_guest_fpu(vcpu);
+ vcpu->arch.last_host_tsc = native_read_tsc();
+}
+
static int kvm_vcpu_ioctl_get_lapic(struct kvm_vcpu *vcpu,
struct kvm_lapic_state *s)
{
@@ -5246,7 +5260,46 @@ static void process_nmi(struct kvm_vcpu *vcpu)
kvm_make_request(KVM_REQ_EVENT, vcpu);
}

-static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
+enum vcpu_enter_guest_slave_retval {
+ EXIT_TO_USER = 0,
+ LOOP_ONLINE, /* vcpu_post_run is done in online cpu */
+ LOOP_SLAVE, /* vcpu_post_run is done in slave cpu */
+ LOOP_APF, /* must handle async_pf in online cpu */
+ LOOP_RETRY, /* may in hlt state */
+};
+
+static int vcpu_post_run(struct kvm_vcpu *vcpu, struct task_struct *task,
+ int *can_complete_async_pf)
+{
+ int r = LOOP_ONLINE;
+
+ clear_bit(KVM_REQ_PENDING_TIMER, &vcpu->requests);
+ if (kvm_cpu_has_pending_timer(vcpu))
+ kvm_inject_pending_timer_irqs(vcpu);
+
+ if (dm_request_for_irq_injection(vcpu)) {
+ r = -EINTR;
+ vcpu->run->exit_reason = KVM_EXIT_INTR;
+ ++vcpu->stat.request_irq_exits;
+ }
+
+ if (can_complete_async_pf) {
+ *can_complete_async_pf = kvm_can_complete_async_pf(vcpu);
+ if (r == LOOP_ONLINE)
+ r = *can_complete_async_pf ? LOOP_APF : LOOP_SLAVE;
+ } else
+ kvm_check_async_pf_completion(vcpu);
+
+ if (signal_pending(task)) {
+ r = -EINTR;
+ vcpu->run->exit_reason = KVM_EXIT_INTR;
+ ++vcpu->stat.signal_exits;
+ }
+
+ return r;
+}
+
+static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct task_struct *task)
{
int r;
unsigned long flags;
@@ -5338,7 +5391,7 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu)
local_irq_save(flags);

if (vcpu->mode == EXITING_GUEST_MODE || vcpu->requests
- || need_resched() || signal_pending(current)) {
+ || need_resched() || signal_pending(task)) {
vcpu->mode = OUTSIDE_GUEST_MODE;
smp_wmb();
local_irq_restore(flags);
@@ -5416,10 +5469,97 @@ out:
return r;
}

+#ifdef CONFIG_SLAVE_CPU
+
+struct __vcpu_enter_guest_args {
+ struct kvm_vcpu *vcpu;
+ struct task_struct *task;
+ struct completion wait;
+ int ret, apf_pending;
+};
+
+static void __vcpu_enter_guest_slave(void *_arg)
+{
+ struct __vcpu_enter_guest_args *arg = _arg;
+ struct kvm_vcpu *vcpu = arg->vcpu;
+ int r = LOOP_SLAVE;
+ int cpu = smp_processor_id();
+
+ vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+ use_mm(arg->task->mm);
+ kvm_arch_vcpu_load(vcpu, cpu);
+
+ while (r == LOOP_SLAVE) {
+ r = vcpu_enter_guest(vcpu, arg->task);
+
+ if (unlikely(!irqs_disabled())) {
+ pr_err("irq is enabled on slave vcpu_etner_guest! - forcely disable\n");
+ local_irq_disable();
+ }
+
+ if (r <= 0)
+ break;
+
+ /* determine if slave cpu can handle the exit alone */
+ r = vcpu_post_run(vcpu, arg->task, &arg->apf_pending);
+
+ if (arg->ret == LOOP_SLAVE &&
+ (vcpu->arch.mp_state != KVM_MP_STATE_RUNNABLE ||
+ vcpu->arch.apf.halted)) {
+ arg->ret = LOOP_RETRY;
+ }
+ }
+
+ kvm_arch_vcpu_put_migrate(vcpu);
+ unuse_mm(arg->task->mm);
+ srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);

-static int __vcpu_run(struct kvm_vcpu *vcpu)
+ arg->ret = r;
+ complete(&arg->wait);
+}
+
+static int vcpu_enter_guest_slave(struct kvm_vcpu *vcpu,
+ struct task_struct *task, int *apf_pending)
{
+ struct __vcpu_enter_guest_args arg = {vcpu, task};
+ struct call_single_data csd = {.func = __vcpu_enter_guest_slave,
+ .info = &arg, .flags = 0};
+ int slave = vcpu->arch.slave_cpu;
int r;
+
+ BUG_ON((unsigned)slave >= nr_cpu_ids || !cpu_slave(slave));
+
+ preempt_disable();
+ preempt_notifier_unregister(&vcpu->preempt_notifier);
+ kvm_arch_vcpu_put_migrate(vcpu);
+ preempt_enable();
+
+ srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
+ init_completion(&arg.wait);
+ __smp_call_function_single(slave, &csd, 0);
+ r = wait_for_completion_interruptible(&arg.wait);
+ if (r) {
+ /* interrupted: kick and wait VM on the slave cpu */
+ kvm_vcpu_kick(vcpu);
+ wait_for_completion(&arg.wait);
+ }
+ vcpu->srcu_idx = srcu_read_lock(&vcpu->kvm->srcu);
+
+ preempt_notifier_register(&vcpu->preempt_notifier);
+ kvm_arch_vcpu_load(vcpu, smp_processor_id());
+
+ r = arg.ret;
+ *apf_pending = arg.apf_pending;
+
+ return r;
+}
+
+#endif /* CONFIG_SLAVE_CPU */
+
+static int __vcpu_run(struct kvm_vcpu *vcpu, struct task_struct *task)
+{
+ int r, apf_pending = 0;
struct kvm *kvm = vcpu->kvm;

if (unlikely(vcpu->arch.mp_state == KVM_MP_STATE_SIPI_RECEIVED)) {
@@ -5438,9 +5578,18 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
r = 1;
while (r > 0) {
if (vcpu->arch.mp_state == KVM_MP_STATE_RUNNABLE &&
- !vcpu->arch.apf.halted)
- r = vcpu_enter_guest(vcpu);
- else {
+ !vcpu->arch.apf.halted) {
+#ifdef CONFIG_SLAVE_CPU
+ apf_pending = 0;
+ if (vcpu->arch.slave_cpu >= 0) {
+ r = vcpu_enter_guest_slave(vcpu, task,
+ &apf_pending);
+ if (r == LOOP_RETRY)
+ continue;
+ } else
+#endif
+ r = vcpu_enter_guest(vcpu, task);
+ } else {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
kvm_vcpu_block(vcpu);
vcpu->srcu_idx = srcu_read_lock(&kvm->srcu);
@@ -5461,26 +5610,15 @@ static int __vcpu_run(struct kvm_vcpu *vcpu)
}
}

+ if (apf_pending)
+ kvm_check_async_pf_completion(vcpu);
+
if (r <= 0)
break;

- clear_bit(KVM_REQ_PENDING_TIMER, &vcpu->requests);
- if (kvm_cpu_has_pending_timer(vcpu))
- kvm_inject_pending_timer_irqs(vcpu);
+ if (r == LOOP_ONLINE)
+ r = vcpu_post_run(vcpu, task, NULL);

- if (dm_request_for_irq_injection(vcpu)) {
- r = -EINTR;
- vcpu->run->exit_reason = KVM_EXIT_INTR;
- ++vcpu->stat.request_irq_exits;
- }
-
- kvm_check_async_pf_completion(vcpu);
-
- if (signal_pending(current)) {
- r = -EINTR;
- vcpu->run->exit_reason = KVM_EXIT_INTR;
- ++vcpu->stat.signal_exits;
- }
if (need_resched()) {
srcu_read_unlock(&kvm->srcu, vcpu->srcu_idx);
kvm_resched(vcpu);
@@ -5582,8 +5720,7 @@ int kvm_arch_vcpu_ioctl_run(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run)
if (r <= 0)
goto out;

- r = __vcpu_run(vcpu);
-
+ r = __vcpu_run(vcpu, current);
out:
post_kvm_run_save(vcpu);
if (vcpu->sigset_active)
@@ -6022,6 +6159,7 @@ int kvm_arch_vcpu_setup(struct kvm_vcpu *vcpu)
r = kvm_arch_vcpu_reset(vcpu);
if (r == 0)
r = kvm_mmu_setup(vcpu);
+ vcpu->arch.task = current;
vcpu_put(vcpu);

return r;
@@ -6204,6 +6342,10 @@ int kvm_arch_vcpu_init(struct kvm_vcpu *vcpu)

kvm_set_tsc_khz(vcpu, max_tsc_khz);

+#ifdef CONFIG_SLAVE_CPU
+ vcpu->arch.slave_cpu = -1;
+#endif
+
r = kvm_mmu_create(vcpu);
if (r < 0)
goto fail_free_pio_data;
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c446435..c44a7be 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -119,6 +119,7 @@ struct kvm_async_pf {
};

void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu);
+int kvm_can_complete_async_pf(struct kvm_vcpu *vcpu);
void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu);
int kvm_setup_async_pf(struct kvm_vcpu *vcpu, gva_t gva, gfn_t gfn,
struct kvm_arch_async_pf *arch);
diff --git a/kernel/smp.c b/kernel/smp.c
index 6e42573..081d700 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -431,6 +431,7 @@ void __smp_call_function_single(int cpu, struct call_single_data *data,
}
put_cpu();
}
+EXPORT_SYMBOL(__smp_call_function_single);

/**
* smp_call_function_many(): Run a function on a set of other CPUs.
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index 74268b4..feb5e76 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -120,12 +120,17 @@ void kvm_clear_async_pf_completion_queue(struct kvm_vcpu *vcpu)
vcpu->async_pf.queued = 0;
}

+int kvm_can_complete_async_pf(struct kvm_vcpu *vcpu)
+{
+ return !list_empty_careful(&vcpu->async_pf.done) &&
+ kvm_arch_can_inject_async_page_present(vcpu);
+}
+
void kvm_check_async_pf_completion(struct kvm_vcpu *vcpu)
{
struct kvm_async_pf *work;

- while (!list_empty_careful(&vcpu->async_pf.done) &&
- kvm_arch_can_inject_async_page_present(vcpu)) {
+ while (kvm_can_complete_async_pf(vcpu)) {
spin_lock(&vcpu->async_pf.lock);
work = list_first_entry(&vcpu->async_pf.done, typeof(*work),
link);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index f5890f0..ff8b418 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1531,7 +1531,8 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
}

me = get_cpu();
- if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
+ if (cpu != me && (unsigned)cpu < nr_cpu_ids &&
+ (cpu_online(cpu) || cpu_slave(cpu)))
if (kvm_arch_vcpu_should_kick(vcpu))
smp_send_reschedule(cpu);
put_cpu();

2012-06-28 06:08:19

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 08/18] KVM: Add KVM_GET_SLAVE_CPU and KVM_SET_SLAVE_CPU to vCPU ioctl

Add an interface to set/get slave CPU dedicated to the vCPUs.

By calling ioctl with KVM_GET_SLAVE_CPU, users can get the slave CPU id
for the vCPU. -1 is returned if a slave CPU is not set.

By calling ioctl with KVM_SET_SLAVE_CPU, users can dedicate the specified
slave CPU to the vCPU. The CPU must be offlined before calling ioctl.
The CPU is activated as slave CPU for the vCPU when the correct id is set.
The slave CPU is freed and offlined by setting -1 as slave CPU id.

Whether getting/setting slave CPUs are supported by KVM or not can be
known by checking KVM_CAP_SLAVE_CPU.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/kvm_host.h | 2 +
arch/x86/kvm/vmx.c | 7 +++++
arch/x86/kvm/x86.c | 54 +++++++++++++++++++++++++++++++++++++++
include/linux/kvm.h | 4 +++
4 files changed, 67 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 80f7b3b..6ae43ef 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -704,6 +704,8 @@ struct kvm_x86_ops {
int (*check_intercept)(struct kvm_vcpu *vcpu,
struct x86_instruction_info *info,
enum x86_intercept_stage stage);
+
+ void (*set_slave_mode)(struct kvm_vcpu *vcpu, bool slave);
};

struct kvm_arch_async_pf {
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 9ee2d9a..42655a5 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7134,6 +7134,11 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,
return X86EMUL_CONTINUE;
}

+static void vmx_set_slave_mode(struct kvm_vcpu *vcpu, bool slave)
+{
+ /* Nothing */
+}
+
static struct kvm_x86_ops vmx_x86_ops = {
.cpu_has_kvm_support = cpu_has_kvm_support,
.disabled_by_bios = vmx_disabled_by_bios,
@@ -7224,6 +7229,8 @@ static struct kvm_x86_ops vmx_x86_ops = {
.set_tdp_cr3 = vmx_set_cr3,

.check_intercept = vmx_check_intercept,
+
+ .set_slave_mode = vmx_set_slave_mode,
};

static int __init vmx_init(void)
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 7b9f2a5..4216d55 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2156,6 +2156,9 @@ int kvm_dev_ioctl_check_extension(long ext)
case KVM_CAP_GET_TSC_KHZ:
case KVM_CAP_PCI_2_3:
case KVM_CAP_KVMCLOCK_CTRL:
+#ifdef CONFIG_SLAVE_CPU
+ case KVM_CAP_SLAVE_CPU:
+#endif
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
@@ -2630,6 +2633,44 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)
return 0;
}

+#ifdef CONFIG_SLAVE_CPU
+
+static int kvm_arch_vcpu_ioctl_set_slave_cpu(struct kvm_vcpu *vcpu,
+ int slave, int set_slave_mode)
+{
+ int old = vcpu->arch.slave_cpu;
+ int r = -EINVAL;
+
+ if (slave >= nr_cpu_ids || cpu_online(slave))
+ goto out;
+ if (slave >= 0 && slave != old && cpu_slave(slave))
+ goto out; /* new slave cpu must be offlined */
+
+ if (old >= 0 && slave != old) {
+ BUG_ON(!cpu_slave(old));
+ r = slave_cpu_down(vcpu->arch.slave_cpu);
+ if (r) {
+ pr_err("kvm: slave_cpu_down %d failed\n", old);
+ goto out;
+ }
+ }
+
+ if (slave >= 0) {
+ r = slave_cpu_up(slave);
+ if (r)
+ goto out;
+ BUG_ON(!cpu_slave(slave));
+ }
+
+ vcpu->arch.slave_cpu = slave;
+ if (set_slave_mode && kvm_x86_ops->set_slave_mode)
+ kvm_x86_ops->set_slave_mode(vcpu, slave >= 0);
+out:
+ return r;
+}
+
+#endif
+
long kvm_arch_vcpu_ioctl(struct file *filp,
unsigned int ioctl, unsigned long arg)
{
@@ -2910,6 +2951,16 @@ long kvm_arch_vcpu_ioctl(struct file *filp,
r = kvm_set_guest_paused(vcpu);
goto out;
}
+#ifdef CONFIG_SLAVE_CPU
+ case KVM_SET_SLAVE_CPU: {
+ r = kvm_arch_vcpu_ioctl_set_slave_cpu(vcpu, (int)arg, 1);
+ goto out;
+ }
+ case KVM_GET_SLAVE_CPU: {
+ r = vcpu->arch.slave_cpu;
+ goto out;
+ }
+#endif
default:
r = -EINVAL;
}
@@ -6144,6 +6195,9 @@ void kvm_put_guest_fpu(struct kvm_vcpu *vcpu)
void kvm_arch_vcpu_free(struct kvm_vcpu *vcpu)
{
kvmclock_reset(vcpu);
+#ifdef CONFIG_SLAVE_CPU
+ kvm_arch_vcpu_ioctl_set_slave_cpu(vcpu, -1, 0);
+#endif

free_cpumask_var(vcpu->arch.wbinvd_dirty_mask);
fx_free(vcpu);
diff --git a/include/linux/kvm.h b/include/linux/kvm.h
index 09f2b3a..65607e5 100644
--- a/include/linux/kvm.h
+++ b/include/linux/kvm.h
@@ -617,6 +617,7 @@ struct kvm_ppc_smmu_info {
#define KVM_CAP_SIGNAL_MSI 77
#define KVM_CAP_PPC_GET_SMMU_INFO 78
#define KVM_CAP_S390_COW 79
+#define KVM_CAP_SLAVE_CPU 80

#ifdef KVM_CAP_IRQ_ROUTING

@@ -901,6 +902,9 @@ struct kvm_s390_ucas_mapping {
#define KVM_SET_ONE_REG _IOW(KVMIO, 0xac, struct kvm_one_reg)
/* VM is being stopped by host */
#define KVM_KVMCLOCK_CTRL _IO(KVMIO, 0xad)
+/* Available with KVM_CAP_SLAVE_CPU */
+#define KVM_GET_SLAVE_CPU _IO(KVMIO, 0xae)
+#define KVM_SET_SLAVE_CPU _IO(KVMIO, 0xaf)

#define KVM_DEV_ASSIGN_ENABLE_IOMMU (1 << 0)
#define KVM_DEV_ASSIGN_PCI_2_3 (1 << 1)

2012-06-28 06:08:32

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 09/18] KVM: Go back to online CPU on VM exit by external interrupt

If the slave CPU receives an interrupt in running a guest, current
implementation must once go back to onilne CPUs to handle the interupt.

This behavior will be replaced by later patch, which introduces direct
interrupt handling mechanism by the guest.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/vmx.c | 1 +
arch/x86/kvm/x86.c | 6 ++++++
3 files changed, 8 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6ae43ef..f8fd1a1 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -350,6 +350,7 @@ struct kvm_vcpu_arch {
int sipi_vector;
u64 ia32_misc_enable_msr;
bool tpr_access_reporting;
+ bool interrupted;

#ifdef CONFIG_SLAVE_CPU
/* slave cpu dedicated to this vcpu */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 42655a5..e24392c 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4328,6 +4328,7 @@ static int handle_exception(struct kvm_vcpu *vcpu)

static int handle_external_interrupt(struct kvm_vcpu *vcpu)
{
+ vcpu->arch.interrupted = true;
++vcpu->stat.irq_exits;
return 1;
}
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 4216d55..1558ec2 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5553,6 +5553,12 @@ static void __vcpu_enter_guest_slave(void *_arg)
break;

/* determine if slave cpu can handle the exit alone */
+ if (vcpu->arch.interrupted) {
+ vcpu->arch.interrupted = false;
+ arg->ret = LOOP_ONLINE;
+ break;
+ }
+
r = vcpu_post_run(vcpu, arg->task, &arg->apf_pending);

if (arg->ret == LOOP_SLAVE &&

2012-06-28 06:08:36

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 11/18] KVM: no exiting from guest when slave CPU halted

Avoid exiting from a guest on slave CPU even if HLT instruction is
executed. Since the slave CPU is dedicated to a vCPU, exit on HLT is
not required, and avoiding VM exit will improve the guest's performance.

This is a partial revert of

10166744b80a ("KVM: VMX: remove yield_on_hlt")

Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/kvm/vmx.c | 22 +++++++++++++++++++++-
1 files changed, 21 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index e24392c..f0c6532 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -1688,6 +1688,17 @@ static void skip_emulated_instruction(struct kvm_vcpu *vcpu)
vmx_set_interrupt_shadow(vcpu, 0);
}

+static void vmx_clear_hlt(struct kvm_vcpu *vcpu)
+{
+ /* Ensure that we clear the HLT state in the VMCS. We don't need to
+ * explicitly skip the instruction because if the HLT state is set, then
+ * the instruction is already executing and RIP has already been
+ * advanced. */
+ if (vcpu->arch.slave_cpu >= 0 &&
+ vmcs_read32(GUEST_ACTIVITY_STATE) == GUEST_ACTIVITY_HLT)
+ vmcs_write32(GUEST_ACTIVITY_STATE, GUEST_ACTIVITY_ACTIVE);
+}
+
/*
* KVM wants to inject page-faults which it got to the guest. This function
* checks whether in a nested guest, we need to inject them to L1 or L2.
@@ -1740,6 +1751,7 @@ static void vmx_queue_exception(struct kvm_vcpu *vcpu, unsigned nr,
intr_info |= INTR_TYPE_HARD_EXCEPTION;

vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr_info);
+ vmx_clear_hlt(vcpu);
}

static bool vmx_rdtscp_supported(void)
@@ -4045,6 +4057,7 @@ static void vmx_inject_irq(struct kvm_vcpu *vcpu)
} else
intr |= INTR_TYPE_EXT_INTR;
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD, intr);
+ vmx_clear_hlt(vcpu);
}

static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
@@ -4076,6 +4089,7 @@ static void vmx_inject_nmi(struct kvm_vcpu *vcpu)
}
vmcs_write32(VM_ENTRY_INTR_INFO_FIELD,
INTR_TYPE_NMI_INTR | INTR_INFO_VALID_MASK | NMI_VECTOR);
+ vmx_clear_hlt(vcpu);
}

static int vmx_nmi_allowed(struct kvm_vcpu *vcpu)
@@ -7137,7 +7151,13 @@ static int vmx_check_intercept(struct kvm_vcpu *vcpu,

static void vmx_set_slave_mode(struct kvm_vcpu *vcpu, bool slave)
{
- /* Nothing */
+ if (slave) {
+ vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
+ CPU_BASED_HLT_EXITING);
+ } else {
+ vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
+ CPU_BASED_HLT_EXITING);
+ }
}

static struct kvm_x86_ops vmx_x86_ops = {

2012-06-28 06:08:43

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 13/18] x86/apic: IRQ vector remapping on slave for slave CPUs

Add a facility to use IRQ vector different from online CPUs on slave CPUs.

When alternative vector for IRQ is registered by remap_slave_vector_irq()
and the IRQ affinity is set only to slave CPUs, the device is configured
to use the alternative vector.

Current patch only supports MSI and Intel CPU with IRQ remapper of IOMMU.

This is intended to be used to routing interrupts directly to KVM guest
which is running on slave CPUs which do not cause VM EXIT by external
interrupts.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/irq.h | 15 ++++++++
arch/x86/kernel/apic/io_apic.c | 68 ++++++++++++++++++++++++++++++++++-
drivers/iommu/intel_irq_remapping.c | 8 +++-
3 files changed, 88 insertions(+), 3 deletions(-)

diff --git a/arch/x86/include/asm/irq.h b/arch/x86/include/asm/irq.h
index ba870bb..84756f7 100644
--- a/arch/x86/include/asm/irq.h
+++ b/arch/x86/include/asm/irq.h
@@ -41,4 +41,19 @@ extern int vector_used_by_percpu_irq(unsigned int vector);

extern void init_ISA_irqs(void);

+#ifdef CONFIG_SLAVE_CPU
+extern void remap_slave_vector_irq(int irq, int vector,
+ const struct cpumask *mask);
+extern void revert_slave_vector_irq(int irq, const struct cpumask *mask);
+extern u8 get_remapped_slave_vector(u8 vector, unsigned int irq,
+ const struct cpumask *mask);
+#else
+static inline u8 get_remapped_slave_vector(u8 vector, unsigned int irq,
+ const struct cpumask *mask)
+{
+ return vector;
+}
+#endif
+
+
#endif /* _ASM_X86_IRQ_H */
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 91b3905..916dbf5 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1257,6 +1257,69 @@ void __setup_vector_irq(int cpu)
raw_spin_unlock(&vector_lock);
}

+#ifdef CONFIG_SLAVE_CPU
+
+/* vector table remapped on slave cpus, indexed by IRQ */
+static DEFINE_PER_CPU(u8[NR_IRQS], slave_vector_remap_tbl) = {
+ [0 ... NR_IRQS - 1] = 0,
+};
+
+void remap_slave_vector_irq(int irq, int vector, const struct cpumask *mask)
+{
+ int cpu;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&vector_lock, flags);
+ for_each_cpu(cpu, mask) {
+ BUG_ON(!cpu_slave(cpu));
+ per_cpu(slave_vector_remap_tbl, cpu)[irq] = vector;
+ per_cpu(vector_irq, cpu)[vector] = irq;
+ }
+ raw_spin_unlock_irqrestore(&vector_lock, flags);
+}
+EXPORT_SYMBOL_GPL(remap_slave_vector_irq);
+
+void revert_slave_vector_irq(int irq, const struct cpumask *mask)
+{
+ int cpu;
+ u8 vector;
+ unsigned long flags;
+
+ raw_spin_lock_irqsave(&vector_lock, flags);
+ for_each_cpu(cpu, mask) {
+ BUG_ON(!cpu_slave(cpu));
+ vector = per_cpu(slave_vector_remap_tbl, cpu)[irq];
+ if (vector) {
+ per_cpu(vector_irq, cpu)[vector] = -1;
+ per_cpu(slave_vector_remap_tbl, cpu)[irq] = 0;
+ }
+ }
+ raw_spin_unlock_irqrestore(&vector_lock, flags);
+}
+EXPORT_SYMBOL_GPL(revert_slave_vector_irq);
+
+/* If all targets CPUs are slave, returns remapped vector */
+u8 get_remapped_slave_vector(u8 vector, unsigned int irq,
+ const struct cpumask *mask)
+{
+ u8 slave_vector;
+
+ if (vector < FIRST_EXTERNAL_VECTOR ||
+ cpumask_intersects(mask, cpu_online_mask))
+ return vector;
+
+ slave_vector = per_cpu(slave_vector_remap_tbl,
+ cpumask_first(mask))[irq];
+ if (slave_vector >= FIRST_EXTERNAL_VECTOR)
+ vector = slave_vector;
+
+ pr_info("slave vector remap: irq: %d => vector: %d\n", irq, vector);
+
+ return vector;
+}
+
+#endif
+
static struct irq_chip ioapic_chip;

#ifdef CONFIG_X86_32
@@ -3080,6 +3143,7 @@ static int
msi_set_affinity(struct irq_data *data, const struct cpumask *mask, bool force)
{
struct irq_cfg *cfg = data->chip_data;
+ int vector = cfg->vector;
struct msi_msg msg;
unsigned int dest;

@@ -3088,8 +3152,10 @@ msi_set_affinity(struct irq_data *data, const struct cpumask *mask, bool force)

__get_cached_msi_msg(data->msi_desc, &msg);

+ vector = get_remapped_slave_vector(vector, data->irq, mask);
+
msg.data &= ~MSI_DATA_VECTOR_MASK;
- msg.data |= MSI_DATA_VECTOR(cfg->vector);
+ msg.data |= MSI_DATA_VECTOR(vector);
msg.address_lo &= ~MSI_ADDR_DEST_ID_MASK;
msg.address_lo |= MSI_ADDR_DEST_ID(dest);

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 0045139..2c6f4d3 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -934,9 +934,14 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
if (assign_irq_vector(irq, cfg, mask))
return -EBUSY;

+ /* Set affinity to either online cpus only or slave cpus only */
+ cpumask_and(data->affinity, mask, cpu_online_mask);
+ if (unlikely(cpumask_empty(data->affinity)))
+ cpumask_copy(data->affinity, mask);
+
dest = apic->cpu_mask_to_apicid_and(cfg->domain, mask);

- irte.vector = cfg->vector;
+ irte.vector = get_remapped_slave_vector(cfg->vector, irq, mask);
irte.dest_id = IRTE_DEST(dest);

/*
@@ -953,7 +958,6 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
if (cfg->move_in_progress)
send_cleanup_vector(cfg);

- cpumask_copy(data->affinity, mask);
return 0;
}
#endif

2012-06-28 06:09:06

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 15/18] KVM: vmx: Add definitions PIN_BASED_PREEMPTION_TIMER

Add some definitions to use PIN_BASED_PREEMPTION_TIMER.

When PIN_BASED_PREEMPTION_TIMER is enabled, the guest will exit
with reason=EXIT_REASON_PREEMPTION_TIMER when the counter specified in
VMX_PREEMPTION_TIMER_VALUE becomes 0.
This patch also adds a dummy handler for EXIT_REASON_PREEMPTION_TIMER,
which just goes back to VM execution soon.

These are currently intended only to be used with avoid entering the
guest on a slave CPU when vmx_prevent_run(vcpu, 1) is called.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/vmx.h | 3 +++
arch/x86/kvm/trace.h | 1 +
arch/x86/kvm/vmx.c | 7 +++++++
3 files changed, 11 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/vmx.h b/arch/x86/include/asm/vmx.h
index 31f180c..5ff43b1 100644
--- a/arch/x86/include/asm/vmx.h
+++ b/arch/x86/include/asm/vmx.h
@@ -65,6 +65,7 @@
#define PIN_BASED_EXT_INTR_MASK 0x00000001
#define PIN_BASED_NMI_EXITING 0x00000008
#define PIN_BASED_VIRTUAL_NMIS 0x00000020
+#define PIN_BASED_PREEMPTION_TIMER 0x00000040

#define VM_EXIT_SAVE_DEBUG_CONTROLS 0x00000002
#define VM_EXIT_HOST_ADDR_SPACE_SIZE 0x00000200
@@ -195,6 +196,7 @@ enum vmcs_field {
GUEST_INTERRUPTIBILITY_INFO = 0x00004824,
GUEST_ACTIVITY_STATE = 0X00004826,
GUEST_SYSENTER_CS = 0x0000482A,
+ VMX_PREEMPTION_TIMER_VALUE = 0x0000482E,
HOST_IA32_SYSENTER_CS = 0x00004c00,
CR0_GUEST_HOST_MASK = 0x00006000,
CR4_GUEST_HOST_MASK = 0x00006002,
@@ -279,6 +281,7 @@ enum vmcs_field {
#define EXIT_REASON_APIC_ACCESS 44
#define EXIT_REASON_EPT_VIOLATION 48
#define EXIT_REASON_EPT_MISCONFIG 49
+#define EXIT_REASON_PREEMPTION_TIMER 52
#define EXIT_REASON_WBINVD 54
#define EXIT_REASON_XSETBV 55

diff --git a/arch/x86/kvm/trace.h b/arch/x86/kvm/trace.h
index 911d264..28b4df8 100644
--- a/arch/x86/kvm/trace.h
+++ b/arch/x86/kvm/trace.h
@@ -218,6 +218,7 @@ TRACE_EVENT(kvm_apic,
{ EXIT_REASON_APIC_ACCESS, "APIC_ACCESS" }, \
{ EXIT_REASON_EPT_VIOLATION, "EPT_VIOLATION" }, \
{ EXIT_REASON_EPT_MISCONFIG, "EPT_MISCONFIG" }, \
+ { EXIT_REASON_PREEMPTION_TIMER, "PREEMPTION_TIMER" }, \
{ EXIT_REASON_WBINVD, "WBINVD" }

#define SVM_EXIT_REASONS \
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 3aea448..2c987d1 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4347,6 +4347,12 @@ static int handle_external_interrupt(struct kvm_vcpu *vcpu)
return 1;
}

+static int handle_preemption_timer(struct kvm_vcpu *vcpu)
+{
+ /* Nothing */
+ return 1;
+}
+
static int handle_triple_fault(struct kvm_vcpu *vcpu)
{
vcpu->run->exit_reason = KVM_EXIT_SHUTDOWN;
@@ -5645,6 +5651,7 @@ static int (*kvm_vmx_exit_handlers[])(struct kvm_vcpu *vcpu) = {
[EXIT_REASON_VMON] = handle_vmon,
[EXIT_REASON_TPR_BELOW_THRESHOLD] = handle_tpr_below_threshold,
[EXIT_REASON_APIC_ACCESS] = handle_apic_access,
+ [EXIT_REASON_PREEMPTION_TIMER] = handle_preemption_timer,
[EXIT_REASON_WBINVD] = handle_wbinvd,
[EXIT_REASON_XSETBV] = handle_xsetbv,
[EXIT_REASON_TASK_SWITCH] = handle_task_switch,

2012-06-28 06:08:48

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 12/18] x86/apic: Enable external interrupt routing to slave CPUs

Enable APIC to handle interrupts on slave CPUs, and enables interrupt
routing to slave CPUs by setting IRQ affinity.

As slave CPUs which run a KVM guest handle external interrupts directly in
the vCPUs, the guest's vector/IRQ mapping is different from the host's.
That requires interrupts to be routed either online CPUs or slave CPUs.

In this patch, if online CPUs are contained in specified affinity settings,
the affinity settings will be only applied to online CPUs. If every
specified CPU is slave, IRQ will be routed to slave CPUs.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/apic.h | 4 ++--
arch/x86/kernel/apic/apic_flat_64.c | 2 +-
arch/x86/kernel/apic/io_apic.c | 21 ++++++++++++---------
arch/x86/kernel/apic/x2apic_cluster.c | 6 +++---
arch/x86/kernel/apic/x2apic_phys.c | 2 +-
drivers/iommu/intel_irq_remapping.c | 2 +-
kernel/irq/manage.c | 4 ++--
kernel/irq/migration.c | 2 +-
kernel/irq/proc.c | 2 +-
9 files changed, 24 insertions(+), 21 deletions(-)

diff --git a/arch/x86/include/asm/apic.h b/arch/x86/include/asm/apic.h
index eaff479..706c4cd 100644
--- a/arch/x86/include/asm/apic.h
+++ b/arch/x86/include/asm/apic.h
@@ -531,7 +531,7 @@ extern void generic_bigsmp_probe(void);
static inline const struct cpumask *default_target_cpus(void)
{
#ifdef CONFIG_SMP
- return cpu_online_mask;
+ return cpu_online_or_slave_mask;
#else
return cpumask_of(0);
#endif
@@ -598,7 +598,7 @@ default_cpu_mask_to_apicid_and(const struct cpumask *cpumask,
{
unsigned long mask1 = cpumask_bits(cpumask)[0];
unsigned long mask2 = cpumask_bits(andmask)[0];
- unsigned long mask3 = cpumask_bits(cpu_online_mask)[0];
+ unsigned long mask3 = cpumask_bits(cpu_online_or_slave_mask)[0];

return (unsigned int)(mask1 & mask2 & mask3);
}
diff --git a/arch/x86/kernel/apic/apic_flat_64.c b/arch/x86/kernel/apic/apic_flat_64.c
index 0e881c4..e0836e9 100644
--- a/arch/x86/kernel/apic/apic_flat_64.c
+++ b/arch/x86/kernel/apic/apic_flat_64.c
@@ -38,7 +38,7 @@ static int flat_acpi_madt_oem_check(char *oem_id, char *oem_table_id)

static const struct cpumask *flat_target_cpus(void)
{
- return cpu_online_mask;
+ return cpu_online_or_slave_mask;
}

static void flat_vector_allocation_domain(int cpu, struct cpumask *retmask)
diff --git a/arch/x86/kernel/apic/io_apic.c b/arch/x86/kernel/apic/io_apic.c
index 5f0ff59..91b3905 100644
--- a/arch/x86/kernel/apic/io_apic.c
+++ b/arch/x86/kernel/apic/io_apic.c
@@ -1125,7 +1125,7 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)

old_vector = cfg->vector;
if (old_vector) {
- cpumask_and(tmp_mask, mask, cpu_online_mask);
+ cpumask_and(tmp_mask, mask, cpu_online_or_slave_mask);
cpumask_and(tmp_mask, cfg->domain, tmp_mask);
if (!cpumask_empty(tmp_mask)) {
free_cpumask_var(tmp_mask);
@@ -1135,7 +1135,7 @@ __assign_irq_vector(int irq, struct irq_cfg *cfg, const struct cpumask *mask)

/* Only try and allocate irqs on cpus that are present */
err = -ENOSPC;
- for_each_cpu_and(cpu, mask, cpu_online_mask) {
+ for_each_cpu_and(cpu, mask, cpu_online_or_slave_mask) {
int new_cpu;
int vector, offset;

@@ -1156,7 +1156,7 @@ next:
if (test_bit(vector, used_vectors))
goto next;

- for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
+ for_each_cpu_and(new_cpu, tmp_mask, cpu_online_or_slave_mask)
if (per_cpu(vector_irq, new_cpu)[vector] != -1)
goto next;
/* Found one! */
@@ -1166,7 +1166,7 @@ next:
cfg->move_in_progress = 1;
cpumask_copy(cfg->old_domain, cfg->domain);
}
- for_each_cpu_and(new_cpu, tmp_mask, cpu_online_mask)
+ for_each_cpu_and(new_cpu, tmp_mask, cpu_online_or_slave_mask)
per_cpu(vector_irq, new_cpu)[vector] = irq;
cfg->vector = vector;
cpumask_copy(cfg->domain, tmp_mask);
@@ -2200,10 +2200,10 @@ void send_cleanup_vector(struct irq_cfg *cfg)

if (unlikely(!alloc_cpumask_var(&cleanup_mask, GFP_ATOMIC))) {
unsigned int i;
- for_each_cpu_and(i, cfg->old_domain, cpu_online_mask)
+ for_each_cpu_and(i, cfg->old_domain, cpu_online_or_slave_mask)
apic->send_IPI_mask(cpumask_of(i), IRQ_MOVE_CLEANUP_VECTOR);
} else {
- cpumask_and(cleanup_mask, cfg->old_domain, cpu_online_mask);
+ cpumask_and(cleanup_mask, cfg->old_domain, cpu_online_or_slave_mask);
apic->send_IPI_mask(cleanup_mask, IRQ_MOVE_CLEANUP_VECTOR);
free_cpumask_var(cleanup_mask);
}
@@ -2244,15 +2244,18 @@ int __ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
{
struct irq_cfg *cfg = data->chip_data;

- if (!cpumask_intersects(mask, cpu_online_mask))
+ if (!cpumask_intersects(mask, cpu_online_or_slave_mask))
return -1;

if (assign_irq_vector(data->irq, data->chip_data, mask))
return -1;

- cpumask_copy(data->affinity, mask);
+ /* Set affinity to either online cpus only or slave cpus only */
+ cpumask_and(data->affinity, mask, cpu_online_mask);
+ if (unlikely(cpumask_empty(data->affinity)))
+ cpumask_copy(data->affinity, mask);

- *dest_id = apic->cpu_mask_to_apicid_and(mask, cfg->domain);
+ *dest_id = apic->cpu_mask_to_apicid_and(data->affinity, cfg->domain);
return 0;
}

diff --git a/arch/x86/kernel/apic/x2apic_cluster.c b/arch/x86/kernel/apic/x2apic_cluster.c
index ff35cff..0a4d8af 100644
--- a/arch/x86/kernel/apic/x2apic_cluster.c
+++ b/arch/x86/kernel/apic/x2apic_cluster.c
@@ -121,7 +121,7 @@ x2apic_cpu_mask_to_apicid_and(const struct cpumask *cpumask,
* May as well be the first.
*/
for_each_cpu_and(cpu, cpumask, andmask) {
- if (cpumask_test_cpu(cpu, cpu_online_mask))
+ if (cpumask_test_cpu(cpu, cpu_online_or_slave_mask))
break;
}

@@ -136,7 +136,7 @@ static void init_x2apic_ldr(void)
per_cpu(x86_cpu_to_logical_apicid, this_cpu) = apic_read(APIC_LDR);

__cpu_set(this_cpu, per_cpu(cpus_in_cluster, this_cpu));
- for_each_online_cpu(cpu) {
+ for_each_cpu(cpu, cpu_online_or_slave_mask) {
if (x2apic_cluster(this_cpu) != x2apic_cluster(cpu))
continue;
__cpu_set(this_cpu, per_cpu(cpus_in_cluster, cpu));
@@ -168,7 +168,7 @@ update_clusterinfo(struct notifier_block *nfb, unsigned long action, void *hcpu)
case CPU_UP_CANCELED:
case CPU_UP_CANCELED_FROZEN:
case CPU_DEAD:
- for_each_online_cpu(cpu) {
+ for_each_cpu(cpu, cpu_online_or_slave_mask) {
if (x2apic_cluster(this_cpu) != x2apic_cluster(cpu))
continue;
__cpu_clear(this_cpu, per_cpu(cpus_in_cluster, cpu));
diff --git a/arch/x86/kernel/apic/x2apic_phys.c b/arch/x86/kernel/apic/x2apic_phys.c
index c17e982..d4ac67e 100644
--- a/arch/x86/kernel/apic/x2apic_phys.c
+++ b/arch/x86/kernel/apic/x2apic_phys.c
@@ -101,7 +101,7 @@ x2apic_cpu_mask_to_apicid_and(const struct cpumask *cpumask,
* May as well be the first.
*/
for_each_cpu_and(cpu, cpumask, andmask) {
- if (cpumask_test_cpu(cpu, cpu_online_mask))
+ if (cpumask_test_cpu(cpu, cpu_online_or_slave_mask))
break;
}

diff --git a/drivers/iommu/intel_irq_remapping.c b/drivers/iommu/intel_irq_remapping.c
index 6d34706..0045139 100644
--- a/drivers/iommu/intel_irq_remapping.c
+++ b/drivers/iommu/intel_irq_remapping.c
@@ -925,7 +925,7 @@ intel_ioapic_set_affinity(struct irq_data *data, const struct cpumask *mask,
unsigned int dest, irq = data->irq;
struct irte irte;

- if (!cpumask_intersects(mask, cpu_online_mask))
+ if (!cpumask_intersects(mask, cpu_online_or_slave_mask))
return -EINVAL;

if (get_irte(irq, &irte))
diff --git a/kernel/irq/manage.c b/kernel/irq/manage.c
index 8c54823..73e5fd8 100644
--- a/kernel/irq/manage.c
+++ b/kernel/irq/manage.c
@@ -308,13 +308,13 @@ setup_affinity(unsigned int irq, struct irq_desc *desc, struct cpumask *mask)
*/
if (irqd_has_set(&desc->irq_data, IRQD_AFFINITY_SET)) {
if (cpumask_intersects(desc->irq_data.affinity,
- cpu_online_mask))
+ cpu_online_or_slave_mask))
set = desc->irq_data.affinity;
else
irqd_clear(&desc->irq_data, IRQD_AFFINITY_SET);
}

- cpumask_and(mask, cpu_online_mask, set);
+ cpumask_and(mask, cpu_online_or_slave_mask, set);
if (node != NUMA_NO_NODE) {
const struct cpumask *nodemask = cpumask_of_node(node);

diff --git a/kernel/irq/migration.c b/kernel/irq/migration.c
index ca3f4aa..6e3aaa9 100644
--- a/kernel/irq/migration.c
+++ b/kernel/irq/migration.c
@@ -42,7 +42,7 @@ void irq_move_masked_irq(struct irq_data *idata)
* For correct operation this depends on the caller
* masking the irqs.
*/
- if (cpumask_any_and(desc->pending_mask, cpu_online_mask) < nr_cpu_ids)
+ if (cpumask_any_and(desc->pending_mask, cpu_online_or_slave_mask) < nr_cpu_ids)
irq_do_set_affinity(&desc->irq_data, desc->pending_mask, false);

cpumask_clear(desc->pending_mask);
diff --git a/kernel/irq/proc.c b/kernel/irq/proc.c
index 4bd4faa..76bd7b2 100644
--- a/kernel/irq/proc.c
+++ b/kernel/irq/proc.c
@@ -103,7 +103,7 @@ static ssize_t write_irq_affinity(int type, struct file *file,
* way to make the system unusable accidentally :-) At least
* one online CPU still has to be targeted.
*/
- if (!cpumask_intersects(new_value, cpu_online_mask)) {
+ if (!cpumask_intersects(new_value, cpu_online_or_slave_mask)) {
/* Special case for empty set - allow the architecture
code to set default SMP affinity. */
err = irq_select_affinity_usr(irq, new_value) ? -EINVAL : count;

2012-06-28 06:09:48

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 18/18] x86: request TLB flush to slave CPU using NMI

For slave CPUs, it is inapropriate to request TLB flush using IPI.
because the IPI may be sent to a KVM guest when the slave CPU is running
the guest with direct interrupt routing.

Instead, it registers a TLB flush request in per-cpu bitmask and send a NMI
to interrupt execution of the guest. Then, NMI handler will check the
requests and handles the requests.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/mmu.h | 7 +++++
arch/x86/kvm/x86.c | 26 ++++++++++++++++++
arch/x86/mm/tlb.c | 63 ++++++++++++++++++++++++++++++++++++++++----
3 files changed, 90 insertions(+), 6 deletions(-)

diff --git a/arch/x86/include/asm/mmu.h b/arch/x86/include/asm/mmu.h
index 5f55e69..25af7f1 100644
--- a/arch/x86/include/asm/mmu.h
+++ b/arch/x86/include/asm/mmu.h
@@ -29,4 +29,11 @@ static inline void leave_mm(int cpu)
}
#endif

+#ifdef CONFIG_SLAVE_CPU
+typedef void(slave_tlbf_notifier_t)(unsigned int cpu, unsigned int sender);
+extern void register_slave_tlbf_notifier(slave_tlbf_notifier_t *f,
+ unsigned int cpu);
+extern void __smp_invalidate_tlb(unsigned int sender);
+#endif
+
#endif /* _ASM_X86_MMU_H */
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 90307f0..6ede776 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2636,6 +2636,7 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)

#ifdef CONFIG_SLAVE_CPU

+static void add_slave_tlbf_request(unsigned int cpu, unsigned int sender);
static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs);

static int kvm_arch_vcpu_ioctl_set_slave_cpu(struct kvm_vcpu *vcpu,
@@ -2663,6 +2664,7 @@ static int kvm_arch_vcpu_ioctl_set_slave_cpu(struct kvm_vcpu *vcpu,
if (r)
goto out;
BUG_ON(!cpu_slave(slave));
+ register_slave_tlbf_notifier(add_slave_tlbf_request, slave);
}

vcpu->arch.slave_cpu = slave;
@@ -5331,6 +5333,9 @@ static void process_nmi(struct kvm_vcpu *vcpu)
/* vcpu currently running on each slave CPU */
static DEFINE_PER_CPU(struct kvm_vcpu *, slave_vcpu);

+/* bitmask to store TLB flush sender sent to each CPU */
+static DEFINE_PER_CPU(unsigned long, slave_tlb_flush_requests);
+
void kvm_get_slave_cpu_mask(struct kvm *kvm, struct cpumask *mask)
{
int i;
@@ -5341,6 +5346,25 @@ void kvm_get_slave_cpu_mask(struct kvm *kvm, struct cpumask *mask)
cpumask_set_cpu(vcpu->arch.slave_cpu, mask);
}

+static void add_slave_tlbf_request(unsigned int cpu, unsigned int sender)
+{
+ unsigned long *mask = &per_cpu(slave_tlb_flush_requests, cpu);
+
+ atomic_set_mask(1 << sender, mask);
+ apic->send_IPI_mask(get_cpu_mask(cpu), NMI_VECTOR);
+}
+
+static void handle_slave_tlb_flush_requests(int cpu)
+{
+ unsigned int sender;
+ unsigned long *mask = &per_cpu(slave_tlb_flush_requests, cpu);
+
+ for_each_set_bit(sender, mask, NUM_INVALIDATE_TLB_VECTORS) {
+ atomic_clear_mask(1 << sender, mask);
+ __smp_invalidate_tlb(sender);
+ }
+}
+
static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs)
{
struct kvm_vcpu *vcpu;
@@ -5349,6 +5373,8 @@ static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs)
if (!cpu_slave(cpu))
return NMI_DONE;

+ handle_slave_tlb_flush_requests(cpu);
+
/* if called from NMI handler after VM exit, no need to prevent run */
vcpu = __this_cpu_read(slave_vcpu);
if (!vcpu || vcpu->mode == OUTSIDE_GUEST_MODE || kvm_is_in_guest())
diff --git a/arch/x86/mm/tlb.c b/arch/x86/mm/tlb.c
index 5e57e11..c53bd43 100644
--- a/arch/x86/mm/tlb.c
+++ b/arch/x86/mm/tlb.c
@@ -12,6 +12,7 @@
#include <asm/cache.h>
#include <asm/apic.h>
#include <asm/uv/uv.h>
+#include <asm/cpu.h>

DEFINE_PER_CPU_SHARED_ALIGNED(struct tlb_state, cpu_tlbstate)
= { &init_mm, 0, };
@@ -55,6 +56,30 @@ static union smp_flush_state flush_state[NUM_INVALIDATE_TLB_VECTORS];

static DEFINE_PER_CPU_READ_MOSTLY(int, tlb_vector_offset);

+#ifdef CONFIG_SLAVE_CPU
+
+static DEFINE_PER_CPU(slave_tlbf_notifier_t *, slave_tlbf_notifier);
+
+void register_slave_tlbf_notifier(slave_tlbf_notifier_t *f, unsigned int cpu)
+{
+ per_cpu(slave_tlbf_notifier, cpu) = f;
+}
+EXPORT_SYMBOL(register_slave_tlbf_notifier);
+
+void request_slave_tlb_flush(const struct cpumask *mask, unsigned int sender)
+{
+ int cpu;
+ slave_tlbf_notifier_t *f;
+
+ for_each_cpu_and(cpu, mask, cpu_slave_mask) {
+ f = per_cpu(slave_tlbf_notifier, cpu);
+ if (f)
+ f(cpu, sender);
+ }
+}
+
+#endif
+
/*
* We cannot call mmdrop() because we are in interrupt context,
* instead update mm->cpu_vm_mask.
@@ -131,17 +156,22 @@ asmlinkage
#endif
void smp_invalidate_interrupt(struct pt_regs *regs)
{
- unsigned int cpu;
unsigned int sender;
- union smp_flush_state *f;

- cpu = smp_processor_id();
/*
* orig_rax contains the negated interrupt vector.
* Use that to determine where the sender put the data.
*/
sender = ~regs->orig_ax - INVALIDATE_TLB_VECTOR_START;
- f = &flush_state[sender];
+ __smp_invalidate_tlb(sender);
+ ack_APIC_irq();
+ inc_irq_stat(irq_tlb_count);
+}
+
+void __smp_invalidate_tlb(unsigned int sender)
+{
+ union smp_flush_state *f = &flush_state[sender];
+ unsigned int cpu = smp_processor_id();

if (!cpumask_test_cpu(cpu, to_cpumask(f->flush_cpumask)))
goto out;
@@ -163,13 +193,13 @@ void smp_invalidate_interrupt(struct pt_regs *regs)
} else
leave_mm(cpu);
}
+
out:
- ack_APIC_irq();
smp_mb__before_clear_bit();
cpumask_clear_cpu(cpu, to_cpumask(f->flush_cpumask));
smp_mb__after_clear_bit();
- inc_irq_stat(irq_tlb_count);
}
+EXPORT_SYMBOL_GPL(__smp_invalidate_tlb);

static void flush_tlb_others_ipi(const struct cpumask *cpumask,
struct mm_struct *mm, unsigned long va)
@@ -191,8 +221,29 @@ static void flush_tlb_others_ipi(const struct cpumask *cpumask,
* We have to send the IPI only to
* CPUs affected.
*/
+#ifdef CONFIG_SLAVE_CPU
+ cpumask_var_t ipi_mask;
+
+ request_slave_tlb_flush(to_cpumask(f->flush_cpumask), sender);
+
+ /* send IPI only to online CPUs */
+ if (!alloc_cpumask_var(&ipi_mask, GFP_KERNEL))
+ /* insufficient memory... send IPI to all CPUs */
+ apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
+ INVALIDATE_TLB_VECTOR_START + sender);
+ else {
+ cpumask_and(ipi_mask, to_cpumask(f->flush_cpumask),
+ cpu_online_mask);
+ request_slave_tlb_flush(to_cpumask(f->flush_cpumask),
+ sender);
+ apic->send_IPI_mask(ipi_mask,
+ INVALIDATE_TLB_VECTOR_START + sender);
+ free_cpumask_var(ipi_mask);
+ }
+#else
apic->send_IPI_mask(to_cpumask(f->flush_cpumask),
INVALIDATE_TLB_VECTOR_START + sender);
+#endif

while (!cpumask_empty(to_cpumask(f->flush_cpumask)))
cpu_relax();

2012-06-28 06:10:00

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 14/18] KVM: Directly handle interrupts by guests without VM EXIT on slave CPUs

Make interrupts on slave CPUs handled by guests without VM EXIT.
This reduces CPU usage by the host to transfer interrupts of assigned
PCI devices from the host to guests. It also reduces cost of VM EXIT
and quickens response of guests to the interrupts.

When a slave CPU is dedicated to a vCPU, exit on external interrupts is
disabled. Unfortunately, we can only enable/disable exits for whole
external interrupts except NMIs and cannot switch exits based on IRQ#
or vectors. Thus, to avoid IPIs from online CPUs transferred to guests,
this patch modify kvm_vcpu_kick() to use NMI for guests on slave CPUs.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/kvm/vmx.c | 4 ++++
arch/x86/kvm/x86.c | 40 ++++++++++++++++++++++++++++++++++++++++
include/linux/kvm_host.h | 1 +
virt/kvm/kvm_main.c | 5 +++--
4 files changed, 48 insertions(+), 2 deletions(-)

diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index f0c6532..3aea448 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -7154,9 +7154,13 @@ static void vmx_set_slave_mode(struct kvm_vcpu *vcpu, bool slave)
if (slave) {
vmcs_clear_bits(CPU_BASED_VM_EXEC_CONTROL,
CPU_BASED_HLT_EXITING);
+ vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
+ PIN_BASED_EXT_INTR_MASK);
} else {
vmcs_set_bits(CPU_BASED_VM_EXEC_CONTROL,
CPU_BASED_HLT_EXITING);
+ vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
+ PIN_BASED_EXT_INTR_MASK);
}
}

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index df5eb05..2e414a1 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -63,6 +63,7 @@
#include <asm/pvclock.h>
#include <asm/div64.h>
#include <asm/cpu.h>
+#include <asm/nmi.h>
#include <asm/mmu.h>

#define MAX_IO_MSRS 256
@@ -2635,6 +2636,8 @@ static int kvm_set_guest_paused(struct kvm_vcpu *vcpu)

#ifdef CONFIG_SLAVE_CPU

+static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs);
+
static int kvm_arch_vcpu_ioctl_set_slave_cpu(struct kvm_vcpu *vcpu,
int slave, int set_slave_mode)
{
@@ -4998,6 +5001,11 @@ int kvm_arch_init(void *opaque)
if (cpu_has_xsave)
host_xcr0 = xgetbv(XCR_XFEATURE_ENABLED_MASK);

+#ifdef CONFIG_SLAVE_CPU
+ register_nmi_handler(NMI_LOCAL, kvm_arch_kicked_by_nmi, 0,
+ "kvm_kick");
+#endif
+
return 0;

out:
@@ -5014,6 +5022,7 @@ void kvm_arch_exit(void)
unregister_hotcpu_notifier(&kvmclock_cpu_notifier_block);
#ifdef CONFIG_SLAVE_CPU
unregister_slave_cpu_notifier(&kvmclock_slave_cpu_notifier_block);
+ unregister_nmi_handler(NMI_LOCAL, "kvm_kick");
#endif
kvm_x86_ops = NULL;
kvm_mmu_module_exit();
@@ -5311,6 +5320,28 @@ static void process_nmi(struct kvm_vcpu *vcpu)
kvm_make_request(KVM_REQ_EVENT, vcpu);
}

+#ifdef CONFIG_SLAVE_CPU
+/* vcpu currently running on each slave CPU */
+static DEFINE_PER_CPU(struct kvm_vcpu *, slave_vcpu);
+
+static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs)
+{
+ struct kvm_vcpu *vcpu;
+ int cpu = smp_processor_id();
+
+ if (!cpu_slave(cpu))
+ return NMI_DONE;
+
+ /* if called from NMI handler after VM exit, no need to prevent run */
+ vcpu = __this_cpu_read(slave_vcpu);
+ if (!vcpu || vcpu->mode == OUTSIDE_GUEST_MODE || kvm_is_in_guest())
+ return NMI_HANDLED;
+
+ return NMI_HANDLED;
+}
+
+#endif
+
enum vcpu_enter_guest_slave_retval {
EXIT_TO_USER = 0,
LOOP_ONLINE, /* vcpu_post_run is done in online cpu */
@@ -5542,7 +5573,10 @@ static void __vcpu_enter_guest_slave(void *_arg)
kvm_arch_vcpu_load(vcpu, cpu);

while (r == LOOP_SLAVE) {
+ __this_cpu_write(slave_vcpu, vcpu);
+ smp_wmb();
r = vcpu_enter_guest(vcpu, arg->task);
+ __this_cpu_write(slave_vcpu, NULL);

if (unlikely(!irqs_disabled())) {
pr_err("irq is enabled on slave vcpu_etner_guest! - forcely disable\n");
@@ -6692,6 +6726,12 @@ int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;
}

+void kvm_arch_vcpu_kick_slave(struct kvm_vcpu *vcpu)
+{
+ if (kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE)
+ apic->send_IPI_mask(get_cpu_mask(vcpu->cpu), NMI_VECTOR);
+}
+
int kvm_arch_interrupt_allowed(struct kvm_vcpu *vcpu)
{
return kvm_x86_ops->interrupt_allowed(vcpu);
diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index c44a7be..9906908 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -533,6 +533,7 @@ void kvm_arch_hardware_unsetup(void);
void kvm_arch_check_processor_compat(void *rtn);
int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu);
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu);
+void kvm_arch_vcpu_kick_slave(struct kvm_vcpu *vcpu);

void kvm_free_physmem(struct kvm *kvm);

diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index ff8b418..6a989e9 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1531,10 +1531,11 @@ void kvm_vcpu_kick(struct kvm_vcpu *vcpu)
}

me = get_cpu();
- if (cpu != me && (unsigned)cpu < nr_cpu_ids &&
- (cpu_online(cpu) || cpu_slave(cpu)))
+ if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_online(cpu))
if (kvm_arch_vcpu_should_kick(vcpu))
smp_send_reschedule(cpu);
+ if (cpu != me && (unsigned)cpu < nr_cpu_ids && cpu_slave(cpu))
+ kvm_arch_vcpu_kick_slave(vcpu);
put_cpu();
}
#endif /* !CONFIG_S390 */

2012-06-28 06:11:19

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 16/18] KVM: add kvm_arch_vcpu_prevent_run to prevent VM ENTER when NMI is received

Since NMI can not be disabled around VM enter, there is a race between
receiving NMI to kick a guest and entering the guests on slave CPUs.If the
NMI is received just before entering VM, after the NMI handler is invoked,
it continues entering the guest and the effect of the NMI will be lost.

This patch adds kvm_arch_vcpu_prevent_run(), which causes VM exit right
after VM enter. The NMI handler uses this to ensure the execution of the
guest is cancelled after NMI.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/kvm_host.h | 5 +++++
arch/x86/kvm/vmx.c | 22 +++++++++++++++++++++-
arch/x86/kvm/x86.c | 29 +++++++++++++++++++++++++++++
3 files changed, 55 insertions(+), 1 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 6745057..3d5028f 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -421,6 +421,8 @@ struct kvm_vcpu_arch {
void *insn;
int insn_len;
} page_fault;
+
+ bool prevent_run;
#endif

int halt_request; /* real mode on Intel only */
@@ -668,6 +670,7 @@ struct kvm_x86_ops {

void (*run)(struct kvm_vcpu *vcpu);
int (*handle_exit)(struct kvm_vcpu *vcpu);
+ void (*prevent_run)(struct kvm_vcpu *vcpu, int prevent);
void (*skip_emulated_instruction)(struct kvm_vcpu *vcpu);
void (*set_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
u32 (*get_interrupt_shadow)(struct kvm_vcpu *vcpu, int mask);
@@ -999,4 +1002,6 @@ int kvm_pmu_read_pmc(struct kvm_vcpu *vcpu, unsigned pmc, u64 *data);
void kvm_handle_pmu_event(struct kvm_vcpu *vcpu);
void kvm_deliver_pmi(struct kvm_vcpu *vcpu);

+int kvm_arch_vcpu_run_prevented(struct kvm_vcpu *vcpu);
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c
index 2c987d1..4d0d547 100644
--- a/arch/x86/kvm/vmx.c
+++ b/arch/x86/kvm/vmx.c
@@ -4349,7 +4349,7 @@ static int handle_external_interrupt(struct kvm_vcpu *vcpu)

static int handle_preemption_timer(struct kvm_vcpu *vcpu)
{
- /* Nothing */
+ kvm_arch_vcpu_run_prevented(vcpu);
return 1;
}

@@ -5929,6 +5929,8 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
}

if (exit_reason & VMX_EXIT_REASONS_FAILED_VMENTRY) {
+ if (vcpu->arch.prevent_run)
+ return kvm_arch_vcpu_run_prevented(vcpu);
vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
vcpu->run->fail_entry.hardware_entry_failure_reason
= exit_reason;
@@ -5936,6 +5938,8 @@ static int vmx_handle_exit(struct kvm_vcpu *vcpu)
}

if (unlikely(vmx->fail)) {
+ if (vcpu->arch.prevent_run)
+ return kvm_arch_vcpu_run_prevented(vcpu);
vcpu->run->exit_reason = KVM_EXIT_FAIL_ENTRY;
vcpu->run->fail_entry.hardware_entry_failure_reason
= vmcs_read32(VM_INSTRUCTION_ERROR);
@@ -6337,6 +6341,21 @@ static void __noclone vmx_vcpu_run(struct kvm_vcpu *vcpu)
#undef R
#undef Q

+/*
+ * Make VMRESUME fail using preemption timer with timer value = 0.
+ * On processors that doesn't support preemption timer, VMRESUME will fail
+ * by internal error.
+ */
+static void vmx_prevent_run(struct kvm_vcpu *vcpu, int prevent)
+{
+ if (prevent)
+ vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
+ PIN_BASED_PREEMPTION_TIMER);
+ else
+ vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
+ PIN_BASED_PREEMPTION_TIMER);
+}
+
static void vmx_free_vcpu(struct kvm_vcpu *vcpu)
{
struct vcpu_vmx *vmx = to_vmx(vcpu);
@@ -7220,6 +7239,7 @@ static struct kvm_x86_ops vmx_x86_ops = {

.run = vmx_vcpu_run,
.handle_exit = vmx_handle_exit,
+ .prevent_run = vmx_prevent_run,
.skip_emulated_instruction = skip_emulated_instruction,
.set_interrupt_shadow = vmx_set_interrupt_shadow,
.get_interrupt_shadow = vmx_get_interrupt_shadow,
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2e414a1..cae8025 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -4961,6 +4961,13 @@ static void kvm_set_mmio_spte_mask(void)
kvm_mmu_set_mmio_spte_mask(mask);
}

+static int kvm_arch_vcpu_prevent_run(struct kvm_vcpu *vcpu, int prevent)
+{
+ vcpu->arch.prevent_run = prevent;
+ kvm_x86_ops->prevent_run(vcpu, prevent);
+ return 1;
+}
+
int kvm_arch_init(void *opaque)
{
int r;
@@ -5337,6 +5344,11 @@ static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs)
if (!vcpu || vcpu->mode == OUTSIDE_GUEST_MODE || kvm_is_in_guest())
return NMI_HANDLED;

+ /*
+ * We may be about to entering VM. To prevent entering,
+ * mark to exit as soon as possible.
+ */
+ kvm_arch_vcpu_prevent_run(vcpu, 1);
return NMI_HANDLED;
}

@@ -5573,6 +5585,14 @@ static void __vcpu_enter_guest_slave(void *_arg)
kvm_arch_vcpu_load(vcpu, cpu);

while (r == LOOP_SLAVE) {
+ /*
+ * After setting slave_vcpu, the guest may be receive NMI when
+ * the vCPU is kicked in kvm_vcpu_kick(). Receiving NMI, the
+ * guest will exit with vcpu->arch.interrupted = true, then
+ * we must go back to online CPUs. Even if we receive NMI
+ * before entering to the guest, kvm_arch_vcpu_prevent_run()
+ * will exit from the guest as soon as enter.
+ */
__this_cpu_write(slave_vcpu, vcpu);
smp_wmb();
r = vcpu_enter_guest(vcpu, arg->task);
@@ -5607,6 +5627,7 @@ static void __vcpu_enter_guest_slave(void *_arg)
}
}

+ kvm_arch_vcpu_prevent_run(vcpu, 0);
kvm_arch_vcpu_put_migrate(vcpu);
unuse_mm(arg->task->mm);
srcu_read_unlock(&vcpu->kvm->srcu, vcpu->srcu_idx);
@@ -6721,6 +6742,14 @@ int kvm_arch_vcpu_runnable(struct kvm_vcpu *vcpu)
kvm_cpu_has_interrupt(vcpu));
}

+int kvm_arch_vcpu_run_prevented(struct kvm_vcpu *vcpu)
+{
+ kvm_x86_ops->prevent_run(vcpu, 0);
+ vcpu->arch.interrupted = true;
+ return 1;
+}
+EXPORT_SYMBOL_GPL(kvm_arch_vcpu_run_prevented);
+
int kvm_arch_vcpu_should_kick(struct kvm_vcpu *vcpu)
{
return kvm_vcpu_exiting_guest_mode(vcpu) == IN_GUEST_MODE;

2012-06-28 06:11:43

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 17/18] KVM: route assigned devices' MSI/MSI-X directly to guests on slave CPUs

When a PCI device is assigned to a guest running on slave CPUs, this
routes the device's MSI/MSI-X interrupts directly to the guest.

Because the guest uses a different interrupt vector from the host,
vector remapping is required. This is safe because slave CPUs only handles
interrupts for the assigned guest.

The slave CPU may receive interrupts for the guest while the guest is not
running. In that case, the host IRQ handler is invoked and the interrupt is
transfered as vIRQ.

If the guest receive the direct interrupt from the devices, EOI to physical
APIC is required. To handle this, if the guest issues EOI when there are no
in-service interrupts in the virtual APIC, physical EOI is issued.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/kvm_host.h | 19 +++++
arch/x86/kvm/irq.c | 136 +++++++++++++++++++++++++++++++++++++++
arch/x86/kvm/lapic.c | 6 +-
arch/x86/kvm/x86.c | 10 +++
virt/kvm/assigned-dev.c | 8 ++
5 files changed, 177 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 3d5028f..3561626 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -1004,4 +1004,23 @@ void kvm_deliver_pmi(struct kvm_vcpu *vcpu);

int kvm_arch_vcpu_run_prevented(struct kvm_vcpu *vcpu);

+#ifdef CONFIG_SLAVE_CPU
+void kvm_get_slave_cpu_mask(struct kvm *kvm, struct cpumask *mask);
+
+struct kvm_assigned_dev_kernel;
+extern void assign_slave_msi(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev);
+extern void deassign_slave_msi(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev);
+extern void assign_slave_msix(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev);
+extern void deassign_slave_msix(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev);
+#else
+#define assign_slave_msi(kvm, assigned_dev)
+#define deassign_slave_msi(kvm, assigned_dev)
+#define assign_slave_msix(kvm, assigned_dev)
+#define deassign_slave_msix(kvm, assigned_dev)
+#endif
+
#endif /* _ASM_X86_KVM_HOST_H */
diff --git a/arch/x86/kvm/irq.c b/arch/x86/kvm/irq.c
index 7e06ba1..128431a 100644
--- a/arch/x86/kvm/irq.c
+++ b/arch/x86/kvm/irq.c
@@ -22,6 +22,8 @@

#include <linux/module.h>
#include <linux/kvm_host.h>
+#include <linux/pci.h>
+#include <asm/msidef.h>

#include "irq.h"
#include "i8254.h"
@@ -94,3 +96,137 @@ void __kvm_migrate_timers(struct kvm_vcpu *vcpu)
__kvm_migrate_apic_timer(vcpu);
__kvm_migrate_pit_timer(vcpu);
}
+
+
+#ifdef CONFIG_SLAVE_CPU
+
+static int kvm_lookup_msi_routing_entry(struct kvm *kvm, int irq)
+{
+ int vec = -1;
+ struct kvm_irq_routing_table *irq_rt;
+ struct kvm_kernel_irq_routing_entry *e;
+ struct hlist_node *n;
+
+ rcu_read_lock();
+ irq_rt = rcu_dereference(kvm->irq_routing);
+ if (irq < irq_rt->nr_rt_entries)
+ hlist_for_each_entry(e, n, &irq_rt->map[irq], link)
+ if (e->type == KVM_IRQ_ROUTING_MSI)
+ vec = (e->msi.data & MSI_DATA_VECTOR_MASK)
+ >> MSI_DATA_VECTOR_SHIFT;
+ rcu_read_unlock();
+
+ return vec;
+}
+
+void assign_slave_msi(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev)
+{
+ int irq = assigned_dev->guest_irq;
+ int host_irq = assigned_dev->host_irq;
+ struct irq_data *data = irq_get_irq_data(host_irq);
+ int vec = kvm_lookup_msi_routing_entry(kvm, irq);
+ cpumask_var_t slave_mask;
+ char buffer[16];
+
+ BUG_ON(!data);
+
+ if (!zalloc_cpumask_var(&slave_mask, GFP_KERNEL)) {
+ pr_err("assign slave MSI failed: no memory\n");
+ return;
+ }
+ kvm_get_slave_cpu_mask(kvm, slave_mask);
+
+ bitmap_scnprintf(buffer, sizeof(buffer), cpu_slave_mask->bits, 32);
+ pr_info("assigned_device slave msi: irq:%d host:%d vec:%d mask:%s\n",
+ irq, host_irq, vec, buffer);
+
+ remap_slave_vector_irq(host_irq, vec, slave_mask);
+ data->chip->irq_set_affinity(data, slave_mask, 1);
+
+ free_cpumask_var(slave_mask);
+}
+
+void deassign_slave_msi(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev)
+{
+ int host_irq = assigned_dev->host_irq;
+ cpumask_var_t slave_mask;
+ char buffer[16];
+
+ if (!zalloc_cpumask_var(&slave_mask, GFP_KERNEL)) {
+ pr_err("deassign slave MSI failed: no memory\n");
+ return;
+ }
+ kvm_get_slave_cpu_mask(kvm, slave_mask);
+
+ bitmap_scnprintf(buffer, sizeof(buffer), cpu_slave_mask->bits, 32);
+ pr_info("deassigned_device slave msi: host:%d mask:%s\n",
+ host_irq, buffer);
+
+ revert_slave_vector_irq(host_irq, slave_mask);
+
+ free_cpumask_var(slave_mask);
+}
+
+void assign_slave_msix(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev)
+{
+ int i;
+
+ for (i = 0; i < assigned_dev->entries_nr; i++) {
+ int irq = assigned_dev->guest_msix_entries[i].vector;
+ int host_irq = assigned_dev->host_msix_entries[i].vector;
+ struct irq_data *data = irq_get_irq_data(host_irq);
+ int vec = kvm_lookup_msi_routing_entry(kvm, irq);
+ cpumask_var_t slave_mask;
+ char buffer[16];
+
+ pr_info("assign_slave_msix: %d %d %x\n", irq, host_irq, vec);
+ BUG_ON(!data);
+
+ if (!zalloc_cpumask_var(&slave_mask, GFP_KERNEL)) {
+ pr_err("assign slave MSI-X failed: no memory\n");
+ return;
+ }
+ kvm_get_slave_cpu_mask(kvm, slave_mask);
+
+ bitmap_scnprintf(buffer, sizeof(buffer), cpu_slave_mask->bits,
+ 32);
+ pr_info("assigned_device slave msi-x: irq:%d host:%d vec:%d mask:%s\n",
+ irq, host_irq, vec, buffer);
+
+ remap_slave_vector_irq(host_irq, vec, slave_mask);
+ data->chip->irq_set_affinity(data, slave_mask, 1);
+
+ free_cpumask_var(slave_mask);
+ }
+}
+
+void deassign_slave_msix(struct kvm *kvm,
+ struct kvm_assigned_dev_kernel *assigned_dev)
+{
+ int i;
+
+ for (i = 0; i < assigned_dev->entries_nr; i++) {
+ int host_irq = assigned_dev->host_msix_entries[i].vector;
+ cpumask_var_t slave_mask;
+ char buffer[16];
+
+ if (!zalloc_cpumask_var(&slave_mask, GFP_KERNEL)) {
+ pr_err("deassign slave MSI failed: no memory\n");
+ return;
+ }
+ kvm_get_slave_cpu_mask(kvm, slave_mask);
+
+ bitmap_scnprintf(buffer, sizeof(buffer), cpu_slave_mask->bits, 32);
+ pr_info("deassigned_device slave msi: host:%d mask:%s\n",
+ host_irq, buffer);
+
+ revert_slave_vector_irq(host_irq, slave_mask);
+
+ free_cpumask_var(slave_mask);
+ }
+}
+
+#endif
diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index 93c1574..1a53a5b 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -489,8 +489,12 @@ static void apic_set_eoi(struct kvm_lapic *apic)
* Not every write EOI will has corresponding ISR,
* one example is when Kernel check timer on setup_IO_APIC
*/
- if (vector == -1)
+ if (vector == -1) {
+ /* On slave cpu, it can be EOI for a direct interrupt */
+ if (cpu_slave(smp_processor_id()))
+ ack_APIC_irq();
return;
+ }

apic_clear_vector(vector, apic->regs + APIC_ISR);
apic_update_ppr(apic);
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index cae8025..90307f0 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5331,6 +5331,16 @@ static void process_nmi(struct kvm_vcpu *vcpu)
/* vcpu currently running on each slave CPU */
static DEFINE_PER_CPU(struct kvm_vcpu *, slave_vcpu);

+void kvm_get_slave_cpu_mask(struct kvm *kvm, struct cpumask *mask)
+{
+ int i;
+ struct kvm_vcpu *vcpu;
+
+ kvm_for_each_vcpu(i, vcpu, kvm)
+ if (vcpu->arch.slave_cpu >= 0)
+ cpumask_set_cpu(vcpu->arch.slave_cpu, mask);
+}
+
static int kvm_arch_kicked_by_nmi(unsigned int cmd, struct pt_regs *regs)
{
struct kvm_vcpu *vcpu;
diff --git a/virt/kvm/assigned-dev.c b/virt/kvm/assigned-dev.c
index ae910f4..265b336 100644
--- a/virt/kvm/assigned-dev.c
+++ b/virt/kvm/assigned-dev.c
@@ -225,8 +225,12 @@ static void deassign_host_irq(struct kvm *kvm,

free_irq(assigned_dev->host_irq, assigned_dev);

- if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI)
+ if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSI) {
pci_disable_msi(assigned_dev->dev);
+ deassign_slave_msi(kvm, assigned_dev);
+ }
+ if (assigned_dev->irq_requested_type & KVM_DEV_IRQ_HOST_MSIX)
+ deassign_slave_msix(kvm, assigned_dev);
}

assigned_dev->irq_requested_type &= ~(KVM_DEV_IRQ_HOST_MASK);
@@ -406,6 +410,7 @@ static int assigned_device_enable_guest_msi(struct kvm *kvm,
{
dev->guest_irq = irq->guest_irq;
dev->ack_notifier.gsi = -1;
+ assign_slave_msi(kvm, dev);
return 0;
}
#endif
@@ -417,6 +422,7 @@ static int assigned_device_enable_guest_msix(struct kvm *kvm,
{
dev->guest_irq = irq->guest_irq;
dev->ack_notifier.gsi = -1;
+ assign_slave_msix(kvm, dev);
return 0;
}
#endif

2012-06-28 06:11:56

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 10/18] KVM: proxy slab operations for slave CPUs on online CPUs

Add some fix-ups that proxy slab operations on online CPUs for the guest,
in order to avoid touching slab on slave CPUs where some slab functions
are not activated.

Currently, slab may be touched on slave CPUs in following 3 cases.
For each cases, the fix-ups below are introduced:

* kvm_mmu_commit_zap_page
With this patch, Instead of commiting zap pages, the pages are added
into invalid_mmu_pages list and make KVM_REQ_COMMIT_ZAP_PAGE request.
Then, the pages are freed on online CPUs after the execution of vCPU
thread is resumed.

* mmu_topup_memory_caches
Preallocate caches for mmu operations in vcpu_enter_guest_slave,
which is done by online CPUs before entering guests.

* kvm_async_pf_wakeup_all
If this function is called on slave CPUs, it makes KVM_REQ_WAKEUP_APF.
The request is handled by calling kvm_async_pf_wakeup_all on online CPUs.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/kvm_host.h | 5 ++++
arch/x86/kvm/mmu.c | 50 ++++++++++++++++++++++++++++-----------
arch/x86/kvm/mmu.h | 4 +++
arch/x86/kvm/x86.c | 15 ++++++++++++
virt/kvm/async_pf.c | 8 ++++++
5 files changed, 68 insertions(+), 14 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index f8fd1a1..6745057 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -69,6 +69,8 @@
#ifdef CONFIG_SLAVE_CPU
/* Requests to handle VM exit on online cpu */
#define KVM_REQ_HANDLE_PF 32
+#define KVM_REQ_COMMIT_ZAP_PAGE 33
+#define KVM_REQ_WAKEUP_APF 34
#endif

/* KVM Hugepage definitions for x86 */
@@ -529,6 +531,9 @@ struct kvm_arch {
* Hash table of struct kvm_mmu_page.
*/
struct list_head active_mmu_pages;
+#ifdef CONFIG_SLAVE_CPU
+ struct list_head invalid_mmu_pages;
+#endif
struct list_head assigned_dev_head;
struct iommu_domain *iommu_domain;
int iommu_flags;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index a179ebc..9ceafa5 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -585,6 +585,10 @@ static int mmu_topup_memory_cache(struct kvm_mmu_memory_cache *cache,

if (cache->nobjs >= min)
return 0;
+#ifdef CONFIG_SLAVE_CPU
+ if (cpu_slave(smp_processor_id()))
+ return -ENOMEM;
+#endif
while (cache->nobjs < ARRAY_SIZE(cache->objects)) {
obj = kmem_cache_zalloc(base_cache, GFP_KERNEL);
if (!obj)
@@ -628,7 +632,7 @@ static void mmu_free_memory_cache_page(struct kvm_mmu_memory_cache *mc)
free_page((unsigned long)mc->objects[--mc->nobjs]);
}

-static int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
+int mmu_topup_memory_caches(struct kvm_vcpu *vcpu)
{
int r;

@@ -1545,7 +1549,7 @@ static void kvm_unlink_unsync_page(struct kvm *kvm, struct kvm_mmu_page *sp)

static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
struct list_head *invalid_list);
-static void kvm_mmu_commit_zap_page(struct kvm *kvm,
+static void kvm_mmu_commit_zap_page(struct kvm *kvm, struct kvm_vcpu *vcpu,
struct list_head *invalid_list);

#define for_each_gfn_sp(kvm, sp, gfn, pos) \
@@ -1588,7 +1592,7 @@ static int kvm_sync_page_transient(struct kvm_vcpu *vcpu,

ret = __kvm_sync_page(vcpu, sp, &invalid_list, false);
if (ret)
- kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);

return ret;
}
@@ -1628,7 +1632,7 @@ static void kvm_sync_pages(struct kvm_vcpu *vcpu, gfn_t gfn)
flush = true;
}

- kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
if (flush)
kvm_mmu_flush_tlb(vcpu);
}
@@ -1715,7 +1719,7 @@ static void mmu_sync_children(struct kvm_vcpu *vcpu,
kvm_sync_page(vcpu, sp, &invalid_list);
mmu_pages_clear_parents(&parents);
}
- kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
cond_resched_lock(&vcpu->kvm->mmu_lock);
kvm_mmu_pages_init(parent, &parents, &pages);
}
@@ -2001,7 +2005,7 @@ static int kvm_mmu_prepare_zap_page(struct kvm *kvm, struct kvm_mmu_page *sp,
return ret;
}

-static void kvm_mmu_commit_zap_page(struct kvm *kvm,
+static void kvm_mmu_commit_zap_page(struct kvm *kvm, struct kvm_vcpu *vcpu,
struct list_head *invalid_list)
{
struct kvm_mmu_page *sp;
@@ -2015,6 +2019,16 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
*/
smp_mb();

+#ifdef CONFIG_SLAVE_CPU
+ if (cpu_slave(smp_processor_id())) {
+ /* Avoid touching kmem_cache on slave cpu */
+ list_splice_init(invalid_list, &kvm->arch.invalid_mmu_pages);
+ if (vcpu)
+ kvm_make_request(KVM_REQ_COMMIT_ZAP_PAGE, vcpu);
+ return;
+ }
+#endif
+
/*
* Wait for all vcpus to exit guest mode and/or lockless shadow
* page table walks.
@@ -2029,6 +2043,14 @@ static void kvm_mmu_commit_zap_page(struct kvm *kvm,
} while (!list_empty(invalid_list));
}

+#ifdef CONFIG_SLAVE_CPU
+void kvm_mmu_commit_zap_page_late(struct kvm_vcpu *vcpu)
+{
+ kvm_mmu_commit_zap_page(vcpu->kvm, vcpu,
+ &vcpu->kvm->arch.invalid_mmu_pages);
+}
+#endif
+
/*
* Changing the number of mmu pages allocated to the vm
* Note: if goal_nr_mmu_pages is too small, you will get dead lock
@@ -2051,7 +2073,7 @@ void kvm_mmu_change_mmu_pages(struct kvm *kvm, unsigned int goal_nr_mmu_pages)
struct kvm_mmu_page, link);
kvm_mmu_prepare_zap_page(kvm, page, &invalid_list);
}
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(kvm, NULL, &invalid_list);
goal_nr_mmu_pages = kvm->arch.n_used_mmu_pages;
}

@@ -2074,7 +2096,7 @@ int kvm_mmu_unprotect_page(struct kvm *kvm, gfn_t gfn)
r = 1;
kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list);
}
- kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(kvm, NULL, &invalid_list);
spin_unlock(&kvm->mmu_lock);

return r;
@@ -2702,7 +2724,7 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
--sp->root_count;
if (!sp->root_count && sp->role.invalid) {
kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
- kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
}
vcpu->arch.mmu.root_hpa = INVALID_PAGE;
spin_unlock(&vcpu->kvm->mmu_lock);
@@ -2721,7 +2743,7 @@ static void mmu_free_roots(struct kvm_vcpu *vcpu)
}
vcpu->arch.mmu.pae_root[i] = INVALID_PAGE;
}
- kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
spin_unlock(&vcpu->kvm->mmu_lock);
vcpu->arch.mmu.root_hpa = INVALID_PAGE;
}
@@ -3734,7 +3756,7 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
}
}
mmu_pte_write_flush_tlb(vcpu, zap_page, remote_flush, local_flush);
- kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
kvm_mmu_audit(vcpu, AUDIT_POST_PTE_WRITE);
spin_unlock(&vcpu->kvm->mmu_lock);
}
@@ -3768,7 +3790,7 @@ void __kvm_mmu_free_some_pages(struct kvm_vcpu *vcpu)
kvm_mmu_prepare_zap_page(vcpu->kvm, sp, &invalid_list);
++vcpu->kvm->stat.mmu_recycled;
}
- kvm_mmu_commit_zap_page(vcpu->kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(vcpu->kvm, vcpu, &invalid_list);
}

static bool is_mmio_page_fault(struct kvm_vcpu *vcpu, gva_t addr)
@@ -3941,7 +3963,7 @@ restart:
if (kvm_mmu_prepare_zap_page(kvm, sp, &invalid_list))
goto restart;

- kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(kvm, NULL, &invalid_list);
spin_unlock(&kvm->mmu_lock);
}

@@ -3980,7 +4002,7 @@ static int mmu_shrink(struct shrinker *shrink, struct shrink_control *sc)
}
nr_to_scan--;

- kvm_mmu_commit_zap_page(kvm, &invalid_list);
+ kvm_mmu_commit_zap_page(kvm, NULL, &invalid_list);
spin_unlock(&kvm->mmu_lock);
srcu_read_unlock(&kvm->srcu, idx);
}
diff --git a/arch/x86/kvm/mmu.h b/arch/x86/kvm/mmu.h
index e374db9..32efc36 100644
--- a/arch/x86/kvm/mmu.h
+++ b/arch/x86/kvm/mmu.h
@@ -52,6 +52,10 @@ int kvm_mmu_get_spte_hierarchy(struct kvm_vcpu *vcpu, u64 addr, u64 sptes[4]);
void kvm_mmu_set_mmio_spte_mask(u64 mmio_mask);
int handle_mmio_page_fault_common(struct kvm_vcpu *vcpu, u64 addr, bool direct);
int kvm_init_shadow_mmu(struct kvm_vcpu *vcpu, struct kvm_mmu *context);
+int mmu_topup_memory_caches(struct kvm_vcpu *vcpu);
+#ifdef CONFIG_SLAVE_CPU
+void kvm_mmu_commit_zap_page_late(struct kvm_vcpu *vcpu);
+#endif

static inline unsigned int kvm_mmu_available_pages(struct kvm *kvm)
{
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1558ec2..df5eb05 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5558,6 +5558,11 @@ static void __vcpu_enter_guest_slave(void *_arg)
arg->ret = LOOP_ONLINE;
break;
}
+ if (test_bit(KVM_REQ_COMMIT_ZAP_PAGE, &vcpu->requests) ||
+ test_bit(KVM_REQ_WAKEUP_APF, &vcpu->requests)) {
+ r = LOOP_ONLINE;
+ break;
+ }

r = vcpu_post_run(vcpu, arg->task, &arg->apf_pending);

@@ -5587,6 +5592,9 @@ static int vcpu_enter_guest_slave(struct kvm_vcpu *vcpu,

BUG_ON((unsigned)slave >= nr_cpu_ids || !cpu_slave(slave));

+ /* Refill memory caches here to avoid calling slab on slave cpu */
+ mmu_topup_memory_caches(vcpu);
+
preempt_disable();
preempt_notifier_unregister(&vcpu->preempt_notifier);
kvm_arch_vcpu_put_migrate(vcpu);
@@ -5618,6 +5626,10 @@ static int vcpu_enter_guest_slave(struct kvm_vcpu *vcpu,
vcpu->arch.page_fault.insn,
vcpu->arch.page_fault.insn_len);
}
+ if (kvm_check_request(KVM_REQ_WAKEUP_APF, vcpu))
+ kvm_async_pf_wakeup_all(vcpu);
+ if (kvm_check_request(KVM_REQ_COMMIT_ZAP_PAGE, vcpu))
+ kvm_mmu_commit_zap_page_late(vcpu);

return r;
}
@@ -6472,6 +6484,9 @@ int kvm_arch_init_vm(struct kvm *kvm, unsigned long type)
return -EINVAL;

INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+#ifdef CONFIG_SLAVE_CPU
+ INIT_LIST_HEAD(&kvm->arch.invalid_mmu_pages);
+#endif
INIT_LIST_HEAD(&kvm->arch.assigned_dev_head);

/* Reserve bit 0 of irq_sources_bitmap for userspace irq source */
diff --git a/virt/kvm/async_pf.c b/virt/kvm/async_pf.c
index feb5e76..8314225 100644
--- a/virt/kvm/async_pf.c
+++ b/virt/kvm/async_pf.c
@@ -204,6 +204,14 @@ int kvm_async_pf_wakeup_all(struct kvm_vcpu *vcpu)
if (!list_empty_careful(&vcpu->async_pf.done))
return 0;

+#ifdef CONFIG_SLAVE_CPU
+ if (cpu_slave(smp_processor_id())) {
+ /* Redo on online cpu to avoid kmem_cache_alloc on slave cpu */
+ kvm_make_request(KVM_REQ_WAKEUP_APF, vcpu);
+ return -ENOMEM;
+ }
+#endif
+
work = kmem_cache_zalloc(async_pf_cache, GFP_ATOMIC);
if (!work)
return -ENOMEM;

2012-06-28 06:12:59

by Tomoki Sekiyama

[permalink] [raw]
Subject: [RFC PATCH 07/18] KVM: handle page faults occured in slave CPUs on online CPUs

Page faults which occured by the guest running on slave CPUs cannot be
handled on slave CPUs because it is running on idle process context.

With this patch, the page fault happened in a slave CPU is notified to
online CPU using struct kvm_access_fault, and is handled after the
user-process for the guest is resumed on an online CPU.

Signed-off-by: Tomoki Sekiyama <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Marcelo Tosatti <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: "H. Peter Anvin" <[email protected]>
---

arch/x86/include/asm/kvm_host.h | 15 +++++++++++++++
arch/x86/kvm/mmu.c | 13 +++++++++++++
arch/x86/kvm/x86.c | 10 ++++++++++
3 files changed, 38 insertions(+), 0 deletions(-)

diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 4291954..80f7b3b 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -66,6 +66,11 @@

#define UNMAPPED_GVA (~(gpa_t)0)

+#ifdef CONFIG_SLAVE_CPU
+/* Requests to handle VM exit on online cpu */
+#define KVM_REQ_HANDLE_PF 32
+#endif
+
/* KVM Hugepage definitions for x86 */
#define KVM_NR_PAGE_SIZES 3
#define KVM_HPAGE_GFN_SHIFT(x) (((x) - 1) * 9)
@@ -405,6 +410,16 @@ struct kvm_vcpu_arch {
u8 nr;
} interrupt;

+#ifdef CONFIG_SLAVE_CPU
+ /* used for recording page fault on offline CPU */
+ struct kvm_access_fault {
+ gva_t cr2;
+ u32 error_code;
+ void *insn;
+ int insn_len;
+ } page_fault;
+#endif
+
int halt_request; /* real mode on Intel only */

int cpuid_nent;
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 6139e1d..a179ebc 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -3785,6 +3785,19 @@ int kvm_mmu_page_fault(struct kvm_vcpu *vcpu, gva_t cr2, u32 error_code,
int r, emulation_type = EMULTYPE_RETRY;
enum emulation_result er;

+#ifdef CONFIG_SLAVE_CPU
+ if (cpu_slave(smp_processor_id())) {
+ /* Page fault must be handled on user-process context. */
+ r = -EFAULT;
+ vcpu->arch.page_fault.cr2 = cr2;
+ vcpu->arch.page_fault.error_code = error_code;
+ vcpu->arch.page_fault.insn = insn;
+ vcpu->arch.page_fault.insn_len = insn_len;
+ kvm_make_request(KVM_REQ_HANDLE_PF, vcpu);
+ goto out;
+ }
+#endif
+
r = vcpu->arch.mmu.page_fault(vcpu, cr2, error_code, false);
if (r < 0)
goto out;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index ecd474a..7b9f2a5 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -5552,6 +5552,16 @@ static int vcpu_enter_guest_slave(struct kvm_vcpu *vcpu,
r = arg.ret;
*apf_pending = arg.apf_pending;

+ if (r == -EFAULT && kvm_check_request(KVM_REQ_HANDLE_PF, vcpu)) {
+ pr_debug("handling page fault request @%p\n",
+ (void *)vcpu->arch.page_fault.cr2);
+ r = kvm_mmu_page_fault(vcpu,
+ vcpu->arch.page_fault.cr2,
+ vcpu->arch.page_fault.error_code,
+ vcpu->arch.page_fault.insn,
+ vcpu->arch.page_fault.insn_len);
+ }
+
return r;
}


2012-06-28 16:39:13

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 18/18] x86: request TLB flush to slave CPU using NMI

On 06/28/2012 09:08 AM, Tomoki Sekiyama wrote:
> For slave CPUs, it is inapropriate to request TLB flush using IPI.
> because the IPI may be sent to a KVM guest when the slave CPU is running
> the guest with direct interrupt routing.
>
> Instead, it registers a TLB flush request in per-cpu bitmask and send a NMI
> to interrupt execution of the guest. Then, NMI handler will check the
> requests and handles the requests.


Currently x86's get_user_pages_fast() depends on TLB flushes being held
up by local_irq_disable(). With this patch, this is no longer true and
get_user_pages_fast() can race with page table freeing. There are
patches from Peter Zijlstra to remove this dependency though. NMIs are
still slow and fragile when compared to normal interrupts, so this patch
is somewhat problematic.


--
error compiling committee.c: too many arguments to function

2012-06-28 16:49:15

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 16/18] KVM: add kvm_arch_vcpu_prevent_run to prevent VM ENTER when NMI is received

On 06/28/2012 09:08 AM, Tomoki Sekiyama wrote:
> Since NMI can not be disabled around VM enter, there is a race between
> receiving NMI to kick a guest and entering the guests on slave CPUs.If the
> NMI is received just before entering VM, after the NMI handler is invoked,
> it continues entering the guest and the effect of the NMI will be lost.
>
> This patch adds kvm_arch_vcpu_prevent_run(), which causes VM exit right
> after VM enter. The NMI handler uses this to ensure the execution of the
> guest is cancelled after NMI.
>
>
> +/*
> + * Make VMRESUME fail using preemption timer with timer value = 0.
> + * On processors that doesn't support preemption timer, VMRESUME will fail
> + * by internal error.
> + */
> +static void vmx_prevent_run(struct kvm_vcpu *vcpu, int prevent)
> +{
> + if (prevent)
> + vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
> + PIN_BASED_PREEMPTION_TIMER);
> + else
> + vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
> + PIN_BASED_PREEMPTION_TIMER);
> +}

This may interrupt another RMW sequence, which will then overwrite the
control. So it needs to be called only if inside the entry sequence
(otherwise can just set a KVM_REQ_IMMEDIATE_EXIT in vcpu->requests).

--
error compiling committee.c: too many arguments to function

2012-06-28 16:58:32

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] KVM: x86: CPU isolation and direct interrupts handling by guests

On 06/28/2012 09:07 AM, Tomoki Sekiyama wrote:
> Hello,
>
> This RFC patch series provides facility to dedicate CPUs to KVM guests
> and enable the guests to handle interrupts from passed-through PCI devices
> directly (without VM exit and relay by the host).
>
> With this feature, we can improve throughput and response time of the device
> and the host's CPU usage by reducing the overhead of interrupt handling.
> This is good for the application using very high throughput/frequent
> interrupt device (e.g. 10GbE NIC).
> CPU-intensive high performance applications and real-time applicatoins
> also gets benefit from CPU isolation feature, which reduces VM exit and
> scheduling delay.
>
> Current implementation is still just PoC and have many limitations, but
> submitted for RFC. Any comments are appreciated.
>
> * Overview
> Intel and AMD CPUs have a feature to handle interrupts by guests without
> VM Exit. However, because it cannot switch VM Exit based on IRQ vectors,
> interrupts to both the host and the guest will be routed to guests.
>
> To avoid mixture of host and guest interrupts, in this patch, some of CPUs
> are cut off from the host and dedicated to the guests. In addition, IRQ
> affinity of the passed-through devices are set to the guest CPUs only.
>
> For IPI from the host to the guest, we use NMIs, that is an only interrupts
> having another VM Exit flag.
>
> * Benefits
> This feature provides benefits of virtualization to areas where high
> performance and low latency are required, such as HPC and trading,
> and so on. It also useful for consolidation in large scale systems with
> many CPU cores and PCI devices passed-through or with SR-IOV.
> For the future, it may be used to keep the guests running even if the host
> is crashed (but that would need additional features like memory isolation).
>
> * Limitations
> Current implementation is experimental, unstable, and has a lot of limitations.
> - SMP guests don't work correctly
> - Only Linux guest is supported
> - Only Intel VT-x is supported
> - Only MSI and MSI-X pass-through; no ISA interrupts support
> - Non passed-through PCI devices (including virtio) are slower
> - Kernel space PIT emulation does not work
> - Needs a lot of cleanups
>

This is both impressive and scary. What is the target scenario here?
Partitioning? I don't see this working for generic consolidation.


--
error compiling committee.c: too many arguments to function

2012-06-28 17:02:42

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 06/18] KVM: Add facility to run guests on slave CPUs

On 06/28/2012 09:07 AM, Tomoki Sekiyama wrote:
> Add path to migrate execution of vcpu_enter_guest to a slave CPU when
> vcpu->arch.slave_cpu is set.
>
> After moving to the slave CPU, it goes back to the online CPU when the
> guest is exited by reasons that cannot be handled by the slave CPU only
> (e.g. handling async page faults).

What about, say, instruction emulation? It may need to touch guest
memory, which cannot be done from interrupt disabled context.

> +
> +static int vcpu_post_run(struct kvm_vcpu *vcpu, struct task_struct *task,
> + int *can_complete_async_pf)
> +{
> + int r = LOOP_ONLINE;
> +
> + clear_bit(KVM_REQ_PENDING_TIMER, &vcpu->requests);
> + if (kvm_cpu_has_pending_timer(vcpu))
> + kvm_inject_pending_timer_irqs(vcpu);
> +
> + if (dm_request_for_irq_injection(vcpu)) {
> + r = -EINTR;
> + vcpu->run->exit_reason = KVM_EXIT_INTR;
> + ++vcpu->stat.request_irq_exits;
> + }
> +
> + if (can_complete_async_pf) {
> + *can_complete_async_pf = kvm_can_complete_async_pf(vcpu);
> + if (r == LOOP_ONLINE)
> + r = *can_complete_async_pf ? LOOP_APF : LOOP_SLAVE;
> + } else
> + kvm_check_async_pf_completion(vcpu);
> +
> + if (signal_pending(task)) {
> + r = -EINTR;
> + vcpu->run->exit_reason = KVM_EXIT_INTR;
> + ++vcpu->stat.signal_exits;
> + }

Isn't this racy? The signal can come right after this.

--
error compiling committee.c: too many arguments to function

2012-06-28 17:26:34

by Jan Kiszka

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] KVM: x86: CPU isolation and direct interrupts handling by guests

On 2012-06-28 18:58, Avi Kivity wrote:
> On 06/28/2012 09:07 AM, Tomoki Sekiyama wrote:
>> Hello,
>>
>> This RFC patch series provides facility to dedicate CPUs to KVM guests
>> and enable the guests to handle interrupts from passed-through PCI devices
>> directly (without VM exit and relay by the host).
>>
>> With this feature, we can improve throughput and response time of the device
>> and the host's CPU usage by reducing the overhead of interrupt handling.
>> This is good for the application using very high throughput/frequent
>> interrupt device (e.g. 10GbE NIC).
>> CPU-intensive high performance applications and real-time applicatoins
>> also gets benefit from CPU isolation feature, which reduces VM exit and
>> scheduling delay.
>>
>> Current implementation is still just PoC and have many limitations, but
>> submitted for RFC. Any comments are appreciated.
>>
>> * Overview
>> Intel and AMD CPUs have a feature to handle interrupts by guests without
>> VM Exit. However, because it cannot switch VM Exit based on IRQ vectors,
>> interrupts to both the host and the guest will be routed to guests.
>>
>> To avoid mixture of host and guest interrupts, in this patch, some of CPUs
>> are cut off from the host and dedicated to the guests. In addition, IRQ
>> affinity of the passed-through devices are set to the guest CPUs only.
>>
>> For IPI from the host to the guest, we use NMIs, that is an only interrupts
>> having another VM Exit flag.
>>
>> * Benefits
>> This feature provides benefits of virtualization to areas where high
>> performance and low latency are required, such as HPC and trading,
>> and so on. It also useful for consolidation in large scale systems with
>> many CPU cores and PCI devices passed-through or with SR-IOV.
>> For the future, it may be used to keep the guests running even if the host
>> is crashed (but that would need additional features like memory isolation).
>>
>> * Limitations
>> Current implementation is experimental, unstable, and has a lot of limitations.
>> - SMP guests don't work correctly
>> - Only Linux guest is supported
>> - Only Intel VT-x is supported
>> - Only MSI and MSI-X pass-through; no ISA interrupts support
>> - Non passed-through PCI devices (including virtio) are slower
>> - Kernel space PIT emulation does not work
>> - Needs a lot of cleanups
>>
>
> This is both impressive and scary. What is the target scenario here?
> Partitioning? I don't see this working for generic consolidation.
>

>From my POV, partitioning - including hard realtime partitions - would
provide some use cases. But, as far as I saw, there are still major
restrictions in this approach, e.g. that you can't return to userspace
on the slave core. Or even execute the in-kernel device models on that core.

I think we need something based on the no-hz work on the long run, ie.
the ability to run a single VCPU thread of the userland hypervisor on a
single core with zero rescheduling and unrelated interruptions - as far
as the guest load scenario allows this (we have some here).

Well, and we need proper hardware support for direct IRQ injection on x86...

Jan

--
Siemens AG, Corporate Technology, CT T DE IT 1
Corporate Competence Center Embedded Linux

2012-06-28 17:35:01

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] KVM: x86: CPU isolation and direct interrupts handling by guests

On 06/28/2012 08:26 PM, Jan Kiszka wrote:
>>
>> This is both impressive and scary. What is the target scenario here?
>> Partitioning? I don't see this working for generic consolidation.
>>
>
> From my POV, partitioning - including hard realtime partitions - would
> provide some use cases. But, as far as I saw, there are still major
> restrictions in this approach, e.g. that you can't return to userspace
> on the slave core. Or even execute the in-kernel device models on that core.
>
> I think we need something based on the no-hz work on the long run, ie.
> the ability to run a single VCPU thread of the userland hypervisor on a
> single core with zero rescheduling and unrelated interruptions - as far
> as the guest load scenario allows this (we have some here).

What we can do perhaps is switch from direct mode to indirect mode on
exit. Instead of running with interrupts disabled, enable interrupts
and make sure they are forwarded to the guest on the next entry.

> Well, and we need proper hardware support for direct IRQ injection on x86...

Hardware support always helps, but it always seems to come after the
software support is in place and needs to be supported forever.

--
error compiling committee.c: too many arguments to function

2012-06-29 09:25:40

by Tomoki Sekiyama

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] KVM: x86: CPU isolation and direct interrupts handling by guests

Hi, thanks for your comments.

On 2012/06/29 2:34, Avi Kivity wrote:
> On 06/28/2012 08:26 PM, Jan Kiszka wrote:
>>> This is both impressive and scary. What is the target scenario here?
>>> Partitioning? I don't see this working for generic consolidation.
>>
>> From my POV, partitioning - including hard realtime partitions - would
>> provide some use cases. But, as far as I saw, there are still major
>> restrictions in this approach, e.g. that you can't return to userspace
>> on the slave core. Or even execute the in-kernel device models on that core.

Exactly this is for partitioning that requires bare-metal performance
with low latency and realtime. I think it is also useful for workload
like HPC with MPI, that is CPU intensive and that needs low latency.

>> I think we need something based on the no-hz work on the long run, ie.
>> the ability to run a single VCPU thread of the userland hypervisor on a
>> single core with zero rescheduling and unrelated interruptions - as far
>> as the guest load scenario allows this (we have some here).

With no-hz approach, we can ease some problems such as accessing userspace
memory from interrupt disabled context. Still we need IRQ vector remappping
or something like para-virtualized vector assignment for IRQs to reduce VM
exit on interrupts handling.

> What we can do perhaps is switch from direct mode to indirect mode on
> exit. Instead of running with interrupts disabled, enable interrupts
> and make sure they are forwarded to the guest on the next entry.

Already with current implementation, the slave host kernel receives
interrupts while the guest execution is stopped for handling events like
qemu devices emulation on online CPUs. In that case, the interrupts are
forwarded to the guest as vIRQs.
I will reconsider about enabling interrupts on slave CPU for accessing
userspace memory and so on.

>> Well, and we need proper hardware support for direct IRQ injection on x86...

I really hope this feature ...

> Hardware support always helps, but it always seems to come after the
> software support is in place and needs to be supported forever.

Thanks,
--
Tomoki Sekiyama <[email protected]>
Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory

2012-06-29 09:26:00

by Tomoki Sekiyama

[permalink] [raw]
Subject: Re: [RFC PATCH 06/18] KVM: Add facility to run guests on slave CPUs

On 2012/06/29 2:02, Avi Kivity wrote:
> On 06/28/2012 09:07 AM, Tomoki Sekiyama wrote:
>> Add path to migrate execution of vcpu_enter_guest to a slave CPU when
>> vcpu->arch.slave_cpu is set.
>>
>> After moving to the slave CPU, it goes back to the online CPU when the
>> guest is exited by reasons that cannot be handled by the slave CPU only
>> (e.g. handling async page faults).
>
> What about, say, instruction emulation? It may need to touch guest
> memory, which cannot be done from interrupt disabled context.

Hmm, it seems difficult to resolve this in interrupt disabled context.

Within partitioning scenario, it might be possible to give up execution
if the memory is not pinned down, but I'm not sure that is acceptable.

It looks better to make the slave core interruptible and sleepable.

>> +
>> +static int vcpu_post_run(struct kvm_vcpu *vcpu, struct task_struct *task,
>> + int *can_complete_async_pf)
>> +{
>> + int r = LOOP_ONLINE;
>> +
>> + clear_bit(KVM_REQ_PENDING_TIMER, &vcpu->requests);
>> + if (kvm_cpu_has_pending_timer(vcpu))
>> + kvm_inject_pending_timer_irqs(vcpu);
>> +
>> + if (dm_request_for_irq_injection(vcpu)) {
>> + r = -EINTR;
>> + vcpu->run->exit_reason = KVM_EXIT_INTR;
>> + ++vcpu->stat.request_irq_exits;
>> + }
>> +
>> + if (can_complete_async_pf) {
>> + *can_complete_async_pf = kvm_can_complete_async_pf(vcpu);
>> + if (r == LOOP_ONLINE)
>> + r = *can_complete_async_pf ? LOOP_APF : LOOP_SLAVE;
>> + } else
>> + kvm_check_async_pf_completion(vcpu);
>> +
>> + if (signal_pending(task)) {
>> + r = -EINTR;
>> + vcpu->run->exit_reason = KVM_EXIT_INTR;
>> + ++vcpu->stat.signal_exits;
>> + }
>
> Isn't this racy? The signal can come right after this.

Oops, this is racy here.
However, this is resolved if later patch [RFC PATCH 16/18] is applied.
I will reorder patches.

Signals will wake up vCPU user thread (sleeping in
vcpu_enter_guest_slave > wait_for_completion_interruptible in an online CPU)
and the thread kicks vcpu(by NMI). Then, kvm_arch_vcpu_prevent_run is
called in NMI handler to fail the VM enter again.

(But kvm_arch_vcpu_prevent_run still has another problem, as you replied.)

Thanks,
--
Tomoki Sekiyama <[email protected]>
Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory

2012-06-29 09:26:16

by Tomoki Sekiyama

[permalink] [raw]
Subject: Re: [RFC PATCH 16/18] KVM: add kvm_arch_vcpu_prevent_run to prevent VM ENTER when NMI is received

On 2012/06/29 1:48, Avi Kivity wrote:
> On 06/28/2012 09:08 AM, Tomoki Sekiyama wrote:
>> Since NMI can not be disabled around VM enter, there is a race between
>> receiving NMI to kick a guest and entering the guests on slave CPUs.If the
>> NMI is received just before entering VM, after the NMI handler is invoked,
>> it continues entering the guest and the effect of the NMI will be lost.
>>
>> This patch adds kvm_arch_vcpu_prevent_run(), which causes VM exit right
>> after VM enter. The NMI handler uses this to ensure the execution of the
>> guest is cancelled after NMI.
>>
>>
>> +/*
>> + * Make VMRESUME fail using preemption timer with timer value = 0.
>> + * On processors that doesn't support preemption timer, VMRESUME will fail
>> + * by internal error.
>> + */
>> +static void vmx_prevent_run(struct kvm_vcpu *vcpu, int prevent)
>> +{
>> + if (prevent)
>> + vmcs_set_bits(PIN_BASED_VM_EXEC_CONTROL,
>> + PIN_BASED_PREEMPTION_TIMER);
>> + else
>> + vmcs_clear_bits(PIN_BASED_VM_EXEC_CONTROL,
>> + PIN_BASED_PREEMPTION_TIMER);
>> +}
>
> This may interrupt another RMW sequence, which will then overwrite the
> control. So it needs to be called only if inside the entry sequence
> (otherwise can just set a KVM_REQ_IMMEDIATE_EXIT in vcpu->requests).
>

I agree. I will add the check whether it is in the entry sequence.

Thanks,
--
Tomoki Sekiyama <[email protected]>
Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory

2012-06-29 09:26:33

by Tomoki Sekiyama

[permalink] [raw]
Subject: Re: [RFC PATCH 18/18] x86: request TLB flush to slave CPU using NMI

On 2012/06/29 1:38, Avi Kivity wrote:
> On 06/28/2012 09:08 AM, Tomoki Sekiyama wrote:
>> For slave CPUs, it is inapropriate to request TLB flush using IPI.
>> because the IPI may be sent to a KVM guest when the slave CPU is running
>> the guest with direct interrupt routing.
>>
>> Instead, it registers a TLB flush request in per-cpu bitmask and send a NMI
>> to interrupt execution of the guest. Then, NMI handler will check the
>> requests and handles the requests.
>
>
> Currently x86's get_user_pages_fast() depends on TLB flushes being held
> up by local_irq_disable(). With this patch, this is no longer true and
> get_user_pages_fast() can race with page table freeing. There are
> patches from Peter Zijlstra to remove this dependency though.

Thank you for the information. I will check his patches.

> NMIs are
> still slow and fragile when compared to normal interrupts, so this patch
> is somewhat problematic.

OK, always sending NMIs is actually problematic. I should check the
slave core state and send NMIs only when slave guest is
running and NMI is really needed.

Thanks,
--
Tomoki Sekiyama <[email protected]>
Linux Technology Center
Hitachi, Ltd., Yokohama Research Laboratory

2012-06-29 14:57:01

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC PATCH 00/18] KVM: x86: CPU isolation and direct interrupts handling by guests

On 06/29/2012 12:25 PM, Tomoki Sekiyama wrote:
> Hi, thanks for your comments.
>
> On 2012/06/29 2:34, Avi Kivity wrote:
> > On 06/28/2012 08:26 PM, Jan Kiszka wrote:
> >>> This is both impressive and scary. What is the target scenario here?
> >>> Partitioning? I don't see this working for generic consolidation.
> >>
> >> From my POV, partitioning - including hard realtime partitions - would
> >> provide some use cases. But, as far as I saw, there are still major
> >> restrictions in this approach, e.g. that you can't return to userspace
> >> on the slave core. Or even execute the in-kernel device models on that core.
>
> Exactly this is for partitioning that requires bare-metal performance
> with low latency and realtime.

It's hard for me to evaluate how large that segment is. Since the
patchset is so intrusive, it needs a large potential user set to
justify, or a large reduction in complexity, or both.

> I think it is also useful for workload
> like HPC with MPI, that is CPU intensive and that needs low latency.

I keep hearing about people virtualizing these types of workloads, but I
haven't yet understood why.


--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.