This patch-series implements paravirt-spinlock implementation for KVM guests,
based heavily on Xen's implementation. I tried to refactor Xen's spinlock
implementation to make it common for both Xen and KVM - but found that
few differences between Xen and KVM (Xen has the ability to block on a
particular event/irq for example) _and_ the fact that the guest kernel
can be compiled to support both Xen and KVM hypervisors (CONFIG_XEN and
CONFIG_KVM_GUEST can both be turned on) makes the "common" code a eye-sore.
There will have to be:
if (xen) {
...
} else if (kvm) {
..
}
or possibly:
alternative(NOP, some_xen_specific_call, ....)
type of code in the common implementation.
For the time-being, I have made this KVM-specific only. At somepoint in future,
I hope this can be made common between Xen/KVM.
More background and results for this patch below:
What is the Problem being solved?
=================================
Guest operating system can be preempted by hypervisor at any arbitrary point.
There is no mechanism (that I know of) where guest OS can disable preemption for
certain periods of time. One noticeable effect of this is with regard to
locking. Lets say one virtual-cpu of a guest (VCPUA) grabs a spinlock and before
it could relinquish the lock is preempted by hypervisor. The time-of-preemption
(when the virtual cpu is off the cpu) can be quite large. In that period, if
another of guest OS's virtual cpu (VCPUB) tries grabbing the same lock, it could
end up spin-waiting a _long_ time, burning cycles unnecessarily. To add to the
woes, VCPUB may actually be waiting for VCPUA to yield before it can run on
the same (physical) cpu. This is termed as the "lock-holder preemption" (LHP)
problem. The effect of it can be quite serious. For ex:
http://lkml.org/lkml/2010/4/11/108 reported 80% performance degradation because
of an issue attributed to LHP problem.
Solutions
=========
There are several solutions to this problem.
a. Realize that a lock-holder could have been preempted, and avoid spin-waiting
too long. Instead, yield cycles (to the lock-holder perhaps). This is a
common solution adopted by most paravirtualized-guests (Xen, s390, powerpc).
b. Avoid preempting a lock-holder while its holding a (spin-) lock.
In this scheme, guest OS can hint (set some flag in memory shared with
hypervisor) whenever its holding a lock and hypervisor could defer preempting
the guest vcpu when its holding a lock. With this scheme, we should never
have a lock-acquiring vcpu spin on a preempted vcpu to release its lock. If
ever it spins, its because somebody *currently running* is holding the lock -
and hence it won't have to spin-wait too long. IOW we are pro-actively
trying to prevent the LHP problem from occuring in the first place. This
should improve job turnaround time for some workloads. [1] has some
results based on this approach.
c. Share run-status of vcpu with guests. This could be used to optimize
routines like mutex_spin_on_owner().
Hypervisor could share run-status of vcpus in guest kernel memory. This would
allow us to optimize routines like mutex_spin_on_owner() - we don't spin-wait
if we relaize that the target vcpu has been preempted.
a) and c) are about dealing with the LHP problem, while b) is about preventing
the problem from happening.
This patch-series is along a). Its based against v2.6.35-rc4 kernel for both
guest and host.
I have patches for b) and c) as well - want to send them after more thorough
experimentation with various workloads.
Results
=======
Machine : IBM x3650 with 2 Dual-core Intel Xeon (5160) CPUs and 4972MB RAM
Kernel for host/guest : 2.6.35-rc4
Test :
Spawn a single guest under KVM with 4VCPUs, 3092MB memory, virtio disk
Guest runs kernel compile benchmark as:
time -p make -s -j20 bzImage
for 3 times in a loop.
This is repeated under varios over-commitment scenarios and
"vcpu/pcpu pinning configurations"
Overcommit scenarios are :
1x : only guest is running
2x : cpu hogs are started such that (hogs + guest vcpu count)/pcpu = 2
3x : cpu hogs are started such that (hogs + guest vcpu count)/pcpu = 3
4x : cpu hogs are started such that (hogs + guest vcpu count)/pcpu = 4
VCPU/PCPU pinning scenarion:
A : Each of the vcpu of the guest is pinned to a separate pcpu
B : No pinning. vcpu could run on any pcpu.
C : The 4 VCPUs of guest pinned to run inside a single dual-core CPU
(cpu 2,3 in this case)
Scenario A:
W/o patch W/ Patch Difference
Avg (std-dev) Avg. (std-dev)
1: 273.270 (1.051) 251.613 (1.155) 7.9%
2: 541.343 (57.317) 443.400 (2.832) 18.1%
3: 819.160 (9.218) 642.393 (2.646) 21.6%
4: 1020.493 (34.743) 839.327 (0.658) 17.8%
Scenario B:
1: 276.947 (0.644) 248.397 (1.563) 10.3%
2: 500.723 (5.694) 438.957 (6.112) 12.33%
3: 725.687 (5.267) 641.317 (3.902) 11.62%
4: 973.910 (21.712) 836.853 (2.754) 14.07%
Scenario C:
1: 564.610 (12.055) 420.510 (2.598) 25.52%
2: 750.867 (5.139) 618.570 (2.914) 17.61%
3: 950.150 (13.496) 813.803 (1.872) 14.35%
4: 1125.027 (5.337) 1007.63 (5.707) 10.43%
IMO this is good improvement with the patchset applied.
References
==========
1. http://l4ka.org/publications/2004/Towards-Scalable-Multiprocessor-Virtual-Machin
Debugfs support for reading an array of u32-type integers. This is a rework of
what code already exists for Xen.
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
---
arch/x86/xen/debugfs.c | 104 ----------------------------------------
arch/x86/xen/debugfs.h | 4 -
arch/x86/xen/mmu.c | 2
arch/x86/xen/multicalls.c | 6 +-
arch/x86/xen/spinlock.c | 6 +-
fs/debugfs/file.c | 117 ++++++++++++++++++++++++++++++++++++++++++++++
include/linux/debugfs.h | 11 ++++
7 files changed, 135 insertions(+), 115 deletions(-)
Index: current/arch/x86/xen/debugfs.c
===================================================================
--- current.orig/arch/x86/xen/debugfs.c
+++ current/arch/x86/xen/debugfs.c
@@ -18,107 +18,3 @@ struct dentry * __init xen_init_debugfs(
return d_xen_debug;
}
-
-struct array_data
-{
- void *array;
- unsigned elements;
-};
-
-static int u32_array_open(struct inode *inode, struct file *file)
-{
- file->private_data = NULL;
- return nonseekable_open(inode, file);
-}
-
-static size_t format_array(char *buf, size_t bufsize, const char *fmt,
- u32 *array, unsigned array_size)
-{
- size_t ret = 0;
- unsigned i;
-
- for(i = 0; i < array_size; i++) {
- size_t len;
-
- len = snprintf(buf, bufsize, fmt, array[i]);
- len++; /* ' ' or '\n' */
- ret += len;
-
- if (buf) {
- buf += len;
- bufsize -= len;
- buf[-1] = (i == array_size-1) ? '\n' : ' ';
- }
- }
-
- ret++; /* \0 */
- if (buf)
- *buf = '\0';
-
- return ret;
-}
-
-static char *format_array_alloc(const char *fmt, u32 *array, unsigned array_size)
-{
- size_t len = format_array(NULL, 0, fmt, array, array_size);
- char *ret;
-
- ret = kmalloc(len, GFP_KERNEL);
- if (ret == NULL)
- return NULL;
-
- format_array(ret, len, fmt, array, array_size);
- return ret;
-}
-
-static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len,
- loff_t *ppos)
-{
- struct inode *inode = file->f_path.dentry->d_inode;
- struct array_data *data = inode->i_private;
- size_t size;
-
- if (*ppos == 0) {
- if (file->private_data) {
- kfree(file->private_data);
- file->private_data = NULL;
- }
-
- file->private_data = format_array_alloc("%u", data->array, data->elements);
- }
-
- size = 0;
- if (file->private_data)
- size = strlen(file->private_data);
-
- return simple_read_from_buffer(buf, len, ppos, file->private_data, size);
-}
-
-static int xen_array_release(struct inode *inode, struct file *file)
-{
- kfree(file->private_data);
-
- return 0;
-}
-
-static const struct file_operations u32_array_fops = {
- .owner = THIS_MODULE,
- .open = u32_array_open,
- .release= xen_array_release,
- .read = u32_array_read,
-};
-
-struct dentry *xen_debugfs_create_u32_array(const char *name, mode_t mode,
- struct dentry *parent,
- u32 *array, unsigned elements)
-{
- struct array_data *data = kmalloc(sizeof(*data), GFP_KERNEL);
-
- if (data == NULL)
- return NULL;
-
- data->array = array;
- data->elements = elements;
-
- return debugfs_create_file(name, mode, parent, data, &u32_array_fops);
-}
Index: current/arch/x86/xen/debugfs.h
===================================================================
--- current.orig/arch/x86/xen/debugfs.h
+++ current/arch/x86/xen/debugfs.h
@@ -3,8 +3,4 @@
struct dentry * __init xen_init_debugfs(void);
-struct dentry *xen_debugfs_create_u32_array(const char *name, mode_t mode,
- struct dentry *parent,
- u32 *array, unsigned elements);
-
#endif /* _XEN_DEBUGFS_H */
Index: current/arch/x86/xen/mmu.c
===================================================================
--- current.orig/arch/x86/xen/mmu.c
+++ current/arch/x86/xen/mmu.c
@@ -1983,7 +1983,7 @@ static int __init xen_mmu_debugfs(void)
debugfs_create_u32("mmu_update", 0444, d_mmu_debug, &mmu_stats.mmu_update);
debugfs_create_u32("mmu_update_extended", 0444, d_mmu_debug,
&mmu_stats.mmu_update_extended);
- xen_debugfs_create_u32_array("mmu_update_histo", 0444, d_mmu_debug,
+ debugfs_create_u32_array("mmu_update_histo", 0444, d_mmu_debug,
mmu_stats.mmu_update_histo, 20);
debugfs_create_u32("set_pte_at", 0444, d_mmu_debug, &mmu_stats.set_pte_at);
Index: current/arch/x86/xen/multicalls.c
===================================================================
--- current.orig/arch/x86/xen/multicalls.c
+++ current/arch/x86/xen/multicalls.c
@@ -269,11 +269,11 @@ static int __init xen_mc_debugfs(void)
debugfs_create_u32("hypercalls", 0444, d_mc_debug, &mc_stats.hypercalls);
debugfs_create_u32("arg_total", 0444, d_mc_debug, &mc_stats.arg_total);
- xen_debugfs_create_u32_array("batch_histo", 0444, d_mc_debug,
+ debugfs_create_u32_array("batch_histo", 0444, d_mc_debug,
mc_stats.histo, MC_BATCH);
- xen_debugfs_create_u32_array("hypercall_histo", 0444, d_mc_debug,
+ debugfs_create_u32_array("hypercall_histo", 0444, d_mc_debug,
mc_stats.histo_hypercalls, NHYPERCALLS);
- xen_debugfs_create_u32_array("flush_reasons", 0444, d_mc_debug,
+ debugfs_create_u32_array("flush_reasons", 0444, d_mc_debug,
mc_stats.flush, FL_N_REASONS);
return 0;
Index: current/arch/x86/xen/spinlock.c
===================================================================
--- current.orig/arch/x86/xen/spinlock.c
+++ current/arch/x86/xen/spinlock.c
@@ -423,11 +423,11 @@ static int __init xen_spinlock_debugfs(v
debugfs_create_u64("time_total", 0444, d_spin_debug,
&spinlock_stats.time_total);
- xen_debugfs_create_u32_array("histo_total", 0444, d_spin_debug,
+ debugfs_create_u32_array("histo_total", 0444, d_spin_debug,
spinlock_stats.histo_spin_total, HISTO_BUCKETS + 1);
- xen_debugfs_create_u32_array("histo_spinning", 0444, d_spin_debug,
+ debugfs_create_u32_array("histo_spinning", 0444, d_spin_debug,
spinlock_stats.histo_spin_spinning, HISTO_BUCKETS + 1);
- xen_debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
+ debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
return 0;
Index: current/fs/debugfs/file.c
===================================================================
--- current.orig/fs/debugfs/file.c
+++ current/fs/debugfs/file.c
@@ -18,6 +18,7 @@
#include <linux/pagemap.h>
#include <linux/namei.h>
#include <linux/debugfs.h>
+#include <linux/slab.h>
static ssize_t default_read_file(struct file *file, char __user *buf,
size_t count, loff_t *ppos)
@@ -531,3 +532,119 @@ struct dentry *debugfs_create_blob(const
return debugfs_create_file(name, mode, parent, blob, &fops_blob);
}
EXPORT_SYMBOL_GPL(debugfs_create_blob);
+
+struct array_data {
+ void *array;
+ unsigned elements;
+};
+
+static int u32_array_open(struct inode *inode, struct file *file)
+{
+ file->private_data = NULL;
+ return nonseekable_open(inode, file);
+}
+
+static size_t format_array(char *buf, size_t bufsize, const char *fmt,
+ u32 *array, unsigned array_size)
+{
+ size_t ret = 0;
+ unsigned i;
+
+ for (i = 0; i < array_size; i++) {
+ size_t len;
+
+ len = snprintf(buf, bufsize, fmt, array[i]);
+ len++; /* ' ' or '\n' */
+ ret += len;
+
+ if (buf) {
+ buf += len;
+ bufsize -= len;
+ buf[-1] = (i == array_size-1) ? '\n' : ' ';
+ }
+ }
+
+ ret++; /* \0 */
+ if (buf)
+ *buf = '\0';
+
+ return ret;
+}
+
+static char *
+format_array_alloc(const char *fmt, u32 *array, unsigned array_size)
+{
+ size_t len = format_array(NULL, 0, fmt, array, array_size);
+ char *ret;
+
+ ret = kmalloc(len, GFP_KERNEL);
+ if (ret == NULL)
+ return NULL;
+
+ format_array(ret, len, fmt, array, array_size);
+ return ret;
+}
+
+static ssize_t u32_array_read(struct file *file, char __user *buf, size_t len,
+ loff_t *ppos)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ struct array_data *data = inode->i_private;
+ size_t size;
+
+ if (*ppos == 0) {
+ kfree(file->private_data);
+ file->private_data = format_array_alloc("%u", data->array,
+ data->elements);
+ }
+
+ size = 0;
+ if (file->private_data)
+ size = strlen(file->private_data);
+
+ return simple_read_from_buffer(buf, len, ppos,
+ file->private_data, size);
+}
+
+static int u32_array_release(struct inode *inode, struct file *file)
+{
+ kfree(file->private_data);
+
+ return 0;
+}
+
+static const struct file_operations u32_array_fops = {
+ .owner = THIS_MODULE,
+ .open = u32_array_open,
+ .release = u32_array_release,
+ .read = u32_array_read,
+};
+
+/**
+ * debugfs_create_u32_array - create a debugfs file that is used to read an array of unsigned 32-bit values.
+ * @name: a pointer to a string containing the name of the file to create.
+ * @mode: the permission that the file should have
+ * @parent: a pointer to the parent dentry for this file. This should be a
+ * directory dentry if set. If this parameter is %NULL, then the
+ * file will be created in the root of the debugfs filesystem.
+ * @array: array of integers of type u32
+ * @elements: number of elements in the array
+ */
+struct dentry *debugfs_create_u32_array(const char *name, mode_t mode,
+ struct dentry *parent,
+ u32 *array, unsigned elements)
+{
+ struct array_data *data = kmalloc(sizeof(*data), GFP_KERNEL);
+
+ if (data == NULL)
+ return NULL;
+
+ /* u32 array can't be written or executed! */
+ if (mode & (S_IWUGO | S_IXUGO))
+ return ERR_PTR(-EINVAL);
+
+ data->array = array;
+ data->elements = elements;
+
+ return debugfs_create_file(name, mode, parent, data, &u32_array_fops);
+}
Index: current/include/linux/debugfs.h
===================================================================
--- current.orig/include/linux/debugfs.h
+++ current/include/linux/debugfs.h
@@ -74,6 +74,10 @@ struct dentry *debugfs_create_blob(const
struct dentry *parent,
struct debugfs_blob_wrapper *blob);
+struct dentry *debugfs_create_u32_array(const char *name, mode_t mode,
+ struct dentry *parent,
+ u32 *array, unsigned elements);
+
bool debugfs_initialized(void);
#else
@@ -188,6 +192,13 @@ static inline struct dentry *debugfs_cre
return ERR_PTR(-ENODEV);
}
+struct dentry *debugfs_create_u32_array(const char *name, mode_t mode,
+ struct dentry *parent,
+ u32 *array, unsigned elements)
+{
+ return ERR_PTR(-ENODEV);
+}
+
static inline bool debugfs_initialized(void)
{
return false;
Add KVM hypercall for yielding vcpu timeslice.
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
---
arch/x86/include/asm/kvm_para.h | 1 +
arch/x86/kvm/x86.c | 7 ++++++-
include/linux/kvm.h | 1 +
include/linux/kvm_para.h | 1 +
4 files changed, 9 insertions(+), 1 deletion(-)
Index: current/arch/x86/include/asm/kvm_para.h
===================================================================
--- current.orig/arch/x86/include/asm/kvm_para.h
+++ current/arch/x86/include/asm/kvm_para.h
@@ -16,6 +16,7 @@
#define KVM_FEATURE_CLOCKSOURCE 0
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
+#define KVM_FEATURE_YIELD 4
/* This indicates that the new set of kvmclock msrs
* are available. The use of 0x11 and 0x12 is deprecated
*/
Index: current/arch/x86/kvm/x86.c
===================================================================
--- current.orig/arch/x86/kvm/x86.c
+++ current/arch/x86/kvm/x86.c
@@ -1618,6 +1618,7 @@ int kvm_dev_ioctl_check_extension(long e
case KVM_CAP_PCI_SEGMENT:
case KVM_CAP_DEBUGREGS:
case KVM_CAP_X86_ROBUST_SINGLESTEP:
+ case KVM_CAP_YIELD_HYPERCALL:
r = 1;
break;
case KVM_CAP_COALESCED_MMIO:
@@ -1993,7 +1994,8 @@ static void do_cpuid_ent(struct kvm_cpui
entry->eax = (1 << KVM_FEATURE_CLOCKSOURCE) |
(1 << KVM_FEATURE_NOP_IO_DELAY) |
(1 << KVM_FEATURE_CLOCKSOURCE2) |
- (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
+ (1 << KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
+ (1 << KVM_FEATURE_YIELD);
entry->ebx = 0;
entry->ecx = 0;
entry->edx = 0;
@@ -4245,6 +4247,9 @@ int kvm_emulate_hypercall(struct kvm_vcp
case KVM_HC_MMU_OP:
r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2), &ret);
break;
+ case KVM_HC_YIELD:
+ ret = 0;
+ yield();
default:
ret = -KVM_ENOSYS;
break;
Index: current/include/linux/kvm.h
===================================================================
--- current.orig/include/linux/kvm.h
+++ current/include/linux/kvm.h
@@ -524,6 +524,7 @@ struct kvm_enable_cap {
#define KVM_CAP_PPC_OSI 52
#define KVM_CAP_PPC_UNSET_IRQ 53
#define KVM_CAP_ENABLE_CAP 54
+#define KVM_CAP_YIELD_HYPERCALL 55
#ifdef KVM_CAP_IRQ_ROUTING
Index: current/include/linux/kvm_para.h
===================================================================
--- current.orig/include/linux/kvm_para.h
+++ current/include/linux/kvm_para.h
@@ -17,6 +17,7 @@
#define KVM_HC_VAPIC_POLL_IRQ 1
#define KVM_HC_MMU_OP 2
+#define KVM_HC_YIELD 3
/*
* hypercalls use architecture specific
Paravirtual spinlock implementation for KVM guests, based heavily on Xen guest's
spinlock implementation.
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
---
arch/x86/Kconfig | 8 +
arch/x86/kernel/head64.c | 3
arch/x86/kernel/kvm.c | 293 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/kvm_para.h | 8 +
4 files changed, 312 insertions(+)
Index: current/arch/x86/Kconfig
===================================================================
--- current.orig/arch/x86/Kconfig
+++ current/arch/x86/Kconfig
@@ -551,6 +551,14 @@ config KVM_GUEST
This option enables various optimizations for running under the KVM
hypervisor.
+config KVM_DEBUG_FS
+ bool "Enable debug information to be collected for KVM guests"
+ default n
+ depends on KVM_GUEST && EXPERIMENTAL
+ ---help---
+ This option will collect various debug information to be collected
+ and displayed in debugfs of guest kernel.
+
source "arch/x86/lguest/Kconfig"
config PARAVIRT
Index: current/arch/x86/kernel/head64.c
===================================================================
--- current.orig/arch/x86/kernel/head64.c
+++ current/arch/x86/kernel/head64.c
@@ -12,6 +12,7 @@
#include <linux/percpu.h>
#include <linux/start_kernel.h>
#include <linux/io.h>
+#include <linux/kvm_para.h>
#include <asm/processor.h>
#include <asm/proto.h>
@@ -113,6 +114,8 @@ void __init x86_64_start_reservations(ch
reserve_ebda_region();
+ kvm_guest_early_init();
+
/*
* At this point everything still needed from the boot loader
* or BIOS or kernel text should be early reserved or marked not
Index: current/arch/x86/kernel/kvm.c
===================================================================
--- current.orig/arch/x86/kernel/kvm.c
+++ current/arch/x86/kernel/kvm.c
@@ -27,6 +27,8 @@
#include <linux/mm.h>
#include <linux/highmem.h>
#include <linux/hardirq.h>
+#include <linux/debugfs.h>
+#include <linux/sched.h>
#include <asm/timer.h>
#define MMU_QUEUE_SIZE 1024
@@ -238,3 +240,294 @@ void __init kvm_guest_init(void)
paravirt_ops_setup();
}
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+
+#ifdef CONFIG_KVM_DEBUG_FS
+
+static struct spinlock_stats
+{
+ u64 taken;
+ u32 taken_slow;
+
+ u64 released;
+
+#define HISTO_BUCKETS 30
+ u32 histo_spin_total[HISTO_BUCKETS+1];
+ u32 histo_spin_spinning[HISTO_BUCKETS+1];
+ u32 histo_spin_blocked[HISTO_BUCKETS+1];
+
+ u64 time_total;
+ u64 time_spinning;
+ u64 time_blocked;
+} spinlock_stats;
+
+static u8 zero_stats;
+
+static unsigned lock_timeout = 1 << 10;
+#define TIMEOUT lock_timeout
+
+static inline void check_zero(void)
+{
+ if (unlikely(zero_stats)) {
+ memset(&spinlock_stats, 0, sizeof(spinlock_stats));
+ zero_stats = 0;
+ }
+}
+
+#define ADD_STATS(elem, val) \
+ do { check_zero(); spinlock_stats.elem += (val); } while (0)
+
+static inline u64 spin_time_start(void)
+{
+ return sched_clock();
+}
+
+static void __spin_time_accum(u64 delta, u32 *array)
+{
+ unsigned index = ilog2(delta);
+
+ check_zero();
+
+ if (index < HISTO_BUCKETS)
+ array[index]++;
+ else
+ array[HISTO_BUCKETS]++;
+}
+
+static inline void spin_time_accum_spinning(u64 start)
+{
+ u32 delta = sched_clock() - start;
+
+ __spin_time_accum(delta, spinlock_stats.histo_spin_spinning);
+ spinlock_stats.time_spinning += delta;
+}
+
+static inline void spin_time_accum_total(u64 start)
+{
+ u32 delta = sched_clock() - start;
+
+ __spin_time_accum(delta, spinlock_stats.histo_spin_total);
+ spinlock_stats.time_total += delta;
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+ u32 delta = sched_clock() - start;
+
+ __spin_time_accum(delta, spinlock_stats.histo_spin_blocked);
+ spinlock_stats.time_blocked += delta;
+}
+
+static struct dentry *d_spin_debug;
+
+static int __init kvm_spinlock_debugfs(void)
+{
+ struct dentry *d_parent;
+
+ d_parent = debugfs_create_dir("kvm", NULL);
+ if (IS_ERR(d_parent)) {
+ printk(KERN_WARNING "Could not create \"kvm\" directory in "\
+ "debugfs (errno = %li)\n", PTR_ERR(d_parent));
+ return PTR_ERR(d_parent);
+ }
+
+ d_spin_debug = debugfs_create_dir("spinlocks", d_parent);
+
+ debugfs_create_u8("zero_stats", 0644, d_spin_debug, &zero_stats);
+
+ debugfs_create_u32("timeout", 0644, d_spin_debug, &lock_timeout);
+
+ debugfs_create_u64("taken", 0444, d_spin_debug, &spinlock_stats.taken);
+ debugfs_create_u32("taken_slow", 0444, d_spin_debug,
+ &spinlock_stats.taken_slow);
+
+ debugfs_create_u64("released", 0444, d_spin_debug,
+ &spinlock_stats.released);
+
+ debugfs_create_u64("time_spinning", 0444, d_spin_debug,
+ &spinlock_stats.time_spinning);
+ debugfs_create_u64("time_blocked", 0444, d_spin_debug,
+ &spinlock_stats.time_blocked);
+ debugfs_create_u64("time_total", 0444, d_spin_debug,
+ &spinlock_stats.time_total);
+
+ debugfs_create_u32_array("histo_total", 0444, d_spin_debug,
+ spinlock_stats.histo_spin_total, HISTO_BUCKETS + 1);
+ debugfs_create_u32_array("histo_spinning", 0444, d_spin_debug,
+ spinlock_stats.histo_spin_spinning, HISTO_BUCKETS + 1);
+ debugfs_create_u32_array("histo_blocked", 0444, d_spin_debug,
+ spinlock_stats.histo_spin_blocked, HISTO_BUCKETS + 1);
+
+ return 0;
+}
+fs_initcall(kvm_spinlock_debugfs);
+
+#else /* CONFIG_KVM_DEBUG_FS */
+
+#define TIMEOUT (1 << 10)
+#define ADD_STATS(elem, val) do { (void)(val); } while (0)
+
+static inline u64 spin_time_start(void)
+{
+ return 0;
+}
+
+static inline void spin_time_accum_total(u64 start)
+{
+}
+
+static inline void spin_time_accum_spinning(u64 start)
+{
+}
+
+static inline void spin_time_accum_blocked(u64 start)
+{
+}
+
+#endif /* CONFIG_KVM_DEBUG_FS */
+
+struct kvm_spinlock {
+ unsigned char lock; /* 0 -> free; 1 -> locked */
+ unsigned short spinners; /* count of waiting cpus */
+};
+
+/*
+ * Mark a cpu as interested in a lock. Returns the CPU's previous
+ * lock of interest, in case we got preempted by an interrupt.
+ */
+static inline void spinning_lock(struct kvm_spinlock *pl)
+{
+ asm(LOCK_PREFIX " incw %0"
+ : "+m" (pl->spinners) : : "memory");
+}
+
+/*
+ * Mark a cpu as no longer interested in a lock. Restores previous
+ * lock of interest (NULL for none).
+ */
+static inline void unspinning_lock(struct kvm_spinlock *pl)
+{
+ asm(LOCK_PREFIX " decw %0"
+ : "+m" (pl->spinners) : : "memory");
+}
+
+static int kvm_spin_is_locked(struct arch_spinlock *lock)
+{
+ struct kvm_spinlock *sl = (struct kvm_spinlock *)lock;
+
+ return sl->lock != 0;
+}
+
+static int kvm_spin_is_contended(struct arch_spinlock *lock)
+{
+ struct kvm_spinlock *sl = (struct kvm_spinlock *)lock;
+
+ /* Not strictly true; this is only the count of contended
+ lock-takers entering the slow path. */
+ return sl->spinners != 0;
+}
+
+static int kvm_spin_trylock(struct arch_spinlock *lock)
+{
+ struct kvm_spinlock *sl = (struct kvm_spinlock *)lock;
+ u8 old = 1;
+
+ asm("xchgb %b0,%1"
+ : "+q" (old), "+m" (sl->lock) : : "memory");
+
+ return old == 0;
+}
+
+static noinline int kvm_spin_lock_slow(struct arch_spinlock *lock)
+{
+ struct kvm_spinlock *sl = (struct kvm_spinlock *)lock;
+ u64 start;
+
+ ADD_STATS(taken_slow, 1);
+
+ /* announce we're spinning */
+ spinning_lock(sl);
+
+ start = spin_time_start();
+ kvm_hypercall0(KVM_HC_YIELD);
+ spin_time_accum_blocked(start);
+
+ unspinning_lock(sl);
+
+ return 0;
+}
+
+static inline void __kvm_spin_lock(struct arch_spinlock *lock)
+{
+ struct kvm_spinlock *sl = (struct kvm_spinlock *)lock;
+ unsigned timeout;
+ u8 oldval;
+ u64 start_spin;
+
+ ADD_STATS(taken, 1);
+
+ start_spin = spin_time_start();
+
+ do {
+ u64 start_spin_fast = spin_time_start();
+
+ timeout = TIMEOUT;
+
+ asm("1: xchgb %1,%0\n"
+ " testb %1,%1\n"
+ " jz 3f\n"
+ "2: rep;nop\n"
+ " cmpb $0,%0\n"
+ " je 1b\n"
+ " dec %2\n"
+ " jnz 2b\n"
+ "3:\n"
+ : "+m" (sl->lock), "=q" (oldval), "+r" (timeout)
+ : "1" (1)
+ : "memory");
+
+ spin_time_accum_spinning(start_spin_fast);
+
+ } while (unlikely(oldval != 0 &&
+ (TIMEOUT == ~0 || !kvm_spin_lock_slow(lock))));
+
+ spin_time_accum_total(start_spin);
+}
+
+static void kvm_spin_lock(struct arch_spinlock *lock)
+{
+ __kvm_spin_lock(lock);
+}
+
+static void kvm_spin_lock_flags(struct arch_spinlock *lock, unsigned long flags)
+{
+ __kvm_spin_lock(lock);
+}
+
+static void kvm_spin_unlock(struct arch_spinlock *lock)
+{
+ struct kvm_spinlock *sl = (struct kvm_spinlock *)lock;
+
+ ADD_STATS(released, 1);
+
+ smp_wmb(); /* make sure no writes get moved after unlock */
+ sl->lock = 0; /* release lock */
+}
+
+void __init kvm_guest_early_init(void)
+{
+ if (!kvm_para_available())
+ return;
+
+ if (!kvm_para_has_feature(KVM_FEATURE_YIELD))
+ return;
+
+ pv_lock_ops.spin_is_locked = kvm_spin_is_locked;
+ pv_lock_ops.spin_is_contended = kvm_spin_is_contended;
+ pv_lock_ops.spin_lock = kvm_spin_lock;
+ pv_lock_ops.spin_lock_flags = kvm_spin_lock_flags;
+ pv_lock_ops.spin_trylock = kvm_spin_trylock;
+ pv_lock_ops.spin_unlock = kvm_spin_unlock;
+}
+#endif /* CONFIG_PARAVIRT_SPINLOCKS */
Index: current/include/linux/kvm_para.h
===================================================================
--- current.orig/include/linux/kvm_para.h
+++ current/include/linux/kvm_para.h
@@ -27,8 +27,16 @@
#ifdef __KERNEL__
#ifdef CONFIG_KVM_GUEST
void __init kvm_guest_init(void);
+
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+void __init kvm_guest_early_init(void);
+#else
+#define kvm_guest_early_init() do { } while (0)
+#endif
+
#else
#define kvm_guest_init() do { } while (0)
+#define kvm_guest_early_init() do { } while (0)
#endif
static inline int kvm_para_has_feature(unsigned int feature)
Add yield hypercall support in Qemu.
Signed-off-by: Srivatsa Vaddagiri <[email protected]>
---
kvm/include/linux/kvm.h | 1 +
kvm/include/x86/asm/kvm_para.h | 1 +
target-i386/kvm.c | 3 +++
3 files changed, 5 insertions(+)
Index: qemu-kvm/kvm/include/linux/kvm.h
===================================================================
--- qemu-kvm.orig/kvm/include/linux/kvm.h
+++ qemu-kvm/kvm/include/linux/kvm.h
@@ -499,6 +499,7 @@ struct kvm_ioeventfd {
#define KVM_CAP_PPC_SEGSTATE 43
#define KVM_CAP_PCI_SEGMENT 47
+#define KVM_CAP_YIELD_HYPERCALL 55
#ifdef KVM_CAP_IRQ_ROUTING
Index: qemu-kvm/kvm/include/x86/asm/kvm_para.h
===================================================================
--- qemu-kvm.orig/kvm/include/x86/asm/kvm_para.h
+++ qemu-kvm/kvm/include/x86/asm/kvm_para.h
@@ -15,6 +15,7 @@
#define KVM_FEATURE_CLOCKSOURCE 0
#define KVM_FEATURE_NOP_IO_DELAY 1
#define KVM_FEATURE_MMU_OP 2
+#define KVM_FEATURE_YIELD 4
#define MSR_KVM_WALL_CLOCK 0x11
#define MSR_KVM_SYSTEM_TIME 0x12
Index: qemu-kvm/target-i386/kvm.c
===================================================================
--- qemu-kvm.orig/target-i386/kvm.c
+++ qemu-kvm/target-i386/kvm.c
@@ -147,6 +147,9 @@ struct kvm_para_features {
#ifdef KVM_CAP_PV_MMU
{ KVM_CAP_PV_MMU, KVM_FEATURE_MMU_OP },
#endif
+#ifdef KVM_CAP_YIELD_HYPERCALL
+ { KVM_CAP_YIELD_HYPERCALL, KVM_FEATURE_YIELD },
+#endif
{ -1, -1 }
};
On 07/25/2010 11:11 PM, Srivatsa Vaddagiri wrote:
> This patch-series implements paravirt-spinlock implementation for KVM guests,
> based heavily on Xen's implementation.
Excellent! I think this problem has been ignored for too long.
> I tried to refactor Xen's spinlock
> implementation to make it common for both Xen and KVM - but found that
> few differences between Xen and KVM (Xen has the ability to block on a
> particular event/irq for example) _and_ the fact that the guest kernel
> can be compiled to support both Xen and KVM hypervisors (CONFIG_XEN and
> CONFIG_KVM_GUEST can both be turned on) makes the "common" code a eye-sore.
> There will have to be:
>
> if (xen) {
> ...
> } else if (kvm) {
> ..
> }
>
> or possibly:
>
> alternative(NOP, some_xen_specific_call, ....)
>
> type of code in the common implementation.
No, that doesn't look like a good approach. It suggests the apparently
commonality isn't really there.
> For the time-being, I have made this KVM-specific only. At somepoint in future,
> I hope this can be made common between Xen/KVM.
Did you see the patch series I posted a couple of weeks ago to revamp pv
spinlocks? Specifically, I dropped the notion of pv spinlocks in which
the entire spinlock implementation is replaced, and added pv ticketlocks
where the ticketlock algorithm is always used for the fastpath, but it
calls out to pvop calls for the slowpath (a long spin, or unlocking a
lock with waiters). It significantly reduces the amount of
hypervisor-specific code.
You can see the current patches in
git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git
xen/pvticketlock-git
> More background and results for this patch below:
>
> What is the Problem being solved?
> =================================
>
> Guest operating system can be preempted by hypervisor at any arbitrary point.
> There is no mechanism (that I know of) where guest OS can disable preemption for
> certain periods of time. One noticeable effect of this is with regard to
> locking. Lets say one virtual-cpu of a guest (VCPUA) grabs a spinlock and before
> it could relinquish the lock is preempted by hypervisor. The time-of-preemption
> (when the virtual cpu is off the cpu) can be quite large. In that period, if
> another of guest OS's virtual cpu (VCPUB) tries grabbing the same lock, it could
> end up spin-waiting a _long_ time, burning cycles unnecessarily. To add to the
> woes, VCPUB may actually be waiting for VCPUA to yield before it can run on
> the same (physical) cpu. This is termed as the "lock-holder preemption" (LHP)
> problem. The effect of it can be quite serious. For ex:
> http://lkml.org/lkml/2010/4/11/108 reported 80% performance degradation because
> of an issue attributed to LHP problem.
That's not actually the real problem. It's *a* problem, but
insignificant compared to the ticketlock-specific "next-in-line vcpu
scheduler bunfight" problem - lock holder preemption is a misnomer.
Fortunately the solutions to both are (nearly) the same.
See Thomas Friebel's talk "Prevent Guests from Spinning Around
(http://www.xen.org/files/xensummitboston08/LHP.pdf).
> Solutions
> =========
>
> There are several solutions to this problem.
>
> a. Realize that a lock-holder could have been preempted, and avoid spin-waiting
> too long. Instead, yield cycles (to the lock-holder perhaps). This is a
> common solution adopted by most paravirtualized-guests (Xen, s390, powerpc).
>
> b. Avoid preempting a lock-holder while its holding a (spin-) lock.
>
> In this scheme, guest OS can hint (set some flag in memory shared with
> hypervisor) whenever its holding a lock and hypervisor could defer preempting
> the guest vcpu when its holding a lock. With this scheme, we should never
> have a lock-acquiring vcpu spin on a preempted vcpu to release its lock. If
> ever it spins, its because somebody *currently running* is holding the lock -
> and hence it won't have to spin-wait too long. IOW we are pro-actively
> trying to prevent the LHP problem from occuring in the first place. This
> should improve job turnaround time for some workloads. [1] has some
> results based on this approach.
This doesn't actually help the problem mentioned above, because it's not
a problem with the lock holder getting preempted, but what happens once
the lock has been released.
> c. Share run-status of vcpu with guests. This could be used to optimize
> routines like mutex_spin_on_owner().
>
> Hypervisor could share run-status of vcpus in guest kernel memory. This would
> allow us to optimize routines like mutex_spin_on_owner() - we don't spin-wait
> if we relaize that the target vcpu has been preempted.
>
> a) and c) are about dealing with the LHP problem, while b) is about preventing
> the problem from happening.
>
> Scenario A:
>
> W/o patch W/ Patch Difference
> Avg (std-dev) Avg. (std-dev)
>
> 1: 273.270 (1.051) 251.613 (1.155) 7.9%
> 2: 541.343 (57.317) 443.400 (2.832) 18.1%
> 3: 819.160 (9.218) 642.393 (2.646) 21.6%
> 4: 1020.493 (34.743) 839.327 (0.658) 17.8%
>
>
> Scenario B:
>
> 1: 276.947 (0.644) 248.397 (1.563) 10.3%
> 2: 500.723 (5.694) 438.957 (6.112) 12.33%
> 3: 725.687 (5.267) 641.317 (3.902) 11.62%
> 4: 973.910 (21.712) 836.853 (2.754) 14.07%
>
> Scenario C:
>
> 1: 564.610 (12.055) 420.510 (2.598) 25.52%
> 2: 750.867 (5.139) 618.570 (2.914) 17.61%
> 3: 950.150 (13.496) 813.803 (1.872) 14.35%
> 4: 1125.027 (5.337) 1007.63 (5.707) 10.43%
>
>
> IMO this is good improvement with the patchset applied.
Those are consistent with the results I get too.
J
On 07/25/2010 11:14 PM, Srivatsa Vaddagiri wrote:
> Add KVM hypercall for yielding vcpu timeslice.
Can you do a directed yield?
J
> Signed-off-by: Srivatsa Vaddagiri<[email protected]>
>
> ---
> arch/x86/include/asm/kvm_para.h | 1 +
> arch/x86/kvm/x86.c | 7 ++++++-
> include/linux/kvm.h | 1 +
> include/linux/kvm_para.h | 1 +
> 4 files changed, 9 insertions(+), 1 deletion(-)
>
> Index: current/arch/x86/include/asm/kvm_para.h
> ===================================================================
> --- current.orig/arch/x86/include/asm/kvm_para.h
> +++ current/arch/x86/include/asm/kvm_para.h
> @@ -16,6 +16,7 @@
> #define KVM_FEATURE_CLOCKSOURCE 0
> #define KVM_FEATURE_NOP_IO_DELAY 1
> #define KVM_FEATURE_MMU_OP 2
> +#define KVM_FEATURE_YIELD 4
> /* This indicates that the new set of kvmclock msrs
> * are available. The use of 0x11 and 0x12 is deprecated
> */
> Index: current/arch/x86/kvm/x86.c
> ===================================================================
> --- current.orig/arch/x86/kvm/x86.c
> +++ current/arch/x86/kvm/x86.c
> @@ -1618,6 +1618,7 @@ int kvm_dev_ioctl_check_extension(long e
> case KVM_CAP_PCI_SEGMENT:
> case KVM_CAP_DEBUGREGS:
> case KVM_CAP_X86_ROBUST_SINGLESTEP:
> + case KVM_CAP_YIELD_HYPERCALL:
> r = 1;
> break;
> case KVM_CAP_COALESCED_MMIO:
> @@ -1993,7 +1994,8 @@ static void do_cpuid_ent(struct kvm_cpui
> entry->eax = (1<< KVM_FEATURE_CLOCKSOURCE) |
> (1<< KVM_FEATURE_NOP_IO_DELAY) |
> (1<< KVM_FEATURE_CLOCKSOURCE2) |
> - (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT);
> + (1<< KVM_FEATURE_CLOCKSOURCE_STABLE_BIT) |
> + (1<< KVM_FEATURE_YIELD);
> entry->ebx = 0;
> entry->ecx = 0;
> entry->edx = 0;
> @@ -4245,6 +4247,9 @@ int kvm_emulate_hypercall(struct kvm_vcp
> case KVM_HC_MMU_OP:
> r = kvm_pv_mmu_op(vcpu, a0, hc_gpa(vcpu, a1, a2),&ret);
> break;
> + case KVM_HC_YIELD:
> + ret = 0;
> + yield();
> default:
> ret = -KVM_ENOSYS;
> break;
> Index: current/include/linux/kvm.h
> ===================================================================
> --- current.orig/include/linux/kvm.h
> +++ current/include/linux/kvm.h
> @@ -524,6 +524,7 @@ struct kvm_enable_cap {
> #define KVM_CAP_PPC_OSI 52
> #define KVM_CAP_PPC_UNSET_IRQ 53
> #define KVM_CAP_ENABLE_CAP 54
> +#define KVM_CAP_YIELD_HYPERCALL 55
>
> #ifdef KVM_CAP_IRQ_ROUTING
>
> Index: current/include/linux/kvm_para.h
> ===================================================================
> --- current.orig/include/linux/kvm_para.h
> +++ current/include/linux/kvm_para.h
> @@ -17,6 +17,7 @@
>
> #define KVM_HC_VAPIC_POLL_IRQ 1
> #define KVM_HC_MMU_OP 2
> +#define KVM_HC_YIELD 3
>
> /*
> * hypercalls use architecture specific
>
On Mon, Jul 26, 2010 at 10:18:58AM -0700, Jeremy Fitzhardinge wrote:
> >I tried to refactor Xen's spinlock
> >implementation to make it common for both Xen and KVM - but found that
> >few differences between Xen and KVM (Xen has the ability to block on a
> >particular event/irq for example) _and_ the fact that the guest kernel
> >can be compiled to support both Xen and KVM hypervisors (CONFIG_XEN and
> >CONFIG_KVM_GUEST can both be turned on) makes the "common" code a eye-sore.
> >There will have to be:
> >
> > if (xen) {
> > ...
> > } else if (kvm) {
> > ..
> > }
> >
> >or possibly:
> >
> > alternative(NOP, some_xen_specific_call, ....)
> >
> >type of code in the common implementation.
>
> No, that doesn't look like a good approach. It suggests the
> apparently commonality isn't really there.
>
> >For the time-being, I have made this KVM-specific only. At somepoint in future,
> >I hope this can be made common between Xen/KVM.
>
> Did you see the patch series I posted a couple of weeks ago to
> revamp pv spinlocks? Specifically, I dropped the notion of pv
> spinlocks in which the entire spinlock implementation is replaced,
> and added pv ticketlocks where the ticketlock algorithm is always
> used for the fastpath, but it calls out to pvop calls for the
> slowpath (a long spin, or unlocking a lock with waiters). It
> significantly reduces the amount of hypervisor-specific code.
Hmmm interesting - I will go thr' it in detail.
> You can see the current patches in
>
> git://git.kernel.org/pub/scm/linux/kernel/git/jeremy/xen.git
> xen/pvticketlock-git
[snip]
> That's not actually the real problem. It's *a* problem, but
> insignificant compared to the ticketlock-specific "next-in-line vcpu
> scheduler bunfight" problem - lock holder preemption is a misnomer.
> Fortunately the solutions to both are (nearly) the same.
>
> See Thomas Friebel's talk "Prevent Guests from Spinning Around
> (http://www.xen.org/files/xensummitboston08/LHP.pdf).
Yes I had seen Thomas's slides reporting huge degradation in performance with
tick spinlock.
> >b. Avoid preempting a lock-holder while its holding a (spin-) lock.
> >
> > In this scheme, guest OS can hint (set some flag in memory shared with
> > hypervisor) whenever its holding a lock and hypervisor could defer preempting
> > the guest vcpu when its holding a lock. With this scheme, we should never
> > have a lock-acquiring vcpu spin on a preempted vcpu to release its lock. If
> > ever it spins, its because somebody *currently running* is holding the lock -
> > and hence it won't have to spin-wait too long. IOW we are pro-actively
> > trying to prevent the LHP problem from occuring in the first place. This
> > should improve job turnaround time for some workloads. [1] has some
> > results based on this approach.
>
> This doesn't actually help the problem mentioned above, because it's
> not a problem with the lock holder getting preempted, but what
> happens once the lock has been released.
Good point. I agree that the latter problem needs more attention, given a
ticket-type implementation of spinlocks. Have you considered possible solutions
for unmodified guests, which have similar ticket-type lock implementations?
Not sure if that's important enough to investigate solutions like gang
scheduling ..
- vatsa
On Mon, Jul 26, 2010 at 10:19:41AM -0700, Jeremy Fitzhardinge wrote:
> On 07/25/2010 11:14 PM, Srivatsa Vaddagiri wrote:
> >Add KVM hypercall for yielding vcpu timeslice.
>
> Can you do a directed yield?
We don't have that support yet in Linux scheduler. Also I feel it would be more
useful when the target vcpu and yielding vcpu are on the same physical cpu,
rather than when they are on separate cpus. With latter, yielding (or
donating) timeslice need not ensure that target vcpu runs immediately
and also I suspect fairness issues needs to be tackled as well (large number of
waiters shouldn't boot a lock-holders time slice too much that it gets a
larger share).
- vatsa
On Mon, Jul 26, 2010 at 11:41:50AM +0530, Srivatsa Vaddagiri wrote:
> This patch-series implements paravirt-spinlock implementation for KVM guests,
> based heavily on Xen's implementation. I tried to refactor Xen's spinlock
> implementation to make it common for both Xen and KVM - but found that
> few differences between Xen and KVM (Xen has the ability to block on a
> particular event/irq for example) _and_ the fact that the guest kernel
> can be compiled to support both Xen and KVM hypervisors (CONFIG_XEN and
> CONFIG_KVM_GUEST can both be turned on) makes the "common" code a eye-sore.
> There will have to be:
>
> if (xen) {
> ...
> } else if (kvm) {
> ..
> }
>
> or possibly:
>
> alternative(NOP, some_xen_specific_call, ....)
Why not utilize the pvops path?
On Wed, Jul 28, 2010 at 06:10:59PM -0400, Konrad Rzeszutek Wilk wrote:
> On Mon, Jul 26, 2010 at 11:41:50AM +0530, Srivatsa Vaddagiri wrote:
> > This patch-series implements paravirt-spinlock implementation for KVM guests,
> > based heavily on Xen's implementation. I tried to refactor Xen's spinlock
> > implementation to make it common for both Xen and KVM - but found that
> > few differences between Xen and KVM (Xen has the ability to block on a
> > particular event/irq for example) _and_ the fact that the guest kernel
> > can be compiled to support both Xen and KVM hypervisors (CONFIG_XEN and
> > CONFIG_KVM_GUEST can both be turned on) makes the "common" code a eye-sore.
> > There will have to be:
> >
> > if (xen) {
> > ...
> > } else if (kvm) {
> > ..
> > }
> >
> > or possibly:
> >
> > alternative(NOP, some_xen_specific_call, ....)
>
> Why not utilize the pvops path?
Duh. You did use it. Sorry about the noise.
On 07/26/2010 08:19 PM, Jeremy Fitzhardinge wrote:
> On 07/25/2010 11:14 PM, Srivatsa Vaddagiri wrote:
>> Add KVM hypercall for yielding vcpu timeslice.
>
> Can you do a directed yield?
>
A problem with directed yield is figuring out who to yield to. One idea
is to look for a random vcpu that is not running and donate some runtime
to it. In the best case, it's the lock holder and we cause it to start
running. Middle case it's not the lock holder, but we lose enough
runtime to stop running, so at least we don't waste cpu. Worst case we
continue running not having woken the lock holder. Spin again, yield
again hoping to find the right vcpu.
--
error compiling committee.c: too many arguments to function
On 07/28/2010 05:55 PM, Srivatsa Vaddagiri wrote:
> On Mon, Jul 26, 2010 at 10:19:41AM -0700, Jeremy Fitzhardinge wrote:
>> On 07/25/2010 11:14 PM, Srivatsa Vaddagiri wrote:
>>> Add KVM hypercall for yielding vcpu timeslice.
>> Can you do a directed yield?
> We don't have that support yet in Linux scheduler.
If you think it's useful, it would be good to design it into the
interface, and fall back to ordinary yield if the host doesn't support it.
A big advantage of directed yield vs yield is that you conserve
resources within a VM; a simple yield will cause the guest to drop its
share of cpu to other guest.
Made up example:
- 2 vcpu guest with 10% contention
- lock hold time 10us every 100us
- timeslice 1ms
Ideally this guest can consume 190% cpu (sleeping whenever there is
contention). But if we yield when we detect contention, then we sleep
for 1ms, and utilization drops to around 100%-150% (a vcpu will usually
fall asleep soon within a few 100us periods).
> Also I feel it would be more
> useful when the target vcpu and yielding vcpu are on the same physical cpu,
> rather than when they are on separate cpus. With latter, yielding (or
> donating) timeslice need not ensure that target vcpu runs immediately
Donate at least the amount needed to wake up the other vcpu, we can
calculate it during wakeup.
> and also I suspect fairness issues needs to be tackled as well (large number of
> waiters shouldn't boot a lock-holders time slice too much that it gets a
> larger share).
I feel ordinary yield suffers from fairness a lot more.
--
error compiling committee.c: too many arguments to function
On 07/26/2010 09:15 AM, Srivatsa Vaddagiri wrote:
> Paravirtual spinlock implementation for KVM guests, based heavily on Xen guest's
> spinlock implementation.
>
>
> +
> +static struct spinlock_stats
> +{
> + u64 taken;
> + u32 taken_slow;
> +
> + u64 released;
> +
> +#define HISTO_BUCKETS 30
> + u32 histo_spin_total[HISTO_BUCKETS+1];
> + u32 histo_spin_spinning[HISTO_BUCKETS+1];
> + u32 histo_spin_blocked[HISTO_BUCKETS+1];
> +
> + u64 time_total;
> + u64 time_spinning;
> + u64 time_blocked;
> +} spinlock_stats;
Could these be replaced by tracepoints when starting to spin/stopping
spinning etc? Then userspace can reconstruct the histogram as well as
see which locks are involved and what call paths.
> +struct kvm_spinlock {
> + unsigned char lock; /* 0 -> free; 1 -> locked */
> + unsigned short spinners; /* count of waiting cpus */
> +};
> +
> +/*
> + * Mark a cpu as interested in a lock. Returns the CPU's previous
> + * lock of interest, in case we got preempted by an interrupt.
> + */
> +static inline void spinning_lock(struct kvm_spinlock *pl)
> +{
> + asm(LOCK_PREFIX " incw %0"
> + : "+m" (pl->spinners) : : "memory");
> +}
> +
> +/*
> + * Mark a cpu as no longer interested in a lock. Restores previous
> + * lock of interest (NULL for none).
> + */
> +static inline void unspinning_lock(struct kvm_spinlock *pl)
> +{
> + asm(LOCK_PREFIX " decw %0"
> + : "+m" (pl->spinners) : : "memory");
> +}
> +
> +static int kvm_spin_is_locked(struct arch_spinlock *lock)
> +{
> + struct kvm_spinlock *sl = (struct kvm_spinlock *)lock;
> +
> + return sl->lock != 0;
> +}
> +
> +static int kvm_spin_is_contended(struct arch_spinlock *lock)
> +{
> + struct kvm_spinlock *sl = (struct kvm_spinlock *)lock;
> +
> + /* Not strictly true; this is only the count of contended
> + lock-takers entering the slow path. */
> + return sl->spinners != 0;
> +}
> +
> +static int kvm_spin_trylock(struct arch_spinlock *lock)
> +{
> + struct kvm_spinlock *sl = (struct kvm_spinlock *)lock;
> + u8 old = 1;
> +
> + asm("xchgb %b0,%1"
> + : "+q" (old), "+m" (sl->lock) : : "memory");
> +
> + return old == 0;
> +}
> +
> +static noinline int kvm_spin_lock_slow(struct arch_spinlock *lock)
> +{
> + struct kvm_spinlock *sl = (struct kvm_spinlock *)lock;
> + u64 start;
> +
> + ADD_STATS(taken_slow, 1);
> +
> + /* announce we're spinning */
> + spinning_lock(sl);
> +
> + start = spin_time_start();
> + kvm_hypercall0(KVM_HC_YIELD);
Oh. This isn't really a yield since we expect to be woken up? It's
more of a sleep.
We already have a sleep hypercall, it's called HLT. If we can use it,
the thing can work on older hosts. It's tricky though:
- if interrupts were enabled before we started spinning, sleep with
interrupts enabled. This also allows the spinner to switch to another
process if some completion comes along so it's a good idea anyway. Wake
up sends an IPI.
- if not, we need to use NMI to wake up. This is somewhat icky since
there's no atomic "enable NMI and sleep" instruction, so we have to
handle the case of the wake up arriving before HLT (can be done by
examining RIP and seeing if it's in the critical section).
--
error compiling committee.c: too many arguments to function
On 07/26/2010 09:11 AM, Srivatsa Vaddagiri wrote:
> This patch-series implements paravirt-spinlock implementation for KVM guests,
> based heavily on Xen's implementation. I tried to refactor Xen's spinlock
> implementation to make it common for both Xen and KVM - but found that
> few differences between Xen and KVM (Xen has the ability to block on a
> particular event/irq for example) _and_ the fact that the guest kernel
> can be compiled to support both Xen and KVM hypervisors (CONFIG_XEN and
> CONFIG_KVM_GUEST can both be turned on) makes the "common" code a eye-sore.
> There will have to be:
>
> if (xen) {
> ...
> } else if (kvm) {
> ..
> }
>
> or possibly:
>
> alternative(NOP, some_xen_specific_call, ....)
>
> type of code in the common implementation.
I do think things are pretty common. If that is the only issue, you can
use a plain function vector, no?
--
error compiling committee.c: too many arguments to function
On 07/26/2010 09:15 AM, Srivatsa Vaddagiri wrote:
> Paravirtual spinlock implementation for KVM guests, based heavily on Xen guest's
> spinlock implementation.
>
> +static void kvm_spin_unlock(struct arch_spinlock *lock)
> +{
> + struct kvm_spinlock *sl = (struct kvm_spinlock *)lock;
> +
> + ADD_STATS(released, 1);
> +
> + smp_wmb(); /* make sure no writes get moved after unlock */
> + sl->lock = 0; /* release lock */
> +}
Wait, no wakeup?
So it is a yield, not a sleep. I'm worried it could seriously impact
fairness when one non-contending guest (or non-pv) is overcommitted
together with a spin-yield guest.
--
error compiling committee.c: too many arguments to function
* Avi Kivity <[email protected]> [2010-08-02 03:33]:
> On 07/26/2010 08:19 PM, Jeremy Fitzhardinge wrote:
> > On 07/25/2010 11:14 PM, Srivatsa Vaddagiri wrote:
> >>Add KVM hypercall for yielding vcpu timeslice.
> >
> >Can you do a directed yield?
> >
>
> A problem with directed yield is figuring out who to yield to. One idea
> is to look for a random vcpu that is not running and donate some runtime
> to it. In the best case, it's the lock holder and we cause it to start
> running. Middle case it's not the lock holder, but we lose enough
> runtime to stop running, so at least we don't waste cpu. Worst case we
> continue running not having woken the lock holder. Spin again, yield
> again hoping to find the right vcpu.
It's been quite some time, but played with directed yielding for Xen[1]
and we were looking to model the POWER directed yield (H_CONFER) where
the lock holding vcpu was indiciated in the spinlock. When acquiring
the lock, record the vcpu id. When another vcpu attempts to acquire the
lock if it can't it can yield its time to the lock holder.
1. http://lists.xensource.com/archives/html/xen-devel/2005-05/msg00776.html
--
Ryan Harper
Software Engineer; Linux Technology Center
IBM Corp., Austin, Tx
[email protected]
On 08/02/2010 05:42 PM, Ryan Harper wrote:
>
>> A problem with directed yield is figuring out who to yield to. One idea
>> is to look for a random vcpu that is not running and donate some runtime
>> to it. In the best case, it's the lock holder and we cause it to start
>> running. Middle case it's not the lock holder, but we lose enough
>> runtime to stop running, so at least we don't waste cpu. Worst case we
>> continue running not having woken the lock holder. Spin again, yield
>> again hoping to find the right vcpu.
> It's been quite some time, but played with directed yielding for Xen[1]
> and we were looking to model the POWER directed yield (H_CONFER) where
> the lock holding vcpu was indiciated in the spinlock. When acquiring
> the lock, record the vcpu id. When another vcpu attempts to acquire the
> lock if it can't it can yield its time to the lock holder.
No reason why we can't have something similar.
We can take the lock and set the owner atomically:
LOCK_PREFIX "cmpxchg %1, %0"
: "=m"(lock) : "r"(raw_smp_processor_id() | SPIN_LOCK_BIAS),
"a"((u16)0) : "memory"
--
error compiling committee.c: too many arguments to function
On 08/02/2010 01:32 AM, Avi Kivity wrote:
> On 07/26/2010 08:19 PM, Jeremy Fitzhardinge wrote:
>> On 07/25/2010 11:14 PM, Srivatsa Vaddagiri wrote:
>>> Add KVM hypercall for yielding vcpu timeslice.
>>
>> Can you do a directed yield?
>>
>
> A problem with directed yield is figuring out who to yield to. One
> idea is to look for a random vcpu that is not running and donate some
> runtime to it. In the best case, it's the lock holder and we cause it
> to start running. Middle case it's not the lock holder, but we lose
> enough runtime to stop running, so at least we don't waste cpu. Worst
> case we continue running not having woken the lock holder. Spin
> again, yield again hoping to find the right vcpu.
That can help with lockholder preemption, but on unlock you need to wake
up exactly the right vcpu - the next in the ticket queue - in order to
avoid burning masses of cpu. If each cpu records what lock it is
spinning on and what its ticket is in a percpu variable, then the
unlocker can search for the next person to kick.
J
On 08/02/2010 01:48 AM, Avi Kivity wrote:
> On 07/26/2010 09:15 AM, Srivatsa Vaddagiri wrote:
>> Paravirtual spinlock implementation for KVM guests, based heavily on
>> Xen guest's
>> spinlock implementation.
>>
>>
>> +
>> +static struct spinlock_stats
>> +{
>> + u64 taken;
>> + u32 taken_slow;
>> +
>> + u64 released;
>> +
>> +#define HISTO_BUCKETS 30
>> + u32 histo_spin_total[HISTO_BUCKETS+1];
>> + u32 histo_spin_spinning[HISTO_BUCKETS+1];
>> + u32 histo_spin_blocked[HISTO_BUCKETS+1];
>> +
>> + u64 time_total;
>> + u64 time_spinning;
>> + u64 time_blocked;
>> +} spinlock_stats;
>
> Could these be replaced by tracepoints when starting to spin/stopping
> spinning etc? Then userspace can reconstruct the histogram as well as
> see which locks are involved and what call paths.
Unfortunately not; the tracing code uses spinlocks.
(TBH I haven't actually tried, but I did give the code an eyeball to
this end.)
J
On Mon, Aug 02, 2010 at 11:40:23AM +0300, Avi Kivity wrote:
> >>Can you do a directed yield?
> >We don't have that support yet in Linux scheduler.
>
> If you think it's useful, it would be good to design it into the
> interface, and fall back to ordinary yield if the host doesn't
> support it.
>
> A big advantage of directed yield vs yield is that you conserve
> resources within a VM; a simple yield will cause the guest to drop
> its share of cpu to other guest.
Hmm .. I see possibility of modifying yield to reclaim its "lost" timeslice when
its scheduled next as well. Basically remember what timeslice we have given
up and add that as its "bonus" when it runs next. That would keep the dynamics
of yield donation/reclaim local to the (physical) cpu and IMHO is less complex
than dealing with directed yield between tasks located across different physical
cpus. That would also address the fairness issue with yield you are pointing at?
- vatsa
On Tue, Aug 03, 2010 at 10:46:59AM +0530, Srivatsa Vaddagiri wrote:
> On Mon, Aug 02, 2010 at 11:40:23AM +0300, Avi Kivity wrote:
> > >>Can you do a directed yield?
> > >We don't have that support yet in Linux scheduler.
> >
> > If you think it's useful, it would be good to design it into the
> > interface, and fall back to ordinary yield if the host doesn't
> > support it.
> >
> > A big advantage of directed yield vs yield is that you conserve
> > resources within a VM; a simple yield will cause the guest to drop
> > its share of cpu to other guest.
>
> Hmm .. I see possibility of modifying yield to reclaim its "lost" timeslice when
> its scheduled next as well. Basically remember what timeslice we have given
> up and add that as its "bonus" when it runs next. That would keep the dynamics
> of yield donation/reclaim local to the (physical) cpu and IMHO is less complex
> than dealing with directed yield between tasks located across different physical
> cpus. That would also address the fairness issue with yield you are pointing at?
Basically with directed yield, we need to deal with these issues:
- Timeslice inflation of target (lock-holder) vcpu affecting fair-time of other
guests vcpus.
- Intra-VM fairness - different vcpus could get different fair-time, depending
on how much of a lock-holder/spinner a vcpu is
By simply educating yield to reclaim its lost share, I feel we can avoid these
complexities and get most of the benefit of yield-on-contention.
CCing other shceduler experts for their opinion of directed yield.
- vatsa
On 08/02/2010 06:20 PM, Jeremy Fitzhardinge wrote:
> On 08/02/2010 01:48 AM, Avi Kivity wrote:
>> On 07/26/2010 09:15 AM, Srivatsa Vaddagiri wrote:
>>> Paravirtual spinlock implementation for KVM guests, based heavily on
>>> Xen guest's
>>> spinlock implementation.
>>>
>>>
>>> +
>>> +static struct spinlock_stats
>>> +{
>>> + u64 taken;
>>> + u32 taken_slow;
>>> +
>>> + u64 released;
>>> +
>>> +#define HISTO_BUCKETS 30
>>> + u32 histo_spin_total[HISTO_BUCKETS+1];
>>> + u32 histo_spin_spinning[HISTO_BUCKETS+1];
>>> + u32 histo_spin_blocked[HISTO_BUCKETS+1];
>>> +
>>> + u64 time_total;
>>> + u64 time_spinning;
>>> + u64 time_blocked;
>>> +} spinlock_stats;
>>
>> Could these be replaced by tracepoints when starting to spin/stopping
>> spinning etc? Then userspace can reconstruct the histogram as well
>> as see which locks are involved and what call paths.
>
> Unfortunately not; the tracing code uses spinlocks.
>
> (TBH I haven't actually tried, but I did give the code an eyeball to
> this end.)
Hm. The tracing code already uses a specialized lock (arch_spinlock_t),
perhaps we can make this lock avoid the tracing?
It's really sad, btw, there's all those nice lockless ring buffers and
then a spinlock for ftrace_vbprintk(), instead of a per-cpu buffer.
--
I have a truly marvellous patch that fixes the bug which this
signature is too narrow to contain.
On 08/02/2010 11:59 PM, Avi Kivity wrote:
> On 08/02/2010 06:20 PM, Jeremy Fitzhardinge wrote:
>> On 08/02/2010 01:48 AM, Avi Kivity wrote:
>>> On 07/26/2010 09:15 AM, Srivatsa Vaddagiri wrote:
>>>> Paravirtual spinlock implementation for KVM guests, based heavily
>>>> on Xen guest's
>>>> spinlock implementation.
>>>>
>>>>
>>>> +
>>>> +static struct spinlock_stats
>>>> +{
>>>> + u64 taken;
>>>> + u32 taken_slow;
>>>> +
>>>> + u64 released;
>>>> +
>>>> +#define HISTO_BUCKETS 30
>>>> + u32 histo_spin_total[HISTO_BUCKETS+1];
>>>> + u32 histo_spin_spinning[HISTO_BUCKETS+1];
>>>> + u32 histo_spin_blocked[HISTO_BUCKETS+1];
>>>> +
>>>> + u64 time_total;
>>>> + u64 time_spinning;
>>>> + u64 time_blocked;
>>>> +} spinlock_stats;
>>>
>>> Could these be replaced by tracepoints when starting to
>>> spin/stopping spinning etc? Then userspace can reconstruct the
>>> histogram as well as see which locks are involved and what call paths.
>>
>> Unfortunately not; the tracing code uses spinlocks.
>>
>> (TBH I haven't actually tried, but I did give the code an eyeball to
>> this end.)
>
> Hm. The tracing code already uses a specialized lock
> (arch_spinlock_t), perhaps we can make this lock avoid the tracing?
That's not really a specialized lock; that's just the naked
architecture-provided spinlock implementation, without all the lockdep,
etc, etc stuff layered on top. All these changes are at a lower level,
so giving tracing its own type of spinlock amounts to making the
architectures provide two complete spinlock implementations. We could
make tracing use, for example, an rwlock so long as we promise not to
put tracing in the rwlock implementation - but that's hardly elegant.
> It's really sad, btw, there's all those nice lockless ring buffers and
> then a spinlock for ftrace_vbprintk(), instead of a per-cpu buffer.
Sad indeed.
J