2016-11-02 05:13:06

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 00/11] implement vcpu preempted check

change from v6:
fix typos and remove uncessary comments.
change from v5:
spilt x86/kvm patch into guest/host part.
introduce kvm_write_guest_offset_cached.
fix some typos.
rebase patch onto 4.9.2
change from v4:
spilt x86 kvm vcpu preempted check into two patches.
add documentation patch.
add x86 vcpu preempted check patch under xen
add s390 vcpu preempted check patch
change from v3:
add x86 vcpu preempted check patch
change from v2:
no code change, fix typos, update some comments
change from v1:
a simplier definition of default vcpu_is_preempted
skip mahcine type check on ppc, and add config. remove dedicated macro.
add one patch to drop overload of rwsem_spin_on_owner and mutex_spin_on_owner.
add more comments
thanks boqun and Peter's suggestion.

This patch set aims to fix lock holder preemption issues.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

18.09% sched-messaging [kernel.vmlinux] [k] osq_lock
12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner
5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock
3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task
3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq
3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is
2.49% sched-messaging [kernel.vmlinux] [k] system_call

We introduce interface bool vcpu_is_preempted(int cpu) and use it in some spin
loops of osq_lock, rwsem_spin_on_owner and mutex_spin_on_owner.
These spin_on_onwer variant also cause rcu stall before we apply this patch set

We also have observed some performace improvements in uninx benchmark tests.

PPC test result:
1 copy - 0.94%
2 copy - 7.17%
4 copy - 11.9%
8 copy - 3.04%
16 copy - 15.11%

details below:
Without patch:

1 copy - File Write 4096 bufsize 8000 maxblocks 2188223.0 KBps (30.0 s, 1 samples)
2 copy - File Write 4096 bufsize 8000 maxblocks 1804433.0 KBps (30.0 s, 1 samples)
4 copy - File Write 4096 bufsize 8000 maxblocks 1237257.0 KBps (30.0 s, 1 samples)
8 copy - File Write 4096 bufsize 8000 maxblocks 1032658.0 KBps (30.0 s, 1 samples)
16 copy - File Write 4096 bufsize 8000 maxblocks 768000.0 KBps (30.1 s, 1 samples)

With patch:

1 copy - File Write 4096 bufsize 8000 maxblocks 2209189.0 KBps (30.0 s, 1 samples)
2 copy - File Write 4096 bufsize 8000 maxblocks 1943816.0 KBps (30.0 s, 1 samples)
4 copy - File Write 4096 bufsize 8000 maxblocks 1405591.0 KBps (30.0 s, 1 samples)
8 copy - File Write 4096 bufsize 8000 maxblocks 1065080.0 KBps (30.0 s, 1 samples)
16 copy - File Write 4096 bufsize 8000 maxblocks 904762.0 KBps (30.0 s, 1 samples)

X86 test result:
test-case after-patch before-patch
Execl Throughput | 18307.9 lps | 11701.6 lps
File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps
File Copy 256 bufsize 500 maxblocks | 367555.6 KBps | 222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps
Pipe Throughput | 11872208.7 lps | 11855628.9 lps
Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps
Process Creation | 29881.2 lps | 28572.8 lps
Shell Scripts (1 concurrent) | 23224.3 lpm | 22607.4 lpm
Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm
System Call Overhead | 10385653.0 lps | 10419979.0 lps


Christian Borntraeger (1):
s390/spinlock: Provide vcpu_is_preempted

Juergen Gross (1):
x86, xen: support vcpu preempted check

Pan Xinhui (9):
kernel/sched: introduce vcpu preempted check interface
locking/osq: Drop the overload of osq_lock()
kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner
powerpc/spinlock: support vcpu preempted check
x86, paravirt: Add interface to support kvm/xen vcpu preempted check
KVM: Introduce kvm_write_guest_offset_cached
x86, kvm/x86.c: support vcpu preempted check
x86, kernel/kvm.c: support vcpu preempted check
Documentation: virtual: kvm: Support vcpu preempted check

Documentation/virtual/kvm/msr.txt | 9 ++++++++-
arch/powerpc/include/asm/spinlock.h | 8 ++++++++
arch/s390/include/asm/spinlock.h | 8 ++++++++
arch/s390/kernel/smp.c | 9 +++++++--
arch/s390/lib/spinlock.c | 25 ++++++++-----------------
arch/x86/include/asm/paravirt_types.h | 2 ++
arch/x86/include/asm/spinlock.h | 8 ++++++++
arch/x86/include/uapi/asm/kvm_para.h | 4 +++-
arch/x86/kernel/kvm.c | 12 ++++++++++++
arch/x86/kernel/paravirt-spinlocks.c | 6 ++++++
arch/x86/kvm/x86.c | 16 ++++++++++++++++
arch/x86/xen/spinlock.c | 3 ++-
include/linux/kvm_host.h | 2 ++
include/linux/sched.h | 12 ++++++++++++
kernel/locking/mutex.c | 13 +++++++++++--
kernel/locking/osq_lock.c | 8 +++++++-
kernel/locking/rwsem-xadd.c | 14 +++++++++++---
virt/kvm/kvm_main.c | 20 ++++++++++++++------
18 files changed, 145 insertions(+), 34 deletions(-)

--
2.4.11


2016-11-02 05:13:36

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 01/11] kernel/sched: introduce vcpu preempted check interface

This patch support to fix lock holder preemption issue.

For kernel users, we could use bool vcpu_is_preempted(int cpu) to detect
if one vcpu is preempted or not.

The default implementation is a macro defined by false. So compiler can
wrap it out if arch dose not support such vcpu preempted check.

Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Pan Xinhui <[email protected]>
Acked-by: Christian Borntraeger <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Tested-by: Juergen Gross <[email protected]>
---
include/linux/sched.h | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 348f51b..44c1ce7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3506,6 +3506,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)

#endif /* CONFIG_SMP */

+/*
+ * In order to deal with a various lock holder preemption issues provide an
+ * interface to see if a vCPU is currently running or not.
+ *
+ * This allows us to terminate optimistic spin loops and block, analogous to
+ * the native optimistic spin heuristic of testing if the lock owner task is
+ * running or not.
+ */
+#ifndef vcpu_is_preempted
+#define vcpu_is_preempted(cpu) false
+#endif
+
extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
extern long sched_getaffinity(pid_t pid, struct cpumask *mask);

--
2.4.11

2016-11-02 05:13:48

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 03/11] kernel/locking: Drop the overload of {mutex,rwsem}_spin_on_owner

An over-committed guest with more vCPUs than pCPUs has a heavy overload
in the two spin_on_owner. This blames on the lock holder preemption
issue.

Kernel has an interface bool vcpu_is_preempted(int cpu) to see if a vCPU
is currently running or not. So break the spin loops on true condition.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner
8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock
4.12% sched-messaging [kernel.vmlinux] [k] system_call
3.01% sched-messaging [kernel.vmlinux] [k] system_call_common
2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7
2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner
2.00% sched-messaging [kernel.vmlinux] [k] osq_lock

after patch:
9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock
5.28% sched-messaging [unknown] [H] 0xc0000000000768e0
4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7
3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7
3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq
3.02% sched-messaging [kernel.vmlinux] [k] system_call
2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task

Signed-off-by: Pan Xinhui <[email protected]>
Acked-by: Christian Borntraeger <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Tested-by: Juergen Gross <[email protected]>
---
kernel/locking/mutex.c | 13 +++++++++++--
kernel/locking/rwsem-xadd.c | 14 +++++++++++---
2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index a70b90d..24face6 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -236,7 +236,11 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
*/
barrier();

- if (!owner->on_cpu || need_resched()) {
+ /*
+ * Use vcpu_is_preempted to detect lock holder preemption issue.
+ */
+ if (!owner->on_cpu || need_resched() ||
+ vcpu_is_preempted(task_cpu(owner))) {
ret = false;
break;
}
@@ -261,8 +265,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)

rcu_read_lock();
owner = READ_ONCE(lock->owner);
+
+ /*
+ * As lock holder preemption issue, we both skip spinning if task is not
+ * on cpu or its cpu is preempted
+ */
if (owner)
- retval = owner->on_cpu;
+ retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
rcu_read_unlock();
/*
* if lock->owner is not set, the mutex owner may have just acquired
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 2337b4b..b664ce1 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
goto done;
}

- ret = owner->on_cpu;
+ /*
+ * As lock holder preemption issue, we both skip spinning if task is not
+ * on cpu or its cpu is preempted
+ */
+ ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
done:
rcu_read_unlock();
return ret;
@@ -362,8 +366,12 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
*/
barrier();

- /* abort spinning when need_resched or owner is not running */
- if (!owner->on_cpu || need_resched()) {
+ /*
+ * abort spinning when need_resched or owner is not running or
+ * owner's cpu is preempted.
+ */
+ if (!owner->on_cpu || need_resched() ||
+ vcpu_is_preempted(task_cpu(owner))) {
rcu_read_unlock();
return false;
}
--
2.4.11

2016-11-02 05:13:54

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 04/11] powerpc/spinlock: support vcpu preempted check

This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted. Then
kernel can break the spin loops upon the retval of vcpu_is_preempted.

As kernel has used this interface, So lets support it.

Only pSeries need support it. And the fact is PowerNV is built into same
kernel image with pSeries. So we need return false if we are runnig as
PowerNV. The another fact is that lppaca->yiled_count keeps zero on
PowerNV. So we can just skip the machine type check.

Suggested-by: Boqun Feng <[email protected]>
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Pan Xinhui <[email protected]>
---
arch/powerpc/include/asm/spinlock.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index fa37fe9..8c1b913 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -52,6 +52,14 @@
#define SYNC_IO
#endif

+#ifdef CONFIG_PPC_PSERIES
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+ return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
+}
+#endif
+
static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
{
return lock.slock == 0;
--
2.4.11

2016-11-02 05:13:41

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 02/11] locking/osq: Drop the overload of osq_lock()

An over-committed guest with more vCPUs than pCPUs has a heavy overload
in osq_lock().

This is because vCPU A hold the osq lock and yield out, vCPU B wait
per_cpu node->locked to be set. IOW, vCPU B wait vCPU A to run and
unlock the osq lock.

Kernel has an interface bool vcpu_is_preempted(int cpu) to detect if a
vCPU is currently running or not. So break the spin loops on true
condition.

test case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09% sched-messaging [kernel.vmlinux] [k] osq_lock
12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner
5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock
3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task
3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq
3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is
2.49% sched-messaging [kernel.vmlinux] [k] system_call

after patch:
20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner
8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock
4.12% sched-messaging [kernel.vmlinux] [k] system_call
3.01% sched-messaging [kernel.vmlinux] [k] system_call_common
2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7
2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner
2.00% sched-messaging [kernel.vmlinux] [k] osq_lock

Suggested-by: Boqun Feng <[email protected]>
Signed-off-by: Pan Xinhui <[email protected]>
Acked-by: Christian Borntraeger <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Tested-by: Juergen Gross <[email protected]>
---
kernel/locking/osq_lock.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 05a3785..091f97f 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
}

+static inline int node_cpu(struct optimistic_spin_node *node)
+{
+ return node->cpu - 1;
+}
+
static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
{
int cpu_nr = encoded_cpu_val - 1;
@@ -118,8 +123,9 @@ bool osq_lock(struct optimistic_spin_queue *lock)
while (!READ_ONCE(node->locked)) {
/*
* If we need to reschedule bail... so we can block.
+ * Use vcpu_is_preempted to detect lock holder preemption issue.
*/
- if (need_resched())
+ if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
goto unqueue;

cpu_relax_lowlatency();
--
2.4.11

2016-11-02 05:13:59

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 05/11] s390/spinlock: Provide vcpu_is_preempted

From: Christian Borntraeger <[email protected]>

this implements the s390 backend for commit
"kernel/sched: introduce vcpu preempted check interface"
by reworking the existing smp_vcpu_scheduled into
arch_vcpu_is_preempted. We can then also get rid of the
local cpu_is_preempted function by moving the
CIF_ENABLED_WAIT test into arch_vcpu_is_preempted.

Signed-off-by: Christian Borntraeger <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
---
arch/s390/include/asm/spinlock.h | 8 ++++++++
arch/s390/kernel/smp.c | 9 +++++++--
arch/s390/lib/spinlock.c | 25 ++++++++-----------------
3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h
index 7e9e09f..7ecd890 100644
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, unsigned int new)
return __sync_bool_compare_and_swap(lock, old, new);
}

+#ifndef CONFIG_SMP
+static inline bool arch_vcpu_is_preempted(int cpu) { return false; }
+#else
+bool arch_vcpu_is_preempted(int cpu);
+#endif
+
+#define vcpu_is_preempted arch_vcpu_is_preempted
+
/*
* Simple spin lock operations. There are two variants, one clears IRQ's
* on the local processor, one does not.
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 35531fe..b988ed1 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address)
return -1;
}

-int smp_vcpu_scheduled(int cpu)
+bool arch_vcpu_is_preempted(int cpu)
{
- return pcpu_running(pcpu_devices + cpu);
+ if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
+ return false;
+ if (pcpu_running(pcpu_devices + cpu))
+ return false;
+ return true;
}
+EXPORT_SYMBOL(arch_vcpu_is_preempted);

void smp_yield_cpu(int cpu)
{
diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c
index e5f50a7..e48a48e 100644
--- a/arch/s390/lib/spinlock.c
+++ b/arch/s390/lib/spinlock.c
@@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int *lock, unsigned int old)
asm(".insn rsy,0xeb0000000022,%0,0,%1" : : "d" (old), "Q" (*lock));
}

-static inline int cpu_is_preempted(int cpu)
-{
- if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
- return 0;
- if (smp_vcpu_scheduled(cpu))
- return 0;
- return 1;
-}
-
void arch_spin_lock_wait(arch_spinlock_t *lp)
{
unsigned int cpu = SPINLOCK_LOCKVAL;
@@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
continue;
}
/* First iteration: check if the lock owner is running. */
- if (first_diag && cpu_is_preempted(~owner)) {
+ if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
* yield the CPU unconditionally. For LPAR rely on the
* sense running status.
*/
- if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+ if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags)
continue;
}
/* Check if the lock owner is running. */
- if (first_diag && cpu_is_preempted(~owner)) {
+ if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags)
* yield the CPU unconditionally. For LPAR rely on the
* sense running status.
*/
- if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+ if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw)
owner = 0;
while (1) {
if (count-- <= 0) {
- if (owner && cpu_is_preempted(~owner))
+ if (owner && arch_vcpu_is_preempted(~owner))
smp_yield_cpu(~owner);
count = spin_retry;
}
@@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int prev)
owner = 0;
while (1) {
if (count-- <= 0) {
- if (owner && cpu_is_preempted(~owner))
+ if (owner && arch_vcpu_is_preempted(~owner))
smp_yield_cpu(~owner);
count = spin_retry;
}
@@ -241,7 +232,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw)
owner = 0;
while (1) {
if (count-- <= 0) {
- if (owner && cpu_is_preempted(~owner))
+ if (owner && arch_vcpu_is_preempted(~owner))
smp_yield_cpu(~owner);
count = spin_retry;
}
@@ -285,7 +276,7 @@ void arch_lock_relax(unsigned int cpu)
{
if (!cpu)
return;
- if (MACHINE_IS_LPAR && !cpu_is_preempted(~cpu))
+ if (MACHINE_IS_LPAR && !arch_vcpu_is_preempted(~cpu))
return;
smp_yield_cpu(~cpu);
}
--
2.4.11

2016-11-02 05:14:04

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check

This is to fix some lock holder preemption issues. Some other locks
implementation do a spin loop before acquiring the lock itself.
Currently kernel has an interface of bool vcpu_is_preempted(int cpu). It
takes the cpu as parameter and return true if the cpu is preempted.
Then kernel can break the spin loops upon the retval of
vcpu_is_preempted.

As kernel has used this interface, So lets support it.

To deal with kernel and kvm/xen, add vcpu_is_preempted into struct
pv_lock_ops.

Then kvm or xen could provide their own implementation to support
vcpu_is_preempted.

Signed-off-by: Pan Xinhui <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/asm/paravirt_types.h | 2 ++
arch/x86/include/asm/spinlock.h | 8 ++++++++
arch/x86/kernel/paravirt-spinlocks.c | 6 ++++++
3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 0f400c0..38c3bb7 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -310,6 +310,8 @@ struct pv_lock_ops {

void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu);
+
+ bool (*vcpu_is_preempted)(int cpu);
};

/* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 921bea7..0526f59 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -26,6 +26,14 @@
extern struct static_key paravirt_ticketlocks_enabled;
static __always_inline bool static_key_false(struct static_key *key);

+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+ return pv_lock_ops.vcpu_is_preempted(cpu);
+}
+#endif
+
#include <asm/qspinlock.h>

/*
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 2c55a00..2f204dd 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
}

+static bool native_vcpu_is_preempted(int cpu)
+{
+ return 0;
+}
+
struct pv_lock_ops pv_lock_ops = {
#ifdef CONFIG_SMP
.queued_spin_lock_slowpath = native_queued_spin_lock_slowpath,
.queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock),
.wait = paravirt_nop,
.kick = paravirt_nop,
+ .vcpu_is_preempted = native_vcpu_is_preempted,
#endif /* SMP */
};
EXPORT_SYMBOL(pv_lock_ops);
--
2.4.11

2016-11-02 05:14:10

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 07/11] KVM: Introduce kvm_write_guest_offset_cached

It allows us to update some status or field of one struct partially.

We can also save one kvm_read_guest_cached if we just update one filed
of the struct regardless of its current value.

Signed-off-by: Pan Xinhui <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 20 ++++++++++++++------
2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 01c0b9c..6f00237 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -645,6 +645,8 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
unsigned long len);
int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
void *data, unsigned long len);
+int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+ void *data, int offset, unsigned long len);
int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
gpa_t gpa, unsigned long len);
int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 2907b7b..95308ee 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1972,30 +1972,38 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
}
EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);

-int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
- void *data, unsigned long len)
+int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+ void *data, int offset, unsigned long len)
{
struct kvm_memslots *slots = kvm_memslots(kvm);
int r;
+ gpa_t gpa = ghc->gpa + offset;

- BUG_ON(len > ghc->len);
+ BUG_ON(len + offset > ghc->len);

if (slots->generation != ghc->generation)
kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa, ghc->len);

if (unlikely(!ghc->memslot))
- return kvm_write_guest(kvm, ghc->gpa, data, len);
+ return kvm_write_guest(kvm, gpa, data, len);

if (kvm_is_error_hva(ghc->hva))
return -EFAULT;

- r = __copy_to_user((void __user *)ghc->hva, data, len);
+ r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
if (r)
return -EFAULT;
- mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+ mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);

return 0;
}
+EXPORT_SYMBOL_GPL(kvm_write_guest_offset_cached);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+ void *data, unsigned long len)
+{
+ return kvm_write_guest_offset_cached(kvm, ghc, data, 0, len);
+}
EXPORT_SYMBOL_GPL(kvm_write_guest_cached);

int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
--
2.4.11

2016-11-02 05:14:26

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 09/11] x86, kernel/kvm.c: support vcpu preempted check

Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

struct kvm_steal_time::preempted indicate that if one vcpu is running or
not after commit("x86, kvm/x86.c: support vcpu preempted check").

unix benchmark result:
host: kernel 4.8.1, i5-4570, 4 cpus
guest: kernel 4.8.1, 8 vcpus

test-case after-patch before-patch
Execl Throughput | 18307.9 lps | 11701.6 lps
File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps
File Copy 256 bufsize 500 maxblocks | 367555.6 KBps | 222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps
Pipe Throughput | 11872208.7 lps | 11855628.9 lps
Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps
Process Creation | 29881.2 lps | 28572.8 lps
Shell Scripts (1 concurrent) | 23224.3 lpm | 22607.4 lpm
Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm
System Call Overhead | 10385653.0 lps | 10419979.0 lps

Signed-off-by: Pan Xinhui <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/kernel/kvm.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index edbbfc8..0b48dd2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -415,6 +415,15 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
}

+static bool kvm_vcpu_is_preempted(int cpu)
+{
+ struct kvm_steal_time *src;
+
+ src = &per_cpu(steal_time, cpu);
+
+ return !!src->preempted;
+}
+
#ifdef CONFIG_SMP
static void __init kvm_smp_prepare_boot_cpu(void)
{
@@ -471,6 +480,9 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
has_steal_clock = 1;
pv_time_ops.steal_clock = kvm_steal_clock;
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+ pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;
+#endif
}

if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
--
2.4.11

2016-11-02 05:14:34

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 11/11] Documentation: virtual: kvm: Support vcpu preempted check

Commit ("x86, kvm: support vcpu preempted check") add one field "__u8
preempted" into struct kvm_steal_time. This field tells if one vcpu is
running or not.

It is zero if 1) some old KVM deos not support this filed. 2) the vcpu
is not preempted. Other values means the vcpu has been preempted.

Signed-off-by: Pan Xinhui <[email protected]>
Acked-by: Radim Krčmář <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
Documentation/virtual/kvm/msr.txt | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index 2a71c8f..ab2ab76 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
__u64 steal;
__u32 version;
__u32 flags;
- __u32 pad[12];
+ __u8 preempted;
+ __u8 u8_pad[3];
+ __u32 pad[11];
}

whose data will be filled in by the hypervisor periodically. Only one
@@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
nanoseconds. Time during which the vcpu is idle, will not be
reported as steal time.

+ preempted: indicate the VCPU who owns this struct is running or
+ not. Non-zero values mean the VCPU has been preempted. Zero
+ means the VCPU is not preempted. NOTE, it is always zero if the
+ the hypervisor doesn't support this field.
+
MSR_KVM_EOI_EN: 0x4b564d04
data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
when disabled. Bit 1 is reserved and must be zero. When PV end of
--
2.4.11

2016-11-02 05:14:17

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check

Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

Use one field of struct kvm_steal_time ::preempted to indicate that if
one vcpu is running or not.

Signed-off-by: Pan Xinhui <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
---
arch/x86/include/uapi/asm/kvm_para.h | 4 +++-
arch/x86/kvm/x86.c | 16 ++++++++++++++++
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 94dc8ca..1421a65 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -45,7 +45,9 @@ struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
- __u32 pad[12];
+ __u8 preempted;
+ __u8 u8_pad[3];
+ __u32 pad[11];
};

#define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index e375235..f06e115 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2057,6 +2057,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
return;

+ vcpu->arch.st.steal.preempted = 0;
+
if (vcpu->arch.st.steal.version & 1)
vcpu->arch.st.steal.version += 1; /* first time write, random junk */

@@ -2810,8 +2812,22 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
}

+static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
+{
+ if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
+ return;
+
+ vcpu->arch.st.steal.preempted = 1;
+
+ kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
+ &vcpu->arch.st.steal.preempted,
+ offsetof(struct kvm_steal_time, preempted),
+ sizeof(vcpu->arch.st.steal.preempted));
+}
+
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
{
+ kvm_steal_time_set_preempted(vcpu);
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();
--
2.4.11

2016-11-02 05:14:56

by Pan Xinhui

[permalink] [raw]
Subject: [PATCH v7 10/11] x86, xen: support vcpu preempted check

From: Juergen Gross <[email protected]>

Support the vcpu_is_preempted() functionality under Xen. This will
enhance lock performance on overcommitted hosts (more runnable vcpus
than physical cpus in the system) as doing busy waits for preempted
vcpus will hurt system performance far worse than early yielding.

A quick test (4 vcpus on 1 physical cpu doing a parallel build job
with "make -j 8") reduced system time by about 5% with this patch.

Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Pan Xinhui <[email protected]>
---
arch/x86/xen/spinlock.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 3d6e006..74756bb 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
}

-
/*
* Our init of PV spinlocks is split in two init functions due to us
* using paravirt patching and jump labels patching and having to do
@@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = xen_qlock_wait;
pv_lock_ops.kick = xen_qlock_kick;
+
+ pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
}

/*
--
2.4.11

2016-11-15 15:48:06

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check

On Wed, Nov 02, 2016 at 05:08:33AM -0400, Pan Xinhui wrote:
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 0f400c0..38c3bb7 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -310,6 +310,8 @@ struct pv_lock_ops {
>
> void (*wait)(u8 *ptr, u8 val);
> void (*kick)(int cpu);
> +
> + bool (*vcpu_is_preempted)(int cpu);
> };

So that ends up with a full function call in the native case. I did
something like the below on top, completely untested, not been near a
compiler etc..

It doesn't get rid of the branch, but at least it avoids the function
call, and hardware should have no trouble predicting a constant
condition.

Also, it looks like you end up not setting vcpu_is_preempted when KVM
doesn't support steal clock, which would end up in an instant NULL
deref. Fixed that too.

---
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -673,6 +673,11 @@ static __always_inline void pv_kick(int
PVOP_VCALL1(pv_lock_ops.kick, cpu);
}

+static __always_inline void pv_vcpu_is_prempted(int cpu)
+{
+ PVOP_VCALLEE1(pv_lock_ops.vcpu_is_preempted, cpu);
+}
+
#endif /* SMP && PARAVIRT_SPINLOCKS */

#ifdef CONFIG_X86_32
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -309,7 +309,7 @@ struct pv_lock_ops {
void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu);

- bool (*vcpu_is_preempted)(int cpu);
+ struct paravirt_callee_save vcpu_is_preempted;
};

/* This contains all the paravirt structures: we get a convenient
--- a/arch/x86/include/asm/qspinlock.h
+++ b/arch/x86/include/asm/qspinlock.h
@@ -32,6 +32,12 @@ static inline void queued_spin_unlock(st
{
pv_queued_spin_unlock(lock);
}
+
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+ return pv_vcpu_is_preempted(cpu);
+}
#else
static inline void queued_spin_unlock(struct qspinlock *lock)
{
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -26,14 +26,6 @@
extern struct static_key paravirt_ticketlocks_enabled;
static __always_inline bool static_key_false(struct static_key *key);

-#ifdef CONFIG_PARAVIRT_SPINLOCKS
-#define vcpu_is_preempted vcpu_is_preempted
-static inline bool vcpu_is_preempted(int cpu)
-{
- return pv_lock_ops.vcpu_is_preempted(cpu);
-}
-#endif
-
#include <asm/qspinlock.h>

/*
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -415,15 +415,6 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
}

-static bool kvm_vcpu_is_preempted(int cpu)
-{
- struct kvm_steal_time *src;
-
- src = &per_cpu(steal_time, cpu);
-
- return !!src->preempted;
-}
-
#ifdef CONFIG_SMP
static void __init kvm_smp_prepare_boot_cpu(void)
{
@@ -480,9 +471,6 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
has_steal_clock = 1;
pv_time_ops.steal_clock = kvm_steal_clock;
-#ifdef CONFIG_PARAVIRT_SPINLOCKS
- pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;
-#endif
}

if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
@@ -604,6 +592,14 @@ static void kvm_wait(u8 *ptr, u8 val)
local_irq_restore(flags);
}

+static bool __kvm_vcpu_is_preempted(int cpu)
+{
+ struct kvm_steal_time *src = &per_cpu(steal_time, cpu);
+
+ return !!src->preempted;
+}
+PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
+
/*
* Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
*/
@@ -620,6 +616,12 @@ void __init kvm_spinlock_init(void)
pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = kvm_wait;
pv_lock_ops.kick = kvm_kick_cpu;
+ pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(__native_vcpu_is_preempted);
+
+ if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
+ pv_lock_ops.vcpu_is_preempted =
+ PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
+ }
}

static __init int kvm_spinlock_init_jump(void)
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -12,7 +12,6 @@ __visible void __native_queued_spin_unlo
{
native_queued_spin_unlock(lock);
}
-
PV_CALLEE_SAVE_REGS_THUNK(__native_queued_spin_unlock);

bool pv_is_native_spin_unlock(void)
@@ -21,9 +20,16 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
}

-static bool native_vcpu_is_preempted(int cpu)
+__visible bool __native_vcpu_is_preempted(int cpu)
{
- return 0;
+ return false;
+}
+PV_CALLEE_SAVE_REGS_THUNK(__native_vcpu_is_preempted);
+
+bool pv_is_native_vcpu_is_preempted(void)
+{
+ return pv_lock_ops.queued_spin_unlock.func ==
+ __raw_callee_save__native_vcpu_is_preempted;
}

struct pv_lock_ops pv_lock_ops = {
@@ -32,7 +38,7 @@ struct pv_lock_ops pv_lock_ops = {
.queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock),
.wait = paravirt_nop,
.kick = paravirt_nop,
- .vcpu_is_preempted = native_vcpu_is_preempted,
+ .vcpu_is_preempted = PV_CALLEE_SAVE(__native_vcpu_is_preempted),
#endif /* SMP */
};
EXPORT_SYMBOL(pv_lock_ops);
--- a/arch/x86/kernel/paravirt_patch_32.c
+++ b/arch/x86/kernel/paravirt_patch_32.c
@@ -11,6 +11,7 @@ DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %c

#if defined(CONFIG_PARAVIRT_SPINLOCKS)
DEF_NATIVE(pv_lock_ops, queued_spin_unlock, "movb $0, (%eax)");
+DEF_NATIVE(pv_lock_ops, vcpu_is_preempted, "movl $0, %eax");
#endif

unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
@@ -26,6 +27,7 @@ unsigned paravirt_patch_ident_64(void *i
}

extern bool pv_is_native_spin_unlock(void);
+extern bool pv_is_native_vcpu_is_preempted(void);

unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
unsigned long addr, unsigned len)
@@ -54,6 +56,12 @@ unsigned native_patch(u8 type, u16 clobb
end = end_pv_lock_ops_queued_spin_unlock;
goto patch_site;
}
+ case PARAVIRT_PATCH(pv_lock_ops.vcpu_is_preempted):
+ if (pv_is_native_vcpu_is_preempted()) {
+ start = start_pv_lock_ops_vcpu_is_preempted;
+ end = end_pv_lock_ops_vcpu_is_preempted;
+ goto patch_site;
+ }
#endif

default:
--- a/arch/x86/kernel/paravirt_patch_64.c
+++ b/arch/x86/kernel/paravirt_patch_64.c
@@ -20,6 +20,7 @@ DEF_NATIVE(, mov64, "mov %rdi, %rax");

#if defined(CONFIG_PARAVIRT_SPINLOCKS)
DEF_NATIVE(pv_lock_ops, queued_spin_unlock, "movb $0, (%rdi)");
+DEF_NATIVE(pv_lock_ops, vcpu_is_preempted, "movl $0, rax");
#endif

unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
@@ -35,6 +36,7 @@ unsigned paravirt_patch_ident_64(void *i
}

extern bool pv_is_native_spin_unlock(void);
+extern bool pv_is_native_vcpu_is_preempted(void);

unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
unsigned long addr, unsigned len)
@@ -66,6 +68,12 @@ unsigned native_patch(u8 type, u16 clobb
end = end_pv_lock_ops_queued_spin_unlock;
goto patch_site;
}
+ case PARAVIRT_PATCH(pv_lock_ops.vcpu_is_preempted):
+ if (pv_is_native_vcpu_is_preempted()) {
+ start = start_pv_lock_ops_vcpu_is_preempted;
+ end = end_pv_lock_ops_vcpu_is_preempted;
+ goto patch_site;
+ }
#endif

default:
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -114,6 +114,8 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
}

+PV_CALLEE_SAVE_REGS_THUNK(xen_vcpu_stolen);
+
/*
* Our init of PV spinlocks is split in two init functions due to us
* using paravirt patching and jump labels patching and having to do
@@ -136,8 +138,7 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = xen_qlock_wait;
pv_lock_ops.kick = xen_qlock_kick;
-
- pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
+ pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen);
}

/*

2016-11-16 04:19:10

by Pan Xinhui

[permalink] [raw]
Subject: Re: [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check



在 2016/11/15 23:47, Peter Zijlstra 写道:
> On Wed, Nov 02, 2016 at 05:08:33AM -0400, Pan Xinhui wrote:
>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>> index 0f400c0..38c3bb7 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -310,6 +310,8 @@ struct pv_lock_ops {
>>
>> void (*wait)(u8 *ptr, u8 val);
>> void (*kick)(int cpu);
>> +
>> + bool (*vcpu_is_preempted)(int cpu);
>> };
>
> So that ends up with a full function call in the native case. I did
> something like the below on top, completely untested, not been near a
> compiler etc..
>
Hi, Peter.
I think we can avoid a function call in a simpler way. How about below

static inline bool vcpu_is_preempted(int cpu)
{
/* only set in pv case*/
if (pv_lock_ops.vcpu_is_preempted)
return pv_lock_ops.vcpu_is_preempted(cpu);
return false;
}


> It doesn't get rid of the branch, but at least it avoids the function
> call, and hardware should have no trouble predicting a constant
> condition.
>
> Also, it looks like you end up not setting vcpu_is_preempted when KVM
> doesn't support steal clock, which would end up in an instant NULL
> deref. Fixed that too.
>
maybe not true. There is .vcpu_is_preempted = native_vcpu_is_preempted when we define pv_lock_ops.

your patch is a good example for any people who want to add any native/pv function. :)

thanks
xinhui

> ---
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -673,6 +673,11 @@ static __always_inline void pv_kick(int
> PVOP_VCALL1(pv_lock_ops.kick, cpu);
> }
>
> +static __always_inline void pv_vcpu_is_prempted(int cpu)
> +{
> + PVOP_VCALLEE1(pv_lock_ops.vcpu_is_preempted, cpu);
> +}
> +
> #endif /* SMP && PARAVIRT_SPINLOCKS */
>
> #ifdef CONFIG_X86_32
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -309,7 +309,7 @@ struct pv_lock_ops {
> void (*wait)(u8 *ptr, u8 val);
> void (*kick)(int cpu);
>
> - bool (*vcpu_is_preempted)(int cpu);
> + struct paravirt_callee_save vcpu_is_preempted;
> };
>
> /* This contains all the paravirt structures: we get a convenient
> --- a/arch/x86/include/asm/qspinlock.h
> +++ b/arch/x86/include/asm/qspinlock.h
> @@ -32,6 +32,12 @@ static inline void queued_spin_unlock(st
> {
> pv_queued_spin_unlock(lock);
> }
> +
> +#define vcpu_is_preempted vcpu_is_preempted
> +static inline bool vcpu_is_preempted(int cpu)
> +{
> + return pv_vcpu_is_preempted(cpu);
> +}
> #else
> static inline void queued_spin_unlock(struct qspinlock *lock)
> {
> --- a/arch/x86/include/asm/spinlock.h
> +++ b/arch/x86/include/asm/spinlock.h
> @@ -26,14 +26,6 @@
> extern struct static_key paravirt_ticketlocks_enabled;
> static __always_inline bool static_key_false(struct static_key *key);
>
> -#ifdef CONFIG_PARAVIRT_SPINLOCKS
> -#define vcpu_is_preempted vcpu_is_preempted
> -static inline bool vcpu_is_preempted(int cpu)
> -{
> - return pv_lock_ops.vcpu_is_preempted(cpu);
> -}
> -#endif
> -
> #include <asm/qspinlock.h>
>
> /*
> --- a/arch/x86/kernel/kvm.c
> +++ b/arch/x86/kernel/kvm.c
> @@ -415,15 +415,6 @@ void kvm_disable_steal_time(void)
> wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
> }
>
> -static bool kvm_vcpu_is_preempted(int cpu)
> -{
> - struct kvm_steal_time *src;
> -
> - src = &per_cpu(steal_time, cpu);
> -
> - return !!src->preempted;
> -}
> -
> #ifdef CONFIG_SMP
> static void __init kvm_smp_prepare_boot_cpu(void)
> {
> @@ -480,9 +471,6 @@ void __init kvm_guest_init(void)
> if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
> has_steal_clock = 1;
> pv_time_ops.steal_clock = kvm_steal_clock;
> -#ifdef CONFIG_PARAVIRT_SPINLOCKS
> - pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;
> -#endif
> }
>
> if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))
> @@ -604,6 +592,14 @@ static void kvm_wait(u8 *ptr, u8 val)
> local_irq_restore(flags);
> }
>
> +static bool __kvm_vcpu_is_preempted(int cpu)
> +{
> + struct kvm_steal_time *src = &per_cpu(steal_time, cpu);
> +
> + return !!src->preempted;
> +}
> +PV_CALLEE_SAVE_REGS_THUNK(__kvm_vcpu_is_preempted);
> +
> /*
> * Setup pv_lock_ops to exploit KVM_FEATURE_PV_UNHALT if present.
> */
> @@ -620,6 +616,12 @@ void __init kvm_spinlock_init(void)
> pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
> pv_lock_ops.wait = kvm_wait;
> pv_lock_ops.kick = kvm_kick_cpu;
> + pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(__native_vcpu_is_preempted);
> +
> + if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
> + pv_lock_ops.vcpu_is_preempted =
> + PV_CALLEE_SAVE(__kvm_vcpu_is_preempted);
> + }
> }
>
> static __init int kvm_spinlock_init_jump(void)
> --- a/arch/x86/kernel/paravirt-spinlocks.c
> +++ b/arch/x86/kernel/paravirt-spinlocks.c
> @@ -12,7 +12,6 @@ __visible void __native_queued_spin_unlo
> {
> native_queued_spin_unlock(lock);
> }
> -
> PV_CALLEE_SAVE_REGS_THUNK(__native_queued_spin_unlock);
>
> bool pv_is_native_spin_unlock(void)
> @@ -21,9 +20,16 @@ bool pv_is_native_spin_unlock(void)
> __raw_callee_save___native_queued_spin_unlock;
> }
>
> -static bool native_vcpu_is_preempted(int cpu)
> +__visible bool __native_vcpu_is_preempted(int cpu)
> {
> - return 0;
> + return false;
> +}
> +PV_CALLEE_SAVE_REGS_THUNK(__native_vcpu_is_preempted);
> +
> +bool pv_is_native_vcpu_is_preempted(void)
> +{
> + return pv_lock_ops.queued_spin_unlock.func ==
> + __raw_callee_save__native_vcpu_is_preempted;
> }
>
copy-paste issue...

> struct pv_lock_ops pv_lock_ops = {
> @@ -32,7 +38,7 @@ struct pv_lock_ops pv_lock_ops = {
> .queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock),
> .wait = paravirt_nop,
> .kick = paravirt_nop,
> - .vcpu_is_preempted = native_vcpu_is_preempted,
> + .vcpu_is_preempted = PV_CALLEE_SAVE(__native_vcpu_is_preempted),
> #endif /* SMP */
> };
> EXPORT_SYMBOL(pv_lock_ops);
> --- a/arch/x86/kernel/paravirt_patch_32.c
> +++ b/arch/x86/kernel/paravirt_patch_32.c
> @@ -11,6 +11,7 @@ DEF_NATIVE(pv_mmu_ops, read_cr3, "mov %c
>
> #if defined(CONFIG_PARAVIRT_SPINLOCKS)
> DEF_NATIVE(pv_lock_ops, queued_spin_unlock, "movb $0, (%eax)");
> +DEF_NATIVE(pv_lock_ops, vcpu_is_preempted, "movl $0, %eax");
> #endif
>
> unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
> @@ -26,6 +27,7 @@ unsigned paravirt_patch_ident_64(void *i
> }
>
> extern bool pv_is_native_spin_unlock(void);
> +extern bool pv_is_native_vcpu_is_preempted(void);
>
> unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
> unsigned long addr, unsigned len)
> @@ -54,6 +56,12 @@ unsigned native_patch(u8 type, u16 clobb
> end = end_pv_lock_ops_queued_spin_unlock;
> goto patch_site;
> }
> + case PARAVIRT_PATCH(pv_lock_ops.vcpu_is_preempted):
> + if (pv_is_native_vcpu_is_preempted()) {
> + start = start_pv_lock_ops_vcpu_is_preempted;
> + end = end_pv_lock_ops_vcpu_is_preempted;
> + goto patch_site;
> + }
> #endif
>
> default:
> --- a/arch/x86/kernel/paravirt_patch_64.c
> +++ b/arch/x86/kernel/paravirt_patch_64.c
> @@ -20,6 +20,7 @@ DEF_NATIVE(, mov64, "mov %rdi, %rax");
>
> #if defined(CONFIG_PARAVIRT_SPINLOCKS)
> DEF_NATIVE(pv_lock_ops, queued_spin_unlock, "movb $0, (%rdi)");
> +DEF_NATIVE(pv_lock_ops, vcpu_is_preempted, "movl $0, rax");
> #endif
>
> unsigned paravirt_patch_ident_32(void *insnbuf, unsigned len)
> @@ -35,6 +36,7 @@ unsigned paravirt_patch_ident_64(void *i
> }
>
> extern bool pv_is_native_spin_unlock(void);
> +extern bool pv_is_native_vcpu_is_preempted(void);
>
> unsigned native_patch(u8 type, u16 clobbers, void *ibuf,
> unsigned long addr, unsigned len)
> @@ -66,6 +68,12 @@ unsigned native_patch(u8 type, u16 clobb
> end = end_pv_lock_ops_queued_spin_unlock;
> goto patch_site;
> }
> + case PARAVIRT_PATCH(pv_lock_ops.vcpu_is_preempted):
> + if (pv_is_native_vcpu_is_preempted()) {
> + start = start_pv_lock_ops_vcpu_is_preempted;
> + end = end_pv_lock_ops_vcpu_is_preempted;
> + goto patch_site;
> + }
> #endif
>
> default:
> --- a/arch/x86/xen/spinlock.c
> +++ b/arch/x86/xen/spinlock.c
> @@ -114,6 +114,8 @@ void xen_uninit_lock_cpu(int cpu)
> per_cpu(irq_name, cpu) = NULL;
> }
>
> +PV_CALLEE_SAVE_REGS_THUNK(xen_vcpu_stolen);
> +
> /*
> * Our init of PV spinlocks is split in two init functions due to us
> * using paravirt patching and jump labels patching and having to do
> @@ -136,8 +138,7 @@ void __init xen_init_spinlocks(void)
> pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
> pv_lock_ops.wait = xen_qlock_wait;
> pv_lock_ops.kick = xen_qlock_kick;
> -
> - pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
> + pv_lock_ops.vcpu_is_preempted = PV_CALLEE_SAVE(xen_vcpu_stolen);
> }
>
> /*
>

2016-11-16 10:24:32

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check

On Wed, Nov 16, 2016 at 12:19:09PM +0800, Pan Xinhui wrote:
> Hi, Peter.
> I think we can avoid a function call in a simpler way. How about below
>
> static inline bool vcpu_is_preempted(int cpu)
> {
> /* only set in pv case*/
> if (pv_lock_ops.vcpu_is_preempted)
> return pv_lock_ops.vcpu_is_preempted(cpu);
> return false;
> }

That is still more expensive. It needs to do an actual load and makes it
hard to predict the branch, you'd have to actually wait for the load to
complete etc.

Also, it generates more code.

Paravirt muck should strive to be as cheap as possible when ran on
native hardware.

2016-11-16 11:29:58

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check

On 11/16/2016 11:23 AM, Peter Zijlstra wrote:
> On Wed, Nov 16, 2016 at 12:19:09PM +0800, Pan Xinhui wrote:
>> Hi, Peter.
>> I think we can avoid a function call in a simpler way. How about below
>>
>> static inline bool vcpu_is_preempted(int cpu)
>> {
>> /* only set in pv case*/
>> if (pv_lock_ops.vcpu_is_preempted)
>> return pv_lock_ops.vcpu_is_preempted(cpu);
>> return false;
>> }
>
> That is still more expensive. It needs to do an actual load and makes it
> hard to predict the branch, you'd have to actually wait for the load to
> complete etc.

Out of curiosity, why is that hard to predict?
On s390 the branch prediction runs asynchronously ahead of the downstream
pipeline (e.g. search for "IBM z Systems Processor Optimization Primer" page 11).
given enough capacity, I would assume that modern x86 processors would do the same
and be able to predict this is as soon as it becomes hot (and otherwise you would
not notice the branch miss anyway). Is x86 behaving differently here?

> Also, it generates more code.
>
> Paravirt muck should strive to be as cheap as possible when ran on
> native hardware.

As I am interested in this series from the s390 point of view, this is
the only thing that block this series?

Is there a chance to add a static key around the paravirt ops somehow?

2016-11-16 11:43:48

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check

On Wed, Nov 16, 2016 at 12:29:44PM +0100, Christian Borntraeger wrote:
> On 11/16/2016 11:23 AM, Peter Zijlstra wrote:
> > On Wed, Nov 16, 2016 at 12:19:09PM +0800, Pan Xinhui wrote:
> >> Hi, Peter.
> >> I think we can avoid a function call in a simpler way. How about below
> >>
> >> static inline bool vcpu_is_preempted(int cpu)
> >> {
> >> /* only set in pv case*/
> >> if (pv_lock_ops.vcpu_is_preempted)
> >> return pv_lock_ops.vcpu_is_preempted(cpu);
> >> return false;
> >> }
> >
> > That is still more expensive. It needs to do an actual load and makes it
> > hard to predict the branch, you'd have to actually wait for the load to
> > complete etc.
>
> Out of curiosity, why is that hard to predict?
> On s390 the branch prediction runs asynchronously ahead of the downstream
> pipeline (e.g. search for "IBM z Systems Processor Optimization Primer" page 11).
> given enough capacity, I would assume that modern x86 processors would do the same
> and be able to predict this is as soon as it becomes hot (and otherwise you would
> not notice the branch miss anyway). Is x86 behaving differently here?

Not sure how exactly it works, but it seems to me that an immediate
assignment to the value you're going to compare would leave very little
doubt.

Then again, maybe cores aren't that smart and only look at the
hysterical btb for prediction.

> > Also, it generates more code.
> >
> > Paravirt muck should strive to be as cheap as possible when ran on
> > native hardware.
>
> As I am interested in this series from the s390 point of view, this is
> the only thing that block this series?

Ingo was rewriting the changelog, other than that, no, I can do this on
top. Just spotted this because Ingo and me talked it over.

> Is there a chance to add a static key around the paravirt ops somehow?

More code generation still, replacing the call with an immediate
assignment to the return register is the shortest possible option I
think.

2016-11-17 05:16:50

by Pan Xinhui

[permalink] [raw]
Subject: Re: [PATCH v7 06/11] x86, paravirt: Add interface to support kvm/xen vcpu preempted check



在 2016/11/16 18:23, Peter Zijlstra 写道:
> On Wed, Nov 16, 2016 at 12:19:09PM +0800, Pan Xinhui wrote:
>> Hi, Peter.
>> I think we can avoid a function call in a simpler way. How about below
>>
>> static inline bool vcpu_is_preempted(int cpu)
>> {
>> /* only set in pv case*/
>> if (pv_lock_ops.vcpu_is_preempted)
>> return pv_lock_ops.vcpu_is_preempted(cpu);
>> return false;
>> }
>
> That is still more expensive. It needs to do an actual load and makes it
> hard to predict the branch, you'd have to actually wait for the load to
> complete etc.
>
yes, one more load in native case. I think this is acceptable as vcpu_is_preempted is not a critical function.

however if we use pv_callee_save_regs_thunk, more unnecessary registers might be save/resotred in pv case.
that will introduce a little overhead.

but I think I am okay with your idea. I can make another patch based on this patchset with your suggested-by.

thanks
xinhui

> Also, it generates more code.
>
> Paravirt muck should strive to be as cheap as possible when ran on
> native hardware.
>

Subject: [tip:locking/core] sched/core: Introduce the vcpu_is_preempted(cpu) interface

Commit-ID: d9345c65eb7930ac6755cf593ee7686f4029ccf4
Gitweb: http://git.kernel.org/tip/d9345c65eb7930ac6755cf593ee7686f4029ccf4
Author: Pan Xinhui <[email protected]>
AuthorDate: Wed, 2 Nov 2016 05:08:28 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 22 Nov 2016 12:48:05 +0100

sched/core: Introduce the vcpu_is_preempted(cpu) interface

This patch is the first step to add support to improve lock holder
preemption beaviour.

vcpu_is_preempted(cpu) does the obvious thing: it tells us whether a
vCPU is preempted or not.

Defaults to false on architectures that don't support it.

Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Tested-by: Juergen Gross <[email protected]>
Signed-off-by: Pan Xinhui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[ Translated the changelog to English. ]
Acked-by: Christian Borntraeger <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/sched.h | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dc37cbe..37261af 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3510,6 +3510,18 @@ static inline void set_task_cpu(struct task_struct *p, unsigned int cpu)

#endif /* CONFIG_SMP */

+/*
+ * In order to reduce various lock holder preemption latencies provide an
+ * interface to see if a vCPU is currently running or not.
+ *
+ * This allows us to terminate optimistic spin loops and block, analogous to
+ * the native optimistic spin heuristic of testing if the lock owner task is
+ * running or not.
+ */
+#ifndef vcpu_is_preempted
+# define vcpu_is_preempted(cpu) false
+#endif
+
extern long sched_setaffinity(pid_t pid, const struct cpumask *new_mask);
extern long sched_getaffinity(pid_t pid, struct cpumask *mask);


Subject: [tip:locking/core] locking/core, powerpc: Implement vcpu_is_preempted(cpu)

Commit-ID: 41946c86876ea6a3e8857182356e6d76dbfe7fb6
Gitweb: http://git.kernel.org/tip/41946c86876ea6a3e8857182356e6d76dbfe7fb6
Author: Pan Xinhui <[email protected]>
AuthorDate: Wed, 2 Nov 2016 05:08:31 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 22 Nov 2016 12:48:06 +0100

locking/core, powerpc: Implement vcpu_is_preempted(cpu)

Optimize spinlock and mutex busy-loops by providing a vcpu_is_preempted(cpu)
function on pSeries. We do not support PowerNV.

All this can be achieved by using lppaca->yield_count, which is zero on PowerNV.

Suggested-by: Boqun Feng <[email protected]>
Suggested-by: Peter Zijlstra (Intel) <[email protected]>
Signed-off-by: Pan Xinhui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/powerpc/include/asm/spinlock.h | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/arch/powerpc/include/asm/spinlock.h b/arch/powerpc/include/asm/spinlock.h
index fa37fe9..8c1b913 100644
--- a/arch/powerpc/include/asm/spinlock.h
+++ b/arch/powerpc/include/asm/spinlock.h
@@ -52,6 +52,14 @@
#define SYNC_IO
#endif

+#ifdef CONFIG_PPC_PSERIES
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+ return !!(be32_to_cpu(lppaca_of(cpu).yield_count) & 1);
+}
+#endif
+
static __always_inline int arch_spin_value_unlocked(arch_spinlock_t lock)
{
return lock.slock == 0;

Subject: [tip:locking/core] locking/spinlocks, s390: Implement vcpu_is_preempted(cpu)

Commit-ID: 760928c0dafc7d0faf0c0248e28e16d4c8dc7ad6
Gitweb: http://git.kernel.org/tip/760928c0dafc7d0faf0c0248e28e16d4c8dc7ad6
Author: Christian Borntraeger <[email protected]>
AuthorDate: Wed, 2 Nov 2016 05:08:32 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 22 Nov 2016 12:48:06 +0100

locking/spinlocks, s390: Implement vcpu_is_preempted(cpu)

This implements the s390 version for vcpu_is_preempted(cpu),
by reworking the existing smp_vcpu_scheduled() function into
arch_vcpu_is_preempted().

We can then also get rid of the local cpu_is_preempted()
function by moving the CIF_ENABLED_WAIT test into
arch_vcpu_is_preempted().

Signed-off-by: Christian Borntraeger <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Heiko Carstens <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/s390/include/asm/spinlock.h | 8 ++++++++
arch/s390/kernel/smp.c | 9 +++++++--
arch/s390/lib/spinlock.c | 25 ++++++++-----------------
3 files changed, 23 insertions(+), 19 deletions(-)

diff --git a/arch/s390/include/asm/spinlock.h b/arch/s390/include/asm/spinlock.h
index 7e9e09f..7ecd890 100644
--- a/arch/s390/include/asm/spinlock.h
+++ b/arch/s390/include/asm/spinlock.h
@@ -23,6 +23,14 @@ _raw_compare_and_swap(unsigned int *lock, unsigned int old, unsigned int new)
return __sync_bool_compare_and_swap(lock, old, new);
}

+#ifndef CONFIG_SMP
+static inline bool arch_vcpu_is_preempted(int cpu) { return false; }
+#else
+bool arch_vcpu_is_preempted(int cpu);
+#endif
+
+#define vcpu_is_preempted arch_vcpu_is_preempted
+
/*
* Simple spin lock operations. There are two variants, one clears IRQ's
* on the local processor, one does not.
diff --git a/arch/s390/kernel/smp.c b/arch/s390/kernel/smp.c
index 35531fe..b988ed1 100644
--- a/arch/s390/kernel/smp.c
+++ b/arch/s390/kernel/smp.c
@@ -368,10 +368,15 @@ int smp_find_processor_id(u16 address)
return -1;
}

-int smp_vcpu_scheduled(int cpu)
+bool arch_vcpu_is_preempted(int cpu)
{
- return pcpu_running(pcpu_devices + cpu);
+ if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
+ return false;
+ if (pcpu_running(pcpu_devices + cpu))
+ return false;
+ return true;
}
+EXPORT_SYMBOL(arch_vcpu_is_preempted);

void smp_yield_cpu(int cpu)
{
diff --git a/arch/s390/lib/spinlock.c b/arch/s390/lib/spinlock.c
index e5f50a7..e48a48e 100644
--- a/arch/s390/lib/spinlock.c
+++ b/arch/s390/lib/spinlock.c
@@ -37,15 +37,6 @@ static inline void _raw_compare_and_delay(unsigned int *lock, unsigned int old)
asm(".insn rsy,0xeb0000000022,%0,0,%1" : : "d" (old), "Q" (*lock));
}

-static inline int cpu_is_preempted(int cpu)
-{
- if (test_cpu_flag_of(CIF_ENABLED_WAIT, cpu))
- return 0;
- if (smp_vcpu_scheduled(cpu))
- return 0;
- return 1;
-}
-
void arch_spin_lock_wait(arch_spinlock_t *lp)
{
unsigned int cpu = SPINLOCK_LOCKVAL;
@@ -62,7 +53,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
continue;
}
/* First iteration: check if the lock owner is running. */
- if (first_diag && cpu_is_preempted(~owner)) {
+ if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -81,7 +72,7 @@ void arch_spin_lock_wait(arch_spinlock_t *lp)
* yield the CPU unconditionally. For LPAR rely on the
* sense running status.
*/
- if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+ if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -108,7 +99,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags)
continue;
}
/* Check if the lock owner is running. */
- if (first_diag && cpu_is_preempted(~owner)) {
+ if (first_diag && arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
continue;
@@ -127,7 +118,7 @@ void arch_spin_lock_wait_flags(arch_spinlock_t *lp, unsigned long flags)
* yield the CPU unconditionally. For LPAR rely on the
* sense running status.
*/
- if (!MACHINE_IS_LPAR || cpu_is_preempted(~owner)) {
+ if (!MACHINE_IS_LPAR || arch_vcpu_is_preempted(~owner)) {
smp_yield_cpu(~owner);
first_diag = 0;
}
@@ -165,7 +156,7 @@ void _raw_read_lock_wait(arch_rwlock_t *rw)
owner = 0;
while (1) {
if (count-- <= 0) {
- if (owner && cpu_is_preempted(~owner))
+ if (owner && arch_vcpu_is_preempted(~owner))
smp_yield_cpu(~owner);
count = spin_retry;
}
@@ -211,7 +202,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw, unsigned int prev)
owner = 0;
while (1) {
if (count-- <= 0) {
- if (owner && cpu_is_preempted(~owner))
+ if (owner && arch_vcpu_is_preempted(~owner))
smp_yield_cpu(~owner);
count = spin_retry;
}
@@ -241,7 +232,7 @@ void _raw_write_lock_wait(arch_rwlock_t *rw)
owner = 0;
while (1) {
if (count-- <= 0) {
- if (owner && cpu_is_preempted(~owner))
+ if (owner && arch_vcpu_is_preempted(~owner))
smp_yield_cpu(~owner);
count = spin_retry;
}
@@ -285,7 +276,7 @@ void arch_lock_relax(unsigned int cpu)
{
if (!cpu)
return;
- if (MACHINE_IS_LPAR && !cpu_is_preempted(~cpu))
+ if (MACHINE_IS_LPAR && !arch_vcpu_is_preempted(~cpu))
return;
smp_yield_cpu(~cpu);
}

Subject: [tip:locking/core] locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests

Commit-ID: 446f3dc8cc0af59259c6c8b898726fae7ed2c055
Gitweb: http://git.kernel.org/tip/446f3dc8cc0af59259c6c8b898726fae7ed2c055
Author: Pan Xinhui <[email protected]>
AuthorDate: Wed, 2 Nov 2016 05:08:33 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 22 Nov 2016 12:48:07 +0100

locking/core, x86/paravirt: Implement vcpu_is_preempted(cpu) for KVM and Xen guests

Optimize spinlock and mutex busy-loops by providing a vcpu_is_preempted(cpu)
function on KVM and Xen platforms.

Extend the pv_lock_ops interface accordingly and implement the callbacks
on KVM and Xen.

Signed-off-by: Pan Xinhui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
[ Translated to English. ]
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/asm/paravirt_types.h | 2 ++
arch/x86/include/asm/spinlock.h | 8 ++++++++
arch/x86/kernel/paravirt-spinlocks.c | 6 ++++++
3 files changed, 16 insertions(+)

diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 0f400c0..38c3bb7 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -310,6 +310,8 @@ struct pv_lock_ops {

void (*wait)(u8 *ptr, u8 val);
void (*kick)(int cpu);
+
+ bool (*vcpu_is_preempted)(int cpu);
};

/* This contains all the paravirt structures: we get a convenient
diff --git a/arch/x86/include/asm/spinlock.h b/arch/x86/include/asm/spinlock.h
index 921bea7..0526f59 100644
--- a/arch/x86/include/asm/spinlock.h
+++ b/arch/x86/include/asm/spinlock.h
@@ -26,6 +26,14 @@
extern struct static_key paravirt_ticketlocks_enabled;
static __always_inline bool static_key_false(struct static_key *key);

+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+#define vcpu_is_preempted vcpu_is_preempted
+static inline bool vcpu_is_preempted(int cpu)
+{
+ return pv_lock_ops.vcpu_is_preempted(cpu);
+}
+#endif
+
#include <asm/qspinlock.h>

/*
diff --git a/arch/x86/kernel/paravirt-spinlocks.c b/arch/x86/kernel/paravirt-spinlocks.c
index 2c55a00..2f204dd 100644
--- a/arch/x86/kernel/paravirt-spinlocks.c
+++ b/arch/x86/kernel/paravirt-spinlocks.c
@@ -21,12 +21,18 @@ bool pv_is_native_spin_unlock(void)
__raw_callee_save___native_queued_spin_unlock;
}

+static bool native_vcpu_is_preempted(int cpu)
+{
+ return 0;
+}
+
struct pv_lock_ops pv_lock_ops = {
#ifdef CONFIG_SMP
.queued_spin_lock_slowpath = native_queued_spin_lock_slowpath,
.queued_spin_unlock = PV_CALLEE_SAVE(__native_queued_spin_unlock),
.wait = paravirt_nop,
.kick = paravirt_nop,
+ .vcpu_is_preempted = native_vcpu_is_preempted,
#endif /* SMP */
};
EXPORT_SYMBOL(pv_lock_ops);

Subject: [tip:locking/core] kvm: Introduce kvm_write_guest_offset_cached()

Commit-ID: 4ec6e863625625a54f527464ab91ce1a1cb16c42
Gitweb: http://git.kernel.org/tip/4ec6e863625625a54f527464ab91ce1a1cb16c42
Author: Pan Xinhui <[email protected]>
AuthorDate: Wed, 2 Nov 2016 05:08:34 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 22 Nov 2016 12:48:07 +0100

kvm: Introduce kvm_write_guest_offset_cached()

It allows us to update some status or field of a structure partially.

We can also save a kvm_read_guest_cached() call if we just update one
fild of the struct regardless of its current value.

Signed-off-by: Pan Xinhui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Typo fixes. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/kvm_host.h | 2 ++
virt/kvm/kvm_main.c | 20 ++++++++++++++------
2 files changed, 16 insertions(+), 6 deletions(-)

diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h
index 01c0b9c..6f00237 100644
--- a/include/linux/kvm_host.h
+++ b/include/linux/kvm_host.h
@@ -645,6 +645,8 @@ int kvm_write_guest(struct kvm *kvm, gpa_t gpa, const void *data,
unsigned long len);
int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
void *data, unsigned long len);
+int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+ void *data, int offset, unsigned long len);
int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
gpa_t gpa, unsigned long len);
int kvm_clear_guest_page(struct kvm *kvm, gfn_t gfn, int offset, int len);
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 5c36034..2f38ce5 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -1972,30 +1972,38 @@ int kvm_gfn_to_hva_cache_init(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
}
EXPORT_SYMBOL_GPL(kvm_gfn_to_hva_cache_init);

-int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
- void *data, unsigned long len)
+int kvm_write_guest_offset_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+ void *data, int offset, unsigned long len)
{
struct kvm_memslots *slots = kvm_memslots(kvm);
int r;
+ gpa_t gpa = ghc->gpa + offset;

- BUG_ON(len > ghc->len);
+ BUG_ON(len + offset > ghc->len);

if (slots->generation != ghc->generation)
kvm_gfn_to_hva_cache_init(kvm, ghc, ghc->gpa, ghc->len);

if (unlikely(!ghc->memslot))
- return kvm_write_guest(kvm, ghc->gpa, data, len);
+ return kvm_write_guest(kvm, gpa, data, len);

if (kvm_is_error_hva(ghc->hva))
return -EFAULT;

- r = __copy_to_user((void __user *)ghc->hva, data, len);
+ r = __copy_to_user((void __user *)ghc->hva + offset, data, len);
if (r)
return -EFAULT;
- mark_page_dirty_in_slot(ghc->memslot, ghc->gpa >> PAGE_SHIFT);
+ mark_page_dirty_in_slot(ghc->memslot, gpa >> PAGE_SHIFT);

return 0;
}
+EXPORT_SYMBOL_GPL(kvm_write_guest_offset_cached);
+
+int kvm_write_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,
+ void *data, unsigned long len)
+{
+ return kvm_write_guest_offset_cached(kvm, ghc, data, 0, len);
+}
EXPORT_SYMBOL_GPL(kvm_write_guest_cached);

int kvm_read_guest_cached(struct kvm *kvm, struct gfn_to_hva_cache *ghc,

Subject: [tip:locking/core] x86/kvm: Support the vCPU preemption check

Commit-ID: 0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb
Gitweb: http://git.kernel.org/tip/0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb
Author: Pan Xinhui <[email protected]>
AuthorDate: Wed, 2 Nov 2016 05:08:35 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 22 Nov 2016 12:48:08 +0100

x86/kvm: Support the vCPU preemption check

Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vCPUs
than physical CPUs in the system) as doing busy waits for preempted
vCPUs will hurt system performance far worse than early yielding.

Use struct kvm_steal_time::preempted to indicate that if a vCPU
is running or not.

Signed-off-by: Pan Xinhui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Typo fixes. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/include/uapi/asm/kvm_para.h | 4 +++-
arch/x86/kvm/x86.c | 16 ++++++++++++++++
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/uapi/asm/kvm_para.h b/arch/x86/include/uapi/asm/kvm_para.h
index 94dc8ca..1421a65 100644
--- a/arch/x86/include/uapi/asm/kvm_para.h
+++ b/arch/x86/include/uapi/asm/kvm_para.h
@@ -45,7 +45,9 @@ struct kvm_steal_time {
__u64 steal;
__u32 version;
__u32 flags;
- __u32 pad[12];
+ __u8 preempted;
+ __u8 u8_pad[3];
+ __u32 pad[11];
};

#define KVM_STEAL_ALIGNMENT_BITS 5
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 04c5d96..59c2d6f 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2071,6 +2071,8 @@ static void record_steal_time(struct kvm_vcpu *vcpu)
&vcpu->arch.st.steal, sizeof(struct kvm_steal_time))))
return;

+ vcpu->arch.st.steal.preempted = 0;
+
if (vcpu->arch.st.steal.version & 1)
vcpu->arch.st.steal.version += 1; /* first time write, random junk */

@@ -2826,8 +2828,22 @@ void kvm_arch_vcpu_load(struct kvm_vcpu *vcpu, int cpu)
kvm_make_request(KVM_REQ_STEAL_UPDATE, vcpu);
}

+static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
+{
+ if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
+ return;
+
+ vcpu->arch.st.steal.preempted = 1;
+
+ kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
+ &vcpu->arch.st.steal.preempted,
+ offsetof(struct kvm_steal_time, preempted),
+ sizeof(vcpu->arch.st.steal.preempted));
+}
+
void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
{
+ kvm_steal_time_set_preempted(vcpu);
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();

Subject: [tip:locking/core] x86/kvm: Support the vCPU preemption check

Commit-ID: 1885aa7041c9e801e5d5b093b9dad38937ca37f6
Gitweb: http://git.kernel.org/tip/1885aa7041c9e801e5d5b093b9dad38937ca37f6
Author: Pan Xinhui <[email protected]>
AuthorDate: Wed, 2 Nov 2016 05:08:36 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 22 Nov 2016 12:48:08 +0100

x86/kvm: Support the vCPU preemption check

Support the vcpu_is_preempted() functionality under KVM. This will
enhance lock performance on overcommitted hosts (more runnable vCPUs
than physical CPUs in the system) as doing busy waits for preempted
vCPUs will hurt system performance far worse than early yielding.

struct kvm_steal_time::preempted indicates that if one vCPU is running or
not after commit "x86, kvm/x86.c: support vCPU preempted check".

unix benchmark result:
host: kernel 4.8.1, i5-4570, 4 cpus
guest: kernel 4.8.1, 8 vcpus

test-case after-patch before-patch
Execl Throughput | 18307.9 lps | 11701.6 lps
File Copy 1024 bufsize 2000 maxblocks | 1352407.3 KBps | 790418.9 KBps
File Copy 256 bufsize 500 maxblocks | 367555.6 KBps | 222867.7 KBps
File Copy 4096 bufsize 8000 maxblocks | 3675649.7 KBps | 1780614.4 KBps
Pipe Throughput | 11872208.7 lps | 11855628.9 lps
Pipe-based Context Switching | 1495126.5 lps | 1490533.9 lps
Process Creation | 29881.2 lps | 28572.8 lps
Shell Scripts (1 concurrent) | 23224.3 lpm | 22607.4 lpm
Shell Scripts (8 concurrent) | 3531.4 lpm | 3211.9 lpm
System Call Overhead | 10385653.0 lps | 10419979.0 lps

Signed-off-by: Pan Xinhui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/kernel/kvm.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c
index edbbfc8..0b48dd2 100644
--- a/arch/x86/kernel/kvm.c
+++ b/arch/x86/kernel/kvm.c
@@ -415,6 +415,15 @@ void kvm_disable_steal_time(void)
wrmsr(MSR_KVM_STEAL_TIME, 0, 0);
}

+static bool kvm_vcpu_is_preempted(int cpu)
+{
+ struct kvm_steal_time *src;
+
+ src = &per_cpu(steal_time, cpu);
+
+ return !!src->preempted;
+}
+
#ifdef CONFIG_SMP
static void __init kvm_smp_prepare_boot_cpu(void)
{
@@ -471,6 +480,9 @@ void __init kvm_guest_init(void)
if (kvm_para_has_feature(KVM_FEATURE_STEAL_TIME)) {
has_steal_clock = 1;
pv_time_ops.steal_clock = kvm_steal_clock;
+#ifdef CONFIG_PARAVIRT_SPINLOCKS
+ pv_lock_ops.vcpu_is_preempted = kvm_vcpu_is_preempted;
+#endif
}

if (kvm_para_has_feature(KVM_FEATURE_PV_EOI))

Subject: [tip:locking/core] x86/xen: Support the vCPU preemption check

Commit-ID: de7689cf8f3820b34088da3b22ce1a548dda2fc5
Gitweb: http://git.kernel.org/tip/de7689cf8f3820b34088da3b22ce1a548dda2fc5
Author: Juergen Gross <[email protected]>
AuthorDate: Wed, 2 Nov 2016 05:08:37 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 22 Nov 2016 12:48:09 +0100

x86/xen: Support the vCPU preemption check

Support the vcpu_is_preempted() functionality under Xen. This will
enhance lock performance on overcommitted hosts (more runnable vCPUs
than physical CPUs in the system) as doing busy waits for preempted
vCPUs will hurt system performance far worse than early yielding.

A quick test (4 vCPUs on 1 physical CPU doing a parallel build job
with "make -j 8") reduced system time by about 5% with this patch.

Signed-off-by: Juergen Gross <[email protected]>
Signed-off-by: Pan Xinhui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
arch/x86/xen/spinlock.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/arch/x86/xen/spinlock.c b/arch/x86/xen/spinlock.c
index 3d6e006..74756bb 100644
--- a/arch/x86/xen/spinlock.c
+++ b/arch/x86/xen/spinlock.c
@@ -114,7 +114,6 @@ void xen_uninit_lock_cpu(int cpu)
per_cpu(irq_name, cpu) = NULL;
}

-
/*
* Our init of PV spinlocks is split in two init functions due to us
* using paravirt patching and jump labels patching and having to do
@@ -137,6 +136,8 @@ void __init xen_init_spinlocks(void)
pv_lock_ops.queued_spin_unlock = PV_CALLEE_SAVE(__pv_queued_spin_unlock);
pv_lock_ops.wait = xen_qlock_wait;
pv_lock_ops.kick = xen_qlock_kick;
+
+ pv_lock_ops.vcpu_is_preempted = xen_vcpu_stolen;
}

/*

Subject: [tip:locking/core] Documentation/virtual/kvm: Support the vCPU preemption check

Commit-ID: 3dd3e0ce7989b645eee0174b17f5095e187c7f28
Gitweb: http://git.kernel.org/tip/3dd3e0ce7989b645eee0174b17f5095e187c7f28
Author: Pan Xinhui <[email protected]>
AuthorDate: Wed, 2 Nov 2016 05:08:38 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 22 Nov 2016 12:48:09 +0100

Documentation/virtual/kvm: Support the vCPU preemption check

Commit ("x86/kvm: support vCPU preemption check") added a new
struct kvm_steal_time::preempted field. This field tells us if
a vCPU is running or not.

It is zero if some old KVM does not support this field or if the vCPU
is not preempted. Other values means the vCPU has been preempted.

Signed-off-by: Pan Xinhui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Radim Krčmář <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Various typo fixes. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
Documentation/virtual/kvm/msr.txt | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/Documentation/virtual/kvm/msr.txt b/Documentation/virtual/kvm/msr.txt
index 2a71c8f..0a9ea51 100644
--- a/Documentation/virtual/kvm/msr.txt
+++ b/Documentation/virtual/kvm/msr.txt
@@ -208,7 +208,9 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
__u64 steal;
__u32 version;
__u32 flags;
- __u32 pad[12];
+ __u8 preempted;
+ __u8 u8_pad[3];
+ __u32 pad[11];
}

whose data will be filled in by the hypervisor periodically. Only one
@@ -232,6 +234,11 @@ MSR_KVM_STEAL_TIME: 0x4b564d03
nanoseconds. Time during which the vcpu is idle, will not be
reported as steal time.

+ preempted: indicate the vCPU who owns this struct is running or
+ not. Non-zero values mean the vCPU has been preempted. Zero
+ means the vCPU is not preempted. NOTE, it is always zero if the
+ the hypervisor doesn't support this field.
+
MSR_KVM_EOI_EN: 0x4b564d04
data: Bit 0 is 1 when PV end of interrupt is enabled on the vcpu; 0
when disabled. Bit 1 is reserved and must be zero. When PV end of

Subject: [tip:locking/core] locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock()

Commit-ID: 5aff60a191e579ae00ae5ca6ce16c13b687bc8a3
Gitweb: http://git.kernel.org/tip/5aff60a191e579ae00ae5ca6ce16c13b687bc8a3
Author: Pan Xinhui <[email protected]>
AuthorDate: Wed, 2 Nov 2016 05:08:29 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 22 Nov 2016 12:48:10 +0100

locking/osq: Break out of spin-wait busy waiting loop for a preempted vCPU in osq_lock()

An over-committed guest with more vCPUs than pCPUs has a heavy overload
in osq_lock().

This is because if vCPU-A holds the osq lock and yields out, vCPU-B ends
up waiting for per_cpu node->locked to be set. IOW, vCPU-B waits for
vCPU-A to run and unlock the osq lock.

Use the new vcpu_is_preempted(cpu) interface to detect if a vCPU is
currently running or not, and break out of the spin-loop if so.

test case:

$ perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
18.09% sched-messaging [kernel.vmlinux] [k] osq_lock
12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner
5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock
3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task
3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq
3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is
2.49% sched-messaging [kernel.vmlinux] [k] system_call

after patch:
20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner
8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock
4.12% sched-messaging [kernel.vmlinux] [k] system_call
3.01% sched-messaging [kernel.vmlinux] [k] system_call_common
2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7
2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner
2.00% sched-messaging [kernel.vmlinux] [k] osq_lock

Suggested-by: Boqun Feng <[email protected]>
Tested-by: Juergen Gross <[email protected]>
Signed-off-by: Pan Xinhui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Christian Borntraeger <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
[ Translated to English. ]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/osq_lock.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/kernel/locking/osq_lock.c b/kernel/locking/osq_lock.c
index 4ea2710..a316794 100644
--- a/kernel/locking/osq_lock.c
+++ b/kernel/locking/osq_lock.c
@@ -21,6 +21,11 @@ static inline int encode_cpu(int cpu_nr)
return cpu_nr + 1;
}

+static inline int node_cpu(struct optimistic_spin_node *node)
+{
+ return node->cpu - 1;
+}
+
static inline struct optimistic_spin_node *decode_cpu(int encoded_cpu_val)
{
int cpu_nr = encoded_cpu_val - 1;
@@ -118,8 +123,10 @@ bool osq_lock(struct optimistic_spin_queue *lock)
while (!READ_ONCE(node->locked)) {
/*
* If we need to reschedule bail... so we can block.
+ * Use vcpu_is_preempted() to avoid waiting for a preempted
+ * lock holder:
*/
- if (need_resched())
+ if (need_resched() || vcpu_is_preempted(node_cpu(node->prev)))
goto unqueue;

cpu_relax();

Subject: [tip:locking/core] locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted

Commit-ID: 05ffc951392df57edecc2519327b169210c3df75
Gitweb: http://git.kernel.org/tip/05ffc951392df57edecc2519327b169210c3df75
Author: Pan Xinhui <[email protected]>
AuthorDate: Wed, 2 Nov 2016 05:08:30 -0400
Committer: Ingo Molnar <[email protected]>
CommitDate: Tue, 22 Nov 2016 12:48:10 +0100

locking/mutex: Break out of expensive busy-loop on {mutex,rwsem}_spin_on_owner() when owner vCPU is preempted

An over-committed guest with more vCPUs than pCPUs has a heavy overload
in the two spin_on_owner. This blames on the lock holder preemption
issue.

Break out of the loop if the vCPU is preempted: if vcpu_is_preempted(cpu)
is true.

test-case:
perf record -a perf bench sched messaging -g 400 -p && perf report

before patch:
20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner
8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock
4.12% sched-messaging [kernel.vmlinux] [k] system_call
3.01% sched-messaging [kernel.vmlinux] [k] system_call_common
2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7
2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner
2.00% sched-messaging [kernel.vmlinux] [k] osq_lock

after patch:
9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock
5.28% sched-messaging [unknown] [H] 0xc0000000000768e0
4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7
3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7
3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq
3.02% sched-messaging [kernel.vmlinux] [k] system_call
2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task

Tested-by: Juergen Gross <[email protected]>
Signed-off-by: Pan Xinhui <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Acked-by: Christian Borntraeger <[email protected]>
Acked-by: Paolo Bonzini <[email protected]>
Cc: [email protected]
Cc: Linus Torvalds <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/locking/mutex.c | 13 +++++++++++--
kernel/locking/rwsem-xadd.c | 14 +++++++++++---
2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/kernel/locking/mutex.c b/kernel/locking/mutex.c
index c073168..9b34961 100644
--- a/kernel/locking/mutex.c
+++ b/kernel/locking/mutex.c
@@ -364,7 +364,11 @@ bool mutex_spin_on_owner(struct mutex *lock, struct task_struct *owner)
*/
barrier();

- if (!owner->on_cpu || need_resched()) {
+ /*
+ * Use vcpu_is_preempted to detect lock holder preemption issue.
+ */
+ if (!owner->on_cpu || need_resched() ||
+ vcpu_is_preempted(task_cpu(owner))) {
ret = false;
break;
}
@@ -389,8 +393,13 @@ static inline int mutex_can_spin_on_owner(struct mutex *lock)

rcu_read_lock();
owner = __mutex_owner(lock);
+
+ /*
+ * As lock holder preemption issue, we both skip spinning if task is not
+ * on cpu or its cpu is preempted
+ */
if (owner)
- retval = owner->on_cpu;
+ retval = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
rcu_read_unlock();

/*
diff --git a/kernel/locking/rwsem-xadd.c b/kernel/locking/rwsem-xadd.c
index 263e744..6315060 100644
--- a/kernel/locking/rwsem-xadd.c
+++ b/kernel/locking/rwsem-xadd.c
@@ -336,7 +336,11 @@ static inline bool rwsem_can_spin_on_owner(struct rw_semaphore *sem)
goto done;
}

- ret = owner->on_cpu;
+ /*
+ * As lock holder preemption issue, we both skip spinning if task is not
+ * on cpu or its cpu is preempted
+ */
+ ret = owner->on_cpu && !vcpu_is_preempted(task_cpu(owner));
done:
rcu_read_unlock();
return ret;
@@ -362,8 +366,12 @@ static noinline bool rwsem_spin_on_owner(struct rw_semaphore *sem)
*/
barrier();

- /* abort spinning when need_resched or owner is not running */
- if (!owner->on_cpu || need_resched()) {
+ /*
+ * abort spinning when need_resched or owner is not running or
+ * owner's cpu is preempted.
+ */
+ if (!owner->on_cpu || need_resched() ||
+ vcpu_is_preempted(task_cpu(owner))) {
rcu_read_unlock();
return false;
}

2016-12-19 11:42:47

by Andrea Arcangeli

[permalink] [raw]
Subject: Re: [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check

Hello,

On Wed, Nov 02, 2016 at 05:08:35AM -0400, Pan Xinhui wrote:
> Support the vcpu_is_preempted() functionality under KVM. This will
> enhance lock performance on overcommitted hosts (more runnable vcpus
> than physical cpus in the system) as doing busy waits for preempted
> vcpus will hurt system performance far worse than early yielding.
>
> Use one field of struct kvm_steal_time ::preempted to indicate that if
> one vcpu is running or not.
>
> Signed-off-by: Pan Xinhui <[email protected]>
> Acked-by: Paolo Bonzini <[email protected]>
> ---
> arch/x86/include/uapi/asm/kvm_para.h | 4 +++-
> arch/x86/kvm/x86.c | 16 ++++++++++++++++
> 2 files changed, 19 insertions(+), 1 deletion(-)
>
[..]
> +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
> +{
> + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
> + return;
> +
> + vcpu->arch.st.steal.preempted = 1;
> +
> + kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
> + &vcpu->arch.st.steal.preempted,
> + offsetof(struct kvm_steal_time, preempted),
> + sizeof(vcpu->arch.st.steal.preempted));
> +}
> +
> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> {
> + kvm_steal_time_set_preempted(vcpu);
> kvm_x86_ops->vcpu_put(vcpu);
> kvm_put_guest_fpu(vcpu);
> vcpu->arch.last_host_tsc = rdtsc();

You can't call kvm_steal_time_set_preempted in atomic context (neither
in sched_out notifier nor in vcpu_put() after
preempt_disable)). __copy_to_user in kvm_write_guest_offset_cached
schedules and locks up the host.

kvm->srcu (or kvm->slots_lock) is also not taken and
kvm_write_guest_offset_cached needs to call kvm_memslots which
requires it.

This I think is why postcopy live migration locks up with current
upstream, and it doesn't seem related to userfaultfd at all (initially
I suspected the vmf conversion but it wasn't that) and in theory it
can happen with heavy swapping or page migration too.

Just the page is written so frequently it's unlikely to be swapped
out. The page being written so frequently also means it's very likely
found as re-dirtied when postcopy starts and that pretty much
guarantees an userfault will trigger a scheduling event in
kvm_steal_time_set_preempted in destination. There are opposite
probabilities of reproducing this with swapping vs postcopy live
migration.

For now I applied the below two patches, but this just will skip the
write and only prevent the host instability as nobody checks the
retval of __copy_to_user (what happens to guest after the write is
skipped is not as clear and should be investigated, but at least the
host will survive and not all guests will care about this flag being
updated). For this to be fully safe the preempted information should
be just an hint and not fundamental for correct functionality of the
guest pv spinlock code.

This bug was introduced in commit
0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb in v4.9-rc7.

>From 458897fd44aa9b91459a006caa4051a7d1628a23 Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <[email protected]>
Date: Sat, 17 Dec 2016 18:43:52 +0100
Subject: [PATCH 1/2] kvm: fix schedule in atomic in
kvm_steal_time_set_preempted()

kvm_steal_time_set_preempted() isn't disabling the pagefaults before
calling __copy_to_user and the kernel debug notices.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/kvm/x86.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1f0d238..2dabaeb 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,7 +2844,17 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)

void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
{
+ /*
+ * Disable page faults because we're in atomic context here.
+ * kvm_write_guest_offset_cached() would call might_fault()
+ * that relies on pagefault_disable() to tell if there's a
+ * bug. NOTE: the write to guest memory may not go through if
+ * during postcopy live migration or if there's heavy guest
+ * paging.
+ */
+ pagefault_disable();
kvm_steal_time_set_preempted(vcpu);
+ pagefault_enable();
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);
vcpu->arch.last_host_tsc = rdtsc();


>From 2845eba22ac74c5e313e3b590f9dac33e1b3cfef Mon Sep 17 00:00:00 2001
From: Andrea Arcangeli <[email protected]>
Date: Sat, 17 Dec 2016 19:13:32 +0100
Subject: [PATCH 2/2] kvm: take srcu lock around kvm_steal_time_set_preempted()

kvm_memslots() will be called by kvm_write_guest_offset_cached() so
take the srcu lock.

Signed-off-by: Andrea Arcangeli <[email protected]>
---
arch/x86/kvm/x86.c | 7 +++++++
1 file changed, 7 insertions(+)

diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 2dabaeb..02e6ab4 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -2844,6 +2844,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)

void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
{
+ int idx;
/*
* Disable page faults because we're in atomic context here.
* kvm_write_guest_offset_cached() would call might_fault()
@@ -2853,7 +2854,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
* paging.
*/
pagefault_disable();
+ /*
+ * kvm_memslots() will be called by
+ * kvm_write_guest_offset_cached() so take the srcu lock.
+ */
+ idx = srcu_read_lock(&vcpu->kvm->srcu);
kvm_steal_time_set_preempted(vcpu);
+ srcu_read_unlock(&vcpu->kvm->srcu, idx);
pagefault_enable();
kvm_x86_ops->vcpu_put(vcpu);
kvm_put_guest_fpu(vcpu);


2016-12-19 13:57:12

by Pan Xinhui

[permalink] [raw]
Subject: Re: [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check

hi, Andrea
thanks for your reply. :)

在 2016/12/19 19:42, Andrea Arcangeli 写道:
> Hello,
>
> On Wed, Nov 02, 2016 at 05:08:35AM -0400, Pan Xinhui wrote:
>> Support the vcpu_is_preempted() functionality under KVM. This will
>> enhance lock performance on overcommitted hosts (more runnable vcpus
>> than physical cpus in the system) as doing busy waits for preempted
>> vcpus will hurt system performance far worse than early yielding.
>>
>> Use one field of struct kvm_steal_time ::preempted to indicate that if
>> one vcpu is running or not.
>>
>> Signed-off-by: Pan Xinhui <[email protected]>
>> Acked-by: Paolo Bonzini <[email protected]>
>> ---
>> arch/x86/include/uapi/asm/kvm_para.h | 4 +++-
>> arch/x86/kvm/x86.c | 16 ++++++++++++++++
>> 2 files changed, 19 insertions(+), 1 deletion(-)
>>
> [..]
>> +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
>> +{
>> + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
>> + return;
>> +
>> + vcpu->arch.st.steal.preempted = 1;
>> +
>> + kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
>> + &vcpu->arch.st.steal.preempted,
>> + offsetof(struct kvm_steal_time, preempted),
>> + sizeof(vcpu->arch.st.steal.preempted));
>> +}
>> +
>> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>> {
>> + kvm_steal_time_set_preempted(vcpu);
>> kvm_x86_ops->vcpu_put(vcpu);
>> kvm_put_guest_fpu(vcpu);
>> vcpu->arch.last_host_tsc = rdtsc();
>
> You can't call kvm_steal_time_set_preempted in atomic context (neither
> in sched_out notifier nor in vcpu_put() after
> preempt_disable)). __copy_to_user in kvm_write_guest_offset_cached
> schedules and locks up the host.
>
yes, you are right! :) we have known the problems.
I am going to introduce something like kvm_write_guest_XXX_atomic and use them instead of kvm_write_guest_offset_cached.
within pagefault_disable()/enable(), we can not call __copy_to_user I think.

> kvm->srcu (or kvm->slots_lock) is also not taken and
> kvm_write_guest_offset_cached needs to call kvm_memslots which
> requires it.
>
let me check the details later. thanks for pointing it out.

> This I think is why postcopy live migration locks up with current
> upstream, and it doesn't seem related to userfaultfd at all (initially
> I suspected the vmf conversion but it wasn't that) and in theory it
> can happen with heavy swapping or page migration too.
>
> Just the page is written so frequently it's unlikely to be swapped
> out. The page being written so frequently also means it's very likely
> found as re-dirtied when postcopy starts and that pretty much
> guarantees an userfault will trigger a scheduling event in
> kvm_steal_time_set_preempted in destination. There are opposite
> probabilities of reproducing this with swapping vs postcopy live
> migration.
>

Good analyze. :)

> For now I applied the below two patches, but this just will skip the
> write and only prevent the host instability as nobody checks the
> retval of __copy_to_user (what happens to guest after the write is
> skipped is not as clear and should be investigated, but at least the
> host will survive and not all guests will care about this flag being
> updated). For this to be fully safe the preempted information should
> be just an hint and not fundamental for correct functionality of the
> guest pv spinlock code.
>
> This bug was introduced in commit
> 0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb in v4.9-rc7.
>
> From 458897fd44aa9b91459a006caa4051a7d1628a23 Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <[email protected]>
> Date: Sat, 17 Dec 2016 18:43:52 +0100
> Subject: [PATCH 1/2] kvm: fix schedule in atomic in
> kvm_steal_time_set_preempted()
>
> kvm_steal_time_set_preempted() isn't disabling the pagefaults before
> calling __copy_to_user and the kernel debug notices.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> arch/x86/kvm/x86.c | 10 ++++++++++
> 1 file changed, 10 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1f0d238..2dabaeb 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2844,7 +2844,17 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
>
> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> {
> + /*
> + * Disable page faults because we're in atomic context here.
> + * kvm_write_guest_offset_cached() would call might_fault()
> + * that relies on pagefault_disable() to tell if there's a
> + * bug. NOTE: the write to guest memory may not go through if
> + * during postcopy live migration or if there's heavy guest
> + * paging.
> + */
> + pagefault_disable();
> kvm_steal_time_set_preempted(vcpu);
> + pagefault_enable();
can we just add this?
I think it is better to modify kvm_steal_time_set_preempted() and let it run correctly in atomic context.

thanks
xinhui

> kvm_x86_ops->vcpu_put(vcpu);
> kvm_put_guest_fpu(vcpu);
> vcpu->arch.last_host_tsc = rdtsc();
>
>
> From 2845eba22ac74c5e313e3b590f9dac33e1b3cfef Mon Sep 17 00:00:00 2001
> From: Andrea Arcangeli <[email protected]>
> Date: Sat, 17 Dec 2016 19:13:32 +0100
> Subject: [PATCH 2/2] kvm: take srcu lock around kvm_steal_time_set_preempted()
>
> kvm_memslots() will be called by kvm_write_guest_offset_cached() so
> take the srcu lock.
>
> Signed-off-by: Andrea Arcangeli <[email protected]>
> ---
> arch/x86/kvm/x86.c | 7 +++++++
> 1 file changed, 7 insertions(+)
>
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 2dabaeb..02e6ab4 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -2844,6 +2844,7 @@ static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
>
> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> {
> + int idx;
> /*
> * Disable page faults because we're in atomic context here.
> * kvm_write_guest_offset_cached() would call might_fault()
> @@ -2853,7 +2854,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
> * paging.
> */
> pagefault_disable();
> + /*
> + * kvm_memslots() will be called by
> + * kvm_write_guest_offset_cached() so take the srcu lock.
> + */
> + idx = srcu_read_lock(&vcpu->kvm->srcu);
> kvm_steal_time_set_preempted(vcpu);
> + srcu_read_unlock(&vcpu->kvm->srcu, idx);
> pagefault_enable();
> kvm_x86_ops->vcpu_put(vcpu);
> kvm_put_guest_fpu(vcpu);
>
>

2016-12-19 14:39:37

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH v7 08/11] x86, kvm/x86.c: support vcpu preempted check



On 19/12/2016 14:56, Pan Xinhui wrote:
> hi, Andrea
> thanks for your reply. :)
>
> 在 2016/12/19 19:42, Andrea Arcangeli 写道:
>> Hello,
>>
>> On Wed, Nov 02, 2016 at 05:08:35AM -0400, Pan Xinhui wrote:
>>> Support the vcpu_is_preempted() functionality under KVM. This will
>>> enhance lock performance on overcommitted hosts (more runnable vcpus
>>> than physical cpus in the system) as doing busy waits for preempted
>>> vcpus will hurt system performance far worse than early yielding.
>>>
>>> Use one field of struct kvm_steal_time ::preempted to indicate that if
>>> one vcpu is running or not.
>>>
>>> Signed-off-by: Pan Xinhui <[email protected]>
>>> Acked-by: Paolo Bonzini <[email protected]>
>>> ---
>>> arch/x86/include/uapi/asm/kvm_para.h | 4 +++-
>>> arch/x86/kvm/x86.c | 16 ++++++++++++++++
>>> 2 files changed, 19 insertions(+), 1 deletion(-)
>>>
>> [..]
>>> +static void kvm_steal_time_set_preempted(struct kvm_vcpu *vcpu)
>>> +{
>>> + if (!(vcpu->arch.st.msr_val & KVM_MSR_ENABLED))
>>> + return;
>>> +
>>> + vcpu->arch.st.steal.preempted = 1;
>>> +
>>> + kvm_write_guest_offset_cached(vcpu->kvm, &vcpu->arch.st.stime,
>>> + &vcpu->arch.st.steal.preempted,
>>> + offsetof(struct kvm_steal_time, preempted),
>>> + sizeof(vcpu->arch.st.steal.preempted));
>>> +}
>>> +
>>> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>>> {
>>> + kvm_steal_time_set_preempted(vcpu);
>>> kvm_x86_ops->vcpu_put(vcpu);
>>> kvm_put_guest_fpu(vcpu);
>>> vcpu->arch.last_host_tsc = rdtsc();
>>
>> You can't call kvm_steal_time_set_preempted in atomic context (neither
>> in sched_out notifier nor in vcpu_put() after
>> preempt_disable)). __copy_to_user in kvm_write_guest_offset_cached
>> schedules and locks up the host.
>>
> yes, you are right! :) we have known the problems.
> I am going to introduce something like kvm_write_guest_XXX_atomic and
> use them instead of kvm_write_guest_offset_cached.
> within pagefault_disable()/enable(), we can not call __copy_to_user I
> think.

Since I have already botched the revert and need to resend the fix, I'm
going to apply Andrea's patches. The preempted flag is only advisory
anyway.

Thanks,

Paolo

>> kvm->srcu (or kvm->slots_lock) is also not taken and
>> kvm_write_guest_offset_cached needs to call kvm_memslots which
>> requires it.
>>
> let me check the details later. thanks for pointing it out.
>
>> This I think is why postcopy live migration locks up with current
>> upstream, and it doesn't seem related to userfaultfd at all (initially
>> I suspected the vmf conversion but it wasn't that) and in theory it
>> can happen with heavy swapping or page migration too.
>>
>> Just the page is written so frequently it's unlikely to be swapped
>> out. The page being written so frequently also means it's very likely
>> found as re-dirtied when postcopy starts and that pretty much
>> guarantees an userfault will trigger a scheduling event in
>> kvm_steal_time_set_preempted in destination. There are opposite
>> probabilities of reproducing this with swapping vs postcopy live
>> migration.
>>
>
> Good analyze. :)
>
>> For now I applied the below two patches, but this just will skip the
>> write and only prevent the host instability as nobody checks the
>> retval of __copy_to_user (what happens to guest after the write is
>> skipped is not as clear and should be investigated, but at least the
>> host will survive and not all guests will care about this flag being
>> updated). For this to be fully safe the preempted information should
>> be just an hint and not fundamental for correct functionality of the
>> guest pv spinlock code.
>>
>> This bug was introduced in commit
>> 0b9f6c4615c993d2b552e0d2bd1ade49b56e5beb in v4.9-rc7.
>>
>> From 458897fd44aa9b91459a006caa4051a7d1628a23 Mon Sep 17 00:00:00 2001
>> From: Andrea Arcangeli <[email protected]>
>> Date: Sat, 17 Dec 2016 18:43:52 +0100
>> Subject: [PATCH 1/2] kvm: fix schedule in atomic in
>> kvm_steal_time_set_preempted()
>>
>> kvm_steal_time_set_preempted() isn't disabling the pagefaults before
>> calling __copy_to_user and the kernel debug notices.
>>
>> Signed-off-by: Andrea Arcangeli <[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 10 ++++++++++
>> 1 file changed, 10 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 1f0d238..2dabaeb 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -2844,7 +2844,17 @@ static void kvm_steal_time_set_preempted(struct
>> kvm_vcpu *vcpu)
>>
>> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>> {
>> + /*
>> + * Disable page faults because we're in atomic context here.
>> + * kvm_write_guest_offset_cached() would call might_fault()
>> + * that relies on pagefault_disable() to tell if there's a
>> + * bug. NOTE: the write to guest memory may not go through if
>> + * during postcopy live migration or if there's heavy guest
>> + * paging.
>> + */
>> + pagefault_disable();
>> kvm_steal_time_set_preempted(vcpu);
>> + pagefault_enable();
> can we just add this?
> I think it is better to modify kvm_steal_time_set_preempted() and let it
> run correctly in atomic context.
>
> thanks
> xinhui
>
>> kvm_x86_ops->vcpu_put(vcpu);
>> kvm_put_guest_fpu(vcpu);
>> vcpu->arch.last_host_tsc = rdtsc();
>>
>>
>> From 2845eba22ac74c5e313e3b590f9dac33e1b3cfef Mon Sep 17 00:00:00 2001
>> From: Andrea Arcangeli <[email protected]>
>> Date: Sat, 17 Dec 2016 19:13:32 +0100
>> Subject: [PATCH 2/2] kvm: take srcu lock around
>> kvm_steal_time_set_preempted()
>>
>> kvm_memslots() will be called by kvm_write_guest_offset_cached() so
>> take the srcu lock.
>>
>> Signed-off-by: Andrea Arcangeli <[email protected]>
>> ---
>> arch/x86/kvm/x86.c | 7 +++++++
>> 1 file changed, 7 insertions(+)
>>
>> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
>> index 2dabaeb..02e6ab4 100644
>> --- a/arch/x86/kvm/x86.c
>> +++ b/arch/x86/kvm/x86.c
>> @@ -2844,6 +2844,7 @@ static void kvm_steal_time_set_preempted(struct
>> kvm_vcpu *vcpu)
>>
>> void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>> {
>> + int idx;
>> /*
>> * Disable page faults because we're in atomic context here.
>> * kvm_write_guest_offset_cached() would call might_fault()
>> @@ -2853,7 +2854,13 @@ void kvm_arch_vcpu_put(struct kvm_vcpu *vcpu)
>> * paging.
>> */
>> pagefault_disable();
>> + /*
>> + * kvm_memslots() will be called by
>> + * kvm_write_guest_offset_cached() so take the srcu lock.
>> + */
>> + idx = srcu_read_lock(&vcpu->kvm->srcu);
>> kvm_steal_time_set_preempted(vcpu);
>> + srcu_read_unlock(&vcpu->kvm->srcu, idx);
>> pagefault_enable();
>> kvm_x86_ops->vcpu_put(vcpu);
>> kvm_put_guest_fpu(vcpu);
>>
>>
>